Welcome to vLLM! ================ .. figure:: ./assets/logos/vllm-logo-text-light.png :width: 60% :align: center :alt: vLLM :class: no-scaled-link .. raw:: html

Easy, fast, and cheap LLM serving for everyone

vLLM is a fast and easy-to-use library for LLM inference and serving. vLLM is fast with: * State-of-the-art serving throughput * Efficient management of attention key and value memory with **PagedAttention** * Continuous batching of incoming requests * Fast model execution with CUDA/HIP graph * Quantization: `GPTQ `_, `AWQ `_, INT4, INT8, and FP8 * Optimized CUDA kernels, including integration with FlashAttention and FlashInfer. * Speculative decoding * Chunked prefill vLLM is flexible and easy to use with: * Seamless integration with popular HuggingFace models * High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more * Tensor parallelism and pipeline parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server * Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, TPU, and AWS Neuron. * Prefix caching support * Multi-lora support For more information, check out the following: * `vLLM announcing blog post `_ (intro to PagedAttention) * `vLLM paper `_ (SOSP 2023) * `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency `_ by Cade Daniel et al. * :ref:`vLLM Meetups `. Documentation ------------- .. toctree:: :maxdepth: 1 :caption: Getting Started getting_started/installation getting_started/amd-installation getting_started/openvino-installation getting_started/cpu-installation getting_started/neuron-installation getting_started/tpu-installation getting_started/xpu-installation getting_started/quickstart getting_started/debugging getting_started/examples/examples_index .. toctree:: :maxdepth: 1 :caption: Serving serving/openai_compatible_server serving/deploying_with_docker serving/distributed_serving serving/metrics serving/env_vars serving/usage_stats serving/integrations serving/tensorizer serving/faq .. toctree:: :maxdepth: 1 :caption: Models models/supported_models models/adding_model models/enabling_multimodal_inputs models/engine_args models/lora models/vlm models/spec_decode models/performance .. toctree:: :maxdepth: 1 :caption: Quantization quantization/supported_hardware quantization/auto_awq quantization/bnb quantization/int8 quantization/fp8 quantization/fp8_e5m2_kvcache quantization/fp8_e4m3_kvcache .. toctree:: :maxdepth: 1 :caption: Automatic Prefix Caching automatic_prefix_caching/apc automatic_prefix_caching/details .. toctree:: :maxdepth: 1 :caption: Performance benchmarks performance_benchmark/benchmarks .. toctree:: :maxdepth: 2 :caption: Developer Documentation dev/sampling_params dev/offline_inference/offline_index dev/engine/engine_index dev/kernel/paged_attention dev/input_processing/model_inputs_index dev/multimodal/multimodal_index dev/dockerfile/dockerfile dev/profiling/profiling_index .. toctree:: :maxdepth: 1 :caption: Community community/meetups community/sponsors Indices and tables ================== * :ref:`genindex` * :ref:`modindex`