Introduction

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models that has fundamentally transformed how organizations deploy LLM inference at scale. Originally developed at UC Berkeley’s Sky Computing Lab, vLLM was published at ACM SOSP 2023 with its groundbreaking PagedAttention paper, which introduced a novel approach to KV cache management inspired by operating system virtual memory. Today, vLLM boasts over 2000 contributors and supports more than 200 model architectures, making it one of the most widely adopted open-source LLM serving frameworks in the world. Its tagline – “Easy, fast, and cheap LLM serving for everyone” – captures the project’s mission to democratize access to efficient LLM deployment.

Before vLLM, traditional LLM serving systems suffered from severe GPU memory waste due to KV cache fragmentation. Pre-allocating contiguous memory blocks for each request’s key-value cache meant that 60-80% of GPU memory went unused, dramatically limiting the number of concurrent requests a single GPU could handle. vLLM’s PagedAttention mechanism eliminates this waste by applying virtual memory paging concepts to KV cache management, enabling near-full GPU memory utilization and achieving throughput improvements of 2-4x over existing systems.

Key Insight: vLLM’s PagedAttention eliminates the 60-80% memory waste traditionally caused by KV cache fragmentation, enabling near-full GPU memory utilization for inference workloads.

Architecture Overview

vLLM Architecture

The vLLM architecture follows a layered design that cleanly separates the API serving layer from the core inference engine, enabling both online serving and offline batch inference through a unified codebase. At the top of the stack sits the API Server Layer, built on FastAPI and uvicorn, which exposes OpenAI-compatible endpoints for chat completions, text completions, embeddings, and more. This layer handles HTTP request parsing, authentication, and response streaming, making it trivial to migrate existing applications from OpenAI’s API to a self-hosted vLLM deployment.

Beneath the API server lies the AsyncLLM Engine, which serves as the asynchronous bridge between the API layer and the core inference engine. Communication between the API server process and the engine core process uses ZMQ IPC with Msgpack serialization, providing low-latency message passing without the overhead of HTTP. The AsyncLLM Engine manages a coroutine-based request queue, allowing it to handle thousands of concurrent connections efficiently while forwarding requests to the EngineCore for processing.

The EngineCore is the heart of vLLM, running a tight busy loop: schedule -> execute -> output. On each iteration, the scheduler determines which requests to advance, the model executor runs the forward pass on the GPU, and the output is processed to generate new tokens. This iteration-level scheduling granularity is what enables continuous batching – new requests can be injected into the running batch at any step, rather than waiting for the entire batch to complete.

The Scheduler operates at the token level rather than the request level, managing a token budget per step and deciding which requests from the RUNNING and WAITING queues should be scheduled. When GPU memory is insufficient for all pending requests, the scheduler preempts the lowest-priority requests by swapping their KV cache to CPU memory or recomputing their state later.

The KV Cache Manager maintains a BlockPool that tracks all physical GPU memory blocks available for KV cache storage. It supports prefix caching through block hashing (using sha256_cbor or xxhash_cbor algorithms), allowing requests that share common system prompts to reuse cached KV blocks via copy-on-write semantics. This dramatically reduces the computational cost of processing repeated prompt prefixes.

The Executor layer abstracts the hardware execution backend. vLLM supports three executor types: UniProcExecutor for single-GPU inference, MultiprocExecutor for multi-GPU tensor parallelism on a single node, and RayDistributedExecutor for distributed inference across multiple nodes. Each executor manages worker processes that run the actual model computation.

The GPU Model Runner handles the actual forward pass execution, managing block tables that map virtual KV cache blocks to physical memory locations, and selecting the appropriate attention backend (FlashAttention, FlashInfer, Triton, or ROCm AITER) based on the available hardware and model configuration.

vLLM supports a wide range of hardware platforms including NVIDIA GPUs (CUDA), AMD GPUs (ROCm), Intel accelerators (XPU), Google TPUs, and AWS Trainium/Inferentia, making it one of the most portable LLM serving engines available.

The following code snippet shows the core EngineCore step loop, which drives the entire inference pipeline on each iteration:

def step(self) -> tuple[dict[int, EngineCoreOutputs], bool]:
    scheduler_output = self.scheduler.schedule()
    future = self.model_executor.execute_model(scheduler_output, non_block=True)
    grammar_output = self.scheduler.get_grammar_bitmask(scheduler_output)
    model_output = future.result()
    engine_core_outputs = self.scheduler.update_from_output(
        scheduler_output, model_output
    )
    return engine_core_outputs, scheduler_output.total_num_scheduled_tokens > 0

PagedAttention: The Core Innovation

PagedAttention Mechanism

PagedAttention is the foundational innovation that sets vLLM apart from all other LLM serving engines. The core insight is remarkably elegant: apply the same virtual memory paging concept that operating systems have used for decades to manage physical RAM, but instead apply it to the KV cache stored in GPU memory. In traditional LLM serving, each request’s KV cache must be stored in contiguous memory, leading to severe fragmentation because the maximum sequence length is unknown at allocation time. Systems typically over-provision memory for each request, reserving space for the worst-case sequence length, which wastes 60-80% of GPU memory.

PagedAttention solves this by dividing the KV cache into fixed-size virtual blocks, each typically containing tokens for 16 sequence positions. These virtual blocks are mapped to physical blocks in GPU memory through a Block Table, analogous to the page table in an operating system. Because physical blocks do not need to be contiguous, the system can allocate them on-demand as a request generates more tokens, eliminating the need for pre-allocation and dramatically reducing memory waste.

The Block Table is the central data structure that makes this mapping possible. Each request maintains its own block table, which maps its virtual block indices to physical block IDs in GPU memory. During the attention computation, the GPU kernel reads the block table to locate the physical memory addresses of each token’s key and value vectors, performing attention over non-contiguous memory regions efficiently.

Block sharing with reference counting enables powerful optimizations. When multiple requests share a common prompt prefix (such as a system prompt), they can share the same physical blocks by simply pointing their block table entries to the same physical block IDs. A reference counter on each physical block tracks how many requests are using it. When a request needs to modify a shared block (during copy-on-write), the system allocates a new physical block, copies the data, and decrements the reference count on the original block. This is identical to how operating systems handle shared memory pages.

Prefix caching extends block sharing by computing a hash of each block’s content using either the sha256_cbor or xxhash_cbor algorithm. When a new request arrives, the KV Cache Manager computes hashes for each block of the prompt and checks whether matching blocks already exist in the cache. If a cache hit occurs, the request reuses the existing physical blocks without recomputing the key-value pairs, saving significant GPU compute time for repeated prompt prefixes.

The KVCacheBlock data structure is the fundamental unit of memory management in vLLM. Each block tracks its physical block ID, a reference count for copy-on-write semantics, a block hash for prefix caching lookups, and doubly-linked list pointers for the free block queue. When a block is freed (its reference count drops to zero), it is added to the tail of the free block queue, implementing a least-recently-used (LRU) eviction policy.

The BlockPool manages the pool of all physical blocks, maintaining a free block queue as a doubly-linked list. When a new block is needed, it is dequeued from the head of the free list. When a block is released, it is appended to the tail. This LRU ordering ensures that the most recently freed blocks are the last to be evicted, maximizing cache hit rates for prefix caching.

vLLM also supports KV cache quantization to further reduce memory consumption. Supported quantization modes include FP8 (both E4M3 and E5M2 formats with static or dynamic scaling), INT8, and NVFP4. These modes compress the key and value vectors stored in the KV cache, allowing longer sequences or more concurrent requests to fit within the same GPU memory budget, with minimal impact on model accuracy.

@dataclass(slots=True)
class KVCacheBlock:
    block_id: int           # Physical block ID
    ref_cnt: int = 0        # Reference count (copy-on-write)
    _block_hash: BlockHashWithGroupId | None = None  # Hash for prefix caching
    prev_free_block: "KVCacheBlock | None" = None     # Doubly linked list
    next_free_block: "KVCacheBlock | None" = None
    is_null: bool = False   # Null block placeholder

Amazing: PagedAttention applies the same virtual memory paging concept used in operating systems to KV cache management, allowing non-contiguous memory allocation and block sharing across requests - reducing memory waste from 60-80% to near zero.

Continuous Batching

Continuous Batching Flow

Continuous batching is the scheduling paradigm that enables vLLM to achieve its remarkable throughput numbers. Unlike traditional LLM serving systems that operate in discrete phases – a prefill phase where all requests in a batch process their prompts, followed by a decode phase where all requests generate tokens – vLLM eliminates the rigid boundary between these phases entirely. Instead, it performs token-level scheduling at every iteration, allowing the scheduler to mix prefill and decode operations freely within a single step.

On each scheduling iteration, the scheduler first processes RUNNING requests – those that are already in the middle of generation. These include requests that are decoding (generating one new token per step) and requests that are in the middle of a chunked prefill (processing a long prompt in multiple steps). Running requests are given priority because they already have KV cache blocks allocated in GPU memory, and preempting them would waste the compute already invested.

Next, the scheduler considers WAITING requests – new arrivals that have not yet been scheduled. For each waiting request, the scheduler performs a prefix cache lookup to determine how many of the prompt tokens already have cached KV blocks. If a cache hit is found, the request only needs to process the uncached suffix, saving both compute and memory. The scheduler then checks whether there is sufficient token budget and available KV cache blocks to admit the waiting request into the running batch.

Token budget management is critical to maintaining stable performance. The scheduler maintains a maximum number of tokens that can be processed in a single step, which is determined by the GPU memory capacity and the model configuration. If adding a new request would exceed the token budget, the scheduler defers it to the next iteration. This prevents out-of-memory errors and ensures that each step completes within a predictable time bound.

When the number of available KV cache blocks is insufficient to accommodate all running and waiting requests, the scheduler performs preemption. It identifies the lowest-priority requests (typically the most recently admitted ones) and either swaps their KV cache to CPU memory or simply drops them, forcing recomputation when they are rescheduled. Preemption ensures that the highest-priority requests continue to make progress even under memory pressure.

Chunked prefill is another important optimization for handling long prompts. Rather than processing an entire long prompt in a single step (which would consume a large portion of the token budget and delay all other requests), the scheduler can split the prefill across multiple steps. Each step processes a chunk of the prompt tokens, interleaving the chunked prefill with ongoing decode operations. This prevents long prompts from causing latency spikes for shorter requests that are already generating tokens.

The request lifecycle in vLLM follows a clear state machine: a new request enters the WAITING state, transitions to RUNNING when the scheduler admits it, and finally reaches FINISHED when the model generates an end-of-sequence token or the maximum token limit is reached. At each iteration, the scheduler evaluates all requests and updates their states accordingly, ensuring that GPU resources are always fully utilized.

def schedule(self) -> SchedulerOutput:
    # No "decoding phase" nor "prefill phase"
    # Just token-level scheduling
    # 1. Schedule RUNNING requests (decode + ongoing prefill)
    # 2. Schedule WAITING requests (new prefill)
    # 3. Preempt if blocks insufficient

Takeaway: vLLM’s continuous batching eliminates the rigid prefill-decode phase boundaries found in traditional serving systems, allowing new requests to enter the batch at any iteration - dramatically improving GPU utilization and reducing latency.

Distributed Inference

As language models grow larger – with models like Llama 3.1 405B and DeepSeek-V3 requiring hundreds of gigabytes of parameters – distributing inference across multiple GPUs and nodes becomes essential. vLLM provides a comprehensive set of parallelism strategies to handle models of any size.

Tensor Parallelism (TP) splits individual model weight matrices across multiple GPUs, with each GPU computing a portion of the matrix multiplication and then synchronizing the results via all-reduce operations. This is the most common parallelism strategy for LLM inference, as it reduces the per-GPU memory requirement and can accelerate computation for large layers. vLLM supports TP within a single node (using NCCL for communication) and across nodes (using Ray for process management).

Pipeline Parallelism (PP) divides the model layers across GPUs, with each GPU responsible for a contiguous group of layers. Requests flow through the pipeline stage by stage, with a batch queue managing the flow between stages. While PP introduces pipeline bubbles (periods where some GPUs are idle), it requires less inter-GPU communication than TP, making it suitable for multi-node deployments where network bandwidth is limited.

Data Parallelism (DP) replicates the entire model across multiple nodes, with each node processing a different subset of requests. This is the simplest form of parallelism and is ideal when the model fits on a single node but you need to increase overall throughput. vLLM’s DP implementation includes load balancing across replicas.

Expert Parallelism (EP) is designed specifically for Mixture-of-Experts (MoE) models, where different experts within a layer can be placed on different GPUs. Each token is routed to the appropriate expert based on the router’s output, with all-to-all communication to distribute tokens and gather results. EP is critical for serving large MoE models like Mixtral and DeepSeek-V3 efficiently.

Context Parallelism comes in two flavors: Prefill Context Parallelism (PCP) splits the prefill computation across ranks for very long prompts, and Decode Context Parallelism (DCP) splits the decode attention across ranks. These strategies are particularly useful for models with very long context windows.

Disaggregated Prefill/Decode is an advanced deployment pattern where separate vLLM instances handle prefill and decode operations. Prefill instances are optimized for compute throughput (processing long prompts quickly), while decode instances are optimized for low-latency token generation. KV cache blocks are transferred from prefill to decode instances over high-speed interconnects, allowing each instance type to be right-sized for its workload.

The executor backend selection determines how vLLM manages distributed processes:

@staticmethod
def get_class(vllm_config: VllmConfig) -> type["Executor"]:
    if distributed_executor_backend == "ray":
        executor_class = RayDistributedExecutor
    elif distributed_executor_backend == "mp":
        executor_class = MultiprocExecutor
    elif distributed_executor_backend == "uni":
        executor_class = UniProcExecutor

Quantization Support

Quantization is essential for deploying large language models on limited GPU memory. vLLM supports over 20 quantization methods, covering both weight quantization (reducing model size) and KV cache quantization (reducing memory usage during inference). The following table summarizes the major quantization methods supported by vLLM:

Method Type Bits Notes
FP8 (E4M3/E5M2) Weight + KV 8 Static or dynamic scaling per tensor/channel
GPTQ / GPTQ-Marlin Weight 4 Marlin kernel for fast 4-bit inference
AWQ / AWQ-Marlin / AWQ-Triton Weight 4 Activation-aware weight quantization
GGUF Weight 2-8 Compatible with llama.cpp format
bitsandbytes Weight 4/8 NF4 and INT4/INT8 quantization
compressed-tensors Weight Variable Neural Magic format support
ModelOpt Weight 4/8 NVIDIA Model Optimizer integration
TorchAO Weight Variable PyTorch native quantization
MXFP4 / NVFP4 Weight + KV 4 Microscaling formats for Blackwell GPUs
INT8 experts Weight 8 Specialized for MoE expert layers
KV Cache FP8 KV Cache 8 Reduces KV cache memory by 50%
KV Cache INT8 KV Cache 8 Alternative 8-bit KV cache compression
KV Cache NVFP4 KV Cache 4 Maximum KV cache compression for Blackwell

Important: vLLM supports 20+ quantization methods including weight quantization and KV cache quantization, enabling deployment of large models on limited GPU memory with minimal accuracy loss.

OpenAI-Compatible API Server

One of vLLM’s most practical features is its drop-in compatible OpenAI API server. This means any application built against the OpenAI API can be pointed at a vLLM server with a simple base URL change, requiring zero code modifications. The server supports the following endpoints:

  • /v1/chat/completions – Chat completion with conversation history
  • /v1/completions – Text completion for single prompts
  • /v1/models – List available models
  • /v1/embeddings – Text embedding generation

Additionally, vLLM provides extended endpoints for tokenization (/tokenize, /detokenize), scoring (/score), speech-to-text, and a realtime API for streaming interactions. For teams already using Anthropic’s API, vLLM also offers Anthropic Messages API compatibility. A gRPC endpoint is available for high-performance programmatic access.

Starting the server is straightforward:

# Start OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct

# Or with Python
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct

Once the server is running, you can use the standard OpenAI Python client to interact with it:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

This compatibility layer makes vLLM an ideal choice for organizations that want to self-host LLM inference while maintaining the ability to switch between OpenAI’s cloud API and their own infrastructure without modifying application code.

Integration Ecosystem

vLLM Ecosystem

vLLM’s success is not just about its core inference engine – it is the rich ecosystem of integrations that makes it the de facto standard for LLM serving. The ecosystem spans framework integrations, hardware platforms, attention backends, speculative decoding strategies, structured output engines, and adapter support, creating a comprehensive platform that fits into virtually any AI infrastructure.

Framework integrations connect vLLM to the broader AI development ecosystem. LangChain and LlamaIndex, two of the most popular frameworks for building LLM-powered applications, both provide native vLLM integration, allowing developers to use vLLM as the inference backend within their agent and RAG pipelines. Ray integration enables distributed deployment and autoscaling of vLLM instances, making it easy to serve models at production scale. HuggingFace integration provides seamless model loading from the Hub, supporting automatic downloading, caching, and configuration of model weights and tokenizers.

Hardware platform support is one of vLLM’s strongest differentiators. While many LLM serving engines are tightly coupled to NVIDIA CUDA, vLLM supports NVIDIA GPUs (via CUDA), AMD GPUs (via ROCm), Intel accelerators (via XPU), Google TPUs, AWS Trainium and Inferentia, and even CPU-only inference. This breadth of hardware support means organizations are not locked into a single vendor and can deploy vLLM on whatever infrastructure they have available.

Attention backends are the low-level kernels that compute the attention operation, and vLLM supports multiple implementations optimized for different hardware and model configurations. FlashAttention-2 and FlashAttention-3 provide the highest performance on NVIDIA GPUs. FlashInfer offers flexible attention kernels with support for custom attention patterns. Triton-based attention provides a portable fallback that works on any CUDA GPU. For AMD GPUs, the ROCm AITER backend delivers optimized attention. MLA (Multi-head Latent Attention) support handles models like DeepSeek that use non-standard attention architectures.

Speculative decoding is a latency optimization technique that uses a small draft model to predict multiple tokens ahead, which are then verified in parallel by the large model. vLLM supports several speculative decoding strategies: EAGLE uses a lightweight draft model with feature-level drafting, Medusa adds multiple prediction heads to the model, N-gram matching uses statistical patterns from the prompt, and suffix decoding leverages suffix arrays for efficient matching. These techniques can reduce end-to-end latency by 2-3x for compatible workloads.

Structured output support ensures that generated text conforms to a specified format, such as JSON schemas or regular expressions. vLLM integrates with xgrammar and guidance to enforce output structure during generation, which is essential for building reliable API-driven applications on top of LLM inference.

LoRA support enables serving multiple fine-tuned adapters from a single base model, dramatically reducing the memory and compute cost of serving many specialized models. vLLM supports multi-LoRA serving with dynamic loading and unloading of adapters, allowing a single vLLM instance to serve hundreds of LoRA adapters efficiently.

Installation and Quick Start

Getting started with vLLM is straightforward. The recommended installation method uses pip:

# Install vLLM
pip install vllm

# Or with uv for faster installation
uv pip install vllm

For offline batch inference (without starting a server), vLLM provides a simple Python API:

from vllm import LLM, SamplingParams

llm = LLM(model='meta-llama/Llama-3.1-8B-Instruct')
params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(['Hello, how are you?'], params)
for output in outputs:
    print(output.outputs[0].text)

For online serving with the OpenAI-compatible API, start the server with:

vllm serve meta-llama/Llama-3.1-8B-Instruct --tensor-parallel-size 2

The --tensor-parallel-size flag enables multi-GPU tensor parallelism. Additional configuration options include --gpu-memory-utilization to control how much GPU memory vLLM uses (default 0.9), --max-model-len to set the maximum sequence length, and --quantization to specify a quantization method.

Performance Optimizations

vLLM incorporates numerous performance optimizations at every level of the stack, from kernel-level tuning to system-level scheduling. Understanding these optimizations is key to getting the best performance from your vLLM deployment.

CUDA/HIP graphs reduce kernel launch overhead by capturing sequences of GPU operations into a single graph that can be replayed with minimal CPU-GPU synchronization. During the decode phase, where the same sequence of operations is repeated for every token, CUDA graphs can eliminate 5-10 microseconds of overhead per step, which adds up to significant savings at high iteration rates.

Optimized attention kernels are critical for LLM inference performance. vLLM supports FlashAttention-2, FlashAttention-3, FlashInfer, and TRTLLM-GEN backends, each optimized for different GPU architectures and sequence length ranges. The attention backend is selected automatically based on the hardware and model configuration, but can also be manually specified for fine-tuning.

Optimized GEMM and MoE kernels accelerate the matrix multiplications that dominate LLM inference time. vLLM uses CUTLASS-based kernels for standard dense GEMM, TRTLLM-GEN kernels for quantized inference, and CuTeDSL kernels for emerging GPU architectures. For MoE models, specialized kernels handle the sparse expert routing and computation efficiently.

torch.compile integration enables automatic kernel generation and optimization using PyTorch’s just-in-time compilation system. This can fuse multiple operations into a single kernel, reduce memory traffic, and take advantage of hardware-specific optimizations without manual kernel development.

Prefix caching provides significant speedups for workloads with shared system prompts. When multiple requests share the same prompt prefix, the KV cache blocks for that prefix are computed once and then shared across all subsequent requests via copy-on-write semantics. This can reduce the prefill time for repeated prefixes by 80-100%, depending on the prefix length.

Speculative decoding reduces end-to-end latency by generating multiple candidate tokens in parallel and verifying them against the target model. When the draft model’s predictions are correct (which happens frequently for well-matched draft models), multiple tokens are produced in a single step, effectively multiplying the generation speed without any loss in output quality.

Conclusion

vLLM has established itself as the de facto standard for LLM inference serving by solving the fundamental memory management problem that plagued earlier systems. Its PagedAttention mechanism, inspired by operating system virtual memory, reduces KV cache memory waste from 60-80% to near zero, enabling dramatically higher throughput on the same hardware. Combined with continuous batching, which eliminates rigid phase boundaries and maximizes GPU utilization at every iteration, vLLM achieves 2-4x throughput improvements over alternative serving engines.

The project’s breadth of support – 200+ model architectures, 20+ quantization methods, 5+ hardware platforms, and a rich ecosystem of framework integrations – makes it the most versatile LLM serving engine available. Whether you are deploying a single GPU for development or a multi-node cluster for production, vLLM provides the tools and optimizations needed to serve LLMs efficiently and cost-effectively.

For organizations building LLM-powered applications, vLLM’s OpenAI-compatible API server provides a zero-friction migration path from cloud APIs to self-hosted infrastructure, while its distributed inference capabilities ensure that even the largest models can be served with low latency and high throughput.

Resources:

Watch PyShine on YouTube

Contents