LMCache: A KV Cache Management Layer for Scalable LLM Inference

What Is LMCache

LMCache is a KV cache management layer for LLM inference. It turns the KV cache – the attention keys and values generated during prefill – from a temporary, engine-bound state into reusable, persistent, AI-native knowledge that can be stored, reused across requests and sessions, monitored with a full observability stack, and transformed for better generation quality. The result is a significant reduction in time-to-first-token (TTFT) and a measurable improvement in throughput, especially for long-context agentic, multi-turn conversation, and knowledge-augmented workloads such as retrieval-augmented generation (RAG).

The project lives at github.com/LMCache/LMCache and has attracted over 9,500 GitHub stars. It is licensed under Apache 2.0, has joined the PyTorch Foundation, and can be installed with a single command: pip install lmcache. The codebase is written primarily in Python with C++ native extensions in csrc/ and Rust source in rust/ for performance-critical paths.

LMCache is vendor-neutral by design. It works as a KV cache layer across mainstream open-source serving engines including vLLM, SGLang, and TensorRT-LLM. It supports a wide range of storage backends – CPU RAM, local disk (SSD), Redis/Valkey, Mooncake, InfiniStore, S3-compatible object storage, NIXL, and GDS – allowing users to freely switch between serving engines and storage vendors while reusing the stored KV caches. This vendor neutrality is a core architectural principle, not an afterthought.

The core problem LMCache addresses is that every prefill recomputes attention keys and values from scratch. For long-context workloads, this prefill computation dominates TTFT and consumes GPU resources that could otherwise serve decode requests. By caching and reusing KV blocks, LMCache eliminates redundant computation, frees GPU memory, and enables cross-request, cross-session, and cross-engine reuse.

LMCache Architecture

Understanding the LMCache Architecture

The architecture diagram above illustrates the layered design of LMCache. It shows how serving engines, the core cache engine, tiered storage, transfer channels, and observability tools fit together.

Top Layer – LLM Serving Engines:

vLLM, SGLang, and TensorRT-LLM generate KV cache during prefill.
These engines do not manage KV cache persistence themselves.
They delegate that responsibility to LMCache through the integration layer.

Integration Layer:

Contains engine-specific connectors in integration/vllm/, integration/sglang/, and integration/tensorrt_llm/.
Each connector intercepts prefill and redirects to the cache engine for lookup.

LMCache v1 Core – the heart of the system:

cache_engine.py – central orchestration of all cache operations.
token_database.py – token-to-cache-block mapping with O(1) lookup.
manager.py – high-level coordination of all subsystems.
metadata.py – cache block metadata and lifecycle tracking.
protocol.py – internal communication between components.
When a request arrives, the Cache Engine coordinates with the Token Database to check for cached KV blocks.
If cached, the Manager retrieves them from the appropriate storage tier.

Storage Layer – L1 (local) and L2 (remote/distributed):

L1 local: CPU RAM (local_cpu_backend.py), local disk/SSD (local_disk_backend.py), GPU Direct Storage (gds_backend.py).
L2 remote: Redis/Valkey (redis_connector.py), S3 (s3_connector.py), Mooncake (mooncakestore_connector.py).
L2 also includes P2P (p2p_backend.py), NIXL (nixl_storage_backend.py), and Aerospike (aerospike_client.py).
The Manager automatically offloads KV blocks from L1 to L2 when local capacity is exhausted.
Retrieved blocks are promoted back to L1 when needed for decode.

Transfer Channels – enabling PD disaggregation:

nixl_channel.py – RDMA/NVLink-based transfer for maximum bandwidth.
py_socket_channel.py – TCP-based transfer for broader compatibility.
Prefill workers send KV cache to decode workers over high-speed interconnects.

Observability and CLI Layer:

Metrics collection for monitoring cache health and performance.
lmcache CLI for management operations.
lmcache_server for running the standalone daemon.
lmcache_controller for distributed control plane operations.

How It Works

LMCache Workflow

The KV Cache Lifecycle

The workflow diagram above traces the complete lifecycle of a KV cache request through LMCache, from the moment a user request arrives at the serving engine to the point where a response is delivered. The flow begins when a request arrives at the serving engine (vLLM, SGLang, or TensorRT-LLM). The engine integration layer intercepts the prefill phase before any computation begins, redirecting the request to LMCache for cache lookup.

The Token Database performs a lookup to determine whether the KV blocks for the incoming tokens already exist in the cache. This is the critical decision point. If the lookup results in a cache hit, the system retrieves the pre-computed KV blocks from the appropriate storage tier – L1 local storage (CPU RAM, SSD, or GDS) for the fastest access, or L2 remote storage (Redis, S3, P2P, or NIXL) for larger capacity. The retrieved KV blocks pass through the KV Codec for decompression via the SERDE interface, and the decompressed KV is delivered directly to the engine for decode, skipping the expensive prefill computation entirely. This is where LMCache delivers its most significant TTFT reduction.

If the lookup results in a cache miss, the engine computes the prefill normally, generating new KV blocks. These new blocks pass through the KV Codec for compression – the asymmetric K16/V8 quantization in asym_k16_v8.py reduces memory footprint while maintaining generation quality. The compressed blocks are then stored in the tiered cache, starting with L1 local storage and automatically offloading to L2 if local capacity is exceeded. At this point, the system checks whether PD disaggregation is in use. If yes, the KV blocks are transferred via NIXL/RDMA to a separate decode worker, which handles the token generation phase. If no, the local engine proceeds with decode directly.

A distinctive path in the workflow is the Cacheblend non-prefix reuse route. Traditional prefix caching only reuses KV blocks that match the beginning of a prompt. Cacheblend goes further: it identifies reusable KV blocks at any position in the prompt, not just the prefix. When a partial match is found, Cacheblend selectively recomputes only the tokens where the cached KV diverges from the new prompt, recovering generation quality while still avoiding most of the prefill computation. This is particularly valuable for RAG workloads where documents are inserted at different positions in the prompt, and for multi-turn conversations where earlier context can be reused even when the latest turn changes the prefix.

The engine-independent daemon design is a key architectural decision. LMCache runs as a standalone process, separate from the inference engine. This means the KV cache survives engine crashes – there is no fate-sharing between the cache and the engine. If vLLM crashes and restarts, the cached KV blocks are still available, and the engine can resume serving from cache without recomputing anything. This design also enables KV cache sharing across multiple engine instances, which is essential for distributed inference and for scenarios where different engines serve different request types.

Architecture Deep Dive

The LMCache v1 codebase is organized into several key subsystems, each responsible for a specific aspect of KV cache management. Understanding these subsystems is essential for contributors and for operators who need to configure or extend the system.

Core v1 Module

The core of LMCache lives in lmcache/v1/ and consists of the following key files:

cache_engine.py – The central orchestration point for all cache operations. It coordinates between the token database, storage manager, and GPU connectors to serve cache lookups and store new KV blocks.
token_database.py – Maintains the mapping between token sequences and cached KV blocks. This is the data structure that enables O(1) cache lookup, determining whether a given prompt’s KV is already cached.
metadata.py – Tracks cache block metadata including creation time, access frequency, size, and lifecycle state. This information drives eviction decisions and quota management.
manager.py – The high-level cache manager that coordinates all subsystems, including the storage manager, GPU connectors, and observability modules.
protocol.py – Defines the internal communication protocol used between LMCache components, especially in distributed and multiprocess deployments.

Storage Backend Hierarchy

The storage system in lmcache/v1/storage_backend/ implements a two-tier hierarchy with a unified interface:

L1 Local Storage provides the fastest access path:

local_cpu_backend.py – CPU RAM storage, the primary offload target for KV blocks that need fast retrieval.
local_disk_backend.py – Local disk/SSD storage for larger capacity with moderate latency.
gds_backend.py – GPU Direct Storage, which bypasses the CPU and enables direct data transfer between SSDs and GPU memory.

L2 Remote and Distributed Storage provides scalable capacity:

remote_backend.py – The abstract remote backend interface.
p2p_backend.py – Peer-to-peer backend for multi-node CPU memory sharing.
nixl_storage_backend.py – NIXL-based storage for RDMA-enabled remote access.
maru_backend.py – Maru backend for specialized storage scenarios.

The connector layer in storage_backend/connector/ provides adapters for specific storage systems: redis_connector.py and valkey_connector.py for Redis/Valkey, s3_connector.py for S3-compatible object storage, mooncakestore_connector.py for Mooncake, infinistore_connector.py for InfiniStore, aerospike_client.py for Aerospike, and fs_connector.py for filesystem-based storage. Each adapter implements the same unified interface, making it straightforward to add new storage backends.

KV Codec and Compression

The KV codec subsystem in lmcache/v1/kv_codec/ handles compression and serialization of KV blocks:

asym_k16_v8.py – Asymmetric quantization that uses 16-bit precision for keys and 8-bit precision for values. This asymmetric approach preserves the numerical accuracy needed for attention computation while reducing the memory footprint of the larger value tensors.
encoded_kv.py – The serialized KV representation format used for storage and transfer.

The SERDE (serialize/deserialize) interface in storage_backend/naive_serde/ provides a pluggable framework for researchers to implement custom compression, token dropping, and serialization strategies. The cachegen_encoder.py and cachegen_decoder.py implement the CacheGen compression scheme from the SIGCOMM 2024 paper, while kivi_serde.py implements KIVI quantization.

Transfer Channels

The transfer channel subsystem in lmcache/v1/transfer_channel/ enables PD disaggregation:

nixl_channel.py – NIXL-based transfer channel using RDMA/NVLink for maximum bandwidth and minimum latency. This is the preferred transfer mechanism for PD disaggregation in production deployments with compatible hardware.
py_socket_channel.py – TCP-based transfer channel using PySocket for broader compatibility when RDMA hardware is not available.
mock_memory_channel.py – In-memory mock channel for testing and development.

Multiprocess Architecture

The multiprocess subsystem in lmcache/v1/multiprocess/ implements the MP architecture that delivered a 10x MoE inference performance boost. Key files include engine_context.py for engine-level context management, engine_module.py for engine module coordination, session.py for session management, and posix_shm.py for POSIX shared memory operations. The MP architecture achieves process-level parallelism, allowing multiple engine processes to share KV cache through shared memory without the overhead of inter-process communication for every cache operation.

Distributed Coordination

The distributed subsystem in lmcache/v1/distributed/ provides the coordination layer for multi-node deployments. Key components include storage_controller.py for storage controller operations, l1_manager.py for L1 storage management, quota_manager.py for quota enforcement, eviction.py for cache eviction policies, and tiers.py for tier management. The cache policy module supports multiple eviction strategies: FIFO (fifo.py), LRU (lru.py), LFU (lfu.py), and MRU (mru.py).

Integration Layer

The integration layer in lmcache/integration/ provides engine-specific connectors:

vllm/ – vLLM integration, including the MP connector for multiprocess deployments.
sglang/ – SGLang integration for drop-in acceleration.
tensorrt_llm/ – TensorRT-LLM integration for NVIDIA-optimized inference.

CLI Tools and Deployment

LMCache provides three CLI entry points defined in pyproject.toml:

lmcache – The main CLI for cache management operations, mapped to lmcache.cli.main:main.
lmcache_server – The standalone daemon server, mapped to lmcache.v1.server.__main__:main.
lmcache_controller – The distributed control plane API server, mapped to lmcache.v1.api_server.__main__:main.

The Kubernetes operator in operator/ enables production deployment on Kubernetes clusters, with custom resource definitions for managing LMCache instances. Docker images are available in the docker/ directory for containerized deployment.

Installation

LMCache requires Python 3.10 through 3.13, a GPU environment, and a Linux (POSIX) operating system. The build system uses torch==2.11.0, ninja, and setuptools_scm as build dependencies.

Install from PyPI

The simplest way to install LMCache is from PyPI:

pip install lmcache

This installs the latest release along with all runtime dependencies. The current version on PyPI is 0.5.0, released in June 2026.

Install from Source

For development or for flexibility with the torch version, install from source with --no-build-isolation:

      
        git clone https://github.com/LMCache/LMCache.git
cd LMCache
pip install -e . --no-build-isolation

When using --no-build-isolation, ensure that torch, ninja, and setuptools_scm are installed in your environment before running the install command, since the build will not create an isolated environment to fetch them.

Docker and Kubernetes

Docker images are available in the docker/ directory of the repository for containerized deployment. The Kubernetes operator in operator/ provides custom resource definitions for deploying and managing LMCache on Kubernetes clusters, with built-in health monitoring and metrics collection.

Configuration

LMCache is configured via YAML config files that specify storage backend selection, cache policies, transfer channel settings, and observability options. Configuration examples for different storage backends are available in the documentation.

Usage

LMCache Features

Understanding the Feature Map

The features diagram above maps the seven key features of LMCache to the source components that implement each one. This mapping is useful for developers who want to understand the codebase structure and for operators who need to know which components to configure for specific use cases.

Engine-Independent Deployment is implemented through the standalone/ directory and the server/ module, which together run LMCache as a standalone daemon process. This daemon manages KV cache independently from the inference engine process, ensuring that cached KV survives engine crashes and can be shared across multiple engine instances. The key architectural principle is no fate-sharing: the cache lifecycle is decoupled from the engine lifecycle.

Tiered KV Cache Offloading is implemented through storage_manager.py and the L1/L2 storage backends. The local_cpu_backend.py provides CPU RAM storage, local_disk_backend.py provides SSD storage, and gds_backend.py provides GPU Direct Storage. The storage manager automatically moves KV blocks between tiers based on access patterns and capacity constraints, ensuring that frequently accessed blocks stay in the fastest tier while less frequently accessed blocks are offloaded to higher-capacity tiers.

KV Cache Observability is implemented through the health_monitor/ and mp_observability/ modules, which provide a rich set of metrics including Kubernetes health monitoring, performance diagnostics, request-level and token-level prefix cache hit rates, KV cache lifecycle tracking, and user-specific usage metrics. These metrics are exposed through standard interfaces for integration with monitoring dashboards.

Pluggable Backends are implemented through the connector layer in storage_backend/connector/, with adapters for Redis (redis_connector.py), S3 (s3_connector.py), Mooncake (mooncakestore_connector.py), InfiniStore (infinistore_connector.py), and Valkey (valkey_connector.py). Each adapter implements the same unified interface, making it straightforward to add new storage backends without modifying the core cache engine.

Non-Prefix KV Reuse is implemented through Cacheblend, with example workflows in the blend_kv/ and blend_kv_v1/ directories. Cacheblend identifies reusable KV blocks at any position in the prompt, not just the prefix, and selectively recomputes tokens where the cached KV diverges from the new prompt to recover generation quality.

PD Disaggregation is implemented through the transfer channel subsystem, with nixl_channel.py for RDMA/NVLink-based transfer and py_socket_channel.py for TCP-based transfer. The pd_backend.py and pd_backend_async.py modules coordinate the prefill-to-decode KV transfer, enabling prefill workers and decode workers to run on separate hardware.

KV Transformation is implemented through the kv_codec/ module and the SERDE interface in storage_backend/naive_serde/. The asym_k16_v8.py codec provides asymmetric K16/V8 quantization, while the SERDE interface allows researchers to implement custom compression, token dropping, and serialization strategies.

Example Workflows

The examples/ directory in the repository contains a comprehensive set of example workflows:

basic_check/ – Basic cache verification to confirm LMCache is working correctly with your engine setup.
blend_kv/ and blend_kv_v1/ – Cacheblend non-prefix reuse examples demonstrating how to reuse KV blocks at arbitrary prompt positions.
disagg_prefill/ and disagg_prefill_mp/ – PD disaggregation examples showing how to separate prefill and decode workers with KV transfer.
p2p/ – Multi-node P2P CPU memory sharing, aggregating memory across nodes into a shared pool.
multi_process/ – MP architecture examples for process-level parallelism and the 10x MoE inference boost.
kubernetes/ – Kubernetes deployment examples using the LMCache operator.
observability/ – Metrics and monitoring setup examples.
online_session/ – Session-based cache reuse for multi-turn conversations.
redis_lookup/ – Redis backend configuration and usage examples.

Basic Usage with vLLM

To use LMCache with vLLM, attach LMCache as a KV cache layer by configuring the integration in your vLLM deployment. The integration layer in lmcache/integration/vllm/ handles the connection, intercepting prefill operations and redirecting them through the cache engine for lookup and storage.

SGLang and TensorRT-LLM Integration

SGLang integration is provided through lmcache/integration/sglang/ for drop-in acceleration. TensorRT-LLM integration is provided through lmcache/integration/tensorrt_llm/ for NVIDIA-optimized inference. Both integrations follow the same pattern: the integration layer intercepts prefill, the cache engine performs lookup, and on a cache hit, the KV is delivered to the engine without recomputation.

Features

LMCache provides seven key features that together make it a comprehensive KV cache management layer for LLM inference:

Feature	Description	Key Components
Engine-Independent Deployment	Standalone daemon process that manages KV cache independently from the inference engine, ensuring no fate-sharing	`standalone/`, `server/`
Tiered KV Cache Offloading	Move KV caches from GPU memory into a tiered storage hierarchy spanning CPU, local disk, and remote backends	`storage_manager.py`, `local_cpu_backend.py`, `local_disk_backend.py`, `gds_backend.py`
KV Cache Observability	Rich set of metrics including Kubernetes health monitoring, KV-cache-specific metrics, and management metrics	`health_monitor/`, `mp_observability/`
Pluggable Storage Backends	Unified interface for integrating remote storage and KV transfer backends	`redis_connector.py`, `s3_connector.py`, `mooncakestore_connector.py`, `infinistore_connector.py`, `valkey_connector.py`
Non-Prefix KV Reuse	Reuse cached KV blocks at any position in the prompt via Cacheblend with selective recomputation	Cacheblend, `blend_kv/` examples
PD Disaggregation and KV Transfer	Transfer KV cache from prefill workers to decode workers over NVLink, RDMA, or TCP	`nixl_channel.py`, `py_socket_channel.py`, `pd_backend.py`
Pluggable KV Transformation	SERDE interface for compression, token dropping, and custom serialization	`asym_k16_v8.py`, `encoded_kv.py`, `naive_serde/serde.py`

Recent Milestones

LMCache has seen rapid development and adoption, with several significant milestones:

2026/05: Agentic workload benchmark on AMD MI300X, demonstrating LMCache performance for multi-turn agentic workloads.
2026/04: MP architecture release, delivering a 10x MoE inference performance boost through process-level parallelism.
2026/03: LMCache presented at GTC 2026, showcasing the system to the broader AI inference community.
2026/01: Multi-node P2P CPU memory sharing moved from experimental feature to production, enabling aggregate memory pools across nodes.
2025/10: LMCache joined the PyTorch Foundation, and Tensormesh was unveiled.
2025/09: NVIDIA Dynamo integrated LMCache, accelerating LLM inference in the Dynamo framework.
2025/07: Redis integration blog published, showing faster LLM inference and cheaper responses with LMCache and Redis.

Performance

LMCache Performance

Performance Tiers and Benchmarks

The performance diagram above illustrates the tiered storage hierarchy that LMCache uses to balance latency and capacity, along with benchmark highlights from recent releases. The tier pyramid shows five levels of storage, from the fastest and smallest at the top to the slowest and largest at the bottom.

Tier 0: GPU HBM is the fastest tier, with sub-microsecond latency and bandwidth in the hundreds of GB/s range. However, GPU memory is limited to approximately 40-80 GB per GPU, making it the most expensive and scarce resource. LMCache’s primary goal is to free GPU HBM by offloading KV cache to lower tiers, making that memory available for serving more concurrent requests.

Tier 1: CPU RAM provides microsecond-level latency with host memory capacity typically in the hundreds of GB. This is the primary offload target for KV blocks that need fast retrieval. The KV Codec compression layer, using K16/V8 asymmetric quantization, effectively extends the capacity of this tier by reducing the memory footprint of each KV block.

Tier 2: Local Disk/SSD provides millisecond-level latency with large capacity, suitable for KV blocks that are accessed less frequently. The eviction and quota management system (eviction.py, quota_manager.py) with policies including FIFO, LRU, LFU, and MRU ensures that the most valuable KV blocks stay in faster tiers while less valuable blocks are offloaded to SSD.

Tier 3: GDS (GPU Direct Storage) bypasses the CPU entirely, enabling direct data transfer between SSDs and GPU memory. This reduces the latency overhead of CPU-mediated transfers and is particularly valuable for large KV blocks that need to be loaded directly into GPU memory for decode.

Tier 4: Remote Backends including Redis/Valkey, S3, Mooncake, P2P, and NIXL/RDMA provide the largest capacity tier with network-level latency. The multi-node P2P feature aggregates memory across nodes into a shared pool, effectively creating a distributed cache that can hold far more KV blocks than any single node could manage. The CoreWeave x Cohere case study demonstrated how this tiered approach can break the memory barrier for large-scale LLM inference.

The benchmark callouts in the diagram highlight key performance results. The 10x MoE inference boost from the MP architecture (released April 2026) demonstrates the power of process-level parallelism for Mixture-of-Experts models, where the multiprocess design allows multiple engine processes to share KV cache through POSIX shared memory without inter-process communication overhead for every cache operation. The AMD MI300X agentic workload benchmark (May 2026) validated LMCache performance on AMD hardware for multi-turn agentic workloads. The TTFT reduction from cached prefill eliminates redundant computation, providing the most direct performance benefit. The multi-node P2P aggregate memory pool enables scaling beyond single-node memory limits.

The automatic tiering system moves KV blocks between tiers based on access patterns and capacity constraints. When a cache hit occurs at a lower tier, the system can promote the block to a faster tier for future access. When a tier reaches capacity, the eviction policy determines which blocks to evict or offload to a lower tier. The quota manager ensures fair resource allocation across users and workloads, preventing any single user from monopolizing cache capacity.

Practical Deployment Tips:

Start with L1 CPU RAM as the primary offload target for the best latency-to-capacity trade-off.
Enable K16/V8 quantization to roughly double the effective capacity of CPU RAM and SSD tiers.
Use GDS when your workload involves large KV blocks that must reach GPU memory with minimal CPU overhead.
Deploy Redis or Valkey as the L2 backend when you need shared, cross-instance cache across multiple engine processes.
Consider multi-node P2P when a single node’s memory is insufficient for your working set of KV blocks.
Use NIXL/RDMA transfer channels for PD disaggregation only when you have compatible RDMA or NVLink hardware.
Monitor prefix cache hit rates via the observability modules to tune eviction policies and tier thresholds.

Conclusion

LMCache is becoming the de-facto standard for KV cache management in LLM inference. Its vendor-neutral design allows users to switch between serving engines (vLLM, SGLang, TensorRT-LLM) and storage vendors (Redis, S3, Mooncake, NIXL, and more) freely while reusing cached KV, avoiding lock-in to any single provider. The growing ecosystem – with the PyTorch Foundation, NVIDIA Dynamo integration, CoreWeave partnership, Redis integration, and AMD benchmarks – demonstrates broad industry adoption.

The research foundation is solid. LMCache builds on CacheGen (SIGCOMM 2024) for KV cache compression and streaming, CacheBlend (EuroSys 2025) for non-prefix KV reuse with cached knowledge fusion, and the LMCache paper (arXiv 2510.09665) for the overall system design. The codebase is open source under Apache 2.0, with active community participation through Slack, community meetings, and a contributing guide.

For developers and operators building LLM inference systems, LMCache provides a production-ready, research-backed, vendor-neutral KV cache management layer that reduces TTFT, improves throughput, and enables scalable inference for long-context, agentic, and multi-turn workloads. Installation is a single command – pip install lmcache – and the documentation provides comprehensive guides for installation, quickstart, configuration, benchmarking, and production deployment.

Enjoyed this post? Never miss out on future posts by following us