DeepEP: Efficient Expert-Parallel Communication Library for MoE

Introduction

DeepEP is a communication library purpose-built for Mixture-of-Experts (MoE) and expert parallelism (EP) from DeepSeek. It delivers high-throughput and low-latency all-to-all GPU kernels—commonly known as MoE dispatch and combine—with support for low-precision operations including FP8. Designed to align with the group-limited gating algorithm in the DeepSeek-V3 paper, DeepEP is a critical building block for scaling MoE models across multiple GPUs and nodes.

What is DeepEP?

In MoE architectures, tokens must be routed (dispatched) to the appropriate expert GPUs, and the computed results must be gathered back (combined). This all-to-all communication pattern is the bottleneck in large-scale MoE training and inference. DeepEP solves this with two classes of optimized kernels:

Normal Kernels – optimized for training and inference prefilling, supporting both NVLink (intranode) and RDMA (internode) forwarding with asymmetric-domain bandwidth.
Low-Latency Kernels – designed for latency-sensitive inference decoding, using pure RDMA with a hook-based communication-computation overlap that occupies zero SM resources.

DeepEP Architecture

Architecture Overview

Normal Kernels (Training / Prefilling)

Normal kernels are optimized for throughput-heavy scenarios:

Feature	Description
NVLink Intranode	Up to ~153 GB/s dispatch / ~158 GB/s combine on H800
RDMA Internode	~43–58 GB/s across EP16–EP64 configurations
SM Number Control	Fine-grained control over Streaming Multiprocessor usage
FP8 Dispatch + BF16 Combine	Mixed-precision for memory and compute efficiency

Low-Latency Kernels (Inference Decoding)

For decoding scenarios where every microsecond counts:

Feature	Description
Pure RDMA	Direct NIC-to-NIC communication without CPU involvement
Ultra-Low Latency	77–194 μs dispatch, 114–369 μs combine
Hook-Based Overlap	Background RDMA traffic with zero SM occupation
CUDA Graph Compatible	Enables static graph capture for repeated inference

MoE Dispatch and Combine Flow

The core data flow in DeepEP follows the standard MoE pattern with highly optimized communication primitives:

MoE Data Flow

Dispatch Forward

      
    
      
        import torch
import torch.distributed as dist
from deep_ep import Buffer, EventOverlap

Buffer.set_num_sms(24)

def dispatch_forward(x, topk_idx, topk_weights, num_experts, previous_event=None):
    # Calculate dispatch layout
    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, previous_event = \
        _buffer.get_dispatch_layout(topk_idx, num_experts,
                                    previous_event=previous_event, async_finish=True,
                                    allocate_on_comm_stream=previous_event is not None)
    
    # Execute dispatch kernel
    recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \
        _buffer.dispatch(x, topk_idx=topk_idx, topk_weights=topk_weights,
                         num_tokens_per_rank=num_tokens_per_rank,
                         num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,
                         is_token_in_rank=is_token_in_rank,
                         num_tokens_per_expert=num_tokens_per_expert,
                         previous_event=previous_event, async_finish=True,
                         allocate_on_comm_stream=True)
    return recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event

      
      
        

Combine Forward

      
        def combine_forward(x, handle, previous_event=None):
    combined_x, _, event = _buffer.combine(
        x, handle, async_finish=True, previous_event=previous_event,
        allocate_on_comm_stream=previous_event is not None
    )
    return combined_x, event

Low-Latency Dispatch for Decoding

      
        def low_latency_dispatch(hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts):
    recv_hidden_states, recv_expert_count, handle, event, hook = \
        _buffer.low_latency_dispatch(
            hidden_states, topk_idx, num_max_dispatch_tokens_per_rank, num_experts,
            async_finish=False, return_recv_hook=True
        )
    # hook() triggers background RDMA receive without SM occupation
    return recv_hidden_states, recv_expert_count, handle, event, hook

Performance Benchmarks

DeepEP is benchmarked on H800 GPUs with CX7 InfiniBand 400 Gb/s network cards, following the DeepSeek-V3/R1 pretraining settings (4096 tokens per batch, 7168 hidden, top-4 groups, top-8 experts, FP8 dispatching and BF16 combining).

Performance Benchmarks

Normal Kernel Results

Type	EP Size	Dispatch Bandwidth	Combine Bandwidth
Intranode	8	153 GB/s (NVLink)	158 GB/s (NVLink)
Internode	16	43 GB/s (RDMA)	43 GB/s (RDMA)
Internode	32	58 GB/s (RDMA)	57 GB/s (RDMA)
Internode	64	51 GB/s (RDMA)	50 GB/s (RDMA)

Low-Latency Kernel Results

EP Size	Dispatch Latency	Dispatch BW	Combine Latency	Combine BW
8	77 μs	98 GB/s	114 μs	127 GB/s
16	118 μs	63 GB/s	195 μs	74 GB/s
32	155 μs	48 GB/s	273 μs	53 GB/s
64	173 μs	43 GB/s	314 μs	46 GB/s
128	192 μs	39 GB/s	369 μs	39 GB/s
256	194 μs	39 GB/s	360 μs	40 GB/s

News (2025.04.22): Tencent Network Platform Department contributed optimizations that improved normal kernel performance by up to 30%.
News (2025.06.05): Low-latency kernels now leverage NVLink for intranode communication where possible, further reducing latency.

Network Configuration

DeepEP is fully tested with InfiniBand and theoretically compatible with RoCE. Proper network configuration is essential for peak performance:

Network Configuration

Traffic Isolation

Use InfiniBand Virtual Lanes (VL) to segregate traffic:

Normal kernel workloads
Low-latency kernel workloads
Other cluster workloads

Control via the NVSHMEM_IB_SL environment variable.

Adaptive Routing

Load Condition	Recommendation
Heavy network load	Enable adaptive routing
Light network load	Use static routing

Congestion Control

Disabled in production environments—no significant congestion has been observed in DeepSeek’s deployments.

Installation

Requirements

GPU: Ampere (SM80), Hopper (SM90), or architectures with SM90 PTX ISA support
Python: 3.8+
CUDA: 11.0+ (SM80) or 12.3+ (SM90)
PyTorch: 2.1+
Network: NVLink (intranode) + RDMA (internode)

Build from Source

      
        # Install NVSHMEM dependency first (see third-party/README.md)

# Build and create symbolic links
NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py build
ln -s build/lib.linux-x86_64-cpython-38/deep_ep_cpp.cpython-38-x86_64-linux-gnu.so

# Run tests
python tests/test_intranode.py
python tests/test_internode.py
python tests/test_low_latency.py

Install

      
        NVSHMEM_DIR=/path/to/installed/nvshmem python setup.py install

Environment Variables

Variable	Description
`NVSHMEM_DIR`	Path to NVSHMEM (disables internode/low-latency if unset)
`DISABLE_SM90_FEATURES`	Set to 1 to disable SM90-specific features
`TORCH_CUDA_ARCH_LIST`	Target architectures, e.g., `"9.0"`
`DISABLE_AGGRESSIVE_PTX_INSTRS`	Set to 1 to disable aggressive PTX load/store instructions

Experimental Branches

DeepEP has an active community contributing optimizations:

Branch	Description
Zero-copy	Removes PyTorch tensor-to-buffer copies, reducing SM usage significantly
Eager	Low-latency protocol removing extra RTT from RDMA atomic operations
Hybrid-EP	TMA instructions for minimal SM usage, PCIe kernel support, NVFP4 support
AntGroup-Opt	SM-free RDMA path, SBO overlap, rail-optimized forwarding
Mori-EP	ROCm/AMD GPU support via MORI backend

Community and Ecosystem

uccl/uccl-ep – Heterogeneous GPU and NIC support (Nvidia, AMD, EFA, Broadcom, CX7)
Infrawaves/DeepEP_ibrc_dual-ports_multiQP – Multi-QP and dual-port NIC support
antgroup/DeepXTrace – Diagnostic analyzer for slow rank localization
ROCm/mori – AMD’s next-gen communication library for AI workloads

Roadmap

AR support
A100 support (intranode only)
BF16 low-latency dispatch
NVLink for intranode low-latency kernels
TMA copy for internode and low-latency kernels
SM-free kernels and refactors
Fully remove undefined-behavior PTX instructions

Conclusion

DeepEP represents a significant advancement in MoE communication efficiency, providing both throughput-optimized and latency-optimized kernels for different phases of model training and inference. With support for FP8, fine-grained SM control, and zero-SM-overlap hooks, it enables scaling MoE models to hundreds of GPUs while maintaining high hardware utilization. The active community contributions from Tencent, AntGroup, and others continue to push its performance boundaries.

For more details, visit the DeepEP GitHub repository.

Citation

      
        @misc{deepep2025,
  title={DeepEP: an efficient expert-parallel communication library},
  author={Chenggang Zhao and Shangyan Zhou and Liyue Zhang and Chengqi Deng and Zhean Xu and Yuxuan Liu and Kuai Yu and Jiashi Li and Liang Zhao},
  year={2025},
  publisher={GitHub},
  howpublished={\url{https://github.com/deepseek-ai/DeepEP}},
}

Enjoyed this post? Never miss out on future posts by following us

DeepEP: Efficient Expert-Parallel Communication Library for MoE

Introduction

What is DeepEP?

Architecture Overview

Normal Kernels (Training / Prefilling)

Low-Latency Kernels (Inference Decoding)

MoE Dispatch and Combine Flow

Dispatch Forward

Combine Forward

Low-Latency Dispatch for Decoding

Performance Benchmarks

Normal Kernel Results

Low-Latency Kernel Results

Network Configuration

Traffic Isolation

Adaptive Routing

Congestion Control

Installation

Requirements

Build from Source

Install

Environment Variables

Experimental Branches

Community and Ecosystem

Roadmap

Conclusion

Citation

Related Posts

OpenCode: The Open Source Coding Agent with 148K+ Stars

Part 8: Soft Actor-Critic (SAC) - Maximum Entropy Reinfor...

RAG-Anything: All-in-One RAG Framework for Multi-Modal Re...

Academic Research Skills: Supercharging Claude Code for R...

Superpowers: The Complete Agentic Skills Framework for AI...

LibreChat: The Enhanced ChatGPT Clone with 35K+ Stars

Contents