VoxCPM: Tokenizer-Free TTS for Multilingual Speech Generation

Text-to-speech technology has undergone remarkable evolution over the past decade, moving from concatenative systems to neural approaches that can generate highly natural-sounding speech. However, most modern TTS systems still rely on discrete audio tokenization, which introduces information loss and limits expressiveness. VoxCPM, developed by OpenBMB, represents a paradigm shift with its tokenizer-free approach that directly generates continuous speech representations.

VoxCPM2, the latest release, is a 2 billion parameter model trained on over 2 million hours of multilingual speech data. It supports 30 languages, voice design from natural language descriptions, controllable voice cloning, and outputs studio-quality 48kHz audio. Built on the MiniCPM-4 backbone, this system achieves state-of-the-art performance on multiple benchmarks while remaining fully open-source under the Apache-2.0 license.

What Makes VoxCPM Different?

Traditional TTS systems typically follow a pipeline that involves converting audio into discrete tokens using neural audio codecs like EnCodec or SoundStream. These tokens are then predicted by a language model and decoded back to audio. While effective, this approach has inherent limitations:

Information Loss: Discrete tokenization quantizes continuous audio features, losing subtle acoustic details
Limited Expressiveness: Token-based systems struggle with prosody variation and emotional nuance
Quality Ceiling: The reconstruction quality is bounded by the codec’s resolution

VoxCPM bypasses these limitations entirely by operating in the continuous latent space of AudioVAE V2. Instead of predicting discrete tokens, the model directly generates continuous representations through a diffusion autoregressive architecture. This enables richer expressiveness and higher fidelity output.

VoxCPM Architecture

Understanding the VoxCPM Architecture

The architecture diagram above illustrates the core components and their interactions in the VoxCPM2 system. Let’s break down each component:

AudioVAE V2: The Foundation

AudioVAE V2 serves as the audio encoder-decoder backbone, providing asymmetric encode/decode capabilities. The encoder accepts 16kHz input audio and compresses it into a compact latent representation. The decoder then reconstructs audio at 48kHz, providing built-in super-resolution without requiring external upsamplers. This asymmetric design is crucial for achieving studio-quality output while maintaining computational efficiency.

The encoder uses causal convolutions with Snake activation functions, enabling it to capture temporal dependencies without looking ahead. The decoder incorporates sample rate conditioning, allowing it to adaptively generate audio at different quality levels based on the input context.

LocalEncoder (LocEnc)

The LocalEncoder processes audio features at a local level, encoding patches of audio into representations suitable for the language model. It uses a transformer-based architecture with special tokens to mark audio boundaries. This component is essential for voice cloning, as it encodes reference audio into a format the model can use to condition generation.

Text-Semantic Language Model (TSLM)

Built on MiniCPM-4, the TSLM is the core reasoning engine of VoxCPM2. With 2 billion parameters, it processes both text tokens and audio features in a unified representation space. The model uses RoPE (Rotary Position Embeddings) with LongRoPE scaling for extended context length, enabling it to handle long-form content effectively.

The TSLM performs context-aware synthesis, automatically inferring appropriate prosody and expressiveness from the text content. This means the model understands that a question should have rising intonation, or that exciting content should be delivered with more energy.

Scalar Quantization Layer (FSQ)

The Finite Scalar Quantization layer bridges the semantic and acoustic spaces. Unlike traditional vector quantization with codebooks, FSQ projects continuous representations through a learned quantization function. This maintains differentiability while providing structure to the latent space, enabling better generalization and more stable training.

Residual Acoustic Language Model (RALM)

The RALM operates in parallel to the TSLM, focusing on acoustic details that complement the semantic understanding. With 8 dedicated layers, it captures fine-grained acoustic patterns including voice characteristics, speaking rate variations, and emotional nuances. The fusion of TSLM and RALM outputs enables VoxCPM2 to produce speech that is both semantically accurate and acoustically natural.

LocalDiT (LocDiT)

The Local Diffusion Transformer generates audio features using flow matching, a modern alternative to traditional diffusion models. LocDiT takes the combined representations from TSLM and RALM, along with conditioning from previous audio patches, and generates the next audio patch autoregressively. The diffusion process allows for controllable generation through classifier-free guidance (CFG), enabling users to balance between diversity and fidelity.

The Tokenizer-Free TTS Pipeline

Tokenizer-Free TTS Pipeline

Understanding the Pipeline

The pipeline diagram above demonstrates how VoxCPM2 processes text to speech without traditional audio tokenization. This is a fundamental departure from conventional TTS systems.

Stage 1: Text Input

The process begins with raw text input. VoxCPM2 supports 30 languages without requiring language tags or special preprocessing. Users can simply provide text in any supported language, and the model automatically handles the rest.

Stage 2: Text Tokenization

VoxCPM2 uses LlamaTokenizerFast for text processing, which provides efficient subword tokenization. Critically, this is where the tokenizer-free nature becomes apparent: there is no audio tokenization step. Traditional systems would need to convert audio to discrete tokens at this stage, but VoxCPM2 operates entirely in continuous space.

Stage 3: Language Model Processing

The MiniCPM-4 backbone processes the tokenized text, building semantic understanding and context. This 2B parameter model was specifically trained for speech generation, learning to predict not just text semantics but also acoustic patterns. The model develops an understanding of how text should sound when spoken, including prosody, emphasis, and pacing.

Stage 4: Diffusion Autoregressive Generation

This is the heart of VoxCPM2’s innovation. Instead of predicting discrete tokens, the model generates continuous audio features through a diffusion process. Each step generates a patch of audio features conditioned on previous patches, creating a coherent audio stream. The flow matching approach provides stable training and high-quality samples.

Stage 5: Audio Decoding

The AudioVAE V2 decoder transforms the continuous latent representations into waveform audio. The decoder’s asymmetric design enables it to output 48kHz audio directly, providing studio-quality sound without external post-processing.

Key Innovation: Continuous Representation

The highlighted box in the diagram emphasizes the critical difference: VoxCPM2 never discretizes audio. By maintaining continuous representations throughout, the model preserves all acoustic information, enabling more natural and expressive speech synthesis. This approach also simplifies the architecture by removing the need for complex audio codecs with large codebooks.

Voice Cloning Capabilities

Voice Cloning Workflow

Understanding Voice Cloning Modes

VoxCPM2 introduces three distinct voice cloning modes, each designed for different use cases and quality requirements. The diagram above illustrates how these modes work and when to use each one.

Voice Design: Creating Voices from Scratch

Voice Design is a groundbreaking feature that allows users to create entirely new voices using natural language descriptions. No reference audio is required. Users can specify characteristics like:

Gender and age (e.g., “young woman”, “elderly man”)
Tone qualities (e.g., “gentle and sweet”, “deep and authoritative”)
Emotional style (e.g., “cheerful”, “melancholic”)
Speaking pace (e.g., “slow and deliberate”, “quick and energetic”)

The model interprets these descriptions and generates a voice that matches the specification. This is achieved through the model’s understanding of how acoustic properties relate to perceptual descriptions, learned during training on diverse speech data.

Controllable Cloning: Reference with Style Control

Controllable Cloning takes a reference audio sample and allows additional style instructions to modify the output. This mode is ideal when you have a target voice but want to adjust specific aspects:

Clone a voice but make it speak faster or slower
Preserve timbre while changing emotional expression
Maintain voice identity while adjusting energy level

The reference audio provides the base voice characteristics, while the control instructions guide the generation toward the desired style. This enables fine-grained control without sacrificing voice similarity.

Ultimate Cloning: Perfect Voice Reproduction

Ultimate Cloning is designed for scenarios where exact voice reproduction is critical. It requires both the reference audio and its exact transcript. The model uses audio continuation, where it learns from the reference and continues seamlessly, preserving every vocal nuance including:

Timbre and voice quality
Speaking rhythm and cadence
Emotional expression and style
Breathing patterns and pauses

This mode achieves the highest similarity scores and is recommended for applications like audiobook narration, voice preservation, and professional voice work.

Processing Pipeline

All three modes feed into the same VoxCPM2 processing pipeline. The AudioVAE V2 encoder extracts latent features from reference audio (when provided), and the model generates speech conditioned on these features. The 2B parameter model ensures high-quality output across all modes.

Multilingual Support

Understanding Multilingual Capabilities

VoxCPM2’s multilingual support is one of its most impressive features, enabling speech synthesis in 30 languages without requiring language tags or special configuration. The diagram above shows the breadth of language coverage.

European Languages

VoxCPM2 supports all major European languages including English, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Polish, Swedish, Norwegian, Danish, Finnish, and Greek. The model handles the unique phonetic characteristics of each language, including:

German compound words and consonant clusters
French nasal vowels and liaison
Spanish rapid speech patterns
Russian palatalization and stress patterns

Asian Languages

Asian language support includes Chinese (with 9 dialects), Japanese, Korean, Thai, Vietnamese, Indonesian, Hindi, and more. These languages present unique challenges:

Chinese tonal system and dialectal variation
Japanese pitch accent and mora timing
Korean honorifics and speech levels
Thai tonal distinctions

Chinese Dialect Support

Beyond standard Mandarin, VoxCPM2 supports 9 Chinese dialects: Sichuanese, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, and Minnan. This extensive dialect coverage makes it valuable for regional content creation and accessibility applications.

Automatic Language Detection

The model automatically detects the input language without requiring explicit tags. This is achieved through the multilingual training on over 2 million hours of speech data. Users can simply input text in any supported language, and the model handles the rest.

Training Data Scale

The 2 million+ hours of training data ensures robust performance across all languages. This scale enables the model to learn language-specific patterns while maintaining cross-lingual transfer for improved generalization.

Installation and Quick Start

Requirements

Python 3.10 or higher (less than 3.13)
PyTorch 2.5.0 or higher
CUDA 12.0 or higher
8GB+ VRAM for VoxCPM2

Installation

pip install voxcpm

Basic Usage

      
    
      
        from voxcpm import VoxCPM
import soundfile as sf

# Load the model
model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

# Generate speech
wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("Audio saved to demo.wav")

      
      
        

Voice Design Example

      
        # Create a voice from natural language description
wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

Voice Cloning Example

      
        # Clone a voice from reference audio
wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

Ultimate Cloning Example

      
        # Ultimate cloning with transcript
wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="path/to/voice.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="path/to/voice.wav",
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)

Streaming API

For real-time applications, VoxCPM2 supports streaming generation:

      
        import numpy as np

chunks = []
for chunk in model.generate_streaming(
    text="Streaming text to speech is easy with VoxCPM!",
):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

Features Comparison

Feature	VoxCPM2	VoxCPM1.5	VoxCPM-0.5B
Parameters	2B	0.6B	0.5B
Audio Quality	48kHz	44.1kHz	16kHz
Languages	30	2 (zh, en)	2 (zh, en)
Voice Design	Yes	No	No
Controllable Cloning	Yes	No	No
Ultimate Cloning	Yes	Yes	Yes
RTF (RTX 4090)	~0.30	~0.15	~0.17
RTF with Nano-VLLM	~0.13	~0.08	~0.10
VRAM Usage	~8GB	~6GB	~5GB
License	Apache-2.0	Apache-2.0	Apache-2.0

Performance Benchmarks

VoxCPM2 achieves state-of-the-art results on multiple public benchmarks:

Seed-TTS-eval Results

Model	test-EN WER	test-EN SIM	test-ZH CER	test-ZH SIM
VoxCPM2	1.84%	75.3%	0.97%	79.5%
CosyVoice3-1.5B	2.02%	71.8%	1.16%	78.0%
F5-TTS	2.00%	67.0%	1.53%	76.0%

Multilingual Performance

On the internal 30-language benchmark, VoxCPM2 achieves an average error rate of 1.68%, demonstrating robust multilingual capabilities across diverse language families.

Fine-Tuning

VoxCPM supports both full fine-tuning (SFT) and LoRA fine-tuning. With as little as 5-10 minutes of audio, you can adapt the model to a specific speaker, language, or domain.

LoRA Fine-Tuning

      
        python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml

Full Fine-Tuning

      
        python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml

WebUI for Training

python lora_ft_webui.py  # Open http://localhost:7860

Production Deployment

For high-throughput serving, use Nano-vLLM-VoxCPM:

pip install nano-vllm-voxcpm

      
        from nanovllm_voxcpm import VoxCPM
import numpy as np
import soundfile as sf

server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = list(server.generate(target_text="Hello from VoxCPM!"))
sf.write("out.wav", np.concatenate(chunks), 48000)
server.stop()

This achieves RTF as low as ~0.13 on NVIDIA RTX 4090 with support for batched concurrent requests.

Ecosystem

VoxCPM has a growing ecosystem of community projects:

Project	Description
Nano-vLLM	High-throughput GPU serving
VoxCPM.cpp	GGML/GGUF for CPU, CUDA, Vulkan
VoxCPM-ONNX	ONNX export for CPU inference
VoxCPMANE	Apple Neural Engine backend
voxcpm_rs	Rust re-implementation
ComfyUI-VoxCPM	ComfyUI node integration

Troubleshooting

Common Issues

Model loading fails with memory error

Ensure you have at least 8GB VRAM for VoxCPM2. You can try loading with load_denoiser=False to reduce memory usage.

Audio quality is poor

Increase inference_timesteps (default is 10, try 20-30 for better quality). Adjust cfg_value (default 2.0, higher values increase guidance strength).

Voice cloning similarity is low

For best results, use Ultimate Cloning mode with both reference audio and transcript. Ensure reference audio is clean and at least 3-10 seconds.

Streaming audio has artifacts

Increase streaming_prefix_len parameter (default is 4) to provide more context for smooth decoding.

Risks and Limitations

Potential for Misuse: VoxCPM’s voice cloning can generate highly realistic synthetic speech. It is strictly forbidden to use VoxCPM for impersonation, fraud, or disinformation. Always clearly mark AI-generated content.
Controllable Generation Stability: Voice Design and Controllable Voice Cloning results can vary between runs. You may need to generate 1-3 times to obtain the desired voice or style.
Language Coverage: VoxCPM2 officially supports 30 languages. For unsupported languages, you can test directly or try fine-tuning on your own data.

Conclusion

VoxCPM2 represents a significant advancement in text-to-speech technology. Its tokenizer-free approach enables more natural and expressive speech synthesis, while the 2B parameter model trained on 2 million+ hours of data ensures high quality across 30 languages. The voice design and controllable cloning features open new possibilities for creative applications, while the Apache-2.0 license makes it suitable for commercial use.

Whether you’re building voice assistants, creating audiobooks, or developing accessibility tools, VoxCPM2 provides a powerful and flexible foundation. The growing ecosystem of community projects ensures broad deployment options from edge devices to high-throughput servers.

References

Enjoyed this post? Never miss out on future posts by following us

VoxCPM: Tokenizer-Free TTS for Multilingual Speech Generation

VoxCPM: Tokenizer-Free TTS for Multilingual Speech Generation

What Makes VoxCPM Different?

Understanding the VoxCPM Architecture

The Tokenizer-Free TTS Pipeline

Understanding the Pipeline

Voice Cloning Capabilities

Understanding Voice Cloning Modes

Multilingual Support

Understanding Multilingual Capabilities

Installation and Quick Start

Requirements

Installation

Basic Usage

Voice Design Example

Voice Cloning Example

Ultimate Cloning Example

Streaming API

Features Comparison

Performance Benchmarks

Seed-TTS-eval Results

Multilingual Performance

Fine-Tuning

LoRA Fine-Tuning

Full Fine-Tuning

WebUI for Training

Production Deployment

Ecosystem

Troubleshooting

Common Issues

Risks and Limitations

Conclusion

Related Posts

References

Related Posts

How to easily stream picamera video over wifi with...

UDP Single server to multiple clients

Deploy Flask Video Processing App on PythonAnywhere

Add Text with Transparent Rectangle Background in OpenCV

Plot Real-Time Frame Rate of Webcam in Python with OpenCV

75+ Good Python Coding Examples for Software Development ...

Contents