VoxCPM: Tokenizer-Free TTS for Multilingual Speech Generation

Text-to-speech technology has undergone remarkable evolution over the past decade, moving from concatenative systems to neural approaches that can generate highly natural-sounding speech. However, most modern TTS systems still rely on discrete audio tokenization, which introduces information loss and limits expressiveness. VoxCPM, developed by OpenBMB, represents a paradigm shift with its tokenizer-free approach that directly generates continuous speech representations.

VoxCPM2, the latest release, is a 2 billion parameter model trained on over 2 million hours of multilingual speech data. It supports 30 languages, voice design from natural language descriptions, controllable voice cloning, and outputs studio-quality 48kHz audio. Built on the MiniCPM-4 backbone, this system achieves state-of-the-art performance on multiple benchmarks while remaining fully open-source under the Apache-2.0 license.

What Makes VoxCPM Different?

Traditional TTS systems typically follow a pipeline that involves converting audio into discrete tokens using neural audio codecs like EnCodec or SoundStream. These tokens are then predicted by a language model and decoded back to audio. While effective, this approach has inherent limitations:

  • Information Loss: Discrete tokenization quantizes continuous audio features, losing subtle acoustic details
  • Limited Expressiveness: Token-based systems struggle with prosody variation and emotional nuance
  • Quality Ceiling: The reconstruction quality is bounded by the codec’s resolution

VoxCPM bypasses these limitations entirely by operating in the continuous latent space of AudioVAE V2. Instead of predicting discrete tokens, the model directly generates continuous representations through a diffusion autoregressive architecture. This enables richer expressiveness and higher fidelity output.

VoxCPM Architecture

Understanding the VoxCPM Architecture

The architecture diagram above illustrates the core components and their interactions in the VoxCPM2 system. Let’s break down each component:

AudioVAE V2: The Foundation

AudioVAE V2 serves as the audio encoder-decoder backbone, providing asymmetric encode/decode capabilities. The encoder accepts 16kHz input audio and compresses it into a compact latent representation. The decoder then reconstructs audio at 48kHz, providing built-in super-resolution without requiring external upsamplers. This asymmetric design is crucial for achieving studio-quality output while maintaining computational efficiency.

The encoder uses causal convolutions with Snake activation functions, enabling it to capture temporal dependencies without looking ahead. The decoder incorporates sample rate conditioning, allowing it to adaptively generate audio at different quality levels based on the input context.

LocalEncoder (LocEnc)

The LocalEncoder processes audio features at a local level, encoding patches of audio into representations suitable for the language model. It uses a transformer-based architecture with special tokens to mark audio boundaries. This component is essential for voice cloning, as it encodes reference audio into a format the model can use to condition generation.

Text-Semantic Language Model (TSLM)

Built on MiniCPM-4, the TSLM is the core reasoning engine of VoxCPM2. With 2 billion parameters, it processes both text tokens and audio features in a unified representation space. The model uses RoPE (Rotary Position Embeddings) with LongRoPE scaling for extended context length, enabling it to handle long-form content effectively.

The TSLM performs context-aware synthesis, automatically inferring appropriate prosody and expressiveness from the text content. This means the model understands that a question should have rising intonation, or that exciting content should be delivered with more energy.

Scalar Quantization Layer (FSQ)

The Finite Scalar Quantization layer bridges the semantic and acoustic spaces. Unlike traditional vector quantization with codebooks, FSQ projects continuous representations through a learned quantization function. This maintains differentiability while providing structure to the latent space, enabling better generalization and more stable training.

Residual Acoustic Language Model (RALM)

The RALM operates in parallel to the TSLM, focusing on acoustic details that complement the semantic understanding. With 8 dedicated layers, it captures fine-grained acoustic patterns including voice characteristics, speaking rate variations, and emotional nuances. The fusion of TSLM and RALM outputs enables VoxCPM2 to produce speech that is both semantically accurate and acoustically natural.

LocalDiT (LocDiT)

The Local Diffusion Transformer generates audio features using flow matching, a modern alternative to traditional diffusion models. LocDiT takes the combined representations from TSLM and RALM, along with conditioning from previous audio patches, and generates the next audio patch autoregressively. The diffusion process allows for controllable generation through classifier-free guidance (CFG), enabling users to balance between diversity and fidelity.

The Tokenizer-Free TTS Pipeline

Tokenizer-Free TTS Pipeline

Understanding the Pipeline

The pipeline diagram above demonstrates how VoxCPM2 processes text to speech without traditional audio tokenization. This is a fundamental departure from conventional TTS systems.

Stage 1: Text Input

The process begins with raw text input. VoxCPM2 supports 30 languages without requiring language tags or special preprocessing. Users can simply provide text in any supported language, and the model automatically handles the rest.

Stage 2: Text Tokenization

VoxCPM2 uses LlamaTokenizerFast for text processing, which provides efficient subword tokenization. Critically, this is where the tokenizer-free nature becomes apparent: there is no audio tokenization step. Traditional systems would need to convert audio to discrete tokens at this stage, but VoxCPM2 operates entirely in continuous space.

Stage 3: Language Model Processing

The MiniCPM-4 backbone processes the tokenized text, building semantic understanding and context. This 2B parameter model was specifically trained for speech generation, learning to predict not just text semantics but also acoustic patterns. The model develops an understanding of how text should sound when spoken, including prosody, emphasis, and pacing.

Stage 4: Diffusion Autoregressive Generation

This is the heart of VoxCPM2’s innovation. Instead of predicting discrete tokens, the model generates continuous audio features through a diffusion process. Each step generates a patch of audio features conditioned on previous patches, creating a coherent audio stream. The flow matching approach provides stable training and high-quality samples.

Stage 5: Audio Decoding

The AudioVAE V2 decoder transforms the continuous latent representations into waveform audio. The decoder’s asymmetric design enables it to output 48kHz audio directly, providing studio-quality sound without external post-processing.

Key Innovation: Continuous Representation

The highlighted box in the diagram emphasizes the critical difference: VoxCPM2 never discretizes audio. By maintaining continuous representations throughout, the model preserves all acoustic information, enabling more natural and expressive speech synthesis. This approach also simplifies the architecture by removing the need for complex audio codecs with large codebooks.

Voice Cloning Capabilities

Voice Cloning Workflow

Understanding Voice Cloning Modes

VoxCPM2 introduces three distinct voice cloning modes, each designed for different use cases and quality requirements. The diagram above illustrates how these modes work and when to use each one.

Voice Design: Creating Voices from Scratch

Voice Design is a groundbreaking feature that allows users to create entirely new voices using natural language descriptions. No reference audio is required. Users can specify characteristics like:

  • Gender and age (e.g., “young woman”, “elderly man”)
  • Tone qualities (e.g., “gentle and sweet”, “deep and authoritative”)
  • Emotional style (e.g., “cheerful”, “melancholic”)
  • Speaking pace (e.g., “slow and deliberate”, “quick and energetic”)

The model interprets these descriptions and generates a voice that matches the specification. This is achieved through the model’s understanding of how acoustic properties relate to perceptual descriptions, learned during training on diverse speech data.

Controllable Cloning: Reference with Style Control

Controllable Cloning takes a reference audio sample and allows additional style instructions to modify the output. This mode is ideal when you have a target voice but want to adjust specific aspects:

  • Clone a voice but make it speak faster or slower
  • Preserve timbre while changing emotional expression
  • Maintain voice identity while adjusting energy level

The reference audio provides the base voice characteristics, while the control instructions guide the generation toward the desired style. This enables fine-grained control without sacrificing voice similarity.

Ultimate Cloning: Perfect Voice Reproduction

Ultimate Cloning is designed for scenarios where exact voice reproduction is critical. It requires both the reference audio and its exact transcript. The model uses audio continuation, where it learns from the reference and continues seamlessly, preserving every vocal nuance including:

  • Timbre and voice quality
  • Speaking rhythm and cadence
  • Emotional expression and style
  • Breathing patterns and pauses

This mode achieves the highest similarity scores and is recommended for applications like audiobook narration, voice preservation, and professional voice work.

Processing Pipeline

All three modes feed into the same VoxCPM2 processing pipeline. The AudioVAE V2 encoder extracts latent features from reference audio (when provided), and the model generates speech conditioned on these features. The 2B parameter model ensures high-quality output across all modes.

Multilingual Support

Multilingual Support

Understanding Multilingual Capabilities

VoxCPM2’s multilingual support is one of its most impressive features, enabling speech synthesis in 30 languages without requiring language tags or special configuration. The diagram above shows the breadth of language coverage.

European Languages

VoxCPM2 supports all major European languages including English, German, French, Spanish, Italian, Portuguese, Dutch, Russian, Polish, Swedish, Norwegian, Danish, Finnish, and Greek. The model handles the unique phonetic characteristics of each language, including:

  • German compound words and consonant clusters
  • French nasal vowels and liaison
  • Spanish rapid speech patterns
  • Russian palatalization and stress patterns

Asian Languages

Asian language support includes Chinese (with 9 dialects), Japanese, Korean, Thai, Vietnamese, Indonesian, Hindi, and more. These languages present unique challenges:

  • Chinese tonal system and dialectal variation
  • Japanese pitch accent and mora timing
  • Korean honorifics and speech levels
  • Thai tonal distinctions

Chinese Dialect Support

Beyond standard Mandarin, VoxCPM2 supports 9 Chinese dialects: Sichuanese, Cantonese, Wu, Northeastern, Henan, Shaanxi, Shandong, Tianjin, and Minnan. This extensive dialect coverage makes it valuable for regional content creation and accessibility applications.

Automatic Language Detection

The model automatically detects the input language without requiring explicit tags. This is achieved through the multilingual training on over 2 million hours of speech data. Users can simply input text in any supported language, and the model handles the rest.

Training Data Scale

The 2 million+ hours of training data ensures robust performance across all languages. This scale enables the model to learn language-specific patterns while maintaining cross-lingual transfer for improved generalization.

Installation and Quick Start

Requirements

  • Python 3.10 or higher (less than 3.13)
  • PyTorch 2.5.0 or higher
  • CUDA 12.0 or higher
  • 8GB+ VRAM for VoxCPM2

Installation

pip install voxcpm

Basic Usage

from voxcpm import VoxCPM
import soundfile as sf

# Load the model
model = VoxCPM.from_pretrained(
    "openbmb/VoxCPM2",
    load_denoiser=False,
)

# Generate speech
wav = model.generate(
    text="VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("Audio saved to demo.wav")

Voice Design Example

# Create a voice from natural language description
wav = model.generate(
    text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)

Voice Cloning Example

# Clone a voice from reference audio
wav = model.generate(
    text="This is a cloned voice generated by VoxCPM2.",
    reference_wav_path="path/to/voice.wav",
    cfg_value=2.0,
    inference_timesteps=10,
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)

Ultimate Cloning Example

# Ultimate cloning with transcript
wav = model.generate(
    text="This is an ultimate cloning demonstration using VoxCPM2.",
    prompt_wav_path="path/to/voice.wav",
    prompt_text="The transcript of the reference audio.",
    reference_wav_path="path/to/voice.wav",
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)

Streaming API

For real-time applications, VoxCPM2 supports streaming generation:

import numpy as np

chunks = []
for chunk in model.generate_streaming(
    text="Streaming text to speech is easy with VoxCPM!",
):
    chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)

Features Comparison

Feature VoxCPM2 VoxCPM1.5 VoxCPM-0.5B
Parameters 2B 0.6B 0.5B
Audio Quality 48kHz 44.1kHz 16kHz
Languages 30 2 (zh, en) 2 (zh, en)
Voice Design Yes No No
Controllable Cloning Yes No No
Ultimate Cloning Yes Yes Yes
RTF (RTX 4090) ~0.30 ~0.15 ~0.17
RTF with Nano-VLLM ~0.13 ~0.08 ~0.10
VRAM Usage ~8GB ~6GB ~5GB
License Apache-2.0 Apache-2.0 Apache-2.0

Performance Benchmarks

VoxCPM2 achieves state-of-the-art results on multiple public benchmarks:

Seed-TTS-eval Results

Model test-EN WER test-EN SIM test-ZH CER test-ZH SIM
VoxCPM2 1.84% 75.3% 0.97% 79.5%
CosyVoice3-1.5B 2.02% 71.8% 1.16% 78.0%
F5-TTS 2.00% 67.0% 1.53% 76.0%

Multilingual Performance

On the internal 30-language benchmark, VoxCPM2 achieves an average error rate of 1.68%, demonstrating robust multilingual capabilities across diverse language families.

Fine-Tuning

VoxCPM supports both full fine-tuning (SFT) and LoRA fine-tuning. With as little as 5-10 minutes of audio, you can adapt the model to a specific speaker, language, or domain.

LoRA Fine-Tuning

python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml

Full Fine-Tuning

python scripts/train_voxcpm_finetune.py \
    --config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml

WebUI for Training

python lora_ft_webui.py  # Open http://localhost:7860

Production Deployment

For high-throughput serving, use Nano-vLLM-VoxCPM:

pip install nano-vllm-voxcpm
from nanovllm_voxcpm import VoxCPM
import numpy as np
import soundfile as sf

server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = list(server.generate(target_text="Hello from VoxCPM!"))
sf.write("out.wav", np.concatenate(chunks), 48000)
server.stop()

This achieves RTF as low as ~0.13 on NVIDIA RTX 4090 with support for batched concurrent requests.

Ecosystem

VoxCPM has a growing ecosystem of community projects:

Project Description
Nano-vLLM High-throughput GPU serving
VoxCPM.cpp GGML/GGUF for CPU, CUDA, Vulkan
VoxCPM-ONNX ONNX export for CPU inference
VoxCPMANE Apple Neural Engine backend
voxcpm_rs Rust re-implementation
ComfyUI-VoxCPM ComfyUI node integration

Troubleshooting

Common Issues

Model loading fails with memory error

Ensure you have at least 8GB VRAM for VoxCPM2. You can try loading with load_denoiser=False to reduce memory usage.

Audio quality is poor

Increase inference_timesteps (default is 10, try 20-30 for better quality). Adjust cfg_value (default 2.0, higher values increase guidance strength).

Voice cloning similarity is low

For best results, use Ultimate Cloning mode with both reference audio and transcript. Ensure reference audio is clean and at least 3-10 seconds.

Streaming audio has artifacts

Increase streaming_prefix_len parameter (default is 4) to provide more context for smooth decoding.

Risks and Limitations

  • Potential for Misuse: VoxCPM’s voice cloning can generate highly realistic synthetic speech. It is strictly forbidden to use VoxCPM for impersonation, fraud, or disinformation. Always clearly mark AI-generated content.

  • Controllable Generation Stability: Voice Design and Controllable Voice Cloning results can vary between runs. You may need to generate 1-3 times to obtain the desired voice or style.

  • Language Coverage: VoxCPM2 officially supports 30 languages. For unsupported languages, you can test directly or try fine-tuning on your own data.

Conclusion

VoxCPM2 represents a significant advancement in text-to-speech technology. Its tokenizer-free approach enables more natural and expressive speech synthesis, while the 2B parameter model trained on 2 million+ hours of data ensures high quality across 30 languages. The voice design and controllable cloning features open new possibilities for creative applications, while the Apache-2.0 license makes it suitable for commercial use.

Whether you’re building voice assistants, creating audiobooks, or developing accessibility tools, VoxCPM2 provides a powerful and flexible foundation. The growing ecosystem of community projects ensures broad deployment options from edge devices to high-throughput servers.

References

Watch PyShine on YouTube

Contents