qwen3-tts-rs Profiling & Benchmarking
Run performance profiling and benchmarks for the qwen3-tts Rust TTS engine.
Prerequisites
- Docker with
--gpus allsupport qwen3-tts:latestDocker image (has Rust toolchain + CUDA)- Model weights in
test_data/models/(1.7B-CustomVoice is the default) tokenizer.jsonmust be in the model directory
Docker Execution Pattern
The CUDA toolchain lives inside the Docker container. All cargo commands must
run there. The workspace is bind-mounted at /workspace:
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && <COMMAND>'
Profiling Modes
1. Chrome Trace (default — best for span hierarchy)
Produces trace.json for viewing in chrome://tracing or https://ui.perfetto.dev.
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --profile=profiling --features=profiling,cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 1 --warmup 1'
Output: trace.json (~12MB for 3 sentences). Contains spans:
generate_frames— full generation loopcode_predictor/code_predictor_inner— per-frame acoustic code generationtalker_step— per-frame transformer forward passsampling/top_k/top_p— per-frame token samplinggpu_synctrace events — marks everyto_vec1()GPU→CPU sync
2. Per-Stage Timing (no profiling feature needed)
The e2e_bench binary reports stage breakdowns (prefill / generation / decode)
even without the profiling feature:
docker run --rm --gpus all --entrypoint /bin/bash \
-v "$(pwd):/workspace" -w /workspace \
qwen3-tts:latest \
-c 'export PATH=/root/.rustup/toolchains/stable-aarch64-unknown-linux-gnu/bin:$PATH && \
cargo run --release --features=cuda,cli --bin e2e_bench -- \
--model-dir test_data/models/1.7B-CustomVoice --iterations 3 --warmup 1'
3. Streaming TTFA (Time to First Audio)
# Add --streaming flag
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--iterations 3 --warmup 1 --streaming
4. JSON Output
... --bin e2e_bench -- --model-dir test_data/models/1.7B-CustomVoice \
--json-output results.json --iterations 3
GPU Sync Audit
List all to_vec1() GPU→CPU synchronization points:
bash scripts/audit-gpu-syncs.sh
Interpreting Results
Stage Breakdown Table
Label Words Wall (ms) Audio (s) RTF Tok/s Mem (MB) Prefill Generate Decode
short 13 5235.2 3.68 1.423 8.8 858 21ms (1%) 2724ms (71%) 1109ms (29%)
medium 53 23786.3 34.00 0.700 17.9 859 20ms (0%) 22694ms (95%) 1057ms (4%)
long 115 43797.4 60.96 0.718 17.4 864 19ms (0%) 41861ms (96%) 1886ms (4%)
Key metrics:
- RTF < 1.0 = faster than real-time
- Prefill: Should be <50ms on GPU. If high, check embedding/attention.
- Generation: Dominates. ~18 GPU→CPU syncs per frame (16 code_predictor + 2 sampling).
- Decode: ConvNeXt decoder. Scales with frame count. ~4% for long text.
- Tok/s: Semantic tokens per second. Higher = better.
Chrome Trace Analysis
In Perfetto/chrome://tracing:
- Look for gaps between
talker_stepandcode_predictor— that's CPU overhead - Check if
sampling(top_k + top_p) is significant vs model forward passes - The
gpu_syncevents mark where GPU stalls waiting for CPU
Optimization Targets
The ~18 to_vec1() calls per frame are the main bottleneck:
- 16 in code_predictor (argmax per acoustic code group)
- 2 in sampling (read sampled token)
Batch these to reduce GPU→CPU round-trips.
Model Variants
| Model | Dir | Notes |
| ---------------- | ----------------------------------- | ------------------------------- |
| 1.7B-CustomVoice | test_data/models/1.7B-CustomVoice | Default benchmark target |
| 1.7B-Base | test_data/models/1.7B-Base | Voice cloning (needs ref audio) |
| 1.7B-VoiceDesign | test_data/models/1.7B-VoiceDesign | Text-described voices |
Reference Baseline (1.7B-CustomVoice, CUDA)
From January 2025 on DGX (A100):
- Short (13 words): RTF 1.42, 8.8 tok/s
- Medium (53 words): RTF 0.70, 17.9 tok/s
- Long (115 words): RTF 0.72, 17.4 tok/s
- Prefill: ~20ms, Decode: ~1-2s, Generation: 71-96%