Local TTS Skill Skill | Agent Skills

Local TTS Skill

Generate high-quality speech audio locally using Apple Silicon MLX acceleration and the Kokoro-82M model. No API keys or recurring costs.

Quick Start

# Generate MP3 from text
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --text "Hello, this is a test." \
    --output ~/Desktop/test.mp3

# Generate from file
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/script.txt \
    --voice af_heart \
    --output ~/Desktop/podcast.mp3

# List available voices
uv run --with mlx-audio skills/local-tts/scripts/list_voices.py

Parameters

| Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | --text | One of text/file | - | Text to convert | | --file | One of text/file | - | Path to text file | | --voice | No | af_heart | Voice preset | | --output | Yes | - | Output file path (.mp3, .wav) | | --model | No | Kokoro-82M-bf16 | Model to use | | --list-voices | No | - | Show available voices |

Voice Presets

American English Female (prefix: af_)

af_heart - Warm, friendly (default)
af_bella - Soft, calm
af_nova - Clear, professional
af_river - Clear, confident
af_sarah - Soft, expressive

American English Male (prefix: am_)

am_adam - Clear, professional
am_echo - Deep, smooth
am_liam - Articulate, conversational
am_michael - Soft, measured

British English (prefix: bf_, bm_)

bf_emma - Clear, refined female
bm_daniel - Clear, professional male
bm_george - Distinguished male

See references/voices.md for full list.

Output Format

{
  "success": true,
  "file": "/Users/hagelk/Desktop/podcast.mp3",
  "voice": "af_heart",
  "model": "Kokoro-82M-bf16",
  "characters": 9824,
  "chunks": 20,
  "duration_seconds": 612.5,
  "generation_time": 45.2
}

Performance

| Hardware | Speed | Notes | |----------|-------|-------| | M3 Pro 36GB | ~3-4x realtime | First run slower (model loading) | | M1/M2 Mac Mini 8GB | ~1.5x realtime | Works well for briefings | | M1/M2 Mac Mini 16GB | ~2x realtime | Comfortable headroom |

Technical Details

Model: Kokoro-82M-bf16 (~200MB download on first run)
Sample rate: 24kHz mono
Chunking: Text split at ~400 chars per chunk for quality
Concatenation: Chunks joined seamlessly via pydub
Formats: MP3, WAV, M4A, OGG

Important Notes

MUST use --with flags - Do not use PEP 723 inline deps. mlx-audio requires uv's cached environment.
First run is slower - Model downloads ~200MB and espeak dependencies initialize.
Model cached at: ~/.cache/huggingface/hub/models--mlx-community--Kokoro-82M-bf16/

Integration with Morning Briefing

The morning-briefing skill uses this for podcast generation:

uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/morning_briefing_podcast.txt \
    --voice af_heart \
    --output ~/Desktop/morning_briefing.mp3

Agent Skills: Local TTS Skill

Install this agent skill to your local

Skill Files