Agent Skills: Local TTS Skill

Local text-to-speech using MLX and Kokoro model

UncategorizedID: krishagel/geoffrey/local-tts

Install this agent skill to your local

pnpm dlx add-skill https://github.com/krishagel/geoffrey/tree/HEAD/skills/local-tts

Skill Files

Browse the full folder contents for local-tts.

Download Skill

Loading file tree…

skills/local-tts/SKILL.md

Skill Metadata

Name
local-tts
Description
Local text-to-speech using MLX and Kokoro model

Local TTS Skill

Generate high-quality speech audio locally using Apple Silicon MLX acceleration and the Kokoro-82M model. No API keys or recurring costs.

Quick Start

# Generate MP3 from text
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --text "Hello, this is a test." \
    --output ~/Desktop/test.mp3

# Generate from file
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/script.txt \
    --voice af_heart \
    --output ~/Desktop/podcast.mp3

# List available voices
uv run --with mlx-audio skills/local-tts/scripts/list_voices.py

Parameters

| Parameter | Required | Default | Description | |-----------|----------|---------|-------------| | --text | One of text/file | - | Text to convert | | --file | One of text/file | - | Path to text file | | --voice | No | af_heart | Voice preset | | --output | Yes | - | Output file path (.mp3, .wav) | | --model | No | Kokoro-82M-bf16 | Model to use | | --list-voices | No | - | Show available voices |

Voice Presets

American English Female (prefix: af_)

  • af_heart - Warm, friendly (default)
  • af_bella - Soft, calm
  • af_nova - Clear, professional
  • af_river - Clear, confident
  • af_sarah - Soft, expressive

American English Male (prefix: am_)

  • am_adam - Clear, professional
  • am_echo - Deep, smooth
  • am_liam - Articulate, conversational
  • am_michael - Soft, measured

British English (prefix: bf_, bm_)

  • bf_emma - Clear, refined female
  • bm_daniel - Clear, professional male
  • bm_george - Distinguished male

See references/voices.md for full list.

Output Format

{
  "success": true,
  "file": "/Users/hagelk/Desktop/podcast.mp3",
  "voice": "af_heart",
  "model": "Kokoro-82M-bf16",
  "characters": 9824,
  "chunks": 20,
  "duration_seconds": 612.5,
  "generation_time": 45.2
}

Performance

| Hardware | Speed | Notes | |----------|-------|-------| | M3 Pro 36GB | ~3-4x realtime | First run slower (model loading) | | M1/M2 Mac Mini 8GB | ~1.5x realtime | Works well for briefings | | M1/M2 Mac Mini 16GB | ~2x realtime | Comfortable headroom |

Technical Details

  • Model: Kokoro-82M-bf16 (~200MB download on first run)
  • Sample rate: 24kHz mono
  • Chunking: Text split at ~400 chars per chunk for quality
  • Concatenation: Chunks joined seamlessly via pydub
  • Formats: MP3, WAV, M4A, OGG

Important Notes

  1. MUST use --with flags - Do not use PEP 723 inline deps. mlx-audio requires uv's cached environment.

  2. First run is slower - Model downloads ~200MB and espeak dependencies initialize.

  3. Model cached at: ~/.cache/huggingface/hub/models--mlx-community--Kokoro-82M-bf16/

Integration with Morning Briefing

The morning-briefing skill uses this for podcast generation:

uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/morning_briefing_podcast.txt \
    --voice af_heart \
    --output ~/Desktop/morning_briefing.mp3