Agent Skills: Whisper Audio Intelligibility Test

|

UncategorizedID: TrevorS/dot-claude/whisper-test

Install this agent skill to your local

pnpm dlx add-skill https://github.com/TrevorS/dot-claude/tree/HEAD/skills/whisper-test

Skill Files

Browse the full folder contents for whisper-test.

Download Skill

Loading file tree…

skills/whisper-test/SKILL.md

Skill Metadata

Name
whisper-test
Description
|

Whisper Audio Intelligibility Test

Transcribe WAV audio files using OpenAI Whisper and report whether the speech is intelligible. Optionally compare against expected text.

Setup

Whisper is installed as a uv tool: uv tool install openai-whisper.

Since this machine may lack ffmpeg, always use the Python API approach that loads WAV files with scipy (bypasses the ffmpeg requirement).

Running Transcription

Use uv run --no-project --with openai-whisper --with scipy --python 3.11 to execute the transcription script:

uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  [--model tiny|base|small|medium|large-v3] \
  [--language en] \
  [--expected "expected text"] \
  [--json] \
  file1.wav [file2.wav ...]

Arguments

  • --model: Whisper model size (default: large-v3). See model selection guide below.
  • --language: Language hint (default: en).
  • --expected: Expected transcription text. When provided, calculates Word Error Rate (WER).
  • --json: Output results as JSON instead of human-readable text.
  • Positional: One or more WAV file paths.

Model Selection

Use large-v3 for TTS quality verification. Smaller models hallucinate or miss words in synthesized speech, making them unreliable for judging output quality.

| Model | VRAM | When to use | | ---------- | ------ | --------------------------------------------------------------- | | large-v3 | ~10 GB | Default. TTS evaluation, quality gating, regression testing | | medium | ~5 GB | GPU memory constrained, still decent accuracy | | small | ~2 GB | Quick smoke tests only | | base | ~1 GB | Not recommended for TTS — high hallucination rate | | tiny | ~1 GB | Not recommended for TTS — unreliable |

Observed with identical Qwen3-TTS 1.7B voice-cloned output:

  • large-v3: "That's one tank. Flash attention pipeline." (key phrase captured)
  • base: "That's one thing, flash attention pipeline." (close but hallucinated)

For poor-quality 0.6B output, base hallucinated "Charging Wheel" while large-v3 gave "Flat, splashes." — honest about the poor quality instead of confabulating plausible words.

Output Format

For each file, prints:

filename.wav:
  transcription: "Hello world, this is a test."
  duration: 2.96s
  rms: 0.0866
  peak: 0.6832
  silence: 49.2%
  [wer: 0.0%]  (if --expected provided)

Interpreting Results

| Transcription | Meaning | | --------------------- | ------------------------------------------------------------------------------- | | Matches expected text | Audio is intelligible and correct | | Partial match | Audio has some speech but quality issues | | Empty string "" | Audio is unintelligible (noise, silence, or garbage) | | Hallucinated text | Model heard something in noise (common with Whisper, especially smaller models) |

Audio Quality Indicators

  • RMS < 0.01: Essentially silent
  • silence > 80%: Mostly silence, likely no speech
  • peak < 0.05: Very quiet, may not contain useful audio

TTS-Specific Patterns

Voice-cloned TTS output often has these characteristics:

  • Garbled opening, clear ending: Common with ICL voice cloning on short references. The model needs a few frames to "lock in" to the target voice.
  • Key phrases preserved: Even when WER is high, domain-specific terms (e.g. "flash attention pipeline") often come through clearly.
  • Smaller models produce worse audio: 0.6B models produce significantly less intelligible output than 1.7B — expect Whisper to reflect this.

Batch Testing (TTS Variant Comparison)

When testing multiple TTS outputs against expected text:

uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  --expected "Hello world, this is a test." \
  variant1.wav variant2.wav variant3.wav

This produces a comparison table showing which variants produce intelligible speech.

Docker / NGC Container Usage

When testing on a GPU box inside an NGC container (e.g. for CUDA flash-attn builds), ffmpeg isn't available and apt can be slow. Two workarounds:

  1. Static ffmpeg binary (fast, no apt):

    curl -sL https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-arm64-static.tar.xz \
      | tar xJ --strip-components=1 -C /usr/local/bin/ --wildcards "*/ffmpeg" "*/ffprobe"
    pip install openai-whisper
    
  2. Use scipy loader (this script's default — no ffmpeg needed):

    pip install openai-whisper scipy
    python3 ~/.claude/skills/whisper-test/transcribe.py --model large-v3 output.wav
    

The script loads WAV files directly via scipy, bypassing Whisper's ffmpeg dependency entirely. This works for WAV files (the standard TTS output format).