Transcription
Overview
Transcribe audio/video files (local or remote) using Whisper AI via transcribe-anything. Supports local files, YouTube URLs, and microphone input. Output formats: SRT, VTT, plain text, JSON.
Installation
pip install transcribe-anything
Backends install automatically in isolated virtual environments.
Usage
# Local file
transcribe-anything audio.mp3
# YouTube URL
transcribe-anything "https://www.youtube.com/watch?v=VIDEO_ID"
# With options
transcribe-anything audio.mp3 --model large-v3 --lang en --output_dir ./transcripts/
# GPU / device selection
transcribe-anything audio.mp3 --device cuda # NVIDIA GPU
transcribe-anything audio.mp3 --device mlx # Mac Apple Silicon (fastest on Mac)
transcribe-anything audio.mp3 --device groq # Cloud API (fastest overall)
# Speaker diarization (requires HuggingFace token)
transcribe-anything audio.mp3 --device insane --hf_token YOUR_HF_TOKEN
Key Options
| Option | Description | Default |
| ------------------ | ---------------------------------------------- | ------------ |
| --model | tiny, small, medium, large, large-v3 | large-v3 |
| --lang | Language code (en, fr, de) or auto | auto-detect |
| --device | cpu, cuda, mlx, insane, groq | auto-select |
| --output_dir | Directory to write transcript files | ./ |
| --task | transcribe or translate (→ English) | transcribe |
| --hf_token | HuggingFace token for speaker diarization | — |
| --initial_prompt | Domain vocabulary hint for technical terms | — |
Backend Comparison
| Backend | Platform | Speed | Requires |
| ---------------- | ---------------------- | ------------------ | ------------------------------ |
| faster-whisper | Windows/Linux/Mac | Fast | No internet |
| mlx | Mac Apple Silicon only | 4x faster | No internet |
| insane | Windows/Linux GPU | Fastest local | No internet, optional HF token |
| groq | Cloud API | 189–250x real-time | Internet + Groq API key |
| cpu | Universal | Slowest | No internet |
Output Files
| File | Format |
| -------------- | -------------------------------------------------- |
| .srt | SubRip subtitles with timestamps |
| .vtt | WebVTT subtitles |
| .txt | Plain text transcript |
| .json | Structured segments with timestamps and confidence |
| speaker.json | Speaker-partitioned dialogue (insane backend only) |
Agent Usage Pattern
- Identify input — local file path or URL
- Select model —
tiny/smallfor speed,large-v3for accuracy - Select device — omit for auto;
cudafor GPU,mlxfor Apple Silicon - Run:
transcribe-anything <input> --model <model> --output_dir <dir> - Return: path to output directory + detected language from
.json
Batch Processing Large Audio Files
For audio files >30 minutes or processing multiple files:
# Batch process all audio files in a directory
for f in audio/*.mp3; do
transcribe-anything "$f" \
--model large-v3 \
--output_dir "transcripts/$(basename "$f" .mp3)/" \
--device cuda
done
# Process large files with chunking (split at silence boundaries)
# Install: pip install pydub
python3 -c "
from pydub import AudioSegment
from pydub.silence import split_on_silence
import os
audio = AudioSegment.from_file('long_audio.mp3')
chunks = split_on_silence(audio, min_silence_len=1000, silence_thresh=-40)
for i, chunk in enumerate(chunks):
chunk_path = f'chunks/chunk_{i:04d}.mp3'
chunk.export(chunk_path, format='mp3')
os.system(f'transcribe-anything {chunk_path} --output_dir chunks/output/')
"
Performance targets:
| File Length | Backend | Expected Speed | | ----------- | ---------------- | -------------- | | <10 min | faster-whisper | 1-2 min | | 10-60 min | mlx (Mac) / cuda | 2-8 min | | >60 min | groq (cloud) | 1-3 min | | Real-time | groq / insane | <1x duration |
WhisperX and Speaker Diarization
WhisperX extends Whisper with word-level timestamps and speaker diarization:
# Install WhisperX (used by transcribe-anything --device insane)
pip install whisperx
# Direct WhisperX usage for advanced control
python3 -c "
import whisperx
import json
# Load model
device = 'cuda'
compute_type = 'float16'
model = whisperx.load_model('large-v3', device, compute_type=compute_type)
# Transcribe
audio = whisperx.load_audio('audio.mp3')
result = model.transcribe(audio, batch_size=16)
# Align timestamps (word-level)
model_a, metadata = whisperx.load_align_model(language_code=result['language'], device=device)
result = whisperx.align(result['segments'], model_a, metadata, audio, device)
# Speaker diarization (requires HuggingFace token)
diarize_model = whisperx.DiarizationPipeline(use_auth_token='YOUR_HF_TOKEN', device=device)
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
print(json.dumps(result['segments'], indent=2))
"
Speaker diarization output format:
{
"segments": [
{
"start": 0.5,
"end": 4.2,
"text": "Hello, welcome to the meeting.",
"speaker": "SPEAKER_00",
"words": [{ "word": "Hello", "start": 0.5, "end": 0.9, "speaker": "SPEAKER_00" }]
}
]
}
Requirements for speaker diarization:
- HuggingFace account + token (
--hf_token) - Accept model license:
pyannote/speaker-diarization-3.1 - GPU strongly recommended (CPU is 10-50x slower)
Enforcement Hooks
Input validated against schemas/input.schema.json. See hooks/pre-execute.cjs for validation logic.
References
- Package: https://github.com/aj47/transcribe-anything
- Whisper paper: https://arxiv.org/abs/2212.04356
Memory Protocol (MANDATORY)
Before starting: Read .claude/context/memory/learnings.md for prior transcription task context.
After completing:
- Performance findings ->
.claude/context/memory/learnings.md - Issues encountered ->
.claude/context/memory/issues.md
ASSUME INTERRUPTION: If it's not in memory, it didn't happen.