fal.ai Audio Models Skill

Quick Reference

| STT Model | Endpoint | Speed | Accuracy | |-----------|----------|-------|----------| | Whisper | fal-ai/whisper | Medium | Highest | | Whisper Turbo | fal-ai/whisper-turbo | Fast | High | | Whisper Large v3 | fal-ai/whisper-large-v3 | Slow | Highest |

| TTS Model | Endpoint | Voice Clone | Quality | |-----------|----------|-------------|---------| | F5-TTS | fal-ai/f5-tts | Yes | High | | ElevenLabs | fal-ai/elevenlabs/tts | Via API | Highest | | Kokoro | fal-ai/kokoro/american-english | No | Good | | XTTS | fal-ai/xtts | Yes | Good |

| Whisper Task | Use Case | |--------------|----------| | transcribe | Same language text | | translate | Non-English → English |

| Whisper Parameter | Value | |-------------------|-------| | chunk_level | "segment" for timestamps | | language | ISO code (e.g., "en") |

When to Use This Skill

Use for audio processing:

Transcribing audio/video to text
Generating subtitles with timestamps
Translating speech to English
Cloning voices from reference audio
Generating natural speech from text

Related skills:

For video with audio: see fal-text-to-video
For API integration: see fal-api-reference
For model comparison: see fal-model-guide

fal.ai Audio Models

Complete reference for speech-to-text (STT) and text-to-speech (TTS) models on fal.ai.

Speech-to-Text Models

Whisper (OpenAI)

Endpoint: fal-ai/whisper Best For: Accurate transcription and translation

The industry-standard speech recognition model with support for 99+ languages.

import { fal } from "@fal-ai/client";

const result = await fal.subscribe("fal-ai/whisper", {
  input: {
    audio_url: "https://example.com/speech.mp3",
    task: "transcribe",
    language: "en",
    chunk_level: "segment"
  }
});

console.log(result.text);
console.log(result.chunks);  // With timestamps

import fal_client

result = fal_client.subscribe(
    "fal-ai/whisper",
    arguments={
        "audio_url": "https://example.com/speech.mp3",
        "task": "transcribe",
        "language": "en",
        "chunk_level": "segment"
    }
)
print(result["text"])
for chunk in result["chunks"]:
    print(f"[{chunk['timestamp'][0]:.2f}-{chunk['timestamp'][1]:.2f}] {chunk['text']}")

Whisper Parameters:

| Parameter | Type | Values | Description | |-----------|------|--------|-------------| | audio_url | string | - | Audio file URL | | task | string | "transcribe", "translate" | Transcribe or translate to English | | language | string | ISO code | Source language (optional, auto-detected) | | chunk_level | string | "segment" | Return timestamps | | version | string | "3" | Whisper version |

Response Structure:

interface WhisperOutput {
  text: string;  // Full transcription
  chunks?: Array<{
    text: string;
    timestamp: [number, number];  // [start, end] in seconds
  }>;
}

Whisper Turbo

Endpoint: fal-ai/whisper-turbo Best For: Fast transcription

const result = await fal.subscribe("fal-ai/whisper-turbo", {
  input: {
    audio_url: "https://example.com/podcast.mp3",
    task: "transcribe"
  }
});

Whisper Large v3

Endpoint: fal-ai/whisper-large-v3 Best For: Maximum accuracy

const result = await fal.subscribe("fal-ai/whisper-large-v3", {
  input: {
    audio_url: "https://example.com/meeting.mp3",
    task: "transcribe",
    language: "en"
  }
});

Whisper Usage Examples

Transcription with Timestamps:

const result = await fal.subscribe("fal-ai/whisper", {
  input: {
    audio_url: audioUrl,
    task: "transcribe",
    chunk_level: "segment"
  }
});

// Format as SRT subtitles
result.chunks.forEach((chunk, i) => {
  const start = formatTime(chunk.timestamp[0]);
  const end = formatTime(chunk.timestamp[1]);
  console.log(`${i + 1}\n${start} --> ${end}\n${chunk.text}\n`);
});

function formatTime(seconds: number): string {
  const h = Math.floor(seconds / 3600);
  const m = Math.floor((seconds % 3600) / 60);
  const s = Math.floor(seconds % 60);
  const ms = Math.floor((seconds % 1) * 1000);
  return `${h.toString().padStart(2, '0')}:${m.toString().padStart(2, '0')}:${s.toString().padStart(2, '0')},${ms.toString().padStart(3, '0')}`;
}

Translation (Non-English to English):

const result = await fal.subscribe("fal-ai/whisper", {
  input: {
    audio_url: "https://example.com/french-speech.mp3",
    task: "translate",  // Translates to English
    language: "fr"
  }
});

console.log(result.text);  // English translation

Multi-Language Detection:

// Whisper auto-detects language if not specified
const result = await fal.subscribe("fal-ai/whisper", {
  input: {
    audio_url: "https://example.com/unknown-language.mp3",
    task: "transcribe"
    // language omitted - auto-detect
  }
});

Text-to-Speech Models

F5-TTS

Endpoint: fal-ai/f5-tts Best For: Voice cloning from reference audio

const result = await fal.subscribe("fal-ai/f5-tts", {
  input: {
    gen_text: "Hello! Welcome to our product demonstration. We're excited to show you what we've built.",
    ref_audio_url: "https://example.com/voice-sample.wav",
    ref_text: "This is a sample of my voice for cloning purposes.",
    model_type: "F5-TTS"
  }
});

console.log(result.audio_url);

result = fal_client.subscribe(
    "fal-ai/f5-tts",
    arguments={
        "gen_text": "Hello! Welcome to our product.",
        "ref_audio_url": "https://example.com/voice-sample.wav",
        "ref_text": "This is a sample of my voice."
    }
)
print(result["audio_url"])

F5-TTS Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | gen_text | string | Text to synthesize | | ref_audio_url | string | Reference voice audio URL | | ref_text | string | Transcript of reference audio | | model_type | string | "F5-TTS" or "E2-TTS" | | remove_silence | boolean | Remove silence from output |

ElevenLabs TTS

Endpoint: fal-ai/elevenlabs/tts Best For: Premium voice quality

const result = await fal.subscribe("fal-ai/elevenlabs/tts", {
  input: {
    text: "Welcome to fal.ai! Let me tell you about our amazing AI models.",
    voice_id: "21m00Tcm4TlvDq8ikWAM",  // ElevenLabs voice ID
    model_id: "eleven_multilingual_v2"
  }
});

console.log(result.audio.url);

ElevenLabs Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | text | string | Text to synthesize | | voice_id | string | ElevenLabs voice ID | | model_id | string | TTS model version | | stability | number | Voice stability (0-1) | | similarity_boost | number | Voice similarity (0-1) |

ElevenLabs Voice IDs (examples):

21m00Tcm4TlvDq8ikWAM - Rachel (female)
AZnzlk1XvdvUeBnXmlld - Domi (female)
EXAVITQu4vr4xnSDxMaL - Bella (female)
ErXwobaYiN019PkySvjV - Antoni (male)
VR6AewLTigWG4xSOukaG - Arnold (male)

Kokoro TTS

Endpoint: fal-ai/kokoro/american-english Best For: Multi-language, natural sounding

const result = await fal.subscribe("fal-ai/kokoro/american-english", {
  input: {
    text: "This is a test of the Kokoro text-to-speech system.",
    voice: "af_bella"  // Voice style
  }
});

console.log(result.audio.url);

Kokoro Variants:

fal-ai/kokoro/american-english - American English
fal-ai/kokoro/british-english - British English
fal-ai/kokoro/japanese - Japanese
fal-ai/kokoro/mandarin - Mandarin Chinese

Kokoro Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | text | string | Text to synthesize | | voice | string | Voice style identifier | | speed | number | Speech speed multiplier |

XTTS (Coqui)

Endpoint: fal-ai/xtts Best For: Open-source voice cloning

const result = await fal.subscribe("fal-ai/xtts", {
  input: {
    text: "Hello, this is a cloned voice speaking.",
    audio_url: "https://example.com/voice-reference.wav",
    language: "en"
  }
});

XTTS Parameters:

| Parameter | Type | Description | |-----------|------|-------------| | text | string | Text to synthesize | | audio_url | string | Reference audio for cloning | | language | string | Target language |

Model Comparison

Speech-to-Text

| Model | Speed | Accuracy | Languages | Best For | |-------|-------|----------|-----------|----------| | Whisper | Medium | Highest | 99+ | Accuracy critical | | Whisper Turbo | Fast | High | 99+ | Speed needed | | Whisper Large v3 | Slow | Highest | 99+ | Maximum quality |

Text-to-Speech

| Model | Quality | Voice Clone | Languages | Best For | |-------|---------|-------------|-----------|----------| | F5-TTS | High | Yes | Multiple | Voice cloning | | ElevenLabs | Highest | Via API | Many | Premium quality | | Kokoro | Good | No | Multiple | Multi-language | | XTTS | Good | Yes | 16 | Open-source |

Workflow Examples

Transcribe and Translate Pipeline

async function processAudio(audioUrl: string, targetLanguage: string = 'en') {
  // 1. Transcribe
  const transcription = await fal.subscribe("fal-ai/whisper", {
    input: {
      audio_url: audioUrl,
      task: "transcribe",
      chunk_level: "segment"
    }
  });

  // 2. If not English, translate
  let translation = null;
  if (targetLanguage === 'en') {
    translation = await fal.subscribe("fal-ai/whisper", {
      input: {
        audio_url: audioUrl,
        task: "translate"
      }
    });
  }

  return {
    original: transcription.text,
    translated: translation?.text,
    chunks: transcription.chunks
  };
}

Voice Cloning Pipeline

async function cloneVoiceAndSpeak(
  referenceAudioUrl: string,
  referenceText: string,
  textToSpeak: string
) {
  // Use F5-TTS for voice cloning
  const result = await fal.subscribe("fal-ai/f5-tts", {
    input: {
      gen_text: textToSpeak,
      ref_audio_url: referenceAudioUrl,
      ref_text: referenceText,
      remove_silence: true
    }
  });

  return result.audio_url;
}

Subtitle Generation

async function generateSubtitles(videoUrl: string): Promise<string> {
  // Extract audio and transcribe
  const result = await fal.subscribe("fal-ai/whisper", {
    input: {
      audio_url: videoUrl,  // Works with video URLs too
      task: "transcribe",
      chunk_level: "segment"
    }
  });

  // Generate SRT format
  let srt = '';
  result.chunks.forEach((chunk, i) => {
    srt += `${i + 1}\n`;
    srt += `${formatSrtTime(chunk.timestamp[0])} --> ${formatSrtTime(chunk.timestamp[1])}\n`;
    srt += `${chunk.text}\n\n`;
  });

  return srt;
}

function formatSrtTime(seconds: number): string {
  const date = new Date(seconds * 1000);
  return date.toISOString().substr(11, 12).replace('.', ',');
}

Audio Book Generation

async function generateAudioBook(chapters: string[], voiceId: string) {
  const audioUrls = [];

  for (const chapter of chapters) {
    // Split into manageable chunks
    const chunks = splitText(chapter, 5000);

    for (const chunk of chunks) {
      const result = await fal.subscribe("fal-ai/elevenlabs/tts", {
        input: {
          text: chunk,
          voice_id: voiceId,
          model_id: "eleven_multilingual_v2"
        }
      });
      audioUrls.push(result.audio.url);
    }
  }

  return audioUrls;
}

function splitText(text: string, maxLength: number): string[] {
  const chunks = [];
  let current = '';

  text.split('. ').forEach(sentence => {
    if ((current + sentence).length < maxLength) {
      current += sentence + '. ';
    } else {
      chunks.push(current.trim());
      current = sentence + '. ';
    }
  });

  if (current) chunks.push(current.trim());
  return chunks;
}

Parameter Reference

Speech-to-Text Input

interface STTInput {
  audio_url: string;
  task?: "transcribe" | "translate";
  language?: string;  // ISO 639-1 code
  chunk_level?: "segment";
  version?: string;
}

Text-to-Speech Input

interface TTSInput {
  // Common
  text?: string;
  gen_text?: string;

  // Voice cloning
  ref_audio_url?: string;
  ref_text?: string;
  audio_url?: string;  // XTTS

  // Voice selection
  voice_id?: string;  // ElevenLabs
  voice?: string;     // Kokoro
  model_type?: string; // F5-TTS

  // Control
  speed?: number;
  stability?: number;
  similarity_boost?: number;
  language?: string;
  remove_silence?: boolean;
}

Best Practices

Speech-to-Text

Audio Quality: Clean audio = better transcription
Specify Language: Provide language hint when known
Use Timestamps: Request chunk_level: "segment" for subtitles
Handle Long Audio: Whisper handles long files automatically
Translation: Use task: "translate" for non-English to English

Text-to-Speech

Reference Quality: For voice cloning, use 10-30 second clear samples
Reference Transcript: Accurate transcript improves cloning quality
Text Length: Split very long text into chunks
Punctuation: Proper punctuation improves prosody
Emotion Hints: Use punctuation (!, ?) to convey emotion

Common Supported Languages

| Language | Code | STT | TTS | |----------|------|-----|-----| | English | en | Yes | Yes | | Spanish | es | Yes | Yes | | French | fr | Yes | Yes | | German | de | Yes | Yes | | Italian | it | Yes | Yes | | Portuguese | pt | Yes | Yes | | Japanese | ja | Yes | Yes | | Chinese | zh | Yes | Yes | | Korean | ko | Yes | Yes | | Russian | ru | Yes | Limited |

File Format Support

Input Formats (STT)

| Format | Extension | Supported | |--------|-----------|-----------| | MP3 | .mp3 | Yes | | WAV | .wav | Yes | | M4A | .m4a | Yes | | FLAC | .flac | Yes | | OGG | .ogg | Yes | | WebM | .webm | Yes | | Video | .mp4 | Yes (audio extracted) |

Output Formats (TTS)

| Model | Output Format | |-------|---------------| | F5-TTS | WAV | | ElevenLabs | MP3 | | Kokoro | WAV | | XTTS | WAV |

Error Handling

try {
  const result = await fal.subscribe("fal-ai/whisper", {
    input: { audio_url: audioUrl, task: "transcribe" }
  });
} catch (error) {
  if (error.status === 400) {
    console.error("Invalid audio file or URL");
  } else if (error.status === 413) {
    console.error("Audio file too large");
  } else {
    console.error("Transcription failed:", error.message);
  }
}

Agent Skills: fal.ai Audio Models

Install this agent skill to your local

Skill Files

Quick Reference

When to Use This Skill

fal.ai Audio Models

Speech-to-Text Models

Whisper (OpenAI)

Whisper Turbo

Whisper Large v3

Whisper Usage Examples

Text-to-Speech Models

F5-TTS

ElevenLabs TTS

Kokoro TTS

XTTS (Coqui)

Model Comparison

Speech-to-Text

Text-to-Speech

Workflow Examples

Transcribe and Translate Pipeline

Voice Cloning Pipeline

Subtitle Generation

Audio Book Generation

Parameter Reference

Speech-to-Text Input

Text-to-Speech Input

Best Practices

Speech-to-Text

Text-to-Speech

Common Supported Languages

File Format Support

Input Formats (STT)

Output Formats (TTS)

Error Handling