fal.ai Media Generation Skill

fal.ai Media Generation

Generate images, videos, and speech using fal.ai's suite of AI models.

Important distinctions:

Text-to-Image: generate_image.py - Creates images from text prompts
Text-to-Video: generate_video_text.py - Creates videos from text prompts (no image needed)
Image-to-Video: generate_video.py - Animates an existing image into video

Setup

Set your fal.ai API key:

export FAL_API_KEY="your-api-key-here"

Get your API key from https://fal.ai/dashboard/keys

Quick Reference

| Task | Command | |------|---------| | Generate image | ./scripts/generate_image.py "prompt" | | Generate video from text | ./scripts/generate_video_text.py "prompt" | | Generate video from image | ./scripts/generate_video.py image.png | | Convert text to speech | ./scripts/generate_speech.py "text" | | List image models | ./scripts/generate_image.py --list-models "" | | List text-to-video models | ./scripts/generate_video_text.py --list-models | | List image-to-video models | ./scripts/generate_video.py --list-models | | List TTS models | ./scripts/generate_speech.py --list-models |

Text-to-Image Generation

Generate images from text descriptions using state-of-the-art models.

Available Models

| Model | Description | Speed | Quality | |-------|-------------|-------|---------| | flux-schnell (default) | Fast generation | Fast | Good | | flux-dev | Development model | Medium | High | | flux-pro | Production quality | Slow | Best | | flux-realism | Photorealistic | Medium | High | | flux-kontext | Context-aware editing | Medium | High | | nano-banana-pro | Gemini-powered, web search | Medium | High | | recraft-v3 | Design/artistic | Medium | High | | stable-diffusion-xl | Classic SD | Medium | Good |

Usage

# Basic generation
uv run ./scripts/generate_image.py "A serene mountain landscape at sunset"

# Specify model and aspect ratio
uv run ./scripts/generate_image.py "A cyberpunk cityscape" --model flux-pro --aspect landscape_16_9

# Generate multiple images
uv run ./scripts/generate_image.py "A cute robot mascot" --num 4 --output ./robots/

# With negative prompt and seed
uv run ./scripts/generate_image.py "Professional headshot" --negative "cartoon, anime" --seed 42

# Open image after generation
uv run ./scripts/generate_image.py "A beautiful garden" --open

Aspect Ratios

square (default) - 1:1
square_hd - 1:1 high resolution
portrait_4_3 - 3:4 portrait
portrait_16_9 - 9:16 tall portrait
landscape_4_3 - 4:3 landscape
landscape_16_9 - 16:9 widescreen
21_9 - Ultra-wide
9_21 - Ultra-tall

Text-to-Video Generation

Generate videos directly from text prompts (no input image required).

Available Models

| Model | Description | Cost | Max Duration | |-------|-------------|------|--------------| | veo3 | Google Veo 3, best quality + audio | ~$2.00/5s | ~8s | | kling-v2.6 | Kling 2.6 Pro, cinematic | ~$0.70/5s | ~10s | | ltx-v2-fast (default) | LTX 2.0 Fast, good balance | ~$0.20/5s | ~10s | | ltx-v2 | LTX 2.0, higher quality | ~$0.20/5s | ~10s | | hunyuan | Hunyuan, high visual quality | ~$0.38/5s | ~5s | | hunyuan-v1.5 | Hunyuan 1.5, improved motion | ~$0.38/5s | ~5s | | minimax | MiniMax Video-01 | ~$0.50/video | ~5s | | wan | Wan 2.1, fast | ~$0.25/5s | ~5s |

Usage

# Basic text-to-video
uv run ./scripts/generate_video_text.py "a cat walking on the beach at sunset"

# Cinematic video with specific model
uv run ./scripts/generate_video_text.py "cinematic drone shot of mountains at sunrise" --model hunyuan

# Vertical video for social media
uv run ./scripts/generate_video_text.py "person dancing in studio" --aspect 9:16 --resolution 1080p

# With seed for reproducibility
uv run ./scripts/generate_video_text.py "ocean waves crashing" --seed 42 --open

Tips for Text-to-Video

Be descriptive - Include motion, camera angles, lighting
Cinematic keywords - "cinematic", "8k", "dramatic lighting" help quality
Duration limits - Most models generate 5-10 second clips
Resolution tradeoffs - Higher resolution = slower generation

Image-to-Video Generation

Animate static images into videos (requires an input image).

Available Models

| Model | Description | Quality | |-------|-------------|---------| | kling (default) | Kling v1.5 Pro | Excellent | | kling-v2.6 | Kling v2.6 Pro + audio | Best | | minimax | MiniMax video | Good | | luma | Luma Dream Machine | Good | | runway-gen3 | Runway Gen-3 Turbo | Excellent | | hunyuan | Hunyuan video | Good |

Usage

# Basic video generation
uv run ./scripts/generate_video.py image.png

# With motion prompt
uv run ./scripts/generate_video.py portrait.jpg --prompt "person slowly smiles and nods"

# Different model and duration
uv run ./scripts/generate_video.py landscape.png --model runway-gen3 --duration 10

# Specify output path
uv run ./scripts/generate_video.py photo.jpg --output ./videos/animated.mp4 --open

Tips for Image-to-Video

Image quality matters - Use high-resolution, clear images
Simple motion prompts - Describe the motion, not the scene
Duration limits - Most models support 5-10 seconds
Aspect ratio - Usually auto-detected from the input image

Text-to-Speech

Convert text to natural-sounding speech.

Available Models

| Model | Description | Features | |-------|-------------|----------| | f5-tts (default) | F5-TTS | Voice cloning | | kokoro | Kokoro | Multiple voices | | playht | PlayHT v3 | High quality | | minimax-tts | MiniMax TTS | Fast |

Usage

# Basic text-to-speech
uv run ./scripts/generate_speech.py "Hello, welcome to our application!"

# Different model
uv run ./scripts/generate_speech.py "This is a test." --model kokoro

# Voice cloning with reference audio
uv run ./scripts/generate_speech.py "Clone this voice" --reference my_voice.mp3

# Adjust speed
uv run ./scripts/generate_speech.py "Speaking faster now" --speed 1.2

# Specify output and open
uv run ./scripts/generate_speech.py "Podcast intro" --output intro.wav --open

Voice Cloning

To clone a voice, provide a reference audio sample:

uv run ./scripts/generate_speech.py "Text in cloned voice" --reference sample.mp3 --model f5-tts

Best practices for reference audio:

5-30 seconds of clear speech
Minimal background noise
Single speaker
Natural speaking pace

Common Workflows

Create a Video from Text (Easiest)

# Direct text-to-video - no image needed
uv run ./scripts/generate_video_text.py "A majestic eagle spreads its wings and takes flight from a cliff, cinematic, dramatic lighting" --model ltx-v2-fast --open

Create a Video from Image (More Control)

# 1. Generate the image
uv run ./scripts/generate_image.py "A majestic eagle perched on a cliff" --model flux-pro --output eagle.png

# 2. Animate it
uv run ./scripts/generate_video.py eagle.png --prompt "eagle spreads wings and takes flight" --open

Generate Marketing Assets

# Product image variations
uv run ./scripts/generate_image.py "Minimalist product photo of headphones on white background" --num 4 --aspect square_hd

# Social media formats
uv run ./scripts/generate_image.py "Summer sale banner" --aspect landscape_16_9 --output banner_wide.png
uv run ./scripts/generate_image.py "Summer sale banner" --aspect portrait_16_9 --output banner_story.png

Create Voiceover

# Generate narration
uv run ./scripts/generate_speech.py "Welcome to our product demo. Today we'll explore the amazing features..." --output narration.wav

# With custom voice
uv run ./scripts/generate_speech.py "Welcome to our product demo." --reference brand_voice.mp3 --output narration.wav

API Reference

All scripts are built on the shared fal_helper.py library. You can import it directly for programmatic use:

from fal_helper import FalClient

client = FalClient()

# Generate images
images = client.generate_image(
    prompt="A beautiful sunset",
    model="flux-schnell",
    aspect_ratio="landscape_16_9",
    num_images=2,
)

# Generate video from text (no image needed)
video = client.generate_video_from_text(
    prompt="cinematic ocean waves at sunset",
    model="ltx-v2-fast",
    aspect_ratio="16:9",
    resolution="720p",
)

# Generate video from image
video = client.generate_video(
    image_path="image.png",
    prompt="camera slowly pans right",
    model="kling",
    duration=5.0,
)

# Text to speech
audio = client.text_to_speech(
    text="Hello world",
    model="f5-tts",
    reference_audio="voice_sample.mp3",  # optional
)

# Download results
client.download_file(images[0]["url"], "output.png")
client.download_file(video["url"], "output.mp4")

Troubleshooting

"FAL_API_KEY environment variable is not set"

Set your API key:

export FAL_API_KEY="your-key-here"

Video generation is slow

Video generation typically takes 2-5 minutes. The models process frames sequentially which takes time.

Image quality issues

Try a higher-quality model (flux-pro instead of flux-schnell)
Use more specific prompts
Add negative prompts to avoid unwanted elements
Use square_hd for higher resolution

Voice cloning sounds off

Ensure reference audio is clear with no background noise
Use 10-30 seconds of reference audio
The reference should be natural speech, not singing or whispering
Try the f5-tts model which has the best voice cloning

Notes

All scripts auto-install dependencies via uv run
Generated files include timestamps to avoid overwrites
Use --open flag to immediately view/play generated media
Video generation consumes more API credits than images
Some models may have content restrictions

Agent Skills: fal.ai Media Generation

Install this agent skill to your local

Skill Files