Multimodal Models Skill

Multimodal Models

Pre-trained models for vision, audio, and cross-modal tasks.

Model Overview

| Model | Modality | Task | |-------|----------|------| | CLIP | Image + Text | Zero-shot classification, similarity | | Whisper | Audio → Text | Transcription, translation | | Stable Diffusion | Text → Image | Image generation, editing |

CLIP (Vision-Language)

Zero-shot image classification without training on specific labels.

CLIP Use Cases

| Task | How | |------|-----| | Zero-shot classification | Compare image to text label embeddings | | Image search | Find images matching text query | | Content moderation | Classify against safety categories | | Image similarity | Compare image embeddings |

CLIP Models

| Model | Parameters | Trade-off | |-------|------------|-----------| | ViT-B/32 | 151M | Recommended balance | | ViT-L/14 | 428M | Best quality, slower | | RN50 | 102M | Fastest, lower quality |

CLIP Concepts

| Concept | Description | |---------|-------------| | Dual encoder | Separate encoders for image and text | | Contrastive learning | Trained to match image-text pairs | | Normalization | Always normalize embeddings before similarity | | Descriptive labels | Better labels = better zero-shot accuracy |

Key concept: CLIP embeds images and text in same space. Classification = find nearest text embedding.

CLIP Limitations

Not for fine-grained classification
No spatial understanding (whole image only)
May reflect training data biases

Whisper (Speech Recognition)

Robust multilingual transcription supporting 99 languages.

Whisper Use Cases

| Task | Configuration | |------|---------------| | Transcription | Default transcribe task | | Translation to English | task="translate" | | Subtitles | Output format SRT/VTT | | Word timestamps | word_timestamps=True |

Whisper Models

| Model | Size | Speed | Recommendation | |-------|------|-------|----------------| | turbo | 809M | Fast | Recommended | | large | 1550M | Slow | Maximum quality | | small | 244M | Medium | Good balance | | base | 74M | Fast | Quick tests | | tiny | 39M | Fastest | Prototyping only |

Whisper Concepts

| Concept | Description | |---------|-------------| | Language detection | Auto-detects, or specify for speed | | Initial prompt | Improves technical terms accuracy | | Timestamps | Segment-level or word-level | | faster-whisper | 4× faster alternative implementation |

Key concept: Specify language when known—auto-detection adds latency.

Whisper Limitations

May hallucinate on silence/noise
No speaker diarization (who said what)
Accuracy degrades on >30 min audio
Not suitable for real-time captioning

Stable Diffusion (Image Generation)

Text-to-image generation with various control methods.

SD Use Cases

| Task | Pipeline | |------|----------| | Text-to-image | DiffusionPipeline | | Style transfer | Image2Image | | Fill regions | Inpainting | | Guided generation | ControlNet | | Custom styles | LoRA adapters |

SD Models

| Model | Resolution | Quality | |-------|------------|---------| | SDXL | 1024×1024 | Best | | SD 1.5 | 512×512 | Good, faster | | SD 2.1 | 768×768 | Middle ground |

Key Parameters

| Parameter | Effect | Typical Value | |-----------|--------|---------------| | num_inference_steps | Quality vs speed | 20-50 | | guidance_scale | Prompt adherence | 7-12 | | negative_prompt | Avoid artifacts | "blurry, low quality" | | strength (img2img) | How much to change | 0.5-0.8 | | seed | Reproducibility | Fixed number |

Control Methods

| Method | Input | Use Case | |--------|-------|----------| | ControlNet | Edge/depth/pose | Structural guidance | | LoRA | Trained weights | Custom styles | | Img2Img | Source image | Style transfer | | Inpainting | Image + mask | Fill regions |

Memory Optimization

| Technique | Effect | |-----------|--------| | CPU offload | Reduces VRAM usage | | Attention slicing | Trades speed for memory | | VAE tiling | Large image support | | xFormers | Faster attention | | DPM scheduler | Fewer steps needed |

Key concept: Use SDXL for quality, SD 1.5 for speed. Always use negative prompts.

SD Limitations

GPU strongly recommended (CPU very slow)
Large VRAM requirements for SDXL
May generate anatomical errors
Prompt engineering matters

Common Patterns

Embedding and Similarity

All three models use embeddings:

CLIP: Image/text embeddings for similarity
Whisper: Audio embeddings for transcription
SD: Text embeddings for image conditioning

GPU Acceleration

| Model | VRAM Needed | |-------|-------------| | CLIP ViT-B/32 | ~2 GB | | Whisper turbo | ~6 GB | | SD 1.5 | ~6 GB | | SDXL | ~10 GB |

Best Practices

| Practice | Why | |----------|-----| | Use recommended model sizes | Best quality/speed balance | | Cache embeddings (CLIP) | Expensive to recompute | | Specify language (Whisper) | Faster than auto-detect | | Use negative prompts (SD) | Avoid common artifacts | | Set seeds for reproducibility | Consistent results |

Resources

CLIP: https://github.com/openai/CLIP
Whisper: https://github.com/openai/whisper
Diffusers: https://huggingface.co/docs/diffusers

Agent Skills: Multimodal Models

Install this agent skill to your local

Skill Files