Groq Core Workflow B: Audio, Vision & Speech
Overview
Beyond chat completions, Groq provides ultra-fast audio transcription (Whisper at 216x real-time), multimodal vision (Llama 4 Scout/Maverick), and text-to-speech. These endpoints use the same groq-sdk client.
Prerequisites
groq-sdkinstalled,GROQ_API_KEYset- For audio: audio files in supported formats
- For vision: image URLs or base64 images
Audio Models
| Model ID | Languages | Speed | Best For |
|----------|-----------|-------|----------|
| whisper-large-v3 | 100+ | 164x real-time | Best accuracy, multilingual |
| whisper-large-v3-turbo | 100+ | 216x real-time | Best speed/accuracy balance |
Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
Instructions
Step 1: Audio Transcription (Whisper)
import Groq from "groq-sdk";
import fs from "fs";
const groq = new Groq();
// Transcribe audio file
async function transcribe(filePath: string): Promise<string> {
const transcription = await groq.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-large-v3-turbo",
response_format: "json", // or "text" or "verbose_json"
language: "en", // Optional: ISO 639-1 code
});
return transcription.text;
}
// With timestamps (verbose mode)
async function transcribeWithTimestamps(filePath: string) {
const transcription = await groq.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-large-v3-turbo",
response_format: "verbose_json",
timestamp_granularities: ["segment"],
});
return transcription;
// Returns segments with start/end times
}
Step 2: Audio Translation (to English)
// Translate any language audio to English text
async function translateAudio(filePath: string): Promise<string> {
const translation = await groq.audio.translations.create({
file: fs.createReadStream(filePath),
model: "whisper-large-v3",
});
return translation.text;
}
Step 3: Vision (Image Understanding)
// Analyze images with Llama 4 Scout (up to 5 images per request)
async function analyzeImage(imageUrl: string, question: string) {
const completion = await groq.chat.completions.create({
model: "meta-llama/llama-4-scout-17b-16e-instruct",
messages: [
{
role: "user",
content: [
{ type: "text", text: question },
{ type: "image_url", image_url: { url: imageUrl } },
],
},
],
max_tokens: 1024,
});
return completion.choices[0].message.content;
}
// Multiple images
async function compareImages(urls: string[], prompt: string) {
const imageContent = urls.map((url) => ({
type: "image_url" as const,
image_url: { url },
}));
const completion = await groq.chat.completions.create({
model: "meta-llama/llama-4-scout-17b-16e-instruct",
messages: [{
role: "user",
content: [{ type: "text", text: prompt }, ...imageContent],
}],
max_tokens: 2048,
});
return completion.choices[0].message.content;
}
// Base64 image input
async function analyzeBase64Image(base64Data: string) {
return groq.chat.completions.create({
model: "meta-llama/llama-4-scout-17b-16e-instruct",
messages: [{
role: "user",
content: [
{ type: "text", text: "Describe this image in detail." },
{
type: "image_url",
image_url: { url: `data:image/jpeg;base64,${base64Data}` },
},
],
}],
});
}
Step 4: Text-to-Speech
// Generate speech from text
async function textToSpeech(text: string, outputPath: string) {
const response = await groq.audio.speech.create({
model: "playai-tts", // or "playai-tts-arabic"
input: text,
voice: "Arista-PlayAI", // See Groq docs for voice options
response_format: "wav", // wav, mp3, flac, opus, aac
});
const buffer = Buffer.from(await response.arrayBuffer());
fs.writeFileSync(outputPath, buffer);
console.log(`Audio saved to ${outputPath}`);
}
Step 5: Python Audio Transcription
from groq import Groq
client = Groq()
# Transcribe
with open("audio.mp3", "rb") as file:
transcription = client.audio.transcriptions.create(
file=("audio.mp3", file),
model="whisper-large-v3-turbo",
response_format="verbose_json",
)
print(transcription.text)
for segment in transcription.segments:
print(f"[{segment.start:.1f}s - {segment.end:.1f}s] {segment.text}")
Step 6: Model Benchmarking
// Compare models on same prompt for speed vs quality
async function benchmarkModels(prompt: string) {
const models = [
"llama-3.1-8b-instant",
"llama-3.3-70b-versatile",
"llama-3.3-70b-specdec",
];
for (const model of models) {
const start = performance.now();
const result = await groq.chat.completions.create({
model,
messages: [{ role: "user", content: prompt }],
max_tokens: 200,
});
const elapsed = performance.now() - start;
const tps = result.usage!.completion_tokens / ((result.usage as any).completion_time || 1);
console.log(
`${model.padEnd(45)} | ${elapsed.toFixed(0)}ms | ${tps.toFixed(0)} tok/s | ${result.usage!.total_tokens} tokens`
);
}
}
Vision Model Limits
- Maximum 5 images per request
- Supported formats: JPEG, PNG, GIF, WebP
- Images fetched from URL or embedded as base64
- Vision models also support tool use, JSON mode, and streaming
Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| Invalid file format | Unsupported audio type | Convert to mp3/wav/flac first |
| File too large | Audio exceeds 25MB | Split into smaller chunks |
| model_not_found | Vision model ID wrong | Use full path: meta-llama/llama-4-scout-17b-16e-instruct |
| max_images_exceeded | >5 images in request | Reduce to 5 or fewer images |
| 429 on Whisper | Audio RPM limit hit | Queue transcription requests |
Resources
Next Steps
For common errors and troubleshooting, see groq-common-errors.