Groq Observability
Overview
Monitor Groq LPU inference for latency, token throughput, rate limit utilization, and cost. Groq's defining advantage is speed (280-560 tok/s), so latency degradation is the highest-priority signal. The API returns rich timing metadata (queue_time, prompt_time, completion_time) and rate limit headers on every response.
Key Metrics to Track
| Metric | Type | Source | Why |
|--------|------|--------|-----|
| TTFT (time to first token) | Histogram | Client-side timing | Groq's main value prop |
| Tokens/second | Gauge | usage.completion_time | Throughput degradation |
| Total latency | Histogram | Client-side timing | End-to-end performance |
| Rate limit remaining | Gauge | x-ratelimit-remaining-* headers | Prevent 429s |
| Token usage | Counter | usage.total_tokens | Cost attribution |
| Error rate by code | Counter | Error handler | Availability |
| Estimated cost | Counter | Tokens * model price | Budget tracking |
Instructions
Step 1: Instrumented Groq Client
import Groq from "groq-sdk";
const groq = new Groq();
interface GroqMetrics {
model: string;
latencyMs: number;
ttftMs: number;
tokensPerSec: number;
promptTokens: number;
completionTokens: number;
totalTokens: number;
queueTimeMs: number;
estimatedCostUsd: number;
}
const PRICE_PER_1M: Record<string, { input: number; output: number }> = {
"llama-3.1-8b-instant": { input: 0.05, output: 0.08 },
"llama-3.3-70b-versatile": { input: 0.59, output: 0.79 },
"llama-3.3-70b-specdec": { input: 0.59, output: 0.99 },
"meta-llama/llama-4-scout-17b-16e-instruct": { input: 0.11, output: 0.34 },
};
async function trackedCompletion(
model: string,
messages: any[],
options?: { maxTokens?: number; temperature?: number }
): Promise<{ result: any; metrics: GroqMetrics }> {
const start = performance.now();
const result = await groq.chat.completions.create({
model,
messages,
max_tokens: options?.maxTokens ?? 1024,
temperature: options?.temperature ?? 0.7,
});
const latencyMs = performance.now() - start;
const usage = result.usage!;
const pricing = PRICE_PER_1M[model] || { input: 0.10, output: 0.10 };
const metrics: GroqMetrics = {
model,
latencyMs: Math.round(latencyMs),
ttftMs: Math.round(((usage as any).prompt_time ?? 0) * 1000),
tokensPerSec: Math.round(
usage.completion_tokens / ((usage as any).completion_time || latencyMs / 1000)
),
promptTokens: usage.prompt_tokens,
completionTokens: usage.completion_tokens,
totalTokens: usage.total_tokens,
queueTimeMs: Math.round(((usage as any).queue_time ?? 0) * 1000),
estimatedCostUsd:
(usage.prompt_tokens / 1_000_000) * pricing.input +
(usage.completion_tokens / 1_000_000) * pricing.output,
};
emitMetrics(metrics);
return { result, metrics };
}
Step 2: Prometheus Metrics
import { Histogram, Counter, Gauge } from "prom-client";
const groqLatency = new Histogram({
name: "groq_latency_ms",
help: "Groq API latency in milliseconds",
labelNames: ["model"],
buckets: [50, 100, 200, 500, 1000, 2000, 5000],
});
const groqTokens = new Counter({
name: "groq_tokens_total",
help: "Total tokens processed",
labelNames: ["model", "direction"],
});
const groqThroughput = new Gauge({
name: "groq_tokens_per_second",
help: "Current tokens per second",
labelNames: ["model"],
});
const groqRateLimitRemaining = new Gauge({
name: "groq_ratelimit_remaining",
help: "Remaining rate limit quota",
labelNames: ["type"],
});
const groqCost = new Counter({
name: "groq_cost_usd",
help: "Estimated cost in USD",
labelNames: ["model"],
});
const groqErrors = new Counter({
name: "groq_errors_total",
help: "API errors by status code",
labelNames: ["model", "status_code"],
});
function emitMetrics(m: GroqMetrics) {
groqLatency.labels(m.model).observe(m.latencyMs);
groqTokens.labels(m.model, "input").inc(m.promptTokens);
groqTokens.labels(m.model, "output").inc(m.completionTokens);
groqThroughput.labels(m.model).set(m.tokensPerSec);
groqCost.labels(m.model).inc(m.estimatedCostUsd);
}
Step 3: Rate Limit Header Tracking
// Parse rate limit headers from any Groq response
function trackRateLimitHeaders(headers: Record<string, string>) {
const remaining = {
requests: parseInt(headers["x-ratelimit-remaining-requests"] || "0"),
tokens: parseInt(headers["x-ratelimit-remaining-tokens"] || "0"),
};
groqRateLimitRemaining.labels("requests").set(remaining.requests);
groqRateLimitRemaining.labels("tokens").set(remaining.tokens);
return remaining;
}
Step 4: Prometheus Alert Rules
# prometheus/groq-alerts.yml
groups:
- name: groq
rules:
- alert: GroqLatencyHigh
expr: histogram_quantile(0.95, rate(groq_latency_ms_bucket[5m])) > 1000
for: 2m
labels:
severity: warning
annotations:
summary: "Groq P95 latency > 1s (normally < 200ms)"
- alert: GroqRateLimitCritical
expr: groq_ratelimit_remaining{type="requests"} < 5
for: 1m
labels:
severity: critical
annotations:
summary: "Groq rate limit nearly exhausted (< 5 requests remaining)"
- alert: GroqThroughputDrop
expr: groq_tokens_per_second < 100
for: 5m
labels:
severity: warning
annotations:
summary: "Groq throughput dropped below 100 tok/s (expected 280+)"
- alert: GroqErrorRateHigh
expr: rate(groq_errors_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Groq API error rate elevated (> 5% of requests)"
- alert: GroqCostSpike
expr: increase(groq_cost_usd[1h]) > 10
labels:
severity: warning
annotations:
summary: "Groq spend exceeded $10 in the past hour"
Step 5: Structured Request Logging
// Structured JSON log for each Groq request
function logGroqRequest(metrics: GroqMetrics, requestId?: string) {
const logEntry = {
ts: new Date().toISOString(),
service: "groq",
model: metrics.model,
latency_ms: metrics.latencyMs,
ttft_ms: metrics.ttftMs,
tokens_per_sec: metrics.tokensPerSec,
prompt_tokens: metrics.promptTokens,
completion_tokens: metrics.completionTokens,
queue_time_ms: metrics.queueTimeMs,
cost_usd: metrics.estimatedCostUsd.toFixed(6),
request_id: requestId,
};
// Output as structured JSON for log aggregation
console.log(JSON.stringify(logEntry));
}
Step 6: Dashboard Panels
Key Grafana/dashboard panels for Groq monitoring:
- TTFT Distribution (histogram) -- Groq's main value; alert if > 500ms
- Tokens/Second by Model (time series) -- should be 280-560 range
- Rate Limit Utilization (gauge, 0-100%) -- alert at 90%
- Request Volume (counter rate) -- by model
- Error Rate (counter rate) -- by status code (429, 5xx)
- Cumulative Cost (counter) -- by model, daily/weekly/monthly
- Queue Time (histogram) -- Groq-specific, should be < 50ms
Error Handling
| Issue | Cause | Solution | |-------|-------|----------| | 429 with high retry-after | RPM or TPM exhausted | Implement request queuing | | Latency spike > 2s | Model overloaded or large prompt | Reduce prompt size or switch to lighter model | | 503 Service Unavailable | Groq capacity issue | Enable fallback to alternative provider | | Tokens/sec drop | Streaming disabled or large prompts | Enable streaming for better perceived performance |
Resources
Next Steps
For incident response procedures, see groq-incident-runbook.