Agent Skills: Perplexity Load & Scale

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/perplexity-load-scale

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/perplexity-pack/skills/perplexity-load-scale

Skill Files

Browse the full folder contents for perplexity-load-scale.

Download Skill

Loading file tree…

plugins/saas-packs/perplexity-pack/skills/perplexity-load-scale/SKILL.md

Skill Metadata

Name
perplexity-load-scale
Description
|

Perplexity Load & Scale

Overview

Load testing and capacity planning for Perplexity Sonar API. Key constraint: Perplexity rate limits at 50 RPM (default tier), and every request performs a live web search with variable latency. Load testing must respect these limits to avoid burning through credits.

Capacity Constraints

| Constraint | Default Limit | Impact | |-----------|--------------|--------| | RPM (requests per minute) | 50 | Hard ceiling on throughput | | Context window | 127K tokens | Limits conversation history | | sonar latency | 1-3s | Throughput: ~20-50 concurrent | | sonar-pro latency | 3-8s | Throughput: ~6-16 concurrent | | search_domain_filter | 20 domains max | Per-request limit |

Prerequisites

  • k6 load testing tool installed
  • Separate Perplexity API key for load testing
  • Budget approval (load tests cost money)

Instructions

Step 1: k6 Load Test Script

// perplexity-load-test.js
import http from "k6/http";
import { check, sleep } from "k6";
import { Rate, Trend } from "k6/metrics";

const errorRate = new Rate("perplexity_errors");
const citationCount = new Trend("perplexity_citations");

export const options = {
  stages: [
    { duration: "1m", target: 5 },    // Ramp to 5 VUs
    { duration: "3m", target: 5 },    // Steady at 5 VUs
    { duration: "1m", target: 15 },   // Ramp to 15 VUs
    { duration: "3m", target: 15 },   // Steady at 15 VUs
    { duration: "1m", target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ["p(95)<10000"],  // 10s P95 for sonar
    perplexity_errors: ["rate<0.05"],    // <5% error rate
  },
};

const queries = [
  "What is TypeScript?",
  "Latest Node.js features",
  "Python vs JavaScript for web development",
  "Current state of AI in healthcare",
  "Best practices for REST API design",
];

export default function () {
  const query = queries[Math.floor(Math.random() * queries.length)];

  const response = http.post(
    "https://api.perplexity.ai/chat/completions",
    JSON.stringify({
      model: "sonar",
      messages: [{ role: "user", content: query }],
      max_tokens: 200,
    }),
    {
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${__ENV.PERPLEXITY_API_KEY}`,
      },
      timeout: "15s",
    }
  );

  const success = check(response, {
    "status is 200": (r) => r.status === 200,
    "has content": (r) => {
      try { return JSON.parse(r.body).choices[0].message.content.length > 0; }
      catch { return false; }
    },
  });

  errorRate.add(!success);

  if (response.status === 200) {
    try {
      const body = JSON.parse(response.body);
      citationCount.add(body.citations?.length || 0);
    } catch {}
  }

  // Critical: stay within 50 RPM
  sleep(1.5 + Math.random());
}

Step 2: Run Load Test

set -euo pipefail
# Minimal test (5 queries, verify setup)
k6 run --vus 1 --duration 30s \
  --env PERPLEXITY_API_KEY=$PERPLEXITY_API_KEY \
  perplexity-load-test.js

# Full test (respecting 50 RPM)
k6 run --env PERPLEXITY_API_KEY=$PERPLEXITY_API_KEY \
  perplexity-load-test.js

Step 3: Capacity Estimation

interface CapacityEstimate {
  maxRPM: number;
  avgLatencyMs: number;
  maxConcurrent: number;
  dailyCapacity: number;
  estimatedDailyCost: number;
}

function estimateCapacity(
  rpm: number,
  avgLatency: number,
  model: "sonar" | "sonar-pro"
): CapacityEstimate {
  const costPerRequest = model === "sonar-pro" ? 0.02 : 0.005;

  return {
    maxRPM: rpm,
    avgLatencyMs: avgLatency,
    maxConcurrent: Math.floor((rpm / 60) * (avgLatency / 1000)),
    dailyCapacity: rpm * 60 * 24,
    estimatedDailyCost: rpm * 60 * 24 * costPerRequest,
  };
}

// Example: 50 RPM, 2s avg latency, sonar
const capacity = estimateCapacity(50, 2000, "sonar");
// { maxRPM: 50, maxConcurrent: 1, dailyCapacity: 72000, estimatedDailyCost: $360 }

Step 4: Request Queue for Scale

import PQueue from "p-queue";

// Queue that respects 50 RPM
const searchQueue = new PQueue({
  concurrency: 5,
  interval: 60_000,
  intervalCap: 45,  // 45 RPM (safety margin below 50)
});

async function scalableSearch(query: string) {
  return searchQueue.add(() =>
    perplexity.chat.completions.create({
      model: "sonar",
      messages: [{ role: "user", content: query }],
      max_tokens: 500,
    })
  );
}

// Queue status for monitoring
function queueStatus() {
  return {
    pending: searchQueue.pending,
    size: searchQueue.size,
    isPaused: searchQueue.isPaused,
  };
}

Step 5: Scaling Strategy

| Scale | Queries/Day | Architecture | Cost/Day | |-------|-------------|-------------|----------| | Small | <1,000 | Direct API calls | <$5 | | Medium | 1K-10K | Queue + cache (30%+ hit rate) | $5-$50 | | Large | 10K-100K | Multi-key + cache + queue | $50-$500 | | Enterprise | 100K+ | Contact Perplexity for custom limits | Custom |

For Medium+ scale, caching is mandatory. A 50% cache hit rate halves your API costs and doubles effective throughput.

Benchmark Results Template

## Perplexity Load Test Report
**Date:** YYYY-MM-DD | **Model:** sonar | **Duration:** 10 min

| Metric | Value |
|--------|-------|
| Total Requests | |
| Success Rate | |
| P50 Latency | |
| P95 Latency | |
| P99 Latency | |
| Avg Citations/Response | |
| Max Sustained RPM | |
| Estimated Cost | |

Error Handling

| Issue | Cause | Solution | |-------|-------|----------| | 429 during load test | Exceeding 50 RPM | Reduce VUs, increase sleep | | Inconsistent latency | Web search variability | Normal; use P95 not avg | | k6 timeout | sonar-pro queries >15s | Increase timeout to 30s | | High cost from test | Too many queries | Use max_tokens: 50 for load tests |

Output

  • k6 load test script calibrated for Perplexity rate limits
  • Capacity estimation calculator
  • Request queue for sustained throughput
  • Scaling strategy by volume tier

Resources

Next Steps

For reliability patterns, see perplexity-reliability-patterns.