Agent Skills: Clerk Incident Runbook

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/clerk-incident-runbook

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/clerk-pack/skills/clerk-incident-runbook

Skill Files

Browse the full folder contents for clerk-incident-runbook.

Download Skill

Loading file tree…

plugins/saas-packs/clerk-pack/skills/clerk-incident-runbook/SKILL.md

Skill Metadata

Name
clerk-incident-runbook
Description
|

Clerk Incident Runbook

Overview

Procedures for responding to Clerk-related incidents in production. Covers triage, emergency auth bypass, recovery scripts, and post-incident review.

Prerequisites

  • Access to Clerk Dashboard (dashboard.clerk.com)
  • Access to application logs and monitoring
  • Emergency contact list for on-call team
  • Rollback procedures documented

Instructions

Step 1: Triage — Identify Incident Category

| Category | Symptoms | Severity | |----------|----------|----------| | Clerk outage | status.clerk.com shows incident, all auth fails | Critical | | Key compromise | Unauthorized access detected | Critical | | Middleware failure | All routes return 500 | High | | Session issues | Users randomly logged out | Medium | | Webhook backlog | User sync falling behind | Low |

Quick diagnostic:

#!/bin/bash
# scripts/clerk-triage.sh
set -euo pipefail

echo "=== Clerk Incident Triage ==="
echo "Time: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

# 1. Check Clerk status
echo -e "\n--- Clerk Status ---"
curl -s https://status.clerk.com/api/v2/status.json | python3 -c "
import json,sys
d=json.load(sys.stdin)
print(f\"Status: {d['status']['description']}\")" 2>/dev/null || echo "Cannot reach status API"

# 2. Check API connectivity
echo -e "\n--- API Connectivity ---"
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer ${CLERK_SECRET_KEY}" \
  https://api.clerk.com/v1/users?limit=1 2>/dev/null)
echo "API response: HTTP $HTTP_CODE"

# 3. Check app health
echo -e "\n--- App Health ---"
curl -s http://localhost:3000/api/clerk-health 2>/dev/null | python3 -m json.tool || echo "App not reachable"

Step 2: Emergency Auth Bypass (Clerk Outage Only)

// middleware.ts — emergency bypass mode
import { clerkMiddleware, createRouteMatcher } from '@clerk/nextjs/server'
import { NextResponse } from 'next/server'

const EMERGENCY_BYPASS = process.env.CLERK_EMERGENCY_BYPASS === 'true'

const isPublicRoute = createRouteMatcher(['/', '/sign-in(.*)', '/sign-up(.*)'])

export default clerkMiddleware(async (auth, req) => {
  // Emergency bypass: allow all requests when Clerk is down
  if (EMERGENCY_BYPASS) {
    console.warn('[EMERGENCY] Auth bypass active — all requests allowed')
    const response = NextResponse.next()
    response.headers.set('X-Auth-Bypass', 'true')
    return response
  }

  if (!isPublicRoute(req)) {
    await auth.protect()
  }
})

Activate bypass:

# Vercel: set env var and redeploy
vercel env add CLERK_EMERGENCY_BYPASS production  # Set to "true"
vercel deploy --prod

# After Clerk recovers: remove bypass
vercel env rm CLERK_EMERGENCY_BYPASS production
vercel deploy --prod

Step 3: Key Rotation (Compromised Secret Key)

#!/bin/bash
# scripts/rotate-clerk-keys.sh
set -euo pipefail

echo "=== Clerk Key Rotation ==="
echo "1. Go to dashboard.clerk.com > API Keys"
echo "2. Generate new Secret Key"
echo "3. Update all environments:"

# Update production
echo "Updating production..."
# vercel env rm CLERK_SECRET_KEY production
# vercel env add CLERK_SECRET_KEY production  # Paste new key
# vercel deploy --prod

echo "4. Verify all endpoints still work"
echo "5. Monitor for unauthorized access attempts"
echo "6. File incident report"

Step 4: Session Recovery (Mass Logout Fix)

// app/api/admin/refresh-sessions/route.ts
import { auth, clerkClient } from '@clerk/nextjs/server'

export async function POST() {
  const { has } = await auth()
  if (!has({ role: 'org:admin' })) {
    return Response.json({ error: 'Admin only' }, { status: 403 })
  }

  // Force-revoke all sessions (users will need to re-authenticate)
  const client = await clerkClient()
  const users = await client.users.getUserList({ limit: 500 })
  let revoked = 0

  for (const user of users.data) {
    const sessions = await client.sessions.getSessionList({ userId: user.id })
    for (const session of sessions.data) {
      if (session.status === 'active') {
        await client.sessions.revokeSession(session.id)
        revoked++
      }
    }
  }

  return Response.json({ revoked, message: `Revoked ${revoked} sessions` })
}

Step 5: Webhook Replay (Missed Events)

# Check for missed webhooks in Clerk Dashboard:
# Dashboard > Webhooks > Select endpoint > Message Logs
# Click "Retry" on failed messages

# Or replay from your audit log:
echo "Check database for missing user records:"
echo "SELECT clerk_id FROM users WHERE created_at > NOW() - INTERVAL '1 hour'"

Step 6: Post-Incident Review Template

## Incident Report

**Date:** YYYY-MM-DD HH:MM UTC
**Duration:** X hours Y minutes
**Severity:** Critical / High / Medium / Low
**Category:** Clerk Outage / Key Compromise / Config Error / Middleware Failure

### Timeline
- HH:MM — Incident detected (how: monitoring alert / user report / manual)
- HH:MM — Triage started, category identified
- HH:MM — Mitigation applied (emergency bypass / key rotation / rollback)
- HH:MM — Service restored
- HH:MM — Post-incident review completed

### Root Cause
[Description of what caused the incident]

### Impact
- Users affected: X
- Duration of auth downtime: Y minutes
- Data loss: None / Partial / Details

### Action Items
- [ ] Add monitoring for [specific check]
- [ ] Update runbook with [new procedure]
- [ ] Implement [preventive measure]

Output

  • Triage script identifying incident category and severity
  • Emergency auth bypass middleware (activate via env var)
  • Key rotation procedure for compromised credentials
  • Session revocation endpoint for mass-logout recovery
  • Post-incident review template

Error Handling

| Scenario | Response | |----------|----------| | Clerk API completely down | Activate emergency bypass, monitor status.clerk.com | | Secret key compromised | Rotate keys immediately, revoke all sessions, audit logs | | Middleware 500 errors | Check middleware.ts syntax, verify Clerk SDK version | | Webhook delivery failures | Retry from Dashboard, check endpoint accessibility | | Users randomly logged out | Check session lifetime settings, verify domain config |

Examples

Quick Status Check One-Liner

curl -s https://status.clerk.com/api/v2/status.json | python3 -c "import json,sys; print(json.load(sys.stdin)['status']['description'])"

Resources

Next Steps

After resolving incident, review clerk-observability for improved monitoring.