Agent Skills: Canva Incident Runbook

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/canva-incident-runbook

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/canva-pack/skills/canva-incident-runbook

Skill Files

Browse the full folder contents for canva-incident-runbook.

Download Skill

Loading file tree…

plugins/saas-packs/canva-pack/skills/canva-incident-runbook/SKILL.md

Skill Metadata

Name
canva-incident-runbook
Description
'Execute Canva Connect API incident response with triage, mitigation,

Canva Incident Runbook

Overview

Rapid incident response for Canva Connect API integration failures. Covers triage, mitigation, escalation, and postmortem.

Quick Triage (First 5 Minutes)

#!/bin/bash
# canva-triage.sh — Run immediately when incident detected

echo "=== Canva Triage ==="

# 1. Is it Canva or us?
echo -n "Canva API: "
curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)\n" \
  -H "Authorization: Bearer $CANVA_ACCESS_TOKEN" \
  "https://api.canva.com/rest/v1/users/me"

# 2. Check our health endpoint
echo -n "Our health: "
curl -s -o /dev/null -w "HTTP %{http_code}\n" \
  "https://api.ourapp.com/health"

# 3. Error rate (if Prometheus available)
echo "Error rate (5min):"
curl -s "localhost:9090/api/v1/query?query=rate(canva_api_errors_total[5m])" \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data']['result'])" 2>/dev/null \
  || echo "Prometheus not available"

# 4. Rate limit status
echo -n "Rate limit remaining: "
curl -sD - -o /dev/null -H "Authorization: Bearer $CANVA_ACCESS_TOKEN" \
  "https://api.canva.com/rest/v1/designs?limit=1" 2>&1 \
  | grep -i "x-ratelimit-remaining" || echo "unknown"

Decision Tree

API returning errors?
├── YES → What HTTP status?
│   ├── 401 → Token expired → Refresh token, check rotation
│   ├── 403 → Scope issue → Verify integration permissions
│   ├── 429 → Rate limited → Enable backoff, check Retry-After
│   ├── 5xx → Canva outage → Enable fallback, monitor status page
│   └── Other → Check request format against API docs
└── NO → Is our integration healthy?
    ├── YES → Likely resolved or intermittent → Monitor
    └── NO → Check our infra (pods, memory, DNS, TLS)

Severity Levels

| Level | Definition | Response Time | Example | |-------|------------|---------------|---------| | P1 | All design operations broken | < 15 min | All API calls returning 5xx | | P2 | Degraded — some operations fail | < 1 hour | Exports failing, designs work | | P3 | Minor — non-critical feature down | < 4 hours | Webhooks delayed | | P4 | No user impact | Next business day | Monitoring gap |

Immediate Mitigation by Error Type

401 — Token Expired / Revoked

# Check if token is valid
curl -s -H "Authorization: Bearer $TOKEN" \
  https://api.canva.com/rest/v1/users/me | python3 -m json.tool

# If expired: refresh all affected users' tokens
# If revoked: users must re-authorize via OAuth flow

429 — Rate Limited

# Check how long to wait
curl -sD - -o /dev/null -H "Authorization: Bearer $TOKEN" \
  "https://api.canva.com/rest/v1/designs" 2>&1 \
  | grep -i "retry-after"

# Immediate: reduce request rate
# Enable queue-based rate limiting

5xx — Canva Service Error

# Check Canva status page (no official status.canva.com for API)
# Check Canva developer community for reported outages

# Enable graceful degradation
# Return cached data where possible
# Show "Design features temporarily unavailable" to users

Communication Templates

Internal (Slack)

P[1-4] INCIDENT: Canva Integration
Status: INVESTIGATING | MITIGATING | RESOLVED
Impact: [Describe user impact]
API Response: HTTP [status code]
Current action: [What you're doing]
Next update: [Time]
IC: @[name]

External (Status Page)

Canva Design Features — Degraded Performance

We are experiencing issues with our design integration.
Users may see delays or errors when creating/exporting designs.

We are actively working with our design platform provider to resolve this.

Last updated: [ISO 8601 timestamp]

Post-Incident

Evidence Collection

# Collect logs for the incident window
kubectl logs -l app=canva-integration --since=2h > incident-canva-logs.txt

# Export metrics
curl "localhost:9090/api/v1/query_range?query=canva_api_errors_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > metrics.json

Postmortem Template

## Incident: Canva API [Error Type]
**Date:** YYYY-MM-DD HH:MM UTC
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline (UTC)
- HH:MM — [First alert / error detected]
- HH:MM — [Investigation started]
- HH:MM — [Root cause identified]
- HH:MM — [Mitigation applied]
- HH:MM — [Confirmed resolved]

### Root Cause
[Was it Canva-side or our integration? Token issue? Rate limit? Code bug?]

### Impact
- Users affected: N
- Failed operations: N designs / N exports

### Action Items
- [ ] [Preventive measure] — Owner — Due date

Error Handling

| Issue | Cause | Solution | |-------|-------|----------| | Can't determine if Canva is down | No status page API | Test with known-good token | | Token refresh fails | Revoked integration | Re-authorize user | | All users affected | Integration-level issue | Check client credentials | | Single user affected | User-level token issue | Refresh that user's token |

Resources

Next Steps

For data handling, see canva-data-handling.