Agent Skills: Canva Incident Runbook

|

UncategorizedID: jeremylongshore/claude-code-plugins-plus-skills/canva-incident-runbook

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/canva-pack/skills/canva-incident-runbook

Skill Files

Browse the full folder contents for canva-incident-runbook.

Download Skill

Loading file tree…

plugins/saas-packs/canva-pack/skills/canva-incident-runbook/SKILL.md

Skill Metadata

Name
canva-incident-runbook
Description
|

Canva Incident Runbook

Overview

Rapid incident response for Canva Connect API integration failures. Covers triage, mitigation, escalation, and postmortem.

Quick Triage (First 5 Minutes)

#!/bin/bash
# canva-triage.sh — Run immediately when incident detected

echo "=== Canva Triage ==="

# 1. Is it Canva or us?
echo -n "Canva API: "
curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)\n" \
  -H "Authorization: Bearer $CANVA_ACCESS_TOKEN" \
  "https://api.canva.com/rest/v1/users/me"

# 2. Check our health endpoint
echo -n "Our health: "
curl -s -o /dev/null -w "HTTP %{http_code}\n" \
  "https://api.ourapp.com/health"

# 3. Error rate (if Prometheus available)
echo "Error rate (5min):"
curl -s "localhost:9090/api/v1/query?query=rate(canva_api_errors_total[5m])" \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data']['result'])" 2>/dev/null \
  || echo "Prometheus not available"

# 4. Rate limit status
echo -n "Rate limit remaining: "
curl -sD - -o /dev/null -H "Authorization: Bearer $CANVA_ACCESS_TOKEN" \
  "https://api.canva.com/rest/v1/designs?limit=1" 2>&1 \
  | grep -i "x-ratelimit-remaining" || echo "unknown"

Decision Tree

API returning errors?
├── YES → What HTTP status?
│   ├── 401 → Token expired → Refresh token, check rotation
│   ├── 403 → Scope issue → Verify integration permissions
│   ├── 429 → Rate limited → Enable backoff, check Retry-After
│   ├── 5xx → Canva outage → Enable fallback, monitor status page
│   └── Other → Check request format against API docs
└── NO → Is our integration healthy?
    ├── YES → Likely resolved or intermittent → Monitor
    └── NO → Check our infra (pods, memory, DNS, TLS)

Severity Levels

| Level | Definition | Response Time | Example | |-------|------------|---------------|---------| | P1 | All design operations broken | < 15 min | All API calls returning 5xx | | P2 | Degraded — some operations fail | < 1 hour | Exports failing, designs work | | P3 | Minor — non-critical feature down | < 4 hours | Webhooks delayed | | P4 | No user impact | Next business day | Monitoring gap |

Immediate Mitigation by Error Type

401 — Token Expired / Revoked

# Check if token is valid
curl -s -H "Authorization: Bearer $TOKEN" \
  https://api.canva.com/rest/v1/users/me | python3 -m json.tool

# If expired: refresh all affected users' tokens
# If revoked: users must re-authorize via OAuth flow

429 — Rate Limited

# Check how long to wait
curl -sD - -o /dev/null -H "Authorization: Bearer $TOKEN" \
  "https://api.canva.com/rest/v1/designs" 2>&1 \
  | grep -i "retry-after"

# Immediate: reduce request rate
# Enable queue-based rate limiting

5xx — Canva Service Error

# Check Canva status page (no official status.canva.com for API)
# Check Canva developer community for reported outages

# Enable graceful degradation
# Return cached data where possible
# Show "Design features temporarily unavailable" to users

Communication Templates

Internal (Slack)

P[1-4] INCIDENT: Canva Integration
Status: INVESTIGATING | MITIGATING | RESOLVED
Impact: [Describe user impact]
API Response: HTTP [status code]
Current action: [What you're doing]
Next update: [Time]
IC: @[name]

External (Status Page)

Canva Design Features — Degraded Performance

We are experiencing issues with our design integration.
Users may see delays or errors when creating/exporting designs.

We are actively working with our design platform provider to resolve this.

Last updated: [ISO 8601 timestamp]

Post-Incident

Evidence Collection

# Collect logs for the incident window
kubectl logs -l app=canva-integration --since=2h > incident-canva-logs.txt

# Export metrics
curl "localhost:9090/api/v1/query_range?query=canva_api_errors_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > metrics.json

Postmortem Template

## Incident: Canva API [Error Type]
**Date:** YYYY-MM-DD HH:MM UTC
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline (UTC)
- HH:MM — [First alert / error detected]
- HH:MM — [Investigation started]
- HH:MM — [Root cause identified]
- HH:MM — [Mitigation applied]
- HH:MM — [Confirmed resolved]

### Root Cause
[Was it Canva-side or our integration? Token issue? Rate limit? Code bug?]

### Impact
- Users affected: N
- Failed operations: N designs / N exports

### Action Items
- [ ] [Preventive measure] — Owner — Due date

Error Handling

| Issue | Cause | Solution | |-------|-------|----------| | Can't determine if Canva is down | No status page API | Test with known-good token | | Token refresh fails | Revoked integration | Re-authorize user | | All users affected | Integration-level issue | Check client credentials | | Single user affected | User-level token issue | Refresh that user's token |

Resources

Next Steps

For data handling, see canva-data-handling.