Adobe Incident Runbook Skill

Adobe Incident Runbook

Overview

Rapid incident response procedures for Adobe API-related outages, covering IMS authentication failures, Firefly/Photoshop API downtime, PDF Services quota exhaustion, and I/O Events delivery failures.

Prerequisites

Access to Adobe Developer Console and Admin Console
Access to application monitoring (Grafana, Datadog, etc.)
kubectl access to production cluster (if applicable)
Communication channels (Slack, PagerDuty)

Severity Matrix

| Level | Definition | Response Time | Example | |-------|------------|---------------|---------| | P1 | Complete Adobe integration failure | < 15 min | IMS auth broken, all APIs down | | P2 | Single API degraded | < 1 hour | Firefly 429s, Photoshop timeouts | | P3 | Minor impact | < 4 hours | Webhook delays, slow PDF extraction | | P4 | No user impact | Next business day | Monitoring gap, metric anomaly |

Quick Triage (Run These First)

# 1. Is Adobe itself down?
curl -s -o /dev/null -w "Adobe Status: %{http_code}\n" https://status.adobe.com

# 2. Can we generate an access token?
curl -s -o /dev/null -w "IMS Auth: %{http_code}\n" -X POST \
  'https://ims-na1.adobelogin.com/ims/token/v3' \
  -d "client_id=${ADOBE_CLIENT_ID}&client_secret=${ADOBE_CLIENT_SECRET}&grant_type=client_credentials&scope=${ADOBE_SCOPES}"

# 3. Can we reach each API endpoint?
for endpoint in firefly-api.adobe.io image.adobe.io pdf-services.adobe.io; do
  CODE=$(curl -s -o /dev/null -w "%{http_code}" --connect-timeout 5 "https://$endpoint" 2>/dev/null || echo "UNREACHABLE")
  echo "$endpoint: $CODE"
done

# 4. Check our app health
curl -sf https://your-app.com/health | python3 -m json.tool

# 5. Recent errors in our logs (last 5 min)
kubectl logs -l app=adobe-service --since=5m 2>/dev/null | grep -i "error\|failed\|429\|401\|500" | tail -20

Decision Tree

Adobe APIs returning errors?
├── YES: Is status.adobe.com reporting an incident?
│   ├── YES → Adobe-side outage. Enable fallback mode. Monitor status page.
│   └── NO → Check our credentials and config.
│       ├── 401 errors → Credentials expired/rotated. See "Auth Recovery" below.
│       ├── 429 errors → Rate limited. See "Rate Limit Recovery" below.
│       └── 500/503 errors → Adobe server issue (unreported). Open support ticket.
└── NO: Is our application healthy?
    ├── YES → Likely resolved or intermittent. Continue monitoring.
    └── NO → Our infrastructure issue. Check pods, memory, network.

Recovery Procedures

Auth Recovery (401/403)

# 1. Verify credentials are still valid in Developer Console
#    https://developer.adobe.com/console → Your Project → Credentials

# 2. Test credential directly
curl -v -X POST 'https://ims-na1.adobelogin.com/ims/token/v3' \
  -d "client_id=${ADOBE_CLIENT_ID}&client_secret=${ADOBE_CLIENT_SECRET}&grant_type=client_credentials&scope=${ADOBE_SCOPES}" 2>&1 | grep -E "HTTP|error"

# 3. If credentials were rotated, update in secret manager
gcloud secrets versions add adobe-client-secret --data-file=- <<< "new_p8_secret"
# OR
aws secretsmanager update-secret --secret-id adobe/production/credentials \
  --secret-string '{"client_id":"...","client_secret":"new_secret"}'

# 4. Restart application to clear cached token
kubectl rollout restart deployment/adobe-service

# 5. Verify recovery
curl -sf https://your-app.com/health | jq '.services.adobe'

Rate Limit Recovery (429)

# 1. Check if rate limiting is transient or sustained
# Look at 429 error rate over last 30 min

# 2. Reduce throughput immediately
# Option A: Scale down workers
kubectl scale deployment/adobe-batch-worker --replicas=1

# Option B: Enable rate limit queue mode
kubectl set env deployment/adobe-service ADOBE_RATE_LIMIT_MODE=queue

# 3. For sustained rate limiting, contact Adobe for limit increase
# Include: client_id, typical request volume, business justification

Fallback Mode

# Enable fallback mode (app continues working without Adobe)
kubectl set env deployment/adobe-service ADOBE_FALLBACK_MODE=true

# Verify fallback is working
curl -sf https://your-app.com/health | jq '.services.adobe'
# Should return { "status": "degraded", "mode": "fallback" }

Communication Templates

Internal (Slack)

P[1-4] INCIDENT: Adobe [API Name] Integration
Status: INVESTIGATING / IDENTIFIED / MONITORING / RESOLVED
Impact: [User-facing description]
Root cause: [Adobe outage / credential issue / rate limit / our bug]
Current action: [What you're doing right now]
Next update: [Time]
Commander: @[name]

Postmortem Template

## Incident: Adobe [API] [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description of what happened]

### Timeline
- HH:MM UTC — Alert fired: adobe_api_errors_total spike
- HH:MM UTC — On-call acknowledged, began triage
- HH:MM UTC — Root cause identified: [description]
- HH:MM UTC — Mitigation applied: [action taken]
- HH:MM UTC — Full recovery confirmed

### Root Cause
[Technical explanation — was it Adobe-side, credential issue, our bug?]

### Impact
- Users affected: N
- API calls failed: N
- Revenue impact: $X (if applicable)

### Action Items
- [ ] [Preventive measure] — Owner — Due date
- [ ] [Monitoring improvement] — Owner — Due date
- [ ] [Documentation update] — Owner — Due date

Output

Incident severity classified
Root cause identified via decision tree
Recovery procedure executed
Stakeholders notified with template
Evidence collected for postmortem

Error Handling

| Issue | Cause | Solution | |-------|-------|----------| | Can't reach status.adobe.com | Network issue | Use mobile data or check @AdobeCare on Twitter | | kubectl auth expired | Token timeout | Re-authenticate with cloud provider | | Secret manager access denied | IAM policy | Use break-glass admin account | | Fallback mode not implemented | Missing code path | Return cached/default data |

Resources

Next Steps

For data handling, see adobe-data-handling.

Agent Skills: Adobe Incident Runbook

Install this agent skill to your local

Skill Files