Replit Incident Runbook
Overview
Rapid incident response for Replit deployment failures, database issues, and platform outages. Covers triage, diagnosis, remediation, rollback, and communication.
Prerequisites
- Access to Replit Workspace and Deployment settings
- Deployment URL for health checks
- Communication channel (Slack, email)
- Rollback awareness (Deployment History)
Severity Levels
| Level | Definition | Response Time | Examples | |-------|------------|---------------|----------| | P1 | Complete outage | < 15 min | App returns 5xx, DB down | | P2 | Degraded service | < 1 hour | Slow responses, intermittent errors | | P3 | Minor impact | < 4 hours | Non-critical feature broken | | P4 | No user impact | Next business day | Monitoring gap |
Quick Triage (First 5 Minutes)
set -euo pipefail
DEPLOY_URL="https://your-app.replit.app"
echo "=== TRIAGE ==="
# 1. Check Replit platform status
echo -n "Replit Status: "
curl -s https://status.replit.com/api/v2/summary.json | \
python3 -c "import sys,json;print(json.load(sys.stdin)['status']['description'])" 2>/dev/null || \
echo "Check https://status.replit.com"
# 2. Check your deployment health
echo -n "App Health: "
curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)" "$DEPLOY_URL/health" 2>/dev/null || echo "UNREACHABLE"
echo ""
# 3. Get health details
echo "Health Response:"
curl -s "$DEPLOY_URL/health" 2>/dev/null | python3 -m json.tool 2>/dev/null || echo "No response"
# 4. Check if it's a cold start issue (Autoscale)
echo -n "Second request: "
curl -s -o /dev/null -w "HTTP %{http_code} (%{time_total}s)\n" "$DEPLOY_URL/health"
Decision Tree
App not responding?
├─ YES: Is status.replit.com reporting an incident?
│ ├─ YES → Platform issue. Wait for Replit. Communicate to users.
│ └─ NO → Your deployment issue. Continue below.
│
│ Can you access the Replit Workspace?
│ ├─ YES → Check deployment logs:
│ │ ├─ Build error → Fix code, redeploy
│ │ ├─ Runtime crash → Check logs, fix, redeploy
│ │ └─ Secret missing → Add to Secrets tab, redeploy
│ └─ NO → Network/browser issue. Try incognito window.
│
└─ App responds but with errors?
├─ 5xx errors → Check logs for crash/exception
├─ Slow responses → Check database, cold start, memory
└─ Auth not working → Verify deployment domain, not dev URL
Remediation by Error Type
Deployment Crash (5xx / App Unreachable)
1. Open Replit Workspace
2. Go to Deployment Settings > Logs
3. Look for the crash reason:
- "Error: Cannot find module..." → Missing dependency
- "FATAL: Missing secrets..." → Add to Secrets tab
- "EADDRINUSE" → Port conflict in .replit config
- "JavaScript heap out of memory" → Increase VM size or fix memory leak
4. Fix the issue in code
5. Click "Deploy" to redeploy
6. If fix is unclear, ROLLBACK:
- Deployment Settings > History
- Click "Rollback" on last known-good version
Database Connection Failure
1. Check database status in Database pane
2. Verify DATABASE_URL is set in Secrets
3. Test connection:
# From Replit Shell
node -e "
const {Pool} = require('pg');
const pool = new Pool({connectionString: process.env.DATABASE_URL, ssl:{rejectUnauthorized:false}});
pool.query('SELECT NOW()').then(r => console.log('OK:', r.rows[0])).catch(e => console.error('FAIL:', e.message)).finally(() => pool.end());
"
4. If connection fails:
- Check if PostgreSQL is provisioned (Database pane)
- Try creating a new database
- Check for connection pool exhaustion (max connections)
Cold Start Too Slow (Autoscale)
If cold starts exceed acceptable latency:
1. Check deployment type: Autoscale scales to zero
2. Options:
a. Switch to Reserved VM (always-on, no cold starts)
b. Set up external keep-alive (ping /health every 4 min)
c. Optimize startup: lazy imports, defer DB connection
3. To switch:
- Update .replit: deploymentTarget = "cloudrun"
- Redeploy
Secrets Missing After Deploy
1. Open Secrets tab (lock icon in sidebar)
2. Verify all required secrets are present
3. Check Deployment Settings > Environment Variables
4. Secrets should auto-sync (2025+), but if not:
- Remove and re-add the secret
- Redeploy
5. For Account-level secrets:
- Account Settings > Secrets
- These apply to ALL Repls
Rollback Procedure
Replit supports one-click rollback to any previous deployment:
1. Deployment Settings > History
2. Find the last successful deployment
3. Click "Rollback to this version"
4. Verify health endpoint
5. Investigate root cause before redeploying fix
Rollback restores:
- Code at that deployment's commit
- Deployment configuration at that time
- Does NOT rollback database changes
Communication Templates
Internal (Slack)
P[1-4] INCIDENT: [App Name] on Replit
Status: INVESTIGATING / IDENTIFIED / MONITORING / RESOLVED
Impact: [What users are experiencing]
Cause: [If known]
Action: [What we're doing]
ETA: [When we expect resolution]
Next update: [Time]
External (Status Page)
[App Name] Service Disruption
We are experiencing issues with [specific feature/service].
[Describe user impact].
We have identified the cause and are working on a fix.
Estimated resolution: [time].
Last updated: [timestamp]
Post-Incident
Evidence Collection
set -euo pipefail
# Capture deployment logs
# Go to Deployment Settings > Logs > Copy relevant entries
# Capture timeline
echo "Timeline of events:" > incident-report.md
echo "- [time] Issue detected" >> incident-report.md
echo "- [time] Investigation started" >> incident-report.md
echo "- [time] Root cause identified" >> incident-report.md
echo "- [time] Fix deployed / rollback executed" >> incident-report.md
echo "- [time] Service restored" >> incident-report.md
Postmortem Template
## Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
### Summary
[1-2 sentence description of what happened]
### Root Cause
[Technical explanation]
### Timeline
- HH:MM — First alert
- HH:MM — Investigation started
- HH:MM — Root cause found
- HH:MM — Fix deployed / rollback
- HH:MM — Service restored
### Impact
- Users affected: [N]
- Downtime: [duration]
### Action Items
- [ ] [Prevention measure] — Owner — Due date
Error Handling
| Issue | Cause | Solution | |-------|-------|----------| | Can't access Workspace | Replit outage | Use status.replit.com, wait | | Rollback not available | No previous deployments | Fix forward, deploy fix | | Logs too short | Container restarted | Set up external log aggregator | | DB rollback needed | Bad migration | Restore from Replit DB snapshot |
Resources
Next Steps
For data handling patterns, see replit-data-handling.