Scalability Advisor Skill

Scalability Advisor

Provides systematic guidance for scaling systems at different growth stages, identifying bottlenecks, and designing for horizontal scalability.

When to Use

Planning for 10x, 100x, or 1000x growth
Diagnosing current performance bottlenecks
Designing new systems for scale
Evaluating scaling strategies (vertical vs. horizontal)
Capacity planning and infrastructure sizing

Scaling Stages Framework

Stage Overview

┌─────────────────────────────────────────────────────────────────────┐
│                    SCALING JOURNEY                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Stage 1        Stage 2         Stage 3         Stage 4             │
│  Startup        Growth          Scale           Enterprise          │
│  0-10K users    10K-100K        100K-1M         1M+ users           │
│                                                                     │
│  Single         Add caching,    Horizontal      Global,             │
│  server         read replicas   scaling         multi-region        │
│                                                                     │
│  $100/mo        $1K/mo          $10K/mo         $100K+/mo           │
└─────────────────────────────────────────────────────────────────────┘

Stage 1: Startup (0-10K Users)

Architecture

┌────────────────────────────────────────┐
│           Single Server                │
│  ┌──────────────────────────────────┐  │
│  │  App Server (Node/Python/etc)    │  │
│  │  + Database (PostgreSQL)         │  │
│  │  + File Storage (local/S3)       │  │
│  └──────────────────────────────────┘  │
└────────────────────────────────────────┘

Key Metrics

| Metric | Target | Warning | |--------|--------|---------| | Response time (P95) | < 500ms | > 1s | | Database queries/request | < 10 | > 20 | | Server CPU | < 70% | > 85% | | Database connections | < 50% pool | > 80% pool |

What to Focus On

DO:

Write clean, maintainable code
Use database indexes on frequently queried columns
Implement basic monitoring (uptime, errors)
Keep architecture simple (monolith is fine)

DON'T:

Over-engineer for scale you don't have
Add caching before you need it
Split into microservices prematurely
Worry about multi-region yet

When to Move to Stage 2

Database CPU consistently > 70%
Response times degrading
Single queries taking > 100ms
Server resources maxed

Stage 2: Growth (10K-100K Users)

Architecture

┌─────────────────────────────────────────────────────────────┐
│                                                             │
│    ┌─────────┐      ┌─────────────────────────────────┐     │
│    │   CDN   │      │      Load Balancer              │     │
│    └────┬────┘      └──────────────┬──────────────────┘     │
│         │                          │                        │
│         │           ┌──────────────┼──────────────┐         │
│         │           │              │              │         │
│         ▼           ▼              ▼              ▼         │
│    ┌─────────┐ ┌─────────┐   ┌─────────┐   ┌─────────┐      │
│    │ Static  │ │ App 1   │   │ App 2   │   │ App 3   │      │
│    │ Assets  │ └────┬────┘   └────┬────┘   └────┬────┘      │
│    └─────────┘      │             │             │           │
│                     └──────────────┼────────────┘           │
│                                    │                        │
│                     ┌──────────────┼──────────────┐         │
│                     │              │              │         │
│                     ▼              ▼              ▼         │
│               ┌─────────┐   ┌─────────┐   ┌─────────┐       │
│               │ Primary │   │  Read   │   │  Redis  │       │
│               │   DB    │───│ Replica │   │  Cache  │       │
│               └─────────┘   └─────────┘   └─────────┘       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Key Additions

| Component | Purpose | When to Add | |-----------|---------|-------------| | CDN | Static asset caching | Images, JS, CSS taking > 20% bandwidth | | Load Balancer | Distribute traffic | Single server CPU > 70% | | Read Replicas | Offload reads | > 80% database ops are reads | | Redis Cache | Application caching | Same queries repeated frequently | | Job Queue | Async processing | Background tasks blocking requests |

Caching Strategy

Request Flow with Caching:

1. Check CDN (static assets)         ─► HIT: Return cached
                                           │
2. Check Application Cache (Redis)   ─► HIT: Return cached
                                           │
3. Check Database                    ─► Return + Cache result

What to Cache:

Session data (TTL: session duration)
User profile data (TTL: 5-15 minutes)
API responses (TTL: varies by freshness needs)
Database query results (TTL: 1-5 minutes)
Computed values (TTL: based on computation cost)

Database Optimization

-- Find slow queries
SELECT query, calls, mean_time, total_time
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;

-- Find missing indexes
SELECT schemaname, tablename, indexrelname, idx_scan, seq_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND seq_scan > 1000;

When to Move to Stage 3

Write traffic overwhelming single primary
Cache hit rate plateauing despite optimization
Read replicas can't keep up with replication lag
Need independent scaling of components

Stage 3: Scale (100K-1M Users)

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                           CDN / Edge                                 │
└──────────────────────────────────────────────────────────────────────┘
                                    │
┌──────────────────────────────────────────────────────────────────────┐
│                        API Gateway                                   │
│              (Rate limiting, Auth, Routing)                          │
└──────────────────────────────────────────────────────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐          ┌───────────────┐          ┌───────────────┐
│   Service A   │          │   Service B   │          │   Service C   │
│   (Users)     │          │   (Orders)    │          │   (Search)    │
│   Auto-scale  │          │   Auto-scale  │          │   Auto-scale  │
└───────┬───────┘          └───────┬───────┘          └───────┬───────┘
        │                          │                          │
        ▼                          ▼                          ▼
┌───────────────┐          ┌───────────────┐          ┌───────────────┐
│   User DB     │          │   Order DB    │          │ Elasticsearch │
│   (Sharded)   │          │   (Sharded)   │          │   (Cluster)   │
└───────────────┘          └───────────────┘          └───────────────┘
                                    │
                                    ▼
                    ┌───────────────────────────┐
                    │     Message Queue         │
                    │     (Kafka / SQS)         │
                    └───────────────────────────┘

Key Patterns

Database Sharding

Sharding Strategies:

1. Hash-based (user_id % num_shards)
   PRO: Even distribution
   CON: Hard to add shards

2. Range-based (user_id 1-1M → shard 1)
   PRO: Easy to add shards
   CON: Hotspots possible

3. Directory-based (lookup table)
   PRO: Flexible
   CON: Lookup overhead

Event-Driven Architecture

Synchronous → Asynchronous

Before:
  API → Service A → Service B → Service C → Response (slow)

After:
  API → Service A → Queue → Response (fast)
                      ↓
              Service B, C process async

Scaling Checklist

[ ] Stateless application servers (no local state)
[ ] Database read/write separation
[ ] Asynchronous processing for non-critical paths
[ ] Circuit breakers between services
[ ] Distributed tracing implemented
[ ] Auto-scaling configured with proper metrics
[ ] Database connection pooling (PgBouncer, ProxySQL)

When to Move to Stage 4

Need geographic distribution for latency
Regulatory requirements (data residency)
Single region can't handle failover
Global user base with latency requirements

Stage 4: Enterprise (1M+ Users)

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                          Global Load Balancer                           │
│                       (GeoDNS, Anycast, Route53)                        │
└─────────────────────────────────────────────────────────────────────────┘
                    │                                │
           ┌────────┴────────┐              ┌───────┴────────┐
           │                 │              │                │
           ▼                 ▼              ▼                ▼
    ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
    │  US-East     │  │  US-West     │  │  EU-West     │  │  AP-South    │
    │  Region      │  │  Region      │  │  Region      │  │  Region      │
    │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │
    │  │Services│  │  │  │Services│  │  │  │Services│  │  │  │Services│  │
    │  └────────┘  │  │  └────────┘  │  │  └────────┘  │  │  └────────┘  │
    │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │  │  ┌────────┐  │
    │  │Database│  │  │  │Database│  │  │  │Database│  │  │  │Database│  │
    │  │(Primary)│ │  │  │(Replica)│ │  │  │(Primary)│ │  │  │(Replica)│ │
    │  └────────┘  │  │  └────────┘  │  │  └────────┘  │  │  └────────┘  │
    └──────────────┘  └──────────────┘  └──────────────┘  └──────────────┘
                              │                   │
                              └─────────┬─────────┘
                                        │
                              Cross-Region Replication

Multi-Region Patterns

| Pattern | Consistency | Latency | Complexity | |---------|-------------|---------|------------| | Active-Passive | Strong | High failover | Low | | Active-Active | Eventual | Low | High | | Follow-the-Sun | Strong per region | Medium | Medium |

Data Consistency Strategies

CAP Theorem Trade-offs:

Strong Consistency (CP):
- All regions see same data
- Higher latency for writes
- Use for: Financial transactions, inventory

Eventual Consistency (AP):
- Regions may have stale data briefly
- Low latency always
- Use for: Social feeds, analytics, non-critical

Causal Consistency:
- Related operations ordered correctly
- Balance of latency and correctness
- Use for: Messaging, collaboration

Enterprise Checklist

[ ] Multi-region deployment
[ ] Cross-region data replication
[ ] Global CDN with edge caching
[ ] Disaster recovery tested
[ ] Compliance (SOC 2, GDPR, data residency)
[ ] 99.99% SLA architecture
[ ] Zero-downtime deployments
[ ] Chaos engineering practice

Bottleneck Diagnosis Guide

Finding the Bottleneck

Systematic Diagnosis:

1. Where is time spent?
   └─► Distributed tracing (Jaeger, Datadog)

2. Is it the database?
   └─► Check slow query logs, connection pool

3. Is it the application?
   └─► CPU profiling, memory analysis

4. Is it the network?
   └─► Latency between services, DNS resolution

5. Is it external services?
   └─► Third-party API latency, rate limits

Common Bottlenecks by Layer

| Layer | Symptoms | Solutions | |-------|----------|-----------| | Database | Slow queries, high CPU | Indexing, read replicas, caching | | Application | High CPU, memory | Optimize code, scale horizontally | | Network | High latency, timeouts | CDN, edge caching, connection pooling | | Storage | Slow I/O, high wait | SSD, object storage, caching | | External APIs | Timeouts, rate limits | Circuit breakers, caching, fallbacks |

Database Bottleneck Checklist

## Quick Database Health Check

1. Connection Pool
   - Current connections vs max?
   - Connection wait time?
   - Pool exhaustion events?

2. Query Performance
   - Slowest queries (pg_stat_statements)?
   - Missing indexes (seq scans > 10K)?
   - Lock contention?

3. Replication
   - Replica lag?
   - Write throughput?
   - Read distribution?

4. Storage
   - Disk I/O wait?
   - Table/index bloat?
   - WAL write latency?

Scaling Calculations

Capacity Planning Formula

Required Capacity = Peak Traffic × Growth Factor × Safety Margin

Example:
- Current peak: 1,000 req/sec
- Expected growth: 3x in 12 months
- Safety margin: 1.5x

Required: 1,000 × 3 × 1.5 = 4,500 req/sec capacity

Database Sizing

Connection Pool Size:
  connections = (num_cores × 2) + effective_spindle_count

  Example: 8 cores, SSD
  connections = (8 × 2) + 1 = 17 connections per instance

Read Replica Sizing:
  replicas = ceiling(read_traffic / single_replica_capacity)

  Example: 10,000 reads/sec, 3,000/replica capacity
  replicas = ceiling(10,000 / 3,000) = 4 replicas

Cache Sizing

Cache Size:
  memory = working_set_size × (1 + overhead_factor)

  Working set = frequently accessed data (usually 10-20% of total)
  Overhead = ~1.5x for Redis data structures

  Example: 10GB working set
  Redis memory = 10GB × 1.5 = 15GB

Quick Reference

Scaling Decision Matrix

| Symptom | First Try | Then Try | Finally | |---------|-----------|----------|---------| | Slow page loads | Add caching | CDN | Edge compute | | Database slow | Add indexes | Read replicas | Sharding | | API timeouts | Async processing | Circuit breakers | Event-driven | | High server CPU | Vertical scale | Horizontal scale | Optimize code | | High memory | Increase RAM | Fix memory leaks | Redesign data structures |

Infrastructure Cost at Scale

| Users | Architecture | Monthly Cost | |-------|-------------|--------------| | 10K | Single server | $100-300 | | 100K | Load balanced + cache | $1,000-3,000 | | 1M | Microservices + sharding | $10,000-30,000 | | 10M | Multi-region | $100,000+ |

References

Bottleneck Diagnosis Guide - Detailed troubleshooting
Capacity Planning Calculator - Sizing formulas

Agent Skills: Scalability Advisor

Install this agent skill to your local

Skill Files