Software Backend Engineering Skill

Software Backend Engineering

Use this skill to design, implement, and review production-grade backend services: API boundaries, data layer, auth, caching, observability, error handling, testing, and deployment.

Defaults to bias toward: type-safe boundaries (validation at the edge), OpenTelemetry for observability, zero-trust assumptions, idempotency for retries, RFC 9457 errors, Postgres + pooling, structured logs, timeouts, and rate limiting.

Scaffolding rule: When scaffolding a new project, show full working implementations for all domain logic — fraud rules, audit logging, webhook handlers, validation pipelines, background jobs. Don't just reference file names or stub functions; show the actual code so the user can run it immediately.

Quick Reference

| Task | Default Picks | Notes | |------|---------------|-------| | REST API | Fastify / Express / NestJS | Prefer typed boundaries + explicit timeouts | | Edge API | Hono / platform-native handlers | Keep work stateless, CPU-light | | Type-Safe API | tRPC | Prefer for TS monorepos and internal APIs | | GraphQL API | Apollo Server / Pothos | Prefer for complex client-driven queries | | Database | PostgreSQL | Use pooling + migrations + query budgets | | ORM / Query Layer | Prisma / Drizzle / SQLAlchemy / GORM / SeaORM / EF Core | Prefer explicit transactions | | Authentication | OIDC/OAuth + sessions/JWT | Prefer httpOnly cookies for browsers | | Validation | Zod / Pydantic / validator libs | Validate at the boundary, not deep inside | | Caching | Redis (or managed) | Use TTLs + invalidation strategy | | Background Jobs | BullMQ / platform queues | Make jobs idempotent + retry-safe | | Testing | Unit + integration + contract/E2E | Keep most tests below the UI layer | | Observability | Structured logs + OpenTelemetry | Correlation IDs end-to-end |

Scope

Use this skill to:

Design and implement REST/GraphQL/tRPC APIs
Model data schemas and run safe migrations
Implement authentication/authorization (OIDC/OAuth, sessions/JWT)
Add validation, error handling, rate limiting, caching, and background jobs
Ship production readiness (timeouts, observability, deploy/runbooks)

When NOT to Use This Skill

Use a different skill when:

Frontend-only concerns -> See software-frontend
Infrastructure provisioning (Terraform, K8s manifests) -> See ops-devops-platform
API design patterns only (no implementation) -> See dev-api-design
SQL query optimization and indexing -> See data-sql-optimization
Security audits and threat modeling -> See software-security-appsec
System architecture (beyond single service) -> See software-architecture-design

Technology Selection

Pick based on the strongest constraint, not feature lists:

| Constraint | Default Pick | Why | |-----------|-------------|-----| | Team knows TypeScript only | Fastify/Hono + Prisma/Drizzle | Ecosystem depth, hiring ease | | Need <50ms P95, CPU-bound work | Go (net/http + sqlc/pgx) | Goroutines isolate CPU work; no event-loop risk | | Data-heavy / ML integration | Python (FastAPI + SQLAlchemy) | Best ecosystem for numpy/pandas/ML pipelines | | Memory-safety critical | Rust (Axum + SeaORM/SQLx) | Zero-cost abstractions, no GC | | Enterprise/.NET team | C# (ASP.NET Core + EF Core) | Azure integration, mature tooling | | Edge/serverless | Hono / platform-native handlers | Stateless, CPU-light, fast cold starts | | Fintech/audit-sensitive | Go + sqlc (or raw SQL) | ORM magic is a liability; you need auditable SQL |

For detailed framework/ORM/auth/caching selection trees, see references/edge-deployment-guide.md and language-specific references. See assets/ for starter templates per language.

API Design Patterns (Dec 2025)

Idempotency Patterns

All mutating operations MUST support idempotency for retry safety.

Implementation:

// Idempotency key header
const idempotencyKey = request.headers['idempotency-key'];
const cached = await redis.get(`idem:${idempotencyKey}`);
if (cached) return JSON.parse(cached);

const result = await processOperation();
await redis.set(`idem:${idempotencyKey}`, JSON.stringify(result), 'EX', 86400);
return result;

| Do | Avoid | |----|-------| | Store idempotency keys with TTL (24h typical) | Processing duplicate requests | | Return cached response for duplicate keys | Different responses for same key | | Use client-generated UUIDs | Server-generated keys |

Pagination Patterns

| Pattern | Use When | Example | |---------|----------|---------| | Cursor-based | Large datasets, real-time data | ?cursor=abc123&limit=20 | | Offset-based | Small datasets, random access | ?page=3&per_page=20 | | Keyset | Sorted data, high performance | ?after_id=1000&limit=20 |

Prefer cursor-based pagination for APIs with frequent inserts.

Error Response Standard (Problem Details)

Use a consistent machine-readable error format (RFC 9457 Problem Details): https://www.rfc-editor.org/rfc/rfc9457

{
  "type": "https://example.com/problems/invalid-request",
  "title": "Invalid request",
  "status": 400,
  "detail": "email is required",
  "instance": "/v1/users"
}

Health Check Patterns

// Liveness: Is the process running?
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

// Readiness: Can the service handle traffic?
app.get('/health/ready', async (req, res) => {
  const dbOk = await checkDatabase();
  const cacheOk = await checkRedis();
  if (dbOk && cacheOk) {
    res.status(200).json({ status: 'ready', db: 'ok', cache: 'ok' });
  } else {
    res.status(503).json({ status: 'not ready', db: dbOk, cache: cacheOk });
  }
});

Common Mistakes (Non-Obvious)

| Avoid | Instead | Why | |-------|---------|-----| | N+1 queries | include/select or DataLoader | 10-100x perf hit; easy to miss in ORM code | | No request timeouts | Timeouts on HTTP clients, DB, handlers | Hung deps cascade; see Production Hardening below | | Missing connection pooling | Prisma pool / PgBouncer / pgx pool | Exhaustion under load on shared DB tiers | | Catching errors silently | Log + rethrow or handle explicitly | Hidden failures, impossible to debug |

Production Hardening: Patterns Models Skip

These are the patterns that separate "works in dev" from "survives production." Models tend to skip them unless explicitly prompted — add them to every service.

Request & Query Timeouts

Every outbound call needs a timeout. Without one, a hung dependency leaks connections and cascades failures.

// HTTP client timeout
const response = await fetch(url, { signal: AbortSignal.timeout(5000) });

// Database query timeout (Prisma)
await prisma.$queryRaw`SET statement_timeout = '3000'`;

// Express/Fastify request timeout
server.register(import('@fastify/timeout'), { timeout: 30000 });

| Layer | Default Timeout | Rationale | |-------|----------------|-----------| | HTTP client calls | 5s | External APIs shouldn't block you | | Database queries | 3s | Slow queries = missing index or bad plan | | Request handler | 30s | Safety net for the whole request lifecycle | | Background jobs | 5min | Jobs that run longer need chunking |

Field-Level Selection (Don't `SELECT *`)

ORMs default to fetching all columns. On wide tables this wastes bandwidth and hides performance problems.

// BAD: fetches all 30 columns
const users = await prisma.user.findMany({ include: { posts: true } });

// GOOD: fetch only what the endpoint needs
const users = await prisma.user.findMany({
  select: { id: true, name: true, email: true },
  include: { posts: { select: { id: true, title: true } } }
});

For Go (sqlc): write explicit column lists in SQL queries — sqlc enforces this naturally. For Python (SQLAlchemy): use load_only() or explicit column selection.

Structured Error Responses (RFC 9457)

Return machine-readable errors from day one. Clients shouldn't have to regex-parse error messages.

{
  "type": "https://api.example.com/problems/validation-error",
  "title": "Validation failed",
  "status": 422,
  "detail": "email must be a valid email address",
  "instance": "/v1/users",
  "errors": [{ "field": "email", "message": "invalid format" }]
}

Set Content-Type: application/problem+json. This format is a standard (RFC 9457) and parseable by any HTTP client.

Query Plan Verification

Before shipping any new query to production, verify its execution plan:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ... FROM ... WHERE ...;

Red flags in the output: Seq Scan on large tables, Nested Loop with high row estimates, Sort without index. Add indexes or rewrite the query before deploying.

Performance Debugging Workflow

When a service is slow, work through these layers in order. Fix the cheapest layer first — don't add caching before fixing N+1 queries.

| Step | What to Check | Fix | |------|--------------|-----| | 1. Query analysis | Enable query logging, find N+1s and slow queries | Rewrite with include/joins, add select for field-level optimization | | 2. Indexing | Run EXPLAIN ANALYZE on slow queries | Add composite indexes matching WHERE + ORDER BY patterns | | 3. Connection pooling | Check connection count vs. pool size | Configure pool limits (Prisma connection_limit, PgBouncer, pgx pool) | | 4. Caching | Identify read-heavy, rarely-changing data | Add Redis/in-memory cache with TTL + invalidation strategy | | 5. Timeouts | Check for missing timeouts on DB, HTTP, handlers | Add timeouts at every layer (see Production Hardening above) | | 6. Platform tuning | Shared DB limits, cold starts, memory | Upgrade tier, add read replicas, tune runtime settings |

Key principle: always measure before and after. Use structured logging with request IDs to trace specific slow requests end-to-end.

Infrastructure Economics

Backend architecture decisions directly impact cost and revenue. See references/infrastructure-economics.md for detailed cost modeling, SLA-to-revenue mapping, unit economics checklists, and FinOps practices.

Navigation

Resources

references/backend-best-practices.md - Template authoring guide, quality checklist, and shared utilities pointers
references/edge-deployment-guide.md - Edge computing patterns, Cloudflare Workers vs Vercel Edge, tRPC, Hono, Bun
references/infrastructure-economics.md - Cost modeling, performance SLAs -> revenue, FinOps practices, cloud optimization
references/go-best-practices.md - Go idioms, concurrency, error handling, GORM usage, testing, profiling
references/rust-best-practices.md - Ownership, async, Axum, SeaORM, error handling, testing
references/python-best-practices.md - FastAPI, SQLAlchemy, async patterns, validation, testing, performance
references/nodejs-best-practices.md - Event loop, async patterns, Express/Fastify/NestJS/Hono, error handling, memory management, security, profiling
references/csharp-best-practices.md - C# 14 / .NET 10 LTS, extension members, field keyword, ASP.NET Core 10 (validation, SSE, OpenAPI 3.1), EF Core 10 (LeftJoin, named filters), HybridCache, Polly v8 resilience
references/database-patterns.md - PostgreSQL patterns (JSONB, CTEs, partitioning), connection pooling, migration strategies, ORM comparison, index design
references/message-queues-background-jobs.md - BullMQ patterns, broker comparison (Redis/SQS/Kafka/RabbitMQ), idempotent jobs, DLQ, scheduling, delivery guarantees
data/sources.json - External references per language/runtime
Shared checklists: ../software-clean-code-standard/assets/checklists/backend-api-review-checklist.md, ../software-clean-code-standard/assets/checklists/secure-code-review-checklist.md

Shared Utilities (Centralized patterns - extract, don't duplicate)

../software-clean-code-standard/utilities/auth-utilities.md - Argon2id, jose JWT, OAuth 2.1/PKCE
../software-clean-code-standard/utilities/error-handling.md - Effect Result types, correlation IDs
../software-clean-code-standard/utilities/config-validation.md - Zod 3.24+, Valibot, secrets management
../software-clean-code-standard/utilities/resilience-utilities.md - p-retry v6, opossum v8, OTel spans
../software-clean-code-standard/utilities/logging-utilities.md - pino v9 + OpenTelemetry integration
../software-clean-code-standard/utilities/testing-utilities.md - Vitest, MSW v2, factories, fixtures
../software-clean-code-standard/utilities/observability-utilities.md - OpenTelemetry SDK, tracing, metrics
../software-clean-code-standard/references/clean-code-standard.md - Canonical clean code rules (CC-*) for citation

Templates

assets/nodejs/template-nodejs-prisma-postgres.md - Node.js + Prisma + PostgreSQL
assets/go/template-go-fiber-gorm.md - Go + Fiber + GORM + PostgreSQL
assets/rust/template-rust-axum-seaorm.md - Rust + Axum + SeaORM + PostgreSQL
assets/python/template-python-fastapi-sqlalchemy.md - Python + FastAPI + SQLAlchemy + PostgreSQL
assets/csharp/template-csharp-aspnet-efcore.md - C# + ASP.NET Core + Entity Framework Core + PostgreSQL

Related Skills

../software-architecture-design/SKILL.md - System decomposition, SLAs, and data flows
../software-security-appsec/SKILL.md - Authentication/authorization and secure API design
../ops-devops-platform/SKILL.md - CI/CD, infrastructure, and deployment safety
../qa-resilience/SKILL.md - Resilience, retries, and failure playbooks
../software-code-review/SKILL.md - Review checklists and standards for backend changes
../qa-testing-strategy/SKILL.md - Testing strategies, test pyramids, and coverage goals
../dev-api-design/SKILL.md - RESTful design, GraphQL, and API versioning patterns
../data-sql-optimization/SKILL.md - SQL optimization, indexing, and query tuning patterns

Freshness Protocol

When users ask version-sensitive recommendation questions, do a quick freshness check before asserting "best" choices or quoting versions.

Trigger Conditions

"What's the best backend framework for [use case]?"
"What should I use for [API design/auth/database]?"
"What's the latest in Node.js/Go/Rust?"
"Current best practices for [REST/GraphQL/tRPC]?"
"Is [framework/runtime] still relevant in 2026?"
"[Express] vs [Fastify] vs [Hono]?"
"Best ORM for [database/use case]?"

How to Freshness-Check

Start from data/sources.json (official docs, release notes, support policies).
Run a targeted web search for the specific component and open release notes/support policy pages.
Prefer official sources over blogs for versions and support windows.

What to Report

Current landscape: what is stable and widely used now
Emerging trends: what is gaining traction (and why)
Deprecated/declining: what is falling out of favor (and why)
Recommendation: default choice + 1-2 alternatives, with trade-offs

Example Topics (verify with fresh search)

Node.js LTS support window and major changes
Bun vs Deno vs Node.js
Hono, Elysia, and edge-first frameworks
Drizzle vs Prisma for TypeScript
tRPC and end-to-end type safety
Edge computing and serverless patterns
.NET 10 LTS (Nov 2025) and C# 14 adoption
ASP.NET Core 10 built-in validation vs FluentValidation
EF Core 10 vs Dapper for C# data access
HybridCache vs manual IMemoryCache + IDistributedCache

Operational Playbooks

references/operational-playbook.md - Full backend architecture patterns, checklists, TypeScript notes, and decision tables

Fact-Checking

Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
Prefer primary sources; report source links and dates for volatile information.
If web access is unavailable, state the limitation and mark guidance as unverified.

Agent Skills: Software Backend Engineering

Install this agent skill to your local

Skill Files