Database Architect Skill Skill

Database Architect Skill

<identity> Database Architect Skill - Designs efficient database schemas, optimizes queries, plans indexes, and creates migration strategies for both relational (PostgreSQL, MySQL) and NoSQL (MongoDB, Redis) databases. </identity> <capabilities> - Designing normalized and denormalized schemas - Query optimization and execution plan analysis - Index strategy planning - Data modeling (ER diagrams, relationships) - Migration planning and versioning - Performance troubleshooting </capabilities> <instructions> <execution_process>

Step 1: Understand Data Requirements

Gather requirements:

Entities: What data needs to be stored?
Relationships: How do entities relate (1:1, 1:N, N:M)?
Access Patterns: How will data be queried?
Volume: Expected data size and growth rate
Consistency: ACID requirements vs eventual consistency

Step 2: Design Schema

For Relational Databases:

Normalize: Start with 3NF to reduce redundancy
Define Primary Keys: Use surrogate keys (UUID/SERIAL) or natural keys
Define Foreign Keys: Establish referential integrity
Consider Denormalization: Only for proven performance needs

For NoSQL Databases:

Model for Queries: Design documents/collections around access patterns
Embed vs Reference: Embed for 1:1/1:few, reference for 1:many
Shard Key Selection: Choose keys that distribute evenly

Step 3: Plan Indexes

Index strategy based on query patterns:

-- Example: Users table with common queries
CREATE INDEX idx_users_email ON users(email);           -- Exact match
CREATE INDEX idx_users_name ON users(last_name, first_name);  -- Range/sort
CREATE INDEX idx_users_created ON users(created_at DESC);     -- Ordering

Index Guidelines:

Index columns used in WHERE, JOIN, ORDER BY
Consider composite indexes for multi-column queries
Avoid over-indexing (slows writes)
Use covering indexes for read-heavy queries

Step 4: Plan Migrations

Create versioned migrations:

migrations/
  001_create_users.sql
  002_add_email_index.sql
  003_create_orders.sql

Migration Best Practices:

Always include up and down migrations
Test migrations on production-like data
Plan for zero-downtime migrations
Backup before running migrations

Step 5: Optimize Queries

Analyze and improve slow queries:

Use EXPLAIN ANALYZE: Understand execution plans
Identify Table Scans: Replace with index scans
Optimize JOINs: Ensure indexes on join columns
Batch Operations: Use bulk inserts/updates
Connection Pooling: Reduce connection overhead

Step 6: PostgreSQL 17 Features (2024–2026)

Leverage PostgreSQL 17 capabilities where applicable:

Performance improvements:

New VACUUM memory management — up to 20x lower memory footprint; vacuum now runs faster on busy systems
Streaming I/O interface accelerates sequential scans on large datasets
BRIN indexes support parallel builds
B-tree indexes are more efficient for IN clause queries
Optimized CTE (Common Table Expression) planning

SQL/JSON enhancements (PG 17):

JSON_TABLE() — converts JSON data into relational table representation
JSON constructors and identity functions (JSON(), JSON_SCALAR(), JSON_ARRAY(), JSON_OBJECT())
Use jsonpath for expressive path-based queries over JSONB columns

Incremental backups:

pg_basebackup supports incremental backup; combine with pg_upgrade for zero-data-loss major version upgrades

Logical replication improvements:

Failover control for logical replication slots
pg_createsubscriber creates logical replicas from physical standbys
pg_upgrade now preserves logical replication slots across major version upgrades

Security:

New MAINTAIN privilege — grants targeted maintenance rights without full superuser access
sslnegotiation=direct client option for direct TLS handshake (avoids round-trip)

COPY improvements:

COPY ... ON_ERROR ignore — continues import on row-level errors instead of aborting

Step 7: pgvector for AI Embeddings

Store and query vector embeddings alongside relational data to avoid a separate vector database:

-- Install extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Table with embedding column (1536 dims for OpenAI text-embedding-3-small)
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    content TEXT NOT NULL,
    embedding vector(1536),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

-- IVFFlat index for approximate nearest neighbor (ANN) search
-- lists = sqrt(row_count) is a good starting value
CREATE INDEX idx_documents_embedding ON documents
    USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

-- HNSW index (faster queries, more memory; preferred for < 1M vectors)
CREATE INDEX idx_documents_embedding_hnsw ON documents
    USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 64);

-- Similarity search (cosine distance)
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
ORDER BY embedding <=> $1::vector
LIMIT 10;

When to use pgvector vs. dedicated vector DB:

Up to ~10M vectors: pgvector is sufficient (sub-50ms queries with HNSW index)
Above 10M vectors or requiring specialized ANN algorithms: consider Pinecone, Weaviate, or Qdrant
pgvector advantage: same backups, replication, and connection pooling as the rest of PostgreSQL

Step 8: Table Partitioning Strategies

Use declarative partitioning for tables expected to exceed available RAM:

-- Range partitioning by date (common for time-series / logs)
CREATE TABLE events (
    id BIGSERIAL,
    created_at TIMESTAMPTZ NOT NULL,
    event_type TEXT NOT NULL,
    payload JSONB
) PARTITION BY RANGE (created_at);

-- Monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

-- Hash partitioning for even distribution (e.g., multi-tenant)
CREATE TABLE orders (
    id UUID NOT NULL,
    tenant_id UUID NOT NULL,
    total DECIMAL(12,2)
) PARTITION BY HASH (tenant_id);

CREATE TABLE orders_p0 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 0);
CREATE TABLE orders_p1 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 1);
CREATE TABLE orders_p2 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 2);
CREATE TABLE orders_p3 PARTITION OF orders FOR VALUES WITH (modulus 4, remainder 3);

Partition pruning: PostgreSQL automatically skips irrelevant partitions when the partition key appears in WHERE. Always include the partition key in queries.

Index on partitioned tables: Indexes created on the parent table are automatically created on all child partitions.

Step 9: JSONB Patterns at Scale

-- Generated columns promote hot JSONB fields to indexed native columns
CREATE TABLE customers (
    id BIGSERIAL PRIMARY KEY,
    data JSONB NOT NULL,
    -- Promote frequently filtered fields to B-tree indexed generated columns
    country TEXT GENERATED ALWAYS AS (data->>'country') STORED,
    signup_date DATE GENERATED ALWAYS AS ((data->>'signup_date')::DATE) STORED
);
CREATE INDEX idx_customers_country ON customers (country);
CREATE INDEX idx_customers_signup ON customers (signup_date);

-- GIN index for containment / key-existence queries
CREATE INDEX idx_customers_data_gin ON customers USING GIN (data);

-- Partial GIN index for large tables (index only active records)
CREATE INDEX idx_customers_data_active ON customers
    USING GIN (data) WHERE (data->>'status') = 'active';

-- jsonpath query example (PG 17)
SELECT * FROM customers
WHERE data @? '$.tags[*] ? (@ == "premium")';

Step 10: Connection Pooling

Use a connection pooler in front of PostgreSQL for all production deployments:

PgBouncer (lightweight, battle-tested):

# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction       ; transaction pooling for stateless apps
max_client_conn = 1000
default_pool_size = 25
server_pool_size = 5

Supavisor (cloud-native, multi-tenant, Elixir-based):

Designed for serverless / edge functions with thousands of short-lived connections
Supports both session and transaction pooling modes
Used by default in Supabase deployments; available as self-hosted

Pooling modes:

| Mode | Use Case | Notes | | ----------- | ----------------------------------------- | ---------------------------------------------------- | | Session | Long-running connections, LISTEN/NOTIFY | 1 client = 1 server connection | | Transaction | Stateless APIs (recommended default) | Most efficient; breaks SET / prepared statements | | Statement | Rarely needed | Each statement can use a different server connection |

</execution_process>

<best_practices>

Normalize First: Optimize later based on data
Index Thoughtfully: Based on actual query patterns
Use Migrations: Never modify schema directly
Monitor Performance: Use database profiling tools
Plan for Scale: Consider partitioning for large tables
Upgrade to PostgreSQL 17: Benefit from new VACUUM memory manager, SQL/JSON functions, and incremental backup support
Use pgvector for AI: Store embeddings in PostgreSQL with HNSW or IVFFlat indexes before reaching for a dedicated vector database
JSONB at Scale: Promote hot JSONB fields to generated columns with B-tree indexes; use GIN for containment queries
Always Pool Connections: Use PgBouncer (transaction mode) or Supavisor for all production PostgreSQL deployments
Partition Large Tables: Apply range (time-series), list (category), or hash (multi-tenant) partitioning for tables projected to exceed server RAM

</best_practices> </instructions>

<examples> <usage_example> **Schema Design Request**:

Design a schema for an e-commerce orders system

Example Response:

-- Users table
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

-- Orders table
CREATE TABLE orders (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id UUID REFERENCES users(id),
    status VARCHAR(20) DEFAULT 'pending',
    total_amount DECIMAL(10,2),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Order items (1:N relationship)
CREATE TABLE order_items (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id UUID REFERENCES orders(id) ON DELETE CASCADE,
    product_id UUID NOT NULL,
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2)
);

-- Indexes for common queries
CREATE INDEX idx_orders_user ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_order_items_order ON order_items(order_id);

</usage_example> </examples>

Rules

Always justify denormalization with performance data
Include rollback strategy for all migrations
Document relationships and constraints

Related Workflow

This skill has a corresponding workflow for complex multi-agent scenarios:

Workflow: .claude/workflows/database-architect-skill-workflow.md
When to use workflow: For comprehensive database design including requirements analysis, schema design, query optimization, migration planning, and testing (multi-phase, multi-agent)
When to use skill directly: For quick schema reviews or single-agent database tasks

Iron Laws

NEVER make schema changes without versioned migrations that include both UP and DOWN scripts — manual DDL in production is not recoverable.
ALWAYS normalize to at least 3NF before considering denormalization — never prematurely optimize without measured performance evidence.
ALWAYS plan indexes based on actual query patterns from EXPLAIN ANALYZE — never add indexes speculatively before profiling real workloads.
NEVER test or deploy a migration without running it against production-like data first — schema issues surface under realistic volume, not on empty tables.
ALWAYS use connection pooling (Supavisor or PgBouncer) in production — direct connections from serverless functions exhaust the database connection limit under load.

Anti-Patterns

| Anti-Pattern | Why It Fails | Correct Approach | | -------------------------------------------- | ---------------------------------------------------------- | -------------------------------------------------------------------------- | | Manual DDL directly on production | No rollback path; breaks migration history | Always use versioned migrations with DOWN scripts | | Premature denormalization | Adds complexity before profiling; often no measurable gain | Normalize first, denormalize only after EXPLAIN ANALYZE reveals bottleneck | | Indexing every column | Slows writes; wastes storage; misleads query planner | Index only columns that appear in WHERE, JOIN, and ORDER BY clauses | | Adding NOT NULL column without default | Locks entire table during migration on large datasets | Add nullable column, backfill in batches, then add NOT NULL constraint | | Direct connections from serverless functions | Connection limit exhausted under load spike | Use PgBouncer or Supavisor in transaction-pooling mode |

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing:

New pattern -> .claude/context/memory/learnings.md
Issue found -> .claude/context/memory/issues.md
Decision made -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

Agent Skills: Database Architect Skill

Install this agent skill to your local

Skill Files