Data Systems Architecture Skill

Data Systems Architecture

Overview

Core principle: Good data system architecture balances reliability (correct operation under faults), scalability (handling growth gracefully), and maintainability (enabling productive change over time). Every architectural decision involves trade-offs between these concerns.

This skill synthesizes knowledge from three foundational texts:

Designing Data-Intensive Applications (Kleppmann) - distributed systems, storage engines, scaling
The Art of PostgreSQL (Fontaine) - PostgreSQL-specific patterns, SQL as programming
PostgreSQL Query Optimization (Dombrovskaya et al.) - execution plans, performance tuning

When to Use

| Symptom | Start With | |---------|------------| | Designing a new database/schema | 01-foundational-principles.md | | Normalization vs denormalization decisions | 02-data-modeling.md | | Need to understand OLTP vs OLAP | 03-storage-engines.md | | Slow queries, index selection | 04-indexing.md | | Planning for growth, read replicas | 05-scaling-patterns.md | | Race conditions, deadlocks, isolation issues | 06-transactions-concurrency.md | | N+1 queries, ORM problems, application integration | 07-application-integration.md |

Navigation

Reference Files (Load as needed)

01-foundational-principles.md    - Reliability/Scalability/Maintainability, load parameters
02-data-modeling.md              - Normalization, denormalization, schema design patterns
03-storage-engines.md            - B-trees, LSM-trees, OLTP vs OLAP, PostgreSQL internals
04-indexing.md                   - Index types, compound indexes, covering indexes, maintenance
05-scaling-patterns.md           - Replication, partitioning, sharding strategies
06-transactions-concurrency.md   - ACID, isolation levels, MVCC, locking patterns
07-application-integration.md    - ORM pitfalls, N+1, business logic placement, batch processing

Quick Decision Framework

New system design?
├─ Yes → Read 01, then 02 for data model
└─ No → What's the problem?
         ├─ "Queries are slow" → Read 04 (indexing) + 03 (storage patterns)
         ├─ "Data is inconsistent" → Read 02 (modeling) + 06 (transactions)
         ├─ "Can't handle the load" → Read 05 (scaling) + 03 (OLTP vs OLAP)
         ├─ "App makes too many queries" → Read 07 (N+1, ORM patterns)
         └─ "Race conditions/deadlocks" → Read 06 (concurrency)

Core Concepts (Quick Reference)

The Three Pillars

| Concern | Definition | Key Question | |---------|------------|--------------| | Reliability | System works correctly under faults | What happens when things fail? | | Scalability | Handles growth gracefully | What's 10x load look like? | | Maintainability | Easy to operate and evolve | Can new engineers understand this? |

Data Model Selection

| Model | Best For | Avoid When | |-------|----------|------------| | Relational | Many-to-many relationships, joins, consistency | Highly hierarchical data, constant schema changes | | Document | Self-contained docs, tree structures | Need for joins, many-to-many | | Graph | Highly connected data, recursive queries | Simple CRUD, no relationship traversal |

OLTP vs OLAP

| Aspect | OLTP | OLAP | |--------|------|------| | Query pattern | Point lookups, few rows | Aggregates, many rows | | Optimization | Index everything used in WHERE | Fewer indexes, full scans OK | | Storage | Row-oriented | Consider column-oriented |

Index Type Quick Reference

| Type | Use Case | PostgreSQL | |------|----------|------------| | B-tree | Equality, range, sorting | Default, most queries | | Hash | Equality only | Faster for exact match | | GIN | Arrays, JSONB, full-text | @>, @@ operators | | GiST | Geometric, range types | PostGIS, nearest-neighbor | | BRIN | Large, naturally ordered tables | Time-series data |

Isolation Levels

| Level | Prevents | PostgreSQL Default? | |-------|----------|-------------------| | Read Committed | Dirty reads | Yes | | Repeatable Read | + Non-repeatable reads | No | | Serializable | All anomalies | No (uses SSI) |

Design Checklist

Before finalizing a data architecture:

[ ] Identified load parameters (read/write ratio, data volume, latency requirements)
[ ] Chose appropriate data model (relational/document/graph hybrid?)
[ ] Normalized to 3NF first, denormalized only with measured justification
[ ] Designed indexes for actual query patterns (not hypothetical)
[ ] Considered 10x growth scenario
[ ] Established isolation level requirements
[ ] Defined where business logic lives (app vs DB vs both)
[ ] Planned for operations (backups, monitoring, migrations)

References

Kleppmann, M. Designing Data-Intensive Applications (O'Reilly, 2017)
Fontaine, D. The Art of PostgreSQL (2nd ed., 2020)
Dombrovskaya, H., Novikov, B., Bailliekova, A. PostgreSQL Query Optimization (Apress, 2021)

Agent Skills: Data Systems Architecture

Install this agent skill to your local

Skill Files