Data Substrate Analysis
Analyzes the fundamental units of data and state management patterns.
Process
- Locate type files — Find types.py, schema.py, models.py, state.py
- Classify typing — Strict (Pydantic), structural (TypedDict), loose (dict)
- Analyze mutation — In-place modification vs. copy-on-write
- Document serialization — json(), dict(), pickle, custom methods
Typing Strategy Classification
Detection Patterns
| Strategy | Indicators | Files to Check |
|----------|-----------|----------------|
| Pydantic | BaseModel, Field(), validator | models.py, schema.py |
| Dataclass | @dataclass, field() | types.py, models.py |
| TypedDict | TypedDict, Required[], NotRequired[] | types.py |
| NamedTuple | NamedTuple, typing.NamedTuple | types.py |
| Loose | Dict[str, Any], plain dict | Throughout |
Analysis Questions
- Are boundaries validated (API ingress/egress)?
- Is nesting depth reasonable (<3 levels)?
- Are optional fields explicit or implicit None?
- Version migration path (Pydantic V1 → V2)?
Immutability Analysis
Mutable Patterns (Risk Indicators)
# In-place list modification
state.messages.append(msg)
state.history.extend(new_items)
# Direct dict mutation
state['key'] = value
state.update(new_data)
# Object attribute mutation
state.status = 'complete'
Immutable Patterns (Safer)
# Pydantic copy
new_state = state.model_copy(update={'key': value})
# Dataclass replace
new_state = replace(state, messages=[*state.messages, msg])
# Spread operator style
new_state = {**state, 'key': value}
# Frozen dataclass
@dataclass(frozen=True)
class State: ...
Serialization Strategy
Common Patterns
| Method | Code Pattern | Trade-offs |
|--------|-------------|------------|
| Pydantic JSON | .model_dump_json() | Type-safe, automatic |
| Pydantic Dict | .model_dump() | For internal use |
| Dataclass | asdict(obj) | Manual, no validation |
| Custom | to_dict(), from_dict() | Full control |
| Pickle | pickle.dumps() | Fast, fragile, security risk |
| JSON | json.dumps(obj, default=...) | Requires encoder |
Questions to Answer
- Is serialization implicit (automatic) or explicit (manual)?
- How are nested objects handled?
- Is deserialization validated?
- What happens with unknown fields?
Output Template
## Data Substrate Analysis: [Framework Name]
### Typing Strategy
- **Primary Approach**: [Pydantic/Dataclass/TypedDict/Loose]
- **Key Files**: [List of files]
- **Nesting Depth**: [Shallow/Medium/Deep]
- **Validation**: [At boundaries/Everywhere/None]
### Core Primitives
| Type | Location | Purpose | Mutability |
|------|----------|---------|------------|
| Message | schema.py:L15 | Chat message | Immutable |
| State | state.py:L42 | Agent state | Mutable ⚠️ |
| Result | types.py:L78 | Tool output | Immutable |
### Mutation Analysis
- **Pattern**: [In-place/Copy-on-write/Mixed]
- **Risk Areas**: [List of mutable state locations]
- **Concurrency Safe**: [Yes/No/Partial]
### Serialization
- **Method**: [Pydantic/Custom/JSON]
- **Implicit/Explicit**: [Description]
- **Round-trip Tested**: [Yes/No/Unknown]
Integration
- Prerequisite:
codebase-mappingto identify type files - Feeds into:
comparative-matrixfor typing decisions - Related:
resilience-analysisfor error handling in serialization