Creating DB Analytics Analysis
Overview
This skill guides creating new database analyses. Each analysis follows a structured pipeline: explore schema → collect data → analyze cross-subscriber → create per-subscriber reports → integrate with CI.
All changes MUST be made on a feature branch.
When to Use
- User asks to "create a new analysis" or "analyze {topic}"
- User wants to understand a new domain in the clearvisio database
- User needs cross-subscriber insights on a topic
Prerequisites
Before starting, verify:
- Git status is clean: Run
git status. If uncommitted changes exist, ask Viktor how to handle them before proceeding. - database_keys.parquet exists: Check
data/raw/database_keys.parquetexists (needed for multi-subscriber analysis).
Naming Convention
Convert topic to slug format for filenames:
- "Returning Customers" →
returning_customers - "SMS Patterns" →
sms_patterns - Use lowercase with underscores, no spaces or special characters
Step 0: Ask Questions and Create Todos (MANDATORY)
You MUST complete this step before any other work.
0.1 Ask Initial Questions
Use AskUserQuestion to gather:
- Topic: What domain/topic to analyze? (e.g., "returning customers", "SMS patterns")
- Template: Do you need per-subscriber reports? (yes/no)
- Key questions: What specific questions should the analysis answer?
0.2 Create TodoWrite Todos
After receiving answers, IMMEDIATELY create todos:
TodoWrite with these items:
- [ ] Create feature branch for {topic}
- [ ] Explore database schema (subagent)
- [ ] Create data collection notebook (subagent)
- [ ] Verify data collection notebook runs
- [ ] Push the branch to GitLab and create MR
- [ ] Ask the user to review the data collection notebook
- [ ] Create panel analysis notebook (subagent)
- [ ] Verify panel analysis notebook runs
- [ ] Create template notebook (subagent) [if needed]
- [ ] Verify template notebook runs [if needed]
- [ ] Add CI jobs (subagent)
- [ ] Final review and commit
Step 1: Create Feature Branch
Before any file changes, create a feature branch:
git checkout -b analysis/{topic_slug}
Example: git checkout -b analysis/returning_customers
Success criteria: Branch created, git branch shows you're on the new branch.
Step 2: Database Exploration (Subagent)
Dispatch a subagent using the Task tool with subagent_type: "general-purpose":
Prompt:
Explore the clearvisio database schema to understand tables and relationships for analyzing {TOPIC}.
Working directory: /Users/nagyv/Projects/Clearvisio/db-analytics
Tasks:
1. Connect to the clearvisio database using:
```python
from src.database.connector import get_engine
engine = get_engine('clearvisio')
- Identify relevant tables using INFORMATION_SCHEMA:
SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'clearvisio' AND TABLE_NAME LIKE '{prefix}_%' - Examine table structures with DESCRIBE
- Find foreign key relationships via KEY_COLUMN_USAGE
Output:
- Describe the database structure identified in
docs/db-structure/{TOPIC}.md - List of relevant tables with their purposes
- Key columns for the analysis
- Sample SQL query that will work across subscriber databases
- Any data quality concerns
**Success criteria:** You have a clear understanding of which tables and columns to query, and you documented your findings in a Markdown file.
**Error handling:** If no relevant tables found, ask Viktor to clarify the topic or suggest alternative approaches.
## Step 3: Data Collection Notebook (Subagent)
Dispatch a subagent to create `notebooks/exploratory/multi_subscriber_{topic_slug}.ipynb`:
**Prompt:**
Create a multi-subscriber data collection notebook.
Working directory: /Users/nagyv/Projects/Clearvisio/db-analytics Output file: notebooks/exploratory/multi_subscriber_{TOPIC_SLUG}.ipynb
The notebook MUST have these cells in order:
CELL 1 (Markdown):
Multi-Subscriber {Topic} Analysis
Brief description of what this notebook analyzes.
CELL 2 (Code - Imports): import sys sys.path.append('../..')
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from src.database.connector import subscriber_database from src.data.multi_subscriber import analyze_all_subscribers, load_subscribers
pd.set_option('display.max_columns', None) sns.set_style('whitegrid')
CELL 3 (Markdown):
Analysis Function
CELL 4 (Code - Analysis function): def analyze_{topic_slug}(engine, subscriber_name): """Analyze {topic} for a subscriber database.
Args:
engine: SQLAlchemy engine connected to subscriber database
subscriber_name: Name of the subscriber
Returns:
dict with metrics and optional 'data' DataFrame
"""
query = '''
{SQL_QUERY_FROM_EXPLORATION}
'''
try:
df = pd.read_sql(query, engine)
if len(df) == 0:
return {{'error': 'No data found'}}
return {{
'metric1': value1,
'data': df,
'error': None
}}
except Exception as e:
return {{'error': str(e)}}
CELL 5 (Markdown):
Test with Small Subset
CELL 6 (Code - Test): test_results = analyze_all_subscribers(analyze_{topic_slug}, limit=5)
Display results
pd.DataFrame.from_dict( {{k: {{key: val for key, val in v.items() if key != 'data'}} for k, v in test_results.items()}}, orient='index' )
CELL 7 (Markdown):
Run Across All Subscribers
CELL 8 (Code - Full run): all_results = analyze_all_subscribers(analyze_{topic_slug})
CELL 9 (Markdown):
Aggregate and Save
CELL 10 (Code - Save): import os os.makedirs('../../data/interim', exist_ok=True)
Combine all data
all_data = [] for subscriber, result in all_results.items(): if result.get('error') is None and result.get('data') is not None: df = result['data'].copy() df['subscriber'] = subscriber all_data.append(df)
combined_df = pd.concat(all_data, ignore_index=True) combined_df.to_parquet('../../data/interim/{topic_slug}.parquet', index=False) print(f"Saved {{len(combined_df)}} rows to data/interim/{topic_slug}.parquet")
Tables to use: {TABLES_FROM_STEP_2} Key questions: {KEY_QUESTIONS}
**Success criteria:** Notebook created, runs without errors with limit=5.
**Verification:** After subagent completes, run the notebook locally:
```bash
uv run jupyter nbconvert --to notebook --execute notebooks/exploratory/multi_subscriber_{topic_slug}.ipynb --output /tmp/test.ipynb
Error handling: If some subscribers fail, that's expected. Check that most succeed and errors are properly captured.
Step 4: Part-time Review and Commit
After subagents complete:
-
Commit with descriptive message:
git add notebooks/exploratory/multi_subscriber_{topic_slug}.ipynb git commit -m "Add {topic} analysis data collection" -
Push for review:
git push -u origin analysis/{topic_slug}
Success criteria: All files committed, branch pushed, ready for user review before continuing.
Step 5: Panel Analysis Notebook (Subagent)
Dispatch a subagent to create notebooks/panel/{topic_slug}_analysis.ipynb:
Prompt:
Create a panel analysis notebook for cross-subscriber {TOPIC} analysis.
Working directory: /Users/nagyv/Projects/Clearvisio/db-analytics
Output file: notebooks/panel/{TOPIC_SLUG}_analysis.ipynb
Input data: data/interim/{TOPIC_SLUG}.parquet
The notebook MUST:
1. Load data:
import sys
sys.path.append('../..')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans # if clustering needed
df = pd.read_parquet('../../data/interim/{topic_slug}.parquet')
2. Include sections for:
- Data overview and summary statistics
- Distribution visualizations (histograms, box plots)
- Cross-subscriber comparisons
- Clustering analysis (if appropriate)
- Outlier identification
- Key insights summary
3. Answer these questions: {KEY_QUESTIONS}
4. Use clear titles, labels, and formatting for all visualizations.
Success criteria: Notebook created, loads parquet file, produces visualizations.
Verification: Run the notebook:
uv run jupyter nbconvert --to notebook --execute notebooks/panel/{topic_slug}_analysis.ipynb --output /tmp/test.ipynb
Step 6: Template Notebook - Optional (Subagent)
Only create if user requested per-subscriber reports.
Dispatch a subagent to create notebooks/templates/{topic_slug}.ipynb:
Prompt:
Create a per-subscriber template notebook for {TOPIC} analysis.
Working directory: /Users/nagyv/Projects/Clearvisio/db-analytics
Output file: notebooks/templates/{TOPIC_SLUG}.ipynb
Input data: data/interim/{TOPIC_SLUG}.parquet
CRITICAL REQUIREMENT: First code cell MUST be tagged 'parameters' and contain:
# parameters for papermill, default values
subscriber_name = "default_subscriber"
The notebook MUST:
1. Print subscriber name header:
print(f"# {TOPIC} analysis for `{{subscriber_name}}`")
2. Load and filter data:
import sys
sys.path.append('../..')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
from src.database.connector import get_project_root
parquet_path = get_project_root() / 'data/interim/{topic_slug}.parquet'
all_df = pd.read_parquet(parquet_path)
current_df = all_df[all_df['subscriber'] == subscriber_name].copy()
3. Provide subscriber-specific analysis:
- Key metrics for this subscriber
- Visualizations
- Comparison to overall averages
- Summary findings
Run command: python main.py notebooks/templates/{topic_slug}.ipynb
Success criteria: Notebook created with parameters cell, runs for a single subscriber.
Verification:
uv run python main.py notebooks/templates/{topic_slug}.ipynb --limit 1
Step 7: CI Integration (Subagent)
Dispatch a subagent to update .gitlab/ci/analytics.yml:
Prompt:
Add CI jobs to .gitlab/ci/analytics.yml for {TOPIC} analysis.
Working directory: /Users/nagyv/Projects/Clearvisio/db-analytics
File to edit: .gitlab/ci/analytics.yml
Add these jobs:
1. Data collection job in 'analyze' stage:
load_{topic_slug}:
extends: .notebook_base
stage: analyze
needs: ["generate_database_keys"]
script:
- uv run jupyter nbconvert --to notebook --execute --inplace notebooks/exploratory/multi_subscriber_{topic_slug}.ipynb
artifacts:
paths:
- data/interim/{topic_slug}.parquet
- notebooks/exploratory/multi_subscriber_{topic_slug}.ipynb
expire_in: 30 days
2. Panel analysis job in 'templates' stage:
analyse_{topic_slug}:
extends: .notebook_base
stage: templates
needs:
- generate_database_keys
- load_{topic_slug}
script:
- uv run jupyter nbconvert --to notebook --execute --inplace notebooks/panel/{topic_slug}_analysis.ipynb
artifacts:
paths:
- notebooks/panel/{topic_slug}_analysis.ipynb
expire_in: 30 days
3. If template notebook exists, add to run_template job's script:
- uv run python main.py notebooks/templates/{topic_slug}.ipynb
Also update run_template's 'needs' to include load_{topic_slug} if template uses that data.
Success criteria: CI jobs added with correct stages and dependencies.
Step 8: Final Review and Commit
After all subagents complete:
-
Review created files:
notebooks/exploratory/multi_subscriber_{topic_slug}.ipynbnotebooks/panel/{topic_slug}_analysis.ipynbnotebooks/templates/{topic_slug}.ipynb(if created).gitlab/ci/analytics.yml
-
Verify notebooks run locally (if not already done in verification steps)
-
Commit with descriptive message:
git add notebooks/exploratory/multi_subscriber_{topic_slug}.ipynb git add notebooks/panel/{topic_slug}_analysis.ipynb git add notebooks/templates/{topic_slug}.ipynb # if exists git add .gitlab/ci/analytics.yml git commit -m "Add {topic} analysis notebooks and CI integration" -
Push for final review:
git push -u origin analysis/{topic_slug}
Success criteria: All files committed, branch pushed, ready for merge request.
Common Mistakes
| Mistake | Correct Approach |
|---------|------------------|
| Skipping database exploration | ALWAYS explore schema first - you need to know the tables |
| Wrong notebook location | exploratory → notebooks/exploratory/, panel → notebooks/panel/, templates → notebooks/templates/ |
| Missing parameters cell in template | FIRST code cell MUST have subscriber_name = "default_subscriber" |
| Not testing with limit=5 | ALWAYS test with limit=5 before running on all subscribers |
| Wrong parquet path | From notebooks: ../../data/interim/{topic}.parquet. From templates: use get_project_root() |
| Missing CI integration | ALWAYS add jobs to analytics.yml |
| Working on main branch | ALWAYS create feature branch first |
| Not creating todos | ALWAYS create TodoWrite todos at the start |
| Skipping verification | ALWAYS verify each notebook runs before proceeding |
Troubleshooting
| Problem | Solution | |---------|----------| | Some subscribers fail in analysis | Expected - check error rate, if >10% fail investigate | | Parquet file not created | Check notebook ran fully, check path is correct | | Template notebook can't find subscriber | Verify subscriber exists in parquet file's 'subscriber' column | | CI job fails | Check dependencies in 'needs', verify artifact paths |
Checklist (Create as TodoWrite todos)
- [ ] Asked initial questions (topic, template needed, key questions)
- [ ] Created feature branch
- [ ] Explored database schema (subagent)
- [ ] Documented database schema structure (subagent)
- [ ] Created data collection notebook (subagent)
- [ ] Verified data collection notebook runs
- [ ] Created panel analysis notebook (subagent)
- [ ] Verified panel analysis notebook runs
- [ ] Created template notebook if needed (subagent)
- [ ] Verified template notebook runs (if created)
- [ ] Added CI jobs (subagent)
- [ ] Final review and commit
- [ ] Branch pushed for review