Data Science Expert Skill

Data Science Expert

Comprehensive data science frameworks for analytics, machine learning, and data-driven decision making.

Data Strategy

Data Maturity Model

| Level | Name | Characteristics | | ----- | ------------------- | ----------------------------------------- | | 1 | Ad Hoc | Manual, inconsistent, siloed | | 2 | Opportunistic | Some automation, point solutions | | 3 | Systematic | Defined processes, governance emerging | | 4 | Differentiating | Data-driven decisions, advanced analytics | | 5 | Transformative | AI-first, competitive advantage |

Analytics Value Chain

DATA → INFORMATION → INSIGHT → ACTION → VALUE

PROGRESSION:
Descriptive: What happened?
Diagnostic: Why did it happen?
Predictive: What will happen?
Prescriptive: What should we do?
Autonomous: Self-optimizing systems

Statistical Analysis

Descriptive Statistics

CENTRAL TENDENCY:
- Mean: Sum / Count (sensitive to outliers)
- Median: Middle value (robust to outliers)
- Mode: Most frequent value

DISPERSION:
- Range: Max - Min
- Variance: Average squared deviation
- Standard Deviation: √Variance
- IQR: Q3 - Q1 (robust)

DISTRIBUTION SHAPE:
- Skewness: Asymmetry (0 = symmetric)
- Kurtosis: Tail heaviness (3 = normal)

For detailed inferential statistics and hypothesis testing, see Statistical Methods Reference.

Machine Learning

Algorithm Selection

| Task | Algorithms | When to Use | | ---------------------------- | ------------------------------------------------------------ | -------------------------------- | | Classification | Logistic Regression, Random Forest, XGBoost, Neural Networks | Categorical outcomes | | Regression | Linear Regression, Ridge/Lasso, Random Forest, XGBoost | Continuous outcomes | | Clustering | K-Means, Hierarchical, DBSCAN | Group discovery | | Dimensionality Reduction | PCA, t-SNE, UMAP | Feature reduction, visualization | | Anomaly Detection | Isolation Forest, One-Class SVM, Autoencoders | Outlier detection | | Time Series | ARIMA, Prophet, LSTM | Sequential data | | Recommendation | Collaborative Filtering, Content-Based, Matrix Factorization | Personalization | | NLP | Transformers, BERT, GPT | Text understanding/generation |

For detailed ML pipelines, feature engineering, and model monitoring, see ML Pipelines Reference.

Data Governance

Data Governance Framework

GOVERNANCE PILLARS:

POLICIES:
- Data ownership
- Data classification
- Data retention
- Data access
- Data quality standards

ROLES:
- Data Owner: Accountable for data domain
- Data Steward: Day-to-day quality management
- Data Custodian: Technical implementation
- Data Consumer: End user

PROCESSES:
- Data cataloging
- Metadata management
- Data lineage
- Issue resolution
- Change management

METRICS:
- Data quality scores
- Policy compliance
- Data access requests
- Issue resolution time

Data Quality Dimensions

| Dimension | Definition | Measurement | | ---------------- | --------------------------------- | ------------------------- | | Accuracy | Correct representation of reality | % records matching source | | Completeness | All required data present | % non-null values | | Consistency | Same across systems | % matching across sources | | Timeliness | Available when needed | Latency, freshness | | Validity | Conforms to format/rules | % passing validation | | Uniqueness | No unwanted duplicates | Duplicate rate |

Business Intelligence

BI Architecture

ARCHITECTURE LAYERS:

DATA SOURCES:
- Operational systems
- External data
- IoT/streaming

DATA INTEGRATION:
- ETL/ELT pipelines
- Data lakes
- Data warehouses

SEMANTIC LAYER:
- Business definitions
- Calculated metrics
- Hierarchies
- Relationships

PRESENTATION:
- Dashboards
- Reports
- Ad-hoc analysis
- Embedded analytics

Dashboard Design Principles

DESIGN PRINCIPLES:

PURPOSE:
- One clear objective per dashboard
- Know your audience
- Enable decisions

LAYOUT:
- Most important top-left
- Related items grouped
- Progressive disclosure
- Whitespace for clarity

VISUALS:
- Right chart for data type
- Consistent formatting
- Minimal decoration
- Color with purpose

INTERACTIVITY:
- Filters for exploration
- Drill-down capability
- Cross-filtering
- Tooltip details

Metric Design

METRIC DEFINITION TEMPLATE:

NAME: [Metric name]
DEFINITION: [Clear business definition]
FORMULA: [Precise calculation]
OWNER: [Responsible person]
DATA SOURCE: [Where it comes from]
GRAIN: [Level of detail]
FREQUENCY: [Update cadence]
DIMENSIONS: [Slicing attributes]
TARGETS: [Goals/benchmarks]
RELATED: [Related metrics]

Predictive Modeling

Use Case Framework

| Use Case | Business Application | Approach | | ------------------------- | -------------------- | ----------------------- | | Churn Prediction | Retention programs | Classification | | Demand Forecasting | Inventory planning | Time series | | Lead Scoring | Sales prioritization | Classification | | Price Optimization | Revenue management | Regression/RL | | Fraud Detection | Risk mitigation | Anomaly detection | | Recommendation | Personalization | Collaborative filtering | | Customer Segmentation | Marketing targeting | Clustering | | Lifetime Value | Customer investment | Regression |

Data Ethics & Privacy

Ethical AI Framework

PRINCIPLES:

FAIRNESS:
- No discriminatory outcomes
- Bias testing across groups
- Regular auditing

ACCOUNTABILITY:
- Clear ownership
- Decision audit trails
- Escalation process

TRANSPARENCY:
- Explainable decisions
- Clear documentation
- User communication

PRIVACY:
- Data minimization
- Consent management
- Security controls

Bias Detection

BIAS TYPES:

HISTORICAL: Reflects past discrimination
REPRESENTATION: Training data not representative
MEASUREMENT: Proxy variables correlate with protected attributes
AGGREGATION: Single model for diverse populations
EVALUATION: Inappropriate benchmarks

FAIRNESS METRICS:
- Demographic Parity: Equal positive rates
- Equalized Odds: Equal TPR and FPR
- Individual Fairness: Similar inputs, similar outputs
- Calibration: Equal accuracy across groups

Analytics Team Structure

Team Roles

| Role | Focus | Skills | | ---------------------- | -------------------------- | --------------------------- | | Data Engineer | Pipelines, infrastructure | SQL, Python, Spark, Cloud | | Data Analyst | Reporting, ad-hoc analysis | SQL, BI tools, Statistics | | Data Scientist | Modeling, ML | Python/R, ML, Statistics | | ML Engineer | Model deployment | MLOps, Software Engineering | | Analytics Engineer | Data modeling | dbt, SQL, Data Modeling |

Operating Models

| Model | Description | Best For | | ----------------- | ----------------------------- | ----------------------- | | Centralized | Single analytics team | Consistency, efficiency | | Decentralized | Embedded in business units | Business alignment | | Hub & Spoke | Central CoE + embedded | Balance of both | | Federated | Shared platform, domain teams | Scale with autonomy |

References

ML Pipelines Reference - Detailed ML pipeline, feature engineering, model development
Statistical Methods Reference - Inferential statistics, hypothesis testing, evaluation metrics

Agent Skills: Data Science Expert

Install this agent skill to your local

Skill Files