Agent Skills: Data Analyst Workflow

Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다.

UncategorizedID: lastdays03/claude_skills/data-analyst

Install this agent skill to your local

pnpm dlx add-skill https://github.com/lastdays03/claude_skills/tree/HEAD/.agent/skills/data-analyst

Skill Files

Browse the full folder contents for data-analyst.

Download Skill

Loading file tree…

.agent/skills/data-analyst/SKILL.md

Skill Metadata

Name
data-analyst
Description
Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다.

Data Analyst Workflow

Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다. OSEMN 방법론과 SKILL.md의 표준을 따릅니다.

0단계: 기본 원칙 (Core Principles)

  1. Explanation First (선 설명 후 코드): 코드를 작성하기 전에 무엇을, 왜 분석하는지 Markdown으로 서술합니다.
  2. Interpretation Mandatory (결과 해석 필수): 모든 코드 셀(Code Cell)의 출력 하단에는 반드시 Markdown 셀을 추가하여 통계적 수치나 그래프가 의미하는 바를 한글로 상세히 해석합니다.
    • Bad: 그래프만 덩그러니 있음.
    • Good: "히스토그램의 꼬리가 우측으로 긴 것으로 보아(Skewed > 1), 로그 변환이 필요함을 시사합니다."

1단계: 분석 환경 및 목표 정의 (Environment & Goal)

  1. Context Loading:
    • this document를 로드하여 **'Core Principles'**를 확인합니다.
    • Methodology Screening (방법론 스크리닝):
      • SKILL.md의 **'Methodology Master List'**를 스캔합니다.
      • 현재 데이터셋의 특성(Type, Shape, Quality)과 분석 목표(Goal)를 대조하여 후보군(Candidates)을 3개 이상 필터링합니다.
      • [Check]: 각 후보군 선정 이유를 한 줄로 서술하였는가? (예: "결측치가 많으므로 Random Forest 채택")
  2. Define Objective:
    • resources/plan-template.md를 사용하여 docs/plans/ANALYSIS_[주제].md를 작성합니다.
    • 단순 통계/정확도 외에 **비즈니스 임팩트(KPI)**를 성공 지표로 정의합니다.
  3. Notebook Setup:
    • docs/notebooks/[Topic]_Analysis.ipynb를 생성하거나 엽니다.
    • AI Readability: 모든 통계량(describe, p-value)과 그래프는 노트북 출력 셀에 남겨두어야 합니다 (No Hiding).

3단계: 방법론 스크리닝 (Methodology Screening)

"The Right Tool for the Job"

데이터 특성에 맞춰 SKILL.mdMethodology Master List에서 최적의 기법을 선정하고 Plan에 명시합니다.

  • Metric Selection: 비즈니스 목표와 데이터 불균형 여부에 따라 평가지표를 선택합니다. (예: Imbalanced → F1/Recall/PRCURSOR, Regression → MAPE/RMSE)
  • Preprocessing:
    • Outliers: 이상치가 많으면 RobustScaler 선택.
    • High Cardinality: 카테고리가 너무 많으면 Target Encoding 고려.
    • Imbalance: SMOTE 또는 Class Weights 적용 여부 결정.
  • Modeling:
    • Baseline: Logistic/Linear Regression, Decision Tree.
    • Advanced: CatBoost(Categorical), XGBoost/LightGBM(Large Data), Isolation Forest(Anomaly) 등 선택.
  • Validation стратегия:
    • Time Series: 반드시 Time Series Split 사용.
    • Imbalanced: Stratified K-Fold 필수.

2단계: 데이터 적재 및 품질 검증 (Obtain & Scrub)

Garbage In, Garbage Out을 방지하기 위한 데이터 신뢰성 확보 단계입니다.

  1. Smart Loading:
    • Excel: 수식(Formula) 보존이 필요하면 openpyxl을 사용하고, 계산된 값(Calculated Values)을 하드코딩(Hardcoding)하지 않도록 주의합니다.
    • CSV: 인코딩(utf-8, cp949) 및 구분자(Delimiter) 자동 감지 기능을 활용합니다.
  2. Data Profiling:
    • 기초 통계량(describe()), 결측치(isnull().sum()), 데이터 타입(dtypes)을 확인합니다.
    • [Markdown]: 확인된 결측치 현황과 데이터 타입을 텍스트로 요약합니다.
  3. Deep Sanity Check:
    • SKILL.md'Logical Failures' 항목을 참조하여 정밀 점검합니다.
    • 불가능한 값 감지: 음수 나이(Age < 0), 미래의 날짜(Future Dates) 등 상식 밖의 데이터를 식별합니다.
  4. Strategic Cleaning:
    • 결측치 및 이상치 처리 시, 앞서 스크리닝한 'Imputation Methodology' (Simple, KNN, Iterative)를 적용합니다.
    • 처리 기준(삭제 vs 대치)과 근거를 명시합니다.

3단계: 가설 주도적 탐색 (Hypothesis Driven EDA)

단순한 그래프 나열을 지양하고, 질문(Question) -> 시각화(Viz) -> 발견(Finding)의 흐름을 유지합니다.

  1. Univariate Analysis: 개별 변수의 분포, 왜도(Skewness), 첨도(Kurtosis)를 확인합니다.
    • [Markdown]: 각 변수의 분포 특성과 이상치 유무를 서술합니다.
  2. Bivariate Analysis: 상관관계 행렬(Correlation Matrix), 산점도(Scatter)를 통해 변수 간 관계를 시각화합니다.
    • [Markdown]: 변수 간의 선형/비선형 관계를 해석합니다.
  3. Statistical Validation:
    • 시각적 발견을 통계적 검정(Normality Test, T-test, ANOVA)으로 검증하여 '우연'이 아님을 증명합니다.
    • [Markdown]: p-value를 근거로 귀무가설 기각 여부를 명시합니다.
  4. Insight Logging:
    • 분석 중간에 발견된 중요한 사실이 비즈니스 KPI에 미칠 영향을 즉시 기록합니다.

4단계: 모델링 및 해석 (Model & Interpret) [Optional]

  1. Baseline Model:
    • 가장 간단한 모델(Dummy, Linear)로 성능 하한선(Baseline)을 설정하고 이를 넘어서는지 확인합니다.
  2. Feature Engineering:
    • SKILL.md 리스트의 기법(Encoding, Scaling, PCA)을 적용합니다.
  3. 4.3 Advanced Modeling & Tuning

  • Model Selection: 선정된 고성능 모델(XGBoost, CatBoost, RF 등)을 학습합니다.
  • Hyperparameter Tuning: 단순히 Grid Search를 넘어 Optuna 등을 활용해 효율적으로 최적화합니다.
  • Rigorous Validation: Cross Validation 점수와 Hold-out Test 점수의 차이를 확인하여 오버피팅을 감지합니다. (Gap > 5% 시 경고)
    • Clustering: K-Means 등을 사용할 경우 Elbow MethodSilhouette Score로 최적의 군집 수(K)를 결정합니다.
  1. Rigorous Validation:
    • Stratified K-Fold를 사용하여 과적합을 방지하고 일반화 성능을 평가합니다.
  2. Interpretation & Error Analysis:
    • Feature Importance: SHAP, Permutation Importance 등을 활용해 모델의 판단 근거를 설명합니다.
    • Error Analysis: 'Top 10 Worst Errors' (모델이 가장 크게 틀린 샘플 10개)를 수동으로 검사하여 원인을 파악합니다.
  3. Reporting:
    • 기술적 수치(Accuracy, MSE)를 비즈니스 언어(예상 수익, 리스크 감소)로 번역하여 보고합니다.

Standards & Rules

Data Analyst Standards (OSEMN)

Purpose

To transform raw data into actionable business insights using a rigorous, hypothesis-driven approach.

Core Principles (Core Philosophy)

"No Hiding" AI Readability: All statistical outputs (describe, corr, p-value) and graphs must remain in the notebook output. This ensures tools like NotebookLM can contextually understand the analysis.

Quality Standards (Tier 1 Best Practices)

1. Data Integrity (Obtain & Scrub)

"Garbage In, Garbage Out"

  • [ ] Data Source Verification: Check file extension, size, and metadata.
  • [ ] Data Quality Check:
    • Missing Values: Identify mechanism (MCAR, MAR, MNAR) before imputing.
    • Logical Failures: Check for impossible values (e.g., age < 0, future dates), Data Leakage, and Overfitting indicators.
    • Data Types: Ensure numeric cols are not strings, etc.
  • [ ] File Handling Standards:
    • Excel (.xlsx): Preserve existing formatting/formulas. Use openpyxl for editing, pandas for reading. Zero Hardcoding of calculated values.
    • CSV (.csv): Detect delimiter automatically (csv.Sniffer). Handle encoding errors (utf-8 vs cp949) explicitly.

2. Hypothesis Driven EDA (Explore)

"Ask, Don't just Plot"

  • [ ] Univariate Analysis: Distribution of each key variable (Histogram/Boxplot). Check for Skewness/Kurtosis.
  • [ ] Bivariate Analysis: Correlation matrix, Scatter plots for relationships.
  • [ ] Statistical Validation:
    • Normality Test: Shapiro-Wilk or K-S test.
    • Significance: T-test/ANOVA for group differences.
  • [ ] Insight Logging: Record the implication of every finding immediately.

3. Rigorous Modeling (Model)

"Trust but Verify"

  • [ ] Baseline First: Compare complex models against a Dummy/Logistic Baseline.
  • [ ] Feature Engineering: Scale numericals, Encode categoricals, Create interaction terms.
  • [ ] Cross-Validation: Use Stratified K-Fold to prevent overfitting.
  • [ ] Metric Selection: Optimize for business KPI (not just Accuracy).
  • [ ] Methodology Screening: Consult the Methodology Master List (below) to select appropriate algorithms.

4. Interpretation (Interpret)

"Why did it predict that?"

  • [ ] Feature Importance: SHAP values or Permutation Importance.
  • [ ] Error Analysis: Manually inspect the "Top 10 Worst Errors".
  • [ ] Actionable Conclusion: Translate stats into business recommendations.

Checklist (Quality Gate)

Before finalizing:

  • [ ] Reproducibility: Can the notebook run from top to bottom without error?
  • [ ] Storytelling: Does the notebook flow like a narrative?
  • [ ] Visuals: Are all graphs labeled (Title, Axis, Legend)?

Appendix: Methodology Master List (Reference)

Scan these tables to select the most appropriate methodology for your data and goal.

1. Preprocessing & Data Cleaning

| Methodology | Usage / Purpose | Data Constraints | | :----------------------- | :-------------------------------------------------- | :---------------------------------------------------- | | Simple Imputation | Missing Value Imputation (Simple Replacement) | Mean/Median (Numeric), Mode (Categorical) | | KNN Imputation | Missing Value Imputation (Similarity-based) | Mainly Numeric, useful when correlations exist | | Iterative Imputation | Missing Value Imputation (Model-based) | High variable correlation, assumes MAR | | One-Hot Encoding | Categorical to Numeric | Nominal data, Low Cardinality | | Label Encoding | Categorical to Numeric | Ordinal data | | Target Encoding | Categorical to Numeric | High Cardinality features, Risk of Overfitting | | Standard Scaler | Scaling (Standardization) | Sensitive to outliers, assumes Gaussian distribution | | MinMax Scaler | Scaling (Normalization) | Bounded data, distribution agnostic | | Robust Scaler | Scaling (Robust to Outliers) | Data with many outliers (Uses Median/IQR) | | SMOTE | Oversampling (Imbalanced Data) | Synthesize minority class samples (Training set ONLY) | | PCA | Dimensionality Reduction, Multicollinearity Removal | Continuous variables, assumes linear relationships |

2. Machine Learning Models

| Methodology | Type | Usage / Purpose | Constraints / Notes | | :---------------------- | :---------------- | :----------------------------------------- | :--------------------------------------------------------- | | Linear Regression | Regression | Baseline for regression | Linear relationship assumption | | Logistic Regression | Classification | Baseline for classification | Linear separation assumption, large sparse data OK | | SVM / SVR | Class/Reg | High accuracy in high dimensional spaces | Computationally expensive (O(n^3)), Scale-sensitive | | K-Nearest Neighbors | Class/Reg | Instance-based learning, Simple | Scale-sensitive, Small data | | Random Forest | Ensemble | Robust Classification/Regression | Handles Mixed types, Robust to outliers/missing values | | XGBoost / LightGBM | Ensemble | High Performance | Large datasets, handles missing values internally | | CatBoost | Ensemble | Best for Categorical Features | Handles categories automatically, Slower training | | Isolation Forest | Anomaly Detection | Outlier/Anomaly Detection | High dimensional data, efficiency | | K-Means | Clustering | Partitioning into K clusters | Spherical Clusters, Sensitive to outliers, Scale-sensitive | | DBSCAN | Clustering | Density-based clustering, Detects Outliers | Arbitrary shapes, Scale-sensitive, finding epsilon is hard | | Hierarchical | Clustering | Dendrogram visualization | Computationally expensive for large data |

3. Deep Learning Models

| Methodology | Usage / Purpose | Data Constraints | | :-------------- | :------------------------------ | :------------------------------- | | CNN | Image/Pattern Recognition | Grid-like data (Images, etc.) | | RNN / LSTM | Sequence/Time-Series Prediction | Sequential data | | Transformer | NLP, Complex Pattern Matching | Long sequences, Large-scale data |

4. Validation & Optimization

| Methodology | Type | Usage / Purpose | Notes | | :------------------------ | :------------- | :----------------------------------- | :---------------------------------------------------- | | Stratified K-Fold | Validation | Cross Validation (Generalization) | Essential for Imbalanced Class distribution | | K-Fold CV | Validation | Cross Validation | Sufficient data, Balanced classes | | Time Series Split | Validation | Cross Validation (Temporal) | No future data leakage (essential for time-series) | | Grid Search | Tuning | Hyperparameter Optimization | Small search space (Exhaustive) | | Bayesian Optimization | Tuning | Hyperparameter Optimization | Large search space, High evaluation cost | | Optuna | Tuning | Next-gen Hyperparameter Optimization | Efficient, Define-by-run, Pruning capabilities | | L1 (Lasso) | Regularization | Sparse Model, Feature Selection | When sparse solution is needed | | L2 (Ridge) | Regularization | Prevent Overfitting, Weight Decay | When high multicollinearity exists | | ElasticNet | Regularization | Combination of L1 and L2 | When both feature selection and regularization needed |

5. Interpretation

| Methodology | Usage / Purpose | Notes | | :---------- | :------------------------ | :-------------------------------- | | SHAP | Explain Model Predictions | Specialized for Tree-based models |

Appendix: Evaluation Metrics Guide

Select metrics based on your problem type and business goal.

Classification Metrics

| Metric | Focus | When to use | | :------------ | :----------------------- | :------------------------------------------------------ | | Accuracy | Overall Correctness | Balanced datasets only. Misleading for imbalanced data. | | Precision | False Positive Reduction | When FP is costly (e.g., Spam Filter). | | Recall | False Negative Reduction | When FN is critical (e.g., Cancer Diagnosis, Fraud). | | F1 Score | Balance | When you need a balance between Precision and Recall. | | ROC-AUC | Ranking Quality | When you need robust performance across thresholds. | | Log Loss | Probability Confidence | When the predicted probability value itself matters. |

Regression Metrics

| Metric | Focus | When to use | | :----------- | :------------------------ | :------------------------------------------------------ | | MSE | Large Error Penalty | When outliers/large errors should be heavily penalized. | | RMSE | Interpretability | When you need error in the same unit as the target. | | MAE | Robustness | When you want to be robust against outliers. | | R2 Score | Explainability | To see how much variance is explained by the model. | | MAPE | Business Interpretability | Error in Percentage (%). Easy for stakeholders. |

Clustering Metrics (Unsupervised)

| Metric | Focus | When to use | | :------------------- | :------------------ | :--------------------------------------------------------------------------------- | | Silhouette Score | Cluster Separation | To measure how similar an object is to its own cluster compared to other clusters. | | Davies-Bouldin | Cluster Compactness | Lower is better. Good for comparing clustering algorithms. | | Elbow Method | Optimal K | To find the inflection point (optimal K) in K-Means. |