Data Analyst Workflow Skill

Data Analyst Workflow

Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다. OSEMN 방법론과 SKILL.md의 표준을 따릅니다.

0단계: 기본 원칙 (Core Principles)

Explanation First (선 설명 후 코드): 코드를 작성하기 전에 무엇을, 왜 분석하는지 Markdown으로 서술합니다.
Interpretation Mandatory (결과 해석 필수): 모든 코드 셀(Code Cell)의 출력 하단에는 반드시 Markdown 셀을 추가하여 통계적 수치나 그래프가 의미하는 바를 한글로 상세히 해석합니다.
- Bad: 그래프만 덩그러니 있음.
- Good: "히스토그램의 꼬리가 우측으로 긴 것으로 보아(Skewed > 1), 로그 변환이 필요함을 시사합니다."

1단계: 분석 환경 및 목표 정의 (Environment & Goal)

Context Loading:
- this document를 로드하여 **'Core Principles'**를 확인합니다.
- Methodology Screening (방법론 스크리닝):
  - SKILL.md의 **'Methodology Master List'**를 스캔합니다.
  - 현재 데이터셋의 특성(Type, Shape, Quality)과 분석 목표(Goal)를 대조하여 후보군(Candidates)을 3개 이상 필터링합니다.
  - [Check]: 각 후보군 선정 이유를 한 줄로 서술하였는가? (예: "결측치가 많으므로 Random Forest 채택")
Define Objective:
- resources/plan-template.md를 사용하여 docs/plans/ANALYSIS_[주제].md를 작성합니다.
- 단순 통계/정확도 외에 **비즈니스 임팩트(KPI)**를 성공 지표로 정의합니다.
Notebook Setup:
- docs/notebooks/[Topic]_Analysis.ipynb를 생성하거나 엽니다.
- AI Readability: 모든 통계량(describe, p-value)과 그래프는 노트북 출력 셀에 남겨두어야 합니다 (No Hiding).

3단계: 방법론 스크리닝 (Methodology Screening)

"The Right Tool for the Job"

데이터 특성에 맞춰 SKILL.md의 Methodology Master List에서 최적의 기법을 선정하고 Plan에 명시합니다.

Metric Selection: 비즈니스 목표와 데이터 불균형 여부에 따라 평가지표를 선택합니다. (예: Imbalanced → F1/Recall/PRCURSOR, Regression → MAPE/RMSE)
Preprocessing:
- Outliers: 이상치가 많으면 RobustScaler 선택.
- High Cardinality: 카테고리가 너무 많으면 Target Encoding 고려.
- Imbalance: SMOTE 또는 Class Weights 적용 여부 결정.
Modeling:
- Baseline: Logistic/Linear Regression, Decision Tree.
- Advanced: CatBoost(Categorical), XGBoost/LightGBM(Large Data), Isolation Forest(Anomaly) 등 선택.
Validation стратегия:
- Time Series: 반드시 Time Series Split 사용.
- Imbalanced: Stratified K-Fold 필수.

2단계: 데이터 적재 및 품질 검증 (Obtain & Scrub)

Garbage In, Garbage Out을 방지하기 위한 데이터 신뢰성 확보 단계입니다.

Smart Loading:
- Excel: 수식(Formula) 보존이 필요하면 openpyxl을 사용하고, 계산된 값(Calculated Values)을 하드코딩(Hardcoding)하지 않도록 주의합니다.
- CSV: 인코딩(utf-8, cp949) 및 구분자(Delimiter) 자동 감지 기능을 활용합니다.
Data Profiling:
- 기초 통계량(describe()), 결측치(isnull().sum()), 데이터 타입(dtypes)을 확인합니다.
- [Markdown]: 확인된 결측치 현황과 데이터 타입을 텍스트로 요약합니다.
Deep Sanity Check:
- SKILL.md의 'Logical Failures' 항목을 참조하여 정밀 점검합니다.
- 불가능한 값 감지: 음수 나이(Age < 0), 미래의 날짜(Future Dates) 등 상식 밖의 데이터를 식별합니다.
Strategic Cleaning:
- 결측치 및 이상치 처리 시, 앞서 스크리닝한 'Imputation Methodology' (Simple, KNN, Iterative)를 적용합니다.
- 처리 기준(삭제 vs 대치)과 근거를 명시합니다.

3단계: 가설 주도적 탐색 (Hypothesis Driven EDA)

단순한 그래프 나열을 지양하고, 질문(Question) -> 시각화(Viz) -> 발견(Finding)의 흐름을 유지합니다.

Univariate Analysis: 개별 변수의 분포, 왜도(Skewness), 첨도(Kurtosis)를 확인합니다.
- [Markdown]: 각 변수의 분포 특성과 이상치 유무를 서술합니다.
Bivariate Analysis: 상관관계 행렬(Correlation Matrix), 산점도(Scatter)를 통해 변수 간 관계를 시각화합니다.
- [Markdown]: 변수 간의 선형/비선형 관계를 해석합니다.
Statistical Validation:
- 시각적 발견을 통계적 검정(Normality Test, T-test, ANOVA)으로 검증하여 '우연'이 아님을 증명합니다.
- [Markdown]: p-value를 근거로 귀무가설 기각 여부를 명시합니다.
Insight Logging:
- 분석 중간에 발견된 중요한 사실이 비즈니스 KPI에 미칠 영향을 즉시 기록합니다.

4단계: 모델링 및 해석 (Model & Interpret) [Optional]

Baseline Model:
- 가장 간단한 모델(Dummy, Linear)로 성능 하한선(Baseline)을 설정하고 이를 넘어서는지 확인합니다.
Feature Engineering:
- SKILL.md 리스트의 기법(Encoding, Scaling, PCA)을 적용합니다.
4.3 Advanced Modeling & Tuning

Model Selection: 선정된 고성능 모델(XGBoost, CatBoost, RF 등)을 학습합니다.
Hyperparameter Tuning: 단순히 Grid Search를 넘어 Optuna 등을 활용해 효율적으로 최적화합니다.
Rigorous Validation: Cross Validation 점수와 Hold-out Test 점수의 차이를 확인하여 오버피팅을 감지합니다. (Gap > 5% 시 경고)
- Clustering: K-Means 등을 사용할 경우 Elbow Method나 Silhouette Score로 최적의 군집 수(K)를 결정합니다.

Rigorous Validation:
- Stratified K-Fold를 사용하여 과적합을 방지하고 일반화 성능을 평가합니다.
Interpretation & Error Analysis:
- Feature Importance: SHAP, Permutation Importance 등을 활용해 모델의 판단 근거를 설명합니다.
- Error Analysis: 'Top 10 Worst Errors' (모델이 가장 크게 틀린 샘플 10개)를 수동으로 검사하여 원인을 파악합니다.
Reporting:
- 기술적 수치(Accuracy, MSE)를 비즈니스 언어(예상 수익, 리스크 감소)로 번역하여 보고합니다.

Standards & Rules

Data Analyst Standards (OSEMN)

Purpose

To transform raw data into actionable business insights using a rigorous, hypothesis-driven approach.

Core Principles (Core Philosophy)

"No Hiding" AI Readability: All statistical outputs (describe, corr, p-value) and graphs must remain in the notebook output. This ensures tools like NotebookLM can contextually understand the analysis.

Quality Standards (Tier 1 Best Practices)

1. Data Integrity (Obtain & Scrub)

"Garbage In, Garbage Out"

[ ] Data Source Verification: Check file extension, size, and metadata.
[ ] Data Quality Check:
- Missing Values: Identify mechanism (MCAR, MAR, MNAR) before imputing.
- Logical Failures: Check for impossible values (e.g., age < 0, future dates), Data Leakage, and Overfitting indicators.
- Data Types: Ensure numeric cols are not strings, etc.
[ ] File Handling Standards:
- Excel (.xlsx): Preserve existing formatting/formulas. Use openpyxl for editing, pandas for reading. Zero Hardcoding of calculated values.
- CSV (.csv): Detect delimiter automatically (csv.Sniffer). Handle encoding errors (utf-8 vs cp949) explicitly.

2. Hypothesis Driven EDA (Explore)

"Ask, Don't just Plot"

[ ] Univariate Analysis: Distribution of each key variable (Histogram/Boxplot). Check for Skewness/Kurtosis.
[ ] Bivariate Analysis: Correlation matrix, Scatter plots for relationships.
[ ] Statistical Validation:
- Normality Test: Shapiro-Wilk or K-S test.
- Significance: T-test/ANOVA for group differences.
[ ] Insight Logging: Record the implication of every finding immediately.

3. Rigorous Modeling (Model)

"Trust but Verify"

[ ] Baseline First: Compare complex models against a Dummy/Logistic Baseline.
[ ] Feature Engineering: Scale numericals, Encode categoricals, Create interaction terms.
[ ] Cross-Validation: Use Stratified K-Fold to prevent overfitting.
[ ] Metric Selection: Optimize for business KPI (not just Accuracy).
[ ] Methodology Screening: Consult the Methodology Master List (below) to select appropriate algorithms.

4. Interpretation (Interpret)

"Why did it predict that?"

[ ] Feature Importance: SHAP values or Permutation Importance.
[ ] Error Analysis: Manually inspect the "Top 10 Worst Errors".
[ ] Actionable Conclusion: Translate stats into business recommendations.

Checklist (Quality Gate)

Before finalizing:

[ ] Reproducibility: Can the notebook run from top to bottom without error?
[ ] Storytelling: Does the notebook flow like a narrative?
[ ] Visuals: Are all graphs labeled (Title, Axis, Legend)?

Appendix: Methodology Master List (Reference)

Scan these tables to select the most appropriate methodology for your data and goal.

1. Preprocessing & Data Cleaning

2. Machine Learning Models

3. Deep Learning Models

4. Validation & Optimization

5. Interpretation

Appendix: Evaluation Metrics Guide

Select metrics based on your problem type and business goal.

Classification Metrics

Regression Metrics

Clustering Metrics (Unsupervised)

Agent Skills: Data Analyst Workflow

Install this agent skill to your local

Skill Files