Data Analyst Workflow
Python 생태계(Jupyter, Pandas, Scikit-learn)를 활용하여 데이터에서 심층적인 인사이트를 도출하는 전문 분석 워크플로우입니다. OSEMN 방법론과 SKILL.md의 표준을 따릅니다.
0단계: 기본 원칙 (Core Principles)
- Explanation First (선 설명 후 코드): 코드를 작성하기 전에 무엇을, 왜 분석하는지 Markdown으로 서술합니다.
- Interpretation Mandatory (결과 해석 필수): 모든 코드 셀(Code Cell)의 출력 하단에는 반드시 Markdown 셀을 추가하여 통계적 수치나 그래프가 의미하는 바를 한글로 상세히 해석합니다.
- Bad: 그래프만 덩그러니 있음.
- Good: "히스토그램의 꼬리가 우측으로 긴 것으로 보아(Skewed > 1), 로그 변환이 필요함을 시사합니다."
1단계: 분석 환경 및 목표 정의 (Environment & Goal)
- Context Loading:
this document를 로드하여 **'Core Principles'**를 확인합니다.- Methodology Screening (방법론 스크리닝):
SKILL.md의 **'Methodology Master List'**를 스캔합니다.- 현재 데이터셋의 특성(Type, Shape, Quality)과 분석 목표(Goal)를 대조하여 후보군(Candidates)을 3개 이상 필터링합니다.
- [Check]: 각 후보군 선정 이유를 한 줄로 서술하였는가? (예: "결측치가 많으므로 Random Forest 채택")
- Define Objective:
resources/plan-template.md를 사용하여docs/plans/ANALYSIS_[주제].md를 작성합니다.- 단순 통계/정확도 외에 **비즈니스 임팩트(KPI)**를 성공 지표로 정의합니다.
- Notebook Setup:
docs/notebooks/[Topic]_Analysis.ipynb를 생성하거나 엽니다.- AI Readability: 모든 통계량(
describe,p-value)과 그래프는 노트북 출력 셀에 남겨두어야 합니다 (No Hiding).
3단계: 방법론 스크리닝 (Methodology Screening)
"The Right Tool for the Job"
데이터 특성에 맞춰 SKILL.md의 Methodology Master List에서 최적의 기법을 선정하고 Plan에 명시합니다.
- Metric Selection: 비즈니스 목표와 데이터 불균형 여부에 따라 평가지표를 선택합니다. (예: Imbalanced → F1/Recall/PRCURSOR, Regression → MAPE/RMSE)
- Preprocessing:
- Outliers: 이상치가 많으면
RobustScaler선택. - High Cardinality: 카테고리가 너무 많으면
Target Encoding고려. - Imbalance:
SMOTE또는Class Weights적용 여부 결정.
- Outliers: 이상치가 많으면
- Modeling:
- Baseline: Logistic/Linear Regression, Decision Tree.
- Advanced:
CatBoost(Categorical),XGBoost/LightGBM(Large Data),Isolation Forest(Anomaly) 등 선택.
- Validation стратегия:
- Time Series: 반드시
Time Series Split사용. - Imbalanced:
Stratified K-Fold필수.
- Time Series: 반드시
2단계: 데이터 적재 및 품질 검증 (Obtain & Scrub)
Garbage In, Garbage Out을 방지하기 위한 데이터 신뢰성 확보 단계입니다.
- Smart Loading:
- Excel: 수식(Formula) 보존이 필요하면
openpyxl을 사용하고, 계산된 값(Calculated Values)을 하드코딩(Hardcoding)하지 않도록 주의합니다. - CSV: 인코딩(
utf-8,cp949) 및 구분자(Delimiter) 자동 감지 기능을 활용합니다.
- Excel: 수식(Formula) 보존이 필요하면
- Data Profiling:
- 기초 통계량(
describe()), 결측치(isnull().sum()), 데이터 타입(dtypes)을 확인합니다. - [Markdown]: 확인된 결측치 현황과 데이터 타입을 텍스트로 요약합니다.
- 기초 통계량(
- Deep Sanity Check:
SKILL.md의 'Logical Failures' 항목을 참조하여 정밀 점검합니다.- 불가능한 값 감지: 음수 나이(Age < 0), 미래의 날짜(Future Dates) 등 상식 밖의 데이터를 식별합니다.
- Strategic Cleaning:
- 결측치 및 이상치 처리 시, 앞서 스크리닝한 'Imputation Methodology' (Simple, KNN, Iterative)를 적용합니다.
- 처리 기준(삭제 vs 대치)과 근거를 명시합니다.
3단계: 가설 주도적 탐색 (Hypothesis Driven EDA)
단순한 그래프 나열을 지양하고, 질문(Question) -> 시각화(Viz) -> 발견(Finding)의 흐름을 유지합니다.
- Univariate Analysis: 개별 변수의 분포, 왜도(Skewness), 첨도(Kurtosis)를 확인합니다.
- [Markdown]: 각 변수의 분포 특성과 이상치 유무를 서술합니다.
- Bivariate Analysis: 상관관계 행렬(Correlation Matrix), 산점도(Scatter)를 통해 변수 간 관계를 시각화합니다.
- [Markdown]: 변수 간의 선형/비선형 관계를 해석합니다.
- Statistical Validation:
- 시각적 발견을 통계적 검정(Normality Test, T-test, ANOVA)으로 검증하여 '우연'이 아님을 증명합니다.
- [Markdown]: p-value를 근거로 귀무가설 기각 여부를 명시합니다.
- Insight Logging:
- 분석 중간에 발견된 중요한 사실이 비즈니스 KPI에 미칠 영향을 즉시 기록합니다.
4단계: 모델링 및 해석 (Model & Interpret) [Optional]
- Baseline Model:
- 가장 간단한 모델(Dummy, Linear)로 성능 하한선(Baseline)을 설정하고 이를 넘어서는지 확인합니다.
- Feature Engineering:
SKILL.md리스트의 기법(Encoding, Scaling, PCA)을 적용합니다.
-
4.3 Advanced Modeling & Tuning
- Model Selection: 선정된 고성능 모델(XGBoost, CatBoost, RF 등)을 학습합니다.
- Hyperparameter Tuning: 단순히 Grid Search를 넘어
Optuna등을 활용해 효율적으로 최적화합니다. - Rigorous Validation:
Cross Validation점수와Hold-out Test점수의 차이를 확인하여 오버피팅을 감지합니다. (Gap > 5% 시 경고)- Clustering: K-Means 등을 사용할 경우
Elbow Method나Silhouette Score로 최적의 군집 수(K)를 결정합니다.
- Clustering: K-Means 등을 사용할 경우
- Rigorous Validation:
- Stratified K-Fold를 사용하여 과적합을 방지하고 일반화 성능을 평가합니다.
- Interpretation & Error Analysis:
- Feature Importance:
SHAP,Permutation Importance등을 활용해 모델의 판단 근거를 설명합니다. - Error Analysis: 'Top 10 Worst Errors' (모델이 가장 크게 틀린 샘플 10개)를 수동으로 검사하여 원인을 파악합니다.
- Feature Importance:
- Reporting:
- 기술적 수치(Accuracy, MSE)를 비즈니스 언어(예상 수익, 리스크 감소)로 번역하여 보고합니다.
Standards & Rules
Data Analyst Standards (OSEMN)
Purpose
To transform raw data into actionable business insights using a rigorous, hypothesis-driven approach.
Core Principles (Core Philosophy)
"No Hiding"
AI Readability: All statistical outputs (describe, corr, p-value) and graphs must remain in the notebook output. This ensures tools like NotebookLM can contextually understand the analysis.
Quality Standards (Tier 1 Best Practices)
1. Data Integrity (Obtain & Scrub)
"Garbage In, Garbage Out"
- [ ] Data Source Verification: Check file extension, size, and metadata.
- [ ] Data Quality Check:
- Missing Values: Identify mechanism (MCAR, MAR, MNAR) before imputing.
- Logical Failures: Check for impossible values (e.g., age < 0, future dates), Data Leakage, and Overfitting indicators.
- Data Types: Ensure numeric cols are not strings, etc.
- [ ] File Handling Standards:
- Excel (.xlsx): Preserve existing formatting/formulas. Use
openpyxlfor editing,pandasfor reading. Zero Hardcoding of calculated values. - CSV (.csv): Detect delimiter automatically (
csv.Sniffer). Handle encoding errors (utf-8 vs cp949) explicitly.
- Excel (.xlsx): Preserve existing formatting/formulas. Use
2. Hypothesis Driven EDA (Explore)
"Ask, Don't just Plot"
- [ ] Univariate Analysis: Distribution of each key variable (Histogram/Boxplot). Check for Skewness/Kurtosis.
- [ ] Bivariate Analysis: Correlation matrix, Scatter plots for relationships.
- [ ] Statistical Validation:
- Normality Test: Shapiro-Wilk or K-S test.
- Significance: T-test/ANOVA for group differences.
- [ ] Insight Logging: Record the implication of every finding immediately.
3. Rigorous Modeling (Model)
"Trust but Verify"
- [ ] Baseline First: Compare complex models against a Dummy/Logistic Baseline.
- [ ] Feature Engineering: Scale numericals, Encode categoricals, Create interaction terms.
- [ ] Cross-Validation: Use Stratified K-Fold to prevent overfitting.
- [ ] Metric Selection: Optimize for business KPI (not just Accuracy).
- [ ] Methodology Screening: Consult the Methodology Master List (below) to select appropriate algorithms.
4. Interpretation (Interpret)
"Why did it predict that?"
- [ ] Feature Importance: SHAP values or Permutation Importance.
- [ ] Error Analysis: Manually inspect the "Top 10 Worst Errors".
- [ ] Actionable Conclusion: Translate stats into business recommendations.
Checklist (Quality Gate)
Before finalizing:
- [ ] Reproducibility: Can the notebook run from top to bottom without error?
- [ ] Storytelling: Does the notebook flow like a narrative?
- [ ] Visuals: Are all graphs labeled (Title, Axis, Legend)?
Appendix: Methodology Master List (Reference)
Scan these tables to select the most appropriate methodology for your data and goal.
1. Preprocessing & Data Cleaning
| Methodology | Usage / Purpose | Data Constraints | | :----------------------- | :-------------------------------------------------- | :---------------------------------------------------- | | Simple Imputation | Missing Value Imputation (Simple Replacement) | Mean/Median (Numeric), Mode (Categorical) | | KNN Imputation | Missing Value Imputation (Similarity-based) | Mainly Numeric, useful when correlations exist | | Iterative Imputation | Missing Value Imputation (Model-based) | High variable correlation, assumes MAR | | One-Hot Encoding | Categorical to Numeric | Nominal data, Low Cardinality | | Label Encoding | Categorical to Numeric | Ordinal data | | Target Encoding | Categorical to Numeric | High Cardinality features, Risk of Overfitting | | Standard Scaler | Scaling (Standardization) | Sensitive to outliers, assumes Gaussian distribution | | MinMax Scaler | Scaling (Normalization) | Bounded data, distribution agnostic | | Robust Scaler | Scaling (Robust to Outliers) | Data with many outliers (Uses Median/IQR) | | SMOTE | Oversampling (Imbalanced Data) | Synthesize minority class samples (Training set ONLY) | | PCA | Dimensionality Reduction, Multicollinearity Removal | Continuous variables, assumes linear relationships |
2. Machine Learning Models
| Methodology | Type | Usage / Purpose | Constraints / Notes | | :---------------------- | :---------------- | :----------------------------------------- | :--------------------------------------------------------- | | Linear Regression | Regression | Baseline for regression | Linear relationship assumption | | Logistic Regression | Classification | Baseline for classification | Linear separation assumption, large sparse data OK | | SVM / SVR | Class/Reg | High accuracy in high dimensional spaces | Computationally expensive (O(n^3)), Scale-sensitive | | K-Nearest Neighbors | Class/Reg | Instance-based learning, Simple | Scale-sensitive, Small data | | Random Forest | Ensemble | Robust Classification/Regression | Handles Mixed types, Robust to outliers/missing values | | XGBoost / LightGBM | Ensemble | High Performance | Large datasets, handles missing values internally | | CatBoost | Ensemble | Best for Categorical Features | Handles categories automatically, Slower training | | Isolation Forest | Anomaly Detection | Outlier/Anomaly Detection | High dimensional data, efficiency | | K-Means | Clustering | Partitioning into K clusters | Spherical Clusters, Sensitive to outliers, Scale-sensitive | | DBSCAN | Clustering | Density-based clustering, Detects Outliers | Arbitrary shapes, Scale-sensitive, finding epsilon is hard | | Hierarchical | Clustering | Dendrogram visualization | Computationally expensive for large data |
3. Deep Learning Models
| Methodology | Usage / Purpose | Data Constraints | | :-------------- | :------------------------------ | :------------------------------- | | CNN | Image/Pattern Recognition | Grid-like data (Images, etc.) | | RNN / LSTM | Sequence/Time-Series Prediction | Sequential data | | Transformer | NLP, Complex Pattern Matching | Long sequences, Large-scale data |
4. Validation & Optimization
| Methodology | Type | Usage / Purpose | Notes | | :------------------------ | :------------- | :----------------------------------- | :---------------------------------------------------- | | Stratified K-Fold | Validation | Cross Validation (Generalization) | Essential for Imbalanced Class distribution | | K-Fold CV | Validation | Cross Validation | Sufficient data, Balanced classes | | Time Series Split | Validation | Cross Validation (Temporal) | No future data leakage (essential for time-series) | | Grid Search | Tuning | Hyperparameter Optimization | Small search space (Exhaustive) | | Bayesian Optimization | Tuning | Hyperparameter Optimization | Large search space, High evaluation cost | | Optuna | Tuning | Next-gen Hyperparameter Optimization | Efficient, Define-by-run, Pruning capabilities | | L1 (Lasso) | Regularization | Sparse Model, Feature Selection | When sparse solution is needed | | L2 (Ridge) | Regularization | Prevent Overfitting, Weight Decay | When high multicollinearity exists | | ElasticNet | Regularization | Combination of L1 and L2 | When both feature selection and regularization needed |
5. Interpretation
| Methodology | Usage / Purpose | Notes | | :---------- | :------------------------ | :-------------------------------- | | SHAP | Explain Model Predictions | Specialized for Tree-based models |
Appendix: Evaluation Metrics Guide
Select metrics based on your problem type and business goal.
Classification Metrics
| Metric | Focus | When to use | | :------------ | :----------------------- | :------------------------------------------------------ | | Accuracy | Overall Correctness | Balanced datasets only. Misleading for imbalanced data. | | Precision | False Positive Reduction | When FP is costly (e.g., Spam Filter). | | Recall | False Negative Reduction | When FN is critical (e.g., Cancer Diagnosis, Fraud). | | F1 Score | Balance | When you need a balance between Precision and Recall. | | ROC-AUC | Ranking Quality | When you need robust performance across thresholds. | | Log Loss | Probability Confidence | When the predicted probability value itself matters. |
Regression Metrics
| Metric | Focus | When to use | | :----------- | :------------------------ | :------------------------------------------------------ | | MSE | Large Error Penalty | When outliers/large errors should be heavily penalized. | | RMSE | Interpretability | When you need error in the same unit as the target. | | MAE | Robustness | When you want to be robust against outliers. | | R2 Score | Explainability | To see how much variance is explained by the model. | | MAPE | Business Interpretability | Error in Percentage (%). Easy for stakeholders. |
Clustering Metrics (Unsupervised)
| Metric | Focus | When to use | | :------------------- | :------------------ | :--------------------------------------------------------------------------------- | | Silhouette Score | Cluster Separation | To measure how similar an object is to its own cluster compared to other clusters. | | Davies-Bouldin | Cluster Compactness | Lower is better. Good for comparing clustering algorithms. | | Elbow Method | Optimal K | To find the inflection point (optimal K) in K-Means. |