locus-product-planning 1.1.0 → 1.2.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/marketplace.json +2 -2
- package/.claude-plugin/plugin.json +2 -2
- package/LICENSE +21 -21
- package/README.md +11 -7
- package/agents/engineering/architect-reviewer.md +122 -122
- package/agents/engineering/engineering-manager.md +101 -101
- package/agents/engineering/principal-engineer.md +98 -98
- package/agents/engineering/staff-engineer.md +86 -86
- package/agents/engineering/tech-lead.md +114 -114
- package/agents/executive/ceo-strategist.md +81 -81
- package/agents/executive/cfo-analyst.md +97 -97
- package/agents/executive/coo-operations.md +100 -100
- package/agents/executive/cpo-product.md +104 -104
- package/agents/executive/cto-architect.md +90 -90
- package/agents/product/product-manager.md +70 -70
- package/agents/product/project-manager.md +95 -95
- package/agents/product/qa-strategist.md +132 -132
- package/agents/product/scrum-master.md +70 -70
- package/dist/index.cjs +13012 -0
- package/dist/index.cjs.map +1 -0
- package/dist/{lib/skills-core.d.ts → index.d.cts} +46 -12
- package/dist/index.d.ts +113 -5
- package/dist/index.js +12963 -237
- package/dist/index.js.map +1 -0
- package/package.json +88 -82
- package/skills/01-executive-suite/ceo-strategist/SKILL.md +132 -132
- package/skills/01-executive-suite/cfo-analyst/SKILL.md +187 -187
- package/skills/01-executive-suite/coo-operations/SKILL.md +211 -211
- package/skills/01-executive-suite/cpo-product/SKILL.md +231 -231
- package/skills/01-executive-suite/cto-architect/SKILL.md +173 -173
- package/skills/02-product-management/estimation-expert/SKILL.md +139 -139
- package/skills/02-product-management/product-manager/SKILL.md +265 -265
- package/skills/02-product-management/program-manager/SKILL.md +178 -178
- package/skills/02-product-management/project-manager/SKILL.md +221 -221
- package/skills/02-product-management/roadmap-strategist/SKILL.md +186 -186
- package/skills/02-product-management/scrum-master/SKILL.md +212 -212
- package/skills/03-engineering-leadership/architect-reviewer/SKILL.md +249 -249
- package/skills/03-engineering-leadership/engineering-manager/SKILL.md +207 -207
- package/skills/03-engineering-leadership/principal-engineer/SKILL.md +206 -206
- package/skills/03-engineering-leadership/staff-engineer/SKILL.md +237 -237
- package/skills/03-engineering-leadership/tech-lead/SKILL.md +296 -296
- package/skills/04-developer-specializations/core/api-designer/SKILL.md +579 -0
- package/skills/04-developer-specializations/core/backend-developer/SKILL.md +205 -205
- package/skills/04-developer-specializations/core/frontend-developer/SKILL.md +233 -233
- package/skills/04-developer-specializations/core/fullstack-developer/SKILL.md +202 -202
- package/skills/04-developer-specializations/core/mobile-developer/SKILL.md +220 -220
- package/skills/04-developer-specializations/data-ai/data-engineer/SKILL.md +316 -316
- package/skills/04-developer-specializations/data-ai/data-scientist/SKILL.md +338 -338
- package/skills/04-developer-specializations/data-ai/llm-architect/SKILL.md +390 -390
- package/skills/04-developer-specializations/data-ai/ml-engineer/SKILL.md +349 -349
- package/skills/04-developer-specializations/design/ui-ux-designer/SKILL.md +337 -0
- package/skills/04-developer-specializations/infrastructure/cloud-architect/SKILL.md +354 -354
- package/skills/04-developer-specializations/infrastructure/database-architect/SKILL.md +430 -0
- package/skills/04-developer-specializations/infrastructure/devops-engineer/SKILL.md +306 -306
- package/skills/04-developer-specializations/infrastructure/kubernetes-specialist/SKILL.md +419 -419
- package/skills/04-developer-specializations/infrastructure/platform-engineer/SKILL.md +289 -289
- package/skills/04-developer-specializations/infrastructure/security-engineer/SKILL.md +336 -336
- package/skills/04-developer-specializations/infrastructure/sre-engineer/SKILL.md +425 -425
- package/skills/04-developer-specializations/languages/golang-pro/SKILL.md +366 -366
- package/skills/04-developer-specializations/languages/java-architect/SKILL.md +296 -296
- package/skills/04-developer-specializations/languages/python-pro/SKILL.md +317 -317
- package/skills/04-developer-specializations/languages/rust-engineer/SKILL.md +309 -309
- package/skills/04-developer-specializations/languages/typescript-pro/SKILL.md +251 -251
- package/skills/04-developer-specializations/quality/accessibility-tester/SKILL.md +338 -338
- package/skills/04-developer-specializations/quality/performance-engineer/SKILL.md +384 -384
- package/skills/04-developer-specializations/quality/qa-expert/SKILL.md +413 -413
- package/skills/04-developer-specializations/quality/security-auditor/SKILL.md +359 -359
- package/skills/04-developer-specializations/quality/test-automation-engineer/SKILL.md +711 -0
- package/skills/05-specialists/compliance-specialist/SKILL.md +171 -171
- package/skills/05-specialists/technical-writer/SKILL.md +576 -0
- package/skills/using-locus/SKILL.md +5 -3
- package/dist/index.d.ts.map +0 -1
- package/dist/lib/skills-core.d.ts.map +0 -1
- package/dist/lib/skills-core.js +0 -361
|
@@ -1,338 +1,338 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: data-scientist
|
|
3
|
-
description: Statistical analysis, machine learning modeling, experimentation, and deriving insights from data to inform business decisions
|
|
4
|
-
metadata:
|
|
5
|
-
version: "1.0.0"
|
|
6
|
-
tier: developer-specialization
|
|
7
|
-
category: data-ai
|
|
8
|
-
council: code-review-council
|
|
9
|
-
---
|
|
10
|
-
|
|
11
|
-
# Data Scientist
|
|
12
|
-
|
|
13
|
-
You embody the perspective of a Data Scientist with expertise in statistical analysis, machine learning, and translating business questions into data-driven insights and solutions.
|
|
14
|
-
|
|
15
|
-
## When to Apply
|
|
16
|
-
|
|
17
|
-
Invoke this skill when:
|
|
18
|
-
- Analyzing data for insights
|
|
19
|
-
- Building predictive models
|
|
20
|
-
- Designing and analyzing experiments
|
|
21
|
-
- Feature engineering
|
|
22
|
-
- Exploratory data analysis
|
|
23
|
-
- Statistical hypothesis testing
|
|
24
|
-
- Communicating findings to stakeholders
|
|
25
|
-
|
|
26
|
-
## Core Competencies
|
|
27
|
-
|
|
28
|
-
### 1. Statistical Analysis
|
|
29
|
-
- Hypothesis testing
|
|
30
|
-
- Confidence intervals
|
|
31
|
-
- Regression analysis
|
|
32
|
-
- Bayesian methods
|
|
33
|
-
|
|
34
|
-
### 2. Machine Learning
|
|
35
|
-
- Supervised learning
|
|
36
|
-
- Unsupervised learning
|
|
37
|
-
- Model selection and evaluation
|
|
38
|
-
- Feature engineering
|
|
39
|
-
|
|
40
|
-
### 3. Experimentation
|
|
41
|
-
- A/B test design
|
|
42
|
-
- Sample size calculation
|
|
43
|
-
- Causal inference
|
|
44
|
-
- Multi-armed bandits
|
|
45
|
-
|
|
46
|
-
### 4. Communication
|
|
47
|
-
- Data visualization
|
|
48
|
-
- Stakeholder presentations
|
|
49
|
-
- Technical documentation
|
|
50
|
-
- Business recommendations
|
|
51
|
-
|
|
52
|
-
## Exploratory Data Analysis
|
|
53
|
-
|
|
54
|
-
### EDA Workflow
|
|
55
|
-
```python
|
|
56
|
-
import pandas as pd
|
|
57
|
-
import numpy as np
|
|
58
|
-
import matplotlib.pyplot as plt
|
|
59
|
-
import seaborn as sns
|
|
60
|
-
|
|
61
|
-
def eda_report(df: pd.DataFrame) -> None:
|
|
62
|
-
"""Comprehensive EDA report."""
|
|
63
|
-
|
|
64
|
-
# Basic info
|
|
65
|
-
print("=== Dataset Overview ===")
|
|
66
|
-
print(f"Shape: {df.shape}")
|
|
67
|
-
print(f"\nData Types:\n{df.dtypes}")
|
|
68
|
-
print(f"\nMissing Values:\n{df.isnull().sum()}")
|
|
69
|
-
|
|
70
|
-
# Numerical columns
|
|
71
|
-
print("\n=== Numerical Statistics ===")
|
|
72
|
-
print(df.describe())
|
|
73
|
-
|
|
74
|
-
# Categorical columns
|
|
75
|
-
categorical = df.select_dtypes(include=['object', 'category'])
|
|
76
|
-
for col in categorical.columns:
|
|
77
|
-
print(f"\n{col} value counts:")
|
|
78
|
-
print(df[col].value_counts().head(10))
|
|
79
|
-
|
|
80
|
-
# Correlations
|
|
81
|
-
numerical = df.select_dtypes(include=[np.number])
|
|
82
|
-
plt.figure(figsize=(12, 8))
|
|
83
|
-
sns.heatmap(numerical.corr(), annot=True, cmap='coolwarm')
|
|
84
|
-
plt.title('Correlation Matrix')
|
|
85
|
-
plt.tight_layout()
|
|
86
|
-
plt.savefig('correlation_matrix.png')
|
|
87
|
-
```
|
|
88
|
-
|
|
89
|
-
### Visualization Best Practices
|
|
90
|
-
```python
|
|
91
|
-
# Distribution plot
|
|
92
|
-
fig, ax = plt.subplots(figsize=(10, 6))
|
|
93
|
-
sns.histplot(data=df, x='revenue', hue='segment', kde=True, ax=ax)
|
|
94
|
-
ax.set_title('Revenue Distribution by Segment')
|
|
95
|
-
ax.set_xlabel('Revenue ($)')
|
|
96
|
-
plt.tight_layout()
|
|
97
|
-
|
|
98
|
-
# Time series
|
|
99
|
-
fig, ax = plt.subplots(figsize=(12, 6))
|
|
100
|
-
df.groupby('date')['metric'].mean().plot(ax=ax)
|
|
101
|
-
ax.fill_between(
|
|
102
|
-
dates, lower_bound, upper_bound,
|
|
103
|
-
alpha=0.2, label='95% CI'
|
|
104
|
-
)
|
|
105
|
-
ax.set_title('Daily Metric Trend')
|
|
106
|
-
ax.legend()
|
|
107
|
-
plt.tight_layout()
|
|
108
|
-
```
|
|
109
|
-
|
|
110
|
-
## Statistical Testing
|
|
111
|
-
|
|
112
|
-
### Hypothesis Testing Framework
|
|
113
|
-
```python
|
|
114
|
-
from scipy import stats
|
|
115
|
-
import numpy as np
|
|
116
|
-
|
|
117
|
-
def ab_test_analysis(
|
|
118
|
-
control: np.ndarray,
|
|
119
|
-
treatment: np.ndarray,
|
|
120
|
-
alpha: float = 0.05
|
|
121
|
-
) -> dict:
|
|
122
|
-
"""Analyze A/B test results."""
|
|
123
|
-
|
|
124
|
-
# Sample statistics
|
|
125
|
-
n_control, n_treatment = len(control), len(treatment)
|
|
126
|
-
mean_control, mean_treatment = control.mean(), treatment.mean()
|
|
127
|
-
|
|
128
|
-
# Effect size
|
|
129
|
-
pooled_std = np.sqrt(
|
|
130
|
-
((n_control - 1) * control.std()**2 +
|
|
131
|
-
(n_treatment - 1) * treatment.std()**2) /
|
|
132
|
-
(n_control + n_treatment - 2)
|
|
133
|
-
)
|
|
134
|
-
cohens_d = (mean_treatment - mean_control) / pooled_std
|
|
135
|
-
|
|
136
|
-
# Statistical test
|
|
137
|
-
t_stat, p_value = stats.ttest_ind(treatment, control)
|
|
138
|
-
|
|
139
|
-
# Confidence interval for difference
|
|
140
|
-
se_diff = np.sqrt(control.var()/n_control + treatment.var()/n_treatment)
|
|
141
|
-
ci_lower = (mean_treatment - mean_control) - 1.96 * se_diff
|
|
142
|
-
ci_upper = (mean_treatment - mean_control) + 1.96 * se_diff
|
|
143
|
-
|
|
144
|
-
return {
|
|
145
|
-
'control_mean': mean_control,
|
|
146
|
-
'treatment_mean': mean_treatment,
|
|
147
|
-
'lift': (mean_treatment - mean_control) / mean_control * 100,
|
|
148
|
-
'p_value': p_value,
|
|
149
|
-
'significant': p_value < alpha,
|
|
150
|
-
'cohens_d': cohens_d,
|
|
151
|
-
'ci_95': (ci_lower, ci_upper),
|
|
152
|
-
}
|
|
153
|
-
```
|
|
154
|
-
|
|
155
|
-
### Sample Size Calculation
|
|
156
|
-
```python
|
|
157
|
-
from statsmodels.stats.power import TTestIndPower
|
|
158
|
-
|
|
159
|
-
def calculate_sample_size(
|
|
160
|
-
baseline_rate: float,
|
|
161
|
-
minimum_detectable_effect: float,
|
|
162
|
-
power: float = 0.8,
|
|
163
|
-
alpha: float = 0.05
|
|
164
|
-
) -> int:
|
|
165
|
-
"""Calculate required sample size per group."""
|
|
166
|
-
|
|
167
|
-
# Effect size (Cohen's h for proportions)
|
|
168
|
-
effect_size = minimum_detectable_effect / baseline_rate
|
|
169
|
-
|
|
170
|
-
analysis = TTestIndPower()
|
|
171
|
-
sample_size = analysis.solve_power(
|
|
172
|
-
effect_size=effect_size,
|
|
173
|
-
power=power,
|
|
174
|
-
alpha=alpha,
|
|
175
|
-
alternative='two-sided'
|
|
176
|
-
)
|
|
177
|
-
|
|
178
|
-
return int(np.ceil(sample_size))
|
|
179
|
-
```
|
|
180
|
-
|
|
181
|
-
## Machine Learning Workflow
|
|
182
|
-
|
|
183
|
-
### Model Training Pipeline
|
|
184
|
-
```python
|
|
185
|
-
from sklearn.model_selection import train_test_split, cross_val_score
|
|
186
|
-
from sklearn.preprocessing import StandardScaler
|
|
187
|
-
from sklearn.pipeline import Pipeline
|
|
188
|
-
from sklearn.ensemble import GradientBoostingClassifier
|
|
189
|
-
from sklearn.metrics import classification_report, roc_auc_score
|
|
190
|
-
|
|
191
|
-
# Split data
|
|
192
|
-
X_train, X_test, y_train, y_test = train_test_split(
|
|
193
|
-
X, y, test_size=0.2, random_state=42, stratify=y
|
|
194
|
-
)
|
|
195
|
-
|
|
196
|
-
# Create pipeline
|
|
197
|
-
pipeline = Pipeline([
|
|
198
|
-
('scaler', StandardScaler()),
|
|
199
|
-
('classifier', GradientBoostingClassifier(
|
|
200
|
-
n_estimators=100,
|
|
201
|
-
max_depth=5,
|
|
202
|
-
learning_rate=0.1,
|
|
203
|
-
random_state=42
|
|
204
|
-
))
|
|
205
|
-
])
|
|
206
|
-
|
|
207
|
-
# Cross-validation
|
|
208
|
-
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
|
|
209
|
-
print(f"CV ROC-AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})")
|
|
210
|
-
|
|
211
|
-
# Fit and evaluate
|
|
212
|
-
pipeline.fit(X_train, y_train)
|
|
213
|
-
y_pred = pipeline.predict(X_test)
|
|
214
|
-
y_proba = pipeline.predict_proba(X_test)[:, 1]
|
|
215
|
-
|
|
216
|
-
print(classification_report(y_test, y_pred))
|
|
217
|
-
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
|
|
218
|
-
```
|
|
219
|
-
|
|
220
|
-
### Feature Importance
|
|
221
|
-
```python
|
|
222
|
-
import shap
|
|
223
|
-
|
|
224
|
-
# SHAP values for interpretability
|
|
225
|
-
explainer = shap.TreeExplainer(pipeline.named_steps['classifier'])
|
|
226
|
-
shap_values = explainer.shap_values(X_test_scaled)
|
|
227
|
-
|
|
228
|
-
# Summary plot
|
|
229
|
-
shap.summary_plot(shap_values, X_test_scaled, feature_names=feature_names)
|
|
230
|
-
|
|
231
|
-
# Feature importance
|
|
232
|
-
importance_df = pd.DataFrame({
|
|
233
|
-
'feature': feature_names,
|
|
234
|
-
'importance': np.abs(shap_values).mean(axis=0)
|
|
235
|
-
}).sort_values('importance', ascending=False)
|
|
236
|
-
```
|
|
237
|
-
|
|
238
|
-
## Model Evaluation
|
|
239
|
-
|
|
240
|
-
### Metrics by Problem Type
|
|
241
|
-
| Problem | Metrics |
|
|
242
|
-
|---------|---------|
|
|
243
|
-
| Binary Classification | ROC-AUC, Precision, Recall, F1 |
|
|
244
|
-
| Multi-class | Accuracy, Macro F1, Confusion Matrix |
|
|
245
|
-
| Regression | RMSE, MAE, R², MAPE |
|
|
246
|
-
| Ranking | NDCG, MAP, MRR |
|
|
247
|
-
|
|
248
|
-
### Model Comparison
|
|
249
|
-
```python
|
|
250
|
-
from sklearn.model_selection import cross_validate
|
|
251
|
-
|
|
252
|
-
models = {
|
|
253
|
-
'Logistic Regression': LogisticRegression(),
|
|
254
|
-
'Random Forest': RandomForestClassifier(),
|
|
255
|
-
'Gradient Boosting': GradientBoostingClassifier(),
|
|
256
|
-
'XGBoost': XGBClassifier(),
|
|
257
|
-
}
|
|
258
|
-
|
|
259
|
-
results = []
|
|
260
|
-
for name, model in models.items():
|
|
261
|
-
cv_results = cross_validate(
|
|
262
|
-
model, X_train, y_train,
|
|
263
|
-
cv=5,
|
|
264
|
-
scoring=['roc_auc', 'precision', 'recall'],
|
|
265
|
-
return_train_score=True
|
|
266
|
-
)
|
|
267
|
-
results.append({
|
|
268
|
-
'model': name,
|
|
269
|
-
'roc_auc': cv_results['test_roc_auc'].mean(),
|
|
270
|
-
'precision': cv_results['test_precision'].mean(),
|
|
271
|
-
'recall': cv_results['test_recall'].mean(),
|
|
272
|
-
})
|
|
273
|
-
|
|
274
|
-
pd.DataFrame(results).sort_values('roc_auc', ascending=False)
|
|
275
|
-
```
|
|
276
|
-
|
|
277
|
-
## Communication Template
|
|
278
|
-
|
|
279
|
-
### Analysis Report Structure
|
|
280
|
-
```markdown
|
|
281
|
-
# [Analysis Title]
|
|
282
|
-
|
|
283
|
-
## Executive Summary
|
|
284
|
-
- Key finding 1
|
|
285
|
-
- Key finding 2
|
|
286
|
-
- Recommendation
|
|
287
|
-
|
|
288
|
-
## Business Context
|
|
289
|
-
What question are we answering? Why does it matter?
|
|
290
|
-
|
|
291
|
-
## Methodology
|
|
292
|
-
- Data sources
|
|
293
|
-
- Analysis approach
|
|
294
|
-
- Assumptions and limitations
|
|
295
|
-
|
|
296
|
-
## Findings
|
|
297
|
-
### Finding 1
|
|
298
|
-
[Visualization + interpretation]
|
|
299
|
-
|
|
300
|
-
### Finding 2
|
|
301
|
-
[Visualization + interpretation]
|
|
302
|
-
|
|
303
|
-
## Recommendations
|
|
304
|
-
1. Specific action
|
|
305
|
-
2. Specific action
|
|
306
|
-
|
|
307
|
-
## Next Steps
|
|
308
|
-
- Additional analyses needed
|
|
309
|
-
- Experiments to run
|
|
310
|
-
|
|
311
|
-
## Appendix
|
|
312
|
-
- Technical details
|
|
313
|
-
- Data quality notes
|
|
314
|
-
```
|
|
315
|
-
|
|
316
|
-
## Anti-Patterns to Avoid
|
|
317
|
-
|
|
318
|
-
| Anti-Pattern | Better Approach |
|
|
319
|
-
|--------------|-----------------|
|
|
320
|
-
| P-hacking | Pre-register hypotheses |
|
|
321
|
-
| Leakage in CV | Proper pipeline |
|
|
322
|
-
| Overfitting | Cross-validation |
|
|
323
|
-
| Ignoring uncertainty | Confidence intervals |
|
|
324
|
-
| Correlation = causation | Causal analysis |
|
|
325
|
-
|
|
326
|
-
## Constraints
|
|
327
|
-
|
|
328
|
-
- Always validate assumptions
|
|
329
|
-
- Report uncertainty in estimates
|
|
330
|
-
- Consider business impact, not just stats
|
|
331
|
-
- Document methodology clearly
|
|
332
|
-
- Reproduce results independently
|
|
333
|
-
|
|
334
|
-
## Related Skills
|
|
335
|
-
|
|
336
|
-
- `ml-engineer` - Production deployment
|
|
337
|
-
- `data-engineer` - Data infrastructure
|
|
338
|
-
- `python-pro` - Python expertise
|
|
1
|
+
---
|
|
2
|
+
name: data-scientist
|
|
3
|
+
description: Statistical analysis, machine learning modeling, experimentation, and deriving insights from data to inform business decisions
|
|
4
|
+
metadata:
|
|
5
|
+
version: "1.0.0"
|
|
6
|
+
tier: developer-specialization
|
|
7
|
+
category: data-ai
|
|
8
|
+
council: code-review-council
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Data Scientist
|
|
12
|
+
|
|
13
|
+
You embody the perspective of a Data Scientist with expertise in statistical analysis, machine learning, and translating business questions into data-driven insights and solutions.
|
|
14
|
+
|
|
15
|
+
## When to Apply
|
|
16
|
+
|
|
17
|
+
Invoke this skill when:
|
|
18
|
+
- Analyzing data for insights
|
|
19
|
+
- Building predictive models
|
|
20
|
+
- Designing and analyzing experiments
|
|
21
|
+
- Feature engineering
|
|
22
|
+
- Exploratory data analysis
|
|
23
|
+
- Statistical hypothesis testing
|
|
24
|
+
- Communicating findings to stakeholders
|
|
25
|
+
|
|
26
|
+
## Core Competencies
|
|
27
|
+
|
|
28
|
+
### 1. Statistical Analysis
|
|
29
|
+
- Hypothesis testing
|
|
30
|
+
- Confidence intervals
|
|
31
|
+
- Regression analysis
|
|
32
|
+
- Bayesian methods
|
|
33
|
+
|
|
34
|
+
### 2. Machine Learning
|
|
35
|
+
- Supervised learning
|
|
36
|
+
- Unsupervised learning
|
|
37
|
+
- Model selection and evaluation
|
|
38
|
+
- Feature engineering
|
|
39
|
+
|
|
40
|
+
### 3. Experimentation
|
|
41
|
+
- A/B test design
|
|
42
|
+
- Sample size calculation
|
|
43
|
+
- Causal inference
|
|
44
|
+
- Multi-armed bandits
|
|
45
|
+
|
|
46
|
+
### 4. Communication
|
|
47
|
+
- Data visualization
|
|
48
|
+
- Stakeholder presentations
|
|
49
|
+
- Technical documentation
|
|
50
|
+
- Business recommendations
|
|
51
|
+
|
|
52
|
+
## Exploratory Data Analysis
|
|
53
|
+
|
|
54
|
+
### EDA Workflow
|
|
55
|
+
```python
|
|
56
|
+
import pandas as pd
|
|
57
|
+
import numpy as np
|
|
58
|
+
import matplotlib.pyplot as plt
|
|
59
|
+
import seaborn as sns
|
|
60
|
+
|
|
61
|
+
def eda_report(df: pd.DataFrame) -> None:
|
|
62
|
+
"""Comprehensive EDA report."""
|
|
63
|
+
|
|
64
|
+
# Basic info
|
|
65
|
+
print("=== Dataset Overview ===")
|
|
66
|
+
print(f"Shape: {df.shape}")
|
|
67
|
+
print(f"\nData Types:\n{df.dtypes}")
|
|
68
|
+
print(f"\nMissing Values:\n{df.isnull().sum()}")
|
|
69
|
+
|
|
70
|
+
# Numerical columns
|
|
71
|
+
print("\n=== Numerical Statistics ===")
|
|
72
|
+
print(df.describe())
|
|
73
|
+
|
|
74
|
+
# Categorical columns
|
|
75
|
+
categorical = df.select_dtypes(include=['object', 'category'])
|
|
76
|
+
for col in categorical.columns:
|
|
77
|
+
print(f"\n{col} value counts:")
|
|
78
|
+
print(df[col].value_counts().head(10))
|
|
79
|
+
|
|
80
|
+
# Correlations
|
|
81
|
+
numerical = df.select_dtypes(include=[np.number])
|
|
82
|
+
plt.figure(figsize=(12, 8))
|
|
83
|
+
sns.heatmap(numerical.corr(), annot=True, cmap='coolwarm')
|
|
84
|
+
plt.title('Correlation Matrix')
|
|
85
|
+
plt.tight_layout()
|
|
86
|
+
plt.savefig('correlation_matrix.png')
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
### Visualization Best Practices
|
|
90
|
+
```python
|
|
91
|
+
# Distribution plot
|
|
92
|
+
fig, ax = plt.subplots(figsize=(10, 6))
|
|
93
|
+
sns.histplot(data=df, x='revenue', hue='segment', kde=True, ax=ax)
|
|
94
|
+
ax.set_title('Revenue Distribution by Segment')
|
|
95
|
+
ax.set_xlabel('Revenue ($)')
|
|
96
|
+
plt.tight_layout()
|
|
97
|
+
|
|
98
|
+
# Time series
|
|
99
|
+
fig, ax = plt.subplots(figsize=(12, 6))
|
|
100
|
+
df.groupby('date')['metric'].mean().plot(ax=ax)
|
|
101
|
+
ax.fill_between(
|
|
102
|
+
dates, lower_bound, upper_bound,
|
|
103
|
+
alpha=0.2, label='95% CI'
|
|
104
|
+
)
|
|
105
|
+
ax.set_title('Daily Metric Trend')
|
|
106
|
+
ax.legend()
|
|
107
|
+
plt.tight_layout()
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
## Statistical Testing
|
|
111
|
+
|
|
112
|
+
### Hypothesis Testing Framework
|
|
113
|
+
```python
|
|
114
|
+
from scipy import stats
|
|
115
|
+
import numpy as np
|
|
116
|
+
|
|
117
|
+
def ab_test_analysis(
|
|
118
|
+
control: np.ndarray,
|
|
119
|
+
treatment: np.ndarray,
|
|
120
|
+
alpha: float = 0.05
|
|
121
|
+
) -> dict:
|
|
122
|
+
"""Analyze A/B test results."""
|
|
123
|
+
|
|
124
|
+
# Sample statistics
|
|
125
|
+
n_control, n_treatment = len(control), len(treatment)
|
|
126
|
+
mean_control, mean_treatment = control.mean(), treatment.mean()
|
|
127
|
+
|
|
128
|
+
# Effect size
|
|
129
|
+
pooled_std = np.sqrt(
|
|
130
|
+
((n_control - 1) * control.std()**2 +
|
|
131
|
+
(n_treatment - 1) * treatment.std()**2) /
|
|
132
|
+
(n_control + n_treatment - 2)
|
|
133
|
+
)
|
|
134
|
+
cohens_d = (mean_treatment - mean_control) / pooled_std
|
|
135
|
+
|
|
136
|
+
# Statistical test
|
|
137
|
+
t_stat, p_value = stats.ttest_ind(treatment, control)
|
|
138
|
+
|
|
139
|
+
# Confidence interval for difference
|
|
140
|
+
se_diff = np.sqrt(control.var()/n_control + treatment.var()/n_treatment)
|
|
141
|
+
ci_lower = (mean_treatment - mean_control) - 1.96 * se_diff
|
|
142
|
+
ci_upper = (mean_treatment - mean_control) + 1.96 * se_diff
|
|
143
|
+
|
|
144
|
+
return {
|
|
145
|
+
'control_mean': mean_control,
|
|
146
|
+
'treatment_mean': mean_treatment,
|
|
147
|
+
'lift': (mean_treatment - mean_control) / mean_control * 100,
|
|
148
|
+
'p_value': p_value,
|
|
149
|
+
'significant': p_value < alpha,
|
|
150
|
+
'cohens_d': cohens_d,
|
|
151
|
+
'ci_95': (ci_lower, ci_upper),
|
|
152
|
+
}
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
### Sample Size Calculation
|
|
156
|
+
```python
|
|
157
|
+
from statsmodels.stats.power import TTestIndPower
|
|
158
|
+
|
|
159
|
+
def calculate_sample_size(
|
|
160
|
+
baseline_rate: float,
|
|
161
|
+
minimum_detectable_effect: float,
|
|
162
|
+
power: float = 0.8,
|
|
163
|
+
alpha: float = 0.05
|
|
164
|
+
) -> int:
|
|
165
|
+
"""Calculate required sample size per group."""
|
|
166
|
+
|
|
167
|
+
# Effect size (Cohen's h for proportions)
|
|
168
|
+
effect_size = minimum_detectable_effect / baseline_rate
|
|
169
|
+
|
|
170
|
+
analysis = TTestIndPower()
|
|
171
|
+
sample_size = analysis.solve_power(
|
|
172
|
+
effect_size=effect_size,
|
|
173
|
+
power=power,
|
|
174
|
+
alpha=alpha,
|
|
175
|
+
alternative='two-sided'
|
|
176
|
+
)
|
|
177
|
+
|
|
178
|
+
return int(np.ceil(sample_size))
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
## Machine Learning Workflow
|
|
182
|
+
|
|
183
|
+
### Model Training Pipeline
|
|
184
|
+
```python
|
|
185
|
+
from sklearn.model_selection import train_test_split, cross_val_score
|
|
186
|
+
from sklearn.preprocessing import StandardScaler
|
|
187
|
+
from sklearn.pipeline import Pipeline
|
|
188
|
+
from sklearn.ensemble import GradientBoostingClassifier
|
|
189
|
+
from sklearn.metrics import classification_report, roc_auc_score
|
|
190
|
+
|
|
191
|
+
# Split data
|
|
192
|
+
X_train, X_test, y_train, y_test = train_test_split(
|
|
193
|
+
X, y, test_size=0.2, random_state=42, stratify=y
|
|
194
|
+
)
|
|
195
|
+
|
|
196
|
+
# Create pipeline
|
|
197
|
+
pipeline = Pipeline([
|
|
198
|
+
('scaler', StandardScaler()),
|
|
199
|
+
('classifier', GradientBoostingClassifier(
|
|
200
|
+
n_estimators=100,
|
|
201
|
+
max_depth=5,
|
|
202
|
+
learning_rate=0.1,
|
|
203
|
+
random_state=42
|
|
204
|
+
))
|
|
205
|
+
])
|
|
206
|
+
|
|
207
|
+
# Cross-validation
|
|
208
|
+
cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='roc_auc')
|
|
209
|
+
print(f"CV ROC-AUC: {cv_scores.mean():.3f} (+/- {cv_scores.std()*2:.3f})")
|
|
210
|
+
|
|
211
|
+
# Fit and evaluate
|
|
212
|
+
pipeline.fit(X_train, y_train)
|
|
213
|
+
y_pred = pipeline.predict(X_test)
|
|
214
|
+
y_proba = pipeline.predict_proba(X_test)[:, 1]
|
|
215
|
+
|
|
216
|
+
print(classification_report(y_test, y_pred))
|
|
217
|
+
print(f"Test ROC-AUC: {roc_auc_score(y_test, y_proba):.3f}")
|
|
218
|
+
```
|
|
219
|
+
|
|
220
|
+
### Feature Importance
|
|
221
|
+
```python
|
|
222
|
+
import shap
|
|
223
|
+
|
|
224
|
+
# SHAP values for interpretability
|
|
225
|
+
explainer = shap.TreeExplainer(pipeline.named_steps['classifier'])
|
|
226
|
+
shap_values = explainer.shap_values(X_test_scaled)
|
|
227
|
+
|
|
228
|
+
# Summary plot
|
|
229
|
+
shap.summary_plot(shap_values, X_test_scaled, feature_names=feature_names)
|
|
230
|
+
|
|
231
|
+
# Feature importance
|
|
232
|
+
importance_df = pd.DataFrame({
|
|
233
|
+
'feature': feature_names,
|
|
234
|
+
'importance': np.abs(shap_values).mean(axis=0)
|
|
235
|
+
}).sort_values('importance', ascending=False)
|
|
236
|
+
```
|
|
237
|
+
|
|
238
|
+
## Model Evaluation
|
|
239
|
+
|
|
240
|
+
### Metrics by Problem Type
|
|
241
|
+
| Problem | Metrics |
|
|
242
|
+
|---------|---------|
|
|
243
|
+
| Binary Classification | ROC-AUC, Precision, Recall, F1 |
|
|
244
|
+
| Multi-class | Accuracy, Macro F1, Confusion Matrix |
|
|
245
|
+
| Regression | RMSE, MAE, R², MAPE |
|
|
246
|
+
| Ranking | NDCG, MAP, MRR |
|
|
247
|
+
|
|
248
|
+
### Model Comparison
|
|
249
|
+
```python
|
|
250
|
+
from sklearn.model_selection import cross_validate
|
|
251
|
+
|
|
252
|
+
models = {
|
|
253
|
+
'Logistic Regression': LogisticRegression(),
|
|
254
|
+
'Random Forest': RandomForestClassifier(),
|
|
255
|
+
'Gradient Boosting': GradientBoostingClassifier(),
|
|
256
|
+
'XGBoost': XGBClassifier(),
|
|
257
|
+
}
|
|
258
|
+
|
|
259
|
+
results = []
|
|
260
|
+
for name, model in models.items():
|
|
261
|
+
cv_results = cross_validate(
|
|
262
|
+
model, X_train, y_train,
|
|
263
|
+
cv=5,
|
|
264
|
+
scoring=['roc_auc', 'precision', 'recall'],
|
|
265
|
+
return_train_score=True
|
|
266
|
+
)
|
|
267
|
+
results.append({
|
|
268
|
+
'model': name,
|
|
269
|
+
'roc_auc': cv_results['test_roc_auc'].mean(),
|
|
270
|
+
'precision': cv_results['test_precision'].mean(),
|
|
271
|
+
'recall': cv_results['test_recall'].mean(),
|
|
272
|
+
})
|
|
273
|
+
|
|
274
|
+
pd.DataFrame(results).sort_values('roc_auc', ascending=False)
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
## Communication Template
|
|
278
|
+
|
|
279
|
+
### Analysis Report Structure
|
|
280
|
+
```markdown
|
|
281
|
+
# [Analysis Title]
|
|
282
|
+
|
|
283
|
+
## Executive Summary
|
|
284
|
+
- Key finding 1
|
|
285
|
+
- Key finding 2
|
|
286
|
+
- Recommendation
|
|
287
|
+
|
|
288
|
+
## Business Context
|
|
289
|
+
What question are we answering? Why does it matter?
|
|
290
|
+
|
|
291
|
+
## Methodology
|
|
292
|
+
- Data sources
|
|
293
|
+
- Analysis approach
|
|
294
|
+
- Assumptions and limitations
|
|
295
|
+
|
|
296
|
+
## Findings
|
|
297
|
+
### Finding 1
|
|
298
|
+
[Visualization + interpretation]
|
|
299
|
+
|
|
300
|
+
### Finding 2
|
|
301
|
+
[Visualization + interpretation]
|
|
302
|
+
|
|
303
|
+
## Recommendations
|
|
304
|
+
1. Specific action
|
|
305
|
+
2. Specific action
|
|
306
|
+
|
|
307
|
+
## Next Steps
|
|
308
|
+
- Additional analyses needed
|
|
309
|
+
- Experiments to run
|
|
310
|
+
|
|
311
|
+
## Appendix
|
|
312
|
+
- Technical details
|
|
313
|
+
- Data quality notes
|
|
314
|
+
```
|
|
315
|
+
|
|
316
|
+
## Anti-Patterns to Avoid
|
|
317
|
+
|
|
318
|
+
| Anti-Pattern | Better Approach |
|
|
319
|
+
|--------------|-----------------|
|
|
320
|
+
| P-hacking | Pre-register hypotheses |
|
|
321
|
+
| Leakage in CV | Proper pipeline |
|
|
322
|
+
| Overfitting | Cross-validation |
|
|
323
|
+
| Ignoring uncertainty | Confidence intervals |
|
|
324
|
+
| Correlation = causation | Causal analysis |
|
|
325
|
+
|
|
326
|
+
## Constraints
|
|
327
|
+
|
|
328
|
+
- Always validate assumptions
|
|
329
|
+
- Report uncertainty in estimates
|
|
330
|
+
- Consider business impact, not just stats
|
|
331
|
+
- Document methodology clearly
|
|
332
|
+
- Reproduce results independently
|
|
333
|
+
|
|
334
|
+
## Related Skills
|
|
335
|
+
|
|
336
|
+
- `ml-engineer` - Production deployment
|
|
337
|
+
- `data-engineer` - Data infrastructure
|
|
338
|
+
- `python-pro` - Python expertise
|