tech-hub-skills 1.2.0 → 1.5.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/{LICENSE → .claude/LICENSE} +21 -21
- package/.claude/README.md +291 -0
- package/.claude/bin/cli.js +266 -0
- package/{bin → .claude/bin}/copilot.js +182 -182
- package/{bin → .claude/bin}/postinstall.js +42 -42
- package/{tech_hub_skills/skills → .claude/commands}/README.md +336 -336
- package/{tech_hub_skills/skills → .claude/commands}/ai-engineer.md +104 -104
- package/{tech_hub_skills/skills → .claude/commands}/aws.md +143 -143
- package/{tech_hub_skills/skills → .claude/commands}/azure.md +149 -149
- package/{tech_hub_skills/skills → .claude/commands}/backend-developer.md +108 -108
- package/{tech_hub_skills/skills → .claude/commands}/code-review.md +399 -399
- package/{tech_hub_skills/skills → .claude/commands}/compliance-automation.md +747 -747
- package/{tech_hub_skills/skills → .claude/commands}/compliance-officer.md +108 -108
- package/{tech_hub_skills/skills → .claude/commands}/data-engineer.md +113 -113
- package/{tech_hub_skills/skills → .claude/commands}/data-governance.md +102 -102
- package/{tech_hub_skills/skills → .claude/commands}/data-scientist.md +123 -123
- package/{tech_hub_skills/skills → .claude/commands}/database-admin.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/devops.md +160 -160
- package/{tech_hub_skills/skills → .claude/commands}/docker.md +160 -160
- package/{tech_hub_skills/skills → .claude/commands}/enterprise-dashboard.md +613 -613
- package/{tech_hub_skills/skills → .claude/commands}/finops.md +184 -184
- package/{tech_hub_skills/skills → .claude/commands}/frontend-developer.md +108 -108
- package/{tech_hub_skills/skills → .claude/commands}/gcp.md +143 -143
- package/{tech_hub_skills/skills → .claude/commands}/ml-engineer.md +115 -115
- package/{tech_hub_skills/skills → .claude/commands}/mlops.md +187 -187
- package/{tech_hub_skills/skills → .claude/commands}/network-engineer.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/optimization-advisor.md +329 -329
- package/{tech_hub_skills/skills → .claude/commands}/orchestrator.md +623 -623
- package/{tech_hub_skills/skills → .claude/commands}/platform-engineer.md +102 -102
- package/{tech_hub_skills/skills → .claude/commands}/process-automation.md +226 -226
- package/{tech_hub_skills/skills → .claude/commands}/process-changelog.md +184 -184
- package/{tech_hub_skills/skills → .claude/commands}/process-documentation.md +484 -484
- package/{tech_hub_skills/skills → .claude/commands}/process-kanban.md +324 -324
- package/{tech_hub_skills/skills → .claude/commands}/process-versioning.md +214 -214
- package/{tech_hub_skills/skills → .claude/commands}/product-designer.md +104 -104
- package/{tech_hub_skills/skills → .claude/commands}/project-starter.md +443 -443
- package/{tech_hub_skills/skills → .claude/commands}/qa-engineer.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/security-architect.md +135 -135
- package/{tech_hub_skills/skills → .claude/commands}/sre.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/system-design.md +126 -126
- package/{tech_hub_skills/skills → .claude/commands}/technical-writer.md +101 -101
- package/.claude/package.json +46 -0
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -252
- package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_ab_tester.py +356 -0
- package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_template_manager.py +274 -0
- package/.claude/roles/ai-engineer/skills/01-prompt-engineering/token_cost_estimator.py +324 -0
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -448
- package/.claude/roles/ai-engineer/skills/02-rag-pipeline/document_chunker.py +336 -0
- package/.claude/roles/ai-engineer/skills/02-rag-pipeline/rag_pipeline.sql +213 -0
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -599
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -735
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -711
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -777
- package/{tech_hub_skills → .claude}/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/02-data-factory/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/03-synapse-analytics/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/04-databricks/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/05-functions/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/06-kubernetes-service/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/07-openai-service/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/08-machine-learning/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/09-storage-adls/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/10-networking/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/11-sql-cosmos/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/12-event-hubs/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/code-review/skills/01-automated-code-review/README.md +394 -394
- package/{tech_hub_skills → .claude}/roles/code-review/skills/02-pr-review-workflow/README.md +427 -427
- package/{tech_hub_skills → .claude}/roles/code-review/skills/03-code-quality-gates/README.md +518 -518
- package/{tech_hub_skills → .claude}/roles/code-review/skills/04-reviewer-assignment/README.md +504 -504
- package/{tech_hub_skills → .claude}/roles/code-review/skills/05-review-analytics/README.md +540 -540
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -550
- package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/bronze_ingestion.py +337 -0
- package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/medallion_queries.sql +300 -0
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -580
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/03-data-quality/README.md +579 -579
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -608
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/05-performance-optimization/README.md +547 -547
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/01-data-catalog/README.md +112 -112
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/02-data-lineage/README.md +129 -129
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/03-data-quality-framework/README.md +182 -182
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/04-access-control/README.md +39 -39
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/05-master-data-management/README.md +40 -40
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/06-compliance-privacy/README.md +46 -46
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/01-eda-automation/README.md +230 -230
- package/.claude/roles/data-scientist/skills/01-eda-automation/eda_generator.py +446 -0
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/03-feature-engineering/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/05-customer-analytics/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/07-experimentation/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/08-data-visualization/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/01-cicd-pipeline/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/02-container-orchestration/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/03-infrastructure-as-code/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/04-gitops/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/05-environment-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/06-automated-testing/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/07-release-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/08-monitoring-alerting/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/09-devsecops/README.md +265 -265
- package/{tech_hub_skills → .claude}/roles/finops/skills/01-cost-visibility/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/02-resource-tagging/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/03-budget-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/04-reserved-instances/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/05-spot-optimization/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/06-storage-tiering/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/07-compute-rightsizing/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/08-chargeback/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -566
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -655
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/03-model-training/README.md +704 -704
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/04-model-serving/README.md +845 -845
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -874
- package/{tech_hub_skills → .claude}/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/02-experiment-tracking/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/03-model-registry/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/04-feature-store/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/05-model-deployment/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/06-model-observability/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/07-data-versioning/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/08-ab-testing/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/09-automated-retraining/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -153
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -57
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -59
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/04-developer-experience/README.md +57 -57
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/05-incident-management/README.md +73 -73
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/06-capacity-management/README.md +59 -59
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/01-requirements-discovery/README.md +407 -407
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/02-user-research/README.md +382 -382
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -437
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/04-ux-design/README.md +496 -496
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/05-product-market-fit/README.md +376 -376
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/06-stakeholder-management/README.md +412 -412
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/01-pii-detection/README.md +319 -319
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/02-threat-modeling/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/03-infrastructure-security/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/04-iam/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/05-application-security/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/06-secrets-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/07-security-monitoring/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/01-architecture-patterns/README.md +337 -337
- package/{tech_hub_skills → .claude}/roles/system-design/skills/02-requirements-engineering/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/03-scalability/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/04-high-availability/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/05-cost-optimization-design/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/06-api-design/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/07-observability-architecture/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -336
- package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/README.md +521 -521
- package/.claude/roles/system-design/skills/08-process-automation/ai_prompt_generator.py +744 -0
- package/.claude/roles/system-design/skills/08-process-automation/automation_recommender.py +688 -0
- package/.claude/roles/system-design/skills/08-process-automation/plan_generator.py +679 -0
- package/.claude/roles/system-design/skills/08-process-automation/process_analyzer.py +528 -0
- package/.claude/roles/system-design/skills/08-process-automation/process_parser.py +684 -0
- package/.claude/roles/system-design/skills/08-process-automation/role_matcher.py +615 -0
- package/.claude/skills/README.md +336 -0
- package/.claude/skills/ai-engineer.md +104 -0
- package/.claude/skills/aws.md +143 -0
- package/.claude/skills/azure.md +149 -0
- package/.claude/skills/backend-developer.md +108 -0
- package/.claude/skills/code-review.md +399 -0
- package/.claude/skills/compliance-automation.md +747 -0
- package/.claude/skills/compliance-officer.md +108 -0
- package/.claude/skills/data-engineer.md +113 -0
- package/.claude/skills/data-governance.md +102 -0
- package/.claude/skills/data-scientist.md +123 -0
- package/.claude/skills/database-admin.md +109 -0
- package/.claude/skills/devops.md +160 -0
- package/.claude/skills/docker.md +160 -0
- package/.claude/skills/enterprise-dashboard.md +613 -0
- package/.claude/skills/finops.md +184 -0
- package/.claude/skills/frontend-developer.md +108 -0
- package/.claude/skills/gcp.md +143 -0
- package/.claude/skills/ml-engineer.md +115 -0
- package/.claude/skills/mlops.md +187 -0
- package/.claude/skills/network-engineer.md +109 -0
- package/.claude/skills/optimization-advisor.md +329 -0
- package/.claude/skills/orchestrator.md +623 -0
- package/.claude/skills/platform-engineer.md +102 -0
- package/.claude/skills/process-automation.md +226 -0
- package/.claude/skills/process-changelog.md +184 -0
- package/.claude/skills/process-documentation.md +484 -0
- package/.claude/skills/process-kanban.md +324 -0
- package/.claude/skills/process-versioning.md +214 -0
- package/.claude/skills/product-designer.md +104 -0
- package/.claude/skills/project-starter.md +443 -0
- package/.claude/skills/qa-engineer.md +109 -0
- package/.claude/skills/security-architect.md +135 -0
- package/.claude/skills/sre.md +109 -0
- package/.claude/skills/system-design.md +126 -0
- package/.claude/skills/technical-writer.md +101 -0
- package/.gitattributes +2 -0
- package/GITHUB_COPILOT.md +106 -0
- package/README.md +192 -291
- package/package.json +16 -46
- package/bin/cli.js +0 -241
|
@@ -1,704 +1,704 @@
|
|
|
1
|
-
# Skill 3: Model Training & Hyperparameter Tuning
|
|
2
|
-
|
|
3
|
-
## 🎯 Overview
|
|
4
|
-
Implement scalable model training pipelines with automated hyperparameter optimization and experiment tracking.
|
|
5
|
-
|
|
6
|
-
## 🔗 Connections
|
|
7
|
-
- **Data Scientist**: Productionizes experimental models (ds-01, ds-02, ds-08)
|
|
8
|
-
- **ML Engineer**: Feeds from feature store, deploys to serving (ml-02, ml-04, ml-07)
|
|
9
|
-
- **MLOps**: Experiment tracking and model versioning (mo-01, mo-03)
|
|
10
|
-
- **FinOps**: Optimizes training costs and compute usage (fo-06, fo-07)
|
|
11
|
-
- **DevOps**: Automates training pipelines with CI/CD (do-01, do-03, do-08)
|
|
12
|
-
- **Security Architect**: Ensures secure training environments (sa-02, sa-08)
|
|
13
|
-
- **System Design**: Distributed training architecture (sd-03, sd-05)
|
|
14
|
-
- **Data Engineer**: Consumes validated training data (de-03)
|
|
15
|
-
|
|
16
|
-
## 🛠️ Tools Included
|
|
17
|
-
|
|
18
|
-
### 1. `model_trainer.py`
|
|
19
|
-
Unified training interface for sklearn, XGBoost, LightGBM, PyTorch.
|
|
20
|
-
|
|
21
|
-
### 2. `hyperparameter_optimizer.py`
|
|
22
|
-
Automated hyperparameter tuning with Optuna/Ray Tune.
|
|
23
|
-
|
|
24
|
-
### 3. `experiment_tracker.py`
|
|
25
|
-
MLflow integration for comprehensive experiment tracking.
|
|
26
|
-
|
|
27
|
-
### 4. `model_evaluator.py`
|
|
28
|
-
Model evaluation with business metrics and validation.
|
|
29
|
-
|
|
30
|
-
### 5. `training_config.yaml`
|
|
31
|
-
Configuration templates for training pipelines.
|
|
32
|
-
|
|
33
|
-
## 🏗️ Training Pipeline Architecture
|
|
34
|
-
|
|
35
|
-
```
|
|
36
|
-
Feature Store → Data Preparation → Model Training → Evaluation → Registry
|
|
37
|
-
↓ ↓ ↓ ↓
|
|
38
|
-
Validation Experiment Track Metrics Versioning
|
|
39
|
-
Splitting HPO Comparison Lineage
|
|
40
|
-
Augmentation Checkpointing Validation Promotion
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
## 🚀 Quick Start
|
|
44
|
-
|
|
45
|
-
```python
|
|
46
|
-
from model_trainer import ModelTrainer
|
|
47
|
-
from hyperparameter_optimizer import HPOptimizer
|
|
48
|
-
from experiment_tracker import ExperimentTracker
|
|
49
|
-
|
|
50
|
-
# Initialize tracker
|
|
51
|
-
tracker = ExperimentTracker(experiment_name="churn_prediction_v2")
|
|
52
|
-
|
|
53
|
-
# Configure training
|
|
54
|
-
trainer = ModelTrainer(
|
|
55
|
-
model_type="xgboost",
|
|
56
|
-
objective="binary:logistic",
|
|
57
|
-
eval_metric="auc"
|
|
58
|
-
)
|
|
59
|
-
|
|
60
|
-
# Load features from feature store
|
|
61
|
-
features = feature_store.get_historical_features(
|
|
62
|
-
feature_refs=["customer_behavior:v1"],
|
|
63
|
-
entity_df=training_entities
|
|
64
|
-
)
|
|
65
|
-
|
|
66
|
-
# Train with experiment tracking
|
|
67
|
-
with tracker.start_run():
|
|
68
|
-
# Log parameters
|
|
69
|
-
tracker.log_params({
|
|
70
|
-
"model_type": "xgboost",
|
|
71
|
-
"max_depth": 6,
|
|
72
|
-
"n_estimators": 100
|
|
73
|
-
})
|
|
74
|
-
|
|
75
|
-
# Train model
|
|
76
|
-
model = trainer.train(
|
|
77
|
-
X_train=features.drop(columns=["target"]),
|
|
78
|
-
y_train=features["target"],
|
|
79
|
-
validation_split=0.2
|
|
80
|
-
)
|
|
81
|
-
|
|
82
|
-
# Evaluate
|
|
83
|
-
metrics = trainer.evaluate(X_test, y_test)
|
|
84
|
-
tracker.log_metrics(metrics)
|
|
85
|
-
|
|
86
|
-
# Save model
|
|
87
|
-
tracker.log_model(model, "churn_predictor")
|
|
88
|
-
|
|
89
|
-
# Hyperparameter optimization
|
|
90
|
-
optimizer = HPOptimizer(
|
|
91
|
-
estimator=trainer,
|
|
92
|
-
param_space={
|
|
93
|
-
"max_depth": [3, 5, 7, 9],
|
|
94
|
-
"n_estimators": [50, 100, 200],
|
|
95
|
-
"learning_rate": [0.01, 0.05, 0.1],
|
|
96
|
-
"subsample": [0.7, 0.8, 0.9, 1.0]
|
|
97
|
-
},
|
|
98
|
-
optimization_metric="auc",
|
|
99
|
-
n_trials=50
|
|
100
|
-
)
|
|
101
|
-
|
|
102
|
-
best_params = optimizer.optimize(X_train, y_train, X_val, y_val)
|
|
103
|
-
```
|
|
104
|
-
|
|
105
|
-
## 📚 Best Practices
|
|
106
|
-
|
|
107
|
-
### Training Cost Optimization (FinOps Integration)
|
|
108
|
-
|
|
109
|
-
1. **Use Spot/Preemptible Instances**
|
|
110
|
-
- 60-90% cost savings for interruptible training
|
|
111
|
-
- Implement checkpointing for fault tolerance
|
|
112
|
-
- Automatic job resumption after preemption
|
|
113
|
-
- Best for batch training jobs
|
|
114
|
-
- Reference: FinOps fo-06 (Compute Optimization), fo-07 (AI/ML Cost)
|
|
115
|
-
|
|
116
|
-
2. **Right-Size Training Compute**
|
|
117
|
-
- Profile training jobs to determine optimal instance size
|
|
118
|
-
- Use CPU for tree-based models (XGBoost, LightGBM)
|
|
119
|
-
- Reserve GPUs for deep learning only
|
|
120
|
-
- Monitor GPU/CPU utilization
|
|
121
|
-
- Auto-scale compute clusters
|
|
122
|
-
- Reference: FinOps fo-06
|
|
123
|
-
|
|
124
|
-
3. **Optimize Hyperparameter Tuning**
|
|
125
|
-
- Use Bayesian optimization over grid search (10x fewer trials)
|
|
126
|
-
- Implement early stopping to terminate poor trials
|
|
127
|
-
- Parallelize trials efficiently
|
|
128
|
-
- Track cost per hyperparameter trial
|
|
129
|
-
- Set budget limits per optimization run
|
|
130
|
-
- Reference: FinOps fo-07, ML best practices
|
|
131
|
-
|
|
132
|
-
4. **Training Time Optimization**
|
|
133
|
-
- Use early stopping to prevent overtraining
|
|
134
|
-
- Implement learning rate schedules
|
|
135
|
-
- Use mixed precision training (2x faster on GPUs)
|
|
136
|
-
- Profile and optimize data loading
|
|
137
|
-
- Cache preprocessed data
|
|
138
|
-
- Reference: ML Engineer best practices
|
|
139
|
-
|
|
140
|
-
5. **Track Training Costs Per Experiment**
|
|
141
|
-
- Log compute costs with experiments
|
|
142
|
-
- Track training duration and resource usage
|
|
143
|
-
- Monitor cost vs accuracy trade-offs
|
|
144
|
-
- Set budget alerts for runaway experiments
|
|
145
|
-
- Reference: FinOps fo-01 (Cost Monitoring), fo-03 (Budget Management)
|
|
146
|
-
|
|
147
|
-
### MLOps Integration for Training
|
|
148
|
-
|
|
149
|
-
6. **Comprehensive Experiment Tracking**
|
|
150
|
-
- Track all hyperparameters and configurations
|
|
151
|
-
- Log all metrics (training, validation, test)
|
|
152
|
-
- Version datasets used for training
|
|
153
|
-
- Save model artifacts and checkpoints
|
|
154
|
-
- Track training duration and resource usage
|
|
155
|
-
- Reference: MLOps mo-01 (Experiment Tracking)
|
|
156
|
-
|
|
157
|
-
7. **Model Versioning & Lineage**
|
|
158
|
-
- Version all trained models
|
|
159
|
-
- Track complete lineage (data + code + config)
|
|
160
|
-
- Link models to training runs
|
|
161
|
-
- Document model architecture and purpose
|
|
162
|
-
- Reference: MLOps mo-03 (Model Versioning), mo-06 (Lineage)
|
|
163
|
-
|
|
164
|
-
8. **Reproducible Training**
|
|
165
|
-
- Set random seeds for reproducibility
|
|
166
|
-
- Version control training code
|
|
167
|
-
- Pin dependency versions
|
|
168
|
-
- Document training environment
|
|
169
|
-
- Store training configurations
|
|
170
|
-
- Reference: MLOps mo-01, DevOps do-01
|
|
171
|
-
|
|
172
|
-
9. **Model Validation & Testing**
|
|
173
|
-
- Validate on held-out test set
|
|
174
|
-
- Test model performance on edge cases
|
|
175
|
-
- Verify model fairness and bias
|
|
176
|
-
- Test inference latency
|
|
177
|
-
- Reference: MLOps mo-07 (Testing), Data Scientist ds-08
|
|
178
|
-
|
|
179
|
-
### DevOps Integration for Training
|
|
180
|
-
|
|
181
|
-
10. **Automated Training Pipelines**
|
|
182
|
-
- Trigger training on data updates
|
|
183
|
-
- Automate model evaluation and comparison
|
|
184
|
-
- Implement automatic model promotion
|
|
185
|
-
- Schedule periodic retraining
|
|
186
|
-
- Reference: DevOps do-01 (CI/CD), ML Engineer ml-01
|
|
187
|
-
|
|
188
|
-
11. **Containerized Training**
|
|
189
|
-
- Package training code in Docker containers
|
|
190
|
-
- Use multi-stage builds for efficiency
|
|
191
|
-
- Version control container definitions
|
|
192
|
-
- Test containers locally before deployment
|
|
193
|
-
- Reference: DevOps do-03 (Containerization)
|
|
194
|
-
|
|
195
|
-
12. **Infrastructure as Code for Training**
|
|
196
|
-
- Define training infrastructure in Terraform
|
|
197
|
-
- Automate compute cluster provisioning
|
|
198
|
-
- Version control all infrastructure
|
|
199
|
-
- Implement environment parity (dev/staging/prod)
|
|
200
|
-
- Reference: DevOps do-04 (IaC)
|
|
201
|
-
|
|
202
|
-
13. **Monitoring & Observability**
|
|
203
|
-
- Monitor training job status and health
|
|
204
|
-
- Track training metrics in real-time
|
|
205
|
-
- Set up alerts for training failures
|
|
206
|
-
- Log training errors and exceptions
|
|
207
|
-
- Reference: DevOps do-08 (Monitoring)
|
|
208
|
-
|
|
209
|
-
### Data Quality for Training
|
|
210
|
-
|
|
211
|
-
14. **Training Data Validation**
|
|
212
|
-
- Validate data schema before training
|
|
213
|
-
- Check for data drift vs previous training
|
|
214
|
-
- Verify label distribution
|
|
215
|
-
- Detect data quality issues early
|
|
216
|
-
- Reference: Data Engineer de-03 (Data Quality)
|
|
217
|
-
|
|
218
|
-
15. **Data Versioning**
|
|
219
|
-
- Version training datasets
|
|
220
|
-
- Track data lineage for reproducibility
|
|
221
|
-
- Document data collection and labeling
|
|
222
|
-
- Reference: MLOps mo-06 (Lineage)
|
|
223
|
-
|
|
224
|
-
### Security & Compliance
|
|
225
|
-
|
|
226
|
-
16. **Secure Training Environments**
|
|
227
|
-
- Train in isolated network environments
|
|
228
|
-
- Use managed identities for authentication
|
|
229
|
-
- Encrypt data at rest and in transit
|
|
230
|
-
- Audit training job access
|
|
231
|
-
- Reference: Security Architect sa-02 (IAM), sa-03 (Network)
|
|
232
|
-
|
|
233
|
-
17. **Model Security**
|
|
234
|
-
- Scan training dependencies for vulnerabilities
|
|
235
|
-
- Implement model access controls
|
|
236
|
-
- Audit model artifacts
|
|
237
|
-
- Document model security posture
|
|
238
|
-
- Reference: Security Architect sa-08 (LLM Security)
|
|
239
|
-
|
|
240
|
-
### Azure-Specific Best Practices
|
|
241
|
-
|
|
242
|
-
18. **Azure Machine Learning**
|
|
243
|
-
- Use managed compute clusters
|
|
244
|
-
- Enable auto-scaling for training jobs
|
|
245
|
-
- Leverage Azure ML pipelines
|
|
246
|
-
- Use Azure ML environments for reproducibility
|
|
247
|
-
- Reference: Azure az-04 (AI/ML Services)
|
|
248
|
-
|
|
249
|
-
19. **Cost Management in Azure ML**
|
|
250
|
-
- Use low-priority compute for training (70-80% savings)
|
|
251
|
-
- Enable compute auto-shutdown
|
|
252
|
-
- Monitor compute utilization
|
|
253
|
-
- Set workspace spending limits
|
|
254
|
-
- Reference: Azure az-04, FinOps fo-06
|
|
255
|
-
|
|
256
|
-
20. **Model Training Best Practices**
|
|
257
|
-
- Use early stopping callbacks
|
|
258
|
-
- Implement cross-validation for robust estimates
|
|
259
|
-
- Track learning curves for overfitting detection
|
|
260
|
-
- Save best model checkpoints
|
|
261
|
-
- Reference: ML Engineer best practices
|
|
262
|
-
|
|
263
|
-
## 💰 Cost Optimization Examples
|
|
264
|
-
|
|
265
|
-
### Spot Instance Training with Checkpointing
|
|
266
|
-
```python
|
|
267
|
-
from azure.ai.ml import command, Input
|
|
268
|
-
from azure.ai.ml.entities import AmlCompute
|
|
269
|
-
from model_trainer import CheckpointedTrainer
|
|
270
|
-
|
|
271
|
-
# Create spot compute cluster (60-90% savings)
|
|
272
|
-
spot_cluster = AmlCompute(
|
|
273
|
-
name="spot-training-cluster",
|
|
274
|
-
size="Standard_NC6s_v3", # GPU instance
|
|
275
|
-
min_instances=0,
|
|
276
|
-
max_instances=4,
|
|
277
|
-
tier="LowPriority", # Spot pricing!
|
|
278
|
-
idle_time_before_scale_down=300
|
|
279
|
-
)
|
|
280
|
-
|
|
281
|
-
ml_client.compute.begin_create_or_update(spot_cluster).result()
|
|
282
|
-
|
|
283
|
-
# Checkpointed trainer (automatically resumes after preemption)
|
|
284
|
-
trainer = CheckpointedTrainer(
|
|
285
|
-
model_type="pytorch",
|
|
286
|
-
checkpoint_dir="./checkpoints",
|
|
287
|
-
checkpoint_frequency=100, # Save every 100 steps
|
|
288
|
-
resume_from_checkpoint=True
|
|
289
|
-
)
|
|
290
|
-
|
|
291
|
-
# Training script with checkpointing
|
|
292
|
-
training_script = """
|
|
293
|
-
import torch
|
|
294
|
-
from model_trainer import CheckpointedTrainer
|
|
295
|
-
|
|
296
|
-
trainer = CheckpointedTrainer(
|
|
297
|
-
model_type="pytorch",
|
|
298
|
-
checkpoint_dir="./outputs/checkpoints"
|
|
299
|
-
)
|
|
300
|
-
|
|
301
|
-
# Load checkpoint if exists (after preemption)
|
|
302
|
-
start_epoch = trainer.load_checkpoint_if_exists()
|
|
303
|
-
|
|
304
|
-
# Training loop with automatic checkpointing
|
|
305
|
-
for epoch in range(start_epoch, num_epochs):
|
|
306
|
-
train_loss = trainer.train_epoch(model, train_loader)
|
|
307
|
-
val_loss = trainer.validate(model, val_loader)
|
|
308
|
-
|
|
309
|
-
# Automatic checkpoint saving
|
|
310
|
-
trainer.save_checkpoint(
|
|
311
|
-
epoch=epoch,
|
|
312
|
-
model=model,
|
|
313
|
-
optimizer=optimizer,
|
|
314
|
-
loss=val_loss
|
|
315
|
-
)
|
|
316
|
-
"""
|
|
317
|
-
|
|
318
|
-
# Submit to spot cluster
|
|
319
|
-
job = command(
|
|
320
|
-
code="./src",
|
|
321
|
-
command="python train.py",
|
|
322
|
-
environment="azureml:pytorch-training:1",
|
|
323
|
-
compute="spot-training-cluster",
|
|
324
|
-
inputs={
|
|
325
|
-
"training_data": Input(path="azureml://datasets/training_data/labels/latest")
|
|
326
|
-
}
|
|
327
|
-
)
|
|
328
|
-
|
|
329
|
-
# Job automatically resumes from checkpoint if preempted
|
|
330
|
-
run = ml_client.jobs.create_or_update(job)
|
|
331
|
-
|
|
332
|
-
# Track savings
|
|
333
|
-
from finops_tracker import TrainingCostTracker
|
|
334
|
-
cost_tracker = TrainingCostTracker()
|
|
335
|
-
savings_report = cost_tracker.calculate_spot_savings(run.name)
|
|
336
|
-
print(f"Cost with spot: ${savings_report.spot_cost:.2f}")
|
|
337
|
-
print(f"Cost with dedicated: ${savings_report.dedicated_cost:.2f}")
|
|
338
|
-
print(f"Total savings: ${savings_report.savings:.2f} ({savings_report.savings_percent}%)")
|
|
339
|
-
```
|
|
340
|
-
|
|
341
|
-
### Bayesian Hyperparameter Optimization (10x Fewer Trials)
|
|
342
|
-
```python
|
|
343
|
-
from hyperparameter_optimizer import BayesianOptimizer
|
|
344
|
-
from finops_tracker import HPOCostTracker
|
|
345
|
-
import optuna
|
|
346
|
-
|
|
347
|
-
cost_tracker = HPOCostTracker()
|
|
348
|
-
|
|
349
|
-
def objective(trial):
|
|
350
|
-
"""Objective function with cost tracking"""
|
|
351
|
-
|
|
352
|
-
# Suggest hyperparameters
|
|
353
|
-
params = {
|
|
354
|
-
'max_depth': trial.suggest_int('max_depth', 3, 10),
|
|
355
|
-
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
|
|
356
|
-
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
|
|
357
|
-
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
|
|
358
|
-
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
|
|
359
|
-
'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
|
|
360
|
-
'gamma': trial.suggest_float('gamma', 0, 0.5)
|
|
361
|
-
}
|
|
362
|
-
|
|
363
|
-
# Track trial cost
|
|
364
|
-
with cost_tracker.track_trial(trial.number):
|
|
365
|
-
# Train model
|
|
366
|
-
trainer = ModelTrainer(model_type="xgboost", **params)
|
|
367
|
-
trainer.train(X_train, y_train)
|
|
368
|
-
|
|
369
|
-
# Evaluate
|
|
370
|
-
score = trainer.evaluate(X_val, y_val)['auc']
|
|
371
|
-
|
|
372
|
-
# Report cost
|
|
373
|
-
cost_tracker.report_trial_cost(trial.number, score)
|
|
374
|
-
|
|
375
|
-
return score
|
|
376
|
-
|
|
377
|
-
# Bayesian optimization with early stopping
|
|
378
|
-
study = optuna.create_study(
|
|
379
|
-
direction='maximize',
|
|
380
|
-
sampler=optuna.samplers.TPESampler(seed=42), # Bayesian optimization
|
|
381
|
-
pruner=optuna.pruners.MedianPruner( # Early stopping for poor trials
|
|
382
|
-
n_startup_trials=5,
|
|
383
|
-
n_warmup_steps=10,
|
|
384
|
-
interval_steps=1
|
|
385
|
-
)
|
|
386
|
-
)
|
|
387
|
-
|
|
388
|
-
# Run optimization with budget limit
|
|
389
|
-
study.optimize(
|
|
390
|
-
objective,
|
|
391
|
-
n_trials=50, # vs 1000+ for grid search
|
|
392
|
-
timeout=3600, # 1 hour max
|
|
393
|
-
callbacks=[cost_tracker.budget_callback(max_budget=100.00)]
|
|
394
|
-
)
|
|
395
|
-
|
|
396
|
-
# Results with cost analysis
|
|
397
|
-
print(f"Best AUC: {study.best_value:.4f}")
|
|
398
|
-
print(f"Best params: {study.best_params}")
|
|
399
|
-
|
|
400
|
-
cost_report = cost_tracker.generate_hpo_report()
|
|
401
|
-
print(f"\nHPO Cost Analysis:")
|
|
402
|
-
print(f"Total trials: {cost_report.total_trials}")
|
|
403
|
-
print(f"Completed trials: {cost_report.completed_trials}")
|
|
404
|
-
print(f"Pruned trials: {cost_report.pruned_trials} (cost savings!)")
|
|
405
|
-
print(f"Total cost: ${cost_report.total_cost:.2f}")
|
|
406
|
-
print(f"Average cost per trial: ${cost_report.avg_cost_per_trial:.2f}")
|
|
407
|
-
print(f"Estimated grid search cost: ${cost_report.grid_search_cost_estimate:.2f}")
|
|
408
|
-
print(f"Savings vs grid search: ${cost_report.savings:.2f} ({cost_report.savings_percent}%)")
|
|
409
|
-
|
|
410
|
-
# Visualize optimization
|
|
411
|
-
from optuna.visualization import plot_optimization_history, plot_param_importances
|
|
412
|
-
plot_optimization_history(study).show()
|
|
413
|
-
plot_param_importances(study).show()
|
|
414
|
-
```
|
|
415
|
-
|
|
416
|
-
### Mixed Precision Training (2x Speed on GPUs)
|
|
417
|
-
```python
|
|
418
|
-
import torch
|
|
419
|
-
from torch.cuda.amp import autocast, GradScaler
|
|
420
|
-
from model_trainer import PyTorchTrainer
|
|
421
|
-
|
|
422
|
-
class MixedPrecisionTrainer(PyTorchTrainer):
|
|
423
|
-
"""Mixed precision training for 2x speed and 50% memory reduction"""
|
|
424
|
-
|
|
425
|
-
def __init__(self, model, optimizer, **kwargs):
|
|
426
|
-
super().__init__(model, optimizer, **kwargs)
|
|
427
|
-
self.scaler = GradScaler() # For numerical stability
|
|
428
|
-
self.cost_tracker = TrainingCostTracker()
|
|
429
|
-
|
|
430
|
-
def train_epoch(self, train_loader):
|
|
431
|
-
self.model.train()
|
|
432
|
-
total_loss = 0
|
|
433
|
-
|
|
434
|
-
for batch_idx, (data, target) in enumerate(train_loader):
|
|
435
|
-
data, target = data.cuda(), target.cuda()
|
|
436
|
-
|
|
437
|
-
# Mixed precision training
|
|
438
|
-
with autocast(): # Automatic mixed precision
|
|
439
|
-
output = self.model(data)
|
|
440
|
-
loss = self.criterion(output, target)
|
|
441
|
-
|
|
442
|
-
# Scaled backpropagation
|
|
443
|
-
self.optimizer.zero_grad()
|
|
444
|
-
self.scaler.scale(loss).backward()
|
|
445
|
-
self.scaler.step(self.optimizer)
|
|
446
|
-
self.scaler.update()
|
|
447
|
-
|
|
448
|
-
total_loss += loss.item()
|
|
449
|
-
|
|
450
|
-
return total_loss / len(train_loader)
|
|
451
|
-
|
|
452
|
-
# Usage
|
|
453
|
-
model = MyModel().cuda()
|
|
454
|
-
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
|
455
|
-
|
|
456
|
-
trainer = MixedPrecisionTrainer(model, optimizer)
|
|
457
|
-
|
|
458
|
-
# Track training time and cost
|
|
459
|
-
with cost_tracker.track_training("mixed_precision_training"):
|
|
460
|
-
for epoch in range(num_epochs):
|
|
461
|
-
train_loss = trainer.train_epoch(train_loader)
|
|
462
|
-
val_loss = trainer.validate(val_loader)
|
|
463
|
-
|
|
464
|
-
# Compare with FP32 training
|
|
465
|
-
report = cost_tracker.compare_precision_modes()
|
|
466
|
-
print(f"Mixed precision training time: {report.mixed_time:.2f}s")
|
|
467
|
-
print(f"FP32 training time: {report.fp32_time:.2f}s")
|
|
468
|
-
print(f"Speedup: {report.speedup:.2f}x")
|
|
469
|
-
print(f"Cost savings: ${report.cost_savings:.2f}")
|
|
470
|
-
```
|
|
471
|
-
|
|
472
|
-
### Cost-Tracked Experiment Management
|
|
473
|
-
```python
|
|
474
|
-
from experiment_tracker import ExperimentTracker
|
|
475
|
-
from finops_tracker import ExperimentCostTracker
|
|
476
|
-
from datetime import datetime
|
|
477
|
-
|
|
478
|
-
class CostAwareExperimentTracker:
|
|
479
|
-
"""Experiment tracker with integrated cost monitoring"""
|
|
480
|
-
|
|
481
|
-
def __init__(self, experiment_name: str):
|
|
482
|
-
self.tracker = ExperimentTracker(experiment_name=experiment_name)
|
|
483
|
-
self.cost_tracker = ExperimentCostTracker()
|
|
484
|
-
|
|
485
|
-
def start_run(self, run_name: str, compute_type: str = "cpu"):
|
|
486
|
-
"""Start a new run with cost tracking"""
|
|
487
|
-
|
|
488
|
-
# Start MLflow run
|
|
489
|
-
self.run = self.tracker.start_run(run_name=run_name)
|
|
490
|
-
|
|
491
|
-
# Start cost tracking
|
|
492
|
-
self.cost_tracker.start_tracking(
|
|
493
|
-
run_id=self.run.info.run_id,
|
|
494
|
-
compute_type=compute_type
|
|
495
|
-
)
|
|
496
|
-
|
|
497
|
-
return self
|
|
498
|
-
|
|
499
|
-
def __enter__(self):
|
|
500
|
-
return self
|
|
501
|
-
|
|
502
|
-
def __exit__(self, exc_type, exc_val, exc_tb):
|
|
503
|
-
# Calculate and log costs
|
|
504
|
-
cost_metrics = self.cost_tracker.stop_tracking()
|
|
505
|
-
|
|
506
|
-
# Log cost metrics to MLflow
|
|
507
|
-
self.tracker.log_metrics({
|
|
508
|
-
"compute_cost_usd": cost_metrics.compute_cost,
|
|
509
|
-
"storage_cost_usd": cost_metrics.storage_cost,
|
|
510
|
-
"total_cost_usd": cost_metrics.total_cost,
|
|
511
|
-
"training_duration_seconds": cost_metrics.duration,
|
|
512
|
-
"cost_per_hour": cost_metrics.cost_per_hour
|
|
513
|
-
})
|
|
514
|
-
|
|
515
|
-
# Add cost tags
|
|
516
|
-
self.tracker.set_tags({
|
|
517
|
-
"compute_type": cost_metrics.compute_type,
|
|
518
|
-
"instance_type": cost_metrics.instance_type,
|
|
519
|
-
"cost_optimized": str(cost_metrics.compute_type == "spot")
|
|
520
|
-
})
|
|
521
|
-
|
|
522
|
-
# End run
|
|
523
|
-
self.tracker.end_run()
|
|
524
|
-
|
|
525
|
-
# Usage
|
|
526
|
-
tracker = CostAwareExperimentTracker("customer_churn")
|
|
527
|
-
|
|
528
|
-
# Experiment 1: Standard training
|
|
529
|
-
with tracker.start_run("baseline_xgboost", compute_type="cpu"):
|
|
530
|
-
model = train_xgboost(X_train, y_train)
|
|
531
|
-
metrics = evaluate(model, X_test, y_test)
|
|
532
|
-
tracker.tracker.log_metrics(metrics)
|
|
533
|
-
tracker.tracker.log_model(model, "xgboost_baseline")
|
|
534
|
-
|
|
535
|
-
# Experiment 2: GPU training with spot instances
|
|
536
|
-
with tracker.start_run("deep_learning_spot", compute_type="spot_gpu"):
|
|
537
|
-
model = train_neural_network(X_train, y_train)
|
|
538
|
-
metrics = evaluate(model, X_test, y_test)
|
|
539
|
-
tracker.tracker.log_metrics(metrics)
|
|
540
|
-
tracker.tracker.log_model(model, "nn_spot")
|
|
541
|
-
|
|
542
|
-
# Compare experiments by cost and performance
|
|
543
|
-
experiments_df = tracker.cost_tracker.compare_experiments()
|
|
544
|
-
print(experiments_df[['run_name', 'accuracy', 'total_cost_usd', 'cost_per_point_accuracy']])
|
|
545
|
-
|
|
546
|
-
# Find most cost-efficient model
|
|
547
|
-
best_value = experiments_df['cost_per_point_accuracy'].idxmin()
|
|
548
|
-
print(f"\nMost cost-efficient model: {experiments_df.loc[best_value, 'run_name']}")
|
|
549
|
-
print(f"Accuracy: {experiments_df.loc[best_value, 'accuracy']:.4f}")
|
|
550
|
-
print(f"Cost: ${experiments_df.loc[best_value, 'total_cost_usd']:.2f}")
|
|
551
|
-
```
|
|
552
|
-
|
|
553
|
-
## 🚀 CI/CD for Model Training
|
|
554
|
-
|
|
555
|
-
### Automated Training Pipeline
|
|
556
|
-
```yaml
|
|
557
|
-
# .github/workflows/model-training.yml
|
|
558
|
-
name: Model Training Pipeline
|
|
559
|
-
|
|
560
|
-
on:
|
|
561
|
-
push:
|
|
562
|
-
paths:
|
|
563
|
-
- 'models/**'
|
|
564
|
-
- 'training/**'
|
|
565
|
-
branches:
|
|
566
|
-
- main
|
|
567
|
-
schedule:
|
|
568
|
-
- cron: '0 3 * * 0' # Weekly retraining
|
|
569
|
-
|
|
570
|
-
jobs:
|
|
571
|
-
train-and-evaluate:
|
|
572
|
-
runs-on: ubuntu-latest
|
|
573
|
-
steps:
|
|
574
|
-
- uses: actions/checkout@v3
|
|
575
|
-
|
|
576
|
-
- name: Azure Login
|
|
577
|
-
uses: azure/login@v1
|
|
578
|
-
with:
|
|
579
|
-
creds: ${{ secrets.AZURE_CREDENTIALS }}
|
|
580
|
-
|
|
581
|
-
- name: Setup Python
|
|
582
|
-
uses: actions/setup-python@v4
|
|
583
|
-
with:
|
|
584
|
-
python-version: '3.10'
|
|
585
|
-
|
|
586
|
-
- name: Install dependencies
|
|
587
|
-
run: |
|
|
588
|
-
pip install -r requirements.txt
|
|
589
|
-
pip install pytest pytest-cov
|
|
590
|
-
|
|
591
|
-
- name: Run unit tests
|
|
592
|
-
run: pytest tests/unit/ --cov=src
|
|
593
|
-
|
|
594
|
-
- name: Validate training data
|
|
595
|
-
run: python scripts/validate_training_data.py
|
|
596
|
-
|
|
597
|
-
- name: Train model on spot instances
|
|
598
|
-
run: |
|
|
599
|
-
python training/train_model.py \
|
|
600
|
-
--experiment-name "churn_weekly_${{ github.run_number }}" \
|
|
601
|
-
--compute-type spot \
|
|
602
|
-
--max-cost 50.00 \
|
|
603
|
-
--enable-early-stopping
|
|
604
|
-
|
|
605
|
-
- name: Run hyperparameter optimization
|
|
606
|
-
if: github.ref == 'refs/heads/main'
|
|
607
|
-
run: |
|
|
608
|
-
python training/optimize_hyperparameters.py \
|
|
609
|
-
--n-trials 30 \
|
|
610
|
-
--optimization-method bayesian \
|
|
611
|
-
--max-cost 100.00
|
|
612
|
-
|
|
613
|
-
- name: Evaluate model
|
|
614
|
-
run: |
|
|
615
|
-
python training/evaluate_model.py \
|
|
616
|
-
--min-accuracy 0.85 \
|
|
617
|
-
--min-auc 0.90 \
|
|
618
|
-
--test-fairness
|
|
619
|
-
|
|
620
|
-
- name: Generate model card
|
|
621
|
-
run: python scripts/generate_model_card.py
|
|
622
|
-
|
|
623
|
-
- name: Register model
|
|
624
|
-
if: success()
|
|
625
|
-
run: python scripts/register_model.py --stage Staging
|
|
626
|
-
|
|
627
|
-
- name: Run integration tests
|
|
628
|
-
run: pytest tests/integration/
|
|
629
|
-
|
|
630
|
-
- name: Generate training report
|
|
631
|
-
run: python scripts/generate_training_report.py
|
|
632
|
-
|
|
633
|
-
- name: Upload artifacts
|
|
634
|
-
uses: actions/upload-artifact@v3
|
|
635
|
-
with:
|
|
636
|
-
name: training-artifacts
|
|
637
|
-
path: |
|
|
638
|
-
outputs/
|
|
639
|
-
reports/
|
|
640
|
-
```
|
|
641
|
-
|
|
642
|
-
## 📊 Metrics & Monitoring
|
|
643
|
-
|
|
644
|
-
| Metric Category | Metric | Target | Tool |
|
|
645
|
-
|-----------------|--------|--------|------|
|
|
646
|
-
| **Training Costs** | Cost per training run | <$25 | FinOps tracker |
|
|
647
|
-
| | Monthly training budget | <$2000 | Azure Cost Management |
|
|
648
|
-
| | Spot instance savings | >70% | Cost tracker |
|
|
649
|
-
| | GPU utilization | >80% | Azure Monitor |
|
|
650
|
-
| **HPO Costs** | Cost per HPO run | <$100 | HPO cost tracker |
|
|
651
|
-
| | Trials pruned (savings) | >40% | Optuna |
|
|
652
|
-
| | Bayesian vs grid savings | >80% | Cost comparison |
|
|
653
|
-
| **Training Performance** | Training time | <2 hours | Experiment tracker |
|
|
654
|
-
| | Model accuracy | >0.85 | MLflow |
|
|
655
|
-
| | AUC score | >0.90 | Evaluation metrics |
|
|
656
|
-
| **Resource Utilization** | CPU utilization | >70% | Azure Monitor |
|
|
657
|
-
| | Memory utilization | >60% | Azure Monitor |
|
|
658
|
-
| | Data loading time | <10% total | Profiler |
|
|
659
|
-
| **Pipeline Reliability** | Training success rate | >95% | Pipeline metrics |
|
|
660
|
-
| | Experiment reproducibility | 100% | Seed + versioning |
|
|
661
|
-
| **Model Quality** | Validation score | >baseline | Experiment tracker |
|
|
662
|
-
| | Test set performance | >0.85 | Model evaluator |
|
|
663
|
-
|
|
664
|
-
## 🔄 Integration Workflow
|
|
665
|
-
|
|
666
|
-
### End-to-End Training Pipeline
|
|
667
|
-
```
|
|
668
|
-
1. Feature Store Access (ml-02)
|
|
669
|
-
↓
|
|
670
|
-
2. Data Validation (de-03)
|
|
671
|
-
↓
|
|
672
|
-
3. Training Data Preparation (ml-03)
|
|
673
|
-
↓
|
|
674
|
-
4. Experiment Initialization (mo-01)
|
|
675
|
-
↓
|
|
676
|
-
5. Hyperparameter Optimization (ml-03)
|
|
677
|
-
↓
|
|
678
|
-
6. Model Training with Cost Tracking (ml-03, fo-07)
|
|
679
|
-
↓
|
|
680
|
-
7. Model Evaluation (ml-03, ds-08)
|
|
681
|
-
↓
|
|
682
|
-
8. Model Versioning (mo-03)
|
|
683
|
-
↓
|
|
684
|
-
9. Model Security Scan (sa-08)
|
|
685
|
-
↓
|
|
686
|
-
10. Model Registration (ml-07)
|
|
687
|
-
↓
|
|
688
|
-
11. Lineage Tracking (mo-06)
|
|
689
|
-
↓
|
|
690
|
-
12. Model Deployment (ml-04)
|
|
691
|
-
```
|
|
692
|
-
|
|
693
|
-
## 🎯 Quick Wins
|
|
694
|
-
|
|
695
|
-
1. **Switch to spot instances** - 60-90% training cost reduction
|
|
696
|
-
2. **Use Bayesian optimization** - 80-90% fewer hyperparameter trials
|
|
697
|
-
3. **Enable mixed precision training** - 2x speed on GPUs
|
|
698
|
-
4. **Implement early stopping** - 20-40% faster training
|
|
699
|
-
5. **Cache preprocessed data** - 30-50% faster data loading
|
|
700
|
-
6. **Set up experiment tracking** - Better model comparison
|
|
701
|
-
7. **Implement checkpointing** - Resilience to preemption
|
|
702
|
-
8. **Use learning rate schedules** - Better convergence
|
|
703
|
-
9. **Track training costs** - Visibility into spending
|
|
704
|
-
10. **Automate model evaluation** - Consistent quality gates
|
|
1
|
+
# Skill 3: Model Training & Hyperparameter Tuning
|
|
2
|
+
|
|
3
|
+
## 🎯 Overview
|
|
4
|
+
Implement scalable model training pipelines with automated hyperparameter optimization and experiment tracking.
|
|
5
|
+
|
|
6
|
+
## 🔗 Connections
|
|
7
|
+
- **Data Scientist**: Productionizes experimental models (ds-01, ds-02, ds-08)
|
|
8
|
+
- **ML Engineer**: Feeds from feature store, deploys to serving (ml-02, ml-04, ml-07)
|
|
9
|
+
- **MLOps**: Experiment tracking and model versioning (mo-01, mo-03)
|
|
10
|
+
- **FinOps**: Optimizes training costs and compute usage (fo-06, fo-07)
|
|
11
|
+
- **DevOps**: Automates training pipelines with CI/CD (do-01, do-03, do-08)
|
|
12
|
+
- **Security Architect**: Ensures secure training environments (sa-02, sa-08)
|
|
13
|
+
- **System Design**: Distributed training architecture (sd-03, sd-05)
|
|
14
|
+
- **Data Engineer**: Consumes validated training data (de-03)
|
|
15
|
+
|
|
16
|
+
## 🛠️ Tools Included
|
|
17
|
+
|
|
18
|
+
### 1. `model_trainer.py`
|
|
19
|
+
Unified training interface for sklearn, XGBoost, LightGBM, PyTorch.
|
|
20
|
+
|
|
21
|
+
### 2. `hyperparameter_optimizer.py`
|
|
22
|
+
Automated hyperparameter tuning with Optuna/Ray Tune.
|
|
23
|
+
|
|
24
|
+
### 3. `experiment_tracker.py`
|
|
25
|
+
MLflow integration for comprehensive experiment tracking.
|
|
26
|
+
|
|
27
|
+
### 4. `model_evaluator.py`
|
|
28
|
+
Model evaluation with business metrics and validation.
|
|
29
|
+
|
|
30
|
+
### 5. `training_config.yaml`
|
|
31
|
+
Configuration templates for training pipelines.
|
|
32
|
+
|
|
33
|
+
## 🏗️ Training Pipeline Architecture
|
|
34
|
+
|
|
35
|
+
```
|
|
36
|
+
Feature Store → Data Preparation → Model Training → Evaluation → Registry
|
|
37
|
+
↓ ↓ ↓ ↓
|
|
38
|
+
Validation Experiment Track Metrics Versioning
|
|
39
|
+
Splitting HPO Comparison Lineage
|
|
40
|
+
Augmentation Checkpointing Validation Promotion
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## 🚀 Quick Start
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
from model_trainer import ModelTrainer
|
|
47
|
+
from hyperparameter_optimizer import HPOptimizer
|
|
48
|
+
from experiment_tracker import ExperimentTracker
|
|
49
|
+
|
|
50
|
+
# Initialize tracker
|
|
51
|
+
tracker = ExperimentTracker(experiment_name="churn_prediction_v2")
|
|
52
|
+
|
|
53
|
+
# Configure training
|
|
54
|
+
trainer = ModelTrainer(
|
|
55
|
+
model_type="xgboost",
|
|
56
|
+
objective="binary:logistic",
|
|
57
|
+
eval_metric="auc"
|
|
58
|
+
)
|
|
59
|
+
|
|
60
|
+
# Load features from feature store
|
|
61
|
+
features = feature_store.get_historical_features(
|
|
62
|
+
feature_refs=["customer_behavior:v1"],
|
|
63
|
+
entity_df=training_entities
|
|
64
|
+
)
|
|
65
|
+
|
|
66
|
+
# Train with experiment tracking
|
|
67
|
+
with tracker.start_run():
|
|
68
|
+
# Log parameters
|
|
69
|
+
tracker.log_params({
|
|
70
|
+
"model_type": "xgboost",
|
|
71
|
+
"max_depth": 6,
|
|
72
|
+
"n_estimators": 100
|
|
73
|
+
})
|
|
74
|
+
|
|
75
|
+
# Train model
|
|
76
|
+
model = trainer.train(
|
|
77
|
+
X_train=features.drop(columns=["target"]),
|
|
78
|
+
y_train=features["target"],
|
|
79
|
+
validation_split=0.2
|
|
80
|
+
)
|
|
81
|
+
|
|
82
|
+
# Evaluate
|
|
83
|
+
metrics = trainer.evaluate(X_test, y_test)
|
|
84
|
+
tracker.log_metrics(metrics)
|
|
85
|
+
|
|
86
|
+
# Save model
|
|
87
|
+
tracker.log_model(model, "churn_predictor")
|
|
88
|
+
|
|
89
|
+
# Hyperparameter optimization
|
|
90
|
+
optimizer = HPOptimizer(
|
|
91
|
+
estimator=trainer,
|
|
92
|
+
param_space={
|
|
93
|
+
"max_depth": [3, 5, 7, 9],
|
|
94
|
+
"n_estimators": [50, 100, 200],
|
|
95
|
+
"learning_rate": [0.01, 0.05, 0.1],
|
|
96
|
+
"subsample": [0.7, 0.8, 0.9, 1.0]
|
|
97
|
+
},
|
|
98
|
+
optimization_metric="auc",
|
|
99
|
+
n_trials=50
|
|
100
|
+
)
|
|
101
|
+
|
|
102
|
+
best_params = optimizer.optimize(X_train, y_train, X_val, y_val)
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
## 📚 Best Practices
|
|
106
|
+
|
|
107
|
+
### Training Cost Optimization (FinOps Integration)
|
|
108
|
+
|
|
109
|
+
1. **Use Spot/Preemptible Instances**
|
|
110
|
+
- 60-90% cost savings for interruptible training
|
|
111
|
+
- Implement checkpointing for fault tolerance
|
|
112
|
+
- Automatic job resumption after preemption
|
|
113
|
+
- Best for batch training jobs
|
|
114
|
+
- Reference: FinOps fo-06 (Compute Optimization), fo-07 (AI/ML Cost)
|
|
115
|
+
|
|
116
|
+
2. **Right-Size Training Compute**
|
|
117
|
+
- Profile training jobs to determine optimal instance size
|
|
118
|
+
- Use CPU for tree-based models (XGBoost, LightGBM)
|
|
119
|
+
- Reserve GPUs for deep learning only
|
|
120
|
+
- Monitor GPU/CPU utilization
|
|
121
|
+
- Auto-scale compute clusters
|
|
122
|
+
- Reference: FinOps fo-06
|
|
123
|
+
|
|
124
|
+
3. **Optimize Hyperparameter Tuning**
|
|
125
|
+
- Use Bayesian optimization over grid search (10x fewer trials)
|
|
126
|
+
- Implement early stopping to terminate poor trials
|
|
127
|
+
- Parallelize trials efficiently
|
|
128
|
+
- Track cost per hyperparameter trial
|
|
129
|
+
- Set budget limits per optimization run
|
|
130
|
+
- Reference: FinOps fo-07, ML best practices
|
|
131
|
+
|
|
132
|
+
4. **Training Time Optimization**
|
|
133
|
+
- Use early stopping to prevent overtraining
|
|
134
|
+
- Implement learning rate schedules
|
|
135
|
+
- Use mixed precision training (2x faster on GPUs)
|
|
136
|
+
- Profile and optimize data loading
|
|
137
|
+
- Cache preprocessed data
|
|
138
|
+
- Reference: ML Engineer best practices
|
|
139
|
+
|
|
140
|
+
5. **Track Training Costs Per Experiment**
|
|
141
|
+
- Log compute costs with experiments
|
|
142
|
+
- Track training duration and resource usage
|
|
143
|
+
- Monitor cost vs accuracy trade-offs
|
|
144
|
+
- Set budget alerts for runaway experiments
|
|
145
|
+
- Reference: FinOps fo-01 (Cost Monitoring), fo-03 (Budget Management)
|
|
146
|
+
|
|
147
|
+
### MLOps Integration for Training
|
|
148
|
+
|
|
149
|
+
6. **Comprehensive Experiment Tracking**
|
|
150
|
+
- Track all hyperparameters and configurations
|
|
151
|
+
- Log all metrics (training, validation, test)
|
|
152
|
+
- Version datasets used for training
|
|
153
|
+
- Save model artifacts and checkpoints
|
|
154
|
+
- Track training duration and resource usage
|
|
155
|
+
- Reference: MLOps mo-01 (Experiment Tracking)
|
|
156
|
+
|
|
157
|
+
7. **Model Versioning & Lineage**
|
|
158
|
+
- Version all trained models
|
|
159
|
+
- Track complete lineage (data + code + config)
|
|
160
|
+
- Link models to training runs
|
|
161
|
+
- Document model architecture and purpose
|
|
162
|
+
- Reference: MLOps mo-03 (Model Versioning), mo-06 (Lineage)
|
|
163
|
+
|
|
164
|
+
8. **Reproducible Training**
|
|
165
|
+
- Set random seeds for reproducibility
|
|
166
|
+
- Version control training code
|
|
167
|
+
- Pin dependency versions
|
|
168
|
+
- Document training environment
|
|
169
|
+
- Store training configurations
|
|
170
|
+
- Reference: MLOps mo-01, DevOps do-01
|
|
171
|
+
|
|
172
|
+
9. **Model Validation & Testing**
|
|
173
|
+
- Validate on held-out test set
|
|
174
|
+
- Test model performance on edge cases
|
|
175
|
+
- Verify model fairness and bias
|
|
176
|
+
- Test inference latency
|
|
177
|
+
- Reference: MLOps mo-07 (Testing), Data Scientist ds-08
|
|
178
|
+
|
|
179
|
+
### DevOps Integration for Training
|
|
180
|
+
|
|
181
|
+
10. **Automated Training Pipelines**
|
|
182
|
+
- Trigger training on data updates
|
|
183
|
+
- Automate model evaluation and comparison
|
|
184
|
+
- Implement automatic model promotion
|
|
185
|
+
- Schedule periodic retraining
|
|
186
|
+
- Reference: DevOps do-01 (CI/CD), ML Engineer ml-01
|
|
187
|
+
|
|
188
|
+
11. **Containerized Training**
|
|
189
|
+
- Package training code in Docker containers
|
|
190
|
+
- Use multi-stage builds for efficiency
|
|
191
|
+
- Version control container definitions
|
|
192
|
+
- Test containers locally before deployment
|
|
193
|
+
- Reference: DevOps do-03 (Containerization)
|
|
194
|
+
|
|
195
|
+
12. **Infrastructure as Code for Training**
|
|
196
|
+
- Define training infrastructure in Terraform
|
|
197
|
+
- Automate compute cluster provisioning
|
|
198
|
+
- Version control all infrastructure
|
|
199
|
+
- Implement environment parity (dev/staging/prod)
|
|
200
|
+
- Reference: DevOps do-04 (IaC)
|
|
201
|
+
|
|
202
|
+
13. **Monitoring & Observability**
|
|
203
|
+
- Monitor training job status and health
|
|
204
|
+
- Track training metrics in real-time
|
|
205
|
+
- Set up alerts for training failures
|
|
206
|
+
- Log training errors and exceptions
|
|
207
|
+
- Reference: DevOps do-08 (Monitoring)
|
|
208
|
+
|
|
209
|
+
### Data Quality for Training
|
|
210
|
+
|
|
211
|
+
14. **Training Data Validation**
|
|
212
|
+
- Validate data schema before training
|
|
213
|
+
- Check for data drift vs previous training
|
|
214
|
+
- Verify label distribution
|
|
215
|
+
- Detect data quality issues early
|
|
216
|
+
- Reference: Data Engineer de-03 (Data Quality)
|
|
217
|
+
|
|
218
|
+
15. **Data Versioning**
|
|
219
|
+
- Version training datasets
|
|
220
|
+
- Track data lineage for reproducibility
|
|
221
|
+
- Document data collection and labeling
|
|
222
|
+
- Reference: MLOps mo-06 (Lineage)
|
|
223
|
+
|
|
224
|
+
### Security & Compliance
|
|
225
|
+
|
|
226
|
+
16. **Secure Training Environments**
|
|
227
|
+
- Train in isolated network environments
|
|
228
|
+
- Use managed identities for authentication
|
|
229
|
+
- Encrypt data at rest and in transit
|
|
230
|
+
- Audit training job access
|
|
231
|
+
- Reference: Security Architect sa-02 (IAM), sa-03 (Network)
|
|
232
|
+
|
|
233
|
+
17. **Model Security**
|
|
234
|
+
- Scan training dependencies for vulnerabilities
|
|
235
|
+
- Implement model access controls
|
|
236
|
+
- Audit model artifacts
|
|
237
|
+
- Document model security posture
|
|
238
|
+
- Reference: Security Architect sa-08 (LLM Security)
|
|
239
|
+
|
|
240
|
+
### Azure-Specific Best Practices
|
|
241
|
+
|
|
242
|
+
18. **Azure Machine Learning**
|
|
243
|
+
- Use managed compute clusters
|
|
244
|
+
- Enable auto-scaling for training jobs
|
|
245
|
+
- Leverage Azure ML pipelines
|
|
246
|
+
- Use Azure ML environments for reproducibility
|
|
247
|
+
- Reference: Azure az-04 (AI/ML Services)
|
|
248
|
+
|
|
249
|
+
19. **Cost Management in Azure ML**
|
|
250
|
+
- Use low-priority compute for training (70-80% savings)
|
|
251
|
+
- Enable compute auto-shutdown
|
|
252
|
+
- Monitor compute utilization
|
|
253
|
+
- Set workspace spending limits
|
|
254
|
+
- Reference: Azure az-04, FinOps fo-06
|
|
255
|
+
|
|
256
|
+
20. **Model Training Best Practices**
|
|
257
|
+
- Use early stopping callbacks
|
|
258
|
+
- Implement cross-validation for robust estimates
|
|
259
|
+
- Track learning curves for overfitting detection
|
|
260
|
+
- Save best model checkpoints
|
|
261
|
+
- Reference: ML Engineer best practices
|
|
262
|
+
|
|
263
|
+
## 💰 Cost Optimization Examples
|
|
264
|
+
|
|
265
|
+
### Spot Instance Training with Checkpointing
|
|
266
|
+
```python
|
|
267
|
+
from azure.ai.ml import command, Input
|
|
268
|
+
from azure.ai.ml.entities import AmlCompute
|
|
269
|
+
from model_trainer import CheckpointedTrainer
|
|
270
|
+
|
|
271
|
+
# Create spot compute cluster (60-90% savings)
|
|
272
|
+
spot_cluster = AmlCompute(
|
|
273
|
+
name="spot-training-cluster",
|
|
274
|
+
size="Standard_NC6s_v3", # GPU instance
|
|
275
|
+
min_instances=0,
|
|
276
|
+
max_instances=4,
|
|
277
|
+
tier="LowPriority", # Spot pricing!
|
|
278
|
+
idle_time_before_scale_down=300
|
|
279
|
+
)
|
|
280
|
+
|
|
281
|
+
ml_client.compute.begin_create_or_update(spot_cluster).result()
|
|
282
|
+
|
|
283
|
+
# Checkpointed trainer (automatically resumes after preemption)
|
|
284
|
+
trainer = CheckpointedTrainer(
|
|
285
|
+
model_type="pytorch",
|
|
286
|
+
checkpoint_dir="./checkpoints",
|
|
287
|
+
checkpoint_frequency=100, # Save every 100 steps
|
|
288
|
+
resume_from_checkpoint=True
|
|
289
|
+
)
|
|
290
|
+
|
|
291
|
+
# Training script with checkpointing
|
|
292
|
+
training_script = """
|
|
293
|
+
import torch
|
|
294
|
+
from model_trainer import CheckpointedTrainer
|
|
295
|
+
|
|
296
|
+
trainer = CheckpointedTrainer(
|
|
297
|
+
model_type="pytorch",
|
|
298
|
+
checkpoint_dir="./outputs/checkpoints"
|
|
299
|
+
)
|
|
300
|
+
|
|
301
|
+
# Load checkpoint if exists (after preemption)
|
|
302
|
+
start_epoch = trainer.load_checkpoint_if_exists()
|
|
303
|
+
|
|
304
|
+
# Training loop with automatic checkpointing
|
|
305
|
+
for epoch in range(start_epoch, num_epochs):
|
|
306
|
+
train_loss = trainer.train_epoch(model, train_loader)
|
|
307
|
+
val_loss = trainer.validate(model, val_loader)
|
|
308
|
+
|
|
309
|
+
# Automatic checkpoint saving
|
|
310
|
+
trainer.save_checkpoint(
|
|
311
|
+
epoch=epoch,
|
|
312
|
+
model=model,
|
|
313
|
+
optimizer=optimizer,
|
|
314
|
+
loss=val_loss
|
|
315
|
+
)
|
|
316
|
+
"""
|
|
317
|
+
|
|
318
|
+
# Submit to spot cluster
|
|
319
|
+
job = command(
|
|
320
|
+
code="./src",
|
|
321
|
+
command="python train.py",
|
|
322
|
+
environment="azureml:pytorch-training:1",
|
|
323
|
+
compute="spot-training-cluster",
|
|
324
|
+
inputs={
|
|
325
|
+
"training_data": Input(path="azureml://datasets/training_data/labels/latest")
|
|
326
|
+
}
|
|
327
|
+
)
|
|
328
|
+
|
|
329
|
+
# Job automatically resumes from checkpoint if preempted
|
|
330
|
+
run = ml_client.jobs.create_or_update(job)
|
|
331
|
+
|
|
332
|
+
# Track savings
|
|
333
|
+
from finops_tracker import TrainingCostTracker
|
|
334
|
+
cost_tracker = TrainingCostTracker()
|
|
335
|
+
savings_report = cost_tracker.calculate_spot_savings(run.name)
|
|
336
|
+
print(f"Cost with spot: ${savings_report.spot_cost:.2f}")
|
|
337
|
+
print(f"Cost with dedicated: ${savings_report.dedicated_cost:.2f}")
|
|
338
|
+
print(f"Total savings: ${savings_report.savings:.2f} ({savings_report.savings_percent}%)")
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
### Bayesian Hyperparameter Optimization (10x Fewer Trials)
|
|
342
|
+
```python
|
|
343
|
+
from hyperparameter_optimizer import BayesianOptimizer
|
|
344
|
+
from finops_tracker import HPOCostTracker
|
|
345
|
+
import optuna
|
|
346
|
+
|
|
347
|
+
cost_tracker = HPOCostTracker()
|
|
348
|
+
|
|
349
|
+
def objective(trial):
|
|
350
|
+
"""Objective function with cost tracking"""
|
|
351
|
+
|
|
352
|
+
# Suggest hyperparameters
|
|
353
|
+
params = {
|
|
354
|
+
'max_depth': trial.suggest_int('max_depth', 3, 10),
|
|
355
|
+
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
|
|
356
|
+
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
|
|
357
|
+
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
|
|
358
|
+
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
|
|
359
|
+
'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
|
|
360
|
+
'gamma': trial.suggest_float('gamma', 0, 0.5)
|
|
361
|
+
}
|
|
362
|
+
|
|
363
|
+
# Track trial cost
|
|
364
|
+
with cost_tracker.track_trial(trial.number):
|
|
365
|
+
# Train model
|
|
366
|
+
trainer = ModelTrainer(model_type="xgboost", **params)
|
|
367
|
+
trainer.train(X_train, y_train)
|
|
368
|
+
|
|
369
|
+
# Evaluate
|
|
370
|
+
score = trainer.evaluate(X_val, y_val)['auc']
|
|
371
|
+
|
|
372
|
+
# Report cost
|
|
373
|
+
cost_tracker.report_trial_cost(trial.number, score)
|
|
374
|
+
|
|
375
|
+
return score
|
|
376
|
+
|
|
377
|
+
# Bayesian optimization with early stopping
|
|
378
|
+
study = optuna.create_study(
|
|
379
|
+
direction='maximize',
|
|
380
|
+
sampler=optuna.samplers.TPESampler(seed=42), # Bayesian optimization
|
|
381
|
+
pruner=optuna.pruners.MedianPruner( # Early stopping for poor trials
|
|
382
|
+
n_startup_trials=5,
|
|
383
|
+
n_warmup_steps=10,
|
|
384
|
+
interval_steps=1
|
|
385
|
+
)
|
|
386
|
+
)
|
|
387
|
+
|
|
388
|
+
# Run optimization with budget limit
|
|
389
|
+
study.optimize(
|
|
390
|
+
objective,
|
|
391
|
+
n_trials=50, # vs 1000+ for grid search
|
|
392
|
+
timeout=3600, # 1 hour max
|
|
393
|
+
callbacks=[cost_tracker.budget_callback(max_budget=100.00)]
|
|
394
|
+
)
|
|
395
|
+
|
|
396
|
+
# Results with cost analysis
|
|
397
|
+
print(f"Best AUC: {study.best_value:.4f}")
|
|
398
|
+
print(f"Best params: {study.best_params}")
|
|
399
|
+
|
|
400
|
+
cost_report = cost_tracker.generate_hpo_report()
|
|
401
|
+
print(f"\nHPO Cost Analysis:")
|
|
402
|
+
print(f"Total trials: {cost_report.total_trials}")
|
|
403
|
+
print(f"Completed trials: {cost_report.completed_trials}")
|
|
404
|
+
print(f"Pruned trials: {cost_report.pruned_trials} (cost savings!)")
|
|
405
|
+
print(f"Total cost: ${cost_report.total_cost:.2f}")
|
|
406
|
+
print(f"Average cost per trial: ${cost_report.avg_cost_per_trial:.2f}")
|
|
407
|
+
print(f"Estimated grid search cost: ${cost_report.grid_search_cost_estimate:.2f}")
|
|
408
|
+
print(f"Savings vs grid search: ${cost_report.savings:.2f} ({cost_report.savings_percent}%)")
|
|
409
|
+
|
|
410
|
+
# Visualize optimization
|
|
411
|
+
from optuna.visualization import plot_optimization_history, plot_param_importances
|
|
412
|
+
plot_optimization_history(study).show()
|
|
413
|
+
plot_param_importances(study).show()
|
|
414
|
+
```
|
|
415
|
+
|
|
416
|
+
### Mixed Precision Training (2x Speed on GPUs)
|
|
417
|
+
```python
|
|
418
|
+
import torch
|
|
419
|
+
from torch.cuda.amp import autocast, GradScaler
|
|
420
|
+
from model_trainer import PyTorchTrainer
|
|
421
|
+
|
|
422
|
+
class MixedPrecisionTrainer(PyTorchTrainer):
|
|
423
|
+
"""Mixed precision training for 2x speed and 50% memory reduction"""
|
|
424
|
+
|
|
425
|
+
def __init__(self, model, optimizer, **kwargs):
|
|
426
|
+
super().__init__(model, optimizer, **kwargs)
|
|
427
|
+
self.scaler = GradScaler() # For numerical stability
|
|
428
|
+
self.cost_tracker = TrainingCostTracker()
|
|
429
|
+
|
|
430
|
+
def train_epoch(self, train_loader):
|
|
431
|
+
self.model.train()
|
|
432
|
+
total_loss = 0
|
|
433
|
+
|
|
434
|
+
for batch_idx, (data, target) in enumerate(train_loader):
|
|
435
|
+
data, target = data.cuda(), target.cuda()
|
|
436
|
+
|
|
437
|
+
# Mixed precision training
|
|
438
|
+
with autocast(): # Automatic mixed precision
|
|
439
|
+
output = self.model(data)
|
|
440
|
+
loss = self.criterion(output, target)
|
|
441
|
+
|
|
442
|
+
# Scaled backpropagation
|
|
443
|
+
self.optimizer.zero_grad()
|
|
444
|
+
self.scaler.scale(loss).backward()
|
|
445
|
+
self.scaler.step(self.optimizer)
|
|
446
|
+
self.scaler.update()
|
|
447
|
+
|
|
448
|
+
total_loss += loss.item()
|
|
449
|
+
|
|
450
|
+
return total_loss / len(train_loader)
|
|
451
|
+
|
|
452
|
+
# Usage
|
|
453
|
+
model = MyModel().cuda()
|
|
454
|
+
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
|
455
|
+
|
|
456
|
+
trainer = MixedPrecisionTrainer(model, optimizer)
|
|
457
|
+
|
|
458
|
+
# Track training time and cost
|
|
459
|
+
with cost_tracker.track_training("mixed_precision_training"):
|
|
460
|
+
for epoch in range(num_epochs):
|
|
461
|
+
train_loss = trainer.train_epoch(train_loader)
|
|
462
|
+
val_loss = trainer.validate(val_loader)
|
|
463
|
+
|
|
464
|
+
# Compare with FP32 training
|
|
465
|
+
report = cost_tracker.compare_precision_modes()
|
|
466
|
+
print(f"Mixed precision training time: {report.mixed_time:.2f}s")
|
|
467
|
+
print(f"FP32 training time: {report.fp32_time:.2f}s")
|
|
468
|
+
print(f"Speedup: {report.speedup:.2f}x")
|
|
469
|
+
print(f"Cost savings: ${report.cost_savings:.2f}")
|
|
470
|
+
```
|
|
471
|
+
|
|
472
|
+
### Cost-Tracked Experiment Management
|
|
473
|
+
```python
|
|
474
|
+
from experiment_tracker import ExperimentTracker
|
|
475
|
+
from finops_tracker import ExperimentCostTracker
|
|
476
|
+
from datetime import datetime
|
|
477
|
+
|
|
478
|
+
class CostAwareExperimentTracker:
|
|
479
|
+
"""Experiment tracker with integrated cost monitoring"""
|
|
480
|
+
|
|
481
|
+
def __init__(self, experiment_name: str):
|
|
482
|
+
self.tracker = ExperimentTracker(experiment_name=experiment_name)
|
|
483
|
+
self.cost_tracker = ExperimentCostTracker()
|
|
484
|
+
|
|
485
|
+
def start_run(self, run_name: str, compute_type: str = "cpu"):
|
|
486
|
+
"""Start a new run with cost tracking"""
|
|
487
|
+
|
|
488
|
+
# Start MLflow run
|
|
489
|
+
self.run = self.tracker.start_run(run_name=run_name)
|
|
490
|
+
|
|
491
|
+
# Start cost tracking
|
|
492
|
+
self.cost_tracker.start_tracking(
|
|
493
|
+
run_id=self.run.info.run_id,
|
|
494
|
+
compute_type=compute_type
|
|
495
|
+
)
|
|
496
|
+
|
|
497
|
+
return self
|
|
498
|
+
|
|
499
|
+
def __enter__(self):
|
|
500
|
+
return self
|
|
501
|
+
|
|
502
|
+
def __exit__(self, exc_type, exc_val, exc_tb):
|
|
503
|
+
# Calculate and log costs
|
|
504
|
+
cost_metrics = self.cost_tracker.stop_tracking()
|
|
505
|
+
|
|
506
|
+
# Log cost metrics to MLflow
|
|
507
|
+
self.tracker.log_metrics({
|
|
508
|
+
"compute_cost_usd": cost_metrics.compute_cost,
|
|
509
|
+
"storage_cost_usd": cost_metrics.storage_cost,
|
|
510
|
+
"total_cost_usd": cost_metrics.total_cost,
|
|
511
|
+
"training_duration_seconds": cost_metrics.duration,
|
|
512
|
+
"cost_per_hour": cost_metrics.cost_per_hour
|
|
513
|
+
})
|
|
514
|
+
|
|
515
|
+
# Add cost tags
|
|
516
|
+
self.tracker.set_tags({
|
|
517
|
+
"compute_type": cost_metrics.compute_type,
|
|
518
|
+
"instance_type": cost_metrics.instance_type,
|
|
519
|
+
"cost_optimized": str(cost_metrics.compute_type == "spot")
|
|
520
|
+
})
|
|
521
|
+
|
|
522
|
+
# End run
|
|
523
|
+
self.tracker.end_run()
|
|
524
|
+
|
|
525
|
+
# Usage
|
|
526
|
+
tracker = CostAwareExperimentTracker("customer_churn")
|
|
527
|
+
|
|
528
|
+
# Experiment 1: Standard training
|
|
529
|
+
with tracker.start_run("baseline_xgboost", compute_type="cpu"):
|
|
530
|
+
model = train_xgboost(X_train, y_train)
|
|
531
|
+
metrics = evaluate(model, X_test, y_test)
|
|
532
|
+
tracker.tracker.log_metrics(metrics)
|
|
533
|
+
tracker.tracker.log_model(model, "xgboost_baseline")
|
|
534
|
+
|
|
535
|
+
# Experiment 2: GPU training with spot instances
|
|
536
|
+
with tracker.start_run("deep_learning_spot", compute_type="spot_gpu"):
|
|
537
|
+
model = train_neural_network(X_train, y_train)
|
|
538
|
+
metrics = evaluate(model, X_test, y_test)
|
|
539
|
+
tracker.tracker.log_metrics(metrics)
|
|
540
|
+
tracker.tracker.log_model(model, "nn_spot")
|
|
541
|
+
|
|
542
|
+
# Compare experiments by cost and performance
|
|
543
|
+
experiments_df = tracker.cost_tracker.compare_experiments()
|
|
544
|
+
print(experiments_df[['run_name', 'accuracy', 'total_cost_usd', 'cost_per_point_accuracy']])
|
|
545
|
+
|
|
546
|
+
# Find most cost-efficient model
|
|
547
|
+
best_value = experiments_df['cost_per_point_accuracy'].idxmin()
|
|
548
|
+
print(f"\nMost cost-efficient model: {experiments_df.loc[best_value, 'run_name']}")
|
|
549
|
+
print(f"Accuracy: {experiments_df.loc[best_value, 'accuracy']:.4f}")
|
|
550
|
+
print(f"Cost: ${experiments_df.loc[best_value, 'total_cost_usd']:.2f}")
|
|
551
|
+
```
|
|
552
|
+
|
|
553
|
+
## 🚀 CI/CD for Model Training
|
|
554
|
+
|
|
555
|
+
### Automated Training Pipeline
|
|
556
|
+
```yaml
|
|
557
|
+
# .github/workflows/model-training.yml
|
|
558
|
+
name: Model Training Pipeline
|
|
559
|
+
|
|
560
|
+
on:
|
|
561
|
+
push:
|
|
562
|
+
paths:
|
|
563
|
+
- 'models/**'
|
|
564
|
+
- 'training/**'
|
|
565
|
+
branches:
|
|
566
|
+
- main
|
|
567
|
+
schedule:
|
|
568
|
+
- cron: '0 3 * * 0' # Weekly retraining
|
|
569
|
+
|
|
570
|
+
jobs:
|
|
571
|
+
train-and-evaluate:
|
|
572
|
+
runs-on: ubuntu-latest
|
|
573
|
+
steps:
|
|
574
|
+
- uses: actions/checkout@v3
|
|
575
|
+
|
|
576
|
+
- name: Azure Login
|
|
577
|
+
uses: azure/login@v1
|
|
578
|
+
with:
|
|
579
|
+
creds: ${{ secrets.AZURE_CREDENTIALS }}
|
|
580
|
+
|
|
581
|
+
- name: Setup Python
|
|
582
|
+
uses: actions/setup-python@v4
|
|
583
|
+
with:
|
|
584
|
+
python-version: '3.10'
|
|
585
|
+
|
|
586
|
+
- name: Install dependencies
|
|
587
|
+
run: |
|
|
588
|
+
pip install -r requirements.txt
|
|
589
|
+
pip install pytest pytest-cov
|
|
590
|
+
|
|
591
|
+
- name: Run unit tests
|
|
592
|
+
run: pytest tests/unit/ --cov=src
|
|
593
|
+
|
|
594
|
+
- name: Validate training data
|
|
595
|
+
run: python scripts/validate_training_data.py
|
|
596
|
+
|
|
597
|
+
- name: Train model on spot instances
|
|
598
|
+
run: |
|
|
599
|
+
python training/train_model.py \
|
|
600
|
+
--experiment-name "churn_weekly_${{ github.run_number }}" \
|
|
601
|
+
--compute-type spot \
|
|
602
|
+
--max-cost 50.00 \
|
|
603
|
+
--enable-early-stopping
|
|
604
|
+
|
|
605
|
+
- name: Run hyperparameter optimization
|
|
606
|
+
if: github.ref == 'refs/heads/main'
|
|
607
|
+
run: |
|
|
608
|
+
python training/optimize_hyperparameters.py \
|
|
609
|
+
--n-trials 30 \
|
|
610
|
+
--optimization-method bayesian \
|
|
611
|
+
--max-cost 100.00
|
|
612
|
+
|
|
613
|
+
- name: Evaluate model
|
|
614
|
+
run: |
|
|
615
|
+
python training/evaluate_model.py \
|
|
616
|
+
--min-accuracy 0.85 \
|
|
617
|
+
--min-auc 0.90 \
|
|
618
|
+
--test-fairness
|
|
619
|
+
|
|
620
|
+
- name: Generate model card
|
|
621
|
+
run: python scripts/generate_model_card.py
|
|
622
|
+
|
|
623
|
+
- name: Register model
|
|
624
|
+
if: success()
|
|
625
|
+
run: python scripts/register_model.py --stage Staging
|
|
626
|
+
|
|
627
|
+
- name: Run integration tests
|
|
628
|
+
run: pytest tests/integration/
|
|
629
|
+
|
|
630
|
+
- name: Generate training report
|
|
631
|
+
run: python scripts/generate_training_report.py
|
|
632
|
+
|
|
633
|
+
- name: Upload artifacts
|
|
634
|
+
uses: actions/upload-artifact@v3
|
|
635
|
+
with:
|
|
636
|
+
name: training-artifacts
|
|
637
|
+
path: |
|
|
638
|
+
outputs/
|
|
639
|
+
reports/
|
|
640
|
+
```
|
|
641
|
+
|
|
642
|
+
## 📊 Metrics & Monitoring
|
|
643
|
+
|
|
644
|
+
| Metric Category | Metric | Target | Tool |
|
|
645
|
+
|-----------------|--------|--------|------|
|
|
646
|
+
| **Training Costs** | Cost per training run | <$25 | FinOps tracker |
|
|
647
|
+
| | Monthly training budget | <$2000 | Azure Cost Management |
|
|
648
|
+
| | Spot instance savings | >70% | Cost tracker |
|
|
649
|
+
| | GPU utilization | >80% | Azure Monitor |
|
|
650
|
+
| **HPO Costs** | Cost per HPO run | <$100 | HPO cost tracker |
|
|
651
|
+
| | Trials pruned (savings) | >40% | Optuna |
|
|
652
|
+
| | Bayesian vs grid savings | >80% | Cost comparison |
|
|
653
|
+
| **Training Performance** | Training time | <2 hours | Experiment tracker |
|
|
654
|
+
| | Model accuracy | >0.85 | MLflow |
|
|
655
|
+
| | AUC score | >0.90 | Evaluation metrics |
|
|
656
|
+
| **Resource Utilization** | CPU utilization | >70% | Azure Monitor |
|
|
657
|
+
| | Memory utilization | >60% | Azure Monitor |
|
|
658
|
+
| | Data loading time | <10% total | Profiler |
|
|
659
|
+
| **Pipeline Reliability** | Training success rate | >95% | Pipeline metrics |
|
|
660
|
+
| | Experiment reproducibility | 100% | Seed + versioning |
|
|
661
|
+
| **Model Quality** | Validation score | >baseline | Experiment tracker |
|
|
662
|
+
| | Test set performance | >0.85 | Model evaluator |
|
|
663
|
+
|
|
664
|
+
## 🔄 Integration Workflow
|
|
665
|
+
|
|
666
|
+
### End-to-End Training Pipeline
|
|
667
|
+
```
|
|
668
|
+
1. Feature Store Access (ml-02)
|
|
669
|
+
↓
|
|
670
|
+
2. Data Validation (de-03)
|
|
671
|
+
↓
|
|
672
|
+
3. Training Data Preparation (ml-03)
|
|
673
|
+
↓
|
|
674
|
+
4. Experiment Initialization (mo-01)
|
|
675
|
+
↓
|
|
676
|
+
5. Hyperparameter Optimization (ml-03)
|
|
677
|
+
↓
|
|
678
|
+
6. Model Training with Cost Tracking (ml-03, fo-07)
|
|
679
|
+
↓
|
|
680
|
+
7. Model Evaluation (ml-03, ds-08)
|
|
681
|
+
↓
|
|
682
|
+
8. Model Versioning (mo-03)
|
|
683
|
+
↓
|
|
684
|
+
9. Model Security Scan (sa-08)
|
|
685
|
+
↓
|
|
686
|
+
10. Model Registration (ml-07)
|
|
687
|
+
↓
|
|
688
|
+
11. Lineage Tracking (mo-06)
|
|
689
|
+
↓
|
|
690
|
+
12. Model Deployment (ml-04)
|
|
691
|
+
```
|
|
692
|
+
|
|
693
|
+
## 🎯 Quick Wins
|
|
694
|
+
|
|
695
|
+
1. **Switch to spot instances** - 60-90% training cost reduction
|
|
696
|
+
2. **Use Bayesian optimization** - 80-90% fewer hyperparameter trials
|
|
697
|
+
3. **Enable mixed precision training** - 2x speed on GPUs
|
|
698
|
+
4. **Implement early stopping** - 20-40% faster training
|
|
699
|
+
5. **Cache preprocessed data** - 30-50% faster data loading
|
|
700
|
+
6. **Set up experiment tracking** - Better model comparison
|
|
701
|
+
7. **Implement checkpointing** - Resilience to preemption
|
|
702
|
+
8. **Use learning rate schedules** - Better convergence
|
|
703
|
+
9. **Track training costs** - Visibility into spending
|
|
704
|
+
10. **Automate model evaluation** - Consistent quality gates
|