tech-hub-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (133) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +250 -0
  3. package/bin/cli.js +241 -0
  4. package/bin/copilot.js +182 -0
  5. package/bin/postinstall.js +42 -0
  6. package/package.json +46 -0
  7. package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
  8. package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
  9. package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
  10. package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
  11. package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
  12. package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
  13. package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
  14. package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
  15. package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
  16. package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
  17. package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
  18. package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
  19. package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
  20. package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
  21. package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
  22. package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
  23. package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
  24. package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
  25. package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
  26. package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
  27. package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
  28. package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
  29. package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
  30. package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
  31. package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
  32. package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
  33. package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
  34. package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
  35. package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
  36. package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
  37. package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
  38. package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
  39. package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
  40. package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
  41. package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
  42. package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
  43. package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
  44. package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
  45. package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
  46. package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
  47. package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
  48. package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
  49. package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
  50. package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
  51. package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
  52. package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
  53. package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
  54. package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
  55. package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
  56. package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
  57. package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
  58. package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
  59. package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
  60. package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
  61. package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
  62. package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
  63. package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
  64. package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
  65. package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
  66. package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
  67. package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
  68. package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
  69. package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
  70. package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
  71. package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
  72. package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
  73. package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
  74. package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
  75. package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
  76. package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
  77. package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
  78. package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
  79. package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
  80. package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
  81. package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
  82. package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
  83. package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
  84. package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
  85. package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
  86. package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
  87. package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
  88. package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
  89. package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
  90. package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
  91. package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
  92. package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
  93. package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
  94. package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
  95. package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
  96. package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
  97. package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
  98. package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
  99. package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
  100. package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
  101. package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
  102. package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
  103. package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
  104. package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
  105. package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
  106. package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
  107. package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
  108. package/tech_hub_skills/skills/README.md +336 -0
  109. package/tech_hub_skills/skills/ai-engineer.md +104 -0
  110. package/tech_hub_skills/skills/azure.md +149 -0
  111. package/tech_hub_skills/skills/code-review.md +399 -0
  112. package/tech_hub_skills/skills/compliance-automation.md +747 -0
  113. package/tech_hub_skills/skills/data-engineer.md +113 -0
  114. package/tech_hub_skills/skills/data-governance.md +102 -0
  115. package/tech_hub_skills/skills/data-scientist.md +123 -0
  116. package/tech_hub_skills/skills/devops.md +160 -0
  117. package/tech_hub_skills/skills/docker.md +160 -0
  118. package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
  119. package/tech_hub_skills/skills/finops.md +184 -0
  120. package/tech_hub_skills/skills/ml-engineer.md +115 -0
  121. package/tech_hub_skills/skills/mlops.md +187 -0
  122. package/tech_hub_skills/skills/optimization-advisor.md +329 -0
  123. package/tech_hub_skills/skills/orchestrator.md +497 -0
  124. package/tech_hub_skills/skills/platform-engineer.md +102 -0
  125. package/tech_hub_skills/skills/process-automation.md +226 -0
  126. package/tech_hub_skills/skills/process-changelog.md +184 -0
  127. package/tech_hub_skills/skills/process-documentation.md +484 -0
  128. package/tech_hub_skills/skills/process-kanban.md +324 -0
  129. package/tech_hub_skills/skills/process-versioning.md +214 -0
  130. package/tech_hub_skills/skills/product-designer.md +104 -0
  131. package/tech_hub_skills/skills/project-starter.md +443 -0
  132. package/tech_hub_skills/skills/security-architect.md +135 -0
  133. package/tech_hub_skills/skills/system-design.md +126 -0
@@ -0,0 +1,704 @@
1
+ # Skill 3: Model Training & Hyperparameter Tuning
2
+
3
+ ## 🎯 Overview
4
+ Implement scalable model training pipelines with automated hyperparameter optimization and experiment tracking.
5
+
6
+ ## 🔗 Connections
7
+ - **Data Scientist**: Productionizes experimental models (ds-01, ds-02, ds-08)
8
+ - **ML Engineer**: Feeds from feature store, deploys to serving (ml-02, ml-04, ml-07)
9
+ - **MLOps**: Experiment tracking and model versioning (mo-01, mo-03)
10
+ - **FinOps**: Optimizes training costs and compute usage (fo-06, fo-07)
11
+ - **DevOps**: Automates training pipelines with CI/CD (do-01, do-03, do-08)
12
+ - **Security Architect**: Ensures secure training environments (sa-02, sa-08)
13
+ - **System Design**: Distributed training architecture (sd-03, sd-05)
14
+ - **Data Engineer**: Consumes validated training data (de-03)
15
+
16
+ ## 🛠️ Tools Included
17
+
18
+ ### 1. `model_trainer.py`
19
+ Unified training interface for sklearn, XGBoost, LightGBM, PyTorch.
20
+
21
+ ### 2. `hyperparameter_optimizer.py`
22
+ Automated hyperparameter tuning with Optuna/Ray Tune.
23
+
24
+ ### 3. `experiment_tracker.py`
25
+ MLflow integration for comprehensive experiment tracking.
26
+
27
+ ### 4. `model_evaluator.py`
28
+ Model evaluation with business metrics and validation.
29
+
30
+ ### 5. `training_config.yaml`
31
+ Configuration templates for training pipelines.
32
+
33
+ ## 🏗️ Training Pipeline Architecture
34
+
35
+ ```
36
+ Feature Store → Data Preparation → Model Training → Evaluation → Registry
37
+ ↓ ↓ ↓ ↓
38
+ Validation Experiment Track Metrics Versioning
39
+ Splitting HPO Comparison Lineage
40
+ Augmentation Checkpointing Validation Promotion
41
+ ```
42
+
43
+ ## 🚀 Quick Start
44
+
45
+ ```python
46
+ from model_trainer import ModelTrainer
47
+ from hyperparameter_optimizer import HPOptimizer
48
+ from experiment_tracker import ExperimentTracker
49
+
50
+ # Initialize tracker
51
+ tracker = ExperimentTracker(experiment_name="churn_prediction_v2")
52
+
53
+ # Configure training
54
+ trainer = ModelTrainer(
55
+ model_type="xgboost",
56
+ objective="binary:logistic",
57
+ eval_metric="auc"
58
+ )
59
+
60
+ # Load features from feature store
61
+ features = feature_store.get_historical_features(
62
+ feature_refs=["customer_behavior:v1"],
63
+ entity_df=training_entities
64
+ )
65
+
66
+ # Train with experiment tracking
67
+ with tracker.start_run():
68
+ # Log parameters
69
+ tracker.log_params({
70
+ "model_type": "xgboost",
71
+ "max_depth": 6,
72
+ "n_estimators": 100
73
+ })
74
+
75
+ # Train model
76
+ model = trainer.train(
77
+ X_train=features.drop(columns=["target"]),
78
+ y_train=features["target"],
79
+ validation_split=0.2
80
+ )
81
+
82
+ # Evaluate
83
+ metrics = trainer.evaluate(X_test, y_test)
84
+ tracker.log_metrics(metrics)
85
+
86
+ # Save model
87
+ tracker.log_model(model, "churn_predictor")
88
+
89
+ # Hyperparameter optimization
90
+ optimizer = HPOptimizer(
91
+ estimator=trainer,
92
+ param_space={
93
+ "max_depth": [3, 5, 7, 9],
94
+ "n_estimators": [50, 100, 200],
95
+ "learning_rate": [0.01, 0.05, 0.1],
96
+ "subsample": [0.7, 0.8, 0.9, 1.0]
97
+ },
98
+ optimization_metric="auc",
99
+ n_trials=50
100
+ )
101
+
102
+ best_params = optimizer.optimize(X_train, y_train, X_val, y_val)
103
+ ```
104
+
105
+ ## 📚 Best Practices
106
+
107
+ ### Training Cost Optimization (FinOps Integration)
108
+
109
+ 1. **Use Spot/Preemptible Instances**
110
+ - 60-90% cost savings for interruptible training
111
+ - Implement checkpointing for fault tolerance
112
+ - Automatic job resumption after preemption
113
+ - Best for batch training jobs
114
+ - Reference: FinOps fo-06 (Compute Optimization), fo-07 (AI/ML Cost)
115
+
116
+ 2. **Right-Size Training Compute**
117
+ - Profile training jobs to determine optimal instance size
118
+ - Use CPU for tree-based models (XGBoost, LightGBM)
119
+ - Reserve GPUs for deep learning only
120
+ - Monitor GPU/CPU utilization
121
+ - Auto-scale compute clusters
122
+ - Reference: FinOps fo-06
123
+
124
+ 3. **Optimize Hyperparameter Tuning**
125
+ - Use Bayesian optimization over grid search (10x fewer trials)
126
+ - Implement early stopping to terminate poor trials
127
+ - Parallelize trials efficiently
128
+ - Track cost per hyperparameter trial
129
+ - Set budget limits per optimization run
130
+ - Reference: FinOps fo-07, ML best practices
131
+
132
+ 4. **Training Time Optimization**
133
+ - Use early stopping to prevent overtraining
134
+ - Implement learning rate schedules
135
+ - Use mixed precision training (2x faster on GPUs)
136
+ - Profile and optimize data loading
137
+ - Cache preprocessed data
138
+ - Reference: ML Engineer best practices
139
+
140
+ 5. **Track Training Costs Per Experiment**
141
+ - Log compute costs with experiments
142
+ - Track training duration and resource usage
143
+ - Monitor cost vs accuracy trade-offs
144
+ - Set budget alerts for runaway experiments
145
+ - Reference: FinOps fo-01 (Cost Monitoring), fo-03 (Budget Management)
146
+
147
+ ### MLOps Integration for Training
148
+
149
+ 6. **Comprehensive Experiment Tracking**
150
+ - Track all hyperparameters and configurations
151
+ - Log all metrics (training, validation, test)
152
+ - Version datasets used for training
153
+ - Save model artifacts and checkpoints
154
+ - Track training duration and resource usage
155
+ - Reference: MLOps mo-01 (Experiment Tracking)
156
+
157
+ 7. **Model Versioning & Lineage**
158
+ - Version all trained models
159
+ - Track complete lineage (data + code + config)
160
+ - Link models to training runs
161
+ - Document model architecture and purpose
162
+ - Reference: MLOps mo-03 (Model Versioning), mo-06 (Lineage)
163
+
164
+ 8. **Reproducible Training**
165
+ - Set random seeds for reproducibility
166
+ - Version control training code
167
+ - Pin dependency versions
168
+ - Document training environment
169
+ - Store training configurations
170
+ - Reference: MLOps mo-01, DevOps do-01
171
+
172
+ 9. **Model Validation & Testing**
173
+ - Validate on held-out test set
174
+ - Test model performance on edge cases
175
+ - Verify model fairness and bias
176
+ - Test inference latency
177
+ - Reference: MLOps mo-07 (Testing), Data Scientist ds-08
178
+
179
+ ### DevOps Integration for Training
180
+
181
+ 10. **Automated Training Pipelines**
182
+ - Trigger training on data updates
183
+ - Automate model evaluation and comparison
184
+ - Implement automatic model promotion
185
+ - Schedule periodic retraining
186
+ - Reference: DevOps do-01 (CI/CD), ML Engineer ml-01
187
+
188
+ 11. **Containerized Training**
189
+ - Package training code in Docker containers
190
+ - Use multi-stage builds for efficiency
191
+ - Version control container definitions
192
+ - Test containers locally before deployment
193
+ - Reference: DevOps do-03 (Containerization)
194
+
195
+ 12. **Infrastructure as Code for Training**
196
+ - Define training infrastructure in Terraform
197
+ - Automate compute cluster provisioning
198
+ - Version control all infrastructure
199
+ - Implement environment parity (dev/staging/prod)
200
+ - Reference: DevOps do-04 (IaC)
201
+
202
+ 13. **Monitoring & Observability**
203
+ - Monitor training job status and health
204
+ - Track training metrics in real-time
205
+ - Set up alerts for training failures
206
+ - Log training errors and exceptions
207
+ - Reference: DevOps do-08 (Monitoring)
208
+
209
+ ### Data Quality for Training
210
+
211
+ 14. **Training Data Validation**
212
+ - Validate data schema before training
213
+ - Check for data drift vs previous training
214
+ - Verify label distribution
215
+ - Detect data quality issues early
216
+ - Reference: Data Engineer de-03 (Data Quality)
217
+
218
+ 15. **Data Versioning**
219
+ - Version training datasets
220
+ - Track data lineage for reproducibility
221
+ - Document data collection and labeling
222
+ - Reference: MLOps mo-06 (Lineage)
223
+
224
+ ### Security & Compliance
225
+
226
+ 16. **Secure Training Environments**
227
+ - Train in isolated network environments
228
+ - Use managed identities for authentication
229
+ - Encrypt data at rest and in transit
230
+ - Audit training job access
231
+ - Reference: Security Architect sa-02 (IAM), sa-03 (Network)
232
+
233
+ 17. **Model Security**
234
+ - Scan training dependencies for vulnerabilities
235
+ - Implement model access controls
236
+ - Audit model artifacts
237
+ - Document model security posture
238
+ - Reference: Security Architect sa-08 (LLM Security)
239
+
240
+ ### Azure-Specific Best Practices
241
+
242
+ 18. **Azure Machine Learning**
243
+ - Use managed compute clusters
244
+ - Enable auto-scaling for training jobs
245
+ - Leverage Azure ML pipelines
246
+ - Use Azure ML environments for reproducibility
247
+ - Reference: Azure az-04 (AI/ML Services)
248
+
249
+ 19. **Cost Management in Azure ML**
250
+ - Use low-priority compute for training (70-80% savings)
251
+ - Enable compute auto-shutdown
252
+ - Monitor compute utilization
253
+ - Set workspace spending limits
254
+ - Reference: Azure az-04, FinOps fo-06
255
+
256
+ 20. **Model Training Best Practices**
257
+ - Use early stopping callbacks
258
+ - Implement cross-validation for robust estimates
259
+ - Track learning curves for overfitting detection
260
+ - Save best model checkpoints
261
+ - Reference: ML Engineer best practices
262
+
263
+ ## 💰 Cost Optimization Examples
264
+
265
+ ### Spot Instance Training with Checkpointing
266
+ ```python
267
+ from azure.ai.ml import command, Input
268
+ from azure.ai.ml.entities import AmlCompute
269
+ from model_trainer import CheckpointedTrainer
270
+
271
+ # Create spot compute cluster (60-90% savings)
272
+ spot_cluster = AmlCompute(
273
+ name="spot-training-cluster",
274
+ size="Standard_NC6s_v3", # GPU instance
275
+ min_instances=0,
276
+ max_instances=4,
277
+ tier="LowPriority", # Spot pricing!
278
+ idle_time_before_scale_down=300
279
+ )
280
+
281
+ ml_client.compute.begin_create_or_update(spot_cluster).result()
282
+
283
+ # Checkpointed trainer (automatically resumes after preemption)
284
+ trainer = CheckpointedTrainer(
285
+ model_type="pytorch",
286
+ checkpoint_dir="./checkpoints",
287
+ checkpoint_frequency=100, # Save every 100 steps
288
+ resume_from_checkpoint=True
289
+ )
290
+
291
+ # Training script with checkpointing
292
+ training_script = """
293
+ import torch
294
+ from model_trainer import CheckpointedTrainer
295
+
296
+ trainer = CheckpointedTrainer(
297
+ model_type="pytorch",
298
+ checkpoint_dir="./outputs/checkpoints"
299
+ )
300
+
301
+ # Load checkpoint if exists (after preemption)
302
+ start_epoch = trainer.load_checkpoint_if_exists()
303
+
304
+ # Training loop with automatic checkpointing
305
+ for epoch in range(start_epoch, num_epochs):
306
+ train_loss = trainer.train_epoch(model, train_loader)
307
+ val_loss = trainer.validate(model, val_loader)
308
+
309
+ # Automatic checkpoint saving
310
+ trainer.save_checkpoint(
311
+ epoch=epoch,
312
+ model=model,
313
+ optimizer=optimizer,
314
+ loss=val_loss
315
+ )
316
+ """
317
+
318
+ # Submit to spot cluster
319
+ job = command(
320
+ code="./src",
321
+ command="python train.py",
322
+ environment="azureml:pytorch-training:1",
323
+ compute="spot-training-cluster",
324
+ inputs={
325
+ "training_data": Input(path="azureml://datasets/training_data/labels/latest")
326
+ }
327
+ )
328
+
329
+ # Job automatically resumes from checkpoint if preempted
330
+ run = ml_client.jobs.create_or_update(job)
331
+
332
+ # Track savings
333
+ from finops_tracker import TrainingCostTracker
334
+ cost_tracker = TrainingCostTracker()
335
+ savings_report = cost_tracker.calculate_spot_savings(run.name)
336
+ print(f"Cost with spot: ${savings_report.spot_cost:.2f}")
337
+ print(f"Cost with dedicated: ${savings_report.dedicated_cost:.2f}")
338
+ print(f"Total savings: ${savings_report.savings:.2f} ({savings_report.savings_percent}%)")
339
+ ```
340
+
341
+ ### Bayesian Hyperparameter Optimization (10x Fewer Trials)
342
+ ```python
343
+ from hyperparameter_optimizer import BayesianOptimizer
344
+ from finops_tracker import HPOCostTracker
345
+ import optuna
346
+
347
+ cost_tracker = HPOCostTracker()
348
+
349
+ def objective(trial):
350
+ """Objective function with cost tracking"""
351
+
352
+ # Suggest hyperparameters
353
+ params = {
354
+ 'max_depth': trial.suggest_int('max_depth', 3, 10),
355
+ 'n_estimators': trial.suggest_int('n_estimators', 50, 300),
356
+ 'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
357
+ 'subsample': trial.suggest_float('subsample', 0.6, 1.0),
358
+ 'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
359
+ 'min_child_weight': trial.suggest_int('min_child_weight', 1, 7),
360
+ 'gamma': trial.suggest_float('gamma', 0, 0.5)
361
+ }
362
+
363
+ # Track trial cost
364
+ with cost_tracker.track_trial(trial.number):
365
+ # Train model
366
+ trainer = ModelTrainer(model_type="xgboost", **params)
367
+ trainer.train(X_train, y_train)
368
+
369
+ # Evaluate
370
+ score = trainer.evaluate(X_val, y_val)['auc']
371
+
372
+ # Report cost
373
+ cost_tracker.report_trial_cost(trial.number, score)
374
+
375
+ return score
376
+
377
+ # Bayesian optimization with early stopping
378
+ study = optuna.create_study(
379
+ direction='maximize',
380
+ sampler=optuna.samplers.TPESampler(seed=42), # Bayesian optimization
381
+ pruner=optuna.pruners.MedianPruner( # Early stopping for poor trials
382
+ n_startup_trials=5,
383
+ n_warmup_steps=10,
384
+ interval_steps=1
385
+ )
386
+ )
387
+
388
+ # Run optimization with budget limit
389
+ study.optimize(
390
+ objective,
391
+ n_trials=50, # vs 1000+ for grid search
392
+ timeout=3600, # 1 hour max
393
+ callbacks=[cost_tracker.budget_callback(max_budget=100.00)]
394
+ )
395
+
396
+ # Results with cost analysis
397
+ print(f"Best AUC: {study.best_value:.4f}")
398
+ print(f"Best params: {study.best_params}")
399
+
400
+ cost_report = cost_tracker.generate_hpo_report()
401
+ print(f"\nHPO Cost Analysis:")
402
+ print(f"Total trials: {cost_report.total_trials}")
403
+ print(f"Completed trials: {cost_report.completed_trials}")
404
+ print(f"Pruned trials: {cost_report.pruned_trials} (cost savings!)")
405
+ print(f"Total cost: ${cost_report.total_cost:.2f}")
406
+ print(f"Average cost per trial: ${cost_report.avg_cost_per_trial:.2f}")
407
+ print(f"Estimated grid search cost: ${cost_report.grid_search_cost_estimate:.2f}")
408
+ print(f"Savings vs grid search: ${cost_report.savings:.2f} ({cost_report.savings_percent}%)")
409
+
410
+ # Visualize optimization
411
+ from optuna.visualization import plot_optimization_history, plot_param_importances
412
+ plot_optimization_history(study).show()
413
+ plot_param_importances(study).show()
414
+ ```
415
+
416
+ ### Mixed Precision Training (2x Speed on GPUs)
417
+ ```python
418
+ import torch
419
+ from torch.cuda.amp import autocast, GradScaler
420
+ from model_trainer import PyTorchTrainer
421
+
422
+ class MixedPrecisionTrainer(PyTorchTrainer):
423
+ """Mixed precision training for 2x speed and 50% memory reduction"""
424
+
425
+ def __init__(self, model, optimizer, **kwargs):
426
+ super().__init__(model, optimizer, **kwargs)
427
+ self.scaler = GradScaler() # For numerical stability
428
+ self.cost_tracker = TrainingCostTracker()
429
+
430
+ def train_epoch(self, train_loader):
431
+ self.model.train()
432
+ total_loss = 0
433
+
434
+ for batch_idx, (data, target) in enumerate(train_loader):
435
+ data, target = data.cuda(), target.cuda()
436
+
437
+ # Mixed precision training
438
+ with autocast(): # Automatic mixed precision
439
+ output = self.model(data)
440
+ loss = self.criterion(output, target)
441
+
442
+ # Scaled backpropagation
443
+ self.optimizer.zero_grad()
444
+ self.scaler.scale(loss).backward()
445
+ self.scaler.step(self.optimizer)
446
+ self.scaler.update()
447
+
448
+ total_loss += loss.item()
449
+
450
+ return total_loss / len(train_loader)
451
+
452
+ # Usage
453
+ model = MyModel().cuda()
454
+ optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
455
+
456
+ trainer = MixedPrecisionTrainer(model, optimizer)
457
+
458
+ # Track training time and cost
459
+ with cost_tracker.track_training("mixed_precision_training"):
460
+ for epoch in range(num_epochs):
461
+ train_loss = trainer.train_epoch(train_loader)
462
+ val_loss = trainer.validate(val_loader)
463
+
464
+ # Compare with FP32 training
465
+ report = cost_tracker.compare_precision_modes()
466
+ print(f"Mixed precision training time: {report.mixed_time:.2f}s")
467
+ print(f"FP32 training time: {report.fp32_time:.2f}s")
468
+ print(f"Speedup: {report.speedup:.2f}x")
469
+ print(f"Cost savings: ${report.cost_savings:.2f}")
470
+ ```
471
+
472
+ ### Cost-Tracked Experiment Management
473
+ ```python
474
+ from experiment_tracker import ExperimentTracker
475
+ from finops_tracker import ExperimentCostTracker
476
+ from datetime import datetime
477
+
478
+ class CostAwareExperimentTracker:
479
+ """Experiment tracker with integrated cost monitoring"""
480
+
481
+ def __init__(self, experiment_name: str):
482
+ self.tracker = ExperimentTracker(experiment_name=experiment_name)
483
+ self.cost_tracker = ExperimentCostTracker()
484
+
485
+ def start_run(self, run_name: str, compute_type: str = "cpu"):
486
+ """Start a new run with cost tracking"""
487
+
488
+ # Start MLflow run
489
+ self.run = self.tracker.start_run(run_name=run_name)
490
+
491
+ # Start cost tracking
492
+ self.cost_tracker.start_tracking(
493
+ run_id=self.run.info.run_id,
494
+ compute_type=compute_type
495
+ )
496
+
497
+ return self
498
+
499
+ def __enter__(self):
500
+ return self
501
+
502
+ def __exit__(self, exc_type, exc_val, exc_tb):
503
+ # Calculate and log costs
504
+ cost_metrics = self.cost_tracker.stop_tracking()
505
+
506
+ # Log cost metrics to MLflow
507
+ self.tracker.log_metrics({
508
+ "compute_cost_usd": cost_metrics.compute_cost,
509
+ "storage_cost_usd": cost_metrics.storage_cost,
510
+ "total_cost_usd": cost_metrics.total_cost,
511
+ "training_duration_seconds": cost_metrics.duration,
512
+ "cost_per_hour": cost_metrics.cost_per_hour
513
+ })
514
+
515
+ # Add cost tags
516
+ self.tracker.set_tags({
517
+ "compute_type": cost_metrics.compute_type,
518
+ "instance_type": cost_metrics.instance_type,
519
+ "cost_optimized": str(cost_metrics.compute_type == "spot")
520
+ })
521
+
522
+ # End run
523
+ self.tracker.end_run()
524
+
525
+ # Usage
526
+ tracker = CostAwareExperimentTracker("customer_churn")
527
+
528
+ # Experiment 1: Standard training
529
+ with tracker.start_run("baseline_xgboost", compute_type="cpu"):
530
+ model = train_xgboost(X_train, y_train)
531
+ metrics = evaluate(model, X_test, y_test)
532
+ tracker.tracker.log_metrics(metrics)
533
+ tracker.tracker.log_model(model, "xgboost_baseline")
534
+
535
+ # Experiment 2: GPU training with spot instances
536
+ with tracker.start_run("deep_learning_spot", compute_type="spot_gpu"):
537
+ model = train_neural_network(X_train, y_train)
538
+ metrics = evaluate(model, X_test, y_test)
539
+ tracker.tracker.log_metrics(metrics)
540
+ tracker.tracker.log_model(model, "nn_spot")
541
+
542
+ # Compare experiments by cost and performance
543
+ experiments_df = tracker.cost_tracker.compare_experiments()
544
+ print(experiments_df[['run_name', 'accuracy', 'total_cost_usd', 'cost_per_point_accuracy']])
545
+
546
+ # Find most cost-efficient model
547
+ best_value = experiments_df['cost_per_point_accuracy'].idxmin()
548
+ print(f"\nMost cost-efficient model: {experiments_df.loc[best_value, 'run_name']}")
549
+ print(f"Accuracy: {experiments_df.loc[best_value, 'accuracy']:.4f}")
550
+ print(f"Cost: ${experiments_df.loc[best_value, 'total_cost_usd']:.2f}")
551
+ ```
552
+
553
+ ## 🚀 CI/CD for Model Training
554
+
555
+ ### Automated Training Pipeline
556
+ ```yaml
557
+ # .github/workflows/model-training.yml
558
+ name: Model Training Pipeline
559
+
560
+ on:
561
+ push:
562
+ paths:
563
+ - 'models/**'
564
+ - 'training/**'
565
+ branches:
566
+ - main
567
+ schedule:
568
+ - cron: '0 3 * * 0' # Weekly retraining
569
+
570
+ jobs:
571
+ train-and-evaluate:
572
+ runs-on: ubuntu-latest
573
+ steps:
574
+ - uses: actions/checkout@v3
575
+
576
+ - name: Azure Login
577
+ uses: azure/login@v1
578
+ with:
579
+ creds: ${{ secrets.AZURE_CREDENTIALS }}
580
+
581
+ - name: Setup Python
582
+ uses: actions/setup-python@v4
583
+ with:
584
+ python-version: '3.10'
585
+
586
+ - name: Install dependencies
587
+ run: |
588
+ pip install -r requirements.txt
589
+ pip install pytest pytest-cov
590
+
591
+ - name: Run unit tests
592
+ run: pytest tests/unit/ --cov=src
593
+
594
+ - name: Validate training data
595
+ run: python scripts/validate_training_data.py
596
+
597
+ - name: Train model on spot instances
598
+ run: |
599
+ python training/train_model.py \
600
+ --experiment-name "churn_weekly_${{ github.run_number }}" \
601
+ --compute-type spot \
602
+ --max-cost 50.00 \
603
+ --enable-early-stopping
604
+
605
+ - name: Run hyperparameter optimization
606
+ if: github.ref == 'refs/heads/main'
607
+ run: |
608
+ python training/optimize_hyperparameters.py \
609
+ --n-trials 30 \
610
+ --optimization-method bayesian \
611
+ --max-cost 100.00
612
+
613
+ - name: Evaluate model
614
+ run: |
615
+ python training/evaluate_model.py \
616
+ --min-accuracy 0.85 \
617
+ --min-auc 0.90 \
618
+ --test-fairness
619
+
620
+ - name: Generate model card
621
+ run: python scripts/generate_model_card.py
622
+
623
+ - name: Register model
624
+ if: success()
625
+ run: python scripts/register_model.py --stage Staging
626
+
627
+ - name: Run integration tests
628
+ run: pytest tests/integration/
629
+
630
+ - name: Generate training report
631
+ run: python scripts/generate_training_report.py
632
+
633
+ - name: Upload artifacts
634
+ uses: actions/upload-artifact@v3
635
+ with:
636
+ name: training-artifacts
637
+ path: |
638
+ outputs/
639
+ reports/
640
+ ```
641
+
642
+ ## 📊 Metrics & Monitoring
643
+
644
+ | Metric Category | Metric | Target | Tool |
645
+ |-----------------|--------|--------|------|
646
+ | **Training Costs** | Cost per training run | <$25 | FinOps tracker |
647
+ | | Monthly training budget | <$2000 | Azure Cost Management |
648
+ | | Spot instance savings | >70% | Cost tracker |
649
+ | | GPU utilization | >80% | Azure Monitor |
650
+ | **HPO Costs** | Cost per HPO run | <$100 | HPO cost tracker |
651
+ | | Trials pruned (savings) | >40% | Optuna |
652
+ | | Bayesian vs grid savings | >80% | Cost comparison |
653
+ | **Training Performance** | Training time | <2 hours | Experiment tracker |
654
+ | | Model accuracy | >0.85 | MLflow |
655
+ | | AUC score | >0.90 | Evaluation metrics |
656
+ | **Resource Utilization** | CPU utilization | >70% | Azure Monitor |
657
+ | | Memory utilization | >60% | Azure Monitor |
658
+ | | Data loading time | <10% total | Profiler |
659
+ | **Pipeline Reliability** | Training success rate | >95% | Pipeline metrics |
660
+ | | Experiment reproducibility | 100% | Seed + versioning |
661
+ | **Model Quality** | Validation score | >baseline | Experiment tracker |
662
+ | | Test set performance | >0.85 | Model evaluator |
663
+
664
+ ## 🔄 Integration Workflow
665
+
666
+ ### End-to-End Training Pipeline
667
+ ```
668
+ 1. Feature Store Access (ml-02)
669
+
670
+ 2. Data Validation (de-03)
671
+
672
+ 3. Training Data Preparation (ml-03)
673
+
674
+ 4. Experiment Initialization (mo-01)
675
+
676
+ 5. Hyperparameter Optimization (ml-03)
677
+
678
+ 6. Model Training with Cost Tracking (ml-03, fo-07)
679
+
680
+ 7. Model Evaluation (ml-03, ds-08)
681
+
682
+ 8. Model Versioning (mo-03)
683
+
684
+ 9. Model Security Scan (sa-08)
685
+
686
+ 10. Model Registration (ml-07)
687
+
688
+ 11. Lineage Tracking (mo-06)
689
+
690
+ 12. Model Deployment (ml-04)
691
+ ```
692
+
693
+ ## 🎯 Quick Wins
694
+
695
+ 1. **Switch to spot instances** - 60-90% training cost reduction
696
+ 2. **Use Bayesian optimization** - 80-90% fewer hyperparameter trials
697
+ 3. **Enable mixed precision training** - 2x speed on GPUs
698
+ 4. **Implement early stopping** - 20-40% faster training
699
+ 5. **Cache preprocessed data** - 30-50% faster data loading
700
+ 6. **Set up experiment tracking** - Better model comparison
701
+ 7. **Implement checkpointing** - Resilience to preemption
702
+ 8. **Use learning rate schedules** - Better convergence
703
+ 9. **Track training costs** - Visibility into spending
704
+ 10. **Automate model evaluation** - Consistent quality gates