tech-hub-skills 1.2.0 → 1.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (198) hide show
  1. package/{LICENSE → .claude/LICENSE} +21 -21
  2. package/.claude/README.md +291 -0
  3. package/.claude/bin/cli.js +266 -0
  4. package/{bin → .claude/bin}/copilot.js +182 -182
  5. package/{bin → .claude/bin}/postinstall.js +42 -42
  6. package/{tech_hub_skills/skills → .claude/commands}/README.md +336 -336
  7. package/{tech_hub_skills/skills → .claude/commands}/ai-engineer.md +104 -104
  8. package/{tech_hub_skills/skills → .claude/commands}/aws.md +143 -143
  9. package/{tech_hub_skills/skills → .claude/commands}/azure.md +149 -149
  10. package/{tech_hub_skills/skills → .claude/commands}/backend-developer.md +108 -108
  11. package/{tech_hub_skills/skills → .claude/commands}/code-review.md +399 -399
  12. package/{tech_hub_skills/skills → .claude/commands}/compliance-automation.md +747 -747
  13. package/{tech_hub_skills/skills → .claude/commands}/compliance-officer.md +108 -108
  14. package/{tech_hub_skills/skills → .claude/commands}/data-engineer.md +113 -113
  15. package/{tech_hub_skills/skills → .claude/commands}/data-governance.md +102 -102
  16. package/{tech_hub_skills/skills → .claude/commands}/data-scientist.md +123 -123
  17. package/{tech_hub_skills/skills → .claude/commands}/database-admin.md +109 -109
  18. package/{tech_hub_skills/skills → .claude/commands}/devops.md +160 -160
  19. package/{tech_hub_skills/skills → .claude/commands}/docker.md +160 -160
  20. package/{tech_hub_skills/skills → .claude/commands}/enterprise-dashboard.md +613 -613
  21. package/{tech_hub_skills/skills → .claude/commands}/finops.md +184 -184
  22. package/{tech_hub_skills/skills → .claude/commands}/frontend-developer.md +108 -108
  23. package/{tech_hub_skills/skills → .claude/commands}/gcp.md +143 -143
  24. package/{tech_hub_skills/skills → .claude/commands}/ml-engineer.md +115 -115
  25. package/{tech_hub_skills/skills → .claude/commands}/mlops.md +187 -187
  26. package/{tech_hub_skills/skills → .claude/commands}/network-engineer.md +109 -109
  27. package/{tech_hub_skills/skills → .claude/commands}/optimization-advisor.md +329 -329
  28. package/{tech_hub_skills/skills → .claude/commands}/orchestrator.md +623 -623
  29. package/{tech_hub_skills/skills → .claude/commands}/platform-engineer.md +102 -102
  30. package/{tech_hub_skills/skills → .claude/commands}/process-automation.md +226 -226
  31. package/{tech_hub_skills/skills → .claude/commands}/process-changelog.md +184 -184
  32. package/{tech_hub_skills/skills → .claude/commands}/process-documentation.md +484 -484
  33. package/{tech_hub_skills/skills → .claude/commands}/process-kanban.md +324 -324
  34. package/{tech_hub_skills/skills → .claude/commands}/process-versioning.md +214 -214
  35. package/{tech_hub_skills/skills → .claude/commands}/product-designer.md +104 -104
  36. package/{tech_hub_skills/skills → .claude/commands}/project-starter.md +443 -443
  37. package/{tech_hub_skills/skills → .claude/commands}/qa-engineer.md +109 -109
  38. package/{tech_hub_skills/skills → .claude/commands}/security-architect.md +135 -135
  39. package/{tech_hub_skills/skills → .claude/commands}/sre.md +109 -109
  40. package/{tech_hub_skills/skills → .claude/commands}/system-design.md +126 -126
  41. package/{tech_hub_skills/skills → .claude/commands}/technical-writer.md +101 -101
  42. package/.claude/package.json +46 -0
  43. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -252
  44. package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_ab_tester.py +356 -0
  45. package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_template_manager.py +274 -0
  46. package/.claude/roles/ai-engineer/skills/01-prompt-engineering/token_cost_estimator.py +324 -0
  47. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -448
  48. package/.claude/roles/ai-engineer/skills/02-rag-pipeline/document_chunker.py +336 -0
  49. package/.claude/roles/ai-engineer/skills/02-rag-pipeline/rag_pipeline.sql +213 -0
  50. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -599
  51. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -735
  52. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -711
  53. package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -777
  54. package/{tech_hub_skills → .claude}/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -264
  55. package/{tech_hub_skills → .claude}/roles/azure/skills/02-data-factory/README.md +264 -264
  56. package/{tech_hub_skills → .claude}/roles/azure/skills/03-synapse-analytics/README.md +264 -264
  57. package/{tech_hub_skills → .claude}/roles/azure/skills/04-databricks/README.md +264 -264
  58. package/{tech_hub_skills → .claude}/roles/azure/skills/05-functions/README.md +264 -264
  59. package/{tech_hub_skills → .claude}/roles/azure/skills/06-kubernetes-service/README.md +264 -264
  60. package/{tech_hub_skills → .claude}/roles/azure/skills/07-openai-service/README.md +264 -264
  61. package/{tech_hub_skills → .claude}/roles/azure/skills/08-machine-learning/README.md +264 -264
  62. package/{tech_hub_skills → .claude}/roles/azure/skills/09-storage-adls/README.md +264 -264
  63. package/{tech_hub_skills → .claude}/roles/azure/skills/10-networking/README.md +264 -264
  64. package/{tech_hub_skills → .claude}/roles/azure/skills/11-sql-cosmos/README.md +264 -264
  65. package/{tech_hub_skills → .claude}/roles/azure/skills/12-event-hubs/README.md +264 -264
  66. package/{tech_hub_skills → .claude}/roles/code-review/skills/01-automated-code-review/README.md +394 -394
  67. package/{tech_hub_skills → .claude}/roles/code-review/skills/02-pr-review-workflow/README.md +427 -427
  68. package/{tech_hub_skills → .claude}/roles/code-review/skills/03-code-quality-gates/README.md +518 -518
  69. package/{tech_hub_skills → .claude}/roles/code-review/skills/04-reviewer-assignment/README.md +504 -504
  70. package/{tech_hub_skills → .claude}/roles/code-review/skills/05-review-analytics/README.md +540 -540
  71. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -550
  72. package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/bronze_ingestion.py +337 -0
  73. package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/medallion_queries.sql +300 -0
  74. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -580
  75. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/03-data-quality/README.md +579 -579
  76. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -608
  77. package/{tech_hub_skills → .claude}/roles/data-engineer/skills/05-performance-optimization/README.md +547 -547
  78. package/{tech_hub_skills → .claude}/roles/data-governance/skills/01-data-catalog/README.md +112 -112
  79. package/{tech_hub_skills → .claude}/roles/data-governance/skills/02-data-lineage/README.md +129 -129
  80. package/{tech_hub_skills → .claude}/roles/data-governance/skills/03-data-quality-framework/README.md +182 -182
  81. package/{tech_hub_skills → .claude}/roles/data-governance/skills/04-access-control/README.md +39 -39
  82. package/{tech_hub_skills → .claude}/roles/data-governance/skills/05-master-data-management/README.md +40 -40
  83. package/{tech_hub_skills → .claude}/roles/data-governance/skills/06-compliance-privacy/README.md +46 -46
  84. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/01-eda-automation/README.md +230 -230
  85. package/.claude/roles/data-scientist/skills/01-eda-automation/eda_generator.py +446 -0
  86. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -264
  87. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/03-feature-engineering/README.md +264 -264
  88. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -264
  89. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/05-customer-analytics/README.md +264 -264
  90. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -264
  91. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/07-experimentation/README.md +264 -264
  92. package/{tech_hub_skills → .claude}/roles/data-scientist/skills/08-data-visualization/README.md +264 -264
  93. package/{tech_hub_skills → .claude}/roles/devops/skills/01-cicd-pipeline/README.md +264 -264
  94. package/{tech_hub_skills → .claude}/roles/devops/skills/02-container-orchestration/README.md +264 -264
  95. package/{tech_hub_skills → .claude}/roles/devops/skills/03-infrastructure-as-code/README.md +264 -264
  96. package/{tech_hub_skills → .claude}/roles/devops/skills/04-gitops/README.md +264 -264
  97. package/{tech_hub_skills → .claude}/roles/devops/skills/05-environment-management/README.md +264 -264
  98. package/{tech_hub_skills → .claude}/roles/devops/skills/06-automated-testing/README.md +264 -264
  99. package/{tech_hub_skills → .claude}/roles/devops/skills/07-release-management/README.md +264 -264
  100. package/{tech_hub_skills → .claude}/roles/devops/skills/08-monitoring-alerting/README.md +264 -264
  101. package/{tech_hub_skills → .claude}/roles/devops/skills/09-devsecops/README.md +265 -265
  102. package/{tech_hub_skills → .claude}/roles/finops/skills/01-cost-visibility/README.md +264 -264
  103. package/{tech_hub_skills → .claude}/roles/finops/skills/02-resource-tagging/README.md +264 -264
  104. package/{tech_hub_skills → .claude}/roles/finops/skills/03-budget-management/README.md +264 -264
  105. package/{tech_hub_skills → .claude}/roles/finops/skills/04-reserved-instances/README.md +264 -264
  106. package/{tech_hub_skills → .claude}/roles/finops/skills/05-spot-optimization/README.md +264 -264
  107. package/{tech_hub_skills → .claude}/roles/finops/skills/06-storage-tiering/README.md +264 -264
  108. package/{tech_hub_skills → .claude}/roles/finops/skills/07-compute-rightsizing/README.md +264 -264
  109. package/{tech_hub_skills → .claude}/roles/finops/skills/08-chargeback/README.md +264 -264
  110. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -566
  111. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -655
  112. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/03-model-training/README.md +704 -704
  113. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/04-model-serving/README.md +845 -845
  114. package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -874
  115. package/{tech_hub_skills → .claude}/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -264
  116. package/{tech_hub_skills → .claude}/roles/mlops/skills/02-experiment-tracking/README.md +264 -264
  117. package/{tech_hub_skills → .claude}/roles/mlops/skills/03-model-registry/README.md +264 -264
  118. package/{tech_hub_skills → .claude}/roles/mlops/skills/04-feature-store/README.md +264 -264
  119. package/{tech_hub_skills → .claude}/roles/mlops/skills/05-model-deployment/README.md +264 -264
  120. package/{tech_hub_skills → .claude}/roles/mlops/skills/06-model-observability/README.md +264 -264
  121. package/{tech_hub_skills → .claude}/roles/mlops/skills/07-data-versioning/README.md +264 -264
  122. package/{tech_hub_skills → .claude}/roles/mlops/skills/08-ab-testing/README.md +264 -264
  123. package/{tech_hub_skills → .claude}/roles/mlops/skills/09-automated-retraining/README.md +264 -264
  124. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -153
  125. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -57
  126. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -59
  127. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/04-developer-experience/README.md +57 -57
  128. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/05-incident-management/README.md +73 -73
  129. package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/06-capacity-management/README.md +59 -59
  130. package/{tech_hub_skills → .claude}/roles/product-designer/skills/01-requirements-discovery/README.md +407 -407
  131. package/{tech_hub_skills → .claude}/roles/product-designer/skills/02-user-research/README.md +382 -382
  132. package/{tech_hub_skills → .claude}/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -437
  133. package/{tech_hub_skills → .claude}/roles/product-designer/skills/04-ux-design/README.md +496 -496
  134. package/{tech_hub_skills → .claude}/roles/product-designer/skills/05-product-market-fit/README.md +376 -376
  135. package/{tech_hub_skills → .claude}/roles/product-designer/skills/06-stakeholder-management/README.md +412 -412
  136. package/{tech_hub_skills → .claude}/roles/security-architect/skills/01-pii-detection/README.md +319 -319
  137. package/{tech_hub_skills → .claude}/roles/security-architect/skills/02-threat-modeling/README.md +264 -264
  138. package/{tech_hub_skills → .claude}/roles/security-architect/skills/03-infrastructure-security/README.md +264 -264
  139. package/{tech_hub_skills → .claude}/roles/security-architect/skills/04-iam/README.md +264 -264
  140. package/{tech_hub_skills → .claude}/roles/security-architect/skills/05-application-security/README.md +264 -264
  141. package/{tech_hub_skills → .claude}/roles/security-architect/skills/06-secrets-management/README.md +264 -264
  142. package/{tech_hub_skills → .claude}/roles/security-architect/skills/07-security-monitoring/README.md +264 -264
  143. package/{tech_hub_skills → .claude}/roles/system-design/skills/01-architecture-patterns/README.md +337 -337
  144. package/{tech_hub_skills → .claude}/roles/system-design/skills/02-requirements-engineering/README.md +264 -264
  145. package/{tech_hub_skills → .claude}/roles/system-design/skills/03-scalability/README.md +264 -264
  146. package/{tech_hub_skills → .claude}/roles/system-design/skills/04-high-availability/README.md +264 -264
  147. package/{tech_hub_skills → .claude}/roles/system-design/skills/05-cost-optimization-design/README.md +264 -264
  148. package/{tech_hub_skills → .claude}/roles/system-design/skills/06-api-design/README.md +264 -264
  149. package/{tech_hub_skills → .claude}/roles/system-design/skills/07-observability-architecture/README.md +264 -264
  150. package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -336
  151. package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/README.md +521 -521
  152. package/.claude/roles/system-design/skills/08-process-automation/ai_prompt_generator.py +744 -0
  153. package/.claude/roles/system-design/skills/08-process-automation/automation_recommender.py +688 -0
  154. package/.claude/roles/system-design/skills/08-process-automation/plan_generator.py +679 -0
  155. package/.claude/roles/system-design/skills/08-process-automation/process_analyzer.py +528 -0
  156. package/.claude/roles/system-design/skills/08-process-automation/process_parser.py +684 -0
  157. package/.claude/roles/system-design/skills/08-process-automation/role_matcher.py +615 -0
  158. package/.claude/skills/README.md +336 -0
  159. package/.claude/skills/ai-engineer.md +104 -0
  160. package/.claude/skills/aws.md +143 -0
  161. package/.claude/skills/azure.md +149 -0
  162. package/.claude/skills/backend-developer.md +108 -0
  163. package/.claude/skills/code-review.md +399 -0
  164. package/.claude/skills/compliance-automation.md +747 -0
  165. package/.claude/skills/compliance-officer.md +108 -0
  166. package/.claude/skills/data-engineer.md +113 -0
  167. package/.claude/skills/data-governance.md +102 -0
  168. package/.claude/skills/data-scientist.md +123 -0
  169. package/.claude/skills/database-admin.md +109 -0
  170. package/.claude/skills/devops.md +160 -0
  171. package/.claude/skills/docker.md +160 -0
  172. package/.claude/skills/enterprise-dashboard.md +613 -0
  173. package/.claude/skills/finops.md +184 -0
  174. package/.claude/skills/frontend-developer.md +108 -0
  175. package/.claude/skills/gcp.md +143 -0
  176. package/.claude/skills/ml-engineer.md +115 -0
  177. package/.claude/skills/mlops.md +187 -0
  178. package/.claude/skills/network-engineer.md +109 -0
  179. package/.claude/skills/optimization-advisor.md +329 -0
  180. package/.claude/skills/orchestrator.md +623 -0
  181. package/.claude/skills/platform-engineer.md +102 -0
  182. package/.claude/skills/process-automation.md +226 -0
  183. package/.claude/skills/process-changelog.md +184 -0
  184. package/.claude/skills/process-documentation.md +484 -0
  185. package/.claude/skills/process-kanban.md +324 -0
  186. package/.claude/skills/process-versioning.md +214 -0
  187. package/.claude/skills/product-designer.md +104 -0
  188. package/.claude/skills/project-starter.md +443 -0
  189. package/.claude/skills/qa-engineer.md +109 -0
  190. package/.claude/skills/security-architect.md +135 -0
  191. package/.claude/skills/sre.md +109 -0
  192. package/.claude/skills/system-design.md +126 -0
  193. package/.claude/skills/technical-writer.md +101 -0
  194. package/.gitattributes +2 -0
  195. package/GITHUB_COPILOT.md +106 -0
  196. package/README.md +192 -291
  197. package/package.json +16 -46
  198. package/bin/cli.js +0 -241
@@ -1,874 +1,874 @@
1
- # Skill 5: Model Monitoring & Drift Detection
2
-
3
- ## 🎯 Overview
4
- Implement comprehensive model monitoring with data drift, concept drift, and performance degradation detection for production ML systems.
5
-
6
- ## 🔗 Connections
7
- - **MLOps**: Core monitoring and drift detection capabilities (mo-04, mo-05)
8
- - **ML Engineer**: Monitors deployed models and triggers retraining (ml-04, ml-09)
9
- - **Data Scientist**: Analyzes model degradation patterns (ds-08)
10
- - **DevOps**: Integrates with observability platforms (do-08)
11
- - **FinOps**: Monitors model performance vs cost trade-offs (fo-07)
12
- - **Security Architect**: Detects anomalous predictions (sa-08)
13
- - **Data Engineer**: Monitors data quality for features (de-03)
14
- - **System Design**: Scalable monitoring architecture (sd-08)
15
-
16
- ## 🛠️ Tools Included
17
-
18
- ### 1. `drift_detector.py`
19
- Statistical drift detection for features and predictions.
20
-
21
- ### 2. `model_monitor.py`
22
- Comprehensive model performance monitoring with alerting.
23
-
24
- ### 3. `prediction_analyzer.py`
25
- Prediction distribution analysis and anomaly detection.
26
-
27
- ### 4. `monitoring_dashboard.py`
28
- Real-time monitoring dashboards with Grafana/Azure Monitor.
29
-
30
- ### 5. `alert_manager.py`
31
- Intelligent alerting system for model degradation.
32
-
33
- ## 🏗️ Monitoring Architecture
34
-
35
- ```
36
- Production Traffic → Logging → Analysis → Drift Detection → Alerting
37
- ↓ ↓ ↓ ↓
38
- Predictions Metrics Statistics Notifications
39
- Features Business Comparison Auto-retrain
40
- Metadata Technical Baseline Dashboards
41
- ```
42
-
43
- ## 🚀 Quick Start
44
-
45
- ```python
46
- from drift_detector import DriftDetector, KSTest, PSICalculator
47
- from model_monitor import ModelMonitor
48
- from alert_manager import AlertManager
49
-
50
- # Initialize drift detector
51
- drift_detector = DriftDetector(
52
- baseline_data=training_data,
53
- detection_methods=[
54
- KSTest(significance_level=0.05),
55
- PSICalculator(threshold=0.2)
56
- ]
57
- )
58
-
59
- # Initialize model monitor
60
- monitor = ModelMonitor(
61
- model_name="churn_predictor_v2",
62
- metrics=["accuracy", "auc", "precision", "recall"],
63
- alert_thresholds={
64
- "accuracy": 0.85,
65
- "auc": 0.90,
66
- "data_drift_score": 0.2,
67
- "prediction_drift_score": 0.15
68
- }
69
- )
70
-
71
- # Monitor predictions in production
72
- @monitor.track_predictions
73
- async def predict(features):
74
- prediction = model.predict(features)
75
-
76
- # Log prediction for monitoring
77
- monitor.log_prediction(
78
- features=features,
79
- prediction=prediction,
80
- timestamp=datetime.now(),
81
- metadata={"customer_id": features["customer_id"]}
82
- )
83
-
84
- return prediction
85
-
86
- # Run drift detection (scheduled job)
87
- def check_drift():
88
- """Daily drift detection job"""
89
-
90
- # Get recent production data
91
- production_data = monitor.get_recent_predictions(days=7)
92
-
93
- # Detect feature drift
94
- feature_drift = drift_detector.detect_feature_drift(
95
- production_data=production_data,
96
- features=feature_list
97
- )
98
-
99
- # Detect prediction drift
100
- prediction_drift = drift_detector.detect_prediction_drift(
101
- production_predictions=production_data["predictions"],
102
- baseline_predictions=training_predictions
103
- )
104
-
105
- # Alert if drift detected
106
- if feature_drift.has_drift or prediction_drift.has_drift:
107
- alert_manager.send_alert(
108
- severity="warning",
109
- title="Model Drift Detected",
110
- message=f"Drifted features: {feature_drift.drifted_features}\n"
111
- f"Prediction drift score: {prediction_drift.score:.3f}",
112
- actions=["Review model", "Trigger retraining"]
113
- )
114
-
115
- # Generate drift report
116
- drift_report = drift_detector.generate_report(
117
- feature_drift=feature_drift,
118
- prediction_drift=prediction_drift
119
- )
120
-
121
- return drift_report
122
- ```
123
-
124
- ## 📚 Best Practices
125
-
126
- ### Drift Detection & Monitoring (MLOps Integration)
127
-
128
- 1. **Multi-Level Drift Detection**
129
- - Monitor data drift (feature distribution changes)
130
- - Monitor concept drift (relationship between features and target)
131
- - Monitor prediction drift (output distribution changes)
132
- - Monitor performance drift (metric degradation)
133
- - Reference: MLOps mo-05 (Drift Detection)
134
-
135
- 2. **Statistical Drift Tests**
136
- - Use Kolmogorov-Smirnov test for continuous features
137
- - Use Chi-square test for categorical features
138
- - Calculate Population Stability Index (PSI)
139
- - Track Jensen-Shannon divergence
140
- - Set appropriate significance levels
141
- - Reference: MLOps mo-05, Data Scientist ds-08
142
-
143
- 3. **Baseline Comparison**
144
- - Maintain reference datasets (training data)
145
- - Update baselines periodically
146
- - Track distribution shifts over time
147
- - Document baseline versions
148
- - Reference: MLOps mo-05, mo-06 (Lineage)
149
-
150
- 4. **Monitoring Cadence**
151
- - Real-time monitoring for critical models
152
- - Hourly/daily drift checks for most models
153
- - Weekly deep-dive analysis
154
- - Monthly model review
155
- - Reference: MLOps mo-04 (Monitoring)
156
-
157
- 5. **Comprehensive Model Metrics**
158
- - Track business metrics (revenue impact, user engagement)
159
- - Monitor technical metrics (accuracy, AUC, F1)
160
- - Track operational metrics (latency, throughput)
161
- - Monitor data quality metrics
162
- - Reference: MLOps mo-04
163
-
164
- ### DevOps Integration for Monitoring
165
-
166
- 6. **Observability Integration**
167
- - Integrate with Azure Monitor / App Insights
168
- - Use OpenTelemetry for instrumentation
169
- - Centralize logs and metrics
170
- - Implement distributed tracing
171
- - Reference: DevOps do-08 (Monitoring)
172
-
173
- 7. **Alerting & Incident Response**
174
- - Set up intelligent alerting (avoid alert fatigue)
175
- - Define alert severity levels
176
- - Implement escalation policies
177
- - Automate incident response
178
- - Reference: DevOps do-08
179
-
180
- 8. **Monitoring Dashboards**
181
- - Build real-time monitoring dashboards
182
- - Visualize drift metrics over time
183
- - Track model performance trends
184
- - Enable team collaboration
185
- - Reference: DevOps do-08, MLOps mo-04
186
-
187
- ### Cost Optimization for Monitoring (FinOps Integration)
188
-
189
- 9. **Efficient Logging Strategy**
190
- - Sample predictions for monitoring (not 100%)
191
- - Implement tiered logging (critical vs routine)
192
- - Compress and archive old logs
193
- - Monitor log storage costs
194
- - Reference: FinOps fo-05 (Storage), fo-07 (AI/ML Cost)
195
-
196
- 10. **Optimize Monitoring Compute**
197
- - Run drift detection on scheduled batches
198
- - Use serverless for event-driven monitoring
199
- - Right-size monitoring infrastructure
200
- - Cache expensive drift calculations
201
- - Reference: FinOps fo-06 (Compute Optimization)
202
-
203
- 11. **Monitoring Cost Tracking**
204
- - Track monitoring infrastructure costs
205
- - Monitor log ingestion costs
206
- - Optimize retention policies
207
- - Balance cost vs visibility
208
- - Reference: FinOps fo-01 (Cost Monitoring)
209
-
210
- ### Automated Response & Retraining
211
-
212
- 12. **Automated Drift Response**
213
- - Auto-alert when drift exceeds thresholds
214
- - Trigger model investigation workflows
215
- - Initiate automated retraining pipelines
216
- - Implement automatic rollback if needed
217
- - Reference: MLOps mo-05, ML Engineer ml-09
218
-
219
- 13. **Model Retraining Triggers**
220
- - Performance degradation thresholds
221
- - Significant data drift detected
222
- - Concept drift indicators
223
- - Scheduled periodic retraining
224
- - Reference: ML Engineer ml-09 (Continuous Retraining)
225
-
226
- ### Data Quality Monitoring
227
-
228
- 14. **Feature Quality Checks**
229
- - Monitor feature completeness
230
- - Detect feature value range violations
231
- - Track feature correlation changes
232
- - Alert on missing features
233
- - Reference: Data Engineer de-03 (Data Quality)
234
-
235
- 15. **Input Validation Monitoring**
236
- - Track invalid input rates
237
- - Monitor schema violations
238
- - Detect data type mismatches
239
- - Alert on data quality issues
240
- - Reference: Data Engineer de-03
241
-
242
- ### Security & Anomaly Detection
243
-
244
- 16. **Prediction Anomaly Detection**
245
- - Detect unusual prediction patterns
246
- - Identify potential model attacks
247
- - Monitor for adversarial inputs
248
- - Alert on suspicious behavior
249
- - Reference: Security Architect sa-08 (LLM Security)
250
-
251
- 17. **Model Behavior Monitoring**
252
- - Track prediction confidence scores
253
- - Monitor prediction uncertainty
254
- - Detect model degradation patterns
255
- - Identify edge cases
256
- - Reference: MLOps mo-04, Security Architect sa-08
257
-
258
- ### Azure-Specific Best Practices
259
-
260
- 18. **Azure Monitor Integration**
261
- - Use Azure Monitor for metrics
262
- - Enable Application Insights
263
- - Set up Log Analytics workspaces
264
- - Configure metric alerts
265
- - Reference: Azure az-04 (AI/ML Services)
266
-
267
- 19. **Azure ML Model Monitoring**
268
- - Enable model data collection
269
- - Configure data drift detection
270
- - Use built-in monitoring dashboards
271
- - Integrate with Azure Monitor
272
- - Reference: Azure az-04
273
-
274
- 20. **Cost-Effective Monitoring**
275
- - Use log sampling for high-volume models
276
- - Implement retention policies
277
- - Archive to cold storage
278
- - Monitor monitoring costs
279
- - Reference: Azure az-04, FinOps fo-05
280
-
281
- ## 💰 Cost Optimization Examples
282
-
283
- ### Intelligent Prediction Logging
284
- ```python
285
- from model_monitor import SmartLogger
286
- from finops_tracker import MonitoringCostTracker
287
- import random
288
-
289
- class CostOptimizedMonitor:
290
- """Cost-optimized prediction logging with sampling"""
291
-
292
- def __init__(self, model_name: str, sampling_rate: float = 0.1):
293
- self.model_name = model_name
294
- self.sampling_rate = sampling_rate # Log 10% of predictions
295
- self.logger = SmartLogger(model_name)
296
- self.cost_tracker = MonitoringCostTracker()
297
-
298
- # Always log certain predictions
299
- self.always_log_conditions = [
300
- lambda pred: pred["confidence"] < 0.5, # Low confidence
301
- lambda pred: pred["value"] > 0.9, # High risk
302
- lambda pred: pred.get("is_edge_case", False) # Edge cases
303
- ]
304
-
305
- def should_log_prediction(self, prediction: dict) -> bool:
306
- """Intelligent sampling decision"""
307
-
308
- # Always log important predictions
309
- for condition in self.always_log_conditions:
310
- if condition(prediction):
311
- return True
312
-
313
- # Sample remaining predictions
314
- return random.random() < self.sampling_rate
315
-
316
- async def log_prediction(
317
- self,
318
- features: dict,
319
- prediction: dict,
320
- metadata: dict
321
- ):
322
- """Log prediction with cost optimization"""
323
-
324
- if not self.should_log_prediction(prediction):
325
- self.cost_tracker.record_skipped_log()
326
- return
327
-
328
- # Log to monitoring system
329
- with self.cost_tracker.track_logging_cost():
330
- await self.logger.log(
331
- timestamp=datetime.now(),
332
- features=features,
333
- prediction=prediction,
334
- metadata=metadata,
335
- # Compress large payloads
336
- compress=len(str(features)) > 1000
337
- )
338
-
339
- self.cost_tracker.record_logged_prediction()
340
-
341
- def get_cost_report(self):
342
- """Monitoring cost analysis"""
343
- report = self.cost_tracker.generate_report()
344
-
345
- print(f"Monitoring Cost Report:")
346
- print(f"Total predictions: {report.total_predictions:,}")
347
- print(f"Logged predictions: {report.logged_predictions:,}")
348
- print(f"Sampling rate: {report.actual_sampling_rate:.1%}")
349
- print(f"Log storage cost: ${report.storage_cost:.2f}")
350
- print(f"Log ingestion cost: ${report.ingestion_cost:.2f}")
351
- print(f"Total monitoring cost: ${report.total_cost:.2f}")
352
- print(f"Cost per logged prediction: ${report.cost_per_log:.4f}")
353
- print(f"Savings from sampling: ${report.sampling_savings:.2f}")
354
-
355
- return report
356
-
357
- # Usage
358
- monitor = CostOptimizedMonitor(
359
- model_name="churn_predictor_v2",
360
- sampling_rate=0.1 # Log 10% + important predictions
361
- )
362
-
363
- # In production
364
- for prediction_request in prediction_stream:
365
- features = prediction_request.features
366
- prediction = model.predict(features)
367
-
368
- await monitor.log_prediction(
369
- features=features,
370
- prediction=prediction,
371
- metadata={"customer_id": prediction_request.customer_id}
372
- )
373
-
374
- # Monitor costs
375
- monthly_report = monitor.get_cost_report()
376
-
377
- # Expected results:
378
- # - 90% reduction in logging costs
379
- # - Still captures all important events
380
- # - Sufficient data for drift detection
381
- ```
382
-
383
- ### Efficient Drift Detection
384
- ```python
385
- from drift_detector import BatchDriftDetector
386
- from scipy import stats
387
- import numpy as np
388
- from finops_tracker import DriftCostTracker
389
-
390
- class CostOptimizedDriftDetection:
391
- """Efficient drift detection with cost optimization"""
392
-
393
- def __init__(self, baseline_data: pd.DataFrame):
394
- self.baseline_data = baseline_data
395
- self.detector = BatchDriftDetector()
396
- self.cost_tracker = DriftCostTracker()
397
-
398
- # Pre-compute baseline statistics (one-time cost)
399
- self.baseline_stats = self._compute_baseline_stats()
400
-
401
- def _compute_baseline_stats(self):
402
- """Pre-compute baseline statistics for efficiency"""
403
- stats = {}
404
-
405
- for column in self.baseline_data.columns:
406
- if self.baseline_data[column].dtype in ['float64', 'int64']:
407
- stats[column] = {
408
- 'mean': self.baseline_data[column].mean(),
409
- 'std': self.baseline_data[column].std(),
410
- 'min': self.baseline_data[column].min(),
411
- 'max': self.baseline_data[column].max(),
412
- 'quantiles': self.baseline_data[column].quantile([0.25, 0.5, 0.75]).to_dict(),
413
- 'distribution': self.baseline_data[column].values
414
- }
415
- else:
416
- stats[column] = {
417
- 'value_counts': self.baseline_data[column].value_counts().to_dict(),
418
- 'unique_values': set(self.baseline_data[column].unique())
419
- }
420
-
421
- return stats
422
-
423
- def detect_drift_efficient(
424
- self,
425
- production_data: pd.DataFrame,
426
- features: list = None,
427
- method: str = "ks_test"
428
- ) -> dict:
429
- """Efficient drift detection using pre-computed statistics"""
430
-
431
- with self.cost_tracker.track_drift_detection():
432
- drift_results = {}
433
- features = features or production_data.columns
434
-
435
- for feature in features:
436
- if feature not in self.baseline_stats:
437
- continue
438
-
439
- baseline_values = self.baseline_stats[feature]['distribution']
440
- production_values = production_data[feature].values
441
-
442
- # Use cached baseline statistics
443
- if method == "ks_test":
444
- # Kolmogorov-Smirnov test (efficient)
445
- statistic, p_value = stats.ks_2samp(
446
- baseline_values,
447
- production_values
448
- )
449
- has_drift = p_value < 0.05
450
-
451
- elif method == "psi":
452
- # Population Stability Index (very efficient)
453
- psi_score = self._calculate_psi_efficient(
454
- baseline_values,
455
- production_values
456
- )
457
- has_drift = psi_score > 0.2
458
- statistic = psi_score
459
- p_value = None
460
-
461
- drift_results[feature] = {
462
- 'has_drift': has_drift,
463
- 'statistic': statistic,
464
- 'p_value': p_value,
465
- 'drift_magnitude': abs(
466
- production_values.mean() - baseline_values.mean()
467
- ) / baseline_values.std() if baseline_values.std() > 0 else 0
468
- }
469
-
470
- # Cost report
471
- cost_report = self.cost_tracker.get_detection_cost()
472
- print(f"Drift detection cost: ${cost_report.cost:.4f}")
473
- print(f"Detection time: {cost_report.duration_ms:.2f}ms")
474
-
475
- return {
476
- 'drift_results': drift_results,
477
- 'drifted_features': [f for f, r in drift_results.items() if r['has_drift']],
478
- 'drift_score': np.mean([r['drift_magnitude'] for r in drift_results.values()]),
479
- 'cost': cost_report.cost
480
- }
481
-
482
- def _calculate_psi_efficient(
483
- self,
484
- baseline: np.ndarray,
485
- production: np.ndarray,
486
- bins: int = 10
487
- ) -> float:
488
- """Efficient PSI calculation"""
489
-
490
- # Create bins from baseline
491
- bin_edges = np.percentile(baseline, np.linspace(0, 100, bins + 1))
492
- bin_edges[0] = -np.inf
493
- bin_edges[-1] = np.inf
494
-
495
- # Calculate distributions
496
- baseline_dist = np.histogram(baseline, bins=bin_edges)[0] / len(baseline)
497
- production_dist = np.histogram(production, bins=bin_edges)[0] / len(production)
498
-
499
- # PSI calculation
500
- psi = np.sum(
501
- (production_dist - baseline_dist) * np.log(
502
- (production_dist + 1e-10) / (baseline_dist + 1e-10)
503
- )
504
- )
505
-
506
- return psi
507
-
508
- def incremental_drift_monitoring(
509
- self,
510
- data_stream,
511
- window_size: int = 1000,
512
- check_frequency: int = 100
513
- ):
514
- """Incremental drift detection for streaming data"""
515
-
516
- buffer = []
517
- check_count = 0
518
-
519
- for record in data_stream:
520
- buffer.append(record)
521
-
522
- # Check drift every N records
523
- if len(buffer) >= check_frequency:
524
- check_count += 1
525
-
526
- # Convert to DataFrame
527
- production_sample = pd.DataFrame(buffer[-window_size:])
528
-
529
- # Run efficient drift detection
530
- drift_result = self.detect_drift_efficient(
531
- production_data=production_sample,
532
- method="psi" # Faster than KS test
533
- )
534
-
535
- # Alert if drift detected
536
- if drift_result['drifted_features']:
537
- print(f"\nDrift detected at check #{check_count}:")
538
- print(f"Drifted features: {drift_result['drifted_features']}")
539
- print(f"Drift score: {drift_result['drift_score']:.3f}")
540
-
541
- # Clear old records from buffer
542
- buffer = buffer[-window_size:]
543
-
544
- # Usage
545
- drift_detector = CostOptimizedDriftDetection(
546
- baseline_data=training_data
547
- )
548
-
549
- # Batch drift detection (daily job)
550
- production_data = get_recent_predictions(days=7)
551
- drift_result = drift_detector.detect_drift_efficient(
552
- production_data=production_data,
553
- method="psi" # 10x faster than KS test
554
- )
555
-
556
- print(f"Drift detection completed in {drift_result['cost']:.4f}s")
557
- print(f"Drifted features: {drift_result['drifted_features']}")
558
-
559
- # Streaming drift detection
560
- drift_detector.incremental_drift_monitoring(
561
- data_stream=prediction_stream,
562
- window_size=1000,
563
- check_frequency=100
564
- )
565
- ```
566
-
567
- ### Automated Drift Response
568
- ```python
569
- from alert_manager import AlertManager, AlertSeverity
570
- from model_retrainer import AutoRetrainer
571
- from drift_detector import DriftAnalyzer
572
-
573
- class AutomatedDriftResponse:
574
- """Automated drift detection and response system"""
575
-
576
- def __init__(self, model_name: str):
577
- self.model_name = model_name
578
- self.drift_analyzer = DriftAnalyzer(model_name)
579
- self.alert_manager = AlertManager()
580
- self.retrainer = AutoRetrainer(model_name)
581
-
582
- # Configure drift response thresholds
583
- self.thresholds = {
584
- "minor_drift": 0.1, # Warning alert
585
- "moderate_drift": 0.2, # Trigger investigation
586
- "severe_drift": 0.3 # Auto-retrain
587
- }
588
-
589
- async def monitor_and_respond(self):
590
- """Continuous drift monitoring with automated response"""
591
-
592
- # Get recent production data
593
- production_data = await self.get_production_data(days=7)
594
-
595
- # Detect drift
596
- drift_result = self.drift_analyzer.analyze_drift(
597
- production_data=production_data,
598
- features=feature_list
599
- )
600
-
601
- drift_score = drift_result['drift_score']
602
- drifted_features = drift_result['drifted_features']
603
-
604
- # Automated response based on severity
605
- if drift_score >= self.thresholds["severe_drift"]:
606
- # Severe drift - auto-retrain
607
- await self._handle_severe_drift(drift_result)
608
-
609
- elif drift_score >= self.thresholds["moderate_drift"]:
610
- # Moderate drift - trigger investigation
611
- await self._handle_moderate_drift(drift_result)
612
-
613
- elif drift_score >= self.thresholds["minor_drift"]:
614
- # Minor drift - warning alert
615
- await self._handle_minor_drift(drift_result)
616
-
617
- return drift_result
618
-
619
- async def _handle_severe_drift(self, drift_result):
620
- """Handle severe drift with auto-retraining"""
621
-
622
- # Send critical alert
623
- await self.alert_manager.send_alert(
624
- severity=AlertSeverity.CRITICAL,
625
- title=f"Severe Drift Detected - Auto-Retraining Initiated",
626
- message=f"Model: {self.model_name}\n"
627
- f"Drift score: {drift_result['drift_score']:.3f}\n"
628
- f"Drifted features: {drift_result['drifted_features']}\n"
629
- f"Action: Automatic retraining initiated",
630
- channels=["slack", "email", "pagerduty"]
631
- )
632
-
633
- # Trigger automated retraining
634
- retrain_job = await self.retrainer.trigger_retraining(
635
- reason="severe_drift_detected",
636
- drift_score=drift_result['drift_score'],
637
- drifted_features=drift_result['drifted_features'],
638
- priority="high"
639
- )
640
-
641
- print(f"Retraining job initiated: {retrain_job.id}")
642
-
643
- async def _handle_moderate_drift(self, drift_result):
644
- """Handle moderate drift with investigation workflow"""
645
-
646
- # Send warning alert
647
- await self.alert_manager.send_alert(
648
- severity=AlertSeverity.WARNING,
649
- title=f"Moderate Drift Detected - Investigation Required",
650
- message=f"Model: {self.model_name}\n"
651
- f"Drift score: {drift_result['drift_score']:.3f}\n"
652
- f"Drifted features: {drift_result['drifted_features']}\n"
653
- f"Action: Please investigate and determine if retraining is needed",
654
- channels=["slack", "email"],
655
- actions=[
656
- {"label": "Trigger Retraining", "action": "retrain"},
657
- {"label": "Acknowledge", "action": "ack"},
658
- {"label": "View Dashboard", "action": "dashboard"}
659
- ]
660
- )
661
-
662
- # Create investigation ticket
663
- await self.alert_manager.create_investigation_ticket(
664
- title=f"Model Drift Investigation - {self.model_name}",
665
- description=drift_result,
666
- assignee="ml-ops-team"
667
- )
668
-
669
- async def _handle_minor_drift(self, drift_result):
670
- """Handle minor drift with monitoring"""
671
-
672
- # Send info alert
673
- await self.alert_manager.send_alert(
674
- severity=AlertSeverity.INFO,
675
- title=f"Minor Drift Detected",
676
- message=f"Model: {self.model_name}\n"
677
- f"Drift score: {drift_result['drift_score']:.3f}\n"
678
- f"Drifted features: {drift_result['drifted_features']}\n"
679
- f"Action: Continuing to monitor",
680
- channels=["slack"]
681
- )
682
-
683
- # Log to dashboard
684
- await self.drift_analyzer.log_drift_event(drift_result)
685
-
686
- # Usage (scheduled job)
687
- drift_responder = AutomatedDriftResponse(model_name="churn_predictor_v2")
688
-
689
- # Run daily
690
- async def daily_drift_check():
691
- result = await drift_responder.monitor_and_respond()
692
- print(f"Drift check completed. Score: {result['drift_score']:.3f}")
693
-
694
- # Schedule with APScheduler or similar
695
- from apscheduler.schedulers.asyncio import AsyncIOScheduler
696
-
697
- scheduler = AsyncIOScheduler()
698
- scheduler.add_job(daily_drift_check, 'cron', hour=2) # Run at 2 AM daily
699
- scheduler.start()
700
- ```
701
-
702
- ## 🚀 Monitoring Dashboards
703
-
704
- ### Grafana Dashboard Configuration
705
- ```python
706
- # monitoring_dashboard.py
707
- from grafana_api import GrafanaDashboard
708
- from azure.monitor import MetricsClient
709
-
710
- def create_model_monitoring_dashboard(model_name: str):
711
- """Create comprehensive model monitoring dashboard"""
712
-
713
- dashboard = GrafanaDashboard(
714
- title=f"Model Monitoring - {model_name}",
715
- refresh="30s"
716
- )
717
-
718
- # Row 1: Model Performance Metrics
719
- dashboard.add_row("Model Performance")
720
- dashboard.add_panel(
721
- title="Model Accuracy (7d rolling)",
722
- query="avg(model_accuracy{model_name='" + model_name + "'})",
723
- panel_type="graph",
724
- threshold=0.85,
725
- alert_condition="below"
726
- )
727
- dashboard.add_panel(
728
- title="AUC Score",
729
- query="avg(model_auc{model_name='" + model_name + "'})",
730
- panel_type="gauge",
731
- thresholds=[0.80, 0.90, 0.95]
732
- )
733
-
734
- # Row 2: Drift Metrics
735
- dashboard.add_row("Drift Detection")
736
- dashboard.add_panel(
737
- title="Feature Drift Score",
738
- query="max(feature_drift_score{model_name='" + model_name + "'})",
739
- panel_type="graph",
740
- threshold=0.2,
741
- alert_condition="above"
742
- )
743
- dashboard.add_panel(
744
- title="Prediction Drift Score",
745
- query="max(prediction_drift_score{model_name='" + model_name + "'})",
746
- panel_type="graph",
747
- threshold=0.15
748
- )
749
- dashboard.add_panel(
750
- title="Drifted Features Count",
751
- query="count(drifted_features{model_name='" + model_name + "'})",
752
- panel_type="stat"
753
- )
754
-
755
- # Row 3: Operational Metrics
756
- dashboard.add_row("Operational Metrics")
757
- dashboard.add_panel(
758
- title="Prediction Latency (p95)",
759
- query="histogram_quantile(0.95, prediction_latency{model_name='" + model_name + "'})",
760
- panel_type="graph",
761
- threshold=100,
762
- unit="ms"
763
- )
764
- dashboard.add_panel(
765
- title="Requests per Minute",
766
- query="rate(prediction_requests{model_name='" + model_name + "'}[1m])",
767
- panel_type="graph"
768
- )
769
- dashboard.add_panel(
770
- title="Error Rate",
771
- query="rate(prediction_errors{model_name='" + model_name + "'}[5m])",
772
- panel_type="graph",
773
- threshold=0.01,
774
- alert_condition="above"
775
- )
776
-
777
- # Row 4: Data Quality
778
- dashboard.add_row("Data Quality")
779
- dashboard.add_panel(
780
- title="Feature Completeness",
781
- query="avg(feature_completeness{model_name='" + model_name + "'})",
782
- panel_type="gauge",
783
- thresholds=[0.95, 0.99, 1.0]
784
- )
785
- dashboard.add_panel(
786
- title="Invalid Inputs Rate",
787
- query="rate(invalid_inputs{model_name='" + model_name + "'}[5m])",
788
- panel_type="graph"
789
- )
790
-
791
- # Row 5: Cost Metrics
792
- dashboard.add_row("Cost & Resource Usage")
793
- dashboard.add_panel(
794
- title="Daily Serving Cost",
795
- query="sum(serving_cost_usd{model_name='" + model_name + "'})",
796
- panel_type="stat",
797
- unit="currencyUSD"
798
- )
799
- dashboard.add_panel(
800
- title="Cost per 1000 Predictions",
801
- query="serving_cost_usd{model_name='" + model_name + "'} / prediction_count * 1000",
802
- panel_type="graph",
803
- unit="currencyUSD"
804
- )
805
-
806
- # Save dashboard
807
- dashboard.save()
808
- return dashboard
809
-
810
- # Create dashboard
811
- dashboard = create_model_monitoring_dashboard("churn_predictor_v2")
812
- print(f"Dashboard created: {dashboard.url}")
813
- ```
814
-
815
- ## 📊 Metrics & Monitoring
816
-
817
- | Metric Category | Metric | Target | Tool |
818
- |-----------------|--------|--------|------|
819
- | **Drift Detection** | Feature drift score | <0.2 | Drift detector |
820
- | | Prediction drift score | <0.15 | Drift detector |
821
- | | Drifted features count | <3 | KS/PSI tests |
822
- | | Drift check frequency | Daily | Scheduler |
823
- | **Model Performance** | Production accuracy | >0.85 | Model monitor |
824
- | | Production AUC | >0.90 | Model monitor |
825
- | | Performance vs baseline | >95% | Comparison |
826
- | **Data Quality** | Feature completeness | >99% | Quality checker |
827
- | | Invalid input rate | <1% | Validator |
828
- | | Missing feature rate | <0.1% | Monitor |
829
- | **Monitoring Costs** | Log storage cost | <$100/mo | FinOps tracker |
830
- | | Monitoring compute | <$50/mo | Cost tracker |
831
- | | Alert notification cost | <$20/mo | Alert manager |
832
- | **Operational** | Alert response time | <30 min | SLA monitor |
833
- | | False alert rate | <5% | Alert tuning |
834
- | | Dashboard load time | <2s | Performance |
835
-
836
- ## 🔄 Integration Workflow
837
-
838
- ### End-to-End Monitoring Pipeline
839
- ```
840
- 1. Production Predictions (ml-04)
841
-
842
- 2. Intelligent Logging (ml-05)
843
-
844
- 3. Data Aggregation (ml-05)
845
-
846
- 4. Drift Detection (mo-05)
847
-
848
- 5. Performance Monitoring (mo-04)
849
-
850
- 6. Anomaly Detection (ml-05)
851
-
852
- 7. Alert Generation (ml-05)
853
-
854
- 8. Dashboard Updates (do-08)
855
-
856
- 9. Investigation Workflow (ml-05)
857
-
858
- 10. Auto-Retraining Trigger (ml-09)
859
-
860
- 11. Cost Tracking (fo-07)
861
- ```
862
-
863
- ## 🎯 Quick Wins
864
-
865
- 1. **Enable prediction logging** - Visibility into production behavior
866
- 2. **Set up drift detection** - Early warning for model degradation
867
- 3. **Create monitoring dashboards** - Real-time visibility
868
- 4. **Implement intelligent sampling** - 90% reduction in logging costs
869
- 5. **Configure performance alerts** - Proactive issue detection
870
- 6. **Use PSI for drift** - 10x faster than KS test
871
- 7. **Pre-compute baseline stats** - Faster drift detection
872
- 8. **Set up automated alerts** - Faster incident response
873
- 9. **Track monitoring costs** - Optimize monitoring spend
874
- 10. **Implement auto-retraining** - Automated drift response
1
+ # Skill 5: Model Monitoring & Drift Detection
2
+
3
+ ## 🎯 Overview
4
+ Implement comprehensive model monitoring with data drift, concept drift, and performance degradation detection for production ML systems.
5
+
6
+ ## 🔗 Connections
7
+ - **MLOps**: Core monitoring and drift detection capabilities (mo-04, mo-05)
8
+ - **ML Engineer**: Monitors deployed models and triggers retraining (ml-04, ml-09)
9
+ - **Data Scientist**: Analyzes model degradation patterns (ds-08)
10
+ - **DevOps**: Integrates with observability platforms (do-08)
11
+ - **FinOps**: Monitors model performance vs cost trade-offs (fo-07)
12
+ - **Security Architect**: Detects anomalous predictions (sa-08)
13
+ - **Data Engineer**: Monitors data quality for features (de-03)
14
+ - **System Design**: Scalable monitoring architecture (sd-08)
15
+
16
+ ## 🛠️ Tools Included
17
+
18
+ ### 1. `drift_detector.py`
19
+ Statistical drift detection for features and predictions.
20
+
21
+ ### 2. `model_monitor.py`
22
+ Comprehensive model performance monitoring with alerting.
23
+
24
+ ### 3. `prediction_analyzer.py`
25
+ Prediction distribution analysis and anomaly detection.
26
+
27
+ ### 4. `monitoring_dashboard.py`
28
+ Real-time monitoring dashboards with Grafana/Azure Monitor.
29
+
30
+ ### 5. `alert_manager.py`
31
+ Intelligent alerting system for model degradation.
32
+
33
+ ## 🏗️ Monitoring Architecture
34
+
35
+ ```
36
+ Production Traffic → Logging → Analysis → Drift Detection → Alerting
37
+ ↓ ↓ ↓ ↓
38
+ Predictions Metrics Statistics Notifications
39
+ Features Business Comparison Auto-retrain
40
+ Metadata Technical Baseline Dashboards
41
+ ```
42
+
43
+ ## 🚀 Quick Start
44
+
45
+ ```python
46
+ from drift_detector import DriftDetector, KSTest, PSICalculator
47
+ from model_monitor import ModelMonitor
48
+ from alert_manager import AlertManager
49
+
50
+ # Initialize drift detector
51
+ drift_detector = DriftDetector(
52
+ baseline_data=training_data,
53
+ detection_methods=[
54
+ KSTest(significance_level=0.05),
55
+ PSICalculator(threshold=0.2)
56
+ ]
57
+ )
58
+
59
+ # Initialize model monitor
60
+ monitor = ModelMonitor(
61
+ model_name="churn_predictor_v2",
62
+ metrics=["accuracy", "auc", "precision", "recall"],
63
+ alert_thresholds={
64
+ "accuracy": 0.85,
65
+ "auc": 0.90,
66
+ "data_drift_score": 0.2,
67
+ "prediction_drift_score": 0.15
68
+ }
69
+ )
70
+
71
+ # Monitor predictions in production
72
+ @monitor.track_predictions
73
+ async def predict(features):
74
+ prediction = model.predict(features)
75
+
76
+ # Log prediction for monitoring
77
+ monitor.log_prediction(
78
+ features=features,
79
+ prediction=prediction,
80
+ timestamp=datetime.now(),
81
+ metadata={"customer_id": features["customer_id"]}
82
+ )
83
+
84
+ return prediction
85
+
86
+ # Run drift detection (scheduled job)
87
+ def check_drift():
88
+ """Daily drift detection job"""
89
+
90
+ # Get recent production data
91
+ production_data = monitor.get_recent_predictions(days=7)
92
+
93
+ # Detect feature drift
94
+ feature_drift = drift_detector.detect_feature_drift(
95
+ production_data=production_data,
96
+ features=feature_list
97
+ )
98
+
99
+ # Detect prediction drift
100
+ prediction_drift = drift_detector.detect_prediction_drift(
101
+ production_predictions=production_data["predictions"],
102
+ baseline_predictions=training_predictions
103
+ )
104
+
105
+ # Alert if drift detected
106
+ if feature_drift.has_drift or prediction_drift.has_drift:
107
+ alert_manager.send_alert(
108
+ severity="warning",
109
+ title="Model Drift Detected",
110
+ message=f"Drifted features: {feature_drift.drifted_features}\n"
111
+ f"Prediction drift score: {prediction_drift.score:.3f}",
112
+ actions=["Review model", "Trigger retraining"]
113
+ )
114
+
115
+ # Generate drift report
116
+ drift_report = drift_detector.generate_report(
117
+ feature_drift=feature_drift,
118
+ prediction_drift=prediction_drift
119
+ )
120
+
121
+ return drift_report
122
+ ```
123
+
124
+ ## 📚 Best Practices
125
+
126
+ ### Drift Detection & Monitoring (MLOps Integration)
127
+
128
+ 1. **Multi-Level Drift Detection**
129
+ - Monitor data drift (feature distribution changes)
130
+ - Monitor concept drift (relationship between features and target)
131
+ - Monitor prediction drift (output distribution changes)
132
+ - Monitor performance drift (metric degradation)
133
+ - Reference: MLOps mo-05 (Drift Detection)
134
+
135
+ 2. **Statistical Drift Tests**
136
+ - Use Kolmogorov-Smirnov test for continuous features
137
+ - Use Chi-square test for categorical features
138
+ - Calculate Population Stability Index (PSI)
139
+ - Track Jensen-Shannon divergence
140
+ - Set appropriate significance levels
141
+ - Reference: MLOps mo-05, Data Scientist ds-08
142
+
143
+ 3. **Baseline Comparison**
144
+ - Maintain reference datasets (training data)
145
+ - Update baselines periodically
146
+ - Track distribution shifts over time
147
+ - Document baseline versions
148
+ - Reference: MLOps mo-05, mo-06 (Lineage)
149
+
150
+ 4. **Monitoring Cadence**
151
+ - Real-time monitoring for critical models
152
+ - Hourly/daily drift checks for most models
153
+ - Weekly deep-dive analysis
154
+ - Monthly model review
155
+ - Reference: MLOps mo-04 (Monitoring)
156
+
157
+ 5. **Comprehensive Model Metrics**
158
+ - Track business metrics (revenue impact, user engagement)
159
+ - Monitor technical metrics (accuracy, AUC, F1)
160
+ - Track operational metrics (latency, throughput)
161
+ - Monitor data quality metrics
162
+ - Reference: MLOps mo-04
163
+
164
+ ### DevOps Integration for Monitoring
165
+
166
+ 6. **Observability Integration**
167
+ - Integrate with Azure Monitor / App Insights
168
+ - Use OpenTelemetry for instrumentation
169
+ - Centralize logs and metrics
170
+ - Implement distributed tracing
171
+ - Reference: DevOps do-08 (Monitoring)
172
+
173
+ 7. **Alerting & Incident Response**
174
+ - Set up intelligent alerting (avoid alert fatigue)
175
+ - Define alert severity levels
176
+ - Implement escalation policies
177
+ - Automate incident response
178
+ - Reference: DevOps do-08
179
+
180
+ 8. **Monitoring Dashboards**
181
+ - Build real-time monitoring dashboards
182
+ - Visualize drift metrics over time
183
+ - Track model performance trends
184
+ - Enable team collaboration
185
+ - Reference: DevOps do-08, MLOps mo-04
186
+
187
+ ### Cost Optimization for Monitoring (FinOps Integration)
188
+
189
+ 9. **Efficient Logging Strategy**
190
+ - Sample predictions for monitoring (not 100%)
191
+ - Implement tiered logging (critical vs routine)
192
+ - Compress and archive old logs
193
+ - Monitor log storage costs
194
+ - Reference: FinOps fo-05 (Storage), fo-07 (AI/ML Cost)
195
+
196
+ 10. **Optimize Monitoring Compute**
197
+ - Run drift detection on scheduled batches
198
+ - Use serverless for event-driven monitoring
199
+ - Right-size monitoring infrastructure
200
+ - Cache expensive drift calculations
201
+ - Reference: FinOps fo-06 (Compute Optimization)
202
+
203
+ 11. **Monitoring Cost Tracking**
204
+ - Track monitoring infrastructure costs
205
+ - Monitor log ingestion costs
206
+ - Optimize retention policies
207
+ - Balance cost vs visibility
208
+ - Reference: FinOps fo-01 (Cost Monitoring)
209
+
210
+ ### Automated Response & Retraining
211
+
212
+ 12. **Automated Drift Response**
213
+ - Auto-alert when drift exceeds thresholds
214
+ - Trigger model investigation workflows
215
+ - Initiate automated retraining pipelines
216
+ - Implement automatic rollback if needed
217
+ - Reference: MLOps mo-05, ML Engineer ml-09
218
+
219
+ 13. **Model Retraining Triggers**
220
+ - Performance degradation thresholds
221
+ - Significant data drift detected
222
+ - Concept drift indicators
223
+ - Scheduled periodic retraining
224
+ - Reference: ML Engineer ml-09 (Continuous Retraining)
225
+
226
+ ### Data Quality Monitoring
227
+
228
+ 14. **Feature Quality Checks**
229
+ - Monitor feature completeness
230
+ - Detect feature value range violations
231
+ - Track feature correlation changes
232
+ - Alert on missing features
233
+ - Reference: Data Engineer de-03 (Data Quality)
234
+
235
+ 15. **Input Validation Monitoring**
236
+ - Track invalid input rates
237
+ - Monitor schema violations
238
+ - Detect data type mismatches
239
+ - Alert on data quality issues
240
+ - Reference: Data Engineer de-03
241
+
242
+ ### Security & Anomaly Detection
243
+
244
+ 16. **Prediction Anomaly Detection**
245
+ - Detect unusual prediction patterns
246
+ - Identify potential model attacks
247
+ - Monitor for adversarial inputs
248
+ - Alert on suspicious behavior
249
+ - Reference: Security Architect sa-08 (LLM Security)
250
+
251
+ 17. **Model Behavior Monitoring**
252
+ - Track prediction confidence scores
253
+ - Monitor prediction uncertainty
254
+ - Detect model degradation patterns
255
+ - Identify edge cases
256
+ - Reference: MLOps mo-04, Security Architect sa-08
257
+
258
+ ### Azure-Specific Best Practices
259
+
260
+ 18. **Azure Monitor Integration**
261
+ - Use Azure Monitor for metrics
262
+ - Enable Application Insights
263
+ - Set up Log Analytics workspaces
264
+ - Configure metric alerts
265
+ - Reference: Azure az-04 (AI/ML Services)
266
+
267
+ 19. **Azure ML Model Monitoring**
268
+ - Enable model data collection
269
+ - Configure data drift detection
270
+ - Use built-in monitoring dashboards
271
+ - Integrate with Azure Monitor
272
+ - Reference: Azure az-04
273
+
274
+ 20. **Cost-Effective Monitoring**
275
+ - Use log sampling for high-volume models
276
+ - Implement retention policies
277
+ - Archive to cold storage
278
+ - Monitor monitoring costs
279
+ - Reference: Azure az-04, FinOps fo-05
280
+
281
+ ## 💰 Cost Optimization Examples
282
+
283
+ ### Intelligent Prediction Logging
284
+ ```python
285
+ from model_monitor import SmartLogger
286
+ from finops_tracker import MonitoringCostTracker
287
+ import random
288
+
289
+ class CostOptimizedMonitor:
290
+ """Cost-optimized prediction logging with sampling"""
291
+
292
+ def __init__(self, model_name: str, sampling_rate: float = 0.1):
293
+ self.model_name = model_name
294
+ self.sampling_rate = sampling_rate # Log 10% of predictions
295
+ self.logger = SmartLogger(model_name)
296
+ self.cost_tracker = MonitoringCostTracker()
297
+
298
+ # Always log certain predictions
299
+ self.always_log_conditions = [
300
+ lambda pred: pred["confidence"] < 0.5, # Low confidence
301
+ lambda pred: pred["value"] > 0.9, # High risk
302
+ lambda pred: pred.get("is_edge_case", False) # Edge cases
303
+ ]
304
+
305
+ def should_log_prediction(self, prediction: dict) -> bool:
306
+ """Intelligent sampling decision"""
307
+
308
+ # Always log important predictions
309
+ for condition in self.always_log_conditions:
310
+ if condition(prediction):
311
+ return True
312
+
313
+ # Sample remaining predictions
314
+ return random.random() < self.sampling_rate
315
+
316
+ async def log_prediction(
317
+ self,
318
+ features: dict,
319
+ prediction: dict,
320
+ metadata: dict
321
+ ):
322
+ """Log prediction with cost optimization"""
323
+
324
+ if not self.should_log_prediction(prediction):
325
+ self.cost_tracker.record_skipped_log()
326
+ return
327
+
328
+ # Log to monitoring system
329
+ with self.cost_tracker.track_logging_cost():
330
+ await self.logger.log(
331
+ timestamp=datetime.now(),
332
+ features=features,
333
+ prediction=prediction,
334
+ metadata=metadata,
335
+ # Compress large payloads
336
+ compress=len(str(features)) > 1000
337
+ )
338
+
339
+ self.cost_tracker.record_logged_prediction()
340
+
341
+ def get_cost_report(self):
342
+ """Monitoring cost analysis"""
343
+ report = self.cost_tracker.generate_report()
344
+
345
+ print(f"Monitoring Cost Report:")
346
+ print(f"Total predictions: {report.total_predictions:,}")
347
+ print(f"Logged predictions: {report.logged_predictions:,}")
348
+ print(f"Sampling rate: {report.actual_sampling_rate:.1%}")
349
+ print(f"Log storage cost: ${report.storage_cost:.2f}")
350
+ print(f"Log ingestion cost: ${report.ingestion_cost:.2f}")
351
+ print(f"Total monitoring cost: ${report.total_cost:.2f}")
352
+ print(f"Cost per logged prediction: ${report.cost_per_log:.4f}")
353
+ print(f"Savings from sampling: ${report.sampling_savings:.2f}")
354
+
355
+ return report
356
+
357
+ # Usage
358
+ monitor = CostOptimizedMonitor(
359
+ model_name="churn_predictor_v2",
360
+ sampling_rate=0.1 # Log 10% + important predictions
361
+ )
362
+
363
+ # In production
364
+ for prediction_request in prediction_stream:
365
+ features = prediction_request.features
366
+ prediction = model.predict(features)
367
+
368
+ await monitor.log_prediction(
369
+ features=features,
370
+ prediction=prediction,
371
+ metadata={"customer_id": prediction_request.customer_id}
372
+ )
373
+
374
+ # Monitor costs
375
+ monthly_report = monitor.get_cost_report()
376
+
377
+ # Expected results:
378
+ # - 90% reduction in logging costs
379
+ # - Still captures all important events
380
+ # - Sufficient data for drift detection
381
+ ```
382
+
383
+ ### Efficient Drift Detection
384
+ ```python
385
+ from drift_detector import BatchDriftDetector
386
+ from scipy import stats
387
+ import numpy as np
388
+ from finops_tracker import DriftCostTracker
389
+
390
+ class CostOptimizedDriftDetection:
391
+ """Efficient drift detection with cost optimization"""
392
+
393
+ def __init__(self, baseline_data: pd.DataFrame):
394
+ self.baseline_data = baseline_data
395
+ self.detector = BatchDriftDetector()
396
+ self.cost_tracker = DriftCostTracker()
397
+
398
+ # Pre-compute baseline statistics (one-time cost)
399
+ self.baseline_stats = self._compute_baseline_stats()
400
+
401
+ def _compute_baseline_stats(self):
402
+ """Pre-compute baseline statistics for efficiency"""
403
+ stats = {}
404
+
405
+ for column in self.baseline_data.columns:
406
+ if self.baseline_data[column].dtype in ['float64', 'int64']:
407
+ stats[column] = {
408
+ 'mean': self.baseline_data[column].mean(),
409
+ 'std': self.baseline_data[column].std(),
410
+ 'min': self.baseline_data[column].min(),
411
+ 'max': self.baseline_data[column].max(),
412
+ 'quantiles': self.baseline_data[column].quantile([0.25, 0.5, 0.75]).to_dict(),
413
+ 'distribution': self.baseline_data[column].values
414
+ }
415
+ else:
416
+ stats[column] = {
417
+ 'value_counts': self.baseline_data[column].value_counts().to_dict(),
418
+ 'unique_values': set(self.baseline_data[column].unique())
419
+ }
420
+
421
+ return stats
422
+
423
+ def detect_drift_efficient(
424
+ self,
425
+ production_data: pd.DataFrame,
426
+ features: list = None,
427
+ method: str = "ks_test"
428
+ ) -> dict:
429
+ """Efficient drift detection using pre-computed statistics"""
430
+
431
+ with self.cost_tracker.track_drift_detection():
432
+ drift_results = {}
433
+ features = features or production_data.columns
434
+
435
+ for feature in features:
436
+ if feature not in self.baseline_stats:
437
+ continue
438
+
439
+ baseline_values = self.baseline_stats[feature]['distribution']
440
+ production_values = production_data[feature].values
441
+
442
+ # Use cached baseline statistics
443
+ if method == "ks_test":
444
+ # Kolmogorov-Smirnov test (efficient)
445
+ statistic, p_value = stats.ks_2samp(
446
+ baseline_values,
447
+ production_values
448
+ )
449
+ has_drift = p_value < 0.05
450
+
451
+ elif method == "psi":
452
+ # Population Stability Index (very efficient)
453
+ psi_score = self._calculate_psi_efficient(
454
+ baseline_values,
455
+ production_values
456
+ )
457
+ has_drift = psi_score > 0.2
458
+ statistic = psi_score
459
+ p_value = None
460
+
461
+ drift_results[feature] = {
462
+ 'has_drift': has_drift,
463
+ 'statistic': statistic,
464
+ 'p_value': p_value,
465
+ 'drift_magnitude': abs(
466
+ production_values.mean() - baseline_values.mean()
467
+ ) / baseline_values.std() if baseline_values.std() > 0 else 0
468
+ }
469
+
470
+ # Cost report
471
+ cost_report = self.cost_tracker.get_detection_cost()
472
+ print(f"Drift detection cost: ${cost_report.cost:.4f}")
473
+ print(f"Detection time: {cost_report.duration_ms:.2f}ms")
474
+
475
+ return {
476
+ 'drift_results': drift_results,
477
+ 'drifted_features': [f for f, r in drift_results.items() if r['has_drift']],
478
+ 'drift_score': np.mean([r['drift_magnitude'] for r in drift_results.values()]),
479
+ 'cost': cost_report.cost
480
+ }
481
+
482
+ def _calculate_psi_efficient(
483
+ self,
484
+ baseline: np.ndarray,
485
+ production: np.ndarray,
486
+ bins: int = 10
487
+ ) -> float:
488
+ """Efficient PSI calculation"""
489
+
490
+ # Create bins from baseline
491
+ bin_edges = np.percentile(baseline, np.linspace(0, 100, bins + 1))
492
+ bin_edges[0] = -np.inf
493
+ bin_edges[-1] = np.inf
494
+
495
+ # Calculate distributions
496
+ baseline_dist = np.histogram(baseline, bins=bin_edges)[0] / len(baseline)
497
+ production_dist = np.histogram(production, bins=bin_edges)[0] / len(production)
498
+
499
+ # PSI calculation
500
+ psi = np.sum(
501
+ (production_dist - baseline_dist) * np.log(
502
+ (production_dist + 1e-10) / (baseline_dist + 1e-10)
503
+ )
504
+ )
505
+
506
+ return psi
507
+
508
+ def incremental_drift_monitoring(
509
+ self,
510
+ data_stream,
511
+ window_size: int = 1000,
512
+ check_frequency: int = 100
513
+ ):
514
+ """Incremental drift detection for streaming data"""
515
+
516
+ buffer = []
517
+ check_count = 0
518
+
519
+ for record in data_stream:
520
+ buffer.append(record)
521
+
522
+ # Check drift every N records
523
+ if len(buffer) >= check_frequency:
524
+ check_count += 1
525
+
526
+ # Convert to DataFrame
527
+ production_sample = pd.DataFrame(buffer[-window_size:])
528
+
529
+ # Run efficient drift detection
530
+ drift_result = self.detect_drift_efficient(
531
+ production_data=production_sample,
532
+ method="psi" # Faster than KS test
533
+ )
534
+
535
+ # Alert if drift detected
536
+ if drift_result['drifted_features']:
537
+ print(f"\nDrift detected at check #{check_count}:")
538
+ print(f"Drifted features: {drift_result['drifted_features']}")
539
+ print(f"Drift score: {drift_result['drift_score']:.3f}")
540
+
541
+ # Clear old records from buffer
542
+ buffer = buffer[-window_size:]
543
+
544
+ # Usage
545
+ drift_detector = CostOptimizedDriftDetection(
546
+ baseline_data=training_data
547
+ )
548
+
549
+ # Batch drift detection (daily job)
550
+ production_data = get_recent_predictions(days=7)
551
+ drift_result = drift_detector.detect_drift_efficient(
552
+ production_data=production_data,
553
+ method="psi" # 10x faster than KS test
554
+ )
555
+
556
+ print(f"Drift detection completed in {drift_result['cost']:.4f}s")
557
+ print(f"Drifted features: {drift_result['drifted_features']}")
558
+
559
+ # Streaming drift detection
560
+ drift_detector.incremental_drift_monitoring(
561
+ data_stream=prediction_stream,
562
+ window_size=1000,
563
+ check_frequency=100
564
+ )
565
+ ```
566
+
567
+ ### Automated Drift Response
568
+ ```python
569
+ from alert_manager import AlertManager, AlertSeverity
570
+ from model_retrainer import AutoRetrainer
571
+ from drift_detector import DriftAnalyzer
572
+
573
+ class AutomatedDriftResponse:
574
+ """Automated drift detection and response system"""
575
+
576
+ def __init__(self, model_name: str):
577
+ self.model_name = model_name
578
+ self.drift_analyzer = DriftAnalyzer(model_name)
579
+ self.alert_manager = AlertManager()
580
+ self.retrainer = AutoRetrainer(model_name)
581
+
582
+ # Configure drift response thresholds
583
+ self.thresholds = {
584
+ "minor_drift": 0.1, # Warning alert
585
+ "moderate_drift": 0.2, # Trigger investigation
586
+ "severe_drift": 0.3 # Auto-retrain
587
+ }
588
+
589
+ async def monitor_and_respond(self):
590
+ """Continuous drift monitoring with automated response"""
591
+
592
+ # Get recent production data
593
+ production_data = await self.get_production_data(days=7)
594
+
595
+ # Detect drift
596
+ drift_result = self.drift_analyzer.analyze_drift(
597
+ production_data=production_data,
598
+ features=feature_list
599
+ )
600
+
601
+ drift_score = drift_result['drift_score']
602
+ drifted_features = drift_result['drifted_features']
603
+
604
+ # Automated response based on severity
605
+ if drift_score >= self.thresholds["severe_drift"]:
606
+ # Severe drift - auto-retrain
607
+ await self._handle_severe_drift(drift_result)
608
+
609
+ elif drift_score >= self.thresholds["moderate_drift"]:
610
+ # Moderate drift - trigger investigation
611
+ await self._handle_moderate_drift(drift_result)
612
+
613
+ elif drift_score >= self.thresholds["minor_drift"]:
614
+ # Minor drift - warning alert
615
+ await self._handle_minor_drift(drift_result)
616
+
617
+ return drift_result
618
+
619
+ async def _handle_severe_drift(self, drift_result):
620
+ """Handle severe drift with auto-retraining"""
621
+
622
+ # Send critical alert
623
+ await self.alert_manager.send_alert(
624
+ severity=AlertSeverity.CRITICAL,
625
+ title=f"Severe Drift Detected - Auto-Retraining Initiated",
626
+ message=f"Model: {self.model_name}\n"
627
+ f"Drift score: {drift_result['drift_score']:.3f}\n"
628
+ f"Drifted features: {drift_result['drifted_features']}\n"
629
+ f"Action: Automatic retraining initiated",
630
+ channels=["slack", "email", "pagerduty"]
631
+ )
632
+
633
+ # Trigger automated retraining
634
+ retrain_job = await self.retrainer.trigger_retraining(
635
+ reason="severe_drift_detected",
636
+ drift_score=drift_result['drift_score'],
637
+ drifted_features=drift_result['drifted_features'],
638
+ priority="high"
639
+ )
640
+
641
+ print(f"Retraining job initiated: {retrain_job.id}")
642
+
643
+ async def _handle_moderate_drift(self, drift_result):
644
+ """Handle moderate drift with investigation workflow"""
645
+
646
+ # Send warning alert
647
+ await self.alert_manager.send_alert(
648
+ severity=AlertSeverity.WARNING,
649
+ title=f"Moderate Drift Detected - Investigation Required",
650
+ message=f"Model: {self.model_name}\n"
651
+ f"Drift score: {drift_result['drift_score']:.3f}\n"
652
+ f"Drifted features: {drift_result['drifted_features']}\n"
653
+ f"Action: Please investigate and determine if retraining is needed",
654
+ channels=["slack", "email"],
655
+ actions=[
656
+ {"label": "Trigger Retraining", "action": "retrain"},
657
+ {"label": "Acknowledge", "action": "ack"},
658
+ {"label": "View Dashboard", "action": "dashboard"}
659
+ ]
660
+ )
661
+
662
+ # Create investigation ticket
663
+ await self.alert_manager.create_investigation_ticket(
664
+ title=f"Model Drift Investigation - {self.model_name}",
665
+ description=drift_result,
666
+ assignee="ml-ops-team"
667
+ )
668
+
669
+ async def _handle_minor_drift(self, drift_result):
670
+ """Handle minor drift with monitoring"""
671
+
672
+ # Send info alert
673
+ await self.alert_manager.send_alert(
674
+ severity=AlertSeverity.INFO,
675
+ title=f"Minor Drift Detected",
676
+ message=f"Model: {self.model_name}\n"
677
+ f"Drift score: {drift_result['drift_score']:.3f}\n"
678
+ f"Drifted features: {drift_result['drifted_features']}\n"
679
+ f"Action: Continuing to monitor",
680
+ channels=["slack"]
681
+ )
682
+
683
+ # Log to dashboard
684
+ await self.drift_analyzer.log_drift_event(drift_result)
685
+
686
+ # Usage (scheduled job)
687
+ drift_responder = AutomatedDriftResponse(model_name="churn_predictor_v2")
688
+
689
+ # Run daily
690
+ async def daily_drift_check():
691
+ result = await drift_responder.monitor_and_respond()
692
+ print(f"Drift check completed. Score: {result['drift_score']:.3f}")
693
+
694
+ # Schedule with APScheduler or similar
695
+ from apscheduler.schedulers.asyncio import AsyncIOScheduler
696
+
697
+ scheduler = AsyncIOScheduler()
698
+ scheduler.add_job(daily_drift_check, 'cron', hour=2) # Run at 2 AM daily
699
+ scheduler.start()
700
+ ```
701
+
702
+ ## 🚀 Monitoring Dashboards
703
+
704
+ ### Grafana Dashboard Configuration
705
+ ```python
706
+ # monitoring_dashboard.py
707
+ from grafana_api import GrafanaDashboard
708
+ from azure.monitor import MetricsClient
709
+
710
+ def create_model_monitoring_dashboard(model_name: str):
711
+ """Create comprehensive model monitoring dashboard"""
712
+
713
+ dashboard = GrafanaDashboard(
714
+ title=f"Model Monitoring - {model_name}",
715
+ refresh="30s"
716
+ )
717
+
718
+ # Row 1: Model Performance Metrics
719
+ dashboard.add_row("Model Performance")
720
+ dashboard.add_panel(
721
+ title="Model Accuracy (7d rolling)",
722
+ query="avg(model_accuracy{model_name='" + model_name + "'})",
723
+ panel_type="graph",
724
+ threshold=0.85,
725
+ alert_condition="below"
726
+ )
727
+ dashboard.add_panel(
728
+ title="AUC Score",
729
+ query="avg(model_auc{model_name='" + model_name + "'})",
730
+ panel_type="gauge",
731
+ thresholds=[0.80, 0.90, 0.95]
732
+ )
733
+
734
+ # Row 2: Drift Metrics
735
+ dashboard.add_row("Drift Detection")
736
+ dashboard.add_panel(
737
+ title="Feature Drift Score",
738
+ query="max(feature_drift_score{model_name='" + model_name + "'})",
739
+ panel_type="graph",
740
+ threshold=0.2,
741
+ alert_condition="above"
742
+ )
743
+ dashboard.add_panel(
744
+ title="Prediction Drift Score",
745
+ query="max(prediction_drift_score{model_name='" + model_name + "'})",
746
+ panel_type="graph",
747
+ threshold=0.15
748
+ )
749
+ dashboard.add_panel(
750
+ title="Drifted Features Count",
751
+ query="count(drifted_features{model_name='" + model_name + "'})",
752
+ panel_type="stat"
753
+ )
754
+
755
+ # Row 3: Operational Metrics
756
+ dashboard.add_row("Operational Metrics")
757
+ dashboard.add_panel(
758
+ title="Prediction Latency (p95)",
759
+ query="histogram_quantile(0.95, prediction_latency{model_name='" + model_name + "'})",
760
+ panel_type="graph",
761
+ threshold=100,
762
+ unit="ms"
763
+ )
764
+ dashboard.add_panel(
765
+ title="Requests per Minute",
766
+ query="rate(prediction_requests{model_name='" + model_name + "'}[1m])",
767
+ panel_type="graph"
768
+ )
769
+ dashboard.add_panel(
770
+ title="Error Rate",
771
+ query="rate(prediction_errors{model_name='" + model_name + "'}[5m])",
772
+ panel_type="graph",
773
+ threshold=0.01,
774
+ alert_condition="above"
775
+ )
776
+
777
+ # Row 4: Data Quality
778
+ dashboard.add_row("Data Quality")
779
+ dashboard.add_panel(
780
+ title="Feature Completeness",
781
+ query="avg(feature_completeness{model_name='" + model_name + "'})",
782
+ panel_type="gauge",
783
+ thresholds=[0.95, 0.99, 1.0]
784
+ )
785
+ dashboard.add_panel(
786
+ title="Invalid Inputs Rate",
787
+ query="rate(invalid_inputs{model_name='" + model_name + "'}[5m])",
788
+ panel_type="graph"
789
+ )
790
+
791
+ # Row 5: Cost Metrics
792
+ dashboard.add_row("Cost & Resource Usage")
793
+ dashboard.add_panel(
794
+ title="Daily Serving Cost",
795
+ query="sum(serving_cost_usd{model_name='" + model_name + "'})",
796
+ panel_type="stat",
797
+ unit="currencyUSD"
798
+ )
799
+ dashboard.add_panel(
800
+ title="Cost per 1000 Predictions",
801
+ query="serving_cost_usd{model_name='" + model_name + "'} / prediction_count * 1000",
802
+ panel_type="graph",
803
+ unit="currencyUSD"
804
+ )
805
+
806
+ # Save dashboard
807
+ dashboard.save()
808
+ return dashboard
809
+
810
+ # Create dashboard
811
+ dashboard = create_model_monitoring_dashboard("churn_predictor_v2")
812
+ print(f"Dashboard created: {dashboard.url}")
813
+ ```
814
+
815
+ ## 📊 Metrics & Monitoring
816
+
817
+ | Metric Category | Metric | Target | Tool |
818
+ |-----------------|--------|--------|------|
819
+ | **Drift Detection** | Feature drift score | <0.2 | Drift detector |
820
+ | | Prediction drift score | <0.15 | Drift detector |
821
+ | | Drifted features count | <3 | KS/PSI tests |
822
+ | | Drift check frequency | Daily | Scheduler |
823
+ | **Model Performance** | Production accuracy | >0.85 | Model monitor |
824
+ | | Production AUC | >0.90 | Model monitor |
825
+ | | Performance vs baseline | >95% | Comparison |
826
+ | **Data Quality** | Feature completeness | >99% | Quality checker |
827
+ | | Invalid input rate | <1% | Validator |
828
+ | | Missing feature rate | <0.1% | Monitor |
829
+ | **Monitoring Costs** | Log storage cost | <$100/mo | FinOps tracker |
830
+ | | Monitoring compute | <$50/mo | Cost tracker |
831
+ | | Alert notification cost | <$20/mo | Alert manager |
832
+ | **Operational** | Alert response time | <30 min | SLA monitor |
833
+ | | False alert rate | <5% | Alert tuning |
834
+ | | Dashboard load time | <2s | Performance |
835
+
836
+ ## 🔄 Integration Workflow
837
+
838
+ ### End-to-End Monitoring Pipeline
839
+ ```
840
+ 1. Production Predictions (ml-04)
841
+
842
+ 2. Intelligent Logging (ml-05)
843
+
844
+ 3. Data Aggregation (ml-05)
845
+
846
+ 4. Drift Detection (mo-05)
847
+
848
+ 5. Performance Monitoring (mo-04)
849
+
850
+ 6. Anomaly Detection (ml-05)
851
+
852
+ 7. Alert Generation (ml-05)
853
+
854
+ 8. Dashboard Updates (do-08)
855
+
856
+ 9. Investigation Workflow (ml-05)
857
+
858
+ 10. Auto-Retraining Trigger (ml-09)
859
+
860
+ 11. Cost Tracking (fo-07)
861
+ ```
862
+
863
+ ## 🎯 Quick Wins
864
+
865
+ 1. **Enable prediction logging** - Visibility into production behavior
866
+ 2. **Set up drift detection** - Early warning for model degradation
867
+ 3. **Create monitoring dashboards** - Real-time visibility
868
+ 4. **Implement intelligent sampling** - 90% reduction in logging costs
869
+ 5. **Configure performance alerts** - Proactive issue detection
870
+ 6. **Use PSI for drift** - 10x faster than KS test
871
+ 7. **Pre-compute baseline stats** - Faster drift detection
872
+ 8. **Set up automated alerts** - Faster incident response
873
+ 9. **Track monitoring costs** - Optimize monitoring spend
874
+ 10. **Implement auto-retraining** - Automated drift response