tech-hub-skills 1.2.0 → 1.5.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/{LICENSE → .claude/LICENSE} +21 -21
- package/.claude/README.md +291 -0
- package/.claude/bin/cli.js +266 -0
- package/{bin → .claude/bin}/copilot.js +182 -182
- package/{bin → .claude/bin}/postinstall.js +42 -42
- package/{tech_hub_skills/skills → .claude/commands}/README.md +336 -336
- package/{tech_hub_skills/skills → .claude/commands}/ai-engineer.md +104 -104
- package/{tech_hub_skills/skills → .claude/commands}/aws.md +143 -143
- package/{tech_hub_skills/skills → .claude/commands}/azure.md +149 -149
- package/{tech_hub_skills/skills → .claude/commands}/backend-developer.md +108 -108
- package/{tech_hub_skills/skills → .claude/commands}/code-review.md +399 -399
- package/{tech_hub_skills/skills → .claude/commands}/compliance-automation.md +747 -747
- package/{tech_hub_skills/skills → .claude/commands}/compliance-officer.md +108 -108
- package/{tech_hub_skills/skills → .claude/commands}/data-engineer.md +113 -113
- package/{tech_hub_skills/skills → .claude/commands}/data-governance.md +102 -102
- package/{tech_hub_skills/skills → .claude/commands}/data-scientist.md +123 -123
- package/{tech_hub_skills/skills → .claude/commands}/database-admin.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/devops.md +160 -160
- package/{tech_hub_skills/skills → .claude/commands}/docker.md +160 -160
- package/{tech_hub_skills/skills → .claude/commands}/enterprise-dashboard.md +613 -613
- package/{tech_hub_skills/skills → .claude/commands}/finops.md +184 -184
- package/{tech_hub_skills/skills → .claude/commands}/frontend-developer.md +108 -108
- package/{tech_hub_skills/skills → .claude/commands}/gcp.md +143 -143
- package/{tech_hub_skills/skills → .claude/commands}/ml-engineer.md +115 -115
- package/{tech_hub_skills/skills → .claude/commands}/mlops.md +187 -187
- package/{tech_hub_skills/skills → .claude/commands}/network-engineer.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/optimization-advisor.md +329 -329
- package/{tech_hub_skills/skills → .claude/commands}/orchestrator.md +623 -623
- package/{tech_hub_skills/skills → .claude/commands}/platform-engineer.md +102 -102
- package/{tech_hub_skills/skills → .claude/commands}/process-automation.md +226 -226
- package/{tech_hub_skills/skills → .claude/commands}/process-changelog.md +184 -184
- package/{tech_hub_skills/skills → .claude/commands}/process-documentation.md +484 -484
- package/{tech_hub_skills/skills → .claude/commands}/process-kanban.md +324 -324
- package/{tech_hub_skills/skills → .claude/commands}/process-versioning.md +214 -214
- package/{tech_hub_skills/skills → .claude/commands}/product-designer.md +104 -104
- package/{tech_hub_skills/skills → .claude/commands}/project-starter.md +443 -443
- package/{tech_hub_skills/skills → .claude/commands}/qa-engineer.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/security-architect.md +135 -135
- package/{tech_hub_skills/skills → .claude/commands}/sre.md +109 -109
- package/{tech_hub_skills/skills → .claude/commands}/system-design.md +126 -126
- package/{tech_hub_skills/skills → .claude/commands}/technical-writer.md +101 -101
- package/.claude/package.json +46 -0
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -252
- package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_ab_tester.py +356 -0
- package/.claude/roles/ai-engineer/skills/01-prompt-engineering/prompt_template_manager.py +274 -0
- package/.claude/roles/ai-engineer/skills/01-prompt-engineering/token_cost_estimator.py +324 -0
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -448
- package/.claude/roles/ai-engineer/skills/02-rag-pipeline/document_chunker.py +336 -0
- package/.claude/roles/ai-engineer/skills/02-rag-pipeline/rag_pipeline.sql +213 -0
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -599
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -735
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -711
- package/{tech_hub_skills → .claude}/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -777
- package/{tech_hub_skills → .claude}/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/02-data-factory/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/03-synapse-analytics/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/04-databricks/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/05-functions/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/06-kubernetes-service/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/07-openai-service/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/08-machine-learning/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/09-storage-adls/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/10-networking/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/11-sql-cosmos/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/azure/skills/12-event-hubs/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/code-review/skills/01-automated-code-review/README.md +394 -394
- package/{tech_hub_skills → .claude}/roles/code-review/skills/02-pr-review-workflow/README.md +427 -427
- package/{tech_hub_skills → .claude}/roles/code-review/skills/03-code-quality-gates/README.md +518 -518
- package/{tech_hub_skills → .claude}/roles/code-review/skills/04-reviewer-assignment/README.md +504 -504
- package/{tech_hub_skills → .claude}/roles/code-review/skills/05-review-analytics/README.md +540 -540
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -550
- package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/bronze_ingestion.py +337 -0
- package/.claude/roles/data-engineer/skills/01-lakehouse-architecture/medallion_queries.sql +300 -0
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -580
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/03-data-quality/README.md +579 -579
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -608
- package/{tech_hub_skills → .claude}/roles/data-engineer/skills/05-performance-optimization/README.md +547 -547
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/01-data-catalog/README.md +112 -112
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/02-data-lineage/README.md +129 -129
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/03-data-quality-framework/README.md +182 -182
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/04-access-control/README.md +39 -39
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/05-master-data-management/README.md +40 -40
- package/{tech_hub_skills → .claude}/roles/data-governance/skills/06-compliance-privacy/README.md +46 -46
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/01-eda-automation/README.md +230 -230
- package/.claude/roles/data-scientist/skills/01-eda-automation/eda_generator.py +446 -0
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/03-feature-engineering/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/05-customer-analytics/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/07-experimentation/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/data-scientist/skills/08-data-visualization/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/01-cicd-pipeline/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/02-container-orchestration/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/03-infrastructure-as-code/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/04-gitops/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/05-environment-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/06-automated-testing/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/07-release-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/08-monitoring-alerting/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/devops/skills/09-devsecops/README.md +265 -265
- package/{tech_hub_skills → .claude}/roles/finops/skills/01-cost-visibility/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/02-resource-tagging/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/03-budget-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/04-reserved-instances/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/05-spot-optimization/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/06-storage-tiering/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/07-compute-rightsizing/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/finops/skills/08-chargeback/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -566
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -655
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/03-model-training/README.md +704 -704
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/04-model-serving/README.md +845 -845
- package/{tech_hub_skills → .claude}/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -874
- package/{tech_hub_skills → .claude}/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/02-experiment-tracking/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/03-model-registry/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/04-feature-store/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/05-model-deployment/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/06-model-observability/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/07-data-versioning/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/08-ab-testing/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/mlops/skills/09-automated-retraining/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -153
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -57
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -59
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/04-developer-experience/README.md +57 -57
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/05-incident-management/README.md +73 -73
- package/{tech_hub_skills → .claude}/roles/platform-engineer/skills/06-capacity-management/README.md +59 -59
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/01-requirements-discovery/README.md +407 -407
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/02-user-research/README.md +382 -382
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -437
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/04-ux-design/README.md +496 -496
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/05-product-market-fit/README.md +376 -376
- package/{tech_hub_skills → .claude}/roles/product-designer/skills/06-stakeholder-management/README.md +412 -412
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/01-pii-detection/README.md +319 -319
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/02-threat-modeling/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/03-infrastructure-security/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/04-iam/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/05-application-security/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/06-secrets-management/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/security-architect/skills/07-security-monitoring/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/01-architecture-patterns/README.md +337 -337
- package/{tech_hub_skills → .claude}/roles/system-design/skills/02-requirements-engineering/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/03-scalability/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/04-high-availability/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/05-cost-optimization-design/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/06-api-design/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/07-observability-architecture/README.md +264 -264
- package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -336
- package/{tech_hub_skills → .claude}/roles/system-design/skills/08-process-automation/README.md +521 -521
- package/.claude/roles/system-design/skills/08-process-automation/ai_prompt_generator.py +744 -0
- package/.claude/roles/system-design/skills/08-process-automation/automation_recommender.py +688 -0
- package/.claude/roles/system-design/skills/08-process-automation/plan_generator.py +679 -0
- package/.claude/roles/system-design/skills/08-process-automation/process_analyzer.py +528 -0
- package/.claude/roles/system-design/skills/08-process-automation/process_parser.py +684 -0
- package/.claude/roles/system-design/skills/08-process-automation/role_matcher.py +615 -0
- package/.claude/skills/README.md +336 -0
- package/.claude/skills/ai-engineer.md +104 -0
- package/.claude/skills/aws.md +143 -0
- package/.claude/skills/azure.md +149 -0
- package/.claude/skills/backend-developer.md +108 -0
- package/.claude/skills/code-review.md +399 -0
- package/.claude/skills/compliance-automation.md +747 -0
- package/.claude/skills/compliance-officer.md +108 -0
- package/.claude/skills/data-engineer.md +113 -0
- package/.claude/skills/data-governance.md +102 -0
- package/.claude/skills/data-scientist.md +123 -0
- package/.claude/skills/database-admin.md +109 -0
- package/.claude/skills/devops.md +160 -0
- package/.claude/skills/docker.md +160 -0
- package/.claude/skills/enterprise-dashboard.md +613 -0
- package/.claude/skills/finops.md +184 -0
- package/.claude/skills/frontend-developer.md +108 -0
- package/.claude/skills/gcp.md +143 -0
- package/.claude/skills/ml-engineer.md +115 -0
- package/.claude/skills/mlops.md +187 -0
- package/.claude/skills/network-engineer.md +109 -0
- package/.claude/skills/optimization-advisor.md +329 -0
- package/.claude/skills/orchestrator.md +623 -0
- package/.claude/skills/platform-engineer.md +102 -0
- package/.claude/skills/process-automation.md +226 -0
- package/.claude/skills/process-changelog.md +184 -0
- package/.claude/skills/process-documentation.md +484 -0
- package/.claude/skills/process-kanban.md +324 -0
- package/.claude/skills/process-versioning.md +214 -0
- package/.claude/skills/product-designer.md +104 -0
- package/.claude/skills/project-starter.md +443 -0
- package/.claude/skills/qa-engineer.md +109 -0
- package/.claude/skills/security-architect.md +135 -0
- package/.claude/skills/sre.md +109 -0
- package/.claude/skills/system-design.md +126 -0
- package/.claude/skills/technical-writer.md +101 -0
- package/.gitattributes +2 -0
- package/GITHUB_COPILOT.md +106 -0
- package/README.md +192 -291
- package/package.json +16 -46
- package/bin/cli.js +0 -241
|
@@ -1,874 +1,874 @@
|
|
|
1
|
-
# Skill 5: Model Monitoring & Drift Detection
|
|
2
|
-
|
|
3
|
-
## 🎯 Overview
|
|
4
|
-
Implement comprehensive model monitoring with data drift, concept drift, and performance degradation detection for production ML systems.
|
|
5
|
-
|
|
6
|
-
## 🔗 Connections
|
|
7
|
-
- **MLOps**: Core monitoring and drift detection capabilities (mo-04, mo-05)
|
|
8
|
-
- **ML Engineer**: Monitors deployed models and triggers retraining (ml-04, ml-09)
|
|
9
|
-
- **Data Scientist**: Analyzes model degradation patterns (ds-08)
|
|
10
|
-
- **DevOps**: Integrates with observability platforms (do-08)
|
|
11
|
-
- **FinOps**: Monitors model performance vs cost trade-offs (fo-07)
|
|
12
|
-
- **Security Architect**: Detects anomalous predictions (sa-08)
|
|
13
|
-
- **Data Engineer**: Monitors data quality for features (de-03)
|
|
14
|
-
- **System Design**: Scalable monitoring architecture (sd-08)
|
|
15
|
-
|
|
16
|
-
## 🛠️ Tools Included
|
|
17
|
-
|
|
18
|
-
### 1. `drift_detector.py`
|
|
19
|
-
Statistical drift detection for features and predictions.
|
|
20
|
-
|
|
21
|
-
### 2. `model_monitor.py`
|
|
22
|
-
Comprehensive model performance monitoring with alerting.
|
|
23
|
-
|
|
24
|
-
### 3. `prediction_analyzer.py`
|
|
25
|
-
Prediction distribution analysis and anomaly detection.
|
|
26
|
-
|
|
27
|
-
### 4. `monitoring_dashboard.py`
|
|
28
|
-
Real-time monitoring dashboards with Grafana/Azure Monitor.
|
|
29
|
-
|
|
30
|
-
### 5. `alert_manager.py`
|
|
31
|
-
Intelligent alerting system for model degradation.
|
|
32
|
-
|
|
33
|
-
## 🏗️ Monitoring Architecture
|
|
34
|
-
|
|
35
|
-
```
|
|
36
|
-
Production Traffic → Logging → Analysis → Drift Detection → Alerting
|
|
37
|
-
↓ ↓ ↓ ↓
|
|
38
|
-
Predictions Metrics Statistics Notifications
|
|
39
|
-
Features Business Comparison Auto-retrain
|
|
40
|
-
Metadata Technical Baseline Dashboards
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
## 🚀 Quick Start
|
|
44
|
-
|
|
45
|
-
```python
|
|
46
|
-
from drift_detector import DriftDetector, KSTest, PSICalculator
|
|
47
|
-
from model_monitor import ModelMonitor
|
|
48
|
-
from alert_manager import AlertManager
|
|
49
|
-
|
|
50
|
-
# Initialize drift detector
|
|
51
|
-
drift_detector = DriftDetector(
|
|
52
|
-
baseline_data=training_data,
|
|
53
|
-
detection_methods=[
|
|
54
|
-
KSTest(significance_level=0.05),
|
|
55
|
-
PSICalculator(threshold=0.2)
|
|
56
|
-
]
|
|
57
|
-
)
|
|
58
|
-
|
|
59
|
-
# Initialize model monitor
|
|
60
|
-
monitor = ModelMonitor(
|
|
61
|
-
model_name="churn_predictor_v2",
|
|
62
|
-
metrics=["accuracy", "auc", "precision", "recall"],
|
|
63
|
-
alert_thresholds={
|
|
64
|
-
"accuracy": 0.85,
|
|
65
|
-
"auc": 0.90,
|
|
66
|
-
"data_drift_score": 0.2,
|
|
67
|
-
"prediction_drift_score": 0.15
|
|
68
|
-
}
|
|
69
|
-
)
|
|
70
|
-
|
|
71
|
-
# Monitor predictions in production
|
|
72
|
-
@monitor.track_predictions
|
|
73
|
-
async def predict(features):
|
|
74
|
-
prediction = model.predict(features)
|
|
75
|
-
|
|
76
|
-
# Log prediction for monitoring
|
|
77
|
-
monitor.log_prediction(
|
|
78
|
-
features=features,
|
|
79
|
-
prediction=prediction,
|
|
80
|
-
timestamp=datetime.now(),
|
|
81
|
-
metadata={"customer_id": features["customer_id"]}
|
|
82
|
-
)
|
|
83
|
-
|
|
84
|
-
return prediction
|
|
85
|
-
|
|
86
|
-
# Run drift detection (scheduled job)
|
|
87
|
-
def check_drift():
|
|
88
|
-
"""Daily drift detection job"""
|
|
89
|
-
|
|
90
|
-
# Get recent production data
|
|
91
|
-
production_data = monitor.get_recent_predictions(days=7)
|
|
92
|
-
|
|
93
|
-
# Detect feature drift
|
|
94
|
-
feature_drift = drift_detector.detect_feature_drift(
|
|
95
|
-
production_data=production_data,
|
|
96
|
-
features=feature_list
|
|
97
|
-
)
|
|
98
|
-
|
|
99
|
-
# Detect prediction drift
|
|
100
|
-
prediction_drift = drift_detector.detect_prediction_drift(
|
|
101
|
-
production_predictions=production_data["predictions"],
|
|
102
|
-
baseline_predictions=training_predictions
|
|
103
|
-
)
|
|
104
|
-
|
|
105
|
-
# Alert if drift detected
|
|
106
|
-
if feature_drift.has_drift or prediction_drift.has_drift:
|
|
107
|
-
alert_manager.send_alert(
|
|
108
|
-
severity="warning",
|
|
109
|
-
title="Model Drift Detected",
|
|
110
|
-
message=f"Drifted features: {feature_drift.drifted_features}\n"
|
|
111
|
-
f"Prediction drift score: {prediction_drift.score:.3f}",
|
|
112
|
-
actions=["Review model", "Trigger retraining"]
|
|
113
|
-
)
|
|
114
|
-
|
|
115
|
-
# Generate drift report
|
|
116
|
-
drift_report = drift_detector.generate_report(
|
|
117
|
-
feature_drift=feature_drift,
|
|
118
|
-
prediction_drift=prediction_drift
|
|
119
|
-
)
|
|
120
|
-
|
|
121
|
-
return drift_report
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
## 📚 Best Practices
|
|
125
|
-
|
|
126
|
-
### Drift Detection & Monitoring (MLOps Integration)
|
|
127
|
-
|
|
128
|
-
1. **Multi-Level Drift Detection**
|
|
129
|
-
- Monitor data drift (feature distribution changes)
|
|
130
|
-
- Monitor concept drift (relationship between features and target)
|
|
131
|
-
- Monitor prediction drift (output distribution changes)
|
|
132
|
-
- Monitor performance drift (metric degradation)
|
|
133
|
-
- Reference: MLOps mo-05 (Drift Detection)
|
|
134
|
-
|
|
135
|
-
2. **Statistical Drift Tests**
|
|
136
|
-
- Use Kolmogorov-Smirnov test for continuous features
|
|
137
|
-
- Use Chi-square test for categorical features
|
|
138
|
-
- Calculate Population Stability Index (PSI)
|
|
139
|
-
- Track Jensen-Shannon divergence
|
|
140
|
-
- Set appropriate significance levels
|
|
141
|
-
- Reference: MLOps mo-05, Data Scientist ds-08
|
|
142
|
-
|
|
143
|
-
3. **Baseline Comparison**
|
|
144
|
-
- Maintain reference datasets (training data)
|
|
145
|
-
- Update baselines periodically
|
|
146
|
-
- Track distribution shifts over time
|
|
147
|
-
- Document baseline versions
|
|
148
|
-
- Reference: MLOps mo-05, mo-06 (Lineage)
|
|
149
|
-
|
|
150
|
-
4. **Monitoring Cadence**
|
|
151
|
-
- Real-time monitoring for critical models
|
|
152
|
-
- Hourly/daily drift checks for most models
|
|
153
|
-
- Weekly deep-dive analysis
|
|
154
|
-
- Monthly model review
|
|
155
|
-
- Reference: MLOps mo-04 (Monitoring)
|
|
156
|
-
|
|
157
|
-
5. **Comprehensive Model Metrics**
|
|
158
|
-
- Track business metrics (revenue impact, user engagement)
|
|
159
|
-
- Monitor technical metrics (accuracy, AUC, F1)
|
|
160
|
-
- Track operational metrics (latency, throughput)
|
|
161
|
-
- Monitor data quality metrics
|
|
162
|
-
- Reference: MLOps mo-04
|
|
163
|
-
|
|
164
|
-
### DevOps Integration for Monitoring
|
|
165
|
-
|
|
166
|
-
6. **Observability Integration**
|
|
167
|
-
- Integrate with Azure Monitor / App Insights
|
|
168
|
-
- Use OpenTelemetry for instrumentation
|
|
169
|
-
- Centralize logs and metrics
|
|
170
|
-
- Implement distributed tracing
|
|
171
|
-
- Reference: DevOps do-08 (Monitoring)
|
|
172
|
-
|
|
173
|
-
7. **Alerting & Incident Response**
|
|
174
|
-
- Set up intelligent alerting (avoid alert fatigue)
|
|
175
|
-
- Define alert severity levels
|
|
176
|
-
- Implement escalation policies
|
|
177
|
-
- Automate incident response
|
|
178
|
-
- Reference: DevOps do-08
|
|
179
|
-
|
|
180
|
-
8. **Monitoring Dashboards**
|
|
181
|
-
- Build real-time monitoring dashboards
|
|
182
|
-
- Visualize drift metrics over time
|
|
183
|
-
- Track model performance trends
|
|
184
|
-
- Enable team collaboration
|
|
185
|
-
- Reference: DevOps do-08, MLOps mo-04
|
|
186
|
-
|
|
187
|
-
### Cost Optimization for Monitoring (FinOps Integration)
|
|
188
|
-
|
|
189
|
-
9. **Efficient Logging Strategy**
|
|
190
|
-
- Sample predictions for monitoring (not 100%)
|
|
191
|
-
- Implement tiered logging (critical vs routine)
|
|
192
|
-
- Compress and archive old logs
|
|
193
|
-
- Monitor log storage costs
|
|
194
|
-
- Reference: FinOps fo-05 (Storage), fo-07 (AI/ML Cost)
|
|
195
|
-
|
|
196
|
-
10. **Optimize Monitoring Compute**
|
|
197
|
-
- Run drift detection on scheduled batches
|
|
198
|
-
- Use serverless for event-driven monitoring
|
|
199
|
-
- Right-size monitoring infrastructure
|
|
200
|
-
- Cache expensive drift calculations
|
|
201
|
-
- Reference: FinOps fo-06 (Compute Optimization)
|
|
202
|
-
|
|
203
|
-
11. **Monitoring Cost Tracking**
|
|
204
|
-
- Track monitoring infrastructure costs
|
|
205
|
-
- Monitor log ingestion costs
|
|
206
|
-
- Optimize retention policies
|
|
207
|
-
- Balance cost vs visibility
|
|
208
|
-
- Reference: FinOps fo-01 (Cost Monitoring)
|
|
209
|
-
|
|
210
|
-
### Automated Response & Retraining
|
|
211
|
-
|
|
212
|
-
12. **Automated Drift Response**
|
|
213
|
-
- Auto-alert when drift exceeds thresholds
|
|
214
|
-
- Trigger model investigation workflows
|
|
215
|
-
- Initiate automated retraining pipelines
|
|
216
|
-
- Implement automatic rollback if needed
|
|
217
|
-
- Reference: MLOps mo-05, ML Engineer ml-09
|
|
218
|
-
|
|
219
|
-
13. **Model Retraining Triggers**
|
|
220
|
-
- Performance degradation thresholds
|
|
221
|
-
- Significant data drift detected
|
|
222
|
-
- Concept drift indicators
|
|
223
|
-
- Scheduled periodic retraining
|
|
224
|
-
- Reference: ML Engineer ml-09 (Continuous Retraining)
|
|
225
|
-
|
|
226
|
-
### Data Quality Monitoring
|
|
227
|
-
|
|
228
|
-
14. **Feature Quality Checks**
|
|
229
|
-
- Monitor feature completeness
|
|
230
|
-
- Detect feature value range violations
|
|
231
|
-
- Track feature correlation changes
|
|
232
|
-
- Alert on missing features
|
|
233
|
-
- Reference: Data Engineer de-03 (Data Quality)
|
|
234
|
-
|
|
235
|
-
15. **Input Validation Monitoring**
|
|
236
|
-
- Track invalid input rates
|
|
237
|
-
- Monitor schema violations
|
|
238
|
-
- Detect data type mismatches
|
|
239
|
-
- Alert on data quality issues
|
|
240
|
-
- Reference: Data Engineer de-03
|
|
241
|
-
|
|
242
|
-
### Security & Anomaly Detection
|
|
243
|
-
|
|
244
|
-
16. **Prediction Anomaly Detection**
|
|
245
|
-
- Detect unusual prediction patterns
|
|
246
|
-
- Identify potential model attacks
|
|
247
|
-
- Monitor for adversarial inputs
|
|
248
|
-
- Alert on suspicious behavior
|
|
249
|
-
- Reference: Security Architect sa-08 (LLM Security)
|
|
250
|
-
|
|
251
|
-
17. **Model Behavior Monitoring**
|
|
252
|
-
- Track prediction confidence scores
|
|
253
|
-
- Monitor prediction uncertainty
|
|
254
|
-
- Detect model degradation patterns
|
|
255
|
-
- Identify edge cases
|
|
256
|
-
- Reference: MLOps mo-04, Security Architect sa-08
|
|
257
|
-
|
|
258
|
-
### Azure-Specific Best Practices
|
|
259
|
-
|
|
260
|
-
18. **Azure Monitor Integration**
|
|
261
|
-
- Use Azure Monitor for metrics
|
|
262
|
-
- Enable Application Insights
|
|
263
|
-
- Set up Log Analytics workspaces
|
|
264
|
-
- Configure metric alerts
|
|
265
|
-
- Reference: Azure az-04 (AI/ML Services)
|
|
266
|
-
|
|
267
|
-
19. **Azure ML Model Monitoring**
|
|
268
|
-
- Enable model data collection
|
|
269
|
-
- Configure data drift detection
|
|
270
|
-
- Use built-in monitoring dashboards
|
|
271
|
-
- Integrate with Azure Monitor
|
|
272
|
-
- Reference: Azure az-04
|
|
273
|
-
|
|
274
|
-
20. **Cost-Effective Monitoring**
|
|
275
|
-
- Use log sampling for high-volume models
|
|
276
|
-
- Implement retention policies
|
|
277
|
-
- Archive to cold storage
|
|
278
|
-
- Monitor monitoring costs
|
|
279
|
-
- Reference: Azure az-04, FinOps fo-05
|
|
280
|
-
|
|
281
|
-
## 💰 Cost Optimization Examples
|
|
282
|
-
|
|
283
|
-
### Intelligent Prediction Logging
|
|
284
|
-
```python
|
|
285
|
-
from model_monitor import SmartLogger
|
|
286
|
-
from finops_tracker import MonitoringCostTracker
|
|
287
|
-
import random
|
|
288
|
-
|
|
289
|
-
class CostOptimizedMonitor:
|
|
290
|
-
"""Cost-optimized prediction logging with sampling"""
|
|
291
|
-
|
|
292
|
-
def __init__(self, model_name: str, sampling_rate: float = 0.1):
|
|
293
|
-
self.model_name = model_name
|
|
294
|
-
self.sampling_rate = sampling_rate # Log 10% of predictions
|
|
295
|
-
self.logger = SmartLogger(model_name)
|
|
296
|
-
self.cost_tracker = MonitoringCostTracker()
|
|
297
|
-
|
|
298
|
-
# Always log certain predictions
|
|
299
|
-
self.always_log_conditions = [
|
|
300
|
-
lambda pred: pred["confidence"] < 0.5, # Low confidence
|
|
301
|
-
lambda pred: pred["value"] > 0.9, # High risk
|
|
302
|
-
lambda pred: pred.get("is_edge_case", False) # Edge cases
|
|
303
|
-
]
|
|
304
|
-
|
|
305
|
-
def should_log_prediction(self, prediction: dict) -> bool:
|
|
306
|
-
"""Intelligent sampling decision"""
|
|
307
|
-
|
|
308
|
-
# Always log important predictions
|
|
309
|
-
for condition in self.always_log_conditions:
|
|
310
|
-
if condition(prediction):
|
|
311
|
-
return True
|
|
312
|
-
|
|
313
|
-
# Sample remaining predictions
|
|
314
|
-
return random.random() < self.sampling_rate
|
|
315
|
-
|
|
316
|
-
async def log_prediction(
|
|
317
|
-
self,
|
|
318
|
-
features: dict,
|
|
319
|
-
prediction: dict,
|
|
320
|
-
metadata: dict
|
|
321
|
-
):
|
|
322
|
-
"""Log prediction with cost optimization"""
|
|
323
|
-
|
|
324
|
-
if not self.should_log_prediction(prediction):
|
|
325
|
-
self.cost_tracker.record_skipped_log()
|
|
326
|
-
return
|
|
327
|
-
|
|
328
|
-
# Log to monitoring system
|
|
329
|
-
with self.cost_tracker.track_logging_cost():
|
|
330
|
-
await self.logger.log(
|
|
331
|
-
timestamp=datetime.now(),
|
|
332
|
-
features=features,
|
|
333
|
-
prediction=prediction,
|
|
334
|
-
metadata=metadata,
|
|
335
|
-
# Compress large payloads
|
|
336
|
-
compress=len(str(features)) > 1000
|
|
337
|
-
)
|
|
338
|
-
|
|
339
|
-
self.cost_tracker.record_logged_prediction()
|
|
340
|
-
|
|
341
|
-
def get_cost_report(self):
|
|
342
|
-
"""Monitoring cost analysis"""
|
|
343
|
-
report = self.cost_tracker.generate_report()
|
|
344
|
-
|
|
345
|
-
print(f"Monitoring Cost Report:")
|
|
346
|
-
print(f"Total predictions: {report.total_predictions:,}")
|
|
347
|
-
print(f"Logged predictions: {report.logged_predictions:,}")
|
|
348
|
-
print(f"Sampling rate: {report.actual_sampling_rate:.1%}")
|
|
349
|
-
print(f"Log storage cost: ${report.storage_cost:.2f}")
|
|
350
|
-
print(f"Log ingestion cost: ${report.ingestion_cost:.2f}")
|
|
351
|
-
print(f"Total monitoring cost: ${report.total_cost:.2f}")
|
|
352
|
-
print(f"Cost per logged prediction: ${report.cost_per_log:.4f}")
|
|
353
|
-
print(f"Savings from sampling: ${report.sampling_savings:.2f}")
|
|
354
|
-
|
|
355
|
-
return report
|
|
356
|
-
|
|
357
|
-
# Usage
|
|
358
|
-
monitor = CostOptimizedMonitor(
|
|
359
|
-
model_name="churn_predictor_v2",
|
|
360
|
-
sampling_rate=0.1 # Log 10% + important predictions
|
|
361
|
-
)
|
|
362
|
-
|
|
363
|
-
# In production
|
|
364
|
-
for prediction_request in prediction_stream:
|
|
365
|
-
features = prediction_request.features
|
|
366
|
-
prediction = model.predict(features)
|
|
367
|
-
|
|
368
|
-
await monitor.log_prediction(
|
|
369
|
-
features=features,
|
|
370
|
-
prediction=prediction,
|
|
371
|
-
metadata={"customer_id": prediction_request.customer_id}
|
|
372
|
-
)
|
|
373
|
-
|
|
374
|
-
# Monitor costs
|
|
375
|
-
monthly_report = monitor.get_cost_report()
|
|
376
|
-
|
|
377
|
-
# Expected results:
|
|
378
|
-
# - 90% reduction in logging costs
|
|
379
|
-
# - Still captures all important events
|
|
380
|
-
# - Sufficient data for drift detection
|
|
381
|
-
```
|
|
382
|
-
|
|
383
|
-
### Efficient Drift Detection
|
|
384
|
-
```python
|
|
385
|
-
from drift_detector import BatchDriftDetector
|
|
386
|
-
from scipy import stats
|
|
387
|
-
import numpy as np
|
|
388
|
-
from finops_tracker import DriftCostTracker
|
|
389
|
-
|
|
390
|
-
class CostOptimizedDriftDetection:
|
|
391
|
-
"""Efficient drift detection with cost optimization"""
|
|
392
|
-
|
|
393
|
-
def __init__(self, baseline_data: pd.DataFrame):
|
|
394
|
-
self.baseline_data = baseline_data
|
|
395
|
-
self.detector = BatchDriftDetector()
|
|
396
|
-
self.cost_tracker = DriftCostTracker()
|
|
397
|
-
|
|
398
|
-
# Pre-compute baseline statistics (one-time cost)
|
|
399
|
-
self.baseline_stats = self._compute_baseline_stats()
|
|
400
|
-
|
|
401
|
-
def _compute_baseline_stats(self):
|
|
402
|
-
"""Pre-compute baseline statistics for efficiency"""
|
|
403
|
-
stats = {}
|
|
404
|
-
|
|
405
|
-
for column in self.baseline_data.columns:
|
|
406
|
-
if self.baseline_data[column].dtype in ['float64', 'int64']:
|
|
407
|
-
stats[column] = {
|
|
408
|
-
'mean': self.baseline_data[column].mean(),
|
|
409
|
-
'std': self.baseline_data[column].std(),
|
|
410
|
-
'min': self.baseline_data[column].min(),
|
|
411
|
-
'max': self.baseline_data[column].max(),
|
|
412
|
-
'quantiles': self.baseline_data[column].quantile([0.25, 0.5, 0.75]).to_dict(),
|
|
413
|
-
'distribution': self.baseline_data[column].values
|
|
414
|
-
}
|
|
415
|
-
else:
|
|
416
|
-
stats[column] = {
|
|
417
|
-
'value_counts': self.baseline_data[column].value_counts().to_dict(),
|
|
418
|
-
'unique_values': set(self.baseline_data[column].unique())
|
|
419
|
-
}
|
|
420
|
-
|
|
421
|
-
return stats
|
|
422
|
-
|
|
423
|
-
def detect_drift_efficient(
|
|
424
|
-
self,
|
|
425
|
-
production_data: pd.DataFrame,
|
|
426
|
-
features: list = None,
|
|
427
|
-
method: str = "ks_test"
|
|
428
|
-
) -> dict:
|
|
429
|
-
"""Efficient drift detection using pre-computed statistics"""
|
|
430
|
-
|
|
431
|
-
with self.cost_tracker.track_drift_detection():
|
|
432
|
-
drift_results = {}
|
|
433
|
-
features = features or production_data.columns
|
|
434
|
-
|
|
435
|
-
for feature in features:
|
|
436
|
-
if feature not in self.baseline_stats:
|
|
437
|
-
continue
|
|
438
|
-
|
|
439
|
-
baseline_values = self.baseline_stats[feature]['distribution']
|
|
440
|
-
production_values = production_data[feature].values
|
|
441
|
-
|
|
442
|
-
# Use cached baseline statistics
|
|
443
|
-
if method == "ks_test":
|
|
444
|
-
# Kolmogorov-Smirnov test (efficient)
|
|
445
|
-
statistic, p_value = stats.ks_2samp(
|
|
446
|
-
baseline_values,
|
|
447
|
-
production_values
|
|
448
|
-
)
|
|
449
|
-
has_drift = p_value < 0.05
|
|
450
|
-
|
|
451
|
-
elif method == "psi":
|
|
452
|
-
# Population Stability Index (very efficient)
|
|
453
|
-
psi_score = self._calculate_psi_efficient(
|
|
454
|
-
baseline_values,
|
|
455
|
-
production_values
|
|
456
|
-
)
|
|
457
|
-
has_drift = psi_score > 0.2
|
|
458
|
-
statistic = psi_score
|
|
459
|
-
p_value = None
|
|
460
|
-
|
|
461
|
-
drift_results[feature] = {
|
|
462
|
-
'has_drift': has_drift,
|
|
463
|
-
'statistic': statistic,
|
|
464
|
-
'p_value': p_value,
|
|
465
|
-
'drift_magnitude': abs(
|
|
466
|
-
production_values.mean() - baseline_values.mean()
|
|
467
|
-
) / baseline_values.std() if baseline_values.std() > 0 else 0
|
|
468
|
-
}
|
|
469
|
-
|
|
470
|
-
# Cost report
|
|
471
|
-
cost_report = self.cost_tracker.get_detection_cost()
|
|
472
|
-
print(f"Drift detection cost: ${cost_report.cost:.4f}")
|
|
473
|
-
print(f"Detection time: {cost_report.duration_ms:.2f}ms")
|
|
474
|
-
|
|
475
|
-
return {
|
|
476
|
-
'drift_results': drift_results,
|
|
477
|
-
'drifted_features': [f for f, r in drift_results.items() if r['has_drift']],
|
|
478
|
-
'drift_score': np.mean([r['drift_magnitude'] for r in drift_results.values()]),
|
|
479
|
-
'cost': cost_report.cost
|
|
480
|
-
}
|
|
481
|
-
|
|
482
|
-
def _calculate_psi_efficient(
|
|
483
|
-
self,
|
|
484
|
-
baseline: np.ndarray,
|
|
485
|
-
production: np.ndarray,
|
|
486
|
-
bins: int = 10
|
|
487
|
-
) -> float:
|
|
488
|
-
"""Efficient PSI calculation"""
|
|
489
|
-
|
|
490
|
-
# Create bins from baseline
|
|
491
|
-
bin_edges = np.percentile(baseline, np.linspace(0, 100, bins + 1))
|
|
492
|
-
bin_edges[0] = -np.inf
|
|
493
|
-
bin_edges[-1] = np.inf
|
|
494
|
-
|
|
495
|
-
# Calculate distributions
|
|
496
|
-
baseline_dist = np.histogram(baseline, bins=bin_edges)[0] / len(baseline)
|
|
497
|
-
production_dist = np.histogram(production, bins=bin_edges)[0] / len(production)
|
|
498
|
-
|
|
499
|
-
# PSI calculation
|
|
500
|
-
psi = np.sum(
|
|
501
|
-
(production_dist - baseline_dist) * np.log(
|
|
502
|
-
(production_dist + 1e-10) / (baseline_dist + 1e-10)
|
|
503
|
-
)
|
|
504
|
-
)
|
|
505
|
-
|
|
506
|
-
return psi
|
|
507
|
-
|
|
508
|
-
def incremental_drift_monitoring(
|
|
509
|
-
self,
|
|
510
|
-
data_stream,
|
|
511
|
-
window_size: int = 1000,
|
|
512
|
-
check_frequency: int = 100
|
|
513
|
-
):
|
|
514
|
-
"""Incremental drift detection for streaming data"""
|
|
515
|
-
|
|
516
|
-
buffer = []
|
|
517
|
-
check_count = 0
|
|
518
|
-
|
|
519
|
-
for record in data_stream:
|
|
520
|
-
buffer.append(record)
|
|
521
|
-
|
|
522
|
-
# Check drift every N records
|
|
523
|
-
if len(buffer) >= check_frequency:
|
|
524
|
-
check_count += 1
|
|
525
|
-
|
|
526
|
-
# Convert to DataFrame
|
|
527
|
-
production_sample = pd.DataFrame(buffer[-window_size:])
|
|
528
|
-
|
|
529
|
-
# Run efficient drift detection
|
|
530
|
-
drift_result = self.detect_drift_efficient(
|
|
531
|
-
production_data=production_sample,
|
|
532
|
-
method="psi" # Faster than KS test
|
|
533
|
-
)
|
|
534
|
-
|
|
535
|
-
# Alert if drift detected
|
|
536
|
-
if drift_result['drifted_features']:
|
|
537
|
-
print(f"\nDrift detected at check #{check_count}:")
|
|
538
|
-
print(f"Drifted features: {drift_result['drifted_features']}")
|
|
539
|
-
print(f"Drift score: {drift_result['drift_score']:.3f}")
|
|
540
|
-
|
|
541
|
-
# Clear old records from buffer
|
|
542
|
-
buffer = buffer[-window_size:]
|
|
543
|
-
|
|
544
|
-
# Usage
|
|
545
|
-
drift_detector = CostOptimizedDriftDetection(
|
|
546
|
-
baseline_data=training_data
|
|
547
|
-
)
|
|
548
|
-
|
|
549
|
-
# Batch drift detection (daily job)
|
|
550
|
-
production_data = get_recent_predictions(days=7)
|
|
551
|
-
drift_result = drift_detector.detect_drift_efficient(
|
|
552
|
-
production_data=production_data,
|
|
553
|
-
method="psi" # 10x faster than KS test
|
|
554
|
-
)
|
|
555
|
-
|
|
556
|
-
print(f"Drift detection completed in {drift_result['cost']:.4f}s")
|
|
557
|
-
print(f"Drifted features: {drift_result['drifted_features']}")
|
|
558
|
-
|
|
559
|
-
# Streaming drift detection
|
|
560
|
-
drift_detector.incremental_drift_monitoring(
|
|
561
|
-
data_stream=prediction_stream,
|
|
562
|
-
window_size=1000,
|
|
563
|
-
check_frequency=100
|
|
564
|
-
)
|
|
565
|
-
```
|
|
566
|
-
|
|
567
|
-
### Automated Drift Response
|
|
568
|
-
```python
|
|
569
|
-
from alert_manager import AlertManager, AlertSeverity
|
|
570
|
-
from model_retrainer import AutoRetrainer
|
|
571
|
-
from drift_detector import DriftAnalyzer
|
|
572
|
-
|
|
573
|
-
class AutomatedDriftResponse:
|
|
574
|
-
"""Automated drift detection and response system"""
|
|
575
|
-
|
|
576
|
-
def __init__(self, model_name: str):
|
|
577
|
-
self.model_name = model_name
|
|
578
|
-
self.drift_analyzer = DriftAnalyzer(model_name)
|
|
579
|
-
self.alert_manager = AlertManager()
|
|
580
|
-
self.retrainer = AutoRetrainer(model_name)
|
|
581
|
-
|
|
582
|
-
# Configure drift response thresholds
|
|
583
|
-
self.thresholds = {
|
|
584
|
-
"minor_drift": 0.1, # Warning alert
|
|
585
|
-
"moderate_drift": 0.2, # Trigger investigation
|
|
586
|
-
"severe_drift": 0.3 # Auto-retrain
|
|
587
|
-
}
|
|
588
|
-
|
|
589
|
-
async def monitor_and_respond(self):
|
|
590
|
-
"""Continuous drift monitoring with automated response"""
|
|
591
|
-
|
|
592
|
-
# Get recent production data
|
|
593
|
-
production_data = await self.get_production_data(days=7)
|
|
594
|
-
|
|
595
|
-
# Detect drift
|
|
596
|
-
drift_result = self.drift_analyzer.analyze_drift(
|
|
597
|
-
production_data=production_data,
|
|
598
|
-
features=feature_list
|
|
599
|
-
)
|
|
600
|
-
|
|
601
|
-
drift_score = drift_result['drift_score']
|
|
602
|
-
drifted_features = drift_result['drifted_features']
|
|
603
|
-
|
|
604
|
-
# Automated response based on severity
|
|
605
|
-
if drift_score >= self.thresholds["severe_drift"]:
|
|
606
|
-
# Severe drift - auto-retrain
|
|
607
|
-
await self._handle_severe_drift(drift_result)
|
|
608
|
-
|
|
609
|
-
elif drift_score >= self.thresholds["moderate_drift"]:
|
|
610
|
-
# Moderate drift - trigger investigation
|
|
611
|
-
await self._handle_moderate_drift(drift_result)
|
|
612
|
-
|
|
613
|
-
elif drift_score >= self.thresholds["minor_drift"]:
|
|
614
|
-
# Minor drift - warning alert
|
|
615
|
-
await self._handle_minor_drift(drift_result)
|
|
616
|
-
|
|
617
|
-
return drift_result
|
|
618
|
-
|
|
619
|
-
async def _handle_severe_drift(self, drift_result):
|
|
620
|
-
"""Handle severe drift with auto-retraining"""
|
|
621
|
-
|
|
622
|
-
# Send critical alert
|
|
623
|
-
await self.alert_manager.send_alert(
|
|
624
|
-
severity=AlertSeverity.CRITICAL,
|
|
625
|
-
title=f"Severe Drift Detected - Auto-Retraining Initiated",
|
|
626
|
-
message=f"Model: {self.model_name}\n"
|
|
627
|
-
f"Drift score: {drift_result['drift_score']:.3f}\n"
|
|
628
|
-
f"Drifted features: {drift_result['drifted_features']}\n"
|
|
629
|
-
f"Action: Automatic retraining initiated",
|
|
630
|
-
channels=["slack", "email", "pagerduty"]
|
|
631
|
-
)
|
|
632
|
-
|
|
633
|
-
# Trigger automated retraining
|
|
634
|
-
retrain_job = await self.retrainer.trigger_retraining(
|
|
635
|
-
reason="severe_drift_detected",
|
|
636
|
-
drift_score=drift_result['drift_score'],
|
|
637
|
-
drifted_features=drift_result['drifted_features'],
|
|
638
|
-
priority="high"
|
|
639
|
-
)
|
|
640
|
-
|
|
641
|
-
print(f"Retraining job initiated: {retrain_job.id}")
|
|
642
|
-
|
|
643
|
-
async def _handle_moderate_drift(self, drift_result):
|
|
644
|
-
"""Handle moderate drift with investigation workflow"""
|
|
645
|
-
|
|
646
|
-
# Send warning alert
|
|
647
|
-
await self.alert_manager.send_alert(
|
|
648
|
-
severity=AlertSeverity.WARNING,
|
|
649
|
-
title=f"Moderate Drift Detected - Investigation Required",
|
|
650
|
-
message=f"Model: {self.model_name}\n"
|
|
651
|
-
f"Drift score: {drift_result['drift_score']:.3f}\n"
|
|
652
|
-
f"Drifted features: {drift_result['drifted_features']}\n"
|
|
653
|
-
f"Action: Please investigate and determine if retraining is needed",
|
|
654
|
-
channels=["slack", "email"],
|
|
655
|
-
actions=[
|
|
656
|
-
{"label": "Trigger Retraining", "action": "retrain"},
|
|
657
|
-
{"label": "Acknowledge", "action": "ack"},
|
|
658
|
-
{"label": "View Dashboard", "action": "dashboard"}
|
|
659
|
-
]
|
|
660
|
-
)
|
|
661
|
-
|
|
662
|
-
# Create investigation ticket
|
|
663
|
-
await self.alert_manager.create_investigation_ticket(
|
|
664
|
-
title=f"Model Drift Investigation - {self.model_name}",
|
|
665
|
-
description=drift_result,
|
|
666
|
-
assignee="ml-ops-team"
|
|
667
|
-
)
|
|
668
|
-
|
|
669
|
-
async def _handle_minor_drift(self, drift_result):
|
|
670
|
-
"""Handle minor drift with monitoring"""
|
|
671
|
-
|
|
672
|
-
# Send info alert
|
|
673
|
-
await self.alert_manager.send_alert(
|
|
674
|
-
severity=AlertSeverity.INFO,
|
|
675
|
-
title=f"Minor Drift Detected",
|
|
676
|
-
message=f"Model: {self.model_name}\n"
|
|
677
|
-
f"Drift score: {drift_result['drift_score']:.3f}\n"
|
|
678
|
-
f"Drifted features: {drift_result['drifted_features']}\n"
|
|
679
|
-
f"Action: Continuing to monitor",
|
|
680
|
-
channels=["slack"]
|
|
681
|
-
)
|
|
682
|
-
|
|
683
|
-
# Log to dashboard
|
|
684
|
-
await self.drift_analyzer.log_drift_event(drift_result)
|
|
685
|
-
|
|
686
|
-
# Usage (scheduled job)
|
|
687
|
-
drift_responder = AutomatedDriftResponse(model_name="churn_predictor_v2")
|
|
688
|
-
|
|
689
|
-
# Run daily
|
|
690
|
-
async def daily_drift_check():
|
|
691
|
-
result = await drift_responder.monitor_and_respond()
|
|
692
|
-
print(f"Drift check completed. Score: {result['drift_score']:.3f}")
|
|
693
|
-
|
|
694
|
-
# Schedule with APScheduler or similar
|
|
695
|
-
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
|
696
|
-
|
|
697
|
-
scheduler = AsyncIOScheduler()
|
|
698
|
-
scheduler.add_job(daily_drift_check, 'cron', hour=2) # Run at 2 AM daily
|
|
699
|
-
scheduler.start()
|
|
700
|
-
```
|
|
701
|
-
|
|
702
|
-
## 🚀 Monitoring Dashboards
|
|
703
|
-
|
|
704
|
-
### Grafana Dashboard Configuration
|
|
705
|
-
```python
|
|
706
|
-
# monitoring_dashboard.py
|
|
707
|
-
from grafana_api import GrafanaDashboard
|
|
708
|
-
from azure.monitor import MetricsClient
|
|
709
|
-
|
|
710
|
-
def create_model_monitoring_dashboard(model_name: str):
|
|
711
|
-
"""Create comprehensive model monitoring dashboard"""
|
|
712
|
-
|
|
713
|
-
dashboard = GrafanaDashboard(
|
|
714
|
-
title=f"Model Monitoring - {model_name}",
|
|
715
|
-
refresh="30s"
|
|
716
|
-
)
|
|
717
|
-
|
|
718
|
-
# Row 1: Model Performance Metrics
|
|
719
|
-
dashboard.add_row("Model Performance")
|
|
720
|
-
dashboard.add_panel(
|
|
721
|
-
title="Model Accuracy (7d rolling)",
|
|
722
|
-
query="avg(model_accuracy{model_name='" + model_name + "'})",
|
|
723
|
-
panel_type="graph",
|
|
724
|
-
threshold=0.85,
|
|
725
|
-
alert_condition="below"
|
|
726
|
-
)
|
|
727
|
-
dashboard.add_panel(
|
|
728
|
-
title="AUC Score",
|
|
729
|
-
query="avg(model_auc{model_name='" + model_name + "'})",
|
|
730
|
-
panel_type="gauge",
|
|
731
|
-
thresholds=[0.80, 0.90, 0.95]
|
|
732
|
-
)
|
|
733
|
-
|
|
734
|
-
# Row 2: Drift Metrics
|
|
735
|
-
dashboard.add_row("Drift Detection")
|
|
736
|
-
dashboard.add_panel(
|
|
737
|
-
title="Feature Drift Score",
|
|
738
|
-
query="max(feature_drift_score{model_name='" + model_name + "'})",
|
|
739
|
-
panel_type="graph",
|
|
740
|
-
threshold=0.2,
|
|
741
|
-
alert_condition="above"
|
|
742
|
-
)
|
|
743
|
-
dashboard.add_panel(
|
|
744
|
-
title="Prediction Drift Score",
|
|
745
|
-
query="max(prediction_drift_score{model_name='" + model_name + "'})",
|
|
746
|
-
panel_type="graph",
|
|
747
|
-
threshold=0.15
|
|
748
|
-
)
|
|
749
|
-
dashboard.add_panel(
|
|
750
|
-
title="Drifted Features Count",
|
|
751
|
-
query="count(drifted_features{model_name='" + model_name + "'})",
|
|
752
|
-
panel_type="stat"
|
|
753
|
-
)
|
|
754
|
-
|
|
755
|
-
# Row 3: Operational Metrics
|
|
756
|
-
dashboard.add_row("Operational Metrics")
|
|
757
|
-
dashboard.add_panel(
|
|
758
|
-
title="Prediction Latency (p95)",
|
|
759
|
-
query="histogram_quantile(0.95, prediction_latency{model_name='" + model_name + "'})",
|
|
760
|
-
panel_type="graph",
|
|
761
|
-
threshold=100,
|
|
762
|
-
unit="ms"
|
|
763
|
-
)
|
|
764
|
-
dashboard.add_panel(
|
|
765
|
-
title="Requests per Minute",
|
|
766
|
-
query="rate(prediction_requests{model_name='" + model_name + "'}[1m])",
|
|
767
|
-
panel_type="graph"
|
|
768
|
-
)
|
|
769
|
-
dashboard.add_panel(
|
|
770
|
-
title="Error Rate",
|
|
771
|
-
query="rate(prediction_errors{model_name='" + model_name + "'}[5m])",
|
|
772
|
-
panel_type="graph",
|
|
773
|
-
threshold=0.01,
|
|
774
|
-
alert_condition="above"
|
|
775
|
-
)
|
|
776
|
-
|
|
777
|
-
# Row 4: Data Quality
|
|
778
|
-
dashboard.add_row("Data Quality")
|
|
779
|
-
dashboard.add_panel(
|
|
780
|
-
title="Feature Completeness",
|
|
781
|
-
query="avg(feature_completeness{model_name='" + model_name + "'})",
|
|
782
|
-
panel_type="gauge",
|
|
783
|
-
thresholds=[0.95, 0.99, 1.0]
|
|
784
|
-
)
|
|
785
|
-
dashboard.add_panel(
|
|
786
|
-
title="Invalid Inputs Rate",
|
|
787
|
-
query="rate(invalid_inputs{model_name='" + model_name + "'}[5m])",
|
|
788
|
-
panel_type="graph"
|
|
789
|
-
)
|
|
790
|
-
|
|
791
|
-
# Row 5: Cost Metrics
|
|
792
|
-
dashboard.add_row("Cost & Resource Usage")
|
|
793
|
-
dashboard.add_panel(
|
|
794
|
-
title="Daily Serving Cost",
|
|
795
|
-
query="sum(serving_cost_usd{model_name='" + model_name + "'})",
|
|
796
|
-
panel_type="stat",
|
|
797
|
-
unit="currencyUSD"
|
|
798
|
-
)
|
|
799
|
-
dashboard.add_panel(
|
|
800
|
-
title="Cost per 1000 Predictions",
|
|
801
|
-
query="serving_cost_usd{model_name='" + model_name + "'} / prediction_count * 1000",
|
|
802
|
-
panel_type="graph",
|
|
803
|
-
unit="currencyUSD"
|
|
804
|
-
)
|
|
805
|
-
|
|
806
|
-
# Save dashboard
|
|
807
|
-
dashboard.save()
|
|
808
|
-
return dashboard
|
|
809
|
-
|
|
810
|
-
# Create dashboard
|
|
811
|
-
dashboard = create_model_monitoring_dashboard("churn_predictor_v2")
|
|
812
|
-
print(f"Dashboard created: {dashboard.url}")
|
|
813
|
-
```
|
|
814
|
-
|
|
815
|
-
## 📊 Metrics & Monitoring
|
|
816
|
-
|
|
817
|
-
| Metric Category | Metric | Target | Tool |
|
|
818
|
-
|-----------------|--------|--------|------|
|
|
819
|
-
| **Drift Detection** | Feature drift score | <0.2 | Drift detector |
|
|
820
|
-
| | Prediction drift score | <0.15 | Drift detector |
|
|
821
|
-
| | Drifted features count | <3 | KS/PSI tests |
|
|
822
|
-
| | Drift check frequency | Daily | Scheduler |
|
|
823
|
-
| **Model Performance** | Production accuracy | >0.85 | Model monitor |
|
|
824
|
-
| | Production AUC | >0.90 | Model monitor |
|
|
825
|
-
| | Performance vs baseline | >95% | Comparison |
|
|
826
|
-
| **Data Quality** | Feature completeness | >99% | Quality checker |
|
|
827
|
-
| | Invalid input rate | <1% | Validator |
|
|
828
|
-
| | Missing feature rate | <0.1% | Monitor |
|
|
829
|
-
| **Monitoring Costs** | Log storage cost | <$100/mo | FinOps tracker |
|
|
830
|
-
| | Monitoring compute | <$50/mo | Cost tracker |
|
|
831
|
-
| | Alert notification cost | <$20/mo | Alert manager |
|
|
832
|
-
| **Operational** | Alert response time | <30 min | SLA monitor |
|
|
833
|
-
| | False alert rate | <5% | Alert tuning |
|
|
834
|
-
| | Dashboard load time | <2s | Performance |
|
|
835
|
-
|
|
836
|
-
## 🔄 Integration Workflow
|
|
837
|
-
|
|
838
|
-
### End-to-End Monitoring Pipeline
|
|
839
|
-
```
|
|
840
|
-
1. Production Predictions (ml-04)
|
|
841
|
-
↓
|
|
842
|
-
2. Intelligent Logging (ml-05)
|
|
843
|
-
↓
|
|
844
|
-
3. Data Aggregation (ml-05)
|
|
845
|
-
↓
|
|
846
|
-
4. Drift Detection (mo-05)
|
|
847
|
-
↓
|
|
848
|
-
5. Performance Monitoring (mo-04)
|
|
849
|
-
↓
|
|
850
|
-
6. Anomaly Detection (ml-05)
|
|
851
|
-
↓
|
|
852
|
-
7. Alert Generation (ml-05)
|
|
853
|
-
↓
|
|
854
|
-
8. Dashboard Updates (do-08)
|
|
855
|
-
↓
|
|
856
|
-
9. Investigation Workflow (ml-05)
|
|
857
|
-
↓
|
|
858
|
-
10. Auto-Retraining Trigger (ml-09)
|
|
859
|
-
↓
|
|
860
|
-
11. Cost Tracking (fo-07)
|
|
861
|
-
```
|
|
862
|
-
|
|
863
|
-
## 🎯 Quick Wins
|
|
864
|
-
|
|
865
|
-
1. **Enable prediction logging** - Visibility into production behavior
|
|
866
|
-
2. **Set up drift detection** - Early warning for model degradation
|
|
867
|
-
3. **Create monitoring dashboards** - Real-time visibility
|
|
868
|
-
4. **Implement intelligent sampling** - 90% reduction in logging costs
|
|
869
|
-
5. **Configure performance alerts** - Proactive issue detection
|
|
870
|
-
6. **Use PSI for drift** - 10x faster than KS test
|
|
871
|
-
7. **Pre-compute baseline stats** - Faster drift detection
|
|
872
|
-
8. **Set up automated alerts** - Faster incident response
|
|
873
|
-
9. **Track monitoring costs** - Optimize monitoring spend
|
|
874
|
-
10. **Implement auto-retraining** - Automated drift response
|
|
1
|
+
# Skill 5: Model Monitoring & Drift Detection
|
|
2
|
+
|
|
3
|
+
## 🎯 Overview
|
|
4
|
+
Implement comprehensive model monitoring with data drift, concept drift, and performance degradation detection for production ML systems.
|
|
5
|
+
|
|
6
|
+
## 🔗 Connections
|
|
7
|
+
- **MLOps**: Core monitoring and drift detection capabilities (mo-04, mo-05)
|
|
8
|
+
- **ML Engineer**: Monitors deployed models and triggers retraining (ml-04, ml-09)
|
|
9
|
+
- **Data Scientist**: Analyzes model degradation patterns (ds-08)
|
|
10
|
+
- **DevOps**: Integrates with observability platforms (do-08)
|
|
11
|
+
- **FinOps**: Monitors model performance vs cost trade-offs (fo-07)
|
|
12
|
+
- **Security Architect**: Detects anomalous predictions (sa-08)
|
|
13
|
+
- **Data Engineer**: Monitors data quality for features (de-03)
|
|
14
|
+
- **System Design**: Scalable monitoring architecture (sd-08)
|
|
15
|
+
|
|
16
|
+
## 🛠️ Tools Included
|
|
17
|
+
|
|
18
|
+
### 1. `drift_detector.py`
|
|
19
|
+
Statistical drift detection for features and predictions.
|
|
20
|
+
|
|
21
|
+
### 2. `model_monitor.py`
|
|
22
|
+
Comprehensive model performance monitoring with alerting.
|
|
23
|
+
|
|
24
|
+
### 3. `prediction_analyzer.py`
|
|
25
|
+
Prediction distribution analysis and anomaly detection.
|
|
26
|
+
|
|
27
|
+
### 4. `monitoring_dashboard.py`
|
|
28
|
+
Real-time monitoring dashboards with Grafana/Azure Monitor.
|
|
29
|
+
|
|
30
|
+
### 5. `alert_manager.py`
|
|
31
|
+
Intelligent alerting system for model degradation.
|
|
32
|
+
|
|
33
|
+
## 🏗️ Monitoring Architecture
|
|
34
|
+
|
|
35
|
+
```
|
|
36
|
+
Production Traffic → Logging → Analysis → Drift Detection → Alerting
|
|
37
|
+
↓ ↓ ↓ ↓
|
|
38
|
+
Predictions Metrics Statistics Notifications
|
|
39
|
+
Features Business Comparison Auto-retrain
|
|
40
|
+
Metadata Technical Baseline Dashboards
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## 🚀 Quick Start
|
|
44
|
+
|
|
45
|
+
```python
|
|
46
|
+
from drift_detector import DriftDetector, KSTest, PSICalculator
|
|
47
|
+
from model_monitor import ModelMonitor
|
|
48
|
+
from alert_manager import AlertManager
|
|
49
|
+
|
|
50
|
+
# Initialize drift detector
|
|
51
|
+
drift_detector = DriftDetector(
|
|
52
|
+
baseline_data=training_data,
|
|
53
|
+
detection_methods=[
|
|
54
|
+
KSTest(significance_level=0.05),
|
|
55
|
+
PSICalculator(threshold=0.2)
|
|
56
|
+
]
|
|
57
|
+
)
|
|
58
|
+
|
|
59
|
+
# Initialize model monitor
|
|
60
|
+
monitor = ModelMonitor(
|
|
61
|
+
model_name="churn_predictor_v2",
|
|
62
|
+
metrics=["accuracy", "auc", "precision", "recall"],
|
|
63
|
+
alert_thresholds={
|
|
64
|
+
"accuracy": 0.85,
|
|
65
|
+
"auc": 0.90,
|
|
66
|
+
"data_drift_score": 0.2,
|
|
67
|
+
"prediction_drift_score": 0.15
|
|
68
|
+
}
|
|
69
|
+
)
|
|
70
|
+
|
|
71
|
+
# Monitor predictions in production
|
|
72
|
+
@monitor.track_predictions
|
|
73
|
+
async def predict(features):
|
|
74
|
+
prediction = model.predict(features)
|
|
75
|
+
|
|
76
|
+
# Log prediction for monitoring
|
|
77
|
+
monitor.log_prediction(
|
|
78
|
+
features=features,
|
|
79
|
+
prediction=prediction,
|
|
80
|
+
timestamp=datetime.now(),
|
|
81
|
+
metadata={"customer_id": features["customer_id"]}
|
|
82
|
+
)
|
|
83
|
+
|
|
84
|
+
return prediction
|
|
85
|
+
|
|
86
|
+
# Run drift detection (scheduled job)
|
|
87
|
+
def check_drift():
|
|
88
|
+
"""Daily drift detection job"""
|
|
89
|
+
|
|
90
|
+
# Get recent production data
|
|
91
|
+
production_data = monitor.get_recent_predictions(days=7)
|
|
92
|
+
|
|
93
|
+
# Detect feature drift
|
|
94
|
+
feature_drift = drift_detector.detect_feature_drift(
|
|
95
|
+
production_data=production_data,
|
|
96
|
+
features=feature_list
|
|
97
|
+
)
|
|
98
|
+
|
|
99
|
+
# Detect prediction drift
|
|
100
|
+
prediction_drift = drift_detector.detect_prediction_drift(
|
|
101
|
+
production_predictions=production_data["predictions"],
|
|
102
|
+
baseline_predictions=training_predictions
|
|
103
|
+
)
|
|
104
|
+
|
|
105
|
+
# Alert if drift detected
|
|
106
|
+
if feature_drift.has_drift or prediction_drift.has_drift:
|
|
107
|
+
alert_manager.send_alert(
|
|
108
|
+
severity="warning",
|
|
109
|
+
title="Model Drift Detected",
|
|
110
|
+
message=f"Drifted features: {feature_drift.drifted_features}\n"
|
|
111
|
+
f"Prediction drift score: {prediction_drift.score:.3f}",
|
|
112
|
+
actions=["Review model", "Trigger retraining"]
|
|
113
|
+
)
|
|
114
|
+
|
|
115
|
+
# Generate drift report
|
|
116
|
+
drift_report = drift_detector.generate_report(
|
|
117
|
+
feature_drift=feature_drift,
|
|
118
|
+
prediction_drift=prediction_drift
|
|
119
|
+
)
|
|
120
|
+
|
|
121
|
+
return drift_report
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
## 📚 Best Practices
|
|
125
|
+
|
|
126
|
+
### Drift Detection & Monitoring (MLOps Integration)
|
|
127
|
+
|
|
128
|
+
1. **Multi-Level Drift Detection**
|
|
129
|
+
- Monitor data drift (feature distribution changes)
|
|
130
|
+
- Monitor concept drift (relationship between features and target)
|
|
131
|
+
- Monitor prediction drift (output distribution changes)
|
|
132
|
+
- Monitor performance drift (metric degradation)
|
|
133
|
+
- Reference: MLOps mo-05 (Drift Detection)
|
|
134
|
+
|
|
135
|
+
2. **Statistical Drift Tests**
|
|
136
|
+
- Use Kolmogorov-Smirnov test for continuous features
|
|
137
|
+
- Use Chi-square test for categorical features
|
|
138
|
+
- Calculate Population Stability Index (PSI)
|
|
139
|
+
- Track Jensen-Shannon divergence
|
|
140
|
+
- Set appropriate significance levels
|
|
141
|
+
- Reference: MLOps mo-05, Data Scientist ds-08
|
|
142
|
+
|
|
143
|
+
3. **Baseline Comparison**
|
|
144
|
+
- Maintain reference datasets (training data)
|
|
145
|
+
- Update baselines periodically
|
|
146
|
+
- Track distribution shifts over time
|
|
147
|
+
- Document baseline versions
|
|
148
|
+
- Reference: MLOps mo-05, mo-06 (Lineage)
|
|
149
|
+
|
|
150
|
+
4. **Monitoring Cadence**
|
|
151
|
+
- Real-time monitoring for critical models
|
|
152
|
+
- Hourly/daily drift checks for most models
|
|
153
|
+
- Weekly deep-dive analysis
|
|
154
|
+
- Monthly model review
|
|
155
|
+
- Reference: MLOps mo-04 (Monitoring)
|
|
156
|
+
|
|
157
|
+
5. **Comprehensive Model Metrics**
|
|
158
|
+
- Track business metrics (revenue impact, user engagement)
|
|
159
|
+
- Monitor technical metrics (accuracy, AUC, F1)
|
|
160
|
+
- Track operational metrics (latency, throughput)
|
|
161
|
+
- Monitor data quality metrics
|
|
162
|
+
- Reference: MLOps mo-04
|
|
163
|
+
|
|
164
|
+
### DevOps Integration for Monitoring
|
|
165
|
+
|
|
166
|
+
6. **Observability Integration**
|
|
167
|
+
- Integrate with Azure Monitor / App Insights
|
|
168
|
+
- Use OpenTelemetry for instrumentation
|
|
169
|
+
- Centralize logs and metrics
|
|
170
|
+
- Implement distributed tracing
|
|
171
|
+
- Reference: DevOps do-08 (Monitoring)
|
|
172
|
+
|
|
173
|
+
7. **Alerting & Incident Response**
|
|
174
|
+
- Set up intelligent alerting (avoid alert fatigue)
|
|
175
|
+
- Define alert severity levels
|
|
176
|
+
- Implement escalation policies
|
|
177
|
+
- Automate incident response
|
|
178
|
+
- Reference: DevOps do-08
|
|
179
|
+
|
|
180
|
+
8. **Monitoring Dashboards**
|
|
181
|
+
- Build real-time monitoring dashboards
|
|
182
|
+
- Visualize drift metrics over time
|
|
183
|
+
- Track model performance trends
|
|
184
|
+
- Enable team collaboration
|
|
185
|
+
- Reference: DevOps do-08, MLOps mo-04
|
|
186
|
+
|
|
187
|
+
### Cost Optimization for Monitoring (FinOps Integration)
|
|
188
|
+
|
|
189
|
+
9. **Efficient Logging Strategy**
|
|
190
|
+
- Sample predictions for monitoring (not 100%)
|
|
191
|
+
- Implement tiered logging (critical vs routine)
|
|
192
|
+
- Compress and archive old logs
|
|
193
|
+
- Monitor log storage costs
|
|
194
|
+
- Reference: FinOps fo-05 (Storage), fo-07 (AI/ML Cost)
|
|
195
|
+
|
|
196
|
+
10. **Optimize Monitoring Compute**
|
|
197
|
+
- Run drift detection on scheduled batches
|
|
198
|
+
- Use serverless for event-driven monitoring
|
|
199
|
+
- Right-size monitoring infrastructure
|
|
200
|
+
- Cache expensive drift calculations
|
|
201
|
+
- Reference: FinOps fo-06 (Compute Optimization)
|
|
202
|
+
|
|
203
|
+
11. **Monitoring Cost Tracking**
|
|
204
|
+
- Track monitoring infrastructure costs
|
|
205
|
+
- Monitor log ingestion costs
|
|
206
|
+
- Optimize retention policies
|
|
207
|
+
- Balance cost vs visibility
|
|
208
|
+
- Reference: FinOps fo-01 (Cost Monitoring)
|
|
209
|
+
|
|
210
|
+
### Automated Response & Retraining
|
|
211
|
+
|
|
212
|
+
12. **Automated Drift Response**
|
|
213
|
+
- Auto-alert when drift exceeds thresholds
|
|
214
|
+
- Trigger model investigation workflows
|
|
215
|
+
- Initiate automated retraining pipelines
|
|
216
|
+
- Implement automatic rollback if needed
|
|
217
|
+
- Reference: MLOps mo-05, ML Engineer ml-09
|
|
218
|
+
|
|
219
|
+
13. **Model Retraining Triggers**
|
|
220
|
+
- Performance degradation thresholds
|
|
221
|
+
- Significant data drift detected
|
|
222
|
+
- Concept drift indicators
|
|
223
|
+
- Scheduled periodic retraining
|
|
224
|
+
- Reference: ML Engineer ml-09 (Continuous Retraining)
|
|
225
|
+
|
|
226
|
+
### Data Quality Monitoring
|
|
227
|
+
|
|
228
|
+
14. **Feature Quality Checks**
|
|
229
|
+
- Monitor feature completeness
|
|
230
|
+
- Detect feature value range violations
|
|
231
|
+
- Track feature correlation changes
|
|
232
|
+
- Alert on missing features
|
|
233
|
+
- Reference: Data Engineer de-03 (Data Quality)
|
|
234
|
+
|
|
235
|
+
15. **Input Validation Monitoring**
|
|
236
|
+
- Track invalid input rates
|
|
237
|
+
- Monitor schema violations
|
|
238
|
+
- Detect data type mismatches
|
|
239
|
+
- Alert on data quality issues
|
|
240
|
+
- Reference: Data Engineer de-03
|
|
241
|
+
|
|
242
|
+
### Security & Anomaly Detection
|
|
243
|
+
|
|
244
|
+
16. **Prediction Anomaly Detection**
|
|
245
|
+
- Detect unusual prediction patterns
|
|
246
|
+
- Identify potential model attacks
|
|
247
|
+
- Monitor for adversarial inputs
|
|
248
|
+
- Alert on suspicious behavior
|
|
249
|
+
- Reference: Security Architect sa-08 (LLM Security)
|
|
250
|
+
|
|
251
|
+
17. **Model Behavior Monitoring**
|
|
252
|
+
- Track prediction confidence scores
|
|
253
|
+
- Monitor prediction uncertainty
|
|
254
|
+
- Detect model degradation patterns
|
|
255
|
+
- Identify edge cases
|
|
256
|
+
- Reference: MLOps mo-04, Security Architect sa-08
|
|
257
|
+
|
|
258
|
+
### Azure-Specific Best Practices
|
|
259
|
+
|
|
260
|
+
18. **Azure Monitor Integration**
|
|
261
|
+
- Use Azure Monitor for metrics
|
|
262
|
+
- Enable Application Insights
|
|
263
|
+
- Set up Log Analytics workspaces
|
|
264
|
+
- Configure metric alerts
|
|
265
|
+
- Reference: Azure az-04 (AI/ML Services)
|
|
266
|
+
|
|
267
|
+
19. **Azure ML Model Monitoring**
|
|
268
|
+
- Enable model data collection
|
|
269
|
+
- Configure data drift detection
|
|
270
|
+
- Use built-in monitoring dashboards
|
|
271
|
+
- Integrate with Azure Monitor
|
|
272
|
+
- Reference: Azure az-04
|
|
273
|
+
|
|
274
|
+
20. **Cost-Effective Monitoring**
|
|
275
|
+
- Use log sampling for high-volume models
|
|
276
|
+
- Implement retention policies
|
|
277
|
+
- Archive to cold storage
|
|
278
|
+
- Monitor monitoring costs
|
|
279
|
+
- Reference: Azure az-04, FinOps fo-05
|
|
280
|
+
|
|
281
|
+
## 💰 Cost Optimization Examples
|
|
282
|
+
|
|
283
|
+
### Intelligent Prediction Logging
|
|
284
|
+
```python
|
|
285
|
+
from model_monitor import SmartLogger
|
|
286
|
+
from finops_tracker import MonitoringCostTracker
|
|
287
|
+
import random
|
|
288
|
+
|
|
289
|
+
class CostOptimizedMonitor:
|
|
290
|
+
"""Cost-optimized prediction logging with sampling"""
|
|
291
|
+
|
|
292
|
+
def __init__(self, model_name: str, sampling_rate: float = 0.1):
|
|
293
|
+
self.model_name = model_name
|
|
294
|
+
self.sampling_rate = sampling_rate # Log 10% of predictions
|
|
295
|
+
self.logger = SmartLogger(model_name)
|
|
296
|
+
self.cost_tracker = MonitoringCostTracker()
|
|
297
|
+
|
|
298
|
+
# Always log certain predictions
|
|
299
|
+
self.always_log_conditions = [
|
|
300
|
+
lambda pred: pred["confidence"] < 0.5, # Low confidence
|
|
301
|
+
lambda pred: pred["value"] > 0.9, # High risk
|
|
302
|
+
lambda pred: pred.get("is_edge_case", False) # Edge cases
|
|
303
|
+
]
|
|
304
|
+
|
|
305
|
+
def should_log_prediction(self, prediction: dict) -> bool:
|
|
306
|
+
"""Intelligent sampling decision"""
|
|
307
|
+
|
|
308
|
+
# Always log important predictions
|
|
309
|
+
for condition in self.always_log_conditions:
|
|
310
|
+
if condition(prediction):
|
|
311
|
+
return True
|
|
312
|
+
|
|
313
|
+
# Sample remaining predictions
|
|
314
|
+
return random.random() < self.sampling_rate
|
|
315
|
+
|
|
316
|
+
async def log_prediction(
|
|
317
|
+
self,
|
|
318
|
+
features: dict,
|
|
319
|
+
prediction: dict,
|
|
320
|
+
metadata: dict
|
|
321
|
+
):
|
|
322
|
+
"""Log prediction with cost optimization"""
|
|
323
|
+
|
|
324
|
+
if not self.should_log_prediction(prediction):
|
|
325
|
+
self.cost_tracker.record_skipped_log()
|
|
326
|
+
return
|
|
327
|
+
|
|
328
|
+
# Log to monitoring system
|
|
329
|
+
with self.cost_tracker.track_logging_cost():
|
|
330
|
+
await self.logger.log(
|
|
331
|
+
timestamp=datetime.now(),
|
|
332
|
+
features=features,
|
|
333
|
+
prediction=prediction,
|
|
334
|
+
metadata=metadata,
|
|
335
|
+
# Compress large payloads
|
|
336
|
+
compress=len(str(features)) > 1000
|
|
337
|
+
)
|
|
338
|
+
|
|
339
|
+
self.cost_tracker.record_logged_prediction()
|
|
340
|
+
|
|
341
|
+
def get_cost_report(self):
|
|
342
|
+
"""Monitoring cost analysis"""
|
|
343
|
+
report = self.cost_tracker.generate_report()
|
|
344
|
+
|
|
345
|
+
print(f"Monitoring Cost Report:")
|
|
346
|
+
print(f"Total predictions: {report.total_predictions:,}")
|
|
347
|
+
print(f"Logged predictions: {report.logged_predictions:,}")
|
|
348
|
+
print(f"Sampling rate: {report.actual_sampling_rate:.1%}")
|
|
349
|
+
print(f"Log storage cost: ${report.storage_cost:.2f}")
|
|
350
|
+
print(f"Log ingestion cost: ${report.ingestion_cost:.2f}")
|
|
351
|
+
print(f"Total monitoring cost: ${report.total_cost:.2f}")
|
|
352
|
+
print(f"Cost per logged prediction: ${report.cost_per_log:.4f}")
|
|
353
|
+
print(f"Savings from sampling: ${report.sampling_savings:.2f}")
|
|
354
|
+
|
|
355
|
+
return report
|
|
356
|
+
|
|
357
|
+
# Usage
|
|
358
|
+
monitor = CostOptimizedMonitor(
|
|
359
|
+
model_name="churn_predictor_v2",
|
|
360
|
+
sampling_rate=0.1 # Log 10% + important predictions
|
|
361
|
+
)
|
|
362
|
+
|
|
363
|
+
# In production
|
|
364
|
+
for prediction_request in prediction_stream:
|
|
365
|
+
features = prediction_request.features
|
|
366
|
+
prediction = model.predict(features)
|
|
367
|
+
|
|
368
|
+
await monitor.log_prediction(
|
|
369
|
+
features=features,
|
|
370
|
+
prediction=prediction,
|
|
371
|
+
metadata={"customer_id": prediction_request.customer_id}
|
|
372
|
+
)
|
|
373
|
+
|
|
374
|
+
# Monitor costs
|
|
375
|
+
monthly_report = monitor.get_cost_report()
|
|
376
|
+
|
|
377
|
+
# Expected results:
|
|
378
|
+
# - 90% reduction in logging costs
|
|
379
|
+
# - Still captures all important events
|
|
380
|
+
# - Sufficient data for drift detection
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
### Efficient Drift Detection
|
|
384
|
+
```python
|
|
385
|
+
from drift_detector import BatchDriftDetector
|
|
386
|
+
from scipy import stats
|
|
387
|
+
import numpy as np
|
|
388
|
+
from finops_tracker import DriftCostTracker
|
|
389
|
+
|
|
390
|
+
class CostOptimizedDriftDetection:
|
|
391
|
+
"""Efficient drift detection with cost optimization"""
|
|
392
|
+
|
|
393
|
+
def __init__(self, baseline_data: pd.DataFrame):
|
|
394
|
+
self.baseline_data = baseline_data
|
|
395
|
+
self.detector = BatchDriftDetector()
|
|
396
|
+
self.cost_tracker = DriftCostTracker()
|
|
397
|
+
|
|
398
|
+
# Pre-compute baseline statistics (one-time cost)
|
|
399
|
+
self.baseline_stats = self._compute_baseline_stats()
|
|
400
|
+
|
|
401
|
+
def _compute_baseline_stats(self):
|
|
402
|
+
"""Pre-compute baseline statistics for efficiency"""
|
|
403
|
+
stats = {}
|
|
404
|
+
|
|
405
|
+
for column in self.baseline_data.columns:
|
|
406
|
+
if self.baseline_data[column].dtype in ['float64', 'int64']:
|
|
407
|
+
stats[column] = {
|
|
408
|
+
'mean': self.baseline_data[column].mean(),
|
|
409
|
+
'std': self.baseline_data[column].std(),
|
|
410
|
+
'min': self.baseline_data[column].min(),
|
|
411
|
+
'max': self.baseline_data[column].max(),
|
|
412
|
+
'quantiles': self.baseline_data[column].quantile([0.25, 0.5, 0.75]).to_dict(),
|
|
413
|
+
'distribution': self.baseline_data[column].values
|
|
414
|
+
}
|
|
415
|
+
else:
|
|
416
|
+
stats[column] = {
|
|
417
|
+
'value_counts': self.baseline_data[column].value_counts().to_dict(),
|
|
418
|
+
'unique_values': set(self.baseline_data[column].unique())
|
|
419
|
+
}
|
|
420
|
+
|
|
421
|
+
return stats
|
|
422
|
+
|
|
423
|
+
def detect_drift_efficient(
|
|
424
|
+
self,
|
|
425
|
+
production_data: pd.DataFrame,
|
|
426
|
+
features: list = None,
|
|
427
|
+
method: str = "ks_test"
|
|
428
|
+
) -> dict:
|
|
429
|
+
"""Efficient drift detection using pre-computed statistics"""
|
|
430
|
+
|
|
431
|
+
with self.cost_tracker.track_drift_detection():
|
|
432
|
+
drift_results = {}
|
|
433
|
+
features = features or production_data.columns
|
|
434
|
+
|
|
435
|
+
for feature in features:
|
|
436
|
+
if feature not in self.baseline_stats:
|
|
437
|
+
continue
|
|
438
|
+
|
|
439
|
+
baseline_values = self.baseline_stats[feature]['distribution']
|
|
440
|
+
production_values = production_data[feature].values
|
|
441
|
+
|
|
442
|
+
# Use cached baseline statistics
|
|
443
|
+
if method == "ks_test":
|
|
444
|
+
# Kolmogorov-Smirnov test (efficient)
|
|
445
|
+
statistic, p_value = stats.ks_2samp(
|
|
446
|
+
baseline_values,
|
|
447
|
+
production_values
|
|
448
|
+
)
|
|
449
|
+
has_drift = p_value < 0.05
|
|
450
|
+
|
|
451
|
+
elif method == "psi":
|
|
452
|
+
# Population Stability Index (very efficient)
|
|
453
|
+
psi_score = self._calculate_psi_efficient(
|
|
454
|
+
baseline_values,
|
|
455
|
+
production_values
|
|
456
|
+
)
|
|
457
|
+
has_drift = psi_score > 0.2
|
|
458
|
+
statistic = psi_score
|
|
459
|
+
p_value = None
|
|
460
|
+
|
|
461
|
+
drift_results[feature] = {
|
|
462
|
+
'has_drift': has_drift,
|
|
463
|
+
'statistic': statistic,
|
|
464
|
+
'p_value': p_value,
|
|
465
|
+
'drift_magnitude': abs(
|
|
466
|
+
production_values.mean() - baseline_values.mean()
|
|
467
|
+
) / baseline_values.std() if baseline_values.std() > 0 else 0
|
|
468
|
+
}
|
|
469
|
+
|
|
470
|
+
# Cost report
|
|
471
|
+
cost_report = self.cost_tracker.get_detection_cost()
|
|
472
|
+
print(f"Drift detection cost: ${cost_report.cost:.4f}")
|
|
473
|
+
print(f"Detection time: {cost_report.duration_ms:.2f}ms")
|
|
474
|
+
|
|
475
|
+
return {
|
|
476
|
+
'drift_results': drift_results,
|
|
477
|
+
'drifted_features': [f for f, r in drift_results.items() if r['has_drift']],
|
|
478
|
+
'drift_score': np.mean([r['drift_magnitude'] for r in drift_results.values()]),
|
|
479
|
+
'cost': cost_report.cost
|
|
480
|
+
}
|
|
481
|
+
|
|
482
|
+
def _calculate_psi_efficient(
|
|
483
|
+
self,
|
|
484
|
+
baseline: np.ndarray,
|
|
485
|
+
production: np.ndarray,
|
|
486
|
+
bins: int = 10
|
|
487
|
+
) -> float:
|
|
488
|
+
"""Efficient PSI calculation"""
|
|
489
|
+
|
|
490
|
+
# Create bins from baseline
|
|
491
|
+
bin_edges = np.percentile(baseline, np.linspace(0, 100, bins + 1))
|
|
492
|
+
bin_edges[0] = -np.inf
|
|
493
|
+
bin_edges[-1] = np.inf
|
|
494
|
+
|
|
495
|
+
# Calculate distributions
|
|
496
|
+
baseline_dist = np.histogram(baseline, bins=bin_edges)[0] / len(baseline)
|
|
497
|
+
production_dist = np.histogram(production, bins=bin_edges)[0] / len(production)
|
|
498
|
+
|
|
499
|
+
# PSI calculation
|
|
500
|
+
psi = np.sum(
|
|
501
|
+
(production_dist - baseline_dist) * np.log(
|
|
502
|
+
(production_dist + 1e-10) / (baseline_dist + 1e-10)
|
|
503
|
+
)
|
|
504
|
+
)
|
|
505
|
+
|
|
506
|
+
return psi
|
|
507
|
+
|
|
508
|
+
def incremental_drift_monitoring(
|
|
509
|
+
self,
|
|
510
|
+
data_stream,
|
|
511
|
+
window_size: int = 1000,
|
|
512
|
+
check_frequency: int = 100
|
|
513
|
+
):
|
|
514
|
+
"""Incremental drift detection for streaming data"""
|
|
515
|
+
|
|
516
|
+
buffer = []
|
|
517
|
+
check_count = 0
|
|
518
|
+
|
|
519
|
+
for record in data_stream:
|
|
520
|
+
buffer.append(record)
|
|
521
|
+
|
|
522
|
+
# Check drift every N records
|
|
523
|
+
if len(buffer) >= check_frequency:
|
|
524
|
+
check_count += 1
|
|
525
|
+
|
|
526
|
+
# Convert to DataFrame
|
|
527
|
+
production_sample = pd.DataFrame(buffer[-window_size:])
|
|
528
|
+
|
|
529
|
+
# Run efficient drift detection
|
|
530
|
+
drift_result = self.detect_drift_efficient(
|
|
531
|
+
production_data=production_sample,
|
|
532
|
+
method="psi" # Faster than KS test
|
|
533
|
+
)
|
|
534
|
+
|
|
535
|
+
# Alert if drift detected
|
|
536
|
+
if drift_result['drifted_features']:
|
|
537
|
+
print(f"\nDrift detected at check #{check_count}:")
|
|
538
|
+
print(f"Drifted features: {drift_result['drifted_features']}")
|
|
539
|
+
print(f"Drift score: {drift_result['drift_score']:.3f}")
|
|
540
|
+
|
|
541
|
+
# Clear old records from buffer
|
|
542
|
+
buffer = buffer[-window_size:]
|
|
543
|
+
|
|
544
|
+
# Usage
|
|
545
|
+
drift_detector = CostOptimizedDriftDetection(
|
|
546
|
+
baseline_data=training_data
|
|
547
|
+
)
|
|
548
|
+
|
|
549
|
+
# Batch drift detection (daily job)
|
|
550
|
+
production_data = get_recent_predictions(days=7)
|
|
551
|
+
drift_result = drift_detector.detect_drift_efficient(
|
|
552
|
+
production_data=production_data,
|
|
553
|
+
method="psi" # 10x faster than KS test
|
|
554
|
+
)
|
|
555
|
+
|
|
556
|
+
print(f"Drift detection completed in {drift_result['cost']:.4f}s")
|
|
557
|
+
print(f"Drifted features: {drift_result['drifted_features']}")
|
|
558
|
+
|
|
559
|
+
# Streaming drift detection
|
|
560
|
+
drift_detector.incremental_drift_monitoring(
|
|
561
|
+
data_stream=prediction_stream,
|
|
562
|
+
window_size=1000,
|
|
563
|
+
check_frequency=100
|
|
564
|
+
)
|
|
565
|
+
```
|
|
566
|
+
|
|
567
|
+
### Automated Drift Response
|
|
568
|
+
```python
|
|
569
|
+
from alert_manager import AlertManager, AlertSeverity
|
|
570
|
+
from model_retrainer import AutoRetrainer
|
|
571
|
+
from drift_detector import DriftAnalyzer
|
|
572
|
+
|
|
573
|
+
class AutomatedDriftResponse:
|
|
574
|
+
"""Automated drift detection and response system"""
|
|
575
|
+
|
|
576
|
+
def __init__(self, model_name: str):
|
|
577
|
+
self.model_name = model_name
|
|
578
|
+
self.drift_analyzer = DriftAnalyzer(model_name)
|
|
579
|
+
self.alert_manager = AlertManager()
|
|
580
|
+
self.retrainer = AutoRetrainer(model_name)
|
|
581
|
+
|
|
582
|
+
# Configure drift response thresholds
|
|
583
|
+
self.thresholds = {
|
|
584
|
+
"minor_drift": 0.1, # Warning alert
|
|
585
|
+
"moderate_drift": 0.2, # Trigger investigation
|
|
586
|
+
"severe_drift": 0.3 # Auto-retrain
|
|
587
|
+
}
|
|
588
|
+
|
|
589
|
+
async def monitor_and_respond(self):
|
|
590
|
+
"""Continuous drift monitoring with automated response"""
|
|
591
|
+
|
|
592
|
+
# Get recent production data
|
|
593
|
+
production_data = await self.get_production_data(days=7)
|
|
594
|
+
|
|
595
|
+
# Detect drift
|
|
596
|
+
drift_result = self.drift_analyzer.analyze_drift(
|
|
597
|
+
production_data=production_data,
|
|
598
|
+
features=feature_list
|
|
599
|
+
)
|
|
600
|
+
|
|
601
|
+
drift_score = drift_result['drift_score']
|
|
602
|
+
drifted_features = drift_result['drifted_features']
|
|
603
|
+
|
|
604
|
+
# Automated response based on severity
|
|
605
|
+
if drift_score >= self.thresholds["severe_drift"]:
|
|
606
|
+
# Severe drift - auto-retrain
|
|
607
|
+
await self._handle_severe_drift(drift_result)
|
|
608
|
+
|
|
609
|
+
elif drift_score >= self.thresholds["moderate_drift"]:
|
|
610
|
+
# Moderate drift - trigger investigation
|
|
611
|
+
await self._handle_moderate_drift(drift_result)
|
|
612
|
+
|
|
613
|
+
elif drift_score >= self.thresholds["minor_drift"]:
|
|
614
|
+
# Minor drift - warning alert
|
|
615
|
+
await self._handle_minor_drift(drift_result)
|
|
616
|
+
|
|
617
|
+
return drift_result
|
|
618
|
+
|
|
619
|
+
async def _handle_severe_drift(self, drift_result):
|
|
620
|
+
"""Handle severe drift with auto-retraining"""
|
|
621
|
+
|
|
622
|
+
# Send critical alert
|
|
623
|
+
await self.alert_manager.send_alert(
|
|
624
|
+
severity=AlertSeverity.CRITICAL,
|
|
625
|
+
title=f"Severe Drift Detected - Auto-Retraining Initiated",
|
|
626
|
+
message=f"Model: {self.model_name}\n"
|
|
627
|
+
f"Drift score: {drift_result['drift_score']:.3f}\n"
|
|
628
|
+
f"Drifted features: {drift_result['drifted_features']}\n"
|
|
629
|
+
f"Action: Automatic retraining initiated",
|
|
630
|
+
channels=["slack", "email", "pagerduty"]
|
|
631
|
+
)
|
|
632
|
+
|
|
633
|
+
# Trigger automated retraining
|
|
634
|
+
retrain_job = await self.retrainer.trigger_retraining(
|
|
635
|
+
reason="severe_drift_detected",
|
|
636
|
+
drift_score=drift_result['drift_score'],
|
|
637
|
+
drifted_features=drift_result['drifted_features'],
|
|
638
|
+
priority="high"
|
|
639
|
+
)
|
|
640
|
+
|
|
641
|
+
print(f"Retraining job initiated: {retrain_job.id}")
|
|
642
|
+
|
|
643
|
+
async def _handle_moderate_drift(self, drift_result):
|
|
644
|
+
"""Handle moderate drift with investigation workflow"""
|
|
645
|
+
|
|
646
|
+
# Send warning alert
|
|
647
|
+
await self.alert_manager.send_alert(
|
|
648
|
+
severity=AlertSeverity.WARNING,
|
|
649
|
+
title=f"Moderate Drift Detected - Investigation Required",
|
|
650
|
+
message=f"Model: {self.model_name}\n"
|
|
651
|
+
f"Drift score: {drift_result['drift_score']:.3f}\n"
|
|
652
|
+
f"Drifted features: {drift_result['drifted_features']}\n"
|
|
653
|
+
f"Action: Please investigate and determine if retraining is needed",
|
|
654
|
+
channels=["slack", "email"],
|
|
655
|
+
actions=[
|
|
656
|
+
{"label": "Trigger Retraining", "action": "retrain"},
|
|
657
|
+
{"label": "Acknowledge", "action": "ack"},
|
|
658
|
+
{"label": "View Dashboard", "action": "dashboard"}
|
|
659
|
+
]
|
|
660
|
+
)
|
|
661
|
+
|
|
662
|
+
# Create investigation ticket
|
|
663
|
+
await self.alert_manager.create_investigation_ticket(
|
|
664
|
+
title=f"Model Drift Investigation - {self.model_name}",
|
|
665
|
+
description=drift_result,
|
|
666
|
+
assignee="ml-ops-team"
|
|
667
|
+
)
|
|
668
|
+
|
|
669
|
+
async def _handle_minor_drift(self, drift_result):
|
|
670
|
+
"""Handle minor drift with monitoring"""
|
|
671
|
+
|
|
672
|
+
# Send info alert
|
|
673
|
+
await self.alert_manager.send_alert(
|
|
674
|
+
severity=AlertSeverity.INFO,
|
|
675
|
+
title=f"Minor Drift Detected",
|
|
676
|
+
message=f"Model: {self.model_name}\n"
|
|
677
|
+
f"Drift score: {drift_result['drift_score']:.3f}\n"
|
|
678
|
+
f"Drifted features: {drift_result['drifted_features']}\n"
|
|
679
|
+
f"Action: Continuing to monitor",
|
|
680
|
+
channels=["slack"]
|
|
681
|
+
)
|
|
682
|
+
|
|
683
|
+
# Log to dashboard
|
|
684
|
+
await self.drift_analyzer.log_drift_event(drift_result)
|
|
685
|
+
|
|
686
|
+
# Usage (scheduled job)
|
|
687
|
+
drift_responder = AutomatedDriftResponse(model_name="churn_predictor_v2")
|
|
688
|
+
|
|
689
|
+
# Run daily
|
|
690
|
+
async def daily_drift_check():
|
|
691
|
+
result = await drift_responder.monitor_and_respond()
|
|
692
|
+
print(f"Drift check completed. Score: {result['drift_score']:.3f}")
|
|
693
|
+
|
|
694
|
+
# Schedule with APScheduler or similar
|
|
695
|
+
from apscheduler.schedulers.asyncio import AsyncIOScheduler
|
|
696
|
+
|
|
697
|
+
scheduler = AsyncIOScheduler()
|
|
698
|
+
scheduler.add_job(daily_drift_check, 'cron', hour=2) # Run at 2 AM daily
|
|
699
|
+
scheduler.start()
|
|
700
|
+
```
|
|
701
|
+
|
|
702
|
+
## 🚀 Monitoring Dashboards
|
|
703
|
+
|
|
704
|
+
### Grafana Dashboard Configuration
|
|
705
|
+
```python
|
|
706
|
+
# monitoring_dashboard.py
|
|
707
|
+
from grafana_api import GrafanaDashboard
|
|
708
|
+
from azure.monitor import MetricsClient
|
|
709
|
+
|
|
710
|
+
def create_model_monitoring_dashboard(model_name: str):
|
|
711
|
+
"""Create comprehensive model monitoring dashboard"""
|
|
712
|
+
|
|
713
|
+
dashboard = GrafanaDashboard(
|
|
714
|
+
title=f"Model Monitoring - {model_name}",
|
|
715
|
+
refresh="30s"
|
|
716
|
+
)
|
|
717
|
+
|
|
718
|
+
# Row 1: Model Performance Metrics
|
|
719
|
+
dashboard.add_row("Model Performance")
|
|
720
|
+
dashboard.add_panel(
|
|
721
|
+
title="Model Accuracy (7d rolling)",
|
|
722
|
+
query="avg(model_accuracy{model_name='" + model_name + "'})",
|
|
723
|
+
panel_type="graph",
|
|
724
|
+
threshold=0.85,
|
|
725
|
+
alert_condition="below"
|
|
726
|
+
)
|
|
727
|
+
dashboard.add_panel(
|
|
728
|
+
title="AUC Score",
|
|
729
|
+
query="avg(model_auc{model_name='" + model_name + "'})",
|
|
730
|
+
panel_type="gauge",
|
|
731
|
+
thresholds=[0.80, 0.90, 0.95]
|
|
732
|
+
)
|
|
733
|
+
|
|
734
|
+
# Row 2: Drift Metrics
|
|
735
|
+
dashboard.add_row("Drift Detection")
|
|
736
|
+
dashboard.add_panel(
|
|
737
|
+
title="Feature Drift Score",
|
|
738
|
+
query="max(feature_drift_score{model_name='" + model_name + "'})",
|
|
739
|
+
panel_type="graph",
|
|
740
|
+
threshold=0.2,
|
|
741
|
+
alert_condition="above"
|
|
742
|
+
)
|
|
743
|
+
dashboard.add_panel(
|
|
744
|
+
title="Prediction Drift Score",
|
|
745
|
+
query="max(prediction_drift_score{model_name='" + model_name + "'})",
|
|
746
|
+
panel_type="graph",
|
|
747
|
+
threshold=0.15
|
|
748
|
+
)
|
|
749
|
+
dashboard.add_panel(
|
|
750
|
+
title="Drifted Features Count",
|
|
751
|
+
query="count(drifted_features{model_name='" + model_name + "'})",
|
|
752
|
+
panel_type="stat"
|
|
753
|
+
)
|
|
754
|
+
|
|
755
|
+
# Row 3: Operational Metrics
|
|
756
|
+
dashboard.add_row("Operational Metrics")
|
|
757
|
+
dashboard.add_panel(
|
|
758
|
+
title="Prediction Latency (p95)",
|
|
759
|
+
query="histogram_quantile(0.95, prediction_latency{model_name='" + model_name + "'})",
|
|
760
|
+
panel_type="graph",
|
|
761
|
+
threshold=100,
|
|
762
|
+
unit="ms"
|
|
763
|
+
)
|
|
764
|
+
dashboard.add_panel(
|
|
765
|
+
title="Requests per Minute",
|
|
766
|
+
query="rate(prediction_requests{model_name='" + model_name + "'}[1m])",
|
|
767
|
+
panel_type="graph"
|
|
768
|
+
)
|
|
769
|
+
dashboard.add_panel(
|
|
770
|
+
title="Error Rate",
|
|
771
|
+
query="rate(prediction_errors{model_name='" + model_name + "'}[5m])",
|
|
772
|
+
panel_type="graph",
|
|
773
|
+
threshold=0.01,
|
|
774
|
+
alert_condition="above"
|
|
775
|
+
)
|
|
776
|
+
|
|
777
|
+
# Row 4: Data Quality
|
|
778
|
+
dashboard.add_row("Data Quality")
|
|
779
|
+
dashboard.add_panel(
|
|
780
|
+
title="Feature Completeness",
|
|
781
|
+
query="avg(feature_completeness{model_name='" + model_name + "'})",
|
|
782
|
+
panel_type="gauge",
|
|
783
|
+
thresholds=[0.95, 0.99, 1.0]
|
|
784
|
+
)
|
|
785
|
+
dashboard.add_panel(
|
|
786
|
+
title="Invalid Inputs Rate",
|
|
787
|
+
query="rate(invalid_inputs{model_name='" + model_name + "'}[5m])",
|
|
788
|
+
panel_type="graph"
|
|
789
|
+
)
|
|
790
|
+
|
|
791
|
+
# Row 5: Cost Metrics
|
|
792
|
+
dashboard.add_row("Cost & Resource Usage")
|
|
793
|
+
dashboard.add_panel(
|
|
794
|
+
title="Daily Serving Cost",
|
|
795
|
+
query="sum(serving_cost_usd{model_name='" + model_name + "'})",
|
|
796
|
+
panel_type="stat",
|
|
797
|
+
unit="currencyUSD"
|
|
798
|
+
)
|
|
799
|
+
dashboard.add_panel(
|
|
800
|
+
title="Cost per 1000 Predictions",
|
|
801
|
+
query="serving_cost_usd{model_name='" + model_name + "'} / prediction_count * 1000",
|
|
802
|
+
panel_type="graph",
|
|
803
|
+
unit="currencyUSD"
|
|
804
|
+
)
|
|
805
|
+
|
|
806
|
+
# Save dashboard
|
|
807
|
+
dashboard.save()
|
|
808
|
+
return dashboard
|
|
809
|
+
|
|
810
|
+
# Create dashboard
|
|
811
|
+
dashboard = create_model_monitoring_dashboard("churn_predictor_v2")
|
|
812
|
+
print(f"Dashboard created: {dashboard.url}")
|
|
813
|
+
```
|
|
814
|
+
|
|
815
|
+
## 📊 Metrics & Monitoring
|
|
816
|
+
|
|
817
|
+
| Metric Category | Metric | Target | Tool |
|
|
818
|
+
|-----------------|--------|--------|------|
|
|
819
|
+
| **Drift Detection** | Feature drift score | <0.2 | Drift detector |
|
|
820
|
+
| | Prediction drift score | <0.15 | Drift detector |
|
|
821
|
+
| | Drifted features count | <3 | KS/PSI tests |
|
|
822
|
+
| | Drift check frequency | Daily | Scheduler |
|
|
823
|
+
| **Model Performance** | Production accuracy | >0.85 | Model monitor |
|
|
824
|
+
| | Production AUC | >0.90 | Model monitor |
|
|
825
|
+
| | Performance vs baseline | >95% | Comparison |
|
|
826
|
+
| **Data Quality** | Feature completeness | >99% | Quality checker |
|
|
827
|
+
| | Invalid input rate | <1% | Validator |
|
|
828
|
+
| | Missing feature rate | <0.1% | Monitor |
|
|
829
|
+
| **Monitoring Costs** | Log storage cost | <$100/mo | FinOps tracker |
|
|
830
|
+
| | Monitoring compute | <$50/mo | Cost tracker |
|
|
831
|
+
| | Alert notification cost | <$20/mo | Alert manager |
|
|
832
|
+
| **Operational** | Alert response time | <30 min | SLA monitor |
|
|
833
|
+
| | False alert rate | <5% | Alert tuning |
|
|
834
|
+
| | Dashboard load time | <2s | Performance |
|
|
835
|
+
|
|
836
|
+
## 🔄 Integration Workflow
|
|
837
|
+
|
|
838
|
+
### End-to-End Monitoring Pipeline
|
|
839
|
+
```
|
|
840
|
+
1. Production Predictions (ml-04)
|
|
841
|
+
↓
|
|
842
|
+
2. Intelligent Logging (ml-05)
|
|
843
|
+
↓
|
|
844
|
+
3. Data Aggregation (ml-05)
|
|
845
|
+
↓
|
|
846
|
+
4. Drift Detection (mo-05)
|
|
847
|
+
↓
|
|
848
|
+
5. Performance Monitoring (mo-04)
|
|
849
|
+
↓
|
|
850
|
+
6. Anomaly Detection (ml-05)
|
|
851
|
+
↓
|
|
852
|
+
7. Alert Generation (ml-05)
|
|
853
|
+
↓
|
|
854
|
+
8. Dashboard Updates (do-08)
|
|
855
|
+
↓
|
|
856
|
+
9. Investigation Workflow (ml-05)
|
|
857
|
+
↓
|
|
858
|
+
10. Auto-Retraining Trigger (ml-09)
|
|
859
|
+
↓
|
|
860
|
+
11. Cost Tracking (fo-07)
|
|
861
|
+
```
|
|
862
|
+
|
|
863
|
+
## 🎯 Quick Wins
|
|
864
|
+
|
|
865
|
+
1. **Enable prediction logging** - Visibility into production behavior
|
|
866
|
+
2. **Set up drift detection** - Early warning for model degradation
|
|
867
|
+
3. **Create monitoring dashboards** - Real-time visibility
|
|
868
|
+
4. **Implement intelligent sampling** - 90% reduction in logging costs
|
|
869
|
+
5. **Configure performance alerts** - Proactive issue detection
|
|
870
|
+
6. **Use PSI for drift** - 10x faster than KS test
|
|
871
|
+
7. **Pre-compute baseline stats** - Faster drift detection
|
|
872
|
+
8. **Set up automated alerts** - Faster incident response
|
|
873
|
+
9. **Track monitoring costs** - Optimize monitoring spend
|
|
874
|
+
10. **Implement auto-retraining** - Automated drift response
|