omgkit 2.20.0 → 2.21.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +125 -10
- package/package.json +1 -1
- package/plugin/agents/ai-architect-agent.md +282 -0
- package/plugin/agents/data-scientist-agent.md +221 -0
- package/plugin/agents/experiment-analyst-agent.md +318 -0
- package/plugin/agents/ml-engineer-agent.md +165 -0
- package/plugin/agents/mlops-engineer-agent.md +324 -0
- package/plugin/agents/model-optimizer-agent.md +287 -0
- package/plugin/agents/production-engineer-agent.md +360 -0
- package/plugin/agents/research-scientist-agent.md +274 -0
- package/plugin/commands/omgdata/augment.md +86 -0
- package/plugin/commands/omgdata/collect.md +81 -0
- package/plugin/commands/omgdata/label.md +83 -0
- package/plugin/commands/omgdata/split.md +83 -0
- package/plugin/commands/omgdata/validate.md +76 -0
- package/plugin/commands/omgdata/version.md +85 -0
- package/plugin/commands/omgdeploy/ab.md +94 -0
- package/plugin/commands/omgdeploy/cloud.md +89 -0
- package/plugin/commands/omgdeploy/edge.md +93 -0
- package/plugin/commands/omgdeploy/package.md +91 -0
- package/plugin/commands/omgdeploy/serve.md +92 -0
- package/plugin/commands/omgfeature/embed.md +93 -0
- package/plugin/commands/omgfeature/extract.md +93 -0
- package/plugin/commands/omgfeature/select.md +85 -0
- package/plugin/commands/omgfeature/store.md +97 -0
- package/plugin/commands/omgml/init.md +60 -0
- package/plugin/commands/omgml/status.md +82 -0
- package/plugin/commands/omgops/drift.md +87 -0
- package/plugin/commands/omgops/monitor.md +99 -0
- package/plugin/commands/omgops/pipeline.md +102 -0
- package/plugin/commands/omgops/registry.md +109 -0
- package/plugin/commands/omgops/retrain.md +91 -0
- package/plugin/commands/omgoptim/distill.md +90 -0
- package/plugin/commands/omgoptim/profile.md +92 -0
- package/plugin/commands/omgoptim/prune.md +81 -0
- package/plugin/commands/omgoptim/quantize.md +83 -0
- package/plugin/commands/omgtrain/baseline.md +78 -0
- package/plugin/commands/omgtrain/compare.md +99 -0
- package/plugin/commands/omgtrain/evaluate.md +85 -0
- package/plugin/commands/omgtrain/train.md +81 -0
- package/plugin/commands/omgtrain/tune.md +89 -0
- package/plugin/registry.yaml +252 -2
- package/plugin/skills/ml-systems/SKILL.md +65 -0
- package/plugin/skills/ml-systems/ai-accelerators/SKILL.md +342 -0
- package/plugin/skills/ml-systems/data-eng/SKILL.md +126 -0
- package/plugin/skills/ml-systems/deep-learning-primer/SKILL.md +143 -0
- package/plugin/skills/ml-systems/deployment-paradigms/SKILL.md +148 -0
- package/plugin/skills/ml-systems/dnn-architectures/SKILL.md +128 -0
- package/plugin/skills/ml-systems/edge-deployment/SKILL.md +366 -0
- package/plugin/skills/ml-systems/efficient-ai/SKILL.md +316 -0
- package/plugin/skills/ml-systems/feature-engineering/SKILL.md +151 -0
- package/plugin/skills/ml-systems/ml-frameworks/SKILL.md +187 -0
- package/plugin/skills/ml-systems/ml-serving-optimization/SKILL.md +371 -0
- package/plugin/skills/ml-systems/ml-systems-fundamentals/SKILL.md +103 -0
- package/plugin/skills/ml-systems/ml-workflow/SKILL.md +162 -0
- package/plugin/skills/ml-systems/mlops/SKILL.md +386 -0
- package/plugin/skills/ml-systems/model-deployment/SKILL.md +350 -0
- package/plugin/skills/ml-systems/model-dev/SKILL.md +160 -0
- package/plugin/skills/ml-systems/model-optimization/SKILL.md +339 -0
- package/plugin/skills/ml-systems/robust-ai/SKILL.md +395 -0
- package/plugin/skills/ml-systems/training-data/SKILL.md +152 -0
- package/plugin/workflows/ml-systems/data-preparation-workflow.md +276 -0
- package/plugin/workflows/ml-systems/edge-deployment-workflow.md +413 -0
- package/plugin/workflows/ml-systems/full-ml-lifecycle-workflow.md +405 -0
- package/plugin/workflows/ml-systems/hyperparameter-tuning-workflow.md +352 -0
- package/plugin/workflows/ml-systems/mlops-pipeline-workflow.md +384 -0
- package/plugin/workflows/ml-systems/model-deployment-workflow.md +392 -0
- package/plugin/workflows/ml-systems/model-development-workflow.md +218 -0
- package/plugin/workflows/ml-systems/model-evaluation-workflow.md +416 -0
- package/plugin/workflows/ml-systems/model-optimization-workflow.md +390 -0
- package/plugin/workflows/ml-systems/monitoring-drift-workflow.md +446 -0
- package/plugin/workflows/ml-systems/retraining-workflow.md +401 -0
- package/plugin/workflows/ml-systems/training-pipeline-workflow.md +382 -0
|
@@ -0,0 +1,446 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Monitoring & Drift Workflow
|
|
3
|
+
description: Production monitoring workflow for detecting data drift, model degradation, and triggering appropriate responses.
|
|
4
|
+
category: ml-systems
|
|
5
|
+
complexity: medium
|
|
6
|
+
agents:
|
|
7
|
+
- mlops-engineer-agent
|
|
8
|
+
- experiment-analyst-agent
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Monitoring & Drift Workflow
|
|
12
|
+
|
|
13
|
+
Monitor production models and detect drift.
|
|
14
|
+
|
|
15
|
+
## Overview
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
19
|
+
│ MONITORING & DRIFT WORKFLOW │
|
|
20
|
+
├─────────────────────────────────────────────────────────────┤
|
|
21
|
+
│ │
|
|
22
|
+
│ 1. COLLECT 2. DETECT 3. ANALYZE │
|
|
23
|
+
│ DATA DRIFT ROOT CAUSE │
|
|
24
|
+
│ ↓ ↓ ↓ │
|
|
25
|
+
│ Predictions Statistical Feature analysis │
|
|
26
|
+
│ Features tests Data investigation │
|
|
27
|
+
│ Ground truth Thresholds Model behavior │
|
|
28
|
+
│ │
|
|
29
|
+
│ 4. ALERT 5. RESPOND 6. DOCUMENT │
|
|
30
|
+
│ ↓ ↓ ↓ │
|
|
31
|
+
│ PagerDuty Switch model Incident report │
|
|
32
|
+
│ Slack Trigger retrain Lessons learned │
|
|
33
|
+
│ Dashboard Fallback Model card update │
|
|
34
|
+
│ │
|
|
35
|
+
└─────────────────────────────────────────────────────────────┘
|
|
36
|
+
```
|
|
37
|
+
|
|
38
|
+
## Steps
|
|
39
|
+
|
|
40
|
+
### Step 1: Data Collection
|
|
41
|
+
**Agent**: mlops-engineer-agent
|
|
42
|
+
|
|
43
|
+
**Inputs**:
|
|
44
|
+
- Prediction logs
|
|
45
|
+
- Input features
|
|
46
|
+
- Ground truth (when available)
|
|
47
|
+
|
|
48
|
+
**Actions**:
|
|
49
|
+
```bash
|
|
50
|
+
# Setup monitoring
|
|
51
|
+
/omgops:monitor --config monitoring.yaml
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
```python
|
|
55
|
+
# Prediction logging
|
|
56
|
+
class PredictionLogger:
|
|
57
|
+
def __init__(self, storage_client):
|
|
58
|
+
self.storage = storage_client
|
|
59
|
+
|
|
60
|
+
def log_prediction(self, request_id, features, prediction, confidence):
|
|
61
|
+
record = {
|
|
62
|
+
'request_id': request_id,
|
|
63
|
+
'timestamp': datetime.utcnow().isoformat(),
|
|
64
|
+
'features': features,
|
|
65
|
+
'prediction': prediction,
|
|
66
|
+
'confidence': confidence,
|
|
67
|
+
'model_version': MODEL_VERSION
|
|
68
|
+
}
|
|
69
|
+
|
|
70
|
+
# Log to streaming storage
|
|
71
|
+
self.storage.append('predictions', record)
|
|
72
|
+
|
|
73
|
+
# Update real-time metrics
|
|
74
|
+
metrics.log_prediction(prediction, confidence)
|
|
75
|
+
|
|
76
|
+
def log_feedback(self, request_id, ground_truth):
|
|
77
|
+
self.storage.update('predictions', request_id, {
|
|
78
|
+
'ground_truth': ground_truth,
|
|
79
|
+
'correct': self.was_correct(request_id, ground_truth)
|
|
80
|
+
})
|
|
81
|
+
|
|
82
|
+
# Reference data storage
|
|
83
|
+
class ReferenceDataManager:
|
|
84
|
+
def __init__(self, reference_path):
|
|
85
|
+
self.reference = pd.read_parquet(reference_path)
|
|
86
|
+
self.stats = self.compute_statistics()
|
|
87
|
+
|
|
88
|
+
def compute_statistics(self):
|
|
89
|
+
return {
|
|
90
|
+
col: {
|
|
91
|
+
'mean': self.reference[col].mean(),
|
|
92
|
+
'std': self.reference[col].std(),
|
|
93
|
+
'quantiles': self.reference[col].quantile([0.25, 0.5, 0.75]).to_dict()
|
|
94
|
+
}
|
|
95
|
+
for col in self.reference.select_dtypes(include=[np.number]).columns
|
|
96
|
+
}
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
**Outputs**:
|
|
100
|
+
- Prediction logs
|
|
101
|
+
- Feature distributions
|
|
102
|
+
- Reference statistics
|
|
103
|
+
|
|
104
|
+
### Step 2: Drift Detection
|
|
105
|
+
**Agent**: experiment-analyst-agent
|
|
106
|
+
|
|
107
|
+
**Actions**:
|
|
108
|
+
```bash
|
|
109
|
+
# Check for drift
|
|
110
|
+
/omgops:drift --reference reference.parquet --current current.parquet
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
```python
|
|
114
|
+
from scipy import stats
|
|
115
|
+
from evidently.metrics import DataDriftTable
|
|
116
|
+
|
|
117
|
+
class DriftDetector:
|
|
118
|
+
def __init__(self, reference_data, significance=0.05):
|
|
119
|
+
self.reference = reference_data
|
|
120
|
+
self.significance = significance
|
|
121
|
+
|
|
122
|
+
def detect_data_drift(self, current_data):
|
|
123
|
+
results = {}
|
|
124
|
+
|
|
125
|
+
for col in self.reference.columns:
|
|
126
|
+
if self.reference[col].dtype in ['float64', 'int64']:
|
|
127
|
+
# KS test for numerical
|
|
128
|
+
stat, p_value = stats.ks_2samp(
|
|
129
|
+
self.reference[col],
|
|
130
|
+
current_data[col]
|
|
131
|
+
)
|
|
132
|
+
method = 'ks'
|
|
133
|
+
else:
|
|
134
|
+
# Chi-square for categorical
|
|
135
|
+
ref_counts = self.reference[col].value_counts()
|
|
136
|
+
cur_counts = current_data[col].value_counts()
|
|
137
|
+
stat, p_value = stats.chisquare(cur_counts, ref_counts)
|
|
138
|
+
method = 'chi2'
|
|
139
|
+
|
|
140
|
+
results[col] = {
|
|
141
|
+
'method': method,
|
|
142
|
+
'statistic': stat,
|
|
143
|
+
'p_value': p_value,
|
|
144
|
+
'drift_detected': p_value < self.significance
|
|
145
|
+
}
|
|
146
|
+
|
|
147
|
+
# Calculate PSI for overall drift
|
|
148
|
+
results['overall_psi'] = self.calculate_psi(current_data)
|
|
149
|
+
results['drift_detected'] = any(r.get('drift_detected', False) for r in results.values())
|
|
150
|
+
|
|
151
|
+
return results
|
|
152
|
+
|
|
153
|
+
def calculate_psi(self, current_data):
|
|
154
|
+
psi_values = []
|
|
155
|
+
for col in self.reference.select_dtypes(include=[np.number]).columns:
|
|
156
|
+
ref_hist, bins = np.histogram(self.reference[col], bins=10)
|
|
157
|
+
cur_hist, _ = np.histogram(current_data[col], bins=bins)
|
|
158
|
+
|
|
159
|
+
ref_pct = ref_hist / len(self.reference) + 0.0001
|
|
160
|
+
cur_pct = cur_hist / len(current_data) + 0.0001
|
|
161
|
+
|
|
162
|
+
psi = np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
|
|
163
|
+
psi_values.append(psi)
|
|
164
|
+
|
|
165
|
+
return np.mean(psi_values)
|
|
166
|
+
|
|
167
|
+
class ConceptDriftDetector:
|
|
168
|
+
def __init__(self, window_size=1000):
|
|
169
|
+
self.window_size = window_size
|
|
170
|
+
self.performance_history = []
|
|
171
|
+
|
|
172
|
+
def update(self, y_true, y_pred):
|
|
173
|
+
is_correct = y_true == y_pred
|
|
174
|
+
self.performance_history.append(is_correct)
|
|
175
|
+
|
|
176
|
+
if len(self.performance_history) < self.window_size * 2:
|
|
177
|
+
return None
|
|
178
|
+
|
|
179
|
+
# Compare recent performance to historical
|
|
180
|
+
recent = self.performance_history[-self.window_size:]
|
|
181
|
+
historical = self.performance_history[-self.window_size*2:-self.window_size]
|
|
182
|
+
|
|
183
|
+
recent_acc = np.mean(recent)
|
|
184
|
+
historical_acc = np.mean(historical)
|
|
185
|
+
|
|
186
|
+
# Significant drop indicates concept drift
|
|
187
|
+
return {
|
|
188
|
+
'recent_accuracy': recent_acc,
|
|
189
|
+
'historical_accuracy': historical_acc,
|
|
190
|
+
'drift_detected': (historical_acc - recent_acc) > 0.05
|
|
191
|
+
}
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
**Outputs**:
|
|
195
|
+
- Drift scores
|
|
196
|
+
- Affected features
|
|
197
|
+
- Detection alerts
|
|
198
|
+
|
|
199
|
+
### Step 3: Root Cause Analysis
|
|
200
|
+
**Agent**: experiment-analyst-agent
|
|
201
|
+
|
|
202
|
+
**Actions**:
|
|
203
|
+
```python
|
|
204
|
+
class RootCauseAnalyzer:
|
|
205
|
+
def analyze_drift(self, drift_results, current_data, reference_data):
|
|
206
|
+
causes = []
|
|
207
|
+
|
|
208
|
+
# 1. Check feature shifts
|
|
209
|
+
for feature, result in drift_results.items():
|
|
210
|
+
if result.get('drift_detected'):
|
|
211
|
+
shift = self.analyze_feature_shift(
|
|
212
|
+
reference_data[feature],
|
|
213
|
+
current_data[feature]
|
|
214
|
+
)
|
|
215
|
+
causes.append({
|
|
216
|
+
'type': 'feature_shift',
|
|
217
|
+
'feature': feature,
|
|
218
|
+
'details': shift
|
|
219
|
+
})
|
|
220
|
+
|
|
221
|
+
# 2. Check data quality
|
|
222
|
+
quality_issues = self.check_data_quality(current_data)
|
|
223
|
+
if quality_issues:
|
|
224
|
+
causes.append({
|
|
225
|
+
'type': 'data_quality',
|
|
226
|
+
'issues': quality_issues
|
|
227
|
+
})
|
|
228
|
+
|
|
229
|
+
# 3. Check for new categories
|
|
230
|
+
new_categories = self.check_new_categories(
|
|
231
|
+
reference_data, current_data
|
|
232
|
+
)
|
|
233
|
+
if new_categories:
|
|
234
|
+
causes.append({
|
|
235
|
+
'type': 'new_categories',
|
|
236
|
+
'details': new_categories
|
|
237
|
+
})
|
|
238
|
+
|
|
239
|
+
# 4. Check temporal patterns
|
|
240
|
+
temporal = self.check_temporal_patterns(current_data)
|
|
241
|
+
if temporal['anomaly_detected']:
|
|
242
|
+
causes.append({
|
|
243
|
+
'type': 'temporal_anomaly',
|
|
244
|
+
'details': temporal
|
|
245
|
+
})
|
|
246
|
+
|
|
247
|
+
return {
|
|
248
|
+
'causes': causes,
|
|
249
|
+
'severity': self.calculate_severity(causes),
|
|
250
|
+
'recommendations': self.generate_recommendations(causes)
|
|
251
|
+
}
|
|
252
|
+
|
|
253
|
+
def analyze_feature_shift(self, reference, current):
|
|
254
|
+
return {
|
|
255
|
+
'ref_mean': reference.mean(),
|
|
256
|
+
'cur_mean': current.mean(),
|
|
257
|
+
'shift': (current.mean() - reference.mean()) / reference.std(),
|
|
258
|
+
'ref_dist': reference.describe().to_dict(),
|
|
259
|
+
'cur_dist': current.describe().to_dict()
|
|
260
|
+
}
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
**Outputs**:
|
|
264
|
+
- Root causes identified
|
|
265
|
+
- Severity assessment
|
|
266
|
+
- Remediation options
|
|
267
|
+
|
|
268
|
+
### Step 4: Alerting
|
|
269
|
+
**Agent**: mlops-engineer-agent
|
|
270
|
+
|
|
271
|
+
**Actions**:
|
|
272
|
+
```python
|
|
273
|
+
class AlertManager:
|
|
274
|
+
def __init__(self, config):
|
|
275
|
+
self.slack = SlackClient(config['slack_webhook'])
|
|
276
|
+
self.pagerduty = PagerDutyClient(config['pagerduty_key'])
|
|
277
|
+
|
|
278
|
+
def alert(self, severity, message, details):
|
|
279
|
+
if severity == 'critical':
|
|
280
|
+
# Page on-call
|
|
281
|
+
self.pagerduty.trigger(
|
|
282
|
+
description=message,
|
|
283
|
+
severity='critical',
|
|
284
|
+
details=details
|
|
285
|
+
)
|
|
286
|
+
|
|
287
|
+
# Always send to Slack
|
|
288
|
+
self.slack.send(
|
|
289
|
+
channel='#ml-alerts',
|
|
290
|
+
text=self.format_message(severity, message, details),
|
|
291
|
+
attachments=self.format_attachments(details)
|
|
292
|
+
)
|
|
293
|
+
|
|
294
|
+
def format_message(self, severity, message, details):
|
|
295
|
+
emoji = {'critical': '🚨', 'warning': '⚠️', 'info': 'ℹ️'}
|
|
296
|
+
return f"{emoji[severity]} *{severity.upper()}*: {message}"
|
|
297
|
+
|
|
298
|
+
# Prometheus alerting rules
|
|
299
|
+
alerting_rules = """
|
|
300
|
+
groups:
|
|
301
|
+
- name: ml-drift-alerts
|
|
302
|
+
rules:
|
|
303
|
+
- alert: DataDriftDetected
|
|
304
|
+
expr: ml_data_drift_psi > 0.2
|
|
305
|
+
for: 30m
|
|
306
|
+
labels:
|
|
307
|
+
severity: warning
|
|
308
|
+
annotations:
|
|
309
|
+
summary: "Data drift detected in production"
|
|
310
|
+
|
|
311
|
+
- alert: ModelAccuracyDrop
|
|
312
|
+
expr: ml_model_accuracy < 0.85
|
|
313
|
+
for: 1h
|
|
314
|
+
labels:
|
|
315
|
+
severity: critical
|
|
316
|
+
annotations:
|
|
317
|
+
summary: "Model accuracy below threshold"
|
|
318
|
+
|
|
319
|
+
- alert: HighPredictionLatency
|
|
320
|
+
expr: histogram_quantile(0.99, ml_prediction_latency) > 0.5
|
|
321
|
+
for: 15m
|
|
322
|
+
labels:
|
|
323
|
+
severity: warning
|
|
324
|
+
"""
|
|
325
|
+
```
|
|
326
|
+
|
|
327
|
+
**Outputs**:
|
|
328
|
+
- Alerts sent
|
|
329
|
+
- Incident created
|
|
330
|
+
- Team notified
|
|
331
|
+
|
|
332
|
+
### Step 5: Response
|
|
333
|
+
**Agent**: mlops-engineer-agent
|
|
334
|
+
|
|
335
|
+
**Actions**:
|
|
336
|
+
```python
|
|
337
|
+
class DriftResponseManager:
|
|
338
|
+
def __init__(self, model_manager, pipeline_manager):
|
|
339
|
+
self.models = model_manager
|
|
340
|
+
self.pipelines = pipeline_manager
|
|
341
|
+
|
|
342
|
+
def respond(self, drift_analysis):
|
|
343
|
+
severity = drift_analysis['severity']
|
|
344
|
+
response = {'actions_taken': []}
|
|
345
|
+
|
|
346
|
+
if severity == 'critical':
|
|
347
|
+
# Immediate fallback
|
|
348
|
+
self.models.switch_to_fallback()
|
|
349
|
+
response['actions_taken'].append('switched_to_fallback')
|
|
350
|
+
|
|
351
|
+
# Trigger immediate retraining
|
|
352
|
+
run_id = self.pipelines.trigger('emergency_retraining')
|
|
353
|
+
response['actions_taken'].append(f'triggered_retraining:{run_id}')
|
|
354
|
+
|
|
355
|
+
elif severity == 'high':
|
|
356
|
+
# Schedule retraining
|
|
357
|
+
self.pipelines.schedule('retraining', priority='high')
|
|
358
|
+
response['actions_taken'].append('scheduled_retraining')
|
|
359
|
+
|
|
360
|
+
# Increase monitoring
|
|
361
|
+
self.models.increase_monitoring_frequency()
|
|
362
|
+
response['actions_taken'].append('increased_monitoring')
|
|
363
|
+
|
|
364
|
+
elif severity == 'medium':
|
|
365
|
+
# Log for review
|
|
366
|
+
self.log_for_review(drift_analysis)
|
|
367
|
+
response['actions_taken'].append('logged_for_review')
|
|
368
|
+
|
|
369
|
+
return response
|
|
370
|
+
|
|
371
|
+
def switch_to_fallback(self):
|
|
372
|
+
# Use simpler, more robust model
|
|
373
|
+
fallback_version = self.models.get_fallback_version()
|
|
374
|
+
self.models.deploy(fallback_version)
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
**Outputs**:
|
|
378
|
+
- Response actions taken
|
|
379
|
+
- Model switched if needed
|
|
380
|
+
- Retraining triggered
|
|
381
|
+
|
|
382
|
+
### Step 6: Documentation
|
|
383
|
+
**Agent**: experiment-analyst-agent
|
|
384
|
+
|
|
385
|
+
**Actions**:
|
|
386
|
+
```python
|
|
387
|
+
def create_incident_report(drift_analysis, response):
|
|
388
|
+
report = f"""
|
|
389
|
+
# Drift Incident Report
|
|
390
|
+
|
|
391
|
+
## Summary
|
|
392
|
+
- **Date**: {datetime.now().isoformat()}
|
|
393
|
+
- **Severity**: {drift_analysis['severity']}
|
|
394
|
+
- **Type**: {drift_analysis['causes'][0]['type'] if drift_analysis['causes'] else 'Unknown'}
|
|
395
|
+
|
|
396
|
+
## Detection
|
|
397
|
+
- **PSI Score**: {drift_analysis.get('overall_psi', 'N/A')}
|
|
398
|
+
- **Affected Features**: {len([c for c in drift_analysis['causes'] if c['type'] == 'feature_shift'])}
|
|
399
|
+
- **Detection Method**: Statistical tests (KS, Chi-square)
|
|
400
|
+
|
|
401
|
+
## Root Cause Analysis
|
|
402
|
+
{format_causes(drift_analysis['causes'])}
|
|
403
|
+
|
|
404
|
+
## Actions Taken
|
|
405
|
+
{format_actions(response['actions_taken'])}
|
|
406
|
+
|
|
407
|
+
## Recommendations
|
|
408
|
+
{format_recommendations(drift_analysis['recommendations'])}
|
|
409
|
+
|
|
410
|
+
## Lessons Learned
|
|
411
|
+
- [To be filled post-incident]
|
|
412
|
+
|
|
413
|
+
## Follow-up Actions
|
|
414
|
+
- [ ] Review data pipeline
|
|
415
|
+
- [ ] Update monitoring thresholds
|
|
416
|
+
- [ ] Retrain model with new data
|
|
417
|
+
"""
|
|
418
|
+
return report
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
**Outputs**:
|
|
422
|
+
- Incident report
|
|
423
|
+
- Lessons learned
|
|
424
|
+
- Updated documentation
|
|
425
|
+
|
|
426
|
+
## Artifacts
|
|
427
|
+
|
|
428
|
+
- `monitoring/` - Monitoring configurations
|
|
429
|
+
- `alerts/` - Alert definitions
|
|
430
|
+
- `incidents/` - Incident reports
|
|
431
|
+
- `dashboards/` - Grafana dashboards
|
|
432
|
+
- `runbooks/` - Response procedures
|
|
433
|
+
|
|
434
|
+
## Next Workflows
|
|
435
|
+
|
|
436
|
+
After drift detection:
|
|
437
|
+
- → **retraining-workflow** for model updates
|
|
438
|
+
- → **model-evaluation-workflow** for post-retraining validation
|
|
439
|
+
|
|
440
|
+
## Quality Gates
|
|
441
|
+
|
|
442
|
+
- [ ] All steps completed successfully
|
|
443
|
+
- [ ] Metrics meet defined thresholds
|
|
444
|
+
- [ ] Documentation updated
|
|
445
|
+
- [ ] Artifacts versioned and stored
|
|
446
|
+
- [ ] Stakeholder approval obtained
|