omgkit 2.20.0 → 2.21.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (73) hide show
  1. package/README.md +125 -10
  2. package/package.json +1 -1
  3. package/plugin/agents/ai-architect-agent.md +282 -0
  4. package/plugin/agents/data-scientist-agent.md +221 -0
  5. package/plugin/agents/experiment-analyst-agent.md +318 -0
  6. package/plugin/agents/ml-engineer-agent.md +165 -0
  7. package/plugin/agents/mlops-engineer-agent.md +324 -0
  8. package/plugin/agents/model-optimizer-agent.md +287 -0
  9. package/plugin/agents/production-engineer-agent.md +360 -0
  10. package/plugin/agents/research-scientist-agent.md +274 -0
  11. package/plugin/commands/omgdata/augment.md +86 -0
  12. package/plugin/commands/omgdata/collect.md +81 -0
  13. package/plugin/commands/omgdata/label.md +83 -0
  14. package/plugin/commands/omgdata/split.md +83 -0
  15. package/plugin/commands/omgdata/validate.md +76 -0
  16. package/plugin/commands/omgdata/version.md +85 -0
  17. package/plugin/commands/omgdeploy/ab.md +94 -0
  18. package/plugin/commands/omgdeploy/cloud.md +89 -0
  19. package/plugin/commands/omgdeploy/edge.md +93 -0
  20. package/plugin/commands/omgdeploy/package.md +91 -0
  21. package/plugin/commands/omgdeploy/serve.md +92 -0
  22. package/plugin/commands/omgfeature/embed.md +93 -0
  23. package/plugin/commands/omgfeature/extract.md +93 -0
  24. package/plugin/commands/omgfeature/select.md +85 -0
  25. package/plugin/commands/omgfeature/store.md +97 -0
  26. package/plugin/commands/omgml/init.md +60 -0
  27. package/plugin/commands/omgml/status.md +82 -0
  28. package/plugin/commands/omgops/drift.md +87 -0
  29. package/plugin/commands/omgops/monitor.md +99 -0
  30. package/plugin/commands/omgops/pipeline.md +102 -0
  31. package/plugin/commands/omgops/registry.md +109 -0
  32. package/plugin/commands/omgops/retrain.md +91 -0
  33. package/plugin/commands/omgoptim/distill.md +90 -0
  34. package/plugin/commands/omgoptim/profile.md +92 -0
  35. package/plugin/commands/omgoptim/prune.md +81 -0
  36. package/plugin/commands/omgoptim/quantize.md +83 -0
  37. package/plugin/commands/omgtrain/baseline.md +78 -0
  38. package/plugin/commands/omgtrain/compare.md +99 -0
  39. package/plugin/commands/omgtrain/evaluate.md +85 -0
  40. package/plugin/commands/omgtrain/train.md +81 -0
  41. package/plugin/commands/omgtrain/tune.md +89 -0
  42. package/plugin/registry.yaml +252 -2
  43. package/plugin/skills/ml-systems/SKILL.md +65 -0
  44. package/plugin/skills/ml-systems/ai-accelerators/SKILL.md +342 -0
  45. package/plugin/skills/ml-systems/data-eng/SKILL.md +126 -0
  46. package/plugin/skills/ml-systems/deep-learning-primer/SKILL.md +143 -0
  47. package/plugin/skills/ml-systems/deployment-paradigms/SKILL.md +148 -0
  48. package/plugin/skills/ml-systems/dnn-architectures/SKILL.md +128 -0
  49. package/plugin/skills/ml-systems/edge-deployment/SKILL.md +366 -0
  50. package/plugin/skills/ml-systems/efficient-ai/SKILL.md +316 -0
  51. package/plugin/skills/ml-systems/feature-engineering/SKILL.md +151 -0
  52. package/plugin/skills/ml-systems/ml-frameworks/SKILL.md +187 -0
  53. package/plugin/skills/ml-systems/ml-serving-optimization/SKILL.md +371 -0
  54. package/plugin/skills/ml-systems/ml-systems-fundamentals/SKILL.md +103 -0
  55. package/plugin/skills/ml-systems/ml-workflow/SKILL.md +162 -0
  56. package/plugin/skills/ml-systems/mlops/SKILL.md +386 -0
  57. package/plugin/skills/ml-systems/model-deployment/SKILL.md +350 -0
  58. package/plugin/skills/ml-systems/model-dev/SKILL.md +160 -0
  59. package/plugin/skills/ml-systems/model-optimization/SKILL.md +339 -0
  60. package/plugin/skills/ml-systems/robust-ai/SKILL.md +395 -0
  61. package/plugin/skills/ml-systems/training-data/SKILL.md +152 -0
  62. package/plugin/workflows/ml-systems/data-preparation-workflow.md +276 -0
  63. package/plugin/workflows/ml-systems/edge-deployment-workflow.md +413 -0
  64. package/plugin/workflows/ml-systems/full-ml-lifecycle-workflow.md +405 -0
  65. package/plugin/workflows/ml-systems/hyperparameter-tuning-workflow.md +352 -0
  66. package/plugin/workflows/ml-systems/mlops-pipeline-workflow.md +384 -0
  67. package/plugin/workflows/ml-systems/model-deployment-workflow.md +392 -0
  68. package/plugin/workflows/ml-systems/model-development-workflow.md +218 -0
  69. package/plugin/workflows/ml-systems/model-evaluation-workflow.md +416 -0
  70. package/plugin/workflows/ml-systems/model-optimization-workflow.md +390 -0
  71. package/plugin/workflows/ml-systems/monitoring-drift-workflow.md +446 -0
  72. package/plugin/workflows/ml-systems/retraining-workflow.md +401 -0
  73. package/plugin/workflows/ml-systems/training-pipeline-workflow.md +382 -0
@@ -0,0 +1,446 @@
1
+ ---
2
+ name: Monitoring & Drift Workflow
3
+ description: Production monitoring workflow for detecting data drift, model degradation, and triggering appropriate responses.
4
+ category: ml-systems
5
+ complexity: medium
6
+ agents:
7
+ - mlops-engineer-agent
8
+ - experiment-analyst-agent
9
+ ---
10
+
11
+ # Monitoring & Drift Workflow
12
+
13
+ Monitor production models and detect drift.
14
+
15
+ ## Overview
16
+
17
+ ```
18
+ ┌─────────────────────────────────────────────────────────────┐
19
+ │ MONITORING & DRIFT WORKFLOW │
20
+ ├─────────────────────────────────────────────────────────────┤
21
+ │ │
22
+ │ 1. COLLECT 2. DETECT 3. ANALYZE │
23
+ │ DATA DRIFT ROOT CAUSE │
24
+ │ ↓ ↓ ↓ │
25
+ │ Predictions Statistical Feature analysis │
26
+ │ Features tests Data investigation │
27
+ │ Ground truth Thresholds Model behavior │
28
+ │ │
29
+ │ 4. ALERT 5. RESPOND 6. DOCUMENT │
30
+ │ ↓ ↓ ↓ │
31
+ │ PagerDuty Switch model Incident report │
32
+ │ Slack Trigger retrain Lessons learned │
33
+ │ Dashboard Fallback Model card update │
34
+ │ │
35
+ └─────────────────────────────────────────────────────────────┘
36
+ ```
37
+
38
+ ## Steps
39
+
40
+ ### Step 1: Data Collection
41
+ **Agent**: mlops-engineer-agent
42
+
43
+ **Inputs**:
44
+ - Prediction logs
45
+ - Input features
46
+ - Ground truth (when available)
47
+
48
+ **Actions**:
49
+ ```bash
50
+ # Setup monitoring
51
+ /omgops:monitor --config monitoring.yaml
52
+ ```
53
+
54
+ ```python
55
+ # Prediction logging
56
+ class PredictionLogger:
57
+ def __init__(self, storage_client):
58
+ self.storage = storage_client
59
+
60
+ def log_prediction(self, request_id, features, prediction, confidence):
61
+ record = {
62
+ 'request_id': request_id,
63
+ 'timestamp': datetime.utcnow().isoformat(),
64
+ 'features': features,
65
+ 'prediction': prediction,
66
+ 'confidence': confidence,
67
+ 'model_version': MODEL_VERSION
68
+ }
69
+
70
+ # Log to streaming storage
71
+ self.storage.append('predictions', record)
72
+
73
+ # Update real-time metrics
74
+ metrics.log_prediction(prediction, confidence)
75
+
76
+ def log_feedback(self, request_id, ground_truth):
77
+ self.storage.update('predictions', request_id, {
78
+ 'ground_truth': ground_truth,
79
+ 'correct': self.was_correct(request_id, ground_truth)
80
+ })
81
+
82
+ # Reference data storage
83
+ class ReferenceDataManager:
84
+ def __init__(self, reference_path):
85
+ self.reference = pd.read_parquet(reference_path)
86
+ self.stats = self.compute_statistics()
87
+
88
+ def compute_statistics(self):
89
+ return {
90
+ col: {
91
+ 'mean': self.reference[col].mean(),
92
+ 'std': self.reference[col].std(),
93
+ 'quantiles': self.reference[col].quantile([0.25, 0.5, 0.75]).to_dict()
94
+ }
95
+ for col in self.reference.select_dtypes(include=[np.number]).columns
96
+ }
97
+ ```
98
+
99
+ **Outputs**:
100
+ - Prediction logs
101
+ - Feature distributions
102
+ - Reference statistics
103
+
104
+ ### Step 2: Drift Detection
105
+ **Agent**: experiment-analyst-agent
106
+
107
+ **Actions**:
108
+ ```bash
109
+ # Check for drift
110
+ /omgops:drift --reference reference.parquet --current current.parquet
111
+ ```
112
+
113
+ ```python
114
+ from scipy import stats
115
+ from evidently.metrics import DataDriftTable
116
+
117
+ class DriftDetector:
118
+ def __init__(self, reference_data, significance=0.05):
119
+ self.reference = reference_data
120
+ self.significance = significance
121
+
122
+ def detect_data_drift(self, current_data):
123
+ results = {}
124
+
125
+ for col in self.reference.columns:
126
+ if self.reference[col].dtype in ['float64', 'int64']:
127
+ # KS test for numerical
128
+ stat, p_value = stats.ks_2samp(
129
+ self.reference[col],
130
+ current_data[col]
131
+ )
132
+ method = 'ks'
133
+ else:
134
+ # Chi-square for categorical
135
+ ref_counts = self.reference[col].value_counts()
136
+ cur_counts = current_data[col].value_counts()
137
+ stat, p_value = stats.chisquare(cur_counts, ref_counts)
138
+ method = 'chi2'
139
+
140
+ results[col] = {
141
+ 'method': method,
142
+ 'statistic': stat,
143
+ 'p_value': p_value,
144
+ 'drift_detected': p_value < self.significance
145
+ }
146
+
147
+ # Calculate PSI for overall drift
148
+ results['overall_psi'] = self.calculate_psi(current_data)
149
+ results['drift_detected'] = any(r.get('drift_detected', False) for r in results.values())
150
+
151
+ return results
152
+
153
+ def calculate_psi(self, current_data):
154
+ psi_values = []
155
+ for col in self.reference.select_dtypes(include=[np.number]).columns:
156
+ ref_hist, bins = np.histogram(self.reference[col], bins=10)
157
+ cur_hist, _ = np.histogram(current_data[col], bins=bins)
158
+
159
+ ref_pct = ref_hist / len(self.reference) + 0.0001
160
+ cur_pct = cur_hist / len(current_data) + 0.0001
161
+
162
+ psi = np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
163
+ psi_values.append(psi)
164
+
165
+ return np.mean(psi_values)
166
+
167
+ class ConceptDriftDetector:
168
+ def __init__(self, window_size=1000):
169
+ self.window_size = window_size
170
+ self.performance_history = []
171
+
172
+ def update(self, y_true, y_pred):
173
+ is_correct = y_true == y_pred
174
+ self.performance_history.append(is_correct)
175
+
176
+ if len(self.performance_history) < self.window_size * 2:
177
+ return None
178
+
179
+ # Compare recent performance to historical
180
+ recent = self.performance_history[-self.window_size:]
181
+ historical = self.performance_history[-self.window_size*2:-self.window_size]
182
+
183
+ recent_acc = np.mean(recent)
184
+ historical_acc = np.mean(historical)
185
+
186
+ # Significant drop indicates concept drift
187
+ return {
188
+ 'recent_accuracy': recent_acc,
189
+ 'historical_accuracy': historical_acc,
190
+ 'drift_detected': (historical_acc - recent_acc) > 0.05
191
+ }
192
+ ```
193
+
194
+ **Outputs**:
195
+ - Drift scores
196
+ - Affected features
197
+ - Detection alerts
198
+
199
+ ### Step 3: Root Cause Analysis
200
+ **Agent**: experiment-analyst-agent
201
+
202
+ **Actions**:
203
+ ```python
204
+ class RootCauseAnalyzer:
205
+ def analyze_drift(self, drift_results, current_data, reference_data):
206
+ causes = []
207
+
208
+ # 1. Check feature shifts
209
+ for feature, result in drift_results.items():
210
+ if result.get('drift_detected'):
211
+ shift = self.analyze_feature_shift(
212
+ reference_data[feature],
213
+ current_data[feature]
214
+ )
215
+ causes.append({
216
+ 'type': 'feature_shift',
217
+ 'feature': feature,
218
+ 'details': shift
219
+ })
220
+
221
+ # 2. Check data quality
222
+ quality_issues = self.check_data_quality(current_data)
223
+ if quality_issues:
224
+ causes.append({
225
+ 'type': 'data_quality',
226
+ 'issues': quality_issues
227
+ })
228
+
229
+ # 3. Check for new categories
230
+ new_categories = self.check_new_categories(
231
+ reference_data, current_data
232
+ )
233
+ if new_categories:
234
+ causes.append({
235
+ 'type': 'new_categories',
236
+ 'details': new_categories
237
+ })
238
+
239
+ # 4. Check temporal patterns
240
+ temporal = self.check_temporal_patterns(current_data)
241
+ if temporal['anomaly_detected']:
242
+ causes.append({
243
+ 'type': 'temporal_anomaly',
244
+ 'details': temporal
245
+ })
246
+
247
+ return {
248
+ 'causes': causes,
249
+ 'severity': self.calculate_severity(causes),
250
+ 'recommendations': self.generate_recommendations(causes)
251
+ }
252
+
253
+ def analyze_feature_shift(self, reference, current):
254
+ return {
255
+ 'ref_mean': reference.mean(),
256
+ 'cur_mean': current.mean(),
257
+ 'shift': (current.mean() - reference.mean()) / reference.std(),
258
+ 'ref_dist': reference.describe().to_dict(),
259
+ 'cur_dist': current.describe().to_dict()
260
+ }
261
+ ```
262
+
263
+ **Outputs**:
264
+ - Root causes identified
265
+ - Severity assessment
266
+ - Remediation options
267
+
268
+ ### Step 4: Alerting
269
+ **Agent**: mlops-engineer-agent
270
+
271
+ **Actions**:
272
+ ```python
273
+ class AlertManager:
274
+ def __init__(self, config):
275
+ self.slack = SlackClient(config['slack_webhook'])
276
+ self.pagerduty = PagerDutyClient(config['pagerduty_key'])
277
+
278
+ def alert(self, severity, message, details):
279
+ if severity == 'critical':
280
+ # Page on-call
281
+ self.pagerduty.trigger(
282
+ description=message,
283
+ severity='critical',
284
+ details=details
285
+ )
286
+
287
+ # Always send to Slack
288
+ self.slack.send(
289
+ channel='#ml-alerts',
290
+ text=self.format_message(severity, message, details),
291
+ attachments=self.format_attachments(details)
292
+ )
293
+
294
+ def format_message(self, severity, message, details):
295
+ emoji = {'critical': '🚨', 'warning': '⚠️', 'info': 'ℹ️'}
296
+ return f"{emoji[severity]} *{severity.upper()}*: {message}"
297
+
298
+ # Prometheus alerting rules
299
+ alerting_rules = """
300
+ groups:
301
+ - name: ml-drift-alerts
302
+ rules:
303
+ - alert: DataDriftDetected
304
+ expr: ml_data_drift_psi > 0.2
305
+ for: 30m
306
+ labels:
307
+ severity: warning
308
+ annotations:
309
+ summary: "Data drift detected in production"
310
+
311
+ - alert: ModelAccuracyDrop
312
+ expr: ml_model_accuracy < 0.85
313
+ for: 1h
314
+ labels:
315
+ severity: critical
316
+ annotations:
317
+ summary: "Model accuracy below threshold"
318
+
319
+ - alert: HighPredictionLatency
320
+ expr: histogram_quantile(0.99, ml_prediction_latency) > 0.5
321
+ for: 15m
322
+ labels:
323
+ severity: warning
324
+ """
325
+ ```
326
+
327
+ **Outputs**:
328
+ - Alerts sent
329
+ - Incident created
330
+ - Team notified
331
+
332
+ ### Step 5: Response
333
+ **Agent**: mlops-engineer-agent
334
+
335
+ **Actions**:
336
+ ```python
337
+ class DriftResponseManager:
338
+ def __init__(self, model_manager, pipeline_manager):
339
+ self.models = model_manager
340
+ self.pipelines = pipeline_manager
341
+
342
+ def respond(self, drift_analysis):
343
+ severity = drift_analysis['severity']
344
+ response = {'actions_taken': []}
345
+
346
+ if severity == 'critical':
347
+ # Immediate fallback
348
+ self.models.switch_to_fallback()
349
+ response['actions_taken'].append('switched_to_fallback')
350
+
351
+ # Trigger immediate retraining
352
+ run_id = self.pipelines.trigger('emergency_retraining')
353
+ response['actions_taken'].append(f'triggered_retraining:{run_id}')
354
+
355
+ elif severity == 'high':
356
+ # Schedule retraining
357
+ self.pipelines.schedule('retraining', priority='high')
358
+ response['actions_taken'].append('scheduled_retraining')
359
+
360
+ # Increase monitoring
361
+ self.models.increase_monitoring_frequency()
362
+ response['actions_taken'].append('increased_monitoring')
363
+
364
+ elif severity == 'medium':
365
+ # Log for review
366
+ self.log_for_review(drift_analysis)
367
+ response['actions_taken'].append('logged_for_review')
368
+
369
+ return response
370
+
371
+ def switch_to_fallback(self):
372
+ # Use simpler, more robust model
373
+ fallback_version = self.models.get_fallback_version()
374
+ self.models.deploy(fallback_version)
375
+ ```
376
+
377
+ **Outputs**:
378
+ - Response actions taken
379
+ - Model switched if needed
380
+ - Retraining triggered
381
+
382
+ ### Step 6: Documentation
383
+ **Agent**: experiment-analyst-agent
384
+
385
+ **Actions**:
386
+ ```python
387
+ def create_incident_report(drift_analysis, response):
388
+ report = f"""
389
+ # Drift Incident Report
390
+
391
+ ## Summary
392
+ - **Date**: {datetime.now().isoformat()}
393
+ - **Severity**: {drift_analysis['severity']}
394
+ - **Type**: {drift_analysis['causes'][0]['type'] if drift_analysis['causes'] else 'Unknown'}
395
+
396
+ ## Detection
397
+ - **PSI Score**: {drift_analysis.get('overall_psi', 'N/A')}
398
+ - **Affected Features**: {len([c for c in drift_analysis['causes'] if c['type'] == 'feature_shift'])}
399
+ - **Detection Method**: Statistical tests (KS, Chi-square)
400
+
401
+ ## Root Cause Analysis
402
+ {format_causes(drift_analysis['causes'])}
403
+
404
+ ## Actions Taken
405
+ {format_actions(response['actions_taken'])}
406
+
407
+ ## Recommendations
408
+ {format_recommendations(drift_analysis['recommendations'])}
409
+
410
+ ## Lessons Learned
411
+ - [To be filled post-incident]
412
+
413
+ ## Follow-up Actions
414
+ - [ ] Review data pipeline
415
+ - [ ] Update monitoring thresholds
416
+ - [ ] Retrain model with new data
417
+ """
418
+ return report
419
+ ```
420
+
421
+ **Outputs**:
422
+ - Incident report
423
+ - Lessons learned
424
+ - Updated documentation
425
+
426
+ ## Artifacts
427
+
428
+ - `monitoring/` - Monitoring configurations
429
+ - `alerts/` - Alert definitions
430
+ - `incidents/` - Incident reports
431
+ - `dashboards/` - Grafana dashboards
432
+ - `runbooks/` - Response procedures
433
+
434
+ ## Next Workflows
435
+
436
+ After drift detection:
437
+ - → **retraining-workflow** for model updates
438
+ - → **model-evaluation-workflow** for post-retraining validation
439
+
440
+ ## Quality Gates
441
+
442
+ - [ ] All steps completed successfully
443
+ - [ ] Metrics meet defined thresholds
444
+ - [ ] Documentation updated
445
+ - [ ] Artifacts versioned and stored
446
+ - [ ] Stakeholder approval obtained