omgkit 2.13.0 → 2.16.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +129 -10
- package/package.json +2 -2
- package/plugin/agents/api-designer.md +5 -0
- package/plugin/agents/architect.md +8 -0
- package/plugin/agents/brainstormer.md +4 -0
- package/plugin/agents/cicd-manager.md +6 -0
- package/plugin/agents/code-reviewer.md +6 -0
- package/plugin/agents/copywriter.md +2 -0
- package/plugin/agents/data-engineer.md +255 -0
- package/plugin/agents/database-admin.md +10 -0
- package/plugin/agents/debugger.md +10 -0
- package/plugin/agents/devsecops.md +314 -0
- package/plugin/agents/docs-manager.md +4 -0
- package/plugin/agents/domain-decomposer.md +181 -0
- package/plugin/agents/embedded-systems.md +397 -0
- package/plugin/agents/fullstack-developer.md +12 -0
- package/plugin/agents/game-systems-designer.md +375 -0
- package/plugin/agents/git-manager.md +10 -0
- package/plugin/agents/journal-writer.md +2 -0
- package/plugin/agents/ml-engineer.md +284 -0
- package/plugin/agents/observability-engineer.md +353 -0
- package/plugin/agents/oracle.md +9 -0
- package/plugin/agents/performance-engineer.md +290 -0
- package/plugin/agents/pipeline-architect.md +6 -0
- package/plugin/agents/planner.md +12 -0
- package/plugin/agents/platform-engineer.md +325 -0
- package/plugin/agents/project-manager.md +3 -0
- package/plugin/agents/researcher.md +5 -0
- package/plugin/agents/scientific-computing.md +426 -0
- package/plugin/agents/scout.md +3 -0
- package/plugin/agents/security-auditor.md +7 -0
- package/plugin/agents/sprint-master.md +17 -0
- package/plugin/agents/tester.md +10 -0
- package/plugin/agents/ui-ux-designer.md +12 -0
- package/plugin/agents/vulnerability-scanner.md +6 -0
- package/plugin/commands/data/pipeline.md +47 -0
- package/plugin/commands/data/quality.md +49 -0
- package/plugin/commands/domain/analyze.md +34 -0
- package/plugin/commands/domain/map.md +41 -0
- package/plugin/commands/game/balance.md +56 -0
- package/plugin/commands/game/optimize.md +62 -0
- package/plugin/commands/iot/provision.md +58 -0
- package/plugin/commands/ml/evaluate.md +47 -0
- package/plugin/commands/ml/train.md +48 -0
- package/plugin/commands/perf/benchmark.md +54 -0
- package/plugin/commands/perf/profile.md +49 -0
- package/plugin/commands/platform/blueprint.md +56 -0
- package/plugin/commands/security/audit.md +54 -0
- package/plugin/commands/security/scan.md +55 -0
- package/plugin/commands/sre/dashboard.md +53 -0
- package/plugin/registry.yaml +787 -0
- package/plugin/skills/ai-ml/experiment-tracking/SKILL.md +338 -0
- package/plugin/skills/ai-ml/feature-stores/SKILL.md +340 -0
- package/plugin/skills/ai-ml/llm-ops/SKILL.md +454 -0
- package/plugin/skills/ai-ml/ml-pipelines/SKILL.md +390 -0
- package/plugin/skills/ai-ml/model-monitoring/SKILL.md +398 -0
- package/plugin/skills/ai-ml/model-serving/SKILL.md +386 -0
- package/plugin/skills/event-driven/cqrs-patterns/SKILL.md +348 -0
- package/plugin/skills/event-driven/event-sourcing/SKILL.md +334 -0
- package/plugin/skills/event-driven/kafka-deep/SKILL.md +252 -0
- package/plugin/skills/event-driven/saga-orchestration/SKILL.md +335 -0
- package/plugin/skills/event-driven/schema-registry/SKILL.md +328 -0
- package/plugin/skills/event-driven/stream-processing/SKILL.md +313 -0
- package/plugin/skills/game/game-audio/SKILL.md +446 -0
- package/plugin/skills/game/game-networking/SKILL.md +490 -0
- package/plugin/skills/game/godot-patterns/SKILL.md +413 -0
- package/plugin/skills/game/shader-programming/SKILL.md +492 -0
- package/plugin/skills/game/unity-patterns/SKILL.md +488 -0
- package/plugin/skills/iot/device-provisioning/SKILL.md +405 -0
- package/plugin/skills/iot/edge-computing/SKILL.md +369 -0
- package/plugin/skills/iot/industrial-protocols/SKILL.md +438 -0
- package/plugin/skills/iot/mqtt-deep/SKILL.md +418 -0
- package/plugin/skills/iot/ota-updates/SKILL.md +426 -0
- package/plugin/skills/microservices/api-gateway-patterns/SKILL.md +201 -0
- package/plugin/skills/microservices/circuit-breaker-patterns/SKILL.md +246 -0
- package/plugin/skills/microservices/contract-testing/SKILL.md +284 -0
- package/plugin/skills/microservices/distributed-tracing/SKILL.md +246 -0
- package/plugin/skills/microservices/service-discovery/SKILL.md +304 -0
- package/plugin/skills/microservices/service-mesh/SKILL.md +181 -0
- package/plugin/skills/mobile-advanced/mobile-ci-cd/SKILL.md +407 -0
- package/plugin/skills/mobile-advanced/mobile-security/SKILL.md +403 -0
- package/plugin/skills/mobile-advanced/offline-first/SKILL.md +473 -0
- package/plugin/skills/mobile-advanced/push-notifications/SKILL.md +494 -0
- package/plugin/skills/mobile-advanced/react-native-deep/SKILL.md +374 -0
- package/plugin/skills/simulation/numerical-methods/SKILL.md +434 -0
- package/plugin/skills/simulation/parallel-computing/SKILL.md +382 -0
- package/plugin/skills/simulation/physics-engines/SKILL.md +377 -0
- package/plugin/skills/simulation/validation-verification/SKILL.md +479 -0
- package/plugin/skills/simulation/visualization-scientific/SKILL.md +365 -0
- package/plugin/stdrules/ALIGNMENT_PRINCIPLE.md +240 -0
- package/plugin/workflows/ai-engineering/agent-development.md +3 -3
- package/plugin/workflows/ai-engineering/fine-tuning.md +3 -3
- package/plugin/workflows/ai-engineering/model-evaluation.md +3 -3
- package/plugin/workflows/ai-engineering/prompt-engineering.md +2 -2
- package/plugin/workflows/ai-engineering/rag-development.md +4 -4
- package/plugin/workflows/ai-ml/data-pipeline.md +188 -0
- package/plugin/workflows/ai-ml/experiment-cycle.md +203 -0
- package/plugin/workflows/ai-ml/feature-engineering.md +208 -0
- package/plugin/workflows/ai-ml/model-deployment.md +199 -0
- package/plugin/workflows/ai-ml/monitoring-setup.md +227 -0
- package/plugin/workflows/api/api-design.md +1 -1
- package/plugin/workflows/api/api-testing.md +2 -2
- package/plugin/workflows/content/technical-docs.md +1 -1
- package/plugin/workflows/database/migration.md +1 -1
- package/plugin/workflows/database/optimization.md +1 -1
- package/plugin/workflows/database/schema-design.md +3 -3
- package/plugin/workflows/development/bug-fix.md +3 -3
- package/plugin/workflows/development/code-review.md +2 -1
- package/plugin/workflows/development/feature.md +3 -3
- package/plugin/workflows/development/refactor.md +2 -2
- package/plugin/workflows/event-driven/consumer-groups.md +190 -0
- package/plugin/workflows/event-driven/event-storming.md +172 -0
- package/plugin/workflows/event-driven/replay-testing.md +186 -0
- package/plugin/workflows/event-driven/saga-implementation.md +206 -0
- package/plugin/workflows/event-driven/schema-evolution.md +173 -0
- package/plugin/workflows/fullstack/authentication.md +4 -4
- package/plugin/workflows/fullstack/full-feature.md +4 -4
- package/plugin/workflows/game-dev/content-pipeline.md +218 -0
- package/plugin/workflows/game-dev/platform-submission.md +263 -0
- package/plugin/workflows/game-dev/playtesting.md +237 -0
- package/plugin/workflows/game-dev/prototype-to-production.md +205 -0
- package/plugin/workflows/microservices/contract-first.md +151 -0
- package/plugin/workflows/microservices/distributed-tracing.md +166 -0
- package/plugin/workflows/microservices/domain-decomposition.md +123 -0
- package/plugin/workflows/microservices/integration-testing.md +149 -0
- package/plugin/workflows/microservices/service-mesh-setup.md +153 -0
- package/plugin/workflows/microservices/service-scaffolding.md +151 -0
- package/plugin/workflows/omega/1000x-innovation.md +2 -2
- package/plugin/workflows/omega/100x-architecture.md +2 -2
- package/plugin/workflows/omega/10x-improvement.md +2 -2
- package/plugin/workflows/quality/performance-optimization.md +2 -2
- package/plugin/workflows/research/best-practices.md +1 -1
- package/plugin/workflows/research/technology-research.md +1 -1
- package/plugin/workflows/security/penetration-testing.md +3 -3
- package/plugin/workflows/security/security-audit.md +3 -3
- package/plugin/workflows/sprint/sprint-execution.md +2 -2
- package/plugin/workflows/sprint/sprint-retrospective.md +1 -1
- package/plugin/workflows/sprint/sprint-setup.md +1 -1
|
@@ -0,0 +1,398 @@
|
|
|
1
|
+
# Model Monitoring
|
|
2
|
+
|
|
3
|
+
Data drift detection, model performance monitoring, explainability dashboards, and alerting systems.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
Model monitoring ensures ML models perform correctly in production by detecting drift, tracking performance, and alerting on anomalies.
|
|
8
|
+
|
|
9
|
+
## Core Concepts
|
|
10
|
+
|
|
11
|
+
### Types of Drift
|
|
12
|
+
- **Data Drift**: Input distribution changes
|
|
13
|
+
- **Concept Drift**: Relationship between X→Y changes
|
|
14
|
+
- **Prediction Drift**: Output distribution changes
|
|
15
|
+
- **Label Drift**: Ground truth distribution changes
|
|
16
|
+
|
|
17
|
+
### Monitoring Dimensions
|
|
18
|
+
- **Data Quality**: Missing values, outliers, schema
|
|
19
|
+
- **Model Performance**: Accuracy, latency, throughput
|
|
20
|
+
- **Feature Health**: Statistical properties over time
|
|
21
|
+
- **Business Metrics**: Revenue impact, user engagement
|
|
22
|
+
|
|
23
|
+
## Data Drift Detection
|
|
24
|
+
|
|
25
|
+
### Statistical Tests
|
|
26
|
+
```python
|
|
27
|
+
from scipy import stats
|
|
28
|
+
import numpy as np
|
|
29
|
+
from typing import Dict, Tuple
|
|
30
|
+
|
|
31
|
+
class DriftDetector:
|
|
32
|
+
def __init__(self, reference_data: np.ndarray):
|
|
33
|
+
self.reference = reference_data
|
|
34
|
+
|
|
35
|
+
def detect_drift(
|
|
36
|
+
self,
|
|
37
|
+
current_data: np.ndarray,
|
|
38
|
+
method: str = "ks"
|
|
39
|
+
) -> Dict[str, float]:
|
|
40
|
+
if method == "ks":
|
|
41
|
+
# Kolmogorov-Smirnov test
|
|
42
|
+
statistic, p_value = stats.ks_2samp(self.reference, current_data)
|
|
43
|
+
elif method == "chi2":
|
|
44
|
+
# Chi-squared test for categorical
|
|
45
|
+
statistic, p_value = stats.chisquare(current_data, self.reference)
|
|
46
|
+
elif method == "psi":
|
|
47
|
+
# Population Stability Index
|
|
48
|
+
statistic = self._calculate_psi(current_data)
|
|
49
|
+
p_value = None
|
|
50
|
+
|
|
51
|
+
return {
|
|
52
|
+
"statistic": statistic,
|
|
53
|
+
"p_value": p_value,
|
|
54
|
+
"drift_detected": p_value < 0.05 if p_value else statistic > 0.25
|
|
55
|
+
}
|
|
56
|
+
|
|
57
|
+
def _calculate_psi(self, current_data: np.ndarray) -> float:
|
|
58
|
+
# Bin data
|
|
59
|
+
bins = np.histogram_bin_edges(self.reference, bins=10)
|
|
60
|
+
ref_counts = np.histogram(self.reference, bins=bins)[0]
|
|
61
|
+
cur_counts = np.histogram(current_data, bins=bins)[0]
|
|
62
|
+
|
|
63
|
+
# Calculate PSI
|
|
64
|
+
ref_pct = ref_counts / len(self.reference) + 0.0001
|
|
65
|
+
cur_pct = cur_counts / len(current_data) + 0.0001
|
|
66
|
+
|
|
67
|
+
psi = np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))
|
|
68
|
+
return psi
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Evidently AI Integration
|
|
72
|
+
```python
|
|
73
|
+
from evidently import ColumnMapping
|
|
74
|
+
from evidently.report import Report
|
|
75
|
+
from evidently.metric_preset import DataDriftPreset, DataQualityPreset
|
|
76
|
+
from evidently.metrics import (
|
|
77
|
+
DataDriftTable,
|
|
78
|
+
DatasetDriftMetric,
|
|
79
|
+
ColumnDriftMetric
|
|
80
|
+
)
|
|
81
|
+
|
|
82
|
+
# Define column mapping
|
|
83
|
+
column_mapping = ColumnMapping(
|
|
84
|
+
target="label",
|
|
85
|
+
prediction="prediction",
|
|
86
|
+
numerical_features=["feature1", "feature2", "feature3"],
|
|
87
|
+
categorical_features=["category1", "category2"]
|
|
88
|
+
)
|
|
89
|
+
|
|
90
|
+
# Create drift report
|
|
91
|
+
report = Report(metrics=[
|
|
92
|
+
DatasetDriftMetric(),
|
|
93
|
+
DataDriftTable(),
|
|
94
|
+
ColumnDriftMetric(column_name="feature1"),
|
|
95
|
+
ColumnDriftMetric(column_name="feature2")
|
|
96
|
+
])
|
|
97
|
+
|
|
98
|
+
report.run(
|
|
99
|
+
reference_data=reference_df,
|
|
100
|
+
current_data=current_df,
|
|
101
|
+
column_mapping=column_mapping
|
|
102
|
+
)
|
|
103
|
+
|
|
104
|
+
# Export results
|
|
105
|
+
report.save_html("drift_report.html")
|
|
106
|
+
drift_metrics = report.as_dict()
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
## Performance Monitoring
|
|
110
|
+
|
|
111
|
+
### Metrics Tracking
|
|
112
|
+
```python
|
|
113
|
+
from dataclasses import dataclass
|
|
114
|
+
from typing import List, Optional
|
|
115
|
+
from datetime import datetime
|
|
116
|
+
import prometheus_client as prom
|
|
117
|
+
|
|
118
|
+
# Define metrics
|
|
119
|
+
PREDICTION_LATENCY = prom.Histogram(
|
|
120
|
+
"model_prediction_latency_seconds",
|
|
121
|
+
"Time spent processing prediction",
|
|
122
|
+
["model_name", "model_version"],
|
|
123
|
+
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
|
|
124
|
+
)
|
|
125
|
+
|
|
126
|
+
PREDICTION_COUNT = prom.Counter(
|
|
127
|
+
"model_predictions_total",
|
|
128
|
+
"Total number of predictions",
|
|
129
|
+
["model_name", "model_version", "prediction_class"]
|
|
130
|
+
)
|
|
131
|
+
|
|
132
|
+
PREDICTION_ERROR = prom.Counter(
|
|
133
|
+
"model_prediction_errors_total",
|
|
134
|
+
"Total prediction errors",
|
|
135
|
+
["model_name", "model_version", "error_type"]
|
|
136
|
+
)
|
|
137
|
+
|
|
138
|
+
@dataclass
|
|
139
|
+
class PredictionLog:
|
|
140
|
+
request_id: str
|
|
141
|
+
model_name: str
|
|
142
|
+
model_version: str
|
|
143
|
+
features: dict
|
|
144
|
+
prediction: any
|
|
145
|
+
probability: Optional[float]
|
|
146
|
+
latency_ms: float
|
|
147
|
+
timestamp: datetime
|
|
148
|
+
|
|
149
|
+
class ModelMonitor:
|
|
150
|
+
def __init__(self, model_name: str, model_version: str):
|
|
151
|
+
self.model_name = model_name
|
|
152
|
+
self.model_version = model_version
|
|
153
|
+
|
|
154
|
+
def log_prediction(self, log: PredictionLog):
|
|
155
|
+
# Record latency
|
|
156
|
+
PREDICTION_LATENCY.labels(
|
|
157
|
+
model_name=self.model_name,
|
|
158
|
+
model_version=self.model_version
|
|
159
|
+
).observe(log.latency_ms / 1000)
|
|
160
|
+
|
|
161
|
+
# Record prediction class
|
|
162
|
+
PREDICTION_COUNT.labels(
|
|
163
|
+
model_name=self.model_name,
|
|
164
|
+
model_version=self.model_version,
|
|
165
|
+
prediction_class=str(log.prediction)
|
|
166
|
+
).inc()
|
|
167
|
+
|
|
168
|
+
# Store for offline analysis
|
|
169
|
+
self._store_log(log)
|
|
170
|
+
|
|
171
|
+
def log_error(self, error_type: str):
|
|
172
|
+
PREDICTION_ERROR.labels(
|
|
173
|
+
model_name=self.model_name,
|
|
174
|
+
model_version=self.model_version,
|
|
175
|
+
error_type=error_type
|
|
176
|
+
).inc()
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
### Ground Truth Comparison
|
|
180
|
+
```python
|
|
181
|
+
import pandas as pd
|
|
182
|
+
from sklearn.metrics import (
|
|
183
|
+
accuracy_score, precision_score, recall_score,
|
|
184
|
+
f1_score, roc_auc_score, confusion_matrix
|
|
185
|
+
)
|
|
186
|
+
from datetime import datetime, timedelta
|
|
187
|
+
|
|
188
|
+
class PerformanceMonitor:
|
|
189
|
+
def __init__(self, db_connection):
|
|
190
|
+
self.db = db_connection
|
|
191
|
+
|
|
192
|
+
def calculate_metrics(
|
|
193
|
+
self,
|
|
194
|
+
start_date: datetime,
|
|
195
|
+
end_date: datetime
|
|
196
|
+
) -> dict:
|
|
197
|
+
# Join predictions with ground truth
|
|
198
|
+
query = """
|
|
199
|
+
SELECT p.prediction, p.probability, g.actual
|
|
200
|
+
FROM predictions p
|
|
201
|
+
JOIN ground_truth g ON p.request_id = g.request_id
|
|
202
|
+
WHERE p.timestamp BETWEEN %s AND %s
|
|
203
|
+
"""
|
|
204
|
+
|
|
205
|
+
df = pd.read_sql(query, self.db, params=[start_date, end_date])
|
|
206
|
+
|
|
207
|
+
if len(df) == 0:
|
|
208
|
+
return {}
|
|
209
|
+
|
|
210
|
+
return {
|
|
211
|
+
"accuracy": accuracy_score(df["actual"], df["prediction"]),
|
|
212
|
+
"precision": precision_score(df["actual"], df["prediction"], average="weighted"),
|
|
213
|
+
"recall": recall_score(df["actual"], df["prediction"], average="weighted"),
|
|
214
|
+
"f1": f1_score(df["actual"], df["prediction"], average="weighted"),
|
|
215
|
+
"auc": roc_auc_score(df["actual"], df["probability"]) if "probability" in df else None,
|
|
216
|
+
"sample_count": len(df),
|
|
217
|
+
"confusion_matrix": confusion_matrix(df["actual"], df["prediction"]).tolist()
|
|
218
|
+
}
|
|
219
|
+
|
|
220
|
+
def detect_performance_degradation(
|
|
221
|
+
self,
|
|
222
|
+
current_metrics: dict,
|
|
223
|
+
baseline_metrics: dict,
|
|
224
|
+
threshold: float = 0.05
|
|
225
|
+
) -> bool:
|
|
226
|
+
for metric in ["accuracy", "precision", "recall", "f1"]:
|
|
227
|
+
if current_metrics[metric] < baseline_metrics[metric] - threshold:
|
|
228
|
+
return True
|
|
229
|
+
return False
|
|
230
|
+
```
|
|
231
|
+
|
|
232
|
+
## Alerting System
|
|
233
|
+
|
|
234
|
+
### Alert Configuration
|
|
235
|
+
```python
|
|
236
|
+
from dataclasses import dataclass
|
|
237
|
+
from enum import Enum
|
|
238
|
+
from typing import Callable, List
|
|
239
|
+
import smtplib
|
|
240
|
+
from email.mime.text import MIMEText
|
|
241
|
+
|
|
242
|
+
class AlertSeverity(Enum):
|
|
243
|
+
INFO = "info"
|
|
244
|
+
WARNING = "warning"
|
|
245
|
+
CRITICAL = "critical"
|
|
246
|
+
|
|
247
|
+
@dataclass
|
|
248
|
+
class AlertRule:
|
|
249
|
+
name: str
|
|
250
|
+
condition: Callable[[dict], bool]
|
|
251
|
+
severity: AlertSeverity
|
|
252
|
+
message_template: str
|
|
253
|
+
|
|
254
|
+
class AlertManager:
|
|
255
|
+
def __init__(self, rules: List[AlertRule]):
|
|
256
|
+
self.rules = rules
|
|
257
|
+
self.channels = []
|
|
258
|
+
|
|
259
|
+
def add_channel(self, channel):
|
|
260
|
+
self.channels.append(channel)
|
|
261
|
+
|
|
262
|
+
def evaluate(self, metrics: dict):
|
|
263
|
+
for rule in self.rules:
|
|
264
|
+
if rule.condition(metrics):
|
|
265
|
+
alert = {
|
|
266
|
+
"name": rule.name,
|
|
267
|
+
"severity": rule.severity,
|
|
268
|
+
"message": rule.message_template.format(**metrics),
|
|
269
|
+
"metrics": metrics
|
|
270
|
+
}
|
|
271
|
+
self._send_alert(alert)
|
|
272
|
+
|
|
273
|
+
def _send_alert(self, alert: dict):
|
|
274
|
+
for channel in self.channels:
|
|
275
|
+
channel.send(alert)
|
|
276
|
+
|
|
277
|
+
# Define rules
|
|
278
|
+
rules = [
|
|
279
|
+
AlertRule(
|
|
280
|
+
name="accuracy_drop",
|
|
281
|
+
condition=lambda m: m.get("accuracy", 1.0) < 0.80,
|
|
282
|
+
severity=AlertSeverity.CRITICAL,
|
|
283
|
+
message="Model accuracy dropped to {accuracy:.2%}"
|
|
284
|
+
),
|
|
285
|
+
AlertRule(
|
|
286
|
+
name="high_latency",
|
|
287
|
+
condition=lambda m: m.get("p99_latency_ms", 0) > 500,
|
|
288
|
+
severity=AlertSeverity.WARNING,
|
|
289
|
+
message="P99 latency is {p99_latency_ms}ms"
|
|
290
|
+
),
|
|
291
|
+
AlertRule(
|
|
292
|
+
name="data_drift",
|
|
293
|
+
condition=lambda m: m.get("drift_score", 0) > 0.25,
|
|
294
|
+
severity=AlertSeverity.WARNING,
|
|
295
|
+
message="Data drift detected: PSI={drift_score:.3f}"
|
|
296
|
+
)
|
|
297
|
+
]
|
|
298
|
+
```
|
|
299
|
+
|
|
300
|
+
### Slack Integration
|
|
301
|
+
```python
|
|
302
|
+
import requests
|
|
303
|
+
from typing import Dict
|
|
304
|
+
|
|
305
|
+
class SlackChannel:
|
|
306
|
+
def __init__(self, webhook_url: str):
|
|
307
|
+
self.webhook_url = webhook_url
|
|
308
|
+
|
|
309
|
+
def send(self, alert: Dict):
|
|
310
|
+
color = {
|
|
311
|
+
AlertSeverity.INFO: "#36a64f",
|
|
312
|
+
AlertSeverity.WARNING: "#ff9800",
|
|
313
|
+
AlertSeverity.CRITICAL: "#f44336"
|
|
314
|
+
}[alert["severity"]]
|
|
315
|
+
|
|
316
|
+
payload = {
|
|
317
|
+
"attachments": [{
|
|
318
|
+
"color": color,
|
|
319
|
+
"title": f":warning: {alert['name']}",
|
|
320
|
+
"text": alert["message"],
|
|
321
|
+
"fields": [
|
|
322
|
+
{"title": k, "value": str(v), "short": True}
|
|
323
|
+
for k, v in alert["metrics"].items()
|
|
324
|
+
],
|
|
325
|
+
"footer": "ML Monitoring System"
|
|
326
|
+
}]
|
|
327
|
+
}
|
|
328
|
+
|
|
329
|
+
requests.post(self.webhook_url, json=payload)
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
## Monitoring Dashboard
|
|
333
|
+
|
|
334
|
+
### Grafana Queries
|
|
335
|
+
```promql
|
|
336
|
+
# Prediction latency percentiles
|
|
337
|
+
histogram_quantile(0.99,
|
|
338
|
+
sum(rate(model_prediction_latency_seconds_bucket[5m])) by (le, model_name)
|
|
339
|
+
)
|
|
340
|
+
|
|
341
|
+
# Predictions per second
|
|
342
|
+
sum(rate(model_predictions_total[1m])) by (model_name, prediction_class)
|
|
343
|
+
|
|
344
|
+
# Error rate
|
|
345
|
+
sum(rate(model_prediction_errors_total[5m]))
|
|
346
|
+
/
|
|
347
|
+
sum(rate(model_predictions_total[5m]))
|
|
348
|
+
|
|
349
|
+
# Feature drift over time (custom metric)
|
|
350
|
+
model_feature_drift_psi{feature_name=~"feature.*"}
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
## Best Practices
|
|
354
|
+
|
|
355
|
+
1. **Baseline Metrics**: Establish clear baselines
|
|
356
|
+
2. **Granular Monitoring**: Per-segment analysis
|
|
357
|
+
3. **Alert Fatigue**: Tune thresholds carefully
|
|
358
|
+
4. **Root Cause Analysis**: Correlate metrics
|
|
359
|
+
5. **Automated Remediation**: Retrain triggers
|
|
360
|
+
|
|
361
|
+
## Monitoring Checklist
|
|
362
|
+
|
|
363
|
+
```
|
|
364
|
+
□ Data quality checks (schema, nulls, ranges)
|
|
365
|
+
□ Feature distribution monitoring
|
|
366
|
+
□ Prediction distribution tracking
|
|
367
|
+
□ Latency monitoring (p50, p95, p99)
|
|
368
|
+
□ Error rate tracking
|
|
369
|
+
□ Ground truth collection pipeline
|
|
370
|
+
□ Performance metrics computation
|
|
371
|
+
□ Drift detection (statistical tests)
|
|
372
|
+
□ Alert rules configured
|
|
373
|
+
□ Dashboard created
|
|
374
|
+
□ Runbook documented
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
## Anti-Patterns
|
|
378
|
+
|
|
379
|
+
- Monitoring only accuracy
|
|
380
|
+
- Ignoring data quality
|
|
381
|
+
- Alert threshold too sensitive
|
|
382
|
+
- Missing ground truth pipeline
|
|
383
|
+
- No historical comparison
|
|
384
|
+
|
|
385
|
+
## When to Use
|
|
386
|
+
|
|
387
|
+
- Production ML models
|
|
388
|
+
- High-stakes predictions
|
|
389
|
+
- Regulatory requirements
|
|
390
|
+
- Continuous learning systems
|
|
391
|
+
- Multi-model environments
|
|
392
|
+
|
|
393
|
+
## When NOT to Use
|
|
394
|
+
|
|
395
|
+
- Development/testing only
|
|
396
|
+
- Batch jobs with manual review
|
|
397
|
+
- Very stable domains
|
|
398
|
+
- Low-volume predictions
|