omgkit 2.20.0 → 2.21.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +125 -10
- package/package.json +1 -1
- package/plugin/agents/ai-architect-agent.md +282 -0
- package/plugin/agents/data-scientist-agent.md +221 -0
- package/plugin/agents/experiment-analyst-agent.md +318 -0
- package/plugin/agents/ml-engineer-agent.md +165 -0
- package/plugin/agents/mlops-engineer-agent.md +324 -0
- package/plugin/agents/model-optimizer-agent.md +287 -0
- package/plugin/agents/production-engineer-agent.md +360 -0
- package/plugin/agents/research-scientist-agent.md +274 -0
- package/plugin/commands/omgdata/augment.md +86 -0
- package/plugin/commands/omgdata/collect.md +81 -0
- package/plugin/commands/omgdata/label.md +83 -0
- package/plugin/commands/omgdata/split.md +83 -0
- package/plugin/commands/omgdata/validate.md +76 -0
- package/plugin/commands/omgdata/version.md +85 -0
- package/plugin/commands/omgdeploy/ab.md +94 -0
- package/plugin/commands/omgdeploy/cloud.md +89 -0
- package/plugin/commands/omgdeploy/edge.md +93 -0
- package/plugin/commands/omgdeploy/package.md +91 -0
- package/plugin/commands/omgdeploy/serve.md +92 -0
- package/plugin/commands/omgfeature/embed.md +93 -0
- package/plugin/commands/omgfeature/extract.md +93 -0
- package/plugin/commands/omgfeature/select.md +85 -0
- package/plugin/commands/omgfeature/store.md +97 -0
- package/plugin/commands/omgml/init.md +60 -0
- package/plugin/commands/omgml/status.md +82 -0
- package/plugin/commands/omgops/drift.md +87 -0
- package/plugin/commands/omgops/monitor.md +99 -0
- package/plugin/commands/omgops/pipeline.md +102 -0
- package/plugin/commands/omgops/registry.md +109 -0
- package/plugin/commands/omgops/retrain.md +91 -0
- package/plugin/commands/omgoptim/distill.md +90 -0
- package/plugin/commands/omgoptim/profile.md +92 -0
- package/plugin/commands/omgoptim/prune.md +81 -0
- package/plugin/commands/omgoptim/quantize.md +83 -0
- package/plugin/commands/omgtrain/baseline.md +78 -0
- package/plugin/commands/omgtrain/compare.md +99 -0
- package/plugin/commands/omgtrain/evaluate.md +85 -0
- package/plugin/commands/omgtrain/train.md +81 -0
- package/plugin/commands/omgtrain/tune.md +89 -0
- package/plugin/registry.yaml +252 -2
- package/plugin/skills/ml-systems/SKILL.md +65 -0
- package/plugin/skills/ml-systems/ai-accelerators/SKILL.md +342 -0
- package/plugin/skills/ml-systems/data-eng/SKILL.md +126 -0
- package/plugin/skills/ml-systems/deep-learning-primer/SKILL.md +143 -0
- package/plugin/skills/ml-systems/deployment-paradigms/SKILL.md +148 -0
- package/plugin/skills/ml-systems/dnn-architectures/SKILL.md +128 -0
- package/plugin/skills/ml-systems/edge-deployment/SKILL.md +366 -0
- package/plugin/skills/ml-systems/efficient-ai/SKILL.md +316 -0
- package/plugin/skills/ml-systems/feature-engineering/SKILL.md +151 -0
- package/plugin/skills/ml-systems/ml-frameworks/SKILL.md +187 -0
- package/plugin/skills/ml-systems/ml-serving-optimization/SKILL.md +371 -0
- package/plugin/skills/ml-systems/ml-systems-fundamentals/SKILL.md +103 -0
- package/plugin/skills/ml-systems/ml-workflow/SKILL.md +162 -0
- package/plugin/skills/ml-systems/mlops/SKILL.md +386 -0
- package/plugin/skills/ml-systems/model-deployment/SKILL.md +350 -0
- package/plugin/skills/ml-systems/model-dev/SKILL.md +160 -0
- package/plugin/skills/ml-systems/model-optimization/SKILL.md +339 -0
- package/plugin/skills/ml-systems/robust-ai/SKILL.md +395 -0
- package/plugin/skills/ml-systems/training-data/SKILL.md +152 -0
- package/plugin/workflows/ml-systems/data-preparation-workflow.md +276 -0
- package/plugin/workflows/ml-systems/edge-deployment-workflow.md +413 -0
- package/plugin/workflows/ml-systems/full-ml-lifecycle-workflow.md +405 -0
- package/plugin/workflows/ml-systems/hyperparameter-tuning-workflow.md +352 -0
- package/plugin/workflows/ml-systems/mlops-pipeline-workflow.md +384 -0
- package/plugin/workflows/ml-systems/model-deployment-workflow.md +392 -0
- package/plugin/workflows/ml-systems/model-development-workflow.md +218 -0
- package/plugin/workflows/ml-systems/model-evaluation-workflow.md +416 -0
- package/plugin/workflows/ml-systems/model-optimization-workflow.md +390 -0
- package/plugin/workflows/ml-systems/monitoring-drift-workflow.md +446 -0
- package/plugin/workflows/ml-systems/retraining-workflow.md +401 -0
- package/plugin/workflows/ml-systems/training-pipeline-workflow.md +382 -0
|
@@ -0,0 +1,382 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: Training Pipeline Workflow
|
|
3
|
+
description: Automated training pipeline workflow for reproducible model training with experiment tracking and model registration.
|
|
4
|
+
category: ml-systems
|
|
5
|
+
complexity: medium
|
|
6
|
+
agents:
|
|
7
|
+
- ml-engineer-agent
|
|
8
|
+
- mlops-engineer-agent
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Training Pipeline Workflow
|
|
12
|
+
|
|
13
|
+
Automated pipeline for reproducible model training.
|
|
14
|
+
|
|
15
|
+
## Overview
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
┌─────────────────────────────────────────────────────────────┐
|
|
19
|
+
│ TRAINING PIPELINE WORKFLOW │
|
|
20
|
+
├─────────────────────────────────────────────────────────────┤
|
|
21
|
+
│ │
|
|
22
|
+
│ TRIGGER DATA PREP TRAINING │
|
|
23
|
+
│ ──────── ───────── ──────── │
|
|
24
|
+
│ Schedule Load features Train model │
|
|
25
|
+
│ Manual Validate Log metrics │
|
|
26
|
+
│ Drift detect Split Save checkpoint │
|
|
27
|
+
│ │
|
|
28
|
+
│ EVALUATION REGISTRATION NOTIFICATION │
|
|
29
|
+
│ ────────── ──────────── ──────────── │
|
|
30
|
+
│ Test metrics Model registry Slack/Email │
|
|
31
|
+
│ Comparison Version tag Dashboard update │
|
|
32
|
+
│ Quality gates Artifacts Next steps │
|
|
33
|
+
│ │
|
|
34
|
+
└─────────────────────────────────────────────────────────────┘
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
## Pipeline Configuration
|
|
38
|
+
|
|
39
|
+
```yaml
|
|
40
|
+
# pipeline_config.yaml
|
|
41
|
+
pipeline:
|
|
42
|
+
name: model_training_pipeline
|
|
43
|
+
schedule: "0 2 * * 0" # Weekly at 2 AM Sunday
|
|
44
|
+
timeout: 3600 # 1 hour
|
|
45
|
+
retries: 2
|
|
46
|
+
|
|
47
|
+
data:
|
|
48
|
+
source: feature_store
|
|
49
|
+
features:
|
|
50
|
+
- user_features
|
|
51
|
+
- transaction_features
|
|
52
|
+
target: is_churned
|
|
53
|
+
split:
|
|
54
|
+
train: 0.7
|
|
55
|
+
val: 0.15
|
|
56
|
+
test: 0.15
|
|
57
|
+
|
|
58
|
+
training:
|
|
59
|
+
model_type: xgboost
|
|
60
|
+
hyperparameters:
|
|
61
|
+
max_depth: 6
|
|
62
|
+
learning_rate: 0.1
|
|
63
|
+
n_estimators: 100
|
|
64
|
+
early_stopping:
|
|
65
|
+
patience: 10
|
|
66
|
+
metric: val_auc
|
|
67
|
+
|
|
68
|
+
evaluation:
|
|
69
|
+
metrics:
|
|
70
|
+
- accuracy
|
|
71
|
+
- precision
|
|
72
|
+
- recall
|
|
73
|
+
- f1
|
|
74
|
+
- auc
|
|
75
|
+
thresholds:
|
|
76
|
+
auc: 0.85
|
|
77
|
+
precision: 0.80
|
|
78
|
+
|
|
79
|
+
registration:
|
|
80
|
+
model_name: churn_predictor
|
|
81
|
+
auto_promote: false
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## Steps
|
|
85
|
+
|
|
86
|
+
### Step 1: Pipeline Trigger
|
|
87
|
+
**Agent**: mlops-engineer-agent
|
|
88
|
+
|
|
89
|
+
**Triggers**:
|
|
90
|
+
- Scheduled (cron)
|
|
91
|
+
- Manual trigger
|
|
92
|
+
- Drift detection alert
|
|
93
|
+
- New data arrival
|
|
94
|
+
- CI/CD push
|
|
95
|
+
|
|
96
|
+
**Actions**:
|
|
97
|
+
```bash
|
|
98
|
+
# Create/update pipeline
|
|
99
|
+
/omgops:pipeline --config pipeline_config.yaml --action create
|
|
100
|
+
|
|
101
|
+
# Manual trigger
|
|
102
|
+
/omgops:pipeline --name model_training_pipeline --action run
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Step 2: Data Preparation
|
|
106
|
+
**Agent**: ml-engineer-agent
|
|
107
|
+
|
|
108
|
+
**Inputs**:
|
|
109
|
+
- Feature store reference
|
|
110
|
+
- Data version
|
|
111
|
+
- Split configuration
|
|
112
|
+
|
|
113
|
+
**Actions**:
|
|
114
|
+
```python
|
|
115
|
+
# Pipeline step: data_preparation
|
|
116
|
+
def prepare_data(config):
|
|
117
|
+
# Load features from feature store
|
|
118
|
+
features = feature_store.get_historical_features(
|
|
119
|
+
entity_df=entity_df,
|
|
120
|
+
features=config['data']['features']
|
|
121
|
+
)
|
|
122
|
+
|
|
123
|
+
# Validate data
|
|
124
|
+
validation_result = validate_data(features, config['data']['schema'])
|
|
125
|
+
if not validation_result.passed:
|
|
126
|
+
raise DataValidationError(validation_result.errors)
|
|
127
|
+
|
|
128
|
+
# Split data
|
|
129
|
+
train, val, test = split_data(
|
|
130
|
+
features,
|
|
131
|
+
ratios=config['data']['split'],
|
|
132
|
+
stratify=config['data']['target']
|
|
133
|
+
)
|
|
134
|
+
|
|
135
|
+
return train, val, test
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
**Outputs**:
|
|
139
|
+
- Prepared datasets
|
|
140
|
+
- Data validation report
|
|
141
|
+
- Split statistics
|
|
142
|
+
|
|
143
|
+
### Step 3: Model Training
|
|
144
|
+
**Agent**: ml-engineer-agent
|
|
145
|
+
|
|
146
|
+
**Inputs**:
|
|
147
|
+
- Training data
|
|
148
|
+
- Hyperparameters
|
|
149
|
+
- Training configuration
|
|
150
|
+
|
|
151
|
+
**Actions**:
|
|
152
|
+
```bash
|
|
153
|
+
# Execute training
|
|
154
|
+
/omgtrain:train --config pipeline_config.yaml --experiment-name weekly_training
|
|
155
|
+
```
|
|
156
|
+
|
|
157
|
+
```python
|
|
158
|
+
# Pipeline step: train_model
|
|
159
|
+
def train_model(train_data, val_data, config):
|
|
160
|
+
with mlflow.start_run(run_name=f"train_{datetime.now().isoformat()}"):
|
|
161
|
+
# Log parameters
|
|
162
|
+
mlflow.log_params(config['training']['hyperparameters'])
|
|
163
|
+
|
|
164
|
+
# Initialize model
|
|
165
|
+
model = XGBClassifier(**config['training']['hyperparameters'])
|
|
166
|
+
|
|
167
|
+
# Train with early stopping
|
|
168
|
+
model.fit(
|
|
169
|
+
train_data.X, train_data.y,
|
|
170
|
+
eval_set=[(val_data.X, val_data.y)],
|
|
171
|
+
early_stopping_rounds=config['training']['early_stopping']['patience']
|
|
172
|
+
)
|
|
173
|
+
|
|
174
|
+
# Log training metrics
|
|
175
|
+
for metric, value in model.evals_result_['validation_0'].items():
|
|
176
|
+
for i, v in enumerate(value):
|
|
177
|
+
mlflow.log_metric(f"val_{metric}", v, step=i)
|
|
178
|
+
|
|
179
|
+
# Save model checkpoint
|
|
180
|
+
mlflow.xgboost.log_model(model, "model")
|
|
181
|
+
|
|
182
|
+
return model, mlflow.active_run().info.run_id
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
**Outputs**:
|
|
186
|
+
- Trained model
|
|
187
|
+
- Training metrics
|
|
188
|
+
- MLflow run ID
|
|
189
|
+
|
|
190
|
+
### Step 4: Evaluation
|
|
191
|
+
**Agent**: experiment-analyst-agent
|
|
192
|
+
|
|
193
|
+
**Inputs**:
|
|
194
|
+
- Trained model
|
|
195
|
+
- Test dataset
|
|
196
|
+
- Evaluation thresholds
|
|
197
|
+
|
|
198
|
+
**Actions**:
|
|
199
|
+
```bash
|
|
200
|
+
# Evaluate model
|
|
201
|
+
/omgtrain:evaluate --run-id <run_id> --data test.csv --thresholds thresholds.yaml
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
```python
|
|
205
|
+
# Pipeline step: evaluate_model
|
|
206
|
+
def evaluate_model(model, test_data, config):
|
|
207
|
+
predictions = model.predict(test_data.X)
|
|
208
|
+
probabilities = model.predict_proba(test_data.X)[:, 1]
|
|
209
|
+
|
|
210
|
+
metrics = {
|
|
211
|
+
'accuracy': accuracy_score(test_data.y, predictions),
|
|
212
|
+
'precision': precision_score(test_data.y, predictions),
|
|
213
|
+
'recall': recall_score(test_data.y, predictions),
|
|
214
|
+
'f1': f1_score(test_data.y, predictions),
|
|
215
|
+
'auc': roc_auc_score(test_data.y, probabilities)
|
|
216
|
+
}
|
|
217
|
+
|
|
218
|
+
# Check quality gates
|
|
219
|
+
quality_passed = all(
|
|
220
|
+
metrics[metric] >= threshold
|
|
221
|
+
for metric, threshold in config['evaluation']['thresholds'].items()
|
|
222
|
+
)
|
|
223
|
+
|
|
224
|
+
return metrics, quality_passed
|
|
225
|
+
```
|
|
226
|
+
|
|
227
|
+
**Outputs**:
|
|
228
|
+
- Evaluation metrics
|
|
229
|
+
- Quality gate results
|
|
230
|
+
- Error analysis
|
|
231
|
+
|
|
232
|
+
### Step 5: Model Registration
|
|
233
|
+
**Agent**: mlops-engineer-agent
|
|
234
|
+
|
|
235
|
+
**Inputs**:
|
|
236
|
+
- Trained model
|
|
237
|
+
- Evaluation results
|
|
238
|
+
- Registration configuration
|
|
239
|
+
|
|
240
|
+
**Actions**:
|
|
241
|
+
```bash
|
|
242
|
+
# Register model
|
|
243
|
+
/omgops:registry --run-id <run_id> --model-name churn_predictor --stage staging
|
|
244
|
+
```
|
|
245
|
+
|
|
246
|
+
```python
|
|
247
|
+
# Pipeline step: register_model
|
|
248
|
+
def register_model(run_id, metrics, config):
|
|
249
|
+
if not metrics['quality_passed']:
|
|
250
|
+
logging.warning("Quality gates not passed, skipping registration")
|
|
251
|
+
return None
|
|
252
|
+
|
|
253
|
+
# Register model version
|
|
254
|
+
model_version = mlflow.register_model(
|
|
255
|
+
f"runs:/{run_id}/model",
|
|
256
|
+
config['registration']['model_name']
|
|
257
|
+
)
|
|
258
|
+
|
|
259
|
+
# Add metadata
|
|
260
|
+
client = MlflowClient()
|
|
261
|
+
client.set_model_version_tag(
|
|
262
|
+
name=config['registration']['model_name'],
|
|
263
|
+
version=model_version.version,
|
|
264
|
+
key="metrics",
|
|
265
|
+
value=json.dumps(metrics)
|
|
266
|
+
)
|
|
267
|
+
|
|
268
|
+
# Auto-promote if configured
|
|
269
|
+
if config['registration']['auto_promote']:
|
|
270
|
+
client.transition_model_version_stage(
|
|
271
|
+
name=config['registration']['model_name'],
|
|
272
|
+
version=model_version.version,
|
|
273
|
+
stage="Staging"
|
|
274
|
+
)
|
|
275
|
+
|
|
276
|
+
return model_version
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
**Outputs**:
|
|
280
|
+
- Registered model version
|
|
281
|
+
- Model artifacts
|
|
282
|
+
- Promotion status
|
|
283
|
+
|
|
284
|
+
### Step 6: Notification
|
|
285
|
+
**Agent**: mlops-engineer-agent
|
|
286
|
+
|
|
287
|
+
**Inputs**:
|
|
288
|
+
- Pipeline results
|
|
289
|
+
- Metrics
|
|
290
|
+
- Status
|
|
291
|
+
|
|
292
|
+
**Actions**:
|
|
293
|
+
```python
|
|
294
|
+
# Pipeline step: notify
|
|
295
|
+
def notify_completion(results):
|
|
296
|
+
message = f"""
|
|
297
|
+
🤖 Training Pipeline Complete
|
|
298
|
+
|
|
299
|
+
Model: {results['model_name']}
|
|
300
|
+
Version: {results['version']}
|
|
301
|
+
Status: {'✅ Passed' if results['quality_passed'] else '❌ Failed'}
|
|
302
|
+
|
|
303
|
+
Metrics:
|
|
304
|
+
- AUC: {results['metrics']['auc']:.4f}
|
|
305
|
+
- F1: {results['metrics']['f1']:.4f}
|
|
306
|
+
|
|
307
|
+
Next: {'Ready for review' if results['quality_passed'] else 'Investigate failures'}
|
|
308
|
+
"""
|
|
309
|
+
|
|
310
|
+
# Send to Slack
|
|
311
|
+
send_slack_notification(message, channel="#ml-alerts")
|
|
312
|
+
|
|
313
|
+
# Update dashboard
|
|
314
|
+
update_training_dashboard(results)
|
|
315
|
+
```
|
|
316
|
+
|
|
317
|
+
**Outputs**:
|
|
318
|
+
- Notifications sent
|
|
319
|
+
- Dashboard updated
|
|
320
|
+
- Logs archived
|
|
321
|
+
|
|
322
|
+
## Airflow DAG
|
|
323
|
+
|
|
324
|
+
```python
|
|
325
|
+
from airflow import DAG
|
|
326
|
+
from airflow.operators.python import PythonOperator
|
|
327
|
+
|
|
328
|
+
with DAG(
|
|
329
|
+
'model_training_pipeline',
|
|
330
|
+
schedule_interval='0 2 * * 0',
|
|
331
|
+
catchup=False
|
|
332
|
+
) as dag:
|
|
333
|
+
|
|
334
|
+
prepare = PythonOperator(
|
|
335
|
+
task_id='prepare_data',
|
|
336
|
+
python_callable=prepare_data
|
|
337
|
+
)
|
|
338
|
+
|
|
339
|
+
train = PythonOperator(
|
|
340
|
+
task_id='train_model',
|
|
341
|
+
python_callable=train_model
|
|
342
|
+
)
|
|
343
|
+
|
|
344
|
+
evaluate = PythonOperator(
|
|
345
|
+
task_id='evaluate_model',
|
|
346
|
+
python_callable=evaluate_model
|
|
347
|
+
)
|
|
348
|
+
|
|
349
|
+
register = PythonOperator(
|
|
350
|
+
task_id='register_model',
|
|
351
|
+
python_callable=register_model
|
|
352
|
+
)
|
|
353
|
+
|
|
354
|
+
notify = PythonOperator(
|
|
355
|
+
task_id='notify',
|
|
356
|
+
python_callable=notify_completion
|
|
357
|
+
)
|
|
358
|
+
|
|
359
|
+
prepare >> train >> evaluate >> register >> notify
|
|
360
|
+
```
|
|
361
|
+
|
|
362
|
+
## Artifacts
|
|
363
|
+
|
|
364
|
+
- `pipeline_config.yaml` - Pipeline configuration
|
|
365
|
+
- `mlflow/` - Experiment tracking
|
|
366
|
+
- `models/` - Model artifacts
|
|
367
|
+
- `logs/` - Pipeline logs
|
|
368
|
+
- `reports/` - Evaluation reports
|
|
369
|
+
|
|
370
|
+
## Next Workflows
|
|
371
|
+
|
|
372
|
+
After training pipeline:
|
|
373
|
+
- → **model-evaluation-workflow** for detailed analysis
|
|
374
|
+
- → **model-deployment-workflow** for production
|
|
375
|
+
|
|
376
|
+
## Quality Gates
|
|
377
|
+
|
|
378
|
+
- [ ] All steps completed successfully
|
|
379
|
+
- [ ] Metrics meet defined thresholds
|
|
380
|
+
- [ ] Documentation updated
|
|
381
|
+
- [ ] Artifacts versioned and stored
|
|
382
|
+
- [ ] Stakeholder approval obtained
|