npm - @jetrabbits/agentic - Versions diffs - 0.0.1 - Mend

@jetrabbits/agentic 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (440) hide show

package/areas/software/mlops/prompts/model-incident.md ADDED Viewed

@@ -0,0 +1,87 @@
+---
+workflow: model-incident
+---
+# Prompt: `/model-incident`
+Use when: responding to model degradation, drift, bias, or endpoint outages that require rollback, diagnosis, and scoped remediation.
+---
+## Example 1 — Prediction drift after feature pipeline change
+**EN:**
+```
+/model-incident
+Model name: churn-predictor-v9
+Incident type: drift
+Symptoms:
+- PSI on feature monthly_sessions = 0.34 (baseline alert threshold 0.2)
+- probability distribution shifted heavily upward after 2026-03-20 09:00 UTC
+- no serving error spike, but retention team reports doubled intervention volume
+Recent changes: feature store job for monthly_sessions moved from daily batch to hourly incremental pipeline
+Immediate need:
+- decide whether to tolerate degraded predictions or rollback to previous champion
+- scope the affected prediction window
+- identify whether feature semantics changed or only data freshness changed
+Output: severity classification, rollback recommendation, affected window, and remediation path
+```
+**RU:**
+```
+/model-incident
+Model name: churn-predictor-v9
+Тип инцидента: drift
+Симптомы:
+- PSI по признаку monthly_sessions = 0.34 (порог алерта baseline = 0.2)
+- распределение вероятностей сильно сместилось вверх после 2026-03-20 09:00 UTC
+- всплеска serving error нет, но retention team сообщает о двукратном росте intervention volume
+Недавние изменения: job feature store для monthly_sessions переведён с daily batch на hourly incremental pipeline
+Нужно немедленно:
+- решить, терпим ли деградированные предсказания или откатываемся к предыдущему champion
+- определить окно затронутых предсказаний
+- понять, изменилась ли семантика признака или только data freshness
+Результат: классификация severity, рекомендация по rollback, affected window и путь remediation
+```
+---
+## Example 2 — Inference outage after container update
+**EN:**
+```
+/model-incident
+Model name: recommendations-transformer-v4
+Incident type: outage
+Symptoms:
+- endpoint 5xx rate = 18% for 12 minutes
+- pod logs show "CUDA out of memory" after new serving image deploy
+- HPA scaled from 4 to 8 replicas, but latency still above 2.5s and readiness probes flap
+Current impact: homepage recommendation widgets empty for 30% of sessions
+Required response:
+- stabilize service within 5 minutes, including rollback if needed
+- determine whether the issue is model artifact size, batch size config, or infrastructure sizing
+- notify downstream product analytics about affected prediction interval
+Output: immediate response plan, rollback or mitigation action, and post-incident prevention changes
+```
+**RU:**
+```
+/model-incident
+Model name: recommendations-transformer-v4
+Тип инцидента: outage
+Симптомы:
+- 5xx rate endpoint'а = 18% в течение 12 минут
+- логи pod'ов показывают "CUDA out of memory" после деплоя нового serving image
+- HPA масштабировался с 4 до 8 реплик, но latency всё ещё выше 2.5s и readiness probes флапают
+Текущее влияние: homepage widgets с рекомендациями пустые у 30% сессий
+Требуемый ответ:
+- стабилизировать сервис в течение 5 минут, включая rollback при необходимости
+- определить, связана ли проблема с размером model artifact, batch size config или sizing инфраструктуры
+- уведомить downstream product analytics о затронутом интервале предсказаний
+Результат: план немедленного реагирования, rollback или mitigation action и изменения для предотвращения повторения
+```

package/areas/software/mlops/prompts/train-experiment.md ADDED Viewed

@@ -0,0 +1,83 @@
+---
+workflow: train-experiment
+---
+# Prompt: `/train-experiment`
+Use when: running a reproducible model training experiment with a pinned environment, tracked artifacts, and automatic evaluation against the champion.
+---
+## Example 1 — Retrain fraud model on fresh quarterly data
+**EN:**
+```
+/train-experiment
+Model name: fraud-detection-xgb
+Training config: configs/fraud_xgb_q2_2026.yaml
+Data version: dv_2026_03_15_fraud_training
+Compute budget: 1 GPU not required, max 16 CPU cores, training must finish in < 3 hours
+Requirements:
+- snapshot git commit, Docker image digest, and feature store version in MLflow before training
+- run hyperparameter search over max_depth, eta, min_child_weight, and scale_pos_weight
+- stop immediately if validation loss diverges or NaN metrics appear
+- after training, run /evaluate-model automatically against current champion fraud-detection-xgb-v12
+Output: MLflow run link, model artifact URI, best config summary, and evaluation handoff
+```
+**RU:**
+```
+/train-experiment
+Model name: fraud-detection-xgb
+Training config: configs/fraud_xgb_q2_2026.yaml
+Версия данных: dv_2026_03_15_fraud_training
+Бюджет compute: GPU не требуется, максимум 16 CPU cores, обучение должно завершиться < 3 часов
+Требования:
+- до обучения зафиксировать git commit, Docker image digest и версию feature store в MLflow
+- запустить hyperparameter search по max_depth, eta, min_child_weight и scale_pos_weight
+- немедленно остановить обучение, если validation loss расходится или появляются NaN метрики
+- после обучения автоматически запустить /evaluate-model против текущего champion fraud-detection-xgb-v12
+Результат: ссылка на MLflow run, URI model artifact, summary лучшей конфигурации и handoff в evaluation
+```
+---
+## Example 2 — Experiment after data quality recovery
+**EN:**
+```
+/train-experiment
+Model name: demand-forecast-lstm
+Training config: configs/demand_lstm_recovery.yaml
+Context: previous two runs used corrupted holiday calendar features; data quality incident fixed yesterday
+Prerequisites:
+- confirm corrected feature table passed quality checks and matches data version dq_fix_2026_03_24
+- compare new run not only with champion but also with the two bad runs to prove recovery
+- keep full environment reproducibility because finance team may audit this retrain
+Success criteria:
+- training loss decreases monotonically across all epochs
+- evaluation scorecard includes top-3 previous runs comparison
+- run metadata clearly marks this as post-incident remediation training
+Output: reproducibility record, training summary, and recommendation whether to proceed to deploy-endpoint or continue tuning
+```
+**RU:**
+```
+/train-experiment
+Model name: demand-forecast-lstm
+Training config: configs/demand_lstm_recovery.yaml
+Контекст: предыдущие два запуска использовали повреждённые holiday calendar features; data quality incident исправлен вчера
+Предусловия:
+- подтвердить, что исправленная feature table прошла quality checks и соответствует версии данных dq_fix_2026_03_24
+- сравнить новый run не только с champion, но и с двумя неудачными run'ами, чтобы доказать восстановление
+- сохранить полную воспроизводимость окружения, потому что finance team может аудировать этот retrain
+Критерии успеха:
+- training loss монотонно уменьшается на всех эпохах
+- evaluation scorecard включает сравнение с top-3 previous runs
+- metadata run'а явно помечает его как post-incident remediation training
+Результат: запись о воспроизводимости, summary обучения и рекомендация, переходить ли к deploy-endpoint или продолжать tuning
+```

package/areas/software/mlops/rules/data-integrity.md ADDED Viewed

@@ -0,0 +1,9 @@
+# Rule: Data Integrity for ML
+**Priority**: P0 — Data leakage produces models that appear excellent but fail in production.
+1. **Strict train/val/test split**: Test set touched exactly once — for final candidate evaluation. Using test set for hyperparameter decisions is data leakage.
+2. **Temporal splits for time-series**: Splits must respect temporal ordering. No random shuffling that puts future data in training set.
+3. **Training-serving feature parity**: Features computed during training must be identical in definition to features available at inference time. Any divergence is training-serving skew.
+4. **No target leakage**: Features must be available at prediction time. Features derived from the target variable are forbidden.
+5. **Feature provenance documented**: Every feature traceable to data source and computation logic. Undocumented features cannot be promoted to production.

package/areas/software/mlops/rules/model-governance.md ADDED Viewed

@@ -0,0 +1,9 @@
+# Rule: Model Governance
+**Priority**: P0 — Ungoverned models in production are a compliance and reliability risk.
+1. **No promotion without evaluation**: A model cannot be deployed without a documented evaluation scorecard comparing it to the current champion. Evaluation on held-out test set only.
+2. **Human approval gate**: Model promotion requires sign-off from: ML engineer + product owner. Fairness reviewer for high-stakes models.
+3. **Model registry is the gate**: All production models registered in MLflow Model Registry with stage transitions: `Staging → Production`. Ad-hoc deployment scripts forbidden.
+4. **Deployment audit trail**: Every production deployment records: who, when, which scorecard, rollback procedure.
+5. **Model cards required**: Every production model must have a model card: intended use, performance by subgroup, known limitations, bias assessment.

package/areas/software/mlops/rules/production-safety.md ADDED Viewed

@@ -0,0 +1,9 @@
+# Rule: Production Safety
+**Priority**: P1 — Required before any model serves real traffic.
+1. **Fallback mechanism**: Every model endpoint must have a defined fallback: rule-based baseline, previous model version, or graceful degradation.
+2. **Latency SLO**: Inference endpoints must define and enforce a latency SLO (e.g., p99 < 200ms). Models that cannot meet SLO must be optimized before deploy.
+3. **Prediction monitoring**: All production models log predictions with input features and timestamps. Monitoring must be active before go-live.
+4. **Shadow mode**: High-stakes models (credit, medical, fraud) must run in shadow mode ≥ 2 weeks before live traffic.
+5. **Drift alerting**: Alerts configured for input feature drift and output prediction drift vs. training baseline.

package/areas/software/mlops/rules/reproducibility.md ADDED Viewed

@@ -0,0 +1,9 @@
+# Rule: Reproducibility
+**Priority**: P0 — A model that cannot be reproduced cannot be trusted in production.
+1. **Every training run versioned**: Training code (git commit), data snapshot, hyperparameters, and environment (Docker digest) logged in MLflow before training starts.
+2. **Random seeds fixed**: All randomness seeded — `random.seed()`, `numpy.random.seed()`, `torch.manual_seed()`. Seed is a versioned hyperparameter.
+3. **Environment pinned**: Training and inference environments fully specified via pinned `requirements.txt` or `conda.yaml`. No `>=` constraints in production specs.
+4. **Data version immutable**: Training datasets are snapshots, not live views. After evaluation begins, training data cannot be modified.
+5. **Re-trainable from scratch**: Same code + data + hyperparameters must produce a model within acceptable variance of the original.

package/areas/software/mlops/skills/experiment-tracking/SKILL.md ADDED Viewed

@@ -0,0 +1,29 @@
+# Skill: Experiment Tracking (MLflow)
+## When to load
+When running training experiments, comparing runs, or reproducing a historical experiment.
+## MLflow Tracking Pattern
+```python
+with mlflow.start_run(run_name="xgboost-lr-0.01-depth-6") as run:
+    mlflow.log_params({
+        "model_type": "xgboost",
+        "learning_rate": 0.01,
+        "max_depth": 6,
+        "data_version": dataset_version,
+        "random_seed": 42,
+    })
+    mlflow.set_tags({
+        "git_commit": subprocess.check_output(["git", "rev-parse", "HEAD"]).decode().strip(),
+    })
+    model = train_model(X_train, y_train, hyperparams)
+    mlflow.log_metrics({"test_auc_roc": 0.847, "test_f1": 0.731})
+    signature = mlflow.models.infer_signature(X_train, model.predict(X_train))
+    mlflow.xgboost.log_model(model, "model", signature=signature)
+    mlflow.log_artifact("evaluation_scorecard.json")
+```

package/areas/software/mlops/skills/feature-engineering/SKILL.md ADDED Viewed

@@ -0,0 +1,44 @@
+# Skill: Feature Engineering
+## When to load
+When building training datasets, designing feature pipelines, or debugging training-serving skew.
+## Declarative Feature Pipeline
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.compose import ColumnTransformer
+preprocessor = ColumnTransformer(transformers=[
+    ('num', Pipeline([
+        ('imputer', SimpleImputer(strategy='median')),
+        ('scaler', StandardScaler()),
+    ]), numeric_features),
+    ('cat', Pipeline([
+        ('imputer', SimpleImputer(strategy='constant', fill_value='unknown')),
+        ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
+    ]), categorical_features),
+])
+# ✅ Fit ONLY on training data
+preprocessor.fit(X_train)
+X_train_processed = preprocessor.transform(X_train)
+X_test_processed = preprocessor.transform(X_test)  # Uses train statistics
+```
+## Training-Serving Skew Prevention
+```python
+# Single feature definition used in BOTH training and inference
+def compute_user_features(user_id: str, reference_date: datetime) -> dict:
+    """
+    Used by: training pipeline (historical dates) AND inference API (current date).
+    Identical computation guarantees no skew.
+    """
+    orders = db.query("SELECT * FROM orders WHERE user_id = %s AND created_at < %s", (user_id, reference_date))
+    return {
+        "order_count_30d": count_in_window(orders, reference_date, days=30),
+        "avg_order_value_90d": avg_in_window(orders, reference_date, days=90),
+    }
+```

package/areas/software/mlops/skills/inference-serving/SKILL.md ADDED Viewed

@@ -0,0 +1,35 @@
+# Skill: Inference Serving
+## When to load
+When deploying a model to an API endpoint or optimizing inference latency.
+## FastAPI Inference Endpoint
+```python
+@app.on_event("startup")
+def load_model():
+    app.state.model = mlflow.pyfunc.load_model("models:/churn-predictor/Production")
+    app.state.preprocessor = load_preprocessor()
+@app.post("/predict", response_model=PredictionResponse)
+def predict(request: PredictionRequest):
+    try:
+        features = app.state.preprocessor.transform([request.features])
+        probability = app.state.model.predict(features)[0]
+        log_prediction(request.user_id, request.features, float(probability))
+        return PredictionResponse(
+            user_id=request.user_id,
+            churn_probability=float(probability),
+        )
+    except Exception as e:
+        logger.error("Inference failed", error=str(e))
+        return PredictionResponse(user_id=request.user_id, churn_probability=FALLBACK_PROBABILITY)
+```
+## Latency Checklist
+- [ ] Model loaded at startup, not per request
+- [ ] Input preprocessing vectorized (batch)
+- [ ] ONNX conversion for framework-agnostic optimization
+- [ ] Batch inference enabled for high-throughput

package/areas/software/mlops/skills/model-evaluation/SKILL.md ADDED Viewed

@@ -0,0 +1,40 @@
+# Skill: Model Evaluation
+## When to load
+When evaluating a trained model, comparing versions, or performing fairness analysis.
+## Threshold Selection
+```python
+def select_optimal_threshold(y_true, y_prob, business_objective: str):
+    """
+    business_objective:
+    - 'max_f1': balanced precision/recall
+    - 'high_precision': minimize false positives (fraud)
+    - 'high_recall': minimize false negatives (screening)
+    """
+    precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob)
+    if business_objective == 'max_f1':
+        f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
+        return thresholds[np.argmax(f1_scores)]
+```
+## Subgroup Fairness Analysis (Required for People-Affecting Models)
+```python
+def evaluate_fairness(y_true, y_pred, sensitive_attribute):
+    groups = sensitive_attribute.unique()
+    results = {g: {
+        "n": (sensitive_attribute == g).sum(),
+        "positive_rate": y_pred[sensitive_attribute == g].mean(),
+        "tpr": recall_score(y_true[sensitive_attribute == g], y_pred[sensitive_attribute == g]),
+    } for g in groups}
+    pos_rates = [r["positive_rate"] for r in results.values()]
+    dp_diff = max(pos_rates) - min(pos_rates)
+    if dp_diff > 0.1:
+        logger.warning(f"Demographic parity difference {dp_diff:.3f} exceeds 0.1 threshold")
+    return results, dp_diff
+```

package/areas/software/mlops/skills/model-monitoring/SKILL.md ADDED Viewed

@@ -0,0 +1,32 @@
+# Skill: Model Monitoring
+## When to load
+When setting up monitoring for a deployed model or responding to drift alerts.
+## Monitoring Dimensions
+```
+1. Operational health
+   - Latency: p50, p95, p99
+   - Error rate: prediction failures, input validation failures
+2. Data drift (vs training baseline)
+   - PSI (Population Stability Index) per feature
+   - PSI > 0.2 = significant shift → retrain likely needed
+3. Model quality (when labels available)
+   - Accuracy metrics after ground truth arrives
+   - Business outcome correlation
+```
+## PSI Drift Detection
+```python
+def calculate_psi(expected: np.ndarray, actual: np.ndarray, buckets: int = 10) -> float:
+    """PSI < 0.1: stable. 0.1-0.2: monitor. > 0.2: retrain."""
+    breakpoints = np.percentile(expected, np.linspace(0, 100, buckets + 1))
+    exp_counts = np.clip(np.histogram(expected, breakpoints)[0] / len(expected), 1e-4, None)
+    act_counts = np.clip(np.histogram(actual, breakpoints)[0] / len(actual), 1e-4, None)
+    return np.sum((act_counts - exp_counts) * np.log(act_counts / exp_counts))
+```

package/areas/software/mlops/workflows/champion-challenger.md ADDED Viewed

@@ -0,0 +1,65 @@
+---
+name: champion-challenger
+type: workflow
+trigger: /champion-challenger
+description: Run a statistically valid A/B experiment between champion and challenger models with guardrail-based auto-rollback.
+inputs:
+  - champion_model
+  - challenger_model
+  - experiment_duration
+outputs:
+  - experiment_report
+  - promotion_decision
+roles:
+  - developer
+  - qa
+  - team-lead
+execution:
+  initiator: developer
+related-rules:
+  - production-safety.md
+  - model-governance.md
+  - reproducibility.md
+uses-skills:
+  - model-evaluation
+  - model-monitoring
+quality-gates:
+  - sample size calculated before experiment starts
+  - guardrail metrics monitored daily with auto-rollback
+  - promotion decision based on statistical significance
+---
+## Steps
+### 1. Experiment Design — `@developer`
+- **Input:** champion, challenger, duration
+- **Actions:** define primary metric (business outcome); calculate required sample size (80% power, α=0.05); define guardrail metrics (latency, error rate); document in experiment tracker
+- **Output:** experiment design doc with sample size and guardrails
+- **Done when:** `@team-lead` approves design
+### 2. Configure Traffic Split — `@developer`
+- **Input:** approved design
+- **Actions:** hash `user_id` for consistent assignment; 50% champion / 50% challenger; log assignment to experiment tracker
+- **Output:** traffic split active
+- **Done when:** split verified in logs; both models receiving traffic
+### 3. Run & Monitor — `@qa`
+- **Input:** active experiment
+- **Actions:** monitor guardrail metrics daily; if guardrail breached → auto-rollback to champion; run for full planned duration unless guardrail breach
+- **Output:** daily monitoring logs; guardrail status
+- **Done when:** experiment duration complete or guardrail triggered rollback
+### 4. Analyze Results — `@qa`
+- **Input:** experiment logs
+- **Actions:** compute statistical significance of primary metric; segment analysis: is challenger better for ALL segments?; document practical significance alongside statistical
+- **Output:** `experiment_results.md` — p-value, effect size, segment breakdown
+- **Done when:** analysis complete; results ready for decision
+### 5. Promotion Decision — `@team-lead`
+- **Input:** experiment results
+- **Actions:** p < 0.05 AND practical significance AND no harm to any segment → PROMOTE challenger; otherwise → KEEP champion; route 100% to winner; archive loser; write experiment report for model registry
+- **Output:** experiment report in registry; traffic routed to winner
+- **Done when:** winner in production; loser archived; report complete
+## Exit
+Decision recorded in registry + winner at 100% traffic + report published = experiment closed.

package/areas/software/mlops/workflows/deploy-endpoint.md ADDED Viewed

@@ -0,0 +1,70 @@
+---
+name: deploy-endpoint
+type: workflow
+trigger: /deploy-endpoint
+description: Deploy a model endpoint using shadow → canary → full rollout with automatic rollback on SLO breach.
+inputs:
+  - model_name
+  - run_id
+  - deployment_strategy
+outputs:
+  - live_endpoint
+  - deployment_report
+roles:
+  - team-lead
+  - developer
+  - qa
+execution:
+  initiator: team-lead
+related-rules:
+  - production-safety.md
+  - model-governance.md
+uses-skills:
+  - inference-serving
+  - model-monitoring
+quality-gates:
+  - PROMOTE recommendation confirmed in model registry
+  - canary passes latency and error rate SLOs
+  - monitoring dashboards updated post-deploy
+---
+## Steps
+### 1. Pre-flight — `@team-lead`
+- **Input:** model run ID
+- **Actions:** confirm model passed `/evaluate-model` with PROMOTE recommendation; verify human approval recorded in model registry; check production endpoint health; confirm no active P0/P1 incidents
+- **Output:** pre-flight sign-off
+- **Done when:** all checks pass; deployment may proceed
+### 2. Shadow Deployment — `@developer` (if `--shadow`)
+- **Input:** pre-flight sign-off
+- **Actions:** deploy alongside current champion; 100% traffic to champion; mirror requests to challenger (no user-facing response from challenger); run shadow ≥ 48 hours; compare predictions for distribution drift
+- **Output:** shadow comparison report
+- **Done when:** no significant prediction distribution drift detected
+### 3. Canary Rollout — `@developer`
+- **Input:** shadow report (or pre-flight if skipping shadow)
+- **Actions:** serve challenger to 5% of traffic; monitor 30 minutes:
+  - latency p99 > SLO → AUTO-ROLLBACK
+  - error rate > 1% → AUTO-ROLLBACK
+  - gradually increase: 5% → 20% → 50% → 100%
+- **Output:** canary metrics per traffic split
+- **Done when:** 100% traffic on challenger with no SLO breaches
+### 4. Promote Champion — `@developer`
+- **Input:** successful canary
+- **Actions:** transition challenger: Staging → Production in registry; demote old champion: Production → Archived
+- **Output:** registry updated; old champion archived
+- **Done when:** registry state reflects new champion
+### 5. Post-Deploy Monitoring — `@qa`
+- **Input:** live endpoint
+- **Actions:** establish new baseline for drift monitoring; confirm monitoring dashboards updated; observe first 24 hours for anomalies
+- **Output:** `deployment_report.md` — timeline, traffic split results, new baselines
+- **Done when:** stable 24 hours; report complete
+## Iteration Loop
+Auto-rollback on SLO breach returns to Step 2 (shadow) or Step 1 (full re-review).
+## Exit
+100% traffic on new champion + stable monitoring + deployment report = endpoint promoted.

package/areas/software/mlops/workflows/evaluate-model.md ADDED Viewed

@@ -0,0 +1,63 @@
+---
+name: evaluate-model
+type: workflow
+trigger: /evaluate-model
+description: Compute metrics, fairness analysis, and business impact scorecard to produce a promotion recommendation.
+inputs:
+  - run_id
+  - champion_reference
+outputs:
+  - evaluation_scorecard
+  - promotion_recommendation
+roles:
+  - qa
+  - team-lead
+execution:
+  initiator: qa
+related-rules:
+  - model-governance.md
+  - data-integrity.md
+  - reproducibility.md
+uses-skills:
+  - model-evaluation
+  - experiment-tracking
+quality-gates:
+  - test set was not used during any training iteration
+  - fairness disparity checked for people-affecting models
+  - champion comparison statistically significant
+---
+## Steps
+### 1. Load Model & Test Data — `@qa`
+- **Input:** MLflow run ID
+- **Actions:** retrieve model artifact from MLflow run; load held-out test set from the data version recorded in the run; confirm test set was NOT used during any training iteration
+- **Output:** model and test data loaded
+- **Done when:** data provenance confirmed; no leakage
+### 2. Compute Core Metrics — `@qa`
+- **Input:** model + test data
+- **Actions:** classification: AUC-ROC, F1, Precision, Recall, PR-AUC; regression: MAE, RMSE, R², MAPE; save raw metric values to MLflow run
+- **Output:** core metrics logged
+- **Done when:** all applicable metrics computed
+### 3. Business Impact Translation — `@qa`
+- **Input:** core metrics
+- **Actions:** translate statistical metrics to business impact (e.g. "At 80% precision, identifies 62% of churners — est. $120K saved/month"); document assumptions in scorecard
+- **Output:** business impact statement in scorecard
+- **Done when:** at least one business metric derived
+### 4. Fairness Analysis — `@qa` (if model affects people)
+- **Input:** model predictions + protected group labels
+- **Actions:** compute demographic parity difference across protected groups; flag if disparity > 0.1 — requires `@team-lead` human review before promotion
+- **Output:** fairness report; flag if human review needed
+- **Done when:** fairness check complete; no unreviewed disparity > 0.1
+### 5. Champion Comparison — `@team-lead`
+- **Input:** challenger metrics + champion from Production stage in registry
+- **Actions:** run statistical significance test; review scorecard; make promotion decision: PROMOTE / DO_NOT_PROMOTE / NEEDS_REVIEW
+- **Output:** `evaluation_scorecard.json` with recommendation; visualizations (confusion matrix, ROC, feature importance)
+- **Done when:** recommendation recorded in model registry
+## Exit
+Signed scorecard + promotion recommendation = evaluation complete; feed into `/deploy-endpoint` or `/champion-challenger`.

package/areas/software/mlops/workflows/model-incident.md ADDED Viewed

@@ -0,0 +1,64 @@
+---
+name: model-incident
+type: workflow
+trigger: /model-incident
+description: Respond to a model degradation, drift, bias, or outage incident with structured triage, rollback, and postmortem.
+inputs:
+  - model_name
+  - incident_type
+outputs:
+  - resolved_incident
+  - postmortem
+roles:
+  - qa
+  - developer
+  - team-lead
+execution:
+  initiator: team-lead
+related-rules:
+  - production-safety.md
+  - model-governance.md
+  - data-integrity.md
+uses-skills:
+  - model-monitoring
+  - model-evaluation
+quality-gates:
+  - rollback executed within 5 minutes for critical incidents
+  - affected prediction window scoped and logged
+  - postmortem published with monitoring improvement
+---
+## Steps
+### 1. Immediate Response — `@team-lead`
+- **Input:** incident alert
+- **Actions:** assess impact (users affected? incorrect decisions made?); decide: tolerate degraded predictions or rollback NOW?; if critical → rollback to previous champion (< 5 min target)
+- **Output:** rollback decision; incident severity classification
+- **Done when:** system stabilized (rollback applied or consciously tolerated)
+### 2. Diagnose — `@qa`
+- **Input:** incident type
+- **Actions:** drift → compare input distributions to training baseline (PSI); degradation → compare business metrics to post-deployment baseline; outage → check endpoint health, container logs, resource utilization; bias → compute fairness metrics for affected period
+- **Output:** diagnosis report with evidence
+- **Done when:** root cause category identified
+### 3. Scope Affected Predictions — `@developer`
+- **Input:** diagnosis report
+- **Actions:** identify time window of degradation; log which predictions were made during affected window; notify downstream systems consuming model output
+- **Output:** affected prediction log; downstream teams notified
+- **Done when:** full impact window documented
+### 4. Root Cause & Remediation — `@developer`
+- **Input:** affected window + diagnosis
+- **Actions:** data drift → schedule retraining with `/train-experiment`; model rot → `/train-experiment` with recent data; infrastructure → fix pipeline, verify feature consistency; code bug → implement fix, run `/evaluate-model` before re-deploying
+- **Output:** remediation action taken
+- **Done when:** root cause fixed; new model validated or pipeline restored
+### 5. Post-Incident — `@team-lead`
+- **Input:** resolved incident
+- **Actions:** add monitoring rule to catch pattern earlier; write postmortem; update model card with known failure modes
+- **Output:** postmortem at `.mlops/incidents/<date>-<model>-incident.md`; monitoring updated; model card updated
+- **Done when:** postmortem reviewed; prevention measures in place
+## Exit
+System restored + postmortem published + monitoring improved = incident closed.