claude-turing 2.2.1 → 2.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "2.2.1",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 30 commands, 2 specialized agents, experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "2.4.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 36 commands, 2 specialized agents, model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -341,6 +341,12 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
341
341
  | `/turing:report` | Generate research report |
342
342
  | `/turing:poster` | Generate research poster |
343
343
  | `/turing:preflight` | Pre-release validation checks |
344
+ | `/turing:diff <a> <b>` | Deep experiment comparison — config diffs, metric significance, per-class regressions, curve divergence |
345
+ | `/turing:watch [--analyze]` | Live training monitor — loss spikes, NaN detection, overfitting, plateau alerts |
346
+ | `/turing:regress [--tolerance]` | Performance regression gate — verify metrics haven't degraded after changes |
347
+ | `/turing:ensemble [--top-k]` | Automated ensemble — voting, stacking, blending from top-K models |
348
+ | `/turing:stitch <action>` | Pipeline composition — show, swap, cache, and run stages independently |
349
+ | `/turing:warm <exp-id>` | Warm-start from prior model — load checkpoint, freeze layers, adjust LR |
344
350
 
345
351
  And for fully hands-off operation:
346
352
 
@@ -525,11 +531,11 @@ Each project gets independent config, data, experiments, models, and agent memor
525
531
 
526
532
  ## Architecture of Turing Itself
527
533
 
528
- 30 commands, 2 agents, 9 config files, 49 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), 778 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
534
+ 36 commands, 2 agents, 10 config files, 55 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
529
535
 
530
536
  ```
531
537
  turing/
532
- ├── commands/ 29 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration)
538
+ ├── commands/ 35 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition)
533
539
  ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
534
540
  ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
535
541
  ├── templates/ Scaffolded into user projects by /turing:init
@@ -0,0 +1,48 @@
1
+ ---
2
+ name: diff
3
+ description: Deep experiment comparison — config diffs, metric significance, per-class regressions, training curve divergence, feature importance shifts.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-a> <exp-b> [--code]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Deep diagnostic comparison of two experiments. Goes beyond "which metric is higher" to show where, when, and why two experiments diverge.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First two arguments are experiment IDs (required), e.g. `exp-042 exp-053`
20
+ - `--code` includes git diff of train.py between the two experiments' commits
21
+ - `--json` outputs raw JSON instead of markdown
22
+
23
+ 3. **Run deep comparison:**
24
+ ```bash
25
+ python scripts/experiment_diff.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Report results — the diff includes:**
29
+ - **Config diff:** which hyperparameters changed, with magnitude (e.g., `max_depth: 6 → 8 (+33%)`)
30
+ - **Metric diff:** all metrics with deltas and statistical significance (if seed studies exist)
31
+ - **Per-class diff:** which classes improved/regressed — flags regressions hidden by aggregate improvement
32
+ - **Training curve divergence:** the epoch where the two experiments' loss/metric curves separate
33
+ - **Feature importance shifts:** which features gained/lost importance
34
+ - **Code diff (--code):** git diff of train.py between the two commits
35
+
36
+ 5. **Saved output:** report written to `experiments/diffs/<exp-a>-vs-<exp-b>.yaml`
37
+
38
+ 6. **If experiment ID not found:** list available experiment IDs from `experiments/log.jsonl`
39
+
40
+ 7. **If no training pipeline exists:** suggest `/turing:init` first.
41
+
42
+ ## Examples
43
+
44
+ ```
45
+ /turing:diff exp-042 exp-053 # Full diagnostic comparison
46
+ /turing:diff exp-042 exp-053 --code # Include train.py code changes
47
+ /turing:diff exp-001 exp-010 --json # Raw JSON output
48
+ ```
@@ -0,0 +1,54 @@
1
+ ---
2
+ name: ensemble
3
+ description: Automated ensemble construction — combines top-K models via voting, stacking, and blending for zero-cost improvement.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--top-k 5] [--methods voting,stacking,blending]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Build ensembles from your best experiments automatically. Often yields 1-3% improvement with zero additional training.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--top-k 5` — number of top models to include (default: 5)
20
+ - `--methods voting,stacking,blending` — ensemble methods to try
21
+ - `--predictions-dir experiments/predictions` — directory with saved predictions
22
+ - `--json` — raw JSON output
23
+
24
+ 3. **Run ensemble construction:**
25
+ ```bash
26
+ python scripts/build_ensemble.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report results:**
30
+ - Table of all ensemble methods tried with metric deltas vs best single model
31
+ - Best ensemble method highlighted with improvement amount
32
+ - Diversity analysis: prediction correlation matrix, diversity assessment
33
+ - Base model summary: which experiments were combined
34
+
35
+ 5. **Ensemble methods:**
36
+ - **Voting:** majority vote (classification) or mean (regression)
37
+ - **Weighted voting:** weights proportional to individual model performance
38
+ - **Stacking:** cross-validated meta-learner (ridge/logistic) on out-of-fold predictions
39
+ - **Blending:** holdout-based meta-learner (simpler, less data-efficient)
40
+
41
+ 6. **Prerequisites:** experiments must have saved predictions in `experiments/predictions/`. Each experiment needs `<exp-id>-predictions.npy` and a shared `labels.npy`.
42
+
43
+ 7. **If no predictions exist:** suggest saving predictions during training by adding prediction logging to `evaluate.py`.
44
+
45
+ 8. **Saved output:** report written to `experiments/ensembles/ensemble-*.yaml`
46
+
47
+ ## Examples
48
+
49
+ ```
50
+ /turing:ensemble # Default: top-5, all methods
51
+ /turing:ensemble --top-k 3 # Top-3 models only
52
+ /turing:ensemble --methods voting,stacking # Specific methods
53
+ /turing:ensemble --json # Machine-readable output
54
+ ```
@@ -0,0 +1,53 @@
1
+ ---
2
+ name: regress
3
+ description: Performance regression gate — re-run best experiment after code/dependency changes and verify metrics haven't degraded.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--tolerance 0.01] [--against exp-id] [--quick]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ CI for your model. After any change to code, dependencies, or data, verify metrics haven't silently regressed.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--tolerance 0.01` sets the relative tolerance (default 1%)
20
+ - `--against exp-042` checks against a specific experiment (default: best)
21
+ - `--quick` runs 1 seed instead of 3 for fast checks
22
+ - `--runs 5` sets number of regression runs (default 3)
23
+ - `--json` outputs raw JSON
24
+
25
+ 3. **Run regression gate:**
26
+ ```bash
27
+ python scripts/regression_gate.py $ARGUMENTS
28
+ ```
29
+
30
+ 4. **Report results:**
31
+ - **PASS:** all metrics within tolerance — no regression
32
+ - **WARNING:** some metrics degraded within 2x tolerance — investigate
33
+ - **FAIL:** REGRESSION DETECTED — at least one metric degraded beyond tolerance
34
+ - Shows per-metric comparison with deltas and relative differences
35
+ - Shows environment diff if library versions changed (may explain regression)
36
+
37
+ 5. **Saved output:** report written to `experiments/regressions/check-YYYY-MM-DD.yaml`
38
+
39
+ 6. **If no experiments exist:** suggest running `/turing:train` first.
40
+
41
+ 7. **On FAIL verdict:** suggest investigating with:
42
+ - `/turing:diff <baseline> <latest>` to see what changed
43
+ - `pip freeze` comparison to identify library version changes
44
+ - `git diff` to review code changes
45
+
46
+ ## Examples
47
+
48
+ ```
49
+ /turing:regress # Default: check best, 1% tolerance, 3 runs
50
+ /turing:regress --quick # Fast check: 1 run
51
+ /turing:regress --against exp-042 # Check specific experiment
52
+ /turing:regress --tolerance 0.005 --runs 5 # Strict: 0.5% tolerance, 5 runs
53
+ ```
@@ -0,0 +1,49 @@
1
+ ---
2
+ name: stitch
3
+ description: Pipeline composition — decompose ML pipelines into swappable stages. Show, swap, cache, and run stages independently.
4
+ disable-model-invocation: true
5
+ argument-hint: "<show|swap|cache|run> [stage] [--from exp-id]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Decompose your ML pipeline into stages that can be independently varied, cached, and reused across experiments.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument is the action: `show`, `swap`, `cache`, `run`
20
+ - `show` — display pipeline stages with hash and cache status
21
+ - `swap <stage> --from <exp-id>` — replace a stage with one from another experiment
22
+ - `cache` — save intermediate stage outputs to disk
23
+ - `run` — execute pipeline, skipping cached stages
24
+
25
+ 3. **Run pipeline manager:**
26
+ ```bash
27
+ python scripts/pipeline_manager.py $ARGUMENTS
28
+ ```
29
+
30
+ 4. **Report results:**
31
+ - **show:** numbered stage list with description, content hash, and cache status
32
+ - **swap:** what changed, old vs new stage config, updated pipeline
33
+ - **cache:** per-stage cache paths and status
34
+ - **run:** which stages will be skipped (cached) vs re-run
35
+
36
+ 5. **Stage types:** preprocess, features, model, postprocess (configurable in `config.yaml` under `pipeline.stages`)
37
+
38
+ 6. **Cache benefit:** when only the model stage changes, preprocessing and feature engineering are skipped — experiments run faster
39
+
40
+ 7. **If no pipeline config:** falls back to default 4-stage pipeline
41
+
42
+ ## Examples
43
+
44
+ ```
45
+ /turing:stitch show # Display pipeline stages
46
+ /turing:stitch swap model --from exp-031 # Keep features, swap model
47
+ /turing:stitch cache # Cache intermediate outputs
48
+ /turing:stitch run # Run with cached stages
49
+ ```
@@ -39,6 +39,12 @@ You are the Turing ML research router. Detect the user's intent and route to the
39
39
  | "fork", "branch", "try both", "parallel experiments", "A or B" | `/turing:fork` | Orchestrate |
40
40
  | "profile", "profiling", "bottleneck", "slow training", "why is it slow", "timing" | `/turing:profile` | Check |
41
41
  | "checkpoint", "checkpoints", "prune checkpoints", "disk space", "resume training" | `/turing:checkpoint` | Check |
42
+ | "diff", "deep compare", "what changed", "why did it diverge", "experiment diff" | `/turing:diff` | Analyze |
43
+ | "watch", "monitor", "live training", "loss spike", "is it overfitting", "training progress" | `/turing:watch` | Monitor |
44
+ | "regress", "regression", "did metrics degrade", "check for regression", "CI gate", "stability check" | `/turing:regress` | Validate |
45
+ | "ensemble", "combine models", "voting", "stacking", "blending", "merge models" | `/turing:ensemble` | Compose |
46
+ | "stitch", "pipeline", "swap stage", "cache stage", "pipeline composition" | `/turing:stitch` | Compose |
47
+ | "warm", "warm start", "fine-tune", "continue training", "transfer learning", "from checkpoint" | `/turing:warm` | Compose |
42
48
 
43
49
  ## Sub-commands
44
50
 
@@ -74,6 +80,12 @@ You are the Turing ML research router. Detect the user's intent and route to the
74
80
  | `/turing:fork <exp-id> --branches` | Experiment branching: run parallel tracks, report winner | (inline) |
75
81
  | `/turing:profile [exp-id]` | Computational profiling: timing, memory, throughput, bottleneck detection | (inline) |
76
82
  | `/turing:checkpoint <action>` | Smart checkpoint management: list, prune (Pareto), average, resume, stats | (inline) |
83
+ | `/turing:diff <exp-a> <exp-b>` | Deep experiment comparison: config diff, metric significance, per-class regressions, curve divergence | (inline) |
84
+ | `/turing:watch [--analyze]` | Live training monitor with early-warning alerts (loss spike, NaN, overfitting, plateau) | (inline) |
85
+ | `/turing:regress [--tolerance]` | Performance regression gate: re-run best experiment, verify metrics haven't degraded | (inline) |
86
+ | `/turing:ensemble [--top-k] [--methods]` | Automated ensemble: voting, weighted voting, stacking, blending from top-K models | (inline) |
87
+ | `/turing:stitch <action> [stage]` | Pipeline composition: show/swap/cache/run stages independently | (inline) |
88
+ | `/turing:warm <exp-id>` | Warm-start from prior model: load checkpoint, freeze layers, adjust LR | (inline) |
77
89
 
78
90
  ## Proactive Detection
79
91
 
@@ -0,0 +1,53 @@
1
+ ---
2
+ name: warm
3
+ description: Warm-start from a prior model — load checkpoint, optionally freeze layers, adjust learning rate, and continue training.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-id> [--freeze-layers encoder] [--unfreeze-after 5]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Take a trained checkpoint and use it as initialization for a new experiment. Automates the "start from here but change X" pattern.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument is the source experiment ID (required)
20
+ - `--freeze-layers encoder decoder` — layer names to freeze (neural only)
21
+ - `--unfreeze-after 5` — unfreeze all layers after N epochs (gradual unfreezing)
22
+ - `--lr-factor 0.1` — learning rate reduction factor (default: 0.1x)
23
+ - `--json` — raw JSON output
24
+
25
+ 3. **Run warm-start planner:**
26
+ ```bash
27
+ python scripts/warm_start.py $ARGUMENTS
28
+ ```
29
+
30
+ 4. **Report results:**
31
+ - Model type detection (tree, neural, sklearn)
32
+ - Strategy: continue_boosting, load_weights, or warm_start_param
33
+ - Numbered step-by-step instructions
34
+ - Config changes to apply
35
+ - Checkpoint info (path, format, size)
36
+
37
+ 5. **Strategies by model type:**
38
+ - **Tree models (XGBoost/LightGBM):** continue boosting from existing trees with more estimators
39
+ - **Neural networks:** load weights, optionally freeze layers, reset optimizer, reduce LR
40
+ - **scikit-learn:** use `warm_start=True` parameter for incremental learning
41
+
42
+ 6. **If no checkpoint found:** plan is still generated, but warns that checkpoint is needed
43
+
44
+ 7. **Saved output:** report written to `experiments/warm_starts/warm-<exp-id>.yaml`
45
+
46
+ ## Examples
47
+
48
+ ```
49
+ /turing:warm exp-042 # Auto-detect strategy
50
+ /turing:warm exp-042 --freeze-layers encoder # Freeze encoder layers
51
+ /turing:warm exp-042 --freeze-layers encoder --unfreeze-after 5 # Gradual unfreezing
52
+ /turing:warm exp-042 --lr-factor 0.01 # Very small fine-tuning LR
53
+ ```
@@ -0,0 +1,60 @@
1
+ ---
2
+ name: watch
3
+ description: Live training monitor with early-warning alerts for loss spikes, NaN, overfitting, and metric plateaus.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--alerts] [--interval 10] [--analyze run.log]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Stream metrics during training with early-warning alerts. Catches problems mid-run instead of at the end.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--analyze run.log` — post-hoc analysis of a completed log (non-blocking)
20
+ - `--alerts` — show only alert lines, suppress normal output
21
+ - `--interval 10` — check interval in seconds (default: 10)
22
+ - `--alerts-config config/watch_alerts.yaml` — custom alert rules
23
+ - `--json` — raw JSON output (for `--analyze` mode)
24
+
25
+ 3. **For post-hoc analysis:**
26
+ ```bash
27
+ python scripts/training_monitor.py --analyze run.log
28
+ ```
29
+
30
+ 4. **For live monitoring (inform user):**
31
+ Live monitoring requires a running training process. Suggest the user run in a separate terminal:
32
+ ```bash
33
+ python scripts/training_monitor.py --log run.log --interval 10
34
+ ```
35
+
36
+ 5. **Alert types:**
37
+ - **Loss spike:** loss > 3x rolling mean (configurable multiplier)
38
+ - **NaN detected:** any metric is NaN — CRITICAL, suggests pausing
39
+ - **Overfitting onset:** train/val gap widening for 3+ consecutive epochs
40
+ - **Plateau:** metric improvement < 0.001 for 5+ consecutive epochs
41
+
42
+ 6. **Dashboard line format:**
43
+ ```
44
+ Epoch 23/100 | loss: 0.342 ↓ | acc: 0.865 ↑ | gap: 0.018 | ⚠ plateau
45
+ ```
46
+
47
+ 7. **Alert config:** rules are in `config/watch_alerts.yaml` — users can customize thresholds.
48
+
49
+ 8. **Saved output:** analysis report written to `experiments/monitors/analysis-*.yaml`
50
+
51
+ 9. **If no training log exists:** suggest running `/turing:train` first.
52
+
53
+ ## Examples
54
+
55
+ ```
56
+ /turing:watch --analyze run.log # Analyze completed training
57
+ /turing:watch --analyze run.log --json # JSON output for scripting
58
+ /turing:watch --alerts # Live: show only alerts
59
+ /turing:watch --interval 5 # Live: check every 5 seconds
60
+ ```
@@ -0,0 +1,36 @@
1
+ # Training monitor alert rules for /turing:watch
2
+ #
3
+ # Each alert has:
4
+ # condition: alert type (loss_spike, nan_detected, overfitting, plateau)
5
+ # severity: info | warning | critical
6
+ # action: optional action on trigger (e.g., "pause")
7
+ # message: alert message template with {epoch}, {value}, {mean}, etc.
8
+ #
9
+ # Customize thresholds to match your training dynamics.
10
+
11
+ alerts:
12
+ loss_spike:
13
+ condition: loss_spike
14
+ multiplier: 3.0 # Trigger if loss > N * rolling_mean
15
+ severity: warning
16
+ message: "Loss spike at epoch {epoch}: {value} vs rolling mean {mean:.4f}"
17
+
18
+ nan_detected:
19
+ condition: nan_detected
20
+ severity: critical
21
+ action: pause # Suggest pausing training on NaN
22
+ message: "NaN detected in {metric} at epoch {epoch}"
23
+
24
+ overfitting_onset:
25
+ condition: overfitting
26
+ gap_ratio: 0.5 # train_loss / val_loss ratio threshold
27
+ consecutive: 3 # N consecutive epochs of widening gap
28
+ severity: warning
29
+ message: "Overfitting detected — train/val gap widening since epoch {onset}"
30
+
31
+ plateau:
32
+ condition: plateau
33
+ min_improvement: 0.001 # Minimum metric change per epoch
34
+ consecutive: 5 # N consecutive flat epochs
35
+ severity: info
36
+ message: "Metric plateaued — consider early stopping or learning rate reduction"
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "2.2.1",
3
+ "version": "2.4.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/install.js CHANGED
@@ -25,6 +25,8 @@ const SUB_COMMANDS = [
25
25
  "report", "mode", "preflight", "card", "seed", "reproduce",
26
26
  "diagnose", "ablate", "frontier", "profile", "checkpoint", "export",
27
27
  "lit", "paper", "queue", "retry", "fork",
28
+ "diff", "watch", "regress",
29
+ "ensemble", "stitch", "warm",
28
30
  ];
29
31
 
30
32
  export async function install(opts = {}) {
@@ -79,6 +81,7 @@ export async function install(opts = {}) {
79
81
  "experiment_archetypes.yaml", "novelty_aliases.yaml",
80
82
  "relationships.toml", "state.toml", "task_taxonomy.yaml",
81
83
  "failure_modes.yaml",
84
+ "watch_alerts.yaml",
82
85
  ];
83
86
  for (const file of CONFIG_FILES) {
84
87
  await copyFile(
package/src/verify.js CHANGED
@@ -44,6 +44,12 @@ const EXPECTED_COMMANDS = [
44
44
  "queue/SKILL.md",
45
45
  "retry/SKILL.md",
46
46
  "fork/SKILL.md",
47
+ "diff/SKILL.md",
48
+ "watch/SKILL.md",
49
+ "regress/SKILL.md",
50
+ "ensemble/SKILL.md",
51
+ "stitch/SKILL.md",
52
+ "warm/SKILL.md",
47
53
  ];
48
54
 
49
55
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];
@@ -53,6 +59,7 @@ const EXPECTED_CONFIG = [
53
59
  "experiment_archetypes.yaml", "novelty_aliases.yaml",
54
60
  "relationships.toml", "state.toml", "task_taxonomy.yaml",
55
61
  "failure_modes.yaml",
62
+ "watch_alerts.yaml",
56
63
  ];
57
64
 
58
65
  async function fileExists(path) {