claude-turing 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (35) hide show
  1. package/.claude-plugin/plugin.json +2 -2
  2. package/README.md +33 -2
  3. package/commands/ablate.md +47 -0
  4. package/commands/diagnose.md +52 -0
  5. package/commands/frontier.md +45 -0
  6. package/commands/reproduce.md +48 -0
  7. package/commands/seed.md +47 -0
  8. package/commands/turing.md +10 -0
  9. package/package.json +1 -1
  10. package/src/install.js +2 -1
  11. package/src/verify.js +5 -0
  12. package/templates/config.yaml +10 -0
  13. package/templates/program.md +5 -0
  14. package/templates/scripts/__pycache__/ablation_study.cpython-314.pyc +0 -0
  15. package/templates/scripts/__pycache__/diagnose_errors.cpython-314.pyc +0 -0
  16. package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
  17. package/templates/scripts/__pycache__/generate_model_card.cpython-314.pyc +0 -0
  18. package/templates/scripts/__pycache__/pareto_frontier.cpython-314.pyc +0 -0
  19. package/templates/scripts/__pycache__/reproduce_experiment.cpython-314.pyc +0 -0
  20. package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
  21. package/templates/scripts/__pycache__/seed_runner.cpython-314.pyc +0 -0
  22. package/templates/scripts/__pycache__/turing_io.cpython-314.pyc +0 -0
  23. package/templates/scripts/__pycache__/update_state.cpython-314.pyc +0 -0
  24. package/templates/scripts/ablation_study.py +487 -0
  25. package/templates/scripts/diagnose_errors.py +601 -0
  26. package/templates/scripts/generate_brief.py +117 -0
  27. package/templates/scripts/generate_model_card.py +25 -0
  28. package/templates/scripts/leaderboard.py +10 -0
  29. package/templates/scripts/pareto_frontier.py +470 -0
  30. package/templates/scripts/reproduce_experiment.py +548 -0
  31. package/templates/scripts/scaffold.py +11 -0
  32. package/templates/scripts/seed_runner.py +414 -0
  33. package/templates/scripts/show_metrics.py +17 -0
  34. package/templates/scripts/turing_io.py +36 -0
  35. package/templates/scripts/update_state.py +13 -0
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "1.2.0",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 17 commands, 2 specialized agents, tree-search hypothesis exploration (TreeQuest AB-MCTS), cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "1.4.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 22 commands, 2 specialized agents, experiment intelligence (error analysis, ablation studies, Pareto frontiers), statistical rigor (multi-seed studies, reproducibility verification), tree-search hypothesis exploration (TreeQuest AB-MCTS), cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -323,6 +323,11 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
323
323
  | Command | What it does |
324
324
  |---------|-------------|
325
325
  | `/turing:validate [--auto]` | Check metric stability — auto-configure multi-run if noisy |
326
+ | `/turing:seed [N] [--quick]` | Multi-seed study — mean/std/CI, flag seed-sensitive results |
327
+ | `/turing:reproduce <exp-id>` | Reproducibility verification — re-run and check tolerance |
328
+ | `/turing:diagnose [exp-id]` | Error analysis — failure modes, confused pairs, feature-range bias |
329
+ | `/turing:ablate [--components]` | Ablation study — remove components, measure impact, flag dead weight |
330
+ | `/turing:frontier [--metrics]` | Pareto frontier — multi-objective tradeoff visualization |
326
331
  | `/turing:card` | Generate a model card — performance, limitations, intended use, artifact contract |
327
332
  | `/turing:logbook` | Generate HTML experiment logbook |
328
333
  | `/turing:report` | Generate research report |
@@ -392,6 +397,32 @@ After N experiments with no meaningful improvement, the agent stops and reports
392
397
 
393
398
  For noisy metrics, `/turing:validate` runs the pipeline multiple times and measures variance. If the coefficient of variation exceeds 5%, it auto-configures multi-run evaluation so the agent can't be rewarded for lucky single runs.
394
399
 
400
+ ## Statistical Rigor
401
+
402
+ > *"Stop publishing lucky seeds. Start publishing distributions."*
403
+
404
+ Before claiming a result, run a seed study:
405
+
406
+ ```
407
+ /turing:seed # 5 seeds on best experiment
408
+ /turing:seed --quick # 3 seeds for fast check
409
+ /turing:seed 10 # 10 seeds for thorough study
410
+ ```
411
+
412
+ This runs the same experiment across multiple random seeds and reports mean +/- std with 95% confidence intervals. If the coefficient of variation exceeds 5%, the result is flagged as **seed-sensitive** — meaning you should report the distribution, not a single number.
413
+
414
+ To verify an experiment can be reproduced:
415
+
416
+ ```
417
+ /turing:reproduce exp-042 # Default: 3 runs, 2% tolerance
418
+ /turing:reproduce exp-042 --strict # Exact match required
419
+ /turing:reproduce exp-042 --tolerance 0.05 # Custom tolerance
420
+ ```
421
+
422
+ This re-runs the experiment from the logged config and checks that metrics fall within tolerance. It also detects environment drift — if library versions have changed since the original run, you'll know before a reviewer tells you.
423
+
424
+ Seed study results automatically appear in `/turing:brief` and `/turing:card`.
425
+
395
426
  ## Tree-Search Hypothesis Exploration
396
427
 
397
428
  > *"The learned coin-flipper weaves through the quadrillion-coin room with a preternatural air."*
@@ -486,11 +517,11 @@ Each project gets independent config, data, experiments, models, and agent memor
486
517
 
487
518
  ## Architecture of Turing Itself
488
519
 
489
- 17 commands, 2 agents, 8 config files, 31 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, 379 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
520
+ 22 commands, 2 agents, 8 config files, 37 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence (error analysis + ablation + Pareto frontier), 487 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
490
521
 
491
522
  ```
492
523
  turing/
493
- ├── commands/ 16 skill files (core + taste-leverage + reporting + exploration)
524
+ ├── commands/ 21 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence)
494
525
  ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
495
526
  ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
496
527
  ├── templates/ Scaffolded into user projects by /turing:init
@@ -0,0 +1,47 @@
1
+ ---
2
+ name: ablate
3
+ description: Run systematic ablation study — remove components one at a time, measure impact, produce publication-ready table with dead-weight flagging.
4
+ disable-model-invocation: true
5
+ argument-hint: "[exp-id] [--components \"X,Y\"] [--seeds 3] [--latex]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Run a systematic ablation study to measure the contribution of each model component.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument can be an experiment ID (e.g., `exp-042`); defaults to best
20
+ - `--components "dropout,feature_X,regularization"` specifies components to ablate
21
+ - `--seeds 3` runs each ablation 3 times for statistical robustness (uses seed runner)
22
+ - `--latex` outputs a LaTeX-formatted table instead of markdown
23
+
24
+ 3. **Run ablation study:**
25
+ ```bash
26
+ python scripts/ablation_study.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report results:**
30
+ - Show the ablation table: Configuration | Metric | Δ from Full | % Change
31
+ - Rank by impact (largest Δ first)
32
+ - Flag **dead-weight** components (removing them improves the metric)
33
+ - If `--latex`, output ready for copy-paste into a paper
34
+
35
+ 5. **Saved output:** results written to `experiments/ablations/exp-NNN-ablation.yaml`
36
+
37
+ 6. **If no ablatable components detected:** suggest using `--components` explicitly.
38
+
39
+ ## Examples
40
+
41
+ ```
42
+ /turing:ablate # Auto-detect components
43
+ /turing:ablate exp-042 # Specific experiment
44
+ /turing:ablate --components "dropout,subsample" # Specific components
45
+ /turing:ablate --seeds 3 # Multi-seed for robustness
46
+ /turing:ablate --latex # LaTeX table output
47
+ ```
@@ -0,0 +1,52 @@
1
+ ---
2
+ name: diagnose
3
+ description: Error analysis — cluster failure cases, identify systematic failure modes, and suggest targeted fixes with auto-queued hypotheses.
4
+ disable-model-invocation: true
5
+ argument-hint: "[exp-id] [--auto-queue] [--top 5]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Analyze where and why the model fails, beyond aggregate metrics.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Generate predictions if needed:**
19
+ Check if `experiments/predictions/exp-NNN-preds.yaml` exists. If not, run:
20
+ ```bash
21
+ python train.py --predict-only --output experiments/predictions/
22
+ ```
23
+ The predictions file must contain `y_true`, `y_pred`, `task_type`, and optionally `features`.
24
+
25
+ 3. **Parse arguments from `$ARGUMENTS`:**
26
+ - First argument can be an experiment ID (e.g., `exp-042`); defaults to best
27
+ - `--auto-queue` auto-queues hypotheses from failure modes into `hypotheses.yaml`
28
+ - `--top 5` limits to top N failure modes (default 5)
29
+
30
+ 4. **Run error analysis:**
31
+ ```bash
32
+ python scripts/diagnose_errors.py $ARGUMENTS
33
+ ```
34
+
35
+ 5. **Report results:**
36
+ - **Classification:** confusion matrix, most-confused pairs, per-class P/R/F1, low-recall classes
37
+ - **Regression:** residual stats, P90/P95 errors, feature-range bias, systematic bias
38
+ - **Failure modes:** ranked by impact, with suggested fixes
39
+ - **Auto-hypotheses:** if `--auto-queue`, shows queued hypotheses targeting weaknesses
40
+
41
+ 6. **Saved output:** report written to `experiments/diagnoses/exp-NNN-diagnosis.yaml`
42
+
43
+ 7. **If no predictions file exists:** instruct user to run the model on validation set first.
44
+
45
+ ## Examples
46
+
47
+ ```
48
+ /turing:diagnose # Analyze best experiment
49
+ /turing:diagnose exp-042 # Specific experiment
50
+ /turing:diagnose --auto-queue # Queue fix hypotheses
51
+ /turing:diagnose --top 10 # Top 10 failure modes
52
+ ```
@@ -0,0 +1,45 @@
1
+ ---
2
+ name: frontier
3
+ description: Visualize Pareto frontier across multiple objectives — answers "which model is actually best?" when there are tradeoffs.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--metrics \"accuracy,train_seconds,n_params\"] [--ascii]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Visualize the Pareto frontier across multiple objectives from experiment history.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--metrics "accuracy,train_seconds,n_params"` specifies metrics to analyze
20
+ - Without `--metrics`, uses primary metric + train_seconds from config
21
+ - `--ascii` generates an ASCII scatter plot (2D projection)
22
+
23
+ 3. **Run Pareto analysis:**
24
+ ```bash
25
+ python scripts/pareto_frontier.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Report results:**
29
+ - **Pareto-optimal experiments:** table with all metrics and what each is best at
30
+ - **Dominated experiments:** with their nearest Pareto neighbor
31
+ - **ASCII scatter plot** (if `--ascii`): 2D projection with * for Pareto, · for dominated
32
+ - Summary: "N Pareto-optimal of M experiments across K metrics"
33
+
34
+ 5. **Saved output:** results written to `experiments/frontiers/frontier-YYYY-MM-DD.yaml`
35
+
36
+ 6. **If no experiments have all requested metrics:** suggest which metrics are available.
37
+
38
+ ## Examples
39
+
40
+ ```
41
+ /turing:frontier # Default: metric vs time
42
+ /turing:frontier --metrics "accuracy,train_seconds" # 2D frontier
43
+ /turing:frontier --metrics "accuracy,train_seconds,n_params" # 3D frontier
44
+ /turing:frontier --ascii # With scatter plot
45
+ ```
@@ -0,0 +1,48 @@
1
+ ---
2
+ name: reproduce
3
+ description: Verify reproducibility of a specific experiment by re-running from logged config and checking metrics fall within tolerance.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-id> [--tolerance 0.02] [--strict] [--runs 3]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Verify that a logged experiment can be reproduced with consistent results.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument is the experiment ID (required), e.g. `exp-042`
20
+ - `--tolerance 0.02` sets the relative tolerance (default 2%)
21
+ - `--strict` requires exact float match (1e-6), overrides tolerance
22
+ - `--runs 3` sets number of reproduction runs (default 3, 1 for strict)
23
+
24
+ 3. **Run reproducibility verification:**
25
+ ```bash
26
+ python scripts/reproduce_experiment.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report results:**
30
+ - **reproducible:** metrics match exactly (deterministic algorithm)
31
+ - **approximately_reproducible:** metrics within tolerance or original falls in 95% CI
32
+ - **not_reproducible:** metrics outside tolerance and CI
33
+ - **environment_changed:** metrics diverge AND library versions differ
34
+ - Show environment diff if present (Python version, package versions)
35
+
36
+ 5. **Saved output:** report written to `experiments/reproductions/exp-NNN-repro.yaml`
37
+
38
+ 6. **If experiment ID not found:** list available experiment IDs from `experiments/log.jsonl`
39
+
40
+ 7. **If no training pipeline exists:** suggest `/turing:init` first.
41
+
42
+ ## Examples
43
+
44
+ ```
45
+ /turing:reproduce exp-042 # Default: 3 runs, 2% tolerance
46
+ /turing:reproduce exp-042 --strict # Exact match required
47
+ /turing:reproduce exp-042 --tolerance 0.05 --runs 5 # Lenient, more runs
48
+ ```
@@ -0,0 +1,47 @@
1
+ ---
2
+ name: seed
3
+ description: Run multi-seed study on an experiment to compute mean/std/CI and flag seed-sensitive results. Prevents publishing lucky seeds.
4
+ disable-model-invocation: true
5
+ argument-hint: "[N] [--quick] [--exp-id <id>]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Run a multi-seed study to verify that experiment results are robust across random seeds.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - A bare number (e.g., `5`) sets the seed count
20
+ - `--quick` runs 3 seeds instead of 5
21
+ - `--exp-id exp-042` targets a specific experiment (defaults to best)
22
+ - `--seed-list 42,123,456` uses specific seed values
23
+
24
+ 3. **Run seed study:**
25
+ ```bash
26
+ python scripts/seed_runner.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report results:**
30
+ - Show the per-seed results table
31
+ - Show mean +/- std with 95% CI
32
+ - **STABLE (CV < 5%):** result is robust, safe to report
33
+ - **SEED-SENSITIVE (CV >= 5%):** result varies too much across seeds — do not report single-seed numbers
34
+ - If seed-sensitive, recommend reporting as mean +/- std over N seeds
35
+
36
+ 5. **Saved output:** results are written to `experiments/seed_studies/exp-NNN-seeds.yaml`
37
+
38
+ 6. **If no training pipeline exists:** suggest `/turing:init` first.
39
+
40
+ ## Examples
41
+
42
+ ```
43
+ /turing:seed # 5 seeds on best experiment
44
+ /turing:seed --quick # 3 seeds for fast check
45
+ /turing:seed 10 # 10 seeds for thorough study
46
+ /turing:seed --exp-id exp-042 # Specific experiment
47
+ ```
@@ -20,12 +20,17 @@ You are the Turing ML research router. Detect the user's intent and route to the
20
20
  | "poster", "presentation", "one-pager", "visual summary" | `/turing:poster` | Document |
21
21
  | "report", "write-up", "findings", "document results" | `/turing:report` | Document |
22
22
  | "validate", "stability", "check variance", "noisy" | `/turing:validate` | Validate |
23
+ | "seed", "seed study", "multi-seed", "lucky seed", "seed sensitivity" | `/turing:seed` | Validate |
24
+ | "reproduce", "reproducibility", "verify results", "re-run experiment", "repro" | `/turing:reproduce` | Validate |
23
25
  | "suggest", "what model", "recommend", "which architecture", "literature" | `/turing:suggest` | Research |
24
26
  | "explore hypotheses", "tree search", "treequest", "search hypothesis space", "MCTS" | `/turing:explore` | Research |
25
27
  | "design", "plan experiment", "how should I test", "experiment design" | `/turing:design` | Design |
26
28
  | "mode", "explore", "exploit", "replicate", "strategy" | `/turing:mode` | Strategy |
27
29
  | "preflight", "resources", "VRAM", "memory", "can I run", "OOM", "GPU" | `/turing:preflight` | Check |
28
30
  | "card", "model card", "document model", "model documentation" | `/turing:card` | Document |
31
+ | "diagnose", "error analysis", "failure modes", "where does it fail", "confusion matrix" | `/turing:diagnose` | Analyze |
32
+ | "ablate", "ablation", "remove component", "which features matter", "component impact" | `/turing:ablate` | Analyze |
33
+ | "frontier", "pareto", "tradeoff", "tradeoffs", "multi-objective", "which model is best" | `/turing:frontier` | Analyze |
29
34
 
30
35
  ## Sub-commands
31
36
 
@@ -39,6 +44,8 @@ You are the Turing ML research router. Detect the user's intent and route to the
39
44
  | `/turing:brief` | Generate structured research intelligence report | @ml-evaluator |
40
45
  | `/turing:init` | Scaffold a new ML project | (inline) |
41
46
  | `/turing:validate` | Check metric stability, auto-fix if noisy | (inline) |
47
+ | `/turing:seed [N] [--quick]` | Multi-seed study: mean/std/CI, flag seed-sensitive results | (inline) |
48
+ | `/turing:reproduce <exp-id>` | Reproducibility verification with tolerance checking | (inline) |
42
49
  | `/turing:suggest` | Literature-grounded model architecture suggestions | (inline, uses WebSearch) |
43
50
  | `/turing:explore` | Tree-search hypothesis exploration via AB-MCTS | (inline) |
44
51
  | `/turing:design <hyp-id>` | Generate structured experiment design from hypothesis | (inline, uses WebSearch) |
@@ -48,6 +55,9 @@ You are the Turing ML research router. Detect the user's intent and route to the
48
55
  | `/turing:mode <mode>` | Set research strategy (explore/exploit/replicate) | (inline) |
49
56
  | `/turing:preflight` | Pre-flight resource check (VRAM/RAM/disk) | (inline) |
50
57
  | `/turing:card` | Generate standardized model card (type, performance, data, limitations, contract) | (inline) |
58
+ | `/turing:diagnose [exp-id]` | Error analysis: failure modes, confused pairs, feature-range bias | (inline) |
59
+ | `/turing:ablate [--components]` | Ablation study: remove components, measure impact, flag dead weight | (inline) |
60
+ | `/turing:frontier [--metrics]` | Pareto frontier: multi-objective tradeoff visualization | (inline) |
51
61
 
52
62
  ## Proactive Detection
53
63
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "1.2.0",
3
+ "version": "1.4.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/install.js CHANGED
@@ -22,7 +22,8 @@ const PLUGIN_ROOT = join(__dirname, "..");
22
22
  const SUB_COMMANDS = [
23
23
  "init", "train", "status", "compare", "sweep", "validate",
24
24
  "try", "brief", "suggest", "explore", "design", "logbook", "poster",
25
- "report", "mode", "preflight", "card",
25
+ "report", "mode", "preflight", "card", "seed", "reproduce",
26
+ "diagnose", "ablate", "frontier",
26
27
  ];
27
28
 
28
29
  export async function install(opts = {}) {
package/src/verify.js CHANGED
@@ -31,6 +31,11 @@ const EXPECTED_COMMANDS = [
31
31
  "mode/SKILL.md",
32
32
  "preflight/SKILL.md",
33
33
  "card/SKILL.md",
34
+ "seed/SKILL.md",
35
+ "reproduce/SKILL.md",
36
+ "diagnose/SKILL.md",
37
+ "ablate/SKILL.md",
38
+ "frontier/SKILL.md",
34
39
  ];
35
40
 
36
41
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];
@@ -19,6 +19,16 @@ evaluation:
19
19
  # Set to false for metrics where higher is better (accuracy, f1, auc)
20
20
  lower_is_better: false # {{METRIC_DIRECTION}} -- change to true if lower is better
21
21
 
22
+ # Multi-seed configuration (Phase 10.1: /turing:seed)
23
+ # Seeds used for seed studies — diverse values for good coverage
24
+ seed_seeds: [42, 123, 456, 789, 1024, 1337, 2048, 3141, 4096, 7919]
25
+ seed_study_n_runs: 5 # Default number of seeds for /turing:seed
26
+ seed_sensitivity_threshold: 5.0 # CV% above this = seed-sensitive
27
+
28
+ # Reproducibility configuration (Phase 10.2: /turing:reproduce)
29
+ reproduce_tolerance: 0.02 # 2% relative tolerance for approximate match
30
+ reproduce_n_runs: 3 # Default reproduction runs for stochastic algorithms
31
+
22
32
  convergence:
23
33
  patience: 3 # Consecutive non-improvements before stopping
24
34
  improvement_threshold: 0.005 # 0.5% relative improvement required
@@ -170,6 +170,11 @@ The autoresearch experiment loop. Each iteration is one experiment — one hypot
170
170
  - N consecutive non-improvements (`config.yaml` → `convergence.patience`) = STOP
171
171
  - `max_iterations` reached = STOP
172
172
  - Report final best model and recommend next steps
173
+ - **Before declaring final results**, run a seed study to verify robustness:
174
+ ```bash
175
+ python scripts/seed_runner.py --quick
176
+ ```
177
+ If CV > 5%, the result is seed-sensitive — report mean ± std, not a single-seed number.
173
178
 
174
179
  10. **REPEAT** — return to step 1.
175
180