claude-turing 1.1.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (34) hide show
  1. package/.claude-plugin/plugin.json +2 -2
  2. package/README.md +67 -3
  3. package/commands/explore.md +107 -0
  4. package/commands/reproduce.md +48 -0
  5. package/commands/seed.md +47 -0
  6. package/commands/suggest.md +68 -4
  7. package/commands/turing.md +6 -0
  8. package/package.json +1 -1
  9. package/src/claude-md.js +1 -0
  10. package/src/install.js +2 -2
  11. package/src/verify.js +3 -0
  12. package/templates/config.yaml +10 -0
  13. package/templates/program.md +5 -0
  14. package/templates/requirements.txt +4 -0
  15. package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
  16. package/templates/scripts/__pycache__/generate_model_card.cpython-314.pyc +0 -0
  17. package/templates/scripts/__pycache__/manage_hypotheses.cpython-314.pyc +0 -0
  18. package/templates/scripts/__pycache__/reproduce_experiment.cpython-314.pyc +0 -0
  19. package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
  20. package/templates/scripts/__pycache__/seed_runner.cpython-314.pyc +0 -0
  21. package/templates/scripts/__pycache__/treequest_suggest.cpython-314.pyc +0 -0
  22. package/templates/scripts/__pycache__/turing_io.cpython-314.pyc +0 -0
  23. package/templates/scripts/__pycache__/update_state.cpython-314.pyc +0 -0
  24. package/templates/scripts/generate_brief.py +85 -3
  25. package/templates/scripts/generate_model_card.py +25 -0
  26. package/templates/scripts/leaderboard.py +10 -0
  27. package/templates/scripts/manage_hypotheses.py +2 -2
  28. package/templates/scripts/reproduce_experiment.py +548 -0
  29. package/templates/scripts/scaffold.py +5 -0
  30. package/templates/scripts/seed_runner.py +414 -0
  31. package/templates/scripts/show_metrics.py +17 -0
  32. package/templates/scripts/treequest_suggest.py +520 -0
  33. package/templates/scripts/turing_io.py +36 -0
  34. package/templates/scripts/update_state.py +13 -0
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "1.1.0",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 16 commands, 2 specialized agents, cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "1.3.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 19 commands, 2 specialized agents, statistical rigor (multi-seed studies, reproducibility verification), tree-search hypothesis exploration (TreeQuest AB-MCTS), cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -313,6 +313,8 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
313
313
  | `/turing:try <hypothesis>` | Inject a hypothesis — free text or `archetype:model_comparison` |
314
314
  | `/turing:brief [--deep]` | Research briefing — campaign summary, failure patterns, literature-grounded suggestions |
315
315
  | `/turing:suggest` | Literature-grounded model architecture suggestions with citations |
316
+ | `/turing:suggest --strategy treequest` | Tree-search hypothesis exploration (alias for `/turing:explore`) |
317
+ | `/turing:explore` | AB-MCTS tree search over critique-scored hypothesis space |
316
318
  | `/turing:design <hyp-id>` | Generate structured experiment design from a hypothesis |
317
319
  | `/turing:mode <explore\|exploit\|replicate>` | Set research strategy — drives novelty guard policy |
318
320
 
@@ -321,6 +323,8 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
321
323
  | Command | What it does |
322
324
  |---------|-------------|
323
325
  | `/turing:validate [--auto]` | Check metric stability — auto-configure multi-run if noisy |
326
+ | `/turing:seed [N] [--quick]` | Multi-seed study — mean/std/CI, flag seed-sensitive results |
327
+ | `/turing:reproduce <exp-id>` | Reproducibility verification — re-run and check tolerance |
324
328
  | `/turing:card` | Generate a model card — performance, limitations, intended use, artifact contract |
325
329
  | `/turing:logbook` | Generate HTML experiment logbook |
326
330
  | `/turing:report` | Generate research report |
@@ -390,6 +394,65 @@ After N experiments with no meaningful improvement, the agent stops and reports
390
394
 
391
395
  For noisy metrics, `/turing:validate` runs the pipeline multiple times and measures variance. If the coefficient of variation exceeds 5%, it auto-configures multi-run evaluation so the agent can't be rewarded for lucky single runs.
392
396
 
397
+ ## Statistical Rigor
398
+
399
+ > *"Stop publishing lucky seeds. Start publishing distributions."*
400
+
401
+ Before claiming a result, run a seed study:
402
+
403
+ ```
404
+ /turing:seed # 5 seeds on best experiment
405
+ /turing:seed --quick # 3 seeds for fast check
406
+ /turing:seed 10 # 10 seeds for thorough study
407
+ ```
408
+
409
+ This runs the same experiment across multiple random seeds and reports mean +/- std with 95% confidence intervals. If the coefficient of variation exceeds 5%, the result is flagged as **seed-sensitive** — meaning you should report the distribution, not a single number.
410
+
411
+ To verify an experiment can be reproduced:
412
+
413
+ ```
414
+ /turing:reproduce exp-042 # Default: 3 runs, 2% tolerance
415
+ /turing:reproduce exp-042 --strict # Exact match required
416
+ /turing:reproduce exp-042 --tolerance 0.05 # Custom tolerance
417
+ ```
418
+
419
+ This re-runs the experiment from the logged config and checks that metrics fall within tolerance. It also detects environment drift — if library versions have changed since the original run, you'll know before a reviewer tells you.
420
+
421
+ Seed study results automatically appear in `/turing:brief` and `/turing:card`.
422
+
423
+ ## Tree-Search Hypothesis Exploration
424
+
425
+ > *"The learned coin-flipper weaves through the quadrillion-coin room with a preternatural air."*
426
+
427
+ Sometimes the best experiment to try next isn't obvious from the literature or the agent's memory. `/turing:explore` uses [TreeQuest](https://github.com/SakanaAI/treequest)'s AB-MCTS (Adaptive Branching Monte Carlo Tree Search) to search the space of experiment *ideas* as a tree, scored by the critique engine (novelty x feasibility x impact).
428
+
429
+ ```
430
+ /turing:explore # Run MCTS over hypothesis space
431
+ /turing:explore --strategy greedy # Greedy fallback (no TreeQuest needed)
432
+ /turing:explore --iterations 50 --top 8 # Deeper search, more results
433
+ /turing:suggest --strategy treequest # Same thing via suggest
434
+ ```
435
+
436
+ How it works:
437
+
438
+ ```
439
+ Seeds MCTS expands best-scoring branches
440
+
441
+ ┌──────┼──────┐ Each node is a hypothesis scored by:
442
+ ▼ ▼ ▼ - Novelty (vs experiment history)
443
+ LightGBM Reg Features - Feasibility (hardware, deps)
444
+ │ │ │ - Expected impact (type success rate)
445
+ ▼ ▼ ▼
446
+ +dart +L1 +poly Top-K results queued as hypotheses
447
+ │ │ for the next /turing:train run
448
+ ▼ ▼
449
+ +subsamp +target-enc
450
+ ```
451
+
452
+ Unlike `/turing:suggest` (which searches the web for papers), `/turing:explore` searches the space of *refinement chains* — combinations and sequences of modifications that score well together. It discovers non-obvious experiment strategies that independent suggestions cannot find.
453
+
454
+ Falls back to greedy best-first search when TreeQuest is not installed.
455
+
393
456
  ## Cost-Performance Frontier
394
457
 
395
458
  > *"This model is 2% better but takes 10x longer to train. Is that worth it?"*
@@ -451,11 +514,11 @@ Each project gets independent config, data, experiments, models, and agent memor
451
514
 
452
515
  ## Architecture of Turing Itself
453
516
 
454
- 16 commands, 2 agents, 8 config files, 30 template scripts, model registry, artifact contract, cost-performance frontier, model cards, 345 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
517
+ 19 commands, 2 agents, 8 config files, 34 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor (seed studies + reproducibility), 407 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
455
518
 
456
519
  ```
457
520
  turing/
458
- ├── commands/ 15 skill files (core + taste-leverage + reporting)
521
+ ├── commands/ 18 skill files (core + taste-leverage + reporting + exploration + statistical rigor)
459
522
  ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
460
523
  ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
461
524
  ├── templates/ Scaffolded into user projects by /turing:init
@@ -464,7 +527,7 @@ turing/
464
527
  │ ├── train.py Training code (AGENT-EDITABLE)
465
528
  │ ├── model_contract.md Artifact schema for production consumers
466
529
  │ ├── model_registry.yaml Available model architectures + hyperparams
467
- │ └── scripts/ 25 Python scripts (core loop + analysis + infra)
530
+ │ └── scripts/ 26 Python scripts (core loop + analysis + infra + tree search)
468
531
  ├── tests/ 338 tests (unit + integration + anti-pattern + manifest)
469
532
  ├── src/ 5 JS installer files (npm deployment)
470
533
  ├── bin/ CLI entry points
@@ -482,6 +545,7 @@ turing/
482
545
  - **[Principle of Least Privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege)** (Saltzer & Schroeder, 1975) — each agent has exactly the capabilities needed for its role
483
546
  - **[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping)** (Prechelt, 1998) — convergence detection as discrete early stopping
484
547
  - **[Multi-Armed Bandits](https://en.wikipedia.org/wiki/Multi-armed_bandit)** — the explore-exploit tradeoff
548
+ - **[TreeQuest](https://github.com/SakanaAI/treequest)** (Sakana AI, 2025) — AB-MCTS for inference-time scaling; repurposed here for hypothesis-space exploration
485
549
  - **[Version Control as Lab Notebook](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004668)** (Ram, 2013) — git as a scientific record-keeping system
486
550
  - **[Reproducibility Crisis](https://en.wikipedia.org/wiki/Replication_crisis)** — if the measurement can change between experiments, results are not reproducible
487
551
 
@@ -0,0 +1,107 @@
1
+ ---
2
+ name: explore
3
+ description: Tree-search-guided hypothesis exploration using AB-MCTS. Explores the space of experiment ideas as a search tree, scored by the critique engine. Discovers non-obvious refinement chains that linear suggestion cannot find.
4
+ disable-model-invocation: true
5
+ argument-hint: "[ml/project] [--iterations N] [--top N] [--strategy abmcts-a|abmcts-m|greedy]"
6
+ allowed-tools: Read, Write, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob
7
+ ---
8
+
9
+ Explore the hypothesis space using tree search. Instead of suggesting independent ideas, this builds and searches a tree of refinement chains — each node is a hypothesis scored by novelty, feasibility, and expected impact.
10
+
11
+ ## Project Detection
12
+
13
+ 0. **Detect project directory:**
14
+ - If `$ARGUMENTS` contains a path (e.g., `ml/coding`), use that as the project directory
15
+ - Else if cwd contains `config.yaml` and `train.py`, use cwd
16
+ - Else search for `ml/*/` subdirectories containing `config.yaml`
17
+ - If exactly one found, use it
18
+ - If multiple found, list them and ask the user which to target
19
+ - All subsequent commands run from the detected project directory
20
+
21
+ ## Parse Options
22
+
23
+ Extract from `$ARGUMENTS`:
24
+ - `--iterations N` — search depth (default: 30)
25
+ - `--top N` — number of results to return (default: 5)
26
+ - `--strategy` — algorithm choice: `abmcts-a` (default), `abmcts-m` (Bayesian), or `greedy` (no TreeQuest needed)
27
+ - `--seeds-only` — just show generated seeds without running search
28
+ - `--json` — output as JSON for programmatic use
29
+
30
+ ## Steps
31
+
32
+ ### 1. Assess Current State
33
+
34
+ ```bash
35
+ source .venv/bin/activate && python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
36
+ ```
37
+
38
+ Read `config.yaml` to understand the current model and metric.
39
+
40
+ ### 2. Run Tree Search
41
+
42
+ ```bash
43
+ source .venv/bin/activate && python scripts/treequest_suggest.py \
44
+ --log experiments/log.jsonl \
45
+ --config config.yaml \
46
+ --top <N> \
47
+ --iterations <N> \
48
+ --strategy <strategy>
49
+ ```
50
+
51
+ The script will:
52
+ - Generate seed hypotheses from config and experiment history
53
+ - Run AB-MCTS (or greedy fallback) over the hypothesis tree
54
+ - Score each node using the critique engine
55
+ - Return top-K ranked, deduplicated hypotheses
56
+
57
+ ### 3. Queue Best Hypotheses
58
+
59
+ For each result, add to the hypothesis queue:
60
+
61
+ ```bash
62
+ source .venv/bin/activate && python scripts/manage_hypotheses.py add "<description>" \
63
+ --priority medium --source treequest
64
+ ```
65
+
66
+ ### 4. Show Results
67
+
68
+ Display the search output and confirm queuing:
69
+
70
+ ```
71
+ TreeQuest Hypothesis Exploration (AB-MCTS-A)
72
+ ============================================
73
+ Nodes explored: 35
74
+ Top 5 hypotheses by critique score:
75
+
76
+ 1. [PROCEED] (score: 7.8/10)
77
+ Switch to LightGBM with dart boosting; additionally add polynomial features
78
+ Novelty: 8 Feasibility: 9 Impact: 7
79
+ -> Queued as hyp-NNN
80
+
81
+ 2. [PROCEED] (score: 7.2/10)
82
+ Use low learning rate (0.01) with 2000 estimators; additionally add L2 regularization
83
+ Novelty: 7 Feasibility: 8 Impact: 7
84
+ Depth: 1 (refined from parent)
85
+ -> Queued as hyp-NNN
86
+
87
+ ...
88
+
89
+ Queued N hypotheses. Run /turing:train to test them.
90
+ ```
91
+
92
+ ## How It Differs From /turing:suggest
93
+
94
+ | | `/turing:suggest` | `/turing:explore` |
95
+ |---|---|---|
96
+ | **Source** | Web literature search | Tree search over critique scores |
97
+ | **Strategy** | Independent suggestions | Refinement chains (parent -> child) |
98
+ | **Requires internet** | Yes | No |
99
+ | **Discovers** | What papers recommend | What combinations score well |
100
+ | **Best for** | Early-stage exploration | Mid-experiment optimization |
101
+
102
+ ## Integration
103
+
104
+ - Results feed into `hypotheses.yaml` — the next `/turing:train` picks them up
105
+ - `/turing:brief` shows queued treequest-sourced hypotheses
106
+ - `/turing:suggest --strategy treequest` is an alias for this command
107
+ - Human can override priority: `/turing:try` always takes precedence
@@ -0,0 +1,48 @@
1
+ ---
2
+ name: reproduce
3
+ description: Verify reproducibility of a specific experiment by re-running from logged config and checking metrics fall within tolerance.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-id> [--tolerance 0.02] [--strict] [--runs 3]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Verify that a logged experiment can be reproduced with consistent results.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument is the experiment ID (required), e.g. `exp-042`
20
+ - `--tolerance 0.02` sets the relative tolerance (default 2%)
21
+ - `--strict` requires exact float match (1e-6), overrides tolerance
22
+ - `--runs 3` sets number of reproduction runs (default 3, 1 for strict)
23
+
24
+ 3. **Run reproducibility verification:**
25
+ ```bash
26
+ python scripts/reproduce_experiment.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report results:**
30
+ - **reproducible:** metrics match exactly (deterministic algorithm)
31
+ - **approximately_reproducible:** metrics within tolerance or original falls in 95% CI
32
+ - **not_reproducible:** metrics outside tolerance and CI
33
+ - **environment_changed:** metrics diverge AND library versions differ
34
+ - Show environment diff if present (Python version, package versions)
35
+
36
+ 5. **Saved output:** report written to `experiments/reproductions/exp-NNN-repro.yaml`
37
+
38
+ 6. **If experiment ID not found:** list available experiment IDs from `experiments/log.jsonl`
39
+
40
+ 7. **If no training pipeline exists:** suggest `/turing:init` first.
41
+
42
+ ## Examples
43
+
44
+ ```
45
+ /turing:reproduce exp-042 # Default: 3 runs, 2% tolerance
46
+ /turing:reproduce exp-042 --strict # Exact match required
47
+ /turing:reproduce exp-042 --tolerance 0.05 --runs 5 # Lenient, more runs
48
+ ```
@@ -0,0 +1,47 @@
1
+ ---
2
+ name: seed
3
+ description: Run multi-seed study on an experiment to compute mean/std/CI and flag seed-sensitive results. Prevents publishing lucky seeds.
4
+ disable-model-invocation: true
5
+ argument-hint: "[N] [--quick] [--exp-id <id>]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Run a multi-seed study to verify that experiment results are robust across random seeds.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - A bare number (e.g., `5`) sets the seed count
20
+ - `--quick` runs 3 seeds instead of 5
21
+ - `--exp-id exp-042` targets a specific experiment (defaults to best)
22
+ - `--seed-list 42,123,456` uses specific seed values
23
+
24
+ 3. **Run seed study:**
25
+ ```bash
26
+ python scripts/seed_runner.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report results:**
30
+ - Show the per-seed results table
31
+ - Show mean +/- std with 95% CI
32
+ - **STABLE (CV < 5%):** result is robust, safe to report
33
+ - **SEED-SENSITIVE (CV >= 5%):** result varies too much across seeds — do not report single-seed numbers
34
+ - If seed-sensitive, recommend reporting as mean +/- std over N seeds
35
+
36
+ 5. **Saved output:** results are written to `experiments/seed_studies/exp-NNN-seeds.yaml`
37
+
38
+ 6. **If no training pipeline exists:** suggest `/turing:init` first.
39
+
40
+ ## Examples
41
+
42
+ ```
43
+ /turing:seed # 5 seeds on best experiment
44
+ /turing:seed --quick # 3 seeds for fast check
45
+ /turing:seed 10 # 10 seeds for thorough study
46
+ /turing:seed --exp-id exp-042 # Specific experiment
47
+ ```
@@ -6,9 +6,16 @@ argument-hint: "[task description override]"
6
6
  allowed-tools: Read, Write, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob, WebSearch, WebFetch
7
7
  ---
8
8
 
9
- Suggest model architectures for the current ML task, grounded in recent literature. Hypotheses backed by papers, not vibes.
9
+ Suggest model architectures for the current ML task. Supports two strategies:
10
10
 
11
- ## Steps
11
+ - **literature** (default): Web search for recent papers, synthesize grounded suggestions with citations.
12
+ - **treequest**: Tree-search-guided hypothesis exploration using AB-MCTS over the critique scoring function. Explores refinement chains that literature search cannot find.
13
+
14
+ ## Strategy Detection
15
+
16
+ If `$ARGUMENTS` contains `--strategy treequest` or `treequest`, use the TreeQuest strategy below. Otherwise use the default literature strategy.
17
+
18
+ ## Steps (Literature Strategy — default)
12
19
 
13
20
  ### 1. Understand the Task
14
21
 
@@ -84,12 +91,69 @@ Sources consulted: <N papers/articles>
84
91
  Queued N hypotheses. Run /turing:train to test them.
85
92
  ```
86
93
 
87
- ## Fallback
94
+ ## Fallback (Literature Strategy)
88
95
 
89
96
  If web search returns insufficient results, suggest model families from `config/taxonomy.toml` based on what hasn't been tried yet. Note that suggestions are taxonomy-based, not literature-backed, and queue with `--source taxonomy`.
90
97
 
98
+ ## Steps (TreeQuest Strategy)
99
+
100
+ When using `--strategy treequest`:
101
+
102
+ ### 1. Detect Project Directory
103
+
104
+ Same detection logic as the literature strategy — find `config.yaml` + `train.py`.
105
+
106
+ ### 2. Run Tree Search
107
+
108
+ ```bash
109
+ source .venv/bin/activate && python scripts/treequest_suggest.py \
110
+ --log experiments/log.jsonl \
111
+ --config config.yaml \
112
+ --top 5 \
113
+ --iterations 30 \
114
+ --strategy abmcts-a
115
+ ```
116
+
117
+ If TreeQuest is not installed, the script automatically falls back to greedy best-first search.
118
+
119
+ ### 3. Queue Results
120
+
121
+ For each result from the tree search, queue as a hypothesis:
122
+
123
+ ```bash
124
+ source .venv/bin/activate && python scripts/manage_hypotheses.py add "<description>" --priority medium --source treequest
125
+ ```
126
+
127
+ ### 4. Show Results
128
+
129
+ Display the tree search output and confirm hypotheses were queued:
130
+
131
+ ```
132
+ TreeQuest Hypothesis Exploration (AB-MCTS-A)
133
+ ============================================
134
+ Nodes explored: 35
135
+ Top 5 hypotheses by critique score:
136
+
137
+ 1. [PROCEED] (score: 7.8/10)
138
+ Switch to LightGBM with dart boosting; additionally add polynomial features
139
+ Novelty: 8 Feasibility: 9 Impact: 7
140
+
141
+ ...
142
+
143
+ Queued N hypotheses. Run /turing:train to test them.
144
+ ```
145
+
146
+ ### TreeQuest Options
147
+
148
+ Pass additional flags via `$ARGUMENTS`:
149
+ - `--iterations N` — search depth (default: 30)
150
+ - `--top N` — number of results (default: 5)
151
+ - `--strategy abmcts-m` — use Bayesian mixed model variant (requires PyMC)
152
+ - `--greedy` — force greedy fallback without TreeQuest
153
+
91
154
  ## Integration
92
155
 
93
156
  - Suggestions feed into `hypotheses.yaml` — the next `/turing:train` picks them up
94
- - `/turing:brief` shows queued literature-sourced hypotheses
157
+ - `/turing:brief` shows queued literature-sourced and treequest-sourced hypotheses
158
+ - `/turing:explore` runs the TreeQuest search as a standalone command
95
159
  - Human can override priority: `/turing:try` always takes precedence
@@ -20,7 +20,10 @@ You are the Turing ML research router. Detect the user's intent and route to the
20
20
  | "poster", "presentation", "one-pager", "visual summary" | `/turing:poster` | Document |
21
21
  | "report", "write-up", "findings", "document results" | `/turing:report` | Document |
22
22
  | "validate", "stability", "check variance", "noisy" | `/turing:validate` | Validate |
23
+ | "seed", "seed study", "multi-seed", "lucky seed", "seed sensitivity" | `/turing:seed` | Validate |
24
+ | "reproduce", "reproducibility", "verify results", "re-run experiment", "repro" | `/turing:reproduce` | Validate |
23
25
  | "suggest", "what model", "recommend", "which architecture", "literature" | `/turing:suggest` | Research |
26
+ | "explore hypotheses", "tree search", "treequest", "search hypothesis space", "MCTS" | `/turing:explore` | Research |
24
27
  | "design", "plan experiment", "how should I test", "experiment design" | `/turing:design` | Design |
25
28
  | "mode", "explore", "exploit", "replicate", "strategy" | `/turing:mode` | Strategy |
26
29
  | "preflight", "resources", "VRAM", "memory", "can I run", "OOM", "GPU" | `/turing:preflight` | Check |
@@ -38,7 +41,10 @@ You are the Turing ML research router. Detect the user's intent and route to the
38
41
  | `/turing:brief` | Generate structured research intelligence report | @ml-evaluator |
39
42
  | `/turing:init` | Scaffold a new ML project | (inline) |
40
43
  | `/turing:validate` | Check metric stability, auto-fix if noisy | (inline) |
44
+ | `/turing:seed [N] [--quick]` | Multi-seed study: mean/std/CI, flag seed-sensitive results | (inline) |
45
+ | `/turing:reproduce <exp-id>` | Reproducibility verification with tolerance checking | (inline) |
41
46
  | `/turing:suggest` | Literature-grounded model architecture suggestions | (inline, uses WebSearch) |
47
+ | `/turing:explore` | Tree-search hypothesis exploration via AB-MCTS | (inline) |
42
48
  | `/turing:design <hyp-id>` | Generate structured experiment design from hypothesis | (inline, uses WebSearch) |
43
49
  | `/turing:logbook` | HTML/markdown logbook with trajectory chart | (inline) |
44
50
  | `/turing:poster` | Single-page HTML research poster | (inline) |
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "1.1.0",
3
+ "version": "1.3.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/claude-md.js CHANGED
@@ -21,6 +21,7 @@ Autonomous ML research harness. The autoresearch loop as a formal protocol.
21
21
  | \`/turing:validate\` | Check metric stability, auto-fix if noisy |
22
22
  | \`/turing:try <hypothesis>\` | Inject a hypothesis into the experiment queue |
23
23
  | \`/turing:brief\` | Generate research intelligence report |
24
+ | \`/turing:explore\` | Tree-search hypothesis exploration (AB-MCTS) |
24
25
  | \`/turing:preflight\` | Pre-flight resource check (VRAM/RAM/disk) |
25
26
 
26
27
  ### Agents
package/src/install.js CHANGED
@@ -21,8 +21,8 @@ const PLUGIN_ROOT = join(__dirname, "..");
21
21
  // Single source of truth for sub-commands (DRY — used for dirs and file copy)
22
22
  const SUB_COMMANDS = [
23
23
  "init", "train", "status", "compare", "sweep", "validate",
24
- "try", "brief", "suggest", "design", "logbook", "poster",
25
- "report", "mode", "preflight", "card",
24
+ "try", "brief", "suggest", "explore", "design", "logbook", "poster",
25
+ "report", "mode", "preflight", "card", "seed", "reproduce",
26
26
  ];
27
27
 
28
28
  export async function install(opts = {}) {
package/src/verify.js CHANGED
@@ -23,6 +23,7 @@ const EXPECTED_COMMANDS = [
23
23
  "try/SKILL.md",
24
24
  "brief/SKILL.md",
25
25
  "suggest/SKILL.md",
26
+ "explore/SKILL.md",
26
27
  "design/SKILL.md",
27
28
  "logbook/SKILL.md",
28
29
  "poster/SKILL.md",
@@ -30,6 +31,8 @@ const EXPECTED_COMMANDS = [
30
31
  "mode/SKILL.md",
31
32
  "preflight/SKILL.md",
32
33
  "card/SKILL.md",
34
+ "seed/SKILL.md",
35
+ "reproduce/SKILL.md",
33
36
  ];
34
37
 
35
38
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];
@@ -19,6 +19,16 @@ evaluation:
19
19
  # Set to false for metrics where higher is better (accuracy, f1, auc)
20
20
  lower_is_better: false # {{METRIC_DIRECTION}} -- change to true if lower is better
21
21
 
22
+ # Multi-seed configuration (Phase 10.1: /turing:seed)
23
+ # Seeds used for seed studies — diverse values for good coverage
24
+ seed_seeds: [42, 123, 456, 789, 1024, 1337, 2048, 3141, 4096, 7919]
25
+ seed_study_n_runs: 5 # Default number of seeds for /turing:seed
26
+ seed_sensitivity_threshold: 5.0 # CV% above this = seed-sensitive
27
+
28
+ # Reproducibility configuration (Phase 10.2: /turing:reproduce)
29
+ reproduce_tolerance: 0.02 # 2% relative tolerance for approximate match
30
+ reproduce_n_runs: 3 # Default reproduction runs for stochastic algorithms
31
+
22
32
  convergence:
23
33
  patience: 3 # Consecutive non-improvements before stopping
24
34
  improvement_threshold: 0.005 # 0.5% relative improvement required
@@ -170,6 +170,11 @@ The autoresearch experiment loop. Each iteration is one experiment — one hypot
170
170
  - N consecutive non-improvements (`config.yaml` → `convergence.patience`) = STOP
171
171
  - `max_iterations` reached = STOP
172
172
  - Report final best model and recommend next steps
173
+ - **Before declaring final results**, run a seed study to verify robustness:
174
+ ```bash
175
+ python scripts/seed_runner.py --quick
176
+ ```
177
+ If CV > 5%, the result is seed-sensitive — report mean ± std, not a single-seed number.
173
178
 
174
179
  10. **REPEAT** — return to step 1.
175
180
 
@@ -6,3 +6,7 @@ numpy>=2.0
6
6
  joblib>=1.4
7
7
  pyyaml>=6.0
8
8
  pytest>=8.0
9
+
10
+ # Optional: tree-search-guided hypothesis exploration
11
+ # Install with: pip install "treequest[all]"
12
+ # treequest>=0.1