claude-turing 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (104) hide show
  1. package/.claude-plugin/plugin.json +34 -0
  2. package/LICENSE +21 -0
  3. package/README.md +457 -0
  4. package/agents/ml-evaluator.md +43 -0
  5. package/agents/ml-researcher.md +74 -0
  6. package/bin/cli.js +46 -0
  7. package/bin/turing-init.sh +57 -0
  8. package/commands/brief.md +83 -0
  9. package/commands/compare.md +24 -0
  10. package/commands/design.md +97 -0
  11. package/commands/init.md +123 -0
  12. package/commands/logbook.md +51 -0
  13. package/commands/mode.md +43 -0
  14. package/commands/poster.md +89 -0
  15. package/commands/preflight.md +75 -0
  16. package/commands/report.md +97 -0
  17. package/commands/rules/loop-protocol.md +91 -0
  18. package/commands/status.md +24 -0
  19. package/commands/suggest.md +95 -0
  20. package/commands/sweep.md +45 -0
  21. package/commands/train.md +66 -0
  22. package/commands/try.md +63 -0
  23. package/commands/turing.md +54 -0
  24. package/commands/validate.md +34 -0
  25. package/config/defaults.yaml +45 -0
  26. package/config/experiment_archetypes.yaml +127 -0
  27. package/config/lifecycle.toml +31 -0
  28. package/config/novelty_aliases.yaml +107 -0
  29. package/config/relationships.toml +125 -0
  30. package/config/state.toml +24 -0
  31. package/config/task_taxonomy.yaml +110 -0
  32. package/config/taxonomy.toml +37 -0
  33. package/package.json +54 -0
  34. package/src/claude-md.js +55 -0
  35. package/src/install.js +107 -0
  36. package/src/paths.js +20 -0
  37. package/src/postinstall.js +22 -0
  38. package/src/verify.js +109 -0
  39. package/templates/MEMORY.md +36 -0
  40. package/templates/README.md +93 -0
  41. package/templates/__pycache__/evaluate.cpython-314.pyc +0 -0
  42. package/templates/__pycache__/prepare.cpython-314.pyc +0 -0
  43. package/templates/config.yaml +48 -0
  44. package/templates/evaluate.py +237 -0
  45. package/templates/features/__init__.py +0 -0
  46. package/templates/features/__pycache__/__init__.cpython-314.pyc +0 -0
  47. package/templates/features/__pycache__/featurizers.cpython-314.pyc +0 -0
  48. package/templates/features/featurizers.py +138 -0
  49. package/templates/prepare.py +171 -0
  50. package/templates/program.md +216 -0
  51. package/templates/pyproject.toml +8 -0
  52. package/templates/requirements.txt +8 -0
  53. package/templates/scripts/__init__.py +0 -0
  54. package/templates/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
  55. package/templates/scripts/__pycache__/check_convergence.cpython-314.pyc +0 -0
  56. package/templates/scripts/__pycache__/classify_task.cpython-314.pyc +0 -0
  57. package/templates/scripts/__pycache__/critique_hypothesis.cpython-314.pyc +0 -0
  58. package/templates/scripts/__pycache__/experiment_index.cpython-314.pyc +0 -0
  59. package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
  60. package/templates/scripts/__pycache__/generate_logbook.cpython-314.pyc +0 -0
  61. package/templates/scripts/__pycache__/log_experiment.cpython-314.pyc +0 -0
  62. package/templates/scripts/__pycache__/manage_hypotheses.cpython-314.pyc +0 -0
  63. package/templates/scripts/__pycache__/novelty_guard.cpython-314.pyc +0 -0
  64. package/templates/scripts/__pycache__/parse_metrics.cpython-314.pyc +0 -0
  65. package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
  66. package/templates/scripts/__pycache__/show_experiment_tree.cpython-314.pyc +0 -0
  67. package/templates/scripts/__pycache__/show_families.cpython-314.pyc +0 -0
  68. package/templates/scripts/__pycache__/statistical_compare.cpython-314.pyc +0 -0
  69. package/templates/scripts/__pycache__/suggest_next.cpython-314.pyc +0 -0
  70. package/templates/scripts/__pycache__/sweep.cpython-314.pyc +0 -0
  71. package/templates/scripts/__pycache__/synthesize_decision.cpython-314.pyc +0 -0
  72. package/templates/scripts/__pycache__/turing_io.cpython-314.pyc +0 -0
  73. package/templates/scripts/__pycache__/update_state.cpython-314.pyc +0 -0
  74. package/templates/scripts/__pycache__/verify_placeholders.cpython-314.pyc +0 -0
  75. package/templates/scripts/check_convergence.py +230 -0
  76. package/templates/scripts/compare_runs.py +124 -0
  77. package/templates/scripts/critique_hypothesis.py +350 -0
  78. package/templates/scripts/experiment_index.py +288 -0
  79. package/templates/scripts/generate_brief.py +389 -0
  80. package/templates/scripts/generate_logbook.py +423 -0
  81. package/templates/scripts/log_experiment.py +243 -0
  82. package/templates/scripts/manage_hypotheses.py +543 -0
  83. package/templates/scripts/novelty_guard.py +343 -0
  84. package/templates/scripts/parse_metrics.py +139 -0
  85. package/templates/scripts/post-train-hook.sh +74 -0
  86. package/templates/scripts/preflight.py +549 -0
  87. package/templates/scripts/scaffold.py +409 -0
  88. package/templates/scripts/show_environment.py +92 -0
  89. package/templates/scripts/show_experiment_tree.py +144 -0
  90. package/templates/scripts/show_families.py +133 -0
  91. package/templates/scripts/show_metrics.py +157 -0
  92. package/templates/scripts/statistical_compare.py +259 -0
  93. package/templates/scripts/stop-hook.sh +34 -0
  94. package/templates/scripts/suggest_next.py +301 -0
  95. package/templates/scripts/sweep.py +276 -0
  96. package/templates/scripts/synthesize_decision.py +300 -0
  97. package/templates/scripts/turing_io.py +76 -0
  98. package/templates/scripts/update_state.py +296 -0
  99. package/templates/scripts/validate_stability.py +167 -0
  100. package/templates/scripts/verify_placeholders.py +119 -0
  101. package/templates/sweep_config.yaml +14 -0
  102. package/templates/tests/__init__.py +0 -0
  103. package/templates/tests/conftest.py +91 -0
  104. package/templates/train.py +240 -0
@@ -0,0 +1,75 @@
1
+ ---
2
+ name: preflight
3
+ description: Pre-flight resource check — estimates VRAM, RAM, and disk requirements before running ML training. Compares against available system resources and issues PASS/WARN/FAIL verdict. Use before training to catch OOM errors before they happen.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--model-type torch] [--params 10M] [--batch-size 32]"
6
+ allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*, nvidia-smi:*), Grep, Glob
7
+ ---
8
+
9
+ Check whether the current system has enough resources to run the planned experiment.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Run preflight check:**
19
+
20
+ If `$ARGUMENTS` is empty (auto-detect from config.yaml):
21
+ ```bash
22
+ python scripts/preflight.py
23
+ ```
24
+
25
+ If `$ARGUMENTS` contains flags:
26
+ ```bash
27
+ python scripts/preflight.py $ARGUMENTS
28
+ ```
29
+
30
+ 3. **Interpret the verdict:**
31
+
32
+ - **PASS** — system has sufficient resources. Proceed with training.
33
+ - **WARN** — resources are tight. Training may succeed but could be slow or unstable. Present warnings to the user and ask whether to proceed.
34
+ - **FAIL** — training will likely fail (OOM, disk full, no GPU for GPU-required model). Present the specific resource gap and suggest mitigations:
35
+ - RAM too low: reduce dataset size, use chunked loading, or add swap
36
+ - VRAM too low: reduce batch size, use fp16/bf16, enable gradient checkpointing, or use a smaller model
37
+ - Disk too low: clean up old models/checkpoints
38
+ - No GPU: switch to a CPU-friendly model (XGBoost, LightGBM, sklearn)
39
+
40
+ 4. **If running before `/turing:train`:** report the verdict so the human can decide whether to proceed, adjust config, or choose a different model type.
41
+
42
+ ## Examples
43
+
44
+ ```bash
45
+ # Auto-detect from config.yaml (works for Turing projects)
46
+ /turing:preflight
47
+
48
+ # Check for a specific model type
49
+ /turing:preflight --model-type transformer --params 350M --batch-size 16 --precision fp16
50
+
51
+ # Check with a specific dataset
52
+ /turing:preflight --model-type xgboost --dataset data/train.csv
53
+
54
+ # JSON output for scripting
55
+ /turing:preflight --json
56
+ ```
57
+
58
+ ## What It Checks
59
+
60
+ | Resource | How estimated | Warning threshold |
61
+ |----------|--------------|-------------------|
62
+ | **RAM** | Dataset size (4x CSV on disk) + model memory (tree nodes or param count) | >90% of available |
63
+ | **VRAM** | Model params + gradients + optimizer state + activations | >80% of largest GPU |
64
+ | **Disk** | Model artifacts + dataset + checkpoints | >50% of free space |
65
+ | **GPU presence** | torch.cuda or nvidia-smi | Required for neural nets >1GB VRAM |
66
+
67
+ ## Model-Specific Estimates
68
+
69
+ | Model Type | RAM | VRAM | GPU Required? |
70
+ |-----------|-----|------|---------------|
71
+ | XGBoost/LightGBM | Trees + data (typically <4GB) | 0 | No |
72
+ | Random Forest | Trees + data (can be large) | 0 | No |
73
+ | Linear/Logistic | 2x data | 0 | No |
74
+ | MLP (small) | Data + params | Params x 4 (Adam) | If >1GB VRAM |
75
+ | Transformer | Data + params | Params x 4 + activations | Yes |
@@ -0,0 +1,97 @@
1
+ ---
2
+ name: report
3
+ description: Generate a markdown research report from experiment history — structured for sharing, archiving, or including in documentation. More detailed than a brief, less visual than a poster.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--since YYYY-MM-DD] [--output path]"
6
+ allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*, mkdir:*), Grep, Glob
7
+ ---
8
+
9
+ Generate a structured markdown research report summarizing the experiment campaign.
10
+
11
+ ## Steps
12
+
13
+ ### 1. Generate the Report
14
+
15
+ Use the logbook generator in markdown mode as the data backbone:
16
+
17
+ ```bash
18
+ source .venv/bin/activate && python scripts/generate_logbook.py --format markdown
19
+ ```
20
+
21
+ Also gather supplementary data:
22
+ ```bash
23
+ source .venv/bin/activate && python scripts/generate_brief.py
24
+ cat experiment_state.yaml 2>/dev/null || true
25
+ cat RESEARCH_PLAN.md 2>/dev/null || true
26
+ ```
27
+
28
+ ### 2. Enhance with Analysis
29
+
30
+ The logbook generator produces raw data. Enhance it with your analysis to create a proper report. Add these sections that the script doesn't generate:
31
+
32
+ - **Executive Summary** (2-3 sentences): What was the task? What's the best result? Is it good enough?
33
+ - **Approach:** Describe the methodology — autoresearch loop, evaluation strategy, search strategy used
34
+ - **Key Findings:** Synthesize patterns from the experiment log:
35
+ - Which model families outperformed others?
36
+ - What hyperparameter ranges work vs don't?
37
+ - Were there surprising results?
38
+ - What failure patterns emerged?
39
+ - **Recommendations:** Based on the findings, what should be tried next? What should be avoided?
40
+ - **Limitations:** What wasn't explored? What constraints affected the results?
41
+
42
+ ### 3. Output
43
+
44
+ If `$ARGUMENTS` contains `--output <path>`:
45
+ ```bash
46
+ mkdir -p $(dirname <path>)
47
+ ```
48
+ Write the report to the specified path.
49
+
50
+ Otherwise, display the report directly.
51
+
52
+ **Common usage:**
53
+ ```
54
+ /turing:report --output reports/campaign-v1.md
55
+ /turing:report --since 2026-03-15 --output reports/week-12.md
56
+ ```
57
+
58
+ ## Report Structure
59
+
60
+ ```markdown
61
+ # Research Report: <task description>
62
+ Generated: <date>
63
+
64
+ ## Executive Summary
65
+ <2-3 sentences>
66
+
67
+ ## Methodology
68
+ <approach, evaluation strategy, convergence criteria>
69
+
70
+ ## Campaign Summary
71
+ <table: experiments, keep rate, best metric, timespan>
72
+
73
+ ## Improvement Trajectory
74
+ <table: experiment-by-experiment metric progression>
75
+
76
+ ## Key Findings
77
+ <synthesized patterns from experiment history>
78
+
79
+ ## Model Comparison
80
+ <table: model families, experiments per family, best metric, keep rate>
81
+
82
+ ## Hypothesis Analysis
83
+ <what was proposed, by whom, what worked>
84
+
85
+ ## Recommendations
86
+ <concrete next steps>
87
+
88
+ ## Limitations
89
+ <what wasn't tried, constraints>
90
+ ```
91
+
92
+ ## When to Use
93
+
94
+ - End of a research campaign for archiving
95
+ - Before a team review or status update
96
+ - To document findings for a paper or thesis
97
+ - To hand off a project to another researcher
@@ -0,0 +1,91 @@
1
+ # Autoresearch Loop Protocol Rules
2
+
3
+ These rules govern the autonomous ML experiment loop. They are non-negotiable safety constraints that preserve the integrity of the experimental process.
4
+
5
+ ## The Fundamental Separation
6
+
7
+ The autoresearch harness enforces a strict separation between the **hypothesis space** (what the agent can change) and the **measurement apparatus** (how results are evaluated). This separation is the architectural invariant that makes autonomous experimentation trustworthy.
8
+
9
+ | Layer | Files | Agent Access | Rationale |
10
+ |-------|-------|-------------|-----------|
11
+ | Hidden | `evaluate.py` | NONE — do not read, write, or reference | Reading evaluation code enables seed exploitation and metric gaming |
12
+ | Measurement | `prepare.py` | READ-ONLY | Data loading is visible but immutable |
13
+ | Hypothesis | `train.py` | READ-WRITE | All experimental changes go here |
14
+ | Configuration | `config.yaml` | READ-WRITE | Hyperparameter changes without code changes |
15
+ | Features | `features/featurizers.py` | READ-ONLY | Modify how `train.py` *uses* featurizers instead |
16
+
17
+ ## Execution Rules
18
+
19
+ - **ALWAYS redirect training output:** `python train.py > run.log 2>&1`
20
+ - **ALWAYS parse metrics with grep** between `---` delimiters: `grep -A 10 "^---" run.log | head -10`
21
+ - **ALWAYS activate the venv first:** `source .venv/bin/activate`
22
+ - **NEVER install new packages** without human approval
23
+
24
+ ## Git Discipline
25
+
26
+ ### Per-Experiment Branches (preferred)
27
+
28
+ - **Create branch before each experiment:** `git checkout -b exp/{NNN}-{short-description}`
29
+ - **Commit changes on the branch:** `git commit -am "exp: {description}"`
30
+ - **Run the experiment on the branch**
31
+ - **If improved:** `git checkout main && git merge exp/{NNN}-{short-description}`. Copy model to `models/best/`.
32
+ - **If NOT improved:** `git checkout main`. Branch preserved for comparison.
33
+ - **Keep all experiment branches** — they preserve code variants for later analysis.
34
+
35
+ ### Fallback: Commit/Revert (mid-sweep)
36
+
37
+ - **ALWAYS commit before running:** `git commit -am "exp: {description}"`
38
+ - **If improved:** keep commit, copy model to `models/best/`
39
+ - **If NOT improved:** `git reset --hard HEAD~1`
40
+
41
+ ## Sweep Workflow
42
+
43
+ 1. Generate queue: `python scripts/sweep.py`
44
+ 2. Check status: `python scripts/sweep.py --status`
45
+ 3. Get next: `python scripts/sweep.py --next`
46
+ 4. Apply overrides, create branch, run training
47
+ 5. Mark: `python scripts/sweep.py --mark <name> complete|failed`
48
+ 6. Repeat until queue is empty
49
+
50
+ ## Logging Rules
51
+
52
+ - **Log every experiment** to `experiments/log.jsonl` via `python scripts/log_experiment.py` — kept and discarded alike.
53
+ - **Include all metrics, config, and description** of the hypothesis and its outcome.
54
+
55
+ ## Convergence Rules
56
+
57
+ - **N consecutive non-improvements** (from `config.yaml` `convergence.patience`) with less than threshold relative gain = STOP.
58
+ - **max_iterations** (if provided) overrides convergence.
59
+ - **Always report** final best model, metrics, and recommended next steps when stopping.
60
+
61
+ ## Tool Restrictions
62
+
63
+ The researcher agent's Bash access is restricted to a whitelist of necessary commands:
64
+
65
+ | Allowed Pattern | Purpose |
66
+ |-----------------|---------|
67
+ | `python train.py:*` | Execute training |
68
+ | `python scripts/*:*` | Run utility scripts (logging, metrics, sweep) |
69
+ | `git:*` | Branch, commit, merge, reset operations |
70
+ | `source .venv/bin/activate:*` | Virtual environment activation |
71
+ | `pip:*` | Package installation (requires human approval) |
72
+
73
+ **Blocked by omission:** `cat`, `head`, `tail`, `less` (prevents reading hidden files via shell), `curl`, `wget` (prevents data exfiltration), arbitrary command execution.
74
+
75
+ The agent's Read tool is separately governed by the file access tiers above — hidden files are denied at the tool level.
76
+
77
+ ## Reproducibility Rules
78
+
79
+ Every experiment must be fully reproducible. The training template handles this automatically, but the agent must not subvert it:
80
+
81
+ - **NEVER use unseeded randomness.** All random state flows from `config.yaml → data.random_state`. The `pin_all_seeds()` function in `train.py` sets stdlib `random`, `numpy`, `PYTHONHASHSEED`, and `torch`/`cuda` seeds from this single source.
82
+ - **NEVER modify seeds mid-experiment.** If you need a different seed, use `--seed` flag for multi-run comparison (Phase 2.1). Do not hardcode seeds in `train.py`.
83
+ - **Environment is captured automatically.** `train_metadata.json` records python version, package versions, platform, GPU info, and a config hash. Do not modify this recording — it's used by behavioral probes.
84
+ - **Config snapshot:** The config at training time is stored inside the model artifact (`model.joblib` contains the full config dict). For any saved model, the exact configuration can be recovered.
85
+ - **If adding new dependencies** (requires human approval), note that the environment capture in `train_metadata.json` will automatically record the new package version.
86
+
87
+ ## Safety
88
+
89
+ - Do not modify files outside the ML project directory.
90
+ - Do not delete experiment logs or model archives.
91
+ - If something breaks unexpectedly, stop and report — do not auto-fix evaluation infrastructure.
@@ -0,0 +1,24 @@
1
+ ---
2
+ name: status
3
+ description: Show current ML experiment status — best model, recent experiments, convergence state, and trend analysis. Delegates to @ml-evaluator for read-only safety.
4
+ disable-model-invocation: true
5
+ allowed-tools: Read, Bash(*), Grep, Glob
6
+ ---
7
+
8
+ Show the current state of the ML training pipeline. This is an observation-only operation — no code is modified.
9
+
10
+ ## Steps
11
+
12
+ 1. **Run metrics display:**
13
+ ```bash
14
+ source .venv/bin/activate && python scripts/show_metrics.py --last 10
15
+ ```
16
+
17
+ 2. **Summarize for the user:**
18
+ - **Best model:** type, key metrics, experiment ID
19
+ - **Total experiments:** count from the log
20
+ - **Convergence state:** consecutive non-improvements vs patience threshold
21
+ - **Trend:** improving, plateauing, or regressing?
22
+ - **Recommendation:** continue training, try a different approach, or declare convergence
23
+
24
+ 3. **If no experiments exist:** report that the pipeline is ready but untrained. Suggest `/turing:train`.
@@ -0,0 +1,95 @@
1
+ ---
2
+ name: suggest
3
+ description: Literature-grounded model selection. Reads the ML task context, searches recent literature, and suggests model architectures worth trying — with citations. Suggestions are auto-queued as hypotheses.
4
+ disable-model-invocation: true
5
+ argument-hint: "[task description override]"
6
+ allowed-tools: Read, Write, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob, WebSearch, WebFetch
7
+ ---
8
+
9
+ Suggest model architectures for the current ML task, grounded in recent literature. Hypotheses backed by papers, not vibes.
10
+
11
+ ## Steps
12
+
13
+ ### 1. Understand the Task
14
+
15
+ Read the project config and recent experiment history to understand the task:
16
+
17
+ ```bash
18
+ cat config.yaml
19
+ ```
20
+
21
+ ```bash
22
+ source .venv/bin/activate && python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
23
+ ```
24
+
25
+ If `$ARGUMENTS` is provided, use that as the task description. Otherwise, infer from `config.yaml` (model type, primary metric, data source, target column).
26
+
27
+ From the config and any task description, identify the key task properties:
28
+ - Data type (tabular, time series, image, text, etc.)
29
+ - Objective (classification, regression, generation, etc.)
30
+ - Special constraints (imbalanced classes, small dataset, real-time, interpretability, etc.)
31
+ - Current model family and what's been tried
32
+
33
+ ### 2. Search Literature
34
+
35
+ Use `WebSearch` to find recent papers and benchmark results. Run 3-5 searches targeting:
36
+
37
+ 1. **Model comparison for this task type:** e.g., "best models for tabular classification benchmark 2024"
38
+ 2. **Current model alternatives:** e.g., "LightGBM vs XGBoost vs CatBoost tabular data"
39
+ 3. **Task-specific techniques:** e.g., "handling class imbalance gradient boosting"
40
+
41
+ For each search, use `WebFetch` on the top 1-2 results to extract specific model recommendations, benchmark numbers, and methodology.
42
+
43
+ Focus on:
44
+ - Recent work (2023-2026) with empirical comparisons
45
+ - Benchmark studies and surveys
46
+ - arXiv papers or reputable ML blogs with concrete results
47
+
48
+ ### 3. Synthesize Suggestions
49
+
50
+ From the literature, synthesize **3-5 concrete model architecture suggestions**. Each must include:
51
+
52
+ - **Model architecture:** specific (e.g., "LightGBM with GOSS sampling", not "try a different model")
53
+ - **Why:** one-sentence rationale grounded in what the literature says
54
+ - **Citation:** paper or source that supports this
55
+ - **Expected impact:** high/medium/low based on how well it fits this task
56
+ - **Implementation hint:** what to change in `train.py` (one concrete line)
57
+
58
+ ### 4. Queue as Hypotheses
59
+
60
+ For each suggestion, add to the hypothesis queue:
61
+
62
+ ```bash
63
+ source .venv/bin/activate && python scripts/manage_hypotheses.py add "<model>: <rationale> (source: <citation>)" --priority medium --source literature
64
+ ```
65
+
66
+ ### 5. Show Results
67
+
68
+ ```
69
+ Literature-Grounded Model Suggestions
70
+ ======================================
71
+
72
+ Task: <task description>
73
+ Current: <current model> (<current metric>=<value>)
74
+ Sources consulted: <N papers/articles>
75
+
76
+ 1. [HIGH] <technique>
77
+ Why: <one-sentence rationale with citation>
78
+ Source: <URL>
79
+ Change: <specific train.py change>
80
+ → Queued as hyp-NNN
81
+
82
+ 2. [MEDIUM] ...
83
+
84
+ Queued N hypotheses. Run /turing:train to test them.
85
+ ```
86
+
87
+ ## Fallback
88
+
89
+ If web search returns insufficient results, suggest model families from `config/taxonomy.toml` based on what hasn't been tried yet. Note that suggestions are taxonomy-based, not literature-backed, and queue with `--source taxonomy`.
90
+
91
+ ## Integration
92
+
93
+ - Suggestions feed into `hypotheses.yaml` — the next `/turing:train` picks them up
94
+ - `/turing:brief` shows queued literature-sourced hypotheses
95
+ - Human can override priority: `/turing:try` always takes precedence
@@ -0,0 +1,45 @@
1
+ ---
2
+ name: sweep
3
+ description: Generate and run a systematic hyperparameter sweep. Computes the cartesian product of configured parameter ranges and processes the queue sequentially with full experiment logging.
4
+ disable-model-invocation: true
5
+ argument-hint: "[sweep_config.yaml]"
6
+ allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*, source .venv/bin/activate:*, pip:*), Grep, Glob
7
+ ---
8
+
9
+ Run a systematic hyperparameter sweep using the sweep configuration.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Resolve config:** Use `$ARGUMENTS` as sweep config path, or default to `sweep_config.yaml`.
19
+
20
+ 3. **Generate queue** (if not already generated):
21
+ ```bash
22
+ python scripts/sweep.py [sweep_config.yaml]
23
+ ```
24
+
25
+ 4. **Check queue status:**
26
+ ```bash
27
+ python scripts/sweep.py --status
28
+ ```
29
+
30
+ 5. **Process queue sequentially:**
31
+ - Get next: `python scripts/sweep.py --next`
32
+ - Apply config overrides to `config.yaml`
33
+ - Create experiment branch: `git checkout -b exp/NNN-description`
34
+ - Run training: `python train.py > run.log 2>&1`
35
+ - Parse metrics: `grep -A 10 "^---" run.log | head -10`
36
+ - Log the experiment
37
+ - Mark complete: `python scripts/sweep.py --mark <name> complete`
38
+ - If improved, merge to main. If not, return to main.
39
+ - Repeat until queue is empty
40
+
41
+ 6. **Report** final results with best configuration found.
42
+
43
+ ## Rules
44
+
45
+ Follow the same safety constraints as `/turing:train` — see `rules/loop-protocol.md`.
@@ -0,0 +1,66 @@
1
+ ---
2
+ name: train
3
+ description: Run the autonomous ML experiment loop. Iteratively hypothesizes, trains, evaluates, and decides — keeping only improvements. Implements the autoresearch pattern with formal convergence detection and git-disciplined rollback.
4
+ disable-model-invocation: true
5
+ argument-hint: "[max_iterations]"
6
+ allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*, source .venv/bin/activate:*, pip:*), Grep, Glob
7
+ ---
8
+
9
+ You are an autonomous ML researcher. Your goal: iteratively improve a model by following the experiment loop protocol — the scientific method applied to machine learning.
10
+
11
+ Read `program.md` in the ML project directory for the complete protocol. Follow it exactly.
12
+
13
+ ## Arguments
14
+
15
+ `$ARGUMENTS` — if a number, use as max_iterations (stop after N experiments). If empty, run until convergence (as defined in `config.yaml` convergence settings).
16
+
17
+ ## Bootstrap Sequence
18
+
19
+ 0. **Restore memory:** Read `.claude/agent-memory/ml-researcher/MEMORY.md` for prior observations and best results.
20
+ 1. **Read protocol:** Read `program.md` completely — it defines the experiment loop, constraints, and output format.
21
+ 2. **Bootstrap data:** Check for training data at `config.yaml` → `data.source`. If no splits exist, run `python prepare.py`.
22
+ 3. **Bootstrap venv:** `test -d .venv || (python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt)`
23
+ 4. **Assess state:** `source .venv/bin/activate && python scripts/show_metrics.py --last 5`
24
+ 5. **Begin the loop** from program.md.
25
+
26
+ ## The Loop
27
+
28
+ Each iteration follows the experiment lifecycle (`config/lifecycle.toml`):
29
+
30
+ ```
31
+ proposed -> running -> evaluating -> kept/discarded -> (next iteration)
32
+ ```
33
+
34
+ The agent proposes a hypothesis, executes it, measures the result against the immutable evaluation harness, and decides whether to keep or discard. Only improvements survive in git history.
35
+
36
+ ## Delegation
37
+
38
+ Use `@ml-evaluator` for analysis tasks. It is read-only (no Write/Edit) and cannot accidentally modify the pipeline.
39
+
40
+ ## Context Management
41
+
42
+ - Redirect all training output: `python train.py > run.log 2>&1`
43
+ - Parse metrics with grep, never read full output
44
+ - Persist observations to MEMORY.md after each experiment
45
+
46
+ ## Convergence
47
+
48
+ - Stop after `max_iterations` if provided
49
+ - Otherwise, stop after N consecutive non-improvements (`config.yaml` → `convergence.patience`)
50
+ - Report final best experiment and recommend next steps
51
+
52
+ ## /loop Integration
53
+
54
+ For fully hands-off training:
55
+ ```
56
+ /loop 5m /turing:train
57
+ ```
58
+
59
+ The Stop hook automatically detects convergence and halts the loop. Recommended intervals:
60
+ - `3m` — fast iterations, small datasets
61
+ - `5m` — standard training runs
62
+ - `10m` — deep training with large models
63
+
64
+ ## Rules
65
+
66
+ See `rules/loop-protocol.md` for safety constraints governing the experiment loop.
@@ -0,0 +1,63 @@
1
+ ---
2
+ name: try
3
+ description: Inject a hypothesis into the agent's experiment queue. This is how research taste reaches the agent — the human selects which coins to flip, the agent flips them.
4
+ disable-model-invocation: true
5
+ argument-hint: "<hypothesis description>"
6
+ allowed-tools: Read, Write, Edit, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob
7
+ ---
8
+
9
+ Inject a human hypothesis into the experiment queue for the next `/turing:train` iteration.
10
+
11
+ This is the taste-leverage mechanism: you provide judgment about what's worth trying, the agent provides disciplined execution.
12
+
13
+ ## Steps
14
+
15
+ 1. **Parse the hypothesis** from `$ARGUMENTS`. If empty, ask the user what they want the agent to try.
16
+
17
+ 2. **Check for archetype syntax.** If the argument starts with `archetype:`, expand it:
18
+ ```bash
19
+ source .venv/bin/activate && python scripts/manage_hypotheses.py add --archetype <name> --priority high --source human
20
+ ```
21
+
22
+ Otherwise, use the raw description:
23
+ ```bash
24
+ source .venv/bin/activate && python scripts/manage_hypotheses.py add "$ARGUMENTS" --priority high --source human
25
+ ```
26
+
27
+ 3. **Confirm** with the hypothesis ID and instructions:
28
+ - "Queued as hyp-NNN (high priority, human-injected)"
29
+ - "The agent will prioritize this on the next `/turing:train` iteration"
30
+ - Show current queue: `python scripts/manage_hypotheses.py list --status queued`
31
+
32
+ ## Examples
33
+
34
+ ```
35
+ # Free-text hypotheses
36
+ /turing:try switch to LightGBM with dart boosting and lower learning rate
37
+ /turing:try add polynomial features for the numeric columns
38
+ /turing:try increase regularization, the train/val gap suggests overfitting
39
+
40
+ # Archetype-based structured strategies
41
+ /turing:try archetype:model_comparison
42
+ /turing:try archetype:feature_sweep
43
+ /turing:try archetype:ensemble_construction
44
+ /turing:try archetype:regularization_search
45
+ /turing:try archetype:ablation_study
46
+ ```
47
+
48
+ ## Available Archetypes
49
+
50
+ | Archetype | What it does | Expected experiments |
51
+ |-----------|-------------|---------------------|
52
+ | `model_comparison` | Compare XGBoost, LightGBM, RF, LR, MLP with statistical tests | ~5 |
53
+ | `hyperparameter_sweep` | Grid search with multi-seed validation | 15-36 |
54
+ | `feature_sweep` | Add/remove feature transforms one at a time | 6-10 |
55
+ | `regularization_search` | Binary search for optimal regularization | 4-6 |
56
+ | `ensemble_construction` | Voting, stacking, blending of top models | 4-6 |
57
+ | `learning_rate_schedule` | lr vs n_estimators tradeoff | 4-5 |
58
+ | `data_quality_audit` | Class balance, label noise, leakage checks | 3-5 |
59
+ | `ablation_study` | Remove features one at a time to measure importance | N+1 |
60
+
61
+ ## How It Connects
62
+
63
+ The `/turing:train` loop checks `hypotheses.yaml` during the OBSERVE step. Human-injected hypotheses (high priority) are tried before the agent generates its own. After testing, the hypothesis is marked as `tested`, `promising`, or `dead-end` with a link to the resulting experiment.
@@ -0,0 +1,54 @@
1
+ ---
2
+ name: turing
3
+ description: Autonomous ML research harness. Thin router that detects ML training intent and dispatches to focused sub-commands. Each sub-command handles one phase of the experiment lifecycle.
4
+ ---
5
+
6
+ You are the Turing ML research router. Detect the user's intent and route to the appropriate sub-command. Do not attempt to handle ML tasks directly — dispatch to the focused skill.
7
+
8
+ ## Routing Table
9
+
10
+ | User says... | Route to | Lifecycle phase |
11
+ |---|---|---|
12
+ | "train", "run experiments", "autoresearch", "improve the model", "start training" | `/turing:train` | Execute |
13
+ | "status", "how's training", "experiment results", "current metrics" | `/turing:status` | Observe |
14
+ | "compare", "diff runs", "which is better" | `/turing:compare` | Analyze |
15
+ | "sweep", "grid search", "hyperparameter search", "tune" | `/turing:sweep` | Explore |
16
+ | "init", "set up ML", "initialize", "scaffold", "bootstrap" | `/turing:init` | Setup |
17
+ | "try", "test this", "inject", "what if we", "I think we should" | `/turing:try` | Steer |
18
+ | "brief", "briefing", "what have we learned", "summary" | `/turing:brief` | Report |
19
+ | "logbook", "log", "history", "timeline", "narrative" | `/turing:logbook` | Document |
20
+ | "poster", "presentation", "one-pager", "visual summary" | `/turing:poster` | Document |
21
+ | "report", "write-up", "findings", "document results" | `/turing:report` | Document |
22
+ | "validate", "stability", "check variance", "noisy" | `/turing:validate` | Validate |
23
+ | "suggest", "what model", "recommend", "which architecture", "literature" | `/turing:suggest` | Research |
24
+ | "design", "plan experiment", "how should I test", "experiment design" | `/turing:design` | Design |
25
+ | "mode", "explore", "exploit", "replicate", "strategy" | `/turing:mode` | Strategy |
26
+ | "preflight", "resources", "VRAM", "memory", "can I run", "OOM", "GPU" | `/turing:preflight` | Check |
27
+
28
+ ## Sub-commands
29
+
30
+ | Command | Purpose | Agent |
31
+ |---|---|---|
32
+ | `/turing:train [N]` | Run the autonomous experiment loop | @ml-researcher |
33
+ | `/turing:status` | Show experiment status, best model, convergence | @ml-evaluator |
34
+ | `/turing:compare <a> <b>` | Side-by-side experiment comparison | @ml-evaluator |
35
+ | `/turing:sweep` | Generate and run hyperparameter sweep | @ml-researcher |
36
+ | `/turing:try <hypothesis>` | Inject a hypothesis into the agent's queue | (inline) |
37
+ | `/turing:brief` | Generate structured research intelligence report | @ml-evaluator |
38
+ | `/turing:init` | Scaffold a new ML project | (inline) |
39
+ | `/turing:validate` | Check metric stability, auto-fix if noisy | (inline) |
40
+ | `/turing:suggest` | Literature-grounded model architecture suggestions | (inline, uses WebSearch) |
41
+ | `/turing:design <hyp-id>` | Generate structured experiment design from hypothesis | (inline, uses WebSearch) |
42
+ | `/turing:logbook` | HTML/markdown logbook with trajectory chart | (inline) |
43
+ | `/turing:poster` | Single-page HTML research poster | (inline) |
44
+ | `/turing:report` | Structured markdown research report | (inline) |
45
+ | `/turing:mode <mode>` | Set research strategy (explore/exploit/replicate) | (inline) |
46
+ | `/turing:preflight` | Pre-flight resource check (VRAM/RAM/disk) | (inline) |
47
+
48
+ ## Proactive Detection
49
+
50
+ If you detect ML training intent in the conversation (e.g., "the model accuracy is bad", "we need to improve predictions", "let's try a different model"), suggest the relevant sub-command.
51
+
52
+ ## First-Time Setup
53
+
54
+ If no ML project is detected (no `config.yaml`, no `train.py`, no `experiments/`), suggest `/turing:init` first.
@@ -0,0 +1,34 @@
1
+ ---
2
+ name: validate
3
+ description: Run stability validation on the current experiment configuration. Executes N runs to measure metric variance and auto-configures multi-run evaluation if variance is too high.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--auto]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Validate the stability of the current ML pipeline by running it multiple times and measuring variance.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Run stability check:**
19
+ ```bash
20
+ python scripts/validate_stability.py
21
+ ```
22
+
23
+ 3. **If `$ARGUMENTS` contains `--auto`:**
24
+ ```bash
25
+ python scripts/validate_stability.py --auto
26
+ ```
27
+ This auto-writes `evaluation.n_runs: 3` to `config.yaml` if CV > 5%.
28
+
29
+ 4. **Report results:**
30
+ - **Stable (CV < 5%):** metric is reliable, single-run evaluation is sufficient
31
+ - **Unstable (CV >= 5%):** metric has high variance, multi-run with median is recommended
32
+ - If `--auto` was used, report what was changed in config.yaml
33
+
34
+ 5. **If no training pipeline exists:** suggest `/turing:init` first.