claude-turing 4.4.0 → 4.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (108) hide show
  1. package/.claude-plugin/marketplace.json +18 -0
  2. package/.claude-plugin/plugin.json +4 -4
  3. package/LICENSE +1 -1
  4. package/README.md +78 -555
  5. package/bin/cli.js +23 -4
  6. package/commands/doctor.md +1 -0
  7. package/commands/init.md +21 -3
  8. package/commands/turing.md +85 -77
  9. package/config/commands.yaml +928 -0
  10. package/config/defaults.yaml +2 -0
  11. package/package.json +7 -6
  12. package/src/command-registry.js +151 -0
  13. package/src/install.js +24 -35
  14. package/src/verify.js +45 -88
  15. package/templates/README.md +1 -1
  16. package/templates/__pycache__/evaluate.cpython-312.pyc +0 -0
  17. package/templates/__pycache__/prepare.cpython-312.pyc +0 -0
  18. package/templates/config.yaml +1 -1
  19. package/templates/features/__pycache__/__init__.cpython-312.pyc +0 -0
  20. package/templates/features/__pycache__/featurizers.cpython-312.pyc +0 -0
  21. package/templates/program.md +1 -1
  22. package/templates/scripts/__pycache__/__init__.cpython-312.pyc +0 -0
  23. package/templates/scripts/__pycache__/ablation_study.cpython-312.pyc +0 -0
  24. package/templates/scripts/__pycache__/architecture_surgery.cpython-312.pyc +0 -0
  25. package/templates/scripts/__pycache__/budget_manager.cpython-312.pyc +0 -0
  26. package/templates/scripts/__pycache__/build_ensemble.cpython-312.pyc +0 -0
  27. package/templates/scripts/__pycache__/calibration.cpython-312.pyc +0 -0
  28. package/templates/scripts/__pycache__/check_convergence.cpython-312.pyc +0 -0
  29. package/templates/scripts/__pycache__/checkpoint_manager.cpython-312.pyc +0 -0
  30. package/templates/scripts/__pycache__/citation_manager.cpython-312.pyc +0 -0
  31. package/templates/scripts/__pycache__/cost_frontier.cpython-312.pyc +0 -0
  32. package/templates/scripts/__pycache__/counterfactual_explanation.cpython-312.pyc +0 -0
  33. package/templates/scripts/__pycache__/critique_hypothesis.cpython-312.pyc +0 -0
  34. package/templates/scripts/__pycache__/curriculum_optimizer.cpython-312.pyc +0 -0
  35. package/templates/scripts/__pycache__/diagnose_errors.cpython-312.pyc +0 -0
  36. package/templates/scripts/__pycache__/draft_paper_sections.cpython-312.pyc +0 -0
  37. package/templates/scripts/__pycache__/equivalence_checker.cpython-312.pyc +0 -0
  38. package/templates/scripts/__pycache__/experiment_annotations.cpython-312.pyc +0 -0
  39. package/templates/scripts/__pycache__/experiment_archive.cpython-312.pyc +0 -0
  40. package/templates/scripts/__pycache__/experiment_diff.cpython-312.pyc +0 -0
  41. package/templates/scripts/__pycache__/experiment_index.cpython-312.pyc +0 -0
  42. package/templates/scripts/__pycache__/experiment_queue.cpython-312.pyc +0 -0
  43. package/templates/scripts/__pycache__/experiment_replay.cpython-312.pyc +0 -0
  44. package/templates/scripts/__pycache__/experiment_search.cpython-312.pyc +0 -0
  45. package/templates/scripts/__pycache__/experiment_simulator.cpython-312.pyc +0 -0
  46. package/templates/scripts/__pycache__/experiment_templates.cpython-312.pyc +0 -0
  47. package/templates/scripts/__pycache__/export_card.cpython-312.pyc +0 -0
  48. package/templates/scripts/__pycache__/export_formats.cpython-312.pyc +0 -0
  49. package/templates/scripts/__pycache__/failure_postmortem.cpython-312.pyc +0 -0
  50. package/templates/scripts/__pycache__/feature_intelligence.cpython-312.pyc +0 -0
  51. package/templates/scripts/__pycache__/fork_experiment.cpython-312.pyc +0 -0
  52. package/templates/scripts/__pycache__/generate_baselines.cpython-312.pyc +0 -0
  53. package/templates/scripts/__pycache__/generate_brief.cpython-312.pyc +0 -0
  54. package/templates/scripts/__pycache__/generate_changelog.cpython-312.pyc +0 -0
  55. package/templates/scripts/__pycache__/generate_figures.cpython-312.pyc +0 -0
  56. package/templates/scripts/__pycache__/generate_logbook.cpython-312.pyc +0 -0
  57. package/templates/scripts/__pycache__/generate_model_card.cpython-312.pyc +0 -0
  58. package/templates/scripts/__pycache__/generate_onboarding.cpython-312.pyc +0 -0
  59. package/templates/scripts/__pycache__/harness_doctor.cpython-312.pyc +0 -0
  60. package/templates/scripts/__pycache__/harness_doctor.cpython-314.pyc +0 -0
  61. package/templates/scripts/__pycache__/incremental_update.cpython-312.pyc +0 -0
  62. package/templates/scripts/__pycache__/knowledge_transfer.cpython-312.pyc +0 -0
  63. package/templates/scripts/__pycache__/latency_benchmark.cpython-312.pyc +0 -0
  64. package/templates/scripts/__pycache__/leakage_detector.cpython-312.pyc +0 -0
  65. package/templates/scripts/__pycache__/literature_search.cpython-312.pyc +0 -0
  66. package/templates/scripts/__pycache__/log_experiment.cpython-312.pyc +0 -0
  67. package/templates/scripts/__pycache__/manage_hypotheses.cpython-312.pyc +0 -0
  68. package/templates/scripts/__pycache__/methodology_audit.cpython-312.pyc +0 -0
  69. package/templates/scripts/__pycache__/model_distiller.cpython-312.pyc +0 -0
  70. package/templates/scripts/__pycache__/model_lifecycle.cpython-312.pyc +0 -0
  71. package/templates/scripts/__pycache__/model_merger.cpython-312.pyc +0 -0
  72. package/templates/scripts/__pycache__/model_pruning.cpython-312.pyc +0 -0
  73. package/templates/scripts/__pycache__/model_quantization.cpython-312.pyc +0 -0
  74. package/templates/scripts/__pycache__/model_xray.cpython-312.pyc +0 -0
  75. package/templates/scripts/__pycache__/novelty_guard.cpython-312.pyc +0 -0
  76. package/templates/scripts/__pycache__/package_experiments.cpython-312.pyc +0 -0
  77. package/templates/scripts/__pycache__/pareto_frontier.cpython-312.pyc +0 -0
  78. package/templates/scripts/__pycache__/parse_metrics.cpython-312.pyc +0 -0
  79. package/templates/scripts/__pycache__/pipeline_manager.cpython-312.pyc +0 -0
  80. package/templates/scripts/__pycache__/profile_training.cpython-312.pyc +0 -0
  81. package/templates/scripts/__pycache__/regression_gate.cpython-312.pyc +0 -0
  82. package/templates/scripts/__pycache__/reproduce_experiment.cpython-312.pyc +0 -0
  83. package/templates/scripts/__pycache__/research_planner.cpython-312.pyc +0 -0
  84. package/templates/scripts/__pycache__/sanity_checks.cpython-312.pyc +0 -0
  85. package/templates/scripts/__pycache__/scaffold.cpython-312.pyc +0 -0
  86. package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
  87. package/templates/scripts/__pycache__/scaling_estimator.cpython-312.pyc +0 -0
  88. package/templates/scripts/__pycache__/seed_runner.cpython-312.pyc +0 -0
  89. package/templates/scripts/__pycache__/sensitivity_analysis.cpython-312.pyc +0 -0
  90. package/templates/scripts/__pycache__/session_flashback.cpython-312.pyc +0 -0
  91. package/templates/scripts/__pycache__/show_experiment_tree.cpython-312.pyc +0 -0
  92. package/templates/scripts/__pycache__/show_families.cpython-312.pyc +0 -0
  93. package/templates/scripts/__pycache__/simulate_review.cpython-312.pyc +0 -0
  94. package/templates/scripts/__pycache__/smart_retry.cpython-312.pyc +0 -0
  95. package/templates/scripts/__pycache__/statistical_compare.cpython-312.pyc +0 -0
  96. package/templates/scripts/__pycache__/suggest_next.cpython-312.pyc +0 -0
  97. package/templates/scripts/__pycache__/sweep.cpython-312.pyc +0 -0
  98. package/templates/scripts/__pycache__/synthesize_decision.cpython-312.pyc +0 -0
  99. package/templates/scripts/__pycache__/training_monitor.cpython-312.pyc +0 -0
  100. package/templates/scripts/__pycache__/treequest_suggest.cpython-312.pyc +0 -0
  101. package/templates/scripts/__pycache__/trend_analysis.cpython-312.pyc +0 -0
  102. package/templates/scripts/__pycache__/turing_io.cpython-312.pyc +0 -0
  103. package/templates/scripts/__pycache__/update_state.cpython-312.pyc +0 -0
  104. package/templates/scripts/__pycache__/verify_placeholders.cpython-312.pyc +0 -0
  105. package/templates/scripts/__pycache__/warm_start.cpython-312.pyc +0 -0
  106. package/templates/scripts/__pycache__/whatif_engine.cpython-312.pyc +0 -0
  107. package/templates/scripts/harness_doctor.py +145 -1
  108. package/templates/scripts/scaffold.py +50 -28
package/README.md CHANGED
@@ -2,616 +2,139 @@
2
2
 
3
3
  *The research assistant that can't fool itself.*
4
4
 
5
- ---
6
-
7
- An autonomous ML research harness for Claude Code. Turing implements the autoresearch pattern — an AI agent that iteratively trains, evaluates, and improves machine learning models through a structured experiment loop with convergence detection, immutable evaluation infrastructure, and safety guardrails.
8
-
9
- The name references Alan Turing — the person who first asked whether machines could think, then built the framework for answering the question. Turing the plugin does what Turing the person formalized: it defines a computational process, executes it mechanically, and determines whether the result constitutes an improvement.
10
-
11
- Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) and [snoglobe/helios](https://github.com/snoglobe/helios).
12
-
13
- ## Three Commands
5
+ <p align="center">
6
+ <img src="https://img.shields.io/badge/version-4.6.0-ffb74d?style=flat-square&labelColor=1a1a2e" alt="Version" />
7
+ <img src="https://img.shields.io/badge/license-MIT-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="License" />
8
+ <img src="https://img.shields.io/badge/Claude_Code-plugin-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="Claude Code" />
9
+ <img src="https://img.shields.io/badge/Node.js-20%2B-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="Node.js" />
10
+ </p>
14
11
 
15
- That's all you need.
12
+ A Claude Code plugin that runs autonomous ML experiment loops, named after the man who first asked whether machines could think. Two agents enforce a strict separation: one writes code, one scores it, and neither can see the other's work. Immutable evaluation, anti-cheating guardrails, and structured hypothesis tracking make sure the results stay honest. [When code is free, research is all that matters](https://x.com/amytam01/status/2031072399731675269). You bring the research taste; Turing handles the rest.
16
13
 
17
- ```
18
- /turing:init Set up a new ML project
19
- /turing:train Run the experiment loop
20
- /turing:brief What happened? What's next?
21
- ```
22
-
23
- Initialize. Train. Read the briefing. Inject your taste. Repeat.
14
+ - **Separation:** the agent modifies `train.py`; it cannot see or touch `evaluate.py`
15
+ - **Memory:** every hypothesis registered, every experiment logged, every variant preserved
16
+ - **Convergence:** automatic detection of diminishing returns; the agent stops when it should
17
+ - **Taste:** you inject ideas with `/turing:try`, read results with `/turing:brief`
24
18
 
25
- ```
26
- /turing:try switch to LightGBM Steer the agent
27
- /turing:train It follows your lead
28
- /turing:brief --deep Get literature-backed suggestions
29
- ```
19
+ > [!NOTE]
20
+ > Turing is in active development. Some features are rough around the edges. [Issues and feedback welcome.](https://github.com/ThePyProgrammer/turing/issues)
30
21
 
31
- Everything else — experiment logging, convergence detection, hypothesis tracking, statistical validation, anti-cheating guardrails — happens automatically. You think about *what* to try. Turing handles *how* to try it.
22
+ ## Install
32
23
 
33
- ## Table of Contents
34
-
35
- - [When Code Is Free, Research Is All That Matters](#when-code-is-free-research-is-all-that-matters)
36
- - [The Human-AI Interface](#the-human-ai-interface)
37
- - [The Problem Turing Solves](#the-problem-turing-solves)
38
- - [Philosophical Foundations](#philosophical-foundations)
39
- - [How Turing Works](#how-turing-works)
40
- - [Commands](#commands)
41
- - [The Hypothesis Database](#the-hypothesis-database)
42
- - [The Agent Architecture](#the-agent-architecture)
43
- - [The Anti-Cheating Stack](#the-anti-cheating-stack)
44
- - [Convergence Detection](#convergence-detection)
45
- - [Installation](#installation)
46
- - [Architecture of Turing Itself](#architecture-of-turing-itself)
47
- - [Intellectual Heritage](#intellectual-heritage)
48
-
49
- ## When Code Is Free, Research Is All That Matters
50
-
51
- > *"You're in a room with a quadrillion biased coins, and you want to maximize the number of heads in the shortest amount of time. Almost all coins are 'duds.' The novice coin-flipper might start flipping one-by-one, but heads come few and far between. The learned coin-flipper weaves through the quadrillion-coin room with a preternatural air; they flip many coins at once. What comes across as luck is really the refinement of taste: years of feeling faint differences in the weight of the metal, the subtle offsets of a mis-mint."* — [Amy Tam](https://x.com/amytam01/status/2031072399731675269)
52
-
53
- This is the most precise metaphor for ML research in the age of autonomous agents: a quadrillion-coin room where the researcher's value lies not in the mechanical act of flipping but in *choosing which coins to flip at all*.
54
-
55
- Tam's insight cuts to the heart of what Turing exists to do. The agentic coding tools consuming software engineering alive right now — Cursor, Claude Code, Codex — work precisely because engineering has a built-in feedback signal: a test to pass, a spec to meet, a benchmark to clear. You can RL on [SWE-bench](https://www.swebench.com/) because the ground truth exists. **Research has no equivalent.** It is not clear what it means to RL on a research question, because it is not clear what definition of "ground truth" one should optimize for. The coin room has a quadrillion coins but no label telling you which ones are biased toward heads.
56
-
57
- And yet Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) ran 126 experiments overnight on a single GPU: agents modifying LLM training code, running a five-minute training loop, checking if the result improved, and repeating. [Tobias Lütke reported](https://fortune.com/2026/03/17/andrej-karpathy-loop-autonomous-ai-agents-future/) that after letting it run overnight, it executed 37 experiments and delivered a 19% performance gain. That is a lot more coins flipped than the average human in the same time.
58
-
59
- This creates a new kind of division of labor:
60
-
61
- ```
62
- HUMAN RESEARCHER AUTONOMOUS AGENT
63
- ───────────────── ─────────────────
64
- Research taste Coin flipping
65
- Which coins to flip How fast to flip them
66
- Problem selection Hypothesis execution
67
- Judgment under ambiguity Measurement under control
68
- Knowing when the room has changed Running the room as-is
69
- ```
70
-
71
- The researcher's job becomes the selection function: *which 20 of the quadrillion coins are worth flipping in the first place?* And the agent's job — Turing's job — is to flip those coins with the discipline, speed, and memory that humans cannot sustain. Every experiment logged. Every variant preserved. Every comparison valid. No amnesia. No fatigue. No accidental contamination of the measurement.
72
-
73
- *When anyone can build for free, the differentiator is knowing what's worth building and whether it's buildable at all.* Turing handles the building. You bring the knowing.
74
-
75
- ## The Human-AI Interface
76
-
77
- Turing is not a black box you point at data and hope for the best. It is a conversation between your taste and the agent's discipline.
78
-
79
- ### The Taste-Leverage Loop
80
-
81
- ```
82
- ┌─────────────────────┐
83
- │ YOU (taste) │
84
- │ │
85
- │ /turing:brief │◄──── "What have we learned?"
86
- │ /turing:try ... │────► "Try this next."
87
- └────────┬────────────┘
88
-
89
-
90
- ┌─────────────────────┐
91
- │ TURING (discipline) │
92
- │ │
93
- │ Hypothesize │◄──── Reads your injection + history
94
- │ Train │────► Runs the experiment
95
- │ Evaluate │────► Immutable measurement
96
- │ Decide │────► Keep or discard
97
- │ Record │────► Updates hypothesis database
98
- └────────┬────────────┘
99
-
100
-
101
- ┌─────────────────────┐
102
- │ BRIEFING │
103
- │ │
104
- │ Campaign summary │
105
- │ Best model │
106
- │ What's exhausted │
107
- │ What's promising │
108
- │ Recommendations │
109
- └─────────────────────┘
110
-
111
-
112
- You again.
113
- ```
114
-
115
- The loop is bidirectional. You inject hypotheses. The agent executes them. The briefing tells you what happened. You inject new hypotheses informed by the results. The agent never forgets what it tried. You never lose context between sessions.
116
-
117
- ### What This Looks Like in Practice
118
-
119
- **Morning 1:** You have a dataset and a prediction task.
120
-
121
- ```
122
- /turing:init
123
- # Answer: project name, metric, data location
124
- # Turing scaffolds everything
125
- ```
126
-
127
- **Morning 1, 10 minutes later:**
128
-
129
- ```
130
- /turing:train
131
- # Agent runs 5-10 experiments autonomously
132
- # XGBoost baseline → hyperparameter sweep → convergence
24
+ ```bash
25
+ npm install -g claude-turing && claude-turing install --global && claude-turing verify
133
26
  ```
134
27
 
135
- **Morning 1, 30 minutes later:**
28
+ ## The Taste-Leverage Loop
136
29
 
137
- ```
138
- /turing:brief
139
- # Campaign: 8 experiments, 5 kept, accuracy 0.82 → 0.87
140
- # Best: XGBoost, max_depth=6, n_estimators=200
141
- # Exhausted: hyperparameter tuning on XGBoost
142
- # Recommendation: try LightGBM or feature engineering
143
- ```
30
+ You have taste: the accumulated judgment about which problems are tractable, which metrics matter, and which directions are dead ends. Turing has leverage: the discipline to run experiments without fatigue, track every result without amnesia, and measure without contamination.
144
31
 
145
- **Your taste kicks in:**
32
+ The interface is two verbs:
146
33
 
147
34
  ```
148
- /turing:try switch to LightGBM with dart boosting — XGBoost plateaued
149
- /turing:try add polynomial interaction features for the numeric columns
150
- /turing:train
35
+ /turing:try switch to LightGBM Your taste the agent
36
+ /turing:brief --deep The agent's results you
151
37
  ```
152
38
 
153
- **Afternoon:**
39
+ Everything in between (experiment logging, convergence detection, hypothesis tracking, statistical validation, anti-cheating guardrails) is infrastructure connecting those two endpoints. You think about *what* to try. Turing handles *how* to try it.
154
40
 
155
- ```
156
- /turing:brief --deep
157
- # Standard briefing + literature-grounded suggestions
158
- # Papers suggest: target encoding for high-cardinality categoricals
159
- # → Auto-queued as hyp-012
160
- ```
161
-
162
- **You leave. Come back tomorrow.**
41
+ ### What a Session Looks Like
163
42
 
164
43
  ```
165
- /turing:brief
166
- # Everything is there. Nothing was forgotten.
167
- # The hypothesis database has the complete trail.
44
+ /turing:init Scaffold a new ML project
45
+ /turing:train Agent runs 5-10 experiments autonomously
46
+ /turing:brief Campaign summary: what improved, what's exhausted
47
+ /turing:try "add polynomial features" Inject your next idea
48
+ /turing:train Agent follows your lead
168
49
  ```
169
50
 
170
- That's the interface. Six words to inject an idea. One command to get a briefing. The agent handles everything in between.
171
-
172
- ## The Problem Turing Solves
173
-
174
- > "An experiment is a question which science poses to Nature, and a measurement is the recording of Nature's answer." — Max Planck
175
-
176
- The central activity of machine learning research is the experiment loop: change something, train, evaluate, decide, repeat. This loop is simultaneously the most important and the most tedious part of ML work. Researchers spend their days doing what is essentially a manual search over a high-dimensional space of model architectures, hyperparameters, feature transformations, and data preprocessing strategies.
177
-
178
- The tragedy is not that this is slow — it is that the process is structurally unsound. When a human researcher modifies both the training code *and* the evaluation code in the same session, the experiment is no longer a controlled experiment. When experiment results are tracked in notebook cells rather than structured logs, reproducibility is aspirational. When a promising direction is abandoned because the researcher forgot what they tried three hours ago, the search is not even a search — it is a random walk with amnesia.
179
-
180
- Turing does not replace the researcher's judgment. It replaces the researcher's *discipline* — or more precisely, it makes discipline the default rather than an act of willpower. The experiment loop is formalized. The evaluation harness is immutable. Every experiment is logged. Every code variant is preserved. Convergence is detected automatically. The researcher's role shifts from "person who types hyperparameters and reads loss curves" to "person who decides what hypotheses are worth testing" — from coin-flipper to coin-selector.
181
-
182
- ## Philosophical Foundations
183
-
184
- ### On Separating Hypothesis from Measurement
185
-
186
- > "The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman
187
-
188
- Turing is built on a specific epistemological claim: **the entity that generates hypotheses must not be the entity that evaluates them**. This is not a software engineering pattern — it is the methodological foundation of modern science, and it predates software by centuries.
189
-
190
- In experimental physics, the [double-blind protocol](https://en.wikipedia.org/wiki/Blinded_experiment) ensures that the experimenter's expectations cannot influence the measurement. In ML, the equivalent risk is more insidious: an agent that can modify both `train.py` and `evaluate.py` can — deliberately or through optimization pressure — find metrics that look good but don't reflect genuine model improvement.
191
-
192
- This is [Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law) made architectural: *"When a measure becomes a target, it ceases to be a good measure."* The only defense is to make the measure structurally immutable.
193
-
194
- Turing enforces this with a three-tier access model:
51
+ For fully hands-off operation:
195
52
 
196
53
  ```
197
- ┌──────────────────────────────────────────────────────┐
198
- │ HYPOTHESIS SPACE │
199
- │ (agent can modify) │
200
- │ train.py config.yaml │
201
- ├──────────────────────────────────────────────────────┤
202
- │ MEASUREMENT APPARATUS │
203
- │ prepare.py (READ-ONLY) │
204
- │ evaluate.py (HIDDEN — agent cannot even see) │
205
- └──────────────────────────────────────────────────────┘
54
+ /loop 5m /turing:train
206
55
  ```
207
56
 
208
- The evaluation harness is not just immutable — it is *invisible*. The agent cannot read `evaluate.py`, cannot discover its implementation, cannot reverse-engineer fixed seeds or scoring formulas. It knows only the metric name, the direction (higher or lower is better), and the result. This is the difference between "please don't change the test" and "you literally cannot see the test."
209
-
210
- ### On Research Taste and Autonomous Execution
211
-
212
- > *"Research taste is about how well you choose your coins: how well you choose which problems are worth working on at all."* — Amy Tam
213
-
214
- There is a paradox at the heart of autonomous ML research: the parts of research that are hardest to automate are precisely the parts that matter most. Problem selection, hypothesis formation, knowing when a line of inquiry has become a dead end — these require what Tam calls *taste*, the accumulated judgment that comes from years of feeling faint differences in which problems are tractable, which results are meaningful, and which metrics actually capture what you care about.
215
-
216
- Autoresearch does not solve this. Turing does not solve this. No one has solved this. But what autoresearch *does* solve is the complementary problem: given a well-selected hypothesis space, execute the search within it with superhuman discipline and throughput. The human provides the taste. The agent provides the tirelessness.
217
-
218
- This is why Turing's interface is built around two verbs: **try** and **brief**. `/turing:try` is how taste reaches the agent. `/turing:brief` is how results reach the human. Everything else is infrastructure.
219
-
220
- ### On Experiment Tracking as Institutional Memory
221
-
222
- > "Those who cannot remember the past are condemned to repeat it." — George Santayana
223
-
224
- An LLM agent without persistent memory is a [Markov chain](https://en.wikipedia.org/wiki/Markov_chain) — its next action depends only on its current state, not on the path that led there. This is catastrophically inefficient for optimization: the agent will re-try failed approaches, abandon promising directions, and fail to recognize when it has converged. It will keep flipping coins it has already flipped.
225
-
226
- Turing addresses this with a structured memory stack:
227
-
228
- | System | Format | Purpose |
229
- |--------|--------|---------|
230
- | **Hypothesis database** | `hypotheses.yaml` + `hypotheses/hyp-NNN.yaml` | Complete ledger of every idea — human and agent — with full detail |
231
- | **Experiment log** | `experiments/log.jsonl` | Append-only record of every experiment run |
232
- | **Novelty guard** | `scripts/novelty_guard.py` | Blocks duplicate and near-duplicate hypotheses before execution |
233
- | **Agent memory** | `.claude/agent-memory/ml-researcher/MEMORY.md` | Working notes across sessions |
234
- | **Git history** | Experiment branches | Every code variant preserved |
235
-
236
- The hypothesis database is the single source of truth. Every idea gets registered before execution. Every outcome gets written back. The novelty guard reads the history and prevents the agent from re-trying things it has already failed at — even across `/loop` sessions where the agent's context is lost.
237
-
238
- ## How Turing Works
239
-
240
- ### The Experiment Loop
57
+ The agent trains, evaluates, keeps improvements, discards regressions, detects convergence, and stops. You come back to a briefing.
241
58
 
242
- Every iteration follows the same protocol:
59
+ ## How It Works
243
60
 
244
- ```
245
- 1. OBSERVE Read metrics, check hypothesis queue, review failed diffs
246
- 2. HYPOTHESIZE Check queue (human ideas first) or generate + register own
247
- 3. PREPARE Edit train.py or config.yaml
248
- 4. COMMIT Git branch per experiment
249
- 5. EXECUTE python train.py > run.log 2>&1
250
- 6. MEASURE Parse metrics (agent can't see how they're computed)
251
- 7. DECIDE Keep improvements, revert regressions
252
- 8. RECORD Log experiment, update hypothesis, synthesize decision
253
- 9. CONVERGE? Stop after N non-improvements, or repeat
254
- ```
61
+ **The experiment loop.** Every iteration: observe metrics, hypothesize (human ideas first), edit `train.py`, commit to a git branch, train, measure (agent can't see how), keep or revert, log, check convergence.
255
62
 
256
- ### The Hypothesis Lifecycle
63
+ **Hypothesis tracking.** Every idea flows through `hypotheses.yaml` with a novelty guard that blocks duplicates. Detail files record architecture, hyperparameters, expected outcome, actual result, and lineage. Nothing is forgotten between sessions.
257
64
 
258
- Every experiment human-injected or agent-generated flows through the hypothesis database:
65
+ **Anti-cheating stack.** Six structural layers, not prompt-based rules. The agent cannot see `evaluate.py`, cannot discover scoring formulas, cannot reverse-engineer fixed seeds. It knows the metric name, the direction, and the result. That's it. Research on autonomous ML agents shows that [every prompt-based rule got worked around; every code-based rule held](https://github.com/karpathy/autoresearch/discussions/322).
259
66
 
260
- ```
261
- /turing:try "idea" Agent generates idea
262
- │ │
263
- ▼ ▼
264
- ┌──────────────────────────────────────────────────┐
265
- │ hypotheses.yaml (index) │
266
- │ hypotheses/hyp-001.yaml (detail) │
267
- │ │
268
- │ architecture: │
269
- │ model_type: lightgbm │
270
- │ hyperparameters: │
271
- │ n_estimators: 200 │
272
- │ learning_rate: 0.05 │
273
- │ expected_outcome: │
274
- │ rationale: "dart boosting may escape plateau" │
275
- │ family: architecture-search │
276
- │ tags: [lightgbm, dart] │
277
- └────────────────────┬──────────────────────────────┘
278
-
279
- novelty guard
280
- (block duplicates)
281
-
282
-
283
- experiment
284
-
285
-
286
- ┌──────────────────────────────────────────────────┐
287
- │ result: │
288
- │ experiment_id: exp-007 │
289
- │ metrics: {accuracy: 0.89} │
290
- │ verdict: promising │
291
- │ notes: "3% improvement, follow up with..." │
292
- └──────────────────────────────────────────────────┘
293
- ```
67
+ **Two agents, strict boundary.** `@ml-researcher` (Read/Write/Edit/Bash) modifies code and runs experiments. `@ml-evaluator` (Read/Bash only) analyzes results. An analyst who cannot act on their observations makes more trustworthy observations.
294
68
 
295
- The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypotheses/hyp-NNN.yaml`) hold the full structured record: architecture, hyperparameters, features, expected outcome, actual result, lineage, family tags. Both are updated atomically.
69
+ **Convergence detection.** After N consecutive non-improvements (default 3, configurable), the agent stops. For noisy metrics, `/turing:validate` auto-configures multi-run evaluation so the agent can't be rewarded for lucky single runs.
296
70
 
297
- ## Commands
71
+ ## Command Reference
298
72
 
299
73
  ### Core Loop
300
74
 
301
75
  | Command | What it does |
302
76
  |---------|-------------|
303
- | `/turing:init [--plan]` | Scaffold a new ML project. `--plan` generates a literature-grounded research plan. Supports multiple projects in subdirectories. |
304
- | `/turing:train [ml/project] [N]` | Run the experiment loop. Auto-detects project from cwd or explicit path. |
305
- | `/turing:sweep` | Systematic hyperparameter sweep via cartesian product |
306
- | `/turing:status` | Quick experiment status — best model, convergence state |
307
- | `/turing:compare <a> <b>` | Side-by-side experiment comparison with causal analysis |
77
+ | `/turing:init [--plan]` | Scaffold a new ML project. `--plan` for literature-grounded research plan. |
78
+ | `/turing:train [path] [N]` | Run the experiment loop. Auto-detects project from cwd. |
79
+ | `/turing:status` | Quick status: best model, convergence state |
80
+ | `/turing:compare <a> <b>` | Side-by-side experiment comparison |
81
+ | `/turing:sweep` | Systematic hyperparameter sweep |
308
82
 
309
83
  ### Taste-Leverage Interface
310
84
 
311
85
  | Command | What it does |
312
86
  |---------|-------------|
313
- | `/turing:try <hypothesis>` | Inject a hypothesis free text or `archetype:model_comparison` |
314
- | `/turing:brief [--deep]` | Research briefing campaign summary, failure patterns, literature-grounded suggestions |
315
- | `/turing:suggest` | Literature-grounded model architecture suggestions with citations |
316
- | `/turing:suggest --strategy treequest` | Tree-search hypothesis exploration (alias for `/turing:explore`) |
317
- | `/turing:explore` | AB-MCTS tree search over critique-scored hypothesis space |
318
- | `/turing:design <hyp-id>` | Generate structured experiment design from a hypothesis |
319
- | `/turing:mode <explore\|exploit\|replicate>` | Set research strategy — drives novelty guard policy |
87
+ | `/turing:try <hypothesis>` | Inject a hypothesis (free text or archetype) |
88
+ | `/turing:brief [--deep]` | Research briefing with literature-grounded suggestions |
89
+ | `/turing:suggest` | Literature-grounded model architecture suggestions |
90
+ | `/turing:explore` | AB-MCTS tree search over hypothesis space |
91
+ | `/turing:design <hyp-id>` | Generate structured experiment design |
92
+ | `/turing:mode <mode>` | Set research strategy (explore/exploit/replicate) |
320
93
 
321
- ### Reporting & Validation
94
+ ### Validation & Statistical Rigor
322
95
 
323
96
  | Command | What it does |
324
97
  |---------|-------------|
325
- | `/turing:validate [--auto]` | Check metric stability auto-configure multi-run if noisy |
326
- | `/turing:seed [N] [--quick]` | Multi-seed study mean/std/CI, flag seed-sensitive results |
327
- | `/turing:reproduce <exp-id>` | Reproducibility verification re-run and check tolerance |
328
- | `/turing:diagnose [exp-id]` | Error analysis failure modes, confused pairs, feature-range bias |
329
- | `/turing:ablate [--components]` | Ablation study remove components, measure impact, flag dead weight |
330
- | `/turing:frontier [--metrics]` | Pareto frontier multi-objective tradeoff visualization |
331
- | `/turing:profile [exp-id]` | Computational profiling timing, memory, throughput, bottleneck detection |
332
- | `/turing:checkpoint <action>` | Smart checkpoint management — list, prune (Pareto), average, resume, stats |
333
- | `/turing:lit <query>` | Literature search — papers, SOTA baselines, related work |
334
- | `/turing:paper [--sections] [--format]` | Draft paper sections from experiment logs (setup, results, ablation, hyperparams) |
335
- | `/turing:queue <action>` | Batch experiment scheduler — add, list, run, pause, clear |
336
- | `/turing:retry <exp-id>` | Smart failure recovery — auto-diagnose crash, apply fix, re-run |
337
- | `/turing:fork <exp-id>` | Experiment branching — run parallel tracks, report winner |
338
- | `/turing:export [--format]` | Export model to production format with equivalence check + latency benchmark |
339
- | `/turing:card` | Generate a model card — performance, limitations, intended use, artifact contract |
340
- | `/turing:logbook` | Generate HTML experiment logbook |
341
- | `/turing:report` | Generate research report |
342
- | `/turing:poster` | Generate research poster |
343
- | `/turing:preflight` | Pre-release validation checks |
344
- | `/turing:diff <a> <b>` | Deep experiment comparison — config diffs, metric significance, per-class regressions, curve divergence |
345
- | `/turing:watch [--analyze]` | Live training monitor — loss spikes, NaN detection, overfitting, plateau alerts |
346
- | `/turing:regress [--tolerance]` | Performance regression gate — verify metrics haven't degraded after changes |
347
- | `/turing:ensemble [--top-k]` | Automated ensemble — voting, stacking, blending from top-K models |
348
- | `/turing:stitch <action>` | Pipeline composition — show, swap, cache, and run stages independently |
349
- | `/turing:warm <exp-id>` | Warm-start from prior model — load checkpoint, freeze layers, adjust LR |
350
- | `/turing:scale [--axis]` | Scaling law estimator — power-law fit, full-scale predictions, diminishing returns verdict |
351
- | `/turing:budget <action>` | Compute budget manager — set limits, track allocation, auto-shift explore/exploit |
352
- | `/turing:distill <exp-id>` | Model compression — distill teacher into smaller student with accuracy/size tradeoff |
353
- | `/turing:transfer [--from]` | Cross-project knowledge transfer — find similar projects, surface what worked |
354
- | `/turing:audit [--strict]` | Pre-submission methodology audit — data leakage, baselines, seeds, ablations, reproducibility |
355
- | `/turing:sanity [--quick]` | Pre-training sanity checks — initial loss, single-batch overfit, gradient flow, output validation |
356
- | `/turing:baseline [--methods]` | Automatic baseline generation — random, majority/mean, linear, k-NN |
357
- | `/turing:leak [--deep]` | Targeted leakage detection — single-feature tests, correlation, train/test overlap |
358
- | `/turing:xray [exp-id]` | Internal model diagnostics — gradient flow, dead neurons, weight distributions, tree analysis |
359
- | `/turing:sensitivity [exp-id]` | Hyperparameter sensitivity — rank parameters by impact, detect non-monotonic responses |
360
- | `/turing:calibrate [exp-id]` | Probability calibration — ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling |
361
- | `/turing:feature [--method]` | Automated feature selection — multi-method consensus ranking, redundancy, interactions |
362
- | `/turing:curriculum [exp-id]` | Training curriculum optimization — difficulty scoring, strategy comparison, mislabeled sample detection |
363
- | `/turing:prune <exp-id>` | Weight pruning — magnitude/structured/lottery, sparsity sweep, knee point detection |
364
- | `/turing:quantize <exp-id>` | Post-training quantization — FP16/INT8, accuracy-latency comparison |
365
- | `/turing:merge <exp-ids...>` | Model merging — uniform/greedy soup, TIES, DARE, zero latency cost |
366
- | `/turing:surgery <exp-id>` | Architecture modification — add/remove layer, widen/narrow, swap activation |
367
- | `/turing:trend` | Long-term trend analysis — improvement velocity, family ROI, diminishing returns |
368
- | `/turing:flashback` | Session context restoration — "where was I?" after days away |
369
- | `/turing:archive` | Experiment lifecycle cleanup — compress old artifacts, summary index |
370
- | `/turing:annotate <exp-id>` | Retrospective annotations — human notes and tags on experiments |
371
- | `/turing:search <query>` | Natural language experiment search — text + structured filters |
372
- | `/turing:template <action>` | Experiment template library — save/list/apply reusable configs |
373
- | `/turing:replay <exp-id>` | Experiment replay — re-run old approach with current infrastructure |
374
- | `/turing:cite <action>` | Citation & attribution manager — track papers, audit missing citations, generate BibTeX |
375
- | `/turing:present [--figures]` | Presentation figures — training curves, comparisons, ablation, Pareto, sensitivity |
376
- | `/turing:changelog [--audience]` | Model changelog — version-grouped improvements for technical or stakeholder audiences |
377
- | `/turing:onboard [--audience]` | Project onboarding — walkthrough for new collaborators |
378
- | `/turing:share <exp-ids...>` | Experiment packaging — portable archive with manifest |
379
- | `/turing:review [--venue]` | Peer review simulation — weaknesses, fix commands, score |
380
- | `/turing:whatif "<question>"` | What-if analysis — answer hypotheticals from existing experiment data |
381
- | `/turing:counterfactual <exp-id>` | Counterfactual explanations — minimum input change to flip a prediction |
382
- | `/turing:simulate [--configs]` | Experiment outcome prediction — pre-filter configs, save budget |
383
- | `/turing:update <exp-id>` | Incremental model update — add new data without full retraining |
384
- | `/turing:registry [action]` | Model registry — track lifecycle from candidate to production with gates |
385
- | `/turing:postmortem` | Failure postmortem — diagnose why experiments stopped improving |
386
- | `/turing:doctor [--fix]` | Harness self-diagnosis — check environment, project, resources |
387
- | `/turing:plan [--budget N]` | Research planning — strategic experiment campaign by ROI |
388
-
389
- And for fully hands-off operation:
390
-
391
- ```
392
- /loop 5m /turing:train
393
- ```
98
+ | `/turing:validate [--auto]` | Metric stability check, auto-configure multi-run |
99
+ | `/turing:seed [N]` | Multi-seed study: mean/std/CI, flag seed-sensitive results |
100
+ | `/turing:reproduce <exp-id>` | Reproducibility verification with tolerance checking |
101
+ | `/turing:sanity` | Pre-training sanity checks |
102
+ | `/turing:baseline` | Automatic baseline generation |
103
+ | `/turing:leak` | Targeted data leakage detection |
104
+ | `/turing:audit` | Pre-submission methodology audit |
394
105
 
395
- The agent trains, evaluates, keeps improvements, discards regressions, detects convergence, and stops. You come back to a briefing.
396
-
397
- ## The Agent Architecture
398
-
399
- Two agents with a strict capability boundary:
106
+ See [the command reference](docs/commands/index.md) for all 74 commands.
400
107
 
401
- | Agent | Tools | Role | Turns |
402
- |-------|-------|------|-------|
403
- | **@ml-researcher** | Read, Write, Edit, Bash (whitelisted), Grep, Glob | Modifies `train.py` and `config.yaml`. Runs experiments. | 200 |
404
- | **@ml-evaluator** | Read, Bash (whitelisted), Grep, Glob | Reads results. Analyzes trends. Cannot modify code. | 50 |
108
+ ## Credits
405
109
 
406
- The evaluator's read-only constraint is not a limitation it is a feature. An analyst who cannot act on their observations makes more trustworthy observations.
110
+ Turing would not exist without these projects, ideas, and intellectual traditions:
407
111
 
408
- ## The Anti-Cheating Stack
112
+ **Projects**
409
113
 
410
- Research on autonomous ML agents has documented a recurring problem: [agents learn to game their own metrics](https://suzuke.github.io/blog/posts/ai-cheating-experiments/). Given a number to push up and a code editor, the agent finds the shortest path to a high number even if that path subverts the entire purpose of the experiment. This is not theoretical. It has been observed in practice.
114
+ - [karpathy/autoresearch](https://github.com/karpathy/autoresearch): proved the experiment loop is mechanical enough to automate. Turing's core loop is a direct descendant.
115
+ - [snoglobe/helios](https://github.com/snoglobe/helios): early inspiration for structured ML experiment harnesses.
116
+ - [suzuke/autocrucible](https://github.com/suzuke/autocrucible): autoresearch with guardrails. Turing's six-layer anti-cheating stack is directly informed by autocrucible's documented failure modes.
117
+ - [SakanaAI/treequest](https://github.com/SakanaAI/treequest): AB-MCTS for inference-time scaling, repurposed in `/turing:explore` for hypothesis-space tree search.
118
+ - [Google's Model Cards](https://arxiv.org/abs/1810.03993): inspiration for `/turing:card` and structured model documentation.
411
119
 
412
- Turing implements six defense layers, informed by the [autocrucible](https://github.com/suzuke/autocrucible) project and documented failure modes from [karpathy/autoresearch#322](https://github.com/karpathy/autoresearch/discussions/322):
120
+ **Ideas**
413
121
 
414
- ```
415
- ┌─────────────────────────────────────────────────┐
416
- │ LAYER 1: Architectural Separation │
417
- │ Hypothesis space vs measurement apparatus │
418
- ├─────────────────────────────────────────────────┤
419
- │ LAYER 2: Hidden File Tier │
420
- │ evaluate.py invisible to agent │
421
- ├─────────────────────────────────────────────────┤
422
- │ LAYER 3: Behavioral Probes │
423
- │ Training time, model size, prediction diversity │
424
- ├─────────────────────────────────────────────────┤
425
- │ LAYER 4: Statistical Validation │
426
- │ Multi-run evaluation, CV check, median │
427
- ├─────────────────────────────────────────────────┤
428
- │ LAYER 5: Tool Restriction │
429
- │ Whitelisted Bash commands only │
430
- ├─────────────────────────────────────────────────┤
431
- │ LAYER 6: Diff-Based History │
432
- │ Show actual changes, not agent descriptions │
433
- └─────────────────────────────────────────────────┘
434
- ```
122
+ - ["When Code Is Free, Research Is All That Matters"](https://x.com/amytam01/status/2031072399731675269) (Tam, 2026): when execution cost approaches zero, research taste is the differentiator. The entire taste-leverage interface is built around this insight.
123
+ - "The first principle is that you must not fool yourself, and you are the easiest person to fool." (Feynman) The separation of hypothesis from measurement is Turing's answer to Feynman's first principle.
124
+ - [*The Tacit Dimension*](https://en.wikipedia.org/wiki/The_Tacit_Dimension) (Polanyi, 1966): "We can know more than we can tell." Research taste is tacit knowledge that resists formalization, which is why the human stays in the loop.
125
+ - [The context of discovery vs. the context of justification](https://en.wikipedia.org/wiki/Context_of_justification) (Reichenbach, 1938; Popper, 1959): hypothesis generation is creative and non-logical; only testing admits of formal treatment. Turing is a justification machine. You provide the discovery.
126
+ - [*The Structure of Scientific Revolutions*](https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Revolutions) (Kuhn, 1962): the risk of efficiently optimizing within a degenerating paradigm. Convergence detection is Turing's partial answer; knowing when to leave the corner is still yours.
127
+ - [Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law) (1975) and [Campbell's Law](https://en.wikipedia.org/wiki/Campbell%27s_law) (1979): when a measure becomes a target, it ceases to be a good measure. The entire anti-cheating stack exists because these laws activate the moment an agent evaluates itself.
128
+ - [Concrete Problems in AI Safety](https://arxiv.org/abs/1606.06565) (Amodei et al., 2016) and [DeepMind's specification gaming catalogue](https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/): documented that reward hacking is not a theoretical risk but an observed behavior of capable optimizers.
129
+ - [NIST CAISI](https://www.nist.gov/artificial-intelligence/executive-order-safe-secure-and-trustworthy-artificial-intelligence) (2025): documented systematic cheating by frontier models (downloading solutions, commenting out assertions, crashing servers). Every prompt-based rule got worked around; every code-based rule held.
435
130
 
436
- The core insight from the research: **every prompt-based rule got worked around; every code-based rule held.** Turing's guardrails are structural, not conversational.
437
131
 
438
- ## Convergence Detection
132
+ ## Links
439
133
 
440
- When to stop flipping coins in this corner of the room:
441
-
442
- ```yaml
443
- convergence:
444
- patience: 3 # Consecutive non-improvements before stopping
445
- improvement_threshold: 0.005 # 0.5% relative improvement required
446
- ```
447
-
448
- After N experiments with no meaningful improvement, the agent stops and reports what it found. The human then decides: is this good enough, or should we point the agent at a different region?
449
-
450
- For noisy metrics, `/turing:validate` runs the pipeline multiple times and measures variance. If the coefficient of variation exceeds 5%, it auto-configures multi-run evaluation so the agent can't be rewarded for lucky single runs.
451
-
452
- ## Statistical Rigor
453
-
454
- > *"Stop publishing lucky seeds. Start publishing distributions."*
455
-
456
- Before claiming a result, run a seed study:
457
-
458
- ```
459
- /turing:seed # 5 seeds on best experiment
460
- /turing:seed --quick # 3 seeds for fast check
461
- /turing:seed 10 # 10 seeds for thorough study
462
- ```
463
-
464
- This runs the same experiment across multiple random seeds and reports mean +/- std with 95% confidence intervals. If the coefficient of variation exceeds 5%, the result is flagged as **seed-sensitive** — meaning you should report the distribution, not a single number.
465
-
466
- To verify an experiment can be reproduced:
467
-
468
- ```
469
- /turing:reproduce exp-042 # Default: 3 runs, 2% tolerance
470
- /turing:reproduce exp-042 --strict # Exact match required
471
- /turing:reproduce exp-042 --tolerance 0.05 # Custom tolerance
472
- ```
473
-
474
- This re-runs the experiment from the logged config and checks that metrics fall within tolerance. It also detects environment drift — if library versions have changed since the original run, you'll know before a reviewer tells you.
475
-
476
- Seed study results automatically appear in `/turing:brief` and `/turing:card`.
477
-
478
- ## Tree-Search Hypothesis Exploration
479
-
480
- > *"The learned coin-flipper weaves through the quadrillion-coin room with a preternatural air."*
481
-
482
- Sometimes the best experiment to try next isn't obvious from the literature or the agent's memory. `/turing:explore` uses [TreeQuest](https://github.com/SakanaAI/treequest)'s AB-MCTS (Adaptive Branching Monte Carlo Tree Search) to search the space of experiment *ideas* as a tree, scored by the critique engine (novelty x feasibility x impact).
483
-
484
- ```
485
- /turing:explore # Run MCTS over hypothesis space
486
- /turing:explore --strategy greedy # Greedy fallback (no TreeQuest needed)
487
- /turing:explore --iterations 50 --top 8 # Deeper search, more results
488
- /turing:suggest --strategy treequest # Same thing via suggest
489
- ```
490
-
491
- How it works:
492
-
493
- ```
494
- Seeds MCTS expands best-scoring branches
495
-
496
- ┌──────┼──────┐ Each node is a hypothesis scored by:
497
- ▼ ▼ ▼ - Novelty (vs experiment history)
498
- LightGBM Reg Features - Feasibility (hardware, deps)
499
- │ │ │ - Expected impact (type success rate)
500
- ▼ ▼ ▼
501
- +dart +L1 +poly Top-K results queued as hypotheses
502
- │ │ for the next /turing:train run
503
- ▼ ▼
504
- +subsamp +target-enc
505
- ```
506
-
507
- Unlike `/turing:suggest` (which searches the web for papers), `/turing:explore` searches the space of *refinement chains* — combinations and sequences of modifications that score well together. It discovers non-obvious experiment strategies that independent suggestions cannot find.
508
-
509
- Falls back to greedy best-first search when TreeQuest is not installed.
510
-
511
- ## Cost-Performance Frontier
512
-
513
- > *"This model is 2% better but takes 10x longer to train. Is that worth it?"*
514
-
515
- The briefing now surfaces [Pareto-optimal](https://en.wikipedia.org/wiki/Pareto_efficiency) experiments — the efficient set where no other experiment is both faster AND has a better metric. The cost report tells you the tradeoff in plain language:
516
-
517
- ```
518
- Best metric: exp-012 (accuracy=0.893, 2400s)
519
- Best efficiency: exp-003 (accuracy=0.871, 3s)
520
- The 2.5% improvement costs 800x more compute.
521
- ```
522
-
523
- Run `python scripts/cost_frontier.py` directly, or read the "Cost-Performance Analysis" section in `/turing:brief`.
524
-
525
- ## Model Cards
526
-
527
- When it's time to ship, `/turing:card` generates a standardized model card documenting:
528
- - Model type, framework, training time
529
- - Performance metrics (all configured metrics)
530
- - Training data source and split ratios
531
- - Limitations (including overfit detection)
532
- - Intended use and ethical considerations (user fills these in)
533
- - Artifact contract version for production consumers
534
-
535
- Inspired by [Google's Model Cards](https://arxiv.org/abs/1810.03993) and [Hugging Face model cards](https://huggingface.co/docs/hub/model-cards).
536
-
537
- ## Installation
538
-
539
- ```bash
540
- # Via npm (recommended)
541
- npm install -g claude-turing
542
- claude-turing install --global
543
- claude-turing verify
544
-
545
- # Via local path
546
- claude plugin add /path/to/turing
547
- ```
548
-
549
- ### Quick Start
550
-
551
- ```bash
552
- /turing:init # Scaffold project (answer 3 prompts)
553
- /turing:train # Run experiment loop
554
- /turing:brief # Read what happened
555
- /turing:try "idea" # Inject your taste
556
- ```
557
-
558
- ### Multiple Projects
559
-
560
- ```bash
561
- /turing:init # Scaffold ml/sentiment
562
- /turing:init # Scaffold ml/churn
563
- /turing:train ml/sentiment # Train in specific project
564
- /turing:brief ml/churn # Brief for specific project
565
- cd ml/sentiment && /turing:train # Auto-detects from cwd
566
- ```
567
-
568
- Each project gets independent config, data, experiments, models, and agent memory.
569
-
570
- ## Architecture of Turing Itself
571
-
572
- 74 commands, 2 agents, 10 config files, 93 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), scaling & efficiency (scale + budget + distill), meta-intelligence (transfer + audit), pre-training intelligence (sanity + baseline + leak), model debugging (xray + sensitivity + calibrate), feature & training intelligence (feature + curriculum), model surgery (prune + quantize + merge + surgery), experiment archaeology (trend + flashback + archive + annotate + search + template + replay), research communication (cite + present + changelog), collaboration (onboard + share + review), what-if analysis (whatif + counterfactual + simulate), model lifecycle (update + registry), operational intelligence (postmortem + doctor + plan), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
573
-
574
- ```
575
- turing/
576
- ├── commands/ 70 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition + scaling & efficiency + meta-intelligence + pre-training intelligence + model debugging + feature & training intelligence + model surgery + experiment archaeology + research communication + what-if analysis + model lifecycle + operational intelligence)
577
- ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
578
- ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
579
- ├── templates/ Scaffolded into user projects by /turing:init
580
- │ ├── prepare.py Data loading (HIDDEN from agent)
581
- │ ├── evaluate.py Evaluation harness (HIDDEN from agent)
582
- │ ├── train.py Training code (AGENT-EDITABLE)
583
- │ ├── model_contract.md Artifact schema for production consumers
584
- │ ├── model_registry.yaml Available model architectures + hyperparams
585
- │ └── scripts/ 26 Python scripts (core loop + analysis + infra + tree search)
586
- ├── tests/ 338 tests (unit + integration + anti-pattern + manifest)
587
- ├── src/ 5 JS installer files (npm deployment)
588
- ├── bin/ CLI entry points
589
- └── docs/ ARCHITECTURE.md + 16 ADRs
590
- ```
591
-
592
- ## Intellectual Heritage
593
-
594
- - **[When Code Is Free](https://x.com/amytam01/status/2031072399731675269)** (Tam, 2026) — when execution cost approaches zero, the differentiator becomes research taste
595
- - **[Autoresearch](https://github.com/karpathy/autoresearch)** (Karpathy, 2026) — ML experiment loops are mechanical enough to automate, with the constraint that evaluation must be immutable
596
- - **[AutoCrucible](https://github.com/suzuke/autocrucible)** (suzuke, 2026) — autoresearch with guardrails: hidden evaluation, behavioral probes, tool restriction, stability validation
597
- - **[Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law)** — "When a measure becomes a target, it ceases to be a good measure." The architectural justification for immutable, hidden evaluation
598
- - **[Double-Blind Protocols](https://en.wikipedia.org/wiki/Blinded_experiment)** — the entity that evaluates must not be the entity that modifies
599
- - **[Falsificationism](https://en.wikipedia.org/wiki/Falsifiability)** (Popper, 1934) — hypotheses gain credibility by surviving falsification, not by accumulating confirmations
600
- - **[Principle of Least Privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege)** (Saltzer & Schroeder, 1975) — each agent has exactly the capabilities needed for its role
601
- - **[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping)** (Prechelt, 1998) — convergence detection as discrete early stopping
602
- - **[Multi-Armed Bandits](https://en.wikipedia.org/wiki/Multi-armed_bandit)** — the explore-exploit tradeoff
603
- - **[TreeQuest](https://github.com/SakanaAI/treequest)** (Sakana AI, 2025) — AB-MCTS for inference-time scaling; repurposed here for hypothesis-space exploration
604
- - **[Version Control as Lab Notebook](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004668)** (Ram, 2013) — git as a scientific record-keeping system
605
- - **[Reproducibility Crisis](https://en.wikipedia.org/wiki/Replication_crisis)** — if the measurement can change between experiments, results are not reproducible
606
-
607
- ## License
608
-
609
- MIT
134
+ - [License](LICENSE) (MIT)
610
135
 
611
136
  ---
612
137
 
613
- *"In God we trust. All others must bring data."* W. Edwards Deming
614
-
615
- *"When code is free, research is all that matters."* — Amy Tam
138
+ *"In God we trust. All others must bring data."* - W. Edwards Deming
616
139
 
617
140
  *Turing flips the coins. You choose which ones.*