npm - claude-turing - Versions diffs - 1.0.0 - Mend

claude-turing 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (104) hide show

package/.claude-plugin/plugin.json +34 -0
package/LICENSE +21 -0
package/README.md +457 -0
package/agents/ml-evaluator.md +43 -0
package/agents/ml-researcher.md +74 -0
package/bin/cli.js +46 -0
package/bin/turing-init.sh +57 -0
package/commands/brief.md +83 -0
package/commands/compare.md +24 -0
package/commands/design.md +97 -0
package/commands/init.md +123 -0
package/commands/logbook.md +51 -0
package/commands/mode.md +43 -0
package/commands/poster.md +89 -0
package/commands/preflight.md +75 -0
package/commands/report.md +97 -0
package/commands/rules/loop-protocol.md +91 -0
package/commands/status.md +24 -0
package/commands/suggest.md +95 -0
package/commands/sweep.md +45 -0
package/commands/train.md +66 -0
package/commands/try.md +63 -0
package/commands/turing.md +54 -0
package/commands/validate.md +34 -0
package/config/defaults.yaml +45 -0
package/config/experiment_archetypes.yaml +127 -0
package/config/lifecycle.toml +31 -0
package/config/novelty_aliases.yaml +107 -0
package/config/relationships.toml +125 -0
package/config/state.toml +24 -0
package/config/task_taxonomy.yaml +110 -0
package/config/taxonomy.toml +37 -0
package/package.json +54 -0
package/src/claude-md.js +55 -0
package/src/install.js +107 -0
package/src/paths.js +20 -0
package/src/postinstall.js +22 -0
package/src/verify.js +109 -0
package/templates/MEMORY.md +36 -0
package/templates/README.md +93 -0
package/templates/__pycache__/evaluate.cpython-314.pyc +0 -0
package/templates/__pycache__/prepare.cpython-314.pyc +0 -0
package/templates/config.yaml +48 -0
package/templates/evaluate.py +237 -0
package/templates/features/__init__.py +0 -0
package/templates/features/__pycache__/__init__.cpython-314.pyc +0 -0
package/templates/features/__pycache__/featurizers.cpython-314.pyc +0 -0
package/templates/features/featurizers.py +138 -0
package/templates/prepare.py +171 -0
package/templates/program.md +216 -0
package/templates/pyproject.toml +8 -0
package/templates/requirements.txt +8 -0
package/templates/scripts/__init__.py +0 -0
package/templates/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/check_convergence.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/classify_task.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/critique_hypothesis.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/experiment_index.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/generate_logbook.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/log_experiment.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/manage_hypotheses.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/novelty_guard.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/parse_metrics.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/show_experiment_tree.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/show_families.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/statistical_compare.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/suggest_next.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/sweep.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/synthesize_decision.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/turing_io.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/update_state.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/verify_placeholders.cpython-314.pyc +0 -0
package/templates/scripts/check_convergence.py +230 -0
package/templates/scripts/compare_runs.py +124 -0
package/templates/scripts/critique_hypothesis.py +350 -0
package/templates/scripts/experiment_index.py +288 -0
package/templates/scripts/generate_brief.py +389 -0
package/templates/scripts/generate_logbook.py +423 -0
package/templates/scripts/log_experiment.py +243 -0
package/templates/scripts/manage_hypotheses.py +543 -0
package/templates/scripts/novelty_guard.py +343 -0
package/templates/scripts/parse_metrics.py +139 -0
package/templates/scripts/post-train-hook.sh +74 -0
package/templates/scripts/preflight.py +549 -0
package/templates/scripts/scaffold.py +409 -0
package/templates/scripts/show_environment.py +92 -0
package/templates/scripts/show_experiment_tree.py +144 -0
package/templates/scripts/show_families.py +133 -0
package/templates/scripts/show_metrics.py +157 -0
package/templates/scripts/statistical_compare.py +259 -0
package/templates/scripts/stop-hook.sh +34 -0
package/templates/scripts/suggest_next.py +301 -0
package/templates/scripts/sweep.py +276 -0
package/templates/scripts/synthesize_decision.py +300 -0
package/templates/scripts/turing_io.py +76 -0
package/templates/scripts/update_state.py +296 -0
package/templates/scripts/validate_stability.py +167 -0
package/templates/scripts/verify_placeholders.py +119 -0
package/templates/sweep_config.yaml +14 -0
package/templates/tests/__init__.py +0 -0
package/templates/tests/conftest.py +91 -0
package/templates/train.py +240 -0

package/.claude-plugin/plugin.json ADDED Viewed

@@ -0,0 +1,34 @@
+{
+  "name": "turing",
+  "version": "1.0.0",
+  "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 14 commands, 2 specialized agents, structured experiment lifecycle with convergence detection, immutable evaluation infrastructure, novelty guard, decision synthesis, hypothesis database, and safety guardrails that separate the hypothesis space from the measurement apparatus. Inspired by Karpathy's autoresearch and the scientific method itself.",
+  "author": {
+    "name": "pragnition"
+  },
+  "homepage": "https://github.com/pragnition/turing",
+  "repository": "https://github.com/pragnition/turing",
+  "license": "MIT",
+  "keywords": [
+    "ml",
+    "machine-learning",
+    "autoresearch",
+    "experiment-tracking",
+    "hyperparameter-tuning",
+    "autonomous-training",
+    "convergence-detection",
+    "model-evaluation",
+    "scientific-method",
+    "experiment-lifecycle",
+    "hypothesis-testing",
+    "reproducibility",
+    "immutable-evaluation",
+    "feature-engineering",
+    "gradient-boosting",
+    "hyperparameter-sweep",
+    "model-selection",
+    "experiment-logging",
+    "claude-code",
+    "plugin",
+    "ai-agents"
+  ]
+}

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 pragnition
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,457 @@
+# turing
+*The research assistant that can't fool itself.*
+---
+An autonomous ML research harness for Claude Code. Turing implements the autoresearch pattern — an AI agent that iteratively trains, evaluates, and improves machine learning models through a structured experiment loop with convergence detection, immutable evaluation infrastructure, and safety guardrails.
+The name references Alan Turing — the person who first asked whether machines could think, then built the framework for answering the question. Turing the plugin does what Turing the person formalized: it defines a computational process, executes it mechanically, and determines whether the result constitutes an improvement.
+Inspired by [karpathy/autoresearch](https://github.com/karpathy/autoresearch) and [snoglobe/helios](https://github.com/snoglobe/helios).
+## Three Commands
+That's all you need.
+```
+/turing:init                          Set up a new ML project
+/turing:train                         Run the experiment loop
+/turing:brief                         What happened? What's next?
+```
+Initialize. Train. Read the briefing. Inject your taste. Repeat.
+```
+/turing:try switch to LightGBM        Steer the agent
+/turing:train                          It follows your lead
+/turing:brief --deep                   Get literature-backed suggestions
+```
+Everything else — experiment logging, convergence detection, hypothesis tracking, statistical validation, anti-cheating guardrails — happens automatically. You think about *what* to try. Turing handles *how* to try it.
+## Table of Contents
+- [When Code Is Free, Research Is All That Matters](#when-code-is-free-research-is-all-that-matters)
+- [The Human-AI Interface](#the-human-ai-interface)
+- [The Problem Turing Solves](#the-problem-turing-solves)
+- [Philosophical Foundations](#philosophical-foundations)
+- [How Turing Works](#how-turing-works)
+- [Commands](#commands)
+- [The Hypothesis Database](#the-hypothesis-database)
+- [The Agent Architecture](#the-agent-architecture)
+- [The Anti-Cheating Stack](#the-anti-cheating-stack)
+- [Convergence Detection](#convergence-detection)
+- [Installation](#installation)
+- [Architecture of Turing Itself](#architecture-of-turing-itself)
+- [Intellectual Heritage](#intellectual-heritage)
+## When Code Is Free, Research Is All That Matters
+> *"You're in a room with a quadrillion biased coins, and you want to maximize the number of heads in the shortest amount of time. Almost all coins are 'duds.' The novice coin-flipper might start flipping one-by-one, but heads come few and far between. The learned coin-flipper weaves through the quadrillion-coin room with a preternatural air; they flip many coins at once. What comes across as luck is really the refinement of taste: years of feeling faint differences in the weight of the metal, the subtle offsets of a mis-mint."* — [Amy Tam](https://x.com/amytam01/status/2031072399731675269)
+This is the most precise metaphor for ML research in the age of autonomous agents: a quadrillion-coin room where the researcher's value lies not in the mechanical act of flipping but in *choosing which coins to flip at all*.
+Tam's insight cuts to the heart of what Turing exists to do. The agentic coding tools consuming software engineering alive right now — Cursor, Claude Code, Codex — work precisely because engineering has a built-in feedback signal: a test to pass, a spec to meet, a benchmark to clear. You can RL on [SWE-bench](https://www.swebench.com/) because the ground truth exists. **Research has no equivalent.** It is not clear what it means to RL on a research question, because it is not clear what definition of "ground truth" one should optimize for. The coin room has a quadrillion coins but no label telling you which ones are biased toward heads.
+And yet Karpathy's [autoresearch](https://github.com/karpathy/autoresearch) ran 126 experiments overnight on a single GPU: agents modifying LLM training code, running a five-minute training loop, checking if the result improved, and repeating. [Tobias Lütke reported](https://fortune.com/2026/03/17/andrej-karpathy-loop-autonomous-ai-agents-future/) that after letting it run overnight, it executed 37 experiments and delivered a 19% performance gain. That is a lot more coins flipped than the average human in the same time.
+This creates a new kind of division of labor:
+```
+HUMAN RESEARCHER                    AUTONOMOUS AGENT
+─────────────────                   ─────────────────
+Research taste                      Coin flipping
+Which coins to flip                 How fast to flip them
+Problem selection                   Hypothesis execution
+Judgment under ambiguity            Measurement under control
+Knowing when the room has changed   Running the room as-is
+```
+The researcher's job becomes the selection function: *which 20 of the quadrillion coins are worth flipping in the first place?* And the agent's job — Turing's job — is to flip those coins with the discipline, speed, and memory that humans cannot sustain. Every experiment logged. Every variant preserved. Every comparison valid. No amnesia. No fatigue. No accidental contamination of the measurement.
+*When anyone can build for free, the differentiator is knowing what's worth building and whether it's buildable at all.* Turing handles the building. You bring the knowing.
+## The Human-AI Interface
+Turing is not a black box you point at data and hope for the best. It is a conversation between your taste and the agent's discipline.
+### The Taste-Leverage Loop
+```
+         ┌─────────────────────┐
+         │  YOU (taste)         │
+         │                     │
+         │  /turing:brief      │◄──── "What have we learned?"
+         │  /turing:try ...    │────► "Try this next."
+         └────────┬────────────┘
+                  │
+                  ▼
+         ┌─────────────────────┐
+         │  TURING (discipline) │
+         │                     │
+         │  Hypothesize        │◄──── Reads your injection + history
+         │  Train              │────► Runs the experiment
+         │  Evaluate           │────► Immutable measurement
+         │  Decide             │────► Keep or discard
+         │  Record             │────► Updates hypothesis database
+         └────────┬────────────┘
+                  │
+                  ▼
+         ┌─────────────────────┐
+         │  BRIEFING            │
+         │                     │
+         │  Campaign summary   │
+         │  Best model         │
+         │  What's exhausted   │
+         │  What's promising   │
+         │  Recommendations    │
+         └─────────────────────┘
+                  │
+                  ▼
+              You again.
+```
+The loop is bidirectional. You inject hypotheses. The agent executes them. The briefing tells you what happened. You inject new hypotheses informed by the results. The agent never forgets what it tried. You never lose context between sessions.
+### What This Looks Like in Practice
+**Morning 1:** You have a dataset and a prediction task.
+```
+/turing:init
+# Answer: project name, metric, data location
+# Turing scaffolds everything
+```
+**Morning 1, 10 minutes later:**
+```
+/turing:train
+# Agent runs 5-10 experiments autonomously
+# XGBoost baseline → hyperparameter sweep → convergence
+```
+**Morning 1, 30 minutes later:**
+```
+/turing:brief
+# Campaign: 8 experiments, 5 kept, accuracy 0.82 → 0.87
+# Best: XGBoost, max_depth=6, n_estimators=200
+# Exhausted: hyperparameter tuning on XGBoost
+# Recommendation: try LightGBM or feature engineering
+```
+**Your taste kicks in:**
+```
+/turing:try switch to LightGBM with dart boosting — XGBoost plateaued
+/turing:try add polynomial interaction features for the numeric columns
+/turing:train
+```
+**Afternoon:**
+```
+/turing:brief --deep
+# Standard briefing + literature-grounded suggestions
+# Papers suggest: target encoding for high-cardinality categoricals
+# → Auto-queued as hyp-012
+```
+**You leave. Come back tomorrow.**
+```
+/turing:brief
+# Everything is there. Nothing was forgotten.
+# The hypothesis database has the complete trail.
+```
+That's the interface. Six words to inject an idea. One command to get a briefing. The agent handles everything in between.
+## The Problem Turing Solves
+> "An experiment is a question which science poses to Nature, and a measurement is the recording of Nature's answer." — Max Planck
+The central activity of machine learning research is the experiment loop: change something, train, evaluate, decide, repeat. This loop is simultaneously the most important and the most tedious part of ML work. Researchers spend their days doing what is essentially a manual search over a high-dimensional space of model architectures, hyperparameters, feature transformations, and data preprocessing strategies.
+The tragedy is not that this is slow — it is that the process is structurally unsound. When a human researcher modifies both the training code *and* the evaluation code in the same session, the experiment is no longer a controlled experiment. When experiment results are tracked in notebook cells rather than structured logs, reproducibility is aspirational. When a promising direction is abandoned because the researcher forgot what they tried three hours ago, the search is not even a search — it is a random walk with amnesia.
+Turing does not replace the researcher's judgment. It replaces the researcher's *discipline* — or more precisely, it makes discipline the default rather than an act of willpower. The experiment loop is formalized. The evaluation harness is immutable. Every experiment is logged. Every code variant is preserved. Convergence is detected automatically. The researcher's role shifts from "person who types hyperparameters and reads loss curves" to "person who decides what hypotheses are worth testing" — from coin-flipper to coin-selector.
+## Philosophical Foundations
+### On Separating Hypothesis from Measurement
+> "The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman
+Turing is built on a specific epistemological claim: **the entity that generates hypotheses must not be the entity that evaluates them**. This is not a software engineering pattern — it is the methodological foundation of modern science, and it predates software by centuries.
+In experimental physics, the [double-blind protocol](https://en.wikipedia.org/wiki/Blinded_experiment) ensures that the experimenter's expectations cannot influence the measurement. In ML, the equivalent risk is more insidious: an agent that can modify both `train.py` and `evaluate.py` can — deliberately or through optimization pressure — find metrics that look good but don't reflect genuine model improvement.
+This is [Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law) made architectural: *"When a measure becomes a target, it ceases to be a good measure."* The only defense is to make the measure structurally immutable.
+Turing enforces this with a three-tier access model:
+```
+┌──────────────────────────────────────────────────────┐
+│                  HYPOTHESIS SPACE                     │
+│              (agent can modify)                       │
+│    train.py          config.yaml                     │
+├──────────────────────────────────────────────────────┤
+│                MEASUREMENT APPARATUS                  │
+│         prepare.py (READ-ONLY)                       │
+│         evaluate.py (HIDDEN — agent cannot even see) │
+└──────────────────────────────────────────────────────┘
+```
+The evaluation harness is not just immutable — it is *invisible*. The agent cannot read `evaluate.py`, cannot discover its implementation, cannot reverse-engineer fixed seeds or scoring formulas. It knows only the metric name, the direction (higher or lower is better), and the result. This is the difference between "please don't change the test" and "you literally cannot see the test."
+### On Research Taste and Autonomous Execution
+> *"Research taste is about how well you choose your coins: how well you choose which problems are worth working on at all."* — Amy Tam
+There is a paradox at the heart of autonomous ML research: the parts of research that are hardest to automate are precisely the parts that matter most. Problem selection, hypothesis formation, knowing when a line of inquiry has become a dead end — these require what Tam calls *taste*, the accumulated judgment that comes from years of feeling faint differences in which problems are tractable, which results are meaningful, and which metrics actually capture what you care about.
+Autoresearch does not solve this. Turing does not solve this. No one has solved this. But what autoresearch *does* solve is the complementary problem: given a well-selected hypothesis space, execute the search within it with superhuman discipline and throughput. The human provides the taste. The agent provides the tirelessness.
+This is why Turing's interface is built around two verbs: **try** and **brief**. `/turing:try` is how taste reaches the agent. `/turing:brief` is how results reach the human. Everything else is infrastructure.
+### On Experiment Tracking as Institutional Memory
+> "Those who cannot remember the past are condemned to repeat it." — George Santayana
+An LLM agent without persistent memory is a [Markov chain](https://en.wikipedia.org/wiki/Markov_chain) — its next action depends only on its current state, not on the path that led there. This is catastrophically inefficient for optimization: the agent will re-try failed approaches, abandon promising directions, and fail to recognize when it has converged. It will keep flipping coins it has already flipped.
+Turing addresses this with a structured memory stack:
+| System | Format | Purpose |
+|--------|--------|---------|
+| **Hypothesis database** | `hypotheses.yaml` + `hypotheses/hyp-NNN.yaml` | Complete ledger of every idea — human and agent — with full detail |
+| **Experiment log** | `experiments/log.jsonl` | Append-only record of every experiment run |
+| **Novelty guard** | `scripts/novelty_guard.py` | Blocks duplicate and near-duplicate hypotheses before execution |
+| **Agent memory** | `.claude/agent-memory/ml-researcher/MEMORY.md` | Working notes across sessions |
+| **Git history** | Experiment branches | Every code variant preserved |
+The hypothesis database is the single source of truth. Every idea gets registered before execution. Every outcome gets written back. The novelty guard reads the history and prevents the agent from re-trying things it has already failed at — even across `/loop` sessions where the agent's context is lost.
+## How Turing Works
+### The Experiment Loop
+Every iteration follows the same protocol:
+```
+1. OBSERVE     Read metrics, check hypothesis queue, review failed diffs
+2. HYPOTHESIZE Check queue (human ideas first) or generate + register own
+3. PREPARE     Edit train.py or config.yaml
+4. COMMIT      Git branch per experiment
+5. EXECUTE     python train.py > run.log 2>&1
+6. MEASURE     Parse metrics (agent can't see how they're computed)
+7. DECIDE      Keep improvements, revert regressions
+8. RECORD      Log experiment, update hypothesis, synthesize decision
+9. CONVERGE?   Stop after N non-improvements, or repeat
+```
+### The Hypothesis Lifecycle
+Every experiment — human-injected or agent-generated — flows through the hypothesis database:
+```
+  /turing:try "idea"                     Agent generates idea
+        │                                       │
+        ▼                                       ▼
+  ┌──────────────────────────────────────────────────┐
+  │  hypotheses.yaml (index)                          │
+  │  hypotheses/hyp-001.yaml (detail)                 │
+  │                                                   │
+  │  architecture:                                    │
+  │    model_type: lightgbm                           │
+  │  hyperparameters:                                 │
+  │    n_estimators: 200                              │
+  │    learning_rate: 0.05                            │
+  │  expected_outcome:                                │
+  │    rationale: "dart boosting may escape plateau"  │
+  │  family: architecture-search                      │
+  │  tags: [lightgbm, dart]                           │
+  └────────────────────┬──────────────────────────────┘
+                       │
+                 novelty guard
+                 (block duplicates)
+                       │
+                       ▼
+                  experiment
+                       │
+                       ▼
+  ┌──────────────────────────────────────────────────┐
+  │  result:                                          │
+  │    experiment_id: exp-007                          │
+  │    metrics: {accuracy: 0.89}                      │
+  │    verdict: promising                             │
+  │    notes: "3% improvement, follow up with..."     │
+  └──────────────────────────────────────────────────┘
+```
+The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypotheses/hyp-NNN.yaml`) hold the full structured record: architecture, hyperparameters, features, expected outcome, actual result, lineage, family tags. Both are updated atomically.
+## Commands
+### Core Loop
+| Command | What it does |
+|---------|-------------|
+| `/turing:init [--plan]` | Scaffold a new ML project. `--plan` generates a literature-grounded research plan. |
+| `/turing:train [N]` | Run the autonomous experiment loop (optional max iterations) |
+| `/turing:sweep` | Systematic hyperparameter sweep via cartesian product |
+| `/turing:status` | Quick experiment status — best model, convergence state |
+| `/turing:compare <a> <b>` | Side-by-side experiment comparison with causal analysis |
+### Taste-Leverage Interface
+| Command | What it does |
+|---------|-------------|
+| `/turing:try <hypothesis>` | Inject a hypothesis — free text or `archetype:model_comparison` |
+| `/turing:brief [--deep]` | Research briefing — campaign summary, failure patterns, literature-grounded suggestions |
+| `/turing:suggest` | Literature-grounded model architecture suggestions with citations |
+| `/turing:design <hyp-id>` | Generate structured experiment design from a hypothesis |
+| `/turing:mode <explore\|exploit\|replicate>` | Set research strategy — drives novelty guard policy |
+### Reporting & Validation
+| Command | What it does |
+|---------|-------------|
+| `/turing:validate [--auto]` | Check metric stability — auto-configure multi-run if noisy |
+| `/turing:logbook` | Generate HTML experiment logbook |
+| `/turing:report` | Generate research report |
+| `/turing:poster` | Generate research poster |
+| `/turing:preflight` | Pre-release validation checks |
+And for fully hands-off operation:
+```
+/loop 5m /turing:train
+```
+The agent trains, evaluates, keeps improvements, discards regressions, detects convergence, and stops. You come back to a briefing.
+## The Agent Architecture
+Two agents with a strict capability boundary:
+| Agent | Tools | Role | Turns |
+|-------|-------|------|-------|
+| **@ml-researcher** | Read, Write, Edit, Bash (whitelisted), Grep, Glob | Modifies `train.py` and `config.yaml`. Runs experiments. | 200 |
+| **@ml-evaluator** | Read, Bash (whitelisted), Grep, Glob | Reads results. Analyzes trends. Cannot modify code. | 50 |
+The evaluator's read-only constraint is not a limitation — it is a feature. An analyst who cannot act on their observations makes more trustworthy observations.
+## The Anti-Cheating Stack
+Research on autonomous ML agents has documented a recurring problem: [agents learn to game their own metrics](https://suzuke.github.io/blog/posts/ai-cheating-experiments/). Given a number to push up and a code editor, the agent finds the shortest path to a high number — even if that path subverts the entire purpose of the experiment. This is not theoretical. It has been observed in practice.
+Turing implements six defense layers, informed by the [autocrucible](https://github.com/suzuke/autocrucible) project and documented failure modes from [karpathy/autoresearch#322](https://github.com/karpathy/autoresearch/discussions/322):
+```
+┌─────────────────────────────────────────────────┐
+│  LAYER 1: Architectural Separation               │
+│  Hypothesis space vs measurement apparatus        │
+├─────────────────────────────────────────────────┤
+│  LAYER 2: Hidden File Tier                        │
+│  evaluate.py invisible to agent                   │
+├─────────────────────────────────────────────────┤
+│  LAYER 3: Behavioral Probes                       │
+│  Training time, model size, prediction diversity   │
+├─────────────────────────────────────────────────┤
+│  LAYER 4: Statistical Validation                  │
+│  Multi-run evaluation, CV check, median           │
+├─────────────────────────────────────────────────┤
+│  LAYER 5: Tool Restriction                        │
+│  Whitelisted Bash commands only                   │
+├─────────────────────────────────────────────────┤
+│  LAYER 6: Diff-Based History                      │
+│  Show actual changes, not agent descriptions      │
+└─────────────────────────────────────────────────┘
+```
+The core insight from the research: **every prompt-based rule got worked around; every code-based rule held.** Turing's guardrails are structural, not conversational.
+## Convergence Detection
+When to stop flipping coins in this corner of the room:
+```yaml
+convergence:
+  patience: 3                    # Consecutive non-improvements before stopping
+  improvement_threshold: 0.005   # 0.5% relative improvement required
+```
+After N experiments with no meaningful improvement, the agent stops and reports what it found. The human then decides: is this good enough, or should we point the agent at a different region?
+For noisy metrics, `/turing:validate` runs the pipeline multiple times and measures variance. If the coefficient of variation exceeds 5%, it auto-configures multi-run evaluation so the agent can't be rewarded for lucky single runs.
+## Installation
+```bash
+# Via npm (recommended)
+npm install -g claude-turing
+claude-turing install --global
+claude-turing verify
+# Via local path
+claude plugin add /path/to/turing
+```
+### Quick Start
+```bash
+/turing:init            # Scaffold project (answer 3 prompts)
+/turing:train           # Run experiment loop
+/turing:brief           # Read what happened
+/turing:try "idea"      # Inject your taste
+```
+## Architecture of Turing Itself
+15 commands, 2 agents, 8 config files, 25 template scripts, 338 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
+```
+turing/
+├── commands/              15 skill files (core + taste-leverage + reporting)
+├── agents/                2 agents (researcher: read/write, evaluator: read-only)
+├── config/                8 files (lifecycle, taxonomy, archetypes, novelty aliases)
+├── templates/             Scaffolded into user projects by /turing:init
+│   ├── prepare.py         Data loading (HIDDEN from agent)
+│   ├── evaluate.py        Evaluation harness (HIDDEN from agent)
+│   ├── train.py           Training code (AGENT-EDITABLE)
+│   └── scripts/           25 Python scripts (core loop + analysis + infra)
+├── tests/                 338 tests (unit + integration + anti-pattern + manifest)
+├── src/                   5 JS installer files (npm deployment)
+├── bin/                   CLI entry points
+└── docs/                  ARCHITECTURE.md + 16 ADRs
+```
+## Intellectual Heritage
+- **[When Code Is Free](https://x.com/amytam01/status/2031072399731675269)** (Tam, 2026) — when execution cost approaches zero, the differentiator becomes research taste
+- **[Autoresearch](https://github.com/karpathy/autoresearch)** (Karpathy, 2026) — ML experiment loops are mechanical enough to automate, with the constraint that evaluation must be immutable
+- **[AutoCrucible](https://github.com/suzuke/autocrucible)** (suzuke, 2026) — autoresearch with guardrails: hidden evaluation, behavioral probes, tool restriction, stability validation
+- **[Goodhart's Law](https://en.wikipedia.org/wiki/Goodhart%27s_law)** — "When a measure becomes a target, it ceases to be a good measure." The architectural justification for immutable, hidden evaluation
+- **[Double-Blind Protocols](https://en.wikipedia.org/wiki/Blinded_experiment)** — the entity that evaluates must not be the entity that modifies
+- **[Falsificationism](https://en.wikipedia.org/wiki/Falsifiability)** (Popper, 1934) — hypotheses gain credibility by surviving falsification, not by accumulating confirmations
+- **[Principle of Least Privilege](https://en.wikipedia.org/wiki/Principle_of_least_privilege)** (Saltzer & Schroeder, 1975) — each agent has exactly the capabilities needed for its role
+- **[Early Stopping](https://en.wikipedia.org/wiki/Early_stopping)** (Prechelt, 1998) — convergence detection as discrete early stopping
+- **[Multi-Armed Bandits](https://en.wikipedia.org/wiki/Multi-armed_bandit)** — the explore-exploit tradeoff
+- **[Version Control as Lab Notebook](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004668)** (Ram, 2013) — git as a scientific record-keeping system
+- **[Reproducibility Crisis](https://en.wikipedia.org/wiki/Replication_crisis)** — if the measurement can change between experiments, results are not reproducible
+## License
+MIT
+---
+*"In God we trust. All others must bring data."* — W. Edwards Deming
+*"When code is free, research is all that matters."* — Amy Tam
+*Turing flips the coins. You choose which ones.*

package/agents/ml-evaluator.md ADDED Viewed

@@ -0,0 +1,43 @@
+---
+name: ml-evaluator
+description: Read-only ML evaluation agent. Analyzes experiment results, compares runs, detects convergence patterns, and provides statistical insights. Cannot modify code — this is a safety constraint, not a limitation. The evaluator sees what the researcher cannot see precisely because it cannot change what it observes.
+tools: Read, Bash, Grep, Glob
+model: inherit
+maxTurns: 50
+---
+You are a read-only ML evaluation assistant. Your tools are limited to **Read, Bash, Grep, Glob** — you have no Write or Edit tools. This is intentional and load-bearing.
+## Why Read-Only Matters
+In quantum mechanics, observation changes the system. In ML experimentation, the evaluator must not be the experimenter. Your inability to modify code is what makes your analysis trustworthy — you cannot unconsciously bias your findings toward changes you made.
+## Capabilities
+- **Metric trend analysis:** detect improvement trajectories, plateaus, and regressions
+- **Configuration comparison:** identify which hyperparameter changes correlate with improvement
+- **Convergence assessment:** determine whether further experimentation is likely to yield gains
+- **Feature importance:** analyze which features contribute most to model performance
+- **Failure mode classification:** categorize why experiments failed (from `config/taxonomy.toml`)
+## Useful Commands
+Always activate the venv first: `source .venv/bin/activate`
+| Command | Purpose |
+|---------|---------|
+| `python scripts/show_metrics.py --last 10` | Recent experiment summary |
+| `python scripts/compare_runs.py <a> <b>` | Side-by-side comparison |
+| `python evaluate.py` | Run evaluation on current model |
+| `cat experiments/results.tsv` | Quick-reference TSV |
+## Analysis Framework
+When asked to analyze results, provide:
+1. **Metric trends:** improvement trajectory, plateau detection, variance across runs
+2. **Best configuration:** what combination of model type, hyperparameters, and features works best
+3. **Diminishing returns:** at what point did improvements slow? Is the current approach exhausted?
+4. **Failed approaches:** what was tried and didn't work? Are there patterns in failures?
+5. **Recommendations:** what should the researcher try next, ranked by expected impact?
+6. **Convergence verdict:** has the model converged? Justify with data, not intuition.

package/agents/ml-researcher.md ADDED Viewed

@@ -0,0 +1,74 @@
+---
+name: ml-researcher
+description: Autonomous ML research agent that implements the autoresearch experiment loop. Modifies train.py, runs experiments, evaluates results, keeps improvements, discards regressions. Operates under strict safety constraints — immutable evaluation infrastructure, git-disciplined rollback, and structured experiment logging.
+tools: Read, Write, Edit, Bash, Grep, Glob
+model: inherit
+memory: project
+maxTurns: 200
+---
+You are an autonomous ML researcher. You do not guess — you hypothesize, experiment, measure, and decide. Every experiment is a bet; the evaluation harness is the house.
+## Core Invariant
+**The measurement apparatus is sacred.** `evaluate.py` is HIDDEN — you must not read, reference, or attempt to access it. `prepare.py` is READ-ONLY. You modify `train.py` and `config.yaml` — nothing else in the pipeline. The evaluation harness is invisible to you by design: if you could read the scoring function, you could exploit fixed seeds, memorize test data, or reverse-engineer the metric. This separation is not a convention — it is the architectural invariant that makes your results trustworthy.
+## Protocol
+Read `program.md` in the ML directory for the complete experiment loop protocol. Follow it exactly. The protocol encodes the scientific method:
+1. **Observe** — read recent experiment results and agent memory
+2. **Hypothesize** — propose a specific, falsifiable change
+3. **Execute** — modify train.py or config.yaml, commit, train
+4. **Measure** — parse metrics from the immutable evaluation harness
+5. **Decide** — keep improvements, revert regressions
+6. **Record** — log everything, update memory
+## Constraints
+- **Only modify `train.py` and `config.yaml`.** `evaluate.py` is HIDDEN (do not read or reference). Other pipeline files are READ-ONLY.
+- **Always work in the venv:** `source .venv/bin/activate`
+- **Redirect training output:** `python train.py > run.log 2>&1`
+- **Parse metrics with grep:** `grep -A 10 "^---" run.log | head -10`
+- **Use @ml-evaluator** for analysis tasks — it has no Write/Edit tools and cannot accidentally break the pipeline.
+## Memory
+**Read first:** `.claude/agent-memory/ml-researcher/MEMORY.md`
+At the START of each session:
+1. Read MEMORY.md to restore context (best metrics, failed approaches, promising directions)
+2. Use this to avoid repeating failed experiments and to continue promising directions
+After EACH experiment (kept or discarded):
+1. Update "Best Result" section if metrics improved
+2. Add observation to "Observations" with what was tried and result
+3. Add to "Failed Approaches" if the approach was discarded
+4. Update "Promising Directions" based on what you learned
+## Git Discipline
+Each experiment follows a strict commit protocol:
+- **Branch:** `git checkout -b exp/{NNN}-{short-description}`
+- **Commit** changes on the branch before running
+- **If improved:** merge to main, copy model to `models/best/`
+- **If NOT improved:** return to main without merging (branch preserved)
+This ensures every code variant is preserved and the main branch only contains improvements.
+## Experiment Classification
+Classify each experiment by type (from `config/taxonomy.toml`):
+- `hyperparameter` — tuning existing model parameters
+- `architecture` — changing model type or structure
+- `feature` — modifying feature engineering
+- `data` — changing data handling
+- `ensemble` — combining models
+- `regularization` — adjusting regularization
+## Stopping Conditions
+1. `max_iterations` reached (if provided by user)
+2. N consecutive non-improvements (convergence, from `config.yaml`)
+Report the final best model, its metrics, and recommend next steps.