npm - claude-turing - Versions diffs - 1.0.0 - Mend

claude-turing 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (104) hide show

package/.claude-plugin/plugin.json +34 -0
package/LICENSE +21 -0
package/README.md +457 -0
package/agents/ml-evaluator.md +43 -0
package/agents/ml-researcher.md +74 -0
package/bin/cli.js +46 -0
package/bin/turing-init.sh +57 -0
package/commands/brief.md +83 -0
package/commands/compare.md +24 -0
package/commands/design.md +97 -0
package/commands/init.md +123 -0
package/commands/logbook.md +51 -0
package/commands/mode.md +43 -0
package/commands/poster.md +89 -0
package/commands/preflight.md +75 -0
package/commands/report.md +97 -0
package/commands/rules/loop-protocol.md +91 -0
package/commands/status.md +24 -0
package/commands/suggest.md +95 -0
package/commands/sweep.md +45 -0
package/commands/train.md +66 -0
package/commands/try.md +63 -0
package/commands/turing.md +54 -0
package/commands/validate.md +34 -0
package/config/defaults.yaml +45 -0
package/config/experiment_archetypes.yaml +127 -0
package/config/lifecycle.toml +31 -0
package/config/novelty_aliases.yaml +107 -0
package/config/relationships.toml +125 -0
package/config/state.toml +24 -0
package/config/task_taxonomy.yaml +110 -0
package/config/taxonomy.toml +37 -0
package/package.json +54 -0
package/src/claude-md.js +55 -0
package/src/install.js +107 -0
package/src/paths.js +20 -0
package/src/postinstall.js +22 -0
package/src/verify.js +109 -0
package/templates/MEMORY.md +36 -0
package/templates/README.md +93 -0
package/templates/__pycache__/evaluate.cpython-314.pyc +0 -0
package/templates/__pycache__/prepare.cpython-314.pyc +0 -0
package/templates/config.yaml +48 -0
package/templates/evaluate.py +237 -0
package/templates/features/__init__.py +0 -0
package/templates/features/__pycache__/__init__.cpython-314.pyc +0 -0
package/templates/features/__pycache__/featurizers.cpython-314.pyc +0 -0
package/templates/features/featurizers.py +138 -0
package/templates/prepare.py +171 -0
package/templates/program.md +216 -0
package/templates/pyproject.toml +8 -0
package/templates/requirements.txt +8 -0
package/templates/scripts/__init__.py +0 -0
package/templates/scripts/__pycache__/__init__.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/check_convergence.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/classify_task.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/critique_hypothesis.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/experiment_index.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/generate_logbook.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/log_experiment.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/manage_hypotheses.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/novelty_guard.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/parse_metrics.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/show_experiment_tree.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/show_families.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/statistical_compare.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/suggest_next.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/sweep.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/synthesize_decision.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/turing_io.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/update_state.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/verify_placeholders.cpython-314.pyc +0 -0
package/templates/scripts/check_convergence.py +230 -0
package/templates/scripts/compare_runs.py +124 -0
package/templates/scripts/critique_hypothesis.py +350 -0
package/templates/scripts/experiment_index.py +288 -0
package/templates/scripts/generate_brief.py +389 -0
package/templates/scripts/generate_logbook.py +423 -0
package/templates/scripts/log_experiment.py +243 -0
package/templates/scripts/manage_hypotheses.py +543 -0
package/templates/scripts/novelty_guard.py +343 -0
package/templates/scripts/parse_metrics.py +139 -0
package/templates/scripts/post-train-hook.sh +74 -0
package/templates/scripts/preflight.py +549 -0
package/templates/scripts/scaffold.py +409 -0
package/templates/scripts/show_environment.py +92 -0
package/templates/scripts/show_experiment_tree.py +144 -0
package/templates/scripts/show_families.py +133 -0
package/templates/scripts/show_metrics.py +157 -0
package/templates/scripts/statistical_compare.py +259 -0
package/templates/scripts/stop-hook.sh +34 -0
package/templates/scripts/suggest_next.py +301 -0
package/templates/scripts/sweep.py +276 -0
package/templates/scripts/synthesize_decision.py +300 -0
package/templates/scripts/turing_io.py +76 -0
package/templates/scripts/update_state.py +296 -0
package/templates/scripts/validate_stability.py +167 -0
package/templates/scripts/verify_placeholders.py +119 -0
package/templates/sweep_config.yaml +14 -0
package/templates/tests/__init__.py +0 -0
package/templates/tests/conftest.py +91 -0
package/templates/train.py +240 -0

package/commands/preflight.md ADDED Viewed

@@ -0,0 +1,75 @@
+---
+name: preflight
+description: Pre-flight resource check — estimates VRAM, RAM, and disk requirements before running ML training. Compares against available system resources and issues PASS/WARN/FAIL verdict. Use before training to catch OOM errors before they happen.
+disable-model-invocation: true
+argument-hint: "[--model-type torch] [--params 10M] [--batch-size 32]"
+allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*, nvidia-smi:*), Grep, Glob
+---
+Check whether the current system has enough resources to run the planned experiment.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Run preflight check:**
+   If `$ARGUMENTS` is empty (auto-detect from config.yaml):
+   ```bash
+   python scripts/preflight.py
+   ```
+   If `$ARGUMENTS` contains flags:
+   ```bash
+   python scripts/preflight.py $ARGUMENTS
+   ```
+3. **Interpret the verdict:**
+   - **PASS** — system has sufficient resources. Proceed with training.
+   - **WARN** — resources are tight. Training may succeed but could be slow or unstable. Present warnings to the user and ask whether to proceed.
+   - **FAIL** — training will likely fail (OOM, disk full, no GPU for GPU-required model). Present the specific resource gap and suggest mitigations:
+     - RAM too low: reduce dataset size, use chunked loading, or add swap
+     - VRAM too low: reduce batch size, use fp16/bf16, enable gradient checkpointing, or use a smaller model
+     - Disk too low: clean up old models/checkpoints
+     - No GPU: switch to a CPU-friendly model (XGBoost, LightGBM, sklearn)
+4. **If running before `/turing:train`:** report the verdict so the human can decide whether to proceed, adjust config, or choose a different model type.
+## Examples
+```bash
+# Auto-detect from config.yaml (works for Turing projects)
+/turing:preflight
+# Check for a specific model type
+/turing:preflight --model-type transformer --params 350M --batch-size 16 --precision fp16
+# Check with a specific dataset
+/turing:preflight --model-type xgboost --dataset data/train.csv
+# JSON output for scripting
+/turing:preflight --json
+```
+## What It Checks
+| Resource | How estimated | Warning threshold |
+|----------|--------------|-------------------|
+| **RAM** | Dataset size (4x CSV on disk) + model memory (tree nodes or param count) | >90% of available |
+| **VRAM** | Model params + gradients + optimizer state + activations | >80% of largest GPU |
+| **Disk** | Model artifacts + dataset + checkpoints | >50% of free space |
+| **GPU presence** | torch.cuda or nvidia-smi | Required for neural nets >1GB VRAM |
+## Model-Specific Estimates
+| Model Type | RAM | VRAM | GPU Required? |
+|-----------|-----|------|---------------|
+| XGBoost/LightGBM | Trees + data (typically <4GB) | 0 | No |
+| Random Forest | Trees + data (can be large) | 0 | No |
+| Linear/Logistic | 2x data | 0 | No |
+| MLP (small) | Data + params | Params x 4 (Adam) | If >1GB VRAM |
+| Transformer | Data + params | Params x 4 + activations | Yes |

package/commands/report.md ADDED Viewed

@@ -0,0 +1,97 @@
+---
+name: report
+description: Generate a markdown research report from experiment history — structured for sharing, archiving, or including in documentation. More detailed than a brief, less visual than a poster.
+disable-model-invocation: true
+argument-hint: "[--since YYYY-MM-DD] [--output path]"
+allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*, mkdir:*), Grep, Glob
+---
+Generate a structured markdown research report summarizing the experiment campaign.
+## Steps
+### 1. Generate the Report
+Use the logbook generator in markdown mode as the data backbone:
+```bash
+source .venv/bin/activate && python scripts/generate_logbook.py --format markdown
+```
+Also gather supplementary data:
+```bash
+source .venv/bin/activate && python scripts/generate_brief.py
+cat experiment_state.yaml 2>/dev/null || true
+cat RESEARCH_PLAN.md 2>/dev/null || true
+```
+### 2. Enhance with Analysis
+The logbook generator produces raw data. Enhance it with your analysis to create a proper report. Add these sections that the script doesn't generate:
+- **Executive Summary** (2-3 sentences): What was the task? What's the best result? Is it good enough?
+- **Approach:** Describe the methodology — autoresearch loop, evaluation strategy, search strategy used
+- **Key Findings:** Synthesize patterns from the experiment log:
+  - Which model families outperformed others?
+  - What hyperparameter ranges work vs don't?
+  - Were there surprising results?
+  - What failure patterns emerged?
+- **Recommendations:** Based on the findings, what should be tried next? What should be avoided?
+- **Limitations:** What wasn't explored? What constraints affected the results?
+### 3. Output
+If `$ARGUMENTS` contains `--output <path>`:
+```bash
+mkdir -p $(dirname <path>)
+```
+Write the report to the specified path.
+Otherwise, display the report directly.
+**Common usage:**
+```
+/turing:report --output reports/campaign-v1.md
+/turing:report --since 2026-03-15 --output reports/week-12.md
+```
+## Report Structure
+```markdown
+# Research Report: <task description>
+Generated: <date>
+## Executive Summary
+<2-3 sentences>
+## Methodology
+<approach, evaluation strategy, convergence criteria>
+## Campaign Summary
+<table: experiments, keep rate, best metric, timespan>
+## Improvement Trajectory
+<table: experiment-by-experiment metric progression>
+## Key Findings
+<synthesized patterns from experiment history>
+## Model Comparison
+<table: model families, experiments per family, best metric, keep rate>
+## Hypothesis Analysis
+<what was proposed, by whom, what worked>
+## Recommendations
+<concrete next steps>
+## Limitations
+<what wasn't tried, constraints>
+```
+## When to Use
+- End of a research campaign for archiving
+- Before a team review or status update
+- To document findings for a paper or thesis
+- To hand off a project to another researcher

package/commands/rules/loop-protocol.md ADDED Viewed

@@ -0,0 +1,91 @@
+# Autoresearch Loop Protocol Rules
+These rules govern the autonomous ML experiment loop. They are non-negotiable safety constraints that preserve the integrity of the experimental process.
+## The Fundamental Separation
+The autoresearch harness enforces a strict separation between the **hypothesis space** (what the agent can change) and the **measurement apparatus** (how results are evaluated). This separation is the architectural invariant that makes autonomous experimentation trustworthy.
+| Layer | Files | Agent Access | Rationale |
+|-------|-------|-------------|-----------|
+| Hidden | `evaluate.py` | NONE — do not read, write, or reference | Reading evaluation code enables seed exploitation and metric gaming |
+| Measurement | `prepare.py` | READ-ONLY | Data loading is visible but immutable |
+| Hypothesis | `train.py` | READ-WRITE | All experimental changes go here |
+| Configuration | `config.yaml` | READ-WRITE | Hyperparameter changes without code changes |
+| Features | `features/featurizers.py` | READ-ONLY | Modify how `train.py` *uses* featurizers instead |
+## Execution Rules
+- **ALWAYS redirect training output:** `python train.py > run.log 2>&1`
+- **ALWAYS parse metrics with grep** between `---` delimiters: `grep -A 10 "^---" run.log | head -10`
+- **ALWAYS activate the venv first:** `source .venv/bin/activate`
+- **NEVER install new packages** without human approval
+## Git Discipline
+### Per-Experiment Branches (preferred)
+- **Create branch before each experiment:** `git checkout -b exp/{NNN}-{short-description}`
+- **Commit changes on the branch:** `git commit -am "exp: {description}"`
+- **Run the experiment on the branch**
+- **If improved:** `git checkout main && git merge exp/{NNN}-{short-description}`. Copy model to `models/best/`.
+- **If NOT improved:** `git checkout main`. Branch preserved for comparison.
+- **Keep all experiment branches** — they preserve code variants for later analysis.
+### Fallback: Commit/Revert (mid-sweep)
+- **ALWAYS commit before running:** `git commit -am "exp: {description}"`
+- **If improved:** keep commit, copy model to `models/best/`
+- **If NOT improved:** `git reset --hard HEAD~1`
+## Sweep Workflow
+1. Generate queue: `python scripts/sweep.py`
+2. Check status: `python scripts/sweep.py --status`
+3. Get next: `python scripts/sweep.py --next`
+4. Apply overrides, create branch, run training
+5. Mark: `python scripts/sweep.py --mark <name> complete|failed`
+6. Repeat until queue is empty
+## Logging Rules
+- **Log every experiment** to `experiments/log.jsonl` via `python scripts/log_experiment.py` — kept and discarded alike.
+- **Include all metrics, config, and description** of the hypothesis and its outcome.
+## Convergence Rules
+- **N consecutive non-improvements** (from `config.yaml` `convergence.patience`) with less than threshold relative gain = STOP.
+- **max_iterations** (if provided) overrides convergence.
+- **Always report** final best model, metrics, and recommended next steps when stopping.
+## Tool Restrictions
+The researcher agent's Bash access is restricted to a whitelist of necessary commands:
+| Allowed Pattern | Purpose |
+|-----------------|---------|
+| `python train.py:*` | Execute training |
+| `python scripts/*:*` | Run utility scripts (logging, metrics, sweep) |
+| `git:*` | Branch, commit, merge, reset operations |
+| `source .venv/bin/activate:*` | Virtual environment activation |
+| `pip:*` | Package installation (requires human approval) |
+**Blocked by omission:** `cat`, `head`, `tail`, `less` (prevents reading hidden files via shell), `curl`, `wget` (prevents data exfiltration), arbitrary command execution.
+The agent's Read tool is separately governed by the file access tiers above — hidden files are denied at the tool level.
+## Reproducibility Rules
+Every experiment must be fully reproducible. The training template handles this automatically, but the agent must not subvert it:
+- **NEVER use unseeded randomness.** All random state flows from `config.yaml → data.random_state`. The `pin_all_seeds()` function in `train.py` sets stdlib `random`, `numpy`, `PYTHONHASHSEED`, and `torch`/`cuda` seeds from this single source.
+- **NEVER modify seeds mid-experiment.** If you need a different seed, use `--seed` flag for multi-run comparison (Phase 2.1). Do not hardcode seeds in `train.py`.
+- **Environment is captured automatically.** `train_metadata.json` records python version, package versions, platform, GPU info, and a config hash. Do not modify this recording — it's used by behavioral probes.
+- **Config snapshot:** The config at training time is stored inside the model artifact (`model.joblib` contains the full config dict). For any saved model, the exact configuration can be recovered.
+- **If adding new dependencies** (requires human approval), note that the environment capture in `train_metadata.json` will automatically record the new package version.
+## Safety
+- Do not modify files outside the ML project directory.
+- Do not delete experiment logs or model archives.
+- If something breaks unexpectedly, stop and report — do not auto-fix evaluation infrastructure.

package/commands/status.md ADDED Viewed

@@ -0,0 +1,24 @@
+---
+name: status
+description: Show current ML experiment status — best model, recent experiments, convergence state, and trend analysis. Delegates to @ml-evaluator for read-only safety.
+disable-model-invocation: true
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Show the current state of the ML training pipeline. This is an observation-only operation — no code is modified.
+## Steps
+1. **Run metrics display:**
+   ```bash
+   source .venv/bin/activate && python scripts/show_metrics.py --last 10
+   ```
+2. **Summarize for the user:**
+   - **Best model:** type, key metrics, experiment ID
+   - **Total experiments:** count from the log
+   - **Convergence state:** consecutive non-improvements vs patience threshold
+   - **Trend:** improving, plateauing, or regressing?
+   - **Recommendation:** continue training, try a different approach, or declare convergence
+3. **If no experiments exist:** report that the pipeline is ready but untrained. Suggest `/turing:train`.

package/commands/suggest.md ADDED Viewed

@@ -0,0 +1,95 @@
+---
+name: suggest
+description: Literature-grounded model selection. Reads the ML task context, searches recent literature, and suggests model architectures worth trying — with citations. Suggestions are auto-queued as hypotheses.
+disable-model-invocation: true
+argument-hint: "[task description override]"
+allowed-tools: Read, Write, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob, WebSearch, WebFetch
+---
+Suggest model architectures for the current ML task, grounded in recent literature. Hypotheses backed by papers, not vibes.
+## Steps
+### 1. Understand the Task
+Read the project config and recent experiment history to understand the task:
+```bash
+cat config.yaml
+```
+```bash
+source .venv/bin/activate && python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
+```
+If `$ARGUMENTS` is provided, use that as the task description. Otherwise, infer from `config.yaml` (model type, primary metric, data source, target column).
+From the config and any task description, identify the key task properties:
+- Data type (tabular, time series, image, text, etc.)
+- Objective (classification, regression, generation, etc.)
+- Special constraints (imbalanced classes, small dataset, real-time, interpretability, etc.)
+- Current model family and what's been tried
+### 2. Search Literature
+Use `WebSearch` to find recent papers and benchmark results. Run 3-5 searches targeting:
+1. **Model comparison for this task type:** e.g., "best models for tabular classification benchmark 2024"
+2. **Current model alternatives:** e.g., "LightGBM vs XGBoost vs CatBoost tabular data"
+3. **Task-specific techniques:** e.g., "handling class imbalance gradient boosting"
+For each search, use `WebFetch` on the top 1-2 results to extract specific model recommendations, benchmark numbers, and methodology.
+Focus on:
+- Recent work (2023-2026) with empirical comparisons
+- Benchmark studies and surveys
+- arXiv papers or reputable ML blogs with concrete results
+### 3. Synthesize Suggestions
+From the literature, synthesize **3-5 concrete model architecture suggestions**. Each must include:
+- **Model architecture:** specific (e.g., "LightGBM with GOSS sampling", not "try a different model")
+- **Why:** one-sentence rationale grounded in what the literature says
+- **Citation:** paper or source that supports this
+- **Expected impact:** high/medium/low based on how well it fits this task
+- **Implementation hint:** what to change in `train.py` (one concrete line)
+### 4. Queue as Hypotheses
+For each suggestion, add to the hypothesis queue:
+```bash
+source .venv/bin/activate && python scripts/manage_hypotheses.py add "<model>: <rationale> (source: <citation>)" --priority medium --source literature
+```
+### 5. Show Results
+```
+Literature-Grounded Model Suggestions
+======================================
+Task: <task description>
+Current: <current model> (<current metric>=<value>)
+Sources consulted: <N papers/articles>
+1. [HIGH] <technique>
+   Why: <one-sentence rationale with citation>
+   Source: <URL>
+   Change: <specific train.py change>
+   → Queued as hyp-NNN
+2. [MEDIUM] ...
+Queued N hypotheses. Run /turing:train to test them.
+```
+## Fallback
+If web search returns insufficient results, suggest model families from `config/taxonomy.toml` based on what hasn't been tried yet. Note that suggestions are taxonomy-based, not literature-backed, and queue with `--source taxonomy`.
+## Integration
+- Suggestions feed into `hypotheses.yaml` — the next `/turing:train` picks them up
+- `/turing:brief` shows queued literature-sourced hypotheses
+- Human can override priority: `/turing:try` always takes precedence

package/commands/sweep.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: sweep
+description: Generate and run a systematic hyperparameter sweep. Computes the cartesian product of configured parameter ranges and processes the queue sequentially with full experiment logging.
+disable-model-invocation: true
+argument-hint: "[sweep_config.yaml]"
+allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*, source .venv/bin/activate:*, pip:*), Grep, Glob
+---
+Run a systematic hyperparameter sweep using the sweep configuration.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Resolve config:** Use `$ARGUMENTS` as sweep config path, or default to `sweep_config.yaml`.
+3. **Generate queue** (if not already generated):
+   ```bash
+   python scripts/sweep.py [sweep_config.yaml]
+   ```
+4. **Check queue status:**
+   ```bash
+   python scripts/sweep.py --status
+   ```
+5. **Process queue sequentially:**
+   - Get next: `python scripts/sweep.py --next`
+   - Apply config overrides to `config.yaml`
+   - Create experiment branch: `git checkout -b exp/NNN-description`
+   - Run training: `python train.py > run.log 2>&1`
+   - Parse metrics: `grep -A 10 "^---" run.log | head -10`
+   - Log the experiment
+   - Mark complete: `python scripts/sweep.py --mark <name> complete`
+   - If improved, merge to main. If not, return to main.
+   - Repeat until queue is empty
+6. **Report** final results with best configuration found.
+## Rules
+Follow the same safety constraints as `/turing:train` — see `rules/loop-protocol.md`.

package/commands/train.md ADDED Viewed

@@ -0,0 +1,66 @@
+---
+name: train
+description: Run the autonomous ML experiment loop. Iteratively hypothesizes, trains, evaluates, and decides — keeping only improvements. Implements the autoresearch pattern with formal convergence detection and git-disciplined rollback.
+disable-model-invocation: true
+argument-hint: "[max_iterations]"
+allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*, source .venv/bin/activate:*, pip:*), Grep, Glob
+---
+You are an autonomous ML researcher. Your goal: iteratively improve a model by following the experiment loop protocol — the scientific method applied to machine learning.
+Read `program.md` in the ML project directory for the complete protocol. Follow it exactly.
+## Arguments
+`$ARGUMENTS` — if a number, use as max_iterations (stop after N experiments). If empty, run until convergence (as defined in `config.yaml` convergence settings).
+## Bootstrap Sequence
+0. **Restore memory:** Read `.claude/agent-memory/ml-researcher/MEMORY.md` for prior observations and best results.
+1. **Read protocol:** Read `program.md` completely — it defines the experiment loop, constraints, and output format.
+2. **Bootstrap data:** Check for training data at `config.yaml` → `data.source`. If no splits exist, run `python prepare.py`.
+3. **Bootstrap venv:** `test -d .venv || (python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt)`
+4. **Assess state:** `source .venv/bin/activate && python scripts/show_metrics.py --last 5`
+5. **Begin the loop** from program.md.
+## The Loop
+Each iteration follows the experiment lifecycle (`config/lifecycle.toml`):
+```
+proposed -> running -> evaluating -> kept/discarded -> (next iteration)
+```
+The agent proposes a hypothesis, executes it, measures the result against the immutable evaluation harness, and decides whether to keep or discard. Only improvements survive in git history.
+## Delegation
+Use `@ml-evaluator` for analysis tasks. It is read-only (no Write/Edit) and cannot accidentally modify the pipeline.
+## Context Management
+- Redirect all training output: `python train.py > run.log 2>&1`
+- Parse metrics with grep, never read full output
+- Persist observations to MEMORY.md after each experiment
+## Convergence
+- Stop after `max_iterations` if provided
+- Otherwise, stop after N consecutive non-improvements (`config.yaml` → `convergence.patience`)
+- Report final best experiment and recommend next steps
+## /loop Integration
+For fully hands-off training:
+```
+/loop 5m /turing:train
+```
+The Stop hook automatically detects convergence and halts the loop. Recommended intervals:
+- `3m` — fast iterations, small datasets
+- `5m` — standard training runs
+- `10m` — deep training with large models
+## Rules
+See `rules/loop-protocol.md` for safety constraints governing the experiment loop.

package/commands/try.md ADDED Viewed

@@ -0,0 +1,63 @@
+---
+name: try
+description: Inject a hypothesis into the agent's experiment queue. This is how research taste reaches the agent — the human selects which coins to flip, the agent flips them.
+disable-model-invocation: true
+argument-hint: "<hypothesis description>"
+allowed-tools: Read, Write, Edit, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob
+---
+Inject a human hypothesis into the experiment queue for the next `/turing:train` iteration.
+This is the taste-leverage mechanism: you provide judgment about what's worth trying, the agent provides disciplined execution.
+## Steps
+1. **Parse the hypothesis** from `$ARGUMENTS`. If empty, ask the user what they want the agent to try.
+2. **Check for archetype syntax.** If the argument starts with `archetype:`, expand it:
+   ```bash
+   source .venv/bin/activate && python scripts/manage_hypotheses.py add --archetype <name> --priority high --source human
+   ```
+   Otherwise, use the raw description:
+   ```bash
+   source .venv/bin/activate && python scripts/manage_hypotheses.py add "$ARGUMENTS" --priority high --source human
+   ```
+3. **Confirm** with the hypothesis ID and instructions:
+   - "Queued as hyp-NNN (high priority, human-injected)"
+   - "The agent will prioritize this on the next `/turing:train` iteration"
+   - Show current queue: `python scripts/manage_hypotheses.py list --status queued`
+## Examples
+```
+# Free-text hypotheses
+/turing:try switch to LightGBM with dart boosting and lower learning rate
+/turing:try add polynomial features for the numeric columns
+/turing:try increase regularization, the train/val gap suggests overfitting
+# Archetype-based structured strategies
+/turing:try archetype:model_comparison
+/turing:try archetype:feature_sweep
+/turing:try archetype:ensemble_construction
+/turing:try archetype:regularization_search
+/turing:try archetype:ablation_study
+```
+## Available Archetypes
+| Archetype | What it does | Expected experiments |
+|-----------|-------------|---------------------|
+| `model_comparison` | Compare XGBoost, LightGBM, RF, LR, MLP with statistical tests | ~5 |
+| `hyperparameter_sweep` | Grid search with multi-seed validation | 15-36 |
+| `feature_sweep` | Add/remove feature transforms one at a time | 6-10 |
+| `regularization_search` | Binary search for optimal regularization | 4-6 |
+| `ensemble_construction` | Voting, stacking, blending of top models | 4-6 |
+| `learning_rate_schedule` | lr vs n_estimators tradeoff | 4-5 |
+| `data_quality_audit` | Class balance, label noise, leakage checks | 3-5 |
+| `ablation_study` | Remove features one at a time to measure importance | N+1 |
+## How It Connects
+The `/turing:train` loop checks `hypotheses.yaml` during the OBSERVE step. Human-injected hypotheses (high priority) are tried before the agent generates its own. After testing, the hypothesis is marked as `tested`, `promising`, or `dead-end` with a link to the resulting experiment.

package/commands/turing.md ADDED Viewed

@@ -0,0 +1,54 @@
+---
+name: turing
+description: Autonomous ML research harness. Thin router that detects ML training intent and dispatches to focused sub-commands. Each sub-command handles one phase of the experiment lifecycle.
+---
+You are the Turing ML research router. Detect the user's intent and route to the appropriate sub-command. Do not attempt to handle ML tasks directly — dispatch to the focused skill.
+## Routing Table
+| User says... | Route to | Lifecycle phase |
+|---|---|---|
+| "train", "run experiments", "autoresearch", "improve the model", "start training" | `/turing:train` | Execute |
+| "status", "how's training", "experiment results", "current metrics" | `/turing:status` | Observe |
+| "compare", "diff runs", "which is better" | `/turing:compare` | Analyze |
+| "sweep", "grid search", "hyperparameter search", "tune" | `/turing:sweep` | Explore |
+| "init", "set up ML", "initialize", "scaffold", "bootstrap" | `/turing:init` | Setup |
+| "try", "test this", "inject", "what if we", "I think we should" | `/turing:try` | Steer |
+| "brief", "briefing", "what have we learned", "summary" | `/turing:brief` | Report |
+| "logbook", "log", "history", "timeline", "narrative" | `/turing:logbook` | Document |
+| "poster", "presentation", "one-pager", "visual summary" | `/turing:poster` | Document |
+| "report", "write-up", "findings", "document results" | `/turing:report` | Document |
+| "validate", "stability", "check variance", "noisy" | `/turing:validate` | Validate |
+| "suggest", "what model", "recommend", "which architecture", "literature" | `/turing:suggest` | Research |
+| "design", "plan experiment", "how should I test", "experiment design" | `/turing:design` | Design |
+| "mode", "explore", "exploit", "replicate", "strategy" | `/turing:mode` | Strategy |
+| "preflight", "resources", "VRAM", "memory", "can I run", "OOM", "GPU" | `/turing:preflight` | Check |
+## Sub-commands
+| Command | Purpose | Agent |
+|---|---|---|
+| `/turing:train [N]` | Run the autonomous experiment loop | @ml-researcher |
+| `/turing:status` | Show experiment status, best model, convergence | @ml-evaluator |
+| `/turing:compare <a> <b>` | Side-by-side experiment comparison | @ml-evaluator |
+| `/turing:sweep` | Generate and run hyperparameter sweep | @ml-researcher |
+| `/turing:try <hypothesis>` | Inject a hypothesis into the agent's queue | (inline) |
+| `/turing:brief` | Generate structured research intelligence report | @ml-evaluator |
+| `/turing:init` | Scaffold a new ML project | (inline) |
+| `/turing:validate` | Check metric stability, auto-fix if noisy | (inline) |
+| `/turing:suggest` | Literature-grounded model architecture suggestions | (inline, uses WebSearch) |
+| `/turing:design <hyp-id>` | Generate structured experiment design from hypothesis | (inline, uses WebSearch) |
+| `/turing:logbook` | HTML/markdown logbook with trajectory chart | (inline) |
+| `/turing:poster` | Single-page HTML research poster | (inline) |
+| `/turing:report` | Structured markdown research report | (inline) |
+| `/turing:mode <mode>` | Set research strategy (explore/exploit/replicate) | (inline) |
+| `/turing:preflight` | Pre-flight resource check (VRAM/RAM/disk) | (inline) |
+## Proactive Detection
+If you detect ML training intent in the conversation (e.g., "the model accuracy is bad", "we need to improve predictions", "let's try a different model"), suggest the relevant sub-command.
+## First-Time Setup
+If no ML project is detected (no `config.yaml`, no `train.py`, no `experiments/`), suggest `/turing:init` first.

package/commands/validate.md ADDED Viewed

@@ -0,0 +1,34 @@
+---
+name: validate
+description: Run stability validation on the current experiment configuration. Executes N runs to measure metric variance and auto-configures multi-run evaluation if variance is too high.
+disable-model-invocation: true
+argument-hint: "[--auto]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Validate the stability of the current ML pipeline by running it multiple times and measuring variance.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Run stability check:**
+   ```bash
+   python scripts/validate_stability.py
+   ```
+3. **If `$ARGUMENTS` contains `--auto`:**
+   ```bash
+   python scripts/validate_stability.py --auto
+   ```
+   This auto-writes `evaluation.n_runs: 3` to `config.yaml` if CV > 5%.
+4. **Report results:**
+   - **Stable (CV < 5%):** metric is reliable, single-run evaluation is sufficient
+   - **Unstable (CV >= 5%):** metric has high variance, multi-run with median is recommended
+   - If `--auto` was used, report what was changed in config.yaml
+5. **If no training pipeline exists:** suggest `/turing:init` first.