claude-turing 4.8.0 → 4.8.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +1 -1
- package/agents/ml-evaluator.md +4 -4
- package/agents/ml-researcher.md +2 -2
- package/bin/turing-init.sh +2 -2
- package/commands/ablate.md +3 -3
- package/commands/annotate.md +2 -2
- package/commands/archive.md +2 -2
- package/commands/audit.md +3 -3
- package/commands/baseline.md +3 -3
- package/commands/brief.md +5 -5
- package/commands/budget.md +3 -3
- package/commands/calibrate.md +3 -3
- package/commands/card.md +3 -3
- package/commands/changelog.md +2 -2
- package/commands/checkpoint.md +3 -3
- package/commands/cite.md +2 -2
- package/commands/compare.md +1 -1
- package/commands/counterfactual.md +2 -2
- package/commands/curriculum.md +3 -3
- package/commands/design.md +3 -3
- package/commands/diagnose.md +4 -4
- package/commands/diff.md +3 -3
- package/commands/distill.md +3 -3
- package/commands/doctor.md +2 -2
- package/commands/ensemble.md +3 -3
- package/commands/explore.md +4 -4
- package/commands/export.md +3 -3
- package/commands/feature.md +3 -3
- package/commands/flashback.md +2 -2
- package/commands/fork.md +3 -3
- package/commands/frontier.md +3 -3
- package/commands/init.md +5 -5
- package/commands/leak.md +3 -3
- package/commands/lit.md +3 -3
- package/commands/logbook.md +5 -5
- package/commands/merge.md +2 -2
- package/commands/mode.md +1 -1
- package/commands/onboard.md +2 -2
- package/commands/paper.md +3 -3
- package/commands/plan.md +2 -2
- package/commands/poster.md +3 -3
- package/commands/postmortem.md +2 -2
- package/commands/preflight.md +5 -5
- package/commands/present.md +2 -2
- package/commands/profile.md +3 -3
- package/commands/prune.md +2 -2
- package/commands/quantize.md +2 -2
- package/commands/queue.md +3 -3
- package/commands/registry.md +2 -2
- package/commands/regress.md +3 -3
- package/commands/replay.md +2 -2
- package/commands/report.md +3 -3
- package/commands/reproduce.md +3 -3
- package/commands/retry.md +3 -3
- package/commands/review.md +2 -2
- package/commands/rules/loop-protocol.md +11 -11
- package/commands/sanity.md +3 -3
- package/commands/scale.md +4 -4
- package/commands/search.md +2 -2
- package/commands/seed.md +3 -3
- package/commands/sensitivity.md +3 -3
- package/commands/share.md +2 -2
- package/commands/simulate.md +2 -2
- package/commands/status.md +1 -1
- package/commands/stitch.md +3 -3
- package/commands/suggest.md +5 -5
- package/commands/surgery.md +2 -2
- package/commands/sweep.md +8 -8
- package/commands/template.md +2 -2
- package/commands/train.md +5 -5
- package/commands/transfer.md +3 -3
- package/commands/trend.md +2 -2
- package/commands/try.md +4 -4
- package/commands/update.md +2 -2
- package/commands/validate.md +4 -4
- package/commands/warm.md +3 -3
- package/commands/watch.md +4 -4
- package/commands/whatif.md +2 -2
- package/commands/xray.md +3 -3
- package/config/commands.yaml +1 -1
- package/package.json +1 -1
- package/skills/turing/ablate/SKILL.md +3 -3
- package/skills/turing/annotate/SKILL.md +2 -2
- package/skills/turing/archive/SKILL.md +2 -2
- package/skills/turing/audit/SKILL.md +3 -3
- package/skills/turing/baseline/SKILL.md +3 -3
- package/skills/turing/brief/SKILL.md +5 -5
- package/skills/turing/budget/SKILL.md +3 -3
- package/skills/turing/calibrate/SKILL.md +3 -3
- package/skills/turing/card/SKILL.md +3 -3
- package/skills/turing/changelog/SKILL.md +2 -2
- package/skills/turing/checkpoint/SKILL.md +3 -3
- package/skills/turing/cite/SKILL.md +2 -2
- package/skills/turing/compare/SKILL.md +1 -1
- package/skills/turing/counterfactual/SKILL.md +2 -2
- package/skills/turing/curriculum/SKILL.md +3 -3
- package/skills/turing/design/SKILL.md +3 -3
- package/skills/turing/diagnose/SKILL.md +4 -4
- package/skills/turing/diff/SKILL.md +3 -3
- package/skills/turing/distill/SKILL.md +3 -3
- package/skills/turing/doctor/SKILL.md +2 -2
- package/skills/turing/ensemble/SKILL.md +3 -3
- package/skills/turing/explore/SKILL.md +4 -4
- package/skills/turing/export/SKILL.md +3 -3
- package/skills/turing/feature/SKILL.md +3 -3
- package/skills/turing/flashback/SKILL.md +2 -2
- package/skills/turing/fork/SKILL.md +3 -3
- package/skills/turing/frontier/SKILL.md +3 -3
- package/skills/turing/init/SKILL.md +5 -5
- package/skills/turing/leak/SKILL.md +3 -3
- package/skills/turing/lit/SKILL.md +3 -3
- package/skills/turing/logbook/SKILL.md +5 -5
- package/skills/turing/merge/SKILL.md +2 -2
- package/skills/turing/mode/SKILL.md +1 -1
- package/skills/turing/onboard/SKILL.md +2 -2
- package/skills/turing/paper/SKILL.md +3 -3
- package/skills/turing/plan/SKILL.md +2 -2
- package/skills/turing/poster/SKILL.md +3 -3
- package/skills/turing/postmortem/SKILL.md +2 -2
- package/skills/turing/preflight/SKILL.md +5 -5
- package/skills/turing/present/SKILL.md +2 -2
- package/skills/turing/profile/SKILL.md +3 -3
- package/skills/turing/prune/SKILL.md +2 -2
- package/skills/turing/quantize/SKILL.md +2 -2
- package/skills/turing/queue/SKILL.md +3 -3
- package/skills/turing/registry/SKILL.md +2 -2
- package/skills/turing/regress/SKILL.md +3 -3
- package/skills/turing/replay/SKILL.md +2 -2
- package/skills/turing/report/SKILL.md +3 -3
- package/skills/turing/reproduce/SKILL.md +3 -3
- package/skills/turing/retry/SKILL.md +3 -3
- package/skills/turing/review/SKILL.md +2 -2
- package/skills/turing/rules/loop-protocol.md +11 -11
- package/skills/turing/sanity/SKILL.md +3 -3
- package/skills/turing/scale/SKILL.md +4 -4
- package/skills/turing/search/SKILL.md +2 -2
- package/skills/turing/seed/SKILL.md +3 -3
- package/skills/turing/sensitivity/SKILL.md +3 -3
- package/skills/turing/share/SKILL.md +2 -2
- package/skills/turing/simulate/SKILL.md +2 -2
- package/skills/turing/status/SKILL.md +1 -1
- package/skills/turing/stitch/SKILL.md +3 -3
- package/skills/turing/suggest/SKILL.md +5 -5
- package/skills/turing/surgery/SKILL.md +2 -2
- package/skills/turing/sweep/SKILL.md +8 -8
- package/skills/turing/template/SKILL.md +2 -2
- package/skills/turing/train/SKILL.md +5 -5
- package/skills/turing/transfer/SKILL.md +3 -3
- package/skills/turing/trend/SKILL.md +2 -2
- package/skills/turing/try/SKILL.md +4 -4
- package/skills/turing/update/SKILL.md +2 -2
- package/skills/turing/validate/SKILL.md +4 -4
- package/skills/turing/warm/SKILL.md +3 -3
- package/skills/turing/watch/SKILL.md +4 -4
- package/skills/turing/whatif/SKILL.md +2 -2
- package/skills/turing/xray/SKILL.md +3 -3
- package/templates/README.md +5 -8
- package/templates/program.md +18 -18
- package/templates/pyproject.toml +10 -0
- package/templates/requirements.txt +4 -1
- package/templates/scripts/generate_onboarding.py +1 -1
- package/templates/scripts/post-train-hook.sh +7 -8
- package/templates/scripts/scaffold.py +24 -26
- package/templates/scripts/stop-hook.sh +2 -3
- package/templates/scripts/turing-run-python.sh +9 -0
package/commands/status.md
CHANGED
|
@@ -10,7 +10,7 @@ Show the current state of the ML training pipeline. This is an observation-only
|
|
|
10
10
|
|
|
11
11
|
1. **Run metrics display:**
|
|
12
12
|
```bash
|
|
13
|
-
|
|
13
|
+
uv run python scripts/show_metrics.py --last 10
|
|
14
14
|
```
|
|
15
15
|
|
|
16
16
|
2. **Summarize for the user:**
|
package/commands/stitch.md
CHANGED
|
@@ -9,9 +9,9 @@ Decompose your ML pipeline into stages that can be independently varied, cached,
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,7 +23,7 @@ Decompose your ML pipeline into stages that can be independently varied, cached,
|
|
|
23
23
|
|
|
24
24
|
3. **Run pipeline manager:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/pipeline_manager.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/pipeline_manager.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **Report results:**
|
package/commands/suggest.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: suggest
|
|
3
3
|
description: Literature-grounded model selection. Reads the ML task context, searches recent literature, and suggests model architectures worth trying — with citations. Suggestions are auto-queued as hypotheses.
|
|
4
4
|
argument-hint: "[task description override]"
|
|
5
|
-
allowed-tools: Read, Write, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Write, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob, WebSearch, WebFetch
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Suggest model architectures for the current ML task. Supports two strategies:
|
|
@@ -25,7 +25,7 @@ cat config.yaml
|
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
```bash
|
|
28
|
-
|
|
28
|
+
uv run python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
|
|
29
29
|
```
|
|
30
30
|
|
|
31
31
|
If `$ARGUMENTS` is provided, use that as the task description. Otherwise, infer from `config.yaml` (model type, primary metric, data source, target column).
|
|
@@ -66,7 +66,7 @@ From the literature, synthesize **3-5 concrete model architecture suggestions**.
|
|
|
66
66
|
For each suggestion, add to the hypothesis queue:
|
|
67
67
|
|
|
68
68
|
```bash
|
|
69
|
-
|
|
69
|
+
uv run python scripts/manage_hypotheses.py add "<model>: <rationale> (source: <citation>)" --priority medium --source literature
|
|
70
70
|
```
|
|
71
71
|
|
|
72
72
|
### 5. Show Results
|
|
@@ -105,7 +105,7 @@ Same detection logic as the literature strategy — find `config.yaml` + `train.
|
|
|
105
105
|
### 2. Run Tree Search
|
|
106
106
|
|
|
107
107
|
```bash
|
|
108
|
-
|
|
108
|
+
uv run python scripts/treequest_suggest.py \
|
|
109
109
|
--log experiments/log.jsonl \
|
|
110
110
|
--config config.yaml \
|
|
111
111
|
--top 5 \
|
|
@@ -120,7 +120,7 @@ If TreeQuest is not installed, the script automatically falls back to greedy bes
|
|
|
120
120
|
For each result from the tree search, queue as a hypothesis:
|
|
121
121
|
|
|
122
122
|
```bash
|
|
123
|
-
|
|
123
|
+
uv run python scripts/manage_hypotheses.py add "<description>" --priority medium --source treequest
|
|
124
124
|
```
|
|
125
125
|
|
|
126
126
|
### 4. Show Results
|
package/commands/surgery.md
CHANGED
|
@@ -9,8 +9,8 @@ Programmatic architecture changes with auto warm-start from existing weights.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
13
|
-
2. **Run:** `python scripts/architecture_surgery.py $ARGUMENTS`
|
|
12
|
+
1. **Sync environment:** `uv sync`
|
|
13
|
+
2. **Run:** `uv run python scripts/architecture_surgery.py $ARGUMENTS`
|
|
14
14
|
3. **Operations:** add-layer, remove-layer, widen, narrow, swap-activation, add-skip, add-norm, deepen, swap-objective
|
|
15
15
|
4. **For tree models:** deepen (increase max_depth), widen (more estimators), swap-objective
|
|
16
16
|
5. **Report:** operation details, config changes, parameter count delta, warm-start source
|
package/commands/sweep.md
CHANGED
|
@@ -2,38 +2,38 @@
|
|
|
2
2
|
name: sweep
|
|
3
3
|
description: Generate and run a systematic hyperparameter sweep. Computes the cartesian product of configured parameter ranges and processes the queue sequentially with full experiment logging.
|
|
4
4
|
argument-hint: "[sweep_config.yaml]"
|
|
5
|
-
allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*,
|
|
5
|
+
allowed-tools: Read, Write, Edit, Bash(uv run python train.py:*, uv run python scripts/*:*, git:*, uv sync:*, uv add:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Run a systematic hyperparameter sweep using the sweep configuration.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Resolve config:** Use `$ARGUMENTS` as sweep config path, or default to `sweep_config.yaml`.
|
|
18
18
|
|
|
19
19
|
3. **Generate queue** (if not already generated):
|
|
20
20
|
```bash
|
|
21
|
-
python scripts/sweep.py [sweep_config.yaml]
|
|
21
|
+
uv run python scripts/sweep.py [sweep_config.yaml]
|
|
22
22
|
```
|
|
23
23
|
|
|
24
24
|
4. **Check queue status:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/sweep.py --status
|
|
26
|
+
uv run python scripts/sweep.py --status
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
5. **Process queue sequentially:**
|
|
30
|
-
- Get next: `python scripts/sweep.py --next`
|
|
30
|
+
- Get next: `uv run python scripts/sweep.py --next`
|
|
31
31
|
- Apply config overrides to `config.yaml`
|
|
32
32
|
- Create experiment branch: `git checkout -b exp/NNN-description`
|
|
33
|
-
- Run training: `python train.py > run.log 2>&1`
|
|
33
|
+
- Run training: `uv run python train.py > run.log 2>&1`
|
|
34
34
|
- Parse metrics: `grep -A 10 "^---" run.log | head -10`
|
|
35
35
|
- Log the experiment
|
|
36
|
-
- Mark complete: `python scripts/sweep.py --mark <name> complete`
|
|
36
|
+
- Mark complete: `uv run python scripts/sweep.py --mark <name> complete`
|
|
37
37
|
- If improved, merge to main. If not, return to main.
|
|
38
38
|
- Repeat until queue is empty
|
|
39
39
|
|
package/commands/template.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Turn your best experiment configs into reusable recipes that persist across projects.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/experiment_templates.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/experiment_templates.py $ARGUMENTS`
|
|
13
13
|
3. **Operations:** save (from experiment), list (all templates), apply (to current project), share (export)
|
|
14
14
|
4. **Stored at:** `~/.turing/templates/` (cross-project)
|
|
15
15
|
|
package/commands/train.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: train
|
|
3
3
|
description: Run the autonomous ML experiment loop. Iteratively hypothesizes, trains, evaluates, and decides — keeping only improvements. Implements the autoresearch pattern with formal convergence detection and git-disciplined rollback.
|
|
4
4
|
argument-hint: "[max_iterations]"
|
|
5
|
-
allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*,
|
|
5
|
+
allowed-tools: Read, Write, Edit, Bash(uv run python train.py:*, uv run python scripts/*:*, git:*, uv sync:*, uv add:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
You are an autonomous ML researcher. Your goal: iteratively improve a model by following the experiment loop protocol — the scientific method applied to machine learning.
|
|
@@ -26,9 +26,9 @@ Read `program.md` in the ML project directory for the complete protocol. Follow
|
|
|
26
26
|
|
|
27
27
|
1. **Restore memory:** Read `.claude/agent-memory/ml-researcher-{project_name}/MEMORY.md` for prior observations and best results.
|
|
28
28
|
2. **Read protocol:** Read `program.md` completely — it defines the experiment loop, constraints, and output format.
|
|
29
|
-
3. **Bootstrap data:** Check for training data at `config.yaml` → `data.source`. If no splits exist, run `python prepare.py`.
|
|
30
|
-
4. **Bootstrap
|
|
31
|
-
5. **Assess state:** `
|
|
29
|
+
3. **Bootstrap data:** Check for training data at `config.yaml` → `data.source`. If no splits exist, run `uv run python prepare.py`.
|
|
30
|
+
4. **Bootstrap uv environment:** `uv sync`
|
|
31
|
+
5. **Assess state:** `uv run python scripts/show_metrics.py --last 5`
|
|
32
32
|
6. **Begin the loop** from program.md.
|
|
33
33
|
|
|
34
34
|
## The Loop
|
|
@@ -47,7 +47,7 @@ Use `@ml-evaluator` for analysis tasks. It is read-only (no Write/Edit) and cann
|
|
|
47
47
|
|
|
48
48
|
## Context Management
|
|
49
49
|
|
|
50
|
-
- Redirect all training output: `python train.py > run.log 2>&1`
|
|
50
|
+
- Redirect all training output: `uv run python train.py > run.log 2>&1`
|
|
51
51
|
- Parse metrics with grep, never read full output
|
|
52
52
|
- Persist observations to MEMORY.md after each experiment
|
|
53
53
|
|
package/commands/transfer.md
CHANGED
|
@@ -9,9 +9,9 @@ Find similar prior projects and surface what worked. "Last time you had tabular
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Find similar prior projects and surface what worked. "Last time you had tabular
|
|
|
22
22
|
|
|
23
23
|
3. **Run knowledge transfer:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/knowledge_transfer.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/knowledge_transfer.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report includes:**
|
package/commands/trend.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
See the arc of your research, not just the latest results. Strategic view over 100+ experiments.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/trend_analysis.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/trend_analysis.py $ARGUMENTS`
|
|
13
13
|
3. **Report:** improvement velocity over time windows, family ROI ranking, diminishing returns prediction, phase transitions
|
|
14
14
|
4. **Saved output:** `experiments/trends/trend-*.yaml`
|
|
15
15
|
|
package/commands/try.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: try
|
|
3
3
|
description: Inject a hypothesis into the agent's experiment queue. This is how research taste reaches the agent — the human selects which coins to flip, the agent flips them.
|
|
4
4
|
argument-hint: "<hypothesis description>"
|
|
5
|
-
allowed-tools: Read, Write, Edit, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Write, Edit, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Inject a human hypothesis into the experiment queue for the next `/turing:train` iteration.
|
|
@@ -15,18 +15,18 @@ This is the taste-leverage mechanism: you provide judgment about what's worth tr
|
|
|
15
15
|
|
|
16
16
|
2. **Check for archetype syntax.** If the argument starts with `archetype:`, expand it:
|
|
17
17
|
```bash
|
|
18
|
-
|
|
18
|
+
uv run python scripts/manage_hypotheses.py add --archetype <name> --priority high --source human
|
|
19
19
|
```
|
|
20
20
|
|
|
21
21
|
Otherwise, use the raw description:
|
|
22
22
|
```bash
|
|
23
|
-
|
|
23
|
+
uv run python scripts/manage_hypotheses.py add "$ARGUMENTS" --priority high --source human
|
|
24
24
|
```
|
|
25
25
|
|
|
26
26
|
3. **Confirm** with the hypothesis ID and instructions:
|
|
27
27
|
- "Queued as hyp-NNN (high priority, human-injected)"
|
|
28
28
|
- "The agent will prioritize this on the next `/turing:train` iteration"
|
|
29
|
-
- Show current queue: `python scripts/manage_hypotheses.py list --status queued`
|
|
29
|
+
- Show current queue: `uv run python scripts/manage_hypotheses.py list --status queued`
|
|
30
30
|
|
|
31
31
|
## Examples
|
|
32
32
|
|
package/commands/update.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Add new data to an existing model without starting from scratch. Detects catastrophic forgetting.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/incremental_update.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/incremental_update.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/updates/`
|
|
14
14
|
|
|
15
15
|
## Model-specific strategies
|
package/commands/validate.md
CHANGED
|
@@ -9,19 +9,19 @@ Validate the stability of the current ML pipeline by running it multiple times a
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Run stability check:**
|
|
18
18
|
```bash
|
|
19
|
-
python scripts/validate_stability.py
|
|
19
|
+
uv run python scripts/validate_stability.py
|
|
20
20
|
```
|
|
21
21
|
|
|
22
22
|
3. **If `$ARGUMENTS` contains `--auto`:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/validate_stability.py --auto
|
|
24
|
+
uv run python scripts/validate_stability.py --auto
|
|
25
25
|
```
|
|
26
26
|
This auto-writes `evaluation.n_runs: 3` to `config.yaml` if CV > 5%.
|
|
27
27
|
|
package/commands/warm.md
CHANGED
|
@@ -9,9 +9,9 @@ Take a trained checkpoint and use it as initialization for a new experiment. Aut
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,7 +23,7 @@ Take a trained checkpoint and use it as initialization for a new experiment. Aut
|
|
|
23
23
|
|
|
24
24
|
3. **Run warm-start planner:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/warm_start.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/warm_start.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **Report results:**
|
package/commands/watch.md
CHANGED
|
@@ -9,9 +9,9 @@ Stream metrics during training with early-warning alerts. Catches problems mid-r
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,13 +23,13 @@ Stream metrics during training with early-warning alerts. Catches problems mid-r
|
|
|
23
23
|
|
|
24
24
|
3. **For post-hoc analysis:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/training_monitor.py --analyze run.log
|
|
26
|
+
uv run python scripts/training_monitor.py --analyze run.log
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **For live monitoring (inform user):**
|
|
30
30
|
Live monitoring requires a running training process. Suggest the user run in a separate terminal:
|
|
31
31
|
```bash
|
|
32
|
-
python scripts/training_monitor.py --log run.log --interval 10
|
|
32
|
+
uv run python scripts/training_monitor.py --log run.log --interval 10
|
|
33
33
|
```
|
|
34
34
|
|
|
35
35
|
5. **Alert types:**
|
package/commands/whatif.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Answer "what if?" questions using existing experiment data. Routes to the right estimator automatically.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/whatif_engine.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/whatif_engine.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/whatif/`
|
|
14
14
|
|
|
15
15
|
## Supported question types
|
package/commands/xray.md
CHANGED
|
@@ -9,9 +9,9 @@ See inside the model. When it underperforms, the fix depends on *why*.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ See inside the model. When it underperforms, the fix depends on *why*.
|
|
|
22
22
|
|
|
23
23
|
3. **Run model diagnostics:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/model_xray.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/model_xray.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Diagnostics by model type:**
|
package/config/commands.yaml
CHANGED
|
@@ -332,7 +332,7 @@ commands:
|
|
|
332
332
|
- Glob
|
|
333
333
|
argument_hint: '[--metrics "accuracy,train_seconds,n_params"] [--ascii]'
|
|
334
334
|
init:
|
|
335
|
-
description: "Initialize a new ML project with the Turing autoresearch harness. Scaffolds the full experiment infrastructure \u2014 immutable evaluation pipeline, agent-editable training code, structured logging, convergence detection hooks, and a Python
|
|
335
|
+
description: "Initialize a new ML project with the Turing autoresearch harness. Scaffolds the full experiment infrastructure \u2014 immutable evaluation pipeline, agent-editable training code, structured logging, convergence detection hooks, and a uv-managed Python environment. Use --plan to generate a research plan."
|
|
336
336
|
lifecycle: setup
|
|
337
337
|
invocation_mode: slash_only
|
|
338
338
|
model_invocation: enabled
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "claude-turing",
|
|
3
|
-
"version": "4.8.
|
|
3
|
+
"version": "4.8.1",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
|
|
6
6
|
"bin": {
|
|
@@ -9,9 +9,9 @@ Run a systematic ablation study to measure the contribution of each model compon
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Run a systematic ablation study to measure the contribution of each model compon
|
|
|
22
22
|
|
|
23
23
|
3. **Run ablation study:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/ablation_study.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/ablation_study.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report results:**
|
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Add context that experiment logs can't capture. "This only worked because the data was pre-sorted."
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/experiment_annotations.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/experiment_annotations.py $ARGUMENTS`
|
|
13
13
|
3. **Operations:** add (text + tags), list (per-experiment or all), search (keyword or tag)
|
|
14
14
|
4. **Stored in:** `experiments/annotations.yaml`
|
|
15
15
|
|
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Keep your project directory manageable after 200+ experiments.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/experiment_archive.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/experiment_archive.py $ARGUMENTS`
|
|
13
13
|
3. **Protected experiments:** Pareto-optimal, current best, recent, top-N by metric
|
|
14
14
|
4. **Report:** archived count, preserved count, space reclaimed
|
|
15
15
|
5. **Saved output:** `experiments/archive/index.yaml`
|
|
@@ -9,9 +9,9 @@ A reviewer checklist you run before submitting. Catches methodology mistakes tha
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ A reviewer checklist you run before submitting. Catches methodology mistakes tha
|
|
|
21
21
|
|
|
22
22
|
3. **Run methodology audit:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/methodology_audit.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/methodology_audit.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Checks performed:**
|
|
@@ -9,9 +9,9 @@ Generate trivial baselines so you always know if your model is meaningfully bett
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Generate trivial baselines so you always know if your model is meaningfully bett
|
|
|
21
21
|
|
|
22
22
|
3. **Run baseline generation:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/generate_baselines.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/generate_baselines.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Baselines generated:**
|
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: brief
|
|
3
3
|
description: Generate a structured research intelligence report from experiment history — what's been learned, what's promising, what's exhausted, and what the human should consider next. Use --deep for literature-grounded suggestions.
|
|
4
4
|
argument-hint: "[ml/project] [--deep]"
|
|
5
|
-
allowed-tools: Read, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob, WebSearch, WebFetch
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Generate a research briefing that a human can read in 2 minutes and immediately decide what to inject next.
|
|
@@ -23,14 +23,14 @@ Before generating the briefing, detect which project to report on:
|
|
|
23
23
|
|
|
24
24
|
1. **Generate the briefing:**
|
|
25
25
|
```bash
|
|
26
|
-
|
|
26
|
+
uv run python scripts/generate_brief.py
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
2. **Self-critique the briefing** before presenting. Review the generated output and check:
|
|
30
30
|
- **Recommendations specificity:** Are they concrete enough to act on? "Try a different model" is bad. "Try LightGBM with leaf-wise growth because exp-004 showed depth sensitivity" is good. If vague, rewrite them with specific model/hyperparameter suggestions grounded in the experiment data.
|
|
31
31
|
- **Exhausted directions coverage:** Cross-reference the "Model Types Explored" section against `experiments/log.jsonl`. Are there discarded experiments missing from the summary? If so, add them.
|
|
32
32
|
- **Convergence estimate grounding:** If the briefing says "close to convergence" or "further improvement possible", verify against the actual metric trajectory. Is the claim supported by the numbers?
|
|
33
|
-
- **Metric accuracy:** Spot-check that the "Current Best" metrics match the actual log. Run `python scripts/show_metrics.py --last 1` if uncertain.
|
|
33
|
+
- **Metric accuracy:** Spot-check that the "Current Best" metrics match the actual log. Run `uv run python scripts/show_metrics.py --last 1` if uncertain.
|
|
34
34
|
|
|
35
35
|
If any section fails the check, regenerate just that section. Max 1 revision round — don't over-polish.
|
|
36
36
|
|
|
@@ -75,7 +75,7 @@ When `--deep` is requested, add a 7th section: **Literature-Grounded Suggestions
|
|
|
75
75
|
|
|
76
76
|
4. **Queue suggestions** as hypotheses:
|
|
77
77
|
```bash
|
|
78
|
-
|
|
78
|
+
uv run python scripts/manage_hypotheses.py add "<technique>: <rationale> (source: <citation>)" --priority medium --source literature
|
|
79
79
|
```
|
|
80
80
|
|
|
81
81
|
5. **Format as a section** appended to the briefing.
|
|
@@ -83,7 +83,7 @@ When `--deep` is requested, add a 7th section: **Literature-Grounded Suggestions
|
|
|
83
83
|
## Saving Briefs
|
|
84
84
|
|
|
85
85
|
```bash
|
|
86
|
-
mkdir -p briefs && python scripts/generate_brief.py > briefs/brief-$(date +%Y-%m-%d).md
|
|
86
|
+
mkdir -p briefs && uv run python scripts/generate_brief.py > briefs/brief-$(date +%Y-%m-%d).md
|
|
87
87
|
```
|
|
88
88
|
|
|
89
89
|
## When to Use
|
|
@@ -9,9 +9,9 @@ Set a compute ceiling and let the system optimize within it. Prevents runaway ex
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Set a compute ceiling and let the system optimize within it. Prevents runaway ex
|
|
|
22
22
|
|
|
23
23
|
3. **Run budget manager:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/budget_manager.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/budget_manager.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Actions:**
|
|
@@ -9,9 +9,9 @@ Make model probabilities trustworthy. Does 80% confidence actually mean 80% corr
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Make model probabilities trustworthy. Does 80% confidence actually mean 80% corr
|
|
|
21
21
|
|
|
22
22
|
3. **Run calibration:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/calibration.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/calibration.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report includes:**
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: card
|
|
3
3
|
description: Generate a standardized model card documenting the trained model — type, performance, training data, limitations, intended use, and artifact contract.
|
|
4
|
-
allowed-tools: Read, Bash(python scripts/*:*,
|
|
4
|
+
allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
You generate a standardized model card from the experiment log, model contract, and config.
|
|
@@ -10,12 +10,12 @@ You generate a standardized model card from the experiment log, model contract,
|
|
|
10
10
|
|
|
11
11
|
1. **Activate the virtual environment:**
|
|
12
12
|
```bash
|
|
13
|
-
|
|
13
|
+
uv sync
|
|
14
14
|
```
|
|
15
15
|
|
|
16
16
|
2. **Run the model card generator:**
|
|
17
17
|
```bash
|
|
18
|
-
python scripts/generate_model_card.py --config config.yaml --log experiments/log.jsonl --contract model_contract.md --output MODEL_CARD.md
|
|
18
|
+
uv run python scripts/generate_model_card.py --config config.yaml --log experiments/log.jsonl --contract model_contract.md --output MODEL_CARD.md
|
|
19
19
|
```
|
|
20
20
|
|
|
21
21
|
3. **Read and present the generated card:**
|
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Translate experiment logs into a narrative that PMs and stakeholders can read in 2 minutes.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/generate_changelog.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/generate_changelog.py $ARGUMENTS`
|
|
13
13
|
3. **Audience:** technical (experiment IDs, configs), stakeholder (plain English, percentages)
|
|
14
14
|
4. **Saved output:** `paper/CHANGELOG.md`
|
|
15
15
|
|
|
@@ -9,9 +9,9 @@ Manage model checkpoints intelligently using Pareto dominance.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Manage model checkpoints intelligently using Pareto dominance.
|
|
|
22
22
|
|
|
23
23
|
3. **Run checkpoint manager:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/checkpoint_manager.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/checkpoint_manager.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report results by action:**
|
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Track which papers and methods influenced each experiment. Catch missing citations before submission.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/citation_manager.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/citation_manager.py $ARGUMENTS`
|
|
13
13
|
3. **Operations:** add (associate citation with experiment), list (group by type), check (audit missing), bib (BibTeX)
|
|
14
14
|
4. **Stored in:** `experiments/citations.yaml`
|
|
15
15
|
|
|
@@ -11,7 +11,7 @@ Compare two ML experiment runs side-by-side to understand what changed and why o
|
|
|
11
11
|
|
|
12
12
|
1. **Run comparison:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv run python scripts/compare_runs.py $0 $1
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Analyze the delta:**
|