claude-turing 4.8.0 → 4.8.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +1 -1
- package/agents/ml-evaluator.md +4 -4
- package/agents/ml-researcher.md +2 -2
- package/bin/turing-init.sh +2 -2
- package/commands/ablate.md +3 -3
- package/commands/annotate.md +2 -2
- package/commands/archive.md +2 -2
- package/commands/audit.md +3 -3
- package/commands/baseline.md +3 -3
- package/commands/brief.md +5 -5
- package/commands/budget.md +3 -3
- package/commands/calibrate.md +3 -3
- package/commands/card.md +3 -3
- package/commands/changelog.md +2 -2
- package/commands/checkpoint.md +3 -3
- package/commands/cite.md +2 -2
- package/commands/compare.md +1 -1
- package/commands/counterfactual.md +2 -2
- package/commands/curriculum.md +3 -3
- package/commands/design.md +3 -3
- package/commands/diagnose.md +4 -4
- package/commands/diff.md +3 -3
- package/commands/distill.md +3 -3
- package/commands/doctor.md +2 -2
- package/commands/ensemble.md +3 -3
- package/commands/explore.md +4 -4
- package/commands/export.md +3 -3
- package/commands/feature.md +3 -3
- package/commands/flashback.md +2 -2
- package/commands/fork.md +3 -3
- package/commands/frontier.md +3 -3
- package/commands/init.md +5 -5
- package/commands/leak.md +3 -3
- package/commands/lit.md +3 -3
- package/commands/logbook.md +5 -5
- package/commands/merge.md +2 -2
- package/commands/mode.md +1 -1
- package/commands/onboard.md +2 -2
- package/commands/paper.md +3 -3
- package/commands/plan.md +2 -2
- package/commands/poster.md +3 -3
- package/commands/postmortem.md +2 -2
- package/commands/preflight.md +5 -5
- package/commands/present.md +2 -2
- package/commands/profile.md +3 -3
- package/commands/prune.md +2 -2
- package/commands/quantize.md +2 -2
- package/commands/queue.md +3 -3
- package/commands/registry.md +2 -2
- package/commands/regress.md +3 -3
- package/commands/replay.md +2 -2
- package/commands/report.md +3 -3
- package/commands/reproduce.md +3 -3
- package/commands/retry.md +3 -3
- package/commands/review.md +2 -2
- package/commands/rules/loop-protocol.md +11 -11
- package/commands/sanity.md +3 -3
- package/commands/scale.md +4 -4
- package/commands/search.md +2 -2
- package/commands/seed.md +3 -3
- package/commands/sensitivity.md +3 -3
- package/commands/share.md +2 -2
- package/commands/simulate.md +2 -2
- package/commands/status.md +1 -1
- package/commands/stitch.md +3 -3
- package/commands/suggest.md +5 -5
- package/commands/surgery.md +2 -2
- package/commands/sweep.md +8 -8
- package/commands/template.md +2 -2
- package/commands/train.md +5 -5
- package/commands/transfer.md +3 -3
- package/commands/trend.md +2 -2
- package/commands/try.md +4 -4
- package/commands/update.md +2 -2
- package/commands/validate.md +4 -4
- package/commands/warm.md +3 -3
- package/commands/watch.md +4 -4
- package/commands/whatif.md +2 -2
- package/commands/xray.md +3 -3
- package/config/commands.yaml +1 -1
- package/package.json +1 -1
- package/skills/turing/ablate/SKILL.md +3 -3
- package/skills/turing/annotate/SKILL.md +2 -2
- package/skills/turing/archive/SKILL.md +2 -2
- package/skills/turing/audit/SKILL.md +3 -3
- package/skills/turing/baseline/SKILL.md +3 -3
- package/skills/turing/brief/SKILL.md +5 -5
- package/skills/turing/budget/SKILL.md +3 -3
- package/skills/turing/calibrate/SKILL.md +3 -3
- package/skills/turing/card/SKILL.md +3 -3
- package/skills/turing/changelog/SKILL.md +2 -2
- package/skills/turing/checkpoint/SKILL.md +3 -3
- package/skills/turing/cite/SKILL.md +2 -2
- package/skills/turing/compare/SKILL.md +1 -1
- package/skills/turing/counterfactual/SKILL.md +2 -2
- package/skills/turing/curriculum/SKILL.md +3 -3
- package/skills/turing/design/SKILL.md +3 -3
- package/skills/turing/diagnose/SKILL.md +4 -4
- package/skills/turing/diff/SKILL.md +3 -3
- package/skills/turing/distill/SKILL.md +3 -3
- package/skills/turing/doctor/SKILL.md +2 -2
- package/skills/turing/ensemble/SKILL.md +3 -3
- package/skills/turing/explore/SKILL.md +4 -4
- package/skills/turing/export/SKILL.md +3 -3
- package/skills/turing/feature/SKILL.md +3 -3
- package/skills/turing/flashback/SKILL.md +2 -2
- package/skills/turing/fork/SKILL.md +3 -3
- package/skills/turing/frontier/SKILL.md +3 -3
- package/skills/turing/init/SKILL.md +5 -5
- package/skills/turing/leak/SKILL.md +3 -3
- package/skills/turing/lit/SKILL.md +3 -3
- package/skills/turing/logbook/SKILL.md +5 -5
- package/skills/turing/merge/SKILL.md +2 -2
- package/skills/turing/mode/SKILL.md +1 -1
- package/skills/turing/onboard/SKILL.md +2 -2
- package/skills/turing/paper/SKILL.md +3 -3
- package/skills/turing/plan/SKILL.md +2 -2
- package/skills/turing/poster/SKILL.md +3 -3
- package/skills/turing/postmortem/SKILL.md +2 -2
- package/skills/turing/preflight/SKILL.md +5 -5
- package/skills/turing/present/SKILL.md +2 -2
- package/skills/turing/profile/SKILL.md +3 -3
- package/skills/turing/prune/SKILL.md +2 -2
- package/skills/turing/quantize/SKILL.md +2 -2
- package/skills/turing/queue/SKILL.md +3 -3
- package/skills/turing/registry/SKILL.md +2 -2
- package/skills/turing/regress/SKILL.md +3 -3
- package/skills/turing/replay/SKILL.md +2 -2
- package/skills/turing/report/SKILL.md +3 -3
- package/skills/turing/reproduce/SKILL.md +3 -3
- package/skills/turing/retry/SKILL.md +3 -3
- package/skills/turing/review/SKILL.md +2 -2
- package/skills/turing/rules/loop-protocol.md +11 -11
- package/skills/turing/sanity/SKILL.md +3 -3
- package/skills/turing/scale/SKILL.md +4 -4
- package/skills/turing/search/SKILL.md +2 -2
- package/skills/turing/seed/SKILL.md +3 -3
- package/skills/turing/sensitivity/SKILL.md +3 -3
- package/skills/turing/share/SKILL.md +2 -2
- package/skills/turing/simulate/SKILL.md +2 -2
- package/skills/turing/status/SKILL.md +1 -1
- package/skills/turing/stitch/SKILL.md +3 -3
- package/skills/turing/suggest/SKILL.md +5 -5
- package/skills/turing/surgery/SKILL.md +2 -2
- package/skills/turing/sweep/SKILL.md +8 -8
- package/skills/turing/template/SKILL.md +2 -2
- package/skills/turing/train/SKILL.md +5 -5
- package/skills/turing/transfer/SKILL.md +3 -3
- package/skills/turing/trend/SKILL.md +2 -2
- package/skills/turing/try/SKILL.md +4 -4
- package/skills/turing/update/SKILL.md +2 -2
- package/skills/turing/validate/SKILL.md +4 -4
- package/skills/turing/warm/SKILL.md +3 -3
- package/skills/turing/watch/SKILL.md +4 -4
- package/skills/turing/whatif/SKILL.md +2 -2
- package/skills/turing/xray/SKILL.md +3 -3
- package/templates/README.md +5 -8
- package/templates/program.md +18 -18
- package/templates/pyproject.toml +10 -0
- package/templates/requirements.txt +4 -1
- package/templates/scripts/generate_onboarding.py +1 -1
- package/templates/scripts/post-train-hook.sh +7 -8
- package/templates/scripts/scaffold.py +24 -26
- package/templates/scripts/stop-hook.sh +2 -3
- package/templates/scripts/turing-run-python.sh +9 -0
package/commands/init.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: init
|
|
3
|
-
description: Initialize a new ML project with the Turing autoresearch harness. Scaffolds the full experiment infrastructure — immutable evaluation pipeline, agent-editable training code, structured logging, convergence detection hooks, and a Python
|
|
3
|
+
description: Initialize a new ML project with the Turing autoresearch harness. Scaffolds the full experiment infrastructure — immutable evaluation pipeline, agent-editable training code, structured logging, convergence detection hooks, and a uv-managed Python environment. Use --plan to generate a research plan.
|
|
4
4
|
argument-hint: "[project_name] [--plan]"
|
|
5
5
|
allowed-tools: Read, Write, Edit, Bash(*), Grep, Glob, WebSearch, WebFetch
|
|
6
6
|
---
|
|
@@ -23,7 +23,7 @@ Ask the user for the following (or accept from `$ARGUMENTS` if provided as JSON)
|
|
|
23
23
|
Once you have all 6 values, delegate to the unified scaffolding script:
|
|
24
24
|
|
|
25
25
|
```bash
|
|
26
|
-
|
|
26
|
+
uv run python <templates_dir>/scripts/scaffold.py \
|
|
27
27
|
--project-name "<project_name>" \
|
|
28
28
|
--target-metric "<target_metric>" \
|
|
29
29
|
--metric-direction "<metric_direction>" \
|
|
@@ -38,7 +38,7 @@ The scaffold script handles everything in a single atomic operation:
|
|
|
38
38
|
- Creates data/, experiments/, models/ directories
|
|
39
39
|
- Sets up agent memory at `.claude/agent-memory/ml-researcher-{project_name}/MEMORY.md`
|
|
40
40
|
- Configures Claude Code hooks in `.claude/settings.local.json`
|
|
41
|
-
-
|
|
41
|
+
- Runs `uv sync` from the ML directory when uv is available
|
|
42
42
|
- Verifies all placeholders were replaced (fails loudly if any remain)
|
|
43
43
|
|
|
44
44
|
## Locating Templates
|
|
@@ -57,7 +57,7 @@ node_modules/claude-turing/templates/
|
|
|
57
57
|
Example command:
|
|
58
58
|
|
|
59
59
|
```bash
|
|
60
|
-
|
|
60
|
+
uv run python ~/.claude/commands/turing/templates/scripts/scaffold.py \
|
|
61
61
|
--project-name "<project_name>" \
|
|
62
62
|
--target-metric "<target_metric>" \
|
|
63
63
|
--metric-direction "<metric_direction>" \
|
|
@@ -71,7 +71,7 @@ python3 ~/.claude/commands/turing/templates/scripts/scaffold.py \
|
|
|
71
71
|
|
|
72
72
|
Report what was created:
|
|
73
73
|
- The separation: READ-ONLY (`prepare.py`, `evaluate.py`) vs AGENT-EDITABLE (`train.py`)
|
|
74
|
-
- Next steps: add data to the configured data source path, run `python prepare.py`, then `/turing:train`
|
|
74
|
+
- Next steps: add data to the configured data source path, run `uv run python prepare.py`, then `/turing:train`
|
|
75
75
|
- The taste-leverage loop: `/turing:try` to inject hypotheses, `/turing:brief` for intelligence reports
|
|
76
76
|
|
|
77
77
|
## Research Plan Generation (--plan flag)
|
package/commands/leak.md
CHANGED
|
@@ -9,9 +9,9 @@ Actively probe for data leakage. The #1 cause of "too good to be true" results.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Actively probe for data leakage. The #1 cause of "too good to be true" results.
|
|
|
21
21
|
|
|
22
22
|
3. **Run leakage scan:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/leakage_detector.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/leakage_detector.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Checks performed:**
|
package/commands/lit.md
CHANGED
|
@@ -9,9 +9,9 @@ Search the literature for papers, baselines, and related work.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,7 +23,7 @@ Search the literature for papers, baselines, and related work.
|
|
|
23
23
|
|
|
24
24
|
3. **Run literature search:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/literature_search.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/literature_search.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **Report results:**
|
package/commands/logbook.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: logbook
|
|
3
3
|
description: Generate a research logbook showing the full experiment narrative — hypotheses proposed, experiments run, decisions made, and progress over time. Outputs HTML (with interactive chart) or markdown.
|
|
4
4
|
argument-hint: "[--since YYYY-MM-DD] [--format html|markdown] [--output path]"
|
|
5
|
-
allowed-tools: Read, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*, mkdir:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Generate a research logbook that captures the full narrative of the experiment campaign.
|
|
@@ -11,7 +11,7 @@ Generate a research logbook that captures the full narrative of the experiment c
|
|
|
11
11
|
|
|
12
12
|
1. **Generate the logbook:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv run python scripts/generate_logbook.py
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
**With options from `$ARGUMENTS`:**
|
|
@@ -22,13 +22,13 @@ Generate a research logbook that captures the full narrative of the experiment c
|
|
|
22
22
|
**Common usage:**
|
|
23
23
|
```bash
|
|
24
24
|
# HTML logbook with interactive trajectory chart
|
|
25
|
-
|
|
25
|
+
uv run python scripts/generate_logbook.py --output logbook.html
|
|
26
26
|
|
|
27
27
|
# Markdown for embedding in docs or READMEs
|
|
28
|
-
|
|
28
|
+
uv run python scripts/generate_logbook.py --format markdown --output logbook.md
|
|
29
29
|
|
|
30
30
|
# Last week's activity
|
|
31
|
-
|
|
31
|
+
uv run python scripts/generate_logbook.py --since 2026-03-24 --output logbook.html
|
|
32
32
|
```
|
|
33
33
|
|
|
34
34
|
2. **Present the result:**
|
package/commands/merge.md
CHANGED
|
@@ -9,8 +9,8 @@ Combine model weights (not predictions) into a single, better model with no late
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
13
|
-
2. **Run:** `python scripts/model_merger.py $ARGUMENTS`
|
|
12
|
+
1. **Sync environment:** `uv sync`
|
|
13
|
+
2. **Run:** `uv run python scripts/model_merger.py $ARGUMENTS`
|
|
14
14
|
3. **Methods:** uniform soup (simple average), greedy soup (include only if improves), TIES (trim+elect+merge), DARE (drop+rescale)
|
|
15
15
|
4. **Report:** compatibility check, per-model metrics, method comparison, improvement delta
|
|
16
16
|
5. **Saved output:** `experiments/merges/merge-*.yaml`
|
package/commands/mode.md
CHANGED
package/commands/onboard.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
5-minute read that replaces a 1-hour onboarding meeting.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/generate_onboarding.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/generate_onboarding.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `ONBOARDING.md`
|
|
14
14
|
|
|
15
15
|
## Examples
|
package/commands/paper.md
CHANGED
|
@@ -9,9 +9,9 @@ Draft paper sections directly from experiment data.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -20,7 +20,7 @@ Draft paper sections directly from experiment data.
|
|
|
20
20
|
|
|
21
21
|
3. **Run paper drafting:**
|
|
22
22
|
```bash
|
|
23
|
-
python scripts/draft_paper_sections.py $ARGUMENTS
|
|
23
|
+
uv run python scripts/draft_paper_sections.py $ARGUMENTS
|
|
24
24
|
```
|
|
25
25
|
|
|
26
26
|
4. **Report results:**
|
package/commands/plan.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Design the next N experiments strategically, not randomly. Allocates budget by expected ROI.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/research_planner.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/research_planner.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/plans/`
|
|
14
14
|
|
|
15
15
|
## How it works
|
package/commands/poster.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: poster
|
|
3
3
|
description: Generate a single-page HTML research poster summarizing the experiment campaign — best result, trajectory, key findings, and methodology. Adapted from posterskill's self-contained HTML architecture.
|
|
4
4
|
argument-hint: "[title override]"
|
|
5
|
-
allowed-tools: Read, Write, Edit, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Write, Edit, Bash(uv run python scripts/*:*, uv sync:*, mkdir:*, open:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Generate a research poster summarizing the experiment campaign as a single self-contained HTML file. Adapted from [posterskill](https://github.com/ethanweber/posterskill)'s architecture — no build step, works when opened as `file://`.
|
|
@@ -15,8 +15,8 @@ Read the experiment history and project context:
|
|
|
15
15
|
|
|
16
16
|
```bash
|
|
17
17
|
cat config.yaml
|
|
18
|
-
|
|
19
|
-
|
|
18
|
+
uv run python scripts/generate_brief.py
|
|
19
|
+
uv run python scripts/show_metrics.py --last 20
|
|
20
20
|
cat experiment_state.yaml 2>/dev/null || true
|
|
21
21
|
cat RESEARCH_PLAN.md 2>/dev/null || true
|
|
22
22
|
```
|
package/commands/postmortem.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
When experiments stop improving, find out why. Diagnoses search space exhaustion, config errors, data issues, metric ceilings, and noise floors.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/failure_postmortem.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/failure_postmortem.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/postmortems/`
|
|
14
14
|
|
|
15
15
|
## Diagnosis categories
|
package/commands/preflight.md
CHANGED
|
@@ -2,28 +2,28 @@
|
|
|
2
2
|
name: preflight
|
|
3
3
|
description: Pre-flight resource check — estimates VRAM, RAM, and disk requirements before running ML training. Compares against available system resources and issues PASS/WARN/FAIL verdict. Use before training to catch OOM errors before they happen.
|
|
4
4
|
argument-hint: "[--model-type torch] [--params 10M] [--batch-size 32]"
|
|
5
|
-
allowed-tools: Read, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*, nvidia-smi:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Check whether the current system has enough resources to run the planned experiment.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Run preflight check:**
|
|
18
18
|
|
|
19
19
|
If `$ARGUMENTS` is empty (auto-detect from config.yaml):
|
|
20
20
|
```bash
|
|
21
|
-
python scripts/preflight.py
|
|
21
|
+
uv run python scripts/preflight.py
|
|
22
22
|
```
|
|
23
23
|
|
|
24
24
|
If `$ARGUMENTS` contains flags:
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/preflight.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/preflight.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
3. **Interpret the verdict:**
|
package/commands/present.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Generate presentation-ready figure specifications from experiment data in seconds.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/generate_figures.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/generate_figures.py $ARGUMENTS`
|
|
13
13
|
3. **Figure types:** training, comparison, ablation, pareto, sensitivity
|
|
14
14
|
4. **Styles:** light (papers), dark (demos), poster (large fonts)
|
|
15
15
|
5. **Saved output:** `paper/figures/`
|
package/commands/profile.md
CHANGED
|
@@ -9,9 +9,9 @@ Profile a training run to identify performance bottlenecks.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -20,7 +20,7 @@ Profile a training run to identify performance bottlenecks.
|
|
|
20
20
|
|
|
21
21
|
3. **Run profiling:**
|
|
22
22
|
```bash
|
|
23
|
-
python scripts/profile_training.py $ARGUMENTS
|
|
23
|
+
uv run python scripts/profile_training.py $ARGUMENTS
|
|
24
24
|
```
|
|
25
25
|
|
|
26
26
|
4. **Report results:**
|
package/commands/prune.md
CHANGED
|
@@ -9,8 +9,8 @@ Remove redundant weights for faster inference and smaller models.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
13
|
-
2. **Run:** `python scripts/model_pruning.py $ARGUMENTS`
|
|
12
|
+
1. **Sync environment:** `uv sync`
|
|
13
|
+
2. **Run:** `uv run python scripts/model_pruning.py $ARGUMENTS`
|
|
14
14
|
3. **Methods:** magnitude (zero small weights), structured (remove neurons), lottery (iterative with rewind)
|
|
15
15
|
4. **For tree models:** progressively reduces n_estimators
|
|
16
16
|
5. **Report:** sparsity sweep table, knee point, recommended sparsity
|
package/commands/quantize.md
CHANGED
|
@@ -9,8 +9,8 @@ Quantize for production. Lowest-effort optimization: 2-4x speedup, 2-4x memory r
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
13
|
-
2. **Run:** `python scripts/model_quantization.py $ARGUMENTS`
|
|
12
|
+
1. **Sync environment:** `uv sync`
|
|
13
|
+
2. **Run:** `uv run python scripts/model_quantization.py $ARGUMENTS`
|
|
14
14
|
3. **Precision levels:** FP32 (baseline), FP16 (GPU), INT8 dynamic (simplest), INT8 static (best accuracy)
|
|
15
15
|
4. **Report:** precision comparison table, recommended level, QAT suggestion if needed
|
|
16
16
|
5. **Saved output:** `experiments/quantization/<exp-id>-quantization.yaml`
|
package/commands/queue.md
CHANGED
|
@@ -9,9 +9,9 @@ Manage the experiment queue for unattended batch execution.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,7 +23,7 @@ Manage the experiment queue for unattended batch execution.
|
|
|
23
23
|
|
|
24
24
|
3. **Run queue manager:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/experiment_queue.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/experiment_queue.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **Report results by action:**
|
package/commands/registry.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Track which model is production, staging, candidate, or archived. Promotion requires passing gates.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/model_lifecycle.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/model_lifecycle.py $ARGUMENTS`
|
|
13
13
|
3. **Registry:** `experiments/registry.yaml`
|
|
14
14
|
|
|
15
15
|
## Promotion gates
|
package/commands/regress.md
CHANGED
|
@@ -9,9 +9,9 @@ CI for your model. After any change to code, dependencies, or data, verify metri
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,7 +23,7 @@ CI for your model. After any change to code, dependencies, or data, verify metri
|
|
|
23
23
|
|
|
24
24
|
3. **Run regression gate:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/regression_gate.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/regression_gate.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **Report results:**
|
package/commands/replay.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Should you revisit old ideas? Infrastructure changes may make failed approaches work now.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/experiment_replay.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/experiment_replay.py $ARGUMENTS`
|
|
13
13
|
3. **Modes:** default (current code+data), --with-current-data, --with-current-preprocessing
|
|
14
14
|
4. **Report:** original vs replayed metrics, delta, verdict
|
|
15
15
|
5. **Saved output:** `experiments/replays/`
|
package/commands/report.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: report
|
|
3
3
|
description: Generate a markdown research report from experiment history — structured for sharing, archiving, or including in documentation. More detailed than a brief, less visual than a poster.
|
|
4
4
|
argument-hint: "[--since YYYY-MM-DD] [--output path]"
|
|
5
|
-
allowed-tools: Read, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*, mkdir:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Generate a structured markdown research report summarizing the experiment campaign.
|
|
@@ -14,12 +14,12 @@ Generate a structured markdown research report summarizing the experiment campai
|
|
|
14
14
|
Use the logbook generator in markdown mode as the data backbone:
|
|
15
15
|
|
|
16
16
|
```bash
|
|
17
|
-
|
|
17
|
+
uv run python scripts/generate_logbook.py --format markdown
|
|
18
18
|
```
|
|
19
19
|
|
|
20
20
|
Also gather supplementary data:
|
|
21
21
|
```bash
|
|
22
|
-
|
|
22
|
+
uv run python scripts/generate_brief.py
|
|
23
23
|
cat experiment_state.yaml 2>/dev/null || true
|
|
24
24
|
cat RESEARCH_PLAN.md 2>/dev/null || true
|
|
25
25
|
```
|
package/commands/reproduce.md
CHANGED
|
@@ -9,9 +9,9 @@ Verify that a logged experiment can be reproduced with consistent results.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Verify that a logged experiment can be reproduced with consistent results.
|
|
|
22
22
|
|
|
23
23
|
3. **Run reproducibility verification:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/reproduce_experiment.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/reproduce_experiment.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report results:**
|
package/commands/retry.md
CHANGED
|
@@ -9,9 +9,9 @@ Auto-diagnose and recover from experiment failures.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Auto-diagnose and recover from experiment failures.
|
|
|
21
21
|
|
|
22
22
|
3. **Run smart retry:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/smart_retry.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/smart_retry.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report results:**
|
package/commands/review.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Simulate a conference reviewer before you submit. Each weakness links to the command that fixes it.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/simulate_review.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/simulate_review.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/reviews/`
|
|
14
14
|
|
|
15
15
|
## Examples
|
|
@@ -16,9 +16,9 @@ The autoresearch harness enforces a strict separation between the **hypothesis s
|
|
|
16
16
|
|
|
17
17
|
## Execution Rules
|
|
18
18
|
|
|
19
|
-
- **ALWAYS redirect training output:** `python train.py > run.log 2>&1`
|
|
19
|
+
- **ALWAYS redirect training output:** `uv run python train.py > run.log 2>&1`
|
|
20
20
|
- **ALWAYS parse metrics with grep** between `---` delimiters: `grep -A 10 "^---" run.log | head -10`
|
|
21
|
-
- **ALWAYS
|
|
21
|
+
- **ALWAYS run Python through uv:** `uv run python ...`
|
|
22
22
|
- **NEVER install new packages** without human approval
|
|
23
23
|
|
|
24
24
|
## Git Discipline
|
|
@@ -40,16 +40,16 @@ The autoresearch harness enforces a strict separation between the **hypothesis s
|
|
|
40
40
|
|
|
41
41
|
## Sweep Workflow
|
|
42
42
|
|
|
43
|
-
1. Generate queue: `python scripts/sweep.py`
|
|
44
|
-
2. Check status: `python scripts/sweep.py --status`
|
|
45
|
-
3. Get next: `python scripts/sweep.py --next`
|
|
43
|
+
1. Generate queue: `uv run python scripts/sweep.py`
|
|
44
|
+
2. Check status: `uv run python scripts/sweep.py --status`
|
|
45
|
+
3. Get next: `uv run python scripts/sweep.py --next`
|
|
46
46
|
4. Apply overrides, create branch, run training
|
|
47
|
-
5. Mark: `python scripts/sweep.py --mark <name> complete|failed`
|
|
47
|
+
5. Mark: `uv run python scripts/sweep.py --mark <name> complete|failed`
|
|
48
48
|
6. Repeat until queue is empty
|
|
49
49
|
|
|
50
50
|
## Logging Rules
|
|
51
51
|
|
|
52
|
-
- **Log every experiment** to `experiments/log.jsonl` via `python scripts/log_experiment.py` — kept and discarded alike.
|
|
52
|
+
- **Log every experiment** to `experiments/log.jsonl` via `uv run python scripts/log_experiment.py` — kept and discarded alike.
|
|
53
53
|
- **Include all metrics, config, and description** of the hypothesis and its outcome.
|
|
54
54
|
|
|
55
55
|
## Convergence Rules
|
|
@@ -64,11 +64,11 @@ The researcher agent's Bash access is restricted to a whitelist of necessary com
|
|
|
64
64
|
|
|
65
65
|
| Allowed Pattern | Purpose |
|
|
66
66
|
|-----------------|---------|
|
|
67
|
-
| `python train.py:*` | Execute training |
|
|
68
|
-
| `python scripts/*:*` | Run utility scripts (logging, metrics, sweep) |
|
|
67
|
+
| `uv run python train.py:*` | Execute training |
|
|
68
|
+
| `uv run python scripts/*:*` | Run utility scripts (logging, metrics, sweep) |
|
|
69
69
|
| `git:*` | Branch, commit, merge, reset operations |
|
|
70
|
-
| `
|
|
71
|
-
| `
|
|
70
|
+
| `uv sync:*` | Virtual environment activation |
|
|
71
|
+
| `uv add:*` | Package installation (requires human approval) |
|
|
72
72
|
|
|
73
73
|
**Blocked by omission:** `cat`, `head`, `tail`, `less` (prevents reading hidden files via shell), `curl`, `wget` (prevents data exfiltration), arbitrary command execution.
|
|
74
74
|
|
package/commands/sanity.md
CHANGED
|
@@ -9,9 +9,9 @@ Run a battery of fast checks before committing to a full training run. Catches w
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Run a battery of fast checks before committing to a full training run. Catches w
|
|
|
21
21
|
|
|
22
22
|
3. **Run sanity checks:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/sanity_checks.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/sanity_checks.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Checks performed:**
|
package/commands/scale.md
CHANGED
|
@@ -9,9 +9,9 @@ Predict full-scale performance from a handful of small experiments. Answers "is
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -24,11 +24,11 @@ Predict full-scale performance from a handful of small experiments. Answers "is
|
|
|
24
24
|
3. **Plan or analyze:**
|
|
25
25
|
- **Plan mode (default):** generates scale point configs to run
|
|
26
26
|
```bash
|
|
27
|
-
python scripts/scaling_estimator.py --axis data --points 4
|
|
27
|
+
uv run python scripts/scaling_estimator.py --axis data --points 4
|
|
28
28
|
```
|
|
29
29
|
- **Analyze mode:** fits power law to completed results
|
|
30
30
|
```bash
|
|
31
|
-
python scripts/scaling_estimator.py --analyze experiments/scaling/results.yaml
|
|
31
|
+
uv run python scripts/scaling_estimator.py --analyze experiments/scaling/results.yaml
|
|
32
32
|
```
|
|
33
33
|
|
|
34
34
|
4. **Scaling axes:**
|
package/commands/search.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Find specific experiments in a large history with natural language and structured filters.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/experiment_search.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/experiment_search.py $ARGUMENTS`
|
|
13
13
|
3. **Filters:** `accuracy>0.85`, `status:kept`, `family:baseline`, `date:last-week`
|
|
14
14
|
4. **Report:** ranked table of matching experiments
|
|
15
15
|
|
package/commands/seed.md
CHANGED
|
@@ -9,9 +9,9 @@ Run a multi-seed study to verify that experiment results are robust across rando
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Run a multi-seed study to verify that experiment results are robust across rando
|
|
|
22
22
|
|
|
23
23
|
3. **Run seed study:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/seed_runner.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/seed_runner.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report results:**
|
package/commands/sensitivity.md
CHANGED
|
@@ -9,9 +9,9 @@ Which hyperparameters actually matter? Stop wasting time on the ones that don't.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Which hyperparameters actually matter? Stop wasting time on the ones that don't.
|
|
|
21
21
|
|
|
22
22
|
3. **Run sensitivity analysis:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/sensitivity_analysis.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/sensitivity_analysis.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report includes:**
|
package/commands/share.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Package experiments for collaborator handoff or paper supplementary material.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/package_experiments.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/package_experiments.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `exports/packages/<name>/`
|
|
14
14
|
|
|
15
15
|
## Examples
|
package/commands/simulate.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Predict outcomes before spending compute. Ranks proposed configs and recommends which to run vs skip.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/experiment_simulator.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/experiment_simulator.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/simulations/`
|
|
14
14
|
|
|
15
15
|
## How it works
|