claude-turing 4.8.0 → 4.8.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +1 -1
- package/agents/ml-evaluator.md +4 -4
- package/agents/ml-researcher.md +2 -2
- package/bin/turing-init.sh +2 -2
- package/commands/ablate.md +3 -3
- package/commands/annotate.md +2 -2
- package/commands/archive.md +2 -2
- package/commands/audit.md +3 -3
- package/commands/baseline.md +3 -3
- package/commands/brief.md +5 -5
- package/commands/budget.md +3 -3
- package/commands/calibrate.md +3 -3
- package/commands/card.md +3 -3
- package/commands/changelog.md +2 -2
- package/commands/checkpoint.md +3 -3
- package/commands/cite.md +2 -2
- package/commands/compare.md +1 -1
- package/commands/counterfactual.md +2 -2
- package/commands/curriculum.md +3 -3
- package/commands/design.md +3 -3
- package/commands/diagnose.md +4 -4
- package/commands/diff.md +3 -3
- package/commands/distill.md +3 -3
- package/commands/doctor.md +2 -2
- package/commands/ensemble.md +3 -3
- package/commands/explore.md +4 -4
- package/commands/export.md +3 -3
- package/commands/feature.md +3 -3
- package/commands/flashback.md +2 -2
- package/commands/fork.md +3 -3
- package/commands/frontier.md +3 -3
- package/commands/init.md +5 -5
- package/commands/leak.md +3 -3
- package/commands/lit.md +3 -3
- package/commands/logbook.md +5 -5
- package/commands/merge.md +2 -2
- package/commands/mode.md +1 -1
- package/commands/onboard.md +2 -2
- package/commands/paper.md +3 -3
- package/commands/plan.md +2 -2
- package/commands/poster.md +3 -3
- package/commands/postmortem.md +2 -2
- package/commands/preflight.md +5 -5
- package/commands/present.md +2 -2
- package/commands/profile.md +3 -3
- package/commands/prune.md +2 -2
- package/commands/quantize.md +2 -2
- package/commands/queue.md +3 -3
- package/commands/registry.md +2 -2
- package/commands/regress.md +3 -3
- package/commands/replay.md +2 -2
- package/commands/report.md +3 -3
- package/commands/reproduce.md +3 -3
- package/commands/retry.md +3 -3
- package/commands/review.md +2 -2
- package/commands/rules/loop-protocol.md +11 -11
- package/commands/sanity.md +3 -3
- package/commands/scale.md +4 -4
- package/commands/search.md +2 -2
- package/commands/seed.md +3 -3
- package/commands/sensitivity.md +3 -3
- package/commands/share.md +2 -2
- package/commands/simulate.md +2 -2
- package/commands/status.md +1 -1
- package/commands/stitch.md +3 -3
- package/commands/suggest.md +5 -5
- package/commands/surgery.md +2 -2
- package/commands/sweep.md +8 -8
- package/commands/template.md +2 -2
- package/commands/train.md +5 -5
- package/commands/transfer.md +3 -3
- package/commands/trend.md +2 -2
- package/commands/try.md +4 -4
- package/commands/update.md +2 -2
- package/commands/validate.md +4 -4
- package/commands/warm.md +3 -3
- package/commands/watch.md +4 -4
- package/commands/whatif.md +2 -2
- package/commands/xray.md +3 -3
- package/config/commands.yaml +1 -1
- package/package.json +1 -1
- package/skills/turing/ablate/SKILL.md +3 -3
- package/skills/turing/annotate/SKILL.md +2 -2
- package/skills/turing/archive/SKILL.md +2 -2
- package/skills/turing/audit/SKILL.md +3 -3
- package/skills/turing/baseline/SKILL.md +3 -3
- package/skills/turing/brief/SKILL.md +5 -5
- package/skills/turing/budget/SKILL.md +3 -3
- package/skills/turing/calibrate/SKILL.md +3 -3
- package/skills/turing/card/SKILL.md +3 -3
- package/skills/turing/changelog/SKILL.md +2 -2
- package/skills/turing/checkpoint/SKILL.md +3 -3
- package/skills/turing/cite/SKILL.md +2 -2
- package/skills/turing/compare/SKILL.md +1 -1
- package/skills/turing/counterfactual/SKILL.md +2 -2
- package/skills/turing/curriculum/SKILL.md +3 -3
- package/skills/turing/design/SKILL.md +3 -3
- package/skills/turing/diagnose/SKILL.md +4 -4
- package/skills/turing/diff/SKILL.md +3 -3
- package/skills/turing/distill/SKILL.md +3 -3
- package/skills/turing/doctor/SKILL.md +2 -2
- package/skills/turing/ensemble/SKILL.md +3 -3
- package/skills/turing/explore/SKILL.md +4 -4
- package/skills/turing/export/SKILL.md +3 -3
- package/skills/turing/feature/SKILL.md +3 -3
- package/skills/turing/flashback/SKILL.md +2 -2
- package/skills/turing/fork/SKILL.md +3 -3
- package/skills/turing/frontier/SKILL.md +3 -3
- package/skills/turing/init/SKILL.md +5 -5
- package/skills/turing/leak/SKILL.md +3 -3
- package/skills/turing/lit/SKILL.md +3 -3
- package/skills/turing/logbook/SKILL.md +5 -5
- package/skills/turing/merge/SKILL.md +2 -2
- package/skills/turing/mode/SKILL.md +1 -1
- package/skills/turing/onboard/SKILL.md +2 -2
- package/skills/turing/paper/SKILL.md +3 -3
- package/skills/turing/plan/SKILL.md +2 -2
- package/skills/turing/poster/SKILL.md +3 -3
- package/skills/turing/postmortem/SKILL.md +2 -2
- package/skills/turing/preflight/SKILL.md +5 -5
- package/skills/turing/present/SKILL.md +2 -2
- package/skills/turing/profile/SKILL.md +3 -3
- package/skills/turing/prune/SKILL.md +2 -2
- package/skills/turing/quantize/SKILL.md +2 -2
- package/skills/turing/queue/SKILL.md +3 -3
- package/skills/turing/registry/SKILL.md +2 -2
- package/skills/turing/regress/SKILL.md +3 -3
- package/skills/turing/replay/SKILL.md +2 -2
- package/skills/turing/report/SKILL.md +3 -3
- package/skills/turing/reproduce/SKILL.md +3 -3
- package/skills/turing/retry/SKILL.md +3 -3
- package/skills/turing/review/SKILL.md +2 -2
- package/skills/turing/rules/loop-protocol.md +11 -11
- package/skills/turing/sanity/SKILL.md +3 -3
- package/skills/turing/scale/SKILL.md +4 -4
- package/skills/turing/search/SKILL.md +2 -2
- package/skills/turing/seed/SKILL.md +3 -3
- package/skills/turing/sensitivity/SKILL.md +3 -3
- package/skills/turing/share/SKILL.md +2 -2
- package/skills/turing/simulate/SKILL.md +2 -2
- package/skills/turing/status/SKILL.md +1 -1
- package/skills/turing/stitch/SKILL.md +3 -3
- package/skills/turing/suggest/SKILL.md +5 -5
- package/skills/turing/surgery/SKILL.md +2 -2
- package/skills/turing/sweep/SKILL.md +8 -8
- package/skills/turing/template/SKILL.md +2 -2
- package/skills/turing/train/SKILL.md +5 -5
- package/skills/turing/transfer/SKILL.md +3 -3
- package/skills/turing/trend/SKILL.md +2 -2
- package/skills/turing/try/SKILL.md +4 -4
- package/skills/turing/update/SKILL.md +2 -2
- package/skills/turing/validate/SKILL.md +4 -4
- package/skills/turing/warm/SKILL.md +3 -3
- package/skills/turing/watch/SKILL.md +4 -4
- package/skills/turing/whatif/SKILL.md +2 -2
- package/skills/turing/xray/SKILL.md +3 -3
- package/templates/README.md +5 -8
- package/templates/program.md +18 -18
- package/templates/pyproject.toml +10 -0
- package/templates/requirements.txt +4 -1
- package/templates/scripts/generate_onboarding.py +1 -1
- package/templates/scripts/post-train-hook.sh +7 -8
- package/templates/scripts/scaffold.py +24 -26
- package/templates/scripts/stop-hook.sh +2 -3
- package/templates/scripts/turing-run-python.sh +9 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "turing",
|
|
3
|
-
"version": "4.8.
|
|
3
|
+
"version": "4.8.1",
|
|
4
4
|
"description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 74 commands, 2 specialized agents, skills/turing source layout, operational intelligence (postmortem + doctor + plan), model lifecycle (update + registry), what-if analysis (whatif + counterfactual + simulate), collaboration (onboard + share + review), research communication (cite + present + changelog), experiment archaeology (trend + flashback + archive + annotate + search + template + replay), model surgery (prune + quantize + merge + surgery), feature & training intelligence, model debugging, pre-training intelligence, meta-intelligence, scaling & efficiency, model composition, deep analysis, experiment orchestration, literature + paper, model export, profiling, checkpoints, experiment intelligence, statistical rigor, tree-search, cost-performance, model cards, hypothesis database, novelty guard, anti-cheating, taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "Prannaya Gupta"
|
package/README.md
CHANGED
|
@@ -3,7 +3,7 @@
|
|
|
3
3
|
*The research assistant that can't fool itself.*
|
|
4
4
|
|
|
5
5
|
<p align="center">
|
|
6
|
-
<img src="https://img.shields.io/badge/version-4.8.
|
|
6
|
+
<img src="https://img.shields.io/badge/version-4.8.1-ffb74d?style=flat-square&labelColor=1a1a2e" alt="Version" />
|
|
7
7
|
<img src="https://img.shields.io/badge/license-MIT-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="License" />
|
|
8
8
|
<img src="https://img.shields.io/badge/Claude_Code-plugin-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="Claude Code" />
|
|
9
9
|
<img src="https://img.shields.io/badge/Node.js-20%2B-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="Node.js" />
|
package/agents/ml-evaluator.md
CHANGED
|
@@ -22,13 +22,13 @@ In quantum mechanics, observation changes the system. In ML experimentation, the
|
|
|
22
22
|
|
|
23
23
|
## Useful Commands
|
|
24
24
|
|
|
25
|
-
Always
|
|
25
|
+
Always run Python through uv from the ML directory.
|
|
26
26
|
|
|
27
27
|
| Command | Purpose |
|
|
28
28
|
|---------|---------|
|
|
29
|
-
| `python scripts/show_metrics.py --last 10` | Recent experiment summary |
|
|
30
|
-
| `python scripts/compare_runs.py <a> <b>` | Side-by-side comparison |
|
|
31
|
-
| `python evaluate.py` | Run evaluation on current model |
|
|
29
|
+
| `uv run python scripts/show_metrics.py --last 10` | Recent experiment summary |
|
|
30
|
+
| `uv run python scripts/compare_runs.py <a> <b>` | Side-by-side comparison |
|
|
31
|
+
| `uv run python evaluate.py` | Run evaluation on current model |
|
|
32
32
|
| `cat experiments/results.tsv` | Quick-reference TSV |
|
|
33
33
|
|
|
34
34
|
## Analysis Framework
|
package/agents/ml-researcher.md
CHANGED
|
@@ -27,8 +27,8 @@ Read `program.md` in the ML directory for the complete experiment loop protocol.
|
|
|
27
27
|
## Constraints
|
|
28
28
|
|
|
29
29
|
- **Only modify `train.py` and `config.yaml`.** `evaluate.py` is HIDDEN (do not read or reference). Other pipeline files are READ-ONLY.
|
|
30
|
-
- **Always
|
|
31
|
-
- **Redirect training output:** `python train.py > run.log 2>&1`
|
|
30
|
+
- **Always run Python through uv:** `uv run python ...`
|
|
31
|
+
- **Redirect training output:** `uv run python train.py > run.log 2>&1`
|
|
32
32
|
- **Parse metrics with grep:** `grep -A 10 "^---" run.log | head -10`
|
|
33
33
|
- **Use @ml-evaluator** for analysis tasks — it has no Write/Edit tools and cannot accidentally break the pipeline.
|
|
34
34
|
|
package/bin/turing-init.sh
CHANGED
|
@@ -33,8 +33,8 @@ echo ""
|
|
|
33
33
|
if [[ $# -eq 0 ]] || [[ "${1:-}" == "--interactive" ]]; then
|
|
34
34
|
python3 "$SCAFFOLD_SCRIPT" --interactive --templates-dir "$TEMPLATES_DIR" --no-venv
|
|
35
35
|
echo ""
|
|
36
|
-
echo " To set up the
|
|
37
|
-
echo " cd <ml_dir> &&
|
|
36
|
+
echo " To set up the uv environment:"
|
|
37
|
+
echo " cd <ml_dir> && uv sync"
|
|
38
38
|
exit 0
|
|
39
39
|
fi
|
|
40
40
|
|
package/commands/ablate.md
CHANGED
|
@@ -9,9 +9,9 @@ Run a systematic ablation study to measure the contribution of each model compon
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Run a systematic ablation study to measure the contribution of each model compon
|
|
|
22
22
|
|
|
23
23
|
3. **Run ablation study:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/ablation_study.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/ablation_study.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report results:**
|
package/commands/annotate.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Add context that experiment logs can't capture. "This only worked because the data was pre-sorted."
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/experiment_annotations.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/experiment_annotations.py $ARGUMENTS`
|
|
13
13
|
3. **Operations:** add (text + tags), list (per-experiment or all), search (keyword or tag)
|
|
14
14
|
4. **Stored in:** `experiments/annotations.yaml`
|
|
15
15
|
|
package/commands/archive.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Keep your project directory manageable after 200+ experiments.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/experiment_archive.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/experiment_archive.py $ARGUMENTS`
|
|
13
13
|
3. **Protected experiments:** Pareto-optimal, current best, recent, top-N by metric
|
|
14
14
|
4. **Report:** archived count, preserved count, space reclaimed
|
|
15
15
|
5. **Saved output:** `experiments/archive/index.yaml`
|
package/commands/audit.md
CHANGED
|
@@ -9,9 +9,9 @@ A reviewer checklist you run before submitting. Catches methodology mistakes tha
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ A reviewer checklist you run before submitting. Catches methodology mistakes tha
|
|
|
21
21
|
|
|
22
22
|
3. **Run methodology audit:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/methodology_audit.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/methodology_audit.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Checks performed:**
|
package/commands/baseline.md
CHANGED
|
@@ -9,9 +9,9 @@ Generate trivial baselines so you always know if your model is meaningfully bett
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Generate trivial baselines so you always know if your model is meaningfully bett
|
|
|
21
21
|
|
|
22
22
|
3. **Run baseline generation:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/generate_baselines.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/generate_baselines.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Baselines generated:**
|
package/commands/brief.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: brief
|
|
3
3
|
description: Generate a structured research intelligence report from experiment history — what's been learned, what's promising, what's exhausted, and what the human should consider next. Use --deep for literature-grounded suggestions.
|
|
4
4
|
argument-hint: "[ml/project] [--deep]"
|
|
5
|
-
allowed-tools: Read, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob, WebSearch, WebFetch
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Generate a research briefing that a human can read in 2 minutes and immediately decide what to inject next.
|
|
@@ -23,14 +23,14 @@ Before generating the briefing, detect which project to report on:
|
|
|
23
23
|
|
|
24
24
|
1. **Generate the briefing:**
|
|
25
25
|
```bash
|
|
26
|
-
|
|
26
|
+
uv run python scripts/generate_brief.py
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
2. **Self-critique the briefing** before presenting. Review the generated output and check:
|
|
30
30
|
- **Recommendations specificity:** Are they concrete enough to act on? "Try a different model" is bad. "Try LightGBM with leaf-wise growth because exp-004 showed depth sensitivity" is good. If vague, rewrite them with specific model/hyperparameter suggestions grounded in the experiment data.
|
|
31
31
|
- **Exhausted directions coverage:** Cross-reference the "Model Types Explored" section against `experiments/log.jsonl`. Are there discarded experiments missing from the summary? If so, add them.
|
|
32
32
|
- **Convergence estimate grounding:** If the briefing says "close to convergence" or "further improvement possible", verify against the actual metric trajectory. Is the claim supported by the numbers?
|
|
33
|
-
- **Metric accuracy:** Spot-check that the "Current Best" metrics match the actual log. Run `python scripts/show_metrics.py --last 1` if uncertain.
|
|
33
|
+
- **Metric accuracy:** Spot-check that the "Current Best" metrics match the actual log. Run `uv run python scripts/show_metrics.py --last 1` if uncertain.
|
|
34
34
|
|
|
35
35
|
If any section fails the check, regenerate just that section. Max 1 revision round — don't over-polish.
|
|
36
36
|
|
|
@@ -75,7 +75,7 @@ When `--deep` is requested, add a 7th section: **Literature-Grounded Suggestions
|
|
|
75
75
|
|
|
76
76
|
4. **Queue suggestions** as hypotheses:
|
|
77
77
|
```bash
|
|
78
|
-
|
|
78
|
+
uv run python scripts/manage_hypotheses.py add "<technique>: <rationale> (source: <citation>)" --priority medium --source literature
|
|
79
79
|
```
|
|
80
80
|
|
|
81
81
|
5. **Format as a section** appended to the briefing.
|
|
@@ -83,7 +83,7 @@ When `--deep` is requested, add a 7th section: **Literature-Grounded Suggestions
|
|
|
83
83
|
## Saving Briefs
|
|
84
84
|
|
|
85
85
|
```bash
|
|
86
|
-
mkdir -p briefs && python scripts/generate_brief.py > briefs/brief-$(date +%Y-%m-%d).md
|
|
86
|
+
mkdir -p briefs && uv run python scripts/generate_brief.py > briefs/brief-$(date +%Y-%m-%d).md
|
|
87
87
|
```
|
|
88
88
|
|
|
89
89
|
## When to Use
|
package/commands/budget.md
CHANGED
|
@@ -9,9 +9,9 @@ Set a compute ceiling and let the system optimize within it. Prevents runaway ex
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Set a compute ceiling and let the system optimize within it. Prevents runaway ex
|
|
|
22
22
|
|
|
23
23
|
3. **Run budget manager:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/budget_manager.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/budget_manager.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Actions:**
|
package/commands/calibrate.md
CHANGED
|
@@ -9,9 +9,9 @@ Make model probabilities trustworthy. Does 80% confidence actually mean 80% corr
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Make model probabilities trustworthy. Does 80% confidence actually mean 80% corr
|
|
|
21
21
|
|
|
22
22
|
3. **Run calibration:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/calibration.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/calibration.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report includes:**
|
package/commands/card.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: card
|
|
3
3
|
description: Generate a standardized model card documenting the trained model — type, performance, training data, limitations, intended use, and artifact contract.
|
|
4
|
-
allowed-tools: Read, Bash(python scripts/*:*,
|
|
4
|
+
allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob
|
|
5
5
|
---
|
|
6
6
|
|
|
7
7
|
You generate a standardized model card from the experiment log, model contract, and config.
|
|
@@ -10,12 +10,12 @@ You generate a standardized model card from the experiment log, model contract,
|
|
|
10
10
|
|
|
11
11
|
1. **Activate the virtual environment:**
|
|
12
12
|
```bash
|
|
13
|
-
|
|
13
|
+
uv sync
|
|
14
14
|
```
|
|
15
15
|
|
|
16
16
|
2. **Run the model card generator:**
|
|
17
17
|
```bash
|
|
18
|
-
python scripts/generate_model_card.py --config config.yaml --log experiments/log.jsonl --contract model_contract.md --output MODEL_CARD.md
|
|
18
|
+
uv run python scripts/generate_model_card.py --config config.yaml --log experiments/log.jsonl --contract model_contract.md --output MODEL_CARD.md
|
|
19
19
|
```
|
|
20
20
|
|
|
21
21
|
3. **Read and present the generated card:**
|
package/commands/changelog.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Translate experiment logs into a narrative that PMs and stakeholders can read in 2 minutes.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/generate_changelog.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/generate_changelog.py $ARGUMENTS`
|
|
13
13
|
3. **Audience:** technical (experiment IDs, configs), stakeholder (plain English, percentages)
|
|
14
14
|
4. **Saved output:** `paper/CHANGELOG.md`
|
|
15
15
|
|
package/commands/checkpoint.md
CHANGED
|
@@ -9,9 +9,9 @@ Manage model checkpoints intelligently using Pareto dominance.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Manage model checkpoints intelligently using Pareto dominance.
|
|
|
22
22
|
|
|
23
23
|
3. **Run checkpoint manager:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/checkpoint_manager.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/checkpoint_manager.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report results by action:**
|
package/commands/cite.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Track which papers and methods influenced each experiment. Catch missing citations before submission.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/citation_manager.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/citation_manager.py $ARGUMENTS`
|
|
13
13
|
3. **Operations:** add (associate citation with experiment), list (group by type), check (audit missing), bib (BibTeX)
|
|
14
14
|
4. **Stored in:** `experiments/citations.yaml`
|
|
15
15
|
|
package/commands/compare.md
CHANGED
|
@@ -11,7 +11,7 @@ Compare two ML experiment runs side-by-side to understand what changed and why o
|
|
|
11
11
|
|
|
12
12
|
1. **Run comparison:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv run python scripts/compare_runs.py $0 $1
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Analyze the delta:**
|
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
What would need to change to flip this prediction? Minimum-change counterfactual for individual predictions.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/counterfactual_explanation.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/counterfactual_explanation.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/counterfactuals/`
|
|
14
14
|
|
|
15
15
|
## Methods
|
package/commands/curriculum.md
CHANGED
|
@@ -9,9 +9,9 @@ Does the order your model sees data matter? Find out systematically.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Does the order your model sees data matter? Find out systematically.
|
|
|
21
21
|
|
|
22
22
|
3. **Run curriculum analysis:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/curriculum_optimizer.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/curriculum_optimizer.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Strategies tested:**
|
package/commands/design.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: design
|
|
3
3
|
description: Generate a structured experiment design for a hypothesis. Reads experiment history, searches literature for methodology, produces a scored design document at experiments/designs/.
|
|
4
4
|
argument-hint: "<hypothesis-id or description>"
|
|
5
|
-
allowed-tools: Read, Write, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Write, Bash(uv run python scripts/*:*, uv sync:*, mkdir:*), Grep, Glob, WebSearch, WebFetch
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Front-load the thinking before the coding. Given a hypothesis, produce a structured experiment design grounded in methodology from the literature.
|
|
@@ -13,7 +13,7 @@ Front-load the thinking before the coding. Given a hypothesis, produce a structu
|
|
|
13
13
|
|
|
14
14
|
If `$ARGUMENTS` matches `hyp-NNN`, load the hypothesis:
|
|
15
15
|
```bash
|
|
16
|
-
|
|
16
|
+
uv run python scripts/manage_hypotheses.py show $ARGUMENTS
|
|
17
17
|
```
|
|
18
18
|
|
|
19
19
|
If freeform text, use it directly as the hypothesis description.
|
|
@@ -23,7 +23,7 @@ Read the current config and experiment state:
|
|
|
23
23
|
cat config.yaml
|
|
24
24
|
```
|
|
25
25
|
```bash
|
|
26
|
-
|
|
26
|
+
uv run python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
|
|
27
27
|
```
|
|
28
28
|
```bash
|
|
29
29
|
cat experiment_state.yaml 2>/dev/null || echo "No experiment state yet"
|
package/commands/diagnose.md
CHANGED
|
@@ -9,15 +9,15 @@ Analyze where and why the model fails, beyond aggregate metrics.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Generate predictions if needed:**
|
|
18
18
|
Check if `experiments/predictions/exp-NNN-preds.yaml` exists. If not, run:
|
|
19
19
|
```bash
|
|
20
|
-
python train.py --predict-only --output experiments/predictions/
|
|
20
|
+
uv run python train.py --predict-only --output experiments/predictions/
|
|
21
21
|
```
|
|
22
22
|
The predictions file must contain `y_true`, `y_pred`, `task_type`, and optionally `features`.
|
|
23
23
|
|
|
@@ -28,7 +28,7 @@ Analyze where and why the model fails, beyond aggregate metrics.
|
|
|
28
28
|
|
|
29
29
|
4. **Run error analysis:**
|
|
30
30
|
```bash
|
|
31
|
-
python scripts/diagnose_errors.py $ARGUMENTS
|
|
31
|
+
uv run python scripts/diagnose_errors.py $ARGUMENTS
|
|
32
32
|
```
|
|
33
33
|
|
|
34
34
|
5. **Report results:**
|
package/commands/diff.md
CHANGED
|
@@ -9,9 +9,9 @@ Deep diagnostic comparison of two experiments. Goes beyond "which metric is high
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Deep diagnostic comparison of two experiments. Goes beyond "which metric is high
|
|
|
21
21
|
|
|
22
22
|
3. **Run deep comparison:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/experiment_diff.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/experiment_diff.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report results — the diff includes:**
|
package/commands/distill.md
CHANGED
|
@@ -9,9 +9,9 @@ Compress a large model into a smaller, faster one for production. Measures the a
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,7 +23,7 @@ Compress a large model into a smaller, faster one for production. Measures the a
|
|
|
23
23
|
|
|
24
24
|
3. **Run distillation planner:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/model_distiller.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/model_distiller.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **Report includes:**
|
package/commands/doctor.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Is Turing healthy? Check everything and get a score.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. `
|
|
12
|
-
2. `python scripts/harness_doctor.py $ARGUMENTS`
|
|
11
|
+
1. `uv sync`
|
|
12
|
+
2. `uv run python scripts/harness_doctor.py $ARGUMENTS`
|
|
13
13
|
3. **Saved:** `experiments/doctor/`
|
|
14
14
|
|
|
15
15
|
## Checks
|
package/commands/ensemble.md
CHANGED
|
@@ -9,9 +9,9 @@ Build ensembles from your best experiments automatically. Often yields 1-3% impr
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -22,7 +22,7 @@ Build ensembles from your best experiments automatically. Often yields 1-3% impr
|
|
|
22
22
|
|
|
23
23
|
3. **Run ensemble construction:**
|
|
24
24
|
```bash
|
|
25
|
-
python scripts/build_ensemble.py $ARGUMENTS
|
|
25
|
+
uv run python scripts/build_ensemble.py $ARGUMENTS
|
|
26
26
|
```
|
|
27
27
|
|
|
28
28
|
4. **Report results:**
|
package/commands/explore.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: explore
|
|
3
3
|
description: Tree-search-guided hypothesis exploration using AB-MCTS. Explores the space of experiment ideas as a search tree, scored by the critique engine. Discovers non-obvious refinement chains that linear suggestion cannot find.
|
|
4
4
|
argument-hint: "[ml/project] [--iterations N] [--top N] [--strategy abmcts-a|abmcts-m|greedy]"
|
|
5
|
-
allowed-tools: Read, Write, Bash(python scripts/*:*,
|
|
5
|
+
allowed-tools: Read, Write, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
Explore the hypothesis space using tree search. Instead of suggesting independent ideas, this builds and searches a tree of refinement chains — each node is a hypothesis scored by novelty, feasibility, and expected impact.
|
|
@@ -31,7 +31,7 @@ Extract from `$ARGUMENTS`:
|
|
|
31
31
|
### 1. Assess Current State
|
|
32
32
|
|
|
33
33
|
```bash
|
|
34
|
-
|
|
34
|
+
uv run python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
|
|
35
35
|
```
|
|
36
36
|
|
|
37
37
|
Read `config.yaml` to understand the current model and metric.
|
|
@@ -39,7 +39,7 @@ Read `config.yaml` to understand the current model and metric.
|
|
|
39
39
|
### 2. Run Tree Search
|
|
40
40
|
|
|
41
41
|
```bash
|
|
42
|
-
|
|
42
|
+
uv run python scripts/treequest_suggest.py \
|
|
43
43
|
--log experiments/log.jsonl \
|
|
44
44
|
--config config.yaml \
|
|
45
45
|
--top <N> \
|
|
@@ -58,7 +58,7 @@ The script will:
|
|
|
58
58
|
For each result, add to the hypothesis queue:
|
|
59
59
|
|
|
60
60
|
```bash
|
|
61
|
-
|
|
61
|
+
uv run python scripts/manage_hypotheses.py add "<description>" \
|
|
62
62
|
--priority medium --source treequest
|
|
63
63
|
```
|
|
64
64
|
|
package/commands/export.md
CHANGED
|
@@ -9,9 +9,9 @@ Export a trained model to a production-ready format.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -23,7 +23,7 @@ Export a trained model to a production-ready format.
|
|
|
23
23
|
|
|
24
24
|
3. **Run export pipeline:**
|
|
25
25
|
```bash
|
|
26
|
-
python scripts/export_model.py $ARGUMENTS
|
|
26
|
+
uv run python scripts/export_model.py $ARGUMENTS
|
|
27
27
|
```
|
|
28
28
|
|
|
29
29
|
4. **Report results:**
|
package/commands/feature.md
CHANGED
|
@@ -9,9 +9,9 @@ Systematically evaluate which features matter and which are noise.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Systematically evaluate which features matter and which are noise.
|
|
|
21
21
|
|
|
22
22
|
3. **Run feature analysis:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/feature_intelligence.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/feature_intelligence.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report includes:**
|
package/commands/flashback.md
CHANGED
|
@@ -8,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
|
|
|
8
8
|
Come back to a project after a week and start working in 10 seconds instead of 30 minutes.
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
|
-
1. **
|
|
12
|
-
2. **Run:** `python scripts/session_flashback.py $ARGUMENTS`
|
|
11
|
+
1. **Sync environment:** `uv sync`
|
|
12
|
+
2. **Run:** `uv run python scripts/session_flashback.py $ARGUMENTS`
|
|
13
13
|
3. **Report:** current best, last session experiments, pending hypotheses, annotations, budget, suggested next action
|
|
14
14
|
4. **Saved output:** `experiments/flashbacks/flashback-*.yaml`
|
|
15
15
|
|
package/commands/fork.md
CHANGED
|
@@ -9,9 +9,9 @@ Fork an experiment into parallel branches and compare results.
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Fork an experiment into parallel branches and compare results.
|
|
|
21
21
|
|
|
22
22
|
3. **Run fork:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/fork_experiment.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/fork_experiment.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report results:**
|
package/commands/frontier.md
CHANGED
|
@@ -9,9 +9,9 @@ Visualize the Pareto frontier across multiple objectives from experiment history
|
|
|
9
9
|
|
|
10
10
|
## Steps
|
|
11
11
|
|
|
12
|
-
1. **
|
|
12
|
+
1. **Sync environment:**
|
|
13
13
|
```bash
|
|
14
|
-
|
|
14
|
+
uv sync
|
|
15
15
|
```
|
|
16
16
|
|
|
17
17
|
2. **Parse arguments from `$ARGUMENTS`:**
|
|
@@ -21,7 +21,7 @@ Visualize the Pareto frontier across multiple objectives from experiment history
|
|
|
21
21
|
|
|
22
22
|
3. **Run Pareto analysis:**
|
|
23
23
|
```bash
|
|
24
|
-
python scripts/pareto_frontier.py $ARGUMENTS
|
|
24
|
+
uv run python scripts/pareto_frontier.py $ARGUMENTS
|
|
25
25
|
```
|
|
26
26
|
|
|
27
27
|
4. **Report results:**
|