claude-turing 4.7.0 → 4.8.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (172) hide show
  1. package/.claude-plugin/plugin.json +2 -2
  2. package/README.md +1 -1
  3. package/agents/ml-evaluator.md +4 -4
  4. package/agents/ml-researcher.md +2 -2
  5. package/bin/turing-init.sh +2 -2
  6. package/commands/ablate.md +3 -4
  7. package/commands/annotate.md +2 -3
  8. package/commands/archive.md +2 -3
  9. package/commands/audit.md +3 -4
  10. package/commands/baseline.md +3 -4
  11. package/commands/brief.md +5 -6
  12. package/commands/budget.md +3 -4
  13. package/commands/calibrate.md +3 -4
  14. package/commands/card.md +3 -4
  15. package/commands/changelog.md +2 -3
  16. package/commands/checkpoint.md +3 -4
  17. package/commands/cite.md +2 -3
  18. package/commands/compare.md +1 -2
  19. package/commands/counterfactual.md +2 -3
  20. package/commands/curriculum.md +3 -4
  21. package/commands/design.md +3 -4
  22. package/commands/diagnose.md +4 -5
  23. package/commands/diff.md +3 -4
  24. package/commands/distill.md +3 -4
  25. package/commands/doctor.md +2 -3
  26. package/commands/ensemble.md +3 -4
  27. package/commands/explore.md +4 -5
  28. package/commands/export.md +3 -4
  29. package/commands/feature.md +3 -4
  30. package/commands/flashback.md +2 -3
  31. package/commands/fork.md +3 -4
  32. package/commands/frontier.md +3 -4
  33. package/commands/init.md +5 -6
  34. package/commands/leak.md +3 -4
  35. package/commands/lit.md +3 -4
  36. package/commands/logbook.md +5 -6
  37. package/commands/merge.md +2 -3
  38. package/commands/mode.md +1 -2
  39. package/commands/onboard.md +2 -3
  40. package/commands/paper.md +3 -4
  41. package/commands/plan.md +2 -3
  42. package/commands/poster.md +3 -4
  43. package/commands/postmortem.md +2 -3
  44. package/commands/preflight.md +5 -6
  45. package/commands/present.md +2 -3
  46. package/commands/profile.md +3 -4
  47. package/commands/prune.md +2 -3
  48. package/commands/quantize.md +2 -3
  49. package/commands/queue.md +3 -4
  50. package/commands/registry.md +2 -3
  51. package/commands/regress.md +3 -4
  52. package/commands/replay.md +2 -3
  53. package/commands/report.md +3 -4
  54. package/commands/reproduce.md +3 -4
  55. package/commands/retry.md +3 -4
  56. package/commands/review.md +2 -3
  57. package/commands/rules/loop-protocol.md +11 -11
  58. package/commands/sanity.md +3 -4
  59. package/commands/scale.md +4 -5
  60. package/commands/search.md +2 -3
  61. package/commands/seed.md +3 -4
  62. package/commands/sensitivity.md +3 -4
  63. package/commands/share.md +2 -3
  64. package/commands/simulate.md +2 -3
  65. package/commands/status.md +1 -2
  66. package/commands/stitch.md +3 -4
  67. package/commands/suggest.md +5 -6
  68. package/commands/surgery.md +2 -3
  69. package/commands/sweep.md +8 -9
  70. package/commands/template.md +2 -3
  71. package/commands/train.md +5 -6
  72. package/commands/transfer.md +3 -4
  73. package/commands/trend.md +2 -3
  74. package/commands/try.md +4 -5
  75. package/commands/turing.md +3 -3
  76. package/commands/update.md +2 -3
  77. package/commands/validate.md +4 -5
  78. package/commands/warm.md +3 -4
  79. package/commands/watch.md +4 -5
  80. package/commands/whatif.md +2 -3
  81. package/commands/xray.md +3 -4
  82. package/config/commands.yaml +75 -75
  83. package/package.json +3 -2
  84. package/skills/turing/SKILL.md +3 -3
  85. package/skills/turing/ablate/SKILL.md +3 -4
  86. package/skills/turing/annotate/SKILL.md +2 -3
  87. package/skills/turing/archive/SKILL.md +2 -3
  88. package/skills/turing/audit/SKILL.md +3 -4
  89. package/skills/turing/baseline/SKILL.md +3 -4
  90. package/skills/turing/brief/SKILL.md +5 -6
  91. package/skills/turing/budget/SKILL.md +3 -4
  92. package/skills/turing/calibrate/SKILL.md +3 -4
  93. package/skills/turing/card/SKILL.md +3 -4
  94. package/skills/turing/changelog/SKILL.md +2 -3
  95. package/skills/turing/checkpoint/SKILL.md +3 -4
  96. package/skills/turing/cite/SKILL.md +2 -3
  97. package/skills/turing/compare/SKILL.md +1 -2
  98. package/skills/turing/counterfactual/SKILL.md +2 -3
  99. package/skills/turing/curriculum/SKILL.md +3 -4
  100. package/skills/turing/design/SKILL.md +3 -4
  101. package/skills/turing/diagnose/SKILL.md +4 -5
  102. package/skills/turing/diff/SKILL.md +3 -4
  103. package/skills/turing/distill/SKILL.md +3 -4
  104. package/skills/turing/doctor/SKILL.md +2 -3
  105. package/skills/turing/ensemble/SKILL.md +3 -4
  106. package/skills/turing/explore/SKILL.md +4 -5
  107. package/skills/turing/export/SKILL.md +3 -4
  108. package/skills/turing/feature/SKILL.md +3 -4
  109. package/skills/turing/flashback/SKILL.md +2 -3
  110. package/skills/turing/fork/SKILL.md +3 -4
  111. package/skills/turing/frontier/SKILL.md +3 -4
  112. package/skills/turing/init/SKILL.md +5 -6
  113. package/skills/turing/leak/SKILL.md +3 -4
  114. package/skills/turing/lit/SKILL.md +3 -4
  115. package/skills/turing/logbook/SKILL.md +5 -6
  116. package/skills/turing/merge/SKILL.md +2 -3
  117. package/skills/turing/mode/SKILL.md +1 -2
  118. package/skills/turing/onboard/SKILL.md +2 -3
  119. package/skills/turing/paper/SKILL.md +3 -4
  120. package/skills/turing/plan/SKILL.md +2 -3
  121. package/skills/turing/poster/SKILL.md +3 -4
  122. package/skills/turing/postmortem/SKILL.md +2 -3
  123. package/skills/turing/preflight/SKILL.md +5 -6
  124. package/skills/turing/present/SKILL.md +2 -3
  125. package/skills/turing/profile/SKILL.md +3 -4
  126. package/skills/turing/prune/SKILL.md +2 -3
  127. package/skills/turing/quantize/SKILL.md +2 -3
  128. package/skills/turing/queue/SKILL.md +3 -4
  129. package/skills/turing/registry/SKILL.md +2 -3
  130. package/skills/turing/regress/SKILL.md +3 -4
  131. package/skills/turing/replay/SKILL.md +2 -3
  132. package/skills/turing/report/SKILL.md +3 -4
  133. package/skills/turing/reproduce/SKILL.md +3 -4
  134. package/skills/turing/retry/SKILL.md +3 -4
  135. package/skills/turing/review/SKILL.md +2 -3
  136. package/skills/turing/rules/loop-protocol.md +11 -11
  137. package/skills/turing/sanity/SKILL.md +3 -4
  138. package/skills/turing/scale/SKILL.md +4 -5
  139. package/skills/turing/search/SKILL.md +2 -3
  140. package/skills/turing/seed/SKILL.md +3 -4
  141. package/skills/turing/sensitivity/SKILL.md +3 -4
  142. package/skills/turing/share/SKILL.md +2 -3
  143. package/skills/turing/simulate/SKILL.md +2 -3
  144. package/skills/turing/status/SKILL.md +1 -2
  145. package/skills/turing/stitch/SKILL.md +3 -4
  146. package/skills/turing/suggest/SKILL.md +5 -6
  147. package/skills/turing/surgery/SKILL.md +2 -3
  148. package/skills/turing/sweep/SKILL.md +8 -9
  149. package/skills/turing/template/SKILL.md +2 -3
  150. package/skills/turing/train/SKILL.md +5 -6
  151. package/skills/turing/transfer/SKILL.md +3 -4
  152. package/skills/turing/trend/SKILL.md +2 -3
  153. package/skills/turing/try/SKILL.md +4 -5
  154. package/skills/turing/update/SKILL.md +2 -3
  155. package/skills/turing/validate/SKILL.md +4 -5
  156. package/skills/turing/warm/SKILL.md +3 -4
  157. package/skills/turing/watch/SKILL.md +4 -5
  158. package/skills/turing/whatif/SKILL.md +2 -3
  159. package/skills/turing/xray/SKILL.md +3 -4
  160. package/src/command-registry.js +12 -0
  161. package/src/install.js +4 -3
  162. package/src/sync-commands-layout.js +149 -0
  163. package/src/sync-skills-layout.js +4 -133
  164. package/templates/README.md +5 -8
  165. package/templates/program.md +18 -18
  166. package/templates/pyproject.toml +10 -0
  167. package/templates/requirements.txt +4 -1
  168. package/templates/scripts/generate_onboarding.py +1 -1
  169. package/templates/scripts/post-train-hook.sh +7 -8
  170. package/templates/scripts/scaffold.py +24 -26
  171. package/templates/scripts/stop-hook.sh +2 -3
  172. package/templates/scripts/turing-run-python.sh +9 -0
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: replay
3
3
  description: Experiment replay — re-run a historical experiment with current infrastructure to test if old approaches do better now.
4
- disable-model-invocation: true
5
4
  argument-hint: "<exp-id> [--with-current-data] [--with-current-preprocessing]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -9,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
9
8
  Should you revisit old ideas? Infrastructure changes may make failed approaches work now.
10
9
 
11
10
  ## Steps
12
- 1. **Activate environment:** `source .venv/bin/activate`
13
- 2. **Run:** `python scripts/experiment_replay.py $ARGUMENTS`
11
+ 1. **Sync environment:** `uv sync`
12
+ 2. **Run:** `uv run python scripts/experiment_replay.py $ARGUMENTS`
14
13
  3. **Modes:** default (current code+data), --with-current-data, --with-current-preprocessing
15
14
  4. **Report:** original vs replayed metrics, delta, verdict
16
15
  5. **Saved output:** `experiments/replays/`
@@ -1,9 +1,8 @@
1
1
  ---
2
2
  name: report
3
3
  description: Generate a markdown research report from experiment history — structured for sharing, archiving, or including in documentation. More detailed than a brief, less visual than a poster.
4
- disable-model-invocation: true
5
4
  argument-hint: "[--since YYYY-MM-DD] [--output path]"
6
- allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*, mkdir:*), Grep, Glob
5
+ allowed-tools: Read, Bash(uv run python scripts/*:*, uv sync:*, mkdir:*), Grep, Glob
7
6
  ---
8
7
 
9
8
  Generate a structured markdown research report summarizing the experiment campaign.
@@ -15,12 +14,12 @@ Generate a structured markdown research report summarizing the experiment campai
15
14
  Use the logbook generator in markdown mode as the data backbone:
16
15
 
17
16
  ```bash
18
- source .venv/bin/activate && python scripts/generate_logbook.py --format markdown
17
+ uv run python scripts/generate_logbook.py --format markdown
19
18
  ```
20
19
 
21
20
  Also gather supplementary data:
22
21
  ```bash
23
- source .venv/bin/activate && python scripts/generate_brief.py
22
+ uv run python scripts/generate_brief.py
24
23
  cat experiment_state.yaml 2>/dev/null || true
25
24
  cat RESEARCH_PLAN.md 2>/dev/null || true
26
25
  ```
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: reproduce
3
3
  description: Verify reproducibility of a specific experiment by re-running from logged config and checking metrics fall within tolerance.
4
- disable-model-invocation: true
5
4
  argument-hint: "<exp-id> [--tolerance 0.02] [--strict] [--runs 3]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Verify that a logged experiment can be reproduced with consistent results.
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -23,7 +22,7 @@ Verify that a logged experiment can be reproduced with consistent results.
23
22
 
24
23
  3. **Run reproducibility verification:**
25
24
  ```bash
26
- python scripts/reproduce_experiment.py $ARGUMENTS
25
+ uv run python scripts/reproduce_experiment.py $ARGUMENTS
27
26
  ```
28
27
 
29
28
  4. **Report results:**
package/commands/retry.md CHANGED
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: retry
3
3
  description: Smart failure recovery — auto-diagnose crash type and retry with targeted fix. OOM → halve batch. NaN → add clipping.
4
- disable-model-invocation: true
5
4
  argument-hint: "<exp-id> [--max-attempts 3]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Auto-diagnose and recover from experiment failures.
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -22,7 +21,7 @@ Auto-diagnose and recover from experiment failures.
22
21
 
23
22
  3. **Run smart retry:**
24
23
  ```bash
25
- python scripts/smart_retry.py $ARGUMENTS
24
+ uv run python scripts/smart_retry.py $ARGUMENTS
26
25
  ```
27
26
 
28
27
  4. **Report results:**
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: review
3
3
  description: Peer review simulation — generate likely reviewer objections with severity ratings and fix commands.
4
- disable-model-invocation: true
5
4
  argument-hint: "[--venue neurips|icml|general] [--harsh]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -9,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
9
8
  Simulate a conference reviewer before you submit. Each weakness links to the command that fixes it.
10
9
 
11
10
  ## Steps
12
- 1. `source .venv/bin/activate`
13
- 2. `python scripts/simulate_review.py $ARGUMENTS`
11
+ 1. `uv sync`
12
+ 2. `uv run python scripts/simulate_review.py $ARGUMENTS`
14
13
  3. **Saved:** `experiments/reviews/`
15
14
 
16
15
  ## Examples
@@ -16,9 +16,9 @@ The autoresearch harness enforces a strict separation between the **hypothesis s
16
16
 
17
17
  ## Execution Rules
18
18
 
19
- - **ALWAYS redirect training output:** `python train.py > run.log 2>&1`
19
+ - **ALWAYS redirect training output:** `uv run python train.py > run.log 2>&1`
20
20
  - **ALWAYS parse metrics with grep** between `---` delimiters: `grep -A 10 "^---" run.log | head -10`
21
- - **ALWAYS activate the venv first:** `source .venv/bin/activate`
21
+ - **ALWAYS run Python through uv:** `uv run python ...`
22
22
  - **NEVER install new packages** without human approval
23
23
 
24
24
  ## Git Discipline
@@ -40,16 +40,16 @@ The autoresearch harness enforces a strict separation between the **hypothesis s
40
40
 
41
41
  ## Sweep Workflow
42
42
 
43
- 1. Generate queue: `python scripts/sweep.py`
44
- 2. Check status: `python scripts/sweep.py --status`
45
- 3. Get next: `python scripts/sweep.py --next`
43
+ 1. Generate queue: `uv run python scripts/sweep.py`
44
+ 2. Check status: `uv run python scripts/sweep.py --status`
45
+ 3. Get next: `uv run python scripts/sweep.py --next`
46
46
  4. Apply overrides, create branch, run training
47
- 5. Mark: `python scripts/sweep.py --mark <name> complete|failed`
47
+ 5. Mark: `uv run python scripts/sweep.py --mark <name> complete|failed`
48
48
  6. Repeat until queue is empty
49
49
 
50
50
  ## Logging Rules
51
51
 
52
- - **Log every experiment** to `experiments/log.jsonl` via `python scripts/log_experiment.py` — kept and discarded alike.
52
+ - **Log every experiment** to `experiments/log.jsonl` via `uv run python scripts/log_experiment.py` — kept and discarded alike.
53
53
  - **Include all metrics, config, and description** of the hypothesis and its outcome.
54
54
 
55
55
  ## Convergence Rules
@@ -64,11 +64,11 @@ The researcher agent's Bash access is restricted to a whitelist of necessary com
64
64
 
65
65
  | Allowed Pattern | Purpose |
66
66
  |-----------------|---------|
67
- | `python train.py:*` | Execute training |
68
- | `python scripts/*:*` | Run utility scripts (logging, metrics, sweep) |
67
+ | `uv run python train.py:*` | Execute training |
68
+ | `uv run python scripts/*:*` | Run utility scripts (logging, metrics, sweep) |
69
69
  | `git:*` | Branch, commit, merge, reset operations |
70
- | `source .venv/bin/activate:*` | Virtual environment activation |
71
- | `pip:*` | Package installation (requires human approval) |
70
+ | `uv sync:*` | Virtual environment activation |
71
+ | `uv add:*` | Package installation (requires human approval) |
72
72
 
73
73
  **Blocked by omission:** `cat`, `head`, `tail`, `less` (prevents reading hidden files via shell), `curl`, `wget` (prevents data exfiltration), arbitrary command execution.
74
74
 
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: sanity
3
3
  description: Pre-training sanity checks — catch broken data loaders, misconfigured losses, and dead gradients in 30 seconds before wasting hours.
4
- disable-model-invocation: true
5
4
  argument-hint: "[--quick] [--verbose]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Run a battery of fast checks before committing to a full training run. Catches w
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -22,7 +21,7 @@ Run a battery of fast checks before committing to a full training run. Catches w
22
21
 
23
22
  3. **Run sanity checks:**
24
23
  ```bash
25
- python scripts/sanity_checks.py $ARGUMENTS
24
+ uv run python scripts/sanity_checks.py $ARGUMENTS
26
25
  ```
27
26
 
28
27
  4. **Checks performed:**
package/commands/scale.md CHANGED
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: scale
3
3
  description: Scaling law estimator — run small experiments at different sizes, fit a power law, and predict full-scale performance before committing compute.
4
- disable-model-invocation: true
5
4
  argument-hint: "[--axis data|compute|params] [--points 4] [--analyze results.yaml]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Predict full-scale performance from a handful of small experiments. Answers "is
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -25,11 +24,11 @@ Predict full-scale performance from a handful of small experiments. Answers "is
25
24
  3. **Plan or analyze:**
26
25
  - **Plan mode (default):** generates scale point configs to run
27
26
  ```bash
28
- python scripts/scaling_estimator.py --axis data --points 4
27
+ uv run python scripts/scaling_estimator.py --axis data --points 4
29
28
  ```
30
29
  - **Analyze mode:** fits power law to completed results
31
30
  ```bash
32
- python scripts/scaling_estimator.py --analyze experiments/scaling/results.yaml
31
+ uv run python scripts/scaling_estimator.py --analyze experiments/scaling/results.yaml
33
32
  ```
34
33
 
35
34
  4. **Scaling axes:**
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: search
3
3
  description: Natural language experiment search — query with text + structured filters over 200+ experiments.
4
- disable-model-invocation: true
5
4
  argument-hint: "<query> [--filter \"accuracy>0.85\"] [--limit 10]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -9,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
9
8
  Find specific experiments in a large history with natural language and structured filters.
10
9
 
11
10
  ## Steps
12
- 1. **Activate environment:** `source .venv/bin/activate`
13
- 2. **Run:** `python scripts/experiment_search.py $ARGUMENTS`
11
+ 1. **Sync environment:** `uv sync`
12
+ 2. **Run:** `uv run python scripts/experiment_search.py $ARGUMENTS`
14
13
  3. **Filters:** `accuracy>0.85`, `status:kept`, `family:baseline`, `date:last-week`
15
14
  4. **Report:** ranked table of matching experiments
16
15
 
package/commands/seed.md CHANGED
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: seed
3
3
  description: Run multi-seed study on an experiment to compute mean/std/CI and flag seed-sensitive results. Prevents publishing lucky seeds.
4
- disable-model-invocation: true
5
4
  argument-hint: "[N] [--quick] [--exp-id <id>]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Run a multi-seed study to verify that experiment results are robust across rando
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -23,7 +22,7 @@ Run a multi-seed study to verify that experiment results are robust across rando
23
22
 
24
23
  3. **Run seed study:**
25
24
  ```bash
26
- python scripts/seed_runner.py $ARGUMENTS
25
+ uv run python scripts/seed_runner.py $ARGUMENTS
27
26
  ```
28
27
 
29
28
  4. **Report results:**
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: sensitivity
3
3
  description: Hyperparameter sensitivity analysis — rank parameters by impact, identify which matter and which are noise.
4
- disable-model-invocation: true
5
4
  argument-hint: "[exp-id] [--params learning_rate,max_depth]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Which hyperparameters actually matter? Stop wasting time on the ones that don't.
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -22,7 +21,7 @@ Which hyperparameters actually matter? Stop wasting time on the ones that don't.
22
21
 
23
22
  3. **Run sensitivity analysis:**
24
23
  ```bash
25
- python scripts/sensitivity_analysis.py $ARGUMENTS
24
+ uv run python scripts/sensitivity_analysis.py $ARGUMENTS
26
25
  ```
27
26
 
28
27
  4. **Report includes:**
package/commands/share.md CHANGED
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: share
3
3
  description: Experiment packaging — portable archive with config, metrics, seed study, annotations, reproduction instructions.
4
- disable-model-invocation: true
5
4
  argument-hint: "<exp-ids...> [--include model,figures,code]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -9,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
9
8
  Package experiments for collaborator handoff or paper supplementary material.
10
9
 
11
10
  ## Steps
12
- 1. `source .venv/bin/activate`
13
- 2. `python scripts/package_experiments.py $ARGUMENTS`
11
+ 1. `uv sync`
12
+ 2. `uv run python scripts/package_experiments.py $ARGUMENTS`
14
13
  3. **Saved:** `exports/packages/<name>/`
15
14
 
16
15
  ## Examples
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: simulate
3
3
  description: Experiment outcome prediction — predict which configs will beat the current best before running them.
4
- disable-model-invocation: true
5
4
  argument-hint: "[--configs configs.yaml] [--top-k 5] [--threshold 0.001]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -9,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
9
8
  Predict outcomes before spending compute. Ranks proposed configs and recommends which to run vs skip.
10
9
 
11
10
  ## Steps
12
- 1. `source .venv/bin/activate`
13
- 2. `python scripts/experiment_simulator.py $ARGUMENTS`
11
+ 1. `uv sync`
12
+ 2. `uv run python scripts/experiment_simulator.py $ARGUMENTS`
14
13
  3. **Saved:** `experiments/simulations/`
15
14
 
16
15
  ## How it works
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: status
3
3
  description: Show current ML experiment status — best model, recent experiments, convergence state, and trend analysis. Delegates to @ml-evaluator for read-only safety.
4
- disable-model-invocation: true
5
4
  allowed-tools: Read, Bash(*), Grep, Glob
6
5
  ---
7
6
 
@@ -11,7 +10,7 @@ Show the current state of the ML training pipeline. This is an observation-only
11
10
 
12
11
  1. **Run metrics display:**
13
12
  ```bash
14
- source .venv/bin/activate && python scripts/show_metrics.py --last 10
13
+ uv run python scripts/show_metrics.py --last 10
15
14
  ```
16
15
 
17
16
  2. **Summarize for the user:**
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: stitch
3
3
  description: Pipeline composition — decompose ML pipelines into swappable stages. Show, swap, cache, and run stages independently.
4
- disable-model-invocation: true
5
4
  argument-hint: "<show|swap|cache|run> [stage] [--from exp-id]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Decompose your ML pipeline into stages that can be independently varied, cached,
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -24,7 +23,7 @@ Decompose your ML pipeline into stages that can be independently varied, cached,
24
23
 
25
24
  3. **Run pipeline manager:**
26
25
  ```bash
27
- python scripts/pipeline_manager.py $ARGUMENTS
26
+ uv run python scripts/pipeline_manager.py $ARGUMENTS
28
27
  ```
29
28
 
30
29
  4. **Report results:**
@@ -1,9 +1,8 @@
1
1
  ---
2
2
  name: suggest
3
3
  description: Literature-grounded model selection. Reads the ML task context, searches recent literature, and suggests model architectures worth trying — with citations. Suggestions are auto-queued as hypotheses.
4
- disable-model-invocation: true
5
4
  argument-hint: "[task description override]"
6
- allowed-tools: Read, Write, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob, WebSearch, WebFetch
5
+ allowed-tools: Read, Write, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob, WebSearch, WebFetch
7
6
  ---
8
7
 
9
8
  Suggest model architectures for the current ML task. Supports two strategies:
@@ -26,7 +25,7 @@ cat config.yaml
26
25
  ```
27
26
 
28
27
  ```bash
29
- source .venv/bin/activate && python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
28
+ uv run python scripts/show_metrics.py --last 10 2>/dev/null || echo "No experiments yet"
30
29
  ```
31
30
 
32
31
  If `$ARGUMENTS` is provided, use that as the task description. Otherwise, infer from `config.yaml` (model type, primary metric, data source, target column).
@@ -67,7 +66,7 @@ From the literature, synthesize **3-5 concrete model architecture suggestions**.
67
66
  For each suggestion, add to the hypothesis queue:
68
67
 
69
68
  ```bash
70
- source .venv/bin/activate && python scripts/manage_hypotheses.py add "<model>: <rationale> (source: <citation>)" --priority medium --source literature
69
+ uv run python scripts/manage_hypotheses.py add "<model>: <rationale> (source: <citation>)" --priority medium --source literature
71
70
  ```
72
71
 
73
72
  ### 5. Show Results
@@ -106,7 +105,7 @@ Same detection logic as the literature strategy — find `config.yaml` + `train.
106
105
  ### 2. Run Tree Search
107
106
 
108
107
  ```bash
109
- source .venv/bin/activate && python scripts/treequest_suggest.py \
108
+ uv run python scripts/treequest_suggest.py \
110
109
  --log experiments/log.jsonl \
111
110
  --config config.yaml \
112
111
  --top 5 \
@@ -121,7 +120,7 @@ If TreeQuest is not installed, the script automatically falls back to greedy bes
121
120
  For each result from the tree search, queue as a hypothesis:
122
121
 
123
122
  ```bash
124
- source .venv/bin/activate && python scripts/manage_hypotheses.py add "<description>" --priority medium --source treequest
123
+ uv run python scripts/manage_hypotheses.py add "<description>" --priority medium --source treequest
125
124
  ```
126
125
 
127
126
  ### 4. Show Results
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: surgery
3
3
  description: Architecture modification — add/remove layers, widen/narrow, swap activations, inject skip connections. Specify what to change, system handles how.
4
- disable-model-invocation: true
5
4
  argument-hint: "<exp-id> --op <operation> [args...]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,8 +9,8 @@ Programmatic architecture changes with auto warm-start from existing weights.
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:** `source .venv/bin/activate`
14
- 2. **Run:** `python scripts/architecture_surgery.py $ARGUMENTS`
12
+ 1. **Sync environment:** `uv sync`
13
+ 2. **Run:** `uv run python scripts/architecture_surgery.py $ARGUMENTS`
15
14
  3. **Operations:** add-layer, remove-layer, widen, narrow, swap-activation, add-skip, add-norm, deepen, swap-objective
16
15
  4. **For tree models:** deepen (increase max_depth), widen (more estimators), swap-objective
17
16
  5. **Report:** operation details, config changes, parameter count delta, warm-start source
package/commands/sweep.md CHANGED
@@ -1,40 +1,39 @@
1
1
  ---
2
2
  name: sweep
3
3
  description: Generate and run a systematic hyperparameter sweep. Computes the cartesian product of configured parameter ranges and processes the queue sequentially with full experiment logging.
4
- disable-model-invocation: true
5
4
  argument-hint: "[sweep_config.yaml]"
6
- allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*, source .venv/bin/activate:*, pip:*), Grep, Glob
5
+ allowed-tools: Read, Write, Edit, Bash(uv run python train.py:*, uv run python scripts/*:*, git:*, uv sync:*, uv add:*), Grep, Glob
7
6
  ---
8
7
 
9
8
  Run a systematic hyperparameter sweep using the sweep configuration.
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Resolve config:** Use `$ARGUMENTS` as sweep config path, or default to `sweep_config.yaml`.
19
18
 
20
19
  3. **Generate queue** (if not already generated):
21
20
  ```bash
22
- python scripts/sweep.py [sweep_config.yaml]
21
+ uv run python scripts/sweep.py [sweep_config.yaml]
23
22
  ```
24
23
 
25
24
  4. **Check queue status:**
26
25
  ```bash
27
- python scripts/sweep.py --status
26
+ uv run python scripts/sweep.py --status
28
27
  ```
29
28
 
30
29
  5. **Process queue sequentially:**
31
- - Get next: `python scripts/sweep.py --next`
30
+ - Get next: `uv run python scripts/sweep.py --next`
32
31
  - Apply config overrides to `config.yaml`
33
32
  - Create experiment branch: `git checkout -b exp/NNN-description`
34
- - Run training: `python train.py > run.log 2>&1`
33
+ - Run training: `uv run python train.py > run.log 2>&1`
35
34
  - Parse metrics: `grep -A 10 "^---" run.log | head -10`
36
35
  - Log the experiment
37
- - Mark complete: `python scripts/sweep.py --mark <name> complete`
36
+ - Mark complete: `uv run python scripts/sweep.py --mark <name> complete`
38
37
  - If improved, merge to main. If not, return to main.
39
38
  - Repeat until queue is empty
40
39
 
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: template
3
3
  description: Experiment template library — save winning configs as reusable templates, apply to new projects.
4
- disable-model-invocation: true
5
4
  argument-hint: "<save|list|apply|share> [--name name] [--from exp-id]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -9,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
9
8
  Turn your best experiment configs into reusable recipes that persist across projects.
10
9
 
11
10
  ## Steps
12
- 1. **Activate environment:** `source .venv/bin/activate`
13
- 2. **Run:** `python scripts/experiment_templates.py $ARGUMENTS`
11
+ 1. **Sync environment:** `uv sync`
12
+ 2. **Run:** `uv run python scripts/experiment_templates.py $ARGUMENTS`
14
13
  3. **Operations:** save (from experiment), list (all templates), apply (to current project), share (export)
15
14
  4. **Stored at:** `~/.turing/templates/` (cross-project)
16
15
 
package/commands/train.md CHANGED
@@ -1,9 +1,8 @@
1
1
  ---
2
2
  name: train
3
3
  description: Run the autonomous ML experiment loop. Iteratively hypothesizes, trains, evaluates, and decides — keeping only improvements. Implements the autoresearch pattern with formal convergence detection and git-disciplined rollback.
4
- disable-model-invocation: true
5
4
  argument-hint: "[max_iterations]"
6
- allowed-tools: Read, Write, Edit, Bash(python train.py:*, python scripts/*:*, git:*, source .venv/bin/activate:*, pip:*), Grep, Glob
5
+ allowed-tools: Read, Write, Edit, Bash(uv run python train.py:*, uv run python scripts/*:*, git:*, uv sync:*, uv add:*), Grep, Glob
7
6
  ---
8
7
 
9
8
  You are an autonomous ML researcher. Your goal: iteratively improve a model by following the experiment loop protocol — the scientific method applied to machine learning.
@@ -27,9 +26,9 @@ Read `program.md` in the ML project directory for the complete protocol. Follow
27
26
 
28
27
  1. **Restore memory:** Read `.claude/agent-memory/ml-researcher-{project_name}/MEMORY.md` for prior observations and best results.
29
28
  2. **Read protocol:** Read `program.md` completely — it defines the experiment loop, constraints, and output format.
30
- 3. **Bootstrap data:** Check for training data at `config.yaml` → `data.source`. If no splits exist, run `python prepare.py`.
31
- 4. **Bootstrap venv:** `test -d .venv || (python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt)`
32
- 5. **Assess state:** `source .venv/bin/activate && python scripts/show_metrics.py --last 5`
29
+ 3. **Bootstrap data:** Check for training data at `config.yaml` → `data.source`. If no splits exist, run `uv run python prepare.py`.
30
+ 4. **Bootstrap uv environment:** `uv sync`
31
+ 5. **Assess state:** `uv run python scripts/show_metrics.py --last 5`
33
32
  6. **Begin the loop** from program.md.
34
33
 
35
34
  ## The Loop
@@ -48,7 +47,7 @@ Use `@ml-evaluator` for analysis tasks. It is read-only (no Write/Edit) and cann
48
47
 
49
48
  ## Context Management
50
49
 
51
- - Redirect all training output: `python train.py > run.log 2>&1`
50
+ - Redirect all training output: `uv run python train.py > run.log 2>&1`
52
51
  - Parse metrics with grep, never read full output
53
52
  - Persist observations to MEMORY.md after each experiment
54
53
 
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: transfer
3
3
  description: Cross-project knowledge transfer — find similar prior projects and surface what worked. Builds institutional ML memory.
4
- disable-model-invocation: true
5
4
  argument-hint: "[--from project-path] [--auto]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -10,9 +9,9 @@ Find similar prior projects and surface what worked. "Last time you had tabular
10
9
 
11
10
  ## Steps
12
11
 
13
- 1. **Activate environment:**
12
+ 1. **Sync environment:**
14
13
  ```bash
15
- source .venv/bin/activate
14
+ uv sync
16
15
  ```
17
16
 
18
17
  2. **Parse arguments from `$ARGUMENTS`:**
@@ -23,7 +22,7 @@ Find similar prior projects and surface what worked. "Last time you had tabular
23
22
 
24
23
  3. **Run knowledge transfer:**
25
24
  ```bash
26
- python scripts/knowledge_transfer.py $ARGUMENTS
25
+ uv run python scripts/knowledge_transfer.py $ARGUMENTS
27
26
  ```
28
27
 
29
28
  4. **Report includes:**
package/commands/trend.md CHANGED
@@ -1,7 +1,6 @@
1
1
  ---
2
2
  name: trend
3
3
  description: Long-term trend analysis — improvement velocity, family ROI, diminishing returns detection, strategic research direction.
4
- disable-model-invocation: true
5
4
  argument-hint: "[--window 30d] [--metric accuracy]"
6
5
  allowed-tools: Read, Bash(*), Grep, Glob
7
6
  ---
@@ -9,8 +8,8 @@ allowed-tools: Read, Bash(*), Grep, Glob
9
8
  See the arc of your research, not just the latest results. Strategic view over 100+ experiments.
10
9
 
11
10
  ## Steps
12
- 1. **Activate environment:** `source .venv/bin/activate`
13
- 2. **Run:** `python scripts/trend_analysis.py $ARGUMENTS`
11
+ 1. **Sync environment:** `uv sync`
12
+ 2. **Run:** `uv run python scripts/trend_analysis.py $ARGUMENTS`
14
13
  3. **Report:** improvement velocity over time windows, family ROI ranking, diminishing returns prediction, phase transitions
15
14
  4. **Saved output:** `experiments/trends/trend-*.yaml`
16
15
 
package/commands/try.md CHANGED
@@ -1,9 +1,8 @@
1
1
  ---
2
2
  name: try
3
3
  description: Inject a hypothesis into the agent's experiment queue. This is how research taste reaches the agent — the human selects which coins to flip, the agent flips them.
4
- disable-model-invocation: true
5
4
  argument-hint: "<hypothesis description>"
6
- allowed-tools: Read, Write, Edit, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob
5
+ allowed-tools: Read, Write, Edit, Bash(uv run python scripts/*:*, uv sync:*), Grep, Glob
7
6
  ---
8
7
 
9
8
  Inject a human hypothesis into the experiment queue for the next `/turing:train` iteration.
@@ -16,18 +15,18 @@ This is the taste-leverage mechanism: you provide judgment about what's worth tr
16
15
 
17
16
  2. **Check for archetype syntax.** If the argument starts with `archetype:`, expand it:
18
17
  ```bash
19
- source .venv/bin/activate && python scripts/manage_hypotheses.py add --archetype <name> --priority high --source human
18
+ uv run python scripts/manage_hypotheses.py add --archetype <name> --priority high --source human
20
19
  ```
21
20
 
22
21
  Otherwise, use the raw description:
23
22
  ```bash
24
- source .venv/bin/activate && python scripts/manage_hypotheses.py add "$ARGUMENTS" --priority high --source human
23
+ uv run python scripts/manage_hypotheses.py add "$ARGUMENTS" --priority high --source human
25
24
  ```
26
25
 
27
26
  3. **Confirm** with the hypothesis ID and instructions:
28
27
  - "Queued as hyp-NNN (high priority, human-injected)"
29
28
  - "The agent will prioritize this on the next `/turing:train` iteration"
30
- - Show current queue: `python scripts/manage_hypotheses.py list --status queued`
29
+ - Show current queue: `uv run python scripts/manage_hypotheses.py list --status queued`
31
30
 
32
31
  ## Examples
33
32