claude-turing 2.4.0 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "2.4.0",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 36 commands, 2 specialized agents, model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "3.0.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 41 commands, 2 specialized agents, meta-intelligence (cross-project knowledge transfer + methodology audit), scaling & efficiency (scaling laws + compute budget + model distillation), model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -347,6 +347,11 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
347
347
  | `/turing:ensemble [--top-k]` | Automated ensemble — voting, stacking, blending from top-K models |
348
348
  | `/turing:stitch <action>` | Pipeline composition — show, swap, cache, and run stages independently |
349
349
  | `/turing:warm <exp-id>` | Warm-start from prior model — load checkpoint, freeze layers, adjust LR |
350
+ | `/turing:scale [--axis]` | Scaling law estimator — power-law fit, full-scale predictions, diminishing returns verdict |
351
+ | `/turing:budget <action>` | Compute budget manager — set limits, track allocation, auto-shift explore/exploit |
352
+ | `/turing:distill <exp-id>` | Model compression — distill teacher into smaller student with accuracy/size tradeoff |
353
+ | `/turing:transfer [--from]` | Cross-project knowledge transfer — find similar projects, surface what worked |
354
+ | `/turing:audit [--strict]` | Pre-submission methodology audit — data leakage, baselines, seeds, ablations, reproducibility |
350
355
 
351
356
  And for fully hands-off operation:
352
357
 
@@ -531,11 +536,11 @@ Each project gets independent config, data, experiments, models, and agent memor
531
536
 
532
537
  ## Architecture of Turing Itself
533
538
 
534
- 36 commands, 2 agents, 10 config files, 55 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
539
+ 41 commands, 2 agents, 10 config files, 60 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), scaling & efficiency (scale + budget + distill), meta-intelligence (transfer + audit), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
535
540
 
536
541
  ```
537
542
  turing/
538
- ├── commands/ 35 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition)
543
+ ├── commands/ 40 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition + scaling & efficiency + meta-intelligence)
539
544
  ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
540
545
  ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
541
546
  ├── templates/ Scaffolded into user projects by /turing:init
@@ -0,0 +1,56 @@
1
+ ---
2
+ name: audit
3
+ description: Pre-submission methodology audit — catch data leakage, missing baselines, cherry-picked seeds, and incomplete ablations before a reviewer does.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--strict] [--checklist neurips]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ A reviewer checklist you run before submitting. Catches methodology mistakes that cause desk rejections.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--strict` — treat warnings as failures
20
+ - `--checklist neurips|icml|iclr` — add venue-specific checks
21
+ - `--json` — raw JSON output
22
+
23
+ 3. **Run methodology audit:**
24
+ ```bash
25
+ python scripts/methodology_audit.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Checks performed:**
29
+ - **Data leakage** (critical): verify prepare.py/evaluate.py separation
30
+ - **CV strategy** (critical): verify appropriate cross-validation for data type
31
+ - **Seed sensitivity** (high): seed studies exist for best experiments
32
+ - **Ablation completeness** (high): ablation studies performed
33
+ - **Baseline comparison** (high): simple baselines in experiment log
34
+ - **Reproducibility** (high): best result successfully reproduced
35
+ - **Hyperparameter budget** (medium): total tuning cost documented
36
+ - **Regression stability** (medium): regression checks performed
37
+
38
+ 5. **Verdicts:**
39
+ - **PASS** — ready for submission
40
+ - **PASS (with warnings)** — address before submission
41
+ - **NEEDS WORK** — fix failures first
42
+ - **FAIL** — critical issues found
43
+
44
+ 6. **Actions:** each failure suggests the `/turing:` command to fix it
45
+
46
+ 7. **Venue checklists:** `--checklist neurips` adds NeurIPS-specific checks (broader impact, reproducibility checklist, code availability)
47
+
48
+ 8. **Saved output:** report in `experiments/audits/audit-YYYY-MM-DD.yaml`
49
+
50
+ ## Examples
51
+
52
+ ```
53
+ /turing:audit # Standard audit
54
+ /turing:audit --strict # Warnings become failures
55
+ /turing:audit --checklist neurips # NeurIPS submission checklist
56
+ ```
@@ -0,0 +1,52 @@
1
+ ---
2
+ name: budget
3
+ description: Compute budget manager — set experiment/time limits, track allocation across explore/exploit phases, auto-shift modes, hard stop.
4
+ disable-model-invocation: true
5
+ argument-hint: "<set|status|reset> [--experiments 50] [--hours 8]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Set a compute ceiling and let the system optimize within it. Prevents runaway experiment loops.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument is action: `set`, `status`, `reset`, or `check`
20
+ - `--experiments 50` — max experiment count
21
+ - `--hours 8` — max wall-clock hours
22
+ - `--json` — raw JSON output
23
+
24
+ 3. **Run budget manager:**
25
+ ```bash
26
+ python scripts/budget_manager.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Actions:**
30
+ - **set:** create a budget with experiment and/or time constraints
31
+ - **status:** show usage, burn rate, projected exhaustion, allocation breakdown
32
+ - **reset:** deactivate the current budget
33
+ - **check:** returns whether another experiment is allowed (used by `/turing:train`)
34
+
35
+ 5. **Budget allocation policy:**
36
+ - **0-50% budget:** EXPLORE — try diverse hypotheses
37
+ - **50-80% budget:** MIXED — explore promising, exploit best
38
+ - **80-100% budget:** EXPLOIT ONLY — refine the winner
39
+ - **100% budget:** HARD STOP — `/turing:train` refuses new experiments
40
+
41
+ 6. **Budget state** stored in `experiment_state.yaml` under the `budget` key.
42
+
43
+ 7. **If no budget exists:** `/turing:train` runs without limits.
44
+
45
+ ## Examples
46
+
47
+ ```
48
+ /turing:budget set --experiments 50 --hours 8 # Set both constraints
49
+ /turing:budget set --experiments 30 # Experiment count only
50
+ /turing:budget status # Show usage and projections
51
+ /turing:budget reset # Remove budget limits
52
+ ```
@@ -0,0 +1,56 @@
1
+ ---
2
+ name: distill
3
+ description: Model compression via distillation — train a smaller student model to match a larger teacher's predictions.
4
+ disable-model-invocation: true
5
+ argument-hint: "<teacher-exp-id> [--compression 4] [--method soft-labels]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Compress a large model into a smaller, faster one for production. Measures the accuracy/size/latency tradeoff.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument is teacher experiment ID (required)
20
+ - `--compression 4` — compression ratio (default: 4x)
21
+ - `--method soft_labels|feature_matching|dataset_distillation` — distillation method
22
+ - `--target-latency 5` — auto-adjust compression to meet latency target (ms)
23
+ - `--json` — raw JSON output
24
+
25
+ 3. **Run distillation planner:**
26
+ ```bash
27
+ python scripts/model_distiller.py $ARGUMENTS
28
+ ```
29
+
30
+ 4. **Report includes:**
31
+ - Teacher model metrics
32
+ - Auto-selected student architecture (fewer trees/layers/width)
33
+ - Estimated size reduction and latency improvement
34
+ - Distillation configuration (temperature, alpha, loss function)
35
+ - Verdict: EXCELLENT / ACCEPTABLE / MARGINAL / TOO MUCH LOSS
36
+
37
+ 5. **Student selection by model type:**
38
+ - **Tree models:** fewer estimators, shallower depth
39
+ - **Neural networks:** fewer layers, narrower hidden dims
40
+ - **scikit-learn:** simpler model family (RandomForest → DecisionTree)
41
+
42
+ 6. **Distillation methods:**
43
+ - **soft_labels:** train on teacher's probability outputs with temperature scaling
44
+ - **feature_matching:** align intermediate representations (neural only)
45
+ - **dataset_distillation:** train on teacher-labeled synthetic data
46
+
47
+ 7. **Saved output:** report written to `experiments/distillations/distill-<exp-id>.yaml`
48
+
49
+ ## Examples
50
+
51
+ ```
52
+ /turing:distill exp-042 # 4x compression, soft labels
53
+ /turing:distill exp-042 --compression 8 # Aggressive compression
54
+ /turing:distill exp-042 --method feature_matching # Neural feature alignment
55
+ /turing:distill exp-042 --target-latency 5 # Meet 5ms latency target
56
+ ```
@@ -0,0 +1,55 @@
1
+ ---
2
+ name: scale
3
+ description: Scaling law estimator — run small experiments at different sizes, fit a power law, and predict full-scale performance before committing compute.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--axis data|compute|params] [--points 4] [--analyze results.yaml]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Predict full-scale performance from a handful of small experiments. Answers "is it worth training on the full dataset?" in 30 minutes instead of 3 days.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--axis data|compute|params` — scaling axis (default: data)
20
+ - `--points 4` — number of scale points (default: 4)
21
+ - `--analyze results.yaml` — analyze existing results instead of planning
22
+ - `--plot` — include ASCII scaling plot
23
+ - `--json` — raw JSON output
24
+
25
+ 3. **Plan or analyze:**
26
+ - **Plan mode (default):** generates scale point configs to run
27
+ ```bash
28
+ python scripts/scaling_estimator.py --axis data --points 4
29
+ ```
30
+ - **Analyze mode:** fits power law to completed results
31
+ ```bash
32
+ python scripts/scaling_estimator.py --analyze experiments/scaling/results.yaml
33
+ ```
34
+
35
+ 4. **Scaling axes:**
36
+ - **data:** train on 10%, 25%, 50%, 75% of dataset
37
+ - **compute:** train for 10%, 25%, 50%, 75% of max epochs
38
+ - **params:** scale model size (fewer estimators, shallower depth)
39
+
40
+ 5. **After planning:** run each scale point experiment, record results in YAML, then use `--analyze` to fit the curve
41
+
42
+ 6. **Report includes:**
43
+ - Power law fit: `metric = a × n^b` with R²
44
+ - Predictions for 100%, 150%, 200% scale
45
+ - Verdict: DIMINISHING RETURNS / MARGINAL GAINS / WORTH SCALING
46
+
47
+ 7. **Saved output:** report written to `experiments/scaling/scale-YYYY-MM-DD.yaml`
48
+
49
+ ## Examples
50
+
51
+ ```
52
+ /turing:scale # Plan: data axis, 4 points
53
+ /turing:scale --axis compute --points 3 # Plan: compute axis, 3 points
54
+ /turing:scale --analyze results.yaml --plot # Analyze with ASCII plot
55
+ ```
@@ -0,0 +1,54 @@
1
+ ---
2
+ name: transfer
3
+ description: Cross-project knowledge transfer — find similar prior projects and surface what worked. Builds institutional ML memory.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--from project-path] [--auto]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Find similar prior projects and surface what worked. "Last time you had tabular classification with class imbalance, LightGBM beat everything by 3%."
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--from ~/projects/fraud-detection` — transfer from a specific project
20
+ - `--auto` — auto-queue hypotheses from recommendations
21
+ - `--index ~/.turing/project_index.yaml` — custom index path
22
+ - `--json` — raw JSON output
23
+
24
+ 3. **Run knowledge transfer:**
25
+ ```bash
26
+ python scripts/knowledge_transfer.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report includes:**
30
+ - Similar prior projects ranked by similarity score
31
+ - Per project: task type, winner model, key insights
32
+ - Suggested hypotheses from winning strategies
33
+ - Auto-queued hypotheses (with `--auto`)
34
+
35
+ 5. **Similarity matching** uses:
36
+ - Task type (classification/regression) — highest weight
37
+ - Dataset size (log-scale comparison)
38
+ - Feature types (tabular/image/text)
39
+ - Class balance characteristics
40
+ - Dimensionality
41
+
42
+ 6. **Project index** at `~/.turing/project_index.yaml` — local only, never uploaded
43
+
44
+ 7. **If no similar projects found:** suggest running on more projects first or specifying one with `--from`
45
+
46
+ 8. **Saved output:** report in `experiments/transfers/transfer-*.yaml`
47
+
48
+ ## Examples
49
+
50
+ ```
51
+ /turing:transfer # Search index for similar projects
52
+ /turing:transfer --from ~/projects/fraud-detection # Transfer from specific project
53
+ /turing:transfer --auto # Auto-queue hypotheses
54
+ ```
@@ -45,6 +45,11 @@ You are the Turing ML research router. Detect the user's intent and route to the
45
45
  | "ensemble", "combine models", "voting", "stacking", "blending", "merge models" | `/turing:ensemble` | Compose |
46
46
  | "stitch", "pipeline", "swap stage", "cache stage", "pipeline composition" | `/turing:stitch` | Compose |
47
47
  | "warm", "warm start", "fine-tune", "continue training", "transfer learning", "from checkpoint" | `/turing:warm` | Compose |
48
+ | "scale", "scaling law", "how much data", "is more data worth it", "power law", "data efficiency" | `/turing:scale` | Analyze |
49
+ | "budget", "compute budget", "how many experiments", "spending limit", "stop after" | `/turing:budget` | Manage |
50
+ | "distill", "compress", "smaller model", "student model", "knowledge distillation", "model compression" | `/turing:distill` | Deploy |
51
+ | "transfer", "what worked before", "similar project", "cross-project", "institutional knowledge", "prior projects" | `/turing:transfer` | Research |
52
+ | "audit", "methodology check", "pre-submission", "reviewer checklist", "data leakage", "missing baselines" | `/turing:audit` | Validate |
48
53
 
49
54
  ## Sub-commands
50
55
 
@@ -86,6 +91,11 @@ You are the Turing ML research router. Detect the user's intent and route to the
86
91
  | `/turing:ensemble [--top-k] [--methods]` | Automated ensemble: voting, weighted voting, stacking, blending from top-K models | (inline) |
87
92
  | `/turing:stitch <action> [stage]` | Pipeline composition: show/swap/cache/run stages independently | (inline) |
88
93
  | `/turing:warm <exp-id>` | Warm-start from prior model: load checkpoint, freeze layers, adjust LR | (inline) |
94
+ | `/turing:scale [--axis]` | Scaling law estimator: fit power law, predict full-scale performance | (inline) |
95
+ | `/turing:budget <action>` | Compute budget manager: set limits, track allocation, auto-shift modes | (inline) |
96
+ | `/turing:distill <exp-id>` | Model compression: distill teacher into smaller student model | (inline) |
97
+ | `/turing:transfer [--from]` | Cross-project knowledge transfer: find similar prior projects, surface what worked | (inline) |
98
+ | `/turing:audit [--strict]` | Pre-submission methodology audit: data leakage, baselines, seeds, ablations, reproducibility | (inline) |
89
99
 
90
100
  ## Proactive Detection
91
101
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "2.4.0",
3
+ "version": "3.0.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/install.js CHANGED
@@ -27,6 +27,8 @@ const SUB_COMMANDS = [
27
27
  "lit", "paper", "queue", "retry", "fork",
28
28
  "diff", "watch", "regress",
29
29
  "ensemble", "stitch", "warm",
30
+ "scale", "budget", "distill",
31
+ "transfer", "audit",
30
32
  ];
31
33
 
32
34
  export async function install(opts = {}) {
package/src/verify.js CHANGED
@@ -50,6 +50,11 @@ const EXPECTED_COMMANDS = [
50
50
  "ensemble/SKILL.md",
51
51
  "stitch/SKILL.md",
52
52
  "warm/SKILL.md",
53
+ "scale/SKILL.md",
54
+ "budget/SKILL.md",
55
+ "distill/SKILL.md",
56
+ "transfer/SKILL.md",
57
+ "audit/SKILL.md",
53
58
  ];
54
59
 
55
60
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];