claude-turing 3.0.0 → 3.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "3.0.0",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 41 commands, 2 specialized agents, meta-intelligence (cross-project knowledge transfer + methodology audit), scaling & efficiency (scaling laws + compute budget + model distillation), model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "3.2.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 47 commands, 2 specialized agents, model debugging (xray + sensitivity + calibration), pre-training intelligence (sanity checks + baseline generation + leakage detection), meta-intelligence (cross-project knowledge transfer + methodology audit), scaling & efficiency (scaling laws + compute budget + model distillation), model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -352,6 +352,12 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
352
352
  | `/turing:distill <exp-id>` | Model compression — distill teacher into smaller student with accuracy/size tradeoff |
353
353
  | `/turing:transfer [--from]` | Cross-project knowledge transfer — find similar projects, surface what worked |
354
354
  | `/turing:audit [--strict]` | Pre-submission methodology audit — data leakage, baselines, seeds, ablations, reproducibility |
355
+ | `/turing:sanity [--quick]` | Pre-training sanity checks — initial loss, single-batch overfit, gradient flow, output validation |
356
+ | `/turing:baseline [--methods]` | Automatic baseline generation — random, majority/mean, linear, k-NN |
357
+ | `/turing:leak [--deep]` | Targeted leakage detection — single-feature tests, correlation, train/test overlap |
358
+ | `/turing:xray [exp-id]` | Internal model diagnostics — gradient flow, dead neurons, weight distributions, tree analysis |
359
+ | `/turing:sensitivity [exp-id]` | Hyperparameter sensitivity — rank parameters by impact, detect non-monotonic responses |
360
+ | `/turing:calibrate [exp-id]` | Probability calibration — ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling |
355
361
 
356
362
  And for fully hands-off operation:
357
363
 
@@ -536,11 +542,11 @@ Each project gets independent config, data, experiments, models, and agent memor
536
542
 
537
543
  ## Architecture of Turing Itself
538
544
 
539
- 41 commands, 2 agents, 10 config files, 60 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), scaling & efficiency (scale + budget + distill), meta-intelligence (transfer + audit), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
545
+ 47 commands, 2 agents, 10 config files, 66 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), scaling & efficiency (scale + budget + distill), meta-intelligence (transfer + audit), pre-training intelligence (sanity + baseline + leak), model debugging (xray + sensitivity + calibrate), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
540
546
 
541
547
  ```
542
548
  turing/
543
- ├── commands/ 40 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition + scaling & efficiency + meta-intelligence)
549
+ ├── commands/ 46 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition + scaling & efficiency + meta-intelligence + pre-training intelligence + model debugging)
544
550
  ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
545
551
  ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
546
552
  ├── templates/ Scaffolded into user projects by /turing:init
@@ -0,0 +1,45 @@
1
+ ---
2
+ name: baseline
3
+ description: Automatic baseline generation — random, majority/mean, linear, k-NN baselines in 60 seconds. Every experiment needs a "is this better than dumb?" reference.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--methods all|simple|linear] [--data data.npz]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Generate trivial baselines so you always know if your model is meaningfully better than simple approaches.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--methods all|simple|linear` — baseline group (default: all)
20
+ - `--data data.npz` — data file with X and y arrays
21
+ - `--json` — raw JSON output
22
+
23
+ 3. **Run baseline generation:**
24
+ ```bash
25
+ python scripts/generate_baselines.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Baselines generated:**
29
+ - **Classification:** Random, Majority class, Stratified random, Logistic Regression, k-NN
30
+ - **Regression:** Random, Mean predictor, Median predictor, Ridge Regression, k-NN
31
+ - Each evaluated with the same protocol as real experiments
32
+
33
+ 5. **Report includes:** comparison table with metric values and notes (floor, ceiling, reference)
34
+
35
+ 6. **Integration:** satisfies the "baseline comparison" check in `/turing:audit`
36
+
37
+ 7. **Saved output:** report in `experiments/baselines/baselines-*.yaml`
38
+
39
+ ## Examples
40
+
41
+ ```
42
+ /turing:baseline # All baselines
43
+ /turing:baseline --methods simple # Just random + majority
44
+ /turing:baseline --data data/processed.npz # With actual data
45
+ ```
@@ -0,0 +1,47 @@
1
+ ---
2
+ name: calibrate
3
+ description: Probability calibration — measure ECE, plot reliability diagrams, apply Platt scaling or isotonic regression.
4
+ disable-model-invocation: true
5
+ argument-hint: "[exp-id] [--method platt|isotonic|temperature|auto]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Make model probabilities trustworthy. Does 80% confidence actually mean 80% correct?
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - Optional experiment ID
20
+ - `--method platt|isotonic|temperature|auto` — calibration method (default: auto)
21
+ - `--json` — raw JSON output
22
+
23
+ 3. **Run calibration:**
24
+ ```bash
25
+ python scripts/calibration.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Report includes:**
29
+ - ECE/MCE before calibration
30
+ - Reliability diagram (predicted vs actual per bin)
31
+ - Calibration method comparison table
32
+ - Verdict: ALREADY CALIBRATED / IMPROVED / NO IMPROVEMENT
33
+
34
+ 5. **Methods:**
35
+ - **Platt:** logistic regression on logits
36
+ - **Isotonic:** non-parametric (more flexible, needs more data)
37
+ - **Temperature:** single scalar T parameter
38
+ - **Auto:** tries all, picks lowest ECE
39
+
40
+ 6. **Saved output:** report in `experiments/calibration/<exp-id>-calibration.yaml`
41
+
42
+ ## Examples
43
+
44
+ ```
45
+ /turing:calibrate exp-042 # Auto-select best method
46
+ /turing:calibrate exp-042 --method platt # Platt scaling only
47
+ ```
@@ -0,0 +1,47 @@
1
+ ---
2
+ name: leak
3
+ description: Targeted leakage detection — probe for data leakage with single-feature tests, correlation checks, and train/test overlap detection.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--deep] [--features feature_1,feature_2]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Actively probe for data leakage. The #1 cause of "too good to be true" results.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--deep` — run full single-feature analysis (slow but thorough)
20
+ - `--features "feat_1,feat_2"` — check specific features
21
+ - `--json` — raw JSON output
22
+
23
+ 3. **Run leakage scan:**
24
+ ```bash
25
+ python scripts/leakage_detector.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Checks performed:**
29
+ - **Feature-target correlation:** flag features with >0.95 correlation to target
30
+ - **Single-feature predictiveness (--deep):** train on each feature alone, flag any that achieve >80% of full model performance
31
+ - **Train/test overlap:** hash-based deduplication across splits
32
+
33
+ 5. **Verdicts:**
34
+ - **CLEAN** — no leakage detected
35
+ - **SUSPICIOUS** — warnings to review
36
+ - **LEAKAGE DETECTED** — critical flags found
37
+
38
+ 6. **Integration:** satisfies the "data leakage" check in `/turing:audit`
39
+
40
+ 7. **Saved output:** report in `experiments/leakage/leak-*.yaml`
41
+
42
+ ## Examples
43
+
44
+ ```
45
+ /turing:leak # Standard correlation + overlap checks
46
+ /turing:leak --deep # Full single-feature analysis
47
+ ```
@@ -0,0 +1,48 @@
1
+ ---
2
+ name: sanity
3
+ description: Pre-training sanity checks — catch broken data loaders, misconfigured losses, and dead gradients in 30 seconds before wasting hours.
4
+ disable-model-invocation: true
5
+ argument-hint: "[--quick] [--verbose]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Run a battery of fast checks before committing to a full training run. Catches wiring bugs in seconds.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - `--quick` — skip single-batch overfit test (fastest, ~5 seconds)
20
+ - `--verbose` — show detailed check output
21
+ - `--json` — raw JSON output
22
+
23
+ 3. **Run sanity checks:**
24
+ ```bash
25
+ python scripts/sanity_checks.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Checks performed:**
29
+ - **Data pipeline** (critical): first batch loads, shapes match, no NaN/Inf
30
+ - **Initial loss** (high): loss at initialization matches theory (e.g., -log(1/C) for cross-entropy)
31
+ - **Gradient flow** (high): all parameters have non-zero, non-exploding gradients
32
+ - **Single-batch overfit** (critical): model can memorize 1 batch in 50 steps — if not, something is broken
33
+ - **Output validation** (high): predictions are non-NaN, non-constant, reasonable range
34
+ - **Config consistency** (medium): learning rate, batch size in reasonable ranges
35
+
36
+ 5. **Verdicts:**
37
+ - **PASS** — safe to proceed
38
+ - **PASS (with warnings)** — review before training
39
+ - **FAIL** — do not proceed, fix issues first
40
+
41
+ 6. **Saved output:** report in `experiments/sanity/sanity-*.yaml`
42
+
43
+ ## Examples
44
+
45
+ ```
46
+ /turing:sanity # Full check (~30 seconds)
47
+ /turing:sanity --quick # Skip overfit test (~5 seconds)
48
+ ```
@@ -0,0 +1,41 @@
1
+ ---
2
+ name: sensitivity
3
+ description: Hyperparameter sensitivity analysis — rank parameters by impact, identify which matter and which are noise.
4
+ disable-model-invocation: true
5
+ argument-hint: "[exp-id] [--params learning_rate,max_depth]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Which hyperparameters actually matter? Stop wasting time on the ones that don't.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - Optional experiment ID
20
+ - `--params "learning_rate,max_depth"` — specific parameters to analyze
21
+ - `--json` — raw JSON output
22
+
23
+ 3. **Run sensitivity analysis:**
24
+ ```bash
25
+ python scripts/sensitivity_analysis.py $ARGUMENTS
26
+ ```
27
+
28
+ 4. **Report includes:**
29
+ - Per-parameter sensitivity ranking: HIGH / MED / LOW / NONE
30
+ - Metric range for each parameter sweep
31
+ - Monotonicity detection (is there a sweet spot?)
32
+ - Recommendations: focus tuning on X, stop tuning Y
33
+
34
+ 5. **Saved output:** report in `experiments/sensitivity/<exp-id>-sensitivity.yaml`
35
+
36
+ ## Examples
37
+
38
+ ```
39
+ /turing:sensitivity exp-042 # All tunable params
40
+ /turing:sensitivity --params "learning_rate,max_depth" # Specific params
41
+ ```
@@ -50,6 +50,12 @@ You are the Turing ML research router. Detect the user's intent and route to the
50
50
  | "distill", "compress", "smaller model", "student model", "knowledge distillation", "model compression" | `/turing:distill` | Deploy |
51
51
  | "transfer", "what worked before", "similar project", "cross-project", "institutional knowledge", "prior projects" | `/turing:transfer` | Research |
52
52
  | "audit", "methodology check", "pre-submission", "reviewer checklist", "data leakage", "missing baselines" | `/turing:audit` | Validate |
53
+ | "sanity", "sanity check", "pre-training", "is it broken", "before training", "quick check" | `/turing:sanity` | Check |
54
+ | "baseline", "baselines", "trivial baseline", "majority class", "is it better than random" | `/turing:baseline` | Analyze |
55
+ | "leak", "leakage", "data leakage scan", "suspicious feature", "train test overlap" | `/turing:leak` | Validate |
56
+ | "xray", "model internals", "dead neurons", "gradient flow", "weight distribution", "inside the model" | `/turing:xray` | Analyze |
57
+ | "sensitivity", "which params matter", "hyperparameter importance", "parameter ranking" | `/turing:sensitivity` | Analyze |
58
+ | "calibrate", "calibration", "ECE", "reliability diagram", "overconfident", "probability calibration" | `/turing:calibrate` | Analyze |
53
59
 
54
60
  ## Sub-commands
55
61
 
@@ -96,6 +102,12 @@ You are the Turing ML research router. Detect the user's intent and route to the
96
102
  | `/turing:distill <exp-id>` | Model compression: distill teacher into smaller student model | (inline) |
97
103
  | `/turing:transfer [--from]` | Cross-project knowledge transfer: find similar prior projects, surface what worked | (inline) |
98
104
  | `/turing:audit [--strict]` | Pre-submission methodology audit: data leakage, baselines, seeds, ablations, reproducibility | (inline) |
105
+ | `/turing:sanity [--quick]` | Pre-training sanity checks: initial loss, overfit test, gradient flow, output validation | (inline) |
106
+ | `/turing:baseline [--methods]` | Automatic baseline generation: random, majority/mean, linear, k-NN | (inline) |
107
+ | `/turing:leak [--deep]` | Targeted leakage detection: single-feature tests, correlation, train/test overlap | (inline) |
108
+ | `/turing:xray [exp-id]` | Internal model diagnostics: gradient flow, dead neurons, weight distributions, tree analysis | (inline) |
109
+ | `/turing:sensitivity [exp-id]` | Hyperparameter sensitivity analysis: rank parameters by impact, detect non-monotonic responses | (inline) |
110
+ | `/turing:calibrate [exp-id]` | Probability calibration: ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling | (inline) |
99
111
 
100
112
  ## Proactive Detection
101
113
 
@@ -0,0 +1,43 @@
1
+ ---
2
+ name: xray
3
+ description: Internal model diagnostics — gradient flow, dead neurons, activation stats, weight distributions, tree depth analysis.
4
+ disable-model-invocation: true
5
+ argument-hint: "[exp-id] [--layer encoder.layer.2] [--compare exp-a exp-b]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ See inside the model. When it underperforms, the fix depends on *why*.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - Optional experiment ID
20
+ - `--layer "name"` — focus on specific layer
21
+ - `--compare exp-a exp-b` — side-by-side diagnostics
22
+ - `--json` — raw JSON output
23
+
24
+ 3. **Run model diagnostics:**
25
+ ```bash
26
+ python scripts/model_xray.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Diagnostics by model type:**
30
+ - **Neural networks:** gradient magnitudes, activation stats, dead neuron %, weight distributions, gradient-to-weight ratio
31
+ - **Tree models:** depth utilization, leaf purity, feature split dominance
32
+ - **scikit-learn:** coefficient magnitudes, feature importance concentration
33
+
34
+ 5. **Issues detected:** dead gradients, vanishing/exploding gradients, dead neurons, sparse weights, feature dominance, overfitting risk
35
+
36
+ 6. **Saved output:** report in `experiments/xrays/<exp-id>-xray.yaml`
37
+
38
+ ## Examples
39
+
40
+ ```
41
+ /turing:xray exp-042 # Full diagnostics
42
+ /turing:xray # Best experiment
43
+ ```
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "3.0.0",
3
+ "version": "3.2.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/install.js CHANGED
@@ -29,6 +29,8 @@ const SUB_COMMANDS = [
29
29
  "ensemble", "stitch", "warm",
30
30
  "scale", "budget", "distill",
31
31
  "transfer", "audit",
32
+ "sanity", "baseline", "leak",
33
+ "xray", "sensitivity", "calibrate",
32
34
  ];
33
35
 
34
36
  export async function install(opts = {}) {
package/src/verify.js CHANGED
@@ -55,6 +55,12 @@ const EXPECTED_COMMANDS = [
55
55
  "distill/SKILL.md",
56
56
  "transfer/SKILL.md",
57
57
  "audit/SKILL.md",
58
+ "sanity/SKILL.md",
59
+ "baseline/SKILL.md",
60
+ "leak/SKILL.md",
61
+ "xray/SKILL.md",
62
+ "sensitivity/SKILL.md",
63
+ "calibrate/SKILL.md",
58
64
  ];
59
65
 
60
66
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];