claude-turing 3.1.0 → 3.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +2 -2
- package/README.md +7 -2
- package/commands/calibrate.md +47 -0
- package/commands/curriculum.md +43 -0
- package/commands/feature.md +42 -0
- package/commands/sensitivity.md +41 -0
- package/commands/turing.md +10 -0
- package/commands/xray.md +43 -0
- package/package.json +1 -1
- package/src/install.js +2 -0
- package/src/verify.js +5 -0
- package/templates/scripts/__pycache__/calibration.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/curriculum_optimizer.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/feature_intelligence.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/model_xray.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/sensitivity_analysis.cpython-314.pyc +0 -0
- package/templates/scripts/calibration.py +364 -0
- package/templates/scripts/curriculum_optimizer.py +337 -0
- package/templates/scripts/feature_intelligence.py +369 -0
- package/templates/scripts/model_xray.py +317 -0
- package/templates/scripts/scaffold.py +10 -0
- package/templates/scripts/sensitivity_analysis.py +335 -0
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "turing",
|
|
3
|
-
"version": "3.
|
|
4
|
-
"description": "Autonomous ML research harness — the autoresearch loop as a formal protocol.
|
|
3
|
+
"version": "3.3.0",
|
|
4
|
+
"description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 49 commands, 2 specialized agents, feature & training intelligence (feature selection + curriculum optimization), model debugging (xray + sensitivity + calibration), pre-training intelligence (sanity checks + baseline generation + leakage detection), meta-intelligence (cross-project knowledge transfer + methodology audit), scaling & efficiency (scaling laws + compute budget + model distillation), model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "pragnition"
|
|
7
7
|
},
|
package/README.md
CHANGED
|
@@ -355,6 +355,11 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
|
|
|
355
355
|
| `/turing:sanity [--quick]` | Pre-training sanity checks — initial loss, single-batch overfit, gradient flow, output validation |
|
|
356
356
|
| `/turing:baseline [--methods]` | Automatic baseline generation — random, majority/mean, linear, k-NN |
|
|
357
357
|
| `/turing:leak [--deep]` | Targeted leakage detection — single-feature tests, correlation, train/test overlap |
|
|
358
|
+
| `/turing:xray [exp-id]` | Internal model diagnostics — gradient flow, dead neurons, weight distributions, tree analysis |
|
|
359
|
+
| `/turing:sensitivity [exp-id]` | Hyperparameter sensitivity — rank parameters by impact, detect non-monotonic responses |
|
|
360
|
+
| `/turing:calibrate [exp-id]` | Probability calibration — ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling |
|
|
361
|
+
| `/turing:feature [--method]` | Automated feature selection — multi-method consensus ranking, redundancy, interactions |
|
|
362
|
+
| `/turing:curriculum [exp-id]` | Training curriculum optimization — difficulty scoring, strategy comparison, mislabeled sample detection |
|
|
358
363
|
|
|
359
364
|
And for fully hands-off operation:
|
|
360
365
|
|
|
@@ -539,11 +544,11 @@ Each project gets independent config, data, experiments, models, and agent memor
|
|
|
539
544
|
|
|
540
545
|
## Architecture of Turing Itself
|
|
541
546
|
|
|
542
|
-
|
|
547
|
+
49 commands, 2 agents, 10 config files, 68 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), scaling & efficiency (scale + budget + distill), meta-intelligence (transfer + audit), pre-training intelligence (sanity + baseline + leak), model debugging (xray + sensitivity + calibrate), feature & training intelligence (feature + curriculum), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
|
|
543
548
|
|
|
544
549
|
```
|
|
545
550
|
turing/
|
|
546
|
-
├── commands/
|
|
551
|
+
├── commands/ 48 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition + scaling & efficiency + meta-intelligence + pre-training intelligence + model debugging + feature & training intelligence)
|
|
547
552
|
├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
|
|
548
553
|
├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
|
|
549
554
|
├── templates/ Scaffolded into user projects by /turing:init
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: calibrate
|
|
3
|
+
description: Probability calibration — measure ECE, plot reliability diagrams, apply Platt scaling or isotonic regression.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "[exp-id] [--method platt|isotonic|temperature|auto]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
Make model probabilities trustworthy. Does 80% confidence actually mean 80% correct?
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- Optional experiment ID
|
|
20
|
+
- `--method platt|isotonic|temperature|auto` — calibration method (default: auto)
|
|
21
|
+
- `--json` — raw JSON output
|
|
22
|
+
|
|
23
|
+
3. **Run calibration:**
|
|
24
|
+
```bash
|
|
25
|
+
python scripts/calibration.py $ARGUMENTS
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
4. **Report includes:**
|
|
29
|
+
- ECE/MCE before calibration
|
|
30
|
+
- Reliability diagram (predicted vs actual per bin)
|
|
31
|
+
- Calibration method comparison table
|
|
32
|
+
- Verdict: ALREADY CALIBRATED / IMPROVED / NO IMPROVEMENT
|
|
33
|
+
|
|
34
|
+
5. **Methods:**
|
|
35
|
+
- **Platt:** logistic regression on logits
|
|
36
|
+
- **Isotonic:** non-parametric (more flexible, needs more data)
|
|
37
|
+
- **Temperature:** single scalar T parameter
|
|
38
|
+
- **Auto:** tries all, picks lowest ECE
|
|
39
|
+
|
|
40
|
+
6. **Saved output:** report in `experiments/calibration/<exp-id>-calibration.yaml`
|
|
41
|
+
|
|
42
|
+
## Examples
|
|
43
|
+
|
|
44
|
+
```
|
|
45
|
+
/turing:calibrate exp-042 # Auto-select best method
|
|
46
|
+
/turing:calibrate exp-042 --method platt # Platt scaling only
|
|
47
|
+
```
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: curriculum
|
|
3
|
+
description: Training curriculum optimization — order data by difficulty, compare easy-to-hard vs hard-to-easy vs self-paced strategies.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "[exp-id] [--strategies easy-to-hard,random]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
Does the order your model sees data matter? Find out systematically.
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- Optional experiment ID
|
|
20
|
+
- `--strategies "easy_to_hard,hard_to_easy,self_paced,random"` — strategies to test
|
|
21
|
+
- `--json` — raw JSON output
|
|
22
|
+
|
|
23
|
+
3. **Run curriculum analysis:**
|
|
24
|
+
```bash
|
|
25
|
+
python scripts/curriculum_optimizer.py $ARGUMENTS
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
4. **Strategies tested:**
|
|
29
|
+
- **Random:** standard shuffling (control)
|
|
30
|
+
- **Easy-to-hard:** classic curriculum learning
|
|
31
|
+
- **Hard-to-easy:** anti-curriculum
|
|
32
|
+
- **Self-paced:** start easy, gradually include harder samples
|
|
33
|
+
|
|
34
|
+
5. **Report includes:** strategy comparison table with metric, convergence epoch, and speedup vs random; impossible sample detection (likely mislabeled)
|
|
35
|
+
|
|
36
|
+
6. **Saved output:** report in `experiments/curriculum/<exp-id>-curriculum.yaml`
|
|
37
|
+
|
|
38
|
+
## Examples
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
/turing:curriculum exp-042 # All strategies
|
|
42
|
+
/turing:curriculum --strategies easy_to_hard,random # Specific strategies
|
|
43
|
+
```
|
|
@@ -0,0 +1,42 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: feature
|
|
3
|
+
description: Automated feature selection — multi-method importance consensus, redundancy detection, and interaction feature generation.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "[--method all|importance] [--top-k 20]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
Systematically evaluate which features matter and which are noise.
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- `--method all|importance|selection|generation` — analysis type (default: all)
|
|
20
|
+
- `--top-k 20` — number of top features to consider
|
|
21
|
+
- `--json` — raw JSON output
|
|
22
|
+
|
|
23
|
+
3. **Run feature analysis:**
|
|
24
|
+
```bash
|
|
25
|
+
python scripts/feature_intelligence.py $ARGUMENTS
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
4. **Report includes:**
|
|
29
|
+
- Consensus ranking: features ranked by number of methods placing them in top-K
|
|
30
|
+
- Per-method ranks: mutual information, L1, tree-based
|
|
31
|
+
- Redundant pairs: features with |r| > 0.95
|
|
32
|
+
- Candidate interaction features from top consensus set
|
|
33
|
+
- Drop recommendation for zero-consensus features
|
|
34
|
+
|
|
35
|
+
5. **Saved output:** report in `experiments/features/features-*.yaml`
|
|
36
|
+
|
|
37
|
+
## Examples
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
/turing:feature # Full analysis
|
|
41
|
+
/turing:feature --top-k 10 # Top-10 consensus
|
|
42
|
+
```
|
|
@@ -0,0 +1,41 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: sensitivity
|
|
3
|
+
description: Hyperparameter sensitivity analysis — rank parameters by impact, identify which matter and which are noise.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "[exp-id] [--params learning_rate,max_depth]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
Which hyperparameters actually matter? Stop wasting time on the ones that don't.
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- Optional experiment ID
|
|
20
|
+
- `--params "learning_rate,max_depth"` — specific parameters to analyze
|
|
21
|
+
- `--json` — raw JSON output
|
|
22
|
+
|
|
23
|
+
3. **Run sensitivity analysis:**
|
|
24
|
+
```bash
|
|
25
|
+
python scripts/sensitivity_analysis.py $ARGUMENTS
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
4. **Report includes:**
|
|
29
|
+
- Per-parameter sensitivity ranking: HIGH / MED / LOW / NONE
|
|
30
|
+
- Metric range for each parameter sweep
|
|
31
|
+
- Monotonicity detection (is there a sweet spot?)
|
|
32
|
+
- Recommendations: focus tuning on X, stop tuning Y
|
|
33
|
+
|
|
34
|
+
5. **Saved output:** report in `experiments/sensitivity/<exp-id>-sensitivity.yaml`
|
|
35
|
+
|
|
36
|
+
## Examples
|
|
37
|
+
|
|
38
|
+
```
|
|
39
|
+
/turing:sensitivity exp-042 # All tunable params
|
|
40
|
+
/turing:sensitivity --params "learning_rate,max_depth" # Specific params
|
|
41
|
+
```
|
package/commands/turing.md
CHANGED
|
@@ -53,6 +53,11 @@ You are the Turing ML research router. Detect the user's intent and route to the
|
|
|
53
53
|
| "sanity", "sanity check", "pre-training", "is it broken", "before training", "quick check" | `/turing:sanity` | Check |
|
|
54
54
|
| "baseline", "baselines", "trivial baseline", "majority class", "is it better than random" | `/turing:baseline` | Analyze |
|
|
55
55
|
| "leak", "leakage", "data leakage scan", "suspicious feature", "train test overlap" | `/turing:leak` | Validate |
|
|
56
|
+
| "xray", "model internals", "dead neurons", "gradient flow", "weight distribution", "inside the model" | `/turing:xray` | Analyze |
|
|
57
|
+
| "sensitivity", "which params matter", "hyperparameter importance", "parameter ranking" | `/turing:sensitivity` | Analyze |
|
|
58
|
+
| "calibrate", "calibration", "ECE", "reliability diagram", "overconfident", "probability calibration" | `/turing:calibrate` | Analyze |
|
|
59
|
+
| "feature", "features", "feature selection", "feature importance", "which features matter", "redundant features" | `/turing:feature` | Analyze |
|
|
60
|
+
| "curriculum", "training order", "easy to hard", "data ordering", "curriculum learning" | `/turing:curriculum` | Optimize |
|
|
56
61
|
|
|
57
62
|
## Sub-commands
|
|
58
63
|
|
|
@@ -102,6 +107,11 @@ You are the Turing ML research router. Detect the user's intent and route to the
|
|
|
102
107
|
| `/turing:sanity [--quick]` | Pre-training sanity checks: initial loss, overfit test, gradient flow, output validation | (inline) |
|
|
103
108
|
| `/turing:baseline [--methods]` | Automatic baseline generation: random, majority/mean, linear, k-NN | (inline) |
|
|
104
109
|
| `/turing:leak [--deep]` | Targeted leakage detection: single-feature tests, correlation, train/test overlap | (inline) |
|
|
110
|
+
| `/turing:xray [exp-id]` | Internal model diagnostics: gradient flow, dead neurons, weight distributions, tree analysis | (inline) |
|
|
111
|
+
| `/turing:sensitivity [exp-id]` | Hyperparameter sensitivity analysis: rank parameters by impact, detect non-monotonic responses | (inline) |
|
|
112
|
+
| `/turing:calibrate [exp-id]` | Probability calibration: ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling | (inline) |
|
|
113
|
+
| `/turing:feature [--method]` | Automated feature selection: multi-method consensus ranking, redundancy, interaction generation | (inline) |
|
|
114
|
+
| `/turing:curriculum [exp-id]` | Training curriculum optimization: difficulty scoring, strategy comparison, impossible sample detection | (inline) |
|
|
105
115
|
|
|
106
116
|
## Proactive Detection
|
|
107
117
|
|
package/commands/xray.md
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: xray
|
|
3
|
+
description: Internal model diagnostics — gradient flow, dead neurons, activation stats, weight distributions, tree depth analysis.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "[exp-id] [--layer encoder.layer.2] [--compare exp-a exp-b]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
See inside the model. When it underperforms, the fix depends on *why*.
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- Optional experiment ID
|
|
20
|
+
- `--layer "name"` — focus on specific layer
|
|
21
|
+
- `--compare exp-a exp-b` — side-by-side diagnostics
|
|
22
|
+
- `--json` — raw JSON output
|
|
23
|
+
|
|
24
|
+
3. **Run model diagnostics:**
|
|
25
|
+
```bash
|
|
26
|
+
python scripts/model_xray.py $ARGUMENTS
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
4. **Diagnostics by model type:**
|
|
30
|
+
- **Neural networks:** gradient magnitudes, activation stats, dead neuron %, weight distributions, gradient-to-weight ratio
|
|
31
|
+
- **Tree models:** depth utilization, leaf purity, feature split dominance
|
|
32
|
+
- **scikit-learn:** coefficient magnitudes, feature importance concentration
|
|
33
|
+
|
|
34
|
+
5. **Issues detected:** dead gradients, vanishing/exploding gradients, dead neurons, sparse weights, feature dominance, overfitting risk
|
|
35
|
+
|
|
36
|
+
6. **Saved output:** report in `experiments/xrays/<exp-id>-xray.yaml`
|
|
37
|
+
|
|
38
|
+
## Examples
|
|
39
|
+
|
|
40
|
+
```
|
|
41
|
+
/turing:xray exp-042 # Full diagnostics
|
|
42
|
+
/turing:xray # Best experiment
|
|
43
|
+
```
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "claude-turing",
|
|
3
|
-
"version": "3.
|
|
3
|
+
"version": "3.3.0",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
|
|
6
6
|
"bin": {
|
package/src/install.js
CHANGED
package/src/verify.js
CHANGED
|
@@ -58,6 +58,11 @@ const EXPECTED_COMMANDS = [
|
|
|
58
58
|
"sanity/SKILL.md",
|
|
59
59
|
"baseline/SKILL.md",
|
|
60
60
|
"leak/SKILL.md",
|
|
61
|
+
"xray/SKILL.md",
|
|
62
|
+
"sensitivity/SKILL.md",
|
|
63
|
+
"calibrate/SKILL.md",
|
|
64
|
+
"feature/SKILL.md",
|
|
65
|
+
"curriculum/SKILL.md",
|
|
61
66
|
];
|
|
62
67
|
|
|
63
68
|
const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];
|
|
Binary file
|
|
Binary file
|
|
Binary file
|