claude-turing 2.3.0 → 2.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +2 -2
- package/README.md +5 -2
- package/commands/ensemble.md +54 -0
- package/commands/stitch.md +49 -0
- package/commands/turing.md +6 -0
- package/commands/warm.md +53 -0
- package/package.json +1 -1
- package/src/install.js +1 -0
- package/src/verify.js +3 -0
- package/templates/scripts/__pycache__/build_ensemble.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/pipeline_manager.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
- package/templates/scripts/__pycache__/warm_start.cpython-314.pyc +0 -0
- package/templates/scripts/build_ensemble.py +696 -0
- package/templates/scripts/generate_brief.py +35 -0
- package/templates/scripts/pipeline_manager.py +457 -0
- package/templates/scripts/scaffold.py +6 -0
- package/templates/scripts/warm_start.py +493 -0
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "turing",
|
|
3
|
-
"version": "2.
|
|
4
|
-
"description": "Autonomous ML research harness — the autoresearch loop as a formal protocol.
|
|
3
|
+
"version": "2.4.0",
|
|
4
|
+
"description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 36 commands, 2 specialized agents, model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "pragnition"
|
|
7
7
|
},
|
package/README.md
CHANGED
|
@@ -344,6 +344,9 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
|
|
|
344
344
|
| `/turing:diff <a> <b>` | Deep experiment comparison — config diffs, metric significance, per-class regressions, curve divergence |
|
|
345
345
|
| `/turing:watch [--analyze]` | Live training monitor — loss spikes, NaN detection, overfitting, plateau alerts |
|
|
346
346
|
| `/turing:regress [--tolerance]` | Performance regression gate — verify metrics haven't degraded after changes |
|
|
347
|
+
| `/turing:ensemble [--top-k]` | Automated ensemble — voting, stacking, blending from top-K models |
|
|
348
|
+
| `/turing:stitch <action>` | Pipeline composition — show, swap, cache, and run stages independently |
|
|
349
|
+
| `/turing:warm <exp-id>` | Warm-start from prior model — load checkpoint, freeze layers, adjust LR |
|
|
347
350
|
|
|
348
351
|
And for fully hands-off operation:
|
|
349
352
|
|
|
@@ -528,11 +531,11 @@ Each project gets independent config, data, experiments, models, and agent memor
|
|
|
528
531
|
|
|
529
532
|
## Architecture of Turing Itself
|
|
530
533
|
|
|
531
|
-
|
|
534
|
+
36 commands, 2 agents, 10 config files, 55 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
|
|
532
535
|
|
|
533
536
|
```
|
|
534
537
|
turing/
|
|
535
|
-
├── commands/
|
|
538
|
+
├── commands/ 35 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition)
|
|
536
539
|
├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
|
|
537
540
|
├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
|
|
538
541
|
├── templates/ Scaffolded into user projects by /turing:init
|
|
@@ -0,0 +1,54 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: ensemble
|
|
3
|
+
description: Automated ensemble construction — combines top-K models via voting, stacking, and blending for zero-cost improvement.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "[--top-k 5] [--methods voting,stacking,blending]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
Build ensembles from your best experiments automatically. Often yields 1-3% improvement with zero additional training.
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- `--top-k 5` — number of top models to include (default: 5)
|
|
20
|
+
- `--methods voting,stacking,blending` — ensemble methods to try
|
|
21
|
+
- `--predictions-dir experiments/predictions` — directory with saved predictions
|
|
22
|
+
- `--json` — raw JSON output
|
|
23
|
+
|
|
24
|
+
3. **Run ensemble construction:**
|
|
25
|
+
```bash
|
|
26
|
+
python scripts/build_ensemble.py $ARGUMENTS
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
4. **Report results:**
|
|
30
|
+
- Table of all ensemble methods tried with metric deltas vs best single model
|
|
31
|
+
- Best ensemble method highlighted with improvement amount
|
|
32
|
+
- Diversity analysis: prediction correlation matrix, diversity assessment
|
|
33
|
+
- Base model summary: which experiments were combined
|
|
34
|
+
|
|
35
|
+
5. **Ensemble methods:**
|
|
36
|
+
- **Voting:** majority vote (classification) or mean (regression)
|
|
37
|
+
- **Weighted voting:** weights proportional to individual model performance
|
|
38
|
+
- **Stacking:** cross-validated meta-learner (ridge/logistic) on out-of-fold predictions
|
|
39
|
+
- **Blending:** holdout-based meta-learner (simpler, less data-efficient)
|
|
40
|
+
|
|
41
|
+
6. **Prerequisites:** experiments must have saved predictions in `experiments/predictions/`. Each experiment needs `<exp-id>-predictions.npy` and a shared `labels.npy`.
|
|
42
|
+
|
|
43
|
+
7. **If no predictions exist:** suggest saving predictions during training by adding prediction logging to `evaluate.py`.
|
|
44
|
+
|
|
45
|
+
8. **Saved output:** report written to `experiments/ensembles/ensemble-*.yaml`
|
|
46
|
+
|
|
47
|
+
## Examples
|
|
48
|
+
|
|
49
|
+
```
|
|
50
|
+
/turing:ensemble # Default: top-5, all methods
|
|
51
|
+
/turing:ensemble --top-k 3 # Top-3 models only
|
|
52
|
+
/turing:ensemble --methods voting,stacking # Specific methods
|
|
53
|
+
/turing:ensemble --json # Machine-readable output
|
|
54
|
+
```
|
|
@@ -0,0 +1,49 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: stitch
|
|
3
|
+
description: Pipeline composition — decompose ML pipelines into swappable stages. Show, swap, cache, and run stages independently.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "<show|swap|cache|run> [stage] [--from exp-id]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
Decompose your ML pipeline into stages that can be independently varied, cached, and reused across experiments.
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- First argument is the action: `show`, `swap`, `cache`, `run`
|
|
20
|
+
- `show` — display pipeline stages with hash and cache status
|
|
21
|
+
- `swap <stage> --from <exp-id>` — replace a stage with one from another experiment
|
|
22
|
+
- `cache` — save intermediate stage outputs to disk
|
|
23
|
+
- `run` — execute pipeline, skipping cached stages
|
|
24
|
+
|
|
25
|
+
3. **Run pipeline manager:**
|
|
26
|
+
```bash
|
|
27
|
+
python scripts/pipeline_manager.py $ARGUMENTS
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
4. **Report results:**
|
|
31
|
+
- **show:** numbered stage list with description, content hash, and cache status
|
|
32
|
+
- **swap:** what changed, old vs new stage config, updated pipeline
|
|
33
|
+
- **cache:** per-stage cache paths and status
|
|
34
|
+
- **run:** which stages will be skipped (cached) vs re-run
|
|
35
|
+
|
|
36
|
+
5. **Stage types:** preprocess, features, model, postprocess (configurable in `config.yaml` under `pipeline.stages`)
|
|
37
|
+
|
|
38
|
+
6. **Cache benefit:** when only the model stage changes, preprocessing and feature engineering are skipped — experiments run faster
|
|
39
|
+
|
|
40
|
+
7. **If no pipeline config:** falls back to default 4-stage pipeline
|
|
41
|
+
|
|
42
|
+
## Examples
|
|
43
|
+
|
|
44
|
+
```
|
|
45
|
+
/turing:stitch show # Display pipeline stages
|
|
46
|
+
/turing:stitch swap model --from exp-031 # Keep features, swap model
|
|
47
|
+
/turing:stitch cache # Cache intermediate outputs
|
|
48
|
+
/turing:stitch run # Run with cached stages
|
|
49
|
+
```
|
package/commands/turing.md
CHANGED
|
@@ -42,6 +42,9 @@ You are the Turing ML research router. Detect the user's intent and route to the
|
|
|
42
42
|
| "diff", "deep compare", "what changed", "why did it diverge", "experiment diff" | `/turing:diff` | Analyze |
|
|
43
43
|
| "watch", "monitor", "live training", "loss spike", "is it overfitting", "training progress" | `/turing:watch` | Monitor |
|
|
44
44
|
| "regress", "regression", "did metrics degrade", "check for regression", "CI gate", "stability check" | `/turing:regress` | Validate |
|
|
45
|
+
| "ensemble", "combine models", "voting", "stacking", "blending", "merge models" | `/turing:ensemble` | Compose |
|
|
46
|
+
| "stitch", "pipeline", "swap stage", "cache stage", "pipeline composition" | `/turing:stitch` | Compose |
|
|
47
|
+
| "warm", "warm start", "fine-tune", "continue training", "transfer learning", "from checkpoint" | `/turing:warm` | Compose |
|
|
45
48
|
|
|
46
49
|
## Sub-commands
|
|
47
50
|
|
|
@@ -80,6 +83,9 @@ You are the Turing ML research router. Detect the user's intent and route to the
|
|
|
80
83
|
| `/turing:diff <exp-a> <exp-b>` | Deep experiment comparison: config diff, metric significance, per-class regressions, curve divergence | (inline) |
|
|
81
84
|
| `/turing:watch [--analyze]` | Live training monitor with early-warning alerts (loss spike, NaN, overfitting, plateau) | (inline) |
|
|
82
85
|
| `/turing:regress [--tolerance]` | Performance regression gate: re-run best experiment, verify metrics haven't degraded | (inline) |
|
|
86
|
+
| `/turing:ensemble [--top-k] [--methods]` | Automated ensemble: voting, weighted voting, stacking, blending from top-K models | (inline) |
|
|
87
|
+
| `/turing:stitch <action> [stage]` | Pipeline composition: show/swap/cache/run stages independently | (inline) |
|
|
88
|
+
| `/turing:warm <exp-id>` | Warm-start from prior model: load checkpoint, freeze layers, adjust LR | (inline) |
|
|
83
89
|
|
|
84
90
|
## Proactive Detection
|
|
85
91
|
|
package/commands/warm.md
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: warm
|
|
3
|
+
description: Warm-start from a prior model — load checkpoint, optionally freeze layers, adjust learning rate, and continue training.
|
|
4
|
+
disable-model-invocation: true
|
|
5
|
+
argument-hint: "<exp-id> [--freeze-layers encoder] [--unfreeze-after 5]"
|
|
6
|
+
allowed-tools: Read, Bash(*), Grep, Glob
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
Take a trained checkpoint and use it as initialization for a new experiment. Automates the "start from here but change X" pattern.
|
|
10
|
+
|
|
11
|
+
## Steps
|
|
12
|
+
|
|
13
|
+
1. **Activate environment:**
|
|
14
|
+
```bash
|
|
15
|
+
source .venv/bin/activate
|
|
16
|
+
```
|
|
17
|
+
|
|
18
|
+
2. **Parse arguments from `$ARGUMENTS`:**
|
|
19
|
+
- First argument is the source experiment ID (required)
|
|
20
|
+
- `--freeze-layers encoder decoder` — layer names to freeze (neural only)
|
|
21
|
+
- `--unfreeze-after 5` — unfreeze all layers after N epochs (gradual unfreezing)
|
|
22
|
+
- `--lr-factor 0.1` — learning rate reduction factor (default: 0.1x)
|
|
23
|
+
- `--json` — raw JSON output
|
|
24
|
+
|
|
25
|
+
3. **Run warm-start planner:**
|
|
26
|
+
```bash
|
|
27
|
+
python scripts/warm_start.py $ARGUMENTS
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
4. **Report results:**
|
|
31
|
+
- Model type detection (tree, neural, sklearn)
|
|
32
|
+
- Strategy: continue_boosting, load_weights, or warm_start_param
|
|
33
|
+
- Numbered step-by-step instructions
|
|
34
|
+
- Config changes to apply
|
|
35
|
+
- Checkpoint info (path, format, size)
|
|
36
|
+
|
|
37
|
+
5. **Strategies by model type:**
|
|
38
|
+
- **Tree models (XGBoost/LightGBM):** continue boosting from existing trees with more estimators
|
|
39
|
+
- **Neural networks:** load weights, optionally freeze layers, reset optimizer, reduce LR
|
|
40
|
+
- **scikit-learn:** use `warm_start=True` parameter for incremental learning
|
|
41
|
+
|
|
42
|
+
6. **If no checkpoint found:** plan is still generated, but warns that checkpoint is needed
|
|
43
|
+
|
|
44
|
+
7. **Saved output:** report written to `experiments/warm_starts/warm-<exp-id>.yaml`
|
|
45
|
+
|
|
46
|
+
## Examples
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
/turing:warm exp-042 # Auto-detect strategy
|
|
50
|
+
/turing:warm exp-042 --freeze-layers encoder # Freeze encoder layers
|
|
51
|
+
/turing:warm exp-042 --freeze-layers encoder --unfreeze-after 5 # Gradual unfreezing
|
|
52
|
+
/turing:warm exp-042 --lr-factor 0.01 # Very small fine-tuning LR
|
|
53
|
+
```
|
package/package.json
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "claude-turing",
|
|
3
|
-
"version": "2.
|
|
3
|
+
"version": "2.4.0",
|
|
4
4
|
"type": "module",
|
|
5
5
|
"description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
|
|
6
6
|
"bin": {
|
package/src/install.js
CHANGED
package/src/verify.js
CHANGED
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|