claude-turing 1.0.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. package/.claude-plugin/plugin.json +2 -2
  2. package/README.md +48 -7
  3. package/commands/brief.md +13 -1
  4. package/commands/card.md +36 -0
  5. package/commands/init.md +13 -0
  6. package/commands/train.md +16 -7
  7. package/commands/turing.md +4 -2
  8. package/package.json +1 -1
  9. package/src/install.js +1 -1
  10. package/src/verify.js +1 -0
  11. package/templates/model_contract.md +49 -0
  12. package/templates/model_registry.yaml +69 -0
  13. package/templates/program.md +2 -0
  14. package/templates/scripts/__pycache__/cost_frontier.cpython-314.pyc +0 -0
  15. package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
  16. package/templates/scripts/__pycache__/generate_model_card.cpython-314.pyc +0 -0
  17. package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
  18. package/templates/scripts/cleanup.py +599 -0
  19. package/templates/scripts/cost_frontier.py +292 -0
  20. package/templates/scripts/diff_configs.py +534 -0
  21. package/templates/scripts/export_results.py +457 -0
  22. package/templates/scripts/generate_brief.py +54 -0
  23. package/templates/scripts/generate_model_card.py +342 -0
  24. package/templates/scripts/leaderboard.py +508 -0
  25. package/templates/scripts/plot_trajectory.py +611 -0
  26. package/templates/scripts/scaffold.py +9 -0
  27. package/templates/scripts/show_metrics.py +23 -2
  28. package/templates/tests/__pycache__/__init__.cpython-314.pyc +0 -0
  29. package/templates/tests/__pycache__/conftest.cpython-314-pytest-9.0.2.pyc +0 -0
  30. package/templates/tests/__pycache__/test_cost_frontier.cpython-314-pytest-9.0.2.pyc +0 -0
  31. package/templates/tests/test_cost_frontier.py +222 -0
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "1.0.0",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 14 commands, 2 specialized agents, structured experiment lifecycle with convergence detection, immutable evaluation infrastructure, novelty guard, decision synthesis, hypothesis database, and safety guardrails that separate the hypothesis space from the measurement apparatus. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "1.1.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 16 commands, 2 specialized agents, cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -300,8 +300,8 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
300
300
 
301
301
  | Command | What it does |
302
302
  |---------|-------------|
303
- | `/turing:init [--plan]` | Scaffold a new ML project. `--plan` generates a literature-grounded research plan. |
304
- | `/turing:train [N]` | Run the autonomous experiment loop (optional max iterations) |
303
+ | `/turing:init [--plan]` | Scaffold a new ML project. `--plan` generates a literature-grounded research plan. Supports multiple projects in subdirectories. |
304
+ | `/turing:train [ml/project] [N]` | Run the experiment loop. Auto-detects project from cwd or explicit path. |
305
305
  | `/turing:sweep` | Systematic hyperparameter sweep via cartesian product |
306
306
  | `/turing:status` | Quick experiment status — best model, convergence state |
307
307
  | `/turing:compare <a> <b>` | Side-by-side experiment comparison with causal analysis |
@@ -321,6 +321,7 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
321
321
  | Command | What it does |
322
322
  |---------|-------------|
323
323
  | `/turing:validate [--auto]` | Check metric stability — auto-configure multi-run if noisy |
324
+ | `/turing:card` | Generate a model card — performance, limitations, intended use, artifact contract |
324
325
  | `/turing:logbook` | Generate HTML experiment logbook |
325
326
  | `/turing:report` | Generate research report |
326
327
  | `/turing:poster` | Generate research poster |
@@ -389,6 +390,32 @@ After N experiments with no meaningful improvement, the agent stops and reports
389
390
 
390
391
  For noisy metrics, `/turing:validate` runs the pipeline multiple times and measures variance. If the coefficient of variation exceeds 5%, it auto-configures multi-run evaluation so the agent can't be rewarded for lucky single runs.
391
392
 
393
+ ## Cost-Performance Frontier
394
+
395
+ > *"This model is 2% better but takes 10x longer to train. Is that worth it?"*
396
+
397
+ The briefing now surfaces [Pareto-optimal](https://en.wikipedia.org/wiki/Pareto_efficiency) experiments — the efficient set where no other experiment is both faster AND has a better metric. The cost report tells you the tradeoff in plain language:
398
+
399
+ ```
400
+ Best metric: exp-012 (accuracy=0.893, 2400s)
401
+ Best efficiency: exp-003 (accuracy=0.871, 3s)
402
+ The 2.5% improvement costs 800x more compute.
403
+ ```
404
+
405
+ Run `python scripts/cost_frontier.py` directly, or read the "Cost-Performance Analysis" section in `/turing:brief`.
406
+
407
+ ## Model Cards
408
+
409
+ When it's time to ship, `/turing:card` generates a standardized model card documenting:
410
+ - Model type, framework, training time
411
+ - Performance metrics (all configured metrics)
412
+ - Training data source and split ratios
413
+ - Limitations (including overfit detection)
414
+ - Intended use and ethical considerations (user fills these in)
415
+ - Artifact contract version for production consumers
416
+
417
+ Inspired by [Google's Model Cards](https://arxiv.org/abs/1810.03993) and [Hugging Face model cards](https://huggingface.co/docs/hub/model-cards).
418
+
392
419
  ## Installation
393
420
 
394
421
  ```bash
@@ -404,15 +431,27 @@ claude plugin add /path/to/turing
404
431
  ### Quick Start
405
432
 
406
433
  ```bash
407
- /turing:init # Scaffold project (answer 3 prompts)
408
- /turing:train # Run experiment loop
409
- /turing:brief # Read what happened
410
- /turing:try "idea" # Inject your taste
434
+ /turing:init # Scaffold project (answer 3 prompts)
435
+ /turing:train # Run experiment loop
436
+ /turing:brief # Read what happened
437
+ /turing:try "idea" # Inject your taste
411
438
  ```
412
439
 
440
+ ### Multiple Projects
441
+
442
+ ```bash
443
+ /turing:init # Scaffold ml/sentiment
444
+ /turing:init # Scaffold ml/churn
445
+ /turing:train ml/sentiment # Train in specific project
446
+ /turing:brief ml/churn # Brief for specific project
447
+ cd ml/sentiment && /turing:train # Auto-detects from cwd
448
+ ```
449
+
450
+ Each project gets independent config, data, experiments, models, and agent memory.
451
+
413
452
  ## Architecture of Turing Itself
414
453
 
415
- 15 commands, 2 agents, 8 config files, 25 template scripts, 338 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
454
+ 16 commands, 2 agents, 8 config files, 30 template scripts, model registry, artifact contract, cost-performance frontier, model cards, 345 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
416
455
 
417
456
  ```
418
457
  turing/
@@ -423,6 +462,8 @@ turing/
423
462
  │ ├── prepare.py Data loading (HIDDEN from agent)
424
463
  │ ├── evaluate.py Evaluation harness (HIDDEN from agent)
425
464
  │ ├── train.py Training code (AGENT-EDITABLE)
465
+ │ ├── model_contract.md Artifact schema for production consumers
466
+ │ ├── model_registry.yaml Available model architectures + hyperparams
426
467
  │ └── scripts/ 25 Python scripts (core loop + analysis + infra)
427
468
  ├── tests/ 338 tests (unit + integration + anti-pattern + manifest)
428
469
  ├── src/ 5 JS installer files (npm deployment)
package/commands/brief.md CHANGED
@@ -2,12 +2,24 @@
2
2
  name: brief
3
3
  description: Generate a structured research intelligence report from experiment history — what's been learned, what's promising, what's exhausted, and what the human should consider next. Use --deep for literature-grounded suggestions.
4
4
  disable-model-invocation: true
5
- argument-hint: "[--deep]"
5
+ argument-hint: "[ml/project] [--deep]"
6
6
  allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob, WebSearch, WebFetch
7
7
  ---
8
8
 
9
9
  Generate a research briefing that a human can read in 2 minutes and immediately decide what to inject next.
10
10
 
11
+ ## Project Detection
12
+
13
+ Before generating the briefing, detect which project to report on:
14
+
15
+ 0. **Detect project directory:**
16
+ - If `$ARGUMENTS` contains a path (e.g., `ml/coding`), use that as the project directory
17
+ - Else if cwd contains `config.yaml` and `train.py`, use cwd
18
+ - Else search for `ml/*/` subdirectories containing `config.yaml`
19
+ - If exactly one found, use it
20
+ - If multiple found, list them and ask the user which to report on
21
+ - All subsequent commands run from the detected project directory
22
+
11
23
  ## Steps
12
24
 
13
25
  1. **Generate the briefing:**
@@ -0,0 +1,36 @@
1
+ ---
2
+ name: card
3
+ description: Generate a standardized model card documenting the trained model — type, performance, training data, limitations, intended use, and artifact contract.
4
+ disable-model-invocation: true
5
+ allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob
6
+ ---
7
+
8
+ You generate a standardized model card from the experiment log, model contract, and config.
9
+
10
+ ## Steps
11
+
12
+ 1. **Activate the virtual environment:**
13
+ ```bash
14
+ source .venv/bin/activate
15
+ ```
16
+
17
+ 2. **Run the model card generator:**
18
+ ```bash
19
+ python scripts/generate_model_card.py --config config.yaml --log experiments/log.jsonl --contract model_contract.md --output MODEL_CARD.md
20
+ ```
21
+
22
+ 3. **Read and present the generated card:**
23
+ - Read `MODEL_CARD.md` and display it to the user.
24
+ - If no experiments exist yet, inform the user and show the skeleton card.
25
+
26
+ 4. **Suggest next steps:**
27
+ - Review the **Ethical Considerations** section and fill in bias, fairness, and impact notes.
28
+ - Review the **Intended Use** section and document what the model is NOT intended for.
29
+ - If limitations mention overfitting, suggest running `/turing:validate` for stability checks.
30
+ - If the card looks complete, suggest committing it to version control.
31
+
32
+ ## Error Handling
33
+
34
+ - If `config.yaml` is missing, tell the user to run `/turing:init` first.
35
+ - If `experiments/log.jsonl` is missing or empty, generate a skeleton card and note that training is needed.
36
+ - If `.venv` doesn't exist, try `python3 scripts/generate_model_card.py` directly.
package/commands/init.md CHANGED
@@ -121,3 +121,16 @@ If `$ARGUMENTS` contains `--plan`, generate a research plan AFTER scaffolding. T
121
121
  ### Integration
122
122
 
123
123
  The agent's `program.md` OBSERVE step reads `RESEARCH_PLAN.md` (if it exists) for strategic direction. The plan is advisory — the agent can deviate but should note why in `experiment_state.yaml`.
124
+
125
+ ## Multiple Projects
126
+
127
+ You can scaffold multiple ML projects in the same repository:
128
+
129
+ ```bash
130
+ /turing:init # First project: prompts for ml_dir (e.g., ml/sentiment)
131
+ /turing:init # Second project: prompts for ml_dir (e.g., ml/churn)
132
+ ```
133
+
134
+ Each project gets its own directory with independent config, data, experiments, and models. `/turing:train ml/sentiment` or `/turing:train ml/churn` targets a specific project. If you `cd ml/sentiment` first, `/turing:train` auto-detects from cwd.
135
+
136
+ Agent memory is scoped per project: `.claude/agent-memory/ml-researcher-{project_name}/MEMORY.md`
package/commands/train.md CHANGED
@@ -12,16 +12,25 @@ Read `program.md` in the ML project directory for the complete protocol. Follow
12
12
 
13
13
  ## Arguments
14
14
 
15
- `$ARGUMENTS` — if a number, use as max_iterations (stop after N experiments). If empty, run until convergence (as defined in `config.yaml` convergence settings).
15
+ `$ARGUMENTS` — accepts a project path (e.g., `ml/coding`), a number for max_iterations, or both (e.g., `ml/coding 10`). If no number, run until convergence (as defined in `config.yaml` convergence settings).
16
16
 
17
17
  ## Bootstrap Sequence
18
18
 
19
- 0. **Restore memory:** Read `.claude/agent-memory/ml-researcher/MEMORY.md` for prior observations and best results.
20
- 1. **Read protocol:** Read `program.md` completely it defines the experiment loop, constraints, and output format.
21
- 2. **Bootstrap data:** Check for training data at `config.yaml` `data.source`. If no splits exist, run `python prepare.py`.
22
- 3. **Bootstrap venv:** `test -d .venv || (python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt)`
23
- 4. **Assess state:** `source .venv/bin/activate && python scripts/show_metrics.py --last 5`
24
- 5. **Begin the loop** from program.md.
19
+ 0. **Detect project directory:**
20
+ - If `$ARGUMENTS` contains a path (e.g., `ml/coding`), use that as the project directory
21
+ - Else if cwd contains `config.yaml` and `train.py`, use cwd
22
+ - Else search for `ml/*/` subdirectories containing `config.yaml`
23
+ - If exactly one found, use it
24
+ - If multiple found, list them and ask the user which to target
25
+ - All subsequent commands run from the detected project directory
26
+ - Memory path: `.claude/agent-memory/ml-researcher-{project_name}/MEMORY.md`
27
+
28
+ 1. **Restore memory:** Read `.claude/agent-memory/ml-researcher-{project_name}/MEMORY.md` for prior observations and best results.
29
+ 2. **Read protocol:** Read `program.md` completely — it defines the experiment loop, constraints, and output format.
30
+ 3. **Bootstrap data:** Check for training data at `config.yaml` → `data.source`. If no splits exist, run `python prepare.py`.
31
+ 4. **Bootstrap venv:** `test -d .venv || (python3 -m venv .venv && source .venv/bin/activate && pip install -r requirements.txt)`
32
+ 5. **Assess state:** `source .venv/bin/activate && python scripts/show_metrics.py --last 5`
33
+ 6. **Begin the loop** from program.md.
25
34
 
26
35
  ## The Loop
27
36
 
@@ -9,7 +9,7 @@ You are the Turing ML research router. Detect the user's intent and route to the
9
9
 
10
10
  | User says... | Route to | Lifecycle phase |
11
11
  |---|---|---|
12
- | "train", "run experiments", "autoresearch", "improve the model", "start training" | `/turing:train` | Execute |
12
+ | "train", "train ml/coding", "train ml/claims", "run experiments", "run experiments in ml/X", "autoresearch", "improve the model", "start training" | `/turing:train` | Execute |
13
13
  | "status", "how's training", "experiment results", "current metrics" | `/turing:status` | Observe |
14
14
  | "compare", "diff runs", "which is better" | `/turing:compare` | Analyze |
15
15
  | "sweep", "grid search", "hyperparameter search", "tune" | `/turing:sweep` | Explore |
@@ -24,12 +24,13 @@ You are the Turing ML research router. Detect the user's intent and route to the
24
24
  | "design", "plan experiment", "how should I test", "experiment design" | `/turing:design` | Design |
25
25
  | "mode", "explore", "exploit", "replicate", "strategy" | `/turing:mode` | Strategy |
26
26
  | "preflight", "resources", "VRAM", "memory", "can I run", "OOM", "GPU" | `/turing:preflight` | Check |
27
+ | "card", "model card", "document model", "model documentation" | `/turing:card` | Document |
27
28
 
28
29
  ## Sub-commands
29
30
 
30
31
  | Command | Purpose | Agent |
31
32
  |---|---|---|
32
- | `/turing:train [N]` | Run the autonomous experiment loop | @ml-researcher |
33
+ | `/turing:train [ml/project] [N]` | Run the autonomous experiment loop (auto-detects project from path or cwd) | @ml-researcher |
33
34
  | `/turing:status` | Show experiment status, best model, convergence | @ml-evaluator |
34
35
  | `/turing:compare <a> <b>` | Side-by-side experiment comparison | @ml-evaluator |
35
36
  | `/turing:sweep` | Generate and run hyperparameter sweep | @ml-researcher |
@@ -44,6 +45,7 @@ You are the Turing ML research router. Detect the user's intent and route to the
44
45
  | `/turing:report` | Structured markdown research report | (inline) |
45
46
  | `/turing:mode <mode>` | Set research strategy (explore/exploit/replicate) | (inline) |
46
47
  | `/turing:preflight` | Pre-flight resource check (VRAM/RAM/disk) | (inline) |
48
+ | `/turing:card` | Generate standardized model card (type, performance, data, limitations, contract) | (inline) |
47
49
 
48
50
  ## Proactive Detection
49
51
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "1.0.0",
3
+ "version": "1.1.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/install.js CHANGED
@@ -22,7 +22,7 @@ const PLUGIN_ROOT = join(__dirname, "..");
22
22
  const SUB_COMMANDS = [
23
23
  "init", "train", "status", "compare", "sweep", "validate",
24
24
  "try", "brief", "suggest", "design", "logbook", "poster",
25
- "report", "mode", "preflight",
25
+ "report", "mode", "preflight", "card",
26
26
  ];
27
27
 
28
28
  export async function install(opts = {}) {
package/src/verify.js CHANGED
@@ -29,6 +29,7 @@ const EXPECTED_COMMANDS = [
29
29
  "report/SKILL.md",
30
30
  "mode/SKILL.md",
31
31
  "preflight/SKILL.md",
32
+ "card/SKILL.md",
32
33
  ];
33
34
 
34
35
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];
@@ -0,0 +1,49 @@
1
+ # Model Artifact Contract
2
+
3
+ Version: 1
4
+ Last updated: {{PROJECT_NAME}} initial scaffold
5
+
6
+ ## Bundle Format
7
+
8
+ The trained model is saved as a joblib bundle at `models/best/model.joblib` containing:
9
+
10
+ ```python
11
+ {
12
+ "model": <fitted model object>,
13
+ "featurizer": <fitted CompositeFeaturizer>,
14
+ "config": <dict of training config>,
15
+ "contract_version": 1
16
+ }
17
+ ```
18
+
19
+ ## Metadata
20
+
21
+ `models/best/metadata.json` contains:
22
+
23
+ ```json
24
+ {
25
+ "contract_version": 1,
26
+ "model_type": "xgboost",
27
+ "experiment_id": "exp-001",
28
+ "metrics": {"{{TARGET_METRIC}}": 0.0},
29
+ "feature_names": [],
30
+ "created_at": "ISO-8601"
31
+ }
32
+ ```
33
+
34
+ ## Consumer Contract
35
+
36
+ Any service loading this model expects:
37
+ - `bundle["model"]` has a `.predict()` method accepting a feature matrix
38
+ - `bundle["featurizer"]` has a `.transform(df)` method returning a DataFrame
39
+ - `bundle.get("contract_version", 0)` must equal 1
40
+
41
+ If `contract_version` doesn't match, the consumer should log a warning and fall back to a default/rules-based approach.
42
+
43
+ ## Breaking Changes
44
+
45
+ Increment `contract_version` when changing:
46
+ - Feature schema (different featurizer output shape)
47
+ - Label encoding (different label_map)
48
+ - Bundle key names
49
+ - Model input/output format
@@ -0,0 +1,69 @@
1
+ # Model Registry for {{PROJECT_NAME}}
2
+ #
3
+ # Catalog of available model architectures. The agent reads this
4
+ # during /turing:suggest and archetype:model_comparison to know
5
+ # what models are available and how to configure them.
6
+ #
7
+ # Add your domain-specific models here.
8
+
9
+ models:
10
+ xgboost:
11
+ name: "XGBoost Classifier"
12
+ family: "gradient_boosting"
13
+ task: "{{TARGET_METRIC}}"
14
+ notes: "Default model. Good for tabular data with mixed feature types."
15
+ paper: "https://arxiv.org/abs/1603.02754"
16
+ default_hyperparams:
17
+ n_estimators: 100
18
+ max_depth: 4
19
+ learning_rate: 0.1
20
+ objective: "multi:softmax"
21
+
22
+ lightgbm:
23
+ name: "LightGBM Classifier"
24
+ family: "gradient_boosting"
25
+ task: "{{TARGET_METRIC}}"
26
+ notes: "Often faster than XGBoost. Leaf-wise growth. Try dart boosting for regularization."
27
+ paper: "https://papers.nips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html"
28
+ default_hyperparams:
29
+ n_estimators: 100
30
+ max_depth: -1
31
+ learning_rate: 0.1
32
+ num_leaves: 31
33
+
34
+ random_forest:
35
+ name: "Random Forest Classifier"
36
+ family: "ensemble"
37
+ task: "{{TARGET_METRIC}}"
38
+ notes: "Bagging ensemble. Good baseline. Less prone to overfitting than single trees."
39
+ default_hyperparams:
40
+ n_estimators: 100
41
+ max_depth: null
42
+ min_samples_split: 2
43
+
44
+ logistic_regression:
45
+ name: "Logistic Regression"
46
+ family: "linear"
47
+ task: "{{TARGET_METRIC}}"
48
+ notes: "Simple linear baseline. Always try this first — if it works well, your features are strong."
49
+ default_hyperparams:
50
+ C: 1.0
51
+ max_iter: 1000
52
+
53
+ mlp:
54
+ name: "Multi-Layer Perceptron"
55
+ family: "neural_network"
56
+ task: "{{TARGET_METRIC}}"
57
+ notes: "Simple neural network. Try when samples > 2000. Needs feature scaling."
58
+ default_hyperparams:
59
+ hidden_layer_sizes: [100, 50]
60
+ learning_rate_init: 0.001
61
+ max_iter: 200
62
+
63
+ # Add your domain-specific models below:
64
+ # e.g., for NLP tasks:
65
+ # bert-base:
66
+ # name: "BERT Base"
67
+ # family: "transformer"
68
+ # hf_id: "bert-base-uncased"
69
+ # ...
@@ -71,6 +71,8 @@ The autoresearch experiment loop. Each iteration is one experiment — one hypot
71
71
  cat RESEARCH_PLAN.md 2>/dev/null || true
72
72
  ```
73
73
 
74
+ Read `model_registry.yaml` to know what model architectures are available, their default hyperparameters, and family groupings. Use this to inform model selection during suggest and model_comparison archetypes.
75
+
74
76
  If `RESEARCH_PLAN.md` exists, use it for strategic direction (which model families to explore, in what order, what budget). The plan is advisory — deviate if evidence warrants, but note why.
75
77
 
76
78
  For the most recent discarded experiments, read the actual git diff to understand what was tried and failed — do NOT rely on your own memory of what you changed: