npm - claude-turing - Versions diffs - 4.6.0 → 4.7.0 - Mend

claude-turing 4.6.0 → 4.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (255) hide show

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "turing",
-  "version": "4.6.0",
-  "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 74 commands, 2 specialized agents, operational intelligence (postmortem + doctor + plan), model lifecycle (update + registry), what-if analysis (whatif + counterfactual + simulate), collaboration (onboard + share + review), research communication (cite + present + changelog), experiment archaeology (trend + flashback + archive + annotate + search + template + replay), model surgery (prune + quantize + merge + surgery), feature & training intelligence, model debugging, pre-training intelligence, meta-intelligence, scaling & efficiency, model composition, deep analysis, experiment orchestration, literature + paper, model export, profiling, checkpoints, experiment intelligence, statistical rigor, tree-search, cost-performance, model cards, hypothesis database, novelty guard, anti-cheating, taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
+  "version": "4.7.0",
+  "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 74 commands, 2 specialized agents, modern skills/turing package mirror, operational intelligence (postmortem + doctor + plan), model lifecycle (update + registry), what-if analysis (whatif + counterfactual + simulate), collaboration (onboard + share + review), research communication (cite + present + changelog), experiment archaeology (trend + flashback + archive + annotate + search + template + replay), model surgery (prune + quantize + merge + surgery), feature & training intelligence, model debugging, pre-training intelligence, meta-intelligence, scaling & efficiency, model composition, deep analysis, experiment orchestration, literature + paper, model export, profiling, checkpoints, experiment intelligence, statistical rigor, tree-search, cost-performance, model cards, hypothesis database, novelty guard, anti-cheating, taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
   "author": {
     "name": "Prannaya Gupta"
   },

package/README.md CHANGED Viewed

@@ -3,7 +3,7 @@
 *The research assistant that can't fool itself.*
 <p align="center">
-  <img src="https://img.shields.io/badge/version-4.6.0-ffb74d?style=flat-square&labelColor=1a1a2e" alt="Version" />
+  <img src="https://img.shields.io/badge/version-4.7.0-ffb74d?style=flat-square&labelColor=1a1a2e" alt="Version" />
   <img src="https://img.shields.io/badge/license-MIT-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="License" />
   <img src="https://img.shields.io/badge/Claude_Code-plugin-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="Claude Code" />
   <img src="https://img.shields.io/badge/Node.js-20%2B-ff4d4d?style=flat-square&labelColor=1a1a2e" alt="Node.js" />

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "claude-turing",
-  "version": "4.6.0",
+  "version": "4.7.0",
   "type": "module",
   "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
   "bin": {
@@ -8,16 +8,22 @@
     "claude-turing": "./bin/cli.js"
   },
   "scripts": {
-    "postinstall": "node src/postinstall.js"
+    "postinstall": "node src/postinstall.js",
+    "sync:skills": "node src/sync-skills-layout.js"
   },
   "files": [
     "bin/",
     "src/",
     ".claude-plugin/",
     "commands/",
+    "skills/",
     "agents/",
     "templates/",
-    "config/"
+    "config/",
+    "!src/**/*.egg-info/**",
+    "!templates/**/__pycache__/**",
+    "!templates/**/*.pyc",
+    "!templates/**/.pytest_cache/**"
   ],
   "keywords": [
     "ml",

package/skills/turing/SKILL.md ADDED Viewed

@@ -0,0 +1,180 @@
+---
+name: turing
+description: Autonomous ML research harness. Thin router that detects ML training intent and identifies the matching Turing sub-command execution path. Each sub-command handles one phase of the experiment lifecycle.
+---
+You are the Turing ML research router. Detect the user's intent and identify the matching Turing sub-command execution path.
+## Execution Contract
+Turing sub-commands are explicit slash-command skills. Current sub-commands are `slash_only` and use `disable-model-invocation: true`, so router handling must not claim model dispatch into those skills.
+- If the user explicitly invokes `/turing:<cmd>`, Claude Code runtime handles that slash command.
+- If the user invokes `/turing` as a router and the detected command is `slash_only`, give the exact slash command to run.
+- If a command has a documented safe equivalent script, the assistant may execute those documented steps inline when safe and appropriate.
+## Routing Table
+| User says... | Route to | Lifecycle phase |
+|---|---|---|
+| "train", "train ml/coding", "train ml/claims", "run experiments", "run experiments in ml/X", "autoresearch", "improve the model", "start training" | `/turing:train` | Execute |
+| "status", "how's training", "experiment results", "current metrics" | `/turing:status` | Observe |
+| "compare", "diff runs", "which is better" | `/turing:compare` | Analyze |
+| "sweep", "grid search", "hyperparameter search", "tune" | `/turing:sweep` | Explore |
+| "init", "set up ML", "initialize", "scaffold", "bootstrap" | `/turing:init` | Setup |
+| "try", "test this", "inject", "what if we", "I think we should" | `/turing:try` | Steer |
+| "brief", "briefing", "what have we learned", "summary" | `/turing:brief` | Report |
+| "logbook", "log", "history", "timeline", "narrative" | `/turing:logbook` | Document |
+| "poster", "presentation", "one-pager", "visual summary" | `/turing:poster` | Document |
+| "report", "write-up", "findings", "document results" | `/turing:report` | Document |
+| "validate", "stability", "check variance", "noisy" | `/turing:validate` | Validate |
+| "seed", "seed study", "multi-seed", "lucky seed", "seed sensitivity" | `/turing:seed` | Validate |
+| "reproduce", "reproducibility", "verify results", "re-run experiment", "repro" | `/turing:reproduce` | Validate |
+| "suggest", "what model", "recommend", "which architecture", "literature" | `/turing:suggest` | Research |
+| "explore hypotheses", "tree search", "treequest", "search hypothesis space", "MCTS" | `/turing:explore` | Research |
+| "design", "plan experiment", "how should I test", "experiment design" | `/turing:design` | Design |
+| "mode", "explore", "exploit", "replicate", "strategy" | `/turing:mode` | Strategy |
+| "preflight", "resources", "VRAM", "memory", "can I run", "OOM", "GPU" | `/turing:preflight` | Check |
+| "card", "model card", "document model", "model documentation" | `/turing:card` | Document |
+| "diagnose", "error analysis", "failure modes", "where does it fail", "confusion matrix" | `/turing:diagnose` | Analyze |
+| "ablate", "ablation", "remove component", "which features matter", "component impact" | `/turing:ablate` | Analyze |
+| "frontier", "pareto", "tradeoff", "tradeoffs", "multi-objective", "which model is best" | `/turing:frontier` | Analyze |
+| "lit", "literature", "papers", "SOTA", "baseline", "related work", "citations" | `/turing:lit` | Research |
+| "paper", "draft paper", "write paper", "results table", "latex", "experimental setup" | `/turing:paper` | Document |
+| "export", "deploy", "production", "onnx", "torchscript", "tflite", "ship model" | `/turing:export` | Deploy |
+| "queue", "batch", "overnight", "schedule experiments", "run queue" | `/turing:queue` | Orchestrate |
+| "retry", "retry experiment", "crashed", "OOM", "fix and rerun" | `/turing:retry` | Orchestrate |
+| "fork", "branch", "try both", "parallel experiments", "A or B" | `/turing:fork` | Orchestrate |
+| "profile", "profiling", "bottleneck", "slow training", "why is it slow", "timing" | `/turing:profile` | Check |
+| "checkpoint", "checkpoints", "prune checkpoints", "disk space", "resume training" | `/turing:checkpoint` | Check |
+| "diff", "deep compare", "what changed", "why did it diverge", "experiment diff" | `/turing:diff` | Analyze |
+| "watch", "monitor", "live training", "loss spike", "is it overfitting", "training progress" | `/turing:watch` | Monitor |
+| "regress", "regression", "did metrics degrade", "check for regression", "CI gate", "stability check" | `/turing:regress` | Validate |
+| "ensemble", "combine models", "voting", "stacking", "blending", "merge models" | `/turing:ensemble` | Compose |
+| "stitch", "pipeline", "swap stage", "cache stage", "pipeline composition" | `/turing:stitch` | Compose |
+| "warm", "warm start", "fine-tune", "continue training", "transfer learning", "from checkpoint" | `/turing:warm` | Compose |
+| "scale", "scaling law", "how much data", "is more data worth it", "power law", "data efficiency" | `/turing:scale` | Analyze |
+| "budget", "compute budget", "how many experiments", "spending limit", "stop after" | `/turing:budget` | Manage |
+| "distill", "compress", "smaller model", "student model", "knowledge distillation", "model compression" | `/turing:distill` | Deploy |
+| "transfer", "what worked before", "similar project", "cross-project", "institutional knowledge", "prior projects" | `/turing:transfer` | Research |
+| "audit", "methodology check", "pre-submission", "reviewer checklist", "data leakage", "missing baselines" | `/turing:audit` | Validate |
+| "sanity", "sanity check", "pre-training", "is it broken", "before training", "quick check" | `/turing:sanity` | Check |
+| "baseline", "baselines", "trivial baseline", "majority class", "is it better than random" | `/turing:baseline` | Analyze |
+| "leak", "leakage", "data leakage scan", "suspicious feature", "train test overlap" | `/turing:leak` | Validate |
+| "xray", "model internals", "dead neurons", "gradient flow", "weight distribution", "inside the model" | `/turing:xray` | Analyze |
+| "sensitivity", "which params matter", "hyperparameter importance", "parameter ranking" | `/turing:sensitivity` | Analyze |
+| "calibrate", "calibration", "ECE", "reliability diagram", "overconfident", "probability calibration" | `/turing:calibrate` | Analyze |
+| "feature", "features", "feature selection", "feature importance", "which features matter", "redundant features" | `/turing:feature` | Analyze |
+| "curriculum", "training order", "easy to hard", "data ordering", "curriculum learning" | `/turing:curriculum` | Optimize |
+| "prune", "pruning", "sparsity", "remove weights", "smaller model", "weight pruning" | `/turing:prune` | Optimize |
+| "quantize", "quantization", "int8", "fp16", "reduce precision", "faster inference" | `/turing:quantize` | Optimize |
+| "merge", "model soup", "merge weights", "average models", "TIES", "DARE" | `/turing:merge` | Compose |
+| "surgery", "architecture", "add layer", "widen", "modify model", "swap activation" | `/turing:surgery` | Modify |
+| "cite", "citation", "bibliography", "bibtex", "attribution", "references" | `/turing:cite` | Record |
+| "present", "figures", "slides", "presentation", "charts", "plots" | `/turing:present` | Document |
+| "changelog", "model changelog", "progress summary", "what improved" | `/turing:changelog` | Document |
+| "onboard", "onboarding", "walkthrough", "new collaborator", "project overview" | `/turing:onboard` | Document |
+| "share", "package", "export experiments", "send results", "portable" | `/turing:share` | Share |
+| "review", "peer review", "reviewer", "simulate review", "weakness" | `/turing:review` | Validate |
+| "trend", "trends", "research direction", "improvement rate", "diminishing returns", "what's working" | `/turing:trend` | Analyze |
+| "flashback", "where was I", "context", "resume", "catch up", "what happened" | `/turing:flashback` | Recall |
+| "archive", "cleanup", "compress old", "disk space", "archive experiments" | `/turing:archive` | Manage |
+| "annotate", "note", "tag experiment", "add note", "experiment note" | `/turing:annotate` | Record |
+| "search", "find experiment", "query experiments", "which experiments" | `/turing:search` | Query |
+| "template", "recipe", "save config", "reusable config", "starting point" | `/turing:template` | Manage |
+| "replay", "re-run", "revisit", "retry old", "would it work now" | `/turing:replay` | Validate |
+| "what if", "what-if", "hypothetical", "estimate impact", "would it help" | `/turing:whatif` | Analyze |
+| "counterfactual", "flip prediction", "why this prediction", "minimum change", "explanation" | `/turing:counterfactual` | Explain |
+| "simulate", "predict outcome", "pre-filter", "which configs will work", "forecast" | `/turing:simulate` | Predict |
+| "update", "incremental", "new data", "add data", "fine-tune existing", "partial update" | `/turing:update` | Update |
+| "registry", "promote", "demote", "staging", "production", "which model is deployed", "model lifecycle" | `/turing:registry` | Govern |
+| "postmortem", "why failing", "failure streak", "why no improvement", "what went wrong" | `/turing:postmortem` | Diagnose |
+| "doctor", "health check", "is it broken", "diagnose harness", "self-check" | `/turing:doctor` | Check |
+| "plan", "research plan", "campaign", "what next", "allocate budget", "strategic plan" | `/turing:plan` | Plan |
+## Sub-commands
+| Command | Purpose | Invocation |
+|---|---|---|
+| `/turing:train [ml/project] [N]` | Run the autonomous experiment loop (auto-detects project from path or cwd) | slash_only |
+| `/turing:status` | Show experiment status, best model, convergence | slash_only |
+| `/turing:compare <a> <b>` | Side-by-side experiment comparison | slash_only |
+| `/turing:sweep` | Generate and run hyperparameter sweep | slash_only |
+| `/turing:try <hypothesis>` | Inject a hypothesis into the agent's queue | slash_only |
+| `/turing:brief` | Generate structured research intelligence report | slash_only |
+| `/turing:init` | Scaffold a new ML project | slash_only |
+| `/turing:validate` | Check metric stability, auto-fix if noisy | slash_only |
+| `/turing:seed [N] [--quick]` | Multi-seed study: mean/std/CI, flag seed-sensitive results | slash_only |
+| `/turing:reproduce <exp-id>` | Reproducibility verification with tolerance checking | slash_only |
+| `/turing:suggest` | Literature-grounded model architecture suggestions | slash_only |
+| `/turing:explore` | Tree-search hypothesis exploration via AB-MCTS | slash_only |
+| `/turing:design <hyp-id>` | Generate structured experiment design from hypothesis | slash_only |
+| `/turing:logbook` | HTML/markdown logbook with trajectory chart | slash_only |
+| `/turing:poster` | Single-page HTML research poster | slash_only |
+| `/turing:report` | Structured markdown research report | slash_only |
+| `/turing:mode <mode>` | Set research strategy (explore/exploit/replicate) | slash_only |
+| `/turing:preflight` | Pre-flight resource check (VRAM/RAM/disk) | slash_only |
+| `/turing:card` | Generate standardized model card (type, performance, data, limitations, contract) | slash_only |
+| `/turing:diagnose [exp-id]` | Error analysis: failure modes, confused pairs, feature-range bias | slash_only |
+| `/turing:ablate [--components]` | Ablation study: remove components, measure impact, flag dead weight | slash_only |
+| `/turing:frontier [--metrics]` | Pareto frontier: multi-objective tradeoff visualization | slash_only |
+| `/turing:lit <query>` | Literature search: papers, SOTA baselines, related work | slash_only |
+| `/turing:paper [--sections] [--format]` | Draft paper sections from experiment logs (setup, results, ablation, hyperparams) | slash_only |
+| `/turing:export [exp-id] [--format]` | Export model to production format with equivalence check + latency benchmark | slash_only |
+| `/turing:queue <action>` | Batch experiment scheduler: add, list, run, pause, clear | slash_only |
+| `/turing:retry <exp-id>` | Smart failure recovery: auto-diagnose crash, apply fix, re-run | slash_only |
+| `/turing:fork <exp-id> --branches` | Experiment branching: run parallel tracks, report winner | slash_only |
+| `/turing:profile [exp-id]` | Computational profiling: timing, memory, throughput, bottleneck detection | slash_only |
+| `/turing:checkpoint <action>` | Smart checkpoint management: list, prune (Pareto), average, resume, stats | slash_only |
+| `/turing:diff <exp-a> <exp-b>` | Deep experiment comparison: config diff, metric significance, per-class regressions, curve divergence | slash_only |
+| `/turing:watch [--analyze]` | Live training monitor with early-warning alerts (loss spike, NaN, overfitting, plateau) | slash_only |
+| `/turing:regress [--tolerance]` | Performance regression gate: re-run best experiment, verify metrics haven't degraded | slash_only |
+| `/turing:ensemble [--top-k] [--methods]` | Automated ensemble: voting, weighted voting, stacking, blending from top-K models | slash_only |
+| `/turing:stitch <action> [stage]` | Pipeline composition: show/swap/cache/run stages independently | slash_only |
+| `/turing:warm <exp-id>` | Warm-start from prior model: load checkpoint, freeze layers, adjust LR | slash_only |
+| `/turing:scale [--axis]` | Scaling law estimator: fit power law, predict full-scale performance | slash_only |
+| `/turing:budget <action>` | Compute budget manager: set limits, track allocation, auto-shift modes | slash_only |
+| `/turing:distill <exp-id>` | Model compression: distill teacher into smaller student model | slash_only |
+| `/turing:transfer [--from]` | Cross-project knowledge transfer: find similar prior projects, surface what worked | slash_only |
+| `/turing:audit [--strict]` | Pre-submission methodology audit: data leakage, baselines, seeds, ablations, reproducibility | slash_only |
+| `/turing:sanity [--quick]` | Pre-training sanity checks: initial loss, overfit test, gradient flow, output validation | slash_only |
+| `/turing:baseline [--methods]` | Automatic baseline generation: random, majority/mean, linear, k-NN | slash_only |
+| `/turing:leak [--deep]` | Targeted leakage detection: single-feature tests, correlation, train/test overlap | slash_only |
+| `/turing:xray [exp-id]` | Internal model diagnostics: gradient flow, dead neurons, weight distributions, tree analysis | slash_only |
+| `/turing:sensitivity [exp-id]` | Hyperparameter sensitivity analysis: rank parameters by impact, detect non-monotonic responses | slash_only |
+| `/turing:calibrate [exp-id]` | Probability calibration: ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling | slash_only |
+| `/turing:feature [--method]` | Automated feature selection: multi-method consensus ranking, redundancy, interaction generation | slash_only |
+| `/turing:curriculum [exp-id]` | Training curriculum optimization: difficulty scoring, strategy comparison, impossible sample detection | slash_only |
+| `/turing:prune <exp-id>` | Weight pruning: magnitude/structured/lottery, sparsity sweep, knee point detection | slash_only |
+| `/turing:quantize <exp-id>` | Post-training quantization: FP16/INT8, accuracy-latency comparison, QAT suggestion | slash_only |
+| `/turing:merge <exp-ids...>` | Model merging: uniform/greedy soup, TIES, DARE — free accuracy, zero latency cost | slash_only |
+| `/turing:surgery <exp-id>` | Architecture modification: add/remove layer, widen/narrow, swap activation, skip connections | slash_only |
+| `/turing:trend` | Long-term trend analysis: improvement velocity, family ROI, diminishing returns detection | slash_only |
+| `/turing:flashback` | Session context restoration: "where was I?" after days away from the project | slash_only |
+| `/turing:archive` | Experiment lifecycle cleanup: compress old artifacts, prune checkpoints, summary index | slash_only |
+| `/turing:annotate <exp-id>` | Retrospective annotations: add human notes, tags, search by content | slash_only |
+| `/turing:search <query>` | Natural language experiment search with structured filters | slash_only |
+| `/turing:template <action>` | Experiment template library: save/list/apply reusable configs across projects | slash_only |
+| `/turing:replay <exp-id>` | Experiment replay: re-run old experiment with current infrastructure | slash_only |
+| `/turing:cite <action>` | Citation manager: add/list/check/bib for papers, datasets, methods | slash_only |
+| `/turing:present [--figures]` | Presentation figures: training curves, comparisons, ablation, Pareto, sensitivity | slash_only |
+| `/turing:changelog [--audience]` | Model changelog: version-grouped improvements for technical or stakeholder audiences | slash_only |
+| `/turing:onboard [--audience]` | Project onboarding: full walkthrough for new collaborators | slash_only |
+| `/turing:share <exp-ids...>` | Experiment packaging: portable archive with manifest and README | slash_only |
+| `/turing:review [--venue]` | Peer review simulation: weaknesses, questions, fix commands, score | slash_only |
+| `/turing:whatif "<question>"` | What-if analysis: route hypotheticals to existing estimators (scaling, ablation, sensitivity, ensemble, pruning) | slash_only |
+| `/turing:counterfactual <exp-id> --sample <index>` | Input-level counterfactual explanations: minimum input change to flip a prediction | slash_only |
+| `/turing:simulate [--configs] [--top-k]` | Experiment outcome prediction: pre-filter configs using surrogate model, save budget | slash_only |
+| `/turing:update <exp-id> --new-data <path>` | Incremental model update: add new data without full retraining, forgetting detection | slash_only |
+| `/turing:registry [list\|register\|promote\|demote\|history]` | Model registry: stage lifecycle (candidate → staging → production) with promotion gates | slash_only |
+| `/turing:postmortem [--window N]` | Failure postmortem: diagnose why experiments stopped improving (exhaustion, config error, data issue, ceiling, noise) | slash_only |
+| `/turing:doctor [--fix]` | Harness self-diagnosis: environment, dependencies, config, log integrity, scripts, disk, git state, Claude hooks | slash_only |
+| `/turing:plan [--budget N] [--goal]` | Research planning assistant: strategic campaign design with budget-aware ROI allocation | slash_only |
+## Proactive Detection
+If you detect ML training intent in the conversation (e.g., "the model accuracy is bad", "we need to improve predictions", "let's try a different model"), suggest the relevant sub-command.
+## First-Time Setup
+If no ML project is detected (no `config.yaml`, no `train.py`, no `experiments/`), suggest `/turing:init` first.

package/skills/turing/ablate/SKILL.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: ablate
+description: Run systematic ablation study — remove components one at a time, measure impact, produce publication-ready table with dead-weight flagging.
+disable-model-invocation: true
+argument-hint: "[exp-id] [--components \"X,Y\"] [--seeds 3] [--latex]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Run a systematic ablation study to measure the contribution of each model component.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Parse arguments from `$ARGUMENTS`:**
+   - First argument can be an experiment ID (e.g., `exp-042`); defaults to best
+   - `--components "dropout,feature_X,regularization"` specifies components to ablate
+   - `--seeds 3` runs each ablation 3 times for statistical robustness (uses seed runner)
+   - `--latex` outputs a LaTeX-formatted table instead of markdown
+3. **Run ablation study:**
+   ```bash
+   python scripts/ablation_study.py $ARGUMENTS
+   ```
+4. **Report results:**
+   - Show the ablation table: Configuration | Metric | Δ from Full | % Change
+   - Rank by impact (largest Δ first)
+   - Flag **dead-weight** components (removing them improves the metric)
+   - If `--latex`, output ready for copy-paste into a paper
+5. **Saved output:** results written to `experiments/ablations/exp-NNN-ablation.yaml`
+6. **If no ablatable components detected:** suggest using `--components` explicitly.
+## Examples
+```
+/turing:ablate                                    # Auto-detect components
+/turing:ablate exp-042                            # Specific experiment
+/turing:ablate --components "dropout,subsample"   # Specific components
+/turing:ablate --seeds 3                          # Multi-seed for robustness
+/turing:ablate --latex                            # LaTeX table output
+```

package/skills/turing/annotate/SKILL.md ADDED Viewed

@@ -0,0 +1,23 @@
+---
+name: annotate
+description: Retrospective experiment annotations — add human notes, tags, and context that automated metrics can't capture.
+disable-model-invocation: true
+argument-hint: "<exp-id> \"note\" [--tag fragile] | --list | --search \"keyword\""
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Add context that experiment logs can't capture. "This only worked because the data was pre-sorted."
+## Steps
+1. **Activate environment:** `source .venv/bin/activate`
+2. **Run:** `python scripts/experiment_annotations.py $ARGUMENTS`
+3. **Operations:** add (text + tags), list (per-experiment or all), search (keyword or tag)
+4. **Stored in:** `experiments/annotations.yaml`
+## Examples
+```
+/turing:annotate exp-042 "Fragile — only works with specific preprocessing"
+/turing:annotate exp-042 "Reviewer 2 requested this" --tag reviewer-requested
+/turing:annotate --list
+/turing:annotate --search "fragile"
+```

package/skills/turing/archive/SKILL.md ADDED Viewed

@@ -0,0 +1,23 @@
+---
+name: archive
+description: Experiment lifecycle cleanup — compress old artifacts, prune checkpoints, create queryable summary index. Reclaim disk space.
+disable-model-invocation: true
+argument-hint: "[--older-than 30d] [--keep-best 10] [--dry-run]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Keep your project directory manageable after 200+ experiments.
+## Steps
+1. **Activate environment:** `source .venv/bin/activate`
+2. **Run:** `python scripts/experiment_archive.py $ARGUMENTS`
+3. **Protected experiments:** Pareto-optimal, current best, recent, top-N by metric
+4. **Report:** archived count, preserved count, space reclaimed
+5. **Saved output:** `experiments/archive/index.yaml`
+## Examples
+```
+/turing:archive --dry-run                    # Preview what would be archived
+/turing:archive --older-than 30 --keep-best 10  # Archive old, keep top 10
+/turing:archive                              # Default: 30 days, keep 10
+```

package/skills/turing/audit/SKILL.md ADDED Viewed

@@ -0,0 +1,56 @@
+---
+name: audit
+description: Pre-submission methodology audit — catch data leakage, missing baselines, cherry-picked seeds, and incomplete ablations before a reviewer does.
+disable-model-invocation: true
+argument-hint: "[--strict] [--checklist neurips]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+A reviewer checklist you run before submitting. Catches methodology mistakes that cause desk rejections.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Parse arguments from `$ARGUMENTS`:**
+   - `--strict` — treat warnings as failures
+   - `--checklist neurips|icml|iclr` — add venue-specific checks
+   - `--json` — raw JSON output
+3. **Run methodology audit:**
+   ```bash
+   python scripts/methodology_audit.py $ARGUMENTS
+   ```
+4. **Checks performed:**
+   - **Data leakage** (critical): verify prepare.py/evaluate.py separation
+   - **CV strategy** (critical): verify appropriate cross-validation for data type
+   - **Seed sensitivity** (high): seed studies exist for best experiments
+   - **Ablation completeness** (high): ablation studies performed
+   - **Baseline comparison** (high): simple baselines in experiment log
+   - **Reproducibility** (high): best result successfully reproduced
+   - **Hyperparameter budget** (medium): total tuning cost documented
+   - **Regression stability** (medium): regression checks performed
+5. **Verdicts:**
+   - **PASS** — ready for submission
+   - **PASS (with warnings)** — address before submission
+   - **NEEDS WORK** — fix failures first
+   - **FAIL** — critical issues found
+6. **Actions:** each failure suggests the `/turing:` command to fix it
+7. **Venue checklists:** `--checklist neurips` adds NeurIPS-specific checks (broader impact, reproducibility checklist, code availability)
+8. **Saved output:** report in `experiments/audits/audit-YYYY-MM-DD.yaml`
+## Examples
+```
+/turing:audit                          # Standard audit
+/turing:audit --strict                 # Warnings become failures
+/turing:audit --checklist neurips      # NeurIPS submission checklist
+```

package/skills/turing/baseline/SKILL.md ADDED Viewed

@@ -0,0 +1,45 @@
+---
+name: baseline
+description: Automatic baseline generation — random, majority/mean, linear, k-NN baselines in 60 seconds. Every experiment needs a "is this better than dumb?" reference.
+disable-model-invocation: true
+argument-hint: "[--methods all|simple|linear] [--data data.npz]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Generate trivial baselines so you always know if your model is meaningfully better than simple approaches.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Parse arguments from `$ARGUMENTS`:**
+   - `--methods all|simple|linear` — baseline group (default: all)
+   - `--data data.npz` — data file with X and y arrays
+   - `--json` — raw JSON output
+3. **Run baseline generation:**
+   ```bash
+   python scripts/generate_baselines.py $ARGUMENTS
+   ```
+4. **Baselines generated:**
+   - **Classification:** Random, Majority class, Stratified random, Logistic Regression, k-NN
+   - **Regression:** Random, Mean predictor, Median predictor, Ridge Regression, k-NN
+   - Each evaluated with the same protocol as real experiments
+5. **Report includes:** comparison table with metric values and notes (floor, ceiling, reference)
+6. **Integration:** satisfies the "baseline comparison" check in `/turing:audit`
+7. **Saved output:** report in `experiments/baselines/baselines-*.yaml`
+## Examples
+```
+/turing:baseline                           # All baselines
+/turing:baseline --methods simple          # Just random + majority
+/turing:baseline --data data/processed.npz # With actual data
+```

package/skills/turing/brief/SKILL.md ADDED Viewed

@@ -0,0 +1,95 @@
+---
+name: brief
+description: Generate a structured research intelligence report from experiment history — what's been learned, what's promising, what's exhausted, and what the human should consider next. Use --deep for literature-grounded suggestions.
+disable-model-invocation: true
+argument-hint: "[ml/project] [--deep]"
+allowed-tools: Read, Bash(python scripts/*:*, source .venv/bin/activate:*), Grep, Glob, WebSearch, WebFetch
+---
+Generate a research briefing that a human can read in 2 minutes and immediately decide what to inject next.
+## Project Detection
+Before generating the briefing, detect which project to report on:
+0. **Detect project directory:**
+   - If `$ARGUMENTS` contains a path (e.g., `ml/coding`), use that as the project directory
+   - Else if cwd contains `config.yaml` and `train.py`, use cwd
+   - Else search for `ml/*/` subdirectories containing `config.yaml`
+     - If exactly one found, use it
+     - If multiple found, list them and ask the user which to report on
+   - All subsequent commands run from the detected project directory
+## Steps
+1. **Generate the briefing:**
+   ```bash
+   source .venv/bin/activate && python scripts/generate_brief.py
+   ```
+2. **Self-critique the briefing** before presenting. Review the generated output and check:
+   - **Recommendations specificity:** Are they concrete enough to act on? "Try a different model" is bad. "Try LightGBM with leaf-wise growth because exp-004 showed depth sensitivity" is good. If vague, rewrite them with specific model/hyperparameter suggestions grounded in the experiment data.
+   - **Exhausted directions coverage:** Cross-reference the "Model Types Explored" section against `experiments/log.jsonl`. Are there discarded experiments missing from the summary? If so, add them.
+   - **Convergence estimate grounding:** If the briefing says "close to convergence" or "further improvement possible", verify against the actual metric trajectory. Is the claim supported by the numbers?
+   - **Metric accuracy:** Spot-check that the "Current Best" metrics match the actual log. Run `python scripts/show_metrics.py --last 1` if uncertain.
+   If any section fails the check, regenerate just that section. Max 1 revision round — don't over-polish.
+3. **Present the output** to the user. The briefing has 6 sections:
+   - **Campaign Summary** — total experiments, keep rate, timespan
+   - **Current Best** — model type, metrics, experiment ID, configuration
+   - **Improvement Trajectory** — metric over time, rate of improvement
+   - **Model Types Explored** — which approaches have been tried and their hit rates
+   - **Hypothesis Queue** — pending and completed hypotheses
+   - **Recommendations** — data-driven next steps
+4. **If `$ARGUMENTS` contains `--deep`:** run the Literature-Grounded Suggestions step below.
+5. **Prompt for action:**
+   - "Want to inject a hypothesis? Use `/turing:try <idea>`"
+   - "Want to continue training? Use `/turing:train`"
+   - "Want literature-backed suggestions? Use `/turing:brief --deep`"
+## Literature-Grounded Suggestions (--deep flag)
+When `--deep` is requested, add a 7th section: **Literature-Grounded Suggestions**.
+### Steps:
+1. **Read context:** Read `config.yaml` and the briefing output to understand:
+   - What task type this is (tabular classification, time series, etc.)
+   - Which model families have been exhausted (from "Model Types Explored")
+   - Where improvement has plateaued (from "Improvement Trajectory")
+   - What failure patterns keep recurring
+2. **Search literature** with `WebSearch` for techniques that address the specific stagnation:
+   - If plateaued: "improve [task type] accuracy beyond [current metric] 2024"
+   - If overfitting: "regularization techniques [model family] [task type]"
+   - If all models tried: "state of the art [task type] benchmark 2024 2025"
+3. **Distill 3-5 suggestions** from the literature, each with:
+   - **Technique:** specific and actionable
+   - **Source:** paper or article URL
+   - **Why now:** how it addresses the specific stagnation point
+   - **Impact estimate:** high/medium/low
+   - **Complexity:** low/medium/high
+4. **Queue suggestions** as hypotheses:
+   ```bash
+   source .venv/bin/activate && python scripts/manage_hypotheses.py add "<technique>: <rationale> (source: <citation>)" --priority medium --source literature
+   ```
+5. **Format as a section** appended to the briefing.
+## Saving Briefs
+```bash
+mkdir -p briefs && python scripts/generate_brief.py > briefs/brief-$(date +%Y-%m-%d).md
+```
+## When to Use
+- After a training session completes or converges
+- Before injecting new hypotheses (to understand what's already been tried)
+- When returning to a project after time away
+- **With `--deep`:** when the agent seems stuck and you want evidence-based direction

package/skills/turing/budget/SKILL.md ADDED Viewed

@@ -0,0 +1,52 @@
+---
+name: budget
+description: Compute budget manager — set experiment/time limits, track allocation across explore/exploit phases, auto-shift modes, hard stop.
+disable-model-invocation: true
+argument-hint: "<set|status|reset> [--experiments 50] [--hours 8]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Set a compute ceiling and let the system optimize within it. Prevents runaway experiment loops.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Parse arguments from `$ARGUMENTS`:**
+   - First argument is action: `set`, `status`, `reset`, or `check`
+   - `--experiments 50` — max experiment count
+   - `--hours 8` — max wall-clock hours
+   - `--json` — raw JSON output
+3. **Run budget manager:**
+   ```bash
+   python scripts/budget_manager.py $ARGUMENTS
+   ```
+4. **Actions:**
+   - **set:** create a budget with experiment and/or time constraints
+   - **status:** show usage, burn rate, projected exhaustion, allocation breakdown
+   - **reset:** deactivate the current budget
+   - **check:** returns whether another experiment is allowed (used by `/turing:train`)
+5. **Budget allocation policy:**
+   - **0-50% budget:** EXPLORE — try diverse hypotheses
+   - **50-80% budget:** MIXED — explore promising, exploit best
+   - **80-100% budget:** EXPLOIT ONLY — refine the winner
+   - **100% budget:** HARD STOP — `/turing:train` refuses new experiments
+6. **Budget state** stored in `experiment_state.yaml` under the `budget` key.
+7. **If no budget exists:** `/turing:train` runs without limits.
+## Examples
+```
+/turing:budget set --experiments 50 --hours 8   # Set both constraints
+/turing:budget set --experiments 30             # Experiment count only
+/turing:budget status                           # Show usage and projections
+/turing:budget reset                            # Remove budget limits
+```

package/skills/turing/calibrate/SKILL.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: calibrate
+description: Probability calibration — measure ECE, plot reliability diagrams, apply Platt scaling or isotonic regression.
+disable-model-invocation: true
+argument-hint: "[exp-id] [--method platt|isotonic|temperature|auto]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Make model probabilities trustworthy. Does 80% confidence actually mean 80% correct?
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Parse arguments from `$ARGUMENTS`:**
+   - Optional experiment ID
+   - `--method platt|isotonic|temperature|auto` — calibration method (default: auto)
+   - `--json` — raw JSON output
+3. **Run calibration:**
+   ```bash
+   python scripts/calibration.py $ARGUMENTS
+   ```
+4. **Report includes:**
+   - ECE/MCE before calibration
+   - Reliability diagram (predicted vs actual per bin)
+   - Calibration method comparison table
+   - Verdict: ALREADY CALIBRATED / IMPROVED / NO IMPROVEMENT
+5. **Methods:**
+   - **Platt:** logistic regression on logits
+   - **Isotonic:** non-parametric (more flexible, needs more data)
+   - **Temperature:** single scalar T parameter
+   - **Auto:** tries all, picks lowest ECE
+6. **Saved output:** report in `experiments/calibration/<exp-id>-calibration.yaml`
+## Examples
+```
+/turing:calibrate exp-042                  # Auto-select best method
+/turing:calibrate exp-042 --method platt   # Platt scaling only
+```