npm - claude-turing - Versions diffs - 1.4.0 → 1.5.0 - Mend

claude-turing 1.4.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/.claude-plugin/plugin.json +2 -2
package/README.md +4 -2
package/commands/checkpoint.md +47 -0
package/commands/profile.md +43 -0
package/commands/turing.md +4 -0
package/package.json +1 -1
package/src/install.js +1 -1
package/src/verify.js +2 -0
package/templates/scripts/__pycache__/checkpoint_manager.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/profile_training.cpython-314.pyc +0 -0
package/templates/scripts/__pycache__/scaffold.cpython-314.pyc +0 -0
package/templates/scripts/checkpoint_manager.py +449 -0
package/templates/scripts/generate_brief.py +38 -1
package/templates/scripts/profile_training.py +533 -0
package/templates/scripts/scaffold.py +4 -0

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "turing",
-  "version": "1.4.0",
-  "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 22 commands, 2 specialized agents, experiment intelligence (error analysis, ablation studies, Pareto frontiers), statistical rigor (multi-seed studies, reproducibility verification), tree-search hypothesis exploration (TreeQuest AB-MCTS), cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
+  "version": "1.5.0",
+  "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 24 commands, 2 specialized agents, performance profiling, smart Pareto-based checkpoint management, experiment intelligence (error analysis, ablation studies, Pareto frontiers), statistical rigor (multi-seed studies, reproducibility verification), tree-search hypothesis exploration (TreeQuest AB-MCTS), cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
   "author": {
     "name": "pragnition"
   },

package/README.md CHANGED Viewed

@@ -328,6 +328,8 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
 | `/turing:diagnose [exp-id]` | Error analysis — failure modes, confused pairs, feature-range bias |
 | `/turing:ablate [--components]` | Ablation study — remove components, measure impact, flag dead weight |
 | `/turing:frontier [--metrics]` | Pareto frontier — multi-objective tradeoff visualization |
+| `/turing:profile [exp-id]` | Computational profiling — timing, memory, throughput, bottleneck detection |
+| `/turing:checkpoint <action>` | Smart checkpoint management — list, prune (Pareto), average, resume, stats |
 | `/turing:card` | Generate a model card — performance, limitations, intended use, artifact contract |
 | `/turing:logbook` | Generate HTML experiment logbook |
 | `/turing:report` | Generate research report |
@@ -517,11 +519,11 @@ Each project gets independent config, data, experiments, models, and agent memor
 ## Architecture of Turing Itself
-22 commands, 2 agents, 8 config files, 37 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence (error analysis + ablation + Pareto frontier), 487 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
+24 commands, 2 agents, 8 config files, 39 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling + smart checkpoints, 542 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
 ```
 turing/
-├── commands/              21 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence)
+├── commands/              23 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance)
 ├── agents/                2 agents (researcher: read/write, evaluator: read-only)
 ├── config/                8 files (lifecycle, taxonomy, archetypes, novelty aliases)
 ├── templates/             Scaffolded into user projects by /turing:init

package/commands/checkpoint.md ADDED Viewed

@@ -0,0 +1,47 @@
+---
+name: checkpoint
+description: Smart checkpoint management — list, prune (Pareto-based), average top-K, resume from any point, disk usage stats.
+disable-model-invocation: true
+argument-hint: "<list|prune|average|resume|stats> [exp-id] [--top 3] [--dry-run]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Manage model checkpoints intelligently using Pareto dominance.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Parse arguments from `$ARGUMENTS`:**
+   - First word is the action: `list`, `prune`, `average`, `resume`, `stats`
+   - `resume` requires an experiment ID as second argument
+   - `--top 3` sets the number of checkpoints for averaging
+   - `--dry-run` previews pruning without deleting
+3. **Run checkpoint manager:**
+   ```bash
+   python scripts/checkpoint_manager.py $ARGUMENTS
+   ```
+4. **Report results by action:**
+   - **list:** Table of all checkpoints with metrics, size, and Pareto status
+   - **prune:** Removes dominated checkpoints, reports space saved
+   - **average:** Lists top-K checkpoints for weight averaging
+   - **resume:** Locates checkpoint for a specific experiment
+   - **stats:** Disk usage summary by total, average, and model type
+5. **Saved output:** report written to `experiments/checkpoints/checkpoint-report.yaml`
+## Examples
+```
+/turing:checkpoint list              # Show all checkpoints
+/turing:checkpoint stats             # Disk usage summary
+/turing:checkpoint prune --dry-run   # Preview what would be pruned
+/turing:checkpoint prune             # Remove dominated checkpoints
+/turing:checkpoint average --top 5   # Top 5 for averaging
+/turing:checkpoint resume exp-042    # Resume from checkpoint
+```

package/commands/profile.md ADDED Viewed

@@ -0,0 +1,43 @@
+---
+name: profile
+description: Profile a training run — timing breakdown, memory usage, throughput, bottleneck detection with actionable recommendations.
+disable-model-invocation: true
+argument-hint: "[exp-id] [--seed 42]"
+allowed-tools: Read, Bash(*), Grep, Glob
+---
+Profile a training run to identify performance bottlenecks.
+## Steps
+1. **Activate environment:**
+   ```bash
+   source .venv/bin/activate
+   ```
+2. **Parse arguments from `$ARGUMENTS`:**
+   - First argument can be an experiment ID (e.g., `exp-042`); defaults to best
+   - `--seed 42` sets the random seed for the profiling run
+3. **Run profiling:**
+   ```bash
+   python scripts/profile_training.py $ARGUMENTS
+   ```
+4. **Report results:**
+   - **Timing:** total time, training time, overhead breakdown
+   - **Memory:** peak RSS, Python peak, GPU peak (if applicable)
+   - **Throughput:** samples/sec
+   - **Bottleneck:** identified bottleneck type and severity
+   - **Recommendations:** actionable fixes for the detected bottleneck
+5. **Saved output:** results written to `experiments/profiles/exp-NNN-profile.yaml`
+6. **If no training pipeline exists:** suggest `/turing:init` first.
+## Examples
+```
+/turing:profile              # Profile best experiment config
+/turing:profile exp-042      # Profile specific experiment
+```

package/commands/turing.md CHANGED Viewed

@@ -31,6 +31,8 @@ You are the Turing ML research router. Detect the user's intent and route to the
 | "diagnose", "error analysis", "failure modes", "where does it fail", "confusion matrix" | `/turing:diagnose` | Analyze |
 | "ablate", "ablation", "remove component", "which features matter", "component impact" | `/turing:ablate` | Analyze |
 | "frontier", "pareto", "tradeoff", "tradeoffs", "multi-objective", "which model is best" | `/turing:frontier` | Analyze |
+| "profile", "profiling", "bottleneck", "slow training", "why is it slow", "timing" | `/turing:profile` | Check |
+| "checkpoint", "checkpoints", "prune checkpoints", "disk space", "resume training" | `/turing:checkpoint` | Check |
 ## Sub-commands
@@ -58,6 +60,8 @@ You are the Turing ML research router. Detect the user's intent and route to the
 | `/turing:diagnose [exp-id]` | Error analysis: failure modes, confused pairs, feature-range bias | (inline) |
 | `/turing:ablate [--components]` | Ablation study: remove components, measure impact, flag dead weight | (inline) |
 | `/turing:frontier [--metrics]` | Pareto frontier: multi-objective tradeoff visualization | (inline) |
+| `/turing:profile [exp-id]` | Computational profiling: timing, memory, throughput, bottleneck detection | (inline) |
+| `/turing:checkpoint <action>` | Smart checkpoint management: list, prune (Pareto), average, resume, stats | (inline) |
 ## Proactive Detection

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "claude-turing",
-  "version": "1.4.0",
+  "version": "1.5.0",
   "type": "module",
   "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
   "bin": {

package/src/install.js CHANGED Viewed

@@ -23,7 +23,7 @@ const SUB_COMMANDS = [
   "init", "train", "status", "compare", "sweep", "validate",
   "try", "brief", "suggest", "explore", "design", "logbook", "poster",
   "report", "mode", "preflight", "card", "seed", "reproduce",
-  "diagnose", "ablate", "frontier",
+  "diagnose", "ablate", "frontier", "profile", "checkpoint",
 ];
 export async function install(opts = {}) {

package/src/verify.js CHANGED Viewed

@@ -36,6 +36,8 @@ const EXPECTED_COMMANDS = [
   "diagnose/SKILL.md",
   "ablate/SKILL.md",
   "frontier/SKILL.md",
+  "profile/SKILL.md",
+  "checkpoint/SKILL.md",
 ];
 const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];

package/templates/scripts/__pycache__/checkpoint_manager.cpython-314.pyc ADDED Viewed

Binary file

package/templates/scripts/__pycache__/generate_brief.cpython-314.pyc CHANGED Viewed

Binary file

package/templates/scripts/__pycache__/profile_training.cpython-314.pyc ADDED Viewed

Binary file

package/templates/scripts/__pycache__/scaffold.cpython-314.pyc CHANGED Viewed

Binary file

package/templates/scripts/checkpoint_manager.py ADDED Viewed

@@ -0,0 +1,449 @@
+#!/usr/bin/env python3
+"""Smart checkpoint manager for ML experiments.
+Manages model checkpoints based on Pareto dominance rather than
+simple "keep last K". Supports listing, Pareto-based pruning,
+checkpoint averaging, and resume from any point.
+Usage:
+    python scripts/checkpoint_manager.py list                          # List checkpoints
+    python scripts/checkpoint_manager.py prune                         # Remove dominated
+    python scripts/checkpoint_manager.py average [--top 3]             # Average top-K
+    python scripts/checkpoint_manager.py resume exp-042                # Resume from checkpoint
+    python scripts/checkpoint_manager.py stats                         # Disk usage summary
+"""
+from __future__ import annotations
+import argparse
+import json
+import shutil
+import sys
+from datetime import datetime, timezone
+from pathlib import Path
+import yaml
+from scripts.turing_io import load_config, load_experiments
+DEFAULT_CHECKPOINT_DIR = "experiments/checkpoints"
+def scan_checkpoints(checkpoint_dir: str = DEFAULT_CHECKPOINT_DIR) -> list[dict]:
+    """Scan checkpoint directory and return metadata for each checkpoint.
+    Returns list of dicts with path, experiment_id, metrics, size_bytes, created.
+    """
+    ckpt_path = Path(checkpoint_dir)
+    if not ckpt_path.exists():
+        return []
+    checkpoints = []
+    for entry in sorted(ckpt_path.iterdir()):
+        if entry.is_dir():
+            # Look for metadata.yaml or metadata.json in the checkpoint dir
+            meta_path = entry / "metadata.yaml"
+            meta_json = entry / "metadata.json"
+            metadata = {}
+            if meta_path.exists():
+                with open(meta_path) as f:
+                    metadata = yaml.safe_load(f) or {}
+            elif meta_json.exists():
+                with open(meta_json) as f:
+                    metadata = json.load(f)
+            # Compute directory size
+            size_bytes = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
+            checkpoints.append({
+                "path": str(entry),
+                "name": entry.name,
+                "experiment_id": metadata.get("experiment_id", entry.name),
+                "metrics": metadata.get("metrics", {}),
+                "config": metadata.get("config", {}),
+                "size_bytes": size_bytes,
+                "size_mb": round(size_bytes / 1024**2, 1),
+                "created": metadata.get("timestamp", metadata.get("created_at")),
+            })
+        elif entry.is_file() and entry.suffix in (".joblib", ".pkl", ".pt", ".pth", ".h5", ".onnx"):
+            # Single-file checkpoint
+            size_bytes = entry.stat().st_size
+            # Try to extract exp ID from filename
+            exp_id = entry.stem.replace("-checkpoint", "").replace("_checkpoint", "")
+            checkpoints.append({
+                "path": str(entry),
+                "name": entry.name,
+                "experiment_id": exp_id,
+                "metrics": {},
+                "config": {},
+                "size_bytes": size_bytes,
+                "size_mb": round(size_bytes / 1024**2, 1),
+                "created": None,
+            })
+    return checkpoints
+def enrich_checkpoints_from_log(
+    checkpoints: list[dict],
+    experiments: list[dict],
+) -> list[dict]:
+    """Enrich checkpoint metadata with metrics from experiment log."""
+    exp_map = {e.get("experiment_id"): e for e in experiments}
+    for ckpt in checkpoints:
+        exp_id = ckpt["experiment_id"]
+        if exp_id in exp_map and not ckpt["metrics"]:
+            exp = exp_map[exp_id]
+            ckpt["metrics"] = exp.get("metrics", {})
+            ckpt["config"] = exp.get("config", {})
+            if not ckpt["created"]:
+                ckpt["created"] = exp.get("timestamp")
+    return checkpoints
+def compute_pareto_checkpoints(
+    checkpoints: list[dict],
+    metrics: list[str],
+    directions: dict[str, str],
+) -> tuple[list[dict], list[dict]]:
+    """Separate checkpoints into Pareto-optimal and dominated sets.
+    Returns (pareto_optimal, dominated) tuple.
+    """
+    if not checkpoints or not metrics:
+        return checkpoints, []
+    # Filter to checkpoints that have all requested metrics
+    complete = []
+    incomplete = []
+    for ckpt in checkpoints:
+        if all(ckpt["metrics"].get(m) is not None for m in metrics):
+            complete.append(ckpt)
+        else:
+            incomplete.append(ckpt)
+    if not complete:
+        return checkpoints, []
+    pareto = []
+    dominated = []
+    for i, candidate in enumerate(complete):
+        is_dominated = False
+        for j, other in enumerate(complete):
+            if i == j:
+                continue
+            all_at_least = True
+            strictly_better = False
+            for m in metrics:
+                c_val = float(candidate["metrics"][m])
+                o_val = float(other["metrics"][m])
+                direction = directions.get(m, "higher")
+                if direction == "higher":
+                    if o_val < c_val:
+                        all_at_least = False
+                        break
+                    if o_val > c_val:
+                        strictly_better = True
+                else:
+                    if o_val > c_val:
+                        all_at_least = False
+                        break
+                    if o_val < c_val:
+                        strictly_better = True
+            if all_at_least and strictly_better:
+                is_dominated = True
+                break
+        if is_dominated:
+            dominated.append(candidate)
+        else:
+            pareto.append(candidate)
+    # Incomplete checkpoints are kept (can't determine dominance)
+    pareto.extend(incomplete)
+    return pareto, dominated
+def prune_dominated(
+    checkpoints: list[dict],
+    dominated: list[dict],
+    keep_latest: bool = True,
+    dry_run: bool = False,
+) -> dict:
+    """Remove dominated checkpoints from disk.
+    Args:
+        checkpoints: All checkpoints.
+        dominated: Dominated checkpoints to remove.
+        keep_latest: Always keep the most recent checkpoint (for resume safety).
+        dry_run: If True, report what would be pruned without deleting.
+    Returns:
+        Dict with pruned count, bytes saved, and details.
+    """
+    # Protect the latest checkpoint
+    if keep_latest and checkpoints:
+        latest = max(checkpoints, key=lambda c: c.get("created") or "")
+        dominated = [d for d in dominated if d["path"] != latest["path"]]
+    pruned = []
+    bytes_saved = 0
+    for ckpt in dominated:
+        path = Path(ckpt["path"])
+        if not path.exists():
+            continue
+        pruned.append({
+            "name": ckpt["name"],
+            "experiment_id": ckpt["experiment_id"],
+            "size_mb": ckpt["size_mb"],
+        })
+        bytes_saved += ckpt["size_bytes"]
+        if not dry_run:
+            if path.is_dir():
+                shutil.rmtree(path)
+            else:
+                path.unlink()
+    return {
+        "pruned_count": len(pruned),
+        "bytes_saved": bytes_saved,
+        "mb_saved": round(bytes_saved / 1024**2, 1),
+        "pruned": pruned,
+        "dry_run": dry_run,
+    }
+def compute_disk_stats(checkpoints: list[dict]) -> dict:
+    """Compute disk usage statistics for checkpoints."""
+    if not checkpoints:
+        return {"total_count": 0, "total_size_mb": 0, "avg_size_mb": 0}
+    total_bytes = sum(c["size_bytes"] for c in checkpoints)
+    return {
+        "total_count": len(checkpoints),
+        "total_size_mb": round(total_bytes / 1024**2, 1),
+        "total_size_gb": round(total_bytes / 1024**3, 2),
+        "avg_size_mb": round(total_bytes / len(checkpoints) / 1024**2, 1),
+        "largest": max(checkpoints, key=lambda c: c["size_bytes"])["name"] if checkpoints else None,
+        "largest_mb": max(c["size_mb"] for c in checkpoints) if checkpoints else 0,
+    }
+def format_checkpoint_list(
+    checkpoints: list[dict],
+    pareto_ids: set[str],
+    primary_metric: str,
+) -> str:
+    """Format checkpoint list as markdown table."""
+    if not checkpoints:
+        return "No checkpoints found."
+    lines = [
+        "# Checkpoints",
+        "",
+        f"| Name | Experiment | {primary_metric} | Size | Pareto |",
+        f"|------|------------|{'---' * max(len(primary_metric) // 3, 1)}--|------|--------|",
+    ]
+    for ckpt in checkpoints:
+        metric_val = ckpt["metrics"].get(primary_metric)
+        metric_str = f"{metric_val:.4f}" if isinstance(metric_val, (int, float)) else "N/A"
+        pareto_marker = "YES" if ckpt["path"] in pareto_ids or ckpt["name"] in pareto_ids else ""
+        lines.append(
+            f"| {ckpt['name']} | {ckpt['experiment_id']} "
+            f"| {metric_str} | {ckpt['size_mb']:.1f} MB | {pareto_marker} |"
+        )
+    return "\n".join(lines)
+def format_prune_report(prune_result: dict, stats_before: dict, stats_after: dict) -> str:
+    """Format pruning report."""
+    lines = [
+        "# Checkpoint Pruning",
+        "",
+        f"Before: {stats_before['total_count']} checkpoints, {stats_before.get('total_size_gb', 0):.1f} GB",
+    ]
+    if prune_result["dry_run"]:
+        lines.append(f"Would prune: {prune_result['pruned_count']} dominated checkpoints ({prune_result['mb_saved']:.1f} MB)")
+        lines.append("")
+        lines.append("*Dry run — no files deleted. Run without --dry-run to prune.*")
+    else:
+        lines.append(f"Pruned: {prune_result['pruned_count']} dominated checkpoints ({prune_result['mb_saved']:.1f} MB)")
+        lines.append(f"After: {stats_after['total_count']} checkpoints, {stats_after.get('total_size_gb', 0):.1f} GB")
+    if prune_result["pruned"]:
+        lines.extend(["", "## Removed", ""])
+        for p in prune_result["pruned"]:
+            lines.append(f"- {p['name']} ({p['experiment_id']}, {p['size_mb']:.1f} MB)")
+    return "\n".join(lines)
+def format_stats_report(stats: dict, checkpoints: list[dict]) -> str:
+    """Format disk usage statistics."""
+    lines = [
+        "# Checkpoint Storage",
+        "",
+        f"- **Total checkpoints:** {stats['total_count']}",
+        f"- **Total size:** {stats.get('total_size_gb', 0):.2f} GB ({stats['total_size_mb']:.1f} MB)",
+        f"- **Average size:** {stats['avg_size_mb']:.1f} MB",
+    ]
+    if stats.get("largest"):
+        lines.append(f"- **Largest:** {stats['largest']} ({stats['largest_mb']:.1f} MB)")
+    # Size by model type
+    by_type: dict[str, list[dict]] = {}
+    for ckpt in checkpoints:
+        mt = ckpt.get("config", {}).get("model_type", "unknown")
+        by_type.setdefault(mt, []).append(ckpt)
+    if len(by_type) > 1:
+        lines.extend(["", "## By Model Type", ""])
+        for mt, ckpts in sorted(by_type.items()):
+            total_mb = sum(c["size_mb"] for c in ckpts)
+            lines.append(f"- **{mt}:** {len(ckpts)} checkpoints, {total_mb:.1f} MB")
+    return "\n".join(lines)
+def save_checkpoint_report(report: dict, output_dir: str = "experiments/checkpoints") -> Path:
+    """Save checkpoint management report."""
+    out_path = Path(output_dir)
+    out_path.mkdir(parents=True, exist_ok=True)
+    filepath = out_path / "checkpoint-report.yaml"
+    with open(filepath, "w") as f:
+        yaml.dump(report, f, default_flow_style=False, sort_keys=False)
+    return filepath
+def main() -> None:
+    """CLI entry point."""
+    parser = argparse.ArgumentParser(description="Smart checkpoint manager")
+    parser.add_argument(
+        "action",
+        choices=["list", "prune", "average", "resume", "stats"],
+        help="Action to perform",
+    )
+    parser.add_argument("exp_id", nargs="?", default=None, help="Experiment ID (for resume)")
+    parser.add_argument("--checkpoint-dir", default=DEFAULT_CHECKPOINT_DIR, help="Checkpoint directory")
+    parser.add_argument("--config", default="config.yaml", help="Path to config.yaml")
+    parser.add_argument("--log", default="experiments/log.jsonl", help="Path to experiment log")
+    parser.add_argument("--top", type=int, default=3, help="Top K for averaging")
+    parser.add_argument("--dry-run", action="store_true", help="Don't actually delete (prune)")
+    parser.add_argument("--json", action="store_true", help="Output raw JSON")
+    args = parser.parse_args()
+    config = load_config(args.config)
+    eval_cfg = config.get("evaluation", {})
+    primary_metric = eval_cfg.get("primary_metric", "accuracy")
+    lower_is_better = eval_cfg.get("lower_is_better", False)
+    experiments = load_experiments(args.log)
+    checkpoints = scan_checkpoints(args.checkpoint_dir)
+    checkpoints = enrich_checkpoints_from_log(checkpoints, experiments)
+    # Determine metric directions
+    lower_metrics = {"train_seconds", "latency", "mse", "rmse", "mae", "loss"}
+    metrics_to_check = [primary_metric]
+    if "train_seconds" in set().union(*(c["metrics"].keys() for c in checkpoints if c["metrics"])):
+        metrics_to_check.append("train_seconds")
+    directions = {}
+    for m in metrics_to_check:
+        if m == primary_metric:
+            directions[m] = "lower" if lower_is_better else "higher"
+        elif m in lower_metrics:
+            directions[m] = "lower"
+        else:
+            directions[m] = "higher"
+    pareto, dominated = compute_pareto_checkpoints(checkpoints, metrics_to_check, directions)
+    pareto_paths = {c["path"] for c in pareto}
+    if args.action == "list":
+        if args.json:
+            print(json.dumps(checkpoints, indent=2, default=str))
+        else:
+            print(format_checkpoint_list(checkpoints, pareto_paths, primary_metric))
+    elif args.action == "stats":
+        stats = compute_disk_stats(checkpoints)
+        if args.json:
+            print(json.dumps(stats, indent=2))
+        else:
+            print(format_stats_report(stats, checkpoints))
+    elif args.action == "prune":
+        stats_before = compute_disk_stats(checkpoints)
+        result = prune_dominated(checkpoints, dominated, dry_run=args.dry_run)
+        if not args.dry_run:
+            remaining = scan_checkpoints(args.checkpoint_dir)
+            stats_after = compute_disk_stats(remaining)
+        else:
+            stats_after = {
+                "total_count": stats_before["total_count"] - result["pruned_count"],
+                "total_size_mb": stats_before["total_size_mb"] - result["mb_saved"],
+                "total_size_gb": round((stats_before["total_size_mb"] - result["mb_saved"]) / 1024, 2),
+            }
+        if args.json:
+            print(json.dumps({"prune": result, "before": stats_before, "after": stats_after}, indent=2))
+        else:
+            print(format_prune_report(result, stats_before, stats_after))
+    elif args.action == "average":
+        # Sort by primary metric and take top K
+        with_metric = [c for c in checkpoints if primary_metric in c["metrics"]]
+        with_metric.sort(
+            key=lambda c: c["metrics"][primary_metric],
+            reverse=not lower_is_better,
+        )
+        top_k = with_metric[:args.top]
+        if not top_k:
+            print("No checkpoints with metric data found.", file=sys.stderr)
+            sys.exit(1)
+        print(f"Top {len(top_k)} checkpoints for averaging:", file=sys.stderr)
+        for c in top_k:
+            print(f"  {c['experiment_id']}: {primary_metric}={c['metrics'][primary_metric]:.4f}", file=sys.stderr)
+        print("\nNote: Weight averaging requires model-specific implementation.", file=sys.stderr)
+        print("The checkpoint paths are:", file=sys.stderr)
+        for c in top_k:
+            print(f"  {c['path']}", file=sys.stderr)
+    elif args.action == "resume":
+        if not args.exp_id:
+            print("Specify experiment ID: checkpoint resume <exp-id>", file=sys.stderr)
+            sys.exit(1)
+        target = None
+        for c in checkpoints:
+            if c["experiment_id"] == args.exp_id:
+                target = c
+                break
+        if not target:
+            print(f"No checkpoint found for {args.exp_id}", file=sys.stderr)
+            available = [c["experiment_id"] for c in checkpoints]
+            if available:
+                print(f"Available: {', '.join(available[:10])}", file=sys.stderr)
+            sys.exit(1)
+        print(f"Resume checkpoint: {target['path']}", file=sys.stderr)
+        print(f"Experiment: {target['experiment_id']}", file=sys.stderr)
+        if target["metrics"]:
+            print(f"Metrics: {target['metrics']}", file=sys.stderr)
+if __name__ == "__main__":
+    main()

package/templates/scripts/generate_brief.py CHANGED Viewed

@@ -212,6 +212,23 @@ def detect_environment_drift(experiments: list[dict]) -> list[str]:
     return warnings
+def load_profiles(profile_dir: str = "experiments/profiles") -> list[dict]:
+    """Load all profiling results from YAML files."""
+    path = Path(profile_dir)
+    if not path.exists():
+        return []
+    profiles = []
+    for f in sorted(path.glob("*-profile.yaml")):
+        try:
+            with open(f) as fh:
+                profile = yaml.safe_load(fh)
+                if profile and isinstance(profile, dict):
+                    profiles.append(profile)
+        except (yaml.YAMLError, OSError):
+            continue
+    return profiles
 def load_diagnoses(diag_dir: str = "experiments/diagnoses") -> list[dict]:
     """Load all diagnosis reports from YAML files."""
     path = Path(diag_dir)
@@ -278,6 +295,7 @@ def format_brief(
     seed_studies: list[dict] | None = None,
     reproductions: list[dict] | None = None,
     diagnoses: list[dict] | None = None,
+    profiles: list[dict] | None = None,
 ) -> str:
     """Format the research briefing as markdown."""
     direction = "lower" if lower_is_better else "higher"
@@ -454,6 +472,23 @@ def format_brief(
         if failed:
             lines.extend(["", f"*{len(failed)} experiment(s) failed reproducibility checks.*"])
+    # Profiles
+    if profiles:
+        lines.extend(["", "## Performance Profile", ""])
+        for prof in profiles[-1:]:  # Show most recent
+            exp_id = prof.get("experiment_id", "?")
+            p = prof.get("profile", {})
+            bn = prof.get("bottleneck", {})
+            lines.append(f"**{exp_id}:** {p.get('total_time_sec', 0):.1f}s total")
+            mem = p.get("memory", {})
+            if mem.get("peak_rss_mb"):
+                lines.append(f"- Peak memory: {mem['peak_rss_mb']:.0f} MB")
+            if bn.get("type") and bn["type"] != "none_detected":
+                lines.append(f"- Bottleneck: **{bn['type']}** ({bn.get('severity', 'unknown')})")
+            recs = prof.get("recommendations", [])
+            if recs:
+                lines.append(f"- Top recommendation: {recs[0]}")
     # Diagnoses (error analysis)
     if diagnoses:
         lines.extend(["", "## Error Analysis", ""])
@@ -529,10 +564,11 @@ def generate_brief(
     cost_records = load_cost_data(log_path, metric)
     pareto = compute_pareto_frontier(cost_records, lower_is_better) if cost_records else []
-    # Load seed studies, reproduction reports, and diagnoses
+    # Load seed studies, reproduction reports, diagnoses, and profiles
     seed_studies = load_seed_studies()
     reproductions = load_reproductions()
     diagnoses = load_diagnoses()
+    profiles = load_profiles()
     return format_brief(
         campaign, best, trajectory, model_types, hypotheses,
@@ -542,6 +578,7 @@ def generate_brief(
         seed_studies=seed_studies if seed_studies else None,
         reproductions=reproductions if reproductions else None,
         diagnoses=diagnoses if diagnoses else None,
+        profiles=profiles if profiles else None,
     )

package/templates/scripts/profile_training.py ADDED Viewed

@@ -0,0 +1,533 @@
+#!/usr/bin/env python3
+"""Computational profiling for ML training runs.
+Measures timing breakdown, memory usage, throughput, and identifies
+bottlenecks. Maps bottleneck patterns to known fixes.
+Usage:
+    python scripts/profile_training.py                     # Profile best config
+    python scripts/profile_training.py --exp-id exp-042    # Specific experiment
+    python scripts/profile_training.py --detailed          # Include per-phase breakdown
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import re
+import subprocess
+import sys
+import time
+import tracemalloc
+from datetime import datetime, timezone
+from pathlib import Path
+import yaml
+from scripts.turing_io import load_config, load_experiments
+# Known bottleneck patterns -> recommendation mappings
+BOTTLENECK_RECOMMENDATIONS = {
+    "data_loading": [
+        "Cache preprocessed data to disk — data loading dominates training time",
+        "Use memory-mapped files or HDF5 for large datasets",
+        "Check for unnecessary file I/O in the data pipeline",
+    ],
+    "preprocessing": [
+        "Move preprocessing to a one-time step before training",
+        "Cache feature transformations (encoders, scalers) to avoid recomputation",
+        "Consider using vectorized operations instead of row-by-row processing",
+    ],
+    "model_training": [
+        "Reduce model complexity (fewer estimators, smaller depth)",
+        "Try a faster model type (LightGBM is typically faster than XGBoost)",
+        "Enable GPU acceleration if available and model supports it",
+    ],
+    "evaluation": [
+        "Reduce validation set size for intermediate checks",
+        "Evaluate less frequently (every N iterations instead of every iteration)",
+    ],
+    "memory": [
+        "Process data in batches to reduce peak memory",
+        "Use float32 instead of float64 for features",
+        "Release intermediate dataframes after use",
+    ],
+}
+def find_experiment(experiments: list[dict], exp_id: str | None, metric: str, lower_is_better: bool) -> dict | None:
+    """Find experiment by ID or return best kept."""
+    if exp_id:
+        for exp in experiments:
+            if exp.get("experiment_id") == exp_id:
+                return exp
+        return None
+    best = None
+    best_val = float("inf") if lower_is_better else float("-inf")
+    for exp in experiments:
+        if exp.get("status") != "kept":
+            continue
+        val = exp.get("metrics", {}).get(metric)
+        if val is None:
+            continue
+        if (lower_is_better and val < best_val) or (not lower_is_better and val > best_val):
+            best_val = val
+            best = exp
+    return best
+def get_system_memory_mb() -> float | None:
+    """Get current process RSS memory in MB from /proc/self/status."""
+    try:
+        with open("/proc/self/status") as f:
+            for line in f:
+                if line.startswith("VmRSS:"):
+                    return int(line.split()[1]) / 1024  # kB -> MB
+    except (FileNotFoundError, ValueError, IndexError):
+        pass
+    return None
+def get_peak_memory_mb() -> float | None:
+    """Get peak RSS memory in MB from /proc/self/status."""
+    try:
+        with open("/proc/self/status") as f:
+            for line in f:
+                if line.startswith("VmPeak:"):
+                    return int(line.split()[1]) / 1024  # kB -> MB
+    except (FileNotFoundError, ValueError, IndexError):
+        pass
+    return None
+def check_gpu_available() -> dict | None:
+    """Check for GPU availability and return info."""
+    # Try torch
+    try:
+        import torch
+        if torch.cuda.is_available():
+            return {
+                "type": "cuda",
+                "device": torch.cuda.get_device_name(0),
+                "memory_total_mb": round(torch.cuda.get_device_properties(0).total_mem / 1024**2),
+            }
+    except ImportError:
+        pass
+    # Try nvidia-smi
+    try:
+        result = subprocess.run(
+            ["nvidia-smi", "--query-gpu=name,memory.total", "--format=csv,noheader,nounits"],
+            capture_output=True, text=True, timeout=5,
+        )
+        if result.returncode == 0 and result.stdout.strip():
+            parts = result.stdout.strip().split(",")
+            return {
+                "type": "cuda",
+                "device": parts[0].strip(),
+                "memory_total_mb": int(parts[1].strip()) if len(parts) > 1 else None,
+            }
+    except (FileNotFoundError, subprocess.TimeoutExpired):
+        pass
+    return None
+def profile_training_run(seed: int = 42, timeout: int = 600) -> dict:
+    """Run train.py with profiling instrumentation.
+    Wraps the training run and captures timing, memory, and throughput.
+    """
+    # Start memory tracking
+    tracemalloc.start()
+    mem_before = get_system_memory_mb()
+    start_time = time.perf_counter()
+    # Run training
+    cmd = ["python", "train.py", "--seed", str(seed)]
+    try:
+        proc = subprocess.run(
+            cmd, capture_output=True, text=True, timeout=timeout,
+        )
+    except subprocess.TimeoutExpired:
+        return {"error": f"Training timed out after {timeout}s"}
+    end_time = time.perf_counter()
+    total_time = end_time - start_time
+    # Memory snapshot
+    current, peak = tracemalloc.get_traced_memory()
+    tracemalloc.stop()
+    mem_after = get_system_memory_mb()
+    peak_rss = get_peak_memory_mb()
+    # Parse metrics from output
+    metrics = {}
+    timing_data = {}
+    in_block = False
+    for line in proc.stdout.splitlines():
+        line_stripped = line.strip()
+        if line_stripped == "---":
+            if in_block:
+                break
+            in_block = True
+            continue
+        if in_block and ":" in line_stripped:
+            key, value = line_stripped.split(":", 1)
+            key = key.strip()
+            value = value.strip()
+            try:
+                metrics[key] = float(value)
+            except ValueError:
+                metrics[key] = value
+    # Extract timing if train.py reports it
+    train_seconds = metrics.get("train_seconds")
+    if isinstance(train_seconds, str):
+        try:
+            train_seconds = float(train_seconds)
+        except ValueError:
+            train_seconds = None
+    # Build profile
+    profile = {
+        "total_time_sec": round(total_time, 2),
+        "train_time_sec": round(float(train_seconds), 2) if train_seconds else round(total_time, 2),
+        "overhead_sec": round(total_time - float(train_seconds), 2) if train_seconds else 0,
+        "memory": {
+            "peak_rss_mb": round(peak_rss, 1) if peak_rss else None,
+            "python_peak_mb": round(peak / 1024**2, 1),
+            "rss_before_mb": round(mem_before, 1) if mem_before else None,
+            "rss_after_mb": round(mem_after, 1) if mem_after else None,
+        },
+        "metrics": metrics,
+        "returncode": proc.returncode,
+    }
+    # Check GPU
+    gpu_info = check_gpu_available()
+    if gpu_info:
+        profile["gpu"] = gpu_info
+        # Try to get GPU memory usage
+        try:
+            import torch
+            if torch.cuda.is_available():
+                profile["memory"]["peak_gpu_mb"] = round(torch.cuda.max_memory_allocated() / 1024**2, 1)
+        except ImportError:
+            pass
+    return profile
+def estimate_timing_breakdown(profile: dict) -> dict:
+    """Estimate timing breakdown from available data.
+    Since we can't instrument inside train.py without modifying it,
+    we estimate based on total time and known patterns.
+    """
+    total = profile.get("train_time_sec", profile.get("total_time_sec", 0))
+    if total <= 0:
+        return {}
+    train_secs = profile.get("metrics", {}).get("train_seconds")
+    if isinstance(train_secs, str):
+        try:
+            train_secs = float(train_secs)
+        except ValueError:
+            train_secs = None
+    overhead = profile.get("overhead_sec", 0)
+    breakdown = {
+        "total_sec": round(total, 2),
+        "overhead_sec": round(overhead, 2),
+        "overhead_pct": round(overhead / total * 100, 1) if total > 0 else 0,
+    }
+    if train_secs:
+        breakdown["training_sec"] = round(train_secs, 2)
+        breakdown["training_pct"] = round(train_secs / total * 100, 1)
+    return breakdown
+def detect_bottleneck(profile: dict, breakdown: dict) -> dict:
+    """Identify the primary bottleneck from profiling data."""
+    bottlenecks = []
+    total = profile.get("total_time_sec", 0)
+    overhead = profile.get("overhead_sec", 0)
+    # Check if overhead (data loading + setup) is dominant
+    if total > 0 and overhead > 0:
+        overhead_pct = overhead / total * 100
+        if overhead_pct > 50:
+            bottlenecks.append({
+                "type": "data_loading",
+                "severity": "high",
+                "description": f"Non-training overhead is {overhead_pct:.0f}% of total time ({overhead:.1f}s of {total:.1f}s)",
+                "pct_of_total": round(overhead_pct, 1),
+            })
+    # Check memory pressure
+    peak_rss = profile.get("memory", {}).get("peak_rss_mb")
+    if peak_rss and peak_rss > 4096:
+        bottlenecks.append({
+            "type": "memory",
+            "severity": "medium" if peak_rss < 8192 else "high",
+            "description": f"Peak memory usage is {peak_rss:.0f} MB",
+            "pct_of_total": 0,
+        })
+    # Check GPU utilization
+    gpu = profile.get("gpu")
+    peak_gpu = profile.get("memory", {}).get("peak_gpu_mb")
+    if gpu and peak_gpu:
+        gpu_total = gpu.get("memory_total_mb", 0)
+        if gpu_total > 0:
+            util_pct = peak_gpu / gpu_total * 100
+            if util_pct < 30:
+                bottlenecks.append({
+                    "type": "gpu_underutilized",
+                    "severity": "medium",
+                    "description": f"GPU memory utilization is only {util_pct:.0f}% — consider larger batch size",
+                    "pct_of_total": 0,
+                })
+    # Check training time relative to expectations
+    train_secs = profile.get("metrics", {}).get("train_seconds")
+    if isinstance(train_secs, (int, float)) and train_secs > 300:
+        bottlenecks.append({
+            "type": "model_training",
+            "severity": "medium",
+            "description": f"Training takes {train_secs:.0f}s — consider model simplification",
+            "pct_of_total": 0,
+        })
+    if not bottlenecks:
+        return {
+            "type": "none_detected",
+            "severity": "low",
+            "description": "No significant bottleneck detected",
+            "pct_of_total": 0,
+        }
+    # Return most severe
+    severity_order = {"high": 3, "medium": 2, "low": 1}
+    bottlenecks.sort(key=lambda b: -severity_order.get(b["severity"], 0))
+    return bottlenecks[0]
+def generate_recommendations(bottleneck: dict, profile: dict) -> list[str]:
+    """Generate actionable recommendations based on detected bottleneck."""
+    bt = bottleneck.get("type", "none_detected")
+    recs = BOTTLENECK_RECOMMENDATIONS.get(bt, [])
+    # Add context-specific recommendations
+    extra = []
+    peak_rss = profile.get("memory", {}).get("peak_rss_mb")
+    if peak_rss and peak_rss > 2048 and bt != "memory":
+        extra.append(f"Memory usage is {peak_rss:.0f} MB — monitor for OOM risk on smaller machines")
+    gpu = profile.get("gpu")
+    if not gpu:
+        extra.append("No GPU detected — GPU acceleration could significantly speed up training")
+    total = profile.get("total_time_sec", 0)
+    if total < 10:
+        extra.append("Training is very fast — profiling overhead may distort results")
+    return list(recs) + extra
+def compute_throughput(profile: dict) -> dict:
+    """Compute throughput metrics from profile data."""
+    total_time = profile.get("train_time_sec", profile.get("total_time_sec", 0))
+    metrics = profile.get("metrics", {})
+    n_samples = metrics.get("n_samples") or metrics.get("n_train_samples")
+    if isinstance(n_samples, str):
+        try:
+            n_samples = int(n_samples)
+        except ValueError:
+            n_samples = None
+    throughput = {}
+    if n_samples and total_time > 0:
+        throughput["samples_per_sec"] = round(n_samples / total_time, 1)
+    return throughput
+def save_profile(profile_data: dict, output_dir: str = "experiments/profiles") -> Path:
+    """Save profile results to YAML file."""
+    out_path = Path(output_dir)
+    out_path.mkdir(parents=True, exist_ok=True)
+    exp_id = profile_data.get("experiment_id", "unknown")
+    filepath = out_path / f"{exp_id}-profile.yaml"
+    with open(filepath, "w") as f:
+        yaml.dump(profile_data, f, default_flow_style=False, sort_keys=False)
+    return filepath
+def format_profile_report(profile_data: dict) -> str:
+    """Format profile results as human-readable markdown."""
+    if "error" in profile_data:
+        return f"ERROR: {profile_data['error']}"
+    exp_id = profile_data.get("experiment_id", "?")
+    profile = profile_data.get("profile", {})
+    breakdown = profile_data.get("breakdown", {})
+    bottleneck = profile_data.get("bottleneck", {})
+    recommendations = profile_data.get("recommendations", [])
+    throughput = profile_data.get("throughput", {})
+    lines = [
+        f"# Training Profile: {exp_id}",
+        "",
+    ]
+    # Timing
+    lines.extend([
+        "## Timing",
+        "",
+        f"- **Total time:** {profile.get('total_time_sec', 0):.1f}s",
+    ])
+    if breakdown.get("training_sec"):
+        lines.append(f"- **Training:** {breakdown['training_sec']:.1f}s ({breakdown.get('training_pct', 0):.0f}%)")
+    if breakdown.get("overhead_sec", 0) > 0:
+        lines.append(f"- **Overhead:** {breakdown['overhead_sec']:.1f}s ({breakdown.get('overhead_pct', 0):.0f}%)")
+    # Memory
+    mem = profile.get("memory", {})
+    lines.extend(["", "## Memory", ""])
+    if mem.get("peak_rss_mb"):
+        lines.append(f"- **Peak RSS:** {mem['peak_rss_mb']:.0f} MB")
+    lines.append(f"- **Python peak:** {mem.get('python_peak_mb', 0):.1f} MB")
+    if mem.get("peak_gpu_mb"):
+        lines.append(f"- **GPU peak:** {mem['peak_gpu_mb']:.0f} MB")
+    # GPU
+    gpu = profile.get("gpu")
+    if gpu:
+        lines.extend(["", "## GPU", ""])
+        lines.append(f"- **Device:** {gpu.get('device', 'unknown')}")
+        if gpu.get("memory_total_mb"):
+            lines.append(f"- **Total VRAM:** {gpu['memory_total_mb']} MB")
+    # Throughput
+    if throughput:
+        lines.extend(["", "## Throughput", ""])
+        if throughput.get("samples_per_sec"):
+            lines.append(f"- **Samples/sec:** {throughput['samples_per_sec']:.1f}")
+    # Bottleneck
+    lines.extend([
+        "",
+        "## Bottleneck",
+        "",
+        f"**{bottleneck.get('type', 'none')}** ({bottleneck.get('severity', 'low')})",
+        "",
+        bottleneck.get("description", "No bottleneck detected."),
+    ])
+    # Recommendations
+    if recommendations:
+        lines.extend(["", "## Recommendations", ""])
+        for rec in recommendations:
+            lines.append(f"- {rec}")
+    return "\n".join(lines)
+def run_profile(
+    exp_id: str | None = None,
+    config_path: str = "config.yaml",
+    log_path: str = "experiments/log.jsonl",
+    seed: int = 42,
+    timeout: int = 600,
+) -> dict:
+    """Run a complete training profile.
+    Args:
+        exp_id: Experiment ID (defaults to best).
+        config_path: Path to config.yaml.
+        log_path: Path to experiment log.
+        seed: Random seed for the profiling run.
+        timeout: Training timeout in seconds.
+    Returns:
+        Complete profile result dict.
+    """
+    config = load_config(config_path)
+    eval_cfg = config.get("evaluation", {})
+    primary_metric = eval_cfg.get("primary_metric", "accuracy")
+    lower_is_better = eval_cfg.get("lower_is_better", False)
+    experiments = load_experiments(log_path)
+    target_exp = find_experiment(experiments, exp_id, primary_metric, lower_is_better)
+    if not target_exp:
+        return {"error": f"No experiment found{f' with ID {exp_id}' if exp_id else ''}"}
+    target_id = target_exp.get("experiment_id", "unknown")
+    print(f"Profiling {target_id}...", file=sys.stderr)
+    # Run profiled training
+    profile = profile_training_run(seed=seed, timeout=timeout)
+    if "error" in profile:
+        return {"error": profile["error"], "experiment_id": target_id}
+    # Analyze results
+    breakdown = estimate_timing_breakdown(profile)
+    bottleneck = detect_bottleneck(profile, breakdown)
+    recommendations = generate_recommendations(bottleneck, profile)
+    throughput = compute_throughput(profile)
+    result = {
+        "experiment_id": target_id,
+        "timestamp": datetime.now(timezone.utc).isoformat(),
+        "profile": profile,
+        "breakdown": breakdown,
+        "bottleneck": bottleneck,
+        "recommendations": recommendations,
+        "throughput": throughput,
+    }
+    return result
+def main() -> None:
+    """CLI entry point."""
+    parser = argparse.ArgumentParser(description="Computational profiling for ML training")
+    parser.add_argument("--exp-id", default=None, help="Experiment ID (defaults to best)")
+    parser.add_argument("--config", default="config.yaml", help="Path to config.yaml")
+    parser.add_argument("--log", default="experiments/log.jsonl", help="Path to experiment log")
+    parser.add_argument("--seed", type=int, default=42, help="Random seed")
+    parser.add_argument("--timeout", type=int, default=600, help="Training timeout in seconds")
+    parser.add_argument("--json", action="store_true", help="Output raw JSON")
+    args = parser.parse_args()
+    result = run_profile(
+        exp_id=args.exp_id,
+        config_path=args.config,
+        log_path=args.log,
+        seed=args.seed,
+        timeout=args.timeout,
+    )
+    if "error" not in result:
+        filepath = save_profile(result)
+        print(f"Saved to {filepath}", file=sys.stderr)
+    if args.json:
+        print(json.dumps(result, indent=2, default=str))
+    else:
+        print(format_profile_report(result))
+if __name__ == "__main__":
+    main()

package/templates/scripts/scaffold.py CHANGED Viewed

@@ -95,6 +95,8 @@ TEMPLATE_DIRS = {
         "diagnose_errors.py",
         "ablation_study.py",
         "pareto_frontier.py",
+        "profile_training.py",
+        "checkpoint_manager.py",
     ],
     "tests": ["__init__.py", "conftest.py"],
 }
@@ -108,6 +110,8 @@ DIRECTORIES_TO_CREATE = [
     "experiments/ablations",
     "experiments/frontiers",
     "experiments/predictions",
+    "experiments/profiles",
+    "experiments/checkpoints",
     "models/best",
     "models/archive",
 ]