claude-turing 1.4.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "1.4.0",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 22 commands, 2 specialized agents, experiment intelligence (error analysis, ablation studies, Pareto frontiers), statistical rigor (multi-seed studies, reproducibility verification), tree-search hypothesis exploration (TreeQuest AB-MCTS), cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "1.5.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 24 commands, 2 specialized agents, performance profiling, smart Pareto-based checkpoint management, experiment intelligence (error analysis, ablation studies, Pareto frontiers), statistical rigor (multi-seed studies, reproducibility verification), tree-search hypothesis exploration (TreeQuest AB-MCTS), cost-performance frontier analysis, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -328,6 +328,8 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
328
328
  | `/turing:diagnose [exp-id]` | Error analysis — failure modes, confused pairs, feature-range bias |
329
329
  | `/turing:ablate [--components]` | Ablation study — remove components, measure impact, flag dead weight |
330
330
  | `/turing:frontier [--metrics]` | Pareto frontier — multi-objective tradeoff visualization |
331
+ | `/turing:profile [exp-id]` | Computational profiling — timing, memory, throughput, bottleneck detection |
332
+ | `/turing:checkpoint <action>` | Smart checkpoint management — list, prune (Pareto), average, resume, stats |
331
333
  | `/turing:card` | Generate a model card — performance, limitations, intended use, artifact contract |
332
334
  | `/turing:logbook` | Generate HTML experiment logbook |
333
335
  | `/turing:report` | Generate research report |
@@ -517,11 +519,11 @@ Each project gets independent config, data, experiments, models, and agent memor
517
519
 
518
520
  ## Architecture of Turing Itself
519
521
 
520
- 22 commands, 2 agents, 8 config files, 37 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence (error analysis + ablation + Pareto frontier), 487 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
522
+ 24 commands, 2 agents, 8 config files, 39 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling + smart checkpoints, 542 tests, 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
521
523
 
522
524
  ```
523
525
  turing/
524
- ├── commands/ 21 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence)
526
+ ├── commands/ 23 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance)
525
527
  ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
526
528
  ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
527
529
  ├── templates/ Scaffolded into user projects by /turing:init
@@ -0,0 +1,47 @@
1
+ ---
2
+ name: checkpoint
3
+ description: Smart checkpoint management — list, prune (Pareto-based), average top-K, resume from any point, disk usage stats.
4
+ disable-model-invocation: true
5
+ argument-hint: "<list|prune|average|resume|stats> [exp-id] [--top 3] [--dry-run]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Manage model checkpoints intelligently using Pareto dominance.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First word is the action: `list`, `prune`, `average`, `resume`, `stats`
20
+ - `resume` requires an experiment ID as second argument
21
+ - `--top 3` sets the number of checkpoints for averaging
22
+ - `--dry-run` previews pruning without deleting
23
+
24
+ 3. **Run checkpoint manager:**
25
+ ```bash
26
+ python scripts/checkpoint_manager.py $ARGUMENTS
27
+ ```
28
+
29
+ 4. **Report results by action:**
30
+ - **list:** Table of all checkpoints with metrics, size, and Pareto status
31
+ - **prune:** Removes dominated checkpoints, reports space saved
32
+ - **average:** Lists top-K checkpoints for weight averaging
33
+ - **resume:** Locates checkpoint for a specific experiment
34
+ - **stats:** Disk usage summary by total, average, and model type
35
+
36
+ 5. **Saved output:** report written to `experiments/checkpoints/checkpoint-report.yaml`
37
+
38
+ ## Examples
39
+
40
+ ```
41
+ /turing:checkpoint list # Show all checkpoints
42
+ /turing:checkpoint stats # Disk usage summary
43
+ /turing:checkpoint prune --dry-run # Preview what would be pruned
44
+ /turing:checkpoint prune # Remove dominated checkpoints
45
+ /turing:checkpoint average --top 5 # Top 5 for averaging
46
+ /turing:checkpoint resume exp-042 # Resume from checkpoint
47
+ ```
@@ -0,0 +1,43 @@
1
+ ---
2
+ name: profile
3
+ description: Profile a training run — timing breakdown, memory usage, throughput, bottleneck detection with actionable recommendations.
4
+ disable-model-invocation: true
5
+ argument-hint: "[exp-id] [--seed 42]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Profile a training run to identify performance bottlenecks.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:**
14
+ ```bash
15
+ source .venv/bin/activate
16
+ ```
17
+
18
+ 2. **Parse arguments from `$ARGUMENTS`:**
19
+ - First argument can be an experiment ID (e.g., `exp-042`); defaults to best
20
+ - `--seed 42` sets the random seed for the profiling run
21
+
22
+ 3. **Run profiling:**
23
+ ```bash
24
+ python scripts/profile_training.py $ARGUMENTS
25
+ ```
26
+
27
+ 4. **Report results:**
28
+ - **Timing:** total time, training time, overhead breakdown
29
+ - **Memory:** peak RSS, Python peak, GPU peak (if applicable)
30
+ - **Throughput:** samples/sec
31
+ - **Bottleneck:** identified bottleneck type and severity
32
+ - **Recommendations:** actionable fixes for the detected bottleneck
33
+
34
+ 5. **Saved output:** results written to `experiments/profiles/exp-NNN-profile.yaml`
35
+
36
+ 6. **If no training pipeline exists:** suggest `/turing:init` first.
37
+
38
+ ## Examples
39
+
40
+ ```
41
+ /turing:profile # Profile best experiment config
42
+ /turing:profile exp-042 # Profile specific experiment
43
+ ```
@@ -31,6 +31,8 @@ You are the Turing ML research router. Detect the user's intent and route to the
31
31
  | "diagnose", "error analysis", "failure modes", "where does it fail", "confusion matrix" | `/turing:diagnose` | Analyze |
32
32
  | "ablate", "ablation", "remove component", "which features matter", "component impact" | `/turing:ablate` | Analyze |
33
33
  | "frontier", "pareto", "tradeoff", "tradeoffs", "multi-objective", "which model is best" | `/turing:frontier` | Analyze |
34
+ | "profile", "profiling", "bottleneck", "slow training", "why is it slow", "timing" | `/turing:profile` | Check |
35
+ | "checkpoint", "checkpoints", "prune checkpoints", "disk space", "resume training" | `/turing:checkpoint` | Check |
34
36
 
35
37
  ## Sub-commands
36
38
 
@@ -58,6 +60,8 @@ You are the Turing ML research router. Detect the user's intent and route to the
58
60
  | `/turing:diagnose [exp-id]` | Error analysis: failure modes, confused pairs, feature-range bias | (inline) |
59
61
  | `/turing:ablate [--components]` | Ablation study: remove components, measure impact, flag dead weight | (inline) |
60
62
  | `/turing:frontier [--metrics]` | Pareto frontier: multi-objective tradeoff visualization | (inline) |
63
+ | `/turing:profile [exp-id]` | Computational profiling: timing, memory, throughput, bottleneck detection | (inline) |
64
+ | `/turing:checkpoint <action>` | Smart checkpoint management: list, prune (Pareto), average, resume, stats | (inline) |
61
65
 
62
66
  ## Proactive Detection
63
67
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "1.4.0",
3
+ "version": "1.5.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/install.js CHANGED
@@ -23,7 +23,7 @@ const SUB_COMMANDS = [
23
23
  "init", "train", "status", "compare", "sweep", "validate",
24
24
  "try", "brief", "suggest", "explore", "design", "logbook", "poster",
25
25
  "report", "mode", "preflight", "card", "seed", "reproduce",
26
- "diagnose", "ablate", "frontier",
26
+ "diagnose", "ablate", "frontier", "profile", "checkpoint",
27
27
  ];
28
28
 
29
29
  export async function install(opts = {}) {
package/src/verify.js CHANGED
@@ -36,6 +36,8 @@ const EXPECTED_COMMANDS = [
36
36
  "diagnose/SKILL.md",
37
37
  "ablate/SKILL.md",
38
38
  "frontier/SKILL.md",
39
+ "profile/SKILL.md",
40
+ "checkpoint/SKILL.md",
39
41
  ];
40
42
 
41
43
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];
@@ -0,0 +1,449 @@
1
+ #!/usr/bin/env python3
2
+ """Smart checkpoint manager for ML experiments.
3
+
4
+ Manages model checkpoints based on Pareto dominance rather than
5
+ simple "keep last K". Supports listing, Pareto-based pruning,
6
+ checkpoint averaging, and resume from any point.
7
+
8
+ Usage:
9
+ python scripts/checkpoint_manager.py list # List checkpoints
10
+ python scripts/checkpoint_manager.py prune # Remove dominated
11
+ python scripts/checkpoint_manager.py average [--top 3] # Average top-K
12
+ python scripts/checkpoint_manager.py resume exp-042 # Resume from checkpoint
13
+ python scripts/checkpoint_manager.py stats # Disk usage summary
14
+ """
15
+
16
+ from __future__ import annotations
17
+
18
+ import argparse
19
+ import json
20
+ import shutil
21
+ import sys
22
+ from datetime import datetime, timezone
23
+ from pathlib import Path
24
+
25
+ import yaml
26
+
27
+ from scripts.turing_io import load_config, load_experiments
28
+
29
+
30
+ DEFAULT_CHECKPOINT_DIR = "experiments/checkpoints"
31
+
32
+
33
+ def scan_checkpoints(checkpoint_dir: str = DEFAULT_CHECKPOINT_DIR) -> list[dict]:
34
+ """Scan checkpoint directory and return metadata for each checkpoint.
35
+
36
+ Returns list of dicts with path, experiment_id, metrics, size_bytes, created.
37
+ """
38
+ ckpt_path = Path(checkpoint_dir)
39
+ if not ckpt_path.exists():
40
+ return []
41
+
42
+ checkpoints = []
43
+ for entry in sorted(ckpt_path.iterdir()):
44
+ if entry.is_dir():
45
+ # Look for metadata.yaml or metadata.json in the checkpoint dir
46
+ meta_path = entry / "metadata.yaml"
47
+ meta_json = entry / "metadata.json"
48
+
49
+ metadata = {}
50
+ if meta_path.exists():
51
+ with open(meta_path) as f:
52
+ metadata = yaml.safe_load(f) or {}
53
+ elif meta_json.exists():
54
+ with open(meta_json) as f:
55
+ metadata = json.load(f)
56
+
57
+ # Compute directory size
58
+ size_bytes = sum(f.stat().st_size for f in entry.rglob("*") if f.is_file())
59
+
60
+ checkpoints.append({
61
+ "path": str(entry),
62
+ "name": entry.name,
63
+ "experiment_id": metadata.get("experiment_id", entry.name),
64
+ "metrics": metadata.get("metrics", {}),
65
+ "config": metadata.get("config", {}),
66
+ "size_bytes": size_bytes,
67
+ "size_mb": round(size_bytes / 1024**2, 1),
68
+ "created": metadata.get("timestamp", metadata.get("created_at")),
69
+ })
70
+ elif entry.is_file() and entry.suffix in (".joblib", ".pkl", ".pt", ".pth", ".h5", ".onnx"):
71
+ # Single-file checkpoint
72
+ size_bytes = entry.stat().st_size
73
+ # Try to extract exp ID from filename
74
+ exp_id = entry.stem.replace("-checkpoint", "").replace("_checkpoint", "")
75
+
76
+ checkpoints.append({
77
+ "path": str(entry),
78
+ "name": entry.name,
79
+ "experiment_id": exp_id,
80
+ "metrics": {},
81
+ "config": {},
82
+ "size_bytes": size_bytes,
83
+ "size_mb": round(size_bytes / 1024**2, 1),
84
+ "created": None,
85
+ })
86
+
87
+ return checkpoints
88
+
89
+
90
+ def enrich_checkpoints_from_log(
91
+ checkpoints: list[dict],
92
+ experiments: list[dict],
93
+ ) -> list[dict]:
94
+ """Enrich checkpoint metadata with metrics from experiment log."""
95
+ exp_map = {e.get("experiment_id"): e for e in experiments}
96
+
97
+ for ckpt in checkpoints:
98
+ exp_id = ckpt["experiment_id"]
99
+ if exp_id in exp_map and not ckpt["metrics"]:
100
+ exp = exp_map[exp_id]
101
+ ckpt["metrics"] = exp.get("metrics", {})
102
+ ckpt["config"] = exp.get("config", {})
103
+ if not ckpt["created"]:
104
+ ckpt["created"] = exp.get("timestamp")
105
+
106
+ return checkpoints
107
+
108
+
109
+ def compute_pareto_checkpoints(
110
+ checkpoints: list[dict],
111
+ metrics: list[str],
112
+ directions: dict[str, str],
113
+ ) -> tuple[list[dict], list[dict]]:
114
+ """Separate checkpoints into Pareto-optimal and dominated sets.
115
+
116
+ Returns (pareto_optimal, dominated) tuple.
117
+ """
118
+ if not checkpoints or not metrics:
119
+ return checkpoints, []
120
+
121
+ # Filter to checkpoints that have all requested metrics
122
+ complete = []
123
+ incomplete = []
124
+ for ckpt in checkpoints:
125
+ if all(ckpt["metrics"].get(m) is not None for m in metrics):
126
+ complete.append(ckpt)
127
+ else:
128
+ incomplete.append(ckpt)
129
+
130
+ if not complete:
131
+ return checkpoints, []
132
+
133
+ pareto = []
134
+ dominated = []
135
+
136
+ for i, candidate in enumerate(complete):
137
+ is_dominated = False
138
+ for j, other in enumerate(complete):
139
+ if i == j:
140
+ continue
141
+ all_at_least = True
142
+ strictly_better = False
143
+ for m in metrics:
144
+ c_val = float(candidate["metrics"][m])
145
+ o_val = float(other["metrics"][m])
146
+ direction = directions.get(m, "higher")
147
+ if direction == "higher":
148
+ if o_val < c_val:
149
+ all_at_least = False
150
+ break
151
+ if o_val > c_val:
152
+ strictly_better = True
153
+ else:
154
+ if o_val > c_val:
155
+ all_at_least = False
156
+ break
157
+ if o_val < c_val:
158
+ strictly_better = True
159
+ if all_at_least and strictly_better:
160
+ is_dominated = True
161
+ break
162
+ if is_dominated:
163
+ dominated.append(candidate)
164
+ else:
165
+ pareto.append(candidate)
166
+
167
+ # Incomplete checkpoints are kept (can't determine dominance)
168
+ pareto.extend(incomplete)
169
+ return pareto, dominated
170
+
171
+
172
+ def prune_dominated(
173
+ checkpoints: list[dict],
174
+ dominated: list[dict],
175
+ keep_latest: bool = True,
176
+ dry_run: bool = False,
177
+ ) -> dict:
178
+ """Remove dominated checkpoints from disk.
179
+
180
+ Args:
181
+ checkpoints: All checkpoints.
182
+ dominated: Dominated checkpoints to remove.
183
+ keep_latest: Always keep the most recent checkpoint (for resume safety).
184
+ dry_run: If True, report what would be pruned without deleting.
185
+
186
+ Returns:
187
+ Dict with pruned count, bytes saved, and details.
188
+ """
189
+ # Protect the latest checkpoint
190
+ if keep_latest and checkpoints:
191
+ latest = max(checkpoints, key=lambda c: c.get("created") or "")
192
+ dominated = [d for d in dominated if d["path"] != latest["path"]]
193
+
194
+ pruned = []
195
+ bytes_saved = 0
196
+
197
+ for ckpt in dominated:
198
+ path = Path(ckpt["path"])
199
+ if not path.exists():
200
+ continue
201
+ pruned.append({
202
+ "name": ckpt["name"],
203
+ "experiment_id": ckpt["experiment_id"],
204
+ "size_mb": ckpt["size_mb"],
205
+ })
206
+ bytes_saved += ckpt["size_bytes"]
207
+ if not dry_run:
208
+ if path.is_dir():
209
+ shutil.rmtree(path)
210
+ else:
211
+ path.unlink()
212
+
213
+ return {
214
+ "pruned_count": len(pruned),
215
+ "bytes_saved": bytes_saved,
216
+ "mb_saved": round(bytes_saved / 1024**2, 1),
217
+ "pruned": pruned,
218
+ "dry_run": dry_run,
219
+ }
220
+
221
+
222
+ def compute_disk_stats(checkpoints: list[dict]) -> dict:
223
+ """Compute disk usage statistics for checkpoints."""
224
+ if not checkpoints:
225
+ return {"total_count": 0, "total_size_mb": 0, "avg_size_mb": 0}
226
+
227
+ total_bytes = sum(c["size_bytes"] for c in checkpoints)
228
+ return {
229
+ "total_count": len(checkpoints),
230
+ "total_size_mb": round(total_bytes / 1024**2, 1),
231
+ "total_size_gb": round(total_bytes / 1024**3, 2),
232
+ "avg_size_mb": round(total_bytes / len(checkpoints) / 1024**2, 1),
233
+ "largest": max(checkpoints, key=lambda c: c["size_bytes"])["name"] if checkpoints else None,
234
+ "largest_mb": max(c["size_mb"] for c in checkpoints) if checkpoints else 0,
235
+ }
236
+
237
+
238
+ def format_checkpoint_list(
239
+ checkpoints: list[dict],
240
+ pareto_ids: set[str],
241
+ primary_metric: str,
242
+ ) -> str:
243
+ """Format checkpoint list as markdown table."""
244
+ if not checkpoints:
245
+ return "No checkpoints found."
246
+
247
+ lines = [
248
+ "# Checkpoints",
249
+ "",
250
+ f"| Name | Experiment | {primary_metric} | Size | Pareto |",
251
+ f"|------|------------|{'---' * max(len(primary_metric) // 3, 1)}--|------|--------|",
252
+ ]
253
+
254
+ for ckpt in checkpoints:
255
+ metric_val = ckpt["metrics"].get(primary_metric)
256
+ metric_str = f"{metric_val:.4f}" if isinstance(metric_val, (int, float)) else "N/A"
257
+ pareto_marker = "YES" if ckpt["path"] in pareto_ids or ckpt["name"] in pareto_ids else ""
258
+ lines.append(
259
+ f"| {ckpt['name']} | {ckpt['experiment_id']} "
260
+ f"| {metric_str} | {ckpt['size_mb']:.1f} MB | {pareto_marker} |"
261
+ )
262
+
263
+ return "\n".join(lines)
264
+
265
+
266
+ def format_prune_report(prune_result: dict, stats_before: dict, stats_after: dict) -> str:
267
+ """Format pruning report."""
268
+ lines = [
269
+ "# Checkpoint Pruning",
270
+ "",
271
+ f"Before: {stats_before['total_count']} checkpoints, {stats_before.get('total_size_gb', 0):.1f} GB",
272
+ ]
273
+
274
+ if prune_result["dry_run"]:
275
+ lines.append(f"Would prune: {prune_result['pruned_count']} dominated checkpoints ({prune_result['mb_saved']:.1f} MB)")
276
+ lines.append("")
277
+ lines.append("*Dry run — no files deleted. Run without --dry-run to prune.*")
278
+ else:
279
+ lines.append(f"Pruned: {prune_result['pruned_count']} dominated checkpoints ({prune_result['mb_saved']:.1f} MB)")
280
+ lines.append(f"After: {stats_after['total_count']} checkpoints, {stats_after.get('total_size_gb', 0):.1f} GB")
281
+
282
+ if prune_result["pruned"]:
283
+ lines.extend(["", "## Removed", ""])
284
+ for p in prune_result["pruned"]:
285
+ lines.append(f"- {p['name']} ({p['experiment_id']}, {p['size_mb']:.1f} MB)")
286
+
287
+ return "\n".join(lines)
288
+
289
+
290
+ def format_stats_report(stats: dict, checkpoints: list[dict]) -> str:
291
+ """Format disk usage statistics."""
292
+ lines = [
293
+ "# Checkpoint Storage",
294
+ "",
295
+ f"- **Total checkpoints:** {stats['total_count']}",
296
+ f"- **Total size:** {stats.get('total_size_gb', 0):.2f} GB ({stats['total_size_mb']:.1f} MB)",
297
+ f"- **Average size:** {stats['avg_size_mb']:.1f} MB",
298
+ ]
299
+ if stats.get("largest"):
300
+ lines.append(f"- **Largest:** {stats['largest']} ({stats['largest_mb']:.1f} MB)")
301
+
302
+ # Size by model type
303
+ by_type: dict[str, list[dict]] = {}
304
+ for ckpt in checkpoints:
305
+ mt = ckpt.get("config", {}).get("model_type", "unknown")
306
+ by_type.setdefault(mt, []).append(ckpt)
307
+
308
+ if len(by_type) > 1:
309
+ lines.extend(["", "## By Model Type", ""])
310
+ for mt, ckpts in sorted(by_type.items()):
311
+ total_mb = sum(c["size_mb"] for c in ckpts)
312
+ lines.append(f"- **{mt}:** {len(ckpts)} checkpoints, {total_mb:.1f} MB")
313
+
314
+ return "\n".join(lines)
315
+
316
+
317
+ def save_checkpoint_report(report: dict, output_dir: str = "experiments/checkpoints") -> Path:
318
+ """Save checkpoint management report."""
319
+ out_path = Path(output_dir)
320
+ out_path.mkdir(parents=True, exist_ok=True)
321
+ filepath = out_path / "checkpoint-report.yaml"
322
+ with open(filepath, "w") as f:
323
+ yaml.dump(report, f, default_flow_style=False, sort_keys=False)
324
+ return filepath
325
+
326
+
327
+ def main() -> None:
328
+ """CLI entry point."""
329
+ parser = argparse.ArgumentParser(description="Smart checkpoint manager")
330
+ parser.add_argument(
331
+ "action",
332
+ choices=["list", "prune", "average", "resume", "stats"],
333
+ help="Action to perform",
334
+ )
335
+ parser.add_argument("exp_id", nargs="?", default=None, help="Experiment ID (for resume)")
336
+ parser.add_argument("--checkpoint-dir", default=DEFAULT_CHECKPOINT_DIR, help="Checkpoint directory")
337
+ parser.add_argument("--config", default="config.yaml", help="Path to config.yaml")
338
+ parser.add_argument("--log", default="experiments/log.jsonl", help="Path to experiment log")
339
+ parser.add_argument("--top", type=int, default=3, help="Top K for averaging")
340
+ parser.add_argument("--dry-run", action="store_true", help="Don't actually delete (prune)")
341
+ parser.add_argument("--json", action="store_true", help="Output raw JSON")
342
+ args = parser.parse_args()
343
+
344
+ config = load_config(args.config)
345
+ eval_cfg = config.get("evaluation", {})
346
+ primary_metric = eval_cfg.get("primary_metric", "accuracy")
347
+ lower_is_better = eval_cfg.get("lower_is_better", False)
348
+
349
+ experiments = load_experiments(args.log)
350
+ checkpoints = scan_checkpoints(args.checkpoint_dir)
351
+ checkpoints = enrich_checkpoints_from_log(checkpoints, experiments)
352
+
353
+ # Determine metric directions
354
+ lower_metrics = {"train_seconds", "latency", "mse", "rmse", "mae", "loss"}
355
+ metrics_to_check = [primary_metric]
356
+ if "train_seconds" in set().union(*(c["metrics"].keys() for c in checkpoints if c["metrics"])):
357
+ metrics_to_check.append("train_seconds")
358
+
359
+ directions = {}
360
+ for m in metrics_to_check:
361
+ if m == primary_metric:
362
+ directions[m] = "lower" if lower_is_better else "higher"
363
+ elif m in lower_metrics:
364
+ directions[m] = "lower"
365
+ else:
366
+ directions[m] = "higher"
367
+
368
+ pareto, dominated = compute_pareto_checkpoints(checkpoints, metrics_to_check, directions)
369
+ pareto_paths = {c["path"] for c in pareto}
370
+
371
+ if args.action == "list":
372
+ if args.json:
373
+ print(json.dumps(checkpoints, indent=2, default=str))
374
+ else:
375
+ print(format_checkpoint_list(checkpoints, pareto_paths, primary_metric))
376
+
377
+ elif args.action == "stats":
378
+ stats = compute_disk_stats(checkpoints)
379
+ if args.json:
380
+ print(json.dumps(stats, indent=2))
381
+ else:
382
+ print(format_stats_report(stats, checkpoints))
383
+
384
+ elif args.action == "prune":
385
+ stats_before = compute_disk_stats(checkpoints)
386
+ result = prune_dominated(checkpoints, dominated, dry_run=args.dry_run)
387
+
388
+ if not args.dry_run:
389
+ remaining = scan_checkpoints(args.checkpoint_dir)
390
+ stats_after = compute_disk_stats(remaining)
391
+ else:
392
+ stats_after = {
393
+ "total_count": stats_before["total_count"] - result["pruned_count"],
394
+ "total_size_mb": stats_before["total_size_mb"] - result["mb_saved"],
395
+ "total_size_gb": round((stats_before["total_size_mb"] - result["mb_saved"]) / 1024, 2),
396
+ }
397
+
398
+ if args.json:
399
+ print(json.dumps({"prune": result, "before": stats_before, "after": stats_after}, indent=2))
400
+ else:
401
+ print(format_prune_report(result, stats_before, stats_after))
402
+
403
+ elif args.action == "average":
404
+ # Sort by primary metric and take top K
405
+ with_metric = [c for c in checkpoints if primary_metric in c["metrics"]]
406
+ with_metric.sort(
407
+ key=lambda c: c["metrics"][primary_metric],
408
+ reverse=not lower_is_better,
409
+ )
410
+ top_k = with_metric[:args.top]
411
+
412
+ if not top_k:
413
+ print("No checkpoints with metric data found.", file=sys.stderr)
414
+ sys.exit(1)
415
+
416
+ print(f"Top {len(top_k)} checkpoints for averaging:", file=sys.stderr)
417
+ for c in top_k:
418
+ print(f" {c['experiment_id']}: {primary_metric}={c['metrics'][primary_metric]:.4f}", file=sys.stderr)
419
+ print("\nNote: Weight averaging requires model-specific implementation.", file=sys.stderr)
420
+ print("The checkpoint paths are:", file=sys.stderr)
421
+ for c in top_k:
422
+ print(f" {c['path']}", file=sys.stderr)
423
+
424
+ elif args.action == "resume":
425
+ if not args.exp_id:
426
+ print("Specify experiment ID: checkpoint resume <exp-id>", file=sys.stderr)
427
+ sys.exit(1)
428
+
429
+ target = None
430
+ for c in checkpoints:
431
+ if c["experiment_id"] == args.exp_id:
432
+ target = c
433
+ break
434
+
435
+ if not target:
436
+ print(f"No checkpoint found for {args.exp_id}", file=sys.stderr)
437
+ available = [c["experiment_id"] for c in checkpoints]
438
+ if available:
439
+ print(f"Available: {', '.join(available[:10])}", file=sys.stderr)
440
+ sys.exit(1)
441
+
442
+ print(f"Resume checkpoint: {target['path']}", file=sys.stderr)
443
+ print(f"Experiment: {target['experiment_id']}", file=sys.stderr)
444
+ if target["metrics"]:
445
+ print(f"Metrics: {target['metrics']}", file=sys.stderr)
446
+
447
+
448
+ if __name__ == "__main__":
449
+ main()
@@ -212,6 +212,23 @@ def detect_environment_drift(experiments: list[dict]) -> list[str]:
212
212
  return warnings
213
213
 
214
214
 
215
+ def load_profiles(profile_dir: str = "experiments/profiles") -> list[dict]:
216
+ """Load all profiling results from YAML files."""
217
+ path = Path(profile_dir)
218
+ if not path.exists():
219
+ return []
220
+ profiles = []
221
+ for f in sorted(path.glob("*-profile.yaml")):
222
+ try:
223
+ with open(f) as fh:
224
+ profile = yaml.safe_load(fh)
225
+ if profile and isinstance(profile, dict):
226
+ profiles.append(profile)
227
+ except (yaml.YAMLError, OSError):
228
+ continue
229
+ return profiles
230
+
231
+
215
232
  def load_diagnoses(diag_dir: str = "experiments/diagnoses") -> list[dict]:
216
233
  """Load all diagnosis reports from YAML files."""
217
234
  path = Path(diag_dir)
@@ -278,6 +295,7 @@ def format_brief(
278
295
  seed_studies: list[dict] | None = None,
279
296
  reproductions: list[dict] | None = None,
280
297
  diagnoses: list[dict] | None = None,
298
+ profiles: list[dict] | None = None,
281
299
  ) -> str:
282
300
  """Format the research briefing as markdown."""
283
301
  direction = "lower" if lower_is_better else "higher"
@@ -454,6 +472,23 @@ def format_brief(
454
472
  if failed:
455
473
  lines.extend(["", f"*{len(failed)} experiment(s) failed reproducibility checks.*"])
456
474
 
475
+ # Profiles
476
+ if profiles:
477
+ lines.extend(["", "## Performance Profile", ""])
478
+ for prof in profiles[-1:]: # Show most recent
479
+ exp_id = prof.get("experiment_id", "?")
480
+ p = prof.get("profile", {})
481
+ bn = prof.get("bottleneck", {})
482
+ lines.append(f"**{exp_id}:** {p.get('total_time_sec', 0):.1f}s total")
483
+ mem = p.get("memory", {})
484
+ if mem.get("peak_rss_mb"):
485
+ lines.append(f"- Peak memory: {mem['peak_rss_mb']:.0f} MB")
486
+ if bn.get("type") and bn["type"] != "none_detected":
487
+ lines.append(f"- Bottleneck: **{bn['type']}** ({bn.get('severity', 'unknown')})")
488
+ recs = prof.get("recommendations", [])
489
+ if recs:
490
+ lines.append(f"- Top recommendation: {recs[0]}")
491
+
457
492
  # Diagnoses (error analysis)
458
493
  if diagnoses:
459
494
  lines.extend(["", "## Error Analysis", ""])
@@ -529,10 +564,11 @@ def generate_brief(
529
564
  cost_records = load_cost_data(log_path, metric)
530
565
  pareto = compute_pareto_frontier(cost_records, lower_is_better) if cost_records else []
531
566
 
532
- # Load seed studies, reproduction reports, and diagnoses
567
+ # Load seed studies, reproduction reports, diagnoses, and profiles
533
568
  seed_studies = load_seed_studies()
534
569
  reproductions = load_reproductions()
535
570
  diagnoses = load_diagnoses()
571
+ profiles = load_profiles()
536
572
 
537
573
  return format_brief(
538
574
  campaign, best, trajectory, model_types, hypotheses,
@@ -542,6 +578,7 @@ def generate_brief(
542
578
  seed_studies=seed_studies if seed_studies else None,
543
579
  reproductions=reproductions if reproductions else None,
544
580
  diagnoses=diagnoses if diagnoses else None,
581
+ profiles=profiles if profiles else None,
545
582
  )
546
583
 
547
584
 
@@ -0,0 +1,533 @@
1
+ #!/usr/bin/env python3
2
+ """Computational profiling for ML training runs.
3
+
4
+ Measures timing breakdown, memory usage, throughput, and identifies
5
+ bottlenecks. Maps bottleneck patterns to known fixes.
6
+
7
+ Usage:
8
+ python scripts/profile_training.py # Profile best config
9
+ python scripts/profile_training.py --exp-id exp-042 # Specific experiment
10
+ python scripts/profile_training.py --detailed # Include per-phase breakdown
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import argparse
16
+ import json
17
+ import os
18
+ import re
19
+ import subprocess
20
+ import sys
21
+ import time
22
+ import tracemalloc
23
+ from datetime import datetime, timezone
24
+ from pathlib import Path
25
+
26
+ import yaml
27
+
28
+ from scripts.turing_io import load_config, load_experiments
29
+
30
+
31
+ # Known bottleneck patterns -> recommendation mappings
32
+ BOTTLENECK_RECOMMENDATIONS = {
33
+ "data_loading": [
34
+ "Cache preprocessed data to disk — data loading dominates training time",
35
+ "Use memory-mapped files or HDF5 for large datasets",
36
+ "Check for unnecessary file I/O in the data pipeline",
37
+ ],
38
+ "preprocessing": [
39
+ "Move preprocessing to a one-time step before training",
40
+ "Cache feature transformations (encoders, scalers) to avoid recomputation",
41
+ "Consider using vectorized operations instead of row-by-row processing",
42
+ ],
43
+ "model_training": [
44
+ "Reduce model complexity (fewer estimators, smaller depth)",
45
+ "Try a faster model type (LightGBM is typically faster than XGBoost)",
46
+ "Enable GPU acceleration if available and model supports it",
47
+ ],
48
+ "evaluation": [
49
+ "Reduce validation set size for intermediate checks",
50
+ "Evaluate less frequently (every N iterations instead of every iteration)",
51
+ ],
52
+ "memory": [
53
+ "Process data in batches to reduce peak memory",
54
+ "Use float32 instead of float64 for features",
55
+ "Release intermediate dataframes after use",
56
+ ],
57
+ }
58
+
59
+
60
+ def find_experiment(experiments: list[dict], exp_id: str | None, metric: str, lower_is_better: bool) -> dict | None:
61
+ """Find experiment by ID or return best kept."""
62
+ if exp_id:
63
+ for exp in experiments:
64
+ if exp.get("experiment_id") == exp_id:
65
+ return exp
66
+ return None
67
+ best = None
68
+ best_val = float("inf") if lower_is_better else float("-inf")
69
+ for exp in experiments:
70
+ if exp.get("status") != "kept":
71
+ continue
72
+ val = exp.get("metrics", {}).get(metric)
73
+ if val is None:
74
+ continue
75
+ if (lower_is_better and val < best_val) or (not lower_is_better and val > best_val):
76
+ best_val = val
77
+ best = exp
78
+ return best
79
+
80
+
81
+ def get_system_memory_mb() -> float | None:
82
+ """Get current process RSS memory in MB from /proc/self/status."""
83
+ try:
84
+ with open("/proc/self/status") as f:
85
+ for line in f:
86
+ if line.startswith("VmRSS:"):
87
+ return int(line.split()[1]) / 1024 # kB -> MB
88
+ except (FileNotFoundError, ValueError, IndexError):
89
+ pass
90
+ return None
91
+
92
+
93
+ def get_peak_memory_mb() -> float | None:
94
+ """Get peak RSS memory in MB from /proc/self/status."""
95
+ try:
96
+ with open("/proc/self/status") as f:
97
+ for line in f:
98
+ if line.startswith("VmPeak:"):
99
+ return int(line.split()[1]) / 1024 # kB -> MB
100
+ except (FileNotFoundError, ValueError, IndexError):
101
+ pass
102
+ return None
103
+
104
+
105
+ def check_gpu_available() -> dict | None:
106
+ """Check for GPU availability and return info."""
107
+ # Try torch
108
+ try:
109
+ import torch
110
+ if torch.cuda.is_available():
111
+ return {
112
+ "type": "cuda",
113
+ "device": torch.cuda.get_device_name(0),
114
+ "memory_total_mb": round(torch.cuda.get_device_properties(0).total_mem / 1024**2),
115
+ }
116
+ except ImportError:
117
+ pass
118
+ # Try nvidia-smi
119
+ try:
120
+ result = subprocess.run(
121
+ ["nvidia-smi", "--query-gpu=name,memory.total", "--format=csv,noheader,nounits"],
122
+ capture_output=True, text=True, timeout=5,
123
+ )
124
+ if result.returncode == 0 and result.stdout.strip():
125
+ parts = result.stdout.strip().split(",")
126
+ return {
127
+ "type": "cuda",
128
+ "device": parts[0].strip(),
129
+ "memory_total_mb": int(parts[1].strip()) if len(parts) > 1 else None,
130
+ }
131
+ except (FileNotFoundError, subprocess.TimeoutExpired):
132
+ pass
133
+ return None
134
+
135
+
136
+ def profile_training_run(seed: int = 42, timeout: int = 600) -> dict:
137
+ """Run train.py with profiling instrumentation.
138
+
139
+ Wraps the training run and captures timing, memory, and throughput.
140
+ """
141
+ # Start memory tracking
142
+ tracemalloc.start()
143
+ mem_before = get_system_memory_mb()
144
+
145
+ start_time = time.perf_counter()
146
+
147
+ # Run training
148
+ cmd = ["python", "train.py", "--seed", str(seed)]
149
+ try:
150
+ proc = subprocess.run(
151
+ cmd, capture_output=True, text=True, timeout=timeout,
152
+ )
153
+ except subprocess.TimeoutExpired:
154
+ return {"error": f"Training timed out after {timeout}s"}
155
+
156
+ end_time = time.perf_counter()
157
+ total_time = end_time - start_time
158
+
159
+ # Memory snapshot
160
+ current, peak = tracemalloc.get_traced_memory()
161
+ tracemalloc.stop()
162
+ mem_after = get_system_memory_mb()
163
+ peak_rss = get_peak_memory_mb()
164
+
165
+ # Parse metrics from output
166
+ metrics = {}
167
+ timing_data = {}
168
+ in_block = False
169
+ for line in proc.stdout.splitlines():
170
+ line_stripped = line.strip()
171
+ if line_stripped == "---":
172
+ if in_block:
173
+ break
174
+ in_block = True
175
+ continue
176
+ if in_block and ":" in line_stripped:
177
+ key, value = line_stripped.split(":", 1)
178
+ key = key.strip()
179
+ value = value.strip()
180
+ try:
181
+ metrics[key] = float(value)
182
+ except ValueError:
183
+ metrics[key] = value
184
+
185
+ # Extract timing if train.py reports it
186
+ train_seconds = metrics.get("train_seconds")
187
+ if isinstance(train_seconds, str):
188
+ try:
189
+ train_seconds = float(train_seconds)
190
+ except ValueError:
191
+ train_seconds = None
192
+
193
+ # Build profile
194
+ profile = {
195
+ "total_time_sec": round(total_time, 2),
196
+ "train_time_sec": round(float(train_seconds), 2) if train_seconds else round(total_time, 2),
197
+ "overhead_sec": round(total_time - float(train_seconds), 2) if train_seconds else 0,
198
+ "memory": {
199
+ "peak_rss_mb": round(peak_rss, 1) if peak_rss else None,
200
+ "python_peak_mb": round(peak / 1024**2, 1),
201
+ "rss_before_mb": round(mem_before, 1) if mem_before else None,
202
+ "rss_after_mb": round(mem_after, 1) if mem_after else None,
203
+ },
204
+ "metrics": metrics,
205
+ "returncode": proc.returncode,
206
+ }
207
+
208
+ # Check GPU
209
+ gpu_info = check_gpu_available()
210
+ if gpu_info:
211
+ profile["gpu"] = gpu_info
212
+ # Try to get GPU memory usage
213
+ try:
214
+ import torch
215
+ if torch.cuda.is_available():
216
+ profile["memory"]["peak_gpu_mb"] = round(torch.cuda.max_memory_allocated() / 1024**2, 1)
217
+ except ImportError:
218
+ pass
219
+
220
+ return profile
221
+
222
+
223
+ def estimate_timing_breakdown(profile: dict) -> dict:
224
+ """Estimate timing breakdown from available data.
225
+
226
+ Since we can't instrument inside train.py without modifying it,
227
+ we estimate based on total time and known patterns.
228
+ """
229
+ total = profile.get("train_time_sec", profile.get("total_time_sec", 0))
230
+ if total <= 0:
231
+ return {}
232
+
233
+ train_secs = profile.get("metrics", {}).get("train_seconds")
234
+ if isinstance(train_secs, str):
235
+ try:
236
+ train_secs = float(train_secs)
237
+ except ValueError:
238
+ train_secs = None
239
+
240
+ overhead = profile.get("overhead_sec", 0)
241
+
242
+ breakdown = {
243
+ "total_sec": round(total, 2),
244
+ "overhead_sec": round(overhead, 2),
245
+ "overhead_pct": round(overhead / total * 100, 1) if total > 0 else 0,
246
+ }
247
+
248
+ if train_secs:
249
+ breakdown["training_sec"] = round(train_secs, 2)
250
+ breakdown["training_pct"] = round(train_secs / total * 100, 1)
251
+
252
+ return breakdown
253
+
254
+
255
+ def detect_bottleneck(profile: dict, breakdown: dict) -> dict:
256
+ """Identify the primary bottleneck from profiling data."""
257
+ bottlenecks = []
258
+
259
+ total = profile.get("total_time_sec", 0)
260
+ overhead = profile.get("overhead_sec", 0)
261
+
262
+ # Check if overhead (data loading + setup) is dominant
263
+ if total > 0 and overhead > 0:
264
+ overhead_pct = overhead / total * 100
265
+ if overhead_pct > 50:
266
+ bottlenecks.append({
267
+ "type": "data_loading",
268
+ "severity": "high",
269
+ "description": f"Non-training overhead is {overhead_pct:.0f}% of total time ({overhead:.1f}s of {total:.1f}s)",
270
+ "pct_of_total": round(overhead_pct, 1),
271
+ })
272
+
273
+ # Check memory pressure
274
+ peak_rss = profile.get("memory", {}).get("peak_rss_mb")
275
+ if peak_rss and peak_rss > 4096:
276
+ bottlenecks.append({
277
+ "type": "memory",
278
+ "severity": "medium" if peak_rss < 8192 else "high",
279
+ "description": f"Peak memory usage is {peak_rss:.0f} MB",
280
+ "pct_of_total": 0,
281
+ })
282
+
283
+ # Check GPU utilization
284
+ gpu = profile.get("gpu")
285
+ peak_gpu = profile.get("memory", {}).get("peak_gpu_mb")
286
+ if gpu and peak_gpu:
287
+ gpu_total = gpu.get("memory_total_mb", 0)
288
+ if gpu_total > 0:
289
+ util_pct = peak_gpu / gpu_total * 100
290
+ if util_pct < 30:
291
+ bottlenecks.append({
292
+ "type": "gpu_underutilized",
293
+ "severity": "medium",
294
+ "description": f"GPU memory utilization is only {util_pct:.0f}% — consider larger batch size",
295
+ "pct_of_total": 0,
296
+ })
297
+
298
+ # Check training time relative to expectations
299
+ train_secs = profile.get("metrics", {}).get("train_seconds")
300
+ if isinstance(train_secs, (int, float)) and train_secs > 300:
301
+ bottlenecks.append({
302
+ "type": "model_training",
303
+ "severity": "medium",
304
+ "description": f"Training takes {train_secs:.0f}s — consider model simplification",
305
+ "pct_of_total": 0,
306
+ })
307
+
308
+ if not bottlenecks:
309
+ return {
310
+ "type": "none_detected",
311
+ "severity": "low",
312
+ "description": "No significant bottleneck detected",
313
+ "pct_of_total": 0,
314
+ }
315
+
316
+ # Return most severe
317
+ severity_order = {"high": 3, "medium": 2, "low": 1}
318
+ bottlenecks.sort(key=lambda b: -severity_order.get(b["severity"], 0))
319
+ return bottlenecks[0]
320
+
321
+
322
+ def generate_recommendations(bottleneck: dict, profile: dict) -> list[str]:
323
+ """Generate actionable recommendations based on detected bottleneck."""
324
+ bt = bottleneck.get("type", "none_detected")
325
+ recs = BOTTLENECK_RECOMMENDATIONS.get(bt, [])
326
+
327
+ # Add context-specific recommendations
328
+ extra = []
329
+ peak_rss = profile.get("memory", {}).get("peak_rss_mb")
330
+ if peak_rss and peak_rss > 2048 and bt != "memory":
331
+ extra.append(f"Memory usage is {peak_rss:.0f} MB — monitor for OOM risk on smaller machines")
332
+
333
+ gpu = profile.get("gpu")
334
+ if not gpu:
335
+ extra.append("No GPU detected — GPU acceleration could significantly speed up training")
336
+
337
+ total = profile.get("total_time_sec", 0)
338
+ if total < 10:
339
+ extra.append("Training is very fast — profiling overhead may distort results")
340
+
341
+ return list(recs) + extra
342
+
343
+
344
+ def compute_throughput(profile: dict) -> dict:
345
+ """Compute throughput metrics from profile data."""
346
+ total_time = profile.get("train_time_sec", profile.get("total_time_sec", 0))
347
+ metrics = profile.get("metrics", {})
348
+
349
+ n_samples = metrics.get("n_samples") or metrics.get("n_train_samples")
350
+ if isinstance(n_samples, str):
351
+ try:
352
+ n_samples = int(n_samples)
353
+ except ValueError:
354
+ n_samples = None
355
+
356
+ throughput = {}
357
+ if n_samples and total_time > 0:
358
+ throughput["samples_per_sec"] = round(n_samples / total_time, 1)
359
+
360
+ return throughput
361
+
362
+
363
+ def save_profile(profile_data: dict, output_dir: str = "experiments/profiles") -> Path:
364
+ """Save profile results to YAML file."""
365
+ out_path = Path(output_dir)
366
+ out_path.mkdir(parents=True, exist_ok=True)
367
+ exp_id = profile_data.get("experiment_id", "unknown")
368
+ filepath = out_path / f"{exp_id}-profile.yaml"
369
+ with open(filepath, "w") as f:
370
+ yaml.dump(profile_data, f, default_flow_style=False, sort_keys=False)
371
+ return filepath
372
+
373
+
374
+ def format_profile_report(profile_data: dict) -> str:
375
+ """Format profile results as human-readable markdown."""
376
+ if "error" in profile_data:
377
+ return f"ERROR: {profile_data['error']}"
378
+
379
+ exp_id = profile_data.get("experiment_id", "?")
380
+ profile = profile_data.get("profile", {})
381
+ breakdown = profile_data.get("breakdown", {})
382
+ bottleneck = profile_data.get("bottleneck", {})
383
+ recommendations = profile_data.get("recommendations", [])
384
+ throughput = profile_data.get("throughput", {})
385
+
386
+ lines = [
387
+ f"# Training Profile: {exp_id}",
388
+ "",
389
+ ]
390
+
391
+ # Timing
392
+ lines.extend([
393
+ "## Timing",
394
+ "",
395
+ f"- **Total time:** {profile.get('total_time_sec', 0):.1f}s",
396
+ ])
397
+ if breakdown.get("training_sec"):
398
+ lines.append(f"- **Training:** {breakdown['training_sec']:.1f}s ({breakdown.get('training_pct', 0):.0f}%)")
399
+ if breakdown.get("overhead_sec", 0) > 0:
400
+ lines.append(f"- **Overhead:** {breakdown['overhead_sec']:.1f}s ({breakdown.get('overhead_pct', 0):.0f}%)")
401
+
402
+ # Memory
403
+ mem = profile.get("memory", {})
404
+ lines.extend(["", "## Memory", ""])
405
+ if mem.get("peak_rss_mb"):
406
+ lines.append(f"- **Peak RSS:** {mem['peak_rss_mb']:.0f} MB")
407
+ lines.append(f"- **Python peak:** {mem.get('python_peak_mb', 0):.1f} MB")
408
+ if mem.get("peak_gpu_mb"):
409
+ lines.append(f"- **GPU peak:** {mem['peak_gpu_mb']:.0f} MB")
410
+
411
+ # GPU
412
+ gpu = profile.get("gpu")
413
+ if gpu:
414
+ lines.extend(["", "## GPU", ""])
415
+ lines.append(f"- **Device:** {gpu.get('device', 'unknown')}")
416
+ if gpu.get("memory_total_mb"):
417
+ lines.append(f"- **Total VRAM:** {gpu['memory_total_mb']} MB")
418
+
419
+ # Throughput
420
+ if throughput:
421
+ lines.extend(["", "## Throughput", ""])
422
+ if throughput.get("samples_per_sec"):
423
+ lines.append(f"- **Samples/sec:** {throughput['samples_per_sec']:.1f}")
424
+
425
+ # Bottleneck
426
+ lines.extend([
427
+ "",
428
+ "## Bottleneck",
429
+ "",
430
+ f"**{bottleneck.get('type', 'none')}** ({bottleneck.get('severity', 'low')})",
431
+ "",
432
+ bottleneck.get("description", "No bottleneck detected."),
433
+ ])
434
+
435
+ # Recommendations
436
+ if recommendations:
437
+ lines.extend(["", "## Recommendations", ""])
438
+ for rec in recommendations:
439
+ lines.append(f"- {rec}")
440
+
441
+ return "\n".join(lines)
442
+
443
+
444
+ def run_profile(
445
+ exp_id: str | None = None,
446
+ config_path: str = "config.yaml",
447
+ log_path: str = "experiments/log.jsonl",
448
+ seed: int = 42,
449
+ timeout: int = 600,
450
+ ) -> dict:
451
+ """Run a complete training profile.
452
+
453
+ Args:
454
+ exp_id: Experiment ID (defaults to best).
455
+ config_path: Path to config.yaml.
456
+ log_path: Path to experiment log.
457
+ seed: Random seed for the profiling run.
458
+ timeout: Training timeout in seconds.
459
+
460
+ Returns:
461
+ Complete profile result dict.
462
+ """
463
+ config = load_config(config_path)
464
+ eval_cfg = config.get("evaluation", {})
465
+ primary_metric = eval_cfg.get("primary_metric", "accuracy")
466
+ lower_is_better = eval_cfg.get("lower_is_better", False)
467
+
468
+ experiments = load_experiments(log_path)
469
+ target_exp = find_experiment(experiments, exp_id, primary_metric, lower_is_better)
470
+
471
+ if not target_exp:
472
+ return {"error": f"No experiment found{f' with ID {exp_id}' if exp_id else ''}"}
473
+
474
+ target_id = target_exp.get("experiment_id", "unknown")
475
+
476
+ print(f"Profiling {target_id}...", file=sys.stderr)
477
+
478
+ # Run profiled training
479
+ profile = profile_training_run(seed=seed, timeout=timeout)
480
+
481
+ if "error" in profile:
482
+ return {"error": profile["error"], "experiment_id": target_id}
483
+
484
+ # Analyze results
485
+ breakdown = estimate_timing_breakdown(profile)
486
+ bottleneck = detect_bottleneck(profile, breakdown)
487
+ recommendations = generate_recommendations(bottleneck, profile)
488
+ throughput = compute_throughput(profile)
489
+
490
+ result = {
491
+ "experiment_id": target_id,
492
+ "timestamp": datetime.now(timezone.utc).isoformat(),
493
+ "profile": profile,
494
+ "breakdown": breakdown,
495
+ "bottleneck": bottleneck,
496
+ "recommendations": recommendations,
497
+ "throughput": throughput,
498
+ }
499
+
500
+ return result
501
+
502
+
503
+ def main() -> None:
504
+ """CLI entry point."""
505
+ parser = argparse.ArgumentParser(description="Computational profiling for ML training")
506
+ parser.add_argument("--exp-id", default=None, help="Experiment ID (defaults to best)")
507
+ parser.add_argument("--config", default="config.yaml", help="Path to config.yaml")
508
+ parser.add_argument("--log", default="experiments/log.jsonl", help="Path to experiment log")
509
+ parser.add_argument("--seed", type=int, default=42, help="Random seed")
510
+ parser.add_argument("--timeout", type=int, default=600, help="Training timeout in seconds")
511
+ parser.add_argument("--json", action="store_true", help="Output raw JSON")
512
+ args = parser.parse_args()
513
+
514
+ result = run_profile(
515
+ exp_id=args.exp_id,
516
+ config_path=args.config,
517
+ log_path=args.log,
518
+ seed=args.seed,
519
+ timeout=args.timeout,
520
+ )
521
+
522
+ if "error" not in result:
523
+ filepath = save_profile(result)
524
+ print(f"Saved to {filepath}", file=sys.stderr)
525
+
526
+ if args.json:
527
+ print(json.dumps(result, indent=2, default=str))
528
+ else:
529
+ print(format_profile_report(result))
530
+
531
+
532
+ if __name__ == "__main__":
533
+ main()
@@ -95,6 +95,8 @@ TEMPLATE_DIRS = {
95
95
  "diagnose_errors.py",
96
96
  "ablation_study.py",
97
97
  "pareto_frontier.py",
98
+ "profile_training.py",
99
+ "checkpoint_manager.py",
98
100
  ],
99
101
  "tests": ["__init__.py", "conftest.py"],
100
102
  }
@@ -108,6 +110,8 @@ DIRECTORIES_TO_CREATE = [
108
110
  "experiments/ablations",
109
111
  "experiments/frontiers",
110
112
  "experiments/predictions",
113
+ "experiments/profiles",
114
+ "experiments/checkpoints",
111
115
  "models/best",
112
116
  "models/archive",
113
117
  ]