claude-turing 3.3.0 → 3.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "turing",
3
- "version": "3.3.0",
4
- "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 49 commands, 2 specialized agents, feature & training intelligence (feature selection + curriculum optimization), model debugging (xray + sensitivity + calibration), pre-training intelligence (sanity checks + baseline generation + leakage detection), meta-intelligence (cross-project knowledge transfer + methodology audit), scaling & efficiency (scaling laws + compute budget + model distillation), model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
3
+ "version": "3.4.0",
4
+ "description": "Autonomous ML research harness — the autoresearch loop as a formal protocol. 53 commands, 2 specialized agents, model surgery (pruning + quantization + merging + architecture modification), feature & training intelligence (feature selection + curriculum optimization), model debugging (xray + sensitivity + calibration), pre-training intelligence (sanity checks + baseline generation + leakage detection), meta-intelligence (cross-project knowledge transfer + methodology audit), scaling & efficiency (scaling laws + compute budget + model distillation), model composition (ensemble + pipeline stitch + warm-start), deep analysis (experiment diff + live training monitor + regression gate), experiment orchestration (batch queue + smart retry + branching), literature integration + paper drafting, production model export, performance profiling, smart checkpoints, experiment intelligence, statistical rigor, tree-search hypothesis exploration, cost-performance frontier, model cards, model registry, hypothesis database with novelty guard, anti-cheating guardrails, and the taste-leverage loop. Inspired by Karpathy's autoresearch and the scientific method itself.",
5
5
  "author": {
6
6
  "name": "pragnition"
7
7
  },
package/README.md CHANGED
@@ -360,6 +360,10 @@ The index (`hypotheses.yaml`) is the lightweight queue. The detail files (`hypot
360
360
  | `/turing:calibrate [exp-id]` | Probability calibration — ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling |
361
361
  | `/turing:feature [--method]` | Automated feature selection — multi-method consensus ranking, redundancy, interactions |
362
362
  | `/turing:curriculum [exp-id]` | Training curriculum optimization — difficulty scoring, strategy comparison, mislabeled sample detection |
363
+ | `/turing:prune <exp-id>` | Weight pruning — magnitude/structured/lottery, sparsity sweep, knee point detection |
364
+ | `/turing:quantize <exp-id>` | Post-training quantization — FP16/INT8, accuracy-latency comparison |
365
+ | `/turing:merge <exp-ids...>` | Model merging — uniform/greedy soup, TIES, DARE, zero latency cost |
366
+ | `/turing:surgery <exp-id>` | Architecture modification — add/remove layer, widen/narrow, swap activation |
363
367
 
364
368
  And for fully hands-off operation:
365
369
 
@@ -544,11 +548,11 @@ Each project gets independent config, data, experiments, models, and agent memor
544
548
 
545
549
  ## Architecture of Turing Itself
546
550
 
547
- 49 commands, 2 agents, 10 config files, 68 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), scaling & efficiency (scale + budget + distill), meta-intelligence (transfer + audit), pre-training intelligence (sanity + baseline + leak), model debugging (xray + sensitivity + calibrate), feature & training intelligence (feature + curriculum), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
551
+ 53 commands, 2 agents, 10 config files, 72 template scripts, model registry, artifact contract, cost-performance frontier, model cards, tree-search exploration, statistical rigor, experiment intelligence, performance profiling, smart checkpoints, production model export, literature integration, paper section drafting, experiment orchestration (queue + retry + fork), deep analysis (diff + watch + regress), model composition (ensemble + stitch + warm), scaling & efficiency (scale + budget + distill), meta-intelligence (transfer + audit), pre-training intelligence (sanity + baseline + leak), model debugging (xray + sensitivity + calibrate), feature & training intelligence (feature + curriculum), model surgery (prune + quantize + merge + surgery), 16 ADRs. See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for the full codemap.
548
552
 
549
553
  ```
550
554
  turing/
551
- ├── commands/ 48 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition + scaling & efficiency + meta-intelligence + pre-training intelligence + model debugging + feature & training intelligence)
555
+ ├── commands/ 52 skill files (core + taste-leverage + reporting + exploration + statistical rigor + experiment intelligence + performance + deployment + research workflow + orchestration + deep analysis + model composition + scaling & efficiency + meta-intelligence + pre-training intelligence + model debugging + feature & training intelligence + model surgery)
552
556
  ├── agents/ 2 agents (researcher: read/write, evaluator: read-only)
553
557
  ├── config/ 8 files (lifecycle, taxonomy, archetypes, novelty aliases)
554
558
  ├── templates/ Scaffolded into user projects by /turing:init
@@ -0,0 +1,24 @@
1
+ ---
2
+ name: merge
3
+ description: Model merging — average weights from multiple checkpoints into a single model (soups, TIES, DARE). Free accuracy, zero latency cost.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-ids...> [--method uniform|greedy|ties|dare]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Combine model weights (not predictions) into a single, better model with no latency overhead.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:** `source .venv/bin/activate`
14
+ 2. **Run:** `python scripts/model_merger.py $ARGUMENTS`
15
+ 3. **Methods:** uniform soup (simple average), greedy soup (include only if improves), TIES (trim+elect+merge), DARE (drop+rescale)
16
+ 4. **Report:** compatibility check, per-model metrics, method comparison, improvement delta
17
+ 5. **Saved output:** `experiments/merges/merge-*.yaml`
18
+
19
+ ## Examples
20
+
21
+ ```
22
+ /turing:merge exp-042 exp-053 exp-067 # All methods
23
+ /turing:merge exp-042 exp-053 --method greedy # Greedy soup only
24
+ ```
@@ -0,0 +1,26 @@
1
+ ---
2
+ name: prune
3
+ description: Weight pruning — measure accuracy at different sparsity levels, find the knee point, produce a smaller/faster model.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-id> [--sparsity 0.5,0.75,0.9] [--method magnitude|structured|lottery]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Remove redundant weights for faster inference and smaller models.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:** `source .venv/bin/activate`
14
+ 2. **Run:** `python scripts/model_pruning.py $ARGUMENTS`
15
+ 3. **Methods:** magnitude (zero small weights), structured (remove neurons), lottery (iterative with rewind)
16
+ 4. **For tree models:** progressively reduces n_estimators
17
+ 5. **Report:** sparsity sweep table, knee point, recommended sparsity
18
+ 6. **Saved output:** `experiments/pruning/<exp-id>-pruning.yaml`
19
+
20
+ ## Examples
21
+
22
+ ```
23
+ /turing:prune exp-042 # Default: magnitude, 5 levels
24
+ /turing:prune exp-042 --method structured # Remove entire neurons
25
+ /turing:prune exp-042 --sparsity 0.5,0.75,0.9 # Custom levels
26
+ ```
@@ -0,0 +1,24 @@
1
+ ---
2
+ name: quantize
3
+ description: Post-training quantization — FP32→INT8/FP16, measure accuracy loss, 2-4x speedup with <0.5% accuracy loss.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-id> [--precision int8|fp16|dynamic]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Quantize for production. Lowest-effort optimization: 2-4x speedup, 2-4x memory reduction.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:** `source .venv/bin/activate`
14
+ 2. **Run:** `python scripts/model_quantization.py $ARGUMENTS`
15
+ 3. **Precision levels:** FP32 (baseline), FP16 (GPU), INT8 dynamic (simplest), INT8 static (best accuracy)
16
+ 4. **Report:** precision comparison table, recommended level, QAT suggestion if needed
17
+ 5. **Saved output:** `experiments/quantization/<exp-id>-quantization.yaml`
18
+
19
+ ## Examples
20
+
21
+ ```
22
+ /turing:quantize exp-042 # Compare all precision levels
23
+ /turing:quantize exp-042 --precision int8 # INT8 specifically
24
+ ```
@@ -0,0 +1,27 @@
1
+ ---
2
+ name: surgery
3
+ description: Architecture modification — add/remove layers, widen/narrow, swap activations, inject skip connections. Specify what to change, system handles how.
4
+ disable-model-invocation: true
5
+ argument-hint: "<exp-id> --op <operation> [args...]"
6
+ allowed-tools: Read, Bash(*), Grep, Glob
7
+ ---
8
+
9
+ Programmatic architecture changes with auto warm-start from existing weights.
10
+
11
+ ## Steps
12
+
13
+ 1. **Activate environment:** `source .venv/bin/activate`
14
+ 2. **Run:** `python scripts/architecture_surgery.py $ARGUMENTS`
15
+ 3. **Operations:** add-layer, remove-layer, widen, narrow, swap-activation, add-skip, add-norm, deepen, swap-objective
16
+ 4. **For tree models:** deepen (increase max_depth), widen (more estimators), swap-objective
17
+ 5. **Report:** operation details, config changes, parameter count delta, warm-start source
18
+ 6. **Saved output:** `experiments/surgery/<exp-id>-<op>.yaml`
19
+
20
+ ## Examples
21
+
22
+ ```
23
+ /turing:surgery exp-042 --op widen 2 # 2x wider hidden layers
24
+ /turing:surgery exp-042 --op add-layer # Insert a layer
25
+ /turing:surgery exp-042 --op swap-activation relu gelu # ReLU → GELU
26
+ /turing:surgery exp-042 --op deepen # Deeper trees
27
+ ```
@@ -58,6 +58,10 @@ You are the Turing ML research router. Detect the user's intent and route to the
58
58
  | "calibrate", "calibration", "ECE", "reliability diagram", "overconfident", "probability calibration" | `/turing:calibrate` | Analyze |
59
59
  | "feature", "features", "feature selection", "feature importance", "which features matter", "redundant features" | `/turing:feature` | Analyze |
60
60
  | "curriculum", "training order", "easy to hard", "data ordering", "curriculum learning" | `/turing:curriculum` | Optimize |
61
+ | "prune", "pruning", "sparsity", "remove weights", "smaller model", "weight pruning" | `/turing:prune` | Optimize |
62
+ | "quantize", "quantization", "int8", "fp16", "reduce precision", "faster inference" | `/turing:quantize` | Optimize |
63
+ | "merge", "model soup", "merge weights", "average models", "TIES", "DARE" | `/turing:merge` | Compose |
64
+ | "surgery", "architecture", "add layer", "widen", "modify model", "swap activation" | `/turing:surgery` | Modify |
61
65
 
62
66
  ## Sub-commands
63
67
 
@@ -112,6 +116,10 @@ You are the Turing ML research router. Detect the user's intent and route to the
112
116
  | `/turing:calibrate [exp-id]` | Probability calibration: ECE/MCE, reliability diagrams, Platt/isotonic/temperature scaling | (inline) |
113
117
  | `/turing:feature [--method]` | Automated feature selection: multi-method consensus ranking, redundancy, interaction generation | (inline) |
114
118
  | `/turing:curriculum [exp-id]` | Training curriculum optimization: difficulty scoring, strategy comparison, impossible sample detection | (inline) |
119
+ | `/turing:prune <exp-id>` | Weight pruning: magnitude/structured/lottery, sparsity sweep, knee point detection | (inline) |
120
+ | `/turing:quantize <exp-id>` | Post-training quantization: FP16/INT8, accuracy-latency comparison, QAT suggestion | (inline) |
121
+ | `/turing:merge <exp-ids...>` | Model merging: uniform/greedy soup, TIES, DARE — free accuracy, zero latency cost | (inline) |
122
+ | `/turing:surgery <exp-id>` | Architecture modification: add/remove layer, widen/narrow, swap activation, skip connections | (inline) |
115
123
 
116
124
  ## Proactive Detection
117
125
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "claude-turing",
3
- "version": "3.3.0",
3
+ "version": "3.4.0",
4
4
  "type": "module",
5
5
  "description": "Autonomous ML research harness for Claude Code. The autoresearch loop as a formal protocol — iteratively trains, evaluates, and improves ML models with structured experiment tracking, convergence detection, immutable evaluation infrastructure, and safety guardrails.",
6
6
  "bin": {
package/src/install.js CHANGED
@@ -32,6 +32,7 @@ const SUB_COMMANDS = [
32
32
  "sanity", "baseline", "leak",
33
33
  "xray", "sensitivity", "calibrate",
34
34
  "feature", "curriculum",
35
+ "prune", "quantize", "merge", "surgery",
35
36
  ];
36
37
 
37
38
  export async function install(opts = {}) {
package/src/verify.js CHANGED
@@ -63,6 +63,10 @@ const EXPECTED_COMMANDS = [
63
63
  "calibrate/SKILL.md",
64
64
  "feature/SKILL.md",
65
65
  "curriculum/SKILL.md",
66
+ "prune/SKILL.md",
67
+ "quantize/SKILL.md",
68
+ "merge/SKILL.md",
69
+ "surgery/SKILL.md",
66
70
  ];
67
71
 
68
72
  const EXPECTED_AGENTS = ["ml-researcher.md", "ml-evaluator.md"];
@@ -0,0 +1,238 @@
1
+ #!/usr/bin/env python3
2
+ """Architecture modification for the autoresearch pipeline.
3
+
4
+ Programmatic architecture changes: add/remove layers, widen/narrow,
5
+ swap activation functions, inject skip connections, change normalization.
6
+ Produces a modified config and instructions for the modified experiment.
7
+
8
+ Usage:
9
+ python scripts/architecture_surgery.py exp-042 --op widen 2
10
+ python scripts/architecture_surgery.py exp-042 --op add-layer
11
+ python scripts/architecture_surgery.py exp-042 --op swap-activation relu gelu
12
+ python scripts/architecture_surgery.py --json
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import argparse
18
+ import json
19
+ import math
20
+ import sys
21
+ from datetime import datetime, timezone
22
+ from pathlib import Path
23
+
24
+ import yaml
25
+
26
+ from scripts.turing_io import load_config, load_experiments
27
+
28
+ DEFAULT_LOG_PATH = "experiments/log.jsonl"
29
+ OPERATIONS = ["add-layer", "remove-layer", "widen", "narrow", "swap-activation",
30
+ "add-skip", "add-norm", "deepen", "swap-objective"]
31
+
32
+
33
+ def plan_operation(
34
+ operation: str,
35
+ config: dict,
36
+ hyperparams: dict,
37
+ model_type: str,
38
+ args: list[str] | None = None,
39
+ ) -> dict:
40
+ """Plan an architecture modification.
41
+
42
+ Returns a plan dict with new config, parameter count change, and instructions.
43
+ """
44
+ args = args or []
45
+ plan = {
46
+ "operation": operation,
47
+ "model_type": model_type,
48
+ "original_config": hyperparams.copy(),
49
+ "new_config": hyperparams.copy(),
50
+ "instructions": [],
51
+ "param_change": None,
52
+ }
53
+
54
+ is_tree = any(t in model_type.lower() for t in ("xgboost", "lightgbm", "forest", "gbm", "catboost"))
55
+ is_neural = any(t in model_type.lower() for t in ("mlp", "nn", "pytorch", "tensorflow", "transformer"))
56
+
57
+ if operation == "widen":
58
+ factor = float(args[0]) if args else 2.0
59
+ if is_neural:
60
+ hs = hyperparams.get("hidden_size", 256)
61
+ new_hs = int(hs * factor)
62
+ plan["new_config"]["hidden_size"] = new_hs
63
+ plan["instructions"].append(f"Multiply hidden dimensions: {hs} → {new_hs} ({factor}x)")
64
+ plan["param_change"] = f"+{(factor**2 - 1)*100:.0f}% parameters (quadratic in width)"
65
+ elif is_tree:
66
+ n = hyperparams.get("n_estimators", 100)
67
+ new_n = int(n * factor)
68
+ plan["new_config"]["n_estimators"] = new_n
69
+ plan["instructions"].append(f"Increase estimators: {n} → {new_n}")
70
+ plan["param_change"] = f"+{(factor - 1)*100:.0f}% trees"
71
+ else:
72
+ plan["instructions"].append(f"Widen by {factor}x — adjust model-specific width parameters")
73
+
74
+ elif operation == "narrow":
75
+ factor = float(args[0]) if args else 0.5
76
+ if is_neural:
77
+ hs = hyperparams.get("hidden_size", 256)
78
+ new_hs = max(8, int(hs * factor))
79
+ plan["new_config"]["hidden_size"] = new_hs
80
+ plan["instructions"].append(f"Reduce hidden dimensions: {hs} → {new_hs} ({factor}x)")
81
+ elif is_tree:
82
+ n = hyperparams.get("n_estimators", 100)
83
+ new_n = max(1, int(n * factor))
84
+ plan["new_config"]["n_estimators"] = new_n
85
+ plan["instructions"].append(f"Reduce estimators: {n} → {new_n}")
86
+
87
+ elif operation == "add-layer":
88
+ if is_neural:
89
+ n_layers = hyperparams.get("n_layers", hyperparams.get("layers", 3))
90
+ plan["new_config"]["n_layers"] = n_layers + 1
91
+ plan["instructions"].extend([
92
+ f"Add layer: {n_layers} → {n_layers + 1}",
93
+ "New layer initialized with default weights",
94
+ "Auto warm-start: existing layers loaded from source",
95
+ ])
96
+ plan["param_change"] = f"+1 layer ({n_layers} → {n_layers + 1})"
97
+ else:
98
+ plan["instructions"].append("add-layer not applicable for non-neural models")
99
+
100
+ elif operation == "remove-layer":
101
+ if is_neural:
102
+ n_layers = hyperparams.get("n_layers", hyperparams.get("layers", 3))
103
+ if n_layers > 1:
104
+ plan["new_config"]["n_layers"] = n_layers - 1
105
+ plan["instructions"].append(f"Remove layer: {n_layers} → {n_layers - 1}")
106
+ else:
107
+ plan["instructions"].append("Cannot remove — only 1 layer remaining")
108
+ else:
109
+ plan["instructions"].append("remove-layer not applicable for non-neural models")
110
+
111
+ elif operation == "deepen":
112
+ if is_tree:
113
+ depth = hyperparams.get("max_depth", 6)
114
+ new_depth = depth + 2
115
+ plan["new_config"]["max_depth"] = new_depth
116
+ plan["instructions"].append(f"Increase max depth: {depth} → {new_depth}")
117
+ elif is_neural:
118
+ n_layers = hyperparams.get("n_layers", 3)
119
+ plan["new_config"]["n_layers"] = n_layers + 2
120
+ plan["instructions"].append(f"Add 2 layers: {n_layers} → {n_layers + 2}")
121
+
122
+ elif operation == "swap-activation":
123
+ if len(args) >= 2:
124
+ from_act, to_act = args[0], args[1]
125
+ else:
126
+ from_act, to_act = "relu", "gelu"
127
+ plan["new_config"]["activation"] = to_act
128
+ plan["instructions"].append(f"Swap activation: {from_act} → {to_act}")
129
+
130
+ elif operation == "add-skip":
131
+ plan["new_config"]["skip_connections"] = True
132
+ plan["instructions"].append("Inject residual/skip connections between layers")
133
+
134
+ elif operation == "add-norm":
135
+ norm_type = args[0] if args else "batch_norm"
136
+ plan["new_config"]["normalization"] = norm_type
137
+ plan["instructions"].append(f"Add {norm_type} after each layer")
138
+
139
+ elif operation == "swap-objective":
140
+ if len(args) >= 2:
141
+ from_obj, to_obj = args[0], args[1]
142
+ else:
143
+ from_obj, to_obj = hyperparams.get("objective", "logloss"), "focal"
144
+ plan["new_config"]["objective"] = to_obj
145
+ plan["instructions"].append(f"Swap objective: {from_obj} → {to_obj}")
146
+
147
+ else:
148
+ plan["instructions"].append(f"Unknown operation: {operation}")
149
+
150
+ return plan
151
+
152
+
153
+ def surgery_report(
154
+ exp_id: str,
155
+ operation: str,
156
+ op_args: list[str] | None = None,
157
+ config_path: str = "config.yaml",
158
+ log_path: str = DEFAULT_LOG_PATH,
159
+ ) -> dict:
160
+ """Generate a surgery report."""
161
+ experiments = load_experiments(log_path)
162
+ exp = next((e for e in experiments if e.get("experiment_id") == exp_id), None)
163
+
164
+ if not exp:
165
+ return {"error": f"Experiment {exp_id} not found"}
166
+
167
+ config = exp.get("config", {})
168
+ model_type = config.get("model_type", "unknown")
169
+ hyperparams = config.get("hyperparams", {})
170
+
171
+ plan = plan_operation(operation, config, hyperparams, model_type, op_args)
172
+
173
+ return {
174
+ "generated_at": datetime.now(timezone.utc).isoformat(),
175
+ "experiment_id": exp_id,
176
+ "plan": plan,
177
+ "warm_start_from": exp_id,
178
+ }
179
+
180
+
181
+ def save_surgery_report(report: dict, output_dir: str = "experiments/surgery") -> Path:
182
+ out = Path(output_dir); out.mkdir(parents=True, exist_ok=True)
183
+ exp_id = report.get("experiment_id", "unknown")
184
+ op = report.get("plan", {}).get("operation", "unknown")
185
+ fp = out / f"{exp_id}-{op}.yaml"
186
+ with open(fp, "w") as f: yaml.dump(report, f, default_flow_style=False, sort_keys=False)
187
+ return fp
188
+
189
+
190
+ def format_surgery_report(report: dict) -> str:
191
+ if "error" in report: return f"ERROR: {report['error']}"
192
+
193
+ plan = report.get("plan", {})
194
+ exp_id = report.get("experiment_id", "?")
195
+ op = plan.get("operation", "?")
196
+
197
+ lines = [f"# Surgery: {op} ({exp_id})", "",
198
+ f"*Generated {report.get('generated_at', 'N/A')[:19]}*",
199
+ f"**Model:** {plan.get('model_type', '?')}", ""]
200
+
201
+ lines.extend(["## Instructions", ""])
202
+ for i, inst in enumerate(plan.get("instructions", []), 1):
203
+ lines.append(f"{i}. {inst}")
204
+ lines.append("")
205
+
206
+ if plan.get("param_change"):
207
+ lines.append(f"**Parameter change:** {plan['param_change']}")
208
+ lines.append("")
209
+
210
+ orig = plan.get("original_config", {})
211
+ new = plan.get("new_config", {})
212
+ changed = {k: (orig.get(k), new[k]) for k in new if orig.get(k) != new.get(k)}
213
+ if changed:
214
+ lines.extend(["## Config Changes", ""])
215
+ for k, (old, new_v) in changed.items():
216
+ lines.append(f"- `{k}`: {old} → {new_v}")
217
+ lines.append("")
218
+
219
+ lines.append(f"**Warm-start from:** {report.get('warm_start_from', '?')}")
220
+
221
+ return "\n".join(lines)
222
+
223
+
224
+ def main() -> None:
225
+ parser = argparse.ArgumentParser(description="Architecture modification")
226
+ parser.add_argument("exp_id")
227
+ parser.add_argument("--op", required=True, help="Operation name")
228
+ parser.add_argument("op_args", nargs="*", help="Operation arguments")
229
+ parser.add_argument("--config", default="config.yaml")
230
+ parser.add_argument("--log", default=DEFAULT_LOG_PATH)
231
+ parser.add_argument("--json", action="store_true")
232
+ args = parser.parse_args()
233
+ report = surgery_report(args.exp_id, args.op, args.op_args, args.config, args.log)
234
+ if "error" not in report:
235
+ fp = save_surgery_report(report); print(f"Saved to {fp}", file=sys.stderr)
236
+ print(json.dumps(report, indent=2, default=str) if args.json else format_surgery_report(report))
237
+
238
+ if __name__ == "__main__": main()
@@ -0,0 +1,277 @@
1
+ #!/usr/bin/env python3
2
+ """Model merging for the autoresearch pipeline.
3
+
4
+ Average or merge weights from multiple fine-tuned checkpoints into a
5
+ single model (model soups, TIES, DARE, greedy soup). Often beats any
6
+ individual model with zero additional training cost and no latency overhead.
7
+
8
+ Usage:
9
+ python scripts/model_merger.py exp-042 exp-053 exp-067
10
+ python scripts/model_merger.py exp-042 exp-053 --method greedy
11
+ python scripts/model_merger.py --json
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import argparse
17
+ import json
18
+ import sys
19
+ from datetime import datetime, timezone
20
+ from pathlib import Path
21
+
22
+ import numpy as np
23
+ import yaml
24
+
25
+ from scripts.turing_io import load_config, load_experiments
26
+
27
+ DEFAULT_LOG_PATH = "experiments/log.jsonl"
28
+ MERGE_METHODS = ["uniform", "greedy", "ties", "dare"]
29
+
30
+
31
+ def check_compatibility(experiments: list[dict]) -> dict:
32
+ """Check that all models share the same architecture."""
33
+ model_types = {e.get("config", {}).get("model_type", "?") for e in experiments}
34
+ compatible = len(model_types) == 1
35
+ return {
36
+ "compatible": compatible,
37
+ "model_types": list(model_types),
38
+ "n_models": len(experiments),
39
+ "reason": "All models share same architecture" if compatible else f"Mixed architectures: {model_types}",
40
+ }
41
+
42
+
43
+ def plan_uniform_merge(
44
+ experiments: list[dict],
45
+ primary_metric: str,
46
+ ) -> dict:
47
+ """Plan uniform weight averaging (model soup)."""
48
+ metrics = [e.get("metrics", {}).get(primary_metric, 0) for e in experiments]
49
+ return {
50
+ "method": "uniform",
51
+ "description": "Simple average of all model weights",
52
+ "n_models": len(experiments),
53
+ "individual_metrics": [{"exp_id": e.get("experiment_id"), primary_metric: m} for e, m in zip(experiments, metrics)],
54
+ "weights": [round(1.0 / len(experiments), 4)] * len(experiments),
55
+ }
56
+
57
+
58
+ def plan_greedy_merge(
59
+ experiments: list[dict],
60
+ primary_metric: str,
61
+ merge_results: list[dict] | None = None,
62
+ ) -> dict:
63
+ """Plan greedy soup — iteratively add models only if they improve the merge."""
64
+ sorted_exps = sorted(experiments, key=lambda e: e.get("metrics", {}).get(primary_metric, 0), reverse=True)
65
+ included = [sorted_exps[0].get("experiment_id")]
66
+ excluded = []
67
+
68
+ if merge_results:
69
+ # Use actual results to determine inclusion
70
+ for r in merge_results[1:]:
71
+ if r.get("improved", True):
72
+ included.append(r.get("exp_id"))
73
+ else:
74
+ excluded.append(r.get("exp_id"))
75
+ else:
76
+ # Plan: include all by default, actual filtering done at execution
77
+ included = [e.get("experiment_id") for e in sorted_exps]
78
+
79
+ return {
80
+ "method": "greedy",
81
+ "description": "Iteratively add models only if they improve the merged result",
82
+ "included": included,
83
+ "excluded": excluded,
84
+ "n_included": len(included),
85
+ "n_excluded": len(excluded),
86
+ }
87
+
88
+
89
+ def plan_ties_merge(experiments: list[dict]) -> dict:
90
+ """Plan TIES merging (Trim, Elect sign, disjoint Merge)."""
91
+ return {
92
+ "method": "ties",
93
+ "description": "Trim redundant params, elect sign consensus, disjoint merge",
94
+ "n_models": len(experiments),
95
+ "steps": [
96
+ "1. Compute task vectors (delta from base) for each model",
97
+ "2. Trim: zero out smallest magnitude deltas",
98
+ "3. Elect: resolve sign conflicts by majority vote",
99
+ "4. Merge: average the surviving, sign-consistent deltas",
100
+ ],
101
+ }
102
+
103
+
104
+ def plan_dare_merge(experiments: list[dict]) -> dict:
105
+ """Plan DARE merging (Drop And REscale)."""
106
+ return {
107
+ "method": "dare",
108
+ "description": "Randomly drop parameters and rescale survivors to reduce interference",
109
+ "n_models": len(experiments),
110
+ "drop_rate": 0.5,
111
+ "steps": [
112
+ "1. Compute task vectors for each model",
113
+ "2. Randomly drop 50% of parameters per model",
114
+ "3. Rescale surviving parameters by 1/(1-drop_rate)",
115
+ "4. Average the rescaled task vectors",
116
+ ],
117
+ }
118
+
119
+
120
+ def compare_merge_methods(
121
+ method_results: dict[str, dict] | None = None,
122
+ experiments: list[dict] | None = None,
123
+ primary_metric: str = "accuracy",
124
+ ) -> dict:
125
+ """Compare merge method results."""
126
+ if not experiments:
127
+ return {"error": "No experiments provided"}
128
+
129
+ # Best single model
130
+ best_single = max(experiments, key=lambda e: e.get("metrics", {}).get(primary_metric, 0))
131
+ best_metric = best_single.get("metrics", {}).get(primary_metric, 0)
132
+
133
+ results = [{
134
+ "method": "best_single",
135
+ "metric_value": best_metric,
136
+ "delta": 0.0,
137
+ "experiment_id": best_single.get("experiment_id"),
138
+ }]
139
+
140
+ if method_results:
141
+ for method_name, data in method_results.items():
142
+ metric = data.get("metric_value", data.get(primary_metric, 0))
143
+ results.append({
144
+ "method": method_name,
145
+ "metric_value": metric,
146
+ "delta": round(metric - best_metric, 6),
147
+ })
148
+
149
+ best_merge = max(results, key=lambda r: r.get("metric_value", 0))
150
+
151
+ return {
152
+ "results": results,
153
+ "best_method": best_merge.get("method"),
154
+ "best_metric": best_merge.get("metric_value"),
155
+ "improvement": best_merge.get("delta", 0),
156
+ }
157
+
158
+
159
+ def merge_analysis(
160
+ exp_ids: list[str] | None = None,
161
+ method_results: dict[str, dict] | None = None,
162
+ config_path: str = "config.yaml",
163
+ log_path: str = DEFAULT_LOG_PATH,
164
+ ) -> dict:
165
+ """Run merge analysis."""
166
+ config = load_config(config_path)
167
+ primary_metric = config.get("evaluation", {}).get("primary_metric", "accuracy")
168
+ experiments = load_experiments(log_path)
169
+
170
+ if exp_ids:
171
+ selected = [e for e in experiments if e.get("experiment_id") in exp_ids]
172
+ else:
173
+ # Default: top 3 kept experiments
174
+ kept = sorted(
175
+ [e for e in experiments if e.get("status") == "kept"],
176
+ key=lambda e: e.get("metrics", {}).get(primary_metric, 0), reverse=True,
177
+ )
178
+ selected = kept[:3]
179
+
180
+ if len(selected) < 2:
181
+ return {"error": "Need at least 2 experiments for model merging"}
182
+
183
+ compat = check_compatibility(selected)
184
+
185
+ plans = {
186
+ "uniform": plan_uniform_merge(selected, primary_metric),
187
+ "greedy": plan_greedy_merge(selected, primary_metric),
188
+ "ties": plan_ties_merge(selected),
189
+ "dare": plan_dare_merge(selected),
190
+ }
191
+
192
+ comparison = compare_merge_methods(method_results, selected, primary_metric) if method_results else None
193
+
194
+ return {
195
+ "generated_at": datetime.now(timezone.utc).isoformat(),
196
+ "primary_metric": primary_metric,
197
+ "compatibility": compat,
198
+ "base_models": [{"exp_id": e.get("experiment_id"),
199
+ "model_type": e.get("config", {}).get("model_type"),
200
+ primary_metric: e.get("metrics", {}).get(primary_metric)}
201
+ for e in selected],
202
+ "plans": plans,
203
+ "comparison": comparison,
204
+ }
205
+
206
+
207
+ def save_merge_report(report: dict, output_dir: str = "experiments/merges") -> Path:
208
+ out = Path(output_dir); out.mkdir(parents=True, exist_ok=True)
209
+ ts = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
210
+ fp = out / f"merge-{ts}.yaml"
211
+ with open(fp, "w") as f: yaml.dump(json.loads(json.dumps(report, default=str)), f, default_flow_style=False, sort_keys=False)
212
+ return fp
213
+
214
+
215
+ def format_merge_report(report: dict) -> str:
216
+ if "error" in report: return f"ERROR: {report['error']}"
217
+
218
+ metric = report.get("primary_metric", "metric")
219
+ lines = ["# Model Merge Analysis", "",
220
+ f"*Generated {report.get('generated_at', 'N/A')[:19]}*", ""]
221
+
222
+ # Compatibility
223
+ compat = report.get("compatibility", {})
224
+ lines.append(f"**Compatibility:** {'✓' if compat.get('compatible') else '✗'} {compat.get('reason', '')}")
225
+ lines.append("")
226
+
227
+ # Base models
228
+ lines.extend(["## Base Models", "",
229
+ f"| Experiment | Model Type | {metric} |",
230
+ "|------------|------------|--------|"])
231
+ for m in report.get("base_models", []):
232
+ val = m.get(metric, "N/A")
233
+ val_str = f"{val:.4f}" if isinstance(val, float) else str(val)
234
+ lines.append(f"| {m.get('exp_id', '?')} | {m.get('model_type', '?')} | {val_str} |")
235
+ lines.append("")
236
+
237
+ # Methods
238
+ plans = report.get("plans", {})
239
+ if plans:
240
+ lines.extend(["## Available Methods", ""])
241
+ for name, plan in plans.items():
242
+ lines.append(f"- **{name}:** {plan.get('description', '')}")
243
+ lines.append("")
244
+
245
+ # Comparison (if results available)
246
+ comparison = report.get("comparison")
247
+ if comparison:
248
+ lines.extend(["## Results", "",
249
+ f"| Method | {metric} | Δ vs Best Single |",
250
+ "|--------|--------|------------------|"])
251
+ for r in comparison.get("results", []):
252
+ val = f"{r.get('metric_value', 0):.4f}"
253
+ delta = f"{r.get('delta', 0):+.4f}" if r.get("delta") is not None else "—"
254
+ marker = " ← BEST" if r["method"] == comparison.get("best_method") and r["method"] != "best_single" else ""
255
+ lines.append(f"| {r['method']} | {val} | {delta} |{marker}")
256
+ lines.append("")
257
+ imp = comparison.get("improvement", 0)
258
+ if imp > 0:
259
+ lines.append(f"**{comparison['best_method']} improves by {imp:+.4f} over best single model — zero latency cost.**")
260
+
261
+ return "\n".join(lines)
262
+
263
+
264
+ def main() -> None:
265
+ parser = argparse.ArgumentParser(description="Model merging")
266
+ parser.add_argument("exp_ids", nargs="*", help="Experiment IDs to merge")
267
+ parser.add_argument("--method", choices=MERGE_METHODS, help="Specific merge method")
268
+ parser.add_argument("--config", default="config.yaml")
269
+ parser.add_argument("--log", default=DEFAULT_LOG_PATH)
270
+ parser.add_argument("--json", action="store_true")
271
+ args = parser.parse_args()
272
+ report = merge_analysis(exp_ids=args.exp_ids or None, config_path=args.config, log_path=args.log)
273
+ if "error" not in report:
274
+ fp = save_merge_report(report); print(f"Saved to {fp}", file=sys.stderr)
275
+ print(json.dumps(report, indent=2, default=str) if args.json else format_merge_report(report))
276
+
277
+ if __name__ == "__main__": main()
@@ -0,0 +1,182 @@
1
+ #!/usr/bin/env python3
2
+ """Weight pruning for the autoresearch pipeline.
3
+
4
+ Structured and unstructured weight pruning. Measures accuracy at different
5
+ sparsity levels, finds the knee point, and plans pruned model production.
6
+
7
+ Usage:
8
+ python scripts/model_pruning.py exp-042
9
+ python scripts/model_pruning.py exp-042 --sparsity 0.5,0.75,0.9
10
+ python scripts/model_pruning.py exp-042 --method magnitude
11
+ python scripts/model_pruning.py --json
12
+ """
13
+
14
+ from __future__ import annotations
15
+
16
+ import argparse
17
+ import json
18
+ import math
19
+ import sys
20
+ from datetime import datetime, timezone
21
+ from pathlib import Path
22
+
23
+ import numpy as np
24
+ import yaml
25
+
26
+ from scripts.turing_io import load_config, load_experiments
27
+
28
+ DEFAULT_LOG_PATH = "experiments/log.jsonl"
29
+ DEFAULT_SPARSITY_LEVELS = [0.0, 0.50, 0.75, 0.90, 0.95]
30
+ PRUNING_METHODS = ["magnitude", "structured", "lottery"]
31
+
32
+
33
+ def plan_sparsity_sweep(
34
+ sparsity_levels: list[float] | None = None,
35
+ ) -> list[dict]:
36
+ if sparsity_levels is None:
37
+ sparsity_levels = DEFAULT_SPARSITY_LEVELS
38
+ return [{"sparsity": s, "description": f"{s*100:.0f}% weights removed"} for s in sparsity_levels]
39
+
40
+
41
+ def compute_pruning_plan(
42
+ model_type: str,
43
+ hyperparams: dict,
44
+ method: str,
45
+ sparsity: float,
46
+ ) -> dict:
47
+ plan = {"method": method, "sparsity": sparsity, "config_changes": {}}
48
+ if "xgboost" in model_type.lower() or "lightgbm" in model_type.lower() or "forest" in model_type.lower():
49
+ n_est = hyperparams.get("n_estimators", 100)
50
+ plan["config_changes"]["n_estimators"] = max(1, int(n_est * (1 - sparsity)))
51
+ plan["strategy"] = "reduce_estimators"
52
+ elif method == "magnitude":
53
+ plan["strategy"] = "zero_small_weights"
54
+ plan["description"] = f"Zero out smallest {sparsity*100:.0f}% of weights by absolute value"
55
+ elif method == "structured":
56
+ plan["strategy"] = "remove_neurons"
57
+ plan["description"] = f"Remove {sparsity*100:.0f}% of neurons/filters by importance"
58
+ elif method == "lottery":
59
+ plan["strategy"] = "iterative_magnitude_with_rewind"
60
+ plan["description"] = f"Iterative pruning to {sparsity*100:.0f}% with weight rewinding"
61
+ return plan
62
+
63
+
64
+ def find_knee_point(sweep_results: list[dict], metric_key: str = "accuracy") -> dict | None:
65
+ if len(sweep_results) < 3:
66
+ return None
67
+ sparsities = [r["sparsity"] for r in sweep_results]
68
+ metrics = [r.get(metric_key, 0) for r in sweep_results]
69
+ max_drop = 0
70
+ knee_idx = None
71
+ for i in range(1, len(metrics)):
72
+ drop = metrics[i - 1] - metrics[i]
73
+ if drop > max_drop:
74
+ max_drop = drop
75
+ knee_idx = i
76
+ if knee_idx and knee_idx > 0:
77
+ return {"sparsity": sparsities[knee_idx - 1],
78
+ "metric_before_knee": round(metrics[knee_idx - 1], 6),
79
+ "metric_after_knee": round(metrics[knee_idx], 6),
80
+ "drop_at_knee": round(max_drop, 6)}
81
+ return None
82
+
83
+
84
+ def estimate_speedup(sparsity: float) -> float:
85
+ if sparsity <= 0:
86
+ return 1.0
87
+ return round(1.0 / (1.0 - sparsity * 0.7), 2)
88
+
89
+
90
+ def estimate_size_reduction(sparsity: float) -> float:
91
+ return round(sparsity * 100, 1)
92
+
93
+
94
+ def analyze_pruning(
95
+ sweep_results: list[dict] | None = None,
96
+ exp_id: str | None = None,
97
+ method: str = "magnitude",
98
+ config_path: str = "config.yaml",
99
+ log_path: str = DEFAULT_LOG_PATH,
100
+ ) -> dict:
101
+ config = load_config(config_path)
102
+ primary_metric = config.get("evaluation", {}).get("primary_metric", "accuracy")
103
+
104
+ if sweep_results:
105
+ knee = find_knee_point(sweep_results, primary_metric)
106
+ for r in sweep_results:
107
+ r["speedup"] = estimate_speedup(r["sparsity"])
108
+ r["size_reduction_pct"] = estimate_size_reduction(r["sparsity"])
109
+ recommended = None
110
+ for r in sweep_results:
111
+ delta = abs(r.get(primary_metric, 0) - sweep_results[0].get(primary_metric, 0))
112
+ if delta < 0.005 and r["sparsity"] > 0:
113
+ recommended = r
114
+ return {
115
+ "generated_at": datetime.now(timezone.utc).isoformat(),
116
+ "experiment_id": exp_id, "method": method, "primary_metric": primary_metric,
117
+ "sweep_results": sweep_results, "knee_point": knee,
118
+ "recommended": recommended,
119
+ }
120
+
121
+ experiments = load_experiments(log_path)
122
+ exp = next((e for e in experiments if e.get("experiment_id") == exp_id), None) if exp_id else None
123
+ model_type = exp.get("config", {}).get("model_type", "unknown") if exp else "unknown"
124
+ hyperparams = exp.get("config", {}).get("hyperparams", {}) if exp else {}
125
+
126
+ levels = plan_sparsity_sweep()
127
+ plans = [compute_pruning_plan(model_type, hyperparams, method, s["sparsity"]) for s in levels]
128
+ return {
129
+ "action": "plan", "generated_at": datetime.now(timezone.utc).isoformat(),
130
+ "experiment_id": exp_id, "model_type": model_type, "method": method,
131
+ "sparsity_levels": levels, "plans": plans,
132
+ "message": f"Run {len(levels)} experiments at sparsity levels: {', '.join(s['description'] for s in levels)}",
133
+ }
134
+
135
+
136
+ def save_pruning_report(report: dict, output_dir: str = "experiments/pruning") -> Path:
137
+ out = Path(output_dir); out.mkdir(parents=True, exist_ok=True)
138
+ exp_id = report.get("experiment_id", "unknown")
139
+ fp = out / f"{exp_id}-pruning.yaml"
140
+ with open(fp, "w") as f: yaml.dump(json.loads(json.dumps(report, default=str)), f, default_flow_style=False, sort_keys=False)
141
+ return fp
142
+
143
+
144
+ def format_pruning_report(report: dict) -> str:
145
+ if "error" in report: return f"ERROR: {report['error']}"
146
+ if report.get("action") == "plan":
147
+ lines = ["# Pruning Plan", "", f"**Model:** {report.get('model_type', '?')}", f"**Method:** {report.get('method', '?')}", ""]
148
+ for p in report.get("plans", []):
149
+ lines.append(f"- {p.get('sparsity', 0)*100:.0f}%: {p.get('strategy', '?')}")
150
+ return "\n".join(lines)
151
+
152
+ metric = report.get("primary_metric", "metric")
153
+ lines = [f"# Pruning Results: {report.get('experiment_id', '?')}", "",
154
+ f"| Sparsity | {metric} | Speedup | Size Reduction |",
155
+ "|----------|--------|---------|----------------|"]
156
+ for r in report.get("sweep_results", []):
157
+ val = f"{r.get(metric, 0):.4f}" if isinstance(r.get(metric), (int, float)) else "N/A"
158
+ lines.append(f"| {r['sparsity']*100:.0f}% | {val} | {r.get('speedup', '?')}x | {r.get('size_reduction_pct', '?')}% |")
159
+ knee = report.get("knee_point")
160
+ if knee:
161
+ lines.extend(["", f"**Knee point:** {knee['sparsity']*100:.0f}% sparsity (accuracy drops {knee['drop_at_knee']:.4f})"])
162
+ rec = report.get("recommended")
163
+ if rec:
164
+ lines.extend(["", f"**Recommended:** {rec['sparsity']*100:.0f}% sparsity ({rec.get('speedup', '?')}x speedup, <0.5% accuracy loss)"])
165
+ return "\n".join(lines)
166
+
167
+
168
+ def main() -> None:
169
+ parser = argparse.ArgumentParser(description="Weight pruning")
170
+ parser.add_argument("exp_id", nargs="?")
171
+ parser.add_argument("--sparsity", help="Comma-separated sparsity levels")
172
+ parser.add_argument("--method", choices=PRUNING_METHODS, default="magnitude")
173
+ parser.add_argument("--config", default="config.yaml")
174
+ parser.add_argument("--log", default=DEFAULT_LOG_PATH)
175
+ parser.add_argument("--json", action="store_true")
176
+ args = parser.parse_args()
177
+ report = analyze_pruning(exp_id=args.exp_id, method=args.method, config_path=args.config, log_path=args.log)
178
+ if "error" not in report:
179
+ fp = save_pruning_report(report); print(f"Saved to {fp}", file=sys.stderr)
180
+ print(json.dumps(report, indent=2, default=str) if args.json else format_pruning_report(report))
181
+
182
+ if __name__ == "__main__": main()
@@ -0,0 +1,177 @@
1
+ #!/usr/bin/env python3
2
+ """Post-training quantization for the autoresearch pipeline.
3
+
4
+ Quantize model weights from FP32 to INT8/FP16, measure accuracy loss
5
+ per precision level, and plan quantization-aware training if needed.
6
+
7
+ Usage:
8
+ python scripts/model_quantization.py exp-042
9
+ python scripts/model_quantization.py exp-042 --precision int8
10
+ python scripts/model_quantization.py --json
11
+ """
12
+
13
+ from __future__ import annotations
14
+
15
+ import argparse
16
+ import json
17
+ import math
18
+ import sys
19
+ from datetime import datetime, timezone
20
+ from pathlib import Path
21
+
22
+ import numpy as np
23
+ import yaml
24
+
25
+ from scripts.turing_io import load_config, load_experiments
26
+
27
+ DEFAULT_LOG_PATH = "experiments/log.jsonl"
28
+ PRECISION_LEVELS = ["fp32", "fp16", "int8_dynamic", "int8_static"]
29
+ QAT_THRESHOLD = 0.01 # If PTQ accuracy loss > 1%, suggest QAT
30
+
31
+
32
+ def compute_quantization_plan(
33
+ precision: str,
34
+ model_size_bytes: int | None = None,
35
+ latency_ms: float | None = None,
36
+ ) -> dict:
37
+ size_factors = {"fp32": 1.0, "fp16": 0.5, "int8_dynamic": 0.25, "int8_static": 0.25}
38
+ latency_factors = {"fp32": 1.0, "fp16": 0.58, "int8_dynamic": 0.39, "int8_static": 0.37}
39
+
40
+ factor_s = size_factors.get(precision, 1.0)
41
+ factor_l = latency_factors.get(precision, 1.0)
42
+
43
+ plan = {
44
+ "precision": precision,
45
+ "size_factor": factor_s,
46
+ "latency_factor": factor_l,
47
+ "estimated_size_bytes": int(model_size_bytes * factor_s) if model_size_bytes else None,
48
+ "estimated_latency_ms": round(latency_ms * factor_l, 2) if latency_ms else None,
49
+ "size_reduction_pct": round((1 - factor_s) * 100, 1),
50
+ "speedup": round(1 / factor_l, 2) if factor_l > 0 else None,
51
+ }
52
+
53
+ if precision == "fp16":
54
+ plan["description"] = "Half-precision floating point — GPU inference"
55
+ plan["method"] = "cast_to_fp16"
56
+ elif precision == "int8_dynamic":
57
+ plan["description"] = "Dynamic INT8 — weights quantized, activations at runtime"
58
+ plan["method"] = "dynamic_quantization"
59
+ elif precision == "int8_static":
60
+ plan["description"] = "Static INT8 — calibrated activation ranges, best accuracy"
61
+ plan["method"] = "static_quantization"
62
+ plan["requires_calibration"] = True
63
+ else:
64
+ plan["description"] = "Full precision (baseline)"
65
+ plan["method"] = "none"
66
+
67
+ return plan
68
+
69
+
70
+ def compare_precision_levels(
71
+ sweep_results: list[dict] | None = None,
72
+ model_size_bytes: int | None = None,
73
+ latency_ms: float | None = None,
74
+ primary_metric: str = "accuracy",
75
+ ) -> dict:
76
+ """Compare quantization results across precision levels."""
77
+ if sweep_results:
78
+ baseline = next((r for r in sweep_results if r.get("precision") == "fp32"), sweep_results[0])
79
+ baseline_metric = baseline.get(primary_metric, 0)
80
+
81
+ for r in sweep_results:
82
+ r["delta"] = round(r.get(primary_metric, 0) - baseline_metric, 6)
83
+ plan = compute_quantization_plan(r["precision"], model_size_bytes, latency_ms)
84
+ r.update({k: v for k, v in plan.items() if k not in r})
85
+
86
+ best = min(
87
+ [r for r in sweep_results if r["precision"] != "fp32"],
88
+ key=lambda r: abs(r.get("delta", 0)) + (1 - r.get("speedup", 1)) * 0.1,
89
+ default=None,
90
+ )
91
+
92
+ needs_qat = any(abs(r.get("delta", 0)) > QAT_THRESHOLD for r in sweep_results if "int8" in r.get("precision", ""))
93
+
94
+ return {
95
+ "sweep_results": sweep_results,
96
+ "recommended": best,
97
+ "needs_qat": needs_qat,
98
+ }
99
+
100
+ # Plan mode
101
+ plans = [compute_quantization_plan(p, model_size_bytes, latency_ms) for p in PRECISION_LEVELS]
102
+ return {"action": "plan", "plans": plans}
103
+
104
+
105
+ def analyze_quantization(
106
+ sweep_results: list[dict] | None = None,
107
+ exp_id: str | None = None,
108
+ config_path: str = "config.yaml",
109
+ log_path: str = DEFAULT_LOG_PATH,
110
+ ) -> dict:
111
+ config = load_config(config_path)
112
+ primary_metric = config.get("evaluation", {}).get("primary_metric", "accuracy")
113
+
114
+ experiments = load_experiments(log_path)
115
+ exp = next((e for e in experiments if e.get("experiment_id") == exp_id), None) if exp_id else None
116
+
117
+ model_size = exp.get("metrics", {}).get("model_size_bytes") if exp else None
118
+ latency = exp.get("metrics", {}).get("latency_ms", exp.get("metrics", {}).get("inference_ms")) if exp else None
119
+
120
+ comparison = compare_precision_levels(sweep_results, model_size, latency, primary_metric)
121
+
122
+ return {
123
+ "generated_at": datetime.now(timezone.utc).isoformat(),
124
+ "experiment_id": exp_id,
125
+ "primary_metric": primary_metric,
126
+ **comparison,
127
+ }
128
+
129
+
130
+ def save_quantization_report(report: dict, output_dir: str = "experiments/quantization") -> Path:
131
+ out = Path(output_dir); out.mkdir(parents=True, exist_ok=True)
132
+ exp_id = report.get("experiment_id", "unknown")
133
+ fp = out / f"{exp_id}-quantization.yaml"
134
+ with open(fp, "w") as f: yaml.dump(json.loads(json.dumps(report, default=str)), f, default_flow_style=False, sort_keys=False)
135
+ return fp
136
+
137
+
138
+ def format_quantization_report(report: dict) -> str:
139
+ if "error" in report: return f"ERROR: {report['error']}"
140
+
141
+ if report.get("action") == "plan":
142
+ lines = ["# Quantization Plan", ""]
143
+ for p in report.get("plans", []):
144
+ lines.append(f"- **{p['precision']}**: {p['description']} (size: {p['size_reduction_pct']}% reduction, speedup: {p.get('speedup', '?')}x)")
145
+ return "\n".join(lines)
146
+
147
+ metric = report.get("primary_metric", "metric")
148
+ lines = [f"# Quantization Results: {report.get('experiment_id', '?')}", "",
149
+ f"| Precision | {metric} | Delta | Speedup | Size Reduction |",
150
+ "|-----------|--------|-------|---------|----------------|"]
151
+ for r in report.get("sweep_results", []):
152
+ val = f"{r.get(metric, 0):.4f}" if isinstance(r.get(metric), (int, float)) else "N/A"
153
+ delta = f"{r.get('delta', 0):+.4f}" if r.get("delta") is not None else "—"
154
+ lines.append(f"| {r['precision']} | {val} | {delta} | {r.get('speedup', '?')}x | {r.get('size_reduction_pct', '?')}% |")
155
+
156
+ rec = report.get("recommended")
157
+ if rec:
158
+ lines.extend(["", f"**Recommended:** {rec['precision']} ({rec.get('delta', 0):+.4f} accuracy, {rec.get('speedup', '?')}x speedup)"])
159
+ if report.get("needs_qat"):
160
+ lines.extend(["", "**Note:** INT8 accuracy loss > 1% — consider quantization-aware training (QAT)"])
161
+ return "\n".join(lines)
162
+
163
+
164
+ def main() -> None:
165
+ parser = argparse.ArgumentParser(description="Post-training quantization")
166
+ parser.add_argument("exp_id", nargs="?")
167
+ parser.add_argument("--precision", help="Specific precision level")
168
+ parser.add_argument("--config", default="config.yaml")
169
+ parser.add_argument("--log", default=DEFAULT_LOG_PATH)
170
+ parser.add_argument("--json", action="store_true")
171
+ args = parser.parse_args()
172
+ report = analyze_quantization(exp_id=args.exp_id, config_path=args.config, log_path=args.log)
173
+ if "error" not in report:
174
+ fp = save_quantization_report(report); print(f"Saved to {fp}", file=sys.stderr)
175
+ print(json.dumps(report, indent=2, default=str) if args.json else format_quantization_report(report))
176
+
177
+ if __name__ == "__main__": main()
@@ -126,6 +126,10 @@ TEMPLATE_DIRS = {
126
126
  "calibration.py",
127
127
  "feature_intelligence.py",
128
128
  "curriculum_optimizer.py",
129
+ "model_pruning.py",
130
+ "model_quantization.py",
131
+ "model_merger.py",
132
+ "architecture_surgery.py",
129
133
  ],
130
134
  "tests": ["__init__.py", "conftest.py"],
131
135
  }
@@ -164,6 +168,10 @@ DIRECTORIES_TO_CREATE = [
164
168
  "experiments/calibration",
165
169
  "experiments/features",
166
170
  "experiments/curriculum",
171
+ "experiments/pruning",
172
+ "experiments/quantization",
173
+ "experiments/merges",
174
+ "experiments/surgery",
167
175
  "experiments/logs",
168
176
  "models/best",
169
177
  "models/archive",