@tryhamster/gerbil 1.0.0-rc.8 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (179) hide show
  1. package/LICENSE +1 -1
  2. package/README.md +247 -84
  3. package/dist/architectures-C1I5V3Dt.mjs +6070 -0
  4. package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
  5. package/dist/browser/index.d.ts +264 -588
  6. package/dist/browser/index.d.ts.map +1 -1
  7. package/dist/browser/index.js +585 -2334
  8. package/dist/browser/index.js.map +1 -1
  9. package/dist/cli.mjs +625 -1098
  10. package/dist/cli.mjs.map +1 -1
  11. package/dist/defaults-9komdrbY.mjs +24 -0
  12. package/dist/defaults-9komdrbY.mjs.map +1 -0
  13. package/dist/frameworks/express.d.mts +1 -3
  14. package/dist/frameworks/express.d.mts.map +1 -1
  15. package/dist/frameworks/express.mjs +7 -7
  16. package/dist/frameworks/express.mjs.map +1 -1
  17. package/dist/frameworks/fastify.d.mts +1 -1
  18. package/dist/frameworks/fastify.d.mts.map +1 -1
  19. package/dist/frameworks/fastify.mjs +3 -3
  20. package/dist/frameworks/fastify.mjs.map +1 -1
  21. package/dist/frameworks/hono.d.mts +1 -1
  22. package/dist/frameworks/hono.d.mts.map +1 -1
  23. package/dist/frameworks/hono.mjs +4 -4
  24. package/dist/frameworks/hono.mjs.map +1 -1
  25. package/dist/frameworks/next.d.mts +3 -2
  26. package/dist/frameworks/next.d.mts.map +1 -1
  27. package/dist/frameworks/next.mjs +4 -4
  28. package/dist/frameworks/next.mjs.map +1 -1
  29. package/dist/frameworks/react.d.mts +1 -1
  30. package/dist/frameworks/trpc.d.mts +1 -1
  31. package/dist/frameworks/trpc.d.mts.map +1 -1
  32. package/dist/frameworks/trpc.mjs +4 -4
  33. package/dist/frameworks/trpc.mjs.map +1 -1
  34. package/dist/gerbil-BHrJJIa4.mjs +1656 -0
  35. package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
  36. package/dist/gerbil-BT9fCydo.d.mts +488 -0
  37. package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
  38. package/dist/gerbil-DomNfIr1.mjs +4 -0
  39. package/dist/gpu/hooks.d.mts +520 -0
  40. package/dist/gpu/hooks.d.mts.map +1 -0
  41. package/dist/gpu/hooks.mjs +1188 -0
  42. package/dist/gpu/hooks.mjs.map +1 -0
  43. package/dist/gpu/index.d.mts +2 -0
  44. package/dist/gpu/index.mjs +6 -0
  45. package/dist/gpu-33qCAtHW.mjs +3615 -0
  46. package/dist/gpu-33qCAtHW.mjs.map +1 -0
  47. package/dist/index-Dgmb2kE3.d.mts +245 -0
  48. package/dist/index-Dgmb2kE3.d.mts.map +1 -0
  49. package/dist/index-jEAL2s-A.d.mts +2022 -0
  50. package/dist/index-jEAL2s-A.d.mts.map +1 -0
  51. package/dist/index.d.mts +22 -487
  52. package/dist/index.d.mts.map +1 -1
  53. package/dist/index.mjs +13 -8
  54. package/dist/index.mjs.map +1 -1
  55. package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
  56. package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
  57. package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
  58. package/dist/integrations/ai-sdk.d.mts +75 -6
  59. package/dist/integrations/ai-sdk.d.mts.map +1 -1
  60. package/dist/integrations/ai-sdk.mjs +131 -15
  61. package/dist/integrations/ai-sdk.mjs.map +1 -1
  62. package/dist/integrations/langchain.d.mts +1 -1
  63. package/dist/integrations/langchain.d.mts.map +1 -1
  64. package/dist/integrations/langchain.mjs +5 -5
  65. package/dist/integrations/langchain.mjs.map +1 -1
  66. package/dist/integrations/llamaindex.d.mts +1 -1
  67. package/dist/integrations/llamaindex.d.mts.map +1 -1
  68. package/dist/integrations/llamaindex.mjs +5 -5
  69. package/dist/integrations/llamaindex.mjs.map +1 -1
  70. package/dist/integrations/mcp-client.mjs +3 -3
  71. package/dist/integrations/mcp-client.mjs.map +1 -1
  72. package/dist/integrations/mcp.d.mts +3 -2
  73. package/dist/integrations/mcp.d.mts.map +1 -1
  74. package/dist/integrations/mcp.mjs +5 -5
  75. package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
  76. package/dist/mcp-1DaMsaBc.mjs.map +1 -0
  77. package/dist/memory/index.d.mts +3 -0
  78. package/dist/memory/index.mjs +6 -0
  79. package/dist/memory-D1P7Tmda.mjs +4 -0
  80. package/dist/memory-DVN0MnIG.mjs +132 -0
  81. package/dist/memory-DVN0MnIG.mjs.map +1 -0
  82. package/dist/memory-Dj0J1v88.mjs +294 -0
  83. package/dist/memory-Dj0J1v88.mjs.map +1 -0
  84. package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
  85. package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
  86. package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
  87. package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
  88. package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
  89. package/dist/repl-jV5gcJFA.mjs +9 -0
  90. package/dist/skills/index.d.mts +270 -320
  91. package/dist/skills/index.d.mts.map +1 -1
  92. package/dist/skills/index.mjs +5 -5
  93. package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
  94. package/dist/skills-DX8D59UH.mjs.map +1 -0
  95. package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
  96. package/dist/tools-DQ1mPUw5.mjs.map +1 -0
  97. package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
  98. package/dist/types-D6FiR_oh.d.mts.map +1 -0
  99. package/dist/types-DQBe2lFo.d.mts +165 -0
  100. package/dist/types-DQBe2lFo.d.mts.map +1 -0
  101. package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
  102. package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
  103. package/dist/vector-B0panuy6.mjs +95 -0
  104. package/dist/vector-B0panuy6.mjs.map +1 -0
  105. package/docs/PROJECT-STATE.md +321 -0
  106. package/docs/adding-a-model-family.md +280 -0
  107. package/docs/ai-sdk.md +70 -61
  108. package/docs/architecture/overview.md +17 -7
  109. package/docs/browser.md +203 -8
  110. package/docs/embeddings.md +156 -0
  111. package/docs/gerbil-site-native-migration.md +217 -0
  112. package/docs/gpu-engine/architectures.md +398 -0
  113. package/docs/gpu-engine/ir.md +372 -0
  114. package/docs/gpu-engine/kernels.md +718 -0
  115. package/docs/gpu-engine/paper.html +1759 -0
  116. package/docs/gpu-engine/paper.md +2109 -0
  117. package/docs/gpu-engine/safetensors.md +312 -0
  118. package/docs/gpu-engine/tokenizer.md +302 -0
  119. package/docs/memory-rag.md +91 -0
  120. package/docs/metal-safari-intel.md +190 -0
  121. package/docs/mobile-failure-diagnosis.md +124 -0
  122. package/docs/mobile.md +99 -0
  123. package/docs/observability.md +230 -0
  124. package/docs/onnx-removal-plan.md +339 -0
  125. package/docs/research/autoresearch-portable.md +904 -0
  126. package/docs/research/dispatch-reduction-hivemind.md +84 -0
  127. package/docs/research/ios-safari-model-caching.md +117 -0
  128. package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
  129. package/docs/research/native-stt-model-selection.md +49 -0
  130. package/docs/research/native-tts-model-selection.md +90 -0
  131. package/docs/research/native-vs-chromium-decision.md +152 -0
  132. package/docs/research/nemotron-mamba2-inference.md +910 -0
  133. package/docs/research/qwen35-multimodal.md +293 -0
  134. package/docs/research/qwen36-gemma4-targets.md +337 -0
  135. package/docs/research/sota-embedding-models.md +179 -0
  136. package/docs/research/sota-mobile-models-2026.md +263 -0
  137. package/docs/research/sota-modality-models.md +202 -0
  138. package/docs/research/tps-baselines.md +71 -0
  139. package/docs/research/webgpu-m4-reference.md +104 -0
  140. package/docs/site-update-plan.md +155 -0
  141. package/docs/structured-output.md +123 -0
  142. package/docs/stt.md +63 -446
  143. package/docs/tts.md +77 -499
  144. package/docs/vision.md +100 -338
  145. package/package.json +22 -7
  146. package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
  147. package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
  148. package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
  149. package/dist/gerbil-CJ3ifloF.mjs +0 -4
  150. package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
  151. package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
  152. package/dist/gerbil-qOTe1nl2.d.mts +0 -431
  153. package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
  154. package/dist/kokoro-BNTb6egA.mjs +0 -20210
  155. package/dist/kokoro-BNTb6egA.mjs.map +0 -1
  156. package/dist/kokoro-DFRQ1OeM.js +0 -20212
  157. package/dist/kokoro-DFRQ1OeM.js.map +0 -1
  158. package/dist/mcp-BvbriaBy.mjs.map +0 -1
  159. package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
  160. package/dist/repl-DveXw36T.mjs +0 -9
  161. package/dist/skills-CD3Orlex.mjs.map +0 -1
  162. package/dist/stt-CpLYbGFd.mjs +0 -433
  163. package/dist/stt-CpLYbGFd.mjs.map +0 -1
  164. package/dist/stt-DRPLEEHB.mjs +0 -3
  165. package/dist/stt-Te8Qz-Ay.js +0 -433
  166. package/dist/stt-Te8Qz-Ay.js.map +0 -1
  167. package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
  168. package/dist/transformers.web-DokyH3rP.js +0 -3
  169. package/dist/transformers.web-M6mCnEYJ.js +0 -30382
  170. package/dist/transformers.web-M6mCnEYJ.js.map +0 -1
  171. package/dist/tts-C0xx3CtE.js +0 -724
  172. package/dist/tts-C0xx3CtE.js.map +0 -1
  173. package/dist/tts-DXgsKGCe.mjs +0 -3
  174. package/dist/tts-DeGANMNV.mjs +0 -730
  175. package/dist/tts-DeGANMNV.mjs.map +0 -1
  176. package/dist/types-CiTc7ez3.d.mts.map +0 -1
  177. /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
  178. /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
  179. /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
@@ -0,0 +1,904 @@
1
+ # Autoresearch: Complete Technical Reference
2
+
3
+ A portable reference for implementing autonomous research loops. Based on [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) and the [MLX port](https://github.com/trevin-creator/autoresearch-mlx).
4
+
5
+ ## Core Concept
6
+
7
+ An AI agent autonomously iterates on a single mutable file, running fixed-time experiments, keeping improvements and reverting failures. Git provides memory. A single metric provides the keep/revert signal. Humans sleep; the agent works.
8
+
9
+ Karpathy's first run: 83 experiments over ~2 days, 15 kept improvements, 11% speedup on the GPT-2 leaderboard. The agent found real improvements that a domain expert missed after years of manual tuning. All improvements were additive and transferred to larger models.
10
+
11
+ The key insight is NOT that AI can tune hyperparameters — it's that the loop structure turns any measurable problem into an autonomous hill-climbing search where the agent builds on its own history, reads its own failures, and discovers things humans don't think to try.
12
+
13
+ ## Architecture
14
+
15
+ ```
16
+ program.md (immutable) — agent instructions, goals, constraints
17
+ prepare.py (immutable) — data, evaluation, constants
18
+ constants.py (immutable) — TIME_BUDGET, MAX_SEQ_LEN, EVAL_TOKENS
19
+ train.py (MUTABLE) — the experiment subject
20
+ results.tsv (append) — complete experiment log (all attempts)
21
+ git branch (ratcheted) — only successful commits survive
22
+ run.log (transient) — stdout/stderr from current experiment
23
+ ```
24
+
25
+ ### The Three-File Contract
26
+
27
+ 1. **program.md** — The agent's operating manual. Defines what to optimize, what rules to follow, what files are off-limits. The ONLY way humans influence the loop after launch. This is the "prompt" — the entire behavior of the system is determined by what's in this file. It has six sections: monorepo safety, setup protocol, experimentation rules, output format, logging rules, and the experiment loop.
28
+
29
+ 2. **prepare.py** — Fixed infrastructure. Data loading, tokenization, the evaluation function. Agent CANNOT modify this. This prevents metric gaming and ensures all experiments are comparable. It also contains the immutable constants (`TIME_BUDGET`, `MAX_SEQ_LEN`, `EVAL_TOKENS`). The evaluation function is the oracle — if the agent could change it, the system would be meaningless.
30
+
31
+ 3. **train.py** — The single mutable file. Agent can change anything: architecture, optimizer, hyperparameters, training loop, batch size, model size, activation functions, attention mechanisms, initialization schemes, learning rate schedules. The only constraint is it must run without crashing and produce output within the time budget.
32
+
33
+ ### System Parameters
34
+
35
+ | Parameter | Value | Purpose |
36
+ |-----------|-------|---------|
37
+ | `MAX_SEQ_LEN` | 2048 tokens | Fixed sequence length for all experiments |
38
+ | `TIME_BUDGET` | 300 seconds | Training duration (wall-clock, excludes init/eval) |
39
+ | `EVAL_TOKENS` | 20,971,520 (H100) / 1,572,864 (MLX) | Validation evaluation budget |
40
+ | `VOCAB_SIZE` | 8,192 | BPE tokenizer vocabulary |
41
+ | `VAL_SHARD` | 6542 | Pinned validation shard (never trained on) |
42
+ | `MAX_SHARD` | 6542 | Highest shard index |
43
+
44
+ ### Why This Works
45
+
46
+ - **Fixed time budget** — every experiment costs the same wall-clock time (5 min default). This is the crucial design choice. It eliminates the explore/exploit tradeoff around compute allocation. The agent can try radical changes freely because the worst case is 5 wasted minutes. It also means the system naturally discovers the right tradeoff between model size and training steps — a bigger model gets fewer steps, a smaller one gets more.
47
+
48
+ - **Single mutable file** — constrains search space, produces reviewable diffs, prevents the agent from accidentally breaking infrastructure. Every experiment is a single-file diff. You can `git log --stat` and see exactly what changed.
49
+
50
+ - **Single metric** — val_bpb (validation bits per byte). Unambiguous: lower is better. No composite scores, no human judgment needed, no weighting decisions. The agent never has to ask "is this better?" — it just compares two numbers.
51
+
52
+ - **Git as memory** — the commit log IS the research journal. The agent reads its own history to plan next experiments. Kept commits show what works; reverted commits (logged in results.tsv) show what doesn't. This dual-tracking is essential: the git branch is the clean history of validated improvements; results.tsv is the complete exploration log including all failures.
53
+
54
+ - **Ratchet mechanism** — the branch only advances on improvement. You can never regress. The current HEAD is always the best-known configuration. This is a monotonic improvement guarantee that makes the system safe to run unattended.
55
+
56
+ ### Architectural Invariants
57
+
58
+ These are the properties that make the system trustworthy:
59
+
60
+ | Invariant | Enforcement | Rationale |
61
+ |-----------|-------------|-----------|
62
+ | Single mutable file | Agent instructions in program.md | Isolated experimental changes, reviewable diffs |
63
+ | Fixed time budget | `TIME_BUDGET` constant in immutable prepare.py | Hardware-specific optimization, equal cost per experiment |
64
+ | Fixed evaluation | `evaluate_bpb()` in immutable prepare.py | All experiments are comparable, no metric gaming |
65
+ | No new dependencies | Agent instructions | Prevents scope creep, environment remains stable |
66
+ | Single scalar metric | `val_bpb` only | Eliminates multi-objective complexity, unambiguous decisions |
67
+ | Git-based ratcheting | Keep via commit, discard via reset | Monotonic improvement, clean history |
68
+ | Immutable constants | `MAX_SEQ_LEN`, `VOCAB_SIZE`, `EVAL_TOKENS` | Consistent data processing across all experiments |
69
+ | Pinned validation data | Shard 6542 never used for training | No data leakage, stable evaluation |
70
+
71
+ ## The Experiment Loop
72
+
73
+ ### Verbatim Protocol (from program.md)
74
+
75
+ ```
76
+ LOOP FOREVER:
77
+ 1. Look at the git state: the current branch/commit we're on
78
+ 2. Tune train.py with an experimental idea by directly hacking the code
79
+ 3. git commit
80
+ 4. Run the experiment: uv run train.py > run.log 2>&1
81
+ 5. Read out the results: grep "^val_bpb:\|^peak_vram_mb:" run.log
82
+ 6. If grep output is empty, the run crashed. Read tail -n 50 run.log
83
+ 7. Record the results in results.tsv
84
+ 8. If val_bpb improved (lower), keep the git commit
85
+ 9. If val_bpb is equal or worse, git reset HEAD~1
86
+ ```
87
+
88
+ ### Exact Git Commands
89
+
90
+ ```bash
91
+ # Setup (once, at start of run)
92
+ git checkout -b autoresearch/<tag> # e.g., autoresearch/mar5
93
+
94
+ # Each experiment
95
+ git add train.py
96
+ git commit -m "experiment: <description>"
97
+ uv run train.py > run.log 2>&1
98
+ grep "^val_bpb:\|^peak_vram_mb:" run.log
99
+
100
+ # If KEEP (val_bpb improved):
101
+ git add results.tsv
102
+ git commit --amend --no-edit # fold results.tsv into experiment commit
103
+
104
+ # If DISCARD (val_bpb equal or worse):
105
+ git reset --hard <previous_kept_commit> # revert to last known-good state
106
+ ```
107
+
108
+ The amend-on-keep pattern is elegant: each kept commit contains both the code change AND the results that prove it worked. The branch reads as a clean research log.
109
+
110
+ ### Cycle Timing
111
+
112
+ | Phase | H100 | Apple Silicon |
113
+ |-------|------|---------------|
114
+ | Training | 300s (5 min) | 300s (5 min) |
115
+ | Model init + compilation | ~30s | ~11s |
116
+ | Evaluation | ~30s | ~52s |
117
+ | Agent analysis + code edit | ~60s | ~60s |
118
+ | **Total per experiment** | ~7 min | ~7 min |
119
+ | **Experiments per hour** | ~8 | ~8-9 |
120
+ | **Overnight (10h)** | ~80 | ~80 |
121
+
122
+ ### The Setup Phase
123
+
124
+ Before the autonomous loop begins, there's a critical interactive setup:
125
+
126
+ 1. **Propose a date-based run tag** (e.g., `mar5`, `jun12-m4max`)
127
+ 2. **Create the branch**: `git checkout -b autoresearch/<tag>`
128
+ 3. **Read all files**: README.md, prepare.py, train.py — build full context
129
+ 4. **Verify data**: Confirm `~/.cache/autoresearch/` contains shards and tokenizer
130
+ 5. **Create results.tsv** with header row
131
+ 6. **Run baseline**: Execute unmodified train.py, record as first entry
132
+ 7. **Get human approval** before starting autonomous loop
133
+
134
+ **Critical**: The agent must establish its own baseline on the current hardware. A baseline from a different machine is invalid — the time budget produces different step counts on different hardware, which means different optimal configurations.
135
+
136
+ ## Decision Framework
137
+
138
+ ### The Primary Signal: val_bpb Comparison
139
+
140
+ ```
141
+ IF val_bpb_new < val_bpb_best:
142
+ KEEP — improvement detected
143
+ ELIF val_bpb_new == val_bpb_best AND code_is_simpler:
144
+ KEEP — simplification win
145
+ ELSE:
146
+ DISCARD
147
+ ```
148
+
149
+ ### Concrete Decision Examples
150
+
151
+ | Baseline val_bpb | New val_bpb | Code Change | Decision | Why |
152
+ |-------------------|------------|-------------|----------|-----|
153
+ | 0.997 | 0.993 | +5 clean lines | **Keep** | Clear improvement, reasonable complexity |
154
+ | 0.997 | 0.996 | +20 hacky lines | Discard | Marginal gain, ugly complexity |
155
+ | 0.997 | 0.997 | -15 lines (deletion) | **Keep** | Equal performance, simpler = win |
156
+ | 0.997 | 0.992 | +10 clean lines | **Keep** | Significant gain, clean code |
157
+ | 0.997 | 0.000 | Any | Crash | Log as crash, diagnose |
158
+ | 0.997 | 1.005 | Any | Discard | Regression |
159
+ | 0.997 | 0.996 | Removed entire subsystem | **Keep** | Great simplification for tiny cost |
160
+
161
+ ### The Simplicity Criterion
162
+
163
+ "All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it."
164
+
165
+ This is a soft constraint, not a hard rule. The agent applies judgment:
166
+ - Code deletions achieving equal performance are explicit wins
167
+ - Minor gains requiring significant complexity warrant rejection
168
+ - Simplifications achieving equal performance are encouraged
169
+ - Removing something and getting equal/better results is a great outcome — it means the thing was unnecessary
170
+
171
+ ### VRAM / Memory Management
172
+
173
+ Memory is a soft constraint. The tradeoff is situational:
174
+
175
+ | Memory Change | val_bpb Change | Decision |
176
+ |---------------|----------------|----------|
177
+ | +2 GB | -0.050 | **Keep** — meaningful gain |
178
+ | +10 GB | -0.002 | Discard — not worth the memory |
179
+ | -5 GB | +0.000 | **Keep** — efficiency win |
180
+ | +1 GB | -0.010 | **Keep** — acceptable tradeoff |
181
+
182
+ The principle: memory shouldn't "blow up dramatically." Some increase is fine for meaningful gains, but the agent shouldn't chase marginal improvements at the cost of 2x memory usage.
183
+
184
+ ## The Metric: val_bpb
185
+
186
+ ### What It Measures
187
+
188
+ Validation bits per byte (val_bpb) is the primary optimization target. Lower = better.
189
+
190
+ **Formula:**
191
+ ```
192
+ val_bpb = total_nats / (ln(2) * total_bytes)
193
+ ```
194
+
195
+ Where:
196
+ - `total_nats` = sum of cross-entropy losses in natural log units
197
+ - `total_bytes` = sum of UTF-8 byte lengths for all target tokens
198
+ - `ln(2)` = conversion factor from nats to bits (≈0.6931)
199
+
200
+ Special tokens (byte length 0) are excluded from the BPB calculation.
201
+
202
+ ### Why BPB Over Perplexity
203
+
204
+ BPB is vocabulary-size-independent. If you change the tokenizer vocabulary (say from 8K to 32K tokens), perplexity becomes incomparable, but BPB remains valid because it normalizes by bytes, not tokens. This is critical for a system where the agent might want to experiment with different vocabulary sizes.
205
+
206
+ ### BPB Scale
207
+
208
+ | BPB Value | Interpretation | Compression Ratio |
209
+ |-----------|----------------|-------------------|
210
+ | 8.0 | No compression (random) | 1:1 |
211
+ | 2.0 | 4x compression | 4:1 |
212
+ | 1.5 | 5.3x compression | 5.3:1 |
213
+ | 1.0 | 8x compression (near SOTA) | 8:1 |
214
+
215
+ Karpathy's H100 baseline: `val_bpb = 0.998` (8x compression in 5 min of training).
216
+ Best MLX result: `val_bpb = 1.295` (6.2x compression on M4 Max in 5 min).
217
+
218
+ ### Evaluation Details
219
+
220
+ ```python
221
+ def evaluate_bpb(model, tokenizer, batch_size):
222
+ """Fixed evaluation on pinned validation shard."""
223
+ steps = EVAL_TOKENS // (batch_size * MAX_SEQ_LEN)
224
+ total_nats = 0.0
225
+ total_bytes = 0
226
+
227
+ for inputs, targets in validation_batches:
228
+ # Forward pass with per-token loss (reduction='none')
229
+ per_token_loss = model(inputs, targets) # cross-entropy, nats
230
+
231
+ # Look up byte count for each target token
232
+ token_byte_lengths = token_bytes[targets]
233
+
234
+ # Mask out special tokens (byte length 0)
235
+ mask = token_byte_lengths > 0
236
+
237
+ total_nats += (per_token_loss * mask).sum()
238
+ total_bytes += token_byte_lengths[mask].sum()
239
+
240
+ return total_nats / (math.log(2) * total_bytes)
241
+ ```
242
+
243
+ Key properties:
244
+ - Fixed validation shard (6542) — never used for training
245
+ - Fixed token count — exactly 20,971,520 tokens (H100) or 1,572,864 (MLX)
246
+ - Per-token loss with byte-length weighting — not averaged over tokens
247
+ - Special token masking — tokens with 0 byte length excluded
248
+ - Deterministic — same model always produces the same score
249
+
250
+ ## Key Rules
251
+
252
+ ### NEVER STOP
253
+
254
+ The agent runs indefinitely until manually interrupted. No "should I keep going?" — the human might be asleep for 8+ hours. If the agent runs out of ideas, it should:
255
+
256
+ 1. Re-read train.py, prepare.py, and results.tsv
257
+ 2. Look at near-misses (experiments that almost improved)
258
+ 3. Try combining two near-miss ideas
259
+ 4. Try radical architectural changes
260
+ 5. Read papers (if the agent has access)
261
+ 6. Try the opposite of what's been working
262
+ 7. Try parameter sweeps in unexplored ranges
263
+ 8. Try removing things
264
+
265
+ The design assumes overnight operation: 60-80 experiments without any human interaction.
266
+
267
+ ### Crash Handling Protocol
268
+
269
+ ```
270
+ IF grep "^val_bpb:" run.log returns empty:
271
+ 1. Read tail -n 50 run.log
272
+ 2. Diagnose the failure
273
+ 3. IF simple bug (typo, import, off-by-one):
274
+ Fix it, recommit, retry
275
+ 4. IF fundamental issue (OOM, architectural impossibility):
276
+ Log as crash (val_bpb=0.000000, memory_gb=0.0)
277
+ Revert and move on
278
+ 5. Record in results.tsv with status=crash
279
+ ```
280
+
281
+ The agent uses judgment. A missing import is worth fixing. An idea that causes OOM on every variation is worth abandoning. The key is: don't get stuck. Log it, learn from it, move on.
282
+
283
+ ### Timeout Handling
284
+
285
+ | Phase | Expected | Timeout Threshold | Action |
286
+ |-------|----------|-------------------|--------|
287
+ | Training | 300s | 600s (H100) / 900s (MLX) | Kill process, treat as failure |
288
+ | Full cycle | ~7 min | 15 min | Kill, revert, log as crash |
289
+
290
+ The training loop has a built-in fast-fail: if `train_loss > 100` at any point, exit immediately with code 1. This catches divergence early instead of wasting 5 minutes.
291
+
292
+ ### Results Logging
293
+
294
+ `results.tsv` — tab-separated, NOT comma. Commas are explicitly prohibited in descriptions.
295
+
296
+ ```
297
+ commit val_bpb memory_gb status description
298
+ a1b2c3d 0.997900 44.0 keep baseline
299
+ b2c3d4e 0.993200 44.2 keep increase LR to 0.04
300
+ c3d4e5f 1.005000 44.0 discard switch to GeLU activation
301
+ d4e5f6g 0.000000 0.0 crash double model width (OOM)
302
+ e5f6g7h 0.995100 44.1 discard add residual scaling (marginal + complex)
303
+ ```
304
+
305
+ **Dual tracking**: results.tsv logs ALL experiments (keep/discard/crash). Git branch contains ONLY kept commits. This means you have both:
306
+ - The **clean improvement history** (git log) — what the system converged to
307
+ - The **full exploration log** (results.tsv) — what was tried, what failed, why
308
+
309
+ The agent reads BOTH to plan next experiments. The failures are as informative as the successes.
310
+
311
+ ## Model Architecture (train.py)
312
+
313
+ The mutable code starts as a GPT-2-style transformer. Everything below is the default starting point — the agent can change any of it.
314
+
315
+ ### GPTConfig
316
+
317
+ ```python
318
+ @dataclass
319
+ class GPTConfig:
320
+ vocab_size: int = 8192
321
+ max_seq_len: int = 2048 # matches MAX_SEQ_LEN in constants
322
+ n_layer: int # computed from DEPTH
323
+ n_head: int # computed from n_embd / HEAD_DIM
324
+ n_embd: int # computed from DEPTH * ASPECT_RATIO
325
+ head_dim: int = 64 # per-head dimension
326
+ window_pattern: str = "SSSL" # attention pattern per layer
327
+ ```
328
+
329
+ Model dimension is computed: `n_embd = ((DEPTH * ASPECT_RATIO + HEAD_DIM - 1) // HEAD_DIM) * HEAD_DIM` — rounded up to head_dim boundary.
330
+
331
+ ### Attention Mechanism
332
+
333
+ - **Separate Q/K/V projections** (not fused)
334
+ - **RoPE positional encoding** applied post-projection. Base theta defaults to 10K (agent discovered 100K is better)
335
+ - **QK normalization** — queries and keys normalized before attention. Agent discovered post-norm scaling (`q, k *= 1.15`) helps for "sharper attention"
336
+ - **Value Embeddings (VE)** — alternating layers add gated embeddings: `v = v + gate * ve` where `gate = 2 * sigmoid(linear(x))`. The gate channels and scale range are tunable
337
+ - **Sliding window attention** — configurable per layer via `window_pattern` string. `"S"` = short-range (causal window), `"L"` = long-range (full context). Default `"SSSL"` = three short + one long. Agent discovered the window was too conservative
338
+ - **Mask caching** by `(seq_len, window_size)` tuple to avoid recomputation
339
+ - **Logit softcap**: `logits = cap * tanh(logits / cap)` where cap defaults to 20 (agent discovered 15 is better)
340
+
341
+ ### MLP
342
+
343
+ - Configurable expansion factor (4x on H100, 3x on MLX — agent discovered 3x beats 4x)
344
+ - **Squared ReLU** activation: `max(x, 0)^2`
345
+ - No bias terms
346
+ - Agent discovered initializing `c_fc` weights 0.5x smaller improves training
347
+
348
+ ### Block Structure
349
+
350
+ ```python
351
+ x = x + attn(norm(x), ve, mask) # pre-norm attention with value embeddings
352
+ x = x + mlp(norm(x)) # pre-norm MLP
353
+ ```
354
+
355
+ ### Residual Interpolation (MLX variant)
356
+
357
+ Learnable per-layer interpolation between residual stream and initial embedding:
358
+
359
+ ```python
360
+ x = resid_lambdas[i] * x + x0_lambdas[i] * x0
361
+ ```
362
+
363
+ Where `resid_lambdas` init to 1.0 and `x0_lambdas` init to 0.1. This allows direct paths from the input embedding to any layer, which helps with gradient flow in deeper models.
364
+
365
+ ### Weight Initialization
366
+
367
+ | Parameter | Initialization | Scale |
368
+ |-----------|----------------|-------|
369
+ | Token embeddings | Normal(0, std) | std=1.0 (agent found 0.8 better) |
370
+ | Output projection (lm_head) | Normal(0, 0.001) | Very small |
371
+ | Q/K/V/MLP input weights | Uniform(-s, s) | s = sqrt(3) * n_embd^(-0.5) |
372
+ | Projection output weights | Zero | Zero-init for residual |
373
+ | resid_lambdas | 1.0 | Full residual |
374
+ | x0_lambdas | 0.1 | Slight skip connection |
375
+ | Value embeddings | Uniform(-s, s) | Same as Q/K/V |
376
+
377
+ ### Default Model Sizes
378
+
379
+ | Config | DEPTH=8 (H100 default) | DEPTH=4 (MLX optimal) |
380
+ |--------|------------------------|----------------------|
381
+ | Parameters | ~50M | ~21M |
382
+ | Steps in 5 min | ~46 | ~92 |
383
+ | n_embd | ~768 | ~512 |
384
+
385
+ The MLX agents universally discovered that DEPTH=4 beats DEPTH=8: half the parameters but 2x the training steps. **More optimizer steps beats more parameters when compute time is fixed.**
386
+
387
+ ### Output Format
388
+
389
+ Training produces structured output parsed by the agent:
390
+
391
+ ```
392
+ ---
393
+ val_bpb: 0.997900
394
+ training_seconds: 300.1
395
+ total_seconds: 325.9
396
+ peak_vram_mb: 45060.2
397
+ mfu_percent: 39.80
398
+ total_tokens_M: 499.6
399
+ num_steps: 953
400
+ num_params_M: 50.3
401
+ depth: 8
402
+ ```
403
+
404
+ The agent extracts metrics via `grep "^val_bpb:\|^peak_vram_mb:" run.log`.
405
+
406
+ ## Optimizer System
407
+
408
+ ### Parameter Groups (4 groups with differentiated learning rates)
409
+
410
+ The optimizer doesn't treat all parameters equally. This is one of the most important design choices:
411
+
412
+ | Group | Which Parameters | Learning Rate | Weight Decay | Why Separate |
413
+ |-------|-----------------|---------------|-------------|--------------|
414
+ | **Matrix** | 2D weight matrices (not embed/unembed) | `MATRIX_LR` (0.04) | 0.2 | These are the bulk of the model; benefit from Muon |
415
+ | **Embedding** | `tok_embed`, `embed_v` (value embeddings) | `EMBEDDING_LR` (0.6) | None | Embeddings need high LR, no decay |
416
+ | **Unembedding** | `lm_head` | `UNEMBEDDING_LR` (0.004) | None | Output projection is sensitive |
417
+ | **Scalar** | 1D parameters (norms, biases) | `SCALAR_LR` (0.5) | Custom | Small parameters, high LR |
418
+
419
+ **Dimension scaling**: `dmodel_lr_scale = (model_dim / 768) ** -0.5` — learning rates scale inversely with model width.
420
+
421
+ The agent discovered that per-group Adam betas and weight decay (instead of shared global values) was a significant improvement. Karpathy: "It found that AdamW betas were all messed up."
422
+
423
+ ### MuonAdamW Hybrid Optimizer
424
+
425
+ Two update strategies in one optimizer:
426
+
427
+ **AdamW** (for embeddings, unembedding, scalars):
428
+ ```python
429
+ # Standard Adam with decoupled weight decay
430
+ m = beta1 * m + (1 - beta1) * grad # first moment
431
+ v = beta2 * v + (1 - beta2) * grad^2 # second moment
432
+ m_hat = m / (1 - beta1^t) # bias correction
433
+ v_hat = v / (1 - beta2^t)
434
+ param = param * (1 - lr * weight_decay) # decoupled decay
435
+ param = param - lr * m_hat / (sqrt(v_hat) + eps) # Adam step
436
+ ```
437
+
438
+ Default AdamW hyperparameters:
439
+ | Parameter | Default |
440
+ |-----------|---------|
441
+ | beta1 | 0.8 |
442
+ | beta2 | 0.95 |
443
+ | weight_decay | 0.2 |
444
+ | eps | 1e-8 |
445
+
446
+ **Muon** (for weight matrices — the bulk of the model):
447
+
448
+ A second-order optimizer using Newton-Schulz iterations to precondition gradients. This is what makes the H100 version fast — Muon gets more signal per gradient step than AdamW alone.
449
+
450
+ ```python
451
+ # Newton-Schulz preconditioning
452
+ G = gradient
453
+ for i in range(NS_STEPS):
454
+ G = G @ (3*I - G.T @ G) / 2 # iterative polar decomposition
455
+ # Apply preconditioned gradient with momentum
456
+ ```
457
+
458
+ The Newton-Schulz iteration finds the "best direction" to update weights by whitening the gradient — removing correlations so each parameter gets an equally-scaled update. It's like applying a second-order method (Newton's method) but cheaply.
459
+
460
+ Key Muon parameters:
461
+ | Parameter | H100 Default | MLX Optimal | Notes |
462
+ |-----------|-------------|-------------|-------|
463
+ | NS_STEPS | 5 | **3** | Fewer iterations = faster steps = more updates in 5 min |
464
+ | momentum | 0.95 | 0.95 | Baseline momentum |
465
+ | beta2 | 0.95 | — | Agent found 0.9 better on H100 |
466
+ | momentum_warmup | 0.95→0.97 | — | Over 400 steps on H100 |
467
+
468
+ **MLX discovery**: NS_STEPS=3 outperforms NS_STEPS=5 on Apple Silicon. This is the first documented tuning of Muon on this hardware. The reasoning: fewer iterations per step = faster steps = more total gradient updates in the fixed 5-minute budget.
469
+
470
+ **Hardware-dependent optimizer choice**: On the Mac Mini (constrained compute), Muon was a breakthrough. On M4 Max (more headroom), plain AdamW won. The system discovers hardware-appropriate configurations, not a single "best" configuration.
471
+
472
+ ### Learning Rate Schedule
473
+
474
+ Three phases controlled by `progress = total_training_time / TIME_BUDGET`:
475
+
476
+ ```
477
+ Phase 1: Warmup [0, WARMUP_RATIO] → linear ramp from 0 to 1.0
478
+ Phase 2: Steady state [WARMUP_RATIO, 1-WARMDOWN_RATIO] → constant at 1.0
479
+ Phase 3: Warmdown [1-WARMDOWN_RATIO, 1.0] → linear decay to FINAL_LR_FRAC
480
+ ```
481
+
482
+ | Parameter | H100 Default | H100 Optimized | MLX Default |
483
+ |-----------|-------------|----------------|-------------|
484
+ | WARMUP_RATIO | ratio-based | 40 absolute steps | 0.0 |
485
+ | WARMDOWN_RATIO | 0.5 | 0.65 | 0.5 |
486
+ | FINAL_LR_FRAC | 0.0 | 0.05 | 0.0 |
487
+
488
+ The agent discovered:
489
+ - **Non-zero FINAL_LR_FRAC (0.05)** — don't decay LR all the way to zero. Keep a small residual learning rate
490
+ - **Longer warmdown (0.65 vs 0.5)** — more gradual cooldown helps
491
+ - **Weight decay schedule**: linear → cosine decay was an improvement
492
+
493
+ ## Training Loop Internals
494
+
495
+ ### Time Budget Enforcement
496
+
497
+ ```python
498
+ TIME_BUDGET = 300 # seconds of actual training
499
+ STARTUP_EXCLUDE_STEPS = 1 # exclude first step from timing (compilation overhead)
500
+
501
+ t0 = None
502
+ for step in range(max_steps):
503
+ # ... forward, backward, optimizer step ...
504
+
505
+ if step == STARTUP_EXCLUDE_STEPS:
506
+ t0 = time.perf_counter() # start timing AFTER compilation
507
+
508
+ if t0 is not None:
509
+ total_training_time = time.perf_counter() - t0
510
+ if step >= STARTUP_EXCLUDE_STEPS and total_training_time >= TIME_BUDGET:
511
+ break
512
+ ```
513
+
514
+ Wall-clock measurement EXCLUDES: model initialization, first-step compilation, data loading setup, final evaluation.
515
+ Wall-clock measurement INCLUDES: forward passes, backward passes, optimizer steps, gradient accumulation.
516
+
517
+ This means every experiment gets exactly the same amount of actual training compute, regardless of initialization overhead.
518
+
519
+ ### Progress-Based Scheduling
520
+
521
+ The LR schedule is driven by wall-clock progress, not step count:
522
+
523
+ ```python
524
+ progress = min(total_training_time / TIME_BUDGET, 1.0)
525
+ lr_multiplier = get_lr_multiplier(progress)
526
+ ```
527
+
528
+ This is important because different configurations produce different step counts in the same 5 minutes. A model with 2x parameters takes 2x longer per step, so it gets half the steps — but the LR schedule still spans the full training run proportionally.
529
+
530
+ ### Fast-Fail Detection
531
+
532
+ ```python
533
+ if train_loss > 100:
534
+ sys.exit(1) # divergence detected, abort immediately
535
+ ```
536
+
537
+ Don't waste 5 minutes on a diverged run. If loss explodes, bail immediately.
538
+
539
+ ### Smoothed Loss Tracking
540
+
541
+ ```python
542
+ smoothed_loss = beta * smoothed_loss + (1 - beta) * loss # EMA with beta=0.9
543
+ smoothed_loss_debiased = smoothed_loss / (1 - beta^step) # early-step correction
544
+ ```
545
+
546
+ ### Memory Management (MLX-specific)
547
+
548
+ ```python
549
+ gc.collect()
550
+ gc.freeze() # freeze all existing objects (exclude from GC)
551
+ gc.disable() # disable GC during training
552
+
553
+ # Every 5000 steps:
554
+ gc.collect() # manual collection to prevent memory drift
555
+ ```
556
+
557
+ Aggressive GC management is critical on unified memory hardware where training and system share the same pool.
558
+
559
+ ### Batch Size Configuration
560
+
561
+ | Parameter | H100 Default | MLX Default | MLX Optimal |
562
+ |-----------|-------------|-------------|-------------|
563
+ | TOTAL_BATCH_SIZE | 2^17 | 2^16 (65,536 tokens) | 2^14 (16,384 tokens) |
564
+ | DEVICE_BATCH_SIZE | — | 16 sequences | — |
565
+ | GRAD_ACCUM_STEPS | — | computed | — |
566
+
567
+ ```python
568
+ grad_accum_steps = TOTAL_BATCH_SIZE // (DEVICE_BATCH_SIZE * MAX_SEQ_LEN)
569
+ ```
570
+
571
+ The agents discovered smaller batch sizes outperform: `TOTAL_BATCH_SIZE 2^14-2^13` beat `2^17` by fitting more gradient steps in the time budget. This is the same insight as DEPTH=4 vs DEPTH=8: **in a fixed time budget, more steps > more tokens per step.**
572
+
573
+ ## Data Pipeline (prepare.py — Immutable)
574
+
575
+ ### Dataset
576
+
577
+ `karpathy/climbmix-400b-shuffle` from HuggingFace:
578
+ - 6,543 parquet shards total
579
+ - Training: shards 0-6,541
580
+ - Validation: shard 6,542 (pinned, never used for training)
581
+
582
+ ### Download System
583
+
584
+ ```python
585
+ def download_data():
586
+ """Parallel download with retry logic."""
587
+ pool = multiprocessing.Pool(processes=8)
588
+ # For each shard:
589
+ # 1. Download to .tmp file
590
+ # 2. Atomic rename on success
591
+ # 3. Skip if already exists (resumable)
592
+ # 4. Retry with exponential backoff (3-5 attempts, wait 2^attempt seconds)
593
+ ```
594
+
595
+ Key properties:
596
+ - **Atomic writes**: download to `.tmp`, rename on completion — prevents corruption from interrupted downloads
597
+ - **Resumable**: skips already-downloaded files
598
+ - **Parallel**: 8 workers for throughput
599
+ - **Cached**: `~/.cache/autoresearch/data/`
600
+
601
+ ### Tokenizer
602
+
603
+ - **BPE** via `rustbpe` library (Rust-based, fast)
604
+ - **Vocabulary size**: 8,192 (minus 4 special tokens = 8,188 mergeable ranks)
605
+ - Trained on ~1 billion characters from training shards
606
+ - Integrated with `tiktoken` for encoding/decoding
607
+ - Produces: `tokenizer.pkl` (tiktoken Encoding) and `token_bytes.npy`/`token_bytes.pt` (maps token IDs → UTF-8 byte lengths)
608
+ - Special tokens have byte length 0 (excluded from BPB)
609
+
610
+ ### BOS-Aligned Best-Fit Packing
611
+
612
+ The dataloader achieves 100% token utilization with no padding:
613
+
614
+ ```python
615
+ def make_dataloader(tokenizer, batch_size, seq_len, split, buffer_size=1000):
616
+ """
617
+ 1. Tokenize documents, prepend BOS to each
618
+ 2. Buffer 1000 tokenized documents
619
+ 3. Best-fit selection: pack documents into rows of exactly seq_len+1 tokens
620
+ 4. Yield (inputs, targets, epoch) where:
621
+ - inputs = positions [0, seq_len)
622
+ - targets = positions [1, seq_len+1)
623
+ """
624
+ ```
625
+
626
+ Properties:
627
+ - BOS token prepended to each document — every document boundary is marked
628
+ - Best-fit selection minimizes wasted tokens (unlike fixed-length chunking)
629
+ - Deterministic: same shard order produces identical batches
630
+ - Row group processing: reads parquet by row group for memory efficiency
631
+ - Infinite iterator with epoch tracking
632
+
633
+ ### Cache Structure
634
+
635
+ ```
636
+ ~/.cache/autoresearch/
637
+ ├── data/
638
+ │ ├── shard_00000.parquet (training)
639
+ │ ├── shard_00001.parquet
640
+ │ ├── ...
641
+ │ ├── shard_06541.parquet (training)
642
+ │ └── shard_06542.parquet (validation — pinned)
643
+ └── tokenizer/
644
+ ├── tokenizer.pkl (tiktoken Encoding object)
645
+ └── token_bytes.npy/.pt (vocab_size x int32, byte lengths)
646
+ ```
647
+
648
+ One-time setup: `uv run prepare.py` (~5 minutes, downloads data and trains tokenizer).
649
+
650
+ ## Agent Prompt Engineering (program.md)
651
+
652
+ The program.md is the most important file in the system. It's what turns a general-purpose AI agent into an autonomous researcher. Key design choices:
653
+
654
+ ### Structure (6 sections)
655
+
656
+ 1. **Monorepo Safety** — If in a monorepo, stage only the experiment directory paths. Never `git add -A`.
657
+
658
+ 2. **Setup Protocol** — Interactive initialization: branch creation, file reading, data verification, baseline establishment, human approval. This prevents the agent from running blind.
659
+
660
+ 3. **Experimentation Rules** — Hard constraints: only modify train.py, no new dependencies, no modifying prepare.py or constants, no changing evaluation function or time/sequence length constants.
661
+
662
+ 4. **Output Format** — What to grep for. The structured `---` block with `val_bpb:`, `peak_vram_mb:`, etc.
663
+
664
+ 5. **Logging Rules** — TSV format, no commas, status values (keep/discard/crash), crash logging convention (`val_bpb=0.000000`).
665
+
666
+ 6. **The Experiment Loop** — The autonomous cycle, verbatim.
667
+
668
+ ### Key Prompt Engineering Decisions
669
+
670
+ **The NEVER STOP principle** is repeated and emphasized. This is deliberate — without it, agents naturally pause to ask for confirmation, which defeats overnight operation.
671
+
672
+ **The simplicity criterion** is stated as a tiebreaker, not a primary objective. This prevents the agent from refusing to add code — it can add complexity if the improvement justifies it.
673
+
674
+ **The crash handling** is judgment-based ("fix simple bugs, skip fundamentally broken ideas"). This avoids both extremes: an agent that gives up at the first error, or an agent that retries the same broken idea forever.
675
+
676
+ **The idea exhaustion strategy** ("re-read files, try combining near-misses, try radical changes") prevents the agent from stalling when it runs out of obvious ideas. The instruction to "think harder" is deliberate — agents can often find ideas if they're told not to give up.
677
+
678
+ ## Real Results
679
+
680
+ ### Karpathy's 2-Day Run (H100)
681
+
682
+ 83 experiments, 15 kept improvements. Baseline: `val_bpb = 0.998`, 45.1 GB VRAM.
683
+
684
+ **Optimizer & schedule changes:**
685
+ - Unembedding LR: 0.004 → 0.008, weight decay: 0.2 → 0.28
686
+ - Per-group Adam betas and weight decay (instead of shared global)
687
+ - Muon beta2: 0.95 → 0.9, momentum warmup target: 0.95 → 0.97 over 400 steps
688
+ - Warmup: ratio-based → absolute steps (40)
689
+ - Warmdown ratio: 0.5 → 0.65, final LR fraction: 0.0 → 0.05
690
+ - Weight decay schedule: linear → cosine decay
691
+ - Polar express norm factor: 1.02 → 1.01
692
+
693
+ **Architecture & init changes:**
694
+ - VE gate: channels 32 → 12, scale range 2x → 3x, init small positive
695
+ - Post-QK-norm scaling (q,k *= 1.15) for sharper attention
696
+ - Embedding init std: 1.0 → 0.8, MLP c_fc init 0.5x smaller
697
+ - RoPE base theta: 10K → 100K
698
+ - Short attention window: seq_len/2 → ~seq_len/3 (ceil to 128 tile)
699
+ - Logit softcap: 20 → 15
700
+
701
+ Result: "Time to GPT-2" dropped from 2.02 hours to 1.80 hours (11% improvement).
702
+
703
+ Key quote: "The agent found multipliers to sharpen attention, pointing to future work. It found that Value Embeddings really like regularization and I wasn't applying any (oops). It found that my banded attention was too conservative (I forgot to tune it). It found that AdamW betas were all messed up."
704
+
705
+ What this means: the agent found bugs and missed tuning opportunities in code written by one of the world's foremost ML researchers. The improvements were real, not artifacts — they transferred to larger models and stacked additively.
706
+
707
+ ### MLX Port Overnight Results (Apple Silicon)
708
+
709
+ Three machines ran autonomously for 6-12 hours:
710
+
711
+ | Machine | Optimizer | Experiments | Best val_bpb | Improvement |
712
+ |---|---|---|---|---|
713
+ | M4 Max 128GB | AdamW | ~50 | 1.295 | 19% |
714
+ | M4 Max 128GB (#2) | AdamW + surface gates | ~30 | 1.339 | 17% |
715
+ | Mac Mini | Muon + AdamW | 30 | 1.462 | 24% |
716
+
717
+ Upstream H100 reference: val_bpb 0.998 in the same 5-minute budget.
718
+
719
+ ### Universal Discoveries (all machines converged)
720
+
721
+ - **DEPTH=4 over DEPTH=8**: Half the parameters, 2x training steps. Every machine found this independently — "more optimizer steps beats more parameters when compute time is fixed"
722
+ - **Smaller batch sizes**: 2^14-2^13 beat 2^17 — more gradient updates matter more than more tokens per update
723
+ - **Lean MLP**: 3x expansion beat 4x. On Mac Mini (most constrained), 2x was better
724
+ - **Schedule tuning**: WARMDOWN_RATIO and FINAL_LR_FRAC were significant everywhere
725
+
726
+ ### Hardware-Specific Discoveries
727
+
728
+ - **Muon is hardware-dependent**: breakthrough on Mac Mini (constrained compute), but plain AdamW won on M4 Max. The hypothesis: when you have plenty of memory/compute, AdamW's simplicity wins; when compute is tight, Muon's better gradient signal per step matters more
729
+ - **NS_STEPS=3 over NS_STEPS=5**: First documented Muon tuning on Apple Silicon. Fewer Newton-Schulz iterations = faster steps = more total updates
730
+ - **Same loop + different hardware = genuinely different optimal configurations**. That's the point — the system finds what's best for YOUR hardware, not a universal recipe
731
+
732
+ ### Anti-Patterns Discovered
733
+
734
+ These consistently failed across machines:
735
+
736
+ 1. **Increasing model size beyond optimal depth** — fewer training steps in fixed budget, net negative
737
+ 2. **Large batch sizes (2^17+)** — fewer gradient updates, optimizer progress stalls
738
+ 3. **Complex architectural changes with tiny gains** — failed simplicity criterion
739
+ 4. **Over-expanding MLP (4x+)** — computation cost not worth the extra capacity
740
+ 5. **Any change that reduces step count significantly** — the time budget makes step count critical
741
+
742
+ ### Common Successful Parameter Ranges
743
+
744
+ These are the ranges where agents found improvements across runs:
745
+
746
+ | Parameter | Explored Range | Typical Optimal |
747
+ |-----------|---------------|-----------------|
748
+ | DEPTH | 4-8 | 4 (universal on MLX) |
749
+ | WINDOW_PATTERN | "SL", "SSSL", "SSSSL" | "SSSL" |
750
+ | MLP expansion | 2x-4x | 3x (2x on constrained hw) |
751
+ | HEAD_DIM | 64-192 | 64-128 |
752
+ | TOTAL_BATCH_SIZE | 2^13-2^17 | 2^14 |
753
+ | MATRIX_LR | 0.01-0.1 | 0.04 |
754
+ | EMBEDDING_LR | 0.3-1.2 | 0.6 |
755
+ | WARMUP_RATIO | 0.0-0.1 | 0.0 |
756
+ | WARMDOWN_RATIO | 0.2-0.5 | 0.3-0.5 |
757
+
758
+ ## Monitoring
759
+
760
+ ### During a Run
761
+
762
+ Real-time progress in `run.log`:
763
+ ```
764
+ step 1 | loss 11.2345 | lr 1.2e-04 | 2.3k tokens/s
765
+ step 2 | loss 10.8734 | lr 2.4e-04 | 2.4k tokens/s
766
+ ...
767
+ step 92 | loss 2.1234 | lr 3.8e-05 | 2.3k tokens/s
768
+ ---
769
+ val_bpb: 1.534000
770
+ training_seconds: 312.4
771
+ total_seconds: 405.7
772
+ peak_vram_mb: 27528.9
773
+ mfu_percent: 0.00
774
+ total_tokens_M: 39.8
775
+ num_steps: 92
776
+ num_params_M: 21.3
777
+ depth: 4
778
+ ```
779
+
780
+ ### Simple Monitoring Script
781
+
782
+ ```bash
783
+ # Watch results accumulate
784
+ while true; do
785
+ echo "--- $(date) ---"
786
+ tail -5 results.tsv | column -t -s $'\t'
787
+ echo "Total: $(wc -l < results.tsv) experiments"
788
+ sleep 60
789
+ done
790
+ ```
791
+
792
+ ### Multi-Machine Runs
793
+
794
+ Branch naming convention: `autoresearch/<date>-<machine>`
795
+
796
+ ```bash
797
+ # Machine 1
798
+ git checkout -b autoresearch/mar5-m4max
799
+
800
+ # Machine 2
801
+ git checkout -b autoresearch/mar5-mini
802
+ ```
803
+
804
+ After overnight runs, compare branches. Both machines will discover universal improvements (DEPTH=4) and hardware-specific ones (optimizer choice). Cross-pollinate: try Machine 1's best config on Machine 2 and vice versa.
805
+
806
+ ## Adapting to Other Domains
807
+
808
+ The pattern generalizes to any optimization problem with:
809
+ 1. A mutable configuration/code (the "train.py")
810
+ 2. An objective metric that's efficient to evaluate
811
+ 3. A fixed budget per experiment
812
+ 4. A keep/revert mechanism (git)
813
+
814
+ ### Template
815
+
816
+ ```yaml
817
+ # autoresearch-config.yaml
818
+ name: "my-project"
819
+ metric: "the_metric_name" # what to optimize
820
+ metric_direction: "minimize" # or "maximize"
821
+
822
+ mutable_files:
823
+ - "the_file_agent_can_edit.py"
824
+
825
+ immutable_files:
826
+ - "evaluation.py" # metric computation, cannot be gamed
827
+ - "data_loader.py" # fixed data pipeline
828
+
829
+ run_command: "python train.py > run.log 2>&1"
830
+ eval_command: "grep '^metric:' run.log"
831
+
832
+ budget_minutes: 5 # fixed time per experiment
833
+ branch_prefix: "autoresearch" # git branch naming
834
+
835
+ rules:
836
+ - "Only modify files listed in mutable_files"
837
+ - "Do not install new dependencies"
838
+ - "Simpler solutions preferred over complex ones"
839
+ - "Run indefinitely until interrupted"
840
+ ```
841
+
842
+ ### Example Domains
843
+
844
+ **ML model training** (the original):
845
+ - Mutable: train.py (architecture, optimizer, hyperparams)
846
+ - Metric: val_bpb or val_loss
847
+ - Budget: 5 min per experiment
848
+
849
+ **Inference optimization** (like Gerbil's kernel optimizer):
850
+ - Mutable: config.toml, shader code, kernel parameters
851
+ - Metric: tokens/second, latency_p99
852
+ - Budget: 2 min per benchmark run
853
+
854
+ **Compiler/codegen optimization**:
855
+ - Mutable: optimization passes, code generation rules
856
+ - Metric: benchmark suite runtime
857
+ - Budget: 10 min (compile + bench)
858
+
859
+ **Growth/marketing**:
860
+ - Mutable: landing page copy, ad targeting config
861
+ - Metric: conversion rate
862
+ - Budget: hours (need traffic for statistical significance)
863
+
864
+ **Fine-tuning pipeline**:
865
+ - Mutable: training config (hyperparams, data mix, LoRA settings)
866
+ - Metric: composite eval score (pass rate + preference rate)
867
+ - Budget: 30-90 min per cloud training run
868
+
869
+ ### Key Considerations for Adaptation
870
+
871
+ 1. **Cycle time is king**. Karpathy gets ~80 experiments overnight because each takes ~7 minutes total. If your cycle is 90 minutes, you get ~16/day. Find proxy metrics that correlate with your true objective but evaluate faster.
872
+
873
+ 2. **The metric must be automatable**. Human judgment doesn't scale. Either automate evaluation entirely (val_bpb) or use an AI judge (Opus scoring responses). The metric must be a single scalar that the agent can compare with `<`.
874
+
875
+ 3. **The mutable surface should be small**. One file, or a small set of config values. If the agent can change everything, the search space explodes and improvements don't stack reliably. One-file diffs are reviewable; multi-file changes are opaque.
876
+
877
+ 4. **Git ratchet prevents regression**. This is critical. You never go backwards. Every kept commit is guaranteed to be at least as good as the previous best. This is what makes overnight operation safe.
878
+
879
+ 5. **The agent needs memory**. results.tsv + git log give the agent context on what worked and what didn't. Without this, the agent repeats failed experiments. The dual tracking (git for successes, TSV for everything) is essential.
880
+
881
+ 6. **Establish your own baseline**. Never use someone else's baseline numbers. Run the unmodified code on your hardware and measure. Different hardware, different step counts, different optimal configurations.
882
+
883
+ 7. **The time budget creates natural tradeoffs**. You don't need to manually balance model size vs. training steps — the fixed budget does it automatically. A bigger model gets fewer steps; the metric tells you which tradeoff wins.
884
+
885
+ 8. **Hardware-specific optimization is a feature, not a bug**. The same loop on different hardware discovers different optimal configurations. This is correct behavior — the best config for an H100 is not the best config for a Mac Mini.
886
+
887
+ ## The Vision
888
+
889
+ From Karpathy:
890
+
891
+ > "All LLM frontier labs will do this. It's the final boss battle. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges."
892
+
893
+ > "Any metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm."
894
+
895
+ > "One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone."
896
+
897
+ ## References
898
+
899
+ - [karpathy/autoresearch](https://github.com/karpathy/autoresearch) — original repo
900
+ - [karpathy/nanochat](https://github.com/karpathy/nanochat) — the training codebase being optimized
901
+ - [nanochat commit 6ed7d1d](https://github.com/karpathy/nanochat/commit/6ed7d1d82cee16c2e26f45d559ad3338447a6c1b) — the stacked improvements from round 1
902
+ - [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx) — Apple Silicon port
903
+ - [DeepWiki: autoresearch](https://deepwiki.com/karpathy/autoresearch) — detailed system documentation
904
+ - [DeepWiki: autoresearch-mlx](https://deepwiki.com/trevin-creator/autoresearch-mlx) — MLX port documentation