@tryhamster/gerbil 1.0.0-rc.9 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +1 -1
- package/README.md +247 -84
- package/dist/architectures-C1I5V3Dt.mjs +6070 -0
- package/dist/architectures-C1I5V3Dt.mjs.map +1 -0
- package/dist/browser/index.d.ts +264 -588
- package/dist/browser/index.d.ts.map +1 -1
- package/dist/browser/index.js +585 -2334
- package/dist/browser/index.js.map +1 -1
- package/dist/cli.mjs +625 -1098
- package/dist/cli.mjs.map +1 -1
- package/dist/defaults-9komdrbY.mjs +24 -0
- package/dist/defaults-9komdrbY.mjs.map +1 -0
- package/dist/frameworks/express.d.mts +1 -3
- package/dist/frameworks/express.d.mts.map +1 -1
- package/dist/frameworks/express.mjs +7 -7
- package/dist/frameworks/express.mjs.map +1 -1
- package/dist/frameworks/fastify.d.mts +1 -1
- package/dist/frameworks/fastify.d.mts.map +1 -1
- package/dist/frameworks/fastify.mjs +3 -3
- package/dist/frameworks/fastify.mjs.map +1 -1
- package/dist/frameworks/hono.d.mts +1 -1
- package/dist/frameworks/hono.d.mts.map +1 -1
- package/dist/frameworks/hono.mjs +4 -4
- package/dist/frameworks/hono.mjs.map +1 -1
- package/dist/frameworks/next.d.mts +3 -2
- package/dist/frameworks/next.d.mts.map +1 -1
- package/dist/frameworks/next.mjs +4 -4
- package/dist/frameworks/next.mjs.map +1 -1
- package/dist/frameworks/react.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts +1 -1
- package/dist/frameworks/trpc.d.mts.map +1 -1
- package/dist/frameworks/trpc.mjs +4 -4
- package/dist/frameworks/trpc.mjs.map +1 -1
- package/dist/gerbil-BHrJJIa4.mjs +1656 -0
- package/dist/gerbil-BHrJJIa4.mjs.map +1 -0
- package/dist/gerbil-BT9fCydo.d.mts +488 -0
- package/dist/gerbil-BT9fCydo.d.mts.map +1 -0
- package/dist/gerbil-DomNfIr1.mjs +4 -0
- package/dist/gpu/hooks.d.mts +520 -0
- package/dist/gpu/hooks.d.mts.map +1 -0
- package/dist/gpu/hooks.mjs +1188 -0
- package/dist/gpu/hooks.mjs.map +1 -0
- package/dist/gpu/index.d.mts +2 -0
- package/dist/gpu/index.mjs +6 -0
- package/dist/gpu-33qCAtHW.mjs +3615 -0
- package/dist/gpu-33qCAtHW.mjs.map +1 -0
- package/dist/index-Dgmb2kE3.d.mts +245 -0
- package/dist/index-Dgmb2kE3.d.mts.map +1 -0
- package/dist/index-jEAL2s-A.d.mts +2022 -0
- package/dist/index-jEAL2s-A.d.mts.map +1 -0
- package/dist/index.d.mts +22 -487
- package/dist/index.d.mts.map +1 -1
- package/dist/index.mjs +13 -8
- package/dist/index.mjs.map +1 -1
- package/dist/indexeddb-store-BWIMtxxH.mjs +103 -0
- package/dist/indexeddb-store-BWIMtxxH.mjs.map +1 -0
- package/dist/indexeddb-store-ClH12Xnl.mjs +4 -0
- package/dist/integrations/ai-sdk.d.mts +75 -6
- package/dist/integrations/ai-sdk.d.mts.map +1 -1
- package/dist/integrations/ai-sdk.mjs +131 -15
- package/dist/integrations/ai-sdk.mjs.map +1 -1
- package/dist/integrations/langchain.d.mts +1 -1
- package/dist/integrations/langchain.d.mts.map +1 -1
- package/dist/integrations/langchain.mjs +5 -5
- package/dist/integrations/langchain.mjs.map +1 -1
- package/dist/integrations/llamaindex.d.mts +1 -1
- package/dist/integrations/llamaindex.d.mts.map +1 -1
- package/dist/integrations/llamaindex.mjs +5 -5
- package/dist/integrations/llamaindex.mjs.map +1 -1
- package/dist/integrations/mcp-client.mjs +3 -3
- package/dist/integrations/mcp-client.mjs.map +1 -1
- package/dist/integrations/mcp.d.mts +3 -2
- package/dist/integrations/mcp.d.mts.map +1 -1
- package/dist/integrations/mcp.mjs +5 -5
- package/dist/{mcp-BvbriaBy.mjs → mcp-1DaMsaBc.mjs} +4 -4
- package/dist/mcp-1DaMsaBc.mjs.map +1 -0
- package/dist/memory/index.d.mts +3 -0
- package/dist/memory/index.mjs +6 -0
- package/dist/memory-D1P7Tmda.mjs +4 -0
- package/dist/memory-DVN0MnIG.mjs +132 -0
- package/dist/memory-DVN0MnIG.mjs.map +1 -0
- package/dist/memory-Dj0J1v88.mjs +294 -0
- package/dist/memory-Dj0J1v88.mjs.map +1 -0
- package/dist/moonshine-stt-BLyVoRpB.mjs +4 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs +11936 -0
- package/dist/moonshine-stt-v_P_Ci_m.mjs.map +1 -0
- package/dist/{one-liner-s-lD8rCC.mjs → one-liner-DnQn7HJK.mjs} +14 -16
- package/dist/one-liner-DnQn7HJK.mjs.map +1 -0
- package/dist/repl-jV5gcJFA.mjs +9 -0
- package/dist/skills/index.d.mts +270 -320
- package/dist/skills/index.d.mts.map +1 -1
- package/dist/skills/index.mjs +5 -5
- package/dist/{skills-CD3Orlex.mjs → skills-DX8D59UH.mjs} +187 -32
- package/dist/skills-DX8D59UH.mjs.map +1 -0
- package/dist/{tools-Bi1P7Xoy.mjs → tools-DQ1mPUw5.mjs} +34 -22
- package/dist/tools-DQ1mPUw5.mjs.map +1 -0
- package/dist/{types-CiTc7ez3.d.mts → types-D6FiR_oh.d.mts} +106 -12
- package/dist/types-D6FiR_oh.d.mts.map +1 -0
- package/dist/types-DQBe2lFo.d.mts +165 -0
- package/dist/types-DQBe2lFo.d.mts.map +1 -0
- package/dist/{utils-CZBZ8dgR.mjs → utils-DKO55ZmZ.mjs} +1 -1
- package/dist/{utils-CZBZ8dgR.mjs.map → utils-DKO55ZmZ.mjs.map} +1 -1
- package/dist/vector-B0panuy6.mjs +95 -0
- package/dist/vector-B0panuy6.mjs.map +1 -0
- package/docs/PROJECT-STATE.md +321 -0
- package/docs/adding-a-model-family.md +280 -0
- package/docs/ai-sdk.md +70 -61
- package/docs/architecture/overview.md +17 -7
- package/docs/browser.md +203 -8
- package/docs/embeddings.md +156 -0
- package/docs/gerbil-site-native-migration.md +217 -0
- package/docs/gpu-engine/architectures.md +398 -0
- package/docs/gpu-engine/ir.md +372 -0
- package/docs/gpu-engine/kernels.md +718 -0
- package/docs/gpu-engine/paper.html +1759 -0
- package/docs/gpu-engine/paper.md +2109 -0
- package/docs/gpu-engine/safetensors.md +312 -0
- package/docs/gpu-engine/tokenizer.md +302 -0
- package/docs/memory-rag.md +91 -0
- package/docs/metal-safari-intel.md +190 -0
- package/docs/mobile-failure-diagnosis.md +124 -0
- package/docs/mobile.md +99 -0
- package/docs/observability.md +230 -0
- package/docs/onnx-removal-plan.md +339 -0
- package/docs/research/autoresearch-portable.md +904 -0
- package/docs/research/dispatch-reduction-hivemind.md +84 -0
- package/docs/research/ios-safari-model-caching.md +117 -0
- package/docs/research/mobile-webgpu-speed-fusion.md +135 -0
- package/docs/research/native-stt-model-selection.md +49 -0
- package/docs/research/native-tts-model-selection.md +90 -0
- package/docs/research/native-vs-chromium-decision.md +152 -0
- package/docs/research/nemotron-mamba2-inference.md +910 -0
- package/docs/research/qwen35-multimodal.md +293 -0
- package/docs/research/qwen36-gemma4-targets.md +337 -0
- package/docs/research/sota-embedding-models.md +179 -0
- package/docs/research/sota-mobile-models-2026.md +263 -0
- package/docs/research/sota-modality-models.md +202 -0
- package/docs/research/tps-baselines.md +71 -0
- package/docs/research/webgpu-m4-reference.md +104 -0
- package/docs/site-update-plan.md +155 -0
- package/docs/structured-output.md +123 -0
- package/docs/stt.md +63 -446
- package/docs/tts.md +77 -499
- package/docs/vision.md +100 -338
- package/package.json +22 -7
- package/dist/chrome-backend-CORwaIyC.mjs +0 -1212
- package/dist/chrome-backend-CORwaIyC.mjs.map +0 -1
- package/dist/chrome-backend-DIKYoWj-.mjs +0 -3
- package/dist/gerbil-CJ3ifloF.mjs +0 -4
- package/dist/gerbil-Dw4Qj77e.mjs +0 -1631
- package/dist/gerbil-Dw4Qj77e.mjs.map +0 -1
- package/dist/gerbil-qOTe1nl2.d.mts +0 -431
- package/dist/gerbil-qOTe1nl2.d.mts.map +0 -1
- package/dist/kokoro-BNTb6egA.mjs +0 -20210
- package/dist/kokoro-BNTb6egA.mjs.map +0 -1
- package/dist/kokoro-CMOGDSgT.js +0 -20212
- package/dist/kokoro-CMOGDSgT.js.map +0 -1
- package/dist/mcp-BvbriaBy.mjs.map +0 -1
- package/dist/one-liner-s-lD8rCC.mjs.map +0 -1
- package/dist/repl-DveXw36T.mjs +0 -9
- package/dist/skills-CD3Orlex.mjs.map +0 -1
- package/dist/stt-Bu-E23Sc.js +0 -433
- package/dist/stt-Bu-E23Sc.js.map +0 -1
- package/dist/stt-CpLYbGFd.mjs +0 -433
- package/dist/stt-CpLYbGFd.mjs.map +0 -1
- package/dist/stt-DRPLEEHB.mjs +0 -3
- package/dist/tools-Bi1P7Xoy.mjs.map +0 -1
- package/dist/transformers.web-DiD1gTwk.js +0 -44695
- package/dist/transformers.web-DiD1gTwk.js.map +0 -1
- package/dist/transformers.web-u34VxRFM.js +0 -3
- package/dist/tts-CqroPaSK.js +0 -724
- package/dist/tts-CqroPaSK.js.map +0 -1
- package/dist/tts-DXgsKGCe.mjs +0 -3
- package/dist/tts-DeGANMNV.mjs +0 -730
- package/dist/tts-DeGANMNV.mjs.map +0 -1
- package/dist/types-CiTc7ez3.d.mts.map +0 -1
- /package/dist/{auto-update-S9s5-g0C.mjs → auto-update-BVaLXcDE.mjs} +0 -0
- /package/dist/{chunk-CkXuGtQK.mjs → chunk-B9cbKln6.mjs} +0 -0
- /package/dist/{microphone-DaMZFRuR.mjs → microphone-Bqmoz9_K.mjs} +0 -0
|
@@ -0,0 +1,904 @@
|
|
|
1
|
+
# Autoresearch: Complete Technical Reference
|
|
2
|
+
|
|
3
|
+
A portable reference for implementing autonomous research loops. Based on [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) and the [MLX port](https://github.com/trevin-creator/autoresearch-mlx).
|
|
4
|
+
|
|
5
|
+
## Core Concept
|
|
6
|
+
|
|
7
|
+
An AI agent autonomously iterates on a single mutable file, running fixed-time experiments, keeping improvements and reverting failures. Git provides memory. A single metric provides the keep/revert signal. Humans sleep; the agent works.
|
|
8
|
+
|
|
9
|
+
Karpathy's first run: 83 experiments over ~2 days, 15 kept improvements, 11% speedup on the GPT-2 leaderboard. The agent found real improvements that a domain expert missed after years of manual tuning. All improvements were additive and transferred to larger models.
|
|
10
|
+
|
|
11
|
+
The key insight is NOT that AI can tune hyperparameters — it's that the loop structure turns any measurable problem into an autonomous hill-climbing search where the agent builds on its own history, reads its own failures, and discovers things humans don't think to try.
|
|
12
|
+
|
|
13
|
+
## Architecture
|
|
14
|
+
|
|
15
|
+
```
|
|
16
|
+
program.md (immutable) — agent instructions, goals, constraints
|
|
17
|
+
prepare.py (immutable) — data, evaluation, constants
|
|
18
|
+
constants.py (immutable) — TIME_BUDGET, MAX_SEQ_LEN, EVAL_TOKENS
|
|
19
|
+
train.py (MUTABLE) — the experiment subject
|
|
20
|
+
results.tsv (append) — complete experiment log (all attempts)
|
|
21
|
+
git branch (ratcheted) — only successful commits survive
|
|
22
|
+
run.log (transient) — stdout/stderr from current experiment
|
|
23
|
+
```
|
|
24
|
+
|
|
25
|
+
### The Three-File Contract
|
|
26
|
+
|
|
27
|
+
1. **program.md** — The agent's operating manual. Defines what to optimize, what rules to follow, what files are off-limits. The ONLY way humans influence the loop after launch. This is the "prompt" — the entire behavior of the system is determined by what's in this file. It has six sections: monorepo safety, setup protocol, experimentation rules, output format, logging rules, and the experiment loop.
|
|
28
|
+
|
|
29
|
+
2. **prepare.py** — Fixed infrastructure. Data loading, tokenization, the evaluation function. Agent CANNOT modify this. This prevents metric gaming and ensures all experiments are comparable. It also contains the immutable constants (`TIME_BUDGET`, `MAX_SEQ_LEN`, `EVAL_TOKENS`). The evaluation function is the oracle — if the agent could change it, the system would be meaningless.
|
|
30
|
+
|
|
31
|
+
3. **train.py** — The single mutable file. Agent can change anything: architecture, optimizer, hyperparameters, training loop, batch size, model size, activation functions, attention mechanisms, initialization schemes, learning rate schedules. The only constraint is it must run without crashing and produce output within the time budget.
|
|
32
|
+
|
|
33
|
+
### System Parameters
|
|
34
|
+
|
|
35
|
+
| Parameter | Value | Purpose |
|
|
36
|
+
|-----------|-------|---------|
|
|
37
|
+
| `MAX_SEQ_LEN` | 2048 tokens | Fixed sequence length for all experiments |
|
|
38
|
+
| `TIME_BUDGET` | 300 seconds | Training duration (wall-clock, excludes init/eval) |
|
|
39
|
+
| `EVAL_TOKENS` | 20,971,520 (H100) / 1,572,864 (MLX) | Validation evaluation budget |
|
|
40
|
+
| `VOCAB_SIZE` | 8,192 | BPE tokenizer vocabulary |
|
|
41
|
+
| `VAL_SHARD` | 6542 | Pinned validation shard (never trained on) |
|
|
42
|
+
| `MAX_SHARD` | 6542 | Highest shard index |
|
|
43
|
+
|
|
44
|
+
### Why This Works
|
|
45
|
+
|
|
46
|
+
- **Fixed time budget** — every experiment costs the same wall-clock time (5 min default). This is the crucial design choice. It eliminates the explore/exploit tradeoff around compute allocation. The agent can try radical changes freely because the worst case is 5 wasted minutes. It also means the system naturally discovers the right tradeoff between model size and training steps — a bigger model gets fewer steps, a smaller one gets more.
|
|
47
|
+
|
|
48
|
+
- **Single mutable file** — constrains search space, produces reviewable diffs, prevents the agent from accidentally breaking infrastructure. Every experiment is a single-file diff. You can `git log --stat` and see exactly what changed.
|
|
49
|
+
|
|
50
|
+
- **Single metric** — val_bpb (validation bits per byte). Unambiguous: lower is better. No composite scores, no human judgment needed, no weighting decisions. The agent never has to ask "is this better?" — it just compares two numbers.
|
|
51
|
+
|
|
52
|
+
- **Git as memory** — the commit log IS the research journal. The agent reads its own history to plan next experiments. Kept commits show what works; reverted commits (logged in results.tsv) show what doesn't. This dual-tracking is essential: the git branch is the clean history of validated improvements; results.tsv is the complete exploration log including all failures.
|
|
53
|
+
|
|
54
|
+
- **Ratchet mechanism** — the branch only advances on improvement. You can never regress. The current HEAD is always the best-known configuration. This is a monotonic improvement guarantee that makes the system safe to run unattended.
|
|
55
|
+
|
|
56
|
+
### Architectural Invariants
|
|
57
|
+
|
|
58
|
+
These are the properties that make the system trustworthy:
|
|
59
|
+
|
|
60
|
+
| Invariant | Enforcement | Rationale |
|
|
61
|
+
|-----------|-------------|-----------|
|
|
62
|
+
| Single mutable file | Agent instructions in program.md | Isolated experimental changes, reviewable diffs |
|
|
63
|
+
| Fixed time budget | `TIME_BUDGET` constant in immutable prepare.py | Hardware-specific optimization, equal cost per experiment |
|
|
64
|
+
| Fixed evaluation | `evaluate_bpb()` in immutable prepare.py | All experiments are comparable, no metric gaming |
|
|
65
|
+
| No new dependencies | Agent instructions | Prevents scope creep, environment remains stable |
|
|
66
|
+
| Single scalar metric | `val_bpb` only | Eliminates multi-objective complexity, unambiguous decisions |
|
|
67
|
+
| Git-based ratcheting | Keep via commit, discard via reset | Monotonic improvement, clean history |
|
|
68
|
+
| Immutable constants | `MAX_SEQ_LEN`, `VOCAB_SIZE`, `EVAL_TOKENS` | Consistent data processing across all experiments |
|
|
69
|
+
| Pinned validation data | Shard 6542 never used for training | No data leakage, stable evaluation |
|
|
70
|
+
|
|
71
|
+
## The Experiment Loop
|
|
72
|
+
|
|
73
|
+
### Verbatim Protocol (from program.md)
|
|
74
|
+
|
|
75
|
+
```
|
|
76
|
+
LOOP FOREVER:
|
|
77
|
+
1. Look at the git state: the current branch/commit we're on
|
|
78
|
+
2. Tune train.py with an experimental idea by directly hacking the code
|
|
79
|
+
3. git commit
|
|
80
|
+
4. Run the experiment: uv run train.py > run.log 2>&1
|
|
81
|
+
5. Read out the results: grep "^val_bpb:\|^peak_vram_mb:" run.log
|
|
82
|
+
6. If grep output is empty, the run crashed. Read tail -n 50 run.log
|
|
83
|
+
7. Record the results in results.tsv
|
|
84
|
+
8. If val_bpb improved (lower), keep the git commit
|
|
85
|
+
9. If val_bpb is equal or worse, git reset HEAD~1
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### Exact Git Commands
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
# Setup (once, at start of run)
|
|
92
|
+
git checkout -b autoresearch/<tag> # e.g., autoresearch/mar5
|
|
93
|
+
|
|
94
|
+
# Each experiment
|
|
95
|
+
git add train.py
|
|
96
|
+
git commit -m "experiment: <description>"
|
|
97
|
+
uv run train.py > run.log 2>&1
|
|
98
|
+
grep "^val_bpb:\|^peak_vram_mb:" run.log
|
|
99
|
+
|
|
100
|
+
# If KEEP (val_bpb improved):
|
|
101
|
+
git add results.tsv
|
|
102
|
+
git commit --amend --no-edit # fold results.tsv into experiment commit
|
|
103
|
+
|
|
104
|
+
# If DISCARD (val_bpb equal or worse):
|
|
105
|
+
git reset --hard <previous_kept_commit> # revert to last known-good state
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
The amend-on-keep pattern is elegant: each kept commit contains both the code change AND the results that prove it worked. The branch reads as a clean research log.
|
|
109
|
+
|
|
110
|
+
### Cycle Timing
|
|
111
|
+
|
|
112
|
+
| Phase | H100 | Apple Silicon |
|
|
113
|
+
|-------|------|---------------|
|
|
114
|
+
| Training | 300s (5 min) | 300s (5 min) |
|
|
115
|
+
| Model init + compilation | ~30s | ~11s |
|
|
116
|
+
| Evaluation | ~30s | ~52s |
|
|
117
|
+
| Agent analysis + code edit | ~60s | ~60s |
|
|
118
|
+
| **Total per experiment** | ~7 min | ~7 min |
|
|
119
|
+
| **Experiments per hour** | ~8 | ~8-9 |
|
|
120
|
+
| **Overnight (10h)** | ~80 | ~80 |
|
|
121
|
+
|
|
122
|
+
### The Setup Phase
|
|
123
|
+
|
|
124
|
+
Before the autonomous loop begins, there's a critical interactive setup:
|
|
125
|
+
|
|
126
|
+
1. **Propose a date-based run tag** (e.g., `mar5`, `jun12-m4max`)
|
|
127
|
+
2. **Create the branch**: `git checkout -b autoresearch/<tag>`
|
|
128
|
+
3. **Read all files**: README.md, prepare.py, train.py — build full context
|
|
129
|
+
4. **Verify data**: Confirm `~/.cache/autoresearch/` contains shards and tokenizer
|
|
130
|
+
5. **Create results.tsv** with header row
|
|
131
|
+
6. **Run baseline**: Execute unmodified train.py, record as first entry
|
|
132
|
+
7. **Get human approval** before starting autonomous loop
|
|
133
|
+
|
|
134
|
+
**Critical**: The agent must establish its own baseline on the current hardware. A baseline from a different machine is invalid — the time budget produces different step counts on different hardware, which means different optimal configurations.
|
|
135
|
+
|
|
136
|
+
## Decision Framework
|
|
137
|
+
|
|
138
|
+
### The Primary Signal: val_bpb Comparison
|
|
139
|
+
|
|
140
|
+
```
|
|
141
|
+
IF val_bpb_new < val_bpb_best:
|
|
142
|
+
KEEP — improvement detected
|
|
143
|
+
ELIF val_bpb_new == val_bpb_best AND code_is_simpler:
|
|
144
|
+
KEEP — simplification win
|
|
145
|
+
ELSE:
|
|
146
|
+
DISCARD
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
### Concrete Decision Examples
|
|
150
|
+
|
|
151
|
+
| Baseline val_bpb | New val_bpb | Code Change | Decision | Why |
|
|
152
|
+
|-------------------|------------|-------------|----------|-----|
|
|
153
|
+
| 0.997 | 0.993 | +5 clean lines | **Keep** | Clear improvement, reasonable complexity |
|
|
154
|
+
| 0.997 | 0.996 | +20 hacky lines | Discard | Marginal gain, ugly complexity |
|
|
155
|
+
| 0.997 | 0.997 | -15 lines (deletion) | **Keep** | Equal performance, simpler = win |
|
|
156
|
+
| 0.997 | 0.992 | +10 clean lines | **Keep** | Significant gain, clean code |
|
|
157
|
+
| 0.997 | 0.000 | Any | Crash | Log as crash, diagnose |
|
|
158
|
+
| 0.997 | 1.005 | Any | Discard | Regression |
|
|
159
|
+
| 0.997 | 0.996 | Removed entire subsystem | **Keep** | Great simplification for tiny cost |
|
|
160
|
+
|
|
161
|
+
### The Simplicity Criterion
|
|
162
|
+
|
|
163
|
+
"All else being equal, simpler is better. A small improvement that adds ugly complexity is not worth it."
|
|
164
|
+
|
|
165
|
+
This is a soft constraint, not a hard rule. The agent applies judgment:
|
|
166
|
+
- Code deletions achieving equal performance are explicit wins
|
|
167
|
+
- Minor gains requiring significant complexity warrant rejection
|
|
168
|
+
- Simplifications achieving equal performance are encouraged
|
|
169
|
+
- Removing something and getting equal/better results is a great outcome — it means the thing was unnecessary
|
|
170
|
+
|
|
171
|
+
### VRAM / Memory Management
|
|
172
|
+
|
|
173
|
+
Memory is a soft constraint. The tradeoff is situational:
|
|
174
|
+
|
|
175
|
+
| Memory Change | val_bpb Change | Decision |
|
|
176
|
+
|---------------|----------------|----------|
|
|
177
|
+
| +2 GB | -0.050 | **Keep** — meaningful gain |
|
|
178
|
+
| +10 GB | -0.002 | Discard — not worth the memory |
|
|
179
|
+
| -5 GB | +0.000 | **Keep** — efficiency win |
|
|
180
|
+
| +1 GB | -0.010 | **Keep** — acceptable tradeoff |
|
|
181
|
+
|
|
182
|
+
The principle: memory shouldn't "blow up dramatically." Some increase is fine for meaningful gains, but the agent shouldn't chase marginal improvements at the cost of 2x memory usage.
|
|
183
|
+
|
|
184
|
+
## The Metric: val_bpb
|
|
185
|
+
|
|
186
|
+
### What It Measures
|
|
187
|
+
|
|
188
|
+
Validation bits per byte (val_bpb) is the primary optimization target. Lower = better.
|
|
189
|
+
|
|
190
|
+
**Formula:**
|
|
191
|
+
```
|
|
192
|
+
val_bpb = total_nats / (ln(2) * total_bytes)
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
Where:
|
|
196
|
+
- `total_nats` = sum of cross-entropy losses in natural log units
|
|
197
|
+
- `total_bytes` = sum of UTF-8 byte lengths for all target tokens
|
|
198
|
+
- `ln(2)` = conversion factor from nats to bits (≈0.6931)
|
|
199
|
+
|
|
200
|
+
Special tokens (byte length 0) are excluded from the BPB calculation.
|
|
201
|
+
|
|
202
|
+
### Why BPB Over Perplexity
|
|
203
|
+
|
|
204
|
+
BPB is vocabulary-size-independent. If you change the tokenizer vocabulary (say from 8K to 32K tokens), perplexity becomes incomparable, but BPB remains valid because it normalizes by bytes, not tokens. This is critical for a system where the agent might want to experiment with different vocabulary sizes.
|
|
205
|
+
|
|
206
|
+
### BPB Scale
|
|
207
|
+
|
|
208
|
+
| BPB Value | Interpretation | Compression Ratio |
|
|
209
|
+
|-----------|----------------|-------------------|
|
|
210
|
+
| 8.0 | No compression (random) | 1:1 |
|
|
211
|
+
| 2.0 | 4x compression | 4:1 |
|
|
212
|
+
| 1.5 | 5.3x compression | 5.3:1 |
|
|
213
|
+
| 1.0 | 8x compression (near SOTA) | 8:1 |
|
|
214
|
+
|
|
215
|
+
Karpathy's H100 baseline: `val_bpb = 0.998` (8x compression in 5 min of training).
|
|
216
|
+
Best MLX result: `val_bpb = 1.295` (6.2x compression on M4 Max in 5 min).
|
|
217
|
+
|
|
218
|
+
### Evaluation Details
|
|
219
|
+
|
|
220
|
+
```python
|
|
221
|
+
def evaluate_bpb(model, tokenizer, batch_size):
|
|
222
|
+
"""Fixed evaluation on pinned validation shard."""
|
|
223
|
+
steps = EVAL_TOKENS // (batch_size * MAX_SEQ_LEN)
|
|
224
|
+
total_nats = 0.0
|
|
225
|
+
total_bytes = 0
|
|
226
|
+
|
|
227
|
+
for inputs, targets in validation_batches:
|
|
228
|
+
# Forward pass with per-token loss (reduction='none')
|
|
229
|
+
per_token_loss = model(inputs, targets) # cross-entropy, nats
|
|
230
|
+
|
|
231
|
+
# Look up byte count for each target token
|
|
232
|
+
token_byte_lengths = token_bytes[targets]
|
|
233
|
+
|
|
234
|
+
# Mask out special tokens (byte length 0)
|
|
235
|
+
mask = token_byte_lengths > 0
|
|
236
|
+
|
|
237
|
+
total_nats += (per_token_loss * mask).sum()
|
|
238
|
+
total_bytes += token_byte_lengths[mask].sum()
|
|
239
|
+
|
|
240
|
+
return total_nats / (math.log(2) * total_bytes)
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
Key properties:
|
|
244
|
+
- Fixed validation shard (6542) — never used for training
|
|
245
|
+
- Fixed token count — exactly 20,971,520 tokens (H100) or 1,572,864 (MLX)
|
|
246
|
+
- Per-token loss with byte-length weighting — not averaged over tokens
|
|
247
|
+
- Special token masking — tokens with 0 byte length excluded
|
|
248
|
+
- Deterministic — same model always produces the same score
|
|
249
|
+
|
|
250
|
+
## Key Rules
|
|
251
|
+
|
|
252
|
+
### NEVER STOP
|
|
253
|
+
|
|
254
|
+
The agent runs indefinitely until manually interrupted. No "should I keep going?" — the human might be asleep for 8+ hours. If the agent runs out of ideas, it should:
|
|
255
|
+
|
|
256
|
+
1. Re-read train.py, prepare.py, and results.tsv
|
|
257
|
+
2. Look at near-misses (experiments that almost improved)
|
|
258
|
+
3. Try combining two near-miss ideas
|
|
259
|
+
4. Try radical architectural changes
|
|
260
|
+
5. Read papers (if the agent has access)
|
|
261
|
+
6. Try the opposite of what's been working
|
|
262
|
+
7. Try parameter sweeps in unexplored ranges
|
|
263
|
+
8. Try removing things
|
|
264
|
+
|
|
265
|
+
The design assumes overnight operation: 60-80 experiments without any human interaction.
|
|
266
|
+
|
|
267
|
+
### Crash Handling Protocol
|
|
268
|
+
|
|
269
|
+
```
|
|
270
|
+
IF grep "^val_bpb:" run.log returns empty:
|
|
271
|
+
1. Read tail -n 50 run.log
|
|
272
|
+
2. Diagnose the failure
|
|
273
|
+
3. IF simple bug (typo, import, off-by-one):
|
|
274
|
+
Fix it, recommit, retry
|
|
275
|
+
4. IF fundamental issue (OOM, architectural impossibility):
|
|
276
|
+
Log as crash (val_bpb=0.000000, memory_gb=0.0)
|
|
277
|
+
Revert and move on
|
|
278
|
+
5. Record in results.tsv with status=crash
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
The agent uses judgment. A missing import is worth fixing. An idea that causes OOM on every variation is worth abandoning. The key is: don't get stuck. Log it, learn from it, move on.
|
|
282
|
+
|
|
283
|
+
### Timeout Handling
|
|
284
|
+
|
|
285
|
+
| Phase | Expected | Timeout Threshold | Action |
|
|
286
|
+
|-------|----------|-------------------|--------|
|
|
287
|
+
| Training | 300s | 600s (H100) / 900s (MLX) | Kill process, treat as failure |
|
|
288
|
+
| Full cycle | ~7 min | 15 min | Kill, revert, log as crash |
|
|
289
|
+
|
|
290
|
+
The training loop has a built-in fast-fail: if `train_loss > 100` at any point, exit immediately with code 1. This catches divergence early instead of wasting 5 minutes.
|
|
291
|
+
|
|
292
|
+
### Results Logging
|
|
293
|
+
|
|
294
|
+
`results.tsv` — tab-separated, NOT comma. Commas are explicitly prohibited in descriptions.
|
|
295
|
+
|
|
296
|
+
```
|
|
297
|
+
commit val_bpb memory_gb status description
|
|
298
|
+
a1b2c3d 0.997900 44.0 keep baseline
|
|
299
|
+
b2c3d4e 0.993200 44.2 keep increase LR to 0.04
|
|
300
|
+
c3d4e5f 1.005000 44.0 discard switch to GeLU activation
|
|
301
|
+
d4e5f6g 0.000000 0.0 crash double model width (OOM)
|
|
302
|
+
e5f6g7h 0.995100 44.1 discard add residual scaling (marginal + complex)
|
|
303
|
+
```
|
|
304
|
+
|
|
305
|
+
**Dual tracking**: results.tsv logs ALL experiments (keep/discard/crash). Git branch contains ONLY kept commits. This means you have both:
|
|
306
|
+
- The **clean improvement history** (git log) — what the system converged to
|
|
307
|
+
- The **full exploration log** (results.tsv) — what was tried, what failed, why
|
|
308
|
+
|
|
309
|
+
The agent reads BOTH to plan next experiments. The failures are as informative as the successes.
|
|
310
|
+
|
|
311
|
+
## Model Architecture (train.py)
|
|
312
|
+
|
|
313
|
+
The mutable code starts as a GPT-2-style transformer. Everything below is the default starting point — the agent can change any of it.
|
|
314
|
+
|
|
315
|
+
### GPTConfig
|
|
316
|
+
|
|
317
|
+
```python
|
|
318
|
+
@dataclass
|
|
319
|
+
class GPTConfig:
|
|
320
|
+
vocab_size: int = 8192
|
|
321
|
+
max_seq_len: int = 2048 # matches MAX_SEQ_LEN in constants
|
|
322
|
+
n_layer: int # computed from DEPTH
|
|
323
|
+
n_head: int # computed from n_embd / HEAD_DIM
|
|
324
|
+
n_embd: int # computed from DEPTH * ASPECT_RATIO
|
|
325
|
+
head_dim: int = 64 # per-head dimension
|
|
326
|
+
window_pattern: str = "SSSL" # attention pattern per layer
|
|
327
|
+
```
|
|
328
|
+
|
|
329
|
+
Model dimension is computed: `n_embd = ((DEPTH * ASPECT_RATIO + HEAD_DIM - 1) // HEAD_DIM) * HEAD_DIM` — rounded up to head_dim boundary.
|
|
330
|
+
|
|
331
|
+
### Attention Mechanism
|
|
332
|
+
|
|
333
|
+
- **Separate Q/K/V projections** (not fused)
|
|
334
|
+
- **RoPE positional encoding** applied post-projection. Base theta defaults to 10K (agent discovered 100K is better)
|
|
335
|
+
- **QK normalization** — queries and keys normalized before attention. Agent discovered post-norm scaling (`q, k *= 1.15`) helps for "sharper attention"
|
|
336
|
+
- **Value Embeddings (VE)** — alternating layers add gated embeddings: `v = v + gate * ve` where `gate = 2 * sigmoid(linear(x))`. The gate channels and scale range are tunable
|
|
337
|
+
- **Sliding window attention** — configurable per layer via `window_pattern` string. `"S"` = short-range (causal window), `"L"` = long-range (full context). Default `"SSSL"` = three short + one long. Agent discovered the window was too conservative
|
|
338
|
+
- **Mask caching** by `(seq_len, window_size)` tuple to avoid recomputation
|
|
339
|
+
- **Logit softcap**: `logits = cap * tanh(logits / cap)` where cap defaults to 20 (agent discovered 15 is better)
|
|
340
|
+
|
|
341
|
+
### MLP
|
|
342
|
+
|
|
343
|
+
- Configurable expansion factor (4x on H100, 3x on MLX — agent discovered 3x beats 4x)
|
|
344
|
+
- **Squared ReLU** activation: `max(x, 0)^2`
|
|
345
|
+
- No bias terms
|
|
346
|
+
- Agent discovered initializing `c_fc` weights 0.5x smaller improves training
|
|
347
|
+
|
|
348
|
+
### Block Structure
|
|
349
|
+
|
|
350
|
+
```python
|
|
351
|
+
x = x + attn(norm(x), ve, mask) # pre-norm attention with value embeddings
|
|
352
|
+
x = x + mlp(norm(x)) # pre-norm MLP
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
### Residual Interpolation (MLX variant)
|
|
356
|
+
|
|
357
|
+
Learnable per-layer interpolation between residual stream and initial embedding:
|
|
358
|
+
|
|
359
|
+
```python
|
|
360
|
+
x = resid_lambdas[i] * x + x0_lambdas[i] * x0
|
|
361
|
+
```
|
|
362
|
+
|
|
363
|
+
Where `resid_lambdas` init to 1.0 and `x0_lambdas` init to 0.1. This allows direct paths from the input embedding to any layer, which helps with gradient flow in deeper models.
|
|
364
|
+
|
|
365
|
+
### Weight Initialization
|
|
366
|
+
|
|
367
|
+
| Parameter | Initialization | Scale |
|
|
368
|
+
|-----------|----------------|-------|
|
|
369
|
+
| Token embeddings | Normal(0, std) | std=1.0 (agent found 0.8 better) |
|
|
370
|
+
| Output projection (lm_head) | Normal(0, 0.001) | Very small |
|
|
371
|
+
| Q/K/V/MLP input weights | Uniform(-s, s) | s = sqrt(3) * n_embd^(-0.5) |
|
|
372
|
+
| Projection output weights | Zero | Zero-init for residual |
|
|
373
|
+
| resid_lambdas | 1.0 | Full residual |
|
|
374
|
+
| x0_lambdas | 0.1 | Slight skip connection |
|
|
375
|
+
| Value embeddings | Uniform(-s, s) | Same as Q/K/V |
|
|
376
|
+
|
|
377
|
+
### Default Model Sizes
|
|
378
|
+
|
|
379
|
+
| Config | DEPTH=8 (H100 default) | DEPTH=4 (MLX optimal) |
|
|
380
|
+
|--------|------------------------|----------------------|
|
|
381
|
+
| Parameters | ~50M | ~21M |
|
|
382
|
+
| Steps in 5 min | ~46 | ~92 |
|
|
383
|
+
| n_embd | ~768 | ~512 |
|
|
384
|
+
|
|
385
|
+
The MLX agents universally discovered that DEPTH=4 beats DEPTH=8: half the parameters but 2x the training steps. **More optimizer steps beats more parameters when compute time is fixed.**
|
|
386
|
+
|
|
387
|
+
### Output Format
|
|
388
|
+
|
|
389
|
+
Training produces structured output parsed by the agent:
|
|
390
|
+
|
|
391
|
+
```
|
|
392
|
+
---
|
|
393
|
+
val_bpb: 0.997900
|
|
394
|
+
training_seconds: 300.1
|
|
395
|
+
total_seconds: 325.9
|
|
396
|
+
peak_vram_mb: 45060.2
|
|
397
|
+
mfu_percent: 39.80
|
|
398
|
+
total_tokens_M: 499.6
|
|
399
|
+
num_steps: 953
|
|
400
|
+
num_params_M: 50.3
|
|
401
|
+
depth: 8
|
|
402
|
+
```
|
|
403
|
+
|
|
404
|
+
The agent extracts metrics via `grep "^val_bpb:\|^peak_vram_mb:" run.log`.
|
|
405
|
+
|
|
406
|
+
## Optimizer System
|
|
407
|
+
|
|
408
|
+
### Parameter Groups (4 groups with differentiated learning rates)
|
|
409
|
+
|
|
410
|
+
The optimizer doesn't treat all parameters equally. This is one of the most important design choices:
|
|
411
|
+
|
|
412
|
+
| Group | Which Parameters | Learning Rate | Weight Decay | Why Separate |
|
|
413
|
+
|-------|-----------------|---------------|-------------|--------------|
|
|
414
|
+
| **Matrix** | 2D weight matrices (not embed/unembed) | `MATRIX_LR` (0.04) | 0.2 | These are the bulk of the model; benefit from Muon |
|
|
415
|
+
| **Embedding** | `tok_embed`, `embed_v` (value embeddings) | `EMBEDDING_LR` (0.6) | None | Embeddings need high LR, no decay |
|
|
416
|
+
| **Unembedding** | `lm_head` | `UNEMBEDDING_LR` (0.004) | None | Output projection is sensitive |
|
|
417
|
+
| **Scalar** | 1D parameters (norms, biases) | `SCALAR_LR` (0.5) | Custom | Small parameters, high LR |
|
|
418
|
+
|
|
419
|
+
**Dimension scaling**: `dmodel_lr_scale = (model_dim / 768) ** -0.5` — learning rates scale inversely with model width.
|
|
420
|
+
|
|
421
|
+
The agent discovered that per-group Adam betas and weight decay (instead of shared global values) was a significant improvement. Karpathy: "It found that AdamW betas were all messed up."
|
|
422
|
+
|
|
423
|
+
### MuonAdamW Hybrid Optimizer
|
|
424
|
+
|
|
425
|
+
Two update strategies in one optimizer:
|
|
426
|
+
|
|
427
|
+
**AdamW** (for embeddings, unembedding, scalars):
|
|
428
|
+
```python
|
|
429
|
+
# Standard Adam with decoupled weight decay
|
|
430
|
+
m = beta1 * m + (1 - beta1) * grad # first moment
|
|
431
|
+
v = beta2 * v + (1 - beta2) * grad^2 # second moment
|
|
432
|
+
m_hat = m / (1 - beta1^t) # bias correction
|
|
433
|
+
v_hat = v / (1 - beta2^t)
|
|
434
|
+
param = param * (1 - lr * weight_decay) # decoupled decay
|
|
435
|
+
param = param - lr * m_hat / (sqrt(v_hat) + eps) # Adam step
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
Default AdamW hyperparameters:
|
|
439
|
+
| Parameter | Default |
|
|
440
|
+
|-----------|---------|
|
|
441
|
+
| beta1 | 0.8 |
|
|
442
|
+
| beta2 | 0.95 |
|
|
443
|
+
| weight_decay | 0.2 |
|
|
444
|
+
| eps | 1e-8 |
|
|
445
|
+
|
|
446
|
+
**Muon** (for weight matrices — the bulk of the model):
|
|
447
|
+
|
|
448
|
+
A second-order optimizer using Newton-Schulz iterations to precondition gradients. This is what makes the H100 version fast — Muon gets more signal per gradient step than AdamW alone.
|
|
449
|
+
|
|
450
|
+
```python
|
|
451
|
+
# Newton-Schulz preconditioning
|
|
452
|
+
G = gradient
|
|
453
|
+
for i in range(NS_STEPS):
|
|
454
|
+
G = G @ (3*I - G.T @ G) / 2 # iterative polar decomposition
|
|
455
|
+
# Apply preconditioned gradient with momentum
|
|
456
|
+
```
|
|
457
|
+
|
|
458
|
+
The Newton-Schulz iteration finds the "best direction" to update weights by whitening the gradient — removing correlations so each parameter gets an equally-scaled update. It's like applying a second-order method (Newton's method) but cheaply.
|
|
459
|
+
|
|
460
|
+
Key Muon parameters:
|
|
461
|
+
| Parameter | H100 Default | MLX Optimal | Notes |
|
|
462
|
+
|-----------|-------------|-------------|-------|
|
|
463
|
+
| NS_STEPS | 5 | **3** | Fewer iterations = faster steps = more updates in 5 min |
|
|
464
|
+
| momentum | 0.95 | 0.95 | Baseline momentum |
|
|
465
|
+
| beta2 | 0.95 | — | Agent found 0.9 better on H100 |
|
|
466
|
+
| momentum_warmup | 0.95→0.97 | — | Over 400 steps on H100 |
|
|
467
|
+
|
|
468
|
+
**MLX discovery**: NS_STEPS=3 outperforms NS_STEPS=5 on Apple Silicon. This is the first documented tuning of Muon on this hardware. The reasoning: fewer iterations per step = faster steps = more total gradient updates in the fixed 5-minute budget.
|
|
469
|
+
|
|
470
|
+
**Hardware-dependent optimizer choice**: On the Mac Mini (constrained compute), Muon was a breakthrough. On M4 Max (more headroom), plain AdamW won. The system discovers hardware-appropriate configurations, not a single "best" configuration.
|
|
471
|
+
|
|
472
|
+
### Learning Rate Schedule
|
|
473
|
+
|
|
474
|
+
Three phases controlled by `progress = total_training_time / TIME_BUDGET`:
|
|
475
|
+
|
|
476
|
+
```
|
|
477
|
+
Phase 1: Warmup [0, WARMUP_RATIO] → linear ramp from 0 to 1.0
|
|
478
|
+
Phase 2: Steady state [WARMUP_RATIO, 1-WARMDOWN_RATIO] → constant at 1.0
|
|
479
|
+
Phase 3: Warmdown [1-WARMDOWN_RATIO, 1.0] → linear decay to FINAL_LR_FRAC
|
|
480
|
+
```
|
|
481
|
+
|
|
482
|
+
| Parameter | H100 Default | H100 Optimized | MLX Default |
|
|
483
|
+
|-----------|-------------|----------------|-------------|
|
|
484
|
+
| WARMUP_RATIO | ratio-based | 40 absolute steps | 0.0 |
|
|
485
|
+
| WARMDOWN_RATIO | 0.5 | 0.65 | 0.5 |
|
|
486
|
+
| FINAL_LR_FRAC | 0.0 | 0.05 | 0.0 |
|
|
487
|
+
|
|
488
|
+
The agent discovered:
|
|
489
|
+
- **Non-zero FINAL_LR_FRAC (0.05)** — don't decay LR all the way to zero. Keep a small residual learning rate
|
|
490
|
+
- **Longer warmdown (0.65 vs 0.5)** — more gradual cooldown helps
|
|
491
|
+
- **Weight decay schedule**: linear → cosine decay was an improvement
|
|
492
|
+
|
|
493
|
+
## Training Loop Internals
|
|
494
|
+
|
|
495
|
+
### Time Budget Enforcement
|
|
496
|
+
|
|
497
|
+
```python
|
|
498
|
+
TIME_BUDGET = 300 # seconds of actual training
|
|
499
|
+
STARTUP_EXCLUDE_STEPS = 1 # exclude first step from timing (compilation overhead)
|
|
500
|
+
|
|
501
|
+
t0 = None
|
|
502
|
+
for step in range(max_steps):
|
|
503
|
+
# ... forward, backward, optimizer step ...
|
|
504
|
+
|
|
505
|
+
if step == STARTUP_EXCLUDE_STEPS:
|
|
506
|
+
t0 = time.perf_counter() # start timing AFTER compilation
|
|
507
|
+
|
|
508
|
+
if t0 is not None:
|
|
509
|
+
total_training_time = time.perf_counter() - t0
|
|
510
|
+
if step >= STARTUP_EXCLUDE_STEPS and total_training_time >= TIME_BUDGET:
|
|
511
|
+
break
|
|
512
|
+
```
|
|
513
|
+
|
|
514
|
+
Wall-clock measurement EXCLUDES: model initialization, first-step compilation, data loading setup, final evaluation.
|
|
515
|
+
Wall-clock measurement INCLUDES: forward passes, backward passes, optimizer steps, gradient accumulation.
|
|
516
|
+
|
|
517
|
+
This means every experiment gets exactly the same amount of actual training compute, regardless of initialization overhead.
|
|
518
|
+
|
|
519
|
+
### Progress-Based Scheduling
|
|
520
|
+
|
|
521
|
+
The LR schedule is driven by wall-clock progress, not step count:
|
|
522
|
+
|
|
523
|
+
```python
|
|
524
|
+
progress = min(total_training_time / TIME_BUDGET, 1.0)
|
|
525
|
+
lr_multiplier = get_lr_multiplier(progress)
|
|
526
|
+
```
|
|
527
|
+
|
|
528
|
+
This is important because different configurations produce different step counts in the same 5 minutes. A model with 2x parameters takes 2x longer per step, so it gets half the steps — but the LR schedule still spans the full training run proportionally.
|
|
529
|
+
|
|
530
|
+
### Fast-Fail Detection
|
|
531
|
+
|
|
532
|
+
```python
|
|
533
|
+
if train_loss > 100:
|
|
534
|
+
sys.exit(1) # divergence detected, abort immediately
|
|
535
|
+
```
|
|
536
|
+
|
|
537
|
+
Don't waste 5 minutes on a diverged run. If loss explodes, bail immediately.
|
|
538
|
+
|
|
539
|
+
### Smoothed Loss Tracking
|
|
540
|
+
|
|
541
|
+
```python
|
|
542
|
+
smoothed_loss = beta * smoothed_loss + (1 - beta) * loss # EMA with beta=0.9
|
|
543
|
+
smoothed_loss_debiased = smoothed_loss / (1 - beta^step) # early-step correction
|
|
544
|
+
```
|
|
545
|
+
|
|
546
|
+
### Memory Management (MLX-specific)
|
|
547
|
+
|
|
548
|
+
```python
|
|
549
|
+
gc.collect()
|
|
550
|
+
gc.freeze() # freeze all existing objects (exclude from GC)
|
|
551
|
+
gc.disable() # disable GC during training
|
|
552
|
+
|
|
553
|
+
# Every 5000 steps:
|
|
554
|
+
gc.collect() # manual collection to prevent memory drift
|
|
555
|
+
```
|
|
556
|
+
|
|
557
|
+
Aggressive GC management is critical on unified memory hardware where training and system share the same pool.
|
|
558
|
+
|
|
559
|
+
### Batch Size Configuration
|
|
560
|
+
|
|
561
|
+
| Parameter | H100 Default | MLX Default | MLX Optimal |
|
|
562
|
+
|-----------|-------------|-------------|-------------|
|
|
563
|
+
| TOTAL_BATCH_SIZE | 2^17 | 2^16 (65,536 tokens) | 2^14 (16,384 tokens) |
|
|
564
|
+
| DEVICE_BATCH_SIZE | — | 16 sequences | — |
|
|
565
|
+
| GRAD_ACCUM_STEPS | — | computed | — |
|
|
566
|
+
|
|
567
|
+
```python
|
|
568
|
+
grad_accum_steps = TOTAL_BATCH_SIZE // (DEVICE_BATCH_SIZE * MAX_SEQ_LEN)
|
|
569
|
+
```
|
|
570
|
+
|
|
571
|
+
The agents discovered smaller batch sizes outperform: `TOTAL_BATCH_SIZE 2^14-2^13` beat `2^17` by fitting more gradient steps in the time budget. This is the same insight as DEPTH=4 vs DEPTH=8: **in a fixed time budget, more steps > more tokens per step.**
|
|
572
|
+
|
|
573
|
+
## Data Pipeline (prepare.py — Immutable)
|
|
574
|
+
|
|
575
|
+
### Dataset
|
|
576
|
+
|
|
577
|
+
`karpathy/climbmix-400b-shuffle` from HuggingFace:
|
|
578
|
+
- 6,543 parquet shards total
|
|
579
|
+
- Training: shards 0-6,541
|
|
580
|
+
- Validation: shard 6,542 (pinned, never used for training)
|
|
581
|
+
|
|
582
|
+
### Download System
|
|
583
|
+
|
|
584
|
+
```python
|
|
585
|
+
def download_data():
|
|
586
|
+
"""Parallel download with retry logic."""
|
|
587
|
+
pool = multiprocessing.Pool(processes=8)
|
|
588
|
+
# For each shard:
|
|
589
|
+
# 1. Download to .tmp file
|
|
590
|
+
# 2. Atomic rename on success
|
|
591
|
+
# 3. Skip if already exists (resumable)
|
|
592
|
+
# 4. Retry with exponential backoff (3-5 attempts, wait 2^attempt seconds)
|
|
593
|
+
```
|
|
594
|
+
|
|
595
|
+
Key properties:
|
|
596
|
+
- **Atomic writes**: download to `.tmp`, rename on completion — prevents corruption from interrupted downloads
|
|
597
|
+
- **Resumable**: skips already-downloaded files
|
|
598
|
+
- **Parallel**: 8 workers for throughput
|
|
599
|
+
- **Cached**: `~/.cache/autoresearch/data/`
|
|
600
|
+
|
|
601
|
+
### Tokenizer
|
|
602
|
+
|
|
603
|
+
- **BPE** via `rustbpe` library (Rust-based, fast)
|
|
604
|
+
- **Vocabulary size**: 8,192 (minus 4 special tokens = 8,188 mergeable ranks)
|
|
605
|
+
- Trained on ~1 billion characters from training shards
|
|
606
|
+
- Integrated with `tiktoken` for encoding/decoding
|
|
607
|
+
- Produces: `tokenizer.pkl` (tiktoken Encoding) and `token_bytes.npy`/`token_bytes.pt` (maps token IDs → UTF-8 byte lengths)
|
|
608
|
+
- Special tokens have byte length 0 (excluded from BPB)
|
|
609
|
+
|
|
610
|
+
### BOS-Aligned Best-Fit Packing
|
|
611
|
+
|
|
612
|
+
The dataloader achieves 100% token utilization with no padding:
|
|
613
|
+
|
|
614
|
+
```python
|
|
615
|
+
def make_dataloader(tokenizer, batch_size, seq_len, split, buffer_size=1000):
|
|
616
|
+
"""
|
|
617
|
+
1. Tokenize documents, prepend BOS to each
|
|
618
|
+
2. Buffer 1000 tokenized documents
|
|
619
|
+
3. Best-fit selection: pack documents into rows of exactly seq_len+1 tokens
|
|
620
|
+
4. Yield (inputs, targets, epoch) where:
|
|
621
|
+
- inputs = positions [0, seq_len)
|
|
622
|
+
- targets = positions [1, seq_len+1)
|
|
623
|
+
"""
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
Properties:
|
|
627
|
+
- BOS token prepended to each document — every document boundary is marked
|
|
628
|
+
- Best-fit selection minimizes wasted tokens (unlike fixed-length chunking)
|
|
629
|
+
- Deterministic: same shard order produces identical batches
|
|
630
|
+
- Row group processing: reads parquet by row group for memory efficiency
|
|
631
|
+
- Infinite iterator with epoch tracking
|
|
632
|
+
|
|
633
|
+
### Cache Structure
|
|
634
|
+
|
|
635
|
+
```
|
|
636
|
+
~/.cache/autoresearch/
|
|
637
|
+
├── data/
|
|
638
|
+
│ ├── shard_00000.parquet (training)
|
|
639
|
+
│ ├── shard_00001.parquet
|
|
640
|
+
│ ├── ...
|
|
641
|
+
│ ├── shard_06541.parquet (training)
|
|
642
|
+
│ └── shard_06542.parquet (validation — pinned)
|
|
643
|
+
└── tokenizer/
|
|
644
|
+
├── tokenizer.pkl (tiktoken Encoding object)
|
|
645
|
+
└── token_bytes.npy/.pt (vocab_size x int32, byte lengths)
|
|
646
|
+
```
|
|
647
|
+
|
|
648
|
+
One-time setup: `uv run prepare.py` (~5 minutes, downloads data and trains tokenizer).
|
|
649
|
+
|
|
650
|
+
## Agent Prompt Engineering (program.md)
|
|
651
|
+
|
|
652
|
+
The program.md is the most important file in the system. It's what turns a general-purpose AI agent into an autonomous researcher. Key design choices:
|
|
653
|
+
|
|
654
|
+
### Structure (6 sections)
|
|
655
|
+
|
|
656
|
+
1. **Monorepo Safety** — If in a monorepo, stage only the experiment directory paths. Never `git add -A`.
|
|
657
|
+
|
|
658
|
+
2. **Setup Protocol** — Interactive initialization: branch creation, file reading, data verification, baseline establishment, human approval. This prevents the agent from running blind.
|
|
659
|
+
|
|
660
|
+
3. **Experimentation Rules** — Hard constraints: only modify train.py, no new dependencies, no modifying prepare.py or constants, no changing evaluation function or time/sequence length constants.
|
|
661
|
+
|
|
662
|
+
4. **Output Format** — What to grep for. The structured `---` block with `val_bpb:`, `peak_vram_mb:`, etc.
|
|
663
|
+
|
|
664
|
+
5. **Logging Rules** — TSV format, no commas, status values (keep/discard/crash), crash logging convention (`val_bpb=0.000000`).
|
|
665
|
+
|
|
666
|
+
6. **The Experiment Loop** — The autonomous cycle, verbatim.
|
|
667
|
+
|
|
668
|
+
### Key Prompt Engineering Decisions
|
|
669
|
+
|
|
670
|
+
**The NEVER STOP principle** is repeated and emphasized. This is deliberate — without it, agents naturally pause to ask for confirmation, which defeats overnight operation.
|
|
671
|
+
|
|
672
|
+
**The simplicity criterion** is stated as a tiebreaker, not a primary objective. This prevents the agent from refusing to add code — it can add complexity if the improvement justifies it.
|
|
673
|
+
|
|
674
|
+
**The crash handling** is judgment-based ("fix simple bugs, skip fundamentally broken ideas"). This avoids both extremes: an agent that gives up at the first error, or an agent that retries the same broken idea forever.
|
|
675
|
+
|
|
676
|
+
**The idea exhaustion strategy** ("re-read files, try combining near-misses, try radical changes") prevents the agent from stalling when it runs out of obvious ideas. The instruction to "think harder" is deliberate — agents can often find ideas if they're told not to give up.
|
|
677
|
+
|
|
678
|
+
## Real Results
|
|
679
|
+
|
|
680
|
+
### Karpathy's 2-Day Run (H100)
|
|
681
|
+
|
|
682
|
+
83 experiments, 15 kept improvements. Baseline: `val_bpb = 0.998`, 45.1 GB VRAM.
|
|
683
|
+
|
|
684
|
+
**Optimizer & schedule changes:**
|
|
685
|
+
- Unembedding LR: 0.004 → 0.008, weight decay: 0.2 → 0.28
|
|
686
|
+
- Per-group Adam betas and weight decay (instead of shared global)
|
|
687
|
+
- Muon beta2: 0.95 → 0.9, momentum warmup target: 0.95 → 0.97 over 400 steps
|
|
688
|
+
- Warmup: ratio-based → absolute steps (40)
|
|
689
|
+
- Warmdown ratio: 0.5 → 0.65, final LR fraction: 0.0 → 0.05
|
|
690
|
+
- Weight decay schedule: linear → cosine decay
|
|
691
|
+
- Polar express norm factor: 1.02 → 1.01
|
|
692
|
+
|
|
693
|
+
**Architecture & init changes:**
|
|
694
|
+
- VE gate: channels 32 → 12, scale range 2x → 3x, init small positive
|
|
695
|
+
- Post-QK-norm scaling (q,k *= 1.15) for sharper attention
|
|
696
|
+
- Embedding init std: 1.0 → 0.8, MLP c_fc init 0.5x smaller
|
|
697
|
+
- RoPE base theta: 10K → 100K
|
|
698
|
+
- Short attention window: seq_len/2 → ~seq_len/3 (ceil to 128 tile)
|
|
699
|
+
- Logit softcap: 20 → 15
|
|
700
|
+
|
|
701
|
+
Result: "Time to GPT-2" dropped from 2.02 hours to 1.80 hours (11% improvement).
|
|
702
|
+
|
|
703
|
+
Key quote: "The agent found multipliers to sharpen attention, pointing to future work. It found that Value Embeddings really like regularization and I wasn't applying any (oops). It found that my banded attention was too conservative (I forgot to tune it). It found that AdamW betas were all messed up."
|
|
704
|
+
|
|
705
|
+
What this means: the agent found bugs and missed tuning opportunities in code written by one of the world's foremost ML researchers. The improvements were real, not artifacts — they transferred to larger models and stacked additively.
|
|
706
|
+
|
|
707
|
+
### MLX Port Overnight Results (Apple Silicon)
|
|
708
|
+
|
|
709
|
+
Three machines ran autonomously for 6-12 hours:
|
|
710
|
+
|
|
711
|
+
| Machine | Optimizer | Experiments | Best val_bpb | Improvement |
|
|
712
|
+
|---|---|---|---|---|
|
|
713
|
+
| M4 Max 128GB | AdamW | ~50 | 1.295 | 19% |
|
|
714
|
+
| M4 Max 128GB (#2) | AdamW + surface gates | ~30 | 1.339 | 17% |
|
|
715
|
+
| Mac Mini | Muon + AdamW | 30 | 1.462 | 24% |
|
|
716
|
+
|
|
717
|
+
Upstream H100 reference: val_bpb 0.998 in the same 5-minute budget.
|
|
718
|
+
|
|
719
|
+
### Universal Discoveries (all machines converged)
|
|
720
|
+
|
|
721
|
+
- **DEPTH=4 over DEPTH=8**: Half the parameters, 2x training steps. Every machine found this independently — "more optimizer steps beats more parameters when compute time is fixed"
|
|
722
|
+
- **Smaller batch sizes**: 2^14-2^13 beat 2^17 — more gradient updates matter more than more tokens per update
|
|
723
|
+
- **Lean MLP**: 3x expansion beat 4x. On Mac Mini (most constrained), 2x was better
|
|
724
|
+
- **Schedule tuning**: WARMDOWN_RATIO and FINAL_LR_FRAC were significant everywhere
|
|
725
|
+
|
|
726
|
+
### Hardware-Specific Discoveries
|
|
727
|
+
|
|
728
|
+
- **Muon is hardware-dependent**: breakthrough on Mac Mini (constrained compute), but plain AdamW won on M4 Max. The hypothesis: when you have plenty of memory/compute, AdamW's simplicity wins; when compute is tight, Muon's better gradient signal per step matters more
|
|
729
|
+
- **NS_STEPS=3 over NS_STEPS=5**: First documented Muon tuning on Apple Silicon. Fewer Newton-Schulz iterations = faster steps = more total updates
|
|
730
|
+
- **Same loop + different hardware = genuinely different optimal configurations**. That's the point — the system finds what's best for YOUR hardware, not a universal recipe
|
|
731
|
+
|
|
732
|
+
### Anti-Patterns Discovered
|
|
733
|
+
|
|
734
|
+
These consistently failed across machines:
|
|
735
|
+
|
|
736
|
+
1. **Increasing model size beyond optimal depth** — fewer training steps in fixed budget, net negative
|
|
737
|
+
2. **Large batch sizes (2^17+)** — fewer gradient updates, optimizer progress stalls
|
|
738
|
+
3. **Complex architectural changes with tiny gains** — failed simplicity criterion
|
|
739
|
+
4. **Over-expanding MLP (4x+)** — computation cost not worth the extra capacity
|
|
740
|
+
5. **Any change that reduces step count significantly** — the time budget makes step count critical
|
|
741
|
+
|
|
742
|
+
### Common Successful Parameter Ranges
|
|
743
|
+
|
|
744
|
+
These are the ranges where agents found improvements across runs:
|
|
745
|
+
|
|
746
|
+
| Parameter | Explored Range | Typical Optimal |
|
|
747
|
+
|-----------|---------------|-----------------|
|
|
748
|
+
| DEPTH | 4-8 | 4 (universal on MLX) |
|
|
749
|
+
| WINDOW_PATTERN | "SL", "SSSL", "SSSSL" | "SSSL" |
|
|
750
|
+
| MLP expansion | 2x-4x | 3x (2x on constrained hw) |
|
|
751
|
+
| HEAD_DIM | 64-192 | 64-128 |
|
|
752
|
+
| TOTAL_BATCH_SIZE | 2^13-2^17 | 2^14 |
|
|
753
|
+
| MATRIX_LR | 0.01-0.1 | 0.04 |
|
|
754
|
+
| EMBEDDING_LR | 0.3-1.2 | 0.6 |
|
|
755
|
+
| WARMUP_RATIO | 0.0-0.1 | 0.0 |
|
|
756
|
+
| WARMDOWN_RATIO | 0.2-0.5 | 0.3-0.5 |
|
|
757
|
+
|
|
758
|
+
## Monitoring
|
|
759
|
+
|
|
760
|
+
### During a Run
|
|
761
|
+
|
|
762
|
+
Real-time progress in `run.log`:
|
|
763
|
+
```
|
|
764
|
+
step 1 | loss 11.2345 | lr 1.2e-04 | 2.3k tokens/s
|
|
765
|
+
step 2 | loss 10.8734 | lr 2.4e-04 | 2.4k tokens/s
|
|
766
|
+
...
|
|
767
|
+
step 92 | loss 2.1234 | lr 3.8e-05 | 2.3k tokens/s
|
|
768
|
+
---
|
|
769
|
+
val_bpb: 1.534000
|
|
770
|
+
training_seconds: 312.4
|
|
771
|
+
total_seconds: 405.7
|
|
772
|
+
peak_vram_mb: 27528.9
|
|
773
|
+
mfu_percent: 0.00
|
|
774
|
+
total_tokens_M: 39.8
|
|
775
|
+
num_steps: 92
|
|
776
|
+
num_params_M: 21.3
|
|
777
|
+
depth: 4
|
|
778
|
+
```
|
|
779
|
+
|
|
780
|
+
### Simple Monitoring Script
|
|
781
|
+
|
|
782
|
+
```bash
|
|
783
|
+
# Watch results accumulate
|
|
784
|
+
while true; do
|
|
785
|
+
echo "--- $(date) ---"
|
|
786
|
+
tail -5 results.tsv | column -t -s $'\t'
|
|
787
|
+
echo "Total: $(wc -l < results.tsv) experiments"
|
|
788
|
+
sleep 60
|
|
789
|
+
done
|
|
790
|
+
```
|
|
791
|
+
|
|
792
|
+
### Multi-Machine Runs
|
|
793
|
+
|
|
794
|
+
Branch naming convention: `autoresearch/<date>-<machine>`
|
|
795
|
+
|
|
796
|
+
```bash
|
|
797
|
+
# Machine 1
|
|
798
|
+
git checkout -b autoresearch/mar5-m4max
|
|
799
|
+
|
|
800
|
+
# Machine 2
|
|
801
|
+
git checkout -b autoresearch/mar5-mini
|
|
802
|
+
```
|
|
803
|
+
|
|
804
|
+
After overnight runs, compare branches. Both machines will discover universal improvements (DEPTH=4) and hardware-specific ones (optimizer choice). Cross-pollinate: try Machine 1's best config on Machine 2 and vice versa.
|
|
805
|
+
|
|
806
|
+
## Adapting to Other Domains
|
|
807
|
+
|
|
808
|
+
The pattern generalizes to any optimization problem with:
|
|
809
|
+
1. A mutable configuration/code (the "train.py")
|
|
810
|
+
2. An objective metric that's efficient to evaluate
|
|
811
|
+
3. A fixed budget per experiment
|
|
812
|
+
4. A keep/revert mechanism (git)
|
|
813
|
+
|
|
814
|
+
### Template
|
|
815
|
+
|
|
816
|
+
```yaml
|
|
817
|
+
# autoresearch-config.yaml
|
|
818
|
+
name: "my-project"
|
|
819
|
+
metric: "the_metric_name" # what to optimize
|
|
820
|
+
metric_direction: "minimize" # or "maximize"
|
|
821
|
+
|
|
822
|
+
mutable_files:
|
|
823
|
+
- "the_file_agent_can_edit.py"
|
|
824
|
+
|
|
825
|
+
immutable_files:
|
|
826
|
+
- "evaluation.py" # metric computation, cannot be gamed
|
|
827
|
+
- "data_loader.py" # fixed data pipeline
|
|
828
|
+
|
|
829
|
+
run_command: "python train.py > run.log 2>&1"
|
|
830
|
+
eval_command: "grep '^metric:' run.log"
|
|
831
|
+
|
|
832
|
+
budget_minutes: 5 # fixed time per experiment
|
|
833
|
+
branch_prefix: "autoresearch" # git branch naming
|
|
834
|
+
|
|
835
|
+
rules:
|
|
836
|
+
- "Only modify files listed in mutable_files"
|
|
837
|
+
- "Do not install new dependencies"
|
|
838
|
+
- "Simpler solutions preferred over complex ones"
|
|
839
|
+
- "Run indefinitely until interrupted"
|
|
840
|
+
```
|
|
841
|
+
|
|
842
|
+
### Example Domains
|
|
843
|
+
|
|
844
|
+
**ML model training** (the original):
|
|
845
|
+
- Mutable: train.py (architecture, optimizer, hyperparams)
|
|
846
|
+
- Metric: val_bpb or val_loss
|
|
847
|
+
- Budget: 5 min per experiment
|
|
848
|
+
|
|
849
|
+
**Inference optimization** (like Gerbil's kernel optimizer):
|
|
850
|
+
- Mutable: config.toml, shader code, kernel parameters
|
|
851
|
+
- Metric: tokens/second, latency_p99
|
|
852
|
+
- Budget: 2 min per benchmark run
|
|
853
|
+
|
|
854
|
+
**Compiler/codegen optimization**:
|
|
855
|
+
- Mutable: optimization passes, code generation rules
|
|
856
|
+
- Metric: benchmark suite runtime
|
|
857
|
+
- Budget: 10 min (compile + bench)
|
|
858
|
+
|
|
859
|
+
**Growth/marketing**:
|
|
860
|
+
- Mutable: landing page copy, ad targeting config
|
|
861
|
+
- Metric: conversion rate
|
|
862
|
+
- Budget: hours (need traffic for statistical significance)
|
|
863
|
+
|
|
864
|
+
**Fine-tuning pipeline**:
|
|
865
|
+
- Mutable: training config (hyperparams, data mix, LoRA settings)
|
|
866
|
+
- Metric: composite eval score (pass rate + preference rate)
|
|
867
|
+
- Budget: 30-90 min per cloud training run
|
|
868
|
+
|
|
869
|
+
### Key Considerations for Adaptation
|
|
870
|
+
|
|
871
|
+
1. **Cycle time is king**. Karpathy gets ~80 experiments overnight because each takes ~7 minutes total. If your cycle is 90 minutes, you get ~16/day. Find proxy metrics that correlate with your true objective but evaluate faster.
|
|
872
|
+
|
|
873
|
+
2. **The metric must be automatable**. Human judgment doesn't scale. Either automate evaluation entirely (val_bpb) or use an AI judge (Opus scoring responses). The metric must be a single scalar that the agent can compare with `<`.
|
|
874
|
+
|
|
875
|
+
3. **The mutable surface should be small**. One file, or a small set of config values. If the agent can change everything, the search space explodes and improvements don't stack reliably. One-file diffs are reviewable; multi-file changes are opaque.
|
|
876
|
+
|
|
877
|
+
4. **Git ratchet prevents regression**. This is critical. You never go backwards. Every kept commit is guaranteed to be at least as good as the previous best. This is what makes overnight operation safe.
|
|
878
|
+
|
|
879
|
+
5. **The agent needs memory**. results.tsv + git log give the agent context on what worked and what didn't. Without this, the agent repeats failed experiments. The dual tracking (git for successes, TSV for everything) is essential.
|
|
880
|
+
|
|
881
|
+
6. **Establish your own baseline**. Never use someone else's baseline numbers. Run the unmodified code on your hardware and measure. Different hardware, different step counts, different optimal configurations.
|
|
882
|
+
|
|
883
|
+
7. **The time budget creates natural tradeoffs**. You don't need to manually balance model size vs. training steps — the fixed budget does it automatically. A bigger model gets fewer steps; the metric tells you which tradeoff wins.
|
|
884
|
+
|
|
885
|
+
8. **Hardware-specific optimization is a feature, not a bug**. The same loop on different hardware discovers different optimal configurations. This is correct behavior — the best config for an H100 is not the best config for a Mac Mini.
|
|
886
|
+
|
|
887
|
+
## The Vision
|
|
888
|
+
|
|
889
|
+
From Karpathy:
|
|
890
|
+
|
|
891
|
+
> "All LLM frontier labs will do this. It's the final boss battle. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges."
|
|
892
|
+
|
|
893
|
+
> "Any metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm."
|
|
894
|
+
|
|
895
|
+
> "One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of 'group meeting'. That era is long gone."
|
|
896
|
+
|
|
897
|
+
## References
|
|
898
|
+
|
|
899
|
+
- [karpathy/autoresearch](https://github.com/karpathy/autoresearch) — original repo
|
|
900
|
+
- [karpathy/nanochat](https://github.com/karpathy/nanochat) — the training codebase being optimized
|
|
901
|
+
- [nanochat commit 6ed7d1d](https://github.com/karpathy/nanochat/commit/6ed7d1d82cee16c2e26f45d559ad3338447a6c1b) — the stacked improvements from round 1
|
|
902
|
+
- [trevin-creator/autoresearch-mlx](https://github.com/trevin-creator/autoresearch-mlx) — Apple Silicon port
|
|
903
|
+
- [DeepWiki: autoresearch](https://deepwiki.com/karpathy/autoresearch) — detailed system documentation
|
|
904
|
+
- [DeepWiki: autoresearch-mlx](https://deepwiki.com/trevin-creator/autoresearch-mlx) — MLX port documentation
|