nanogpt 0.1.2 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cf308fcec8ccec074200361b2327a2381a5412b92497391bb71ec5f154cd1283
4
- data.tar.gz: 87dcc389df03af0ac59fc0e75bbef7be7071f00613a2662cecf365ba38bc853e
3
+ metadata.gz: f3563f92bd2b986ae27476c80c596cca77438d899b53d017f6cb708179fd7ad4
4
+ data.tar.gz: d15ed5effb810c0fa022b4fa3d6708fc2623332b0d3f4de5057a66343a9a3b2f
5
5
  SHA512:
6
- metadata.gz: 2b6ceeb10236b639c82398c94d3c1a876eff549a17289ff45abae489e480a3d6a1db45f95a4b2e56f1d1df6afde28159965b188342d1e38c2482359dfb11e061
7
- data.tar.gz: 0fd4653c2719d1d3c339a904437fd0657f17e49fa0a9d4361a3a259806fab8dc798ed7f179722b41913d3471e8799abf78823f207988987d86cba680d7f90f03
6
+ metadata.gz: 030bd007f053f6400afec46810e10d1c8a96b881df932c291fc7efb637a4c00424cd9304facac71e56783a861417dcd6ec247b2f6df3926a443d2540ecf6adcf
7
+ data.tar.gz: c0bcf4eae610c2475c841c186ee8a2a6aebc2bb22ecf2d0fbb9d65c7218152699598d320cc6972262ff918d0991c4cc4c951a68552bec46b24bd4fee10aa1b07
data/CHANGELOG.md ADDED
@@ -0,0 +1,45 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
7
+
8
+ ## [0.2.0] - 2025-12-08
9
+
10
+ ### Added
11
+
12
+ - **Custom text file training**: New `nanogpt prepare textfile <path>` command to train on any text file with character-level tokenization
13
+ - Streams through large files without loading everything into memory
14
+ - Auto-detects file encoding (UTF-8 or Windows-1252)
15
+ - Configurable output directory name (`--output=NAME`)
16
+ - Configurable train/validation split ratio (`--val_ratio=F`, default 0.1)
17
+ - Updated README with documentation for training on custom text files
18
+
19
+ ## [0.1.2] - 2025-12-07
20
+
21
+ ### Fixed
22
+
23
+ - Fixed `prepare` command to output files to current working directory
24
+
25
+ ## [0.1.1] - 2025-12-07
26
+
27
+ ### Fixed
28
+
29
+ - Fixed typo in documentation
30
+
31
+ ## [0.1.0] - 2025-12-07
32
+
33
+ ### Added
34
+
35
+ - Initial release
36
+ - Full GPT-2 architecture implementation in Ruby
37
+ - MPS (Metal) and CUDA GPU acceleration via torch.rb
38
+ - Flash attention support when dropout=0
39
+ - Character-level and GPT-2 BPE tokenizers
40
+ - Cosine learning rate schedule with warmup
41
+ - Gradient accumulation for larger effective batch sizes
42
+ - Checkpointing and training resumption
43
+ - Shakespeare character-level dataset
44
+ - OpenWebText dataset support
45
+ - CLI commands: `train`, `sample`, `bench`, `prepare`
data/Gemfile.lock CHANGED
@@ -1,17 +1,35 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- nanogpt (0.1.0)
4
+ nanogpt (0.3.0)
5
5
  numo-narray (~> 0.9)
6
+ rackup (~> 2.0)
7
+ sinatra (~> 4.0)
8
+ sqlite3 (~> 2.0)
6
9
  tiktoken_ruby (~> 0.0)
7
10
  torch-rb (~> 0.14)
11
+ webrick (~> 1.8)
8
12
 
9
13
  GEM
10
14
  remote: https://rubygems.org/
11
15
  specs:
16
+ base64 (0.3.0)
12
17
  diff-lcs (1.6.2)
18
+ logger (1.7.0)
19
+ mustermann (3.0.4)
20
+ ruby2_keywords (~> 0.0.1)
13
21
  numo-narray (0.9.2.1)
14
22
  parquet (0.7.3-arm64-darwin)
23
+ rack (3.2.5)
24
+ rack-protection (4.2.1)
25
+ base64 (>= 0.1.0)
26
+ logger (>= 1.6.0)
27
+ rack (>= 3.0.0, < 4)
28
+ rack-session (2.1.1)
29
+ base64 (>= 0.1.0)
30
+ rack (>= 3.0.0)
31
+ rackup (2.3.1)
32
+ rack (>= 3)
15
33
  rice (4.7.1)
16
34
  rspec (3.13.2)
17
35
  rspec-core (~> 3.13.0)
@@ -26,12 +44,24 @@ GEM
26
44
  diff-lcs (>= 1.2.0, < 2.0)
27
45
  rspec-support (~> 3.13.0)
28
46
  rspec-support (3.13.6)
47
+ ruby2_keywords (0.0.5)
48
+ sinatra (4.2.1)
49
+ logger (>= 1.6.0)
50
+ mustermann (~> 3.0)
51
+ rack (>= 3.0.0, < 4)
52
+ rack-protection (= 4.2.1)
53
+ rack-session (>= 2.0.0, < 3)
54
+ tilt (~> 2.0)
55
+ sqlite3 (2.9.0-arm64-darwin)
29
56
  tiktoken_ruby (0.0.13-arm64-darwin)
57
+ tilt (2.7.0)
30
58
  torch-rb (0.22.2)
31
59
  rice (>= 4.7)
60
+ webrick (1.9.2)
32
61
 
33
62
  PLATFORMS
34
63
  arm64-darwin-24
64
+ arm64-darwin-25
35
65
 
36
66
  DEPENDENCIES
37
67
  nanogpt!
data/README.md CHANGED
@@ -1,5 +1,7 @@
1
1
  # nanoGPT
2
2
 
3
+ [![Gem Version](https://badge.fury.io/rb/nanogpt.svg)](https://rubygems.org/gems/nanogpt)
4
+
3
5
  A Ruby port of Karpathy's [nanoGPT](https://github.com/karpathy/nanoGPT). Train GPT-2 style language models from scratch using [torch.rb](https://github.com/ankane/torch.rb).
4
6
 
5
7
  Built for Ruby developers who want to understand how LLMs work by building one.
@@ -13,7 +15,7 @@ gem install nanogpt
13
15
  nanogpt prepare shakespeare_char
14
16
 
15
17
  # Train (use MPS on Apple Silicon for 17x speedup)
16
- nanogpt train --dataset=shakespeare_char --device=mps
18
+ nanogpt train --dataset=shakespeare_char --device=mps --max_iters=2000
17
19
 
18
20
  # Generate text
19
21
  nanogpt sample --dataset=shakespeare_char
@@ -30,7 +32,7 @@ bundle install
30
32
  bundle exec ruby data/shakespeare_char/prepare.rb
31
33
 
32
34
  # Train
33
- bundle exec exe/nanogpt train --dataset=shakespeare_char --device=mps
35
+ bundle exec exe/nanogpt train --dataset=shakespeare_char --device=mps --max_iters=2000
34
36
 
35
37
  # Sample
36
38
  bundle exec exe/nanogpt sample --dataset=shakespeare_char
@@ -81,6 +83,53 @@ nanogpt bench [options] # Run performance benchmarks
81
83
  --top_k=N # Top-k sampling (default: 200)
82
84
  ```
83
85
 
86
+ ## Training on Your Own Text
87
+
88
+ You can train on any text file using the `textfile` command:
89
+
90
+ ```bash
91
+ # Prepare your text file (creates char-level tokenizer)
92
+ nanogpt prepare textfile /path/to/mybook.txt --output=mybook
93
+
94
+ # Train a model
95
+ nanogpt train --dataset=mybook --device=mps --max_iters=2000
96
+
97
+ # Generate text
98
+ nanogpt sample --dataset=mybook --start="Once upon a time"
99
+ ```
100
+
101
+ ### Options
102
+
103
+ ```bash
104
+ --output=NAME # Output directory name (default: derived from filename)
105
+ --val_ratio=F # Validation split ratio (default: 0.1)
106
+ ```
107
+
108
+ ### Example: Training on a Novel
109
+
110
+ ```bash
111
+ # Download a book
112
+ curl -o lotr.txt "https://example.com/fellowship.txt"
113
+
114
+ # Prepare (handles UTF-8 and Windows-1252 encodings)
115
+ nanogpt prepare textfile lotr.txt --output=lotr
116
+
117
+ # Train a larger model for better results
118
+ nanogpt train --dataset=lotr --device=mps \
119
+ --max_iters=2000 \
120
+ --n_layer=6 --n_head=6 --n_embd=384 \
121
+ --block_size=256 --batch_size=32
122
+
123
+ # Sample with a prompt
124
+ nanogpt sample --dataset=lotr --start="Frodo" --max_new_tokens=500
125
+ ```
126
+
127
+ The `textfile` command:
128
+ - Streams through large files without loading everything into memory
129
+ - Auto-detects encoding (UTF-8 or Windows-1252)
130
+ - Creates a character-level vocabulary from your text
131
+ - Splits into train/validation sets
132
+
84
133
  ## Features
85
134
 
86
135
  - Full GPT-2 architecture (attention, MLP, layer norm, embeddings)
@@ -0,0 +1,429 @@
1
+ # nanoGPT Architecture
2
+
3
+ This document explains the architecture of nanoGPT, a Ruby implementation of GPT-2 style language models.
4
+
5
+ ## Overview
6
+
7
+ nanoGPT implements a **decoder-only transformer** architecture for autoregressive language modeling. The model learns to predict the next token in a sequence given all previous tokens.
8
+
9
+ ```
10
+ Input Tokens → Embeddings → Transformer Blocks → Output Logits → Next Token Prediction
11
+ ```
12
+
13
+ ## Core Components
14
+
15
+ ### 1. Token & Position Embeddings
16
+
17
+ The model converts discrete tokens into continuous vectors:
18
+
19
+ ```
20
+ Token IDs [0, 42, 7, ...] → Token Embeddings (vocab_size × n_embd)
21
+ Position [0, 1, 2, ...] → Position Embeddings (block_size × n_embd)
22
+
23
+ Final Embedding = TokenEmbed(token) + PosEmbed(position)
24
+ ```
25
+
26
+ **Files:** `lib/nano_gpt/model.rb:16-18`
27
+
28
+ The position embeddings are learned (not sinusoidal) and allow the model to understand token order.
29
+
30
+ ### 2. Transformer Block
31
+
32
+ Each transformer block contains two sub-layers with residual connections:
33
+
34
+ ```
35
+ ┌─────────────────────────────────────────────────┐
36
+ │ Transformer Block │
37
+ ├─────────────────────────────────────────────────┤
38
+ │ │
39
+ │ x ─→ LayerNorm ─→ Attention ─→ (+) ─┐ │
40
+ │ └────────────────────────────────────┘ │
41
+ │ │
42
+ │ x ─→ LayerNorm ─→ MLP ─→ (+) ─→ output │
43
+ │ └─────────────────────────────────────┘ │
44
+ │ │
45
+ └─────────────────────────────────────────────────┘
46
+ ```
47
+
48
+ **File:** `lib/nano_gpt/layers/block.rb`
49
+
50
+ This is "pre-norm" architecture (LayerNorm before sub-layer, not after) which improves training stability.
51
+
52
+ ### 3. Causal Self-Attention
53
+
54
+ The attention mechanism allows each token to attend to all previous tokens (but not future ones):
55
+
56
+ ```
57
+ Q = x @ W_q (Query: what am I looking for?)
58
+ K = x @ W_k (Key: what do I contain?)
59
+ V = x @ W_v (Value: what information do I provide?)
60
+
61
+ Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
62
+ ```
63
+
64
+ The "causal" part means we mask future positions:
65
+
66
+ ```
67
+ Position: 0 1 2 3
68
+ Token 0: [1 0 0 0] ← can only see itself
69
+ Token 1: [1 1 0 0] ← can see positions 0,1
70
+ Token 2: [1 1 1 0] ← can see positions 0,1,2
71
+ Token 3: [1 1 1 1] ← can see all
72
+ ```
73
+
74
+ **File:** `lib/nano_gpt/layers/causal_self_attention.rb`
75
+
76
+ Multi-head attention splits the embedding into `n_head` parallel attention operations, allowing the model to attend to different aspects simultaneously.
77
+
78
+ ### 4. MLP (Feed-Forward Network)
79
+
80
+ After attention, each token passes through an MLP:
81
+
82
+ ```
83
+ x → Linear(n_embd → 4×n_embd) → GELU → Linear(4×n_embd → n_embd) → output
84
+ ```
85
+
86
+ **File:** `lib/nano_gpt/layers/mlp.rb`
87
+
88
+ The 4× expansion provides additional capacity for computation.
89
+
90
+ ### 5. Output Layer (Weight Tying)
91
+
92
+ The final output uses the transposed token embedding matrix:
93
+
94
+ ```
95
+ logits = LayerNorm(x) @ TokenEmbed.T
96
+ ```
97
+
98
+ This "weight tying" reduces parameters and improves performance.
99
+
100
+ ## Data Flow
101
+
102
+ ```
103
+ ┌────────────────────────────────────────────────────────────────────┐
104
+ │ GPT Forward Pass │
105
+ ├────────────────────────────────────────────────────────────────────┤
106
+ │ │
107
+ │ Input: token_ids [batch, seq_len] │
108
+ │ ↓ │
109
+ │ Token Embedding + Position Embedding │
110
+ │ ↓ │
111
+ │ Dropout │
112
+ │ ↓ │
113
+ │ ┌──────────────────────────────────────┐ │
114
+ │ │ Transformer Block 1 │ │
115
+ │ │ LayerNorm → Attention → Residual │ │
116
+ │ │ LayerNorm → MLP → Residual │ │
117
+ │ └──────────────────────────────────────┘ │
118
+ │ ↓ │
119
+ │ ... (repeat n_layer times) ... │
120
+ │ ↓ │
121
+ │ Final LayerNorm │
122
+ │ ↓ │
123
+ │ Linear projection (tied weights) │
124
+ │ ↓ │
125
+ │ Output: logits [batch, seq_len, vocab_size] │
126
+ │ │
127
+ └────────────────────────────────────────────────────────────────────┘
128
+ ```
129
+
130
+ ## Training Pipeline
131
+
132
+ ```
133
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
134
+ │ Prepare │ → │ Load │ → │ Forward │ → │ Backward │
135
+ │ Data │ │ Batch │ │ Pass │ │ Pass │
136
+ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
137
+ │ │ │ │
138
+ train.bin DataLoader GPT.forward loss.backward
139
+ val.bin get_batch() ↓ ↓
140
+ meta.json ↓ Cross-entropy Gradients
141
+ [x, y] tensors loss computed
142
+
143
+
144
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
145
+ │ Update │ ← │ Clip │ ← │ Accumulate │ ← │ │
146
+ │ Weights │ │ Gradients │ │ Gradients │ │ │
147
+ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
148
+
149
+ AdamW
150
+ optimizer
151
+ ```
152
+
153
+ **File:** `lib/nano_gpt/trainer.rb`
154
+
155
+ ## Generation (Inference)
156
+
157
+ ```
158
+ Start: "Hello"
159
+
160
+ Tokenize: [15496]
161
+
162
+ ┌─────────────────────────────────────┐
163
+ │ Loop until max_tokens: │
164
+ │ 1. Forward pass → logits │
165
+ │ 2. Apply temperature │
166
+ │ 3. Apply top-k filtering │
167
+ │ 4. Sample from distribution │
168
+ │ 5. Append to sequence │
169
+ └─────────────────────────────────────┘
170
+
171
+ Decode: "Hello, how are you today?"
172
+ ```
173
+
174
+ **File:** `lib/nano_gpt/model.rb:110-147`
175
+
176
+ ## File Structure
177
+
178
+ ```
179
+ lib/nano_gpt/
180
+ ├── model.rb # GPT model class
181
+ ├── config.rb # GPTConfig (model hyperparameters)
182
+ ├── train_config.rb # Training/sampling/bench configs
183
+ ├── trainer.rb # Training loop
184
+ ├── data_loader.rb # Batch loading from binary files
185
+ ├── tokenizer.rb # Character & GPT-2 BPE tokenizers
186
+ ├── lr_scheduler.rb # Cosine annealing with warmup
187
+ ├── device.rb # CPU/CUDA/MPS device detection
188
+ ├── textfile_preparer.rb # Custom dataset preparation
189
+ └── layers/
190
+ ├── block.rb # Transformer block
191
+ ├── causal_self_attention.rb # Multi-head attention
192
+ ├── mlp.rb # Feed-forward network
193
+ └── layer_norm.rb # Layer normalization
194
+ ```
195
+
196
+ ---
197
+
198
+ # Glossary
199
+
200
+ ## A
201
+
202
+ ### Attention
203
+ A mechanism that allows each position in a sequence to dynamically focus on other positions. Computes weighted sums of values based on query-key similarities.
204
+
205
+ ### Autoregressive
206
+ A model that generates outputs one step at a time, where each step depends on all previous steps. GPT generates text left-to-right, predicting one token at a time.
207
+
208
+ ### AdamW
209
+ An optimizer that combines Adam (adaptive learning rates) with decoupled weight decay. Used for training the model.
210
+
211
+ ## B
212
+
213
+ ### Batch Size
214
+ The number of independent sequences processed simultaneously. Larger batches provide more stable gradients but require more memory.
215
+
216
+ ### Block Size
217
+ The maximum sequence length (context window) the model can process. Also called "context length". Default: 256 tokens.
218
+
219
+ ### BPE (Byte Pair Encoding)
220
+ A tokenization algorithm that builds a vocabulary by iteratively merging frequent character pairs. GPT-2 uses BPE with ~50k tokens.
221
+
222
+ ## C
223
+
224
+ ### Causal Mask
225
+ A triangular mask that prevents attention from seeing future tokens. Essential for autoregressive generation.
226
+
227
+ ### Checkpoint
228
+ A saved snapshot of model weights and training state. Allows resuming training or using the model for inference.
229
+
230
+ ### Context Window
231
+ See "Block Size". The number of tokens the model can "see" when making predictions.
232
+
233
+ ### Cross-Entropy Loss
234
+ The training objective. Measures how well the model's predicted probability distribution matches the actual next token.
235
+
236
+ ## D
237
+
238
+ ### Decoder-Only
239
+ A transformer architecture that only uses the decoder stack (with causal masking). GPT models are decoder-only, unlike encoder-decoder models like T5.
240
+
241
+ ### Dropout
242
+ A regularization technique that randomly zeros out activations during training. Prevents overfitting. Set to 0 during inference.
243
+
244
+ ## E
245
+
246
+ ### Embedding
247
+ A learned vector representation of a discrete token. Maps vocabulary indices to continuous vectors of dimension `n_embd`.
248
+
249
+ ### Embedding Dimension (n_embd)
250
+ The size of token representations. Larger dimensions allow more expressive representations but require more computation. Default: 384.
251
+
252
+ ## F
253
+
254
+ ### Flash Attention
255
+ An optimized attention implementation that's faster and more memory-efficient. Used when dropout is 0.
256
+
257
+ ### Forward Pass
258
+ Computing the model's output from input. Propagates activations through all layers to produce logits.
259
+
260
+ ## G
261
+
262
+ ### GELU (Gaussian Error Linear Unit)
263
+ An activation function: `GELU(x) = x * Φ(x)` where Φ is the Gaussian CDF. Smoother than ReLU.
264
+
265
+ ### Gradient Accumulation
266
+ Simulating larger batch sizes by accumulating gradients over multiple forward-backward passes before updating weights.
267
+
268
+ ### Gradient Clipping
269
+ Limiting gradient magnitudes to prevent training instability. Default max norm: 1.0.
270
+
271
+ ## H
272
+
273
+ ### Head (Attention Head)
274
+ One parallel attention computation. Multi-head attention runs `n_head` heads in parallel, each with dimension `n_embd/n_head`.
275
+
276
+ ### Head Size
277
+ The dimension of each attention head: `head_size = n_embd / n_head`.
278
+
279
+ ## I
280
+
281
+ ### Inference
282
+ Using a trained model to generate predictions. Also called "sampling" or "generation".
283
+
284
+ ### Iteration
285
+ One update step in training. May include multiple micro-batches if using gradient accumulation.
286
+
287
+ ## K
288
+
289
+ ### Key (K)
290
+ In attention, the vector that represents "what information this position contains". Compared against queries.
291
+
292
+ ## L
293
+
294
+ ### Layer Normalization
295
+ Normalizes activations across the feature dimension. Stabilizes training. Applied before attention and MLP in pre-norm architecture.
296
+
297
+ ### Learning Rate
298
+ Controls how much weights change per update. Typically uses warmup + cosine decay schedule.
299
+
300
+ ### Logits
301
+ Raw (unnormalized) model outputs before softmax. Shape: [batch, seq_len, vocab_size].
302
+
303
+ ### Loss
304
+ The training objective to minimize. Cross-entropy between predicted and actual next tokens.
305
+
306
+ ## M
307
+
308
+ ### MFU (Model FLOPs Utilization)
309
+ Percentage of theoretical GPU compute being used. Higher is better. Measures hardware efficiency.
310
+
311
+ ### MLP (Multi-Layer Perceptron)
312
+ The feed-forward network in each transformer block. Expands to 4× dimension then projects back.
313
+
314
+ ### MPS (Metal Performance Shaders)
315
+ Apple's GPU compute framework. Used for acceleration on Apple Silicon Macs.
316
+
317
+ ## N
318
+
319
+ ### n_embd
320
+ Embedding dimension. The size of token vectors throughout the model. Default: 384.
321
+
322
+ ### n_head
323
+ Number of attention heads. Must divide n_embd evenly. Default: 6.
324
+
325
+ ### n_layer
326
+ Number of transformer blocks stacked. More layers = more capacity but slower. Default: 6.
327
+
328
+ ## O
329
+
330
+ ### Optimizer
331
+ Algorithm that updates model weights based on gradients. nanoGPT uses AdamW.
332
+
333
+ ## P
334
+
335
+ ### Parameters
336
+ The learnable weights of the model. Measured in millions (M) or billions (B).
337
+
338
+ ### Position Embedding
339
+ Learned vectors that encode token position. Added to token embeddings so the model knows token order.
340
+
341
+ ### Pre-norm
342
+ Architecture where LayerNorm is applied before (not after) attention/MLP. Improves training stability.
343
+
344
+ ## Q
345
+
346
+ ### Query (Q)
347
+ In attention, the vector that represents "what this position is looking for". Compared against keys.
348
+
349
+ ## R
350
+
351
+ ### Residual Connection
352
+ Adding the input directly to the output: `output = layer(x) + x`. Helps gradient flow and enables deeper networks.
353
+
354
+ ## S
355
+
356
+ ### Sampling
357
+ Generating text by repeatedly predicting and appending tokens. See "Generation".
358
+
359
+ ### Softmax
360
+ Converts logits to probabilities: `softmax(x)_i = exp(x_i) / sum(exp(x_j))`. Output sums to 1.
361
+
362
+ ### State Dict
363
+ A dictionary mapping parameter names to their tensor values. Used for saving/loading models.
364
+
365
+ ## T
366
+
367
+ ### Temperature
368
+ Scaling factor for logits before softmax during generation. Lower = more deterministic, higher = more random.
369
+
370
+ ### Token
371
+ The atomic unit of text. Can be a character, word piece, or subword depending on the tokenizer.
372
+
373
+ ### Tokenizer
374
+ Converts text to token IDs and vice versa. nanoGPT supports character-level and GPT-2 BPE tokenization.
375
+
376
+ ### Top-k Sampling
377
+ Restricting sampling to the k most probable tokens. Reduces incoherent outputs.
378
+
379
+ ### Transformer
380
+ The neural network architecture based on self-attention. Introduced in "Attention Is All You Need" (2017).
381
+
382
+ ## V
383
+
384
+ ### Value (V)
385
+ In attention, the vector containing "the information to retrieve". Weighted sum of values is the attention output.
386
+
387
+ ### Validation Loss
388
+ Loss computed on held-out data. Used to detect overfitting and decide when to save checkpoints.
389
+
390
+ ### Vocab Size
391
+ Number of unique tokens. Character-level: ~65-100. GPT-2 BPE: 50,257.
392
+
393
+ ## W
394
+
395
+ ### Warmup
396
+ Gradually increasing learning rate at the start of training. Prevents early instability.
397
+
398
+ ### Weight Decay
399
+ L2 regularization applied to weights (not biases/layernorm). Prevents overfitting. Default: 0.1.
400
+
401
+ ### Weight Tying
402
+ Sharing the token embedding matrix with the output projection. Reduces parameters.
403
+
404
+ ---
405
+
406
+ ## Model Size Formula
407
+
408
+ ```
409
+ Parameters ≈ 12 × n_layer × n_embd²
410
+
411
+ Example (default config):
412
+ 12 × 6 × 384² ≈ 10.6M parameters
413
+ ```
414
+
415
+ ## Compute Requirements
416
+
417
+ Training one iteration processes:
418
+ ```
419
+ tokens = batch_size × block_size × gradient_accumulation_steps
420
+
421
+ FLOPs ≈ 6 × parameters × tokens (forward + backward)
422
+ ```
423
+
424
+ ## References
425
+
426
+ - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Original transformer paper
427
+ - [GPT-2](https://openai.com/research/better-language-models) - OpenAI's GPT-2 model
428
+ - [nanoGPT](https://github.com/karpathy/nanoGPT) - Original Python implementation
429
+ - [PaLM](https://arxiv.org/abs/2204.02311) - MFU calculation reference