nanogpt 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c2d0853148f473bb23f4ffffaa5a7b9e45f3faff570cc0a85253d40b6b63b80a
4
- data.tar.gz: 68f31e52460273d97d1a97de844fb79c9bfab22b939a897482b6def4c77bdecc
3
+ metadata.gz: f3563f92bd2b986ae27476c80c596cca77438d899b53d017f6cb708179fd7ad4
4
+ data.tar.gz: d15ed5effb810c0fa022b4fa3d6708fc2623332b0d3f4de5057a66343a9a3b2f
5
5
  SHA512:
6
- metadata.gz: 42a617a95e04d7727cc011a033ea039b8763a9103093dbec7ad614eb74a9a4f4ed818d82ffedeb8e627494dd73fd7950fdb30f97dd5ff63866123a0e892befc1
7
- data.tar.gz: af2916e7179830cc91fc516be16b765e46a3c6f37cb732583900bf71591abf16d8bbdc250bbe07a0b83eac124a5ec5b7241684017b4c166ebe3fcb8611702592
6
+ metadata.gz: 030bd007f053f6400afec46810e10d1c8a96b881df932c291fc7efb637a4c00424cd9304facac71e56783a861417dcd6ec247b2f6df3926a443d2540ecf6adcf
7
+ data.tar.gz: c0bcf4eae610c2475c841c186ee8a2a6aebc2bb22ecf2d0fbb9d65c7218152699598d320cc6972262ff918d0991c4cc4c951a68552bec46b24bd4fee10aa1b07
data/Gemfile.lock CHANGED
@@ -1,17 +1,35 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- nanogpt (0.1.2)
4
+ nanogpt (0.3.0)
5
5
  numo-narray (~> 0.9)
6
+ rackup (~> 2.0)
7
+ sinatra (~> 4.0)
8
+ sqlite3 (~> 2.0)
6
9
  tiktoken_ruby (~> 0.0)
7
10
  torch-rb (~> 0.14)
11
+ webrick (~> 1.8)
8
12
 
9
13
  GEM
10
14
  remote: https://rubygems.org/
11
15
  specs:
16
+ base64 (0.3.0)
12
17
  diff-lcs (1.6.2)
18
+ logger (1.7.0)
19
+ mustermann (3.0.4)
20
+ ruby2_keywords (~> 0.0.1)
13
21
  numo-narray (0.9.2.1)
14
22
  parquet (0.7.3-arm64-darwin)
23
+ rack (3.2.5)
24
+ rack-protection (4.2.1)
25
+ base64 (>= 0.1.0)
26
+ logger (>= 1.6.0)
27
+ rack (>= 3.0.0, < 4)
28
+ rack-session (2.1.1)
29
+ base64 (>= 0.1.0)
30
+ rack (>= 3.0.0)
31
+ rackup (2.3.1)
32
+ rack (>= 3)
15
33
  rice (4.7.1)
16
34
  rspec (3.13.2)
17
35
  rspec-core (~> 3.13.0)
@@ -26,9 +44,20 @@ GEM
26
44
  diff-lcs (>= 1.2.0, < 2.0)
27
45
  rspec-support (~> 3.13.0)
28
46
  rspec-support (3.13.6)
47
+ ruby2_keywords (0.0.5)
48
+ sinatra (4.2.1)
49
+ logger (>= 1.6.0)
50
+ mustermann (~> 3.0)
51
+ rack (>= 3.0.0, < 4)
52
+ rack-protection (= 4.2.1)
53
+ rack-session (>= 2.0.0, < 3)
54
+ tilt (~> 2.0)
55
+ sqlite3 (2.9.0-arm64-darwin)
29
56
  tiktoken_ruby (0.0.13-arm64-darwin)
57
+ tilt (2.7.0)
30
58
  torch-rb (0.22.2)
31
59
  rice (>= 4.7)
60
+ webrick (1.9.2)
32
61
 
33
62
  PLATFORMS
34
63
  arm64-darwin-24
@@ -0,0 +1,429 @@
1
+ # nanoGPT Architecture
2
+
3
+ This document explains the architecture of nanoGPT, a Ruby implementation of GPT-2 style language models.
4
+
5
+ ## Overview
6
+
7
+ nanoGPT implements a **decoder-only transformer** architecture for autoregressive language modeling. The model learns to predict the next token in a sequence given all previous tokens.
8
+
9
+ ```
10
+ Input Tokens → Embeddings → Transformer Blocks → Output Logits → Next Token Prediction
11
+ ```
12
+
13
+ ## Core Components
14
+
15
+ ### 1. Token & Position Embeddings
16
+
17
+ The model converts discrete tokens into continuous vectors:
18
+
19
+ ```
20
+ Token IDs [0, 42, 7, ...] → Token Embeddings (vocab_size × n_embd)
21
+ Position [0, 1, 2, ...] → Position Embeddings (block_size × n_embd)
22
+
23
+ Final Embedding = TokenEmbed(token) + PosEmbed(position)
24
+ ```
25
+
26
+ **Files:** `lib/nano_gpt/model.rb:16-18`
27
+
28
+ The position embeddings are learned (not sinusoidal) and allow the model to understand token order.
29
+
30
+ ### 2. Transformer Block
31
+
32
+ Each transformer block contains two sub-layers with residual connections:
33
+
34
+ ```
35
+ ┌─────────────────────────────────────────────────┐
36
+ │ Transformer Block │
37
+ ├─────────────────────────────────────────────────┤
38
+ │ │
39
+ │ x ─→ LayerNorm ─→ Attention ─→ (+) ─┐ │
40
+ │ └────────────────────────────────────┘ │
41
+ │ │
42
+ │ x ─→ LayerNorm ─→ MLP ─→ (+) ─→ output │
43
+ │ └─────────────────────────────────────┘ │
44
+ │ │
45
+ └─────────────────────────────────────────────────┘
46
+ ```
47
+
48
+ **File:** `lib/nano_gpt/layers/block.rb`
49
+
50
+ This is "pre-norm" architecture (LayerNorm before sub-layer, not after) which improves training stability.
51
+
52
+ ### 3. Causal Self-Attention
53
+
54
+ The attention mechanism allows each token to attend to all previous tokens (but not future ones):
55
+
56
+ ```
57
+ Q = x @ W_q (Query: what am I looking for?)
58
+ K = x @ W_k (Key: what do I contain?)
59
+ V = x @ W_v (Value: what information do I provide?)
60
+
61
+ Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
62
+ ```
63
+
64
+ The "causal" part means we mask future positions:
65
+
66
+ ```
67
+ Position: 0 1 2 3
68
+ Token 0: [1 0 0 0] ← can only see itself
69
+ Token 1: [1 1 0 0] ← can see positions 0,1
70
+ Token 2: [1 1 1 0] ← can see positions 0,1,2
71
+ Token 3: [1 1 1 1] ← can see all
72
+ ```
73
+
74
+ **File:** `lib/nano_gpt/layers/causal_self_attention.rb`
75
+
76
+ Multi-head attention splits the embedding into `n_head` parallel attention operations, allowing the model to attend to different aspects simultaneously.
77
+
78
+ ### 4. MLP (Feed-Forward Network)
79
+
80
+ After attention, each token passes through an MLP:
81
+
82
+ ```
83
+ x → Linear(n_embd → 4×n_embd) → GELU → Linear(4×n_embd → n_embd) → output
84
+ ```
85
+
86
+ **File:** `lib/nano_gpt/layers/mlp.rb`
87
+
88
+ The 4× expansion provides additional capacity for computation.
89
+
90
+ ### 5. Output Layer (Weight Tying)
91
+
92
+ The final output uses the transposed token embedding matrix:
93
+
94
+ ```
95
+ logits = LayerNorm(x) @ TokenEmbed.T
96
+ ```
97
+
98
+ This "weight tying" reduces parameters and improves performance.
99
+
100
+ ## Data Flow
101
+
102
+ ```
103
+ ┌────────────────────────────────────────────────────────────────────┐
104
+ │ GPT Forward Pass │
105
+ ├────────────────────────────────────────────────────────────────────┤
106
+ │ │
107
+ │ Input: token_ids [batch, seq_len] │
108
+ │ ↓ │
109
+ │ Token Embedding + Position Embedding │
110
+ │ ↓ │
111
+ │ Dropout │
112
+ │ ↓ │
113
+ │ ┌──────────────────────────────────────┐ │
114
+ │ │ Transformer Block 1 │ │
115
+ │ │ LayerNorm → Attention → Residual │ │
116
+ │ │ LayerNorm → MLP → Residual │ │
117
+ │ └──────────────────────────────────────┘ │
118
+ │ ↓ │
119
+ │ ... (repeat n_layer times) ... │
120
+ │ ↓ │
121
+ │ Final LayerNorm │
122
+ │ ↓ │
123
+ │ Linear projection (tied weights) │
124
+ │ ↓ │
125
+ │ Output: logits [batch, seq_len, vocab_size] │
126
+ │ │
127
+ └────────────────────────────────────────────────────────────────────┘
128
+ ```
129
+
130
+ ## Training Pipeline
131
+
132
+ ```
133
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
134
+ │ Prepare │ → │ Load │ → │ Forward │ → │ Backward │
135
+ │ Data │ │ Batch │ │ Pass │ │ Pass │
136
+ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
137
+ │ │ │ │
138
+ train.bin DataLoader GPT.forward loss.backward
139
+ val.bin get_batch() ↓ ↓
140
+ meta.json ↓ Cross-entropy Gradients
141
+ [x, y] tensors loss computed
142
+
143
+
144
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
145
+ │ Update │ ← │ Clip │ ← │ Accumulate │ ← │ │
146
+ │ Weights │ │ Gradients │ │ Gradients │ │ │
147
+ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
148
+
149
+ AdamW
150
+ optimizer
151
+ ```
152
+
153
+ **File:** `lib/nano_gpt/trainer.rb`
154
+
155
+ ## Generation (Inference)
156
+
157
+ ```
158
+ Start: "Hello"
159
+
160
+ Tokenize: [15496]
161
+
162
+ ┌─────────────────────────────────────┐
163
+ │ Loop until max_tokens: │
164
+ │ 1. Forward pass → logits │
165
+ │ 2. Apply temperature │
166
+ │ 3. Apply top-k filtering │
167
+ │ 4. Sample from distribution │
168
+ │ 5. Append to sequence │
169
+ └─────────────────────────────────────┘
170
+
171
+ Decode: "Hello, how are you today?"
172
+ ```
173
+
174
+ **File:** `lib/nano_gpt/model.rb:110-147`
175
+
176
+ ## File Structure
177
+
178
+ ```
179
+ lib/nano_gpt/
180
+ ├── model.rb # GPT model class
181
+ ├── config.rb # GPTConfig (model hyperparameters)
182
+ ├── train_config.rb # Training/sampling/bench configs
183
+ ├── trainer.rb # Training loop
184
+ ├── data_loader.rb # Batch loading from binary files
185
+ ├── tokenizer.rb # Character & GPT-2 BPE tokenizers
186
+ ├── lr_scheduler.rb # Cosine annealing with warmup
187
+ ├── device.rb # CPU/CUDA/MPS device detection
188
+ ├── textfile_preparer.rb # Custom dataset preparation
189
+ └── layers/
190
+ ├── block.rb # Transformer block
191
+ ├── causal_self_attention.rb # Multi-head attention
192
+ ├── mlp.rb # Feed-forward network
193
+ └── layer_norm.rb # Layer normalization
194
+ ```
195
+
196
+ ---
197
+
198
+ # Glossary
199
+
200
+ ## A
201
+
202
+ ### Attention
203
+ A mechanism that allows each position in a sequence to dynamically focus on other positions. Computes weighted sums of values based on query-key similarities.
204
+
205
+ ### Autoregressive
206
+ A model that generates outputs one step at a time, where each step depends on all previous steps. GPT generates text left-to-right, predicting one token at a time.
207
+
208
+ ### AdamW
209
+ An optimizer that combines Adam (adaptive learning rates) with decoupled weight decay. Used for training the model.
210
+
211
+ ## B
212
+
213
+ ### Batch Size
214
+ The number of independent sequences processed simultaneously. Larger batches provide more stable gradients but require more memory.
215
+
216
+ ### Block Size
217
+ The maximum sequence length (context window) the model can process. Also called "context length". Default: 256 tokens.
218
+
219
+ ### BPE (Byte Pair Encoding)
220
+ A tokenization algorithm that builds a vocabulary by iteratively merging frequent character pairs. GPT-2 uses BPE with ~50k tokens.
221
+
222
+ ## C
223
+
224
+ ### Causal Mask
225
+ A triangular mask that prevents attention from seeing future tokens. Essential for autoregressive generation.
226
+
227
+ ### Checkpoint
228
+ A saved snapshot of model weights and training state. Allows resuming training or using the model for inference.
229
+
230
+ ### Context Window
231
+ See "Block Size". The number of tokens the model can "see" when making predictions.
232
+
233
+ ### Cross-Entropy Loss
234
+ The training objective. Measures how well the model's predicted probability distribution matches the actual next token.
235
+
236
+ ## D
237
+
238
+ ### Decoder-Only
239
+ A transformer architecture that only uses the decoder stack (with causal masking). GPT models are decoder-only, unlike encoder-decoder models like T5.
240
+
241
+ ### Dropout
242
+ A regularization technique that randomly zeros out activations during training. Prevents overfitting. Set to 0 during inference.
243
+
244
+ ## E
245
+
246
+ ### Embedding
247
+ A learned vector representation of a discrete token. Maps vocabulary indices to continuous vectors of dimension `n_embd`.
248
+
249
+ ### Embedding Dimension (n_embd)
250
+ The size of token representations. Larger dimensions allow more expressive representations but require more computation. Default: 384.
251
+
252
+ ## F
253
+
254
+ ### Flash Attention
255
+ An optimized attention implementation that's faster and more memory-efficient. Used when dropout is 0.
256
+
257
+ ### Forward Pass
258
+ Computing the model's output from input. Propagates activations through all layers to produce logits.
259
+
260
+ ## G
261
+
262
+ ### GELU (Gaussian Error Linear Unit)
263
+ An activation function: `GELU(x) = x * Φ(x)` where Φ is the Gaussian CDF. Smoother than ReLU.
264
+
265
+ ### Gradient Accumulation
266
+ Simulating larger batch sizes by accumulating gradients over multiple forward-backward passes before updating weights.
267
+
268
+ ### Gradient Clipping
269
+ Limiting gradient magnitudes to prevent training instability. Default max norm: 1.0.
270
+
271
+ ## H
272
+
273
+ ### Head (Attention Head)
274
+ One parallel attention computation. Multi-head attention runs `n_head` heads in parallel, each with dimension `n_embd/n_head`.
275
+
276
+ ### Head Size
277
+ The dimension of each attention head: `head_size = n_embd / n_head`.
278
+
279
+ ## I
280
+
281
+ ### Inference
282
+ Using a trained model to generate predictions. Also called "sampling" or "generation".
283
+
284
+ ### Iteration
285
+ One update step in training. May include multiple micro-batches if using gradient accumulation.
286
+
287
+ ## K
288
+
289
+ ### Key (K)
290
+ In attention, the vector that represents "what information this position contains". Compared against queries.
291
+
292
+ ## L
293
+
294
+ ### Layer Normalization
295
+ Normalizes activations across the feature dimension. Stabilizes training. Applied before attention and MLP in pre-norm architecture.
296
+
297
+ ### Learning Rate
298
+ Controls how much weights change per update. Typically uses warmup + cosine decay schedule.
299
+
300
+ ### Logits
301
+ Raw (unnormalized) model outputs before softmax. Shape: [batch, seq_len, vocab_size].
302
+
303
+ ### Loss
304
+ The training objective to minimize. Cross-entropy between predicted and actual next tokens.
305
+
306
+ ## M
307
+
308
+ ### MFU (Model FLOPs Utilization)
309
+ Percentage of theoretical GPU compute being used. Higher is better. Measures hardware efficiency.
310
+
311
+ ### MLP (Multi-Layer Perceptron)
312
+ The feed-forward network in each transformer block. Expands to 4× dimension then projects back.
313
+
314
+ ### MPS (Metal Performance Shaders)
315
+ Apple's GPU compute framework. Used for acceleration on Apple Silicon Macs.
316
+
317
+ ## N
318
+
319
+ ### n_embd
320
+ Embedding dimension. The size of token vectors throughout the model. Default: 384.
321
+
322
+ ### n_head
323
+ Number of attention heads. Must divide n_embd evenly. Default: 6.
324
+
325
+ ### n_layer
326
+ Number of transformer blocks stacked. More layers = more capacity but slower. Default: 6.
327
+
328
+ ## O
329
+
330
+ ### Optimizer
331
+ Algorithm that updates model weights based on gradients. nanoGPT uses AdamW.
332
+
333
+ ## P
334
+
335
+ ### Parameters
336
+ The learnable weights of the model. Measured in millions (M) or billions (B).
337
+
338
+ ### Position Embedding
339
+ Learned vectors that encode token position. Added to token embeddings so the model knows token order.
340
+
341
+ ### Pre-norm
342
+ Architecture where LayerNorm is applied before (not after) attention/MLP. Improves training stability.
343
+
344
+ ## Q
345
+
346
+ ### Query (Q)
347
+ In attention, the vector that represents "what this position is looking for". Compared against keys.
348
+
349
+ ## R
350
+
351
+ ### Residual Connection
352
+ Adding the input directly to the output: `output = layer(x) + x`. Helps gradient flow and enables deeper networks.
353
+
354
+ ## S
355
+
356
+ ### Sampling
357
+ Generating text by repeatedly predicting and appending tokens. See "Generation".
358
+
359
+ ### Softmax
360
+ Converts logits to probabilities: `softmax(x)_i = exp(x_i) / sum(exp(x_j))`. Output sums to 1.
361
+
362
+ ### State Dict
363
+ A dictionary mapping parameter names to their tensor values. Used for saving/loading models.
364
+
365
+ ## T
366
+
367
+ ### Temperature
368
+ Scaling factor for logits before softmax during generation. Lower = more deterministic, higher = more random.
369
+
370
+ ### Token
371
+ The atomic unit of text. Can be a character, word piece, or subword depending on the tokenizer.
372
+
373
+ ### Tokenizer
374
+ Converts text to token IDs and vice versa. nanoGPT supports character-level and GPT-2 BPE tokenization.
375
+
376
+ ### Top-k Sampling
377
+ Restricting sampling to the k most probable tokens. Reduces incoherent outputs.
378
+
379
+ ### Transformer
380
+ The neural network architecture based on self-attention. Introduced in "Attention Is All You Need" (2017).
381
+
382
+ ## V
383
+
384
+ ### Value (V)
385
+ In attention, the vector containing "the information to retrieve". Weighted sum of values is the attention output.
386
+
387
+ ### Validation Loss
388
+ Loss computed on held-out data. Used to detect overfitting and decide when to save checkpoints.
389
+
390
+ ### Vocab Size
391
+ Number of unique tokens. Character-level: ~65-100. GPT-2 BPE: 50,257.
392
+
393
+ ## W
394
+
395
+ ### Warmup
396
+ Gradually increasing learning rate at the start of training. Prevents early instability.
397
+
398
+ ### Weight Decay
399
+ L2 regularization applied to weights (not biases/layernorm). Prevents overfitting. Default: 0.1.
400
+
401
+ ### Weight Tying
402
+ Sharing the token embedding matrix with the output projection. Reduces parameters.
403
+
404
+ ---
405
+
406
+ ## Model Size Formula
407
+
408
+ ```
409
+ Parameters ≈ 12 × n_layer × n_embd²
410
+
411
+ Example (default config):
412
+ 12 × 6 × 384² ≈ 10.6M parameters
413
+ ```
414
+
415
+ ## Compute Requirements
416
+
417
+ Training one iteration processes:
418
+ ```
419
+ tokens = batch_size × block_size × gradient_accumulation_steps
420
+
421
+ FLOPs ≈ 6 × parameters × tokens (forward + backward)
422
+ ```
423
+
424
+ ## References
425
+
426
+ - [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Original transformer paper
427
+ - [GPT-2](https://openai.com/research/better-language-models) - OpenAI's GPT-2 model
428
+ - [nanoGPT](https://github.com/karpathy/nanoGPT) - Original Python implementation
429
+ - [PaLM](https://arxiv.org/abs/2204.02311) - MFU calculation reference