nanogpt 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +30 -1
- data/docs/ARCHITECTURE.md +429 -0
- data/exe/nanogpt +210 -233
- data/lib/nano_gpt/bpe_textfile_preparer.rb +105 -0
- data/lib/nano_gpt/data_loader.rb +5 -20
- data/lib/nano_gpt/layers/block.rb +6 -1
- data/lib/nano_gpt/layers/causal_self_attention.rb +11 -1
- data/lib/nano_gpt/model.rb +1 -7
- data/lib/nano_gpt/textfile_preparer.rb +189 -0
- data/lib/nano_gpt/train_config.rb +80 -146
- data/lib/nano_gpt/trainer.rb +21 -48
- data/lib/nano_gpt/version.rb +1 -1
- data/lib/nano_gpt/web/metrics_store.rb +136 -0
- data/lib/nano_gpt/web/server.rb +294 -0
- data/lib/nano_gpt/web/sse_notifier.rb +37 -0
- data/lib/nano_gpt/web/training_state.rb +56 -0
- data/lib/nano_gpt/web/training_worker.rb +153 -0
- data/lib/nano_gpt/web/views/layout.erb +78 -0
- data/lib/nano_gpt/web/views/run_detail.erb +432 -0
- data/lib/nano_gpt/web/views/runs.erb +434 -0
- data/lib/nano_gpt/web/web_trainer.rb +210 -0
- data/lib/nano_gpt/web.rb +9 -0
- data/lib/nano_gpt.rb +1 -0
- data/nanogpt.gemspec +4 -0
- metadata +71 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: f3563f92bd2b986ae27476c80c596cca77438d899b53d017f6cb708179fd7ad4
|
|
4
|
+
data.tar.gz: d15ed5effb810c0fa022b4fa3d6708fc2623332b0d3f4de5057a66343a9a3b2f
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 030bd007f053f6400afec46810e10d1c8a96b881df932c291fc7efb637a4c00424cd9304facac71e56783a861417dcd6ec247b2f6df3926a443d2540ecf6adcf
|
|
7
|
+
data.tar.gz: c0bcf4eae610c2475c841c186ee8a2a6aebc2bb22ecf2d0fbb9d65c7218152699598d320cc6972262ff918d0991c4cc4c951a68552bec46b24bd4fee10aa1b07
|
data/Gemfile.lock
CHANGED
|
@@ -1,17 +1,35 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
nanogpt (0.
|
|
4
|
+
nanogpt (0.3.0)
|
|
5
5
|
numo-narray (~> 0.9)
|
|
6
|
+
rackup (~> 2.0)
|
|
7
|
+
sinatra (~> 4.0)
|
|
8
|
+
sqlite3 (~> 2.0)
|
|
6
9
|
tiktoken_ruby (~> 0.0)
|
|
7
10
|
torch-rb (~> 0.14)
|
|
11
|
+
webrick (~> 1.8)
|
|
8
12
|
|
|
9
13
|
GEM
|
|
10
14
|
remote: https://rubygems.org/
|
|
11
15
|
specs:
|
|
16
|
+
base64 (0.3.0)
|
|
12
17
|
diff-lcs (1.6.2)
|
|
18
|
+
logger (1.7.0)
|
|
19
|
+
mustermann (3.0.4)
|
|
20
|
+
ruby2_keywords (~> 0.0.1)
|
|
13
21
|
numo-narray (0.9.2.1)
|
|
14
22
|
parquet (0.7.3-arm64-darwin)
|
|
23
|
+
rack (3.2.5)
|
|
24
|
+
rack-protection (4.2.1)
|
|
25
|
+
base64 (>= 0.1.0)
|
|
26
|
+
logger (>= 1.6.0)
|
|
27
|
+
rack (>= 3.0.0, < 4)
|
|
28
|
+
rack-session (2.1.1)
|
|
29
|
+
base64 (>= 0.1.0)
|
|
30
|
+
rack (>= 3.0.0)
|
|
31
|
+
rackup (2.3.1)
|
|
32
|
+
rack (>= 3)
|
|
15
33
|
rice (4.7.1)
|
|
16
34
|
rspec (3.13.2)
|
|
17
35
|
rspec-core (~> 3.13.0)
|
|
@@ -26,9 +44,20 @@ GEM
|
|
|
26
44
|
diff-lcs (>= 1.2.0, < 2.0)
|
|
27
45
|
rspec-support (~> 3.13.0)
|
|
28
46
|
rspec-support (3.13.6)
|
|
47
|
+
ruby2_keywords (0.0.5)
|
|
48
|
+
sinatra (4.2.1)
|
|
49
|
+
logger (>= 1.6.0)
|
|
50
|
+
mustermann (~> 3.0)
|
|
51
|
+
rack (>= 3.0.0, < 4)
|
|
52
|
+
rack-protection (= 4.2.1)
|
|
53
|
+
rack-session (>= 2.0.0, < 3)
|
|
54
|
+
tilt (~> 2.0)
|
|
55
|
+
sqlite3 (2.9.0-arm64-darwin)
|
|
29
56
|
tiktoken_ruby (0.0.13-arm64-darwin)
|
|
57
|
+
tilt (2.7.0)
|
|
30
58
|
torch-rb (0.22.2)
|
|
31
59
|
rice (>= 4.7)
|
|
60
|
+
webrick (1.9.2)
|
|
32
61
|
|
|
33
62
|
PLATFORMS
|
|
34
63
|
arm64-darwin-24
|
|
@@ -0,0 +1,429 @@
|
|
|
1
|
+
# nanoGPT Architecture
|
|
2
|
+
|
|
3
|
+
This document explains the architecture of nanoGPT, a Ruby implementation of GPT-2 style language models.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
nanoGPT implements a **decoder-only transformer** architecture for autoregressive language modeling. The model learns to predict the next token in a sequence given all previous tokens.
|
|
8
|
+
|
|
9
|
+
```
|
|
10
|
+
Input Tokens → Embeddings → Transformer Blocks → Output Logits → Next Token Prediction
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
## Core Components
|
|
14
|
+
|
|
15
|
+
### 1. Token & Position Embeddings
|
|
16
|
+
|
|
17
|
+
The model converts discrete tokens into continuous vectors:
|
|
18
|
+
|
|
19
|
+
```
|
|
20
|
+
Token IDs [0, 42, 7, ...] → Token Embeddings (vocab_size × n_embd)
|
|
21
|
+
Position [0, 1, 2, ...] → Position Embeddings (block_size × n_embd)
|
|
22
|
+
|
|
23
|
+
Final Embedding = TokenEmbed(token) + PosEmbed(position)
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
**Files:** `lib/nano_gpt/model.rb:16-18`
|
|
27
|
+
|
|
28
|
+
The position embeddings are learned (not sinusoidal) and allow the model to understand token order.
|
|
29
|
+
|
|
30
|
+
### 2. Transformer Block
|
|
31
|
+
|
|
32
|
+
Each transformer block contains two sub-layers with residual connections:
|
|
33
|
+
|
|
34
|
+
```
|
|
35
|
+
┌─────────────────────────────────────────────────┐
|
|
36
|
+
│ Transformer Block │
|
|
37
|
+
├─────────────────────────────────────────────────┤
|
|
38
|
+
│ │
|
|
39
|
+
│ x ─→ LayerNorm ─→ Attention ─→ (+) ─┐ │
|
|
40
|
+
│ └────────────────────────────────────┘ │
|
|
41
|
+
│ │
|
|
42
|
+
│ x ─→ LayerNorm ─→ MLP ─→ (+) ─→ output │
|
|
43
|
+
│ └─────────────────────────────────────┘ │
|
|
44
|
+
│ │
|
|
45
|
+
└─────────────────────────────────────────────────┘
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
**File:** `lib/nano_gpt/layers/block.rb`
|
|
49
|
+
|
|
50
|
+
This is "pre-norm" architecture (LayerNorm before sub-layer, not after) which improves training stability.
|
|
51
|
+
|
|
52
|
+
### 3. Causal Self-Attention
|
|
53
|
+
|
|
54
|
+
The attention mechanism allows each token to attend to all previous tokens (but not future ones):
|
|
55
|
+
|
|
56
|
+
```
|
|
57
|
+
Q = x @ W_q (Query: what am I looking for?)
|
|
58
|
+
K = x @ W_k (Key: what do I contain?)
|
|
59
|
+
V = x @ W_v (Value: what information do I provide?)
|
|
60
|
+
|
|
61
|
+
Attention(Q,K,V) = softmax(Q @ K^T / sqrt(d_k)) @ V
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
The "causal" part means we mask future positions:
|
|
65
|
+
|
|
66
|
+
```
|
|
67
|
+
Position: 0 1 2 3
|
|
68
|
+
Token 0: [1 0 0 0] ← can only see itself
|
|
69
|
+
Token 1: [1 1 0 0] ← can see positions 0,1
|
|
70
|
+
Token 2: [1 1 1 0] ← can see positions 0,1,2
|
|
71
|
+
Token 3: [1 1 1 1] ← can see all
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
**File:** `lib/nano_gpt/layers/causal_self_attention.rb`
|
|
75
|
+
|
|
76
|
+
Multi-head attention splits the embedding into `n_head` parallel attention operations, allowing the model to attend to different aspects simultaneously.
|
|
77
|
+
|
|
78
|
+
### 4. MLP (Feed-Forward Network)
|
|
79
|
+
|
|
80
|
+
After attention, each token passes through an MLP:
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
x → Linear(n_embd → 4×n_embd) → GELU → Linear(4×n_embd → n_embd) → output
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
**File:** `lib/nano_gpt/layers/mlp.rb`
|
|
87
|
+
|
|
88
|
+
The 4× expansion provides additional capacity for computation.
|
|
89
|
+
|
|
90
|
+
### 5. Output Layer (Weight Tying)
|
|
91
|
+
|
|
92
|
+
The final output uses the transposed token embedding matrix:
|
|
93
|
+
|
|
94
|
+
```
|
|
95
|
+
logits = LayerNorm(x) @ TokenEmbed.T
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
This "weight tying" reduces parameters and improves performance.
|
|
99
|
+
|
|
100
|
+
## Data Flow
|
|
101
|
+
|
|
102
|
+
```
|
|
103
|
+
┌────────────────────────────────────────────────────────────────────┐
|
|
104
|
+
│ GPT Forward Pass │
|
|
105
|
+
├────────────────────────────────────────────────────────────────────┤
|
|
106
|
+
│ │
|
|
107
|
+
│ Input: token_ids [batch, seq_len] │
|
|
108
|
+
│ ↓ │
|
|
109
|
+
│ Token Embedding + Position Embedding │
|
|
110
|
+
│ ↓ │
|
|
111
|
+
│ Dropout │
|
|
112
|
+
│ ↓ │
|
|
113
|
+
│ ┌──────────────────────────────────────┐ │
|
|
114
|
+
│ │ Transformer Block 1 │ │
|
|
115
|
+
│ │ LayerNorm → Attention → Residual │ │
|
|
116
|
+
│ │ LayerNorm → MLP → Residual │ │
|
|
117
|
+
│ └──────────────────────────────────────┘ │
|
|
118
|
+
│ ↓ │
|
|
119
|
+
│ ... (repeat n_layer times) ... │
|
|
120
|
+
│ ↓ │
|
|
121
|
+
│ Final LayerNorm │
|
|
122
|
+
│ ↓ │
|
|
123
|
+
│ Linear projection (tied weights) │
|
|
124
|
+
│ ↓ │
|
|
125
|
+
│ Output: logits [batch, seq_len, vocab_size] │
|
|
126
|
+
│ │
|
|
127
|
+
└────────────────────────────────────────────────────────────────────┘
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
## Training Pipeline
|
|
131
|
+
|
|
132
|
+
```
|
|
133
|
+
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
134
|
+
│ Prepare │ → │ Load │ → │ Forward │ → │ Backward │
|
|
135
|
+
│ Data │ │ Batch │ │ Pass │ │ Pass │
|
|
136
|
+
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
|
|
137
|
+
│ │ │ │
|
|
138
|
+
train.bin DataLoader GPT.forward loss.backward
|
|
139
|
+
val.bin get_batch() ↓ ↓
|
|
140
|
+
meta.json ↓ Cross-entropy Gradients
|
|
141
|
+
[x, y] tensors loss computed
|
|
142
|
+
│
|
|
143
|
+
↓
|
|
144
|
+
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
145
|
+
│ Update │ ← │ Clip │ ← │ Accumulate │ ← │ │
|
|
146
|
+
│ Weights │ │ Gradients │ │ Gradients │ │ │
|
|
147
|
+
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
|
|
148
|
+
│
|
|
149
|
+
AdamW
|
|
150
|
+
optimizer
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
**File:** `lib/nano_gpt/trainer.rb`
|
|
154
|
+
|
|
155
|
+
## Generation (Inference)
|
|
156
|
+
|
|
157
|
+
```
|
|
158
|
+
Start: "Hello"
|
|
159
|
+
↓
|
|
160
|
+
Tokenize: [15496]
|
|
161
|
+
↓
|
|
162
|
+
┌─────────────────────────────────────┐
|
|
163
|
+
│ Loop until max_tokens: │
|
|
164
|
+
│ 1. Forward pass → logits │
|
|
165
|
+
│ 2. Apply temperature │
|
|
166
|
+
│ 3. Apply top-k filtering │
|
|
167
|
+
│ 4. Sample from distribution │
|
|
168
|
+
│ 5. Append to sequence │
|
|
169
|
+
└─────────────────────────────────────┘
|
|
170
|
+
↓
|
|
171
|
+
Decode: "Hello, how are you today?"
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
**File:** `lib/nano_gpt/model.rb:110-147`
|
|
175
|
+
|
|
176
|
+
## File Structure
|
|
177
|
+
|
|
178
|
+
```
|
|
179
|
+
lib/nano_gpt/
|
|
180
|
+
├── model.rb # GPT model class
|
|
181
|
+
├── config.rb # GPTConfig (model hyperparameters)
|
|
182
|
+
├── train_config.rb # Training/sampling/bench configs
|
|
183
|
+
├── trainer.rb # Training loop
|
|
184
|
+
├── data_loader.rb # Batch loading from binary files
|
|
185
|
+
├── tokenizer.rb # Character & GPT-2 BPE tokenizers
|
|
186
|
+
├── lr_scheduler.rb # Cosine annealing with warmup
|
|
187
|
+
├── device.rb # CPU/CUDA/MPS device detection
|
|
188
|
+
├── textfile_preparer.rb # Custom dataset preparation
|
|
189
|
+
└── layers/
|
|
190
|
+
├── block.rb # Transformer block
|
|
191
|
+
├── causal_self_attention.rb # Multi-head attention
|
|
192
|
+
├── mlp.rb # Feed-forward network
|
|
193
|
+
└── layer_norm.rb # Layer normalization
|
|
194
|
+
```
|
|
195
|
+
|
|
196
|
+
---
|
|
197
|
+
|
|
198
|
+
# Glossary
|
|
199
|
+
|
|
200
|
+
## A
|
|
201
|
+
|
|
202
|
+
### Attention
|
|
203
|
+
A mechanism that allows each position in a sequence to dynamically focus on other positions. Computes weighted sums of values based on query-key similarities.
|
|
204
|
+
|
|
205
|
+
### Autoregressive
|
|
206
|
+
A model that generates outputs one step at a time, where each step depends on all previous steps. GPT generates text left-to-right, predicting one token at a time.
|
|
207
|
+
|
|
208
|
+
### AdamW
|
|
209
|
+
An optimizer that combines Adam (adaptive learning rates) with decoupled weight decay. Used for training the model.
|
|
210
|
+
|
|
211
|
+
## B
|
|
212
|
+
|
|
213
|
+
### Batch Size
|
|
214
|
+
The number of independent sequences processed simultaneously. Larger batches provide more stable gradients but require more memory.
|
|
215
|
+
|
|
216
|
+
### Block Size
|
|
217
|
+
The maximum sequence length (context window) the model can process. Also called "context length". Default: 256 tokens.
|
|
218
|
+
|
|
219
|
+
### BPE (Byte Pair Encoding)
|
|
220
|
+
A tokenization algorithm that builds a vocabulary by iteratively merging frequent character pairs. GPT-2 uses BPE with ~50k tokens.
|
|
221
|
+
|
|
222
|
+
## C
|
|
223
|
+
|
|
224
|
+
### Causal Mask
|
|
225
|
+
A triangular mask that prevents attention from seeing future tokens. Essential for autoregressive generation.
|
|
226
|
+
|
|
227
|
+
### Checkpoint
|
|
228
|
+
A saved snapshot of model weights and training state. Allows resuming training or using the model for inference.
|
|
229
|
+
|
|
230
|
+
### Context Window
|
|
231
|
+
See "Block Size". The number of tokens the model can "see" when making predictions.
|
|
232
|
+
|
|
233
|
+
### Cross-Entropy Loss
|
|
234
|
+
The training objective. Measures how well the model's predicted probability distribution matches the actual next token.
|
|
235
|
+
|
|
236
|
+
## D
|
|
237
|
+
|
|
238
|
+
### Decoder-Only
|
|
239
|
+
A transformer architecture that only uses the decoder stack (with causal masking). GPT models are decoder-only, unlike encoder-decoder models like T5.
|
|
240
|
+
|
|
241
|
+
### Dropout
|
|
242
|
+
A regularization technique that randomly zeros out activations during training. Prevents overfitting. Set to 0 during inference.
|
|
243
|
+
|
|
244
|
+
## E
|
|
245
|
+
|
|
246
|
+
### Embedding
|
|
247
|
+
A learned vector representation of a discrete token. Maps vocabulary indices to continuous vectors of dimension `n_embd`.
|
|
248
|
+
|
|
249
|
+
### Embedding Dimension (n_embd)
|
|
250
|
+
The size of token representations. Larger dimensions allow more expressive representations but require more computation. Default: 384.
|
|
251
|
+
|
|
252
|
+
## F
|
|
253
|
+
|
|
254
|
+
### Flash Attention
|
|
255
|
+
An optimized attention implementation that's faster and more memory-efficient. Used when dropout is 0.
|
|
256
|
+
|
|
257
|
+
### Forward Pass
|
|
258
|
+
Computing the model's output from input. Propagates activations through all layers to produce logits.
|
|
259
|
+
|
|
260
|
+
## G
|
|
261
|
+
|
|
262
|
+
### GELU (Gaussian Error Linear Unit)
|
|
263
|
+
An activation function: `GELU(x) = x * Φ(x)` where Φ is the Gaussian CDF. Smoother than ReLU.
|
|
264
|
+
|
|
265
|
+
### Gradient Accumulation
|
|
266
|
+
Simulating larger batch sizes by accumulating gradients over multiple forward-backward passes before updating weights.
|
|
267
|
+
|
|
268
|
+
### Gradient Clipping
|
|
269
|
+
Limiting gradient magnitudes to prevent training instability. Default max norm: 1.0.
|
|
270
|
+
|
|
271
|
+
## H
|
|
272
|
+
|
|
273
|
+
### Head (Attention Head)
|
|
274
|
+
One parallel attention computation. Multi-head attention runs `n_head` heads in parallel, each with dimension `n_embd/n_head`.
|
|
275
|
+
|
|
276
|
+
### Head Size
|
|
277
|
+
The dimension of each attention head: `head_size = n_embd / n_head`.
|
|
278
|
+
|
|
279
|
+
## I
|
|
280
|
+
|
|
281
|
+
### Inference
|
|
282
|
+
Using a trained model to generate predictions. Also called "sampling" or "generation".
|
|
283
|
+
|
|
284
|
+
### Iteration
|
|
285
|
+
One update step in training. May include multiple micro-batches if using gradient accumulation.
|
|
286
|
+
|
|
287
|
+
## K
|
|
288
|
+
|
|
289
|
+
### Key (K)
|
|
290
|
+
In attention, the vector that represents "what information this position contains". Compared against queries.
|
|
291
|
+
|
|
292
|
+
## L
|
|
293
|
+
|
|
294
|
+
### Layer Normalization
|
|
295
|
+
Normalizes activations across the feature dimension. Stabilizes training. Applied before attention and MLP in pre-norm architecture.
|
|
296
|
+
|
|
297
|
+
### Learning Rate
|
|
298
|
+
Controls how much weights change per update. Typically uses warmup + cosine decay schedule.
|
|
299
|
+
|
|
300
|
+
### Logits
|
|
301
|
+
Raw (unnormalized) model outputs before softmax. Shape: [batch, seq_len, vocab_size].
|
|
302
|
+
|
|
303
|
+
### Loss
|
|
304
|
+
The training objective to minimize. Cross-entropy between predicted and actual next tokens.
|
|
305
|
+
|
|
306
|
+
## M
|
|
307
|
+
|
|
308
|
+
### MFU (Model FLOPs Utilization)
|
|
309
|
+
Percentage of theoretical GPU compute being used. Higher is better. Measures hardware efficiency.
|
|
310
|
+
|
|
311
|
+
### MLP (Multi-Layer Perceptron)
|
|
312
|
+
The feed-forward network in each transformer block. Expands to 4× dimension then projects back.
|
|
313
|
+
|
|
314
|
+
### MPS (Metal Performance Shaders)
|
|
315
|
+
Apple's GPU compute framework. Used for acceleration on Apple Silicon Macs.
|
|
316
|
+
|
|
317
|
+
## N
|
|
318
|
+
|
|
319
|
+
### n_embd
|
|
320
|
+
Embedding dimension. The size of token vectors throughout the model. Default: 384.
|
|
321
|
+
|
|
322
|
+
### n_head
|
|
323
|
+
Number of attention heads. Must divide n_embd evenly. Default: 6.
|
|
324
|
+
|
|
325
|
+
### n_layer
|
|
326
|
+
Number of transformer blocks stacked. More layers = more capacity but slower. Default: 6.
|
|
327
|
+
|
|
328
|
+
## O
|
|
329
|
+
|
|
330
|
+
### Optimizer
|
|
331
|
+
Algorithm that updates model weights based on gradients. nanoGPT uses AdamW.
|
|
332
|
+
|
|
333
|
+
## P
|
|
334
|
+
|
|
335
|
+
### Parameters
|
|
336
|
+
The learnable weights of the model. Measured in millions (M) or billions (B).
|
|
337
|
+
|
|
338
|
+
### Position Embedding
|
|
339
|
+
Learned vectors that encode token position. Added to token embeddings so the model knows token order.
|
|
340
|
+
|
|
341
|
+
### Pre-norm
|
|
342
|
+
Architecture where LayerNorm is applied before (not after) attention/MLP. Improves training stability.
|
|
343
|
+
|
|
344
|
+
## Q
|
|
345
|
+
|
|
346
|
+
### Query (Q)
|
|
347
|
+
In attention, the vector that represents "what this position is looking for". Compared against keys.
|
|
348
|
+
|
|
349
|
+
## R
|
|
350
|
+
|
|
351
|
+
### Residual Connection
|
|
352
|
+
Adding the input directly to the output: `output = layer(x) + x`. Helps gradient flow and enables deeper networks.
|
|
353
|
+
|
|
354
|
+
## S
|
|
355
|
+
|
|
356
|
+
### Sampling
|
|
357
|
+
Generating text by repeatedly predicting and appending tokens. See "Generation".
|
|
358
|
+
|
|
359
|
+
### Softmax
|
|
360
|
+
Converts logits to probabilities: `softmax(x)_i = exp(x_i) / sum(exp(x_j))`. Output sums to 1.
|
|
361
|
+
|
|
362
|
+
### State Dict
|
|
363
|
+
A dictionary mapping parameter names to their tensor values. Used for saving/loading models.
|
|
364
|
+
|
|
365
|
+
## T
|
|
366
|
+
|
|
367
|
+
### Temperature
|
|
368
|
+
Scaling factor for logits before softmax during generation. Lower = more deterministic, higher = more random.
|
|
369
|
+
|
|
370
|
+
### Token
|
|
371
|
+
The atomic unit of text. Can be a character, word piece, or subword depending on the tokenizer.
|
|
372
|
+
|
|
373
|
+
### Tokenizer
|
|
374
|
+
Converts text to token IDs and vice versa. nanoGPT supports character-level and GPT-2 BPE tokenization.
|
|
375
|
+
|
|
376
|
+
### Top-k Sampling
|
|
377
|
+
Restricting sampling to the k most probable tokens. Reduces incoherent outputs.
|
|
378
|
+
|
|
379
|
+
### Transformer
|
|
380
|
+
The neural network architecture based on self-attention. Introduced in "Attention Is All You Need" (2017).
|
|
381
|
+
|
|
382
|
+
## V
|
|
383
|
+
|
|
384
|
+
### Value (V)
|
|
385
|
+
In attention, the vector containing "the information to retrieve". Weighted sum of values is the attention output.
|
|
386
|
+
|
|
387
|
+
### Validation Loss
|
|
388
|
+
Loss computed on held-out data. Used to detect overfitting and decide when to save checkpoints.
|
|
389
|
+
|
|
390
|
+
### Vocab Size
|
|
391
|
+
Number of unique tokens. Character-level: ~65-100. GPT-2 BPE: 50,257.
|
|
392
|
+
|
|
393
|
+
## W
|
|
394
|
+
|
|
395
|
+
### Warmup
|
|
396
|
+
Gradually increasing learning rate at the start of training. Prevents early instability.
|
|
397
|
+
|
|
398
|
+
### Weight Decay
|
|
399
|
+
L2 regularization applied to weights (not biases/layernorm). Prevents overfitting. Default: 0.1.
|
|
400
|
+
|
|
401
|
+
### Weight Tying
|
|
402
|
+
Sharing the token embedding matrix with the output projection. Reduces parameters.
|
|
403
|
+
|
|
404
|
+
---
|
|
405
|
+
|
|
406
|
+
## Model Size Formula
|
|
407
|
+
|
|
408
|
+
```
|
|
409
|
+
Parameters ≈ 12 × n_layer × n_embd²
|
|
410
|
+
|
|
411
|
+
Example (default config):
|
|
412
|
+
12 × 6 × 384² ≈ 10.6M parameters
|
|
413
|
+
```
|
|
414
|
+
|
|
415
|
+
## Compute Requirements
|
|
416
|
+
|
|
417
|
+
Training one iteration processes:
|
|
418
|
+
```
|
|
419
|
+
tokens = batch_size × block_size × gradient_accumulation_steps
|
|
420
|
+
|
|
421
|
+
FLOPs ≈ 6 × parameters × tokens (forward + backward)
|
|
422
|
+
```
|
|
423
|
+
|
|
424
|
+
## References
|
|
425
|
+
|
|
426
|
+
- [Attention Is All You Need](https://arxiv.org/abs/1706.03762) - Original transformer paper
|
|
427
|
+
- [GPT-2](https://openai.com/research/better-language-models) - OpenAI's GPT-2 model
|
|
428
|
+
- [nanoGPT](https://github.com/karpathy/nanoGPT) - Original Python implementation
|
|
429
|
+
- [PaLM](https://arxiv.org/abs/2204.02311) - MFU calculation reference
|