transformers-from-scratch 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 etfrer-yi
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,98 @@
1
+ Metadata-Version: 2.4
2
+ Name: transformers-from-scratch
3
+ Version: 0.1.0
4
+ Summary: Encoder, decoder, and encoder-decoder Transformer architectures in PyTorch
5
+ License-Expression: MIT
6
+ Project-URL: Repository, https://github.com/etfrer-yi/Transformer-Variants
7
+ Requires-Python: >=3.9
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: torch>=2.0
11
+ Dynamic: license-file
12
+
13
+ # Transformer Variants
14
+
15
+ Implementations of encoder, decoder, and encoder-decoder Transformer architectures in PyTorch, following "Attention Is All You Need" (Vaswani et al., 2017).
16
+
17
+ ## Installation
18
+
19
+ ```bash
20
+ pip install transformers-from-scratch
21
+ ```
22
+
23
+ PyTorch is required but not installed automatically (to allow users to choose their CUDA build). Install it first from [pytorch.org](https://pytorch.org/get-started/locally/).
24
+
25
+ ## Development Setup
26
+
27
+ ```bash
28
+ python3 -m venv venv
29
+ source venv/bin/activate
30
+ pip install -r requirements.txt
31
+ ```
32
+
33
+ ## Modules
34
+
35
+ `model.py` provides three stand-alone models and the building blocks they are composed of.
36
+
37
+ | Class | Description |
38
+ |---|---|
39
+ | `EncoderTransformer` | BERT-style bidirectional encoder |
40
+ | `DecoderOnlyTransformer` | GPT-style autoregressive decoder |
41
+ | `EncoderDecoderTransformer` | Sequence-to-sequence encoder-decoder |
42
+ | `DecoderTransformer` | Decoder component — not for stand-alone use |
43
+
44
+ ## Usage
45
+
46
+ ```python
47
+ import torch
48
+ from model import EncoderTransformer, DecoderOnlyTransformer, EncoderDecoderTransformer
49
+
50
+ VOCAB_SIZE, MAX_SEQ_LEN = 32000, 512
51
+ D_MODEL, D_FF, N_HEADS, N_BLOCKS = 512, 2048, 8, 6
52
+ ```
53
+
54
+ ### Encoder
55
+
56
+ ```python
57
+ model = EncoderTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS)
58
+
59
+ src = torch.randint(0, VOCAB_SIZE, (B, T_src)) # [B, T_src]
60
+ src_mask = (src == pad_id).unsqueeze(1) # [B, 1, T_src] True = padding
61
+
62
+ out = model(src, src_mask) # [B, T_src, d_model]
63
+ ```
64
+
65
+ ### Decoder-only
66
+
67
+ ```python
68
+ model = DecoderOnlyTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS)
69
+
70
+ tgt = torch.randint(0, VOCAB_SIZE, (B, T)) # [B, T]
71
+ pad_mask = (tgt == pad_id).unsqueeze(1) # [B, 1, T] True = padding (optional)
72
+
73
+ logits = model(tgt, pad_mask) # [B, T, vocab_size]
74
+ ```
75
+
76
+ A causal mask is generated internally — no need to pass one.
77
+
78
+ ### Encoder-decoder
79
+
80
+ ```python
81
+ model = EncoderDecoderTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS, N_BLOCKS)
82
+
83
+ src = torch.randint(0, VOCAB_SIZE, (B, T_src)) # [B, T_src]
84
+ tgt = torch.randint(0, VOCAB_SIZE, (B, T_tgt)) # [B, T_tgt]
85
+
86
+ src_mask = (src == pad_id).unsqueeze(1) # [B, 1, T_src]
87
+ T_tgt = tgt.size(1)
88
+ causal = torch.ones(T_tgt, T_tgt, dtype=torch.bool).triu(1).unsqueeze(0) # [1, T_tgt, T_tgt]
89
+ tgt_mask = causal | (tgt == pad_id).unsqueeze(1) # [B, T_tgt, T_tgt]
90
+
91
+ logits = model(src, tgt, src_mask, tgt_mask) # [B, T_tgt, vocab_size]
92
+ ```
93
+
94
+ ## Running tests
95
+
96
+ ```bash
97
+ python -m unittest test -v
98
+ ```
@@ -0,0 +1,86 @@
1
+ # Transformer Variants
2
+
3
+ Implementations of encoder, decoder, and encoder-decoder Transformer architectures in PyTorch, following "Attention Is All You Need" (Vaswani et al., 2017).
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ pip install transformers-from-scratch
9
+ ```
10
+
11
+ PyTorch is required but not installed automatically (to allow users to choose their CUDA build). Install it first from [pytorch.org](https://pytorch.org/get-started/locally/).
12
+
13
+ ## Development Setup
14
+
15
+ ```bash
16
+ python3 -m venv venv
17
+ source venv/bin/activate
18
+ pip install -r requirements.txt
19
+ ```
20
+
21
+ ## Modules
22
+
23
+ `model.py` provides three stand-alone models and the building blocks they are composed of.
24
+
25
+ | Class | Description |
26
+ |---|---|
27
+ | `EncoderTransformer` | BERT-style bidirectional encoder |
28
+ | `DecoderOnlyTransformer` | GPT-style autoregressive decoder |
29
+ | `EncoderDecoderTransformer` | Sequence-to-sequence encoder-decoder |
30
+ | `DecoderTransformer` | Decoder component — not for stand-alone use |
31
+
32
+ ## Usage
33
+
34
+ ```python
35
+ import torch
36
+ from model import EncoderTransformer, DecoderOnlyTransformer, EncoderDecoderTransformer
37
+
38
+ VOCAB_SIZE, MAX_SEQ_LEN = 32000, 512
39
+ D_MODEL, D_FF, N_HEADS, N_BLOCKS = 512, 2048, 8, 6
40
+ ```
41
+
42
+ ### Encoder
43
+
44
+ ```python
45
+ model = EncoderTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS)
46
+
47
+ src = torch.randint(0, VOCAB_SIZE, (B, T_src)) # [B, T_src]
48
+ src_mask = (src == pad_id).unsqueeze(1) # [B, 1, T_src] True = padding
49
+
50
+ out = model(src, src_mask) # [B, T_src, d_model]
51
+ ```
52
+
53
+ ### Decoder-only
54
+
55
+ ```python
56
+ model = DecoderOnlyTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS)
57
+
58
+ tgt = torch.randint(0, VOCAB_SIZE, (B, T)) # [B, T]
59
+ pad_mask = (tgt == pad_id).unsqueeze(1) # [B, 1, T] True = padding (optional)
60
+
61
+ logits = model(tgt, pad_mask) # [B, T, vocab_size]
62
+ ```
63
+
64
+ A causal mask is generated internally — no need to pass one.
65
+
66
+ ### Encoder-decoder
67
+
68
+ ```python
69
+ model = EncoderDecoderTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS, N_BLOCKS)
70
+
71
+ src = torch.randint(0, VOCAB_SIZE, (B, T_src)) # [B, T_src]
72
+ tgt = torch.randint(0, VOCAB_SIZE, (B, T_tgt)) # [B, T_tgt]
73
+
74
+ src_mask = (src == pad_id).unsqueeze(1) # [B, 1, T_src]
75
+ T_tgt = tgt.size(1)
76
+ causal = torch.ones(T_tgt, T_tgt, dtype=torch.bool).triu(1).unsqueeze(0) # [1, T_tgt, T_tgt]
77
+ tgt_mask = causal | (tgt == pad_id).unsqueeze(1) # [B, T_tgt, T_tgt]
78
+
79
+ logits = model(src, tgt, src_mask, tgt_mask) # [B, T_tgt, vocab_size]
80
+ ```
81
+
82
+ ## Running tests
83
+
84
+ ```bash
85
+ python -m unittest test -v
86
+ ```
@@ -0,0 +1,16 @@
1
+ [build-system]
2
+ requires = ["setuptools>=68"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "transformers-from-scratch"
7
+ version = "0.1.0"
8
+ description = "Encoder, decoder, and encoder-decoder Transformer architectures in PyTorch"
9
+ readme = "README.md"
10
+ license = "MIT"
11
+ license-files = ["LICENSE"]
12
+ requires-python = ">=3.9"
13
+ dependencies = ["torch>=2.0"]
14
+
15
+ [project.urls]
16
+ Repository = "https://github.com/etfrer-yi/Transformer-Variants"
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,13 @@
1
+ from .model import (
2
+ EncoderTransformer,
3
+ DecoderOnlyTransformer,
4
+ DecoderTransformer,
5
+ EncoderDecoderTransformer,
6
+ )
7
+
8
+ __all__ = [
9
+ "EncoderTransformer",
10
+ "DecoderOnlyTransformer",
11
+ "DecoderTransformer",
12
+ "EncoderDecoderTransformer",
13
+ ]
@@ -0,0 +1,294 @@
1
+ """
2
+ Implementations of encoder, decoder, and encoder-decoder Transformer architectures,
3
+ following the design of "Attention Is All You Need" (Vaswani et al., 2017).
4
+
5
+ Notation used in shape comments throughout this module:
6
+ B — batch size
7
+ T — sequence length (generic)
8
+ T_src — source sequence length
9
+ T_tgt — target sequence length
10
+ d_model — model embedding dimension
11
+ d_ff — feed-forward hidden dimension
12
+ d_key — per-head key/query dimension (= d_model // n_heads)
13
+ d_val — per-head value dimension (= d_model // n_heads)
14
+
15
+ Mask convention: BoolTensor where True marks positions to ignore (filled with -inf).
16
+ """
17
+
18
+ import math
19
+ import torch
20
+ from torch import Tensor, BoolTensor
21
+ import torch.nn as nn
22
+ import torch.nn.functional as F
23
+
24
+
25
+ class PositionalEncoding(nn.Module):
26
+ def __init__(self, max_seq_len: int, d_model: int):
27
+ super().__init__()
28
+ pos = torch.arange(max_seq_len).unsqueeze(1) # [T, 1]
29
+ i = torch.arange(0, d_model, 2) # [d_model/2]
30
+ pe = torch.zeros(1, max_seq_len, d_model) # [1, T, d_model]
31
+ pe[0, :, 0::2] = torch.sin(pos / 10000 ** (i / d_model))
32
+ pe[0, :, 1::2] = torch.cos(pos / 10000 ** (i / d_model))
33
+ self.register_buffer('pe', pe)
34
+
35
+ def forward(self, x: Tensor) -> Tensor:
36
+ # x: [B, T, d_model]
37
+ return x + self.pe[:, :x.size(1)] # [B, T, d_model]
38
+
39
+
40
+ class FeedForward(nn.Module):
41
+ def __init__(self, d_model: int, d_ff: int):
42
+ super().__init__()
43
+ self.gelu = nn.GELU()
44
+ self.linear1 = nn.Linear(d_model, d_ff)
45
+ self.linear2 = nn.Linear(d_ff, d_model)
46
+
47
+ def forward(self, x: Tensor):
48
+ # x: [B, T, d_model]
49
+ x = self.linear1(x) # [B, T, d_ff]
50
+ x = self.gelu(x) # [B, T, d_ff]
51
+ x = self.linear2(x) # [B, T, d_model]
52
+ return x
53
+
54
+
55
+ class SingleHeadCrossAttention(nn.Module):
56
+ def __init__(self, d_model: int, d_key: int, d_val: int):
57
+ super().__init__()
58
+ self.softmax = nn.Softmax(-1)
59
+ self.q_proj = nn.Linear(d_model, d_key, bias=False)
60
+ self.k_proj = nn.Linear(d_model, d_key, bias=False)
61
+ self.v_proj = nn.Linear(d_model, d_val, bias=False)
62
+
63
+ def forward(self, src: Tensor, tgt: Tensor, mask: BoolTensor = None):
64
+ # src: [B, T_src, d_model], tgt: [B, T_tgt, d_model], mask: [B, 1, T_src] or [B, T_tgt, T_src]
65
+ Q = self.q_proj(tgt) # [B, T_tgt, d_key]
66
+ K = self.k_proj(src) # [B, T_src, d_key]
67
+ V = self.v_proj(src) # [B, T_src, d_val]
68
+ attn_scores = torch.matmul(Q, K.transpose(-1, -2)) / math.sqrt(K.size(-1)) # [B, T_tgt, T_src]
69
+ if mask is not None:
70
+ attn_scores = attn_scores.masked_fill(mask, float('-inf'))
71
+ return torch.matmul(self.softmax(attn_scores), V) # [B, T_tgt, d_val]
72
+
73
+
74
+ class SingleHeadSelfAttention(nn.Module):
75
+ def __init__(self, d_model: int, d_key: int, d_val: int):
76
+ super().__init__()
77
+ self.softmax = nn.Softmax(-1)
78
+ self.q_proj = nn.Linear(d_model, d_key, bias=False)
79
+ self.k_proj = nn.Linear(d_model, d_key, bias=False)
80
+ self.v_proj = nn.Linear(d_model, d_val, bias=False)
81
+
82
+ def forward(self, x: Tensor, mask: BoolTensor = None):
83
+ # x: [B, T, d_model], mask: [B, T, T] or [1, T, T]
84
+ Q = self.q_proj(x) # [B, T, d_key]
85
+ K = self.k_proj(x) # [B, T, d_key]
86
+ V = self.v_proj(x) # [B, T, d_val]
87
+ attn_scores = torch.matmul(Q, K.transpose(-1, -2)) / math.sqrt(K.size(-1)) # [B, T, T]
88
+ if mask is not None:
89
+ attn_scores = attn_scores.masked_fill(mask, float('-inf'))
90
+ return torch.matmul(self.softmax(attn_scores), V) # [B, T, d_val]
91
+
92
+
93
+ class MultiHeadCrossAttention(nn.Module):
94
+ def __init__(self, d_model: int, n_heads: int):
95
+ super().__init__()
96
+ assert d_model % n_heads == 0
97
+ d_head = d_model // n_heads
98
+ self.heads = nn.ModuleList([
99
+ SingleHeadCrossAttention(d_model, d_head, d_head) for _ in range(n_heads)
100
+ ])
101
+ self.out_proj = nn.Linear(d_model, d_model, bias=False)
102
+
103
+ def forward(self, src: Tensor, tgt: Tensor, mask: BoolTensor = None):
104
+ # src: [B, T_src, d_model], tgt: [B, T_tgt, d_model], mask: [B, 1, T_src] or [B, T_tgt, T_src]
105
+ return self.out_proj(torch.cat([head(src, tgt, mask) for head in self.heads], dim=-1)) # [B, T_tgt, d_model]
106
+
107
+
108
+ class MultiHeadSelfAttention(nn.Module):
109
+ def __init__(self, d_model: int, n_heads: int):
110
+ super().__init__()
111
+ assert d_model % n_heads == 0
112
+ d_head = d_model // n_heads
113
+ self.heads = nn.ModuleList([
114
+ SingleHeadSelfAttention(d_model, d_head, d_head) for _ in range(n_heads)
115
+ ])
116
+ self.out_proj = nn.Linear(d_model, d_model, bias=False)
117
+
118
+ def forward(self, x: Tensor, mask: BoolTensor = None):
119
+ # x: [B, T, d_model], mask: [B, T, T] or [1, T, T]
120
+ return self.out_proj(torch.cat([head(x, mask) for head in self.heads], dim=-1)) # [B, T, d_model]
121
+
122
+
123
+ class CrossAttentionTransformerBlock(nn.Module):
124
+ def __init__(self, d_model: int, d_ff: int, n_heads: int):
125
+ super().__init__()
126
+ self.layer_norm1 = nn.LayerNorm(d_model)
127
+ self.layer_norm2 = nn.LayerNorm(d_model)
128
+ self.layer_norm3 = nn.LayerNorm(d_model)
129
+ self.multi_head_self_attn = MultiHeadSelfAttention(d_model, n_heads)
130
+ self.multi_head_cross_attn = MultiHeadCrossAttention(d_model, n_heads)
131
+ self.feed_forward = FeedForward(d_model, d_ff)
132
+
133
+ def forward(self, src: Tensor, tgt: Tensor, src_mask: BoolTensor = None, tgt_mask: BoolTensor = None):
134
+ # src: [B, T_src, d_model], tgt: [B, T_tgt, d_model], src_mask: [B, 1, T_src], tgt_mask: [B, T_tgt, T_tgt]
135
+ carry = self.multi_head_self_attn(self.layer_norm1(tgt), tgt_mask) # [B, T_tgt, d_model]
136
+ tgt = carry + tgt # [B, T_tgt, d_model]
137
+ carry = self.multi_head_cross_attn(src, self.layer_norm2(tgt), src_mask) # [B, T_tgt, d_model]
138
+ tgt = carry + tgt # [B, T_tgt, d_model]
139
+ carry = self.feed_forward(self.layer_norm3(tgt)) # [B, T_tgt, d_model]
140
+ tgt = carry + tgt # [B, T_tgt, d_model]
141
+ return tgt
142
+
143
+
144
+ class TransformerBlock(nn.Module):
145
+ def __init__(self, d_model: int, d_ff: int, n_heads: int):
146
+ super().__init__()
147
+ self.layer_norm1 = nn.LayerNorm(d_model)
148
+ self.layer_norm2 = nn.LayerNorm(d_model)
149
+ self.multi_head_attn = MultiHeadSelfAttention(d_model, n_heads)
150
+ self.feed_forward = FeedForward(d_model, d_ff)
151
+
152
+ def forward(self, x: Tensor, mask: BoolTensor = None):
153
+ # x: [B, T, d_model], mask: [B, T, T] or [1, T, T]
154
+ carry = self.multi_head_attn(self.layer_norm1(x), mask) # [B, T, d_model]
155
+ x = carry + x # [B, T, d_model]
156
+ carry = self.feed_forward(self.layer_norm2(x)) # [B, T, d_model]
157
+ x = carry + x # [B, T, d_model]
158
+ return x
159
+
160
+
161
+ class DecoderOnlyTransformer(nn.Module):
162
+ """
163
+ An autoregressive, GPT-style, decoder-only transformer for text generation purposes.
164
+ Importantly, returns outputs with dimension vocab_size, i.e. logits.
165
+ Stand-alone, can be used on its own.
166
+ """
167
+ def __init__(
168
+ self,
169
+ vocab_size: int,
170
+ max_seq_len: int,
171
+ d_model: int,
172
+ d_ff: int,
173
+ n_heads: int,
174
+ n_blocks: int,
175
+ ):
176
+ super().__init__()
177
+ assert d_model % n_heads == 0
178
+ self.embedding = nn.Embedding(vocab_size, d_model)
179
+ self.pos_encoding = PositionalEncoding(max_seq_len, d_model)
180
+ self.attn_blocks = nn.ModuleList([
181
+ TransformerBlock(d_model, d_ff, n_heads) for blk in range(n_blocks)
182
+ ])
183
+ self.final_layer_norm = nn.LayerNorm(d_model)
184
+ # In decoder-only Transformer, we use the unembedding matrix to map to logits of the vocabulary
185
+ self.unembedding = nn.Linear(d_model, vocab_size, bias=False)
186
+ self.unembedding.weight = self.embedding.weight
187
+
188
+ def forward(self, x: Tensor, pad_mask: BoolTensor = None):
189
+ # x: [B, T] (token indices), pad_mask: [B, 1, T] (optional, True = padding)
190
+ T = x.size(1)
191
+ causal_mask = torch.ones(T, T, dtype=torch.bool, device=x.device).triu(1).unsqueeze(0) # [1, T, T]
192
+ mask = causal_mask | pad_mask if pad_mask is not None else causal_mask # [B, T, T]
193
+ x = self.pos_encoding(self.embedding(x)) # [B, T, d_model]
194
+ for attn_blk in self.attn_blocks:
195
+ x = attn_blk(x, mask) # [B, T, d_model]
196
+ x = self.final_layer_norm(x) # [B, T, d_model]
197
+ x = self.unembedding(x) # [B, T, vocab_size]
198
+ return x
199
+
200
+
201
+ class EncoderTransformer(nn.Module):
202
+ """
203
+ An encoder Transformer, with BERT-style bidirectionality.
204
+ Importantly, returns outputs with dimension d_model, i.e. same as the input embeddings.
205
+ Stand-alone, can be used on its own.
206
+ """
207
+ def __init__(
208
+ self,
209
+ vocab_size: int,
210
+ max_seq_len: int,
211
+ d_model: int,
212
+ d_ff: int,
213
+ n_heads: int,
214
+ n_blocks: int,
215
+ ):
216
+ super().__init__()
217
+ assert d_model % n_heads == 0
218
+ self.embedding = nn.Embedding(vocab_size, d_model)
219
+ self.pos_encoding = PositionalEncoding(max_seq_len, d_model)
220
+ self.attn_blocks = nn.ModuleList([
221
+ TransformerBlock(d_model, d_ff, n_heads) for blk in range(n_blocks)
222
+ ])
223
+ self.final_layer_norm = nn.LayerNorm(d_model)
224
+
225
+ def forward(self, x: Tensor, mask: BoolTensor = None):
226
+ # x: [B, T] (token indices), mask: [B, 1, T] (optional, True = padding)
227
+ x = self.pos_encoding(self.embedding(x)) # [B, T, d_model]
228
+ for attn_blk in self.attn_blocks:
229
+ x = attn_blk(x, mask) # [B, T, d_model]
230
+ x = self.final_layer_norm(x) # [B, T, d_model]
231
+ return x
232
+
233
+
234
+ class DecoderTransformer(nn.Module):
235
+ """
236
+ A decoder Transformer, which assumes that the src argument already consists of encoded vectors.
237
+ Importantly, returns outputs with dimension vocab_size, i.e. logits.
238
+ Not for stand-alone use.
239
+ """
240
+ def __init__(
241
+ self,
242
+ vocab_size: int,
243
+ max_seq_len: int,
244
+ d_model: int,
245
+ d_ff: int,
246
+ n_heads: int,
247
+ n_blocks: int,
248
+ ):
249
+ super().__init__()
250
+ assert d_model % n_heads == 0
251
+ self.embedding = nn.Embedding(vocab_size, d_model)
252
+ self.pos_encoding = PositionalEncoding(max_seq_len, d_model)
253
+ self.attn_blocks = nn.ModuleList([
254
+ CrossAttentionTransformerBlock(d_model, d_ff, n_heads) for blk in range(n_blocks)
255
+ ])
256
+ self.final_layer_norm = nn.LayerNorm(d_model)
257
+ # In decoder Transformer, we use the unembedding matrix to map to logits of the vocabulary
258
+ self.unembedding = nn.Linear(d_model, vocab_size, bias=False)
259
+ self.unembedding.weight = self.embedding.weight
260
+
261
+ def forward(self, src: Tensor, tgt: Tensor, src_mask: BoolTensor = None, tgt_mask: BoolTensor = None):
262
+ # src: [B, T_src, d_model] (encoded), tgt: [B, T_tgt] (token indices), src_mask: [B, 1, T_src], tgt_mask: [B, T_tgt, T_tgt]
263
+ tgt = self.pos_encoding(self.embedding(tgt)) # [B, T_tgt, d_model]
264
+ for attn_blk in self.attn_blocks:
265
+ tgt = attn_blk(src, tgt, src_mask, tgt_mask) # [B, T_tgt, d_model]
266
+ tgt = self.final_layer_norm(tgt) # [B, T_tgt, d_model]
267
+ tgt = self.unembedding(tgt) # [B, T_tgt, vocab_size]
268
+ return tgt
269
+
270
+
271
+ class EncoderDecoderTransformer(nn.Module):
272
+ """
273
+ An encoder-decoder Transformer.
274
+ Importantly, returns outputs with dimension vocab_size, i.e. logits.
275
+ Stand-alone, can be used on its own.
276
+ """
277
+ def __init__(
278
+ self,
279
+ vocab_size: int,
280
+ max_seq_len: int,
281
+ d_model: int,
282
+ d_ff: int,
283
+ n_heads: int,
284
+ n_blocks_encoder: int,
285
+ n_blocks_decoder: int,
286
+ ):
287
+ super().__init__()
288
+ assert d_model % n_heads == 0
289
+ self.encoder = EncoderTransformer(vocab_size, max_seq_len, d_model, d_ff, n_heads, n_blocks_encoder)
290
+ self.decoder = DecoderTransformer(vocab_size, max_seq_len, d_model, d_ff, n_heads, n_blocks_decoder)
291
+
292
+ def forward(self, src: Tensor, tgt: Tensor, src_mask: BoolTensor = None, tgt_mask: BoolTensor = None):
293
+ # src: [B, T_src] (token indices), tgt: [B, T_tgt] (token indices), src_mask: [B, 1, T_src], tgt_mask: [B, T_tgt, T_tgt]
294
+ return self.decoder(self.encoder(src, src_mask), tgt, src_mask, tgt_mask) # [B, T_tgt, vocab_size]
@@ -0,0 +1,98 @@
1
+ Metadata-Version: 2.4
2
+ Name: transformers-from-scratch
3
+ Version: 0.1.0
4
+ Summary: Encoder, decoder, and encoder-decoder Transformer architectures in PyTorch
5
+ License-Expression: MIT
6
+ Project-URL: Repository, https://github.com/etfrer-yi/Transformer-Variants
7
+ Requires-Python: >=3.9
8
+ Description-Content-Type: text/markdown
9
+ License-File: LICENSE
10
+ Requires-Dist: torch>=2.0
11
+ Dynamic: license-file
12
+
13
+ # Transformer Variants
14
+
15
+ Implementations of encoder, decoder, and encoder-decoder Transformer architectures in PyTorch, following "Attention Is All You Need" (Vaswani et al., 2017).
16
+
17
+ ## Installation
18
+
19
+ ```bash
20
+ pip install transformers-from-scratch
21
+ ```
22
+
23
+ PyTorch is required but not installed automatically (to allow users to choose their CUDA build). Install it first from [pytorch.org](https://pytorch.org/get-started/locally/).
24
+
25
+ ## Development Setup
26
+
27
+ ```bash
28
+ python3 -m venv venv
29
+ source venv/bin/activate
30
+ pip install -r requirements.txt
31
+ ```
32
+
33
+ ## Modules
34
+
35
+ `model.py` provides three stand-alone models and the building blocks they are composed of.
36
+
37
+ | Class | Description |
38
+ |---|---|
39
+ | `EncoderTransformer` | BERT-style bidirectional encoder |
40
+ | `DecoderOnlyTransformer` | GPT-style autoregressive decoder |
41
+ | `EncoderDecoderTransformer` | Sequence-to-sequence encoder-decoder |
42
+ | `DecoderTransformer` | Decoder component — not for stand-alone use |
43
+
44
+ ## Usage
45
+
46
+ ```python
47
+ import torch
48
+ from model import EncoderTransformer, DecoderOnlyTransformer, EncoderDecoderTransformer
49
+
50
+ VOCAB_SIZE, MAX_SEQ_LEN = 32000, 512
51
+ D_MODEL, D_FF, N_HEADS, N_BLOCKS = 512, 2048, 8, 6
52
+ ```
53
+
54
+ ### Encoder
55
+
56
+ ```python
57
+ model = EncoderTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS)
58
+
59
+ src = torch.randint(0, VOCAB_SIZE, (B, T_src)) # [B, T_src]
60
+ src_mask = (src == pad_id).unsqueeze(1) # [B, 1, T_src] True = padding
61
+
62
+ out = model(src, src_mask) # [B, T_src, d_model]
63
+ ```
64
+
65
+ ### Decoder-only
66
+
67
+ ```python
68
+ model = DecoderOnlyTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS)
69
+
70
+ tgt = torch.randint(0, VOCAB_SIZE, (B, T)) # [B, T]
71
+ pad_mask = (tgt == pad_id).unsqueeze(1) # [B, 1, T] True = padding (optional)
72
+
73
+ logits = model(tgt, pad_mask) # [B, T, vocab_size]
74
+ ```
75
+
76
+ A causal mask is generated internally — no need to pass one.
77
+
78
+ ### Encoder-decoder
79
+
80
+ ```python
81
+ model = EncoderDecoderTransformer(VOCAB_SIZE, MAX_SEQ_LEN, D_MODEL, D_FF, N_HEADS, N_BLOCKS, N_BLOCKS)
82
+
83
+ src = torch.randint(0, VOCAB_SIZE, (B, T_src)) # [B, T_src]
84
+ tgt = torch.randint(0, VOCAB_SIZE, (B, T_tgt)) # [B, T_tgt]
85
+
86
+ src_mask = (src == pad_id).unsqueeze(1) # [B, 1, T_src]
87
+ T_tgt = tgt.size(1)
88
+ causal = torch.ones(T_tgt, T_tgt, dtype=torch.bool).triu(1).unsqueeze(0) # [1, T_tgt, T_tgt]
89
+ tgt_mask = causal | (tgt == pad_id).unsqueeze(1) # [B, T_tgt, T_tgt]
90
+
91
+ logits = model(src, tgt, src_mask, tgt_mask) # [B, T_tgt, vocab_size]
92
+ ```
93
+
94
+ ## Running tests
95
+
96
+ ```bash
97
+ python -m unittest test -v
98
+ ```
@@ -0,0 +1,10 @@
1
+ LICENSE
2
+ README.md
3
+ pyproject.toml
4
+ transformer_variants/__init__.py
5
+ transformer_variants/model.py
6
+ transformers_from_scratch.egg-info/PKG-INFO
7
+ transformers_from_scratch.egg-info/SOURCES.txt
8
+ transformers_from_scratch.egg-info/dependency_links.txt
9
+ transformers_from_scratch.egg-info/requires.txt
10
+ transformers_from_scratch.egg-info/top_level.txt