x-transformers 2.0.0__py3-none-any.whl → 2.0.2__py3-none-any.whl

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,2420 @@
1
+ Metadata-Version: 2.4
2
+ Name: x-transformers
3
+ Version: 2.0.2
4
+ Summary: X-Transformers
5
+ Project-URL: Homepage, https://pypi.org/project/x-transformers/
6
+ Project-URL: Repository, https://github.com/lucidrains/x-transformers
7
+ Author-email: Phil Wang <lucidrains@gmail.com>
8
+ License: MIT License
9
+
10
+ Copyright (c) 2020 Phil Wang
11
+
12
+ Permission is hereby granted, free of charge, to any person obtaining a copy
13
+ of this software and associated documentation files (the "Software"), to deal
14
+ in the Software without restriction, including without limitation the rights
15
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
16
+ copies of the Software, and to permit persons to whom the Software is
17
+ furnished to do so, subject to the following conditions:
18
+
19
+ The above copyright notice and this permission notice shall be included in all
20
+ copies or substantial portions of the Software.
21
+
22
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
23
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
24
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
25
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
26
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
27
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
28
+ SOFTWARE.
29
+ License-File: LICENSE
30
+ Keywords: artificial intelligence,attention mechanism,transformers
31
+ Classifier: Development Status :: 4 - Beta
32
+ Classifier: Intended Audience :: Developers
33
+ Classifier: License :: OSI Approved :: MIT License
34
+ Classifier: Programming Language :: Python :: 3.6
35
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
36
+ Requires-Python: >=3.9
37
+ Requires-Dist: einops>=0.8.0
38
+ Requires-Dist: einx>=0.3.0
39
+ Requires-Dist: loguru
40
+ Requires-Dist: packaging>=21.0
41
+ Requires-Dist: torch>=2.0
42
+ Provides-Extra: examples
43
+ Requires-Dist: lion-pytorch; extra == 'examples'
44
+ Requires-Dist: torchvision; extra == 'examples'
45
+ Requires-Dist: tqdm; extra == 'examples'
46
+ Provides-Extra: test
47
+ Requires-Dist: pytest; extra == 'test'
48
+ Description-Content-Type: text/markdown
49
+
50
+ ## x-transformers
51
+
52
+ [![PyPI version](https://badge.fury.io/py/x-transformers.svg)](https://badge.fury.io/py/x-transformers)
53
+
54
+ A concise but fully-featured transformer, complete with a set of promising e**x**perimental features from various papers.
55
+
56
+ ## Install
57
+
58
+ ```bash
59
+ $ pip install x-transformers
60
+ ```
61
+
62
+ ## Usage
63
+
64
+ Full encoder / decoder
65
+
66
+ ```python
67
+ import torch
68
+ from x_transformers import XTransformer
69
+
70
+ model = XTransformer(
71
+ dim = 512,
72
+ enc_num_tokens = 256,
73
+ enc_depth = 6,
74
+ enc_heads = 8,
75
+ enc_max_seq_len = 1024,
76
+ dec_num_tokens = 256,
77
+ dec_depth = 6,
78
+ dec_heads = 8,
79
+ dec_max_seq_len = 1024,
80
+ tie_token_emb = True # tie embeddings of encoder and decoder
81
+ )
82
+
83
+ src = torch.randint(0, 256, (1, 1024))
84
+ src_mask = torch.ones_like(src).bool()
85
+ tgt = torch.randint(0, 256, (1, 1024))
86
+
87
+ loss = model(src, tgt, mask = src_mask) # (1, 1024, 512)
88
+ loss.backward()
89
+ ```
90
+
91
+ Decoder-only (GPT-like)
92
+
93
+ ```python
94
+ import torch
95
+ from x_transformers import TransformerWrapper, Decoder
96
+
97
+ model = TransformerWrapper(
98
+ num_tokens = 20000,
99
+ max_seq_len = 1024,
100
+ attn_layers = Decoder(
101
+ dim = 512,
102
+ depth = 12,
103
+ heads = 8
104
+ )
105
+ ).cuda()
106
+
107
+ x = torch.randint(0, 256, (1, 1024)).cuda()
108
+
109
+ model(x) # (1, 1024, 20000)
110
+ ```
111
+
112
+ GPT3 would be approximately the following (but you wouldn't be able to run it anyways)
113
+
114
+ ```python
115
+
116
+ gpt3 = TransformerWrapper(
117
+ num_tokens = 50000,
118
+ max_seq_len = 2048,
119
+ attn_layers = Decoder(
120
+ dim = 12288,
121
+ depth = 96,
122
+ heads = 96,
123
+ attn_dim_head = 128
124
+ )
125
+ ).cuda()
126
+ ```
127
+
128
+ Encoder-only (BERT-like)
129
+
130
+ ```python
131
+ import torch
132
+ from x_transformers import TransformerWrapper, Encoder
133
+
134
+ model = TransformerWrapper(
135
+ num_tokens = 20000,
136
+ max_seq_len = 1024,
137
+ attn_layers = Encoder(
138
+ dim = 512,
139
+ depth = 12,
140
+ heads = 8
141
+ )
142
+ ).cuda()
143
+
144
+ x = torch.randint(0, 256, (1, 1024)).cuda()
145
+ mask = torch.ones_like(x).bool()
146
+
147
+ model(x, mask = mask) # (1, 1024, 20000)
148
+ ```
149
+
150
+ State of the art image classification (<a href="https://arxiv.org/abs/2205.01580">SimpleViT</a>)
151
+
152
+ ```python
153
+ import torch
154
+ from x_transformers import ViTransformerWrapper, Encoder
155
+
156
+ model = ViTransformerWrapper(
157
+ image_size = 256,
158
+ patch_size = 32,
159
+ num_classes = 1000,
160
+ attn_layers = Encoder(
161
+ dim = 512,
162
+ depth = 6,
163
+ heads = 8,
164
+ )
165
+ )
166
+
167
+ img = torch.randn(1, 3, 256, 256)
168
+ model(img) # (1, 1000)
169
+ ```
170
+
171
+ Image -> caption
172
+
173
+ ```python
174
+ import torch
175
+ from x_transformers import ViTransformerWrapper, TransformerWrapper, Encoder, Decoder
176
+
177
+ encoder = ViTransformerWrapper(
178
+ image_size = 256,
179
+ patch_size = 32,
180
+ attn_layers = Encoder(
181
+ dim = 512,
182
+ depth = 6,
183
+ heads = 8
184
+ )
185
+ )
186
+
187
+ decoder = TransformerWrapper(
188
+ num_tokens = 20000,
189
+ max_seq_len = 1024,
190
+ attn_layers = Decoder(
191
+ dim = 512,
192
+ depth = 6,
193
+ heads = 8,
194
+ cross_attend = True
195
+ )
196
+ )
197
+
198
+ img = torch.randn(1, 3, 256, 256)
199
+ caption = torch.randint(0, 20000, (1, 1024))
200
+
201
+ encoded = encoder(img, return_embeddings = True)
202
+ decoder(caption, context = encoded) # (1, 1024, 20000)
203
+ ```
204
+
205
+ <a href="https://arxiv.org/abs/2209.06794">PaLI</a>, state of the art language-vision model
206
+
207
+ ```python
208
+ import torch
209
+ from x_transformers import ViTransformerWrapper, XTransformer, Encoder
210
+
211
+ # PaLI composes of
212
+ # 1. vision transformer (ViTransformerWrapper) +
213
+ # 2. encoder-decoder transformer (XTransformer)
214
+
215
+ vit = ViTransformerWrapper(
216
+ image_size = 256,
217
+ patch_size = 32,
218
+ attn_layers = Encoder(
219
+ dim = 512,
220
+ depth = 6,
221
+ heads = 8
222
+ )
223
+ )
224
+
225
+ pali = XTransformer(
226
+ dim = 512,
227
+ enc_num_tokens = 256,
228
+ enc_depth = 6,
229
+ enc_heads = 8,
230
+ enc_max_seq_len = 1024,
231
+ dec_num_tokens = 256,
232
+ dec_depth = 6,
233
+ dec_heads = 8,
234
+ dec_max_seq_len = 1024
235
+ )
236
+
237
+ # training data
238
+
239
+ img = torch.randn(1, 3, 256, 256) # images
240
+ prompt = torch.randint(0, 256, (1, 1024)) # prompt
241
+ prompt_mask = torch.ones(1, 1024).bool() # prompt text mask
242
+ output_text = torch.randint(0, 256, (1, 1024)) # target output text
243
+
244
+ # train
245
+
246
+ img_embeds = vit(
247
+ img,
248
+ return_embeddings = True
249
+ )
250
+
251
+ loss = pali(
252
+ prompt,
253
+ output_text,
254
+ mask = prompt_mask,
255
+ src_prepend_embeds = img_embeds # will preprend image embeddings to encoder text embeddings before attention
256
+ )
257
+
258
+ loss.backward()
259
+
260
+ # do the above for many steps on a 17B parameter model
261
+ # attention is all you need
262
+ ```
263
+
264
+ ## Dropouts
265
+
266
+ ```python
267
+ import torch
268
+ from x_transformers import TransformerWrapper, Decoder, Encoder
269
+
270
+ model = TransformerWrapper(
271
+ num_tokens = 20000,
272
+ max_seq_len = 1024,
273
+ emb_dropout = 0.1, # dropout after embedding
274
+ attn_layers = Decoder(
275
+ dim = 512,
276
+ depth = 6,
277
+ heads = 8,
278
+ layer_dropout = 0.1, # stochastic depth - dropout entire layer
279
+ attn_dropout = 0.1, # dropout post-attention
280
+ ff_dropout = 0.1 # feedforward dropout
281
+ )
282
+ )
283
+
284
+ x = torch.randint(0, 20000, (1, 1024))
285
+ model(x)
286
+ ```
287
+
288
+ ## Features
289
+
290
+ ### Flash Attention
291
+
292
+ <img src="./images/flash-attention.png" width="500px"></img>
293
+
294
+ What originally started off as <a href="https://arxiv.org/abs/2112.05682">a short paper</a> from Markus Rabe culminated as a practical fused attention CUDA kernel, named <a href="https://arxiv.org/abs/2205.14135">Flash Attention</a> by <a href="https://tridao.me/">Tri Dao</a>.
295
+
296
+ The technique processes the attention matrix in tiles, only keeping track of the running softmax and exponentiated weighted sums. By recomputing on the backwards pass in a tiled fashion, one is able to keep the memory linear with respect to sequence length. This allows a lot of recent models to be able to reach for longer context lengths without worrying about the memory bottleneck.
297
+
298
+ Other engineering decisions made by Tri Dao led to its enormous success, namely minimizing HBM accesses so that both the forwards and backwards outperform naive attention. In other words, flash attention is not only more memory efficient, but faster as well, making it a necessity for training transformers.
299
+
300
+ MetaAI has recently added the ability to use <a href="https://github.com/hazyresearch/flash-attention">Tri Dao's CUDA kernel</a> through the <a href="https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html">scaled_dot_product_attention</a> function in Pytorch 2.0. (They also have a `mem_efficient` attention, which is identical to flash attention design, just that the tiles are traversed differently)
301
+
302
+ <a href="https://ai.facebook.com/blog/large-language-model-llama-meta-ai/">Llama</a> was trained using Flash Attention. The only reason to avoid it is if you require operating on the attention matrix (dynamic positional bias, talking heads, residual attention).
303
+
304
+ You can use it in this repository by setting `attn_flash` to `True` and enjoy the immediate memory savings and increase in speed.
305
+
306
+ ex.
307
+
308
+ ```python
309
+ import torch
310
+ from x_transformers import TransformerWrapper, Decoder, Encoder
311
+
312
+ model = TransformerWrapper(
313
+ num_tokens = 20000,
314
+ max_seq_len = 1024,
315
+ attn_layers = Decoder(
316
+ dim = 512,
317
+ depth = 6,
318
+ heads = 8,
319
+ attn_flash = True # just set this to True if you have pytorch 2.0 installed
320
+ )
321
+ )
322
+ ```
323
+
324
+ ### Augmenting Self-attention with Persistent Memory
325
+
326
+ <img src="./images/all-attention.png" width="500px"></img>
327
+
328
+ https://arxiv.org/abs/1907.01470
329
+
330
+ Proposes adding learned memory key / values prior to attention. They were able to remove feedforwards altogether and attain similar performance to the original transformers. I have found that keeping the feedforwards and adding the memory key / values leads to even better performance.
331
+
332
+ ```python
333
+ from x_transformers import Decoder, Encoder
334
+
335
+ enc = Encoder(
336
+ dim = 512,
337
+ depth = 6,
338
+ heads = 8,
339
+ attn_num_mem_kv = 16 # 16 memory key / values
340
+ )
341
+ ```
342
+
343
+ ### Memory Transformers
344
+
345
+ <img src="./images/memory-transformer.png" width="500px"></img>
346
+
347
+ https://arxiv.org/abs/2006.11527
348
+
349
+ Proposes adding learned tokens, akin to CLS tokens, named memory tokens, that is passed through the attention layers alongside the input tokens. This setting is compatible with both encoder and decoder training.
350
+
351
+ ```python
352
+ import torch
353
+ from x_transformers import TransformerWrapper, Decoder, Encoder
354
+
355
+ model = TransformerWrapper(
356
+ num_tokens = 20000,
357
+ max_seq_len = 1024,
358
+ num_memory_tokens = 20, # 20 memory tokens
359
+ attn_layers = Encoder(
360
+ dim = 512,
361
+ depth = 6,
362
+ heads = 8
363
+ )
364
+ )
365
+ ```
366
+
367
+ Update: MetaAI researchers <a href="https://arxiv.org/abs/2309.16588">have found</a> that adding memory tokens (they call them register tokens), alleviates outliers (which is suspected now to be a pathology of attention networks unable to <a href="https://arxiv.org/abs/2306.12929">attend to nothing</a>).
368
+
369
+ Update 2: a hybrid architecture out of Nvidia named <a href="https://openreview.net/forum?id=A1ztozypga">Hymba</a> used memory tokens successfully in the autoregressive case, termed meta tokens in their paper.
370
+
371
+ Update 3: further corroborated by <a href="https://arxiv.org/abs/2501.00663">a paper</a> trying to extend memory in attention networks, termed persistent memory
372
+
373
+ ### Transformers Without Tears
374
+
375
+ <img src="./images/scalenorm.png"></img>
376
+
377
+ https://arxiv.org/abs/1910.05895
378
+
379
+ They experiment with alternatives to Layer normalization and found one that is both effective and simpler. Researchers have shared with me this leads to faster convergence.
380
+
381
+ ```python
382
+ import torch
383
+ from x_transformers import TransformerWrapper, Decoder, Encoder
384
+
385
+ model = TransformerWrapper(
386
+ num_tokens = 20000,
387
+ max_seq_len = 1024,
388
+ attn_layers = Decoder(
389
+ dim = 512,
390
+ depth = 6,
391
+ heads = 8,
392
+ use_scalenorm = True # set to True to use for all layers
393
+ )
394
+ )
395
+ ```
396
+
397
+ You can also use the l2 normalized embeddings proposed as part of `fixnorm`. I have found it leads to improved convergence, when paired with small initialization (proposed by <a href="https://github.com/BlinkDL">BlinkDL</a>). The small initialization will be taken care of as long as `l2norm_embed` is set to `True`
398
+
399
+ ```python
400
+ import torch
401
+ from x_transformers import TransformerWrapper, Decoder, Encoder
402
+
403
+ model = TransformerWrapper(
404
+ num_tokens = 20000,
405
+ max_seq_len = 1024,
406
+ l2norm_embed = True, # set this to True for l2 normalized embedding + small init
407
+ attn_layers = Decoder(
408
+ dim = 512,
409
+ depth = 6,
410
+ heads = 8
411
+ )
412
+ )
413
+ ```
414
+
415
+ Along the same lines of l2 normalized embeddings, Huggingface's <a href="https://huggingface.co/bigscience/bloom">175B parameter BLOOM</a> also places a layernorm right after the embeddings and just before the tokens enter the attention layers. This was corroborated by Yandex's <a href="https://github.com/yandex/YaLM-100B">100B parameter YaLM</a> to stabilize training.
416
+
417
+ It is recommended you either have either `l2norm_embed` or `post_emb_norm` set to `True` but not both, as they probably serve the same purpose.
418
+
419
+ ```python
420
+ import torch
421
+ from x_transformers import TransformerWrapper, Decoder, Encoder
422
+
423
+ model = TransformerWrapper(
424
+ num_tokens = 20000,
425
+ max_seq_len = 1024,
426
+ post_emb_norm = True, # set this to True to layernorm summed token + pos embeddings
427
+ attn_layers = Decoder(
428
+ dim = 512,
429
+ depth = 6,
430
+ heads = 8
431
+ )
432
+ )
433
+ ```
434
+
435
+ ### Root Mean Square Layer Normalization
436
+
437
+ https://arxiv.org/abs/1910.07467
438
+
439
+ The authors propose to replace layer normalization with a simpler alternative, without mean centering and the learned bias. An investigative paper found this to be the <a href="https://arxiv.org/abs/2102.11972">best performing normalization variant</a>. It was also used in Deepmind's latest large language models, <a href="https://deepmind.com/research/publications/2021/improving-language-models-by-retrieving-from-trillions-of-tokens">Retro</a> and <a href="https://arxiv.org/abs/2112.11446">Gopher</a>.
440
+
441
+ ```python
442
+ import torch
443
+ from x_transformers import TransformerWrapper, Decoder, Encoder
444
+
445
+ model = TransformerWrapper(
446
+ num_tokens = 20000,
447
+ max_seq_len = 1024,
448
+ attn_layers = Decoder(
449
+ dim = 512,
450
+ depth = 6,
451
+ heads = 8,
452
+ use_rmsnorm = True # set to true to use for all layers
453
+ )
454
+ )
455
+ ```
456
+
457
+ *July 2023* <a href="https://arxiv.org/abs/2307.14995">A linear attention paper</a> has experiments to show that removing the learned multiplicative gamma led to no performance degradation. This simplifies the RMS normalization to a satisfying `l2norm(x) * sqrt(dim)`.
458
+
459
+ ```python
460
+ import torch
461
+ from x_transformers import TransformerWrapper, Decoder, Encoder
462
+
463
+ model = TransformerWrapper(
464
+ num_tokens = 20000,
465
+ max_seq_len = 1024,
466
+ attn_layers = Decoder(
467
+ dim = 512,
468
+ depth = 6,
469
+ heads = 8,
470
+ use_simple_rmsnorm = True # set to true to use for all layers
471
+ )
472
+ )
473
+ ```
474
+
475
+ ### GLU Variants Improve Transformer
476
+
477
+ <img src="./images/ffglu.png"></img>
478
+
479
+ https://arxiv.org/abs/2002.05202
480
+
481
+ Noam Shazeer paper that explores gating in the feedforward, finding that simple gating with GELU leads to significant improvements. This variant also showed up in the latest mT5 architecture. You should always turn this on (I may eventually turn it on by default).
482
+
483
+ ```python
484
+ import torch
485
+ from x_transformers import TransformerWrapper, Decoder, Encoder
486
+
487
+ model = TransformerWrapper(
488
+ num_tokens = 20000,
489
+ max_seq_len = 1024,
490
+ attn_layers = Decoder(
491
+ dim = 512,
492
+ depth = 6,
493
+ heads = 8,
494
+ ff_glu = True # set to true to use for all feedforwards
495
+ )
496
+ )
497
+ ```
498
+
499
+ The <a href="https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html">PaLM</a> language model also chose to use the Swish GLU variant. You can turn this on by setting two flags
500
+
501
+ ```python
502
+ import torch
503
+ from x_transformers import TransformerWrapper, Decoder, Encoder
504
+
505
+ model = TransformerWrapper(
506
+ num_tokens = 20000,
507
+ max_seq_len = 1024,
508
+ attn_layers = Decoder(
509
+ dim = 512,
510
+ depth = 6,
511
+ heads = 8,
512
+ ff_swish = True, # set this to True
513
+ ff_glu = True # set to true to use for all feedforwards
514
+ )
515
+ )
516
+ ``````
517
+
518
+ ### No Bias in Feedforward
519
+
520
+ Starting with <a href="https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html">PaLM</a>, there begun a trend to remove biases from the transformer all together. <a href="https://github.com/borisdayma">Boris Dayma</a> has run a number of experiments that showed removing biases from feedforwards led to increased throughput without any loss of accuracy. This was corroborated by <a href="https://arxiv.org/abs/2212.14034">yet another paper</a> investigating transformer architecture variants.
521
+
522
+ You can turn off the feedforward bias as follows
523
+
524
+ ```python
525
+ import torch
526
+ from x_transformers import TransformerWrapper, Decoder, Encoder
527
+
528
+ model = TransformerWrapper(
529
+ num_tokens = 20000,
530
+ max_seq_len = 1024,
531
+ attn_layers = Decoder(
532
+ dim = 512,
533
+ depth = 6,
534
+ heads = 8,
535
+ ff_no_bias = True # set this to True
536
+ )
537
+ )
538
+ ```
539
+
540
+ ### ReLU²
541
+
542
+ https://arxiv.org/abs/2109.08668
543
+
544
+ This paper used neural architecture search and found an activation, Relu Squared, that is both simpler and performs better than GELU, in the autoregressive language model setting. I have confirmed this in my independent experiments. However, if one were using the GLU variant from above, GELU still performs better. Pending further corroboration.
545
+
546
+ ```python
547
+ import torch
548
+ from x_transformers import TransformerWrapper, Decoder, Encoder
549
+
550
+ model = TransformerWrapper(
551
+ num_tokens = 20000,
552
+ max_seq_len = 1024,
553
+ attn_layers = Decoder(
554
+ dim = 512,
555
+ depth = 6,
556
+ heads = 8,
557
+ ff_relu_squared = True
558
+ )
559
+ )
560
+ ```
561
+
562
+ ### Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
563
+
564
+ <img src="./images/topk-attention.png" width="500px"></img>
565
+
566
+ https://arxiv.org/abs/1912.11637
567
+
568
+ This paper proposes an efficient way to sparsify attention by zeroing all dot-product query/key values not within the top k values. The show that this cheap method was as effective as other more expensive operations like sparsemax or entmax15. This technique comes with the cost of an extra hyperparameter (the top k values to keep). The paper recommends a value of `k = 8`
569
+
570
+ ```python
571
+ import torch
572
+ from x_transformers import TransformerWrapper, Decoder
573
+
574
+ model = TransformerWrapper(
575
+ num_tokens = 20000,
576
+ max_seq_len = 1024,
577
+ attn_layers = Decoder(
578
+ dim = 512,
579
+ depth = 6,
580
+ heads = 8,
581
+ attn_sparse_topk = 8, # keep only the top 8 values before attention (softmax)
582
+ attn_sparse_topk_straight_through = True # straight through the original gradients
583
+ )
584
+ )
585
+ ```
586
+
587
+ An extreme case of `topk` value of `1`, you can use the following
588
+
589
+ ```python
590
+ model = TransformerWrapper(
591
+ num_tokens = 20000,
592
+ max_seq_len = 1024,
593
+ attn_layers = Decoder(
594
+ dim = 512,
595
+ depth = 6,
596
+ heads = 8,
597
+ attn_hard = True # will only propagate the single value of the argmax of qk logit. offered in the case it addresses https://arxiv.org/abs/2410.01104
598
+ )
599
+ )
600
+ ```
601
+
602
+ ### Talking-Heads Attention
603
+
604
+ <img src="./images/talking-heads.png" width="500px"></img>
605
+
606
+ https://arxiv.org/abs/2003.02436
607
+
608
+ A Noam Shazeer paper that proposes mixing information between heads pre and post attention (softmax). This comes with the cost of extra memory and compute.
609
+
610
+ ```python
611
+ import torch
612
+ from x_transformers import TransformerWrapper, Decoder
613
+
614
+ model = TransformerWrapper(
615
+ num_tokens = 20000,
616
+ max_seq_len = 1024,
617
+ attn_layers = Decoder(
618
+ dim = 512,
619
+ depth = 6,
620
+ heads = 8,
621
+ attn_pre_talking_heads = True, # linear combination across pre-softmax attn logits across heads
622
+ attn_post_talking_heads = True # linear combination across post-softmax attn across heads
623
+ )
624
+ )
625
+ ```
626
+
627
+ ### One Write-Head Is All You Need
628
+
629
+ https://arxiv.org/abs/1911.02150
630
+
631
+ Yet another Noam Shazeer paper (he's a legend) that proposes to only have one head for the key / values, but multi-headed queries. This paper was largely ignored for a while, but recently validated at scale in <a href="https://arxiv.org/abs/2203.07814">AlphaCode</a> as well as <a href="https://arxiv.org/abs/2204.02311">PaLM</a>. It has the property of being memory efficient when decoding extremely large language models. You can use it with one keyword argument as shown below.
632
+
633
+ ```python
634
+ import torch
635
+ from x_transformers import TransformerWrapper, Decoder
636
+
637
+ model = TransformerWrapper(
638
+ num_tokens = 20000,
639
+ max_seq_len = 1024,
640
+ attn_layers = Decoder(
641
+ dim = 512,
642
+ depth = 6,
643
+ heads = 8,
644
+ attn_one_kv_head = True
645
+ )
646
+ )
647
+ ```
648
+
649
+ This has been further generalized in <a href="https://arxiv.org/abs/2305.13245">a recent paper</a> to allow for groups of query heads to attend to a single key / value head. You can use this by specifying the `attn_kv_heads`
650
+
651
+ ```python
652
+ import torch
653
+ from x_transformers import TransformerWrapper, Decoder
654
+
655
+ model = TransformerWrapper(
656
+ num_tokens = 20000,
657
+ max_seq_len = 1024,
658
+ attn_layers = Decoder(
659
+ dim = 512,
660
+ depth = 12,
661
+ heads = 8,
662
+ attn_kv_heads = 2 # say you want 4 query heads to attend to 1 key / value head
663
+ )
664
+ )
665
+ ```
666
+
667
+ ### Attention on Attention for Image Captioning
668
+
669
+ <img src="./images/attention-on-attention.png"></img>
670
+
671
+ https://arxiv.org/abs/1908.06954
672
+
673
+ This paper proposes to add a gated linear unit at the end of the attention layer, further gated by the original queries. Although this is not widely used outside of visual question / answering, I suspect it should lead to improvements after seeing the success of the feedforward GLU variant.
674
+
675
+ Update: After some experimentation, I found this variant actually performs worse, but if it were to be modified to not concatenate the queries before gating, it performs much better. That is what we will be using in this repository.
676
+
677
+ ```python
678
+ import torch
679
+ from x_transformers import TransformerWrapper, Encoder
680
+
681
+ model = TransformerWrapper(
682
+ num_tokens = 20000,
683
+ max_seq_len = 1024,
684
+ attn_layers = Encoder(
685
+ dim = 512,
686
+ depth = 6,
687
+ heads = 8,
688
+ attn_on_attn = True # gate output of attention layer, by queries
689
+ )
690
+ )
691
+ ```
692
+
693
+ ### Intra-attention Gating on Values
694
+
695
+ <img src="./images/gate_values.png" width="400px"></img>
696
+
697
+ <a href="https://github.com/deepmind/alphafold">Alphafold2</a> had a peculiar variant of attention where they gate the aggregated values with the input, presumably to have the block have more control over the update.
698
+
699
+ A quick test shows a small but noticeable improvement, on about the same order as attention on attention.
700
+
701
+ ```python
702
+ import torch
703
+ from x_transformers import TransformerWrapper, Encoder
704
+
705
+ model = TransformerWrapper(
706
+ num_tokens = 20000,
707
+ max_seq_len = 1024,
708
+ attn_layers = Encoder(
709
+ dim = 512,
710
+ depth = 6,
711
+ heads = 8,
712
+ attn_gate_values = True # gate aggregated values with the input
713
+ )
714
+ )
715
+ ```
716
+
717
+ ### Improving Transformer Models by Reordering their Sublayers
718
+
719
+ <img src="./images/sandwich.png"></img>
720
+
721
+ <img src="./images/sandwich-2.png"></img>
722
+
723
+ https://arxiv.org/abs/1911.03864
724
+
725
+ This paper proposes to break from the normal fixed pattern of alternating attention and feedforwards, but to have blocks of only attention at the beginning followed by blocks of feedforwards at the end. This was further corroborated by a paper by Nvidia that reduces the number of attention layers to be 1/3rd of the feedforwards without loss in performance.
726
+
727
+ The amount of interleaving is controlled by a "sandwich coefficient", which they found to be optimal at a value of `6`.
728
+
729
+ You can experiment with this feature as shown below
730
+
731
+ ```python
732
+ import torch
733
+ from x_transformers import TransformerWrapper, Encoder
734
+
735
+ model = TransformerWrapper(
736
+ num_tokens = 20000,
737
+ max_seq_len = 1024,
738
+ attn_layers = Encoder(
739
+ dim = 512,
740
+ depth = 6,
741
+ heads = 8,
742
+ sandwich_coef = 6 # interleave attention and feedforwards with sandwich coefficient of 6
743
+ )
744
+ )
745
+ ```
746
+
747
+ ### Weight-tied Layers
748
+
749
+ In the early days of the cambrian explosion of BERT, a paper explored weight tying all the layers, the model named <a href="https://arxiv.org/abs/1909.11942">ALBERT</a>. You can use it by setting `weight_tie_layers = True`
750
+
751
+ ```python
752
+ import torch
753
+ from x_transformers import TransformerWrapper, Encoder
754
+
755
+ model = TransformerWrapper(
756
+ num_tokens = 20000,
757
+ max_seq_len = 1024,
758
+ attn_layers = Encoder(
759
+ dim = 512,
760
+ depth = 12,
761
+ weight_tie_layers = True # set this to True to weight tie all the layers
762
+ )
763
+ )
764
+ ```
765
+
766
+ If you wish to do something more sophisticated, say 3 layers, with each layer recurrent 4 times before onto the next (similar to <a href="https://arxiv.org/abs/2405.15071">this paper</a>), that is possible as well. Be aware the `layers_execute_order` is 0-indexed
767
+
768
+ ```python
769
+ import torch
770
+ from x_transformers import TransformerWrapper, Decoder
771
+
772
+ model = TransformerWrapper(
773
+ num_tokens = 20000,
774
+ max_seq_len = 1024,
775
+ attn_layers = Decoder(
776
+ dim = 512,
777
+ custom_layers = (
778
+ 'a', 'f', # 3 sets of attention and feedforward
779
+ 'a', 'f',
780
+ 'a', 'f'
781
+ ),
782
+ layers_execute_order = (
783
+ *((0, 1) * 4), # each done 4 times before sequentially passed forward, but you can probably imagine some more interesting configurations...
784
+ *((2, 3) * 4),
785
+ *((4, 5) * 4),
786
+ )
787
+ )
788
+ )
789
+ ```
790
+
791
+ ### Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
792
+
793
+ <img src="./images/macaron-1.png"></img>
794
+
795
+ <img src="./images/macaron-2.png"></img>
796
+
797
+ https://arxiv.org/abs/1906.02762
798
+
799
+ The authors propose to view the success of transformers from a dynamical systems point of view, and then proposes an improvement based on mathematics of that POV. Specifically, they propose to place the attention layer in between two feedforward layers. This was adopted by a paper using transformers for speech recognition, the <a href="https://arxiv.org/abs/2005.08100">Conformer</a>.
800
+
801
+ ```python
802
+ import torch
803
+ from x_transformers import TransformerWrapper, Encoder
804
+
805
+ model = TransformerWrapper(
806
+ num_tokens = 20000,
807
+ max_seq_len = 1024,
808
+ attn_layers = Encoder(
809
+ dim = 512,
810
+ depth = 6,
811
+ heads = 8,
812
+ macaron = True # use macaron configuration
813
+ )
814
+ )
815
+ ```
816
+
817
+ ### T5's Simplified Relative Positional Encoding
818
+
819
+ https://arxiv.org/abs/1910.10683
820
+
821
+ T5 is one of the most successful encoder / decoder transformer architectures trained to date. They invented a new simplified relative positional encoding based on learned bias values that are added to the attention matrix pre-softmax. This bias is shared and injected into each attention layer. I have decided to include this because it offers a cheap way to have relative positional encoding (superior to absolute positional), and I have read papers that suggest having positional encoding added to each layer (vs only before the first) is beneficial.
822
+
823
+ ```python
824
+ import torch
825
+ from x_transformers import TransformerWrapper, Decoder
826
+
827
+ model = TransformerWrapper(
828
+ num_tokens = 20000,
829
+ max_seq_len = 1024,
830
+ attn_layers = Decoder(
831
+ dim = 512,
832
+ depth = 6,
833
+ heads = 8,
834
+ rel_pos_bias = True # adds relative positional bias to all attention layers, a la T5
835
+ )
836
+ )
837
+ ```
838
+
839
+ ### Residual Attention
840
+
841
+ <img src="./images/residual_attn.png" width="500px"></img>
842
+
843
+ https://arxiv.org/abs/2012.11747
844
+
845
+ This paper from Google proposes residualizing the pre-attention scores across all layers. At the cost of no extra parameters, they show improvement on top of regular attention networks. If you turn on this setting, be aware that the best results in the paper used post-normalization, in which case a learning warmup will be needed. The authors also reported that they could use a higher learning rate and get even better gains in the same amount of steps. (In the paper they use `2e-4` vs `1e-4` for vanilla transformer)
846
+
847
+ ```python
848
+ import torch
849
+ from x_transformers import TransformerWrapper, Encoder
850
+
851
+ model = TransformerWrapper(
852
+ num_tokens = 20000,
853
+ max_seq_len = 1024,
854
+ attn_layers = Encoder(
855
+ dim = 512,
856
+ depth = 6,
857
+ heads = 8,
858
+ pre_norm = False, # in the paper, residual attention had best results with post-layernorm
859
+ residual_attn = True # add residual attention
860
+ )
861
+ )
862
+ ```
863
+
864
+ I also tried residualizing cross attention and may have noticed an improvement in convergence. You can try it by setting the `cross_residual_attn` keyword to `True`
865
+
866
+ ```python
867
+ import torch
868
+ from x_transformers import XTransformer
869
+
870
+ model = XTransformer(
871
+ dim = 512,
872
+ enc_num_tokens = 256,
873
+ enc_depth = 6,
874
+ enc_heads = 8,
875
+ enc_max_seq_len = 1024,
876
+ dec_num_tokens = 256,
877
+ dec_depth = 6,
878
+ dec_heads = 8,
879
+ dec_max_seq_len = 1024,
880
+ dec_cross_residual_attn = True # residualize cross attention
881
+ )
882
+ ```
883
+
884
+ ### Transformer-XL recurrence
885
+
886
+ You can also do Transformer-XL recurrence, by simply passing in a `max_mem_len` in the `TransformerWrapper` class, and then making sure your `Decoder` has `rel_pos_bias` (or `rotary_pos_emb`) set to `True`.
887
+
888
+ Then, you can retrieve the memories at each step with the `return_mems` keyword and pass it to the next iteration.
889
+
890
+ ```python
891
+ import torch
892
+ from x_transformers import TransformerWrapper, Decoder
893
+
894
+ model_xl = TransformerWrapper(
895
+ num_tokens = 20000,
896
+ max_seq_len = 512,
897
+ max_mem_len = 2048,
898
+ attn_layers = Decoder(
899
+ dim = 512,
900
+ depth = 6,
901
+ heads = 8,
902
+ rel_pos_bias = True
903
+ )
904
+ )
905
+
906
+ seg1 = torch.randint(0, 20000, (1, 512))
907
+ seg2 = torch.randint(0, 20000, (1, 512))
908
+ seg3 = torch.randint(0, 20000, (1, 512))
909
+
910
+ logits1, mems1 = model_xl(seg1, return_mems = True)
911
+ logits2, mems2 = model_xl(seg2, mems = mems1, return_mems = True)
912
+ logits3, mems3 = model_xl(seg3, mems = mems2, return_mems = True)
913
+ ```
914
+
915
+ Setting up the logic for training and sampling from transformer xl can be a bit overwhelming. This repository offers a simple wrapper that should make this easy, with the `XLAutoregressiveWrapper`.
916
+
917
+ ```python
918
+ # pass in the above model_xl
919
+
920
+ xl_wrapper = XLAutoregressiveWrapper(model_xl)
921
+
922
+ seg = torch.randint(0, 20000, (1, 4096)).cuda() # sequence exceeding max length, automatically segmented and memory managed
923
+
924
+ loss = xl_wrapper(seg)
925
+ loss.backward()
926
+
927
+ # then, after much training
928
+
929
+ prime = seg[:, :1024] # if prime exceeds max length, memory will be caught up before generating
930
+
931
+ generated = xl_wrapper.generate(prime, 4096) # (1, 4096)
932
+ ```
933
+
934
+ ### Enhanced recurrence
935
+
936
+ <img src="./images/enhanced-recurrence.png" width="400px"/>
937
+
938
+ <a href="https://arxiv.org/abs/2012.15688">This paper</a> proposes a simple technique to enhance the range of Transformer-XL. They simply route the memory segment of a layer to the layer below it, for the next recurrent step. You can enable this by setting `shift_mem_down = 1`. You can also shift down arbitrary number of layers by setting this value to `> 1`.
939
+
940
+ ```python
941
+ import torch
942
+ from x_transformers import TransformerWrapper, Decoder
943
+
944
+ model_xl = TransformerWrapper(
945
+ num_tokens = 20000,
946
+ max_seq_len = 512,
947
+ max_mem_len = 2048,
948
+ shift_mem_down = 1,
949
+ attn_layers = Decoder(
950
+ dim = 512,
951
+ depth = 6,
952
+ heads = 8,
953
+ rotary_pos_emb = True
954
+ )
955
+ )
956
+
957
+ seg1 = torch.randint(0, 20000, (1, 512))
958
+ seg2 = torch.randint(0, 20000, (1, 512))
959
+ seg3 = torch.randint(0, 20000, (1, 512))
960
+
961
+ logits1, mems1 = model_xl(seg1, return_mems = True)
962
+ logits2, mems2 = model_xl(seg2, mems = mems1, return_mems = True) # mems1 of layer N are automatically routed to the layer N-1
963
+ ```
964
+
965
+ ### Gated residual
966
+
967
+ <img src="./images/gating.png" width="500px"></img>
968
+
969
+ https://arxiv.org/abs/1910.06764
970
+
971
+ The authors propose gating the residual connections in the transformer network and demonstrate increased stability and performance for Transformer-XL in a variety of reinforcement learning tasks.
972
+
973
+ ```python
974
+ import torch
975
+ from x_transformers import TransformerWrapper, Decoder
976
+
977
+ model = TransformerWrapper(
978
+ num_tokens = 20000,
979
+ max_seq_len = 1024,
980
+ max_mem_len = 2048,
981
+ attn_layers = Decoder(
982
+ dim = 512,
983
+ depth = 6,
984
+ heads = 16,
985
+ gate_residual = True
986
+ )
987
+ )
988
+ ```
989
+
990
+ ### Rotary Positional Embeddings
991
+
992
+ <img src="./images/rotary.png" width="500px"></img>
993
+
994
+ Developed in Beijing, this new technique quickly gained interest in the NLP circles. In short, it allows you to endow the transformer with relative positional embeddings at the cost of no learned parameters. You apply a rotary operation to the queries and keys prior to their dot product in attention. The big idea is injecting positions through rotations.
995
+
996
+ Highly recommend that you have this turned on whenever you are working on an ordered sequence.
997
+
998
+ ```python
999
+ import torch
1000
+ from x_transformers import TransformerWrapper, Decoder
1001
+
1002
+ model = TransformerWrapper(
1003
+ num_tokens = 20000,
1004
+ max_seq_len = 1024,
1005
+ attn_layers = Decoder(
1006
+ dim = 512,
1007
+ depth = 6,
1008
+ heads = 8,
1009
+ rotary_pos_emb = True # turns on rotary positional embeddings
1010
+ )
1011
+ )
1012
+ ```
1013
+
1014
+ Update (12/2022): Rotary embedding has since been hugely successful, widely adopted in many large language models, including the largest in the world, PaLM. However, it has been uncovered in the ALiBi paper that rotary embeddings cannot length extrapolate well. This was recently addressed in <a href="https://arxiv.org/abs/2212.10554v1">a Microsoft research paper</a>. They propose a way to unobtrusively add the same decay as in ALiBi, and found that this resolves the extrapolation problem. You can use it in this repository by setting `rotary_xpos = True`. Like ALiBi, it would enforce the attention to be local. You can set the receptive field with `rotary_xpos_scale_base` value, which defaults to `512`
1015
+
1016
+ ```python
1017
+ import torch
1018
+ from x_transformers import TransformerWrapper, Decoder
1019
+
1020
+ model = TransformerWrapper(
1021
+ num_tokens = 20000,
1022
+ max_seq_len = 1024,
1023
+ attn_layers = Decoder(
1024
+ dim = 512,
1025
+ depth = 6,
1026
+ heads = 8,
1027
+ rotary_xpos = True # modified rotary to extrapolate well beyond length at which it was trained
1028
+ )
1029
+ )
1030
+ ```
1031
+
1032
+ ### Dynamic Positional Bias
1033
+
1034
+ <img src="./images/dynamic-pos-bias.png" width="150px"></img>
1035
+
1036
+ This technique bears roots from the field of vision transformers, where researchers are trying to have relative positions generalize to larger resolutions (without having to retrain the entire network). It was used in two recent papers, <a href="https://arxiv.org/abs/2108.00154">CrossFormer</a>, as well as <a href="https://arxiv.org/abs/2111.09883">SwinV2</a>.
1037
+
1038
+ <a href="https://github.com/cfoster0">Charles Foster</a> first tried this for a language model, and found that it works. Later on <a href="https://github.com/bob80333">Eric Engelhart</a> produced experimental results that show the same type of extrapolation holds, even for 1d sequences.
1039
+
1040
+ Eric trained at sequence lengths of 128, and showed that it generalized well to 1024. In addition, he showed that linear positions was better than log (used in SwinV2), for language.
1041
+
1042
+ Linear distances
1043
+
1044
+ <img src="./images/dynamic-pos-bias-linear.png" width="600px"></img>
1045
+
1046
+ Log distances
1047
+
1048
+ <img src="./images/dynamic-pos-bias-log.png" width="600px"></img>
1049
+
1050
+ Negative control - Sinusoidal
1051
+
1052
+ <img src="./images/dynamic-pos-bias-sinusoidal.png" width="600px"></img>
1053
+
1054
+ More of Eric's experimental results can be found <a href="https://github.com/bob80333/investigating_extrapolation">here</a>
1055
+
1056
+ You can use this type of relative position if you wish to train at smaller sequence lengths and have it generalize to longer ones, for both autoregressive and bidirectional models.
1057
+
1058
+ Update: <a href="https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/460121">First place RNA folding using dynamic positional bias</a>
1059
+
1060
+ ```python
1061
+ import torch
1062
+ from x_transformers import TransformerWrapper, Decoder
1063
+
1064
+ model = TransformerWrapper(
1065
+ num_tokens = 256,
1066
+ max_seq_len = 1024,
1067
+ attn_layers = Decoder(
1068
+ dim = 512,
1069
+ depth = 6,
1070
+ heads = 8,
1071
+ dynamic_pos_bias = True, # set this to True
1072
+ dynamic_pos_bias_log_distance = False # whether to use log distance, as in SwinV2
1073
+ )
1074
+ )
1075
+ ```
1076
+
1077
+ ### ALiBi Positional Embedding
1078
+
1079
+ <a href="https://ofir.io/train_short_test_long.pdf">This paper</a> proposes to simply apply a static linear bias to the attention matrix. The authors show this is not only effective as a relative positional encoding, but also allows the attention net to extrapolate to greater sequences length than what it was trained on, for autoregressive language models.
1080
+
1081
+ This repository also offers a bidirectional variant (nonsymmetric), proposed by the authors <a href="https://github.com/ofirpress/attention_with_linear_biases/issues/5">here</a>. However, this is untested. If you need bidirectional length extrapolation, the safest option would be Dynamic Position Bias
1082
+
1083
+ Update: It may be that ALiBi enforces a strong local attention across the heads, and may hinder it from attending at distances greater than 1k. To avoid any issues with global message passing, I've decided to introduce another hyperparameter `alibi_num_heads`, so one can specify less heads for the ALiBi bias
1084
+
1085
+ Update: There are reports that ALiBi outperform Rotary embeddings for pretraining and downstream fine-tuning.
1086
+
1087
+ Update: <a href="https://arxiv.org/abs/2305.19466">New paper</a> shows that no positional embedding can length extrapolate even than explicit ones
1088
+
1089
+ ```python
1090
+ import torch
1091
+ from x_transformers import TransformerWrapper, Decoder
1092
+
1093
+ model = TransformerWrapper(
1094
+ num_tokens = 20000,
1095
+ max_seq_len = 1024,
1096
+ attn_layers = Decoder(
1097
+ dim = 512,
1098
+ depth = 6,
1099
+ heads = 8,
1100
+ alibi_pos_bias = True, # turns on ALiBi positional embedding
1101
+ alibi_num_heads = 4 # only use ALiBi for 4 out of the 8 heads, so other 4 heads can still attend far distances
1102
+ )
1103
+ )
1104
+ ```
1105
+
1106
+ ### Shifted Tokens
1107
+
1108
+ An <a href="https://github.com/BlinkDL">independent researcher</a> has found that shifting a subset of the feature dimension along the sequence dimension by 1 token helps with convergence (<a href="https://zhuanlan.zhihu.com/p/191393788">Time-mixing</a>). I have tested this for the autoregressive case and can confirm that it leads to greatly improved convergence. This also lines up with <a href="https://arxiv.org/abs/2106.07477">the results</a> of some papers in the vision domain.
1109
+
1110
+ To use it, simply set `shift_tokens = 1` (or to whatever number of shifts you desire). The feature dimension will be divided by `shift_tokens + 1` and then each chunk will be shifted `[0, shift_tokens]` respectively
1111
+
1112
+ Update: new experiments by @sdtblck suggests this may only work for character-level training
1113
+
1114
+ Update: after more experiments, it seems that in the context of BPE encoding, with rotary turned on, there is no benefit to shifting. for character-level training, shifting may still improve a tiny bit
1115
+
1116
+ Update: When doing BPE encoded tokens, it seems that shift of 2 will bottleneck the dimensions (divided by 5). It is recommended you always do a shift of 1, unless if you are working with character level.
1117
+
1118
+ ```python
1119
+ import torch
1120
+ from x_transformers import TransformerWrapper, Decoder
1121
+
1122
+ model = TransformerWrapper(
1123
+ num_tokens = 20000,
1124
+ max_seq_len = 1024,
1125
+ attn_layers = Decoder(
1126
+ dim = 512,
1127
+ depth = 6,
1128
+ heads = 8,
1129
+ shift_tokens = 1
1130
+ )
1131
+ )
1132
+ ```
1133
+
1134
+ If you want finer control over how much is shifted per block (whether attention or feedforward), simply pass in a tuple of size that is equal to the number of layers.
1135
+
1136
+ ```python
1137
+ import torch
1138
+ from x_transformers import TransformerWrapper, Decoder
1139
+
1140
+ model = TransformerWrapper(
1141
+ num_tokens = 20000,
1142
+ max_seq_len = 1024,
1143
+ attn_layers = Decoder(
1144
+ dim = 512,
1145
+ depth = 6,
1146
+ heads = 8,
1147
+ shift_tokens = (1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0) # 12 blocks, attention and feedforward alternating, with progressively less shifting
1148
+ )
1149
+ )
1150
+ ```
1151
+
1152
+ ### Sandwich Norm
1153
+
1154
+ <img src="./images/sandwich_norm.png" width="400px"/>
1155
+
1156
+ This technique first made an appearance in <a href="https://arxiv.org/abs/2105.13290">the CoqView paper</a>, a Chinese version of the famous text-to-image transformer DALL-E. They propose, when using pre-layernorm, to add an extra layernorm to all the branch outputs. I have found this to be very effective for a number of projects, when facing instability during training.
1157
+
1158
+ ```python
1159
+ import torch
1160
+ from x_transformers import TransformerWrapper, Decoder
1161
+
1162
+ model = TransformerWrapper(
1163
+ num_tokens = 20000,
1164
+ max_seq_len = 1024,
1165
+ attn_layers = Decoder(
1166
+ dim = 512,
1167
+ depth = 6,
1168
+ heads = 8,
1169
+ sandwich_norm = True # set this to True
1170
+ )
1171
+ )
1172
+
1173
+ x = torch.randint(0, 20000, (1, 1024))
1174
+ model(x)
1175
+ ```
1176
+
1177
+ ### ResiDual
1178
+
1179
+ <img src="./images/resi_dual.png" width="400px"/>
1180
+
1181
+ <a href="https://arxiv.org/abs/2304.14802">This Microsoft paper</a> proposes yet another normalization configuration, combining both pre and post layernorm. They claim this hybridization reduces representation collapse (known to be an issue with pre-layernorm with increasing depth), while maintaining stability and reducing vanishing gradients (issues with post-layernorm). Initial experiments on my end show it to work no worse than pre-layernorm or sandwich norm. More study needed by the public to see if this is actually a winning technique.
1182
+
1183
+ ```python
1184
+ import torch
1185
+ from x_transformers import TransformerWrapper, Decoder
1186
+
1187
+ model = TransformerWrapper(
1188
+ num_tokens = 20000,
1189
+ max_seq_len = 1024,
1190
+ attn_layers = Decoder(
1191
+ dim = 512,
1192
+ depth = 6,
1193
+ heads = 8,
1194
+ resi_dual = True, # set this to True
1195
+ resi_dual_scale = 0.1 # in appendix, they said on fp16 the prenorm residual is prone to overflow. they claim by scaling it at each layer by a factor, it would prevent the overflow, and keep results the same (as layernorms are invariant to scaling of the input)
1196
+ )
1197
+ )
1198
+
1199
+ x = torch.randint(0, 20000, (1, 1024))
1200
+ model(x)
1201
+ ```
1202
+
1203
+ ### Normformer
1204
+
1205
+ <img src="./images/normformer.png" width="400px"/>
1206
+
1207
+ This <a href="https://openreview.net/forum?id=GMYWzWztDx5">paper</a> uncovers an issue with pre-norm transformers where gradients are mismatched between the early and later layers. They propose 4 changes, of which I will be offering 3.
1208
+
1209
+ The first change is to offer per head scaling after aggregating the values in attention. My experiments show a slight improvement in convergence.
1210
+
1211
+ ```python
1212
+ import torch
1213
+ from x_transformers import TransformerWrapper, Decoder
1214
+
1215
+ model = TransformerWrapper(
1216
+ num_tokens = 20000,
1217
+ max_seq_len = 1024,
1218
+ attn_layers = Decoder(
1219
+ dim = 512,
1220
+ depth = 6,
1221
+ heads = 8,
1222
+ attn_head_scale = True # set this to True
1223
+ )
1224
+ )
1225
+
1226
+ x = torch.randint(0, 20000, (1, 1024))
1227
+ model(x)
1228
+ ```
1229
+
1230
+ The second change is an extra layernorm right after the activation in the feedforward. I have also verified a slight improvement, at the cost of extra compute.
1231
+
1232
+ ```python
1233
+ import torch
1234
+ from x_transformers import TransformerWrapper, Decoder
1235
+
1236
+ model = TransformerWrapper(
1237
+ num_tokens = 20000,
1238
+ max_seq_len = 1024,
1239
+ attn_layers = Decoder(
1240
+ dim = 512,
1241
+ depth = 6,
1242
+ heads = 8,
1243
+ ff_post_act_ln = True # set this to True
1244
+ )
1245
+ )
1246
+
1247
+ x = torch.randint(0, 20000, (1, 1024))
1248
+ model(x)
1249
+ ```
1250
+
1251
+ For the residual scaling, you simply have to set `scale_residual = True`. I have noticed slight improvements, but occasional instability as well, so use with caution.
1252
+
1253
+ ```python
1254
+ import torch
1255
+ from x_transformers import TransformerWrapper, Decoder
1256
+
1257
+ model = TransformerWrapper(
1258
+ num_tokens = 20000,
1259
+ max_seq_len = 1024,
1260
+ attn_layers = Decoder(
1261
+ dim = 512,
1262
+ depth = 6,
1263
+ heads = 8,
1264
+ scale_residual = True # set this to True
1265
+ )
1266
+ )
1267
+
1268
+ x = torch.randint(0, 20000, (1, 1024))
1269
+ model(x)
1270
+ ```
1271
+
1272
+ The last change is a layernorm right after the outwards projection in attention. This is actually identical to the sandwich norm proposed by the Coqview paper, so you can use this by simply setting `sandwich_norm = True`, although it would also add it to the feedforward layer.
1273
+
1274
+ ### Cosine Sim Attention
1275
+
1276
+ <img src="./images/cosine-sim-attention.png" width="400px"></img>
1277
+
1278
+ This <a href="https://arxiv.org/abs/2010.04245">paper</a> proposes to l2 normalize the queries and keys along the head dimension before the dot product (cosine similarity), with the additional change of the scale being learned rather than static. The normalization prevents the attention operation from overflowing, and removes any need for numerical stability measures prior to softmax. Both are perennial problems when training transformers.
1279
+
1280
+ This was validated at scale recently by the training of <a href="https://arxiv.org/abs/2111.09883">a 3B parameter vision transformer</a>. The SwinV2 paper also proposes to change the pre-layernorm to a post-layernorm for further stability.
1281
+
1282
+ I have validated that this works just as well as dot product attention in an autoregressive setting, if one were to initialize the temperature as proposed in the QK-norm paper (as a function of the sequence length).
1283
+
1284
+ This flavor of attention also has <a href="https://arxiv.org/abs/2111.05498">a connection</a> to sparse distributed memory. <a href="https://www.youtube.com/watch?v=THIIk7LR9_8">[youtube talk]</a>
1285
+
1286
+ Update: I have discovered a way to remove the learned temperature altogether, by grouping the feature dimension and doing l2-normalization on each group. This allows the queries and keys to have a similarity that is upper bounded by the number of groups. A group size of 8 or 16 was sufficient in my tests. Decided to name this technique "Grouped QK Normalization". The drawback is that I believe an attention head dimension 32 is too small to use this tactic (a dimension often used in vision)
1287
+
1288
+ Update 2: Tero Karras has successfully used cosine sim attention in <a href="https://arxiv.org/abs/2312.02696">a new paper</a>.
1289
+
1290
+ You can use it as follows
1291
+
1292
+ ```python
1293
+ import torch
1294
+ from x_transformers import TransformerWrapper, Decoder
1295
+
1296
+ model = TransformerWrapper(
1297
+ num_tokens = 20000,
1298
+ max_seq_len = 1024,
1299
+ attn_layers = Decoder(
1300
+ dim = 512,
1301
+ depth = 6,
1302
+ heads = 8,
1303
+ attn_qk_norm = True, # set this to True
1304
+ attn_qk_norm_groups = 8 # number of groups in the feature dimension for l2norm, similarity scores will be bounded between [-group, group]. determines how sharp the attention can be
1305
+ )
1306
+ )
1307
+
1308
+ x = torch.randint(0, 20000, (1, 1024))
1309
+ model(x)
1310
+ ```
1311
+
1312
+ Another update: Simply scaling the cosine similarity (group of 1) with a fixed constant (10) may work too
1313
+
1314
+ ```python
1315
+ import torch
1316
+ from x_transformers import TransformerWrapper, Decoder
1317
+
1318
+ model = TransformerWrapper(
1319
+ num_tokens = 20000,
1320
+ max_seq_len = 1024,
1321
+ attn_layers = Decoder(
1322
+ dim = 512,
1323
+ depth = 6,
1324
+ heads = 8,
1325
+ attn_qk_norm = True, # set to True
1326
+ attn_qk_norm_scale = 10 # new scale on the similarity, with groups of 1
1327
+ )
1328
+ )
1329
+
1330
+ x = torch.randint(0, 20000, (1, 1024))
1331
+ model(x)
1332
+ ```
1333
+
1334
+ ### QK RMSNorm
1335
+
1336
+ <img src="./images/qknorm-analysis.png" width="450px"></img>
1337
+
1338
+ Update: Google Brain has proven out something similar to cosine sim attention in <a href="https://arxiv.org/abs/2302.05442">a 22B parameter model</a>. In their papers, they have analysis showing that the normalization resulted in not only extra stability, but also better results in the end (due to less need to adjust learning rate when increasing parameter count).
1339
+
1340
+ We are nearing the point of wiping out a source of transformer training instability with one simple intervention, in my opinion. The only slight difference in the paper is that they still have a learned scale across the feature dimension (per use of rmsnorm). Not sure how critical this is, but just to make sure we don't miss anything, I will include this here. You can use this by setting `qk_norm_dim_scale = True`
1341
+
1342
+ Update: <a href="https://twitter.com/Tim_Dettmers/status/1625531080513306627">Counterpoint from Tim Dettmers</a>
1343
+
1344
+ Update 2: <a href="https://arxiv.org/abs/2305.19268">Counter</a> to Tim's assertion that outliers are needed, and potentially even <a href="https://arxiv.org/abs/2306.12929">some solutions</a>
1345
+
1346
+ Update 3: Used by <a href="https://www.adept.ai/blog/persimmon-8b">8B parameter LLM</a> successfully
1347
+
1348
+ Update 4: a MetaAI group found that they can <a href="https://arxiv.org/abs/2309.16588">alleviate outliers</a> by adding `register tokens`, also known as `memory tokens` from earlier literature (Burtsev et al). Perhaps what should be tried next is see if qk norm can be improved in the presence of memory tokens.
1349
+
1350
+ ```python
1351
+ import torch
1352
+ from x_transformers import TransformerWrapper, Decoder
1353
+
1354
+ model = TransformerWrapper(
1355
+ num_tokens = 20000,
1356
+ max_seq_len = 1024,
1357
+ attn_layers = Decoder(
1358
+ dim = 512,
1359
+ depth = 12,
1360
+ heads = 8,
1361
+ attn_qk_norm = True,
1362
+ attn_qk_norm_dim_scale = True # set this to True, in addition to `attn_qk_norm = True`
1363
+ )
1364
+ )
1365
+
1366
+ x = torch.randint(0, 256, (1, 1024))
1367
+ model(x)
1368
+ ```
1369
+
1370
+ ### Turning off absolute positional embedding
1371
+
1372
+ A number of papers have hinted that causal transformers (`Decoder`) can learn absolute positions in the absence of added embeddings of any sort. This was recently thoroughly investigated <a href="https://arxiv.org/abs/2203.16634">here</a>. You can turn off the absolute positional embedding by setting `use_abs_pos_emb = False` in the `TransformerWrapper`
1373
+
1374
+ Given <a href="https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html">PaLM</a>, the trend going forward may be to forgo absolute positional embedding (again, for causal transformers only), and add relative positional embeddings with RoPE, ALiBi, etc.
1375
+
1376
+ Update: <a href="https://arxiv.org/abs/2305.19466">This paper</a> shows that in the absence of any engineered absolute or relative positional embeddings, decoders can generate implicit positions, and even length generalize better than solutions of the past. They were unaware of dynamic positional bias, however.
1377
+
1378
+ ```python
1379
+ import torch
1380
+ from x_transformers import TransformerWrapper, Decoder
1381
+
1382
+ model = TransformerWrapper(
1383
+ num_tokens = 20000,
1384
+ max_seq_len = 1024,
1385
+ use_abs_pos_emb = False, # set this to False
1386
+ attn_layers = Decoder(
1387
+ dim = 512,
1388
+ depth = 6,
1389
+ heads = 8,
1390
+ )
1391
+ )
1392
+
1393
+ x = torch.randint(0, 20000, (1, 1024))
1394
+ model(x)
1395
+ ```
1396
+
1397
+ ### Forgetful Causal Mask
1398
+
1399
+ <img src="./images/fcm.png" width="450px"></img>
1400
+
1401
+ <a href="https://arxiv.org/abs/2210.13432">This paper</a> shows convincing results that one can combine masking (from masked language modeling) with autoregressive training, leading to significantly better results.
1402
+
1403
+ You can use this by setting the `mask_prob` on the `AutoregressiveWrapper` class
1404
+
1405
+
1406
+ ```python
1407
+ import torch
1408
+ from x_transformers import TransformerWrapper, Decoder, AutoregressiveWrapper
1409
+
1410
+ model = TransformerWrapper(
1411
+ num_tokens = 20000,
1412
+ max_seq_len = 1024,
1413
+ attn_layers = Decoder(
1414
+ dim = 512,
1415
+ depth = 12,
1416
+ heads = 8
1417
+ )
1418
+ )
1419
+
1420
+ model = AutoregressiveWrapper(
1421
+ model,
1422
+ mask_prob = 0.15 # in paper, they use 15%, same as BERT
1423
+ ).cuda()
1424
+
1425
+ # mock data
1426
+
1427
+ x = torch.randint(0, 20000, (1, 1024)).cuda()
1428
+
1429
+ # derive cross entropy loss, masking all taken care of
1430
+
1431
+ loss = model(x)
1432
+ loss.backward()
1433
+ ```
1434
+
1435
+
1436
+ ## Miscellaneous
1437
+
1438
+ ### Cross Attention
1439
+
1440
+ ```python
1441
+ import torch
1442
+ from x_transformers import Encoder, CrossAttender
1443
+
1444
+ enc = Encoder(dim = 512, depth = 6)
1445
+ model = CrossAttender(dim = 512, depth = 6)
1446
+
1447
+ nodes = torch.randn(1, 1, 512)
1448
+ node_masks = torch.ones(1, 1).bool()
1449
+
1450
+ neighbors = torch.randn(1, 5, 512)
1451
+ neighbor_masks = torch.ones(1, 5).bool()
1452
+
1453
+ encoded_neighbors = enc(neighbors, mask = neighbor_masks)
1454
+ model(nodes, context = encoded_neighbors, mask = node_masks, context_mask = neighbor_masks) # (1, 1, 512)
1455
+
1456
+ ```
1457
+
1458
+ ### Continuous Embeddings
1459
+
1460
+ ```python
1461
+ import torch
1462
+ from x_transformers import ContinuousTransformerWrapper, Decoder
1463
+
1464
+ model = ContinuousTransformerWrapper(
1465
+ dim_in = 32,
1466
+ dim_out = 100,
1467
+ max_seq_len = 1024,
1468
+ attn_layers = Decoder(
1469
+ dim = 512,
1470
+ depth = 12,
1471
+ heads = 8
1472
+ )
1473
+ )
1474
+
1475
+ x = torch.randn((1, 1024, 32))
1476
+ mask = torch.ones(1, 1024).bool()
1477
+
1478
+ model(x, mask = mask) # (1, 1024, 100)
1479
+ ```
1480
+
1481
+ You can also train a transformer that accepts continuous values autoregressively easily, in the same scheme as done successfully in <a href="https://arxiv.org/abs/2112.05329">this paper</a>
1482
+
1483
+ ```python
1484
+ import torch
1485
+ from x_transformers import ContinuousTransformerWrapper, Decoder
1486
+ from x_transformers import ContinuousAutoregressiveWrapper
1487
+
1488
+ model = ContinuousTransformerWrapper(
1489
+ dim_in = 777,
1490
+ dim_out = 777,
1491
+ max_seq_len = 1024,
1492
+ attn_layers = Decoder(
1493
+ dim = 512,
1494
+ depth = 12,
1495
+ heads = 8
1496
+ )
1497
+ )
1498
+
1499
+ # wrap it with the continuous autoregressive wrapper
1500
+
1501
+ model = ContinuousAutoregressiveWrapper(model)
1502
+
1503
+ # mock data
1504
+
1505
+ x = torch.randn((1, 1024, 777))
1506
+ mask = torch.ones(1, 1024).bool()
1507
+
1508
+ # train on a lot of data above
1509
+
1510
+ loss = model(x, mask = mask)
1511
+ loss.backward
1512
+
1513
+ # then generate
1514
+
1515
+ start_emb = torch.randn(1, 777)
1516
+ generated = model.generate(start_emb, 17) # (17, 777)
1517
+ ```
1518
+
1519
+ ### xVal - Continuous and Discrete
1520
+
1521
+ <img src="./images/xval.png" width="400px"></img>
1522
+
1523
+ This is promising work that resulted from the collaboration across many institutes (collectively known as Polymathic AI). They found that by offering a continuously scaled number token to the transformer, the transformer was able to generalize arithmetic and forecasting tasks better than the alternative encoding schemes.
1524
+
1525
+ This is corroborated by some [prior work](https://github.com/lucidrains/tab-transformer-pytorch#ft-transformer)
1526
+
1527
+ ```python
1528
+ import torch
1529
+
1530
+ from x_transformers import (
1531
+ Decoder,
1532
+ XValTransformerWrapper,
1533
+ XValAutoregressiveWrapper
1534
+ )
1535
+
1536
+ model = XValTransformerWrapper(
1537
+ num_tokens = 4,
1538
+ numerical_token_id = 3,
1539
+ max_seq_len = 1024,
1540
+ attn_layers = Decoder(
1541
+ dim = 512,
1542
+ depth = 12,
1543
+ heads = 8
1544
+ )
1545
+ )
1546
+
1547
+ # wrap it with the xval autoregressive wrapper
1548
+
1549
+ model = XValAutoregressiveWrapper(model)
1550
+
1551
+ # mock data
1552
+
1553
+ ids = torch.randint(0, 4, (1, 777))
1554
+ nums = torch.randn(1, 777)
1555
+
1556
+ # train on a lot of data above
1557
+
1558
+ loss = model(ids, nums)
1559
+ loss.backward()
1560
+
1561
+ # then generate
1562
+
1563
+ start_ids = torch.randint(0, 4, (1, 1))
1564
+ start_nums = torch.randn(1, 1)
1565
+
1566
+ ids_out, num_out, is_number_mask = model.generate(start_ids, start_nums, 17)
1567
+
1568
+ # (1, 17), (1, 17), (1, 17)
1569
+
1570
+ # discrete, continuous, mask for discrete / continuous
1571
+ ```
1572
+
1573
+ ## Citations
1574
+
1575
+ ```bibtex
1576
+ @misc{vaswani2017attention,
1577
+ title = {Attention Is All You Need},
1578
+ author = {Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
1579
+ year = {2017},
1580
+ eprint = {1706.03762},
1581
+ archivePrefix = {arXiv},
1582
+ primaryClass = {cs.CL}
1583
+ }
1584
+ ```
1585
+
1586
+ ```bibtex
1587
+ @article{DBLP:journals/corr/abs-1907-01470,
1588
+ author = {Sainbayar Sukhbaatar and Edouard Grave and Guillaume Lample and Herv{\'{e}} J{\'{e}}gou and Armand Joulin},
1589
+ title = {Augmenting Self-attention with Persistent Memory},
1590
+ journal = {CoRR},
1591
+ volume = {abs/1907.01470},
1592
+ year = {2019},
1593
+ url = {http://arxiv.org/abs/1907.01470}
1594
+ }
1595
+ ```
1596
+
1597
+ ```bibtex
1598
+ @article{1910.05895,
1599
+ author = {Toan Q. Nguyen and Julian Salazar},
1600
+ title = {Transformers without Tears: Improving the Normalization of Self-Attention},
1601
+ year = {2019},
1602
+ eprint = {arXiv:1910.05895},
1603
+ doi = {10.5281/zenodo.3525484},
1604
+ }
1605
+ ```
1606
+
1607
+ ```bibtex
1608
+ @misc{shazeer2020glu,
1609
+ title = {GLU Variants Improve Transformer},
1610
+ author = {Noam Shazeer},
1611
+ year = {2020},
1612
+ url = {https://arxiv.org/abs/2002.05202}
1613
+ }
1614
+ ```
1615
+
1616
+ ```bibtex
1617
+ @inproceedings{Zoph2022STMoEDS,
1618
+ title = {ST-MoE: Designing Stable and Transferable Sparse Expert Models},
1619
+ author = {Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},
1620
+ year = {2022}
1621
+ }
1622
+ ```
1623
+
1624
+ ```bibtex
1625
+ @misc{bhojanapalli2020lowrank,
1626
+ title = {Low-Rank Bottleneck in Multi-head Attention Models},
1627
+ author = {Srinadh Bhojanapalli and Chulhee Yun and Ankit Singh Rawat and Sashank J. Reddi and Sanjiv Kumar},
1628
+ year = {2020},
1629
+ eprint = {2002.07028}
1630
+ }
1631
+ ```
1632
+
1633
+ ```bibtex
1634
+ @misc{burtsev2020memory,
1635
+ title = {Memory Transformer},
1636
+ author = {Mikhail S. Burtsev and Grigory V. Sapunov},
1637
+ year = {2020},
1638
+ eprint = {2006.11527},
1639
+ archivePrefix = {arXiv},
1640
+ primaryClass = {cs.CL}
1641
+ }
1642
+ ```
1643
+
1644
+ ```bibtex
1645
+ @misc{zhao2019explicit,
1646
+ title = {Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection},
1647
+ author = {Guangxiang Zhao and Junyang Lin and Zhiyuan Zhang and Xuancheng Ren and Qi Su and Xu Sun},
1648
+ year = {2019},
1649
+ eprint = {1912.11637},
1650
+ archivePrefix = {arXiv},
1651
+ primaryClass = {cs.CL}
1652
+ }
1653
+ ```
1654
+
1655
+ ```bibtex
1656
+ @misc{correia2019adaptively,
1657
+ title = {Adaptively Sparse Transformers},
1658
+ author = {Gonçalo M. Correia and Vlad Niculae and André F. T. Martins},
1659
+ year = {2019},
1660
+ eprint = {1909.00015},
1661
+ archivePrefix = {arXiv},
1662
+ primaryClass = {cs.CL}
1663
+ }
1664
+ ```
1665
+
1666
+ ```bibtex
1667
+ @misc{shazeer2020talkingheads,
1668
+ title = {Talking-Heads Attention},
1669
+ author = {Noam Shazeer and Zhenzhong Lan and Youlong Cheng and Nan Ding and Le Hou},
1670
+ year = {2020},
1671
+ eprint = {2003.02436},
1672
+ archivePrefix = {arXiv},
1673
+ primaryClass = {cs.LG}
1674
+ }
1675
+ ```
1676
+
1677
+ ```bibtex
1678
+ @misc{press2020improving,
1679
+ title = {Improving Transformer Models by Reordering their Sublayers},
1680
+ author = {Ofir Press and Noah A. Smith and Omer Levy},
1681
+ year = {2020},
1682
+ eprint = {1911.03864},
1683
+ archivePrefix = {arXiv},
1684
+ primaryClass = {cs.CL}
1685
+ }
1686
+ ```
1687
+
1688
+ ```bibtex
1689
+ @misc{lu2019understanding,
1690
+ title = {Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View},
1691
+ author = {Yiping Lu and Zhuohan Li and Di He and Zhiqing Sun and Bin Dong and Tao Qin and Liwei Wang and Tie-Yan Liu},
1692
+ year = {2019},
1693
+ eprint = {1906.02762},
1694
+ archivePrefix = {arXiv},
1695
+ primaryClass = {cs.LG}
1696
+ }
1697
+ ```
1698
+
1699
+ ```bibtex
1700
+ @misc{ke2020rethinking,
1701
+ title = {Rethinking Positional Encoding in Language Pre-training},
1702
+ author = {Guolin Ke and Di He and Tie-Yan Liu},
1703
+ year = {2020},
1704
+ eprint = {2006.15595},
1705
+ archivePrefix = {arXiv},
1706
+ primaryClass = {cs.CL}
1707
+ }
1708
+ ```
1709
+
1710
+ ```bibtex
1711
+ @misc{dosovitskiy2020image,
1712
+ title = {An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
1713
+ author = {Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby},
1714
+ year = {2020},
1715
+ eprint = {2010.11929},
1716
+ archivePrefix = {arXiv},
1717
+ primaryClass = {cs.CV}
1718
+ }
1719
+ ```
1720
+
1721
+ ```bibtex
1722
+ @misc{huang2019attention,
1723
+ title = {Attention on Attention for Image Captioning},
1724
+ author = {Lun Huang and Wenmin Wang and Jie Chen and Xiao-Yong Wei},
1725
+ year = {2019},
1726
+ eprint = {1908.06954},
1727
+ archivePrefix = {arXiv},
1728
+ primaryClass = {cs.CV}
1729
+ }
1730
+ ```
1731
+
1732
+ ```bibtex
1733
+ @misc{raffel2020exploring,
1734
+ title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
1735
+ author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
1736
+ year = {2020},
1737
+ eprint = {1910.10683},
1738
+ archivePrefix = {arXiv},
1739
+ primaryClass = {cs.LG}
1740
+ }
1741
+ ```
1742
+
1743
+ ```bibtex
1744
+ @inproceedings{martins-etal-2020-sparse,
1745
+ title = "Sparse Text Generation",
1746
+ author = "Martins, Pedro Henrique and
1747
+ Marinho, Zita and
1748
+ Martins, Andr{\'e} F. T.",
1749
+ booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
1750
+ month = nov,
1751
+ year = "2020",
1752
+ address = "Online",
1753
+ publisher = "Association for Computational Linguistics",
1754
+ url = "https://www.aclweb.org/anthology/2020.emnlp-main.348"
1755
+ }
1756
+ ```
1757
+
1758
+ ```bibtex
1759
+ @misc{he2020realformer,
1760
+ title = {RealFormer: Transformer Likes Residual Attention},
1761
+ author = {Ruining He and Anirudh Ravula and Bhargav Kanagal and Joshua Ainslie},
1762
+ year = {2020},
1763
+ eprint = {2012.11747},
1764
+ archivePrefix = {arXiv},
1765
+ primaryClass = {cs.LG}
1766
+ }
1767
+ ```
1768
+
1769
+ ```bibtex
1770
+ @misc{carion2020endtoend,
1771
+ title = {End-to-End Object Detection with Transformers},
1772
+ author = {Nicolas Carion and Francisco Massa and Gabriel Synnaeve and Nicolas Usunier and Alexander Kirillov and Sergey Zagoruyko},
1773
+ year = {2020},
1774
+ eprint = {2005.12872},
1775
+ archivePrefix = {arXiv},
1776
+ primaryClass = {cs.CV}
1777
+ }
1778
+ ```
1779
+
1780
+ ```bibtex
1781
+ @misc{press2021ALiBi,
1782
+ title = {Train Short, Test Long: Attention with Linear Biases Enable Input Length Extrapolation},
1783
+ author = {Ofir Press and Noah A. Smith and Mike Lewis},
1784
+ year = {2021},
1785
+ url = {https://ofir.io/train_short_test_long.pdf}
1786
+ }
1787
+ ```
1788
+
1789
+ ```bibtex
1790
+ @misc{parisotto2019stabilizing,
1791
+ title = {Stabilizing Transformers for Reinforcement Learning},
1792
+ author = {Emilio Parisotto and H. Francis Song and Jack W. Rae and Razvan Pascanu and Caglar Gulcehre and Siddhant M. Jayakumar and Max Jaderberg and Raphael Lopez Kaufman and Aidan Clark and Seb Noury and Matthew M. Botvinick and Nicolas Heess and Raia Hadsell},
1793
+ year = {2019},
1794
+ eprint = {1910.06764},
1795
+ archivePrefix = {arXiv},
1796
+ primaryClass = {cs.LG}
1797
+ }
1798
+ ```
1799
+
1800
+ ```bibtex
1801
+ @misc{narang2021transformer,
1802
+ title = {Do Transformer Modifications Transfer Across Implementations and Applications?},
1803
+ author = {Sharan Narang and Hyung Won Chung and Yi Tay and William Fedus and Thibault Fevry and Michael Matena and Karishma Malkan and Noah Fiedel and Noam Shazeer and Zhenzhong Lan and Yanqi Zhou and Wei Li and Nan Ding and Jake Marcus and Adam Roberts and Colin Raffel},
1804
+ year = {2021},
1805
+ eprint = {2102.11972},
1806
+ archivePrefix = {arXiv},
1807
+ primaryClass = {cs.LG}
1808
+ }
1809
+ ```
1810
+
1811
+ ```bibtex
1812
+ @misc{zhang2019root,
1813
+ title = {Root Mean Square Layer Normalization},
1814
+ author = {Biao Zhang and Rico Sennrich},
1815
+ year = {2019},
1816
+ eprint = {1910.07467},
1817
+ archivePrefix = {arXiv},
1818
+ primaryClass = {cs.LG}
1819
+ }
1820
+ ```
1821
+
1822
+ ```bibtex
1823
+ @inproceedings{Qin2023ScalingTT,
1824
+ title = {Scaling TransNormer to 175 Billion Parameters},
1825
+ author = {Zhen Qin and Dong Li and Weigao Sun and Weixuan Sun and Xuyang Shen and Xiaodong Han and Yunshen Wei and Baohong Lv and Fei Yuan and Xiao Luo and Y. Qiao and Yiran Zhong},
1826
+ year = {2023},
1827
+ url = {https://api.semanticscholar.org/CorpusID:260203124}
1828
+ }
1829
+ ```
1830
+
1831
+ ```bibtex
1832
+ @misc{su2021roformer,
1833
+ title = {RoFormer: Enhanced Transformer with Rotary Position Embedding},
1834
+ author = {Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu},
1835
+ year = {2021},
1836
+ eprint = {2104.09864},
1837
+ archivePrefix = {arXiv},
1838
+ primaryClass = {cs.CL}
1839
+ }
1840
+ ```
1841
+
1842
+ ```bibtex
1843
+ @inproceedings{Chen2023ExtendingCW,
1844
+ title = {Extending Context Window of Large Language Models via Positional Interpolation},
1845
+ author = {Shouyuan Chen and Sherman Wong and Liangjian Chen and Yuandong Tian},
1846
+ year = {2023}
1847
+ }
1848
+ ```
1849
+
1850
+ ```bibtex
1851
+ @inproceedings{Sun2022ALT,
1852
+ title = {A Length-Extrapolatable Transformer},
1853
+ author = {Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei},
1854
+ year = {2022}
1855
+ }
1856
+ ```
1857
+
1858
+ ```bibtex
1859
+ @Article{AlphaFold2021,
1860
+ author = {Jumper, John and Evans, Richard and Pritzel, Alexander and Green, Tim and Figurnov, Michael and Ronneberger, Olaf and Tunyasuvunakool, Kathryn and Bates, Russ and {\v{Z}}{\'\i}dek, Augustin and Potapenko, Anna and Bridgland, Alex and Meyer, Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie, Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig and Reiman, David and Clancy, Ellen and Zielinski, Michal and Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet and Hassabis, Demis},
1861
+ journal = {Nature},
1862
+ title = {Highly accurate protein structure prediction with {AlphaFold}},
1863
+ year = {2021},
1864
+ doi = {10.1038/s41586-021-03819-2},
1865
+ note = {(Accelerated article preview)},
1866
+ }
1867
+ ```
1868
+
1869
+ ```bibtex
1870
+ @software{peng_bo_2021_5196578,
1871
+ author = {PENG Bo},
1872
+ title = {BlinkDL/RWKV-LM: 0.01},
1873
+ month = {aug},
1874
+ year = {2021},
1875
+ publisher = {Zenodo},
1876
+ version = {0.01},
1877
+ doi = {10.5281/zenodo.5196578},
1878
+ url = {https://doi.org/10.5281/zenodo.5196578}
1879
+ }
1880
+ ```
1881
+
1882
+ ```bibtex
1883
+ @misc{csordás2021devil,
1884
+ title = {The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers},
1885
+ author = {Róbert Csordás and Kazuki Irie and Jürgen Schmidhuber},
1886
+ year = {2021},
1887
+ eprint = {2108.12284},
1888
+ archivePrefix = {arXiv},
1889
+ primaryClass = {cs.LG}
1890
+ }
1891
+ ```
1892
+
1893
+ ```bibtex
1894
+ @misc{so2021primer,
1895
+ title = {Primer: Searching for Efficient Transformers for Language Modeling},
1896
+ author = {David R. So and Wojciech Mańke and Hanxiao Liu and Zihang Dai and Noam Shazeer and Quoc V. Le},
1897
+ year = {2021},
1898
+ eprint = {2109.08668},
1899
+ archivePrefix = {arXiv},
1900
+ primaryClass = {cs.LG}
1901
+ }
1902
+ ```
1903
+
1904
+ ```bibtex
1905
+ @misc{ding2021erniedoc,
1906
+ title = {ERNIE-Doc: A Retrospective Long-Document Modeling Transformer},
1907
+ author = {Siyu Ding and Junyuan Shang and Shuohuan Wang and Yu Sun and Hao Tian and Hua Wu and Haifeng Wang},
1908
+ year = {2021},
1909
+ eprint = {2012.15688},
1910
+ archivePrefix = {arXiv},
1911
+ primaryClass = {cs.CL}
1912
+ }
1913
+ ```
1914
+
1915
+ ```bibtex
1916
+ @misc{ding2021cogview,
1917
+ title = {CogView: Mastering Text-to-Image Generation via Transformers},
1918
+ author = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
1919
+ year = {2021},
1920
+ eprint = {2105.13290},
1921
+ archivePrefix = {arXiv},
1922
+ primaryClass = {cs.CV}
1923
+ }
1924
+ ```
1925
+
1926
+ ```bibtex
1927
+ @inproceedings{anonymous2022normformer,
1928
+ title = {NormFormer: Improved Transformer Pretraining with Extra Normalization},
1929
+ author = {Anonymous},
1930
+ booktitle = {Submitted to The Tenth International Conference on Learning Representations },
1931
+ year = {2022},
1932
+ url = {https://openreview.net/forum?id=GMYWzWztDx5},
1933
+ note = {under review}
1934
+ }
1935
+ ```
1936
+
1937
+ ```bibtex
1938
+ @misc{henry2020querykey,
1939
+ title = {Query-Key Normalization for Transformers},
1940
+ author = {Alex Henry and Prudhvi Raj Dachapally and Shubham Pawar and Yuxuan Chen},
1941
+ year = {2020},
1942
+ eprint = {2010.04245},
1943
+ archivePrefix = {arXiv},
1944
+ primaryClass = {cs.CL}
1945
+ }
1946
+ ```
1947
+
1948
+ ```bibtex
1949
+ @misc{liu2021swin,
1950
+ title = {Swin Transformer V2: Scaling Up Capacity and Resolution},
1951
+ author = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
1952
+ year = {2021},
1953
+ eprint = {2111.09883},
1954
+ archivePrefix = {arXiv},
1955
+ primaryClass = {cs.CV}
1956
+ }
1957
+ ```
1958
+
1959
+ ```bibtex
1960
+ @article{Haviv2022TransformerLM,
1961
+ title = {Transformer Language Models without Positional Encodings Still Learn Positional Information},
1962
+ author = {Adi Haviv and Ori Ram and Ofir Press and Peter Izsak and Omer Levy},
1963
+ journal = {ArXiv},
1964
+ year = {2022},
1965
+ volume = {abs/2203.16634}
1966
+ }
1967
+ ```
1968
+
1969
+ ```bibtex
1970
+ @article{chowdhery2022PaLM,
1971
+ title = {PaLM: Scaling Language Modeling with Pathways},
1972
+ author = {Chowdhery, Aakanksha et al},
1973
+ year = {2022}
1974
+ }
1975
+ ```
1976
+
1977
+ ```bibtex
1978
+ @article{Shazeer2019FastTD,
1979
+ title = {Fast Transformer Decoding: One Write-Head is All You Need},
1980
+ author = {Noam M. Shazeer},
1981
+ journal = {ArXiv},
1982
+ year = {2019},
1983
+ volume = {abs/1911.02150}
1984
+ }
1985
+ ```
1986
+
1987
+ ```bibtex
1988
+ @article{Ainslie2023GQATG,
1989
+ title = {GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints},
1990
+ author = {Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebr'on and Sumit K. Sanghai},
1991
+ journal = {ArXiv},
1992
+ year = {2023},
1993
+ volume = {abs/2305.13245},
1994
+ url = {https://api.semanticscholar.org/CorpusID:258833177}
1995
+ }
1996
+ ```
1997
+
1998
+ ```bibtex
1999
+ @article{Liu2022FCMFC,
2000
+ title = {FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners},
2001
+ author = {Hao Liu and Xinyang Geng and Lisa Lee and Igor Mordatch and Sergey Levine and Sharan Narang and P. Abbeel},
2002
+ journal = {ArXiv},
2003
+ year = {2022},
2004
+ volume = {abs/2210.13432}
2005
+ }
2006
+ ```
2007
+
2008
+ ```bibtex
2009
+ @inproceedings{Huang2016DeepNW,
2010
+ title = {Deep Networks with Stochastic Depth},
2011
+ author = {Gao Huang and Yu Sun and Zhuang Liu and Daniel Sedra and Kilian Q. Weinberger},
2012
+ booktitle = {European Conference on Computer Vision},
2013
+ year = {2016}
2014
+ }
2015
+ ```
2016
+
2017
+ ```bibtex
2018
+ @inproceedings{Hua2022TransformerQI,
2019
+ title = {Transformer Quality in Linear Time},
2020
+ author = {Weizhe Hua and Zihang Dai and Hanxiao Liu and Quoc V. Le},
2021
+ booktitle = {International Conference on Machine Learning},
2022
+ year = {2022}
2023
+ }
2024
+ ```
2025
+
2026
+ ```bibtex
2027
+ @article{Chang2022MaskGITMG,
2028
+ title = {MaskGIT: Masked Generative Image Transformer},
2029
+ author = {Huiwen Chang and Han Zhang and Lu Jiang and Ce Liu and William T. Freeman},
2030
+ journal = {2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
2031
+ year = {2022},
2032
+ pages = {11305-11315}
2033
+ }
2034
+ ```
2035
+
2036
+ ```bibtex
2037
+ @article{Lezama2022ImprovedMI,
2038
+ title = {Improved Masked Image Generation with Token-Critic},
2039
+ author = {Jos{\'e} Lezama and Huiwen Chang and Lu Jiang and Irfan Essa},
2040
+ journal = {ArXiv},
2041
+ year = {2022},
2042
+ volume = {abs/2209.04439}
2043
+ }
2044
+ ```
2045
+
2046
+ ```bibtex
2047
+ @misc{https://doi.org/10.48550/arxiv.2302.01327,
2048
+ doi = {10.48550/ARXIV.2302.01327},
2049
+ url = {https://arxiv.org/abs/2302.01327},
2050
+ author = {Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil},
2051
+ title = {Dual PatchNorm},
2052
+ publisher = {arXiv},
2053
+ year = {2023},
2054
+ copyright = {Creative Commons Attribution 4.0 International}
2055
+ }
2056
+ ```
2057
+
2058
+ ```bibtex
2059
+ @inproceedings{dao2022flashattention,
2060
+ title = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
2061
+ author = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
2062
+ booktitle = {Advances in Neural Information Processing Systems},
2063
+ year = {2022}
2064
+ }
2065
+ ```
2066
+
2067
+ ```bibtex
2068
+ @inproceedings{Dehghani2023ScalingVT,
2069
+ title = {Scaling Vision Transformers to 22 Billion Parameters},
2070
+ author = {Mostafa Dehghani and Josip Djolonga and Basil Mustafa and Piotr Padlewski and Jonathan Heek and Justin Gilmer and Andreas Steiner and Mathilde Caron and Robert Geirhos and Ibrahim M. Alabdulmohsin and Rodolphe Jenatton and Lucas Beyer and Michael Tschannen and Anurag Arnab and Xiao Wang and Carlos Riquelme and Matthias Minderer and Joan Puigcerver and Utku Evci and Manoj Kumar and Sjoerd van Steenkiste and Gamaleldin F. Elsayed and Aravindh Mahendran and Fisher Yu and Avital Oliver and Fantine Huot and Jasmijn Bastings and Mark Collier and Alexey A. Gritsenko and Vighnesh Birodkar and Cristina Nader Vasconcelos and Yi Tay and Thomas Mensink and Alexander Kolesnikov and Filip Paveti'c and Dustin Tran and Thomas Kipf and Mario Luvci'c and Xiaohua Zhai and Daniel Keysers and Jeremiah Harmsen and Neil Houlsby},
2071
+ year = {2023}
2072
+ }
2073
+ ```
2074
+
2075
+ ```bibtex
2076
+ @article{Beyer2022BetterPV,
2077
+ title = {Better plain ViT baselines for ImageNet-1k},
2078
+ author = {Lucas Beyer and Xiaohua Zhai and Alexander Kolesnikov},
2079
+ journal = {ArXiv},
2080
+ year = {2022},
2081
+ volume = {abs/2205.01580}
2082
+ }
2083
+ ```
2084
+
2085
+ ```bibtex
2086
+ @article{Kazemnejad2023TheIO,
2087
+ title = {The Impact of Positional Encoding on Length Generalization in Transformers},
2088
+ author = {Amirhossein Kazemnejad and Inkit Padhi and Karthikeyan Natesan Ramamurthy and Payel Das and Siva Reddy},
2089
+ journal = {ArXiv},
2090
+ year = {2023},
2091
+ volume = {abs/2305.19466}
2092
+ }
2093
+ ```
2094
+
2095
+ ```bibtex
2096
+ @misc{bloc97-2023
2097
+ title = {NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.},
2098
+ author = {/u/bloc97},
2099
+ url = {https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/}
2100
+ }
2101
+ ```
2102
+
2103
+ ```bibtex
2104
+ @inproceedings{Zoph2022STMoEDS,
2105
+ title = {ST-MoE: Designing Stable and Transferable Sparse Expert Models},
2106
+ author = {Barret Zoph and Irwan Bello and Sameer Kumar and Nan Du and Yanping Huang and Jeff Dean and Noam M. Shazeer and William Fedus},
2107
+ year = {2022}
2108
+ }
2109
+ ```
2110
+
2111
+ ```bibtex
2112
+ @article{Lan2019ALBERTAL,
2113
+ title = {ALBERT: A Lite BERT for Self-supervised Learning of Language Representations},
2114
+ author = {Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut},
2115
+ journal = {ArXiv},
2116
+ year = {2019},
2117
+ volume = {abs/1909.11942},
2118
+ url = {https://api.semanticscholar.org/CorpusID:202888986}
2119
+ }
2120
+ ```
2121
+
2122
+ ```bibtex
2123
+ @inproceedings{Li2022ContrastiveDO,
2124
+ title = {Contrastive Decoding: Open-ended Text Generation as Optimization},
2125
+ author = {Xiang Lisa Li and Ari Holtzman and Daniel Fried and Percy Liang and Jason Eisner and Tatsunori Hashimoto and Luke Zettlemoyer and Mike Lewis},
2126
+ booktitle = {Annual Meeting of the Association for Computational Linguistics},
2127
+ year = {2022},
2128
+ url = {https://api.semanticscholar.org/CorpusID:253157949}
2129
+ }
2130
+ ```
2131
+
2132
+ ```bibtex
2133
+ @inproceedings{OBrien2023ContrastiveDI,
2134
+ title = {Contrastive Decoding Improves Reasoning in Large Language Models},
2135
+ author = {Sean O'Brien and Mike Lewis},
2136
+ year = {2023},
2137
+ url = {https://api.semanticscholar.org/CorpusID:261884427}
2138
+ }
2139
+ ```
2140
+
2141
+ ```bibtex
2142
+ @inproceedings{Darcet2023VisionTN,
2143
+ title = {Vision Transformers Need Registers},
2144
+ author = {Timoth'ee Darcet and Maxime Oquab and Julien Mairal and Piotr Bojanowski},
2145
+ year = {2023},
2146
+ url = {https://api.semanticscholar.org/CorpusID:263134283}
2147
+ }
2148
+ ```
2149
+
2150
+ ```bibtex
2151
+ @article{Bondarenko2023QuantizableTR,
2152
+ title = {Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing},
2153
+ author = {Yelysei Bondarenko and Markus Nagel and Tijmen Blankevoort},
2154
+ journal = {ArXiv},
2155
+ year = {2023},
2156
+ volume = {abs/2306.12929},
2157
+ url = {https://api.semanticscholar.org/CorpusID:259224568}
2158
+ }
2159
+ ```
2160
+
2161
+ ```bibtex
2162
+ @inproceedings{Golkar2023xValAC,
2163
+ title = {xVal: A Continuous Number Encoding for Large Language Models},
2164
+ author = {Siavash Golkar and Mariel Pettee and Michael Eickenberg and Alberto Bietti and M. Cranmer and G{\'e}raud Krawezik and Francois Lanusse and Michael McCabe and Ruben Ohana and Liam Parker and Bruno R{\'e}galdo-Saint Blancard and Tiberiu Teşileanu and Kyunghyun Cho and Shirley Ho},
2165
+ year = {2023},
2166
+ url = {https://api.semanticscholar.org/CorpusID:263622222}
2167
+ }
2168
+ ```
2169
+
2170
+ ```bibtex
2171
+ @article{Wang2022DeepNetST,
2172
+ title = {DeepNet: Scaling Transformers to 1, 000 Layers},
2173
+ author = {Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei},
2174
+ journal = {ArXiv},
2175
+ year = {2022},
2176
+ volume = {abs/2203.00555},
2177
+ url = {https://api.semanticscholar.org/CorpusID:247187905}
2178
+ }
2179
+ ```
2180
+
2181
+ ```bibtex
2182
+ @article{Rafailov2023DirectPO,
2183
+ title = {Direct Preference Optimization: Your Language Model is Secretly a Reward Model},
2184
+ author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Stefano Ermon and Christopher D. Manning and Chelsea Finn},
2185
+ journal = {ArXiv},
2186
+ year = {2023},
2187
+ volume = {abs/2305.18290},
2188
+ url = {https://api.semanticscholar.org/CorpusID:258959321}
2189
+ }
2190
+ ```
2191
+
2192
+ ```bibtex
2193
+ @misc{xAI2024Grok,
2194
+ author = {xAI},
2195
+ title = {Grok},
2196
+ year = {2024},
2197
+ publisher = {GitHub},
2198
+ journal = {GitHub repository},
2199
+ howpublished = {\url{https://github.com/xai-org/grok-1}},
2200
+ }
2201
+ ```
2202
+
2203
+ ```bibtex
2204
+ @inproceedings{Golovneva2024ContextualPE,
2205
+ title = {Contextual Position Encoding: Learning to Count What's Important},
2206
+ author = {Olga Golovneva and Tianlu Wang and Jason Weston and Sainbayar Sukhbaatar},
2207
+ year = {2024},
2208
+ url = {https://api.semanticscholar.org/CorpusID:270094992}
2209
+ }
2210
+ ```
2211
+
2212
+ ```bibtex
2213
+ @article{Peebles2022ScalableDM,
2214
+ title = {Scalable Diffusion Models with Transformers},
2215
+ author = {William S. Peebles and Saining Xie},
2216
+ journal = {2023 IEEE/CVF International Conference on Computer Vision (ICCV)},
2217
+ year = {2022},
2218
+ pages = {4172-4182},
2219
+ url = {https://api.semanticscholar.org/CorpusID:254854389}
2220
+ }
2221
+ ```
2222
+
2223
+ ```bibtex
2224
+ @misc{Rubin2024,
2225
+ author = {Ohad Rubin},
2226
+ url = {https://medium.com/@ohadrubin/exploring-weight-decay-in-layer-normalization-challenges-and-a-reparameterization-solution-ad4d12c24950}
2227
+ }
2228
+ ```
2229
+
2230
+ ```bibtex
2231
+ @article{Mesnard2024GemmaOM,
2232
+ title = {Gemma: Open Models Based on Gemini Research and Technology},
2233
+ author = {Gemma Team Thomas Mesnard and Cassidy Hardin and Robert Dadashi and Surya Bhupatiraju and Shreya Pathak and L. Sifre and Morgane Riviere and Mihir Kale and J Christopher Love and Pouya Dehghani Tafti and L'eonard Hussenot and Aakanksha Chowdhery and Adam Roberts and Aditya Barua and Alex Botev and Alex Castro-Ros and Ambrose Slone and Am'elie H'eliou and Andrea Tacchetti and Anna Bulanova and Antonia Paterson and Beth Tsai and Bobak Shahriari and Charline Le Lan and Christopher A. Choquette-Choo and Cl'ement Crepy and Daniel Cer and Daphne Ippolito and David Reid and Elena Buchatskaya and Eric Ni and Eric Noland and Geng Yan and George Tucker and George-Christian Muraru and Grigory Rozhdestvenskiy and Henryk Michalewski and Ian Tenney and Ivan Grishchenko and Jacob Austin and James Keeling and Jane Labanowski and Jean-Baptiste Lespiau and Jeff Stanway and Jenny Brennan and Jeremy Chen and Johan Ferret and Justin Chiu and Justin Mao-Jones and Katherine Lee and Kathy Yu and Katie Millican and Lars Lowe Sjoesund and Lisa Lee and Lucas Dixon and Machel Reid and Maciej Mikula and Mateo Wirth and Michael Sharman and Nikolai Chinaev and Nithum Thain and Olivier Bachem and Oscar Chang and Oscar Wahltinez and Paige Bailey and Paul Michel and Petko Yotov and Pier Giuseppe Sessa and Rahma Chaabouni and Ramona Comanescu and Reena Jana and Rohan Anil and Ross McIlroy and Ruibo Liu and Ryan Mullins and Samuel L Smith and Sebastian Borgeaud and Sertan Girgin and Sholto Douglas and Shree Pandya and Siamak Shakeri and Soham De and Ted Klimenko and Tom Hennigan and Vladimir Feinberg and Wojciech Stokowiec and Yu-hui Chen and Zafarali Ahmed and Zhitao Gong and Tris Brian Warkentin and Ludovic Peran and Minh Giang and Cl'ement Farabet and Oriol Vinyals and Jeffrey Dean and Koray Kavukcuoglu and Demis Hassabis and Zoubin Ghahramani and Douglas Eck and Joelle Barral and Fernando Pereira and Eli Collins and Armand Joulin and Noah Fiedel and Evan Senter and Alek Andreev and Kathleen Kenealy},
2234
+ journal = {ArXiv},
2235
+ year = {2024},
2236
+ volume = {abs/2403.08295},
2237
+ url = {https://api.semanticscholar.org/CorpusID:268379206}
2238
+ }
2239
+ ```
2240
+
2241
+ ```bibtex
2242
+ @article{Nguyen2024MinPS,
2243
+ title = {Min P Sampling: Balancing Creativity and Coherence at High Temperature},
2244
+ author = {Minh Nguyen and Andrew Baker and Andreas Kirsch and Clement Neo},
2245
+ journal = {ArXiv},
2246
+ year = {2024},
2247
+ volume = {abs/2407.01082},
2248
+ url = {https://api.semanticscholar.org/CorpusID:270870613}
2249
+ }
2250
+ ```
2251
+
2252
+ ```bibtex
2253
+ @article{Bao2022AllAW,
2254
+ title = {All are Worth Words: A ViT Backbone for Diffusion Models},
2255
+ author = {Fan Bao and Shen Nie and Kaiwen Xue and Yue Cao and Chongxuan Li and Hang Su and Jun Zhu},
2256
+ journal = {2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
2257
+ year = {2022},
2258
+ pages = {22669-22679},
2259
+ url = {https://api.semanticscholar.org/CorpusID:253581703}
2260
+ }
2261
+ ```
2262
+
2263
+ ```bibtex
2264
+ @article{Jumper2021HighlyAP,
2265
+ title = {Highly accurate protein structure prediction with AlphaFold},
2266
+ author = {John M. Jumper and Richard Evans and Alexander Pritzel and Tim Green and Michael Figurnov and Olaf Ronneberger and Kathryn Tunyasuvunakool and Russ Bates and Augustin Ž{\'i}dek and Anna Potapenko and Alex Bridgland and Clemens Meyer and Simon A A Kohl and Andy Ballard and Andrew Cowie and Bernardino Romera-Paredes and Stanislav Nikolov and Rishub Jain and Jonas Adler and Trevor Back and Stig Petersen and David Reiman and Ellen Clancy and Michal Zielinski and Martin Steinegger and Michalina Pacholska and Tamas Berghammer and Sebastian Bodenstein and David Silver and Oriol Vinyals and Andrew W. Senior and Koray Kavukcuoglu and Pushmeet Kohli and Demis Hassabis},
2267
+ journal = {Nature},
2268
+ year = {2021},
2269
+ volume = {596},
2270
+ pages = {583 - 589},
2271
+ url = {https://api.semanticscholar.org/CorpusID:235959867}
2272
+ }
2273
+ ```
2274
+
2275
+ ```bibtex
2276
+ @article{Yang2017BreakingTS,
2277
+ title = {Breaking the Softmax Bottleneck: A High-Rank RNN Language Model},
2278
+ author = {Zhilin Yang and Zihang Dai and Ruslan Salakhutdinov and William W. Cohen},
2279
+ journal = {ArXiv},
2280
+ year = {2017},
2281
+ volume = {abs/1711.03953},
2282
+ url = {https://api.semanticscholar.org/CorpusID:26238954}
2283
+ }
2284
+ ```
2285
+
2286
+ ```bibtex
2287
+ @inproceedings{Kanai2018SigsoftmaxRO,
2288
+ title = {Sigsoftmax: Reanalysis of the Softmax Bottleneck},
2289
+ author = {Sekitoshi Kanai and Yasuhiro Fujiwara and Yuki Yamanaka and Shuichi Adachi},
2290
+ booktitle = {Neural Information Processing Systems},
2291
+ year = {2018},
2292
+ url = {https://api.semanticscholar.org/CorpusID:44064935}
2293
+ ```
2294
+
2295
+ ```bibtex
2296
+ @article{Kim2020TheLC,
2297
+ title = {The Lipschitz Constant of Self-Attention},
2298
+ author = {Hyunjik Kim and George Papamakarios and Andriy Mnih},
2299
+ journal = {ArXiv},
2300
+ year = {2020},
2301
+ volume = {abs/2006.04710},
2302
+ url = {https://api.semanticscholar.org/CorpusID:219530837}
2303
+ }
2304
+ ```
2305
+
2306
+ ```bibtex
2307
+ @inproceedings{Ramapuram2024TheoryAA,
2308
+ title = {Theory, Analysis, and Best Practices for Sigmoid Self-Attention},
2309
+ author = {Jason Ramapuram and Federico Danieli and Eeshan Gunesh Dhekane and Floris Weers and Dan Busbridge and Pierre Ablin and Tatiana Likhomanenko and Jagrit Digani and Zijin Gu and Amitis Shidani and Russ Webb},
2310
+ year = {2024},
2311
+ url = {https://api.semanticscholar.org/CorpusID:272463580}
2312
+ }
2313
+ ```
2314
+
2315
+ ```bibtex
2316
+ @inproceedings{Leviathan2024SelectiveAI,
2317
+ title = {Selective Attention Improves Transformer},
2318
+ author = {Yaniv Leviathan and Matan Kalman and Yossi Matias},
2319
+ year = {2024},
2320
+ url = {https://api.semanticscholar.org/CorpusID:273098114}
2321
+ }
2322
+ ```
2323
+
2324
+ ```bibtex
2325
+ @article{Bai2019DeepEM,
2326
+ title = {Deep Equilibrium Models},
2327
+ author = {Shaojie Bai and J. Zico Kolter and Vladlen Koltun},
2328
+ journal = {ArXiv},
2329
+ year = {2019},
2330
+ volume = {abs/1909.01377},
2331
+ url = {https://api.semanticscholar.org/CorpusID:202539738}
2332
+ }
2333
+ ```
2334
+
2335
+ ```bibtex
2336
+ @article{Wu2021MuseMorphoseFA,
2337
+ title = {MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer With One Transformer VAE},
2338
+ author = {Shih-Lun Wu and Yi-Hsuan Yang},
2339
+ journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
2340
+ year = {2021},
2341
+ volume = {31},
2342
+ pages = {1953-1967},
2343
+ url = {https://api.semanticscholar.org/CorpusID:234338162}
2344
+ }
2345
+ ```
2346
+
2347
+ ```bibtex
2348
+ @inproceedings{Zhou2024ValueRL,
2349
+ title = {Value Residual Learning For Alleviating Attention Concentration In Transformers},
2350
+ author = {Zhanchao Zhou and Tianyi Wu and Zhiyun Jiang and Zhenzhong Lan},
2351
+ year = {2024},
2352
+ url = {https://api.semanticscholar.org/CorpusID:273532030}
2353
+ }
2354
+ ```
2355
+
2356
+ ```bibtex
2357
+ @inproceedings{anonymous2024forgetting,
2358
+ title = {Forgetting Transformer: Softmax Attention with a Forget Gate},
2359
+ author = {Anonymous},
2360
+ booktitle = {Submitted to The Thirteenth International Conference on Learning Representations},
2361
+ year = {2024},
2362
+ url = {https://openreview.net/forum?id=q2Lnyegkr8},
2363
+ note = {under review}
2364
+ }
2365
+ ```
2366
+
2367
+ ```bibtex
2368
+ @inproceedings{anonymous2024from,
2369
+ title = {From {MLP} to Neo{MLP}: Leveraging Self-Attention for Neural Fields},
2370
+ author = {Anonymous},
2371
+ booktitle = {Submitted to The Thirteenth International Conference on Learning Representations},
2372
+ year = {2024},
2373
+ url = {https://openreview.net/forum?id=A8Vuf2e8y6},
2374
+ note = {under review}
2375
+ }
2376
+ ```
2377
+
2378
+ ```bibtex
2379
+ @inproceedings{Duvvuri2024LASERAW,
2380
+ title = {LASER: Attention with Exponential Transformation},
2381
+ author = {Sai Surya Duvvuri and Inderjit S. Dhillon},
2382
+ year = {2024},
2383
+ url = {https://api.semanticscholar.org/CorpusID:273849947}
2384
+ }
2385
+ ```
2386
+
2387
+ ```bibtex
2388
+ @article{Zhu2024HyperConnections,
2389
+ title = {Hyper-Connections},
2390
+ author = {Defa Zhu and Hongzhi Huang and Zihao Huang and Yutao Zeng and Yunyao Mao and Banggu Wu and Qiyang Min and Xun Zhou},
2391
+ journal = {ArXiv},
2392
+ year = {2024},
2393
+ volume = {abs/2409.19606},
2394
+ url = {https://api.semanticscholar.org/CorpusID:272987528}
2395
+ }
2396
+ ```
2397
+
2398
+ ```bibtex
2399
+ @inproceedings{anonymous2024hymba,
2400
+ title = {Hymba: A Hybrid-head Architecture for Small Language Models},
2401
+ author = {Anonymous},
2402
+ booktitle = {Submitted to The Thirteenth International Conference on Learning Representations},
2403
+ year = {2024},
2404
+ url = {https://openreview.net/forum?id=A1ztozypga},
2405
+ note = {under review}
2406
+ }
2407
+ ```
2408
+
2409
+ ```bibtex
2410
+ @article{Shao2024DeepSeekV2AS,
2411
+ title = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model},
2412
+ author = {Zhihong Shao and Damai Dai and Daya Guo and Bo Liu (Benjamin Liu) and Zihan Wang and Huajian Xin},
2413
+ journal = {ArXiv},
2414
+ year = {2024},
2415
+ volume = {abs/2405.04434},
2416
+ url = {https://api.semanticscholar.org/CorpusID:269613809}
2417
+ }
2418
+ ```
2419
+
2420
+ *solve intelligence... then use that to solve everything else.* - Demis Hassabis