adv-optm 0.1.0__tar.gz → 1.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- adv_optm-1.2.2/PKG-INFO +222 -0
- adv_optm-1.2.2/README.md +191 -0
- adv_optm-1.2.2/adv_optm/__init__.py +23 -0
- adv_optm-1.2.2/adv_optm/optim/AdaMuon_adv.py +729 -0
- adv_optm-1.2.2/adv_optm/optim/AdamW_adv.py +381 -0
- adv_optm-1.2.2/adv_optm/optim/Adopt_adv.py +444 -0
- adv_optm-1.2.2/adv_optm/optim/Lion_Prodigy_adv.py +348 -0
- adv_optm-1.2.2/adv_optm/optim/Lion_adv.py +217 -0
- adv_optm-1.2.2/adv_optm/optim/Muon_adv.py +730 -0
- adv_optm-1.2.2/adv_optm/optim/Prodigy_adv.py +546 -0
- adv_optm-1.2.2/adv_optm/optim/Simplified_AdEMAMix.py +305 -0
- adv_optm-1.2.2/adv_optm/optim/__init__.py +19 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm/util/BF16_Stochastic_Rounding.py +22 -4
- adv_optm-1.2.2/adv_optm/util/Kourkoutas.py +165 -0
- adv_optm-1.2.2/adv_optm/util/Newton_Schulz.py +87 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm/util/__init__.py +3 -1
- adv_optm-1.2.2/adv_optm.egg-info/PKG-INFO +222 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm.egg-info/SOURCES.txt +7 -1
- {adv_optm-0.1.0 → adv_optm-1.2.2}/setup.py +1 -1
- adv_optm-0.1.0/PKG-INFO +0 -134
- adv_optm-0.1.0/README.md +0 -103
- adv_optm-0.1.0/adv_optm/__init__.py +0 -13
- adv_optm-0.1.0/adv_optm/optim/AdamW_adv.py +0 -293
- adv_optm-0.1.0/adv_optm/optim/Adopt_adv.py +0 -336
- adv_optm-0.1.0/adv_optm/optim/Prodigy_adv.py +0 -367
- adv_optm-0.1.0/adv_optm/optim/__init__.py +0 -9
- adv_optm-0.1.0/adv_optm/util/Randomized_SVD.py +0 -37
- adv_optm-0.1.0/adv_optm.egg-info/PKG-INFO +0 -134
- {adv_optm-0.1.0 → adv_optm-1.2.2}/LICENSE +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm/util/Effective_Shape.py +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm/util/NNMF.py +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm/util/One_Bit_Boolean.py +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm/util/OrthoGrad.py +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm.egg-info/dependency_links.txt +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm.egg-info/requires.txt +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/adv_optm.egg-info/top_level.txt +0 -0
- {adv_optm-0.1.0 → adv_optm-1.2.2}/setup.cfg +0 -0
adv_optm-1.2.2/PKG-INFO
ADDED
|
@@ -0,0 +1,222 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: adv_optm
|
|
3
|
+
Version: 1.2.2
|
|
4
|
+
Summary: A family of highly efficient, lightweight yet powerful optimizers.
|
|
5
|
+
Home-page: https://github.com/Koratahiu/Advanced_Optimizers
|
|
6
|
+
Author: Koratahiu
|
|
7
|
+
Author-email: hiuhonor@gmail.com
|
|
8
|
+
License: Apache 2.0
|
|
9
|
+
Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
12
|
+
Classifier: Operating System :: OS Independent
|
|
13
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
14
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
15
|
+
Requires-Python: >=3.8
|
|
16
|
+
Description-Content-Type: text/markdown
|
|
17
|
+
License-File: LICENSE
|
|
18
|
+
Requires-Dist: torch>=2.0
|
|
19
|
+
Dynamic: author
|
|
20
|
+
Dynamic: author-email
|
|
21
|
+
Dynamic: classifier
|
|
22
|
+
Dynamic: description
|
|
23
|
+
Dynamic: description-content-type
|
|
24
|
+
Dynamic: home-page
|
|
25
|
+
Dynamic: keywords
|
|
26
|
+
Dynamic: license
|
|
27
|
+
Dynamic: license-file
|
|
28
|
+
Dynamic: requires-dist
|
|
29
|
+
Dynamic: requires-python
|
|
30
|
+
Dynamic: summary
|
|
31
|
+
|
|
32
|
+
# Advanced Optimizers (AIO)
|
|
33
|
+
|
|
34
|
+
A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.
|
|
35
|
+
|
|
36
|
+
[](https://pypi.org/project/adv_optm/)
|
|
37
|
+
|
|
38
|
+
---
|
|
39
|
+
|
|
40
|
+
## 📦 Installation
|
|
41
|
+
|
|
42
|
+
```bash
|
|
43
|
+
pip install adv_optm
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## 🧠 Core Innovations
|
|
49
|
+
|
|
50
|
+
This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with **1-bit compression for optimizer states**:
|
|
51
|
+
|
|
52
|
+
### **Memory-Efficient Optimization (SMMF-inspired)**
|
|
53
|
+
- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
|
|
54
|
+
- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
|
|
55
|
+
- **Innovation**:
|
|
56
|
+
- First moment split into **1-bit sign + absolute value**
|
|
57
|
+
- Final storage: **four factored vectors + one 1-bit sign state**
|
|
58
|
+
- Preserves Adam-like update quality with drastically reduced memory
|
|
59
|
+
|
|
60
|
+
---
|
|
61
|
+
|
|
62
|
+
## ⚡ Performance Characteristics
|
|
63
|
+
|
|
64
|
+
### Memory Efficiency (SDXL Model – 6.5GB)
|
|
65
|
+
| Optimizer | Memory Usage | Description |
|
|
66
|
+
|-----------|--------------|-------------|
|
|
67
|
+
| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
|
|
68
|
+
| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
|
|
69
|
+
| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
|
|
70
|
+
|
|
71
|
+
### Speed Comparison (SDXL, Batch Size 4)
|
|
72
|
+
| Optimizer | Speed | Notes |
|
|
73
|
+
|-----------|-------|-------|
|
|
74
|
+
| `Adafactor` | ~8.5s/it | Baseline |
|
|
75
|
+
| `Adopt_Factored` | ~10s/it | +18% overhead from compression |
|
|
76
|
+
| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## 🧪 Available Optimizers
|
|
81
|
+
|
|
82
|
+
### Standard Optimizers (All support `factored=True/False`)
|
|
83
|
+
| Optimizer | Description | Best For |
|
|
84
|
+
|-----------|-------------|----------|
|
|
85
|
+
| `Adam_Adv` | Advanced Adam implementation | General purpose |
|
|
86
|
+
| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
|
|
87
|
+
| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
|
|
88
|
+
| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
|
|
89
|
+
| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
|
|
90
|
+
| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
|
|
91
|
+
|
|
92
|
+
---
|
|
93
|
+
|
|
94
|
+
## ⚙️ Feature Matrix
|
|
95
|
+
|
|
96
|
+
| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
|
|
97
|
+
|---------|----------|-----------|-------------|---------------------|----------|
|
|
98
|
+
| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
99
|
+
| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
|
|
100
|
+
| Simplified_AdEMAMix | ✗ | ✓ | ✓ | ✓ | ✗ |
|
|
101
|
+
| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
102
|
+
| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
|
|
103
|
+
| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
|
|
104
|
+
| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
|
|
105
|
+
| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
106
|
+
| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
107
|
+
| **Kourkoutas-β** | ✓ | ✓ | ✓ | ✓ | ✗ |
|
|
108
|
+
|
|
109
|
+
---
|
|
110
|
+
|
|
111
|
+
## 🛠️ Comprehensive Feature Guide
|
|
112
|
+
|
|
113
|
+
### A. Universal Safe Features
|
|
114
|
+
*These features work with all optimizers and are generally safe to enable.*
|
|
115
|
+
|
|
116
|
+
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|
|
117
|
+
|--------|-------------|-------------------|--------------------|-------------------|--------------|
|
|
118
|
+
| **Fused Back Pass** | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |
|
|
119
|
+
| **Stochastic Rounding** | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
|
|
120
|
+
| **OrthoGrad** | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
|
|
121
|
+
| **Factored** | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
|
|
122
|
+
|
|
123
|
+
### B. Individual Features
|
|
124
|
+
|
|
125
|
+
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|
|
126
|
+
|--------|-------------|-------------------|--------------------|-------------------|--------------|
|
|
127
|
+
| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |
|
|
128
|
+
| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |
|
|
129
|
+
| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
|
|
130
|
+
| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |
|
|
131
|
+
| **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |
|
|
132
|
+
| **Kourkoutas-β** | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-β]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
|
|
133
|
+
|
|
134
|
+
> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## 🔍 Feature Deep Dives
|
|
139
|
+
|
|
140
|
+
### AdEMAMix
|
|
141
|
+
|
|
142
|
+
- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
|
|
143
|
+
- Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
|
|
144
|
+
|
|
145
|
+
#### Tunable Hyperparameters
|
|
146
|
+
| Parameter | Default | Tuning Guide |
|
|
147
|
+
|-----------|---------|--------------|
|
|
148
|
+
| `beta3` | 0.9999 | • Runs >120k steps: **0.9999**<br>• Runs ≤120k steps: **0.999** |
|
|
149
|
+
| `alpha` | 5 | • Reduce to **2–3** if diverging<br>• Increase to strengthen long-term memory |
|
|
150
|
+
|
|
151
|
+
> ✅ **Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.
|
|
152
|
+
|
|
153
|
+
---
|
|
154
|
+
|
|
155
|
+
### Simplified_AdEMAMix
|
|
156
|
+
|
|
157
|
+
- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
|
|
158
|
+
- Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
|
|
159
|
+
- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
|
|
160
|
+
|
|
161
|
+
#### Tunable Hyperparameters
|
|
162
|
+
| Parameter | Default | Tuning Guide |
|
|
163
|
+
|----------|---------|--------------|
|
|
164
|
+
| `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
|
|
165
|
+
| `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
|
|
166
|
+
|
|
167
|
+
> ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
|
|
168
|
+
> For `Prodigy_Adv`, set `initial_d` to:
|
|
169
|
+
> - **LoRA**: `1e-8`
|
|
170
|
+
> - **Full FT**: `1e-10`
|
|
171
|
+
> - **Embedding**: `1e-7`
|
|
172
|
+
|
|
173
|
+
> ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
|
|
174
|
+
|
|
175
|
+
---
|
|
176
|
+
|
|
177
|
+
### atan2
|
|
178
|
+
|
|
179
|
+
- Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.
|
|
180
|
+
- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
|
|
181
|
+
- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
|
|
182
|
+
|
|
183
|
+
> 📚 **Reference**:
|
|
184
|
+
> - Paper: https://arxiv.org/abs/2407.05872
|
|
185
|
+
> - Code: https://github.com/lucidrains/adam-atan2-pytorch
|
|
186
|
+
|
|
187
|
+
---
|
|
188
|
+
|
|
189
|
+
### **Kourkoutas-β**
|
|
190
|
+
|
|
191
|
+
**Kourkoutas-β** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (β₂)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.
|
|
192
|
+
|
|
193
|
+
Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
|
|
194
|
+
|
|
195
|
+
- **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
|
|
196
|
+
- **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
|
|
197
|
+
|
|
198
|
+
This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
|
|
199
|
+
|
|
200
|
+
#### Pros/Cons
|
|
201
|
+
|
|
202
|
+
| **Category** | **Details** |
|
|
203
|
+
|--------------|-------------|
|
|
204
|
+
| ✅ **Pros** | • **Layer-wise adaptation** blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction).<br>• **Robust to sudden loss landscape shifts**, reacts quickly during gradient bursts, smooths during calm phases.<br>• **High tolerance to aggressive learning rates**. |
|
|
205
|
+
| ⚠️ **Cons** | • **Potentially unstable at the start of training** due to unreliable early gradient norms; mitigated by using `K-β Warmup Steps`. |
|
|
206
|
+
|
|
207
|
+
> 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
|
|
208
|
+
|
|
209
|
+
> 📚 **Reference**:
|
|
210
|
+
> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
211
|
+
> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
|
|
212
|
+
|
|
213
|
+
---
|
|
214
|
+
|
|
215
|
+
## 📚 References
|
|
216
|
+
|
|
217
|
+
1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
|
|
218
|
+
2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
|
|
219
|
+
3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
|
|
220
|
+
4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
|
|
221
|
+
6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
222
|
+
7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
|
adv_optm-1.2.2/README.md
ADDED
|
@@ -0,0 +1,191 @@
|
|
|
1
|
+
# Advanced Optimizers (AIO)
|
|
2
|
+
|
|
3
|
+
A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.
|
|
4
|
+
|
|
5
|
+
[](https://pypi.org/project/adv_optm/)
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## 📦 Installation
|
|
10
|
+
|
|
11
|
+
```bash
|
|
12
|
+
pip install adv_optm
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## 🧠 Core Innovations
|
|
18
|
+
|
|
19
|
+
This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with **1-bit compression for optimizer states**:
|
|
20
|
+
|
|
21
|
+
### **Memory-Efficient Optimization (SMMF-inspired)**
|
|
22
|
+
- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
|
|
23
|
+
- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
|
|
24
|
+
- **Innovation**:
|
|
25
|
+
- First moment split into **1-bit sign + absolute value**
|
|
26
|
+
- Final storage: **four factored vectors + one 1-bit sign state**
|
|
27
|
+
- Preserves Adam-like update quality with drastically reduced memory
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## ⚡ Performance Characteristics
|
|
32
|
+
|
|
33
|
+
### Memory Efficiency (SDXL Model – 6.5GB)
|
|
34
|
+
| Optimizer | Memory Usage | Description |
|
|
35
|
+
|-----------|--------------|-------------|
|
|
36
|
+
| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
|
|
37
|
+
| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
|
|
38
|
+
| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
|
|
39
|
+
|
|
40
|
+
### Speed Comparison (SDXL, Batch Size 4)
|
|
41
|
+
| Optimizer | Speed | Notes |
|
|
42
|
+
|-----------|-------|-------|
|
|
43
|
+
| `Adafactor` | ~8.5s/it | Baseline |
|
|
44
|
+
| `Adopt_Factored` | ~10s/it | +18% overhead from compression |
|
|
45
|
+
| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## 🧪 Available Optimizers
|
|
50
|
+
|
|
51
|
+
### Standard Optimizers (All support `factored=True/False`)
|
|
52
|
+
| Optimizer | Description | Best For |
|
|
53
|
+
|-----------|-------------|----------|
|
|
54
|
+
| `Adam_Adv` | Advanced Adam implementation | General purpose |
|
|
55
|
+
| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
|
|
56
|
+
| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
|
|
57
|
+
| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
|
|
58
|
+
| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
|
|
59
|
+
| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
|
|
60
|
+
|
|
61
|
+
---
|
|
62
|
+
|
|
63
|
+
## ⚙️ Feature Matrix
|
|
64
|
+
|
|
65
|
+
| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
|
|
66
|
+
|---------|----------|-----------|-------------|---------------------|----------|
|
|
67
|
+
| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
68
|
+
| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
|
|
69
|
+
| Simplified_AdEMAMix | ✗ | ✓ | ✓ | ✓ | ✗ |
|
|
70
|
+
| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
71
|
+
| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
|
|
72
|
+
| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
|
|
73
|
+
| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
|
|
74
|
+
| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
75
|
+
| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
|
|
76
|
+
| **Kourkoutas-β** | ✓ | ✓ | ✓ | ✓ | ✗ |
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## 🛠️ Comprehensive Feature Guide
|
|
81
|
+
|
|
82
|
+
### A. Universal Safe Features
|
|
83
|
+
*These features work with all optimizers and are generally safe to enable.*
|
|
84
|
+
|
|
85
|
+
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|
|
86
|
+
|--------|-------------|-------------------|--------------------|-------------------|--------------|
|
|
87
|
+
| **Fused Back Pass** | Fuses backward pass; gradients used immediately and memory freed on-the-fly | Memory-constrained environments | Reduces peak memory | Memory optimization | All optimizers |
|
|
88
|
+
| **Stochastic Rounding** | Replaces nearest rounding with stochastic rounding to preserve small gradient updates in BF16 | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
|
|
89
|
+
| **OrthoGrad** | Removes gradient component parallel to weights to reduce overfitting | Full fine-tuning without weight decay | +33% time overhead (BS=4); less at larger BS | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
|
|
90
|
+
| **Factored** | Memory-efficient optimization via rank-1 1-bit factorization of optimizer states | Large models / memory-limited hardware | Adds compression overhead | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
|
|
91
|
+
|
|
92
|
+
### B. Individual Features
|
|
93
|
+
|
|
94
|
+
| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
|
|
95
|
+
|--------|-------------|-------------------|--------------------|-------------------|--------------|
|
|
96
|
+
| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |
|
|
97
|
+
| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |
|
|
98
|
+
| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
|
|
99
|
+
| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |
|
|
100
|
+
| **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |
|
|
101
|
+
| **Kourkoutas-β** | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-β]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
|
|
102
|
+
|
|
103
|
+
> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## 🔍 Feature Deep Dives
|
|
108
|
+
|
|
109
|
+
### AdEMAMix
|
|
110
|
+
|
|
111
|
+
- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
|
|
112
|
+
- Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
|
|
113
|
+
|
|
114
|
+
#### Tunable Hyperparameters
|
|
115
|
+
| Parameter | Default | Tuning Guide |
|
|
116
|
+
|-----------|---------|--------------|
|
|
117
|
+
| `beta3` | 0.9999 | • Runs >120k steps: **0.9999**<br>• Runs ≤120k steps: **0.999** |
|
|
118
|
+
| `alpha` | 5 | • Reduce to **2–3** if diverging<br>• Increase to strengthen long-term memory |
|
|
119
|
+
|
|
120
|
+
> ✅ **Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.
|
|
121
|
+
|
|
122
|
+
---
|
|
123
|
+
|
|
124
|
+
### Simplified_AdEMAMix
|
|
125
|
+
|
|
126
|
+
- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
|
|
127
|
+
- Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
|
|
128
|
+
- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
|
|
129
|
+
|
|
130
|
+
#### Tunable Hyperparameters
|
|
131
|
+
| Parameter | Default | Tuning Guide |
|
|
132
|
+
|----------|---------|--------------|
|
|
133
|
+
| `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
|
|
134
|
+
| `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
|
|
135
|
+
|
|
136
|
+
> ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
|
|
137
|
+
> For `Prodigy_Adv`, set `initial_d` to:
|
|
138
|
+
> - **LoRA**: `1e-8`
|
|
139
|
+
> - **Full FT**: `1e-10`
|
|
140
|
+
> - **Embedding**: `1e-7`
|
|
141
|
+
|
|
142
|
+
> ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
|
|
143
|
+
|
|
144
|
+
---
|
|
145
|
+
|
|
146
|
+
### atan2
|
|
147
|
+
|
|
148
|
+
- Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.
|
|
149
|
+
- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
|
|
150
|
+
- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
|
|
151
|
+
|
|
152
|
+
> 📚 **Reference**:
|
|
153
|
+
> - Paper: https://arxiv.org/abs/2407.05872
|
|
154
|
+
> - Code: https://github.com/lucidrains/adam-atan2-pytorch
|
|
155
|
+
|
|
156
|
+
---
|
|
157
|
+
|
|
158
|
+
### **Kourkoutas-β**
|
|
159
|
+
|
|
160
|
+
**Kourkoutas-β** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (β₂)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.
|
|
161
|
+
|
|
162
|
+
Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
|
|
163
|
+
|
|
164
|
+
- **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
|
|
165
|
+
- **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
|
|
166
|
+
|
|
167
|
+
This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
|
|
168
|
+
|
|
169
|
+
#### Pros/Cons
|
|
170
|
+
|
|
171
|
+
| **Category** | **Details** |
|
|
172
|
+
|--------------|-------------|
|
|
173
|
+
| ✅ **Pros** | • **Layer-wise adaptation** blends benefits of high β₂ (strong smoothing) and low β₂ (fast reaction).<br>• **Robust to sudden loss landscape shifts**, reacts quickly during gradient bursts, smooths during calm phases.<br>• **High tolerance to aggressive learning rates**. |
|
|
174
|
+
| ⚠️ **Cons** | • **Potentially unstable at the start of training** due to unreliable early gradient norms; mitigated by using `K-β Warmup Steps`. |
|
|
175
|
+
|
|
176
|
+
> 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
|
|
177
|
+
|
|
178
|
+
> 📚 **Reference**:
|
|
179
|
+
> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
180
|
+
> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
|
|
181
|
+
|
|
182
|
+
---
|
|
183
|
+
|
|
184
|
+
## 📚 References
|
|
185
|
+
|
|
186
|
+
1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
|
|
187
|
+
2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
|
|
188
|
+
3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
|
|
189
|
+
4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
|
|
190
|
+
6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
191
|
+
7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
|
|
@@ -0,0 +1,23 @@
|
|
|
1
|
+
from .optim import (
|
|
2
|
+
AdamW_adv,
|
|
3
|
+
Prodigy_adv,
|
|
4
|
+
Adopt_adv,
|
|
5
|
+
Simplified_AdEMAMix,
|
|
6
|
+
Lion_adv,
|
|
7
|
+
Lion_Prodigy_adv,
|
|
8
|
+
Muon_adv,
|
|
9
|
+
AdaMuon_adv,
|
|
10
|
+
)
|
|
11
|
+
|
|
12
|
+
__all__ = [
|
|
13
|
+
"AdamW_adv",
|
|
14
|
+
"Prodigy_adv",
|
|
15
|
+
"Adopt_adv",
|
|
16
|
+
"Simplified_AdEMAMix",
|
|
17
|
+
"Lion_adv",
|
|
18
|
+
"Lion_Prodigy_adv",
|
|
19
|
+
"Muon_adv",
|
|
20
|
+
"AdaMuon_adv",
|
|
21
|
+
]
|
|
22
|
+
|
|
23
|
+
__version__ = "1.2.2"
|