adv-optm 1.2.2__tar.gz → 2.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. {adv_optm-1.2.2 → adv_optm-2.1.0}/PKG-INFO +39 -13
  2. {adv_optm-1.2.2 → adv_optm-2.1.0}/README.md +38 -12
  3. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/__init__.py +3 -1
  4. adv_optm-2.1.0/adv_optm/optim/AdaMuon_adv.py +544 -0
  5. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/optim/AdamW_adv.py +132 -121
  6. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/optim/Adopt_adv.py +151 -152
  7. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/optim/Lion_Prodigy_adv.py +143 -102
  8. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/optim/Lion_adv.py +110 -71
  9. adv_optm-2.1.0/adv_optm/optim/Muon_adv.py +482 -0
  10. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/optim/Prodigy_adv.py +172 -156
  11. adv_optm-2.1.0/adv_optm/optim/SignSGD_adv.py +245 -0
  12. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/optim/Simplified_AdEMAMix.py +85 -64
  13. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/optim/__init__.py +3 -1
  14. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/util/Kourkoutas.py +72 -41
  15. adv_optm-2.1.0/adv_optm/util/Muon_AuxAdam.py +163 -0
  16. adv_optm-2.1.0/adv_optm/util/Muon_util.py +322 -0
  17. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm/util/OrthoGrad.py +9 -4
  18. adv_optm-2.1.0/adv_optm/util/__init__.py +0 -0
  19. adv_optm-2.1.0/adv_optm/util/factorization_util.py +105 -0
  20. adv_optm-2.1.0/adv_optm/util/lion_k.py +53 -0
  21. adv_optm-2.1.0/adv_optm/util/param_update.py +164 -0
  22. adv_optm-2.1.0/adv_optm/util/update_util.py +24 -0
  23. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm.egg-info/PKG-INFO +39 -13
  24. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm.egg-info/SOURCES.txt +8 -6
  25. {adv_optm-1.2.2 → adv_optm-2.1.0}/setup.py +1 -1
  26. adv_optm-1.2.2/adv_optm/optim/AdaMuon_adv.py +0 -729
  27. adv_optm-1.2.2/adv_optm/optim/Muon_adv.py +0 -730
  28. adv_optm-1.2.2/adv_optm/util/BF16_Stochastic_Rounding.py +0 -65
  29. adv_optm-1.2.2/adv_optm/util/Effective_Shape.py +0 -8
  30. adv_optm-1.2.2/adv_optm/util/NNMF.py +0 -18
  31. adv_optm-1.2.2/adv_optm/util/Newton_Schulz.py +0 -87
  32. adv_optm-1.2.2/adv_optm/util/One_Bit_Boolean.py +0 -22
  33. adv_optm-1.2.2/adv_optm/util/__init__.py +0 -13
  34. {adv_optm-1.2.2 → adv_optm-2.1.0}/LICENSE +0 -0
  35. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm.egg-info/dependency_links.txt +0 -0
  36. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm.egg-info/requires.txt +0 -0
  37. {adv_optm-1.2.2 → adv_optm-2.1.0}/adv_optm.egg-info/top_level.txt +0 -0
  38. {adv_optm-1.2.2 → adv_optm-2.1.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: adv_optm
3
- Version: 1.2.2
3
+ Version: 2.1.0
4
4
  Summary: A family of highly efficient, lightweight yet powerful optimizers.
5
5
  Home-page: https://github.com/Koratahiu/Advanced_Optimizers
6
6
  Author: Koratahiu
@@ -35,6 +35,32 @@ A comprehensive, all-in-one collection of optimization algorithms for deep learn
35
35
 
36
36
  [![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
37
37
 
38
+ ## 🔥 What's New
39
+
40
+ ### in 2.0.x
41
+
42
+ * Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
43
+ * Better and improved 1-bit factored mode via (nnmf_factor=True).
44
+ * Various improvements across the optimizers.
45
+
46
+ ### in 1.2.x
47
+ * Added **advanced variants** of [Muon optimizer](https://kellerjordan.github.io/posts/muon/) with **features** and **settings** from recent papers.
48
+
49
+ | Optimizer | Description |
50
+ |---|---|
51
+ | `Muon_adv` | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
52
+ | `AdaMuon_adv` | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
53
+
54
+ > *Documentation coming soon.*
55
+
56
+ * Implemented [Cautious Weight Decay](https://arxiv.org/abs/2510.12402) for all advanced optimizers.
57
+
58
+ * Improved parameter update and weight decay for **BF16** with **stochastic rounding**. The updates are now accumulated in **float32** and rounded once at the end.
59
+
60
+ * Use fused and in-place operations whenever possible for all advanced optimizers.
61
+
62
+ * **Prodigy variants** are now **50% faster** by [avoiding CUDA syncs](https://github.com/Koratahiu/Advanced_Optimizers/pull/5). Thanks to **@dxqb**!
63
+
38
64
  ---
39
65
 
40
66
  ## 📦 Installation
@@ -52,7 +78,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
52
78
  ### **Memory-Efficient Optimization (SMMF-inspired)**
53
79
  - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
54
80
  - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
55
- - **Innovation**:
81
+ - **Innovation**:
56
82
  - First moment split into **1-bit sign + absolute value**
57
83
  - Final storage: **four factored vectors + one 1-bit sign state**
58
84
  - Preserves Adam-like update quality with drastically reduced memory
@@ -110,7 +136,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
110
136
 
111
137
  ## 🛠️ Comprehensive Feature Guide
112
138
 
113
- ### A. Universal Safe Features
139
+ ### A. Universal Safe Features
114
140
  *These features work with all optimizers and are generally safe to enable.*
115
141
 
116
142
  | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
@@ -164,7 +190,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
164
190
  | `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
165
191
  | `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
166
192
 
167
- > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
193
+ > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
168
194
  > For `Prodigy_Adv`, set `initial_d` to:
169
195
  > - **LoRA**: `1e-8`
170
196
  > - **Full FT**: `1e-10`
@@ -180,7 +206,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
180
206
  - Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
181
207
  - **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
182
208
 
183
- > 📚 **Reference**:
209
+ > 📚 **Reference**:
184
210
  > - Paper: https://arxiv.org/abs/2407.05872
185
211
  > - Code: https://github.com/lucidrains/adam-atan2-pytorch
186
212
 
@@ -192,8 +218,8 @@ This library integrates multiple state-of-the-art optimization techniques valida
192
218
 
193
219
  Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
194
220
 
195
- - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
196
- - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
221
+ - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
222
+ - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
197
223
 
198
224
  This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
199
225
 
@@ -206,17 +232,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
206
232
 
207
233
  > 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
208
234
 
209
- > 📚 **Reference**:
210
- > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
235
+ > 📚 **Reference**:
236
+ > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
211
237
  > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
212
238
 
213
239
  ---
214
240
 
215
241
  ## 📚 References
216
242
 
217
- 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
218
- 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
219
- 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
220
- 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
243
+ 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
244
+ 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
245
+ 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
246
+ 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
221
247
  6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
222
248
  7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
@@ -4,6 +4,32 @@ A comprehensive, all-in-one collection of optimization algorithms for deep learn
4
4
 
5
5
  [![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
6
6
 
7
+ ## 🔥 What's New
8
+
9
+ ### in 2.0.x
10
+
11
+ * Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
12
+ * Better and improved 1-bit factored mode via (nnmf_factor=True).
13
+ * Various improvements across the optimizers.
14
+
15
+ ### in 1.2.x
16
+ * Added **advanced variants** of [Muon optimizer](https://kellerjordan.github.io/posts/muon/) with **features** and **settings** from recent papers.
17
+
18
+ | Optimizer | Description |
19
+ |---|---|
20
+ | `Muon_adv` | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
21
+ | `AdaMuon_adv` | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
22
+
23
+ > *Documentation coming soon.*
24
+
25
+ * Implemented [Cautious Weight Decay](https://arxiv.org/abs/2510.12402) for all advanced optimizers.
26
+
27
+ * Improved parameter update and weight decay for **BF16** with **stochastic rounding**. The updates are now accumulated in **float32** and rounded once at the end.
28
+
29
+ * Use fused and in-place operations whenever possible for all advanced optimizers.
30
+
31
+ * **Prodigy variants** are now **50% faster** by [avoiding CUDA syncs](https://github.com/Koratahiu/Advanced_Optimizers/pull/5). Thanks to **@dxqb**!
32
+
7
33
  ---
8
34
 
9
35
  ## 📦 Installation
@@ -21,7 +47,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
21
47
  ### **Memory-Efficient Optimization (SMMF-inspired)**
22
48
  - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
23
49
  - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
24
- - **Innovation**:
50
+ - **Innovation**:
25
51
  - First moment split into **1-bit sign + absolute value**
26
52
  - Final storage: **four factored vectors + one 1-bit sign state**
27
53
  - Preserves Adam-like update quality with drastically reduced memory
@@ -79,7 +105,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
79
105
 
80
106
  ## 🛠️ Comprehensive Feature Guide
81
107
 
82
- ### A. Universal Safe Features
108
+ ### A. Universal Safe Features
83
109
  *These features work with all optimizers and are generally safe to enable.*
84
110
 
85
111
  | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
@@ -133,7 +159,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
133
159
  | `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
134
160
  | `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
135
161
 
136
- > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
162
+ > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
137
163
  > For `Prodigy_Adv`, set `initial_d` to:
138
164
  > - **LoRA**: `1e-8`
139
165
  > - **Full FT**: `1e-10`
@@ -149,7 +175,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
149
175
  - Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
150
176
  - **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
151
177
 
152
- > 📚 **Reference**:
178
+ > 📚 **Reference**:
153
179
  > - Paper: https://arxiv.org/abs/2407.05872
154
180
  > - Code: https://github.com/lucidrains/adam-atan2-pytorch
155
181
 
@@ -161,8 +187,8 @@ This library integrates multiple state-of-the-art optimization techniques valida
161
187
 
162
188
  Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
163
189
 
164
- - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
165
- - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
190
+ - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
191
+ - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
166
192
 
167
193
  This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
168
194
 
@@ -175,17 +201,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
175
201
 
176
202
  > 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
177
203
 
178
- > 📚 **Reference**:
179
- > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
204
+ > 📚 **Reference**:
205
+ > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
180
206
  > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
181
207
 
182
208
  ---
183
209
 
184
210
  ## 📚 References
185
211
 
186
- 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
187
- 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
188
- 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
189
- 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
212
+ 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
213
+ 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
214
+ 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
215
+ 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
190
216
  6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
191
217
  7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
@@ -7,6 +7,7 @@ from .optim import (
7
7
  Lion_Prodigy_adv,
8
8
  Muon_adv,
9
9
  AdaMuon_adv,
10
+ SignSGD_adv,
10
11
  )
11
12
 
12
13
  __all__ = [
@@ -18,6 +19,7 @@ __all__ = [
18
19
  "Lion_Prodigy_adv",
19
20
  "Muon_adv",
20
21
  "AdaMuon_adv",
22
+ "SignSGD_adv",
21
23
  ]
22
24
 
23
- __version__ = "1.2.2"
25
+ __version__ = "2.1.0"