adv-optm 1.2.dev9__tar.gz → 2.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (40) hide show
  1. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/PKG-INFO +45 -72
  2. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/README.md +44 -71
  3. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/__init__.py +3 -1
  4. adv_optm-2.1.0/adv_optm/optim/AdaMuon_adv.py +544 -0
  5. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/optim/AdamW_adv.py +151 -133
  6. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/optim/Adopt_adv.py +165 -155
  7. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/optim/Lion_Prodigy_adv.py +149 -101
  8. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/optim/Lion_adv.py +116 -70
  9. adv_optm-2.1.0/adv_optm/optim/Muon_adv.py +482 -0
  10. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/optim/Prodigy_adv.py +185 -155
  11. adv_optm-2.1.0/adv_optm/optim/SignSGD_adv.py +245 -0
  12. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/optim/Simplified_AdEMAMix.py +95 -67
  13. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/optim/__init__.py +3 -1
  14. adv_optm-2.1.0/adv_optm/util/Kourkoutas.py +196 -0
  15. adv_optm-2.1.0/adv_optm/util/Muon_AuxAdam.py +163 -0
  16. adv_optm-2.1.0/adv_optm/util/Muon_util.py +322 -0
  17. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm/util/OrthoGrad.py +9 -4
  18. adv_optm-2.1.0/adv_optm/util/__init__.py +0 -0
  19. adv_optm-2.1.0/adv_optm/util/factorization_util.py +105 -0
  20. adv_optm-2.1.0/adv_optm/util/lion_k.py +53 -0
  21. adv_optm-2.1.0/adv_optm/util/param_update.py +164 -0
  22. adv_optm-2.1.0/adv_optm/util/update_util.py +24 -0
  23. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm.egg-info/PKG-INFO +45 -72
  24. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm.egg-info/SOURCES.txt +8 -7
  25. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/setup.py +1 -1
  26. adv_optm-1.2.dev9/adv_optm/optim/AdaMuon_adv.py +0 -473
  27. adv_optm-1.2.dev9/adv_optm/optim/Muon_adv.py +0 -503
  28. adv_optm-1.2.dev9/adv_optm/util/BF16_Stochastic_Rounding.py +0 -47
  29. adv_optm-1.2.dev9/adv_optm/util/Effective_Shape.py +0 -8
  30. adv_optm-1.2.dev9/adv_optm/util/Kourkoutas.py +0 -194
  31. adv_optm-1.2.dev9/adv_optm/util/MuonAdam_helper.py +0 -32
  32. adv_optm-1.2.dev9/adv_optm/util/NNMF.py +0 -18
  33. adv_optm-1.2.dev9/adv_optm/util/Newton_Schulz.py +0 -48
  34. adv_optm-1.2.dev9/adv_optm/util/One_Bit_Boolean.py +0 -22
  35. adv_optm-1.2.dev9/adv_optm/util/__init__.py +0 -12
  36. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/LICENSE +0 -0
  37. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm.egg-info/dependency_links.txt +0 -0
  38. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm.egg-info/requires.txt +0 -0
  39. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/adv_optm.egg-info/top_level.txt +0 -0
  40. {adv_optm-1.2.dev9 → adv_optm-2.1.0}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: adv_optm
3
- Version: 1.2.dev9
3
+ Version: 2.1.0
4
4
  Summary: A family of highly efficient, lightweight yet powerful optimizers.
5
5
  Home-page: https://github.com/Koratahiu/Advanced_Optimizers
6
6
  Author: Koratahiu
@@ -35,6 +35,32 @@ A comprehensive, all-in-one collection of optimization algorithms for deep learn
35
35
 
36
36
  [![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
37
37
 
38
+ ## 🔥 What's New
39
+
40
+ ### in 2.0.x
41
+
42
+ * Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
43
+ * Better and improved 1-bit factored mode via (nnmf_factor=True).
44
+ * Various improvements across the optimizers.
45
+
46
+ ### in 1.2.x
47
+ * Added **advanced variants** of [Muon optimizer](https://kellerjordan.github.io/posts/muon/) with **features** and **settings** from recent papers.
48
+
49
+ | Optimizer | Description |
50
+ |---|---|
51
+ | `Muon_adv` | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
52
+ | `AdaMuon_adv` | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
53
+
54
+ > *Documentation coming soon.*
55
+
56
+ * Implemented [Cautious Weight Decay](https://arxiv.org/abs/2510.12402) for all advanced optimizers.
57
+
58
+ * Improved parameter update and weight decay for **BF16** with **stochastic rounding**. The updates are now accumulated in **float32** and rounded once at the end.
59
+
60
+ * Use fused and in-place operations whenever possible for all advanced optimizers.
61
+
62
+ * **Prodigy variants** are now **50% faster** by [avoiding CUDA syncs](https://github.com/Koratahiu/Advanced_Optimizers/pull/5). Thanks to **@dxqb**!
63
+
38
64
  ---
39
65
 
40
66
  ## 📦 Installation
@@ -52,7 +78,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
52
78
  ### **Memory-Efficient Optimization (SMMF-inspired)**
53
79
  - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
54
80
  - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
55
- - **Innovation**:
81
+ - **Innovation**:
56
82
  - First moment split into **1-bit sign + absolute value**
57
83
  - Final storage: **four factored vectors + one 1-bit sign state**
58
84
  - Preserves Adam-like update quality with drastically reduced memory
@@ -110,7 +136,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
110
136
 
111
137
  ## 🛠️ Comprehensive Feature Guide
112
138
 
113
- ### A. Universal Safe Features
139
+ ### A. Universal Safe Features
114
140
  *These features work with all optimizers and are generally safe to enable.*
115
141
 
116
142
  | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
@@ -141,7 +167,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
141
167
 
142
168
  - Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
143
169
  - Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
144
- - **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
145
170
 
146
171
  #### Tunable Hyperparameters
147
172
  | Parameter | Default | Tuning Guide |
@@ -156,7 +181,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
156
181
  ### Simplified_AdEMAMix
157
182
 
158
183
  - Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
159
- - Replaces Adam’s first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.
184
+ - Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
160
185
  - **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
161
186
 
162
187
  #### Tunable Hyperparameters
@@ -165,26 +190,13 @@ This library integrates multiple state-of-the-art optimization techniques valida
165
190
  | `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
166
191
  | `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
167
192
 
168
- > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
193
+ > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
169
194
  > For `Prodigy_Adv`, set `initial_d` to:
170
195
  > - **LoRA**: `1e-8`
171
196
  > - **Full FT**: `1e-10`
172
197
  > - **Embedding**: `1e-7`
173
198
 
174
- > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.
175
-
176
- #### Performance Validation
177
-
178
- **Small Batch Training (SDXL, BS=2, 1.8K steps)**
179
- ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
180
-
181
- - **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
182
- - **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
183
-
184
- **Results**:
185
- - Faster convergence and higher final performance with Simplified_AdEMAMix
186
- - D-Adaptation automatically compensates for aggressive updates
187
- - Generated samples show **significantly better quality**
199
+ > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
188
200
 
189
201
  ---
190
202
 
@@ -194,6 +206,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
194
206
  - Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
195
207
  - **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
196
208
 
209
+ > 📚 **Reference**:
210
+ > - Paper: https://arxiv.org/abs/2407.05872
211
+ > - Code: https://github.com/lucidrains/adam-atan2-pytorch
212
+
197
213
  ---
198
214
 
199
215
  ### **Kourkoutas-β**
@@ -202,8 +218,8 @@ This library integrates multiple state-of-the-art optimization techniques valida
202
218
 
203
219
  Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
204
220
 
205
- - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
206
- - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
221
+ - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
222
+ - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
207
223
 
208
224
  This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
209
225
 
@@ -216,60 +232,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
216
232
 
217
233
  > 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
218
234
 
219
- > 🔍 **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic β₂ values across layers every *N* steps.
220
-
221
- #### 📊 Performance Validation
222
-
223
- **ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
224
- <img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
225
-
226
- - 🟣 Fixed `beta2=0.999`
227
- - 🟠 Auto K-beta
228
-
229
- **Observations:**
230
- - K-beta is clearly better and more robust/stable for high LRs.
231
-
232
- > 📚 **Reference**:
233
- > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
235
+ > 📚 **Reference**:
236
+ > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
234
237
  > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
235
238
 
236
239
  ---
237
240
 
238
- ## Recommended Preset (Tested on LoRA/FT/Embedding)
239
-
240
- ```yaml
241
- Learning Rate: 1
242
- optimizer: PRODIGY_Adv
243
- settings:
244
- - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
245
- - beta2: 0.999
246
- - kourkoutas_beta: True # For Kourkoutas-β
247
- - K-β Warmup Steps: 50 # Or 100, 200, depending on your run
248
- - Simplified_AdEMAMix: True
249
- - Grad α: 100
250
- - OrthoGrad: True
251
- - weight_decay: 0.0
252
- - initial_d:
253
- • LoRA: 1e-8
254
- • Full fine-tune: 1e-10
255
- • Embedding: 1e-7
256
- - d_coef: 1
257
- - d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
258
- - factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
259
- ```
260
-
261
- > ✅ **Why it works**:
262
- > - `Kourkoutas-β` handles beta2 values
263
- > - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
264
- > - `OrthoGrad` prevents overfitting without weight decay
265
-
266
- ---
267
-
268
241
  ## 📚 References
269
242
 
270
- 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
271
- 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
272
- 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
273
- 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
274
- 5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
243
+ 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
244
+ 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
245
+ 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
246
+ 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
275
247
  6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
248
+ 7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
@@ -4,6 +4,32 @@ A comprehensive, all-in-one collection of optimization algorithms for deep learn
4
4
 
5
5
  [![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
6
6
 
7
+ ## 🔥 What's New
8
+
9
+ ### in 2.0.x
10
+
11
+ * Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
12
+ * Better and improved 1-bit factored mode via (nnmf_factor=True).
13
+ * Various improvements across the optimizers.
14
+
15
+ ### in 1.2.x
16
+ * Added **advanced variants** of [Muon optimizer](https://kellerjordan.github.io/posts/muon/) with **features** and **settings** from recent papers.
17
+
18
+ | Optimizer | Description |
19
+ |---|---|
20
+ | `Muon_adv` | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
21
+ | `AdaMuon_adv` | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
22
+
23
+ > *Documentation coming soon.*
24
+
25
+ * Implemented [Cautious Weight Decay](https://arxiv.org/abs/2510.12402) for all advanced optimizers.
26
+
27
+ * Improved parameter update and weight decay for **BF16** with **stochastic rounding**. The updates are now accumulated in **float32** and rounded once at the end.
28
+
29
+ * Use fused and in-place operations whenever possible for all advanced optimizers.
30
+
31
+ * **Prodigy variants** are now **50% faster** by [avoiding CUDA syncs](https://github.com/Koratahiu/Advanced_Optimizers/pull/5). Thanks to **@dxqb**!
32
+
7
33
  ---
8
34
 
9
35
  ## 📦 Installation
@@ -21,7 +47,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
21
47
  ### **Memory-Efficient Optimization (SMMF-inspired)**
22
48
  - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
23
49
  - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
24
- - **Innovation**:
50
+ - **Innovation**:
25
51
  - First moment split into **1-bit sign + absolute value**
26
52
  - Final storage: **four factored vectors + one 1-bit sign state**
27
53
  - Preserves Adam-like update quality with drastically reduced memory
@@ -79,7 +105,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
79
105
 
80
106
  ## 🛠️ Comprehensive Feature Guide
81
107
 
82
- ### A. Universal Safe Features
108
+ ### A. Universal Safe Features
83
109
  *These features work with all optimizers and are generally safe to enable.*
84
110
 
85
111
  | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
@@ -110,7 +136,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
110
136
 
111
137
  - Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
112
138
  - Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
113
- - **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
114
139
 
115
140
  #### Tunable Hyperparameters
116
141
  | Parameter | Default | Tuning Guide |
@@ -125,7 +150,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
125
150
  ### Simplified_AdEMAMix
126
151
 
127
152
  - Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
128
- - Replaces Adam’s first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.
153
+ - Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
129
154
  - **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
130
155
 
131
156
  #### Tunable Hyperparameters
@@ -134,26 +159,13 @@ This library integrates multiple state-of-the-art optimization techniques valida
134
159
  | `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
135
160
  | `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
136
161
 
137
- > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
162
+ > ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
138
163
  > For `Prodigy_Adv`, set `initial_d` to:
139
164
  > - **LoRA**: `1e-8`
140
165
  > - **Full FT**: `1e-10`
141
166
  > - **Embedding**: `1e-7`
142
167
 
143
- > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.
144
-
145
- #### Performance Validation
146
-
147
- **Small Batch Training (SDXL, BS=2, 1.8K steps)**
148
- ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
149
-
150
- - **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
151
- - **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
152
-
153
- **Results**:
154
- - Faster convergence and higher final performance with Simplified_AdEMAMix
155
- - D-Adaptation automatically compensates for aggressive updates
156
- - Generated samples show **significantly better quality**
168
+ > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
157
169
 
158
170
  ---
159
171
 
@@ -163,6 +175,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
163
175
  - Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
164
176
  - **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
165
177
 
178
+ > 📚 **Reference**:
179
+ > - Paper: https://arxiv.org/abs/2407.05872
180
+ > - Code: https://github.com/lucidrains/adam-atan2-pytorch
181
+
166
182
  ---
167
183
 
168
184
  ### **Kourkoutas-β**
@@ -171,8 +187,8 @@ This library integrates multiple state-of-the-art optimization techniques valida
171
187
 
172
188
  Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
173
189
 
174
- - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
175
- - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
190
+ - **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
191
+ - **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
176
192
 
177
193
  This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
178
194
 
@@ -185,60 +201,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
185
201
 
186
202
  > 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
187
203
 
188
- > 🔍 **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic β₂ values across layers every *N* steps.
189
-
190
- #### 📊 Performance Validation
191
-
192
- **ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
193
- <img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
194
-
195
- - 🟣 Fixed `beta2=0.999`
196
- - 🟠 Auto K-beta
197
-
198
- **Observations:**
199
- - K-beta is clearly better and more robust/stable for high LRs.
200
-
201
- > 📚 **Reference**:
202
- > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
204
+ > 📚 **Reference**:
205
+ > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
203
206
  > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
204
207
 
205
208
  ---
206
209
 
207
- ## Recommended Preset (Tested on LoRA/FT/Embedding)
208
-
209
- ```yaml
210
- Learning Rate: 1
211
- optimizer: PRODIGY_Adv
212
- settings:
213
- - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
214
- - beta2: 0.999
215
- - kourkoutas_beta: True # For Kourkoutas-β
216
- - K-β Warmup Steps: 50 # Or 100, 200, depending on your run
217
- - Simplified_AdEMAMix: True
218
- - Grad α: 100
219
- - OrthoGrad: True
220
- - weight_decay: 0.0
221
- - initial_d:
222
- • LoRA: 1e-8
223
- • Full fine-tune: 1e-10
224
- • Embedding: 1e-7
225
- - d_coef: 1
226
- - d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
227
- - factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
228
- ```
229
-
230
- > ✅ **Why it works**:
231
- > - `Kourkoutas-β` handles beta2 values
232
- > - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
233
- > - `OrthoGrad` prevents overfitting without weight decay
234
-
235
- ---
236
-
237
210
  ## 📚 References
238
211
 
239
- 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
240
- 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
241
- 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
242
- 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
243
- 5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
212
+ 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
213
+ 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
214
+ 3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
215
+ 4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
244
216
  6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
217
+ 7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
@@ -7,6 +7,7 @@ from .optim import (
7
7
  Lion_Prodigy_adv,
8
8
  Muon_adv,
9
9
  AdaMuon_adv,
10
+ SignSGD_adv,
10
11
  )
11
12
 
12
13
  __all__ = [
@@ -18,6 +19,7 @@ __all__ = [
18
19
  "Lion_Prodigy_adv",
19
20
  "Muon_adv",
20
21
  "AdaMuon_adv",
22
+ "SignSGD_adv",
21
23
  ]
22
24
 
23
- __version__ = "1.2.dev9"
25
+ __version__ = "2.1.0"