adv-optm 1.2.dev9__tar.gz → 1.2.2__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/PKG-INFO +8 -61
  2. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/README.md +7 -60
  3. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/__init__.py +1 -1
  4. adv_optm-1.2.2/adv_optm/optim/AdaMuon_adv.py +729 -0
  5. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/AdamW_adv.py +29 -22
  6. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Adopt_adv.py +35 -24
  7. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Lion_Prodigy_adv.py +8 -1
  8. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Lion_adv.py +8 -1
  9. adv_optm-1.2.2/adv_optm/optim/Muon_adv.py +730 -0
  10. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Prodigy_adv.py +30 -16
  11. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Simplified_AdEMAMix.py +12 -5
  12. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/BF16_Stochastic_Rounding.py +22 -4
  13. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/Kourkoutas.py +24 -53
  14. adv_optm-1.2.2/adv_optm/util/Newton_Schulz.py +87 -0
  15. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/__init__.py +1 -0
  16. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/PKG-INFO +8 -61
  17. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/SOURCES.txt +0 -1
  18. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/setup.py +1 -1
  19. adv_optm-1.2.dev9/adv_optm/optim/AdaMuon_adv.py +0 -473
  20. adv_optm-1.2.dev9/adv_optm/optim/Muon_adv.py +0 -503
  21. adv_optm-1.2.dev9/adv_optm/util/MuonAdam_helper.py +0 -32
  22. adv_optm-1.2.dev9/adv_optm/util/Newton_Schulz.py +0 -48
  23. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/LICENSE +0 -0
  24. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/__init__.py +0 -0
  25. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/Effective_Shape.py +0 -0
  26. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/NNMF.py +0 -0
  27. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/One_Bit_Boolean.py +0 -0
  28. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/OrthoGrad.py +0 -0
  29. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/dependency_links.txt +0 -0
  30. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/requires.txt +0 -0
  31. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/top_level.txt +0 -0
  32. {adv_optm-1.2.dev9 → adv_optm-1.2.2}/setup.cfg +0 -0
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: adv_optm
3
- Version: 1.2.dev9
3
+ Version: 1.2.2
4
4
  Summary: A family of highly efficient, lightweight yet powerful optimizers.
5
5
  Home-page: https://github.com/Koratahiu/Advanced_Optimizers
6
6
  Author: Koratahiu
@@ -141,7 +141,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
141
141
 
142
142
  - Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
143
143
  - Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
144
- - **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
145
144
 
146
145
  #### Tunable Hyperparameters
147
146
  | Parameter | Default | Tuning Guide |
@@ -156,7 +155,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
156
155
  ### Simplified_AdEMAMix
157
156
 
158
157
  - Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
159
- - Replaces Adam’s first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.
158
+ - Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
160
159
  - **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
161
160
 
162
161
  #### Tunable Hyperparameters
@@ -171,20 +170,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
171
170
  > - **Full FT**: `1e-10`
172
171
  > - **Embedding**: `1e-7`
173
172
 
174
- > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.
175
-
176
- #### Performance Validation
177
-
178
- **Small Batch Training (SDXL, BS=2, 1.8K steps)**
179
- ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
180
-
181
- - **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
182
- - **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
183
-
184
- **Results**:
185
- - Faster convergence and higher final performance with Simplified_AdEMAMix
186
- - D-Adaptation automatically compensates for aggressive updates
187
- - Generated samples show **significantly better quality**
173
+ > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
188
174
 
189
175
  ---
190
176
 
@@ -194,6 +180,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
194
180
  - Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
195
181
  - **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
196
182
 
183
+ > 📚 **Reference**:
184
+ > - Paper: https://arxiv.org/abs/2407.05872
185
+ > - Code: https://github.com/lucidrains/adam-atan2-pytorch
186
+
197
187
  ---
198
188
 
199
189
  ### **Kourkoutas-β**
@@ -216,60 +206,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
216
206
 
217
207
  > 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
218
208
 
219
- > 🔍 **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic β₂ values across layers every *N* steps.
220
-
221
- #### 📊 Performance Validation
222
-
223
- **ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
224
- <img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
225
-
226
- - 🟣 Fixed `beta2=0.999`
227
- - 🟠 Auto K-beta
228
-
229
- **Observations:**
230
- - K-beta is clearly better and more robust/stable for high LRs.
231
-
232
209
  > 📚 **Reference**:
233
210
  > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
234
211
  > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
235
212
 
236
213
  ---
237
214
 
238
- ## Recommended Preset (Tested on LoRA/FT/Embedding)
239
-
240
- ```yaml
241
- Learning Rate: 1
242
- optimizer: PRODIGY_Adv
243
- settings:
244
- - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
245
- - beta2: 0.999
246
- - kourkoutas_beta: True # For Kourkoutas-β
247
- - K-β Warmup Steps: 50 # Or 100, 200, depending on your run
248
- - Simplified_AdEMAMix: True
249
- - Grad α: 100
250
- - OrthoGrad: True
251
- - weight_decay: 0.0
252
- - initial_d:
253
- • LoRA: 1e-8
254
- • Full fine-tune: 1e-10
255
- • Embedding: 1e-7
256
- - d_coef: 1
257
- - d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
258
- - factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
259
- ```
260
-
261
- > ✅ **Why it works**:
262
- > - `Kourkoutas-β` handles beta2 values
263
- > - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
264
- > - `OrthoGrad` prevents overfitting without weight decay
265
-
266
- ---
267
-
268
215
  ## 📚 References
269
216
 
270
217
  1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
271
218
  2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
272
219
  3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
273
220
  4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
274
- 5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
275
221
  6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
222
+ 7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
@@ -110,7 +110,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
110
110
 
111
111
  - Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
112
112
  - Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
113
- - **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
114
113
 
115
114
  #### Tunable Hyperparameters
116
115
  | Parameter | Default | Tuning Guide |
@@ -125,7 +124,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
125
124
  ### Simplified_AdEMAMix
126
125
 
127
126
  - Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
128
- - Replaces Adam’s first moment with a **gradient accumulator**, combining the stability of long memory with responsiveness to recent gradients.
127
+ - Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
129
128
  - **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
130
129
 
131
130
  #### Tunable Hyperparameters
@@ -140,20 +139,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
140
139
  > - **Full FT**: `1e-10`
141
140
  > - **Embedding**: `1e-7`
142
141
 
143
- > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard gradient clipping.
144
-
145
- #### Performance Validation
146
-
147
- **Small Batch Training (SDXL, BS=2, 1.8K steps)**
148
- ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
149
-
150
- - **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
151
- - **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
152
-
153
- **Results**:
154
- - Faster convergence and higher final performance with Simplified_AdEMAMix
155
- - D-Adaptation automatically compensates for aggressive updates
156
- - Generated samples show **significantly better quality**
142
+ > ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
157
143
 
158
144
  ---
159
145
 
@@ -163,6 +149,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
163
149
  - Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
164
150
  - **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
165
151
 
152
+ > 📚 **Reference**:
153
+ > - Paper: https://arxiv.org/abs/2407.05872
154
+ > - Code: https://github.com/lucidrains/adam-atan2-pytorch
155
+
166
156
  ---
167
157
 
168
158
  ### **Kourkoutas-β**
@@ -185,60 +175,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
185
175
 
186
176
  > 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
187
177
 
188
- > 🔍 **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic β₂ values across layers every *N* steps.
189
-
190
- #### 📊 Performance Validation
191
-
192
- **ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
193
- <img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
194
-
195
- - 🟣 Fixed `beta2=0.999`
196
- - 🟠 Auto K-beta
197
-
198
- **Observations:**
199
- - K-beta is clearly better and more robust/stable for high LRs.
200
-
201
178
  > 📚 **Reference**:
202
179
  > - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
203
180
  > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
204
181
 
205
182
  ---
206
183
 
207
- ## Recommended Preset (Tested on LoRA/FT/Embedding)
208
-
209
- ```yaml
210
- Learning Rate: 1
211
- optimizer: PRODIGY_Adv
212
- settings:
213
- - beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
214
- - beta2: 0.999
215
- - kourkoutas_beta: True # For Kourkoutas-β
216
- - K-β Warmup Steps: 50 # Or 100, 200, depending on your run
217
- - Simplified_AdEMAMix: True
218
- - Grad α: 100
219
- - OrthoGrad: True
220
- - weight_decay: 0.0
221
- - initial_d:
222
- • LoRA: 1e-8
223
- • Full fine-tune: 1e-10
224
- • Embedding: 1e-7
225
- - d_coef: 1
226
- - d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
227
- - factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
228
- ```
229
-
230
- > ✅ **Why it works**:
231
- > - `Kourkoutas-β` handles beta2 values
232
- > - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
233
- > - `OrthoGrad` prevents overfitting without weight decay
234
-
235
- ---
236
-
237
184
  ## 📚 References
238
185
 
239
186
  1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
240
187
  2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
241
188
  3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
242
189
  4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
243
- 5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
244
190
  6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
191
+ 7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
@@ -20,4 +20,4 @@ __all__ = [
20
20
  "AdaMuon_adv",
21
21
  ]
22
22
 
23
- __version__ = "1.2.dev9"
23
+ __version__ = "1.2.2"