adv-optm 1.2.dev9__tar.gz → 1.2.2__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/PKG-INFO +8 -61
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/README.md +7 -60
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/__init__.py +1 -1
- adv_optm-1.2.2/adv_optm/optim/AdaMuon_adv.py +729 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/AdamW_adv.py +29 -22
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Adopt_adv.py +35 -24
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Lion_Prodigy_adv.py +8 -1
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Lion_adv.py +8 -1
- adv_optm-1.2.2/adv_optm/optim/Muon_adv.py +730 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Prodigy_adv.py +30 -16
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/Simplified_AdEMAMix.py +12 -5
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/BF16_Stochastic_Rounding.py +22 -4
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/Kourkoutas.py +24 -53
- adv_optm-1.2.2/adv_optm/util/Newton_Schulz.py +87 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/__init__.py +1 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/PKG-INFO +8 -61
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/SOURCES.txt +0 -1
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/setup.py +1 -1
- adv_optm-1.2.dev9/adv_optm/optim/AdaMuon_adv.py +0 -473
- adv_optm-1.2.dev9/adv_optm/optim/Muon_adv.py +0 -503
- adv_optm-1.2.dev9/adv_optm/util/MuonAdam_helper.py +0 -32
- adv_optm-1.2.dev9/adv_optm/util/Newton_Schulz.py +0 -48
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/LICENSE +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/optim/__init__.py +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/Effective_Shape.py +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/NNMF.py +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/One_Bit_Boolean.py +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm/util/OrthoGrad.py +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/dependency_links.txt +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/requires.txt +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/adv_optm.egg-info/top_level.txt +0 -0
- {adv_optm-1.2.dev9 → adv_optm-1.2.2}/setup.cfg +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: adv_optm
|
|
3
|
-
Version: 1.2.
|
|
3
|
+
Version: 1.2.2
|
|
4
4
|
Summary: A family of highly efficient, lightweight yet powerful optimizers.
|
|
5
5
|
Home-page: https://github.com/Koratahiu/Advanced_Optimizers
|
|
6
6
|
Author: Koratahiu
|
|
@@ -141,7 +141,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
141
141
|
|
|
142
142
|
- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
|
|
143
143
|
- Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
|
|
144
|
-
- **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
|
|
145
144
|
|
|
146
145
|
#### Tunable Hyperparameters
|
|
147
146
|
| Parameter | Default | Tuning Guide |
|
|
@@ -156,7 +155,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
156
155
|
### Simplified_AdEMAMix
|
|
157
156
|
|
|
158
157
|
- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
|
|
159
|
-
- Replaces Adam’s first moment with a **gradient
|
|
158
|
+
- Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
|
|
160
159
|
- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
|
|
161
160
|
|
|
162
161
|
#### Tunable Hyperparameters
|
|
@@ -171,20 +170,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
171
170
|
> - **Full FT**: `1e-10`
|
|
172
171
|
> - **Embedding**: `1e-7`
|
|
173
172
|
|
|
174
|
-
> ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard
|
|
175
|
-
|
|
176
|
-
#### Performance Validation
|
|
177
|
-
|
|
178
|
-
**Small Batch Training (SDXL, BS=2, 1.8K steps)**
|
|
179
|
-

|
|
180
|
-
|
|
181
|
-
- **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
|
|
182
|
-
- **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
|
|
183
|
-
|
|
184
|
-
**Results**:
|
|
185
|
-
- Faster convergence and higher final performance with Simplified_AdEMAMix
|
|
186
|
-
- D-Adaptation automatically compensates for aggressive updates
|
|
187
|
-
- Generated samples show **significantly better quality**
|
|
173
|
+
> ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
|
|
188
174
|
|
|
189
175
|
---
|
|
190
176
|
|
|
@@ -194,6 +180,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
194
180
|
- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
|
|
195
181
|
- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
|
|
196
182
|
|
|
183
|
+
> 📚 **Reference**:
|
|
184
|
+
> - Paper: https://arxiv.org/abs/2407.05872
|
|
185
|
+
> - Code: https://github.com/lucidrains/adam-atan2-pytorch
|
|
186
|
+
|
|
197
187
|
---
|
|
198
188
|
|
|
199
189
|
### **Kourkoutas-β**
|
|
@@ -216,60 +206,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
|
|
|
216
206
|
|
|
217
207
|
> 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
|
|
218
208
|
|
|
219
|
-
> 🔍 **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic β₂ values across layers every *N* steps.
|
|
220
|
-
|
|
221
|
-
#### 📊 Performance Validation
|
|
222
|
-
|
|
223
|
-
**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
|
|
224
|
-
<img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
|
|
225
|
-
|
|
226
|
-
- 🟣 Fixed `beta2=0.999`
|
|
227
|
-
- 🟠 Auto K-beta
|
|
228
|
-
|
|
229
|
-
**Observations:**
|
|
230
|
-
- K-beta is clearly better and more robust/stable for high LRs.
|
|
231
|
-
|
|
232
209
|
> 📚 **Reference**:
|
|
233
210
|
> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
234
211
|
> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
|
|
235
212
|
|
|
236
213
|
---
|
|
237
214
|
|
|
238
|
-
## Recommended Preset (Tested on LoRA/FT/Embedding)
|
|
239
|
-
|
|
240
|
-
```yaml
|
|
241
|
-
Learning Rate: 1
|
|
242
|
-
optimizer: PRODIGY_Adv
|
|
243
|
-
settings:
|
|
244
|
-
- beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
|
|
245
|
-
- beta2: 0.999
|
|
246
|
-
- kourkoutas_beta: True # For Kourkoutas-β
|
|
247
|
-
- K-β Warmup Steps: 50 # Or 100, 200, depending on your run
|
|
248
|
-
- Simplified_AdEMAMix: True
|
|
249
|
-
- Grad α: 100
|
|
250
|
-
- OrthoGrad: True
|
|
251
|
-
- weight_decay: 0.0
|
|
252
|
-
- initial_d:
|
|
253
|
-
• LoRA: 1e-8
|
|
254
|
-
• Full fine-tune: 1e-10
|
|
255
|
-
• Embedding: 1e-7
|
|
256
|
-
- d_coef: 1
|
|
257
|
-
- d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
|
|
258
|
-
- factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
|
|
259
|
-
```
|
|
260
|
-
|
|
261
|
-
> ✅ **Why it works**:
|
|
262
|
-
> - `Kourkoutas-β` handles beta2 values
|
|
263
|
-
> - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
|
|
264
|
-
> - `OrthoGrad` prevents overfitting without weight decay
|
|
265
|
-
|
|
266
|
-
---
|
|
267
|
-
|
|
268
215
|
## 📚 References
|
|
269
216
|
|
|
270
217
|
1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
|
|
271
218
|
2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
|
|
272
219
|
3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
|
|
273
220
|
4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
|
|
274
|
-
5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
|
|
275
221
|
6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
222
|
+
7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
|
|
@@ -110,7 +110,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
110
110
|
|
|
111
111
|
- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
|
|
112
112
|
- Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
|
|
113
|
-
- **Reference**: [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
|
|
114
113
|
|
|
115
114
|
#### Tunable Hyperparameters
|
|
116
115
|
| Parameter | Default | Tuning Guide |
|
|
@@ -125,7 +124,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
125
124
|
### Simplified_AdEMAMix
|
|
126
125
|
|
|
127
126
|
- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
|
|
128
|
-
- Replaces Adam’s first moment with a **gradient
|
|
127
|
+
- Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
|
|
129
128
|
- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
|
|
130
129
|
|
|
131
130
|
#### Tunable Hyperparameters
|
|
@@ -140,20 +139,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
140
139
|
> - **Full FT**: `1e-10`
|
|
141
140
|
> - **Embedding**: `1e-7`
|
|
142
141
|
|
|
143
|
-
> ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard
|
|
144
|
-
|
|
145
|
-
#### Performance Validation
|
|
146
|
-
|
|
147
|
-
**Small Batch Training (SDXL, BS=2, 1.8K steps)**
|
|
148
|
-

|
|
149
|
-
|
|
150
|
-
- **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
|
|
151
|
-
- **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
|
|
152
|
-
|
|
153
|
-
**Results**:
|
|
154
|
-
- Faster convergence and higher final performance with Simplified_AdEMAMix
|
|
155
|
-
- D-Adaptation automatically compensates for aggressive updates
|
|
156
|
-
- Generated samples show **significantly better quality**
|
|
142
|
+
> ⚠️ **Incompatible** with: **Cautious**, **Grams**, **atan2**, and standard update clipping.
|
|
157
143
|
|
|
158
144
|
---
|
|
159
145
|
|
|
@@ -163,6 +149,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
|
|
|
163
149
|
- Automatically clips updates to **[-2, 2]**, preventing destabilizing jumps.
|
|
164
150
|
- **Highly recommended** for `Adopt_Adv`, which is prone to instability without clipping.
|
|
165
151
|
|
|
152
|
+
> 📚 **Reference**:
|
|
153
|
+
> - Paper: https://arxiv.org/abs/2407.05872
|
|
154
|
+
> - Code: https://github.com/lucidrains/adam-atan2-pytorch
|
|
155
|
+
|
|
166
156
|
---
|
|
167
157
|
|
|
168
158
|
### **Kourkoutas-β**
|
|
@@ -185,60 +175,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
|
|
|
185
175
|
|
|
186
176
|
> 💡 **Best Practice**: Set `K_warmup_steps` equal to your standard LR warmup steps. During warmup, the optimizer uses the static `beta2`; adaptation begins only after warmup ends.
|
|
187
177
|
|
|
188
|
-
> 🔍 **Debugging Aid**: Enable `K_Logging` to monitor (min, max, mean) of dynamic β₂ values across layers every *N* steps.
|
|
189
|
-
|
|
190
|
-
#### 📊 Performance Validation
|
|
191
|
-
|
|
192
|
-
**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
|
|
193
|
-
<img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
|
|
194
|
-
|
|
195
|
-
- 🟣 Fixed `beta2=0.999`
|
|
196
|
-
- 🟠 Auto K-beta
|
|
197
|
-
|
|
198
|
-
**Observations:**
|
|
199
|
-
- K-beta is clearly better and more robust/stable for high LRs.
|
|
200
|
-
|
|
201
178
|
> 📚 **Reference**:
|
|
202
179
|
> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
203
180
|
> - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
|
|
204
181
|
|
|
205
182
|
---
|
|
206
183
|
|
|
207
|
-
## Recommended Preset (Tested on LoRA/FT/Embedding)
|
|
208
|
-
|
|
209
|
-
```yaml
|
|
210
|
-
Learning Rate: 1
|
|
211
|
-
optimizer: PRODIGY_Adv
|
|
212
|
-
settings:
|
|
213
|
-
- beta1: 0.99 # Controls momentum decay, ~100-step effective memory. Adjust to 0.999 (1000 steps) or 0.9999 (10000 steps) based on training length and stability needs.
|
|
214
|
-
- beta2: 0.999
|
|
215
|
-
- kourkoutas_beta: True # For Kourkoutas-β
|
|
216
|
-
- K-β Warmup Steps: 50 # Or 100, 200, depending on your run
|
|
217
|
-
- Simplified_AdEMAMix: True
|
|
218
|
-
- Grad α: 100
|
|
219
|
-
- OrthoGrad: True
|
|
220
|
-
- weight_decay: 0.0
|
|
221
|
-
- initial_d:
|
|
222
|
-
• LoRA: 1e-8
|
|
223
|
-
• Full fine-tune: 1e-10
|
|
224
|
-
• Embedding: 1e-7
|
|
225
|
-
- d_coef: 1
|
|
226
|
-
- d_limiter: True # To stablizie Prodigy with Simplified_AdEMAMix
|
|
227
|
-
- factored: False # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
|
|
228
|
-
```
|
|
229
|
-
|
|
230
|
-
> ✅ **Why it works**:
|
|
231
|
-
> - `Kourkoutas-β` handles beta2 values
|
|
232
|
-
> - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
|
|
233
|
-
> - `OrthoGrad` prevents overfitting without weight decay
|
|
234
|
-
|
|
235
|
-
---
|
|
236
|
-
|
|
237
184
|
## 📚 References
|
|
238
185
|
|
|
239
186
|
1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
|
|
240
187
|
2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
|
|
241
188
|
3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
|
|
242
189
|
4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
|
|
243
|
-
5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
|
|
244
190
|
6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
|
|
191
|
+
7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)
|