adv-optm 0.1.8__tar.gz → 1.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of adv-optm might be problematic. Click here for more details.

Files changed (27) hide show
  1. adv_optm-1.0.0/PKG-INFO +174 -0
  2. adv_optm-1.0.0/README.md +143 -0
  3. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/__init__.py +1 -1
  4. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/optim/AdamW_adv.py +3 -0
  5. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/optim/Adopt_adv.py +47 -8
  6. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/optim/Prodigy_adv.py +4 -1
  7. adv_optm-1.0.0/adv_optm.egg-info/PKG-INFO +174 -0
  8. {adv_optm-0.1.8 → adv_optm-1.0.0}/setup.py +1 -1
  9. adv_optm-0.1.8/PKG-INFO +0 -130
  10. adv_optm-0.1.8/README.md +0 -99
  11. adv_optm-0.1.8/adv_optm.egg-info/PKG-INFO +0 -130
  12. {adv_optm-0.1.8 → adv_optm-1.0.0}/LICENSE +0 -0
  13. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/optim/Lion_Prodigy_adv.py +0 -0
  14. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/optim/Lion_adv.py +0 -0
  15. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/optim/Simplified_AdEMAMix.py +0 -0
  16. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/optim/__init__.py +0 -0
  17. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/util/BF16_Stochastic_Rounding.py +0 -0
  18. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/util/Effective_Shape.py +0 -0
  19. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/util/NNMF.py +0 -0
  20. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/util/One_Bit_Boolean.py +0 -0
  21. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/util/OrthoGrad.py +0 -0
  22. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm/util/__init__.py +0 -0
  23. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm.egg-info/SOURCES.txt +0 -0
  24. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm.egg-info/dependency_links.txt +0 -0
  25. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm.egg-info/requires.txt +0 -0
  26. {adv_optm-0.1.8 → adv_optm-1.0.0}/adv_optm.egg-info/top_level.txt +0 -0
  27. {adv_optm-0.1.8 → adv_optm-1.0.0}/setup.cfg +0 -0
@@ -0,0 +1,174 @@
1
+ Metadata-Version: 2.4
2
+ Name: adv_optm
3
+ Version: 1.0.0
4
+ Summary: A family of highly efficient, lightweight yet powerful optimizers.
5
+ Home-page: https://github.com/Koratahiu/Advanced_Optimizers
6
+ Author: Koratahiu
7
+ Author-email: hiuhonor@gmail.com
8
+ License: Apache 2.0
9
+ Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: License :: OSI Approved :: Apache Software License
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
14
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
15
+ Requires-Python: >=3.8
16
+ Description-Content-Type: text/markdown
17
+ License-File: LICENSE
18
+ Requires-Dist: torch>=2.0
19
+ Dynamic: author
20
+ Dynamic: author-email
21
+ Dynamic: classifier
22
+ Dynamic: description
23
+ Dynamic: description-content-type
24
+ Dynamic: home-page
25
+ Dynamic: keywords
26
+ Dynamic: license
27
+ Dynamic: license-file
28
+ Dynamic: requires-dist
29
+ Dynamic: requires-python
30
+ Dynamic: summary
31
+
32
+ # Advanced Optimizers (AIO)
33
+
34
+ A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.
35
+
36
+ [![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
37
+
38
+ ---
39
+
40
+ ## 📦 Installation
41
+
42
+ ```bash
43
+ pip install adv_optm
44
+ ```
45
+
46
+ ---
47
+
48
+ ## 🧠 Core Innovations
49
+
50
+ This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
51
+
52
+ ### **Memory-Efficient Optimization (SMMF-inspired)**
53
+ - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
54
+ - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
55
+ - **Innovation**:
56
+ - First moment split into **1-bit sign + absolute value**
57
+ - Final storage: **four factored vectors + one 1-bit sign state**
58
+ - Preserves Adam-like update quality with drastically reduced memory
59
+
60
+ ---
61
+
62
+ ## ⚡ Performance Characteristics
63
+
64
+ ### Memory Efficiency (SDXL Model - 6.5GB)
65
+ | Optimizer | Memory Usage | Description |
66
+ |-----------|--------------|-------------|
67
+ | `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
68
+ | `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
69
+ | `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
70
+
71
+ ### Speed Comparison (SDXL, Batch Size 4)
72
+ | Optimizer | Speed | Notes |
73
+ |-----------|-------|-------|
74
+ | `Adafactor` | ~8.5s/it | Baseline |
75
+ | `Adopt_Factored` | ~10s/it | +18% overhead from compression |
76
+ | `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
77
+
78
+ ---
79
+
80
+ ## 🧪 Available Optimizers
81
+
82
+ ### Standard Optimizers (All support `factored=True/False`)
83
+ | Optimizer | Description | Best For |
84
+ |-----------|-------------|----------|
85
+ | `Adam_Adv` | Advanced Adam implementation | General purpose |
86
+ | `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
87
+ | `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
88
+ | `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
89
+ | `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
90
+ | `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
91
+
92
+ ### Feature Matrix
93
+ | Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
94
+ |---------|----------|-----------|-------------|---------------------|----------|
95
+ | Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
96
+ | AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
97
+ | Simplified_AdEMAMix | ✗ | ✗ | ✓ | ✓ | ✗ |
98
+ | OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
99
+ | Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
100
+ | Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
101
+ | atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
102
+ | Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
103
+ | Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
104
+
105
+ ---
106
+
107
+ ## ⚙️ Key Features & Parameters
108
+
109
+ ### Comprehensive Feature Guide
110
+
111
+ | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
112
+ |---------|-------------|-------------------|--------------------|-------------------|--------------|
113
+ | **Factored** | Memory-efficient optimization using rank-1 factorization | Enable for large models (>1B params) or limited VRAM | +12-41% time overhead, 1-bit memory usage | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
114
+ | **AdEMAMix** | Dual EMA system for momentum | Use for long training runs (10k+ steps) | +1 state memory. | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
115
+ | **Simplified_AdEMAMix** | Accumulator-based momentum | Small batch training (≤32) | Same memory as standard, no extra overhead | [Schedule-Free Connections](https://arxiv.org/abs/2502.02431) | Adam/Prodigy |
116
+ | **OrthoGrad** | Removes gradient component parallel to weights | Full finetuning without weight decay | +33% time overhead, no memory impact | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
117
+ | **Stochastic Rounding** | Improves precision for BF16 training | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
118
+ | **atan2** | Robust eps replacement + built-in clipping | Use with Adopt or unstable training | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/prodigy |
119
+ | **Cautious** | Update only when the direction align with the gradients | should faster the convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/prodigy |
120
+ | **Grams** | Update direction from the gradients | should have a stronger effect than cautious | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/prodigy |
121
+
122
+ ---
123
+
124
+ ## Simplified_AdEMAMix Parameters
125
+ Simplified_AdEMAMix replaces standard momentum with an accumulator for better small-large batch performance.
126
+
127
+ | Parameter | Recommended Values | Description |
128
+ |-----------|---------------------|-------------|
129
+ | `beta1` | 0.9 (large BS), 0.99-0.9999 (small BS) | Determines memory length of accumulator |
130
+ | `alpha` | 100-10 (small BS), 1-0 (large BS) | Gradient smoothing factor |
131
+
132
+ **Alpha Tuning Guide**:
133
+ | Batch Size | Recommended α | Rationale |
134
+ |------------|---------------|-----------|
135
+ | Small (≤32) | 100, 50, 20, 10 | Emphasizes recent gradients for quick adaptation |
136
+ | Medium (32-512) | 10, 5, 2, 1 | Balanced approach |
137
+ | Large (≥512) | 1, 0.5, 0 | Emphasizes historical gradients for stability |
138
+
139
+ ⚠️ **Important**: Use **~100x smaller learning rate** with Simplified_AdEMAMix compared to AdamW (e.g., 1e-6 instead of 1e-4)
140
+
141
+ ### 📊 Performance Validation
142
+ Small Batch Training (SDXL, BS=2, 1.8K steps)
143
+ ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
144
+
145
+ - **🟢 Prodigy_adv** (beta1=0.9, d0=1e-5): Final LR=2.9e-4
146
+ - **🔵 Prodigy_adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR=5.8e-6
147
+
148
+ **Results**:
149
+ - Simplified_AdEMAMix shows faster convergence and better final performance
150
+ - D-Adaptation automatically handles aggressive updates (50x smaller LR)
151
+ - Generated samples show significantly better quality with Simplified_AdEMAMix
152
+
153
+ ---
154
+
155
+ ## ⚠️ Known Limitations
156
+
157
+ ### 1. Prodigy_Adv Sensitivity
158
+ - Highly sensitive to gradient modifications (Adopt normalization, low-rank factorization)
159
+ - May fail to increase learning rate in some LoRA scenarios
160
+ - **Fix**: Disable factorization or set beta1=0
161
+
162
+ ### 2. Aggressive Learning Rates
163
+ - Can destabilize factored first moment
164
+ - **Recommendation**: Check Prodigy learning rate as reference for safe LR threshold
165
+
166
+ ---
167
+
168
+ ## 📚 References
169
+
170
+ 1. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
171
+ 2. [The AdEMAMix Optimizer: Better, Faster, Older](https://arxiv.org/abs/2409.03137)
172
+ 3. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants](https://arxiv.org/abs/2502.02431)
173
+
174
+ ---
@@ -0,0 +1,143 @@
1
+ # Advanced Optimizers (AIO)
2
+
3
+ A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.
4
+
5
+ [![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
6
+
7
+ ---
8
+
9
+ ## 📦 Installation
10
+
11
+ ```bash
12
+ pip install adv_optm
13
+ ```
14
+
15
+ ---
16
+
17
+ ## 🧠 Core Innovations
18
+
19
+ This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
20
+
21
+ ### **Memory-Efficient Optimization (SMMF-inspired)**
22
+ - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
23
+ - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
24
+ - **Innovation**:
25
+ - First moment split into **1-bit sign + absolute value**
26
+ - Final storage: **four factored vectors + one 1-bit sign state**
27
+ - Preserves Adam-like update quality with drastically reduced memory
28
+
29
+ ---
30
+
31
+ ## ⚡ Performance Characteristics
32
+
33
+ ### Memory Efficiency (SDXL Model - 6.5GB)
34
+ | Optimizer | Memory Usage | Description |
35
+ |-----------|--------------|-------------|
36
+ | `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
37
+ | `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
38
+ | `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
39
+
40
+ ### Speed Comparison (SDXL, Batch Size 4)
41
+ | Optimizer | Speed | Notes |
42
+ |-----------|-------|-------|
43
+ | `Adafactor` | ~8.5s/it | Baseline |
44
+ | `Adopt_Factored` | ~10s/it | +18% overhead from compression |
45
+ | `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
46
+
47
+ ---
48
+
49
+ ## 🧪 Available Optimizers
50
+
51
+ ### Standard Optimizers (All support `factored=True/False`)
52
+ | Optimizer | Description | Best For |
53
+ |-----------|-------------|----------|
54
+ | `Adam_Adv` | Advanced Adam implementation | General purpose |
55
+ | `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
56
+ | `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
57
+ | `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
58
+ | `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
59
+ | `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
60
+
61
+ ### Feature Matrix
62
+ | Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
63
+ |---------|----------|-----------|-------------|---------------------|----------|
64
+ | Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
65
+ | AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
66
+ | Simplified_AdEMAMix | ✗ | ✗ | ✓ | ✓ | ✗ |
67
+ | OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
68
+ | Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
69
+ | Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
70
+ | atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
71
+ | Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
72
+ | Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
73
+
74
+ ---
75
+
76
+ ## ⚙️ Key Features & Parameters
77
+
78
+ ### Comprehensive Feature Guide
79
+
80
+ | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
81
+ |---------|-------------|-------------------|--------------------|-------------------|--------------|
82
+ | **Factored** | Memory-efficient optimization using rank-1 factorization | Enable for large models (>1B params) or limited VRAM | +12-41% time overhead, 1-bit memory usage | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
83
+ | **AdEMAMix** | Dual EMA system for momentum | Use for long training runs (10k+ steps) | +1 state memory. | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
84
+ | **Simplified_AdEMAMix** | Accumulator-based momentum | Small batch training (≤32) | Same memory as standard, no extra overhead | [Schedule-Free Connections](https://arxiv.org/abs/2502.02431) | Adam/Prodigy |
85
+ | **OrthoGrad** | Removes gradient component parallel to weights | Full finetuning without weight decay | +33% time overhead, no memory impact | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
86
+ | **Stochastic Rounding** | Improves precision for BF16 training | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
87
+ | **atan2** | Robust eps replacement + built-in clipping | Use with Adopt or unstable training | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/prodigy |
88
+ | **Cautious** | Update only when the direction align with the gradients | should faster the convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/prodigy |
89
+ | **Grams** | Update direction from the gradients | should have a stronger effect than cautious | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/prodigy |
90
+
91
+ ---
92
+
93
+ ## Simplified_AdEMAMix Parameters
94
+ Simplified_AdEMAMix replaces standard momentum with an accumulator for better small-large batch performance.
95
+
96
+ | Parameter | Recommended Values | Description |
97
+ |-----------|---------------------|-------------|
98
+ | `beta1` | 0.9 (large BS), 0.99-0.9999 (small BS) | Determines memory length of accumulator |
99
+ | `alpha` | 100-10 (small BS), 1-0 (large BS) | Gradient smoothing factor |
100
+
101
+ **Alpha Tuning Guide**:
102
+ | Batch Size | Recommended α | Rationale |
103
+ |------------|---------------|-----------|
104
+ | Small (≤32) | 100, 50, 20, 10 | Emphasizes recent gradients for quick adaptation |
105
+ | Medium (32-512) | 10, 5, 2, 1 | Balanced approach |
106
+ | Large (≥512) | 1, 0.5, 0 | Emphasizes historical gradients for stability |
107
+
108
+ ⚠️ **Important**: Use **~100x smaller learning rate** with Simplified_AdEMAMix compared to AdamW (e.g., 1e-6 instead of 1e-4)
109
+
110
+ ### 📊 Performance Validation
111
+ Small Batch Training (SDXL, BS=2, 1.8K steps)
112
+ ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
113
+
114
+ - **🟢 Prodigy_adv** (beta1=0.9, d0=1e-5): Final LR=2.9e-4
115
+ - **🔵 Prodigy_adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR=5.8e-6
116
+
117
+ **Results**:
118
+ - Simplified_AdEMAMix shows faster convergence and better final performance
119
+ - D-Adaptation automatically handles aggressive updates (50x smaller LR)
120
+ - Generated samples show significantly better quality with Simplified_AdEMAMix
121
+
122
+ ---
123
+
124
+ ## ⚠️ Known Limitations
125
+
126
+ ### 1. Prodigy_Adv Sensitivity
127
+ - Highly sensitive to gradient modifications (Adopt normalization, low-rank factorization)
128
+ - May fail to increase learning rate in some LoRA scenarios
129
+ - **Fix**: Disable factorization or set beta1=0
130
+
131
+ ### 2. Aggressive Learning Rates
132
+ - Can destabilize factored first moment
133
+ - **Recommendation**: Check Prodigy learning rate as reference for safe LR threshold
134
+
135
+ ---
136
+
137
+ ## 📚 References
138
+
139
+ 1. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
140
+ 2. [The AdEMAMix Optimizer: Better, Faster, Older](https://arxiv.org/abs/2409.03137)
141
+ 3. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants](https://arxiv.org/abs/2502.02431)
142
+
143
+ ---
@@ -16,4 +16,4 @@ __all__ = [
16
16
  "Lion_Prodigy_adv",
17
17
  ]
18
18
 
19
- __version__ = "0.1.8"
19
+ __version__ = "1.0.0"
@@ -86,6 +86,9 @@ class AdamW_adv(torch.optim.Optimizer):
86
86
  raise ValueError(f"Epsilon should be >= 0.0. Got {eps}")
87
87
  if not (weight_decay >= 0.0):
88
88
  raise ValueError(f"Weight-decay should be >= 0.0. Got {weight_decay}")
89
+ if use_cautious and use_grams:
90
+ print("Warning: use_cautious is incompatible with use_grams, Disabling use_cautious.")
91
+ use_cautious = False
89
92
 
90
93
  defaults = {
91
94
  "lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay,
@@ -62,6 +62,16 @@ class Adopt_adv(torch.optim.Optimizer):
62
62
  the warmup, `alpha` ramps from 0 to its target value. If `None`,
63
63
  the scheduler is disabled and the full `alpha` value is used from
64
64
  the start. (default: None)
65
+ Simplified_AdEMAMix (bool): whether to use the Simplified AdEMAMix update rule.
66
+ This changes the EMA to accumulator and the update numerator to `alpha_grad * grad + mt`, which can be
67
+ more responsive, especially for small batch sizes. Enabling this will
68
+ automatically disable `use_AdEMAMix`, `use_cautious`, `use_grams`,
69
+ and `use_atan2`. (default: False)
70
+ alpha_grad (float): Mixing coefficient for the Simplified AdEMAMix update rule
71
+ (only used when `Simplified_AdEMAMix` is `True`). Controls the weight of the
72
+ current gradient. For small batch sizes, use high values (e.g., 10-100) to be
73
+ more responsive. For large batch sizes, use low values (e.g., 0-1) for
74
+ stability. (default: 100.0)
65
75
  factored (bool): whether to use the factorization or disable it to use
66
76
  the uncompressed optimizer. (default: False)
67
77
  """
@@ -77,13 +87,15 @@ class Adopt_adv(torch.optim.Optimizer):
77
87
  vector_reshape: bool = True,
78
88
  stochastic_rounding: bool = True,
79
89
  use_atan2: bool = False,
80
- use_cautious: bool = True,
90
+ use_cautious: bool = False,
81
91
  use_grams: bool = False,
82
92
  use_orthograd: bool = False,
83
93
  use_AdEMAMix: bool = False,
84
94
  beta3_ema: float = 0.9999,
85
95
  alpha: float = 5.0,
86
96
  t_alpha: int | None = None,
97
+ Simplified_AdEMAMix: bool = False,
98
+ alpha_grad: float = 100.0,
87
99
  factored: bool = False,
88
100
  ):
89
101
  if not (lr >= 0.0):
@@ -94,19 +106,34 @@ class Adopt_adv(torch.optim.Optimizer):
94
106
  raise ValueError(f"Epsilon should be >= 0.0. Got {eps}")
95
107
  if not (weight_decay >= 0.0):
96
108
  raise ValueError(f"Weight-decay should be >= 0.0. Got {weight_decay}")
109
+ if use_cautious and use_grams:
110
+ print("Warning: use_cautious is incompatible with use_grams, Disabling use_cautious.")
111
+ use_cautious = False
112
+ if betas[0] == 0.0 and Simplified_AdEMAMix:
113
+ raise ValueError(f"Beta1 cannot be 0.0 when using Simplified_AdEMAMix. Got {betas[0]}")
114
+ if use_AdEMAMix and Simplified_AdEMAMix:
115
+ print("Warning: use_AdEMAMix is incompatible with Simplified_AdEMAMix, Disabling use_AdEMAMix.")
116
+ if use_grams and Simplified_AdEMAMix:
117
+ print("Warning: use_grams is incompatible with Simplified_AdEMAMix, Disabling use_grams.")
118
+ if use_cautious and Simplified_AdEMAMix:
119
+ print("Warning: use_cautious is incompatible with Simplified_AdEMAMix, Disabling use_cautious.")
120
+ if use_atan2 and Simplified_AdEMAMix:
121
+ print("Warning: use_atan2 is incompatible with Simplified_AdEMAMix. Disabling use_atan2.")
122
+ use_atan2 = False
97
123
 
98
124
  defaults = {
99
125
  "lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay,
100
126
  "vector_reshape": vector_reshape, "beta3_ema": beta3_ema, "alpha": alpha,
101
- "t_alpha": t_alpha,
127
+ "t_alpha": t_alpha, "alpha_grad": alpha_grad,
102
128
  }
103
129
  self.clip_lambda = clip_lambda
104
130
  self.stochastic_rounding = stochastic_rounding
105
- self.use_atan2 = use_atan2
106
- self.use_cautious = use_cautious
107
- self.use_grams = use_grams
131
+ self.use_atan2 = use_atan2 and not Simplified_AdEMAMix
132
+ self.use_cautious = use_cautious and not Simplified_AdEMAMix
133
+ self.use_grams = use_grams and not Simplified_AdEMAMix
108
134
  self.use_orthograd = use_orthograd
109
- self.use_AdEMAMix = use_AdEMAMix
135
+ self.use_AdEMAMix = use_AdEMAMix and not Simplified_AdEMAMix
136
+ self.Simplified_AdEMAMix = Simplified_AdEMAMix
110
137
  self.factored = factored
111
138
  super().__init__(params, defaults)
112
139
 
@@ -185,6 +212,8 @@ class Adopt_adv(torch.optim.Optimizer):
185
212
  alpha_t = alpha
186
213
  if t_alpha is not None and t_alpha > 0 and current_step < t_alpha:
187
214
  alpha_t = min(current_step * alpha / t_alpha, alpha)
215
+ if self.Simplified_AdEMAMix:
216
+ alpha_grad = group["alpha_grad"]
188
217
 
189
218
  if state['factored']:
190
219
  d1, d2 = state['effective_shape']
@@ -224,7 +253,10 @@ class Adopt_adv(torch.optim.Optimizer):
224
253
  del denom
225
254
 
226
255
  # ADOPT Step B: Update momentum m_t using normalized gradient
227
- mt.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
256
+ if self.Simplified_AdEMAMix:
257
+ mt.mul_(beta1).add_(normalized_grad, alpha=1.0)
258
+ else:
259
+ mt.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
228
260
  if self.use_grams:
229
261
  mt = grad_reshaped.sign() * mt.abs()
230
262
  elif self.use_cautious:
@@ -237,6 +269,8 @@ class Adopt_adv(torch.optim.Optimizer):
237
269
  mt_slow.mul_(beta3_ema).add_(normalized_grad, alpha=1.0 - beta3_ema)
238
270
  update = torch.add(mt, m_slow, alpha=alpha_t)
239
271
  update = update.view(p.shape)
272
+ elif self.Simplified_AdEMAMix:
273
+ update = torch.add(mt, grad_reshaped, alpha=alpha_grad)
240
274
  else:
241
275
  update = mt.view(p.shape)
242
276
 
@@ -283,7 +317,10 @@ class Adopt_adv(torch.optim.Optimizer):
283
317
  del denom
284
318
 
285
319
  # ADOPT Step B: Update momentum m_t
286
- m.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
320
+ if self.Simplified_AdEMAMix:
321
+ m.mul_(beta1).add_(normalized_grad, alpha=1.0)
322
+ else:
323
+ m.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
287
324
 
288
325
  if self.use_grams:
289
326
  m = grad.sign() * m.abs()
@@ -296,6 +333,8 @@ class Adopt_adv(torch.optim.Optimizer):
296
333
  if self.use_AdEMAMix:
297
334
  m_slow.mul_(beta3_ema).add_(normalized_grad, alpha=1.0 - beta3_ema)
298
335
  update = torch.add(m, m_slow, alpha=alpha_t)
336
+ elif self.Simplified_AdEMAMix:
337
+ update = torch.add(m, grad, alpha=alpha_grad)
299
338
  else:
300
339
  update = m.clone()
301
340
 
@@ -127,8 +127,11 @@ class Prodigy_adv(torch.optim.Optimizer):
127
127
  raise ValueError(f"Weight-decay should be >= 0.0. Got {weight_decay}")
128
128
  if not (prodigy_steps >= 0):
129
129
  raise ValueError(f"prodigy_steps should be >= 0. Got {prodigy_steps}")
130
+ if use_cautious and use_grams:
131
+ print("Warning: use_cautious is incompatible with use_grams, Disabling use_cautious.")
132
+ use_cautious = False
130
133
  if betas[0] == 0.0 and Simplified_AdEMAMix:
131
- raise ValueError(f"Beta 1 cannot be 0.0 when using Simplified_AdEMAMix. Got {betas[0]}")
134
+ raise ValueError(f"Beta1 cannot be 0.0 when using Simplified_AdEMAMix. Got {betas[0]}")
132
135
  if use_AdEMAMix and Simplified_AdEMAMix:
133
136
  print("Warning: use_AdEMAMix is incompatible with Simplified_AdEMAMix, Disabling use_AdEMAMix.")
134
137
  if use_grams and Simplified_AdEMAMix:
@@ -0,0 +1,174 @@
1
+ Metadata-Version: 2.4
2
+ Name: adv_optm
3
+ Version: 1.0.0
4
+ Summary: A family of highly efficient, lightweight yet powerful optimizers.
5
+ Home-page: https://github.com/Koratahiu/Advanced_Optimizers
6
+ Author: Koratahiu
7
+ Author-email: hiuhonor@gmail.com
8
+ License: Apache 2.0
9
+ Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: License :: OSI Approved :: Apache Software License
12
+ Classifier: Operating System :: OS Independent
13
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
14
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
15
+ Requires-Python: >=3.8
16
+ Description-Content-Type: text/markdown
17
+ License-File: LICENSE
18
+ Requires-Dist: torch>=2.0
19
+ Dynamic: author
20
+ Dynamic: author-email
21
+ Dynamic: classifier
22
+ Dynamic: description
23
+ Dynamic: description-content-type
24
+ Dynamic: home-page
25
+ Dynamic: keywords
26
+ Dynamic: license
27
+ Dynamic: license-file
28
+ Dynamic: requires-dist
29
+ Dynamic: requires-python
30
+ Dynamic: summary
31
+
32
+ # Advanced Optimizers (AIO)
33
+
34
+ A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.
35
+
36
+ [![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
37
+
38
+ ---
39
+
40
+ ## 📦 Installation
41
+
42
+ ```bash
43
+ pip install adv_optm
44
+ ```
45
+
46
+ ---
47
+
48
+ ## 🧠 Core Innovations
49
+
50
+ This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
51
+
52
+ ### **Memory-Efficient Optimization (SMMF-inspired)**
53
+ - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
54
+ - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
55
+ - **Innovation**:
56
+ - First moment split into **1-bit sign + absolute value**
57
+ - Final storage: **four factored vectors + one 1-bit sign state**
58
+ - Preserves Adam-like update quality with drastically reduced memory
59
+
60
+ ---
61
+
62
+ ## ⚡ Performance Characteristics
63
+
64
+ ### Memory Efficiency (SDXL Model - 6.5GB)
65
+ | Optimizer | Memory Usage | Description |
66
+ |-----------|--------------|-------------|
67
+ | `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
68
+ | `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
69
+ | `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
70
+
71
+ ### Speed Comparison (SDXL, Batch Size 4)
72
+ | Optimizer | Speed | Notes |
73
+ |-----------|-------|-------|
74
+ | `Adafactor` | ~8.5s/it | Baseline |
75
+ | `Adopt_Factored` | ~10s/it | +18% overhead from compression |
76
+ | `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
77
+
78
+ ---
79
+
80
+ ## 🧪 Available Optimizers
81
+
82
+ ### Standard Optimizers (All support `factored=True/False`)
83
+ | Optimizer | Description | Best For |
84
+ |-----------|-------------|----------|
85
+ | `Adam_Adv` | Advanced Adam implementation | General purpose |
86
+ | `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
87
+ | `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
88
+ | `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
89
+ | `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
90
+ | `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
91
+
92
+ ### Feature Matrix
93
+ | Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
94
+ |---------|----------|-----------|-------------|---------------------|----------|
95
+ | Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
96
+ | AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
97
+ | Simplified_AdEMAMix | ✗ | ✗ | ✓ | ✓ | ✗ |
98
+ | OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
99
+ | Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
100
+ | Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
101
+ | atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
102
+ | Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
103
+ | Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
104
+
105
+ ---
106
+
107
+ ## ⚙️ Key Features & Parameters
108
+
109
+ ### Comprehensive Feature Guide
110
+
111
+ | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
112
+ |---------|-------------|-------------------|--------------------|-------------------|--------------|
113
+ | **Factored** | Memory-efficient optimization using rank-1 factorization | Enable for large models (>1B params) or limited VRAM | +12-41% time overhead, 1-bit memory usage | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
114
+ | **AdEMAMix** | Dual EMA system for momentum | Use for long training runs (10k+ steps) | +1 state memory. | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
115
+ | **Simplified_AdEMAMix** | Accumulator-based momentum | Small batch training (≤32) | Same memory as standard, no extra overhead | [Schedule-Free Connections](https://arxiv.org/abs/2502.02431) | Adam/Prodigy |
116
+ | **OrthoGrad** | Removes gradient component parallel to weights | Full finetuning without weight decay | +33% time overhead, no memory impact | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
117
+ | **Stochastic Rounding** | Improves precision for BF16 training | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
118
+ | **atan2** | Robust eps replacement + built-in clipping | Use with Adopt or unstable training | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/prodigy |
119
+ | **Cautious** | Update only when the direction align with the gradients | should faster the convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/prodigy |
120
+ | **Grams** | Update direction from the gradients | should have a stronger effect than cautious | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/prodigy |
121
+
122
+ ---
123
+
124
+ ## Simplified_AdEMAMix Parameters
125
+ Simplified_AdEMAMix replaces standard momentum with an accumulator for better small-large batch performance.
126
+
127
+ | Parameter | Recommended Values | Description |
128
+ |-----------|---------------------|-------------|
129
+ | `beta1` | 0.9 (large BS), 0.99-0.9999 (small BS) | Determines memory length of accumulator |
130
+ | `alpha` | 100-10 (small BS), 1-0 (large BS) | Gradient smoothing factor |
131
+
132
+ **Alpha Tuning Guide**:
133
+ | Batch Size | Recommended α | Rationale |
134
+ |------------|---------------|-----------|
135
+ | Small (≤32) | 100, 50, 20, 10 | Emphasizes recent gradients for quick adaptation |
136
+ | Medium (32-512) | 10, 5, 2, 1 | Balanced approach |
137
+ | Large (≥512) | 1, 0.5, 0 | Emphasizes historical gradients for stability |
138
+
139
+ ⚠️ **Important**: Use **~100x smaller learning rate** with Simplified_AdEMAMix compared to AdamW (e.g., 1e-6 instead of 1e-4)
140
+
141
+ ### 📊 Performance Validation
142
+ Small Batch Training (SDXL, BS=2, 1.8K steps)
143
+ ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
144
+
145
+ - **🟢 Prodigy_adv** (beta1=0.9, d0=1e-5): Final LR=2.9e-4
146
+ - **🔵 Prodigy_adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR=5.8e-6
147
+
148
+ **Results**:
149
+ - Simplified_AdEMAMix shows faster convergence and better final performance
150
+ - D-Adaptation automatically handles aggressive updates (50x smaller LR)
151
+ - Generated samples show significantly better quality with Simplified_AdEMAMix
152
+
153
+ ---
154
+
155
+ ## ⚠️ Known Limitations
156
+
157
+ ### 1. Prodigy_Adv Sensitivity
158
+ - Highly sensitive to gradient modifications (Adopt normalization, low-rank factorization)
159
+ - May fail to increase learning rate in some LoRA scenarios
160
+ - **Fix**: Disable factorization or set beta1=0
161
+
162
+ ### 2. Aggressive Learning Rates
163
+ - Can destabilize factored first moment
164
+ - **Recommendation**: Check Prodigy learning rate as reference for safe LR threshold
165
+
166
+ ---
167
+
168
+ ## 📚 References
169
+
170
+ 1. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
171
+ 2. [The AdEMAMix Optimizer: Better, Faster, Older](https://arxiv.org/abs/2409.03137)
172
+ 3. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants](https://arxiv.org/abs/2502.02431)
173
+
174
+ ---
@@ -5,7 +5,7 @@ with open("README.md", "r", encoding="utf-8") as fh:
5
5
 
6
6
  setup(
7
7
  name="adv_optm",
8
- version="0.1.8",
8
+ version="1.0.0",
9
9
  author="Koratahiu",
10
10
  author_email="hiuhonor@gmail.com",
11
11
  license='Apache 2.0',
adv_optm-0.1.8/PKG-INFO DELETED
@@ -1,130 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: adv_optm
3
- Version: 0.1.8
4
- Summary: A family of highly efficient, lightweight yet powerful optimizers.
5
- Home-page: https://github.com/Koratahiu/Advanced_Optimizers
6
- Author: Koratahiu
7
- Author-email: hiuhonor@gmail.com
8
- License: Apache 2.0
9
- Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
10
- Classifier: Programming Language :: Python :: 3
11
- Classifier: License :: OSI Approved :: Apache Software License
12
- Classifier: Operating System :: OS Independent
13
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
14
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
15
- Requires-Python: >=3.8
16
- Description-Content-Type: text/markdown
17
- License-File: LICENSE
18
- Requires-Dist: torch>=2.0
19
- Dynamic: author
20
- Dynamic: author-email
21
- Dynamic: classifier
22
- Dynamic: description
23
- Dynamic: description-content-type
24
- Dynamic: home-page
25
- Dynamic: keywords
26
- Dynamic: license
27
- Dynamic: license-file
28
- Dynamic: requires-dist
29
- Dynamic: requires-python
30
- Dynamic: summary
31
-
32
- # Advanced Optimizers
33
-
34
- This repo introduces a new family of highly efficient, lightweight yet powerful optimizers, born from extensive research into recent academic literature and validated through practical training runs across diverse models.
35
-
36
- ---
37
-
38
- ### Install
39
-
40
- `pip install adv_optm`
41
-
42
- ---
43
-
44
- ### Theory (Inspired by SMMF)
45
-
46
- Based primarily on:
47
- **[SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization](https://arxiv.org/abs/2412.08894)**
48
-
49
- The core innovation:
50
- - Uses fast, non-negative matrix factorization (NNMF - rank 1), but **reconstructs the full state before each update** to preserve momentum accuracy, then re-factors afterward (factor → reconstruct → update → factor cycle).
51
- - For the *signed first moment*, we split into **sign + absolute value**:
52
- - Sign is stored as **1-bit state** via bitwise ops (SMMF originally used 8-bit with 7 bits wasted).
53
- - Absolute value goes through the factor/reconstruct cycle using two factored vectors + the signed state.
54
- - Final storage: **four factored vectors + one 1-bit sign**.
55
- - Updates behave like full-state Adam but with drastically reduced memory.
56
-
57
- > ✅ **TL;DR**: Lightweight, strong, memory-efficient optimizer.
58
-
59
- ---
60
-
61
- ### Memory Cost
62
-
63
- - **Adopt_Factored** for full SDXL finetune: **328 MB** (4 small vectors + 1-bit state)
64
- - **Adopt_Factored with AdEMAMix** for full SDXL finetune: **625 MB** (6 small vectors + two 1-bit states)
65
- > SDXL is 6.5GB model.
66
-
67
- ---
68
-
69
- ### ⏱️ Speed (my tests in SDXL - BS 4)
70
-
71
- - **Adopt_Factored**: ~10s/it
72
- - **Adopt_Factored with AdEMAMix**: ~12s/it
73
- - **Adafactor**: ~8.5s/it
74
- → Overhead from compression/reconstruction cycles.
75
- → It's faster than [MLorc](https://arxiv.org/abs/2506.01897) (~12s/it), which uses RSVD compression, and should be the fastest momentum compression (AFAIK).
76
-
77
- ---
78
-
79
- ### 📈 Performance
80
-
81
- - **Better than Adafactor, and CAME factorzation methods**
82
- - **Comparable or identical to Adam** (see SMMF paper results)
83
-
84
- ---
85
-
86
- ### Available Optimizers (all support `Factored` toggle)
87
-
88
- Set `Factored=False` to disable factorization and run as a full uncompressed optimizer (like vanilla Adam).
89
-
90
- 1. **Adam**
91
- 2. **Prodigy**
92
- 3. **Adopt**
93
-
94
- ---
95
-
96
- ### Bonus Features (Built-in)
97
-
98
- - **Fused Backward Pass**
99
-
100
- - **Stochastic Rounding (SR)**: Improves quality and convergence for **BF16 training**.
101
-
102
- - **[AdEMAMix](https://arxiv.org/abs/2409.03137)**
103
- → This adds a second, slow-moving EMA, which is combined with the primary momentum to stabilize updates, especially during long runs of full finetuning.
104
- → A higher value of beta3 (e.g., 0.9999) gives the EMA a longer memory, making it more stable but slower to adapt. A lower value (e.g., 0.999) is often better for shorter training runs (2k-4k steps).
105
- → When `factored` is true, it compresses the new momentum in the same way as the first moment (1-bit state + 2 vectors). However, this introduces noticeable overhead as we are compressing/reconstructing a third state each step.
106
-
107
- ⚠️ **Note**: AdEMAMix updates are more aggressive than normal Adam/Adopt, so use a x2-x5 smaller LR than usual (or use Prodigy).
108
-
109
- - **[`atan2` smoothing & scaling](https://github.com/lucidrains/adam-atan2-pytorch)**
110
- → Robust `eps` replacement (no tuning!) + built-in gradient clipping
111
- → *Ideal for ADOPT* (which normally needs higher `eps` and clipping), so `use_atan2` is all-in-one for it.
112
-
113
- - **[OrthoGrad](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)**
114
- → Removes gradient component parallel to weights → prevents "naïve loss minimization" (NLM) → reduces natural overfitting
115
- → Perfect for fine-tuning the direction of existing features (e.g., full finetune or training a trained LoRA) without weight decay erasing prior knowledge.
116
-
117
- ⚠️ **Note**: OrthoGrad introduces **~33% time overhead**, so take this into account.
118
-
119
- - **[Grams: Gradient Descent with Adaptive Momentum Scaling](https://github.com/Gunale0926/Grams)**
120
- → Eliminates the need for 1-bit momentum sign storage by using the **sign of gradients** for the first moment.
121
-
122
- ⚠️ **Not recommended for small batch sizes**: gradients are too noisy, which can destabilize momentum (tested for Prodigy and it made the optimizer slower to find the LR or converge in BS 4).
123
-
124
- ### Other Notes
125
-
126
- - **Adopt** skips the first step (only initializes the states) and has built-in clipping (sticking to the original optimizer), but we skip both of these when you enable `use_atan2`; as the optimizer becomes scale-invariant and the values of the states won't cause any issues or instability.
127
-
128
- - When `use_atan2` is True, `eps` will be ignored and you should also disable any gradient clipping.
129
-
130
- ---
adv_optm-0.1.8/README.md DELETED
@@ -1,99 +0,0 @@
1
- # Advanced Optimizers
2
-
3
- This repo introduces a new family of highly efficient, lightweight yet powerful optimizers, born from extensive research into recent academic literature and validated through practical training runs across diverse models.
4
-
5
- ---
6
-
7
- ### Install
8
-
9
- `pip install adv_optm`
10
-
11
- ---
12
-
13
- ### Theory (Inspired by SMMF)
14
-
15
- Based primarily on:
16
- **[SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization](https://arxiv.org/abs/2412.08894)**
17
-
18
- The core innovation:
19
- - Uses fast, non-negative matrix factorization (NNMF - rank 1), but **reconstructs the full state before each update** to preserve momentum accuracy, then re-factors afterward (factor → reconstruct → update → factor cycle).
20
- - For the *signed first moment*, we split into **sign + absolute value**:
21
- - Sign is stored as **1-bit state** via bitwise ops (SMMF originally used 8-bit with 7 bits wasted).
22
- - Absolute value goes through the factor/reconstruct cycle using two factored vectors + the signed state.
23
- - Final storage: **four factored vectors + one 1-bit sign**.
24
- - Updates behave like full-state Adam but with drastically reduced memory.
25
-
26
- > ✅ **TL;DR**: Lightweight, strong, memory-efficient optimizer.
27
-
28
- ---
29
-
30
- ### Memory Cost
31
-
32
- - **Adopt_Factored** for full SDXL finetune: **328 MB** (4 small vectors + 1-bit state)
33
- - **Adopt_Factored with AdEMAMix** for full SDXL finetune: **625 MB** (6 small vectors + two 1-bit states)
34
- > SDXL is 6.5GB model.
35
-
36
- ---
37
-
38
- ### ⏱️ Speed (my tests in SDXL - BS 4)
39
-
40
- - **Adopt_Factored**: ~10s/it
41
- - **Adopt_Factored with AdEMAMix**: ~12s/it
42
- - **Adafactor**: ~8.5s/it
43
- → Overhead from compression/reconstruction cycles.
44
- → It's faster than [MLorc](https://arxiv.org/abs/2506.01897) (~12s/it), which uses RSVD compression, and should be the fastest momentum compression (AFAIK).
45
-
46
- ---
47
-
48
- ### 📈 Performance
49
-
50
- - **Better than Adafactor, and CAME factorzation methods**
51
- - **Comparable or identical to Adam** (see SMMF paper results)
52
-
53
- ---
54
-
55
- ### Available Optimizers (all support `Factored` toggle)
56
-
57
- Set `Factored=False` to disable factorization and run as a full uncompressed optimizer (like vanilla Adam).
58
-
59
- 1. **Adam**
60
- 2. **Prodigy**
61
- 3. **Adopt**
62
-
63
- ---
64
-
65
- ### Bonus Features (Built-in)
66
-
67
- - **Fused Backward Pass**
68
-
69
- - **Stochastic Rounding (SR)**: Improves quality and convergence for **BF16 training**.
70
-
71
- - **[AdEMAMix](https://arxiv.org/abs/2409.03137)**
72
- → This adds a second, slow-moving EMA, which is combined with the primary momentum to stabilize updates, especially during long runs of full finetuning.
73
- → A higher value of beta3 (e.g., 0.9999) gives the EMA a longer memory, making it more stable but slower to adapt. A lower value (e.g., 0.999) is often better for shorter training runs (2k-4k steps).
74
- → When `factored` is true, it compresses the new momentum in the same way as the first moment (1-bit state + 2 vectors). However, this introduces noticeable overhead as we are compressing/reconstructing a third state each step.
75
-
76
- ⚠️ **Note**: AdEMAMix updates are more aggressive than normal Adam/Adopt, so use a x2-x5 smaller LR than usual (or use Prodigy).
77
-
78
- - **[`atan2` smoothing & scaling](https://github.com/lucidrains/adam-atan2-pytorch)**
79
- → Robust `eps` replacement (no tuning!) + built-in gradient clipping
80
- → *Ideal for ADOPT* (which normally needs higher `eps` and clipping), so `use_atan2` is all-in-one for it.
81
-
82
- - **[OrthoGrad](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)**
83
- → Removes gradient component parallel to weights → prevents "naïve loss minimization" (NLM) → reduces natural overfitting
84
- → Perfect for fine-tuning the direction of existing features (e.g., full finetune or training a trained LoRA) without weight decay erasing prior knowledge.
85
-
86
- ⚠️ **Note**: OrthoGrad introduces **~33% time overhead**, so take this into account.
87
-
88
- - **[Grams: Gradient Descent with Adaptive Momentum Scaling](https://github.com/Gunale0926/Grams)**
89
- → Eliminates the need for 1-bit momentum sign storage by using the **sign of gradients** for the first moment.
90
-
91
- ⚠️ **Not recommended for small batch sizes**: gradients are too noisy, which can destabilize momentum (tested for Prodigy and it made the optimizer slower to find the LR or converge in BS 4).
92
-
93
- ### Other Notes
94
-
95
- - **Adopt** skips the first step (only initializes the states) and has built-in clipping (sticking to the original optimizer), but we skip both of these when you enable `use_atan2`; as the optimizer becomes scale-invariant and the values of the states won't cause any issues or instability.
96
-
97
- - When `use_atan2` is True, `eps` will be ignored and you should also disable any gradient clipping.
98
-
99
- ---
@@ -1,130 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: adv_optm
3
- Version: 0.1.8
4
- Summary: A family of highly efficient, lightweight yet powerful optimizers.
5
- Home-page: https://github.com/Koratahiu/Advanced_Optimizers
6
- Author: Koratahiu
7
- Author-email: hiuhonor@gmail.com
8
- License: Apache 2.0
9
- Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
10
- Classifier: Programming Language :: Python :: 3
11
- Classifier: License :: OSI Approved :: Apache Software License
12
- Classifier: Operating System :: OS Independent
13
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
14
- Classifier: Topic :: Software Development :: Libraries :: Python Modules
15
- Requires-Python: >=3.8
16
- Description-Content-Type: text/markdown
17
- License-File: LICENSE
18
- Requires-Dist: torch>=2.0
19
- Dynamic: author
20
- Dynamic: author-email
21
- Dynamic: classifier
22
- Dynamic: description
23
- Dynamic: description-content-type
24
- Dynamic: home-page
25
- Dynamic: keywords
26
- Dynamic: license
27
- Dynamic: license-file
28
- Dynamic: requires-dist
29
- Dynamic: requires-python
30
- Dynamic: summary
31
-
32
- # Advanced Optimizers
33
-
34
- This repo introduces a new family of highly efficient, lightweight yet powerful optimizers, born from extensive research into recent academic literature and validated through practical training runs across diverse models.
35
-
36
- ---
37
-
38
- ### Install
39
-
40
- `pip install adv_optm`
41
-
42
- ---
43
-
44
- ### Theory (Inspired by SMMF)
45
-
46
- Based primarily on:
47
- **[SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization](https://arxiv.org/abs/2412.08894)**
48
-
49
- The core innovation:
50
- - Uses fast, non-negative matrix factorization (NNMF - rank 1), but **reconstructs the full state before each update** to preserve momentum accuracy, then re-factors afterward (factor → reconstruct → update → factor cycle).
51
- - For the *signed first moment*, we split into **sign + absolute value**:
52
- - Sign is stored as **1-bit state** via bitwise ops (SMMF originally used 8-bit with 7 bits wasted).
53
- - Absolute value goes through the factor/reconstruct cycle using two factored vectors + the signed state.
54
- - Final storage: **four factored vectors + one 1-bit sign**.
55
- - Updates behave like full-state Adam but with drastically reduced memory.
56
-
57
- > ✅ **TL;DR**: Lightweight, strong, memory-efficient optimizer.
58
-
59
- ---
60
-
61
- ### Memory Cost
62
-
63
- - **Adopt_Factored** for full SDXL finetune: **328 MB** (4 small vectors + 1-bit state)
64
- - **Adopt_Factored with AdEMAMix** for full SDXL finetune: **625 MB** (6 small vectors + two 1-bit states)
65
- > SDXL is 6.5GB model.
66
-
67
- ---
68
-
69
- ### ⏱️ Speed (my tests in SDXL - BS 4)
70
-
71
- - **Adopt_Factored**: ~10s/it
72
- - **Adopt_Factored with AdEMAMix**: ~12s/it
73
- - **Adafactor**: ~8.5s/it
74
- → Overhead from compression/reconstruction cycles.
75
- → It's faster than [MLorc](https://arxiv.org/abs/2506.01897) (~12s/it), which uses RSVD compression, and should be the fastest momentum compression (AFAIK).
76
-
77
- ---
78
-
79
- ### 📈 Performance
80
-
81
- - **Better than Adafactor, and CAME factorzation methods**
82
- - **Comparable or identical to Adam** (see SMMF paper results)
83
-
84
- ---
85
-
86
- ### Available Optimizers (all support `Factored` toggle)
87
-
88
- Set `Factored=False` to disable factorization and run as a full uncompressed optimizer (like vanilla Adam).
89
-
90
- 1. **Adam**
91
- 2. **Prodigy**
92
- 3. **Adopt**
93
-
94
- ---
95
-
96
- ### Bonus Features (Built-in)
97
-
98
- - **Fused Backward Pass**
99
-
100
- - **Stochastic Rounding (SR)**: Improves quality and convergence for **BF16 training**.
101
-
102
- - **[AdEMAMix](https://arxiv.org/abs/2409.03137)**
103
- → This adds a second, slow-moving EMA, which is combined with the primary momentum to stabilize updates, especially during long runs of full finetuning.
104
- → A higher value of beta3 (e.g., 0.9999) gives the EMA a longer memory, making it more stable but slower to adapt. A lower value (e.g., 0.999) is often better for shorter training runs (2k-4k steps).
105
- → When `factored` is true, it compresses the new momentum in the same way as the first moment (1-bit state + 2 vectors). However, this introduces noticeable overhead as we are compressing/reconstructing a third state each step.
106
-
107
- ⚠️ **Note**: AdEMAMix updates are more aggressive than normal Adam/Adopt, so use a x2-x5 smaller LR than usual (or use Prodigy).
108
-
109
- - **[`atan2` smoothing & scaling](https://github.com/lucidrains/adam-atan2-pytorch)**
110
- → Robust `eps` replacement (no tuning!) + built-in gradient clipping
111
- → *Ideal for ADOPT* (which normally needs higher `eps` and clipping), so `use_atan2` is all-in-one for it.
112
-
113
- - **[OrthoGrad](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability)**
114
- → Removes gradient component parallel to weights → prevents "naïve loss minimization" (NLM) → reduces natural overfitting
115
- → Perfect for fine-tuning the direction of existing features (e.g., full finetune or training a trained LoRA) without weight decay erasing prior knowledge.
116
-
117
- ⚠️ **Note**: OrthoGrad introduces **~33% time overhead**, so take this into account.
118
-
119
- - **[Grams: Gradient Descent with Adaptive Momentum Scaling](https://github.com/Gunale0926/Grams)**
120
- → Eliminates the need for 1-bit momentum sign storage by using the **sign of gradients** for the first moment.
121
-
122
- ⚠️ **Not recommended for small batch sizes**: gradients are too noisy, which can destabilize momentum (tested for Prodigy and it made the optimizer slower to find the LR or converge in BS 4).
123
-
124
- ### Other Notes
125
-
126
- - **Adopt** skips the first step (only initializes the states) and has built-in clipping (sticking to the original optimizer), but we skip both of these when you enable `use_atan2`; as the optimizer becomes scale-invariant and the values of the states won't cause any issues or instability.
127
-
128
- - When `use_atan2` is True, `eps` will be ignored and you should also disable any gradient clipping.
129
-
130
- ---
File without changes
File without changes
File without changes