PyPI - adv-optm - Versions diffs - 0.1.7__tar.gz → 0.1.9__tar.gz - Mend

adv-optm 0.1.7tar.gz → 0.1.9tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of adv-optm might be problematic. Click here for more details.

Files changed (27) hide show

adv_optm-0.1.9/PKG-INFO ADDED Viewed

@@ -0,0 +1,174 @@
+Metadata-Version: 2.4
+Name: adv_optm
+Version: 0.1.9
+Summary: A family of highly efficient, lightweight yet powerful optimizers.
+Home-page: https://github.com/Koratahiu/Advanced_Optimizers
+Author: Koratahiu
+Author-email: hiuhonor@gmail.com
+License: Apache 2.0
+Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch>=2.0
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# Advanced Optimizers (AIO)
+A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.
+[![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
+---
+## 📦 Installation
+```bash
+pip install adv_optm
+```
+---
+## 🧠 Core Innovations
+This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
+### **Memory-Efficient Optimization (SMMF-inspired)**
+- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
+- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
+- **Innovation**:
+  - First moment split into **1-bit sign + absolute value**
+  - Final storage: **four factored vectors + one 1-bit sign state**
+  - Preserves Adam-like update quality with drastically reduced memory
+---
+## ⚡ Performance Characteristics
+### Memory Efficiency (SDXL Model - 6.5GB)
+| Optimizer | Memory Usage | Description |
+|-----------|--------------|-------------|
+| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
+| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
+| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
+### Speed Comparison (SDXL, Batch Size 4)
+| Optimizer | Speed | Notes |
+|-----------|-------|-------|
+| `Adafactor` | ~8.5s/it | Baseline |
+| `Adopt_Factored` | ~10s/it | +18% overhead from compression |
+| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
+---
+## 🧪 Available Optimizers
+### Standard Optimizers (All support `factored=True/False`)
+| Optimizer | Description | Best For |
+|-----------|-------------|----------|
+| `Adam_Adv` | Advanced Adam implementation | General purpose |
+| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
+| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
+| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
+| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
+| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
+### Feature Matrix
+| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
+|---------|----------|-----------|-------------|---------------------|----------|
+| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
+| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
+| Simplified_AdEMAMix | ✗ | ✗ | ✓ | ✓ | ✗ |
+| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
+| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
+| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
+| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
+---
+## ⚙️ Key Features & Parameters
+### Comprehensive Feature Guide
+| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
+|---------|-------------|-------------------|--------------------|-------------------|--------------|
+| **Factored** | Memory-efficient optimization using rank-1 factorization | Enable for large models (>1B params) or limited VRAM | +12-41% time overhead, 1-bit memory usage | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
+| **AdEMAMix** | Dual EMA system for momentum | Use for long training runs (10k+ steps) | +1 state memory. | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
+| **Simplified_AdEMAMix** | Accumulator-based momentum | Small batch training (≤32) | Same memory as standard, no extra overhead | [Schedule-Free Connections](https://arxiv.org/abs/2502.02431) | Adam/Prodigy |
+| **OrthoGrad** | Removes gradient component parallel to weights | Full finetuning without weight decay | +33% time overhead, no memory impact | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
+| **Stochastic Rounding** | Improves precision for BF16 training | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
+| **atan2** | Robust eps replacement + built-in clipping | Use with Adopt or unstable training | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/prodigy |
+| **Cautious** | Update only when the direction align with the gradients | should faster the convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/prodigy |
+| **Grams** | Update direction from the gradients | should have a stronger effect than cautious | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/prodigy |
+---
+## Simplified_AdEMAMix Parameters
+Simplified_AdEMAMix replaces standard momentum with an accumulator for better small-large batch performance.
+| Parameter | Recommended Values | Description |
+|-----------|---------------------|-------------|
+| `beta1` | 0.9 (large BS), 0.99-0.9999 (small BS) | Determines memory length of accumulator |
+| `alpha` | 100-10 (small BS), 1-0 (large BS) | Gradient smoothing factor |
+**Alpha Tuning Guide**:
+| Batch Size | Recommended α | Rationale |
+|------------|---------------|-----------|
+| Small (≤32) | 100, 50, 20, 10 | Emphasizes recent gradients for quick adaptation |
+| Medium (32-512) | 10, 5, 2, 1 | Balanced approach |
+| Large (≥512) | 1, 0.5, 0 | Emphasizes historical gradients for stability |
+⚠️ **Important**: Use **~100x smaller learning rate** with Simplified_AdEMAMix compared to AdamW (e.g., 1e-6 instead of 1e-4)
+### 📊 Performance Validation
+Small Batch Training (SDXL, BS=2, 1.8K steps)
+![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
+- **🟢 Prodigy_adv** (beta1=0.9, d0=1e-5): Final LR=2.9e-4
+- **🔵 Prodigy_adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR=5.8e-6
+**Results**:
+- Simplified_AdEMAMix shows faster convergence and better final performance
+- D-Adaptation automatically handles aggressive updates (50x smaller LR)
+- Generated samples show significantly better quality with Simplified_AdEMAMix
+---
+## ⚠️ Known Limitations
+### 1. Prodigy_Adv Sensitivity
+- Highly sensitive to gradient modifications (Adopt normalization, low-rank factorization)
+- May fail to increase learning rate in some LoRA scenarios
+- **Fix**: Disable factorization or set beta1=0
+### 2. Aggressive Learning Rates
+- Can destabilize factored first moment
+- **Recommendation**: Check Prodigy learning rate as reference for safe LR threshold
+---
+## 📚 References
+1. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
+2. [The AdEMAMix Optimizer: Better, Faster, Older](https://arxiv.org/abs/2409.03137)
+3. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants](https://arxiv.org/abs/2502.02431)
+---

adv_optm-0.1.9/README.md ADDED Viewed

@@ -0,0 +1,143 @@
+# Advanced Optimizers (AIO)
+A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for maximum efficiency, minimal memory footprint, and superior performance across diverse model architectures and training scenarios.
+[![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
+---
+## 📦 Installation
+```bash
+pip install adv_optm
+```
+---
+## 🧠 Core Innovations
+This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training, with 1-bit compression for optimizer states:
+### **Memory-Efficient Optimization (SMMF-inspired)**
+- **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
+- **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
+- **Innovation**:
+  - First moment split into **1-bit sign + absolute value**
+  - Final storage: **four factored vectors + one 1-bit sign state**
+  - Preserves Adam-like update quality with drastically reduced memory
+---
+## ⚡ Performance Characteristics
+### Memory Efficiency (SDXL Model - 6.5GB)
+| Optimizer | Memory Usage | Description |
+|-----------|--------------|-------------|
+| `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
+| `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
+| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
+### Speed Comparison (SDXL, Batch Size 4)
+| Optimizer | Speed | Notes |
+|-----------|-------|-------|
+| `Adafactor` | ~8.5s/it | Baseline |
+| `Adopt_Factored` | ~10s/it | +18% overhead from compression |
+| `Adopt_Factored + AdEMAMix` | ~12s/it | +41% overhead (3 factored states) |
+---
+## 🧪 Available Optimizers
+### Standard Optimizers (All support `factored=True/False`)
+| Optimizer | Description | Best For |
+|-----------|-------------|----------|
+| `Adam_Adv` | Advanced Adam implementation | General purpose |
+| `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
+| `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
+| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
+| `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
+| `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
+### Feature Matrix
+| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
+|---------|----------|-----------|-------------|---------------------|----------|
+| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
+| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
+| Simplified_AdEMAMix | ✗ | ✗ | ✓ | ✓ | ✗ |
+| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
+| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
+| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
+| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
+| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
+---
+## ⚙️ Key Features & Parameters
+### Comprehensive Feature Guide
+| Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
+|---------|-------------|-------------------|--------------------|-------------------|--------------|
+| **Factored** | Memory-efficient optimization using rank-1 factorization | Enable for large models (>1B params) or limited VRAM | +12-41% time overhead, 1-bit memory usage | [SMMF](https://arxiv.org/abs/2412.08894) | All optimizers |
+| **AdEMAMix** | Dual EMA system for momentum | Use for long training runs (10k+ steps) | +1 state memory. | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
+| **Simplified_AdEMAMix** | Accumulator-based momentum | Small batch training (≤32) | Same memory as standard, no extra overhead | [Schedule-Free Connections](https://arxiv.org/abs/2502.02431) | Adam/Prodigy |
+| **OrthoGrad** | Removes gradient component parallel to weights | Full finetuning without weight decay | +33% time overhead, no memory impact | [Grokking at Edge](https://github.com/LucasPrietoAl/grokking-at-the-edge-of-numerical-stability) | All optimizers |
+| **Stochastic Rounding** | Improves precision for BF16 training | BF16 training | Minimal overhead (<5%) | [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192) | All optimizers |
+| **atan2** | Robust eps replacement + built-in clipping | Use with Adopt or unstable training | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/prodigy |
+| **Cautious** | Update only when the direction align with the gradients | should faster the convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/prodigy |
+| **Grams** | Update direction from the gradients | should have a stronger effect than cautious | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/prodigy |
+---
+## Simplified_AdEMAMix Parameters
+Simplified_AdEMAMix replaces standard momentum with an accumulator for better small-large batch performance.
+| Parameter | Recommended Values | Description |
+|-----------|---------------------|-------------|
+| `beta1` | 0.9 (large BS), 0.99-0.9999 (small BS) | Determines memory length of accumulator |
+| `alpha` | 100-10 (small BS), 1-0 (large BS) | Gradient smoothing factor |
+**Alpha Tuning Guide**:
+| Batch Size | Recommended α | Rationale |
+|------------|---------------|-----------|
+| Small (≤32) | 100, 50, 20, 10 | Emphasizes recent gradients for quick adaptation |
+| Medium (32-512) | 10, 5, 2, 1 | Balanced approach |
+| Large (≥512) | 1, 0.5, 0 | Emphasizes historical gradients for stability |
+⚠️ **Important**: Use **~100x smaller learning rate** with Simplified_AdEMAMix compared to AdamW (e.g., 1e-6 instead of 1e-4)
+### 📊 Performance Validation
+Small Batch Training (SDXL, BS=2, 1.8K steps)
+![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
+- **🟢 Prodigy_adv** (beta1=0.9, d0=1e-5): Final LR=2.9e-4
+- **🔵 Prodigy_adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR=5.8e-6
+**Results**:
+- Simplified_AdEMAMix shows faster convergence and better final performance
+- D-Adaptation automatically handles aggressive updates (50x smaller LR)
+- Generated samples show significantly better quality with Simplified_AdEMAMix
+---
+## ⚠️ Known Limitations
+### 1. Prodigy_Adv Sensitivity
+- Highly sensitive to gradient modifications (Adopt normalization, low-rank factorization)
+- May fail to increase learning rate in some LoRA scenarios
+- **Fix**: Disable factorization or set beta1=0
+### 2. Aggressive Learning Rates
+- Can destabilize factored first moment
+- **Recommendation**: Check Prodigy learning rate as reference for safe LR threshold
+---
+## 📚 References
+1. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
+2. [The AdEMAMix Optimizer: Better, Faster, Older](https://arxiv.org/abs/2409.03137)
+3. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants](https://arxiv.org/abs/2502.02431)
+---

{adv_optm-0.1.7 → adv_optm-0.1.9}/adv_optm/__init__.py RENAMED Viewed

@@ -16,4 +16,4 @@ __all__ = [
     "Lion_Prodigy_adv",
 ]
-__version__ = "0.1.7"
+__version__ = "0.1.9"

{adv_optm-0.1.7 → adv_optm-0.1.9}/adv_optm/optim/AdamW_adv.py RENAMED Viewed

@@ -55,7 +55,7 @@ class AdamW_adv(torch.optim.Optimizer):
             the warmup, `alpha` ramps from 0 to its target value. If `None`,
             the scheduler is disabled. (default: None)
         factored (bool): whether to use the factorization or disable it to use
-            the uncompressed optimizer. (default: True)
+            the uncompressed optimizer. (default: False)
     """
     def __init__(
@@ -76,7 +76,7 @@ class AdamW_adv(torch.optim.Optimizer):
         beta3_ema: float = 0.9999,
         alpha: float = 5.0,
         t_alpha: int | None = None,
-        factored: bool = True,
+        factored: bool = False,
     ):
         if not (lr >= 0.0):
             raise ValueError(f"Learning-rate should be >= 0.0. Got {lr}")
@@ -86,6 +86,9 @@ class AdamW_adv(torch.optim.Optimizer):
             raise ValueError(f"Epsilon should be >= 0.0. Got {eps}")
         if not (weight_decay >= 0.0):
             raise ValueError(f"Weight-decay should be >= 0.0. Got {weight_decay}")
+        if use_cautious and use_grams:
+            print("Warning: use_cautious is incompatible with use_grams, Disabling use_cautious.")
+            use_cautious = False
         defaults = {
             "lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay,
@@ -216,7 +219,10 @@ class AdamW_adv(torch.optim.Optimizer):
                 del unpacked_sign_slow
                 mt_slow.mul_(beta3_ema).add_(grad_reshaped, alpha=1.0 - beta3_ema)
-                update = mt + (alpha_t * mt_slow) if beta1 > 0 else grad_reshaped + (alpha_t * mt_slow)
+                if beta1 > 0:
+                    update = torch.add(mt, mt_slow, alpha=alpha_t)
+                else:
+                    update = torch.add(grad_reshaped, mt_slow, alpha=alpha_t)
             else:
                 update = mt.clone() if beta1 > 0 else grad_reshaped.clone()
             del grad_reshaped
@@ -262,7 +268,10 @@ class AdamW_adv(torch.optim.Optimizer):
             if self.use_AdEMAMix:
                 exp_avg_slow = state['exp_avg_slow']
                 exp_avg_slow.mul_(beta3_ema).add_(grad, alpha=1 - beta3_ema)
-                update = exp_avg + (alpha_t * exp_avg_slow) if beta1 > 0 else grad + (alpha_t * exp_avg_slow)
+                if beta1 > 0:
+                    update = torch.add(exp_avg, exp_avg_slow, alpha=alpha_t)
+                else:
+                    update = torch.add(grad, exp_avg_slow, alpha=alpha_t)
             else:
                 update = exp_avg.clone() if beta1 > 0 else grad.clone()

{adv_optm-0.1.7 → adv_optm-0.1.9}/adv_optm/optim/Adopt_adv.py RENAMED Viewed

@@ -62,8 +62,18 @@ class Adopt_adv(torch.optim.Optimizer):
             the warmup, `alpha` ramps from 0 to its target value. If `None`,
             the scheduler is disabled and the full `alpha` value is used from
             the start. (default: None)
+        Simplified_AdEMAMix (bool): whether to use the Simplified AdEMAMix update rule.
+            This changes the EMA to accumulator and the update numerator to `alpha_grad * grad + mt`, which can be
+            more responsive, especially for small batch sizes. Enabling this will
+            automatically disable `use_AdEMAMix`, `use_cautious`, `use_grams`,
+            and `use_atan2`. (default: False)
+        alpha_grad (float): Mixing coefficient for the Simplified AdEMAMix update rule
+            (only used when `Simplified_AdEMAMix` is `True`). Controls the weight of the
+            current gradient. For small batch sizes, use high values (e.g., 10-100) to be
+            more responsive. For large batch sizes, use low values (e.g., 0-1) for
+            stability. (default: 100.0)
         factored (bool): whether to use the factorization or disable it to use
-            the uncompressed optimizer. (default: True)
+            the uncompressed optimizer. (default: False)
     """
     def __init__(
@@ -77,14 +87,16 @@ class Adopt_adv(torch.optim.Optimizer):
         vector_reshape: bool = True,
         stochastic_rounding: bool = True,
         use_atan2: bool = False,
-        use_cautious: bool = True,
+        use_cautious: bool = False,
         use_grams: bool = False,
         use_orthograd: bool = False,
         use_AdEMAMix: bool = False,
         beta3_ema: float = 0.9999,
         alpha: float = 5.0,
         t_alpha: int | None = None,
-        factored: bool = True,
+        Simplified_AdEMAMix: bool = False,
+        alpha_grad: float = 100.0,
+        factored: bool = False,
     ):
         if not (lr >= 0.0):
             raise ValueError(f"Learning-rate should be >= 0.0. Got {lr}")
@@ -94,19 +106,34 @@ class Adopt_adv(torch.optim.Optimizer):
             raise ValueError(f"Epsilon should be >= 0.0. Got {eps}")
         if not (weight_decay >= 0.0):
             raise ValueError(f"Weight-decay should be >= 0.0. Got {weight_decay}")
+        if use_cautious and use_grams:
+            print("Warning: use_cautious is incompatible with use_grams, Disabling use_cautious.")
+            use_cautious = False
+        if betas[0] == 0.0 and Simplified_AdEMAMix:
+            raise ValueError(f"Beta1 cannot be 0.0 when using Simplified_AdEMAMix. Got {betas[0]}")
+        if use_AdEMAMix and Simplified_AdEMAMix:
+            print("Warning: use_AdEMAMix is incompatible with Simplified_AdEMAMix, Disabling use_AdEMAMix.")
+        if use_grams and Simplified_AdEMAMix:
+            print("Warning: use_grams is incompatible with Simplified_AdEMAMix, Disabling use_grams.")
+        if use_cautious and Simplified_AdEMAMix:
+            print("Warning: use_cautious is incompatible with Simplified_AdEMAMix, Disabling use_cautious.")
+        if use_atan2 and Simplified_AdEMAMix:
+            print("Warning: use_atan2 is incompatible with Simplified_AdEMAMix. Disabling use_atan2.")
+            use_atan2 = False
         defaults = {
             "lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay,
             "vector_reshape": vector_reshape, "beta3_ema": beta3_ema, "alpha": alpha,
-            "t_alpha": t_alpha,
+            "t_alpha": t_alpha, "alpha_grad": alpha_grad,
         }
         self.clip_lambda = clip_lambda
         self.stochastic_rounding = stochastic_rounding
-        self.use_atan2 = use_atan2
-        self.use_cautious = use_cautious
-        self.use_grams = use_grams
+        self.use_atan2 = use_atan2 and not Simplified_AdEMAMix
+        self.use_cautious = use_cautious and not Simplified_AdEMAMix
+        self.use_grams = use_grams and not Simplified_AdEMAMix
         self.use_orthograd = use_orthograd
-        self.use_AdEMAMix = use_AdEMAMix
+        self.use_AdEMAMix = use_AdEMAMix and not Simplified_AdEMAMix
+        self.Simplified_AdEMAMix = Simplified_AdEMAMix
         self.factored = factored
         super().__init__(params, defaults)
@@ -185,6 +212,8 @@ class Adopt_adv(torch.optim.Optimizer):
             alpha_t = alpha
             if t_alpha is not None and t_alpha > 0 and current_step < t_alpha:
                 alpha_t = min(current_step * alpha / t_alpha, alpha)
+        if self.Simplified_AdEMAMix:
+            alpha_grad = group["alpha_grad"]
         if state['factored']:
             d1, d2 = state['effective_shape']
@@ -224,7 +253,10 @@ class Adopt_adv(torch.optim.Optimizer):
             del denom
             # ADOPT Step B: Update momentum m_t using normalized gradient
-            mt.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
+            if self.Simplified_AdEMAMix:
+                mt.mul_(beta1).add_(normalized_grad, alpha=1.0)
+            else:
+                mt.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
             if self.use_grams:
                 mt = grad_reshaped.sign() * mt.abs()
             elif self.use_cautious:
@@ -235,8 +267,10 @@ class Adopt_adv(torch.optim.Optimizer):
             if self.use_AdEMAMix:
                 mt_slow.mul_(beta3_ema).add_(normalized_grad, alpha=1.0 - beta3_ema)
-                update = mt + (alpha_t * mt_slow)
+                update = torch.add(mt, m_slow, alpha=alpha_t)
                 update = update.view(p.shape)
+            elif self.Simplified_AdEMAMix:
+                update = torch.add(mt, grad_reshaped, alpha=alpha_grad)
             else:
                 update = mt.view(p.shape)
@@ -283,7 +317,10 @@ class Adopt_adv(torch.optim.Optimizer):
             del denom
             # ADOPT Step B: Update momentum m_t
-            m.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
+            if self.Simplified_AdEMAMix:
+                m.mul_(beta1).add_(normalized_grad, alpha=1.0)
+            else:
+                m.mul_(beta1).add_(normalized_grad, alpha=1.0 - beta1)
             if self.use_grams:
                 m = grad.sign() * m.abs()
@@ -295,9 +332,11 @@ class Adopt_adv(torch.optim.Optimizer):
             if self.use_AdEMAMix:
                 m_slow.mul_(beta3_ema).add_(normalized_grad, alpha=1.0 - beta3_ema)
-                update = m + (alpha_t * m_slow)
+                update = torch.add(m, m_slow, alpha=alpha_t)
+            elif self.Simplified_AdEMAMix:
+                update = torch.add(m, grad, alpha=alpha_grad)
             else:
-                update = m
+                update = m.clone()
             if self.use_atan2:
                 update.mul_(group['lr'] * 1.2732395447351628)

{adv_optm-0.1.7 → adv_optm-0.1.9}/adv_optm/optim/Lion_Prodigy_adv.py RENAMED Viewed

@@ -33,8 +33,6 @@ class Lion_Prodigy_adv(torch.optim.Optimizer):
             (default: 0.0).
         factored (bool): whether to use the factorization or use the
             uncompressed optimizer. (default: True)
-        variance_reduction (bool): whether to use the variance reduction technique
-            from "Convergence Analysis of the Lion Optimizer" (arXiv:2508.12327v1). (default: False).
         d0 (float):
             Initial D estimate for D-adaptation (default 1e-6). Rarely needs changing.
         d_coef (float):
@@ -66,7 +64,6 @@ class Lion_Prodigy_adv(torch.optim.Optimizer):
         use_cautious: bool = False,
         clip_threshold: float = 0.0,
         factored: bool = True,
-        variance_reduction: bool = False,
         # prodigy parameters
         beta3: float = None,
         d0: float = 1e-6,
@@ -97,7 +94,6 @@ class Lion_Prodigy_adv(torch.optim.Optimizer):
         self.stochastic_rounding = stochastic_rounding
         self.use_cautious = use_cautious
         self.factored = factored
-        self.variance_reduction = variance_reduction
         self.fsdp_in_use = fsdp_in_use
         super().__init__(params, defaults)
         # Global state for accumulating metrics across parameter updates within a single step.
@@ -183,12 +179,8 @@ class Lion_Prodigy_adv(torch.optim.Optimizer):
                 state['mv_m_nmf'] = torch.zeros(d2, device=p.device, dtype=dtype)
                 packed_d2 = (d2 + 7) // 8
                 state['sign'] = torch.zeros((d1, packed_d2), dtype=torch.uint8, device=p.device)
-                if self.variance_reduction:
-                    state['prev_grad'] = torch.zeros((d1, d2), device=p.device, dtype=dtype)
             else: # Fallback to standard Lion
                 state['exp_avg'] = torch.zeros_like(p, device=p.device, dtype=dtype)
-                if self.variance_reduction:
-                    state['prev_grad'] = torch.zeros_like(p, device=p.device, dtype=dtype)
         if state['factored']:
             # Factored Path
@@ -215,20 +207,7 @@ class Lion_Prodigy_adv(torch.optim.Optimizer):
             update_for_param = signed_update.view(p.shape).mul(self.dlr)
             # Update momentum m_t = β2*m_{t-1} + (1-β2)*lr*g_t
-            if self.variance_reduction:
-                if state['step'] == 1:
-                    exp_avg.copy_(grad_reshaped)
-                else:
-                    # Heuristic Prodigy-STORM update
-                    correction = exp_avg.sub(state['prev_grad'])
-                    grad_alpha = self.d * (1 - self.beta2) + self.beta2
-                    exp_avg.copy_(grad_reshaped).mul_(grad_alpha).add_(correction, alpha=self.beta2)
-                    del correction, grad_alpha
-                state['prev_grad'].copy_(grad_reshaped)
-            else:
-                # Standard Prodigy-Lion
-                alpha = self.d * (1 - self.beta2)
-                exp_avg.mul_(self.beta2).add_(grad_reshaped, alpha=alpha)
+            exp_avg.mul_(self.beta2).add_(grad_reshaped, alpha=self.d * (1 - self.beta2))
             del grad_reshaped
             # Compress new momentum m_t and store factors
@@ -254,20 +233,7 @@ class Lion_Prodigy_adv(torch.optim.Optimizer):
             update_for_param = signed_update.mul(self.dlr)
             # Update momentum
-            if self.variance_reduction:
-                if state['step'] == 1:
-                    exp_avg.copy_(grad)
-                else:
-                    # Heuristic Prodigy-STORM update
-                    correction = exp_avg.sub(state['prev_grad'])
-                    grad_alpha = self.d * (1 - self.beta2) + self.beta2
-                    exp_avg.copy_(grad).mul_(grad_alpha).add_(correction, alpha=self.beta2)
-                    del grad_alpha, correction
-                state['prev_grad'].copy_(grad)
-            else:
-                # Standard Prodigy-Lion
-                alpha = self.d * (1 - self.beta2)
-                exp_avg.mul_(self.beta2).add_(grad, alpha=alpha)
+            exp_avg.mul_(self.beta2).add_(grad, alpha=self.d * (1 - self.beta2))
         # --- Accumulate Prodigy stats ---
         d0, safeguard_warmup, slice_p = group['d0'], group['safeguard_warmup'], group['slice_p']
@@ -298,7 +264,7 @@ class Lion_Prodigy_adv(torch.optim.Optimizer):
         else:
             p.data.add_(-update_for_param)
-            del update_for_param
+        del update_for_param
     @torch.no_grad()
     def step(self, closure: Optional[callable] = None):

adv-optm 0.1.7__tar.gz → 0.1.9__tar.gz

Potentially problematic release.

adv-optm 0.1.7tar.gz → 0.1.9tar.gz