PyPI - adv-optm - Versions diffs - 2.4.dev20__tar.gz → 2.4.dev21__tar.gz - Mend

adv-optm 2.4.dev20tar.gz → 2.4.dev21tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

{adv_optm-2.4.dev20 → adv_optm-2.4.dev21}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: adv_optm
-Version: 2.4.dev20
+Version: 2.4.dev21
 Summary: A family of highly efficient, lightweight yet powerful optimizers.
 Home-page: https://github.com/Koratahiu/Advanced_Optimizers
 Author: Koratahiu
@@ -15,7 +15,7 @@ Classifier: Topic :: Software Development :: Libraries :: Python Modules
 Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: torch>=2.0
+Requires-Dist: torch>=2.1
 Dynamic: author
 Dynamic: author-email
 Dynamic: classifier
@@ -37,10 +37,6 @@ A comprehensive, all-in-one collection of optimization algorithms for deep learn
 ## 🔥 What's New
-### in 2.2.2
-- `Simplified_AdEMAMix` now uses the same LR as AdamW for all `beta1` and `alpha_grad` values!
 ### in 2.1.x
 - Added Signum (SignSGD with momentum): A new optimizer in the family (SignSGD_adv)
@@ -101,7 +97,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
 |-----------|--------------|-------------|
 | `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
 | `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
-| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
 ### Speed Comparison (SDXL, Batch Size 4)
 | Optimizer | Speed | Notes |
@@ -120,7 +115,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
 | `Adam_Adv` | Advanced Adam implementation | General purpose |
 | `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
 | `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
-| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
 | `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
 | `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
@@ -128,18 +122,14 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ## ⚙️ Feature Matrix
-| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
-|---------|----------|-----------|-------------|---------------------|----------|
-| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
-| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
-| Simplified_AdEMAMix | ✗ | ✓ | ✓ | ✓ | ✗ |
-| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
-| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
-| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
-| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
-| **Kourkoutas-β** | ✓ | ✓ | ✓ | ✓ | ✗ |
+| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Lion_Adv |
+|---------|----------|-----------|-------------|----------|
+| Factored | ✓ | ✓ | ✓ ✓ |
+| OrthoGrad | ✓ | ✓ | ✓ | ✓ |
+| atan2 | ✓ | ✓ | ✓ |✗ |
+| Stochastic Rounding | ✓ | ✓ | ✓ |✓ |
+| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ |
+| **Kourkoutas-β** | ✓ | ✓ | ✓ | ✗ |
 ---
@@ -159,48 +149,13 @@ This library integrates multiple state-of-the-art optimization techniques valida
 | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
 |--------|-------------|-------------------|--------------------|-------------------|--------------|
-| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |
-| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |
-| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
-| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |
 | **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |
-| **Kourkoutas-β** | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-β]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
-> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.
+| **Kourkoutas-β** | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-β]() | Adam/Adopt/Prodigy |
 ---
 ## 🔍 Feature Deep Dives
-### AdEMAMix
-- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
-- Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
-#### Tunable Hyperparameters
-| Parameter | Default | Tuning Guide |
-|-----------|---------|--------------|
-| `beta3` | 0.9999 | • Runs >120k steps: **0.9999**<br>• Runs ≤120k steps: **0.999** |
-| `alpha` | 5 | • Reduce to **2–3** if diverging<br>• Increase to strengthen long-term memory |
-> ✅ **Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.
----
-### Simplified_AdEMAMix
-- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
-- Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
-- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
-#### Tunable Hyperparameters
-| Parameter | Default | Tuning Guide |
-|----------|---------|--------------|
-| `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
-| `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
----
 ### atan2
 - Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.
@@ -215,7 +170,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ### **Kourkoutas-β**
-**Kourkoutas-β** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (β₂)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.
+**Kourkoutas-β** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (β₂)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`.
 Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
@@ -243,7 +198,5 @@ This is especially effective for **noisy training, small batch sizes, and high l
 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
-3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
-4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
 6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
 7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)

{adv_optm-2.4.dev20 → adv_optm-2.4.dev21}/README.md RENAMED Viewed

@@ -6,10 +6,6 @@ A comprehensive, all-in-one collection of optimization algorithms for deep learn
 ## 🔥 What's New
-### in 2.2.2
-- `Simplified_AdEMAMix` now uses the same LR as AdamW for all `beta1` and `alpha_grad` values!
 ### in 2.1.x
 - Added Signum (SignSGD with momentum): A new optimizer in the family (SignSGD_adv)
@@ -70,7 +66,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
 |-----------|--------------|-------------|
 | `Adopt_Factored` | 328 MB | 4 small vectors + 1-bit state |
 | `Adopt_Factored + AdEMAMix` | 625 MB | 6 small vectors + two 1-bit states |
-| `Simplified_AdEMAMix` | 328 MB | Same as standard factored (no extra state) |
 ### Speed Comparison (SDXL, Batch Size 4)
 | Optimizer | Speed | Notes |
@@ -89,7 +84,6 @@ This library integrates multiple state-of-the-art optimization techniques valida
 | `Adam_Adv` | Advanced Adam implementation | General purpose |
 | `Adopt_Adv` | Adam-variant with independent beta2 | Stable training for small batch size regimes |
 | `Prodigy_Adv` | Prodigy with D-Adaptation | Adam with automatic LR tuning |
-| `Simplified_AdEMAMix` | Adam variant with accumulator momentum | Small/large batch training when tuned correctly |
 | `Lion_Adv` | Advanced Lion implementation | Memory-constrained environments |
 | `Prodigy_Lion_Adv` | Prodigy + Lion combination | Lion with automatic LR tuning |
@@ -97,18 +91,14 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ## ⚙️ Feature Matrix
-| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Simplified_AdEMAMix | Lion_Adv |
-|---------|----------|-----------|-------------|---------------------|----------|
-| Factored | ✓ | ✓ | ✓ | ✓ | ✓ |
-| AdEMAMix | ✓ | ✓ | ✓ | ✗ | ✗ |
-| Simplified_AdEMAMix | ✗ | ✓ | ✓ | ✓ | ✗ |
-| OrthoGrad | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Grams | ✓ | ✓ | ✓ | ✗ | ✗ |
-| Cautious | ✓ | ✓ | ✓ | ✗ | ✓ |
-| atan2 | ✓ | ✓ | ✓ | ✗ | ✗ |
-| Stochastic Rounding | ✓ | ✓ | ✓ | ✓ | ✓ |
-| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ | ✓ |
-| **Kourkoutas-β** | ✓ | ✓ | ✓ | ✓ | ✗ |
+| Feature | Adam_Adv | Adopt_Adv | Prodigy_Adv | Lion_Adv |
+|---------|----------|-----------|-------------|----------|
+| Factored | ✓ | ✓ | ✓ ✓ |
+| OrthoGrad | ✓ | ✓ | ✓ | ✓ |
+| atan2 | ✓ | ✓ | ✓ |✗ |
+| Stochastic Rounding | ✓ | ✓ | ✓ |✓ |
+| Fused Backward Pass | ✓ | ✓ | ✓ | ✓ |
+| **Kourkoutas-β** | ✓ | ✓ | ✓ | ✗ |
 ---
@@ -128,48 +118,13 @@ This library integrates multiple state-of-the-art optimization techniques valida
 | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
 |--------|-------------|-------------------|--------------------|-------------------|--------------|
-| **Cautious** | Only applies update if gradient direction aligns with momentum direction | Accelerating convergence | No overhead | [C-Optim](https://github.com/kyleliang919/C-Optim) | Adam/Adopt/Prodigy/Lion |
-| **Grams** | Update direction derived purely from current gradient | When Cautious is insufficient | No overhead | [Grams](https://github.com/Gunale0926/Grams) | Adam/Adopt/Prodigy |
-| **AdEMAMix** | Dual EMA system that retains relevance of gradients over tens of thousands of steps | Long training runs, especially where model forgetting is a concern | +1 state memory | [AdEMAMix](https://arxiv.org/abs/2409.03137) | Adam/Adopt/Prodigy |
-| **Simplified_AdEMAMix** | Accumulator-based momentum, single EMA variant of AdEMAMix | All scenarios when tuned correctly | No overhead | [Connections](https://arxiv.org/abs/2502.02431) | Adam/Adopt/Prodigy |
 | **atan2** | Robust epsilon replacement with built-in gradient clipping | Use for stable bounded updates (or for Adopt as it needs that) | No overhead | [Adam-atan2](https://github.com/lucidrains/adam-atan2-pytorch) | Adam/Adopt/Prodigy |
-| **Kourkoutas-β** | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-β]() | Adam/Adopt/Prodigy/Simplified_AdEMAMix |
-> **Note**: If both **Cautious** and **Grams** are enabled, **Grams takes precedence** and Cautious is disabled.
+| **Kourkoutas-β** | Layer-wise adaptive β₂ based on gradient “sunspike” ratio | Noisy/small/large-batch/high-LR training | No overhead | [Kourkoutas-β]() | Adam/Adopt/Prodigy |
 ---
 ## 🔍 Feature Deep Dives
-### AdEMAMix
-- Adds a **slow-decaying second EMA** (`beta3`) that retains gradient memory over tens of thousands of steps.
-- Particularly effective for **small batch sizes**, where Adam’s standard first moment is nearly useless.
-#### Tunable Hyperparameters
-| Parameter | Default | Tuning Guide |
-|-----------|---------|--------------|
-| `beta3` | 0.9999 | • Runs >120k steps: **0.9999**<br>• Runs ≤120k steps: **0.999** |
-| `alpha` | 5 | • Reduce to **2–3** if diverging<br>• Increase to strengthen long-term memory |
-> ✅ **Pro Tip**: Set `beta1=0` in Adam/Adopt/Prodigy to skip standard EMA entirely and rely solely on AdEMAMix’s slow EMA, ideal for small-batch regimes.
----
-### Simplified_AdEMAMix
-- Introduced in [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants (arXiv:2502.02431)](https://arxiv.org/abs/2502.02431).
-- Replaces Adam’s first moment with a **theory-based momentum** with emphasize on raw gradient, combining the stability of long memory with responsiveness to recent gradients.
-- **Key insight**: Classical momentum **does not accelerate** in noisy (small-batch) regimes; this accumulator do.
-#### Tunable Hyperparameters
-| Parameter | Default | Tuning Guide |
-|----------|---------|--------------|
-| `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
-| `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
----
 ### atan2
 - Replaces `eps` in Adam-family optimizers with a **scale-invariant**, bounded update rule.
@@ -184,7 +139,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ### **Kourkoutas-β**
-**Kourkoutas-β** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (β₂)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`, and `Simplified_AdEMAMix`.
+**Kourkoutas-β** introduces a **sunspike-driven, layer-wise adaptive second-moment decay (β₂)** as an optional enhancement for `Adam_Adv`, `Adopt_Adv`, `Prodigy_Adv`.
 Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
@@ -212,7 +167,5 @@ This is especially effective for **noisy training, small batch sizes, and high l
 1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
 2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
-3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
-4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
 6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
 7. [Scaling Exponents Across Parameterizations and Optimizers](https://arxiv.org/abs/2407.05872)

{adv_optm-2.4.dev20 → adv_optm-2.4.dev21}/adv_optm/__init__.py RENAMED Viewed

@@ -2,9 +2,7 @@ from .optim import (
     AdamW_adv,
     Prodigy_adv,
     Adopt_adv,
-    Simplified_AdEMAMix,
     Lion_adv,
-    Lion_Prodigy_adv,
     Muon_adv,
     AdaMuon_adv,
     SignSGD_adv,
@@ -15,13 +13,11 @@ __all__ = [
     "AdamW_adv",
     "Prodigy_adv",
     "Adopt_adv",
-    "Simplified_AdEMAMix",
     "Lion_adv",
-    "Lion_Prodigy_adv",
     "Muon_adv",
     "AdaMuon_adv",
     "SignSGD_adv",
     "SinkSGD_adv",
 ]
-__version__ = "2.4.dev20"
+__version__ = "2.4.dev21"

{adv_optm-2.4.dev20 → adv_optm-2.4.dev21}/adv_optm/optim/AdaMuon_adv.py RENAMED Viewed

@@ -3,8 +3,8 @@ import torch
 import math
 from ..util import param_update
-from ..util.Muon_util import newton_schulz, _is_suitable_for_muon, rms_adjustment, normuon_update, approx_mars, _auto_projection_for_adamuon, get_spectral_scaling
-from ..util.scaled_optm import spectral_normalization, init_spectral_norm
+from ..util.Muon_util import newton_schulz, _is_suitable_for_muon, rms_adjustment, normuon_update, approx_mars, _auto_projection_for_adamuon
+from ..util.scaled_optm import spectral_normalization, init_spectral_norm, scale_eps
 from ..util.factorization_util import _get_effective_shape, _factorize_state, _reconstruct_state
 from ..util.OrthoGrad import _orthogonalize_gradient
 from ..util.Kourkoutas import KourkoutasHelper
@@ -50,7 +50,8 @@ class AdaMuon_adv(torch.optim.Optimizer):
             vector, used for RMS-aligned rescaling. Allows for the reuse of existing Adam
             learning rate schedules. (default: True).
         ns_steps (int): number of Newton-Schulz iterations to perform (default: 5).
-        ns_eps (float): epsilon for Newton-Schulz normalization stability (default: 1e-7).
+        ns_eps (float): epsilon for Newton-Schulz normalization stability. When None
+            it's derived from scale invariant rule (default: 1e-7).
         ns_coeffs (tuple[float, float, float]): The (a, b, c) coefficients for the
             quintic polynomial in the Newton-Schulz iteration.
             (default: (3.4445, -4.7750, 2.0315)).
@@ -77,7 +78,8 @@ class AdaMuon_adv(torch.optim.Optimizer):
             (default: 128)
         accelerated_ns (bool): If True, enables Chebyshev-accelerated Newton-Schulz, which
             dynamically calculates optimal 3rd-order polynomial coefficients. (default: False)
-        cns_a_bound (float): Initial lower bound for singular values for CANS. (default: 1e-4)
+        cns_a_bound (float): Initial lower bound for singular values for CANS. When None
+            it's derived from scale invariant rule (default: None).
         approx_mars (bool): If True, enables Approximated MARS-M variance reduction.
         fom the paper "MARS-M: When Variance Reduction Meets Matrices"
             (default: False)
@@ -112,12 +114,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         adam_fisher_wd (bool): Fisher Adam (FAdam) weight decay for the AdamW part. (default: False)
         adam_use_bias_correction (bool): Bias correction for AdamW.
         adam_use_atan2 (bool): Atan2 update rule for AdamW.
-        adam_cautious_mask (bool): Cautious masking for AdamW.
-        adam_grams_moment (bool): Grams-style updates for AdamW.
         adam_orthogonal_gradient (bool): OrthoGrad for AdamW.
-        adam_use_AdEMAMix (bool): AdEMAMix for AdamW.
-        adam_beta3_ema (float): Beta3 for AdEMAMix.
-        adam_alpha (float): Alpha for AdEMAMix.
         adam_nesterov (bool): Nesterov momentum for AdamW. (default: False)
         adam_nesterov_coef (float, optional): Nesterov coefficient for AdamW. (default: None)
         adam_kourkoutas_beta (bool): Kourkoutas-β for AdamW.
@@ -126,7 +123,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         adam_tiny_spike (float): Tiny spike for Kourkoutas-β. (default: 1e-9)
         adam_k_warmup_steps (int): Warmup steps for Kourkoutas-β. (default: 0)
         adam_spectral_normalization (bool): Enable explicit spectral normalization for AdamW. (default: False)
-        adam_state_precision (str): Precision for AuxAdam states. Options: 'auto', 'fp32', 'bf16_sr', 'fp8_sr', 'int8_sr', 'factored'. (default: 'auto')
+        adam_state_precision (str): Precision for AuxAdam states. Options: 'auto', 'fp32', 'bf16_sr', 'fp16', 'fp8_sr', 'int8_sr', 'factored'. (default: 'auto')
         adam_nnmf_factor (bool): 1-bit factored for AdamW.
         adam_factored_2nd (bool): Factorize only the second moment (v_t) for AuxAdam. (default: False)
     """
@@ -147,7 +144,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         rms_rescaling: bool = True,
         # Newton Schulz
         ns_steps: int = 5,
-        ns_eps: float = 1e-7,
+        ns_eps: float | None = 1e-7,
         ns_coeffs: tuple[float, float, float] = (3.4445, -4.7750, 2.0315),
         # Stochastic Rounding for BF16
         stochastic_rounding: bool = True,
@@ -174,7 +171,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         nnmf_factor: bool = False,
         # CANS
         accelerated_ns: bool = False,
-        cns_a_bound: float = 1e-4,
+        cns_a_bound: float | None = None,
         # MARS-M
         approx_mars: bool = False,
         mars_gamma: float = 0.025,
@@ -193,12 +190,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         adam_fisher_wd: bool = False,
         adam_use_bias_correction: bool = True,
         adam_use_atan2: bool = False,
-        adam_cautious_mask: bool = False,
-        adam_grams_moment: bool = False,
         adam_orthogonal_gradient: bool = False,
-        adam_use_AdEMAMix: bool = False,
-        adam_beta3_ema: float = 0.9999,
-        adam_alpha: float = 5.0,
         adam_nesterov: bool = False,
         adam_nesterov_coef: float | None = None,
         adam_kourkoutas_beta: bool = False,
@@ -223,8 +215,12 @@ class AdaMuon_adv(torch.optim.Optimizer):
         if spectral_normalization and accelerated_ns:
             ValueError("spectral_normalization violates accelerated Newton-Schulz assumptions. Pick one of them.")
+        # Legacy backwards compatibility support for `nnmf_factor=True`
+        if nnmf_factor:
+            state_precision = "factored"
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -262,9 +258,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
             "adam_betas": adam_betas, "adam_eps": adam_eps, "adam_weight_decay": adam_weight_decay,
             "adam_fisher_wd": adam_fisher_wd,
             "adam_use_bias_correction": adam_use_bias_correction, "adam_use_atan2": adam_use_atan2,
-            "adam_cautious_mask": adam_cautious_mask, "adam_grams_moment": adam_grams_moment,
             "adam_orthogonal_gradient": adam_orthogonal_gradient,
-            "adam_use_AdEMAMix": adam_use_AdEMAMix, "adam_beta3_ema": adam_beta3_ema, "adam_alpha": adam_alpha,
             "adam_nesterov": adam_nesterov, "adam_nesterov_coef": adam_nesterov_coef,
             "adam_kourkoutas_beta": adam_kourkoutas_beta, "adam_beta2_min": adam_beta2_min,
             "adam_ema_alpha": adam_ema_alpha, "adam_tiny_spike": adam_tiny_spike,
@@ -274,25 +268,10 @@ class AdaMuon_adv(torch.optim.Optimizer):
             "adam_nnmf_factor": adam_nnmf_factor, "adam_factored_2nd": adam_factored_2nd,
         }
         self.stochastic_rounding = stochastic_rounding
-        self._init_lr = lr
+        self._init_lr = lr if lr > 0 else 1
         super().__init__(params, defaults)
-        # Validate that every group has a determined optimizer type
-        for i, group in enumerate(self.param_groups):
-            if group.get('use_muon') is None and group.get('optim_type') is None:
-                # Automatic shape-based detection if not explicit
-                has_muon_shape = False
-                for p in group['params']:
-                    has_muon_shape = _is_suitable_for_muon(p)
-                    if has_muon_shape:
-                        group['use_muon'] = True
-                    else:
-                        group['use_muon'] = False
-            if group.get('use_muon') is None: # Fallback
-                 group['use_muon'] = group.get('optim_type') == 'muon'
         self.init_step()
         self.kourkoutas_helper = None
@@ -346,12 +325,19 @@ class AdaMuon_adv(torch.optim.Optimizer):
         if 'is_muon' in state:
             return
-        if group['use_muon']:
+        if group.get('use_muon') is not None:
+            state['is_muon'] = group['use_muon']
+        elif group.get('optim_type') is not None:
+            state['is_muon'] = group['optim_type'] == 'muon'
+        else: # Auto-detect per parameter
+            state['is_muon'] = _is_suitable_for_muon(p)
-            state['factored'] = (
-                group['nnmf_factor'] and
-                not (len(p.shape) == 1 and not group['vector_reshape'])
-            )
+        if state['is_muon']:
+            req_precision = group['state_precision']
+            is_vector = len(p.shape) == 1 and not group['vector_reshape']
+            state['factored'] = req_precision == 'factored' and not is_vector
             dtype = torch.float32 if state['factored'] else p.dtype
             device = p.device
@@ -362,23 +348,21 @@ class AdaMuon_adv(torch.optim.Optimizer):
                 state['mv_mbuf_nmf'] = torch.zeros(d2, device=device, dtype=dtype)
                 packed_d2 = (d2 + 7) // 8
                 state['sign_buf'] = torch.zeros((d1, packed_d2), dtype=torch.uint8, device=device)
+                state['shifter'] = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128], device=device, dtype=torch.uint8)
                 if not group['normuon_variant']:
                     state['mu_vbuf_nmf'] = torch.zeros(d1, device=device, dtype=dtype)
                     state['mv_vbuf_nmf'] = torch.zeros(d2, device=device, dtype=dtype)
             else:
                 # Determine effective state precision (small tensors always use fp32)
                 req_precision = group.get('state_precision', 'auto')
-                actual_precision = req_precision
-                if actual_precision != 'auto' and (p.numel() < 10000 or p.ndim == 1):
-                    actual_precision = 'fp32'
+                actual_precision = 'auto' if req_precision == 'factored' else req_precision
                 group['actual_state_precision'] = actual_precision
                 # factored_2nd: factorize v_t only; ignored for NorMuon (no v_t) and tiny params
                 use_factored_2nd = (
                     group.get('factored_2nd', False)
                     and not group['normuon_variant']
-                    and p.numel() >= 10000
-                    and p.ndim > 1
+                    and not (len(p.shape) == 1 and not group['vector_reshape'])
                 )
                 state['factored_2nd'] = use_factored_2nd
@@ -493,19 +477,22 @@ class AdaMuon_adv(torch.optim.Optimizer):
                     random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
                 elif actual_precision == 'fp8_sr':
                     random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
+                if group['low_rank_ortho']:
+                    random_G_sketch = param_update._get_random_noise_for_low_rank_ortho(p, group['ortho_rank'])
             else:
                 lr = group['lr']
-                muon_step_param = self._muon_step_parameter
                 random_int_state_tensor = None
+                random_G_sketch = None
+                muon_step_param = self._muon_step_parameter
-            muon_step_param(p, grad, state, group, lr, random_int_tensor, random_int_state_tensor)
+            muon_step_param(p, grad, state, group, lr, random_int_tensor, random_int_state_tensor, random_G_sketch)
     def compile(self, *args, **kwargs):
         self._compiled_muon_step_parameter = torch.compile(self._muon_step_parameter, *args, **kwargs)
         self._compiled_adam_step_parameter = torch.compile(Muon_AuxAdam._adam_step_parameter, *args, **kwargs)
     @torch.no_grad()
-    def _muon_step_parameter(self, p, grad, state, group, lr, random_int_tensor, random_int_state_tensor=None):
+    def _muon_step_parameter(self, p, grad, state, group, lr, random_int_tensor, random_int_state_tensor, random_G_sketch):
         # Upcast grad for low-precision state modes (non-factored path)
         grad = upcast_grad_for_precision(grad, state, group.get('state_precision', 'auto'))
         beta1, beta2 = group['betas']
@@ -523,14 +510,8 @@ class AdaMuon_adv(torch.optim.Optimizer):
             else:
                 kappa_p = 1.0
-        if group.get('spectral_normalization', False):
-            ns_eps, adaptive_eps, _, _ = get_spectral_scaling(p, p.shape, group.get('n_layers', 1))
-            decoupled_wd = True
-        else:
-            decoupled_wd = False
-            ns_eps = group['ns_eps']
-            adaptive_eps = group['eps']
+        ns_eps = group['ns_eps']
+        adaptive_eps = scale_eps(group['eps'], p)
         # MARS-M Approximated (Variance Reduction)
         if group.get('approx_mars', False):
@@ -545,7 +526,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
             grad_reshaped = grad.view(d1, d2)
             # Reconstruct momentum from previous step's factors & sign
-            mt_buf = _reconstruct_state((state['mu_mbuf_nmf'], state['mv_mbuf_nmf'], state['sign_buf'], d2), signed=True)
+            mt_buf = _reconstruct_state((state['mu_mbuf_nmf'], state['mv_mbuf_nmf'], state['sign_buf'], d2), signed=True, shifter=state['shifter'])
             # Update momentum in full-size
             mt_buf.lerp_(grad_reshaped, 1 - beta1)
@@ -557,7 +538,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
                 update = mt_buf.clone()
             # Factorize
-            state['mu_mbuf_nmf'], state['mv_mbuf_nmf'], state['sign_buf'] = _factorize_state(mt_buf, signed=True)
+            state['mu_mbuf_nmf'], state['mv_mbuf_nmf'], state['sign_buf'] = _factorize_state(mt_buf, signed=True, shifter=state['shifter'])
             del mt_buf
             # Apply update projection
@@ -573,7 +554,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
                 cns_a_bound=group['cns_a_bound'],
                 low_rank_ortho=group['low_rank_ortho'],
                 ortho_rank=group['ortho_rank'],
-                spectral_normalization=group.get('spectral_normalization', False),
+                G_sketch=random_G_sketch,
                 compiled=group.get('compiled_optimizer', False)
             )
@@ -581,10 +562,10 @@ class AdaMuon_adv(torch.optim.Optimizer):
                 normuon_update(update, state['normuon_v'], beta2, group['eps'])
             else:
                 # Reconstruct second momentum from previous step's factors
-                vt_buf = _reconstruct_state((state['mu_vbuf_nmf'], state['mv_vbuf_nmf']), signed=False)
+                vt_buf = _reconstruct_state((state['mu_vbuf_nmf'], state['mv_vbuf_nmf']), signed=False, shifter=state['shifter'])
                 # Update second momentum in full-size
                 vt_buf.mul_(beta2).addcmul_(update, update, value=1 - beta2)
-                state['mu_vbuf_nmf'], state['mv_vbuf_nmf'] = _factorize_state(vt_buf, signed=False)
+                state['mu_vbuf_nmf'], state['mv_vbuf_nmf'] = _factorize_state(vt_buf, signed=False, shifter=state['shifter'])
                 # Apply second momentum update (adaptive scaling)
                 if group['use_atan2']:
                     denom = vt_buf.sqrt_()
@@ -629,7 +610,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
                 cns_a_bound=group['cns_a_bound'],
                 low_rank_ortho=group['low_rank_ortho'],
                 ortho_rank=group['ortho_rank'],
-                spectral_normalization=group.get('spectral_normalization', False),
+                G_sketch=random_G_sketch,
                 compiled=group.get('compiled_optimizer', False)
             )
@@ -641,9 +622,9 @@ class AdaMuon_adv(torch.optim.Optimizer):
                 d1, d2 = state['effective_shape']
                 update = update.view(original_shape)
                 update_f32 = update.float()
-                vt_buf = _reconstruct_state((state['mu_vbuf_nmf'], state['mv_vbuf_nmf']), signed=False)
+                vt_buf = _reconstruct_state((state['mu_vbuf_nmf'], state['mv_vbuf_nmf']), signed=False, shifter=state['shifter'])
                 vt_buf.mul_(beta2).addcmul_(update_f32.view(d1, d2), update_f32.view(d1, d2), value=1 - beta2)
-                state['mu_vbuf_nmf'], state['mv_vbuf_nmf'] = _factorize_state(vt_buf, signed=False)
+                state['mu_vbuf_nmf'], state['mv_vbuf_nmf'] = _factorize_state(vt_buf, signed=False, shifter=state['shifter'])
                 # Apply second moment scaling
                 if group['use_atan2']:
                     denom = vt_buf.sqrt_().view(original_shape)
@@ -678,7 +659,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         update = update.reshape(original_shape)
-        param_update.apply_parameter_update(self, p, group, update, lr, random_int_tensor=random_int_tensor, decoupled=decoupled_wd)
+        param_update.apply_parameter_update(self, p, group, update, lr, random_int_tensor=random_int_tensor)
     @torch.no_grad()
     def step(self, closure=None):

adv-optm 2.4.dev20__tar.gz → 2.4.dev21__tar.gz

adv-optm 2.4.dev20tar.gz → 2.4.dev21tar.gz