PyPI - adv-optm - Versions diffs - 2.4.dev25__tar.gz → 2.5.1__tar.gz - Mend

adv-optm 2.4.dev25tar.gz → 2.5.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

adv_optm-2.5.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,113 @@
+Metadata-Version: 2.4
+Name: adv_optm
+Version: 2.5.1
+Summary: A family of highly efficient, lightweight yet powerful optimizers.
+Home-page: https://github.com/Koratahiu/Advanced_Optimizers
+Author: Koratahiu
+Author-email: hiuhonor@gmail.com
+License: Apache 2.0
+Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch>=2.1
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# Advanced Optimizers (AIO)
+A comprehensive, all-in-one collection of state-of-the-art optimization algorithms for deep learning. Designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.
+[![PyPI version](https://img.shields.io/pypi/v/adv_optm.svg?color=blue&style=flat-square)](https://pypi.org/project/adv_optm/)
+[![Python versions](https://img.shields.io/pypi/pyversions/adv_optm.svg?style=flat-square)](https://pypi.org/project/adv_optm/)
+[![License](https://img.shields.io/badge/license-Apache-green?style=flat-square)](LICENSE)
+---
+## 📦 Installation
+```bash
+pip install adv_optm
+```
+*Requires PyTorch 2.3+ for `torch.compile` support.*
+---
+## What's New
+### 🌟 Version 2.5.x: The Massive Refactor
+This major update introduces a complete architectural refactor of the library:
+**🆕 New Optimizers & Scaling**
+* **`SinkSGD_adv`:** Added a powerful new optimizer to the lineup.
+* **Spectral Scaling:** Now available across *all* optimizers, achieving width/rank invariant updates for highly stable training.
+**💾 Memory & State Precision Control**
+* **Granular State Precision (`state_precision`):** Drastically reduce memory overhead with new optimizer state modes:
+  * `factored` (Rank-2 factored mode)
+  * `fp32` (Full precision)
+  * `bf16_sr` & `int8_sr` (BF16/Int8 with Stochastic Rounding)
+* **Factored Second Moment (`factored_2nd`):** Available for all Adam variants. Works seamlessly alongside any `state_precision` setting to further slash memory usage.
+**⚙️ Advanced Dynamics & Momentum**
+* **Variance Normalized Momentum (`normed_momentum`):** Applies optimizer normalization *before* momentum (Normalization then Momentum/NtM). Available for `AdamW_adv`, `SignSGD_adv`, and `SinkSGD_adv`.
+* **Universal Nesterov Momentum:** Replaced the hard-to-tune Simplified_AdEMAMix with Nesterov momentum (`nesterov`) and a dedicated coefficient (`nesterov_coef`) across all optimizers.
+* **Preconditioning & Signs:**
+  * Added **Variance/Confidence Preconditioning (`snr_cond`)** for `SignSGD_adv` and `SinkSGD_adv` (requires `normed_momentum`). Read the technical reports: [AASS](https://koratahiu.github.io/aass/) & [sink-v](https://koratahiu.github.io/sink-v/).
+  * Added **Adaptive Stochastic Sign** with $L_\infty$ preconditioning (`stochastic_sign`) for `SignSGD_Adv` and `Lion_adv`.
+* **Improved CANS (`accelerated_ns`):** Enhanced for Muon variants by integrating a dynamic lower bound.
+* **New OrthoGrad modes (`orthogonal_gradient`):** Standard OrthoGrad `flattened` and a new matrix-wise mode `iterative`.
+**⚓ Weight Decay Innovations**
+* **Centered Weight Decay (`centered_wd`):** Pulls weights toward their pre-train state (anchor). To save memory, anchor precision (`centered_wd_mode`) can be set to full, float8, int8, or int4.
+* **Fisher Weight Decay (`fisher_wd`):** Now available for Adam variants based on the [FAdam paper](https://arxiv.org/abs/2405.12807).
+* **Geometric Weight Decay:** Added specifically for `SinkSGD_adv` and `SignSGD_adv`.
+*(Note: `Lion_Prodigy_adv`, `Simplified_AdEMAMix`, and heuristic cautious/grams modes have been deprecated in favor of these superior, theoretically-grounded features).*
+<details>
+<summary><b>Click to see older release notes (v1.2.x - v2.1.x)</b></summary>
+### Version 2.1.x
+* **New Optimizer:** Added **Signum** (SignSGD with momentum) to the `SignSGD_adv` family.
+### Version 2.0.x
+* ⚡ **`torch.compile` Support:** Fully implemented for all advanced optimizers. Enable via `compiled_optimizer=True` to heavily fuse and optimize the optimizer step path.
+* 📉 **1-Bit Factored Mode:** Vastly improved implementation via `nnmf_factor=True`.
+* 🛠️ Broad performance and stability improvements across all optimizers.
+### Version 1.2.x
+* **Advanced Muon Variants:** Brought the groundbreaking [Muon optimizer](https://kellerjordan.github.io/posts/muon/) into the fold, enriched with features from recent literature.
+| Optimizer | Description |
+|---|---|
+| `Muon_adv` | Advanced Muon implementation featuring CANS, NorMuon, Low-Rank Orthogonalization, and more. |
+| `AdaMuon_adv` | Combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
+* **Prodigy Speedup:** Prodigy variants are now **50% faster** by eliminating unnecessary CUDA syncs (Shoutout to **@dxqb**!).
+* **Stochastic Rounding for BF16:** Parameter updates and weight decay now accumulate in float32 and round once at the end.
+* **Cautious Weight Decay:** Implemented for all advanced optimizers ([Paper](https://arxiv.org/abs/2510.12402)).
+* **Fused Operations:** Transitioned to fused and in-place operations wherever possible.
+</details>
+---
+## 💡 Core Innovations
+*(Documentation expanding on the theory and usage of these features is coming soon!)*

adv_optm-2.5.1/README.md ADDED Viewed

@@ -0,0 +1,82 @@
+# Advanced Optimizers (AIO)
+A comprehensive, all-in-one collection of state-of-the-art optimization algorithms for deep learning. Designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.
+[![PyPI version](https://img.shields.io/pypi/v/adv_optm.svg?color=blue&style=flat-square)](https://pypi.org/project/adv_optm/)
+[![Python versions](https://img.shields.io/pypi/pyversions/adv_optm.svg?style=flat-square)](https://pypi.org/project/adv_optm/)
+[![License](https://img.shields.io/badge/license-Apache-green?style=flat-square)](LICENSE)
+---
+## 📦 Installation
+```bash
+pip install adv_optm
+```
+*Requires PyTorch 2.3+ for `torch.compile` support.*
+---
+## What's New
+### 🌟 Version 2.5.x: The Massive Refactor
+This major update introduces a complete architectural refactor of the library:
+**🆕 New Optimizers & Scaling**
+* **`SinkSGD_adv`:** Added a powerful new optimizer to the lineup.
+* **Spectral Scaling:** Now available across *all* optimizers, achieving width/rank invariant updates for highly stable training.
+**💾 Memory & State Precision Control**
+* **Granular State Precision (`state_precision`):** Drastically reduce memory overhead with new optimizer state modes:
+  * `factored` (Rank-2 factored mode)
+  * `fp32` (Full precision)
+  * `bf16_sr` & `int8_sr` (BF16/Int8 with Stochastic Rounding)
+* **Factored Second Moment (`factored_2nd`):** Available for all Adam variants. Works seamlessly alongside any `state_precision` setting to further slash memory usage.
+**⚙️ Advanced Dynamics & Momentum**
+* **Variance Normalized Momentum (`normed_momentum`):** Applies optimizer normalization *before* momentum (Normalization then Momentum/NtM). Available for `AdamW_adv`, `SignSGD_adv`, and `SinkSGD_adv`.
+* **Universal Nesterov Momentum:** Replaced the hard-to-tune Simplified_AdEMAMix with Nesterov momentum (`nesterov`) and a dedicated coefficient (`nesterov_coef`) across all optimizers.
+* **Preconditioning & Signs:**
+  * Added **Variance/Confidence Preconditioning (`snr_cond`)** for `SignSGD_adv` and `SinkSGD_adv` (requires `normed_momentum`). Read the technical reports: [AASS](https://koratahiu.github.io/aass/) & [sink-v](https://koratahiu.github.io/sink-v/).
+  * Added **Adaptive Stochastic Sign** with $L_\infty$ preconditioning (`stochastic_sign`) for `SignSGD_Adv` and `Lion_adv`.
+* **Improved CANS (`accelerated_ns`):** Enhanced for Muon variants by integrating a dynamic lower bound.
+* **New OrthoGrad modes (`orthogonal_gradient`):** Standard OrthoGrad `flattened` and a new matrix-wise mode `iterative`.
+**⚓ Weight Decay Innovations**
+* **Centered Weight Decay (`centered_wd`):** Pulls weights toward their pre-train state (anchor). To save memory, anchor precision (`centered_wd_mode`) can be set to full, float8, int8, or int4.
+* **Fisher Weight Decay (`fisher_wd`):** Now available for Adam variants based on the [FAdam paper](https://arxiv.org/abs/2405.12807).
+* **Geometric Weight Decay:** Added specifically for `SinkSGD_adv` and `SignSGD_adv`.
+*(Note: `Lion_Prodigy_adv`, `Simplified_AdEMAMix`, and heuristic cautious/grams modes have been deprecated in favor of these superior, theoretically-grounded features).*
+<details>
+<summary><b>Click to see older release notes (v1.2.x - v2.1.x)</b></summary>
+### Version 2.1.x
+* **New Optimizer:** Added **Signum** (SignSGD with momentum) to the `SignSGD_adv` family.
+### Version 2.0.x
+* ⚡ **`torch.compile` Support:** Fully implemented for all advanced optimizers. Enable via `compiled_optimizer=True` to heavily fuse and optimize the optimizer step path.
+* 📉 **1-Bit Factored Mode:** Vastly improved implementation via `nnmf_factor=True`.
+* 🛠️ Broad performance and stability improvements across all optimizers.
+### Version 1.2.x
+* **Advanced Muon Variants:** Brought the groundbreaking [Muon optimizer](https://kellerjordan.github.io/posts/muon/) into the fold, enriched with features from recent literature.
+| Optimizer | Description |
+|---|---|
+| `Muon_adv` | Advanced Muon implementation featuring CANS, NorMuon, Low-Rank Orthogonalization, and more. |
+| `AdaMuon_adv` | Combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
+* **Prodigy Speedup:** Prodigy variants are now **50% faster** by eliminating unnecessary CUDA syncs (Shoutout to **@dxqb**!).
+* **Stochastic Rounding for BF16:** Parameter updates and weight decay now accumulate in float32 and round once at the end.
+* **Cautious Weight Decay:** Implemented for all advanced optimizers ([Paper](https://arxiv.org/abs/2510.12402)).
+* **Fused Operations:** Transitioned to fused and in-place operations wherever possible.
+</details>
+---
+## 💡 Core Innovations
+*(Documentation expanding on the theory and usage of these features is coming soon!)*

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/__init__.py RENAMED Viewed

@@ -20,4 +20,4 @@ __all__ = [
     "SinkSGD_adv",
 ]
-__version__ = "2.4.dev25"
+__version__ = "2.5.1"

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/AdaMuon_adv.py RENAMED Viewed

@@ -57,7 +57,8 @@ class AdaMuon_adv(torch.optim.Optimizer):
             (default: (3.4445, -4.7750, 2.0315)).
         stochastic_rounding (bool): whether to use stochastic rounding for
             BF16 parameter updates (default: True).
-        orthogonal_gradient (bool): whether to use OrthoGrad.  (default: False)
+        orthogonal_gradient (str): whether to use OrthoGrad variants. 'disabled': off.
+        'flattened': Standard vectorized OrthoGrad. 'iterative': Matrix-wise rank-2 OrthoGrad. (default: disabled)
         nesterov (bool): enables Nesterov momentum (default: False).
         use_atan2 (bool): whether to use the atan2 update rule. (default: False)
         vector_reshape (bool): whether to reshape 1D vectors into 2D
@@ -114,7 +115,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         adam_fisher_wd (bool): Fisher Adam (FAdam) weight decay for the AdamW part. (default: False)
         adam_use_bias_correction (bool): Bias correction for AdamW.
         adam_use_atan2 (bool): Atan2 update rule for AdamW.
-        adam_orthogonal_gradient (bool): OrthoGrad for AdamW.
+        adam_orthogonal_gradient (str): OrthoGrad for AdamW.
         adam_nesterov (bool): Nesterov momentum for AdamW. (default: False)
         adam_nesterov_coef (float, optional): Nesterov coefficient for AdamW. (default: None)
         adam_kourkoutas_beta (bool): Kourkoutas-β for AdamW.
@@ -149,7 +150,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         # Stochastic Rounding for BF16
         stochastic_rounding: bool = True,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # Adam_atan2 (scale invariant)
         use_atan2: bool = False,
         # NorMuon
@@ -190,7 +191,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         adam_fisher_wd: bool = False,
         adam_use_bias_correction: bool = True,
         adam_use_atan2: bool = False,
-        adam_orthogonal_gradient: bool = False,
+        adam_orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         adam_nesterov: bool = False,
         adam_nesterov_coef: float | None = None,
         adam_kourkoutas_beta: bool = False,
@@ -213,7 +214,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
             print("Warning: spectral_normalization is incompatible with rms_rescaling, Disabling rms_rescaling.")
             rms_rescaling = False
         if spectral_normalization and accelerated_ns:
-            ValueError("spectral_normalization violates accelerated Newton-Schulz assumptions. Pick one of them.")
+            raise ValueError("spectral_normalization violates accelerated Newton-Schulz assumptions. Pick one of them.")
         # Legacy backwards compatibility support for `nnmf_factor=True`
         if nnmf_factor:
@@ -515,8 +516,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
             grad = approx_mars(grad, state['last_grad'], group['mars_gamma'], beta1)
-        if group.get("orthogonal_gradient"):
-            grad = _orthogonalize_gradient(p, grad)
+        grad = _orthogonalize_gradient(p, grad, group.get("orthogonal_gradient"))
         if state['factored']: # Factored Muon
             d1, d2 = state['effective_shape']

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/AdamW_adv.py RENAMED Viewed

@@ -45,7 +45,8 @@ class AdamW_adv(torch.optim.Optimizer):
         stochastic_rounding (bool): whether to use stochastic
             rounding for BF16 parameter updates (default: True).
         use_atan2 (bool): whether to use the atan2 update rule. (default: False)
-        orthogonal_gradient (bool): whether to use OrthoGrad.  (default: False)
+        orthogonal_gradient (str): whether to use OrthoGrad variants. 'disabled': off.
+        'flattened': Standard vectorized OrthoGrad. 'iterative': Matrix-wise rank-2 OrthoGrad. (default: disabled)
         normed_momentum (bool): whether to compute the first moment on the normalized gradient. (default: False)
         kourkoutas_beta (bool): whether to enable the layer-wise dynamic β₂ logic.
             If `False`, the optimizer behaves as standard AdamW. (default: False)
@@ -104,7 +105,7 @@ class AdamW_adv(torch.optim.Optimizer):
         # Adam_atan2 (scale invariant)
         use_atan2: bool = False,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # Nesterov momentum
         nesterov: bool = False,
         nesterov_coef: float | None = None,
@@ -326,8 +327,7 @@ class AdamW_adv(torch.optim.Optimizer):
     def _step_parameter(self, p, grad, state, group, step_size, beta1, beta2, sqrt_bias_correction2, random_int_tensor, random_int_state_tensor):
         grad = upcast_grad_for_precision(grad, state, group['state_precision'])
-        if group["orthogonal_gradient"]:
-            grad = _orthogonalize_gradient(p, grad)
+        grad = _orthogonalize_gradient(p, grad, group["orthogonal_gradient"])
         nesterov = group.get('nesterov', False)
         nesterov_coef = group.get('nesterov_coef', None)
@@ -462,7 +462,7 @@ class AdamW_adv(torch.optim.Optimizer):
         else:
             update.mul_(update_scaling)
-        param_update.apply_parameter_update(self, p, group, update, step_size, random_int_tensor=random_int_tensor, wd_scaler=wd_scaler)
+        param_update.apply_parameter_update(self, p, group, update, group['lr'], random_int_tensor=random_int_tensor, wd_scaler=wd_scaler)
     def compile(self, *args, **kwargs):
         self._compiled_step_parameter = torch.compile(self._step_parameter, *args, **kwargs)

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/Adopt_adv.py RENAMED Viewed

@@ -108,7 +108,7 @@ class Adopt_adv(torch.optim.Optimizer):
         # Stochastic Rounding for BF16
         stochastic_rounding: bool = True,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # Nesterov momentum
         nesterov: bool = False,
         nesterov_coef: float | None = None,
@@ -158,7 +158,7 @@ class Adopt_adv(torch.optim.Optimizer):
         defaults = {
             "lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay,
-            "fisher_wd": fisher_wd, "cautious_wd": cautious_wd,
+            "fisher_wd": fisher_wd, "cautious_wd": cautious_wd, "orthogonal_gradient": orthogonal_gradient,
             "nesterov": nesterov, "nesterov_coef": nesterov_coef,
             "kourkoutas_beta": kourkoutas_beta, "beta2_min": beta2_min, "ema_alpha": ema_alpha,
             "tiny_spike": tiny_spike, "k_warmup_steps": k_warmup_steps, "k_logging": k_logging,
@@ -172,7 +172,6 @@ class Adopt_adv(torch.optim.Optimizer):
         self.clip_lambda = clip_lambda
         self.stochastic_rounding = stochastic_rounding
         self.use_atan2 = use_atan2
-        self.orthogonal_gradient = orthogonal_gradient
         self.kourkoutas_beta = kourkoutas_beta
         self.layer_key_fn = layer_key_fn
         self._init_lr = lr if lr > 0 else 1
@@ -237,7 +236,7 @@ class Adopt_adv(torch.optim.Optimizer):
             dtype = torch.float32 if (state['factored'] or req_precision == 'factored') else p.dtype
             vt_dtype = torch.float32 if (state['factored'] or state['factored_2nd'] or req_precision in ['factored', 'bf16_sr', 'int8_sr']) else dtype
-            vt_init = grad.pow(2).to(vt_dtype) * (1 - group['betas'][1])
+            vt_init = grad.pow(2).to(vt_dtype)
             if state['factored']:
                 state['effective_shape'] = _get_effective_shape(p.numel())
@@ -329,8 +328,7 @@ class Adopt_adv(torch.optim.Optimizer):
     def _step_parameter(self, p, grad, state, group, lr, beta1, beta2, random_int_tensor, random_int_state_tensor):
         grad = upcast_grad_for_precision(grad, state, group['state_precision'])
-        if self.orthogonal_gradient:
-            grad = _orthogonalize_gradient(p, grad)
+        grad = _orthogonalize_gradient(p, grad, group["orthogonal_gradient"])
         nesterov = group.get('nesterov', False)
         nesterov_coef = group.get('nesterov_coef', None)

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/Lion_adv.py RENAMED Viewed

@@ -67,7 +67,7 @@ class Lion_adv(torch.optim.Optimizer):
         # Stochastic Rounding for BF16
         stochastic_rounding: bool = True,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # Lion-k
         kappa_p: float = 1.0,
         auto_kappa_p: bool = False,
@@ -213,8 +213,9 @@ class Lion_adv(torch.optim.Optimizer):
     def _step_parameter(self, p, grad, state, group, lr, random_int_tensor, random_noise_tensor):
         if grad.dtype != torch.float32 and state['factored']:
             grad = grad.float()
-        if group["orthogonal_gradient"]:
-            grad = _orthogonalize_gradient(p, grad)
+        is_vector = p.ndim < 2 or getattr(p, '_is_dora_scale', False) or getattr(p, 'is_vector', False)
+        grad = _orthogonalize_gradient(p, grad, group["orthogonal_gradient"])
         # Lion-K Logic
         kappa_p = group.get("kappa_p", 1.0)
@@ -250,7 +251,7 @@ class Lion_adv(torch.optim.Optimizer):
             update = update.view(p.shape)
             if group.get('stochastic_sign', False):
-                update = apply_stochastic_sign_(update, noise=random_noise_tensor)
+                update = apply_stochastic_sign_(update, noise=random_noise_tensor, is_vector=is_vector)
             else:
                 update = _get_lion_k_update(update, kappa_p)
@@ -265,7 +266,7 @@ class Lion_adv(torch.optim.Optimizer):
             exp_avg.lerp_(grad, 1 - beta2)
             if group.get('stochastic_sign', False):
-                update = apply_stochastic_sign_(update, noise=random_noise_tensor)
+                update = apply_stochastic_sign_(update, noise=random_noise_tensor, is_vector=is_vector)
             else:
                 update = _get_lion_k_update(update, kappa_p)

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/Muon_adv.py RENAMED Viewed

@@ -39,7 +39,8 @@ class Muon_adv(torch.optim.Optimizer):
             (default: (3.4445, -4.7750, 2.0315)).
         stochastic_rounding (bool): whether to use stochastic rounding for
             BF16 parameter updates (default: True).
-        orthogonal_gradient (bool): whether to use OrthoGrad.  (default: False)
+        orthogonal_gradient (str): whether to use OrthoGrad variants. 'disabled': off.
+        'flattened': Standard vectorized OrthoGrad. 'iterative': Matrix-wise rank-2 OrthoGrad. (default: disabled)
         vector_reshape (bool): whether to reshape 1D vectors into 2D
             matrices to apply low-rank compression (default: True).
         nnmf_factor (bool): whether to use the factorization or disable it to use
@@ -89,7 +90,7 @@ class Muon_adv(torch.optim.Optimizer):
         adam_fisher_wd (bool): Fisher Adam (FAdam) weight decay for the AdamW part. (default: False)
         adam_use_bias_correction (bool): Bias correction for AdamW.
         adam_use_atan2 (bool): Atan2 update rule for AdamW.
-        adam_orthogonal_gradient (bool): OrthoGrad for AdamW.
+        adam_orthogonal_gradient (str): OrthoGrad for AdamW.
         adam_nesterov (bool): Nesterov momentum for AdamW. (default: False)
         adam_nesterov_coef (float, optional): Nesterov coefficient for AdamW. (default: None)
         adam_kourkoutas_beta (bool): Kourkoutas-β for AdamW.
@@ -121,7 +122,7 @@ class Muon_adv(torch.optim.Optimizer):
         # Stochastic Rounding for BF16
         stochastic_rounding: bool = True,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # RMS Rescaling
         rms_rescaling: bool = True,
         # SMMF factorization
@@ -159,7 +160,7 @@ class Muon_adv(torch.optim.Optimizer):
         adam_fisher_wd: bool = False,
         adam_use_bias_correction: bool = True,
         adam_use_atan2: bool = False,
-        adam_orthogonal_gradient: bool = False,
+        adam_orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         adam_nesterov: bool = False,
         adam_nesterov_coef: float | None = None,
         adam_kourkoutas_beta: bool = False,
@@ -186,7 +187,7 @@ class Muon_adv(torch.optim.Optimizer):
             print("Warning: spectral_normalization is incompatible with rms_rescaling, Disabling rms_rescaling.")
             rms_rescaling = False
         if spectral_normalization and accelerated_ns:
-            ValueError("spectral_normalization violates accelerated Newton-Schulz assumptions. Pick one of them.")
+            raise ValueError("spectral_normalization violates accelerated Newton-Schulz assumptions. Pick one of them.")
         # Legacy backwards compatibility support for `nnmf_factor=True`
         if nnmf_factor:
@@ -457,8 +458,7 @@ class Muon_adv(torch.optim.Optimizer):
         if grad.dtype != torch.float32 and state.get('factored', False):
             grad = grad.float()
-        if group.get("orthogonal_gradient"):
-            grad = _orthogonalize_gradient(p, grad)
+        grad = _orthogonalize_gradient(p, grad, group.get("orthogonal_gradient"))
         if state['factored']: # Factored Muon
             d1, d2 = state['effective_shape']

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/Prodigy_adv.py RENAMED Viewed

@@ -43,7 +43,8 @@ class Prodigy_adv(torch.optim.Optimizer):
         stochastic_rounding (bool): whether to use stochastic
             rounding for BF16 parameter updates (default: True).
         use_atan2 (bool): whether to use the atan2 update rule. (default: False)
-        orthogonal_gradient (bool): whether to use OrthoGrad.  (default: False)
+        orthogonal_gradient (str): whether to use OrthoGrad variants. 'disabled': off.
+        'flattened': Standard vectorized OrthoGrad. 'iterative': Matrix-wise rank-2 OrthoGrad. (default: disabled)
         nnmf_factor (bool): whether to use the factorization or disable it to use
             the uncompressed optimizer. (default: False)
         factored_2nd (bool): whether to keep the first moment uncompressed (dense)
@@ -119,7 +120,7 @@ class Prodigy_adv(torch.optim.Optimizer):
         # Adam_atan2 (scale invariant)
         use_atan2: bool = False,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # Nesterov momentum
         nesterov: bool = False,
         nesterov_coef: float | None = None,
@@ -371,8 +372,7 @@ class Prodigy_adv(torch.optim.Optimizer):
     def _step_parameter(self, p, grad, state, group, beta2, d, dlr, random_int_tensor, random_int_state_tensor):
         grad = upcast_grad_for_precision(grad, state, group['state_precision'])
-        if group["orthogonal_gradient"]:
-            grad = _orthogonalize_gradient(p, grad)
+        grad = _orthogonalize_gradient(p, grad, group["orthogonal_gradient"])
         nesterov = group.get('nesterov', False)
         nesterov_coef = group.get('nesterov_coef', None)

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/SignSGD_adv.py RENAMED Viewed

@@ -62,7 +62,7 @@ class SignSGD_adv(torch.optim.Optimizer):
         # Stochastic Rounding for BF16
         stochastic_rounding: bool = True,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # Stochastic Sign Operator
         stochastic_sign: bool = False,
         # Nesterov momentum
@@ -171,7 +171,7 @@ class SignSGD_adv(torch.optim.Optimizer):
     def __init_state(self, p, group):
         state = self.state[p]
         # State Initialization
-        if group["momentum"] > 0 and len(state) == 0:
+        if 'step' not in state:
             req_precision = group['state_precision']
             is_vector = len(p.shape) == 1 and not group['vector_reshape']
@@ -259,8 +259,7 @@ class SignSGD_adv(torch.optim.Optimizer):
         wd_target = None
         cwd_target = None
-        if group["orthogonal_gradient"]:
-            grad = _orthogonalize_gradient(p, grad)
+        grad = _orthogonalize_gradient(p, grad, group["orthogonal_gradient"])
         if normed_mt:
             if sso:
@@ -282,7 +281,7 @@ class SignSGD_adv(torch.optim.Optimizer):
                 if nesterov and normed_mt:
                     # Scale the normalized gradient using empirical buffer magnitude (SNR recovery)
-                    normed_grad = grad_reshaped * exp_avg.abs()
+                    normed_grad = exp_avg.abs().mul_(grad_reshaped)
                 exp_avg.lerp_(grad_reshaped, 1 - momentum)
@@ -313,7 +312,7 @@ class SignSGD_adv(torch.optim.Optimizer):
                 if nesterov and normed_mt:
                     # Scale the normalized gradient using empirical buffer magnitude (SNR recovery)
-                    normed_grad = grad * exp_avg.abs()
+                    normed_grad = exp_avg.abs().mul_(grad)
                 exp_avg.lerp_(grad, 1 - momentum)
@@ -344,7 +343,7 @@ class SignSGD_adv(torch.optim.Optimizer):
         if group.get('geometric_wd', False) and group["weight_decay"] > 0 :
             wd_target = get_signsgd_wd_target(p, denom=denom, stochastic_sign=sso, noise=random_noise_tensor, is_vector=is_vector)
-            if group.get('centered_wd', 0.0) > 0 and 'anchor_type' in state:
+            if group.get('centered_wd', 0.0) > 0 and 'anchor_data' in state:
                 anchor = dequantize_anchor(p, state, group, p.dtype)
                 cwd_target = get_signsgd_wd_target(p.sub(anchor), denom=denom, stochastic_sign=sso, noise=random_noise_tensor, is_vector=is_vector)
                 del anchor
@@ -355,7 +354,7 @@ class SignSGD_adv(torch.optim.Optimizer):
             update_scaling = lr * A if snr_cond else lr
             update.mul_(update_scaling)
-        param_update.apply_parameter_update(self, p, group, update, lr, random_int_tensor=random_int_tensor, wd_target=wd_target, cwd_target=cwd_target, decoupled=True)
+        param_update.apply_parameter_update(self, p, group, update, lr, random_int_tensor=random_int_tensor, wd_target=wd_target, cwd_target=cwd_target)
     def compile(self, *args, **kwargs):
         self._compiled_step_parameter = torch.compile(self._step_parameter, *args, **kwargs)

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/optim/SinkSGD_adv.py RENAMED Viewed

@@ -69,7 +69,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
         # Stochastic Rounding for BF16
         stochastic_rounding: bool = True,
         # OrthoGrad
-        orthogonal_gradient: bool = False,
+        orthogonal_gradient: str = 'disabled', # 'flattened', 'iterative'
         # Spectral Normed Optimizer
         spectral_normalization: bool = False,
         # Centered WD
@@ -89,8 +89,8 @@ class SinkSGD_adv(torch.optim.Optimizer):
             raise ValueError(f"Momentum should be >= 0.0. Got {momentum}")
         if not (weight_decay >= 0.0):
             raise ValueError(f"Weight-decay should be >= 0.0. Got {weight_decay}")
-        if snr_cond and not normed_momentum:
-            raise NotImplementedError(f"snr_cond is intended to be used with normed_momentum")
+        if snr_cond and not normed_momentum and not momentum > 0:
+            raise NotImplementedError(f"snr_cond is intended to be used with normed_momentum.")
         state_precision = state_precision.lower()
         valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
@@ -237,8 +237,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
         wd_target = None
         cwd_target = None
-        if group["orthogonal_gradient"]:
-            grad = _orthogonalize_gradient(p, grad)
+        grad = _orthogonalize_gradient(p, grad, group["orthogonal_gradient"])
         if normed_mt:
             if not is_vector:
@@ -266,7 +265,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
                 if nesterov and normed_mt:
                     # Scale the normalized gradient using empirical buffer magnitude (SNR recovery)
-                    normed_grad = grad_reshaped * buf.abs()
+                    normed_grad = buf.abs().mul_(grad_reshaped)
                 buf.lerp_(grad_reshaped, 1 - momentum)
@@ -303,7 +302,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
                 if nesterov and normed_mt:
                     # Scale the normalized gradient using empirical buffer magnitude (SNR recovery)
-                    normed_grad = grad * buf.abs()
+                    normed_grad = buf.abs().mul_(grad)
                 buf.lerp_(grad, 1 - momentum)
@@ -346,7 +345,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
                     wd_scaler = get_sinkhorn_wd_scaler(p, row_denom=vt_row, col_denom=vt_col)
                 else:
                     wd_target = get_signsgd_wd_target(p, denom=denom)
-            if is_vector and group.get('centered_wd', 0.0) > 0 and 'anchor_type' in state:
+            if is_vector and group.get('centered_wd', 0.0) > 0 and 'anchor_data' in state:
                 anchor = dequantize_anchor(p, state, group, p.dtype)
                 cwd_target = get_signsgd_wd_target(p.sub(anchor), denom=denom)
                 del anchor

{adv_optm-2.4.dev25 → adv_optm-2.5.1}/adv_optm/util/Muon_AuxAdam.py RENAMED Viewed

@@ -71,8 +71,7 @@ def _init_auxadam_state(self, p, group):
 def _adam_step_parameter(self, p, grad, state, group, beta1_adam, beta2_adam, sqrt_bias_correction2, step_size, random_int_tensor, random_int_state_tensor=None):
     grad = upcast_grad_for_precision(grad, state, group.get('adam_state_precision', 'auto'))
-    if group.get("adam_orthogonal_gradient"):
-        grad = _orthogonalize_gradient(p, grad)
+    grad = _orthogonalize_gradient(p, grad, group.get("adam_orthogonal_gradient"))
     if hasattr(self, 'kourkoutas_helper') and self.kourkoutas_helper:
         # Accumulate current grad's norm for the *next* step
@@ -190,4 +189,4 @@ def _adam_step_parameter(self, p, grad, state, group, beta1_adam, beta2_adam, sq
     else:
         update.mul_(update_scaling)
-    param_update.apply_parameter_update(self, p, group, update, step_size, group["adam_weight_decay"], random_int_tensor=random_int_tensor, wd_scaler=wd_scaler)
+    param_update.apply_parameter_update(self, p, group, update, group['lr'], group["adam_weight_decay"], random_int_tensor=random_int_tensor, wd_scaler=wd_scaler)

adv-optm 2.4.dev25__tar.gz → 2.5.1__tar.gz

adv-optm 2.4.dev25tar.gz → 2.5.1tar.gz