PyPI - adv-optm - Versions diffs - 2.4.dev22__tar.gz → 2.4.dev24__tar.gz - Mend

adv-optm 2.4.dev22tar.gz → 2.4.dev24tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

adv_optm-2.4.dev24/PKG-INFO ADDED Viewed

@@ -0,0 +1,109 @@
+Metadata-Version: 2.4
+Name: adv_optm
+Version: 2.4.dev24
+Summary: A family of highly efficient, lightweight yet powerful optimizers.
+Home-page: https://github.com/Koratahiu/Advanced_Optimizers
+Author: Koratahiu
+Author-email: hiuhonor@gmail.com
+License: Apache 2.0
+Keywords: llm,fine-tuning,memory-efficient,low-rank,compression,pytorch,optimizer,adam
+Classifier: Programming Language :: Python :: 3
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Operating System :: OS Independent
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Topic :: Software Development :: Libraries :: Python Modules
+Requires-Python: >=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: torch>=2.1
+Dynamic: author
+Dynamic: author-email
+Dynamic: classifier
+Dynamic: description
+Dynamic: description-content-type
+Dynamic: home-page
+Dynamic: keywords
+Dynamic: license
+Dynamic: license-file
+Dynamic: requires-dist
+Dynamic: requires-python
+Dynamic: summary
+# Advanced Optimizers (AIO)
+A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.
+[![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
+## 🔥 What's New
+### In 2.4.x:
+This update introduces a whole refactor of the library with many new features and changes:
+- New optimizers state mode option (`state_precision`) with many precision settings for the optimizer states: rank-2 factored mode (`factored`), full FP32 (`fp32`), BF16 with Stochastic Rounding (`bf16_sr`), int8/uint8 with Stochastic Rounding (`int8_sr`), FP16 (`fp16`)
+- Added new powerful optimizer: SinkSGD_adv.
+- Added spectral scaling option to all optimizers, achieving width/rank invariant updates.
+- Added Nesterov momentum (`nesterov`) and its coef (`nesterov_coef`) to all optimizers.
+- Added centered weight decay (`centered_wd`), to pull the weights toward their pre-train state (anchor)
+    - anchor precision can be changed to save memory (`centered_wd_mode`): full, float8, int8, int4
+- Added Fisher Weight Decay option for Adam variants (`fisher_wd`).
+    - Paper: [FAdam...](https://arxiv.org/abs/2405.12807)
+- Added Factored Second Moment option for Adam variants (`factored_2nd`). This works alongside any `state_precision` setting.
+- Added Geometric Weight Decay for SinkSGD_adv and SignSGD_adv.
+- Added new powerful mode: variance normalized momentum (`normed_momentum`). Which applies the optimizer normalization before the momentum (also called as Normalization then momentum NtM)
+    - For: AdamW_adv, SignSGD_adv, SinkSGD_adv.
+- Added Variance/Confidence Preconditioning (`snr_cond`) for SignSGD_adv, SinkSGD_adv.
+    - Only works with `normed_momentum`.
+    - Technical reports: [AASS](https://koratahiu.github.io/aass/), and [sink-v](https://koratahiu.github.io/sink-v/).
+- Added Adaptive Stochastic Sign with L_inf preconditioning (`stochastic_sign`) for SignSGD_Adv and Lion_adv.
+- Improved CANS (`accelerated_ns`) for Muon variants, by integrating dynamic lower bound.
+- Removed Simplified_AdEMAMix optimizer and its settings in other optimizers, they are now replaced by Nesterov momentum and its coef. Which is better and less hard to tune.
+- Removed cautious and grams modes, as they were heuristic and not working well.
+- Removed optimizers: Lion_Prodigy_adv, and Simplified_AdEMAMix.
+### in 2.1.x
+- Added Signum (SignSGD with momentum): A new optimizer in the family (SignSGD_adv)
+- More info coming soon.
+### in 2.0.x
+* Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
+* Better and improved 1-bit factored mode via (nnmf_factor=True).
+* Various improvements across the optimizers.
+### in 1.2.x
+* Added **advanced variants** of [Muon optimizer](https://kellerjordan.github.io/posts/muon/) with **features** and **settings** from recent papers.
+| Optimizer | Description |
+|---|---|
+| `Muon_adv` | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
+| `AdaMuon_adv` | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
+> *Documentation coming soon.*
+* Implemented [Cautious Weight Decay](https://arxiv.org/abs/2510.12402) for all advanced optimizers.
+* Improved parameter update and weight decay for **BF16** with **stochastic rounding**. The updates are now accumulated in **float32** and rounded once at the end.
+* Use fused and in-place operations whenever possible for all advanced optimizers.
+* **Prodigy variants** are now **50% faster** by [avoiding CUDA syncs](https://github.com/Koratahiu/Advanced_Optimizers/pull/5). Thanks to **@dxqb**!
+---
+## 📦 Installation
+```bash
+pip install adv_optm
+```
+---
+## 🧠 Core Innovations
+This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training.
+---

adv_optm-2.4.dev24/README.md ADDED Viewed

@@ -0,0 +1,78 @@
+# Advanced Optimizers (AIO)
+A comprehensive, all-in-one collection of optimization algorithms for deep learning, designed for **maximum efficiency**, **minimal memory footprint**, and **superior performance** across diverse model architectures and training scenarios.
+[![PyPI](https://img.shields.io/pypi/v/adv_optm)](https://pypi.org/project/adv_optm/)
+## 🔥 What's New
+### In 2.4.x:
+This update introduces a whole refactor of the library with many new features and changes:
+- New optimizers state mode option (`state_precision`) with many precision settings for the optimizer states: rank-2 factored mode (`factored`), full FP32 (`fp32`), BF16 with Stochastic Rounding (`bf16_sr`), int8/uint8 with Stochastic Rounding (`int8_sr`), FP16 (`fp16`)
+- Added new powerful optimizer: SinkSGD_adv.
+- Added spectral scaling option to all optimizers, achieving width/rank invariant updates.
+- Added Nesterov momentum (`nesterov`) and its coef (`nesterov_coef`) to all optimizers.
+- Added centered weight decay (`centered_wd`), to pull the weights toward their pre-train state (anchor)
+    - anchor precision can be changed to save memory (`centered_wd_mode`): full, float8, int8, int4
+- Added Fisher Weight Decay option for Adam variants (`fisher_wd`).
+    - Paper: [FAdam...](https://arxiv.org/abs/2405.12807)
+- Added Factored Second Moment option for Adam variants (`factored_2nd`). This works alongside any `state_precision` setting.
+- Added Geometric Weight Decay for SinkSGD_adv and SignSGD_adv.
+- Added new powerful mode: variance normalized momentum (`normed_momentum`). Which applies the optimizer normalization before the momentum (also called as Normalization then momentum NtM)
+    - For: AdamW_adv, SignSGD_adv, SinkSGD_adv.
+- Added Variance/Confidence Preconditioning (`snr_cond`) for SignSGD_adv, SinkSGD_adv.
+    - Only works with `normed_momentum`.
+    - Technical reports: [AASS](https://koratahiu.github.io/aass/), and [sink-v](https://koratahiu.github.io/sink-v/).
+- Added Adaptive Stochastic Sign with L_inf preconditioning (`stochastic_sign`) for SignSGD_Adv and Lion_adv.
+- Improved CANS (`accelerated_ns`) for Muon variants, by integrating dynamic lower bound.
+- Removed Simplified_AdEMAMix optimizer and its settings in other optimizers, they are now replaced by Nesterov momentum and its coef. Which is better and less hard to tune.
+- Removed cautious and grams modes, as they were heuristic and not working well.
+- Removed optimizers: Lion_Prodigy_adv, and Simplified_AdEMAMix.
+### in 2.1.x
+- Added Signum (SignSGD with momentum): A new optimizer in the family (SignSGD_adv)
+- More info coming soon.
+### in 2.0.x
+* Implemented torch.compile for all advanced optimizers. Enabled via (compiled_optimizer=True) to fuse and optimize the optimizer step path.
+* Better and improved 1-bit factored mode via (nnmf_factor=True).
+* Various improvements across the optimizers.
+### in 1.2.x
+* Added **advanced variants** of [Muon optimizer](https://kellerjordan.github.io/posts/muon/) with **features** and **settings** from recent papers.
+| Optimizer | Description |
+|---|---|
+| `Muon_adv` | Advanced Muon implementation with CANS, NorMuon, Low-Rank ortho, etc. features. |
+| `AdaMuon_adv` | Advanced AdaMuon implementation, which combines Muon's geometry with Adam-like adaptive scaling and sign-based orthogonalization. |
+> *Documentation coming soon.*
+* Implemented [Cautious Weight Decay](https://arxiv.org/abs/2510.12402) for all advanced optimizers.
+* Improved parameter update and weight decay for **BF16** with **stochastic rounding**. The updates are now accumulated in **float32** and rounded once at the end.
+* Use fused and in-place operations whenever possible for all advanced optimizers.
+* **Prodigy variants** are now **50% faster** by [avoiding CUDA syncs](https://github.com/Koratahiu/Advanced_Optimizers/pull/5). Thanks to **@dxqb**!
+---
+## 📦 Installation
+```bash
+pip install adv_optm
+```
+---
+## 🧠 Core Innovations
+This library integrates multiple state-of-the-art optimization techniques validated through extensive research and practical training.
+---

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/__init__.py RENAMED Viewed

@@ -20,4 +20,4 @@ __all__ = [
     "SinkSGD_adv",
 ]
-__version__ = "2.4.dev22"
+__version__ = "2.4.dev24"

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/AdaMuon_adv.py RENAMED Viewed

@@ -99,7 +99,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         use_muon (bool | None): whether to use Muon or AuxAdamW. MUST be provided
             either here or via `optim_type` in parameter groups. (default: None)
         state_precision (str): Precision for Muon optimizer states. Options: 'auto' (parameter dtype), 'fp32',
-            'bf16_sr' (BF16 with stochastic rounding), 'fp8_sr', 'int8_sr'.
+            'bf16_sr' (BF16 with stochastic rounding), 'int8_sr'.
             (default: 'auto')
         factored_2nd (bool): Factorize only the second moment (v_t) using SMMF
             low-rank compression while keeping the first moment (momentum_buffer)
@@ -123,7 +123,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         adam_tiny_spike (float): Tiny spike for Kourkoutas-β. (default: 1e-9)
         adam_k_warmup_steps (int): Warmup steps for Kourkoutas-β. (default: 0)
         adam_spectral_normalization (bool): Enable explicit spectral normalization for AdamW. (default: False)
-        adam_state_precision (str): Precision for AuxAdam states. Options: 'auto', 'fp32', 'bf16_sr', 'fp16', 'fp8_sr', 'int8_sr', 'factored'. (default: 'auto')
+        adam_state_precision (str): Precision for AuxAdam states. Options: 'auto', 'fp32', 'bf16_sr', 'fp16', 'int8_sr', 'factored'. (default: 'auto')
         adam_nnmf_factor (bool): 1-bit factored for AdamW.
         adam_factored_2nd (bool): Factorize only the second moment (v_t) for AuxAdam. (default: False)
     """
@@ -157,7 +157,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
         # Boolean to spilt param
         use_muon: bool | None = None,
         # States precision (Muon path)
-        state_precision: str = "auto",  # 'fp32', 'bf16_sr', 'fp8_sr', 'int8_sr'
+        state_precision: str = "auto",  # 'fp32', 'bf16_sr', 'int8_sr'
         # Factorized second moment only
         factored_2nd: bool = False,
         # Update geometry parameters
@@ -220,7 +220,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
             state_precision = "factored"
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -374,6 +374,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
                     d1, d2 = state['effective_shape']
                     state['mu_vbuf_nmf'] = torch.zeros(d1, device=p.device, dtype=torch.float32)
                     state['mv_vbuf_nmf'] = torch.zeros(d2, device=p.device, dtype=torch.float32)
+                    state['shifter'] = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128], device=p.device, dtype=torch.uint8)
                 elif not group['normuon_variant']:
                     init_state_tensor(state, 'second_momentum_buffer', p.shape, actual_precision, p.device, default_dtype, non_neg=True)
@@ -454,8 +455,6 @@ class AdaMuon_adv(torch.optim.Optimizer):
                     random_int_state_tensor = param_update._get_random_int_for_sr(p)
                 elif actual_precision == 'int8_sr':
                     random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-                elif actual_precision == 'fp8_sr':
-                    random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
             else:
                 adam_step_param = Muon_AuxAdam._adam_step_parameter
@@ -475,8 +474,6 @@ class AdaMuon_adv(torch.optim.Optimizer):
                     random_int_state_tensor = param_update._get_random_int_for_sr(p)
                 elif actual_precision == 'int8_sr':
                     random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-                elif actual_precision == 'fp8_sr':
-                    random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
                 if group['low_rank_ortho']:
                     random_G_sketch = param_update._get_random_noise_for_low_rank_ortho(p, group['ortho_rank'])
             else:

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/AdamW_adv.py RENAMED Viewed

@@ -84,7 +84,7 @@ class AdamW_adv(torch.optim.Optimizer):
             while only factorizing the second moment. (default: False)
         state_precision (str): Precision method for Adopt states. Options: 'auto'
             (parameter precision), 'fp32', 'factored' (SMMF low-rank FP32), 'bf16_sr' (with
-            stochastic rounding), 'fp16' , 'fp8_sr', 'int8_sr'. (default: 'auto')
+            stochastic rounding), 'fp16' , 'int8_sr'. (default: 'auto')
     """
     def __init__(
@@ -124,7 +124,7 @@ class AdamW_adv(torch.optim.Optimizer):
         centered_wd: float = 0.0,
         centered_wd_mode: str = 'float8',
         # States precision
-        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'fp8_sr', 'int8_sr'.
+        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'int8_sr'.
         # Factorized second moment only
         factored_2nd: bool = False,
         # SMMF factorization (legacy)
@@ -145,7 +145,7 @@ class AdamW_adv(torch.optim.Optimizer):
             raise ValueError(f"For Kourkoutas-β, betas[1] (as beta2_max) must be > beta2_min. Got {betas[1]} and {beta2_min}")
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -264,6 +264,7 @@ class AdamW_adv(torch.optim.Optimizer):
                     d1, d2 = state['effective_shape']
                     state['mu_v_nmf'] = torch.zeros(d1, device=device, dtype=torch.float32)
                     state['mv_v_nmf'] = torch.zeros(d2, device=device, dtype=torch.float32)
+                    state['shifter'] = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128], device=device, dtype=torch.uint8)
                 else:
                     init_state_tensor(state, 'exp_avg_sq', p.shape, actual_precision, p.device, dtype, non_neg=True)
@@ -314,8 +315,6 @@ class AdamW_adv(torch.optim.Optimizer):
                 random_int_state_tensor = param_update._get_random_int_for_sr(p)
             elif group['actual_state_precision'] == 'int8_sr':
                 random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-            elif group['actual_state_precision'] == 'fp8_sr':
-                random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
             step_param_fn = self._compiled_step_parameter
         else:
             step_param_fn = self._step_parameter

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/Adopt_adv.py RENAMED Viewed

@@ -88,7 +88,7 @@ class Adopt_adv(torch.optim.Optimizer):
             while only factorizing the second moment. (default: False)
         state_precision (str): Precision method for Adopt states. Options: 'auto'
             (parameter precision), 'fp32', 'factored' (SMMF low-rank FP32), 'bf16_sr' (with
-            stochastic rounding), 'fp16' , 'fp8_sr', 'int8_sr'. (default: 'auto')
+            stochastic rounding), 'fp16' , 'int8_sr'. (default: 'auto')
     """
     def __init__(
@@ -126,7 +126,7 @@ class Adopt_adv(torch.optim.Optimizer):
         centered_wd: float = 0.0,
         centered_wd_mode: str = 'float8',
         # States precision
-        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'fp8_sr', 'int8_sr'.
+        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'int8_sr'.
         # Factorized second moment only
         factored_2nd: bool = False,
         # SMMF factorization (legacy)
@@ -148,7 +148,7 @@ class Adopt_adv(torch.optim.Optimizer):
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -236,7 +236,7 @@ class Adopt_adv(torch.optim.Optimizer):
             dtype = torch.float32 if (state['factored'] or req_precision == 'factored') else p.dtype
-            vt_dtype = torch.float32 if (state['factored'] or state['factored_2nd'] or req_precision in ['factored', 'bf16_sr', 'fp8_sr', 'int8_sr']) else dtype
+            vt_dtype = torch.float32 if (state['factored'] or state['factored_2nd'] or req_precision in ['factored', 'bf16_sr', 'int8_sr']) else dtype
             vt_init = grad.pow(2).to(vt_dtype) * (1 - group['betas'][1])
             if state['factored']:
@@ -262,6 +262,7 @@ class Adopt_adv(torch.optim.Optimizer):
                     state['effective_shape'] = _get_effective_shape(p.numel())
                     d1, d2 = state['effective_shape']
                     state['mu_v_nmf'], state['mv_v_nmf'] = _nnmf(vt_init.view(d1, d2))
+                    state['shifter'] = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128], device=p.device, dtype=torch.uint8)
                 else:
                     init_state_tensor(state, 'exp_avg_sq', p.shape, actual_precision, p.device, dtype)
                     set_state(state, 'exp_avg_sq', vt_init, actual_precision, None, non_neg=True)
@@ -316,8 +317,6 @@ class Adopt_adv(torch.optim.Optimizer):
                 random_int_state_tensor = param_update._get_random_int_for_sr(p)
             elif group['actual_state_precision'] == 'int8_sr':
                 random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-            elif group['actual_state_precision'] == 'fp8_sr':
-                random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
             step_param_fn = self._compiled_step_parameter
         else:
             lr = group['lr']

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/Lion_adv.py RENAMED Viewed

@@ -33,8 +33,6 @@ class Lion_adv(torch.optim.Optimizer):
         stochastic_rounding (bool, optional): whether to use stochastic
             rounding for BF16 parameter updates (default: True).
         orthogonal_gradient (bool): whether to orthogonalize the gradient (default: False).
-        clip_threshold (float, optional): whether to clip the gradients norm
-            per-parameter (default: 0.0).
         kappa_p (float, optional): The p-value for the Lp-norm in Lion-K (domain [1.0, 2.0]).
             - 1.0: Standard Lion (sign update).
             - 2.0: Spherical Lion (normalized L2 update).

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/Muon_adv.py RENAMED Viewed

@@ -47,7 +47,7 @@ class Muon_adv(torch.optim.Optimizer):
         use_muon (bool | None): whether to use Muon or AuxAdamW. MUST be provided
             either here or via `optim_type` in parameter groups. (default: None)
         state_precision (str): Precision for Muon optimizer states. Options: 'auto' (parameter dtype), 'fp32',
-            'bf16_sr' (BF16 with stochastic rounding), 'fp8_sr', 'int8_sr'.
+            'bf16_sr' (BF16 with stochastic rounding), 'int8_sr'.
             (default: 'auto')
         low_rank_ortho (bool): If True, enables low-rank orthogonalization, which
             projects the update to a lower rank before orthogonalization.
@@ -98,7 +98,7 @@ class Muon_adv(torch.optim.Optimizer):
         adam_tiny_spike (float): Tiny spike for Kourkoutas-β. (default: 1e-9)
         adam_k_warmup_steps (int): Warmup steps for Kourkoutas-β. (default: 0)
         adam_spectral_normalization (bool): Enable explicit spectral normalization for AdamW. (default: False)
-        adam_state_precision (str): Precision for AuxAdam states. Options: 'auto', 'fp32', 'bf16_sr', 'fp16', 'fp8_sr', 'int8_sr', 'factored'. (default: 'auto')
+        adam_state_precision (str): Precision for AuxAdam states. Options: 'auto', 'fp32', 'bf16_sr', 'fp16', 'int8_sr', 'factored'. (default: 'auto')
         adam_nnmf_factor (bool): 1-bit factored for AdamW.
         adam_factored_2nd (bool): Factorize only the second moment (v_t) for AuxAdam. (default: False)
         """
@@ -130,7 +130,7 @@ class Muon_adv(torch.optim.Optimizer):
         # Boolean to spilt param
         use_muon: bool | None = None,
         # States precision (Muon path)
-        state_precision: str = "auto",  # 'fp32', 'bf16_sr', 'fp8_sr', 'int8_sr'
+        state_precision: str = "auto",  # 'fp32', 'bf16_sr', 'int8_sr'
         # Low-rank Muon
         low_rank_ortho: bool = False,
         ortho_rank: int = 128,
@@ -193,7 +193,7 @@ class Muon_adv(torch.optim.Optimizer):
             state_precision = "factored"
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -406,8 +406,6 @@ class Muon_adv(torch.optim.Optimizer):
                     random_int_state_tensor = param_update._get_random_int_for_sr(p)
                 elif actual_precision == 'int8_sr':
                     random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-                elif actual_precision == 'fp8_sr':
-                    random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
             else:
                 adam_step_param = Muon_AuxAdam._adam_step_parameter
@@ -427,8 +425,6 @@ class Muon_adv(torch.optim.Optimizer):
                     random_int_state_tensor = param_update._get_random_int_for_sr(p)
                 elif actual_precision == 'int8_sr':
                     random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-                elif actual_precision == 'fp8_sr':
-                    random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
                 if group['low_rank_ortho']:
                     random_G_sketch = param_update._get_random_noise_for_low_rank_ortho(p, group['ortho_rank'])
             else:

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/Prodigy_adv.py RENAMED Viewed

@@ -124,7 +124,7 @@ class Prodigy_adv(torch.optim.Optimizer):
         nesterov: bool = False,
         nesterov_coef: float | None = None,
         # States precision
-        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'fp8_sr', 'int8_sr'.
+        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'int8_sr'.
         # Factorized second moment only
         factored_2nd: bool = False,
         # SMMF factorization (legacy)
@@ -168,7 +168,7 @@ class Prodigy_adv(torch.optim.Optimizer):
             raise ValueError(f"For Kourkoutas-β, betas[1] (as beta2_max) must be > beta2_min. Got {betas[1]} and {beta2_min}")
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -311,6 +311,7 @@ class Prodigy_adv(torch.optim.Optimizer):
                     d1, d2 = state['effective_shape']
                     state['mu_v_nmf'] = torch.zeros(d1, device=device, dtype=torch.float32)
                     state['mv_v_nmf'] = torch.zeros(d2, device=device, dtype=torch.float32)
+                    state['shifter'] = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128], device=p.device, dtype=torch.uint8)
                 else:
                     init_state_tensor(state, 'exp_avg_sq', p.shape, actual_precision, p.device, dtype, non_neg=True)
@@ -358,8 +359,6 @@ class Prodigy_adv(torch.optim.Optimizer):
                 random_int_state_tensor = param_update._get_random_int_for_sr(p)
             elif group['actual_state_precision'] == 'int8_sr':
                 random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-            elif group['actual_state_precision'] == 'fp8_sr':
-                random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
             step_param_fn = self._compiled_step_parameter
         else:
             d = group['d']

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/SignSGD_adv.py RENAMED Viewed

@@ -44,7 +44,7 @@ class SignSGD_adv(torch.optim.Optimizer):
             'int4': Uses 4-bit block-wise quantization (block size 32).
         state_precision (str): Precision method for Adopt states. Options: 'auto'
             (parameter precision), 'fp32', 'factored' (SMMF low-rank FP32), 'bf16_sr' (with
-            stochastic rounding), 'fp16' , 'fp8_sr', 'int8_sr'. (default: 'auto')
+            stochastic rounding), 'fp16' , 'int8_sr'. (default: 'auto')
         nnmf_factor (bool): whether to use the factorization or use the
             uncompressed optimizer. (default: True)
     """
@@ -70,13 +70,13 @@ class SignSGD_adv(torch.optim.Optimizer):
         nesterov_coef: float | None = None,
         # Normalization then Momentum
         normed_momentum: bool = False,
-        # SNR Precondition
+        # SNR Precondition (requires normed_momentum)
         snr_cond: bool = False,
         # Centered WD
         centered_wd: float = 0.0,
         centered_wd_mode: str = 'float8',
         # States precision
-        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'fp8_sr', 'int8_sr'.
+        state_precision: str = "auto", # 'fp32', 'factored', 'bf16_sr', 'int8_sr'.
         # Spectral Normed Optimizer
         spectral_normalization: bool = False,
         # SMMF factorization
@@ -95,7 +95,7 @@ class SignSGD_adv(torch.optim.Optimizer):
             raise NotImplementedError(f"snr_cond is intended to be used with normed_momentum")
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -230,8 +230,6 @@ class SignSGD_adv(torch.optim.Optimizer):
                     random_int_state_tensor = param_update._get_random_int_for_sr(p)
                 elif group['actual_state_precision'] == 'int8_sr':
                     random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-                elif group['actual_state_precision'] == 'fp8_sr':
-                    random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
             if group.get('stochastic_sign', False) and not is_vector:
                 random_noise_tensor = param_update._get_random_noise_for_sso(p)
@@ -254,7 +252,8 @@ class SignSGD_adv(torch.optim.Optimizer):
         nesterov = group.get('nesterov', False)
         nesterov_coef = group.get('nesterov_coef', None)
         sso = group.get('stochastic_sign', False)
-        snr_cond = group.get('snr_cond', False) and group.get('normed_momentum', False) and momentum > 0
+        normed_mt = group.get('normed_momentum', False)
+        snr_cond = group.get('snr_cond', False) and normed_mt and momentum > 0
         denom = None
         wd_target = None
@@ -263,7 +262,7 @@ class SignSGD_adv(torch.optim.Optimizer):
         if group["orthogonal_gradient"]:
             grad = _orthogonalize_gradient(p, grad)
-        if group.get('normed_momentum', False):
+        if normed_mt:
             if sso:
                 grad = apply_stochastic_sign_(grad, noise=random_noise_tensor, is_vector=is_vector)
             else:
@@ -285,7 +284,12 @@ class SignSGD_adv(torch.optim.Optimizer):
                 if nesterov:
                     nv_coef = momentum if nesterov_coef is None else nesterov_coef
-                    raw_update = grad_reshaped.lerp(exp_avg, nv_coef)
+                    if normed_mt:
+                        # Scale the normalized gradient down to match the buffer's variance
+                        ema_std = math.sqrt((1 - momentum) / (1 + momentum))
+                        raw_update = (grad_reshaped * ema_std).lerp_(exp_avg, nv_coef)
+                    else:
+                        raw_update = grad.lerp(exp_avg, nv_coef)
                 else:
                     raw_update = exp_avg.clone()
@@ -309,7 +313,12 @@ class SignSGD_adv(torch.optim.Optimizer):
                 if nesterov:
                     nv_coef = momentum if nesterov_coef is None else nesterov_coef
-                    raw_update = grad.lerp(exp_avg, nv_coef)
+                    if normed_mt:
+                        # Scale the normalized gradient down to match the buffer's variance
+                        ema_std = math.sqrt((1 - momentum) / (1 + momentum))
+                        raw_update = (grad * ema_std).lerp_(exp_avg, nv_coef)
+                    else:
+                        raw_update = grad.lerp(exp_avg, nv_coef)
                 else:
                     raw_update = exp_avg.clone()

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/optim/SinkSGD_adv.py RENAMED Viewed

@@ -42,7 +42,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
         nnmf_factor (bool): whether to use factorization or disable it. (default: False)
         state_precision (str): Precision method for states. Options: 'auto'
             (parameter precision), 'fp32', 'factored' (SMMF low-rank FP32), 'bf16_sr',
-            'fp8_sr', 'int8_sr'. (default: 'auto')
+            'int8_sr'. (default: 'auto')
         compiled_optimizer (bool): Compiles the core step function using torch.compile
             for faster execution. (default: False)
     """
@@ -58,7 +58,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
         orthogonal_sinkhorn: bool = False,
         # Normalization then Momentum
         normed_momentum: bool = False,
-        # SNR Precondition
+        # SNR Precondition (requires normed_momentum)
         snr_cond: bool = False,
         # Nesterov Momentum
         nesterov: bool = False,
@@ -93,7 +93,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
             raise NotImplementedError(f"snr_cond is intended to be used with normed_momentum")
         state_precision = state_precision.lower()
-        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "fp8_sr", "int8_sr"}
+        valid_precisions = {"auto", "fp32", "factored", "bf16_sr", "fp16", "int8_sr"}
         if state_precision not in valid_precisions:
             raise ValueError(f"state_precision must be one of {valid_precisions}. Got {state_precision}")
@@ -209,8 +209,6 @@ class SinkSGD_adv(torch.optim.Optimizer):
                 random_int_state_tensor = param_update._get_random_int_for_sr(p)
             elif group['actual_state_precision'] == 'int8_sr':
                 random_int_state_tensor = param_update._get_random_int_for_8bit_sr(p)
-            elif group['actual_state_precision'] == 'fp8_sr':
-                random_int_state_tensor = param_update._get_random_int_for_fp8_sr(p)
             step_param_fn = self._compiled_step_parameter
         else:
             step_param_fn = self._step_parameter
@@ -226,6 +224,7 @@ class SinkSGD_adv(torch.optim.Optimizer):
         orthogonal_sinkhorn = group['orthogonal_sinkhorn']
         momentum = group['momentum']
+        normed_mt = group.get('normed_momentum', False)
         nesterov = group['nesterov']
         nesterov_coef = group.get('nesterov_coef', None)
         snr_cond = group.get('snr_cond', False)
@@ -238,7 +237,10 @@ class SinkSGD_adv(torch.optim.Optimizer):
         wd_target = None
         cwd_target = None
-        if group.get('normed_momentum', False):
+        if group["orthogonal_gradient"]:
+            grad = _orthogonalize_gradient(p, grad)
+        if normed_mt:
             if not is_vector:
                 # Sinkhorn iterative normalization
                 grad = apply_sr_sinkhorn(grad, iters=sinkhorn_iterations, p=p, ortho_project=orthogonal_sinkhorn)
@@ -246,9 +248,6 @@ class SinkSGD_adv(torch.optim.Optimizer):
                 # For vectors, apply sign operation
                 grad = grad.sign_()
-        if group["orthogonal_gradient"]:
-            grad = _orthogonalize_gradient(p, grad)
         if state['factored']:
             d1, d2 = state['effective_shape']
             grad_reshaped = grad.view(d1, d2)
@@ -272,7 +271,12 @@ class SinkSGD_adv(torch.optim.Optimizer):
                 if nesterov:
                     nv_coef = momentum if nesterov_coef is None else nesterov_coef
-                    update = grad_reshaped.lerp(buf, nv_coef)
+                    if normed_mt:
+                        # Scale the normalized gradient down to match the buffer's variance
+                        ema_std = math.sqrt((1 - momentum) / (1 + momentum))
+                        update = (grad_reshaped * ema_std).lerp_(buf, nv_coef)
+                    else:
+                        update = grad_reshaped.lerp(buf, nv_coef)
                 else:
                     update = buf.clone()
             else:
@@ -301,7 +305,12 @@ class SinkSGD_adv(torch.optim.Optimizer):
                 if nesterov:
                     nv_coef = momentum if nesterov_coef is None else nesterov_coef
-                    update = grad.lerp(buf, nv_coef)
+                    if normed_mt:
+                        # Scale the normalized gradient down to match the buffer's variance
+                        ema_std = math.sqrt((1 - momentum) / (1 + momentum))
+                        update = (grad * ema_std).lerp_(buf, nv_coef)
+                    else:
+                        update = grad.lerp(buf, nv_coef)
                 else:
                     update = buf.clone()
             else:

{adv_optm-2.4.dev22 → adv_optm-2.4.dev24}/adv_optm/util/Muon_AuxAdam.py RENAMED Viewed

@@ -56,6 +56,7 @@ def _init_auxadam_state(self, p, group):
             d1, d2 = state['effective_shape']
             state['mu_v_nmf'] = torch.zeros(d1, device=device, dtype=torch.float32)
             state['mv_v_nmf'] = torch.zeros(d2, device=device, dtype=torch.float32)
+            state['shifter'] = torch.tensor([1, 2, 4, 8, 16, 32, 64, 128], device=device, dtype=torch.uint8)
         else:
             init_state_tensor(state, 'exp_avg_sq', p.shape, actual_precision, p.device, dtype, non_neg=True)

adv-optm 2.4.dev22__tar.gz → 2.4.dev24__tar.gz

adv-optm 2.4.dev22tar.gz → 2.4.dev24tar.gz