PyPI - adv-optm - Versions diffs - 1.2.dev19__tar.gz → 2.dev2__tar.gz - Mend

adv-optm 1.2.dev19tar.gz → 2.dev2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of adv-optm might be problematic. Click here for more details.

Files changed (28) hide show

{adv_optm-1.2.dev19 → adv_optm-2.dev2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: adv_optm
-Version: 1.2.dev19
+Version: 2.dev2
 Summary: A family of highly efficient, lightweight yet powerful optimizers.
 Home-page: https://github.com/Koratahiu/Advanced_Optimizers
 Author: Koratahiu
@@ -52,7 +52,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ### **Memory-Efficient Optimization (SMMF-inspired)**
 - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
 - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
-- **Innovation**:
+- **Innovation**:
   - First moment split into **1-bit sign + absolute value**
   - Final storage: **four factored vectors + one 1-bit sign state**
   - Preserves Adam-like update quality with drastically reduced memory
@@ -110,7 +110,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ## 🛠️ Comprehensive Feature Guide
-### A. Universal Safe Features
+### A. Universal Safe Features
 *These features work with all optimizers and are generally safe to enable.*
 | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
@@ -165,7 +165,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 | `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
 | `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
-> ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
+> ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
 > For `Prodigy_Adv`, set `initial_d` to:
 > - **LoRA**: `1e-8`
 > - **Full FT**: `1e-10`
@@ -175,10 +175,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
 #### Performance Validation
-**Small Batch Training (SDXL, BS=2, 1.8K steps)**
+**Small Batch Training (SDXL, BS=2, 1.8K steps)**
 ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
-- **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
+- **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
 - **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
 **Results**:
@@ -202,8 +202,8 @@ This library integrates multiple state-of-the-art optimization techniques valida
 Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
-- **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
-- **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
+- **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
+- **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
 This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
@@ -220,17 +220,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
 #### 📊 Performance Validation
-**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
+**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
 <img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
-- 🟣 Fixed `beta2=0.999`
-- 🟠 Auto K-beta
+- 🟣 Fixed `beta2=0.999`
+- 🟠 Auto K-beta
-**Observations:**
+**Observations:**
 - K-beta is clearly better and more robust/stable for high LRs.
-> 📚 **Reference**:
-> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
+> 📚 **Reference**:
+> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
 > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
 ---
@@ -258,7 +258,7 @@ settings:
   - factored: False  # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
 ```
-> ✅ **Why it works**:
+> ✅ **Why it works**:
 > - `Kourkoutas-β` handles beta2 values
 > - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
 > - `OrthoGrad` prevents overfitting without weight decay
@@ -267,9 +267,9 @@ settings:
 ## 📚 References
-1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
-2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
-3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
-4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
-5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
+1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
+2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
+3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
+4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
+5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
 6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)

{adv_optm-1.2.dev19 → adv_optm-2.dev2}/README.md RENAMED Viewed

@@ -21,7 +21,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ### **Memory-Efficient Optimization (SMMF-inspired)**
 - **Paper**: [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
 - **Approach**: Uses rank-1 non-negative matrix factorization with reconstruction cycle (factor → reconstruct → update → factor)
-- **Innovation**:
+- **Innovation**:
   - First moment split into **1-bit sign + absolute value**
   - Final storage: **four factored vectors + one 1-bit sign state**
   - Preserves Adam-like update quality with drastically reduced memory
@@ -79,7 +79,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 ## 🛠️ Comprehensive Feature Guide
-### A. Universal Safe Features
+### A. Universal Safe Features
 *These features work with all optimizers and are generally safe to enable.*
 | Feature | Description | Recommended Usage | Performance Impact | Theoretical Basis | Compatibility |
@@ -134,7 +134,7 @@ This library integrates multiple state-of-the-art optimization techniques valida
 | `beta1` | 0.99 | Controls accumulator memory length:<br>• Small BS: **0.99–0.9999**<br>• Large BS: **0.9** |
 | `Grad α` | 100 | Most critical parameter:<br>• Inversely scales with batch size<br>• **100–10** for small BS (≤32)<br>• **1–0.1** for large BS (≥512) |
-> ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
+> ⚠️ **Critical**: Requires **~100x smaller learning rate** than AdamW (e.g., 1e-6 vs 1e-4).
 > For `Prodigy_Adv`, set `initial_d` to:
 > - **LoRA**: `1e-8`
 > - **Full FT**: `1e-10`
@@ -144,10 +144,10 @@ This library integrates multiple state-of-the-art optimization techniques valida
 #### Performance Validation
-**Small Batch Training (SDXL, BS=2, 1.8K steps)**
+**Small Batch Training (SDXL, BS=2, 1.8K steps)**
 ![Training Comparison](https://github.com/user-attachments/assets/7eff0671-cc59-47fc-8b63-d5205456d649)
-- **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
+- **🟢 Prodigy_Adv** (beta1=0.9, d0=1e-5): Final LR = 2.9e-4
 - **🔵 Prodigy_Adv + Simplified_AdEMAMix** (beta1=0.99, α=100, d0=1e-7): Final LR = 5.8e-6
 **Results**:
@@ -171,8 +171,8 @@ This library integrates multiple state-of-the-art optimization techniques valida
 Instead of using a fixed β₂ (e.g., 0.999 or 0.95), it **dynamically modulates β₂ per layer** based on a bounded *sunspike ratio*:
-- **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
-- **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
+- **During gradient bursts** → β₂ ↓ toward `Lower β₂` → faster reaction
+- **During calm phases** → β₂ ↑ toward `The Selected β₂` → stronger smoothing
 This is especially effective for **noisy training, small batch sizes, and high learning rates**, where gradient norms shift abruptly due to noise or aggressive LR schedules.
@@ -189,17 +189,17 @@ This is especially effective for **noisy training, small batch sizes, and high l
 #### 📊 Performance Validation
-**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
+**ADAMW_ADV - full SDXL finetuning (aggressive LR: 3e-5) (BS=4, 2.5K steps)**
 <img width="1460" height="382" alt="image" src="https://github.com/user-attachments/assets/007f278a-fbac-4f3d-9cc7-274c3b959cdd" />
-- 🟣 Fixed `beta2=0.999`
-- 🟠 Auto K-beta
+- 🟣 Fixed `beta2=0.999`
+- 🟠 Auto K-beta
-**Observations:**
+**Observations:**
 - K-beta is clearly better and more robust/stable for high LRs.
-> 📚 **Reference**:
-> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
+> 📚 **Reference**:
+> - Paper: [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)
 > - Code: [kbeta](https://github.com/sck-at-ucy/kbeta)
 ---
@@ -227,7 +227,7 @@ settings:
   - factored: False  # Can be true or false, quality should not degrade due to Simplified_AdEMAMix’s high tolerance to 1-bit factorization.
 ```
-> ✅ **Why it works**:
+> ✅ **Why it works**:
 > - `Kourkoutas-β` handles beta2 values
 > - `Simplified_AdEMAMix` ensures responsiveness in small-batch noise
 > - `OrthoGrad` prevents overfitting without weight decay
@@ -236,9 +236,9 @@ settings:
 ## 📚 References
-1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
-2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
-3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
-4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
-5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
+1. [Revisiting BFloat16 Training](https://arxiv.org/abs/2010.06192)
+2. [SMMF: Square-Matricized Momentum Factorization](https://arxiv.org/abs/2412.08894)
+3. [The AdEMAMix Optimizer](https://arxiv.org/abs/2409.03137)
+4. [Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD](https://arxiv.org/abs/2502.02431)
+5. [AdaMeM: Memory Efficient Momentum for Adafactor](https://openreview.net/forum?id=fZqMVTz7K5)
 6. [Kourkoutas-β: A Sunspike-Driven Adam Optimizer with Desert Flair](https://arxiv.org/abs/2508.12996)

{adv_optm-1.2.dev19 → adv_optm-2.dev2}/adv_optm/__init__.py RENAMED Viewed

@@ -20,4 +20,4 @@ __all__ = [
     "AdaMuon_adv",
 ]
-__version__ = "1.2.dev19"
+__version__ = "2.dev2"

{adv_optm-1.2.dev19 → adv_optm-2.dev2}/adv_optm/optim/AdaMuon_adv.py RENAMED Viewed

@@ -46,7 +46,6 @@ class AdaMuon_adv(torch.optim.Optimizer):
             (default: (3.4445, -4.7750, 2.0315)).
         stochastic_rounding (bool): whether to use stochastic rounding for
             BF16 parameter updates (default: True).
-        orthogonal_gradient (bool): whether to use OrthoGrad.  (default: False)
         nesterov (bool): enables Nesterov momentum (default: False).
         use_atan2 (bool): whether to use the atan2 update rule. (default: False)
         Simplified_AdEMAMix (bool): whether to use the Simplified AdEMAMix update rule.
@@ -96,7 +95,6 @@ class AdaMuon_adv(torch.optim.Optimizer):
         ns_eps: float = 1e-7,
         ns_coeffs: tuple[float, float, float] = (3.4445, -4.7750, 2.0315),
         stochastic_rounding: bool = False,
-        orthogonal_gradient: bool = False,
         use_atan2: bool = False,
         nesterov: bool = False,
         Simplified_AdEMAMix: bool = False,
@@ -149,7 +147,7 @@ class AdaMuon_adv(torch.optim.Optimizer):
             "vector_reshape": vector_reshape,
             "nesterov":nesterov, "use_atan2":use_atan2,
             "Simplified_AdEMAMix": Simplified_AdEMAMix, "alpha_grad": alpha_grad,
-            "normuon_variant": normuon_variant, "orthogonal_gradient": orthogonal_gradient,
+            "normuon_variant": normuon_variant,
             # Low-rank Ortho
             "low_rank_ortho": low_rank_ortho, "ortho_rank": ortho_rank,
             "compiled_optimizer":compiled_optimizer,
@@ -284,10 +282,6 @@ class AdaMuon_adv(torch.optim.Optimizer):
         nesterov = group['nesterov']
         Simplified_AdEMAMix = group['Simplified_AdEMAMix']
         alpha_grad = group['alpha_grad']
-        if grad.dtype != torch.float32 and state.get('factored', False):
-            grad = grad.float()
-        if group.get("orthogonal_gradient"):
-            grad = _orthogonalize_gradient(p, grad)
         if state['factored']: # Factored AdaMuon
@@ -351,7 +345,11 @@ class AdaMuon_adv(torch.optim.Optimizer):
                 mean_squared_update = torch.mean(update.square(), dim=1)
                 v_t.mul_(beta2).add_(mean_squared_update, alpha=1 - beta2)
                 # Normalize update
-                update.div_(v_t.sqrt().unsqueeze(1).add_(group['eps']))
+                if group['use_atan2']:
+                    a = 1.2732395
+                    update.atan2_(v_t.sqrt().unsqueeze(1)).mul_(a)
+                else:
+                    update.div_(v_t.sqrt().unsqueeze(1).add_(group['eps']))
                 # Scale learning rate
                 update_norm = torch.linalg.vector_norm(update)
                 scaled_lr = group['rms_target'] * lr * (p.numel()**0.5) / update_norm.add_(group['eps'])
@@ -456,7 +454,11 @@ class AdaMuon_adv(torch.optim.Optimizer):
                     mean_squared_update = torch.mean(update.square(), dim=1)
                     v_t.mul_(beta2).add_(mean_squared_update, alpha=1 - beta2)
                     # Normalize update
-                    update.div_(v_t.sqrt().unsqueeze(1).add_(group['eps']))
+                    if group['use_atan2']:
+                        a = 1.2732395
+                        update.atan2_(v_t.sqrt().unsqueeze(1)).mul_(a)
+                    else:
+                        update.div_(v_t.sqrt().unsqueeze(1).add_(group['eps']))
                     # Scale learning rate
                     update_norm = torch.linalg.vector_norm(update)
                     scaled_lr = group['rms_target'] * lr * (p.numel()**0.5) / update_norm.add_(group['eps'])

{adv_optm-1.2.dev19 → adv_optm-2.dev2}/adv_optm/optim/AdamW_adv.py RENAMED Viewed

@@ -49,14 +49,12 @@ class AdamW_adv(torch.optim.Optimizer):
             before it is added to the fast momentum term (`update = mt + alpha * mt_slow`).
             A higher value increases the stabilizing influence of the slow
             momentum. (default: 5.0)
-        t_alpha (Optional[int]): The number of steps for a linear warmup of the
-            `alpha` parameter (only used when `use_AdEMAMix` is `True`). This is
-            highly recommended to prevent instability at the beginning of training,
-            as it gradually introduces the stabilizing slow momentum term. During
-            the warmup, `alpha` ramps from 0 to its target value. If `None`,
-            the scheduler is disabled. (default: None)
         kourkoutas_beta (bool): whether to enable the layer-wise dynamic β₂ logic.
             If `False`, the optimizer behaves as standard AdamW. (default: False)
+        layer_key_fn (Optional[Callable]): A function that takes a parameter `p`
+            and returns a unique, hashable key representing its "layer" or "bucket".
+            If `None`, parameters are bucketed by their shape.
+            (default: None)
         beta2_min (float): The minimum value for dynamic β₂, used during periods of
             high gradient variance ("sunspikes"). Must be less than `betas[1]`.
             (default: 0.88)
@@ -72,11 +70,7 @@ class AdamW_adv(torch.optim.Optimizer):
         k_logging (int): if > 0 and kourkoutas_beta=True, enables periodic console
             logging of Kourkoutas-β statistics (min, max, mean of `β₂` across layers)
             every logging steps. Useful for debugging and tuning. Set to 0 to disable
-            logging (default: 0).
-        layer_key_fn (Optional[Callable]): A function that takes a parameter `p`
-            and returns a unique, hashable key representing its "layer" or "bucket".
-            If `None`, parameters are bucketed by their memory ID (tensor-wise).
-            (default: None)
+            logging (default: 0).
         nnmf_factor (bool): whether to use the factorization or disable it to use
             the uncompressed optimizer. (default: False)
     """
@@ -89,7 +83,7 @@ class AdamW_adv(torch.optim.Optimizer):
         eps: float = 1e-8,
         weight_decay: float = 0.0,
         use_bias_correction: bool = True,
-        vector_reshape: bool = True,
+        vector_reshape: bool = False,
         stochastic_rounding: bool = True,
         use_atan2: bool = False,
         cautious_mask: bool = False,
@@ -98,15 +92,16 @@ class AdamW_adv(torch.optim.Optimizer):
         use_AdEMAMix: bool = False,
         beta3_ema: float = 0.9999,
         alpha: float = 5.0,
-        t_alpha: int | None = None,
         kourkoutas_beta: bool = False,
+        layer_key_fn: Optional[Callable] = None,
         beta2_min: float = 0.9,
         ema_alpha: float = 0.95,
         tiny_spike: float = 1e-9,
         k_warmup_steps: int = 0,
         k_logging: int = 0,
-        layer_key_fn: Optional[Callable] = None,
         nnmf_factor: bool = False,
+        # Compiled
+        compiled_optimizer: bool = False,
     ):
         if not (lr >= 0.0):
             raise ValueError(f"Learning-rate should be >= 0.0. Got {lr}")
@@ -116,7 +111,8 @@ class AdamW_adv(torch.optim.Optimizer):
             raise ValueError(f"Epsilon should be >= 0.0. Got {eps}")
         if not (weight_decay >= 0.0):
             raise ValueError(f"Weight-decay should be >= 0.0. Got {weight_decay}")
-        if kourkoutas_beta and not (betas[1] > beta2_min): raise ValueError(f"For Kourkoutas-β, betas[1] (as beta2_max) must be > beta2_min. Got {betas[1]} and {beta2_min}")
+        if kourkoutas_beta and not (betas[1] > beta2_min):
+            raise ValueError(f"For Kourkoutas-β, betas[1] (as beta2_max) must be > beta2_min. Got {betas[1]} and {beta2_min}")
         if cautious_mask and grams_moment:
             print("Warning: cautious is incompatible with grams, Disabling cautious.")
@@ -126,22 +122,30 @@ class AdamW_adv(torch.optim.Optimizer):
             "lr": lr, "betas": betas, "eps": eps, "weight_decay": weight_decay,
             "vector_reshape": vector_reshape, "use_atan2": use_atan2,
             "orthogonal_gradient": orthogonal_gradient, "use_bias_correction": use_bias_correction,
-            "beta3_ema": beta3_ema, "alpha": alpha, "t_alpha": t_alpha,
+            "beta3_ema": beta3_ema, "alpha": alpha,
             "kourkoutas_beta": kourkoutas_beta, "beta2_min": beta2_min, "ema_alpha": ema_alpha,
             "tiny_spike": tiny_spike, "k_warmup_steps": k_warmup_steps, "k_logging": k_logging,
+            "compiled_optimizer": compiled_optimizer,
         }
         self.stochastic_rounding = stochastic_rounding
         self.cautious_mask = cautious_mask
         self.grams_moment = grams_moment
         self.use_AdEMAMix = use_AdEMAMix
         self.factored = nnmf_factor
-        self.kourkoutas_beta = kourkoutas_beta
         self.layer_key_fn = layer_key_fn
+        self.kourkoutas_beta = kourkoutas_beta
         super().__init__(params, defaults)
+        self.init_step()
         if self.kourkoutas_beta:
             self.kourkoutas_helper = KourkoutasHelper(self)
+        if compiled_optimizer:
+            torch._dynamo.config.cache_size_limit = 8192
+            self.compile(fullgraph=True)
     @property
     def supports_fused_back_pass(self):
         return True
@@ -154,29 +158,24 @@ class AdamW_adv(torch.optim.Optimizer):
     def supports_flat_params(self):
         return False
-    @torch.no_grad()
-    def step_parameter(self, p: torch.Tensor, group: dict, i: int | None = None):
-        if p.grad is None:
-            return
+    def init_step(self):
+        for group in self.param_groups:
+            for p in group['params']:
+                self.__init_state(p, group)
-        grad = p.grad
-        if grad.dtype != torch.float32 and self.factored:
-            grad = grad.float()
-        if group["orthogonal_gradient"]:
-            grad = _orthogonalize_gradient(p, grad)
+    @torch.no_grad()
+    def __init_state(self, p, group):
         state = self.state[p]
-        # State Initialization
-        if 'step' not in state:
+        if len(state) == 0:
             state['step'] = 0
-            should_factor = (
+            state['factored'] = (
                 self.factored and
                 not (len(p.shape) == 1 and not group['vector_reshape'])
             )
-            state['factored'] = should_factor
             dtype = torch.float32 if self.factored else p.dtype
             device = p.device
@@ -186,18 +185,18 @@ class AdamW_adv(torch.optim.Optimizer):
                 # First moment (m)
                 if group['betas'][0] > 0:
-                    state['mu_m_nmf'] = torch.zeros(d1, device=device, dtype=dtype)
+                    state['mu_m_nmf'] = torch.zeros(d1, device=device, dtype=dtype)
                     state['mv_m_nmf'] = torch.zeros(d2, device=device, dtype=dtype)
                     if not self.grams_moment:
                         packed_d2 = (d2 + 7) // 8
                         state['sign'] = torch.zeros((d1, packed_d2), dtype=torch.uint8, device=device)
                 if self.use_AdEMAMix:
-                    state['mu_m_slow_nmf'] = torch.zeros(d1, device=p.device, dtype=dtype)
+                    state['mu_m_slow_nmf'] = torch.zeros(d1, device=p.device, dtype=dtype)
                     state['mv_m_slow_nmf'] = torch.zeros(d2, device=p.device, dtype=dtype)
                     packed_d2 = (d2 + 7) // 8
                     state['sign_slow'] = torch.zeros((d1, packed_d2), dtype=torch.uint8, device=p.device)
                 # Second moment (v)
-                state['mu_v_nmf'] = torch.zeros(d1, device=device, dtype=dtype)
+                state['mu_v_nmf'] = torch.zeros(d1, device=device, dtype=dtype)
                 state['mv_v_nmf'] = torch.zeros(d2, device=device, dtype=dtype)
             else:  # Fallback to standard AdamW for non-factored tensors
                 if group['betas'][0] > 0:
@@ -206,37 +205,32 @@ class AdamW_adv(torch.optim.Optimizer):
                     state['exp_avg_slow'] = torch.zeros_like(p, device=device, dtype=dtype)
                 state['exp_avg_sq'] = torch.zeros_like(p, device=device, dtype=dtype)
+    @torch.no_grad()
+    def __step_parameter(self, p: torch.Tensor, group: dict, lr: torch.Tensor | float, bias_correction1: torch.Tensor | float, bias_correction2: torch.Tensor | float):
+        if p.grad is None:
+            return
+        grad = p.grad
+        if grad.dtype != torch.float32 and self.factored:
+            grad = grad.float()
+        if group["orthogonal_gradient"]:
+            grad = _orthogonalize_gradient(p, grad)
+        state = self.state[p]
         beta1, beta2 = group['betas']
-        current_step = state['step']
         if group.get('kourkoutas_beta', False):
-            # Call prepare_step() once at the beginning of the step for all params
-            self.kourkoutas_helper.maybe_prepare_step(current_step)
             # Accumulate current grad's norm for the *next* step
             self.kourkoutas_helper.accumulate_gradient_sq_norm(p, grad)
             # Get the dynamic beta2 calculated in prepare_step()
-            beta2 = self.kourkoutas_helper.get_beta2(p, group, current_step)
+            beta2 = self.kourkoutas_helper.get_beta2(p, group)
-        step = state['step'] + 1
-        if group['use_bias_correction']:
-            bias_correction1 = 1.0 - beta1 ** step
-            if group.get('kourkoutas_beta', False):
-                bias_correction2 = 1.0 - group['betas'][1] ** step
-                # Use beta2_max for bias correction
-            else:
-                bias_correction2 = 1.0 - beta2 ** step
-        else:
-            bias_correction1 = 1
-            bias_correction2 = 1
-        step_size = group['lr'] / bias_correction1
+        step_size = lr / bias_correction1
         if self.use_AdEMAMix:
             beta3_ema = group['beta3_ema']
             alpha = group['alpha']
-            t_alpha = group['t_alpha']
-            alpha_t = alpha
-            if t_alpha is not None and t_alpha > 0 and step < t_alpha:
-                alpha_t = min(step * alpha / t_alpha, alpha)
         if state['factored']:
             d1, d2 = state['effective_shape']
@@ -272,9 +266,9 @@ class AdamW_adv(torch.optim.Optimizer):
                 mt_slow.mul_(beta3_ema).add_(grad_reshaped, alpha=1.0 - beta3_ema)
                 if beta1 > 0:
-                    update = torch.add(mt, mt_slow, alpha=alpha_t)
+                    update = torch.add(mt, mt_slow, alpha=alpha)
                 else:
-                    update = torch.add(grad_reshaped, mt_slow, alpha=alpha_t)
+                    update = torch.add(grad_reshaped, mt_slow, alpha=alpha)
             else:
                 update = mt.clone() if beta1 > 0 else grad_reshaped.clone()
             del grad_reshaped
@@ -321,9 +315,9 @@ class AdamW_adv(torch.optim.Optimizer):
                 exp_avg_slow = state['exp_avg_slow']
                 exp_avg_slow.mul_(beta3_ema).add_(grad, alpha=1 - beta3_ema)
                 if beta1 > 0:
-                    update = torch.add(exp_avg, exp_avg_slow, alpha=alpha_t)
+                    update = torch.add(exp_avg, exp_avg_slow, alpha=alpha)
                 else:
-                    update = torch.add(grad, exp_avg_slow, alpha=alpha_t)
+                    update = torch.add(grad, exp_avg_slow, alpha=alpha)
             else:
                 update = exp_avg.clone() if beta1 > 0 else grad.clone()
@@ -343,9 +337,9 @@ class AdamW_adv(torch.optim.Optimizer):
         # Decoupled weight decay
         if group["weight_decay"] != 0:
             if p.dtype == torch.bfloat16 and self.stochastic_rounding:
-                add_stochastic_(p.data, p.data, alpha=-group["weight_decay"] * group["lr"])
+                add_stochastic_(p.data, p.data, alpha=-group["weight_decay"] * lr)
             else:
-                p.data.add_(p.data, alpha=-group["weight_decay"] * group["lr"])
+                p.data.add_(p.data, alpha=-group["weight_decay"] * lr)
         if p.dtype == torch.bfloat16 and self.stochastic_rounding:
             add_stochastic_(p.data, -update)
@@ -353,7 +347,40 @@ class AdamW_adv(torch.optim.Optimizer):
             p.data.add_(-update)
         del update
-        state['step'] += 1
+    @torch.no_grad()
+    def step_parameter(self, p: torch.Tensor, group: dict, i: int | None = None):
+        state = self.state[p]
+        step = state['step']
+        if group['use_bias_correction']:
+            current_step = step + 1
+            beta1, beta2 = group['betas']
+            bias_correction1 = 1.0 - beta1 ** current_step
+            bias_correction2 = 1.0 - beta2 ** current_step
+        else:
+            bias_correction1 = 1.0
+            bias_correction2 = 1.0
+        if group.get('kourkoutas_beta', False):
+            # Prepare Kourkoutas-β once per step using the global step counter.
+            self.kourkoutas_helper.maybe_prepare_step(step)
+        self.state[p]['step'] += 1
+        if not group.get('compiled_optimizer', False):
+            self.__step_parameter(p, group, group['lr'], bias_correction1, bias_correction2)
+        else:
+            if not hasattr(self, 'lr_tensor') or self.lr_tensor is None:
+                # convert to tensors for compiled path once a step
+                self.lr_tensor = torch.tensor(group['lr'], device=p.device)
+                self.bc1_tensor = torch.tensor(bias_correction1, device=p.device)
+                self.bc2_tensor = torch.tensor(bias_correction2, device=p.device)
+            self._compiled_step_parameter(p, group, self.lr_tensor, self.bc1_tensor, self.bc2_tensor)
+    def compile(self, *args, **kwargs):
+        self._compiled_step_parameter = torch.compile(self.__step_parameter, *args, **kwargs)
     @torch.no_grad()
     def step(self, closure=None):
@@ -367,4 +394,7 @@ class AdamW_adv(torch.optim.Optimizer):
             for i, p in enumerate(group['params']):
                 self.step_parameter(p, group, i)
+        if self.param_groups[0].get('compiled_optimizer', False):
+            # Reset compile tensors once a step
+            self.lr_tensor = None
         return loss

adv-optm 1.2.dev19__tar.gz → 2.dev2__tar.gz

Potentially problematic release.

adv-optm 1.2.dev19tar.gz → 2.dev2tar.gz