PyPI - heavyball - Versions diffs - 0.18.7__tar.gz → 0.19.0__tar.gz - Mend

heavyball 0.18.7tar.gz → 0.19.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (45) hide show

{heavyball-0.18.7 → heavyball-0.19.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: heavyball
-Version: 0.18.7
+Version: 0.19.0
 Summary: Efficient optimizers
 Home-page: https://github.com/clashluke/heavyball
 Author: Lucas Nestler
@@ -32,7 +32,7 @@ A simple package of efficient optimizers
 The goal is not to thrive for completeness, full maintenance or abstraction, but instead to provide a simple
 largely static alternative to `torch.optim` with more and better optimizers.
-Currently (2024-11-21, 0.18.6), the recommended stable optimizer is `PrecondSchedulePaLMSOAP` (see below). The
+Currently (2024-11-22, 0.19), the recommended stable optimizer is `PrecondSchedulePaLMSOAP` (see below). The
 recommended experimental optimizer is `DelayedPSGDKron` ([tuning guide](docs/psgd_efficiency.md)).
 ## Features
@@ -45,8 +45,10 @@ recommended experimental optimizer is `DelayedPSGDKron` ([tuning guide](docs/psg
 * **ScheduleFree**: No learning rate schedule, but better convergence
 * [**Preconditioner Schedule**](https://github.com/lixilinx/psgd_torch/): Improved loss-per-step in early convergence,
   better step-per-second in late convergence (explained below)
-* **Memory-efficient storage** PSGD supports `store_triu_as_line` (default: `True`) to trade off memory usage for memory
-  bandwidth; turn it off for lower overheads (for more, see [PSGD Efficiency](docs/psgd_efficiency.md))
+* **Memory-efficient storage** PSGD supports `store_triu_as_line` (default: `True`) and `q_dtype` to trade off memory
+  usage for memory
+  bandwidth; Other optimizers have `storage_dtype`, supporting lower-precision EMAs at no(?) performance drop via
+  stochastic rounding
 ## Getting started
@@ -76,19 +78,19 @@ for _ in range(1000):
 ## Optimizers
-| Name                    | Description                                                                                                                                                       | Advantages / Disadvantages                                                                                                                                                                                                                                                                                                            |
-|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **AdamW**               | More efficient (speed, memory) [AdamW](https://arxiv.org/abs/1711.05101)                                                                                          | + Faster than AdamW<br>+ Possibly more (numerically) stable
-| **LaProp**              | More efficient (speed, memory) [LaProp](https://arxiv.org/abs/2002.04839)                                                                                         | + Same cost as AdamW<br>+ Marginally better converence (better proofs)<br>+ Higher hyperparameter stability<br>- Not a guaranteed win (can be neutral)<br>- No "Slingshot"                                                                                                                                                            |
-| **ADOPT**               | More efficient (speed, memory) [ADOPT](https://arxiv.org/abs/2411.02853)                                                                                          | + Same cost as AdamW<br>+ Rigorous mathematical convergence proofs, even for challenging models (GANs)<br>- Empirically underperforms LaProp<br>- no bf16                                                                                                                                                                             |
-| **SFAdamW**             | More efficient (speed, memory) [ScheduleFree AdamW](https://arxiv.org/abs/2405.15682)                                                                             | + Same cost as AdamW, but better eval perf<br>+ Full control over hyperparameters                                                                                                                                                                                                                                                     |
-| **PaLMSFAdamW**         | ForeachSFAdamW with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                     | + Same cost as AdamW, but better eval perf<br>+ Less control, but faster early and more stable late convergence<br>+ ScheduleFree<br>- slow early convergence                                                                                                                                                                         |
-| **SOAP**                | More efficient (speed, memory) [SOAP](https://arxiv.org/abs/2409.11321)                                                                                           | + Faster convergence (loss-at-step)<br>+ Full control over hyperparameters<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                                                |
-| **PaLMSOAP**            | ForeachSOAP with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                        | + Faster convergence (loss-at-step)<br>+ Less control, but faster early and more stable late convergence<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                  |
-| **SFPaLMSOAP**          | ScheduleFree PaLMForeachSOAP                                                                                                                                      | + Fast convergence (loss-at-step)<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized)                                                                                                      |
+| Name                          | Description                                                                                                                                                       | Advantages / Disadvantages                                                                                                                                                                                                                                                                                                            |
+|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **AdamW**                     | More efficient (speed, memory) [AdamW](https://arxiv.org/abs/1711.05101)                                                                                          | + Faster than AdamW<br>+ Possibly more (numerically) stable
+| **LaProp**                    | More efficient (speed, memory) [LaProp](https://arxiv.org/abs/2002.04839)                                                                                         | + Same cost as AdamW<br>+ Marginally better converence (better proofs)<br>+ Higher hyperparameter stability<br>- Not a guaranteed win (can be neutral)<br>- No "Slingshot"                                                                                                                                                            |
+| **ADOPT**                     | More efficient (speed, memory) [ADOPT](https://arxiv.org/abs/2411.02853)                                                                                          | + Same cost as AdamW<br>+ Rigorous mathematical convergence proofs, even for challenging models (GANs)<br>- Empirically underperforms LaProp<br>- no bf16                                                                                                                                                                             |
+| **SFAdamW**                   | More efficient (speed, memory) [ScheduleFree AdamW](https://arxiv.org/abs/2405.15682)                                                                             | + Same cost as AdamW, but better eval perf<br>+ Full control over hyperparameters                                                                                                                                                                                                                                                     |
+| **PaLMSFAdamW**               | ForeachSFAdamW with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                     | + Same cost as AdamW, but better eval perf<br>+ Less control, but faster early and more stable late convergence<br>+ ScheduleFree<br>- slow early convergence                                                                                                                                                                         |
+| **SOAP**                      | More efficient (speed, memory) [SOAP](https://arxiv.org/abs/2409.11321)                                                                                           | + Faster convergence (loss-at-step)<br>+ Full control over hyperparameters<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                                                |
+| **PaLMSOAP**                  | ForeachSOAP with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                        | + Faster convergence (loss-at-step)<br>+ Less control, but faster early and more stable late convergence<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                  |
+| **SFPaLMSOAP**                | ScheduleFree PaLMForeachSOAP                                                                                                                                      | + Fast convergence (loss-at-step)<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized)                                                                                                      |
 | **PrecondScheduleSFPaLMSOAP** | SFPaLMForeachSOAP with [preconditioner schedule](https://github.com/lixilinx/psgd_torch/), matching the error of PrecondEvery=2 with the cost of PrecondEvery=512 | + Better initial convergence than SFPaLMForeachSOAP<br>+ Significantly faster (sec/it) later<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of step |
-| **PrecondSchedulePaLMSOAP** | PrecondScheduleSFPaLMForeachSOAP without schedule-free                                                                                                            | + Best initial convergence<br>+ Significantly faster (sec/it) later<br>+ high stability<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                   |
-| **PrecondScheduleSOAP** | PrecondScheduleSFPaLMForeachSOAP without PaLM's beta2 schedule                                                                                                    | + Better initial convergence<br>+ Significantly faster (sec/it) later<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                                     |
+| **PrecondSchedulePaLMSOAP**   | PrecondScheduleSFPaLMForeachSOAP without schedule-free                                                                                                            | + Best initial convergence<br>+ Significantly faster (sec/it) later<br>+ high stability<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                   |
+| **PrecondScheduleSOAP**       | PrecondScheduleSFPaLMForeachSOAP without PaLM's beta2 schedule                                                                                                    | + Better initial convergence<br>+ Significantly faster (sec/it) later<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                                     |
 ## Precond Schedule

{heavyball-0.18.7 → heavyball-0.19.0}/README.md RENAMED Viewed

@@ -8,7 +8,7 @@ A simple package of efficient optimizers
 The goal is not to thrive for completeness, full maintenance or abstraction, but instead to provide a simple
 largely static alternative to `torch.optim` with more and better optimizers.
-Currently (2024-11-21, 0.18.6), the recommended stable optimizer is `PrecondSchedulePaLMSOAP` (see below). The
+Currently (2024-11-22, 0.19), the recommended stable optimizer is `PrecondSchedulePaLMSOAP` (see below). The
 recommended experimental optimizer is `DelayedPSGDKron` ([tuning guide](docs/psgd_efficiency.md)).
 ## Features
@@ -21,8 +21,10 @@ recommended experimental optimizer is `DelayedPSGDKron` ([tuning guide](docs/psg
 * **ScheduleFree**: No learning rate schedule, but better convergence
 * [**Preconditioner Schedule**](https://github.com/lixilinx/psgd_torch/): Improved loss-per-step in early convergence,
   better step-per-second in late convergence (explained below)
-* **Memory-efficient storage** PSGD supports `store_triu_as_line` (default: `True`) to trade off memory usage for memory
-  bandwidth; turn it off for lower overheads (for more, see [PSGD Efficiency](docs/psgd_efficiency.md))
+* **Memory-efficient storage** PSGD supports `store_triu_as_line` (default: `True`) and `q_dtype` to trade off memory
+  usage for memory
+  bandwidth; Other optimizers have `storage_dtype`, supporting lower-precision EMAs at no(?) performance drop via
+  stochastic rounding
 ## Getting started
@@ -52,19 +54,19 @@ for _ in range(1000):
 ## Optimizers
-| Name                    | Description                                                                                                                                                       | Advantages / Disadvantages                                                                                                                                                                                                                                                                                                            |
-|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **AdamW**               | More efficient (speed, memory) [AdamW](https://arxiv.org/abs/1711.05101)                                                                                          | + Faster than AdamW<br>+ Possibly more (numerically) stable
-| **LaProp**              | More efficient (speed, memory) [LaProp](https://arxiv.org/abs/2002.04839)                                                                                         | + Same cost as AdamW<br>+ Marginally better converence (better proofs)<br>+ Higher hyperparameter stability<br>- Not a guaranteed win (can be neutral)<br>- No "Slingshot"                                                                                                                                                            |
-| **ADOPT**               | More efficient (speed, memory) [ADOPT](https://arxiv.org/abs/2411.02853)                                                                                          | + Same cost as AdamW<br>+ Rigorous mathematical convergence proofs, even for challenging models (GANs)<br>- Empirically underperforms LaProp<br>- no bf16                                                                                                                                                                             |
-| **SFAdamW**             | More efficient (speed, memory) [ScheduleFree AdamW](https://arxiv.org/abs/2405.15682)                                                                             | + Same cost as AdamW, but better eval perf<br>+ Full control over hyperparameters                                                                                                                                                                                                                                                     |
-| **PaLMSFAdamW**         | ForeachSFAdamW with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                     | + Same cost as AdamW, but better eval perf<br>+ Less control, but faster early and more stable late convergence<br>+ ScheduleFree<br>- slow early convergence                                                                                                                                                                         |
-| **SOAP**                | More efficient (speed, memory) [SOAP](https://arxiv.org/abs/2409.11321)                                                                                           | + Faster convergence (loss-at-step)<br>+ Full control over hyperparameters<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                                                |
-| **PaLMSOAP**            | ForeachSOAP with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                        | + Faster convergence (loss-at-step)<br>+ Less control, but faster early and more stable late convergence<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                  |
-| **SFPaLMSOAP**          | ScheduleFree PaLMForeachSOAP                                                                                                                                      | + Fast convergence (loss-at-step)<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized)                                                                                                      |
+| Name                          | Description                                                                                                                                                       | Advantages / Disadvantages                                                                                                                                                                                                                                                                                                            |
+|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **AdamW**                     | More efficient (speed, memory) [AdamW](https://arxiv.org/abs/1711.05101)                                                                                          | + Faster than AdamW<br>+ Possibly more (numerically) stable
+| **LaProp**                    | More efficient (speed, memory) [LaProp](https://arxiv.org/abs/2002.04839)                                                                                         | + Same cost as AdamW<br>+ Marginally better converence (better proofs)<br>+ Higher hyperparameter stability<br>- Not a guaranteed win (can be neutral)<br>- No "Slingshot"                                                                                                                                                            |
+| **ADOPT**                     | More efficient (speed, memory) [ADOPT](https://arxiv.org/abs/2411.02853)                                                                                          | + Same cost as AdamW<br>+ Rigorous mathematical convergence proofs, even for challenging models (GANs)<br>- Empirically underperforms LaProp<br>- no bf16                                                                                                                                                                             |
+| **SFAdamW**                   | More efficient (speed, memory) [ScheduleFree AdamW](https://arxiv.org/abs/2405.15682)                                                                             | + Same cost as AdamW, but better eval perf<br>+ Full control over hyperparameters                                                                                                                                                                                                                                                     |
+| **PaLMSFAdamW**               | ForeachSFAdamW with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                     | + Same cost as AdamW, but better eval perf<br>+ Less control, but faster early and more stable late convergence<br>+ ScheduleFree<br>- slow early convergence                                                                                                                                                                         |
+| **SOAP**                      | More efficient (speed, memory) [SOAP](https://arxiv.org/abs/2409.11321)                                                                                           | + Faster convergence (loss-at-step)<br>+ Full control over hyperparameters<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                                                |
+| **PaLMSOAP**                  | ForeachSOAP with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                        | + Faster convergence (loss-at-step)<br>+ Less control, but faster early and more stable late convergence<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                  |
+| **SFPaLMSOAP**                | ScheduleFree PaLMForeachSOAP                                                                                                                                      | + Fast convergence (loss-at-step)<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized)                                                                                                      |
 | **PrecondScheduleSFPaLMSOAP** | SFPaLMForeachSOAP with [preconditioner schedule](https://github.com/lixilinx/psgd_torch/), matching the error of PrecondEvery=2 with the cost of PrecondEvery=512 | + Better initial convergence than SFPaLMForeachSOAP<br>+ Significantly faster (sec/it) later<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of step |
-| **PrecondSchedulePaLMSOAP** | PrecondScheduleSFPaLMForeachSOAP without schedule-free                                                                                                            | + Best initial convergence<br>+ Significantly faster (sec/it) later<br>+ high stability<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                   |
-| **PrecondScheduleSOAP** | PrecondScheduleSFPaLMForeachSOAP without PaLM's beta2 schedule                                                                                                    | + Better initial convergence<br>+ Significantly faster (sec/it) later<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                                     |
+| **PrecondSchedulePaLMSOAP**   | PrecondScheduleSFPaLMForeachSOAP without schedule-free                                                                                                            | + Best initial convergence<br>+ Significantly faster (sec/it) later<br>+ high stability<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                   |
+| **PrecondScheduleSOAP**       | PrecondScheduleSFPaLMForeachSOAP without PaLM's beta2 schedule                                                                                                    | + Better initial convergence<br>+ Significantly faster (sec/it) later<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                                     |
 ## Precond Schedule

heavyball-0.19.0/heavyball/foreach_adamw.py ADDED Viewed

@@ -0,0 +1,56 @@
+import torch
+import torch.optim
+from heavyball.utils import copy_stochastic_list_
+from .utils import warmup, exp_avg_sq_, beta_debias, update_param_, StatefulOptimizer, promote
+@torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
+def _compilable_step_(y, grad, exp_avg_sq, exp_avg, beta1, beta2, step, lr, eps, decay):
+    g32, exp_avg32, exp_avg_sq32 = [list(map(promote, x)) for x in [grad, exp_avg, exp_avg_sq]]
+    torch._foreach_lerp_(exp_avg32, g32, 1 - beta_debias(beta1, step + 1))
+    denom = list(exp_avg_sq_(exp_avg_sq32, g32, beta_debias(beta2, step + 1), eps))
+    update_param_(y, exp_avg32, lr, decay, lambda p, e, l: p.addcdiv_(e, denom.pop(0), value=l))
+    copy_stochastic_list_(exp_avg, exp_avg32)
+    copy_stochastic_list_(exp_avg_sq, exp_avg_sq32)
+class ForeachAdamW(StatefulOptimizer):
+    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=0,
+                 foreach: bool = True, storage_dtype: str = 'float32'):
+        defaults = dict(lr=lr, betas=betas, eps=eps, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
+                        lr_max=-1.0, weight_decay=weight_decay, storage_dtype=storage_dtype)
+        super().__init__(params, defaults, foreach)
+    def _step(self, group):
+        eps = group['eps']
+        decay = group['weight_decay']
+        k = group['k']
+        if not group['train_mode']:
+            raise Exception("Not in train mode!")
+        active_p = [p for p in group['params'] if p.grad is not None]
+        if not active_p:
+            return
+        storage_dtype = getattr(torch, group['storage_dtype'])
+        for p in active_p:
+            if 'exp_avg' not in self.state_(p):
+                self.state_(p)['exp_avg'] = torch.zeros_like(p.data, dtype=storage_dtype)
+                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=storage_dtype)
+        y, grad, exp_avg_sq, exp_avg = zip(
+            *[(p.data, p.grad, self.state_(p)['exp_avg_sq'], self.state_(p)['exp_avg']) for p in active_p])
+        lr = -warmup(group['lr'], k + 1, group['warmup_steps'])
+        lr = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(lr)
+        step = torch.empty((), dtype=torch.int32, device=y[0].device).fill_(k)
+        _compilable_step_(y, grad, exp_avg_sq, exp_avg, group['betas'][0], group['betas'][1], step, lr, eps, decay)
+        group['k'] = k + 1

heavyball-0.19.0/heavyball/foreach_adopt.py ADDED Viewed

@@ -0,0 +1,78 @@
+import torch
+import torch.optim
+from heavyball.utils import copy_stochastic_list_
+from .utils import warmup, beta_debias, update_param_, StatefulOptimizer, promote
+@torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
+def _compilable_step_(y, grad, exp_avg_sq, exp_avg, beta1, beta2, step, lr, eps, decay):
+    g32, exp_avg32, exp_avg_sq32 = [list(map(promote, x)) for x in [grad, exp_avg, exp_avg_sq]]
+    update_param_(y, exp_avg, lr, decay)
+    beta1 = beta_debias(beta1, step)
+    denom = torch._foreach_sqrt(exp_avg_sq32)
+    torch._foreach_maximum_(denom, eps)
+    torch._foreach_mul_(exp_avg32, beta1)
+    [ea32.addcdiv_(g, d, value=1 - beta1) for ea32, g, d in zip(exp_avg32, g32, denom)]
+    beta2 = beta_debias(beta2, step + 1)
+    torch._foreach_mul_(exp_avg_sq32, beta2)
+    [eas32.addcmul_(g, g, value=1 - beta2) for eas32, g in zip(exp_avg_sq32, g32)]
+    copy_stochastic_list_(exp_avg, exp_avg32)
+    copy_stochastic_list_(exp_avg_sq, exp_avg_sq32)
+class ForeachADOPT(StatefulOptimizer):
+    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=0,
+                 foreach: bool = True, storage_dtype: str = 'float32'):
+        defaults = dict(lr=lr, betas=betas, eps=eps, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
+                        lr_max=-1.0, weight_decay=weight_decay, storage_dtype=storage_dtype)
+        super().__init__(params, defaults, foreach)
+    def _step(self, group):
+        eps = group['eps']
+        decay = group['weight_decay']
+        k = group['k']
+        if not group['train_mode']:
+            raise Exception("Not in train mode!")
+        active_p = [p for p in group['params'] if p.grad is not None]
+        if not active_p:
+            return
+        storage_dtype = getattr(torch, group['storage_dtype'])
+        for p in active_p:
+            if 'exp_avg' not in self.state_(p):
+                self.state_(p)['exp_avg'] = torch.zeros_like(p.data, dtype=storage_dtype)
+                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=storage_dtype)
+        y, grad, exp_avg_sq, exp_avg = zip(
+            *[(p.data, p.grad, self.state_(p)['exp_avg_sq'], self.state_(p)['exp_avg']) for p in active_p])
+        group['k'] = k + 1
+        if k > 1:
+            lr = -warmup(group['lr'], k - 1, group['warmup_steps'])
+            lr = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(lr)
+            k = torch.empty((), dtype=torch.int32, device=y[0].device).fill_(k)
+            _compilable_step_(y, grad, exp_avg_sq, exp_avg, group['betas'][0], group['betas'][1], k, lr, eps, decay)
+            return
+        grad = [promote(g) for g in grad]
+        if k > 0:
+            beta1 = beta_debias(group['betas'][0], k)
+            denom = torch._foreach_sqrt(exp_avg_sq)
+            torch._foreach_maximum_(denom, eps)
+            torch._foreach_mul_(exp_avg, beta1)
+            torch._foreach_addcdiv_(exp_avg, grad, denom, 1 - beta1)
+        beta2 = beta_debias(group['betas'][1], k + 1)
+        torch._foreach_mul_(exp_avg_sq, beta2)
+        torch._foreach_addcmul_(exp_avg_sq, grad, grad, value=1 - beta2)
+        del grad

heavyball-0.19.0/heavyball/foreach_laprop.py ADDED Viewed

@@ -0,0 +1,61 @@
+import torch
+import torch.optim
+from .utils import warmup, exp_avg_sq_, beta_debias, update_param_, StatefulOptimizer, promote, copy_stochastic_list_
+@torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
+def _compilable_step_(y, grad, exp_avg_sq, exp_avg, beta1, beta2, step, lr, eps, decay):
+    g32, exp_avg32, exp_avg_sq32 = [list(map(promote, x)) for x in [grad, exp_avg, exp_avg_sq]]
+    denom = exp_avg_sq_(exp_avg_sq32, g32, beta_debias(beta2, step), eps)
+    beta1 = beta_debias(beta1, step)
+    torch._foreach_mul_(exp_avg32, beta1)
+    [ea32.addcdiv_(g, d, value=1 - beta1) for ea32, g, d in zip(exp_avg32, g32, denom)]
+    update_param_(y, exp_avg32, lr, decay)
+    copy_stochastic_list_(exp_avg, exp_avg32)
+    copy_stochastic_list_(exp_avg_sq, exp_avg_sq32)
+class ForeachLaProp(StatefulOptimizer):
+    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=1,
+                 foreach: bool = True, storage_dtype: str = 'float32'):
+        defaults = dict(lr=lr, betas=betas, eps=eps, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
+                        lr_max=-1.0, weight_decay=weight_decay, storage_dtype=storage_dtype)
+        super().__init__(params, defaults, foreach)
+    def _step(self, group):
+        eps = group['eps']
+        decay = group['weight_decay']
+        k = group['k']
+        if not group['train_mode']:
+            raise Exception("Not in train mode!")
+        active_p = [p for p in group['params'] if p.grad is not None]
+        if not active_p:
+            return
+        storage_dtype = getattr(torch, group['storage_dtype'])
+        for p in active_p:
+            if 'exp_avg' not in self.state_(p):
+                self.state_(p)['exp_avg'] = torch.zeros_like(p.data, dtype=storage_dtype)
+                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=storage_dtype)
+        y, grad, exp_avg_sq, exp_avg = zip(
+            *[(p.data, p.grad, self.state_(p)['exp_avg_sq'], self.state_(p)['exp_avg'])  #
+              for p in active_p])
+        lr = -warmup(group['lr'], k + 1, group['warmup_steps'])
+        lr = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(lr)
+        step = torch.empty((), dtype=torch.int32, device=y[0].device).fill_(k + 1)
+        _compilable_step_(y, grad, exp_avg_sq, exp_avg, group['betas'][0], group['betas'][1], step, lr, eps, decay)
+        group['k'] = k + 1

heavyball-0.19.0/heavyball/foreach_sfadamw.py ADDED Viewed

@@ -0,0 +1,63 @@
+import torch
+import torch.optim
+from heavyball.utils import get_ckp1, copy_stochastic_list_
+from .utils import warmup, ScheduleFree, exp_avg_sq_, beta_debias, promote, _compilable_schedule_free_
+@torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
+def _compilable_step_(y, grad, exp_avg_sq, z, beta1, beta2, step, ckp1, eps, decay, lr):
+    old_debiased2 = beta_debias(beta2, step)
+    g32 = [promote(g_) for g_ in grad]
+    exp_avg_sq32 = [promote(e_) for e_ in exp_avg_sq]
+    denom = exp_avg_sq_(exp_avg_sq32, g32, old_debiased2, eps)
+    torch._foreach_div_(g32, denom)
+    if decay != 0:
+        torch._foreach_add_(g32, y, alpha=decay)
+    for p, z_, g in zip(y, z, g32):
+        _compilable_schedule_free_(p, z_, ckp1, g, lr, beta1)
+    copy_stochastic_list_(exp_avg_sq, exp_avg_sq32)
+class ForeachSFAdamW(ScheduleFree):
+    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=0, r=0.0,
+                 weight_lr_power=2.0, foreach: bool = True, storage_dtype: str = 'float32'):
+        defaults = dict(lr=lr, betas=betas, eps=eps, r=r, k=0, warmup_steps=warmup_steps, train_mode=True,
+                        weight_sum=0.0, lr_max=-1.0, weight_lr_power=weight_lr_power, weight_decay=weight_decay,
+                        foreach=foreach, storage_dtype=storage_dtype)
+        super().__init__(params, defaults, foreach)
+    def _step(self, group):
+        eps = group['eps']
+        decay = group['weight_decay']
+        k = group['k']
+        if not group['train_mode']:
+            raise Exception("Not in train mode!")
+        active_p = [p for p in group['params'] if p.grad is not None]
+        if not active_p:
+            return
+        storage_dtype = getattr(torch, group['storage_dtype'])
+        for p in active_p:
+            if 'z' not in self.state_(p):
+                self.state_(p)['z'] = torch.clone(p.data)
+                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=storage_dtype)
+        y, grad, exp_avg_sq, z = zip(*[(p.data, p.grad, self.state_(p)['exp_avg_sq'], self.state_(p)['z'])  #
+                                       for p in active_p])
+        lr = warmup(group['lr'], k + 1, group['warmup_steps'])
+        ckp1, group['weight_sum'] = get_ckp1(lr, group['weight_lr_power'], group['weight_sum'], group['r'], k + 1)
+        step = torch.empty((), dtype=torch.int32, device=y[0].device).fill_(k + 1)
+        ckp1 = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(ckp1)
+        lr = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(lr)
+        _compilable_step_(y, grad, exp_avg_sq, z, group['betas'][0], group['betas'][1], step, ckp1, eps, decay, lr)
+        group['k'] = k + 1

heavyball-0.19.0/heavyball/palm_foreach_sfadamw.py ADDED Viewed

@@ -0,0 +1,69 @@
+import torch
+import torch.optim
+from .utils import warmup, ScheduleFree, exp_avg_sq_, beta_debias, get_ckp1, promote, \
+    _compilable_schedule_free_, copy_stochastic_list_
+@torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
+def _compilable_step_(y, grad, exp_avg_sq, z, beta1, beta2, step, ckp1, eps, decay, lr):
+    old_debiased2 = beta_debias(beta2, step)
+    g32 = [promote(g_) for g_ in grad]
+    exp_avg_sq32 = [promote(e_) for e_ in exp_avg_sq]
+    denom = exp_avg_sq_(exp_avg_sq32, g32, old_debiased2, eps)
+    torch._foreach_div_(g32, denom)
+    if decay != 0:
+        torch._foreach_add_(g32, y, alpha=decay)
+    for p, z_, g in zip(y, z, g32):
+        _compilable_schedule_free_(p, z_, ckp1, g, lr, beta1)
+    copy_stochastic_list_(exp_avg_sq, exp_avg_sq32)
+class PaLMForeachSFAdamW(ScheduleFree):
+    def __init__(self, params, lr=0.0025, beta=0.9, betas=(None, None), eps=1e-8, weight_decay=0, warmup_steps=0, r=0.0,
+                 weight_lr_power=2.0, beta2_scale: float = 0.8, foreach: bool = True, storage_dtype: str = 'float32'):
+        if betas[0] is not None:
+            beta = betas[0]
+        defaults = dict(lr=lr, beta=beta, eps=eps, r=r, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
+                        lr_max=-1.0, weight_lr_power=weight_lr_power, weight_decay=weight_decay,
+                        beta2_scale=beta2_scale, storage_dtype=storage_dtype)
+        super().__init__(params, defaults, foreach)
+    def _step(self, group):
+        eps = group['eps']
+        decay = group['weight_decay']
+        k = group['k']
+        if not group['train_mode']:
+            raise Exception("Not in train mode!")
+        active_p = [p for p in group['params'] if p.grad is not None]
+        if not active_p:
+            return
+        storage_dtype = getattr(torch, group['storage_dtype'])
+        for p in active_p:
+            if 'z' not in self.state_(p):
+                self.state_(p)['z'] = torch.clone(p.data)
+                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=storage_dtype)
+        # Decay the first moment running average coefficient
+        beta2 = 1 - (k + 1) ** -group['beta2_scale']
+        y, grad, exp_avg_sq, z = zip(*[(p.data, p.grad, self.state_(p)['exp_avg_sq'], self.state_(p)['z'])  #
+                                       for p in active_p])
+        lr = warmup(group['lr'], k + 1, group['warmup_steps'])
+        ckp1, group['weight_sum'] = get_ckp1(lr, group['weight_lr_power'], group['weight_sum'], group['r'], k + 1)
+        step = torch.empty((), dtype=torch.int32, device=y[0].device).fill_(k + 1)
+        ckp1 = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(ckp1)
+        beta2 = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(beta2)
+        lr = torch.empty((), dtype=torch.float32, device=y[0].device).fill_(lr)
+        _compilable_step_(y, grad, exp_avg_sq, z, group['beta'], beta2, step, ckp1, eps, decay, lr)
+        group['k'] = k + 1

{heavyball-0.18.7 → heavyball-0.19.0}/heavyball/psgd_kron.py RENAMED Viewed

@@ -104,7 +104,8 @@ class ForeachPSGDKron(PSGDBase):
             if should_update:
                 q32 = [promote(q_) for q_ in q]
-                self.do_update(group, [p], [g], [q32], precond_lr, [q_orig], store_triu_as_line)
+                self.do_update(group, [p], [ea if momentum_into_precond_update else g], [q32], precond_lr, [q_orig],
+                               store_triu_as_line)
             set_(g, psgd_precond_grad(q, self.state_(p)["exprs"], ea))
         grad_list = self.clip_fn(grad_list)

{heavyball-0.18.7 → heavyball-0.19.0}/heavyball/utils.py RENAMED Viewed

@@ -40,14 +40,25 @@ def warmup(lr: float, step: int, warmup_steps: int):
 @torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
 def _compilable_schedule_free_(p, z, ckp1, grad, lr, beta1):
-    p32 = p.float()
-    z32 = z.float()
-    p32.lerp_(end=z32, weight=1 - ckp1)
+    p32 = promote(p)
+    z32 = promote(z)
+    p32.lerp_(end=z32, weight=ckp1)
     p32.add_(grad, alpha=lr * (beta1 * (1 - ckp1) - 1))
-    _guarded_copy_stochastic(p, p32)
+    copy_stochastic_(p, p32)
     z32.add_(grad, alpha=-lr)
-    _guarded_copy_stochastic(z, z32)
+    copy_stochastic_(z, z32)
+def get_ckp1(lr, weight_lr_power, weight_sum, r, step):
+    weight = lr ** weight_lr_power * max(step, 1) ** r
+    weight_sum = weight_sum + weight
+    try:
+        ckp1 = weight / weight_sum
+    except ZeroDivisionError:
+        ckp1 = 0
+    return ckp1, weight_sum
 def schedule_free_(lr: float, weight_lr_power: float, weight_sum: float, beta1: float, parameters: List[torch.Tensor],
@@ -136,7 +147,7 @@ def exp_avg_sq_(state, grad, beta2, eps, out=None):
         return torch.sqrt(state, out=out).clamp_(min=eps)
     torch._foreach_mul_(state, beta2)
-    torch._foreach_addcmul_(state, grad, grad, value=1 - beta2)
+    [s.addcmul_(g, g, value=1 - beta2) for s, g in zip(state, grad)]
     denom = torch._foreach_sqrt(state)
     torch._foreach_maximum_(denom, eps)
     return denom
@@ -332,9 +343,9 @@ def compute_ggt(grad, GG, max_precond_dim, precondition_1d, beta):
 def promote(x):
-    if x in (torch.bfloat16, torch.float16):
+    if isinstance(x, torch.dtype) and x in (torch.bfloat16, torch.float16):
         return torch.float32
-    if hasattr(x, 'dtype') and x.dtype in (torch.bfloat16, torch.float16):
+    if isinstance(x, torch.Tensor) and x.dtype in (torch.bfloat16, torch.float16):
         return x.float()
     return x
@@ -486,13 +497,8 @@ def copy_stochastic_list_(target: List[torch.Tensor], source: List[torch.Tensor]
         copy_stochastic_(t, s)
-def _guarded_copy_stochastic(target: torch.Tensor, source: torch.Tensor):
-    if target.dtype != torch.bfloat16 or source.dtype not in (torch.float16, torch.float32, torch.float64):
-        set_(target, source)
-    _compilable_copy_stochastic_(target, source)
-@torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
+# this can be dynamic for most optimizers - just not for PSGD. So, it's disabled for all
+@torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True)
 def _compilable_copy_stochastic_(target: torch.Tensor, source: torch.Tensor):
     """Taken as-is from https://github.com/pytorch/pytorch/issues/120376#issuecomment-1974828905"""
     # create a random 16 bit integer
@@ -509,22 +515,24 @@ def _compilable_copy_stochastic_(target: torch.Tensor, source: torch.Tensor):
 def copy_stochastic_(target: torch.Tensor, source: torch.Tensor):
-    if target.data_ptr() == source.data_ptr():
+    if not torch.compiler.is_compiling() and target.data_ptr() == source.data_ptr():
         return
-    _guarded_copy_stochastic(target, source)
+    if target.dtype != torch.bfloat16 or source.dtype not in (torch.float16, torch.float32, torch.float64):
+        set_(target, source)
+    _compilable_copy_stochastic_(target, source)
 @torch.compile(mode='max-autotune-no-cudagraphs', fullgraph=True, dynamic=True)
 def _compilable_update_one_(p, u, decay, add_fn, lr):
-    p32 = p.float()
-    u32 = u.view(p.shape).float()
+    p32 = promote(p)
+    u32 = promote(u.view(p.shape))
     if decay > 0:
         p32.mul_(1 - decay * lr)
     if add_fn is None:
         p32.add_(u32, alpha=lr)
     else:
         add_fn(p32, u32, lr)
-    _guarded_copy_stochastic(p, p32)
+    copy_stochastic_(p, p32)
 def update_param_(param: List[torch.Tensor], update: List[torch.Tensor], lr: float, decay: float,

{heavyball-0.18.7 → heavyball-0.19.0}/heavyball.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: heavyball
-Version: 0.18.7
+Version: 0.19.0
 Summary: Efficient optimizers
 Home-page: https://github.com/clashluke/heavyball
 Author: Lucas Nestler
@@ -32,7 +32,7 @@ A simple package of efficient optimizers
 The goal is not to thrive for completeness, full maintenance or abstraction, but instead to provide a simple
 largely static alternative to `torch.optim` with more and better optimizers.
-Currently (2024-11-21, 0.18.6), the recommended stable optimizer is `PrecondSchedulePaLMSOAP` (see below). The
+Currently (2024-11-22, 0.19), the recommended stable optimizer is `PrecondSchedulePaLMSOAP` (see below). The
 recommended experimental optimizer is `DelayedPSGDKron` ([tuning guide](docs/psgd_efficiency.md)).
 ## Features
@@ -45,8 +45,10 @@ recommended experimental optimizer is `DelayedPSGDKron` ([tuning guide](docs/psg
 * **ScheduleFree**: No learning rate schedule, but better convergence
 * [**Preconditioner Schedule**](https://github.com/lixilinx/psgd_torch/): Improved loss-per-step in early convergence,
   better step-per-second in late convergence (explained below)
-* **Memory-efficient storage** PSGD supports `store_triu_as_line` (default: `True`) to trade off memory usage for memory
-  bandwidth; turn it off for lower overheads (for more, see [PSGD Efficiency](docs/psgd_efficiency.md))
+* **Memory-efficient storage** PSGD supports `store_triu_as_line` (default: `True`) and `q_dtype` to trade off memory
+  usage for memory
+  bandwidth; Other optimizers have `storage_dtype`, supporting lower-precision EMAs at no(?) performance drop via
+  stochastic rounding
 ## Getting started
@@ -76,19 +78,19 @@ for _ in range(1000):
 ## Optimizers
-| Name                    | Description                                                                                                                                                       | Advantages / Disadvantages                                                                                                                                                                                                                                                                                                            |
-|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| **AdamW**               | More efficient (speed, memory) [AdamW](https://arxiv.org/abs/1711.05101)                                                                                          | + Faster than AdamW<br>+ Possibly more (numerically) stable
-| **LaProp**              | More efficient (speed, memory) [LaProp](https://arxiv.org/abs/2002.04839)                                                                                         | + Same cost as AdamW<br>+ Marginally better converence (better proofs)<br>+ Higher hyperparameter stability<br>- Not a guaranteed win (can be neutral)<br>- No "Slingshot"                                                                                                                                                            |
-| **ADOPT**               | More efficient (speed, memory) [ADOPT](https://arxiv.org/abs/2411.02853)                                                                                          | + Same cost as AdamW<br>+ Rigorous mathematical convergence proofs, even for challenging models (GANs)<br>- Empirically underperforms LaProp<br>- no bf16                                                                                                                                                                             |
-| **SFAdamW**             | More efficient (speed, memory) [ScheduleFree AdamW](https://arxiv.org/abs/2405.15682)                                                                             | + Same cost as AdamW, but better eval perf<br>+ Full control over hyperparameters                                                                                                                                                                                                                                                     |
-| **PaLMSFAdamW**         | ForeachSFAdamW with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                     | + Same cost as AdamW, but better eval perf<br>+ Less control, but faster early and more stable late convergence<br>+ ScheduleFree<br>- slow early convergence                                                                                                                                                                         |
-| **SOAP**                | More efficient (speed, memory) [SOAP](https://arxiv.org/abs/2409.11321)                                                                                           | + Faster convergence (loss-at-step)<br>+ Full control over hyperparameters<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                                                |
-| **PaLMSOAP**            | ForeachSOAP with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                        | + Faster convergence (loss-at-step)<br>+ Less control, but faster early and more stable late convergence<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                  |
-| **SFPaLMSOAP**          | ScheduleFree PaLMForeachSOAP                                                                                                                                      | + Fast convergence (loss-at-step)<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized)                                                                                                      |
+| Name                          | Description                                                                                                                                                       | Advantages / Disadvantages                                                                                                                                                                                                                                                                                                            |
+|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **AdamW**                     | More efficient (speed, memory) [AdamW](https://arxiv.org/abs/1711.05101)                                                                                          | + Faster than AdamW<br>+ Possibly more (numerically) stable
+| **LaProp**                    | More efficient (speed, memory) [LaProp](https://arxiv.org/abs/2002.04839)                                                                                         | + Same cost as AdamW<br>+ Marginally better converence (better proofs)<br>+ Higher hyperparameter stability<br>- Not a guaranteed win (can be neutral)<br>- No "Slingshot"                                                                                                                                                            |
+| **ADOPT**                     | More efficient (speed, memory) [ADOPT](https://arxiv.org/abs/2411.02853)                                                                                          | + Same cost as AdamW<br>+ Rigorous mathematical convergence proofs, even for challenging models (GANs)<br>- Empirically underperforms LaProp<br>- no bf16                                                                                                                                                                             |
+| **SFAdamW**                   | More efficient (speed, memory) [ScheduleFree AdamW](https://arxiv.org/abs/2405.15682)                                                                             | + Same cost as AdamW, but better eval perf<br>+ Full control over hyperparameters                                                                                                                                                                                                                                                     |
+| **PaLMSFAdamW**               | ForeachSFAdamW with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                     | + Same cost as AdamW, but better eval perf<br>+ Less control, but faster early and more stable late convergence<br>+ ScheduleFree<br>- slow early convergence                                                                                                                                                                         |
+| **SOAP**                      | More efficient (speed, memory) [SOAP](https://arxiv.org/abs/2409.11321)                                                                                           | + Faster convergence (loss-at-step)<br>+ Full control over hyperparameters<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                                                |
+| **PaLMSOAP**                  | ForeachSOAP with [PaLM's beta2 schedule](https://arxiv.org/abs/2204.02311)                                                                                        | + Faster convergence (loss-at-step)<br>+ Less control, but faster early and more stable late convergence<br>- more memory usage<br>- more hyperparameters<br>- higher overhead than AdamW (can be ammortized; better loss-at-second)                                                                                                  |
+| **SFPaLMSOAP**                | ScheduleFree PaLMForeachSOAP                                                                                                                                      | + Fast convergence (loss-at-step)<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized)                                                                                                      |
 | **PrecondScheduleSFPaLMSOAP** | SFPaLMForeachSOAP with [preconditioner schedule](https://github.com/lixilinx/psgd_torch/), matching the error of PrecondEvery=2 with the cost of PrecondEvery=512 | + Better initial convergence than SFPaLMForeachSOAP<br>+ Significantly faster (sec/it) later<br>+ less memory usage than PaLMForeachSOAP (more tham AdamW)<br>- slower initial convergence than PaLMForeachSOAP (but allows higher LRs)<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of step |
-| **PrecondSchedulePaLMSOAP** | PrecondScheduleSFPaLMForeachSOAP without schedule-free                                                                                                            | + Best initial convergence<br>+ Significantly faster (sec/it) later<br>+ high stability<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                   |
-| **PrecondScheduleSOAP** | PrecondScheduleSFPaLMForeachSOAP without PaLM's beta2 schedule                                                                                                    | + Better initial convergence<br>+ Significantly faster (sec/it) later<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                                     |
+| **PrecondSchedulePaLMSOAP**   | PrecondScheduleSFPaLMForeachSOAP without schedule-free                                                                                                            | + Best initial convergence<br>+ Significantly faster (sec/it) later<br>+ high stability<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                   |
+| **PrecondScheduleSOAP**       | PrecondScheduleSFPaLMForeachSOAP without PaLM's beta2 schedule                                                                                                    | + Better initial convergence<br>+ Significantly faster (sec/it) later<br>- more memory usage than PrecondScheduleSFPaLMForeachSOAP<br>- higher overhead than AdamW (can be ammortized), goes to 0 with increasing number of steps                                                                                                     |
 ## Precond Schedule

{heavyball-0.18.7 → heavyball-0.19.0}/heavyball.egg-info/SOURCES.txt RENAMED Viewed

@@ -27,6 +27,7 @@ heavyball.egg-info/requires.txt
 heavyball.egg-info/top_level.txt
 test/test_bf16_params.py
 test/test_bf16_q.py
+test/test_bf16_storage.py
 test/test_closure.py
 test/test_foreach.py
 test/test_memory.py

{heavyball-0.18.7 → heavyball-0.19.0}/setup.py RENAMED Viewed

@@ -10,7 +10,7 @@ setuptools.setup(
     name='heavyball',
     license='BSD',
     description='Efficient optimizers',
-    version='0.18.7',
+    version='0.19.0',
     long_description=README,
     url='https://github.com/clashluke/heavyball',
     packages=setuptools.find_packages(),

{heavyball-0.18.7 → heavyball-0.19.0}/test/test_bf16_params.py RENAMED Viewed

@@ -22,7 +22,6 @@ def get_memory():
 @pytest.mark.parametrize("size,depth", [(256, 2)])
 def test_foreach(opt, size, depth: int, iterations: int = 128, outer_iterations: int = 3):
     set_torch()
     opt = getattr(heavyball, opt)
     peaks = []

heavyball-0.19.0/test/test_bf16_storage.py ADDED Viewed

@@ -0,0 +1,60 @@
+import pytest
+import torch
+from torch import nn
+from torch._dynamo import config
+import heavyball
+import heavyball.utils
+from benchmark.utils import get_optim
+from heavyball.utils import clean, set_torch, PSGDBase
+config.cache_size_limit = 128
+def get_memory():
+    clean()
+    torch.cuda.synchronize()
+    clean()
+    torch.cuda.synchronize()
+    return torch.cuda.memory_allocated()
+@pytest.mark.parametrize("opt", heavyball.__all__)
+@pytest.mark.parametrize("size,depth", [(256, 2)])
+def test_foreach(opt, size, depth: int, iterations: int = 128, outer_iterations: int = 3):
+    set_torch()
+    if 'soap' in opt.lower():
+        raise pytest.skip('soap is not supported')
+    opt = getattr(heavyball, opt)
+    if PSGDBase in opt.__mro__:
+        raise pytest.skip('PSGD is not supported')
+    peaks = []
+    losses = []
+    for dtype_name in ["float32", "bfloat16"]:
+        torch.manual_seed(0x2131290)
+        peaks.append([])
+        losses.append([])
+        dtype = getattr(torch, dtype_name)
+        for i in range(outer_iterations):
+            model = nn.Sequential(*[nn.Linear(size, size) for _ in range(depth)]).cuda().to(dtype)
+            o = get_optim(opt, model.parameters(), lr=1e-3, storage_dtype=dtype_name)
+            for _ in range(iterations):
+                loss = model(torch.randn((1024, size), device='cuda', dtype=dtype)).square().mean()
+                loss.backward()
+                o.step()
+                o.zero_grad()
+                losses[-1].append(loss.detach())
+            del model, o
+            clean()
+    for i, (l0, l1) in enumerate(zip(*losses)):
+        print(i, l0.item(), l1.item())
+        assert torch.allclose(l0.float(), l1.float(), rtol=0.1)

heavyball-0.18.7/heavyball/foreach_adamw.py DELETED Viewed

@@ -1,42 +0,0 @@
-import torch
-import torch.optim
-from .utils import warmup, exp_avg_sq_, beta_debias, update_param_, StatefulOptimizer
-class ForeachAdamW(StatefulOptimizer):
-    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=0,
-                 foreach: bool = True):
-        defaults = dict(lr=lr, betas=betas, eps=eps, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
-                        lr_max=-1.0, weight_decay=weight_decay)
-        super().__init__(params, defaults, foreach)
-    def _step(self, group):
-        eps = group['eps']
-        decay = group['weight_decay']
-        k = group['k']
-        if not group['train_mode']:
-            raise Exception("Not in train mode!")
-        active_p = [p for p in group['params'] if p.grad is not None]
-        if not active_p:
-            return
-        for p in active_p:
-            if 'exp_avg' not in self.state_(p):
-                self.state_(p)['exp_avg'] = torch.zeros_like(p.data, dtype=torch.float32)
-                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=torch.float32)
-        y, grad, exp_avg_sq, exp_avg = zip(
-            *[(p.data, p.grad.float(), self.state_(p)['exp_avg_sq'], self.state_(p)['exp_avg']) for p in active_p])
-        # Decay the first and second moment running average coefficient
-        torch._foreach_lerp_(exp_avg, grad, 1 - beta_debias(group['betas'][0], k + 1))
-        denom = list(exp_avg_sq_(exp_avg_sq, grad, beta_debias(group['betas'][1], k + 1), eps))
-        # Normalize grad in-place for memory efficiency
-        lr = -warmup(group['lr'], k + 1, group['warmup_steps'])
-        update_param_(y, exp_avg, lr, decay, lambda p, e, l: p.addcdiv_(e, denom.pop(0), value=l))
-        group['k'] = k + 1

heavyball-0.18.7/heavyball/foreach_adopt.py DELETED Viewed

@@ -1,52 +0,0 @@
-import torch
-import torch.optim
-from .utils import warmup, beta_debias, update_param_, StatefulOptimizer
-class ForeachADOPT(StatefulOptimizer):
-    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=0,
-                 foreach: bool = True):
-        defaults = dict(lr=lr, betas=betas, eps=eps, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
-                        lr_max=-1.0, weight_decay=weight_decay)
-        super().__init__(params, defaults, foreach)
-    def _step(self, group):
-        eps = group['eps']
-        decay = group['weight_decay']
-        k = group['k']
-        if not group['train_mode']:
-            raise Exception("Not in train mode!")
-        active_p = [p for p in group['params'] if p.grad is not None]
-        if not active_p:
-            return
-        for p in active_p:
-            if 'exp_avg' not in self.state_(p):
-                self.state_(p)['exp_avg'] = torch.zeros_like(p.data, dtype=torch.float32)
-                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=torch.float32)
-        y, grad, exp_avg_sq, exp_avg = zip(
-            *[(p.data, p.grad.float(), self.state_(p)['exp_avg_sq'], self.state_(p)['exp_avg']) for p in active_p])
-        if k > 1:
-            lr = -warmup(group['lr'], k - 1, group['warmup_steps'])
-            update_param_(y, exp_avg, lr, decay)
-        if k > 0:
-            beta1 = beta_debias(group['betas'][0], k)
-            denom = torch._foreach_sqrt(exp_avg_sq)
-            torch._foreach_maximum_(denom, eps)
-            torch._foreach_mul_(exp_avg, beta1)
-            torch._foreach_addcdiv_(exp_avg, grad, denom, 1 - beta1)
-        beta2 = beta_debias(group['betas'][1], k + 1)
-        torch._foreach_mul_(exp_avg_sq, beta2)
-        torch._foreach_addcmul_(exp_avg_sq, grad, grad, value=1 - beta2)
-        del grad
-        group['k'] = k + 1

heavyball-0.18.7/heavyball/foreach_laprop.py DELETED Viewed

@@ -1,47 +0,0 @@
-import torch
-import torch.optim
-from .utils import warmup, exp_avg_sq_, beta_debias, update_param_, StatefulOptimizer
-class ForeachLaProp(StatefulOptimizer):
-    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=1,
-                 foreach: bool = True):
-        defaults = dict(lr=lr, betas=betas, eps=eps, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
-                        lr_max=-1.0, weight_decay=weight_decay)
-        super().__init__(params, defaults, foreach)
-    def _step(self, group):
-        eps = group['eps']
-        decay = group['weight_decay']
-        k = group['k']
-        if not group['train_mode']:
-            raise Exception("Not in train mode!")
-        active_p = [p for p in group['params'] if p.grad is not None]
-        if not active_p:
-            return
-        for p in active_p:
-            if 'exp_avg' not in self.state_(p):
-                self.state_(p)['exp_avg'] = torch.zeros_like(p.data, dtype=torch.float32)
-                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=torch.float32)
-        y, grad, exp_avg_sq, exp_avg = zip(
-            *[(p.data, p.grad.float(), self.state_(p)['exp_avg_sq'], self.state_(p)['exp_avg']) for p in active_p])
-        # Decay the first and second moment running average coefficient
-        denom = exp_avg_sq_(exp_avg_sq, grad, beta_debias(group['betas'][1], k + 1), eps)
-        beta1 = beta_debias(group['betas'][0], k + 1)
-        torch._foreach_mul_(exp_avg, beta1)
-        torch._foreach_addcdiv_(exp_avg, grad, denom, 1 - beta1)
-        del grad
-        # Normalize grad in-place for memory efficiency
-        lr = -warmup(group['lr'], k + 1, group['warmup_steps'])
-        update_param_(y, exp_avg, lr, decay)
-        group['k'] = k + 1

heavyball-0.18.7/heavyball/foreach_sfadamw.py DELETED Viewed

@@ -1,54 +0,0 @@
-import torch
-import torch.optim
-from .utils import schedule_free_, warmup, ScheduleFree, exp_avg_sq_, beta_debias
-class ForeachSFAdamW(ScheduleFree):
-    def __init__(self, params, lr=0.0025, betas=(0.9, 0.99), eps=1e-8, weight_decay=0, warmup_steps=0, r=0.0,
-                 weight_lr_power=2.0, foreach: bool = True):
-        defaults = dict(lr=lr, betas=betas, eps=eps, r=r, k=0, warmup_steps=warmup_steps, train_mode=True,
-                        weight_sum=0.0, lr_max=-1.0, weight_lr_power=weight_lr_power, weight_decay=weight_decay,
-                        foreach=foreach)
-        super().__init__(params, defaults, foreach)
-    def _step(self, group):
-        eps = group['eps']
-        decay = group['weight_decay']
-        k = group['k']
-        if not group['train_mode']:
-            raise Exception("Not in train mode!")
-        active_p = [p for p in group['params'] if p.grad is not None]
-        if not active_p:
-            return
-        for p in active_p:
-            if 'z' not in self.state_(p):
-                self.state_(p)['z'] = torch.clone(p.data)
-                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=torch.float32)
-        y, grad, exp_avg_sq, z = zip(
-            *[(p.data, p.grad.float(), self.state_(p)['exp_avg_sq'], self.state_(p)['z']) for p in active_p])
-        # Decay the first moment running average coefficient
-        old_debiased = beta_debias(group['betas'][1], k + 1)
-        # Decay the first and second moment running average coefficient
-        denom = exp_avg_sq_(exp_avg_sq, grad, old_debiased, eps)
-        # Normalize grad in-place for memory efficiency
-        torch._foreach_div_(grad, denom)
-        # Weight decay calculated at y
-        if decay != 0:
-            torch._foreach_add_(grad, y, alpha=decay)
-        lr = warmup(group['lr'], k + 1, group['warmup_steps'])
-        group['weight_sum'] = schedule_free_(lr, group['weight_lr_power'], group['weight_sum'], group['betas'][0], y, z,
-                                             grad, group['r'], k + 1)
-        group['k'] = k + 1

heavyball-0.18.7/heavyball/palm_foreach_sfadamw.py DELETED Viewed

@@ -1,57 +0,0 @@
-import torch
-import torch.optim
-from .utils import schedule_free_, warmup, ScheduleFree, exp_avg_sq_, beta_debias
-class PaLMForeachSFAdamW(ScheduleFree):
-    def __init__(self, params, lr=0.0025, beta=0.9, betas=(None, None), eps=1e-8, weight_decay=0, warmup_steps=0, r=0.0,
-                 weight_lr_power=2.0, beta2_scale: float = 0.8,
-                 foreach: bool = True):
-        if betas[0] is not None:
-            beta = betas[0]
-        defaults = dict(lr=lr, beta=beta, eps=eps, r=r, k=0, warmup_steps=warmup_steps, train_mode=True, weight_sum=0.0,
-                        lr_max=-1.0, weight_lr_power=weight_lr_power, weight_decay=weight_decay,
-                        beta2_scale=beta2_scale)
-        super().__init__(params, defaults, foreach)
-    def _step(self, group):
-        eps = group['eps']
-        decay = group['weight_decay']
-        k = group['k']
-        if not group['train_mode']:
-            raise Exception("Not in train mode!")
-        active_p = [p for p in group['params'] if p.grad is not None]
-        if not active_p:
-            return
-        for p in active_p:
-            if 'z' not in self.state_(p):
-                self.state_(p)['z'] = torch.clone(p.data)
-                self.state_(p)['exp_avg_sq'] = torch.zeros_like(p.data, dtype=torch.float32)
-        y, grad, exp_avg_sq, z = zip(
-            *[(p.data, p.grad.float(), self.state_(p)['exp_avg_sq'], self.state_(p)['z']) for p in active_p])
-        # Decay the first moment running average coefficient
-        beta2 = 1 - (k + 1) ** -group['beta2_scale']
-        old_debiased = beta_debias(beta2, k + 1)
-        # Decay the first and second moment running average coefficient
-        denom = exp_avg_sq_(exp_avg_sq, grad, old_debiased, eps)
-        # Normalize grad in-place for memory efficiency
-        torch._foreach_div_(grad, denom)
-        # Weight decay calculated at y
-        if decay != 0:
-            torch._foreach_add_(grad, y, alpha=decay)
-        lr = warmup(group['lr'], k + 1, group['warmup_steps'])
-        group['weight_sum'] = schedule_free_(lr, group['weight_lr_power'], group['weight_sum'], group['beta'], y, z,
-                                             grad, group['r'], k + 1)
-        group['k'] = k + 1