PyPI - ista-daslab-optimizers - Versions diffs - 1.1.6__tar.gz → 1.1.8__tar.gz - Mend

ista-daslab-optimizers 1.1.6tar.gz → 1.1.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (52) hide show

{ista_daslab_optimizers-1.1.6/ista_daslab_optimizers.egg-info → ista_daslab_optimizers-1.1.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: ista_daslab_optimizers
-Version: 1.1.6
+Version: 1.1.8
 Summary: Deep Learning optimizers developed in the Distributed Algorithms and Systems group (DASLab) @ Institute of Science and Technology Austria (ISTA)
 Author-email: Ionut-Vlad Modoranu <ionut-vlad.modoranu@ist.ac.at>
 Maintainer-email: Ionut-Vlad Modoranu <ionut-vlad.modoranu@ist.ac.at>
@@ -222,6 +222,9 @@ Requires-Dist: gpustat
 Requires-Dist: timm
 Requires-Dist: einops
 Requires-Dist: psutil
+Requires-Dist: fast-hadamard-transform
+Requires-Dist: ista-daslab-optimizers-cuda
+Dynamic: license-file
 # ISTA DAS Lab Optimization Algorithms Package
 This repository contains optimization algorithms for Deep Learning developed by
@@ -240,6 +243,9 @@ The repository contains code for the following optimizers published by DASLab @
 - **MicroAdam**:
   - paper: [MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence](https://arxiv.org/abs/2405.15593)
   - official repository: [GitHub](https://github.com/IST-DASLab/MicroAdam)
+- **Trion / DCT-AdamW**:
+  - paper: [FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models](https://arxiv.org/abs/2505.17967v3)
+  - code: [GitHub](https://github.com/IST-DASLab/ISTA-DASLab-Optimizers/tree/main/ista_daslab_optimizers/fft_low_rank)
 ### Installation
 To use the latest stable version of this repository, you can install via pip:
@@ -261,7 +267,8 @@ source install.sh
 ## How to use optimizers?
-In this repository we provide a minimal working example for CIFAR-10 for optimizers `acdc`, `dense_mfac`, `sparse_mfac` and `micro_adam`:
+In this repository we provide a minimal working example for CIFAR-10 for optimizers `acdc`,
+`dense_mfac`, `sparse_mfac` and `micro_adam`:
 ```shell
 cd examples/cifar10
 OPTIMIZER=micro_adam # or any other optimizer listed above
@@ -291,18 +298,33 @@ optimizer = MicroAdam(
 # Versions summary:
 ---
+- **1.1.8** @ February 5th, 2026:
+  - moved kernels to [ISTA-DASLab-Optimizers-CUDA](https://github.com/IST-DASLab/ISTA-DASLab-Optimizers-CUDA)
+  - building building the package after adding a new optimizer that doesn't require CUDA support would require compiling
+  the kernels from scratch, which is time consuming and not needed
+- **1.1.7** @ October 8th, 2025:
+  - added code for `Trion & DCT-AdamW`
 - **1.1.6** @ February 19th, 2025:
-  - do not update the parameters that have `None` gradient in method `update_model` from `tools.py`. This is useful when using M-FAC for models with more than one classification head in the Continual Learning framework
+  - do not update the parameters that have `None` gradient in method `update_model` from `tools.py`.
+  This is useful when using M-FAC for models with more than one classification head in the Continual Learning framework.
 - **1.1.5** @ February 19th, 2025:
-  - adapted `DenseMFAC` for a model with multiple classification heads for Continual Learning where we have one feature extractor block and a list of classification heads. The issue was related to the model size, which included the feature extractor backbone and all classification heads, but in practice only one classification head will be used for training and inference. This caused some size mismatch errors at runtime in the `DenseCoreMFAC` module because the gradient at runtime had fewer entries than the entire model. When using `DenseMFAC` for such settings, set `optimizer.model_size` to the correct size after calling the constructor and the `DenseCoreMFAC` object will be created automatically in the `step` function.
+  - adapted `DenseMFAC` for a model with multiple classification heads for Continual Learning where
+  we have one feature extractor block and a list of classification heads. The issue was related to
+  the model size, which included the feature extractor backbone and all classification heads, but
+  in practice only one classification head will be used for training and inference. This caused some
+  size mismatch errors at runtime in the `DenseCoreMFAC` module because the gradient at runtime had
+  fewer entries than the entire model. When using `DenseMFAC` for such settings, set `optimizer.model_size`
+  to the correct size after calling the constructor and the `DenseCoreMFAC` object will be created
+  automatically in the `step` function.
 - **1.1.3** @ September 5th, 2024:
   - allow using `SparseCoreMFACwithEF` separately by importing it in `sparse_mfac.__init__.py`
 - **1.1.2** @ August 1st, 2024:
-  - ***[1.1.0]:*** added support to densify the final update: introduced parameter alpha that controls the fraction of error feedback
-  (EF) to be integrated into the update to make it dense. Finally, the fraction alpha will be discarded from the EF at
-  the expense of another call to `Qinv` and `Q` (and implicitly quantization statistics computation).
-  - ***[1.0.2]:*** added FSDP-compatible implementation by initializing the parameter states in the `update_step` method
-  instead of MicroAdam constructor
+  - ***[1.1.0]:*** added support to densify the final update: introduced parameter alpha that controls
+  the fraction of error feedback (EF) to be integrated into the update to make it dense. Finally, the
+  fraction alpha will be discarded from the EF at the expense of another call to `Qinv` and `Q` (and
+  implicitly quantization statistics computation).
+  - ***[1.0.2]:*** added FSDP-compatible implementation by initializing the parameter states in the
+  `update_step` method instead of MicroAdam constructor
 - **1.0.1** @ June 27th, 2024:
   - removed version in dependencies to avoid conflicts with llm-foundry
 - **1.0.0** @ June 20th, 2024:

{ista_daslab_optimizers-1.1.6 → ista_daslab_optimizers-1.1.8}/README.md RENAMED Viewed

@@ -15,6 +15,9 @@ The repository contains code for the following optimizers published by DASLab @
 - **MicroAdam**:
   - paper: [MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence](https://arxiv.org/abs/2405.15593)
   - official repository: [GitHub](https://github.com/IST-DASLab/MicroAdam)
+- **Trion / DCT-AdamW**:
+  - paper: [FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models](https://arxiv.org/abs/2505.17967v3)
+  - code: [GitHub](https://github.com/IST-DASLab/ISTA-DASLab-Optimizers/tree/main/ista_daslab_optimizers/fft_low_rank)
 ### Installation
 To use the latest stable version of this repository, you can install via pip:
@@ -36,7 +39,8 @@ source install.sh
 ## How to use optimizers?
-In this repository we provide a minimal working example for CIFAR-10 for optimizers `acdc`, `dense_mfac`, `sparse_mfac` and `micro_adam`:
+In this repository we provide a minimal working example for CIFAR-10 for optimizers `acdc`,
+`dense_mfac`, `sparse_mfac` and `micro_adam`:
 ```shell
 cd examples/cifar10
 OPTIMIZER=micro_adam # or any other optimizer listed above
@@ -66,18 +70,33 @@ optimizer = MicroAdam(
 # Versions summary:
 ---
+- **1.1.8** @ February 5th, 2026:
+  - moved kernels to [ISTA-DASLab-Optimizers-CUDA](https://github.com/IST-DASLab/ISTA-DASLab-Optimizers-CUDA)
+  - building building the package after adding a new optimizer that doesn't require CUDA support would require compiling
+  the kernels from scratch, which is time consuming and not needed
+- **1.1.7** @ October 8th, 2025:
+  - added code for `Trion & DCT-AdamW`
 - **1.1.6** @ February 19th, 2025:
-  - do not update the parameters that have `None` gradient in method `update_model` from `tools.py`. This is useful when using M-FAC for models with more than one classification head in the Continual Learning framework
+  - do not update the parameters that have `None` gradient in method `update_model` from `tools.py`.
+  This is useful when using M-FAC for models with more than one classification head in the Continual Learning framework.
 - **1.1.5** @ February 19th, 2025:
-  - adapted `DenseMFAC` for a model with multiple classification heads for Continual Learning where we have one feature extractor block and a list of classification heads. The issue was related to the model size, which included the feature extractor backbone and all classification heads, but in practice only one classification head will be used for training and inference. This caused some size mismatch errors at runtime in the `DenseCoreMFAC` module because the gradient at runtime had fewer entries than the entire model. When using `DenseMFAC` for such settings, set `optimizer.model_size` to the correct size after calling the constructor and the `DenseCoreMFAC` object will be created automatically in the `step` function.
+  - adapted `DenseMFAC` for a model with multiple classification heads for Continual Learning where
+  we have one feature extractor block and a list of classification heads. The issue was related to
+  the model size, which included the feature extractor backbone and all classification heads, but
+  in practice only one classification head will be used for training and inference. This caused some
+  size mismatch errors at runtime in the `DenseCoreMFAC` module because the gradient at runtime had
+  fewer entries than the entire model. When using `DenseMFAC` for such settings, set `optimizer.model_size`
+  to the correct size after calling the constructor and the `DenseCoreMFAC` object will be created
+  automatically in the `step` function.
 - **1.1.3** @ September 5th, 2024:
   - allow using `SparseCoreMFACwithEF` separately by importing it in `sparse_mfac.__init__.py`
 - **1.1.2** @ August 1st, 2024:
-  - ***[1.1.0]:*** added support to densify the final update: introduced parameter alpha that controls the fraction of error feedback
-  (EF) to be integrated into the update to make it dense. Finally, the fraction alpha will be discarded from the EF at
-  the expense of another call to `Qinv` and `Q` (and implicitly quantization statistics computation).
-  - ***[1.0.2]:*** added FSDP-compatible implementation by initializing the parameter states in the `update_step` method
-  instead of MicroAdam constructor
+  - ***[1.1.0]:*** added support to densify the final update: introduced parameter alpha that controls
+  the fraction of error feedback (EF) to be integrated into the update to make it dense. Finally, the
+  fraction alpha will be discarded from the EF at the expense of another call to `Qinv` and `Q` (and
+  implicitly quantization statistics computation).
+  - ***[1.0.2]:*** added FSDP-compatible implementation by initializing the parameter states in the
+  `update_step` method instead of MicroAdam constructor
 - **1.0.1** @ June 27th, 2024:
   - removed version in dependencies to avoid conflicts with llm-foundry
 - **1.0.0** @ June 20th, 2024:

{ista_daslab_optimizers-1.1.6 → ista_daslab_optimizers-1.1.8}/ista_daslab_optimizers/__init__.py RENAMED Viewed

@@ -2,3 +2,5 @@ from .acdc import *
 from .micro_adam import *
 from .sparse_mfac import *
 from .dense_mfac import *
+from .fft_low_rank.trion import Trion
+from .fft_low_rank.dct_adamw import DCTAdamW

{ista_daslab_optimizers-1.1.6 → ista_daslab_optimizers-1.1.8}/ista_daslab_optimizers/dense_mfac/dense_core_mfac.py RENAMED Viewed

@@ -4,10 +4,10 @@ import numpy as np
 USE_CUDA = True
 try:
-    import ista_daslab_dense_mfac
+    import ista_daslab_cuda_dense_mfac
 except Exception as e:
     USE_CUDA = False
-    print('\n\t[WARNING] The module "ista_daslab_dense_mfac" is not installed, using slower PyTorch implementation!\n')
+    print('\n\t[WARNING] The module "ista_daslab_cuda_dense_mfac" is not installed, using slower PyTorch implementation!\n')
 class DenseCoreMFAC:
     def __init__(self, grads, dev, gpus, damp=1e-5, create_G=False):
@@ -76,7 +76,7 @@ class DenseCoreMFAC:
         if USE_CUDA:
             diag = torch.diag(torch.full(size=[self.m], fill_value=self.lambd, device=self.dev, dtype=self.dtype))
-            self.coef = ista_daslab_dense_mfac.hinv_setup(tmp, diag)
+            self.coef = ista_daslab_cuda_dense_mfac.hinv_setup(tmp, diag)
         else:
             for i in range(max(self.last, 1), self.m):
                 self.coef[i, :i] = tmp[i, :i].matmul(self.coef[:i, :i])
@@ -130,7 +130,7 @@ class DenseCoreMFAC:
             dots = self.compute_scalar_products(x)
         giHix = self.lambd * dots
         if USE_CUDA:
-            giHix = ista_daslab_dense_mfac.hinv_mul(self.m, self.giHig, giHix)
+            giHix = ista_daslab_cuda_dense_mfac.hinv_mul(self.m, self.giHig, giHix)
         else:
             for i in range(1, self.m):
                 giHix[i:].sub_(self.giHig[i - 1, i:], alpha=giHix[i - 1] / self.denom[i - 1])

ista_daslab_optimizers-1.1.8/ista_daslab_optimizers/fft_low_rank/dct_adamw.py ADDED Viewed

@@ -0,0 +1,351 @@
+import torch
+import torch.distributed as dist
+import math
+import numpy as np
+from fast_hadamard_transform import hadamard_transform
+from ista_daslab_optimizers.utils.dct import dct3_matrix
+from ista_daslab_optimizers.utils.quantizers import Quantizer4bit, Quantizer8bit
+from ista_daslab_optimizers.fft_low_rank.fft_projector import FFTLowRankProjector
+PROJ_DCT = 'dct'
+PROJ_HDM = 'hdm'
+PROJ_RAND_QR = 'rqr'
+ALL_PROJ = [
+    PROJ_DCT, # DCT projection
+    PROJ_HDM, # Hadamard projection
+    PROJ_RAND_QR, # Random-QR projection
+]
+STATE_M = 'm'
+STATE_V = 'v'
+STATE_Q = 'Q'
+STATE_ID = 'param-id'
+STATE_EF = 'ef'
+# STATE_EF_MIN = 'ef-min-vals'
+# STATE_EF_MAX = 'ef-max-vals'
+STATE_FFT_LRP = 'fft-low-rank-projector'
+STATE_BROADCAST_SOURCE = 'broadcast-src' # the process rank that computes the update for a parameter p will broadcast the parameter p to other workers
+class DCTAdamW(torch.optim.Optimizer):
+    def __init__(self,
+                 params,
+                 lr,
+                 weight_decay,
+                 rank,
+                 proj,
+                 use_ef=False,
+                 q_ef=False,
+                 distributed=False,
+                 update_proj_gap=1,
+                 rotate_subspace=False,
+                 sim_type='matmul',
+                 ell_norm=1,
+                 max_shape=32_000,
+                 betas=(0.9, 0.999),
+                 eps=1e-8):
+        assert proj in ALL_PROJ
+        super().__init__(params, dict(lr=lr, weight_decay=weight_decay))
+        self.rank = rank
+        self.proj = proj
+        self.use_ef = use_ef
+        self.q_ef = q_ef
+        self.distributed = distributed
+        self.update_proj_gap = update_proj_gap
+        self.rotate_subspace = rotate_subspace
+        self.sim_type = sim_type
+        self.ell_norm = ell_norm
+        self.max_shape = max_shape # apply low-rank to 2D parameters that have both dimensions smaller than max_shape
+        self.betas = betas
+        self.eps = eps
+        self.steps = 0
+        self.is_state_initialized = False
+        self.Q = None # the full transformation matrix (non-truncated, all rows and columns)
+        self.Q_cols_norm = None
+        self.use_theoretical_similarity = (self.ell_norm < 0)
+        self.ell_norm = abs(self.ell_norm)
+        if proj == PROJ_DCT:
+            assert sim_type in ['matmul', 'makhoul']
+        else:
+            assert sim_type == 'matmul'
+    def setup_Q(self, p):
+        if self.Q is None:
+            size = min(p.shape)
+            if self.proj == PROJ_DCT:
+                Qdct3 = dct3_matrix(size, p.dtype, p.device)  # first row is zero
+                if self.sim_type == 'makhoul':
+                    self.Q = Qdct3.t()
+                    print(f'\n\t!!!!! Initialized DCT-2 matrix of size {size} !!!!!\n')
+                elif self.sim_type == 'matmul':
+                    self.Q = Qdct3
+                    print(f'\n\t!!!!! Initialized DCT-3 matrix of size {size} !!!!!\n')
+                else:
+                    raise RuntimeError(f'Unknown sim_type: {self.sim_type}')
+            elif self.proj == PROJ_HDM:
+                self.Q = hadamard_transform(torch.eye(size).to(device=p.device, dtype=p.dtype), scale=1. / math.sqrt(size))
+                print(f'\n\t!!!!! Initialized Hadamard matrix of size {size} !!!!!\n')
+            elif self.proj == PROJ_RAND_QR:
+                random = torch.randn(size, size, dtype=p.dtype, device=p.device)
+                self.Q, _ = torch.linalg.qr(random)
+                del random
+            else:
+                raise RuntimeError(f'Projection {self.proj} is currently not supported!')
+            if self.use_theoretical_similarity:
+                self.Q_cols_norm = self.Q.norm(p=self.ell_norm, dim=0)
+    def should_compute_update(self, p):
+        """
+            This function returns a boolean indicating whether the update for the parameter p should be computed on the current GPU
+        """
+        state = self.state[p]
+        param_id = state[STATE_ID]
+        return param_id % dist.get_world_size() == dist.get_rank()
+    def should_update_projection(self):
+        return self.steps == 1 or self.steps % self.update_proj_gap == 0
+    def init_state(self, p):
+        state = self.state[p]
+        if p.ndim == 1: # adam update
+            print(f'Parameter of size {tuple(p.shape)} will receive original AdamW update with state shape {tuple(p.shape)}')
+            state[STATE_M] = torch.zeros_like(p)
+            state[STATE_V] = torch.zeros_like(p)
+        elif p.ndim == 2: # low-rank adam update
+            n, m = p.shape
+            if n >= self.max_shape or m >= self.max_shape:  # apply full-rank
+                print(f'Parameter of size {tuple(p.shape)} will receive original AdamW update with state shape {tuple(p.shape)}')
+                state[STATE_M] = torch.zeros_like(p)
+                state[STATE_V] = torch.zeros_like(p)
+            else: # apply low-rank using the DCT transform as orthogonal matrix
+                if n >= m:
+                    low_rank_shape = (n, self.rank)
+                else:
+                    # fix for Llama-3-8B that has a layer of size (1024, 4096)
+                    # fix for Qwen2.5-7B that has a layer of size (512, 3584)
+                    if n in [512, 1024] and m in [3584, 4096]:
+                        low_rank_shape = (n, self.rank)
+                    else:
+                        low_rank_shape = (self.rank, m)
+                # low_rank_shape = (n, self.rank) if n >= m else (self.rank, m)
+                print(f'Parameter of size {tuple(p.shape)} will receive low-rank update with state shape {low_rank_shape}')
+                state[STATE_M] = torch.zeros(*low_rank_shape, dtype=p.dtype, device=p.device)
+                state[STATE_V] = torch.zeros(*low_rank_shape, dtype=p.dtype, device=p.device)
+                state[STATE_FFT_LRP] = FFTLowRankProjector(p,
+                                                          rank=self.rank,
+                                                          proj=self.proj,
+                                                          rotate_subspace=self.rotate_subspace,
+                                                          sim_type=self.sim_type,
+                                                          ell_norm=self.ell_norm,
+                                                          use_th_sim=self.use_theoretical_similarity)
+                if self.use_ef:
+                    if self.q_ef > 0:
+                        # state[STATE_EF] = torch.zeros(p.numel() // 2, dtype=torch.uint8, device=p.device)
+                        # state[STATE_EF_MIN] = torch.zeros(p.shape[0], dtype=torch.bfloat16, device=p.device)
+                        # state[STATE_EF_MAX] = torch.zeros(p.shape[0], dtype=torch.bfloat16, device=p.device)
+                        quantClass = {4: Quantizer4bit, 8: Quantizer8bit}[self.q_ef]
+                        if self.q_ef == 4:
+                            quantClass = Quantizer4bit
+                            print(f'\n\t!!!!! Quantizing EF to 4 bits !!!!!\n')
+                        elif self.q_ef == 8:
+                            quantClass = Quantizer8bit
+                            print(f'\n\t!!!!! Quantizing EF to 8 bits !!!!!\n')
+                        else:
+                            raise RuntimeError(f'Quantization on {self.q_ef} bits is currently not supported!')
+                        state[STATE_EF] = quantClass(shape=p.shape, device=p.device, dtype=p.dtype, bucket_size=p.shape[1])
+                    else:
+                        state[STATE_EF] = torch.zeros_like(p)
+                ### initialize Q
+                print('calling setup_Q')
+                self.setup_Q(p)
+        # end if
+    def init(self):
+        # init broadcast info
+        self.is_state_initialized = True
+        bcast_src_list = []
+        param_id = 0 # parameter id
+        for group in self.param_groups:
+            for p in group['params']:
+                if p is None: continue
+                if p.grad is None: continue
+                state = self.state[p]
+                if len(state) == 0:
+                    if self.distributed:
+                        state[STATE_ID] = param_id
+                        param_id += 1
+                        if self.should_compute_update(p):
+                            # if the current process computes the update, then it will also broadcast the parameters to all other workers
+                            state[STATE_BROADCAST_SOURCE] = torch.tensor(dist.get_rank(), dtype=torch.int32, device=f'cuda:{dist.get_rank()}')
+                            self.init_state(p)
+                        else:
+                            # p.register_hook(lambda grad: None) # set gradient to None
+                            # p.requires_grad = False # disable gradient computation for this layer
+                            state[STATE_BROADCAST_SOURCE] = torch.tensor(0, dtype=torch.int32, device=f'cuda:{dist.get_rank()}') # zero means empty here because we will do an all reduce
+                        bcast_src_list.append(state[STATE_BROADCAST_SOURCE].item())
+                    else:
+                        self.init_state(p)
+        # end for group
+        if self.distributed:
+            dist.barrier()
+            # with open(f'broadcast-{dist.get_rank()}.txt', 'w') as w:
+            # sync broadcast source
+            # w.write(f'Broadcast SRC on worker {dist.get_rank()} before all_reduce: {",".join(map(str, bcast_src_list))}\n')
+            bcast_src_list = []
+            for group in self.param_groups:
+                for p in group['params']:
+                    if p is None: continue
+                    if p.grad is None: continue
+                    state = self.state[p]
+                    dist.all_reduce(state[STATE_BROADCAST_SOURCE], op=dist.ReduceOp.SUM)
+                    state[STATE_BROADCAST_SOURCE] = state[STATE_BROADCAST_SOURCE].item()
+                    bcast_src_list.append(state[STATE_BROADCAST_SOURCE])
+            # end for group
+            # w.write(f'Broadcast SRC on worker {dist.get_rank()} after all_reduce: {",".join(map(str, bcast_src_list))}\n')
+            dist.barrier()
+        # end if
+        torch.cuda.empty_cache()
+    @torch.no_grad()
+    def step(self, closure=None):
+        self.steps += 1
+        loss = None
+        if closure is not None:
+            with torch.enable_grad():
+                loss = closure()
+        if not self.is_state_initialized:
+            self.init() # init broadcast info
+        for group in self.param_groups:
+            lr = group['lr']
+            wd = group['weight_decay']
+            for p in group['params']:
+                if p is None: continue
+                if p.grad is None: continue
+                if wd > 0:
+                    p.mul_(1 - lr * wd)
+                if self.distributed:
+                    if self.should_compute_update(p):
+                        self.update_step(p, lr)
+                else:
+                    self.update_step(p, lr)
+        # end for group
+        if self.distributed:
+            for group in self.param_groups:
+                for p in group['params']:
+                    if p is None: continue
+                    if p.grad is None: continue
+                    dist.broadcast(p, src=self.state[p][STATE_BROADCAST_SOURCE])
+            # end for group
+            dist.barrier() # wait for all GPUs to compute the update for all layers
+        # end if distributed
+        return loss
+    @torch.no_grad()
+    def update_step(self, p, lr):
+        if p.ndim == 1:  # adam update
+            self.adamw_step(p, lr)
+        elif p.ndim == 2: # low-rank adam update
+            n, m = p.shape
+            if n >= self.max_shape or m >= self.max_shape:  # apply full-rank for parameters that have at least one dimension >= max_size (e.g. embeddings and lm_head)
+                self.adamw_step(p, lr)
+            else:
+                self.dct_low_rank_step(p, lr)
+    def dct_low_rank_step(self, p, lr):
+        beta1, beta2 = self.betas
+        bc1 = 1 - beta1 ** self.steps
+        sqrt_bc2 = math.sqrt(1 - beta2 ** self.steps)
+        adjusted_lr = -lr * sqrt_bc2 / bc1
+        A = p.grad # initially, the accumulator stores gradient and a bit later we will add the error feedback
+        state = self.state[p]
+        mt = state[STATE_M]
+        vt = state[STATE_V]
+        if self.use_ef:
+            E = state[STATE_EF]
+            if self.q_ef:
+                # see step 4 from Algorithm 1 in the MicroAdam paper https://arxiv.black/pdf/2405.15593
+                A.add_(E.quantize_inv()) # p.grad += Qinv(EF)
+            else:
+                A.add_(E)
+        clrp: FFTLowRankProjector = state[STATE_FFT_LRP]
+        clrp.inc_step()
+        if self.should_update_projection():
+            a = clrp.change_subspace(self.Q, A, col_norms=self.Q_cols_norm)
+        else:
+            ### compute low-rank accumulator a
+            a = clrp.from_higher_to_lower_dimensions(self.Q, A)
+        if self.use_ef:
+            A_reconstructed = clrp.from_lower_to_higher_dimensions(self.Q, a)
+            if self.q_ef:
+                A.sub_(A_reconstructed) # the full precision EF is stored now in A
+                # see step 8 from Algorithm 1 in the MicroAdam paper https://arxiv.black/pdf/2405.15593
+                E.quantize(A)
+            else:
+                E.copy_(A).sub_(A_reconstructed)
+            del A_reconstructed
+        ### update momentum m and v (rotate first, if needed)
+        if self.steps > 1 and self.rotate_subspace and self.should_update_projection():
+            R = clrp.get_subspace_rotation_matrix(self.Q)
+            clrp.rotate_subspace(R, mt)
+            clrp.rotate_subspace(R, vt)
+            vt.abs_()  # make sure vt is positive
+            del R
+        mt.mul_(beta1).add_(a, alpha=1 - beta1)
+        vt.mul_(beta2).addcmul_(a, a, value=1 - beta2)
+        u = mt / (self.eps * sqrt_bc2 + vt.sqrt())
+        clrp.from_lower_to_higher_dimensions(self.Q, u, out=p.grad)
+        del u, a
+        p.add_(p.grad, alpha=adjusted_lr)
+    @torch.no_grad()
+    def adamw_step(self, p, lr):
+        state = self.state[p]
+        g = p.grad
+        mt = state[STATE_M]
+        vt = state[STATE_V]
+        beta1, beta2 = self.betas
+        bc1 = 1 - beta1 ** self.steps
+        sqrt_bc2 = math.sqrt(1 - beta2 ** self.steps)
+        adjusted_lr = -lr * sqrt_bc2 / bc1
+        # update momentum m and v
+        mt.mul_(beta1).add_(g, alpha=1-beta1)
+        vt.mul_(beta2).addcmul_(g, g, value=1-beta2)
+        # U = mt / (self.eps * sqrt_bc2 + vt.sqrt())
+        g.copy_(vt).sqrt_().add_(self.eps * sqrt_bc2).div_(mt).reciprocal_()
+        p.add_(g, alpha=adjusted_lr)

ista-daslab-optimizers 1.1.6__tar.gz → 1.1.8__tar.gz

ista-daslab-optimizers 1.1.6tar.gz → 1.1.8tar.gz