PyPI - compressed-tensors - Versions diffs - 0.5.0__py3-none-any.whl → 0.6.0__py3-none-any.whl - Mend

compressed-tensors 0.5.0py3-none-any.whl → 0.6.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

compressed_tensors/quantization/observers/mse.py ADDED Viewed

@@ -0,0 +1,162 @@
+# Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any, Optional, Tuple
+import torch
+from compressed_tensors.quantization.observers.base import Observer
+from compressed_tensors.quantization.observers.helpers import calculate_qparams
+from compressed_tensors.quantization.quant_args import QuantizationArgs
+from torch import FloatTensor, IntTensor, Tensor
+__all__ = ["MovingAverageMSEObserver"]
+@Observer.register("mse")
+class MovingAverageMSEObserver(Observer):
+    """
+    Implements a dynamic quantization observer that sets the scale and
+    zero point based on a moving average of the mse-clipped min and max observed values
+    """
+    def __init__(
+        self,
+        quantization_args: QuantizationArgs,
+        averaging_constant: float = 0.01,
+        grid: float = 100.0,
+        maxshrink: float = 0.80,
+        norm: float = 2.4,
+    ):
+        super().__init__(quantization_args=quantization_args)
+        self.min_val = {}
+        self.max_val = {}
+        self.averaging_constant = averaging_constant
+        self.grid = grid
+        self.maxshrink = maxshrink
+        self.norm = norm
+    def calculate_mse_min_max(
+        self,
+        observed: Tensor,
+        reduce_dims: Optional[Tuple[int]] = None,
+    ):
+        """
+        Computes the mse-clipped min and max values of the observed tensor by
+        optimizing for quantization error
+        :param observed: observed tensor to calculate quantization parameters for
+        :param reduce_dims: optional tuple of dimensions to reduce along,
+            returned values will be shaped (1,) along the reduced dimensions
+        :return: tuple of min and max values derived from the observed tensor
+        """
+        from compressed_tensors.quantization.lifecycle import fake_quantize
+        if not reduce_dims:
+            absolute_min_val, absolute_max_val = torch.aminmax(observed)
+        else:
+            absolute_min_val = torch.amin(observed, dim=reduce_dims, keepdims=True)
+            absolute_max_val = torch.amax(observed, dim=reduce_dims, keepdims=True)
+        best = torch.full(absolute_min_val.shape, float("inf"))
+        min_val = torch.ones(absolute_min_val.shape)
+        max_val = torch.zeros(absolute_max_val.shape)
+        for i in range(int(self.maxshrink * self.grid)):
+            p = 1 - i / self.grid
+            shrinked_min_val = p * absolute_min_val
+            shrinked_max_val = p * absolute_max_val
+            candidate_scales, candidate_zero_points = calculate_qparams(
+                shrinked_min_val, shrinked_max_val, self.quantization_args
+            )
+            q = fake_quantize(
+                observed,
+                candidate_scales,
+                candidate_zero_points,
+                self.quantization_args,
+            )
+            q -= observed
+            q.abs_()
+            q.pow_(self.norm)
+            if not reduce_dims:
+                err = torch.sum(q)
+            else:
+                err = torch.sum(q, reduce_dims, keepdims=True)
+            tmp = err < best
+            if torch.any(tmp):
+                best[tmp] = err[tmp]
+                min_val[tmp] = shrinked_min_val[tmp]
+                max_val[tmp] = shrinked_max_val[tmp]
+        return min_val, max_val
+    def calculate_qparams(
+        self,
+        observed: Tensor,
+        reduce_dims: Optional[Tuple[int]] = None,
+        tensor_id: Optional[Any] = None,
+    ) -> Tuple[FloatTensor, IntTensor]:
+        """
+        Updates the mse-clipped min and max values of the observed tensor using
+        a moving average smoothed by the averaging_constant
+        :param observed: observed tensor to calculate quantization parameters for
+        :param reduce_dims: optional tuple of dimensions to reduce along,
+            returned scale and zero point will be shaped (1,) along the
+            reduced dimensions
+        :param tensor_id: Optional id if different ranges of observed tensors are
+            passed, useful for sharding tensors by group_size
+        :return: tuple of scale and zero point derived from the observed tensor
+        """
+        min_val, max_val = self.calculate_mse_min_max(observed, reduce_dims)
+        running_min_val = self.min_val.get(tensor_id, None)
+        running_max_val = self.max_val.get(tensor_id, None)
+        if running_min_val is None or running_max_val is None:
+            updated_min_val = min_val
+            updated_max_val = max_val
+        else:
+            updated_min_val = running_min_val + self.averaging_constant * (
+                min_val - running_min_val
+            )
+            updated_max_val = running_max_val + self.averaging_constant * (
+                max_val - running_max_val
+            )
+        tensor_id = tensor_id or "default"
+        self.min_val[tensor_id] = updated_min_val
+        self.max_val[tensor_id] = updated_max_val
+        return calculate_qparams(
+            updated_min_val, updated_max_val, self.quantization_args
+        )
+    def get_qparams_along_dim(
+        self, observed, dim: int, tensor_id: Optional[Any] = None
+    ):
+        reduce_dims = tuple(idx for idx in range(observed.ndim) if idx != dim)
+        return self.calculate_qparams(
+            observed, reduce_dims=reduce_dims, tensor_id=tensor_id
+        )
+    def reset(self):
+        """
+        Reset the state of the observer, including min and maximum values
+        """
+        super().reset()
+        self.min_val = {}
+        self.max_val = {}

compressed_tensors/quantization/quant_args.py CHANGED Viewed

@@ -13,10 +13,10 @@
 # limitations under the License.
 from enum import Enum
-from typing import Any, Dict, Optional
+from typing import Any, Dict, Optional, Union
 import torch
-from pydantic import BaseModel, Field, validator
+from pydantic import BaseModel, Field, field_validator, model_validator
 __all__ = [
@@ -25,6 +25,7 @@ __all__ = [
     "QuantizationStrategy",
     "QuantizationArgs",
     "round_to_quantized_type",
+    "ActivationOrdering",
 ]
 FP8_DTYPE = torch.float8_e4m3fn
@@ -51,6 +52,19 @@ class QuantizationStrategy(str, Enum):
     TOKEN = "token"
+class ActivationOrdering(str, Enum):
+    """
+    Enum storing strategies for activation ordering
+    Group: reorder groups and weight\n
+    Weight: only reorder weight, not groups. Slightly lower latency and
+    accuracy compared to group actorder\n
+    """
+    GROUP = "group"
+    WEIGHT = "weight"
 class QuantizationArgs(BaseModel, use_enum_values=True):
     """
     User facing arguments used to define a quantization config for weights or
@@ -68,15 +82,18 @@ class QuantizationArgs(BaseModel, use_enum_values=True):
         ranges will be observed with every sample. Defaults to False for static
         quantization. Note that enabling dynamic quantization will change the default
         observer to a memoryless one
+    :param actorder: whether to apply group quantization in decreasing order of
+        activation. Defaults to None for arbitrary ordering
     """
     num_bits: int = 8
-    type: QuantizationType = QuantizationType.INT.value
+    type: QuantizationType = QuantizationType.INT
     symmetric: bool = True
     group_size: Optional[int] = None
     strategy: Optional[QuantizationStrategy] = None
     block_structure: Optional[str] = None
     dynamic: bool = False
+    actorder: Union[ActivationOrdering, bool, None] = None
     observer: str = Field(
         default="minmax",
         description=(
@@ -98,41 +115,96 @@ class QuantizationArgs(BaseModel, use_enum_values=True):
         """
         from compressed_tensors.quantization.observers.base import Observer
-        if self.observer == "minmax" and self.dynamic:
+        if self.dynamic:
             # override defualt observer for dynamic, you never want minmax which
             # keeps state across samples for dynamic
             self.observer = "memoryless"
         return Observer.load_from_registry(self.observer, quantization_args=self)
-    @validator("strategy", pre=True, always=True)
-    def validate_strategy(cls, value, values):
-        group_size = values.get("group_size")
+    @field_validator("type", mode="before")
+    def validate_type(cls, value) -> QuantizationType:
+        if isinstance(value, str):
+            return QuantizationType(value.lower())
-        # use group_size to determinine strategy if not given explicity
-        if group_size is not None and value is None:
-            if group_size > 0:
-                return QuantizationStrategy.GROUP
+        return value
-            elif group_size == -1:
-                return QuantizationStrategy.CHANNEL
+    @field_validator("group_size", mode="before")
+    def validate_group(cls, value) -> Union[int, None]:
+        if value is None:
+            return value
-            else:
-                raise ValueError(
-                    f"group_size={group_size} with strategy {value} is invald. "
-                    "group_size > 0 for strategy='group' and "
-                    "group_size = -1 for 'channel'"
-                )
+        if value < -1:
+            raise ValueError(
+                f"Invalid group size {value}. Use group_size > 0 for "
+                "strategy='group' and group_size = -1 for 'channel'"
+            )
-        if value == QuantizationStrategy.GROUP:
-            if group_size is None:
-                raise ValueError(f"strategy {value} requires group_size to be set.")
+        return value
-        if value is None:
-            return QuantizationStrategy.TENSOR
+    @field_validator("strategy", mode="before")
+    def validate_strategy(cls, value) -> Union[QuantizationStrategy, None]:
+        if isinstance(value, str):
+            return QuantizationStrategy(value.lower())
+        return value
+    @field_validator("actorder", mode="before")
+    def validate_actorder(cls, value) -> Optional[ActivationOrdering]:
+        if isinstance(value, bool):
+            return ActivationOrdering.GROUP if value else None
+        if isinstance(value, str):
+            return ActivationOrdering(value.lower())
         return value
+    @model_validator(mode="after")
+    def validate_model_after(model: "QuantizationArgs") -> Dict[str, Any]:
+        # extract user-passed values from dictionary
+        strategy = model.strategy
+        group_size = model.group_size
+        actorder = model.actorder
+        # infer strategy
+        if strategy is None:
+            if group_size is None:
+                strategy = QuantizationStrategy.TENSOR
+            elif group_size > 0:
+                strategy = QuantizationStrategy.GROUP
+            elif group_size == -1:
+                strategy = QuantizationStrategy.CHANNEL
+            else:
+                raise ValueError(
+                    f"Invalid group size {group_size}. Use group_size > 0 for "
+                    "strategy='group' and group_size = -1 for 'channel'"
+                )
+        # validate strategy and group
+        if strategy == QuantizationStrategy.GROUP:
+            if group_size is None or group_size <= 0:
+                raise ValueError(
+                    f"strategy {strategy} requires group_size to be "
+                    "set to a positive value"
+                )
+        if (
+            group_size is not None
+            and group_size > 0
+            and strategy != QuantizationStrategy.GROUP
+        ):
+            raise ValueError("group_size requires strategy to be set to 'group'")
+        # validate activation ordering and strategy
+        if actorder is not None and strategy != QuantizationStrategy.GROUP:
+            raise ValueError(
+                "Must use group quantization strategy in order to apply "
+                "activation ordering"
+            )
+        # write back modified values
+        model.strategy = strategy
+        return model
     def pytorch_dtype(self) -> torch.dtype:
         if self.type == QuantizationType.FLOAT:
             return FP8_DTYPE

compressed_tensors/quantization/quant_scheme.py CHANGED Viewed

@@ -57,15 +57,9 @@ class QuantizationScheme(BaseModel):
             # default to quantizing all Linear layers
             targets = ["Linear"]
-        # default to 8 bit integer symmetric quantization
-        # for weights
-        weights = QuantizationArgs(num_bits=8, symmetric=True)
-        # default to 8 bit integer asymmetric quantization
-        input_activations = QuantizationArgs(num_bits=8, symmetric=True)
-        # Do not quantize the output activations
-        # by default
+        # by default, activations and weights are left unquantized
+        weights = None
+        input_activations = None
         output_activations = None
         return cls(
@@ -111,6 +105,8 @@ def is_preset_scheme(name: str) -> bool:
     return name.upper() in PRESET_SCHEMES
+UNQUANTIZED = dict()
 # 8 bit integer weights and 8 bit activations quantization
 W8A8 = dict(
     weights=QuantizationArgs(
@@ -208,6 +204,8 @@ FP8_DYNAMIC = dict(
 )
 PRESET_SCHEMES = {
+    # Unquantized (no-op)
+    "UNQUANTIZED": UNQUANTIZED,
     # Integer weight only schemes
     "W8A16": W8A16,
     "W4A16": W4A16,

compressed_tensors/quantization/utils/helpers.py CHANGED Viewed

@@ -181,7 +181,7 @@ def calculate_compression_ratio(model: Module) -> float:
         for parameter in model.parameters():
             uncompressed_bits = get_torch_bit_depth(parameter)
             compressed_bits = uncompressed_bits
-            if is_module_quantized(submodule):
+            if is_module_quantized(submodule) and submodule.quantization_scheme.weights:
                 compressed_bits = submodule.quantization_scheme.weights.num_bits
             num_weights = parameter.numel()

compressed_tensors/utils/__init__.py CHANGED Viewed

@@ -16,5 +16,6 @@
 from .helpers import *
 from .offload import *
 from .permutations_24 import *
+from .permute import *
 from .safetensors_load import *
 from .semi_structured_conversions import *

compressed_tensors/utils/helpers.py CHANGED Viewed

@@ -22,6 +22,7 @@ __all__ = [
     "infer_compressor_from_model_config",
     "fix_fsdp_module_name",
     "tensor_follows_mask_structure",
+    "replace_module",
 ]
 FSDP_WRAPPER_NAME = "_fsdp_wrapped_module"
@@ -90,3 +91,15 @@ def tensor_follows_mask_structure(tensor, mask: str = "2:4") -> bool:
         raise ValueError()
     return True
+def replace_module(model: torch.nn.Module, name: str, new_module: torch.nn.Module):
+    if "." in name:
+        parent_name = name.rsplit(".", 1)[0]
+        child_name = name[len(parent_name) + 1 :]
+        parent = model.get_submodule(parent_name)
+    else:
+        parent_name = ""
+        parent = model
+        child_name = name
+    setattr(parent, child_name, new_module)

compressed_tensors/utils/offload.py CHANGED Viewed

@@ -40,7 +40,13 @@ def get_execution_device(module: Module) -> torch.device:
     """
     if is_module_offloaded(module):
         return module._hf_hook.execution_device
-    return next(module.parameters()).device
+    device = next(module.parameters()).device
+    # offload only gets set for leaf modules, fallback to checking for device type
+    if device.type == "meta":
+        return module._hf_hook.execution_device
+    return device
 def get_offloaded_device(module: Module) -> torch.device:
@@ -83,8 +89,11 @@ def update_parameter_data(
     :param module: layer containing the parameter to update
     :param new_param_data: tensor to update parameter with
-    :param param_name:
+    :param param_name: name of layer parameter to update
     """
+    if not hasattr(module, param_name):
+        return
     device = next(module.parameters()).device
     offloaded = False
@@ -93,6 +102,9 @@ def update_parameter_data(
         offloaded = True
     parameter = getattr(module, param_name, None)
+    if parameter is None:
+        raise ValueError("Attempted to update uninitialized parameter")
     dtype = parameter.dtype
     parameter.data = new_param_data.to(device).to(dtype)

compressed_tensors/utils/permute.py ADDED Viewed

@@ -0,0 +1,70 @@
+# Copyright (c) 2021 - present / Neuralmagic, Inc. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Set, Tuple
+import torch
+__all__ = ["safe_permute"]
+# these datatypes are missing implementations required for standard permutation
+_EXPERIMENTAL_DTYPES: Set[Tuple[torch.dtype, torch.device]] = set()
+def safe_permute(value: torch.Tensor, perm: torch.Tensor, dim: int = 0) -> torch.Tensor:
+    """
+    Perform out-of-place permutation without using torch.Tensor.index_put_,
+    whose implementation is missing for datatypes such as `torch.float8_e4m3fn`
+    :param value: tensor to permute
+    :param perm: permutation map
+    :param dim: dimension along which to apply permutation
+    :return: permuted value
+    """
+    dtype_tuple = (value.dtype, value.device)
+    if dtype_tuple in _EXPERIMENTAL_DTYPES:
+        return _fallback_permute(value, perm, dim)
+    try:
+        return value[tuple([slice(None)] * dim + [perm])]
+    except RuntimeError:
+        # Mark dtype as experimental if advanced indexing fails
+        _EXPERIMENTAL_DTYPES.add(dtype_tuple)
+        return _fallback_permute(value, perm, dim)
+def _fallback_permute(
+    value: torch.Tensor, perm: torch.Tensor, dim: int
+) -> torch.Tensor:
+    """
+    Fallback permutation method for experimental dtypes.
+    :param value: tensor to permute
+    :param perm: permutation map
+    :param dim: dimension along which to apply permutation
+    :return: permuted value
+    """
+    value_ret = value.clone()  # cannot use zeros_like b/c of missing impl.
+    orig_slices = [slice(None)] * (dim + 1)
+    perm_slices = [slice(None)] * (dim + 1)
+    for index, perm_index in enumerate(perm):
+        orig_slices[dim] = index
+        perm_slices[dim] = perm_index
+        value_ret[tuple(orig_slices)] = value[tuple(perm_slices)]
+    return value_ret

compressed_tensors/utils/safetensors_load.py CHANGED Viewed

@@ -234,5 +234,7 @@ def is_quantization_param(name: str) -> bool:
         return True
     if name.endswith("zero_point"):
         return True
+    if name.endswith("g_idx"):
+        return True
     return False

compressed_tensors/utils/semi_structured_conversions.py CHANGED Viewed

@@ -28,6 +28,7 @@ __all__ = [
     "mask_creator",
 ]
 # This is PyTorch implementation of main part of reorder_meta()
 # function, from tools/util/include/cutlass/util/host_reorder.h file
 # of CUTLASS source tree.  Furthermore, CUTLASS template for sparse

compressed_tensors/version.py CHANGED Viewed

@@ -17,7 +17,7 @@ Functionality for storing and setting the version info for SparseML
 """
-version_base = "0.5.0"
+version_base = "0.6.0"
 is_release = True  # change to True to set the generated version as a release version

{compressed_tensors-0.5.0.dist-info → compressed_tensors-0.6.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: compressed-tensors
-Version: 0.5.0
+Version: 0.6.0
 Summary: Library for utilization of compressed safetensors of neural network models
 Home-page: https://github.com/neuralmagic/compressed-tensors
 Author: Neuralmagic, Inc.
@@ -8,44 +8,56 @@ Author-email: support@neuralmagic.com
 License: Apache 2.0
 Description-Content-Type: text/markdown
 License-File: LICENSE
-Requires-Dist: torch >=1.7.0
+Requires-Dist: torch>=1.7.0
 Requires-Dist: transformers
-Requires-Dist: accelerate
-Requires-Dist: pydantic >=2.0
+Requires-Dist: pydantic>=2.0
+Provides-Extra: accelerate
+Requires-Dist: accelerate; extra == "accelerate"
 Provides-Extra: dev
-Requires-Dist: black ==22.12.0 ; extra == 'dev'
-Requires-Dist: isort ==5.8.0 ; extra == 'dev'
-Requires-Dist: wheel >=0.36.2 ; extra == 'dev'
-Requires-Dist: flake8 >=3.8.3 ; extra == 'dev'
-Requires-Dist: pytest >=6.0.0 ; extra == 'dev'
-Requires-Dist: nbconvert >=7.16.3 ; extra == 'dev'
+Requires-Dist: black==22.12.0; extra == "dev"
+Requires-Dist: isort==5.8.0; extra == "dev"
+Requires-Dist: wheel>=0.36.2; extra == "dev"
+Requires-Dist: flake8>=3.8.3; extra == "dev"
+Requires-Dist: pytest>=6.0.0; extra == "dev"
+Requires-Dist: nbconvert>=7.16.3; extra == "dev"
-# compressed_tensors
+# compressed-tensors
-This repository extends a [safetensors](https://github.com/huggingface/safetensors) format to efficiently store sparse and/or quantized tensors on disk. `compressed-tensors` format supports multiple compression types to minimize the disk space and facilitate the tensor manipulation.
+The `compressed-tensors` library extends the [safetensors](https://github.com/huggingface/safetensors) format, providing a versatile and efficient way to store and manage compressed tensor data. This library supports various quantization and sparsity schemes, making it a unified format for handling different model optimizations like GPTQ, AWQ, SmoothQuant, INT8, FP8, SparseGPT, and more.
-## Motivation
+## Why `compressed-tensors`?
-### Reduce disk space by saving sparse tensors in a compressed format
+As model compression becomes increasingly important for efficient deployment of LLMs, the landscape of quantization and compression techniques has become increasingly fragmented.
+Each method often comes with its own storage format and loading procedures, making it challenging to work with multiple techniques or switch between them.
+`compressed-tensors` addresses this by providing a single, extensible format that can represent a wide variety of compression schemes.
-The compressed format stores the data much more efficiently by taking advantage of two properties of tensors:
+* **Unified Checkpoint Format**: Supports various compression schemes in a single, consistent format.
+* **Wide Compatibility**: Works with popular quantization methods like GPTQ, SmoothQuant, and FP8. See [llm-compressor](https://github.com/vllm-project/llm-compressor)
+* **Flexible Quantization Support**:
+  * Weight-only quantization (e.g., W4A16, W8A16, WnA16)
+  * Activation quantization (e.g., W8A8)
+  * KV cache quantization
+  * Non-uniform schemes (different layers can be quantized in different ways!)
+* **Sparsity Support**: Handles both unstructured and semi-structured (e.g., 2:4) sparsity patterns.
+* **Open-Source Integration**: Designed to work seamlessly with Hugging Face models and PyTorch.
-- Sparse tensors -> due to a large number of entries that are equal to zero.
-- Quantized -> due to their low precision representation.
-### Introduce an elegant interface to save/load compressed tensors
-The library provides the user with the ability to compress/decompress tensors. The properties of tensors are defined by human-readable configs, allowing the users to understand the compression format at a quick glance.
+This allows developers and researchers to easily experiment with composing different quantization methods, simplify model deployment pipelines, and reduce the overhead of supporting multiple compression formats in inference engines.
 ## Installation
-### Pip
+### From [PyPI](https://pypi.org/project/compressed-tensors)
+Stable release:
 ```bash
 pip install compressed-tensors
 ```
-### From source
+Nightly release:
+```bash
+pip install compressed-tensors-nightly
+```
+### From Source
 ```bash
 git clone https://github.com/neuralmagic/compressed-tensors

compressed-tensors 0.5.0__py3-none-any.whl → 0.6.0__py3-none-any.whl

compressed-tensors 0.5.0py3-none-any.whl → 0.6.0py3-none-any.whl