PyPI - neural-compressor - Versions diffs - 3.1__tar.gz → 3.2__tar.gz - Mend

neural-compressor 3.1tar.gz → 3.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (599) hide show

{neural_compressor-3.1 → neural_compressor-3.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: neural_compressor
-Version: 3.1
+Version: 3.2
 Summary: Repository of Intel® Neural Compressor
 Home-page: https://github.com/intel/neural-compressor
 Author: Intel AIPT Team
@@ -51,7 +51,7 @@ Intel® Neural Compressor
 <h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, and ONNX Runtime)</h3>
 [![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
-[![version](https://img.shields.io/badge/release-3.1-green)](https://github.com/intel/neural-compressor/releases)
+[![version](https://img.shields.io/badge/release-3.2-green)](https://github.com/intel/neural-compressor/releases)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
 [![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
 [![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
@@ -124,7 +124,7 @@ Following example code demonstrates FP8 Quantization, it is supported by Intel G
 To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
 ```bash
 # Run a container with an interactive shell
-docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
+docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
 ```
 Run the example:
 ```python
@@ -133,14 +133,21 @@ from neural_compressor.torch.quantization import (
     prepare,
     convert,
 )
+import torch
 import torchvision.models as models
 model = models.resnet18()
 qconfig = FP8Config(fp8_config="E4M3")
 model = prepare(model, qconfig)
-# customer defined calibration
-calib_func(model)
+# Customer defined calibration. Below is a dummy calibration
+model(torch.randn(1, 3, 224, 224).to("hpu"))
 model = convert(model)
+output = model(torch.randn(1, 3, 224, 224).to("hpu")).to("cpu")
+print(output.shape)
 ```
 ### Weight-Only Large Language Model Loading (LLMs)

{neural_compressor-3.1 → neural_compressor-3.2}/README.md RENAMED Viewed

@@ -5,7 +5,7 @@ Intel® Neural Compressor
 <h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, and ONNX Runtime)</h3>
 [![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
-[![version](https://img.shields.io/badge/release-3.1-green)](https://github.com/intel/neural-compressor/releases)
+[![version](https://img.shields.io/badge/release-3.2-green)](https://github.com/intel/neural-compressor/releases)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
 [![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
 [![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
@@ -78,7 +78,7 @@ Following example code demonstrates FP8 Quantization, it is supported by Intel G
 To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
 ```bash
 # Run a container with an interactive shell
-docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.17.0/ubuntu22.04/habanalabs/pytorch-installer-2.3.1:latest
+docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu24.04/habanalabs/pytorch-installer-2.5.1:latest
 ```
 Run the example:
 ```python
@@ -87,14 +87,21 @@ from neural_compressor.torch.quantization import (
     prepare,
     convert,
 )
+import torch
 import torchvision.models as models
 model = models.resnet18()
 qconfig = FP8Config(fp8_config="E4M3")
 model = prepare(model, qconfig)
-# customer defined calibration
-calib_func(model)
+# Customer defined calibration. Below is a dummy calibration
+model(torch.randn(1, 3, 224, 224).to("hpu"))
 model = convert(model)
+output = model(torch.randn(1, 3, 224, 224).to("hpu")).to("cpu")
+print(output.shape)
 ```
 ### Weight-Only Large Language Model Loading (LLMs)

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/adaptor/ox_utils/smooth_quant.py RENAMED Viewed

@@ -295,6 +295,9 @@ class ORTSmoothQuant:
                 return False
             for inp in node.input:
                 if self.model.get_initializer(inp) is not None:
+                    # Ensure that mul operators with shared initializer will not be absorbed.
+                    if self.model.get_initializer_share_num(inp) > 1:
+                        return False
                     key = node.input[0].split("_smooth_output")[0]
                     tensor = self.model.get_initializer(inp)
                     new_tensor = (

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/adaptor/pytorch.py RENAMED Viewed

@@ -4926,7 +4926,7 @@ class PyTorchWeightOnlyAdaptor(TemplateAdaptor):
         act_group_size = self.recipes["autoround_args"].get("act_group_size", None)
         act_sym = self.recipes["autoround_args"].get("act_sym", None)
         act_dynamic = self.recipes["autoround_args"].get("act_dynamic", True)
-        quant_block_list = self.recipes["autoround_args"].get("quant_block_list", None)
+        to_quant_block_names = self.recipes["autoround_args"].get("to_quant_block_names", None)
         use_layer_wise = self.recipes["autoround_args"].get("use_layer_wise", False)
         if dataloader is not None:
@@ -4959,7 +4959,7 @@ class PyTorchWeightOnlyAdaptor(TemplateAdaptor):
             dynamic_max_gap=dynamic_max_gap,
             data_type=data_type,
             scale_dtype=scale_dtype,
-            quant_block_list=quant_block_list,
+            to_quant_block_names=to_quant_block_names,
             act_bits=act_bits,
             act_group_size=act_group_size,
             act_sym=act_sym,

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/adaptor/torch_utils/weight_only.py RENAMED Viewed

@@ -706,7 +706,7 @@ def autoround_quantize(
     dynamic_max_gap: int = -1,
     data_type: str = "int",  ##only support int for now
     scale_dtype: str = "fp16",
-    quant_block_list: list = None,
+    to_quant_block_names: list = None,
     act_bits: int = 32,
     act_group_size: int = None,
     act_sym: bool = None,
@@ -761,7 +761,7 @@ def autoround_quantize(
         data_type (str): The data type to be used (default is "int").
         scale_dtype (str): The data type of quantization scale to be used (default is "float32"), different kernels
                            have different choices.
-        quant_block_list (list): A list whose elements are list of block's layer names to be quantized.
+        to_quant_block_names (list): A list whose elements are list of block's layer names to be quantized.
         act_bits (int): Number of bits for activation quantization. Default is 32.
         act_group_size (int): Group size for activation quantization. Default is None.
         act_sym (bool): Whether to use symmetric activation quantization. Default is None.
@@ -800,7 +800,7 @@ def autoround_quantize(
         dynamic_max_gap=dynamic_max_gap,
         data_type=data_type,  ## only support data_type
         scale_dtype=scale_dtype,
-        quant_block_list=quant_block_list,
+        to_quant_block_names=to_quant_block_names,
         act_bits=act_bits,
         act_group_size=act_group_size,
         act_sym=act_sym,

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/common/__init__.py RENAMED Viewed

@@ -27,7 +27,7 @@ from neural_compressor.common.utils import (
     dump_elapsed_time,
 )
 from neural_compressor.common.base_config import options
+from neural_compressor.common.version import __version__
 __all__ = [
     "options",

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/common/utils/constants.py RENAMED Viewed

@@ -56,6 +56,7 @@ class Mode(Enum):
     PREPARE = "prepare"
     CONVERT = "convert"
     QUANTIZE = "quantize"
+    LOAD = "load"
 SERVER_PROCESSOR_BRAND_KEY_WORLD_LST = ["Xeon"]

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/common/utils/logger.py RENAMED Viewed

@@ -17,6 +17,7 @@
 """Logger: handles logging functionalities."""
+import functools
 import logging
 import os
@@ -137,6 +138,12 @@ class Logger(object):
         else:
             Logger().get_logger().warning(msg, *args, **kwargs)
+    @functools.lru_cache(None)
+    def warning_once(msg, *args, **kwargs):
+        """Output log with the warning level only once."""
+        Logger.warning("Below warning will be shown only once:")
+        Logger.warning(msg, *args, **kwargs)
 level = Logger().get_logger().level
 level_name = logging.getLevelName(level)
@@ -152,6 +159,8 @@ def _get_log_msg(mode):
         log_msg = "Preparation"
     elif mode == Mode.CONVERT:  # pragma: no cover
         log_msg = "Conversion"
+    elif mode == Mode.LOAD:  # pragma: no cover
+        log_msg = "Loading"
     return log_msg

{neural_compressor-3.1/neural_compressor → neural_compressor-3.2/neural_compressor/common}/version.py RENAMED Viewed

@@ -15,4 +15,4 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Intel® Neural Compressor: An open-source Python library supporting popular model compression techniques."""
-__version__ = "3.1"
+__version__ = "3.2"

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/evaluation/lm_eval/accuracy.py RENAMED Viewed

@@ -199,6 +199,7 @@ def cli_evaluate(args) -> None:
             },
         )
     lm.pad_to_buckets = args.pad_to_buckets
+    lm.buckets = args.buckets
     results = evaluator.simple_evaluate(
         model=lm,

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/evaluation/lm_eval/models/huggingface.py RENAMED Viewed

@@ -116,11 +116,14 @@ class HFLM(TemplateLM):
         peft: Optional[str] = None,
         autogptq: Optional[Union[bool, str]] = False,
         pad_to_buckets: Optional[Union[bool]] = False,
+        buckets: Optional[list] = [32, 64, 128, 256, 512, 1024, 2048, 4096],
         model_format: Optional[str] = "torch",
         **kwargs,
     ) -> None:
         super().__init__()
         self.pad_to_buckets = pad_to_buckets
+        self.buckets = buckets
+        self.last_bucket = -1
         self.model_format = model_format
         # optionally: take in an already-initialized transformers.PreTrainedModel
         if not isinstance(pretrained, str):
@@ -874,6 +877,19 @@ class HFLM(TemplateLM):
         elif self.AUTO_MODEL_CLASS == transformers.AutoModelForSeq2SeqLM:
             return self.tokenizer.decode(tokens, skip_special_tokens=skip_special_tokens)
+    def find_bucket(self, length):
+        suitable_buckets = [b for b in self.buckets if b >= length]
+        if len(suitable_buckets) == 0:
+            eval_logger.error(f"The input_length={length} exceeds the maximum value in buckets={self.buckets}")
+            eval_logger.error("Please add a higher value into the buckets list for this case.")
+            exit(0)
+        else:
+            if self.last_bucket != suitable_buckets[0]:
+                if hasattr(self.model, "clear_cache"):
+                    self.model.clear_cache()  # clear HPU graph cache to avoid OOM
+                self.last_bucket = suitable_buckets[0]
+            return self.last_bucket
     def _model_call(self, inps, attn_mask=None, labels=None):
         """
         :param inps: torch.Tensor
@@ -943,8 +959,7 @@ class HFLM(TemplateLM):
                     if self.pad_to_buckets:  # use buckets to pad inputs
                         bs, seq_length = inps.shape
                         padding_length = 0
-                        buckets = [64, 128, 256, 512, 1024, 2048, 4096, 8192]
-                        bucket_length = [b for b in buckets if b >= seq_length][0]
+                        bucket_length = self.find_bucket(seq_length)
                         padding_length = bucket_length - seq_length
                         inps = F.pad(inps, (0, padding_length), value=self.model.config.pad_token_id)
                     output = self.model(inps)
@@ -954,6 +969,8 @@ class HFLM(TemplateLM):
                         output = output.logits
                     if self.pad_to_buckets and padding_length != 0:  # use buckets to pad inputs
                         output = output[:, :-padding_length, :]
+                    if "hpu" in output.device.type:  # make sure return fp32 tensor for HPU, TODO: root cause
+                        output = output.to(torch.float32)
                 return output
     def _model_generate(self, context, max_length, stop, **generation_kwargs):

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/evaluation/lm_eval/utils.py RENAMED Viewed

@@ -49,6 +49,7 @@ class LMEvalParser:
         seed=[0, 1234, 1234],
         trust_remote_code=False,
         pad_to_buckets=None,  # used by HPU to align input length for performance.
+        buckets=[32, 64, 128, 256, 512, 1024, 2048, 4096],  # used by HPU to limit input length range.
     ):
         self.model = model
         self.tasks = tasks
@@ -81,3 +82,4 @@ class LMEvalParser:
                 self.pad_to_buckets = False
         else:
             self.pad_to_buckets = pad_to_buckets
+        self.buckets = buckets

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/torch/algorithms/fp8_quant/__init__.py RENAMED Viewed

@@ -19,4 +19,24 @@ from neural_compressor.torch.algorithms.fp8_quant.common import (
     with_patched_module,
 )
 from neural_compressor.torch.algorithms.fp8_quant.prepare_quant.prepare_model import finish_measurements, prep_model
-from neural_compressor.torch.algorithms.fp8_quant.fp8_quant import FP8Quantizer
+from neural_compressor.torch.algorithms.fp8_quant.quantizer import FP8Quantizer
+from neural_compressor.torch.algorithms.fp8_quant.patched_module_base import (
+    PatchedModuleBase,
+    register_patched_module,
+)
+from neural_compressor.torch.algorithms.fp8_quant.scaling_method_base import (
+    ScalingMethodBase,
+    register_scaling_methods,
+)
+from neural_compressor.torch.algorithms.fp8_quant.observer import (
+    ObserverBase,
+    register_observer,
+)
+from neural_compressor.torch.algorithms.fp8_quant.model_configs import (
+    ModuleConfig,
+    ModuleInfo,
+    ModuleType,
+    ModuleExtraConfig
+)
+from neural_compressor.torch.algorithms.fp8_quant.save_load import save, load

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/torch/algorithms/fp8_quant/_core/common.py RENAMED Viewed

@@ -23,7 +23,15 @@ import torch
 from .._quant_common.helper_modules import *
 from .._quant_common.quant_config import get_hqt_config
 from ..utils.logger import logger
+from neural_compressor.torch.algorithms.fp8_quant.model_configs import (
+    ModuleInfo,
+    ModuleConfig,
+    ModuleType,
+    ModuleExtraConfig,
+    get_patched_module_table,
+    get_patched_module_type_table,
+)
+from neural_compressor.torch.utils.auto_accelerator import auto_detect_accelerator
 deepspeed_exists = False
 if importlib.util.find_spec("deepspeed"):  # check if deepspeed is installed
     deepspeed_exists = True
@@ -31,38 +39,7 @@ if importlib.util.find_spec("deepspeed"):  # check if deepspeed is installed
 UNMEASURED_MODELS = "UnmeasuredModels"
-class ModuleInfo:
-    def __init__(self, type, patched_module, should_measure=True):
-        self.type = type
-        self.patched_module = patched_module
-        self.should_measure = should_measure
-class ModuleConfig:
-    def __init__(self, inputs=(None,), outputs=(None,), params=None):
-        self.inputs = inputs
-        self.outputs = outputs
-        self.params = params if params is not None else {}
-class ModuleExtraConfig:
-    def __init__(self, inputs=(None,), outputs=(None,), params=None, scale=None, config_params=None):
-        self.inputs = inputs
-        self.outputs = outputs
-        self.params = params if params is not None else {}
-        self.scale = scale
-        self.config_params = config_params if config_params is not None else {}
-class ModuleType:
-    def __init__(self, num_inputs, param_names, num_outputs, required_output):
-        self.num_inputs = num_inputs
-        self.param_names = param_names
-        self.num_outputs = num_outputs
-        self.required_output = required_output
-mod_types = {
+_mod_types = {
     "linear": ModuleType(1, ["weight"], 1, False),
     "matmul": ModuleType(2, [], 1, False),
     "kv_cache": ModuleType(1, [], 1, False),
@@ -110,7 +87,7 @@ def save_file(model, d, source_format, fname, mode):
     config = get_hqt_config(model)
     logger.debug("Saving %s file: %s", mode, fname)
     ext = os.path.splitext(fname)[1]
-    target_format = file_functions[ext]["format"]
+    target_format = file_functions[ext]['format']
     dc = rec_fn(d, format_functions[(source_format, target_format)])
     df = {
         "GlobalRank": config.cfg["global_rank"],
@@ -119,7 +96,7 @@ def save_file(model, d, source_format, fname, mode):
         "Nodes": dc,
     }
     try:
-        file_functions[ext]["save"](df, fname)
+        file_functions[ext]['save'](df, fname)
     except:
         pass
@@ -127,10 +104,10 @@ def save_file(model, d, source_format, fname, mode):
 def load_file(fname, target_format, fail_on_file_not_exist):
     logger.debug("Loading file: %s", fname)
     ext = os.path.splitext(fname)[1]
-    source_format = file_functions[ext]["format"]
+    source_format = file_functions[ext]['format']
     d = {}
     if os.path.isfile(fname):
-        d = file_functions[ext]["load"](fname)
+        d = file_functions[ext]['load'](fname)
     elif fail_on_file_not_exist:
         raise FileNotFoundError(f"Failed to load file {fname}")
     if "Nodes" in d:
@@ -190,17 +167,17 @@ def load_scales(fname, target_format):
     return d
-def convert_scales_to_tensors_dict(scales_obj, scales_file_format, hp_dtype):
+def convert_scales_to_tensors_dict(scales_obj, scales_file_format, hp_dtype, device="hpu"):
     scales_temp = {k: scales_obj[k].__dict__ for k in scales_obj}
     scales_temp = format_functions_rec((scales_file_format, torch.Tensor))(scales_temp)
-    scales_temp = rec_fn(scales_temp, lambda x: x.to(dtype=hp_dtype, device="hpu"))
+    scales_temp = rec_fn(scales_temp, lambda x: x.to(dtype=hp_dtype, device=device))
     scales = {k: ModuleConfig(**scales_temp[k]) for k in scales_temp}
     return scales
 file_functions = {
-    ".json": {"format": list, "save": save_json, "load": load_json},
-    ".npz": {"format": np.ndarray, "save": save_npz, "load": load_npz},
+    ".json": {'format': list, 'save': save_json, 'load': load_json},
+    ".npz": {'format': np.ndarray, 'save': save_npz, 'load': load_npz}
 }
 format_functions = {
@@ -219,7 +196,7 @@ format_functions = {
 format_functions_rec = lambda k: functools.partial(rec_fn, fn=format_functions[k])
-mod_default_dict = {
+_mod_default_dict = {
     "Matmul": ModuleInfo("matmul", PatchedMatmul),
     "Linear": ModuleInfo("linear", PatchedLinear),
     "RowParallelLinear": ModuleInfo("linear", PatchedRowParallelLinear),
@@ -241,7 +218,7 @@ mod_default_dict = {
 if deepspeed_exists:
-    mod_default_dict.update(
+    _mod_default_dict.update(
         {
             "LinearLayer": ModuleInfo("linear", PatchedLinear),
             "LinearAllreduce": ModuleInfo("linear", PatchedLinearAllReduce),
@@ -250,6 +227,25 @@ if deepspeed_exists:
         }
     )
+@functools.lru_cache(maxsize=None)
+def _import_hpu_modules():
+    from neural_compressor.torch.algorithms.fp8_quant.patched_module_base import (
+        PATCHED_MODULE_TABLE, PATCHED_MODULE_TYPES_TABLE
+    )
+    cur_accelerator = auto_detect_accelerator()
+    if not cur_accelerator.current_device_name().startswith("hpu"):
+        return
+    PATCHED_MODULE_TABLE["hpu"].update(_mod_default_dict)
+    PATCHED_MODULE_TYPES_TABLE["hpu"].update(_mod_types)
+_import_hpu_modules()
+mod_default_dict = get_patched_module_table()
+mod_types = get_patched_module_type_table()
+def get_white_list():
+    return list(mod_default_dict.keys())
 class ModInstInfo:
     def __init__(self, name, parent):
@@ -267,3 +263,7 @@ def generate_model_info(model):
             create_mod_info_recursion(mod)
     create_mod_info_recursion(model)
+def get_device_type_for_scales(mod):
+    config = get_hqt_config(mod).cfg
+    return config["device_for_scales"]

{neural_compressor-3.1 → neural_compressor-3.2}/neural_compressor/torch/algorithms/fp8_quant/_core/fp_utils.py RENAMED Viewed

@@ -12,12 +12,13 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import torch
 import habana_frameworks.torch.core as htcore
 import habana_frameworks.torch.utils.experimental as htexp
-import torch
 from .common import ModuleConfig
-from .quant_dequant import cast_fcn, cast_to_fp8_fcn, descale_fcn, scale_fcn
+from .quant_dequant import cast_to_fp8_fcn, cast_fcn, descale_fcn, scale_fcn
+from neural_compressor.torch.utils.auto_accelerator import auto_detect_accelerator
+cur_accelerator = auto_detect_accelerator()
 GAUDI2 = htexp.synDeviceType.synDeviceGaudi2
 GAUDI3 = htexp.synDeviceType.synDeviceGaudi3
@@ -116,9 +117,9 @@ def scale_to_pow2(scale):
 # for Gaudi2 the range is 16^-2..16^1 so we change 2 with 16 and remember that:
 # 16 = 2^4, log16(m)=log2(m)/log2(16)=log2(m)/4, and we get:
 # we choose s=16^ciel(log16(m))=2^4^ciel(log2(m)/4)=2^(4*ciel(log2(m)/4))=2^(ciel(log2(m)/4)*4)
-def scale_to_pow2_hw(scale, device_type):
+def scale_to_pow2_hw(scale, device_for_scales):
     scale_pow2 = scale_to_pow2(scale)
-    min_scale, max_scale, scale_factor = FP8_143_SCALES_TRAITS[device_type]
+    min_scale, max_scale, scale_factor = FP8_143_SCALES_TRAITS[device_for_scales]
     scale_pow2_hw = torch.minimum(
         torch.maximum(
             2 ** (torch.ceil(torch.log2(scale_pow2) / scale_factor) * scale_factor),
@@ -142,13 +143,13 @@ def mmse_scale_multi(x, ref_scale, scales, lp_dtype, hp_dtype):
         xscales = rs * sv
         y = scale_fcn(x, xscales)
         y = cast_to_fp8_fcn(y, lp_dtype)
-        htcore.mark_step()  # we are measuring the error so we want to avoid fusion of the converts
+        cur_accelerator.synchronize()  # we are measuring the error so we want to avoid fusion of the converts
         y = cast_fcn(y, hp_dtype)
         y = descale_fcn(y, xscales)
         err = torch.sum((x - y) ** 2, dim=sum_axis)
         opt_scale = torch.where(err < opt_err, sv, opt_scale)
         opt_err = torch.where(err < opt_err, err, opt_err)
-        htcore.mark_step()
+        cur_accelerator.synchronize()
     return opt_scale * ref_scale
@@ -160,13 +161,13 @@ def mmse_scale(x, scales, lp_dtype, hp_dtype):
     for s in scales:
         y = scale_fcn(x, s)
         y = cast_to_fp8_fcn(y, lp_dtype)
-        htcore.mark_step()  # we are measuring the error so we want to avoid fusion of the converts
+        cur_accelerator.synchronize()  # we are measuring the error so we want to avoid fusion of the converts
         y = cast_fcn(y, hp_dtype)
         y = descale_fcn(y, s)
         err = torch.norm(x - y)
         opt_scale = torch.where(err <= opt_err, s, opt_scale)
         opt_err = torch.where(err <= opt_err, err, opt_err)
-        htcore.mark_step()
+        cur_accelerator.synchronize()
     return opt_scale

neural-compressor 3.1__tar.gz → 3.2__tar.gz

neural-compressor 3.1tar.gz → 3.2tar.gz