PyPI - neural-compressor - Versions diffs - 2.4__tar.gz → 2.5__tar.gz - Mend

neural-compressor 2.4tar.gz → 2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (554) hide show

{neural_compressor-2.4 → neural_compressor-2.5}/PKG-INFO RENAMED Viewed

@@ -1,9 +1,9 @@
 Metadata-Version: 2.1
 Name: neural_compressor
-Version: 2.4
+Version: 2.5
 Summary: Repository of Intel® Neural Compressor
 Home-page: https://github.com/intel/neural-compressor
-Author: Intel AIA Team
+Author: Intel AIPT Team
 Author-email: feng.tian@intel.com, haihao.shen@intel.com, suyue.chen@intel.com
 License: Apache 2.0
 Keywords: quantization,auto-tuning,post-training static quantization,post-training dynamic quantization,quantization-aware training
@@ -16,7 +16,7 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 License-File: third-party-programs.txt
 Requires-Dist: deprecated>=1.2.13
-Requires-Dist: numpy
+Requires-Dist: numpy<2.0
 Requires-Dist: opencv-python-headless
 Requires-Dist: pandas
 Requires-Dist: Pillow
@@ -30,11 +30,11 @@ Requires-Dist: requests
 Requires-Dist: schema
 Requires-Dist: scikit-learn
 Provides-Extra: pt
-Requires-Dist: neural_compressor_3x_pt==2.4; extra == "pt"
+Requires-Dist: neural_compressor_3x_pt==2.5; extra == "pt"
 Provides-Extra: tf
-Requires-Dist: neural_compressor_3x_tf==2.4; extra == "tf"
+Requires-Dist: neural_compressor_3x_tf==2.5; extra == "tf"
 Provides-Extra: ort
-Requires-Dist: neural_compressor_3x_ort==2.4; extra == "ort"
+Requires-Dist: neural_compressor_3x_ort==2.5; extra == "ort"
 <div align="center">
@@ -43,12 +43,12 @@ Intel® Neural Compressor
 <h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet)</h3>
 [![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
-[![version](https://img.shields.io/badge/release-2.4-green)](https://github.com/intel/neural-compressor/releases)
+[![version](https://img.shields.io/badge/release-2.5-green)](https://github.com/intel/neural-compressor/releases)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
 [![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
 [![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
-[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/README.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
+[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[LLMs Recipes](./docs/source/llm_recipes.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
 ---
 <div align="left">
@@ -63,6 +63,9 @@ In particular, the tool provides the key features, typical examples, and open co
 * Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst)
+## What's New
+* [2024/03] A new SOTA approach [AutoRound](https://github.com/intel/auto-round) Weight-Only Quantization on [Intel Gaudi2 AI accelerator](https://habana.ai/products/gaudi2/) is available for LLMs.
 ## Installation
 ### Install from pypi
@@ -73,29 +76,77 @@ pip install neural-compressor
 > More installation methods can be found at [Installation Guide](https://github.com/intel/neural-compressor/blob/master/docs/source/installation_guide.md). Please check out our [FAQ](https://github.com/intel/neural-compressor/blob/master/docs/source/faq.md) for more details.
 ## Getting Started
-### Quantization with Python API
-```shell
-# Install Intel Neural Compressor and TensorFlow
-pip install neural-compressor
-pip install tensorflow
-# Prepare fp32 model
-wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/mobilenet_v1_1.0_224_frozen.pb
+Setting up the environment:
+```bash
+pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision
 ```
+After successfully installing these packages, try your first quantization program.
+### Weight-Only Quantization (LLMs)
+Following example code demonstrates Weight-Only Quantization on LLMs, it supports Intel CPU, Intel Gauid2 AI Accelerator, Nvidia GPU, best device will be selected automatically.
+To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
+```bash
+docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04//habanalabs/pytorch-installer-2.1.1:latest
+# Check the container ID
+docker ps
+# Login into container
+docker exec -it <container_id> bash
+# Install the optimum-habana
+pip install --upgrade-strategy eager optimum[habana]
+# Install INC/auto_round
+pip install neural-compressor auto_round
+```
+Run the example:
 ```python
-from neural_compressor.data import DataLoader, Datasets
+from transformers import AutoModel, AutoTokenizer
 from neural_compressor.config import PostTrainingQuantConfig
+from neural_compressor.quantization import fit
+from neural_compressor.adaptor.torch_utils.auto_round import get_dataloader
+model_name = "EleutherAI/gpt-neo-125m"
+float_model = AutoModel.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+dataloader = get_dataloader(tokenizer, seqlen=2048)
+woq_conf = PostTrainingQuantConfig(
+    approach="weight_only",
+    op_type_dict={
+        ".*": {  # match all ops
+            "weight": {
+                "dtype": "int",
+                "bits": 4,
+                "algorithm": "AUTOROUND",
+            },
+        }
+    },
+)
+quantized_model = fit(model=float_model, conf=woq_conf, calib_dataloader=dataloader)
+```
+**Note:**
+To try INT4 model inference, please directly use [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers), which leverages Intel Neural Compressor for model quantization.
-dataset = Datasets("tensorflow")["dummy"](shape=(1, 224, 224, 3))
-dataloader = DataLoader(framework="tensorflow", dataset=dataset)
+### Static Quantization (Non-LLMs)
+```python
+from torchvision import models
+from neural_compressor.config import PostTrainingQuantConfig
+from neural_compressor.data import DataLoader, Datasets
 from neural_compressor.quantization import fit
-q_model = fit(
-    model="./mobilenet_v1_1.0_224_frozen.pb",
-    conf=PostTrainingQuantConfig(),
-    calib_dataloader=dataloader,
-)
+float_model = models.resnet18()
+dataset = Datasets("pytorch")["dummy"](shape=(1, 3, 224, 224))
+calib_dataloader = DataLoader(framework="pytorch", dataset=dataset)
+static_quant_conf = PostTrainingQuantConfig()
+quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloader=calib_dataloader)
 ```
 ## Documentation
@@ -110,8 +161,9 @@ q_model = fit(
     <tr>
       <td colspan="2" align="center"><a href="./docs/source/design.md#architecture">Architecture</a></td>
       <td colspan="2" align="center"><a href="./docs/source/design.md#workflow">Workflow</a></td>
+      <td colspan="1" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
+      <td colspan="1" align="center"><a href="./docs/source/llm_recipes.md">LLMs Recipes</a></td>
       <td colspan="2" align="center"><a href="examples/README.md">Examples</a></td>
-      <td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
     </tr>
   </tbody>
   <thead>

{neural_compressor-2.4 → neural_compressor-2.5}/README.md RENAMED Viewed

@@ -5,12 +5,12 @@ Intel® Neural Compressor
 <h3> An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet)</h3>
 [![python](https://img.shields.io/badge/python-3.8%2B-blue)](https://github.com/intel/neural-compressor)
-[![version](https://img.shields.io/badge/release-2.4-green)](https://github.com/intel/neural-compressor/releases)
+[![version](https://img.shields.io/badge/release-2.5-green)](https://github.com/intel/neural-compressor/releases)
 [![license](https://img.shields.io/badge/license-Apache%202-blue)](https://github.com/intel/neural-compressor/blob/master/LICENSE)
 [![coverage](https://img.shields.io/badge/coverage-85%25-green)](https://github.com/intel/neural-compressor)
 [![Downloads](https://static.pepy.tech/personalized-badge/neural-compressor?period=total&units=international_system&left_color=grey&right_color=green&left_text=downloads)](https://pepy.tech/project/neural-compressor)
-[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Examples](./examples/README.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
+[Architecture](./docs/source/design.md#architecture)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Workflow](./docs/source/design.md#workflow)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[LLMs Recipes](./docs/source/llm_recipes.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Results](./docs/source/validated_model_list.md)&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;[Documentations](https://intel.github.io/neural-compressor)
 ---
 <div align="left">
@@ -25,6 +25,9 @@ In particular, the tool provides the key features, typical examples, and open co
 * Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst)
+## What's New
+* [2024/03] A new SOTA approach [AutoRound](https://github.com/intel/auto-round) Weight-Only Quantization on [Intel Gaudi2 AI accelerator](https://habana.ai/products/gaudi2/) is available for LLMs.
 ## Installation
 ### Install from pypi
@@ -35,29 +38,77 @@ pip install neural-compressor
 > More installation methods can be found at [Installation Guide](https://github.com/intel/neural-compressor/blob/master/docs/source/installation_guide.md). Please check out our [FAQ](https://github.com/intel/neural-compressor/blob/master/docs/source/faq.md) for more details.
 ## Getting Started
-### Quantization with Python API
-```shell
-# Install Intel Neural Compressor and TensorFlow
-pip install neural-compressor
-pip install tensorflow
-# Prepare fp32 model
-wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_6/mobilenet_v1_1.0_224_frozen.pb
+Setting up the environment:
+```bash
+pip install "neural-compressor>=2.3" "transformers>=4.34.0" torch torchvision
+```
+After successfully installing these packages, try your first quantization program.
+### Weight-Only Quantization (LLMs)
+Following example code demonstrates Weight-Only Quantization on LLMs, it supports Intel CPU, Intel Gauid2 AI Accelerator, Nvidia GPU, best device will be selected automatically.
+To try on Intel Gaudi2, docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).
+```bash
+docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.14.0/ubuntu22.04//habanalabs/pytorch-installer-2.1.1:latest
+# Check the container ID
+docker ps
+# Login into container
+docker exec -it <container_id> bash
+# Install the optimum-habana
+pip install --upgrade-strategy eager optimum[habana]
+# Install INC/auto_round
+pip install neural-compressor auto_round
 ```
+Run the example:
 ```python
-from neural_compressor.data import DataLoader, Datasets
+from transformers import AutoModel, AutoTokenizer
 from neural_compressor.config import PostTrainingQuantConfig
+from neural_compressor.quantization import fit
+from neural_compressor.adaptor.torch_utils.auto_round import get_dataloader
+model_name = "EleutherAI/gpt-neo-125m"
+float_model = AutoModel.from_pretrained(model_name)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+dataloader = get_dataloader(tokenizer, seqlen=2048)
+woq_conf = PostTrainingQuantConfig(
+    approach="weight_only",
+    op_type_dict={
+        ".*": {  # match all ops
+            "weight": {
+                "dtype": "int",
+                "bits": 4,
+                "algorithm": "AUTOROUND",
+            },
+        }
+    },
+)
+quantized_model = fit(model=float_model, conf=woq_conf, calib_dataloader=dataloader)
+```
+**Note:**
+To try INT4 model inference, please directly use [Intel Extension for Transformers](https://github.com/intel/intel-extension-for-transformers), which leverages Intel Neural Compressor for model quantization.
-dataset = Datasets("tensorflow")["dummy"](shape=(1, 224, 224, 3))
-dataloader = DataLoader(framework="tensorflow", dataset=dataset)
+### Static Quantization (Non-LLMs)
+```python
+from torchvision import models
+from neural_compressor.config import PostTrainingQuantConfig
+from neural_compressor.data import DataLoader, Datasets
 from neural_compressor.quantization import fit
-q_model = fit(
-    model="./mobilenet_v1_1.0_224_frozen.pb",
-    conf=PostTrainingQuantConfig(),
-    calib_dataloader=dataloader,
-)
+float_model = models.resnet18()
+dataset = Datasets("pytorch")["dummy"](shape=(1, 3, 224, 224))
+calib_dataloader = DataLoader(framework="pytorch", dataset=dataset)
+static_quant_conf = PostTrainingQuantConfig()
+quantized_model = fit(model=float_model, conf=static_quant_conf, calib_dataloader=calib_dataloader)
 ```
 ## Documentation
@@ -72,8 +123,9 @@ q_model = fit(
     <tr>
       <td colspan="2" align="center"><a href="./docs/source/design.md#architecture">Architecture</a></td>
       <td colspan="2" align="center"><a href="./docs/source/design.md#workflow">Workflow</a></td>
+      <td colspan="1" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
+      <td colspan="1" align="center"><a href="./docs/source/llm_recipes.md">LLMs Recipes</a></td>
       <td colspan="2" align="center"><a href="examples/README.md">Examples</a></td>
-      <td colspan="2" align="center"><a href="https://intel.github.io/neural-compressor/latest/docs/source/api-doc/apis.html">APIs</a></td>
     </tr>
   </tbody>
   <thead>

{neural_compressor-2.4 → neural_compressor-2.5}/neural_compressor/adaptor/keras.py RENAMED Viewed

@@ -42,6 +42,7 @@ from .adaptor import Adaptor, adaptor_registry
 from .query import QueryBackendCapability
 tf = LazyImport("tensorflow")
+keras = LazyImport("keras")
 def _add_supported_quantized_objects(custom_objects):
@@ -519,6 +520,13 @@ class KerasAdaptor(Adaptor):
     def _restore_model_from_json(self, json_model):
         from tensorflow.keras.models import model_from_json
+        from neural_compressor.utils.utility import version1_gte_version2
+        if version1_gte_version2(keras.__version__, "2.13.1"):
+            from keras.src.saving import serialization_lib
+            serialization_lib.enable_unsafe_deserialization()
         custom_objects = {}
         # We need to keep a dictionary of custom objects as our quantized library
         # is not recognized by keras.

{neural_compressor-2.4 → neural_compressor-2.5}/neural_compressor/adaptor/mxnet_utils/__init__.py RENAMED Viewed

@@ -1,4 +1,5 @@
 """Mxnet util init."""
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 #

{neural_compressor-2.4 → neural_compressor-2.5}/neural_compressor/adaptor/mxnet_utils/util.py RENAMED Viewed

@@ -1,4 +1,5 @@
 """Mxnet util module."""
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 #

{neural_compressor-2.4 → neural_compressor-2.5}/neural_compressor/adaptor/onnxrt.py RENAMED Viewed

@@ -417,15 +417,21 @@ class ONNXRUNTIMEAdaptor(Adaptor):
                 self.quantizable_op_types,
                 self.query_handler.get_fallback_list(),
                 self.reduce_range,
-                options.onnxrt.qdq_setting.AddQDQPairToWeight
-                if "add_qdq_pair_to_weight" not in self.recipes
-                else self.recipes.get("add_qdq_pair_to_weight", False),
-                options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin
-                if "optypes_to_exclude_output_quant" not in self.recipes
-                else self.recipes.get("optypes_to_exclude_output_quant", []),
-                options.onnxrt.qdq_setting.DedicatedQDQPair
-                if "dedicated_qdq_pair" not in self.recipes
-                else self.recipes.get("dedicated_qdq_pair", False),
+                (
+                    options.onnxrt.qdq_setting.AddQDQPairToWeight
+                    if "add_qdq_pair_to_weight" not in self.recipes
+                    else self.recipes.get("add_qdq_pair_to_weight", False)
+                ),
+                (
+                    options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin
+                    if "optypes_to_exclude_output_quant" not in self.recipes
+                    else self.recipes.get("optypes_to_exclude_output_quant", [])
+                ),
+                (
+                    options.onnxrt.qdq_setting.DedicatedQDQPair
+                    if "dedicated_qdq_pair" not in self.recipes
+                    else self.recipes.get("dedicated_qdq_pair", False)
+                ),
                 self.backend,
             )
             quantizer.quantize_model()
@@ -502,15 +508,21 @@ class ONNXRUNTIMEAdaptor(Adaptor):
             self.quantizable_op_types,
             self.query_handler.get_fallback_list(),
             self.reduce_range,
-            options.onnxrt.qdq_setting.AddQDQPairToWeight
-            if "add_qdq_pair_to_weight" not in self.recipes
-            else self.recipes.get("add_qdq_pair_to_weight", False),
-            options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin
-            if "optypes_to_exclude_output_quant" not in self.recipes
-            else self.recipes.get("optypes_to_exclude_output_quant", []),
-            options.onnxrt.qdq_setting.DedicatedQDQPair
-            if "dedicated_qdq_pair" not in self.recipes
-            else self.recipes.get("dedicated_qdq_pair", False),
+            (
+                options.onnxrt.qdq_setting.AddQDQPairToWeight
+                if "add_qdq_pair_to_weight" not in self.recipes
+                else self.recipes.get("add_qdq_pair_to_weight", False)
+            ),
+            (
+                options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin
+                if "optypes_to_exclude_output_quant" not in self.recipes
+                else self.recipes.get("optypes_to_exclude_output_quant", [])
+            ),
+            (
+                options.onnxrt.qdq_setting.DedicatedQDQPair
+                if "dedicated_qdq_pair" not in self.recipes
+                else self.recipes.get("dedicated_qdq_pair", False)
+            ),
             self.backend,
         )
         quantizer.quantize_model()
@@ -657,15 +669,21 @@ class ONNXRUNTIMEAdaptor(Adaptor):
             self.quantizable_op_types,
             self.query_handler.get_fallback_list(),
             self.reduce_range,
-            options.onnxrt.qdq_setting.AddQDQPairToWeight
-            if not options.onnxrt.qdq_setting.AddQDQPairToWeight
-            else self.recipes.get("add_qdq_pair_to_weight", False),
-            options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin
-            if options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin is not None
-            else self.recipes.get("optypes_to_exclude_output_quant", []),
-            options.onnxrt.qdq_setting.DedicatedQDQPair
-            if not options.onnxrt.qdq_setting.DedicatedQDQPair
-            else self.recipes.get("dedicated_qdq_pair", False),
+            (
+                options.onnxrt.qdq_setting.AddQDQPairToWeight
+                if not options.onnxrt.qdq_setting.AddQDQPairToWeight
+                else self.recipes.get("add_qdq_pair_to_weight", False)
+            ),
+            (
+                options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin
+                if options.onnxrt.qdq_setting.OpTypesToExcludeOutputQuantizatioin is not None
+                else self.recipes.get("optypes_to_exclude_output_quant", [])
+            ),
+            (
+                options.onnxrt.qdq_setting.DedicatedQDQPair
+                if not options.onnxrt.qdq_setting.DedicatedQDQPair
+                else self.recipes.get("dedicated_qdq_pair", False)
+            ),
         )
         quantizer.quantize_model()
@@ -765,7 +783,7 @@ class ONNXRUNTIMEAdaptor(Adaptor):
             black_nodes=black_nodes,
             white_nodes=white_nodes,
             iterations=list(range(0, iterations)),
-            backend=self.backend if self.backend != "DmlExecutionProvider" else "CPUExecutionProvider",
+            backend=self.backend,
             reduce_range=self.reduce_range,
             **kwargs,
         )
@@ -979,12 +997,10 @@ class ONNXRUNTIMEAdaptor(Adaptor):
             sess_options.register_custom_ops_library(get_library_path())
         if not model.is_large_model:
-            sess = ort.InferenceSession(
-                model.model.SerializeToString(), sess_options, providers=["CPUExecutionProvider"]
-            )
+            sess = ort.InferenceSession(model.model.SerializeToString(), sess_options, providers=[self.backend])
         elif model.model_path is not None:  # pragma: no cover
             model.model = onnx.ModelProto()  # clean memory for large model
-            sess = ort.InferenceSession(model.model_path, sess_options, providers=["CPUExecutionProvider"])
+            sess = ort.InferenceSession(model.model_path, sess_options, providers=[self.backend])
         else:  # pragma: no cover
             logger.warning("Please use model path instead of onnx model object to quantize")
         del sess
@@ -1914,6 +1930,7 @@ class ONNXRT_WeightOnlyAdaptor(ONNXRUNTIMEAdaptor):
                 mse=mse,
                 perchannel=perchannel,
                 accuracy_level=accuracy_level,
+                providers=[self.backend],
             )
         if "AWQ" in algos:
             from neural_compressor.adaptor.ox_utils.weight_only import awq_quantize
@@ -1931,6 +1948,7 @@ class ONNXRT_WeightOnlyAdaptor(ONNXRUNTIMEAdaptor):
                 enable_auto_scale=enable_auto_scale,
                 enable_mse_search=enable_mse_search,
                 accuracy_level=accuracy_level,
+                providers=[self.backend],
             )
         elif "RTN" in algos:
             from neural_compressor.adaptor.ox_utils.weight_only import rtn_quantize
@@ -1940,6 +1958,7 @@ class ONNXRT_WeightOnlyAdaptor(ONNXRUNTIMEAdaptor):
                 tmp_model,
                 quant_config,
                 accuracy_level=accuracy_level,
+                providers=[self.backend],
             )
         tmp_model.q_config = copy.deepcopy(quant_config)
         self._dump_model_op_stats(tmp_model, tune_cfg)

{neural_compressor-2.4 → neural_compressor-2.5}/neural_compressor/adaptor/onnxrt_cuda.yaml RENAMED Viewed

@@ -17,6 +17,20 @@
 -
   version:
     name: '1.6.0'
+  weight_only_integer: &cap_weight_only {
+    'MatMul': &cap_weight_only_matmul {
+        'weight': {
+                    'dtype': ['int'], # no need to care uint
+                    'bits': [4, 3, 8], # [1-8]
+                    'group_size': [32, -1, 1, 16, 64, 128, 256, 512, 1024], # [1-inf]
+                    'scheme': ['sym', 'asym'], # sym, no ZP
+                    'algorithm': ['RTN', 'AWQ', 'GPTQ']
+        },
+        'activation': {
+                    'dtype': ['fp32']
+        }
+    },
+  }
   int8: &ref_1_6 {
     'static': &ref_1_6_static {
       'Conv': {
@@ -114,6 +128,7 @@
 -
   version:
     name: '1.7.0'
+  weight_only_integer: *cap_weight_only
   int8: {
     'static': {
       'FusedConv': {
@@ -155,6 +170,7 @@
 -
   version:
     name: '1.8.0'
+  weight_only_integer: *cap_weight_only
   int8: {
     'static': {
       'FusedConv': {
@@ -224,6 +240,7 @@
 -
   version:
     name: '1.9.0'
+  weight_only_integer: *cap_weight_only
   int8: {
     'static': {
       'FusedConv': {
@@ -300,6 +317,7 @@
 -
   version:
     name: '1.10.0'
+  weight_only_integer: *cap_weight_only
   int8: {
     'static': {
       'FusedConv': {
@@ -356,6 +374,7 @@
 -
   version:
     name: '1.11.0'
+  weight_only_integer: *cap_weight_only
   int8: &ref_1_11 {
     'static': {
       'FusedConv': {
@@ -427,6 +446,7 @@
 -
   version:
     name: '1.12.0'
+  weight_only_integer: *cap_weight_only
   int8: *ref_1_11
   fp16: *common_fp16
   bf16: *common_bf16
@@ -436,6 +456,7 @@
 -
   version:
     name: 'default'
+  weight_only_integer: *cap_weight_only
   int8: *ref_1_6
   fp16: *common_fp16
   bf16: *common_bf16

neural-compressor 2.4__tar.gz → 2.5__tar.gz

neural-compressor 2.4tar.gz → 2.5tar.gz