PyPI - embedl-deploy-tensorrt - Versions diffs - 0.4.0__tar.gz → 0.4.1__tar.gz - Mend

embedl-deploy-tensorrt 0.4.0tar.gz → 0.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: embedl-deploy-tensorrt
-Version: 0.4.0
+Version: 0.4.1
 Summary: TensorRT backend for embedl-deploy.
 Author-email: Embedl AB <support@embedl.com>
 Project-URL: Homepage, https://www.embedl.com/
@@ -13,7 +13,6 @@ Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
 License-File: NOTICE
-Requires-Dist: tensorrt
 Provides-Extra: core
 Requires-Dist: embedl-deploy; extra == "core"
 Dynamic: license-file
@@ -55,16 +54,16 @@ hardware target ensuring correct quantization and compilation.
 ## Supported Backends
-| Backend             | Status      |
-|---------------------|-------------|
-| NVIDIA TensorRT     | Supported   |
+| Backend                 | Status      |
+|-------------------------|-------------|
+| NVIDIA TensorRT (v10.3) | Supported   |
-Contact us for other backends.
+Contact Embedl for other backends.
 ## Installation
 ```bash
-pip install embedl-deploy
+pip install "embedl-deploy[tensorrt]"
 ```
 Note that you may need to also install `onnx` and `onnx-simplifier` to export
 and get the exported model compiled with TensorRT if using ONNX as an
@@ -86,6 +85,9 @@ model = Model().eval()
 example_input = torch.randn(1, 3, 224, 224)
 # 2. Transform — fuse and optimize for TensorRT in one call
+# For more compatibilty you can trace your model with torch.export.export
+# as follows:
+# model = torch.export.export(model, (example_input)).module()
 res = transform(model, patterns=TENSORRT_PATTERNS)
 print("Model\n", res.model.print_readable())
 print("Matches", "\n".join([str(match) for match in res.matches]))
@@ -112,28 +114,54 @@ torch.onnx.export(
 qat_model = quantized_model.train()
 # Freeze BatchNorm, or apply other QAT utilities as needed
 # train(qat_model)
+```
+### Compile
+Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
+model and compile it for inference. The exported layer info and profile can
+be used for debugging, optimization and visualization.
+Note: that the ONNX model might need to be simplified with onnx-simplifier to
+make trtexec compile it. Dynamo exported models may have compilation issues,
+so it's recommended to export with dynamo=False.
+```bash
+onnxsim model.onnx model.onnx
+/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --fp16 --int8 --useCudaGraph
+```
+Optionally you can get the layer profile with the following flags:
+```
+--exportLayerInfo=layer_info.json
+--exportProfile=profile.json
+--profilingVerbosity=detailed
+```
-# Compile
-# -------
-# Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
-# model and compile it for inference. The exported layer info and profile can
-# be used for debugging, optimization and visualization.
-#
-# Note: that the ONNX model might need to be simplified with onnx-simplifier to
-# make trtexec compile it. Dynamo exported models may have compilation issues,
-# so it's recommended to export with dynamo=False.
-#
-# We are working on a Aten-based export path that should be more robust and
-# support more models in the future.
-# >> onnxsim model.onnx model.onnx
-# >> trtexec \
-#       --onnx=model.onnx \
-#       --exportLayerInfo=layer_info.json \
-#       --exportProfile=profile.json \
-#       --profilingVerbosity=detailed
-# More benchmarking scripts can be found in the examples/ directory
+## Mixed Precision
+To keep a specific layer in higher precision while quantizing the rest to INT8,
+pass its `nn.Conv2d` instance to `ModulesToSkip` after `transform`. Note that
+`torch.fx.GraphModule` deep-copies submodules during tracing, so you must take
+the reference **from the fused graph**, not from the original model:
+```python
+from embedl_deploy.quantize import quantize, QuantConfig, ModulesToSkip
+res = transform(model, patterns=TENSORRT_PATTERNS)
+# Grab the conv instance from the fused graph (not from the original model)
+first_conv = res.model.FusedConvBNActMaxPool_0.conv
+config = QuantConfig(
+    skip=ModulesToSkip(
+        stub={first_conv},    # disables input activation quantization
+        weight={first_conv},  # disables weight fake-quantization
+    )
+)
+quantized_model = quantize(
+    res.model, (example_input,), config=config, forward_loop=calibration_loop
+)
 ```
 ## Design Principles
@@ -150,10 +178,13 @@ qat_model = quantized_model.train()
    `transform()` is a convenience for the common case where you want
    everything applied.
-3. **FX-graph-based.**
-   All graph analysis and surgery uses `torch.fx`. Models are traced once
-   and manipulated as `fx.GraphModule` objects. Support for Aten graphs
-   produced by `torch.export.export` is planned for the future.
+3. **Graph-based models (torch.export.export and symbolic traced).**
+   All graph analysis and surgery uses traced graphs. Models are traced once
+   and manipulated as `fx.GraphModule` objects with suport for tracing via both
+   `torch.fx` (symbolic) as well as `torch.export.export` (Aten). Support for
+   Aten graphs is automatically enabled using Aten recomposition
+   patterns that compose Aten operations into equivalent `torch.nn` modules
+   automatically before conversions and fusions.
 ## Support

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.4.1}/README.md RENAMED Viewed

@@ -35,16 +35,16 @@ hardware target ensuring correct quantization and compilation.
 ## Supported Backends
-| Backend             | Status      |
-|---------------------|-------------|
-| NVIDIA TensorRT     | Supported   |
+| Backend                 | Status      |
+|-------------------------|-------------|
+| NVIDIA TensorRT (v10.3) | Supported   |
-Contact us for other backends.
+Contact Embedl for other backends.
 ## Installation
 ```bash
-pip install embedl-deploy
+pip install "embedl-deploy[tensorrt]"
 ```
 Note that you may need to also install `onnx` and `onnx-simplifier` to export
 and get the exported model compiled with TensorRT if using ONNX as an
@@ -66,6 +66,9 @@ model = Model().eval()
 example_input = torch.randn(1, 3, 224, 224)
 # 2. Transform — fuse and optimize for TensorRT in one call
+# For more compatibilty you can trace your model with torch.export.export
+# as follows:
+# model = torch.export.export(model, (example_input)).module()
 res = transform(model, patterns=TENSORRT_PATTERNS)
 print("Model\n", res.model.print_readable())
 print("Matches", "\n".join([str(match) for match in res.matches]))
@@ -92,28 +95,54 @@ torch.onnx.export(
 qat_model = quantized_model.train()
 # Freeze BatchNorm, or apply other QAT utilities as needed
 # train(qat_model)
+```
+### Compile
+Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
+model and compile it for inference. The exported layer info and profile can
+be used for debugging, optimization and visualization.
+Note: that the ONNX model might need to be simplified with onnx-simplifier to
+make trtexec compile it. Dynamo exported models may have compilation issues,
+so it's recommended to export with dynamo=False.
+```bash
+onnxsim model.onnx model.onnx
+/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --fp16 --int8 --useCudaGraph
+```
+Optionally you can get the layer profile with the following flags:
+```
+--exportLayerInfo=layer_info.json
+--exportProfile=profile.json
+--profilingVerbosity=detailed
+```
-# Compile
-# -------
-# Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
-# model and compile it for inference. The exported layer info and profile can
-# be used for debugging, optimization and visualization.
-#
-# Note: that the ONNX model might need to be simplified with onnx-simplifier to
-# make trtexec compile it. Dynamo exported models may have compilation issues,
-# so it's recommended to export with dynamo=False.
-#
-# We are working on a Aten-based export path that should be more robust and
-# support more models in the future.
-# >> onnxsim model.onnx model.onnx
-# >> trtexec \
-#       --onnx=model.onnx \
-#       --exportLayerInfo=layer_info.json \
-#       --exportProfile=profile.json \
-#       --profilingVerbosity=detailed
-# More benchmarking scripts can be found in the examples/ directory
+## Mixed Precision
+To keep a specific layer in higher precision while quantizing the rest to INT8,
+pass its `nn.Conv2d` instance to `ModulesToSkip` after `transform`. Note that
+`torch.fx.GraphModule` deep-copies submodules during tracing, so you must take
+the reference **from the fused graph**, not from the original model:
+```python
+from embedl_deploy.quantize import quantize, QuantConfig, ModulesToSkip
+res = transform(model, patterns=TENSORRT_PATTERNS)
+# Grab the conv instance from the fused graph (not from the original model)
+first_conv = res.model.FusedConvBNActMaxPool_0.conv
+config = QuantConfig(
+    skip=ModulesToSkip(
+        stub={first_conv},    # disables input activation quantization
+        weight={first_conv},  # disables weight fake-quantization
+    )
+)
+quantized_model = quantize(
+    res.model, (example_input,), config=config, forward_loop=calibration_loop
+)
 ```
 ## Design Principles
@@ -130,10 +159,13 @@ qat_model = quantized_model.train()
    `transform()` is a convenience for the common case where you want
    everything applied.
-3. **FX-graph-based.**
-   All graph analysis and surgery uses `torch.fx`. Models are traced once
-   and manipulated as `fx.GraphModule` objects. Support for Aten graphs
-   produced by `torch.export.export` is planned for the future.
+3. **Graph-based models (torch.export.export and symbolic traced).**
+   All graph analysis and surgery uses traced graphs. Models are traced once
+   and manipulated as `fx.GraphModule` objects with suport for tracing via both
+   `torch.fx` (symbolic) as well as `torch.export.export` (Aten). Support for
+   Aten graphs is automatically enabled using Aten recomposition
+   patterns that compose Aten operations into equivalent `torch.nn` modules
+   automatically before conversions and fusions.
 ## Support

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.4.1}/pyproject.toml RENAMED Viewed

@@ -24,7 +24,7 @@ license-files = [
 readme = "README.md"
 description = "TensorRT backend for embedl-deploy."
 dynamic = ["version"]
-dependencies = ["tensorrt"]
+dependencies = []
 [project.optional-dependencies]
 core = ["embedl-deploy"]

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.4.1}/src/embedl_deploy/_internal/tensorrt/patterns/conversions/attention.py RENAMED Viewed

@@ -47,7 +47,7 @@ from embedl_deploy._internal.tensorrt.modules.swin_attention import (
 )
 try:
-    from torchvision.models.swin_transformer import (  # type: ignore[import-untyped]
+    from torchvision.models.swin_transformer import (
         shifted_window_attention,
     )

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.4.1}/src/embedl_deploy/_internal/tensorrt/patterns/conversions/general.py RENAMED Viewed

@@ -139,7 +139,13 @@ class RemoveExportAssertPattern(Pattern):
 def _is_flatten(node: fx.Node) -> bool:
-    """Return ``True`` when `node` is a flatten call with shape metadata."""
+    """Return ``True`` when `node` is a 4D→2D flatten with shape metadata.
+    Only matches flattens with ``start_dim=1`` on a 4-D input, which
+    produces a 2-D output (the classification-head pattern). Flattens
+    with ``start_dim >= 2`` (e.g. the MHA head-merging ``flatten(2)``
+    that produces 3-D output) are rejected.
+    """
     if node.op == "call_function":
         is_flat = node.target is torch.flatten
     elif node.op == "call_method":
@@ -149,7 +155,16 @@ def _is_flatten(node: fx.Node) -> bool:
     if not is_flat:
         return False
     shape = get_input_shape(node)
-    return shape is not None
+    if shape is None or len(shape) != 4:
+        return False
+    mod = get_module(node)
+    if isinstance(mod, nn.Flatten):
+        start_dim: int = mod.start_dim
+    elif len(node.args) > 1 and isinstance(node.args[1], int):
+        start_dim = node.args[1]
+    else:
+        start_dim = 0
+    return start_dim == 1
 ElementWiseLike: TypeAlias = (