PyPI - embedl-deploy-tensorrt - Versions diffs - 0.4.0__tar.gz → 0.5.0__tar.gz - Mend

embedl-deploy-tensorrt 0.4.0tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (39) hide show

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: embedl-deploy-tensorrt
-Version: 0.4.0
+Version: 0.5.0
 Summary: TensorRT backend for embedl-deploy.
 Author-email: Embedl AB <support@embedl.com>
 Project-URL: Homepage, https://www.embedl.com/
@@ -13,7 +13,6 @@ Requires-Python: >=3.10
 Description-Content-Type: text/markdown
 License-File: LICENSE
 License-File: NOTICE
-Requires-Dist: tensorrt
 Provides-Extra: core
 Requires-Dist: embedl-deploy; extra == "core"
 Dynamic: license-file
@@ -55,16 +54,17 @@ hardware target ensuring correct quantization and compilation.
 ## Supported Backends
-| Backend             | Status      |
-|---------------------|-------------|
-| NVIDIA TensorRT     | Supported   |
+| Backend                   | Status          |
+|---------------------------|-----------------|
+| NVIDIA TensorRT  (v10.3)  | Supported       |
+| Lattice SensAI (v8.0)     | In Development  |
-Contact us for other backends.
+Contact Embedl for other backends.
 ## Installation
 ```bash
-pip install embedl-deploy
+pip install "embedl-deploy[tensorrt]"
 ```
 Note that you may need to also install `onnx` and `onnx-simplifier` to export
 and get the exported model compiled with TensorRT if using ONNX as an
@@ -72,7 +72,7 @@ intermediate.
 ---
-## Quick Start
+## Quick Start for TensorRT Backend
 ```python
 import torch
@@ -86,6 +86,9 @@ model = Model().eval()
 example_input = torch.randn(1, 3, 224, 224)
 # 2. Transform — fuse and optimize for TensorRT in one call
+# For more compatibility you can trace your model with torch.export.export
+# as follows:
+# model = torch.export.export(model, (example_input)).module()
 res = transform(model, patterns=TENSORRT_PATTERNS)
 print("Model\n", res.model.print_readable())
 print("Matches", "\n".join([str(match) for match in res.matches]))
@@ -112,28 +115,54 @@ torch.onnx.export(
 qat_model = quantized_model.train()
 # Freeze BatchNorm, or apply other QAT utilities as needed
 # train(qat_model)
+```
+### Compile
+Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
+model and compile it for inference. The exported layer info and profile can
+be used for debugging, optimization and visualization.
+Note: that the ONNX model might need to be simplified with onnx-simplifier to
+make trtexec compile it. Dynamo exported models may have compilation issues,
+so it's recommended to export with dynamo=False.
+```bash
+onnxsim model.onnx model.onnx
+/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --fp16 --int8 --useCudaGraph
+```
+Optionally you can get the layer profile with the following flags:
+```
+--exportLayerInfo=layer_info.json
+--exportProfile=profile.json
+--profilingVerbosity=detailed
+```
-# Compile
-# -------
-# Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
-# model and compile it for inference. The exported layer info and profile can
-# be used for debugging, optimization and visualization.
-#
-# Note: that the ONNX model might need to be simplified with onnx-simplifier to
-# make trtexec compile it. Dynamo exported models may have compilation issues,
-# so it's recommended to export with dynamo=False.
-#
-# We are working on a Aten-based export path that should be more robust and
-# support more models in the future.
-# >> onnxsim model.onnx model.onnx
-# >> trtexec \
-#       --onnx=model.onnx \
-#       --exportLayerInfo=layer_info.json \
-#       --exportProfile=profile.json \
-#       --profilingVerbosity=detailed
-# More benchmarking scripts can be found in the examples/ directory
+## Mixed Precision
+To keep a specific layer in higher precision while quantizing the rest to INT8,
+pass its `nn.Conv2d` instance to `ModulesToSkip` after `transform`. Note that
+`torch.fx.GraphModule` deep-copies submodules during tracing, so you must take
+the reference **from the fused graph**, not from the original model:
+```python
+from embedl_deploy.quantize import quantize, QuantConfig, ModulesToSkip
+res = transform(model, patterns=TENSORRT_PATTERNS)
+# Grab the conv instance from the fused graph (not from the original model)
+first_conv = res.model.FusedConvBNActMaxPool_0.conv
+config = QuantConfig(
+    skip=ModulesToSkip(
+        stub={first_conv},    # disables input activation quantization
+        weight={first_conv},  # disables weight fake-quantization
+    )
+)
+quantized_model = quantize(
+    res.model, (example_input,), config=config, forward_loop=calibration_loop
+)
 ```
 ## Design Principles
@@ -150,10 +179,13 @@ qat_model = quantized_model.train()
    `transform()` is a convenience for the common case where you want
    everything applied.
-3. **FX-graph-based.**
-   All graph analysis and surgery uses `torch.fx`. Models are traced once
-   and manipulated as `fx.GraphModule` objects. Support for Aten graphs
-   produced by `torch.export.export` is planned for the future.
+3. **Graph-based models (torch.export.export and symbolic traced).**
+   All graph analysis and surgery uses traced graphs. Models are traced once
+   and manipulated as `fx.GraphModule` objects with suport for tracing via both
+   `torch.fx` (symbolic) as well as `torch.export.export` (Aten). Support for
+   Aten graphs is automatically enabled using Aten recomposition
+   patterns that compose Aten operations into equivalent `torch.nn` modules
+   automatically before conversions and fusions.
 ## Support

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/README.md RENAMED Viewed

@@ -35,16 +35,17 @@ hardware target ensuring correct quantization and compilation.
 ## Supported Backends
-| Backend             | Status      |
-|---------------------|-------------|
-| NVIDIA TensorRT     | Supported   |
+| Backend                   | Status          |
+|---------------------------|-----------------|
+| NVIDIA TensorRT  (v10.3)  | Supported       |
+| Lattice SensAI (v8.0)     | In Development  |
-Contact us for other backends.
+Contact Embedl for other backends.
 ## Installation
 ```bash
-pip install embedl-deploy
+pip install "embedl-deploy[tensorrt]"
 ```
 Note that you may need to also install `onnx` and `onnx-simplifier` to export
 and get the exported model compiled with TensorRT if using ONNX as an
@@ -52,7 +53,7 @@ intermediate.
 ---
-## Quick Start
+## Quick Start for TensorRT Backend
 ```python
 import torch
@@ -66,6 +67,9 @@ model = Model().eval()
 example_input = torch.randn(1, 3, 224, 224)
 # 2. Transform — fuse and optimize for TensorRT in one call
+# For more compatibility you can trace your model with torch.export.export
+# as follows:
+# model = torch.export.export(model, (example_input)).module()
 res = transform(model, patterns=TENSORRT_PATTERNS)
 print("Model\n", res.model.print_readable())
 print("Matches", "\n".join([str(match) for match in res.matches]))
@@ -92,28 +96,54 @@ torch.onnx.export(
 qat_model = quantized_model.train()
 # Freeze BatchNorm, or apply other QAT utilities as needed
 # train(qat_model)
+```
+### Compile
+Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
+model and compile it for inference. The exported layer info and profile can
+be used for debugging, optimization and visualization.
+Note: that the ONNX model might need to be simplified with onnx-simplifier to
+make trtexec compile it. Dynamo exported models may have compilation issues,
+so it's recommended to export with dynamo=False.
+```bash
+onnxsim model.onnx model.onnx
+/usr/src/tensorrt/bin/trtexec --onnx=model.onnx --fp16 --int8 --useCudaGraph
+```
+Optionally you can get the layer profile with the following flags:
+```
+--exportLayerInfo=layer_info.json
+--exportProfile=profile.json
+--profilingVerbosity=detailed
+```
-# Compile
-# -------
-# Compilation can be done with TensorRT's trtexec tool, which can take the ONNX
-# model and compile it for inference. The exported layer info and profile can
-# be used for debugging, optimization and visualization.
-#
-# Note: that the ONNX model might need to be simplified with onnx-simplifier to
-# make trtexec compile it. Dynamo exported models may have compilation issues,
-# so it's recommended to export with dynamo=False.
-#
-# We are working on a Aten-based export path that should be more robust and
-# support more models in the future.
-# >> onnxsim model.onnx model.onnx
-# >> trtexec \
-#       --onnx=model.onnx \
-#       --exportLayerInfo=layer_info.json \
-#       --exportProfile=profile.json \
-#       --profilingVerbosity=detailed
-# More benchmarking scripts can be found in the examples/ directory
+## Mixed Precision
+To keep a specific layer in higher precision while quantizing the rest to INT8,
+pass its `nn.Conv2d` instance to `ModulesToSkip` after `transform`. Note that
+`torch.fx.GraphModule` deep-copies submodules during tracing, so you must take
+the reference **from the fused graph**, not from the original model:
+```python
+from embedl_deploy.quantize import quantize, QuantConfig, ModulesToSkip
+res = transform(model, patterns=TENSORRT_PATTERNS)
+# Grab the conv instance from the fused graph (not from the original model)
+first_conv = res.model.FusedConvBNActMaxPool_0.conv
+config = QuantConfig(
+    skip=ModulesToSkip(
+        stub={first_conv},    # disables input activation quantization
+        weight={first_conv},  # disables weight fake-quantization
+    )
+)
+quantized_model = quantize(
+    res.model, (example_input,), config=config, forward_loop=calibration_loop
+)
 ```
 ## Design Principles
@@ -130,10 +160,13 @@ qat_model = quantized_model.train()
    `transform()` is a convenience for the common case where you want
    everything applied.
-3. **FX-graph-based.**
-   All graph analysis and surgery uses `torch.fx`. Models are traced once
-   and manipulated as `fx.GraphModule` objects. Support for Aten graphs
-   produced by `torch.export.export` is planned for the future.
+3. **Graph-based models (torch.export.export and symbolic traced).**
+   All graph analysis and surgery uses traced graphs. Models are traced once
+   and manipulated as `fx.GraphModule` objects with suport for tracing via both
+   `torch.fx` (symbolic) as well as `torch.export.export` (Aten). Support for
+   Aten graphs is automatically enabled using Aten recomposition
+   patterns that compose Aten operations into equivalent `torch.nn` modules
+   automatically before conversions and fusions.
 ## Support

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/pyproject.toml RENAMED Viewed

@@ -24,7 +24,7 @@ license-files = [
 readme = "README.md"
 description = "TensorRT backend for embedl-deploy."
 dynamic = ["version"]
-dependencies = ["tensorrt"]
+dependencies = []
 [project.optional-dependencies]
 core = ["embedl-deploy"]

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/src/embedl_deploy/_internal/tensorrt/backend.py RENAMED Viewed

@@ -11,6 +11,7 @@ from embedl_deploy._internal.tensorrt.plan import (
 )
 BACKEND = Backend(
+    name="tensorrt",
     conversion_patterns=TENSORRT_CONVERSION_PATTERNS,
     fusion_patterns=TENSORRT_FUSION_PATTERNS,
     smooth_patterns=TENSORRT_SMOOTH_PATTERNS,

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/src/embedl_deploy/_internal/tensorrt/modules/attention.py RENAMED Viewed

@@ -69,7 +69,7 @@ class MHAInProjection(ConvertedModule):
         v = v.view(batch, seq, self.num_heads, self.head_dim).transpose(1, 2)
         return q, k, v
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         embed_dim = self.num_heads * self.head_dim
         return (
             f"MHAInProjection("
@@ -80,7 +80,7 @@ class MHAInProjection(ConvertedModule):
 class ScaledDotProductAttention(ConvertedModule):
-    """Core attention: ``softmax(Q · Kᵀ / √H) · V``.
+    """Core attention: ``softmax(Q · Kᵀ · scale) · V``.
     :param num_heads:
         Number of attention heads.
@@ -88,6 +88,14 @@ class ScaledDotProductAttention(ConvertedModule):
         Dimension of each head.
     :param dropout:
         Dropout probability (applied during training only).
+    :param is_causal:
+        Whether to apply a causal mask. Mirrors the ``is_causal`` kwarg
+        of ``F.scaled_dot_product_attention``.
+    :param scale:
+        Explicit attention score scale (multiplied on Q·Kᵀ). When
+        ``None`` the PyTorch default ``1/√head_dim`` is used. Models
+        that pre-scale Q themselves (e.g. chronos-2 + RoPE) must pass
+        ``scale=1.0`` so the default scaling does not apply twice.
     """
     def __init__(
@@ -95,11 +103,15 @@ class ScaledDotProductAttention(ConvertedModule):
         num_heads: int,
         head_dim: int,
         dropout: float = 0.0,
+        is_causal: bool = False,
+        scale: float | None = None,
     ) -> None:
         super().__init__()
         self.num_heads = num_heads
         self.head_dim = head_dim
         self.dropout = dropout
+        self.is_causal = is_causal
+        self.scale = scale
     def forward(
         self,
@@ -117,8 +129,9 @@ class ScaledDotProductAttention(ConvertedModule):
         :param v:
             Value tensor ``[B, num_heads, S, head_dim]``.
         :param attn_mask:
-            Optional attention mask. ``aten.scaled_dot_product_attention``
-            takes an optional 4th positional arg; ``WrapAtenSDPAPattern``
+            Optional attention mask.
+            ``torch.nn.functional.scaled_dot_product_attention`` takes an
+            optional 4th positional arg; ``WrapFunctionalSDPAPattern``
             forwards whatever positional args were on the source node, so
             this module accepts the mask too. SAM3, masked-LM, and
             similar models that compile with mixed-mask attention rely
@@ -135,14 +148,18 @@ class ScaledDotProductAttention(ConvertedModule):
             v,
             attn_mask=attn_mask,
             dropout_p=self.dropout if self.training else 0.0,
+            is_causal=self.is_causal,
+            scale=self.scale,
         )
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         return (
             f"ScaledDotProductAttention("
             f"num_heads={self.num_heads}, "
             f"head_dim={self.head_dim}, "
-            f"dropout={self.dropout})"
+            f"dropout={self.dropout}, "
+            f"is_causal={self.is_causal}, "
+            f"scale={self.scale})"
         )
@@ -197,7 +214,7 @@ class FusedMHAInProjection(FusedModule):
         v = v.view(batch, seq, num_heads, head_dim).transpose(1, 2)
         return q, k, v
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         embed_dim = self.in_proj.num_heads * self.in_proj.head_dim
         return (
             f"FusedMHAInProjection("
@@ -275,10 +292,18 @@ class FusedScaledDotProductAttention(FusedModule):
         # MHA kernel onto the slower INT8-aware variant for no gain.
         if not self.surrounded or not self.softmax_quant.enabled:
             return self.attention(q, k, v, attn_mask)
-        # Use ``1/sqrt(head_dim)`` rather than ``head_dim ** -0.5``: the
+        # Honour the wrapped attention module's explicit ``scale`` if
+        # set — models that pre-scale Q themselves (chronos-2 + RoPE,
+        # for example) build with ``scale=1.0`` to disable the default
+        # ``1/sqrt(head_dim)`` scaling. Falling back to the default
+        # here would apply it twice and collapse softmax.
+        # Note on ``1/sqrt(head_dim)`` vs ``head_dim ** -0.5``: the
         # tensor Pow with a negative float exponent traces to ONNX as a
         # ``Cast → complex128`` node that TRT 10.x can't parse.
-        scale = 1.0 / math.sqrt(q.shape[-1])
+        if self.attention.scale is not None:
+            scale = self.attention.scale
+        else:
+            scale = 1.0 / math.sqrt(q.shape[-1])
         attn_weight = torch.matmul(q, k.transpose(-2, -1)) * scale
         if attn_mask is not None:
             if attn_mask.dtype == torch.bool:
@@ -297,7 +322,7 @@ class FusedScaledDotProductAttention(FusedModule):
             )
         return torch.matmul(attn_weight, v)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         a = self.attention
         qdq = "yes" if self.softmax_quant.enabled else "no"
         return (

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/src/embedl_deploy/_internal/tensorrt/modules/conv.py RENAMED Viewed

@@ -91,7 +91,7 @@ class FusedConvBNAct(FusedModule):
             x = self.bn(x)
         return self.act(x)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         bn_info = ""
         if self.bn is not None:
             bn_info = f", bn={self.bn.num_features} (foldable)"
@@ -129,7 +129,7 @@ class FusedConvBN(FusedModule):
             x = self.bn(x)
         return x
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         bn_info = ""
         if self.bn is not None:
             bn_info = f", bn={self.bn.num_features} (foldable)"
@@ -168,7 +168,7 @@ class FusedConvBNActMaxPool(FusedModule):
         x = self.act(x)
         return self.maxpool(x)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         bn_info = ""
         if self.bn is not None:
             bn_info = f", bn={self.bn.num_features} (foldable)"
@@ -213,7 +213,7 @@ class FusedConvBNAddAct(FusedModule):
         x = self.bn(x)
         return self.act(x + residual)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         return (
             f"FusedConvBNAddAct("
             f"{self.conv.in_channels}→{self.conv.out_channels}, "

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/src/embedl_deploy/_internal/tensorrt/modules/linear.py RENAMED Viewed

@@ -82,7 +82,7 @@ class FusedLinear(FusedModule):
         # pylint: disable-next=not-callable
         return F.linear(x, weight, self.linear.bias)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         return (
             f"FusedLinear("
             f"{self.linear.in_features}→{self.linear.out_features})"
@@ -113,7 +113,7 @@ class FusedLinearAct(FusedModule):
         x = F.linear(x, weight, self.linear.bias)
         return self.act(x)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         act_name = type(self.act).__name__
         return (
             f"FusedLinearAct("
@@ -151,7 +151,7 @@ class FusedLayerNorm(FusedModule):
         """Apply ``layer_norm``."""
         return self.layer_norm(x)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         return (
             f"FusedLayerNorm("
             f"normalized_shape={self.layer_norm.normalized_shape}, "

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/src/embedl_deploy/_internal/tensorrt/modules/pointwise.py RENAMED Viewed

@@ -34,5 +34,5 @@ class FusedActAdd(FusedModule):
         """Apply ``act(x) + residual``."""
         return self.act(x) + residual
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         return f"FusedActAdd({type(self.act).__name__})"

{embedl_deploy_tensorrt-0.4.0 → embedl_deploy_tensorrt-0.5.0}/src/embedl_deploy/_internal/tensorrt/modules/pool.py RENAMED Viewed

@@ -21,5 +21,5 @@ class FusedAdaptiveAvgPool2d(FusedModule):
         """Apply adaptive average pooling."""
         return self.pool(x)
-    def __repr__(self) -> str:
+    def __repr__(self) -> str:  # pragma: no cover
         return f"FusedAdaptiveAvgPool2d(output_size={self.pool.output_size})"

embedl-deploy-tensorrt 0.4.0__tar.gz → 0.5.0__tar.gz

embedl-deploy-tensorrt 0.4.0tar.gz → 0.5.0tar.gz