PyPI - embedl-deploy-tensorrt - Versions diffs - 0.6.1__tar.gz → 0.7.0__tar.gz - Mend

embedl-deploy-tensorrt 0.6.1tar.gz → 0.7.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (36) hide show

{embedl_deploy_tensorrt-0.6.1 → embedl_deploy_tensorrt-0.7.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: embedl-deploy-tensorrt
-Version: 0.6.1
+Version: 0.7.0
 Summary: TensorRT backend for embedl-deploy.
 Author-email: Embedl AB <support@embedl.com>
 Project-URL: Homepage, https://www.embedl.com/
@@ -58,6 +58,7 @@ hardware target ensuring correct quantization and compilation.
 |---------------------------|-----------------|
 | NVIDIA TensorRT  (v10.3)  | Supported       |
 | Lattice SensAI (v8.0)     | In Development  |
+| AMD Vitis AI              | In Development  |
 Contact Embedl for other backends.

{embedl_deploy_tensorrt-0.6.1 → embedl_deploy_tensorrt-0.7.0}/README.md RENAMED Viewed

@@ -39,6 +39,7 @@ hardware target ensuring correct quantization and compilation.
 |---------------------------|-----------------|
 | NVIDIA TensorRT  (v10.3)  | Supported       |
 | Lattice SensAI (v8.0)     | In Development  |
+| AMD Vitis AI              | In Development  |
 Contact Embedl for other backends.

{embedl_deploy_tensorrt-0.6.1 → embedl_deploy_tensorrt-0.7.0}/src/embedl_deploy/_internal/tensorrt/modules/AGENTS.md RENAMED Viewed

@@ -31,10 +31,10 @@ core/quantize/prepare.py       # walks the graph, uses isinstance(mod, FusedModu
 ```
 **Instantiated by:** pattern grafts and `replace()` methods in `patterns/fusions.py`
-and `patterns/conversions/`. Most fusion patterns declare a `graft` attribute
-pointing to the fused module class; the graft system calls `_collect_modules()`
-to gather matched modules in tree order and passes them as positional arguments
-to the constructor. For example, `ConvBNActPattern` grafts `FusedConvBNAct`;
+and `patterns/conversions/`. Most fusion patterns declare `graft = (make_fused(FusedFoo),)`;
+`make_fused` gathers the matched modules in tree order (via `_collect_modules()`)
+and passes them as positional arguments to the constructor. For example,
+`ConvBNActPattern` grafts `(make_fused(FusedConvBNAct),)`;
 `DecomposeMultiheadAttentionPattern.replace()` constructs `MHAInProjection` and
 `ScaledDotProductAttention`.
@@ -61,15 +61,18 @@ Every `FusedModule` subclass must satisfy three requirements:
 2. **Call `super().__init__()`**, which creates `self.input_quant_stubs` — a dict
    mapping each index in `inputs_to_quantize` to a fresh `QuantStub`. The Q/DQ
    pass later enables and configures these stubs during `prepare_qdq()`. It also
-   initialises `self.surrounded = False`, which is later set to `True` by
-   `SurroundWithQuantStubsPattern` to mark modules that have been surrounded
-   with input `QuantStub` entries.
+   initialises `self.output_precision = Precision.UNSET`. Surround-type modules
+   override this to `Precision.DEFERRED` in their constructors to signal that
+   their INT8 decision depends on graph context (see
+   `FusedAdaptiveAvgPool2d`, `FusedScaledDotProductAttention`,
+   `FusedSwinAttention`). Quantization patterns read and update
+   `output_precision` to track precision flow through the graph.
 3. **Implement `forward()`** with the fused computation the module represents.
 ### Graft compatibility
-When a pattern uses `graft = FusedFoo` (bare class), the graft system calls
+When a pattern uses `graft = (make_fused(FusedFoo),)`, `make_fused` calls
 `_collect_modules()` to walk the matched tree and collect the `nn.Module`
 instances corresponding to trunk and fork nodes (nested branches first, then
 trunk nodes). These are passed as positional arguments to the constructor.
@@ -155,11 +158,14 @@ Fusing them together avoids an extra quantized activation between Conv and Pool.
 `inputs_to_quantize = {0, 1}`. The residual tensor is the second input (index 1),
 hence both inputs are quantized. This is the ResNet skip-connection block tail.
-**INT8 compatibility guard:** grouped convolutions where `in_channels / groups` or
-`out_channels / groups` is not a multiple of 4 cannot be quantized to INT8 in
-TensorRT. For those cases `_is_int8_compatible_conv()` returns `False`, and the
-module sets `self.input_quant_stubs = {}` (overriding the `super().__init__()`
-default), effectively opting out of quantization.
+**INT8 compatibility:** all conv fused modules unconditionally create a
+`WeightFakeQuantize` and `input_quant_stubs` in their constructors.  INT8
+compatibility filtering (grouped convolutions where `in_channels / groups` or
+`out_channels / groups` is not a multiple of 4, depthwise convolutions) is
+deferred to `DisableInt8Pattern` in the quantization pass, which
+calls `disable_int8()` on incompatible modules.  Depthwise convolutions set
+`output_precision = Precision.DEFERRED` so the surround pass can decide
+contextually whether to enable INT8.
 ### Linear family — `linear.py`
@@ -217,7 +223,9 @@ is quantized; `_key` and `_value` are accepted to match the self-attention
 call-site but ignored.
 **`FusedScaledDotProductAttention`** — wraps `ScaledDotProductAttention`,
-`inputs_to_quantize = set()`. Adds an internal `softmax_quant` stub with a fixed
+`inputs_to_quantize = {0, 1, 2}`. The pre-declared stubs (Q, K, V) are created
+disabled; `SurroundWithQuantStubsPattern` enables them when the module is
+surrounded with Q/DQ stubs. Adds an internal `softmax_quant` stub with a fixed
 calibration of `(1/127, 0)` — i.e., 8-bit symmetric with a fixed scale matched to
 the softmax output range `[0, 1]`. When the stub is disabled the module delegates
 to the plain SDPA; when enabled it performs manual attention with the quantization
@@ -262,10 +270,12 @@ three modules. This is intentional: the three modules reference the *same*
 referencing attributes to point to the copy. However, the sharing is not
 thread-safe (see Gotchas).
-**`FusedSwinAttention`** — wraps `SwinAttention`, `inputs_to_quantize = set()`.
-Mirrors `FusedScaledDotProductAttention`: adds an internal `softmax_quant` stub
-with fixed calibration. When enabled it manually expands the attention computation
-to insert the quantization step between softmax and BMM2.
+**`FusedSwinAttention`** — wraps `SwinAttention`, `inputs_to_quantize = {0, 1, 2}`.
+The pre-declared stubs (Q, K, V) are created disabled;
+`SurroundWithQuantStubsPattern` enables them. Mirrors
+`FusedScaledDotProductAttention`: adds an internal `softmax_quant` stub with fixed
+calibration. When enabled it manually expands the attention computation to insert
+the quantization step between softmax and BMM2.
 ### Pointwise family — `pointwise.py`
@@ -283,9 +293,9 @@ quantized at their respective scales.
 Pattern that creates it: `patterns/fusions.py`.
 **`FusedAdaptiveAvgPool2d`** — wraps `nn.AdaptiveAvgPool2d`,
-`inputs_to_quantize = set()`. No quantization is applied (pooling is a
-linear operation that does not benefit from separate quantization). Exists as a
-`FusedModule` so the Q/DQ pass treats it uniformly without special-casing it.
+`inputs_to_quantize = {0}`. Defers output precision to the surround pattern.
+The pre-declared stub is created disabled; `SurroundWithQuantStubsPattern`
+enables it when the module is surrounded with Q/DQ stubs.
 ---
@@ -355,11 +365,13 @@ weights are modified after fusion, the fused module sees the change. This is
 usually desirable (e.g. for QAT gradient updates), but can be surprising if the
 original model is used independently.
-**Grouped conv INT8 opt-out:** When `_is_int8_compatible_conv()` returns `False`
-the `__init__` of the conv fused modules sets `self.input_quant_stubs = {}`,
-overriding the dict populated by `FusedModule.__init__()`. This means the module
-is effectively excluded from quantization despite being a `FusedModule`. The
-`weight_fake_quant` attribute is also not created in this path.
+**Grouped conv INT8 opt-out:** INT8 compatibility for grouped convolutions is
+no longer handled in the module constructor. Instead, `DisableInt8Pattern`
+calls `disable_int8()` on modules where `is_int8_beneficial_conv()` returns
+`False`. The module still has `input_quant_stubs` and `weight_fake_quant` after
+construction — they are disabled by the pattern pass. This means
+`bool(mod.input_quant_stubs)` is `True` even for incompatible convolutions until
+the quantization pass runs.
 ---
@@ -372,11 +384,11 @@ is effectively excluded from quantization despite being a `FusedModule`. The
    `__init__` if the module has a learnable weight that should be fake-quantized.
 4. **Implement `forward()`** with the fused computation.
 5. **Write a `Pattern` subclass** in `patterns/fusions.py`. Prefer declaring
-   `graft = FusedFoo` (bare class) so the graft system handles replacement
+   `graft = (make_fused(FusedFoo),)` so the graft system handles replacement
    automatically. The constructor must accept modules in tree order (nested
    branches first, then trunk nodes). If the replacement logic cannot be
-   expressed as a bare-class graft, provide a `ReplacementMaker` or a custom
-   `replace()` method instead.
+   expressed with `make_fused`, provide a different `ReplacementMaker` or a
+   custom `replace()` method instead.
 6. **Add the pattern to `TENSORRT_PATTERNS`** (or the appropriate pattern list) in
    `tensorrt/plan.py`.
 7. **Write tests** in `tests/tensorrt/patterns/fusions/` covering: pattern match,

{embedl_deploy_tensorrt-0.6.1 → embedl_deploy_tensorrt-0.7.0}/src/embedl_deploy/_internal/tensorrt/modules/attention.py RENAMED Viewed

@@ -13,11 +13,14 @@ import torch
 import torch.nn.functional as F
 from torch import nn
-from embedl_deploy._internal.core.modules import ConvertedModule, FusedModule
-from embedl_deploy._internal.core.quantize.stubs import QuantStub
-from embedl_deploy._internal.tensorrt.modules.linear import (
-    attach_int8_weight_quant,
-    maybe_quantize_weight,
+from embedl_deploy._internal.core.modules import (
+    ConvertedModule,
+    FusedModule,
+    Precision,
+)
+from embedl_deploy._internal.core.quantize.stubs import (
+    QuantStub,
+    WeightFakeQuantize,
 )
@@ -181,7 +184,7 @@ class FusedMHAInProjection(FusedModule):
     def __init__(self, in_proj: MHAInProjection) -> None:
         super().__init__()
         self.in_proj = in_proj
-        attach_int8_weight_quant(self, in_proj.linear)
+        self.weight_fake_quant = WeightFakeQuantize({self})
     @property
     def quantized_weight(self) -> torch.Tensor | None:
@@ -205,7 +208,7 @@ class FusedMHAInProjection(FusedModule):
         :returns:
             Tuple ``(Q, K, V)`` each of shape ``[B, num_heads, S, head_dim]``.
         """
-        weight = maybe_quantize_weight(self, self.in_proj.linear.weight)
+        weight = self.weight_fake_quant(self.in_proj.linear.weight)
         batch, seq, _ = query.shape
         qkv = F.linear(query, weight, self.in_proj.linear.bias)
         q, k, v = qkv.chunk(3, dim=-1)
@@ -229,15 +232,13 @@ class FusedMHAInProjection(FusedModule):
 class FusedScaledDotProductAttention(FusedModule):
     """Fused wrapper for ``ScaledDotProductAttention``.
-    Allows the Q/DQ insertion pass to place quantize / dequantize stubs on each
-    of the three inputs (Q, K, V).
-    Additionally holds an internal
+    Quantizes the three inputs (Q, K, V) and holds an internal
     :class:`~embedl_deploy._internal.core.quantize.stubs.QuantStub` between the
-    softmax output and the second batched matrix multiply (BMM2). When that
-    stub is disabled the forward pass delegates to the unwrapped
+    softmax output and the second batched matrix multiply (BMM2). When
+    ``output_precision`` is still ``DEFERRED`` or the softmax stub is disabled,
+    the forward pass delegates to the unwrapped
     :class:`~embedl_deploy._internal.tensorrt.modules.attention.ScaledDotProductAttention`;
-    when enabled it performs manual attention with the quantization step.
+    otherwise it performs manual attention with the quantization step.
     :param attention:
         The
@@ -245,11 +246,12 @@ class FusedScaledDotProductAttention(FusedModule):
         from the decomposed MHA.
     """
-    inputs_to_quantize: set[int] = set()
+    inputs_to_quantize: set[int] = {0, 1, 2}
     def __init__(self, attention: ScaledDotProductAttention) -> None:
         super().__init__()
         self.attention = attention
+        self.output_precision = Precision.DEFERRED
         self.softmax_quant = QuantStub(
             consumers={self},
             n_bits=8,
@@ -266,11 +268,11 @@ class FusedScaledDotProductAttention(FusedModule):
     ) -> torch.Tensor:
         r"""Compute scaled dot-product attention.
-        When the SDPA has been surrounded by ``QuantStub``\ s on its Q/K/V
-        inputs *and* the internal softmax quant stub is enabled, performs
-        manual attention with a quantization step between softmax and BMM2.
-        Otherwise delegates to the wrapped attention module so TensorRT can
-        fuse it into its native FP16 MHA kernel.
+        When ``output_precision`` has been resolved (no longer ``DEFERRED``)
+        and the internal softmax quant stub is enabled, performs manual
+        attention with a quantization step between softmax and BMM2. Otherwise
+        delegates to the wrapped attention module so TensorRT can fuse it into
+        its native FP16 MHA kernel.
         :param q:
             Query tensor ``[B, num_heads, S, head_dim]``.
@@ -286,13 +288,16 @@ class FusedScaledDotProductAttention(FusedModule):
             Output tensor ``[B, num_heads, S, head_dim]``. Callers are
             responsible for any subsequent head-flattening reshape.
         """
-        # Manual attention is only beneficial when this SDPA was
-        # surrounded with input ``QuantStub``s (i.e. Q/K/V are arriving
-        # in INT8). Without surround, ``configure`` may still have left
-        # ``softmax_quant`` enabled — running manual attention then adds
-        # a softmax Q/DQ pair that pushes TensorRT off its FP16 fused
-        # MHA kernel onto the slower INT8-aware variant for no gain.
-        if not self.surrounded or not self.softmax_quant.enabled:
+        # Manual attention is only beneficial when output_precision has
+        # been resolved (i.e. Q/K/V stubs are active and arriving in
+        # INT8). While still DEFERRED, ``softmax_quant`` may be enabled
+        # but running manual attention would add a softmax Q/DQ pair
+        # that pushes TensorRT off its FP16 fused MHA kernel onto the
+        # slower INT8-aware variant for no gain.
+        if (
+            self.output_precision == Precision.DEFERRED
+            or not self.softmax_quant.enabled
+        ):
             return self.attention(q, k, v, attn_mask)
         # Honor the wrapped attention module's explicit ``scale`` if
         # set — models that pre-scale Q themselves (chronos-2 + RoPE,

{embedl_deploy_tensorrt-0.6.1 → embedl_deploy_tensorrt-0.7.0}/src/embedl_deploy/_internal/tensorrt/modules/conv.py RENAMED Viewed

@@ -15,29 +15,20 @@ import torch
 import torch.nn.functional as F
 from torch import nn
-from embedl_deploy._internal.core.modules import ActivationLike, FusedModule
+from embedl_deploy._internal.core.modules import (
+    ActivationLike,
+    FusedModule,
+    Precision,
+)
 from embedl_deploy._internal.core.quantize.stubs import (
+    QuantStub,
     WeightFakeQuantize,
 )
-def _is_int8_compatible_conv(conv: nn.Conv2d) -> bool:
-    """Return ``True`` unless *conv* is a grouped conv violating TRT INT8.
-    TensorRT's documented constraint for ``IConvolutionLayer`` is that
-    ``in_channels / groups`` and ``out_channels / groups`` must both be
-    multiples of 4 in INT8 mode. Depthwise convolutions (``groups ==
-    in_channels``) are an exception: our benchmarks on the target devices show
-    they still benefit from INT8 despite channels-per-group being 1, so we let
-    them through.
-    """
-    if conv.groups <= 1:
-        return True
-    if conv.groups == conv.in_channels:
-        return True
-    in_per_group: int = conv.in_channels // conv.groups
-    out_per_group: int = conv.out_channels // conv.groups
-    return in_per_group % 4 == 0 and out_per_group % 4 == 0
+def is_depthwise_conv(conv: nn.Conv2d) -> bool:
+    """Return ``True`` when *conv* is depthwise."""
+    return conv.groups == conv.in_channels
 def _conv_weight_forward(
@@ -51,6 +42,7 @@ def _conv_weight_forward(
         if weight_fake_quant is not None
         else conv.weight
     )
+    # pylint: disable-next=not-callable
     return F.conv2d(
         x,
         weight,
@@ -63,7 +55,10 @@ def _conv_weight_forward(
 class FusedConvBNAct(FusedModule):
-    """Fused ``Conv2d → [BatchNorm2d] → Act``."""
+    """Fused ``Conv2d → [BatchNorm2d] → Act``.
+    Depthwise convolutions defer output precision to the surround pattern.
+    """
     inputs_to_quantize: set[int] = {0}
@@ -77,10 +72,9 @@ class FusedConvBNAct(FusedModule):
         self.conv = conv
         self.bn = bn
         self.act = act
-        if _is_int8_compatible_conv(conv):
-            self.weight_fake_quant = WeightFakeQuantize({self})
-        else:
-            self.input_quant_stubs = {}
+        self.weight_fake_quant = WeightFakeQuantize({self})
+        if is_depthwise_conv(conv):
+            self.output_precision = Precision.DEFERRED
     @property
     def quantized_weight(self) -> torch.Tensor | None:
@@ -88,7 +82,7 @@ class FusedConvBNAct(FusedModule):
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``conv → [bn] → act``."""
-        wfq = getattr(self, 'weight_fake_quant', None)
+        wfq = getattr(self, "weight_fake_quant", None)
         x = _conv_weight_forward(self.conv, wfq, x)
         if self.bn is not None:
             x = self.bn(x)
@@ -107,7 +101,10 @@ class FusedConvBNAct(FusedModule):
 class FusedConvBN(FusedModule):
-    """Fused ``Conv2d → [BatchNorm2d]`` (no activation)."""
+    """Fused ``Conv2d → [BatchNorm2d]`` (no activation).
+    Depthwise convolutions defer output precision to the surround pattern.
+    """
     inputs_to_quantize: set[int] = {0}
@@ -119,10 +116,9 @@ class FusedConvBN(FusedModule):
         super().__init__()
         self.conv = conv
         self.bn = bn
-        if _is_int8_compatible_conv(conv):
-            self.weight_fake_quant = WeightFakeQuantize({self})
-        else:
-            self.input_quant_stubs = {}
+        self.weight_fake_quant = WeightFakeQuantize({self})
+        if is_depthwise_conv(conv):
+            self.output_precision = Precision.DEFERRED
     @property
     def quantized_weight(self) -> torch.Tensor | None:
@@ -130,7 +126,7 @@ class FusedConvBN(FusedModule):
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``conv → [bn]``."""
-        wfq = getattr(self, 'weight_fake_quant', None)
+        wfq = getattr(self, "weight_fake_quant", None)
         x = _conv_weight_forward(self.conv, wfq, x)
         if self.bn is not None:
             x = self.bn(x)
@@ -173,7 +169,8 @@ class FusedConvBNActMaxPool(FusedModule):
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``conv → [bn] → act → maxpool``."""
-        x = _conv_weight_forward(self.conv, self.weight_fake_quant, x)
+        wfq = getattr(self, "weight_fake_quant", None)
+        x = _conv_weight_forward(self.conv, wfq, x)
         if self.bn is not None:
             x = self.bn(x)
         x = self.act(x)
@@ -194,10 +191,12 @@ class FusedConvBNActMaxPool(FusedModule):
 class FusedConvBNAddAct(FusedModule):
-    """Fused ``Conv2d → BatchNorm2d → add(·, residual) → Activation``.
+    """Fused ``Conv2d → [BatchNorm2d] → add(·, residual) → [Activation]``.
     ``forward()`` accepts two inputs: the main tensor ``x`` and the
-    ``residual`` tensor.
+    ``residual`` tensor. Both the ``BatchNorm2d`` and the trailing activation
+    are optional so that EfficientNet-style ``Conv → BN → Add`` blocks (no
+    activation) are captured.
     """
     inputs_to_quantize: set[int] = {0, 1}
@@ -205,33 +204,99 @@ class FusedConvBNAddAct(FusedModule):
     def __init__(
         self,
         conv: nn.Conv2d,
-        bn: nn.BatchNorm2d,
-        act: ActivationLike,
+        bn: nn.BatchNorm2d | None,
+        act: ActivationLike | None,
     ) -> None:
         super().__init__()
         self.conv = conv
         self.bn = bn
         self.act = act
-        if _is_int8_compatible_conv(conv):
-            self.weight_fake_quant = WeightFakeQuantize({self})
-        else:
-            self.input_quant_stubs = {}
+        self.weight_fake_quant = WeightFakeQuantize({self})
     @property
     def quantized_weight(self) -> torch.Tensor | None:
         return self.conv.weight
     def forward(self, x: torch.Tensor, residual: torch.Tensor) -> torch.Tensor:
-        """Apply ``conv → bn → add(·, residual) → act``."""
-        wfq = getattr(self, 'weight_fake_quant', None)
+        """Apply ``conv → [bn] → add(·, residual) → [act]``."""
+        wfq = getattr(self, "weight_fake_quant", None)
         x = _conv_weight_forward(self.conv, wfq, x)
-        x = self.bn(x)
-        return self.act(x + residual)
+        if self.bn is not None:
+            x = self.bn(x)
+        x = x + residual
+        if self.act is not None:
+            x = self.act(x)
+        return x
     def __repr__(self) -> str:  # pragma: no cover
+        bn_info = ""
+        if self.bn is not None:
+            bn_info = f", bn={self.bn.num_features} (foldable)"
+        act_info = ""
+        if self.act is not None:
+            act_info = f", act={type(self.act).__name__}"
         return (
             f"FusedConvBNAddAct("
             f"{self.conv.in_channels}→{self.conv.out_channels}, "
-            f"k={self.conv.kernel_size}, s={self.conv.stride}, "
-            f"bn={self.bn.num_features} (foldable))"
+            f"k={self.conv.kernel_size}, s={self.conv.stride}"
+            f"{bn_info}{act_info})"
+        )
+class FusedConvBNSigmoidMul(FusedModule):
+    """Fused ``Conv2d → [BatchNorm2d] → Sigmoid → Mul(·, skip)``.
+    Captures the SE gate pattern where an expand convolution produces channel
+    attention weights via Sigmoid, then element-wise multiplies with the skip
+    connection. An internal
+    :class:`~embedl_deploy._internal.core.quantize.stubs.QuantStub` between the
+    conv/BN output and the Sigmoid produces the Q/DQ pair that enables
+    TensorRT's ``PWN(Sigmoid, Mul)`` fusion.
+    ``forward()`` accepts two inputs: the main tensor ``x`` feeding the
+    convolution, and the ``skip`` tensor multiplied by the sigmoid gate.
+    """
+    inputs_to_quantize: set[int] = {0, 1}
+    def __init__(
+        self,
+        conv: nn.Conv2d,
+        bn: nn.BatchNorm2d | None,
+        sigmoid: nn.Sigmoid,
+    ) -> None:
+        super().__init__()
+        self.conv = conv
+        self.bn = bn
+        self.sigmoid = sigmoid
+        self.weight_fake_quant = WeightFakeQuantize({self})
+        self.gate_quant = QuantStub({self})
+    @property
+    def quantized_weight(self) -> torch.Tensor | None:
+        return self.conv.weight
+    def forward(
+        self,
+        x: torch.Tensor,
+        skip: torch.Tensor,
+    ) -> torch.Tensor:
+        """Apply ``conv → [bn] → gate_quant → sigmoid → mul(·, skip)``."""
+        wfq = getattr(self, "weight_fake_quant", None)
+        x = _conv_weight_forward(self.conv, wfq, x)
+        if self.bn is not None:
+            x = self.bn(x)
+        x = self.gate_quant(x)
+        x = self.sigmoid(x)
+        return x * skip
+    def __repr__(self) -> str:  # pragma: no cover
+        bn_info = ""
+        if self.bn is not None:
+            bn_info = f", bn={self.bn.num_features} (foldable)"
+        return (
+            f"FusedConvBNSigmoidMul("
+            f"{self.conv.in_channels}→{self.conv.out_channels}, "
+            f"k={self.conv.kernel_size}, s={self.conv.stride}"
+            f"{bn_info})"
         )

{embedl_deploy_tensorrt-0.6.1 → embedl_deploy_tensorrt-0.7.0}/src/embedl_deploy/_internal/tensorrt/modules/linear.py RENAMED Viewed

@@ -8,60 +8,16 @@ import torch
 import torch.nn.functional as F
 from torch import nn
-from embedl_deploy._internal.core.modules import ActivationLike, FusedModule
+from embedl_deploy._internal.core.modules import (
+    ActivationLike,
+    FusedModule,
+    Precision,
+)
 from embedl_deploy._internal.core.quantize.stubs import (
     SmoothQuantObserver,
     WeightFakeQuantize,
 )
-#: Minimum ``K * N / (K + N)`` for INT8 to outperform FP16.
-INT8_LINEAR_MIN_RATIO: int = 256
-def is_int8_beneficial_linear(linear: nn.Linear) -> bool:
-    """Return ``True`` when INT8 quantization benefits *linear*.
-    Uses the harmonic mean of the weight dimensions ``K * N / (K + N)`` as a
-    proxy for the ratio of INT8 compute savings to Q/DQ reformat overhead.
-    Below
-    :data:`~embedl_deploy._internal.tensorrt.modules.linear.INT8_LINEAR_MIN_RATIO`,
-    the overhead from quantize/dequantize boundary layers exceeds any INT8 GEMM
-    speedup and the layer is better left in FP16.
-    Reference: NVIDIA benchmarks show INT8 GEMM outperforms FP16 only when all
-    three matrix dimensions exceed ~2048 (A100). The harmonic mean threshold of
-    256 conservatively separates mobile-class models (MobileViT FFN ratio ≤
-    160) from server-class models (ViT-B/16 FFN ratio = 614) where INT8 is
-    beneficial.
-    """
-    k, n = linear.in_features, linear.out_features
-    return k * n / (k + n) >= INT8_LINEAR_MIN_RATIO
-def attach_int8_weight_quant(
-    mod: FusedModule,
-    linear: nn.Linear,
-) -> None:
-    """Attach a ``WeightFakeQuantize`` to *mod* when INT8 helps *linear*.
-    When INT8 wouldn't pay for its Q/DQ boundary cost, also clear
-    ``mod.input_quant_stubs`` so the surrounding Q/DQ pass leaves the wrapped
-    linear entirely in FP16.
-    """
-    if is_int8_beneficial_linear(linear):
-        mod.weight_fake_quant = WeightFakeQuantize({mod})
-    else:
-        mod.input_quant_stubs = {}
-def maybe_quantize_weight(
-    mod: nn.Module,
-    weight: torch.Tensor,
-) -> torch.Tensor:
-    """Fake-quantize *weight* through ``mod.weight_fake_quant`` if present."""
-    wfq = getattr(mod, "weight_fake_quant", None)
-    return wfq(weight) if wfq is not None else weight
 class FusedLinear(FusedModule):
     """Fused wrapper for a standalone ``Linear`` layer.
@@ -75,7 +31,7 @@ class FusedLinear(FusedModule):
     def __init__(self, linear: nn.Linear) -> None:
         super().__init__()
         self.linear = linear
-        attach_int8_weight_quant(self, linear)
+        self.weight_fake_quant = WeightFakeQuantize({self})
     @property
     def quantized_weight(self) -> torch.Tensor | None:
@@ -83,7 +39,7 @@ class FusedLinear(FusedModule):
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``linear``, fake-quantizing the weight."""
-        weight = maybe_quantize_weight(self, self.linear.weight)
+        weight = self.weight_fake_quant(self.linear.weight)
         return F.linear(x, weight, self.linear.bias)
     def __repr__(self) -> str:  # pragma: no cover
@@ -108,7 +64,7 @@ class FusedLinearAct(FusedModule):
         super().__init__()
         self.linear = linear
         self.act = act
-        attach_int8_weight_quant(self, linear)
+        self.weight_fake_quant = WeightFakeQuantize({self})
     @property
     def quantized_weight(self) -> torch.Tensor | None:
@@ -116,7 +72,7 @@ class FusedLinearAct(FusedModule):
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``linear → activation``, fake-quantizing the weight."""
-        weight = maybe_quantize_weight(self, self.linear.weight)
+        weight = self.weight_fake_quant(self.linear.weight)
         x = F.linear(x, weight, self.linear.bias)
         return self.act(x)
@@ -143,12 +99,12 @@ class FusedLayerNorm(FusedModule):
         The ``nn.LayerNorm`` from the matched chain.
     """
-    prefers_fp_input: bool = True
     inputs_to_quantize: set[int] = set()
     def __init__(self, layer_norm: nn.LayerNorm) -> None:
         super().__init__()
         self.layer_norm = layer_norm
+        self.output_precision = Precision.INT8
         self.smooth_quant_observer = SmoothQuantObserver(
             consumers={self},
             layer_norm=layer_norm,

{embedl_deploy_tensorrt-0.6.1 → embedl_deploy_tensorrt-0.7.0}/src/embedl_deploy/_internal/tensorrt/modules/pool.py RENAMED Viewed

@@ -5,17 +5,21 @@
 import torch
 from torch import nn
-from embedl_deploy._internal.core.modules import FusedModule
+from embedl_deploy._internal.core.modules import FusedModule, Precision
 class FusedAdaptiveAvgPool2d(FusedModule):
-    """Fused wrapper for ``AdaptiveAvgPool2d``."""
+    """Fused wrapper for ``AdaptiveAvgPool2d``.
-    inputs_to_quantize: set[int] = set()
+    Defers output precision to the surround pattern.
+    """
+    inputs_to_quantize: set[int] = {0}
     def __init__(self, pool: nn.AdaptiveAvgPool2d) -> None:
         super().__init__()
         self.pool = pool
+        self.output_precision = Precision.DEFERRED
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply adaptive average pooling."""

embedl-deploy-tensorrt 0.6.1__tar.gz → 0.7.0__tar.gz

embedl-deploy-tensorrt 0.6.1tar.gz → 0.7.0tar.gz