PyPI - embedl-deploy-tensorrt - Versions diffs - 0.5.0__tar.gz → 0.6.0__tar.gz - Mend

embedl-deploy-tensorrt 0.5.0tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (41) hide show

{embedl_deploy_tensorrt-0.5.0 → embedl_deploy_tensorrt-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: embedl-deploy-tensorrt
-Version: 0.5.0
+Version: 0.6.0
 Summary: TensorRT backend for embedl-deploy.
 Author-email: Embedl AB <support@embedl.com>
 Project-URL: Homepage, https://www.embedl.com/

{embedl_deploy_tensorrt-0.5.0 → embedl_deploy_tensorrt-0.6.0}/src/embedl_deploy/__init__.py RENAMED Viewed

@@ -3,6 +3,7 @@
 """Python package to make AI models deployment-ready for any hardware."""
 from embedl_deploy._internal.core.plan import (
+    Trace,
     TransformationPlan,
     TransformationResult,
     apply_transformation_plan,
@@ -14,6 +15,7 @@ from embedl_deploy.version.public import PUBLIC_VERSION
 __version__ = PUBLIC_VERSION
 __all__ = [
+    "Trace",
     "TransformationPlan",
     "TransformationResult",
     "__version__",

embedl_deploy_tensorrt-0.6.0/src/embedl_deploy/_internal/tensorrt/modules/AGENTS.md ADDED Viewed

@@ -0,0 +1,408 @@
+# TensorRT Fused Modules — Subsystem Guide
+## Role & Boundary
+**This subsystem owns:** concrete `FusedModule` (and `ConvertedModule`) implementations
+for TensorRT. Every class in `modules/` corresponds to a hardware-fusible or
+structurally-decomposed operation the TensorRT backend can emit as a single engine
+layer or kernel.
+**This subsystem does NOT own:**
+- When or how to match these modules in a graph — that lives in `patterns/`.
+- Q/DQ stub insertion or calibration logic — that lives in `core/quantize/`.
+- The pattern-matching engine itself — that lives in `core/`.
+---
+## Position in the System
+```
+patterns/fusions.py                # creates FusedModule instances
+patterns/conversions/general.py    # creates ConvertedModule instances (erasure, flatten-linear)
+patterns/conversions/attention.py  # creates ConvertedModule instances (MHA/Swin/SDPA)
+        │
+        ▼
+modules/ (this subsystem)
+        │
+        ▼
+core/quantize/prepare.py       # walks the graph, uses isinstance(mod, FusedModule)
+                               # and mod.inputs_to_quantize to insert Q/DQ stubs
+```
+**Instantiated by:** pattern grafts and `replace()` methods in `patterns/fusions.py`
+and `patterns/conversions/`. Most fusion patterns declare a `graft` attribute
+pointing to the fused module class; the graft system calls `_collect_modules()`
+to gather matched modules in tree order and passes them as positional arguments
+to the constructor. For example, `ConvBNActPattern` grafts `FusedConvBNAct`;
+`DecomposeMultiheadAttentionPattern.replace()` constructs `MHAInProjection` and
+`ScaledDotProductAttention`.
+**Used by:** The Q/DQ insertion pass in `core/quantize/prepare.py` via
+`isinstance(mod, FusedModule)` and `mod.inputs_to_quantize`, to determine which
+inputs of each `call_module` node should receive a `QuantStub`.
+**Also used as intermediate graph nodes by:** conversion patterns, which produce
+`ConvertedModule` subclasses (`MHAInProjection`, `ScaledDotProductAttention`,
+`SwinWindowPartition`, `SwinAttention`, `SwinWindowReverse`). These stay opaque
+during re-tracing so that downstream fusion patterns can see them as atomic nodes.
+---
+## FusedModule Contract
+Every `FusedModule` subclass must satisfy three requirements:
+1. **Set `inputs_to_quantize` as a class attribute** (not an instance attribute).
+   The Q/DQ insertion pass reads this before the module is instantiated to decide
+   which positional arguments of the `call_module` FX node will have a `QuantStub`
+   inserted in front of them.
+2. **Call `super().__init__()`**, which creates `self.input_quant_stubs` — a dict
+   mapping each index in `inputs_to_quantize` to a fresh `QuantStub`. The Q/DQ
+   pass later enables and configures these stubs during `prepare_qdq()`. It also
+   initialises `self.surrounded = False`, which is later set to `True` by
+   `SurroundWithQuantStubsPattern` to mark modules that have been surrounded
+   with input `QuantStub` entries.
+3. **Implement `forward()`** with the fused computation the module represents.
+### Graft compatibility
+When a pattern uses `graft = FusedFoo` (bare class), the graft system calls
+`_collect_modules()` to walk the matched tree and collect the `nn.Module`
+instances corresponding to trunk and fork nodes (nested branches first, then
+trunk nodes). These are passed as positional arguments to the constructor.
+Therefore the constructor signature must accept modules in the same order
+they appear in the pattern tree.
+### What `inputs_to_quantize` means
+`inputs_to_quantize` is a set of *positional argument indices* corresponding to
+the `call_module` FX node's arguments — that is, the arguments visible in the
+traced graph, not necessarily the Python keyword positions in `forward()`.
+For example, `FusedConvBN.inputs_to_quantize = {0}` means the first tensor
+argument to the fused node (the image tensor) gets a `QuantStub`. The convolution
+weight is quantized separately via `WeightFakeQuantize`, which is attached to the
+module directly rather than via `inputs_to_quantize`.
+`FusedConvBNAddAct.inputs_to_quantize = {0, 1}` means both the main feature map
+and the residual skip tensor each get their own `QuantStub`.
+### Why `inputs_to_quantize` is a class attribute
+Because it describes the module *type's* quantization contract, not any particular
+instance's configuration. The Q/DQ insertion pass queries it once per class during
+graph preparation, before modules are constructed. Making it a class attribute
+(rather than an instance attribute set in `__init__`) ensures the value is
+available from the class itself without constructing an instance.
+---
+## ConvertedModule vs FusedModule
+Both base classes are defined in `core/modules.py`. They serve different roles:
+**`ConvertedModule` subclasses** are *intermediate decomposition products*. They
+replace high-level opaque ops (e.g. `nn.MultiheadAttention`,
+`shifted_window_attention`) with sub-modules that downstream *fusion* patterns can
+then match individually. The custom tracer (`_LeafTracer`) treats them as leaf
+nodes — they are never unwrapped during re-tracing. Examples: `MHAInProjection`,
+`ScaledDotProductAttention`, `SwinWindowPartition`, `SwinAttention`,
+`SwinWindowReverse`.
+**`FusedModule` subclasses** are the *final quantization target*. The Q/DQ
+insertion pass identifies them via `isinstance(mod, FusedModule)` and wraps their
+inputs/weights with fake-quantization stubs. Examples: `FusedConvBN`,
+`FusedConvBNAct`, `FusedLinear`, `FusedSwinAttention`, `FusedMHAInProjection`.
+A `FusedModule` may wrap a `ConvertedModule` (e.g. `FusedMHAInProjection` wraps
+`MHAInProjection`, `FusedSwinAttention` wraps `SwinAttention`). In that case the
+conversion pass runs first, then the fusion pass replaces the `ConvertedModule`
+with its `FusedModule` counterpart.
+---
+## Module Catalog
+### Conv family — `conv.py`
+Pattern that creates them: `patterns/fusions.py` (`ConvBNActPattern`,
+`ConvBNPattern`, `StemConvBNActMaxPoolPattern`, `ConvBNAddActPattern`).
+All conv fused modules store a reference to the original `nn.Conv2d` (and
+optionally `nn.BatchNorm2d`, `ActivationLike`, `nn.MaxPool2d`) — they do not
+pre-compute fused weights. This lets the same module object work correctly in both
+training (with live BN statistics) and eval (where BN is folded by TensorRT at
+export time).
+**BN folding indicator:** The presence of `self.bn` (not `None`) implicitly
+indicates that BN can be folded into the convolution at export time. The
+`__repr__` methods print `"(foldable)"` when `self.bn is not None`.
+**`FusedConvBN`** — `Conv2d → [BatchNorm2d]`, `inputs_to_quantize = {0}`.
+**`FusedConvBNAct`** — `Conv2d → [BatchNorm2d] → Activation`, `inputs_to_quantize = {0}`.
+**`FusedConvBNActMaxPool`** — `Conv2d → [BatchNorm2d] → Activation → MaxPool2d`,
+`inputs_to_quantize = {0}`. MaxPool is included here (rather than fused
+separately) because TensorRT supports fusing the full stem sequence
+(`Conv → BN → Act → Pool`) as a single engine node in the common 7x7 stem pattern.
+Fusing them together avoids an extra quantized activation between Conv and Pool.
+**`FusedConvBNAddAct`** — `Conv2d → BatchNorm2d → add(·, residual) → Activation`,
+`inputs_to_quantize = {0, 1}`. The residual tensor is the second input (index 1),
+hence both inputs are quantized. This is the ResNet skip-connection block tail.
+**INT8 compatibility guard:** grouped convolutions where `in_channels / groups` or
+`out_channels / groups` is not a multiple of 4 cannot be quantized to INT8 in
+TensorRT. For those cases `_is_int8_compatible_conv()` returns `False`, and the
+module sets `self.input_quant_stubs = {}` (overriding the `super().__init__()`
+default), effectively opting out of quantization.
+### Linear family — `linear.py`
+Pattern that creates them: `patterns/fusions.py`.
+**`FusedLinear`** — wraps a single `nn.Linear`, `inputs_to_quantize = {0}`.
+**`FusedLinearAct`** — wraps `nn.Linear → Activation`, `inputs_to_quantize = {0}`.
+**`FusedLayerNorm`** — wraps `nn.LayerNorm`, `inputs_to_quantize = set()`.
+LayerNorm is placed in the linear family because it appears in transformer
+architectures immediately before or after linear layers, and the SmoothQuant
+pass needs to reason about LayerNorm and the following linear together.
+Input quantization is disabled (`inputs_to_quantize = set()`, `prefers_fp_input =
+True`) because LayerNorm normalises its input internally — quantizing the input
+before LayerNorm would destroy the statistical properties that normalization relies
+on.
+**`FusedLayerNorm.smooth_quant_observer`** — holds a `SmoothQuantObserver`
+instance. It is created in `__init__` but is populated (i.e., scale factors are
+computed and migrated) by the SmoothQuant calibration pass in
+`core/quantize/calibrate.py`. The module itself does not use the observer
+during `forward()` — it only provides a hook for the calibration pass to attach to.
+### Attention family — `attention.py`
+Pattern that creates the `ConvertedModule` subclasses:
+`patterns/conversions/attention.py` (`DecomposeMultiheadAttentionPattern`).
+Pattern that creates the `FusedModule` wrappers: `patterns/fusions.py`.
+**Decomposition model:** `nn.MultiheadAttention` is opaque to FX and to TensorRT.
+The conversion pass decomposes it into three explicit sub-modules:
+1. `MHAInProjection` (in-projection) — packed `Linear(E, 3E)` followed by
+   chunking and head splitting.
+2. `ScaledDotProductAttention` (SDPA) — `softmax(Q·Kᵀ / √H) · V`.
+3. `nn.Linear` (out-projection) — the original `mha.out_proj`.
+**Why in-proj returns a tuple:** The in-projection produces three independent
+tensors (Q, K, V). TensorRT can fuse the single packed `Linear(E, 3E)` and the
+split into an efficient multi-head projection kernel. Returning a tuple of
+`(Q, K, V)` lets the graph represent the split explicitly — the conversion pass
+inserts `operator.getitem` nodes to fan out the tuple into three separate feeds
+for SDPA.
+**Head splitting:** Inside `MHAInProjection.forward()`, after the packed linear
+op, each of Q, K, V is reshaped from `[B, S, E]` to `[B, num_heads, S, head_dim]`
+via `view` + `transpose`. This puts the head dimension before the sequence
+dimension, which is the layout expected by `F.scaled_dot_product_attention` and
+by matrix-multiply kernels.
+**`FusedMHAInProjection`** — wraps `MHAInProjection`, `inputs_to_quantize = {0}`.
+Adds a `WeightFakeQuantize` for the packed linear weight. Only `query` (index 0)
+is quantized; `_key` and `_value` are accepted to match the self-attention
+call-site but ignored.
+**`FusedScaledDotProductAttention`** — wraps `ScaledDotProductAttention`,
+`inputs_to_quantize = set()`. Adds an internal `softmax_quant` stub with a fixed
+calibration of `(1/127, 0)` — i.e., 8-bit symmetric with a fixed scale matched to
+the softmax output range `[0, 1]`. When the stub is disabled the module delegates
+to the plain SDPA; when enabled it performs manual attention with the quantization
+step between softmax and the second batched matrix multiply (BMM2).
+### Swin Attention family — `swin_attention.py`
+Pattern that creates the `ConvertedModule` subclasses:
+`patterns/conversions/attention.py` (`DecomposeSwinAttentionPattern`). Pattern
+that creates `FusedSwinAttention`: `patterns/fusions.py`.
+**Decomposition model:** `torchvision`'s `shifted_window_attention` free function
+is an opaque `fx.wrap`-ped call that includes spatial padding, cyclic shifting,
+window partitioning, QKV projection, attention, output projection, and window
+reversal — all in one node. The conversion pass splits it into five sub-modules
+that downstream fusion patterns can match individually:
+1. `SwinWindowPartition` — pad, cyclic-shift, partition to windows.
+2. `MHAInProjection` — QKV projection (shared with the MHA attention family).
+3. `SwinAttention` — windowed attention with relative position bias and shifted-window mask.
+4. `nn.Linear` — output projection.
+5. `SwinWindowReverse` — reverse partition, shift, and unpad.
+**`SwinSpatialState`** — a mutable dataclass shared by all three spatial modules
+(`SwinWindowPartition`, `SwinAttention`, `SwinWindowReverse`). During each
+`forward()` call:
+- `SwinWindowPartition.forward()` **writes** `batch_size`, `height`, `width`,
+  `pad_height`, `pad_width`, and `effective_shift_size` into the state.
+- `SwinAttention.forward()` **reads** `pad_height`, `pad_width`,
+  `effective_shift_size`, and `batch_size` to compute the attention mask.
+- `SwinWindowReverse.forward()` **reads** all fields to undo the partition and
+  remove padding.
+The state must be shared (not copied) because `SwinWindowPartition` computes the
+actual padded dimensions and effective shift at runtime — these depend on the input
+spatial size, which is not known until `forward()` is called.
+**`deepcopy` note:** `copy.deepcopy` preserves the shared reference among the
+three modules. This is intentional: the three modules reference the *same*
+`SwinSpatialState` object; `deepcopy` copies the object once and updates all
+referencing attributes to point to the copy. However, the sharing is not
+thread-safe (see Gotchas).
+**`FusedSwinAttention`** — wraps `SwinAttention`, `inputs_to_quantize = set()`.
+Mirrors `FusedScaledDotProductAttention`: adds an internal `softmax_quant` stub
+with fixed calibration. When enabled it manually expands the attention computation
+to insert the quantization step between softmax and BMM2.
+### Pointwise family — `pointwise.py`
+Pattern that creates it: `patterns/fusions.py` (`ActAddPattern`).
+**`FusedActAdd`** — `Activation → add(·, residual)`, `inputs_to_quantize = {0, 1}`.
+Constructor takes `(act: ActivationLike)`. `forward(x, residual)` applies
+`act(x) + residual`. This prevents TensorRT from merging the upstream convolution
+into an activation-fused kernel when the activation output is consumed by a
+subsequent add. Both the activated feature map and the skip-connection tensor are
+quantized at their respective scales.
+### Pool family — `pool.py`
+Pattern that creates it: `patterns/fusions.py`.
+**`FusedAdaptiveAvgPool2d`** — wraps `nn.AdaptiveAvgPool2d`,
+`inputs_to_quantize = set()`. No quantization is applied (pooling is a
+linear operation that does not benefit from separate quantization). Exists as a
+`FusedModule` so the Q/DQ pass treats it uniformly without special-casing it.
+---
+## BN Folding
+Batch normalisation folding is the process of absorbing the BN scale and bias into
+the preceding convolution weight and bias at inference time:
+```
+w_folded = w * (γ / σ)
+b_folded = (b - μ) * (γ / σ) + β
+```
+where `γ`, `β` are BN's learned weight/bias and `μ`, `σ` are running statistics.
+The fused modules store the original `nn.Conv2d` and `nn.BatchNorm2d` as separate
+sub-modules. During training, the BN is applied as a normal operation (live
+statistics). At export, TensorRT performs the folding in the engine. Whether a
+fused module carries a BN is determined implicitly: `self.bn is not None` means
+the module was created from a pattern that included a `BatchNorm2d`, and folding
+is safe. When `self.bn is None`, the convolution weight is used as-is.
+---
+## Design Decisions
+**Why modules store original sub-modules rather than pre-computing fused weights:**
+Because the modules must be correct in both training and eval modes. Training uses
+live BN statistics; the fused weights can only be computed accurately in eval mode
+with frozen running mean/variance. Storing the originals defers the decision to
+the export step while keeping `forward()` correct throughout.
+**Why `inputs_to_quantize` is a class attribute:** The Q/DQ insertion pass queries
+the set before constructing any instances — it reads it from the class via
+`type(mod).inputs_to_quantize`. Making it a class attribute also prevents
+accidental per-instance divergence.
+**Why intermediate decomposition modules are `ConvertedModule`:** The custom tracer
+(`_LeafTracer` in `core/modules.py`) checks `isinstance(m, (ConvertedModule,
+FusedModule))` to decide whether to treat a module as a leaf. `ConvertedModule` is
+the marker that tells the tracer "do not recurse into this module's `forward()`".
+Without this, re-tracing the graph after conversion would unwrap the decomposed
+sub-modules, defeating the purpose of decomposition.
+---
+## Gotchas & Pitfalls
+**`SwinSpatialState` thread-safety:** The state is a mutable shared object written
+during every `SwinWindowPartition.forward()` call and read by the other two
+modules in the same forward pass. If the model is run from multiple threads
+concurrently (e.g. in a data-parallel setup without model replication), the writes
+from one thread will corrupt the reads of another. Use `torch.nn.DataParallel` (not
+`DistributedDataParallel`) only if each replica gets its own model copy via
+`deepcopy`, which does preserve sharing within a single copy.
+**`inputs_to_quantize` indices vs. Python `forward()` args:** The indices refer to
+the positional arguments of the FX `call_module` node in the traced graph, not to
+the `forward()` Python signature. In the normal case they are identical. They
+diverge if the FX graph is produced from a non-trivial call-site (e.g. keyword
+arguments re-ordered). Always verify against the actual graph node when debugging
+Q/DQ placement.
+**Weight sharing:** The conv/linear sub-modules stored inside fused modules hold
+references to the tensors from the *original* model. If the original model's
+weights are modified after fusion, the fused module sees the change. This is
+usually desirable (e.g. for QAT gradient updates), but can be surprising if the
+original model is used independently.
+**Grouped conv INT8 opt-out:** When `_is_int8_compatible_conv()` returns `False`
+the `__init__` of the conv fused modules sets `self.input_quant_stubs = {}`,
+overriding the dict populated by `FusedModule.__init__()`. This means the module
+is effectively excluded from quantization despite being a `FusedModule`. The
+`weight_fake_quant` attribute is also not created in this path.
+---
+## Adding a New Fused Module
+1. **Subclass `FusedModule`** (from `core/modules.py`).
+2. **Set `inputs_to_quantize`** as a class attribute — a `set[int]` of positional
+   argument indices that should receive activation `QuantStub`s.
+3. **Optionally add `self.weight_fake_quant = WeightFakeQuantize({self})`** in
+   `__init__` if the module has a learnable weight that should be fake-quantized.
+4. **Implement `forward()`** with the fused computation.
+5. **Write a `Pattern` subclass** in `patterns/fusions.py`. Prefer declaring
+   `graft = FusedFoo` (bare class) so the graft system handles replacement
+   automatically. The constructor must accept modules in tree order (nested
+   branches first, then trunk nodes). If the replacement logic cannot be
+   expressed as a bare-class graft, provide a `ReplacementMaker` or a custom
+   `replace()` method instead.
+6. **Add the pattern to `TENSORRT_PATTERNS`** (or the appropriate pattern list) in
+   `tensorrt/plan.py`.
+7. **Write tests** in `tests/tensorrt/patterns/fusions/` covering: pattern match,
+   pattern replace, correct quantisation stub placement.
+---
+## Testing
+Test models for these modules live in two locations:
+- `tests/models/conv.py` — `ConvBnRelu`, `ConvBnSiLU`, `ConvBn`, `ConvOnly`,
+  `ConvBnAddRelu`, `StemConvBnReluMaxPool`, etc.
+- `tests/models/attention.py` — `SimpleSelfAttention`, `SimpleSwinAttention`, etc.
+- `tests/models/linear.py`, `tests/models/pool.py` — linear and pool variants.
+- `tests/tensorrt/models/attention.py` — `LayerNormMHAInProjection`,
+  `SDPALinearOutProjection`, `SwinAttentionLinearOutProjection`.
+Pattern tests live in `tests/tensorrt/patterns/fusions/`.
+**What to verify for a new fused module:**
+- The pattern matches exactly the expected nodes (check `tree_match.serialize()`).
+- The fused node's `target` name is the expected auto-generated name.
+- `resolve_module(fused_node, FusedXxx)` succeeds.
+- `bool(fused_module.input_quant_stubs)` matches `inputs_to_quantize` expectations.
+- `hasattr(fused_module, 'weight_fake_quant')` matches expectations.
+- Graph equivalence: fused model output matches original model output on the same
+  input (see `tests/tensorrt/test_equivalence.py`).

{embedl_deploy_tensorrt-0.5.0 → embedl_deploy_tensorrt-0.6.0}/src/embedl_deploy/_internal/tensorrt/modules/attention.py RENAMED Viewed

@@ -3,7 +3,7 @@
 """Attention sub-modules introduced by MHA decomposition.
 These plain ``nn.Module`` subclasses replace the opaque
-``nn.MultiheadAttention`` in the FX graph.  Phase 2 creates ``Fused*`` wrappers
+``nn.MultiheadAttention`` in the FX graph. Phase 2 creates ``Fused*`` wrappers
 around them for Q/DQ insertion.
 """
@@ -138,10 +138,9 @@ class ScaledDotProductAttention(ConvertedModule):
             on this. Passes through to ``F.scaled_dot_product_attention``
             unchanged (``None`` is the no-mask default).
         :returns:
-            Output tensor ``[B, num_heads, S, head_dim]``.  Callers are
+            Output tensor ``[B, num_heads, S, head_dim]``. Callers are
             responsible for any subsequent head-flattening reshape.
         """
-        # pylint: disable-next=not-callable
         return F.scaled_dot_product_attention(
             q,
             k,
@@ -166,8 +165,8 @@ class ScaledDotProductAttention(ConvertedModule):
 class FusedMHAInProjection(FusedModule):
     """Fused wrapper for ``MHAInProjection``.
-    Allows the Q/DQ insertion pass to place quantize / dequantize stubs
-    around the input projection and to attach a
+    Allows the Q/DQ insertion pass to place quantize / dequantize stubs around
+    the input projection and to attach a
     :class:`~embedl_deploy._internal.core.quantize.stubs.WeightFakeQuantize`
     for the packed linear weight.
@@ -184,6 +183,10 @@ class FusedMHAInProjection(FusedModule):
         self.in_proj = in_proj
         attach_int8_weight_quant(self, in_proj.linear)
+    @property
+    def quantized_weight(self) -> torch.Tensor | None:
+        return self.in_proj.linear.weight
     def forward(
         self,
         query: torch.Tensor,
@@ -192,10 +195,10 @@ class FusedMHAInProjection(FusedModule):
     ) -> tuple[torch.Tensor, ...]:
         """Project input to per-head ``(Q, K, V)`` tensors.
-        Fake-quantizes the packed projection weight when enabled,
-        then performs the linear operation.  Only `query` is used;
-        `_key` and `_value` are accepted to match the call-site
-        signature but ignored for self-attention.
+        Fake-quantizes the packed projection weight when enabled, then performs
+        the linear operation. Only `query` is used; `_key` and `_value` are
+        accepted to match the call-site signature but ignored for
+        self-attention.
         :param query:
             Input tensor of shape ``[B, S, E]``.
@@ -204,7 +207,6 @@ class FusedMHAInProjection(FusedModule):
         """
         weight = maybe_quantize_weight(self, self.in_proj.linear.weight)
         batch, seq, _ = query.shape
-        # pylint: disable-next=not-callable
         qkv = F.linear(query, weight, self.in_proj.linear.bias)
         q, k, v = qkv.chunk(3, dim=-1)
         num_heads = self.in_proj.num_heads
@@ -227,13 +229,13 @@ class FusedMHAInProjection(FusedModule):
 class FusedScaledDotProductAttention(FusedModule):
     """Fused wrapper for ``ScaledDotProductAttention``.
-    Allows the Q/DQ insertion pass to place quantize / dequantize stubs
-    on each of the three inputs (Q, K, V).
+    Allows the Q/DQ insertion pass to place quantize / dequantize stubs on each
+    of the three inputs (Q, K, V).
     Additionally holds an internal
-    :class:`~embedl_deploy._internal.core.quantize.stubs.QuantStub` between
-    the softmax output and the second batched matrix multiply (BMM2).  When
-    that stub is disabled the forward pass delegates to the unwrapped
+    :class:`~embedl_deploy._internal.core.quantize.stubs.QuantStub` between the
+    softmax output and the second batched matrix multiply (BMM2). When that
+    stub is disabled the forward pass delegates to the unwrapped
     :class:`~embedl_deploy._internal.tensorrt.modules.attention.ScaledDotProductAttention`;
     when enabled it performs manual attention with the quantization step.
@@ -266,9 +268,9 @@ class FusedScaledDotProductAttention(FusedModule):
         When the SDPA has been surrounded by ``QuantStub``\ s on its Q/K/V
         inputs *and* the internal softmax quant stub is enabled, performs
-        manual attention with a quantization step between softmax and
-        BMM2.  Otherwise delegates to the wrapped attention module so
-        TensorRT can fuse it into its native FP16 MHA kernel.
+        manual attention with a quantization step between softmax and BMM2.
+        Otherwise delegates to the wrapped attention module so TensorRT can
+        fuse it into its native FP16 MHA kernel.
         :param q:
             Query tensor ``[B, num_heads, S, head_dim]``.
@@ -281,7 +283,7 @@ class FusedScaledDotProductAttention(FusedModule):
             additive float mask broadcastable to ``[B, num_heads, S, S]``
             or a bool mask where ``True`` means "attend".
         :returns:
-            Output tensor ``[B, num_heads, S, head_dim]``.  Callers are
+            Output tensor ``[B, num_heads, S, head_dim]``. Callers are
             responsible for any subsequent head-flattening reshape.
         """
         # Manual attention is only beneficial when this SDPA was
@@ -292,7 +294,7 @@ class FusedScaledDotProductAttention(FusedModule):
         # MHA kernel onto the slower INT8-aware variant for no gain.
         if not self.surrounded or not self.softmax_quant.enabled:
             return self.attention(q, k, v, attn_mask)
-        # Honour the wrapped attention module's explicit ``scale`` if
+        # Honor the wrapped attention module's explicit ``scale`` if
         # set — models that pre-scale Q themselves (chronos-2 + RoPE,
         # for example) build with ``scale=1.0`` to disable the default
         # ``1/sqrt(head_dim)`` scaling. Falling back to the default

{embedl_deploy_tensorrt-0.5.0 → embedl_deploy_tensorrt-0.6.0}/src/embedl_deploy/_internal/tensorrt/modules/conv.py RENAMED Viewed

@@ -3,7 +3,7 @@
 """Fused ``nn.Module`` replacements for convolution-based patterns.
 Each class represents a hardware-fusible operation that replaces a multi-op
-chain found by the pattern matcher.  The fused module keeps the original sub-
+chain found by the pattern matcher. The fused module keeps the original sub-
 modules (``Conv``, ``BN``, ``ReLU``) as children so that:
 * Weights are trivially transferred from the original model.
@@ -25,11 +25,11 @@ def _is_int8_compatible_conv(conv: nn.Conv2d) -> bool:
     """Return ``True`` unless *conv* is a grouped conv violating TRT INT8.
     TensorRT's documented constraint for ``IConvolutionLayer`` is that
-    ``in_channels / groups`` and ``out_channels / groups`` must both
-    be multiples of 4 in INT8 mode.  Depthwise convolutions
-    (``groups == in_channels``) are an exception: our benchmarks on
-    the target devices show they still benefit from INT8 despite
-    channels-per-group being 1, so we let them through.
+    ``in_channels / groups`` and ``out_channels / groups`` must both be
+    multiples of 4 in INT8 mode. Depthwise convolutions (``groups ==
+    in_channels``) are an exception: our benchmarks on the target devices show
+    they still benefit from INT8 despite channels-per-group being 1, so we let
+    them through.
     """
     if conv.groups <= 1:
         return True
@@ -51,7 +51,6 @@ def _conv_weight_forward(
         if weight_fake_quant is not None
         else conv.weight
     )
-    # pylint: disable-next=not-callable
     return F.conv2d(
         x,
         weight,
@@ -83,6 +82,10 @@ class FusedConvBNAct(FusedModule):
         else:
             self.input_quant_stubs = {}
+    @property
+    def quantized_weight(self) -> torch.Tensor | None:
+        return self.conv.weight
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``conv → [bn] → act``."""
         wfq = getattr(self, 'weight_fake_quant', None)
@@ -121,6 +124,10 @@ class FusedConvBN(FusedModule):
         else:
             self.input_quant_stubs = {}
+    @property
+    def quantized_weight(self) -> torch.Tensor | None:
+        return self.conv.weight
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``conv → [bn]``."""
         wfq = getattr(self, 'weight_fake_quant', None)
@@ -160,6 +167,10 @@ class FusedConvBNActMaxPool(FusedModule):
         self.maxpool = maxpool
         self.weight_fake_quant = WeightFakeQuantize({self})
+    @property
+    def quantized_weight(self) -> torch.Tensor | None:
+        return self.conv.weight
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         """Apply ``conv → [bn] → act → maxpool``."""
         x = _conv_weight_forward(self.conv, self.weight_fake_quant, x)
@@ -206,6 +217,10 @@ class FusedConvBNAddAct(FusedModule):
         else:
             self.input_quant_stubs = {}
+    @property
+    def quantized_weight(self) -> torch.Tensor | None:
+        return self.conv.weight
     def forward(self, x: torch.Tensor, residual: torch.Tensor) -> torch.Tensor:
         """Apply ``conv → bn → add(·, residual) → act``."""
         wfq = getattr(self, 'weight_fake_quant', None)

embedl-deploy-tensorrt 0.5.0__tar.gz → 0.6.0__tar.gz

embedl-deploy-tensorrt 0.5.0tar.gz → 0.6.0tar.gz