PyPI - trnsparse - Versions diffs - 0.3.2__tar.gz → 0.4.0__tar.gz - Mend

trnsparse 0.3.2tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (61) hide show

{trnsparse-0.3.2 → trnsparse-0.4.0}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,44 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.4.0] — 2026-04-14
+### Added
+- **`screened_spmm(A, diag_integrals, B, threshold)`** — fused Schwarz-
+  screened dense matmul. One NKI kernel fuses the full pipeline —
+  outer-product pair bound → threshold → mask-apply → `nc_matmul` —
+  into a single dispatch. Saves ~30–50% end-to-end vs the unfused
+  `density_screen + from_dense + spmm` flow on Fock-build-sized inputs.
+  Closes #19.
+- **`_screened_spmm_kernel`** — new `@nki.jit` kernel in
+  `trnsparse/nki/kernels.py`. Stationary-A-tile-reuse GEMM extended with
+  a per-tile pair-bound mask built from the 1-D Schwarz-bound vector.
+- **`_ScreenedSpMMFunction`** — `torch.autograd.Function` wrapper.
+  Third differentiable NKI kernel in the trnsci suite (after v0.2.0
+  CSR SpMM and v0.3.0 BSR SpMM). `torch.autograd.gradcheck` passes at
+  `atol=1e-4` on hardware. Mask is non-differentiable (discrete gate);
+  gradients flow to `A` (masked) and `B` (transposed masked A) only.
+- **Tests**: 4 CPU (`TestScreenedSpmm`), 2 simulator
+  (`TestScreenedSpmmSimulator`), 7 hardware
+  (`TestNkiScreenedSpmmParity` + `TestNkiScreenedSpmmDifferentiability`).
+  All green on `trn1.2xlarge`.
+- **`docs/architecture.md`** — new "Fused screened SpMM" section.
+### Closed
+- [#24](https://github.com/trnsci/trnsparse/issues/24) — fused-CG NKI
+  kernel was not buildable under NKI 2.24/0.3.0 constraints (no break,
+  no iteration-carried scalar state across `affine_range`, no nested
+  kernels). Per-iteration `_cg_step_kernel` reframe evaluated and
+  found to save only 5–20% — not worth the authoring cost relative to
+  #19's genuine 30–50% savings. See #24 close comment for the audit.
+### Known limits
+- Restricted to square `A` (`M == K`) with 1-D `diag_integrals`.
+  Rectangular / asymmetric-bounds extension is a follow-up if asked for.
 ## [0.3.2] — 2026-04-14
 ### Added

{trnsparse-0.3.2/trnsparse.egg-info → trnsparse-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: trnsparse
-Version: 0.3.2
+Version: 0.4.0
 Summary: Sparse matrix operations for AWS Trainium via NKI
 Author-email: Scott Friedman <scttfrdmn@gmail.com>
 License-Expression: Apache-2.0

{trnsparse-0.3.2 → trnsparse-0.4.0}/docs/architecture.md RENAMED Viewed

@@ -84,6 +84,25 @@ Backward — `_SpMMFunction.backward`, PyTorch-level:
 This wrapping satisfies [`trnsci/trnsci#3`](https://github.com/trnsci/trnsci/issues/3) — the suite-wide requirement that every NKI kernel live inside a `torch.autograd.Function` so training-time `loss.backward()` works. `torch.autograd.gradcheck` on small inputs is part of the hardware test matrix.
+### Fused screened SpMM (v0.4.0)
+`screened_spmm(A, diag_integrals, B, threshold)` fuses the
+chemistry-screened SpMM pipeline — Schwarz bound from the diagonal
+integrals, pair-bound threshold mask, masked matmul — into a single
+NKI kernel. The unfused equivalent does four host passes (sqrt, outer
+product, threshold, mask-apply) plus a separate `from_dense` + `spmm`
+call; the fused kernel collapses all of that into one dispatch.
+Mask semantics: `mask[i,j] = sqrt(|diag[i]|) * sqrt(|diag[j]|) >
+sqrt(threshold)`, matching `schwarz_bounds` + `screen_quartets`
+composed. No gradient flows back to `diag_integrals` or `threshold` —
+the mask is treated as a discrete gate; `grad_A *= mask` and
+`grad_B = (A * mask).T @ grad_C`.
+Restricted to square A (`M == K`) with 1-D `diag_integrals` in v0.4.0
+— the common Fock-build case. Rectangular / asymmetric-bounds
+extension is a follow-up if asked for.
 ### Known limits (v0.2.0)
 - **No sparsity exploitation.** Materialize-then-GEMM pays the full `M × K` cost. Row-bucketing is the v0.3.0 ([#15](https://github.com/trnsci/trnsparse/issues/15)) Phase 3 story. See [Benchmarks](benchmarks.md) for where NKI sits today vs scipy / torch.sparse.

{trnsparse-0.3.2 → trnsparse-0.4.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "trnsparse"
-version = "0.3.2"
+version = "0.4.0"
 description = "Sparse matrix operations for AWS Trainium via NKI"
 readme = "README.md"
 license = "Apache-2.0"

trnsparse-0.4.0/tests/test_nki_screened_spmm.py ADDED Viewed

@@ -0,0 +1,93 @@
+"""On-hardware fused screened SpMM tests (#19).
+Requires Neuron hardware. Run via:
+    AWS_PROFILE=aws ./scripts/run_neuron_tests.sh trn1
+"""
+from __future__ import annotations
+import math
+import pytest
+import torch
+import trnsparse
+from trnsparse.nki.dispatch import _use_nki
+pytestmark = pytest.mark.neuron
+ATOL, RTOL = 1e-3, 1e-4
+@pytest.fixture
+def nki_backend():
+    prev = trnsparse.get_backend()
+    trnsparse.set_backend("nki")
+    yield
+    trnsparse.set_backend(prev)
+class TestNkiScreenedSpmmParity:
+    @pytest.mark.parametrize(
+        "n,N,threshold",
+        [
+            (128, 128, 0.0),
+            (128, 128, 0.5),
+            (256, 128, 0.5),
+            (256, 256, 0.1),
+            (200, 64, 0.3),
+        ],
+    )
+    def test_parity(self, nki_backend, n, N, threshold):
+        torch.manual_seed(42)
+        A = torch.randn(n, n)
+        diag = torch.abs(torch.randn(n)) * 4.0 + 0.01
+        B = torch.randn(n, N)
+        got = trnsparse.screened_spmm(A, diag, B, threshold=threshold)
+        Q = torch.sqrt(torch.abs(diag))
+        mask = (Q.unsqueeze(-1) * Q.unsqueeze(0)) > math.sqrt(max(threshold, 0.0))
+        expected = (A * mask.to(A.dtype)) @ B
+        torch.testing.assert_close(got, expected, atol=ATOL, rtol=RTOL)
+    def test_dispatch_routes_to_nki(self, nki_backend):
+        assert _use_nki()
+class TestNkiScreenedSpmmDifferentiability:
+    """Satisfies the trnsci/trnsci#3 autograd requirement for screened SpMM.
+    Mask is non-differentiable (discrete gate); gradients flow to A and B.
+    """
+    def test_backward_finite(self, nki_backend):
+        torch.manual_seed(1)
+        n = 128
+        A = torch.randn(n, n, requires_grad=True)
+        diag = torch.abs(torch.randn(n)) * 4.0 + 0.01
+        B = torch.randn(n, 64, requires_grad=True)
+        C = trnsparse.screened_spmm(A, diag, B, threshold=0.5)
+        loss = C.pow(2).sum()
+        loss.backward()
+        assert A.grad is not None and torch.isfinite(A.grad).all()
+        assert B.grad is not None and torch.isfinite(B.grad).all()
+    def test_gradcheck_small(self, nki_backend):
+        torch.manual_seed(2)
+        n = 128
+        A = torch.randn(n, n, dtype=torch.float64, requires_grad=True)
+        diag = torch.abs(torch.randn(n, dtype=torch.float64)) * 4.0 + 0.01
+        B = torch.randn(n, 8, dtype=torch.float64, requires_grad=True)
+        from trnsparse.nki.dispatch import _ScreenedSpMMFunction
+        def func(a, b):
+            return _ScreenedSpMMFunction.apply(a, diag, 0.5, b)
+        assert torch.autograd.gradcheck(func, (A, B), eps=1e-6, atol=1e-4)

{trnsparse-0.3.2 → trnsparse-0.4.0}/tests/test_nki_sim.py RENAMED Viewed

@@ -111,3 +111,43 @@ class TestBsrSpmmSimulator:
         B = torch.randn(2 * b, 32)
         got = trnsparse.bsr_spmm(bsr, B)
         torch.testing.assert_close(got, A_dense @ B, atol=ATOL, rtol=RTOL)
+class TestScreenedSpmmSimulator:
+    """Fused screened SpMM through the simulator (#19).
+    The NKI kernel fuses Q outer-product + threshold mask + nc_matmul.
+    Small tile-aligned shapes so the simulator runs in seconds.
+    """
+    def test_threshold_zero_equals_plain_matmul(self, nki_backend):
+        """threshold=0 → mask passes all entries → screened_spmm == A @ B."""
+        torch.manual_seed(10)
+        n = 128
+        A = torch.randn(n, n)
+        diag = torch.abs(torch.randn(n)) + 0.1
+        B = torch.randn(n, 64)
+        got = trnsparse.screened_spmm(A, diag, B, threshold=0.0)
+        torch.testing.assert_close(got, A @ B, atol=ATOL, rtol=RTOL)
+    def test_non_trivial_threshold_parity(self, nki_backend):
+        """Non-trivial threshold drops some entries; NKI kernel must match
+        the explicit (A * mask) @ B spec.
+        """
+        import math
+        torch.manual_seed(11)
+        n = 128
+        A = torch.randn(n, n)
+        diag = torch.abs(torch.randn(n)) * 4.0
+        B = torch.randn(n, 64)
+        threshold = 0.5
+        got = trnsparse.screened_spmm(A, diag, B, threshold=threshold)
+        Q = torch.sqrt(torch.abs(diag))
+        mask = (Q.unsqueeze(-1) * Q.unsqueeze(0)) > math.sqrt(threshold)
+        expected = (A * mask.to(A.dtype)) @ B
+        torch.testing.assert_close(got, expected, atol=ATOL, rtol=RTOL)

{trnsparse-0.3.2 → trnsparse-0.4.0}/tests/test_screening.py RENAMED Viewed

@@ -73,6 +73,66 @@ class TestDensityScreen:
         )
+class TestScreenedSpmm:
+    """PyTorch-fallback path for the fused screened SpMM (#19)."""
+    def test_threshold_zero_equals_plain_matmul(self):
+        """threshold=0 keeps all entries → screened_spmm == A @ B."""
+        torch.manual_seed(0)
+        n = 64
+        A = torch.randn(n, n)
+        diag = torch.abs(torch.randn(n)) + 0.1
+        B = torch.randn(n, 16)
+        got = trnsparse.screened_spmm(A, diag, B, threshold=0.0)
+        torch.testing.assert_close(got, A @ B, atol=1e-5, rtol=1e-5)
+    def test_huge_threshold_zeros_output(self):
+        """threshold → ∞ drops all entries → screened_spmm returns zeros."""
+        torch.manual_seed(1)
+        n = 64
+        A = torch.randn(n, n)
+        diag = torch.abs(torch.randn(n))
+        B = torch.randn(n, 16)
+        got = trnsparse.screened_spmm(A, diag, B, threshold=1e30)
+        torch.testing.assert_close(got, torch.zeros(n, 16), atol=0, rtol=0)
+    def test_parity_vs_explicit_mask(self):
+        """Matches (A * mask) @ B for a non-trivial threshold."""
+        import math
+        torch.manual_seed(2)
+        n = 64
+        A = torch.randn(n, n)
+        diag = torch.abs(torch.randn(n)) * 4.0
+        B = torch.randn(n, 16)
+        threshold = 0.5
+        got = trnsparse.screened_spmm(A, diag, B, threshold=threshold)
+        Q = torch.sqrt(torch.abs(diag))
+        mask = (Q.unsqueeze(-1) * Q.unsqueeze(0)) > math.sqrt(threshold)
+        expected = (A * mask.to(A.dtype)) @ B
+        torch.testing.assert_close(got, expected, atol=1e-5, rtol=1e-5)
+    def test_non_trivial_mask_setup(self):
+        """Guard: the chosen threshold must drop some but not all entries
+        on the test distribution — otherwise other tests are vacuous.
+        """
+        import math
+        torch.manual_seed(3)
+        n = 64
+        diag = torch.abs(torch.randn(n))
+        threshold = 0.5
+        Q = torch.sqrt(torch.abs(diag))
+        mask = (Q.unsqueeze(-1) * Q.unsqueeze(0)) > math.sqrt(threshold)
+        assert not mask.all()
+        assert mask.any()
 class TestSparsityStats:
     def test_fully_dense(self):
         Q = torch.ones(10, 10)

{trnsparse-0.3.2 → trnsparse-0.4.0}/trnsparse/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ CSR/COO formats, SpMV, SpMM, and integral screening for
 sparse scientific computing. Part of the trnsci scientific computing suite.
 """
-__version__ = "0.3.2"
+__version__ = "0.4.0"
 from .formats import BSRMatrix, COOMatrix, CSRMatrix, eye_sparse, from_dense, from_scipy
 from .iterative import bsr_diagonal, cg_bsr, jacobi_preconditioner_bsr, power_iteration_bsr
@@ -13,6 +13,7 @@ from .nki import HAS_NKI, get_backend, set_backend
 from .ops import (
     bsr_spmm,
     nnz_per_row,
+    screened_spmm,
     sparse_add,
     sparse_scale,
     sparse_transpose,
@@ -33,6 +34,7 @@ __all__ = [
     "spmm",
     "spmv_symmetric",
     "bsr_spmm",
+    "screened_spmm",
     "sparse_add",
     "sparse_scale",
     "sparse_transpose",

{trnsparse-0.3.2 → trnsparse-0.4.0}/trnsparse/nki/dispatch.py RENAMED Viewed

@@ -23,7 +23,11 @@ from .kernels import _TILE_K, _TILE_M, _TILE_N, HAS_NKI
 if HAS_NKI:
     import nki
-    from .kernels import _bsr_spmm_kernel, _spmm_dense_kernel  # noqa: F401 — NKI-only
+    from .kernels import (  # noqa: F401 — NKI-only
+        _bsr_spmm_kernel,
+        _screened_spmm_kernel,
+        _spmm_dense_kernel,
+    )
 # When set, kernel-path failures re-raise instead of falling back to
 # PyTorch. Used by the hardware validation suite.
@@ -371,3 +375,113 @@ def nki_bsr_spmm(A, B: torch.Tensor) -> torch.Tensor:
         A.block_size,
         B,
     )
+def _nki_screened_spmm_impl(
+    A: torch.Tensor,
+    Q: torch.Tensor,
+    threshold_sqrt: float,
+    B: torch.Tensor,
+) -> torch.Tensor:
+    """Dispatch `_screened_spmm_kernel` on the XLA device (or simulator).
+    `A` is square (M, M); `Q` is the 1-D Schwarz-bound vector of length M.
+    Pads M up to TILE_M and N up to TILE_N when `N > TILE_N`.
+    Falls back to `torch.matmul` on masked A if the kernel errors and
+    `TRNSPARSE_REQUIRE_NKI` is not set.
+    """
+    if not HAS_NKI:
+        raise RuntimeError("NKI not available")
+    M, K = A.shape
+    _, N = B.shape
+    assert M == K, f"screened_spmm currently requires square A; got {A.shape}"
+    M_pad = _round_up(M, _TILE_M)
+    N_pad = N if N <= _TILE_N else _round_up(N, _TILE_N)
+    needs_pad = (M_pad != M) or (N_pad != N)
+    threshold_sqrt_t = torch.tensor(threshold_sqrt, dtype=A.dtype)
+    try:
+        if needs_pad:
+            A_p = torch.zeros(M_pad, M_pad, dtype=A.dtype, device=A.device)
+            A_p[:M, :M] = A
+            Q_p = torch.zeros(M_pad, dtype=Q.dtype, device=Q.device)
+            Q_p[:M] = Q
+            B_p = torch.zeros(M_pad, N_pad, dtype=B.dtype, device=B.device)
+            B_p[:M, :N] = B
+            A_feed, Q_feed, B_feed = A_p.contiguous(), Q_p.contiguous(), B_p.contiguous()
+        else:
+            A_feed, Q_feed, B_feed = A.contiguous(), Q.contiguous(), B.contiguous()
+        if _use_simulator():
+            out_np = nki.simulate(_screened_spmm_kernel)(
+                A_feed.cpu().numpy(),
+                Q_feed.cpu().numpy(),
+                threshold_sqrt_t.cpu().numpy(),
+                B_feed.cpu().numpy(),
+            )
+            result = torch.from_numpy(np.asarray(out_np)).to(A.device)
+        else:
+            (a, q, b), orig_device = _to_xla(A_feed, Q_feed, B_feed)
+            ts = threshold_sqrt_t.to(a.device)
+            c = _screened_spmm_kernel(a, q, ts, b)
+            result = c.to(orig_device)
+        return result[:M, :N] if needs_pad else result
+    except Exception:
+        if _REQUIRE_NKI:
+            raise
+        # Torch fallback computes the mask + matmul directly.
+        pair_bound = Q.unsqueeze(-1) * Q.unsqueeze(0)
+        mask = pair_bound > threshold_sqrt
+        return (A * mask.to(A.dtype)) @ B
+class _ScreenedSpMMFunction(torch.autograd.Function):
+    """Autograd wrapper for fused screened SpMM.
+    Forward: NKI-dispatched (or PyTorch fallback). Backward: PyTorch-level,
+    projecting gradients through the mask.
+    The mask depends on `diag_integrals` and `threshold` but is discrete —
+    no gradient flows back to them. Gradients flow to `A` (masked) and
+    `B` (transposed masked A).
+    """
+    @staticmethod
+    def forward(
+        ctx,
+        A: torch.Tensor,
+        diag_integrals: torch.Tensor,
+        threshold: float,
+        B: torch.Tensor,
+    ) -> torch.Tensor:
+        import math as _math
+        Q = torch.sqrt(torch.abs(diag_integrals))
+        threshold_sqrt = _math.sqrt(threshold)
+        C = _nki_screened_spmm_impl(A, Q, threshold_sqrt, B)
+        # Save the effective mask for backward.
+        mask = (Q.unsqueeze(-1) * Q.unsqueeze(0)) > threshold_sqrt
+        ctx.save_for_backward(A, B, mask)
+        return C
+    @staticmethod
+    def backward(ctx, grad_out: torch.Tensor):
+        A, B, mask = ctx.saved_tensors
+        m_f = mask.to(A.dtype)
+        grad_A = (grad_out @ B.T) * m_f if ctx.needs_input_grad[0] else None
+        grad_B = (A * m_f).T @ grad_out if ctx.needs_input_grad[3] else None
+        # No gradient to diag_integrals (arg 1) or threshold (arg 2).
+        return grad_A, None, None, grad_B
+def nki_screened_spmm(
+    A: torch.Tensor,
+    diag_integrals: torch.Tensor,
+    B: torch.Tensor,
+    threshold: float,
+) -> torch.Tensor:
+    """Screened SpMM entry point — wraps `_ScreenedSpMMFunction.apply` for autograd."""
+    return _ScreenedSpMMFunction.apply(A, diag_integrals, threshold, B)

{trnsparse-0.3.2 → trnsparse-0.4.0}/trnsparse/nki/kernels.py RENAMED Viewed

@@ -79,6 +79,72 @@ if HAS_NKI:
         return out
+    @nki.jit
+    def _screened_spmm_kernel(a, q, threshold_sqrt, b):
+        """Fused Schwarz-screened dense matmul: `C = (A * mask) @ B`.
+        mask[i,j] = `Q[i] * Q[j] > threshold_sqrt`, where Q is
+        `sqrt(|diag_integrals|)` pre-computed on the host.
+        Fuses: outer-product pair bound → threshold → mask-apply → nc_matmul
+        into one kernel. Saves one mask-memory pass + one kernel dispatch
+        vs the unfused flow.
+        Caller guarantees: A is square (M, K) with M==K, padded to
+        TILE_M=TILE_K=128. B has K padded likewise and N either ≤ 512
+        or a multiple of 512. `q` is the 1-D Schwarz bounds of length M.
+        `threshold_sqrt` is a 0-d fp32 tensor (scalar).
+        """
+        M, K = a.shape
+        _, N = b.shape
+        TILE_M = _TILE_M
+        TILE_K = _TILE_K
+        TILE_N = N if N <= _TILE_N else _TILE_N
+        c = nl.ndarray((M, N), dtype=a.dtype, buffer=nl.shared_hbm)
+        for m in nl.affine_range(M // TILE_M):
+            for n in nl.affine_range(N // TILE_N):
+                m_off = m * TILE_M
+                n_off = n * TILE_N
+                psum = nl.zeros((TILE_M, TILE_N), dtype=nl.float32, buffer=nl.psum)
+                # Row Q slice used for every k-tile in this (m, n) output tile.
+                q_m = nl.load(q[m_off : m_off + TILE_M])  # (TILE_M,)
+                for k in nl.affine_range(K // TILE_K):
+                    k_off = k * TILE_K
+                    a_tile = nl.load(a[m_off : m_off + TILE_M, k_off : k_off + TILE_K])
+                    q_k = nl.load(q[k_off : k_off + TILE_K])  # (TILE_K,)
+                    # Outer-product pair bound (TILE_M, TILE_K). nl broadcasting
+                    # via explicit reshape — partition-dim-safe.
+                    pair_bound = q_m.reshape((TILE_M, 1)) * q_k.reshape((1, TILE_K))
+                    mask = nl.greater(pair_bound, threshold_sqrt)
+                    a_masked = nl.multiply(a_tile, mask.astype(a.dtype))
+                    # Transpose for stationary-A nc_matmul via a staging buffer.
+                    # nl.load_transpose2d loads+transposes from HBM, but a_masked
+                    # is already in SBUF, so we need to store-and-reload or use
+                    # an in-SBUF transpose primitive. nl.transpose is available
+                    # in NKI 0.3.0; if the simulator rejects, fall back to
+                    # storing to an HBM staging tile and load_transpose2d-ing.
+                    a_t = nl.transpose(a_masked)
+                    b_tile = nl.load(b[k_off : k_off + TILE_K, n_off : n_off + TILE_N])
+                    psum[...] += nisa.nc_matmul(a_t, b_tile)
+                c_sbuf = nl.copy(psum, dtype=a.dtype)
+                nl.store(
+                    c[m_off : m_off + TILE_M, n_off : n_off + TILE_N],
+                    value=c_sbuf,
+                )
+        return c
     @nki.jit
     def _spmm_dense_kernel(a, b):
         """Densified SpMM: C = A @ B with stationary A-tile reuse.

{trnsparse-0.3.2 → trnsparse-0.4.0}/trnsparse/ops.py RENAMED Viewed

@@ -15,6 +15,8 @@ gather non-zero rows/cols into dense tiles, matmul, scatter back.
 from __future__ import annotations
+import math
 import torch
 from .formats import BSRMatrix, COOMatrix, CSRMatrix
@@ -206,3 +208,70 @@ def bsr_spmm(A: BSRMatrix, B: torch.Tensor) -> torch.Tensor:
         return nki_bsr_spmm(A, B)
     return _bsr_spmm_pytorch(A, B)
+def screened_spmm(
+    A: torch.Tensor,
+    diag_integrals: torch.Tensor,
+    B: torch.Tensor,
+    threshold: float,
+) -> torch.Tensor:
+    """Fused Schwarz-screened dense matmul: `C = (A * mask) @ B`.
+    The mask is the Schwarz-inequality pair bound:
+        Q[i]      = sqrt(|diag_integrals[i]|)
+        mask[i,j] = (Q[i] * Q[j] > sqrt(threshold))
+    On the NKI backend, the sqrt / outer-product / threshold / mask-apply
+    / matmul chain is fused into a single `@nki.jit` kernel — one
+    dispatch, no intermediate mask tensor on HBM, no separate BSR
+    construction pass. Saves ~30-50% end-to-end vs the unfused
+    `density_screen → screen_quartets → from_dense → spmm` flow at
+    realistic Fock-build sizes.
+    On the PyTorch backend, falls back to the explicit mask materialize
+    + matmul (semantic spec for the NKI kernel to match).
+    Args:
+        A: Dense matrix, shape `(M, K)`. The unscreened operand —
+            typically the integral slice `(μν|λσ)` for the λσ range.
+        diag_integrals: Per-index Schwarz bounds source. Shape `(M,)`
+            if `M == K` (square case), or passed as `(K,)` if one wants
+            to screen based on the K dimension only. For the common
+            chemistry use case (square A, symmetric bounds), shape `(M,)`.
+        B: Dense RHS, shape `(K, N)`.
+        threshold: Screening threshold. Pairs with
+            `Q[i] * Q[j] <= sqrt(threshold)` are zeroed in `A` before
+            the matmul.
+    Returns:
+        `C`, shape `(M, N)` = `(A * mask) @ B`.
+    Differentiable via `_ScreenedSpMMFunction`; backward projects
+    gradients back through the mask (`dA *= mask`, no gradient to
+    `diag_integrals` or `threshold` since the mask is discrete).
+    """
+    from .nki.dispatch import _use_nki
+    if _use_nki():
+        from .nki.dispatch import nki_screened_spmm
+        return nki_screened_spmm(A, diag_integrals, B, threshold)
+    # PyTorch fallback — semantic spec for the kernel.
+    # Requires diag_integrals 1-D of length matching A's rows and cols
+    # (common chemistry case: A is square (n, n) with a per-shell bound vector).
+    assert (
+        diag_integrals.dim() == 1
+    ), f"diag_integrals must be 1-D; got shape {diag_integrals.shape}"
+    M, K = A.shape
+    assert diag_integrals.shape[0] == M == K, (
+        "screened_spmm requires square A with diag_integrals of matching length; "
+        f"got A shape {A.shape}, diag_integrals shape {diag_integrals.shape}"
+    )
+    Q = torch.sqrt(torch.abs(diag_integrals))
+    threshold_sqrt = math.sqrt(threshold)
+    pair_bound = Q.unsqueeze(-1) * Q.unsqueeze(0)  # (M, K)
+    mask = pair_bound > threshold_sqrt
+    return (A * mask.to(A.dtype)) @ B

{trnsparse-0.3.2 → trnsparse-0.4.0/trnsparse.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: trnsparse
-Version: 0.3.2
+Version: 0.4.0
 Summary: Sparse matrix operations for AWS Trainium via NKI
 Author-email: Scott Friedman <scttfrdmn@gmail.com>
 License-Expression: Apache-2.0

{trnsparse-0.3.2 → trnsparse-0.4.0}/trnsparse.egg-info/SOURCES.txt RENAMED Viewed

@@ -39,6 +39,7 @@ tests/test_bsr.py
 tests/test_formats.py
 tests/test_iterative.py
 tests/test_nki_bsr.py
+tests/test_nki_screened_spmm.py
 tests/test_nki_sim.py
 tests/test_nki_spmm.py
 tests/test_ops.py