PyPI - sgemm-bi - Versions diffs - 0.1.1__tar.gz - Mend

sgemm-bi 0.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

sgemm_bi-0.1.1/.gitignore +5 -0
sgemm_bi-0.1.1/CHANGELOG.md +85 -0
sgemm_bi-0.1.1/Cargo.lock +114 -0
sgemm_bi-0.1.1/Cargo.toml +32 -0
sgemm_bi-0.1.1/LICENSE-APACHE +19 -0
sgemm_bi-0.1.1/LICENSE-MIT +21 -0
sgemm_bi-0.1.1/PKG-INFO +84 -0
sgemm_bi-0.1.1/README.md +147 -0
sgemm_bi-0.1.1/deny.toml +21 -0
sgemm_bi-0.1.1/docs/usage-guide.md +172 -0
sgemm_bi-0.1.1/examples/capi/smoke.c +162 -0
sgemm_bi-0.1.1/examples/deterministic_training.rs +99 -0
sgemm_bi-0.1.1/include/sgemm_bi.h +126 -0
sgemm_bi-0.1.1/kernels/casts.cu +36 -0
sgemm_bi-0.1.1/kernels/prelude.cuh +24 -0
sgemm_bi-0.1.1/kernels/sgemm_bi.cu +6354 -0
sgemm_bi-0.1.1/pyproject.toml +29 -0
sgemm_bi-0.1.1/python/.gitignore +5 -0
sgemm_bi-0.1.1/python/Cargo.lock +172 -0
sgemm_bi-0.1.1/python/Cargo.toml +33 -0
sgemm_bi-0.1.1/python/README.md +67 -0
sgemm_bi-0.1.1/python/examples/train_deterministic.py +55 -0
sgemm_bi-0.1.1/python/sgemm_bi/__init__.py +25 -0
sgemm_bi-0.1.1/python/sgemm_bi/_sgemm_bi.pyi +96 -0
sgemm_bi-0.1.1/python/sgemm_bi/py.typed +0 -0
sgemm_bi-0.1.1/python/sgemm_bi/torch.py +230 -0
sgemm_bi-0.1.1/python/src/lib.rs +307 -0
sgemm_bi-0.1.1/python/tests/test_torch.py +142 -0
sgemm_bi-0.1.1/src/capi.rs +423 -0
sgemm_bi-0.1.1/src/dispatch.rs +2284 -0
sgemm_bi-0.1.1/src/dtype.rs +36 -0
sgemm_bi-0.1.1/src/engine.rs +333 -0
sgemm_bi-0.1.1/src/error.rs +57 -0
sgemm_bi-0.1.1/src/kernels.rs +270 -0
sgemm_bi-0.1.1/src/lib.rs +90 -0
sgemm_bi-0.1.1/tests/common/mod.rs +115 -0
sgemm_bi-0.1.1/tests/contracts.rs +203 -0
sgemm_bi-0.1.1/tests/tensor_cores.rs +322 -0

sgemm_bi-0.1.1/.gitignore ADDED Viewed

@@ -0,0 +1,5 @@
+/target
+.idea/
+.DS_Store
+*.iml
+/internal/

sgemm_bi-0.1.1/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,85 @@
+# Changelog
+All notable changes to this project are documented here. The format is
+based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and
+this project adheres to [Semantic Versioning](https://semver.org/).
+## [0.1.1] - 2026-06-12
+### Added
+- **Tensor-core small-tile family + BK=64 staging**: six
+  `sgemm_bi_*_tc64_*` kernels (64×64 tiles, 128 threads) extend the TC
+  tier to both output dims >= 64 and to GPU-underfilling grids; both
+  tile families stage 64-deep reduction slabs (half the barrier count)
+  with the 128-tile family on dynamic shared memory. The two families
+  are BIT-IDENTICAL per output element (same ascending mma chain), so
+  the shape-only routing never changes output bits and the strict
+  all-M forward invariance survives tile switching — asserted through
+  the public API by `tc_cross_tile_strict_all_m_invariance`. Measured
+  (RTX 6000 Ada, bf16): TC forward 3.5–6.3× over the scalar tier at
+  GEMM level, ~116 TFLOPS on M2048 K768 N3072; in a host training
+  loop, parity with cuBLAS-PEDANTIC on d128/d256 models and 16–30 %
+  faster from d768 up.
+- **PyTorch binding** (`python/`, PyPI package `sgemm-bi`, import
+  `sgemm_bi`): PyO3 0.29 + maturin, abi3 wheel for Python >= 3.9. No
+  libtorch linkage — tensors cross as raw device pointers, so one wheel
+  works with any PyTorch build; runtime needs only the NVIDIA driver.
+  Ships `sgemm_bi.Linear` (deterministic `nn.Linear` replacement with
+  GEMM-natural `[in, out]` weight layout and `from_torch` converter),
+  the functional `deterministic_linear` autograd op (dW accumulated in
+  f32 inside the kernel, one rounding to the parameter dtype), and the
+  low-level `Engine`. Engine work is ordered against torch's current
+  stream with a CUDA-event bridge (no host syncs); calls release the
+  GIL; forward/backward are safe across torch's autograd thread.
+  Desk-reviewed against PyTorch/PyO3/maturin/CUDA driver documentation;
+  GPU test suite (`python/tests/`) green on RTX 6000 Ada: parity vs
+  float64 references, bit-identity across runs in all three dtypes,
+  strict all-M batch invariance of the tensor-core forward, end-to-end
+  training.
+- **CI/release for the binding**: `python-binding` job (fmt, clippy,
+  wheel build artifact) and a tag-gated `publish-pypi` job using PyPI
+  trusted publishing (OIDC, no token secret).
+- **C ABI** behind the `capi` feature (`src/capi.rs`, header
+  `include/sgemm_bi.h`, `cdylib`/`staticlib` crate types): engine
+  create/destroy/synchronize on a device ordinal, one `SgbGemm`
+  descriptor for all six GEMM entry points (scalar forward/dW/dX +
+  tensor-core triad) over raw `CUdeviceptr`s, per-thread error strings
+  (`sgb_last_error`), raw stream access for event-based ordering, and
+  upcast-scratch pre-sizing for CUDA Graph capture. Panics convert to
+  `SGB_ERR_PANIC` instead of unwinding across the boundary. Smoke test:
+  `examples/capi/smoke.c`.
+- **Explicit architecture gate**: `SgemmBi::new` now rejects devices
+  below `sm_80` with the new `Error::UnsupportedArch` ("requires Ampere
+  or newer") instead of surfacing an opaque NVRTC failure — the kernel
+  blob uses `cp.async` and native bf16 in every tier, so pre-Ampere
+  devices were never able to run it.
+## [0.1.0] - 2026-06-12
+### Added
+- Initial release: deterministic, batch-invariant CUDA GEMM engine with
+  the full training triad — forward `Y = X@W + bias`, weight gradient
+  `dW += X^T@dY` (f32 master accumulator), input gradient `dX = dY@W^T`.
+- **f32 tier**: full shape coverage via bucketed dispatch (GEMV,
+  ultra-thin, narrow, split-K, gap-fill, Big, Slim, split-M/N); fixed
+  reduction order, no atomics, no cuBLAS anywhere.
+- **Typed tier (bf16/f16)**: native buckets keep f32 shared memory and
+  accumulation with the f32 tier's exact FMA chain; uncovered shapes
+  take "upcast → f32 kernel → RNE downcast". Both routes are
+  bit-identical by contract.
+- **Tensor-core tier (bf16/f16)**: `mma.sync.m16n8k16` with f32
+  accumulators, 2-stage `cp.async` staging, `ldmatrix` fragment loads.
+  Separate numeric contract; bit-identical across runs and strictly
+  batch-invariant forward across all M. 3-7x faster than the scalar
+  tiers on 128x128-tile shapes.
+- GPU contract tests (`tests/contracts.rs`, `tests/tensor_cores.rs`)
+  validated on RTX 6000 Ada / CUDA 13.2; CI with fmt, clippy, docs,
+  MSRV 1.94, cargo-deny, cross-platform build matrix, and a manual
+  tag-gated release pipeline.
+[0.1.1]: https://github.com/silvermpx/sgemm-bi/compare/v0.1.0...v0.1.1
+[0.1.0]: https://github.com/silvermpx/sgemm-bi/releases/tag/v0.1.0

sgemm_bi-0.1.1/Cargo.lock ADDED Viewed

@@ -0,0 +1,114 @@
+# This file is automatically @generated by Cargo.
+# It is not intended for manual editing.
+version = 4
+[[package]]
+name = "cfg-if"
+version = "1.0.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801"
+[[package]]
+name = "crunchy"
+version = "0.2.4"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "460fbee9c2c2f33933d720630a6a0bac33ba7053db5344fac858d4b8952d77d5"
+[[package]]
+name = "cudarc"
+version = "0.19.7"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1cea5f10a99e025c1b44ae2354c2d8326b25ddbd0baf76bde8e55cfd4018a2cc"
+dependencies = [
+ "libloading",
+]
+[[package]]
+name = "half"
+version = "2.7.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "6ea2d84b969582b4b1864a92dc5d27cd2b77b622a8d79306834f1be5ba20d84b"
+dependencies = [
+ "cfg-if",
+ "crunchy",
+ "zerocopy",
+]
+[[package]]
+name = "libloading"
+version = "0.9.0"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "754ca22de805bb5744484a5b151a9e1a8e837d5dc232c2d7d8c2e3492edc8b60"
+dependencies = [
+ "cfg-if",
+ "windows-link",
+]
+[[package]]
+name = "proc-macro2"
+version = "1.0.106"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934"
+dependencies = [
+ "unicode-ident",
+]
+[[package]]
+name = "quote"
+version = "1.0.45"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924"
+dependencies = [
+ "proc-macro2",
+]
+[[package]]
+name = "sgemm-bi"
+version = "0.1.1"
+dependencies = [
+ "cudarc",
+ "half",
+]
+[[package]]
+name = "syn"
+version = "2.0.117"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "unicode-ident",
+]
+[[package]]
+name = "unicode-ident"
+version = "1.0.24"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75"
+[[package]]
+name = "windows-link"
+version = "0.2.1"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5"
+[[package]]
+name = "zerocopy"
+version = "0.8.52"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "ce1022995ff5ff5d841ad7d994facc23098cd40152f2c1d11cd607c6f530653f"
+dependencies = [
+ "zerocopy-derive",
+]
+[[package]]
+name = "zerocopy-derive"
+version = "0.8.52"
+source = "registry+https://github.com/rust-lang/crates.io-index"
+checksum = "1ae7f38b72ec2a254e2b87ef277cf2cd4fb97cbebf944faa6f33354da0867930"
+dependencies = [
+ "proc-macro2",
+ "quote",
+ "syn",
+]

sgemm_bi-0.1.1/Cargo.toml ADDED Viewed

@@ -0,0 +1,32 @@
+[package]
+name = "sgemm-bi"
+version = "0.1.1"
+edition = "2024"
+rust-version = "1.94"
+authors = ["silvermpx"]
+description = "Deterministic, batch-invariant CUDA GEMM engine with a full training triad (forward, dW, dX) in f32 / bf16 / f16, plus an opt-in tensor-core tier that is faster than cuBLAS PEDANTIC. Bit-identical results across runs; fixed reduction order; no atomics; no cuBLAS dependency."
+homepage = "https://github.com/silvermpx/sgemm-bi"
+documentation = "https://docs.rs/sgemm-bi"
+license = "MIT OR Apache-2.0"
+repository = "https://github.com/silvermpx/sgemm-bi"
+readme = "README.md"
+exclude = ["python/", ".github/"]
+keywords = ["cuda", "gemm", "deterministic", "deep-learning", "reproducibility"]
+categories = ["science", "mathematics"]
+[lib]
+# cdylib/staticlib carry the C ABI (`capi` feature); rlib is the normal
+# Rust dependency form.
+crate-type = ["lib", "cdylib", "staticlib"]
+[features]
+# Flat extern "C" interface (src/capi.rs + include/sgemm_bi.h).
+capi = []
+[dependencies.cudarc]
+version = "0.19.7"
+default-features = false
+features = ["driver", "nvrtc", "dynamic-loading", "cuda-version-from-build-system"]
+[dev-dependencies]
+half = "2"

sgemm_bi-0.1.1/LICENSE-APACHE ADDED Viewed

@@ -0,0 +1,19 @@
+                              Apache License
+                        Version 2.0, January 2004
+                     http://www.apache.org/licenses/
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+Copyright 2026 silvermpx
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.

sgemm_bi-0.1.1/LICENSE-MIT ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 silvermpx
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

sgemm_bi-0.1.1/PKG-INFO ADDED Viewed

@@ -0,0 +1,84 @@
+Metadata-Version: 2.4
+Name: sgemm-bi
+Version: 0.1.1
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Rust
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Classifier: Environment :: GPU :: NVIDIA CUDA
+Summary: Deterministic, batch-invariant CUDA GEMM for PyTorch: bit-identical training matmuls (forward / dW / dX) in f32, bf16, f16, with an opt-in tensor-core tier.
+Keywords: cuda,gemm,deterministic,pytorch,reproducibility
+Author: silvermpx
+License-Expression: MIT OR Apache-2.0
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
+Project-URL: Changelog, https://github.com/silvermpx/sgemm-bi/blob/main/CHANGELOG.md
+Project-URL: Repository, https://github.com/silvermpx/sgemm-bi
+# sgemm-bi for PyTorch
+Deterministic, batch-invariant CUDA GEMM for PyTorch training:
+bit-identical matmuls (forward / dW / dX) in float32, bfloat16, and
+float16, with an opt-in tensor-core tier that is faster than cuBLAS
+PEDANTIC on transformer-class shapes.
+Built on the [sgemm-bi](https://github.com/silvermpx/sgemm-bi) Rust
+engine. No libtorch linkage — tensors cross as raw device pointers, so
+one wheel works with any PyTorch build. Kernels compile once at engine
+creation via NVRTC; the runtime needs only the NVIDIA driver.
+**Requirements**: NVIDIA Ampere or newer (sm_80+), PyTorch with CUDA.
+## Install
+```sh
+pip install maturin
+cd python && maturin build --release
+pip install target/wheels/sgemm_bi-*.whl
+```
+## Use
+```python
+import torch, sgemm_bi
+# Drop-in layer (weight stored [in, out] — GEMM-natural; convert
+# existing layers with Linear.from_torch):
+layer = sgemm_bi.Linear(768, 3072, device="cuda", dtype=torch.bfloat16,
+                        tensor_cores=True)
+y = layer(x)
+y.sum().backward()   # deterministic dW (f32-accumulated) and dX
+# Functional form:
+y = sgemm_bi.deterministic_linear(x, weight, bias, tensor_cores=True)
+# Low-level engine (raw pointers, explicit stream):
+eng = sgemm_bi.Engine(0)
+eng.forward(y.data_ptr(), x.data_ptr(), w.data_ptr(), None,
+            m, k, n, "bfloat16",
+            torch.cuda.current_stream().cuda_stream, True)
+```
+## Determinism contracts
+| tier | guarantee |
+|---|---|
+| f32 / typed scalar | bit-identical across runs; bf16/f16 ≡ "upcast → f32 kernel → RNE downcast"; full shape coverage |
+| tensor cores (`tensor_cores=True`) | own deterministic contract (mma.sync, f32 accumulate); bit-identical across runs; forward strictly batch-invariant across all M; falls back to the scalar tier when an output dim is < 64 (shape-only dispatch — still deterministic; two bit-identical tile families, 128×128 and 64×64, cover everything above) |
+Bias gradient is a plain f32 column sum done in torch (deterministic
+run-to-run for a fixed shape; not part of the engine's batch-invariance
+contract).
+## Examples
+[`examples/train_deterministic.py`](examples/train_deterministic.py)
+trains twice from one seed and asserts bit-identical weights, then
+checks batch invariance. Typed stubs ship in the wheel (`py.typed`) —
+IDEs autocomplete and document the native `Engine` class.
+## Tests
+```sh
+pip install pytest && pytest tests/ -v   # needs a CUDA GPU
+```

sgemm_bi-0.1.1/README.md ADDED Viewed

@@ -0,0 +1,147 @@
+# sgemm-bi
+Deterministic, batch-invariant CUDA GEMM engine with a full **training
+triad** — forward, weight gradient, and input gradient — in **f32, bf16,
+and f16**, plus an opt-in **tensor-core tier**.
+Existing batch-invariant kernel collections cover inference only and trade
+10–40% throughput for determinism. `sgemm-bi` covers the backward pass
+too, and on tile-friendly shapes the tensor-core tier makes deterministic
+training *faster* than a CUDA-core cuBLAS baseline.
+## Guarantees
+- **Run-to-run determinism** — fixed reduction order in every kernel: no
+  atomics, no data-dependent splits, no vendor-BLAS fallback. Same inputs
+  → bit-identical outputs, including through CUDA Graph replay.
+- **Batch invariance** — within a dispatch bucket, output row 0 is
+  bit-identical regardless of the batch dimension M. The tensor-core
+  forward is strictly batch-invariant across *all* M.
+- **Typed bit contract** — bf16/f16 results are bit-identical to "upcast
+  the inputs to f32, run the f32 tier, round-to-nearest-even downcast the
+  output". Accumulation never happens in reduced precision; exactly one
+  rounding is applied, at the output store.
+## Operations
+| op | math | output |
+|---|---|---|
+| `forward` | `Y[M,N] = X[M,K] @ W[K,N] + bias[N]` | typed / f32 |
+| `backward_dw` | `dW[K,N] += X^T[K,M] @ dY[M,N]` | f32 accumulate |
+| `backward_dx` | `dX[M,K] = dY[M,N] @ W^T[N,K]` | typed / f32 |
+Each op exists in three tiers: `*_f32` (the reference chain), typed
+(bf16/f16, bit-equal to the f32 tier on upcast inputs), and `*_tc`
+(tensor cores — a separate deterministic contract; mma.sync with f32
+accumulators cannot bit-match a scalar FMA chain, but it is deterministic
+and strictly batch-invariant).
+The f32 and typed tiers cover **every** shape: a bucketed dispatcher
+(Big / Slim / narrow / ultra-thin / GEMV / split-K/M/N with fixed-order
+tree reduction) handles the common cases natively and the typed tier
+falls back to "upcast → f32 kernel → downcast" — same bits by contract —
+for the rest. The tensor-core tier covers both output dims ≥ 64 (two
+bit-identical kernel families, 128×128 and 64×64 tiles, routed by shape)
+and returns `Error::Uncovered` otherwise.
+## Performance (RTX 6000 Ada, bf16)
+Tensor-core tier vs the scalar deterministic tier, GEMM level
+(forward; measured on this crate's bench suite):
+| shape (M, K, N) | scalar | tensor cores | speedup |
+|---|---:|---:|---:|
+| 2048, 768, 3072 | 290.9 µs | 83.4 µs | **3.5×** |
+| 4096, 1536, 3072 | 1123.0 µs | 353.5 µs | 3.2× |
+| 2048, 768, 512 | 123.5 µs | 19.5 µs | **6.3×** |
+~116 TFLOPS bf16 at M2048 K768 N3072 (~32 % of Ada dense peak). dW and
+dX see similar gains (4.0–5.6× and 3.5–5.1× on the same shapes).
+Against cuBLAS (measured in a host application using this engine for
+every training GEMM, same GPU, per optimizer step):
+| dtype × tier | vs cuBLAS | model size |
+|---|---|---|
+| f32 scalar vs TF32 | 1.28–1.53× | full f32 precision vs truncated-mantissa TF32 |
+| bf16/f16 scalar vs PEDANTIC | 1.09–1.37× | bit-contract, CUDA cores |
+| bf16 TC vs PEDANTIC | **1.04× (d128) → 0.70× (d1536)** | parity on small models, 16–30 % FASTER from d768 |
+| f16 TC vs PEDANTIC | 1.19× (d128) → 0.76× (d1536) | |
+The cost of determinism is zero-to-negative on transformer-class
+shapes; the deterministic bf16 step at d1536 also beats the f32-TF32
+baseline outright.
+## Documentation and examples
+- [Usage guide](docs/usage-guide.md) — recipes for all three interfaces,
+  tier selection, CUDA Graph capture, determinism self-checks.
+- [`examples/deterministic_training.rs`](examples/deterministic_training.rs) —
+  full Rust triad with runtime determinism/invariance asserts
+  (`cargo run --release --example deterministic_training`).
+- [`examples/capi/smoke.c`](examples/capi/smoke.c) — the C ABI end to end.
+- [`python/examples/train_deterministic.py`](python/examples/train_deterministic.py) —
+  bit-identical PyTorch training, twice from one seed.
+- API reference: [docs.rs/sgemm-bi](https://docs.rs/sgemm-bi); the Python
+  package ships typed stubs (`.pyi` + `py.typed`), so IDE hover/completion
+  documents the native `Engine` too.
+## Usage
+```rust,ignore
+use sgemm_bi::{Dtype, SgemmBi, TypedPtr};
+let context = cudarc::driver::CudaContext::new(0).unwrap();
+let stream = context.new_stream().unwrap();
+let engine = SgemmBi::new(&context, stream.clone()).unwrap();
+// y/x/w are CUdeviceptr device allocations on `stream` (bf16 storage).
+engine
+    .forward(
+        TypedPtr::new(y, Dtype::Bf16),
+        TypedPtr::new(x, Dtype::Bf16),
+        TypedPtr::new(w, Dtype::Bf16),
+        Some(bias_f32_ptr),
+        (m, k, n),
+    )
+    .unwrap();
+```
+The engine binds to one stream; all calls enqueue and return. For CUDA
+Graph capture, call `presize_upcast_scratch` before capturing so the
+typed fallback never allocates inside (or after) a captured graph.
+## Requirements
+- NVIDIA GPU, `sm_80`+ for the bf16/f16 and tensor-core tiers (`cp.async`,
+  `ldmatrix`, bf16 `mma.sync`); the f32 tier runs on older architectures.
+- CUDA driver + NVRTC at run time. Kernels compile at engine construction
+  for the device's native architecture — no toolkit or `nvcc` needed.
+- No cuBLAS: the library never links or calls a vendor BLAS.
+## Testing
+Contract tests require a CUDA device:
+```sh
+cargo test --release -- --test-threads=1
+```
+Covered: f32 run-to-run bit identity; the typed bit contract swept across
+~90 dispatch-gate boundary shapes (forward) plus backward shapes;
+per-bucket batch invariance; tensor-core determinism, strict all-M
+invariance, and accuracy vs the f32 reference. Benchmarks are `#[ignore]`d
+(`bench_tc_vs_scalar`).
+## Lineage
+The Big-tile kernels descend from [siboehm's SGEMM
+warptiling](https://github.com/siboehm/SGEMM_CUDA) work; smem padding
+follows [salykova's sgemm.cu](https://github.com/salykova/sgemm.cu). The
+engine is extracted from the GEMM layer of
+[mamba-rs](https://github.com/silvermpx/mamba-rs), where it powers
+deterministic SSM training.
+## License
+Dual-licensed under MIT or Apache-2.0.

sgemm_bi-0.1.1/deny.toml ADDED Viewed

@@ -0,0 +1,21 @@
+[advisories]
+version = 2
+[licenses]
+version = 2
+allow = [
+    "MIT",
+    "Apache-2.0",
+    "BSD-2-Clause",
+    "BSD-3-Clause",
+    "ISC",
+    "Unicode-3.0",
+    "Zlib",
+]
+[bans]
+multiple-versions = "warn"
+[sources]
+unknown-registry = "deny"
+unknown-git = "deny"