PyPI - blksprs - Versions diffs - 2.1.3__tar.gz → 2.1.5__tar.gz - Mend

blksprs 2.1.3tar.gz → 2.1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (28) hide show

{blksprs-2.1.3 → blksprs-2.1.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: blksprs
-Version: 2.1.3
+Version: 2.1.5
 Summary: A lightweight library for operations on block-sparse matrices in PyTorch.
 Author-email: Felix Schön <schoen@kr.tuwien.ac.at>
 Project-URL: Homepage, https://github.com/FelixSchoen/blksprs
@@ -20,7 +20,8 @@ Requires-Dist: matplotlib; extra == "test"
 # blksprs
 [![GitHub Release](https://img.shields.io/github/v/release/FelixSchoen/blksprs?include_prereleases&label=Latest%20Release)](https://github.com/FelixSchoen/blksprs/releases)
-[![Python Version](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
+[![Python 3.11](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
+[![Python 3.12](https://img.shields.io/badge/Python%20Version-3.12-blue)](https://www.python.org/downloads/release/python-31210/)
 ## Overview
@@ -75,9 +76,7 @@ _* see the [Roadmap](#roadmap) section for more information_
 ## Installation
-Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible
-with
-the Linux platform**.
+Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible with the Linux platform**.
 Keep track of this [issue](https://github.com/triton-lang/triton/issues/1640) for updates.
 We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) using pip:
@@ -86,8 +85,8 @@ We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) u
 ### Dependencies
-- [PyTorch](https://pytorch.org/) (built with v2.6)
-- _[NumPy](https://numpy.org/) (to get rid of warnings, built with v2.2.4)_
+- [PyTorch](https://pytorch.org/) (built with v2.7.1)
+- _[NumPy](https://numpy.org/) (to get rid of warnings, built with v2.3.1)_
 - _[Triton](https://github.com/triton-lang/triton) (included with PyTorch)_
 ## Changelog
@@ -103,7 +102,7 @@ We will continue to maintain the library and fix any issues that arise.
 Should you find any bugs please open an [issue](https://github.com/FelixSchoen/blksprs/issues).
 We also encourage [pull requests](https://github.com/FelixSchoen/blksprs/pulls).
-It might be that this changes with future projects, but as of March 2025, we are content with the current state of the
+It might be that this changes with future projects, but as of June 2025, we are content with the current state of the
 library.
 ## Known Limitations and Issues
@@ -112,9 +111,6 @@ library.
   In order to work around this bug a manual conversion of some values is needed, (slightly) negatively impacting
   performance.
   Watch the [issue](https://github.com/triton-lang/triton/issues/6376) on Triton's issue tracker for more information.
-- PyTorch's `wrap_triton()` currently does not support config pruning. It thus cannot be used for some of the kernels,
-  which could impact graph compilation.
-- There seem to be some issues with autocasting, forcing some operations to manually cast.
 - There will be some slight numerical differences between vanilla and blksprs operations.
   These instabilities are due to Triton and thus cannot be fixed by this library alone.
   However, for all intents and purposes, these very minor differences should not matter and can safely be ignored.

{blksprs-2.1.3 → blksprs-2.1.5}/README.md RENAMED Viewed

@@ -1,7 +1,8 @@
 # blksprs
 [![GitHub Release](https://img.shields.io/github/v/release/FelixSchoen/blksprs?include_prereleases&label=Latest%20Release)](https://github.com/FelixSchoen/blksprs/releases)
-[![Python Version](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
+[![Python 3.11](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
+[![Python 3.12](https://img.shields.io/badge/Python%20Version-3.12-blue)](https://www.python.org/downloads/release/python-31210/)
 ## Overview
@@ -56,9 +57,7 @@ _* see the [Roadmap](#roadmap) section for more information_
 ## Installation
-Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible
-with
-the Linux platform**.
+Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible with the Linux platform**.
 Keep track of this [issue](https://github.com/triton-lang/triton/issues/1640) for updates.
 We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) using pip:
@@ -67,8 +66,8 @@ We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) u
 ### Dependencies
-- [PyTorch](https://pytorch.org/) (built with v2.6)
-- _[NumPy](https://numpy.org/) (to get rid of warnings, built with v2.2.4)_
+- [PyTorch](https://pytorch.org/) (built with v2.7.1)
+- _[NumPy](https://numpy.org/) (to get rid of warnings, built with v2.3.1)_
 - _[Triton](https://github.com/triton-lang/triton) (included with PyTorch)_
 ## Changelog
@@ -84,7 +83,7 @@ We will continue to maintain the library and fix any issues that arise.
 Should you find any bugs please open an [issue](https://github.com/FelixSchoen/blksprs/issues).
 We also encourage [pull requests](https://github.com/FelixSchoen/blksprs/pulls).
-It might be that this changes with future projects, but as of March 2025, we are content with the current state of the
+It might be that this changes with future projects, but as of June 2025, we are content with the current state of the
 library.
 ## Known Limitations and Issues
@@ -93,9 +92,6 @@ library.
   In order to work around this bug a manual conversion of some values is needed, (slightly) negatively impacting
   performance.
   Watch the [issue](https://github.com/triton-lang/triton/issues/6376) on Triton's issue tracker for more information.
-- PyTorch's `wrap_triton()` currently does not support config pruning. It thus cannot be used for some of the kernels,
-  which could impact graph compilation.
-- There seem to be some issues with autocasting, forcing some operations to manually cast.
 - There will be some slight numerical differences between vanilla and blksprs operations.
   These instabilities are due to Triton and thus cannot be fixed by this library alone.
   However, for all intents and purposes, these very minor differences should not matter and can safely be ignored.

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/__init__.py RENAMED Viewed

@@ -1,6 +1,6 @@
 from blksprs.utils.blksprs_tensor import BlksprsTensor
-__version__ = "2.1.2"
+__version__ = "2.1.5"
 class ops:
@@ -27,9 +27,9 @@ class utils:
     from blksprs.utils.processing import apply_torch_linear, apply_torch_normalisation, apply_torch_dropout, \
         apply_function_applicable_row_wise
     from blksprs.utils.tools import do_shape_blocksparse, undo_shape_blocksparse
+    from blksprs.utils.validation import disable_contiguous, disable_validation
     class validation:
-        from blksprs.utils.validation import disable_validation
         from blksprs.utils.validation import validate_dimensions, validate_contiguous, validate_dtype_float, \
             validate_dtype_int, validate_device, validate_sparsity, validate_sparsity_dense, \
             validate_sparsity_block_size

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/conversion.py RENAMED Viewed

@@ -106,17 +106,13 @@ def to_sparse_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get sparsity index of current output block consisting of its batch, row, and column index
-    spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
-    spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
+    spa_val_idx = pid_blk * s_lut_r_s + tl.arange(0, 4) * s_lut_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut + spa_val_idx, mask=spa_val_msk)
-    spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
-    spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
-    spa_col_idx = (pid_blk * s_lut_r_s + 2 * s_lut_c_s)
-    spa_col_msk = (spa_col_idx >= 0 and spa_col_idx < s_lut_r * s_lut_r_s)
-    spa_col = tl.load(s_lut + spa_col_idx, mask=spa_col_msk)
+    spa_bat = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Load block from dense tensor
     blk_d_idx = (spa_bat * x_b_s +
@@ -445,17 +441,13 @@ def adapt_layout_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch, row, and column index
-    spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-    spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
-    spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-    spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
+    spa_val_idx = pid_blk * s_lut_o_r_s + tl.arange(0, 4) * s_lut_o_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_o + spa_val_idx, mask=spa_val_msk)
-    spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-    spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
+    spa_bat_o = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_o = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_o = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Get equivalent sparsity block in from layout
     spa_bat_x = spa_bat_o

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/distribution.py RENAMED Viewed

@@ -125,17 +125,13 @@ def gather_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch, row, and column index
-    spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-    spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
+    spa_val_idx = pid_blk * s_lut_o_r_s + tl.arange(0, 4) * s_lut_o_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_o + spa_val_idx, mask=spa_val_msk)
-    spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-    spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
-    spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-    spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
+    spa_bat_o = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_o = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_o = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Load index values
     blk_i_idx = ((pid_blk * i_b_s) +
@@ -374,17 +370,13 @@ def scatter_reduce_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch, row, and column index
-    spa_bat_x_idx = (pid_blk * s_lut_x_r_s + 0 * s_lut_x_c_s)
-    spa_bat_x_msk = (spa_bat_x_idx >= 0 and spa_bat_x_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_bat_x = tl.load(s_lut_x + spa_bat_x_idx, mask=spa_bat_x_msk)
-    spa_row_x_idx = (pid_blk * s_lut_x_r_s + 1 * s_lut_x_c_s)
-    spa_row_x_msk = (spa_row_x_idx >= 0 and spa_row_x_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_row_x = tl.load(s_lut_x + spa_row_x_idx, mask=spa_row_x_msk)
+    spa_val_idx = pid_blk * s_lut_x_r_s + tl.arange(0, 4) * s_lut_x_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_x + spa_val_idx, mask=spa_val_msk)
-    spa_col_x_idx = (pid_blk * s_lut_x_r_s + 2 * s_lut_x_c_s)
-    spa_col_x_msk = (spa_col_x_idx >= 0 and spa_col_x_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_col_x = tl.load(s_lut_x + spa_col_x_idx, mask=spa_col_x_msk)
+    spa_bat_x = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_x = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_x = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Load x values
     blk_x_idx = ((pid_blk * x_b_s) +

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/flow.py RENAMED Viewed

@@ -66,17 +66,13 @@ def flow_pull_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get sparsity index of current output block consisting of its batch, row, and column index
-    spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
-    spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
+    spa_val_idx = pid_blk * s_lut_r_s + tl.arange(0, 4) * s_lut_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut + spa_val_idx, mask=spa_val_msk)
-    spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
-    spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
-    spa_col_idx = (pid_blk * s_lut_r_s + 2 * s_lut_c_s)
-    spa_col_msk = (spa_col_idx >= 0 and spa_col_idx < s_lut_r * s_lut_r_s)
-    spa_col = tl.load(s_lut + spa_col_idx, mask=spa_col_msk)
+    spa_bat = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Load reverse sparsity index
     rev_idx_spa_idx = (spa_bat * s_l_o_b_s +
@@ -157,17 +153,13 @@ def flow_push_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get sparsity index of current input block consisting of its batch, row, and column index
-    spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
-    spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
-    spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
-    spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
+    spa_val_idx = pid_blk * s_lut_r_s + tl.arange(0, 4) * s_lut_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut + spa_val_idx, mask=spa_val_msk)
-    spa_col_idx = (pid_blk * s_lut_r_s + 2 * s_lut_c_s)
-    spa_col_msk = (spa_col_idx >= 0 and spa_col_idx < s_lut_r * s_lut_r_s)
-    spa_col = tl.load(s_lut + spa_col_idx, mask=spa_col_msk)
+    spa_bat = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Get reverse sparsity index
     rev_idx_spa_idx = (spa_bat * s_l_x_b_s +

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/matmul.py RENAMED Viewed

@@ -145,17 +145,13 @@ def matmul_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch, row, and column index
-    spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-    spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
+    spa_val_idx = pid_blk * s_lut_o_r_s + tl.arange(0, 4) * s_lut_o_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_o + spa_val_idx, mask=spa_val_msk)
-    spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-    spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
-    spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-    spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
+    spa_bat_o = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_o = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_o = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Setup buffer
     buf = tl.zeros(shape=(TRITON_BLOCK_SIZE, TRITON_BLOCK_SIZE), dtype=tl.float32)

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/misc/broadcast_ops.py RENAMED Viewed

@@ -110,17 +110,13 @@ def broadcast_add_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch, row, and column index
-    spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-    spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
+    spa_val_idx = pid_blk * s_lut_o_r_s + tl.arange(0, 4) * s_lut_o_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_o + spa_val_idx, mask=spa_val_msk)
-    spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-    spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
-    spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-    spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
-    spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
+    spa_bat_o = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_o = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_o = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Load x block
     blk_x_idx = (spa_bat_o * x_b_s +

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/misc/row_wise.py RENAMED Viewed

@@ -119,17 +119,17 @@ def row_wise_sum_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch and row index
-    spa_bat_idx = (pid_blk * s_lut_x_r_s + 0 * s_lut_x_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_bat = tl.load(s_lut_x + spa_bat_idx, mask=spa_bat_msk)
+    spa_val_idx = pid_blk * s_lut_x_r_s + tl.arange(0, 4) * s_lut_x_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_x + spa_val_idx, mask=spa_val_msk)
-    spa_row_idx = (pid_blk * s_lut_x_r_s + 1 * s_lut_x_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_row = tl.load(s_lut_x + spa_row_idx, mask=spa_row_msk)
+    spa_bat_x = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_x = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_x = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Load reverse sparsity index for current block
-    rev_idx_spa_idx = (spa_bat * s_l_o_b_s +
-                       spa_row * s_l_o_r_s)
+    rev_idx_spa_idx = (spa_bat_x * s_l_o_b_s +
+                       spa_row_x * s_l_o_r_s)
     rev_idx_spa_msk = (rev_idx_spa_idx >= 0 and rev_idx_spa_idx < s_l_o_b * s_l_o_b_s)
     rev_idx_spa = tl.load(r_lut_o + rev_idx_spa_idx, mask=rev_idx_spa_msk).to(tl.int32)
@@ -263,17 +263,17 @@ def row_wise_max_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch and row index
-    spa_bat_idx = (pid_blk * s_lut_x_r_s + 0 * s_lut_x_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_bat = tl.load(s_lut_x + spa_bat_idx, mask=spa_bat_msk)
+    spa_val_idx = pid_blk * s_lut_x_r_s + tl.arange(0, 4) * s_lut_x_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_x + spa_val_idx, mask=spa_val_msk)
-    spa_row_idx = (pid_blk * s_lut_x_r_s + 1 * s_lut_x_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_row = tl.load(s_lut_x + spa_row_idx, mask=spa_row_msk)
+    spa_bat_x = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_x = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_x = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Load reverse sparsity index for current block
-    rev_idx_spa_idx = (spa_bat * s_l_o_b_s +
-                       spa_row * s_l_o_r_s)
+    rev_idx_spa_idx = (spa_bat_x * s_l_o_b_s +
+                       spa_row_x * s_l_o_r_s)
     rev_idx_spa_msk = (rev_idx_spa_idx >= 0 and rev_idx_spa_idx < s_l_o_b * s_l_o_b_s)
     rev_idx_spa = tl.load(r_lut_o + rev_idx_spa_idx, mask=rev_idx_spa_msk).to(tl.int32)
@@ -361,7 +361,7 @@ def row_wise_add_forward(x: Tensor, sparsity_lut_x: Tensor,
                                     triton.cdiv(o_r, meta["TRITON_BLOCK_SIZE"]),
                                     triton.cdiv(o_c, meta["TRITON_BLOCK_SIZE"])]
-        (wrap_triton(kernel_blocksparse_row_wise_add)[triton_grid]
+        (wrap_triton(row_wise_add_kernel)[triton_grid]
          (x,
           x_b, x_b_s, x_r_s, x_c_s,
           sparsity_lut_x, s_lut_r, s_lut_r_s, s_lut_c_s,
@@ -383,33 +383,33 @@ def row_wise_add_forward(x: Tensor, sparsity_lut_x: Tensor,
     reset_to_zero=["o"]
 )
 @triton.jit
-def kernel_blocksparse_row_wise_add(x,
-                                    x_b, x_b_s, x_r_s, x_c_s,
-                                    s_lut_x, s_lut_x_r, s_lut_x_r_s, s_lut_x_c_s,
-                                    y, y_b, y_b_s, y_r_s, y_c_s,
-                                    s_l_y_b, s_l_y_b_s, s_l_y_r_s,
-                                    r_lut_y,
-                                    o,
-                                    o_b, o_b_s, o_r_s, o_c_s,
-                                    sparsity_block_size,
-                                    TRITON_BLOCK_SIZE: tl.constexpr) -> None:
+def row_wise_add_kernel(x,
+                        x_b, x_b_s, x_r_s, x_c_s,
+                        s_lut_x, s_lut_x_r, s_lut_x_r_s, s_lut_x_c_s,
+                        y, y_b, y_b_s, y_r_s, y_c_s,
+                        s_l_y_b, s_l_y_b_s, s_l_y_r_s,
+                        r_lut_y,
+                        o,
+                        o_b, o_b_s, o_r_s, o_c_s,
+                        sparsity_block_size,
+                        TRITON_BLOCK_SIZE: tl.constexpr) -> None:
     # Get triton block indices
     pid_blk = tl.program_id(axis=0)
     pid_row = tl.program_id(axis=1)
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch and row index
-    spa_bat_idx = (pid_blk * s_lut_x_r_s + 0 * s_lut_x_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_bat = tl.load(s_lut_x + spa_bat_idx, mask=spa_bat_msk)
+    spa_val_idx = pid_blk * s_lut_x_r_s + tl.arange(0, 4) * s_lut_x_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut_x + spa_val_idx, mask=spa_val_msk)
-    spa_row_idx = (pid_blk * s_lut_x_r_s + 1 * s_lut_x_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_x_r * s_lut_x_r_s)
-    spa_row = tl.load(s_lut_x + spa_row_idx, mask=spa_row_msk)
+    spa_bat_x = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row_x = tl.sum(spa_val * (tl.arange(0, 4) == 1))
+    spa_col_x = tl.sum(spa_val * (tl.arange(0, 4) == 2))
     # Get reverse sparsity indices for s
-    rev_idx_spa_s_idx = (spa_bat * s_l_y_b_s +
-                         spa_row * s_l_y_r_s)
+    rev_idx_spa_s_idx = (spa_bat_x * s_l_y_b_s +
+                         spa_row_x * s_l_y_r_s)
     rev_idx_spa_s_msk = (rev_idx_spa_s_idx >= 0 and rev_idx_spa_s_idx < s_l_y_b * s_l_y_b_s)
     rev_idx_spa_s = tl.load(r_lut_y + rev_idx_spa_s_idx, mask=rev_idx_spa_s_msk).to(tl.int32)

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/repeat.py RENAMED Viewed

@@ -142,7 +142,7 @@ def repeat_build_lut(lut: dict, sparsity_layout_x: Tensor, repeats: tuple[int, i
         n_sparse_blocks = torch.sum(lut["sparsity_layout_o"].to(torch.int)).item()
         lut["n_sparse_blocks"] = n_sparse_blocks
-    validate_contiguous(sparsity_layout_o, lut["sparsity_lut"], lut["sparsity_reverse_lut"])
+    validate_contiguous(lut["sparsity_layout_o"], lut["sparsity_lut"], lut["sparsity_reverse_lut"])
     return lut
@@ -178,7 +178,7 @@ def repeat_interleave_build_lut(lut: dict, sparsity_layout_x: Tensor, repeats: i
         n_sparse_blocks = torch.sum(lut["sparsity_layout_o"].to(torch.int)).item()
         lut["n_sparse_blocks"] = n_sparse_blocks
-    validate_contiguous(sparsity_layout_o, lut["sparsity_lut"], lut["sparsity_reverse_lut"])
+    validate_contiguous(lut["sparsity_layout_o"], lut["sparsity_lut"], lut["sparsity_reverse_lut"])
     return lut

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/ops/softmax.py RENAMED Viewed

@@ -176,13 +176,12 @@ def softmax_kernel(x,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch and row index
-    spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
-    spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
+    spa_val_idx = pid_blk * s_lut_r_s + tl.arange(0, 4) * s_lut_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut + spa_val_idx, mask=spa_val_msk)
-    spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
-    spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
+    spa_bat = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row = tl.sum(spa_val * (tl.arange(0, 4) == 1))
     # Get reverse sparsity indices for s
     rev_idx_spa_s_idx = (spa_bat * s_l_s_b_s +
@@ -241,13 +240,12 @@ def softmax_kernel_grad(g,
     pid_col = tl.program_id(axis=2)
     # Get position of current sparsity block consisting of its batch and row index
-    spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
-    spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
+    spa_val_idx = pid_blk * s_lut_r_s + tl.arange(0, 4) * s_lut_c_s
+    spa_val_msk = (tl.arange(0, 4) < 3)
+    spa_val = tl.load(s_lut + spa_val_idx, mask=spa_val_msk)
-    spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
-    spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
+    spa_bat = tl.sum(spa_val * (tl.arange(0, 4) == 0))
+    spa_row = tl.sum(spa_val * (tl.arange(0, 4) == 1))
     rev_idx_spa_s_idx = (spa_bat * s_l_s_b_s +
                          spa_row * s_l_s_r_s)

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/utils/autotuning.py RENAMED Viewed

@@ -14,11 +14,11 @@ if blksprs_autotune_mode == "DEFAULT":
         (64, 3, 8),
         (64, 4, 4),
-        (64, 5, 2),
+        (64, 4, 8),
         (128, 3, 8),
         (128, 4, 4),
-        (128, 5, 2),
+        (128, 4, 8),
     ]
 elif blksprs_autotune_mode == "TEST":
     autotune_parameters = [

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs/utils/validation.py RENAMED Viewed

@@ -1,9 +1,17 @@
 import torch
 from torch import Tensor
+CONTIGUOUS = True
 VALIDATION = True
+def ensure_contiguous(*tensors: Tensor) -> tuple[Tensor, ...]:
+    if _check_skip_contiguous():
+        return tensors
+    return tuple(tensor.contiguous() for tensor in tensors)
 def validate_dimensions(*tensors: Tensor, dims=3) -> None:
     if _check_skip_validation():
         return
@@ -124,6 +132,19 @@ def validate_sparsity_block_size(sparsity_block_size: int, *tensors):
             raise ValueError("Tensor sizes must be divisible by sparsity block size")
+def _check_skip_contiguous():
+    return not CONTIGUOUS
+def _set_skip_contiguous(skip_contiguous: bool):
+    global CONTIGUOUS
+    CONTIGUOUS = not skip_contiguous
+def disable_contiguous():
+    _set_skip_contiguous(True)
 def _check_skip_validation():
     return not VALIDATION

{blksprs-2.1.3 → blksprs-2.1.5}/blksprs.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: blksprs
-Version: 2.1.3
+Version: 2.1.5
 Summary: A lightweight library for operations on block-sparse matrices in PyTorch.
 Author-email: Felix Schön <schoen@kr.tuwien.ac.at>
 Project-URL: Homepage, https://github.com/FelixSchoen/blksprs
@@ -20,7 +20,8 @@ Requires-Dist: matplotlib; extra == "test"
 # blksprs
 [![GitHub Release](https://img.shields.io/github/v/release/FelixSchoen/blksprs?include_prereleases&label=Latest%20Release)](https://github.com/FelixSchoen/blksprs/releases)
-[![Python Version](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
+[![Python 3.11](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
+[![Python 3.12](https://img.shields.io/badge/Python%20Version-3.12-blue)](https://www.python.org/downloads/release/python-31210/)
 ## Overview
@@ -75,9 +76,7 @@ _* see the [Roadmap](#roadmap) section for more information_
 ## Installation
-Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible
-with
-the Linux platform**.
+Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible with the Linux platform**.
 Keep track of this [issue](https://github.com/triton-lang/triton/issues/1640) for updates.
 We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) using pip:
@@ -86,8 +85,8 @@ We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) u
 ### Dependencies
-- [PyTorch](https://pytorch.org/) (built with v2.6)
-- _[NumPy](https://numpy.org/) (to get rid of warnings, built with v2.2.4)_
+- [PyTorch](https://pytorch.org/) (built with v2.7.1)
+- _[NumPy](https://numpy.org/) (to get rid of warnings, built with v2.3.1)_
 - _[Triton](https://github.com/triton-lang/triton) (included with PyTorch)_
 ## Changelog
@@ -103,7 +102,7 @@ We will continue to maintain the library and fix any issues that arise.
 Should you find any bugs please open an [issue](https://github.com/FelixSchoen/blksprs/issues).
 We also encourage [pull requests](https://github.com/FelixSchoen/blksprs/pulls).
-It might be that this changes with future projects, but as of March 2025, we are content with the current state of the
+It might be that this changes with future projects, but as of June 2025, we are content with the current state of the
 library.
 ## Known Limitations and Issues
@@ -112,9 +111,6 @@ library.
   In order to work around this bug a manual conversion of some values is needed, (slightly) negatively impacting
   performance.
   Watch the [issue](https://github.com/triton-lang/triton/issues/6376) on Triton's issue tracker for more information.
-- PyTorch's `wrap_triton()` currently does not support config pruning. It thus cannot be used for some of the kernels,
-  which could impact graph compilation.
-- There seem to be some issues with autocasting, forcing some operations to manually cast.
 - There will be some slight numerical differences between vanilla and blksprs operations.
   These instabilities are due to Triton and thus cannot be fixed by this library alone.
   However, for all intents and purposes, these very minor differences should not matter and can safely be ignored.

{blksprs-2.1.3 → blksprs-2.1.5}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "blksprs"
-version = "2.1.3"
+version = "2.1.5"
 authors = [{ name = "Felix Schön", email = "schoen@kr.tuwien.ac.at" }]
 description = "A lightweight library for operations on block-sparse matrices in PyTorch."
 readme = "README.md"