PyPI - blksprs - Versions diffs - 1.9.2__tar.gz → 1.10__tar.gz - Mend

blksprs 1.9.2tar.gz → 1.10tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

{blksprs-1.9.2 → blksprs-1.10}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: blksprs
-Version: 1.9.2
+Version: 1.10
 Summary: A lightweight library for operations on blocksparse matrices in PyTorch.
 Author-email: Felix Schön <schoen@kr.tuwien.ac.at>
 Project-URL: Homepage, https://github.com/FelixSchoen/blksprs
@@ -23,14 +23,6 @@ Requires-Dist: build; extra == "build"
 [![GitHub Release](https://img.shields.io/github/v/release/FelixSchoen/blksprs?include_prereleases&label=Latest%20Release)](https://github.com/FelixSchoen/blksprs/releases)
 [![Python Version](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
-## Important Notice
-🚨 **Non-Final API** 🚨
-Although it already supports a wide variety of functions, this library is still under active development and the API is
-subject to change. For feature requests or bug reports, please open an [issue](https://github.com/FelixSchoen/blksprs/issues).
-We also encourage [pull requests](https://github.com/FelixSchoen/blksprs/pulls).
 ## Overview
 A lightweight and efficient library for operations on block-sparse matrices in PyTorch using Triton.
@@ -44,7 +36,7 @@ Currently supported operations (includes gradient calculation):
 - Scatter (_supports either no reduction or summation, gradients are only available for summation_)
 - Repeat (_supports target sparsity layout_)
 - Repeat Interleave (_supports target sparsity layout_)
-- Splitting and merging of matrices along the last dimension
+- Splitting and merging of matrices (_currently* only supports splitting and merging along the last dimension_)
 - Conversion to and from sparse form
 - Conversion to different sparsity layouts and different sparsity block sizes
@@ -70,13 +62,15 @@ Furthermore, the library provides a set of utility functions
 - for the creation of sparsity layouts based on existing
 dense tensors and for the scatter operation (module ``bs.layouting``),
 - for the application of ``nn.Linear``, ``nn.Dropout``, and ``nn.LayerNorm`` layers to block-sparse tensors,
-- as well as utility functions to apply linear layers,
-ensure correct input dimensionality, and validate input (module ``bs.utils``).
+- as well as utility functions to ensure correct input dimensionality, and validate input (module ``bs.utils``).
+_* see the [Roadmap](#roadmap) section for more information_
 ## Installation
-Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is only compatible with
-the Linux platform.
+Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible with
+the Linux platform**.
+Keep track of this [issue](https://github.com/triton-lang/triton/issues/1640) for updates.
 We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) using pip:
@@ -92,6 +86,16 @@ We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) u
 See [`CHANGELOG.md`](https://github.com/FelixSchoen/blksprs/blob/main/CHANGELOG.md) for a detailed changelog.
+## Roadmap
+Note that since this library covers all our current needs it is in a **bugfix-only** state.
+This means that there are no plans to add new features, e.g., support for dimension specification of the ``split`` and ``merge`` operations.
+We will continue to maintain the library and fix any issues that arise.
+Should you find any bugs please open an [issue](https://github.com/FelixSchoen/blksprs/issues).
+We also encourage [pull requests](https://github.com/FelixSchoen/blksprs/pulls).
+It might be that this changes with future projects, but as of December 2024, we are content with the current state of the library.
 ## Usage
 We provide an example below to demonstrate the usage of the library.

{blksprs-1.9.2 → blksprs-1.10}/README.md RENAMED Viewed

@@ -3,14 +3,6 @@
 [![GitHub Release](https://img.shields.io/github/v/release/FelixSchoen/blksprs?include_prereleases&label=Latest%20Release)](https://github.com/FelixSchoen/blksprs/releases)
 [![Python Version](https://img.shields.io/badge/Python%20Version-3.11-blue)](https://www.python.org/downloads/release/python-3119/)
-## Important Notice
-🚨 **Non-Final API** 🚨
-Although it already supports a wide variety of functions, this library is still under active development and the API is
-subject to change. For feature requests or bug reports, please open an [issue](https://github.com/FelixSchoen/blksprs/issues).
-We also encourage [pull requests](https://github.com/FelixSchoen/blksprs/pulls).
 ## Overview
 A lightweight and efficient library for operations on block-sparse matrices in PyTorch using Triton.
@@ -24,7 +16,7 @@ Currently supported operations (includes gradient calculation):
 - Scatter (_supports either no reduction or summation, gradients are only available for summation_)
 - Repeat (_supports target sparsity layout_)
 - Repeat Interleave (_supports target sparsity layout_)
-- Splitting and merging of matrices along the last dimension
+- Splitting and merging of matrices (_currently* only supports splitting and merging along the last dimension_)
 - Conversion to and from sparse form
 - Conversion to different sparsity layouts and different sparsity block sizes
@@ -50,13 +42,15 @@ Furthermore, the library provides a set of utility functions
 - for the creation of sparsity layouts based on existing
 dense tensors and for the scatter operation (module ``bs.layouting``),
 - for the application of ``nn.Linear``, ``nn.Dropout``, and ``nn.LayerNorm`` layers to block-sparse tensors,
-- as well as utility functions to apply linear layers,
-ensure correct input dimensionality, and validate input (module ``bs.utils``).
+- as well as utility functions to ensure correct input dimensionality, and validate input (module ``bs.utils``).
+_* see the [Roadmap](#roadmap) section for more information_
 ## Installation
-Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is only compatible with
-the Linux platform.
+Note that due to the dependency on [Triton](https://github.com/triton-lang/triton) this library is **only compatible with
+the Linux platform**.
+Keep track of this [issue](https://github.com/triton-lang/triton/issues/1640) for updates.
 We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) using pip:
@@ -72,6 +66,16 @@ We recommend installing blksprs from [PyPI](https://pypi.org/project/blksprs/) u
 See [`CHANGELOG.md`](https://github.com/FelixSchoen/blksprs/blob/main/CHANGELOG.md) for a detailed changelog.
+## Roadmap
+Note that since this library covers all our current needs it is in a **bugfix-only** state.
+This means that there are no plans to add new features, e.g., support for dimension specification of the ``split`` and ``merge`` operations.
+We will continue to maintain the library and fix any issues that arise.
+Should you find any bugs please open an [issue](https://github.com/FelixSchoen/blksprs/issues).
+We also encourage [pull requests](https://github.com/FelixSchoen/blksprs/pulls).
+It might be that this changes with future projects, but as of December 2024, we are content with the current state of the library.
 ## Usage
 We provide an example below to demonstrate the usage of the library.

{blksprs-1.9.2 → blksprs-1.10}/blksprs/__init__.py RENAMED Viewed

@@ -15,9 +15,6 @@ class ops:
         from blksprs.ops.misc.broadcast_ops import broadcast_add, broadcast_sub
         from blksprs.ops.misc.exp import exp
-    class experimental:
-        from blksprs.ops.experimental.distribution_mdi import gather_mdi, scatter_reduce_mdi
 class layouting:
     from blksprs.layouting.distribution_layout import build_distribution_layout
@@ -25,9 +22,6 @@ class layouting:
         build_sparsity_layout_matmul, build_sparsity_layout_matmul_fast
     from blksprs.utils.layout_utils import build_full_sparsity_layout
-    class experimental:
-        from blksprs.ops.experimental.distribution_mdi import build_distribution_layout_mdi
 class utils:
     from blksprs.utils.processing import apply_torch_linear, apply_torch_normalisation, apply_torch_dropout, \

{blksprs-1.9.2 → blksprs-1.10}/blksprs/layouting/distribution_layout.py RENAMED Viewed

@@ -84,21 +84,21 @@ def kernel_distribution_layout(i,
     # Get position of current sparsity block consisting of its batch, row, and column index
     spa_bat_i_idx = (pid_blk * s_lut_i_r_s + 0 * s_lut_i_c_s)
-    spa_bat_i_msk = (spa_bat_i_idx < s_lut_i_r * s_lut_i_r_s)
+    spa_bat_i_msk = (spa_bat_i_idx >= 0 and spa_bat_i_idx < s_lut_i_r * s_lut_i_r_s)
     spa_bat_i = tl.load(s_lut_i + spa_bat_i_idx, mask=spa_bat_i_msk)
     spa_row_i_idx = (pid_blk * s_lut_i_r_s + 1 * s_lut_i_c_s)
-    spa_row_i_msk = (spa_row_i_idx < s_lut_i_r * s_lut_i_r_s)
+    spa_row_i_msk = (spa_row_i_idx >= 0 and spa_row_i_idx < s_lut_i_r * s_lut_i_r_s)
     spa_row_i = tl.load(s_lut_i + spa_row_i_idx, mask=spa_row_i_msk)
     spa_col_i_idx = (pid_blk * s_lut_i_r_s + 2 * s_lut_i_c_s)
-    spa_col_i_msk = (spa_col_i_idx < s_lut_i_r * s_lut_i_r_s)
+    spa_col_i_msk = (spa_col_i_idx >= 0 and spa_col_i_idx < s_lut_i_r * s_lut_i_r_s)
     spa_col_i = tl.load(s_lut_i + spa_col_i_idx, mask=spa_col_i_msk)
     blk_i_idx = (pid_blk * i_b_s +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * i_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * i_c_s)[None, :])
-    blk_i_msk = (blk_i_idx < i_b * i_b_s)
+    blk_i_msk = (blk_i_idx >= 0 and blk_i_idx < i_b * i_b_s)
     blk_i = tl.load(i + blk_i_idx, mask=blk_i_msk)
     dst_bat_idx = tl.full((TRITON_BLOCK_SIZE, TRITON_BLOCK_SIZE), spa_bat_i, dtype=tl.int32)
@@ -111,10 +111,10 @@ def kernel_distribution_layout(i,
     elif dim == 2:
         dst_col_idx = blk_i // sparsity_block_size
-    blk_v = tl.full((TRITON_BLOCK_SIZE, TRITON_BLOCK_SIZE), 1, dtype=tl.int32)
+    blk_v = tl.full((TRITON_BLOCK_SIZE, TRITON_BLOCK_SIZE), 1, dtype=tl.int1)
     blk_o_idx = ((dst_bat_idx * o_b_s) +
                  (dst_row_idx * o_r_s) +
                  (dst_col_idx * o_c_s))
-    blk_o_msk = (blk_o_idx < o_b * o_b_s)
+    blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
     tl.store(o + blk_o_idx, blk_v, mask=blk_o_msk)

{blksprs-1.9.2 → blksprs-1.10}/blksprs/layouting/sparsity_layout.py RENAMED Viewed

@@ -71,7 +71,7 @@ def kernel_sparsity_layout(x,
     blk_x_idx = (pid_bat * x_b_s +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-    blk_x_msk = (blk_x_idx < x_b * x_b_s)
+    blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
     blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
     # Store sparsity layout value
@@ -79,7 +79,7 @@ def kernel_sparsity_layout(x,
         blk_o_idx = (pid_bat * o_b_s +
                      (((pid_row * TRITON_BLOCK_SIZE) // sparsity_block_size) * o_r_s +
                       ((pid_col * TRITON_BLOCK_SIZE) // sparsity_block_size) * o_c_s))
-        blk_o_msk = (blk_o_idx < o_b * o_b_s)
+        blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
         tl.store(o + blk_o_idx, 1, mask=blk_o_msk)
@@ -162,22 +162,22 @@ def kernel_sparsity_layout_adaption(x,
     # Get sparsity index of current output block consisting of its batch, row, and column index
     spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx < s_lut_r * s_lut_r_s)
+    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
     spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
     spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx < s_lut_r * s_lut_r_s)
+    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
     spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
     spa_col_idx = (pid_blk * s_lut_r_s + 2 * s_lut_c_s)
-    spa_col_msk = (spa_col_idx < s_lut_r * s_lut_r_s)
+    spa_col_msk = (spa_col_idx >= 0 and spa_col_idx < s_lut_r * s_lut_r_s)
     spa_col = tl.load(s_lut + spa_col_idx, mask=spa_col_msk)
     # Load x values
     blk_x_idx = ((pid_blk * x_b_s) +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-    blk_x_msk = (blk_x_idx < x_b * x_b_s)
+    blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
     blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
     # Store sparsity layout value
@@ -187,7 +187,7 @@ def kernel_sparsity_layout_adaption(x,
                        // sparsity_block_size_to) * o_r_s) +
                      (((spa_col * sparsity_block_size_from + pid_col * TRITON_BLOCK_SIZE)
                        // sparsity_block_size_to) * o_c_s))
-        blk_o_msk = (blk_o_idx < o_b * o_b_s)
+        blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
         tl.store(o + blk_o_idx, 1, mask=blk_o_msk)

{blksprs-1.9.2 → blksprs-1.10}/blksprs/ops/conversion.py RENAMED Viewed

@@ -1,5 +1,3 @@
-from typing import Any
 import torch
 import triton
 from torch import Tensor
@@ -133,7 +131,7 @@ class _BlocksparseToDense(torch.autograd.Function):
         # Get reverse sparsity index for current block
         rev_idx_spa_idx = (pid_blk * s_l_b_s + spa_row * s_l_r_s + spa_col * s_l_c_s)
-        rev_idx_spa_msk = (rev_idx_spa_idx < s_l_b * s_l_b_s)
+        rev_idx_spa_msk = (rev_idx_spa_idx >= 0 and rev_idx_spa_idx < s_l_b * s_l_b_s)
         rev_idx_spa = tl.load(sparsity_reverse_lut + rev_idx_spa_idx, mask=rev_idx_spa_msk).to(tl.int32)
         # If block is present commence operations
@@ -143,13 +141,13 @@ class _BlocksparseToDense(torch.autograd.Function):
                          tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                        (((pid_col % (sparsity_block_size // TRITON_BLOCK_SIZE)) * TRITON_BLOCK_SIZE +
                          tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-            blk_msk = (blk_idx < x_b * x_b_s)
+            blk_msk = (blk_idx >= 0 and blk_idx < x_b * x_b_s)
             blk = tl.load(x + blk_idx, mask=blk_msk)
             o_idx = (pid_blk * o_b_s +
                      ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                      ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_c_s)[None, :])
-            o_msk = (o_idx < o_b * o_b_s)
+            o_msk = (o_idx >= 0 and o_idx < o_b * o_b_s)
             tl.store(o + o_idx, blk, o_msk)
@@ -260,15 +258,15 @@ class _BlocksparseToSparse(torch.autograd.Function):
         # Get sparsity index of current output block consisting of its batch, row, and column index
         spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-        spa_bat_msk = (spa_bat_idx < s_lut_r * s_lut_r_s)
+        spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
         spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
         spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-        spa_row_msk = (spa_row_idx < s_lut_r * s_lut_r_s)
+        spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
         spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
         spa_col_idx = (pid_blk * s_lut_r_s + 2 * s_lut_c_s)
-        spa_col_msk = (spa_col_idx < s_lut_r * s_lut_r_s)
+        spa_col_msk = (spa_col_idx >= 0 and spa_col_idx < s_lut_r * s_lut_r_s)
         spa_col = tl.load(s_lut + spa_col_idx, mask=spa_col_msk)
         # Load block from dense tensor
@@ -277,14 +275,14 @@ class _BlocksparseToSparse(torch.autograd.Function):
                        tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                      ((spa_col * sparsity_block_size + pid_col * TRITON_BLOCK_SIZE +
                        tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-        blk_d_msk = (blk_d_idx < x_b * x_b_s)
+        blk_d_msk = (blk_d_idx >= 0 and blk_d_idx < x_b * x_b_s)
         blk_d = tl.load(x + blk_d_idx, mask=blk_d_msk)
         # Store block in sparse tensor
         blk_o_idx = ((pid_blk * o_b_s) +
                      ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                      ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE) * o_c_s))[None, :])
-        blk_o_msk = (blk_o_idx < (pid_blk + 1) * o_b_s)
+        blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < (pid_blk + 1) * o_b_s)
         tl.store(o + blk_o_idx, blk_d, mask=blk_o_msk)
@@ -424,15 +422,15 @@ class _BlocksparseAdaptLayout(torch.autograd.Function):
         # Get position of current sparsity block consisting of its batch, row, and column index
         spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-        spa_bat_o_msk = (spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
         spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-        spa_row_o_msk = (spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
         spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-        spa_col_o_msk = (spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
         # Get equivalent sparsity block in from layout
@@ -444,7 +442,7 @@ class _BlocksparseAdaptLayout(torch.autograd.Function):
         rev_idx_spa_x_idx = (spa_bat_x * s_l_x_b_s +
                              spa_row_x * s_l_x_r_s +
                              spa_col_x * s_l_x_c_s)
-        rev_idx_spa_x_msk = (rev_idx_spa_x_idx < s_l_x_b * s_l_x_b_s)
+        rev_idx_spa_x_msk = (rev_idx_spa_x_idx >= 0 and rev_idx_spa_x_idx < s_l_x_b * s_l_x_b_s)
         rev_idx_spa_x = tl.load(r_lut_x + rev_idx_spa_x_idx, mask=rev_idx_spa_x_msk).to(tl.int32)
         # If block is present commence operations
@@ -459,12 +457,12 @@ class _BlocksparseAdaptLayout(torch.autograd.Function):
             blk_x_idx = ((rev_idx_spa_x * x_b_s) +
                          ((shift_row_x * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                          ((shift_col_x * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-            blk_x_msk = (blk_x_idx < x_b * x_b_s)
+            blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
             blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
             # Store output
             blk_o_idx = ((pid_blk * o_b_s) +
                          ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                          ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_c_s)[None, :])
-            blk_o_msk = (blk_o_idx < o_b * o_b_s)
+            blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
             tl.store(o + blk_o_idx, blk_x, mask=blk_o_msk)

{blksprs-1.9.2 → blksprs-1.10}/blksprs/ops/distribution.py RENAMED Viewed

@@ -138,22 +138,22 @@ class _BlocksparseGather(torch.autograd.Function):
         # Get position of current sparsity block consisting of its batch, row, and column index
         spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-        spa_bat_o_msk = (spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
         spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-        spa_row_o_msk = (spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
         spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-        spa_col_o_msk = (spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
         # Load index values
         blk_i_idx = ((pid_blk * i_b_s) +
                      ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * i_r_s)[:, None] +
                      ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * i_c_s)[None, :])
-        blk_i_msk = (blk_i_idx < i_b * i_b_s)
+        blk_i_msk = (blk_i_idx >= 0 and blk_i_idx < i_b * i_b_s)
         blk_i = tl.load(i + blk_i_idx, mask=blk_i_msk).to(tl.int32)
         # Get indices of sparsity blocks and positions within the blocks
@@ -180,21 +180,21 @@ class _BlocksparseGather(torch.autograd.Function):
         rev_idx_spa_x_idx = ((rev_dst_bat_x * s_l_x_b_s) +
                              (rev_dst_row_x * s_l_x_r_s) +
                              (rev_dst_col_x * s_l_x_c_s))
-        rev_idx_spa_x_msk = (rev_idx_spa_x_idx < s_l_x_b * s_l_x_b_s)
+        rev_idx_spa_x_msk = (rev_idx_spa_x_idx >= 0 and rev_idx_spa_x_idx < s_l_x_b * s_l_x_b_s)
         rev_idx_spa_x = tl.load(r_lut_x + rev_idx_spa_x_idx, mask=rev_idx_spa_x_msk).to(tl.int32)
         # Load x values
         blk_x_idx = ((rev_idx_spa_x * x_b_s) +
                      dst_row_x +
                      dst_col_x)
-        blk_x_msk = ((blk_x_idx < x_b * x_b_s) & rev_idx_spa_x_msk != -1)
+        blk_x_msk = ((blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s) and rev_idx_spa_x_msk != -1)
         blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
         # Store output
         blk_o_idx = ((pid_blk * o_b_s) +
                      ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                      ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_c_s)[None, :])
-        blk_o_msk = ((blk_o_idx < o_b * o_b_s) &  rev_idx_spa_x_msk != -1)
+        blk_o_msk = ((blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s) and rev_idx_spa_x_msk != -1)
         tl.store(o + blk_o_idx, blk_x, mask=blk_o_msk)
@@ -364,29 +364,29 @@ class _BlocksparseScatterReduce(torch.autograd.Function):
         # Get position of current sparsity block consisting of its batch, row, and column index
         spa_bat_x_idx = (pid_blk * s_lut_x_r_s + 0 * s_lut_x_c_s)
-        spa_bat_x_msk = (spa_bat_x_idx < s_lut_x_r * s_lut_x_r_s)
+        spa_bat_x_msk = (spa_bat_x_idx >= 0 and spa_bat_x_idx < s_lut_x_r * s_lut_x_r_s)
         spa_bat_x = tl.load(s_lut_x + spa_bat_x_idx, mask=spa_bat_x_msk)
         spa_row_x_idx = (pid_blk * s_lut_x_r_s + 1 * s_lut_x_c_s)
-        spa_row_x_msk = (spa_row_x_idx < s_lut_x_r * s_lut_x_r_s)
+        spa_row_x_msk = (spa_row_x_idx >= 0 and spa_row_x_idx < s_lut_x_r * s_lut_x_r_s)
         spa_row_x = tl.load(s_lut_x + spa_row_x_idx, mask=spa_row_x_msk)
         spa_col_x_idx = (pid_blk * s_lut_x_r_s + 2 * s_lut_x_c_s)
-        spa_col_x_msk = (spa_col_x_idx < s_lut_x_r * s_lut_x_r_s)
+        spa_col_x_msk = (spa_col_x_idx >= 0 and spa_col_x_idx < s_lut_x_r * s_lut_x_r_s)
         spa_col_x = tl.load(s_lut_x + spa_col_x_idx, mask=spa_col_x_msk)
         # Load x values
         blk_x_idx = ((pid_blk * x_b_s) +
                      ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                      ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-        blk_x_msk = (blk_x_idx < x_b * x_b_s)
+        blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
         blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
         # Load index values
         blk_i_idx = ((pid_blk * i_b_s) +
                      ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * i_r_s)[:, None] +
                      ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * i_c_s)[None, :])
-        blk_i_msk = (blk_i_idx < i_b * i_b_s)
+        blk_i_msk = (blk_i_idx >= 0 and blk_i_idx < i_b * i_b_s)
         blk_i = tl.load(i + blk_i_idx, mask=blk_i_msk).to(tl.int32)
         # Get indices of sparsity blocks and positions within the blocks
@@ -413,14 +413,14 @@ class _BlocksparseScatterReduce(torch.autograd.Function):
         rev_idx_spa_o_idx = ((rev_dst_bat_o * s_l_o_b_s) +
                              (rev_dst_row_o * s_l_o_r_s) +
                              (rev_dst_col_o * s_l_o_c_s))
-        rev_idx_spa_o_msk = (rev_idx_spa_o_idx < s_l_o_b * s_l_o_b_s)
+        rev_idx_spa_o_msk = (rev_idx_spa_o_idx >= 0 and rev_idx_spa_o_idx < s_l_o_b * s_l_o_b_s)
         rev_idx_spa_o = tl.load(r_lut_o + rev_idx_spa_o_idx, mask=rev_idx_spa_o_msk).to(tl.int32)
         # Store output
         blk_o_idx = ((rev_idx_spa_o * o_b_s) +
                      dst_row_o +
                      dst_col_o)
-        blk_o_msk = ((blk_o_idx < o_b * o_b_s) & rev_idx_spa_o_msk != -1)
+        blk_o_msk = ((blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s) and rev_idx_spa_o_msk != -1)
         if reduce_op_ind == 0:
             tl.store(o + blk_o_idx, blk_x, mask=blk_o_msk)

{blksprs-1.9.2 → blksprs-1.10}/blksprs/ops/flow.py RENAMED Viewed

@@ -22,22 +22,22 @@ def kernel_blocksparse_flow_pull(x,
     # Get sparsity index of current output block consisting of its batch, row, and column index
     spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx < s_lut_r * s_lut_r_s)
+    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
     spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
     spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx < s_lut_r * s_lut_r_s)
+    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
     spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
     spa_col_idx = (pid_blk * s_lut_r_s + 2 * s_lut_c_s)
-    spa_col_msk = (spa_col_idx < s_lut_r * s_lut_r_s)
+    spa_col_msk = (spa_col_idx >= 0 and spa_col_idx < s_lut_r * s_lut_r_s)
     spa_col = tl.load(s_lut + spa_col_idx, mask=spa_col_msk)
     # Get reverse sparsity index
     rev_idx_spa_idx = (spa_bat * s_l_o_b_s +
                        spa_row * s_l_o_r_s +
                        spa_col * s_l_o_c_s)
-    rev_idx_spa_msk = (rev_idx_spa_idx < s_l_o_b * s_l_o_b_s)
+    rev_idx_spa_msk = (rev_idx_spa_idx >= 0 and rev_idx_spa_idx < s_l_o_b * s_l_o_b_s)
     rev_idx_spa = tl.load(r_lut + rev_idx_spa_idx, mask=rev_idx_spa_msk).to(tl.int32)
     if rev_idx_spa == -1:
@@ -47,13 +47,13 @@ def kernel_blocksparse_flow_pull(x,
     blk_x_idx = (rev_idx_spa * x_b_s +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-    blk_x_msk = (blk_x_idx < x_b * x_b_s)
+    blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
     blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
     blk_o_idx = (pid_blk * o_b_s +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_c_s)[None, :])
-    blk_o_msk = (blk_o_idx < o_b * o_b_s)
+    blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
     tl.store(o + blk_o_idx, blk_x, mask=blk_o_msk)
@@ -73,22 +73,22 @@ def kernel_blocksparse_flow_push(x,
     # Get sparsity index of current input block consisting of its batch, row, and column index
     spa_bat_idx = (pid_blk * s_lut_r_s + 0 * s_lut_c_s)
-    spa_bat_msk = (spa_bat_idx < s_lut_r * s_lut_r_s)
+    spa_bat_msk = (spa_bat_idx >= 0 and spa_bat_idx < s_lut_r * s_lut_r_s)
     spa_bat = tl.load(s_lut + spa_bat_idx, mask=spa_bat_msk)
     spa_row_idx = (pid_blk * s_lut_r_s + 1 * s_lut_c_s)
-    spa_row_msk = (spa_row_idx < s_lut_r * s_lut_r_s)
+    spa_row_msk = (spa_row_idx >= 0 and spa_row_idx < s_lut_r * s_lut_r_s)
     spa_row = tl.load(s_lut + spa_row_idx, mask=spa_row_msk)
     spa_col_idx = (pid_blk * s_lut_r_s + 2 * s_lut_c_s)
-    spa_col_msk = (spa_col_idx < s_lut_r * s_lut_r_s)
+    spa_col_msk = (spa_col_idx >= 0 and spa_col_idx < s_lut_r * s_lut_r_s)
     spa_col = tl.load(s_lut + spa_col_idx, mask=spa_col_msk)
     # Get reverse sparsity index
     rev_idx_spa_idx = (spa_bat * s_l_x_b_s +
                        spa_row * s_l_x_r_s +
                        spa_col * s_l_x_c_s)
-    rev_idx_spa_msk = (rev_idx_spa_idx < s_l_x_b * s_l_x_b_s)
+    rev_idx_spa_msk = (rev_idx_spa_idx >= 0 and rev_idx_spa_idx < s_l_x_b * s_l_x_b_s)
     rev_idx_spa = tl.load(r_lut + rev_idx_spa_idx, mask=rev_idx_spa_msk).to(tl.int32)
     if rev_idx_spa == -1:
@@ -98,13 +98,13 @@ def kernel_blocksparse_flow_push(x,
     blk_x_idx = (pid_blk * x_b_s +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-    blk_x_msk = (blk_x_idx < x_b * x_b_s)
+    blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
     blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
     blk_o_idx = (rev_idx_spa * o_b_s +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_c_s)[None, :])
-    blk_o_msk = (blk_o_idx < o_b * o_b_s)
+    blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
     tl.atomic_add(o + blk_o_idx, blk_x, mask=blk_o_msk)

{blksprs-1.9.2 → blksprs-1.10}/blksprs/ops/matmul.py RENAMED Viewed

@@ -164,15 +164,15 @@ class _BlocksparseMatmulSSS(torch.autograd.Function):
         # Get position of current sparsity block consisting of its batch, row, and column index
         spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-        spa_bat_o_msk = (spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
         spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-        spa_row_o_msk = (spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
         spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-        spa_col_o_msk = (spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
+        spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
         spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
         # Setup buffer
@@ -192,12 +192,12 @@ class _BlocksparseMatmulSSS(torch.autograd.Function):
             rev_idx_spa_x_idx = (spa_bat_o * s_l_x_b_s +
                                  spa_row_o * s_l_x_r_s +
                                  i_seg_spa * s_l_x_c_s)
-            rev_idx_spa_x_msk = (rev_idx_spa_x_idx < s_l_x_b * s_l_x_b_s)
+            rev_idx_spa_x_msk = (rev_idx_spa_x_idx >= 0 and rev_idx_spa_x_idx < s_l_x_b * s_l_x_b_s)
             rev_idx_spa_x = tl.load(r_lut_x + rev_idx_spa_x_idx, mask=rev_idx_spa_x_msk).to(tl.int32)
             # Get reverse sparsity indices for y
             rev_idx_spa_y_idx = (spa_bat_o * s_l_y_b_s + i_seg_spa * s_l_y_r_s + spa_col_o * s_l_y_c_s)
-            rev_idx_spa_y_msk = (rev_idx_spa_y_idx < s_l_y_b * s_l_y_b_s)
+            rev_idx_spa_y_msk = (rev_idx_spa_y_idx >= 0 and rev_idx_spa_y_idx < s_l_y_b * s_l_y_b_s)
             rev_idx_spa_y = tl.load(r_lut_y + rev_idx_spa_y_idx, mask=rev_idx_spa_y_msk).to(tl.int32)
             # If both blocks are present commence calculation
@@ -206,14 +206,14 @@ class _BlocksparseMatmulSSS(torch.autograd.Function):
                              ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * x_r_s)[:, None] +
                              ((i_seg_tri_mod * TRITON_BLOCK_SIZE +
                                tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-                blk_x_msk = (blk_x_idx < x_b * x_b_s)
+                blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
                 blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
                 blk_y_idx = ((rev_idx_spa_y * y_b_s) +
                              ((i_seg_tri_mod * TRITON_BLOCK_SIZE +
                                tl.arange(0, TRITON_BLOCK_SIZE)) * y_r_s)[:, None] +
                              ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * y_c_s)[None, :])
-                blk_y_msk = (blk_y_idx < y_b * y_b_s)
+                blk_y_msk = (blk_y_idx >= 0 and blk_y_idx < y_b * y_b_s)
                 blk_y = tl.load(y + blk_y_idx, mask=blk_y_msk)
                 # Perform matrix multiplication
@@ -223,5 +223,5 @@ class _BlocksparseMatmulSSS(torch.autograd.Function):
         blk_o_idx = ((pid_blk * o_b_s) +
                      ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                      ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_c_s)[None, :])
-        blk_o_msk = (blk_o_idx < o_b * o_b_s)
+        blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
         tl.store(o + blk_o_idx, buf, mask=blk_o_msk)

{blksprs-1.9.2 → blksprs-1.10}/blksprs/ops/misc/broadcast_ops.py RENAMED Viewed

@@ -99,29 +99,29 @@ def kernel_broadcast_addition(x,
     # Get position of current sparsity block consisting of its batch, row, and column index
     spa_bat_o_idx = (pid_blk * s_lut_o_r_s + 0 * s_lut_o_c_s)
-    spa_bat_o_msk = (spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
+    spa_bat_o_msk = (spa_bat_o_idx >= 0 and spa_bat_o_idx < s_lut_o_r * s_lut_o_r_s)
     spa_bat_o = tl.load(s_lut_o + spa_bat_o_idx, mask=spa_bat_o_msk)
     spa_row_o_idx = (pid_blk * s_lut_o_r_s + 1 * s_lut_o_c_s)
-    spa_row_o_msk = (spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
+    spa_row_o_msk = (spa_row_o_idx >= 0 and spa_row_o_idx < s_lut_o_r * s_lut_o_r_s)
     spa_row_o = tl.load(s_lut_o + spa_row_o_idx, mask=spa_row_o_msk)
     spa_col_o_idx = (pid_blk * s_lut_o_r_s + 2 * s_lut_o_c_s)
-    spa_col_o_msk = (spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
+    spa_col_o_msk = (spa_col_o_idx >= 0 and spa_col_o_idx < s_lut_o_r * s_lut_o_r_s)
     spa_col_o = tl.load(s_lut_o + spa_col_o_idx, mask=spa_col_o_msk)
     # Load x block
     blk_x_idx = (spa_bat_o * x_b_s +
                  ((spa_row_o * sparsity_block_size + pid_row * TRITON_BLOCK_SIZE +
                    tl.arange(0, TRITON_BLOCK_SIZE)) * x_c_s)[None, :])
-    blk_x_msk = (blk_x_idx < x_b * x_b_s)
+    blk_x_msk = (blk_x_idx >= 0 and blk_x_idx < x_b * x_b_s)
     blk_x = tl.load(x + blk_x_idx, mask=blk_x_msk)
     # Load y block
     blk_y_idx = (spa_bat_o * y_b_s +
                  ((spa_col_o * sparsity_block_size + pid_col * TRITON_BLOCK_SIZE +
                    tl.arange(0, TRITON_BLOCK_SIZE)) * y_c_s)[None, :])
-    blk_y_msk = (blk_y_idx < y_b * y_b_s)
+    blk_y_msk = (blk_y_idx >= 0 and blk_y_idx < y_b * y_b_s)
     blk_y = tl.load(y + blk_y_idx, mask=blk_y_msk)
     # Compute sum
@@ -132,5 +132,5 @@ def kernel_broadcast_addition(x,
     blk_o_idx = ((pid_blk * o_b_s) +
                  ((pid_row * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_r_s)[:, None] +
                  ((pid_col * TRITON_BLOCK_SIZE + tl.arange(0, TRITON_BLOCK_SIZE)) * o_c_s)[None, :])
-    blk_o_msk = (blk_o_idx < o_b * o_b_s)
+    blk_o_msk = (blk_o_idx >= 0 and blk_o_idx < o_b * o_b_s)
     tl.store(o + blk_o_idx, buf, mask=blk_o_msk)

blksprs 1.9.2__tar.gz → 1.10__tar.gz

blksprs 1.9.2tar.gz → 1.10tar.gz