PyPI - flashinfer-python - Versions diffs - 0.2.1.post1__tar.gz → 0.2.1.post2__tar.gz - Mend

flashinfer-python 0.2.1.post1tar.gz → 0.2.1.post2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (1189) hide show

{flashinfer_python-0.2.1.post1/flashinfer_python.egg-info → flashinfer_python-0.2.1.post2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: flashinfer-python
-Version: 0.2.1.post1
+Version: 0.2.1.post2
 Summary: FlashInfer: Kernel Library for LLM Serving
 Author: FlashInfer team
 License: Apache License 2.0
@@ -38,12 +38,12 @@ Check our [v0.2 release blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-re
 The core features of FlashInfer include:
 1. **Efficient Sparse/Dense Attention Kernels**: Efficient single/batch attention for sparse(paged)/dense KV-storage on CUDA Cores and Tensor Cores (both FA2 & FA3) templates. The vector-sparse attention can achieve 90% of the bandwidth of dense kernels with same problem size.
 2. **Load-Balanced Scheduling**: FlashInfer decouples `plan`/`run` stage of attention computation where we schedule the computation of variable-length inputs in `plan` stage to alleviate load-imbalance issue.
-3. **Memory Efficiency**: FlashInfer offers [Cascade Attention](https://docs.flashinfer.ai/api/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) for hierical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
+3. **Memory Efficiency**: FlashInfer offers [Cascade Attention](https://docs.flashinfer.ai/api/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) for hierarchical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
 4. **Customizable Attention**: Bring your own attention variants through JIT-compilation.
 5. **CUDAGraph and torch.compile Compatibility**: FlashInfer kernels can be captured by CUDAGraphs and torch.compile for low-latency inference.
 6. **Efficient LLM-specific Operators**: High-Performance [fused kernel for Top-P, Top-K/Min-P sampling](https://docs.flashinfer.ai/api/sampling.html) without the need to sorting.
-FlashInfer support PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
+FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
 ## News
 - [Dec 16, 2024] [Blog Post](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
@@ -164,6 +164,7 @@ We are thrilled to share that FlashInfer is being adopted by many cutting-edge p
 - [vLLM](https://github.com/vllm-project/vllm)
 - [TGI](https://github.com/huggingface/text-generation-inference)
 - [lorax](https://github.com/predibase/lorax)
+- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
 ## Acknowledgement

{flashinfer_python-0.2.1.post1 → flashinfer_python-0.2.1.post2}/README.md RENAMED Viewed

@@ -23,12 +23,12 @@ Check our [v0.2 release blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-re
 The core features of FlashInfer include:
 1. **Efficient Sparse/Dense Attention Kernels**: Efficient single/batch attention for sparse(paged)/dense KV-storage on CUDA Cores and Tensor Cores (both FA2 & FA3) templates. The vector-sparse attention can achieve 90% of the bandwidth of dense kernels with same problem size.
 2. **Load-Balanced Scheduling**: FlashInfer decouples `plan`/`run` stage of attention computation where we schedule the computation of variable-length inputs in `plan` stage to alleviate load-imbalance issue.
-3. **Memory Efficiency**: FlashInfer offers [Cascade Attention](https://docs.flashinfer.ai/api/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) for hierical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
+3. **Memory Efficiency**: FlashInfer offers [Cascade Attention](https://docs.flashinfer.ai/api/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) for hierarchical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
 4. **Customizable Attention**: Bring your own attention variants through JIT-compilation.
 5. **CUDAGraph and torch.compile Compatibility**: FlashInfer kernels can be captured by CUDAGraphs and torch.compile for low-latency inference.
 6. **Efficient LLM-specific Operators**: High-Performance [fused kernel for Top-P, Top-K/Min-P sampling](https://docs.flashinfer.ai/api/sampling.html) without the need to sorting.
-FlashInfer support PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
+FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
 ## News
 - [Dec 16, 2024] [Blog Post](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
@@ -149,6 +149,7 @@ We are thrilled to share that FlashInfer is being adopted by many cutting-edge p
 - [vLLM](https://github.com/vllm-project/vllm)
 - [TGI](https://github.com/huggingface/text-generation-inference)
 - [lorax](https://github.com/predibase/lorax)
+- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
 ## Acknowledgement

{flashinfer_python-0.2.1.post1 → flashinfer_python-0.2.1.post2}/csrc/batch_decode_mla_config.jinja RENAMED Viewed

@@ -14,6 +14,8 @@ constexpr bool USE_LOGITS_SOFT_CAP = {{ use_logits_soft_cap }};
 constexpr int HEAD_DIM_CKV = {{ head_dim_ckv }};
 constexpr int HEAD_DIM_KPE = {{ head_dim_kpe }};
+constexpr int QO_TILE_LEN = {{ qo_tile_len }};
 using Params = BatchDecodeParamsMLA<DTypeQ, DTypeKV, DTypeO, IdType>;
 using AttentionVariant =
     DefaultAttention</*use_custom_mask=*/false, USE_SLIDING_WINDOW, USE_LOGITS_SOFT_CAP, /*use_alibi*/false>;

flashinfer_python-0.2.1.post2/csrc/batch_decode_mla_cute_sm80.cu ADDED Viewed

@@ -0,0 +1,107 @@
+#include <optional>
+#include "pytorch_extension_utils.h"
+#include "mla_config.inc"
+#include <flashinfer/attention/decode_mla_cute_sm80.cuh>
+#include <flashinfer/attention/scheduler.cuh>
+using namespace flashinfer;
+std::vector<int64_t> BatchDecodeWithPagedKVCachePlanMLA(
+    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
+    at::Tensor page_locked_int_workspace_buffer, at::Tensor indptr, unsigned int batch_size,
+    unsigned int num_qo_heads, unsigned int page_size, bool enable_cuda_graph,
+    int64_t cuda_stream) {
+  size_t float_workspace_size_in_bytes =
+      float_workspace_buffer.size(0) * float_workspace_buffer.element_size();
+  size_t int_workspace_size_in_bytes =
+      int_workspace_buffer.size(0) * int_workspace_buffer.element_size();
+  DecodePlanInfo plan_info;
+  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  auto work_estimation_func =
+      BatchDecodeWithPagedKVCacheWorkEstimationDispatchedMlaCuteSM80<HEAD_DIM_CKV, HEAD_DIM_KPE, QO_TILE_LEN,
+                                                             AttentionVariant, Params>;
+  cudaError_t status =
+      DecodePlan<HEAD_DIM_CKV, flashinfer::PosEncodingMode::kNone, AttentionVariant, Params>(
+          static_cast<void*>(float_workspace_buffer.data_ptr()), float_workspace_size_in_bytes,
+          static_cast<void*>(int_workspace_buffer.data_ptr()),
+          static_cast<void*>(page_locked_int_workspace_buffer.data_ptr()),
+          int_workspace_size_in_bytes, plan_info, static_cast<IdType*>(indptr.data_ptr()),
+          batch_size, num_qo_heads, page_size, enable_cuda_graph, /*stream=*/stream,
+          work_estimation_func);
+  TORCH_CHECK(status == cudaSuccess, "BatchDecodeWithPagedKVCachePlanMLA failed with error ",
+              cudaGetErrorString(status));
+  return plan_info.ToVector();
+}
+void BatchDecodeWithPagedKVCacheRunMLA(
+    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
+    std::vector<int64_t> plan_info_vec, at::Tensor q_nope, at::Tensor q_pe,
+    at::Tensor paged_ckv_cache, at::Tensor paged_kpe_cache, at::Tensor paged_kv_indptr,
+    at::Tensor paged_kv_indices, at::Tensor paged_kv_last_page_len, at::Tensor o, float sm_scale,
+    int window_left, float logits_soft_cap, float rope_scale, float rope_theta,
+    std::optional<at::Tensor> maybe_lse, int64_t cuda_stream) {
+  DecodePlanInfo plan_info;
+  plan_info.FromVector(plan_info_vec);
+  auto device = q_nope.device();
+  int64_t batch_size = q_nope.size(0);
+  int64_t num_qo_heads = q_nope.size(1);
+  int64_t page_size = paged_ckv_cache.size(1);
+  if (maybe_lse) {
+    const auto& lse = *maybe_lse;
+    TORCH_CHECK(lse.size(0) == batch_size, lse.size(0), q_nope.size(0));
+    TORCH_CHECK(lse.size(1) == num_qo_heads, lse.size(1), q_nope.size(1));
+  }
+  TORCH_CHECK(logits_soft_cap >= 0.f, "logits_soft_cap must be non-negative");
+  void* float_buffer = static_cast<void*>(float_workspace_buffer.data_ptr());
+  void* int_buffer = static_cast<void*>(int_workspace_buffer.data_ptr());
+  paged_kv_mla_t<DTypeKV, IdType> paged_kv(
+      page_size, HEAD_DIM_CKV, HEAD_DIM_KPE, batch_size,
+      static_cast<DTypeKV*>(paged_ckv_cache.data_ptr()), paged_ckv_cache.strides().data(),
+      static_cast<DTypeKV*>(paged_kpe_cache.data_ptr()), paged_kpe_cache.strides().data(),
+      static_cast<IdType*>(paged_kv_indices.data_ptr()),
+      static_cast<IdType*>(paged_kv_indptr.data_ptr()),
+      static_cast<IdType*>(paged_kv_last_page_len.data_ptr()));
+  Params params(static_cast<DTypeQ*>(q_nope.data_ptr()), static_cast<DTypeQ*>(q_pe.data_ptr()),
+                /*q_offset=*/nullptr, paged_kv, static_cast<DTypeO*>(o.data_ptr()),
+                /*lse=*/(maybe_lse ? static_cast<float*>(maybe_lse->data_ptr()) : nullptr),
+                num_qo_heads, window_left, logits_soft_cap, sm_scale, rope_scale, rope_theta);
+  DTypeO* tmp_v = nullptr;
+  float* tmp_s = nullptr;
+  params.request_indices =
+      GetPtrFromBaseOffset<IdType>(int_buffer, plan_info.request_indices_offset);
+  params.kv_tile_indices =
+      GetPtrFromBaseOffset<IdType>(int_buffer, plan_info.kv_tile_indices_offset);
+  params.o_indptr = GetPtrFromBaseOffset<IdType>(int_buffer, plan_info.o_indptr_offset);
+  params.kv_chunk_size_ptr =
+      GetPtrFromBaseOffset<IdType>(int_buffer, plan_info.kv_chunk_size_ptr_offset);
+  if (plan_info.split_kv) {
+    tmp_v = GetPtrFromBaseOffset<DTypeO>(float_buffer, plan_info.v_offset);
+    tmp_s = GetPtrFromBaseOffset<float>(float_buffer, plan_info.s_offset);
+    if (plan_info.enable_cuda_graph) {
+      params.block_valid_mask =
+          GetPtrFromBaseOffset<bool>(int_buffer, plan_info.block_valid_mask_offset);
+    }
+  }
+  params.padded_batch_size = plan_info.padded_batch_size;
+  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  cudaError_t status =
+      BatchDecodeWithPagedKVCacheDispatchedMlaCuteSM80<HEAD_DIM_CKV, HEAD_DIM_KPE, QO_TILE_LEN,
+                                               Params>(params, tmp_v, tmp_s, /*stream=*/stream);
+  TORCH_CHECK(status == cudaSuccess, "BatchDecodeWithPagedKVCache failed with error ",
+              cudaGetErrorString(status));
+}

{flashinfer_python-0.2.1.post1 → flashinfer_python-0.2.1.post2}/csrc/batch_mla_run.cu RENAMED Viewed

@@ -68,10 +68,11 @@ void BatchMLAPagedAttentionRun(at::Tensor float_workspace_buffer, at::Tensor int
         params.q_pe = static_cast<DTypeQ*>(q_pe.data_ptr());
         params.ckv = static_cast<DTypeKV*>(ckv_cache.data_ptr());
         params.kpe = static_cast<DTypeKV*>(kpe_cache.data_ptr());
-        params.kv_indices = static_cast<IdType*>(kv_indices.data_ptr());
         params.q_indptr = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_indptr_offset);
         params.kv_indptr = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_indptr_offset);
+        params.partial_indptr =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.partial_indptr_offset);
         params.kv_indices = static_cast<IdType*>(kv_indices.data_ptr());
         params.q_len = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_len_offset);
         params.kv_len = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_len_offset);
@@ -80,6 +81,12 @@ void BatchMLAPagedAttentionRun(at::Tensor float_workspace_buffer, at::Tensor int
         params.kv_end = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_end_offset);
         params.work_indptr =
             GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.work_indptr_offset);
+        params.merge_packed_offset_start = GetPtrFromBaseOffset<IdType>(
+            int_buffer_ptr, plan_info.merge_packed_offset_start_offset);
+        params.merge_packed_offset_end =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.merge_packed_offset_end_offset);
+        params.merge_indptr =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.merge_indptr_offset);
         params.final_o = static_cast<DTypeO*>(o.data_ptr());
         params.final_lse =
             maybe_lse.has_value() ? static_cast<float*>(maybe_lse->data_ptr()) : nullptr;

flashinfer_python-0.2.1.post2/flashinfer/_build_meta.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = '0.2.1.post2'

{flashinfer_python-0.2.1.post1 → flashinfer_python-0.2.1.post2}/flashinfer/decode.py RENAMED Viewed

@@ -45,6 +45,7 @@ from .utils import (
     _check_cached_qkv_data_type,
     _check_kv_layout,
     _check_pos_encoding_mode,
+    _check_shape_dtype_device,
     _get_cache_alibi_slopes_buf,
     _get_cache_buf,
     _get_range_buf,
@@ -972,6 +973,8 @@ class BatchDecodeWithPagedKVCacheWrapper:
         q_scale: Optional[float] = None,
         k_scale: Optional[float] = None,
         v_scale: Optional[float] = None,
+        out: Optional[torch.Tensor] = None,
+        lse: Optional[torch.Tensor] = None,
         return_lse: Literal[False] = False,
     ) -> torch.Tensor: ...
@@ -984,6 +987,8 @@ class BatchDecodeWithPagedKVCacheWrapper:
         q_scale: Optional[float] = None,
         k_scale: Optional[float] = None,
         v_scale: Optional[float] = None,
+        out: Optional[torch.Tensor] = None,
+        lse: Optional[torch.Tensor] = None,
         return_lse: Literal[True] = True,
     ) -> Tuple[torch.Tensor, torch.Tensor]: ...
@@ -995,6 +1000,8 @@ class BatchDecodeWithPagedKVCacheWrapper:
         q_scale: Optional[float] = None,
         k_scale: Optional[float] = None,
         v_scale: Optional[float] = None,
+        out: Optional[torch.Tensor] = None,
+        lse: Optional[torch.Tensor] = None,
         return_lse: bool = False,
     ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
         r"""Compute batch decode attention between query and paged kv cache.
@@ -1016,13 +1023,18 @@ class BatchDecodeWithPagedKVCacheWrapper:
               ``[max_num_pages, 2, num_kv_heads, page_size, head_dim]`` if
               :attr:`kv_layout` is ``HND``. Where ``paged_kv_cache[:, 0]`` is the key-cache and
               ``paged_kv_cache[:, 1]`` is the value-cache.
+        *args
+            Additional arguments for the custom kernel.
         q_scale : Optional[float]
             The calibration scale of query for fp8 input, if not provided, will be set to ``1.0``.
         k_scale : Optional[float]
             The calibration scale of key for fp8 input, if not provided, will be set to ``1.0``.
         v_scale : Optional[float]
             The calibration scale of value for fp8 input, if not provided, will be set to ``1.0``.
+        out : Optional[torch.Tensor]
+            The output tensor, if not provided, will be allocated internally.
+        lse : Optional[torch.Tensor]
+            The log-sum-exp of attention logits, if not provided, will be allocated internally.
         return_lse : bool
             Whether to return the logsumexp of attention scores, defaults to ``False``.
@@ -1061,13 +1073,21 @@ class BatchDecodeWithPagedKVCacheWrapper:
         if rope_theta is None:
             rope_theta = 1e4
-        lse = None
         if return_lse:
-            lse = torch.empty(
-                (q.size(0), q.size(1)), dtype=torch.float32, device=q.device
-            )
+            if lse is None:
+                lse = torch.empty(
+                    (q.size(0), q.size(1)), dtype=torch.float32, device=q.device
+                )
+            else:
+                _check_shape_dtype_device(
+                    lse, (q.size(0), q.size(1)), torch.float32, q.device, "lse"
+                )
+        if out is None:
+            out = torch.empty_like(q)
+        else:
+            _check_shape_dtype_device(out, q.shape, q.dtype, q.device, "out")
-        out = torch.empty_like(q)
         if self.use_tensor_cores:
             run_args = [
                 self._float_workspace_buffer,
@@ -1252,6 +1272,7 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
         self,
         float_workspace_buffer: torch.Tensor,
         use_cuda_graph: bool = False,
+        use_tensor_cores: bool = False,
         paged_kv_indptr_buffer: Optional[torch.Tensor] = None,
         paged_kv_indices_buffer: Optional[torch.Tensor] = None,
         paged_kv_last_page_len_buffer: Optional[torch.Tensor] = None,
@@ -1270,6 +1291,10 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
             auxiliary data structures will be stored as the provided buffers. The ``batch_size``
             cannot change during the lifecycle of this wrapper when CUDAGraph is enabled.
+        use_tensor_cores : bool
+            Whether to use tensor cores for the computation. Will be faster for large group
+            size in grouped query attention. Defaults to ``False``.
         paged_kv_indptr_buffer : Optional[torch.Tensor]
             The user reserved buffer on GPU to store the indptr of the paged kv cache, the size
             of the buffer should be ``[batch_size + 1]``.
@@ -1319,6 +1344,7 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
         else:
             self._fixed_batch_size = 0
+        self._use_tensor_cores = use_tensor_cores
         self._paged_kv_indptr_buf = paged_kv_indptr_buffer
         self._paged_kv_indices_buf = paged_kv_indices_buffer
         self._paged_kv_last_page_len_buf = paged_kv_last_page_len_buffer
@@ -1328,6 +1354,10 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
     def is_cuda_graph_enabled(self) -> bool:
         return self._use_cuda_graph
+    @property
+    def use_tensor_cores(self) -> bool:
+        return self._use_tensor_cores
     def reset_workspace_buffer(
         self, float_workspace_buffer: torch.Tensor, int_workspace_buffer: torch.Tensor
     ) -> None:
@@ -1445,8 +1475,10 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
             q_data_type,
             indptr.dtype,
             head_dim_compressed_kv,
+            num_qo_heads,
             window_left != -1,  # use_sliding_window
             logits_soft_cap > 0,  # use_logits_soft_cap
+            self._use_tensor_cores,
         )
         with self.device as device:
             self._plan_info = self._cached_module.plan(
@@ -1476,6 +1508,8 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
         q_scale: Optional[float] = None,
         k_scale: Optional[float] = None,
         v_scale: Optional[float] = None,
+        out: Optional[torch.Tensor] = None,
+        lse: Optional[torch.Tensor] = None,
         return_lse: bool = False,
     ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
         r"""Compute batch decode attention between query and paged kv cache.
@@ -1498,6 +1532,10 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
             The calibration scale of key for fp8 input, if not provided, will be set to ``1.0``.
         v_scale : Optional[float]
             The calibration scale of value for fp8 input, if not provided, will be set to ``1.0``.
+        out : Optional[torch.Tensor]
+            The output tensor, if not provided, will be allocated internally.
+        lse : Optional[torch.Tensor]
+            The log-sum-exp of attention logits, if not provided, will be allocated internally.
         return_lse : bool
             Whether to return the logsumexp of attention scores, defaults to ``False``.
@@ -1527,14 +1565,28 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
             rope_theta = 1e4
         with self.device as device:
-            o = torch.empty_like(q_nope, device=device)
-            maybe_lse = (
-                torch.empty(
-                    (q_nope.size(0), q_nope.size(1)), dtype=torch.float32, device=device
+            if out is None:
+                out = torch.empty_like(q_nope, device=device)
+            else:
+                _check_shape_dtype_device(
+                    out, q_nope.shape, q_nope.dtype, q_nope.device, "out"
                 )
-                if return_lse
-                else None
-            )
+            if return_lse:
+                if lse is None:
+                    lse = torch.empty(
+                        (q_nope.size(0), q_nope.size(1)),
+                        dtype=torch.float32,
+                        device=device,
+                    )
+                else:
+                    _check_shape_dtype_device(
+                        lse,
+                        (q_nope.size(0), q_nope.size(1)),
+                        q_nope.dtype,
+                        q_nope.device,
+                        "lse",
+                    )
             self._cached_module.run(
                 self._float_workspace_buffer,
                 self._int_workspace_buffer,
@@ -1546,16 +1598,16 @@ class BatchDecodeMlaWithPagedKVCacheWrapper:
                 self._paged_kv_indptr_buf,
                 self._paged_kv_indices_buf,
                 self._paged_kv_last_page_len_buf,
-                o,
+                out,
                 sm_scale,
                 window_left,
                 logits_soft_cap,
                 rope_scale,
                 rope_theta,
-                maybe_lse,
+                lse,
                 get_cuda_stream(device),
             )
-            out = [o, maybe_lse] if return_lse else [o]
+            out = [out, lse] if return_lse else [out]
         if v_scale is not None:
             out[0] *= v_scale

{flashinfer_python-0.2.1.post1 → flashinfer_python-0.2.1.post2}/flashinfer/jit/attention.py RENAMED Viewed

@@ -22,7 +22,7 @@ from typing import List, Tuple
 import jinja2
 import torch
-from .core import load_cuda_ops, sm90a_nvcc_flags
+from .core import logger, load_cuda_ops, sm90a_nvcc_flags
 from .env import FLASHINFER_CSRC_DIR, FLASHINFER_GEN_SRC_DIR
 from .utils import (
     dtype_map,
@@ -216,20 +216,20 @@ def get_batch_decode_mla_uri(
     dtype_kv: torch.dtype,
     dtype_o: torch.dtype,
     dtype_idx: torch.dtype,
-    head_dim_qk: int,
-    head_dim_vo: int,
+    head_dim_ckv: int,
     use_sliding_window: bool,
     use_logits_soft_cap: bool,
+    arc: str,
 ) -> str:
     return (
         f"batch_decode_mla_with_kv_cache_dtype_q_{filename_safe_dtype_map[dtype_q]}_"
         f"dtype_kv_{filename_safe_dtype_map[dtype_kv]}_"
         f"dtype_o_{filename_safe_dtype_map[dtype_o]}_"
         f"dtype_idx_{filename_safe_dtype_map[dtype_idx]}_"
-        f"head_dim_qk_{head_dim_qk}_"
-        f"head_dim_vo_{head_dim_vo}_"
+        f"head_dim_ckv{head_dim_ckv}_"
         f"use_swa_{use_sliding_window}_"
-        f"use_logits_cap_{use_logits_soft_cap}"
+        f"use_logits_cap_{use_logits_soft_cap}_"
+        f"arc_{arc}"
     )
@@ -239,18 +239,39 @@ def gen_batch_decode_mla_module(
     dtype_o: torch.dtype,
     dtype_idx: torch.dtype,
     head_dim: int,
+    num_qo_heads: int,
     use_sliding_window: bool,
     use_logits_soft_cap: bool,
+    use_tensor_cores: bool,
 ):
+    cuda_arch_major = torch.cuda.get_device_properties(0).major
+    if cuda_arch_major >= 9: # smem size of SM90 can accommodate all 128 qo-heads data
+        qo_tile_len = 128
+    else:
+        qo_tile_len = 64
+    if (
+            use_tensor_cores and
+            cuda_arch_major >= 8 and num_qo_heads % qo_tile_len == 0 and
+            dtype_q == torch.float16 and dtype_kv == torch.float16 and
+            dtype_o == torch.float16
+       ):
+        logger.info(f"Use tensor-core SM80 version of MLA decode kernel.")
+        arc = "sm80"
+    else:
+        logger.info(f"Fall back to cuda-core version of MLA decode kernel.")
+        arc = "cuda_core"
     uri = get_batch_decode_mla_uri(
         dtype_q,
         dtype_kv,
         dtype_o,
         dtype_idx,
         head_dim,
-        head_dim,
         use_sliding_window,
         use_logits_soft_cap,
+        arc,
     )
     gen_directory = FLASHINFER_GEN_SRC_DIR / uri
     os.makedirs(gen_directory, exist_ok=True)
@@ -267,17 +288,27 @@ def gen_batch_decode_mla_module(
             dtype_idx=dtype_map[dtype_idx],
             head_dim_ckv=head_dim,
             head_dim_kpe=head_dim // 8,
+            qo_tile_len=qo_tile_len,
             use_sliding_window=str(use_sliding_window).lower(),
             use_logits_soft_cap=str(use_logits_soft_cap).lower(),
         ),
     )
+    filenames = []
+    if arc == "sm80":
+        filenames = [
+                        "batch_decode_mla_cute_sm80.cu",
+                        "batch_decode_mla_pybind.cu",
+                    ]
+    else:
+        filenames = [
+                        "batch_decode_mla_plan.cu",
+                        "batch_decode_mla_run.cu",
+                        "batch_decode_mla_pybind.cu",
+                    ]
     source_paths = []
-    for filename in [
-        "batch_decode_mla_plan.cu",
-        "batch_decode_mla_run.cu",
-        "batch_decode_mla_pybind.cu",
-    ]:
+    for filename in filenames:
         src_path = FLASHINFER_CSRC_DIR / filename
         dest_path = gen_directory / filename
         source_paths.append(dest_path)

{flashinfer_python-0.2.1.post1 → flashinfer_python-0.2.1.post2}/flashinfer/mla.py RENAMED Viewed

@@ -21,7 +21,13 @@ from typing import List, Literal, Optional, Tuple, Union, overload
 import torch
 from .jit import gen_batch_mla_module, get_batch_mla_uri
-from .utils import MaskMode, get_cuda_stream, register_custom_op, register_fake_op
+from .utils import (
+    MaskMode,
+    _check_shape_dtype_device,
+    get_cuda_stream,
+    register_custom_op,
+    register_fake_op,
+)
 _batch_mla_modules = {}
@@ -267,6 +273,8 @@ class BatchMLAPagedAttentionWrapper:
         q_pe: torch.Tensor,
         ckv_cache: torch.Tensor,
         kpe_cache: torch.Tensor,
+        out: Optional[torch.Tensor] = None,
+        lse: Optional[torch.Tensor] = None,
         return_lse: bool = False,
     ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
         r"""Run the MLA attention computation.
@@ -283,6 +291,10 @@ class BatchMLAPagedAttentionWrapper:
         kpe_cache : torch.Tensor
             The rope part of the kv-cache tensor, shape: ``[num_pages, page_size, head_dim_kpe]``.
             ``head_dim_kpe`` is 64 in DeepSeek v2/v3 models.
+        out : Optional[torch.Tensor]
+            The output tensor, if not provided, will be allocated internally.
+        lse : Optional[torch.Tensor]
+            The log-sum-exp of attention logits, if not provided, will be allocated internally.
         return_lse : bool, optional
             Whether to return the log-sum-exp value, default is False.
         """
@@ -292,12 +304,22 @@ class BatchMLAPagedAttentionWrapper:
         causal = self._causal
         mask_mode = MaskMode.CAUSAL.value if causal else MaskMode.NON_CAUSAL.value
         with self.device as device:
-            o = torch.empty_like(q_nope)
-            maybe_lse = (
-                torch.empty(q_nope.shape[:2], dtype=torch.float32, device=device)
-                if return_lse
-                else None
-            )
+            if out is None:
+                out = torch.empty_like(q_nope)
+            else:
+                _check_shape_dtype_device(
+                    out, q_nope.shape, q_nope.dtype, q_nope.device, "out"
+                )
+            if return_lse:
+                if lse is None:
+                    lse = torch.empty(
+                        q_nope.shape[:2], dtype=torch.float32, device=device
+                    )
+                else:
+                    _check_shape_dtype_device(
+                        lse, q_nope.shape[:2], torch.float32, q_nope.device, "lse"
+                    )
             self._cached_module.run(
                 self._float_workspace_buffer,
                 self._int_workspace_buffer,
@@ -307,8 +329,8 @@ class BatchMLAPagedAttentionWrapper:
                 ckv_cache,
                 kpe_cache,
                 self._kv_indices_buf,
-                o,
-                maybe_lse,
+                out,
+                lse,
                 mask_mode,
                 num_heads,
                 page_size,
@@ -316,4 +338,4 @@ class BatchMLAPagedAttentionWrapper:
                 get_cuda_stream(device),
             )
-        return (o, maybe_lse) if return_lse else o
+        return (out, lse) if return_lse else out

flashinfer-python 0.2.1.post1__tar.gz → 0.2.1.post2__tar.gz

flashinfer-python 0.2.1.post1tar.gz → 0.2.1.post2tar.gz