PyPI - flashinfer-python - Versions diffs - 0.2.3__tar.gz → 0.2.4__tar.gz - Mend

flashinfer-python 0.2.3tar.gz → 0.2.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (1166) hide show

{flashinfer_python-0.2.3/flashinfer_python.egg-info → flashinfer_python-0.2.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
-Metadata-Version: 2.2
+Metadata-Version: 2.4
 Name: flashinfer-python
-Version: 0.2.3
+Version: 0.2.4
 Summary: FlashInfer: Kernel Library for LLM Serving
 Author: FlashInfer team
 License: Apache License 2.0
@@ -11,6 +11,7 @@ License-File: LICENSE
 Requires-Dist: numpy
 Requires-Dist: torch
 Requires-Dist: ninja
+Dynamic: license-file
 Dynamic: requires-dist
 <p align="center">
@@ -27,6 +28,7 @@ Kernel Library for LLM Serving
 | <a href="https://flashinfer.ai"><b>Blog</b></a> | <a href="https://docs.flashinfer.ai"><b>Documentation</b></a> | <a href="https://join.slack.com/t/flashinfer/shared_invite/zt-2r93kj2aq-wZnC2n_Z2~mf73N5qnVGGA"><b>Slack</b></a>|  <a href="https://github.com/orgs/flashinfer-ai/discussions"><b>Discussion Forum</b></a> |
 </p>
+[![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/badge/icon)](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/)
 [![Release](https://github.com/flashinfer-ai/flashinfer/actions/workflows/release_wheel.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/release_wheel.yml)
 [![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml)
@@ -46,6 +48,8 @@ The core features of FlashInfer include:
 FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
 ## News
+- [Mar 10, 2025] [Blog Post](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.
+- [Mar 1, 2025] Checkout flashinfer's [intra-kernel profiler](https://github.com/flashinfer-ai/flashinfer/tree/main/profiler) for visualizing the timeline of each threadblock in GPU kernels.
 - [Dec 16, 2024] [Blog Post](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
 - [Sept 2024] We've launched a [Slack](https://join.slack.com/t/flashinfer/shared_invite/zt-2r93kj2aq-wZnC2n_Z2~mf73N5qnVGGA) workspace for Flashinfer users and developers. Join us for timely support, discussions, updates and knowledge sharing!
 - [Jan 31, 2024] [Blog Post](https://flashinfer.ai/2024/01/08/cascade-inference.html) Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/README.md RENAMED Viewed

@@ -12,6 +12,7 @@ Kernel Library for LLM Serving
 | <a href="https://flashinfer.ai"><b>Blog</b></a> | <a href="https://docs.flashinfer.ai"><b>Documentation</b></a> | <a href="https://join.slack.com/t/flashinfer/shared_invite/zt-2r93kj2aq-wZnC2n_Z2~mf73N5qnVGGA"><b>Slack</b></a>|  <a href="https://github.com/orgs/flashinfer-ai/discussions"><b>Discussion Forum</b></a> |
 </p>
+[![Build Status](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/badge/icon)](https://ci.tlcpack.ai/job/flashinfer-ci/job/main/)
 [![Release](https://github.com/flashinfer-ai/flashinfer/actions/workflows/release_wheel.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/release_wheel.yml)
 [![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml)
@@ -31,6 +32,8 @@ The core features of FlashInfer include:
 FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
 ## News
+- [Mar 10, 2025] [Blog Post](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.
+- [Mar 1, 2025] Checkout flashinfer's [intra-kernel profiler](https://github.com/flashinfer-ai/flashinfer/tree/main/profiler) for visualizing the timeline of each threadblock in GPU kernels.
 - [Dec 16, 2024] [Blog Post](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
 - [Sept 2024] We've launched a [Slack](https://join.slack.com/t/flashinfer/shared_invite/zt-2r93kj2aq-wZnC2n_Z2~mf73N5qnVGGA) workspace for Flashinfer users and developers. Join us for timely support, discussions, updates and knowledge sharing!
 - [Jan 31, 2024] [Blog Post](https://flashinfer.ai/2024/01/08/cascade-inference.html) Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/activation.cu RENAMED Viewed

@@ -32,11 +32,12 @@ __device__ __forceinline__ float gelu_tanh(const float& val) {
   return val * cdf;
 }
-void silu_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int64_t cuda_stream) {
+void silu_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl) {
   int d = input.size(-1) / 2;
   int64_t num_tokens = input.numel() / input.size(-1);
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(out.device());
+  auto stream = at::cuda::getCurrentCUDAStream();
   DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
     uint32_t vec_size = 16 / sizeof(c_type);
@@ -63,11 +64,13 @@ void silu_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int64_t c
   });
 }
-void gelu_tanh_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int64_t cuda_stream) {
+void gelu_tanh_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl) {
   int d = input.size(-1) / 2;
   int64_t num_tokens = input.numel() / input.size(-1);
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(out.device());
+  auto stream = at::cuda::getCurrentCUDAStream();
   DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
     uint32_t vec_size = 16 / sizeof(c_type);
     cudaLaunchConfig_t config;
@@ -93,12 +96,12 @@ void gelu_tanh_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int6
   });
 }
-void gelu_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int64_t cuda_stream) {
+void gelu_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl) {
   int d = input.size(-1) / 2;
   int64_t num_tokens = input.numel() / input.size(-1);
-  dim3 grid(num_tokens);
+  const c10::cuda::OptionalCUDAGuard device_guard(out.device());
+  auto stream = at::cuda::getCurrentCUDAStream();
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
   DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
     uint32_t vec_size = 16 / sizeof(c_type);
     cudaLaunchConfig_t config;

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_decode.cu RENAMED Viewed

@@ -19,8 +19,8 @@
 #include <optional>
 #include "batch_decode_config.inc"
-#include "pytorch_extension_utils.h"
 #include "pytorch_conversion_utils.h"
+#include "pytorch_extension_utils.h"
 namespace flashinfer {
@@ -36,9 +36,9 @@ using namespace flashinfer;
 at::Tensor BatchDecodeWithPagedKVCachePlan(
     at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
     at::Tensor page_locked_int_workspace_buffer, at::Tensor indptr, int64_t batch_size,
-    int64_t num_qo_heads, int64_t num_kv_heads, int64_t page_size,
-    bool enable_cuda_graph, int64_t window_left, double logits_soft_cap, int64_t head_dim_qk,
-    int64_t head_dim_vo, at::Tensor empty_q_data, at::Tensor empty_kv_data, int64_t cuda_stream) {
+    int64_t num_qo_heads, int64_t num_kv_heads, int64_t page_size, bool enable_cuda_graph,
+    int64_t window_left, double logits_soft_cap, int64_t head_dim_qk, int64_t head_dim_vo,
+    at::Tensor empty_q_data, at::Tensor empty_kv_data) {
   size_t float_workspace_size_in_bytes =
       float_workspace_buffer.size(0) * float_workspace_buffer.element_size();
   size_t int_workspace_size_in_bytes =
@@ -53,7 +53,8 @@ at::Tensor BatchDecodeWithPagedKVCachePlan(
               "CUDA cores template only supports equal head dim for QK and VO, please use tensor "
               "cores template for different head dim");
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(float_workspace_buffer.device());
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   DISPATCH_context(
       DTypeQ, DTypeKV, DTypeO, IdType, HEAD_DIM_QK, HEAD_DIM_VO, POS_ENCODING_MODE,
       USE_SLIDING_WINDOW, USE_LOGITS_SOFT_CAP, AttentionVariant, Params, [&] {
@@ -77,12 +78,14 @@ at::Tensor BatchDecodeWithPagedKVCachePlan(
   return vec_to_tensor(plan_info.ToVector());
 }
-void BatchDecodeWithPagedKVCacheRun(
-    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
-    at::Tensor plan_info_vec, at::Tensor q, at::Tensor paged_k_cache,
-    at::Tensor paged_v_cache, at::Tensor paged_kv_indptr, at::Tensor paged_kv_indices,
-    at::Tensor paged_kv_last_page_len, at::Tensor o, std::optional<at::Tensor> maybe_lse,
-    int64_t kv_layout_code, int64_t window_left ADDITIONAL_FUNC_PARAMS, int64_t cuda_stream) {
+void BatchDecodeWithPagedKVCacheRun(at::Tensor float_workspace_buffer,
+                                    at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                    at::Tensor q, at::Tensor paged_k_cache,
+                                    at::Tensor paged_v_cache, at::Tensor paged_kv_indptr,
+                                    at::Tensor paged_kv_indices, at::Tensor paged_kv_last_page_len,
+                                    at::Tensor o, std::optional<at::Tensor> maybe_lse,
+                                    int64_t kv_layout_code,
+                                    int64_t window_left ADDITIONAL_FUNC_PARAMS) {
   DecodePlanInfo plan_info;
   plan_info.FromVector(tensor_to_vec(plan_info_vec));
   QKVLayout kv_layout = static_cast<QKVLayout>(kv_layout_code);
@@ -129,7 +132,8 @@ void BatchDecodeWithPagedKVCacheRun(
   TORCH_CHECK(k_strides == v_strides, "k/v strides must be identical");
   kv_cache_strides = k_strides.data();
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(device);
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   DISPATCH_context(
       DTypeQ, DTypeKV, DTypeO, IdType, HEAD_DIM_QK, HEAD_DIM_VO, POS_ENCODING_MODE,

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_decode_jit_pybind.cu RENAMED Viewed

@@ -19,16 +19,18 @@
 at::Tensor BatchDecodeWithPagedKVCachePlan(
     at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
     at::Tensor page_locked_int_workspace_buffer, at::Tensor indptr, int64_t batch_size,
-    int64_t num_qo_heads, int64_t num_kv_heads, int64_t page_size,
-    bool enable_cuda_graph, int64_t window_left, double logits_soft_cap, int64_t head_dim_qk,
-    int64_t head_dim_vo, at::Tensor empty_q_data, at::Tensor empty_kv_data, int64_t cuda_stream);
+    int64_t num_qo_heads, int64_t num_kv_heads, int64_t page_size, bool enable_cuda_graph,
+    int64_t window_left, double logits_soft_cap, int64_t head_dim_qk, int64_t head_dim_vo,
+    at::Tensor empty_q_data, at::Tensor empty_kv_data);
-void BatchDecodeWithPagedKVCacheRun(
-    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
-    at::Tensor plan_info_vec, at::Tensor q, at::Tensor paged_k_cache,
-    at::Tensor paged_v_cache, at::Tensor paged_kv_indptr, at::Tensor paged_kv_indices,
-    at::Tensor paged_kv_last_page_len, at::Tensor o, std::optional<at::Tensor> maybe_lse,
-    int64_t kv_layout_code, int64_t window_left ADDITIONAL_FUNC_PARAMS, int64_t cuda_stream);
+void BatchDecodeWithPagedKVCacheRun(at::Tensor float_workspace_buffer,
+                                    at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                    at::Tensor q, at::Tensor paged_k_cache,
+                                    at::Tensor paged_v_cache, at::Tensor paged_kv_indptr,
+                                    at::Tensor paged_kv_indices, at::Tensor paged_kv_last_page_len,
+                                    at::Tensor o, std::optional<at::Tensor> maybe_lse,
+                                    int64_t kv_layout_code,
+                                    int64_t window_left ADDITIONAL_FUNC_PARAMS);
 TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
   // Batched decode with paged KV-Cache plan

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_decode_mla_cute_sm80.cu RENAMED Viewed

@@ -1,11 +1,9 @@
+#include <flashinfer/attention/decode_mla_cute_sm80.cuh>
+#include <flashinfer/attention/scheduler.cuh>
 #include <optional>
-#include "pytorch_extension_utils.h"
 #include "mla_config.inc"
-#include <flashinfer/attention/decode_mla_cute_sm80.cuh>
-#include <flashinfer/attention/scheduler.cuh>
+#include "pytorch_extension_utils.h"
 using namespace flashinfer;
@@ -22,9 +20,8 @@ std::vector<int64_t> BatchDecodeWithPagedKVCachePlanMLA(
   DecodePlanInfo plan_info;
   cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
-  auto work_estimation_func =
-      BatchDecodeWithPagedKVCacheWorkEstimationDispatchedMlaCuteSM80<HEAD_DIM_CKV, HEAD_DIM_KPE, QO_TILE_LEN,
-                                                             AttentionVariant, Params>;
+  auto work_estimation_func = BatchDecodeWithPagedKVCacheWorkEstimationDispatchedMlaCuteSM80<
+      HEAD_DIM_CKV, HEAD_DIM_KPE, QO_TILE_LEN, AttentionVariant, Params>;
   cudaError_t status =
       DecodePlan<HEAD_DIM_CKV, flashinfer::PosEncodingMode::kNone, AttentionVariant, Params>(
           static_cast<void*>(float_workspace_buffer.data_ptr()), float_workspace_size_in_bytes,
@@ -40,7 +37,6 @@ std::vector<int64_t> BatchDecodeWithPagedKVCachePlanMLA(
   return plan_info.ToVector();
 }
 void BatchDecodeWithPagedKVCacheRunMLA(
     at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
     std::vector<int64_t> plan_info_vec, at::Tensor q_nope, at::Tensor q_pe,
@@ -99,9 +95,9 @@ void BatchDecodeWithPagedKVCacheRunMLA(
   params.padded_batch_size = plan_info.padded_batch_size;
   cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
-  cudaError_t status =
-      BatchDecodeWithPagedKVCacheDispatchedMlaCuteSM80<HEAD_DIM_CKV, HEAD_DIM_KPE, QO_TILE_LEN,
-                                               Params>(params, tmp_v, tmp_s, /*stream=*/stream);
+  cudaError_t status = BatchDecodeWithPagedKVCacheDispatchedMlaCuteSM80<HEAD_DIM_CKV, HEAD_DIM_KPE,
+                                                                        QO_TILE_LEN, Params>(
+      params, tmp_v, tmp_s, /*stream=*/stream);
   TORCH_CHECK(status == cudaSuccess, "BatchDecodeWithPagedKVCache failed with error ",
               cudaGetErrorString(status));
 }

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_decode_mla_plan.cu RENAMED Viewed

@@ -3,16 +3,17 @@
 #include <optional>
 #include "mla_config.inc"
-#include "pytorch_extension_utils.h"
 #include "pytorch_conversion_utils.h"
+#include "pytorch_extension_utils.h"
 using namespace flashinfer;
-at::Tensor BatchDecodeWithPagedKVCachePlanMLA(
-    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
-    at::Tensor page_locked_int_workspace_buffer, at::Tensor indptr, int64_t batch_size,
-    int64_t num_qo_heads, int64_t page_size, bool enable_cuda_graph,
-    int64_t cuda_stream) {
+at::Tensor BatchDecodeWithPagedKVCachePlanMLA(at::Tensor float_workspace_buffer,
+                                              at::Tensor int_workspace_buffer,
+                                              at::Tensor page_locked_int_workspace_buffer,
+                                              at::Tensor indptr, int64_t batch_size,
+                                              int64_t num_qo_heads, int64_t page_size,
+                                              bool enable_cuda_graph, int64_t cuda_stream) {
   size_t float_workspace_size_in_bytes =
       float_workspace_buffer.size(0) * float_workspace_buffer.element_size();
   size_t int_workspace_size_in_bytes =

flashinfer_python-0.2.4/csrc/batch_decode_mla_pybind.cu ADDED Viewed

@@ -0,0 +1,21 @@
+#include "mla_config.inc"
+#include "pytorch_extension_utils.h"
+at::Tensor BatchDecodeWithPagedKVCachePlanMLA(at::Tensor float_workspace_buffer,
+                                              at::Tensor int_workspace_buffer,
+                                              at::Tensor page_locked_int_workspace_buffer,
+                                              at::Tensor indptr, int64_t batch_size,
+                                              int64_t num_qo_heads, int64_t page_size,
+                                              bool enable_cuda_graph, int64_t cuda_stream);
+void BatchDecodeWithPagedKVCacheRunMLA(
+    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+    at::Tensor q_nope, at::Tensor q_pe, at::Tensor paged_ckv_cache, at::Tensor paged_kpe_cache,
+    at::Tensor paged_kv_indptr, at::Tensor paged_kv_indices, at::Tensor paged_kv_last_page_len,
+    at::Tensor o, double sm_scale, int64_t window_left, double logits_soft_cap, double rope_scale,
+    double rope_theta, std::optional<at::Tensor> maybe_lse, int64_t cuda_stream);
+TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
+  m.def("plan", BatchDecodeWithPagedKVCachePlanMLA);
+  m.def("run", BatchDecodeWithPagedKVCacheRunMLA);
+}

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_decode_mla_run.cu RENAMED Viewed

@@ -3,18 +3,17 @@
 #include <optional>
 #include "mla_config.inc"
-#include "pytorch_extension_utils.h"
 #include "pytorch_conversion_utils.h"
+#include "pytorch_extension_utils.h"
 using namespace flashinfer;
 void BatchDecodeWithPagedKVCacheRunMLA(
-    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
-    at::Tensor plan_info_vec, at::Tensor q_nope, at::Tensor q_pe,
-    at::Tensor paged_ckv_cache, at::Tensor paged_kpe_cache, at::Tensor paged_kv_indptr,
-    at::Tensor paged_kv_indices, at::Tensor paged_kv_last_page_len, at::Tensor o, double sm_scale,
-    int64_t window_left, double logits_soft_cap, double rope_scale, double rope_theta,
-    std::optional<at::Tensor> maybe_lse, int64_t cuda_stream) {
+    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+    at::Tensor q_nope, at::Tensor q_pe, at::Tensor paged_ckv_cache, at::Tensor paged_kpe_cache,
+    at::Tensor paged_kv_indptr, at::Tensor paged_kv_indices, at::Tensor paged_kv_last_page_len,
+    at::Tensor o, double sm_scale, int64_t window_left, double logits_soft_cap, double rope_scale,
+    double rope_theta, std::optional<at::Tensor> maybe_lse, int64_t cuda_stream) {
   DecodePlanInfo plan_info;
   plan_info.FromVector(tensor_to_vec(plan_info_vec));

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_mla_plan.cu RENAMED Viewed

@@ -26,8 +26,7 @@ at::Tensor BatchMLAPagedAttentionPlan(at::Tensor float_workspace_buffer,
                                       at::Tensor int_workspace_buffer,
                                       at::Tensor page_locked_int_workspace_buffer,
                                       at::Tensor qo_indptr, at::Tensor kv_indptr, at::Tensor kv_len,
-                                      int64_t num_heads, int64_t head_dim_o, bool causal,
-                                      int64_t cuda_stream) {
+                                      int64_t num_heads, int64_t head_dim_o, bool causal) {
   size_t float_workspace_size_in_bytes =
       float_workspace_buffer.size(0) * float_workspace_buffer.element_size();
   size_t int_workspace_size_in_bytes =
@@ -37,7 +36,9 @@ at::Tensor BatchMLAPagedAttentionPlan(at::Tensor float_workspace_buffer,
   int batch_size = kv_len.size(0);
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(float_workspace_buffer.device());
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   cudaError_t status =
       MLAPlan(float_workspace_buffer.data_ptr(), float_workspace_size_in_bytes,
               int_workspace_buffer.data_ptr(), page_locked_int_workspace_buffer.data_ptr(),

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_mla_pybind.cu RENAMED Viewed

@@ -20,15 +20,14 @@ at::Tensor BatchMLAPagedAttentionPlan(at::Tensor float_workspace_buffer,
                                       at::Tensor int_workspace_buffer,
                                       at::Tensor page_locked_int_workspace_buffer,
                                       at::Tensor qo_indptr, at::Tensor kv_indptr, at::Tensor kv_len,
-                                      int64_t num_heads, int64_t head_dim_o, bool causal,
-                                      int64_t cuda_stream);
+                                      int64_t num_heads, int64_t head_dim_o, bool causal);
 void BatchMLAPagedAttentionRun(at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
                                at::Tensor plan_info_vec, at::Tensor q_nope, at::Tensor q_pe,
                                at::Tensor ckv_cache, at::Tensor kpe_cache, at::Tensor kv_indices,
                                at::Tensor o, std::optional<at::Tensor> maybe_lse,
                                int64_t mask_mode_code, int64_t num_heads, int64_t page_size,
-                               double sm_scale, int64_t cuda_stream);
+                               double sm_scale);
 TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
   m.def("plan", &BatchMLAPagedAttentionPlan);

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_mla_run.cu RENAMED Viewed

@@ -29,7 +29,7 @@ void BatchMLAPagedAttentionRun(at::Tensor float_workspace_buffer, at::Tensor int
                                at::Tensor ckv_cache, at::Tensor kpe_cache, at::Tensor kv_indices,
                                at::Tensor o, std::optional<at::Tensor> maybe_lse,
                                int64_t mask_mode_code, int64_t num_heads, int64_t page_size,
-                               double sm_scale, int64_t cuda_stream) {
+                               double sm_scale) {
   // q_nope: [n, num_heads, head_dim_ckv]
   // q_pe: [n, num_heads, head_dim_kpe]
   // ckv_cache: [num_pages, page_size, head_dim_ckv]
@@ -58,7 +58,8 @@ void BatchMLAPagedAttentionRun(at::Tensor float_workspace_buffer, at::Tensor int
   unsigned int o_stride_n = o.stride(0);
   unsigned int o_stride_h = o.stride(1);
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(device);
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   DISPATCH_context(
       DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, [&] {

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_mla_sm90_plan.cu RENAMED Viewed

@@ -27,7 +27,7 @@ at::Tensor BatchMLAPagedAttentionSM90Plan(at::Tensor float_workspace_buffer,
                                           at::Tensor page_locked_int_workspace_buffer,
                                           at::Tensor qo_indptr, at::Tensor kv_indptr,
                                           at::Tensor kv_len, int64_t num_heads, int64_t head_dim_o,
-                                          bool causal, int64_t cuda_stream) {
+                                          bool causal) {
   size_t float_workspace_size_in_bytes =
       float_workspace_buffer.size(0) * float_workspace_buffer.element_size();
   size_t int_workspace_size_in_bytes =
@@ -37,7 +37,9 @@ at::Tensor BatchMLAPagedAttentionSM90Plan(at::Tensor float_workspace_buffer,
   int batch_size = kv_len.size(0);
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(float_workspace_buffer.device());
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   cudaError_t status =
       MLAPlan(float_workspace_buffer.data_ptr(), float_workspace_size_in_bytes,
               int_workspace_buffer.data_ptr(), page_locked_int_workspace_buffer.data_ptr(),

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_mla_sm90_pybind.cu RENAMED Viewed

@@ -21,7 +21,7 @@ at::Tensor BatchMLAPagedAttentionSM90Plan(at::Tensor float_workspace_buffer,
                                           at::Tensor page_locked_int_workspace_buffer,
                                           at::Tensor qo_indptr, at::Tensor kv_indptr,
                                           at::Tensor kv_len, int64_t num_heads, int64_t head_dim_o,
-                                          bool causal, int64_t cuda_stream);
+                                          bool causal);
 void BatchMLAPagedAttentionSM90Run(at::Tensor float_workspace_buffer,
                                    at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
@@ -29,7 +29,7 @@ void BatchMLAPagedAttentionSM90Run(at::Tensor float_workspace_buffer,
                                    at::Tensor kpe_cache, at::Tensor kv_indices, at::Tensor o,
                                    std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
                                    int64_t num_heads, int64_t page_size,
-                                   double sm_scale ADDITIONAL_FUNC_PARAMS, int64_t cuda_stream);
+                                   double sm_scale ADDITIONAL_FUNC_PARAMS);
 TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
   m.def("plan", &BatchMLAPagedAttentionSM90Plan);

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_mla_sm90_run.cu RENAMED Viewed

@@ -30,7 +30,7 @@ void BatchMLAPagedAttentionSM90Run(at::Tensor float_workspace_buffer,
                                    at::Tensor kpe_cache, at::Tensor kv_indices, at::Tensor o,
                                    std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
                                    int64_t num_heads, int64_t page_size,
-                                   double sm_scale ADDITIONAL_FUNC_PARAMS, int64_t cuda_stream) {
+                                   double sm_scale ADDITIONAL_FUNC_PARAMS) {
   // q_nope: [n, num_heads, head_dim_ckv]
   // q_pe: [n, num_heads, head_dim_kpe]
   // ckv_cache: [num_pages, page_size, head_dim_ckv]
@@ -59,7 +59,8 @@ void BatchMLAPagedAttentionSM90Run(at::Tensor float_workspace_buffer,
   unsigned int o_stride_n = o.stride(0);
   unsigned int o_stride_h = o.stride(1);
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(device);
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   DISPATCH_context(
       DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, [&] {

{flashinfer_python-0.2.3 → flashinfer_python-0.2.4}/csrc/batch_prefill.cu RENAMED Viewed

@@ -19,8 +19,8 @@
 #include <optional>
 #include "batch_prefill_config.inc"
-#include "pytorch_extension_utils.h"
 #include "pytorch_conversion_utils.h"
+#include "pytorch_extension_utils.h"
 namespace flashinfer {
@@ -43,10 +43,9 @@ using namespace flashinfer;
 at::Tensor BatchPrefillWithKVCachePlan(
     at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
     at::Tensor page_locked_int_workspace_buffer, at::Tensor qo_indptr, at::Tensor kv_indptr,
-    at::Tensor kv_len_arr, int64_t total_num_rows, int64_t batch_size,
-    int64_t num_qo_heads, int64_t num_kv_heads, int64_t page_size,
-    bool enable_cuda_graph, int64_t head_dim_qk, int64_t head_dim_vo, bool causal,
-    int64_t cuda_stream) {
+    at::Tensor kv_len_arr, int64_t total_num_rows, int64_t batch_size, int64_t num_qo_heads,
+    int64_t num_kv_heads, int64_t page_size, bool enable_cuda_graph, int64_t head_dim_qk,
+    int64_t head_dim_vo, bool causal) {
   size_t float_workspace_size_in_bytes =
       float_workspace_buffer.size(0) * float_workspace_buffer.element_size();
   size_t int_workspace_size_in_bytes =
@@ -54,7 +53,8 @@ at::Tensor BatchPrefillWithKVCachePlan(
   PrefillPlanInfo plan_info;
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(float_workspace_buffer.device());
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   cudaError_t status = PrefillPlan<IdType>(
       float_workspace_buffer.data_ptr(), float_workspace_size_in_bytes,
       int_workspace_buffer.data_ptr(), page_locked_int_workspace_buffer.data_ptr(),
@@ -68,12 +68,12 @@ at::Tensor BatchPrefillWithKVCachePlan(
   return vec_to_tensor(plan_info.ToVector());
 }
-void BatchPrefillWithRaggedKVCacheRun(
-    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
-    at::Tensor plan_info_vec, at::Tensor q, at::Tensor k, at::Tensor v,
-    at::Tensor qo_indptr, at::Tensor kv_indptr, at::Tensor o, std::optional<at::Tensor> maybe_lse,
-    int64_t mask_mode_code, int64_t layout, int64_t window_left ADDITIONAL_FUNC_PARAMS,
-    int64_t cuda_stream) {
+void BatchPrefillWithRaggedKVCacheRun(at::Tensor float_workspace_buffer,
+                                      at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                      at::Tensor q, at::Tensor k, at::Tensor v,
+                                      at::Tensor qo_indptr, at::Tensor kv_indptr, at::Tensor o,
+                                      std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
+                                      int64_t layout, int64_t window_left ADDITIONAL_FUNC_PARAMS) {
   PrefillPlanInfo plan_info;
   plan_info.FromVector(tensor_to_vec(plan_info_vec));
   QKVLayout kv_layout = static_cast<QKVLayout>(layout);
@@ -109,7 +109,8 @@ void BatchPrefillWithRaggedKVCacheRun(
   auto q_scalar_type = q.scalar_type();
   auto kv_scalar_type = k.scalar_type();
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(float_workspace_buffer.device());
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   DISPATCH_context(
       DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_QK, HEAD_DIM_VO, POS_ENCODING_MODE,
@@ -193,13 +194,14 @@ void BatchPrefillWithRaggedKVCacheRun(
       });
 }
-void BatchPrefillWithPagedKVCacheRun(
-    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
-    at::Tensor plan_info_vec, at::Tensor q, at::Tensor paged_k_cache,
-    at::Tensor paged_v_cache, at::Tensor qo_indptr, at::Tensor paged_kv_indptr,
-    at::Tensor paged_kv_indices, at::Tensor paged_kv_last_page_len, at::Tensor o,
-    std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code, int64_t layout,
-    int64_t window_left ADDITIONAL_FUNC_PARAMS, int64_t cuda_stream) {
+void BatchPrefillWithPagedKVCacheRun(at::Tensor float_workspace_buffer,
+                                     at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                     at::Tensor q, at::Tensor paged_k_cache,
+                                     at::Tensor paged_v_cache, at::Tensor qo_indptr,
+                                     at::Tensor paged_kv_indptr, at::Tensor paged_kv_indices,
+                                     at::Tensor paged_kv_last_page_len, at::Tensor o,
+                                     std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
+                                     int64_t layout, int64_t window_left ADDITIONAL_FUNC_PARAMS) {
   PrefillPlanInfo plan_info;
   plan_info.FromVector(tensor_to_vec(plan_info_vec));
   QKVLayout kv_layout = static_cast<QKVLayout>(layout);
@@ -240,7 +242,8 @@ void BatchPrefillWithPagedKVCacheRun(
   TORCH_CHECK(k_strides == v_strides, "k/v strides must be identical");
   kv_cache_strides = k_strides.data();
-  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  const c10::cuda::OptionalCUDAGuard device_guard(float_workspace_buffer.device());
+  const cudaStream_t stream = c10::cuda::getCurrentCUDAStream();
   DISPATCH_context(
       DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_QK, HEAD_DIM_VO, POS_ENCODING_MODE,

flashinfer_python-0.2.4/csrc/batch_prefill_jit_pybind.cu ADDED Viewed

@@ -0,0 +1,49 @@
+/*
+ * Copyright (c) 2023-2025 by FlashInfer team.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "batch_prefill_config.inc"
+#include "pytorch_extension_utils.h"
+at::Tensor BatchPrefillWithKVCachePlan(
+    at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
+    at::Tensor page_locked_int_workspace_buffer, at::Tensor qo_indptr, at::Tensor kv_indptr,
+    at::Tensor kv_len_arr, int64_t total_num_rows, int64_t batch_size, int64_t num_qo_heads,
+    int64_t num_kv_heads, int64_t page_size, bool enable_cuda_graph, int64_t head_dim_qk,
+    int64_t head_dim_vo, bool causal);
+void BatchPrefillWithRaggedKVCacheRun(at::Tensor float_workspace_buffer,
+                                      at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                      at::Tensor q, at::Tensor k, at::Tensor v,
+                                      at::Tensor qo_indptr, at::Tensor kv_indptr, at::Tensor o,
+                                      std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
+                                      int64_t layout, int64_t window_left ADDITIONAL_FUNC_PARAMS);
+void BatchPrefillWithPagedKVCacheRun(at::Tensor float_workspace_buffer,
+                                     at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                     at::Tensor q, at::Tensor paged_k_cache,
+                                     at::Tensor paged_v_cache, at::Tensor qo_indptr,
+                                     at::Tensor paged_kv_indptr, at::Tensor paged_kv_indices,
+                                     at::Tensor paged_kv_last_page_len, at::Tensor o,
+                                     std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
+                                     int64_t layout, int64_t window_left ADDITIONAL_FUNC_PARAMS);
+TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
+  // Batch-request prefill attention with KV-Cache plan
+  m.def("plan", BatchPrefillWithKVCachePlan);
+  // Batch-request prefill attention with KV-Cache operator
+  m.def("ragged_run", BatchPrefillWithRaggedKVCacheRun);
+  // Batch-request prefill attention with KV-Cache operator
+  m.def("paged_run", BatchPrefillWithPagedKVCacheRun);
+}

flashinfer-python 0.2.3__tar.gz → 0.2.4__tar.gz

flashinfer-python 0.2.3tar.gz → 0.2.4tar.gz