PyPI - flashinfer-python - Versions diffs - 0.2.2__tar.gz → 0.2.3__tar.gz - Mend

flashinfer-python 0.2.2tar.gz → 0.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (1191) hide show

flashinfer_python-0.2.3/PKG-INFO ADDED Viewed

@@ -0,0 +1,198 @@
+Metadata-Version: 2.2
+Name: flashinfer-python
+Version: 0.2.3
+Summary: FlashInfer: Kernel Library for LLM Serving
+Author: FlashInfer team
+License: Apache License 2.0
+Project-URL: Homepage, https://github.com/flashinfer-ai/flashinfer
+Requires-Python: <4.0,>=3.8
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: numpy
+Requires-Dist: torch
+Requires-Dist: ninja
+Dynamic: requires-dist
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-black-background.png?raw=true">
+    <img alt="FlashInfer" src="https://github.com/flashinfer-ai/web-data/blob/main/logo/FlashInfer-white-background.png?raw=true" width=55%>
+  </picture>
+</p>
+<h1 align="center">
+Kernel Library for LLM Serving
+</h1>
+<p align="center">
+| <a href="https://flashinfer.ai"><b>Blog</b></a> | <a href="https://docs.flashinfer.ai"><b>Documentation</b></a> | <a href="https://join.slack.com/t/flashinfer/shared_invite/zt-2r93kj2aq-wZnC2n_Z2~mf73N5qnVGGA"><b>Slack</b></a>|  <a href="https://github.com/orgs/flashinfer-ai/discussions"><b>Discussion Forum</b></a> |
+</p>
+[![Release](https://github.com/flashinfer-ai/flashinfer/actions/workflows/release_wheel.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/release_wheel.yml)
+[![Documentation](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml/badge.svg)](https://github.com/flashinfer-ai/flashinfer/actions/workflows/build-doc.yml)
+FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more. FlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
+Check our [v0.2 release blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) for new features!
+The core features of FlashInfer include:
+1. **Efficient Sparse/Dense Attention Kernels**: Efficient single/batch attention for sparse(paged)/dense KV-storage on CUDA Cores and Tensor Cores (both FA2 & FA3) templates. The vector-sparse attention can achieve 90% of the bandwidth of dense kernels with same problem size.
+2. **Load-Balanced Scheduling**: FlashInfer decouples `plan`/`run` stage of attention computation where we schedule the computation of variable-length inputs in `plan` stage to alleviate load-imbalance issue.
+3. **Memory Efficiency**: FlashInfer offers [Cascade Attention](https://docs.flashinfer.ai/api/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) for hierarchical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
+4. **Customizable Attention**: Bring your own attention variants through JIT-compilation.
+5. **CUDAGraph and torch.compile Compatibility**: FlashInfer kernels can be captured by CUDAGraphs and torch.compile for low-latency inference.
+6. **Efficient LLM-specific Operators**: High-Performance [fused kernel for Top-P, Top-K/Min-P sampling](https://docs.flashinfer.ai/api/sampling.html) without the need to sorting.
+FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
+## News
+- [Dec 16, 2024] [Blog Post](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
+- [Sept 2024] We've launched a [Slack](https://join.slack.com/t/flashinfer/shared_invite/zt-2r93kj2aq-wZnC2n_Z2~mf73N5qnVGGA) workspace for Flashinfer users and developers. Join us for timely support, discussions, updates and knowledge sharing!
+- [Jan 31, 2024] [Blog Post](https://flashinfer.ai/2024/01/08/cascade-inference.html) Cascade Inference: Memory-Efficient Shared Prefix Batch Decoding
+- [Jan 31, 2024] [Blog Post](https://flashinfer.ai/2024/01/03/introduce-flashinfer.html) Accelerating Self-Attentions for LLM Serving with FlashInfer
+## Getting Started
+Using our PyTorch API is the easiest way to get started:
+### Install from PIP
+We provide prebuilt python wheels for Linux. Install FlashInfer with the following command:
+```bash
+# For CUDA 12.4 & torch 2.5
+pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.5
+# For other CUDA & torch versions, check https://docs.flashinfer.ai/installation.html
+```
+To try the latest features from the main branch, use our nightly-built wheels:
+```bash
+pip install flashinfer-python -i https://flashinfer.ai/whl/nightly/cu124/torch2.4
+```
+For a JIT version (compiling every kernel from scratch, [NVCC](https://developer.nvidia.com/cuda-downloads) is required), install from [PyPI](https://pypi.org/project/flashinfer-python/):
+```bash
+pip install flashinfer-python
+```
+### Install from Source
+Alternatively, build FlashInfer from source:
+```bash
+git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
+cd flashinfer
+pip install -e . -v
+```
+To pre-compile essential kernels, set the environment variable `FLASHINFER_ENABLE_AOT=1` before running the installation command:
+```bash
+FLASHINFER_ENABLE_AOT=1 pip install -e . -v
+```
+For more details, refer to the [Install from Source documentation](https://docs.flashinfer.ai/installation.html#install-from-source).
+### Trying it out
+Below is a minimal example of using FlashInfer's single-request decode/append/prefill attention kernels:
+```python
+import torch
+import flashinfer
+kv_len = 2048
+num_kv_heads = 32
+head_dim = 128
+k = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
+v = torch.randn(kv_len, num_kv_heads, head_dim).half().to(0)
+# decode attention
+num_qo_heads = 32
+q = torch.randn(num_qo_heads, head_dim).half().to(0)
+o = flashinfer.single_decode_with_kv_cache(q, k, v) # decode attention without RoPE on-the-fly
+o_rope_on_the_fly = flashinfer.single_decode_with_kv_cache(q, k, v, pos_encoding_mode="ROPE_LLAMA") # decode with LLaMA style RoPE on-the-fly
+# append attention
+append_qo_len = 128
+q = torch.randn(append_qo_len, num_qo_heads, head_dim).half().to(0) # append attention, the last 128 tokens in the KV-Cache are the new tokens
+o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True) # append attention without RoPE on-the-fly, apply causal mask
+o_rope_on_the_fly = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=True, pos_encoding_mode="ROPE_LLAMA") # append attention with LLaMA style RoPE on-the-fly, apply causal mask
+# prefill attention
+qo_len = 2048
+q = torch.randn(qo_len, num_qo_heads, head_dim).half().to(0) # prefill attention
+o = flashinfer.single_prefill_with_kv_cache(q, k, v, causal=False) # prefill attention without RoPE on-the-fly, do not apply causal mask
+```
+Check out [documentation](https://docs.flashinfer.ai/) for usage of batch decode/append/prefill kernels and shared-prefix cascading kernels.
+## Custom Attention Variants
+Starting from FlashInfer v0.2, users can customize their own attention variants with additional parameters. For more details, refer to our [JIT examples](https://github.com/flashinfer-ai/flashinfer/blob/main/tests/test_jit_example.py).
+## Run Benchmarks
+We profile FlashInfer kernel performance with [nvbench](https://github.com/NVIDIA/nvbench) and you can compile and run the benchmarks with the following commands:
+```bash
+mkdir build
+cp cmake/config.cmake build # you can modify the config.cmake to enable/disable benchmarks and change CUDA architectures
+cd build
+cmake ..
+make -j12
+```
+You can run `./bench_{single/batch}_{prefill/decode}` to benchmark the performance (e.g. `./bench_single_prefill` for single-request prefill attention). `./bench_{single/batch}_{prefill/decode} --help` will show you the available options.
+## C++ API and TVM Bindings
+FlashInfer also provides C++ API and TVM bindings, please refer to [documentation](https://docs.flashinfer.ai/) for more details.
+## Adoption
+We are thrilled to share that FlashInfer is being adopted by many cutting-edge projects, including but not limited to:
+- [MLC-LLM](https://github.com/mlc-ai/mlc-llm)
+- [Punica](https://github.com/punica-ai/punica)
+- [SGLang](https://github.com/sgl-project/sglang)
+- [ScaleLLM](https://github.com/vectorch-ai/ScaleLLM)
+- [vLLM](https://github.com/vllm-project/vllm)
+- [TGI](https://github.com/huggingface/text-generation-inference)
+- [lorax](https://github.com/predibase/lorax)
+- [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
+- [LightLLM](https://github.com/ModelTC/lightllm)
+## Acknowledgement
+FlashInfer is inspired by [FlashAttention 1&2](https://github.com/dao-AILab/flash-attention/), [vLLM](https://github.com/vllm-project/vllm), [stream-K](https://arxiv.org/abs/2301.03598), [cutlass](https://github.com/nvidia/cutlass) and [AITemplate](https://github.com/facebookincubator/AITemplate) projects.
+## Citation
+If you find FlashInfer helpful in your project or research, please consider citing our [paper](https://arxiv.org/abs/2501.01005):
+```bibtex
+@article{ye2025flashinfer,
+    title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},
+    author = {
+      Ye, Zihao and
+      Chen, Lequn and
+      Lai, Ruihang and
+      Lin, Wuwei and
+      Zhang, Yineng and
+      Wang, Stephanie and
+      Chen, Tianqi and
+      Kasikci, Baris and
+      Grover, Vinod and
+      Krishnamurthy, Arvind and
+      Ceze, Luis
+    },
+    journal = {arXiv preprint arXiv:2501.01005},
+    year = {2025},
+    url = {https://arxiv.org/abs/2501.01005}
+}
+```

flashinfer_python-0.2.3/csrc/activation.cu ADDED Viewed

@@ -0,0 +1,125 @@
+/*
+ * Copyright (c) 2024 by FlashInfer team.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <flashinfer/activation.cuh>
+#include "pytorch_extension_utils.h"
+using namespace flashinfer;
+__device__ __forceinline__ float silu(const float& val) { return val / (1.0f + __expf(-val)); }
+__device__ __forceinline__ float gelu(const float& val) {
+  constexpr float kAlpha = M_SQRT1_2;
+  return val * 0.5f * (1.0f + ::erf(val * kAlpha));
+}
+__device__ __forceinline__ float gelu_tanh(const float& val) {
+  const float cdf =
+      0.5f * (1.0f + math::tanh((0.7978845608028654f * (val + 0.044715f * val * val * val))));
+  return val * cdf;
+}
+void silu_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int64_t cuda_stream) {
+  int d = input.size(-1) / 2;
+  int64_t num_tokens = input.numel() / input.size(-1);
+  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
+    uint32_t vec_size = 16 / sizeof(c_type);
+    cudaLaunchConfig_t config;
+    config.gridDim = num_tokens;
+    config.blockDim = std::min(d / vec_size, 1024U);
+    config.dynamicSmemBytes = 0;
+    config.stream = stream;
+    cudaLaunchAttribute attrs[1];
+    attrs[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
+    attrs[0].val.programmaticStreamSerializationAllowed = enable_pdl;
+    config.numAttrs = 1;
+    config.attrs = attrs;
+    auto kernel = flashinfer::activation::act_and_mul_kernel<c_type, silu>;
+    cudaLaunchKernelEx(&config, kernel, static_cast<c_type*>(out.data_ptr()),
+                       static_cast<c_type*>(input.data_ptr()), d);
+    cudaError_t err = cudaGetLastError();
+    TORCH_CHECK(err == cudaSuccess, "Failed to launch kernel: ", cudaGetErrorString(err));
+    return true;
+  });
+}
+void gelu_tanh_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int64_t cuda_stream) {
+  int d = input.size(-1) / 2;
+  int64_t num_tokens = input.numel() / input.size(-1);
+  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
+    uint32_t vec_size = 16 / sizeof(c_type);
+    cudaLaunchConfig_t config;
+    config.gridDim = num_tokens;
+    config.blockDim = std::min(d / vec_size, 1024U);
+    config.dynamicSmemBytes = 0;
+    config.stream = stream;
+    cudaLaunchAttribute attrs[1];
+    attrs[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
+    attrs[0].val.programmaticStreamSerializationAllowed = enable_pdl;
+    config.numAttrs = 1;
+    config.attrs = attrs;
+    auto kernel = flashinfer::activation::act_and_mul_kernel<c_type, gelu_tanh>;
+    cudaLaunchKernelEx(&config, kernel, static_cast<c_type*>(out.data_ptr()),
+                       static_cast<c_type*>(input.data_ptr()), d);
+    cudaError_t err = cudaGetLastError();
+    TORCH_CHECK(err == cudaSuccess, "Failed to launch kernel: ", cudaGetErrorString(err));
+    return true;
+  });
+}
+void gelu_and_mul(at::Tensor& out, at::Tensor& input, bool enable_pdl, int64_t cuda_stream) {
+  int d = input.size(-1) / 2;
+  int64_t num_tokens = input.numel() / input.size(-1);
+  dim3 grid(num_tokens);
+  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FP16(input.scalar_type(), c_type, [&] {
+    uint32_t vec_size = 16 / sizeof(c_type);
+    cudaLaunchConfig_t config;
+    config.gridDim = num_tokens;
+    config.blockDim = std::min(d / vec_size, 1024U);
+    config.dynamicSmemBytes = 0;
+    config.stream = stream;
+    cudaLaunchAttribute attrs[1];
+    attrs[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
+    attrs[0].val.programmaticStreamSerializationAllowed = enable_pdl;
+    config.numAttrs = 1;
+    config.attrs = attrs;
+    auto kernel = flashinfer::activation::act_and_mul_kernel<c_type, gelu>;
+    cudaLaunchKernelEx(&config, kernel, static_cast<c_type*>(out.data_ptr()),
+                       static_cast<c_type*>(input.data_ptr()), d);
+    cudaError_t err = cudaGetLastError();
+    TORCH_CHECK(err == cudaSuccess, "Failed to launch kernel: ", cudaGetErrorString(err));
+    return true;
+  });
+}

flashinfer_python-0.2.3/csrc/batch_mla_config.jinja ADDED Viewed

@@ -0,0 +1,33 @@
+#pragma once
+#include <flashinfer/page.cuh>
+#include <flashinfer/math.cuh>
+#include <flashinfer/layout.cuh>
+#include <flashinfer/utils.cuh>
+#include <flashinfer/pos_enc.cuh>
+#include <flashinfer/fastdiv.cuh>
+#include <flashinfer/attention/variant_helper.cuh>
+#include <flashinfer/attention/mla_params.cuh>
+using namespace flashinfer;
+#ifdef FLASHINFER_ENABLE_PROFILER
+#define ADDITIONAL_FUNC_PARAMS , at::Tensor profiler_buffer
+#define ADDITIONAL_PARAMS_SETTER \
+  params.profiler_buffer = static_cast<uint64_t*>(profiler_buffer.data_ptr());
+#else
+#define ADDITIONAL_FUNC_PARAMS
+#define ADDITIONAL_PARAMS_SETTER
+#endif
+using DTypeQ = {{ dtype_q }};
+using DTypeKV = {{ dtype_kv }};
+using DTypeO = {{ dtype_o }};
+using IdType = {{ dtype_idx }};
+constexpr int HEAD_DIM_CKV = {{ head_dim_ckv }};
+constexpr int HEAD_DIM_KPE = {{ head_dim_kpe }};
+#define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
+  DISPATCH_MASK_MODE(mask_mode, MASK_MODE, { \
+    using Params = MLAParams<DTypeQ, DTypeKV, DTypeO, IdType>; \
+    __VA_ARGS__(); \
+  })

flashinfer_python-0.2.3/csrc/batch_mla_run.cu ADDED Viewed

@@ -0,0 +1,124 @@
+/*
+ * Copyright (c) 2025 by FlashInfer team.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <flashinfer/attention/mla.cuh>
+#include <flashinfer/attention/scheduler.cuh>
+#include <flashinfer/fastdiv.cuh>
+#include <optional>
+#include "batch_mla_config.inc"
+#include "pytorch_conversion_utils.h"
+#include "pytorch_extension_utils.h"
+using namespace flashinfer;
+void BatchMLAPagedAttentionRun(at::Tensor float_workspace_buffer, at::Tensor int_workspace_buffer,
+                               at::Tensor plan_info_vec, at::Tensor q_nope, at::Tensor q_pe,
+                               at::Tensor ckv_cache, at::Tensor kpe_cache, at::Tensor kv_indices,
+                               at::Tensor o, std::optional<at::Tensor> maybe_lse,
+                               int64_t mask_mode_code, int64_t num_heads, int64_t page_size,
+                               double sm_scale, int64_t cuda_stream) {
+  // q_nope: [n, num_heads, head_dim_ckv]
+  // q_pe: [n, num_heads, head_dim_kpe]
+  // ckv_cache: [num_pages, page_size, head_dim_ckv]
+  // kpe_cache: [num_pages, page_size, head_dim_kpe]
+  MLAPlanInfo plan_info;
+  plan_info.FromVector(tensor_to_vec(plan_info_vec));
+  auto device = q_nope.device();
+  void* float_buffer_ptr = float_workspace_buffer.data_ptr();
+  void* int_buffer_ptr = int_workspace_buffer.data_ptr();
+  const MaskMode mask_mode = static_cast<MaskMode>(mask_mode_code);
+  auto q_scalar_type = q_nope.scalar_type();
+  auto kv_scalar_type = ckv_cache.scalar_type();
+  unsigned int q_nope_stride_n = q_nope.stride(0);
+  unsigned int q_nope_stride_h = q_nope.stride(1);
+  unsigned int q_pe_stride_n = q_pe.stride(0);
+  unsigned int q_pe_stride_h = q_pe.stride(1);
+  unsigned int ckv_stride_page = ckv_cache.stride(0);
+  unsigned int ckv_stride_n = ckv_cache.stride(1);
+  unsigned int kpe_stride_page = kpe_cache.stride(0);
+  unsigned int kpe_stride_n = kpe_cache.stride(1);
+  unsigned int o_stride_n = o.stride(0);
+  unsigned int o_stride_h = o.stride(1);
+  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  DISPATCH_context(
+      DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, [&] {
+        Params params;
+        params.q_nope = static_cast<DTypeQ*>(q_nope.data_ptr());
+        params.q_pe = static_cast<DTypeQ*>(q_pe.data_ptr());
+        params.ckv = static_cast<DTypeKV*>(ckv_cache.data_ptr());
+        params.kpe = static_cast<DTypeKV*>(kpe_cache.data_ptr());
+        params.q_indptr = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_indptr_offset);
+        params.kv_indptr = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_indptr_offset);
+        params.partial_indptr =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.partial_indptr_offset);
+        params.kv_indices = static_cast<IdType*>(kv_indices.data_ptr());
+        params.q_len = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_len_offset);
+        params.kv_len = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_len_offset);
+        params.q_start = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_start_offset);
+        params.kv_start = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_start_offset);
+        params.kv_end = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_end_offset);
+        params.work_indptr =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.work_indptr_offset);
+        params.merge_packed_offset_start = GetPtrFromBaseOffset<IdType>(
+            int_buffer_ptr, plan_info.merge_packed_offset_start_offset);
+        params.merge_packed_offset_end =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.merge_packed_offset_end_offset);
+        params.merge_partial_packed_offset_start = GetPtrFromBaseOffset<IdType>(
+            int_buffer_ptr, plan_info.merge_partial_packed_offset_start_offset);
+        params.merge_partial_packed_offset_end = GetPtrFromBaseOffset<IdType>(
+            int_buffer_ptr, plan_info.merge_partial_packed_offset_end_offset);
+        params.merge_partial_stride =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.merge_partial_stride_offset);
+        params.final_o = static_cast<DTypeO*>(o.data_ptr());
+        params.final_lse =
+            maybe_lse.has_value() ? static_cast<float*>(maybe_lse->data_ptr()) : nullptr;
+        params.partial_o =
+            GetPtrFromBaseOffset<DTypeO>(float_buffer_ptr, plan_info.partial_o_offset);
+        params.partial_lse =
+            GetPtrFromBaseOffset<float>(float_buffer_ptr, plan_info.partial_lse_offset);
+        params.num_heads = uint_fastdiv(num_heads);
+        params.block_size = uint_fastdiv(page_size);
+        params.q_nope_stride_n = q_nope_stride_n;
+        params.q_nope_stride_h = q_nope_stride_h;
+        params.q_pe_stride_n = q_pe_stride_n;
+        params.q_pe_stride_h = q_pe_stride_h;
+        params.ckv_stride_page = ckv_stride_page;
+        params.ckv_stride_n = ckv_stride_n;
+        params.kpe_stride_page = kpe_stride_page;
+        params.kpe_stride_n = kpe_stride_n;
+        params.o_stride_n = o_stride_n;
+        params.o_stride_h = o_stride_h;
+        params.sm_scale = sm_scale;
+        cudaError_t status = mla::BatchMLAPagedAttention<MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE>(
+            params, plan_info.num_blks_x, plan_info.num_blks_y, stream);
+        TORCH_CHECK(status == cudaSuccess,
+                    "Failed to run MLA, error: ", cudaGetErrorString(status));
+      });
+}

flashinfer_python-0.2.3/csrc/batch_mla_sm90_pybind.cu ADDED Viewed

@@ -0,0 +1,37 @@
+/*
+ * Copyright (c) 2025 by FlashInfer team.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include "batch_mla_sm90_config.inc"
+#include "pytorch_extension_utils.h"
+at::Tensor BatchMLAPagedAttentionSM90Plan(at::Tensor float_workspace_buffer,
+                                          at::Tensor int_workspace_buffer,
+                                          at::Tensor page_locked_int_workspace_buffer,
+                                          at::Tensor qo_indptr, at::Tensor kv_indptr,
+                                          at::Tensor kv_len, int64_t num_heads, int64_t head_dim_o,
+                                          bool causal, int64_t cuda_stream);
+void BatchMLAPagedAttentionSM90Run(at::Tensor float_workspace_buffer,
+                                   at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                   at::Tensor q_nope, at::Tensor q_pe, at::Tensor ckv_cache,
+                                   at::Tensor kpe_cache, at::Tensor kv_indices, at::Tensor o,
+                                   std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
+                                   int64_t num_heads, int64_t page_size,
+                                   double sm_scale ADDITIONAL_FUNC_PARAMS, int64_t cuda_stream);
+TORCH_LIBRARY_FRAGMENT(TORCH_EXTENSION_NAME, m) {
+  m.def("plan", &BatchMLAPagedAttentionSM90Plan);
+  m.def("run", &BatchMLAPagedAttentionSM90Run);
+}

flashinfer_python-0.2.3/csrc/batch_mla_sm90_run.cu ADDED Viewed

@@ -0,0 +1,128 @@
+/*
+ * Copyright (c) 2025 by FlashInfer team.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+#include <flashinfer/attention/mla_hopper.cuh>
+#include <flashinfer/attention/scheduler.cuh>
+#include <flashinfer/fastdiv.cuh>
+#include <optional>
+#include "batch_mla_sm90_config.inc"
+#include "pytorch_conversion_utils.h"
+#include "pytorch_extension_utils.h"
+using namespace flashinfer;
+void BatchMLAPagedAttentionSM90Run(at::Tensor float_workspace_buffer,
+                                   at::Tensor int_workspace_buffer, at::Tensor plan_info_vec,
+                                   at::Tensor q_nope, at::Tensor q_pe, at::Tensor ckv_cache,
+                                   at::Tensor kpe_cache, at::Tensor kv_indices, at::Tensor o,
+                                   std::optional<at::Tensor> maybe_lse, int64_t mask_mode_code,
+                                   int64_t num_heads, int64_t page_size,
+                                   double sm_scale ADDITIONAL_FUNC_PARAMS, int64_t cuda_stream) {
+  // q_nope: [n, num_heads, head_dim_ckv]
+  // q_pe: [n, num_heads, head_dim_kpe]
+  // ckv_cache: [num_pages, page_size, head_dim_ckv]
+  // kpe_cache: [num_pages, page_size, head_dim_kpe]
+  MLAPlanInfo plan_info;
+  plan_info.FromVector(tensor_to_vec(plan_info_vec));
+  auto device = q_nope.device();
+  void* float_buffer_ptr = float_workspace_buffer.data_ptr();
+  void* int_buffer_ptr = int_workspace_buffer.data_ptr();
+  const MaskMode mask_mode = static_cast<MaskMode>(mask_mode_code);
+  auto q_scalar_type = q_nope.scalar_type();
+  auto kv_scalar_type = ckv_cache.scalar_type();
+  unsigned int q_nope_stride_n = q_nope.stride(0);
+  unsigned int q_nope_stride_h = q_nope.stride(1);
+  unsigned int q_pe_stride_n = q_pe.stride(0);
+  unsigned int q_pe_stride_h = q_pe.stride(1);
+  unsigned int ckv_stride_page = ckv_cache.stride(0);
+  unsigned int ckv_stride_n = ckv_cache.stride(1);
+  unsigned int kpe_stride_page = kpe_cache.stride(0);
+  unsigned int kpe_stride_n = kpe_cache.stride(1);
+  unsigned int o_stride_n = o.stride(0);
+  unsigned int o_stride_h = o.stride(1);
+  cudaStream_t stream = reinterpret_cast<cudaStream_t>(cuda_stream);
+  DISPATCH_context(
+      DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, [&] {
+        Params params;
+        params.q_nope = static_cast<DTypeQ*>(q_nope.data_ptr());
+        params.q_pe = static_cast<DTypeQ*>(q_pe.data_ptr());
+        params.ckv = static_cast<DTypeKV*>(ckv_cache.data_ptr());
+        params.kpe = static_cast<DTypeKV*>(kpe_cache.data_ptr());
+        params.q_indptr = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_indptr_offset);
+        params.kv_indptr = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_indptr_offset);
+        params.partial_indptr =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.partial_indptr_offset);
+        params.kv_indices = static_cast<IdType*>(kv_indices.data_ptr());
+        params.q_len = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_len_offset);
+        params.kv_len = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_len_offset);
+        params.q_start = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.q_start_offset);
+        params.kv_start = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_start_offset);
+        params.kv_end = GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.kv_end_offset);
+        params.work_indptr =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.work_indptr_offset);
+        params.merge_packed_offset_start = GetPtrFromBaseOffset<IdType>(
+            int_buffer_ptr, plan_info.merge_packed_offset_start_offset);
+        params.merge_packed_offset_end =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.merge_packed_offset_end_offset);
+        params.merge_partial_packed_offset_start = GetPtrFromBaseOffset<IdType>(
+            int_buffer_ptr, plan_info.merge_partial_packed_offset_start_offset);
+        params.merge_partial_packed_offset_end = GetPtrFromBaseOffset<IdType>(
+            int_buffer_ptr, plan_info.merge_partial_packed_offset_end_offset);
+        params.merge_partial_stride =
+            GetPtrFromBaseOffset<IdType>(int_buffer_ptr, plan_info.merge_partial_stride_offset);
+        params.final_o = static_cast<DTypeO*>(o.data_ptr());
+        params.final_lse =
+            maybe_lse.has_value() ? static_cast<float*>(maybe_lse->data_ptr()) : nullptr;
+        params.partial_o =
+            GetPtrFromBaseOffset<DTypeO>(float_buffer_ptr, plan_info.partial_o_offset);
+        params.partial_lse =
+            GetPtrFromBaseOffset<float>(float_buffer_ptr, plan_info.partial_lse_offset);
+        params.num_heads = uint_fastdiv(num_heads);
+        params.block_size = uint_fastdiv(page_size);
+        params.q_nope_stride_n = q_nope_stride_n;
+        params.q_nope_stride_h = q_nope_stride_h;
+        params.q_pe_stride_n = q_pe_stride_n;
+        params.q_pe_stride_h = q_pe_stride_h;
+        params.ckv_stride_page = ckv_stride_page;
+        params.ckv_stride_n = ckv_stride_n;
+        params.kpe_stride_page = kpe_stride_page;
+        params.kpe_stride_n = kpe_stride_n;
+        params.o_stride_n = o_stride_n;
+        params.o_stride_h = o_stride_h;
+        ADDITIONAL_PARAMS_SETTER
+        params.sm_scale = sm_scale;
+        cudaError_t status =
+            mla::BatchMLAPageAttentionHopper<MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE>(
+                params, plan_info.num_blks_x, plan_info.num_blks_y, stream);
+        TORCH_CHECK(status == cudaSuccess,
+                    "Failed to run MLA, error: ", cudaGetErrorString(status));
+      });
+}

flashinfer-python 0.2.2__tar.gz → 0.2.3__tar.gz

flashinfer-python 0.2.2tar.gz → 0.2.3tar.gz