PyPI - sglang - Versions diffs - 0.4.1.post2__tar.gz → 0.4.1.post4__tar.gz - Mend

sglang 0.4.1.post2tar.gz → 0.4.1.post4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (337) hide show

{sglang-0.4.1.post2 → sglang-0.4.1.post4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.4.1.post2
+Version: 0.4.1.post4
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -243,11 +243,11 @@ Requires-Dist: torch; extra == "srt"
 Requires-Dist: vllm<=0.6.4.post1,>=0.6.3.post1; extra == "srt"
 Requires-Dist: cuda-python; extra == "srt"
 Requires-Dist: flashinfer==0.1.6; extra == "srt"
-Requires-Dist: sgl-kernel>=0.0.2.post10; extra == "srt"
+Requires-Dist: sgl-kernel>=0.0.2.post11; extra == "srt"
 Provides-Extra: srt-hip
 Requires-Dist: sglang[runtime_common]; extra == "srt-hip"
 Requires-Dist: torch; extra == "srt-hip"
-Requires-Dist: vllm==0.6.3.dev13; extra == "srt-hip"
+Requires-Dist: vllm==0.6.3.post2.dev1; extra == "srt-hip"
 Provides-Extra: srt-xpu
 Requires-Dist: sglang[runtime_common]; extra == "srt-xpu"
 Provides-Extra: srt-hpu
@@ -315,7 +315,7 @@ Requires-Dist: sglang[test]; extra == "dev-hpu"
 | [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/)
 | [**Documentation**](https://sgl-project.github.io/)
-| [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2tmmp6flg-89dOlJW2TjnBrTRk1I_~GA)
+| [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2um0ad92q-LkU19KQTxCGzlCgRiOiQEw)
 | [**Join Bi-Weekly Development Meeting**](https://docs.google.com/document/d/1xEow4eIM152xNcRxqZz9VEcOiTQo8-CEuuQ5qTmkt-E/edit?usp=sharing)
 | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
@@ -347,9 +347,10 @@ The core features include:
 ## Getting Started
 - [Install SGLang](https://sgl-project.github.io/start/install.html)
-- [Send requests](https://sgl-project.github.io/start/send_request.html)
-- [Backend: SGLang Runtime (SRT)](https://sgl-project.github.io/backend/backend.html)
-- [Frontend: Structured Generation Language (SGLang)](https://sgl-project.github.io/frontend/frontend.html)
+- [Quick Start](https://sgl-project.github.io/start/send_request.html)
+- [Backend Tutorial](https://sgl-project.github.io/backend/openai_api_completions.html)
+- [Frontend Tutorial](https://sgl-project.github.io/frontend/frontend.html)
+- [Contribution Guide](https://sgl-project.github.io/references/contribution_guide.html)
 ## Benchmark and Performance
 Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)
@@ -361,5 +362,5 @@ Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
 The project is supported by (alphabetically): AMD, Baseten, DataCrunch, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, LMSYS.org, Meituan, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, 01.AI.
 ## Acknowledgment and Citation
-We learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
+We learned the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
 Please cite the paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.

{sglang-0.4.1.post2 → sglang-0.4.1.post4}/README.md RENAMED Viewed

@@ -14,7 +14,7 @@
 | [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/)
 | [**Documentation**](https://sgl-project.github.io/)
-| [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2tmmp6flg-89dOlJW2TjnBrTRk1I_~GA)
+| [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2um0ad92q-LkU19KQTxCGzlCgRiOiQEw)
 | [**Join Bi-Weekly Development Meeting**](https://docs.google.com/document/d/1xEow4eIM152xNcRxqZz9VEcOiTQo8-CEuuQ5qTmkt-E/edit?usp=sharing)
 | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
@@ -46,9 +46,10 @@ The core features include:
 ## Getting Started
 - [Install SGLang](https://sgl-project.github.io/start/install.html)
-- [Send requests](https://sgl-project.github.io/start/send_request.html)
-- [Backend: SGLang Runtime (SRT)](https://sgl-project.github.io/backend/backend.html)
-- [Frontend: Structured Generation Language (SGLang)](https://sgl-project.github.io/frontend/frontend.html)
+- [Quick Start](https://sgl-project.github.io/start/send_request.html)
+- [Backend Tutorial](https://sgl-project.github.io/backend/openai_api_completions.html)
+- [Frontend Tutorial](https://sgl-project.github.io/frontend/frontend.html)
+- [Contribution Guide](https://sgl-project.github.io/references/contribution_guide.html)
 ## Benchmark and Performance
 Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)
@@ -60,5 +61,5 @@ Learn more in our release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
 The project is supported by (alphabetically): AMD, Baseten, DataCrunch, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, LMSYS.org, Meituan, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, 01.AI.
 ## Acknowledgment and Citation
-We learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
+We learned the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
 Please cite the paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.

{sglang-0.4.1.post2 → sglang-0.4.1.post4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.4.1.post2"
+version = "0.4.1.post4"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"
@@ -23,11 +23,11 @@ runtime_common = ["aiohttp", "decord", "fastapi",
     "psutil", "pydantic", "python-multipart",
     "pyzmq>=25.1.2", "torchao>=0.7.0", "uvicorn", "uvloop",
     "xgrammar>=0.1.6"]
-srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post10"]
+srt = ["sglang[runtime_common]", "torch", "vllm>=0.6.3.post1,<=0.6.4.post1", "cuda-python", "flashinfer==0.1.6", "sgl-kernel>=0.0.2.post11"]
 # HIP (Heterogeneous-computing Interface for Portability) for AMD
 # => base docker rocm/vllm-dev:20241022, not from public vllm whl
-srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"]
+srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.post2.dev1"]
 # xpu is not enabled in public vllm and torch whl,
 # need to follow https://docs.vllm.ai/en/latest/getting_started/xpu-installation.htmlinstall vllm
 srt_xpu = ["sglang[runtime_common]"]
@@ -61,7 +61,7 @@ dev_hpu = ["sglang[all_hpu]", "sglang[test]"]
 "Bug Tracker" = "https://github.com/sgl-project/sglang/issues"
 [tool.setuptools.package-data]
-"sglang" = ["srt/layers/fused_moe_triton/configs/*.json"]
+"sglang" = ["srt/layers/moe/fused_moe_triton/configs/*.json", "srt/layers/quantization/configs/*.json"]
 [tool.setuptools.packages.find]
 exclude = [

{sglang-0.4.1.post2 → sglang-0.4.1.post4}/sglang/bench_one_batch.py RENAMED Viewed

@@ -63,6 +63,7 @@ from sglang.srt.model_executor.model_runner import ModelRunner
 from sglang.srt.sampling.sampling_params import SamplingParams
 from sglang.srt.server import _set_envs_and_config
 from sglang.srt.server_args import PortArgs, ServerArgs
+from sglang.srt.speculative.spec_info import SpeculativeAlgorithm
 from sglang.srt.utils import configure_logger, kill_process_tree, suppress_other_loggers
@@ -214,6 +215,7 @@ def extend(reqs, model_runner):
         tree_cache=None,
         model_config=model_runner.model_config,
         enable_overlap=False,
+        spec_algorithm=SpeculativeAlgorithm.NONE,
     )
     batch.prepare_for_extend()
     model_worker_batch = batch.get_model_worker_batch()

{sglang-0.4.1.post2 → sglang-0.4.1.post4}/sglang/srt/layers/attention/__init__.py RENAMED Viewed

@@ -1,10 +1,14 @@
+from __future__ import annotations
 from abc import ABC, abstractmethod
-from typing import Optional
+from typing import TYPE_CHECKING, Optional
 import torch
-from sglang.srt.layers.radix_attention import RadixAttention
-from sglang.srt.model_executor.forward_batch_info import ForwardBatch
+if TYPE_CHECKING:
+    from sglang.srt.layers.radix_attention import RadixAttention
+    from sglang.srt.model_executor.forward_batch_info import ForwardBatch, ForwardMode
+    from sglang.srt.speculative.spec_info import SpecInfo
 class AttentionBackend(ABC):
@@ -22,9 +26,12 @@ class AttentionBackend(ABC):
     def init_forward_metadata_capture_cuda_graph(
         self,
         bs: int,
+        num_tokens: int,
         req_pool_indices: torch.Tensor,
         seq_lens: torch.Tensor,
-        encoder_lens: Optional[torch.Tensor] = None,
+        encoder_lens: Optional[torch.Tensor],
+        forward_mode: ForwardMode,
+        spec_info: Optional[SpecInfo],
     ):
         """Init the metadata for a forward pass for capturing a cuda graph."""
         raise NotImplementedError()
@@ -35,7 +42,9 @@ class AttentionBackend(ABC):
         req_pool_indices: torch.Tensor,
         seq_lens: torch.Tensor,
         seq_lens_sum: int,
-        encoder_lens: Optional[torch.Tensor] = None,
+        encoder_lens: Optional[torch.Tensor],
+        forward_mode: ForwardMode,
+        spec_info: Optional[SpecInfo],
     ):
         """Init the metadata for a forward pass for replying a cuda graph."""
         raise NotImplementedError()

{sglang-0.4.1.post2 → sglang-0.4.1.post4}/sglang/srt/layers/attention/double_sparsity_backend.py RENAMED Viewed

@@ -3,7 +3,6 @@ from __future__ import annotations
 from typing import TYPE_CHECKING
 import torch
-import torch.nn as nn
 from sglang.srt.layers.attention import AttentionBackend
 from sglang.srt.managers.schedule_batch import global_server_args_dict
@@ -52,8 +51,6 @@ class DoubleSparseAttnBackend(AttentionBackend):
         self.forward_metadata = None
-        self.cuda_graph_max_seq_len = model_runner.model_config.context_len
     def init_forward_metadata(self, forward_batch: ForwardBatch):
         """Init auxiliary variables for triton attention backend."""
@@ -115,55 +112,6 @@ class DoubleSparseAttnBackend(AttentionBackend):
             ds_req_to_token,
         )
-    def init_cuda_graph_state(self, max_bs: int):
-        # TODO(Andy): Support CUDA graph for double sparse attention
-        raise ValueError(
-            "Double sparse attention does not support CUDA graph for now. Please --disable-cuda-graph"
-        )
-        self.cuda_graph_max_total_num_tokens = max_bs * self.cuda_graph_max_seq_len
-        self.cuda_graph_start_loc = torch.zeros(
-            (max_bs,), dtype=torch.int32, device="cuda"
-        )
-        self.cuda_graph_attn_logits = torch.empty(
-            (
-                self.num_head,
-                self.cuda_graph_max_total_num_tokens,
-            ),
-            dtype=self.reduce_dtype,
-            device="cuda",
-        )
-    def init_forward_metadata_capture_cuda_graph(
-        self,
-        bs: int,
-        req_pool_indices: torch.Tensor,
-        seq_lens: torch.Tensor,
-        encoder_lens=None,
-    ):
-        # NOTE: encoder_lens expected to be zeros or None
-        self.forward_metadata = (
-            self.cuda_graph_start_loc,
-            self.cuda_graph_attn_logits,
-            self.cuda_graph_max_seq_len,
-            None,
-        )
-    def init_forward_metadata_replay_cuda_graph(
-        self,
-        bs: int,
-        req_pool_indices: torch.Tensor,
-        seq_lens: torch.Tensor,
-        seq_lens_sum: int,
-        encoder_lens=None,
-    ):
-        # NOTE: encoder_lens expected to be zeros or None
-        self.cuda_graph_start_loc.zero_()
-        self.cuda_graph_start_loc[1:bs] = torch.cumsum(seq_lens[: bs - 1], dim=0)
-    def get_cuda_graph_seq_len_fill_value(self):
-        return 1
     def forward_extend(
         self,
         q,

sglang 0.4.1.post2__tar.gz → 0.4.1.post4__tar.gz

sglang 0.4.1.post2tar.gz → 0.4.1.post4tar.gz