PyPI - sglang - Versions diffs - 0.4.3.post2__tar.gz → 0.4.3.post4__tar.gz - Mend

sglang 0.4.3.post2tar.gz → 0.4.3.post4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (497) hide show

{sglang-0.4.3.post2/sglang.egg-info → sglang-0.4.3.post4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.2
 Name: sglang
-Version: 0.4.3.post2
+Version: 0.4.3.post4
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -235,32 +235,35 @@ Requires-Dist: pyzmq>=25.1.2; extra == "runtime-common"
 Requires-Dist: torchao>=0.7.0; extra == "runtime-common"
 Requires-Dist: uvicorn; extra == "runtime-common"
 Requires-Dist: uvloop; extra == "runtime-common"
-Requires-Dist: xgrammar==0.1.10; extra == "runtime-common"
+Requires-Dist: xgrammar==0.1.14; extra == "runtime-common"
 Requires-Dist: ninja; extra == "runtime-common"
+Requires-Dist: transformers==4.48.3; extra == "runtime-common"
+Requires-Dist: llguidance>=0.6.15; extra == "runtime-common"
+Requires-Dist: datasets; extra == "runtime-common"
 Provides-Extra: srt
 Requires-Dist: sglang[runtime_common]; extra == "srt"
-Requires-Dist: cuda-python; extra == "srt"
-Requires-Dist: sgl-kernel>=0.0.3.post6; extra == "srt"
-Requires-Dist: torch; extra == "srt"
+Requires-Dist: sgl-kernel==0.0.3.post6; extra == "srt"
+Requires-Dist: flashinfer_python==0.2.2.post1; extra == "srt"
+Requires-Dist: torch==2.5.1; extra == "srt"
 Requires-Dist: vllm<=0.7.2,>=0.6.4.post1; extra == "srt"
-Requires-Dist: flashinfer_python>=0.2.1.post2; extra == "srt"
+Requires-Dist: cuda-python; extra == "srt"
 Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "srt"
 Provides-Extra: srt-hip
 Requires-Dist: sglang[runtime_common]; extra == "srt-hip"
+Requires-Dist: sgl-kernel==0.0.3.post6; extra == "srt-hip"
 Requires-Dist: torch; extra == "srt-hip"
 Requires-Dist: vllm==0.6.7.dev2; extra == "srt-hip"
 Requires-Dist: outlines==0.1.11; extra == "srt-hip"
-Requires-Dist: sgl-kernel>=0.0.3.post1; extra == "srt-hip"
 Provides-Extra: srt-xpu
 Requires-Dist: sglang[runtime_common]; extra == "srt-xpu"
-Requires-Dist: outlines<0.1.0,>=0.0.44; extra == "srt-xpu"
+Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "srt-xpu"
 Provides-Extra: srt-hpu
 Requires-Dist: sglang[runtime_common]; extra == "srt-hpu"
-Requires-Dist: outlines<0.1.0,>=0.0.44; extra == "srt-hpu"
+Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "srt-hpu"
 Provides-Extra: srt-cpu
 Requires-Dist: sglang[runtime_common]; extra == "srt-cpu"
+Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "srt-cpu"
 Requires-Dist: torch; extra == "srt-cpu"
-Requires-Dist: outlines<0.1.0,>=0.0.44; extra == "srt-cpu"
 Provides-Extra: openai
 Requires-Dist: openai>=1.0; extra == "openai"
 Requires-Dist: tiktoken; extra == "openai"
@@ -318,7 +321,7 @@ Provides-Extra: dev-cpu
 Requires-Dist: sglang[all_cpu]; extra == "dev-cpu"
 Requires-Dist: sglang[test]; extra == "dev-cpu"
-<div align="center"  id="sglangtop">
+<div align="center" id="sglangtop">
 <img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img>
 [![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
@@ -336,10 +339,11 @@ Requires-Dist: sglang[test]; extra == "dev-cpu"
 | [**Documentation**](https://docs.sglang.ai/)
 | [**Join Slack**](https://slack.sglang.ai/)
 | [**Join Bi-Weekly Development Meeting**](https://meeting.sglang.ai/)
+| [**Roadmap**](https://github.com/sgl-project/sglang/issues/4042)
 | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
 ## News
-- [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html))
+- [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
 - [2024/12] 🔥 v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
 - [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
@@ -366,7 +370,7 @@ The core features include:
 ## Getting Started
 - [Install SGLang](https://docs.sglang.ai/start/install.html)
-- [Quick Start](https://docs.sglang.ai/start/send_request.html)
+- [Quick Start](https://docs.sglang.ai/backend/send_request.html)
 - [Backend Tutorial](https://docs.sglang.ai/backend/openai_api_completions.html)
 - [Frontend Tutorial](https://docs.sglang.ai/frontend/frontend.html)
 - [Contribution Guide](https://docs.sglang.ai/references/contribution_guide.html)
@@ -375,10 +379,13 @@ The core features include:
 Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)
 ## Roadmap
-[Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)
+[Development Roadmap (2025 H1)](https://github.com/sgl-project/sglang/issues/4042)
 ## Adoption and Sponsorship
-The project is supported by (alphabetically): AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, LMSYS CORP, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, 01.AI.
+The project has been deployed to large-scale production, generating trillions of tokens every day.
+It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
+<img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>
 ## Contact Us

{sglang-0.4.3.post2 → sglang-0.4.3.post4}/README.md RENAMED Viewed

@@ -1,4 +1,4 @@
-<div align="center"  id="sglangtop">
+<div align="center" id="sglangtop">
 <img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img>
 [![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
@@ -16,10 +16,11 @@
 | [**Documentation**](https://docs.sglang.ai/)
 | [**Join Slack**](https://slack.sglang.ai/)
 | [**Join Bi-Weekly Development Meeting**](https://meeting.sglang.ai/)
+| [**Roadmap**](https://github.com/sgl-project/sglang/issues/4042)
 | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
 ## News
-- [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html))
+- [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
 - [2024/12] 🔥 v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
 - [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
@@ -46,7 +47,7 @@ The core features include:
 ## Getting Started
 - [Install SGLang](https://docs.sglang.ai/start/install.html)
-- [Quick Start](https://docs.sglang.ai/start/send_request.html)
+- [Quick Start](https://docs.sglang.ai/backend/send_request.html)
 - [Backend Tutorial](https://docs.sglang.ai/backend/openai_api_completions.html)
 - [Frontend Tutorial](https://docs.sglang.ai/frontend/frontend.html)
 - [Contribution Guide](https://docs.sglang.ai/references/contribution_guide.html)
@@ -55,10 +56,13 @@ The core features include:
 Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)
 ## Roadmap
-[Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)
+[Development Roadmap (2025 H1)](https://github.com/sgl-project/sglang/issues/4042)
 ## Adoption and Sponsorship
-The project is supported by (alphabetically): AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Jam & Tea Studios, LinkedIn, LMSYS CORP, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, 01.AI.
+The project has been deployed to large-scale production, generating trillions of tokens every day.
+It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
+<img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>
 ## Contact Us

{sglang-0.4.3.post2 → sglang-0.4.3.post4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.4.3.post2"
+version = "0.4.3.post4"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"
@@ -17,32 +17,57 @@ dependencies = ["requests", "tqdm", "numpy", "IPython", "setproctitle"]
 [project.optional-dependencies]
 runtime_common = [
-    "aiohttp", "decord", "fastapi",
-    "hf_transfer", "huggingface_hub", "interegular", "modelscope",
-    "orjson", "packaging", "pillow", "prometheus-client>=0.20.0",
-    "psutil", "pydantic", "python-multipart", "pyzmq>=25.1.2",
-    "torchao>=0.7.0", "uvicorn", "uvloop", "xgrammar==0.1.10", "ninja"
+    "aiohttp",
+    "decord",
+    "fastapi",
+    "hf_transfer",
+    "huggingface_hub",
+    "interegular",
+    "modelscope",
+    "orjson",
+    "packaging",
+    "pillow",
+    "prometheus-client>=0.20.0",
+    "psutil",
+    "pydantic",
+    "python-multipart",
+    "pyzmq>=25.1.2",
+    "torchao>=0.7.0",
+    "uvicorn",
+    "uvloop",
+    "xgrammar==0.1.14",
+    "ninja",
+    "transformers==4.48.3",
+    "llguidance>=0.6.15",
+    "datasets"
 ]
 srt = [
-    "sglang[runtime_common]", "cuda-python",
-    "sgl-kernel>=0.0.3.post6", "torch", "vllm>=0.6.4.post1,<=0.7.2",
-    "flashinfer_python>=0.2.1.post2",
+    "sglang[runtime_common]",
+    "sgl-kernel==0.0.3.post6",
+    "flashinfer_python==0.2.2.post1",
+    "torch==2.5.1",
+    "vllm>=0.6.4.post1,<=0.7.2",
+    "cuda-python",
     "outlines>=0.0.44,<=0.1.11",
 ]
 # HIP (Heterogeneous-computing Interface for Portability) for AMD
-# => base docker rocm/vllm-dev:20241022, not from public vllm whl
-srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.7.dev2", "outlines==0.1.11", "sgl-kernel>=0.0.3.post1"]
+# => base docker rocm/vllm-dev:20250114, not from public vllm whl
+srt_hip = ["sglang[runtime_common]", "sgl-kernel==0.0.3.post6", "torch", "vllm==0.6.7.dev2", "outlines==0.1.11"]
 # xpu is not enabled in public vllm and torch whl,
 # need to follow https://docs.vllm.ai/en/latest/getting_started/xpu-installation.htmlinstall vllm
-srt_xpu = ["sglang[runtime_common]", "outlines>=0.0.44,<0.1.0"]
-#For Intel Gaudi(device : hpu) follow the installation guide
-#https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html
-srt_hpu = ["sglang[runtime_common]", "outlines>=0.0.44,<0.1.0"]
+srt_xpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11"]
+# For Intel Gaudi(device : hpu) follow the installation guide
+# https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html
+srt_hpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11"]
 # CPU: currently, there are no pre-built vllm wheels for CPU.
 # To install vllm for CPU, please follow the instruction here:
 # https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html
-srt_cpu = ["sglang[runtime_common]", "torch", "outlines>=0.0.44,<0.1.0"]
+srt_cpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11", "torch"]
 openai = ["openai>=1.0", "tiktoken"]
 anthropic = ["anthropic>=0.20.0"]
@@ -73,7 +98,10 @@ dev_cpu = ["sglang[all_cpu]", "sglang[test]"]
 "Bug Tracker" = "https://github.com/sgl-project/sglang/issues"
 [tool.setuptools.package-data]
-"sglang" = ["srt/layers/moe/fused_moe_triton/configs/*.json", "srt/layers/quantization/configs/*.json"]
+"sglang" = [
+    "srt/layers/moe/fused_moe_triton/configs/*.json",
+    "srt/layers/quantization/configs/*.json",
+]
 [tool.setuptools.packages.find]
 exclude = [

{sglang-0.4.3.post2 → sglang-0.4.3.post4}/sglang/api.py RENAMED Viewed

@@ -94,7 +94,7 @@ def gen(
     regex: Optional[str] = None,
     json_schema: Optional[str] = None,
 ):
-    """Call the model to generate. See the meaning of the arguments in docs/sampling_params.md"""
+    """Call the model to generate. See the meaning of the arguments in docs/backend/sampling_params.md"""
     if choices:
         return SglSelect(

{sglang-0.4.3.post2 → sglang-0.4.3.post4}/sglang/bench_offline_throughput.py RENAMED Viewed

@@ -56,6 +56,7 @@ class BenchArgs:
     profile: bool = False
     skip_warmup: bool = False
     do_not_exit: bool = False
+    prompt_suffix: str = ""
     @staticmethod
     def add_cli_args(parser: argparse.ArgumentParser):
@@ -177,6 +178,12 @@ class BenchArgs:
             action="store_true",
             help="Do not exit the program. This is useful for nsys profile with --duration and --delay.",
         )
+        parser.add_argument(
+            "--prompt-suffix",
+            type=str,
+            default="",
+            help="Suffix applied to the end of all user prompts, followed by assistant prompt suffix.",
+        )
     @classmethod
     def from_cli_args(cls, args: argparse.Namespace):
@@ -216,6 +223,10 @@ def throughput_test_once(
     ]
     if profile:
+        assert (
+            "SGLANG_TORCH_PROFILER_DIR" in os.environ
+        ), "Please set SGLANG_TORCH_PROFILER_DIR."
+        os.makedirs(os.environ["SGLANG_TORCH_PROFILER_DIR"], exist_ok=True)
         backend.start_profile()
     st = time.perf_counter()
@@ -229,6 +240,8 @@ def throughput_test_once(
     if backend_name == "runtime":
         gen_out = json.loads(gen_out)
+    server_info = backend.get_server_info()
     measurement_results["total_latency"] = latency
     measurement_results["total_output_tokens"] = sum(
         o["meta_info"]["completion_tokens"] for o in gen_out
@@ -246,6 +259,7 @@ def throughput_test_once(
         measurement_results["total_input_tokens"]
         + measurement_results["total_output_tokens"]
     ) / latency
+    measurement_results["last_gen_throughput"] = server_info["last_gen_throughput"]
     return measurement_results
@@ -361,6 +375,11 @@ def throughput_test(
     print(
         "{:<40} {:<10}".format("Total generated tokens:", result["total_output_tokens"])
     )
+    print(
+        "{:<40} {:<10.2f}".format(
+            "Last generation throughput (tok/s):", result["last_gen_throughput"]
+        )
+    )
     print(
         "{:<40} {:<10.2f}".format(
             "Request throughput (req/s):", result["request_throughput"]

{sglang-0.4.3.post2 → sglang-0.4.3.post4}/sglang/bench_one_batch.py RENAMED Viewed

@@ -230,7 +230,7 @@ def extend(reqs, model_runner):
     batch = ScheduleBatch.init_new(
         reqs=reqs,
         req_to_token_pool=model_runner.req_to_token_pool,
-        token_to_kv_pool=model_runner.token_to_kv_pool,
+        token_to_kv_pool_allocator=model_runner.token_to_kv_pool_allocator,
         tree_cache=None,
         model_config=model_runner.model_config,
         enable_overlap=False,
@@ -326,7 +326,7 @@ def latency_test_run_once(
     # Clear the pools.
     model_runner.req_to_token_pool.clear()
-    model_runner.token_to_kv_pool.clear()
+    model_runner.token_to_kv_pool_allocator.clear()
     measurement_results = {
         "run_name": run_name,

sglang 0.4.3.post2__tar.gz → 0.4.3.post4__tar.gz

sglang 0.4.3.post2tar.gz → 0.4.3.post4tar.gz