PyPI - sglang - Versions diffs - 0.4.6.post2__tar.gz → 0.4.6.post4__tar.gz - Mend

sglang 0.4.6.post2tar.gz → 0.4.6.post4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (653) hide show

{sglang-0.4.6.post2/sglang.egg-info → sglang-0.4.6.post4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sglang
-Version: 0.4.6.post2
+Version: 0.4.6.post4
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -230,6 +230,7 @@ Requires-Dist: modelscope; extra == "runtime-common"
 Requires-Dist: ninja; extra == "runtime-common"
 Requires-Dist: orjson; extra == "runtime-common"
 Requires-Dist: packaging; extra == "runtime-common"
+Requires-Dist: partial_json_parser; extra == "runtime-common"
 Requires-Dist: pillow; extra == "runtime-common"
 Requires-Dist: prometheus-client>=0.20.0; extra == "runtime-common"
 Requires-Dist: psutil; extra == "runtime-common"
@@ -242,17 +243,16 @@ Requires-Dist: torchao>=0.9.0; extra == "runtime-common"
 Requires-Dist: transformers==4.51.1; extra == "runtime-common"
 Requires-Dist: uvicorn; extra == "runtime-common"
 Requires-Dist: uvloop; extra == "runtime-common"
-Requires-Dist: xgrammar==0.1.17; extra == "runtime-common"
+Requires-Dist: xgrammar==0.1.19; extra == "runtime-common"
 Requires-Dist: blobfile==3.0.0; extra == "runtime-common"
 Provides-Extra: srt
 Requires-Dist: sglang[runtime_common]; extra == "srt"
-Requires-Dist: sgl-kernel==0.1.1; extra == "srt"
+Requires-Dist: sgl-kernel==0.1.2.post1; extra == "srt"
 Requires-Dist: flashinfer_python==0.2.5; extra == "srt"
 Requires-Dist: torch==2.6.0; extra == "srt"
 Requires-Dist: torchvision==0.21.0; extra == "srt"
 Requires-Dist: cuda-python; extra == "srt"
 Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "srt"
-Requires-Dist: partial_json_parser; extra == "srt"
 Requires-Dist: einops; extra == "srt"
 Provides-Extra: blackwell
 Requires-Dist: sglang[runtime_common]; extra == "blackwell"
@@ -261,7 +261,6 @@ Requires-Dist: torch; extra == "blackwell"
 Requires-Dist: torchvision; extra == "blackwell"
 Requires-Dist: cuda-python; extra == "blackwell"
 Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "blackwell"
-Requires-Dist: partial_json_parser; extra == "blackwell"
 Requires-Dist: einops; extra == "blackwell"
 Provides-Extra: srt-hip
 Requires-Dist: sglang[runtime_common]; extra == "srt-hip"
@@ -278,6 +277,9 @@ Provides-Extra: srt-cpu
 Requires-Dist: sglang[runtime_common]; extra == "srt-cpu"
 Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "srt-cpu"
 Requires-Dist: torch; extra == "srt-cpu"
+Provides-Extra: srt-npu
+Requires-Dist: sglang[runtime_common]; extra == "srt-npu"
+Requires-Dist: outlines<=0.1.11,>=0.0.44; extra == "srt-npu"
 Provides-Extra: openai
 Requires-Dist: openai>=1.0; extra == "openai"
 Requires-Dist: tiktoken; extra == "openai"
@@ -299,6 +301,7 @@ Requires-Dist: sglang[srt]; extra == "all"
 Requires-Dist: sglang[openai]; extra == "all"
 Requires-Dist: sglang[anthropic]; extra == "all"
 Requires-Dist: sglang[litellm]; extra == "all"
+Requires-Dist: sglang[torch_memory_saver]; extra == "all"
 Provides-Extra: all-hip
 Requires-Dist: sglang[srt_hip]; extra == "all-hip"
 Requires-Dist: sglang[openai]; extra == "all-hip"
@@ -319,6 +322,11 @@ Requires-Dist: sglang[srt_cpu]; extra == "all-cpu"
 Requires-Dist: sglang[openai]; extra == "all-cpu"
 Requires-Dist: sglang[anthropic]; extra == "all-cpu"
 Requires-Dist: sglang[litellm]; extra == "all-cpu"
+Provides-Extra: all-npu
+Requires-Dist: sglang[srt_npu]; extra == "all-npu"
+Requires-Dist: sglang[openai]; extra == "all-npu"
+Requires-Dist: sglang[anthropic]; extra == "all-npu"
+Requires-Dist: sglang[litellm]; extra == "all-npu"
 Provides-Extra: dev
 Requires-Dist: sglang[all]; extra == "dev"
 Requires-Dist: sglang[test]; extra == "dev"
@@ -358,18 +366,19 @@ Dynamic: license-file
 | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
 ## News
+- [2025/05] 🔥 Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)).
 - [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html))
 - [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/))
-- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
 - [2024/12] 🔥 v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
-- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
 <details>
 <summary>More</summary>
+- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2024/10] The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
+- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 - [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
 - [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
@@ -383,7 +392,7 @@ The core features include:
 - **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
 - **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
-- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, QWen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
+- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
 - **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
 ## Getting Started
@@ -401,7 +410,7 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
 ## Adoption and Sponsorship
 The project has been deployed to large-scale production, generating trillions of tokens every day.
-It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
+It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Google Cloud, Hyperbolic, Iflytek, InnoMatrix, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
 <img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>

{sglang-0.4.6.post2 → sglang-0.4.6.post4}/README.md RENAMED Viewed

@@ -20,18 +20,19 @@
 | [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |
 ## News
+- [2025/05] 🔥 Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)).
 - [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html))
 - [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/))
-- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
 - [2024/12] 🔥 v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
-- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
 <details>
 <summary>More</summary>
+- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2024/10] The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
+- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 - [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
 - [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
@@ -45,7 +46,7 @@ The core features include:
 - **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, quantization (FP8/INT4/AWQ/GPTQ), and multi-lora batching.
 - **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
-- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, QWen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
+- **Extensive Model Support**: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models.
 - **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
 ## Getting Started
@@ -63,7 +64,7 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
 ## Adoption and Sponsorship
 The project has been deployed to large-scale production, generating trillions of tokens every day.
-It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
+It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Google Cloud, Hyperbolic, Iflytek, InnoMatrix, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
 <img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>

{sglang-0.4.6.post2 → sglang-0.4.6.post4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.4.6.post2"
+version = "0.4.6.post4"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"
@@ -29,6 +29,7 @@ runtime_common = [
     "ninja",
     "orjson",
     "packaging",
+    "partial_json_parser",
     "pillow",
     "prometheus-client>=0.20.0",
     "psutil",
@@ -41,19 +42,18 @@ runtime_common = [
     "transformers==4.51.1",
     "uvicorn",
     "uvloop",
-    "xgrammar==0.1.17",
+    "xgrammar==0.1.19",
     "blobfile==3.0.0"
 ]
 srt = [
     "sglang[runtime_common]",
-    "sgl-kernel==0.1.1",
+    "sgl-kernel==0.1.2.post1",
     "flashinfer_python==0.2.5",
     "torch==2.6.0",
     "torchvision==0.21.0",
     "cuda-python",
     "outlines>=0.0.44,<=0.1.11",
-    "partial_json_parser",
     "einops",
 ]
@@ -64,7 +64,6 @@ blackwell = [
     "torchvision",
     "cuda-python",
     "outlines>=0.0.44,<=0.1.11",
-    "partial_json_parser",
     "einops",
 ]
@@ -89,6 +88,8 @@ srt_hpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11"]
 # To install vllm for CPU, please follow the instruction here:
 # https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html
 srt_cpu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11", "torch"]
+# https://vllm-ascend.readthedocs.io/en/latest/installation.html
+srt_npu = ["sglang[runtime_common]", "outlines>=0.0.44,<=0.1.11"]
 openai = ["openai>=1.0", "tiktoken"]
 anthropic = ["anthropic>=0.20.0"]
@@ -102,11 +103,12 @@ test = [
     "accelerate",
     "peft",
 ]
-all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
+all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]", "sglang[torch_memory_saver]"]
 all_hip = ["sglang[srt_hip]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 all_xpu = ["sglang[srt_xpu]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 all_hpu = ["sglang[srt_hpu]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 all_cpu = ["sglang[srt_cpu]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
+all_npu = ["sglang[srt_npu]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 dev = ["sglang[all]", "sglang[test]"]
 dev_hip = ["sglang[all_hip]", "sglang[test]"]
@@ -145,3 +147,7 @@ exclude = [
     "scripts*",
     "tests*",
 ]
+[tool.codespell]
+ignore-words-list = "ans, als, hel, boostrap, childs, te, vas, hsa, ment"
+skip = "*.json,*.jsonl,*.patch,*.txt"

{sglang-0.4.6.post2 → sglang-0.4.6.post4}/sglang/bench_offline_throughput.py RENAMED Viewed

@@ -259,7 +259,9 @@ def throughput_test_once(
         measurement_results["total_input_tokens"]
         + measurement_results["total_output_tokens"]
     ) / latency
-    measurement_results["last_gen_throughput"] = server_info["last_gen_throughput"]
+    measurement_results["last_gen_throughput"] = server_info["internal_states"][0][
+        "last_gen_throughput"
+    ]
     return measurement_results
@@ -315,7 +317,7 @@ def throughput_test(
     tokenizer_id = server_args.tokenizer_path or server_args.model_path
     tokenizer = get_tokenizer(tokenizer_id)
-    # Set global environmnets
+    # Set global environments
     set_ulimit()
     random.seed(bench_args.seed)
     np.random.seed(bench_args.seed)

{sglang-0.4.6.post2 → sglang-0.4.6.post4}/sglang/bench_one_batch.py RENAMED Viewed

@@ -137,17 +137,7 @@ def load_model(server_args, port_args, tp_rank):
     suppress_other_loggers()
     rank_print = print if tp_rank == 0 else lambda *args, **kwargs: None
-    model_config = ModelConfig(
-        server_args.model_path,
-        trust_remote_code=server_args.trust_remote_code,
-        revision=server_args.revision,
-        context_length=server_args.context_length,
-        model_override_args=server_args.json_model_override_args,
-        is_embedding=server_args.is_embedding,
-        enable_multimodal=server_args.enable_multimodal,
-        dtype=server_args.dtype,
-        quantization=server_args.quantization,
-    )
+    model_config = ModelConfig.from_server_args(server_args)
     model_runner = ModelRunner(
         model_config=model_config,
         mem_fraction_static=server_args.mem_fraction_static,
@@ -256,7 +246,7 @@ def extend(reqs, model_runner):
     _maybe_prepare_dp_attn_batch(batch, model_runner)
     model_worker_batch = batch.get_model_worker_batch()
     forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
-    logits_output = model_runner.forward(forward_batch)
+    logits_output, _ = model_runner.forward(forward_batch)
     next_token_ids = model_runner.sample(logits_output, forward_batch)
     return next_token_ids, logits_output.next_token_logits, batch
@@ -268,7 +258,7 @@ def decode(input_token_ids, batch, model_runner):
     _maybe_prepare_dp_attn_batch(batch, model_runner)
     model_worker_batch = batch.get_model_worker_batch()
     forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
-    logits_output = model_runner.forward(forward_batch)
+    logits_output, _ = model_runner.forward(forward_batch)
     next_token_ids = model_runner.sample(logits_output, forward_batch)
     return next_token_ids, logits_output.next_token_logits

{sglang-0.4.6.post2 → sglang-0.4.6.post4}/sglang/bench_one_batch_server.py RENAMED Viewed

@@ -25,6 +25,7 @@ import requests
 from sglang.srt.entrypoints.http_server import launch_server
 from sglang.srt.server_args import ServerArgs
 from sglang.srt.utils import kill_process_tree
+from sglang.test.test_utils import is_in_ci, write_github_step_summary
 @dataclasses.dataclass
@@ -33,9 +34,13 @@ class BenchArgs:
     batch_size: Tuple[int] = (1,)
     input_len: Tuple[int] = (1024,)
     output_len: Tuple[int] = (16,)
+    temperature: float = 0.0
+    return_logprob: bool = False
+    input_len_step_percentage: float = 0.0
     result_filename: str = "result.jsonl"
     base_url: str = ""
     skip_warmup: bool = False
+    show_report: bool = False
     @staticmethod
     def add_cli_args(parser: argparse.ArgumentParser):
@@ -49,11 +54,19 @@ class BenchArgs:
         parser.add_argument(
             "--output-len", type=int, nargs="+", default=BenchArgs.output_len
         )
+        parser.add_argument("--temperature", type=float, default=BenchArgs.temperature)
+        parser.add_argument("--return-logprob", action="store_true")
+        parser.add_argument(
+            "--input-len-step-percentage",
+            type=float,
+            default=BenchArgs.input_len_step_percentage,
+        )
         parser.add_argument(
             "--result-filename", type=str, default=BenchArgs.result_filename
         )
         parser.add_argument("--base-url", type=str, default=BenchArgs.base_url)
         parser.add_argument("--skip-warmup", action="store_true")
+        parser.add_argument("--show-report", action="store_true")
     @classmethod
     def from_cli_args(cls, args: argparse.Namespace):
@@ -99,36 +112,89 @@ def run_one_case(
     batch_size: int,
     input_len: int,
     output_len: int,
+    temperature: float,
+    return_logprob: bool,
+    input_len_step_percentage: float,
     run_name: str,
     result_filename: str,
 ):
+    requests.post(url + "/flush_cache")
+    input_lens = [
+        int(input_len * (1 + (i - (batch_size - 1) / 2) * input_len_step_percentage))
+        for i in range(batch_size)
+    ]
     input_ids = [
-        [int(x) for x in np.random.randint(0, high=16384, size=(input_len,))]
-        for _ in range(batch_size)
+        [int(x) for x in np.random.randint(0, high=16384, size=(input_lens[i],))]
+        for i in range(batch_size)
     ]
+    use_structured_outputs = False
+    if use_structured_outputs:
+        texts = []
+        for _ in range(batch_size):
+            texts.append(
+                "Human: What is the capital city of france? can you give as many trivial information as possible about that city? answer in json.\n"
+                * 50
+                + "Assistant:"
+            )
+        json_schema = "$$ANY$$"
+    else:
+        json_schema = None
     tic = time.time()
     response = requests.post(
         url + "/generate",
         json={
+            # "text": texts,
             "input_ids": input_ids,
             "sampling_params": {
-                "temperature": 0,
+                "temperature": temperature,
                 "max_new_tokens": output_len,
                 "ignore_eos": True,
+                "json_schema": json_schema,
             },
+            "return_logprob": return_logprob,
+            "stream": True,
         },
+        stream=True,
     )
-    latency = time.time() - tic
-    _ = response.json()
-    output_throughput = batch_size * output_len / latency
+    # The TTFT of the last request in the batch
+    ttft = 0.0
+    for chunk in response.iter_lines(decode_unicode=False):
+        chunk = chunk.decode("utf-8")
+        if chunk and chunk.startswith("data:"):
+            if chunk == "data: [DONE]":
+                break
+            data = json.loads(chunk[5:].strip("\n"))
+            if "error" in data:
+                raise RuntimeError(f"Request has failed. {data}.")
+            assert (
+                data["meta_info"]["finish_reason"] is None
+                or data["meta_info"]["finish_reason"]["type"] == "length"
+            )
+            if data["meta_info"]["completion_tokens"] == 1:
+                ttft = time.time() - tic
+    latency = time.time() - tic
+    input_throughput = batch_size * input_len / ttft
+    output_throughput = batch_size * output_len / (latency - ttft)
     overall_throughput = batch_size * (input_len + output_len) / latency
+    server_info = requests.get(url + "/get_server_info").json()
+    acc_length = server_info["internal_states"][0].get("avg_spec_accept_length", None)
+    last_gen_throughput = server_info["internal_states"][0]["last_gen_throughput"]
     print(f"batch size: {batch_size}")
+    print(f"input_len: {input_len}")
+    print(f"output_len: {output_len}")
     print(f"latency: {latency:.2f} s")
-    print(f"output throughput: {output_throughput:.2f} token/s")
-    print(f"(input + output) throughput: {overall_throughput:.2f} token/s")
+    print(f"ttft: {ttft:.2f} s")
+    print(f"Last generation throughput: {last_gen_throughput:.2f} tok/s")
+    print(f"Input throughput: {input_throughput:.2f} tok/s")
+    if output_len != 1:
+        print(f"output throughput: {output_throughput:.2f} tok/s")
     if result_filename:
         with open(result_filename, "a") as fout:
@@ -140,9 +206,21 @@ def run_one_case(
                 "latency": round(latency, 4),
                 "output_throughput": round(output_throughput, 2),
                 "overall_throughput": round(overall_throughput, 2),
+                "last_gen_throughput": round(last_gen_throughput, 2),
             }
             fout.write(json.dumps(res) + "\n")
+    return (
+        batch_size,
+        latency,
+        ttft,
+        input_throughput,
+        output_throughput,
+        overall_throughput,
+        last_gen_throughput,
+        acc_length,
+    )
 def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
     if bench_args.base_url:
@@ -152,27 +230,38 @@ def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
     # warmup
     if not bench_args.skip_warmup:
+        print("=" * 8 + " Warmup Begin " + "=" * 8)
         run_one_case(
             base_url,
             batch_size=16,
             input_len=1024,
             output_len=16,
+            temperature=bench_args.temperature,
+            return_logprob=bench_args.return_logprob,
+            input_len_step_percentage=bench_args.input_len_step_percentage,
             run_name="",
             result_filename="",
         )
+        print("=" * 8 + " Warmup End   " + "=" * 8 + "\n")
     # benchmark
+    result = []
     try:
         for bs, il, ol in itertools.product(
             bench_args.batch_size, bench_args.input_len, bench_args.output_len
         ):
-            run_one_case(
-                base_url,
-                bs,
-                il,
-                ol,
-                bench_args.run_name,
-                bench_args.result_filename,
+            result.append(
+                run_one_case(
+                    base_url,
+                    bs,
+                    il,
+                    ol,
+                    temperature=bench_args.temperature,
+                    return_logprob=bench_args.return_logprob,
+                    input_len_step_percentage=bench_args.input_len_step_percentage,
+                    run_name=bench_args.run_name,
+                    result_filename=bench_args.result_filename,
+                )
             )
     finally:
         if proc:
@@ -180,6 +269,45 @@ def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
     print(f"\nResults are saved to {bench_args.result_filename}")
+    if not bench_args.show_report:
+        return
+    summary = " | batch size | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input price ($/1M) | output price ($/1M) |\n"
+    summary += "| ---------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ------------------ | ------------------- |\n"
+    for (
+        batch_size,
+        latency,
+        ttft,
+        input_throughput,
+        output_throughput,
+        overall_throughput,
+        last_gen_throughput,
+        acc_length,
+    ) in result:
+        hourly_cost = 2 * server_args.tp_size  # $2/hour for one H100
+        input_util = 0.7
+        accept_length = round(acc_length, 2) if acc_length is not None else "n/a"
+        line = (
+            f"| {batch_size} | "
+            f"{latency:.2f} | "
+            f"{input_throughput:.2f} | "
+            f"{output_throughput:.2f} | "
+            f"{accept_length} | "
+            f"{1 / (output_throughput/batch_size) * 1000:.2f} | "
+            f"{1e6 / (input_throughput * input_util) / 3600 * hourly_cost:.2f} | "
+            f"{1e6 / output_throughput / 3600 * hourly_cost:.2f} |\n"
+        )
+        summary += line
+    # print metrics table
+    print(summary)
+    if is_in_ci():
+        write_github_step_summary(
+            f"### Test Nightly Benchmark (bench_one_batch) \n{summary}"
+        )
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()

sglang 0.4.6.post2__tar.gz → 0.4.6.post4__tar.gz

sglang 0.4.6.post2tar.gz → 0.4.6.post4tar.gz