PyPI - sglang - Versions diffs - 0.4.6.post3__tar.gz → 0.4.6.post4__tar.gz - Mend

sglang 0.4.6.post3tar.gz → 0.4.6.post4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (653) hide show

{sglang-0.4.6.post3/sglang.egg-info → sglang-0.4.6.post4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sglang
-Version: 0.4.6.post3
+Version: 0.4.6.post4
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -247,7 +247,7 @@ Requires-Dist: xgrammar==0.1.19; extra == "runtime-common"
 Requires-Dist: blobfile==3.0.0; extra == "runtime-common"
 Provides-Extra: srt
 Requires-Dist: sglang[runtime_common]; extra == "srt"
-Requires-Dist: sgl-kernel==0.1.1; extra == "srt"
+Requires-Dist: sgl-kernel==0.1.2.post1; extra == "srt"
 Requires-Dist: flashinfer_python==0.2.5; extra == "srt"
 Requires-Dist: torch==2.6.0; extra == "srt"
 Requires-Dist: torchvision==0.21.0; extra == "srt"
@@ -301,6 +301,7 @@ Requires-Dist: sglang[srt]; extra == "all"
 Requires-Dist: sglang[openai]; extra == "all"
 Requires-Dist: sglang[anthropic]; extra == "all"
 Requires-Dist: sglang[litellm]; extra == "all"
+Requires-Dist: sglang[torch_memory_saver]; extra == "all"
 Provides-Extra: all-hip
 Requires-Dist: sglang[srt_hip]; extra == "all-hip"
 Requires-Dist: sglang[openai]; extra == "all-hip"
@@ -368,16 +369,16 @@ Dynamic: license-file
 - [2025/05] 🔥 Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)).
 - [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html))
 - [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/))
-- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
 - [2024/12] 🔥 v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
-- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
 <details>
 <summary>More</summary>
+- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2024/10] The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
+- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 - [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
 - [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
@@ -409,7 +410,7 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
 ## Adoption and Sponsorship
 The project has been deployed to large-scale production, generating trillions of tokens every day.
-It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Google Cloud, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
+It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Google Cloud, Hyperbolic, Iflytek, InnoMatrix, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
 <img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>

{sglang-0.4.6.post3 → sglang-0.4.6.post4}/README.md RENAMED Viewed

@@ -23,16 +23,16 @@
 - [2025/05] 🔥 Deploying DeepSeek with PD Disaggregation and Large-scale Expert Parallelism on 96 H100 GPUs ([blog](https://lmsys.org/blog/2025-05-05-large-scale-ep/)).
 - [2025/03] Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html))
 - [2025/03] SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine ([PyTorch blog](https://pytorch.org/blog/sglang-joins-pytorch/))
-- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html), [10+ other companies](https://x.com/lmsysorg/status/1887262321636221412))
 - [2024/12] 🔥 v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
-- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
 <details>
 <summary>More</summary>
+- [2025/02] Unlock DeepSeek-R1 Inference Performance on AMD Instinct™ MI300X GPU ([AMD blog](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1_Perf/README.html))
 - [2024/10] The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
+- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
 - [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 - [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
 - [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
@@ -64,7 +64,7 @@ Learn more in the release blogs: [v0.2 blog](https://lmsys.org/blog/2024-07-25-s
 ## Adoption and Sponsorship
 The project has been deployed to large-scale production, generating trillions of tokens every day.
-It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Google Cloud, Hyperbolic, Iflytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
+It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Google Cloud, Hyperbolic, Iflytek, InnoMatrix, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, Oracle, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI.
 <img src="https://raw.githubusercontent.com/sgl-project/sgl-learning-materials/main/slides/adoption.png" alt="logo" width="800" margin="10px"></img>

{sglang-0.4.6.post3 → sglang-0.4.6.post4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.4.6.post3"
+version = "0.4.6.post4"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"
@@ -48,7 +48,7 @@ runtime_common = [
 srt = [
     "sglang[runtime_common]",
-    "sgl-kernel==0.1.1",
+    "sgl-kernel==0.1.2.post1",
     "flashinfer_python==0.2.5",
     "torch==2.6.0",
     "torchvision==0.21.0",
@@ -103,7 +103,7 @@ test = [
     "accelerate",
     "peft",
 ]
-all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
+all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]", "sglang[torch_memory_saver]"]
 all_hip = ["sglang[srt_hip]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 all_xpu = ["sglang[srt_xpu]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 all_hpu = ["sglang[srt_hpu]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
@@ -147,3 +147,7 @@ exclude = [
     "scripts*",
     "tests*",
 ]
+[tool.codespell]
+ignore-words-list = "ans, als, hel, boostrap, childs, te, vas, hsa, ment"
+skip = "*.json,*.jsonl,*.patch,*.txt"

{sglang-0.4.6.post3 → sglang-0.4.6.post4}/sglang/bench_offline_throughput.py RENAMED Viewed

@@ -259,7 +259,9 @@ def throughput_test_once(
         measurement_results["total_input_tokens"]
         + measurement_results["total_output_tokens"]
     ) / latency
-    measurement_results["last_gen_throughput"] = server_info["last_gen_throughput"]
+    measurement_results["last_gen_throughput"] = server_info["internal_states"][0][
+        "last_gen_throughput"
+    ]
     return measurement_results
@@ -315,7 +317,7 @@ def throughput_test(
     tokenizer_id = server_args.tokenizer_path or server_args.model_path
     tokenizer = get_tokenizer(tokenizer_id)
-    # Set global environmnets
+    # Set global environments
     set_ulimit()
     random.seed(bench_args.seed)
     np.random.seed(bench_args.seed)

{sglang-0.4.6.post3 → sglang-0.4.6.post4}/sglang/bench_one_batch.py RENAMED Viewed

@@ -246,7 +246,7 @@ def extend(reqs, model_runner):
     _maybe_prepare_dp_attn_batch(batch, model_runner)
     model_worker_batch = batch.get_model_worker_batch()
     forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
-    logits_output = model_runner.forward(forward_batch)
+    logits_output, _ = model_runner.forward(forward_batch)
     next_token_ids = model_runner.sample(logits_output, forward_batch)
     return next_token_ids, logits_output.next_token_logits, batch
@@ -258,7 +258,7 @@ def decode(input_token_ids, batch, model_runner):
     _maybe_prepare_dp_attn_batch(batch, model_runner)
     model_worker_batch = batch.get_model_worker_batch()
     forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
-    logits_output = model_runner.forward(forward_batch)
+    logits_output, _ = model_runner.forward(forward_batch)
     next_token_ids = model_runner.sample(logits_output, forward_batch)
     return next_token_ids, logits_output.next_token_logits

{sglang-0.4.6.post3 → sglang-0.4.6.post4}/sglang/bench_one_batch_server.py RENAMED Viewed

@@ -25,6 +25,7 @@ import requests
 from sglang.srt.entrypoints.http_server import launch_server
 from sglang.srt.server_args import ServerArgs
 from sglang.srt.utils import kill_process_tree
+from sglang.test.test_utils import is_in_ci, write_github_step_summary
 @dataclasses.dataclass
@@ -33,9 +34,13 @@ class BenchArgs:
     batch_size: Tuple[int] = (1,)
     input_len: Tuple[int] = (1024,)
     output_len: Tuple[int] = (16,)
+    temperature: float = 0.0
+    return_logprob: bool = False
+    input_len_step_percentage: float = 0.0
     result_filename: str = "result.jsonl"
     base_url: str = ""
     skip_warmup: bool = False
+    show_report: bool = False
     @staticmethod
     def add_cli_args(parser: argparse.ArgumentParser):
@@ -49,11 +54,19 @@ class BenchArgs:
         parser.add_argument(
             "--output-len", type=int, nargs="+", default=BenchArgs.output_len
         )
+        parser.add_argument("--temperature", type=float, default=BenchArgs.temperature)
+        parser.add_argument("--return-logprob", action="store_true")
+        parser.add_argument(
+            "--input-len-step-percentage",
+            type=float,
+            default=BenchArgs.input_len_step_percentage,
+        )
         parser.add_argument(
             "--result-filename", type=str, default=BenchArgs.result_filename
         )
         parser.add_argument("--base-url", type=str, default=BenchArgs.base_url)
         parser.add_argument("--skip-warmup", action="store_true")
+        parser.add_argument("--show-report", action="store_true")
     @classmethod
     def from_cli_args(cls, args: argparse.Namespace):
@@ -99,36 +112,89 @@ def run_one_case(
     batch_size: int,
     input_len: int,
     output_len: int,
+    temperature: float,
+    return_logprob: bool,
+    input_len_step_percentage: float,
     run_name: str,
     result_filename: str,
 ):
+    requests.post(url + "/flush_cache")
+    input_lens = [
+        int(input_len * (1 + (i - (batch_size - 1) / 2) * input_len_step_percentage))
+        for i in range(batch_size)
+    ]
     input_ids = [
-        [int(x) for x in np.random.randint(0, high=16384, size=(input_len,))]
-        for _ in range(batch_size)
+        [int(x) for x in np.random.randint(0, high=16384, size=(input_lens[i],))]
+        for i in range(batch_size)
     ]
+    use_structured_outputs = False
+    if use_structured_outputs:
+        texts = []
+        for _ in range(batch_size):
+            texts.append(
+                "Human: What is the capital city of france? can you give as many trivial information as possible about that city? answer in json.\n"
+                * 50
+                + "Assistant:"
+            )
+        json_schema = "$$ANY$$"
+    else:
+        json_schema = None
     tic = time.time()
     response = requests.post(
         url + "/generate",
         json={
+            # "text": texts,
             "input_ids": input_ids,
             "sampling_params": {
-                "temperature": 0,
+                "temperature": temperature,
                 "max_new_tokens": output_len,
                 "ignore_eos": True,
+                "json_schema": json_schema,
             },
+            "return_logprob": return_logprob,
+            "stream": True,
         },
+        stream=True,
     )
-    latency = time.time() - tic
-    _ = response.json()
-    output_throughput = batch_size * output_len / latency
+    # The TTFT of the last request in the batch
+    ttft = 0.0
+    for chunk in response.iter_lines(decode_unicode=False):
+        chunk = chunk.decode("utf-8")
+        if chunk and chunk.startswith("data:"):
+            if chunk == "data: [DONE]":
+                break
+            data = json.loads(chunk[5:].strip("\n"))
+            if "error" in data:
+                raise RuntimeError(f"Request has failed. {data}.")
+            assert (
+                data["meta_info"]["finish_reason"] is None
+                or data["meta_info"]["finish_reason"]["type"] == "length"
+            )
+            if data["meta_info"]["completion_tokens"] == 1:
+                ttft = time.time() - tic
+    latency = time.time() - tic
+    input_throughput = batch_size * input_len / ttft
+    output_throughput = batch_size * output_len / (latency - ttft)
     overall_throughput = batch_size * (input_len + output_len) / latency
+    server_info = requests.get(url + "/get_server_info").json()
+    acc_length = server_info["internal_states"][0].get("avg_spec_accept_length", None)
+    last_gen_throughput = server_info["internal_states"][0]["last_gen_throughput"]
     print(f"batch size: {batch_size}")
+    print(f"input_len: {input_len}")
+    print(f"output_len: {output_len}")
     print(f"latency: {latency:.2f} s")
-    print(f"output throughput: {output_throughput:.2f} token/s")
-    print(f"(input + output) throughput: {overall_throughput:.2f} token/s")
+    print(f"ttft: {ttft:.2f} s")
+    print(f"Last generation throughput: {last_gen_throughput:.2f} tok/s")
+    print(f"Input throughput: {input_throughput:.2f} tok/s")
+    if output_len != 1:
+        print(f"output throughput: {output_throughput:.2f} tok/s")
     if result_filename:
         with open(result_filename, "a") as fout:
@@ -140,9 +206,21 @@ def run_one_case(
                 "latency": round(latency, 4),
                 "output_throughput": round(output_throughput, 2),
                 "overall_throughput": round(overall_throughput, 2),
+                "last_gen_throughput": round(last_gen_throughput, 2),
             }
             fout.write(json.dumps(res) + "\n")
+    return (
+        batch_size,
+        latency,
+        ttft,
+        input_throughput,
+        output_throughput,
+        overall_throughput,
+        last_gen_throughput,
+        acc_length,
+    )
 def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
     if bench_args.base_url:
@@ -152,27 +230,38 @@ def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
     # warmup
     if not bench_args.skip_warmup:
+        print("=" * 8 + " Warmup Begin " + "=" * 8)
         run_one_case(
             base_url,
             batch_size=16,
             input_len=1024,
             output_len=16,
+            temperature=bench_args.temperature,
+            return_logprob=bench_args.return_logprob,
+            input_len_step_percentage=bench_args.input_len_step_percentage,
             run_name="",
             result_filename="",
         )
+        print("=" * 8 + " Warmup End   " + "=" * 8 + "\n")
     # benchmark
+    result = []
     try:
         for bs, il, ol in itertools.product(
             bench_args.batch_size, bench_args.input_len, bench_args.output_len
         ):
-            run_one_case(
-                base_url,
-                bs,
-                il,
-                ol,
-                bench_args.run_name,
-                bench_args.result_filename,
+            result.append(
+                run_one_case(
+                    base_url,
+                    bs,
+                    il,
+                    ol,
+                    temperature=bench_args.temperature,
+                    return_logprob=bench_args.return_logprob,
+                    input_len_step_percentage=bench_args.input_len_step_percentage,
+                    run_name=bench_args.run_name,
+                    result_filename=bench_args.result_filename,
+                )
             )
     finally:
         if proc:
@@ -180,6 +269,45 @@ def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
     print(f"\nResults are saved to {bench_args.result_filename}")
+    if not bench_args.show_report:
+        return
+    summary = " | batch size | latency (s) | input throughput (tok/s)  | output throughput (tok/s) | acc length | ITL (ms) | input price ($/1M) | output price ($/1M) |\n"
+    summary += "| ---------- | ----------- | ------------------------- | ------------------------- | ---------- | -------- | ------------------ | ------------------- |\n"
+    for (
+        batch_size,
+        latency,
+        ttft,
+        input_throughput,
+        output_throughput,
+        overall_throughput,
+        last_gen_throughput,
+        acc_length,
+    ) in result:
+        hourly_cost = 2 * server_args.tp_size  # $2/hour for one H100
+        input_util = 0.7
+        accept_length = round(acc_length, 2) if acc_length is not None else "n/a"
+        line = (
+            f"| {batch_size} | "
+            f"{latency:.2f} | "
+            f"{input_throughput:.2f} | "
+            f"{output_throughput:.2f} | "
+            f"{accept_length} | "
+            f"{1 / (output_throughput/batch_size) * 1000:.2f} | "
+            f"{1e6 / (input_throughput * input_util) / 3600 * hourly_cost:.2f} | "
+            f"{1e6 / output_throughput / 3600 * hourly_cost:.2f} |\n"
+        )
+        summary += line
+    # print metrics table
+    print(summary)
+    if is_in_ci():
+        write_github_step_summary(
+            f"### Test Nightly Benchmark (bench_one_batch) \n{summary}"
+        )
 if __name__ == "__main__":
     parser = argparse.ArgumentParser()

{sglang-0.4.6.post3 → sglang-0.4.6.post4}/sglang/bench_serving.py RENAMED Viewed

@@ -1103,7 +1103,7 @@ async def benchmark(
     lora_names: List[str],
     extra_request_body: Dict[str, Any],
     profile: bool,
-    pd_seperated: bool = False,
+    pd_separated: bool = False,
     flush_cache: bool = False,
     warmup_requests: int = 1,
 ):
@@ -1239,12 +1239,14 @@ async def benchmark(
     if "sglang" in backend:
         server_info = requests.get(base_url + "/get_server_info")
-        if pd_seperated:
-            accept_length = server_info.json()["decode"][0].get(
+        if pd_separated:
+            accept_length = server_info.json()["decode"][0]["internal_states"][0].get(
                 "avg_spec_accept_length", None
             )
         else:
-            accept_length = server_info.json().get("avg_spec_accept_length", None)
+            accept_length = server_info.json()["internal_states"][0].get(
+                "avg_spec_accept_length", None
+            )
     else:
         accept_length = None
@@ -1263,7 +1265,7 @@ async def benchmark(
     print("{:<40} {:<10}".format("Traffic request rate:", request_rate))
     print(
         "{:<40} {:<10}".format(
-            "Max reqeuest concurrency:",
+            "Max request concurrency:",
             max_concurrency if max_concurrency else "not set",
         )
     )
@@ -1541,7 +1543,7 @@ def run_benchmark(args_: argparse.Namespace):
             lora_names=args.lora_name,
             extra_request_body=extra_request_body,
             profile=args.profile,
-            pd_seperated=args.pd_seperated,
+            pd_separated=args.pd_separated,
             flush_cache=args.flush_cache,
         )
     )
@@ -1720,7 +1722,7 @@ if __name__ == "__main__":
         help="Suffix applied to the end of all user prompts, followed by assistant prompt suffix.",
     )
     parser.add_argument(
-        "--pd-seperated",
+        "--pd-separated",
         action="store_true",
         help="Benchmark PD disaggregation server",
     )

{sglang-0.4.6.post3 → sglang-0.4.6.post4}/sglang/compile_deep_gemm.py RENAMED Viewed

@@ -129,7 +129,7 @@ def launch_server_process_and_send_one_request(
 def refine_server_args(server_args: ServerArgs, compile_args: CompileArgs):
-    # Disbale cuda graph and torch compile to save time
+    # Disable cuda graph and torch compile to save time
     server_args.disable_cuda_graph = True
     server_args.enable_torch_compile = False
     print(f"Disable CUDA Graph and Torch Compile to save time...")

sglang-0.4.6.post4/sglang/eval/loogle_eval.py ADDED Viewed

@@ -0,0 +1,157 @@
+import argparse
+import asyncio
+import os
+import pickle
+from pathlib import Path
+from typing import List
+import openai
+import torch
+from bert_score import BERTScorer
+from datasets import load_dataset
+from tqdm import tqdm
+def get_client(api_url: str) -> openai.AsyncOpenAI:
+    if os.getenv("OPENAI_API_KEY") is None:
+        os.environ["OPENAI_API_KEY"] = "EMPTY"
+    return openai.AsyncOpenAI(base_url=api_url)
+def get_dataset():
+    return load_dataset("bigai-nlco/LooGLE", "longdep_qa", split="test")
+async def fetch_response(
+    client: openai.AsyncOpenAI,
+    context: str,
+    question: str,
+    semaphore: asyncio.Semaphore,
+    index: int,
+    model: str,
+    output_dir: Path,
+):
+    output_file = output_dir / f"response_{index}.pkl"
+    if output_file.exists():
+        return
+    prompt = (
+        "Please answer the question based on the long texts below.\n"
+        f"{context}\n"
+        f"Question: {question}\n"
+        "Answer:"
+    )
+    messages = [
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": prompt},
+    ]
+    async with semaphore:
+        try:
+            response = await client.chat.completions.create(
+                model=model,
+                messages=messages,
+                temperature=0.0,
+                max_tokens=512,
+            )
+        except openai.BadRequestError as e:
+            with open(output_file, "wb") as f:
+                pickle.dump({"error": str(e)}, f)
+            return
+    with open(output_file, "wb") as f:
+        pickle.dump(response, f)
+async def benchmark(args):
+    dataset = get_dataset()
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    client = get_client(args.api_url)
+    semaphore = asyncio.Semaphore(args.max_concurrency)
+    tasks: List[asyncio.Task] = []
+    for idx, ex in enumerate(dataset):
+        tasks.append(
+            asyncio.create_task(
+                fetch_response(
+                    client,
+                    ex["context"],
+                    ex["question"],
+                    semaphore,
+                    idx,
+                    args.model,
+                    output_dir,
+                )
+            )
+        )
+    for _ in tqdm(
+        asyncio.as_completed(tasks), total=len(tasks), desc="Running benchmark"
+    ):
+        await _
+def analyse(args):
+    dataset = get_dataset()
+    output_dir = Path(args.output_dir)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    scorer = BERTScorer(lang="en", device=device)
+    hyps: List[str] = []
+    refs: List[str] = []
+    for idx, ex in enumerate(tqdm(dataset, desc="Loading responses")):
+        pkl_file = output_dir / f"response_{idx}.pkl"
+        if not pkl_file.exists():
+            raise FileNotFoundError(pkl_file)
+        response = pickle.load(open(pkl_file, "rb"))
+        if isinstance(response, dict) and "error" in response:
+            continue
+        hyps.append(response.choices[0].message.content.strip())
+        refs.append(ex["answer"])
+    if not hyps:
+        print("No valid responses to score!")
+        return
+    batch_size = 64
+    all_f1: List[float] = []
+    for i in tqdm(range(0, len(hyps), batch_size), desc="Scoring batches"):
+        h_batch = hyps[i : i + batch_size]
+        r_batch = refs[i : i + batch_size]
+        _, _, f1_scores = scorer.score(h_batch, r_batch, verbose=False)
+        all_f1.extend([float(x) for x in f1_scores])
+    avg = sum(all_f1) / len(all_f1)
+    print(f"Average BERTScore (F1): {avg:.2%}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Run benchmark and evaluation in one go."
+    )
+    parser.add_argument(
+        "--api-url",
+        default="http://127.0.0.1:30000/v1",
+        help="OpenAI‑compatible API base URL",
+    )
+    parser.add_argument(
+        "--model",
+        default="meta-llama/Llama-4-Maverick-17B-128E-Instruct",
+        help="Model name or ID, only used for model name",
+    )
+    parser.add_argument(
+        "--max-concurrency", type=int, default=144, help="Maximum concurrent requests"
+    )
+    parser.add_argument(
+        "--output-dir", default="tmp-output-dir", help="Directory for cached responses"
+    )
+    args = parser.parse_args()
+    asyncio.run(benchmark(args))
+    analyse(args)

sglang 0.4.6.post3__tar.gz → 0.4.6.post4__tar.gz

sglang 0.4.6.post3tar.gz → 0.4.6.post4tar.gz