PyPI - sglang - Versions diffs - 0.3.3__tar.gz → 0.3.4__tar.gz - Mend

sglang 0.3.3tar.gz → 0.3.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (152) hide show

{sglang-0.3.3/sglang.egg-info → sglang-0.3.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.3.3
+Version: 0.3.4
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -214,26 +214,31 @@ License-File: LICENSE
 Requires-Dist: requests
 Requires-Dist: tqdm
 Requires-Dist: numpy
+Provides-Extra: runtime-common
+Requires-Dist: aiohttp; extra == "runtime-common"
+Requires-Dist: decord; extra == "runtime-common"
+Requires-Dist: fastapi; extra == "runtime-common"
+Requires-Dist: hf_transfer; extra == "runtime-common"
+Requires-Dist: huggingface_hub; extra == "runtime-common"
+Requires-Dist: interegular; extra == "runtime-common"
+Requires-Dist: orjson; extra == "runtime-common"
+Requires-Dist: packaging; extra == "runtime-common"
+Requires-Dist: pillow; extra == "runtime-common"
+Requires-Dist: psutil; extra == "runtime-common"
+Requires-Dist: pydantic; extra == "runtime-common"
+Requires-Dist: python-multipart; extra == "runtime-common"
+Requires-Dist: torchao; extra == "runtime-common"
+Requires-Dist: uvicorn; extra == "runtime-common"
+Requires-Dist: uvloop; extra == "runtime-common"
+Requires-Dist: zmq; extra == "runtime-common"
+Requires-Dist: outlines>=0.0.44; extra == "runtime-common"
+Requires-Dist: modelscope; extra == "runtime-common"
 Provides-Extra: srt
-Requires-Dist: aiohttp; extra == "srt"
-Requires-Dist: decord; extra == "srt"
-Requires-Dist: fastapi; extra == "srt"
-Requires-Dist: hf_transfer; extra == "srt"
-Requires-Dist: huggingface_hub; extra == "srt"
-Requires-Dist: interegular; extra == "srt"
-Requires-Dist: packaging; extra == "srt"
-Requires-Dist: pillow; extra == "srt"
-Requires-Dist: psutil; extra == "srt"
-Requires-Dist: pydantic; extra == "srt"
-Requires-Dist: python-multipart; extra == "srt"
+Requires-Dist: sglang[runtime_common]; extra == "srt"
 Requires-Dist: torch; extra == "srt"
-Requires-Dist: torchao; extra == "srt"
-Requires-Dist: uvicorn; extra == "srt"
-Requires-Dist: uvloop; extra == "srt"
-Requires-Dist: zmq; extra == "srt"
 Requires-Dist: vllm==0.5.5; extra == "srt"
-Requires-Dist: outlines>=0.0.44; extra == "srt"
-Requires-Dist: modelscope; extra == "srt"
+Provides-Extra: srt-xpu
+Requires-Dist: sglang[runtime_common]; extra == "srt-xpu"
 Provides-Extra: openai
 Requires-Dist: openai>=1.0; extra == "openai"
 Requires-Dist: tiktoken; extra == "openai"
@@ -253,12 +258,20 @@ Requires-Dist: sglang[srt]; extra == "all"
 Requires-Dist: sglang[openai]; extra == "all"
 Requires-Dist: sglang[anthropic]; extra == "all"
 Requires-Dist: sglang[litellm]; extra == "all"
+Provides-Extra: all-xpu
+Requires-Dist: sglang[srt_xpu]; extra == "all-xpu"
+Requires-Dist: sglang[openai]; extra == "all-xpu"
+Requires-Dist: sglang[anthropic]; extra == "all-xpu"
+Requires-Dist: sglang[litellm]; extra == "all-xpu"
 Provides-Extra: dev
 Requires-Dist: sglang[all]; extra == "dev"
 Requires-Dist: sglang[test]; extra == "dev"
+Provides-Extra: dev-xpu
+Requires-Dist: sglang[all_xpu]; extra == "dev-xpu"
+Requires-Dist: sglang[test]; extra == "dev-xpu"
-<div align="center">
-<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400"></img>
+<div align="center"  id="sglangtop">
+<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img>
 [![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
 ![PyPI - Downloads](https://img.shields.io/pypi/dm/sglang)
@@ -270,15 +283,13 @@ Requires-Dist: sglang[test]; extra == "dev"
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) | [**Join Bi-Weekly Development Meeting (Oct. 19)**](https://calendar.app.google/GYW7S8QGoanCuaxW6) |
-## Upcoming Events
-- [Oct. 11, 2024] Invited talks at [AMD Advancing AI](https://www.amd.com/en/corporate/events/advancing-ai.html) Developer Day.
-- [Oct. 16, 2024] Online meetup for efficient LLM deployment and serving, co-hosted by SGLang, FlashInfer, and MLC LLM! Fill out the [Google form](https://forms.gle/B3YeedLxmrrhL1NM8) to receive the invite link.
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slides**](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_dev_day_v2.pdf) | [**Learn More**](https://github.com/sgl-project/sgl-learning-materials) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
+[**Join Bi-Weekly Development Meeting (Oct. 19)**](https://calendar.app.google/GYW7S8QGoanCuaxW6) |
 ## News
-- [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
-- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
+- [2024/10] 🔥 The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
+- [2024/09] SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
+- [2024/07] Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
 - [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 <details>
@@ -324,7 +335,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
 ### Method 2: From source
 ```
 # Use the last release branch
-git clone -b v0.3.3 https://github.com/sgl-project/sglang.git
+git clone -b v0.3.4 https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -501,6 +512,40 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
 ```
+### Engine Without HTTP Server
+We also provide an inference engine **without a HTTP server**. For example,
+```python
+import sglang as sgl
+def main():
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    sampling_params = {"temperature": 0.8, "top_p": 0.95}
+    llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
+    outputs = llm.generate(prompts, sampling_params)
+    for prompt, output in zip(prompts, outputs):
+        print("===============================")
+        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
+if __name__ == "__main__":
+    main()
+```
+This can be used for:
+1. **Offline Batch Inference**
+2. **Building Custom Servers**
+You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine)
 ### Supported Models
 **Generative Models**
@@ -837,10 +882,7 @@ def chat_example(s):
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
 ## Benchmark And Performance
-![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
-![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)
-Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
+Learn more in our release blogs: [v0.2](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3](https://lmsys.org/blog/2024-09-04-sglang-v0-3/).
 ## Roadmap
 [Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)
@@ -848,3 +890,11 @@ Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
 ## Citation And Acknowledgment
 Please cite our paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.
 We also learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
+<p align="center">
+  <a href="#sglangtop" target="_blank">
+  <bold>Back To Top </bold>
+  </a>
+</p>

{sglang-0.3.3 → sglang-0.3.4}/README.md RENAMED Viewed

@@ -1,5 +1,5 @@
-<div align="center">
-<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400"></img>
+<div align="center"  id="sglangtop">
+<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400" margin="10px"></img>
 [![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
 ![PyPI - Downloads](https://img.shields.io/pypi/dm/sglang)
@@ -11,15 +11,13 @@
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) | [**Join Bi-Weekly Development Meeting (Oct. 19)**](https://calendar.app.google/GYW7S8QGoanCuaxW6) |
-## Upcoming Events
-- [Oct. 11, 2024] Invited talks at [AMD Advancing AI](https://www.amd.com/en/corporate/events/advancing-ai.html) Developer Day.
-- [Oct. 16, 2024] Online meetup for efficient LLM deployment and serving, co-hosted by SGLang, FlashInfer, and MLC LLM! Fill out the [Google form](https://forms.gle/B3YeedLxmrrhL1NM8) to receive the invite link.
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slides**](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/amd_dev_day_v2.pdf) | [**Learn More**](https://github.com/sgl-project/sgl-learning-materials) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
+[**Join Bi-Weekly Development Meeting (Oct. 19)**](https://calendar.app.google/GYW7S8QGoanCuaxW6) |
 ## News
-- [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
-- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
+- [2024/10] 🔥 The First SGLang Online Meetup ([slides](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#the-first-sglang-online-meetup)).
+- [2024/09] SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
+- [2024/07] Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
 - [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 <details>
@@ -65,7 +63,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
 ### Method 2: From source
 ```
 # Use the last release branch
-git clone -b v0.3.3 https://github.com/sgl-project/sglang.git
+git clone -b v0.3.4 https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -242,6 +240,40 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
 ```
+### Engine Without HTTP Server
+We also provide an inference engine **without a HTTP server**. For example,
+```python
+import sglang as sgl
+def main():
+    prompts = [
+        "Hello, my name is",
+        "The president of the United States is",
+        "The capital of France is",
+        "The future of AI is",
+    ]
+    sampling_params = {"temperature": 0.8, "top_p": 0.95}
+    llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
+    outputs = llm.generate(prompts, sampling_params)
+    for prompt, output in zip(prompts, outputs):
+        print("===============================")
+        print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
+if __name__ == "__main__":
+    main()
+```
+This can be used for:
+1. **Offline Batch Inference**
+2. **Building Custom Servers**
+You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine)
 ### Supported Models
 **Generative Models**
@@ -578,10 +610,7 @@ def chat_example(s):
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
 ## Benchmark And Performance
-![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
-![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)
-Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
+Learn more in our release blogs: [v0.2](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3](https://lmsys.org/blog/2024-09-04-sglang-v0-3/).
 ## Roadmap
 [Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)
@@ -589,3 +618,11 @@ Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
 ## Citation And Acknowledgment
 Please cite our paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.
 We also learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
+<p align="center">
+  <a href="#sglangtop" target="_blank">
+  <bold>Back To Top </bold>
+  </a>
+</p>

{sglang-0.3.3 → sglang-0.3.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.3.3"
+version = "0.3.4"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"
@@ -20,16 +20,23 @@ dependencies = [
 ]
 [project.optional-dependencies]
-srt = ["aiohttp", "decord", "fastapi", "hf_transfer", "huggingface_hub", "interegular",
-       "packaging", "pillow", "psutil", "pydantic", "python-multipart",
-       "torch", "torchao", "uvicorn", "uvloop", "zmq",
-       "vllm==0.5.5", "outlines>=0.0.44", "modelscope"]
+runtime_common = ["aiohttp", "decord", "fastapi", "hf_transfer", "huggingface_hub", "interegular",
+    "orjson", "packaging", "pillow", "psutil", "pydantic", "python-multipart",
+    "torchao", "uvicorn", "uvloop", "zmq",
+    "outlines>=0.0.44", "modelscope"]
+# xpu is not enabled in public vllm and torch whl,
+# need to follow https://docs.vllm.ai/en/latest/getting_started/xpu-installation.htmlinstall vllm
+srt = ["sglang[runtime_common]", "torch", "vllm==0.5.5"]
+srt_xpu = ["sglang[runtime_common]"]
 openai = ["openai>=1.0", "tiktoken"]
 anthropic = ["anthropic>=0.20.0"]
 litellm = ["litellm>=1.0.0"]
 test = ["jsonlines", "matplotlib", "pandas", "sentence_transformers", "accelerate", "peft"]
 all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
+all_xpu = ["sglang[srt_xpu]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 dev = ["sglang[all]", "sglang[test]"]
+dev_xpu = ["sglang[all_xpu]", "sglang[test]"]
 [project.urls]
 "Homepage" = "https://github.com/sgl-project/sglang"

{sglang-0.3.3 → sglang-0.3.4}/sglang/bench_latency.py RENAMED Viewed

@@ -139,7 +139,7 @@ def load_model(server_args, port_args, tp_rank):
         gpu_id=tp_rank,
         tp_rank=tp_rank,
         tp_size=server_args.tp_size,
-        nccl_port=port_args.nccl_ports[0],
+        nccl_port=port_args.nccl_port,
         server_args=server_args,
     )
     rank_print(f"max_total_num_tokens={model_runner.max_total_num_tokens}")
@@ -220,6 +220,7 @@ def prepare_synthetic_inputs_for_latency_test(batch_size, input_len):
     return reqs
+@torch.inference_mode()
 def extend(reqs, model_runner):
     batch = ScheduleBatch.init_new(
         reqs=reqs,
@@ -231,26 +232,28 @@ def extend(reqs, model_runner):
     model_worker_batch = batch.get_model_worker_batch()
     forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
     logits_output = model_runner.forward(forward_batch)
-    next_token_ids = model_runner.sample(logits_output, forward_batch).tolist()
+    next_token_ids = model_runner.sample(logits_output, forward_batch)
     return next_token_ids, logits_output.next_token_logits, batch
+@torch.inference_mode()
 def decode(input_token_ids, batch, model_runner):
-    batch.prepare_for_decode(input_token_ids)
+    batch.output_ids = input_token_ids
+    batch.prepare_for_decode()
     model_worker_batch = batch.get_model_worker_batch()
     forward_batch = ForwardBatch.init_new(model_worker_batch, model_runner)
     logits_output = model_runner.forward(forward_batch)
-    next_token_ids = model_runner.sample(logits_output, forward_batch).tolist()
+    next_token_ids = model_runner.sample(logits_output, forward_batch)
     return next_token_ids, logits_output.next_token_logits
-@torch.inference_mode()
 def correctness_test(
     server_args,
     port_args,
     bench_args,
     tp_rank,
 ):
+    configure_logger(server_args, prefix=f" TP{tp_rank}")
     rank_print = print if tp_rank == 0 else lambda *args, **kwargs: None
     # Load the model
@@ -278,8 +281,9 @@ def correctness_test(
     output_ids = [input_ids[i] + [next_token_ids[i]] for i in range(len(input_ids))]
     for _ in range(bench_args.output_len[0] - 1):
         next_token_ids, _ = decode(next_token_ids, batch, model_runner)
+        next_token_ids_list = next_token_ids.tolist()
         for i in range(len(reqs)):
-            output_ids[i].append(next_token_ids[i])
+            output_ids[i].append(next_token_ids_list[i])
     # Print
     for i in range(len(reqs)):
@@ -287,9 +291,15 @@ def correctness_test(
         rank_print(tokenizer.decode(output_ids[i]), "\n")
-@torch.inference_mode()
+def synchronize(device):
+    if device == "cuda":
+        torch.cuda.synchronize()
+    elif device == "xpu":
+        torch.xpu.synchronize()
 def latency_test_run_once(
-    run_name, model_runner, rank_print, reqs, batch_size, input_len, output_len
+    run_name, model_runner, rank_print, reqs, batch_size, input_len, output_len, device
 ):
     max_batch_size = model_runner.max_total_num_tokens // (input_len + output_len)
     if batch_size > max_batch_size:
@@ -312,10 +322,10 @@ def latency_test_run_once(
     tot_latency = 0
     # Prefill
-    torch.cuda.synchronize()
+    synchronize(device)
     tic = time.time()
     next_token_ids, _, batch = extend(reqs, model_runner)
-    torch.cuda.synchronize()
+    synchronize(device)
     prefill_latency = time.time() - tic
     tot_latency += prefill_latency
     throughput = input_len * batch_size / prefill_latency
@@ -328,10 +338,10 @@ def latency_test_run_once(
     # Decode
     decode_latencies = []
     for i in range(output_len - 1):
-        torch.cuda.synchronize()
+        synchronize(device)
         tic = time.time()
         next_token_ids, _ = decode(next_token_ids, batch, model_runner)
-        torch.cuda.synchronize()
+        synchronize(device)
         latency = time.time() - tic
         tot_latency += latency
         throughput = batch_size / latency
@@ -387,6 +397,7 @@ def latency_test(
         bench_args.batch_size[0],
         bench_args.input_len[0],
         8,  # shorter decoding to speed up the warmup
+        server_args.device,
     )
     rank_print("Benchmark ...")
@@ -397,7 +408,14 @@ def latency_test(
     ):
         reqs = prepare_synthetic_inputs_for_latency_test(bs, il)
         ret = latency_test_run_once(
-            bench_args.run_name, model_runner, rank_print, reqs, bs, il, ol
+            bench_args.run_name,
+            model_runner,
+            rank_print,
+            reqs,
+            bs,
+            il,
+            ol,
+            server_args.device,
         )
         if ret is not None:
             result_list.append(ret)

{sglang-0.3.3 → sglang-0.3.4}/sglang/bench_server_latency.py RENAMED Viewed

@@ -6,6 +6,8 @@ It accepts arguments similar to those of launch_server.py.
 Usage:
 python3 -m sglang.bench_server_latency --model meta-llama/Meta-Llama-3.1-8B --batch-size 1 16 64 --input-len 1024 --output-len 8
+python3 -m sglang.bench_server_latency --model None --base-url http://localhost:30000 --batch-size 16 --input-len 1024 --output-len 8
 """
 import argparse
@@ -32,6 +34,8 @@ class BenchArgs:
     input_len: Tuple[int] = (1024,)
     output_len: Tuple[int] = (16,)
     result_filename: str = "result.jsonl"
+    base_url: str = ""
+    skip_warmup: bool = False
     @staticmethod
     def add_cli_args(parser: argparse.ArgumentParser):
@@ -48,6 +52,8 @@ class BenchArgs:
         parser.add_argument(
             "--result-filename", type=str, default=BenchArgs.result_filename
         )
+        parser.add_argument("--base-url", type=str, default=BenchArgs.base_url)
+        parser.add_argument("--skip-warmup", action="store_true")
     @classmethod
     def from_cli_args(cls, args: argparse.Namespace):
@@ -139,17 +145,21 @@ def run_one_case(
 def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
-    proc, base_url = launch_server_process(server_args)
+    if bench_args.base_url:
+        proc, base_url = None, bench_args.base_url
+    else:
+        proc, base_url = launch_server_process(server_args)
     # warmup
-    run_one_case(
-        base_url,
-        batch_size=16,
-        input_len=1024,
-        output_len=16,
-        run_name="",
-        result_filename="",
-    )
+    if not bench_args.skip_warmup:
+        run_one_case(
+            base_url,
+            batch_size=16,
+            input_len=1024,
+            output_len=16,
+            run_name="",
+            result_filename="",
+        )
     # benchmark
     try:
@@ -165,7 +175,8 @@ def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
                 bench_args.result_filename,
             )
     finally:
-        kill_child_process(proc.pid)
+        if proc:
+            kill_child_process(proc.pid)
     print(f"\nResults are saved to {bench_args.result_filename}")

{sglang-0.3.3 → sglang-0.3.4}/sglang/bench_serving.py RENAMED Viewed

@@ -222,6 +222,85 @@ async def async_request_openai_completions(
     return output
+async def async_request_sglang_generate(
+    request_func_input: RequestFuncInput,
+    pbar: Optional[tqdm] = None,
+) -> RequestFuncOutput:
+    api_url = request_func_input.api_url
+    prompt = request_func_input.prompt
+    async with aiohttp.ClientSession(timeout=AIOHTTP_TIMEOUT) as session:
+        payload = {
+            "text": prompt,
+            "sampling_params": {
+                "temperature": 0.0,
+                "max_new_tokens": request_func_input.output_len,
+                "ignore_eos": not args.disable_ignore_eos,
+            },
+            "stream": not args.disable_stream,
+            **request_func_input.extra_request_body,
+        }
+        headers = {}
+        output = RequestFuncOutput()
+        output.prompt_len = request_func_input.prompt_len
+        generated_text = ""
+        ttft = 0.0
+        st = time.perf_counter()
+        most_recent_timestamp = st
+        try:
+            async with session.post(
+                url=api_url, json=payload, headers=headers
+            ) as response:
+                if response.status == 200:
+                    async for chunk_bytes in response.content:
+                        chunk_bytes = chunk_bytes.strip()
+                        if not chunk_bytes:
+                            continue
+                        # print(chunk_bytes)
+                        chunk = remove_prefix(chunk_bytes.decode("utf-8"), "data: ")
+                        latency = time.perf_counter() - st
+                        if chunk == "[DONE]":
+                            pass
+                        else:
+                            data = json.loads(chunk)
+                            # NOTE: Some completion API might have a last
+                            # usage summary response without a token so we
+                            # want to check a token was generated
+                            if data["text"]:
+                                timestamp = time.perf_counter()
+                                # First token
+                                if ttft == 0.0:
+                                    ttft = time.perf_counter() - st
+                                    output.ttft = ttft
+                                # Decoding phase
+                                else:
+                                    output.itl.append(timestamp - most_recent_timestamp)
+                                most_recent_timestamp = timestamp
+                                generated_text = data["text"]
+                    output.generated_text = generated_text
+                    output.success = True
+                    output.latency = latency
+                    output.output_len = request_func_input.output_len
+                else:
+                    output.error = response.reason or ""
+                    output.success = False
+        except Exception:
+            output.success = False
+            exc_info = sys.exc_info()
+            output.error = "".join(traceback.format_exception(*exc_info))
+    if pbar:
+        pbar.update(1)
+    return output
 async def async_request_gserver(
     request_func_input: RequestFuncInput,
     pbar: Optional[tqdm] = None,
@@ -264,7 +343,9 @@ def get_tokenizer(
 ASYNC_REQUEST_FUNCS = {
-    "sglang": async_request_openai_completions,
+    "sglang": async_request_sglang_generate,
+    "sglang-native": async_request_sglang_generate,
+    "sglang-oai": async_request_openai_completions,
     "vllm": async_request_openai_completions,
     "lmdeploy": async_request_openai_completions,
     "trt": async_request_trt_llm,
@@ -387,6 +468,8 @@ def sample_sharegpt_requests(
             continue
         filtered_dataset.append((prompt, prompt_len, output_len))
+    print(f"#Input tokens: {np.sum([x[1] for x in filtered_dataset])}")
+    print(f"#Output tokens: {np.sum([x[2] for x in filtered_dataset])}")
     return filtered_dataset
@@ -587,6 +670,8 @@ async def benchmark(
     else:
         print("Initial test run completed. Starting main benchmark run...")
+    time.sleep(1.5)
     pbar = None if disable_tqdm else tqdm(total=len(input_requests))
     benchmark_start_time = time.perf_counter()
@@ -782,24 +867,33 @@ def run_benchmark(args_: argparse.Namespace):
     if args.port is None:
         args.port = {
             "sglang": 30000,
+            "sglang-native": 30000,
+            "sglang-oai": 30000,
             "lmdeploy": 23333,
             "vllm": 8000,
             "trt": 8000,
             "gserver": 9988,
         }.get(args.backend, 30000)
-    api_url = (
-        f"{args.base_url}/v1/completions"
-        if args.base_url
-        else f"http://{args.host}:{args.port}/v1/completions"
-    )
     model_url = (
         f"{args.base_url}/v1/models"
         if args.base_url
         else f"http://{args.host}:{args.port}/v1/models"
     )
-    if args.backend == "trt":
+    if args.backend in ["sglang", "sglang-native"]:
+        api_url = (
+            f"{args.base_url}/generate"
+            if args.base_url
+            else f"http://{args.host}:{args.port}/generate"
+        )
+    elif args.backend in ["sglang-oai", "vllm", "lmdeploy"]:
+        api_url = (
+            f"{args.base_url}/v1/completions"
+            if args.base_url
+            else f"http://{args.host}:{args.port}/v1/completions"
+        )
+    elif args.backend == "trt":
         api_url = (
             f"{args.base_url}/v2/models/ensemble/generate_stream"
             if args.base_url

sglang 0.3.3__tar.gz → 0.3.4__tar.gz

sglang 0.3.3tar.gz → 0.3.4tar.gz