PyPI - sglang - Versions diffs - 0.2.0__tar.gz → 0.2.1__tar.gz - Mend

sglang 0.2.0tar.gz → 0.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (98) hide show

{sglang-0.2.0/sglang.egg-info → sglang-0.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.2.0
+Version: 0.2.1
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -249,7 +249,7 @@ Requires-Dist: sglang[litellm]; extra == "all"
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -259,13 +259,14 @@ The core features include:
 - **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
 ## News
-- [2024/04] 🔥 SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
-- [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
-- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
+- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
+- [2024/04] SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
+- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 <details>
 <summary>More</summary>
+- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
 - [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
 </details>
@@ -302,7 +303,8 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ```
 ### Method 3: Using docker
-The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
+The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](docker).
+Repalce `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
 ```bash
 docker run --gpus all \
@@ -311,7 +313,7 @@ docker run --gpus all \
     --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
     --ipc=host \
     lmsysorg/sglang:latest \
-    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --host 0.0.0.0 --port 30000
+    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 30000
 ```
 ### Common Notes
@@ -399,6 +401,21 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 - To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
+### Run Llama 3.1 405B
+```bash
+# 2 nodes run 405B fp16
+# replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
+# on the first node
+GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
+# on the second
+GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
+# single node run 405B fp8
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+```
 ### Supported Models
 - Llama / Llama 2 / Llama 3 / Llama 3.1
@@ -656,15 +673,12 @@ for out in state.text_iter():
 - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
-## Benchmark And Performance
-- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
-![llama_7b](assets/llama_7b.jpg)
-- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
-![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+## Benchmark And Performance
+![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
+![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)
-- Learn more about the above [results](docs/benchmark_results.md).
-- Synthetic latency and throughput benchmark [scripts](https://github.com/sgl-project/sglang/tree/main/benchmark/latency_throughput).
+Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
 ## Roadmap
 [Development Roadmap (2024 Q3)](https://github.com/sgl-project/sglang/issues/634)

{sglang-0.2.0 → sglang-0.2.1}/README.md RENAMED Viewed

@@ -4,7 +4,7 @@
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -14,13 +14,14 @@ The core features include:
 - **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
 ## News
-- [2024/04] 🔥 SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
-- [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
-- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
+- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
+- [2024/04] SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
+- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 <details>
 <summary>More</summary>
+- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
 - [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
 </details>
@@ -57,7 +58,8 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ```
 ### Method 3: Using docker
-The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
+The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](docker).
+Repalce `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
 ```bash
 docker run --gpus all \
@@ -66,7 +68,7 @@ docker run --gpus all \
     --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
     --ipc=host \
     lmsysorg/sglang:latest \
-    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --host 0.0.0.0 --port 30000
+    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 30000
 ```
 ### Common Notes
@@ -154,6 +156,21 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 - To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
+### Run Llama 3.1 405B
+```bash
+# 2 nodes run 405B fp16
+# replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
+# on the first node
+GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
+# on the second
+GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
+# single node run 405B fp8
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+```
 ### Supported Models
 - Llama / Llama 2 / Llama 3 / Llama 3.1
@@ -411,15 +428,12 @@ for out in state.text_iter():
 - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
-## Benchmark And Performance
-- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
-![llama_7b](assets/llama_7b.jpg)
-- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
-![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+## Benchmark And Performance
+![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
+![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)
-- Learn more about the above [results](docs/benchmark_results.md).
-- Synthetic latency and throughput benchmark [scripts](https://github.com/sgl-project/sglang/tree/main/benchmark/latency_throughput).
+Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
 ## Roadmap
 [Development Roadmap (2024 Q3)](https://github.com/sgl-project/sglang/issues/634)

{sglang-0.2.0 → sglang-0.2.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.2.0"
+version = "0.2.1"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.2.0 → sglang-0.2.1}/sglang/bench_serving.py RENAMED Viewed

@@ -369,7 +369,7 @@ def sample_random_requests(
 ) -> List[Tuple[str, int, int]]:
     input_lens = np.random.randint(
-        int(input_len * range_ratio),
+        max(int(input_len * range_ratio), 1),
         input_len + 1,
         size=num_prompts,
     )
@@ -415,7 +415,7 @@ def sample_random_requests(
             prompt_token_ids = tokenizer(prompt).input_ids
             prompt_len = len(prompt_token_ids)
-            if prompt_len <= input_lens[i]:
+            if prompt_len > input_lens[i]:
                 input_ids = prompt_token_ids[: input_lens[i]]
             else:
                 ratio = (input_lens[i] + prompt_len - 1) // prompt_len
@@ -935,7 +935,7 @@ if __name__ == "__main__":
     parser.add_argument(
         "--random-range-ratio",
         type=float,
-        default=1.0,
+        default=0.0,
         help="Range of sampled ratio of input/output length, "
         "used only for random dataset.",
     )

{sglang-0.2.0 → sglang-0.2.1}/sglang/global_config.py RENAMED Viewed

@@ -17,7 +17,7 @@ class GlobalConfig:
         # Runtime constants: New generation token ratio estimation
         self.init_new_token_ratio = 0.7
-        self.base_min_new_token_ratio = 0.2
+        self.base_min_new_token_ratio = 0.1
         self.new_token_ratio_decay = 0.001
         self.new_token_ratio_recovery = 0.05

{sglang-0.2.0 → sglang-0.2.1}/sglang/srt/managers/controller/model_runner.py RENAMED Viewed

@@ -121,7 +121,7 @@ class ModelRunner:
             skip_tokenizer_init=True,
         )
-        if is_llama3_405b_fp8(self.model_config):
+        if is_llama3_405b_fp8(self.model_config) and self.tp_size <= 8:
             # A temporary hack to fix the num_heads for meta-llama/Meta-Llama-3.1-405B-FP8 checkpoints
             self.model_config.hf_config.num_key_value_heads = 8
             vllm_model_config.hf_config.num_key_value_heads = 8

{sglang-0.2.0 → sglang-0.2.1}/sglang/srt/managers/io_struct.py RENAMED Viewed

@@ -40,7 +40,10 @@ class GenerateReqInput:
             self.text is not None and self.input_ids is not None
         ):
             raise ValueError("Either text or input_ids should be provided.")
-        if self.sampling_params.get("n", 1) != 1:
+        if (
+            isinstance(self.sampling_params, dict)
+            and self.sampling_params.get("n", 1) != 1
+        ):
             is_single = False
         else:
             if self.text is not None:

{sglang-0.2.0 → sglang-0.2.1}/sglang/srt/openai_api/adapter.py RENAMED Viewed

@@ -94,9 +94,14 @@ def load_chat_template_for_openai_api(chat_template_arg):
 async def v1_completions(tokenizer_manager, raw_request: Request):
     request_json = await raw_request.json()
     request = CompletionRequest(**request_json)
+    prompt = request.prompt
+    if isinstance(prompt, str) or isinstance(prompt[0], str):
+        prompt_kwargs = {"text": prompt}
+    else:
+        prompt_kwargs = {"input_ids": prompt}
     adapted_request = GenerateReqInput(
-        text=request.prompt,
+        **prompt_kwargs,
         sampling_params={
             "temperature": request.temperature,
             "max_new_tokens": request.max_tokens,

{sglang-0.2.0 → sglang-0.2.1}/sglang/srt/utils.py RENAMED Viewed

@@ -626,6 +626,7 @@ def is_llama3_405b_fp8(model_config):
         and model_config.hf_config.intermediate_size == 53248
         and model_config.hf_config.num_hidden_layers == 126
         and model_config.hf_config.num_key_value_heads == 16
+        and hasattr(model_config.hf_config, "quantization_config")
         and model_config.hf_config.quantization_config["quant_method"] == "fbgemm_fp8"
     ):
         return True

sglang-0.2.1/sglang/version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = "0.2.1"

{sglang-0.2.0 → sglang-0.2.1/sglang.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.2.0
+Version: 0.2.1
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -249,7 +249,7 @@ Requires-Dist: sglang[litellm]; extra == "all"
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -259,13 +259,14 @@ The core features include:
 - **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
 ## News
-- [2024/04] 🔥 SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
-- [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
-- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
+- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
+- [2024/04] SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
+- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
 <details>
 <summary>More</summary>
+- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
 - [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
 </details>
@@ -302,7 +303,8 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ```
 ### Method 3: Using docker
-The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
+The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](docker).
+Repalce `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
 ```bash
 docker run --gpus all \
@@ -311,7 +313,7 @@ docker run --gpus all \
     --env "HUGGING_FACE_HUB_TOKEN=<secret>" \
     --ipc=host \
     lmsysorg/sglang:latest \
-    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B --host 0.0.0.0 --port 30000
+    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --host 0.0.0.0 --port 30000
 ```
 ### Common Notes
@@ -399,6 +401,21 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 - To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
+### Run Llama 3.1 405B
+```bash
+# 2 nodes run 405B fp16
+# replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
+# on the first node
+GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
+# on the second
+GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
+# single node run 405B fp8
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+```
 ### Supported Models
 - Llama / Llama 2 / Llama 3 / Llama 3.1
@@ -656,15 +673,12 @@ for out in state.text_iter():
 - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
-## Benchmark And Performance
-- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
-![llama_7b](assets/llama_7b.jpg)
-- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
-![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+## Benchmark And Performance
+![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
+![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)
-- Learn more about the above [results](docs/benchmark_results.md).
-- Synthetic latency and throughput benchmark [scripts](https://github.com/sgl-project/sglang/tree/main/benchmark/latency_throughput).
+Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
 ## Roadmap
 [Development Roadmap (2024 Q3)](https://github.com/sgl-project/sglang/issues/634)