PyPI - sglang - Versions diffs - 0.1.2__tar.gz → 0.1.4__tar.gz - Mend

sglang 0.1.2tar.gz → 0.1.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (62) hide show

{sglang-0.1.2 → sglang-0.1.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.2
+Version: 0.1.4
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -267,10 +267,20 @@ pip install --upgrade pip
 pip install -e "python[all]"
 ```
+### Notes
+- If you are using older GPUs (NVIDIA T4, V100), please use `pip install "triton>=2.2.0"` to avoid some bugs in the triton compiler
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install sglang[openai]`
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
 ### Using OpenAI Models
+Set the OpenAI API Key
+```
+export OPENAI_API_KEY=sk-xxxxxx
+```
+Then, answer a multi-turn question.
 ```python
 from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
@@ -329,30 +339,104 @@ You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
+To begin with, import sglang.
+```python
+import sglang as sgl
+```
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`, `image`.
+You can implement your prompt flow in a function decorated by `sgl.function`.
+You can then invoke the function with `run` or `run_batch`.
+The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+```python
+@sgl.function
+def control_flow(s, question):
+    s += "To answer this question: " + question + ", "
+    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
+    # You can use if or nested function calls
+    if s["tool"] == "calculator":
+        s += "The math expression is" + sgl.gen("expression")
+    elif s["tool"] == "web browser":
+        s += "The website url is" + sgl.gen("url")
+```
 ### Parallelism
+```python
+@sgl.function
+def tip_suggestion(s):
+    s += (
+        "Here are two tips for staying healthy: "
+        "1. Balanced Diet. 2. Regular Exercise.\n\n"
+    )
+    forks = s.fork(2)  # Launch parallel prompts
+    for i, f in enumerate(forks):
+        f += f"Now, expand tip {i+1} into a paragraph:\n"
+        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
+    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
+    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
+    s += "In summary" + sgl.gen("summary")
+```
 ### Multi Modality
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
     s += sgl.user(sgl.image(image_file) + question)
-    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
+    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
 ```
-### Constrained decoding
+### Constrained Decoding
+```python
+@sgl.function
+def regular_expression_gen(s):
+    s += "Q: What is the IP address of the Google DNS servers?\n"
+    s += "A: " + sgl.gen(
+        "answer",
+        temperature=0,
+        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
+    )
+```
 ### Batching
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run_batch(
+    [
+        {"question": "What is the capital of the United Kingdom?"},
+        {"question": "What is the capital of France?"},
+        {"question": "What is the capital of Japan?"},
+    ],
+)
+```
 ### Streaming
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
-### Other Backends
+states = text_qa.run(
+    question="What is the capital of France?",
+    temperature=0.1)
+for out in state.text_iter():
+    print(out, end="", flush=True)
+```
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
 However, it can also be used as a standalone API server.
-In this case, the RadixAttention can still greatly accelerate many use cases.
+In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.
 ### Usage
 Launch a server
@@ -376,6 +460,10 @@ curl http://localhost:30000/v1/completions \
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
+```
 ### Supported Models
 - Llama
@@ -386,6 +474,14 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ## Benchmark And Performance
+- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
+![llama_7b](assets/llama_7b.jpg)
+- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
+![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+Learn more [here](docs/benchmark_results.md).
 ## Roadmap
 - [ ] Function call
 - [ ] Quantization

{sglang-0.1.2 → sglang-0.1.4}/README.md RENAMED Viewed

@@ -32,10 +32,20 @@ pip install --upgrade pip
 pip install -e "python[all]"
 ```
+### Notes
+- If you are using older GPUs (NVIDIA T4, V100), please use `pip install "triton>=2.2.0"` to avoid some bugs in the triton compiler
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install sglang[openai]`
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
 ### Using OpenAI Models
+Set the OpenAI API Key
+```
+export OPENAI_API_KEY=sk-xxxxxx
+```
+Then, answer a multi-turn question.
 ```python
 from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
@@ -94,30 +104,104 @@ You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
+To begin with, import sglang.
+```python
+import sglang as sgl
+```
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`, `image`.
+You can implement your prompt flow in a function decorated by `sgl.function`.
+You can then invoke the function with `run` or `run_batch`.
+The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+```python
+@sgl.function
+def control_flow(s, question):
+    s += "To answer this question: " + question + ", "
+    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
+    # You can use if or nested function calls
+    if s["tool"] == "calculator":
+        s += "The math expression is" + sgl.gen("expression")
+    elif s["tool"] == "web browser":
+        s += "The website url is" + sgl.gen("url")
+```
 ### Parallelism
+```python
+@sgl.function
+def tip_suggestion(s):
+    s += (
+        "Here are two tips for staying healthy: "
+        "1. Balanced Diet. 2. Regular Exercise.\n\n"
+    )
+    forks = s.fork(2)  # Launch parallel prompts
+    for i, f in enumerate(forks):
+        f += f"Now, expand tip {i+1} into a paragraph:\n"
+        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
+    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
+    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
+    s += "In summary" + sgl.gen("summary")
+```
 ### Multi Modality
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
     s += sgl.user(sgl.image(image_file) + question)
-    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
+    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
 ```
-### Constrained decoding
+### Constrained Decoding
+```python
+@sgl.function
+def regular_expression_gen(s):
+    s += "Q: What is the IP address of the Google DNS servers?\n"
+    s += "A: " + sgl.gen(
+        "answer",
+        temperature=0,
+        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
+    )
+```
 ### Batching
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run_batch(
+    [
+        {"question": "What is the capital of the United Kingdom?"},
+        {"question": "What is the capital of France?"},
+        {"question": "What is the capital of Japan?"},
+    ],
+)
+```
 ### Streaming
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
-### Other Backends
+states = text_qa.run(
+    question="What is the capital of France?",
+    temperature=0.1)
+for out in state.text_iter():
+    print(out, end="", flush=True)
+```
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
 However, it can also be used as a standalone API server.
-In this case, the RadixAttention can still greatly accelerate many use cases.
+In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.
 ### Usage
 Launch a server
@@ -141,6 +225,10 @@ curl http://localhost:30000/v1/completions \
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
+```
 ### Supported Models
 - Llama
@@ -151,6 +239,14 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ## Benchmark And Performance
+- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
+![llama_7b](assets/llama_7b.jpg)
+- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
+![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+Learn more [here](docs/benchmark_results.md).
 ## Roadmap
 - [ ] Function call
 - [ ] Quantization

{sglang-0.1.2 → sglang-0.1.4}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.1.2"
+version = "0.1.4"
 description = "A structured generation langauge for LLMs."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.1.2 → sglang-0.1.4}/sglang/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "0.1.2"
+__version__ = "0.1.4"
 from sglang.api import *
 from sglang.global_config import global_config

{sglang-0.1.2 → sglang-0.1.4}/sglang/srt/layers/context_flashattention_nopad.py RENAMED Viewed

@@ -6,6 +6,9 @@ import triton.language as tl
 from sglang.srt.utils import wrap_kernel_launcher
+CUDA_CAPABILITY = torch.cuda.get_device_capability()
 @triton.jit
 def _fwd_kernel(
     Q,
@@ -120,7 +123,11 @@ cached_kernel = None
 def context_attention_fwd(q, k, v, o, b_start_loc, b_seq_len, max_input_len):
-    BLOCK = 128
+    if CUDA_CAPABILITY[0] >= 8:
+        BLOCK = 128
+    else:
+        BLOCK = 64
     Lq, Lk, Lv = q.shape[-1], k.shape[-1], v.shape[-1]
     assert Lq == Lk and Lk == Lv
     assert Lk in {16, 32, 64, 128}

{sglang-0.1.2 → sglang-0.1.4}/sglang/srt/layers/extend_attention.py RENAMED Viewed

@@ -2,6 +2,10 @@ import torch
 import triton
 import triton.language as tl
 from sglang.srt.layers.context_flashattention_nopad import context_attention_fwd
+from sglang.srt.utils import wrap_kernel_launcher
+CUDA_CAPABILITY = torch.cuda.get_device_capability()
 @triton.jit
@@ -153,6 +157,9 @@ def _fwd_kernel(
     tl.store(O_Extend + offs_o, acc / deno[:, None], mask=mask_m[:, None])
+cached_kernel = None
 def extend_attention_fwd(
     q_extend,
     k_extend,
@@ -175,7 +182,11 @@ def extend_attention_fwd(
     k_buffer, v_buffer: (prefix + extend) tensors in mem_manager
     """
-    BLOCK_M, BLOCK_N = 128, 128
+    if CUDA_CAPABILITY[0] >= 8:
+        BLOCK_M, BLOCK_N = 128, 128
+    else:
+        BLOCK_M, BLOCK_N = 64, 64
     Lq, Lk, Lv, Lo = (
         q_extend.shape[-1],
         k_extend.shape[-1],
@@ -193,6 +204,40 @@ def extend_attention_fwd(
     num_warps = 4 if Lk <= 64 else 8
     num_stages = 1
+    global cached_kernel
+    if cached_kernel:
+        cached_kernel(
+            grid,
+            num_warps,
+            q_extend,
+            k_extend,
+            v_extend,
+            o_extend,
+            k_buffer,
+            v_buffer,
+            req_to_tokens,
+            b_req_idx,
+            b_seq_len,
+            b_start_loc_extend,
+            b_seq_len_extend,
+            sm_scale,
+            kv_group_num,
+            q_extend.stride(0),
+            q_extend.stride(1),
+            k_extend.stride(0),
+            k_extend.stride(1),
+            v_extend.stride(0),
+            v_extend.stride(1),
+            o_extend.stride(0),
+            o_extend.stride(1),
+            k_buffer.stride(0),
+            k_buffer.stride(1),
+            v_buffer.stride(0),
+            v_buffer.stride(1),
+            req_to_tokens.stride(0),
+        )
+        return
     _fwd_kernel[grid](
         q_extend,
         k_extend,
@@ -226,6 +271,7 @@ def extend_attention_fwd(
         num_warps=num_warps,
         num_stages=num_stages,
     )
+    cached_kernel = wrap_kernel_launcher(_fwd_kernel)
 def redundant_attention(

{sglang-0.1.2 → sglang-0.1.4}/sglang/srt/managers/router/model_rpc.py RENAMED Viewed

@@ -5,6 +5,7 @@ import time
 from concurrent.futures import ThreadPoolExecutor
 from enum import Enum, auto
 from typing import Dict, List, Optional, Tuple, Union
+import warnings
 import numpy as np
 import rpyc
@@ -164,7 +165,7 @@ class ModelRpcServer(rpyc.Service):
                     + self.tree_cache.evictable_size()
                 )
                 if available_size != self.max_total_num_token:
-                    logger.warning(
+                    warnings.warn(
                         "Warning: "
                         f"available_size={available_size}, max_total_num_token={self.max_total_num_token}\n"
                         "KV cache pool leak detected!"

{sglang-0.1.2 → sglang-0.1.4}/sglang/srt/utils.py RENAMED Viewed

@@ -209,7 +209,7 @@ def load_image(image_file):
     elif image_file.lower().endswith(("png", "jpg", "jpeg", "webp", "gif")):
         image = Image.open(image_file)
     elif image_file.startswith("data:"):
-        image_file = image_url.split(",")[1]
+        image_file = image_file.split(",")[1]
         image = Image.open(BytesIO(base64.b64decode(image_file)))
     else:
         image = Image.open(BytesIO(base64.b64decode(image_file)))

{sglang-0.1.2 → sglang-0.1.4}/sglang.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.2
+Version: 0.1.4
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -267,10 +267,20 @@ pip install --upgrade pip
 pip install -e "python[all]"
 ```
+### Notes
+- If you are using older GPUs (NVIDIA T4, V100), please use `pip install "triton>=2.2.0"` to avoid some bugs in the triton compiler
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install sglang[openai]`
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
 ### Using OpenAI Models
+Set the OpenAI API Key
+```
+export OPENAI_API_KEY=sk-xxxxxx
+```
+Then, answer a multi-turn question.
 ```python
 from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
@@ -329,30 +339,104 @@ You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
+To begin with, import sglang.
+```python
+import sglang as sgl
+```
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`, `image`.
+You can implement your prompt flow in a function decorated by `sgl.function`.
+You can then invoke the function with `run` or `run_batch`.
+The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+```python
+@sgl.function
+def control_flow(s, question):
+    s += "To answer this question: " + question + ", "
+    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
+    # You can use if or nested function calls
+    if s["tool"] == "calculator":
+        s += "The math expression is" + sgl.gen("expression")
+    elif s["tool"] == "web browser":
+        s += "The website url is" + sgl.gen("url")
+```
 ### Parallelism
+```python
+@sgl.function
+def tip_suggestion(s):
+    s += (
+        "Here are two tips for staying healthy: "
+        "1. Balanced Diet. 2. Regular Exercise.\n\n"
+    )
+    forks = s.fork(2)  # Launch parallel prompts
+    for i, f in enumerate(forks):
+        f += f"Now, expand tip {i+1} into a paragraph:\n"
+        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
+    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
+    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
+    s += "In summary" + sgl.gen("summary")
+```
 ### Multi Modality
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
     s += sgl.user(sgl.image(image_file) + question)
-    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
+    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
 ```
-### Constrained decoding
+### Constrained Decoding
+```python
+@sgl.function
+def regular_expression_gen(s):
+    s += "Q: What is the IP address of the Google DNS servers?\n"
+    s += "A: " + sgl.gen(
+        "answer",
+        temperature=0,
+        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
+    )
+```
 ### Batching
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run_batch(
+    [
+        {"question": "What is the capital of the United Kingdom?"},
+        {"question": "What is the capital of France?"},
+        {"question": "What is the capital of Japan?"},
+    ],
+)
+```
 ### Streaming
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
-### Other Backends
+states = text_qa.run(
+    question="What is the capital of France?",
+    temperature=0.1)
+for out in state.text_iter():
+    print(out, end="", flush=True)
+```
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
 However, it can also be used as a standalone API server.
-In this case, the RadixAttention can still greatly accelerate many use cases.
+In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.
 ### Usage
 Launch a server
@@ -376,6 +460,10 @@ curl http://localhost:30000/v1/completions \
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
+```
 ### Supported Models
 - Llama
@@ -386,6 +474,14 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ## Benchmark And Performance
+- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
+![llama_7b](assets/llama_7b.jpg)
+- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
+![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+Learn more [here](docs/benchmark_results.md).
 ## Roadmap
 - [ ] Function call
 - [ ] Quantization