PyPI - sglang - Versions diffs - 0.1.3__tar.gz → 0.1.5__tar.gz - Mend

sglang 0.1.3tar.gz → 0.1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

{sglang-0.1.3/sglang.egg-info → sglang-0.1.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.3
+Version: 0.1.5
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -234,6 +234,7 @@ Requires-Dist: sglang[openai]; extra == "all"
 Requires-Dist: sglang[anthropic]; extra == "all"
 # SGLang
+| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
 SGLang is a structured generation language designed for large language models (LLMs).
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
@@ -267,10 +268,20 @@ pip install --upgrade pip
 pip install -e "python[all]"
 ```
+### Notes
+- If you are using older GPUs (NVIDIA T4, V100), please use `pip install "triton>=2.2.0"` to avoid some bugs in the triton compiler
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install sglang[openai]`
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
 ### Using OpenAI Models
+Set the OpenAI API Key
+```
+export OPENAI_API_KEY=sk-******
+```
+Then, answer a multi-turn question.
 ```python
 from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
@@ -325,6 +336,7 @@ for m in state.messages():
 ### More Examples
+Anthropic and VertexAI (Gemini) models are also supported.
 You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
@@ -334,19 +346,20 @@ To begin with, import sglang.
 import sglang as sgl
 ```
-`sglang` provides some simple primitives such as `gen`, `select`, `fork`.
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`, `image`.
 You can implement your prompt flow in a function decorated by `sgl.function`.
 You can then invoke the function with `run` or `run_batch`.
 The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
 ```python
 @sgl.function
 def control_flow(s, question):
     s += "To answer this question: " + question + ", "
     s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
-    # You can use if or nested function calls
     if s["tool"] == "calculator":
         s += "The math expression is" + sgl.gen("expression")
     elif s["tool"] == "web browser":
@@ -354,6 +367,9 @@ def control_flow(s, question):
 ```
 ### Parallelism
+Use `fork` to launch parallel prompts.
+Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
 ```python
 @sgl.function
 def tip_suggestion(s):
@@ -362,7 +378,7 @@ def tip_suggestion(s):
         "1. Balanced Diet. 2. Regular Exercise.\n\n"
     )
-    forks = s.fork(2)  # Launch parallel prompts
+    forks = s.fork(2)
     for i, f in enumerate(forks):
         f += f"Now, expand tip {i+1} into a paragraph:\n"
         f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
@@ -373,6 +389,8 @@ def tip_suggestion(s):
 ```
 ### Multi Modality
+Use `sgl.image` to pass an image as input.
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
@@ -381,11 +399,13 @@ def image_qa(s, image_file, question):
 ```
 ### Constrained Decoding
+Use `regex=` to specify a regular expression as a decoding constraint.
 ```python
-@function
+@sgl.function
 def regular_expression_gen(s):
     s += "Q: What is the IP address of the Google DNS servers?\n"
-    s += "A: " + gen(
+    s += "A: " + sgl.gen(
         "answer",
         temperature=0,
         regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
@@ -393,6 +413,8 @@ def regular_expression_gen(s):
 ```
 ### Batching
+Use `run_batch` to run a batch of requests with continuous batching.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -405,10 +427,13 @@ states = text_qa.run_batch(
         {"question": "What is the capital of France?"},
         {"question": "What is the capital of Japan?"},
     ],
+    progress_bar=True
 )
 ```
 ### Streaming
+Add `stream=True` to enable streaming.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -417,7 +442,9 @@ def text_qa(s, question):
 states = text_qa.run(
     question="What is the capital of France?",
-    temperature=0.1)
+    temperature=0.1,
+    stream=True
+)
 for out in state.text_iter():
     print(out, end="", flush=True)
@@ -426,7 +453,7 @@ for out in state.text_iter():
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
 However, it can also be used as a standalone API server.
-In this case, the RadixAttention can still greatly accelerate many use cases.
+In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.
 ### Usage
 Launch a server
@@ -450,6 +477,10 @@ curl http://localhost:30000/v1/completions \
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
+```
 ### Supported Models
 - Llama
@@ -457,6 +488,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 - Mixtral
 - LLaVA
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000`
+- AWQ quantization
 ## Benchmark And Performance
@@ -466,13 +498,13 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 - Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
 ![mixtral_8x7b](assets/mixtral_8x7b.jpg)
-Learn more [here]().
+Learn more [here](docs/benchmark_results.md).
 ## Roadmap
-- [ ] Function call
-- [ ] Quantization
+- [ ] Function call APIs
 - [ ] S-LoRA
-- [ ] More models
+- [ ] Support more models
+- [ ] Support more hardware backends
 ## Citation And Acknowledgment
 ```

{sglang-0.1.3 → sglang-0.1.5}/README.md RENAMED Viewed

@@ -1,4 +1,5 @@
 # SGLang
+| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
 SGLang is a structured generation language designed for large language models (LLMs).
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
@@ -32,10 +33,20 @@ pip install --upgrade pip
 pip install -e "python[all]"
 ```
+### Notes
+- If you are using older GPUs (NVIDIA T4, V100), please use `pip install "triton>=2.2.0"` to avoid some bugs in the triton compiler
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install sglang[openai]`
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
 ### Using OpenAI Models
+Set the OpenAI API Key
+```
+export OPENAI_API_KEY=sk-******
+```
+Then, answer a multi-turn question.
 ```python
 from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
@@ -90,6 +101,7 @@ for m in state.messages():
 ### More Examples
+Anthropic and VertexAI (Gemini) models are also supported.
 You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
@@ -99,19 +111,20 @@ To begin with, import sglang.
 import sglang as sgl
 ```
-`sglang` provides some simple primitives such as `gen`, `select`, `fork`.
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`, `image`.
 You can implement your prompt flow in a function decorated by `sgl.function`.
 You can then invoke the function with `run` or `run_batch`.
 The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
 ```python
 @sgl.function
 def control_flow(s, question):
     s += "To answer this question: " + question + ", "
     s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
-    # You can use if or nested function calls
     if s["tool"] == "calculator":
         s += "The math expression is" + sgl.gen("expression")
     elif s["tool"] == "web browser":
@@ -119,6 +132,9 @@ def control_flow(s, question):
 ```
 ### Parallelism
+Use `fork` to launch parallel prompts.
+Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
 ```python
 @sgl.function
 def tip_suggestion(s):
@@ -127,7 +143,7 @@ def tip_suggestion(s):
         "1. Balanced Diet. 2. Regular Exercise.\n\n"
     )
-    forks = s.fork(2)  # Launch parallel prompts
+    forks = s.fork(2)
     for i, f in enumerate(forks):
         f += f"Now, expand tip {i+1} into a paragraph:\n"
         f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
@@ -138,6 +154,8 @@ def tip_suggestion(s):
 ```
 ### Multi Modality
+Use `sgl.image` to pass an image as input.
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
@@ -146,11 +164,13 @@ def image_qa(s, image_file, question):
 ```
 ### Constrained Decoding
+Use `regex=` to specify a regular expression as a decoding constraint.
 ```python
-@function
+@sgl.function
 def regular_expression_gen(s):
     s += "Q: What is the IP address of the Google DNS servers?\n"
-    s += "A: " + gen(
+    s += "A: " + sgl.gen(
         "answer",
         temperature=0,
         regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
@@ -158,6 +178,8 @@ def regular_expression_gen(s):
 ```
 ### Batching
+Use `run_batch` to run a batch of requests with continuous batching.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -170,10 +192,13 @@ states = text_qa.run_batch(
         {"question": "What is the capital of France?"},
         {"question": "What is the capital of Japan?"},
     ],
+    progress_bar=True
 )
 ```
 ### Streaming
+Add `stream=True` to enable streaming.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -182,7 +207,9 @@ def text_qa(s, question):
 states = text_qa.run(
     question="What is the capital of France?",
-    temperature=0.1)
+    temperature=0.1,
+    stream=True
+)
 for out in state.text_iter():
     print(out, end="", flush=True)
@@ -191,7 +218,7 @@ for out in state.text_iter():
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
 However, it can also be used as a standalone API server.
-In this case, the RadixAttention can still greatly accelerate many use cases.
+In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.
 ### Usage
 Launch a server
@@ -215,6 +242,10 @@ curl http://localhost:30000/v1/completions \
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
+```
 ### Supported Models
 - Llama
@@ -222,6 +253,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 - Mixtral
 - LLaVA
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000`
+- AWQ quantization
 ## Benchmark And Performance
@@ -231,13 +263,13 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 - Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
 ![mixtral_8x7b](assets/mixtral_8x7b.jpg)
-Learn more [here]().
+Learn more [here](docs/benchmark_results.md).
 ## Roadmap
-- [ ] Function call
-- [ ] Quantization
+- [ ] Function call APIs
 - [ ] S-LoRA
-- [ ] More models
+- [ ] Support more models
+- [ ] Support more hardware backends
 ## Citation And Acknowledgment
 ```

{sglang-0.1.3 → sglang-0.1.5}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.1.3"
+version = "0.1.5"
 description = "A structured generation langauge for LLMs."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.1.3 → sglang-0.1.5}/sglang/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "0.1.3"
+__version__ = "0.1.5"
 from sglang.api import *
 from sglang.global_config import global_config

{sglang-0.1.3 → sglang-0.1.5}/sglang/api.py RENAMED Viewed

@@ -6,6 +6,7 @@ from sglang.backend.anthropic import Anthropic
 from sglang.backend.base_backend import BaseBackend
 from sglang.backend.openai import OpenAI
 from sglang.backend.runtime_endpoint import RuntimeEndpoint
+from sglang.backend.vertexai import VertexAI
 from sglang.global_config import global_config
 from sglang.lang.ir import (
     SglExpr,

sglang-0.1.5/sglang/backend/vertexai.py ADDED Viewed

@@ -0,0 +1,147 @@
+import os
+import warnings
+from typing import List, Optional, Union
+import numpy as np
+from sglang.backend.base_backend import BaseBackend
+from sglang.lang.chat_template import get_chat_template
+from sglang.lang.interpreter import StreamExecutor
+from sglang.lang.ir import SglSamplingParams
+try:
+    import vertexai
+    from vertexai.preview.generative_models import (
+        GenerationConfig,
+        GenerativeModel,
+        Image,
+    )
+except ImportError as e:
+    GenerativeModel = e
+class VertexAI(BaseBackend):
+    def __init__(self, model_name):
+        super().__init__()
+        if isinstance(GenerativeModel, Exception):
+            raise GenerativeModel
+        project_id = os.environ["GCP_PROJECT_ID"]
+        location = os.environ.get("GCP_LOCATION")
+        vertexai.init(project=project_id, location=location)
+        self.model_name = model_name
+        self.chat_template = get_chat_template("default")
+    def get_chat_template(self):
+        return self.chat_template
+    def generate(
+        self,
+        s: StreamExecutor,
+        sampling_params: SglSamplingParams,
+    ):
+        if s.messages_:
+            prompt = self.messages_to_vertexai_input(s.messages_)
+        else:
+            # single-turn
+            prompt = (
+                self.text_to_vertexai_input(s.text_, s.cur_images)
+                if s.cur_images
+                else s.text_
+            )
+        ret = GenerativeModel(self.model_name).generate_content(
+            prompt,
+            generation_config=GenerationConfig(**sampling_params.to_vertexai_kwargs()),
+        )
+        comp = ret.text
+        return comp, {}
+    def generate_stream(
+        self,
+        s: StreamExecutor,
+        sampling_params: SglSamplingParams,
+    ):
+        if s.messages_:
+            prompt = self.messages_to_vertexai_input(s.messages_)
+        else:
+            # single-turn
+            prompt = (
+                self.text_to_vertexai_input(s.text_, s.cur_images)
+                if s.cur_images
+                else s.text_
+            )
+        generator = GenerativeModel(self.model_name).generate_content(
+            prompt,
+            stream=True,
+            generation_config=GenerationConfig(**sampling_params.to_vertexai_kwargs()),
+        )
+        for ret in generator:
+            yield ret.text, {}
+    def text_to_vertexai_input(self, text, images):
+        input = []
+        # split with image token
+        text_segs = text.split(self.chat_template.image_token)
+        for image_path, image_base64_data in images:
+            text_seg = text_segs.pop(0)
+            if text_seg != "":
+                input.append(text_seg)
+            input.append(Image.from_bytes(image_base64_data))
+        text_seg = text_segs.pop(0)
+        if text_seg != "":
+            input.append(text_seg)
+        return input
+    def messages_to_vertexai_input(self, messages):
+        vertexai_message = []
+        # from openai message format to vertexai message format
+        for msg in messages:
+            if isinstance(msg["content"], str):
+                text = msg["content"]
+            else:
+                text = msg["content"][0]["text"]
+            if msg["role"] == "system":
+                warnings.warn("Warning: system prompt is not supported in VertexAI.")
+                vertexai_message.append(
+                    {
+                        "role": "user",
+                        "parts": [{"text": "System prompt: " + text}],
+                    }
+                )
+                vertexai_message.append(
+                    {
+                        "role": "model",
+                        "parts": [{"text": "Understood."}],
+                    }
+                )
+                continue
+            if msg["role"] == "user":
+                vertexai_msg = {
+                    "role": "user",
+                    "parts": [{"text": text}],
+                }
+            elif msg["role"] == "assistant":
+                vertexai_msg = {
+                    "role": "model",
+                    "parts": [{"text": text}],
+                }
+            # images
+            if isinstance(msg["content"], list) and len(msg["content"]) > 1:
+                for image in msg["content"][1:]:
+                    assert image["type"] == "image_url"
+                    vertexai_msg["parts"].append(
+                        {
+                            "inline_data": {
+                                "data": image["image_url"]["url"].split(",")[1],
+                                "mime_type": "image/jpeg",
+                            }
+                        }
+                    )
+            vertexai_message.append(vertexai_msg)
+        return vertexai_message

{sglang-0.1.3 → sglang-0.1.5}/sglang/lang/interpreter.py RENAMED Viewed

@@ -365,11 +365,10 @@ class StreamExecutor:
             for comp, meta_info in generator:
                 self.text_ += comp
                 self.variables[name] += comp
+                self.meta_info[name] = meta_info
                 self.stream_var_event[name].set()
                 self.stream_text_event.set()
-            self.meta_info[name] = meta_info
             self.variable_event[name].set()
             self.stream_var_event[name].set()
@@ -428,6 +427,7 @@ class StreamExecutor:
             self.messages_.append(last_msg)
             self.cur_images = []
         else:
+            # OpenAI chat API format
             self.messages_.append({"role": expr.role, "content": new_text})
         self.cur_role = None
@@ -582,7 +582,7 @@ class ProgramState:
             else:
                 yield self.get_var(name)
-    async def text_async_iter(self, var_name=None):
+    async def text_async_iter(self, var_name=None, return_meta_data=False):
         loop = asyncio.get_running_loop()
         if self.stream_executor.stream:
@@ -606,7 +606,10 @@ class ProgramState:
                     out = str(self.stream_executor.variables[var_name][prev:])
                     prev += len(out)
                     if out:
-                        yield out
+                        if return_meta_data:
+                            yield out, self.stream_executor.meta_info[var_name]
+                        else:
+                            yield out
                     if self.stream_executor.variable_event[var_name].is_set():
                         break
         else:
@@ -632,11 +635,7 @@ class ProgramState:
         self.stream_executor.end()
     def __repr__(self) -> str:
-        msgs = self.messages()
-        ret = ""
-        for msg in msgs:
-            ret += msg["role"] + ":\n" + msg["content"] + "\n"
-        return ret
+        return f"ProgramState({self.text()})"
 class ProgramStateGroup:

{sglang-0.1.3 → sglang-0.1.5}/sglang/lang/ir.py RENAMED Viewed

@@ -2,6 +2,7 @@
 import dataclasses
 import inspect
+import warnings
 from typing import List, Optional, Union
 from sglang.global_config import global_config
@@ -40,6 +41,8 @@ class SglSamplingParams:
     def to_openai_kwargs(self):
         # OpenAI does not support top_k, so we drop it here
+        if self.regex is not None:
+            warnings.warn("Regular expression is not supported in the OpenAI backend.")
         return {
             "max_tokens": self.max_new_tokens,
             "stop": self.stop or None,
@@ -49,8 +52,26 @@ class SglSamplingParams:
             "presence_penalty": self.presence_penalty,
         }
+    def to_vertexai_kwargs(self):
+        if self.regex is not None:
+            warnings.warn(
+                "Regular expression is not supported in the VertexAI backend."
+            )
+        return {
+            "candidate_count": 1,
+            "max_output_tokens": self.max_new_tokens,
+            "stop_sequences": self.stop,
+            "temperature": self.temperature,
+            "top_p": self.top_p,
+            "top_k": self.top_k if self.top_k > 0 else None,
+        }
     def to_anthropic_kwargs(self):
         # Anthropic does not support frequency_penalty or presence_penalty, so we drop it here
+        if self.regex is not None:
+            warnings.warn(
+                "Regular expression is not supported in the Anthropic backend."
+            )
         return {
             "max_tokens_to_sample": self.max_new_tokens,
             "stop_sequences": self.stop,

{sglang-0.1.3 → sglang-0.1.5}/sglang/srt/layers/context_flashattention_nopad.py RENAMED Viewed

@@ -5,6 +5,8 @@ import triton
 import triton.language as tl
 from sglang.srt.utils import wrap_kernel_launcher
+CUDA_CAPABILITY = torch.cuda.get_device_capability()
 @triton.jit
 def _fwd_kernel(
@@ -120,7 +122,11 @@ cached_kernel = None
 def context_attention_fwd(q, k, v, o, b_start_loc, b_seq_len, max_input_len):
-    BLOCK = 128
+    if CUDA_CAPABILITY[0] >= 8:
+        BLOCK = 128
+    else:
+        BLOCK = 64
     Lq, Lk, Lv = q.shape[-1], k.shape[-1], v.shape[-1]
     assert Lq == Lk and Lk == Lv
     assert Lk in {16, 32, 64, 128}

sglang 0.1.3__tar.gz → 0.1.5__tar.gz

sglang 0.1.3tar.gz → 0.1.5tar.gz