PyPI - sglang - Versions diffs - 0.1.4__tar.gz → 0.1.5__tar.gz - Mend

sglang 0.1.4tar.gz → 0.1.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (63) hide show

{sglang-0.1.4/sglang.egg-info → sglang-0.1.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.4
+Version: 0.1.5
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -234,6 +234,7 @@ Requires-Dist: sglang[openai]; extra == "all"
 Requires-Dist: sglang[anthropic]; extra == "all"
 # SGLang
+| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
 SGLang is a structured generation language designed for large language models (LLMs).
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
@@ -277,7 +278,7 @@ The example below shows how to use sglang to answer a mulit-turn question.
 ### Using OpenAI Models
 Set the OpenAI API Key
 ```
-export OPENAI_API_KEY=sk-xxxxxx
+export OPENAI_API_KEY=sk-******
 ```
 Then, answer a multi-turn question.
@@ -335,6 +336,7 @@ for m in state.messages():
 ### More Examples
+Anthropic and VertexAI (Gemini) models are also supported.
 You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
@@ -350,13 +352,14 @@ You can then invoke the function with `run` or `run_batch`.
 The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
 ```python
 @sgl.function
 def control_flow(s, question):
     s += "To answer this question: " + question + ", "
     s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
-    # You can use if or nested function calls
     if s["tool"] == "calculator":
         s += "The math expression is" + sgl.gen("expression")
     elif s["tool"] == "web browser":
@@ -364,6 +367,9 @@ def control_flow(s, question):
 ```
 ### Parallelism
+Use `fork` to launch parallel prompts.
+Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
 ```python
 @sgl.function
 def tip_suggestion(s):
@@ -372,7 +378,7 @@ def tip_suggestion(s):
         "1. Balanced Diet. 2. Regular Exercise.\n\n"
     )
-    forks = s.fork(2)  # Launch parallel prompts
+    forks = s.fork(2)
     for i, f in enumerate(forks):
         f += f"Now, expand tip {i+1} into a paragraph:\n"
         f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
@@ -383,6 +389,8 @@ def tip_suggestion(s):
 ```
 ### Multi Modality
+Use `sgl.image` to pass an image as input.
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
@@ -391,6 +399,8 @@ def image_qa(s, image_file, question):
 ```
 ### Constrained Decoding
+Use `regex=` to specify a regular expression as a decoding constraint.
 ```python
 @sgl.function
 def regular_expression_gen(s):
@@ -403,6 +413,8 @@ def regular_expression_gen(s):
 ```
 ### Batching
+Use `run_batch` to run a batch of requests with continuous batching.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -415,10 +427,13 @@ states = text_qa.run_batch(
         {"question": "What is the capital of France?"},
         {"question": "What is the capital of Japan?"},
     ],
+    progress_bar=True
 )
 ```
 ### Streaming
+Add `stream=True` to enable streaming.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -427,7 +442,9 @@ def text_qa(s, question):
 states = text_qa.run(
     question="What is the capital of France?",
-    temperature=0.1)
+    temperature=0.1,
+    stream=True
+)
 for out in state.text_iter():
     print(out, end="", flush=True)
@@ -471,6 +488,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 - Mixtral
 - LLaVA
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000`
+- AWQ quantization
 ## Benchmark And Performance
@@ -483,10 +501,10 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 Learn more [here](docs/benchmark_results.md).
 ## Roadmap
-- [ ] Function call
-- [ ] Quantization
+- [ ] Function call APIs
 - [ ] S-LoRA
-- [ ] More models
+- [ ] Support more models
+- [ ] Support more hardware backends
 ## Citation And Acknowledgment
 ```

{sglang-0.1.4 → sglang-0.1.5}/README.md RENAMED Viewed

@@ -1,4 +1,5 @@
 # SGLang
+| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
 SGLang is a structured generation language designed for large language models (LLMs).
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
@@ -42,7 +43,7 @@ The example below shows how to use sglang to answer a mulit-turn question.
 ### Using OpenAI Models
 Set the OpenAI API Key
 ```
-export OPENAI_API_KEY=sk-xxxxxx
+export OPENAI_API_KEY=sk-******
 ```
 Then, answer a multi-turn question.
@@ -100,6 +101,7 @@ for m in state.messages():
 ### More Examples
+Anthropic and VertexAI (Gemini) models are also supported.
 You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
@@ -115,13 +117,14 @@ You can then invoke the function with `run` or `run_batch`.
 The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
 ```python
 @sgl.function
 def control_flow(s, question):
     s += "To answer this question: " + question + ", "
     s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
-    # You can use if or nested function calls
     if s["tool"] == "calculator":
         s += "The math expression is" + sgl.gen("expression")
     elif s["tool"] == "web browser":
@@ -129,6 +132,9 @@ def control_flow(s, question):
 ```
 ### Parallelism
+Use `fork` to launch parallel prompts.
+Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
 ```python
 @sgl.function
 def tip_suggestion(s):
@@ -137,7 +143,7 @@ def tip_suggestion(s):
         "1. Balanced Diet. 2. Regular Exercise.\n\n"
     )
-    forks = s.fork(2)  # Launch parallel prompts
+    forks = s.fork(2)
     for i, f in enumerate(forks):
         f += f"Now, expand tip {i+1} into a paragraph:\n"
         f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
@@ -148,6 +154,8 @@ def tip_suggestion(s):
 ```
 ### Multi Modality
+Use `sgl.image` to pass an image as input.
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
@@ -156,6 +164,8 @@ def image_qa(s, image_file, question):
 ```
 ### Constrained Decoding
+Use `regex=` to specify a regular expression as a decoding constraint.
 ```python
 @sgl.function
 def regular_expression_gen(s):
@@ -168,6 +178,8 @@ def regular_expression_gen(s):
 ```
 ### Batching
+Use `run_batch` to run a batch of requests with continuous batching.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -180,10 +192,13 @@ states = text_qa.run_batch(
         {"question": "What is the capital of France?"},
         {"question": "What is the capital of Japan?"},
     ],
+    progress_bar=True
 )
 ```
 ### Streaming
+Add `stream=True` to enable streaming.
 ```python
 @sgl.function
 def text_qa(s, question):
@@ -192,7 +207,9 @@ def text_qa(s, question):
 states = text_qa.run(
     question="What is the capital of France?",
-    temperature=0.1)
+    temperature=0.1,
+    stream=True
+)
 for out in state.text_iter():
     print(out, end="", flush=True)
@@ -236,6 +253,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 - Mixtral
 - LLaVA
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000`
+- AWQ quantization
 ## Benchmark And Performance
@@ -248,10 +266,10 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 Learn more [here](docs/benchmark_results.md).
 ## Roadmap
-- [ ] Function call
-- [ ] Quantization
+- [ ] Function call APIs
 - [ ] S-LoRA
-- [ ] More models
+- [ ] Support more models
+- [ ] Support more hardware backends
 ## Citation And Acknowledgment
 ```

{sglang-0.1.4 → sglang-0.1.5}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.1.4"
+version = "0.1.5"
 description = "A structured generation langauge for LLMs."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.1.4 → sglang-0.1.5}/sglang/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "0.1.4"
+__version__ = "0.1.5"
 from sglang.api import *
 from sglang.global_config import global_config

{sglang-0.1.4 → sglang-0.1.5}/sglang/api.py RENAMED Viewed

@@ -6,6 +6,7 @@ from sglang.backend.anthropic import Anthropic
 from sglang.backend.base_backend import BaseBackend
 from sglang.backend.openai import OpenAI
 from sglang.backend.runtime_endpoint import RuntimeEndpoint
+from sglang.backend.vertexai import VertexAI
 from sglang.global_config import global_config
 from sglang.lang.ir import (
     SglExpr,

sglang-0.1.5/sglang/backend/vertexai.py ADDED Viewed

@@ -0,0 +1,147 @@
+import os
+import warnings
+from typing import List, Optional, Union
+import numpy as np
+from sglang.backend.base_backend import BaseBackend
+from sglang.lang.chat_template import get_chat_template
+from sglang.lang.interpreter import StreamExecutor
+from sglang.lang.ir import SglSamplingParams
+try:
+    import vertexai
+    from vertexai.preview.generative_models import (
+        GenerationConfig,
+        GenerativeModel,
+        Image,
+    )
+except ImportError as e:
+    GenerativeModel = e
+class VertexAI(BaseBackend):
+    def __init__(self, model_name):
+        super().__init__()
+        if isinstance(GenerativeModel, Exception):
+            raise GenerativeModel
+        project_id = os.environ["GCP_PROJECT_ID"]
+        location = os.environ.get("GCP_LOCATION")
+        vertexai.init(project=project_id, location=location)
+        self.model_name = model_name
+        self.chat_template = get_chat_template("default")
+    def get_chat_template(self):
+        return self.chat_template
+    def generate(
+        self,
+        s: StreamExecutor,
+        sampling_params: SglSamplingParams,
+    ):
+        if s.messages_:
+            prompt = self.messages_to_vertexai_input(s.messages_)
+        else:
+            # single-turn
+            prompt = (
+                self.text_to_vertexai_input(s.text_, s.cur_images)
+                if s.cur_images
+                else s.text_
+            )
+        ret = GenerativeModel(self.model_name).generate_content(
+            prompt,
+            generation_config=GenerationConfig(**sampling_params.to_vertexai_kwargs()),
+        )
+        comp = ret.text
+        return comp, {}
+    def generate_stream(
+        self,
+        s: StreamExecutor,
+        sampling_params: SglSamplingParams,
+    ):
+        if s.messages_:
+            prompt = self.messages_to_vertexai_input(s.messages_)
+        else:
+            # single-turn
+            prompt = (
+                self.text_to_vertexai_input(s.text_, s.cur_images)
+                if s.cur_images
+                else s.text_
+            )
+        generator = GenerativeModel(self.model_name).generate_content(
+            prompt,
+            stream=True,
+            generation_config=GenerationConfig(**sampling_params.to_vertexai_kwargs()),
+        )
+        for ret in generator:
+            yield ret.text, {}
+    def text_to_vertexai_input(self, text, images):
+        input = []
+        # split with image token
+        text_segs = text.split(self.chat_template.image_token)
+        for image_path, image_base64_data in images:
+            text_seg = text_segs.pop(0)
+            if text_seg != "":
+                input.append(text_seg)
+            input.append(Image.from_bytes(image_base64_data))
+        text_seg = text_segs.pop(0)
+        if text_seg != "":
+            input.append(text_seg)
+        return input
+    def messages_to_vertexai_input(self, messages):
+        vertexai_message = []
+        # from openai message format to vertexai message format
+        for msg in messages:
+            if isinstance(msg["content"], str):
+                text = msg["content"]
+            else:
+                text = msg["content"][0]["text"]
+            if msg["role"] == "system":
+                warnings.warn("Warning: system prompt is not supported in VertexAI.")
+                vertexai_message.append(
+                    {
+                        "role": "user",
+                        "parts": [{"text": "System prompt: " + text}],
+                    }
+                )
+                vertexai_message.append(
+                    {
+                        "role": "model",
+                        "parts": [{"text": "Understood."}],
+                    }
+                )
+                continue
+            if msg["role"] == "user":
+                vertexai_msg = {
+                    "role": "user",
+                    "parts": [{"text": text}],
+                }
+            elif msg["role"] == "assistant":
+                vertexai_msg = {
+                    "role": "model",
+                    "parts": [{"text": text}],
+                }
+            # images
+            if isinstance(msg["content"], list) and len(msg["content"]) > 1:
+                for image in msg["content"][1:]:
+                    assert image["type"] == "image_url"
+                    vertexai_msg["parts"].append(
+                        {
+                            "inline_data": {
+                                "data": image["image_url"]["url"].split(",")[1],
+                                "mime_type": "image/jpeg",
+                            }
+                        }
+                    )
+            vertexai_message.append(vertexai_msg)
+        return vertexai_message

{sglang-0.1.4 → sglang-0.1.5}/sglang/lang/interpreter.py RENAMED Viewed

@@ -365,11 +365,10 @@ class StreamExecutor:
             for comp, meta_info in generator:
                 self.text_ += comp
                 self.variables[name] += comp
+                self.meta_info[name] = meta_info
                 self.stream_var_event[name].set()
                 self.stream_text_event.set()
-            self.meta_info[name] = meta_info
             self.variable_event[name].set()
             self.stream_var_event[name].set()
@@ -428,6 +427,7 @@ class StreamExecutor:
             self.messages_.append(last_msg)
             self.cur_images = []
         else:
+            # OpenAI chat API format
             self.messages_.append({"role": expr.role, "content": new_text})
         self.cur_role = None
@@ -582,7 +582,7 @@ class ProgramState:
             else:
                 yield self.get_var(name)
-    async def text_async_iter(self, var_name=None):
+    async def text_async_iter(self, var_name=None, return_meta_data=False):
         loop = asyncio.get_running_loop()
         if self.stream_executor.stream:
@@ -606,7 +606,10 @@ class ProgramState:
                     out = str(self.stream_executor.variables[var_name][prev:])
                     prev += len(out)
                     if out:
-                        yield out
+                        if return_meta_data:
+                            yield out, self.stream_executor.meta_info[var_name]
+                        else:
+                            yield out
                     if self.stream_executor.variable_event[var_name].is_set():
                         break
         else:
@@ -632,11 +635,7 @@ class ProgramState:
         self.stream_executor.end()
     def __repr__(self) -> str:
-        msgs = self.messages()
-        ret = ""
-        for msg in msgs:
-            ret += msg["role"] + ":\n" + msg["content"] + "\n"
-        return ret
+        return f"ProgramState({self.text()})"
 class ProgramStateGroup:

{sglang-0.1.4 → sglang-0.1.5}/sglang/lang/ir.py RENAMED Viewed

@@ -2,6 +2,7 @@
 import dataclasses
 import inspect
+import warnings
 from typing import List, Optional, Union
 from sglang.global_config import global_config
@@ -40,6 +41,8 @@ class SglSamplingParams:
     def to_openai_kwargs(self):
         # OpenAI does not support top_k, so we drop it here
+        if self.regex is not None:
+            warnings.warn("Regular expression is not supported in the OpenAI backend.")
         return {
             "max_tokens": self.max_new_tokens,
             "stop": self.stop or None,
@@ -49,8 +52,26 @@ class SglSamplingParams:
             "presence_penalty": self.presence_penalty,
         }
+    def to_vertexai_kwargs(self):
+        if self.regex is not None:
+            warnings.warn(
+                "Regular expression is not supported in the VertexAI backend."
+            )
+        return {
+            "candidate_count": 1,
+            "max_output_tokens": self.max_new_tokens,
+            "stop_sequences": self.stop,
+            "temperature": self.temperature,
+            "top_p": self.top_p,
+            "top_k": self.top_k if self.top_k > 0 else None,
+        }
     def to_anthropic_kwargs(self):
         # Anthropic does not support frequency_penalty or presence_penalty, so we drop it here
+        if self.regex is not None:
+            warnings.warn(
+                "Regular expression is not supported in the Anthropic backend."
+            )
         return {
             "max_tokens_to_sample": self.max_new_tokens,
             "stop_sequences": self.stop,

{sglang-0.1.4 → sglang-0.1.5}/sglang/srt/layers/context_flashattention_nopad.py RENAMED Viewed

@@ -5,7 +5,6 @@ import triton
 import triton.language as tl
 from sglang.srt.utils import wrap_kernel_launcher
 CUDA_CAPABILITY = torch.cuda.get_device_capability()

{sglang-0.1.4 → sglang-0.1.5}/sglang/srt/layers/extend_attention.py RENAMED Viewed

@@ -4,7 +4,6 @@ import triton.language as tl
 from sglang.srt.layers.context_flashattention_nopad import context_attention_fwd
 from sglang.srt.utils import wrap_kernel_launcher
 CUDA_CAPABILITY = torch.cuda.get_device_capability()

{sglang-0.1.4 → sglang-0.1.5}/sglang/srt/managers/router/manager.py RENAMED Viewed

@@ -28,7 +28,7 @@ class RouterManager:
         self.model_client = model_client
         self.recv_reqs = []
-        # Init Some Configs
+        # Init some configs
         self.extend_dependency_time = GLOBAL_BACKEND_CONFIG.extend_dependency_time
     async def loop_for_forward(self):
@@ -46,7 +46,7 @@ class RouterManager:
                 if has_finished:
                     await asyncio.sleep(self.extend_dependency_time)
-            await asyncio.sleep(0.001)
+            await asyncio.sleep(0.0006)
     async def loop_for_recv_requests(self):
         while True:

{sglang-0.1.4 → sglang-0.1.5}/sglang/srt/managers/router/model_rpc.py RENAMED Viewed

@@ -2,10 +2,10 @@ import asyncio
 import logging
 import multiprocessing
 import time
+import warnings
 from concurrent.futures import ThreadPoolExecutor
 from enum import Enum, auto
 from typing import Dict, List, Optional, Tuple, Union
-import warnings
 import numpy as np
 import rpyc
@@ -45,6 +45,7 @@ class ModelRpcServer(rpyc.Service):
         self.tp_rank = tp_rank
         self.tp_size = server_args.tp_size
         self.schedule_heuristic = server_args.schedule_heuristic
+        self.schedule_conservativeness = server_args.schedule_conservativeness
         # Init model and tokenizer
         self.model_config = ModelConfig(
@@ -108,7 +109,7 @@ class ModelRpcServer(rpyc.Service):
         self.running_batch: Batch = None
         self.out_pyobjs = []
         self.decode_forward_ct = 0
-        self.stream_interval = 2
+        self.stream_interval = server_args.stream_interval
         # Init the FSM cache for constrained generation
         self.regex_fsm_cache = FSMCache(self.tokenizer)
@@ -248,7 +249,9 @@ class ModelRpcServer(rpyc.Service):
         available_size = (
             self.token_to_kv_pool.available_size() + self.tree_cache.evictable_size()
         )
-        new_ratio = self.scheduler.new_token_estimation_ratio()
+        new_ratio = (
+            self.scheduler.new_token_estimation_ratio() * self.schedule_conservativeness
+        )
         if self.running_batch:
             available_size -= sum(
                 [

{sglang-0.1.4 → sglang-0.1.5}/sglang/srt/managers/router/model_runner.py RENAMED Viewed

@@ -278,7 +278,7 @@ class ModelRunner:
                 load_format=self.load_format,
                 revision=None,
             )
-        self.model = model
+        self.model = model.eval()
     def profile_max_num_token(self, total_gpu_memory):
         available_gpu_memory = get_available_gpu_memory(

{sglang-0.1.4 → sglang-0.1.5}/sglang/srt/models/mixtral.py RENAMED Viewed

@@ -355,7 +355,7 @@ class MixtralForCausalLM(nn.Module):
         ):
             if "rotary_emb.inv_freq" in name:
                 continue
-            for (param_name, weight_name, shard_id) in stacked_params_mapping:
+            for param_name, weight_name, shard_id in stacked_params_mapping:
                 if weight_name not in name:
                     continue
                 name = name.replace(weight_name, param_name)

sglang 0.1.4__tar.gz → 0.1.5__tar.gz

sglang 0.1.4tar.gz → 0.1.5tar.gz