PyPI - sglang - Versions diffs - 0.1.1__tar.gz → 0.1.3__tar.gz - Mend

sglang 0.1.1tar.gz → 0.1.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (62) hide show

{sglang-0.1.1 → sglang-0.1.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.1
+Version: 0.1.3
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -329,25 +329,99 @@ You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
+To begin with, import sglang.
+```python
+import sglang as sgl
+```
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`.
+You can implement your prompt flow in a function decorated by `sgl.function`.
+You can then invoke the function with `run` or `run_batch`.
+The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+```python
+@sgl.function
+def control_flow(s, question):
+    s += "To answer this question: " + question + ", "
+    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
+    # You can use if or nested function calls
+    if s["tool"] == "calculator":
+        s += "The math expression is" + sgl.gen("expression")
+    elif s["tool"] == "web browser":
+        s += "The website url is" + sgl.gen("url")
+```
 ### Parallelism
+```python
+@sgl.function
+def tip_suggestion(s):
+    s += (
+        "Here are two tips for staying healthy: "
+        "1. Balanced Diet. 2. Regular Exercise.\n\n"
+    )
+    forks = s.fork(2)  # Launch parallel prompts
+    for i, f in enumerate(forks):
+        f += f"Now, expand tip {i+1} into a paragraph:\n"
+        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
+    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
+    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
+    s += "In summary" + sgl.gen("summary")
+```
 ### Multi Modality
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
     s += sgl.user(sgl.image(image_file) + question)
-    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
+    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
 ```
-### Constrained decoding
+### Constrained Decoding
+```python
+@function
+def regular_expression_gen(s):
+    s += "Q: What is the IP address of the Google DNS servers?\n"
+    s += "A: " + gen(
+        "answer",
+        temperature=0,
+        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
+    )
+```
 ### Batching
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run_batch(
+    [
+        {"question": "What is the capital of the United Kingdom?"},
+        {"question": "What is the capital of France?"},
+        {"question": "What is the capital of Japan?"},
+    ],
+)
+```
 ### Streaming
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run(
+    question="What is the capital of France?",
+    temperature=0.1)
-### Other Backends
+for out in state.text_iter():
+    print(out, end="", flush=True)
+```
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
@@ -386,6 +460,14 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ## Benchmark And Performance
+- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
+![llama_7b](assets/llama_7b.jpg)
+- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
+![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+Learn more [here]().
 ## Roadmap
 - [ ] Function call
 - [ ] Quantization

{sglang-0.1.1 → sglang-0.1.3}/README.md RENAMED Viewed

@@ -94,25 +94,99 @@ You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
+To begin with, import sglang.
+```python
+import sglang as sgl
+```
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`.
+You can implement your prompt flow in a function decorated by `sgl.function`.
+You can then invoke the function with `run` or `run_batch`.
+The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+```python
+@sgl.function
+def control_flow(s, question):
+    s += "To answer this question: " + question + ", "
+    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
+    # You can use if or nested function calls
+    if s["tool"] == "calculator":
+        s += "The math expression is" + sgl.gen("expression")
+    elif s["tool"] == "web browser":
+        s += "The website url is" + sgl.gen("url")
+```
 ### Parallelism
+```python
+@sgl.function
+def tip_suggestion(s):
+    s += (
+        "Here are two tips for staying healthy: "
+        "1. Balanced Diet. 2. Regular Exercise.\n\n"
+    )
+    forks = s.fork(2)  # Launch parallel prompts
+    for i, f in enumerate(forks):
+        f += f"Now, expand tip {i+1} into a paragraph:\n"
+        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
+    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
+    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
+    s += "In summary" + sgl.gen("summary")
+```
 ### Multi Modality
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
     s += sgl.user(sgl.image(image_file) + question)
-    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
+    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
 ```
-### Constrained decoding
+### Constrained Decoding
+```python
+@function
+def regular_expression_gen(s):
+    s += "Q: What is the IP address of the Google DNS servers?\n"
+    s += "A: " + gen(
+        "answer",
+        temperature=0,
+        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
+    )
+```
 ### Batching
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run_batch(
+    [
+        {"question": "What is the capital of the United Kingdom?"},
+        {"question": "What is the capital of France?"},
+        {"question": "What is the capital of Japan?"},
+    ],
+)
+```
 ### Streaming
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run(
+    question="What is the capital of France?",
+    temperature=0.1)
-### Other Backends
+for out in state.text_iter():
+    print(out, end="", flush=True)
+```
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
@@ -151,6 +225,14 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ## Benchmark And Performance
+- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
+![llama_7b](assets/llama_7b.jpg)
+- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
+![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+Learn more [here]().
 ## Roadmap
 - [ ] Function call
 - [ ] Quantization

{sglang-0.1.1 → sglang-0.1.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.1.1"
+version = "0.1.3"
 description = "A structured generation langauge for LLMs."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.1.1 → sglang-0.1.3}/sglang/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ == "0.1.1"
+__version__ = "0.1.3"
 from sglang.api import *
 from sglang.global_config import global_config

{sglang-0.1.1 → sglang-0.1.3}/sglang/api.py RENAMED Viewed

@@ -17,13 +17,19 @@ from sglang.lang.ir import (
     SglRoleEnd,
     SglSelect,
 )
-from sglang.srt.server import Runtime
 def function(func: Callable):
     return SglFunction(func)
+def Runtime(*args, **kwargs):
+    # Avoid importing unnecessary dependency
+    from sglang.srt.server import Runtime
+    return Runtime(*args, **kwargs)
 def set_default_backend(backend: BaseBackend):
     global_config.default_backend = backend
@@ -37,6 +43,7 @@ def gen(
     top_k: Optional[int] = None,
     frequency_penalty: Optional[float] = None,
     presence_penalty: Optional[float] = None,
+    ignore_eos: Optional[bool] = None,
     dtype: Optional[type] = None,
     choices: Optional[List[str]] = None,
     regex: Optional[str] = None,
@@ -60,6 +67,7 @@ def gen(
         top_k,
         frequency_penalty,
         presence_penalty,
+        ignore_eos,
         dtype,
         regex,
     )
@@ -74,6 +82,7 @@ def gen_int(
     top_k: Optional[int] = None,
     frequency_penalty: Optional[float] = None,
     presence_penalty: Optional[float] = None,
+    ignore_eos: Optional[bool] = None,
 ):
     return SglGen(
         name,
@@ -84,6 +93,7 @@ def gen_int(
         top_k,
         frequency_penalty,
         presence_penalty,
+        ignore_eos,
         int,
         None,
     )
@@ -98,6 +108,7 @@ def gen_string(
     top_k: Optional[int] = None,
     frequency_penalty: Optional[float] = None,
     presence_penalty: Optional[float] = None,
+    ignore_eos: Optional[bool] = None,
 ):
     return SglGen(
         name,
@@ -108,6 +119,7 @@ def gen_string(
         top_k,
         frequency_penalty,
         presence_penalty,
+        ignore_eos,
         str,
         None,
     )

{sglang-0.1.1 → sglang-0.1.3}/sglang/backend/anthropic.py RENAMED Viewed

@@ -4,7 +4,7 @@ import numpy as np
 from sglang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import get_chat_template
 from sglang.lang.interpreter import StreamExecutor
-from sglang.lang.ir import SamplingParams
+from sglang.lang.ir import SglSamplingParams
 try:
     import anthropic
@@ -28,7 +28,7 @@ class Anthropic(BaseBackend):
     def generate(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         prompt = s.text_
         ret = anthropic.Anthropic().completions.create(
@@ -43,7 +43,7 @@ class Anthropic(BaseBackend):
     def generate_stream(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         prompt = s.text_
         generator = anthropic.Anthropic().completions.create(

{sglang-0.1.1 → sglang-0.1.3}/sglang/backend/base_backend.py RENAMED Viewed

@@ -2,7 +2,7 @@ from typing import Callable, List, Optional, Union
 from sglang.lang.chat_template import get_chat_template
 from sglang.lang.interpreter import StreamExecutor
-from sglang.lang.ir import SamplingParams
+from sglang.lang.ir import SglSamplingParams
 class BaseBackend:
@@ -48,14 +48,14 @@ class BaseBackend:
     def generate(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         raise NotImplementedError()
     def generate_stream(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         raise NotImplementedError()

{sglang-0.1.1 → sglang-0.1.3}/sglang/backend/openai.py RENAMED Viewed

@@ -4,7 +4,7 @@ import numpy as np
 from sglang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import get_chat_template
 from sglang.lang.interpreter import StreamExecutor
-from sglang.lang.ir import SamplingParams
+from sglang.lang.ir import SglSamplingParams
 try:
     import openai
@@ -73,7 +73,7 @@ class OpenAI(BaseBackend):
     def generate(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         if sampling_params.dtype is None:
             if self.is_chat_model:
@@ -122,7 +122,7 @@ class OpenAI(BaseBackend):
     def generate_stream(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         if sampling_params.dtype is None:
             if self.is_chat_model:

{sglang-0.1.1 → sglang-0.1.3}/sglang/backend/runtime_endpoint.py RENAMED Viewed

@@ -7,7 +7,7 @@ from sglang.backend.base_backend import BaseBackend
 from sglang.global_config import global_config
 from sglang.lang.chat_template import get_chat_template_by_model_path
 from sglang.lang.interpreter import StreamExecutor
-from sglang.lang.ir import SamplingParams, SglArgument
+from sglang.lang.ir import SglArgument, SglSamplingParams
 from sglang.utils import encode_image_base64, find_printable_text, http_request
@@ -55,7 +55,7 @@ class RuntimeEndpoint(BaseBackend):
     def generate(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         if sampling_params.dtype is None:
             data = {
@@ -87,7 +87,7 @@ class RuntimeEndpoint(BaseBackend):
     def generate_stream(
         self,
         s: StreamExecutor,
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         if sampling_params.dtype is None:
             data = {

{sglang-0.1.1 → sglang-0.1.3}/sglang/backend/tgi.py RENAMED Viewed

@@ -7,7 +7,7 @@ from typing import List, Optional, Union
 from sglang.backend.base_backend import BaseBackend
 from sglang.lang.chat_template import get_chat_template_by_model_path
 from sglang.lang.interpreter import StreamExecutor
-from sglang.lang.ir import SamplingParams
+from sglang.lang.ir import SglSamplingParams
 from sglang.utils import http_request
@@ -138,7 +138,7 @@ class TGI(BaseBackend):
         self,
         s: StreamExecutor,
         choices: List[str],
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
     ):
         decision = self.retry_for_expected(
             s.text_,
@@ -152,7 +152,7 @@ class TGI(BaseBackend):
         s: StreamExecutor,
         max_tokens: int,
         stop: Union[str, List[str]],
-        sampling_params: SamplingParams,
+        sampling_params: SglSamplingParams,
         dtype: Optional[str] = None,
     ):
         if dtype is None:

{sglang-0.1.1 → sglang-0.1.3}/sglang/lang/compiler.py RENAMED Viewed

@@ -6,10 +6,10 @@ from typing import List, Union
 from sglang.global_config import global_config
 from sglang.lang.interpreter import ProgramState, StreamExecutor, pin_program
 from sglang.lang.ir import (
-    SamplingParams,
     SglArgument,
     SglConstantText,
     SglExpr,
+    SglSamplingParams,
     SglVariable,
 )
@@ -137,10 +137,9 @@ class CompiledFunction:
     ):
         backend = backend or global_config.default_backend
-        kwargs = {k: SglArgument(k, v) for k, v in kwargs.items()}
         kwargs.update(self.function.bind_arguments)
-        default_sampling_para = SamplingParams(
+        default_sampling_para = SglSamplingParams(
             max_new_tokens=max_new_tokens,
             stop=stop,
             temperature=temperature,
@@ -173,7 +172,7 @@ class CompiledFunction:
         backend = backend or global_config.default_backend
-        default_sampling_para = SamplingParams(
+        default_sampling_para = SglSamplingParams(
             max_new_tokens=max_new_tokens,
             stop=stop,
             temperature=temperature,
@@ -182,9 +181,6 @@ class CompiledFunction:
             frequency_penalty=frequency_penalty,
             presence_penalty=presence_penalty,
         )
-        batch_kwargs = [
-            {k: SglArgument(k, v) for k, v in kwargs.items()} for kwargs in batch_kwargs
-        ]
         # Extract prefix by tracing and cache it
         if len(batch_kwargs) > 1:

{sglang-0.1.1 → sglang-0.1.3}/sglang/lang/interpreter.py RENAMED Viewed

@@ -12,7 +12,6 @@ from typing import Any, Callable, Dict, List, Optional, Union
 import tqdm
 from sglang.global_config import global_config
 from sglang.lang.ir import (
-    SglArgument,
     SglCommitLazy,
     SglConcateAndAppend,
     SglConstantText,
@@ -89,7 +88,7 @@ def run_program_batch(
         for arguments in batch_arguments:
             rets.append(
                 run_program(
-                    program, backend, (), arguments, default_sampling_para, False, False
+                    program, backend, (), arguments, default_sampling_para, False, True
                 )
             )
     else:
@@ -108,7 +107,7 @@ def run_program_batch(
                         arguments,
                         default_sampling_para,
                         False,
-                        False,
+                        True,
                     )
                 )
                 if progress_bar:
@@ -292,7 +291,7 @@ class StreamExecutor:
         assert isinstance(other, SglExpr), f"{other}"
-        if isinstance(other, (SglConstantText, SglArgument)):
+        if isinstance(other, SglConstantText):
             self._execute_fill(other.value)
         elif isinstance(other, SglGen):
             self._execute_gen(other)
@@ -332,8 +331,6 @@ class StreamExecutor:
     def _execute_image(self, expr: SglImage):
         path = expr.path
-        if isinstance(path, SglArgument):
-            path = path.value
         base64_data = encode_image_base64(path)
@@ -419,7 +416,7 @@ class StreamExecutor:
                 "role": expr.role,
                 "content": [{"type": "text", "text": new_text}],
             }
-            for (image_path, image_base64_data) in self.cur_images:
+            for image_path, image_base64_data in self.cur_images:
                 last_msg["content"].append(
                     {
                         "type": "image_url",
@@ -480,6 +477,7 @@ class StreamExecutor:
             "top_k",
             "frequency_penalty",
             "presence_penalty",
+            "ignore_eos",
             "dtype",
             "regex",
         ]:

{sglang-0.1.1 → sglang-0.1.3}/sglang/lang/ir.py RENAMED Viewed

@@ -13,7 +13,7 @@ REGEX_STRING = r"\"[\w\d\s]*\""  # bugs with regex r"\".*\"" in interegular pkg
 @dataclasses.dataclass
-class SamplingParams:
+class SglSamplingParams:
     max_new_tokens: int = 16
     stop: Union[str, List[str]] = ()
     temperature: float = 1.0
@@ -21,13 +21,14 @@ class SamplingParams:
     top_k: int = -1  # -1 means disable
     frequency_penalty: float = 0.0
     presence_penalty: float = 0.0
+    ignore_eos: bool = False
     # for constrained generation, not included in to_xxx_kwargs
     dtype: Optional[str] = None
     regex: Optional[str] = None
     def clone(self):
-        return SamplingParams(
+        return SglSamplingParams(
             self.max_new_tokens,
             self.stop,
             self.temperature,
@@ -67,6 +68,7 @@ class SamplingParams:
             "top_k": self.top_k,
             "frequency_penalty": self.frequency_penalty,
             "presence_penalty": self.presence_penalty,
+            "ignore_eos": self.ignore_eos,
             "regex": self.regex,
         }
@@ -98,13 +100,14 @@ class SglFunction:
         top_k: int = -1,
         frequency_penalty: float = 0.0,
         presence_penalty: float = 0.0,
+        ignore_eos: bool = False,
         stream: bool = False,
         backend=None,
         **kwargs,
     ):
         from sglang.lang.interpreter import run_program
-        default_sampling_para = SamplingParams(
+        default_sampling_para = SglSamplingParams(
             max_new_tokens=max_new_tokens,
             stop=stop,
             temperature=temperature,
@@ -112,9 +115,9 @@ class SglFunction:
             top_k=top_k,
             frequency_penalty=frequency_penalty,
             presence_penalty=presence_penalty,
+            ignore_eos=ignore_eos,
         )
         backend = backend or global_config.default_backend
-        kwargs = {k: SglArgument(k, v) for k, v in kwargs.items()}
         return run_program(self, backend, args, kwargs, default_sampling_para, stream)
     def run_batch(
@@ -128,6 +131,7 @@ class SglFunction:
         top_k: int = -1,
         frequency_penalty: float = 0.0,
         presence_penalty: float = 0.0,
+        ignore_eos: bool = False,
         backend=None,
         num_threads: Union[str, int] = "auto",
         progress_bar: bool = False,
@@ -139,7 +143,7 @@ class SglFunction:
             return []
         assert isinstance(batch_kwargs[0], dict)
-        default_sampling_para = SamplingParams(
+        default_sampling_para = SglSamplingParams(
             max_new_tokens=max_new_tokens,
             stop=stop,
             temperature=temperature,
@@ -147,11 +151,9 @@ class SglFunction:
             top_k=top_k,
             frequency_penalty=frequency_penalty,
             presence_penalty=presence_penalty,
+            ignore_eos=ignore_eos,
         )
         backend = backend or global_config.default_backend
-        batch_kwargs = [
-            {k: SglArgument(k, v) for k, v in kwargs.items()} for kwargs in batch_kwargs
-        ]
         return run_program_batch(
             self,
             backend,
@@ -321,12 +323,13 @@ class SglGen(SglExpr):
         top_k,
         frequency_penalty,
         presence_penalty,
+        ignore_eos,
         dtype,
         regex,
     ):
         super().__init__()
         self.name = name
-        self.sampling_params = SamplingParams(
+        self.sampling_params = SglSamplingParams(
             max_new_tokens=max_new_tokens,
             stop=stop,
             temperature=temperature,
@@ -334,6 +337,7 @@ class SglGen(SglExpr):
             top_k=top_k,
             frequency_penalty=frequency_penalty,
             presence_penalty=presence_penalty,
+            ignore_eos=ignore_eos,
             dtype=dtype,
             regex=regex,
         )

{sglang-0.1.1 → sglang-0.1.3}/sglang/lang/tracer.py RENAMED Viewed

@@ -40,7 +40,8 @@ def extract_prefix_by_tracing(program, backend):
     try:
         with TracingScope(tracer):
             tracer.ret_value = program.func(tracer, **arguments)
-    except StopTracing:
+    except (StopTracing, TypeError):
+        # Some exceptions may not be catched
         pass
     # Run and cache prefix

sglang-0.1.3/sglang/srt/backend_config.py ADDED Viewed

@@ -0,0 +1,12 @@
+"""
+Backend configurations, may vary with different serving platforms.
+"""
+from dataclasses import dataclass
+@dataclass
+class BackendConfig:
+    extend_dependency_time: float = 0.03
+GLOBAL_BACKEND_CONFIG = BackendConfig()

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/router/manager.py RENAMED Viewed

@@ -1,10 +1,10 @@
 import asyncio
 import logging
-from typing import List, Tuple
 import uvloop
 import zmq
 import zmq.asyncio
+from sglang.srt.backend_config import GLOBAL_BACKEND_CONFIG
 from sglang.srt.managers.router.model_rpc import ModelRpcClient
 from sglang.srt.server_args import PortArgs, ServerArgs
 from sglang.srt.utils import get_exception_traceback
@@ -28,6 +28,9 @@ class RouterManager:
         self.model_client = model_client
         self.recv_reqs = []
+        # Init Some Configs
+        self.extend_dependency_time = GLOBAL_BACKEND_CONFIG.extend_dependency_time
     async def loop_for_forward(self):
         while True:
             next_step_input = list(self.recv_reqs)
@@ -37,7 +40,12 @@ class RouterManager:
             for obj in out_pyobjs:
                 self.send_to_detokenizer.send_pyobj(obj)
-            # await for a while to accept input requests
+            # async sleep for recving the subsequent request, and avoiding cache miss
+            if len(out_pyobjs) != 0:
+                has_finished = any([obj.finished for obj in out_pyobjs])
+                if has_finished:
+                    await asyncio.sleep(self.extend_dependency_time)
             await asyncio.sleep(0.001)
     async def loop_for_recv_requests(self):

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/router/model_rpc.py RENAMED Viewed

@@ -19,7 +19,6 @@ from sglang.srt.managers.router.model_runner import ModelRunner
 from sglang.srt.managers.router.radix_cache import RadixCache
 from sglang.srt.managers.router.scheduler import Scheduler
 from sglang.srt.model_config import ModelConfig
-from sglang.srt.sampling_params import SamplingParams
 from sglang.srt.server_args import PortArgs, ServerArgs
 from sglang.srt.utils import (
     get_exception_traceback,
@@ -158,6 +157,18 @@ class ModelRpcServer(rpyc.Service):
                     if self.running_batch.is_empty():
                         self.running_batch = None
                         break
+            else:
+                # check the available size
+                available_size = (
+                    self.token_to_kv_pool.available_size()
+                    + self.tree_cache.evictable_size()
+                )
+                if available_size != self.max_total_num_token:
+                    logger.warning(
+                        "Warning: "
+                        f"available_size={available_size}, max_total_num_token={self.max_total_num_token}\n"
+                        "KV cache pool leak detected!"
+                    )
         if self.running_batch is not None and self.tp_rank == 0:
             if self.decode_forward_ct >= 20:
@@ -408,7 +419,9 @@ class ModelRpcServer(rpyc.Service):
                 token_ids = tuple(req.input_ids + req.output_ids)
                 seq_len = len(token_ids) - 1
                 indices = self.req_to_token_pool.req_to_token[req_pool_idx, :seq_len]
-                prefix_len = self.tree_cache.insert(token_ids, indices.clone())
+                prefix_len = self.tree_cache.insert(
+                    token_ids[:seq_len], indices.clone()
+                )
                 self.token_to_kv_pool.free(indices[:prefix_len])
                 self.req_to_token_pool.free(req_pool_idx)

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/router/radix_cache.py RENAMED Viewed

@@ -116,12 +116,12 @@ class RadixCache:
         for c_key, child in node.children.items():
             prefix_len = match(c_key, key)
             if prefix_len != 0:
-                if prefix_len == len(key) and prefix_len != len(c_key):
+                if prefix_len < len(c_key):
                     new_node = self._split_node(c_key, child, prefix_len)
                     value.append(new_node.value)
                     last_node[0] = new_node
                 else:
-                    value.append(child.value[:prefix_len])
+                    value.append(child.value)
                     last_node[0] = child
                     self._match_prefix_helper(child, key[prefix_len:], value, last_node)
                 break

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/router/scheduler.py RENAMED Viewed

@@ -18,7 +18,7 @@ class Scheduler:
         self.tree_cache = tree_cache
     def new_token_estimation_ratio(self):
-        return 0.4 if self.schedule_heuristic != "fcfs" else 0.5
+        return 0.5 if self.schedule_heuristic != "fcfs" else 0.6
     def get_priority_queue(self, forward_queue):
         if self.schedule_heuristic == "lpm":

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/models/mixtral.py RENAMED Viewed

@@ -351,7 +351,7 @@ class MixtralForCausalLM(nn.Module):
         params_dict = dict(self.named_parameters())
         for name, loaded_weight in hf_model_weights_iterator(
-            model_name_or_path, cache_dir, load_format, revision, fall_back_to_pt=False
+            model_name_or_path, cache_dir, load_format, revision
         ):
             if "rotary_emb.inv_freq" in name:
                 continue

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/server_args.py RENAMED Viewed

@@ -12,7 +12,7 @@ class ServerArgs:
     load_format: str = "auto"
     tokenizer_mode: str = "auto"
     trust_remote_code: bool = True
-    mem_fraction_static: float = 0.91
+    mem_fraction_static: Optional[float] = None
     tp_size: int = 1
     model_mode: List[str] = ()
     schedule_heuristic: str = "lpm"
@@ -24,6 +24,11 @@ class ServerArgs:
     def __post_init__(self):
         if self.tokenizer_path is None:
             self.tokenizer_path = self.model_path
+        if self.mem_fraction_static is None:
+            if self.tp_size > 1:
+                self.mem_fraction_static = 0.8
+            else:
+                self.mem_fraction_static = 0.9
     @staticmethod
     def add_cli_args(parser: argparse.ArgumentParser):
@@ -88,7 +93,8 @@ class ServerArgs:
             type=str,
             default=[],
             nargs="+",
-            help="Model mode: [flashinfer, no-cache, aggressive-new-fill]",
+            choices=["flashinfer", "no-cache"],
+            help="Model mode: [flashinfer, no-cache]",
         )
         parser.add_argument(
             "--schedule-heuristic",

{sglang-0.1.1 → sglang-0.1.3}/sglang/test/test_programs.py RENAMED Viewed

@@ -174,7 +174,7 @@ def test_tool_use():
     def tool_use(s, lhs, rhs):
         s += "Please perform computations using a calculator. You can use calculate(expression) to get the results.\n"
         s += "For example,\ncalculate(1+2)=3\ncalculate(3*4)=12\n"
-        s += "Question: What is the product of " + lhs + " and " + rhs + "?\n"
+        s += "Question: What is the product of " + str(lhs) + " and " + str(rhs) + "?\n"
         s += (
             "Answer: The answer is calculate("
             + sgl.gen("expression", stop=")")

{sglang-0.1.1 → sglang-0.1.3}/sglang/test/test_utils.py RENAMED Viewed

@@ -38,6 +38,26 @@ def call_generate_vllm(prompt, temperature, max_tokens, stop, url, n=1):
     return pred
+def call_generate_outlines(
+    prompt, temperature, max_tokens, url, stop=[], regex=None, n=1
+):
+    data = {
+        "prompt": prompt,
+        "temperature": temperature,
+        "max_tokens": max_tokens,
+        "stop": stop,
+        "regex": regex,
+        "n": n,
+    }
+    res = requests.post(url, json=data)
+    assert res.status_code == 200
+    if n == 1:
+        pred = res.json()["text"][0][len(prompt) :]
+    else:
+        pred = [x[len(prompt) :] for x in res.json()["text"]]
+    return pred
 def call_generate_srt_raw(prompt, temperature, max_tokens, stop, url):
     data = {
         "text": prompt,
@@ -79,7 +99,7 @@ def call_select_vllm(context, choices, url):
         }
         res = requests.post(url, json=data)
         assert res.status_code == 200
-        scores.append(res.json()["prompt_score"])
+        scores.append(res.json().get("prompt_score", 0))
     return np.argmax(scores)
     """
@@ -92,7 +112,7 @@ def call_select_vllm(context, choices, url):
 def add_common_other_args_and_parse(parser):
-    parser.add_argument("--parallel", type=int, default=96)
+    parser.add_argument("--parallel", type=int, default=64)
     parser.add_argument("--host", type=str, default="http://127.0.0.1")
     parser.add_argument("--port", type=int, default=None)
     parser.add_argument(

{sglang-0.1.1 → sglang-0.1.3}/sglang/utils.py RENAMED Viewed

@@ -67,7 +67,7 @@ def dump_state_text(filename, states, mode="w"):
             if isinstance(s, str):
                 pass
             elif isinstance(s, ProgramState):
-                s = s.text().strip()
+                s = s.text()
             else:
                 s = str(s)

{sglang-0.1.1 → sglang-0.1.3}/sglang.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.1
+Version: 0.1.3
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -329,25 +329,99 @@ You can find more examples at [examples/quick_start](examples/quick_start).
 ## Frontend: Structured Generation Langauge (SGLang)
+To begin with, import sglang.
+```python
+import sglang as sgl
+```
+`sglang` provides some simple primitives such as `gen`, `select`, `fork`.
+You can implement your prompt flow in a function decorated by `sgl.function`.
+You can then invoke the function with `run` or `run_batch`.
+The system will manage the state, chat template, and parallelism for you.
 ### Control Flow
+```python
+@sgl.function
+def control_flow(s, question):
+    s += "To answer this question: " + question + ", "
+    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "web browser"]) + ". "
+    # You can use if or nested function calls
+    if s["tool"] == "calculator":
+        s += "The math expression is" + sgl.gen("expression")
+    elif s["tool"] == "web browser":
+        s += "The website url is" + sgl.gen("url")
+```
 ### Parallelism
+```python
+@sgl.function
+def tip_suggestion(s):
+    s += (
+        "Here are two tips for staying healthy: "
+        "1. Balanced Diet. 2. Regular Exercise.\n\n"
+    )
+    forks = s.fork(2)  # Launch parallel prompts
+    for i, f in enumerate(forks):
+        f += f"Now, expand tip {i+1} into a paragraph:\n"
+        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
+    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
+    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
+    s += "In summary" + sgl.gen("summary")
+```
 ### Multi Modality
 ```python
 @sgl.function
 def image_qa(s, image_file, question):
     s += sgl.user(sgl.image(image_file) + question)
-    s += sgl.assistant(sgl.gen("answer_1", max_tokens=256))
+    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
 ```
-### Constrained decoding
+### Constrained Decoding
+```python
+@function
+def regular_expression_gen(s):
+    s += "Q: What is the IP address of the Google DNS servers?\n"
+    s += "A: " + gen(
+        "answer",
+        temperature=0,
+        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
+    )
+```
 ### Batching
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run_batch(
+    [
+        {"question": "What is the capital of the United Kingdom?"},
+        {"question": "What is the capital of France?"},
+        {"question": "What is the capital of Japan?"},
+    ],
+)
+```
 ### Streaming
+```python
+@sgl.function
+def text_qa(s, question):
+    s += "Q: " + question + "\n"
+    s += "A:" + sgl.gen("answer", stop="\n")
+states = text_qa.run(
+    question="What is the capital of France?",
+    temperature=0.1)
-### Other Backends
+for out in state.text_iter():
+    print(out, end="", flush=True)
+```
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
@@ -386,6 +460,14 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ## Benchmark And Performance
+- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
+![llama_7b](assets/llama_7b.jpg)
+- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
+![mixtral_8x7b](assets/mixtral_8x7b.jpg)
+Learn more [here]().
 ## Roadmap
 - [ ] Function call
 - [ ] Quantization

{sglang-0.1.1 → sglang-0.1.3}/sglang.egg-info/SOURCES.txt RENAMED Viewed

@@ -25,6 +25,7 @@ sglang/lang/compiler.py
 sglang/lang/interpreter.py
 sglang/lang/ir.py
 sglang/lang/tracer.py
+sglang/srt/backend_config.py
 sglang/srt/hf_transformers_utils.py
 sglang/srt/memory_pool.py
 sglang/srt/model_config.py

{sglang-0.1.1 → sglang-0.1.3}/LICENSE RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/setup.cfg RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/backend/__init__.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/backend/huggingface.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/flush_cache.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/global_config.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/lang/__init__.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/lang/chat_template.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/launch_server.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/constrained/fsm.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/constrained/fsm_cache.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/constrained/regex.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/constrained/tokenizer.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/hf_transformers_utils.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/layers/context_flashattention_nopad.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/layers/extend_attention.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/layers/get_selected_logprob.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/layers/logits_processor.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/layers/radix_attention.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/layers/token_attention.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/detokenizer_manager.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/io_struct.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/openai_protocol.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/router/infer_batch.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/router/model_runner.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/managers/tokenizer_manager.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/memory_pool.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/model_config.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/models/llama2.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/models/llava.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/sampling_params.py RENAMED Viewed

@@ -7,13 +7,13 @@ _SAMPLING_EPS = 1e-6
 class SamplingParams:
     def __init__(
         self,
+        max_new_tokens: int = 16,
+        stop: Optional[Union[str, List[str]]] = None,
         temperature: float = 1.0,
         top_p: float = 1.0,
         top_k: int = -1,
         frequency_penalty: float = 0.0,
         presence_penalty: float = 0.0,
-        stop: Optional[Union[str, List[str]]] = None,
-        max_new_tokens: int = 16,
         ignore_eos: bool = False,
         skip_special_tokens: bool = True,
         dtype: Optional[str] = None,

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/server.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang/srt/utils.py RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang.egg-info/dependency_links.txt RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang.egg-info/requires.txt RENAMED Viewed

File without changes

{sglang-0.1.1 → sglang-0.1.3}/sglang.egg-info/top_level.txt RENAMED Viewed

File without changes

sglang 0.1.1__tar.gz → 0.1.3__tar.gz

sglang 0.1.1tar.gz → 0.1.3tar.gz