PyPI - sglang - Versions diffs - 0.2.5__tar.gz → 0.2.7__tar.gz - Mend

sglang 0.2.5tar.gz → 0.2.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (103) hide show

{sglang-0.2.5/sglang.egg-info → sglang-0.2.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.2.5
+Version: 0.2.7
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -245,11 +245,18 @@ Requires-Dist: sglang[litellm]; extra == "all"
 <div align="center">
 <img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400"></img>
+[![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
+![PyPI - Downloads](https://img.shields.io/pypi/dm/sglang)
+[![license](https://img.shields.io/github/license/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang/tree/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
+[![open issues](https://img.shields.io/github/issues-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
 </div>
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -292,7 +299,8 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ### Method 2: From source
 ```
-git clone https://github.com/sgl-project/sglang.git
+# Use the stable release branch
+git clone -b release https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -341,7 +349,7 @@ curl http://localhost:30000/generate \
     }
   }'
 ```
-Learn more about the argument format [here](docs/sampling_params.md).
+Learn more about the argument format [here](docs/en/sampling_params.md).
 ### OpenAI Compatible API
 In addition, the server supports OpenAI-compatible APIs.
@@ -388,7 +396,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 ```
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7
 ```
-- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
+- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 - Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
 ```
 # Node 0
@@ -397,23 +405,24 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 # Node 1
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
 ```
-- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/custom_chat_template.md).
+- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
 - To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
 ### Run Llama 3.1 405B
 ```bash
-# 2 nodes run 405B fp16
+## Run 405B (fp8) on a single node
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+## Run 405B (fp16) on two nodes
 # replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
 # on the first node
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
 # on the second
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
-# single node run 405B fp8
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
 ```
 ### Supported Models
@@ -422,6 +431,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instr
 - Mistral / Mixtral
 - Gemma / Gemma 2
 - Qwen / Qwen 2 / Qwen 2 MoE
+- DeepSeek / DeepSeek 2
 - LLaVA 1.5 / 1.6
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
@@ -438,11 +448,11 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instr
 - InternLM 2
 - Mistral NeMo
-Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md).
+Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md).
 ### Benchmark Performance
-- Benchmark a single static batch. Run the following command without launching a server. The arguments are the same as those for `launch_server.py`.
+- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as those for `launch_server.py`. This is not a dynamic batching server, so it may run out of memory for a batch size that can run successfully with a real server. This is because a real server will truncate the prefill into several batches/chunks, while this unit test does not do this.
   ```
   python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
   ```
@@ -669,6 +679,24 @@ for out in state.text_iter():
     print(out, end="", flush=True)
 ```
+#### Roles
+Use `sgl.system`， `sgl.user` and `sgl.assistant` to set roles when using Chat models. You can also define more complex role prompts using begin and end tokens.
+```python
+@sgl.function
+def chat_example(s):
+    s += sgl.system("You are a helpful assistant.")
+    # Same as: s += s.system("You are a helpful assistant.")
+    with s.user():
+        s += "Question: What is the capital of France?"
+    s += sgl.assistant_begin()
+    s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
+    s += sgl.assistant_end()
+```
 #### Tips and Implementation Details
 - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.

{sglang-0.2.5 → sglang-0.2.7}/README.md RENAMED Viewed

@@ -1,10 +1,17 @@
 <div align="center">
 <img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400"></img>
+[![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
+![PyPI - Downloads](https://img.shields.io/pypi/dm/sglang)
+[![license](https://img.shields.io/github/license/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang/tree/main/LICENSE)
+[![issue resolution](https://img.shields.io/github/issues-closed-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
+[![open issues](https://img.shields.io/github/issues-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
 </div>
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -47,7 +54,8 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ### Method 2: From source
 ```
-git clone https://github.com/sgl-project/sglang.git
+# Use the stable release branch
+git clone -b release https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -96,7 +104,7 @@ curl http://localhost:30000/generate \
     }
   }'
 ```
-Learn more about the argument format [here](docs/sampling_params.md).
+Learn more about the argument format [here](docs/en/sampling_params.md).
 ### OpenAI Compatible API
 In addition, the server supports OpenAI-compatible APIs.
@@ -143,7 +151,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 ```
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7
 ```
-- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
+- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 - Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
 ```
 # Node 0
@@ -152,23 +160,24 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 # Node 1
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
 ```
-- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/custom_chat_template.md).
+- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
 - To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
 ### Run Llama 3.1 405B
 ```bash
-# 2 nodes run 405B fp16
+## Run 405B (fp8) on a single node
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+## Run 405B (fp16) on two nodes
 # replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
 # on the first node
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
 # on the second
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
-# single node run 405B fp8
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
 ```
 ### Supported Models
@@ -177,6 +186,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instr
 - Mistral / Mixtral
 - Gemma / Gemma 2
 - Qwen / Qwen 2 / Qwen 2 MoE
+- DeepSeek / DeepSeek 2
 - LLaVA 1.5 / 1.6
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
@@ -193,11 +203,11 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instr
 - InternLM 2
 - Mistral NeMo
-Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md).
+Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md).
 ### Benchmark Performance
-- Benchmark a single static batch. Run the following command without launching a server. The arguments are the same as those for `launch_server.py`.
+- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as those for `launch_server.py`. This is not a dynamic batching server, so it may run out of memory for a batch size that can run successfully with a real server. This is because a real server will truncate the prefill into several batches/chunks, while this unit test does not do this.
   ```
   python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
   ```
@@ -424,6 +434,24 @@ for out in state.text_iter():
     print(out, end="", flush=True)
 ```
+#### Roles
+Use `sgl.system`， `sgl.user` and `sgl.assistant` to set roles when using Chat models. You can also define more complex role prompts using begin and end tokens.
+```python
+@sgl.function
+def chat_example(s):
+    s += sgl.system("You are a helpful assistant.")
+    # Same as: s += s.system("You are a helpful assistant.")
+    with s.user():
+        s += "Question: What is the capital of France?"
+    s += sgl.assistant_begin()
+    s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
+    s += sgl.assistant_end()
+```
 #### Tips and Implementation Details
 - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
 - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.

{sglang-0.2.5 → sglang-0.2.7}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.2.5"
+version = "0.2.7"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.2.5 → sglang-0.2.7}/sglang/__init__.py RENAMED Viewed

@@ -1,4 +1,5 @@
 # SGL API Components
 from sglang.api import (
     Runtime,
     assistant,
@@ -14,48 +15,54 @@ from sglang.api import (
     select,
     set_default_backend,
     system,
+    system_begin,
+    system_end,
     user,
     user_begin,
     user_end,
     video,
 )
-# Global Configurations
-from sglang.global_config import global_config
-# SGL Backends
-from sglang.lang.backend.anthropic import Anthropic
-from sglang.lang.backend.litellm import LiteLLM
-from sglang.lang.backend.openai import OpenAI
-from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
-from sglang.lang.backend.vertexai import VertexAI
-from .version import __version__
-# public APIs management
+# SGLang DSL APIs
 __all__ = [
-    "global_config",
-    "Anthropic",
-    "LiteLLM",
-    "OpenAI",
-    "RuntimeEndpoint",
-    "VertexAI",
-    "function",
     "Runtime",
-    "set_default_backend",
+    "assistant",
+    "assistant_begin",
+    "assistant_end",
     "flush_cache",
-    "get_server_args",
+    "function",
     "gen",
     "gen_int",
     "gen_string",
+    "get_server_args",
     "image",
-    "video",
     "select",
+    "set_default_backend",
     "system",
+    "system_begin",
+    "system_end",
     "user",
-    "assistant",
     "user_begin",
     "user_end",
-    "assistant_begin",
-    "assistant_end",
+    "video",
 ]
+# Global Configurations
+from sglang.global_config import global_config
+__all__ += ["global_config"]
+from sglang.version import __version__
+__all__ += ["__version__"]
+# SGL Backends
+from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
+from sglang.utils import LazyImport
+Anthropic = LazyImport("sglang.lang.backend.anthropic", "Anthropic")
+LiteLLM = LazyImport("sglang.lang.backend.litellm", "LiteLLM")
+OpenAI = LazyImport("sglang.lang.backend.openai", "OpenAI")
+VertexAI = LazyImport("sglang.lang.backend.vertexai", "VertexAI")
+__all__ += ["Anthropic", "LiteLLM", "OpenAI", "VertexAI", "RuntimeEndpoint"]

{sglang-0.2.5 → sglang-0.2.7}/sglang/api.py RENAMED Viewed

@@ -75,7 +75,7 @@ def gen(
     choices: Optional[List[str]] = None,
     regex: Optional[str] = None,
 ):
-    """Call the model to generate. See the meaning of the arguments in docs/sampling_params.md"""
+    """Call the model to generate. See the meaning of the arguments in docs/en/sampling_params.md"""
     if choices:
         return SglSelect(name, choices, 0.0 if temperature is None else temperature)
@@ -210,6 +210,14 @@ def assistant(expr: Optional[SglExpr] = None):
     return _role_common("assistant", expr)
+def system_begin():
+    return SglRoleBegin("system")
+def system_end():
+    return SglRoleEnd("system")
 def user_begin():
     return SglRoleBegin("user")

{sglang-0.2.5 → sglang-0.2.7}/sglang/bench_latency.py RENAMED Viewed

@@ -37,9 +37,9 @@ import torch
 import torch.distributed as dist
 from sglang.srt.hf_transformers_utils import get_tokenizer
-from sglang.srt.managers.controller.infer_batch import Batch, ForwardMode, Req
-from sglang.srt.managers.controller.model_runner import ModelRunner
+from sglang.srt.managers.schedule_batch import Batch, ForwardMode, Req
 from sglang.srt.model_config import ModelConfig
+from sglang.srt.model_executor.model_runner import ModelRunner
 from sglang.srt.sampling_params import SamplingParams
 from sglang.srt.server_args import ServerArgs
 from sglang.srt.utils import suppress_other_loggers

{sglang-0.2.5 → sglang-0.2.7}/sglang/bench_serving.py RENAMED Viewed

@@ -1,5 +1,6 @@
 # Adapted from https://github.com/vllm-project/vllm/blob/6366efc67b0aedd2c1721c14385370e50b297fb3/benchmarks/backend_request_func.py
 # Adapted from https://github.com/vllm-project/vllm/blob/6366efc67b0aedd2c1721c14385370e50b297fb3/benchmarks/benchmark_serving.py
 """
 Benchmark online serving.
@@ -84,6 +85,9 @@ async def async_request_trt_llm(
             "min_length": request_func_input.output_len,
             "end_id": 1048576,
         }
+        if args.disable_ignore_eos:
+            del payload["min_length"]
+            del payload["end_id"]
         output = RequestFuncOutput()
         output.prompt_len = request_func_input.prompt_len
@@ -149,7 +153,7 @@ async def async_request_openai_completions(
             "best_of": 1,
             "max_tokens": request_func_input.output_len,
             "stream": not args.disable_stream,
-            "ignore_eos": True,
+            "ignore_eos": not args.disable_ignore_eos,
         }
         headers = {"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"}
@@ -969,6 +973,11 @@ if __name__ == "__main__":
         action="store_true",
         help="Disable streaming mode.",
     )
+    parser.add_argument(
+        "--disable-ignore-eos",
+        action="store_true",
+        help="Disable ignoring EOS.",
+    )
     set_ulimit()

{sglang-0.2.5 → sglang-0.2.7}/sglang/check_env.py RENAMED Viewed

@@ -22,7 +22,7 @@ PACKAGE_LIST = [
     "huggingface_hub",
     "interegular",
     "packaging",
-    "pillow",
+    "PIL",
     "psutil",
     "pydantic",
     "uvicorn",

{sglang-0.2.5 → sglang-0.2.7}/sglang/lang/backend/litellm.py RENAMED Viewed

@@ -61,7 +61,7 @@ class LiteLLM(BaseBackend):
             model=self.model_name,
             messages=messages,
             **self.client_params,
-            **sampling_params.to_anthropic_kwargs(),
+            **sampling_params.to_litellm_kwargs(),
         )
         comp = ret.choices[0].message.content

{sglang-0.2.5 → sglang-0.2.7}/sglang/lang/backend/openai.py RENAMED Viewed

@@ -18,7 +18,7 @@ except ImportError as e:
     openai = tiktoken = e
-logger = logging.getLogger("openai")
+logger = logging.getLogger(__name__)
 def create_logit_bias_int(tokenizer):

{sglang-0.2.5 → sglang-0.2.7}/sglang/lang/backend/runtime_endpoint.py RENAMED Viewed

@@ -253,14 +253,14 @@ class RuntimeEndpoint(BaseBackend):
             r["meta_info"]["normalized_prompt_logprob"] for r in obj
         ]
         decision = choices[np.argmax(normalized_prompt_logprobs)]
-        prefill_token_logprobs = [r["meta_info"]["prefill_token_logprobs"] for r in obj]
-        decode_token_logprobs = [r["meta_info"]["decode_token_logprobs"] for r in obj]
+        input_token_logprobs = [r["meta_info"]["input_token_logprobs"] for r in obj]
+        output_token_logprobs = [r["meta_info"]["output_token_logprobs"] for r in obj]
         return (
             decision,
             normalized_prompt_logprobs,
-            prefill_token_logprobs,
-            decode_token_logprobs,
+            input_token_logprobs,
+            output_token_logprobs,
         )
     def concatenate_and_append(self, src_rids: List[str], dst_rid: str):

{sglang-0.2.5 → sglang-0.2.7}/sglang/lang/interpreter.py RENAMED Viewed

@@ -541,18 +541,19 @@ class StreamExecutor:
         (
             decision,
             normalized_prompt_logprobs,
-            prefill_token_logprobs,
-            decode_token_logprobs,
+            input_token_logprobs,
+            output_token_logprobs,
         ) = self.backend.select(self, expr.choices, expr.temperature)
         if expr.name is not None:
             name = expr.name
             self.variables[name] = decision
             self.meta_info[name] = {
                 "normalized_prompt_logprobs": normalized_prompt_logprobs,
-                "prefill_token_logprobs": prefill_token_logprobs,
-                "decode_token_logprobs": decode_token_logprobs,
+                "input_token_logprobs": input_token_logprobs,
+                "output_token_logprobs": output_token_logprobs,
             }
             self.variable_event[name].set()
+            self.stream_var_event[name].set()
         self.text_ += decision
     def _execute_variable(self, expr: SglVariable):
@@ -705,9 +706,9 @@ class ProgramState:
     def _role_common(self, name: str, expr: Optional[SglExpr] = None):
         if expr is not None:
-            self.stream_executor.submit(
-                SglExprList([SglRoleBegin(name), expr, SglRoleEnd(name)])
-            )
+            role_expr = SglExprList([SglRoleBegin(name), expr, SglRoleEnd(name)])
+            self.stream_executor.submit(role_expr)
+            return role_expr
         else:
             @contextmanager
@@ -778,7 +779,14 @@ class ProgramState:
                     if self.stream_executor.is_finished:
                         break
             else:
-                event = self.stream_executor.stream_var_event[var_name]
+                event = None
+                while not event:
+                    if var_name in self.stream_executor.stream_var_event:
+                        event = self.stream_executor.stream_var_event[var_name]
+                    if self.stream_executor.is_finished:
+                        yield ""
+                        return
                 while True:
                     event.wait()
                     event.clear()
@@ -813,7 +821,14 @@ class ProgramState:
                     if self.stream_executor.is_finished:
                         break
             else:
-                event = self.stream_executor.stream_var_event[var_name]
+                event = None
+                while not event:
+                    if var_name in self.stream_executor.stream_var_event:
+                        event = self.stream_executor.stream_var_event[var_name]
+                    if self.stream_executor.is_finished:
+                        yield ""
+                        return
                 while True:
                     await loop.run_in_executor(None, event.wait)
                     event.clear()

{sglang-0.2.5 → sglang-0.2.7}/sglang/lang/ir.py RENAMED Viewed

@@ -410,7 +410,7 @@ class SglGen(SglExpr):
         dtype: Optional[type] = None,
         regex: Optional[str] = None,
     ):
-        """Call the model to generate. See the meaning of the arguments in docs/sampling_params.md"""
+        """Call the model to generate. See the meaning of the arguments in docs/en/sampling_params.md"""
         super().__init__()
         self.name = name
         self.sampling_params = SglSamplingParams(

{sglang-0.2.5 → sglang-0.2.7}/sglang/srt/constrained/__init__.py RENAMED Viewed

@@ -1,3 +1,18 @@
+"""
+Copyright 2023-2024 SGLang Team
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
 import json
 from typing import Dict, Optional, Union

{sglang-0.2.5 → sglang-0.2.7}/sglang/srt/constrained/base_cache.py RENAMED Viewed

@@ -1,3 +1,18 @@
+"""
+Copyright 2023-2024 SGLang Team
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
 """Base cache class."""
 import time

sglang-0.2.7/sglang/srt/constrained/fsm_cache.py ADDED Viewed

@@ -0,0 +1,66 @@
+"""
+Copyright 2023-2024 SGLang Team
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+"""Cache for the compressed finite state machine."""
+from sglang.srt.constrained import RegexGuide, TransformerTokenizer
+from sglang.srt.constrained.base_cache import BaseCache
+class FSMCache(BaseCache):
+    def __init__(self, tokenizer_path, tokenizer_args_dict, enable=True):
+        super().__init__(enable=enable)
+        if tokenizer_path.endswith(".json") or tokenizer_path.endswith(".model"):
+            # Do not support TiktokenTokenizer or SentencePieceTokenizer
+            return
+        from importlib.metadata import version
+        if version("outlines") >= "0.0.35":
+            from transformers import AutoTokenizer
+            tokenizer_args_dict.setdefault("padding_side", "left")
+            tokenizer = AutoTokenizer.from_pretrained(
+                tokenizer_path, **tokenizer_args_dict
+            )
+            try:
+                self.outlines_tokenizer = TransformerTokenizer(tokenizer)
+            except AttributeError:
+                # FIXME: tmp fix for chatglm2 & chatglm3 (pad_token_id=0)
+                origin_pad_token_id = tokenizer.pad_token_id
+                def fset(self, value):
+                    self._value = value
+                type(tokenizer).pad_token_id = property(
+                    fget=type(tokenizer).pad_token_id.fget, fset=fset
+                )
+                self.outlines_tokenizer = TransformerTokenizer(tokenizer)
+                self.outlines_tokenizer.tokenizer.pad_token_id = origin_pad_token_id
+                self.outlines_tokenizer.pad_token_id = origin_pad_token_id
+                self.outlines_tokenizer.pad_token = (
+                    self.outlines_tokenizer.tokenizer.pad_token
+                )
+                self.outlines_tokenizer.vocabulary = (
+                    self.outlines_tokenizer.tokenizer.get_vocab()
+                )
+        else:
+            self.outlines_tokenizer = TransformerTokenizer(
+                tokenizer_path, **tokenizer_args_dict
+            )
+    def init_value(self, regex):
+        return RegexGuide(regex, self.outlines_tokenizer)

sglang 0.2.5__tar.gz → 0.2.7__tar.gz

sglang 0.2.5tar.gz → 0.2.7tar.gz