PyPI - sglang - Versions diffs - 0.1.15__tar.gz → 0.1.17__tar.gz - Mend

sglang 0.1.15tar.gz → 0.1.17tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (92) hide show

{sglang-0.1.15/sglang.egg-info → sglang-0.1.17}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.15
+Version: 0.1.17
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -222,12 +222,14 @@ Requires-Dist: torch; extra == "srt"
 Requires-Dist: uvloop; extra == "srt"
 Requires-Dist: uvicorn; extra == "srt"
 Requires-Dist: zmq; extra == "srt"
-Requires-Dist: vllm>=0.4.2; extra == "srt"
+Requires-Dist: vllm==0.4.3; extra == "srt"
 Requires-Dist: interegular; extra == "srt"
 Requires-Dist: pydantic; extra == "srt"
 Requires-Dist: pillow; extra == "srt"
-Requires-Dist: outlines>=0.0.27; extra == "srt"
 Requires-Dist: packaging; extra == "srt"
+Requires-Dist: huggingface_hub; extra == "srt"
+Requires-Dist: hf_transfer; extra == "srt"
+Requires-Dist: outlines>=0.0.34; extra == "srt"
 Provides-Extra: openai
 Requires-Dist: openai>=1.0; extra == "openai"
 Requires-Dist: numpy; extra == "openai"
@@ -235,10 +237,13 @@ Requires-Dist: tiktoken; extra == "openai"
 Provides-Extra: anthropic
 Requires-Dist: anthropic>=0.20.0; extra == "anthropic"
 Requires-Dist: numpy; extra == "anthropic"
+Provides-Extra: litellm
+Requires-Dist: litellm>=1.0.0; extra == "litellm"
 Provides-Extra: all
 Requires-Dist: sglang[srt]; extra == "all"
 Requires-Dist: sglang[openai]; extra == "all"
 Requires-Dist: sglang[anthropic]; extra == "all"
+Requires-Dist: sglang[litellm]; extra == "all"
 <div align="center">
 <img src="assets/logo.png" alt="logo" width="400"></img>
@@ -251,9 +256,9 @@ Requires-Dist: sglang[anthropic]; extra == "all"
 SGLang is a structured generation language designed for large language models (LLMs).
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
-The core features of SGLang include:
+The core features include:
 - **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction.
-- **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.
+- **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache across multiple calls. It can also be used as a standalone serving engine with all common techniques implemented, such as continuous batching and tensor parallelism.
 ## News
 - [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
@@ -286,12 +291,8 @@ pip install -e "python[all]"
 ```
 ### Notes
-- If you are using older GPUs (NVIDIA V100, T4), please pick the correct triton compiler version to avoid some known bugs.
-  - For NVIDIA T4, please use `pip install "triton>=2.2.0"`.
-  - For NVIDIA V100, please install the [nightly](https://triton-lang.org/main/getting-started/installation.html) version.
 - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
@@ -568,15 +569,17 @@ response = client.chat.completions.create(
 print(response)
 ```
-In above example, the server uses the chat template specified in the model tokenizer.
-You can override the chat template if needed when launching the server:
+By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
+If needed, you can also override the chat template when launching the server:
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
 ```
 If the chat template you are looking for is missing, you are welcome to contribute it.
-Meanwhile, you can also temporary register your chat template as follows:
+Meanwhile, you can also temporarily register your chat template as follows:
 ```json
 {
@@ -599,11 +602,16 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory.
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --dp 2 --tp 2
+```
 - If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
 ```
-- You can turn on [flashinfer](docs/flashinfer.md) to accelerate the inference by using highly optimized CUDA kernels.
+- See [flashinfer.md](docs/flashinfer.md) on accelerating inference using highly optimized CUDA kernels.
+- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 ### Supported Models
 - Llama
@@ -617,6 +625,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000`
+- LLaVA-NeXT-Video
+  - see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh)
 - Yi-VL
   - see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py).
 - StableLM

{sglang-0.1.15 → sglang-0.1.17}/README.md RENAMED Viewed

@@ -9,9 +9,9 @@
 SGLang is a structured generation language designed for large language models (LLMs).
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
-The core features of SGLang include:
+The core features include:
 - **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction.
-- **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatic KV cache reuse across multiple calls. It also supports other common techniques like continuous batching and tensor parallelism.
+- **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache across multiple calls. It can also be used as a standalone serving engine with all common techniques implemented, such as continuous batching and tensor parallelism.
 ## News
 - [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
@@ -44,12 +44,8 @@ pip install -e "python[all]"
 ```
 ### Notes
-- If you are using older GPUs (NVIDIA V100, T4), please pick the correct triton compiler version to avoid some known bugs.
-  - For NVIDIA T4, please use `pip install "triton>=2.2.0"`.
-  - For NVIDIA V100, please install the [nightly](https://triton-lang.org/main/getting-started/installation.html) version.
 - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
@@ -326,15 +322,17 @@ response = client.chat.completions.create(
 print(response)
 ```
-In above example, the server uses the chat template specified in the model tokenizer.
-You can override the chat template if needed when launching the server:
+By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
+If needed, you can also override the chat template when launching the server:
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
 ```
 If the chat template you are looking for is missing, you are welcome to contribute it.
-Meanwhile, you can also temporary register your chat template as follows:
+Meanwhile, you can also temporarily register your chat template as follows:
 ```json
 {
@@ -357,11 +355,16 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory.
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --dp 2 --tp 2
+```
 - If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
 ```
-- You can turn on [flashinfer](docs/flashinfer.md) to accelerate the inference by using highly optimized CUDA kernels.
+- See [flashinfer.md](docs/flashinfer.md) on accelerating inference using highly optimized CUDA kernels.
+- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 ### Supported Models
 - Llama
@@ -375,6 +378,8 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000`
+- LLaVA-NeXT-Video
+  - see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh)
 - Yi-VL
   - see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py).
 - StableLM

{sglang-0.1.15 → sglang-0.1.17}/pyproject.toml RENAMED Viewed

@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.1.15"
-description = "A structured generation langauge for LLMs."
+version = "0.1.17"
+description = "A structured generation langauge for LLMs."
 readme = "README.md"
 requires-python = ">=3.8"
 license = {file = "LICENSE"}
@@ -20,10 +20,11 @@ dependencies = [
 [project.optional-dependencies]
 srt = ["aiohttp", "fastapi", "psutil", "rpyc", "torch", "uvloop", "uvicorn",
-       "zmq", "vllm>=0.4.2", "interegular", "pydantic", "pillow", "outlines>=0.0.27", "packaging"]
+       "zmq", "vllm==0.4.3", "interegular", "pydantic", "pillow", "packaging", "huggingface_hub", "hf_transfer", "outlines>=0.0.34"]
 openai = ["openai>=1.0", "numpy", "tiktoken"]
 anthropic = ["anthropic>=0.20.0", "numpy"]
-all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]"]
+litellm = ["litellm>=1.0.0"]
+all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]
 [project.urls]
 "Homepage" = "https://github.com/sgl-project/sglang"

{sglang-0.1.15 → sglang-0.1.17}/sglang/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "0.1.15"
+__version__ = "0.1.17"
 # SGL API Components
 from sglang.api import (
@@ -19,6 +19,7 @@ from sglang.api import (
     user,
     user_begin,
     user_end,
+    video,
 )
 # SGL Backends
@@ -26,6 +27,7 @@ from sglang.backend.anthropic import Anthropic
 from sglang.backend.openai import OpenAI
 from sglang.backend.runtime_endpoint import RuntimeEndpoint
 from sglang.backend.vertexai import VertexAI
+from sglang.backend.litellm import LiteLLM
 # Global Configurations
 from sglang.global_config import global_config
@@ -34,6 +36,7 @@ from sglang.global_config import global_config
 __all__ = [
     "global_config",
     "Anthropic",
+    "LiteLLM",
     "OpenAI",
     "RuntimeEndpoint",
     "VertexAI",
@@ -46,6 +49,7 @@ __all__ = [
     "gen_int",
     "gen_string",
     "image",
+    "video",
     "select",
     "system",
     "user",

{sglang-0.1.15 → sglang-0.1.17}/sglang/api.py RENAMED Viewed

@@ -15,17 +15,18 @@ from sglang.lang.ir import (
     SglRoleBegin,
     SglRoleEnd,
     SglSelect,
+    SglVideo,
 )
 def function(
-    func: Optional[Callable] = None, api_num_spec_tokens: Optional[int] = None
+    func: Optional[Callable] = None, num_api_spec_tokens: Optional[int] = None
 ):
     if func:
-        return SglFunction(func, api_num_spec_tokens=api_num_spec_tokens)
+        return SglFunction(func, num_api_spec_tokens=num_api_spec_tokens)
     def decorator(func):
-        return SglFunction(func, api_num_spec_tokens=api_num_spec_tokens)
+        return SglFunction(func, num_api_spec_tokens=num_api_spec_tokens)
     return decorator
@@ -151,6 +152,10 @@ def image(expr: SglExpr):
     return SglImage(expr)
+def video(path: str, num_frames: int):
+    return SglVideo(path, num_frames)
 def select(
     name: Optional[str] = None,
     choices: List[str] = None,

{sglang-0.1.15 → sglang-0.1.17}/sglang/backend/anthropic.py RENAMED Viewed

@@ -74,4 +74,4 @@ class Anthropic(BaseBackend):
             **sampling_params.to_anthropic_kwargs(),
         ) as stream:
             for text in stream.text_stream:
-                yield text, {}
+                yield text, {}

sglang-0.1.17/sglang/backend/litellm.py ADDED Viewed

@@ -0,0 +1,90 @@
+from typing import Mapping, Optional
+from sglang.backend.base_backend import BaseBackend
+from sglang.lang.chat_template import get_chat_template_by_model_path
+from sglang.lang.interpreter import StreamExecutor
+from sglang.lang.ir import SglSamplingParams
+try:
+    import litellm
+except ImportError as e:
+    litellm = e
+    litellm.num_retries = 1
+class LiteLLM(BaseBackend):
+    def __init__(
+        self,
+        model_name,
+        chat_template=None,
+        api_key=None,
+        organization: Optional[str] = None,
+        base_url: Optional[str] = None,
+        timeout: Optional[float] = 600,
+        max_retries: Optional[int] = litellm.num_retries,
+        default_headers: Optional[Mapping[str, str]] = None,
+    ):
+        super().__init__()
+        if isinstance(litellm, Exception):
+            raise litellm
+        self.model_name = model_name
+        self.chat_template = chat_template or get_chat_template_by_model_path(
+            model_name)
+        self.client_params = {
+            "api_key": api_key,
+            "organization": organization,
+            "base_url": base_url,
+            "timeout": timeout,
+            "max_retries": max_retries,
+            "default_headers": default_headers,
+        }
+    def get_chat_template(self):
+        return self.chat_template
+    def generate(
+        self,
+        s: StreamExecutor,
+        sampling_params: SglSamplingParams,
+    ):
+        if s.messages_:
+            messages = s.messages_
+        else:
+            messages = [{"role": "user", "content": s.text_}]
+        ret = litellm.completion(
+            model=self.model_name,
+            messages=messages,
+            **self.client_params,
+            **sampling_params.to_anthropic_kwargs(),
+        )
+        comp = ret.choices[0].message.content
+        return comp, {}
+    def generate_stream(
+        self,
+        s: StreamExecutor,
+        sampling_params: SglSamplingParams,
+    ):
+        if s.messages_:
+            messages = s.messages_
+        else:
+            messages = [{"role": "user", "content": s.text_}]
+        ret = litellm.completion(
+            model=self.model_name,
+            messages=messages,
+            stream=True,
+            **self.client_params,
+            **sampling_params.to_litellm_kwargs(),
+        )
+        for chunk in ret:
+            text = chunk.choices[0].delta.content
+            if text is not None:
+                yield text, {}

{sglang-0.1.15 → sglang-0.1.17}/sglang/backend/openai.py RENAMED Viewed

@@ -1,5 +1,7 @@
 import logging
 import time
+import warnings
+import dataclasses
 from typing import Callable, List, Optional, Union
 import numpy as np
@@ -41,6 +43,15 @@ INSTRUCT_MODEL_NAMES = [
 ]
+@dataclasses.dataclass
+class TokenUsage:
+    prompt_tokens: int
+    completion_tokens: int
+    def reset(self):
+        self.prompt_tokens = self.completion_tokens = 0
 class OpenAI(BaseBackend):
     def __init__(
         self,
@@ -80,40 +91,89 @@ class OpenAI(BaseBackend):
             else:
                 self.is_chat_model = True
-        self.chat_begin_str = self.chat_template.role_prefix_and_suffix["assistant"][0]
+        self.chat_prefix = self.chat_template.role_prefix_and_suffix["assistant"][0]
+        # Usage
+        self.token_usage = TokenUsage(0, 0)
+        # API speculative execution
+        # TODO(ying): This does not support multi-threading (run_batch)
+        self.spec_kwargs = {}
+        self.spec_format = []
+        self.spec_max_num_tries = 3
     def get_chat_template(self):
         return self.chat_template
+    def _prepare_spec_execution(self, sampling_params: SglSamplingParams,
+                                num_api_spec_tokens: int, spec_var_name: str):
+        if "max_tokens" not in self.spec_kwargs:
+            self.spec_kwargs["max_tokens"] = num_api_spec_tokens
+        else:
+            assert (
+                self.spec_kwargs["max_tokens"] == num_api_spec_tokens
+            )
+        params = sampling_params.to_openai_kwargs()
+        for key, value in params.items():
+            if key in ["stop"]:
+                continue
+            if key in ["max_tokens"]:
+                warnings.warn(
+                    "The parameter max_tokens will be overwritten by speculated number of tokens."
+                )
+                continue
+            if key not in self.spec_kwargs:
+                self.spec_kwargs[key] = value
+            else:
+                assert (
+                    value == self.spec_kwargs[key]
+                ), "sampling parameters should be consistent if turn on api speculative execution."
+        self.spec_format.append(
+            {"text": "", "stop": params["stop"], "name": spec_var_name}
+        )
+        return "", {}
     def generate(
         self,
         s: StreamExecutor,
         sampling_params: SglSamplingParams,
+        spec_var_name: str = None,
     ):
         if sampling_params.dtype is None:
             if self.is_chat_model:
-                if not s.text_.endswith(self.chat_begin_str):
-                    raise RuntimeError(
-                        "This use case is not supported. "
-                        "For OpenAI chat models, sgl.gen must be right after sgl.assistant"
-                    )
-                prompt = s.messages_
+                if s.num_api_spec_tokens is None:
+                    if not s.text_.endswith(self.chat_prefix):
+                        raise RuntimeError(
+                            "This use case is not supported if api speculative execution is off. "
+                            "For OpenAI chat models, sgl.gen must be right after sgl.assistant. "
+                            "Example of adding api speculative execution: @function(num_api_spec_tokens=128)."
+                        )
+                    prompt = s.messages_
+                else:
+                    return self._prepare_spec_execution(sampling_params,
+                        s.num_api_spec_tokens, spec_var_name)
             else:
                 prompt = s.text_
             kwargs = sampling_params.to_openai_kwargs()
             comp = openai_completion(
                 client=self.client,
+                token_usage=self.token_usage,
                 is_chat=self.is_chat_model,
                 model=self.model_name,
                 prompt=prompt,
                 **kwargs,
             )
         elif sampling_params.dtype in [str, "str", "string"]:
+            assert (
+                not self.is_chat_model
+            ), "constrained type not supported on chat model"
             kwargs = sampling_params.to_openai_kwargs()
             kwargs.pop("stop")
             comp = openai_completion(
                 client=self.client,
+                token_usage=self.token_usage,
                 is_chat=self.is_chat_model,
                 model=self.model_name,
                 prompt=s.text_ + '"',
@@ -122,10 +182,14 @@ class OpenAI(BaseBackend):
             )
             comp = '"' + comp + '"'
         elif sampling_params.dtype in [int, "int"]:
+            assert (
+                not self.is_chat_model
+            ), "constrained type not supported on chat model"
             kwargs = sampling_params.to_openai_kwargs()
             kwargs.pop("stop")
             comp = openai_completion(
                 client=self.client,
+                token_usage=self.token_usage,
                 is_chat=self.is_chat_model,
                 model=self.model_name,
                 prompt=s.text_,
@@ -138,6 +202,63 @@ class OpenAI(BaseBackend):
         return comp, {}
+    def spec_fill(self, value: str):
+        assert self.is_chat_model
+        self.spec_format.append({"text": value, "stop": None, "name": None})
+    def spec_pattern_match(self, comp):
+        for i, term in enumerate(self.spec_format):
+            text = term["text"]
+            if text != "":
+                if comp.startswith(text):
+                    comp = comp[len(text) :]
+                else:
+                    return False
+            else:
+                pos = comp.find(term["stop"])
+                if pos != -1:
+                    term["text"] = comp[:pos]
+                    comp = comp[pos:]
+                else:
+                    if i == len(self.spec_format) - 1:
+                        term["text"] = comp
+                    else:
+                        return False
+        return True
+    def role_end_generate(
+        self,
+        s: StreamExecutor,
+    ):
+        if s.num_api_spec_tokens is None or not s.text_.endswith(self.chat_prefix):
+            return
+        comp = ""
+        if not all(x["name"] is None for x in self.spec_format):
+            # TODO(ying): throw errors or warnings
+            for i in range(self.spec_max_num_tries):
+                comp = openai_completion(
+                    client=self.client,
+                    token_usage=self.token_usage,
+                    is_chat=self.is_chat_model,
+                    model=self.model_name,
+                    prompt=s.messages_,
+                    **self.spec_kwargs,
+                )
+                if self.spec_pattern_match(comp):
+                    break
+        for term in self.spec_format:
+            s.text_ += term["text"]
+            name = term["name"]
+            if name is not None:
+                s.variables[name] = term["text"]
+                s.meta_info[name] = {}
+                s.variable_event[name].set()
+        self.spec_kwargs = {}
+        self.spec_format = []
     def generate_stream(
         self,
         s: StreamExecutor,
@@ -145,7 +266,7 @@ class OpenAI(BaseBackend):
     ):
         if sampling_params.dtype is None:
             if self.is_chat_model:
-                if not s.text_.endswith(self.chat_begin_str):
+                if not s.text_.endswith(self.chat_prefix):
                     raise RuntimeError(
                         "This use case is not supported. "
                         "For OpenAI chat models, sgl.gen must be right after sgl.assistant"
@@ -157,6 +278,7 @@ class OpenAI(BaseBackend):
             kwargs = sampling_params.to_openai_kwargs()
             generator = openai_completion_stream(
                 client=self.client,
+                token_usage=self.token_usage,
                 is_chat=self.is_chat_model,
                 model=self.model_name,
                 prompt=prompt,
@@ -202,6 +324,8 @@ class OpenAI(BaseBackend):
             )
             ret_str = ret.choices[0].text
             ret_token = self.tokenizer.encode(ret_str)[0]
+            self.token_usage.prompt_tokens += ret.usage.prompt_tokens
+            self.token_usage.completion_tokens= ret.usage.completion_tokens
             # TODO:
             # 1. return logits as the scores
@@ -231,7 +355,7 @@ class OpenAI(BaseBackend):
         return decision, scores, None, None
-def openai_completion(client, retries=3, is_chat=None, prompt=None, **kwargs):
+def openai_completion(client, token_usage, is_chat=None, retries=3, prompt=None, **kwargs):
     for attempt in range(retries):
         try:
             if is_chat:
@@ -245,6 +369,9 @@ def openai_completion(client, retries=3, is_chat=None, prompt=None, **kwargs):
                     comp = [c.text for c in ret.choices]
                 else:
                     comp = ret.choices[0].text
+            token_usage.prompt_tokens += ret.usage.prompt_tokens
+            token_usage.completion_tokens += ret.usage.completion_tokens
             break
         except (openai.APIError, openai.APIConnectionError, openai.RateLimitError) as e:
             logger.error(f"OpenAI Error: {e}. Waiting 5 seconds...")
@@ -258,16 +385,19 @@ def openai_completion(client, retries=3, is_chat=None, prompt=None, **kwargs):
     return comp
-def openai_completion_stream(client, retries=3, is_chat=None, prompt=None, **kwargs):
+def openai_completion_stream(client, token_usage, is_chat=None, retries=3, prompt=None, **kwargs):
     for attempt in range(retries):
         try:
             if is_chat:
                 if "stop" in kwargs and kwargs["stop"] is None:
                     kwargs.pop("stop")
                 generator = client.chat.completions.create(
-                    messages=prompt, stream=True, **kwargs
+                    messages=prompt, stream=True, stream_options={"include_usage": True},
+                    **kwargs
                 )
                 for ret in generator:
+                    if len(ret.choices) == 0:
+                        continue
                     try:
                         content = ret.choices[0].delta.content
                     except IndexError:
@@ -275,11 +405,17 @@ def openai_completion_stream(client, retries=3, is_chat=None, prompt=None, **kwa
                     yield content or "", {}
             else:
                 generator = client.completions.create(
-                    prompt=prompt, stream=True, **kwargs
+                    prompt=prompt, stream=True, stream_options={"include_usage": True},
+                    **kwargs
                 )
                 for ret in generator:
+                    if len(ret.choices) == 0:
+                        continue
                     content = ret.choices[0].text
                     yield content or "", {}
+            token_usage.prompt_tokens += ret.usage.prompt_tokens
+            token_usage.completion_tokens += ret.usage.completion_tokens
             break
         except (openai.APIError, openai.APIConnectionError, openai.RateLimitError) as e:
             logger.error(f"OpenAI Error: {e}. Waiting 5 seconds...")

sglang 0.1.15__tar.gz → 0.1.17__tar.gz

sglang 0.1.15tar.gz → 0.1.17tar.gz