PyPI - sglang - Versions diffs - 0.1.17__tar.gz → 0.1.18__tar.gz - Mend

sglang 0.1.17tar.gz → 0.1.18tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (91) hide show

{sglang-0.1.17/sglang.egg-info → sglang-0.1.18}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.1.17
+Version: 0.1.18
 Summary: A structured generation langauge for LLMs.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -213,30 +213,29 @@ Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: requests
 Requires-Dist: tqdm
+Requires-Dist: numpy
 Provides-Extra: srt
 Requires-Dist: aiohttp; extra == "srt"
 Requires-Dist: fastapi; extra == "srt"
+Requires-Dist: hf_transfer; extra == "srt"
+Requires-Dist: huggingface_hub; extra == "srt"
+Requires-Dist: interegular; extra == "srt"
+Requires-Dist: packaging; extra == "srt"
+Requires-Dist: pillow; extra == "srt"
 Requires-Dist: psutil; extra == "srt"
+Requires-Dist: pydantic; extra == "srt"
 Requires-Dist: rpyc; extra == "srt"
 Requires-Dist: torch; extra == "srt"
-Requires-Dist: uvloop; extra == "srt"
 Requires-Dist: uvicorn; extra == "srt"
+Requires-Dist: uvloop; extra == "srt"
 Requires-Dist: zmq; extra == "srt"
-Requires-Dist: vllm==0.4.3; extra == "srt"
-Requires-Dist: interegular; extra == "srt"
-Requires-Dist: pydantic; extra == "srt"
-Requires-Dist: pillow; extra == "srt"
-Requires-Dist: packaging; extra == "srt"
-Requires-Dist: huggingface_hub; extra == "srt"
-Requires-Dist: hf_transfer; extra == "srt"
-Requires-Dist: outlines>=0.0.34; extra == "srt"
+Requires-Dist: vllm==0.5.0; extra == "srt"
+Requires-Dist: outlines>=0.0.44; extra == "srt"
 Provides-Extra: openai
 Requires-Dist: openai>=1.0; extra == "openai"
-Requires-Dist: numpy; extra == "openai"
 Requires-Dist: tiktoken; extra == "openai"
 Provides-Extra: anthropic
 Requires-Dist: anthropic>=0.20.0; extra == "anthropic"
-Requires-Dist: numpy; extra == "anthropic"
 Provides-Extra: litellm
 Requires-Dist: litellm>=1.0.0; extra == "litellm"
 Provides-Extra: all
@@ -257,8 +256,8 @@ SGLang is a structured generation language designed for large language models (L
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
 The core features include:
-- **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction.
-- **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache across multiple calls. It can also be used as a standalone serving engine with all common techniques implemented, such as continuous batching and tensor parallelism.
+- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
+- **High-Performance Backend Runtime**: Features RadixAttention for accelerating complex LLM programs by reusing the KV cache across multiple calls. It can also serve as a standalone engine with all common techniques implemented (e.g., continuous batching and tensor parallelism).
 ## News
 - [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
@@ -279,19 +278,27 @@ The core features include:
 ### Method 1: With pip
 ```
 pip install "sglang[all]"
+# Install FlashInfer CUDA kernels
+pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ```
 ### Method 2: From source
 ```
-git clone git@github.com:sgl-project/sglang.git
+git clone https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
 pip install -e "python[all]"
+# Install FlashInfer CUDA kernels
+pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ```
 ### Notes
-- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`
+- If you see errors from the Triton compiler, please install the [Triton Nightly](https://triton-lang.org/main/getting-started/installation.html).
+- If you cannot install FlashInfer, check out its [installation](https://docs.flashinfer.ai/installation.html#) page. If you still cannot install it, you can use the slower Triton kernels by adding `--disable-flashinfer` when launching the server.
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
@@ -610,7 +617,6 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
 ```
-- See [flashinfer.md](docs/flashinfer.md) on accelerating inference using highly optimized CUDA kernels.
 - See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 ### Supported Models
@@ -643,17 +649,18 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
 - Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
 ![mixtral_8x7b](assets/mixtral_8x7b.jpg)
-Learn more [here](docs/benchmark_results.md).
+- Learn more about the above [results](docs/benchmark_results.md).
+- Synthetic latency and throughput benchmark [scripts](https://github.com/sgl-project/sglang/tree/main/benchmark/latency_throughput).
 ## Roadmap
 https://github.com/sgl-project/sglang/issues/157
 ## Citation And Acknowledgment
 ```
-@misc{zheng2023efficiently,
-      title={Efficiently Programming Large Language Models using SGLang},
-      author={Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Jeff Huang and Chuyue Sun and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng},
-      year={2023},
+@misc{zheng2024sglang,
+      title={SGLang: Efficient Execution of Structured Language Model Programs},
+      author={Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng},
+      year={2024},
       eprint={2312.07104},
       archivePrefix={arXiv},
       primaryClass={cs.AI}

{sglang-0.1.17 → sglang-0.1.18}/README.md RENAMED Viewed

@@ -10,8 +10,8 @@ SGLang is a structured generation language designed for large language models (L
 It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
 The core features include:
-- **A Flexible Front-End Language**: This allows for easy programming of LLM applications with multiple chained generation calls, advanced prompting techniques, control flow, multiple modalities, parallelism, and external interaction.
-- **A High-Performance Runtime with RadixAttention**: This feature significantly accelerates the execution of complex LLM programs by automatically reusing the KV cache across multiple calls. It can also be used as a standalone serving engine with all common techniques implemented, such as continuous batching and tensor parallelism.
+- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
+- **High-Performance Backend Runtime**: Features RadixAttention for accelerating complex LLM programs by reusing the KV cache across multiple calls. It can also serve as a standalone engine with all common techniques implemented (e.g., continuous batching and tensor parallelism).
 ## News
 - [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
@@ -32,19 +32,27 @@ The core features include:
 ### Method 1: With pip
 ```
 pip install "sglang[all]"
+# Install FlashInfer CUDA kernels
+pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ```
 ### Method 2: From source
 ```
-git clone git@github.com:sgl-project/sglang.git
+git clone https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
 pip install -e "python[all]"
+# Install FlashInfer CUDA kernels
+pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
 ```
 ### Notes
-- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`
+- If you see errors from the Triton compiler, please install the [Triton Nightly](https://triton-lang.org/main/getting-started/installation.html).
+- If you cannot install FlashInfer, check out its [installation](https://docs.flashinfer.ai/installation.html#) page. If you still cannot install it, you can use the slower Triton kernels by adding `--disable-flashinfer` when launching the server.
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.
@@ -363,7 +371,6 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
 ```
-- See [flashinfer.md](docs/flashinfer.md) on accelerating inference using highly optimized CUDA kernels.
 - See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 ### Supported Models
@@ -396,17 +403,18 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
 - Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
 ![mixtral_8x7b](assets/mixtral_8x7b.jpg)
-Learn more [here](docs/benchmark_results.md).
+- Learn more about the above [results](docs/benchmark_results.md).
+- Synthetic latency and throughput benchmark [scripts](https://github.com/sgl-project/sglang/tree/main/benchmark/latency_throughput).
 ## Roadmap
 https://github.com/sgl-project/sglang/issues/157
 ## Citation And Acknowledgment
 ```
-@misc{zheng2023efficiently,
-      title={Efficiently Programming Large Language Models using SGLang},
-      author={Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Jeff Huang and Chuyue Sun and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng},
-      year={2023},
+@misc{zheng2024sglang,
+      title={SGLang: Efficient Execution of Structured Language Model Programs},
+      author={Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Chuyue Sun and Jeff Huang and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng},
+      year={2024},
       eprint={2312.07104},
       archivePrefix={arXiv},
       primaryClass={cs.AI}

{sglang-0.1.17 → sglang-0.1.18}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.1.17"
+version = "0.1.18"
 description = "A structured generation langauge for LLMs."
 readme = "README.md"
 requires-python = ">=3.8"
@@ -16,13 +16,14 @@ classifiers = [
 dependencies = [
     "requests",
     "tqdm",
+    "numpy",
 ]
 [project.optional-dependencies]
-srt = ["aiohttp", "fastapi", "psutil", "rpyc", "torch", "uvloop", "uvicorn",
-       "zmq", "vllm==0.4.3", "interegular", "pydantic", "pillow", "packaging", "huggingface_hub", "hf_transfer", "outlines>=0.0.34"]
-openai = ["openai>=1.0", "numpy", "tiktoken"]
-anthropic = ["anthropic>=0.20.0", "numpy"]
+srt = ["aiohttp", "fastapi", "hf_transfer", "huggingface_hub", "interegular", "packaging", "pillow",
+       "psutil", "pydantic", "rpyc", "torch", "uvicorn", "uvloop", "zmq", "vllm==0.5.0", "outlines>=0.0.44"]
+openai = ["openai>=1.0", "tiktoken"]
+anthropic = ["anthropic>=0.20.0"]
 litellm = ["litellm>=1.0.0"]
 all = ["sglang[srt]", "sglang[openai]", "sglang[anthropic]", "sglang[litellm]"]

{sglang-0.1.17 → sglang-0.1.18}/sglang/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-__version__ = "0.1.17"
+__version__ = "0.1.18"
 # SGL API Components
 from sglang.api import (
@@ -24,10 +24,10 @@ from sglang.api import (
 # SGL Backends
 from sglang.backend.anthropic import Anthropic
+from sglang.backend.litellm import LiteLLM
 from sglang.backend.openai import OpenAI
 from sglang.backend.runtime_endpoint import RuntimeEndpoint
 from sglang.backend.vertexai import VertexAI
-from sglang.backend.litellm import LiteLLM
 # Global Configurations
 from sglang.global_config import global_config

{sglang-0.1.17 → sglang-0.1.18}/sglang/api.py RENAMED Viewed

@@ -1,4 +1,4 @@
-"""Some Public API Definitions"""
+"""Public APIs of the language."""
 import os
 import re
@@ -43,14 +43,14 @@ def set_default_backend(backend: BaseBackend):
     global_config.default_backend = backend
-def flush_cache(backend: BaseBackend = None):
+def flush_cache(backend: Optional[BaseBackend] = None):
     backend = backend or global_config.default_backend
     if backend is None:
         return False
     return backend.flush_cache()
-def get_server_args(backend: BaseBackend = None):
+def get_server_args(backend: Optional[BaseBackend] = None):
     backend = backend or global_config.default_backend
     if backend is None:
         return None
@@ -158,7 +158,7 @@ def video(path: str, num_frames: int):
 def select(
     name: Optional[str] = None,
-    choices: List[str] = None,
+    choices: Optional[List[str]] = None,
     temperature: float = 0.0,
 ):
     assert choices is not None

{sglang-0.1.17 → sglang-0.1.18}/sglang/backend/litellm.py RENAMED Viewed

@@ -13,7 +13,6 @@ except ImportError as e:
 class LiteLLM(BaseBackend):
     def __init__(
         self,
         model_name,
@@ -33,7 +32,8 @@ class LiteLLM(BaseBackend):
         self.model_name = model_name
         self.chat_template = chat_template or get_chat_template_by_model_path(
-            model_name)
+            model_name
+        )
         self.client_params = {
             "api_key": api_key,

{sglang-0.1.17 → sglang-0.1.18}/sglang/backend/openai.py RENAMED Viewed

@@ -1,7 +1,7 @@
+import dataclasses
 import logging
 import time
 import warnings
-import dataclasses
 from typing import Callable, List, Optional, Union
 import numpy as np
@@ -105,14 +105,16 @@ class OpenAI(BaseBackend):
     def get_chat_template(self):
         return self.chat_template
-    def _prepare_spec_execution(self, sampling_params: SglSamplingParams,
-                                num_api_spec_tokens: int, spec_var_name: str):
+    def _prepare_spec_execution(
+        self,
+        sampling_params: SglSamplingParams,
+        num_api_spec_tokens: int,
+        spec_var_name: str,
+    ):
         if "max_tokens" not in self.spec_kwargs:
             self.spec_kwargs["max_tokens"] = num_api_spec_tokens
         else:
-            assert (
-                self.spec_kwargs["max_tokens"] == num_api_spec_tokens
-            )
+            assert self.spec_kwargs["max_tokens"] == num_api_spec_tokens
         params = sampling_params.to_openai_kwargs()
         for key, value in params.items():
@@ -151,8 +153,9 @@ class OpenAI(BaseBackend):
                         )
                     prompt = s.messages_
                 else:
-                    return self._prepare_spec_execution(sampling_params,
-                        s.num_api_spec_tokens, spec_var_name)
+                    return self._prepare_spec_execution(
+                        sampling_params, s.num_api_spec_tokens, spec_var_name
+                    )
             else:
                 prompt = s.text_
@@ -325,7 +328,7 @@ class OpenAI(BaseBackend):
             ret_str = ret.choices[0].text
             ret_token = self.tokenizer.encode(ret_str)[0]
             self.token_usage.prompt_tokens += ret.usage.prompt_tokens
-            self.token_usage.completion_tokens= ret.usage.completion_tokens
+            self.token_usage.completion_tokens = ret.usage.completion_tokens
             # TODO:
             # 1. return logits as the scores
@@ -355,7 +358,9 @@ class OpenAI(BaseBackend):
         return decision, scores, None, None
-def openai_completion(client, token_usage, is_chat=None, retries=3, prompt=None, **kwargs):
+def openai_completion(
+    client, token_usage, is_chat=None, retries=3, prompt=None, **kwargs
+):
     for attempt in range(retries):
         try:
             if is_chat:
@@ -385,15 +390,19 @@ def openai_completion(client, token_usage, is_chat=None, retries=3, prompt=None,
     return comp
-def openai_completion_stream(client, token_usage, is_chat=None, retries=3, prompt=None, **kwargs):
+def openai_completion_stream(
+    client, token_usage, is_chat=None, retries=3, prompt=None, **kwargs
+):
     for attempt in range(retries):
         try:
             if is_chat:
                 if "stop" in kwargs and kwargs["stop"] is None:
                     kwargs.pop("stop")
                 generator = client.chat.completions.create(
-                    messages=prompt, stream=True, stream_options={"include_usage": True},
-                    **kwargs
+                    messages=prompt,
+                    stream=True,
+                    stream_options={"include_usage": True},
+                    **kwargs,
                 )
                 for ret in generator:
                     if len(ret.choices) == 0:
@@ -405,8 +414,10 @@ def openai_completion_stream(client, token_usage, is_chat=None, retries=3, promp
                     yield content or "", {}
             else:
                 generator = client.completions.create(
-                    prompt=prompt, stream=True, stream_options={"include_usage": True},
-                    **kwargs
+                    prompt=prompt,
+                    stream=True,
+                    stream_options={"include_usage": True},
+                    **kwargs,
                 )
                 for ret in generator:
                     if len(ret.choices) == 0:

sglang 0.1.17__tar.gz → 0.1.18__tar.gz

sglang 0.1.17tar.gz → 0.1.18tar.gz