PyPI - sglang - Versions diffs - 0.3.1.post1__tar.gz → 0.3.1.post2__tar.gz - Mend

sglang 0.3.1.post1tar.gz → 0.3.1.post2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (137) hide show

{sglang-0.3.1.post1/sglang.egg-info → sglang-0.3.1.post2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.3.1.post1
+Version: 0.3.1.post2
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -269,7 +269,7 @@ Requires-Dist: sglang[test]; extra == "dev"
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) | [**Join Weekly Development Meeting**](https://calendar.app.google/v2Tw3kuHkKYyp8VV7) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -278,7 +278,7 @@ The core features include:
 - **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (INT4/FP8/AWQ/GPTQ).
 - **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
 - **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
-- **Active Community**: SGLang is open-source and backed by an active community with industry adoption, welcoming contributions to improve LLM and VLM serving.
+- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
 ## News
 - [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
@@ -318,7 +318,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
 ### Method 2: From source
 ```
 # Use the last release branch
-git clone -b v0.3.1.post1 https://github.com/sgl-project/sglang.git
+git clone -b v0.3.1.post2 https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -483,7 +483,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
 - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
-- To enable DeepSeek MLA acceleration, add `--enable-mla`.
 - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
 - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
 ```

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/README.md RENAMED Viewed

@@ -11,7 +11,7 @@
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) | [**Join Weekly Development Meeting**](https://calendar.app.google/v2Tw3kuHkKYyp8VV7) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -20,7 +20,7 @@ The core features include:
 - **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (INT4/FP8/AWQ/GPTQ).
 - **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
 - **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
-- **Active Community**: SGLang is open-source and backed by an active community with industry adoption, welcoming contributions to improve LLM and VLM serving.
+- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
 ## News
 - [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
@@ -60,7 +60,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
 ### Method 2: From source
 ```
 # Use the last release branch
-git clone -b v0.3.1.post1 https://github.com/sgl-project/sglang.git
+git clone -b v0.3.1.post2 https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -225,7 +225,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
 - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
-- To enable DeepSeek MLA acceleration, add `--enable-mla`.
 - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
 - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
 ```

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.3.1.post1"
+version = "0.3.1.post2"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/bench_latency.py RENAMED Viewed

@@ -1,5 +1,7 @@
 """
-Benchmark the latency of a given model. It accepts arguments similar to those of launch_server.py.
+Benchmark the latency of running a single static batch.
+This script does not launch a server and uses the low-level APIs.
+It accepts arguments similar to those of launch_server.py.
 # Usage (latency test)
 ## with dummy weights:

sglang-0.3.1.post2/sglang/bench_server_latency.py ADDED Viewed

@@ -0,0 +1,187 @@
+"""
+Benchmark the latency of serving a single batch with a real server.
+This script launches a server and uses the HTTP interface.
+It accepts arguments similar to those of launch_server.py.
+Usage:
+python3 -m sglang.bench_server_latency --model meta-llama/Meta-Llama-3.1-8B --batch-size 1 16 64 --input-len 1024 --output-len 8
+"""
+import argparse
+import dataclasses
+import itertools
+import json
+import multiprocessing
+import os
+import time
+from typing import Tuple
+import numpy as np
+import requests
+from sglang.srt.server import launch_server
+from sglang.srt.server_args import ServerArgs
+from sglang.srt.utils import kill_child_process
+@dataclasses.dataclass
+class BenchArgs:
+    run_name: str = "default"
+    batch_size: Tuple[int] = (1,)
+    input_len: Tuple[int] = (1024,)
+    output_len: Tuple[int] = (16,)
+    result_filename: str = "result.jsonl"
+    @staticmethod
+    def add_cli_args(parser: argparse.ArgumentParser):
+        parser.add_argument("--run-name", type=str, default=BenchArgs.run_name)
+        parser.add_argument(
+            "--batch-size", type=int, nargs="+", default=BenchArgs.batch_size
+        )
+        parser.add_argument(
+            "--input-len", type=int, nargs="+", default=BenchArgs.input_len
+        )
+        parser.add_argument(
+            "--output-len", type=int, nargs="+", default=BenchArgs.output_len
+        )
+        parser.add_argument(
+            "--result-filename", type=str, default=BenchArgs.result_filename
+        )
+    @classmethod
+    def from_cli_args(cls, args: argparse.Namespace):
+        # use the default value's type to case the args into correct types.
+        attrs = [(attr.name, type(attr.default)) for attr in dataclasses.fields(cls)]
+        return cls(
+            **{attr: attr_type(getattr(args, attr)) for attr, attr_type in attrs}
+        )
+def launch_server_internal(server_args):
+    try:
+        launch_server(server_args)
+    except Exception as e:
+        raise e
+    finally:
+        kill_child_process(os.getpid(), including_parent=False)
+def launch_server_process(server_args: ServerArgs):
+    proc = multiprocessing.Process(target=launch_server_internal, args=(server_args,))
+    proc.start()
+    base_url = f"http://{server_args.host}:{server_args.port}"
+    timeout = 600
+    start_time = time.time()
+    while time.time() - start_time < timeout:
+        try:
+            headers = {
+                "Content-Type": "application/json; charset=utf-8",
+            }
+            response = requests.get(f"{base_url}/v1/models", headers=headers)
+            if response.status_code == 200:
+                return proc, base_url
+        except requests.RequestException:
+            pass
+        time.sleep(10)
+    raise TimeoutError("Server failed to start within the timeout period.")
+def run_one_case(
+    url: str,
+    batch_size: int,
+    input_len: int,
+    output_len: int,
+    run_name: str,
+    result_filename: str,
+):
+    input_ids = [
+        [int(x) for x in np.random.randint(0, high=16384, size=(input_len,))]
+        for _ in range(batch_size)
+    ]
+    tic = time.time()
+    response = requests.post(
+        url + "/generate",
+        json={
+            "input_ids": input_ids,
+            "sampling_params": {
+                "temperature": 0,
+                "max_new_tokens": output_len,
+                "ignore_eos": True,
+            },
+        },
+    )
+    latency = time.time() - tic
+    _ = response.json()
+    output_throughput = batch_size * output_len / latency
+    overall_throughput = batch_size * (input_len + output_len) / latency
+    print(f"batch size: {batch_size}")
+    print(f"latency: {latency:.2f} s")
+    print(f"output throughput: {output_throughput:.2f} token/s")
+    print(f"(input + output) throughput: {overall_throughput:.2f} token/s")
+    if result_filename:
+        with open(result_filename, "a") as fout:
+            res = {
+                "run_name": run_name,
+                "batch_size": batch_size,
+                "input_len": input_len,
+                "output_len": output_len,
+                "latency": round(latency, 4),
+                "output_throughput": round(output_throughput, 2),
+                "overall_throughput": round(overall_throughput, 2),
+            }
+            fout.write(json.dumps(res) + "\n")
+def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
+    proc, base_url = launch_server_process(server_args)
+    # warmup
+    run_one_case(
+        base_url,
+        batch_size=16,
+        input_len=1024,
+        output_len=16,
+        run_name="",
+        result_filename="",
+    )
+    # benchmark
+    try:
+        for bs, il, ol in itertools.product(
+            bench_args.batch_size, bench_args.input_len, bench_args.output_len
+        ):
+            run_one_case(
+                base_url,
+                bs,
+                il,
+                ol,
+                bench_args.run_name,
+                bench_args.result_filename,
+            )
+    finally:
+        kill_child_process(proc.pid)
+    print(f"\nResults are saved to {bench_args.result_filename}")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    ServerArgs.add_cli_args(parser)
+    BenchArgs.add_cli_args(parser)
+    # For this script, model-path is not required
+    assert (
+        parser._actions[1].option_strings[0] == "--model-path"
+    ), "options changed, this code need to be updated"
+    parser._actions[1].required = False
+    args = parser.parse_args()
+    server_args = ServerArgs.from_cli_args(args)
+    bench_args = BenchArgs.from_cli_args(args)
+    run_benchmark(server_args, bench_args)

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/bench_serving.py RENAMED Viewed

@@ -2,7 +2,7 @@
 # Adapted from https://github.com/vllm-project/vllm/blob/6366efc67b0aedd2c1721c14385370e50b297fb3/benchmarks/benchmark_serving.py
 """
-Benchmark online serving.
+Benchmark online serving with dynamic requests.
 Usage:
 python3 -m sglang.bench_serving --backend sglang --num-prompt 10

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/srt/layers/activation.py RENAMED Viewed

@@ -19,7 +19,12 @@ from typing import Optional
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from flashinfer.activation import gelu_and_mul, gelu_tanh_and_mul, silu_and_mul
+from sglang.srt.utils import is_hip
+if not is_hip():
+    from flashinfer.activation import gelu_and_mul, gelu_tanh_and_mul, silu_and_mul
 from vllm.distributed import (
     divide,
     get_tensor_model_parallel_rank,
@@ -29,8 +34,6 @@ from vllm.model_executor.custom_op import CustomOp
 from vllm.model_executor.layers.quantization import QuantizationConfig
 from vllm.model_executor.utils import set_weight_attrs
-from sglang.srt.utils import is_hip
 logger = logging.getLogger(__name__)

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/srt/layers/layernorm.py RENAMED Viewed

@@ -20,16 +20,19 @@ from typing import Optional, Tuple, Union
 import torch
 import torch.nn as nn
-from flashinfer.norm import (
-    fused_add_rmsnorm,
-    gemma_fused_add_rmsnorm,
-    gemma_rmsnorm,
-    rmsnorm,
-)
-from vllm.model_executor.custom_op import CustomOp
 from sglang.srt.utils import is_hip
+if not is_hip():
+    from flashinfer.norm import (
+        fused_add_rmsnorm,
+        gemma_fused_add_rmsnorm,
+        gemma_rmsnorm,
+        rmsnorm,
+    )
+from vllm.model_executor.custom_op import CustomOp
 logger = logging.getLogger(__name__)

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/srt/layers/sampler.py RENAMED Viewed

@@ -31,8 +31,11 @@ class Sampler(nn.Module):
             logits = logits.next_token_logits
         # Post process logits
+        logits = logits.contiguous()
         logits.div_(sampling_info.temperatures)
-        probs = logits[:] = torch.softmax(logits, dim=-1)
+        probs = torch.softmax(logits, dim=-1)
+        logits = None
+        del logits
         if torch.any(torch.isnan(probs)):
             logger.warning("Detected errors during sampling! NaN in the probability.")
@@ -53,7 +56,11 @@ class Sampler(nn.Module):
                 )
             else:
                 batch_next_token_ids, success = top_k_top_p_sampling_from_probs(
-                    probs, uniform_samples, sampling_info.top_ks, sampling_info.top_ps
+                    probs,
+                    uniform_samples,
+                    sampling_info.top_ks,
+                    sampling_info.top_ps,
+                    filter_apply_order="joint",
                 )
             if not torch.all(success):

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/srt/managers/io_struct.py RENAMED Viewed

@@ -133,6 +133,9 @@ class GenerateReqInput:
                 self.image_data = [None] * num
             elif not isinstance(self.image_data, list):
                 self.image_data = [self.image_data] * num
+            elif isinstance(self.image_data, list):
+                # multi-image with n > 1
+                self.image_data = self.image_data * num
             if self.sampling_params is None:
                 self.sampling_params = [{}] * num

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/srt/managers/policy_scheduler.py RENAMED Viewed

@@ -119,19 +119,32 @@ class PrefillAdder:
         self.running_batch = running_batch
         self.new_token_ratio = new_token_ratio
         self.rem_total_tokens = rem_total_tokens - mixed_with_decode_tokens
-        self.rem_total_tokens_ = self.rem_total_tokens
-        self.total_tokens = rem_total_tokens
         self.rem_input_tokens = rem_input_tokens - mixed_with_decode_tokens
         self.rem_chunk_tokens = rem_chunk_tokens
         if self.rem_chunk_tokens is not None:
             self.rem_chunk_tokens -= mixed_with_decode_tokens
+        self.cur_rem_tokens = rem_total_tokens - mixed_with_decode_tokens
         self.req_states = None
         self.can_run_list = []
         self.new_inflight_req = None
         self.log_hit_tokens = 0
         self.log_input_tokens = 0
+        if running_batch is not None:
+            # Pre-remove the tokens which will be occupied by the running requests
+            self.rem_total_tokens -= sum(
+                [
+                    min(
+                        (r.sampling_params.max_new_tokens - len(r.output_ids)),
+                        CLIP_MAX_NEW_TOKENS,
+                    )
+                    * self.new_token_ratio
+                    for r in running_batch.reqs
+                ]
+            )
     def no_remaining_tokens(self):
         return (
             self.rem_total_tokens <= 0
@@ -141,31 +154,14 @@ class PrefillAdder:
                 if self.rem_chunk_tokens is not None
                 else False
             )
-        )
-    def remove_running_tokens(self, running_batch: ScheduleBatch):
-        self.rem_total_tokens -= sum(
-            [
-                min(
-                    (r.sampling_params.max_new_tokens - len(r.output_ids)),
-                    CLIP_MAX_NEW_TOKENS,
-                )
-                * self.new_token_ratio
-                for r in running_batch.reqs
-            ]
-        )
-        self.rem_total_tokens_ -= sum(
-            [
-                r.sampling_params.max_new_tokens - len(r.output_ids)
-                for r in running_batch.reqs
-            ]
+            or self.cur_rem_tokens <= 0
         )
     def _prefill_one_req(
         self, prefix_len: int, extend_input_len: int, max_new_tokens: int
     ):
         self.rem_total_tokens -= extend_input_len + max_new_tokens
-        self.rem_total_tokens_ -= extend_input_len + max_new_tokens
+        self.cur_rem_tokens -= extend_input_len
         self.rem_input_tokens -= extend_input_len
         if self.rem_chunk_tokens is not None:
             self.rem_chunk_tokens -= extend_input_len
@@ -173,29 +169,7 @@ class PrefillAdder:
         self.log_hit_tokens += prefix_len
         self.log_input_tokens += extend_input_len
-    def add_inflight_req_ignore_eos(self, req: Req):
-        truncated = req.extend_input_len > self.rem_chunk_tokens
-        req.extend_input_len = min(req.extend_input_len, self.rem_chunk_tokens)
-        req.fill_ids = req.fill_ids[: len(req.prefix_indices) + req.extend_input_len]
-        self.can_run_list.append(req)
-        self._prefill_one_req(
-            0,
-            req.extend_input_len,
-            (
-                min(req.sampling_params.max_new_tokens, CLIP_MAX_NEW_TOKENS)
-                if not truncated
-                else 0
-            ),
-        )
-        # Return if chunked prefill not finished
-        return req if truncated else None
     def add_inflight_req(self, req: Req):
-        if req.sampling_params.ignore_eos:
-            return self.add_inflight_req_ignore_eos(req)
         truncated = req.extend_input_len > self.rem_chunk_tokens
         req.extend_input_len = min(req.extend_input_len, self.rem_chunk_tokens)
         req.fill_ids = req.fill_ids[: len(req.prefix_indices) + req.extend_input_len]
@@ -225,7 +199,7 @@ class PrefillAdder:
             self.rem_total_tokens += delta
     def add_one_req_ignore_eos(self, req: Req):
-        def get_req_state(r):
+        def add_req_state(r, insert_sort=False):
             new_token_ratio = (
                 1.0 if r.sampling_params.ignore_eos else self.new_token_ratio
             )
@@ -235,56 +209,38 @@ class PrefillAdder:
             tokens_occupied = len(r.origin_input_ids) + len(r.output_ids)
             if tokens_left > 0:
-                return (tokens_left, tokens_occupied)
-            return None
-        # Quick Check
-        can_run = False
-        if (
-            req.extend_input_len + req.sampling_params.max_new_tokens
-            <= self.rem_total_tokens
-        ):
-            can_run = True
-        if not can_run:
-            if self.req_states is None:
-                self.req_states = []
-                if self.running_batch is not None:
-                    for r in self.running_batch.reqs:
-                        state = get_req_state(r)
-                        if state is not None:
-                            self.req_states.append(state)
-                for r in self.can_run_list:
-                    state = get_req_state(r)
-                    if state is not None:
-                        self.req_states.append(state)
-                state = get_req_state(req)
-                if state is not None:
-                    self.req_states.append(state)
-                self.req_states.sort(key=lambda x: x[0])
-            else:
-                state = get_req_state(req)
-                if state is not None:
-                    for i, (tokens_left, tokens_occupied) in enumerate(self.req_states):
-                        if tokens_left >= state[0]:
-                            self.req_states.insert(i, state)
+                if not insert_sort:
+                    self.req_states.append((tokens_left, tokens_occupied))
+                else:
+                    for i in range(len(self.req_states)):
+                        if tokens_left <= self.req_states[i][0]:
                             break
-                    else:
-                        self.req_states.append(state)
-            tokens_freed = 0
-            for i, (tokens_left, tokens_occupied) in enumerate(self.req_states):
-                decode_steps = (
-                    self.req_states[i + 1][0]
-                    if i + 1 < len(self.req_states)
-                    else tokens_left
-                )
-                bs = len(self.req_states) - i
-                if self.total_tokens + tokens_freed - decode_steps * bs <= 0:
-                    return False
-                tokens_freed += tokens_occupied
+                    self.req_states.insert(i, (tokens_left, tokens_occupied))
+        if self.req_states is None:
+            self.req_states = []
+            add_req_state(req)
+            if self.running_batch is not None:
+                for r in self.running_batch.reqs:
+                    add_req_state(r)
+            for r in self.can_run_list:
+                add_req_state(r)
+            self.req_states.sort(key=lambda x: x[0])
+        else:
+            add_req_state(req, insert_sort=True)
+        cur_rem_tokens = self.cur_rem_tokens - len(req.origin_input_ids)
+        tokens_freed = 0
+        for i, (tokens_left, tokens_occupied) in enumerate(self.req_states):
+            decode_steps = (
+                self.req_states[i + 1][0]
+                if i + 1 < len(self.req_states)
+                else tokens_left
+            )
+            bs = len(self.req_states) - i
+            if cur_rem_tokens + tokens_freed - decode_steps * bs <= 0:
+                return False
+            tokens_freed += tokens_occupied
         if req.extend_input_len <= self.rem_chunk_tokens:
             self.can_run_list.append(req)

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/srt/managers/schedule_batch.py RENAMED Viewed

@@ -40,7 +40,7 @@ global_server_args_dict = {
     "attention_backend": ServerArgs.attention_backend,
     "sampling_backend": ServerArgs.sampling_backend,
     "triton_attention_reduce_in_fp32": ServerArgs.triton_attention_reduce_in_fp32,
-    "enable_mla": ServerArgs.enable_mla,
+    "disable_mla": ServerArgs.disable_mla,
     "torchao_config": ServerArgs.torchao_config,
 }

{sglang-0.3.1.post1 → sglang-0.3.1.post2}/sglang/srt/managers/tp_worker.py RENAMED Viewed

@@ -445,9 +445,6 @@ class ModelTpServer:
             num_mixed_running,
         )
-        if self.running_batch is not None:
-            adder.remove_running_tokens(self.running_batch)
         has_inflight = self.current_inflight_req is not None
         if self.current_inflight_req is not None:
             self.current_inflight_req.init_next_round_input(
@@ -465,9 +462,6 @@ class ModelTpServer:
             )
         for req in self.waiting_queue:
-            if adder.no_remaining_tokens():
-                break
-            req.init_next_round_input(None if prefix_computed else self.tree_cache)
             if (
                 self.lora_paths is not None
                 and len(
@@ -478,6 +472,10 @@ class ModelTpServer:
                 > self.max_loras_per_batch
             ):
                 break
+            if adder.no_remaining_tokens():
+                break
+            req.init_next_round_input(None if prefix_computed else self.tree_cache)
             res = adder.add_one_req(req)
             if (
                 not res
@@ -507,6 +505,11 @@ class ModelTpServer:
             else:
                 tree_cache_hit_rate = 0.0
+            num_used = self.max_total_num_tokens - (
+                self.token_to_kv_pool.available_size()
+                + self.tree_cache.evictable_size()
+            )
             if num_mixed_running > 0:
                 logger.info(
                     f"Prefill batch"
@@ -515,6 +518,7 @@ class ModelTpServer:
                     f"#new-token: {adder.log_input_tokens}, "
                     f"#cached-token: {adder.log_hit_tokens}, "
                     f"cache hit rate: {100.0 * tree_cache_hit_rate:.2f}%, "
+                    f"token usage: {num_used / self.max_total_num_tokens:.2f}, "
                     f"#queue-req: {len(self.waiting_queue) - len(can_run_list) + has_inflight}"
                 )
             else:
@@ -524,6 +528,7 @@ class ModelTpServer:
                     f"#new-token: {adder.log_input_tokens}, "
                     f"#cached-token: {adder.log_hit_tokens}, "
                     f"cache hit rate: {100.0 * tree_cache_hit_rate:.2f}%, "
+                    f"token usage: {num_used / self.max_total_num_tokens:.2f}, "
                     f"#running-req: {running_bs}, "
                     f"#queue-req: {len(self.waiting_queue) - len(can_run_list) + has_inflight}"
                 )

sglang 0.3.1.post1__tar.gz → 0.3.1.post2__tar.gz

sglang 0.3.1.post1tar.gz → 0.3.1.post2tar.gz