PyPI - sglang - Versions diffs - 0.2.5__tar.gz → 0.2.6__tar.gz - Mend

sglang 0.2.5tar.gz → 0.2.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (99) hide show

{sglang-0.2.5/sglang.egg-info → sglang-0.2.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.2.5
+Version: 0.2.6
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -249,7 +249,7 @@ Requires-Dist: sglang[litellm]; extra == "all"
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -404,16 +404,17 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 ### Run Llama 3.1 405B
 ```bash
-# 2 nodes run 405B fp16
+## Run 405B (fp8) on a single node
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+## Run 405B (fp16) on two nodes
 # replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
 # on the first node
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
 # on the second
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
-# single node run 405B fp8
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
 ```
 ### Supported Models
@@ -422,6 +423,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instr
 - Mistral / Mixtral
 - Gemma / Gemma 2
 - Qwen / Qwen 2 / Qwen 2 MoE
+- DeepSeek / DeepSeek 2
 - LLaVA 1.5 / 1.6
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
@@ -442,7 +444,7 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
 ### Benchmark Performance
-- Benchmark a single static batch. Run the following command without launching a server. The arguments are the same as those for `launch_server.py`.
+- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as those for `launch_server.py`. This is not a dynamic batching server, so it may run out of memory for a batch size that can run successfully with a real server. This is because a real server will truncate the prefill into several batches/chunks, while this unit test does not do this.
   ```
   python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
   ```

{sglang-0.2.5 → sglang-0.2.6}/README.md RENAMED Viewed

@@ -4,7 +4,7 @@
 --------------------------------------------------------------------------------
-| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
+| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) |
 SGLang is a fast serving framework for large language models and vision language models.
 It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
@@ -159,16 +159,17 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 ### Run Llama 3.1 405B
 ```bash
-# 2 nodes run 405B fp16
+## Run 405B (fp8) on a single node
+python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
+## Run 405B (fp16) on two nodes
 # replace the `172.16.4.52:20000` with your own first node ip address and port, disable CUDA Graph temporarily
 # on the first node
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph --mem-frac 0.75
 # on the second
 GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph --mem-frac 0.75
-# single node run 405B fp8
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
 ```
 ### Supported Models
@@ -177,6 +178,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instr
 - Mistral / Mixtral
 - Gemma / Gemma 2
 - Qwen / Qwen 2 / Qwen 2 MoE
+- DeepSeek / DeepSeek 2
 - LLaVA 1.5 / 1.6
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
   - `python -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
@@ -197,7 +199,7 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
 ### Benchmark Performance
-- Benchmark a single static batch. Run the following command without launching a server. The arguments are the same as those for `launch_server.py`.
+- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as those for `launch_server.py`. This is not a dynamic batching server, so it may run out of memory for a batch size that can run successfully with a real server. This is because a real server will truncate the prefill into several batches/chunks, while this unit test does not do this.
   ```
   python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
   ```

{sglang-0.2.5 → sglang-0.2.6}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.2.5"
+version = "0.2.6"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.2.5 → sglang-0.2.6}/sglang/lang/backend/runtime_endpoint.py RENAMED Viewed

@@ -253,14 +253,14 @@ class RuntimeEndpoint(BaseBackend):
             r["meta_info"]["normalized_prompt_logprob"] for r in obj
         ]
         decision = choices[np.argmax(normalized_prompt_logprobs)]
-        prefill_token_logprobs = [r["meta_info"]["prefill_token_logprobs"] for r in obj]
-        decode_token_logprobs = [r["meta_info"]["decode_token_logprobs"] for r in obj]
+        input_token_logprobs = [r["meta_info"]["input_token_logprobs"] for r in obj]
+        output_token_logprobs = [r["meta_info"]["output_token_logprobs"] for r in obj]
         return (
             decision,
             normalized_prompt_logprobs,
-            prefill_token_logprobs,
-            decode_token_logprobs,
+            input_token_logprobs,
+            output_token_logprobs,
         )
     def concatenate_and_append(self, src_rids: List[str], dst_rid: str):

{sglang-0.2.5 → sglang-0.2.6}/sglang/lang/interpreter.py RENAMED Viewed

@@ -541,16 +541,16 @@ class StreamExecutor:
         (
             decision,
             normalized_prompt_logprobs,
-            prefill_token_logprobs,
-            decode_token_logprobs,
+            input_token_logprobs,
+            output_token_logprobs,
         ) = self.backend.select(self, expr.choices, expr.temperature)
         if expr.name is not None:
             name = expr.name
             self.variables[name] = decision
             self.meta_info[name] = {
                 "normalized_prompt_logprobs": normalized_prompt_logprobs,
-                "prefill_token_logprobs": prefill_token_logprobs,
-                "decode_token_logprobs": decode_token_logprobs,
+                "input_token_logprobs": input_token_logprobs,
+                "output_token_logprobs": output_token_logprobs,
             }
             self.variable_event[name].set()
         self.text_ += decision

{sglang-0.2.5 → sglang-0.2.6}/sglang/srt/constrained/fsm_cache.py RENAMED Viewed

@@ -21,7 +21,27 @@ class FSMCache(BaseCache):
             tokenizer = AutoTokenizer.from_pretrained(
                 tokenizer_path, **tokenizer_args_dict
             )
-            self.outlines_tokenizer = TransformerTokenizer(tokenizer)
+            try:
+                self.outlines_tokenizer = TransformerTokenizer(tokenizer)
+            except AttributeError:
+                # FIXME: tmp fix for chatglm2 & chatglm3 (pad_token_id=0)
+                origin_pad_token_id = tokenizer.pad_token_id
+                def fset(self, value):
+                    self._value = value
+                type(tokenizer).pad_token_id = property(
+                    fget=type(tokenizer).pad_token_id.fget, fset=fset
+                )
+                self.outlines_tokenizer = TransformerTokenizer(tokenizer)
+                self.outlines_tokenizer.tokenizer.pad_token_id = origin_pad_token_id
+                self.outlines_tokenizer.pad_token_id = origin_pad_token_id
+                self.outlines_tokenizer.pad_token = (
+                    self.outlines_tokenizer.tokenizer.pad_token
+                )
+                self.outlines_tokenizer.vocabulary = (
+                    self.outlines_tokenizer.tokenizer.get_vocab()
+                )
         else:
             self.outlines_tokenizer = TransformerTokenizer(
                 tokenizer_path, **tokenizer_args_dict

{sglang-0.2.5 → sglang-0.2.6}/sglang/srt/hf_transformers_utils.py RENAMED Viewed

@@ -73,7 +73,9 @@ def get_context_length(config):
     rope_scaling = getattr(config, "rope_scaling", None)
     if rope_scaling:
         rope_scaling_factor = config.rope_scaling["factor"]
-        if config.rope_scaling["rope_type"] == "llama3":
+        if "original_max_position_embeddings" in rope_scaling:
+            rope_scaling_factor = 1
+        if config.rope_scaling.get("rope_type", None) == "llama3":
             rope_scaling_factor = 1
     else:
         rope_scaling_factor = 1

{sglang-0.2.5 → sglang-0.2.6}/sglang/srt/layers/logits_processor.py RENAMED Viewed

@@ -1,7 +1,7 @@
 """Logits processing."""
 import dataclasses
-from typing import List, Union
+from typing import List, Optional, Union
 import torch
 from torch import nn
@@ -22,23 +22,23 @@ class LogitProcessorOutput:
     # The normlaized logprobs of prompts.  shape: [#seq]
     normalized_prompt_logprobs: torch.Tensor
-    # The logprobs of prefill tokens.      shape: [#token, vocab_size]
-    prefill_token_logprobs: torch.Tensor
+    # The logprobs of input tokens.      shape: [#token, vocab_size]
+    input_token_logprobs: torch.Tensor
-    # The logprob and id of the top-k tokens in prefill positions.  shape [#seq, #token, k] of Tuple(logprob, token_id)
-    prefill_top_logprobs: List
-    # The logprob and id of the top-k tokens in decode positions.   shape [#seq, #token, k] of Tuple(logprob, token_id)
-    decode_top_logprobs: List
+    # The logprob and id of the top-k tokens in input positions.  shape [#seq, #token, k] of Tuple(logprob, token_id)
+    input_top_logprobs: List
+    # The logprob and id of the top-k tokens in output positions. shape [#seq, #token, k] of Tuple(logprob, token_id)
+    output_top_logprobs: List
 @dataclasses.dataclass
 class LogitsMetadata:
     forward_mode: ForwardMode
-    return_logprob: bool
+    return_logprob: bool = False
-    extend_seq_lens: torch.Tensor = None
-    extend_start_loc: torch.Tensor = None
-    top_logprobs_nums: List[int] = None
+    extend_seq_lens: Optional[torch.Tensor] = None
+    extend_start_loc: Optional[torch.Tensor] = None
+    top_logprobs_nums: Optional[List[int]] = None
     @classmethod
     def from_input_metadata(cls, input_metadata: InputMetadata):
@@ -58,20 +58,16 @@ class LogitsProcessor(nn.Module):
         self.tp_size = get_tensor_model_parallel_world_size()
     def _get_normalized_prompt_logprobs(
-        self, prefill_token_logprobs, logits_metadata: LogitsMetadata
+        self, input_token_logprobs, logits_metadata: LogitsMetadata
     ):
-        logprobs_cumsum = torch.cumsum(
-            prefill_token_logprobs, dim=0, dtype=torch.float32
-        )
+        logprobs_cumsum = torch.cumsum(input_token_logprobs, dim=0, dtype=torch.float32)
         start = logits_metadata.extend_start_loc.clone()
         end = start + logits_metadata.extend_seq_lens - 2
-        start.clamp_(min=0, max=prefill_token_logprobs.shape[0] - 1)
-        end.clamp_(min=0, max=prefill_token_logprobs.shape[0] - 1)
+        start.clamp_(min=0, max=input_token_logprobs.shape[0] - 1)
+        end.clamp_(min=0, max=input_token_logprobs.shape[0] - 1)
         sum_logp = (
-            logprobs_cumsum[end]
-            - logprobs_cumsum[start]
-            + prefill_token_logprobs[start]
+            logprobs_cumsum[end] - logprobs_cumsum[start] + input_token_logprobs[start]
         )
         normalized_prompt_logprobs = sum_logp / (
             (logits_metadata.extend_seq_lens - 1).clamp(min=1)
@@ -79,37 +75,38 @@ class LogitsProcessor(nn.Module):
         return normalized_prompt_logprobs
-    def _get_top_logprobs(self, all_logprobs, logits_metadata: LogitsMetadata):
+    @staticmethod
+    def get_top_logprobs(all_logprobs, logits_metadata: LogitsMetadata):
         # TODO: vectorize the code below
         if logits_metadata.forward_mode == ForwardMode.DECODE:
-            decode_top_logprobs = []
+            output_top_logprobs = []
             for i in range(all_logprobs.shape[0]):
                 k = logits_metadata.top_logprobs_nums[i]
                 t = all_logprobs[i].topk(k)
                 v_cpu = t.values.tolist()
                 p_cpu = t.indices.tolist()
-                decode_top_logprobs.append(list(zip(v_cpu, p_cpu)))
-            return None, decode_top_logprobs
+                output_top_logprobs.append(list(zip(v_cpu, p_cpu)))
+            return None, output_top_logprobs
         else:
-            prefill_top_logprobs, decode_top_logprobs = [], []
+            input_top_logprobs, output_top_logprobs = [], []
             pt = 0
             extend_seq_lens_cpu = logits_metadata.extend_seq_lens.tolist()
             for i, extend_seq_len in enumerate(extend_seq_lens_cpu):
                 if extend_seq_len == 0:
-                    prefill_top_logprobs.append([])
-                    decode_top_logprobs.append([])
+                    input_top_logprobs.append([])
+                    output_top_logprobs.append([])
                     continue
                 k = logits_metadata.top_logprobs_nums[i]
                 t = all_logprobs[pt : pt + extend_seq_len].topk(k)
                 vs_cpu = t.values.tolist()
                 ps_cpu = t.indices.tolist()
-                prefill_top_logprobs.append(
+                input_top_logprobs.append(
                     [list(zip(vs_cpu[j], ps_cpu[j])) for j in range(len(vs_cpu) - 1)]
                 )
-                decode_top_logprobs.append(list(zip(vs_cpu[-1], ps_cpu[-1])))
+                output_top_logprobs.append(list(zip(vs_cpu[-1], ps_cpu[-1])))
                 pt += extend_seq_len
-            return prefill_top_logprobs, decode_top_logprobs
+            return input_top_logprobs, output_top_logprobs
     def forward(
         self,
@@ -136,7 +133,7 @@ class LogitsProcessor(nn.Module):
         last_logits = torch.matmul(last_hidden, weight.T)
         if self.tp_size > 1:
             last_logits = tensor_model_parallel_all_gather(last_logits)
-        last_logits = last_logits[:, : self.config.vocab_size]
+        last_logits = last_logits[:, : self.config.vocab_size].float()
         if hasattr(self.config, "final_logit_softcapping"):
             last_logits /= self.config.final_logit_softcapping
@@ -149,63 +146,75 @@ class LogitsProcessor(nn.Module):
                 next_token_logits=last_logits,
                 next_token_logprobs=None,
                 normalized_prompt_logprobs=None,
-                prefill_token_logprobs=None,
-                prefill_top_logprobs=None,
-                decode_top_logprobs=None,
+                input_token_logprobs=None,
+                input_top_logprobs=None,
+                output_top_logprobs=None,
             )
         else:
             # When logprob is requested, compute the logits for all tokens.
             if logits_metadata.forward_mode == ForwardMode.DECODE:
-                all_logits = last_logits
-            else:
-                all_logits = torch.matmul(hidden_states, weight.T)
-                if self.tp_size > 1:
-                    all_logits = tensor_model_parallel_all_gather(all_logits)
-                all_logits = all_logits[:, : self.config.vocab_size]
-            all_logprobs = all_logits.float()
-            del all_logits
-            all_logprobs[:] = torch.nn.functional.log_softmax(all_logprobs, dim=-1)
+                last_logprobs = torch.nn.functional.log_softmax(last_logits, dim=-1)
-            # Get the logprob of top-k tokens
-            return_top_logprob = any(x > 0 for x in logits_metadata.top_logprobs_nums)
-            if return_top_logprob:
-                prefill_top_logprobs, decode_top_logprobs = self._get_top_logprobs(
-                    all_logprobs, logits_metadata
+                # Get the logprob of top-k tokens
+                return_top_logprob = any(
+                    x > 0 for x in logits_metadata.top_logprobs_nums
                 )
-            else:
-                prefill_top_logprobs = decode_top_logprobs = None
+                if return_top_logprob:
+                    output_top_logprobs = self.get_top_logprobs(
+                        last_logprobs, logits_metadata
+                    )[1]
+                else:
+                    output_top_logprobs = None
-            if logits_metadata.forward_mode == ForwardMode.DECODE:
                 return LogitProcessorOutput(
                     next_token_logits=last_logits,
-                    next_token_logprobs=all_logprobs,
+                    next_token_logprobs=last_logprobs,
                     normalized_prompt_logprobs=None,
-                    prefill_token_logprobs=None,
-                    prefill_top_logprobs=None,
-                    decode_top_logprobs=decode_top_logprobs,
+                    input_token_logprobs=None,
+                    input_top_logprobs=None,
+                    output_top_logprobs=output_top_logprobs,
                 )
             else:
+                all_logits = torch.matmul(hidden_states, weight.T)
+                if self.tp_size > 1:
+                    all_logits = tensor_model_parallel_all_gather(all_logits)
+                all_logits = all_logits[:, : self.config.vocab_size].float()
+                all_logprobs = all_logits
+                del all_logits
+                all_logprobs[:] = torch.nn.functional.log_softmax(all_logprobs, dim=-1)
+                # Get the logprob of top-k tokens
+                return_top_logprob = any(
+                    x > 0 for x in logits_metadata.top_logprobs_nums
+                )
+                if return_top_logprob:
+                    input_top_logprobs, output_top_logprobs = self.get_top_logprobs(
+                        all_logprobs, logits_metadata
+                    )
+                else:
+                    input_top_logprobs = output_top_logprobs = None
                 last_logprobs = all_logprobs[last_index]
                 # Compute the logprobs and normalized logprobs for the prefill tokens.
                 # Note that we pad a zero at the end of each sequence for easy computation.
-                prefill_token_logprobs = all_logprobs[
+                input_token_logprobs = all_logprobs[
                     torch.arange(all_logprobs.shape[0], device="cuda"),
                     torch.cat([input_ids[1:], torch.tensor([0], device="cuda")]),
                 ]
                 normalized_prompt_logprobs = self._get_normalized_prompt_logprobs(
-                    prefill_token_logprobs, logits_metadata
+                    input_token_logprobs, logits_metadata
                 )
                 return LogitProcessorOutput(
                     next_token_logits=last_logits,
                     next_token_logprobs=last_logprobs,
                     normalized_prompt_logprobs=normalized_prompt_logprobs,
-                    prefill_token_logprobs=prefill_token_logprobs,
-                    prefill_top_logprobs=prefill_top_logprobs,
-                    decode_top_logprobs=decode_top_logprobs,
+                    input_token_logprobs=input_token_logprobs,
+                    input_top_logprobs=input_top_logprobs,
+                    output_top_logprobs=output_top_logprobs,
                 )

{sglang-0.2.5 → sglang-0.2.6}/sglang/srt/layers/radix_attention.py RENAMED Viewed

@@ -7,8 +7,11 @@ from torch import nn
 from sglang.global_config import global_config
 from sglang.srt.layers.extend_attention import extend_attention_fwd
 from sglang.srt.layers.token_attention import token_attention_fwd
-from sglang.srt.managers.controller.model_runner import ForwardMode, InputMetadata
-from sglang.srt.server import global_server_args_dict
+from sglang.srt.managers.controller.model_runner import (
+    ForwardMode,
+    InputMetadata,
+    global_server_args_dict,
+)
 class RadixAttention(nn.Module):

{sglang-0.2.5 → sglang-0.2.6}/sglang/srt/layers/token_attention.py RENAMED Viewed

@@ -5,7 +5,7 @@ import torch
 import triton
 import triton.language as tl
-from sglang.srt.server import global_server_args_dict
+from sglang.srt.managers.controller.infer_batch import global_server_args_dict
 if global_server_args_dict.get("attention_reduce_in_fp32", False):
     REDUCE_TRITON_TYPE = tl.float32

{sglang-0.2.5 → sglang-0.2.6}/sglang/srt/managers/controller/cuda_graph_runner.py RENAMED Viewed

@@ -9,7 +9,11 @@ from flashinfer.decode import _grouped_size_compiled_for_decode_kernels
 from vllm.distributed.parallel_state import graph_capture
 from vllm.model_executor.custom_op import CustomOp
-from sglang.srt.layers.logits_processor import LogitProcessorOutput
+from sglang.srt.layers.logits_processor import (
+    LogitProcessorOutput,
+    LogitsMetadata,
+    LogitsProcessor,
+)
 from sglang.srt.managers.controller.infer_batch import (
     Batch,
     ForwardMode,
@@ -185,7 +189,6 @@ class CudaGraphRunner:
     def replay(self, batch: Batch):
         assert batch.out_cache_loc is not None
-        assert not batch.return_logprob
         raw_bs = len(batch.reqs)
         # Pad
@@ -218,23 +221,29 @@ class CudaGraphRunner:
         output = self.output_buffers[bs]
         # Unpad
-        if bs == raw_bs:
-            return output
-        else:
+        if bs != raw_bs:
             output = LogitProcessorOutput(
                 next_token_logits=output.next_token_logits[:raw_bs],
-                next_token_logprobs=(
-                    output.next_token_logprobs[:raw_bs]
-                    if output.next_token_logprobs is not None
-                    else None
-                ),
+                next_token_logprobs=None,
                 normalized_prompt_logprobs=None,
-                prefill_token_logprobs=None,
-                prefill_top_logprobs=None,
-                decode_top_logprobs=(
-                    output.decode_top_logprobs[:raw_bs]
-                    if output.decode_top_logprobs is not None
-                    else None
-                ),
+                input_token_logprobs=None,
+                input_top_logprobs=None,
+                output_top_logprobs=None,
             )
+        # Extract logprobs
+        if batch.return_logprob:
+            output.next_token_logprobs = torch.nn.functional.log_softmax(
+                output.next_token_logits, dim=-1
+            )
+            return_top_logprob = any(x > 0 for x in batch.top_logprobs_nums)
+            if return_top_logprob:
+                logits_metadata = LogitsMetadata(
+                    forward_mode=ForwardMode.DECODE,
+                    top_logprobs_nums=batch.top_logprobs_nums,
+                )
+                output.output_top_logprobs = LogitsProcessor.get_top_logprobs(
+                    output.next_token_logprobs, logits_metadata
+                )[1]
         return output

{sglang-0.2.5 → sglang-0.2.6}/sglang/srt/managers/controller/infer_batch.py RENAMED Viewed

@@ -17,6 +17,13 @@ from sglang.srt.memory_pool import ReqToTokenPool, TokenToKVPool
 INIT_INCREMENTAL_DETOKENIZATION_OFFSET = 5
+# Put some global args for easy access
+global_server_args_dict = {
+    "disable_flashinfer": False,
+    "disable_flashinfer_sampling": False,
+    "attention_reduce_in_fp32": False,
+}
 class ForwardMode(IntEnum):
     # Prefill a new sequence. This is deprecated now. "EXTEND" covers this case.
@@ -124,10 +131,10 @@ class Req:
         self.logprob_start_len = 0
         self.top_logprobs_num = 0
         self.normalized_prompt_logprob = None
-        self.prefill_token_logprobs = None
-        self.prefill_top_logprobs = None
-        self.decode_token_logprobs = []
-        self.decode_top_logprobs = []
+        self.input_token_logprobs = None
+        self.input_top_logprobs = None
+        self.output_token_logprobs = []
+        self.output_top_logprobs = []
         # The tokens is prefilled but need to be considered as decode tokens
         # and should be updated for the decode logprobs
         self.last_update_decode_tokens = 0
@@ -244,8 +251,8 @@ class Req:
                     k = k + 1
                 else:
                     break
-            self.decode_token_logprobs = self.decode_token_logprobs[:k]
-            self.decode_top_logprobs = self.decode_top_logprobs[:k]
+            self.output_token_logprobs = self.output_token_logprobs[:k]
+            self.output_top_logprobs = self.output_top_logprobs[:k]
             self.logprob_start_len = prompt_tokens + k
             self.last_update_decode_tokens = len(self.output_ids) - k
@@ -376,7 +383,7 @@ class Batch:
                     logit_bias = torch.zeros(
                         (bs, vocab_size), dtype=torch.float32, device=device
                     )
-                logit_bias[i] = int_token_logit_bias
+                logit_bias[i][: len(int_token_logit_bias)] = int_token_logit_bias
         # Set fields
         self.input_ids = torch.tensor(
@@ -687,13 +694,21 @@ class Batch:
         # TODO(lmzheng): apply penalty
         probs = torch.softmax(logits, dim=-1)
-        max_top_k_round, batch_size = 32, probs.shape[0]
-        uniform_samples = torch.rand((max_top_k_round, batch_size), device=probs.device)
-        batch_next_token_ids, success = top_k_top_p_sampling_from_probs(
-            probs, uniform_samples, self.top_ks, self.top_ps
-        )
+        if not global_server_args_dict["disable_flashinfer_sampling"]:
+            max_top_k_round, batch_size = 32, probs.shape[0]
+            uniform_samples = torch.rand(
+                (max_top_k_round, batch_size), device=probs.device
+            )
+            batch_next_token_ids, success = top_k_top_p_sampling_from_probs(
+                probs, uniform_samples, self.top_ks, self.top_ps
+            )
+        else:
+            # Here we provide a slower fallback implementation.
+            batch_next_token_ids, success = top_k_top_p_sampling_from_probs_torch(
+                probs, self.top_ks, self.top_ps
+            )
-        if torch.any(~success):
+        if not torch.all(success):
             warnings.warn("Sampling failed, fallback to top_k=1 strategy")
             probs = probs.masked_fill(torch.isnan(probs), 0.0)
             argmax_ids = torch.argmax(probs, dim=-1)
@@ -933,3 +948,29 @@ def init_triton_args(forward_mode, seq_lens, prefix_lens):
         max_extend_len = int(torch.max(extend_seq_lens))
     return max_seq_len, max_extend_len, start_loc, prefix_lens
+def top_k_top_p_sampling_from_probs_torch(
+    probs: torch.Tensor, top_ks: torch.Tensor, top_ps: torch.Tensor
+):
+    """A top-k and top-k sampling implementation with native pytorch operations."""
+    probs_sort, probs_idx = probs.sort(dim=-1, descending=True)
+    probs_sum = torch.cumsum(probs_sort, dim=-1)
+    probs_sort[(probs_sum - probs_sort) > top_ps.view(-1, 1)] = 0.0
+    probs_sort[
+        torch.arange(0, probs.shape[-1], device=probs.device).view(1, -1)
+        >= top_ks.view(-1, 1)
+    ] = 0.0
+    probs_sort.div_(probs_sort.max(dim=-1, keepdim=True)[0])
+    try:
+        sampled_index = torch.multinomial(probs_sort, num_samples=1)
+    except RuntimeError:
+        batch_next_token_ids = torch.zeros(
+            (probs_sort.shape[0],), dtype=torch.int64, device=probs.device
+        )
+        success = torch.zeros(probs.shape[0], dtype=torch.bool, device=probs.device)
+        return batch_next_token_ids, success
+    batch_next_token_ids = torch.gather(probs_idx, dim=1, index=sampled_index).view(-1)
+    success = torch.ones(probs.shape[0], dtype=torch.bool, device=probs.device)
+    return batch_next_token_ids, success

sglang 0.2.5__tar.gz → 0.2.6__tar.gz

sglang 0.2.5tar.gz → 0.2.6tar.gz