PyPI - sglang - Versions diffs - 0.2.14__tar.gz → 0.2.14.post1__tar.gz - Mend

sglang 0.2.14tar.gz → 0.2.14.post1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (121) hide show

{sglang-0.2.14/sglang.egg-info → sglang-0.2.14.post1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: sglang
-Version: 0.2.14
+Version: 0.2.14.post1
 Summary: SGLang is yet another fast serving framework for large language models and vision language models.
 License:                                  Apache License
                                    Version 2.0, January 2004
@@ -312,7 +312,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
 ### Method 2: From source
 ```
 # Use the last release branch
-git clone -b v0.2.14 https://github.com/sgl-project/sglang.git
+git clone -b v0.2.14.post1 https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -339,6 +339,7 @@ docker run --gpus all \
 ### Method 4: Using docker compose
 <details>
+<summary>More</summary>
 > This method is recommended if you plan to serve it as a service.
 > A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml).
@@ -350,6 +351,7 @@ docker run --gpus all \
 ### Method 5: Run on Kubernetes or Clouds with SkyPilot
 <details>
+<summary>More</summary>
 To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
@@ -389,7 +391,7 @@ sky status --endpoint 30000 sglang
 ### Common Notes
-- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is currently one of the dependencies that must be installed for SGLang. If you are using NVIDIA GPU devices below sm80, such as T4, you can't use SGLang for the time being. We expect to resolve this issue soon, so please stay tuned. If you encounter any FlashInfer-related issues on sm80+ devices (e.g., A100, L40S, H100), consider using Triton's kernel by `--disable-flashinfer --disable-flashinfer-sampling` and raise a issue.
+- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is currently one of the dependencies that must be installed for SGLang. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), consider using Triton's kernel by `--disable-flashinfer --disable-flashinfer-sampling` and raise an issue.
 - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
 ## Backend: SGLang Runtime (SRT)
@@ -518,6 +520,7 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
 #### Use Models From ModelScope
 <details>
+<summary>More</summary>
 To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
 ```
@@ -532,6 +535,7 @@ SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen
 #### Run Llama 3.1 405B
 <details>
+<summary>More</summary>
 ```bash
 # Run 405B (fp8) on a single node
@@ -549,7 +553,9 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/
 ### Benchmark Performance
-- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`. Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, consider using `sglang.bench_serving`.
+- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
+  Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
+  A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
   ```
   python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
   ```

{sglang-0.2.14 → sglang-0.2.14.post1}/README.md RENAMED Viewed

@@ -56,7 +56,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
 ### Method 2: From source
 ```
 # Use the last release branch
-git clone -b v0.2.14 https://github.com/sgl-project/sglang.git
+git clone -b v0.2.14.post1 https://github.com/sgl-project/sglang.git
 cd sglang
 pip install --upgrade pip
@@ -83,6 +83,7 @@ docker run --gpus all \
 ### Method 4: Using docker compose
 <details>
+<summary>More</summary>
 > This method is recommended if you plan to serve it as a service.
 > A better approach is to use the [k8s-sglang-service.yaml](./docker/k8s-sglang-service.yaml).
@@ -94,6 +95,7 @@ docker run --gpus all \
 ### Method 5: Run on Kubernetes or Clouds with SkyPilot
 <details>
+<summary>More</summary>
 To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
@@ -133,7 +135,7 @@ sky status --endpoint 30000 sglang
 ### Common Notes
-- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is currently one of the dependencies that must be installed for SGLang. If you are using NVIDIA GPU devices below sm80, such as T4, you can't use SGLang for the time being. We expect to resolve this issue soon, so please stay tuned. If you encounter any FlashInfer-related issues on sm80+ devices (e.g., A100, L40S, H100), consider using Triton's kernel by `--disable-flashinfer --disable-flashinfer-sampling` and raise a issue.
+- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is currently one of the dependencies that must be installed for SGLang. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), consider using Triton's kernel by `--disable-flashinfer --disable-flashinfer-sampling` and raise an issue.
 - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
 ## Backend: SGLang Runtime (SRT)
@@ -262,6 +264,7 @@ Instructions for supporting a new model are [here](https://github.com/sgl-projec
 #### Use Models From ModelScope
 <details>
+<summary>More</summary>
 To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
 ```
@@ -276,6 +279,7 @@ SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen
 #### Run Llama 3.1 405B
 <details>
+<summary>More</summary>
 ```bash
 # Run 405B (fp8) on a single node
@@ -293,7 +297,9 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/
 ### Benchmark Performance
-- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`. Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, consider using `sglang.bench_serving`.
+- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
+  Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
+  A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
   ```
   python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
   ```

{sglang-0.2.14 → sglang-0.2.14.post1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "sglang"
-version = "0.2.14"
+version = "0.2.14.post1"
 description = "SGLang is yet another fast serving framework for large language models and vision language models."
 readme = "README.md"
 requires-python = ">=3.8"

{sglang-0.2.14 → sglang-0.2.14.post1}/sglang/srt/constrained/fsm_cache.py RENAMED Viewed

@@ -15,6 +15,8 @@ limitations under the License.
 """Cache for the compressed finite state machine."""
+from outlines.fsm.json_schema import build_regex_from_schema
 from sglang.srt.constrained import RegexGuide, TransformerTokenizer
 from sglang.srt.constrained.base_tool_cache import BaseToolCache
@@ -26,9 +28,12 @@ class FSMCache(BaseToolCache):
         tokenizer_args_dict,
         enable=True,
         skip_tokenizer_init=False,
+        json_schema_mode=False,
     ):
         super().__init__(enable=enable)
+        self.json_schema_mode = json_schema_mode
         if (
             skip_tokenizer_init
             or tokenizer_path.endswith(".json")
@@ -72,5 +77,9 @@ class FSMCache(BaseToolCache):
                 tokenizer_path, **tokenizer_args_dict
             )
-    def init_value(self, regex):
-        return RegexGuide(regex, self.outlines_tokenizer)
+    def init_value(self, value):
+        if self.json_schema_mode:
+            regex = build_regex_from_schema(value)
+            return RegexGuide(regex, self.outlines_tokenizer), regex
+        else:
+            return RegexGuide(value, self.outlines_tokenizer)

{sglang-0.2.14 → sglang-0.2.14.post1}/sglang/srt/constrained/jump_forward.py RENAMED Viewed

@@ -23,6 +23,7 @@ from collections import defaultdict
 import interegular
 import outlines.caching
+from outlines.fsm.json_schema import build_regex_from_schema
 from sglang.srt.constrained import (
     FSMInfo,

sglang-0.2.14.post1/sglang/srt/layers/activation.py ADDED Viewed

@@ -0,0 +1,131 @@
+"""
+Copyright 2023-2024 SGLang Team
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+"""
+"""Fused operators for activation layers."""
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from flashinfer.activation import gelu_tanh_and_mul, silu_and_mul
+from vllm.distributed import (
+    divide,
+    get_tensor_model_parallel_rank,
+    get_tensor_model_parallel_world_size,
+)
+from vllm.model_executor.custom_op import CustomOp
+from vllm.model_executor.layers.quantization import QuantizationConfig
+from vllm.model_executor.utils import set_weight_attrs
+class SiluAndMul(CustomOp):
+    def forward_native(self, x: torch.Tensor) -> torch.Tensor:
+        d = x.shape[-1] // 2
+        return F.silu(x[..., :d]) * x[..., d:]
+    def forward_cuda(self, x: torch.Tensor) -> torch.Tensor:
+        d = x.shape[-1] // 2
+        output_shape = x.shape[:-1] + (d,)
+        out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
+        silu_and_mul(x, out)
+        return out
+class GeluAndMul(CustomOp):
+    def __init__(self, **kwargs):
+        super().__init__()
+    def forward_native(self, x: torch.Tensor) -> torch.Tensor:
+        d = x.shape[-1] // 2
+        return F.gelu(x[..., :d], approximate="tanh") * x[..., d:]
+    def forward_cuda(self, x: torch.Tensor) -> torch.Tensor:
+        d = x.shape[-1] // 2
+        output_shape = x.shape[:-1] + (d,)
+        out = torch.empty(output_shape, dtype=x.dtype, device=x.device)
+        gelu_tanh_and_mul(x, out)
+        return out
+class ScaledActivation(nn.Module):
+    """An activation function with post-scale parameters.
+    This is used for some quantization methods like AWQ.
+    """
+    def __init__(
+        self,
+        act_module: nn.Module,
+        intermediate_size: int,
+        input_is_parallel: bool = True,
+        params_dtype: Optional[torch.dtype] = None,
+    ):
+        super().__init__()
+        self.act = act_module
+        self.input_is_parallel = input_is_parallel
+        if input_is_parallel:
+            tp_size = get_tensor_model_parallel_world_size()
+            intermediate_size_per_partition = divide(intermediate_size, tp_size)
+        else:
+            intermediate_size_per_partition = intermediate_size
+        if params_dtype is None:
+            params_dtype = torch.get_default_dtype()
+        self.scales = nn.Parameter(
+            torch.empty(intermediate_size_per_partition, dtype=params_dtype)
+        )
+        set_weight_attrs(self.scales, {"weight_loader": self.weight_loader})
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.act(x) / self.scales
+    def weight_loader(self, param: nn.Parameter, loaded_weight: torch.Tensor):
+        param_data = param.data
+        if self.input_is_parallel:
+            tp_rank = get_tensor_model_parallel_rank()
+            shard_size = param_data.shape[0]
+            start_idx = tp_rank * shard_size
+            loaded_weight = loaded_weight.narrow(0, start_idx, shard_size)
+        assert param_data.shape == loaded_weight.shape
+        param_data.copy_(loaded_weight)
+_ACTIVATION_REGISTRY = {
+    "gelu": nn.GELU(),
+    "gelu_pytorch_tanh": nn.GELU(approximate="tanh"),
+}
+def get_act_fn(
+    act_fn_name: str,
+    quant_config: Optional[QuantizationConfig] = None,
+    intermediate_size: Optional[int] = None,
+    input_is_parallel: bool = True,
+    params_dtype: Optional[torch.dtype] = None,
+) -> nn.Module:
+    """Get an activation function by name."""
+    act_fn_name = act_fn_name.lower()
+    if act_fn_name not in _ACTIVATION_REGISTRY:
+        raise ValueError(f"Activation function {act_fn_name!r} is not supported.")
+    act_fn = _ACTIVATION_REGISTRY[act_fn_name]
+    if quant_config is not None and act_fn_name in quant_config.get_scaled_act_names():
+        if intermediate_size is None:
+            raise ValueError(
+                "intermediate_size must be specified for scaled "
+                "activation functions."
+            )
+        return ScaledActivation(
+            act_fn, intermediate_size, input_is_parallel, params_dtype
+        )
+    return act_fn

{sglang-0.2.14 → sglang-0.2.14.post1}/sglang/srt/layers/layernorm.py RENAMED Viewed

@@ -32,15 +32,12 @@ class RMSNorm(CustomOp):
         super().__init__()
         self.weight = nn.Parameter(torch.ones(hidden_size))
         self.variance_epsilon = eps
-        self.is_lower_sm80 = torch.cuda.get_device_capability()[0] < 8
     def forward_cuda(
         self,
         x: torch.Tensor,
         residual: Optional[torch.Tensor] = None,
     ) -> Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]:
-        if self.is_lower_sm80:
-            return self.forward_native(x, residual)
         if residual is not None:
             fused_add_rmsnorm(x, residual, self.weight.data, self.variance_epsilon)

{sglang-0.2.14 → sglang-0.2.14.post1}/sglang/srt/layers/logits_processor.py RENAMED Viewed

@@ -29,7 +29,7 @@ from sglang.srt.model_executor.forward_batch_info import ForwardMode, InputMetad
 @dataclasses.dataclass
-class LogitsProcessorOutput:
+class LogitProcessorOutput:
     # The logits of the next tokens.       shape: [#seq, vocab_size]
     next_token_logits: torch.Tensor
     # The logprobs of the next tokens.     shape: [#seq, vocab_size]
@@ -185,7 +185,7 @@ class LogitsProcessor(nn.Module):
         # Return only last_logits if logprob is not requested
         if not logits_metadata.return_logprob:
-            return LogitsProcessorOutput(
+            return LogitProcessorOutput(
                 next_token_logits=last_logits,
                 next_token_logprobs=None,
                 normalized_prompt_logprobs=None,
@@ -209,7 +209,7 @@ class LogitsProcessor(nn.Module):
                 else:
                     output_top_logprobs = None
-                return LogitsProcessorOutput(
+                return LogitProcessorOutput(
                     next_token_logits=last_logits,
                     next_token_logprobs=last_logprobs,
                     normalized_prompt_logprobs=None,
@@ -278,7 +278,7 @@ class LogitsProcessor(nn.Module):
                 # Remove the last token logprob for the prefill tokens.
                 input_token_logprobs = input_token_logprobs[:-1]
-                return LogitsProcessorOutput(
+                return LogitProcessorOutput(
                     next_token_logits=last_logits,
                     next_token_logprobs=last_logprobs,
                     normalized_prompt_logprobs=normalized_prompt_logprobs,

{sglang-0.2.14 → sglang-0.2.14.post1}/sglang/srt/layers/sampler.py RENAMED Viewed

@@ -1,6 +1,4 @@
-import dataclasses
 import logging
-from typing import Union
 import torch
 from flashinfer.sampling import (
@@ -11,8 +9,6 @@ from flashinfer.sampling import (
 )
 from vllm.model_executor.custom_op import CustomOp
-from sglang.srt.layers.logits_processor import LogitsProcessorOutput
 # TODO: move this dict to another place
 from sglang.srt.managers.schedule_batch import global_server_args_dict
 from sglang.srt.sampling.sampling_batch_info import SamplingBatchInfo
@@ -20,71 +16,30 @@ from sglang.srt.sampling.sampling_batch_info import SamplingBatchInfo
 logger = logging.getLogger(__name__)
-@dataclasses.dataclass
-class SampleOutput:
-    success: torch.Tensor
-    probs: torch.Tensor
-    batch_next_token_ids: torch.Tensor
 class Sampler(CustomOp):
     def __init__(self):
         super().__init__()
-    def _apply_penalties(self, logits: torch.Tensor, sampling_info: SamplingBatchInfo):
-        # min-token, presence, frequency
-        if sampling_info.linear_penalties is not None:
-            logits += sampling_info.linear_penalties
-        # repetition
-        if sampling_info.scaling_penalties is not None:
-            logits = torch.where(
-                logits > 0,
-                logits / sampling_info.scaling_penalties,
-                logits * sampling_info.scaling_penalties,
-            )
-        return logits
-    def _get_probs(
-        self,
-        logits: torch.Tensor,
-        sampling_info: SamplingBatchInfo,
-        is_torch_compile: bool = False,
-    ):
+    def forward_cuda(self, logits: torch.Tensor, sampling_info: SamplingBatchInfo):
         # Post process logits
         logits = logits.contiguous()
         logits.div_(sampling_info.temperatures)
-        if is_torch_compile:
-            # FIXME: Temporary workaround for unknown bugs in torch.compile
-            logits.add_(0)
         if sampling_info.logit_bias is not None:
             logits.add_(sampling_info.logit_bias)
         if sampling_info.vocab_mask is not None:
             logits = logits.masked_fill(~sampling_info.vocab_mask, float("-inf"))
-        logits = self._apply_penalties(logits, sampling_info)
+        logits = sampling_info.penalizer_orchestrator.apply(logits)
-        return torch.softmax(logits, dim=-1)
-    def forward_cuda(
-        self,
-        logits: Union[torch.Tensor, LogitsProcessorOutput],
-        sampling_info: SamplingBatchInfo,
-    ):
-        if isinstance(logits, LogitsProcessorOutput):
-            logits = logits.next_token_logits
-        probs = self._get_probs(logits, sampling_info)
+        probs = torch.softmax(logits, dim=-1)
         if not global_server_args_dict["disable_flashinfer_sampling"]:
             max_top_k_round, batch_size = 32, probs.shape[0]
             uniform_samples = torch.rand(
                 (max_top_k_round, batch_size), device=probs.device
             )
-            if sampling_info.need_min_p_sampling:
+            if sampling_info.min_ps.any():
                 probs = top_k_renorm_prob(probs, sampling_info.top_ks)
                 probs = top_p_renorm_prob(probs, sampling_info.top_ps)
                 batch_next_token_ids, success = min_p_sampling_from_probs(
@@ -100,23 +55,18 @@ class Sampler(CustomOp):
                 probs, sampling_info.top_ks, sampling_info.top_ps, sampling_info.min_ps
             )
-        return SampleOutput(success, probs, batch_next_token_ids)
-    def forward_native(
-        self,
-        logits: Union[torch.Tensor, LogitsProcessorOutput],
-        sampling_info: SamplingBatchInfo,
-    ):
-        if isinstance(logits, LogitsProcessorOutput):
-            logits = logits.next_token_logits
-        probs = self._get_probs(logits, sampling_info, is_torch_compile=True)
+        if not torch.all(success):
+            logging.warning("Sampling failed, fallback to top_k=1 strategy")
+            probs = probs.masked_fill(torch.isnan(probs), 0.0)
+            argmax_ids = torch.argmax(probs, dim=-1)
+            batch_next_token_ids = torch.where(
+                success, batch_next_token_ids, argmax_ids
+            )
-        batch_next_token_ids, success = top_k_top_p_min_p_sampling_from_probs_torch(
-            probs, sampling_info.top_ks, sampling_info.top_ps, sampling_info.min_ps
-        )
+        return batch_next_token_ids
-        return SampleOutput(success, probs, batch_next_token_ids)
+    def forward_native():
+        raise NotImplementedError("Native forward is not implemented yet.")
 def top_k_top_p_min_p_sampling_from_probs_torch(
@@ -137,10 +87,7 @@ def top_k_top_p_min_p_sampling_from_probs_torch(
     probs_sort[probs_sort < min_p_thresholds.view(-1, 1)] = 0.0
     probs_sort.div_(probs_sort.max(dim=-1, keepdim=True)[0])
     try:
-        # FIXME: torch.multiomial does not support num_samples = 1
-        sampled_index = torch.multinomial(probs_sort, num_samples=2, replacement=True)[
-            :, :1
-        ]
+        sampled_index = torch.multinomial(probs_sort, num_samples=1)
     except RuntimeError as e:
         logger.warning(f"Sampling error: {e}")
         batch_next_token_ids = torch.zeros(

{sglang-0.2.14 → sglang-0.2.14.post1}/sglang/srt/managers/schedule_batch.py RENAMED Viewed

@@ -1,5 +1,3 @@
-from __future__ import annotations
 """
 Copyright 2023-2024 SGLang Team
 Licensed under the Apache License, Version 2.0 (the "License");
@@ -19,7 +17,7 @@ limitations under the License.
 import logging
 from dataclasses import dataclass
-from typing import TYPE_CHECKING, List, Optional, Union
+from typing import List, Optional, Union
 import torch
@@ -31,10 +29,6 @@ from sglang.srt.mem_cache.chunk_cache import ChunkCache
 from sglang.srt.mem_cache.memory_pool import BaseTokenToKVPool, ReqToTokenPool
 from sglang.srt.sampling.sampling_batch_info import SamplingBatchInfo
-if TYPE_CHECKING:
-    from sglang.srt.layers.sampler import SampleOutput
 INIT_INCREMENTAL_DETOKENIZATION_OFFSET = 5
 # Put some global args for easy access
@@ -268,7 +262,14 @@ class Req:
         all_text = self.origin_input_text + self.decoded_text + jump_forward_str
         all_ids = self.tokenizer.encode(all_text)
+        if not all_ids:
+            logger.warning("Encoded all_text resulted in empty all_ids")
+            return False
         prompt_tokens = len(self.origin_input_ids_unpadded)
+        if prompt_tokens > len(all_ids):
+            logger.warning("prompt_tokens is larger than encoded all_ids")
+            return False
         if all_ids[prompt_tokens - 1] != self.origin_input_ids_unpadded[-1]:
             # TODO(lsyin): fix token fusion
@@ -677,17 +678,11 @@ class ScheduleBatch:
         self.top_logprobs_nums.extend(other.top_logprobs_nums)
         self.return_logprob = any(req.return_logprob for req in self.reqs)
-    def check_sample_results(self, sample_output: SampleOutput):
-        if not torch.all(sample_output.success):
-            probs = sample_output.probs
-            batch_next_token_ids = sample_output.batch_next_token_ids
-            logging.warning("Sampling failed, fallback to top_k=1 strategy")
-            probs = probs.masked_fill(torch.isnan(probs), 0.0)
-            argmax_ids = torch.argmax(probs, dim=-1)
-            batch_next_token_ids = torch.where(
-                sample_output.success, batch_next_token_ids, argmax_ids
-            )
-            sample_output.probs = probs
-            sample_output.batch_next_token_ids = batch_next_token_ids
+    def sample(self, logits: torch.Tensor):
+        from sglang.srt.layers.sampler import Sampler
+        sampler = Sampler()
+        batch_next_token_ids = sampler(logits, self.sampling_info)
-        return sample_output.batch_next_token_ids
+        return batch_next_token_ids

sglang 0.2.14__tar.gz → 0.2.14.post1__tar.gz

sglang 0.2.14tar.gz → 0.2.14.post1tar.gz