PyPI - vllm-sr - Versions diffs - 0.1.0b2.dev20260204070724__py3-none-any.whl → 0.1.0b2.dev20260204090051__py3-none-any.whl - Mend

vllm-sr 0.1.0b2.dev20260204070724py3-none-any.whl → 0.1.0b2.dev20260204090051py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

cli/merger.py CHANGED Viewed

@@ -154,8 +154,20 @@ def translate_latency_signals(latencies: list) -> list:
     for signal in latencies:
         rule = {
             "name": signal.name,
-            "max_tpot": signal.max_tpot,
         }
+        # At least one of tpot_percentile or ttft_percentile should be set
+        if signal.tpot_percentile is not None and signal.tpot_percentile > 0:
+            rule["tpot_percentile"] = signal.tpot_percentile
+        if signal.ttft_percentile is not None and signal.ttft_percentile > 0:
+            rule["ttft_percentile"] = signal.ttft_percentile
+        # Validate that at least one is set
+        if "tpot_percentile" not in rule and "ttft_percentile" not in rule:
+            log.warn(
+                f"Latency signal '{signal.name}' has neither tpot_percentile nor ttft_percentile set, skipping"
+            )
+            continue
         if signal.description:
             rule["description"] = signal.description
         rules.append(rule)

cli/models.py CHANGED Viewed

@@ -72,7 +72,8 @@ class Latency(BaseModel):
     """Latency signal configuration."""
     name: str
-    max_tpot: float
+    tpot_percentile: Optional[int] = None
+    ttft_percentile: Optional[int] = None
     description: str

cli/templates/config.template.yaml CHANGED Viewed

@@ -177,14 +177,48 @@ signals:
     - name: "ja"
       description: "Japanese language queries"
-  # latency - Latency-based routing signals (TPOT-based)
+  # latency - Latency-based routing signals (TPOT and TTFT percentile-based)
+  # Percentile-based rules adapt to each model's actual performance distribution
+  # Works with any number of observations (1+): uses average for 1-2, percentile for 3+
+  #
+  # ⚠️  RECOMMENDATION: Use BOTH tpot_percentile AND ttft_percentile for comprehensive latency evaluation
+  #    - TPOT: Measures token generation speed (throughput)
+  #    - TTFT: Measures first token latency (user-perceived latency)
+  #    - Together: Complete latency picture for optimal routing decisions
+  #
+  # You CAN use only one if needed for specific use cases:
+  #   - TPOT only: Batch processing, cost optimization (throughput matters)
+  #   - TTFT only: Real-time chat (user perception matters)
+  #   - Both: RECOMMENDED for most applications (comprehensive evaluation)
+  #
+  # At least one of tpot_percentile or ttft_percentile must be set (validation requirement)
+  # When both are set, model must meet BOTH thresholds (AND logic)
   latency:
-    - name: "low_latency"
-      max_tpot: 0.05  # 50ms per token
-      description: "For real-time chat applications requiring fast responses"
-    - name: "medium_latency"
-      max_tpot: 0.15  # 150ms per token
-      description: "For standard applications with moderate latency tolerance"
+    # Example 1: RECOMMENDED - Both TPOT and TTFT percentiles (comprehensive latency evaluation)
+    - name: "low_latency_comprehensive"
+      tpot_percentile: 10  # 10th percentile for TPOT (top 10% fastest token generation)
+      ttft_percentile: 10  # 10th percentile for TTFT (top 10% fastest first token)
+      description: "RECOMMENDED: For real-time applications - fast start and fast generation"
+    # Example 2: Different percentiles for different priorities
+    - name: "balanced_latency"
+      tpot_percentile: 50  # Median TPOT (top 50%)
+      ttft_percentile: 10  # Top 10% TTFT (prioritize fast start)
+      description: "Prioritize fast start, accept moderate generation speed"
+    # Example 3: TPOT percentile only (use case: batch processing, cost optimization)
+    # ⚠️  Note: Only using one metric is allowed but not recommended for most use cases
+    - name: "batch_processing_optimized"
+      tpot_percentile: 10  # 10th percentile for TPOT (top 10% fastest token generation)
+      # ttft_percentile: not set - acceptable for batch processing where throughput matters
+      description: "For batch processing where throughput (TPOT) is critical, TTFT less important"
+    # Example 4: TTFT percentile only (use case: real-time chat where first token matters)
+    # ⚠️  Note: Only using one metric is allowed but not recommended for most use cases
+    - name: "chat_fast_start"
+      ttft_percentile: 10  # 10th percentile for TTFT (top 10% fastest first token)
+      # tpot_percentile: not set - acceptable for chat apps where user perception matters
+      description: "For chat applications where fast first token (TTFT) is critical for UX"
   # context - Context length signals (Token Count)
   context:
@@ -431,6 +465,39 @@ decisions:
           enabled: true
           similarity_threshold: 0.85
+  # Latency-based routing example: Route to models that meet latency requirements
+  - name: "low_latency_route"
+    description: "Route to models with low latency (fast TPOT and TTFT)"
+    priority: 90
+    rules:
+      operator: "AND"
+      conditions:
+        - type: "latency"
+          name: "low_latency_comprehensive"  # Requires both TPOT and TTFT percentiles
+    modelRefs:
+      - model: "openai/gpt-oss-120b"
+        use_reasoning: false
+    plugins:
+      - type: "system_prompt"
+        configuration:
+          system_prompt: "Provide fast, concise responses suitable for real-time applications."
+  - name: "fast_start_route"
+    description: "Route to models with fast first token (prioritize TTFT for chat apps)"
+    priority: 85
+    rules:
+      operator: "AND"
+      conditions:
+        - type: "latency"
+          name: "chat_fast_start"  # Only requires TTFT percentile
+    modelRefs:
+      - model: "openai/gpt-oss-120b"
+        use_reasoning: false
+    plugins:
+      - type: "system_prompt"
+        configuration:
+          system_prompt: "Start responding quickly. User is waiting for immediate feedback."
   # Size-aware routing example: Try smaller models first, escalate if confidence is low
   - name: "confidence_route"
     description: "Cost-efficient routing: try small model first, escalate if needed"

{vllm_sr-0.1.0b2.dev20260204070724.dist-info → vllm_sr-0.1.0b2.dev20260204090051.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: vllm-sr
-Version: 0.1.0b2.dev20260204070724
+Version: 0.1.0b2.dev20260204090051
 Summary: vLLM Semantic Router - Intelligent routing for Mixture-of-Models
 Author: vLLM-SR Team
 License: Apache-2.0

{vllm_sr-0.1.0b2.dev20260204070724.dist-info → vllm_sr-0.1.0b2.dev20260204090051.dist-info}/RECORD RENAMED Viewed

@@ -6,8 +6,8 @@ cli/defaults.py,sha256=x3mdZPBtI74bzgO5UDL-xoNV5jh9uPeSDcfgnkWHVyA,1969
 cli/docker_cli.py,sha256=kj3VvNfUJwcizZQj63mB2yLBBUbvZ9P3MtleIXhTp0M,25230
 cli/logo.py,sha256=I0qnCCGyOsHmN6MRgqa_c07MZSArv_6eHcHAL36V0Eg,1512
 cli/main.py,sha256=-PkXZw_z0VtzLbA5BCQlvLTYRuNbAT0CcBSFvjrsmBo,9137
-cli/merger.py,sha256=KMajL8tzdZeVRSdyq6MJt-DNhOQ5K_suMGx3jhl0odQ,16569
-cli/models.py,sha256=jrHiAqEIsWgu5O4_Y7T1xXLxFWlXuEtdr9RCLg1B0V0,13468
+cli/merger.py,sha256=Cd3DHt5_bMAw6oxSMtu5Debd6gCCRb0wjj-RTZSrZw4,17172
+cli/models.py,sha256=IOpG-XtLZtVDzQVIVd_PyLvMj_5BJDAt4R7kYBqr7Wo,13532
 cli/parser.py,sha256=lFPrROAYyc2cPoMMULV2sPid5IklIvc8zpDekj5Pocw,5653
 cli/utils.py,sha256=CHMwSZC-zxrxCnlZqiyzF2CToCVB5tHku2zyYig8fo0,3787
 cli/validator.py,sha256=G49I8_vyg0HvfI2LHwFlOagTwWAEsIkIDzX5riYoZhA,10651
@@ -19,7 +19,7 @@ cli/commands/serve.py,sha256=vsK3T4uWx49CPTgy4sc6oM7F5riq0sNbyiT7KbS37Wc,6104
 cli/commands/show_config.py,sha256=emkIdH9LKbzMzxna3Lxl-hEG85isfAXY5BlZjP_vaIM,4689
 cli/commands/show_defaults.py,sha256=r95KOHKQWeeNnoFO5EmCUZSPeOKb-5NzCoaZ6CaBcIw,701
 cli/commands/validate.py,sha256=Chm_MfyJMESiOo6IXiroYwMUkm8HCI26OHYYUBkGXGk,2773
-cli/templates/config.template.yaml,sha256=sS7ZJqrujBQgIs8E0vxUYcVofFL_SiGgW8XnNz5lrRE,18541
+cli/templates/config.template.yaml,sha256=8qnnyWiSLT7EkCFWK_O3qyHqvzfY_Q8GjmcbF13-e3k,22012
 cli/templates/envoy.template.yaml,sha256=j43mxQ3NcFMlmbGyrDM0-1rvQkC1N4MpXUC2o7SLT9g,8293
 cli/templates/generate_dashboard.py,sha256=gCui1rPrxUFIeioebyQ_JhL7zbjoEAg8GjUJcWXHcys,24233
 cli/templates/grafana-dashboard.serve.yaml,sha256=o1H5xx60esBzk19qnMQtX4LmL5DswpMvOO7cO09Q3kI,350
@@ -30,8 +30,8 @@ cli/templates/llm-router-dashboard.serve.json,sha256=pwnTjUh7z3_3LnIwtaLXjDWH4aH
 cli/templates/prometheus.serve.yaml,sha256=MGYq8dlRq_i2m5sogQ--kwTvJpkf44QQoCNoI7oyVT8,270
 cli/templates/router-defaults.yaml,sha256=a0riw9juttoBr4G7rtUEJA2FY9TgAvK-8mngWGCCIoc,9373
 cli/templates/tools_db.json,sha256=CPqPBkd5nc966m1YEozz06frrmv3Pd5rrkxKkO3rTiA,4537
-vllm_sr-0.1.0b2.dev20260204070724.dist-info/METADATA,sha256=Klc_-W-wKFC4HErcZy2jqrsSzfQQUSPI5JRB6MFWMCc,7298
-vllm_sr-0.1.0b2.dev20260204070724.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
-vllm_sr-0.1.0b2.dev20260204070724.dist-info/entry_points.txt,sha256=WhlBPbLHUpWUsMuUQX9cnvsYMf0ih5i57vvJ1jJNi0k,42
-vllm_sr-0.1.0b2.dev20260204070724.dist-info/top_level.txt,sha256=2ImG917oaVHlm0nP9oJE-Qrgs-fq_fGWgba2H1f8fpE,4
-vllm_sr-0.1.0b2.dev20260204070724.dist-info/RECORD,,
+vllm_sr-0.1.0b2.dev20260204090051.dist-info/METADATA,sha256=DoGzgyB-bYglhYLedrBqXLTlhB2RqF8rN1qosDC8SGQ,7298
+vllm_sr-0.1.0b2.dev20260204090051.dist-info/WHEEL,sha256=wUyA8OaulRlbfwMtmQsvNngGrxQHAvkKcvRmdizlJi0,92
+vllm_sr-0.1.0b2.dev20260204090051.dist-info/entry_points.txt,sha256=WhlBPbLHUpWUsMuUQX9cnvsYMf0ih5i57vvJ1jJNi0k,42
+vllm_sr-0.1.0b2.dev20260204090051.dist-info/top_level.txt,sha256=2ImG917oaVHlm0nP9oJE-Qrgs-fq_fGWgba2H1f8fpE,4
+vllm_sr-0.1.0b2.dev20260204090051.dist-info/RECORD,,

{vllm_sr-0.1.0b2.dev20260204070724.dist-info → vllm_sr-0.1.0b2.dev20260204090051.dist-info}/WHEEL RENAMED Viewed

File without changes

{vllm_sr-0.1.0b2.dev20260204070724.dist-info → vllm_sr-0.1.0b2.dev20260204090051.dist-info}/entry_points.txt RENAMED Viewed

File without changes

{vllm_sr-0.1.0b2.dev20260204070724.dist-info → vllm_sr-0.1.0b2.dev20260204090051.dist-info}/top_level.txt RENAMED Viewed

File without changes

vllm-sr 0.1.0b2.dev20260204070724__py3-none-any.whl → 0.1.0b2.dev20260204090051__py3-none-any.whl

vllm-sr 0.1.0b2.dev20260204070724py3-none-any.whl → 0.1.0b2.dev20260204090051py3-none-any.whl