PyPI - sie-server - Versions diffs - 0.4.2__tar.gz → 0.5.0__tar.gz - Mend

sie-server 0.4.2tar.gz → 0.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (436) hide show

{sie_server-0.4.2 → sie_server-0.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: sie-server
-Version: 0.4.2
+Version: 0.5.0
 Summary: Search Inference Engine - GPU inference server for search workloads
 License: Apache-2.0
 License-File: LICENSE

{sie_server-0.4.2 → sie_server-0.5.0}/models/Qwen__Qwen3-4B-Instruct-2507.yaml RENAMED Viewed

@@ -14,12 +14,18 @@ tasks:
     context_length: 32768
     max_output_tokens: 4096
     capabilities:
-      # Outlines-backed JSON Schema, regex, and EBNF grammars are
-      # all supported by the SGLang adapter (Outlines and XGrammar
-      # both accept EBNF natively). The gateway gates requests on
-      # this exact list — adding a new ``grammar.kind`` variant
-      # requires both the gateway parser and this list to be updated.
-      grammar: ["json_schema", "regex", "ebnf"]
+      # Same constraint as Qwen3.5-4B / Qwen3.6-27B: this profile runs the
+      # default ``grammar_backend: outlines``, and SGLang's outlines_backend
+      # does NOT implement ebnf (it logs ``Skip unsupported key_type='ebnf'``
+      # then fails the compile — confirmed on A100 smoke 2026-05-26). Only
+      # advertise ``"ebnf"`` here once a profile pins an EBNF-capable backend
+      # (``grammar_backend: xgrammar``) AND a via-SIE EBNF smoke passes; the
+      # gateway gates requests on this exact list, so advertising ebnf without
+      # an ebnf-capable backend admits requests the worker then fails to serve.
+      # The ``test_advertised_ebnf_requires_capable_backend`` consistency test
+      # guards this invariant. Adding a new ``grammar.kind`` variant requires
+      # both the gateway parser and this list to be updated.
+      grammar: ["json_schema", "regex"]
       streaming: true
       # Qwen3-4B-Instruct's chat template emits OpenAI-compatible
       # ``<tool_call>{...}</tool_call>`` blocks when ``tools`` is
@@ -28,6 +34,16 @@ tasks:
       # them on ``delta.tool_calls`` for SSE and on
       # ``message.tool_calls`` for non-streaming requests.
       tools: true
+      # Validated for code generation — MEASURED HumanEval 0.866 / MBPP 0.74 on
+      # Modal A100 (see benchmarks/generation/.../code/measured_baseline.json).
+      # Backs the model="code" alias. Informational (not request-gated).
+      code: true
+      # Backs the model="sql" alias. NOTE: this profile runs the outlines
+      # backend (no ebnf), so SQL output here relies on the model's native
+      # text-to-SQL ability, NOT an EBNF grammar constraint. Re-add ``"ebnf"``
+      # to ``grammar`` above (behind an xgrammar profile + via-SIE EBNF smoke)
+      # to enable grammar-constrained SQL. Informational (not request-gated).
+      sql: true
     # Forwarded verbatim to ``tokenizer.apply_chat_template(**kwargs)`` when
     # the worker renders an OpenAI-shaped ``messages`` request.
     # Qwen3's chat template emits a ``<think>``/``</think>`` reasoning block

{sie_server-0.4.2 → sie_server-0.5.0}/models/Qwen__Qwen3.5-4B.yaml RENAMED Viewed

@@ -29,6 +29,11 @@ tasks:
       grammar: ["json_schema", "regex"]
       streaming: true
       tools: true
+      # NOTE: not advertising code:true here yet — Qwen3.5-4B is the strongest
+      # model on paper but its NEXTN/hybrid serving path does not come up
+      # reliably on the current eval image, so it has no measured HumanEval/MBPP
+      # baseline. The ``model="code"`` alias points to the measured
+      # Qwen3-4B-Instruct-2507; promote this once 3.5-4B is measured.
     # Grammar backend: ``outlines`` (set per-profile under
     # ``adapter_options.loadtime``). Earlier revisions forced ``xgrammar``
     # here because the worker-side Outlines preflight (``compile_outlines``)

{sie_server-0.4.2 → sie_server-0.5.0}/models/Qwen__Qwen3.6-27B.yaml RENAMED Viewed

@@ -34,6 +34,23 @@ tasks:
       grammar: ["json_schema", "regex"]
       streaming: true
       tools: true
+      # Validated on code + text-to-SQL — MEASURED on Modal H100 (bf16 no-spec,
+      # greedy+min_tokens=10; accuracy is GPU-invariant + FP8≈BF16 per ADR
+      # 0001): HumanEval 0.933, MBPP 0.81 (beats the 4B code default 0.866/0.74),
+      # Spider exec-acc 0.693. See code_sql_tools/measured_baseline.json.
+      # Informational (not request-gated); surfaces the high-end code/SQL option.
+      code: true
+      # CAVEAT: this flag is precision-agnostic, but FP8 weight quant regresses
+      # SQL ~13pts (Spider 0.71 BF16 -> 0.58 FP8, same-subset control; see the
+      # ADR + baseline). Code/tools/MC are FP8-safe. The default rtx-pro-6000
+      # profile is FP8, so route SQL-critical traffic to the BF16+NEXTN variant
+      # (the 96GB card fits it). The `:profile` variant suffix can't route by
+      # precision (it's stripped before worker dispatch), so deploy the BF16
+      # variant under a distinct bundle and point the `sql` job alias at it via
+      # a bundle-qualified target, e.g.
+      # SIE_GATEWAY_MODEL_ALIASES={"sql":"<bf16-bundle>:/Qwen/Qwen3.6-27B"}
+      # (gateway resolve_model_spec_with_aliases; see ADR 0001).
+      sql: true
     # Qwen3.6 emits ``<think>...</think>`` reasoning by default. We
     # disable it for the OpenAI-compat path so visible output is the
     # answer only. Operators wanting CoT can flip this profile-side.

sie_server-0.5.0/models/defog__sqlcoder-7b-2.yaml ADDED Viewed

@@ -0,0 +1,70 @@
+sie_id: defog/sqlcoder-7b-2
+hf_id: defog/sqlcoder-7b-2
+# SQLCoder-7B-2 (Defog) — a CodeLlama-7B base fine-tuned for text-to-SQL.
+# Served via the existing SGLang generation adapter (LlamaForCausalLM is native
+# to SGLang); onboarding is config-only.
+#
+# CAVEATS (read before pointing model="sql" at this):
+#   * COMPLETION model, not chat-tuned. It expects its verbatim template
+#     (``### Task ... ### Database Schema ... ### Answer ...[SQL]``) via the
+#     completions path — a chat-completions wrapper will underperform. The
+#     model="sql" alias currently targets Qwen3-4B-Instruct (chat + grammar
+#     path); repoint here only once the SQLCoder prompt template + completions
+#     rendering are wired and measured on Spider.
+#   * dtype float16 (CodeLlama base; the card pins fp16).
+#   * License CC-BY-SA-4.0 (share-alike copyleft) — clear distribution/bundle
+#     use before shipping.
+#   * Card recommends num_beams=4 (beam search); SGLang serves greedy, so
+#     expect a small delta vs the published SQL-Eval numbers.
+#   * MEASURED in SIE (Modal A100): serves via the sglang.generation adapter.
+#     The native ### Task...[SQL] completions template is ESSENTIAL — generic
+#     chat prompt scored 0.025, the native template (spider_sqlcoder task,
+#     --mode completions) scored 0.467 (140/300, greedy) on 2026-06-03. But that
+#     still LOSES to general Qwen3-4B-Instruct at 0.70 on Spider exec-acc, so the
+#     model="sql" alias STAYS on Qwen3-4B-Instruct. (The card recommends
+#     num_beams=4; SGLang serves greedy, so 0.467 is a lower bound, but beam
+#     search is a serving change and would still likely trail.) See
+#     benchmarks/generation/Qwen__Qwen3-4B-Instruct-2507/sql/measured_baseline.json.
+inputs:
+  text: true
+  image: false
+  audio: false
+  video: false
+tasks:
+  generate:
+    # Base model supports 16384; 8192 comfortably fits a Spider schema + question.
+    context_length: 8192
+    max_output_tokens: 512
+    capabilities:
+      # Grammar-constrained SQL (the "any LLM + grammar" path) is plausible via
+      # the SGLang outlines/xgrammar backend but unverified for this model, so
+      # nothing is advertised here yet.
+      grammar: []
+      streaming: true
+      tools: false
+      # NOT advertising sql:true. The capability flag would expose this model as
+      # SQL-ready to capability-based SDK/UI selection, but the currently-wired
+      # served path (generic chat) scores only 0.025 on Spider, and even with its
+      # native completions template (0.467) it loses to Qwen3-4B-Instruct (0.70 —
+      # the actual model="sql" target). Flip on only once SQLCoder beats the
+      # incumbent on a wired, measured path.
+profiles:
+  default:
+    # Generation does not batch at the SIE layer (SGLang batches internally) but
+    # the validator requires the field.
+    max_batch_tokens: 16384
+    compute_precision: float16
+    adapter_path: sie_server.adapters.sglang.generation:SGLangGenerationAdapter
+    kv_budget_tokens: 32768
+    adapter_options:
+      loadtime:
+        mem_fraction_static: 0.85
+        served_model_name: defog/sqlcoder-7b-2
+      runtime:
+        first_chunk_timeout_s: 30
+        inter_chunk_timeout_s: 10
+        overall_timeout_s: 300
+        # Deterministic decode for text-to-SQL (the card uses do_sample=False).
+        default_sampling:
+          temperature: 0.0
+          top_p: 1.0

sie_server-0.5.0/models/ibm-granite__granite-guardian-3.0-2b.yaml ADDED Viewed

@@ -0,0 +1,93 @@
+sie_id: ibm-granite/granite-guardian-3.0-2b
+hf_id: ibm-granite/granite-guardian-3.0-2b
+inputs:
+  text: true
+  image: false
+  audio: false
+  video: false
+tasks:
+  # CHECK POLICY job: a generative guard model. It takes a conversation (or a
+  # single user message) and *generates* a one-token verdict — "Yes" (unsafe)
+  # or "No" (safe) — under a risk taxonomy. Its chat template defaults to the
+  # ``harm`` risk when no ``guardian_config`` is supplied, so the standard
+  # OpenAI ``/v1/chat/completions`` path elicits a moderation verdict with no
+  # special kwargs. Served on the same SGLang generation adapter as the other
+  # decoder-only LLMs (architecture: GraniteForCausalLM).
+  generate:
+    context_length: 8192
+    # Moderation needs only a couple of tokens ("Yes"/"No"); the cap is a
+    # generous ceiling, not a target.
+    max_output_tokens: 512
+    capabilities:
+      # Guard verdicts are free-form single tokens — no grammar/tools needed.
+      grammar: []
+      streaming: true
+      tools: false
+      # Content-moderation / policy-check job — MEASURED on ToxicChat via the
+      # generation gate (see safety/measured_baseline.json): high-recall (0.97)
+      # / low-precision (0.16) under the default 'harm' risk. Backs the
+      # model="guard" alias. Informational (not request-gated).
+      guard: true
+max_sequence_length: 8192
+profiles:
+  default:
+    # max_batch_tokens is a generic engine knob; generation does not batch at
+    # the SIE layer (SGLang batches internally) but the validator requires it.
+    max_batch_tokens: 16384
+    compute_precision: bfloat16
+    adapter_path: sie_server.adapters.sglang.generation:SGLangGenerationAdapter
+    # Conservative L4 baseline; moderation prompts are short so a small KV
+    # budget is ample. Re-calibrate with a concurrency/OOM sweep if promoted.
+    kv_budget_tokens: 8192
+    adapter_options:
+      loadtime:
+        mem_fraction_static: 0.85
+        served_model_name: ibm-granite/granite-guardian-3.0-2b
+        # CHECK POLICY precision dial. The adapter reads the verdict-token
+        # logprobs, computes P(unsafe) over Yes/No, and returns "Yes" iff
+        # P(unsafe) >= threshold. threshold=0.5 == the model's argmax (the
+        # current high-recall default: recall 0.97 / precision 0.16). Measured
+        # ToxicChat trade-off (guard baseline logprob_threshold_sweep):
+        #   0.5  -> F1 0.26  recall 0.97  precision 0.15  (catch-everything)
+        #   0.8  -> F1 0.38  recall 0.54  precision 0.29  (best F1)
+        #   0.95 -> F1 0.31  recall 0.20  precision 0.70  (precision-critical)
+        # Raising it trades recall for precision — a PRODUCT/safety decision per
+        # deployment (a moderation guard usually favours recall), so the shipped
+        # default stays at argmax. Operators raise it to taste.
+        guard:
+          threshold: 0.5
+      runtime:
+        first_chunk_timeout_s: 30
+        inter_chunk_timeout_s: 10
+        overall_timeout_s: 300
+        # Greedy: a guard verdict must be deterministic.
+        default_sampling:
+          temperature: 0.0
+        stop_tokens:
+          - "<|end_of_text|>"
+  a100-40gb:
+    max_batch_tokens: 32768
+    compute_precision: bfloat16
+    adapter_path: sie_server.adapters.sglang.generation:SGLangGenerationAdapter
+    # ~2.5B weights leave most of a 40 GB card for KV; moderation traffic is
+    # short-context so this budget is set for many concurrent admissions.
+    kv_budget_tokens: 65536
+    adapter_options:
+      loadtime:
+        mem_fraction_static: 0.85
+        served_model_name: ibm-granite/granite-guardian-3.0-2b
+        # Duplicated from ``default`` — this is a NON-extending profile, and
+        # loadtime blocks are whole-dict replaced (not deep-merged), so the
+        # CHECK POLICY precision dial would otherwise be dropped here and the
+        # A100 variant would silently fall back to raw argmax (self._guard={}).
+        # Keep in sync with the ``default`` profile's guard.threshold.
+        guard:
+          threshold: 0.5
+      runtime:
+        first_chunk_timeout_s: 30
+        inter_chunk_timeout_s: 10
+        overall_timeout_s: 300
+        default_sampling:
+          temperature: 0.0
+        stop_tokens:
+          - "<|end_of_text|>"

{sie_server-0.4.2 → sie_server-0.5.0}/openapi.json RENAMED Viewed

@@ -3,7 +3,7 @@
   "info": {
     "title": "SIE Server",
     "description": "Search Inference Engine - GPU inference server for search workloads",
-    "version": "0.4.2"
+    "version": "0.5.0"
   },
   "paths": {
     "/": {
@@ -848,6 +848,41 @@
         "type": "object",
         "title": "HTTPValidationError"
       },
+      "ModelCapabilities": {
+        "properties": {
+          "grammar": {
+            "items": {
+              "type": "string"
+            },
+            "type": "array",
+            "title": "Grammar",
+            "default": []
+          },
+          "tools": {
+            "type": "boolean",
+            "title": "Tools",
+            "default": false
+          },
+          "code": {
+            "type": "boolean",
+            "title": "Code",
+            "default": false
+          },
+          "sql": {
+            "type": "boolean",
+            "title": "Sql",
+            "default": false
+          },
+          "guard": {
+            "type": "boolean",
+            "title": "Guard",
+            "default": false
+          }
+        },
+        "type": "object",
+        "title": "ModelCapabilities",
+        "description": "Advertised generation capabilities for a model.\n\nMirrors the gateway ``capabilities`` wire shape\n(``ModelCapabilitiesWire``) for the keys derivable from the loaded\nmodel config's :class:`~sie_server.config.model.GenerateCapabilities`.\n``code``/``sql``/``guard`` are informational flags advertising\nvalidated generation jobs that back the ``model=\"code\"`` /\n``model=\"sql\"`` / ``model=\"guard\"`` aliases. Populated only for\nmodels that declare ``tasks.generate``; ``None`` otherwise.\n\nThese flags mean the model *supports* a task \u2014 they are NOT a\nprecision-independent quality SLA. A flag is true at the model level even\nwhen quality is profile/precision-dependent (e.g. ``sql`` quality regresses\nunder FP8; route SQL-critical traffic to a BF16 bundle via the ``sql``\nalias)."
+      },
       "ModelInfo": {
         "properties": {
           "name": {
@@ -919,6 +954,16 @@
             "type": "object",
             "title": "Profiles",
             "default": {}
+          },
+          "capabilities": {
+            "anyOf": [
+              {
+                "$ref": "#/components/schemas/ModelCapabilities"
+              },
+              {
+                "type": "null"
+              }
+            ]
           }
         },
         "type": "object",

{sie_server-0.4.2 → sie_server-0.5.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "sie-server"
-version = "0.4.2"
+version = "0.5.0"
 description = "Search Inference Engine - GPU inference server for search workloads"
 requires-python = ">=3.12,<3.13"
 license = { text = "Apache-2.0" }

{sie_server-0.4.2 → sie_server-0.5.0}/src/sie_server/adapters/florence2/__init__.py RENAMED Viewed

@@ -29,6 +29,7 @@ from sie_server.adapters._base_adapter import BaseAdapter
 from sie_server.adapters._spec import AdapterSpec
 from sie_server.adapters._types import ERR_NOT_LOADED, ComputePrecision
 from sie_server.core.inference_output import EncodeOutput, ExtractOutput
+from sie_server.core.preprocessor.vision import resolve_florence2_prompt
 from sie_server.types.inputs import media_bytes
 from sie_server.types.responses import DetectedObject, Entity
@@ -247,15 +248,18 @@ class Florence2Adapter(BaseAdapter):
         max_new_tokens = options.get("max_new_tokens", self._max_new_tokens)
         num_beams = options.get("num_beams", self._num_beams)
-        # Build task prompt
-        prompt = self._build_prompt(task, labels, instruction)
+        # Resolve the prompt and the effective task token. A free-text instruction
+        # is answered via DocVQA, so the effective task may differ from the
+        # configured one — post-processing must use the effective task to match
+        # what the prompt asked the model to do.
+        prompt, effective_task = resolve_florence2_prompt(task, labels, instruction)
         # Use preprocessed items if available
         if prepared_items is not None and len(prepared_items) > 0:
             return self._extract_preprocessed(
                 items=items,
                 prepared_items=prepared_items,
-                task=task,
+                task=effective_task,
                 max_new_tokens=max_new_tokens,
                 num_beams=num_beams,
             )
@@ -267,7 +271,7 @@ class Florence2Adapter(BaseAdapter):
             entities, objects = self._extract_single(
                 item,
                 prompt=prompt,
-                task=task,
+                task=effective_task,
                 max_new_tokens=max_new_tokens,
                 num_beams=num_beams,
             )
@@ -388,16 +392,8 @@ class Florence2Adapter(BaseAdapter):
         Returns:
             Complete prompt string.
         """
-        # Use instruction as custom prompt if provided
-        if instruction:
-            return f"{task}{instruction}"
-        # For phrase grounding, append labels
-        if task == TASK_CAPTION_TO_PHRASE_GROUNDING and labels:
-            label_text = ", ".join(labels)
-            return f"{task}{label_text}"
-        return task
+        prompt, _ = resolve_florence2_prompt(task, labels, instruction)
+        return prompt
     def _extract_single(
         self,

sie-server 0.4.2__tar.gz → 0.5.0__tar.gz

sie-server 0.4.2tar.gz → 0.5.0tar.gz