npm - @miller-tech/uap - Versions diffs - 1.15.5 → 1.15.7 - Mend

@miller-tech/uap 1.15.5 → 1.15.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/docs/INDEX.md +8 -0
package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +139 -0
package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +146 -0
package/package.json +1 -1
package/templates/hooks/pre-tool-use-bash.sh +9 -0
package/tools/agents/scripts/anthropic_proxy.py +302 -53
package/tools/agents/tests/test_anthropic_proxy_streaming.py +197 -0

package/docs/INDEX.md CHANGED Viewed

@@ -47,6 +47,14 @@
 - [Token Optimization](benchmarks/TOKEN_OPTIMIZATION.md) -- Per-feature token savings analysis
 - [Accuracy Analysis](benchmarks/ACCURACY_ANALYSIS.md) -- Internal vs Terminal-Bench comparison
+## Blog
+- [Speculative Decoding Production Playbook](blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md) -- Long-form narrative on throughput gains, failure modes, and stable profiles
+## PR Templates
+- [Speculative Docs PR Template](pr/PR_SPECULATIVE_DOCS_TEMPLATE.md) -- Ready-to-submit PR copy, checklist, and reviewer guidance
 ## Research
 - [Memory Systems Comparison](research/MEMORY_SYSTEMS_COMPARISON.md) -- MemGPT, LangGraph, Mem0, A-MEM analysis

package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md ADDED Viewed

@@ -0,0 +1,139 @@
+# Speculative Decoding in llama.cpp: Real Speedups Without Breaking Agentic Reliability
+Speculative decoding can look like free performance - until it meets long-context, tool-heavy agent workflows. This write-up covers what improved throughput, what regressed, and which operational changes restored stability across `llama.cpp` and an Anthropic-compatible proxy.
+## Why This Matters
+Speculative decoding is strongest when generated text has predictable structure or repetition. But in real coding sessions, throughput alone is not enough: the system must preserve clean output, reliable tool-call behavior, and long-session continuity.
+In practice, this is one runtime boundary:
+- `llama.cpp` speculative behavior
+- parameter profile and rollback mode
+- proxy streaming/fallback policies
+- agentic tool-loop control behavior
+## Baseline Environment
+- Runtime: `llama.cpp` + CUDA + Qwen3.5 GGUF
+- Context window: `262144`
+- Spec type: `ngram-cache`
+- Gateway: Anthropic-compatible proxy forwarding to OpenAI-compatible server
+Related runbooks:
+- `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md`
+- `docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md`
+## What We Observed
+### Throughput Gains Were Workload-Dependent
+Speculation did not uniformly improve all turns. Coding/tool turns often saw small uplift; repetition-heavy turns saw large gains.
+Representative 27B snapshot (`ctx=262144`):
+- No spec: ~43 tok/s coding, ~41 tok/s pattern
+- Balanced spec (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
+Takeaway: benchmark by workload class, not one blended average.
+### Newer Lineage Produced Noisier Warnings
+Under identical settings, newer builds emitted warnings such as:
+- `find_slot: non-consecutive token position`
+This correlated with lower effective throughput and less stable long-session behavior in A/B comparisons.
+### Proxy Fallback Could Leak Malformed Internal Text
+When upstream returned reasoning-heavy but empty visible output, weak fallback policy could expose malformed fragments (pseudo-tool text, schema/policy echoes) to end users.
+Patterns included:
+- `</parameter>`-style fragments
+- non-JSON pseudo-tool content
+- repetitive policy-like loops with no valid `tool_calls`
+## Immediate Fixes That Worked
+### Safe Production Defaults
+The highest-leverage stabilization profile was:
+- `PROXY_STREAM_REASONING_FALLBACK=off`
+- `PROXY_MALFORMED_TOOL_GUARDRAIL=on`
+- `PROXY_MALFORMED_TOOL_STREAM_STRICT=on`
+- `PROXY_MAX_TOKENS_FLOOR=4096`
+Why:
+- `fallback=off` suppresses malformed reasoning leakage.
+- malformed-tool guardrail + strict stream path recovers bad stream+tools turns.
+- lower token floor reduces long failure-turn latency while preserving normal turns.
+### Balanced Speculative Profile for Daily Agentic Work
+- `spec-type=ngram-cache`
+- `draft-max=12`
+- `draft-min=2`
+- `draft-p-min=0.80`
+- rollback mode: `strict`
+This profile is less aggressive than max-throughput tuning, but significantly safer for long coding sessions.
+## Benchmark Method That Prevents False Wins
+A useful speculative benchmark protocol should include:
+1. Prompt classes
+   - coding/tool-call tasks
+   - repetition/pattern-heavy tasks
+2. Repeats and warmup
+   - fixed run count
+   - warmup policy
+   - p50/p95 latency, not only mean tok/s
+3. Required metrics
+   - decode throughput (`eval tok/s`)
+   - prefill throughput (`prompt eval tok/s`)
+   - acceptance/rejection behavior
+   - malformed-turn incidence
+   - stop reason distribution
+4. Profile matrix
+   - no-spec baseline
+   - aggressive profile
+   - balanced profile
+Without this, speculative tuning can appear faster while degrading real agentic reliability.
+## Practical Playbook
+### Use for Daily Agentic Coding
+- balanced `ngram-cache` (`12/2/0.80`)
+- strict malformed-tool stream guardrail
+- reasoning fallback disabled
+- reduced token floor (`4096`)
+### Use for Max Throughput Exploration
+- hybrid rollback
+- larger draft windows
+- tightly scoped benchmark prompts
+Then promote only if long-session tool-loop soak remains stable.
+## What llama.cpp Docs Should Add Next
+Mechanics are documented well today. The next improvement is operational clarity:
+- implementation selection matrix by workload
+- troubleshooting by signature (`find_slot`, rollback spikes, acceptance collapse)
+- reproducible benchmark protocol and output schema
+- rollout/canary/rollback criteria
+- proxy compatibility appendix for stream+tools environments
+## Final Takeaway
+Speculative decoding in production is a systems problem, not just a decoding primitive. Treating runtime + transport + tool-loop behavior as one boundary is what makes speculative speedups both real and reliable.

package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md ADDED Viewed

@@ -0,0 +1,146 @@
+## Title
+docs: add speculative decoding production playbook and agentic compatibility guidance
+## Context
+`docs/speculative.md` explains speculative mechanisms and flags, but production operators also need:
+- workload-driven profile selection,
+- reproducible benchmarking protocol,
+- signature-based regression triage,
+- guidance for stream+tools agentic environments.
+This PR adds operational documentation to reduce drift between benchmark wins and real-session behavior.
+## Changes
+### Add new guide
+- New: `docs/speculative-production.md`
+  - implementation matrix:
+    - `draft`
+    - `ngram-cache`
+    - `ngram-simple`
+    - `ngram-map-k`
+    - `ngram-map-k4v`
+    - `ngram-mod`
+  - decision tree by workload (coding, repetitive transform, mixed)
+  - benchmark protocol (run counts, warmup, prompt classes, metrics)
+  - troubleshooting by signature:
+    - `find_slot: non-consecutive token position`
+    - low acceptance + high rollback
+    - throughput collapse after commit switch
+  - rollout rules (canary, promotion threshold, rollback triggers)
+### Update existing speculative docs
+- Update `docs/speculative.md`:
+  - add link to production guide
+  - add "how to interpret statistics in practice"
+  - add "workload sensitivity and reproducibility notes"
+### Add compatibility appendix
+- New appendix (or linked page): stream+tools compatibility for proxy-mediated agentic flows
+  - fallback policy guidance (`off` default for production)
+  - malformed stream/tool guardrail behavior
+  - max token floor and prune target recommendations
+## Why
+Speculative decoding quality in agentic coding depends on end-to-end behavior, including transport and stream tool-loop handling. This documentation closes that gap and provides a repeatable operator path.
+## Validation Plan
+- Verify all CLI flags/options in examples against current `llama-server`.
+- Verify all linked scripts/docs paths resolve.
+- Include one benchmark table with:
+  - decode/prefill throughput
+  - acceptance indicators
+  - latency percentiles
+  - workload class labels
+## Risks
+- Overfitting recommendations to one model/hardware class.
+- Treating proxy behavior as universally required.
+## Mitigations
+- Mark all profile recommendations as workload/hardware sensitive.
+- Separate "safe baseline" from "aggressive benchmark-only" profiles.
+- Require local A/B validation before rollout.
+## Out of Scope
+- Runtime code changes
+- Kernel-level speculative optimization changes
+- Proxy implementation changes (docs-only PR)
+## Follow-ups
+1. Add nightly speculative regression harness.
+2. Publish benchmark JSON schema for machine comparison.
+3. Add commit-lineage tracking for performance regressions.
+---
+## Ready-to-Submit GitHub PR Body
+### Summary
+This docs PR adds a production-oriented speculative decoding playbook for llama.cpp users running real multi-turn workloads (especially agentic/tool-call scenarios). It complements existing mechanism-level docs with actionable tuning, troubleshooting, and rollout guidance.
+### What Changed
+- Added `docs/speculative-production.md` (new operational guide)
+  - implementation selection matrix
+  - workload-based decision tree
+  - benchmark protocol + required metrics
+  - troubleshooting by real log signatures
+  - canary/rollback rollout guidance
+- Updated `docs/speculative.md`
+  - links to production guide
+  - practical stats interpretation notes
+  - workload sensitivity notes
+- Added/linked "agentic stream+tools compatibility" appendix
+  - fallback policy defaults
+  - malformed stream/tool guardrails
+  - token-floor/prune guidance
+### Why
+Current docs describe speculative decoding internals clearly, but production operators need a reproducible way to:
+- choose stable profiles by workload,
+- detect/triage regressions quickly,
+- avoid benchmark-only wins that fail in long sessions.
+### Reviewer Guide
+Please focus review on:
+1. Accuracy of CLI flags and option names.
+2. Correctness of troubleshooting signatures and interpretations.
+3. Clarity of benchmark protocol (can another team reproduce it?).
+4. Whether safe-vs-aggressive profile separation is clear enough.
+### Validation
+- [ ] Command examples verified against current `llama-server --help`
+- [ ] Linked docs/scripts paths validated
+- [ ] Benchmark table includes workload class labels
+- [ ] Metrics include decode/prefill throughput + latency percentile view
+- [ ] No runtime behavior claims without explicit caveats
+### Risks / Caveats
+- Recommendations are model/hardware/workload dependent.
+- Guidance is operational, not a substitute for local A/B testing.
+### Follow-ups
+- [ ] Add nightly regression harness for speculative profiles
+- [ ] Publish machine-readable benchmark schema
+- [ ] Add commit lineage references in benchmark artifacts

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@miller-tech/uap",
-  "version": "1.15.5",
+  "version": "1.15.7",
   "description": "Autonomous AI agent memory system with CLAUDE.md protocol enforcement",
   "type": "module",
   "main": "dist/index.js",

package/templates/hooks/pre-tool-use-bash.sh CHANGED Viewed

@@ -22,6 +22,15 @@ if [ -z "$CMD" ]; then
   exit 0
 fi
+# ─── Protocol Tag Injection Guard ────────────────────────────────
+# Reject Bash payloads that still contain standalone protocol tag lines.
+# These fragments can appear after malformed tool-call rendering and must
+# never reach shell evaluation.
+if printf '%s\n' "$CMD" | grep -qE '^\s*</?(tool_call|tool_response|parameter(=[^>]*)?|function(=[^>]*)?|think)\s*>\s*$'; then
+  echo "BLOCKED [bash-safety]: Command contains standalone XML/protocol tag lines. Remove tool-call tag artifacts before execution." >&2
+  exit 2
+fi
 # ─── IaC Pipeline Enforcement ───────────────────────────────────
 # Block local terraform apply/destroy (policies/iac-pipeline-enforcement.md)
 # Allow: terraform fmt, validate, init, plan, output, show, state list, graph

package/tools/agents/scripts/anthropic_proxy.py CHANGED Viewed

@@ -254,6 +254,28 @@ PROXY_ANALYSIS_ONLY_MIN_TOOLS = int(
 PROXY_ANALYSIS_ONLY_MAX_MESSAGES = int(
     os.environ.get("PROXY_ANALYSIS_ONLY_MAX_MESSAGES", "2")
 )
+PROXY_TOOL_CALL_GRAMMAR = os.environ.get(
+    "PROXY_TOOL_CALL_GRAMMAR", "on"
+).lower() not in {
+    "0",
+    "false",
+    "off",
+    "no",
+}
+PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY = os.environ.get(
+    "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", "on"
+).lower() not in {
+    "0",
+    "false",
+    "off",
+    "no",
+}
+PROXY_TOOL_CALL_GRAMMAR_PATH = os.path.abspath(
+    os.environ.get(
+        "PROXY_TOOL_CALL_GRAMMAR_PATH",
+        os.path.join(os.path.dirname(__file__), "..", "config", "tool-call.gbnf"),
+    )
+)
 # ---------------------------------------------------------------------------
 # Logging
@@ -266,6 +288,45 @@ logging.basicConfig(
 logger = logging.getLogger("uap.anthropic_proxy")
+def _load_tool_call_grammar(path: str) -> str:
+    if not PROXY_TOOL_CALL_GRAMMAR:
+        return ""
+    try:
+        with open(path, "r", encoding="utf-8") as fh:
+            return fh.read().strip()
+    except OSError as exc:
+        logger.warning(
+            "Tool-call grammar disabled: failed to read %s (%s)",
+            path,
+            exc,
+        )
+        return ""
+TOOL_CALL_GBNF = _load_tool_call_grammar(PROXY_TOOL_CALL_GRAMMAR_PATH)
+def _apply_tool_call_grammar(
+    request_body: dict, tool_choice: str | None = None
+) -> None:
+    request_body.pop("grammar", None)
+    if not PROXY_TOOL_CALL_GRAMMAR or not TOOL_CALL_GBNF:
+        return
+    if not request_body.get("tools"):
+        return
+    effective_tool_choice = (
+        tool_choice if tool_choice is not None else request_body.get("tool_choice")
+    )
+    if PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY and effective_tool_choice != "required":
+        return
+    request_body["grammar"] = TOOL_CALL_GBNF
 # ---------------------------------------------------------------------------
 # Option F: Session-level Context Window Monitor
 # ---------------------------------------------------------------------------
@@ -876,7 +937,7 @@ async def lifespan(app: FastAPI):
         _resolve_prune_target_fraction() * 100,
     )
     logger.info(
-        "Guardrails: malformed=%s stream_strict=%s force_non_stream=%s args_preflight=%s tool_narrowing=%s thinking_off_on_tools=%s dampener=%s(%d/%d/%d/%d->%d) contamination_breaker=%s(%d forced=%d required_miss=%d) analysis_only_route=%s(min_tools=%d,max_msgs=%d)",
+        "Guardrails: malformed=%s stream_strict=%s force_non_stream=%s args_preflight=%s tool_narrowing=%s thinking_off_on_tools=%s dampener=%s(%d/%d/%d/%d->%d) contamination_breaker=%s(%d forced=%d required_miss=%d) analysis_only_route=%s(min_tools=%d,max_msgs=%d) grammar=%s(required_only=%s loaded=%s path=%s)",
         PROXY_MALFORMED_TOOL_GUARDRAIL,
         PROXY_MALFORMED_TOOL_STREAM_STRICT,
         PROXY_FORCE_NON_STREAM,
@@ -896,6 +957,10 @@ async def lifespan(app: FastAPI):
         PROXY_ANALYSIS_ONLY_ROUTE,
         PROXY_ANALYSIS_ONLY_MIN_TOOLS,
         PROXY_ANALYSIS_ONLY_MAX_MESSAGES,
+        PROXY_TOOL_CALL_GRAMMAR,
+        PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY,
+        bool(TOOL_CALL_GBNF),
+        PROXY_TOOL_CALL_GRAMMAR_PATH,
     )
     yield
@@ -1044,49 +1109,27 @@ def _is_analysis_only_prompt(text: str) -> bool:
     if not text:
         return False
-    analysis_markers = (
-        "analy",
-        "review",
-        "audit",
-        "summar",
-        "explain",
-        "plan",
-        "recommend",
-        "assess",
-        "compare",
-        "investigate",
-        "diagnose",
+    normalized = text.lower()
+    has_analysis = bool(
+        re.search(
+            r"\b(?:analy(?:ze|zing|sis)?|review|audit|summar(?:y|ize|ized|ise)|explain|plan|recommend|assess|compare|investigate|diagnos(?:e|is))\b",
+            normalized,
+        )
     )
-    action_markers = (
-        "fix",
-        "edit",
-        "write",
-        "create",
-        "implement",
-        "patch",
-        "change",
-        "update",
-        "run ",
-        "execute",
-        "command",
-        "use tool",
-        "call tool",
-        "apply",
-        "commit",
-        "push",
-        "merge",
-        "publish",
-        "deploy",
-        "test",
-        "build",
-        "refactor",
-        "rename",
-        "delete",
-        "install",
+    has_action = bool(
+        re.search(
+            r"\b(?:fix|edit|write|create|implement|patch|change|update|run|execute|apply|commit|push|merge|publish|deploy|test|build|refactor|rename|delete|install)\b",
+            normalized,
+        )
+    ) or any(
+        phrase in normalized
+        for phrase in (
+            "use tool",
+            "call tool",
+            "run command",
+            "execute command",
+        )
     )
-    has_analysis = any(marker in text for marker in analysis_markers)
-    has_action = any(marker in text for marker in action_markers)
     return has_analysis and not has_action
@@ -1467,6 +1510,8 @@ def build_openai_request(anthropic_body: dict, monitor: SessionMonitor) -> dict:
                 "Thinking disabled for tool turn (PROXY_DISABLE_THINKING_ON_TOOL_TURNS=on)"
             )
+        _apply_tool_call_grammar(openai_body)
     return openai_body
@@ -1793,6 +1838,11 @@ _TOOL_ARG_MARKERS = (
     "</think>",
 )
+_BASH_PROTOCOL_LINE_RE = re.compile(
+    r"^\s*</?(?:tool_call|tool_response|parameter(?:=[^>]*)?|function(?:=[^>]*)?|think)\s*>\s*$",
+    re.IGNORECASE,
+)
 def _iter_string_leaves(value):
     if isinstance(value, str):
@@ -1822,6 +1872,26 @@ def _strip_tool_markup_artifacts(text: str) -> str:
     return cleaned.strip()
+def _strip_protocol_tag_only_lines(text: str) -> tuple[str, bool]:
+    if not isinstance(text, str):
+        return text, False
+    lines = text.splitlines()
+    kept_lines: list[str] = []
+    stripped = False
+    for line in lines:
+        if _BASH_PROTOCOL_LINE_RE.match(line):
+            stripped = True
+            continue
+        kept_lines.append(line)
+    if not stripped:
+        return text, False
+    cleaned = "\n".join(kept_lines).strip()
+    return cleaned, True
 def _sanitize_markup_value(value):
     if isinstance(value, str):
         cleaned = _strip_tool_markup_artifacts(value)
@@ -2036,6 +2106,77 @@ def _repair_required_tool_args(
     return repaired_response, repaired_count
+def _repair_bash_command_artifacts(openai_resp: dict) -> tuple[dict, int]:
+    if not _openai_has_tool_calls(openai_resp):
+        return openai_resp, 0
+    choice, message = _extract_openai_choice(openai_resp)
+    tool_calls = message.get("tool_calls") or []
+    if not tool_calls:
+        return openai_resp, 0
+    repaired_tool_calls = []
+    repaired_count = 0
+    for tool_call in tool_calls:
+        fn = tool_call.get("function") if isinstance(tool_call, dict) else {}
+        if not isinstance(fn, dict):
+            fn = {}
+        tool_name = str(fn.get("name", "")).strip().lower()
+        if tool_name != "bash":
+            repaired_tool_calls.append(tool_call)
+            continue
+        raw_args = fn.get("arguments", "{}")
+        if isinstance(raw_args, dict):
+            parsed_args = dict(raw_args)
+        else:
+            try:
+                parsed_args = json.loads(str(raw_args))
+            except json.JSONDecodeError:
+                repaired_tool_calls.append(tool_call)
+                continue
+        if not isinstance(parsed_args, dict):
+            repaired_tool_calls.append(tool_call)
+            continue
+        command = parsed_args.get("command")
+        if not isinstance(command, str):
+            repaired_tool_calls.append(tool_call)
+            continue
+        cleaned_command, changed = _strip_protocol_tag_only_lines(command)
+        if not changed:
+            repaired_tool_calls.append(tool_call)
+            continue
+        parsed_args["command"] = cleaned_command
+        new_tool_call = dict(tool_call)
+        new_fn = dict(fn)
+        new_fn["arguments"] = json.dumps(parsed_args, separators=(",", ":"))
+        new_tool_call["function"] = new_fn
+        repaired_tool_calls.append(new_tool_call)
+        repaired_count += 1
+    if repaired_count == 0:
+        return openai_resp, 0
+    repaired_response = dict(openai_resp)
+    choices = list(openai_resp.get("choices") or [])
+    if not choices:
+        return openai_resp, 0
+    updated_choice = dict(choice)
+    updated_message = dict(message)
+    updated_message["tool_calls"] = repaired_tool_calls
+    updated_choice["message"] = updated_message
+    choices[0] = updated_choice
+    repaired_response["choices"] = choices
+    return repaired_response, repaired_count
 def _required_value_is_empty(value) -> bool:
     if value is None:
         return True
@@ -2132,6 +2273,22 @@ def _validate_tool_call_arguments(
             ),
         )
+    if tool_name.strip().lower() == "bash":
+        command = parsed.get("command")
+        if isinstance(command, str):
+            cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
+                command
+            )
+            if had_protocol_lines and not cleaned_command:
+                return ToolResponseIssue(
+                    kind="invalid_tool_args",
+                    reason="arguments for 'Bash' contained only protocol tag lines",
+                    retry_hint=(
+                        "Emit exactly one `Bash` tool call with a valid shell command in `arguments.command`. "
+                        "Do not include standalone XML/protocol tags."
+                    ),
+                )
     if _contains_tool_markup(parsed):
         return ToolResponseIssue(
             kind="invalid_tool_args",
@@ -2345,20 +2502,34 @@ def _is_malformed_tool_response(openai_resp: dict, anthropic_body: dict) -> bool
 def _build_malformed_retry_body(
-    openai_body: dict, anthropic_body: dict, retry_hint: str = ""
+    openai_body: dict,
+    anthropic_body: dict,
+    retry_hint: str = "",
+    tool_choice: str = "required",
+    attempt: int = 1,
+    total_attempts: int = 1,
 ) -> dict:
     retry_body = dict(openai_body)
     retry_body["stream"] = False
-    retry_body["tool_choice"] = "required"
+    retry_body["tool_choice"] = tool_choice
     retry_body["temperature"] = PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE
-    malformed_retry_instruction = {
-        "role": "user",
-        "content": (
+    if tool_choice == "required":
+        retry_instruction = (
             "Your previous response had invalid tool-call formatting. "
             "Respond with exactly one valid tool call using the provided tools. "
             "Do not output prose, markdown, XML tags, or schema snippets."
-        ),
+        )
+    else:
+        retry_instruction = (
+            "Your previous response had invalid tool-call formatting. "
+            "If a tool is needed, emit exactly one valid tool call with strict JSON arguments. "
+            "If no tool is needed for this turn, return concise plain text with no protocol tags."
+        )
+    malformed_retry_instruction = {
+        "role": "user",
+        "content": retry_instruction,
     }
     existing_messages = retry_body.get("messages")
     if isinstance(existing_messages, list) and existing_messages:
@@ -2381,19 +2552,51 @@ def _build_malformed_retry_body(
     if PROXY_DISABLE_THINKING_ON_TOOL_TURNS:
         retry_body["enable_thinking"] = False
+    _apply_tool_call_grammar(retry_body, tool_choice=tool_choice)
     if retry_hint:
         repair_prompt = (
-            "[TOOL CALL REPAIR]\n"
+            f"[TOOL CALL REPAIR attempt {attempt}/{total_attempts}]\n"
             f"{retry_hint}\n"
-            "Return exactly one valid tool call object and no explanatory prose."
+            "Return a valid response for this turn without protocol artifacts."
         )
         retry_messages = list(retry_body.get("messages", []))
-        retry_messages.append({"role": "system", "content": repair_prompt})
+        retry_messages.append({"role": "user", "content": repair_prompt})
         retry_body["messages"] = retry_messages
     return retry_body
+def _retry_tool_choice_for_attempt(
+    required_tool_choice: bool, attempt: int, total_attempts: int
+) -> str:
+    if not required_tool_choice:
+        return "auto"
+    if total_attempts <= 1:
+        return "required"
+    return "auto" if attempt == total_attempts - 1 else "required"
+def _build_safe_text_openai_response(openai_resp: dict, text: str) -> dict:
+    return {
+        "id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
+        "object": openai_resp.get("object", "chat.completion"),
+        "created": openai_resp.get("created", int(time.time())),
+        "model": openai_resp.get("model", "unknown"),
+        "choices": [
+            {
+                "index": 0,
+                "finish_reason": "stop",
+                "message": {
+                    "role": "assistant",
+                    "content": text,
+                },
+            }
+        ],
+        "usage": openai_resp.get("usage", {}),
+    }
 def _build_clean_guardrail_openai_response(openai_resp: dict) -> dict:
     return {
         "id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
@@ -2437,6 +2640,7 @@ async def _apply_unexpected_end_turn_guardrail(
     retry_body = dict(openai_body)
     retry_body["tool_choice"] = "required"
     retry_body["stream"] = False
+    _apply_tool_call_grammar(retry_body, tool_choice="required")
     retry_resp = await client.post(
         f"{LLAMA_CPP_BASE}/chat/completions",
@@ -2486,7 +2690,8 @@ async def _apply_malformed_tool_guardrail(
         working_resp, required_repairs = _repair_required_tool_args(
             working_resp, anthropic_body
         )
-        repair_count = markup_repairs + required_repairs
+        working_resp, bash_repairs = _repair_bash_command_artifacts(working_resp)
+        repair_count = markup_repairs + required_repairs + bash_repairs
     required_tool_choice = openai_body.get("tool_choice") == "required"
     has_tool_calls = _openai_has_tool_calls(working_resp)
@@ -2536,10 +2741,18 @@ async def _apply_malformed_tool_guardrail(
     attempts = max(0, PROXY_MALFORMED_TOOL_RETRY_MAX)
     current_issue = issue
     for attempt in range(attempts):
+        attempt_tool_choice = _retry_tool_choice_for_attempt(
+            required_tool_choice,
+            attempt,
+            attempts,
+        )
         retry_body = _build_malformed_retry_body(
             openai_body,
             anthropic_body,
             retry_hint=current_issue.retry_hint,
+            tool_choice=attempt_tool_choice,
+            attempt=attempt + 1,
+            total_attempts=attempts,
         )
         retry_resp = await client.post(
             f"{LLAMA_CPP_BASE}/chat/completions",
@@ -2563,7 +2776,14 @@ async def _apply_malformed_tool_guardrail(
             retry_working, retry_required_repairs = _repair_required_tool_args(
                 retry_working, anthropic_body
             )
-            retry_repairs = retry_markup_repairs + retry_required_repairs
+            retry_working, retry_bash_repairs = _repair_bash_command_artifacts(
+                retry_working
+            )
+            retry_repairs = (
+                retry_markup_repairs + retry_required_repairs + retry_bash_repairs
+            )
+        working_resp = retry_working
         retry_has_tool_calls = _openai_has_tool_calls(retry_working)
         retry_required = retry_body.get("tool_choice") == "required"
@@ -2620,6 +2840,17 @@ async def _apply_malformed_tool_guardrail(
         monitor.invalid_tool_call_streak,
         monitor.required_tool_miss_streak,
     )
+    degraded_text = _sanitize_tool_call_apology_text(
+        _openai_message_text(working_resp)
+    ).strip()
+    if degraded_text and not _looks_malformed_tool_payload(degraded_text):
+        logger.warning(
+            "TOOL RESPONSE degrade: session=%s returning safe text fallback after retry exhaustion",
+            session_id,
+        )
+        return _build_safe_text_openai_response(working_resp, degraded_text)
     return _build_clean_guardrail_openai_response(working_resp)
@@ -2720,6 +2951,18 @@ def openai_to_anthropic_response(openai_resp: dict, model: str) -> dict:
             args = json.loads(fn.get("arguments", "{}"))
         except json.JSONDecodeError:
             args = {}
+        if fn.get("name", "").strip().lower() == "bash" and isinstance(args, dict):
+            command = args.get("command")
+            if isinstance(command, str):
+                cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
+                    command
+                )
+                if had_protocol_lines:
+                    args = dict(args)
+                    args["command"] = cleaned_command
+                    logger.warning(
+                        "BASH SAFETY: stripped standalone protocol-tag lines from command before tool execution"
+                    )
         content.append(
             {
                 "type": "tool_use",
@@ -3564,6 +3807,12 @@ async def context_status(request: Request):
         "overflow_count": monitor.overflow_count,
         "prune_threshold": PROXY_CONTEXT_PRUNE_THRESHOLD,
         "recent_history": monitor.context_history[-10:],
+        "tool_call_grammar": {
+            "enabled": PROXY_TOOL_CALL_GRAMMAR,
+            "required_only": PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY,
+            "path": PROXY_TOOL_CALL_GRAMMAR_PATH,
+            "loaded": bool(TOOL_CALL_GBNF),
+        },
         # Loop protection stats
         "loop_protection": {
             "enabled": PROXY_LOOP_BREAKER,

package/tools/agents/tests/test_anthropic_proxy_streaming.py CHANGED Viewed

@@ -487,6 +487,68 @@ class TestMalformedToolGuardrail(unittest.TestCase):
             setattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE", old_temp)
             setattr(proxy, "PROXY_DISABLE_THINKING_ON_TOOL_TURNS", old_disable)
+    def test_malformed_retry_body_appends_retry_hint_as_user_message(self):
+        openai_body = {
+            "model": "test",
+            "messages": [{"role": "user", "content": "fix"}],
+        }
+        anthropic_body = {
+            "tools": [{"name": "Read", "input_schema": {"type": "object"}}]
+        }
+        retry = proxy._build_malformed_retry_body(
+            openai_body,
+            anthropic_body,
+            retry_hint="Use strict JSON",
+            tool_choice="required",
+            attempt=1,
+            total_attempts=2,
+        )
+        self.assertEqual(retry["messages"][-1]["role"], "user")
+        self.assertIn("TOOL CALL REPAIR attempt 1/2", retry["messages"][-1]["content"])
+    def test_retry_ladder_releases_last_attempt_to_auto(self):
+        self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 0, 3), "required")
+        self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 1, 3), "required")
+        self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 2, 3), "auto")
+        self.assertEqual(proxy._retry_tool_choice_for_attempt(False, 0, 3), "auto")
+    def test_malformed_retry_body_applies_grammar_only_for_required_tool_choice(self):
+        old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
+        old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
+        old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
+        try:
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
+            setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
+            openai_body = {
+                "model": "test",
+                "messages": [{"role": "user", "content": "fix"}],
+            }
+            anthropic_body = {
+                "tools": [{"name": "Read", "input_schema": {"type": "object"}}]
+            }
+            required_retry = proxy._build_malformed_retry_body(
+                openai_body,
+                anthropic_body,
+                tool_choice="required",
+            )
+            auto_retry = proxy._build_malformed_retry_body(
+                openai_body,
+                anthropic_body,
+                tool_choice="auto",
+            )
+            self.assertEqual(required_retry.get("grammar"), 'root ::= "<tool_call>"')
+            self.assertNotIn("grammar", auto_retry)
+        finally:
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
+            setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
     def test_clean_guardrail_response_does_not_promise_future_tool_call(self):
         guardrail = proxy._build_clean_guardrail_openai_response(
             {"model": "test-model"}
@@ -772,6 +834,34 @@ class TestMalformedToolGuardrail(unittest.TestCase):
         )
         self.assertEqual(args["command"], "ls")
+    def test_bash_command_repair_strips_protocol_tag_only_lines(self):
+        openai_resp = {
+            "choices": [
+                {
+                    "finish_reason": "tool_calls",
+                    "message": {
+                        "content": "",
+                        "tool_calls": [
+                            {
+                                "id": "call_1",
+                                "function": {
+                                    "name": "Bash",
+                                    "arguments": '{"command":"pwd\\n</function>\\n<tool_call>"}',
+                                },
+                            }
+                        ],
+                    },
+                }
+            ]
+        }
+        repaired, count = proxy._repair_bash_command_artifacts(openai_resp)
+        self.assertEqual(count, 1)
+        args = json.loads(
+            repaired["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
+        )
+        self.assertEqual(args["command"], "pwd")
     def test_guardrail_accepts_repaired_markup_without_fallback(self):
         old_retry = getattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_MAX")
         try:
@@ -1214,6 +1304,81 @@ class TestToolTurnControls(unittest.TestCase):
             setattr(proxy, "PROXY_FORCED_TOOL_DAMPENER_REJECTIONS", old_rejections)
             setattr(proxy, "PROXY_FORCED_TOOL_DAMPENER_AUTO_TURNS", old_auto_turns)
+    def test_build_request_applies_grammar_when_tool_choice_required(self):
+        old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
+        old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
+        old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
+        try:
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
+            setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
+            body = {
+                "model": "test",
+                "messages": [
+                    {
+                        "role": "assistant",
+                        "content": [{"type": "text", "text": "I will continue."}],
+                    },
+                    {"role": "user", "content": "continue"},
+                ],
+                "tools": [
+                    {
+                        "name": "Read",
+                        "description": "Read file",
+                        "input_schema": {"type": "object"},
+                    }
+                ],
+            }
+            openai = proxy.build_openai_request(
+                body, proxy.SessionMonitor(context_window=262144)
+            )
+            self.assertEqual(openai.get("tool_choice"), "required")
+            self.assertEqual(openai.get("grammar"), 'root ::= "<tool_call>"')
+        finally:
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
+            setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
+    def test_build_request_omits_grammar_when_tool_choice_released_to_auto(self):
+        old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
+        old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
+        old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
+        try:
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
+            setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
+            monitor = proxy.SessionMonitor(context_window=262144)
+            monitor.forced_auto_cooldown_turns = 1
+            body = {
+                "model": "test",
+                "messages": [
+                    {
+                        "role": "assistant",
+                        "content": [{"type": "text", "text": "I will continue."}],
+                    },
+                    {"role": "user", "content": "continue"},
+                ],
+                "tools": [
+                    {
+                        "name": "Read",
+                        "description": "Read file",
+                        "input_schema": {"type": "object"},
+                    }
+                ],
+            }
+            openai = proxy.build_openai_request(body, monitor)
+            self.assertEqual(openai.get("tool_choice"), "auto")
+            self.assertNotIn("grammar", openai)
+        finally:
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
+            setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
+            setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
     def test_no_tools_does_not_inject_agentic_system_message(self):
         body = {
             "model": "test",
@@ -1290,6 +1455,38 @@ class TestToolTurnControls(unittest.TestCase):
             setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
             setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
+    def test_analysis_only_route_does_not_treat_implementation_as_action(self):
+        old_route = getattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE")
+        old_min_tools = getattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS")
+        old_max_messages = getattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES")
+        try:
+            setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", True)
+            setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", 4)
+            setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", 2)
+            body = {
+                "messages": [
+                    {
+                        "role": "user",
+                        "content": "analyze implementation options and summarize tradeoffs",
+                    }
+                ],
+                "tools": [
+                    {"name": "Read", "input_schema": {"type": "object"}},
+                    {"name": "Edit", "input_schema": {"type": "object"}},
+                    {"name": "Write", "input_schema": {"type": "object"}},
+                    {"name": "Bash", "input_schema": {"type": "object"}},
+                ],
+            }
+            updated, removed = proxy._maybe_route_analysis_without_tools(body)
+            self.assertEqual(removed, 4)
+            self.assertNotIn("tools", updated)
+        finally:
+            setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", old_route)
+            setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
+            setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
 class TestSessionContaminationBreaker(unittest.TestCase):
     def test_contamination_breaker_trims_and_resets_streak(self):