@miller-tech/uap 1.15.5 → 1.15.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/INDEX.md CHANGED
@@ -47,6 +47,14 @@
47
47
  - [Token Optimization](benchmarks/TOKEN_OPTIMIZATION.md) -- Per-feature token savings analysis
48
48
  - [Accuracy Analysis](benchmarks/ACCURACY_ANALYSIS.md) -- Internal vs Terminal-Bench comparison
49
49
 
50
+ ## Blog
51
+
52
+ - [Speculative Decoding Production Playbook](blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md) -- Long-form narrative on throughput gains, failure modes, and stable profiles
53
+
54
+ ## PR Templates
55
+
56
+ - [Speculative Docs PR Template](pr/PR_SPECULATIVE_DOCS_TEMPLATE.md) -- Ready-to-submit PR copy, checklist, and reviewer guidance
57
+
50
58
  ## Research
51
59
 
52
60
  - [Memory Systems Comparison](research/MEMORY_SYSTEMS_COMPARISON.md) -- MemGPT, LangGraph, Mem0, A-MEM analysis
@@ -0,0 +1,139 @@
1
+ # Speculative Decoding in llama.cpp: Real Speedups Without Breaking Agentic Reliability
2
+
3
+ Speculative decoding can look like free performance - until it meets long-context, tool-heavy agent workflows. This write-up covers what improved throughput, what regressed, and which operational changes restored stability across `llama.cpp` and an Anthropic-compatible proxy.
4
+
5
+ ## Why This Matters
6
+
7
+ Speculative decoding is strongest when generated text has predictable structure or repetition. But in real coding sessions, throughput alone is not enough: the system must preserve clean output, reliable tool-call behavior, and long-session continuity.
8
+
9
+ In practice, this is one runtime boundary:
10
+
11
+ - `llama.cpp` speculative behavior
12
+ - parameter profile and rollback mode
13
+ - proxy streaming/fallback policies
14
+ - agentic tool-loop control behavior
15
+
16
+ ## Baseline Environment
17
+
18
+ - Runtime: `llama.cpp` + CUDA + Qwen3.5 GGUF
19
+ - Context window: `262144`
20
+ - Spec type: `ngram-cache`
21
+ - Gateway: Anthropic-compatible proxy forwarding to OpenAI-compatible server
22
+
23
+ Related runbooks:
24
+
25
+ - `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md`
26
+ - `docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md`
27
+
28
+ ## What We Observed
29
+
30
+ ### Throughput Gains Were Workload-Dependent
31
+
32
+ Speculation did not uniformly improve all turns. Coding/tool turns often saw small uplift; repetition-heavy turns saw large gains.
33
+
34
+ Representative 27B snapshot (`ctx=262144`):
35
+
36
+ - No spec: ~43 tok/s coding, ~41 tok/s pattern
37
+ - Balanced spec (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
38
+
39
+ Takeaway: benchmark by workload class, not one blended average.
40
+
41
+ ### Newer Lineage Produced Noisier Warnings
42
+
43
+ Under identical settings, newer builds emitted warnings such as:
44
+
45
+ - `find_slot: non-consecutive token position`
46
+
47
+ This correlated with lower effective throughput and less stable long-session behavior in A/B comparisons.
48
+
49
+ ### Proxy Fallback Could Leak Malformed Internal Text
50
+
51
+ When upstream returned reasoning-heavy but empty visible output, weak fallback policy could expose malformed fragments (pseudo-tool text, schema/policy echoes) to end users.
52
+
53
+ Patterns included:
54
+
55
+ - `</parameter>`-style fragments
56
+ - non-JSON pseudo-tool content
57
+ - repetitive policy-like loops with no valid `tool_calls`
58
+
59
+ ## Immediate Fixes That Worked
60
+
61
+ ### Safe Production Defaults
62
+
63
+ The highest-leverage stabilization profile was:
64
+
65
+ - `PROXY_STREAM_REASONING_FALLBACK=off`
66
+ - `PROXY_MALFORMED_TOOL_GUARDRAIL=on`
67
+ - `PROXY_MALFORMED_TOOL_STREAM_STRICT=on`
68
+ - `PROXY_MAX_TOKENS_FLOOR=4096`
69
+
70
+ Why:
71
+
72
+ - `fallback=off` suppresses malformed reasoning leakage.
73
+ - malformed-tool guardrail + strict stream path recovers bad stream+tools turns.
74
+ - lower token floor reduces long failure-turn latency while preserving normal turns.
75
+
76
+ ### Balanced Speculative Profile for Daily Agentic Work
77
+
78
+ - `spec-type=ngram-cache`
79
+ - `draft-max=12`
80
+ - `draft-min=2`
81
+ - `draft-p-min=0.80`
82
+ - rollback mode: `strict`
83
+
84
+ This profile is less aggressive than max-throughput tuning, but significantly safer for long coding sessions.
85
+
86
+ ## Benchmark Method That Prevents False Wins
87
+
88
+ A useful speculative benchmark protocol should include:
89
+
90
+ 1. Prompt classes
91
+ - coding/tool-call tasks
92
+ - repetition/pattern-heavy tasks
93
+ 2. Repeats and warmup
94
+ - fixed run count
95
+ - warmup policy
96
+ - p50/p95 latency, not only mean tok/s
97
+ 3. Required metrics
98
+ - decode throughput (`eval tok/s`)
99
+ - prefill throughput (`prompt eval tok/s`)
100
+ - acceptance/rejection behavior
101
+ - malformed-turn incidence
102
+ - stop reason distribution
103
+ 4. Profile matrix
104
+ - no-spec baseline
105
+ - aggressive profile
106
+ - balanced profile
107
+
108
+ Without this, speculative tuning can appear faster while degrading real agentic reliability.
109
+
110
+ ## Practical Playbook
111
+
112
+ ### Use for Daily Agentic Coding
113
+
114
+ - balanced `ngram-cache` (`12/2/0.80`)
115
+ - strict malformed-tool stream guardrail
116
+ - reasoning fallback disabled
117
+ - reduced token floor (`4096`)
118
+
119
+ ### Use for Max Throughput Exploration
120
+
121
+ - hybrid rollback
122
+ - larger draft windows
123
+ - tightly scoped benchmark prompts
124
+
125
+ Then promote only if long-session tool-loop soak remains stable.
126
+
127
+ ## What llama.cpp Docs Should Add Next
128
+
129
+ Mechanics are documented well today. The next improvement is operational clarity:
130
+
131
+ - implementation selection matrix by workload
132
+ - troubleshooting by signature (`find_slot`, rollback spikes, acceptance collapse)
133
+ - reproducible benchmark protocol and output schema
134
+ - rollout/canary/rollback criteria
135
+ - proxy compatibility appendix for stream+tools environments
136
+
137
+ ## Final Takeaway
138
+
139
+ Speculative decoding in production is a systems problem, not just a decoding primitive. Treating runtime + transport + tool-loop behavior as one boundary is what makes speculative speedups both real and reliable.
@@ -0,0 +1,146 @@
1
+ ## Title
2
+
3
+ docs: add speculative decoding production playbook and agentic compatibility guidance
4
+
5
+ ## Context
6
+
7
+ `docs/speculative.md` explains speculative mechanisms and flags, but production operators also need:
8
+
9
+ - workload-driven profile selection,
10
+ - reproducible benchmarking protocol,
11
+ - signature-based regression triage,
12
+ - guidance for stream+tools agentic environments.
13
+
14
+ This PR adds operational documentation to reduce drift between benchmark wins and real-session behavior.
15
+
16
+ ## Changes
17
+
18
+ ### Add new guide
19
+
20
+ - New: `docs/speculative-production.md`
21
+ - implementation matrix:
22
+ - `draft`
23
+ - `ngram-cache`
24
+ - `ngram-simple`
25
+ - `ngram-map-k`
26
+ - `ngram-map-k4v`
27
+ - `ngram-mod`
28
+ - decision tree by workload (coding, repetitive transform, mixed)
29
+ - benchmark protocol (run counts, warmup, prompt classes, metrics)
30
+ - troubleshooting by signature:
31
+ - `find_slot: non-consecutive token position`
32
+ - low acceptance + high rollback
33
+ - throughput collapse after commit switch
34
+ - rollout rules (canary, promotion threshold, rollback triggers)
35
+
36
+ ### Update existing speculative docs
37
+
38
+ - Update `docs/speculative.md`:
39
+ - add link to production guide
40
+ - add "how to interpret statistics in practice"
41
+ - add "workload sensitivity and reproducibility notes"
42
+
43
+ ### Add compatibility appendix
44
+
45
+ - New appendix (or linked page): stream+tools compatibility for proxy-mediated agentic flows
46
+ - fallback policy guidance (`off` default for production)
47
+ - malformed stream/tool guardrail behavior
48
+ - max token floor and prune target recommendations
49
+
50
+ ## Why
51
+
52
+ Speculative decoding quality in agentic coding depends on end-to-end behavior, including transport and stream tool-loop handling. This documentation closes that gap and provides a repeatable operator path.
53
+
54
+ ## Validation Plan
55
+
56
+ - Verify all CLI flags/options in examples against current `llama-server`.
57
+ - Verify all linked scripts/docs paths resolve.
58
+ - Include one benchmark table with:
59
+ - decode/prefill throughput
60
+ - acceptance indicators
61
+ - latency percentiles
62
+ - workload class labels
63
+
64
+ ## Risks
65
+
66
+ - Overfitting recommendations to one model/hardware class.
67
+ - Treating proxy behavior as universally required.
68
+
69
+ ## Mitigations
70
+
71
+ - Mark all profile recommendations as workload/hardware sensitive.
72
+ - Separate "safe baseline" from "aggressive benchmark-only" profiles.
73
+ - Require local A/B validation before rollout.
74
+
75
+ ## Out of Scope
76
+
77
+ - Runtime code changes
78
+ - Kernel-level speculative optimization changes
79
+ - Proxy implementation changes (docs-only PR)
80
+
81
+ ## Follow-ups
82
+
83
+ 1. Add nightly speculative regression harness.
84
+ 2. Publish benchmark JSON schema for machine comparison.
85
+ 3. Add commit-lineage tracking for performance regressions.
86
+
87
+ ---
88
+
89
+ ## Ready-to-Submit GitHub PR Body
90
+
91
+ ### Summary
92
+
93
+ This docs PR adds a production-oriented speculative decoding playbook for llama.cpp users running real multi-turn workloads (especially agentic/tool-call scenarios). It complements existing mechanism-level docs with actionable tuning, troubleshooting, and rollout guidance.
94
+
95
+ ### What Changed
96
+
97
+ - Added `docs/speculative-production.md` (new operational guide)
98
+ - implementation selection matrix
99
+ - workload-based decision tree
100
+ - benchmark protocol + required metrics
101
+ - troubleshooting by real log signatures
102
+ - canary/rollback rollout guidance
103
+ - Updated `docs/speculative.md`
104
+ - links to production guide
105
+ - practical stats interpretation notes
106
+ - workload sensitivity notes
107
+ - Added/linked "agentic stream+tools compatibility" appendix
108
+ - fallback policy defaults
109
+ - malformed stream/tool guardrails
110
+ - token-floor/prune guidance
111
+
112
+ ### Why
113
+
114
+ Current docs describe speculative decoding internals clearly, but production operators need a reproducible way to:
115
+
116
+ - choose stable profiles by workload,
117
+ - detect/triage regressions quickly,
118
+ - avoid benchmark-only wins that fail in long sessions.
119
+
120
+ ### Reviewer Guide
121
+
122
+ Please focus review on:
123
+
124
+ 1. Accuracy of CLI flags and option names.
125
+ 2. Correctness of troubleshooting signatures and interpretations.
126
+ 3. Clarity of benchmark protocol (can another team reproduce it?).
127
+ 4. Whether safe-vs-aggressive profile separation is clear enough.
128
+
129
+ ### Validation
130
+
131
+ - [ ] Command examples verified against current `llama-server --help`
132
+ - [ ] Linked docs/scripts paths validated
133
+ - [ ] Benchmark table includes workload class labels
134
+ - [ ] Metrics include decode/prefill throughput + latency percentile view
135
+ - [ ] No runtime behavior claims without explicit caveats
136
+
137
+ ### Risks / Caveats
138
+
139
+ - Recommendations are model/hardware/workload dependent.
140
+ - Guidance is operational, not a substitute for local A/B testing.
141
+
142
+ ### Follow-ups
143
+
144
+ - [ ] Add nightly regression harness for speculative profiles
145
+ - [ ] Publish machine-readable benchmark schema
146
+ - [ ] Add commit lineage references in benchmark artifacts
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@miller-tech/uap",
3
- "version": "1.15.5",
3
+ "version": "1.15.7",
4
4
  "description": "Autonomous AI agent memory system with CLAUDE.md protocol enforcement",
5
5
  "type": "module",
6
6
  "main": "dist/index.js",
@@ -22,6 +22,15 @@ if [ -z "$CMD" ]; then
22
22
  exit 0
23
23
  fi
24
24
 
25
+ # ─── Protocol Tag Injection Guard ────────────────────────────────
26
+ # Reject Bash payloads that still contain standalone protocol tag lines.
27
+ # These fragments can appear after malformed tool-call rendering and must
28
+ # never reach shell evaluation.
29
+ if printf '%s\n' "$CMD" | grep -qE '^\s*</?(tool_call|tool_response|parameter(=[^>]*)?|function(=[^>]*)?|think)\s*>\s*$'; then
30
+ echo "BLOCKED [bash-safety]: Command contains standalone XML/protocol tag lines. Remove tool-call tag artifacts before execution." >&2
31
+ exit 2
32
+ fi
33
+
25
34
  # ─── IaC Pipeline Enforcement ───────────────────────────────────
26
35
  # Block local terraform apply/destroy (policies/iac-pipeline-enforcement.md)
27
36
  # Allow: terraform fmt, validate, init, plan, output, show, state list, graph
@@ -254,6 +254,28 @@ PROXY_ANALYSIS_ONLY_MIN_TOOLS = int(
254
254
  PROXY_ANALYSIS_ONLY_MAX_MESSAGES = int(
255
255
  os.environ.get("PROXY_ANALYSIS_ONLY_MAX_MESSAGES", "2")
256
256
  )
257
+ PROXY_TOOL_CALL_GRAMMAR = os.environ.get(
258
+ "PROXY_TOOL_CALL_GRAMMAR", "on"
259
+ ).lower() not in {
260
+ "0",
261
+ "false",
262
+ "off",
263
+ "no",
264
+ }
265
+ PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY = os.environ.get(
266
+ "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", "on"
267
+ ).lower() not in {
268
+ "0",
269
+ "false",
270
+ "off",
271
+ "no",
272
+ }
273
+ PROXY_TOOL_CALL_GRAMMAR_PATH = os.path.abspath(
274
+ os.environ.get(
275
+ "PROXY_TOOL_CALL_GRAMMAR_PATH",
276
+ os.path.join(os.path.dirname(__file__), "..", "config", "tool-call.gbnf"),
277
+ )
278
+ )
257
279
 
258
280
  # ---------------------------------------------------------------------------
259
281
  # Logging
@@ -266,6 +288,45 @@ logging.basicConfig(
266
288
  logger = logging.getLogger("uap.anthropic_proxy")
267
289
 
268
290
 
291
+ def _load_tool_call_grammar(path: str) -> str:
292
+ if not PROXY_TOOL_CALL_GRAMMAR:
293
+ return ""
294
+
295
+ try:
296
+ with open(path, "r", encoding="utf-8") as fh:
297
+ return fh.read().strip()
298
+ except OSError as exc:
299
+ logger.warning(
300
+ "Tool-call grammar disabled: failed to read %s (%s)",
301
+ path,
302
+ exc,
303
+ )
304
+ return ""
305
+
306
+
307
+ TOOL_CALL_GBNF = _load_tool_call_grammar(PROXY_TOOL_CALL_GRAMMAR_PATH)
308
+
309
+
310
+ def _apply_tool_call_grammar(
311
+ request_body: dict, tool_choice: str | None = None
312
+ ) -> None:
313
+ request_body.pop("grammar", None)
314
+
315
+ if not PROXY_TOOL_CALL_GRAMMAR or not TOOL_CALL_GBNF:
316
+ return
317
+
318
+ if not request_body.get("tools"):
319
+ return
320
+
321
+ effective_tool_choice = (
322
+ tool_choice if tool_choice is not None else request_body.get("tool_choice")
323
+ )
324
+ if PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY and effective_tool_choice != "required":
325
+ return
326
+
327
+ request_body["grammar"] = TOOL_CALL_GBNF
328
+
329
+
269
330
  # ---------------------------------------------------------------------------
270
331
  # Option F: Session-level Context Window Monitor
271
332
  # ---------------------------------------------------------------------------
@@ -876,7 +937,7 @@ async def lifespan(app: FastAPI):
876
937
  _resolve_prune_target_fraction() * 100,
877
938
  )
878
939
  logger.info(
879
- "Guardrails: malformed=%s stream_strict=%s force_non_stream=%s args_preflight=%s tool_narrowing=%s thinking_off_on_tools=%s dampener=%s(%d/%d/%d/%d->%d) contamination_breaker=%s(%d forced=%d required_miss=%d) analysis_only_route=%s(min_tools=%d,max_msgs=%d)",
940
+ "Guardrails: malformed=%s stream_strict=%s force_non_stream=%s args_preflight=%s tool_narrowing=%s thinking_off_on_tools=%s dampener=%s(%d/%d/%d/%d->%d) contamination_breaker=%s(%d forced=%d required_miss=%d) analysis_only_route=%s(min_tools=%d,max_msgs=%d) grammar=%s(required_only=%s loaded=%s path=%s)",
880
941
  PROXY_MALFORMED_TOOL_GUARDRAIL,
881
942
  PROXY_MALFORMED_TOOL_STREAM_STRICT,
882
943
  PROXY_FORCE_NON_STREAM,
@@ -896,6 +957,10 @@ async def lifespan(app: FastAPI):
896
957
  PROXY_ANALYSIS_ONLY_ROUTE,
897
958
  PROXY_ANALYSIS_ONLY_MIN_TOOLS,
898
959
  PROXY_ANALYSIS_ONLY_MAX_MESSAGES,
960
+ PROXY_TOOL_CALL_GRAMMAR,
961
+ PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY,
962
+ bool(TOOL_CALL_GBNF),
963
+ PROXY_TOOL_CALL_GRAMMAR_PATH,
899
964
  )
900
965
 
901
966
  yield
@@ -1044,49 +1109,27 @@ def _is_analysis_only_prompt(text: str) -> bool:
1044
1109
  if not text:
1045
1110
  return False
1046
1111
 
1047
- analysis_markers = (
1048
- "analy",
1049
- "review",
1050
- "audit",
1051
- "summar",
1052
- "explain",
1053
- "plan",
1054
- "recommend",
1055
- "assess",
1056
- "compare",
1057
- "investigate",
1058
- "diagnose",
1112
+ normalized = text.lower()
1113
+ has_analysis = bool(
1114
+ re.search(
1115
+ r"\b(?:analy(?:ze|zing|sis)?|review|audit|summar(?:y|ize|ized|ise)|explain|plan|recommend|assess|compare|investigate|diagnos(?:e|is))\b",
1116
+ normalized,
1117
+ )
1059
1118
  )
1060
- action_markers = (
1061
- "fix",
1062
- "edit",
1063
- "write",
1064
- "create",
1065
- "implement",
1066
- "patch",
1067
- "change",
1068
- "update",
1069
- "run ",
1070
- "execute",
1071
- "command",
1072
- "use tool",
1073
- "call tool",
1074
- "apply",
1075
- "commit",
1076
- "push",
1077
- "merge",
1078
- "publish",
1079
- "deploy",
1080
- "test",
1081
- "build",
1082
- "refactor",
1083
- "rename",
1084
- "delete",
1085
- "install",
1119
+ has_action = bool(
1120
+ re.search(
1121
+ r"\b(?:fix|edit|write|create|implement|patch|change|update|run|execute|apply|commit|push|merge|publish|deploy|test|build|refactor|rename|delete|install)\b",
1122
+ normalized,
1123
+ )
1124
+ ) or any(
1125
+ phrase in normalized
1126
+ for phrase in (
1127
+ "use tool",
1128
+ "call tool",
1129
+ "run command",
1130
+ "execute command",
1131
+ )
1086
1132
  )
1087
-
1088
- has_analysis = any(marker in text for marker in analysis_markers)
1089
- has_action = any(marker in text for marker in action_markers)
1090
1133
  return has_analysis and not has_action
1091
1134
 
1092
1135
 
@@ -1467,6 +1510,8 @@ def build_openai_request(anthropic_body: dict, monitor: SessionMonitor) -> dict:
1467
1510
  "Thinking disabled for tool turn (PROXY_DISABLE_THINKING_ON_TOOL_TURNS=on)"
1468
1511
  )
1469
1512
 
1513
+ _apply_tool_call_grammar(openai_body)
1514
+
1470
1515
  return openai_body
1471
1516
 
1472
1517
 
@@ -1793,6 +1838,11 @@ _TOOL_ARG_MARKERS = (
1793
1838
  "</think>",
1794
1839
  )
1795
1840
 
1841
+ _BASH_PROTOCOL_LINE_RE = re.compile(
1842
+ r"^\s*</?(?:tool_call|tool_response|parameter(?:=[^>]*)?|function(?:=[^>]*)?|think)\s*>\s*$",
1843
+ re.IGNORECASE,
1844
+ )
1845
+
1796
1846
 
1797
1847
  def _iter_string_leaves(value):
1798
1848
  if isinstance(value, str):
@@ -1822,6 +1872,26 @@ def _strip_tool_markup_artifacts(text: str) -> str:
1822
1872
  return cleaned.strip()
1823
1873
 
1824
1874
 
1875
+ def _strip_protocol_tag_only_lines(text: str) -> tuple[str, bool]:
1876
+ if not isinstance(text, str):
1877
+ return text, False
1878
+
1879
+ lines = text.splitlines()
1880
+ kept_lines: list[str] = []
1881
+ stripped = False
1882
+ for line in lines:
1883
+ if _BASH_PROTOCOL_LINE_RE.match(line):
1884
+ stripped = True
1885
+ continue
1886
+ kept_lines.append(line)
1887
+
1888
+ if not stripped:
1889
+ return text, False
1890
+
1891
+ cleaned = "\n".join(kept_lines).strip()
1892
+ return cleaned, True
1893
+
1894
+
1825
1895
  def _sanitize_markup_value(value):
1826
1896
  if isinstance(value, str):
1827
1897
  cleaned = _strip_tool_markup_artifacts(value)
@@ -2036,6 +2106,77 @@ def _repair_required_tool_args(
2036
2106
  return repaired_response, repaired_count
2037
2107
 
2038
2108
 
2109
+ def _repair_bash_command_artifacts(openai_resp: dict) -> tuple[dict, int]:
2110
+ if not _openai_has_tool_calls(openai_resp):
2111
+ return openai_resp, 0
2112
+
2113
+ choice, message = _extract_openai_choice(openai_resp)
2114
+ tool_calls = message.get("tool_calls") or []
2115
+ if not tool_calls:
2116
+ return openai_resp, 0
2117
+
2118
+ repaired_tool_calls = []
2119
+ repaired_count = 0
2120
+
2121
+ for tool_call in tool_calls:
2122
+ fn = tool_call.get("function") if isinstance(tool_call, dict) else {}
2123
+ if not isinstance(fn, dict):
2124
+ fn = {}
2125
+
2126
+ tool_name = str(fn.get("name", "")).strip().lower()
2127
+ if tool_name != "bash":
2128
+ repaired_tool_calls.append(tool_call)
2129
+ continue
2130
+
2131
+ raw_args = fn.get("arguments", "{}")
2132
+ if isinstance(raw_args, dict):
2133
+ parsed_args = dict(raw_args)
2134
+ else:
2135
+ try:
2136
+ parsed_args = json.loads(str(raw_args))
2137
+ except json.JSONDecodeError:
2138
+ repaired_tool_calls.append(tool_call)
2139
+ continue
2140
+
2141
+ if not isinstance(parsed_args, dict):
2142
+ repaired_tool_calls.append(tool_call)
2143
+ continue
2144
+
2145
+ command = parsed_args.get("command")
2146
+ if not isinstance(command, str):
2147
+ repaired_tool_calls.append(tool_call)
2148
+ continue
2149
+
2150
+ cleaned_command, changed = _strip_protocol_tag_only_lines(command)
2151
+ if not changed:
2152
+ repaired_tool_calls.append(tool_call)
2153
+ continue
2154
+
2155
+ parsed_args["command"] = cleaned_command
2156
+ new_tool_call = dict(tool_call)
2157
+ new_fn = dict(fn)
2158
+ new_fn["arguments"] = json.dumps(parsed_args, separators=(",", ":"))
2159
+ new_tool_call["function"] = new_fn
2160
+ repaired_tool_calls.append(new_tool_call)
2161
+ repaired_count += 1
2162
+
2163
+ if repaired_count == 0:
2164
+ return openai_resp, 0
2165
+
2166
+ repaired_response = dict(openai_resp)
2167
+ choices = list(openai_resp.get("choices") or [])
2168
+ if not choices:
2169
+ return openai_resp, 0
2170
+
2171
+ updated_choice = dict(choice)
2172
+ updated_message = dict(message)
2173
+ updated_message["tool_calls"] = repaired_tool_calls
2174
+ updated_choice["message"] = updated_message
2175
+ choices[0] = updated_choice
2176
+ repaired_response["choices"] = choices
2177
+ return repaired_response, repaired_count
2178
+
2179
+
2039
2180
  def _required_value_is_empty(value) -> bool:
2040
2181
  if value is None:
2041
2182
  return True
@@ -2132,6 +2273,22 @@ def _validate_tool_call_arguments(
2132
2273
  ),
2133
2274
  )
2134
2275
 
2276
+ if tool_name.strip().lower() == "bash":
2277
+ command = parsed.get("command")
2278
+ if isinstance(command, str):
2279
+ cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
2280
+ command
2281
+ )
2282
+ if had_protocol_lines and not cleaned_command:
2283
+ return ToolResponseIssue(
2284
+ kind="invalid_tool_args",
2285
+ reason="arguments for 'Bash' contained only protocol tag lines",
2286
+ retry_hint=(
2287
+ "Emit exactly one `Bash` tool call with a valid shell command in `arguments.command`. "
2288
+ "Do not include standalone XML/protocol tags."
2289
+ ),
2290
+ )
2291
+
2135
2292
  if _contains_tool_markup(parsed):
2136
2293
  return ToolResponseIssue(
2137
2294
  kind="invalid_tool_args",
@@ -2345,20 +2502,34 @@ def _is_malformed_tool_response(openai_resp: dict, anthropic_body: dict) -> bool
2345
2502
 
2346
2503
 
2347
2504
  def _build_malformed_retry_body(
2348
- openai_body: dict, anthropic_body: dict, retry_hint: str = ""
2505
+ openai_body: dict,
2506
+ anthropic_body: dict,
2507
+ retry_hint: str = "",
2508
+ tool_choice: str = "required",
2509
+ attempt: int = 1,
2510
+ total_attempts: int = 1,
2349
2511
  ) -> dict:
2350
2512
  retry_body = dict(openai_body)
2351
2513
  retry_body["stream"] = False
2352
- retry_body["tool_choice"] = "required"
2514
+ retry_body["tool_choice"] = tool_choice
2353
2515
  retry_body["temperature"] = PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE
2354
2516
 
2355
- malformed_retry_instruction = {
2356
- "role": "user",
2357
- "content": (
2517
+ if tool_choice == "required":
2518
+ retry_instruction = (
2358
2519
  "Your previous response had invalid tool-call formatting. "
2359
2520
  "Respond with exactly one valid tool call using the provided tools. "
2360
2521
  "Do not output prose, markdown, XML tags, or schema snippets."
2361
- ),
2522
+ )
2523
+ else:
2524
+ retry_instruction = (
2525
+ "Your previous response had invalid tool-call formatting. "
2526
+ "If a tool is needed, emit exactly one valid tool call with strict JSON arguments. "
2527
+ "If no tool is needed for this turn, return concise plain text with no protocol tags."
2528
+ )
2529
+
2530
+ malformed_retry_instruction = {
2531
+ "role": "user",
2532
+ "content": retry_instruction,
2362
2533
  }
2363
2534
  existing_messages = retry_body.get("messages")
2364
2535
  if isinstance(existing_messages, list) and existing_messages:
@@ -2381,19 +2552,51 @@ def _build_malformed_retry_body(
2381
2552
  if PROXY_DISABLE_THINKING_ON_TOOL_TURNS:
2382
2553
  retry_body["enable_thinking"] = False
2383
2554
 
2555
+ _apply_tool_call_grammar(retry_body, tool_choice=tool_choice)
2556
+
2384
2557
  if retry_hint:
2385
2558
  repair_prompt = (
2386
- "[TOOL CALL REPAIR]\n"
2559
+ f"[TOOL CALL REPAIR attempt {attempt}/{total_attempts}]\n"
2387
2560
  f"{retry_hint}\n"
2388
- "Return exactly one valid tool call object and no explanatory prose."
2561
+ "Return a valid response for this turn without protocol artifacts."
2389
2562
  )
2390
2563
  retry_messages = list(retry_body.get("messages", []))
2391
- retry_messages.append({"role": "system", "content": repair_prompt})
2564
+ retry_messages.append({"role": "user", "content": repair_prompt})
2392
2565
  retry_body["messages"] = retry_messages
2393
2566
 
2394
2567
  return retry_body
2395
2568
 
2396
2569
 
2570
+ def _retry_tool_choice_for_attempt(
2571
+ required_tool_choice: bool, attempt: int, total_attempts: int
2572
+ ) -> str:
2573
+ if not required_tool_choice:
2574
+ return "auto"
2575
+ if total_attempts <= 1:
2576
+ return "required"
2577
+ return "auto" if attempt == total_attempts - 1 else "required"
2578
+
2579
+
2580
+ def _build_safe_text_openai_response(openai_resp: dict, text: str) -> dict:
2581
+ return {
2582
+ "id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
2583
+ "object": openai_resp.get("object", "chat.completion"),
2584
+ "created": openai_resp.get("created", int(time.time())),
2585
+ "model": openai_resp.get("model", "unknown"),
2586
+ "choices": [
2587
+ {
2588
+ "index": 0,
2589
+ "finish_reason": "stop",
2590
+ "message": {
2591
+ "role": "assistant",
2592
+ "content": text,
2593
+ },
2594
+ }
2595
+ ],
2596
+ "usage": openai_resp.get("usage", {}),
2597
+ }
2598
+
2599
+
2397
2600
  def _build_clean_guardrail_openai_response(openai_resp: dict) -> dict:
2398
2601
  return {
2399
2602
  "id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
@@ -2437,6 +2640,7 @@ async def _apply_unexpected_end_turn_guardrail(
2437
2640
  retry_body = dict(openai_body)
2438
2641
  retry_body["tool_choice"] = "required"
2439
2642
  retry_body["stream"] = False
2643
+ _apply_tool_call_grammar(retry_body, tool_choice="required")
2440
2644
 
2441
2645
  retry_resp = await client.post(
2442
2646
  f"{LLAMA_CPP_BASE}/chat/completions",
@@ -2486,7 +2690,8 @@ async def _apply_malformed_tool_guardrail(
2486
2690
  working_resp, required_repairs = _repair_required_tool_args(
2487
2691
  working_resp, anthropic_body
2488
2692
  )
2489
- repair_count = markup_repairs + required_repairs
2693
+ working_resp, bash_repairs = _repair_bash_command_artifacts(working_resp)
2694
+ repair_count = markup_repairs + required_repairs + bash_repairs
2490
2695
 
2491
2696
  required_tool_choice = openai_body.get("tool_choice") == "required"
2492
2697
  has_tool_calls = _openai_has_tool_calls(working_resp)
@@ -2536,10 +2741,18 @@ async def _apply_malformed_tool_guardrail(
2536
2741
  attempts = max(0, PROXY_MALFORMED_TOOL_RETRY_MAX)
2537
2742
  current_issue = issue
2538
2743
  for attempt in range(attempts):
2744
+ attempt_tool_choice = _retry_tool_choice_for_attempt(
2745
+ required_tool_choice,
2746
+ attempt,
2747
+ attempts,
2748
+ )
2539
2749
  retry_body = _build_malformed_retry_body(
2540
2750
  openai_body,
2541
2751
  anthropic_body,
2542
2752
  retry_hint=current_issue.retry_hint,
2753
+ tool_choice=attempt_tool_choice,
2754
+ attempt=attempt + 1,
2755
+ total_attempts=attempts,
2543
2756
  )
2544
2757
  retry_resp = await client.post(
2545
2758
  f"{LLAMA_CPP_BASE}/chat/completions",
@@ -2563,7 +2776,14 @@ async def _apply_malformed_tool_guardrail(
2563
2776
  retry_working, retry_required_repairs = _repair_required_tool_args(
2564
2777
  retry_working, anthropic_body
2565
2778
  )
2566
- retry_repairs = retry_markup_repairs + retry_required_repairs
2779
+ retry_working, retry_bash_repairs = _repair_bash_command_artifacts(
2780
+ retry_working
2781
+ )
2782
+ retry_repairs = (
2783
+ retry_markup_repairs + retry_required_repairs + retry_bash_repairs
2784
+ )
2785
+
2786
+ working_resp = retry_working
2567
2787
 
2568
2788
  retry_has_tool_calls = _openai_has_tool_calls(retry_working)
2569
2789
  retry_required = retry_body.get("tool_choice") == "required"
@@ -2620,6 +2840,17 @@ async def _apply_malformed_tool_guardrail(
2620
2840
  monitor.invalid_tool_call_streak,
2621
2841
  monitor.required_tool_miss_streak,
2622
2842
  )
2843
+
2844
+ degraded_text = _sanitize_tool_call_apology_text(
2845
+ _openai_message_text(working_resp)
2846
+ ).strip()
2847
+ if degraded_text and not _looks_malformed_tool_payload(degraded_text):
2848
+ logger.warning(
2849
+ "TOOL RESPONSE degrade: session=%s returning safe text fallback after retry exhaustion",
2850
+ session_id,
2851
+ )
2852
+ return _build_safe_text_openai_response(working_resp, degraded_text)
2853
+
2623
2854
  return _build_clean_guardrail_openai_response(working_resp)
2624
2855
 
2625
2856
 
@@ -2720,6 +2951,18 @@ def openai_to_anthropic_response(openai_resp: dict, model: str) -> dict:
2720
2951
  args = json.loads(fn.get("arguments", "{}"))
2721
2952
  except json.JSONDecodeError:
2722
2953
  args = {}
2954
+ if fn.get("name", "").strip().lower() == "bash" and isinstance(args, dict):
2955
+ command = args.get("command")
2956
+ if isinstance(command, str):
2957
+ cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
2958
+ command
2959
+ )
2960
+ if had_protocol_lines:
2961
+ args = dict(args)
2962
+ args["command"] = cleaned_command
2963
+ logger.warning(
2964
+ "BASH SAFETY: stripped standalone protocol-tag lines from command before tool execution"
2965
+ )
2723
2966
  content.append(
2724
2967
  {
2725
2968
  "type": "tool_use",
@@ -3564,6 +3807,12 @@ async def context_status(request: Request):
3564
3807
  "overflow_count": monitor.overflow_count,
3565
3808
  "prune_threshold": PROXY_CONTEXT_PRUNE_THRESHOLD,
3566
3809
  "recent_history": monitor.context_history[-10:],
3810
+ "tool_call_grammar": {
3811
+ "enabled": PROXY_TOOL_CALL_GRAMMAR,
3812
+ "required_only": PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY,
3813
+ "path": PROXY_TOOL_CALL_GRAMMAR_PATH,
3814
+ "loaded": bool(TOOL_CALL_GBNF),
3815
+ },
3567
3816
  # Loop protection stats
3568
3817
  "loop_protection": {
3569
3818
  "enabled": PROXY_LOOP_BREAKER,
@@ -487,6 +487,68 @@ class TestMalformedToolGuardrail(unittest.TestCase):
487
487
  setattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE", old_temp)
488
488
  setattr(proxy, "PROXY_DISABLE_THINKING_ON_TOOL_TURNS", old_disable)
489
489
 
490
+ def test_malformed_retry_body_appends_retry_hint_as_user_message(self):
491
+ openai_body = {
492
+ "model": "test",
493
+ "messages": [{"role": "user", "content": "fix"}],
494
+ }
495
+ anthropic_body = {
496
+ "tools": [{"name": "Read", "input_schema": {"type": "object"}}]
497
+ }
498
+
499
+ retry = proxy._build_malformed_retry_body(
500
+ openai_body,
501
+ anthropic_body,
502
+ retry_hint="Use strict JSON",
503
+ tool_choice="required",
504
+ attempt=1,
505
+ total_attempts=2,
506
+ )
507
+
508
+ self.assertEqual(retry["messages"][-1]["role"], "user")
509
+ self.assertIn("TOOL CALL REPAIR attempt 1/2", retry["messages"][-1]["content"])
510
+
511
+ def test_retry_ladder_releases_last_attempt_to_auto(self):
512
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 0, 3), "required")
513
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 1, 3), "required")
514
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 2, 3), "auto")
515
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(False, 0, 3), "auto")
516
+
517
+ def test_malformed_retry_body_applies_grammar_only_for_required_tool_choice(self):
518
+ old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
519
+ old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
520
+ old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
521
+ try:
522
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
523
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
524
+ setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
525
+
526
+ openai_body = {
527
+ "model": "test",
528
+ "messages": [{"role": "user", "content": "fix"}],
529
+ }
530
+ anthropic_body = {
531
+ "tools": [{"name": "Read", "input_schema": {"type": "object"}}]
532
+ }
533
+
534
+ required_retry = proxy._build_malformed_retry_body(
535
+ openai_body,
536
+ anthropic_body,
537
+ tool_choice="required",
538
+ )
539
+ auto_retry = proxy._build_malformed_retry_body(
540
+ openai_body,
541
+ anthropic_body,
542
+ tool_choice="auto",
543
+ )
544
+
545
+ self.assertEqual(required_retry.get("grammar"), 'root ::= "<tool_call>"')
546
+ self.assertNotIn("grammar", auto_retry)
547
+ finally:
548
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
549
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
550
+ setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
551
+
490
552
  def test_clean_guardrail_response_does_not_promise_future_tool_call(self):
491
553
  guardrail = proxy._build_clean_guardrail_openai_response(
492
554
  {"model": "test-model"}
@@ -772,6 +834,34 @@ class TestMalformedToolGuardrail(unittest.TestCase):
772
834
  )
773
835
  self.assertEqual(args["command"], "ls")
774
836
 
837
+ def test_bash_command_repair_strips_protocol_tag_only_lines(self):
838
+ openai_resp = {
839
+ "choices": [
840
+ {
841
+ "finish_reason": "tool_calls",
842
+ "message": {
843
+ "content": "",
844
+ "tool_calls": [
845
+ {
846
+ "id": "call_1",
847
+ "function": {
848
+ "name": "Bash",
849
+ "arguments": '{"command":"pwd\\n</function>\\n<tool_call>"}',
850
+ },
851
+ }
852
+ ],
853
+ },
854
+ }
855
+ ]
856
+ }
857
+
858
+ repaired, count = proxy._repair_bash_command_artifacts(openai_resp)
859
+ self.assertEqual(count, 1)
860
+ args = json.loads(
861
+ repaired["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
862
+ )
863
+ self.assertEqual(args["command"], "pwd")
864
+
775
865
  def test_guardrail_accepts_repaired_markup_without_fallback(self):
776
866
  old_retry = getattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_MAX")
777
867
  try:
@@ -1214,6 +1304,81 @@ class TestToolTurnControls(unittest.TestCase):
1214
1304
  setattr(proxy, "PROXY_FORCED_TOOL_DAMPENER_REJECTIONS", old_rejections)
1215
1305
  setattr(proxy, "PROXY_FORCED_TOOL_DAMPENER_AUTO_TURNS", old_auto_turns)
1216
1306
 
1307
+ def test_build_request_applies_grammar_when_tool_choice_required(self):
1308
+ old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
1309
+ old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
1310
+ old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
1311
+ try:
1312
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
1313
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
1314
+ setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
1315
+
1316
+ body = {
1317
+ "model": "test",
1318
+ "messages": [
1319
+ {
1320
+ "role": "assistant",
1321
+ "content": [{"type": "text", "text": "I will continue."}],
1322
+ },
1323
+ {"role": "user", "content": "continue"},
1324
+ ],
1325
+ "tools": [
1326
+ {
1327
+ "name": "Read",
1328
+ "description": "Read file",
1329
+ "input_schema": {"type": "object"},
1330
+ }
1331
+ ],
1332
+ }
1333
+
1334
+ openai = proxy.build_openai_request(
1335
+ body, proxy.SessionMonitor(context_window=262144)
1336
+ )
1337
+ self.assertEqual(openai.get("tool_choice"), "required")
1338
+ self.assertEqual(openai.get("grammar"), 'root ::= "<tool_call>"')
1339
+ finally:
1340
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
1341
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
1342
+ setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
1343
+
1344
+ def test_build_request_omits_grammar_when_tool_choice_released_to_auto(self):
1345
+ old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
1346
+ old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
1347
+ old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
1348
+ try:
1349
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
1350
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
1351
+ setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
1352
+
1353
+ monitor = proxy.SessionMonitor(context_window=262144)
1354
+ monitor.forced_auto_cooldown_turns = 1
1355
+
1356
+ body = {
1357
+ "model": "test",
1358
+ "messages": [
1359
+ {
1360
+ "role": "assistant",
1361
+ "content": [{"type": "text", "text": "I will continue."}],
1362
+ },
1363
+ {"role": "user", "content": "continue"},
1364
+ ],
1365
+ "tools": [
1366
+ {
1367
+ "name": "Read",
1368
+ "description": "Read file",
1369
+ "input_schema": {"type": "object"},
1370
+ }
1371
+ ],
1372
+ }
1373
+
1374
+ openai = proxy.build_openai_request(body, monitor)
1375
+ self.assertEqual(openai.get("tool_choice"), "auto")
1376
+ self.assertNotIn("grammar", openai)
1377
+ finally:
1378
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
1379
+ setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
1380
+ setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
1381
+
1217
1382
  def test_no_tools_does_not_inject_agentic_system_message(self):
1218
1383
  body = {
1219
1384
  "model": "test",
@@ -1290,6 +1455,38 @@ class TestToolTurnControls(unittest.TestCase):
1290
1455
  setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
1291
1456
  setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
1292
1457
 
1458
+ def test_analysis_only_route_does_not_treat_implementation_as_action(self):
1459
+ old_route = getattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE")
1460
+ old_min_tools = getattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS")
1461
+ old_max_messages = getattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES")
1462
+ try:
1463
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", True)
1464
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", 4)
1465
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", 2)
1466
+
1467
+ body = {
1468
+ "messages": [
1469
+ {
1470
+ "role": "user",
1471
+ "content": "analyze implementation options and summarize tradeoffs",
1472
+ }
1473
+ ],
1474
+ "tools": [
1475
+ {"name": "Read", "input_schema": {"type": "object"}},
1476
+ {"name": "Edit", "input_schema": {"type": "object"}},
1477
+ {"name": "Write", "input_schema": {"type": "object"}},
1478
+ {"name": "Bash", "input_schema": {"type": "object"}},
1479
+ ],
1480
+ }
1481
+
1482
+ updated, removed = proxy._maybe_route_analysis_without_tools(body)
1483
+ self.assertEqual(removed, 4)
1484
+ self.assertNotIn("tools", updated)
1485
+ finally:
1486
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", old_route)
1487
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
1488
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
1489
+
1293
1490
 
1294
1491
  class TestSessionContaminationBreaker(unittest.TestCase):
1295
1492
  def test_contamination_breaker_trims_and_resets_streak(self):