@miller-tech/uap 1.15.5 → 1.15.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/docs/INDEX.md CHANGED
@@ -47,6 +47,14 @@
47
47
  - [Token Optimization](benchmarks/TOKEN_OPTIMIZATION.md) -- Per-feature token savings analysis
48
48
  - [Accuracy Analysis](benchmarks/ACCURACY_ANALYSIS.md) -- Internal vs Terminal-Bench comparison
49
49
 
50
+ ## Blog
51
+
52
+ - [Speculative Decoding Production Playbook](blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md) -- Long-form narrative on throughput gains, failure modes, and stable profiles
53
+
54
+ ## PR Templates
55
+
56
+ - [Speculative Docs PR Template](pr/PR_SPECULATIVE_DOCS_TEMPLATE.md) -- Ready-to-submit PR copy, checklist, and reviewer guidance
57
+
50
58
  ## Research
51
59
 
52
60
  - [Memory Systems Comparison](research/MEMORY_SYSTEMS_COMPARISON.md) -- MemGPT, LangGraph, Mem0, A-MEM analysis
@@ -0,0 +1,139 @@
1
+ # Speculative Decoding in llama.cpp: Real Speedups Without Breaking Agentic Reliability
2
+
3
+ Speculative decoding can look like free performance - until it meets long-context, tool-heavy agent workflows. This write-up covers what improved throughput, what regressed, and which operational changes restored stability across `llama.cpp` and an Anthropic-compatible proxy.
4
+
5
+ ## Why This Matters
6
+
7
+ Speculative decoding is strongest when generated text has predictable structure or repetition. But in real coding sessions, throughput alone is not enough: the system must preserve clean output, reliable tool-call behavior, and long-session continuity.
8
+
9
+ In practice, this is one runtime boundary:
10
+
11
+ - `llama.cpp` speculative behavior
12
+ - parameter profile and rollback mode
13
+ - proxy streaming/fallback policies
14
+ - agentic tool-loop control behavior
15
+
16
+ ## Baseline Environment
17
+
18
+ - Runtime: `llama.cpp` + CUDA + Qwen3.5 GGUF
19
+ - Context window: `262144`
20
+ - Spec type: `ngram-cache`
21
+ - Gateway: Anthropic-compatible proxy forwarding to OpenAI-compatible server
22
+
23
+ Related runbooks:
24
+
25
+ - `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md`
26
+ - `docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md`
27
+
28
+ ## What We Observed
29
+
30
+ ### Throughput Gains Were Workload-Dependent
31
+
32
+ Speculation did not uniformly improve all turns. Coding/tool turns often saw small uplift; repetition-heavy turns saw large gains.
33
+
34
+ Representative 27B snapshot (`ctx=262144`):
35
+
36
+ - No spec: ~43 tok/s coding, ~41 tok/s pattern
37
+ - Balanced spec (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
38
+
39
+ Takeaway: benchmark by workload class, not one blended average.
40
+
41
+ ### Newer Lineage Produced Noisier Warnings
42
+
43
+ Under identical settings, newer builds emitted warnings such as:
44
+
45
+ - `find_slot: non-consecutive token position`
46
+
47
+ This correlated with lower effective throughput and less stable long-session behavior in A/B comparisons.
48
+
49
+ ### Proxy Fallback Could Leak Malformed Internal Text
50
+
51
+ When upstream returned reasoning-heavy but empty visible output, weak fallback policy could expose malformed fragments (pseudo-tool text, schema/policy echoes) to end users.
52
+
53
+ Patterns included:
54
+
55
+ - `</parameter>`-style fragments
56
+ - non-JSON pseudo-tool content
57
+ - repetitive policy-like loops with no valid `tool_calls`
58
+
59
+ ## Immediate Fixes That Worked
60
+
61
+ ### Safe Production Defaults
62
+
63
+ The highest-leverage stabilization profile was:
64
+
65
+ - `PROXY_STREAM_REASONING_FALLBACK=off`
66
+ - `PROXY_MALFORMED_TOOL_GUARDRAIL=on`
67
+ - `PROXY_MALFORMED_TOOL_STREAM_STRICT=on`
68
+ - `PROXY_MAX_TOKENS_FLOOR=4096`
69
+
70
+ Why:
71
+
72
+ - `fallback=off` suppresses malformed reasoning leakage.
73
+ - malformed-tool guardrail + strict stream path recovers bad stream+tools turns.
74
+ - lower token floor reduces long failure-turn latency while preserving normal turns.
75
+
76
+ ### Balanced Speculative Profile for Daily Agentic Work
77
+
78
+ - `spec-type=ngram-cache`
79
+ - `draft-max=12`
80
+ - `draft-min=2`
81
+ - `draft-p-min=0.80`
82
+ - rollback mode: `strict`
83
+
84
+ This profile is less aggressive than max-throughput tuning, but significantly safer for long coding sessions.
85
+
86
+ ## Benchmark Method That Prevents False Wins
87
+
88
+ A useful speculative benchmark protocol should include:
89
+
90
+ 1. Prompt classes
91
+ - coding/tool-call tasks
92
+ - repetition/pattern-heavy tasks
93
+ 2. Repeats and warmup
94
+ - fixed run count
95
+ - warmup policy
96
+ - p50/p95 latency, not only mean tok/s
97
+ 3. Required metrics
98
+ - decode throughput (`eval tok/s`)
99
+ - prefill throughput (`prompt eval tok/s`)
100
+ - acceptance/rejection behavior
101
+ - malformed-turn incidence
102
+ - stop reason distribution
103
+ 4. Profile matrix
104
+ - no-spec baseline
105
+ - aggressive profile
106
+ - balanced profile
107
+
108
+ Without this, speculative tuning can appear faster while degrading real agentic reliability.
109
+
110
+ ## Practical Playbook
111
+
112
+ ### Use for Daily Agentic Coding
113
+
114
+ - balanced `ngram-cache` (`12/2/0.80`)
115
+ - strict malformed-tool stream guardrail
116
+ - reasoning fallback disabled
117
+ - reduced token floor (`4096`)
118
+
119
+ ### Use for Max Throughput Exploration
120
+
121
+ - hybrid rollback
122
+ - larger draft windows
123
+ - tightly scoped benchmark prompts
124
+
125
+ Then promote only if long-session tool-loop soak remains stable.
126
+
127
+ ## What llama.cpp Docs Should Add Next
128
+
129
+ Mechanics are documented well today. The next improvement is operational clarity:
130
+
131
+ - implementation selection matrix by workload
132
+ - troubleshooting by signature (`find_slot`, rollback spikes, acceptance collapse)
133
+ - reproducible benchmark protocol and output schema
134
+ - rollout/canary/rollback criteria
135
+ - proxy compatibility appendix for stream+tools environments
136
+
137
+ ## Final Takeaway
138
+
139
+ Speculative decoding in production is a systems problem, not just a decoding primitive. Treating runtime + transport + tool-loop behavior as one boundary is what makes speculative speedups both real and reliable.
@@ -0,0 +1,146 @@
1
+ ## Title
2
+
3
+ docs: add speculative decoding production playbook and agentic compatibility guidance
4
+
5
+ ## Context
6
+
7
+ `docs/speculative.md` explains speculative mechanisms and flags, but production operators also need:
8
+
9
+ - workload-driven profile selection,
10
+ - reproducible benchmarking protocol,
11
+ - signature-based regression triage,
12
+ - guidance for stream+tools agentic environments.
13
+
14
+ This PR adds operational documentation to reduce drift between benchmark wins and real-session behavior.
15
+
16
+ ## Changes
17
+
18
+ ### Add new guide
19
+
20
+ - New: `docs/speculative-production.md`
21
+ - implementation matrix:
22
+ - `draft`
23
+ - `ngram-cache`
24
+ - `ngram-simple`
25
+ - `ngram-map-k`
26
+ - `ngram-map-k4v`
27
+ - `ngram-mod`
28
+ - decision tree by workload (coding, repetitive transform, mixed)
29
+ - benchmark protocol (run counts, warmup, prompt classes, metrics)
30
+ - troubleshooting by signature:
31
+ - `find_slot: non-consecutive token position`
32
+ - low acceptance + high rollback
33
+ - throughput collapse after commit switch
34
+ - rollout rules (canary, promotion threshold, rollback triggers)
35
+
36
+ ### Update existing speculative docs
37
+
38
+ - Update `docs/speculative.md`:
39
+ - add link to production guide
40
+ - add "how to interpret statistics in practice"
41
+ - add "workload sensitivity and reproducibility notes"
42
+
43
+ ### Add compatibility appendix
44
+
45
+ - New appendix (or linked page): stream+tools compatibility for proxy-mediated agentic flows
46
+ - fallback policy guidance (`off` default for production)
47
+ - malformed stream/tool guardrail behavior
48
+ - max token floor and prune target recommendations
49
+
50
+ ## Why
51
+
52
+ Speculative decoding quality in agentic coding depends on end-to-end behavior, including transport and stream tool-loop handling. This documentation closes that gap and provides a repeatable operator path.
53
+
54
+ ## Validation Plan
55
+
56
+ - Verify all CLI flags/options in examples against current `llama-server`.
57
+ - Verify all linked scripts/docs paths resolve.
58
+ - Include one benchmark table with:
59
+ - decode/prefill throughput
60
+ - acceptance indicators
61
+ - latency percentiles
62
+ - workload class labels
63
+
64
+ ## Risks
65
+
66
+ - Overfitting recommendations to one model/hardware class.
67
+ - Treating proxy behavior as universally required.
68
+
69
+ ## Mitigations
70
+
71
+ - Mark all profile recommendations as workload/hardware sensitive.
72
+ - Separate "safe baseline" from "aggressive benchmark-only" profiles.
73
+ - Require local A/B validation before rollout.
74
+
75
+ ## Out of Scope
76
+
77
+ - Runtime code changes
78
+ - Kernel-level speculative optimization changes
79
+ - Proxy implementation changes (docs-only PR)
80
+
81
+ ## Follow-ups
82
+
83
+ 1. Add nightly speculative regression harness.
84
+ 2. Publish benchmark JSON schema for machine comparison.
85
+ 3. Add commit-lineage tracking for performance regressions.
86
+
87
+ ---
88
+
89
+ ## Ready-to-Submit GitHub PR Body
90
+
91
+ ### Summary
92
+
93
+ This docs PR adds a production-oriented speculative decoding playbook for llama.cpp users running real multi-turn workloads (especially agentic/tool-call scenarios). It complements existing mechanism-level docs with actionable tuning, troubleshooting, and rollout guidance.
94
+
95
+ ### What Changed
96
+
97
+ - Added `docs/speculative-production.md` (new operational guide)
98
+ - implementation selection matrix
99
+ - workload-based decision tree
100
+ - benchmark protocol + required metrics
101
+ - troubleshooting by real log signatures
102
+ - canary/rollback rollout guidance
103
+ - Updated `docs/speculative.md`
104
+ - links to production guide
105
+ - practical stats interpretation notes
106
+ - workload sensitivity notes
107
+ - Added/linked "agentic stream+tools compatibility" appendix
108
+ - fallback policy defaults
109
+ - malformed stream/tool guardrails
110
+ - token-floor/prune guidance
111
+
112
+ ### Why
113
+
114
+ Current docs describe speculative decoding internals clearly, but production operators need a reproducible way to:
115
+
116
+ - choose stable profiles by workload,
117
+ - detect/triage regressions quickly,
118
+ - avoid benchmark-only wins that fail in long sessions.
119
+
120
+ ### Reviewer Guide
121
+
122
+ Please focus review on:
123
+
124
+ 1. Accuracy of CLI flags and option names.
125
+ 2. Correctness of troubleshooting signatures and interpretations.
126
+ 3. Clarity of benchmark protocol (can another team reproduce it?).
127
+ 4. Whether safe-vs-aggressive profile separation is clear enough.
128
+
129
+ ### Validation
130
+
131
+ - [ ] Command examples verified against current `llama-server --help`
132
+ - [ ] Linked docs/scripts paths validated
133
+ - [ ] Benchmark table includes workload class labels
134
+ - [ ] Metrics include decode/prefill throughput + latency percentile view
135
+ - [ ] No runtime behavior claims without explicit caveats
136
+
137
+ ### Risks / Caveats
138
+
139
+ - Recommendations are model/hardware/workload dependent.
140
+ - Guidance is operational, not a substitute for local A/B testing.
141
+
142
+ ### Follow-ups
143
+
144
+ - [ ] Add nightly regression harness for speculative profiles
145
+ - [ ] Publish machine-readable benchmark schema
146
+ - [ ] Add commit lineage references in benchmark artifacts
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@miller-tech/uap",
3
- "version": "1.15.5",
3
+ "version": "1.15.6",
4
4
  "description": "Autonomous AI agent memory system with CLAUDE.md protocol enforcement",
5
5
  "type": "module",
6
6
  "main": "dist/index.js",
@@ -22,6 +22,15 @@ if [ -z "$CMD" ]; then
22
22
  exit 0
23
23
  fi
24
24
 
25
+ # ─── Protocol Tag Injection Guard ────────────────────────────────
26
+ # Reject Bash payloads that still contain standalone protocol tag lines.
27
+ # These fragments can appear after malformed tool-call rendering and must
28
+ # never reach shell evaluation.
29
+ if printf '%s\n' "$CMD" | grep -qE '^\s*</?(tool_call|tool_response|parameter(=[^>]*)?|function(=[^>]*)?|think)\s*>\s*$'; then
30
+ echo "BLOCKED [bash-safety]: Command contains standalone XML/protocol tag lines. Remove tool-call tag artifacts before execution." >&2
31
+ exit 2
32
+ fi
33
+
25
34
  # ─── IaC Pipeline Enforcement ───────────────────────────────────
26
35
  # Block local terraform apply/destroy (policies/iac-pipeline-enforcement.md)
27
36
  # Allow: terraform fmt, validate, init, plan, output, show, state list, graph
@@ -1044,49 +1044,27 @@ def _is_analysis_only_prompt(text: str) -> bool:
1044
1044
  if not text:
1045
1045
  return False
1046
1046
 
1047
- analysis_markers = (
1048
- "analy",
1049
- "review",
1050
- "audit",
1051
- "summar",
1052
- "explain",
1053
- "plan",
1054
- "recommend",
1055
- "assess",
1056
- "compare",
1057
- "investigate",
1058
- "diagnose",
1047
+ normalized = text.lower()
1048
+ has_analysis = bool(
1049
+ re.search(
1050
+ r"\b(?:analy(?:ze|zing|sis)?|review|audit|summar(?:y|ize|ized|ise)|explain|plan|recommend|assess|compare|investigate|diagnos(?:e|is))\b",
1051
+ normalized,
1052
+ )
1059
1053
  )
1060
- action_markers = (
1061
- "fix",
1062
- "edit",
1063
- "write",
1064
- "create",
1065
- "implement",
1066
- "patch",
1067
- "change",
1068
- "update",
1069
- "run ",
1070
- "execute",
1071
- "command",
1072
- "use tool",
1073
- "call tool",
1074
- "apply",
1075
- "commit",
1076
- "push",
1077
- "merge",
1078
- "publish",
1079
- "deploy",
1080
- "test",
1081
- "build",
1082
- "refactor",
1083
- "rename",
1084
- "delete",
1085
- "install",
1054
+ has_action = bool(
1055
+ re.search(
1056
+ r"\b(?:fix|edit|write|create|implement|patch|change|update|run|execute|apply|commit|push|merge|publish|deploy|test|build|refactor|rename|delete|install)\b",
1057
+ normalized,
1058
+ )
1059
+ ) or any(
1060
+ phrase in normalized
1061
+ for phrase in (
1062
+ "use tool",
1063
+ "call tool",
1064
+ "run command",
1065
+ "execute command",
1066
+ )
1086
1067
  )
1087
-
1088
- has_analysis = any(marker in text for marker in analysis_markers)
1089
- has_action = any(marker in text for marker in action_markers)
1090
1068
  return has_analysis and not has_action
1091
1069
 
1092
1070
 
@@ -1793,6 +1771,11 @@ _TOOL_ARG_MARKERS = (
1793
1771
  "</think>",
1794
1772
  )
1795
1773
 
1774
+ _BASH_PROTOCOL_LINE_RE = re.compile(
1775
+ r"^\s*</?(?:tool_call|tool_response|parameter(?:=[^>]*)?|function(?:=[^>]*)?|think)\s*>\s*$",
1776
+ re.IGNORECASE,
1777
+ )
1778
+
1796
1779
 
1797
1780
  def _iter_string_leaves(value):
1798
1781
  if isinstance(value, str):
@@ -1822,6 +1805,26 @@ def _strip_tool_markup_artifacts(text: str) -> str:
1822
1805
  return cleaned.strip()
1823
1806
 
1824
1807
 
1808
+ def _strip_protocol_tag_only_lines(text: str) -> tuple[str, bool]:
1809
+ if not isinstance(text, str):
1810
+ return text, False
1811
+
1812
+ lines = text.splitlines()
1813
+ kept_lines: list[str] = []
1814
+ stripped = False
1815
+ for line in lines:
1816
+ if _BASH_PROTOCOL_LINE_RE.match(line):
1817
+ stripped = True
1818
+ continue
1819
+ kept_lines.append(line)
1820
+
1821
+ if not stripped:
1822
+ return text, False
1823
+
1824
+ cleaned = "\n".join(kept_lines).strip()
1825
+ return cleaned, True
1826
+
1827
+
1825
1828
  def _sanitize_markup_value(value):
1826
1829
  if isinstance(value, str):
1827
1830
  cleaned = _strip_tool_markup_artifacts(value)
@@ -2036,6 +2039,77 @@ def _repair_required_tool_args(
2036
2039
  return repaired_response, repaired_count
2037
2040
 
2038
2041
 
2042
+ def _repair_bash_command_artifacts(openai_resp: dict) -> tuple[dict, int]:
2043
+ if not _openai_has_tool_calls(openai_resp):
2044
+ return openai_resp, 0
2045
+
2046
+ choice, message = _extract_openai_choice(openai_resp)
2047
+ tool_calls = message.get("tool_calls") or []
2048
+ if not tool_calls:
2049
+ return openai_resp, 0
2050
+
2051
+ repaired_tool_calls = []
2052
+ repaired_count = 0
2053
+
2054
+ for tool_call in tool_calls:
2055
+ fn = tool_call.get("function") if isinstance(tool_call, dict) else {}
2056
+ if not isinstance(fn, dict):
2057
+ fn = {}
2058
+
2059
+ tool_name = str(fn.get("name", "")).strip().lower()
2060
+ if tool_name != "bash":
2061
+ repaired_tool_calls.append(tool_call)
2062
+ continue
2063
+
2064
+ raw_args = fn.get("arguments", "{}")
2065
+ if isinstance(raw_args, dict):
2066
+ parsed_args = dict(raw_args)
2067
+ else:
2068
+ try:
2069
+ parsed_args = json.loads(str(raw_args))
2070
+ except json.JSONDecodeError:
2071
+ repaired_tool_calls.append(tool_call)
2072
+ continue
2073
+
2074
+ if not isinstance(parsed_args, dict):
2075
+ repaired_tool_calls.append(tool_call)
2076
+ continue
2077
+
2078
+ command = parsed_args.get("command")
2079
+ if not isinstance(command, str):
2080
+ repaired_tool_calls.append(tool_call)
2081
+ continue
2082
+
2083
+ cleaned_command, changed = _strip_protocol_tag_only_lines(command)
2084
+ if not changed:
2085
+ repaired_tool_calls.append(tool_call)
2086
+ continue
2087
+
2088
+ parsed_args["command"] = cleaned_command
2089
+ new_tool_call = dict(tool_call)
2090
+ new_fn = dict(fn)
2091
+ new_fn["arguments"] = json.dumps(parsed_args, separators=(",", ":"))
2092
+ new_tool_call["function"] = new_fn
2093
+ repaired_tool_calls.append(new_tool_call)
2094
+ repaired_count += 1
2095
+
2096
+ if repaired_count == 0:
2097
+ return openai_resp, 0
2098
+
2099
+ repaired_response = dict(openai_resp)
2100
+ choices = list(openai_resp.get("choices") or [])
2101
+ if not choices:
2102
+ return openai_resp, 0
2103
+
2104
+ updated_choice = dict(choice)
2105
+ updated_message = dict(message)
2106
+ updated_message["tool_calls"] = repaired_tool_calls
2107
+ updated_choice["message"] = updated_message
2108
+ choices[0] = updated_choice
2109
+ repaired_response["choices"] = choices
2110
+ return repaired_response, repaired_count
2111
+
2112
+
2039
2113
  def _required_value_is_empty(value) -> bool:
2040
2114
  if value is None:
2041
2115
  return True
@@ -2132,6 +2206,22 @@ def _validate_tool_call_arguments(
2132
2206
  ),
2133
2207
  )
2134
2208
 
2209
+ if tool_name.strip().lower() == "bash":
2210
+ command = parsed.get("command")
2211
+ if isinstance(command, str):
2212
+ cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
2213
+ command
2214
+ )
2215
+ if had_protocol_lines and not cleaned_command:
2216
+ return ToolResponseIssue(
2217
+ kind="invalid_tool_args",
2218
+ reason="arguments for 'Bash' contained only protocol tag lines",
2219
+ retry_hint=(
2220
+ "Emit exactly one `Bash` tool call with a valid shell command in `arguments.command`. "
2221
+ "Do not include standalone XML/protocol tags."
2222
+ ),
2223
+ )
2224
+
2135
2225
  if _contains_tool_markup(parsed):
2136
2226
  return ToolResponseIssue(
2137
2227
  kind="invalid_tool_args",
@@ -2345,20 +2435,34 @@ def _is_malformed_tool_response(openai_resp: dict, anthropic_body: dict) -> bool
2345
2435
 
2346
2436
 
2347
2437
  def _build_malformed_retry_body(
2348
- openai_body: dict, anthropic_body: dict, retry_hint: str = ""
2438
+ openai_body: dict,
2439
+ anthropic_body: dict,
2440
+ retry_hint: str = "",
2441
+ tool_choice: str = "required",
2442
+ attempt: int = 1,
2443
+ total_attempts: int = 1,
2349
2444
  ) -> dict:
2350
2445
  retry_body = dict(openai_body)
2351
2446
  retry_body["stream"] = False
2352
- retry_body["tool_choice"] = "required"
2447
+ retry_body["tool_choice"] = tool_choice
2353
2448
  retry_body["temperature"] = PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE
2354
2449
 
2355
- malformed_retry_instruction = {
2356
- "role": "user",
2357
- "content": (
2450
+ if tool_choice == "required":
2451
+ retry_instruction = (
2358
2452
  "Your previous response had invalid tool-call formatting. "
2359
2453
  "Respond with exactly one valid tool call using the provided tools. "
2360
2454
  "Do not output prose, markdown, XML tags, or schema snippets."
2361
- ),
2455
+ )
2456
+ else:
2457
+ retry_instruction = (
2458
+ "Your previous response had invalid tool-call formatting. "
2459
+ "If a tool is needed, emit exactly one valid tool call with strict JSON arguments. "
2460
+ "If no tool is needed for this turn, return concise plain text with no protocol tags."
2461
+ )
2462
+
2463
+ malformed_retry_instruction = {
2464
+ "role": "user",
2465
+ "content": retry_instruction,
2362
2466
  }
2363
2467
  existing_messages = retry_body.get("messages")
2364
2468
  if isinstance(existing_messages, list) and existing_messages:
@@ -2383,17 +2487,47 @@ def _build_malformed_retry_body(
2383
2487
 
2384
2488
  if retry_hint:
2385
2489
  repair_prompt = (
2386
- "[TOOL CALL REPAIR]\n"
2490
+ f"[TOOL CALL REPAIR attempt {attempt}/{total_attempts}]\n"
2387
2491
  f"{retry_hint}\n"
2388
- "Return exactly one valid tool call object and no explanatory prose."
2492
+ "Return a valid response for this turn without protocol artifacts."
2389
2493
  )
2390
2494
  retry_messages = list(retry_body.get("messages", []))
2391
- retry_messages.append({"role": "system", "content": repair_prompt})
2495
+ retry_messages.append({"role": "user", "content": repair_prompt})
2392
2496
  retry_body["messages"] = retry_messages
2393
2497
 
2394
2498
  return retry_body
2395
2499
 
2396
2500
 
2501
+ def _retry_tool_choice_for_attempt(
2502
+ required_tool_choice: bool, attempt: int, total_attempts: int
2503
+ ) -> str:
2504
+ if not required_tool_choice:
2505
+ return "auto"
2506
+ if total_attempts <= 1:
2507
+ return "required"
2508
+ return "auto" if attempt == total_attempts - 1 else "required"
2509
+
2510
+
2511
+ def _build_safe_text_openai_response(openai_resp: dict, text: str) -> dict:
2512
+ return {
2513
+ "id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
2514
+ "object": openai_resp.get("object", "chat.completion"),
2515
+ "created": openai_resp.get("created", int(time.time())),
2516
+ "model": openai_resp.get("model", "unknown"),
2517
+ "choices": [
2518
+ {
2519
+ "index": 0,
2520
+ "finish_reason": "stop",
2521
+ "message": {
2522
+ "role": "assistant",
2523
+ "content": text,
2524
+ },
2525
+ }
2526
+ ],
2527
+ "usage": openai_resp.get("usage", {}),
2528
+ }
2529
+
2530
+
2397
2531
  def _build_clean_guardrail_openai_response(openai_resp: dict) -> dict:
2398
2532
  return {
2399
2533
  "id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
@@ -2486,7 +2620,8 @@ async def _apply_malformed_tool_guardrail(
2486
2620
  working_resp, required_repairs = _repair_required_tool_args(
2487
2621
  working_resp, anthropic_body
2488
2622
  )
2489
- repair_count = markup_repairs + required_repairs
2623
+ working_resp, bash_repairs = _repair_bash_command_artifacts(working_resp)
2624
+ repair_count = markup_repairs + required_repairs + bash_repairs
2490
2625
 
2491
2626
  required_tool_choice = openai_body.get("tool_choice") == "required"
2492
2627
  has_tool_calls = _openai_has_tool_calls(working_resp)
@@ -2536,10 +2671,18 @@ async def _apply_malformed_tool_guardrail(
2536
2671
  attempts = max(0, PROXY_MALFORMED_TOOL_RETRY_MAX)
2537
2672
  current_issue = issue
2538
2673
  for attempt in range(attempts):
2674
+ attempt_tool_choice = _retry_tool_choice_for_attempt(
2675
+ required_tool_choice,
2676
+ attempt,
2677
+ attempts,
2678
+ )
2539
2679
  retry_body = _build_malformed_retry_body(
2540
2680
  openai_body,
2541
2681
  anthropic_body,
2542
2682
  retry_hint=current_issue.retry_hint,
2683
+ tool_choice=attempt_tool_choice,
2684
+ attempt=attempt + 1,
2685
+ total_attempts=attempts,
2543
2686
  )
2544
2687
  retry_resp = await client.post(
2545
2688
  f"{LLAMA_CPP_BASE}/chat/completions",
@@ -2563,7 +2706,14 @@ async def _apply_malformed_tool_guardrail(
2563
2706
  retry_working, retry_required_repairs = _repair_required_tool_args(
2564
2707
  retry_working, anthropic_body
2565
2708
  )
2566
- retry_repairs = retry_markup_repairs + retry_required_repairs
2709
+ retry_working, retry_bash_repairs = _repair_bash_command_artifacts(
2710
+ retry_working
2711
+ )
2712
+ retry_repairs = (
2713
+ retry_markup_repairs + retry_required_repairs + retry_bash_repairs
2714
+ )
2715
+
2716
+ working_resp = retry_working
2567
2717
 
2568
2718
  retry_has_tool_calls = _openai_has_tool_calls(retry_working)
2569
2719
  retry_required = retry_body.get("tool_choice") == "required"
@@ -2620,6 +2770,17 @@ async def _apply_malformed_tool_guardrail(
2620
2770
  monitor.invalid_tool_call_streak,
2621
2771
  monitor.required_tool_miss_streak,
2622
2772
  )
2773
+
2774
+ degraded_text = _sanitize_tool_call_apology_text(
2775
+ _openai_message_text(working_resp)
2776
+ ).strip()
2777
+ if degraded_text and not _looks_malformed_tool_payload(degraded_text):
2778
+ logger.warning(
2779
+ "TOOL RESPONSE degrade: session=%s returning safe text fallback after retry exhaustion",
2780
+ session_id,
2781
+ )
2782
+ return _build_safe_text_openai_response(working_resp, degraded_text)
2783
+
2623
2784
  return _build_clean_guardrail_openai_response(working_resp)
2624
2785
 
2625
2786
 
@@ -2720,6 +2881,18 @@ def openai_to_anthropic_response(openai_resp: dict, model: str) -> dict:
2720
2881
  args = json.loads(fn.get("arguments", "{}"))
2721
2882
  except json.JSONDecodeError:
2722
2883
  args = {}
2884
+ if fn.get("name", "").strip().lower() == "bash" and isinstance(args, dict):
2885
+ command = args.get("command")
2886
+ if isinstance(command, str):
2887
+ cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
2888
+ command
2889
+ )
2890
+ if had_protocol_lines:
2891
+ args = dict(args)
2892
+ args["command"] = cleaned_command
2893
+ logger.warning(
2894
+ "BASH SAFETY: stripped standalone protocol-tag lines from command before tool execution"
2895
+ )
2723
2896
  content.append(
2724
2897
  {
2725
2898
  "type": "tool_use",
@@ -487,6 +487,33 @@ class TestMalformedToolGuardrail(unittest.TestCase):
487
487
  setattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE", old_temp)
488
488
  setattr(proxy, "PROXY_DISABLE_THINKING_ON_TOOL_TURNS", old_disable)
489
489
 
490
+ def test_malformed_retry_body_appends_retry_hint_as_user_message(self):
491
+ openai_body = {
492
+ "model": "test",
493
+ "messages": [{"role": "user", "content": "fix"}],
494
+ }
495
+ anthropic_body = {
496
+ "tools": [{"name": "Read", "input_schema": {"type": "object"}}]
497
+ }
498
+
499
+ retry = proxy._build_malformed_retry_body(
500
+ openai_body,
501
+ anthropic_body,
502
+ retry_hint="Use strict JSON",
503
+ tool_choice="required",
504
+ attempt=1,
505
+ total_attempts=2,
506
+ )
507
+
508
+ self.assertEqual(retry["messages"][-1]["role"], "user")
509
+ self.assertIn("TOOL CALL REPAIR attempt 1/2", retry["messages"][-1]["content"])
510
+
511
+ def test_retry_ladder_releases_last_attempt_to_auto(self):
512
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 0, 3), "required")
513
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 1, 3), "required")
514
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 2, 3), "auto")
515
+ self.assertEqual(proxy._retry_tool_choice_for_attempt(False, 0, 3), "auto")
516
+
490
517
  def test_clean_guardrail_response_does_not_promise_future_tool_call(self):
491
518
  guardrail = proxy._build_clean_guardrail_openai_response(
492
519
  {"model": "test-model"}
@@ -772,6 +799,34 @@ class TestMalformedToolGuardrail(unittest.TestCase):
772
799
  )
773
800
  self.assertEqual(args["command"], "ls")
774
801
 
802
+ def test_bash_command_repair_strips_protocol_tag_only_lines(self):
803
+ openai_resp = {
804
+ "choices": [
805
+ {
806
+ "finish_reason": "tool_calls",
807
+ "message": {
808
+ "content": "",
809
+ "tool_calls": [
810
+ {
811
+ "id": "call_1",
812
+ "function": {
813
+ "name": "Bash",
814
+ "arguments": '{"command":"pwd\\n</function>\\n<tool_call>"}',
815
+ },
816
+ }
817
+ ],
818
+ },
819
+ }
820
+ ]
821
+ }
822
+
823
+ repaired, count = proxy._repair_bash_command_artifacts(openai_resp)
824
+ self.assertEqual(count, 1)
825
+ args = json.loads(
826
+ repaired["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
827
+ )
828
+ self.assertEqual(args["command"], "pwd")
829
+
775
830
  def test_guardrail_accepts_repaired_markup_without_fallback(self):
776
831
  old_retry = getattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_MAX")
777
832
  try:
@@ -1290,6 +1345,38 @@ class TestToolTurnControls(unittest.TestCase):
1290
1345
  setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
1291
1346
  setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
1292
1347
 
1348
+ def test_analysis_only_route_does_not_treat_implementation_as_action(self):
1349
+ old_route = getattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE")
1350
+ old_min_tools = getattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS")
1351
+ old_max_messages = getattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES")
1352
+ try:
1353
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", True)
1354
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", 4)
1355
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", 2)
1356
+
1357
+ body = {
1358
+ "messages": [
1359
+ {
1360
+ "role": "user",
1361
+ "content": "analyze implementation options and summarize tradeoffs",
1362
+ }
1363
+ ],
1364
+ "tools": [
1365
+ {"name": "Read", "input_schema": {"type": "object"}},
1366
+ {"name": "Edit", "input_schema": {"type": "object"}},
1367
+ {"name": "Write", "input_schema": {"type": "object"}},
1368
+ {"name": "Bash", "input_schema": {"type": "object"}},
1369
+ ],
1370
+ }
1371
+
1372
+ updated, removed = proxy._maybe_route_analysis_without_tools(body)
1373
+ self.assertEqual(removed, 4)
1374
+ self.assertNotIn("tools", updated)
1375
+ finally:
1376
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", old_route)
1377
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
1378
+ setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
1379
+
1293
1380
 
1294
1381
  class TestSessionContaminationBreaker(unittest.TestCase):
1295
1382
  def test_contamination_breaker_trims_and_resets_streak(self):