@miller-tech/uap 1.15.5 → 1.15.6
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/INDEX.md +8 -0
- package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +139 -0
- package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +146 -0
- package/package.json +1 -1
- package/templates/hooks/pre-tool-use-bash.sh +9 -0
- package/tools/agents/scripts/anthropic_proxy.py +225 -52
- package/tools/agents/tests/test_anthropic_proxy_streaming.py +87 -0
package/docs/INDEX.md
CHANGED
|
@@ -47,6 +47,14 @@
|
|
|
47
47
|
- [Token Optimization](benchmarks/TOKEN_OPTIMIZATION.md) -- Per-feature token savings analysis
|
|
48
48
|
- [Accuracy Analysis](benchmarks/ACCURACY_ANALYSIS.md) -- Internal vs Terminal-Bench comparison
|
|
49
49
|
|
|
50
|
+
## Blog
|
|
51
|
+
|
|
52
|
+
- [Speculative Decoding Production Playbook](blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md) -- Long-form narrative on throughput gains, failure modes, and stable profiles
|
|
53
|
+
|
|
54
|
+
## PR Templates
|
|
55
|
+
|
|
56
|
+
- [Speculative Docs PR Template](pr/PR_SPECULATIVE_DOCS_TEMPLATE.md) -- Ready-to-submit PR copy, checklist, and reviewer guidance
|
|
57
|
+
|
|
50
58
|
## Research
|
|
51
59
|
|
|
52
60
|
- [Memory Systems Comparison](research/MEMORY_SYSTEMS_COMPARISON.md) -- MemGPT, LangGraph, Mem0, A-MEM analysis
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
# Speculative Decoding in llama.cpp: Real Speedups Without Breaking Agentic Reliability
|
|
2
|
+
|
|
3
|
+
Speculative decoding can look like free performance - until it meets long-context, tool-heavy agent workflows. This write-up covers what improved throughput, what regressed, and which operational changes restored stability across `llama.cpp` and an Anthropic-compatible proxy.
|
|
4
|
+
|
|
5
|
+
## Why This Matters
|
|
6
|
+
|
|
7
|
+
Speculative decoding is strongest when generated text has predictable structure or repetition. But in real coding sessions, throughput alone is not enough: the system must preserve clean output, reliable tool-call behavior, and long-session continuity.
|
|
8
|
+
|
|
9
|
+
In practice, this is one runtime boundary:
|
|
10
|
+
|
|
11
|
+
- `llama.cpp` speculative behavior
|
|
12
|
+
- parameter profile and rollback mode
|
|
13
|
+
- proxy streaming/fallback policies
|
|
14
|
+
- agentic tool-loop control behavior
|
|
15
|
+
|
|
16
|
+
## Baseline Environment
|
|
17
|
+
|
|
18
|
+
- Runtime: `llama.cpp` + CUDA + Qwen3.5 GGUF
|
|
19
|
+
- Context window: `262144`
|
|
20
|
+
- Spec type: `ngram-cache`
|
|
21
|
+
- Gateway: Anthropic-compatible proxy forwarding to OpenAI-compatible server
|
|
22
|
+
|
|
23
|
+
Related runbooks:
|
|
24
|
+
|
|
25
|
+
- `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md`
|
|
26
|
+
- `docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md`
|
|
27
|
+
|
|
28
|
+
## What We Observed
|
|
29
|
+
|
|
30
|
+
### Throughput Gains Were Workload-Dependent
|
|
31
|
+
|
|
32
|
+
Speculation did not uniformly improve all turns. Coding/tool turns often saw small uplift; repetition-heavy turns saw large gains.
|
|
33
|
+
|
|
34
|
+
Representative 27B snapshot (`ctx=262144`):
|
|
35
|
+
|
|
36
|
+
- No spec: ~43 tok/s coding, ~41 tok/s pattern
|
|
37
|
+
- Balanced spec (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
|
|
38
|
+
|
|
39
|
+
Takeaway: benchmark by workload class, not one blended average.
|
|
40
|
+
|
|
41
|
+
### Newer Lineage Produced Noisier Warnings
|
|
42
|
+
|
|
43
|
+
Under identical settings, newer builds emitted warnings such as:
|
|
44
|
+
|
|
45
|
+
- `find_slot: non-consecutive token position`
|
|
46
|
+
|
|
47
|
+
This correlated with lower effective throughput and less stable long-session behavior in A/B comparisons.
|
|
48
|
+
|
|
49
|
+
### Proxy Fallback Could Leak Malformed Internal Text
|
|
50
|
+
|
|
51
|
+
When upstream returned reasoning-heavy but empty visible output, weak fallback policy could expose malformed fragments (pseudo-tool text, schema/policy echoes) to end users.
|
|
52
|
+
|
|
53
|
+
Patterns included:
|
|
54
|
+
|
|
55
|
+
- `</parameter>`-style fragments
|
|
56
|
+
- non-JSON pseudo-tool content
|
|
57
|
+
- repetitive policy-like loops with no valid `tool_calls`
|
|
58
|
+
|
|
59
|
+
## Immediate Fixes That Worked
|
|
60
|
+
|
|
61
|
+
### Safe Production Defaults
|
|
62
|
+
|
|
63
|
+
The highest-leverage stabilization profile was:
|
|
64
|
+
|
|
65
|
+
- `PROXY_STREAM_REASONING_FALLBACK=off`
|
|
66
|
+
- `PROXY_MALFORMED_TOOL_GUARDRAIL=on`
|
|
67
|
+
- `PROXY_MALFORMED_TOOL_STREAM_STRICT=on`
|
|
68
|
+
- `PROXY_MAX_TOKENS_FLOOR=4096`
|
|
69
|
+
|
|
70
|
+
Why:
|
|
71
|
+
|
|
72
|
+
- `fallback=off` suppresses malformed reasoning leakage.
|
|
73
|
+
- malformed-tool guardrail + strict stream path recovers bad stream+tools turns.
|
|
74
|
+
- lower token floor reduces long failure-turn latency while preserving normal turns.
|
|
75
|
+
|
|
76
|
+
### Balanced Speculative Profile for Daily Agentic Work
|
|
77
|
+
|
|
78
|
+
- `spec-type=ngram-cache`
|
|
79
|
+
- `draft-max=12`
|
|
80
|
+
- `draft-min=2`
|
|
81
|
+
- `draft-p-min=0.80`
|
|
82
|
+
- rollback mode: `strict`
|
|
83
|
+
|
|
84
|
+
This profile is less aggressive than max-throughput tuning, but significantly safer for long coding sessions.
|
|
85
|
+
|
|
86
|
+
## Benchmark Method That Prevents False Wins
|
|
87
|
+
|
|
88
|
+
A useful speculative benchmark protocol should include:
|
|
89
|
+
|
|
90
|
+
1. Prompt classes
|
|
91
|
+
- coding/tool-call tasks
|
|
92
|
+
- repetition/pattern-heavy tasks
|
|
93
|
+
2. Repeats and warmup
|
|
94
|
+
- fixed run count
|
|
95
|
+
- warmup policy
|
|
96
|
+
- p50/p95 latency, not only mean tok/s
|
|
97
|
+
3. Required metrics
|
|
98
|
+
- decode throughput (`eval tok/s`)
|
|
99
|
+
- prefill throughput (`prompt eval tok/s`)
|
|
100
|
+
- acceptance/rejection behavior
|
|
101
|
+
- malformed-turn incidence
|
|
102
|
+
- stop reason distribution
|
|
103
|
+
4. Profile matrix
|
|
104
|
+
- no-spec baseline
|
|
105
|
+
- aggressive profile
|
|
106
|
+
- balanced profile
|
|
107
|
+
|
|
108
|
+
Without this, speculative tuning can appear faster while degrading real agentic reliability.
|
|
109
|
+
|
|
110
|
+
## Practical Playbook
|
|
111
|
+
|
|
112
|
+
### Use for Daily Agentic Coding
|
|
113
|
+
|
|
114
|
+
- balanced `ngram-cache` (`12/2/0.80`)
|
|
115
|
+
- strict malformed-tool stream guardrail
|
|
116
|
+
- reasoning fallback disabled
|
|
117
|
+
- reduced token floor (`4096`)
|
|
118
|
+
|
|
119
|
+
### Use for Max Throughput Exploration
|
|
120
|
+
|
|
121
|
+
- hybrid rollback
|
|
122
|
+
- larger draft windows
|
|
123
|
+
- tightly scoped benchmark prompts
|
|
124
|
+
|
|
125
|
+
Then promote only if long-session tool-loop soak remains stable.
|
|
126
|
+
|
|
127
|
+
## What llama.cpp Docs Should Add Next
|
|
128
|
+
|
|
129
|
+
Mechanics are documented well today. The next improvement is operational clarity:
|
|
130
|
+
|
|
131
|
+
- implementation selection matrix by workload
|
|
132
|
+
- troubleshooting by signature (`find_slot`, rollback spikes, acceptance collapse)
|
|
133
|
+
- reproducible benchmark protocol and output schema
|
|
134
|
+
- rollout/canary/rollback criteria
|
|
135
|
+
- proxy compatibility appendix for stream+tools environments
|
|
136
|
+
|
|
137
|
+
## Final Takeaway
|
|
138
|
+
|
|
139
|
+
Speculative decoding in production is a systems problem, not just a decoding primitive. Treating runtime + transport + tool-loop behavior as one boundary is what makes speculative speedups both real and reliable.
|
|
@@ -0,0 +1,146 @@
|
|
|
1
|
+
## Title
|
|
2
|
+
|
|
3
|
+
docs: add speculative decoding production playbook and agentic compatibility guidance
|
|
4
|
+
|
|
5
|
+
## Context
|
|
6
|
+
|
|
7
|
+
`docs/speculative.md` explains speculative mechanisms and flags, but production operators also need:
|
|
8
|
+
|
|
9
|
+
- workload-driven profile selection,
|
|
10
|
+
- reproducible benchmarking protocol,
|
|
11
|
+
- signature-based regression triage,
|
|
12
|
+
- guidance for stream+tools agentic environments.
|
|
13
|
+
|
|
14
|
+
This PR adds operational documentation to reduce drift between benchmark wins and real-session behavior.
|
|
15
|
+
|
|
16
|
+
## Changes
|
|
17
|
+
|
|
18
|
+
### Add new guide
|
|
19
|
+
|
|
20
|
+
- New: `docs/speculative-production.md`
|
|
21
|
+
- implementation matrix:
|
|
22
|
+
- `draft`
|
|
23
|
+
- `ngram-cache`
|
|
24
|
+
- `ngram-simple`
|
|
25
|
+
- `ngram-map-k`
|
|
26
|
+
- `ngram-map-k4v`
|
|
27
|
+
- `ngram-mod`
|
|
28
|
+
- decision tree by workload (coding, repetitive transform, mixed)
|
|
29
|
+
- benchmark protocol (run counts, warmup, prompt classes, metrics)
|
|
30
|
+
- troubleshooting by signature:
|
|
31
|
+
- `find_slot: non-consecutive token position`
|
|
32
|
+
- low acceptance + high rollback
|
|
33
|
+
- throughput collapse after commit switch
|
|
34
|
+
- rollout rules (canary, promotion threshold, rollback triggers)
|
|
35
|
+
|
|
36
|
+
### Update existing speculative docs
|
|
37
|
+
|
|
38
|
+
- Update `docs/speculative.md`:
|
|
39
|
+
- add link to production guide
|
|
40
|
+
- add "how to interpret statistics in practice"
|
|
41
|
+
- add "workload sensitivity and reproducibility notes"
|
|
42
|
+
|
|
43
|
+
### Add compatibility appendix
|
|
44
|
+
|
|
45
|
+
- New appendix (or linked page): stream+tools compatibility for proxy-mediated agentic flows
|
|
46
|
+
- fallback policy guidance (`off` default for production)
|
|
47
|
+
- malformed stream/tool guardrail behavior
|
|
48
|
+
- max token floor and prune target recommendations
|
|
49
|
+
|
|
50
|
+
## Why
|
|
51
|
+
|
|
52
|
+
Speculative decoding quality in agentic coding depends on end-to-end behavior, including transport and stream tool-loop handling. This documentation closes that gap and provides a repeatable operator path.
|
|
53
|
+
|
|
54
|
+
## Validation Plan
|
|
55
|
+
|
|
56
|
+
- Verify all CLI flags/options in examples against current `llama-server`.
|
|
57
|
+
- Verify all linked scripts/docs paths resolve.
|
|
58
|
+
- Include one benchmark table with:
|
|
59
|
+
- decode/prefill throughput
|
|
60
|
+
- acceptance indicators
|
|
61
|
+
- latency percentiles
|
|
62
|
+
- workload class labels
|
|
63
|
+
|
|
64
|
+
## Risks
|
|
65
|
+
|
|
66
|
+
- Overfitting recommendations to one model/hardware class.
|
|
67
|
+
- Treating proxy behavior as universally required.
|
|
68
|
+
|
|
69
|
+
## Mitigations
|
|
70
|
+
|
|
71
|
+
- Mark all profile recommendations as workload/hardware sensitive.
|
|
72
|
+
- Separate "safe baseline" from "aggressive benchmark-only" profiles.
|
|
73
|
+
- Require local A/B validation before rollout.
|
|
74
|
+
|
|
75
|
+
## Out of Scope
|
|
76
|
+
|
|
77
|
+
- Runtime code changes
|
|
78
|
+
- Kernel-level speculative optimization changes
|
|
79
|
+
- Proxy implementation changes (docs-only PR)
|
|
80
|
+
|
|
81
|
+
## Follow-ups
|
|
82
|
+
|
|
83
|
+
1. Add nightly speculative regression harness.
|
|
84
|
+
2. Publish benchmark JSON schema for machine comparison.
|
|
85
|
+
3. Add commit-lineage tracking for performance regressions.
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
## Ready-to-Submit GitHub PR Body
|
|
90
|
+
|
|
91
|
+
### Summary
|
|
92
|
+
|
|
93
|
+
This docs PR adds a production-oriented speculative decoding playbook for llama.cpp users running real multi-turn workloads (especially agentic/tool-call scenarios). It complements existing mechanism-level docs with actionable tuning, troubleshooting, and rollout guidance.
|
|
94
|
+
|
|
95
|
+
### What Changed
|
|
96
|
+
|
|
97
|
+
- Added `docs/speculative-production.md` (new operational guide)
|
|
98
|
+
- implementation selection matrix
|
|
99
|
+
- workload-based decision tree
|
|
100
|
+
- benchmark protocol + required metrics
|
|
101
|
+
- troubleshooting by real log signatures
|
|
102
|
+
- canary/rollback rollout guidance
|
|
103
|
+
- Updated `docs/speculative.md`
|
|
104
|
+
- links to production guide
|
|
105
|
+
- practical stats interpretation notes
|
|
106
|
+
- workload sensitivity notes
|
|
107
|
+
- Added/linked "agentic stream+tools compatibility" appendix
|
|
108
|
+
- fallback policy defaults
|
|
109
|
+
- malformed stream/tool guardrails
|
|
110
|
+
- token-floor/prune guidance
|
|
111
|
+
|
|
112
|
+
### Why
|
|
113
|
+
|
|
114
|
+
Current docs describe speculative decoding internals clearly, but production operators need a reproducible way to:
|
|
115
|
+
|
|
116
|
+
- choose stable profiles by workload,
|
|
117
|
+
- detect/triage regressions quickly,
|
|
118
|
+
- avoid benchmark-only wins that fail in long sessions.
|
|
119
|
+
|
|
120
|
+
### Reviewer Guide
|
|
121
|
+
|
|
122
|
+
Please focus review on:
|
|
123
|
+
|
|
124
|
+
1. Accuracy of CLI flags and option names.
|
|
125
|
+
2. Correctness of troubleshooting signatures and interpretations.
|
|
126
|
+
3. Clarity of benchmark protocol (can another team reproduce it?).
|
|
127
|
+
4. Whether safe-vs-aggressive profile separation is clear enough.
|
|
128
|
+
|
|
129
|
+
### Validation
|
|
130
|
+
|
|
131
|
+
- [ ] Command examples verified against current `llama-server --help`
|
|
132
|
+
- [ ] Linked docs/scripts paths validated
|
|
133
|
+
- [ ] Benchmark table includes workload class labels
|
|
134
|
+
- [ ] Metrics include decode/prefill throughput + latency percentile view
|
|
135
|
+
- [ ] No runtime behavior claims without explicit caveats
|
|
136
|
+
|
|
137
|
+
### Risks / Caveats
|
|
138
|
+
|
|
139
|
+
- Recommendations are model/hardware/workload dependent.
|
|
140
|
+
- Guidance is operational, not a substitute for local A/B testing.
|
|
141
|
+
|
|
142
|
+
### Follow-ups
|
|
143
|
+
|
|
144
|
+
- [ ] Add nightly regression harness for speculative profiles
|
|
145
|
+
- [ ] Publish machine-readable benchmark schema
|
|
146
|
+
- [ ] Add commit lineage references in benchmark artifacts
|
package/package.json
CHANGED
|
@@ -22,6 +22,15 @@ if [ -z "$CMD" ]; then
|
|
|
22
22
|
exit 0
|
|
23
23
|
fi
|
|
24
24
|
|
|
25
|
+
# ─── Protocol Tag Injection Guard ────────────────────────────────
|
|
26
|
+
# Reject Bash payloads that still contain standalone protocol tag lines.
|
|
27
|
+
# These fragments can appear after malformed tool-call rendering and must
|
|
28
|
+
# never reach shell evaluation.
|
|
29
|
+
if printf '%s\n' "$CMD" | grep -qE '^\s*</?(tool_call|tool_response|parameter(=[^>]*)?|function(=[^>]*)?|think)\s*>\s*$'; then
|
|
30
|
+
echo "BLOCKED [bash-safety]: Command contains standalone XML/protocol tag lines. Remove tool-call tag artifacts before execution." >&2
|
|
31
|
+
exit 2
|
|
32
|
+
fi
|
|
33
|
+
|
|
25
34
|
# ─── IaC Pipeline Enforcement ───────────────────────────────────
|
|
26
35
|
# Block local terraform apply/destroy (policies/iac-pipeline-enforcement.md)
|
|
27
36
|
# Allow: terraform fmt, validate, init, plan, output, show, state list, graph
|
|
@@ -1044,49 +1044,27 @@ def _is_analysis_only_prompt(text: str) -> bool:
|
|
|
1044
1044
|
if not text:
|
|
1045
1045
|
return False
|
|
1046
1046
|
|
|
1047
|
-
|
|
1048
|
-
|
|
1049
|
-
|
|
1050
|
-
|
|
1051
|
-
|
|
1052
|
-
|
|
1053
|
-
"plan",
|
|
1054
|
-
"recommend",
|
|
1055
|
-
"assess",
|
|
1056
|
-
"compare",
|
|
1057
|
-
"investigate",
|
|
1058
|
-
"diagnose",
|
|
1047
|
+
normalized = text.lower()
|
|
1048
|
+
has_analysis = bool(
|
|
1049
|
+
re.search(
|
|
1050
|
+
r"\b(?:analy(?:ze|zing|sis)?|review|audit|summar(?:y|ize|ized|ise)|explain|plan|recommend|assess|compare|investigate|diagnos(?:e|is))\b",
|
|
1051
|
+
normalized,
|
|
1052
|
+
)
|
|
1059
1053
|
)
|
|
1060
|
-
|
|
1061
|
-
|
|
1062
|
-
|
|
1063
|
-
|
|
1064
|
-
|
|
1065
|
-
|
|
1066
|
-
|
|
1067
|
-
|
|
1068
|
-
|
|
1069
|
-
|
|
1070
|
-
|
|
1071
|
-
|
|
1072
|
-
|
|
1073
|
-
"call tool",
|
|
1074
|
-
"apply",
|
|
1075
|
-
"commit",
|
|
1076
|
-
"push",
|
|
1077
|
-
"merge",
|
|
1078
|
-
"publish",
|
|
1079
|
-
"deploy",
|
|
1080
|
-
"test",
|
|
1081
|
-
"build",
|
|
1082
|
-
"refactor",
|
|
1083
|
-
"rename",
|
|
1084
|
-
"delete",
|
|
1085
|
-
"install",
|
|
1054
|
+
has_action = bool(
|
|
1055
|
+
re.search(
|
|
1056
|
+
r"\b(?:fix|edit|write|create|implement|patch|change|update|run|execute|apply|commit|push|merge|publish|deploy|test|build|refactor|rename|delete|install)\b",
|
|
1057
|
+
normalized,
|
|
1058
|
+
)
|
|
1059
|
+
) or any(
|
|
1060
|
+
phrase in normalized
|
|
1061
|
+
for phrase in (
|
|
1062
|
+
"use tool",
|
|
1063
|
+
"call tool",
|
|
1064
|
+
"run command",
|
|
1065
|
+
"execute command",
|
|
1066
|
+
)
|
|
1086
1067
|
)
|
|
1087
|
-
|
|
1088
|
-
has_analysis = any(marker in text for marker in analysis_markers)
|
|
1089
|
-
has_action = any(marker in text for marker in action_markers)
|
|
1090
1068
|
return has_analysis and not has_action
|
|
1091
1069
|
|
|
1092
1070
|
|
|
@@ -1793,6 +1771,11 @@ _TOOL_ARG_MARKERS = (
|
|
|
1793
1771
|
"</think>",
|
|
1794
1772
|
)
|
|
1795
1773
|
|
|
1774
|
+
_BASH_PROTOCOL_LINE_RE = re.compile(
|
|
1775
|
+
r"^\s*</?(?:tool_call|tool_response|parameter(?:=[^>]*)?|function(?:=[^>]*)?|think)\s*>\s*$",
|
|
1776
|
+
re.IGNORECASE,
|
|
1777
|
+
)
|
|
1778
|
+
|
|
1796
1779
|
|
|
1797
1780
|
def _iter_string_leaves(value):
|
|
1798
1781
|
if isinstance(value, str):
|
|
@@ -1822,6 +1805,26 @@ def _strip_tool_markup_artifacts(text: str) -> str:
|
|
|
1822
1805
|
return cleaned.strip()
|
|
1823
1806
|
|
|
1824
1807
|
|
|
1808
|
+
def _strip_protocol_tag_only_lines(text: str) -> tuple[str, bool]:
|
|
1809
|
+
if not isinstance(text, str):
|
|
1810
|
+
return text, False
|
|
1811
|
+
|
|
1812
|
+
lines = text.splitlines()
|
|
1813
|
+
kept_lines: list[str] = []
|
|
1814
|
+
stripped = False
|
|
1815
|
+
for line in lines:
|
|
1816
|
+
if _BASH_PROTOCOL_LINE_RE.match(line):
|
|
1817
|
+
stripped = True
|
|
1818
|
+
continue
|
|
1819
|
+
kept_lines.append(line)
|
|
1820
|
+
|
|
1821
|
+
if not stripped:
|
|
1822
|
+
return text, False
|
|
1823
|
+
|
|
1824
|
+
cleaned = "\n".join(kept_lines).strip()
|
|
1825
|
+
return cleaned, True
|
|
1826
|
+
|
|
1827
|
+
|
|
1825
1828
|
def _sanitize_markup_value(value):
|
|
1826
1829
|
if isinstance(value, str):
|
|
1827
1830
|
cleaned = _strip_tool_markup_artifacts(value)
|
|
@@ -2036,6 +2039,77 @@ def _repair_required_tool_args(
|
|
|
2036
2039
|
return repaired_response, repaired_count
|
|
2037
2040
|
|
|
2038
2041
|
|
|
2042
|
+
def _repair_bash_command_artifacts(openai_resp: dict) -> tuple[dict, int]:
|
|
2043
|
+
if not _openai_has_tool_calls(openai_resp):
|
|
2044
|
+
return openai_resp, 0
|
|
2045
|
+
|
|
2046
|
+
choice, message = _extract_openai_choice(openai_resp)
|
|
2047
|
+
tool_calls = message.get("tool_calls") or []
|
|
2048
|
+
if not tool_calls:
|
|
2049
|
+
return openai_resp, 0
|
|
2050
|
+
|
|
2051
|
+
repaired_tool_calls = []
|
|
2052
|
+
repaired_count = 0
|
|
2053
|
+
|
|
2054
|
+
for tool_call in tool_calls:
|
|
2055
|
+
fn = tool_call.get("function") if isinstance(tool_call, dict) else {}
|
|
2056
|
+
if not isinstance(fn, dict):
|
|
2057
|
+
fn = {}
|
|
2058
|
+
|
|
2059
|
+
tool_name = str(fn.get("name", "")).strip().lower()
|
|
2060
|
+
if tool_name != "bash":
|
|
2061
|
+
repaired_tool_calls.append(tool_call)
|
|
2062
|
+
continue
|
|
2063
|
+
|
|
2064
|
+
raw_args = fn.get("arguments", "{}")
|
|
2065
|
+
if isinstance(raw_args, dict):
|
|
2066
|
+
parsed_args = dict(raw_args)
|
|
2067
|
+
else:
|
|
2068
|
+
try:
|
|
2069
|
+
parsed_args = json.loads(str(raw_args))
|
|
2070
|
+
except json.JSONDecodeError:
|
|
2071
|
+
repaired_tool_calls.append(tool_call)
|
|
2072
|
+
continue
|
|
2073
|
+
|
|
2074
|
+
if not isinstance(parsed_args, dict):
|
|
2075
|
+
repaired_tool_calls.append(tool_call)
|
|
2076
|
+
continue
|
|
2077
|
+
|
|
2078
|
+
command = parsed_args.get("command")
|
|
2079
|
+
if not isinstance(command, str):
|
|
2080
|
+
repaired_tool_calls.append(tool_call)
|
|
2081
|
+
continue
|
|
2082
|
+
|
|
2083
|
+
cleaned_command, changed = _strip_protocol_tag_only_lines(command)
|
|
2084
|
+
if not changed:
|
|
2085
|
+
repaired_tool_calls.append(tool_call)
|
|
2086
|
+
continue
|
|
2087
|
+
|
|
2088
|
+
parsed_args["command"] = cleaned_command
|
|
2089
|
+
new_tool_call = dict(tool_call)
|
|
2090
|
+
new_fn = dict(fn)
|
|
2091
|
+
new_fn["arguments"] = json.dumps(parsed_args, separators=(",", ":"))
|
|
2092
|
+
new_tool_call["function"] = new_fn
|
|
2093
|
+
repaired_tool_calls.append(new_tool_call)
|
|
2094
|
+
repaired_count += 1
|
|
2095
|
+
|
|
2096
|
+
if repaired_count == 0:
|
|
2097
|
+
return openai_resp, 0
|
|
2098
|
+
|
|
2099
|
+
repaired_response = dict(openai_resp)
|
|
2100
|
+
choices = list(openai_resp.get("choices") or [])
|
|
2101
|
+
if not choices:
|
|
2102
|
+
return openai_resp, 0
|
|
2103
|
+
|
|
2104
|
+
updated_choice = dict(choice)
|
|
2105
|
+
updated_message = dict(message)
|
|
2106
|
+
updated_message["tool_calls"] = repaired_tool_calls
|
|
2107
|
+
updated_choice["message"] = updated_message
|
|
2108
|
+
choices[0] = updated_choice
|
|
2109
|
+
repaired_response["choices"] = choices
|
|
2110
|
+
return repaired_response, repaired_count
|
|
2111
|
+
|
|
2112
|
+
|
|
2039
2113
|
def _required_value_is_empty(value) -> bool:
|
|
2040
2114
|
if value is None:
|
|
2041
2115
|
return True
|
|
@@ -2132,6 +2206,22 @@ def _validate_tool_call_arguments(
|
|
|
2132
2206
|
),
|
|
2133
2207
|
)
|
|
2134
2208
|
|
|
2209
|
+
if tool_name.strip().lower() == "bash":
|
|
2210
|
+
command = parsed.get("command")
|
|
2211
|
+
if isinstance(command, str):
|
|
2212
|
+
cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
|
|
2213
|
+
command
|
|
2214
|
+
)
|
|
2215
|
+
if had_protocol_lines and not cleaned_command:
|
|
2216
|
+
return ToolResponseIssue(
|
|
2217
|
+
kind="invalid_tool_args",
|
|
2218
|
+
reason="arguments for 'Bash' contained only protocol tag lines",
|
|
2219
|
+
retry_hint=(
|
|
2220
|
+
"Emit exactly one `Bash` tool call with a valid shell command in `arguments.command`. "
|
|
2221
|
+
"Do not include standalone XML/protocol tags."
|
|
2222
|
+
),
|
|
2223
|
+
)
|
|
2224
|
+
|
|
2135
2225
|
if _contains_tool_markup(parsed):
|
|
2136
2226
|
return ToolResponseIssue(
|
|
2137
2227
|
kind="invalid_tool_args",
|
|
@@ -2345,20 +2435,34 @@ def _is_malformed_tool_response(openai_resp: dict, anthropic_body: dict) -> bool
|
|
|
2345
2435
|
|
|
2346
2436
|
|
|
2347
2437
|
def _build_malformed_retry_body(
|
|
2348
|
-
openai_body: dict,
|
|
2438
|
+
openai_body: dict,
|
|
2439
|
+
anthropic_body: dict,
|
|
2440
|
+
retry_hint: str = "",
|
|
2441
|
+
tool_choice: str = "required",
|
|
2442
|
+
attempt: int = 1,
|
|
2443
|
+
total_attempts: int = 1,
|
|
2349
2444
|
) -> dict:
|
|
2350
2445
|
retry_body = dict(openai_body)
|
|
2351
2446
|
retry_body["stream"] = False
|
|
2352
|
-
retry_body["tool_choice"] =
|
|
2447
|
+
retry_body["tool_choice"] = tool_choice
|
|
2353
2448
|
retry_body["temperature"] = PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE
|
|
2354
2449
|
|
|
2355
|
-
|
|
2356
|
-
|
|
2357
|
-
"content": (
|
|
2450
|
+
if tool_choice == "required":
|
|
2451
|
+
retry_instruction = (
|
|
2358
2452
|
"Your previous response had invalid tool-call formatting. "
|
|
2359
2453
|
"Respond with exactly one valid tool call using the provided tools. "
|
|
2360
2454
|
"Do not output prose, markdown, XML tags, or schema snippets."
|
|
2361
|
-
)
|
|
2455
|
+
)
|
|
2456
|
+
else:
|
|
2457
|
+
retry_instruction = (
|
|
2458
|
+
"Your previous response had invalid tool-call formatting. "
|
|
2459
|
+
"If a tool is needed, emit exactly one valid tool call with strict JSON arguments. "
|
|
2460
|
+
"If no tool is needed for this turn, return concise plain text with no protocol tags."
|
|
2461
|
+
)
|
|
2462
|
+
|
|
2463
|
+
malformed_retry_instruction = {
|
|
2464
|
+
"role": "user",
|
|
2465
|
+
"content": retry_instruction,
|
|
2362
2466
|
}
|
|
2363
2467
|
existing_messages = retry_body.get("messages")
|
|
2364
2468
|
if isinstance(existing_messages, list) and existing_messages:
|
|
@@ -2383,17 +2487,47 @@ def _build_malformed_retry_body(
|
|
|
2383
2487
|
|
|
2384
2488
|
if retry_hint:
|
|
2385
2489
|
repair_prompt = (
|
|
2386
|
-
"[TOOL CALL REPAIR]\n"
|
|
2490
|
+
f"[TOOL CALL REPAIR attempt {attempt}/{total_attempts}]\n"
|
|
2387
2491
|
f"{retry_hint}\n"
|
|
2388
|
-
"Return
|
|
2492
|
+
"Return a valid response for this turn without protocol artifacts."
|
|
2389
2493
|
)
|
|
2390
2494
|
retry_messages = list(retry_body.get("messages", []))
|
|
2391
|
-
retry_messages.append({"role": "
|
|
2495
|
+
retry_messages.append({"role": "user", "content": repair_prompt})
|
|
2392
2496
|
retry_body["messages"] = retry_messages
|
|
2393
2497
|
|
|
2394
2498
|
return retry_body
|
|
2395
2499
|
|
|
2396
2500
|
|
|
2501
|
+
def _retry_tool_choice_for_attempt(
|
|
2502
|
+
required_tool_choice: bool, attempt: int, total_attempts: int
|
|
2503
|
+
) -> str:
|
|
2504
|
+
if not required_tool_choice:
|
|
2505
|
+
return "auto"
|
|
2506
|
+
if total_attempts <= 1:
|
|
2507
|
+
return "required"
|
|
2508
|
+
return "auto" if attempt == total_attempts - 1 else "required"
|
|
2509
|
+
|
|
2510
|
+
|
|
2511
|
+
def _build_safe_text_openai_response(openai_resp: dict, text: str) -> dict:
|
|
2512
|
+
return {
|
|
2513
|
+
"id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
|
|
2514
|
+
"object": openai_resp.get("object", "chat.completion"),
|
|
2515
|
+
"created": openai_resp.get("created", int(time.time())),
|
|
2516
|
+
"model": openai_resp.get("model", "unknown"),
|
|
2517
|
+
"choices": [
|
|
2518
|
+
{
|
|
2519
|
+
"index": 0,
|
|
2520
|
+
"finish_reason": "stop",
|
|
2521
|
+
"message": {
|
|
2522
|
+
"role": "assistant",
|
|
2523
|
+
"content": text,
|
|
2524
|
+
},
|
|
2525
|
+
}
|
|
2526
|
+
],
|
|
2527
|
+
"usage": openai_resp.get("usage", {}),
|
|
2528
|
+
}
|
|
2529
|
+
|
|
2530
|
+
|
|
2397
2531
|
def _build_clean_guardrail_openai_response(openai_resp: dict) -> dict:
|
|
2398
2532
|
return {
|
|
2399
2533
|
"id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
|
|
@@ -2486,7 +2620,8 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2486
2620
|
working_resp, required_repairs = _repair_required_tool_args(
|
|
2487
2621
|
working_resp, anthropic_body
|
|
2488
2622
|
)
|
|
2489
|
-
|
|
2623
|
+
working_resp, bash_repairs = _repair_bash_command_artifacts(working_resp)
|
|
2624
|
+
repair_count = markup_repairs + required_repairs + bash_repairs
|
|
2490
2625
|
|
|
2491
2626
|
required_tool_choice = openai_body.get("tool_choice") == "required"
|
|
2492
2627
|
has_tool_calls = _openai_has_tool_calls(working_resp)
|
|
@@ -2536,10 +2671,18 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2536
2671
|
attempts = max(0, PROXY_MALFORMED_TOOL_RETRY_MAX)
|
|
2537
2672
|
current_issue = issue
|
|
2538
2673
|
for attempt in range(attempts):
|
|
2674
|
+
attempt_tool_choice = _retry_tool_choice_for_attempt(
|
|
2675
|
+
required_tool_choice,
|
|
2676
|
+
attempt,
|
|
2677
|
+
attempts,
|
|
2678
|
+
)
|
|
2539
2679
|
retry_body = _build_malformed_retry_body(
|
|
2540
2680
|
openai_body,
|
|
2541
2681
|
anthropic_body,
|
|
2542
2682
|
retry_hint=current_issue.retry_hint,
|
|
2683
|
+
tool_choice=attempt_tool_choice,
|
|
2684
|
+
attempt=attempt + 1,
|
|
2685
|
+
total_attempts=attempts,
|
|
2543
2686
|
)
|
|
2544
2687
|
retry_resp = await client.post(
|
|
2545
2688
|
f"{LLAMA_CPP_BASE}/chat/completions",
|
|
@@ -2563,7 +2706,14 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2563
2706
|
retry_working, retry_required_repairs = _repair_required_tool_args(
|
|
2564
2707
|
retry_working, anthropic_body
|
|
2565
2708
|
)
|
|
2566
|
-
|
|
2709
|
+
retry_working, retry_bash_repairs = _repair_bash_command_artifacts(
|
|
2710
|
+
retry_working
|
|
2711
|
+
)
|
|
2712
|
+
retry_repairs = (
|
|
2713
|
+
retry_markup_repairs + retry_required_repairs + retry_bash_repairs
|
|
2714
|
+
)
|
|
2715
|
+
|
|
2716
|
+
working_resp = retry_working
|
|
2567
2717
|
|
|
2568
2718
|
retry_has_tool_calls = _openai_has_tool_calls(retry_working)
|
|
2569
2719
|
retry_required = retry_body.get("tool_choice") == "required"
|
|
@@ -2620,6 +2770,17 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2620
2770
|
monitor.invalid_tool_call_streak,
|
|
2621
2771
|
monitor.required_tool_miss_streak,
|
|
2622
2772
|
)
|
|
2773
|
+
|
|
2774
|
+
degraded_text = _sanitize_tool_call_apology_text(
|
|
2775
|
+
_openai_message_text(working_resp)
|
|
2776
|
+
).strip()
|
|
2777
|
+
if degraded_text and not _looks_malformed_tool_payload(degraded_text):
|
|
2778
|
+
logger.warning(
|
|
2779
|
+
"TOOL RESPONSE degrade: session=%s returning safe text fallback after retry exhaustion",
|
|
2780
|
+
session_id,
|
|
2781
|
+
)
|
|
2782
|
+
return _build_safe_text_openai_response(working_resp, degraded_text)
|
|
2783
|
+
|
|
2623
2784
|
return _build_clean_guardrail_openai_response(working_resp)
|
|
2624
2785
|
|
|
2625
2786
|
|
|
@@ -2720,6 +2881,18 @@ def openai_to_anthropic_response(openai_resp: dict, model: str) -> dict:
|
|
|
2720
2881
|
args = json.loads(fn.get("arguments", "{}"))
|
|
2721
2882
|
except json.JSONDecodeError:
|
|
2722
2883
|
args = {}
|
|
2884
|
+
if fn.get("name", "").strip().lower() == "bash" and isinstance(args, dict):
|
|
2885
|
+
command = args.get("command")
|
|
2886
|
+
if isinstance(command, str):
|
|
2887
|
+
cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
|
|
2888
|
+
command
|
|
2889
|
+
)
|
|
2890
|
+
if had_protocol_lines:
|
|
2891
|
+
args = dict(args)
|
|
2892
|
+
args["command"] = cleaned_command
|
|
2893
|
+
logger.warning(
|
|
2894
|
+
"BASH SAFETY: stripped standalone protocol-tag lines from command before tool execution"
|
|
2895
|
+
)
|
|
2723
2896
|
content.append(
|
|
2724
2897
|
{
|
|
2725
2898
|
"type": "tool_use",
|
|
@@ -487,6 +487,33 @@ class TestMalformedToolGuardrail(unittest.TestCase):
|
|
|
487
487
|
setattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE", old_temp)
|
|
488
488
|
setattr(proxy, "PROXY_DISABLE_THINKING_ON_TOOL_TURNS", old_disable)
|
|
489
489
|
|
|
490
|
+
def test_malformed_retry_body_appends_retry_hint_as_user_message(self):
|
|
491
|
+
openai_body = {
|
|
492
|
+
"model": "test",
|
|
493
|
+
"messages": [{"role": "user", "content": "fix"}],
|
|
494
|
+
}
|
|
495
|
+
anthropic_body = {
|
|
496
|
+
"tools": [{"name": "Read", "input_schema": {"type": "object"}}]
|
|
497
|
+
}
|
|
498
|
+
|
|
499
|
+
retry = proxy._build_malformed_retry_body(
|
|
500
|
+
openai_body,
|
|
501
|
+
anthropic_body,
|
|
502
|
+
retry_hint="Use strict JSON",
|
|
503
|
+
tool_choice="required",
|
|
504
|
+
attempt=1,
|
|
505
|
+
total_attempts=2,
|
|
506
|
+
)
|
|
507
|
+
|
|
508
|
+
self.assertEqual(retry["messages"][-1]["role"], "user")
|
|
509
|
+
self.assertIn("TOOL CALL REPAIR attempt 1/2", retry["messages"][-1]["content"])
|
|
510
|
+
|
|
511
|
+
def test_retry_ladder_releases_last_attempt_to_auto(self):
|
|
512
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 0, 3), "required")
|
|
513
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 1, 3), "required")
|
|
514
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 2, 3), "auto")
|
|
515
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(False, 0, 3), "auto")
|
|
516
|
+
|
|
490
517
|
def test_clean_guardrail_response_does_not_promise_future_tool_call(self):
|
|
491
518
|
guardrail = proxy._build_clean_guardrail_openai_response(
|
|
492
519
|
{"model": "test-model"}
|
|
@@ -772,6 +799,34 @@ class TestMalformedToolGuardrail(unittest.TestCase):
|
|
|
772
799
|
)
|
|
773
800
|
self.assertEqual(args["command"], "ls")
|
|
774
801
|
|
|
802
|
+
def test_bash_command_repair_strips_protocol_tag_only_lines(self):
|
|
803
|
+
openai_resp = {
|
|
804
|
+
"choices": [
|
|
805
|
+
{
|
|
806
|
+
"finish_reason": "tool_calls",
|
|
807
|
+
"message": {
|
|
808
|
+
"content": "",
|
|
809
|
+
"tool_calls": [
|
|
810
|
+
{
|
|
811
|
+
"id": "call_1",
|
|
812
|
+
"function": {
|
|
813
|
+
"name": "Bash",
|
|
814
|
+
"arguments": '{"command":"pwd\\n</function>\\n<tool_call>"}',
|
|
815
|
+
},
|
|
816
|
+
}
|
|
817
|
+
],
|
|
818
|
+
},
|
|
819
|
+
}
|
|
820
|
+
]
|
|
821
|
+
}
|
|
822
|
+
|
|
823
|
+
repaired, count = proxy._repair_bash_command_artifacts(openai_resp)
|
|
824
|
+
self.assertEqual(count, 1)
|
|
825
|
+
args = json.loads(
|
|
826
|
+
repaired["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
|
|
827
|
+
)
|
|
828
|
+
self.assertEqual(args["command"], "pwd")
|
|
829
|
+
|
|
775
830
|
def test_guardrail_accepts_repaired_markup_without_fallback(self):
|
|
776
831
|
old_retry = getattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_MAX")
|
|
777
832
|
try:
|
|
@@ -1290,6 +1345,38 @@ class TestToolTurnControls(unittest.TestCase):
|
|
|
1290
1345
|
setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
|
|
1291
1346
|
setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
|
|
1292
1347
|
|
|
1348
|
+
def test_analysis_only_route_does_not_treat_implementation_as_action(self):
|
|
1349
|
+
old_route = getattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE")
|
|
1350
|
+
old_min_tools = getattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS")
|
|
1351
|
+
old_max_messages = getattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES")
|
|
1352
|
+
try:
|
|
1353
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", True)
|
|
1354
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", 4)
|
|
1355
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", 2)
|
|
1356
|
+
|
|
1357
|
+
body = {
|
|
1358
|
+
"messages": [
|
|
1359
|
+
{
|
|
1360
|
+
"role": "user",
|
|
1361
|
+
"content": "analyze implementation options and summarize tradeoffs",
|
|
1362
|
+
}
|
|
1363
|
+
],
|
|
1364
|
+
"tools": [
|
|
1365
|
+
{"name": "Read", "input_schema": {"type": "object"}},
|
|
1366
|
+
{"name": "Edit", "input_schema": {"type": "object"}},
|
|
1367
|
+
{"name": "Write", "input_schema": {"type": "object"}},
|
|
1368
|
+
{"name": "Bash", "input_schema": {"type": "object"}},
|
|
1369
|
+
],
|
|
1370
|
+
}
|
|
1371
|
+
|
|
1372
|
+
updated, removed = proxy._maybe_route_analysis_without_tools(body)
|
|
1373
|
+
self.assertEqual(removed, 4)
|
|
1374
|
+
self.assertNotIn("tools", updated)
|
|
1375
|
+
finally:
|
|
1376
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", old_route)
|
|
1377
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
|
|
1378
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
|
|
1379
|
+
|
|
1293
1380
|
|
|
1294
1381
|
class TestSessionContaminationBreaker(unittest.TestCase):
|
|
1295
1382
|
def test_contamination_breaker_trims_and_resets_streak(self):
|