@miller-tech/uap 1.15.5 → 1.15.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/INDEX.md +8 -0
- package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +139 -0
- package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +146 -0
- package/package.json +1 -1
- package/templates/hooks/pre-tool-use-bash.sh +9 -0
- package/tools/agents/scripts/anthropic_proxy.py +302 -53
- package/tools/agents/tests/test_anthropic_proxy_streaming.py +197 -0
package/docs/INDEX.md
CHANGED
|
@@ -47,6 +47,14 @@
|
|
|
47
47
|
- [Token Optimization](benchmarks/TOKEN_OPTIMIZATION.md) -- Per-feature token savings analysis
|
|
48
48
|
- [Accuracy Analysis](benchmarks/ACCURACY_ANALYSIS.md) -- Internal vs Terminal-Bench comparison
|
|
49
49
|
|
|
50
|
+
## Blog
|
|
51
|
+
|
|
52
|
+
- [Speculative Decoding Production Playbook](blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md) -- Long-form narrative on throughput gains, failure modes, and stable profiles
|
|
53
|
+
|
|
54
|
+
## PR Templates
|
|
55
|
+
|
|
56
|
+
- [Speculative Docs PR Template](pr/PR_SPECULATIVE_DOCS_TEMPLATE.md) -- Ready-to-submit PR copy, checklist, and reviewer guidance
|
|
57
|
+
|
|
50
58
|
## Research
|
|
51
59
|
|
|
52
60
|
- [Memory Systems Comparison](research/MEMORY_SYSTEMS_COMPARISON.md) -- MemGPT, LangGraph, Mem0, A-MEM analysis
|
|
@@ -0,0 +1,139 @@
|
|
|
1
|
+
# Speculative Decoding in llama.cpp: Real Speedups Without Breaking Agentic Reliability
|
|
2
|
+
|
|
3
|
+
Speculative decoding can look like free performance - until it meets long-context, tool-heavy agent workflows. This write-up covers what improved throughput, what regressed, and which operational changes restored stability across `llama.cpp` and an Anthropic-compatible proxy.
|
|
4
|
+
|
|
5
|
+
## Why This Matters
|
|
6
|
+
|
|
7
|
+
Speculative decoding is strongest when generated text has predictable structure or repetition. But in real coding sessions, throughput alone is not enough: the system must preserve clean output, reliable tool-call behavior, and long-session continuity.
|
|
8
|
+
|
|
9
|
+
In practice, this is one runtime boundary:
|
|
10
|
+
|
|
11
|
+
- `llama.cpp` speculative behavior
|
|
12
|
+
- parameter profile and rollback mode
|
|
13
|
+
- proxy streaming/fallback policies
|
|
14
|
+
- agentic tool-loop control behavior
|
|
15
|
+
|
|
16
|
+
## Baseline Environment
|
|
17
|
+
|
|
18
|
+
- Runtime: `llama.cpp` + CUDA + Qwen3.5 GGUF
|
|
19
|
+
- Context window: `262144`
|
|
20
|
+
- Spec type: `ngram-cache`
|
|
21
|
+
- Gateway: Anthropic-compatible proxy forwarding to OpenAI-compatible server
|
|
22
|
+
|
|
23
|
+
Related runbooks:
|
|
24
|
+
|
|
25
|
+
- `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md`
|
|
26
|
+
- `docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md`
|
|
27
|
+
|
|
28
|
+
## What We Observed
|
|
29
|
+
|
|
30
|
+
### Throughput Gains Were Workload-Dependent
|
|
31
|
+
|
|
32
|
+
Speculation did not uniformly improve all turns. Coding/tool turns often saw small uplift; repetition-heavy turns saw large gains.
|
|
33
|
+
|
|
34
|
+
Representative 27B snapshot (`ctx=262144`):
|
|
35
|
+
|
|
36
|
+
- No spec: ~43 tok/s coding, ~41 tok/s pattern
|
|
37
|
+
- Balanced spec (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
|
|
38
|
+
|
|
39
|
+
Takeaway: benchmark by workload class, not one blended average.
|
|
40
|
+
|
|
41
|
+
### Newer Lineage Produced Noisier Warnings
|
|
42
|
+
|
|
43
|
+
Under identical settings, newer builds emitted warnings such as:
|
|
44
|
+
|
|
45
|
+
- `find_slot: non-consecutive token position`
|
|
46
|
+
|
|
47
|
+
This correlated with lower effective throughput and less stable long-session behavior in A/B comparisons.
|
|
48
|
+
|
|
49
|
+
### Proxy Fallback Could Leak Malformed Internal Text
|
|
50
|
+
|
|
51
|
+
When upstream returned reasoning-heavy but empty visible output, weak fallback policy could expose malformed fragments (pseudo-tool text, schema/policy echoes) to end users.
|
|
52
|
+
|
|
53
|
+
Patterns included:
|
|
54
|
+
|
|
55
|
+
- `</parameter>`-style fragments
|
|
56
|
+
- non-JSON pseudo-tool content
|
|
57
|
+
- repetitive policy-like loops with no valid `tool_calls`
|
|
58
|
+
|
|
59
|
+
## Immediate Fixes That Worked
|
|
60
|
+
|
|
61
|
+
### Safe Production Defaults
|
|
62
|
+
|
|
63
|
+
The highest-leverage stabilization profile was:
|
|
64
|
+
|
|
65
|
+
- `PROXY_STREAM_REASONING_FALLBACK=off`
|
|
66
|
+
- `PROXY_MALFORMED_TOOL_GUARDRAIL=on`
|
|
67
|
+
- `PROXY_MALFORMED_TOOL_STREAM_STRICT=on`
|
|
68
|
+
- `PROXY_MAX_TOKENS_FLOOR=4096`
|
|
69
|
+
|
|
70
|
+
Why:
|
|
71
|
+
|
|
72
|
+
- `fallback=off` suppresses malformed reasoning leakage.
|
|
73
|
+
- malformed-tool guardrail + strict stream path recovers bad stream+tools turns.
|
|
74
|
+
- lower token floor reduces long failure-turn latency while preserving normal turns.
|
|
75
|
+
|
|
76
|
+
### Balanced Speculative Profile for Daily Agentic Work
|
|
77
|
+
|
|
78
|
+
- `spec-type=ngram-cache`
|
|
79
|
+
- `draft-max=12`
|
|
80
|
+
- `draft-min=2`
|
|
81
|
+
- `draft-p-min=0.80`
|
|
82
|
+
- rollback mode: `strict`
|
|
83
|
+
|
|
84
|
+
This profile is less aggressive than max-throughput tuning, but significantly safer for long coding sessions.
|
|
85
|
+
|
|
86
|
+
## Benchmark Method That Prevents False Wins
|
|
87
|
+
|
|
88
|
+
A useful speculative benchmark protocol should include:
|
|
89
|
+
|
|
90
|
+
1. Prompt classes
|
|
91
|
+
- coding/tool-call tasks
|
|
92
|
+
- repetition/pattern-heavy tasks
|
|
93
|
+
2. Repeats and warmup
|
|
94
|
+
- fixed run count
|
|
95
|
+
- warmup policy
|
|
96
|
+
- p50/p95 latency, not only mean tok/s
|
|
97
|
+
3. Required metrics
|
|
98
|
+
- decode throughput (`eval tok/s`)
|
|
99
|
+
- prefill throughput (`prompt eval tok/s`)
|
|
100
|
+
- acceptance/rejection behavior
|
|
101
|
+
- malformed-turn incidence
|
|
102
|
+
- stop reason distribution
|
|
103
|
+
4. Profile matrix
|
|
104
|
+
- no-spec baseline
|
|
105
|
+
- aggressive profile
|
|
106
|
+
- balanced profile
|
|
107
|
+
|
|
108
|
+
Without this, speculative tuning can appear faster while degrading real agentic reliability.
|
|
109
|
+
|
|
110
|
+
## Practical Playbook
|
|
111
|
+
|
|
112
|
+
### Use for Daily Agentic Coding
|
|
113
|
+
|
|
114
|
+
- balanced `ngram-cache` (`12/2/0.80`)
|
|
115
|
+
- strict malformed-tool stream guardrail
|
|
116
|
+
- reasoning fallback disabled
|
|
117
|
+
- reduced token floor (`4096`)
|
|
118
|
+
|
|
119
|
+
### Use for Max Throughput Exploration
|
|
120
|
+
|
|
121
|
+
- hybrid rollback
|
|
122
|
+
- larger draft windows
|
|
123
|
+
- tightly scoped benchmark prompts
|
|
124
|
+
|
|
125
|
+
Then promote only if long-session tool-loop soak remains stable.
|
|
126
|
+
|
|
127
|
+
## What llama.cpp Docs Should Add Next
|
|
128
|
+
|
|
129
|
+
Mechanics are documented well today. The next improvement is operational clarity:
|
|
130
|
+
|
|
131
|
+
- implementation selection matrix by workload
|
|
132
|
+
- troubleshooting by signature (`find_slot`, rollback spikes, acceptance collapse)
|
|
133
|
+
- reproducible benchmark protocol and output schema
|
|
134
|
+
- rollout/canary/rollback criteria
|
|
135
|
+
- proxy compatibility appendix for stream+tools environments
|
|
136
|
+
|
|
137
|
+
## Final Takeaway
|
|
138
|
+
|
|
139
|
+
Speculative decoding in production is a systems problem, not just a decoding primitive. Treating runtime + transport + tool-loop behavior as one boundary is what makes speculative speedups both real and reliable.
|
|
@@ -0,0 +1,146 @@
|
|
|
1
|
+
## Title
|
|
2
|
+
|
|
3
|
+
docs: add speculative decoding production playbook and agentic compatibility guidance
|
|
4
|
+
|
|
5
|
+
## Context
|
|
6
|
+
|
|
7
|
+
`docs/speculative.md` explains speculative mechanisms and flags, but production operators also need:
|
|
8
|
+
|
|
9
|
+
- workload-driven profile selection,
|
|
10
|
+
- reproducible benchmarking protocol,
|
|
11
|
+
- signature-based regression triage,
|
|
12
|
+
- guidance for stream+tools agentic environments.
|
|
13
|
+
|
|
14
|
+
This PR adds operational documentation to reduce drift between benchmark wins and real-session behavior.
|
|
15
|
+
|
|
16
|
+
## Changes
|
|
17
|
+
|
|
18
|
+
### Add new guide
|
|
19
|
+
|
|
20
|
+
- New: `docs/speculative-production.md`
|
|
21
|
+
- implementation matrix:
|
|
22
|
+
- `draft`
|
|
23
|
+
- `ngram-cache`
|
|
24
|
+
- `ngram-simple`
|
|
25
|
+
- `ngram-map-k`
|
|
26
|
+
- `ngram-map-k4v`
|
|
27
|
+
- `ngram-mod`
|
|
28
|
+
- decision tree by workload (coding, repetitive transform, mixed)
|
|
29
|
+
- benchmark protocol (run counts, warmup, prompt classes, metrics)
|
|
30
|
+
- troubleshooting by signature:
|
|
31
|
+
- `find_slot: non-consecutive token position`
|
|
32
|
+
- low acceptance + high rollback
|
|
33
|
+
- throughput collapse after commit switch
|
|
34
|
+
- rollout rules (canary, promotion threshold, rollback triggers)
|
|
35
|
+
|
|
36
|
+
### Update existing speculative docs
|
|
37
|
+
|
|
38
|
+
- Update `docs/speculative.md`:
|
|
39
|
+
- add link to production guide
|
|
40
|
+
- add "how to interpret statistics in practice"
|
|
41
|
+
- add "workload sensitivity and reproducibility notes"
|
|
42
|
+
|
|
43
|
+
### Add compatibility appendix
|
|
44
|
+
|
|
45
|
+
- New appendix (or linked page): stream+tools compatibility for proxy-mediated agentic flows
|
|
46
|
+
- fallback policy guidance (`off` default for production)
|
|
47
|
+
- malformed stream/tool guardrail behavior
|
|
48
|
+
- max token floor and prune target recommendations
|
|
49
|
+
|
|
50
|
+
## Why
|
|
51
|
+
|
|
52
|
+
Speculative decoding quality in agentic coding depends on end-to-end behavior, including transport and stream tool-loop handling. This documentation closes that gap and provides a repeatable operator path.
|
|
53
|
+
|
|
54
|
+
## Validation Plan
|
|
55
|
+
|
|
56
|
+
- Verify all CLI flags/options in examples against current `llama-server`.
|
|
57
|
+
- Verify all linked scripts/docs paths resolve.
|
|
58
|
+
- Include one benchmark table with:
|
|
59
|
+
- decode/prefill throughput
|
|
60
|
+
- acceptance indicators
|
|
61
|
+
- latency percentiles
|
|
62
|
+
- workload class labels
|
|
63
|
+
|
|
64
|
+
## Risks
|
|
65
|
+
|
|
66
|
+
- Overfitting recommendations to one model/hardware class.
|
|
67
|
+
- Treating proxy behavior as universally required.
|
|
68
|
+
|
|
69
|
+
## Mitigations
|
|
70
|
+
|
|
71
|
+
- Mark all profile recommendations as workload/hardware sensitive.
|
|
72
|
+
- Separate "safe baseline" from "aggressive benchmark-only" profiles.
|
|
73
|
+
- Require local A/B validation before rollout.
|
|
74
|
+
|
|
75
|
+
## Out of Scope
|
|
76
|
+
|
|
77
|
+
- Runtime code changes
|
|
78
|
+
- Kernel-level speculative optimization changes
|
|
79
|
+
- Proxy implementation changes (docs-only PR)
|
|
80
|
+
|
|
81
|
+
## Follow-ups
|
|
82
|
+
|
|
83
|
+
1. Add nightly speculative regression harness.
|
|
84
|
+
2. Publish benchmark JSON schema for machine comparison.
|
|
85
|
+
3. Add commit-lineage tracking for performance regressions.
|
|
86
|
+
|
|
87
|
+
---
|
|
88
|
+
|
|
89
|
+
## Ready-to-Submit GitHub PR Body
|
|
90
|
+
|
|
91
|
+
### Summary
|
|
92
|
+
|
|
93
|
+
This docs PR adds a production-oriented speculative decoding playbook for llama.cpp users running real multi-turn workloads (especially agentic/tool-call scenarios). It complements existing mechanism-level docs with actionable tuning, troubleshooting, and rollout guidance.
|
|
94
|
+
|
|
95
|
+
### What Changed
|
|
96
|
+
|
|
97
|
+
- Added `docs/speculative-production.md` (new operational guide)
|
|
98
|
+
- implementation selection matrix
|
|
99
|
+
- workload-based decision tree
|
|
100
|
+
- benchmark protocol + required metrics
|
|
101
|
+
- troubleshooting by real log signatures
|
|
102
|
+
- canary/rollback rollout guidance
|
|
103
|
+
- Updated `docs/speculative.md`
|
|
104
|
+
- links to production guide
|
|
105
|
+
- practical stats interpretation notes
|
|
106
|
+
- workload sensitivity notes
|
|
107
|
+
- Added/linked "agentic stream+tools compatibility" appendix
|
|
108
|
+
- fallback policy defaults
|
|
109
|
+
- malformed stream/tool guardrails
|
|
110
|
+
- token-floor/prune guidance
|
|
111
|
+
|
|
112
|
+
### Why
|
|
113
|
+
|
|
114
|
+
Current docs describe speculative decoding internals clearly, but production operators need a reproducible way to:
|
|
115
|
+
|
|
116
|
+
- choose stable profiles by workload,
|
|
117
|
+
- detect/triage regressions quickly,
|
|
118
|
+
- avoid benchmark-only wins that fail in long sessions.
|
|
119
|
+
|
|
120
|
+
### Reviewer Guide
|
|
121
|
+
|
|
122
|
+
Please focus review on:
|
|
123
|
+
|
|
124
|
+
1. Accuracy of CLI flags and option names.
|
|
125
|
+
2. Correctness of troubleshooting signatures and interpretations.
|
|
126
|
+
3. Clarity of benchmark protocol (can another team reproduce it?).
|
|
127
|
+
4. Whether safe-vs-aggressive profile separation is clear enough.
|
|
128
|
+
|
|
129
|
+
### Validation
|
|
130
|
+
|
|
131
|
+
- [ ] Command examples verified against current `llama-server --help`
|
|
132
|
+
- [ ] Linked docs/scripts paths validated
|
|
133
|
+
- [ ] Benchmark table includes workload class labels
|
|
134
|
+
- [ ] Metrics include decode/prefill throughput + latency percentile view
|
|
135
|
+
- [ ] No runtime behavior claims without explicit caveats
|
|
136
|
+
|
|
137
|
+
### Risks / Caveats
|
|
138
|
+
|
|
139
|
+
- Recommendations are model/hardware/workload dependent.
|
|
140
|
+
- Guidance is operational, not a substitute for local A/B testing.
|
|
141
|
+
|
|
142
|
+
### Follow-ups
|
|
143
|
+
|
|
144
|
+
- [ ] Add nightly regression harness for speculative profiles
|
|
145
|
+
- [ ] Publish machine-readable benchmark schema
|
|
146
|
+
- [ ] Add commit lineage references in benchmark artifacts
|
package/package.json
CHANGED
|
@@ -22,6 +22,15 @@ if [ -z "$CMD" ]; then
|
|
|
22
22
|
exit 0
|
|
23
23
|
fi
|
|
24
24
|
|
|
25
|
+
# ─── Protocol Tag Injection Guard ────────────────────────────────
|
|
26
|
+
# Reject Bash payloads that still contain standalone protocol tag lines.
|
|
27
|
+
# These fragments can appear after malformed tool-call rendering and must
|
|
28
|
+
# never reach shell evaluation.
|
|
29
|
+
if printf '%s\n' "$CMD" | grep -qE '^\s*</?(tool_call|tool_response|parameter(=[^>]*)?|function(=[^>]*)?|think)\s*>\s*$'; then
|
|
30
|
+
echo "BLOCKED [bash-safety]: Command contains standalone XML/protocol tag lines. Remove tool-call tag artifacts before execution." >&2
|
|
31
|
+
exit 2
|
|
32
|
+
fi
|
|
33
|
+
|
|
25
34
|
# ─── IaC Pipeline Enforcement ───────────────────────────────────
|
|
26
35
|
# Block local terraform apply/destroy (policies/iac-pipeline-enforcement.md)
|
|
27
36
|
# Allow: terraform fmt, validate, init, plan, output, show, state list, graph
|
|
@@ -254,6 +254,28 @@ PROXY_ANALYSIS_ONLY_MIN_TOOLS = int(
|
|
|
254
254
|
PROXY_ANALYSIS_ONLY_MAX_MESSAGES = int(
|
|
255
255
|
os.environ.get("PROXY_ANALYSIS_ONLY_MAX_MESSAGES", "2")
|
|
256
256
|
)
|
|
257
|
+
PROXY_TOOL_CALL_GRAMMAR = os.environ.get(
|
|
258
|
+
"PROXY_TOOL_CALL_GRAMMAR", "on"
|
|
259
|
+
).lower() not in {
|
|
260
|
+
"0",
|
|
261
|
+
"false",
|
|
262
|
+
"off",
|
|
263
|
+
"no",
|
|
264
|
+
}
|
|
265
|
+
PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY = os.environ.get(
|
|
266
|
+
"PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", "on"
|
|
267
|
+
).lower() not in {
|
|
268
|
+
"0",
|
|
269
|
+
"false",
|
|
270
|
+
"off",
|
|
271
|
+
"no",
|
|
272
|
+
}
|
|
273
|
+
PROXY_TOOL_CALL_GRAMMAR_PATH = os.path.abspath(
|
|
274
|
+
os.environ.get(
|
|
275
|
+
"PROXY_TOOL_CALL_GRAMMAR_PATH",
|
|
276
|
+
os.path.join(os.path.dirname(__file__), "..", "config", "tool-call.gbnf"),
|
|
277
|
+
)
|
|
278
|
+
)
|
|
257
279
|
|
|
258
280
|
# ---------------------------------------------------------------------------
|
|
259
281
|
# Logging
|
|
@@ -266,6 +288,45 @@ logging.basicConfig(
|
|
|
266
288
|
logger = logging.getLogger("uap.anthropic_proxy")
|
|
267
289
|
|
|
268
290
|
|
|
291
|
+
def _load_tool_call_grammar(path: str) -> str:
|
|
292
|
+
if not PROXY_TOOL_CALL_GRAMMAR:
|
|
293
|
+
return ""
|
|
294
|
+
|
|
295
|
+
try:
|
|
296
|
+
with open(path, "r", encoding="utf-8") as fh:
|
|
297
|
+
return fh.read().strip()
|
|
298
|
+
except OSError as exc:
|
|
299
|
+
logger.warning(
|
|
300
|
+
"Tool-call grammar disabled: failed to read %s (%s)",
|
|
301
|
+
path,
|
|
302
|
+
exc,
|
|
303
|
+
)
|
|
304
|
+
return ""
|
|
305
|
+
|
|
306
|
+
|
|
307
|
+
TOOL_CALL_GBNF = _load_tool_call_grammar(PROXY_TOOL_CALL_GRAMMAR_PATH)
|
|
308
|
+
|
|
309
|
+
|
|
310
|
+
def _apply_tool_call_grammar(
|
|
311
|
+
request_body: dict, tool_choice: str | None = None
|
|
312
|
+
) -> None:
|
|
313
|
+
request_body.pop("grammar", None)
|
|
314
|
+
|
|
315
|
+
if not PROXY_TOOL_CALL_GRAMMAR or not TOOL_CALL_GBNF:
|
|
316
|
+
return
|
|
317
|
+
|
|
318
|
+
if not request_body.get("tools"):
|
|
319
|
+
return
|
|
320
|
+
|
|
321
|
+
effective_tool_choice = (
|
|
322
|
+
tool_choice if tool_choice is not None else request_body.get("tool_choice")
|
|
323
|
+
)
|
|
324
|
+
if PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY and effective_tool_choice != "required":
|
|
325
|
+
return
|
|
326
|
+
|
|
327
|
+
request_body["grammar"] = TOOL_CALL_GBNF
|
|
328
|
+
|
|
329
|
+
|
|
269
330
|
# ---------------------------------------------------------------------------
|
|
270
331
|
# Option F: Session-level Context Window Monitor
|
|
271
332
|
# ---------------------------------------------------------------------------
|
|
@@ -876,7 +937,7 @@ async def lifespan(app: FastAPI):
|
|
|
876
937
|
_resolve_prune_target_fraction() * 100,
|
|
877
938
|
)
|
|
878
939
|
logger.info(
|
|
879
|
-
"Guardrails: malformed=%s stream_strict=%s force_non_stream=%s args_preflight=%s tool_narrowing=%s thinking_off_on_tools=%s dampener=%s(%d/%d/%d/%d->%d) contamination_breaker=%s(%d forced=%d required_miss=%d) analysis_only_route=%s(min_tools=%d,max_msgs=%d)",
|
|
940
|
+
"Guardrails: malformed=%s stream_strict=%s force_non_stream=%s args_preflight=%s tool_narrowing=%s thinking_off_on_tools=%s dampener=%s(%d/%d/%d/%d->%d) contamination_breaker=%s(%d forced=%d required_miss=%d) analysis_only_route=%s(min_tools=%d,max_msgs=%d) grammar=%s(required_only=%s loaded=%s path=%s)",
|
|
880
941
|
PROXY_MALFORMED_TOOL_GUARDRAIL,
|
|
881
942
|
PROXY_MALFORMED_TOOL_STREAM_STRICT,
|
|
882
943
|
PROXY_FORCE_NON_STREAM,
|
|
@@ -896,6 +957,10 @@ async def lifespan(app: FastAPI):
|
|
|
896
957
|
PROXY_ANALYSIS_ONLY_ROUTE,
|
|
897
958
|
PROXY_ANALYSIS_ONLY_MIN_TOOLS,
|
|
898
959
|
PROXY_ANALYSIS_ONLY_MAX_MESSAGES,
|
|
960
|
+
PROXY_TOOL_CALL_GRAMMAR,
|
|
961
|
+
PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY,
|
|
962
|
+
bool(TOOL_CALL_GBNF),
|
|
963
|
+
PROXY_TOOL_CALL_GRAMMAR_PATH,
|
|
899
964
|
)
|
|
900
965
|
|
|
901
966
|
yield
|
|
@@ -1044,49 +1109,27 @@ def _is_analysis_only_prompt(text: str) -> bool:
|
|
|
1044
1109
|
if not text:
|
|
1045
1110
|
return False
|
|
1046
1111
|
|
|
1047
|
-
|
|
1048
|
-
|
|
1049
|
-
|
|
1050
|
-
|
|
1051
|
-
|
|
1052
|
-
|
|
1053
|
-
"plan",
|
|
1054
|
-
"recommend",
|
|
1055
|
-
"assess",
|
|
1056
|
-
"compare",
|
|
1057
|
-
"investigate",
|
|
1058
|
-
"diagnose",
|
|
1112
|
+
normalized = text.lower()
|
|
1113
|
+
has_analysis = bool(
|
|
1114
|
+
re.search(
|
|
1115
|
+
r"\b(?:analy(?:ze|zing|sis)?|review|audit|summar(?:y|ize|ized|ise)|explain|plan|recommend|assess|compare|investigate|diagnos(?:e|is))\b",
|
|
1116
|
+
normalized,
|
|
1117
|
+
)
|
|
1059
1118
|
)
|
|
1060
|
-
|
|
1061
|
-
|
|
1062
|
-
|
|
1063
|
-
|
|
1064
|
-
|
|
1065
|
-
|
|
1066
|
-
|
|
1067
|
-
|
|
1068
|
-
|
|
1069
|
-
|
|
1070
|
-
|
|
1071
|
-
|
|
1072
|
-
|
|
1073
|
-
"call tool",
|
|
1074
|
-
"apply",
|
|
1075
|
-
"commit",
|
|
1076
|
-
"push",
|
|
1077
|
-
"merge",
|
|
1078
|
-
"publish",
|
|
1079
|
-
"deploy",
|
|
1080
|
-
"test",
|
|
1081
|
-
"build",
|
|
1082
|
-
"refactor",
|
|
1083
|
-
"rename",
|
|
1084
|
-
"delete",
|
|
1085
|
-
"install",
|
|
1119
|
+
has_action = bool(
|
|
1120
|
+
re.search(
|
|
1121
|
+
r"\b(?:fix|edit|write|create|implement|patch|change|update|run|execute|apply|commit|push|merge|publish|deploy|test|build|refactor|rename|delete|install)\b",
|
|
1122
|
+
normalized,
|
|
1123
|
+
)
|
|
1124
|
+
) or any(
|
|
1125
|
+
phrase in normalized
|
|
1126
|
+
for phrase in (
|
|
1127
|
+
"use tool",
|
|
1128
|
+
"call tool",
|
|
1129
|
+
"run command",
|
|
1130
|
+
"execute command",
|
|
1131
|
+
)
|
|
1086
1132
|
)
|
|
1087
|
-
|
|
1088
|
-
has_analysis = any(marker in text for marker in analysis_markers)
|
|
1089
|
-
has_action = any(marker in text for marker in action_markers)
|
|
1090
1133
|
return has_analysis and not has_action
|
|
1091
1134
|
|
|
1092
1135
|
|
|
@@ -1467,6 +1510,8 @@ def build_openai_request(anthropic_body: dict, monitor: SessionMonitor) -> dict:
|
|
|
1467
1510
|
"Thinking disabled for tool turn (PROXY_DISABLE_THINKING_ON_TOOL_TURNS=on)"
|
|
1468
1511
|
)
|
|
1469
1512
|
|
|
1513
|
+
_apply_tool_call_grammar(openai_body)
|
|
1514
|
+
|
|
1470
1515
|
return openai_body
|
|
1471
1516
|
|
|
1472
1517
|
|
|
@@ -1793,6 +1838,11 @@ _TOOL_ARG_MARKERS = (
|
|
|
1793
1838
|
"</think>",
|
|
1794
1839
|
)
|
|
1795
1840
|
|
|
1841
|
+
_BASH_PROTOCOL_LINE_RE = re.compile(
|
|
1842
|
+
r"^\s*</?(?:tool_call|tool_response|parameter(?:=[^>]*)?|function(?:=[^>]*)?|think)\s*>\s*$",
|
|
1843
|
+
re.IGNORECASE,
|
|
1844
|
+
)
|
|
1845
|
+
|
|
1796
1846
|
|
|
1797
1847
|
def _iter_string_leaves(value):
|
|
1798
1848
|
if isinstance(value, str):
|
|
@@ -1822,6 +1872,26 @@ def _strip_tool_markup_artifacts(text: str) -> str:
|
|
|
1822
1872
|
return cleaned.strip()
|
|
1823
1873
|
|
|
1824
1874
|
|
|
1875
|
+
def _strip_protocol_tag_only_lines(text: str) -> tuple[str, bool]:
|
|
1876
|
+
if not isinstance(text, str):
|
|
1877
|
+
return text, False
|
|
1878
|
+
|
|
1879
|
+
lines = text.splitlines()
|
|
1880
|
+
kept_lines: list[str] = []
|
|
1881
|
+
stripped = False
|
|
1882
|
+
for line in lines:
|
|
1883
|
+
if _BASH_PROTOCOL_LINE_RE.match(line):
|
|
1884
|
+
stripped = True
|
|
1885
|
+
continue
|
|
1886
|
+
kept_lines.append(line)
|
|
1887
|
+
|
|
1888
|
+
if not stripped:
|
|
1889
|
+
return text, False
|
|
1890
|
+
|
|
1891
|
+
cleaned = "\n".join(kept_lines).strip()
|
|
1892
|
+
return cleaned, True
|
|
1893
|
+
|
|
1894
|
+
|
|
1825
1895
|
def _sanitize_markup_value(value):
|
|
1826
1896
|
if isinstance(value, str):
|
|
1827
1897
|
cleaned = _strip_tool_markup_artifacts(value)
|
|
@@ -2036,6 +2106,77 @@ def _repair_required_tool_args(
|
|
|
2036
2106
|
return repaired_response, repaired_count
|
|
2037
2107
|
|
|
2038
2108
|
|
|
2109
|
+
def _repair_bash_command_artifacts(openai_resp: dict) -> tuple[dict, int]:
|
|
2110
|
+
if not _openai_has_tool_calls(openai_resp):
|
|
2111
|
+
return openai_resp, 0
|
|
2112
|
+
|
|
2113
|
+
choice, message = _extract_openai_choice(openai_resp)
|
|
2114
|
+
tool_calls = message.get("tool_calls") or []
|
|
2115
|
+
if not tool_calls:
|
|
2116
|
+
return openai_resp, 0
|
|
2117
|
+
|
|
2118
|
+
repaired_tool_calls = []
|
|
2119
|
+
repaired_count = 0
|
|
2120
|
+
|
|
2121
|
+
for tool_call in tool_calls:
|
|
2122
|
+
fn = tool_call.get("function") if isinstance(tool_call, dict) else {}
|
|
2123
|
+
if not isinstance(fn, dict):
|
|
2124
|
+
fn = {}
|
|
2125
|
+
|
|
2126
|
+
tool_name = str(fn.get("name", "")).strip().lower()
|
|
2127
|
+
if tool_name != "bash":
|
|
2128
|
+
repaired_tool_calls.append(tool_call)
|
|
2129
|
+
continue
|
|
2130
|
+
|
|
2131
|
+
raw_args = fn.get("arguments", "{}")
|
|
2132
|
+
if isinstance(raw_args, dict):
|
|
2133
|
+
parsed_args = dict(raw_args)
|
|
2134
|
+
else:
|
|
2135
|
+
try:
|
|
2136
|
+
parsed_args = json.loads(str(raw_args))
|
|
2137
|
+
except json.JSONDecodeError:
|
|
2138
|
+
repaired_tool_calls.append(tool_call)
|
|
2139
|
+
continue
|
|
2140
|
+
|
|
2141
|
+
if not isinstance(parsed_args, dict):
|
|
2142
|
+
repaired_tool_calls.append(tool_call)
|
|
2143
|
+
continue
|
|
2144
|
+
|
|
2145
|
+
command = parsed_args.get("command")
|
|
2146
|
+
if not isinstance(command, str):
|
|
2147
|
+
repaired_tool_calls.append(tool_call)
|
|
2148
|
+
continue
|
|
2149
|
+
|
|
2150
|
+
cleaned_command, changed = _strip_protocol_tag_only_lines(command)
|
|
2151
|
+
if not changed:
|
|
2152
|
+
repaired_tool_calls.append(tool_call)
|
|
2153
|
+
continue
|
|
2154
|
+
|
|
2155
|
+
parsed_args["command"] = cleaned_command
|
|
2156
|
+
new_tool_call = dict(tool_call)
|
|
2157
|
+
new_fn = dict(fn)
|
|
2158
|
+
new_fn["arguments"] = json.dumps(parsed_args, separators=(",", ":"))
|
|
2159
|
+
new_tool_call["function"] = new_fn
|
|
2160
|
+
repaired_tool_calls.append(new_tool_call)
|
|
2161
|
+
repaired_count += 1
|
|
2162
|
+
|
|
2163
|
+
if repaired_count == 0:
|
|
2164
|
+
return openai_resp, 0
|
|
2165
|
+
|
|
2166
|
+
repaired_response = dict(openai_resp)
|
|
2167
|
+
choices = list(openai_resp.get("choices") or [])
|
|
2168
|
+
if not choices:
|
|
2169
|
+
return openai_resp, 0
|
|
2170
|
+
|
|
2171
|
+
updated_choice = dict(choice)
|
|
2172
|
+
updated_message = dict(message)
|
|
2173
|
+
updated_message["tool_calls"] = repaired_tool_calls
|
|
2174
|
+
updated_choice["message"] = updated_message
|
|
2175
|
+
choices[0] = updated_choice
|
|
2176
|
+
repaired_response["choices"] = choices
|
|
2177
|
+
return repaired_response, repaired_count
|
|
2178
|
+
|
|
2179
|
+
|
|
2039
2180
|
def _required_value_is_empty(value) -> bool:
|
|
2040
2181
|
if value is None:
|
|
2041
2182
|
return True
|
|
@@ -2132,6 +2273,22 @@ def _validate_tool_call_arguments(
|
|
|
2132
2273
|
),
|
|
2133
2274
|
)
|
|
2134
2275
|
|
|
2276
|
+
if tool_name.strip().lower() == "bash":
|
|
2277
|
+
command = parsed.get("command")
|
|
2278
|
+
if isinstance(command, str):
|
|
2279
|
+
cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
|
|
2280
|
+
command
|
|
2281
|
+
)
|
|
2282
|
+
if had_protocol_lines and not cleaned_command:
|
|
2283
|
+
return ToolResponseIssue(
|
|
2284
|
+
kind="invalid_tool_args",
|
|
2285
|
+
reason="arguments for 'Bash' contained only protocol tag lines",
|
|
2286
|
+
retry_hint=(
|
|
2287
|
+
"Emit exactly one `Bash` tool call with a valid shell command in `arguments.command`. "
|
|
2288
|
+
"Do not include standalone XML/protocol tags."
|
|
2289
|
+
),
|
|
2290
|
+
)
|
|
2291
|
+
|
|
2135
2292
|
if _contains_tool_markup(parsed):
|
|
2136
2293
|
return ToolResponseIssue(
|
|
2137
2294
|
kind="invalid_tool_args",
|
|
@@ -2345,20 +2502,34 @@ def _is_malformed_tool_response(openai_resp: dict, anthropic_body: dict) -> bool
|
|
|
2345
2502
|
|
|
2346
2503
|
|
|
2347
2504
|
def _build_malformed_retry_body(
|
|
2348
|
-
openai_body: dict,
|
|
2505
|
+
openai_body: dict,
|
|
2506
|
+
anthropic_body: dict,
|
|
2507
|
+
retry_hint: str = "",
|
|
2508
|
+
tool_choice: str = "required",
|
|
2509
|
+
attempt: int = 1,
|
|
2510
|
+
total_attempts: int = 1,
|
|
2349
2511
|
) -> dict:
|
|
2350
2512
|
retry_body = dict(openai_body)
|
|
2351
2513
|
retry_body["stream"] = False
|
|
2352
|
-
retry_body["tool_choice"] =
|
|
2514
|
+
retry_body["tool_choice"] = tool_choice
|
|
2353
2515
|
retry_body["temperature"] = PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE
|
|
2354
2516
|
|
|
2355
|
-
|
|
2356
|
-
|
|
2357
|
-
"content": (
|
|
2517
|
+
if tool_choice == "required":
|
|
2518
|
+
retry_instruction = (
|
|
2358
2519
|
"Your previous response had invalid tool-call formatting. "
|
|
2359
2520
|
"Respond with exactly one valid tool call using the provided tools. "
|
|
2360
2521
|
"Do not output prose, markdown, XML tags, or schema snippets."
|
|
2361
|
-
)
|
|
2522
|
+
)
|
|
2523
|
+
else:
|
|
2524
|
+
retry_instruction = (
|
|
2525
|
+
"Your previous response had invalid tool-call formatting. "
|
|
2526
|
+
"If a tool is needed, emit exactly one valid tool call with strict JSON arguments. "
|
|
2527
|
+
"If no tool is needed for this turn, return concise plain text with no protocol tags."
|
|
2528
|
+
)
|
|
2529
|
+
|
|
2530
|
+
malformed_retry_instruction = {
|
|
2531
|
+
"role": "user",
|
|
2532
|
+
"content": retry_instruction,
|
|
2362
2533
|
}
|
|
2363
2534
|
existing_messages = retry_body.get("messages")
|
|
2364
2535
|
if isinstance(existing_messages, list) and existing_messages:
|
|
@@ -2381,19 +2552,51 @@ def _build_malformed_retry_body(
|
|
|
2381
2552
|
if PROXY_DISABLE_THINKING_ON_TOOL_TURNS:
|
|
2382
2553
|
retry_body["enable_thinking"] = False
|
|
2383
2554
|
|
|
2555
|
+
_apply_tool_call_grammar(retry_body, tool_choice=tool_choice)
|
|
2556
|
+
|
|
2384
2557
|
if retry_hint:
|
|
2385
2558
|
repair_prompt = (
|
|
2386
|
-
"[TOOL CALL REPAIR]\n"
|
|
2559
|
+
f"[TOOL CALL REPAIR attempt {attempt}/{total_attempts}]\n"
|
|
2387
2560
|
f"{retry_hint}\n"
|
|
2388
|
-
"Return
|
|
2561
|
+
"Return a valid response for this turn without protocol artifacts."
|
|
2389
2562
|
)
|
|
2390
2563
|
retry_messages = list(retry_body.get("messages", []))
|
|
2391
|
-
retry_messages.append({"role": "
|
|
2564
|
+
retry_messages.append({"role": "user", "content": repair_prompt})
|
|
2392
2565
|
retry_body["messages"] = retry_messages
|
|
2393
2566
|
|
|
2394
2567
|
return retry_body
|
|
2395
2568
|
|
|
2396
2569
|
|
|
2570
|
+
def _retry_tool_choice_for_attempt(
|
|
2571
|
+
required_tool_choice: bool, attempt: int, total_attempts: int
|
|
2572
|
+
) -> str:
|
|
2573
|
+
if not required_tool_choice:
|
|
2574
|
+
return "auto"
|
|
2575
|
+
if total_attempts <= 1:
|
|
2576
|
+
return "required"
|
|
2577
|
+
return "auto" if attempt == total_attempts - 1 else "required"
|
|
2578
|
+
|
|
2579
|
+
|
|
2580
|
+
def _build_safe_text_openai_response(openai_resp: dict, text: str) -> dict:
|
|
2581
|
+
return {
|
|
2582
|
+
"id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
|
|
2583
|
+
"object": openai_resp.get("object", "chat.completion"),
|
|
2584
|
+
"created": openai_resp.get("created", int(time.time())),
|
|
2585
|
+
"model": openai_resp.get("model", "unknown"),
|
|
2586
|
+
"choices": [
|
|
2587
|
+
{
|
|
2588
|
+
"index": 0,
|
|
2589
|
+
"finish_reason": "stop",
|
|
2590
|
+
"message": {
|
|
2591
|
+
"role": "assistant",
|
|
2592
|
+
"content": text,
|
|
2593
|
+
},
|
|
2594
|
+
}
|
|
2595
|
+
],
|
|
2596
|
+
"usage": openai_resp.get("usage", {}),
|
|
2597
|
+
}
|
|
2598
|
+
|
|
2599
|
+
|
|
2397
2600
|
def _build_clean_guardrail_openai_response(openai_resp: dict) -> dict:
|
|
2398
2601
|
return {
|
|
2399
2602
|
"id": openai_resp.get("id", f"chatcmpl_{uuid.uuid4().hex[:12]}"),
|
|
@@ -2437,6 +2640,7 @@ async def _apply_unexpected_end_turn_guardrail(
|
|
|
2437
2640
|
retry_body = dict(openai_body)
|
|
2438
2641
|
retry_body["tool_choice"] = "required"
|
|
2439
2642
|
retry_body["stream"] = False
|
|
2643
|
+
_apply_tool_call_grammar(retry_body, tool_choice="required")
|
|
2440
2644
|
|
|
2441
2645
|
retry_resp = await client.post(
|
|
2442
2646
|
f"{LLAMA_CPP_BASE}/chat/completions",
|
|
@@ -2486,7 +2690,8 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2486
2690
|
working_resp, required_repairs = _repair_required_tool_args(
|
|
2487
2691
|
working_resp, anthropic_body
|
|
2488
2692
|
)
|
|
2489
|
-
|
|
2693
|
+
working_resp, bash_repairs = _repair_bash_command_artifacts(working_resp)
|
|
2694
|
+
repair_count = markup_repairs + required_repairs + bash_repairs
|
|
2490
2695
|
|
|
2491
2696
|
required_tool_choice = openai_body.get("tool_choice") == "required"
|
|
2492
2697
|
has_tool_calls = _openai_has_tool_calls(working_resp)
|
|
@@ -2536,10 +2741,18 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2536
2741
|
attempts = max(0, PROXY_MALFORMED_TOOL_RETRY_MAX)
|
|
2537
2742
|
current_issue = issue
|
|
2538
2743
|
for attempt in range(attempts):
|
|
2744
|
+
attempt_tool_choice = _retry_tool_choice_for_attempt(
|
|
2745
|
+
required_tool_choice,
|
|
2746
|
+
attempt,
|
|
2747
|
+
attempts,
|
|
2748
|
+
)
|
|
2539
2749
|
retry_body = _build_malformed_retry_body(
|
|
2540
2750
|
openai_body,
|
|
2541
2751
|
anthropic_body,
|
|
2542
2752
|
retry_hint=current_issue.retry_hint,
|
|
2753
|
+
tool_choice=attempt_tool_choice,
|
|
2754
|
+
attempt=attempt + 1,
|
|
2755
|
+
total_attempts=attempts,
|
|
2543
2756
|
)
|
|
2544
2757
|
retry_resp = await client.post(
|
|
2545
2758
|
f"{LLAMA_CPP_BASE}/chat/completions",
|
|
@@ -2563,7 +2776,14 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2563
2776
|
retry_working, retry_required_repairs = _repair_required_tool_args(
|
|
2564
2777
|
retry_working, anthropic_body
|
|
2565
2778
|
)
|
|
2566
|
-
|
|
2779
|
+
retry_working, retry_bash_repairs = _repair_bash_command_artifacts(
|
|
2780
|
+
retry_working
|
|
2781
|
+
)
|
|
2782
|
+
retry_repairs = (
|
|
2783
|
+
retry_markup_repairs + retry_required_repairs + retry_bash_repairs
|
|
2784
|
+
)
|
|
2785
|
+
|
|
2786
|
+
working_resp = retry_working
|
|
2567
2787
|
|
|
2568
2788
|
retry_has_tool_calls = _openai_has_tool_calls(retry_working)
|
|
2569
2789
|
retry_required = retry_body.get("tool_choice") == "required"
|
|
@@ -2620,6 +2840,17 @@ async def _apply_malformed_tool_guardrail(
|
|
|
2620
2840
|
monitor.invalid_tool_call_streak,
|
|
2621
2841
|
monitor.required_tool_miss_streak,
|
|
2622
2842
|
)
|
|
2843
|
+
|
|
2844
|
+
degraded_text = _sanitize_tool_call_apology_text(
|
|
2845
|
+
_openai_message_text(working_resp)
|
|
2846
|
+
).strip()
|
|
2847
|
+
if degraded_text and not _looks_malformed_tool_payload(degraded_text):
|
|
2848
|
+
logger.warning(
|
|
2849
|
+
"TOOL RESPONSE degrade: session=%s returning safe text fallback after retry exhaustion",
|
|
2850
|
+
session_id,
|
|
2851
|
+
)
|
|
2852
|
+
return _build_safe_text_openai_response(working_resp, degraded_text)
|
|
2853
|
+
|
|
2623
2854
|
return _build_clean_guardrail_openai_response(working_resp)
|
|
2624
2855
|
|
|
2625
2856
|
|
|
@@ -2720,6 +2951,18 @@ def openai_to_anthropic_response(openai_resp: dict, model: str) -> dict:
|
|
|
2720
2951
|
args = json.loads(fn.get("arguments", "{}"))
|
|
2721
2952
|
except json.JSONDecodeError:
|
|
2722
2953
|
args = {}
|
|
2954
|
+
if fn.get("name", "").strip().lower() == "bash" and isinstance(args, dict):
|
|
2955
|
+
command = args.get("command")
|
|
2956
|
+
if isinstance(command, str):
|
|
2957
|
+
cleaned_command, had_protocol_lines = _strip_protocol_tag_only_lines(
|
|
2958
|
+
command
|
|
2959
|
+
)
|
|
2960
|
+
if had_protocol_lines:
|
|
2961
|
+
args = dict(args)
|
|
2962
|
+
args["command"] = cleaned_command
|
|
2963
|
+
logger.warning(
|
|
2964
|
+
"BASH SAFETY: stripped standalone protocol-tag lines from command before tool execution"
|
|
2965
|
+
)
|
|
2723
2966
|
content.append(
|
|
2724
2967
|
{
|
|
2725
2968
|
"type": "tool_use",
|
|
@@ -3564,6 +3807,12 @@ async def context_status(request: Request):
|
|
|
3564
3807
|
"overflow_count": monitor.overflow_count,
|
|
3565
3808
|
"prune_threshold": PROXY_CONTEXT_PRUNE_THRESHOLD,
|
|
3566
3809
|
"recent_history": monitor.context_history[-10:],
|
|
3810
|
+
"tool_call_grammar": {
|
|
3811
|
+
"enabled": PROXY_TOOL_CALL_GRAMMAR,
|
|
3812
|
+
"required_only": PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY,
|
|
3813
|
+
"path": PROXY_TOOL_CALL_GRAMMAR_PATH,
|
|
3814
|
+
"loaded": bool(TOOL_CALL_GBNF),
|
|
3815
|
+
},
|
|
3567
3816
|
# Loop protection stats
|
|
3568
3817
|
"loop_protection": {
|
|
3569
3818
|
"enabled": PROXY_LOOP_BREAKER,
|
|
@@ -487,6 +487,68 @@ class TestMalformedToolGuardrail(unittest.TestCase):
|
|
|
487
487
|
setattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE", old_temp)
|
|
488
488
|
setattr(proxy, "PROXY_DISABLE_THINKING_ON_TOOL_TURNS", old_disable)
|
|
489
489
|
|
|
490
|
+
def test_malformed_retry_body_appends_retry_hint_as_user_message(self):
|
|
491
|
+
openai_body = {
|
|
492
|
+
"model": "test",
|
|
493
|
+
"messages": [{"role": "user", "content": "fix"}],
|
|
494
|
+
}
|
|
495
|
+
anthropic_body = {
|
|
496
|
+
"tools": [{"name": "Read", "input_schema": {"type": "object"}}]
|
|
497
|
+
}
|
|
498
|
+
|
|
499
|
+
retry = proxy._build_malformed_retry_body(
|
|
500
|
+
openai_body,
|
|
501
|
+
anthropic_body,
|
|
502
|
+
retry_hint="Use strict JSON",
|
|
503
|
+
tool_choice="required",
|
|
504
|
+
attempt=1,
|
|
505
|
+
total_attempts=2,
|
|
506
|
+
)
|
|
507
|
+
|
|
508
|
+
self.assertEqual(retry["messages"][-1]["role"], "user")
|
|
509
|
+
self.assertIn("TOOL CALL REPAIR attempt 1/2", retry["messages"][-1]["content"])
|
|
510
|
+
|
|
511
|
+
def test_retry_ladder_releases_last_attempt_to_auto(self):
|
|
512
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 0, 3), "required")
|
|
513
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 1, 3), "required")
|
|
514
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(True, 2, 3), "auto")
|
|
515
|
+
self.assertEqual(proxy._retry_tool_choice_for_attempt(False, 0, 3), "auto")
|
|
516
|
+
|
|
517
|
+
def test_malformed_retry_body_applies_grammar_only_for_required_tool_choice(self):
|
|
518
|
+
old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
|
|
519
|
+
old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
|
|
520
|
+
old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
|
|
521
|
+
try:
|
|
522
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
|
|
523
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
|
|
524
|
+
setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
|
|
525
|
+
|
|
526
|
+
openai_body = {
|
|
527
|
+
"model": "test",
|
|
528
|
+
"messages": [{"role": "user", "content": "fix"}],
|
|
529
|
+
}
|
|
530
|
+
anthropic_body = {
|
|
531
|
+
"tools": [{"name": "Read", "input_schema": {"type": "object"}}]
|
|
532
|
+
}
|
|
533
|
+
|
|
534
|
+
required_retry = proxy._build_malformed_retry_body(
|
|
535
|
+
openai_body,
|
|
536
|
+
anthropic_body,
|
|
537
|
+
tool_choice="required",
|
|
538
|
+
)
|
|
539
|
+
auto_retry = proxy._build_malformed_retry_body(
|
|
540
|
+
openai_body,
|
|
541
|
+
anthropic_body,
|
|
542
|
+
tool_choice="auto",
|
|
543
|
+
)
|
|
544
|
+
|
|
545
|
+
self.assertEqual(required_retry.get("grammar"), 'root ::= "<tool_call>"')
|
|
546
|
+
self.assertNotIn("grammar", auto_retry)
|
|
547
|
+
finally:
|
|
548
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
|
|
549
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
|
|
550
|
+
setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
|
|
551
|
+
|
|
490
552
|
def test_clean_guardrail_response_does_not_promise_future_tool_call(self):
|
|
491
553
|
guardrail = proxy._build_clean_guardrail_openai_response(
|
|
492
554
|
{"model": "test-model"}
|
|
@@ -772,6 +834,34 @@ class TestMalformedToolGuardrail(unittest.TestCase):
|
|
|
772
834
|
)
|
|
773
835
|
self.assertEqual(args["command"], "ls")
|
|
774
836
|
|
|
837
|
+
def test_bash_command_repair_strips_protocol_tag_only_lines(self):
|
|
838
|
+
openai_resp = {
|
|
839
|
+
"choices": [
|
|
840
|
+
{
|
|
841
|
+
"finish_reason": "tool_calls",
|
|
842
|
+
"message": {
|
|
843
|
+
"content": "",
|
|
844
|
+
"tool_calls": [
|
|
845
|
+
{
|
|
846
|
+
"id": "call_1",
|
|
847
|
+
"function": {
|
|
848
|
+
"name": "Bash",
|
|
849
|
+
"arguments": '{"command":"pwd\\n</function>\\n<tool_call>"}',
|
|
850
|
+
},
|
|
851
|
+
}
|
|
852
|
+
],
|
|
853
|
+
},
|
|
854
|
+
}
|
|
855
|
+
]
|
|
856
|
+
}
|
|
857
|
+
|
|
858
|
+
repaired, count = proxy._repair_bash_command_artifacts(openai_resp)
|
|
859
|
+
self.assertEqual(count, 1)
|
|
860
|
+
args = json.loads(
|
|
861
|
+
repaired["choices"][0]["message"]["tool_calls"][0]["function"]["arguments"]
|
|
862
|
+
)
|
|
863
|
+
self.assertEqual(args["command"], "pwd")
|
|
864
|
+
|
|
775
865
|
def test_guardrail_accepts_repaired_markup_without_fallback(self):
|
|
776
866
|
old_retry = getattr(proxy, "PROXY_MALFORMED_TOOL_RETRY_MAX")
|
|
777
867
|
try:
|
|
@@ -1214,6 +1304,81 @@ class TestToolTurnControls(unittest.TestCase):
|
|
|
1214
1304
|
setattr(proxy, "PROXY_FORCED_TOOL_DAMPENER_REJECTIONS", old_rejections)
|
|
1215
1305
|
setattr(proxy, "PROXY_FORCED_TOOL_DAMPENER_AUTO_TURNS", old_auto_turns)
|
|
1216
1306
|
|
|
1307
|
+
def test_build_request_applies_grammar_when_tool_choice_required(self):
|
|
1308
|
+
old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
|
|
1309
|
+
old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
|
|
1310
|
+
old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
|
|
1311
|
+
try:
|
|
1312
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
|
|
1313
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
|
|
1314
|
+
setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
|
|
1315
|
+
|
|
1316
|
+
body = {
|
|
1317
|
+
"model": "test",
|
|
1318
|
+
"messages": [
|
|
1319
|
+
{
|
|
1320
|
+
"role": "assistant",
|
|
1321
|
+
"content": [{"type": "text", "text": "I will continue."}],
|
|
1322
|
+
},
|
|
1323
|
+
{"role": "user", "content": "continue"},
|
|
1324
|
+
],
|
|
1325
|
+
"tools": [
|
|
1326
|
+
{
|
|
1327
|
+
"name": "Read",
|
|
1328
|
+
"description": "Read file",
|
|
1329
|
+
"input_schema": {"type": "object"},
|
|
1330
|
+
}
|
|
1331
|
+
],
|
|
1332
|
+
}
|
|
1333
|
+
|
|
1334
|
+
openai = proxy.build_openai_request(
|
|
1335
|
+
body, proxy.SessionMonitor(context_window=262144)
|
|
1336
|
+
)
|
|
1337
|
+
self.assertEqual(openai.get("tool_choice"), "required")
|
|
1338
|
+
self.assertEqual(openai.get("grammar"), 'root ::= "<tool_call>"')
|
|
1339
|
+
finally:
|
|
1340
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
|
|
1341
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
|
|
1342
|
+
setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
|
|
1343
|
+
|
|
1344
|
+
def test_build_request_omits_grammar_when_tool_choice_released_to_auto(self):
|
|
1345
|
+
old_enabled = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR")
|
|
1346
|
+
old_required_only = getattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY")
|
|
1347
|
+
old_grammar = getattr(proxy, "TOOL_CALL_GBNF")
|
|
1348
|
+
try:
|
|
1349
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", True)
|
|
1350
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", True)
|
|
1351
|
+
setattr(proxy, "TOOL_CALL_GBNF", 'root ::= "<tool_call>"')
|
|
1352
|
+
|
|
1353
|
+
monitor = proxy.SessionMonitor(context_window=262144)
|
|
1354
|
+
monitor.forced_auto_cooldown_turns = 1
|
|
1355
|
+
|
|
1356
|
+
body = {
|
|
1357
|
+
"model": "test",
|
|
1358
|
+
"messages": [
|
|
1359
|
+
{
|
|
1360
|
+
"role": "assistant",
|
|
1361
|
+
"content": [{"type": "text", "text": "I will continue."}],
|
|
1362
|
+
},
|
|
1363
|
+
{"role": "user", "content": "continue"},
|
|
1364
|
+
],
|
|
1365
|
+
"tools": [
|
|
1366
|
+
{
|
|
1367
|
+
"name": "Read",
|
|
1368
|
+
"description": "Read file",
|
|
1369
|
+
"input_schema": {"type": "object"},
|
|
1370
|
+
}
|
|
1371
|
+
],
|
|
1372
|
+
}
|
|
1373
|
+
|
|
1374
|
+
openai = proxy.build_openai_request(body, monitor)
|
|
1375
|
+
self.assertEqual(openai.get("tool_choice"), "auto")
|
|
1376
|
+
self.assertNotIn("grammar", openai)
|
|
1377
|
+
finally:
|
|
1378
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR", old_enabled)
|
|
1379
|
+
setattr(proxy, "PROXY_TOOL_CALL_GRAMMAR_REQUIRED_ONLY", old_required_only)
|
|
1380
|
+
setattr(proxy, "TOOL_CALL_GBNF", old_grammar)
|
|
1381
|
+
|
|
1217
1382
|
def test_no_tools_does_not_inject_agentic_system_message(self):
|
|
1218
1383
|
body = {
|
|
1219
1384
|
"model": "test",
|
|
@@ -1290,6 +1455,38 @@ class TestToolTurnControls(unittest.TestCase):
|
|
|
1290
1455
|
setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
|
|
1291
1456
|
setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
|
|
1292
1457
|
|
|
1458
|
+
def test_analysis_only_route_does_not_treat_implementation_as_action(self):
|
|
1459
|
+
old_route = getattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE")
|
|
1460
|
+
old_min_tools = getattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS")
|
|
1461
|
+
old_max_messages = getattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES")
|
|
1462
|
+
try:
|
|
1463
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", True)
|
|
1464
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", 4)
|
|
1465
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", 2)
|
|
1466
|
+
|
|
1467
|
+
body = {
|
|
1468
|
+
"messages": [
|
|
1469
|
+
{
|
|
1470
|
+
"role": "user",
|
|
1471
|
+
"content": "analyze implementation options and summarize tradeoffs",
|
|
1472
|
+
}
|
|
1473
|
+
],
|
|
1474
|
+
"tools": [
|
|
1475
|
+
{"name": "Read", "input_schema": {"type": "object"}},
|
|
1476
|
+
{"name": "Edit", "input_schema": {"type": "object"}},
|
|
1477
|
+
{"name": "Write", "input_schema": {"type": "object"}},
|
|
1478
|
+
{"name": "Bash", "input_schema": {"type": "object"}},
|
|
1479
|
+
],
|
|
1480
|
+
}
|
|
1481
|
+
|
|
1482
|
+
updated, removed = proxy._maybe_route_analysis_without_tools(body)
|
|
1483
|
+
self.assertEqual(removed, 4)
|
|
1484
|
+
self.assertNotIn("tools", updated)
|
|
1485
|
+
finally:
|
|
1486
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_ROUTE", old_route)
|
|
1487
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MIN_TOOLS", old_min_tools)
|
|
1488
|
+
setattr(proxy, "PROXY_ANALYSIS_ONLY_MAX_MESSAGES", old_max_messages)
|
|
1489
|
+
|
|
1293
1490
|
|
|
1294
1491
|
class TestSessionContaminationBreaker(unittest.TestCase):
|
|
1295
1492
|
def test_contamination_breaker_trims_and_resets_streak(self):
|