@miller-tech/uap 1.39.0 → 1.40.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (99) hide show
  1. package/README.md +109 -642
  2. package/dist/.tsbuildinfo +1 -1
  3. package/dist/bin/cli.js +2 -2
  4. package/dist/bin/cli.js.map +1 -1
  5. package/dist/cli/deliver.d.ts +3 -2
  6. package/dist/cli/deliver.d.ts.map +1 -1
  7. package/dist/cli/deliver.js +10 -5
  8. package/dist/cli/deliver.js.map +1 -1
  9. package/docs/INDEX.md +48 -286
  10. package/docs/architecture/OVERVIEW.md +328 -0
  11. package/docs/architecture/PROTOCOL.md +204 -0
  12. package/docs/benchmarks/README.md +17 -192
  13. package/docs/getting-started/CONFIGURATION.md +237 -0
  14. package/docs/getting-started/INSTALLATION.md +125 -0
  15. package/docs/getting-started/QUICKSTART.md +115 -0
  16. package/docs/guides/COORDINATION.md +162 -0
  17. package/docs/guides/DELIVER.md +115 -0
  18. package/docs/guides/DEPLOY_BATCHING.md +212 -0
  19. package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
  20. package/docs/guides/LOCAL_MODELS.md +148 -0
  21. package/docs/guides/MCP_ROUTER.md +195 -0
  22. package/docs/guides/MEMORY.md +235 -0
  23. package/docs/guides/MULTI_MODEL.md +223 -0
  24. package/docs/guides/POLICIES.md +190 -0
  25. package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
  26. package/docs/integrations/MCP_ROUTER.md +147 -0
  27. package/docs/integrations/RTK.md +102 -0
  28. package/docs/reference/API.md +485 -0
  29. package/docs/reference/CLI.md +719 -0
  30. package/docs/reference/CONFIGURATION.md +90 -193
  31. package/docs/reference/DATABASE_SCHEMA.md +110 -344
  32. package/docs/reference/FEATURES.md +176 -472
  33. package/docs/reference/PATTERNS.md +102 -0
  34. package/docs/reference/PLATFORMS.md +83 -0
  35. package/package.json +1 -1
  36. package/docs/AGENTS.md +0 -423
  37. package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
  38. package/docs/GETTING_STARTED.md +0 -288
  39. package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
  40. package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
  41. package/docs/architecture/EXPERT_STACK.md +0 -137
  42. package/docs/architecture/MULTI_MODEL.md +0 -224
  43. package/docs/architecture/PLATFORM_GATING.md +0 -68
  44. package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
  45. package/docs/architecture/UAP_COMPLIANCE.md +0 -217
  46. package/docs/architecture/UAP_PROTOCOL.md +0 -339
  47. package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
  48. package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
  49. package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
  50. package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
  51. package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
  52. package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
  53. package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
  54. package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
  55. package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
  56. package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
  57. package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
  58. package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
  59. package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
  60. package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
  61. package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
  62. package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
  63. package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
  64. package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
  65. package/docs/archive/opencode-integration-guide.md +0 -740
  66. package/docs/archive/opencode-integration-quickref.md +0 -180
  67. package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
  68. package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
  69. package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
  70. package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
  71. package/docs/blog/local-coding-agents.md +0 -266
  72. package/docs/blog/x-thread.md +0 -254
  73. package/docs/deployment/DEPLOYMENT.md +0 -895
  74. package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
  75. package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
  76. package/docs/deployment/DEPLOY_BATCHING.md +0 -273
  77. package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
  78. package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
  79. package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
  80. package/docs/getting-started/INTEGRATION.md +0 -628
  81. package/docs/getting-started/OVERVIEW.md +0 -324
  82. package/docs/getting-started/SETUP.md +0 -377
  83. package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
  84. package/docs/integrations/RTK_INTEGRATION.md +0 -468
  85. package/docs/operations/TROUBLESHOOTING.md +0 -660
  86. package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
  87. package/docs/pr/UPSTREAM_PRS.md +0 -424
  88. package/docs/reference/API_REFERENCE.md +0 -903
  89. package/docs/reference/EXPERT_DROIDS.md +0 -219
  90. package/docs/reference/HARNESS-MATRIX.md +0 -318
  91. package/docs/reference/PATTERN_LIBRARY.md +0 -636
  92. package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
  93. package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
  94. package/docs/research/DOMAIN_STRATEGIES.md +0 -316
  95. package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
  96. package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
  97. package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
  98. package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
  99. package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
@@ -1,146 +0,0 @@
1
- ## Title
2
-
3
- docs: add speculative decoding production playbook and agentic compatibility guidance
4
-
5
- ## Context
6
-
7
- `docs/speculative.md` explains speculative mechanisms and flags, but production operators also need:
8
-
9
- - workload-driven profile selection,
10
- - reproducible benchmarking protocol,
11
- - signature-based regression triage,
12
- - guidance for stream+tools agentic environments.
13
-
14
- This PR adds operational documentation to reduce drift between benchmark wins and real-session behavior.
15
-
16
- ## Changes
17
-
18
- ### Add new guide
19
-
20
- - New: `docs/speculative-production.md`
21
- - implementation matrix:
22
- - `draft`
23
- - `ngram-cache`
24
- - `ngram-simple`
25
- - `ngram-map-k`
26
- - `ngram-map-k4v`
27
- - `ngram-mod`
28
- - decision tree by workload (coding, repetitive transform, mixed)
29
- - benchmark protocol (run counts, warmup, prompt classes, metrics)
30
- - troubleshooting by signature:
31
- - `find_slot: non-consecutive token position`
32
- - low acceptance + high rollback
33
- - throughput collapse after commit switch
34
- - rollout rules (canary, promotion threshold, rollback triggers)
35
-
36
- ### Update existing speculative docs
37
-
38
- - Update `docs/speculative.md`:
39
- - add link to production guide
40
- - add "how to interpret statistics in practice"
41
- - add "workload sensitivity and reproducibility notes"
42
-
43
- ### Add compatibility appendix
44
-
45
- - New appendix (or linked page): stream+tools compatibility for proxy-mediated agentic flows
46
- - fallback policy guidance (`off` default for production)
47
- - malformed stream/tool guardrail behavior
48
- - max token floor and prune target recommendations
49
-
50
- ## Why
51
-
52
- Speculative decoding quality in agentic coding depends on end-to-end behavior, including transport and stream tool-loop handling. This documentation closes that gap and provides a repeatable operator path.
53
-
54
- ## Validation Plan
55
-
56
- - Verify all CLI flags/options in examples against current `llama-server`.
57
- - Verify all linked scripts/docs paths resolve.
58
- - Include one benchmark table with:
59
- - decode/prefill throughput
60
- - acceptance indicators
61
- - latency percentiles
62
- - workload class labels
63
-
64
- ## Risks
65
-
66
- - Overfitting recommendations to one model/hardware class.
67
- - Treating proxy behavior as universally required.
68
-
69
- ## Mitigations
70
-
71
- - Mark all profile recommendations as workload/hardware sensitive.
72
- - Separate "safe baseline" from "aggressive benchmark-only" profiles.
73
- - Require local A/B validation before rollout.
74
-
75
- ## Out of Scope
76
-
77
- - Runtime code changes
78
- - Kernel-level speculative optimization changes
79
- - Proxy implementation changes (docs-only PR)
80
-
81
- ## Follow-ups
82
-
83
- 1. Add nightly speculative regression harness.
84
- 2. Publish benchmark JSON schema for machine comparison.
85
- 3. Add commit-lineage tracking for performance regressions.
86
-
87
- ---
88
-
89
- ## Ready-to-Submit GitHub PR Body
90
-
91
- ### Summary
92
-
93
- This docs PR adds a production-oriented speculative decoding playbook for llama.cpp users running real multi-turn workloads (especially agentic/tool-call scenarios). It complements existing mechanism-level docs with actionable tuning, troubleshooting, and rollout guidance.
94
-
95
- ### What Changed
96
-
97
- - Added `docs/speculative-production.md` (new operational guide)
98
- - implementation selection matrix
99
- - workload-based decision tree
100
- - benchmark protocol + required metrics
101
- - troubleshooting by real log signatures
102
- - canary/rollback rollout guidance
103
- - Updated `docs/speculative.md`
104
- - links to production guide
105
- - practical stats interpretation notes
106
- - workload sensitivity notes
107
- - Added/linked "agentic stream+tools compatibility" appendix
108
- - fallback policy defaults
109
- - malformed stream/tool guardrails
110
- - token-floor/prune guidance
111
-
112
- ### Why
113
-
114
- Current docs describe speculative decoding internals clearly, but production operators need a reproducible way to:
115
-
116
- - choose stable profiles by workload,
117
- - detect/triage regressions quickly,
118
- - avoid benchmark-only wins that fail in long sessions.
119
-
120
- ### Reviewer Guide
121
-
122
- Please focus review on:
123
-
124
- 1. Accuracy of CLI flags and option names.
125
- 2. Correctness of troubleshooting signatures and interpretations.
126
- 3. Clarity of benchmark protocol (can another team reproduce it?).
127
- 4. Whether safe-vs-aggressive profile separation is clear enough.
128
-
129
- ### Validation
130
-
131
- - [ ] Command examples verified against current `llama-server --help`
132
- - [ ] Linked docs/scripts paths validated
133
- - [ ] Benchmark table includes workload class labels
134
- - [ ] Metrics include decode/prefill throughput + latency percentile view
135
- - [ ] No runtime behavior claims without explicit caveats
136
-
137
- ### Risks / Caveats
138
-
139
- - Recommendations are model/hardware/workload dependent.
140
- - Guidance is operational, not a substitute for local A/B testing.
141
-
142
- ### Follow-ups
143
-
144
- - [ ] Add nightly regression harness for speculative profiles
145
- - [ ] Publish machine-readable benchmark schema
146
- - [ ] Add commit lineage references in benchmark artifacts
@@ -1,424 +0,0 @@
1
- # UAP Upstream PR Plan
2
-
3
- 5 PRs covering the session stickiness bug, loop protection hardening, per-request spec control, OpenAI-compat endpoint, and the policy engine.
4
-
5
- ## Dependency graph
6
-
7
- ```
8
- PR 1 (session fingerprinting) ── CRITICAL ──► enables PR 2, PR 3, PR 5
9
- PR 2 (loop protection) ── depends on PR 1
10
- PR 3 (spec decoding control) ── independent
11
- PR 4 (OpenAI /v1/chat/completions) ── depends on PR 2 (via guardrails)
12
- PR 5 (policy engine) ── depends on PR 1 + PR 2
13
- ```
14
-
15
- ---
16
-
17
- ## PR 1 — `proxy: stable session fingerprinting`
18
-
19
- **Scope:** Critical bug fix
20
- **Files:** `tools/agents/scripts/anthropic_proxy.py`
21
- **Risk:** Low — pure fix, no new surface area
22
- **Priority:** Highest — every stateful guardrail depends on this
23
-
24
- ### Problem
25
-
26
- Session fingerprints were hashed from `remote | model | system | first_user_content`. Two inputs were volatile:
27
-
28
- 1. **`tool_use_id`** values in tool_result blocks — random UUIDs regenerated per turn. `_content_fingerprint` included `f"result:{block.get('tool_use_id', '')}"` in the hash.
29
- 2. **`system` prompt** — clients inject volatile context (timestamps, cwd, session markers) into system prompts.
30
-
31
- Result: **every single request got a different session ID** → every request spawned a fresh `SessionMonitor` → every stateful guardrail (cycle detection, forced_budget, review_cycles, finalize_hard_stop, unproductive_exhaustion_streak) was effectively stateless per-request.
32
-
33
- This silently broke every loop protection mechanism ever built on top of the session monitor.
34
-
35
- ### Diagnostic evidence
36
-
37
- After adding session ID logging:
38
-
39
- ```
40
- sess=fp:9c8f26a802f9f4739f18 msgs=79
41
- sess=fp:b801857a9e49e21a6599 msgs=81
42
- sess=fp:aeef638954a390ef7aec msgs=83
43
- sess=fp:16f908db2e478f31cb91 msgs=85
44
- ```
45
-
46
- Every request got a new session ID. `session_count: 35` after 35 requests on what should have been one session.
47
-
48
- ### Fix
49
-
50
- 1. `_content_fingerprint` uses stable content excerpt (`result:<first 64 chars>`) instead of `tool_use_id`
51
- 2. `resolve_session_id` hashes only the first user message's **text content**, excludes `system` prompt entirely
52
-
53
- ```python
54
- def resolve_session_id(request: Request, anthropic_body: dict) -> str:
55
- # ... header-based lookup unchanged ...
56
-
57
- first_user = ""
58
- for msg in anthropic_body.get("messages", []):
59
- if msg.get("role") == "user":
60
- content = msg.get("content", "")
61
- if isinstance(content, str):
62
- first_user = content[:512]
63
- elif isinstance(content, list):
64
- text_parts = [
65
- b.get("text", "") for b in content
66
- if isinstance(b, dict) and b.get("type") == "text"
67
- ]
68
- first_user = "\n".join(text_parts)[:512]
69
- break
70
-
71
- # Deliberately exclude `system` from fingerprint — clients inject
72
- # volatile context (timestamps, cwd, session markers).
73
- digest = hashlib.sha256(
74
- f"{remote}|{model}|{first_user}".encode("utf-8", errors="ignore")
75
- ).hexdigest()[:20]
76
- return f"fp:{digest}"
77
- ```
78
-
79
- ### Impact
80
-
81
- - Before: 1 request per session
82
- - After: 170+ requests on the same session (verified with Claude Code + OpenCode + Forge clients)
83
- - All downstream guardrails suddenly started working — no changes needed to them
84
-
85
- ### Add session ID logging
86
-
87
- The REQ line now includes `sess=` for diagnosis:
88
-
89
- ```
90
- REQ: client=remote:127.0.0.1 sess=fp:aa5169796b2c39c2a4a4 rate_60s=1 ...
91
- ```
92
-
93
- ### Tests
94
-
95
- - [ ] Unit test: same message with changing tool_use_ids → stable fingerprint
96
- - [ ] Unit test: same message with changing system timestamps → stable fingerprint
97
- - [ ] Integration test: 3 sequential requests on same conversation → same session_id
98
-
99
- ---
100
-
101
- ## PR 2 — `proxy: loop protection hardening`
102
-
103
- **Scope:** Medium — new counters + threshold gates
104
- **Files:** `anthropic_proxy.py`
105
- **Depends on:** PR 1 (counters only work with sticky sessions)
106
-
107
- ### Additions
108
-
109
- 1. **`tool_state_unproductive_exhaustion_streak`**
110
- - Tracks consecutive `forced_budget_exhausted` events where NEITHER cycling NOR stagnation was detected
111
- - After `PROXY_UNPRODUCTIVE_EXHAUSTION_LIMIT` (default 4), forces finalize
112
- - Catches "distinct-but-unproductive tool spam" that defeats per-tool cycle detection
113
-
114
- 2. **`finalize_hard_stop_count`** (monotonic session-level)
115
- - NOT reset by `fresh_user_text` / `inactive_loop` paths
116
- - Incremented in BOTH:
117
- - `_inject_synthetic_continuation` (synthetic continuation path)
118
- - `state_choice == "finalize"` handler (tool-stripping path)
119
- - When `>= PROXY_FINALIZE_SESSION_HARD_CAP` (default 6), synthetic continuation injection is blocked, natural end_turn passes through → client terminates loop cleanly
120
-
121
- 3. **`finalize_fired` flag in `_completion_blockers()`**
122
- - When `finalize_hard_stop_count > 0`, suppresses `text_only_after_tool_results` blocker
123
- - Prevents state machine from re-entering active loop after a finalize wraps up the work
124
- - Was causing `finalize → review → cycle_detected → finalize → review → ...` infinite ping-pong
125
-
126
- ### New env vars
127
-
128
- ```
129
- PROXY_UNPRODUCTIVE_EXHAUSTION_LIMIT=4 # new
130
- PROXY_FINALIZE_SESSION_HARD_CAP=6 # new
131
- ```
132
-
133
- ### Tuned thresholds (tighter defaults)
134
-
135
- ```
136
- PROXY_LOOP_REPEAT_THRESHOLD=4 # was 10
137
- PROXY_FORCED_THRESHOLD=12 # was 18
138
- PROXY_NO_PROGRESS_THRESHOLD=3 # was 5
139
- PROXY_TOOL_STATE_STAGNATION_THRESHOLD=4 # was 8
140
- PROXY_TOOL_STATE_FINALIZE_THRESHOLD=8 # was 18
141
- PROXY_TOOL_STATE_REVIEW_CYCLE_LIMIT=5 # was 3 (relaxed from prior 3 after tuning)
142
- PROXY_TOOL_NARROWING_EXPAND_ON_LOOP=off # was on
143
- PROXY_TOOL_NARROWING_KEEP=8 # was 12
144
- ```
145
-
146
- ### Verification
147
-
148
- Real session that was previously looping indefinitely terminated cleanly:
149
- ```
150
- TOOL STATE MACHINE: 4 consecutive unproductive budget exhaustions — forcing finalize
151
- TOOL STATE MACHINE: phase review -> finalize reason=unproductive_exhaustion
152
- FINALIZE CONTINUATION: session hard cap reached (6/6) — not injecting, allowing termination
153
- ```
154
-
155
- Client received clean `end_turn`, started a fresh new task.
156
-
157
- ### Tests
158
-
159
- - [ ] Simulated loop: distinct tool calls with no context growth → triggers unproductive exhaustion
160
- - [ ] Simulated loop: same tool repeated → triggers per-tool cycle detection (existing)
161
- - [ ] Finalize → synthetic continuation → reset → new active loop → hard cap at 6 → natural termination
162
-
163
- ---
164
-
165
- ## PR 3 — `proxy: per-request speculative decoding control`
166
-
167
- **Scope:** Small, focused
168
- **Files:** `anthropic_proxy.py`, README
169
- **Risk:** Low
170
-
171
- ### Feature
172
-
173
- New env var `PROXY_DISABLE_SPEC_ON_TOOL_TURNS` (default off). When on, the proxy sets `openai_body["speculative.n_max"] = 0` on tool-turn requests, telling llama.cpp to skip the draft/spec path for that request only.
174
-
175
- ### Why
176
-
177
- Some models (observed: early Qwen3.5-35B-A3B Q4_K_M) produce garbled tool-call output under speculative decoding due to rejected-draft state leakage. Disabling spec on tool turns while keeping it on for plain chat gives the best of both worlds for unstable models. Stable models can leave this off and benefit from spec on every turn.
178
-
179
- ### Applied in two places
180
-
181
- 1. Main handler (`_build_openai_request` end)
182
- 2. Tool starvation breaker early-return path (so the flag is respected on both code paths)
183
-
184
- ```python
185
- if PROXY_DISABLE_SPEC_ON_TOOL_TURNS:
186
- openai_body["speculative.n_max"] = 0
187
- logger.info("Spec decoding disabled for tool turn (PROXY_DISABLE_SPEC_ON_TOOL_TURNS=on)")
188
- ```
189
-
190
- ### Relies on llama.cpp upstream support
191
-
192
- llama.cpp already supports per-request `speculative.n_max` in `server-task.cpp`:
193
- ```cpp
194
- params.speculative.n_max = json_value(data, "speculative.n_max", defaults.speculative.n_max);
195
- ```
196
-
197
- Setting it to 0 gates the entire draft path (`if (n_draft_max > 0)` in `server-context.cpp`).
198
-
199
- ### Tests
200
-
201
- - [ ] Tool-turn request with flag on → `speculative.n_max=0` in forwarded body
202
- - [ ] Non-tool request with flag on → no speculative field added
203
- - [ ] Flag off → no speculative field added regardless
204
-
205
- ---
206
-
207
- ## PR 4 — `proxy: fully guarded OpenAI /v1/chat/completions endpoint`
208
-
209
- **Scope:** Medium — new endpoint with full bidirectional conversion
210
- **Files:** `anthropic_proxy.py`
211
- **Depends on:** PR 2 (reuses the guardrail pipeline)
212
-
213
- ### Motivation
214
-
215
- Clients like **OpenCode**, **Forge**, **Cline**, and many LangChain-based agents expect OpenAI's `/v1/chat/completions` shape. The proxy previously only exposed `/v1/messages` (Anthropic shape), so these clients either:
216
- 1. Bypassed the proxy and talked directly to llama.cpp (no guardrails), OR
217
- 2. Couldn't use the proxy at all
218
-
219
- ### Approach
220
-
221
- Add `/v1/chat/completions` handler that:
222
- 1. Receives OpenAI-format request
223
- 2. Converts to Anthropic format (`openai_to_anthropic_request`)
224
- 3. Invokes the existing `messages()` handler via synthetic `Request` with Anthropic body
225
- 4. Converts the Anthropic response back to OpenAI format (`anthropic_to_openai_response`)
226
- 5. Returns to the client
227
-
228
- **All guardrails from the `/v1/messages` path apply automatically** — loop detection, tool narrowing, cycle breaking, malformed tool retry, context pruning, profile overrides, activation replay (llama.cpp side).
229
-
230
- ### Streaming
231
-
232
- Client stream requests are processed internally as non-stream through the Anthropic pipeline, then re-streamed as OpenAI SSE chunks:
233
-
234
- ```
235
- data: {"id":"msg_...","delta":{"role":"assistant"},...}
236
- data: {"id":"msg_...","delta":{"content":"..."},...}
237
- data: {"id":"msg_...","delta":{"tool_calls":[...]},...}
238
- data: {"id":"msg_...","delta":{},"finish_reason":"tool_calls"}
239
- data: [DONE]
240
- ```
241
-
242
- This sacrifices token-by-token streaming granularity in exchange for keeping all guardrails. The difference is invisible to most clients.
243
-
244
- ### Helper functions added
245
-
246
- - **`openai_to_anthropic_request(openai_body)`** — full conversion (system prompt, messages, tool_calls, tool_responses, tools, tool_choice, sampling params)
247
- - **`anthropic_to_openai_response(anthropic_resp)`** — content blocks → message, tool_use → tool_calls, stop_reason → finish_reason, usage mapping
248
- - **`_parse_anthropic_sse_to_message(raw)`** — SSE fallback parser if inner pipeline returns a stream despite `stream=False`
249
-
250
- ### Verification
251
-
252
- Tested against OpenCode, Forge, and synthetic curl requests:
253
- - Plain chat: clean text response
254
- - Tool use: proper `tool_calls` with JSON arguments
255
- - Streaming: proper SSE chunks with finish_reason
256
- - All guardrails active (verified via log `CHAT (guarded)` marker)
257
-
258
- ### Tests
259
-
260
- - [ ] Round-trip: OpenAI request → Anthropic → OpenAI with matching content
261
- - [ ] Tool call conversion (both directions)
262
- - [ ] System prompt extraction from messages
263
- - [ ] Streaming endpoint emits valid SSE sequence
264
- - [ ] Profile overrides apply to chat/completions path
265
-
266
- ---
267
-
268
- ## PR 5 — `proxy: policy engine with worktree + CI/CD enforcement`
269
-
270
- **Scope:** Large — new module + hook points
271
- **Files:** `policies/engine.py`, `policies/rules/*.py`, `anthropic_proxy.py` (hook points), tests
272
- **Depends on:** PR 1 (session continuity), PR 2 (guardrail infrastructure)
273
- **Risk:** Medium — new subsystem
274
-
275
- ### Motivation
276
-
277
- You can tell a local coding agent to use a git worktree. You can write it in CLAUDE.md, put it in the system prompt, make it the first rule. Local 27–35B models **still commit directly to main**.
278
-
279
- Policy-as-prompt is not an enforcement mechanism for local coding agents — it's a suggestion. The only reliable way to enforce workflow requirements is to make them non-bypassable at the proxy layer.
280
-
281
- ### What it enforces
282
-
283
- - **Worktree routing** — `Edit`, `Write`, `Bash` tool inputs get rewritten to reference the active worktree path. Operations targeting the main working tree are rejected.
284
- - **Completion gates** — `end_turn` is blocked unless tests ran, memory was queried, parallel reviewers were invoked.
285
- - **Pre-commit discipline** — commit tool calls blocked until code-reviewer + security-auditor + architect-reviewer were invoked.
286
- - **CI/CD deploy bucketing** — each agent session has a deploy bucket tied to its worktree. Concurrent agents don't collide at the pipeline layer.
287
- - **Per-profile rule sets** — `build` / `plan` / `memory` / `autoaccept` each get a different policy set.
288
- - **Session start protocol** — mandatory bootstrap checks (memory query, session context load)
289
- - **Auditable trail** — every policy decision logged with rule ID, context, outcome
290
-
291
- ### Architecture
292
-
293
- ```
294
- client → proxy → [guardrails] → [policy engine] → [tool rewriter] → llama.cpp
295
-
296
- audit log
297
- ```
298
-
299
- Every tool call goes through a policy check chain before being forwarded to llama.cpp. Rules can allow, rewrite, or block.
300
-
301
- ### Rule DSL
302
-
303
- ```python
304
- from uap.policies import policy, block, allow, MUTATING_TOOLS
305
-
306
- @policy("worktree.enforce", profile=["build", "autoaccept"])
307
- def enforce_worktree(request, session):
308
- if request.tool_name in MUTATING_TOOLS:
309
- if not session.worktree_active:
310
- return block("worktree_not_in_use",
311
- hint="Create a worktree first with `git worktree add`")
312
- request.tool_input["path"] = rewrite_to_worktree(
313
- request.tool_input["path"], session.worktree
314
- )
315
- return allow()
316
-
317
- @policy("commit.parallel_review", profile="build")
318
- def enforce_parallel_review(request, session):
319
- if request.tool_name == "Bash" and "git commit" in request.tool_input.get("command", ""):
320
- if not session.review_completed_this_turn:
321
- return block("parallel_review_required",
322
- hint="Invoke code-reviewer + security-auditor + architect-reviewer in parallel before committing")
323
- return allow()
324
-
325
- @policy("completion.gates", profile="build")
326
- def enforce_completion_gates(request, session):
327
- if request.is_end_turn:
328
- blockers = []
329
- if not session.tests_ran:
330
- blockers.append("tests_not_run")
331
- if not session.memory_queried:
332
- blockers.append("memory_not_queried")
333
- if blockers:
334
- return block(f"completion_gates_failed: {','.join(blockers)}")
335
- return allow()
336
- ```
337
-
338
- ### Integration with existing `_completion_blockers()`
339
-
340
- Policy blockers extend the existing completion contract:
341
-
342
- ```python
343
- def _completion_blockers(anthropic_body, has_tool_results, phase="", finalize_fired=False):
344
- blockers = []
345
- # ... existing checks ...
346
-
347
- # NEW: policy-level blockers
348
- policy_blockers = policy_engine.evaluate_completion(anthropic_body, session)
349
- blockers.extend(policy_blockers)
350
-
351
- return blockers
352
- ```
353
-
354
- ### Per-profile rule sets
355
-
356
- ```python
357
- # policies/profiles.py
358
- BUILD_PROFILE_RULES = [
359
- "worktree.enforce",
360
- "commit.parallel_review",
361
- "commit.message_format",
362
- "commit.no_secrets",
363
- "completion.gates",
364
- "session.bootstrap",
365
- ]
366
-
367
- PLAN_PROFILE_RULES = [
368
- "tools.read_only", # blocks write/edit/bash tools
369
- "session.bootstrap",
370
- ]
371
-
372
- MEMORY_PROFILE_RULES = [
373
- "tools.memory_only", # only memory read/write tools allowed
374
- ]
375
-
376
- AUTOACCEPT_PROFILE_RULES = [
377
- "worktree.enforce", # same worktree rule
378
- "commit.no_secrets", # security still enforced
379
- # no parallel review required (autoaccept is explicit trade-off)
380
- ]
381
- ```
382
-
383
- ### Audit trail
384
-
385
- Every policy decision is logged with session, rule ID, tool name, decision, and blocker reason:
386
-
387
- ```
388
- POLICY: sess=fp:aa51... rule=worktree.enforce tool=Edit decision=rewrite old_path=/home/cogtek/dev/main/app.py new_path=/home/cogtek/dev/.worktrees/feat-x/app.py
389
- POLICY: sess=fp:aa51... rule=commit.parallel_review tool=Bash decision=block reason=parallel_review_required
390
- ```
391
-
392
- ### Tests
393
-
394
- - [ ] Unit tests for each rule in isolation
395
- - [ ] Integration: build profile session → attempt commit without review → blocked → invoke review → commit succeeds
396
- - [ ] Integration: plan profile session → attempt Write → blocked
397
- - [ ] Multi-agent: two sessions with different worktrees → no collision
398
- - [ ] Audit log format validation
399
-
400
- ### Migration path
401
-
402
- - PR introduces the policy engine as **opt-in** per profile (default profile has no policies — fully backward-compatible)
403
- - Users can enable rules one at a time via profile env vars
404
- - Existing CLAUDE.md prose instructions can reference policies for context, but policies are now enforced independent of prose
405
-
406
- ---
407
-
408
- ## Submission order
409
-
410
- 1. **PR 1 (session fingerprinting)** — critical bug fix, low risk, unblocks everything else
411
- 2. **PR 2 (loop protection hardening)** — depends on PR 1, reviewers can verify that PR 1's fix makes these counters functional
412
- 3. **PR 3 (spec decoding control)** — independent, small, easy to review
413
- 4. **PR 4 (OpenAI endpoint)** — depends on PR 2 (reuses guardrails), adds major new functionality
414
- 5. **PR 5 (policy engine)** — depends on PR 1 + PR 2, new subsystem, needs the most review
415
-
416
- ## Pre-submission checklist (all PRs)
417
-
418
- - [ ] Unit tests added
419
- - [ ] Integration tests with real llama.cpp upstream
420
- - [ ] README / docs updated
421
- - [ ] Env var reference updated
422
- - [ ] No breaking changes to existing endpoints (or clearly flagged)
423
- - [ ] Config migration notes for existing deployments
424
- - [ ] Diff against current production (`anthropic-proxy.env.*` profiles)