@miller-tech/uap 1.39.0 → 1.40.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +109 -642
- package/dist/.tsbuildinfo +1 -1
- package/dist/bin/cli.js +2 -2
- package/dist/bin/cli.js.map +1 -1
- package/dist/cli/deliver.d.ts +3 -2
- package/dist/cli/deliver.d.ts.map +1 -1
- package/dist/cli/deliver.js +10 -5
- package/dist/cli/deliver.js.map +1 -1
- package/docs/INDEX.md +48 -286
- package/docs/architecture/OVERVIEW.md +328 -0
- package/docs/architecture/PROTOCOL.md +204 -0
- package/docs/benchmarks/README.md +17 -192
- package/docs/getting-started/CONFIGURATION.md +237 -0
- package/docs/getting-started/INSTALLATION.md +125 -0
- package/docs/getting-started/QUICKSTART.md +115 -0
- package/docs/guides/COORDINATION.md +162 -0
- package/docs/guides/DELIVER.md +115 -0
- package/docs/guides/DEPLOY_BATCHING.md +212 -0
- package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
- package/docs/guides/LOCAL_MODELS.md +148 -0
- package/docs/guides/MCP_ROUTER.md +195 -0
- package/docs/guides/MEMORY.md +235 -0
- package/docs/guides/MULTI_MODEL.md +223 -0
- package/docs/guides/POLICIES.md +190 -0
- package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
- package/docs/integrations/MCP_ROUTER.md +147 -0
- package/docs/integrations/RTK.md +102 -0
- package/docs/reference/API.md +485 -0
- package/docs/reference/CLI.md +719 -0
- package/docs/reference/CONFIGURATION.md +90 -193
- package/docs/reference/DATABASE_SCHEMA.md +110 -344
- package/docs/reference/FEATURES.md +176 -472
- package/docs/reference/PATTERNS.md +102 -0
- package/docs/reference/PLATFORMS.md +83 -0
- package/package.json +1 -1
- package/docs/AGENTS.md +0 -423
- package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
- package/docs/GETTING_STARTED.md +0 -288
- package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
- package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
- package/docs/architecture/EXPERT_STACK.md +0 -137
- package/docs/architecture/MULTI_MODEL.md +0 -224
- package/docs/architecture/PLATFORM_GATING.md +0 -68
- package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
- package/docs/architecture/UAP_COMPLIANCE.md +0 -217
- package/docs/architecture/UAP_PROTOCOL.md +0 -339
- package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
- package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
- package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
- package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
- package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
- package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
- package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
- package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
- package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
- package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
- package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
- package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
- package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
- package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
- package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
- package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
- package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
- package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
- package/docs/archive/opencode-integration-guide.md +0 -740
- package/docs/archive/opencode-integration-quickref.md +0 -180
- package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
- package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
- package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
- package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
- package/docs/blog/local-coding-agents.md +0 -266
- package/docs/blog/x-thread.md +0 -254
- package/docs/deployment/DEPLOYMENT.md +0 -895
- package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
- package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
- package/docs/deployment/DEPLOY_BATCHING.md +0 -273
- package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
- package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
- package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
- package/docs/getting-started/INTEGRATION.md +0 -628
- package/docs/getting-started/OVERVIEW.md +0 -324
- package/docs/getting-started/SETUP.md +0 -377
- package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
- package/docs/integrations/RTK_INTEGRATION.md +0 -468
- package/docs/operations/TROUBLESHOOTING.md +0 -660
- package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
- package/docs/pr/UPSTREAM_PRS.md +0 -424
- package/docs/reference/API_REFERENCE.md +0 -903
- package/docs/reference/EXPERT_DROIDS.md +0 -219
- package/docs/reference/HARNESS-MATRIX.md +0 -318
- package/docs/reference/PATTERN_LIBRARY.md +0 -636
- package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
- package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
- package/docs/research/DOMAIN_STRATEGIES.md +0 -316
- package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
- package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
- package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
- package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
- package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
|
@@ -1,146 +0,0 @@
|
|
|
1
|
-
## Title
|
|
2
|
-
|
|
3
|
-
docs: add speculative decoding production playbook and agentic compatibility guidance
|
|
4
|
-
|
|
5
|
-
## Context
|
|
6
|
-
|
|
7
|
-
`docs/speculative.md` explains speculative mechanisms and flags, but production operators also need:
|
|
8
|
-
|
|
9
|
-
- workload-driven profile selection,
|
|
10
|
-
- reproducible benchmarking protocol,
|
|
11
|
-
- signature-based regression triage,
|
|
12
|
-
- guidance for stream+tools agentic environments.
|
|
13
|
-
|
|
14
|
-
This PR adds operational documentation to reduce drift between benchmark wins and real-session behavior.
|
|
15
|
-
|
|
16
|
-
## Changes
|
|
17
|
-
|
|
18
|
-
### Add new guide
|
|
19
|
-
|
|
20
|
-
- New: `docs/speculative-production.md`
|
|
21
|
-
- implementation matrix:
|
|
22
|
-
- `draft`
|
|
23
|
-
- `ngram-cache`
|
|
24
|
-
- `ngram-simple`
|
|
25
|
-
- `ngram-map-k`
|
|
26
|
-
- `ngram-map-k4v`
|
|
27
|
-
- `ngram-mod`
|
|
28
|
-
- decision tree by workload (coding, repetitive transform, mixed)
|
|
29
|
-
- benchmark protocol (run counts, warmup, prompt classes, metrics)
|
|
30
|
-
- troubleshooting by signature:
|
|
31
|
-
- `find_slot: non-consecutive token position`
|
|
32
|
-
- low acceptance + high rollback
|
|
33
|
-
- throughput collapse after commit switch
|
|
34
|
-
- rollout rules (canary, promotion threshold, rollback triggers)
|
|
35
|
-
|
|
36
|
-
### Update existing speculative docs
|
|
37
|
-
|
|
38
|
-
- Update `docs/speculative.md`:
|
|
39
|
-
- add link to production guide
|
|
40
|
-
- add "how to interpret statistics in practice"
|
|
41
|
-
- add "workload sensitivity and reproducibility notes"
|
|
42
|
-
|
|
43
|
-
### Add compatibility appendix
|
|
44
|
-
|
|
45
|
-
- New appendix (or linked page): stream+tools compatibility for proxy-mediated agentic flows
|
|
46
|
-
- fallback policy guidance (`off` default for production)
|
|
47
|
-
- malformed stream/tool guardrail behavior
|
|
48
|
-
- max token floor and prune target recommendations
|
|
49
|
-
|
|
50
|
-
## Why
|
|
51
|
-
|
|
52
|
-
Speculative decoding quality in agentic coding depends on end-to-end behavior, including transport and stream tool-loop handling. This documentation closes that gap and provides a repeatable operator path.
|
|
53
|
-
|
|
54
|
-
## Validation Plan
|
|
55
|
-
|
|
56
|
-
- Verify all CLI flags/options in examples against current `llama-server`.
|
|
57
|
-
- Verify all linked scripts/docs paths resolve.
|
|
58
|
-
- Include one benchmark table with:
|
|
59
|
-
- decode/prefill throughput
|
|
60
|
-
- acceptance indicators
|
|
61
|
-
- latency percentiles
|
|
62
|
-
- workload class labels
|
|
63
|
-
|
|
64
|
-
## Risks
|
|
65
|
-
|
|
66
|
-
- Overfitting recommendations to one model/hardware class.
|
|
67
|
-
- Treating proxy behavior as universally required.
|
|
68
|
-
|
|
69
|
-
## Mitigations
|
|
70
|
-
|
|
71
|
-
- Mark all profile recommendations as workload/hardware sensitive.
|
|
72
|
-
- Separate "safe baseline" from "aggressive benchmark-only" profiles.
|
|
73
|
-
- Require local A/B validation before rollout.
|
|
74
|
-
|
|
75
|
-
## Out of Scope
|
|
76
|
-
|
|
77
|
-
- Runtime code changes
|
|
78
|
-
- Kernel-level speculative optimization changes
|
|
79
|
-
- Proxy implementation changes (docs-only PR)
|
|
80
|
-
|
|
81
|
-
## Follow-ups
|
|
82
|
-
|
|
83
|
-
1. Add nightly speculative regression harness.
|
|
84
|
-
2. Publish benchmark JSON schema for machine comparison.
|
|
85
|
-
3. Add commit-lineage tracking for performance regressions.
|
|
86
|
-
|
|
87
|
-
---
|
|
88
|
-
|
|
89
|
-
## Ready-to-Submit GitHub PR Body
|
|
90
|
-
|
|
91
|
-
### Summary
|
|
92
|
-
|
|
93
|
-
This docs PR adds a production-oriented speculative decoding playbook for llama.cpp users running real multi-turn workloads (especially agentic/tool-call scenarios). It complements existing mechanism-level docs with actionable tuning, troubleshooting, and rollout guidance.
|
|
94
|
-
|
|
95
|
-
### What Changed
|
|
96
|
-
|
|
97
|
-
- Added `docs/speculative-production.md` (new operational guide)
|
|
98
|
-
- implementation selection matrix
|
|
99
|
-
- workload-based decision tree
|
|
100
|
-
- benchmark protocol + required metrics
|
|
101
|
-
- troubleshooting by real log signatures
|
|
102
|
-
- canary/rollback rollout guidance
|
|
103
|
-
- Updated `docs/speculative.md`
|
|
104
|
-
- links to production guide
|
|
105
|
-
- practical stats interpretation notes
|
|
106
|
-
- workload sensitivity notes
|
|
107
|
-
- Added/linked "agentic stream+tools compatibility" appendix
|
|
108
|
-
- fallback policy defaults
|
|
109
|
-
- malformed stream/tool guardrails
|
|
110
|
-
- token-floor/prune guidance
|
|
111
|
-
|
|
112
|
-
### Why
|
|
113
|
-
|
|
114
|
-
Current docs describe speculative decoding internals clearly, but production operators need a reproducible way to:
|
|
115
|
-
|
|
116
|
-
- choose stable profiles by workload,
|
|
117
|
-
- detect/triage regressions quickly,
|
|
118
|
-
- avoid benchmark-only wins that fail in long sessions.
|
|
119
|
-
|
|
120
|
-
### Reviewer Guide
|
|
121
|
-
|
|
122
|
-
Please focus review on:
|
|
123
|
-
|
|
124
|
-
1. Accuracy of CLI flags and option names.
|
|
125
|
-
2. Correctness of troubleshooting signatures and interpretations.
|
|
126
|
-
3. Clarity of benchmark protocol (can another team reproduce it?).
|
|
127
|
-
4. Whether safe-vs-aggressive profile separation is clear enough.
|
|
128
|
-
|
|
129
|
-
### Validation
|
|
130
|
-
|
|
131
|
-
- [ ] Command examples verified against current `llama-server --help`
|
|
132
|
-
- [ ] Linked docs/scripts paths validated
|
|
133
|
-
- [ ] Benchmark table includes workload class labels
|
|
134
|
-
- [ ] Metrics include decode/prefill throughput + latency percentile view
|
|
135
|
-
- [ ] No runtime behavior claims without explicit caveats
|
|
136
|
-
|
|
137
|
-
### Risks / Caveats
|
|
138
|
-
|
|
139
|
-
- Recommendations are model/hardware/workload dependent.
|
|
140
|
-
- Guidance is operational, not a substitute for local A/B testing.
|
|
141
|
-
|
|
142
|
-
### Follow-ups
|
|
143
|
-
|
|
144
|
-
- [ ] Add nightly regression harness for speculative profiles
|
|
145
|
-
- [ ] Publish machine-readable benchmark schema
|
|
146
|
-
- [ ] Add commit lineage references in benchmark artifacts
|
package/docs/pr/UPSTREAM_PRS.md
DELETED
|
@@ -1,424 +0,0 @@
|
|
|
1
|
-
# UAP Upstream PR Plan
|
|
2
|
-
|
|
3
|
-
5 PRs covering the session stickiness bug, loop protection hardening, per-request spec control, OpenAI-compat endpoint, and the policy engine.
|
|
4
|
-
|
|
5
|
-
## Dependency graph
|
|
6
|
-
|
|
7
|
-
```
|
|
8
|
-
PR 1 (session fingerprinting) ── CRITICAL ──► enables PR 2, PR 3, PR 5
|
|
9
|
-
PR 2 (loop protection) ── depends on PR 1
|
|
10
|
-
PR 3 (spec decoding control) ── independent
|
|
11
|
-
PR 4 (OpenAI /v1/chat/completions) ── depends on PR 2 (via guardrails)
|
|
12
|
-
PR 5 (policy engine) ── depends on PR 1 + PR 2
|
|
13
|
-
```
|
|
14
|
-
|
|
15
|
-
---
|
|
16
|
-
|
|
17
|
-
## PR 1 — `proxy: stable session fingerprinting`
|
|
18
|
-
|
|
19
|
-
**Scope:** Critical bug fix
|
|
20
|
-
**Files:** `tools/agents/scripts/anthropic_proxy.py`
|
|
21
|
-
**Risk:** Low — pure fix, no new surface area
|
|
22
|
-
**Priority:** Highest — every stateful guardrail depends on this
|
|
23
|
-
|
|
24
|
-
### Problem
|
|
25
|
-
|
|
26
|
-
Session fingerprints were hashed from `remote | model | system | first_user_content`. Two inputs were volatile:
|
|
27
|
-
|
|
28
|
-
1. **`tool_use_id`** values in tool_result blocks — random UUIDs regenerated per turn. `_content_fingerprint` included `f"result:{block.get('tool_use_id', '')}"` in the hash.
|
|
29
|
-
2. **`system` prompt** — clients inject volatile context (timestamps, cwd, session markers) into system prompts.
|
|
30
|
-
|
|
31
|
-
Result: **every single request got a different session ID** → every request spawned a fresh `SessionMonitor` → every stateful guardrail (cycle detection, forced_budget, review_cycles, finalize_hard_stop, unproductive_exhaustion_streak) was effectively stateless per-request.
|
|
32
|
-
|
|
33
|
-
This silently broke every loop protection mechanism ever built on top of the session monitor.
|
|
34
|
-
|
|
35
|
-
### Diagnostic evidence
|
|
36
|
-
|
|
37
|
-
After adding session ID logging:
|
|
38
|
-
|
|
39
|
-
```
|
|
40
|
-
sess=fp:9c8f26a802f9f4739f18 msgs=79
|
|
41
|
-
sess=fp:b801857a9e49e21a6599 msgs=81
|
|
42
|
-
sess=fp:aeef638954a390ef7aec msgs=83
|
|
43
|
-
sess=fp:16f908db2e478f31cb91 msgs=85
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
Every request got a new session ID. `session_count: 35` after 35 requests on what should have been one session.
|
|
47
|
-
|
|
48
|
-
### Fix
|
|
49
|
-
|
|
50
|
-
1. `_content_fingerprint` uses stable content excerpt (`result:<first 64 chars>`) instead of `tool_use_id`
|
|
51
|
-
2. `resolve_session_id` hashes only the first user message's **text content**, excludes `system` prompt entirely
|
|
52
|
-
|
|
53
|
-
```python
|
|
54
|
-
def resolve_session_id(request: Request, anthropic_body: dict) -> str:
|
|
55
|
-
# ... header-based lookup unchanged ...
|
|
56
|
-
|
|
57
|
-
first_user = ""
|
|
58
|
-
for msg in anthropic_body.get("messages", []):
|
|
59
|
-
if msg.get("role") == "user":
|
|
60
|
-
content = msg.get("content", "")
|
|
61
|
-
if isinstance(content, str):
|
|
62
|
-
first_user = content[:512]
|
|
63
|
-
elif isinstance(content, list):
|
|
64
|
-
text_parts = [
|
|
65
|
-
b.get("text", "") for b in content
|
|
66
|
-
if isinstance(b, dict) and b.get("type") == "text"
|
|
67
|
-
]
|
|
68
|
-
first_user = "\n".join(text_parts)[:512]
|
|
69
|
-
break
|
|
70
|
-
|
|
71
|
-
# Deliberately exclude `system` from fingerprint — clients inject
|
|
72
|
-
# volatile context (timestamps, cwd, session markers).
|
|
73
|
-
digest = hashlib.sha256(
|
|
74
|
-
f"{remote}|{model}|{first_user}".encode("utf-8", errors="ignore")
|
|
75
|
-
).hexdigest()[:20]
|
|
76
|
-
return f"fp:{digest}"
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
### Impact
|
|
80
|
-
|
|
81
|
-
- Before: 1 request per session
|
|
82
|
-
- After: 170+ requests on the same session (verified with Claude Code + OpenCode + Forge clients)
|
|
83
|
-
- All downstream guardrails suddenly started working — no changes needed to them
|
|
84
|
-
|
|
85
|
-
### Add session ID logging
|
|
86
|
-
|
|
87
|
-
The REQ line now includes `sess=` for diagnosis:
|
|
88
|
-
|
|
89
|
-
```
|
|
90
|
-
REQ: client=remote:127.0.0.1 sess=fp:aa5169796b2c39c2a4a4 rate_60s=1 ...
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
### Tests
|
|
94
|
-
|
|
95
|
-
- [ ] Unit test: same message with changing tool_use_ids → stable fingerprint
|
|
96
|
-
- [ ] Unit test: same message with changing system timestamps → stable fingerprint
|
|
97
|
-
- [ ] Integration test: 3 sequential requests on same conversation → same session_id
|
|
98
|
-
|
|
99
|
-
---
|
|
100
|
-
|
|
101
|
-
## PR 2 — `proxy: loop protection hardening`
|
|
102
|
-
|
|
103
|
-
**Scope:** Medium — new counters + threshold gates
|
|
104
|
-
**Files:** `anthropic_proxy.py`
|
|
105
|
-
**Depends on:** PR 1 (counters only work with sticky sessions)
|
|
106
|
-
|
|
107
|
-
### Additions
|
|
108
|
-
|
|
109
|
-
1. **`tool_state_unproductive_exhaustion_streak`**
|
|
110
|
-
- Tracks consecutive `forced_budget_exhausted` events where NEITHER cycling NOR stagnation was detected
|
|
111
|
-
- After `PROXY_UNPRODUCTIVE_EXHAUSTION_LIMIT` (default 4), forces finalize
|
|
112
|
-
- Catches "distinct-but-unproductive tool spam" that defeats per-tool cycle detection
|
|
113
|
-
|
|
114
|
-
2. **`finalize_hard_stop_count`** (monotonic session-level)
|
|
115
|
-
- NOT reset by `fresh_user_text` / `inactive_loop` paths
|
|
116
|
-
- Incremented in BOTH:
|
|
117
|
-
- `_inject_synthetic_continuation` (synthetic continuation path)
|
|
118
|
-
- `state_choice == "finalize"` handler (tool-stripping path)
|
|
119
|
-
- When `>= PROXY_FINALIZE_SESSION_HARD_CAP` (default 6), synthetic continuation injection is blocked, natural end_turn passes through → client terminates loop cleanly
|
|
120
|
-
|
|
121
|
-
3. **`finalize_fired` flag in `_completion_blockers()`**
|
|
122
|
-
- When `finalize_hard_stop_count > 0`, suppresses `text_only_after_tool_results` blocker
|
|
123
|
-
- Prevents state machine from re-entering active loop after a finalize wraps up the work
|
|
124
|
-
- Was causing `finalize → review → cycle_detected → finalize → review → ...` infinite ping-pong
|
|
125
|
-
|
|
126
|
-
### New env vars
|
|
127
|
-
|
|
128
|
-
```
|
|
129
|
-
PROXY_UNPRODUCTIVE_EXHAUSTION_LIMIT=4 # new
|
|
130
|
-
PROXY_FINALIZE_SESSION_HARD_CAP=6 # new
|
|
131
|
-
```
|
|
132
|
-
|
|
133
|
-
### Tuned thresholds (tighter defaults)
|
|
134
|
-
|
|
135
|
-
```
|
|
136
|
-
PROXY_LOOP_REPEAT_THRESHOLD=4 # was 10
|
|
137
|
-
PROXY_FORCED_THRESHOLD=12 # was 18
|
|
138
|
-
PROXY_NO_PROGRESS_THRESHOLD=3 # was 5
|
|
139
|
-
PROXY_TOOL_STATE_STAGNATION_THRESHOLD=4 # was 8
|
|
140
|
-
PROXY_TOOL_STATE_FINALIZE_THRESHOLD=8 # was 18
|
|
141
|
-
PROXY_TOOL_STATE_REVIEW_CYCLE_LIMIT=5 # was 3 (relaxed from prior 3 after tuning)
|
|
142
|
-
PROXY_TOOL_NARROWING_EXPAND_ON_LOOP=off # was on
|
|
143
|
-
PROXY_TOOL_NARROWING_KEEP=8 # was 12
|
|
144
|
-
```
|
|
145
|
-
|
|
146
|
-
### Verification
|
|
147
|
-
|
|
148
|
-
Real session that was previously looping indefinitely terminated cleanly:
|
|
149
|
-
```
|
|
150
|
-
TOOL STATE MACHINE: 4 consecutive unproductive budget exhaustions — forcing finalize
|
|
151
|
-
TOOL STATE MACHINE: phase review -> finalize reason=unproductive_exhaustion
|
|
152
|
-
FINALIZE CONTINUATION: session hard cap reached (6/6) — not injecting, allowing termination
|
|
153
|
-
```
|
|
154
|
-
|
|
155
|
-
Client received clean `end_turn`, started a fresh new task.
|
|
156
|
-
|
|
157
|
-
### Tests
|
|
158
|
-
|
|
159
|
-
- [ ] Simulated loop: distinct tool calls with no context growth → triggers unproductive exhaustion
|
|
160
|
-
- [ ] Simulated loop: same tool repeated → triggers per-tool cycle detection (existing)
|
|
161
|
-
- [ ] Finalize → synthetic continuation → reset → new active loop → hard cap at 6 → natural termination
|
|
162
|
-
|
|
163
|
-
---
|
|
164
|
-
|
|
165
|
-
## PR 3 — `proxy: per-request speculative decoding control`
|
|
166
|
-
|
|
167
|
-
**Scope:** Small, focused
|
|
168
|
-
**Files:** `anthropic_proxy.py`, README
|
|
169
|
-
**Risk:** Low
|
|
170
|
-
|
|
171
|
-
### Feature
|
|
172
|
-
|
|
173
|
-
New env var `PROXY_DISABLE_SPEC_ON_TOOL_TURNS` (default off). When on, the proxy sets `openai_body["speculative.n_max"] = 0` on tool-turn requests, telling llama.cpp to skip the draft/spec path for that request only.
|
|
174
|
-
|
|
175
|
-
### Why
|
|
176
|
-
|
|
177
|
-
Some models (observed: early Qwen3.5-35B-A3B Q4_K_M) produce garbled tool-call output under speculative decoding due to rejected-draft state leakage. Disabling spec on tool turns while keeping it on for plain chat gives the best of both worlds for unstable models. Stable models can leave this off and benefit from spec on every turn.
|
|
178
|
-
|
|
179
|
-
### Applied in two places
|
|
180
|
-
|
|
181
|
-
1. Main handler (`_build_openai_request` end)
|
|
182
|
-
2. Tool starvation breaker early-return path (so the flag is respected on both code paths)
|
|
183
|
-
|
|
184
|
-
```python
|
|
185
|
-
if PROXY_DISABLE_SPEC_ON_TOOL_TURNS:
|
|
186
|
-
openai_body["speculative.n_max"] = 0
|
|
187
|
-
logger.info("Spec decoding disabled for tool turn (PROXY_DISABLE_SPEC_ON_TOOL_TURNS=on)")
|
|
188
|
-
```
|
|
189
|
-
|
|
190
|
-
### Relies on llama.cpp upstream support
|
|
191
|
-
|
|
192
|
-
llama.cpp already supports per-request `speculative.n_max` in `server-task.cpp`:
|
|
193
|
-
```cpp
|
|
194
|
-
params.speculative.n_max = json_value(data, "speculative.n_max", defaults.speculative.n_max);
|
|
195
|
-
```
|
|
196
|
-
|
|
197
|
-
Setting it to 0 gates the entire draft path (`if (n_draft_max > 0)` in `server-context.cpp`).
|
|
198
|
-
|
|
199
|
-
### Tests
|
|
200
|
-
|
|
201
|
-
- [ ] Tool-turn request with flag on → `speculative.n_max=0` in forwarded body
|
|
202
|
-
- [ ] Non-tool request with flag on → no speculative field added
|
|
203
|
-
- [ ] Flag off → no speculative field added regardless
|
|
204
|
-
|
|
205
|
-
---
|
|
206
|
-
|
|
207
|
-
## PR 4 — `proxy: fully guarded OpenAI /v1/chat/completions endpoint`
|
|
208
|
-
|
|
209
|
-
**Scope:** Medium — new endpoint with full bidirectional conversion
|
|
210
|
-
**Files:** `anthropic_proxy.py`
|
|
211
|
-
**Depends on:** PR 2 (reuses the guardrail pipeline)
|
|
212
|
-
|
|
213
|
-
### Motivation
|
|
214
|
-
|
|
215
|
-
Clients like **OpenCode**, **Forge**, **Cline**, and many LangChain-based agents expect OpenAI's `/v1/chat/completions` shape. The proxy previously only exposed `/v1/messages` (Anthropic shape), so these clients either:
|
|
216
|
-
1. Bypassed the proxy and talked directly to llama.cpp (no guardrails), OR
|
|
217
|
-
2. Couldn't use the proxy at all
|
|
218
|
-
|
|
219
|
-
### Approach
|
|
220
|
-
|
|
221
|
-
Add `/v1/chat/completions` handler that:
|
|
222
|
-
1. Receives OpenAI-format request
|
|
223
|
-
2. Converts to Anthropic format (`openai_to_anthropic_request`)
|
|
224
|
-
3. Invokes the existing `messages()` handler via synthetic `Request` with Anthropic body
|
|
225
|
-
4. Converts the Anthropic response back to OpenAI format (`anthropic_to_openai_response`)
|
|
226
|
-
5. Returns to the client
|
|
227
|
-
|
|
228
|
-
**All guardrails from the `/v1/messages` path apply automatically** — loop detection, tool narrowing, cycle breaking, malformed tool retry, context pruning, profile overrides, activation replay (llama.cpp side).
|
|
229
|
-
|
|
230
|
-
### Streaming
|
|
231
|
-
|
|
232
|
-
Client stream requests are processed internally as non-stream through the Anthropic pipeline, then re-streamed as OpenAI SSE chunks:
|
|
233
|
-
|
|
234
|
-
```
|
|
235
|
-
data: {"id":"msg_...","delta":{"role":"assistant"},...}
|
|
236
|
-
data: {"id":"msg_...","delta":{"content":"..."},...}
|
|
237
|
-
data: {"id":"msg_...","delta":{"tool_calls":[...]},...}
|
|
238
|
-
data: {"id":"msg_...","delta":{},"finish_reason":"tool_calls"}
|
|
239
|
-
data: [DONE]
|
|
240
|
-
```
|
|
241
|
-
|
|
242
|
-
This sacrifices token-by-token streaming granularity in exchange for keeping all guardrails. The difference is invisible to most clients.
|
|
243
|
-
|
|
244
|
-
### Helper functions added
|
|
245
|
-
|
|
246
|
-
- **`openai_to_anthropic_request(openai_body)`** — full conversion (system prompt, messages, tool_calls, tool_responses, tools, tool_choice, sampling params)
|
|
247
|
-
- **`anthropic_to_openai_response(anthropic_resp)`** — content blocks → message, tool_use → tool_calls, stop_reason → finish_reason, usage mapping
|
|
248
|
-
- **`_parse_anthropic_sse_to_message(raw)`** — SSE fallback parser if inner pipeline returns a stream despite `stream=False`
|
|
249
|
-
|
|
250
|
-
### Verification
|
|
251
|
-
|
|
252
|
-
Tested against OpenCode, Forge, and synthetic curl requests:
|
|
253
|
-
- Plain chat: clean text response
|
|
254
|
-
- Tool use: proper `tool_calls` with JSON arguments
|
|
255
|
-
- Streaming: proper SSE chunks with finish_reason
|
|
256
|
-
- All guardrails active (verified via log `CHAT (guarded)` marker)
|
|
257
|
-
|
|
258
|
-
### Tests
|
|
259
|
-
|
|
260
|
-
- [ ] Round-trip: OpenAI request → Anthropic → OpenAI with matching content
|
|
261
|
-
- [ ] Tool call conversion (both directions)
|
|
262
|
-
- [ ] System prompt extraction from messages
|
|
263
|
-
- [ ] Streaming endpoint emits valid SSE sequence
|
|
264
|
-
- [ ] Profile overrides apply to chat/completions path
|
|
265
|
-
|
|
266
|
-
---
|
|
267
|
-
|
|
268
|
-
## PR 5 — `proxy: policy engine with worktree + CI/CD enforcement`
|
|
269
|
-
|
|
270
|
-
**Scope:** Large — new module + hook points
|
|
271
|
-
**Files:** `policies/engine.py`, `policies/rules/*.py`, `anthropic_proxy.py` (hook points), tests
|
|
272
|
-
**Depends on:** PR 1 (session continuity), PR 2 (guardrail infrastructure)
|
|
273
|
-
**Risk:** Medium — new subsystem
|
|
274
|
-
|
|
275
|
-
### Motivation
|
|
276
|
-
|
|
277
|
-
You can tell a local coding agent to use a git worktree. You can write it in CLAUDE.md, put it in the system prompt, make it the first rule. Local 27–35B models **still commit directly to main**.
|
|
278
|
-
|
|
279
|
-
Policy-as-prompt is not an enforcement mechanism for local coding agents — it's a suggestion. The only reliable way to enforce workflow requirements is to make them non-bypassable at the proxy layer.
|
|
280
|
-
|
|
281
|
-
### What it enforces
|
|
282
|
-
|
|
283
|
-
- **Worktree routing** — `Edit`, `Write`, `Bash` tool inputs get rewritten to reference the active worktree path. Operations targeting the main working tree are rejected.
|
|
284
|
-
- **Completion gates** — `end_turn` is blocked unless tests ran, memory was queried, parallel reviewers were invoked.
|
|
285
|
-
- **Pre-commit discipline** — commit tool calls blocked until code-reviewer + security-auditor + architect-reviewer were invoked.
|
|
286
|
-
- **CI/CD deploy bucketing** — each agent session has a deploy bucket tied to its worktree. Concurrent agents don't collide at the pipeline layer.
|
|
287
|
-
- **Per-profile rule sets** — `build` / `plan` / `memory` / `autoaccept` each get a different policy set.
|
|
288
|
-
- **Session start protocol** — mandatory bootstrap checks (memory query, session context load)
|
|
289
|
-
- **Auditable trail** — every policy decision logged with rule ID, context, outcome
|
|
290
|
-
|
|
291
|
-
### Architecture
|
|
292
|
-
|
|
293
|
-
```
|
|
294
|
-
client → proxy → [guardrails] → [policy engine] → [tool rewriter] → llama.cpp
|
|
295
|
-
↓
|
|
296
|
-
audit log
|
|
297
|
-
```
|
|
298
|
-
|
|
299
|
-
Every tool call goes through a policy check chain before being forwarded to llama.cpp. Rules can allow, rewrite, or block.
|
|
300
|
-
|
|
301
|
-
### Rule DSL
|
|
302
|
-
|
|
303
|
-
```python
|
|
304
|
-
from uap.policies import policy, block, allow, MUTATING_TOOLS
|
|
305
|
-
|
|
306
|
-
@policy("worktree.enforce", profile=["build", "autoaccept"])
|
|
307
|
-
def enforce_worktree(request, session):
|
|
308
|
-
if request.tool_name in MUTATING_TOOLS:
|
|
309
|
-
if not session.worktree_active:
|
|
310
|
-
return block("worktree_not_in_use",
|
|
311
|
-
hint="Create a worktree first with `git worktree add`")
|
|
312
|
-
request.tool_input["path"] = rewrite_to_worktree(
|
|
313
|
-
request.tool_input["path"], session.worktree
|
|
314
|
-
)
|
|
315
|
-
return allow()
|
|
316
|
-
|
|
317
|
-
@policy("commit.parallel_review", profile="build")
|
|
318
|
-
def enforce_parallel_review(request, session):
|
|
319
|
-
if request.tool_name == "Bash" and "git commit" in request.tool_input.get("command", ""):
|
|
320
|
-
if not session.review_completed_this_turn:
|
|
321
|
-
return block("parallel_review_required",
|
|
322
|
-
hint="Invoke code-reviewer + security-auditor + architect-reviewer in parallel before committing")
|
|
323
|
-
return allow()
|
|
324
|
-
|
|
325
|
-
@policy("completion.gates", profile="build")
|
|
326
|
-
def enforce_completion_gates(request, session):
|
|
327
|
-
if request.is_end_turn:
|
|
328
|
-
blockers = []
|
|
329
|
-
if not session.tests_ran:
|
|
330
|
-
blockers.append("tests_not_run")
|
|
331
|
-
if not session.memory_queried:
|
|
332
|
-
blockers.append("memory_not_queried")
|
|
333
|
-
if blockers:
|
|
334
|
-
return block(f"completion_gates_failed: {','.join(blockers)}")
|
|
335
|
-
return allow()
|
|
336
|
-
```
|
|
337
|
-
|
|
338
|
-
### Integration with existing `_completion_blockers()`
|
|
339
|
-
|
|
340
|
-
Policy blockers extend the existing completion contract:
|
|
341
|
-
|
|
342
|
-
```python
|
|
343
|
-
def _completion_blockers(anthropic_body, has_tool_results, phase="", finalize_fired=False):
|
|
344
|
-
blockers = []
|
|
345
|
-
# ... existing checks ...
|
|
346
|
-
|
|
347
|
-
# NEW: policy-level blockers
|
|
348
|
-
policy_blockers = policy_engine.evaluate_completion(anthropic_body, session)
|
|
349
|
-
blockers.extend(policy_blockers)
|
|
350
|
-
|
|
351
|
-
return blockers
|
|
352
|
-
```
|
|
353
|
-
|
|
354
|
-
### Per-profile rule sets
|
|
355
|
-
|
|
356
|
-
```python
|
|
357
|
-
# policies/profiles.py
|
|
358
|
-
BUILD_PROFILE_RULES = [
|
|
359
|
-
"worktree.enforce",
|
|
360
|
-
"commit.parallel_review",
|
|
361
|
-
"commit.message_format",
|
|
362
|
-
"commit.no_secrets",
|
|
363
|
-
"completion.gates",
|
|
364
|
-
"session.bootstrap",
|
|
365
|
-
]
|
|
366
|
-
|
|
367
|
-
PLAN_PROFILE_RULES = [
|
|
368
|
-
"tools.read_only", # blocks write/edit/bash tools
|
|
369
|
-
"session.bootstrap",
|
|
370
|
-
]
|
|
371
|
-
|
|
372
|
-
MEMORY_PROFILE_RULES = [
|
|
373
|
-
"tools.memory_only", # only memory read/write tools allowed
|
|
374
|
-
]
|
|
375
|
-
|
|
376
|
-
AUTOACCEPT_PROFILE_RULES = [
|
|
377
|
-
"worktree.enforce", # same worktree rule
|
|
378
|
-
"commit.no_secrets", # security still enforced
|
|
379
|
-
# no parallel review required (autoaccept is explicit trade-off)
|
|
380
|
-
]
|
|
381
|
-
```
|
|
382
|
-
|
|
383
|
-
### Audit trail
|
|
384
|
-
|
|
385
|
-
Every policy decision is logged with session, rule ID, tool name, decision, and blocker reason:
|
|
386
|
-
|
|
387
|
-
```
|
|
388
|
-
POLICY: sess=fp:aa51... rule=worktree.enforce tool=Edit decision=rewrite old_path=/home/cogtek/dev/main/app.py new_path=/home/cogtek/dev/.worktrees/feat-x/app.py
|
|
389
|
-
POLICY: sess=fp:aa51... rule=commit.parallel_review tool=Bash decision=block reason=parallel_review_required
|
|
390
|
-
```
|
|
391
|
-
|
|
392
|
-
### Tests
|
|
393
|
-
|
|
394
|
-
- [ ] Unit tests for each rule in isolation
|
|
395
|
-
- [ ] Integration: build profile session → attempt commit without review → blocked → invoke review → commit succeeds
|
|
396
|
-
- [ ] Integration: plan profile session → attempt Write → blocked
|
|
397
|
-
- [ ] Multi-agent: two sessions with different worktrees → no collision
|
|
398
|
-
- [ ] Audit log format validation
|
|
399
|
-
|
|
400
|
-
### Migration path
|
|
401
|
-
|
|
402
|
-
- PR introduces the policy engine as **opt-in** per profile (default profile has no policies — fully backward-compatible)
|
|
403
|
-
- Users can enable rules one at a time via profile env vars
|
|
404
|
-
- Existing CLAUDE.md prose instructions can reference policies for context, but policies are now enforced independent of prose
|
|
405
|
-
|
|
406
|
-
---
|
|
407
|
-
|
|
408
|
-
## Submission order
|
|
409
|
-
|
|
410
|
-
1. **PR 1 (session fingerprinting)** — critical bug fix, low risk, unblocks everything else
|
|
411
|
-
2. **PR 2 (loop protection hardening)** — depends on PR 1, reviewers can verify that PR 1's fix makes these counters functional
|
|
412
|
-
3. **PR 3 (spec decoding control)** — independent, small, easy to review
|
|
413
|
-
4. **PR 4 (OpenAI endpoint)** — depends on PR 2 (reuses guardrails), adds major new functionality
|
|
414
|
-
5. **PR 5 (policy engine)** — depends on PR 1 + PR 2, new subsystem, needs the most review
|
|
415
|
-
|
|
416
|
-
## Pre-submission checklist (all PRs)
|
|
417
|
-
|
|
418
|
-
- [ ] Unit tests added
|
|
419
|
-
- [ ] Integration tests with real llama.cpp upstream
|
|
420
|
-
- [ ] README / docs updated
|
|
421
|
-
- [ ] Env var reference updated
|
|
422
|
-
- [ ] No breaking changes to existing endpoints (or clearly flagged)
|
|
423
|
-
- [ ] Config migration notes for existing deployments
|
|
424
|
-
- [ ] Diff against current production (`anthropic-proxy.env.*` profiles)
|