@miller-tech/uap 1.13.13 → 1.13.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,221 @@
1
+ # Speculative Decoding Journey (2026-03)
2
+
3
+ This document records the end-to-end speculative decoding stabilization journey across `llama.cpp` runtime tuning and `uap-anthropic-proxy` guardrails, including fixes, benchmark results, and the production profile now in use.
4
+
5
+ ## Scope
6
+
7
+ - Runtime: `llama.cpp` with Qwen3.5 models, CUDA, `ctx-size=262144`.
8
+ - Gateway: Anthropic-compatible proxy (`tools/agents/scripts/anthropic_proxy.py`).
9
+ - Client behavior: agentic coding loops with tool calls (Claude Code style).
10
+
11
+ ## Goals
12
+
13
+ 1. Preserve high speculative decoding throughput.
14
+ 2. Eliminate pathological loops and malformed visible output.
15
+ 3. Keep tool-call behavior reliable under long sessions.
16
+ 4. Keep production context window at `262144`.
17
+
18
+ ## Phase 1 - Llama.cpp Speculative Stability
19
+
20
+ ### Problems Observed
21
+
22
+ - Rollback loops and instability under aggressive speculative settings.
23
+ - `find_slot` and related server warnings during long agentic sessions.
24
+ - Throughput regressions compared to known fast baseline.
25
+
26
+ ### Work Performed
27
+
28
+ - Implemented and tested multiple rollback strategies in `llama.cpp` worktree branches.
29
+ - Compared baseline fast commit vs newer speculative logic.
30
+ - Restored proven fast runtime path for production service while preserving learned guardrails.
31
+
32
+ ### Key Runtime Decisions
33
+
34
+ - Keep production on fast validated binary lineage (`029edcafc` baseline family).
35
+ - Use strict balanced speculative profile for 35B operations:
36
+ - `speculative.n_max=12`
37
+ - `speculative.n_min=2`
38
+ - `speculative.p_min=0.80`
39
+
40
+ ### Representative Throughput Findings
41
+
42
+ - Qwen3.5-27B, `ctx=262144`, q4 KV cache:
43
+ - No spec: ~43 tok/s coding, ~41 tok/s pattern.
44
+ - Spec (balanced): ~43 tok/s coding, ~102 tok/s pattern.
45
+ - Main uplift appears in pattern-heavy turns, not all coding turns.
46
+
47
+ ## Phase 2 - Proxy Reasoning Fallback Leak Fix
48
+
49
+ ### Problems Observed
50
+
51
+ - Empty visible output (`output_tokens=0`) with large hidden reasoning payloads.
52
+ - Proxy emitted malformed chain-of-thought text as fallback, causing user-visible garbage:
53
+ - repeated fragments like `</parameter>`, tool schema echoes, policy text loops.
54
+
55
+ ### Fixes Implemented
56
+
57
+ - Added explicit streaming fallback policy:
58
+ - `PROXY_STREAM_REASONING_FALLBACK=off|sanitized|visible`
59
+ - `PROXY_STREAM_REASONING_MAX_CHARS`
60
+ - Set production default to `off`.
61
+
62
+ ### Result
63
+
64
+ - Malformed reasoning fallback leakage is suppressed by default.
65
+ - Debugging remains possible with `sanitized`/`visible` modes when intentionally enabled.
66
+
67
+ ## Phase 3 - Token Floor and Prune Controls
68
+
69
+ ### Problems Observed
70
+
71
+ - Hardcoded `max_tokens` floor (`16384`) forced very long failure turns.
72
+ - Pruning threshold flag alone could trigger pruning path without meaningful message reduction.
73
+
74
+ ### Fixes Implemented
75
+
76
+ - Added configurable max token floor:
77
+ - `PROXY_MAX_TOKENS_FLOOR` (`0` disables floor)
78
+ - Added configurable prune target:
79
+ - `PROXY_CONTEXT_PRUNE_TARGET_FRACTION`
80
+
81
+ ### Live A/B Result (Production-Like)
82
+
83
+ `PROXY_MAX_TOKENS_FLOOR=16384` vs `4096`:
84
+
85
+ - Silent reasoning-heavy turn:
86
+ - `16384`: avg `78.749s`
87
+ - `4096`: avg `19.777s`
88
+ - Latency reduction: ~`74.9%`
89
+ - Predicted throughput unchanged (~`208 tok/s` class)
90
+ - Normal tool turns remained stable and slightly faster with `4096`.
91
+
92
+ ## Phase 4 - Malformed Tool-Loop Hardening
93
+
94
+ ### Problem Pattern
95
+
96
+ Under adversarial or degraded prompt states, the model can emit pseudo-tool text instead of valid tool calls, e.g.:
97
+
98
+ - `</parameter>` fragments
99
+ - echoed policy snippets (`you MUST call a tool...`)
100
+ - long no-progress text with no `tool_calls`
101
+
102
+ ### Feature Set Added (Flag Controlled)
103
+
104
+ 1. **Malformed tool guardrail + retry**
105
+ - `PROXY_MALFORMED_TOOL_GUARDRAIL`
106
+ - `PROXY_MALFORMED_TOOL_RETRY_MAX`
107
+ - `PROXY_MALFORMED_TOOL_RETRY_MAX_TOKENS`
108
+ - `PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE`
109
+
110
+ 2. **Strict stream guardrail path**
111
+ - `PROXY_MALFORMED_TOOL_STREAM_STRICT`
112
+ - For stream+tools requests, proxy runs guarded non-stream upstream call, then replays SSE.
113
+
114
+ 3. **Tool narrowing (optional)**
115
+ - `PROXY_TOOL_NARROWING`
116
+ - `PROXY_TOOL_NARROWING_KEEP`
117
+ - `PROXY_TOOL_NARROWING_MIN_TOOLS`
118
+
119
+ 4. **Disable thinking on tool turns (optional)**
120
+ - `PROXY_DISABLE_THINKING_ON_TOOL_TURNS`
121
+
122
+ 5. **Session contamination breaker (optional safety net)**
123
+ - `PROXY_SESSION_CONTAMINATION_BREAKER`
124
+ - `PROXY_SESSION_CONTAMINATION_THRESHOLD`
125
+ - `PROXY_SESSION_CONTAMINATION_KEEP_LAST`
126
+
127
+ 6. **Agentic supplement mode**
128
+ - `PROXY_AGENTIC_SUPPLEMENT_MODE=clean|legacy`
129
+
130
+ ### Test Coverage
131
+
132
+ - Unit tests in `tools/agents/tests/test_anthropic_proxy_streaming.py`
133
+ - Current targeted suite count in this workstream: `16` passing tests.
134
+
135
+ ## Benchmark Highlights (Per-Option Toggles)
136
+
137
+ ### Artifact Stress Benchmark (v3)
138
+
139
+ Source: `/tmp/proxy_visibility_benchmark_v3.json`
140
+
141
+ | Mode | Key Flags | Outcome Summary |
142
+ | --- | --- | --- |
143
+ | Baseline | none | no tool call, policy-echo text surfaced |
144
+ | Option 1 | malformed guardrail + strict stream | malformed detected and retried; returned `tool_use` with empty visible text |
145
+ | Option 2 | tool narrowing only | not sufficient alone in stress case |
146
+ | Option 3 | disable thinking only | not sufficient alone in stress case |
147
+ | Option 4 | contamination breaker only | not sufficient alone in this synthetic workload |
148
+ | Option 5 | clean supplement only | not sufficient alone in stress case |
149
+
150
+ ### Practical Conclusion
151
+
152
+ - Strongest primary mitigation: **Option 1** (malformed guardrail + strict stream + bounded retry).
153
+ - Other options are secondary tuning aids and should not replace Option 1 for this failure class.
154
+
155
+ ## 10-Turn Live Stability Soak
156
+
157
+ Source: `/tmp/proxy_10turn_soak_results.json`
158
+
159
+ - 10 turns, alternating malformed-stress and normal tool-call turns, single live session id.
160
+ - Results:
161
+ - Error rate: `0.0%`
162
+ - Malformed visible output rate (stress turns): `0.0%`
163
+ - Normal tool-call success rate: `100.0%`
164
+ - Duration p50/p95: `10.2s` / `21.366s`
165
+ - Stop reasons: `tool_use=6`, `max_tokens=3`, `end_turn=1`
166
+
167
+ ## Production Profile (Current)
168
+
169
+ File: `/home/cogtek/.config/uap/anthropic-proxy.env`
170
+
171
+ ```bash
172
+ PROXY_MAX_TOKENS_FLOOR=4096
173
+ PROXY_STREAM_REASONING_FALLBACK=off
174
+
175
+ PROXY_MALFORMED_TOOL_GUARDRAIL=on
176
+ PROXY_MALFORMED_TOOL_STREAM_STRICT=on
177
+ PROXY_MALFORMED_TOOL_RETRY_MAX=1
178
+ PROXY_MALFORMED_TOOL_RETRY_MAX_TOKENS=512
179
+ PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE=0
180
+
181
+ PROXY_TOOL_NARROWING=off
182
+ PROXY_DISABLE_THINKING_ON_TOOL_TURNS=off
183
+ PROXY_SESSION_CONTAMINATION_BREAKER=off
184
+ PROXY_AGENTIC_SUPPLEMENT_MODE=legacy
185
+ ```
186
+
187
+ Rationale:
188
+
189
+ - Keep the strongest practical fix enabled (malformed guardrail + strict stream path).
190
+ - Keep latency-optimized floor (`4096`).
191
+ - Keep optional secondary heuristics off unless new evidence warrants enablement.
192
+
193
+ ## Reproduction Checklist
194
+
195
+ 1. Restart services:
196
+
197
+ ```bash
198
+ systemctl --user restart uap-llama-server.service
199
+ systemctl --user restart uap-anthropic-proxy.service
200
+ ```
201
+
202
+ 2. Run targeted unit tests:
203
+
204
+ ```bash
205
+ python3 -m pytest tools/agents/tests/test_anthropic_proxy_streaming.py -q
206
+ ```
207
+
208
+ 3. Run soak script (or equivalent alternating malformed/normal stream sequence).
209
+
210
+ 4. Validate logs:
211
+
212
+ - `MALFORMED TOOL PAYLOAD`
213
+ - `MALFORMED RETRY ...`
214
+ - `STRICT STREAM GUARDRAIL`
215
+ - Absence of user-visible malformed fragments.
216
+
217
+ ## Open Follow-Ups
218
+
219
+ - Add a dedicated persistent benchmark harness under `scripts/` for this exact soak profile.
220
+ - Add branch/commit links from `llama.cpp` worktrees for cross-repo traceability.
221
+ - Optionally evaluate enabling `PROXY_TOOL_NARROWING` in production only after longer mixed-workload soak data.
@@ -116,6 +116,50 @@ llama-server \
116
116
  | `--draft-max` | `16` | Max tokens to draft per iteration. Higher = more throughput, more VRAM. |
117
117
  | `--draft-p-min` | `0.75` | Minimum acceptance probability. Lower = more aggressive drafting. |
118
118
 
119
+ ## Extension Options for Speculative Decoding
120
+
121
+ ### Option 1: Adaptive Runtime Tuning (implemented)
122
+
123
+ Use acceptance and rollback rates to auto-adjust `draft-max`, `draft-min`, and `draft-p-min` over time.
124
+
125
+ - Best for immediate gains without kernel changes
126
+ - Reduces bad bursts when acceptance drops
127
+ - Increases burst length automatically during high-acceptance windows
128
+
129
+ Commands:
130
+
131
+ ```bash
132
+ # Tune once from observed metrics
133
+ llama-optimize spec-autotune --acceptance 0.71 --rollback 0.14 --profile throughput
134
+
135
+ # Compare static defaults vs adaptive tuning using deterministic simulation
136
+ llama-optimize spec-benchmark --profile throughput --trace mixed --steps 180
137
+
138
+ # Live benchmark active server and get tuned flag recommendation
139
+ llama-optimize spec-benchmark-live \
140
+ --endpoint http://127.0.0.1:8080/v1 \
141
+ --model qwen3.5-a3b-iq4xs \
142
+ --runs 5 --max-tokens 256 --profile throughput
143
+ ```
144
+
145
+ Recommended workflow:
146
+
147
+ 1. Run `spec-benchmark-live` with your current startup flags and note `Throughput`.
148
+ 2. Restart `llama-server` with the `Suggested params` flags.
149
+ 3. Re-run `spec-benchmark-live` with the same settings to measure actual gain.
150
+
151
+ ### Option 2: GPU Residency + Overlap
152
+
153
+ - Keep draft model and draft KV fully on GPU
154
+ - Preallocate buffers and overlap draft + verify passes with CUDA streams
155
+ - Improves p95 latency consistency on long runs
156
+
157
+ ### Option 3: GPU Checkpoint/Rollback
158
+
159
+ - Move speculative checkpoint snapshots from CPU RAM to GPU buffers
160
+ - Remove host-device copy overhead from rollback paths
161
+ - Highest upside, but requires deeper runtime changes
162
+
119
163
  ### Sampling
120
164
 
121
165
  | Flag | Value | Purpose |
@@ -177,6 +221,37 @@ All settings are via environment variables:
177
221
  | `PROXY_LOG_LEVEL` | `INFO` | Logging level (DEBUG/INFO/WARNING/ERROR) |
178
222
  | `PROXY_READ_TIMEOUT` | `600` | Read timeout (seconds) for LLM streaming |
179
223
  | `PROXY_MAX_CONNECTIONS` | `20` | Max concurrent upstream connections |
224
+ | `PROXY_MAX_TOKENS_FLOOR` | `16384` | Minimum floor applied to incoming `max_tokens` (`0` disables floor) |
225
+ | `PROXY_CONTEXT_PRUNE_TARGET_FRACTION` | `0.65` | Target context utilization after pruning (`0.0 < value < 1.0`) |
226
+ | `PROXY_STREAM_REASONING_FALLBACK` | `off` | Streaming behavior for reasoning-only empty turns (`off`, `sanitized`, `visible`) |
227
+ | `PROXY_STREAM_REASONING_MAX_CHARS` | `240` | Max fallback length when `PROXY_STREAM_REASONING_FALLBACK=sanitized` |
228
+ | `PROXY_TOOL_NARROWING` | `off` | Narrow large tool lists to top relevant tools per turn |
229
+ | `PROXY_TOOL_NARROWING_KEEP` | `8` | Number of tools to keep when narrowing is enabled |
230
+ | `PROXY_TOOL_NARROWING_MIN_TOOLS` | `12` | Minimum tool count before narrowing activates |
231
+ | `PROXY_DISABLE_THINKING_ON_TOOL_TURNS` | `off` | Sends `enable_thinking=false` when tools are present |
232
+ | `PROXY_MALFORMED_TOOL_GUARDRAIL` | `on` | Detects malformed pseudo tool payloads and retries with strict settings |
233
+ | `PROXY_MALFORMED_TOOL_RETRY_MAX` | `1` | Number of malformed-tool retries |
234
+ | `PROXY_MALFORMED_TOOL_RETRY_MAX_TOKENS` | `2048` | Retry cap for `max_tokens` during malformed-tool recovery |
235
+ | `PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE` | `0` | Retry temperature for malformed-tool recovery |
236
+ | `PROXY_MALFORMED_TOOL_STREAM_STRICT` | `off` | For stream+tools requests, use guarded non-stream upstream path then replay SSE |
237
+ | `PROXY_SESSION_CONTAMINATION_BREAKER` | `on` | Resets long-running malformed sessions to recent context |
238
+ | `PROXY_SESSION_CONTAMINATION_THRESHOLD` | `3` | Consecutive malformed turns before reset |
239
+ | `PROXY_SESSION_CONTAMINATION_KEEP_LAST` | `8` | Number of latest messages to preserve during contamination reset |
240
+ | `PROXY_AGENTIC_SUPPLEMENT_MODE` | `clean` | Agentic system supplement variant (`clean`, `legacy`) |
241
+
242
+ For agentic coding workloads, keep `PROXY_STREAM_REASONING_FALLBACK=off` (default) to avoid leaking malformed internal reasoning as user-visible output. Use `sanitized` only for debugging.
243
+
244
+ For Claude Code + Qwen malformed-tool loops, recommended starting profile:
245
+
246
+ ```bash
247
+ PROXY_STREAM_REASONING_FALLBACK=off
248
+ PROXY_MAX_TOKENS_FLOOR=4096
249
+ PROXY_MALFORMED_TOOL_GUARDRAIL=on
250
+ PROXY_TOOL_NARROWING=on
251
+ PROXY_DISABLE_THINKING_ON_TOOL_TURNS=on
252
+ PROXY_SESSION_CONTAMINATION_BREAKER=on
253
+ PROXY_AGENTIC_SUPPLEMENT_MODE=clean
254
+ ```
180
255
 
181
256
  ### Example: Custom upstream
182
257
 
@@ -339,3 +414,4 @@ Two possible causes:
339
414
  - `tools/agents/scripts/qwen_tool_call_test.py` - Test suite using OpenAI-compatible API
340
415
  - `src/cli/tool-calls.ts` - CLI command for template management
341
416
  - `src/bin/llama-server-optimize.ts` - llama-server startup optimizer
417
+ - `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md` - service bootstrap + ngram-cache A/B benchmarking
@@ -0,0 +1,279 @@
1
+ # UAP + llama.cpp + Anthropic Proxy Bootstrap
2
+
3
+ This guide captures the local continuity stack as a repeatable bootstrap:
4
+
5
+ - `uap-llama-server.service` (llama.cpp)
6
+ - `uap-anthropic-proxy.service` (Anthropic API compatibility)
7
+ - A/B benchmark workflow for speculative decoding with `ngram-cache`
8
+
9
+ It also documents the UAP-side support changes needed to keep llama.cpp speculative decoding stable in agentic workflows.
10
+
11
+ ## 1) Bootstrap services
12
+
13
+ Run:
14
+
15
+ ```bash
16
+ bash scripts/bootstrap/bootstrap-uap-llama-proxy-stack.sh
17
+ ```
18
+
19
+ This writes:
20
+
21
+ - `~/.config/uap/llama-server.env`
22
+ - `~/.config/uap/anthropic-proxy.env`
23
+ - `~/.config/systemd/user/uap-llama-server.service`
24
+ - `~/.config/systemd/user/uap-anthropic-proxy.service`
25
+
26
+ Then it enables and starts both user services.
27
+
28
+ ## 2) Key llama env knobs
29
+
30
+ Edit `~/.config/uap/llama-server.env` and restart service:
31
+
32
+ ```bash
33
+ systemctl --user restart uap-llama-server.service
34
+ ```
35
+
36
+ Important variables:
37
+
38
+ - `LLAMA_SPEC_TYPE` (`none`, `ngram-cache`, etc.)
39
+ - `LLAMA_DRAFT_MAX`
40
+ - `LLAMA_DRAFT_MIN`
41
+ - `LLAMA_DRAFT_P_MIN`
42
+ - `LLAMA_EXTRA_ARGS` (optional additional startup flags)
43
+
44
+ ## 3) Key proxy env knobs
45
+
46
+ Edit `~/.config/uap/anthropic-proxy.env` and restart proxy:
47
+
48
+ ```bash
49
+ systemctl --user restart uap-anthropic-proxy.service
50
+ ```
51
+
52
+ Important variables:
53
+
54
+ - `PROXY_PORT`
55
+ - `LLAMA_CPP_BASE`
56
+ - `PROXY_CONTEXT_WINDOW` (set to `262144` to match llama context)
57
+ - Loop/guardrail options (`PROXY_LOOP_BREAKER`, `PROXY_FORCED_THRESHOLD`, etc.)
58
+
59
+ ## 4) Run ngram-cache signal benchmark
60
+
61
+ Use the service-oriented A/B script:
62
+
63
+ ```bash
64
+ bash scripts/benchmarks/run-spec-ngram-service-ab.sh
65
+ ```
66
+
67
+ What it does:
68
+
69
+ 1. Stops managed `uap-llama-server.service` temporarily
70
+ 2. Runs transient systemd service benchmarks for:
71
+ - `spec-type=none`
72
+ - `spec-type=ngram-cache` (default draft params)
73
+ - `spec-type=ngram-cache` (tuned: `21/6/0.72`)
74
+ 3. Restores managed `uap-llama-server.service`
75
+ 4. Writes report artifacts under `benchmark-results/spec-ngram-ab-<timestamp>/`
76
+
77
+ Outputs:
78
+
79
+ - `report.json` machine-readable deltas
80
+ - `report.md` human-readable summary
81
+
82
+ ## 5) Run automatic draft-parameter sweep (Option 2)
83
+
84
+ Use this to search for the best local `ngram-cache` settings:
85
+
86
+ ```bash
87
+ bash scripts/benchmarks/run-spec-ngram-sweep.sh
88
+ ```
89
+
90
+ Useful overrides:
91
+
92
+ ```bash
93
+ RUNS=5 MAX_TOKENS=256 \
94
+ DRAFT_MAXS="16 18 20 22" \
95
+ DRAFT_MINS="3 4 5 6" \
96
+ DRAFT_P_MINS="0.70 0.72 0.75 0.78" \
97
+ bash scripts/benchmarks/run-spec-ngram-sweep.sh
98
+ ```
99
+
100
+ Outputs are written under `benchmark-results/spec-ngram-sweep-<timestamp>/`:
101
+
102
+ - `results.jsonl` one entry per candidate
103
+ - `summary.json` best candidate + stats
104
+ - `summary.md` top 5 table
105
+
106
+ ## 6) Profiles for agentic coding vs max speed
107
+
108
+ Use two explicit profiles depending on your goal.
109
+
110
+ ### A) Agentic coding continuity profile (recommended daily use)
111
+
112
+ This profile prioritizes long, coherent coding sessions and minimizes `find_slot` warnings.
113
+
114
+ `~/.config/uap/llama-server.env`:
115
+
116
+ ```env
117
+ LLAMA_CTX_SIZE=262144
118
+ LLAMA_SPEC_TYPE=ngram-cache
119
+ LLAMA_DRAFT_MAX=12
120
+ LLAMA_DRAFT_MIN=2
121
+ LLAMA_DRAFT_P_MIN=0.80
122
+ LLAMA_HYBRID_ROLLBACK_MODE=strict
123
+ ```
124
+
125
+ Apply it:
126
+
127
+ ```bash
128
+ systemctl --user restart uap-llama-server.service
129
+ ```
130
+
131
+ `~/.config/uap/anthropic-proxy.env`:
132
+
133
+ ```env
134
+ PROXY_CONTEXT_WINDOW=262144
135
+ PROXY_LOOP_BREAKER=on
136
+ PROXY_LOOP_WINDOW=6
137
+ PROXY_LOOP_REPEAT_THRESHOLD=10
138
+ PROXY_FORCED_THRESHOLD=18
139
+ PROXY_NO_PROGRESS_THRESHOLD=5
140
+ PROXY_CONTEXT_RELEASE_THRESHOLD=0.95
141
+ PROXY_GUARDRAIL_RETRY=on
142
+ ```
143
+
144
+ Apply it:
145
+
146
+ ```bash
147
+ systemctl --user restart uap-anthropic-proxy.service
148
+ ```
149
+
150
+ ### B) Max-throughput benchmark profile (where 220+ tok/s was observed)
151
+
152
+ The 220+ decode throughput observed in this session was achieved with:
153
+
154
+ - CUDA build: `/home/cogtek/llama.cpp/.worktrees/001-llama-spec-rollback-fix/build-cuda/bin/llama-server`
155
+ - GPU flags: `--device CUDA0 --n-gpu-layers all --flash-attn on`
156
+ - Speculative mode: `--spec-type ngram-cache`
157
+ - Rollback mode: `LLAMA_HYBRID_ROLLBACK_MODE=hybrid`
158
+ - Workload: repetitive pattern prompt, `n_predict=512`
159
+
160
+ Run command used for that profile:
161
+
162
+ ```bash
163
+ LLAMA_HYBRID_ROLLBACK_MODE=hybrid \
164
+ /home/cogtek/llama.cpp/.worktrees/001-llama-spec-rollback-fix/build-cuda/bin/llama-server \
165
+ -m "/home/cogtek/Downloads/Qwen3.5-35B-A3B-UD-IQ4_XS.gguf" \
166
+ --host 127.0.0.1 --port 18121 \
167
+ --ctx-size 16384 --parallel 1 --no-warmup \
168
+ --device CUDA0 --n-gpu-layers all --flash-attn on \
169
+ --spec-type ngram-cache
170
+ ```
171
+
172
+ Important: this max-speed profile is workload-sensitive and was measured on a pattern-heavy prompt. For real agentic coding, use Profile A.
173
+
174
+ ## 7) Validated A/B findings (2026-03-23)
175
+
176
+ Direct old-vs-new A/B was run against:
177
+
178
+ - old fast commit: `029edcafc` (first pushed fast state around 21:35)
179
+ - newer commit: `1f8225f8f`
180
+ - model: `Qwen3.5-35B-A3B-UD-IQ4_XS.gguf`
181
+ - speculative: `ngram-cache`, `draft-max=16`, `draft-min=3`, `draft-p-min=0.72`
182
+
183
+ Notes:
184
+
185
+ - Standalone launches at `ctx-size=262144` can fail GPU allocation on some runs for the old commit (`failed to allocate compute pp buffers`).
186
+ - For controlled apples-to-apples throughput comparison, A/B was run at `ctx-size=16384`.
187
+
188
+ Observed results (`/tmp/ab_matrix_ctx16_v2.json`):
189
+
190
+ | Path | Old `029edcafc` | New `1f8225f8f` | Delta (new vs old) |
191
+ | --------------- | --------------- | --------------- | ------------------- |
192
+ | Raw coding | 107.97 tok/s | 99.23 tok/s | -8.1% |
193
+ | Raw pattern | 158.71 tok/s | 105.75 tok/s | -33.4% |
194
+ | Proxy plain | 113.74 tok/s | 109.39 tok/s | -3.8% |
195
+ | Agentic tool 2nd turn | `tool_use` (stable) | `tool_use` (stable) | parity on control flow |
196
+
197
+ Behavioral observations:
198
+
199
+ - Newer commit emitted many `find_slot: non-consecutive token position` warnings in raw/proxy runs under the same speculative settings.
200
+ - Old commit produced materially cleaner logs and higher throughput in the same benchmark profile.
201
+ - Proxy continuity fixes improved agentic tool-loop stability and no longer force premature stop in the tested loop.
202
+
203
+ Decision for throughput-sensitive testing:
204
+
205
+ - Prefer old fast commit `029edcafc` profile for max-throughput benchmarking.
206
+ - Keep a separate continuity profile for long-context agentic coding if warning volume grows.
207
+
208
+ Additional 27B impact snapshot (`Qwen3.5-27B-IQ4_XS`, `ctx=262144`, q4 KV cache):
209
+
210
+ - no speculative: ~43 tok/s coding, ~41 tok/s pattern
211
+ - aggressive speculative (`16/3/0.72`): ~44 tok/s coding, ~102 tok/s pattern
212
+ - balanced speculative (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
213
+
214
+ Interpretation:
215
+
216
+ - balanced profile is functionally safer for agentic sessions,
217
+ - aggressive profile can edge higher on some coding runs,
218
+ - both speculative profiles massively outperform no-spec on repetition-heavy drafts.
219
+
220
+ ## 8) Throughput interpretation and loop prevention
221
+
222
+ When reading llama logs, treat these as different metrics:
223
+
224
+ - `prompt eval time ... tokens per second` = prefill throughput
225
+ - `eval time ... tokens per second` = decode/completion throughput
226
+
227
+ In local continuity runs with large context, prompt throughput may exceed 2k tok/s while decode remains near 80-125 tok/s.
228
+
229
+ For default stability, use the guardrails from Profile A. If you hit active loop incidents, temporarily tighten to:
230
+
231
+ ```env
232
+ PROXY_LOOP_WINDOW=6
233
+ PROXY_LOOP_REPEAT_THRESHOLD=8
234
+ PROXY_FORCED_THRESHOLD=14
235
+ PROXY_NO_PROGRESS_THRESHOLD=4
236
+ PROXY_CONTEXT_RELEASE_THRESHOLD=0.90
237
+ ```
238
+
239
+ Then restart proxy:
240
+
241
+ ```bash
242
+ systemctl --user restart uap-anthropic-proxy.service
243
+ ```
244
+
245
+ ## 9) UAP support changes required for reliable operation
246
+
247
+ The following UAP-side changes are part of the working stack and should be present:
248
+
249
+ 1. Session-scoped loop protection in Anthropic proxy (no cross-session contamination).
250
+ 2. Guardrail retry for unexpected text-only end-turn in active tool loops.
251
+ 3. Optional systemd scaffolding from CLI:
252
+ - `uap init --systemd-services`
253
+ - `uap setup --systemd-services`
254
+ 4. Dedicated launch scripts:
255
+ - `scripts/run-llama-server-continuity.sh`
256
+ - `scripts/run-anthropic-proxy-continuity.sh`
257
+
258
+ These changes ensure llama speculative behavior is evaluated in a stable proxy/control-plane environment.
259
+
260
+ ## 10) Check service health
261
+
262
+ ```bash
263
+ systemctl --user status uap-llama-server.service --no-pager
264
+ systemctl --user status uap-anthropic-proxy.service --no-pager
265
+ curl -sf http://127.0.0.1:8080/v1/models
266
+ curl -sf http://127.0.0.1:4000/health
267
+ ```
268
+
269
+ ## 11) References and credits
270
+
271
+ This implementation and tuning flow builds on prior llama.cpp and proxy work:
272
+
273
+ - llama.cpp speculative docs: `docs/speculative.md`
274
+ - llama.cpp hybrid rollout notes: `docs/development/speculative-hybrid-rollout.md`
275
+ - llama.cpp speculative lineage: #5479, #6828, #6848, #19164
276
+ - checkpoint/SWA context note:
277
+ - https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055
278
+
279
+ Thanks to ggml-org/llama.cpp maintainers and contributors for speculative, cache, and memory-path groundwork.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@miller-tech/uap",
3
- "version": "1.13.13",
3
+ "version": "1.13.15",
4
4
  "description": "Autonomous AI agent memory system with CLAUDE.md protocol enforcement",
5
5
  "type": "module",
6
6
  "main": "dist/index.js",