@miller-tech/uap 1.39.0 → 1.40.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (99) hide show
  1. package/README.md +109 -642
  2. package/dist/.tsbuildinfo +1 -1
  3. package/dist/bin/cli.js +2 -2
  4. package/dist/bin/cli.js.map +1 -1
  5. package/dist/cli/deliver.d.ts +3 -2
  6. package/dist/cli/deliver.d.ts.map +1 -1
  7. package/dist/cli/deliver.js +10 -5
  8. package/dist/cli/deliver.js.map +1 -1
  9. package/docs/INDEX.md +48 -286
  10. package/docs/architecture/OVERVIEW.md +328 -0
  11. package/docs/architecture/PROTOCOL.md +204 -0
  12. package/docs/benchmarks/README.md +17 -192
  13. package/docs/getting-started/CONFIGURATION.md +237 -0
  14. package/docs/getting-started/INSTALLATION.md +125 -0
  15. package/docs/getting-started/QUICKSTART.md +115 -0
  16. package/docs/guides/COORDINATION.md +162 -0
  17. package/docs/guides/DELIVER.md +115 -0
  18. package/docs/guides/DEPLOY_BATCHING.md +212 -0
  19. package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
  20. package/docs/guides/LOCAL_MODELS.md +148 -0
  21. package/docs/guides/MCP_ROUTER.md +195 -0
  22. package/docs/guides/MEMORY.md +235 -0
  23. package/docs/guides/MULTI_MODEL.md +223 -0
  24. package/docs/guides/POLICIES.md +190 -0
  25. package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
  26. package/docs/integrations/MCP_ROUTER.md +147 -0
  27. package/docs/integrations/RTK.md +102 -0
  28. package/docs/reference/API.md +485 -0
  29. package/docs/reference/CLI.md +719 -0
  30. package/docs/reference/CONFIGURATION.md +90 -193
  31. package/docs/reference/DATABASE_SCHEMA.md +110 -344
  32. package/docs/reference/FEATURES.md +176 -472
  33. package/docs/reference/PATTERNS.md +102 -0
  34. package/docs/reference/PLATFORMS.md +83 -0
  35. package/package.json +1 -1
  36. package/docs/AGENTS.md +0 -423
  37. package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
  38. package/docs/GETTING_STARTED.md +0 -288
  39. package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
  40. package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
  41. package/docs/architecture/EXPERT_STACK.md +0 -137
  42. package/docs/architecture/MULTI_MODEL.md +0 -224
  43. package/docs/architecture/PLATFORM_GATING.md +0 -68
  44. package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
  45. package/docs/architecture/UAP_COMPLIANCE.md +0 -217
  46. package/docs/architecture/UAP_PROTOCOL.md +0 -339
  47. package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
  48. package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
  49. package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
  50. package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
  51. package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
  52. package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
  53. package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
  54. package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
  55. package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
  56. package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
  57. package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
  58. package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
  59. package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
  60. package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
  61. package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
  62. package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
  63. package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
  64. package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
  65. package/docs/archive/opencode-integration-guide.md +0 -740
  66. package/docs/archive/opencode-integration-quickref.md +0 -180
  67. package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
  68. package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
  69. package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
  70. package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
  71. package/docs/blog/local-coding-agents.md +0 -266
  72. package/docs/blog/x-thread.md +0 -254
  73. package/docs/deployment/DEPLOYMENT.md +0 -895
  74. package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
  75. package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
  76. package/docs/deployment/DEPLOY_BATCHING.md +0 -273
  77. package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
  78. package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
  79. package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
  80. package/docs/getting-started/INTEGRATION.md +0 -628
  81. package/docs/getting-started/OVERVIEW.md +0 -324
  82. package/docs/getting-started/SETUP.md +0 -377
  83. package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
  84. package/docs/integrations/RTK_INTEGRATION.md +0 -468
  85. package/docs/operations/TROUBLESHOOTING.md +0 -660
  86. package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
  87. package/docs/pr/UPSTREAM_PRS.md +0 -424
  88. package/docs/reference/API_REFERENCE.md +0 -903
  89. package/docs/reference/EXPERT_DROIDS.md +0 -219
  90. package/docs/reference/HARNESS-MATRIX.md +0 -318
  91. package/docs/reference/PATTERN_LIBRARY.md +0 -636
  92. package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
  93. package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
  94. package/docs/research/DOMAIN_STRATEGIES.md +0 -316
  95. package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
  96. package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
  97. package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
  98. package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
  99. package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
@@ -1,279 +0,0 @@
1
- # UAP + llama.cpp + Anthropic Proxy Bootstrap
2
-
3
- This guide captures the local continuity stack as a repeatable bootstrap:
4
-
5
- - `uap-llama-server.service` (llama.cpp)
6
- - `uap-anthropic-proxy.service` (Anthropic API compatibility)
7
- - A/B benchmark workflow for speculative decoding with `ngram-cache`
8
-
9
- It also documents the UAP-side support changes needed to keep llama.cpp speculative decoding stable in agentic workflows.
10
-
11
- ## 1) Bootstrap services
12
-
13
- Run:
14
-
15
- ```bash
16
- bash scripts/bootstrap/bootstrap-uap-llama-proxy-stack.sh
17
- ```
18
-
19
- This writes:
20
-
21
- - `~/.config/uap/llama-server.env`
22
- - `~/.config/uap/anthropic-proxy.env`
23
- - `~/.config/systemd/user/uap-llama-server.service`
24
- - `~/.config/systemd/user/uap-anthropic-proxy.service`
25
-
26
- Then it enables and starts both user services.
27
-
28
- ## 2) Key llama env knobs
29
-
30
- Edit `~/.config/uap/llama-server.env` and restart service:
31
-
32
- ```bash
33
- systemctl --user restart uap-llama-server.service
34
- ```
35
-
36
- Important variables:
37
-
38
- - `LLAMA_SPEC_TYPE` (`none`, `ngram-cache`, etc.)
39
- - `LLAMA_DRAFT_MAX`
40
- - `LLAMA_DRAFT_MIN`
41
- - `LLAMA_DRAFT_P_MIN`
42
- - `LLAMA_EXTRA_ARGS` (optional additional startup flags)
43
-
44
- ## 3) Key proxy env knobs
45
-
46
- Edit `~/.config/uap/anthropic-proxy.env` and restart proxy:
47
-
48
- ```bash
49
- systemctl --user restart uap-anthropic-proxy.service
50
- ```
51
-
52
- Important variables:
53
-
54
- - `PROXY_PORT`
55
- - `LLAMA_CPP_BASE`
56
- - `PROXY_CONTEXT_WINDOW` (set to `262144` to match llama context)
57
- - Loop/guardrail options (`PROXY_LOOP_BREAKER`, `PROXY_FORCED_THRESHOLD`, etc.)
58
-
59
- ## 4) Run ngram-cache signal benchmark
60
-
61
- Use the service-oriented A/B script:
62
-
63
- ```bash
64
- bash scripts/benchmarks/run-spec-ngram-service-ab.sh
65
- ```
66
-
67
- What it does:
68
-
69
- 1. Stops managed `uap-llama-server.service` temporarily
70
- 2. Runs transient systemd service benchmarks for:
71
- - `spec-type=none`
72
- - `spec-type=ngram-cache` (default draft params)
73
- - `spec-type=ngram-cache` (tuned: `21/6/0.72`)
74
- 3. Restores managed `uap-llama-server.service`
75
- 4. Writes report artifacts under `benchmark-results/spec-ngram-ab-<timestamp>/`
76
-
77
- Outputs:
78
-
79
- - `report.json` machine-readable deltas
80
- - `report.md` human-readable summary
81
-
82
- ## 5) Run automatic draft-parameter sweep (Option 2)
83
-
84
- Use this to search for the best local `ngram-cache` settings:
85
-
86
- ```bash
87
- bash scripts/benchmarks/run-spec-ngram-sweep.sh
88
- ```
89
-
90
- Useful overrides:
91
-
92
- ```bash
93
- RUNS=5 MAX_TOKENS=256 \
94
- DRAFT_MAXS="16 18 20 22" \
95
- DRAFT_MINS="3 4 5 6" \
96
- DRAFT_P_MINS="0.70 0.72 0.75 0.78" \
97
- bash scripts/benchmarks/run-spec-ngram-sweep.sh
98
- ```
99
-
100
- Outputs are written under `benchmark-results/spec-ngram-sweep-<timestamp>/`:
101
-
102
- - `results.jsonl` one entry per candidate
103
- - `summary.json` best candidate + stats
104
- - `summary.md` top 5 table
105
-
106
- ## 6) Profiles for agentic coding vs max speed
107
-
108
- Use two explicit profiles depending on your goal.
109
-
110
- ### A) Agentic coding continuity profile (recommended daily use)
111
-
112
- This profile prioritizes long, coherent coding sessions and minimizes `find_slot` warnings.
113
-
114
- `~/.config/uap/llama-server.env`:
115
-
116
- ```env
117
- LLAMA_CTX_SIZE=262144
118
- LLAMA_SPEC_TYPE=ngram-cache
119
- LLAMA_DRAFT_MAX=12
120
- LLAMA_DRAFT_MIN=2
121
- LLAMA_DRAFT_P_MIN=0.80
122
- LLAMA_HYBRID_ROLLBACK_MODE=strict
123
- ```
124
-
125
- Apply it:
126
-
127
- ```bash
128
- systemctl --user restart uap-llama-server.service
129
- ```
130
-
131
- `~/.config/uap/anthropic-proxy.env`:
132
-
133
- ```env
134
- PROXY_CONTEXT_WINDOW=262144
135
- PROXY_LOOP_BREAKER=on
136
- PROXY_LOOP_WINDOW=6
137
- PROXY_LOOP_REPEAT_THRESHOLD=10
138
- PROXY_FORCED_THRESHOLD=18
139
- PROXY_NO_PROGRESS_THRESHOLD=5
140
- PROXY_CONTEXT_RELEASE_THRESHOLD=0.95
141
- PROXY_GUARDRAIL_RETRY=on
142
- ```
143
-
144
- Apply it:
145
-
146
- ```bash
147
- systemctl --user restart uap-anthropic-proxy.service
148
- ```
149
-
150
- ### B) Max-throughput benchmark profile (where 220+ tok/s was observed)
151
-
152
- The 220+ decode throughput observed in this session was achieved with:
153
-
154
- - CUDA build: `/home/cogtek/llama.cpp/.worktrees/001-llama-spec-rollback-fix/build-cuda/bin/llama-server`
155
- - GPU flags: `--device CUDA0 --n-gpu-layers all --flash-attn on`
156
- - Speculative mode: `--spec-type ngram-cache`
157
- - Rollback mode: `LLAMA_HYBRID_ROLLBACK_MODE=hybrid`
158
- - Workload: repetitive pattern prompt, `n_predict=512`
159
-
160
- Run command used for that profile:
161
-
162
- ```bash
163
- LLAMA_HYBRID_ROLLBACK_MODE=hybrid \
164
- /home/cogtek/llama.cpp/.worktrees/001-llama-spec-rollback-fix/build-cuda/bin/llama-server \
165
- -m "/home/cogtek/Downloads/Qwen3.5-35B-A3B-UD-IQ4_XS.gguf" \
166
- --host 127.0.0.1 --port 18121 \
167
- --ctx-size 16384 --parallel 1 --no-warmup \
168
- --device CUDA0 --n-gpu-layers all --flash-attn on \
169
- --spec-type ngram-cache
170
- ```
171
-
172
- Important: this max-speed profile is workload-sensitive and was measured on a pattern-heavy prompt. For real agentic coding, use Profile A.
173
-
174
- ## 7) Validated A/B findings (2026-03-23)
175
-
176
- Direct old-vs-new A/B was run against:
177
-
178
- - old fast commit: `029edcafc` (first pushed fast state around 21:35)
179
- - newer commit: `1f8225f8f`
180
- - model: `Qwen3.5-35B-A3B-UD-IQ4_XS.gguf`
181
- - speculative: `ngram-cache`, `draft-max=16`, `draft-min=3`, `draft-p-min=0.72`
182
-
183
- Notes:
184
-
185
- - Standalone launches at `ctx-size=262144` can fail GPU allocation on some runs for the old commit (`failed to allocate compute pp buffers`).
186
- - For controlled apples-to-apples throughput comparison, A/B was run at `ctx-size=16384`.
187
-
188
- Observed results (`/tmp/ab_matrix_ctx16_v2.json`):
189
-
190
- | Path | Old `029edcafc` | New `1f8225f8f` | Delta (new vs old) |
191
- | --------------- | --------------- | --------------- | ------------------- |
192
- | Raw coding | 107.97 tok/s | 99.23 tok/s | -8.1% |
193
- | Raw pattern | 158.71 tok/s | 105.75 tok/s | -33.4% |
194
- | Proxy plain | 113.74 tok/s | 109.39 tok/s | -3.8% |
195
- | Agentic tool 2nd turn | `tool_use` (stable) | `tool_use` (stable) | parity on control flow |
196
-
197
- Behavioral observations:
198
-
199
- - Newer commit emitted many `find_slot: non-consecutive token position` warnings in raw/proxy runs under the same speculative settings.
200
- - Old commit produced materially cleaner logs and higher throughput in the same benchmark profile.
201
- - Proxy continuity fixes improved agentic tool-loop stability and no longer force premature stop in the tested loop.
202
-
203
- Decision for throughput-sensitive testing:
204
-
205
- - Prefer old fast commit `029edcafc` profile for max-throughput benchmarking.
206
- - Keep a separate continuity profile for long-context agentic coding if warning volume grows.
207
-
208
- Additional 27B impact snapshot (`Qwen3.5-27B-IQ4_XS`, `ctx=262144`, q4 KV cache):
209
-
210
- - no speculative: ~43 tok/s coding, ~41 tok/s pattern
211
- - aggressive speculative (`16/3/0.72`): ~44 tok/s coding, ~102 tok/s pattern
212
- - balanced speculative (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
213
-
214
- Interpretation:
215
-
216
- - balanced profile is functionally safer for agentic sessions,
217
- - aggressive profile can edge higher on some coding runs,
218
- - both speculative profiles massively outperform no-spec on repetition-heavy drafts.
219
-
220
- ## 8) Throughput interpretation and loop prevention
221
-
222
- When reading llama logs, treat these as different metrics:
223
-
224
- - `prompt eval time ... tokens per second` = prefill throughput
225
- - `eval time ... tokens per second` = decode/completion throughput
226
-
227
- In local continuity runs with large context, prompt throughput may exceed 2k tok/s while decode remains near 80-125 tok/s.
228
-
229
- For default stability, use the guardrails from Profile A. If you hit active loop incidents, temporarily tighten to:
230
-
231
- ```env
232
- PROXY_LOOP_WINDOW=6
233
- PROXY_LOOP_REPEAT_THRESHOLD=8
234
- PROXY_FORCED_THRESHOLD=14
235
- PROXY_NO_PROGRESS_THRESHOLD=4
236
- PROXY_CONTEXT_RELEASE_THRESHOLD=0.90
237
- ```
238
-
239
- Then restart proxy:
240
-
241
- ```bash
242
- systemctl --user restart uap-anthropic-proxy.service
243
- ```
244
-
245
- ## 9) UAP support changes required for reliable operation
246
-
247
- The following UAP-side changes are part of the working stack and should be present:
248
-
249
- 1. Session-scoped loop protection in Anthropic proxy (no cross-session contamination).
250
- 2. Guardrail retry for unexpected text-only end-turn in active tool loops.
251
- 3. Optional systemd scaffolding from CLI:
252
- - `uap init --systemd-services`
253
- - `uap setup --systemd-services`
254
- 4. Dedicated launch scripts:
255
- - `scripts/run-llama-server-continuity.sh`
256
- - `scripts/run-anthropic-proxy-continuity.sh`
257
-
258
- These changes ensure llama speculative behavior is evaluated in a stable proxy/control-plane environment.
259
-
260
- ## 10) Check service health
261
-
262
- ```bash
263
- systemctl --user status uap-llama-server.service --no-pager
264
- systemctl --user status uap-anthropic-proxy.service --no-pager
265
- curl -sf http://127.0.0.1:8080/v1/models
266
- curl -sf http://127.0.0.1:4000/health
267
- ```
268
-
269
- ## 11) References and credits
270
-
271
- This implementation and tuning flow builds on prior llama.cpp and proxy work:
272
-
273
- - llama.cpp speculative docs: `docs/speculative.md`
274
- - llama.cpp hybrid rollout notes: `docs/development/speculative-hybrid-rollout.md`
275
- - llama.cpp speculative lineage: #5479, #6828, #6848, #19164
276
- - checkpoint/SWA context note:
277
- - https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055
278
-
279
- Thanks to ggml-org/llama.cpp maintainers and contributors for speculative, cache, and memory-path groundwork.