@miller-tech/uap 1.40.0 → 1.41.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (150) hide show
  1. package/README.md +109 -642
  2. package/dist/.tsbuildinfo +1 -1
  3. package/dist/cli/deliver-defaults.d.ts +23 -0
  4. package/dist/cli/deliver-defaults.d.ts.map +1 -0
  5. package/dist/cli/deliver-defaults.js +121 -0
  6. package/dist/cli/deliver-defaults.js.map +1 -0
  7. package/dist/cli/init.d.ts.map +1 -1
  8. package/dist/cli/init.js +29 -0
  9. package/dist/cli/init.js.map +1 -1
  10. package/dist/cli/setup.d.ts.map +1 -1
  11. package/dist/cli/setup.js +19 -0
  12. package/dist/cli/setup.js.map +1 -1
  13. package/dist/policies/policy-tools.d.ts +7 -0
  14. package/dist/policies/policy-tools.d.ts.map +1 -1
  15. package/dist/policies/policy-tools.js +24 -2
  16. package/dist/policies/policy-tools.js.map +1 -1
  17. package/docs/INDEX.md +48 -286
  18. package/docs/architecture/OVERVIEW.md +328 -0
  19. package/docs/architecture/PROTOCOL.md +204 -0
  20. package/docs/benchmarks/README.md +17 -192
  21. package/docs/getting-started/CONFIGURATION.md +237 -0
  22. package/docs/getting-started/INSTALLATION.md +125 -0
  23. package/docs/getting-started/QUICKSTART.md +115 -0
  24. package/docs/guides/COORDINATION.md +162 -0
  25. package/docs/guides/DELIVER.md +115 -0
  26. package/docs/guides/DEPLOY_BATCHING.md +212 -0
  27. package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
  28. package/docs/guides/LOCAL_MODELS.md +148 -0
  29. package/docs/guides/MCP_ROUTER.md +195 -0
  30. package/docs/guides/MEMORY.md +235 -0
  31. package/docs/guides/MULTI_MODEL.md +223 -0
  32. package/docs/guides/POLICIES.md +190 -0
  33. package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
  34. package/docs/integrations/MCP_ROUTER.md +147 -0
  35. package/docs/integrations/RTK.md +102 -0
  36. package/docs/reference/API.md +485 -0
  37. package/docs/reference/CLI.md +719 -0
  38. package/docs/reference/CONFIGURATION.md +90 -193
  39. package/docs/reference/DATABASE_SCHEMA.md +110 -344
  40. package/docs/reference/FEATURES.md +176 -472
  41. package/docs/reference/PATTERNS.md +102 -0
  42. package/docs/reference/PLATFORMS.md +83 -0
  43. package/package.json +3 -1
  44. package/src/policies/enforcers/7ebbc721-7540-4e9f-879a-770e0213a09b_architecture_review.py +101 -0
  45. package/src/policies/enforcers/__pycache__/_common.cpython-312.pyc +0 -0
  46. package/src/policies/enforcers/_common.py +100 -0
  47. package/src/policies/enforcers/artifact_hygiene.py +52 -0
  48. package/src/policies/enforcers/cluster_routing.py +63 -0
  49. package/src/policies/enforcers/codebase_read_before_plan.py +52 -0
  50. package/src/policies/enforcers/coord_overlap.py +81 -0
  51. package/src/policies/enforcers/delivery_enforcement.py +97 -0
  52. package/src/policies/enforcers/doc_live_over_report.py +50 -0
  53. package/src/policies/enforcers/expert_review_required.py +135 -0
  54. package/src/policies/enforcers/iac_parity.py +53 -0
  55. package/src/policies/enforcers/mcp_router_first.py +37 -0
  56. package/src/policies/enforcers/memory_before_plan.py +61 -0
  57. package/src/policies/enforcers/parallel_reads.py +50 -0
  58. package/src/policies/enforcers/rtk_wrap.py +44 -0
  59. package/src/policies/enforcers/schema_diff_gate.py +80 -0
  60. package/src/policies/enforcers/session_memory_write.py +52 -0
  61. package/src/policies/enforcers/task_required.py +131 -0
  62. package/src/policies/enforcers/test_gate.py +58 -0
  63. package/src/policies/enforcers/validate_plan_before_build.py +75 -0
  64. package/src/policies/enforcers/worktree_required.py +57 -0
  65. package/src/policies/schemas/policies/architecture-review.md +51 -0
  66. package/src/policies/schemas/policies/artifact-hygiene.md +29 -0
  67. package/src/policies/schemas/policies/cluster-routing.md +31 -0
  68. package/src/policies/schemas/policies/codebase-read-before-plan.md +30 -0
  69. package/src/policies/schemas/policies/coord-overlap.md +24 -0
  70. package/src/policies/schemas/policies/delivery-enforcement.md +45 -0
  71. package/src/policies/schemas/policies/doc-live-over-report.md +32 -0
  72. package/src/policies/schemas/policies/expert-review-required.md +60 -0
  73. package/src/policies/schemas/policies/iac-parity.md +31 -0
  74. package/src/policies/schemas/policies/mandatory-testing-deployment.md +147 -0
  75. package/src/policies/schemas/policies/mcp-router-first.md +24 -0
  76. package/src/policies/schemas/policies/memory-before-plan.md +24 -0
  77. package/src/policies/schemas/policies/merge-deploy-monitor-verify.md +145 -0
  78. package/src/policies/schemas/policies/parallel-reads.md +24 -0
  79. package/src/policies/schemas/policies/rtk-wrap.md +26 -0
  80. package/src/policies/schemas/policies/schema-diff-gate.md +30 -0
  81. package/src/policies/schemas/policies/session-memory-write.md +24 -0
  82. package/src/policies/schemas/policies/task-required.md +49 -0
  83. package/src/policies/schemas/policies/test-gate.md +24 -0
  84. package/src/policies/schemas/policies/validate-plan-before-build.md +28 -0
  85. package/src/policies/schemas/policies/worktree-required.md +28 -0
  86. package/templates/hooks/uap-policy-gate.sh +5 -0
  87. package/docs/AGENTS.md +0 -423
  88. package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
  89. package/docs/GETTING_STARTED.md +0 -288
  90. package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
  91. package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
  92. package/docs/architecture/EXPERT_STACK.md +0 -137
  93. package/docs/architecture/MULTI_MODEL.md +0 -224
  94. package/docs/architecture/PLATFORM_GATING.md +0 -68
  95. package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
  96. package/docs/architecture/UAP_COMPLIANCE.md +0 -217
  97. package/docs/architecture/UAP_PROTOCOL.md +0 -339
  98. package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
  99. package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
  100. package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
  101. package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
  102. package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
  103. package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
  104. package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
  105. package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
  106. package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
  107. package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
  108. package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
  109. package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
  110. package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
  111. package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
  112. package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
  113. package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
  114. package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
  115. package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
  116. package/docs/archive/opencode-integration-guide.md +0 -740
  117. package/docs/archive/opencode-integration-quickref.md +0 -180
  118. package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
  119. package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
  120. package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
  121. package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
  122. package/docs/blog/local-coding-agents.md +0 -266
  123. package/docs/blog/x-thread.md +0 -254
  124. package/docs/deployment/DEPLOYMENT.md +0 -895
  125. package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
  126. package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
  127. package/docs/deployment/DEPLOY_BATCHING.md +0 -273
  128. package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
  129. package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
  130. package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
  131. package/docs/getting-started/INTEGRATION.md +0 -628
  132. package/docs/getting-started/OVERVIEW.md +0 -324
  133. package/docs/getting-started/SETUP.md +0 -377
  134. package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
  135. package/docs/integrations/RTK_INTEGRATION.md +0 -468
  136. package/docs/operations/TROUBLESHOOTING.md +0 -660
  137. package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
  138. package/docs/pr/UPSTREAM_PRS.md +0 -424
  139. package/docs/reference/API_REFERENCE.md +0 -903
  140. package/docs/reference/EXPERT_DROIDS.md +0 -219
  141. package/docs/reference/HARNESS-MATRIX.md +0 -318
  142. package/docs/reference/PATTERN_LIBRARY.md +0 -636
  143. package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
  144. package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
  145. package/docs/research/DOMAIN_STRATEGIES.md +0 -316
  146. package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
  147. package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
  148. package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
  149. package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
  150. package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
@@ -1,426 +0,0 @@
1
- # Qwen3.5 llama.cpp Deployment Guide
2
-
3
- How to run Qwen3.5 35B A3B with the official Qwen3 chat template, LoRA adapters, and structured tool call output via llama.cpp.
4
-
5
- ## Prerequisites
6
-
7
- - [llama.cpp](https://github.com/ggml-org/llama.cpp) built with CUDA/Metal support
8
- - Qwen3.5 35B A3B GGUF model (e.g. `qwen3.5-a3b-iq4xs.gguf`)
9
- - (Optional) Draft model for speculative decoding: `Qwen3.5-0.8B-Q8_0.gguf`
10
- - (Optional) LoRA adapter GGUF for improved tool call reliability
11
-
12
- ## Quick Start
13
-
14
- ```bash
15
- llama-server \
16
- --model /path/to/qwen3.5-a3b-iq4xs.gguf \
17
- --chat-template-file chat_template.jinja \
18
- --n-predict 16384 \
19
- --temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
20
- --repeat-penalty 1.0 \
21
- --threads 8 --ctx-size 131072 --batch-size 8 \
22
- --gpu-layers 35 --mlock --flash-attn
23
- ```
24
-
25
- ## Configuration Files
26
-
27
- | File | Purpose |
28
- | ------------------------------------------- | ------------------------------------------------------------------- |
29
- | `chat_template.jinja` | Official Qwen3 chat template with native tool descriptions |
30
- | `tools/agents/config/tool-call.gbnf` | GBNF grammar for per-request use (do NOT use with `--grammar-file`) |
31
- | `tools/agents/config/tool-call-schema.json` | JSON Schema for the tool call payload |
32
- | `config/qwen35-settings.json` | Full model settings, optimization config |
33
- | `config/lora-finetune.yaml` | LoRA training configuration (axolotl/unsloth compatible) |
34
-
35
- ## Important: Do NOT Use `--grammar-file`
36
-
37
- The `--grammar-file` flag applies a GBNF grammar **globally to every completion**. This breaks normal chat because the grammar forces `<tool_call>` output even when no tools are provided.
38
-
39
- llama.cpp's **differential autoparser** handles tool calls automatically:
40
-
41
- 1. It analyzes the Jinja template to discover `<tool_call>`/`</tool_call>` markers
42
- 2. It generates PEG grammar rules with **lazy activation** (`grammar_lazy = true`)
43
- 3. When `tool_choice == "auto"`, the model generates freely until it emits `<tool_call>`, at which point the grammar activates to constrain the JSON payload
44
- 4. After `</tool_call>`, the grammar allows another `<tool_call>` for parallel calls
45
- 5. Plain chat (no tools) is unconstrained
46
-
47
- The GBNF file is kept in the repo for per-request use via the `grammar` field in API payloads, but should never be a server startup flag.
48
-
49
- ## Server Configurations
50
-
51
- ### Basic (no LoRA, no speculative decoding)
52
-
53
- ```bash
54
- llama-server \
55
- --model /path/to/qwen3.5-a3b-iq4xs.gguf \
56
- --chat-template-file chat_template.jinja \
57
- --n-predict 16384 \
58
- --temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
59
- --repeat-penalty 1.0 \
60
- --threads 8 --ctx-size 131072 --batch-size 8 \
61
- --gpu-layers 35 --mlock --flash-attn
62
- ```
63
-
64
- ### With LoRA Adapter
65
-
66
- ```bash
67
- llama-server \
68
- --model /path/to/qwen3.5-a3b-iq4xs.gguf \
69
- --lora /path/to/qwen35-tool-call-lora/adapter.gguf \
70
- --lora-scale 1.0 \
71
- --chat-template-file chat_template.jinja \
72
- --n-predict 16384 \
73
- --temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
74
- --repeat-penalty 1.0 \
75
- --threads 8 --ctx-size 131072 --batch-size 8 \
76
- --gpu-layers 35 --mlock --flash-attn
77
- ```
78
-
79
- ### Full Setup (LoRA + Speculative Decoding)
80
-
81
- ```bash
82
- llama-server \
83
- --model /path/to/qwen3.5-a3b-iq4xs.gguf \
84
- --lora /path/to/qwen35-tool-call-lora/adapter.gguf \
85
- --lora-scale 1.0 \
86
- --chat-template-file chat_template.jinja \
87
- --draft-model /path/to/Qwen3.5-0.8B-Q8_0.gguf \
88
- --draft-max 16 --draft-p-min 0.75 \
89
- --n-predict 16384 \
90
- --temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
91
- --repeat-penalty 1.0 \
92
- --threads 8 --ctx-size 131072 --batch-size 8 \
93
- --gpu-layers 35 --mlock --flash-attn
94
- ```
95
-
96
- ## Key Parameters
97
-
98
- ### Chat Template & Tool Calls
99
-
100
- | Flag | Value | Purpose |
101
- | ---------------------- | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
102
- | `--chat-template-file` | `chat_template.jinja` | Official Qwen3 template with native `tools` block. llama.cpp's autoparser discovers `<tool_call>` markers and generates lazy grammar + triggers automatically. |
103
-
104
- ### LoRA
105
-
106
- | Flag | Value | Purpose |
107
- | -------------- | ---------------------- | ------------------------------------------------------------------------------------------------------ |
108
- | `--lora` | Path to `adapter.gguf` | Loads LoRA adapter at runtime (no model merge needed). Improves tool call format adherence by ~15-20%. |
109
- | `--lora-scale` | `0.0` - `1.0` | Adapter strength. Use `1.0` for full effect, `0.5`-`0.8` to blend with base model behavior. |
110
-
111
- ### Speculative Decoding
112
-
113
- | Flag | Value | Purpose |
114
- | --------------- | -------------------------------- | ----------------------------------------------------------------------- |
115
- | `--draft-model` | Path to `Qwen3.5-0.8B-Q8_0.gguf` | Small draft model proposes tokens verified by the main model. |
116
- | `--draft-max` | `16` | Max tokens to draft per iteration. Higher = more throughput, more VRAM. |
117
- | `--draft-p-min` | `0.75` | Minimum acceptance probability. Lower = more aggressive drafting. |
118
-
119
- ## Extension Options for Speculative Decoding
120
-
121
- ### Option 1: Adaptive Runtime Tuning (implemented)
122
-
123
- Use acceptance and rollback rates to auto-adjust `draft-max`, `draft-min`, and `draft-p-min` over time.
124
-
125
- - Best for immediate gains without kernel changes
126
- - Reduces bad bursts when acceptance drops
127
- - Increases burst length automatically during high-acceptance windows
128
-
129
- Commands:
130
-
131
- ```bash
132
- # Tune once from observed metrics
133
- llama-optimize spec-autotune --acceptance 0.71 --rollback 0.14 --profile throughput
134
-
135
- # Compare static defaults vs adaptive tuning using deterministic simulation
136
- llama-optimize spec-benchmark --profile throughput --trace mixed --steps 180
137
-
138
- # Live benchmark active server and get tuned flag recommendation
139
- llama-optimize spec-benchmark-live \
140
- --endpoint http://127.0.0.1:8080/v1 \
141
- --model qwen3.5-a3b-iq4xs \
142
- --runs 5 --max-tokens 256 --profile throughput
143
- ```
144
-
145
- Recommended workflow:
146
-
147
- 1. Run `spec-benchmark-live` with your current startup flags and note `Throughput`.
148
- 2. Restart `llama-server` with the `Suggested params` flags.
149
- 3. Re-run `spec-benchmark-live` with the same settings to measure actual gain.
150
-
151
- ### Option 2: GPU Residency + Overlap
152
-
153
- - Keep draft model and draft KV fully on GPU
154
- - Preallocate buffers and overlap draft + verify passes with CUDA streams
155
- - Improves p95 latency consistency on long runs
156
-
157
- ### Option 3: GPU Checkpoint/Rollback
158
-
159
- - Move speculative checkpoint snapshots from CPU RAM to GPU buffers
160
- - Remove host-device copy overhead from rollback paths
161
- - Highest upside, but requires deeper runtime changes
162
-
163
- ### Sampling
164
-
165
- | Flag | Value | Purpose |
166
- | ------------------ | ------ | ------------------------------------------------- |
167
- | `--temp` | `0.6` | Low temperature for deterministic tool calls. |
168
- | `--top-p` | `0.9` | Nucleus sampling threshold. |
169
- | `--top-k` | `20` | Limits token candidates per step. |
170
- | `--min-p` | `0.05` | Filters tokens below 5% of top token probability. |
171
- | `--repeat-penalty` | `1.0` | No repetition penalty — code naturally repeats patterns. |
172
-
173
- ### Performance
174
-
175
- | Flag | Value | Purpose |
176
- | -------------- | -------- | ------------------------------------------------- |
177
- | `--flash-attn` | (flag) | 1.5-2x speed on long context. |
178
- | `--gpu-layers` | `35` | Layers offloaded to GPU. Increase if VRAM allows. |
179
- | `--ctx-size` | `131072` | Full 128K context window. |
180
- | `--mlock` | (flag) | Prevents OS from swapping model to disk. |
181
-
182
- ## VRAM Estimates
183
-
184
- | Component | VRAM |
185
- | ------------------- | ---------- |
186
- | Main model (IQ4_XS) | ~17 GB |
187
- | Draft model (Q8_0) | ~0.8 GB |
188
- | KV cache (128K ctx) | ~2-3 GB |
189
- | LoRA adapter | ~50 MB |
190
- | **Total** | **~20 GB** |
191
-
192
- ## Anthropic API Proxy (for Claude Code / Forge Code)
193
-
194
- Claude Code and Forge Code speak the Anthropic Messages API, but llama.cpp exposes an OpenAI-compatible API. The UAP Anthropic Proxy bridges this gap by translating between the two protocols in real time, including full streaming and tool calling support.
195
-
196
- ### Architecture
197
-
198
- ```
199
- Claude Code --(Anthropic API :4000)--> UAP Proxy --(OpenAI API :8080)--> llama.cpp
200
- ```
201
-
202
- ### Quick Start
203
-
204
- ```bash
205
- # Install Python dependencies
206
- pip install -r tools/agents/scripts/requirements-proxy.txt
207
-
208
- # Start the proxy (default: listen on :4000, forward to llama.cpp on :8080)
209
- python tools/agents/scripts/anthropic_proxy.py
210
- ```
211
-
212
- ### Configuration
213
-
214
- All settings are via environment variables:
215
-
216
- | Variable | Default | Description |
217
- | ----------------------- | ------------------------------------ | ---------------------------------------- |
218
- | `LLAMA_CPP_BASE` | `http://192.168.1.165:8080/v1` | OpenAI-compatible upstream server URL |
219
- | `PROXY_PORT` | `4000` | Port for the proxy to listen on |
220
- | `PROXY_HOST` | `0.0.0.0` | Host/IP to bind to |
221
- | `PROXY_LOG_LEVEL` | `INFO` | Logging level (DEBUG/INFO/WARNING/ERROR) |
222
- | `PROXY_READ_TIMEOUT` | `600` | Read timeout (seconds) for LLM streaming |
223
- | `PROXY_MAX_CONNECTIONS` | `20` | Max concurrent upstream connections |
224
- | `PROXY_MAX_TOKENS_FLOOR` | `16384` | Minimum floor applied to incoming `max_tokens` (`0` disables floor) |
225
- | `PROXY_CONTEXT_PRUNE_TARGET_FRACTION` | `0.65` | Target context utilization after pruning (`0.0 < value < 1.0`) |
226
- | `PROXY_STREAM_REASONING_FALLBACK` | `off` | Streaming behavior for reasoning-only empty turns (`off`, `sanitized`, `visible`) |
227
- | `PROXY_STREAM_REASONING_MAX_CHARS` | `240` | Max fallback length when `PROXY_STREAM_REASONING_FALLBACK=sanitized` |
228
- | `PROXY_TOOL_NARROWING` | `off` | Narrow large tool lists to top relevant tools per turn |
229
- | `PROXY_TOOL_NARROWING_KEEP` | `8` | Number of tools to keep when narrowing is enabled |
230
- | `PROXY_TOOL_NARROWING_MIN_TOOLS` | `12` | Minimum tool count before narrowing activates |
231
- | `PROXY_DISABLE_THINKING_ON_TOOL_TURNS` | `off` | Sends `enable_thinking=false` when tools are present |
232
- | `PROXY_MALFORMED_TOOL_GUARDRAIL` | `on` | Detects malformed pseudo tool payloads and retries with strict settings |
233
- | `PROXY_MALFORMED_TOOL_RETRY_MAX` | `1` | Number of malformed-tool retries |
234
- | `PROXY_MALFORMED_TOOL_RETRY_MAX_TOKENS` | `2048` | Retry cap for `max_tokens` during malformed-tool recovery |
235
- | `PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE` | `0` | Retry temperature for malformed-tool recovery |
236
- | `PROXY_MALFORMED_TOOL_STREAM_STRICT` | `off` | For stream+tools requests, use guarded non-stream upstream path then replay SSE |
237
- | `PROXY_SESSION_CONTAMINATION_BREAKER` | `on` | Resets long-running malformed sessions to recent context |
238
- | `PROXY_SESSION_CONTAMINATION_THRESHOLD` | `3` | Consecutive malformed turns before reset |
239
- | `PROXY_SESSION_CONTAMINATION_KEEP_LAST` | `8` | Number of latest messages to preserve during contamination reset |
240
- | `PROXY_AGENTIC_SUPPLEMENT_MODE` | `clean` | Agentic system supplement variant (`clean`, `legacy`) |
241
-
242
- For agentic coding workloads, keep `PROXY_STREAM_REASONING_FALLBACK=off` (default) to avoid leaking malformed internal reasoning as user-visible output. Use `sanitized` only for debugging.
243
-
244
- For Claude Code + Qwen malformed-tool loops, recommended starting profile:
245
-
246
- ```bash
247
- PROXY_STREAM_REASONING_FALLBACK=off
248
- PROXY_MAX_TOKENS_FLOOR=4096
249
- PROXY_MALFORMED_TOOL_GUARDRAIL=on
250
- PROXY_TOOL_NARROWING=on
251
- PROXY_DISABLE_THINKING_ON_TOOL_TURNS=on
252
- PROXY_SESSION_CONTAMINATION_BREAKER=on
253
- PROXY_AGENTIC_SUPPLEMENT_MODE=clean
254
- ```
255
-
256
- ### Example: Custom upstream
257
-
258
- ```bash
259
- LLAMA_CPP_BASE=http://localhost:8080/v1 PROXY_PORT=5000 python tools/agents/scripts/anthropic_proxy.py
260
- ```
261
-
262
- ### Claude Code Configuration
263
-
264
- Point Claude Code at the proxy by setting the API base URL:
265
-
266
- ```bash
267
- export ANTHROPIC_BASE_URL=http://localhost:4000
268
- ```
269
-
270
- ### Endpoints
271
-
272
- The proxy speaks **Anthropic Messages API as its canonical interface** and
273
- keeps an **OpenAI Chat Completions passthrough** for clients that require the
274
- OpenAI shape. Both paths run through the same guarded pipeline (loop
275
- detection, tool narrowing, malformed-payload retry, context pruning, etc.) —
276
- the OpenAI route converts the request to Anthropic, runs the pipeline, and
277
- re-shapes the final response back to OpenAI.
278
-
279
- | Path | Method | Shape | Description |
280
- | ------------------------ | ------ | --------- | --------------------------------------------------------------- |
281
- | `/v1/messages` | POST | Anthropic | Anthropic Messages API — default/canonical (streaming + sync) |
282
- | `/anthropic/v1/messages` | POST | Anthropic | Alias for `/v1/messages` (some Claude Code configs use this) |
283
- | `/v1/chat/completions` | POST | OpenAI | OpenAI Chat Completions passthrough (e.g. Forge, OpenCode) |
284
- | `/v1/models` | GET | Anthropic | Lists spoofed Anthropic model IDs |
285
- | `/health` | GET | — | Health check (verifies upstream reachability) |
286
- | `/v1/context` | GET | — | Current session context usage and pruning state |
287
-
288
- ### Running as a Service (systemd)
289
-
290
- ```ini
291
- [Unit]
292
- Description=UAP Anthropic Proxy
293
- After=network.target
294
-
295
- [Service]
296
- Type=simple
297
- User=cogtek
298
- Environment=LLAMA_CPP_BASE=http://192.168.1.165:8080/v1
299
- Environment=PROXY_PORT=4000
300
- ExecStart=/usr/bin/python3 /path/to/tools/agents/scripts/anthropic_proxy.py
301
- Restart=always
302
- RestartSec=5
303
-
304
- [Install]
305
- WantedBy=multi-user.target
306
- ```
307
-
308
- ## Tool Call Format
309
-
310
- The model emits tool calls in the official Qwen3 format:
311
-
312
- ```
313
- <tool_call>
314
- {"name": "read_file", "arguments": {"path": "/etc/hosts"}}
315
- </tool_call>
316
- ```
317
-
318
- Multiple tool calls in a single turn:
319
-
320
- ```
321
- <tool_call>
322
- {"name": "read_file", "arguments": {"path": "/etc/hosts"}}
323
- </tool_call>
324
- <tool_call>
325
- {"name": "list_dir", "arguments": {"path": "/tmp"}}
326
- </tool_call>
327
- ```
328
-
329
- llama.cpp's autoparser handles stop behavior structurally via PEG grammar rules, not stop sequences. No explicit `</tool_call>` stop sequence is needed at the server level.
330
-
331
- ## LoRA Training Pipeline
332
-
333
- ### 1. Generate Training Data
334
-
335
- ```bash
336
- python3 tools/agents/scripts/generate_lora_training_data.py -n 500
337
- ```
338
-
339
- Produces `tool_call_training_data.jsonl` with ChatML-formatted examples using the official `<tool_call>` format.
340
-
341
- ### 2. Fine-Tune
342
-
343
- Using axolotl:
344
-
345
- ```bash
346
- accelerate launch -m axolotl.cli.train config/lora-finetune.yaml
347
- ```
348
-
349
- Using unsloth (faster, less VRAM):
350
-
351
- ```bash
352
- unsloth train --config config/lora-finetune.yaml
353
- ```
354
-
355
- Training config highlights (`config/lora-finetune.yaml`):
356
-
357
- - LoRA rank 16, alpha 32
358
- - Targets all linear layers (q/k/v/o/gate/up/down projections)
359
- - 3 epochs, cosine LR schedule, 2e-4 learning rate
360
- - BF16 + gradient checkpointing + flash attention
361
-
362
- ### 3. Convert to GGUF
363
-
364
- ```bash
365
- python3 convert_lora_to_gguf.py \
366
- --base Qwen/Qwen3.5-35B-A3B \
367
- --lora output/qwen35-tool-call-lora \
368
- --output adapter.gguf
369
- ```
370
-
371
- ### 4. Load at Runtime
372
-
373
- ```bash
374
- llama-server --model base.gguf --lora adapter.gguf --lora-scale 1.0
375
- ```
376
-
377
- ## Quantization Options
378
-
379
- | Quant | VRAM | Accuracy | Tool Call Reliability |
380
- | ------ | ----- | -------- | --------------------- |
381
- | IQ4_XS | 17 GB | 96% | 94% |
382
- | Q4_K_M | 20 GB | 95% | 95% |
383
- | Q5_K_M | 24 GB | 97% | 97% |
384
- | Q6_K | 28 GB | 98% | 98% |
385
-
386
- ## Troubleshooting
387
-
388
- ### "Template supports tool calls but does not natively describe tools"
389
-
390
- This warning means llama.cpp detected `tool_calls` handling but no `tools` variable access in the template. The `chat_template.jinja` in this repo resolves this by including a `{%- if tools %}` block that renders tool descriptions in `<tools></tools>` XML tags.
391
-
392
- Verify the template is loaded:
393
-
394
- ```bash
395
- llama-server --chat-template-file chat_template.jinja --verbose
396
- ```
397
-
398
- ### LoRA not taking effect
399
-
400
- - Ensure the adapter was converted to GGUF format (not safetensors/PyTorch)
401
- - Check `--lora-scale` is not `0.0`
402
- - Verify the adapter was trained against the same base model architecture
403
-
404
- ### Grammar rejecting valid output
405
-
406
- If using the GBNF grammar via per-request `grammar` field and it's too restrictive, the model may produce truncated output. Check `tools/agents/config/tool-call.gbnf` allows the argument types your tools use (strings, numbers, objects, arrays, booleans, null are all supported).
407
-
408
- ### Model only outputs tool calls, never plain text
409
-
410
- You are likely using `--grammar-file` on the server command line. This forces ALL output into `<tool_call>` format. Remove `--grammar-file` from the startup command and let the autoparser handle tool call detection lazily.
411
-
412
- ### Multi-tool calls truncated to single call
413
-
414
- Two possible causes:
415
-
416
- 1. `--grammar-file` is set globally and the stop sequence `</tool_call>` terminates after the first call. Remove `--grammar-file`.
417
- 2. The client is not passing `parallel_tool_calls: true` in the request. Add it to enable multiple tool calls per turn.
418
-
419
- ## Related Files
420
-
421
- - `tools/agents/scripts/qwen_tool_call_wrapper.py` - Python wrapper with retry logic and format validation
422
- - `tools/agents/scripts/fix_qwen_chat_template.py` - Template verifier/fixer (detects format, validates Jinja2)
423
- - `tools/agents/scripts/qwen_tool_call_test.py` - Test suite using OpenAI-compatible API
424
- - `src/cli/tool-calls.ts` - CLI command for template management
425
- - `src/bin/llama-server-optimize.ts` - llama-server startup optimizer
426
- - `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md` - service bootstrap + ngram-cache A/B benchmarking