@miller-tech/uap 1.40.0 → 1.40.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +109 -642
- package/docs/INDEX.md +48 -286
- package/docs/architecture/OVERVIEW.md +328 -0
- package/docs/architecture/PROTOCOL.md +204 -0
- package/docs/benchmarks/README.md +17 -192
- package/docs/getting-started/CONFIGURATION.md +237 -0
- package/docs/getting-started/INSTALLATION.md +125 -0
- package/docs/getting-started/QUICKSTART.md +115 -0
- package/docs/guides/COORDINATION.md +162 -0
- package/docs/guides/DELIVER.md +115 -0
- package/docs/guides/DEPLOY_BATCHING.md +212 -0
- package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
- package/docs/guides/LOCAL_MODELS.md +148 -0
- package/docs/guides/MCP_ROUTER.md +195 -0
- package/docs/guides/MEMORY.md +235 -0
- package/docs/guides/MULTI_MODEL.md +223 -0
- package/docs/guides/POLICIES.md +190 -0
- package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
- package/docs/integrations/MCP_ROUTER.md +147 -0
- package/docs/integrations/RTK.md +102 -0
- package/docs/reference/API.md +485 -0
- package/docs/reference/CLI.md +719 -0
- package/docs/reference/CONFIGURATION.md +90 -193
- package/docs/reference/DATABASE_SCHEMA.md +110 -344
- package/docs/reference/FEATURES.md +176 -472
- package/docs/reference/PATTERNS.md +102 -0
- package/docs/reference/PLATFORMS.md +83 -0
- package/package.json +1 -1
- package/docs/AGENTS.md +0 -423
- package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
- package/docs/GETTING_STARTED.md +0 -288
- package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
- package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
- package/docs/architecture/EXPERT_STACK.md +0 -137
- package/docs/architecture/MULTI_MODEL.md +0 -224
- package/docs/architecture/PLATFORM_GATING.md +0 -68
- package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
- package/docs/architecture/UAP_COMPLIANCE.md +0 -217
- package/docs/architecture/UAP_PROTOCOL.md +0 -339
- package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
- package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
- package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
- package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
- package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
- package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
- package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
- package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
- package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
- package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
- package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
- package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
- package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
- package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
- package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
- package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
- package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
- package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
- package/docs/archive/opencode-integration-guide.md +0 -740
- package/docs/archive/opencode-integration-quickref.md +0 -180
- package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
- package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
- package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
- package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
- package/docs/blog/local-coding-agents.md +0 -266
- package/docs/blog/x-thread.md +0 -254
- package/docs/deployment/DEPLOYMENT.md +0 -895
- package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
- package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
- package/docs/deployment/DEPLOY_BATCHING.md +0 -273
- package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
- package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
- package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
- package/docs/getting-started/INTEGRATION.md +0 -628
- package/docs/getting-started/OVERVIEW.md +0 -324
- package/docs/getting-started/SETUP.md +0 -377
- package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
- package/docs/integrations/RTK_INTEGRATION.md +0 -468
- package/docs/operations/TROUBLESHOOTING.md +0 -660
- package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
- package/docs/pr/UPSTREAM_PRS.md +0 -424
- package/docs/reference/API_REFERENCE.md +0 -903
- package/docs/reference/EXPERT_DROIDS.md +0 -219
- package/docs/reference/HARNESS-MATRIX.md +0 -318
- package/docs/reference/PATTERN_LIBRARY.md +0 -636
- package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
- package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
- package/docs/research/DOMAIN_STRATEGIES.md +0 -316
- package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
- package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
- package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
- package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
- package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
|
@@ -1,426 +0,0 @@
|
|
|
1
|
-
# Qwen3.5 llama.cpp Deployment Guide
|
|
2
|
-
|
|
3
|
-
How to run Qwen3.5 35B A3B with the official Qwen3 chat template, LoRA adapters, and structured tool call output via llama.cpp.
|
|
4
|
-
|
|
5
|
-
## Prerequisites
|
|
6
|
-
|
|
7
|
-
- [llama.cpp](https://github.com/ggml-org/llama.cpp) built with CUDA/Metal support
|
|
8
|
-
- Qwen3.5 35B A3B GGUF model (e.g. `qwen3.5-a3b-iq4xs.gguf`)
|
|
9
|
-
- (Optional) Draft model for speculative decoding: `Qwen3.5-0.8B-Q8_0.gguf`
|
|
10
|
-
- (Optional) LoRA adapter GGUF for improved tool call reliability
|
|
11
|
-
|
|
12
|
-
## Quick Start
|
|
13
|
-
|
|
14
|
-
```bash
|
|
15
|
-
llama-server \
|
|
16
|
-
--model /path/to/qwen3.5-a3b-iq4xs.gguf \
|
|
17
|
-
--chat-template-file chat_template.jinja \
|
|
18
|
-
--n-predict 16384 \
|
|
19
|
-
--temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
|
|
20
|
-
--repeat-penalty 1.0 \
|
|
21
|
-
--threads 8 --ctx-size 131072 --batch-size 8 \
|
|
22
|
-
--gpu-layers 35 --mlock --flash-attn
|
|
23
|
-
```
|
|
24
|
-
|
|
25
|
-
## Configuration Files
|
|
26
|
-
|
|
27
|
-
| File | Purpose |
|
|
28
|
-
| ------------------------------------------- | ------------------------------------------------------------------- |
|
|
29
|
-
| `chat_template.jinja` | Official Qwen3 chat template with native tool descriptions |
|
|
30
|
-
| `tools/agents/config/tool-call.gbnf` | GBNF grammar for per-request use (do NOT use with `--grammar-file`) |
|
|
31
|
-
| `tools/agents/config/tool-call-schema.json` | JSON Schema for the tool call payload |
|
|
32
|
-
| `config/qwen35-settings.json` | Full model settings, optimization config |
|
|
33
|
-
| `config/lora-finetune.yaml` | LoRA training configuration (axolotl/unsloth compatible) |
|
|
34
|
-
|
|
35
|
-
## Important: Do NOT Use `--grammar-file`
|
|
36
|
-
|
|
37
|
-
The `--grammar-file` flag applies a GBNF grammar **globally to every completion**. This breaks normal chat because the grammar forces `<tool_call>` output even when no tools are provided.
|
|
38
|
-
|
|
39
|
-
llama.cpp's **differential autoparser** handles tool calls automatically:
|
|
40
|
-
|
|
41
|
-
1. It analyzes the Jinja template to discover `<tool_call>`/`</tool_call>` markers
|
|
42
|
-
2. It generates PEG grammar rules with **lazy activation** (`grammar_lazy = true`)
|
|
43
|
-
3. When `tool_choice == "auto"`, the model generates freely until it emits `<tool_call>`, at which point the grammar activates to constrain the JSON payload
|
|
44
|
-
4. After `</tool_call>`, the grammar allows another `<tool_call>` for parallel calls
|
|
45
|
-
5. Plain chat (no tools) is unconstrained
|
|
46
|
-
|
|
47
|
-
The GBNF file is kept in the repo for per-request use via the `grammar` field in API payloads, but should never be a server startup flag.
|
|
48
|
-
|
|
49
|
-
## Server Configurations
|
|
50
|
-
|
|
51
|
-
### Basic (no LoRA, no speculative decoding)
|
|
52
|
-
|
|
53
|
-
```bash
|
|
54
|
-
llama-server \
|
|
55
|
-
--model /path/to/qwen3.5-a3b-iq4xs.gguf \
|
|
56
|
-
--chat-template-file chat_template.jinja \
|
|
57
|
-
--n-predict 16384 \
|
|
58
|
-
--temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
|
|
59
|
-
--repeat-penalty 1.0 \
|
|
60
|
-
--threads 8 --ctx-size 131072 --batch-size 8 \
|
|
61
|
-
--gpu-layers 35 --mlock --flash-attn
|
|
62
|
-
```
|
|
63
|
-
|
|
64
|
-
### With LoRA Adapter
|
|
65
|
-
|
|
66
|
-
```bash
|
|
67
|
-
llama-server \
|
|
68
|
-
--model /path/to/qwen3.5-a3b-iq4xs.gguf \
|
|
69
|
-
--lora /path/to/qwen35-tool-call-lora/adapter.gguf \
|
|
70
|
-
--lora-scale 1.0 \
|
|
71
|
-
--chat-template-file chat_template.jinja \
|
|
72
|
-
--n-predict 16384 \
|
|
73
|
-
--temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
|
|
74
|
-
--repeat-penalty 1.0 \
|
|
75
|
-
--threads 8 --ctx-size 131072 --batch-size 8 \
|
|
76
|
-
--gpu-layers 35 --mlock --flash-attn
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
### Full Setup (LoRA + Speculative Decoding)
|
|
80
|
-
|
|
81
|
-
```bash
|
|
82
|
-
llama-server \
|
|
83
|
-
--model /path/to/qwen3.5-a3b-iq4xs.gguf \
|
|
84
|
-
--lora /path/to/qwen35-tool-call-lora/adapter.gguf \
|
|
85
|
-
--lora-scale 1.0 \
|
|
86
|
-
--chat-template-file chat_template.jinja \
|
|
87
|
-
--draft-model /path/to/Qwen3.5-0.8B-Q8_0.gguf \
|
|
88
|
-
--draft-max 16 --draft-p-min 0.75 \
|
|
89
|
-
--n-predict 16384 \
|
|
90
|
-
--temp 0.6 --top-p 0.9 --top-k 20 --min-p 0.05 \
|
|
91
|
-
--repeat-penalty 1.0 \
|
|
92
|
-
--threads 8 --ctx-size 131072 --batch-size 8 \
|
|
93
|
-
--gpu-layers 35 --mlock --flash-attn
|
|
94
|
-
```
|
|
95
|
-
|
|
96
|
-
## Key Parameters
|
|
97
|
-
|
|
98
|
-
### Chat Template & Tool Calls
|
|
99
|
-
|
|
100
|
-
| Flag | Value | Purpose |
|
|
101
|
-
| ---------------------- | --------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
|
102
|
-
| `--chat-template-file` | `chat_template.jinja` | Official Qwen3 template with native `tools` block. llama.cpp's autoparser discovers `<tool_call>` markers and generates lazy grammar + triggers automatically. |
|
|
103
|
-
|
|
104
|
-
### LoRA
|
|
105
|
-
|
|
106
|
-
| Flag | Value | Purpose |
|
|
107
|
-
| -------------- | ---------------------- | ------------------------------------------------------------------------------------------------------ |
|
|
108
|
-
| `--lora` | Path to `adapter.gguf` | Loads LoRA adapter at runtime (no model merge needed). Improves tool call format adherence by ~15-20%. |
|
|
109
|
-
| `--lora-scale` | `0.0` - `1.0` | Adapter strength. Use `1.0` for full effect, `0.5`-`0.8` to blend with base model behavior. |
|
|
110
|
-
|
|
111
|
-
### Speculative Decoding
|
|
112
|
-
|
|
113
|
-
| Flag | Value | Purpose |
|
|
114
|
-
| --------------- | -------------------------------- | ----------------------------------------------------------------------- |
|
|
115
|
-
| `--draft-model` | Path to `Qwen3.5-0.8B-Q8_0.gguf` | Small draft model proposes tokens verified by the main model. |
|
|
116
|
-
| `--draft-max` | `16` | Max tokens to draft per iteration. Higher = more throughput, more VRAM. |
|
|
117
|
-
| `--draft-p-min` | `0.75` | Minimum acceptance probability. Lower = more aggressive drafting. |
|
|
118
|
-
|
|
119
|
-
## Extension Options for Speculative Decoding
|
|
120
|
-
|
|
121
|
-
### Option 1: Adaptive Runtime Tuning (implemented)
|
|
122
|
-
|
|
123
|
-
Use acceptance and rollback rates to auto-adjust `draft-max`, `draft-min`, and `draft-p-min` over time.
|
|
124
|
-
|
|
125
|
-
- Best for immediate gains without kernel changes
|
|
126
|
-
- Reduces bad bursts when acceptance drops
|
|
127
|
-
- Increases burst length automatically during high-acceptance windows
|
|
128
|
-
|
|
129
|
-
Commands:
|
|
130
|
-
|
|
131
|
-
```bash
|
|
132
|
-
# Tune once from observed metrics
|
|
133
|
-
llama-optimize spec-autotune --acceptance 0.71 --rollback 0.14 --profile throughput
|
|
134
|
-
|
|
135
|
-
# Compare static defaults vs adaptive tuning using deterministic simulation
|
|
136
|
-
llama-optimize spec-benchmark --profile throughput --trace mixed --steps 180
|
|
137
|
-
|
|
138
|
-
# Live benchmark active server and get tuned flag recommendation
|
|
139
|
-
llama-optimize spec-benchmark-live \
|
|
140
|
-
--endpoint http://127.0.0.1:8080/v1 \
|
|
141
|
-
--model qwen3.5-a3b-iq4xs \
|
|
142
|
-
--runs 5 --max-tokens 256 --profile throughput
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
Recommended workflow:
|
|
146
|
-
|
|
147
|
-
1. Run `spec-benchmark-live` with your current startup flags and note `Throughput`.
|
|
148
|
-
2. Restart `llama-server` with the `Suggested params` flags.
|
|
149
|
-
3. Re-run `spec-benchmark-live` with the same settings to measure actual gain.
|
|
150
|
-
|
|
151
|
-
### Option 2: GPU Residency + Overlap
|
|
152
|
-
|
|
153
|
-
- Keep draft model and draft KV fully on GPU
|
|
154
|
-
- Preallocate buffers and overlap draft + verify passes with CUDA streams
|
|
155
|
-
- Improves p95 latency consistency on long runs
|
|
156
|
-
|
|
157
|
-
### Option 3: GPU Checkpoint/Rollback
|
|
158
|
-
|
|
159
|
-
- Move speculative checkpoint snapshots from CPU RAM to GPU buffers
|
|
160
|
-
- Remove host-device copy overhead from rollback paths
|
|
161
|
-
- Highest upside, but requires deeper runtime changes
|
|
162
|
-
|
|
163
|
-
### Sampling
|
|
164
|
-
|
|
165
|
-
| Flag | Value | Purpose |
|
|
166
|
-
| ------------------ | ------ | ------------------------------------------------- |
|
|
167
|
-
| `--temp` | `0.6` | Low temperature for deterministic tool calls. |
|
|
168
|
-
| `--top-p` | `0.9` | Nucleus sampling threshold. |
|
|
169
|
-
| `--top-k` | `20` | Limits token candidates per step. |
|
|
170
|
-
| `--min-p` | `0.05` | Filters tokens below 5% of top token probability. |
|
|
171
|
-
| `--repeat-penalty` | `1.0` | No repetition penalty — code naturally repeats patterns. |
|
|
172
|
-
|
|
173
|
-
### Performance
|
|
174
|
-
|
|
175
|
-
| Flag | Value | Purpose |
|
|
176
|
-
| -------------- | -------- | ------------------------------------------------- |
|
|
177
|
-
| `--flash-attn` | (flag) | 1.5-2x speed on long context. |
|
|
178
|
-
| `--gpu-layers` | `35` | Layers offloaded to GPU. Increase if VRAM allows. |
|
|
179
|
-
| `--ctx-size` | `131072` | Full 128K context window. |
|
|
180
|
-
| `--mlock` | (flag) | Prevents OS from swapping model to disk. |
|
|
181
|
-
|
|
182
|
-
## VRAM Estimates
|
|
183
|
-
|
|
184
|
-
| Component | VRAM |
|
|
185
|
-
| ------------------- | ---------- |
|
|
186
|
-
| Main model (IQ4_XS) | ~17 GB |
|
|
187
|
-
| Draft model (Q8_0) | ~0.8 GB |
|
|
188
|
-
| KV cache (128K ctx) | ~2-3 GB |
|
|
189
|
-
| LoRA adapter | ~50 MB |
|
|
190
|
-
| **Total** | **~20 GB** |
|
|
191
|
-
|
|
192
|
-
## Anthropic API Proxy (for Claude Code / Forge Code)
|
|
193
|
-
|
|
194
|
-
Claude Code and Forge Code speak the Anthropic Messages API, but llama.cpp exposes an OpenAI-compatible API. The UAP Anthropic Proxy bridges this gap by translating between the two protocols in real time, including full streaming and tool calling support.
|
|
195
|
-
|
|
196
|
-
### Architecture
|
|
197
|
-
|
|
198
|
-
```
|
|
199
|
-
Claude Code --(Anthropic API :4000)--> UAP Proxy --(OpenAI API :8080)--> llama.cpp
|
|
200
|
-
```
|
|
201
|
-
|
|
202
|
-
### Quick Start
|
|
203
|
-
|
|
204
|
-
```bash
|
|
205
|
-
# Install Python dependencies
|
|
206
|
-
pip install -r tools/agents/scripts/requirements-proxy.txt
|
|
207
|
-
|
|
208
|
-
# Start the proxy (default: listen on :4000, forward to llama.cpp on :8080)
|
|
209
|
-
python tools/agents/scripts/anthropic_proxy.py
|
|
210
|
-
```
|
|
211
|
-
|
|
212
|
-
### Configuration
|
|
213
|
-
|
|
214
|
-
All settings are via environment variables:
|
|
215
|
-
|
|
216
|
-
| Variable | Default | Description |
|
|
217
|
-
| ----------------------- | ------------------------------------ | ---------------------------------------- |
|
|
218
|
-
| `LLAMA_CPP_BASE` | `http://192.168.1.165:8080/v1` | OpenAI-compatible upstream server URL |
|
|
219
|
-
| `PROXY_PORT` | `4000` | Port for the proxy to listen on |
|
|
220
|
-
| `PROXY_HOST` | `0.0.0.0` | Host/IP to bind to |
|
|
221
|
-
| `PROXY_LOG_LEVEL` | `INFO` | Logging level (DEBUG/INFO/WARNING/ERROR) |
|
|
222
|
-
| `PROXY_READ_TIMEOUT` | `600` | Read timeout (seconds) for LLM streaming |
|
|
223
|
-
| `PROXY_MAX_CONNECTIONS` | `20` | Max concurrent upstream connections |
|
|
224
|
-
| `PROXY_MAX_TOKENS_FLOOR` | `16384` | Minimum floor applied to incoming `max_tokens` (`0` disables floor) |
|
|
225
|
-
| `PROXY_CONTEXT_PRUNE_TARGET_FRACTION` | `0.65` | Target context utilization after pruning (`0.0 < value < 1.0`) |
|
|
226
|
-
| `PROXY_STREAM_REASONING_FALLBACK` | `off` | Streaming behavior for reasoning-only empty turns (`off`, `sanitized`, `visible`) |
|
|
227
|
-
| `PROXY_STREAM_REASONING_MAX_CHARS` | `240` | Max fallback length when `PROXY_STREAM_REASONING_FALLBACK=sanitized` |
|
|
228
|
-
| `PROXY_TOOL_NARROWING` | `off` | Narrow large tool lists to top relevant tools per turn |
|
|
229
|
-
| `PROXY_TOOL_NARROWING_KEEP` | `8` | Number of tools to keep when narrowing is enabled |
|
|
230
|
-
| `PROXY_TOOL_NARROWING_MIN_TOOLS` | `12` | Minimum tool count before narrowing activates |
|
|
231
|
-
| `PROXY_DISABLE_THINKING_ON_TOOL_TURNS` | `off` | Sends `enable_thinking=false` when tools are present |
|
|
232
|
-
| `PROXY_MALFORMED_TOOL_GUARDRAIL` | `on` | Detects malformed pseudo tool payloads and retries with strict settings |
|
|
233
|
-
| `PROXY_MALFORMED_TOOL_RETRY_MAX` | `1` | Number of malformed-tool retries |
|
|
234
|
-
| `PROXY_MALFORMED_TOOL_RETRY_MAX_TOKENS` | `2048` | Retry cap for `max_tokens` during malformed-tool recovery |
|
|
235
|
-
| `PROXY_MALFORMED_TOOL_RETRY_TEMPERATURE` | `0` | Retry temperature for malformed-tool recovery |
|
|
236
|
-
| `PROXY_MALFORMED_TOOL_STREAM_STRICT` | `off` | For stream+tools requests, use guarded non-stream upstream path then replay SSE |
|
|
237
|
-
| `PROXY_SESSION_CONTAMINATION_BREAKER` | `on` | Resets long-running malformed sessions to recent context |
|
|
238
|
-
| `PROXY_SESSION_CONTAMINATION_THRESHOLD` | `3` | Consecutive malformed turns before reset |
|
|
239
|
-
| `PROXY_SESSION_CONTAMINATION_KEEP_LAST` | `8` | Number of latest messages to preserve during contamination reset |
|
|
240
|
-
| `PROXY_AGENTIC_SUPPLEMENT_MODE` | `clean` | Agentic system supplement variant (`clean`, `legacy`) |
|
|
241
|
-
|
|
242
|
-
For agentic coding workloads, keep `PROXY_STREAM_REASONING_FALLBACK=off` (default) to avoid leaking malformed internal reasoning as user-visible output. Use `sanitized` only for debugging.
|
|
243
|
-
|
|
244
|
-
For Claude Code + Qwen malformed-tool loops, recommended starting profile:
|
|
245
|
-
|
|
246
|
-
```bash
|
|
247
|
-
PROXY_STREAM_REASONING_FALLBACK=off
|
|
248
|
-
PROXY_MAX_TOKENS_FLOOR=4096
|
|
249
|
-
PROXY_MALFORMED_TOOL_GUARDRAIL=on
|
|
250
|
-
PROXY_TOOL_NARROWING=on
|
|
251
|
-
PROXY_DISABLE_THINKING_ON_TOOL_TURNS=on
|
|
252
|
-
PROXY_SESSION_CONTAMINATION_BREAKER=on
|
|
253
|
-
PROXY_AGENTIC_SUPPLEMENT_MODE=clean
|
|
254
|
-
```
|
|
255
|
-
|
|
256
|
-
### Example: Custom upstream
|
|
257
|
-
|
|
258
|
-
```bash
|
|
259
|
-
LLAMA_CPP_BASE=http://localhost:8080/v1 PROXY_PORT=5000 python tools/agents/scripts/anthropic_proxy.py
|
|
260
|
-
```
|
|
261
|
-
|
|
262
|
-
### Claude Code Configuration
|
|
263
|
-
|
|
264
|
-
Point Claude Code at the proxy by setting the API base URL:
|
|
265
|
-
|
|
266
|
-
```bash
|
|
267
|
-
export ANTHROPIC_BASE_URL=http://localhost:4000
|
|
268
|
-
```
|
|
269
|
-
|
|
270
|
-
### Endpoints
|
|
271
|
-
|
|
272
|
-
The proxy speaks **Anthropic Messages API as its canonical interface** and
|
|
273
|
-
keeps an **OpenAI Chat Completions passthrough** for clients that require the
|
|
274
|
-
OpenAI shape. Both paths run through the same guarded pipeline (loop
|
|
275
|
-
detection, tool narrowing, malformed-payload retry, context pruning, etc.) —
|
|
276
|
-
the OpenAI route converts the request to Anthropic, runs the pipeline, and
|
|
277
|
-
re-shapes the final response back to OpenAI.
|
|
278
|
-
|
|
279
|
-
| Path | Method | Shape | Description |
|
|
280
|
-
| ------------------------ | ------ | --------- | --------------------------------------------------------------- |
|
|
281
|
-
| `/v1/messages` | POST | Anthropic | Anthropic Messages API — default/canonical (streaming + sync) |
|
|
282
|
-
| `/anthropic/v1/messages` | POST | Anthropic | Alias for `/v1/messages` (some Claude Code configs use this) |
|
|
283
|
-
| `/v1/chat/completions` | POST | OpenAI | OpenAI Chat Completions passthrough (e.g. Forge, OpenCode) |
|
|
284
|
-
| `/v1/models` | GET | Anthropic | Lists spoofed Anthropic model IDs |
|
|
285
|
-
| `/health` | GET | — | Health check (verifies upstream reachability) |
|
|
286
|
-
| `/v1/context` | GET | — | Current session context usage and pruning state |
|
|
287
|
-
|
|
288
|
-
### Running as a Service (systemd)
|
|
289
|
-
|
|
290
|
-
```ini
|
|
291
|
-
[Unit]
|
|
292
|
-
Description=UAP Anthropic Proxy
|
|
293
|
-
After=network.target
|
|
294
|
-
|
|
295
|
-
[Service]
|
|
296
|
-
Type=simple
|
|
297
|
-
User=cogtek
|
|
298
|
-
Environment=LLAMA_CPP_BASE=http://192.168.1.165:8080/v1
|
|
299
|
-
Environment=PROXY_PORT=4000
|
|
300
|
-
ExecStart=/usr/bin/python3 /path/to/tools/agents/scripts/anthropic_proxy.py
|
|
301
|
-
Restart=always
|
|
302
|
-
RestartSec=5
|
|
303
|
-
|
|
304
|
-
[Install]
|
|
305
|
-
WantedBy=multi-user.target
|
|
306
|
-
```
|
|
307
|
-
|
|
308
|
-
## Tool Call Format
|
|
309
|
-
|
|
310
|
-
The model emits tool calls in the official Qwen3 format:
|
|
311
|
-
|
|
312
|
-
```
|
|
313
|
-
<tool_call>
|
|
314
|
-
{"name": "read_file", "arguments": {"path": "/etc/hosts"}}
|
|
315
|
-
</tool_call>
|
|
316
|
-
```
|
|
317
|
-
|
|
318
|
-
Multiple tool calls in a single turn:
|
|
319
|
-
|
|
320
|
-
```
|
|
321
|
-
<tool_call>
|
|
322
|
-
{"name": "read_file", "arguments": {"path": "/etc/hosts"}}
|
|
323
|
-
</tool_call>
|
|
324
|
-
<tool_call>
|
|
325
|
-
{"name": "list_dir", "arguments": {"path": "/tmp"}}
|
|
326
|
-
</tool_call>
|
|
327
|
-
```
|
|
328
|
-
|
|
329
|
-
llama.cpp's autoparser handles stop behavior structurally via PEG grammar rules, not stop sequences. No explicit `</tool_call>` stop sequence is needed at the server level.
|
|
330
|
-
|
|
331
|
-
## LoRA Training Pipeline
|
|
332
|
-
|
|
333
|
-
### 1. Generate Training Data
|
|
334
|
-
|
|
335
|
-
```bash
|
|
336
|
-
python3 tools/agents/scripts/generate_lora_training_data.py -n 500
|
|
337
|
-
```
|
|
338
|
-
|
|
339
|
-
Produces `tool_call_training_data.jsonl` with ChatML-formatted examples using the official `<tool_call>` format.
|
|
340
|
-
|
|
341
|
-
### 2. Fine-Tune
|
|
342
|
-
|
|
343
|
-
Using axolotl:
|
|
344
|
-
|
|
345
|
-
```bash
|
|
346
|
-
accelerate launch -m axolotl.cli.train config/lora-finetune.yaml
|
|
347
|
-
```
|
|
348
|
-
|
|
349
|
-
Using unsloth (faster, less VRAM):
|
|
350
|
-
|
|
351
|
-
```bash
|
|
352
|
-
unsloth train --config config/lora-finetune.yaml
|
|
353
|
-
```
|
|
354
|
-
|
|
355
|
-
Training config highlights (`config/lora-finetune.yaml`):
|
|
356
|
-
|
|
357
|
-
- LoRA rank 16, alpha 32
|
|
358
|
-
- Targets all linear layers (q/k/v/o/gate/up/down projections)
|
|
359
|
-
- 3 epochs, cosine LR schedule, 2e-4 learning rate
|
|
360
|
-
- BF16 + gradient checkpointing + flash attention
|
|
361
|
-
|
|
362
|
-
### 3. Convert to GGUF
|
|
363
|
-
|
|
364
|
-
```bash
|
|
365
|
-
python3 convert_lora_to_gguf.py \
|
|
366
|
-
--base Qwen/Qwen3.5-35B-A3B \
|
|
367
|
-
--lora output/qwen35-tool-call-lora \
|
|
368
|
-
--output adapter.gguf
|
|
369
|
-
```
|
|
370
|
-
|
|
371
|
-
### 4. Load at Runtime
|
|
372
|
-
|
|
373
|
-
```bash
|
|
374
|
-
llama-server --model base.gguf --lora adapter.gguf --lora-scale 1.0
|
|
375
|
-
```
|
|
376
|
-
|
|
377
|
-
## Quantization Options
|
|
378
|
-
|
|
379
|
-
| Quant | VRAM | Accuracy | Tool Call Reliability |
|
|
380
|
-
| ------ | ----- | -------- | --------------------- |
|
|
381
|
-
| IQ4_XS | 17 GB | 96% | 94% |
|
|
382
|
-
| Q4_K_M | 20 GB | 95% | 95% |
|
|
383
|
-
| Q5_K_M | 24 GB | 97% | 97% |
|
|
384
|
-
| Q6_K | 28 GB | 98% | 98% |
|
|
385
|
-
|
|
386
|
-
## Troubleshooting
|
|
387
|
-
|
|
388
|
-
### "Template supports tool calls but does not natively describe tools"
|
|
389
|
-
|
|
390
|
-
This warning means llama.cpp detected `tool_calls` handling but no `tools` variable access in the template. The `chat_template.jinja` in this repo resolves this by including a `{%- if tools %}` block that renders tool descriptions in `<tools></tools>` XML tags.
|
|
391
|
-
|
|
392
|
-
Verify the template is loaded:
|
|
393
|
-
|
|
394
|
-
```bash
|
|
395
|
-
llama-server --chat-template-file chat_template.jinja --verbose
|
|
396
|
-
```
|
|
397
|
-
|
|
398
|
-
### LoRA not taking effect
|
|
399
|
-
|
|
400
|
-
- Ensure the adapter was converted to GGUF format (not safetensors/PyTorch)
|
|
401
|
-
- Check `--lora-scale` is not `0.0`
|
|
402
|
-
- Verify the adapter was trained against the same base model architecture
|
|
403
|
-
|
|
404
|
-
### Grammar rejecting valid output
|
|
405
|
-
|
|
406
|
-
If using the GBNF grammar via per-request `grammar` field and it's too restrictive, the model may produce truncated output. Check `tools/agents/config/tool-call.gbnf` allows the argument types your tools use (strings, numbers, objects, arrays, booleans, null are all supported).
|
|
407
|
-
|
|
408
|
-
### Model only outputs tool calls, never plain text
|
|
409
|
-
|
|
410
|
-
You are likely using `--grammar-file` on the server command line. This forces ALL output into `<tool_call>` format. Remove `--grammar-file` from the startup command and let the autoparser handle tool call detection lazily.
|
|
411
|
-
|
|
412
|
-
### Multi-tool calls truncated to single call
|
|
413
|
-
|
|
414
|
-
Two possible causes:
|
|
415
|
-
|
|
416
|
-
1. `--grammar-file` is set globally and the stop sequence `</tool_call>` terminates after the first call. Remove `--grammar-file`.
|
|
417
|
-
2. The client is not passing `parallel_tool_calls: true` in the request. Add it to enable multiple tool calls per turn.
|
|
418
|
-
|
|
419
|
-
## Related Files
|
|
420
|
-
|
|
421
|
-
- `tools/agents/scripts/qwen_tool_call_wrapper.py` - Python wrapper with retry logic and format validation
|
|
422
|
-
- `tools/agents/scripts/fix_qwen_chat_template.py` - Template verifier/fixer (detects format, validates Jinja2)
|
|
423
|
-
- `tools/agents/scripts/qwen_tool_call_test.py` - Test suite using OpenAI-compatible API
|
|
424
|
-
- `src/cli/tool-calls.ts` - CLI command for template management
|
|
425
|
-
- `src/bin/llama-server-optimize.ts` - llama-server startup optimizer
|
|
426
|
-
- `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md` - service bootstrap + ngram-cache A/B benchmarking
|