@miller-tech/uap 1.39.0 → 1.40.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +109 -642
- package/dist/.tsbuildinfo +1 -1
- package/dist/bin/cli.js +2 -2
- package/dist/bin/cli.js.map +1 -1
- package/dist/cli/deliver.d.ts +3 -2
- package/dist/cli/deliver.d.ts.map +1 -1
- package/dist/cli/deliver.js +10 -5
- package/dist/cli/deliver.js.map +1 -1
- package/docs/INDEX.md +48 -286
- package/docs/architecture/OVERVIEW.md +328 -0
- package/docs/architecture/PROTOCOL.md +204 -0
- package/docs/benchmarks/README.md +17 -192
- package/docs/getting-started/CONFIGURATION.md +237 -0
- package/docs/getting-started/INSTALLATION.md +125 -0
- package/docs/getting-started/QUICKSTART.md +115 -0
- package/docs/guides/COORDINATION.md +162 -0
- package/docs/guides/DELIVER.md +115 -0
- package/docs/guides/DEPLOY_BATCHING.md +212 -0
- package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
- package/docs/guides/LOCAL_MODELS.md +148 -0
- package/docs/guides/MCP_ROUTER.md +195 -0
- package/docs/guides/MEMORY.md +235 -0
- package/docs/guides/MULTI_MODEL.md +223 -0
- package/docs/guides/POLICIES.md +190 -0
- package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
- package/docs/integrations/MCP_ROUTER.md +147 -0
- package/docs/integrations/RTK.md +102 -0
- package/docs/reference/API.md +485 -0
- package/docs/reference/CLI.md +719 -0
- package/docs/reference/CONFIGURATION.md +90 -193
- package/docs/reference/DATABASE_SCHEMA.md +110 -344
- package/docs/reference/FEATURES.md +176 -472
- package/docs/reference/PATTERNS.md +102 -0
- package/docs/reference/PLATFORMS.md +83 -0
- package/package.json +1 -1
- package/docs/AGENTS.md +0 -423
- package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
- package/docs/GETTING_STARTED.md +0 -288
- package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
- package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
- package/docs/architecture/EXPERT_STACK.md +0 -137
- package/docs/architecture/MULTI_MODEL.md +0 -224
- package/docs/architecture/PLATFORM_GATING.md +0 -68
- package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
- package/docs/architecture/UAP_COMPLIANCE.md +0 -217
- package/docs/architecture/UAP_PROTOCOL.md +0 -339
- package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
- package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
- package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
- package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
- package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
- package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
- package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
- package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
- package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
- package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
- package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
- package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
- package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
- package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
- package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
- package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
- package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
- package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
- package/docs/archive/opencode-integration-guide.md +0 -740
- package/docs/archive/opencode-integration-quickref.md +0 -180
- package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
- package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
- package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
- package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
- package/docs/blog/local-coding-agents.md +0 -266
- package/docs/blog/x-thread.md +0 -254
- package/docs/deployment/DEPLOYMENT.md +0 -895
- package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
- package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
- package/docs/deployment/DEPLOY_BATCHING.md +0 -273
- package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
- package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
- package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
- package/docs/getting-started/INTEGRATION.md +0 -628
- package/docs/getting-started/OVERVIEW.md +0 -324
- package/docs/getting-started/SETUP.md +0 -377
- package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
- package/docs/integrations/RTK_INTEGRATION.md +0 -468
- package/docs/operations/TROUBLESHOOTING.md +0 -660
- package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
- package/docs/pr/UPSTREAM_PRS.md +0 -424
- package/docs/reference/API_REFERENCE.md +0 -903
- package/docs/reference/EXPERT_DROIDS.md +0 -219
- package/docs/reference/HARNESS-MATRIX.md +0 -318
- package/docs/reference/PATTERN_LIBRARY.md +0 -636
- package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
- package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
- package/docs/research/DOMAIN_STRATEGIES.md +0 -316
- package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
- package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
- package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
- package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
- package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
|
@@ -0,0 +1,148 @@
|
|
|
1
|
+
# Running UAP Against Local Models
|
|
2
|
+
|
|
3
|
+
> UAP v1.40.0
|
|
4
|
+
|
|
5
|
+
UAP can drive its coding/convergence loop against **local models** served by
|
|
6
|
+
[llama.cpp](https://github.com/ggml-org/llama.cpp) instead of a hosted API.
|
|
7
|
+
This keeps inference on your own hardware (zero per-token cost) and works with
|
|
8
|
+
quantized open-weight models such as Qwen 3.x.
|
|
9
|
+
|
|
10
|
+
There are two endpoint shapes involved, and it matters which client speaks
|
|
11
|
+
which protocol:
|
|
12
|
+
|
|
13
|
+
- **`uap deliver`** talks the **OpenAI-compatible** Chat Completions API
|
|
14
|
+
(`POST /v1/chat/completions`). It points directly at llama.cpp's OpenAI
|
|
15
|
+
endpoint (commonly `:8080/v1`) — no translation proxy is needed for UAP's own
|
|
16
|
+
loop. See
|
|
17
|
+
[`src/models/openai-compat-client.ts`](../../src/models/openai-compat-client.ts).
|
|
18
|
+
- **Anthropic-protocol clients** (e.g. Claude Code) speak the **Anthropic
|
|
19
|
+
Messages API**. When llama.cpp serves the Anthropic protocol natively, those
|
|
20
|
+
clients can point straight at the local port. Where native Anthropic serving
|
|
21
|
+
is not available, the bundled `uap-anthropic-proxy` translates Anthropic
|
|
22
|
+
requests into OpenAI requests for llama.cpp. Prefer the direct, native path
|
|
23
|
+
when you have it.
|
|
24
|
+
|
|
25
|
+
## The model presets
|
|
26
|
+
|
|
27
|
+
`uap deliver` selects a model by **preset id**. Presets are defined in
|
|
28
|
+
[`src/models/types.ts`](../../src/models/types.ts) (`ModelPresets`). The default
|
|
29
|
+
local preset is `qwen35-a3b`:
|
|
30
|
+
|
|
31
|
+
```jsonc
|
|
32
|
+
// qwen35-a3b (excerpt)
|
|
33
|
+
{
|
|
34
|
+
"provider": "custom",
|
|
35
|
+
"apiModel": "qwen35-a3b-iq4xs",
|
|
36
|
+
"endpoint": "http://192.168.1.165:8080/v1", // llama.cpp OpenAI endpoint
|
|
37
|
+
"maxContextTokens": 262144
|
|
38
|
+
}
|
|
39
|
+
```
|
|
40
|
+
|
|
41
|
+
The endpoint is the llama.cpp server's OpenAI-compatible base. Adjust it for
|
|
42
|
+
your host (for example `http://localhost:8080/v1`) by overriding the endpoint
|
|
43
|
+
(see below) or editing the preset.
|
|
44
|
+
|
|
45
|
+
> The exact host/IP in the shipped preset is environment-specific. Point it at
|
|
46
|
+
> wherever your llama.cpp server is listening.
|
|
47
|
+
|
|
48
|
+
## Serving a local model
|
|
49
|
+
|
|
50
|
+
Start llama.cpp's `llama-server` listening on an OpenAI-compatible port
|
|
51
|
+
(`:8080` by default in the helper scripts). A continuity helper for serving is
|
|
52
|
+
included at
|
|
53
|
+
[`scripts/run-llama-server-continuity.sh`](../../scripts/run-llama-server-continuity.sh);
|
|
54
|
+
it wraps `llama-server` with `--host`, `--port` (default `8080`), `--model`,
|
|
55
|
+
and an optional `--chat-template-file`. Models with a custom chat format need
|
|
56
|
+
the correct template applied.
|
|
57
|
+
|
|
58
|
+
UAP ships several helpers around local serving (registered as bins in
|
|
59
|
+
`package.json`):
|
|
60
|
+
|
|
61
|
+
- **`llama-optimize`** — generates optimal `llama.cpp` startup parameters
|
|
62
|
+
(quantization profile, KV-cache quant, flash attention, speculative
|
|
63
|
+
decoding, etc.) for Qwen 3.x-class models on 16GB/24GB VRAM. Source:
|
|
64
|
+
[`src/bin/llama-server-optimize.ts`](../../src/bin/llama-server-optimize.ts).
|
|
65
|
+
- **`uap-template-verify`** — model-agnostic chat-template finder/verifier;
|
|
66
|
+
validates Jinja2 syntax, renders test data, and checks tool-call format
|
|
67
|
+
support. Source:
|
|
68
|
+
[`tools/agents/scripts/chat_template_verifier.py`](../../tools/agents/scripts/chat_template_verifier.py).
|
|
69
|
+
- **`uap-anthropic-proxy`** — Anthropic→OpenAI translation proxy for clients
|
|
70
|
+
that only speak the Anthropic Messages API. Source:
|
|
71
|
+
[`tools/agents/scripts/anthropic_proxy.py`](../../tools/agents/scripts/anthropic_proxy.py).
|
|
72
|
+
It reads `LLAMA_CPP_BASE` (upstream OpenAI endpoint) and `PROXY_PORT` from the
|
|
73
|
+
environment. Use this only when a client can't reach a native Anthropic
|
|
74
|
+
endpoint.
|
|
75
|
+
|
|
76
|
+
## Running the convergence loop locally
|
|
77
|
+
|
|
78
|
+
`uap deliver` iterates a model through execute → apply → verify → feedback until
|
|
79
|
+
real completion gates pass. To run it against a local model, pass the preset:
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
uap deliver "fix the failing build" --model qwen35-a3b
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Because `qwen35-a3b` is the default preset, you can also omit `--model` (or set
|
|
86
|
+
`UAP_DELIVER_MODEL`):
|
|
87
|
+
|
|
88
|
+
```bash
|
|
89
|
+
export UAP_DELIVER_MODEL=qwen35-a3b
|
|
90
|
+
uap deliver "add input validation to the parser"
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Pointing at your own server
|
|
94
|
+
|
|
95
|
+
Override the endpoint without touching the preset:
|
|
96
|
+
|
|
97
|
+
```bash
|
|
98
|
+
uap deliver "refactor the auth module" \
|
|
99
|
+
--model qwen35-a3b \
|
|
100
|
+
--endpoint http://localhost:8080/v1
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
The endpoint must be an OpenAI-compatible `/v1` base. If no endpoint is set on
|
|
104
|
+
the preset or the flag, the client falls back to `UAP_INFERENCE_ENDPOINT`, then
|
|
105
|
+
to `http://localhost:4000/v1`.
|
|
106
|
+
|
|
107
|
+
### Useful `uap deliver` options
|
|
108
|
+
|
|
109
|
+
| Option | Meaning |
|
|
110
|
+
| ------------------- | ------- |
|
|
111
|
+
| `-m, --model <preset>` | Model preset id (default `$UAP_DELIVER_MODEL` or `qwen35-a3b`) |
|
|
112
|
+
| `--endpoint <url>` | Override the model endpoint (OpenAI-compatible `/v1`) |
|
|
113
|
+
| `--temperature <t>` | Sampling temperature (default: execution-profile value) |
|
|
114
|
+
| `--max-turns <n>` | Maximum execute→verify iterations (default 5) |
|
|
115
|
+
| `--gates <ids>` | Restrict to a subset of gates (`build,typecheck,test,lint`) |
|
|
116
|
+
| `--escalate` / `--escalate-model <preset>` | On stagnation, escalate to a stronger model preset (default `$UAP_ESCALATE_MODEL`) |
|
|
117
|
+
| `--coordinate` | Register the run with the coordination layer (announce, heartbeat, overlap detection) |
|
|
118
|
+
| `--deploy` | On success, queue a commit of applied files into the deploy batcher |
|
|
119
|
+
| `--dry-run` | Show detected gates and plan without calling the model |
|
|
120
|
+
|
|
121
|
+
A common local pattern is a cheap local executor that escalates to a hosted
|
|
122
|
+
model only when it stalls:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
uap deliver "implement the retry logic" \
|
|
126
|
+
--model qwen35-a3b \
|
|
127
|
+
--escalate --escalate-model sonnet-4.6
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
## Connecting Claude Code (or other Anthropic clients)
|
|
131
|
+
|
|
132
|
+
If your llama.cpp server exposes the Anthropic Messages API natively, point the
|
|
133
|
+
Anthropic client at that local port directly — this is the preferred path.
|
|
134
|
+
|
|
135
|
+
Otherwise, run the translation proxy in front of llama.cpp:
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
LLAMA_CPP_BASE=http://localhost:8080/v1 PROXY_PORT=4000 uap-anthropic-proxy
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
The client then talks the Anthropic protocol to the proxy, which forwards
|
|
142
|
+
OpenAI requests to llama.cpp. The proxy supports streaming and tool-call
|
|
143
|
+
translation.
|
|
144
|
+
|
|
145
|
+
## Related
|
|
146
|
+
|
|
147
|
+
- [Deploy Batching](./DEPLOY_BATCHING.md) — what `uap deliver --deploy` queues.
|
|
148
|
+
- [Coordination](./COORDINATION.md) — what `uap deliver --coordinate` registers.
|
|
@@ -0,0 +1,195 @@
|
|
|
1
|
+
# MCP Router
|
|
2
|
+
|
|
3
|
+
> UAP v1.40.0
|
|
4
|
+
|
|
5
|
+
The MCP Router is a token-optimizing proxy that sits between an AI harness and
|
|
6
|
+
its MCP tool servers. It is implemented as 11 modules under
|
|
7
|
+
[`src/mcp-router/`](../../src/mcp-router/) and exposed through the
|
|
8
|
+
`uap mcp-router` CLI.
|
|
9
|
+
|
|
10
|
+
## The problem: tool-output token bloat
|
|
11
|
+
|
|
12
|
+
When an agent calls an MCP tool, the *entire* tool result is injected into the
|
|
13
|
+
model's context window. A single `read_file`, search, or API call can return tens
|
|
14
|
+
of kilobytes of mostly-irrelevant text. Across a session this dominates token
|
|
15
|
+
spend and crowds out useful context.
|
|
16
|
+
|
|
17
|
+
The router solves this by **compressing tool output before it reaches the model**,
|
|
18
|
+
returning only the parts the agent actually needs. In practice this yields **up
|
|
19
|
+
to 98% token reduction** on large outputs.
|
|
20
|
+
|
|
21
|
+
## Compression strategy
|
|
22
|
+
|
|
23
|
+
Every tool result passes through the output compressor
|
|
24
|
+
([`output-compressor.ts`](../../src/mcp-router/output-compressor.ts)), which picks
|
|
25
|
+
a strategy based on output size and whether the call supplied an *intent*:
|
|
26
|
+
|
|
27
|
+
| Output size | Strategy | Method |
|
|
28
|
+
|-------------|----------|--------|
|
|
29
|
+
| ≤ 5 KB | **Pass through** unchanged | `passthrough` |
|
|
30
|
+
| 5–10 KB | **Head + tail** smart truncation | `truncated` |
|
|
31
|
+
| ≥ 10 KB **with intent** | **FTS5 index-then-search** — return only matching snippets | `indexed` |
|
|
32
|
+
| ≥ 10 KB without intent | Head + tail smart truncation | `truncated` |
|
|
33
|
+
|
|
34
|
+
The exact thresholds are `5120` bytes (truncation) and `10240` bytes
|
|
35
|
+
(auto-indexing).
|
|
36
|
+
|
|
37
|
+
### Intent-driven FTS5 search
|
|
38
|
+
|
|
39
|
+
When an output is large and the agent describes what it is looking for (an
|
|
40
|
+
`intent`), the compressor:
|
|
41
|
+
|
|
42
|
+
1. **Chunks** the content by structure — markdown headings first, then
|
|
43
|
+
blank-line paragraphs, then fixed-size line blocks as a fallback.
|
|
44
|
+
2. Builds an **in-memory SQLite FTS5** virtual table (`porter` tokenizer) over
|
|
45
|
+
the chunks.
|
|
46
|
+
3. Runs the intent as a **BM25-ranked** full-text query and returns up to
|
|
47
|
+
**3 matching snippets** (`MAX_SNIPPETS`).
|
|
48
|
+
4. Appends a short list of searchable vocabulary terms so the agent can refine a
|
|
49
|
+
follow-up query.
|
|
50
|
+
|
|
51
|
+
If FTS5 returns nothing, it falls back to a keyword scan of the chunks, and
|
|
52
|
+
finally to plain truncation.
|
|
53
|
+
|
|
54
|
+
### Safety guards
|
|
55
|
+
|
|
56
|
+
Native tokenizers can choke on pathological input, so the compressor defends
|
|
57
|
+
against it:
|
|
58
|
+
|
|
59
|
+
- **Null-byte sanitization** — embedded `\0` bytes are stripped before insertion
|
|
60
|
+
to avoid tokenizer crashes.
|
|
61
|
+
- **Per-chunk cap** — chunks are limited to **8 KB** (`MAX_CHUNK_BYTES`) to avoid
|
|
62
|
+
stressing the porter tokenizer on inputs with no word boundaries (base64 blobs,
|
|
63
|
+
minified JS, double-serialized JSON).
|
|
64
|
+
- **Index ceiling** — outputs above **2 MB** (`MAX_INDEX_BYTES`) skip FTS5
|
|
65
|
+
entirely and fall back to truncation, since the native tokenizer can segfault on
|
|
66
|
+
very large unbroken inputs.
|
|
67
|
+
- **Null/exotic results** — `null`/`undefined` results collapse to an empty
|
|
68
|
+
string; BigInt and circular values are coerced safely rather than producing
|
|
69
|
+
`"[object Object]"`.
|
|
70
|
+
|
|
71
|
+
### Supplying intent
|
|
72
|
+
|
|
73
|
+
The `execute_tool` proxy accepts an optional `intent` argument. From its schema
|
|
74
|
+
([`tools/execute.ts`](../../src/mcp-router/tools/execute.ts)):
|
|
75
|
+
|
|
76
|
+
> *Optional: describe what you are looking for in the output. For large results
|
|
77
|
+
> (>10KB), only matching sections are returned instead of the full output.*
|
|
78
|
+
|
|
79
|
+
So an agent calling a tool through the router can pass, e.g.,
|
|
80
|
+
`intent: "the failing test name and stack frame"` and receive only the matching
|
|
81
|
+
sections of an otherwise huge log.
|
|
82
|
+
|
|
83
|
+
## Modules
|
|
84
|
+
|
|
85
|
+
| Concern | Module |
|
|
86
|
+
|---------|--------|
|
|
87
|
+
| Stdio MCP server entrypoint | [`server.ts`](../../src/mcp-router/server.ts) |
|
|
88
|
+
| Tool discovery (search across servers) | [`tools/discover.ts`](../../src/mcp-router/tools/discover.ts) |
|
|
89
|
+
| Tool execution + output compression | [`tools/execute.ts`](../../src/mcp-router/tools/execute.ts) |
|
|
90
|
+
| Output compression engine | [`output-compressor.ts`](../../src/mcp-router/output-compressor.ts) |
|
|
91
|
+
| Per-session token-savings accounting | [`session-stats.ts`](../../src/mcp-router/session-stats.ts) |
|
|
92
|
+
| Config parsing (`mcp.json`) | [`config/parser.ts`](../../src/mcp-router/config/parser.ts) |
|
|
93
|
+
| Fuzzy tool search | [`search/fuzzy.ts`](../../src/mcp-router/search/fuzzy.ts) |
|
|
94
|
+
|
|
95
|
+
## The `uap mcp-router` CLI
|
|
96
|
+
|
|
97
|
+
Commands are defined in [`src/bin/cli.ts`](../../src/bin/cli.ts) and implemented
|
|
98
|
+
in [`src/cli/mcp-router.ts`](../../src/cli/mcp-router.ts).
|
|
99
|
+
|
|
100
|
+
### Start
|
|
101
|
+
|
|
102
|
+
Run the router as a stdio MCP server (this is what a harness launches).
|
|
103
|
+
|
|
104
|
+
```bash
|
|
105
|
+
uap mcp-router start [options]
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
| Option | Description |
|
|
109
|
+
|--------|-------------|
|
|
110
|
+
| `-c, --config <path>` | Path to an `mcp.json` config file |
|
|
111
|
+
| `-v, --verbose` | Enable verbose logging |
|
|
112
|
+
|
|
113
|
+
### Stats
|
|
114
|
+
|
|
115
|
+
Show servers, tools, and token savings for the session.
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
uap mcp-router stats [-c <path>] [-v] [--json]
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Discover
|
|
122
|
+
|
|
123
|
+
Find tools matching a query across all configured servers.
|
|
124
|
+
|
|
125
|
+
```bash
|
|
126
|
+
uap mcp-router discover -q "<query>" [options]
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
| Option | Default | Description |
|
|
130
|
+
|--------|---------|-------------|
|
|
131
|
+
| `-q, --query <query>` | — | Search query (required) |
|
|
132
|
+
| `-s, --server <server>` | — | Filter to a specific server |
|
|
133
|
+
| `-l, --limit <limit>` | `10` | Max results |
|
|
134
|
+
| `-c, --config <path>` | — | Path to `mcp.json` config file |
|
|
135
|
+
| `-v, --verbose` | — | Enable verbose logging |
|
|
136
|
+
| `--json` | — | Output as JSON |
|
|
137
|
+
|
|
138
|
+
### List
|
|
139
|
+
|
|
140
|
+
List the configured MCP servers.
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
uap mcp-router list [-c <path>] [--json]
|
|
144
|
+
```
|
|
145
|
+
|
|
146
|
+
## Enabling the router per harness
|
|
147
|
+
|
|
148
|
+
The router replaces a harness's individual MCP servers with a single `router`
|
|
149
|
+
entry that runs `uap mcp-router start`. The bundled installer wires this up for
|
|
150
|
+
all supported harnesses:
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
uap mcp-setup [--force] [--verbose]
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
This command ([`src/cli/setup-mcp-router.ts`](../../src/cli/setup-mcp-router.ts))
|
|
157
|
+
configures **Claude Code**, **Factory.AI**, **VSCode**, and **Cursor**. It writes
|
|
158
|
+
a `router` server into each harness's MCP config:
|
|
159
|
+
|
|
160
|
+
- **Claude Code** — `~/.claude/settings.json`
|
|
161
|
+
- **Factory.AI** — `~/.factory/mcp.json`
|
|
162
|
+
- **VSCode** — `~/.vscode/mcp.json`
|
|
163
|
+
- **Cursor** — `~/.cursor/settings.json`
|
|
164
|
+
|
|
165
|
+
The entry it installs looks like:
|
|
166
|
+
|
|
167
|
+
```json
|
|
168
|
+
{
|
|
169
|
+
"mcpServers": {
|
|
170
|
+
"router": {
|
|
171
|
+
"command": "npx",
|
|
172
|
+
"args": ["uap", "mcp-router", "start"],
|
|
173
|
+
"description": "Unified MCP Router - routes all tool calls"
|
|
174
|
+
}
|
|
175
|
+
}
|
|
176
|
+
}
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
When a harness already has MCP servers configured, `mcp-setup` migrates them
|
|
180
|
+
behind the router (preserving the originals in a backup field) — pass `--force` to
|
|
181
|
+
skip the confirmation prompt. After setup it validates the install by running
|
|
182
|
+
`uap mcp-router list`.
|
|
183
|
+
|
|
184
|
+
## The savings
|
|
185
|
+
|
|
186
|
+
For the common case where a tool returns a large, mostly-irrelevant payload and
|
|
187
|
+
the agent supplies an intent, the router returns only the 3 best-matching
|
|
188
|
+
snippets — **up to 98% fewer tokens** than the raw output. Small outputs (≤5 KB)
|
|
189
|
+
pass through untouched, so there is no penalty for the common small-result case.
|
|
190
|
+
Per-session savings are tracked in
|
|
191
|
+
[`session-stats.ts`](../../src/mcp-router/session-stats.ts) and surfaced via
|
|
192
|
+
`uap mcp-router stats`.
|
|
193
|
+
|
|
194
|
+
See also the [Memory guide](./MEMORY.md) for reducing token spend on persistent
|
|
195
|
+
context rather than tool output.
|
|
@@ -0,0 +1,235 @@
|
|
|
1
|
+
# Memory System
|
|
2
|
+
|
|
3
|
+
> UAP v1.40.0
|
|
4
|
+
|
|
5
|
+
The Universal Agent Protocol gives agents a persistent, multi-tier memory so
|
|
6
|
+
that learnings survive across sessions, compactions, and even harness switches.
|
|
7
|
+
The system is implemented as 27 modules under [`src/memory/`](../../src/memory/)
|
|
8
|
+
and is driven from the `uap memory` CLI.
|
|
9
|
+
|
|
10
|
+
The design goal is **token efficiency**: instead of replaying entire transcripts
|
|
11
|
+
into the context window, agents write small, high-signal memories and retrieve
|
|
12
|
+
only the most relevant ones on demand via semantic search.
|
|
13
|
+
|
|
14
|
+
## The four tiers
|
|
15
|
+
|
|
16
|
+
Memory flows from a cheap, high-churn staging area down to a durable, searchable
|
|
17
|
+
archive. Each tier has a distinct cost/permanence trade-off.
|
|
18
|
+
|
|
19
|
+
| Tier | Name | Storage | Purpose | Module(s) |
|
|
20
|
+
|------|------|---------|---------|-----------|
|
|
21
|
+
| 0 | Daily log | SQLite | Staging area for raw writes; "log first, promote later" | [`daily-log.ts`](../../src/memory/daily-log.ts), [`short-term/sqlite.ts`](../../src/memory/short-term/sqlite.ts), [`short-term/schema.ts`](../../src/memory/short-term/schema.ts) |
|
|
22
|
+
| 1 | Working cache | In-process / SQLite | Hot context with decay; predictive prefetch | [`speculative-cache.ts`](../../src/memory/speculative-cache.ts), [`predictive-memory.ts`](../../src/memory/predictive-memory.ts) |
|
|
23
|
+
| 2 | Semantic | Qdrant vectors | Embedding-based recall over consolidated knowledge | [`serverless-qdrant.ts`](../../src/memory/serverless-qdrant.ts), [`embeddings.ts`](../../src/memory/embeddings.ts) |
|
|
24
|
+
| 3 | Long-term archive | Pluggable backends | Durable, auditable store of promoted learnings | [`backends/base.ts`](../../src/memory/backends/base.ts), [`backends/factory.ts`](../../src/memory/backends/factory.ts), [`backends/github.ts`](../../src/memory/backends/github.ts), [`backends/qdrant-cloud.ts`](../../src/memory/backends/qdrant-cloud.ts) |
|
|
25
|
+
|
|
26
|
+
### Tier 0 — Daily log
|
|
27
|
+
|
|
28
|
+
Every observation an agent records lands first in the daily log, a SQLite-backed
|
|
29
|
+
staging table (`daily_log`). This follows a "log first, promote later" pattern:
|
|
30
|
+
writes are cheap and non-destructive, and a separate review step decides which
|
|
31
|
+
entries are worth keeping. Each entry carries a `suggestedTier` of either
|
|
32
|
+
`working` or `semantic` so promotion is guided rather than blind. See
|
|
33
|
+
[`daily-log.ts`](../../src/memory/daily-log.ts).
|
|
34
|
+
|
|
35
|
+
### Tier 1 — Working cache
|
|
36
|
+
|
|
37
|
+
The working cache holds the hot context an agent is likely to need next. Entries
|
|
38
|
+
**decay** over time so the cache stays small and relevant, and a predictive layer
|
|
39
|
+
prefetches likely-needed memories. See
|
|
40
|
+
[`speculative-cache.ts`](../../src/memory/speculative-cache.ts) and
|
|
41
|
+
[`predictive-memory.ts`](../../src/memory/predictive-memory.ts).
|
|
42
|
+
|
|
43
|
+
### Tier 2 — Semantic
|
|
44
|
+
|
|
45
|
+
Consolidated knowledge is embedded and stored as vectors in Qdrant. Recall is by
|
|
46
|
+
semantic similarity rather than exact match, so a query retrieves conceptually
|
|
47
|
+
related memories even when the wording differs. See
|
|
48
|
+
[`serverless-qdrant.ts`](../../src/memory/serverless-qdrant.ts).
|
|
49
|
+
|
|
50
|
+
### Tier 3 — Long-term archive
|
|
51
|
+
|
|
52
|
+
The durable archive is backend-pluggable. The bundled backends are selected via
|
|
53
|
+
[`backends/factory.ts`](../../src/memory/backends/factory.ts):
|
|
54
|
+
|
|
55
|
+
- **Qdrant Cloud** — managed vector store ([`backends/qdrant-cloud.ts`](../../src/memory/backends/qdrant-cloud.ts))
|
|
56
|
+
- **GitHub** — version-controlled archive ([`backends/github.ts`](../../src/memory/backends/github.ts))
|
|
57
|
+
- A common interface defined in [`backends/base.ts`](../../src/memory/backends/base.ts)
|
|
58
|
+
|
|
59
|
+
## Semantic recall
|
|
60
|
+
|
|
61
|
+
Tier 2/3 recall uses **real embeddings**, not placeholder hashes. The default
|
|
62
|
+
provider runs a local `nomic-embed-text-v2-moe` model via `llama-server` (or
|
|
63
|
+
Ollama) and produces **768-dimensional** vectors (Matryoshka — truncatable to
|
|
64
|
+
256). Running embeddings locally means semantic recall incurs no per-query API
|
|
65
|
+
cost. See [`embeddings.ts`](../../src/memory/embeddings.ts).
|
|
66
|
+
|
|
67
|
+
Queries return matches ranked by cosine similarity, filtered by a configurable
|
|
68
|
+
threshold (default `0.35`).
|
|
69
|
+
|
|
70
|
+
## Write gates
|
|
71
|
+
|
|
72
|
+
Not every observation deserves to be a memory. The write gate
|
|
73
|
+
([`write-gate.ts`](../../src/memory/write-gate.ts)) scores incoming content and
|
|
74
|
+
**rejects low-value writes** before they consume storage or pollute recall.
|
|
75
|
+
Rejections include:
|
|
76
|
+
|
|
77
|
+
- Empty content
|
|
78
|
+
- Content too short to be a meaningful memory
|
|
79
|
+
- Content matching a **noise pattern** (acknowledgements, transient requests)
|
|
80
|
+
|
|
81
|
+
Content that records a **decision and its reasoning**, a durable **preference or
|
|
82
|
+
convention**, or other high-signal information passes the gate. Rejected writes
|
|
83
|
+
come back with a `rejectionReason` so the caller knows why.
|
|
84
|
+
|
|
85
|
+
The gate can be bypassed deliberately with `--force` on `uap memory store`.
|
|
86
|
+
|
|
87
|
+
## Correction propagation
|
|
88
|
+
|
|
89
|
+
When a fact changes, you don't want stale copies lingering across tiers. The
|
|
90
|
+
correction propagator ([`correction-propagator.ts`](../../src/memory/correction-propagator.ts))
|
|
91
|
+
applies a correction **across all tiers** and marks the superseded entries with a
|
|
92
|
+
date and reason, preserving an **audit trail** in a `superseded_entries` table
|
|
93
|
+
rather than silently deleting. The result reports `tiersUpdated` and
|
|
94
|
+
`supersededCount`.
|
|
95
|
+
|
|
96
|
+
Trigger it from the CLI with `uap memory correct`.
|
|
97
|
+
|
|
98
|
+
## Supporting modules
|
|
99
|
+
|
|
100
|
+
Beyond the tiers, the system includes consolidation
|
|
101
|
+
([`memory-consolidator.ts`](../../src/memory/memory-consolidator.ts)), a
|
|
102
|
+
knowledge graph ([`knowledge-graph.ts`](../../src/memory/knowledge-graph.ts)),
|
|
103
|
+
task classification ([`task-classifier.ts`](../../src/memory/task-classifier.ts)),
|
|
104
|
+
dynamic retrieval ([`dynamic-retrieval.ts`](../../src/memory/dynamic-retrieval.ts)),
|
|
105
|
+
semantic compression
|
|
106
|
+
([`semantic-compression.ts`](../../src/memory/semantic-compression.ts)), and
|
|
107
|
+
scheduled maintenance
|
|
108
|
+
([`memory-maintenance.ts`](../../src/memory/memory-maintenance.ts)).
|
|
109
|
+
|
|
110
|
+
## The `uap memory` CLI
|
|
111
|
+
|
|
112
|
+
All commands are defined in [`src/bin/cli.ts`](../../src/bin/cli.ts) and
|
|
113
|
+
implemented in [`src/cli/memory.ts`](../../src/cli/memory.ts).
|
|
114
|
+
|
|
115
|
+
```bash
|
|
116
|
+
uap memory status # Show memory system status
|
|
117
|
+
uap memory start # Start memory services (Qdrant container)
|
|
118
|
+
uap memory stop # Stop memory services
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### Query
|
|
122
|
+
|
|
123
|
+
```bash
|
|
124
|
+
uap memory query <search> [options]
|
|
125
|
+
```
|
|
126
|
+
|
|
127
|
+
| Option | Default | Description |
|
|
128
|
+
|--------|---------|-------------|
|
|
129
|
+
| `-n, --limit <number>` | `10` | Max results |
|
|
130
|
+
| `-k, --top-k <number>` | `10` | Alias for `--limit` |
|
|
131
|
+
| `-t, --threshold <number>` | `0.35` | Minimum similarity score (0–1) |
|
|
132
|
+
|
|
133
|
+
```bash
|
|
134
|
+
uap memory query "qdrant connection retry" --limit 5 --threshold 0.5
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
### Store
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
uap memory store <content> [options]
|
|
141
|
+
```
|
|
142
|
+
|
|
143
|
+
Applies the write gate unless `--force` is passed.
|
|
144
|
+
|
|
145
|
+
| Option | Default | Description |
|
|
146
|
+
|--------|---------|-------------|
|
|
147
|
+
| `-t, --tags <tags>` | — | Comma-separated tags |
|
|
148
|
+
| `-i, --importance <number>` | `5` | Importance score (1–10) |
|
|
149
|
+
| `-f, --force` | — | Bypass the write gate (store without quality check) |
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
uap memory store "Chose Qdrant over pgvector for HNSW recall speed" \
|
|
153
|
+
--tags architecture,memory --importance 8
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
### Prepopulate
|
|
157
|
+
|
|
158
|
+
Seed memory from existing project knowledge.
|
|
159
|
+
|
|
160
|
+
```bash
|
|
161
|
+
uap memory prepopulate [options]
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
| Option | Default | Description |
|
|
165
|
+
|--------|---------|-------------|
|
|
166
|
+
| `--docs` | — | Import from documentation only |
|
|
167
|
+
| `--git` | — | Import from git history only |
|
|
168
|
+
| `-n, --limit <number>` | `500` | Limit git commits to analyze |
|
|
169
|
+
| `--since <date>` | — | Only analyze commits since date (e.g. `2024-01-01`) |
|
|
170
|
+
| `-v, --verbose` | — | Show detailed output |
|
|
171
|
+
|
|
172
|
+
### Promote
|
|
173
|
+
|
|
174
|
+
Review daily-log (Tier 0) entries and promote significant ones into working or
|
|
175
|
+
semantic memory.
|
|
176
|
+
|
|
177
|
+
```bash
|
|
178
|
+
uap memory promote
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
### Correct
|
|
182
|
+
|
|
183
|
+
Find an existing memory and supersede it with a correction that propagates across
|
|
184
|
+
all tiers.
|
|
185
|
+
|
|
186
|
+
```bash
|
|
187
|
+
uap memory correct <search> [options]
|
|
188
|
+
```
|
|
189
|
+
|
|
190
|
+
| Option | Description |
|
|
191
|
+
|--------|-------------|
|
|
192
|
+
| `-c, --correction <text>` | The corrected content |
|
|
193
|
+
| `-r, --reason <reason>` | Reason for the correction |
|
|
194
|
+
|
|
195
|
+
```bash
|
|
196
|
+
uap memory correct "uses pgvector" \
|
|
197
|
+
--correction "uses Qdrant for semantic recall" \
|
|
198
|
+
--reason "migrated in v1.26"
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Maintain
|
|
202
|
+
|
|
203
|
+
Run scheduled maintenance: decay, prune stale entries, archive old ones, and
|
|
204
|
+
remove duplicates.
|
|
205
|
+
|
|
206
|
+
```bash
|
|
207
|
+
uap memory maintain [-v|--verbose]
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
## How agents use memory
|
|
211
|
+
|
|
212
|
+
The recommended decision loop (see the project `CLAUDE.md`) wires memory into
|
|
213
|
+
every task:
|
|
214
|
+
|
|
215
|
+
1. **READ** recent context with `uap memory query`.
|
|
216
|
+
2. **QUERY** long-term memory for related learnings (semantic search).
|
|
217
|
+
3. **ACT** on the task.
|
|
218
|
+
4. **RECORD** observations back to the daily log (`uap memory store`).
|
|
219
|
+
5. **PROMOTE** significant learnings to long-term memory (`uap memory promote`).
|
|
220
|
+
|
|
221
|
+
Corrections discovered along the way are pushed with `uap memory correct` so the
|
|
222
|
+
fix cascades across tiers.
|
|
223
|
+
|
|
224
|
+
## How it saves tokens
|
|
225
|
+
|
|
226
|
+
- Agents retrieve a handful of **relevant** memories instead of replaying whole
|
|
227
|
+
transcripts into context.
|
|
228
|
+
- The **write gate** keeps storage and recall results free of noise, so each
|
|
229
|
+
retrieved item carries signal.
|
|
230
|
+
- **Decay** and **maintenance** keep the working set small.
|
|
231
|
+
- **Local embeddings** make semantic recall free of per-query API cost.
|
|
232
|
+
- **Correction propagation** prevents stale duplicates from inflating results.
|
|
233
|
+
|
|
234
|
+
See also the [MCP Router guide](./MCP_ROUTER.md) for compressing tool *output*
|
|
235
|
+
before it reaches the model.
|