@miller-tech/uap 1.40.0 → 1.41.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (150) hide show
  1. package/README.md +109 -642
  2. package/dist/.tsbuildinfo +1 -1
  3. package/dist/cli/deliver-defaults.d.ts +23 -0
  4. package/dist/cli/deliver-defaults.d.ts.map +1 -0
  5. package/dist/cli/deliver-defaults.js +121 -0
  6. package/dist/cli/deliver-defaults.js.map +1 -0
  7. package/dist/cli/init.d.ts.map +1 -1
  8. package/dist/cli/init.js +29 -0
  9. package/dist/cli/init.js.map +1 -1
  10. package/dist/cli/setup.d.ts.map +1 -1
  11. package/dist/cli/setup.js +19 -0
  12. package/dist/cli/setup.js.map +1 -1
  13. package/dist/policies/policy-tools.d.ts +7 -0
  14. package/dist/policies/policy-tools.d.ts.map +1 -1
  15. package/dist/policies/policy-tools.js +24 -2
  16. package/dist/policies/policy-tools.js.map +1 -1
  17. package/docs/INDEX.md +48 -286
  18. package/docs/architecture/OVERVIEW.md +328 -0
  19. package/docs/architecture/PROTOCOL.md +204 -0
  20. package/docs/benchmarks/README.md +17 -192
  21. package/docs/getting-started/CONFIGURATION.md +237 -0
  22. package/docs/getting-started/INSTALLATION.md +125 -0
  23. package/docs/getting-started/QUICKSTART.md +115 -0
  24. package/docs/guides/COORDINATION.md +162 -0
  25. package/docs/guides/DELIVER.md +115 -0
  26. package/docs/guides/DEPLOY_BATCHING.md +212 -0
  27. package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
  28. package/docs/guides/LOCAL_MODELS.md +148 -0
  29. package/docs/guides/MCP_ROUTER.md +195 -0
  30. package/docs/guides/MEMORY.md +235 -0
  31. package/docs/guides/MULTI_MODEL.md +223 -0
  32. package/docs/guides/POLICIES.md +190 -0
  33. package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
  34. package/docs/integrations/MCP_ROUTER.md +147 -0
  35. package/docs/integrations/RTK.md +102 -0
  36. package/docs/reference/API.md +485 -0
  37. package/docs/reference/CLI.md +719 -0
  38. package/docs/reference/CONFIGURATION.md +90 -193
  39. package/docs/reference/DATABASE_SCHEMA.md +110 -344
  40. package/docs/reference/FEATURES.md +176 -472
  41. package/docs/reference/PATTERNS.md +102 -0
  42. package/docs/reference/PLATFORMS.md +83 -0
  43. package/package.json +3 -1
  44. package/src/policies/enforcers/7ebbc721-7540-4e9f-879a-770e0213a09b_architecture_review.py +101 -0
  45. package/src/policies/enforcers/__pycache__/_common.cpython-312.pyc +0 -0
  46. package/src/policies/enforcers/_common.py +100 -0
  47. package/src/policies/enforcers/artifact_hygiene.py +52 -0
  48. package/src/policies/enforcers/cluster_routing.py +63 -0
  49. package/src/policies/enforcers/codebase_read_before_plan.py +52 -0
  50. package/src/policies/enforcers/coord_overlap.py +81 -0
  51. package/src/policies/enforcers/delivery_enforcement.py +97 -0
  52. package/src/policies/enforcers/doc_live_over_report.py +50 -0
  53. package/src/policies/enforcers/expert_review_required.py +135 -0
  54. package/src/policies/enforcers/iac_parity.py +53 -0
  55. package/src/policies/enforcers/mcp_router_first.py +37 -0
  56. package/src/policies/enforcers/memory_before_plan.py +61 -0
  57. package/src/policies/enforcers/parallel_reads.py +50 -0
  58. package/src/policies/enforcers/rtk_wrap.py +44 -0
  59. package/src/policies/enforcers/schema_diff_gate.py +80 -0
  60. package/src/policies/enforcers/session_memory_write.py +52 -0
  61. package/src/policies/enforcers/task_required.py +131 -0
  62. package/src/policies/enforcers/test_gate.py +58 -0
  63. package/src/policies/enforcers/validate_plan_before_build.py +75 -0
  64. package/src/policies/enforcers/worktree_required.py +57 -0
  65. package/src/policies/schemas/policies/architecture-review.md +51 -0
  66. package/src/policies/schemas/policies/artifact-hygiene.md +29 -0
  67. package/src/policies/schemas/policies/cluster-routing.md +31 -0
  68. package/src/policies/schemas/policies/codebase-read-before-plan.md +30 -0
  69. package/src/policies/schemas/policies/coord-overlap.md +24 -0
  70. package/src/policies/schemas/policies/delivery-enforcement.md +45 -0
  71. package/src/policies/schemas/policies/doc-live-over-report.md +32 -0
  72. package/src/policies/schemas/policies/expert-review-required.md +60 -0
  73. package/src/policies/schemas/policies/iac-parity.md +31 -0
  74. package/src/policies/schemas/policies/mandatory-testing-deployment.md +147 -0
  75. package/src/policies/schemas/policies/mcp-router-first.md +24 -0
  76. package/src/policies/schemas/policies/memory-before-plan.md +24 -0
  77. package/src/policies/schemas/policies/merge-deploy-monitor-verify.md +145 -0
  78. package/src/policies/schemas/policies/parallel-reads.md +24 -0
  79. package/src/policies/schemas/policies/rtk-wrap.md +26 -0
  80. package/src/policies/schemas/policies/schema-diff-gate.md +30 -0
  81. package/src/policies/schemas/policies/session-memory-write.md +24 -0
  82. package/src/policies/schemas/policies/task-required.md +49 -0
  83. package/src/policies/schemas/policies/test-gate.md +24 -0
  84. package/src/policies/schemas/policies/validate-plan-before-build.md +28 -0
  85. package/src/policies/schemas/policies/worktree-required.md +28 -0
  86. package/templates/hooks/uap-policy-gate.sh +5 -0
  87. package/docs/AGENTS.md +0 -423
  88. package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
  89. package/docs/GETTING_STARTED.md +0 -288
  90. package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
  91. package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
  92. package/docs/architecture/EXPERT_STACK.md +0 -137
  93. package/docs/architecture/MULTI_MODEL.md +0 -224
  94. package/docs/architecture/PLATFORM_GATING.md +0 -68
  95. package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
  96. package/docs/architecture/UAP_COMPLIANCE.md +0 -217
  97. package/docs/architecture/UAP_PROTOCOL.md +0 -339
  98. package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
  99. package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
  100. package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
  101. package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
  102. package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
  103. package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
  104. package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
  105. package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
  106. package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
  107. package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
  108. package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
  109. package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
  110. package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
  111. package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
  112. package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
  113. package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
  114. package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
  115. package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
  116. package/docs/archive/opencode-integration-guide.md +0 -740
  117. package/docs/archive/opencode-integration-quickref.md +0 -180
  118. package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
  119. package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
  120. package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
  121. package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
  122. package/docs/blog/local-coding-agents.md +0 -266
  123. package/docs/blog/x-thread.md +0 -254
  124. package/docs/deployment/DEPLOYMENT.md +0 -895
  125. package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
  126. package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
  127. package/docs/deployment/DEPLOY_BATCHING.md +0 -273
  128. package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
  129. package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
  130. package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
  131. package/docs/getting-started/INTEGRATION.md +0 -628
  132. package/docs/getting-started/OVERVIEW.md +0 -324
  133. package/docs/getting-started/SETUP.md +0 -377
  134. package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
  135. package/docs/integrations/RTK_INTEGRATION.md +0 -468
  136. package/docs/operations/TROUBLESHOOTING.md +0 -660
  137. package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
  138. package/docs/pr/UPSTREAM_PRS.md +0 -424
  139. package/docs/reference/API_REFERENCE.md +0 -903
  140. package/docs/reference/EXPERT_DROIDS.md +0 -219
  141. package/docs/reference/HARNESS-MATRIX.md +0 -318
  142. package/docs/reference/PATTERN_LIBRARY.md +0 -636
  143. package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
  144. package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
  145. package/docs/research/DOMAIN_STRATEGIES.md +0 -316
  146. package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
  147. package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
  148. package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
  149. package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
  150. package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
@@ -0,0 +1,148 @@
1
+ # Running UAP Against Local Models
2
+
3
+ > UAP v1.40.0
4
+
5
+ UAP can drive its coding/convergence loop against **local models** served by
6
+ [llama.cpp](https://github.com/ggml-org/llama.cpp) instead of a hosted API.
7
+ This keeps inference on your own hardware (zero per-token cost) and works with
8
+ quantized open-weight models such as Qwen 3.x.
9
+
10
+ There are two endpoint shapes involved, and it matters which client speaks
11
+ which protocol:
12
+
13
+ - **`uap deliver`** talks the **OpenAI-compatible** Chat Completions API
14
+ (`POST /v1/chat/completions`). It points directly at llama.cpp's OpenAI
15
+ endpoint (commonly `:8080/v1`) — no translation proxy is needed for UAP's own
16
+ loop. See
17
+ [`src/models/openai-compat-client.ts`](../../src/models/openai-compat-client.ts).
18
+ - **Anthropic-protocol clients** (e.g. Claude Code) speak the **Anthropic
19
+ Messages API**. When llama.cpp serves the Anthropic protocol natively, those
20
+ clients can point straight at the local port. Where native Anthropic serving
21
+ is not available, the bundled `uap-anthropic-proxy` translates Anthropic
22
+ requests into OpenAI requests for llama.cpp. Prefer the direct, native path
23
+ when you have it.
24
+
25
+ ## The model presets
26
+
27
+ `uap deliver` selects a model by **preset id**. Presets are defined in
28
+ [`src/models/types.ts`](../../src/models/types.ts) (`ModelPresets`). The default
29
+ local preset is `qwen35-a3b`:
30
+
31
+ ```jsonc
32
+ // qwen35-a3b (excerpt)
33
+ {
34
+ "provider": "custom",
35
+ "apiModel": "qwen35-a3b-iq4xs",
36
+ "endpoint": "http://192.168.1.165:8080/v1", // llama.cpp OpenAI endpoint
37
+ "maxContextTokens": 262144
38
+ }
39
+ ```
40
+
41
+ The endpoint is the llama.cpp server's OpenAI-compatible base. Adjust it for
42
+ your host (for example `http://localhost:8080/v1`) by overriding the endpoint
43
+ (see below) or editing the preset.
44
+
45
+ > The exact host/IP in the shipped preset is environment-specific. Point it at
46
+ > wherever your llama.cpp server is listening.
47
+
48
+ ## Serving a local model
49
+
50
+ Start llama.cpp's `llama-server` listening on an OpenAI-compatible port
51
+ (`:8080` by default in the helper scripts). A continuity helper for serving is
52
+ included at
53
+ [`scripts/run-llama-server-continuity.sh`](../../scripts/run-llama-server-continuity.sh);
54
+ it wraps `llama-server` with `--host`, `--port` (default `8080`), `--model`,
55
+ and an optional `--chat-template-file`. Models with a custom chat format need
56
+ the correct template applied.
57
+
58
+ UAP ships several helpers around local serving (registered as bins in
59
+ `package.json`):
60
+
61
+ - **`llama-optimize`** — generates optimal `llama.cpp` startup parameters
62
+ (quantization profile, KV-cache quant, flash attention, speculative
63
+ decoding, etc.) for Qwen 3.x-class models on 16GB/24GB VRAM. Source:
64
+ [`src/bin/llama-server-optimize.ts`](../../src/bin/llama-server-optimize.ts).
65
+ - **`uap-template-verify`** — model-agnostic chat-template finder/verifier;
66
+ validates Jinja2 syntax, renders test data, and checks tool-call format
67
+ support. Source:
68
+ [`tools/agents/scripts/chat_template_verifier.py`](../../tools/agents/scripts/chat_template_verifier.py).
69
+ - **`uap-anthropic-proxy`** — Anthropic→OpenAI translation proxy for clients
70
+ that only speak the Anthropic Messages API. Source:
71
+ [`tools/agents/scripts/anthropic_proxy.py`](../../tools/agents/scripts/anthropic_proxy.py).
72
+ It reads `LLAMA_CPP_BASE` (upstream OpenAI endpoint) and `PROXY_PORT` from the
73
+ environment. Use this only when a client can't reach a native Anthropic
74
+ endpoint.
75
+
76
+ ## Running the convergence loop locally
77
+
78
+ `uap deliver` iterates a model through execute → apply → verify → feedback until
79
+ real completion gates pass. To run it against a local model, pass the preset:
80
+
81
+ ```bash
82
+ uap deliver "fix the failing build" --model qwen35-a3b
83
+ ```
84
+
85
+ Because `qwen35-a3b` is the default preset, you can also omit `--model` (or set
86
+ `UAP_DELIVER_MODEL`):
87
+
88
+ ```bash
89
+ export UAP_DELIVER_MODEL=qwen35-a3b
90
+ uap deliver "add input validation to the parser"
91
+ ```
92
+
93
+ ### Pointing at your own server
94
+
95
+ Override the endpoint without touching the preset:
96
+
97
+ ```bash
98
+ uap deliver "refactor the auth module" \
99
+ --model qwen35-a3b \
100
+ --endpoint http://localhost:8080/v1
101
+ ```
102
+
103
+ The endpoint must be an OpenAI-compatible `/v1` base. If no endpoint is set on
104
+ the preset or the flag, the client falls back to `UAP_INFERENCE_ENDPOINT`, then
105
+ to `http://localhost:4000/v1`.
106
+
107
+ ### Useful `uap deliver` options
108
+
109
+ | Option | Meaning |
110
+ | ------------------- | ------- |
111
+ | `-m, --model <preset>` | Model preset id (default `$UAP_DELIVER_MODEL` or `qwen35-a3b`) |
112
+ | `--endpoint <url>` | Override the model endpoint (OpenAI-compatible `/v1`) |
113
+ | `--temperature <t>` | Sampling temperature (default: execution-profile value) |
114
+ | `--max-turns <n>` | Maximum execute→verify iterations (default 5) |
115
+ | `--gates <ids>` | Restrict to a subset of gates (`build,typecheck,test,lint`) |
116
+ | `--escalate` / `--escalate-model <preset>` | On stagnation, escalate to a stronger model preset (default `$UAP_ESCALATE_MODEL`) |
117
+ | `--coordinate` | Register the run with the coordination layer (announce, heartbeat, overlap detection) |
118
+ | `--deploy` | On success, queue a commit of applied files into the deploy batcher |
119
+ | `--dry-run` | Show detected gates and plan without calling the model |
120
+
121
+ A common local pattern is a cheap local executor that escalates to a hosted
122
+ model only when it stalls:
123
+
124
+ ```bash
125
+ uap deliver "implement the retry logic" \
126
+ --model qwen35-a3b \
127
+ --escalate --escalate-model sonnet-4.6
128
+ ```
129
+
130
+ ## Connecting Claude Code (or other Anthropic clients)
131
+
132
+ If your llama.cpp server exposes the Anthropic Messages API natively, point the
133
+ Anthropic client at that local port directly — this is the preferred path.
134
+
135
+ Otherwise, run the translation proxy in front of llama.cpp:
136
+
137
+ ```bash
138
+ LLAMA_CPP_BASE=http://localhost:8080/v1 PROXY_PORT=4000 uap-anthropic-proxy
139
+ ```
140
+
141
+ The client then talks the Anthropic protocol to the proxy, which forwards
142
+ OpenAI requests to llama.cpp. The proxy supports streaming and tool-call
143
+ translation.
144
+
145
+ ## Related
146
+
147
+ - [Deploy Batching](./DEPLOY_BATCHING.md) — what `uap deliver --deploy` queues.
148
+ - [Coordination](./COORDINATION.md) — what `uap deliver --coordinate` registers.
@@ -0,0 +1,195 @@
1
+ # MCP Router
2
+
3
+ > UAP v1.40.0
4
+
5
+ The MCP Router is a token-optimizing proxy that sits between an AI harness and
6
+ its MCP tool servers. It is implemented as 11 modules under
7
+ [`src/mcp-router/`](../../src/mcp-router/) and exposed through the
8
+ `uap mcp-router` CLI.
9
+
10
+ ## The problem: tool-output token bloat
11
+
12
+ When an agent calls an MCP tool, the *entire* tool result is injected into the
13
+ model's context window. A single `read_file`, search, or API call can return tens
14
+ of kilobytes of mostly-irrelevant text. Across a session this dominates token
15
+ spend and crowds out useful context.
16
+
17
+ The router solves this by **compressing tool output before it reaches the model**,
18
+ returning only the parts the agent actually needs. In practice this yields **up
19
+ to 98% token reduction** on large outputs.
20
+
21
+ ## Compression strategy
22
+
23
+ Every tool result passes through the output compressor
24
+ ([`output-compressor.ts`](../../src/mcp-router/output-compressor.ts)), which picks
25
+ a strategy based on output size and whether the call supplied an *intent*:
26
+
27
+ | Output size | Strategy | Method |
28
+ |-------------|----------|--------|
29
+ | ≤ 5 KB | **Pass through** unchanged | `passthrough` |
30
+ | 5–10 KB | **Head + tail** smart truncation | `truncated` |
31
+ | ≥ 10 KB **with intent** | **FTS5 index-then-search** — return only matching snippets | `indexed` |
32
+ | ≥ 10 KB without intent | Head + tail smart truncation | `truncated` |
33
+
34
+ The exact thresholds are `5120` bytes (truncation) and `10240` bytes
35
+ (auto-indexing).
36
+
37
+ ### Intent-driven FTS5 search
38
+
39
+ When an output is large and the agent describes what it is looking for (an
40
+ `intent`), the compressor:
41
+
42
+ 1. **Chunks** the content by structure — markdown headings first, then
43
+ blank-line paragraphs, then fixed-size line blocks as a fallback.
44
+ 2. Builds an **in-memory SQLite FTS5** virtual table (`porter` tokenizer) over
45
+ the chunks.
46
+ 3. Runs the intent as a **BM25-ranked** full-text query and returns up to
47
+ **3 matching snippets** (`MAX_SNIPPETS`).
48
+ 4. Appends a short list of searchable vocabulary terms so the agent can refine a
49
+ follow-up query.
50
+
51
+ If FTS5 returns nothing, it falls back to a keyword scan of the chunks, and
52
+ finally to plain truncation.
53
+
54
+ ### Safety guards
55
+
56
+ Native tokenizers can choke on pathological input, so the compressor defends
57
+ against it:
58
+
59
+ - **Null-byte sanitization** — embedded `\0` bytes are stripped before insertion
60
+ to avoid tokenizer crashes.
61
+ - **Per-chunk cap** — chunks are limited to **8 KB** (`MAX_CHUNK_BYTES`) to avoid
62
+ stressing the porter tokenizer on inputs with no word boundaries (base64 blobs,
63
+ minified JS, double-serialized JSON).
64
+ - **Index ceiling** — outputs above **2 MB** (`MAX_INDEX_BYTES`) skip FTS5
65
+ entirely and fall back to truncation, since the native tokenizer can segfault on
66
+ very large unbroken inputs.
67
+ - **Null/exotic results** — `null`/`undefined` results collapse to an empty
68
+ string; BigInt and circular values are coerced safely rather than producing
69
+ `"[object Object]"`.
70
+
71
+ ### Supplying intent
72
+
73
+ The `execute_tool` proxy accepts an optional `intent` argument. From its schema
74
+ ([`tools/execute.ts`](../../src/mcp-router/tools/execute.ts)):
75
+
76
+ > *Optional: describe what you are looking for in the output. For large results
77
+ > (>10KB), only matching sections are returned instead of the full output.*
78
+
79
+ So an agent calling a tool through the router can pass, e.g.,
80
+ `intent: "the failing test name and stack frame"` and receive only the matching
81
+ sections of an otherwise huge log.
82
+
83
+ ## Modules
84
+
85
+ | Concern | Module |
86
+ |---------|--------|
87
+ | Stdio MCP server entrypoint | [`server.ts`](../../src/mcp-router/server.ts) |
88
+ | Tool discovery (search across servers) | [`tools/discover.ts`](../../src/mcp-router/tools/discover.ts) |
89
+ | Tool execution + output compression | [`tools/execute.ts`](../../src/mcp-router/tools/execute.ts) |
90
+ | Output compression engine | [`output-compressor.ts`](../../src/mcp-router/output-compressor.ts) |
91
+ | Per-session token-savings accounting | [`session-stats.ts`](../../src/mcp-router/session-stats.ts) |
92
+ | Config parsing (`mcp.json`) | [`config/parser.ts`](../../src/mcp-router/config/parser.ts) |
93
+ | Fuzzy tool search | [`search/fuzzy.ts`](../../src/mcp-router/search/fuzzy.ts) |
94
+
95
+ ## The `uap mcp-router` CLI
96
+
97
+ Commands are defined in [`src/bin/cli.ts`](../../src/bin/cli.ts) and implemented
98
+ in [`src/cli/mcp-router.ts`](../../src/cli/mcp-router.ts).
99
+
100
+ ### Start
101
+
102
+ Run the router as a stdio MCP server (this is what a harness launches).
103
+
104
+ ```bash
105
+ uap mcp-router start [options]
106
+ ```
107
+
108
+ | Option | Description |
109
+ |--------|-------------|
110
+ | `-c, --config <path>` | Path to an `mcp.json` config file |
111
+ | `-v, --verbose` | Enable verbose logging |
112
+
113
+ ### Stats
114
+
115
+ Show servers, tools, and token savings for the session.
116
+
117
+ ```bash
118
+ uap mcp-router stats [-c <path>] [-v] [--json]
119
+ ```
120
+
121
+ ### Discover
122
+
123
+ Find tools matching a query across all configured servers.
124
+
125
+ ```bash
126
+ uap mcp-router discover -q "<query>" [options]
127
+ ```
128
+
129
+ | Option | Default | Description |
130
+ |--------|---------|-------------|
131
+ | `-q, --query <query>` | — | Search query (required) |
132
+ | `-s, --server <server>` | — | Filter to a specific server |
133
+ | `-l, --limit <limit>` | `10` | Max results |
134
+ | `-c, --config <path>` | — | Path to `mcp.json` config file |
135
+ | `-v, --verbose` | — | Enable verbose logging |
136
+ | `--json` | — | Output as JSON |
137
+
138
+ ### List
139
+
140
+ List the configured MCP servers.
141
+
142
+ ```bash
143
+ uap mcp-router list [-c <path>] [--json]
144
+ ```
145
+
146
+ ## Enabling the router per harness
147
+
148
+ The router replaces a harness's individual MCP servers with a single `router`
149
+ entry that runs `uap mcp-router start`. The bundled installer wires this up for
150
+ all supported harnesses:
151
+
152
+ ```bash
153
+ uap mcp-setup [--force] [--verbose]
154
+ ```
155
+
156
+ This command ([`src/cli/setup-mcp-router.ts`](../../src/cli/setup-mcp-router.ts))
157
+ configures **Claude Code**, **Factory.AI**, **VSCode**, and **Cursor**. It writes
158
+ a `router` server into each harness's MCP config:
159
+
160
+ - **Claude Code** — `~/.claude/settings.json`
161
+ - **Factory.AI** — `~/.factory/mcp.json`
162
+ - **VSCode** — `~/.vscode/mcp.json`
163
+ - **Cursor** — `~/.cursor/settings.json`
164
+
165
+ The entry it installs looks like:
166
+
167
+ ```json
168
+ {
169
+ "mcpServers": {
170
+ "router": {
171
+ "command": "npx",
172
+ "args": ["uap", "mcp-router", "start"],
173
+ "description": "Unified MCP Router - routes all tool calls"
174
+ }
175
+ }
176
+ }
177
+ ```
178
+
179
+ When a harness already has MCP servers configured, `mcp-setup` migrates them
180
+ behind the router (preserving the originals in a backup field) — pass `--force` to
181
+ skip the confirmation prompt. After setup it validates the install by running
182
+ `uap mcp-router list`.
183
+
184
+ ## The savings
185
+
186
+ For the common case where a tool returns a large, mostly-irrelevant payload and
187
+ the agent supplies an intent, the router returns only the 3 best-matching
188
+ snippets — **up to 98% fewer tokens** than the raw output. Small outputs (≤5 KB)
189
+ pass through untouched, so there is no penalty for the common small-result case.
190
+ Per-session savings are tracked in
191
+ [`session-stats.ts`](../../src/mcp-router/session-stats.ts) and surfaced via
192
+ `uap mcp-router stats`.
193
+
194
+ See also the [Memory guide](./MEMORY.md) for reducing token spend on persistent
195
+ context rather than tool output.
@@ -0,0 +1,235 @@
1
+ # Memory System
2
+
3
+ > UAP v1.40.0
4
+
5
+ The Universal Agent Protocol gives agents a persistent, multi-tier memory so
6
+ that learnings survive across sessions, compactions, and even harness switches.
7
+ The system is implemented as 27 modules under [`src/memory/`](../../src/memory/)
8
+ and is driven from the `uap memory` CLI.
9
+
10
+ The design goal is **token efficiency**: instead of replaying entire transcripts
11
+ into the context window, agents write small, high-signal memories and retrieve
12
+ only the most relevant ones on demand via semantic search.
13
+
14
+ ## The four tiers
15
+
16
+ Memory flows from a cheap, high-churn staging area down to a durable, searchable
17
+ archive. Each tier has a distinct cost/permanence trade-off.
18
+
19
+ | Tier | Name | Storage | Purpose | Module(s) |
20
+ |------|------|---------|---------|-----------|
21
+ | 0 | Daily log | SQLite | Staging area for raw writes; "log first, promote later" | [`daily-log.ts`](../../src/memory/daily-log.ts), [`short-term/sqlite.ts`](../../src/memory/short-term/sqlite.ts), [`short-term/schema.ts`](../../src/memory/short-term/schema.ts) |
22
+ | 1 | Working cache | In-process / SQLite | Hot context with decay; predictive prefetch | [`speculative-cache.ts`](../../src/memory/speculative-cache.ts), [`predictive-memory.ts`](../../src/memory/predictive-memory.ts) |
23
+ | 2 | Semantic | Qdrant vectors | Embedding-based recall over consolidated knowledge | [`serverless-qdrant.ts`](../../src/memory/serverless-qdrant.ts), [`embeddings.ts`](../../src/memory/embeddings.ts) |
24
+ | 3 | Long-term archive | Pluggable backends | Durable, auditable store of promoted learnings | [`backends/base.ts`](../../src/memory/backends/base.ts), [`backends/factory.ts`](../../src/memory/backends/factory.ts), [`backends/github.ts`](../../src/memory/backends/github.ts), [`backends/qdrant-cloud.ts`](../../src/memory/backends/qdrant-cloud.ts) |
25
+
26
+ ### Tier 0 — Daily log
27
+
28
+ Every observation an agent records lands first in the daily log, a SQLite-backed
29
+ staging table (`daily_log`). This follows a "log first, promote later" pattern:
30
+ writes are cheap and non-destructive, and a separate review step decides which
31
+ entries are worth keeping. Each entry carries a `suggestedTier` of either
32
+ `working` or `semantic` so promotion is guided rather than blind. See
33
+ [`daily-log.ts`](../../src/memory/daily-log.ts).
34
+
35
+ ### Tier 1 — Working cache
36
+
37
+ The working cache holds the hot context an agent is likely to need next. Entries
38
+ **decay** over time so the cache stays small and relevant, and a predictive layer
39
+ prefetches likely-needed memories. See
40
+ [`speculative-cache.ts`](../../src/memory/speculative-cache.ts) and
41
+ [`predictive-memory.ts`](../../src/memory/predictive-memory.ts).
42
+
43
+ ### Tier 2 — Semantic
44
+
45
+ Consolidated knowledge is embedded and stored as vectors in Qdrant. Recall is by
46
+ semantic similarity rather than exact match, so a query retrieves conceptually
47
+ related memories even when the wording differs. See
48
+ [`serverless-qdrant.ts`](../../src/memory/serverless-qdrant.ts).
49
+
50
+ ### Tier 3 — Long-term archive
51
+
52
+ The durable archive is backend-pluggable. The bundled backends are selected via
53
+ [`backends/factory.ts`](../../src/memory/backends/factory.ts):
54
+
55
+ - **Qdrant Cloud** — managed vector store ([`backends/qdrant-cloud.ts`](../../src/memory/backends/qdrant-cloud.ts))
56
+ - **GitHub** — version-controlled archive ([`backends/github.ts`](../../src/memory/backends/github.ts))
57
+ - A common interface defined in [`backends/base.ts`](../../src/memory/backends/base.ts)
58
+
59
+ ## Semantic recall
60
+
61
+ Tier 2/3 recall uses **real embeddings**, not placeholder hashes. The default
62
+ provider runs a local `nomic-embed-text-v2-moe` model via `llama-server` (or
63
+ Ollama) and produces **768-dimensional** vectors (Matryoshka — truncatable to
64
+ 256). Running embeddings locally means semantic recall incurs no per-query API
65
+ cost. See [`embeddings.ts`](../../src/memory/embeddings.ts).
66
+
67
+ Queries return matches ranked by cosine similarity, filtered by a configurable
68
+ threshold (default `0.35`).
69
+
70
+ ## Write gates
71
+
72
+ Not every observation deserves to be a memory. The write gate
73
+ ([`write-gate.ts`](../../src/memory/write-gate.ts)) scores incoming content and
74
+ **rejects low-value writes** before they consume storage or pollute recall.
75
+ Rejections include:
76
+
77
+ - Empty content
78
+ - Content too short to be a meaningful memory
79
+ - Content matching a **noise pattern** (acknowledgements, transient requests)
80
+
81
+ Content that records a **decision and its reasoning**, a durable **preference or
82
+ convention**, or other high-signal information passes the gate. Rejected writes
83
+ come back with a `rejectionReason` so the caller knows why.
84
+
85
+ The gate can be bypassed deliberately with `--force` on `uap memory store`.
86
+
87
+ ## Correction propagation
88
+
89
+ When a fact changes, you don't want stale copies lingering across tiers. The
90
+ correction propagator ([`correction-propagator.ts`](../../src/memory/correction-propagator.ts))
91
+ applies a correction **across all tiers** and marks the superseded entries with a
92
+ date and reason, preserving an **audit trail** in a `superseded_entries` table
93
+ rather than silently deleting. The result reports `tiersUpdated` and
94
+ `supersededCount`.
95
+
96
+ Trigger it from the CLI with `uap memory correct`.
97
+
98
+ ## Supporting modules
99
+
100
+ Beyond the tiers, the system includes consolidation
101
+ ([`memory-consolidator.ts`](../../src/memory/memory-consolidator.ts)), a
102
+ knowledge graph ([`knowledge-graph.ts`](../../src/memory/knowledge-graph.ts)),
103
+ task classification ([`task-classifier.ts`](../../src/memory/task-classifier.ts)),
104
+ dynamic retrieval ([`dynamic-retrieval.ts`](../../src/memory/dynamic-retrieval.ts)),
105
+ semantic compression
106
+ ([`semantic-compression.ts`](../../src/memory/semantic-compression.ts)), and
107
+ scheduled maintenance
108
+ ([`memory-maintenance.ts`](../../src/memory/memory-maintenance.ts)).
109
+
110
+ ## The `uap memory` CLI
111
+
112
+ All commands are defined in [`src/bin/cli.ts`](../../src/bin/cli.ts) and
113
+ implemented in [`src/cli/memory.ts`](../../src/cli/memory.ts).
114
+
115
+ ```bash
116
+ uap memory status # Show memory system status
117
+ uap memory start # Start memory services (Qdrant container)
118
+ uap memory stop # Stop memory services
119
+ ```
120
+
121
+ ### Query
122
+
123
+ ```bash
124
+ uap memory query <search> [options]
125
+ ```
126
+
127
+ | Option | Default | Description |
128
+ |--------|---------|-------------|
129
+ | `-n, --limit <number>` | `10` | Max results |
130
+ | `-k, --top-k <number>` | `10` | Alias for `--limit` |
131
+ | `-t, --threshold <number>` | `0.35` | Minimum similarity score (0–1) |
132
+
133
+ ```bash
134
+ uap memory query "qdrant connection retry" --limit 5 --threshold 0.5
135
+ ```
136
+
137
+ ### Store
138
+
139
+ ```bash
140
+ uap memory store <content> [options]
141
+ ```
142
+
143
+ Applies the write gate unless `--force` is passed.
144
+
145
+ | Option | Default | Description |
146
+ |--------|---------|-------------|
147
+ | `-t, --tags <tags>` | — | Comma-separated tags |
148
+ | `-i, --importance <number>` | `5` | Importance score (1–10) |
149
+ | `-f, --force` | — | Bypass the write gate (store without quality check) |
150
+
151
+ ```bash
152
+ uap memory store "Chose Qdrant over pgvector for HNSW recall speed" \
153
+ --tags architecture,memory --importance 8
154
+ ```
155
+
156
+ ### Prepopulate
157
+
158
+ Seed memory from existing project knowledge.
159
+
160
+ ```bash
161
+ uap memory prepopulate [options]
162
+ ```
163
+
164
+ | Option | Default | Description |
165
+ |--------|---------|-------------|
166
+ | `--docs` | — | Import from documentation only |
167
+ | `--git` | — | Import from git history only |
168
+ | `-n, --limit <number>` | `500` | Limit git commits to analyze |
169
+ | `--since <date>` | — | Only analyze commits since date (e.g. `2024-01-01`) |
170
+ | `-v, --verbose` | — | Show detailed output |
171
+
172
+ ### Promote
173
+
174
+ Review daily-log (Tier 0) entries and promote significant ones into working or
175
+ semantic memory.
176
+
177
+ ```bash
178
+ uap memory promote
179
+ ```
180
+
181
+ ### Correct
182
+
183
+ Find an existing memory and supersede it with a correction that propagates across
184
+ all tiers.
185
+
186
+ ```bash
187
+ uap memory correct <search> [options]
188
+ ```
189
+
190
+ | Option | Description |
191
+ |--------|-------------|
192
+ | `-c, --correction <text>` | The corrected content |
193
+ | `-r, --reason <reason>` | Reason for the correction |
194
+
195
+ ```bash
196
+ uap memory correct "uses pgvector" \
197
+ --correction "uses Qdrant for semantic recall" \
198
+ --reason "migrated in v1.26"
199
+ ```
200
+
201
+ ### Maintain
202
+
203
+ Run scheduled maintenance: decay, prune stale entries, archive old ones, and
204
+ remove duplicates.
205
+
206
+ ```bash
207
+ uap memory maintain [-v|--verbose]
208
+ ```
209
+
210
+ ## How agents use memory
211
+
212
+ The recommended decision loop (see the project `CLAUDE.md`) wires memory into
213
+ every task:
214
+
215
+ 1. **READ** recent context with `uap memory query`.
216
+ 2. **QUERY** long-term memory for related learnings (semantic search).
217
+ 3. **ACT** on the task.
218
+ 4. **RECORD** observations back to the daily log (`uap memory store`).
219
+ 5. **PROMOTE** significant learnings to long-term memory (`uap memory promote`).
220
+
221
+ Corrections discovered along the way are pushed with `uap memory correct` so the
222
+ fix cascades across tiers.
223
+
224
+ ## How it saves tokens
225
+
226
+ - Agents retrieve a handful of **relevant** memories instead of replaying whole
227
+ transcripts into context.
228
+ - The **write gate** keeps storage and recall results free of noise, so each
229
+ retrieved item carries signal.
230
+ - **Decay** and **maintenance** keep the working set small.
231
+ - **Local embeddings** make semantic recall free of per-query API cost.
232
+ - **Correction propagation** prevents stale duplicates from inflating results.
233
+
234
+ See also the [MCP Router guide](./MCP_ROUTER.md) for compressing tool *output*
235
+ before it reaches the model.