@miller-tech/uap 1.40.0 → 1.41.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (150) hide show
  1. package/README.md +109 -642
  2. package/dist/.tsbuildinfo +1 -1
  3. package/dist/cli/deliver-defaults.d.ts +23 -0
  4. package/dist/cli/deliver-defaults.d.ts.map +1 -0
  5. package/dist/cli/deliver-defaults.js +121 -0
  6. package/dist/cli/deliver-defaults.js.map +1 -0
  7. package/dist/cli/init.d.ts.map +1 -1
  8. package/dist/cli/init.js +29 -0
  9. package/dist/cli/init.js.map +1 -1
  10. package/dist/cli/setup.d.ts.map +1 -1
  11. package/dist/cli/setup.js +19 -0
  12. package/dist/cli/setup.js.map +1 -1
  13. package/dist/policies/policy-tools.d.ts +7 -0
  14. package/dist/policies/policy-tools.d.ts.map +1 -1
  15. package/dist/policies/policy-tools.js +24 -2
  16. package/dist/policies/policy-tools.js.map +1 -1
  17. package/docs/INDEX.md +48 -286
  18. package/docs/architecture/OVERVIEW.md +328 -0
  19. package/docs/architecture/PROTOCOL.md +204 -0
  20. package/docs/benchmarks/README.md +17 -192
  21. package/docs/getting-started/CONFIGURATION.md +237 -0
  22. package/docs/getting-started/INSTALLATION.md +125 -0
  23. package/docs/getting-started/QUICKSTART.md +115 -0
  24. package/docs/guides/COORDINATION.md +162 -0
  25. package/docs/guides/DELIVER.md +115 -0
  26. package/docs/guides/DEPLOY_BATCHING.md +212 -0
  27. package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
  28. package/docs/guides/LOCAL_MODELS.md +148 -0
  29. package/docs/guides/MCP_ROUTER.md +195 -0
  30. package/docs/guides/MEMORY.md +235 -0
  31. package/docs/guides/MULTI_MODEL.md +223 -0
  32. package/docs/guides/POLICIES.md +190 -0
  33. package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
  34. package/docs/integrations/MCP_ROUTER.md +147 -0
  35. package/docs/integrations/RTK.md +102 -0
  36. package/docs/reference/API.md +485 -0
  37. package/docs/reference/CLI.md +719 -0
  38. package/docs/reference/CONFIGURATION.md +90 -193
  39. package/docs/reference/DATABASE_SCHEMA.md +110 -344
  40. package/docs/reference/FEATURES.md +176 -472
  41. package/docs/reference/PATTERNS.md +102 -0
  42. package/docs/reference/PLATFORMS.md +83 -0
  43. package/package.json +3 -1
  44. package/src/policies/enforcers/7ebbc721-7540-4e9f-879a-770e0213a09b_architecture_review.py +101 -0
  45. package/src/policies/enforcers/__pycache__/_common.cpython-312.pyc +0 -0
  46. package/src/policies/enforcers/_common.py +100 -0
  47. package/src/policies/enforcers/artifact_hygiene.py +52 -0
  48. package/src/policies/enforcers/cluster_routing.py +63 -0
  49. package/src/policies/enforcers/codebase_read_before_plan.py +52 -0
  50. package/src/policies/enforcers/coord_overlap.py +81 -0
  51. package/src/policies/enforcers/delivery_enforcement.py +97 -0
  52. package/src/policies/enforcers/doc_live_over_report.py +50 -0
  53. package/src/policies/enforcers/expert_review_required.py +135 -0
  54. package/src/policies/enforcers/iac_parity.py +53 -0
  55. package/src/policies/enforcers/mcp_router_first.py +37 -0
  56. package/src/policies/enforcers/memory_before_plan.py +61 -0
  57. package/src/policies/enforcers/parallel_reads.py +50 -0
  58. package/src/policies/enforcers/rtk_wrap.py +44 -0
  59. package/src/policies/enforcers/schema_diff_gate.py +80 -0
  60. package/src/policies/enforcers/session_memory_write.py +52 -0
  61. package/src/policies/enforcers/task_required.py +131 -0
  62. package/src/policies/enforcers/test_gate.py +58 -0
  63. package/src/policies/enforcers/validate_plan_before_build.py +75 -0
  64. package/src/policies/enforcers/worktree_required.py +57 -0
  65. package/src/policies/schemas/policies/architecture-review.md +51 -0
  66. package/src/policies/schemas/policies/artifact-hygiene.md +29 -0
  67. package/src/policies/schemas/policies/cluster-routing.md +31 -0
  68. package/src/policies/schemas/policies/codebase-read-before-plan.md +30 -0
  69. package/src/policies/schemas/policies/coord-overlap.md +24 -0
  70. package/src/policies/schemas/policies/delivery-enforcement.md +45 -0
  71. package/src/policies/schemas/policies/doc-live-over-report.md +32 -0
  72. package/src/policies/schemas/policies/expert-review-required.md +60 -0
  73. package/src/policies/schemas/policies/iac-parity.md +31 -0
  74. package/src/policies/schemas/policies/mandatory-testing-deployment.md +147 -0
  75. package/src/policies/schemas/policies/mcp-router-first.md +24 -0
  76. package/src/policies/schemas/policies/memory-before-plan.md +24 -0
  77. package/src/policies/schemas/policies/merge-deploy-monitor-verify.md +145 -0
  78. package/src/policies/schemas/policies/parallel-reads.md +24 -0
  79. package/src/policies/schemas/policies/rtk-wrap.md +26 -0
  80. package/src/policies/schemas/policies/schema-diff-gate.md +30 -0
  81. package/src/policies/schemas/policies/session-memory-write.md +24 -0
  82. package/src/policies/schemas/policies/task-required.md +49 -0
  83. package/src/policies/schemas/policies/test-gate.md +24 -0
  84. package/src/policies/schemas/policies/validate-plan-before-build.md +28 -0
  85. package/src/policies/schemas/policies/worktree-required.md +28 -0
  86. package/templates/hooks/uap-policy-gate.sh +5 -0
  87. package/docs/AGENTS.md +0 -423
  88. package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
  89. package/docs/GETTING_STARTED.md +0 -288
  90. package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
  91. package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
  92. package/docs/architecture/EXPERT_STACK.md +0 -137
  93. package/docs/architecture/MULTI_MODEL.md +0 -224
  94. package/docs/architecture/PLATFORM_GATING.md +0 -68
  95. package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
  96. package/docs/architecture/UAP_COMPLIANCE.md +0 -217
  97. package/docs/architecture/UAP_PROTOCOL.md +0 -339
  98. package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
  99. package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
  100. package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
  101. package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
  102. package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
  103. package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
  104. package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
  105. package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
  106. package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
  107. package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
  108. package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
  109. package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
  110. package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
  111. package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
  112. package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
  113. package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
  114. package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
  115. package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
  116. package/docs/archive/opencode-integration-guide.md +0 -740
  117. package/docs/archive/opencode-integration-quickref.md +0 -180
  118. package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
  119. package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
  120. package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
  121. package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
  122. package/docs/blog/local-coding-agents.md +0 -266
  123. package/docs/blog/x-thread.md +0 -254
  124. package/docs/deployment/DEPLOYMENT.md +0 -895
  125. package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
  126. package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
  127. package/docs/deployment/DEPLOY_BATCHING.md +0 -273
  128. package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
  129. package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
  130. package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
  131. package/docs/getting-started/INTEGRATION.md +0 -628
  132. package/docs/getting-started/OVERVIEW.md +0 -324
  133. package/docs/getting-started/SETUP.md +0 -377
  134. package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
  135. package/docs/integrations/RTK_INTEGRATION.md +0 -468
  136. package/docs/operations/TROUBLESHOOTING.md +0 -660
  137. package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
  138. package/docs/pr/UPSTREAM_PRS.md +0 -424
  139. package/docs/reference/API_REFERENCE.md +0 -903
  140. package/docs/reference/EXPERT_DROIDS.md +0 -219
  141. package/docs/reference/HARNESS-MATRIX.md +0 -318
  142. package/docs/reference/PATTERN_LIBRARY.md +0 -636
  143. package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
  144. package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
  145. package/docs/research/DOMAIN_STRATEGIES.md +0 -316
  146. package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
  147. package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
  148. package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
  149. package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
  150. package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
@@ -1,266 +0,0 @@
1
- # Taming Local Coding Agents: How We Made 35B-A3B Actually Usable
2
-
3
- *A deep dive into hybrid speculative decoding, session-level loop protection, policy enforcement, and building a universal coding agent layer on top of llama.cpp.*
4
-
5
- ---
6
-
7
- ## The problem
8
-
9
- Local LLMs have reached a point where a single RTX 3090 can run 27–35B parameter models fast enough for interactive coding agents. But "fast enough" isn't "usable."
10
-
11
- We hit several walls building our universal coding agent stack (UAP) on top of llama.cpp:
12
-
13
- 1. Speculative decoding **silently corrupts** hybrid SSM+attention models (Qwen3.5-35B-A3B, Jamba)
14
- 2. Agent clients enter **runaway tool-use loops** that burn thousands of wasted tokens
15
- 3. Every client speaks a slightly different API shape and injects **volatile context** that breaks stateful guardrails
16
- 4. Models **ignore workflow requirements** in CLAUDE.md — they commit directly to main no matter what the prompt says
17
- 5. Context, memory, skill routing, and multi-agent coordination all need an **additional enforcement layer** above raw inference
18
-
19
- This is what we built to fix all of it.
20
-
21
- ---
22
-
23
- ## Part 1: The hybrid speculative decoding bug
24
-
25
- Qwen3.5-35B-A3B is a hybrid model: 16 of its 64 layers use attention KV cache, 48 use recurrent (SSM) state. When speculative decoding rolls back a partially-accepted batch, it calls `seq_rm(seq_id, p0, -1)` to discard tokens after position `p0`.
26
-
27
- For attention layers this is trivial. For SSM layers it's **impossible** — recurrent state can't be positionally rewound. The upstream llama.cpp handled this with an exact-match checkpoint restore that **never fired** during real speculative decoding:
28
-
29
- ```cpp
30
- checkpoint.pos == p0 - 1 // checkpoint at pre-speculation position K
31
- // p0 - 1 = K + accepted_drafts
32
- // K == K + m → false whenever m > 0
33
- ```
34
-
35
- The fallback path silently updated `cell.pos` without restoring R/S tensor data. SSM state drifted every batch. After a few hundred spec cycles, the model was generating degenerate output that looked like "tool call looping" but was actually accumulated state corruption.
36
-
37
- **Our fix (2 patches, ~280 lines):**
38
-
39
- 1. Added a CPU-side checkpoint system in `llama_memory_hybrid` — save R/S tensors before multi-token speculative batches via `ggml_backend_tensor_get`, restore via `ggml_backend_tensor_set`
40
- 2. Changed the restore condition from `checkpoint.pos == p0 - 1` to `checkpoint.pos <= p0 - 1`
41
- 3. Added **server-side activation replay**: after `seq_rm` restores an earlier checkpoint, re-decode the tokens from `(cache_pos + 1)` to the target position via `llama_decode`, bringing both caches back in sync
42
-
43
- This is the "activation replay" technique from Snakes & Ladders (NeurIPS 2024). The result: Qwen3.5-35B-A3B speculative decoding went from "unusable — produces garbled tool calls that loop forever" to **stable 100+ tok/s with 88–98% draft acceptance**.
44
-
45
- ---
46
-
47
- ## Part 2: The ngram cache reset trap
48
-
49
- llama.cpp's `ngram-mod` speculative type has a hardcoded "low acceptance streak" reset: if draft acceptance drops below 50% for 3 consecutive calls, the entire ngram table is wiped.
50
-
51
- For models with naturally variable output (MoE, fine-tuned, uncensored), this fires constantly. The cache would build up to 100+ drafts/call, then get wiped, then rebuild, then get wiped again. We saw acceptance rates oscillate between 26% and 69% for hours.
52
-
53
- **The fix:** single env var — `NGRAM_MOD_RESET_STREAK=16` (default 3 preserves upstream behavior, `0` disables the reset entirely). On 35B-A3B this moved average acceptance from ~50% to a stable 88%, with peak 98% warmed-up rates.
54
-
55
- ~10 lines of code. Bizarrely impactful.
56
-
57
- ---
58
-
59
- ## Part 3: Loop protection that actually works
60
-
61
- Coding agents making rapid tool calls can fall into pathological loops. We saw three distinct patterns on local 27–35B models:
62
-
63
- 1. **Repeated same tool** — 58 req/min on `Read("/dev/null")`. Easy to catch with per-tool cycle detection.
64
- 2. **Distinct but unproductive** — model rotates through `Glob → Read → Bash → FetchUrl` making tiny calls that add no context. **Defeats** per-tool cycle detection because each call is technically different.
65
- 3. **Post-finalize ping-pong** — state machine forces a finalize turn, model emits text, but completion contract re-triggers the active loop on the next request.
66
-
67
- Our proxy's state machine already had per-tool cycle detection, but it didn't catch patterns 2 and 3. We added:
68
-
69
- - **Unproductive exhaustion streak**: counts consecutive `forced_budget_exhausted` events where no cycle was detected. After N in a row, force finalize.
70
- - **Monotonic finalize hard cap**: session-level counter that survives state resets. After N total finalize events (default 6), stop injecting synthetic continuations and let the natural `end_turn` terminate the loop.
71
- - **`finalize_fired` blocker suppression**: once a finalize has fired in the session, suppress `text_only_after_tool_results` blockers that would re-trigger the active loop.
72
-
73
- But the actual fix for all of this turned out to be a **one-line session fingerprint bug**.
74
-
75
- ---
76
-
77
- ## Part 4: The session fingerprint bug that broke everything
78
-
79
- For weeks, none of our loop protection worked reliably. The state machine would detect a cycle, force a finalize, inject a hint — and then the very next request, the `forced_budget` counter would be back at 11, the `review_cycles` at 0, all the state wiped.
80
-
81
- We assumed it was a state machine bug and wrote more guardrails. Then we added session ID logging:
82
-
83
- ```
84
- REQ: ... sess=fp:9c8f26a802f9f4739f18 msgs=79
85
- REQ: ... sess=fp:b801857a9e49e21a6599 msgs=81
86
- REQ: ... sess=fp:aeef638954a390ef7aec msgs=83
87
- ```
88
-
89
- **Every single request got a new session ID.** Every `SessionMonitor` was fresh. None of the counters were accumulating. Every guardrail we'd built was effectively stateless per-request.
90
-
91
- The bug: session fingerprints included:
92
-
93
- 1. `tool_use_id` values from tool_result blocks (random UUIDs regenerated per turn)
94
- 2. The entire `system` prompt (clients inject timestamps, cwd, session markers)
95
-
96
- **The fix:** hash only the first user message's **text content**. Exclude system prompts. Use stable content hashes for tool_result blocks.
97
-
98
- After this fix, session stickiness went from 1 request/session to 170+ requests/session. Every prior loop protection mechanism suddenly started working. The unproductive exhaustion streak fired exactly when it should. The finalize hard cap terminated runaway sessions cleanly. Context accumulated correctly for prompt caching.
99
-
100
- One bug — the wrong fingerprint inputs — had been silently defeating every stateful guardrail above it for the entire project. If you're building your own state machine on top of an LLM proxy: **check whether your session key is stable FIRST**.
101
-
102
- ---
103
-
104
- ## Part 5: UAP — the universal coding agent layer
105
-
106
- llama.cpp is the engine. UAP is the layer that makes coding agents on top of it actually work.
107
-
108
- ### Session and state management
109
- - **Sticky session fingerprinting** (Part 4)
110
- - **Per-session conversation pruning** to stay under context limits
111
- - **Automatic context window detection** from `/slots`
112
- - **Memory system** with auto-save for user profile, feedback rules, project context, reference pointers — the agent learns across sessions without re-prompting
113
- - **Automatic context insertion** at natural triggers (session start, fresh task detection)
114
-
115
- ### Universal client compatibility
116
- - **Native Anthropic `/v1/messages`** endpoint
117
- - **Full OpenAI `/v1/chat/completions`** endpoint with bidirectional conversion (all guardrails active on both paths)
118
- - **Per-profile chat templates** — ChatML, Gemma-4's `peg-gemma4` DSL, or model-embedded
119
- - **Per-profile grammar** — Qwen-style `<tool_call>` JSON grammar, or off (required for models that use different tool formats)
120
-
121
- ### Skill routing and tool management
122
- - **Tool narrowing** — automatically reduces 35+ tool schemas down to top-N most relevant per request via query token similarity scoring
123
- - **Tool cycling detection** with session-level bans for persistent offenders
124
- - **Malformed tool-call retry** with token/temperature caps
125
- - **Grammar-constrained tool output** (optional per profile)
126
- - **Software pattern prefill** — agent skill registry with discovery and auto-invocation for known task patterns
127
-
128
- ### Loop protection (5-layer defense)
129
- 1. Per-tool fingerprint cycle detection
130
- 2. Stagnation tracking (message fingerprint doesn't change)
131
- 3. Unproductive exhaustion streak (distinct-but-useless calls)
132
- 4. Review cycle limit → forced finalize
133
- 5. Session hard cap on total finalize events → natural termination
134
-
135
- ### Speculative decoding tuning
136
- - Per-profile spec decoding enable/disable
137
- - Per-request `speculative.n_max=0` override for tool turns (optional per profile)
138
- - Configurable ngram-mod reset threshold via env var (Part 2)
139
- - Profile-specific draft parameters (`draft-max`, `draft-min`, `draft-p-min`)
140
-
141
- ### Multi-agent coordination
142
- - **Git worktree enablement** for concurrent agent sessions with isolated filesystem state
143
- - **CI/CD deploy bucketing** to match concurrent agent development cadence — each agent's deploys go to its own bucket
144
- - **Shared memory layer** with conflict detection
145
- - **Skill registry** with discovery
146
-
147
- ### Token optimization
148
- - Pre-request token budget monitoring with estimation
149
- - Automatic conversation pruning near context limits
150
- - Tool schema caching
151
- - Static ngram cache support for cold-start acceleration
152
- - Tool narrowing (35 → 8 saves ~15k tokens per request on the 35-tool setup)
153
-
154
- ---
155
-
156
- ## Part 5b: The policy engine — enforcement, not suggestions
157
-
158
- You can tell a local coding agent to use a git worktree. You can write it in CLAUDE.md. You can put it in the system prompt. You can make it the first rule in the instructions.
159
-
160
- They will still commit directly to main.
161
-
162
- We learned this the hard way. **The only reliable way to enforce a workflow requirement is to make it non-bypassable at the proxy layer — not at the prompt layer.**
163
-
164
- So we built a **policy engine** that intercepts every tool call and completion check.
165
-
166
- ### What it enforces today
167
-
168
- - **Worktree routing** — `Edit`, `Write`, `Bash` tool inputs get rewritten to reference the active worktree path. Operations targeting the main working tree are **rejected** with a policy blocker that the agent can't ignore because it can't produce a valid tool call.
169
- - **Completion gates** — the proxy's completion contract is extended with policy-level blockers. An agent can't emit `end_turn` on a task unless:
170
- - Tests were actually run (not just mentioned)
171
- - Parallel reviewers (code-reviewer + security-auditor + architect-reviewer) were invoked before any commit
172
- - Memory was queried before any review/check/look operation
173
- - Session start protocol completed (bootstrap checks)
174
- - **Commit discipline** — pre-commit policy invokes review agents, validates commit message format, checks for secrets, runs completion gates. Only then does the `commit` tool call pass through.
175
- - **CI/CD deploy bucketing** — each agent session has a deploy bucket tied to its worktree. Multi-agent concurrent development doesn't collide at the pipeline layer because each bucket runs independently.
176
- - **Per-profile rule sets** — the `build` profile has strict worktree + review + test requirements. `plan` mode blocks all `write`/`edit` tools. `memory` mode is read-only. `autoaccept` can skip some gates but not the security ones.
177
-
178
- ### How it works
179
-
180
- Every tool call goes through a policy check chain before being forwarded to llama.cpp:
181
-
182
- ```
183
- client → proxy → [guardrails] → [policy engine] → [tool rewriter] → llama.cpp
184
-
185
- audit log
186
- ```
187
-
188
- Each policy is a small declarative rule:
189
-
190
- ```python
191
- @policy("worktree.enforce", profile=["build", "autoaccept"])
192
- def enforce_worktree(request, session):
193
- if request.tool_name in MUTATING_TOOLS:
194
- if not session.worktree_active:
195
- return block("worktree_not_in_use",
196
- hint="Create a worktree first: git worktree add ...")
197
- request.tool_input["path"] = rewrite_to_worktree(
198
- request.tool_input["path"], session.worktree
199
- )
200
- return allow()
201
-
202
- @policy("commit.parallel_review", profile="build")
203
- def enforce_parallel_review(request, session):
204
- if "git commit" in request.tool_input.get("command", ""):
205
- if not session.review_completed_this_turn:
206
- return block("parallel_review_required")
207
- return allow()
208
- ```
209
-
210
- The rule either allows the call, rewrites it, or blocks it with a reason that becomes part of the agent's context on the next turn. **Agents can't route around a block** — the proxy doesn't give them a tool they can use to bypass the policy, so they have no tokens to emit that would reach the outside world.
211
-
212
- ### Why this matters for local models
213
-
214
- Frontier models kind of follow instructions in CLAUDE.md. Local 27–35B models don't. The gap is large enough that policy-as-prompt is not an enforcement mechanism for local coding agents — it's a suggestion the model ignores when the compute pressure is on.
215
-
216
- Moving enforcement from prompt layer to proxy layer turned our local coding agents from "unreliable hobby" to **"actually usable in a real delivery pipeline."**
217
-
218
- ---
219
-
220
- ## Part 6: Results
221
-
222
- On a single RTX 3090 with Qwen3.5-35B-A3B-UD-IQ4_XS:
223
-
224
- | Metric | Before | After |
225
- |--------|--------|-------|
226
- | Speculative decoding | Broken (garbled output) | **Stable** |
227
- | Peak generation speed | 30–55 tok/s (unstable) | **100+ tok/s** |
228
- | Draft acceptance | 26–69% (oscillating) | **88–98%** |
229
- | Loop protection | Stateless (session bug) | Works end-to-end |
230
- | Session stickiness | 1 req/session | 170+ req/session |
231
- | Time to break runaway loop | Indefinite | ~30–60 seconds |
232
- | Tool output corruption | Frequent | Rare (auto-retried cleanly) |
233
- | Worktree compliance | ~20% (model ignored prompts) | **100% (policy-enforced)** |
234
- | Pre-commit review compliance | ~10% | **100%** |
235
- | Concurrent agent collisions | Common | None (bucketed) |
236
-
237
- ---
238
-
239
- ## Part 7: Where this is going
240
-
241
- We're preparing upstream PRs:
242
-
243
- - **llama.cpp** — three PRs:
244
- 1. Configurable ngram-mod reset threshold
245
- 2. Hybrid speculative rollback via CPU state checkpoints
246
- 3. Server activation replay for partial speculative rollback
247
- - **UAP proxy** — five PRs:
248
- 1. Stable session fingerprinting (critical bug fix)
249
- 2. Loop protection hardening
250
- 3. Per-request speculative decoding control
251
- 4. OpenAI-compatible `/v1/chat/completions` endpoint with guardrails
252
- 5. Policy engine with worktree + CI/CD enforcement
253
-
254
- The llama.cpp patches are at `github.com/DammianMiller/llama.cpp` on branch `upgrade-b8740`. UAP is at `github.com/miller-tech/universal-agent-protocol` (public release pending).
255
-
256
- ---
257
-
258
- ## The punchline
259
-
260
- Local coding agents on consumer GPUs are actually viable today — if you fix the half-dozen subtle bugs that every path through the stack seems to land on.
261
-
262
- Most of the fixes are small. Most of them would be invisible without the right logging. And most of them only matter once you stack them together: the speculative decoding fix makes generation fast enough to be interactive, the ngram reset fix makes it stable, the session fingerprint fix makes loop protection functional, the loop protection makes the agent stoppable, the OpenAI endpoint makes any client able to benefit from it all, and the **policy engine is what finally makes the output trustworthy enough to ship.**
263
-
264
- We kept finding one more bug, one more missing piece, one more enforcement gap. When the last one cleared, we had a local coding agent stack that actually works.
265
-
266
- Share your own findings — the local LLM tooling space is still wide open.
@@ -1,254 +0,0 @@
1
- # X Thread: Taming Local Coding Agents
2
-
3
- Publish as a thread on x.com. Each section is one tweet (≤280 chars where noted).
4
-
5
- ---
6
-
7
- **1/ 🧵**
8
-
9
- Taming local coding agents on a single RTX 3090.
10
-
11
- Qwen3.5-35B-A3B @ ~100 tok/s with working spec decoding, clean tool calls, loop protection that actually works, and policy-enforced worktrees.
12
-
13
- A deep dive into the llama.cpp + UAP stack we built.
14
-
15
- ---
16
-
17
- **2/**
18
-
19
- Five walls you hit building local coding agents:
20
-
21
- 1. Spec decoding silently corrupts hybrid SSM+attention models
22
- 2. Agents enter runaway tool loops
23
- 3. Each client injects volatile context that breaks stateful guardrails
24
- 4. Models ignore workflow rules in CLAUDE.md
25
- 5. Multi-agent concurrency collides at the pipeline
26
-
27
- ---
28
-
29
- **3/**
30
-
31
- Wall 1: Hybrid spec decoding.
32
-
33
- Qwen3.5-35B-A3B has 16 attention + 48 recurrent layers. When spec decoding partially accepts drafts, it needs to roll back. Attention = trivial. Recurrent SSM state = can't positionally rewind.
34
-
35
- ---
36
-
37
- **4/**
38
-
39
- Upstream llama.cpp had an exact-match checkpoint restore:
40
- `checkpoint.pos == p0 - 1`
41
-
42
- But during real spec decoding, `checkpoint.pos = K` while `p0-1 = K + accepted_drafts`. The match never fired. The fallback silently updated position counters without restoring R/S tensors.
43
-
44
- ---
45
-
46
- **5/**
47
-
48
- State drifted every batch. After a few hundred cycles, the model produced degenerate output.
49
-
50
- Symptom: "looping tool calls."
51
- Root cause: accumulated SSM state corruption.
52
-
53
- The two diagnoses look identical from the outside. They're completely different problems.
54
-
55
- ---
56
-
57
- **6/**
58
-
59
- Fix: CPU-side checkpoint system that saves R/S tensors before multi-token batches, plus activation replay (Snakes & Ladders, NeurIPS 2024).
60
-
61
- After `seq_rm` restores a checkpoint, re-decode tokens from (cache_pos+1) → target via `llama_decode` to resync both caches.
62
-
63
- ---
64
-
65
- **7/**
66
-
67
- Result: 35B-A3B spec decoding went from "unusable — produces garbled tool calls that loop forever" to stable **100+ tok/s with 88–98% draft acceptance**.
68
-
69
- ~280 lines of llama.cpp patches. Upstream PRs incoming.
70
-
71
- ---
72
-
73
- **8/**
74
-
75
- Wall 2: Loop protection that doesn't work.
76
-
77
- Agent clients on local models loop. We built per-tool cycle detection, stagnation tracking, forced finalize, synthetic continuation injection. None of it worked reliably.
78
-
79
- ---
80
-
81
- **9/**
82
-
83
- Added session ID logging and saw this:
84
-
85
- ```
86
- REQ ... sess=fp:9c8f... msgs=79
87
- REQ ... sess=fp:b801... msgs=81
88
- REQ ... sess=fp:aeef... msgs=83
89
- ```
90
-
91
- Every request got a NEW session ID. Every counter was fresh. Every guardrail was stateless.
92
-
93
- ---
94
-
95
- **10/**
96
-
97
- Cause: Session fingerprints hashed `tool_use_id` (random UUIDs per turn) + `system` prompt (clients inject timestamps/cwd/sessions).
98
-
99
- Fix: hash ONLY the first user message's text content.
100
-
101
- ---
102
-
103
- **11/**
104
-
105
- One-line fix. Every upstream guardrail suddenly started working. Loop protection went from 0% to >95% effective.
106
-
107
- Lesson: if your state machine isn't working, check whether the session key is stable FIRST. Every other "fix" is noise until that's right.
108
-
109
- ---
110
-
111
- **12/**
112
-
113
- Wall 3: ngram-mod cache reset.
114
-
115
- llama.cpp's `ngram-mod` spec type has a hardcoded reset: if acceptance dips below 50% for 3 calls, wipe the cache.
116
-
117
- For 35B MoE models with naturally variable output, this fires constantly. Cache never stabilizes.
118
-
119
- ---
120
-
121
- **13/**
122
-
123
- Fix: one env var, `NGRAM_MOD_RESET_STREAK=16`. Default 3 (upstream behavior preserved). On 35B-A3B, moved avg acceptance from ~50% to stable 88%+.
124
-
125
- ~10 lines, tiny PR.
126
-
127
- ---
128
-
129
- **14/**
130
-
131
- Wall 4: Model ignores CLAUDE.md.
132
-
133
- You can tell a local 27–35B coding agent "always use a git worktree, run parallel reviews before committing, query memory first."
134
-
135
- It will ignore all of that and commit directly to main. Every time.
136
-
137
- ---
138
-
139
- **15/**
140
-
141
- So we built a **policy engine** that enforces workflow rules at the proxy layer.
142
-
143
- The only reliable enforcement is non-bypassable at the tool-call layer, not at the prompt layer.
144
-
145
- ---
146
-
147
- **16/**
148
-
149
- Policy engine intercepts every tool call BEFORE it reaches llama.cpp:
150
-
151
- - Rewrites file paths to route through active worktree
152
- - Blocks commits until reviewers run in parallel
153
- - Enforces completion gates (tests ran, memory queried, security checked)
154
- - Per-profile rule sets (build / plan / memory / autoaccept)
155
-
156
- ---
157
-
158
- **17/**
159
-
160
- Rules are tiny declarative policies:
161
-
162
- ```python
163
- @policy("worktree.enforce")
164
- def enforce(req, session):
165
- if req.tool in MUTATING_TOOLS:
166
- if not session.worktree_active:
167
- return block("worktree_not_in_use")
168
- req.input.path = to_worktree(req.input.path)
169
- return allow()
170
- ```
171
-
172
- ---
173
-
174
- **18/**
175
-
176
- The agent can't route around a block because the proxy never gives it a tool to bypass with. It has no tokens to emit that would reach the outside world without going through the policy chain.
177
-
178
- This is the difference between "coding agent suggestion" and "coding agent enforcement."
179
-
180
- ---
181
-
182
- **19/**
183
-
184
- Part of UAP: a universal coding agent layer on top of llama.cpp.
185
-
186
- Features:
187
- - Skill routing + tool narrowing (35 → 8 per request)
188
- - Universal client shim: /v1/messages AND /v1/chat/completions, both guarded
189
- - Memory with auto-save for user / feedback / project context
190
- - Sticky sessions with monotonic loop counters
191
-
192
- ---
193
-
194
- **20/**
195
-
196
- And at the dev workflow layer:
197
-
198
- - Git worktree enablement for concurrent agents
199
- - CI/CD deploy bucketing per-worktree
200
- - Token budget monitoring with pre-request estimation
201
- - Software pattern prefill via skill registry
202
- - Multi-agent coordination with shared memory + conflict detection
203
-
204
- ---
205
-
206
- **21/**
207
-
208
- Results on RTX 3090 + Qwen3.5-35B-A3B-UD-IQ4_XS:
209
-
210
- | | Before | After |
211
- |---|---|---|
212
- | Spec decode | Broken | Stable |
213
- | Peak tok/s | 30–55 | **100+** |
214
- | Draft accept | 26–69% | **88–98%** |
215
- | Loop protect | 0% | >95% |
216
- | Worktree compliance | ~20% | **100%** |
217
- | Pre-commit review | ~10% | **100%** |
218
-
219
- ---
220
-
221
- **22/**
222
-
223
- Patches: `github.com/DammianMiller/llama.cpp` branch `upgrade-b8740`
224
-
225
- UAP: `github.com/miller-tech/universal-agent-protocol` (public release pending)
226
-
227
- Upstream PRs coming:
228
- - llama.cpp: hybrid spec rollback, activation replay, configurable ngram reset
229
- - UAP: session fingerprinting, loop protection, policy engine
230
-
231
- ---
232
-
233
- **23/**
234
-
235
- The punchline:
236
-
237
- Local coding agents on consumer GPUs are actually viable today. You just have to fix the half-dozen subtle bugs that every path through the stack seems to land on.
238
-
239
- Most of them are one-line fixes you only find by adding the right logging.
240
-
241
- ---
242
-
243
- **24/**
244
-
245
- The kicker: none of these fixes matter alone.
246
-
247
- - Fast spec decoding is useless if the model loops
248
- - Loop protection is useless if sessions are stateless
249
- - Stateless protection is useless if workflow isn't enforced
250
- - Enforcement is useless if tool output is corrupted
251
-
252
- Stack them all, and it works.
253
-
254
- /end