npm - @miller-tech/uap - Versions diffs - 1.40.0 → 1.41.0 - Mend

@miller-tech/uap 1.40.0 → 1.41.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (150) hide show

package/README.md +109 -642
package/dist/.tsbuildinfo +1 -1
package/dist/cli/deliver-defaults.d.ts +23 -0
package/dist/cli/deliver-defaults.d.ts.map +1 -0
package/dist/cli/deliver-defaults.js +121 -0
package/dist/cli/deliver-defaults.js.map +1 -0
package/dist/cli/init.d.ts.map +1 -1
package/dist/cli/init.js +29 -0
package/dist/cli/init.js.map +1 -1
package/dist/cli/setup.d.ts.map +1 -1
package/dist/cli/setup.js +19 -0
package/dist/cli/setup.js.map +1 -1
package/dist/policies/policy-tools.d.ts +7 -0
package/dist/policies/policy-tools.d.ts.map +1 -1
package/dist/policies/policy-tools.js +24 -2
package/dist/policies/policy-tools.js.map +1 -1
package/docs/INDEX.md +48 -286
package/docs/architecture/OVERVIEW.md +328 -0
package/docs/architecture/PROTOCOL.md +204 -0
package/docs/benchmarks/README.md +17 -192
package/docs/getting-started/CONFIGURATION.md +237 -0
package/docs/getting-started/INSTALLATION.md +125 -0
package/docs/getting-started/QUICKSTART.md +115 -0
package/docs/guides/COORDINATION.md +162 -0
package/docs/guides/DELIVER.md +115 -0
package/docs/guides/DEPLOY_BATCHING.md +212 -0
package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
package/docs/guides/LOCAL_MODELS.md +148 -0
package/docs/guides/MCP_ROUTER.md +195 -0
package/docs/guides/MEMORY.md +235 -0
package/docs/guides/MULTI_MODEL.md +223 -0
package/docs/guides/POLICIES.md +190 -0
package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
package/docs/integrations/MCP_ROUTER.md +147 -0
package/docs/integrations/RTK.md +102 -0
package/docs/reference/API.md +485 -0
package/docs/reference/CLI.md +719 -0
package/docs/reference/CONFIGURATION.md +90 -193
package/docs/reference/DATABASE_SCHEMA.md +110 -344
package/docs/reference/FEATURES.md +176 -472
package/docs/reference/PATTERNS.md +102 -0
package/docs/reference/PLATFORMS.md +83 -0
package/package.json +3 -1
package/src/policies/enforcers/7ebbc721-7540-4e9f-879a-770e0213a09b_architecture_review.py +101 -0
package/src/policies/enforcers/__pycache__/_common.cpython-312.pyc +0 -0
package/src/policies/enforcers/_common.py +100 -0
package/src/policies/enforcers/artifact_hygiene.py +52 -0
package/src/policies/enforcers/cluster_routing.py +63 -0
package/src/policies/enforcers/codebase_read_before_plan.py +52 -0
package/src/policies/enforcers/coord_overlap.py +81 -0
package/src/policies/enforcers/delivery_enforcement.py +97 -0
package/src/policies/enforcers/doc_live_over_report.py +50 -0
package/src/policies/enforcers/expert_review_required.py +135 -0
package/src/policies/enforcers/iac_parity.py +53 -0
package/src/policies/enforcers/mcp_router_first.py +37 -0
package/src/policies/enforcers/memory_before_plan.py +61 -0
package/src/policies/enforcers/parallel_reads.py +50 -0
package/src/policies/enforcers/rtk_wrap.py +44 -0
package/src/policies/enforcers/schema_diff_gate.py +80 -0
package/src/policies/enforcers/session_memory_write.py +52 -0
package/src/policies/enforcers/task_required.py +131 -0
package/src/policies/enforcers/test_gate.py +58 -0
package/src/policies/enforcers/validate_plan_before_build.py +75 -0
package/src/policies/enforcers/worktree_required.py +57 -0
package/src/policies/schemas/policies/architecture-review.md +51 -0
package/src/policies/schemas/policies/artifact-hygiene.md +29 -0
package/src/policies/schemas/policies/cluster-routing.md +31 -0
package/src/policies/schemas/policies/codebase-read-before-plan.md +30 -0
package/src/policies/schemas/policies/coord-overlap.md +24 -0
package/src/policies/schemas/policies/delivery-enforcement.md +45 -0
package/src/policies/schemas/policies/doc-live-over-report.md +32 -0
package/src/policies/schemas/policies/expert-review-required.md +60 -0
package/src/policies/schemas/policies/iac-parity.md +31 -0
package/src/policies/schemas/policies/mandatory-testing-deployment.md +147 -0
package/src/policies/schemas/policies/mcp-router-first.md +24 -0
package/src/policies/schemas/policies/memory-before-plan.md +24 -0
package/src/policies/schemas/policies/merge-deploy-monitor-verify.md +145 -0
package/src/policies/schemas/policies/parallel-reads.md +24 -0
package/src/policies/schemas/policies/rtk-wrap.md +26 -0
package/src/policies/schemas/policies/schema-diff-gate.md +30 -0
package/src/policies/schemas/policies/session-memory-write.md +24 -0
package/src/policies/schemas/policies/task-required.md +49 -0
package/src/policies/schemas/policies/test-gate.md +24 -0
package/src/policies/schemas/policies/validate-plan-before-build.md +28 -0
package/src/policies/schemas/policies/worktree-required.md +28 -0
package/templates/hooks/uap-policy-gate.sh +5 -0
package/docs/AGENTS.md +0 -423
package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
package/docs/GETTING_STARTED.md +0 -288
package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
package/docs/architecture/EXPERT_STACK.md +0 -137
package/docs/architecture/MULTI_MODEL.md +0 -224
package/docs/architecture/PLATFORM_GATING.md +0 -68
package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
package/docs/architecture/UAP_COMPLIANCE.md +0 -217
package/docs/architecture/UAP_PROTOCOL.md +0 -339
package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
package/docs/archive/opencode-integration-guide.md +0 -740
package/docs/archive/opencode-integration-quickref.md +0 -180
package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
package/docs/blog/local-coding-agents.md +0 -266
package/docs/blog/x-thread.md +0 -254
package/docs/deployment/DEPLOYMENT.md +0 -895
package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
package/docs/deployment/DEPLOY_BATCHING.md +0 -273
package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
package/docs/getting-started/INTEGRATION.md +0 -628
package/docs/getting-started/OVERVIEW.md +0 -324
package/docs/getting-started/SETUP.md +0 -377
package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
package/docs/integrations/RTK_INTEGRATION.md +0 -468
package/docs/operations/TROUBLESHOOTING.md +0 -660
package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
package/docs/pr/UPSTREAM_PRS.md +0 -424
package/docs/reference/API_REFERENCE.md +0 -903
package/docs/reference/EXPERT_DROIDS.md +0 -219
package/docs/reference/HARNESS-MATRIX.md +0 -318
package/docs/reference/PATTERN_LIBRARY.md +0 -636
package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
package/docs/research/DOMAIN_STRATEGIES.md +0 -316
package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217

package/docs/blog/local-coding-agents.md DELETED Viewed

@@ -1,266 +0,0 @@
-# Taming Local Coding Agents: How We Made 35B-A3B Actually Usable
-*A deep dive into hybrid speculative decoding, session-level loop protection, policy enforcement, and building a universal coding agent layer on top of llama.cpp.*
----
-## The problem
-Local LLMs have reached a point where a single RTX 3090 can run 27–35B parameter models fast enough for interactive coding agents. But "fast enough" isn't "usable."
-We hit several walls building our universal coding agent stack (UAP) on top of llama.cpp:
-1. Speculative decoding **silently corrupts** hybrid SSM+attention models (Qwen3.5-35B-A3B, Jamba)
-2. Agent clients enter **runaway tool-use loops** that burn thousands of wasted tokens
-3. Every client speaks a slightly different API shape and injects **volatile context** that breaks stateful guardrails
-4. Models **ignore workflow requirements** in CLAUDE.md — they commit directly to main no matter what the prompt says
-5. Context, memory, skill routing, and multi-agent coordination all need an **additional enforcement layer** above raw inference
-This is what we built to fix all of it.
----
-## Part 1: The hybrid speculative decoding bug
-Qwen3.5-35B-A3B is a hybrid model: 16 of its 64 layers use attention KV cache, 48 use recurrent (SSM) state. When speculative decoding rolls back a partially-accepted batch, it calls `seq_rm(seq_id, p0, -1)` to discard tokens after position `p0`.
-For attention layers this is trivial. For SSM layers it's **impossible** — recurrent state can't be positionally rewound. The upstream llama.cpp handled this with an exact-match checkpoint restore that **never fired** during real speculative decoding:
-```cpp
-checkpoint.pos == p0 - 1    // checkpoint at pre-speculation position K
-                            // p0 - 1 = K + accepted_drafts
-                            // K == K + m → false whenever m > 0
-```
-The fallback path silently updated `cell.pos` without restoring R/S tensor data. SSM state drifted every batch. After a few hundred spec cycles, the model was generating degenerate output that looked like "tool call looping" but was actually accumulated state corruption.
-**Our fix (2 patches, ~280 lines):**
-1. Added a CPU-side checkpoint system in `llama_memory_hybrid` — save R/S tensors before multi-token speculative batches via `ggml_backend_tensor_get`, restore via `ggml_backend_tensor_set`
-2. Changed the restore condition from `checkpoint.pos == p0 - 1` to `checkpoint.pos <= p0 - 1`
-3. Added **server-side activation replay**: after `seq_rm` restores an earlier checkpoint, re-decode the tokens from `(cache_pos + 1)` to the target position via `llama_decode`, bringing both caches back in sync
-This is the "activation replay" technique from Snakes & Ladders (NeurIPS 2024). The result: Qwen3.5-35B-A3B speculative decoding went from "unusable — produces garbled tool calls that loop forever" to **stable 100+ tok/s with 88–98% draft acceptance**.
----
-## Part 2: The ngram cache reset trap
-llama.cpp's `ngram-mod` speculative type has a hardcoded "low acceptance streak" reset: if draft acceptance drops below 50% for 3 consecutive calls, the entire ngram table is wiped.
-For models with naturally variable output (MoE, fine-tuned, uncensored), this fires constantly. The cache would build up to 100+ drafts/call, then get wiped, then rebuild, then get wiped again. We saw acceptance rates oscillate between 26% and 69% for hours.
-**The fix:** single env var — `NGRAM_MOD_RESET_STREAK=16` (default 3 preserves upstream behavior, `0` disables the reset entirely). On 35B-A3B this moved average acceptance from ~50% to a stable 88%, with peak 98% warmed-up rates.
-~10 lines of code. Bizarrely impactful.
----
-## Part 3: Loop protection that actually works
-Coding agents making rapid tool calls can fall into pathological loops. We saw three distinct patterns on local 27–35B models:
-1. **Repeated same tool** — 58 req/min on `Read("/dev/null")`. Easy to catch with per-tool cycle detection.
-2. **Distinct but unproductive** — model rotates through `Glob → Read → Bash → FetchUrl` making tiny calls that add no context. **Defeats** per-tool cycle detection because each call is technically different.
-3. **Post-finalize ping-pong** — state machine forces a finalize turn, model emits text, but completion contract re-triggers the active loop on the next request.
-Our proxy's state machine already had per-tool cycle detection, but it didn't catch patterns 2 and 3. We added:
-- **Unproductive exhaustion streak**: counts consecutive `forced_budget_exhausted` events where no cycle was detected. After N in a row, force finalize.
-- **Monotonic finalize hard cap**: session-level counter that survives state resets. After N total finalize events (default 6), stop injecting synthetic continuations and let the natural `end_turn` terminate the loop.
-- **`finalize_fired` blocker suppression**: once a finalize has fired in the session, suppress `text_only_after_tool_results` blockers that would re-trigger the active loop.
-But the actual fix for all of this turned out to be a **one-line session fingerprint bug**.
----
-## Part 4: The session fingerprint bug that broke everything
-For weeks, none of our loop protection worked reliably. The state machine would detect a cycle, force a finalize, inject a hint — and then the very next request, the `forced_budget` counter would be back at 11, the `review_cycles` at 0, all the state wiped.
-We assumed it was a state machine bug and wrote more guardrails. Then we added session ID logging:
-```
-REQ: ... sess=fp:9c8f26a802f9f4739f18 msgs=79
-REQ: ... sess=fp:b801857a9e49e21a6599 msgs=81
-REQ: ... sess=fp:aeef638954a390ef7aec msgs=83
-```
-**Every single request got a new session ID.** Every `SessionMonitor` was fresh. None of the counters were accumulating. Every guardrail we'd built was effectively stateless per-request.
-The bug: session fingerprints included:
-1. `tool_use_id` values from tool_result blocks (random UUIDs regenerated per turn)
-2. The entire `system` prompt (clients inject timestamps, cwd, session markers)
-**The fix:** hash only the first user message's **text content**. Exclude system prompts. Use stable content hashes for tool_result blocks.
-After this fix, session stickiness went from 1 request/session to 170+ requests/session. Every prior loop protection mechanism suddenly started working. The unproductive exhaustion streak fired exactly when it should. The finalize hard cap terminated runaway sessions cleanly. Context accumulated correctly for prompt caching.
-One bug — the wrong fingerprint inputs — had been silently defeating every stateful guardrail above it for the entire project. If you're building your own state machine on top of an LLM proxy: **check whether your session key is stable FIRST**.
----
-## Part 5: UAP — the universal coding agent layer
-llama.cpp is the engine. UAP is the layer that makes coding agents on top of it actually work.
-### Session and state management
-- **Sticky session fingerprinting** (Part 4)
-- **Per-session conversation pruning** to stay under context limits
-- **Automatic context window detection** from `/slots`
-- **Memory system** with auto-save for user profile, feedback rules, project context, reference pointers — the agent learns across sessions without re-prompting
-- **Automatic context insertion** at natural triggers (session start, fresh task detection)
-### Universal client compatibility
-- **Native Anthropic `/v1/messages`** endpoint
-- **Full OpenAI `/v1/chat/completions`** endpoint with bidirectional conversion (all guardrails active on both paths)
-- **Per-profile chat templates** — ChatML, Gemma-4's `peg-gemma4` DSL, or model-embedded
-- **Per-profile grammar** — Qwen-style `<tool_call>` JSON grammar, or off (required for models that use different tool formats)
-### Skill routing and tool management
-- **Tool narrowing** — automatically reduces 35+ tool schemas down to top-N most relevant per request via query token similarity scoring
-- **Tool cycling detection** with session-level bans for persistent offenders
-- **Malformed tool-call retry** with token/temperature caps
-- **Grammar-constrained tool output** (optional per profile)
-- **Software pattern prefill** — agent skill registry with discovery and auto-invocation for known task patterns
-### Loop protection (5-layer defense)
-1. Per-tool fingerprint cycle detection
-2. Stagnation tracking (message fingerprint doesn't change)
-3. Unproductive exhaustion streak (distinct-but-useless calls)
-4. Review cycle limit → forced finalize
-5. Session hard cap on total finalize events → natural termination
-### Speculative decoding tuning
-- Per-profile spec decoding enable/disable
-- Per-request `speculative.n_max=0` override for tool turns (optional per profile)
-- Configurable ngram-mod reset threshold via env var (Part 2)
-- Profile-specific draft parameters (`draft-max`, `draft-min`, `draft-p-min`)
-### Multi-agent coordination
-- **Git worktree enablement** for concurrent agent sessions with isolated filesystem state
-- **CI/CD deploy bucketing** to match concurrent agent development cadence — each agent's deploys go to its own bucket
-- **Shared memory layer** with conflict detection
-- **Skill registry** with discovery
-### Token optimization
-- Pre-request token budget monitoring with estimation
-- Automatic conversation pruning near context limits
-- Tool schema caching
-- Static ngram cache support for cold-start acceleration
-- Tool narrowing (35 → 8 saves ~15k tokens per request on the 35-tool setup)
----
-## Part 5b: The policy engine — enforcement, not suggestions
-You can tell a local coding agent to use a git worktree. You can write it in CLAUDE.md. You can put it in the system prompt. You can make it the first rule in the instructions.
-They will still commit directly to main.
-We learned this the hard way. **The only reliable way to enforce a workflow requirement is to make it non-bypassable at the proxy layer — not at the prompt layer.**
-So we built a **policy engine** that intercepts every tool call and completion check.
-### What it enforces today
-- **Worktree routing** — `Edit`, `Write`, `Bash` tool inputs get rewritten to reference the active worktree path. Operations targeting the main working tree are **rejected** with a policy blocker that the agent can't ignore because it can't produce a valid tool call.
-- **Completion gates** — the proxy's completion contract is extended with policy-level blockers. An agent can't emit `end_turn` on a task unless:
-  - Tests were actually run (not just mentioned)
-  - Parallel reviewers (code-reviewer + security-auditor + architect-reviewer) were invoked before any commit
-  - Memory was queried before any review/check/look operation
-  - Session start protocol completed (bootstrap checks)
-- **Commit discipline** — pre-commit policy invokes review agents, validates commit message format, checks for secrets, runs completion gates. Only then does the `commit` tool call pass through.
-- **CI/CD deploy bucketing** — each agent session has a deploy bucket tied to its worktree. Multi-agent concurrent development doesn't collide at the pipeline layer because each bucket runs independently.
-- **Per-profile rule sets** — the `build` profile has strict worktree + review + test requirements. `plan` mode blocks all `write`/`edit` tools. `memory` mode is read-only. `autoaccept` can skip some gates but not the security ones.
-### How it works
-Every tool call goes through a policy check chain before being forwarded to llama.cpp:
-```
-client → proxy → [guardrails] → [policy engine] → [tool rewriter] → llama.cpp
-                                       ↓
-                                  audit log
-```
-Each policy is a small declarative rule:
-```python
-@policy("worktree.enforce", profile=["build", "autoaccept"])
-def enforce_worktree(request, session):
-    if request.tool_name in MUTATING_TOOLS:
-        if not session.worktree_active:
-            return block("worktree_not_in_use",
-                         hint="Create a worktree first: git worktree add ...")
-        request.tool_input["path"] = rewrite_to_worktree(
-            request.tool_input["path"], session.worktree
-        )
-    return allow()
-@policy("commit.parallel_review", profile="build")
-def enforce_parallel_review(request, session):
-    if "git commit" in request.tool_input.get("command", ""):
-        if not session.review_completed_this_turn:
-            return block("parallel_review_required")
-    return allow()
-```
-The rule either allows the call, rewrites it, or blocks it with a reason that becomes part of the agent's context on the next turn. **Agents can't route around a block** — the proxy doesn't give them a tool they can use to bypass the policy, so they have no tokens to emit that would reach the outside world.
-### Why this matters for local models
-Frontier models kind of follow instructions in CLAUDE.md. Local 27–35B models don't. The gap is large enough that policy-as-prompt is not an enforcement mechanism for local coding agents — it's a suggestion the model ignores when the compute pressure is on.
-Moving enforcement from prompt layer to proxy layer turned our local coding agents from "unreliable hobby" to **"actually usable in a real delivery pipeline."**
----
-## Part 6: Results
-On a single RTX 3090 with Qwen3.5-35B-A3B-UD-IQ4_XS:
-| Metric | Before | After |
-|--------|--------|-------|
-| Speculative decoding | Broken (garbled output) | **Stable** |
-| Peak generation speed | 30–55 tok/s (unstable) | **100+ tok/s** |
-| Draft acceptance | 26–69% (oscillating) | **88–98%** |
-| Loop protection | Stateless (session bug) | Works end-to-end |
-| Session stickiness | 1 req/session | 170+ req/session |
-| Time to break runaway loop | Indefinite | ~30–60 seconds |
-| Tool output corruption | Frequent | Rare (auto-retried cleanly) |
-| Worktree compliance | ~20% (model ignored prompts) | **100% (policy-enforced)** |
-| Pre-commit review compliance | ~10% | **100%** |
-| Concurrent agent collisions | Common | None (bucketed) |
----
-## Part 7: Where this is going
-We're preparing upstream PRs:
-- **llama.cpp** — three PRs:
-  1. Configurable ngram-mod reset threshold
-  2. Hybrid speculative rollback via CPU state checkpoints
-  3. Server activation replay for partial speculative rollback
-- **UAP proxy** — five PRs:
-  1. Stable session fingerprinting (critical bug fix)
-  2. Loop protection hardening
-  3. Per-request speculative decoding control
-  4. OpenAI-compatible `/v1/chat/completions` endpoint with guardrails
-  5. Policy engine with worktree + CI/CD enforcement
-The llama.cpp patches are at `github.com/DammianMiller/llama.cpp` on branch `upgrade-b8740`. UAP is at `github.com/miller-tech/universal-agent-protocol` (public release pending).
----
-## The punchline
-Local coding agents on consumer GPUs are actually viable today — if you fix the half-dozen subtle bugs that every path through the stack seems to land on.
-Most of the fixes are small. Most of them would be invisible without the right logging. And most of them only matter once you stack them together: the speculative decoding fix makes generation fast enough to be interactive, the ngram reset fix makes it stable, the session fingerprint fix makes loop protection functional, the loop protection makes the agent stoppable, the OpenAI endpoint makes any client able to benefit from it all, and the **policy engine is what finally makes the output trustworthy enough to ship.**
-We kept finding one more bug, one more missing piece, one more enforcement gap. When the last one cleared, we had a local coding agent stack that actually works.
-Share your own findings — the local LLM tooling space is still wide open.

package/docs/blog/x-thread.md DELETED Viewed

@@ -1,254 +0,0 @@
-# X Thread: Taming Local Coding Agents
-Publish as a thread on x.com. Each section is one tweet (≤280 chars where noted).
----
-**1/ 🧵**
-Taming local coding agents on a single RTX 3090.
-Qwen3.5-35B-A3B @ ~100 tok/s with working spec decoding, clean tool calls, loop protection that actually works, and policy-enforced worktrees.
-A deep dive into the llama.cpp + UAP stack we built.
----
-**2/**
-Five walls you hit building local coding agents:
-1. Spec decoding silently corrupts hybrid SSM+attention models
-2. Agents enter runaway tool loops
-3. Each client injects volatile context that breaks stateful guardrails
-4. Models ignore workflow rules in CLAUDE.md
-5. Multi-agent concurrency collides at the pipeline
----
-**3/**
-Wall 1: Hybrid spec decoding.
-Qwen3.5-35B-A3B has 16 attention + 48 recurrent layers. When spec decoding partially accepts drafts, it needs to roll back. Attention = trivial. Recurrent SSM state = can't positionally rewind.
----
-**4/**
-Upstream llama.cpp had an exact-match checkpoint restore:
-`checkpoint.pos == p0 - 1`
-But during real spec decoding, `checkpoint.pos = K` while `p0-1 = K + accepted_drafts`. The match never fired. The fallback silently updated position counters without restoring R/S tensors.
----
-**5/**
-State drifted every batch. After a few hundred cycles, the model produced degenerate output.
-Symptom: "looping tool calls."
-Root cause: accumulated SSM state corruption.
-The two diagnoses look identical from the outside. They're completely different problems.
----
-**6/**
-Fix: CPU-side checkpoint system that saves R/S tensors before multi-token batches, plus activation replay (Snakes & Ladders, NeurIPS 2024).
-After `seq_rm` restores a checkpoint, re-decode tokens from (cache_pos+1) → target via `llama_decode` to resync both caches.
----
-**7/**
-Result: 35B-A3B spec decoding went from "unusable — produces garbled tool calls that loop forever" to stable **100+ tok/s with 88–98% draft acceptance**.
-~280 lines of llama.cpp patches. Upstream PRs incoming.
----
-**8/**
-Wall 2: Loop protection that doesn't work.
-Agent clients on local models loop. We built per-tool cycle detection, stagnation tracking, forced finalize, synthetic continuation injection. None of it worked reliably.
----
-**9/**
-Added session ID logging and saw this:
-```
-REQ ... sess=fp:9c8f... msgs=79
-REQ ... sess=fp:b801... msgs=81
-REQ ... sess=fp:aeef... msgs=83
-```
-Every request got a NEW session ID. Every counter was fresh. Every guardrail was stateless.
----
-**10/**
-Cause: Session fingerprints hashed `tool_use_id` (random UUIDs per turn) + `system` prompt (clients inject timestamps/cwd/sessions).
-Fix: hash ONLY the first user message's text content.
----
-**11/**
-One-line fix. Every upstream guardrail suddenly started working. Loop protection went from 0% to >95% effective.
-Lesson: if your state machine isn't working, check whether the session key is stable FIRST. Every other "fix" is noise until that's right.
----
-**12/**
-Wall 3: ngram-mod cache reset.
-llama.cpp's `ngram-mod` spec type has a hardcoded reset: if acceptance dips below 50% for 3 calls, wipe the cache.
-For 35B MoE models with naturally variable output, this fires constantly. Cache never stabilizes.
----
-**13/**
-Fix: one env var, `NGRAM_MOD_RESET_STREAK=16`. Default 3 (upstream behavior preserved). On 35B-A3B, moved avg acceptance from ~50% to stable 88%+.
-~10 lines, tiny PR.
----
-**14/**
-Wall 4: Model ignores CLAUDE.md.
-You can tell a local 27–35B coding agent "always use a git worktree, run parallel reviews before committing, query memory first."
-It will ignore all of that and commit directly to main. Every time.
----
-**15/**
-So we built a **policy engine** that enforces workflow rules at the proxy layer.
-The only reliable enforcement is non-bypassable at the tool-call layer, not at the prompt layer.
----
-**16/**
-Policy engine intercepts every tool call BEFORE it reaches llama.cpp:
-- Rewrites file paths to route through active worktree
-- Blocks commits until reviewers run in parallel
-- Enforces completion gates (tests ran, memory queried, security checked)
-- Per-profile rule sets (build / plan / memory / autoaccept)
----
-**17/**
-Rules are tiny declarative policies:
-```python
-@policy("worktree.enforce")
-def enforce(req, session):
-    if req.tool in MUTATING_TOOLS:
-        if not session.worktree_active:
-            return block("worktree_not_in_use")
-        req.input.path = to_worktree(req.input.path)
-    return allow()
-```
----
-**18/**
-The agent can't route around a block because the proxy never gives it a tool to bypass with. It has no tokens to emit that would reach the outside world without going through the policy chain.
-This is the difference between "coding agent suggestion" and "coding agent enforcement."
----
-**19/**
-Part of UAP: a universal coding agent layer on top of llama.cpp.
-Features:
-- Skill routing + tool narrowing (35 → 8 per request)
-- Universal client shim: /v1/messages AND /v1/chat/completions, both guarded
-- Memory with auto-save for user / feedback / project context
-- Sticky sessions with monotonic loop counters
----
-**20/**
-And at the dev workflow layer:
-- Git worktree enablement for concurrent agents
-- CI/CD deploy bucketing per-worktree
-- Token budget monitoring with pre-request estimation
-- Software pattern prefill via skill registry
-- Multi-agent coordination with shared memory + conflict detection
----
-**21/**
-Results on RTX 3090 + Qwen3.5-35B-A3B-UD-IQ4_XS:
-| | Before | After |
-|---|---|---|
-| Spec decode | Broken | Stable |
-| Peak tok/s | 30–55 | **100+** |
-| Draft accept | 26–69% | **88–98%** |
-| Loop protect | 0% | >95% |
-| Worktree compliance | ~20% | **100%** |
-| Pre-commit review | ~10% | **100%** |
----
-**22/**
-Patches: `github.com/DammianMiller/llama.cpp` branch `upgrade-b8740`
-UAP: `github.com/miller-tech/universal-agent-protocol` (public release pending)
-Upstream PRs coming:
-- llama.cpp: hybrid spec rollback, activation replay, configurable ngram reset
-- UAP: session fingerprinting, loop protection, policy engine
----
-**23/**
-The punchline:
-Local coding agents on consumer GPUs are actually viable today. You just have to fix the half-dozen subtle bugs that every path through the stack seems to land on.
-Most of them are one-line fixes you only find by adding the right logging.
----
-**24/**
-The kicker: none of these fixes matter alone.
-- Fast spec decoding is useless if the model loops
-- Loop protection is useless if sessions are stateless
-- Stateless protection is useless if workflow isn't enforced
-- Enforcement is useless if tool output is corrupted
-Stack them all, and it works.
-/end