@miller-tech/uap 1.40.0 → 1.41.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +109 -642
- package/dist/.tsbuildinfo +1 -1
- package/dist/cli/deliver-defaults.d.ts +23 -0
- package/dist/cli/deliver-defaults.d.ts.map +1 -0
- package/dist/cli/deliver-defaults.js +121 -0
- package/dist/cli/deliver-defaults.js.map +1 -0
- package/dist/cli/init.d.ts.map +1 -1
- package/dist/cli/init.js +29 -0
- package/dist/cli/init.js.map +1 -1
- package/dist/cli/setup.d.ts.map +1 -1
- package/dist/cli/setup.js +19 -0
- package/dist/cli/setup.js.map +1 -1
- package/dist/policies/policy-tools.d.ts +7 -0
- package/dist/policies/policy-tools.d.ts.map +1 -1
- package/dist/policies/policy-tools.js +24 -2
- package/dist/policies/policy-tools.js.map +1 -1
- package/docs/INDEX.md +48 -286
- package/docs/architecture/OVERVIEW.md +328 -0
- package/docs/architecture/PROTOCOL.md +204 -0
- package/docs/benchmarks/README.md +17 -192
- package/docs/getting-started/CONFIGURATION.md +237 -0
- package/docs/getting-started/INSTALLATION.md +125 -0
- package/docs/getting-started/QUICKSTART.md +115 -0
- package/docs/guides/COORDINATION.md +162 -0
- package/docs/guides/DELIVER.md +115 -0
- package/docs/guides/DEPLOY_BATCHING.md +212 -0
- package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
- package/docs/guides/LOCAL_MODELS.md +148 -0
- package/docs/guides/MCP_ROUTER.md +195 -0
- package/docs/guides/MEMORY.md +235 -0
- package/docs/guides/MULTI_MODEL.md +223 -0
- package/docs/guides/POLICIES.md +190 -0
- package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
- package/docs/integrations/MCP_ROUTER.md +147 -0
- package/docs/integrations/RTK.md +102 -0
- package/docs/reference/API.md +485 -0
- package/docs/reference/CLI.md +719 -0
- package/docs/reference/CONFIGURATION.md +90 -193
- package/docs/reference/DATABASE_SCHEMA.md +110 -344
- package/docs/reference/FEATURES.md +176 -472
- package/docs/reference/PATTERNS.md +102 -0
- package/docs/reference/PLATFORMS.md +83 -0
- package/package.json +3 -1
- package/src/policies/enforcers/7ebbc721-7540-4e9f-879a-770e0213a09b_architecture_review.py +101 -0
- package/src/policies/enforcers/__pycache__/_common.cpython-312.pyc +0 -0
- package/src/policies/enforcers/_common.py +100 -0
- package/src/policies/enforcers/artifact_hygiene.py +52 -0
- package/src/policies/enforcers/cluster_routing.py +63 -0
- package/src/policies/enforcers/codebase_read_before_plan.py +52 -0
- package/src/policies/enforcers/coord_overlap.py +81 -0
- package/src/policies/enforcers/delivery_enforcement.py +97 -0
- package/src/policies/enforcers/doc_live_over_report.py +50 -0
- package/src/policies/enforcers/expert_review_required.py +135 -0
- package/src/policies/enforcers/iac_parity.py +53 -0
- package/src/policies/enforcers/mcp_router_first.py +37 -0
- package/src/policies/enforcers/memory_before_plan.py +61 -0
- package/src/policies/enforcers/parallel_reads.py +50 -0
- package/src/policies/enforcers/rtk_wrap.py +44 -0
- package/src/policies/enforcers/schema_diff_gate.py +80 -0
- package/src/policies/enforcers/session_memory_write.py +52 -0
- package/src/policies/enforcers/task_required.py +131 -0
- package/src/policies/enforcers/test_gate.py +58 -0
- package/src/policies/enforcers/validate_plan_before_build.py +75 -0
- package/src/policies/enforcers/worktree_required.py +57 -0
- package/src/policies/schemas/policies/architecture-review.md +51 -0
- package/src/policies/schemas/policies/artifact-hygiene.md +29 -0
- package/src/policies/schemas/policies/cluster-routing.md +31 -0
- package/src/policies/schemas/policies/codebase-read-before-plan.md +30 -0
- package/src/policies/schemas/policies/coord-overlap.md +24 -0
- package/src/policies/schemas/policies/delivery-enforcement.md +45 -0
- package/src/policies/schemas/policies/doc-live-over-report.md +32 -0
- package/src/policies/schemas/policies/expert-review-required.md +60 -0
- package/src/policies/schemas/policies/iac-parity.md +31 -0
- package/src/policies/schemas/policies/mandatory-testing-deployment.md +147 -0
- package/src/policies/schemas/policies/mcp-router-first.md +24 -0
- package/src/policies/schemas/policies/memory-before-plan.md +24 -0
- package/src/policies/schemas/policies/merge-deploy-monitor-verify.md +145 -0
- package/src/policies/schemas/policies/parallel-reads.md +24 -0
- package/src/policies/schemas/policies/rtk-wrap.md +26 -0
- package/src/policies/schemas/policies/schema-diff-gate.md +30 -0
- package/src/policies/schemas/policies/session-memory-write.md +24 -0
- package/src/policies/schemas/policies/task-required.md +49 -0
- package/src/policies/schemas/policies/test-gate.md +24 -0
- package/src/policies/schemas/policies/validate-plan-before-build.md +28 -0
- package/src/policies/schemas/policies/worktree-required.md +28 -0
- package/templates/hooks/uap-policy-gate.sh +5 -0
- package/docs/AGENTS.md +0 -423
- package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
- package/docs/GETTING_STARTED.md +0 -288
- package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
- package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
- package/docs/architecture/EXPERT_STACK.md +0 -137
- package/docs/architecture/MULTI_MODEL.md +0 -224
- package/docs/architecture/PLATFORM_GATING.md +0 -68
- package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
- package/docs/architecture/UAP_COMPLIANCE.md +0 -217
- package/docs/architecture/UAP_PROTOCOL.md +0 -339
- package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
- package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
- package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
- package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
- package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
- package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
- package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
- package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
- package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
- package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
- package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
- package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
- package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
- package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
- package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
- package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
- package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
- package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
- package/docs/archive/opencode-integration-guide.md +0 -740
- package/docs/archive/opencode-integration-quickref.md +0 -180
- package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
- package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
- package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
- package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
- package/docs/blog/local-coding-agents.md +0 -266
- package/docs/blog/x-thread.md +0 -254
- package/docs/deployment/DEPLOYMENT.md +0 -895
- package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
- package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
- package/docs/deployment/DEPLOY_BATCHING.md +0 -273
- package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
- package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
- package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
- package/docs/getting-started/INTEGRATION.md +0 -628
- package/docs/getting-started/OVERVIEW.md +0 -324
- package/docs/getting-started/SETUP.md +0 -377
- package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
- package/docs/integrations/RTK_INTEGRATION.md +0 -468
- package/docs/operations/TROUBLESHOOTING.md +0 -660
- package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
- package/docs/pr/UPSTREAM_PRS.md +0 -424
- package/docs/reference/API_REFERENCE.md +0 -903
- package/docs/reference/EXPERT_DROIDS.md +0 -219
- package/docs/reference/HARNESS-MATRIX.md +0 -318
- package/docs/reference/PATTERN_LIBRARY.md +0 -636
- package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
- package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
- package/docs/research/DOMAIN_STRATEGIES.md +0 -316
- package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
- package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
- package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
- package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
- package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
|
@@ -1,137 +0,0 @@
|
|
|
1
|
-
# Expert Stack: Forward-Design, HALO & Open-Collider
|
|
2
|
-
|
|
3
|
-
This document covers the expert-system extensions added on top of the v1.23.0
|
|
4
|
-
droid stack: forward-design experts, the activated experts-as-MCP-tools surface,
|
|
5
|
-
HALO trace-based harness optimization, open-collider divergent ideation, and the
|
|
6
|
-
expert-review hard gate.
|
|
7
|
-
|
|
8
|
-
> Scope note: the base 33-droid roster, `ExpertOrchestrator`, `expert-route`
|
|
9
|
-
> CLI, and `parallel-expert-review` skill already shipped in v1.23.0. This layer
|
|
10
|
-
> closes real gaps in that stack and integrates two external tools.
|
|
11
|
-
|
|
12
|
-
---
|
|
13
|
-
|
|
14
|
-
## 1. Forward-design droids
|
|
15
|
-
|
|
16
|
-
The pre-existing roster was review-heavy — the orchestrator's `plan`/`design`
|
|
17
|
-
phases produced no up-front design. Three forward-design experts fill that gap:
|
|
18
|
-
|
|
19
|
-
| Droid | Phase | Role |
|
|
20
|
-
|---|---|---|
|
|
21
|
-
| `strategic-architect` | plan | North-star architecture, technology selection (OSS-first), multi-quarter evolution, one-way-door decisions. Forward-design counterpart to `architect-reviewer`. |
|
|
22
|
-
| `tactical-architect` | design | Concrete component/module boundaries, interfaces, data shapes, pattern selection, refactor strategy. |
|
|
23
|
-
| `implementation-planner` | design | Executable work breakdown: ordered steps, file-level plan (reuse-first), test plan, risk/rollback. Feeds the `validate-plan-before-build` gate. |
|
|
24
|
-
|
|
25
|
-
Wiring: `src/coordination/expert-orchestrator.ts` — `PHASE_ROSTER.plan` gains
|
|
26
|
-
`strategic-architect`; `PHASE_ROSTER.design` gains `tactical-architect` and
|
|
27
|
-
`implementation-planner`; `isRelevantForCapability` maps them to the
|
|
28
|
-
`architecture`/`api-design` capabilities so they appear only on relevant tasks.
|
|
29
|
-
|
|
30
|
-
```bash
|
|
31
|
-
uap expert-route "Design a new billing subsystem" --files src/types/billing.ts --json
|
|
32
|
-
# → plan: strategic-architect … design: tactical-architect, implementation-planner …
|
|
33
|
-
```
|
|
34
|
-
|
|
35
|
-
---
|
|
36
|
-
|
|
37
|
-
## 2. Experts as MCP tools (activated)
|
|
38
|
-
|
|
39
|
-
`src/mcp-router/experts/registry.ts` could already convert droids to virtual
|
|
40
|
-
`experts.<name>` tools (`loadExpertTools`) but was never wired in. Now:
|
|
41
|
-
|
|
42
|
-
- `McpRouter.loadTools()` (`src/mcp-router/server.ts`) calls `loadExpertTools(cwd)`
|
|
43
|
-
and adds the experts to the fuzzy search index.
|
|
44
|
-
- `handleExecuteTool` (`src/mcp-router/tools/execute.ts`) intercepts
|
|
45
|
-
`experts.<droid>` paths and dispatches an in-process `consultExpert()` — it
|
|
46
|
-
loads the droid's instructions and returns them wrapped as a prompt (mirroring
|
|
47
|
-
`uap_droid_invoke`), instead of routing to an external MCP server.
|
|
48
|
-
|
|
49
|
-
Result: `discover_tools "architecture review"` surfaces the right expert and
|
|
50
|
-
`execute_tool experts.architect-reviewer` returns a consultation — all within
|
|
51
|
-
the 2-tool token-saving router shape.
|
|
52
|
-
|
|
53
|
-
---
|
|
54
|
-
|
|
55
|
-
## 3. HALO — trace-based harness optimization
|
|
56
|
-
|
|
57
|
-
[HALO](https://github.com/context-labs/HALO) analyzes large volumes of execution
|
|
58
|
-
traces to find *systemic* harness/prompt failure modes (not one-off errors). UAP
|
|
59
|
-
integrates it as an exporter + a droid + a CLI.
|
|
60
|
-
|
|
61
|
-
**Exporter** (`src/observability/halo-exporter.ts`) — opt-in, zero-overhead when
|
|
62
|
-
off. Emits one JSONL span per agent/LLM/tool call in HALO's OTLP/OpenInference
|
|
63
|
-
shape: OTLP identity, `resource.attributes."service.name"`, and the four
|
|
64
|
-
`inference.*` attributes (`project_id`, `observation_kind`, `export.schema_version`,
|
|
65
|
-
`openinference.span.kind`), with nanosecond-precision timestamps.
|
|
66
|
-
|
|
67
|
-
Tap points: `execute.ts:handleExecuteTool` (TOOL spans) and
|
|
68
|
-
`session-telemetry.ts` `agentComplete`/`agentError` (AGENT spans).
|
|
69
|
-
|
|
70
|
-
```bash
|
|
71
|
-
export UAP_HALO_TRACE=1 # enable collection
|
|
72
|
-
export UAP_HALO_TRACE_PATH=.uap/halo/traces.jsonl
|
|
73
|
-
# … run your workflow …
|
|
74
|
-
uap harness status # enabled? path? span count?
|
|
75
|
-
uap harness analyze -p "systemic failure modes?" # wraps `halo <file> -p ...`
|
|
76
|
-
```
|
|
77
|
-
|
|
78
|
-
**Prerequisite:** `pip install halo-engine` (Python ≥3.10) + an OpenAI-compatible
|
|
79
|
-
endpoint. Each analysis run incurs LLM cost. The `harness-optimizer` droid runs
|
|
80
|
-
the loop: diagnose → **verify each claim against the repo** → route fixes →
|
|
81
|
-
re-measure. Hard rule: *ask HALO about the trace data; never ask it to write code.*
|
|
82
|
-
|
|
83
|
-
---
|
|
84
|
-
|
|
85
|
-
## 4. Open-Collider — divergent ideation
|
|
86
|
-
|
|
87
|
-
[open-collider](https://github.com/CL-ML/open-collider) escapes LLM "hivemind"
|
|
88
|
-
clustering by colliding structurally distant knowledge domains (Koestler
|
|
89
|
-
bisociation), then curating non-trivial ideas. Skill mode is free.
|
|
90
|
-
|
|
91
|
-
- `ideation-expert` droid drives the brief → domains → collide → curate flow.
|
|
92
|
-
- `uap ideate setup <name>` scaffolds the `projects/<name>/` file contract
|
|
93
|
-
(`brief_validated.json`, `input_bank.yaml`, `prompts/`, `texts/`).
|
|
94
|
-
- `uap ideate run <name>` drives the brainstorm; `uap ideate ideas <name>` reads
|
|
95
|
-
the newest `curated_ideas.json`.
|
|
96
|
-
- Orchestrator opt-in: `new ExpertOrchestrator({ includeIdeation: true })`
|
|
97
|
-
prepends an `ideate` phase feeding the plan-phase product/strategy droids.
|
|
98
|
-
`readCuratedIdeas()` (`src/cli/ideate.ts`) is the consumable artifact.
|
|
99
|
-
|
|
100
|
-
Use it only when the solution space is wide; skip for convergent tasks.
|
|
101
|
-
|
|
102
|
-
---
|
|
103
|
-
|
|
104
|
-
## 5. Expert-review hard gate
|
|
105
|
-
|
|
106
|
-
The `parallel-expert-review` skill claimed "REQUIRED by policy" but nothing
|
|
107
|
-
enforced it. Two policy artifacts close that:
|
|
108
|
-
|
|
109
|
-
- `expert-review-required` (`src/policies/schemas/policies/expert-review-required.md`
|
|
110
|
-
+ `src/policies/enforcers/expert_review_required.py`): blocks ship actions
|
|
111
|
-
(`git commit`/`push`, `gh pr create`, merge/pr-ready/signoff) unless
|
|
112
|
-
`.uap/reviews/<branch-slug>.json` exists and covers `HEAD` (stale → block).
|
|
113
|
-
Fail-open on detached/non-git; override `UAP_NO_REVIEW=1`.
|
|
114
|
-
- `architecture-review` (`…/policies/architecture-review.md`): the missing
|
|
115
|
-
backing doc for the previously-orphan `architecture_review.py` enforcer
|
|
116
|
-
(ADR-or-waiver on architecturally significant diffs).
|
|
117
|
-
|
|
118
|
-
The review flow writes the artifact on consolidation:
|
|
119
|
-
`{ "head": "<sha>", "verdict": "approve", "reviewers": [...] }`. Install with:
|
|
120
|
-
|
|
121
|
-
```bash
|
|
122
|
-
uap policy install expert-review-required # attaches the enforcer to the hook
|
|
123
|
-
```
|
|
124
|
-
|
|
125
|
-
---
|
|
126
|
-
|
|
127
|
-
## File map
|
|
128
|
-
|
|
129
|
-
| Concern | Path |
|
|
130
|
-
|---|---|
|
|
131
|
-
| Forward-design droids | `.factory/droids/{strategic-architect,tactical-architect,implementation-planner}.md` |
|
|
132
|
-
| Orchestrator wiring | `src/coordination/expert-orchestrator.ts` |
|
|
133
|
-
| Experts-MCP dispatch | `src/mcp-router/experts/registry.ts`, `server.ts`, `tools/execute.ts` |
|
|
134
|
-
| HALO exporter | `src/observability/halo-exporter.ts` |
|
|
135
|
-
| HALO droid + CLI | `.factory/droids/harness-optimizer.md`, `src/cli/harness.ts` |
|
|
136
|
-
| Ideation droid + CLI | `.factory/droids/ideation-expert.md`, `src/cli/ideate.ts` |
|
|
137
|
-
| Review gate | `src/policies/{schemas/policies/expert-review-required.md,enforcers/expert_review_required.py}` |
|
|
@@ -1,224 +0,0 @@
|
|
|
1
|
-
# Multi-Model Agentic Architecture
|
|
2
|
-
|
|
3
|
-
## Executive Summary
|
|
4
|
-
|
|
5
|
-
This document proposes a two-tier agentic architecture using separate models for planning and execution, achieving **92-98% cost reduction** while maintaining near-original performance for complex tasks.
|
|
6
|
-
|
|
7
|
-
## Core Concept
|
|
8
|
-
|
|
9
|
-
**Separation of Concerns:**
|
|
10
|
-
- **Tier 1 (Planner)**: High-level reasoning, task decomposition, orchestration
|
|
11
|
-
- **Tier 2 (Executor)**: Concrete implementation following planner's specifications
|
|
12
|
-
|
|
13
|
-
### Research Findings (2026)
|
|
14
|
-
|
|
15
|
-
#### Model Candidates
|
|
16
|
-
|
|
17
|
-
| Model | Role | Cost (Input/Output) | SWE-Bench | Context | Notes |
|
|
18
|
-
|-------|------|----------------------|-----------|---------|-------|
|
|
19
|
-
| **Claude Opus 4.5** | Planner (current) | $5/$25 per 1M | Highest | 200K | Premium, but expensive |
|
|
20
|
-
| **DeepSeek-V3.2** | Planner | $0.25/$0.38 per 1M | 73.1% | 164K | Best cost/performance ratio |
|
|
21
|
-
| **DeepSeek-V3.2-Exp** | Executor | $0.21/$0.32 per 1M | Strong | 164K | 78x cheaper output than Opus |
|
|
22
|
-
| **GLM-4.7** | Executor | Very Low | Good | 128K | Current workhorse |
|
|
23
|
-
|
|
24
|
-
#### Key Findings
|
|
25
|
-
|
|
26
|
-
1. **DeepSeek-V3.2 Speciale** achieves 73.1% on SWE-Bench Verified (vs Opus's highest scores)
|
|
27
|
-
2. **Cost differential**: DeepSeek is ~23x cheaper for input, ~78x cheaper for output
|
|
28
|
-
3. **Context**: 164K is sufficient for most agentic workflows (vs 200K for Opus)
|
|
29
|
-
4. **Architecture**: MoE with 671B params, activates only 37B per token (high efficiency)
|
|
30
|
-
|
|
31
|
-
## Proposed Architecture
|
|
32
|
-
|
|
33
|
-
### Tier 1: Master Planner
|
|
34
|
-
|
|
35
|
-
**Model**: **DeepSeek-V3.2 Speciale** (replacing Opus 4.5)
|
|
36
|
-
|
|
37
|
-
**Responsibilities**:
|
|
38
|
-
- Task decomposition and planning
|
|
39
|
-
- Subtask dependency analysis
|
|
40
|
-
- Model selection for each subtask
|
|
41
|
-
- Quality assurance routing
|
|
42
|
-
- Critical path identification
|
|
43
|
-
|
|
44
|
-
**When to invoke:**
|
|
45
|
-
- New task request
|
|
46
|
-
- Complex multi-step workflows
|
|
47
|
-
- Requirements for strategic planning
|
|
48
|
-
- Architectural decisions
|
|
49
|
-
|
|
50
|
-
**Fallback**: If DeepSeek fails on critical planning, escalate to Opus 4.5 (1% of cases)
|
|
51
|
-
|
|
52
|
-
### Tier 2: Task Executor
|
|
53
|
-
|
|
54
|
-
**Model**: **GLM-4.7** (current workhorse) or **DeepSeek-V3.2-Exp**
|
|
55
|
-
|
|
56
|
-
**Responsibilities**:
|
|
57
|
-
- Implement specific code blocks
|
|
58
|
-
- Execute tool calls
|
|
59
|
-
- Write tests
|
|
60
|
-
- Fix bugs based on planner guidance
|
|
61
|
-
- Generate documentation
|
|
62
|
-
|
|
63
|
-
**When to invoke:**
|
|
64
|
-
- Concrete implementation tasks
|
|
65
|
-
- Coding following specifications
|
|
66
|
-
- Test writing
|
|
67
|
-
- Bug fixes with clear guidance
|
|
68
|
-
|
|
69
|
-
### Route Decision Matrix
|
|
70
|
-
|
|
71
|
-
| Task Complexity | Routing Logic | Model Selection |
|
|
72
|
-
|----------------|---------------|-----------------|
|
|
73
|
-
| **High** (new feature, architecture) | → Planner → Decompose → Executor | DeepSeek-V3.2 → GLM-4.7 |
|
|
74
|
-
| **Medium** (refactor, bug fix) | → Direct Executor | GLM-4.7 |
|
|
75
|
-
| **Low** (simple change) | → Direct Executor | GLM-4.7 |
|
|
76
|
-
| **Critical** (security, deployment) | → Planner → Verify → Executor | DeepSeek-V3.2 → GLM-4.7 |
|
|
77
|
-
|
|
78
|
-
## Implementation Strategy
|
|
79
|
-
|
|
80
|
-
### Phase 1: Router (Week 1)
|
|
81
|
-
|
|
82
|
-
```typescript
|
|
83
|
-
interface ModelRouter {
|
|
84
|
-
route(task: AgenticTask): ModelSelection;
|
|
85
|
-
}
|
|
86
|
-
|
|
87
|
-
interface ModelSelection {
|
|
88
|
-
model: ModelId;
|
|
89
|
-
fallback?: ModelId;
|
|
90
|
-
reasoning: string;
|
|
91
|
-
}
|
|
92
|
-
```
|
|
93
|
-
|
|
94
|
-
**Routing Logic**:
|
|
95
|
-
1. Analyze task complexity (token estimate, dependencies, novelty)
|
|
96
|
-
2. Check for critical keywords (security, architecture, planning)
|
|
97
|
-
3. Select DeepSeek-V3.2 for planning tasks
|
|
98
|
-
4. Select GLM-4.7 for execution tasks
|
|
99
|
-
5. Fallback to Opus 4.5 only on threshold failures
|
|
100
|
-
|
|
101
|
-
### Phase 2: Planner Integration (Week 2)
|
|
102
|
-
|
|
103
|
-
**Planner Interface**:
|
|
104
|
-
```typescript
|
|
105
|
-
interface Planner {
|
|
106
|
-
plan(task: AgenticTask): ExecutionPlan;
|
|
107
|
-
}
|
|
108
|
-
|
|
109
|
-
interface ExecutionPlan {
|
|
110
|
-
subtasks: Subtask[];
|
|
111
|
-
dependencies: DependencyGraph;
|
|
112
|
-
modelAssignments: Map<SubtaskId, ModelId>;
|
|
113
|
-
}
|
|
114
|
-
```
|
|
115
|
-
|
|
116
|
-
**DeepSeek-V3.2 Integration**:
|
|
117
|
-
- API endpoint integration
|
|
118
|
-
- Context window management (164K)
|
|
119
|
-
- Token budget accounting
|
|
120
|
-
- Failure detection and escalation
|
|
121
|
-
|
|
122
|
-
### Phase 3: Executor Pool (Week 3)
|
|
123
|
-
|
|
124
|
-
**Executor Options**:
|
|
125
|
-
1. **Primary**: GLM-4.7 (existing, low cost, good performance)
|
|
126
|
-
2. **Backup**: DeepSeek-V3.2-Exp (if GLM-4.7 unavailable)
|
|
127
|
-
3. **Fallback**: Opus 4.5 (critical failures only)
|
|
128
|
-
|
|
129
|
-
**Load Balancing**:
|
|
130
|
-
- Round-robin across multiple executor instances
|
|
131
|
-
- Circuit breaker pattern for reliability
|
|
132
|
-
- Timeout management per subtask
|
|
133
|
-
|
|
134
|
-
## Cost Analysis
|
|
135
|
-
|
|
136
|
-
### Baseline (Opus 4.5 Only)
|
|
137
|
-
|
|
138
|
-
**Assumptions**:
|
|
139
|
-
- 100 tasks/day
|
|
140
|
-
- Average 50K input tokens, 30K output tokens per task
|
|
141
|
-
- $5/input per 1M, $25/output per 1M
|
|
142
|
-
|
|
143
|
-
**Daily Cost**:
|
|
144
|
-
- Input: 100 * 50K * $5/1M = $25
|
|
145
|
-
- Output: 100 * 30K * $25/1M = $75
|
|
146
|
-
- **Total: $100/day**
|
|
147
|
-
|
|
148
|
-
**Monthly Cost**: $3,000
|
|
149
|
-
**Yearly Cost**: $36,500
|
|
150
|
-
|
|
151
|
-
### Proposed (DeepSeek + GLM-4.7)
|
|
152
|
-
|
|
153
|
-
**Distribution**:
|
|
154
|
-
- 30% complex tasks → DeepSeek-V3.2 planning (10K tokens)
|
|
155
|
-
- 70% direct execution → GLM-4.7 (15K input, 5K output)
|
|
156
|
-
|
|
157
|
-
**Daily Cost**:
|
|
158
|
-
- Planner (DeepSeek): 30 tasks * 10K tokens * ($0.25/$0.38)/1M = $0.19
|
|
159
|
-
- Executor (GLM):
|
|
160
|
-
- Input: 100 tasks * 15K * $1/1M = $1.50
|
|
161
|
-
- Output: 100 tasks * 5K * $2/1M = $1.00
|
|
162
|
-
- **Total: $2.69/day**
|
|
163
|
-
|
|
164
|
-
**Monthly Cost**: $80.70
|
|
165
|
-
**Yearly Cost**: $982
|
|
166
|
-
|
|
167
|
-
### Cost Savings
|
|
168
|
-
|
|
169
|
-
| Metric | Baseline | Proposed | Savings |
|
|
170
|
-
|--------|----------|----------|---------|
|
|
171
|
-
| Daily | $100 | $2.69 | **97.3%** |
|
|
172
|
-
| Monthly | $3,000 | $80.70 | **97.3%** |
|
|
173
|
-
| Yearly | $36,500 | $982 | **97.3%** |
|
|
174
|
-
|
|
175
|
-
### Performance Impact
|
|
176
|
-
|
|
177
|
-
Expected SWE-Bench performance:
|
|
178
|
-
- **Baseline**: Opus 4.5 (highest scores)
|
|
179
|
-
- **Proposed**:
|
|
180
|
-
- Planner (DeepSeek-V3.2): 73.1% (verified)
|
|
181
|
-
- Executor (GLM-4.7): Strong on straightforward tasks
|
|
182
|
-
- **Composite**: Estimated 85-90% of baseline
|
|
183
|
-
|
|
184
|
-
**Trade-off**: Accept 10-15% performance drop for 97% cost reduction
|
|
185
|
-
|
|
186
|
-
## Risk Assessment
|
|
187
|
-
|
|
188
|
-
### Risks
|
|
189
|
-
|
|
190
|
-
1. **Routing Errors**: Poor model selection for tasks
|
|
191
|
-
- **Mitigation**: Start conservative, 10% fallback to Opus
|
|
192
|
-
- **Monitoring**: Track task success rates per model
|
|
193
|
-
|
|
194
|
-
2. **Quality Regression**: Lower code质量
|
|
195
|
-
- **Mitigation**: Add review loops, use quality droids
|
|
196
|
-
- **Monitoring**: Track test pass rates, bug counts
|
|
197
|
-
|
|
198
|
-
3. **API Reliability**: DeepSeek availability issues
|
|
199
|
-
- **Mitigation**: Multi-in redundancy, fallback to Opus
|
|
200
|
-
- **Monitoring**: Uptime, latency tracking
|
|
201
|
-
|
|
202
|
-
### Rollback Plan
|
|
203
|
-
|
|
204
|
-
If metrics degrade >20%, revert to Opus 4.5-only mode within 24 hours.
|
|
205
|
-
|
|
206
|
-
## Next Steps
|
|
207
|
-
|
|
208
|
-
1. **Week 1**: Implement router with conservative routing (20% direct to Opus)
|
|
209
|
-
2. **Week 2**: Integrate DeepSeek-V3.2 API, test on 10% of tasks
|
|
210
|
-
3. **Week 3**: Shift to 50/50 routing, monitor carefully
|
|
211
|
-
4. **Week 4**: Full deployment, 95% tasks to proposed architecture
|
|
212
|
-
|
|
213
|
-
## Success Metrics
|
|
214
|
-
|
|
215
|
-
- Cost reduction: >90% achieved by month 1
|
|
216
|
-
- Performance: <20% drop vs baseline
|
|
217
|
-
- Reliability: <5% increase in task failures
|
|
218
|
-
- ROI: Break-even within 2 weeks
|
|
219
|
-
|
|
220
|
-
---
|
|
221
|
-
|
|
222
|
-
**Status**: Draft - Ready for review and implementation
|
|
223
|
-
**Created**: 2026-01-21
|
|
224
|
-
**Next Review**: 2026-01-28 (after week 1 pilot)
|
|
@@ -1,68 +0,0 @@
|
|
|
1
|
-
# Platform Gating
|
|
2
|
-
|
|
3
|
-
How UAP's policy gate (the DB-driven enforcement that blocks tool calls via
|
|
4
|
-
`policies.db` + `.policy-tools/*.py`) is applied across each supported agent
|
|
5
|
-
harness, and where harness limits make it advisory.
|
|
6
|
-
|
|
7
|
-
## Install & validate
|
|
8
|
-
|
|
9
|
-
```bash
|
|
10
|
-
uap hooks install # all project platforms (Hermes is global → opt-in)
|
|
11
|
-
uap hooks install -t hermes # Hermes (writes global ~/.hermes/config.yaml)
|
|
12
|
-
uap hooks doctor # audit coverage; exits non-zero on gaps
|
|
13
|
-
uap setup # now also installs hooks (Step 7)
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
The gate script `templates/hooks/uap-policy-gate.sh` is copied into each
|
|
17
|
-
platform's hooks dir and registered on that platform's pre-tool event. It reads
|
|
18
|
-
the tool payload on stdin, runs the active enforcers, and blocks with `exit 2`
|
|
19
|
-
(Claude convention). Hermes uses a wrapper (`uap-policy-gate-hermes.sh`) that
|
|
20
|
-
translates `exit 2` into a stdout `{"decision":"block"}` JSON.
|
|
21
|
-
|
|
22
|
-
## Coverage matrix
|
|
23
|
-
|
|
24
|
-
| Platform | Tier | Pre-tool mechanism | Config |
|
|
25
|
-
|---|---|---|---|
|
|
26
|
-
| claude | ✅ gated | `PreToolUse` hooks (Edit/Write/MultiEdit, Bash, Task/Agent/…) | `.claude/settings.local.json` |
|
|
27
|
-
| vscode | ✅ gated | same (Claude format) | `.claude/settings.local.json` |
|
|
28
|
-
| cursor | ✅ gated | `preToolUse` array | `.cursor/hooks.json` |
|
|
29
|
-
| factory | ✅ gated | `PreToolUse` hooks | `.factory/settings.local.json` |
|
|
30
|
-
| opencode | ✅ gated | `tool.execute.before` plugin hook (throws to abort) | `.opencode/plugin/uap-session-hooks.ts` |
|
|
31
|
-
| omp | ✅ gated | `preToolUsePolicyGate` hook | `.uap/omp/settings.json` |
|
|
32
|
-
| hermes | ✅ gated | `pre_tool_call` shell hook (stdout block JSON) | `~/.hermes/config.yaml` (global) |
|
|
33
|
-
| codex | ⚠️ MCP-gated | no native pre-tool hook event | `.codex/config.toml` `[mcp_servers.uap]` |
|
|
34
|
-
| forgecode | ⚠️ advisory | plugin injects policy context; no block path | `.forge/forgecode.plugin.sh` |
|
|
35
|
-
|
|
36
|
-
## Harness limits (why two platforms are not hard-gated)
|
|
37
|
-
|
|
38
|
-
- **Codex** has no pre-tool-use *hook event*, so it can't auto-run the gate
|
|
39
|
-
before every tool. Gating is **hard** for tools routed through the UAP MCP
|
|
40
|
-
server (`execute_tool` runs the PolicyGate) and **advisory** for codex-native
|
|
41
|
-
edit/bash (run `bash .codex/hooks/uap-policy-gate.sh` per AGENTS.md). `hooks
|
|
42
|
-
doctor` reports codex as MCP-gated.
|
|
43
|
-
- **ForgeCode**'s plugin surfaces session/compaction lifecycle and injects the
|
|
44
|
-
active-policy list as context, but exposes no pre-tool interception point that
|
|
45
|
-
can *block*. Reported as advisory.
|
|
46
|
-
|
|
47
|
-
## Hermes specifics
|
|
48
|
-
|
|
49
|
-
- Config is **global** (`$HERMES_HOME` or `~/.hermes/config.yaml`), so it is
|
|
50
|
-
excluded from the default `uap hooks install` loop and installed explicitly
|
|
51
|
-
with `-t hermes`. `hooks doctor` treats an absent `~/.hermes` as optional, and
|
|
52
|
-
a present-but-unwired install as a real gap.
|
|
53
|
-
- Hermes hooks are **fail-open** (a crashing/exit-non-zero/bad-JSON hook lets the
|
|
54
|
-
tool proceed). The UAP Hermes gate therefore always exits 0 and always emits a
|
|
55
|
-
valid decision JSON, so genuine blocks are enforced.
|
|
56
|
-
- Hermes prompts once to approve each hook command (stored in
|
|
57
|
-
`~/.hermes/shell-hooks-allowlist.json`); approve the UAP gate, or set
|
|
58
|
-
`hooks_auto_accept: true`.
|
|
59
|
-
- Hermes has no per-file persona registry, so UAP droids are surfaced via a
|
|
60
|
-
skills bridge (`~/.hermes/uap-skills/uap-experts/SKILL.md`) that routes to
|
|
61
|
-
`uap expert-route` and the MCP `experts.<name>` tools.
|
|
62
|
-
|
|
63
|
-
## Key files
|
|
64
|
-
|
|
65
|
-
- Installer + doctor: `src/cli/hooks.ts` (`copyHookScripts`, `installHermesHooks`, `auditPlatform`, `hooksDoctor`, `ALL_TARGETS`).
|
|
66
|
-
- Gate scripts: `templates/hooks/uap-policy-gate.sh`, `templates/hooks/uap-policy-gate-hermes.sh`.
|
|
67
|
-
- MCP-router gate (codex path): `src/mcp-router/tools/execute.ts:handleExecuteTool`.
|
|
68
|
-
- Setup wiring: `src/cli/setup.ts`.
|