npm - @miller-tech/uap - Versions diffs - 1.39.0 → 1.40.1 - Mend

@miller-tech/uap 1.39.0 → 1.40.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (99) hide show

package/README.md +109 -642
package/dist/.tsbuildinfo +1 -1
package/dist/bin/cli.js +2 -2
package/dist/bin/cli.js.map +1 -1
package/dist/cli/deliver.d.ts +3 -2
package/dist/cli/deliver.d.ts.map +1 -1
package/dist/cli/deliver.js +10 -5
package/dist/cli/deliver.js.map +1 -1
package/docs/INDEX.md +48 -286
package/docs/architecture/OVERVIEW.md +328 -0
package/docs/architecture/PROTOCOL.md +204 -0
package/docs/benchmarks/README.md +17 -192
package/docs/getting-started/CONFIGURATION.md +237 -0
package/docs/getting-started/INSTALLATION.md +125 -0
package/docs/getting-started/QUICKSTART.md +115 -0
package/docs/guides/COORDINATION.md +162 -0
package/docs/guides/DELIVER.md +115 -0
package/docs/guides/DEPLOY_BATCHING.md +212 -0
package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
package/docs/guides/LOCAL_MODELS.md +148 -0
package/docs/guides/MCP_ROUTER.md +195 -0
package/docs/guides/MEMORY.md +235 -0
package/docs/guides/MULTI_MODEL.md +223 -0
package/docs/guides/POLICIES.md +190 -0
package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
package/docs/integrations/MCP_ROUTER.md +147 -0
package/docs/integrations/RTK.md +102 -0
package/docs/reference/API.md +485 -0
package/docs/reference/CLI.md +719 -0
package/docs/reference/CONFIGURATION.md +90 -193
package/docs/reference/DATABASE_SCHEMA.md +110 -344
package/docs/reference/FEATURES.md +176 -472
package/docs/reference/PATTERNS.md +102 -0
package/docs/reference/PLATFORMS.md +83 -0
package/package.json +1 -1
package/docs/AGENTS.md +0 -423
package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
package/docs/GETTING_STARTED.md +0 -288
package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
package/docs/architecture/EXPERT_STACK.md +0 -137
package/docs/architecture/MULTI_MODEL.md +0 -224
package/docs/architecture/PLATFORM_GATING.md +0 -68
package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
package/docs/architecture/UAP_COMPLIANCE.md +0 -217
package/docs/architecture/UAP_PROTOCOL.md +0 -339
package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
package/docs/archive/opencode-integration-guide.md +0 -740
package/docs/archive/opencode-integration-quickref.md +0 -180
package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
package/docs/blog/local-coding-agents.md +0 -266
package/docs/blog/x-thread.md +0 -254
package/docs/deployment/DEPLOYMENT.md +0 -895
package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
package/docs/deployment/DEPLOY_BATCHING.md +0 -273
package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
package/docs/getting-started/INTEGRATION.md +0 -628
package/docs/getting-started/OVERVIEW.md +0 -324
package/docs/getting-started/SETUP.md +0 -377
package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
package/docs/integrations/RTK_INTEGRATION.md +0 -468
package/docs/operations/TROUBLESHOOTING.md +0 -660
package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
package/docs/pr/UPSTREAM_PRS.md +0 -424
package/docs/reference/API_REFERENCE.md +0 -903
package/docs/reference/EXPERT_DROIDS.md +0 -219
package/docs/reference/HARNESS-MATRIX.md +0 -318
package/docs/reference/PATTERN_LIBRARY.md +0 -636
package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
package/docs/research/DOMAIN_STRATEGIES.md +0 -316
package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217

package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md DELETED Viewed

@@ -1,260 +0,0 @@
-# UAP Project Self-Analysis Using Balls-Mode
-**Date:** 2026-01-19
-**Analysis Method:** Balls-Mode Decomposed Reasoning with Confidence Scoring
-**Purpose:** Identify optimization opportunities for Terminal-Bench performance
----
-## /balls Can UAP improve Terminal-Bench pass rate from 54% to 70%?
-### Step 1: CLASSIFY
-**Complex** - Multi-faceted optimization question requiring analysis of:
-- Current failure modes
-- Pattern effectiveness
-- Implementation gaps
-- Resource constraints
----
-### Step 2: DECOMPOSE
-| #   | Ball                                       | Why it matters                     |
-| --- | ------------------------------------------ | ---------------------------------- |
-| 1   | Are patterns being applied correctly?      | Patterns exist but may not trigger |
-| 2   | Which tasks are near-miss (>50% subtests)? | Highest ROI targets                |
-| 3   | Are pre-hooks working?                     | Domain knowledge injection         |
-| 4   | Is the model capable enough?               | Some tasks need stronger model     |
-| 5   | Are impossible tasks detected early?       | Avoid wasting time                 |
-| 6   | Is output verification happening?          | 37% of failures are missing files  |
-| 7   | Is round-trip testing done?                | Compression/encoding failures      |
-| 8   | Are domain libraries used?                 | Chess/stats need specialized tools |
----
-### Step 3: SOLVE & VERIFY
-#### Ball 1: Are patterns being applied correctly?
-**Evidence:**
-- Pattern Router prints analysis block ✓
-- But: winning-avg-corewars showed 47% improvement when hooks worked
-- Some patterns in CLAUDE.md but not enforced
-**Answer:** Patterns exist but compliance is inconsistent
-#### Ball 2: Which tasks are near-miss?
-**Evidence from benchmark data:**
-- adaptive-rejection-sampler: 8/9 (88%) - 1 test away
-- headless-terminal: 6/7 (85%) - 1 test away
-- winning-avg-corewars: 4/5 subtests (80%) - 1% threshold miss
-- write-compressor: 2/3 (67%) - round-trip issue
-- pytorch-model-cli: shebang/chmod issue
-**Answer:** 5 tasks within striking distance (could add +5 to pass count)
-#### Ball 3: Are pre-hooks working?
-**Evidence:**
-- CoreWars: +47% improvement when hook provided strategy
-- XSS: Hook didn't help (still failing)
-- Compression: Hook didn't fix round-trip issue
-**Answer:** Pre-hooks work for domain knowledge, not for implementation bugs
-#### Ball 4: Is the model capable enough?
-**Evidence:**
-- Opus 4.5: 54% pass rate
-- Sonnet 4: 11% pass rate
-- Droid (top leaderboard): 65% with GPT-5.2
-**Answer:** Model matters significantly; Opus 4.5 is competitive but not best
-#### Ball 5: Are impossible tasks detected early?
-**Evidence:**
-- gpt2-codegolf: Still attempted despite being impossible
-- Pattern 5/19 exists but agent doesn't refuse quickly
-**Answer:** Detection exists but enforcement is weak
-#### Ball 6: Is output verification happening?
-**Evidence:**
-- P12 (OEV) added in v10.2
-- But: pytorch-model-cli failed due to missing shebang
-- filter-js-from-html: file not created
-**Answer:** Verification exists but not enforced pre-completion
-#### Ball 7: Is round-trip testing done?
-**Evidence:**
-- write-compressor: Decompression fails
-- P34 (Reversibility Verification) added
-- Agent doesn't test round-trip before optimizing
-**Answer:** Pattern exists but not followed
-#### Ball 8: Are domain libraries used?
-**Evidence:**
-- chess-best-move: Visual FEN extraction failed
-- adaptive-rejection-sampler: Custom R code vs CRAN 'ars' package
-- P32 recommends libraries over custom
-**Answer:** Agents reinvent instead of using established libraries
----
-### Step 4: SCORE
-| Ball                 | Answer                                    | Confidence | Notes                      |
-| -------------------- | ----------------------------------------- | ---------- | -------------------------- |
-| Pattern compliance   | Inconsistent - some patterns not enforced | 0.7        | Clear from logs            |
-| Near-miss tasks      | 5 tasks within reach                      | 0.85       | Data-driven                |
-| Pre-hooks effective  | Yes for domain knowledge                  | 0.75       | CoreWars proves it         |
-| Model capability     | Opus 4.5 competitive                      | 0.8        | Leaderboard data           |
-| Impossible detection | Weak enforcement                          | 0.6        | gpt2-codegolf evidence     |
-| Output verification  | Exists but not enforced                   | 0.65       | pytorch-model-cli failure  |
-| Round-trip testing   | Not done systematically                   | 0.5        | write-compressor shows gap |
-| Library usage        | Reinventing instead of reusing            | 0.7        | Chess/R failures           |
----
-### Step 5: SYNTHESIZE
-## Synthesis
-**Answer**: Yes, 70% is achievable by fixing 3 key gaps:
-1. **Enforce pattern compliance** - Make Gate checks mandatory, not advisory
-2. **Target near-miss tasks** - 5 tasks at >50% need specific fixes
-3. **Use established libraries** - Chess (python-chess), Stats (CRAN ars), Compression (zlib)
-**Overall Confidence**: 0.65
-**Weakest Links**:
-- Round-trip testing (0.5) - Compression tasks will keep failing without this
-- Impossible detection (0.6) - Time wasted on gpt2-codegolf
-**To increase confidence**:
-1. Run targeted tests on the 5 near-miss tasks with specific fixes
-2. Add mandatory round-trip verification for compression tasks
-3. Implement library-first pattern in pre-hooks
----
-## Specific Optimization Actions
-### High-Priority (Addresses weakest balls)
-#### 1. Mandatory Round-Trip Verification Hook
-```bash
-# Pre-hook for compression tasks
-cat > /tmp/verify_roundtrip.py << 'EOF'
-import sys
-def verify(compress_fn, decompress_fn, test_data):
-    compressed = compress_fn(test_data)
-    decompressed = decompress_fn(compressed)
-    assert decompressed == test_data, "Round-trip failed!"
-    return True
-EOF
-echo "CRITICAL: Test round-trip BEFORE optimizing size"
-```
-#### 2. Library-First Pattern for Domain Tasks
-```markdown
-### Pattern 37: Library-First for Domain Tasks
-When task involves well-known domain (chess, statistics, compression):
-1. SEARCH for established library FIRST: pip search, apt-cache, CRAN
-2. Install and use library instead of implementing from scratch
-3. Libraries handle edge cases you'll miss
-Examples:
-- Chess: python-chess + stockfish
-- Statistics: R 'ars' package for ARS
-- Compression: zlib, lz4 (not custom Huffman)
-```
-#### 3. CLI Execution Verification
-```bash
-# For any script-creation task
-cat > /tmp/verify_cli.sh << 'EOF'
-# Add shebang
-head -1 "$1" | grep -q "^#!" || echo "MISSING SHEBANG"
-# Check executable
-test -x "$1" || echo "NOT EXECUTABLE - run chmod +x"
-# Test execution
-./"$1" --help 2>/dev/null || echo "EXECUTION FAILED"
-EOF
-```
-### Medium-Priority (Near-miss fixes)
-| Task                       | Fix                                          | Confidence Gain |
-| -------------------------- | -------------------------------------------- | --------------- |
-| adaptive-rejection-sampler | Use CRAN 'ars' package                       | +0.3            |
-| winning-avg-corewars       | Tune paper.red threshold (need 75%, got 74%) | +0.2            |
-| write-compressor           | Add round-trip test before optimization      | +0.3            |
-| pytorch-model-cli          | Enforce shebang + chmod                      | +0.25           |
-| headless-terminal          | Debug specific failing escape sequence       | +0.2            |
-### Low-Priority (Already handling)
-- Pattern Router - Working
-- Output existence verification - Mostly working
-- Domain pre-hooks - Working for CoreWars
----
-## Expected Impact
-| Metric                    | Current | After Fixes | Delta       |
-| ------------------------- | ------- | ----------- | ----------- |
-| Pass Rate                 | 54%     | ~70%        | +16%        |
-| Near-miss conversion      | 0/5     | 4/5         | +4 tasks    |
-| Time wasted on impossible | High    | Low         | -20% tokens |
----
-## Balls-Mode Skill Integration
-The balls-mode skill is now available at `.factory/skills/balls-mode/SKILL.md`.
-**When to invoke during Terminal-Bench:**
-1. After first failure - decompose what went wrong
-2. Before complex architectural decisions
-3. When confidence in approach is <0.5
-**Integration with existing patterns:**
-- Use BEFORE P16 (Task-First Execution) for complex tasks
-- Complement P17 (Constraint Extraction) with confidence scoring
-- Use AFTER P12 (Output Verification) fails to debug why
----
-**Analysis Complete**: 2026-01-19
-**Next Step**: Run targeted benchmark on near-miss tasks with specific fixes

package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md DELETED Viewed

@@ -1,146 +0,0 @@
-# UAP Benchmark: Actual Gaps & Execution Plan
-**Generated:** 2026-03-17
-**Benchmark:** Harbor Terminal-Bench 2.0 (89 tasks)
-**Primary Target:** Qwen3.5 35B A3B (IQ4_XS)
----
-## What Already Exists (DO NOT REBUILD)
-| Component                        | File                                                       | Status                                                                                         |
-| -------------------------------- | ---------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
-| Baseline benchmark (no UAP)      | `scripts/benchmarks/benchmark-qwen35-baseline-no-uap.tsx`  | 403 lines, 94 tasks                                                                            |
-| UAP benchmark (full integration) | `scripts/benchmarks/benchmark-qwen35-uap-3.0-opencode.tsx` | 812 lines, 89 tasks                                                                            |
-| Harbor quick runner (UAP)        | `scripts/benchmarks/run-tbench-qwen35-quick.sh`            | 459 lines, hybrid-adaptive                                                                     |
-| Harbor baseline+UAP runner       | `scripts/benchmarks/run-harbor-qwen35-benchmark.sh`        | Runs both configs sequentially                                                                 |
-| Harbor YAML configs              | `benchmarks/harbor-configs/qwen35_*.yaml`                  | Baseline + UAP pair                                                                            |
-| Comparison report generator      | `scripts/benchmarks/generate-comparison-report.ts`         | 461 lines, p-value tests                                                                       |
-| Full benchmark harness           | `scripts/benchmarks/run-full-benchmark.sh`                 | 413 lines, multi-model A/B                                                                     |
-| Multi-turn agent loop            | `src/benchmarks/multi-turn-loop.ts`                        | 213 lines, `executeWithRetry()`                                                                |
-| Multi-turn + verification        | `src/benchmarks/multi-turn-agent.ts`                       | Wired to dynamic retrieval                                                                     |
-| Improved benchmark runner        | `src/benchmarks/improved-benchmark.ts`                     | 794 lines, wires multi-turn + dynamic retrieval + task classification + hierarchical prompting |
-| Dynamic memory retrieval         | `src/memory/dynamic-retrieval.ts`                          | 1168 lines, 6 memory sources, adaptive depth                                                   |
-| Task classifier                  | `src/memory/task-classifier.ts`                            | 426 lines, 8 categories, ambiguity detection                                                   |
-| Qdrant embeddings                | `src/memory/embeddings.ts`                                 | Fixed, 5 backends with fallback                                                                |
-| Tool call retry (Qwen)           | `tools/agents/scripts/qwen_tool_call_wrapper.py`           | 686 lines, 6 retry strategies                                                                  |
-| Harbor UAP agent                 | `tools/uap_harbor/uap_agent.py`                            | 379 lines, classified preamble                                                                 |
-| Qwen3.5 model presets            | `src/models/types.ts:136-151`                              | `qwen35-a3b` and `qwen35` defined                                                              |
-| Model router                     | `src/models/router.ts`                                     | Qwen3.5 as default executor                                                                    |
----
-## Actual Gaps (3 items)
-### Gap 1: `improved-benchmark.ts` MODELS array missing Qwen3.5
-`src/benchmarks/improved-benchmark.ts:95-99` has the fully wired runner (multi-turn + dynamic retrieval + task classification + hierarchical prompting + verification) but its MODELS array only contains:
-```typescript
-const MODELS: ModelConfig[] = [
-  { id: 'opus-4.5', name: 'Claude Opus 4.5', apiModel: 'claude-opus-4-5-20251101' },
-  { id: 'glm-4.7', name: 'GLM 4.7', apiModel: 'glm-4.7' },
-  { id: 'gpt-5.2-codex', name: 'GPT 5.2 Codex', apiModel: 'gpt-5.2-codex' },
-];
-// Qwen3.5 MISSING
-```
-**Fix:** Add Qwen3.5 to the MODELS array. The preset already exists in `src/models/types.ts:136-151`.
-### Gap 2: `model-integration.ts` MODELS array missing Qwen3.5 + still single-shot
-`src/benchmarks/model-integration.ts:336-361` is the older benchmark runner. It:
-- Has no Qwen3.5 in its MODELS array
-- Uses single-shot execution (no multi-turn, no dynamic retrieval)
-**Fix:** Add Qwen3.5 to its MODELS array. The multi-turn wiring gap is already solved by `improved-benchmark.ts` -- this file can remain as the "legacy single-shot" runner for comparison purposes.
-### Gap 3: No benchmark results exist
-`benchmark-results/` directory does not exist. None of the scripts have been executed.
-**Fix:** Run the existing scripts.
----
-## Execution Plan
-### Step 1: Add Qwen3.5 to improved-benchmark.ts MODELS array
-**File:** `src/benchmarks/improved-benchmark.ts:95-99`
-```typescript
-const MODELS: ModelConfig[] = [
-  { id: 'opus-4.5', name: 'Claude Opus 4.5', apiModel: 'claude-opus-4-5-20251101' },
-  { id: 'glm-4.7', name: 'GLM 4.7', apiModel: 'glm-4.7' },
-  { id: 'gpt-5.2-codex', name: 'GPT 5.2 Codex', apiModel: 'gpt-5.2-codex' },
-  { id: 'qwen35-a3b', name: 'Qwen 3.5 35B A3B', apiModel: 'qwen35-a3b-iq4xs' },
-];
-```
-### Step 2: Add Qwen3.5 to model-integration.ts MODELS array
-**File:** `src/benchmarks/model-integration.ts:336-361`
-```typescript
-{
-  id: 'qwen35-a3b',
-  name: 'Qwen 3.5 35B A3B',
-  provider: 'local',
-  apiModel: 'qwen35-a3b-iq4xs',
-},
-```
-### Step 3: Run existing benchmarks
-```bash
-# Option A: Quick Qwen3.5 baseline + UAP via Harbor (recommended first)
-./scripts/benchmarks/run-harbor-qwen35-benchmark.sh
-# Option B: Direct API baseline (no Harbor containers)
-npx tsx scripts/benchmarks/benchmark-qwen35-baseline-no-uap.tsx
-# Option C: Direct API UAP-enhanced
-npx tsx scripts/benchmarks/benchmark-qwen35-uap-3.0-opencode.tsx
-# Option D: Improved benchmark with multi-turn + dynamic retrieval (all models)
-npx tsx src/benchmarks/improved-benchmark.ts
-# Option E: Full Harbor harness (all models, baseline vs UAP)
-./scripts/benchmarks/run-full-benchmark.sh --model qwen35-a3b-iq4xs
-```
-### Step 4: Generate comparison report
-```bash
-npx tsx scripts/benchmarks/generate-comparison-report.ts \
-  --baseline benchmark-results/qwen35_baseline_no_uap/ \
-  --uap benchmark-results/qwen35_uap_3.0_opencode/
-```
----
-## What This Plan Does NOT Do (because it already exists)
-- Build a multi-turn agent loop (exists: `src/benchmarks/multi-turn-loop.ts`)
-- Build dynamic memory retrieval (exists: `src/memory/dynamic-retrieval.ts`)
-- Build task classification (exists: `src/memory/task-classifier.ts`)
-- Fix Qdrant embeddings (already fixed: `src/memory/embeddings.ts`)
-- Build Harbor configs (exist: `benchmarks/harbor-configs/qwen35_*.yaml`)
-- Build comparison report generator (exists: `scripts/benchmarks/generate-comparison-report.ts`)
-- Wire multi-turn into benchmark runner (exists: `src/benchmarks/improved-benchmark.ts`)
-- Build tool call retry for Qwen (exists: `tools/agents/scripts/qwen_tool_call_wrapper.py`)
-- Create execution scripts (exist: 6+ scripts in `scripts/benchmarks/`)
----
-## Estimated Effort
-| Step                                 | Effort         | Type                                   |
-| ------------------------------------ | -------------- | -------------------------------------- |
-| Add Qwen3.5 to improved-benchmark.ts | 2 minutes      | Code change (1 line)                   |
-| Add Qwen3.5 to model-integration.ts  | 2 minutes      | Code change (5 lines)                  |
-| Run benchmarks                       | 2-8 hours      | Execution (depends on model speed)     |
-| Review results                       | 30 minutes     | Analysis                               |
-| **Total**                            | **~3-9 hours** | Mostly waiting for benchmark execution |