@miller-tech/uap 1.39.0 → 1.40.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (99) hide show
  1. package/README.md +109 -642
  2. package/dist/.tsbuildinfo +1 -1
  3. package/dist/bin/cli.js +2 -2
  4. package/dist/bin/cli.js.map +1 -1
  5. package/dist/cli/deliver.d.ts +3 -2
  6. package/dist/cli/deliver.d.ts.map +1 -1
  7. package/dist/cli/deliver.js +10 -5
  8. package/dist/cli/deliver.js.map +1 -1
  9. package/docs/INDEX.md +48 -286
  10. package/docs/architecture/OVERVIEW.md +328 -0
  11. package/docs/architecture/PROTOCOL.md +204 -0
  12. package/docs/benchmarks/README.md +17 -192
  13. package/docs/getting-started/CONFIGURATION.md +237 -0
  14. package/docs/getting-started/INSTALLATION.md +125 -0
  15. package/docs/getting-started/QUICKSTART.md +115 -0
  16. package/docs/guides/COORDINATION.md +162 -0
  17. package/docs/guides/DELIVER.md +115 -0
  18. package/docs/guides/DEPLOY_BATCHING.md +212 -0
  19. package/docs/guides/DROIDS_AND_SKILLS.md +202 -0
  20. package/docs/guides/LOCAL_MODELS.md +148 -0
  21. package/docs/guides/MCP_ROUTER.md +195 -0
  22. package/docs/guides/MEMORY.md +235 -0
  23. package/docs/guides/MULTI_MODEL.md +223 -0
  24. package/docs/guides/POLICIES.md +190 -0
  25. package/docs/guides/WORKTREE_WORKFLOW.md +185 -0
  26. package/docs/integrations/MCP_ROUTER.md +147 -0
  27. package/docs/integrations/RTK.md +102 -0
  28. package/docs/reference/API.md +485 -0
  29. package/docs/reference/CLI.md +719 -0
  30. package/docs/reference/CONFIGURATION.md +90 -193
  31. package/docs/reference/DATABASE_SCHEMA.md +110 -344
  32. package/docs/reference/FEATURES.md +176 -472
  33. package/docs/reference/PATTERNS.md +102 -0
  34. package/docs/reference/PLATFORMS.md +83 -0
  35. package/package.json +1 -1
  36. package/docs/AGENTS.md +0 -423
  37. package/docs/DOCUMENTATION_AUDIT_REPORT.md +0 -131
  38. package/docs/GETTING_STARTED.md +0 -288
  39. package/docs/PROJECT_ANALYSIS_REPORT.md +0 -510
  40. package/docs/architecture/COMPLETE_ARCHITECTURE.md +0 -748
  41. package/docs/architecture/EXPERT_STACK.md +0 -137
  42. package/docs/architecture/MULTI_MODEL.md +0 -224
  43. package/docs/architecture/PLATFORM_GATING.md +0 -68
  44. package/docs/architecture/SYSTEM_ANALYSIS.md +0 -334
  45. package/docs/architecture/UAP_COMPLIANCE.md +0 -217
  46. package/docs/architecture/UAP_PROTOCOL.md +0 -339
  47. package/docs/architecture/UAP_STRICT_DROIDS.md +0 -172
  48. package/docs/archive/BALLS_MODE_SELF_ANALYSIS.md +0 -260
  49. package/docs/archive/BENCHMARK_GAPS_AND_PLAN.md +0 -146
  50. package/docs/archive/FAILING_TASKS_SOLUTION_PLAN.md +0 -668
  51. package/docs/archive/JINJA2-SYSTEM-MESSAGE-FIX.md +0 -209
  52. package/docs/archive/MODEL_ROUTING_IMPLEMENTATION_SUMMARY.md +0 -281
  53. package/docs/archive/MODEL_ROUTING_OPTIMIZATION_PLAN.md +0 -320
  54. package/docs/archive/NPM-PUBLISH-V0.9.1.md +0 -240
  55. package/docs/archive/OPTIMIZATION_OPTIONS.md +0 -334
  56. package/docs/archive/PARALLELISM_GAPS_AND_OPTIONS.md +0 -422
  57. package/docs/archive/POLICY_GATE_IMPLEMENTATION.md +0 -245
  58. package/docs/archive/SETUP_IMPROVEMENTS.md +0 -213
  59. package/docs/archive/UAP_GENERIC_OPTIMIZATION_PLAN.md +0 -270
  60. package/docs/archive/UAP_OPTIMIZATION_PLAN.md +0 -701
  61. package/docs/archive/UAP_V103_PATTERN_DESIGN.md +0 -315
  62. package/docs/archive/UAP_V104_COMPLIANCE_DESIGN.md +0 -223
  63. package/docs/archive/changelog/2026-03-10_uap-100-compliance.md +0 -77
  64. package/docs/archive/changelog/2026-03-10_uap-full-system-verification.md +0 -109
  65. package/docs/archive/opencode-integration-guide.md +0 -740
  66. package/docs/archive/opencode-integration-quickref.md +0 -180
  67. package/docs/benchmarks/OVERNIGHT_RUNNER.md +0 -341
  68. package/docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md +0 -221
  69. package/docs/benchmarks/VALIDATION_PLAN.md +0 -568
  70. package/docs/blog/SPECULATIVE_DECODING_PRODUCTION_PLAYBOOK.md +0 -139
  71. package/docs/blog/local-coding-agents.md +0 -266
  72. package/docs/blog/x-thread.md +0 -254
  73. package/docs/deployment/DEPLOYMENT.md +0 -895
  74. package/docs/deployment/DEPLOYMENT_STRATEGIES.md +0 -518
  75. package/docs/deployment/DEPLOY_BATCHER_ANALYSIS.md +0 -224
  76. package/docs/deployment/DEPLOY_BATCHING.md +0 -273
  77. package/docs/deployment/DEPLOY_BUCKETING_ANALYSIS.md +0 -420
  78. package/docs/deployment/QWEN35_LLAMA_CPP.md +0 -426
  79. package/docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md +0 -279
  80. package/docs/getting-started/INTEGRATION.md +0 -628
  81. package/docs/getting-started/OVERVIEW.md +0 -324
  82. package/docs/getting-started/SETUP.md +0 -377
  83. package/docs/integrations/MCP_ROUTER_SETUP.md +0 -445
  84. package/docs/integrations/RTK_INTEGRATION.md +0 -468
  85. package/docs/operations/TROUBLESHOOTING.md +0 -660
  86. package/docs/pr/PR_SPECULATIVE_DOCS_TEMPLATE.md +0 -146
  87. package/docs/pr/UPSTREAM_PRS.md +0 -424
  88. package/docs/reference/API_REFERENCE.md +0 -903
  89. package/docs/reference/EXPERT_DROIDS.md +0 -219
  90. package/docs/reference/HARNESS-MATRIX.md +0 -318
  91. package/docs/reference/PATTERN_LIBRARY.md +0 -636
  92. package/docs/reference/UAP_CLI_REFERENCE.md +0 -620
  93. package/docs/research/BEHAVIORAL_PATTERNS.md +0 -228
  94. package/docs/research/DOMAIN_STRATEGIES.md +0 -316
  95. package/docs/research/MEMORY_SYSTEMS_COMPARISON.md +0 -812
  96. package/docs/research/PATTERN_ANALYSIS_2026-01-18.md +0 -436
  97. package/docs/research/PERFORMANCE_ANALYSIS_2026-01-18.md +0 -209
  98. package/docs/research/PERFORMANCE_TEST_PLAN.md +0 -383
  99. package/docs/research/TERMINAL_BENCH_LEARNINGS.md +0 -217
@@ -1,568 +0,0 @@
1
- # UAP Validation Plan
2
-
3
- **Version:** 1.0.0
4
- **Last Updated:** 2026-03-13
5
- **Status:** ✅ Production Ready
6
-
7
- ---
8
-
9
- ## Executive Summary
10
-
11
- This document outlines the validation methodology for UAP features, including benchmark test cases, token measurement, quality scoring, and performance tracking.
12
-
13
- ---
14
-
15
- ## 1. Validation Objectives
16
-
17
- ### 1.1 Primary Goals
18
-
19
- 1. **Measure token reduction** - Quantify UAP token savings vs baseline
20
- 2. **Verify success rate improvement** - Compare task completion rates
21
- 3. **Assess quality enhancement** - Evaluate output quality with UAP
22
- 4. **Validate performance gains** - Measure time improvements
23
- 5. **Document best practices** - Establish configuration recommendations
24
-
25
- ### 1.2 Success Criteria
26
-
27
- | Metric | Baseline | Target | Validation |
28
- | --------------- | -------- | ------ | ----------- |
29
- | Token Reduction | 0% | ≥45% | ✅ Achieved |
30
- | Success Rate | 75% | ≥90% | ✅ Achieved |
31
- | Time Reduction | 0% | ≥10% | ✅ Achieved |
32
- | Error Rate | 12% | ≤5% | ✅ Achieved |
33
-
34
- ---
35
-
36
- ## 2. Test Suite
37
-
38
- ### 2.1 Test Categories
39
-
40
- | Category | Tasks | Description |
41
- | ---------------- | ----- | ------------------------------------- |
42
- | **System Admin** | 3 | Git, Docker, Nginx tasks |
43
- | **Security** | 3 | Password hash, mTLS, certificates |
44
- | **ML/Data** | 3 | Model training, compression, sampling |
45
- | **Development** | 3 | Code, HTTP server, testing |
46
-
47
- ### 2.2 Test Cases
48
-
49
- #### System Admin Tasks
50
-
51
- | Test ID | Task | Complexity | Expected Tokens |
52
- | ------- | ----------------------- | ---------- | --------------- |
53
- | T01 | Git Repository Recovery | Medium | 22K |
54
- | T04 | Docker Compose Config | Medium | 21K |
55
- | T09 | HTTP Server Config | Low | 20K |
56
-
57
- #### Security Tasks
58
-
59
- | Test ID | Task | Complexity | Expected Tokens |
60
- | ------- | ---------------------- | ---------- | --------------- |
61
- | T02 | Password Hash Recovery | Low | 19K |
62
- | T03 | mTLS Certificate Setup | High | 31K |
63
- | T08 | SQLite WAL Recovery | High | 30K |
64
-
65
- #### ML/Data Tasks
66
-
67
- | Test ID | Task | Complexity | Expected Tokens |
68
- | ------- | ----------------- | ---------- | --------------- |
69
- | T05 | ML Model Training | High | 28K |
70
- | T06 | Data Compression | Low | 18K |
71
- | T11 | MCMC Sampling | High | 26K |
72
-
73
- #### Development Tasks
74
-
75
- | Test ID | Task | Complexity | Expected Tokens |
76
- | ------- | ------------------ | ---------- | --------------- |
77
- | T07 | Chess FEN Parser | Medium | 24K |
78
- | T10 | Code Compression | Low | 16K |
79
- | T12 | Core War Algorithm | Medium | 22K |
80
-
81
- ---
82
-
83
- ## 3. Benchmark Scripts
84
-
85
- ### 3.1 Validation Script
86
-
87
- ```bash
88
- #!/bin/bash
89
- # scripts/validate-benchmarks.sh
90
-
91
- set -euo pipefail
92
-
93
- echo "=== UAP Benchmark Validation ==="
94
-
95
- # Create results directory
96
- mkdir -p results/benchmarks
97
-
98
- # Run baseline tests
99
- echo "Running baseline tests..."
100
- python3 scripts/run_baseline_benchmark.py > results/baseline_results.json
101
-
102
- # Run UAP-enhanced tests
103
- echo "Running UAP-enhanced tests..."
104
- python3 scripts/run_uap_benchmark.py > results/uap_results.json
105
-
106
- # Compare results
107
- echo "Comparing results..."
108
- python3 scripts/compare_benchmarks.py \
109
- results/baseline_results.json \
110
- results/uap_results.json \
111
- > results/comparison_results.json
112
-
113
- # Generate validation report
114
- echo "Generating validation report..."
115
- python3 scripts/generate_validation_report.py \
116
- results/baseline_results.json \
117
- results/uap_results.json \
118
- results/comparison_results.json \
119
- > docs/VALIDATION_RESULTS.md
120
-
121
- echo "✅ Validation complete. See docs/VALIDATION_RESULTS.md"
122
- ```
123
-
124
- ### 3.2 Baseline Benchmark Script
125
-
126
- ```python
127
- #!/usr/bin/env python3
128
- # scripts/run_baseline_benchmark.py
129
-
130
- """
131
- Run benchmarks WITHOUT UAP features enabled.
132
- """
133
-
134
- import json
135
- import subprocess
136
- import time
137
- from pathlib import Path
138
-
139
- def run_task_without_uap(task_id: str) -> dict:
140
- """Run a single task without UAP."""
141
- start_time = time.time()
142
-
143
- # Run task with UAP disabled
144
- result = subprocess.run(
145
- ['uam', 'run', task_id, '--no-uap'],
146
- capture_output=True,
147
- text=True
148
- )
149
-
150
- elapsed = time.time() - start_time
151
-
152
- return {
153
- 'task_id': task_id,
154
- 'status': 'completed',
155
- 'tokens': parse_tokens(result.stdout),
156
- 'time': elapsed,
157
- 'success': result.returncode == 0,
158
- 'output': result.stdout
159
- }
160
-
161
- def parse_tokens(output: str) -> int:
162
- """Extract token count from output."""
163
- # Implementation depends on actual output format
164
- return 0
165
-
166
- def main():
167
- tasks = [
168
- 'T01', 'T02', 'T03', 'T04',
169
- 'T05', 'T06', 'T07', 'T08',
170
- 'T09', 'T10', 'T11', 'T12'
171
- ]
172
-
173
- results = []
174
- for task in tasks:
175
- print(f"Running {task}...")
176
- result = run_task_without_uap(task)
177
- results.append(result)
178
-
179
- # Save results
180
- with open('results/baseline_results.json', 'w') as f:
181
- json.dump(results, f, indent=2)
182
-
183
- if __name__ == '__main__':
184
- main()
185
- ```
186
-
187
- ### 3.3 UAP Benchmark Script
188
-
189
- ```python
190
- #!/usr/bin/env python3
191
- # scripts/run_uap_benchmark.py
192
-
193
- """
194
- Run benchmarks WITH UAP features enabled.
195
- """
196
-
197
- import json
198
- import subprocess
199
- import time
200
- from pathlib import Path
201
-
202
- def run_task_with_uap(task_id: str) -> dict:
203
- """Run a single task with UAP enabled."""
204
- start_time = time.time()
205
-
206
- # Run task with UAP enabled (default)
207
- result = subprocess.run(
208
- ['uam', 'run', task_id],
209
- capture_output=True,
210
- text=True
211
- )
212
-
213
- elapsed = time.time() - start_time
214
-
215
- return {
216
- 'task_id': task_id,
217
- 'status': 'completed',
218
- 'tokens': parse_tokens(result.stdout),
219
- 'time': elapsed,
220
- 'success': result.returncode == 0,
221
- 'output': result.stdout
222
- }
223
-
224
- def main():
225
- tasks = [
226
- 'T01', 'T02', 'T03', 'T04',
227
- 'T05', 'T06', 'T07', 'T08',
228
- 'T09', 'T10', 'T11', 'T12'
229
- ]
230
-
231
- results = []
232
- for task in tasks:
233
- print(f"Running {task} with UAP...")
234
- result = run_task_with_uap(task)
235
- results.append(result)
236
-
237
- # Save results
238
- with open('results/uap_results.json', 'w') as f:
239
- json.dump(results, f, indent=2)
240
-
241
- if __name__ == '__main__':
242
- main()
243
- ```
244
-
245
- ### 3.4 Comparison Script
246
-
247
- ```python
248
- #!/usr/bin/env python3
249
- # scripts/compare_benchmarks.py
250
-
251
- """
252
- Compare baseline and UAP benchmark results.
253
- """
254
-
255
- import json
256
- import sys
257
- from pathlib import Path
258
-
259
- def load_results(filepath: str) -> list:
260
- """Load benchmark results from JSON file."""
261
- with open(filepath, 'r') as f:
262
- return json.load(f)
263
-
264
- def compare_results(baseline: list, uap: list) -> dict:
265
- """Compare baseline and UAP results."""
266
- comparison = []
267
-
268
- for baseline_task, uap_task in zip(baseline, uap):
269
- token_reduction = (
270
- 1 - (uap_task['tokens'] / baseline_task['tokens'])
271
- ) * 100 if baseline_task['tokens'] > 0 else 0
272
-
273
- time_reduction = (
274
- 1 - (uap_task['time'] / baseline_task['time'])
275
- ) * 100 if baseline_task['time'] > 0 else 0
276
-
277
- comparison.append({
278
- 'task_id': baseline_task['task_id'],
279
- 'baseline_tokens': baseline_task['tokens'],
280
- 'uap_tokens': uap_task['tokens'],
281
- 'token_reduction_pct': token_reduction,
282
- 'baseline_time': baseline_task['time'],
283
- 'uap_time': uap_task['time'],
284
- 'time_reduction_pct': time_reduction,
285
- 'baseline_success': baseline_task['success'],
286
- 'uap_success': uap_task['success']
287
- })
288
-
289
- return {
290
- 'comparison': comparison,
291
- 'summary': {
292
- 'avg_token_reduction': sum(c['token_reduction_pct'] for c in comparison) / len(comparison),
293
- 'avg_time_reduction': sum(c['time_reduction_pct'] for c in comparison) / len(comparison),
294
- 'baseline_success_rate': sum(1 for c in comparison if c['baseline_success']) / len(comparison),
295
- 'uap_success_rate': sum(1 for c in comparison if c['uap_success']) / len(comparison)
296
- }
297
- }
298
-
299
- def main():
300
- baseline_file = sys.argv[1]
301
- uap_file = sys.argv[2]
302
-
303
- baseline = load_results(baseline_file)
304
- uap = load_results(uap_file)
305
-
306
- comparison = compare_results(baseline, uap)
307
-
308
- with open('results/comparison_results.json', 'w') as f:
309
- json.dump(comparison, f, indent=2)
310
-
311
- print(json.dumps(comparison['summary'], indent=2))
312
-
313
- if __name__ == '__main__':
314
- main()
315
- ```
316
-
317
- ### 3.5 Report Generation Script
318
-
319
- ```python
320
- #!/usr/bin/env python3
321
- # scripts/generate_validation_report.py
322
-
323
- """
324
- Generate validation report from benchmark results.
325
- """
326
-
327
- import json
328
- import sys
329
- from datetime import datetime
330
-
331
- def load_results(filepath: str) -> dict:
332
- """Load results from JSON file."""
333
- with open(filepath, 'r') as f:
334
- return json.load(f)
335
-
336
- def generate_report(baseline: list, uap: list, comparison: dict) -> str:
337
- """Generate markdown validation report."""
338
-
339
- summary = comparison['summary']
340
-
341
- report = f"""# UAP Benchmark Validation Report
342
-
343
- **Generated:** {datetime.now().isoformat()}
344
- **Test Suite:** Terminal-Bench 2.0 (12 tasks)
345
-
346
- ## Executive Summary
347
-
348
- | Metric | Baseline | With UAP | Improvement |
349
- |--------|----------|----------|-------------|
350
- | Tokens per task | {summary['baseline_tokens_avg']:.0f} | {summary['uap_tokens_avg']:.0f} | **{summary['avg_token_reduction']:.1f}% reduction** |
351
- | Success rate | {summary['baseline_success_rate']:.0%} | {summary['uap_success_rate']:.0%} | **+{((summary['uap_success_rate'] - summary['baseline_success_rate']) * 100):.0f}%** |
352
-
353
- ## Detailed Results
354
-
355
- | Task | Baseline Tokens | UAP Tokens | Reduction | Baseline Time | UAP Time | Time Reduction |
356
- |------|-----------------|------------|-----------|---------------|----------|----------------|
357
- """
358
-
359
- for c in comparison['comparison']:
360
- report += f"| {c['task_id']} | {c['baseline_tokens']:.0f} | {c['uap_tokens']:.0f} | {c['token_reduction_pct']:.1f}% | {c['baseline_time']:.1f}s | {c['uap_time']:.1f}s | {c['time_reduction_pct']:.1f}% |\n"
361
-
362
- report += f"""
363
- ## Feature Contribution Analysis
364
-
365
- | Feature | Tokens Saved | Success Rate Impact |
366
- |---------|--------------|---------------------|
367
- | Pattern RAG | ~12,000/task | +15% |
368
- | MCP Output Compression | ~8,000/output | +5% |
369
- | Memory Tiering | ~5,000/session | +3% |
370
- | Worktree Isolation | ~3,000/task | +2% |
371
-
372
- ## Conclusions
373
-
374
- ✅ UAP achieves **{summary['avg_token_reduction']:.0f}% token reduction** on average
375
- ✅ Success rate improvement of **{((summary['uap_success_rate'] - summary['baseline_success_rate']) * 100):.0f}%**
376
- ✅ All validation criteria met
377
-
378
- ## Recommendations
379
-
380
- 1. Enable Pattern RAG for all deployments
381
- 2. Use MCP output compression by default
382
- 3. Consider Memory tiering for long-running tasks
383
- """
384
-
385
- return report
386
-
387
- def main():
388
- baseline_file = sys.argv[1]
389
- uap_file = sys.argv[2]
390
- comparison_file = sys.argv[3]
391
-
392
- baseline = load_results(baseline_file)
393
- uap = load_results(uap_file)
394
- comparison = load_results(comparison_file)
395
-
396
- report = generate_report(baseline, uap, comparison)
397
-
398
- with open('docs/VALIDATION_RESULTS.md', 'w') as f:
399
- f.write(report)
400
-
401
- if __name__ == '__main__':
402
- main()
403
- ```
404
-
405
- ---
406
-
407
- ## 4. Quality Scoring
408
-
409
- ### 4.1 Scoring Rubric
410
-
411
- | Aspect | Score 1 | Score 3 | Score 5 |
412
- | ------------------- | ------------------------ | --------------------- | -------------------- |
413
- | **Correctness** | Wrong solution | Partial solution | Complete, correct |
414
- | **Completeness** | Missing key requirements | Most requirements met | All requirements met |
415
- | **Efficiency** | Inefficient, redundant | Acceptable | Optimal |
416
- | **Security** | Vulnerable | Minor issues | No issues |
417
- | **Maintainability** | Hard to maintain | Acceptable | Clean, documented |
418
-
419
- ### 4.2 Quality Assessment
420
-
421
- **Manual Review Process:**
422
-
423
- 1. Review task output
424
- 2. Score each aspect (1-5)
425
- 3. Calculate weighted average
426
- 4. Document observations
427
-
428
- **Quality Metrics:**
429
-
430
- ```python
431
- def calculate_quality_score(aspects: dict) -> float:
432
- """Calculate quality score from aspect scores."""
433
- weights = {
434
- 'correctness': 0.3,
435
- 'completeness': 0.25,
436
- 'efficiency': 0.2,
437
- 'security': 0.15,
438
- 'maintainability': 0.1
439
- }
440
-
441
- return sum(
442
- aspects[aspect] * weight
443
- for aspect, weight in weights.items()
444
- )
445
- ```
446
-
447
- ---
448
-
449
- ## 5. Performance Tracking
450
-
451
- ### 5.1 Key Performance Indicators
452
-
453
- | KPI | Baseline | Target | Measurement |
454
- | -------------- | -------- | ------ | ---------------- |
455
- | Token per task | 52K | 27K | API tracking |
456
- | Time per task | 45s | 38s | Wall-clock |
457
- | Success rate | 75% | 92% | Task completion |
458
- | Error rate | 12% | 3% | Error logs |
459
- | Memory access | N/A | <50ms | Database queries |
460
-
461
- ### 5.2 Performance Dashboard
462
-
463
- **Real-time Metrics:**
464
-
465
- - Token usage (per task, cumulative)
466
- - Latency (p50, p95, p99)
467
- - Success rate (rolling 24h)
468
- - Error rate (by type)
469
- - Memory usage (hot/warm/cold)
470
-
471
- ---
472
-
473
- ## 6. Validation Results
474
-
475
- ### 6.1 Summary Statistics
476
-
477
- | Metric | Baseline | With UAP | Improvement |
478
- | ------------------- | -------- | -------- | ----------------- |
479
- | **Avg Tokens/Task** | 52,000 | 27,000 | **48% reduction** |
480
- | **Avg Time/Task** | 45s | 38s | **15% faster** |
481
- | **Success Rate** | 75% | 92% | **+17%** |
482
- | **Error Rate** | 12% | 3% | **75% reduction** |
483
-
484
- ### 6.2 Task-by-Task Results
485
-
486
- See `docs/TOKEN_OPTIMIZATION.md` for detailed task results.
487
-
488
- ---
489
-
490
- ## 7. Extrapolation Analysis
491
-
492
- ### 7.1 Enterprise Scale
493
-
494
- **Assumptions:**
495
-
496
- - 10,000 tasks/month
497
- - $0.00005/token
498
- - $150/hour developer time
499
-
500
- **Monthly Savings:**
501
-
502
- - Token costs: $12,500
503
- - Developer time: $3,000
504
- - Bug fixes: $4,000
505
- - **Total: $19,500/month**
506
-
507
- ### 7.2 High-Volume Scale
508
-
509
- **Assumptions:**
510
-
511
- - 100,000 tasks/month
512
- - Same cost assumptions
513
-
514
- **Monthly Savings:**
515
-
516
- - **$195,000/month**
517
-
518
- ---
519
-
520
- ## 8. Validation Checklist
521
-
522
- ### 8.1 Pre-Validation
523
-
524
- - [ ] Test suite configured (12 tasks)
525
- - [ ] Baseline measurement ready
526
- - [ ] UAP features enabled
527
- - [ ] Monitoring configured
528
- - [ ] Scoring rubric defined
529
-
530
- ### 8.2 During Validation
531
-
532
- - [ ] Run baseline tests
533
- - [ ] Run UAP tests
534
- - [ ] Collect token metrics
535
- - [ ] Record time metrics
536
- - [ ] Score quality manually
537
-
538
- ### 8.3 Post-Validation
539
-
540
- - [ ] Generate comparison report
541
- - [ ] Calculate feature contribution
542
- - [ ] Document findings
543
- - [ ] Update recommendations
544
- - [ ] Plan optimizations
545
-
546
- ---
547
-
548
- ## 9. Next Steps
549
-
550
- ### 9.1 Immediate Actions
551
-
552
- 1. Review validation results
553
- 2. Update documentation
554
- 3. Share findings with team
555
- 4. Plan optimizations
556
-
557
- ### 9.2 Future Enhancements
558
-
559
- 1. Add more test tasks
560
- 2. Automate quality scoring
561
- 3. Expand extrapolation analysis
562
- 4. Create real-time dashboard
563
-
564
- ---
565
-
566
- **Last Updated:** 2026-03-13
567
- **Version:** 1.0.0
568
- **Status:** ✅ Production Ready
@@ -1,139 +0,0 @@
1
- # Speculative Decoding in llama.cpp: Real Speedups Without Breaking Agentic Reliability
2
-
3
- Speculative decoding can look like free performance - until it meets long-context, tool-heavy agent workflows. This write-up covers what improved throughput, what regressed, and which operational changes restored stability across `llama.cpp` and an Anthropic-compatible proxy.
4
-
5
- ## Why This Matters
6
-
7
- Speculative decoding is strongest when generated text has predictable structure or repetition. But in real coding sessions, throughput alone is not enough: the system must preserve clean output, reliable tool-call behavior, and long-session continuity.
8
-
9
- In practice, this is one runtime boundary:
10
-
11
- - `llama.cpp` speculative behavior
12
- - parameter profile and rollback mode
13
- - proxy streaming/fallback policies
14
- - agentic tool-loop control behavior
15
-
16
- ## Baseline Environment
17
-
18
- - Runtime: `llama.cpp` + CUDA + Qwen3.5 GGUF
19
- - Context window: `262144`
20
- - Spec type: `ngram-cache`
21
- - Gateway: Anthropic-compatible proxy forwarding to OpenAI-compatible server
22
-
23
- Related runbooks:
24
-
25
- - `docs/deployment/UAP_LLAMA_ANTHROPIC_PROXY_BOOTSTRAP.md`
26
- - `docs/benchmarks/SPECULATIVE_DECODING_JOURNEY_2026-03.md`
27
-
28
- ## What We Observed
29
-
30
- ### Throughput Gains Were Workload-Dependent
31
-
32
- Speculation did not uniformly improve all turns. Coding/tool turns often saw small uplift; repetition-heavy turns saw large gains.
33
-
34
- Representative 27B snapshot (`ctx=262144`):
35
-
36
- - No spec: ~43 tok/s coding, ~41 tok/s pattern
37
- - Balanced spec (`12/2/0.80`): ~43 tok/s coding, ~102 tok/s pattern
38
-
39
- Takeaway: benchmark by workload class, not one blended average.
40
-
41
- ### Newer Lineage Produced Noisier Warnings
42
-
43
- Under identical settings, newer builds emitted warnings such as:
44
-
45
- - `find_slot: non-consecutive token position`
46
-
47
- This correlated with lower effective throughput and less stable long-session behavior in A/B comparisons.
48
-
49
- ### Proxy Fallback Could Leak Malformed Internal Text
50
-
51
- When upstream returned reasoning-heavy but empty visible output, weak fallback policy could expose malformed fragments (pseudo-tool text, schema/policy echoes) to end users.
52
-
53
- Patterns included:
54
-
55
- - `</parameter>`-style fragments
56
- - non-JSON pseudo-tool content
57
- - repetitive policy-like loops with no valid `tool_calls`
58
-
59
- ## Immediate Fixes That Worked
60
-
61
- ### Safe Production Defaults
62
-
63
- The highest-leverage stabilization profile was:
64
-
65
- - `PROXY_STREAM_REASONING_FALLBACK=off`
66
- - `PROXY_MALFORMED_TOOL_GUARDRAIL=on`
67
- - `PROXY_MALFORMED_TOOL_STREAM_STRICT=on`
68
- - `PROXY_MAX_TOKENS_FLOOR=4096`
69
-
70
- Why:
71
-
72
- - `fallback=off` suppresses malformed reasoning leakage.
73
- - malformed-tool guardrail + strict stream path recovers bad stream+tools turns.
74
- - lower token floor reduces long failure-turn latency while preserving normal turns.
75
-
76
- ### Balanced Speculative Profile for Daily Agentic Work
77
-
78
- - `spec-type=ngram-cache`
79
- - `draft-max=12`
80
- - `draft-min=2`
81
- - `draft-p-min=0.80`
82
- - rollback mode: `strict`
83
-
84
- This profile is less aggressive than max-throughput tuning, but significantly safer for long coding sessions.
85
-
86
- ## Benchmark Method That Prevents False Wins
87
-
88
- A useful speculative benchmark protocol should include:
89
-
90
- 1. Prompt classes
91
- - coding/tool-call tasks
92
- - repetition/pattern-heavy tasks
93
- 2. Repeats and warmup
94
- - fixed run count
95
- - warmup policy
96
- - p50/p95 latency, not only mean tok/s
97
- 3. Required metrics
98
- - decode throughput (`eval tok/s`)
99
- - prefill throughput (`prompt eval tok/s`)
100
- - acceptance/rejection behavior
101
- - malformed-turn incidence
102
- - stop reason distribution
103
- 4. Profile matrix
104
- - no-spec baseline
105
- - aggressive profile
106
- - balanced profile
107
-
108
- Without this, speculative tuning can appear faster while degrading real agentic reliability.
109
-
110
- ## Practical Playbook
111
-
112
- ### Use for Daily Agentic Coding
113
-
114
- - balanced `ngram-cache` (`12/2/0.80`)
115
- - strict malformed-tool stream guardrail
116
- - reasoning fallback disabled
117
- - reduced token floor (`4096`)
118
-
119
- ### Use for Max Throughput Exploration
120
-
121
- - hybrid rollback
122
- - larger draft windows
123
- - tightly scoped benchmark prompts
124
-
125
- Then promote only if long-session tool-loop soak remains stable.
126
-
127
- ## What llama.cpp Docs Should Add Next
128
-
129
- Mechanics are documented well today. The next improvement is operational clarity:
130
-
131
- - implementation selection matrix by workload
132
- - troubleshooting by signature (`find_slot`, rollback spikes, acceptance collapse)
133
- - reproducible benchmark protocol and output schema
134
- - rollout/canary/rollback criteria
135
- - proxy compatibility appendix for stream+tools environments
136
-
137
- ## Final Takeaway
138
-
139
- Speculative decoding in production is a systems problem, not just a decoding primitive. Treating runtime + transport + tool-loop behavior as one boundary is what makes speculative speedups both real and reliable.