adaptive-utility-agent 1.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (57) hide show
  1. adaptive_utility_agent-1.0.0.dist-info/METADATA +650 -0
  2. adaptive_utility_agent-1.0.0.dist-info/RECORD +57 -0
  3. adaptive_utility_agent-1.0.0.dist-info/WHEEL +4 -0
  4. adaptive_utility_agent-1.0.0.dist-info/entry_points.txt +2 -0
  5. adaptive_utility_agent-1.0.0.dist-info/licenses/LICENSE +674 -0
  6. adaptive_utility_agent-1.0.0.dist-info/licenses/LICENSE-CC-BY-4.0 +16 -0
  7. aua/__init__.py +261 -0
  8. aua/arbiter.py +581 -0
  9. aua/assertions_store.py +301 -0
  10. aua/auth.py +352 -0
  11. aua/blue_green.py +167 -0
  12. aua/certs.py +187 -0
  13. aua/chat.py +168 -0
  14. aua/cli.py +1789 -0
  15. aua/confidence_updater.py +21 -0
  16. aua/config.py +761 -0
  17. aua/contradiction_detector.py +246 -0
  18. aua/correction_loop.py +198 -0
  19. aua/defaults/__init__.py +1 -0
  20. aua/defaults/registry.py +183 -0
  21. aua/doctor.py +843 -0
  22. aua/encryption.py +153 -0
  23. aua/endpoints.py +418 -0
  24. aua/eval.py +278 -0
  25. aua/field_classifier.py +415 -0
  26. aua/hooks.py +166 -0
  27. aua/hot_reload.py +240 -0
  28. aua/logging_config.py +157 -0
  29. aua/metrics.py +258 -0
  30. aua/middleware.py +212 -0
  31. aua/otel.py +136 -0
  32. aua/plugins/__init__.py +0 -0
  33. aua/plugins/errors.py +194 -0
  34. aua/plugins/interfaces.py +319 -0
  35. aua/plugins/registry.py +210 -0
  36. aua/presets.py +98 -0
  37. aua/py.typed +0 -0
  38. aua/rate_limit.py +179 -0
  39. aua/rollback.py +471 -0
  40. aua/router.py +1435 -0
  41. aua/safety.py +110 -0
  42. aua/secrets.py +225 -0
  43. aua/serve.py +595 -0
  44. aua/session.py +152 -0
  45. aua/state.py +356 -0
  46. aua/status.py +355 -0
  47. aua/templates/__init__.py +1 -0
  48. aua/templates/registry.py +61 -0
  49. aua/tiers/a100-cluster.yaml +70 -0
  50. aua/tiers/a100.yaml +10 -0
  51. aua/tiers/macbook.yaml +56 -0
  52. aua/tiers/quad-4090.yaml +84 -0
  53. aua/tiers/rtx4090.yaml +10 -0
  54. aua/tiers/single-4090.yaml +70 -0
  55. aua/utility_scorer.py +302 -0
  56. aua/version.py +11 -0
  57. aua/webhooks.py +181 -0
@@ -0,0 +1,650 @@
1
+ Metadata-Version: 2.4
2
+ Name: adaptive-utility-agent
3
+ Version: 1.0.0
4
+ Summary: Adaptive Utility Agents — a Django-like framework for adaptive multi-model LLM systems.
5
+ Project-URL: Homepage, https://praneethtota.github.io/Adaptive-Utility-Agent
6
+ Project-URL: Repository, https://github.com/praneethtota/Adaptive-Utility-Agent
7
+ Project-URL: Whitepaper, https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_v05.html
8
+ Author: Praneeth Tota
9
+ License: GPL-3.0
10
+ License-File: LICENSE
11
+ License-File: LICENSE-CC-BY-4.0
12
+ Keywords: agents,arbitration,dpo,llm,routing,specialist,utility
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Intended Audience :: Developers
15
+ Classifier: Intended Audience :: Science/Research
16
+ Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
21
+ Requires-Python: >=3.10
22
+ Requires-Dist: click>=8.1.0
23
+ Requires-Dist: fastapi>=0.111.0
24
+ Requires-Dist: filelock>=3.13.0
25
+ Requires-Dist: httpx>=0.27.0
26
+ Requires-Dist: pydantic>=2.0.0
27
+ Requires-Dist: pyyaml>=6.0
28
+ Requires-Dist: rich>=13.0.0
29
+ Requires-Dist: scipy>=1.11.0
30
+ Requires-Dist: uvicorn[standard]>=0.30.0
31
+ Provides-Extra: dev
32
+ Requires-Dist: black>=24.0; extra == 'dev'
33
+ Requires-Dist: build>=1.2; extra == 'dev'
34
+ Requires-Dist: isort>=5.0; extra == 'dev'
35
+ Requires-Dist: mypy>=1.8; extra == 'dev'
36
+ Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
37
+ Requires-Dist: pytest>=8.0; extra == 'dev'
38
+ Requires-Dist: respx>=0.21; extra == 'dev'
39
+ Requires-Dist: ruff>=0.4; extra == 'dev'
40
+ Requires-Dist: types-pyyaml; extra == 'dev'
41
+ Provides-Extra: otel
42
+ Requires-Dist: opentelemetry-api>=1.20; extra == 'otel'
43
+ Requires-Dist: opentelemetry-exporter-otlp>=1.20; extra == 'otel'
44
+ Requires-Dist: opentelemetry-sdk>=1.20; extra == 'otel'
45
+ Provides-Extra: postgres
46
+ Requires-Dist: asyncpg>=0.29; extra == 'postgres'
47
+ Provides-Extra: train
48
+ Requires-Dist: accelerate>=0.28.0; extra == 'train'
49
+ Requires-Dist: peft>=0.9.0; extra == 'train'
50
+ Requires-Dist: torch>=2.1.0; extra == 'train'
51
+ Requires-Dist: transformers>=4.40.0; extra == 'train'
52
+ Requires-Dist: trl>=0.8.0; extra == 'train'
53
+ Provides-Extra: ui
54
+ Provides-Extra: vllm
55
+ Requires-Dist: vllm>=0.4.0; extra == 'vllm'
56
+ Description-Content-Type: text/markdown
57
+
58
+ # Adaptive Utility Agents
59
+
60
+ > **The central failure mode of deployed language models is error repetition. This project builds AI agents that actively work against it — detecting errors, correcting behavior, and not repeating mistakes between model releases.**
61
+
62
+ ---
63
+
64
+ ## 📖 Documentation
65
+
66
+ **🌐 https://praneethtota.github.io/Adaptive-Utility-Agent**
67
+
68
+ The full site includes the whitepaper with rendered math, an architecture-first builder's tutorial with code walkthroughs, and seven domain deep-dives written for specific practitioner audiences. If you're reading this on GitHub, the site is the better starting point.
69
+
70
+ | Page | Audience | Link |
71
+ |---|---|---|
72
+ | **Landing page** | Everyone | [whitepaper.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper.html) |
73
+ | **Whitepaper** (overview) | Researchers, theorists | [whitepaper_overview.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_overview.html) |
74
+ | **Whitepaper** (theory §§4–9) | Researchers | [whitepaper_theory.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_theory.html) |
75
+ | **Whitepaper** (architecture §10) | Engineers | [whitepaper_architecture.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_architecture.html) |
76
+ | **Whitepaper** (results + roadmap) | Everyone | [whitepaper_results.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_results.html) |
77
+ | **Whitepaper** (Appendix A — data) | Researchers | [whitepaper_appendix_a.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_appendix_a.html) |
78
+ | **Whitepaper** (Appendix B — proofs) | Theorists | [whitepaper_appendix_b.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_appendix_b.html) |
79
+ | **Whitepaper** (Appendix C — examples) | Practitioners | [whitepaper_appendix_c.html](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_appendix_c.html) |
80
+ | **Builder's Tutorial** | ML engineers, agent builders | [tutorial.html](https://praneethtota.github.io/Adaptive-Utility-Agent/tutorial.html) |
81
+ | **Production Architecture** | DevOps, platform engineers | [productionizing.html](https://praneethtota.github.io/Adaptive-Utility-Agent/productionizing.html) |
82
+ | AI Data Centers | Inference infra, GPU cloud | [domain_ai_datacenters.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_ai_datacenters.html) |
83
+ | Self-Driving Vehicles | Waymo, Cruise, Aurora | [domain_self_driving.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_self_driving.html) |
84
+ | Autonomous Systems | Robotics, safety-case engineering | [domain_autonomous_systems.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_autonomous_systems.html) |
85
+ | Software Engineering | Coding agents, dev-tools | [domain_software_engineering.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_software_engineering.html) |
86
+ | Dynamic Pricing | Pricing platforms, marketplaces | [domain_dynamic_pricing.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_dynamic_pricing.html) |
87
+ | Energy Systems | Grid software, DER, smart home | [domain_energy_systems.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_energy_systems.html) |
88
+ | Creative Systems | Generative media, content platforms | [domain_creative_systems.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_creative_systems.html) |
89
+ | Recommendation Engines | RecSys, personalization platforms | [domain_recommendation_engines.html](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_recommendation_engines.html) |
90
+ | Roadmap | Everyone | [aua_roadmap.html](https://praneethtota.github.io/Adaptive-Utility-Agent/aua_roadmap.html) |
91
+
92
+
93
+ ---
94
+
95
+ ## 🚀 Quickstart (v1.0)
96
+
97
+ ### 1. Install
98
+
99
+ ```bash
100
+ # Runtime only (CPU / Ollama)
101
+ pip install adaptive-utility-agent
102
+
103
+ # With GPU serving backend (Linux, CUDA required)
104
+ pip install "adaptive-utility-agent[vllm]"
105
+
106
+ # Development (includes test tools)
107
+ pip install "adaptive-utility-agent[dev]"
108
+ ```
109
+
110
+ ### 2. Scaffold a project
111
+
112
+ ```bash
113
+ # Mac/CPU — uses Ollama (install with: brew install ollama)
114
+ aua init my-project --tier macbook
115
+
116
+ # Single RTX 4090 — uses vLLM with AWQ quantization
117
+ aua init my-project --tier single-4090
118
+
119
+ # Quad RTX 4090 — dedicated GPU per specialist
120
+ aua init my-project --tier quad-4090
121
+
122
+ # A100 80 GB — fp16, no quantization
123
+ aua init my-project --tier a100-cluster
124
+
125
+ cd my-project
126
+ ```
127
+
128
+ ### 3. Check your setup
129
+
130
+ ```bash
131
+ aua doctor
132
+ # Every check shows PASS / FAIL / WARN with fix instructions.
133
+ # Exit 0 = all good. Exit 1 = at least one failure.
134
+
135
+ aua doctor --json # Machine-readable JSON output
136
+ aua doctor --strict # Treat warnings as failures (exit 2)
137
+ ```
138
+
139
+ ### 4. Start the system
140
+
141
+ ```bash
142
+ aua serve # start specialists + router
143
+ aua serve --dry-run # print commands without executing
144
+ aua serve --tier single-4090 # override tier at startup
145
+ aua serve --reuse-running # skip port-conflict check
146
+ ```
147
+
148
+ ### 5. Send a query
149
+
150
+ ```bash
151
+ # Single query (cURL)
152
+ curl -X POST http://localhost:8000/query -H "Content-Type: application/json" -d '{"query": "Write binary search in Python. State time complexity."}'
153
+
154
+ # Streaming (SSE)
155
+ curl -N http://localhost:8000/query/stream -X POST -H "Content-Type: application/json" -d '{"query": "Explain the VCG mechanism."}'
156
+
157
+ # Python
158
+ from aua import Router
159
+ from aua.config import load_config
160
+
161
+ config = load_config("aua_config.yaml")
162
+ router = Router.from_config(config)
163
+ result = await router.query("Write bubble sort. What is its O complexity?")
164
+ print(result.response)
165
+ print(f"U={result.u_score:.3f} mode={result.routing_mode}")
166
+ ```
167
+
168
+ ### 6. Monitor
169
+
170
+ ```bash
171
+ aua status # live terminal dashboard (auto-refreshes)
172
+ aua status --once # single snapshot, then exit
173
+ aua status --json # JSON output
174
+ aua status --url http://host:8000 # remote router
175
+ ```
176
+
177
+ ### 7. Roll back a model promotion
178
+
179
+ ```bash
180
+ aua rollback --specialist swe # interactive
181
+ aua rollback --specialist swe --yes # skip confirmation
182
+ aua rollback --dry-run # preview only
183
+ aua rollback --all --yes # roll back every specialist
184
+ ```
185
+
186
+ ### Runtime layout
187
+
188
+ ```
189
+ my-project/
190
+ ├── aua_config.yaml ← edit this to change models/ports/tiers
191
+ ├── models/ ← place AWQ model files here
192
+ ├── dpo_pairs/ ← accumulated automatically
193
+ ├── results/ ← experiment outputs
194
+ ├── logs/ ← CLI logs
195
+ └── .aua/ ← runtime artifacts (auto-created by aua serve)
196
+ ├── logs/ ← per-service log files
197
+ ├── pids/ ← PID files
198
+ ├── state/ ← promotions.jsonl
199
+ └── checkpoints/ ← model symlinks
200
+ ```
201
+
202
+ ### Supported tiers
203
+
204
+ | Tier | Hardware | Backend | Specialists |
205
+ |---|---|---|---|
206
+ | `macbook` | Apple M-series / Intel Mac | Ollama | swe, math |
207
+ | `single-4090` | 1× RTX 4090 24 GB | vLLM AWQ | swe, math |
208
+ | `quad-4090` | 4× RTX 4090 (dedicated per GPU) | vLLM AWQ | swe, math, law |
209
+ | `a100-cluster` | 1× A100 80 GB | vLLM fp16 | swe, math |
210
+
211
+ Aliases `rtx4090` → `single-4090` and `a100` → `a100-cluster` remain for backward compatibility.
212
+
213
+ ---
214
+
215
+ ## License
216
+
217
+ **Code:** GNU General Public License v3.0 — see `LICENSE`
218
+ **Whitepaper:** Creative Commons Attribution 4.0 — see `LICENSE-CC-BY-4.0`
219
+
220
+ If you build on this work, please cite:
221
+ > Tota, P. (2026). *Adaptive Utility Agents: A Framework for Self-Optimizing AI Systems* (v1.0). GitHub. https://github.com/praneethtota/Adaptive-Utility-Agent
222
+
223
+ ---
224
+
225
+ ## The Problem
226
+
227
+ Deployed AI systems are static artifacts. A model that hallucinates today will hallucinate the same thing tomorrow, and every day until the next version ships — which may be months away. There is no feedback loop between detected errors and model behavior in the space between versions.
228
+
229
+ This project addresses that structural absence. The goal is **online learning and error non-repetition**: an agent that detects its own errors, adjusts behavior in response, and does not repeat those errors — continuously, between releases, without a new training cycle.
230
+
231
+ The work is grounded in multi-attribute utility theory from economics, extended by treating utility as a control signal in a feedback system rather than a static objective. It draws on mechanism design — specifically the Vickrey-Clarke-Groves (VCG) mechanism — for arbitration and incentive alignment across model components, and on Kalman filtering, Lyapunov stability analysis, and the Mann-Whitney dominance statistic for the formal foundations of each utility component.
232
+
233
+ ---
234
+
235
+ ## The Core Mechanism: Utility as a Control Law
236
+
237
+ ```
238
+ U = w_e(f) · E + w_c(f) · C + w_k(f) · K
239
+
240
+ E — Efficacy: performance relative to human baseline [0, 1]
241
+ C — Confidence: internal consistency, penalized by contradictions
242
+ K — Curiosity: exploration bonus for high-upside uncertain domains
243
+ f — field (surgery, law, software, creative, ...)
244
+ ```
245
+
246
+ The utility function is **not a monitoring metric**. It is the governing control law over the agent's behavior at every timescale:
247
+
248
+ - **At training time**: field penalty multipliers are DPO loss weights — a surgical contradiction is penalized 10× harder than a creative writing mistake at the weight-update level
249
+ - **During deployment**: utility deviation triggers behavioral corrections and controls whether a new model version is accepted
250
+ - **Across calibration cycles**: utility score determines which interactions generate DPO training pairs and how strongly each pair is weighted
251
+
252
+ The additive weighted structure is not a convenience — it is the unique functional form satisfying five behavioral axioms (monotonicity, continuity, separability, field invariance, linear scaling invariance). Proved from first principles via Debreu's representation theorem and the Cauchy functional equation, using continuity only — no differentiability required (Theorem B.1, Appendix B).
253
+
254
+ | Term | Name | Formal grounding |
255
+ |---|---|---|
256
+ | **E** | Efficacy | Mann-Whitney dominance probability under log-logistic model (Proposition B.3) |
257
+ | **C** | Confidence | Kalman-optimal EMA estimator for ρ = 0.05 noise ratio; geometric convergence with noise floor (Theorems B.4, B.5) |
258
+ | **K** | Curiosity | UCB-inspired exploration bonus; 50% cap enforces exploitation dominance (Proposition B.6) |
259
+
260
+ Field weights and minimum competence bounds are derived from existing societal licensing standards — medical malpractice thresholds, ICAO Annex 13 aviation certification, ISO 26262 safety classifications — making them principled rather than arbitrary.
261
+
262
+ ---
263
+
264
+ ## Applications and Motivation (§2)
265
+
266
+ The framework applies to any system that makes real-time decisions under competing objectives, with the need to improve from experience without waiting for a full retrain. Seven worked domains from §2 of the whitepaper — each with a dedicated deep-dive on the [documentation site](https://praneethtota.github.io/Adaptive-Utility-Agent):
267
+
268
+ **Autonomous Vehicles** — A self-driving vehicle balances safety, efficiency, and comfort simultaneously. Weights shift automatically by context: safety dominates in school zones (w_s=0.90), efficiency rises in emergency transport (w_e=0.40). When sensor fusion uncertainty drives confidence below C_min=0.85, the vehicle abstains from the manoeuvre rather than proceeding at reduced reliability. Three Jetson-class specialists (perception, motion planning, traffic rules) consume ~110W total versus 700W for a single datacenter GPU — and a monolithic frontier model cannot fit a vehicle's power envelope at all. → [Self-Driving deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_self_driving_v0_5.html) · [Autonomous Systems deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_autonomous_systems_v0_5.html)
269
+
270
+ **Drone Delivery** — A delivery drone weighs speed against energy and airspace safety in real time. An approaching storm shifts the safety weight from w_s=0.50 to w_s=0.80, selecting a longer but safer route automatically — no pre-written storm rule required. When environmental uncertainty exceeds the confidence threshold, the drone aborts and returns to base. → [Autonomous Systems deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_autonomous_systems_v0_5.html)
271
+
272
+ **Smart Home Energy Management** — During a peak pricing event, the cost weight rises from w_k=0.40 to w_k=0.65, shifting appliance scheduling to off-peak automatically. When an occupant signals a preference, the system defers by activating a comfort-override profile (w_c=0.75) — not by adding a rule. Cross-session learning accumulates usage patterns without retraining. → [Energy Systems deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_energy_systems_v0_5.html)
273
+
274
+ **Energy Grid Load Balancing** — Under normal load, demand response with battery storage is preferred over gas peaker plants. Under a sudden demand surge, the stability weight rises to w_σ=0.80 and the decision flips to the peaker. The C_min=0.95 gate under surge conditions ensures the agent escalates to a human operator when demand forecasts are unreliable rather than committing a large generation decision under uncertainty. → [Energy Systems deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_energy_systems_v0_5.html)
275
+
276
+ **Dynamic Pricing** — Standard conditions favour moderate pricing with loyalty incentives. Under genuine supply constraints (w_r=0.65), surge pricing becomes optimal. Under a competitive threat, market share weight rises to w_m=0.40 and pricing shifts to defend position. Every price decision is logged with its full utility decomposition — the audit trail that regulators now require. → [Dynamic Pricing deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_dynamic_pricing_v0_5.html)
277
+
278
+ **AI Data Centers** — For GPU cloud operators, a routed graph of smaller specialist models shifts the optimisation target from raw frontier capability to revenue per watt, fleet utilisation, and cost per useful domain query. Lower-tier inventory (A40s, A100s, consumer-adjacent GPUs) that would otherwise be stranded gets a high-value specialist serving role. LoRA multi-tenancy improves utilisation further without expanding hardware. → [AI Data Centers deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_ai_datacenters_v0_5.html)
279
+
280
+ **Self-Driving Companies** — For AV companies the strongest argument is independent updateability, auditable behaviour, and principled abstention. Updating the traffic rules specialist for a new city does not force revalidation of perception or planning. The utility log produces a reproducible explanation of why a given manoeuvre was accepted, rejected, or escalated — the artifact that incident review and regulatory acceptance both require. → [Self-Driving deep-dive](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_self_driving_v0_5.html)
281
+
282
+ Full worked numerical examples with explicit utility calculations for all seven domains are in [Appendix C of the whitepaper](https://praneethtota.github.io/Adaptive-Utility-Agent/whitepaper_full_v0_5.html#appendix-c).
283
+
284
+ ---
285
+
286
+ ## Architecture
287
+
288
+ ### Monolithic Setting (Current)
289
+
290
+ Until the Micro-Expert Architecture is operational, the system wraps a monolithic base model. Three layers compensate for the constraints of a monolithic system:
291
+
292
+ ```
293
+ Layer 1 — Per-session behavioral injection (real-time, no weight change)
294
+ Detected contradictions → corrective assertions → system prompt
295
+
296
+ Layer 2 — Calibration-cycle DPO fine-tuning (several times daily)
297
+ Utility-scored pairs → field-penalty-weighted DPO loss → LoRA update
298
+
299
+ Layer 3 — Release-level distillation (monthly)
300
+ Accumulated adapters → distilled into new base fine-tune
301
+ ```
302
+
303
+ **Personality System (interim wrapper):** Between calibration cycles, a behavioral wrapper biases generation toward safer operating regimes. Formally: a log-linear tilt of the base model's output distribution parameterized by field-bounded trait scores (curiosity, caution, assertiveness, analytical_rigor, creativity). At the field-neutral point the wrapper is the identity — no effect on generation. Lyapunov-stable dynamics with half-life ≈ 34 cycles under zero drift (Theorem B.7). Resets on new model release; not instantiated in the Micro-Expert Architecture.
304
+
305
+ ### Micro-Expert Architecture (Target)
306
+
307
+ The monolithic model is decomposed into independently deployable domain submodels — microservices architecture applied to model inference:
308
+
309
+ ```
310
+ Router (Raft HA cluster, 150–300ms failover)
311
+ ↓ probabilistic field classification + fan-out
312
+ Domain Submodels (surgery | law | software | creative | ...)
313
+ ↓ independent weights, training, deployment
314
+ Arbiter Agent (§10.5) + VCG Mechanism (§10.6)
315
+ ↓ cross-domain contradiction resolution
316
+ Blue-Green Deployment (§10.7)
317
+ ↓ utility-deviation-triggered, softmax traffic routing
318
+ ```
319
+
320
+ Updating surgery weights cannot affect software engineering weights. There are no shared parameters to interfere. Catastrophic forgetting is resolved architecturally. Graph depth is hardware-adaptive: high-VRAM GPUs run shallow graphs of large models; consumer GPUs run deeper graphs of smaller specialists at lower cost per query.
321
+
322
+ ### Arbiter Agent (§10.5)
323
+
324
+ When two submodels produce conflicting outputs, a dedicated Arbiter Agent runs structured evidence checks:
325
+
326
+ | Check | Weight | What it tests |
327
+ |---|---|---|
328
+ | Logical | 0.30 | Does the output contradict its own premises? |
329
+ | Mathematical | 0.40 | Are complexity or numerical claims provably wrong? |
330
+ | Cross-session | 0.20 | Does it contradict prior verified assertions? |
331
+ | Empirical | 0.10 | Does it contradict verifiable external ground truth? |
332
+
333
+ Four verdict cases: A correct → correct B; B correct → correct A; both wrong → correct both + curiosity gap bonus; inconclusive → controlled external escalation under minimum-disclosure protocol. Corrections route internally as DPO signal. Nothing is disclosed externally beyond the verified answer, or a minimal hedge on inconclusive cases.
334
+
335
+ Arbiter calibration: 2–5% of verdicts independently verified against domain experts. Escalates adaptively to a 15% hard ceiling if correction volume rises above baseline.
336
+
337
+ ### VCG Arbitration Mechanism (§10.6)
338
+
339
+ The hand-specified Arbiter check weights are an engineering approximation. The theoretically grounded alternative treats domain submodels as players in a cooperative game:
340
+
341
+ **Three theorems proved (§10.6):**
342
+
343
+ | Theorem | Statement |
344
+ |---|---|
345
+ | **S1 — Dominant Strategy Truthfulness** | Truthful reporting of $v_i$ is a weakly dominant strategy for every submodel, regardless of others' reports |
346
+ | **S2 — Social Optimum (POA = 1)** | Under dominant-strategy equilibrium the Arbiter selects the claim maximising $\sum_i v_i(a)$; Price of Anarchy = 1 exactly |
347
+ | **S3 — Individual Rationality** | Every submodel weakly prefers participation to abstention |
348
+
349
+ Clarke pivot transfers applied as DPO penalty weight adjustments make check weights endogenous and replace the periodic expert-sampling audit with a continuous self-correcting signal.
350
+
351
+ ### Assertions Store (Evidence with Decay)
352
+
353
+ Verified facts persist across sessions with field-specific confidence decay:
354
+
355
+ | Class | Decay | Examples |
356
+ |---|---|---|
357
+ | A — No decay | Never | Mathematical proofs, physical laws, algorithm correctness |
358
+ | B — Slow (τ = 10yr) | Exponential | Mechanical engineering, classical physics |
359
+ | C — Moderate (τ = 3yr) | Exponential | Medical anatomy, legal common law |
360
+ | D — Fast (τ = 6mo) | Exponential | Clinical guidelines, security practices, ML benchmarks |
361
+
362
+ ---
363
+
364
+ ## The Consumer Hardware Argument (§10.9)
365
+
366
+ This is one of the more consequential implications of the Micro-Expert Architecture, and one the paper is careful to state with appropriate scope.
367
+
368
+ ### The claim
369
+
370
+ The dominant assumption in AI deployment is that frontier capability requires frontier compute — specifically, the high-bandwidth GPU clusters subject to export controls. The Micro-Expert Architecture challenges this assumption in a specific and falsifiable way.
371
+
372
+ **The claim is not** that consumer GPUs match H100s on general workloads. They do not — H100s have 3× the memory bandwidth and NVLink interconnects that PCIe cannot approach.
373
+
374
+ **The claim is** that for inference on specialised domain queries — the highest-value AI use cases for most professional organisations — a graph of domain-specialist models on consumer hardware can match the output quality of a monolithic frontier model on enterprise hardware, at substantially lower cost per query. The routing and arbitration layer that makes this possible is what §10.9 formalises and partially validates.
375
+
376
+ ### The cost arithmetic (from public hardware specs)
377
+
378
+ ```
379
+ 7B specialist on RTX 4090: ~$0.00014 per 1K tokens
380
+ 70B model on 2× H100: ~$0.00083 per 1K tokens
381
+
382
+ Single-specialist query: 6× cheaper on consumer hardware
383
+ 3-specialist fan-out: 2× cheaper even at maximum typical fan-out
384
+ ```
385
+
386
+ ### The routing experiment (§10.9.4)
387
+
388
+ A four-arm controlled study using the production agent codebase measured the contribution of the routing and arbitration layer to correctness, independently of model size or hardware. Quality parameters were derived from six published domain benchmarks (all cited in `routing_results.json`).
389
+
390
+ | Arm | Correctness | vs baseline | Brier | p-value |
391
+ |---|---|---|---|---|
392
+ | A — No routing (generic prompt) | 59.0% | — | 0.160 | — |
393
+ | B — Matched routing (oracle) | 71.5% | **+12.5%** | 0.106 | 0.009 |
394
+ | C — Mismatched routing (Regime 2) | 41.5% | **−17.5%** | 0.292 | <0.001 |
395
+ | D — VCG arbitration | 69.5% | **+10.5%** | 0.110 | 0.029 |
396
+
397
+ Three findings:
398
+
399
+ 1. **Correct routing contributes +12.5% correctness** (p = 0.009) through prompt specialisation alone — before any weight-level fine-tuning. This is the routing layer's direct contribution, measurable independently of hardware.
400
+
401
+ 2. **Mismatched routing is actively harmful** (−17.5%, p < 0.001) and dramatically worsens confidence calibration (Brier 0.292 vs 0.160). The model is not just wrong — it is confidently wrong. This quantifies the Regime 2 failure mode from §10.4.1 and makes the case for probabilistic routing and VCG arbitration concrete rather than theoretical.
402
+
403
+ 3. **VCG arbitration captures 84% of the oracle matched-routing gain** (+10.5% vs +12.5%), statistically significant (p = 0.029), with near-matched Brier score. The 2.0pp gap to the oracle is not statistically significant (p = 0.66) — at 82% routing accuracy, VCG arbitration essentially closes on the oracle best case.
404
+
405
+ ```bash
406
+ cd agent && python3 routing_experiment.py
407
+ # Outputs: routing_output/routing_results.json, routing_report.txt, plots/ (4 figures)
408
+ # Replace _generate_response() with live_generate_response() for Ollama inference
409
+ ```
410
+
411
+ ### The complete argument (stated scope)
412
+
413
+ The consumer hardware case combines three components with different evidential status:
414
+
415
+ | Component | Evidence | Source |
416
+ |---|---|---|
417
+ | Routing + arbitration adds +10.5% correctness | **Measured** (this work, statistically significant) | `routing_experiment.py` |
418
+ | Domain-specialist 7B models match general 70B on domain benchmarks | **Published** (independently replicated) | DeepSeek Coder, WizardMath, Med-PaLM citations |
419
+ | 2–6× lower cost per query on consumer hardware | **Analytical** (public hardware specs and cloud pricing) | Lambda Labs, RunPod, NVIDIA specs |
420
+
421
+ Together these form a complete argument. The third component — actual quality benchmarking of fine-tuned 7B specialists against Llama 3.1 70B on physical 4090 hardware — is the primary item of empirical future work and requires only consumer hardware to run.
422
+
423
+ ---
424
+
425
+ ## Implications for the AI Landscape
426
+
427
+ ### The hardware moat is narrower than assumed for professional domains
428
+
429
+ Export controls on H100s, A100s, and their successors rest on a single architectural assumption: that frontier AI capability requires frontier compute. This assumption is well-founded for training and for general-purpose inference at scale. It is considerably weaker for the domain-specific professional inference use cases — medicine, law, engineering, software, mathematics — where AI has the clearest near-term value.
430
+
431
+ The published benchmark evidence is consistent and replicated across multiple independent groups: fine-tuned 7B–13B domain specialists routinely match or exceed general 70B models on their target domain benchmarks. This is not a marginal effect. WizardMath 7B achieves 54.9% on MATH versus 13.5% for Llama 2 70B. Med-PaLM 2 matches GPT-4 on MedQA despite being orders of magnitude smaller. DeepSeek Coder 7B matches GPT-3.5 175B on HumanEval.
432
+
433
+ The Micro-Expert Architecture makes this practically deployable: a router that activates the right specialist for each query, an Arbiter that resolves cross-domain conflicts, and a utility-weighted calibration loop that improves over time — running on consumer hardware, without export-controlled components.
434
+
435
+ ### What this means for compute sovereignty
436
+
437
+ Countries and organisations operating without access to H100 clusters are not locked out of frontier AI capability in the domains that matter most for economic and scientific development. They face a different engineering challenge: building a routed graph of domain specialists rather than scaling a monolithic model. This paper is one piece of the technical foundation for that approach.
438
+
439
+ The critical caveat, stated explicitly throughout §10.9: general-purpose AI capability — the open-ended reasoning and knowledge breadth that frontier models provide on arbitrary queries — does retain a meaningful hardware advantage. The consumer hardware argument applies to the specialised slice, not the general case. That slice is, however, the commercially and professionally most important one.
440
+
441
+ ### The routing failure modes matter as much as the architecture
442
+
443
+ The export control implication is only as strong as the routing is reliable. The Regime 2 result (−17.5% correctness, Brier 0.292) shows that wrong-domain routing is not merely suboptimal — it actively makes the system worse than no routing at all, and does so confidently. This is why the routing problem (§10.4.1) and its mitigations (probabilistic fan-out, VCG calibration, M1–M5) are central to the paper and not peripheral engineering details. A Micro-Expert system with poor routing is worse than a monolithic model. A Micro-Expert system with good routing and proper arbitration is competitive with a much larger model on domain tasks, on consumer hardware.
444
+
445
+ ---
446
+
447
+ ## Mathematical Foundations (Appendix B, v0.5)
448
+
449
+ All proofs use only continuity where differentiability is not assumed; all scope conditions are stated explicitly.
450
+
451
+ | Result | Content | Key note |
452
+ |---|---|---|
453
+ | **Theorem B.1** | Additive linear structure of U uniquely necessary from five axioms | Proved via Debreu + Cauchy functional equation; continuity only, no differentiability |
454
+ | **§B.2** | Field weights from error-cost proportionality, calibrated to liability standards | Design principle, not an optimality theorem |
455
+ | **Proposition B.3** | Efficacy sigmoid = Mann-Whitney dominance probability | Holds under log-logistic model with equal scale; distributional assumption stated |
456
+ | **Theorem B.4** | EMA with α = 0.2 is Kalman-optimal for ρ = 0.05 noise ratio | Reasoning direction clarified: α = 0.2 was chosen first, Kalman characterises the noise regime |
457
+ | **Theorem B.5** | Confidence convergence with noise-aware bound | $\mathbb{E}[\|C_t - C^*\|] \leq (1-\alpha)^t\|C_0 - C^*\| + \sigma_{\tilde{s}}\sqrt{\alpha/(2-\alpha)}$; requires $\lambda\mu(f) < 1$ |
458
+ | **Proposition B.6** | 50% curiosity cap enforces exploitation dominance | Proved exactly; regret analysis open |
459
+ | **Theorem B.7** | Personality Lyapunov stability | Part (iv) clarified: mean reversion β = 0.01 subsumed by field bounds at current parameters |
460
+
461
+ ---
462
+
463
+ ## Simulation Results
464
+
465
+ ### Extended simulation (Appendix A) — 500-task two-arm + 10-cycle stability
466
+
467
+ ```
468
+ Cycle Agent U Base U Ag Brier Bl Brier Ag Rep↑ Bl Rep↑
469
+ ───── ──────── ─────── ──────── ──────── ─────── ───────
470
+ 1 0.5291 0.5333 0.3279 0.3502 0 0
471
+ 2 0.5441 0.5385 0.2177 0.2520 1 6
472
+ 3 0.5656 0.5604 0.2464 0.2860 4 10
473
+ 4 0.5828 0.5622 0.2149 0.2601 3 15
474
+ 5 0.5846 0.5765 0.1059 0.1501 6 15
475
+ ```
476
+
477
+ **69.6% reduction in repeated errors** over uncalibrated baseline (14 vs 46, cycles 2–5).
478
+ **14.3% Brier improvement** overall; 29.5% by cycle 5.
479
+ **Pearson r = 0.461** (U vs correctness, p < 10⁻⁴⁰) — U is a statistically significant correctness predictor.
480
+
481
+ 10-cycle stability: contradiction rate 22% → 6% (73% reduction); Brier reaches 0.049 by cycle 7.
482
+
483
+ ### Routing experiment (§10.9) — four-arm study
484
+
485
+ | Arm | Correctness | Δ vs baseline | Brier | p-value |
486
+ |---|---|---|---|---|
487
+ | A — No routing | 59.0% | — | 0.160 | — |
488
+ | B — Matched (oracle) | 71.5% | +12.5% | 0.106 | 0.009 |
489
+ | C — Mismatched (Regime 2) | 41.5% | −17.5% | 0.292 | <0.001 |
490
+ | D — VCG arbitration | 69.5% | +10.5% | 0.110 | 0.029 |
491
+
492
+ ---
493
+
494
+ ## Validated Claims
495
+
496
+ | Claim | Result | Status |
497
+ |---|---|---|
498
+ | Agent reduces repeated errors vs uncalibrated baseline | 69.6% reduction (14 vs 46 over 400 tasks) | **Confirmed** |
499
+ | U correlates with ground-truth correctness | Pearson r = 0.461 (agent), p < 10⁻⁴⁰ | **Confirmed** |
500
+ | Confidence is better calibrated under agent vs baseline | Brier 0.2226 vs 0.2597 (14.3% improvement) | **Confirmed** |
501
+ | Personality converges stably (Theorem B.7) | Traits in field bounds throughout; dynamics match theorem | **Confirmed** |
502
+ | Contradiction rate falls with sustained calibration | 22% → 6% over 10 cycles (73% reduction) | **Confirmed** |
503
+ | Long-tail errors persist beyond five correction cycles | 8 patterns; root cause: surface-form variability in assertions store | **Confirmed — limitation identified** |
504
+ | Correct routing improves correctness vs no routing | +12.5% (p = 0.009, Cohen's d = 0.265) | **Confirmed** |
505
+ | Mismatched routing is actively harmful | −17.5% correctness, Brier 0.292 vs 0.160 (p < 0.001) | **Confirmed** |
506
+ | VCG arbitration captures most of the routing gain | +10.5% (84% of oracle), p = 0.029 | **Confirmed** |
507
+ | Consumer hardware cost advantage | 2–6× lower cost per token (analytical, from public specs) | **Analytical — empirical validation pending** |
508
+
509
+ ---
510
+
511
+ ## Project Structure
512
+
513
+ ```
514
+ # Root-level HTML — served at https://praneethtota.github.io/Adaptive-Utility-Agent
515
+ whitepaper_v05.html # Landing page — site entry point
516
+ whitepaper_full_v0_5.html # Full whitepaper with KaTeX math + figures
517
+ tutorial_v0_5.html # Builder's tutorial (architecture + code walkthroughs)
518
+ domain_ai_datacenters_v0_5.html # AI Data Centers deep-dive
519
+ domain_self_driving_v0_5.html # Self-Driving Vehicles deep-dive
520
+ domain_autonomous_systems_v0_5.html # Autonomous Systems deep-dive
521
+ domain_software_engineering_v0_5.html # Software Engineering deep-dive
522
+ domain_dynamic_pricing_v0_5.html # Dynamic Pricing deep-dive
523
+ domain_energy_systems_v0_5.html # Energy Systems deep-dive
524
+ domain_creative_systems_v0_5.html # Creative Systems deep-dive
525
+ whitepaper_v05.md # Markdown edition of the whitepaper
526
+
527
+ agent/
528
+ ├── config.py # Field weights, bounds, penalty multipliers
529
+ ├── field_classifier.py # Field distribution: high-stakes floor, EMA drift, entropy fallback
530
+ ├── contradiction_detector.py # Logical, mathematical, cross-session detection
531
+ ├── assertions_store.py # Cross-session store with decay classes A–D
532
+ ├── trust_manager.py # Credential bootstrapping, tit-for-tat scoring
533
+ ├── arbiter.py # 4-check pipeline, gap bonus, adaptive sampling
534
+ ├── utility_scorer.py # E (EMA), C, K (50% cap), difficulty routing
535
+ ├── personality_manager.py # Wrapper evolution, Lyapunov-stable dynamics
536
+ ├── creative_efficacy.py # Two-component creative efficacy model
537
+ ├── agent.py # Main UtilityAgent — wires all components
538
+ ├── harness.py # Live API harness (requires ANTHROPIC_API_KEY)
539
+ ├── simulate.py # Original 3-cycle / 8-problem simulation
540
+ ├── simulate_extended.py # Extended simulation: 500-task two-arm + 10-cycle stability
541
+ ├── routing_experiment.py # Four-arm routing quality study (§10.9)
542
+ ├── requirements.txt
543
+ ├── extended_output/
544
+ │ ├── extended_results.json # Full raw data (task records, cycle stats, DPO pairs)
545
+ │ ├── report.txt
546
+ │ └── plots/ # 10 publication figures (PNG, 150 dpi)
547
+ └── routing_output/
548
+ ├── routing_results.json # Four-arm results with benchmark citations
549
+ ├── routing_report.txt
550
+ └── plots/ # 4 routing experiment figures (PNG, 150 dpi)
551
+ ├── figR1_correctness.png
552
+ ├── figR2_brier.png
553
+ ├── figR3_domain_heatmap.png
554
+ └── figR4_summary.png
555
+
556
+ docs/
557
+ └── to_do_in_version_v06_revised.md # v0.6 backend design: privacy-first MVP spec
558
+ ```
559
+
560
+ ---
561
+
562
+ ## Quick Start
563
+
564
+ ```bash
565
+ # Original simulation — no API key needed
566
+ cd agent && python3 simulate.py
567
+
568
+ # Extended simulation — generates all results and plots
569
+ cd agent && python3 simulate_extended.py
570
+
571
+ # Routing quality experiment (§10.9)
572
+ cd agent && python3 routing_experiment.py
573
+ # For live Ollama inference: replace _generate_response() with live_generate_response()
574
+ # Instructions in routing_experiment.py module docstring
575
+
576
+ # Live harness — requires API key
577
+ pip install httpx
578
+ export ANTHROPIC_API_KEY=sk-ant-...
579
+ cd agent && python3 harness.py
580
+ ```
581
+
582
+ Dependencies: `numpy`, `scipy`, `matplotlib` (standard scientific Python stack). No GPU required for any simulation.
583
+
584
+ 📖 **For the full architecture walkthrough and code-grounded tutorial, visit the documentation site:**
585
+ **https://praneethtota.github.io/Adaptive-Utility-Agent**
586
+
587
+ ---
588
+
589
+ ## What's New in v0.5
590
+
591
+ ### Theoretical additions
592
+
593
+ - **VCG arbitration mechanism (§10.6)**: Theorems S1–S3 prove dominant-strategy truthfulness, social optimality (POA = 1), and individual rationality. Clarke pivot transfers replace hand-specified check weights and the expert-sampling audit with a continuous self-correcting signal.
594
+
595
+ - **Appendix B — complete formal proofs (B.1–B.7)**: Key corrections: B.1 uses Cauchy functional equation (continuity only, no differentiability); B.5 noise-aware bound matches proof; B.7 Part (iv) clarified (β = 0.01 subsumed by field bounds); B.4 sensitivity table corrected.
596
+
597
+ - **§10.9 — Consumer hardware argument**: Analytical cost model (2–6× cheaper per token), routing quality experiment (+10.5% correctness from VCG arbitration, p = 0.029), and explicit scope statement distinguishing measured from analytical claims.
598
+
599
+ ### Empirical additions
600
+
601
+ - **Extended simulation (Appendix A)**: 500-task two-arm comparison + 10-cycle stability run. 69.6% repeated-error reduction. Full data in `extended_results.json`.
602
+
603
+ - **Routing quality experiment (§10.9)**: Four-arm study quantifying the routing layer's contribution (+12.5% oracle, +10.5% VCG, −17.5% Regime 2). Quality model from published benchmarks; code structured for live Ollama drop-in. Data in `routing_results.json`.
604
+
605
+ ### Structural additions
606
+
607
+ - Supplement S1 integrated as §10.6; sections renumbered to §§10.7–10.10
608
+ - References merged: Clarke, Groves, Harsanyi/Selten, Hurwicz, Nash, Vickrey added
609
+ - Validated claims table expanded from 6 to 10 claims
610
+ - Full documentation site launched: seven domain deep-dives, builder's tutorial, rendered whitepaper
611
+
612
+ ---
613
+
614
+ ## Roadmap
615
+
616
+ | Phase | Description | Status |
617
+ |---|---|---|
618
+ | 1 | Code generation MVP — single domain, validate U correlates with quality | Simulated ✓ |
619
+ | 2 | Multi-domain STEM — math proof verification (Lean/SymPy), field classifier | Planned |
620
+ | 3 | Personality system — trait weighting and evolution service | Simulated ✓ |
621
+ | 4 | Trust system — entity scoring and lenient tit-for-tat | Implemented |
622
+ | 5 | Creative fields — platform signal collection, two-component efficacy | Designed |
623
+ | 6 | Full continual learning — LoRA calibration in production, replay buffer | Planned |
624
+ | 7 | Feedback into training — distill adapters into base fine-tune | Planned |
625
+ | 8 | **Physical Hardware Validation and Data Center Economics** — LoRA-adapted 7B specialists on 4× RTX 4090 vs Llama 3.1 70B on H100; latency and quality benchmarking under PCIe vs NVLink | **Next empirical priority** |
626
+ | 9 | **Safety-Critical Deployment Validation** — shadow-mode evaluation, auditable logs, and abstention testing in autonomy-style settings; validate modular updateability under regulatory constraints | Planned |
627
+ | **v0.6** | **Privacy-first backend MVP** — localhost correction memory, canonical query normalizer, domain-gated retry loop, context grammar, opt-in cross-user sharing | **In design** |
628
+
629
+ **Phase 8** is the experiment that turns the consumer hardware argument from analytical to empirical. It requires only consumer hardware (4× RTX 4090, ~$1,600 on the used market or ~$1.60/hr on RunPod), domain-specific fine-tuning datasets (open source), and the existing routing codebase. The experimental design is fully specified in §10.9 of the whitepaper.
630
+
631
+ **Phase 9** validates the framework's safety-case and certification arguments in autonomy-style settings — see the [Self-Driving](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_self_driving_v0_5.html) and [Autonomous Systems](https://praneethtota.github.io/Adaptive-Utility-Agent/domain_autonomous_systems_v0_5.html) domain docs for the full scope.
632
+
633
+ **v0.6 design** is in `docs/to_do_in_version_v06_revised.md`.
634
+
635
+ ---
636
+
637
+ ## Status
638
+
639
+ Active research project at v0.5. Three categories of claims are now validated at different evidential levels:
640
+
641
+ - **Measured** (this work): 69.6% repeated-error reduction, Brier calibration improvement, U↔correctness correlation, +10.5% correctness from VCG arbitration, −17.5% from Regime 2 routing failure
642
+ - **Analytical** (from public specs and published benchmarks): consumer hardware cost model, specialist quality gains
643
+ - **Pending empirical validation**: physical hardware comparison of 7B specialist graph vs 70B monolithic model
644
+
645
+ The gap between the second and third categories — turning the analytical consumer hardware claim into a measured one — is the clearest and most impactful next step, and one that requires only consumer hardware to close. Contributions and collaboration welcome.
646
+
647
+ ---
648
+
649
+ 📖 **Full documentation, domain deep-dives, and builder's tutorial:**
650
+ **https://praneethtota.github.io/Adaptive-Utility-Agent**