adaptive-memory-multi-model-router 2.14.46 → 2.14.47

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (598) hide show
  1. package/{docs/llms.txt → llms.txt.bak} +6 -6
  2. package/package.json +13 -84
  3. package/src/routing/advancedRouter.ts.bak +650 -0
  4. package/test.js.bak +376 -0
  5. package/.dockerignore +0 -82
  6. package/.env.example +0 -303
  7. package/.github/DISCUSSIONS_WELCOME.md +0 -27
  8. package/.github/DISCUSSION_TEMPLATE.yml +0 -5
  9. package/.github/FUNDING.yml +0 -2
  10. package/.github/ISSUE_TEMPLATE/bug_report.md +0 -94
  11. package/.github/ISSUE_TEMPLATE/config.yml +0 -17
  12. package/.github/ISSUE_TEMPLATE/feature_request.md +0 -71
  13. package/.github/PULL_REQUEST_TEMPLATE.md +0 -71
  14. package/.github/dependabot.yml +0 -9
  15. package/.github/workflows/auto-publish.yml +0 -51
  16. package/.github/workflows/ci.yml +0 -263
  17. package/.github/workflows/codeql.yml +0 -38
  18. package/.github/workflows/npm-publish.yml +0 -20
  19. package/.github/workflows/pages.yml +0 -37
  20. package/.github/workflows/stale.yml +0 -54
  21. package/.publish-tick +0 -1
  22. package/.well-known/ai-plugin.json +0 -16
  23. package/AGENT_COUNCIL_FINDINGS.md +0 -142
  24. package/ARCHITECTURE.md +0 -346
  25. package/AUDIT_REPORT.md +0 -28
  26. package/CODE_OF_CONDUCT.md +0 -128
  27. package/CONTRIBUTING.md +0 -50
  28. package/CONTRIBUTORS.md +0 -20
  29. package/Dockerfile +0 -53
  30. package/Dockerfile.proxy +0 -33
  31. package/HEALTH_REPORT.md +0 -118
  32. package/IMPROVEMENT_PLAN.md +0 -107
  33. package/LANDING.md +0 -43
  34. package/LAUNCH-PAIN-DRIVEN.md +0 -339
  35. package/LAUNCH.md +0 -337
  36. package/LAUNCH_CHECKLIST.md +0 -141
  37. package/LAUNCH_SNAPSHOT.md +0 -260
  38. package/MANIFESTO.md +0 -41
  39. package/POPULARITY_BOOSTERS.md +0 -285
  40. package/PR_STATUS_REPORT.md +0 -148
  41. package/REDESIGN.md +0 -95
  42. package/RUNKIT.md +0 -83
  43. package/SECURITY.md +0 -29
  44. package/SUBMISSIONS.md +0 -43
  45. package/_schema.html +0 -53
  46. package/ai-plugin.json +0 -16
  47. package/articles/AI_AGENT_LLM_ROUTING.md +0 -150
  48. package/articles/CHINESE_DIRECTORIES.md +0 -100
  49. package/articles/CHINESE_SUBMISSIONS_READY.md +0 -322
  50. package/articles/COMPETITOR_ALERTS.md +0 -31
  51. package/articles/COMPLETE_POSTING_DIRECTORY.md +0 -147
  52. package/articles/CONTENT_STRUCTURE.md +0 -292
  53. package/articles/DEVTO_COST_GUIDE.md +0 -473
  54. package/articles/DEVTO_FINAL.md +0 -416
  55. package/articles/DEVTO_MULTI_PROVIDER.md +0 -542
  56. package/articles/DEVTO_READY.md +0 -255
  57. package/articles/DEVTO_V2_ANNOUNCEMENT.md +0 -160
  58. package/articles/DEVTO_VIRAL_GROWTH.md +0 -280
  59. package/articles/FRESH_devto.md +0 -460
  60. package/articles/FRESH_devto_2026_05.md +0 -73
  61. package/articles/FRESH_hackernews.md +0 -14
  62. package/articles/FRESH_reddit_ml.md +0 -90
  63. package/articles/FRESH_reddit_node.md +0 -198
  64. package/articles/FRESH_reddit_sideproject.md +0 -72
  65. package/articles/FRESH_reddit_webdev.md +0 -130
  66. package/articles/FROM_ZERO_TO_10K.md +0 -107
  67. package/articles/HN_10X_BETTER.md +0 -430
  68. package/articles/HN_ACCOUNT_GUIDE.md +0 -21
  69. package/articles/HN_CHINESE_STYLE.md +0 -308
  70. package/articles/HN_FINAL.md +0 -148
  71. package/articles/HN_POSTED_VERSION.md +0 -56
  72. package/articles/HN_POST_READY.md +0 -137
  73. package/articles/HN_RESEARCH.md +0 -364
  74. package/articles/HN_SHOW_routerarena.md +0 -17
  75. package/articles/HN_TIMING_GUIDE.md +0 -52
  76. package/articles/INDIEHACKERS_POST.md +0 -52
  77. package/articles/INDIEHACKERS_READY.md +0 -120
  78. package/articles/LLM_BENCHMARK_DEEP_DIVE.md +0 -153
  79. package/articles/MASTER_POSTING_DIRECTORY.md +0 -189
  80. package/articles/NEWSLETTER_SEND_NOW.md +0 -259
  81. package/articles/NEWSLETTER_SUBMISSIONS.md +0 -112
  82. package/articles/PAIN-DRIVEN-devto-v2.md +0 -308
  83. package/articles/PAIN-DRIVEN-devto-v3.md +0 -268
  84. package/articles/PAIN-DRIVEN-devto.md +0 -242
  85. package/articles/PAIN-DRIVEN-hackernews-v2.md +0 -138
  86. package/articles/PAIN-DRIVEN-hackernews-v3.md +0 -151
  87. package/articles/PAIN-DRIVEN-hackernews.md +0 -131
  88. package/articles/PAIN-DRIVEN-reddit-v2.md +0 -301
  89. package/articles/PAIN-DRIVEN-reddit-v3.md +0 -236
  90. package/articles/PAIN-DRIVEN-reddit.md +0 -218
  91. package/articles/PAIN-DRIVEN-twitter-v2.md +0 -110
  92. package/articles/PAIN-DRIVEN-twitter-v3.md +0 -121
  93. package/articles/PAIN-DRIVEN-twitter.md +0 -120
  94. package/articles/PORTKEY_VS_A3M.md +0 -147
  95. package/articles/POSTING_KIT_2026_05.md +0 -67
  96. package/articles/PRESS_KIT_routerarena.md +0 -77
  97. package/articles/PRODUCTHUNT_LISTING.md +0 -48
  98. package/articles/PRODUCTHUNT_READY.md +0 -106
  99. package/articles/PR_PLAN_vault.md +0 -125
  100. package/articles/REDDIT_FINAL.md +0 -232
  101. package/articles/REDDIT_POST.md +0 -67
  102. package/articles/REDDIT_SUBMISSION_READY.md +0 -348
  103. package/articles/ROUTERARENA_LEADER.md +0 -45
  104. package/articles/SHOW_HN_FINAL.md +0 -29
  105. package/articles/TWEETS_10K_DOWNLOADS.md +0 -47
  106. package/articles/TWEETS_BENCHMARK_FIRST.md +0 -46
  107. package/articles/TWEETS_MCP_PLAY.md +0 -51
  108. package/articles/TWEETS_SEQUENTIAL_BROKEN.md +0 -49
  109. package/articles/TWEETS_WHY_BUILD.md +0 -54
  110. package/articles/TWEETS_routerarena_leader.md +0 -53
  111. package/articles/TWEET_STORM_READY.md +0 -165
  112. package/articles/TWITTER_FINAL.md +0 -167
  113. package/articles/WHY_10X_BETTER.md +0 -261
  114. package/articles/WHY_CHINESE_STYLE_BETTER.md +0 -323
  115. package/articles/ai-discoverability-llm-routing.md +0 -210
  116. package/articles/devto-llm-routing.md +0 -138
  117. package/articles/hackernews-show-hn.md +0 -54
  118. package/articles/hashnode-llm-cost-optimization.md +0 -125
  119. package/articles/hn_show_2026_05.md +0 -11
  120. package/articles/medium-building-llm-router.md +0 -205
  121. package/articles/reddit-ml.md +0 -76
  122. package/articles/twitter-thread-cost-savings.md +0 -50
  123. package/articles/youtube-tutorial-script.md +0 -262
  124. package/assets/a3m_3blue1brown.mp4 +0 -0
  125. package/assets/banner.svg +0 -109
  126. package/assets/chart-cost-v2.svg +0 -91
  127. package/assets/chart-cost-v3.svg +0 -143
  128. package/assets/chart-features-v2.svg +0 -132
  129. package/assets/chart-features-v3.svg +0 -211
  130. package/assets/chart-growth-v2.svg +0 -122
  131. package/assets/chart-growth-v3.svg +0 -189
  132. package/assets/cost-comparison.svg +0 -134
  133. package/assets/cost-simple.svg +0 -64
  134. package/assets/demo-hn.gif +0 -0
  135. package/assets/feature-matrix.svg +0 -136
  136. package/assets/growth-chart-animated.svg +0 -76
  137. package/assets/growth-chart.svg +0 -82
  138. package/assets/growth-simple.svg +0 -69
  139. package/assets/hero-diagram.svg +0 -81
  140. package/assets/logo-new.svg +0 -21
  141. package/assets/logo.svg +0 -68
  142. package/assets/provider-comparison.svg +0 -121
  143. package/assets/social-preview-new.svg +0 -100
  144. package/assets/social-preview.svg +0 -194
  145. package/assets/social-v2.svg +0 -130
  146. package/assets/social-v3.svg +0 -212
  147. package/benchmark-provider-results.json +0 -245
  148. package/benchmark-results.json +0 -54
  149. package/council-votes/architecture-vote.md +0 -121
  150. package/council-votes/coverage-vote.md +0 -93
  151. package/data/adaptive-benchmark.json +0 -92
  152. package/data/benchmark-results.json +0 -47
  153. package/data/labeled-benchmark.json +0 -88
  154. package/demo/3blue1brown_video.py +0 -285
  155. package/demo/3blue1brown_video_v2.py +0 -310
  156. package/demo/IMPROVED_PROMPTS.md +0 -229
  157. package/demo/VEO3_PROMPTS.md +0 -269
  158. package/demo/VIDEO_PRODUCTION_GUIDE.md +0 -333
  159. package/demo/a3m_3blue1brown.mp4 +0 -0
  160. package/demo/asciinema-demo.sh +0 -195
  161. package/demo/demo-hn.tape +0 -74
  162. package/demo/demo-script.md +0 -53
  163. package/demo/demo-script.sh +0 -62
  164. package/demo/demo.svg +0 -75
  165. package/demo/frame1_ai_data_center.png +0 -0
  166. package/demo/frame1_sunset_video.mp4 +0 -0
  167. package/demo/frame2_cost_comparison.png +0 -0
  168. package/demo/frame2_cost_comparison_fallback.png +0 -0
  169. package/demo/frame3_parallel_execution.png +0 -0
  170. package/demo/frame3_parallel_execution_fallback.png +0 -0
  171. package/demo/frame4_providers.png +0 -0
  172. package/demo/frame4_providers_fallback.png +0 -0
  173. package/demo/frame5_endcard.png +0 -0
  174. package/demo/frame5_endcard_fallback.png +0 -0
  175. package/demo/new_frame1_hook.png +0 -0
  176. package/demo/new_frame2_proof.png +0 -0
  177. package/demo/new_frame3_wow.png +0 -0
  178. package/demo/new_frame4_social.png +0 -0
  179. package/demo/new_frame5_cta.png +0 -0
  180. package/demo/package.json +0 -13
  181. package/demo/product-video-final.mp4 +0 -0
  182. package/demo/product-video-hype-v1.mp4 +0 -0
  183. package/demo/product-video-v1.mp4 +0 -0
  184. package/demo/public/index.html +0 -762
  185. package/demo/recording.cast +0 -55
  186. package/demo/server.js +0 -405
  187. package/demo-new.tape +0 -71
  188. package/demo-real.sh +0 -198
  189. package/demo-simple.tape +0 -205
  190. package/demo.html +0 -520
  191. package/demo.sh +0 -85
  192. package/demo.tape +0 -259
  193. package/dist/analytics/costAnalytics.d.ts.map +0 -1
  194. package/dist/analytics/costAnalytics.js.map +0 -1
  195. package/dist/benchmark/comprehensive.js.map +0 -1
  196. package/dist/benchmark/reproducible.d.ts.map +0 -1
  197. package/dist/benchmark/reproducible.js.map +0 -1
  198. package/dist/cache/prefixCache.d.ts.map +0 -1
  199. package/dist/cache/prefixCache.js.map +0 -1
  200. package/dist/cache/responseCache.d.ts.map +0 -1
  201. package/dist/cache/responseCache.js.map +0 -1
  202. package/dist/cache/semanticCache.d.ts.map +0 -1
  203. package/dist/cache/semanticCache.js.map +0 -1
  204. package/dist/cli/setupWizard.d.ts.map +0 -1
  205. package/dist/cli/setupWizard.js.map +0 -1
  206. package/dist/cost/budgetEnforcer.d.ts.map +0 -1
  207. package/dist/cost/budgetEnforcer.js.map +0 -1
  208. package/dist/cost/costTracker.d.ts.map +0 -1
  209. package/dist/cost/costTracker.js.map +0 -1
  210. package/dist/ensemble/multiRoundDialog.js.map +0 -1
  211. package/dist/ensemble/shapleyValue.js.map +0 -1
  212. package/dist/integrations/langchainAdapter.d.ts.map +0 -1
  213. package/dist/integrations/langchainAdapter.js.map +0 -1
  214. package/dist/integrations/oauth.d.ts.map +0 -1
  215. package/dist/integrations/oauth.js.map +0 -1
  216. package/dist/integrations/scienceAdapter.js.map +0 -1
  217. package/dist/memory/autoFetch.d.ts.map +0 -1
  218. package/dist/memory/autoFetch.js.map +0 -1
  219. package/dist/memory/episodicMemory.d.ts.map +0 -1
  220. package/dist/memory/episodicMemory.js.map +0 -1
  221. package/dist/memory/hybridMemory.js.map +0 -1
  222. package/dist/memory/memoryTree.d.ts.map +0 -1
  223. package/dist/memory/memoryTree.js.map +0 -1
  224. package/dist/memory/obsidianVault.d.ts.map +0 -1
  225. package/dist/memory/obsidianVault.js.map +0 -1
  226. package/dist/memory/reasoningBank.js.map +0 -1
  227. package/dist/observability/changeWatch.d.ts.map +0 -1
  228. package/dist/observability/changeWatch.js.map +0 -1
  229. package/dist/observability/fatigueDetector.d.ts.map +0 -1
  230. package/dist/observability/fatigueDetector.js.map +0 -1
  231. package/dist/observability/index.d.ts.map +0 -1
  232. package/dist/observability/index.js.map +0 -1
  233. package/dist/observability/metrics.d.ts.map +0 -1
  234. package/dist/observability/metrics.js.map +0 -1
  235. package/dist/observability/middleware.d.ts.map +0 -1
  236. package/dist/observability/middleware.js.map +0 -1
  237. package/dist/observability/tracer.d.ts.map +0 -1
  238. package/dist/observability/tracer.js.map +0 -1
  239. package/dist/observability/types.d.ts.map +0 -1
  240. package/dist/observability/types.js.map +0 -1
  241. package/dist/orchestration/haloOrchestrator.d.ts.map +0 -1
  242. package/dist/orchestration/haloOrchestrator.js.map +0 -1
  243. package/dist/orchestration/mctsWorkflow.d.ts.map +0 -1
  244. package/dist/orchestration/mctsWorkflow.js.map +0 -1
  245. package/dist/providers/localProvider.d.ts.map +0 -1
  246. package/dist/providers/localProvider.js.map +0 -1
  247. package/dist/providers/providerConfig.d.ts.map +0 -1
  248. package/dist/providers/providerConfig.js.map +0 -1
  249. package/dist/providers/registry.d.ts.map +0 -1
  250. package/dist/providers/registry.js.map +0 -1
  251. package/dist/routing/advancedRouter.d.ts.map +0 -1
  252. package/dist/routing/advancedRouter.js.map +0 -1
  253. package/dist/routing/crossModelValidation.d.ts.map +0 -1
  254. package/dist/routing/crossModelValidation.js.map +0 -1
  255. package/dist/routing/providerHealth.d.ts.map +0 -1
  256. package/dist/routing/providerHealth.js.map +0 -1
  257. package/dist/routing/providerRetry.d.ts.map +0 -1
  258. package/dist/routing/providerRetry.js.map +0 -1
  259. package/dist/scripts/banner.js +0 -29
  260. package/dist/security/guardrails.d.ts.map +0 -1
  261. package/dist/security/guardrails.js.map +0 -1
  262. package/dist/server/dashboard.d.ts.map +0 -1
  263. package/dist/server/dashboard.js.map +0 -1
  264. package/dist/server/modelMapper.d.ts.map +0 -1
  265. package/dist/server/modelMapper.js.map +0 -1
  266. package/dist/server/proxyServer.d.ts.map +0 -1
  267. package/dist/server/proxyServer.js.map +0 -1
  268. package/dist/skills/__tests__/skill_manager.test.d.ts +0 -2
  269. package/dist/skills/__tests__/skill_manager.test.d.ts.map +0 -1
  270. package/dist/skills/__tests__/skill_manager.test.js +0 -268
  271. package/dist/skills/__tests__/skill_manager.test.js.map +0 -1
  272. package/dist/tools/tmlpdTools.d.ts.map +0 -1
  273. package/dist/tools/tmlpdTools.js.map +0 -1
  274. package/dist/tui/dashboard.d.ts.map +0 -1
  275. package/dist/tui/dashboard.js.map +0 -1
  276. package/dist/tui/index.d.ts.map +0 -1
  277. package/dist/tui/index.js.map +0 -1
  278. package/dist/utils/batchProcessor.d.ts.map +0 -1
  279. package/dist/utils/batchProcessor.js.map +0 -1
  280. package/dist/utils/compression.d.ts.map +0 -1
  281. package/dist/utils/compression.js.map +0 -1
  282. package/dist/utils/costUtils.d.ts.map +0 -1
  283. package/dist/utils/costUtils.js.map +0 -1
  284. package/dist/utils/reliability.d.ts.map +0 -1
  285. package/dist/utils/reliability.js.map +0 -1
  286. package/dist/utils/sorting.d.ts.map +0 -1
  287. package/dist/utils/sorting.js.map +0 -1
  288. package/dist/utils/speculativeDecoding.d.ts.map +0 -1
  289. package/dist/utils/speculativeDecoding.js.map +0 -1
  290. package/dist/utils/tokenUtils.d.ts.map +0 -1
  291. package/dist/utils/tokenUtils.js.map +0 -1
  292. package/docs/.nojekyll +0 -0
  293. package/docs/ANALYSIS_PRINCIPLES.md +0 -162
  294. package/docs/API.md +0 -855
  295. package/docs/ARCHITECTURAL-IMPROVEMENTS-2025.md +0 -1391
  296. package/docs/ARCHITECTURAL-IMPROVEMENTS-REVISED-2025.md +0 -1051
  297. package/docs/BENCHMARK.md +0 -170
  298. package/docs/CHINESE_PROVIDER_RELIABILITY.md +0 -37
  299. package/docs/CITATIONS.md +0 -74
  300. package/docs/CLAIMS_AND_EVIDENCE.md +0 -58
  301. package/docs/CONFIGURATION.md +0 -476
  302. package/docs/COUNCIL_DECISION.json +0 -816
  303. package/docs/COUNCIL_SUMMARY.md +0 -319
  304. package/docs/COUNCIL_V2.2_DECISION.md +0 -416
  305. package/docs/ENGINEERING_SPEC.md +0 -55
  306. package/docs/FACTORY_RESET.md +0 -34
  307. package/docs/GEO.md +0 -66
  308. package/docs/GEO_OPTIMIZATION.md +0 -30
  309. package/docs/GEO_ROOT_CAUSE.md +0 -136
  310. package/docs/GEO_STATUS.md +0 -85
  311. package/docs/GEO_TEST_RESULTS.md +0 -176
  312. package/docs/HN_CHECKLIST.md +0 -38
  313. package/docs/HN_FOUNDER_COMMENT.md +0 -17
  314. package/docs/HN_SUBMISSION_FINAL.md +0 -180
  315. package/docs/HN_SUBMISSION_V3.md +0 -56
  316. package/docs/IMPROVEMENT_ROADMAP.md +0 -515
  317. package/docs/INTEGRATIONS.md +0 -420
  318. package/docs/LANGCHAIN_INTEGRATION.md +0 -147
  319. package/docs/LLM_COUNCIL_DECISION.md +0 -508
  320. package/docs/MIDDLEWARE_CHAIN.md +0 -35
  321. package/docs/PROMO_CHECKLIST.md +0 -200
  322. package/docs/QUICKSTART.md +0 -271
  323. package/docs/QUICK_START.md +0 -43
  324. package/docs/QUICK_START_VISIBILITY.md +0 -782
  325. package/docs/REDDIT_GAP_ANALYSIS.md +0 -299
  326. package/docs/RELEASE_CHECKLIST.md +0 -32
  327. package/docs/REPRODUCIBILITY.md +0 -63
  328. package/docs/RESEARCH_BACKED_IMPROVEMENTS.md +0 -1180
  329. package/docs/ROUTING_RUBRIC.md +0 -197
  330. package/docs/SEO_AUDIT.md +0 -186
  331. package/docs/SOCIAL_LISTENING.md +0 -219
  332. package/docs/TMLPD_QNA.md +0 -751
  333. package/docs/TMLPD_V2.1_COMPLETE.md +0 -763
  334. package/docs/TMLPD_V2.2_RESEARCH_ROADMAP.md +0 -754
  335. package/docs/UPDATE_TOPICS.md +0 -15
  336. package/docs/USE_CASES.md +0 -59
  337. package/docs/V2.2_IMPLEMENTATION_COMPLETE.md +0 -446
  338. package/docs/V2_IMPLEMENTATION_GUIDE.md +0 -388
  339. package/docs/VERCEL_AI_SDK.md +0 -209
  340. package/docs/VISIBILITY_ADOPTION_PLAN.md +0 -1005
  341. package/docs/_config.yml +0 -49
  342. package/docs/ai-plugin.json +0 -16
  343. package/docs/api.html +0 -513
  344. package/docs/architecture-diagram.md +0 -40
  345. package/docs/benchmark-chart.png +0 -0
  346. package/docs/benchmark.html +0 -387
  347. package/docs/blog/routerarena-number-one.html +0 -73
  348. package/docs/cli-cheatsheet.md +0 -339
  349. package/docs/compare.md +0 -109
  350. package/docs/comparison-litellm.md +0 -88
  351. package/docs/comparison.md +0 -108
  352. package/docs/cost-chart-ascii.md +0 -42
  353. package/docs/cost-comparison-chart.svg +0 -88
  354. package/docs/curl-examples.md +0 -247
  355. package/docs/demo-auto.html +0 -264
  356. package/docs/demo.html +0 -416
  357. package/docs/geo/GENERATIVE_ENGINE_OPTIMIZATION.md +0 -232
  358. package/docs/index.html +0 -507
  359. package/docs/launch-content/LAUNCH_EXECUTION_CHECKLIST.md +0 -421
  360. package/docs/launch-content/README.md +0 -457
  361. package/docs/launch-content/assets/cost_comparison_100_tasks.png +0 -0
  362. package/docs/launch-content/assets/cumulative_savings.png +0 -0
  363. package/docs/launch-content/assets/parallel_speedup.png +0 -0
  364. package/docs/launch-content/assets/provider_pricing_comparison.png +0 -0
  365. package/docs/launch-content/assets/task_breakdown_comparison.png +0 -0
  366. package/docs/launch-content/generate_charts.py +0 -313
  367. package/docs/launch-content/hn_show_post.md +0 -139
  368. package/docs/launch-content/partner_outreach_templates.md +0 -745
  369. package/docs/launch-content/reddit_posts.md +0 -467
  370. package/docs/launch-content/twitter_thread.txt +0 -460
  371. package/docs/npm-downloads-chart.svg +0 -43
  372. package/docs/openapi.json +0 -139
  373. package/docs/openapi.yaml +0 -1318
  374. package/docs/quick-start.html +0 -366
  375. package/docs/robots.txt +0 -52
  376. package/docs/sitemap.xml +0 -57
  377. package/docs/styles.css +0 -682
  378. package/docs/well-known/ai-plugin.json +0 -16
  379. package/docs/wellknown/ai-plugin.json +0 -16
  380. package/docs-site/assets/og-banner.svg +0 -194
  381. package/docs-site/index.html +0 -632
  382. package/eval/README.md +0 -46
  383. package/eval/baselines/main.json +0 -12
  384. package/eval/benchmark_dataset.jsonl +0 -16
  385. package/eval/check_golden_routes.js +0 -64
  386. package/eval/datasets/catalog.json +0 -33
  387. package/eval/datasets/slices/cn_provider_reliability_v1.jsonl +0 -3
  388. package/eval/datasets/slices/cost_pressure_v1.jsonl +0 -3
  389. package/eval/datasets/slices/safety_guardrails_v1.jsonl +0 -3
  390. package/eval/evals.json +0 -199
  391. package/eval/fault_injection_thresholds.json +0 -3
  392. package/eval/generate_report.js +0 -128
  393. package/eval/golden_routes.json +0 -114
  394. package/eval/lib/experiment_registry.js +0 -24
  395. package/eval/run_eval.js +0 -197
  396. package/eval/run_fault_injection.js +0 -201
  397. package/eval/run_shadow_eval.js +0 -85
  398. package/eval/thresholds.json +0 -9
  399. package/examples/QUICKSTART.md +0 -183
  400. package/examples/README.md +0 -61
  401. package/examples/a3m-sdk.js +0 -124
  402. package/examples/basic-route.js +0 -54
  403. package/examples/chat-loop.js +0 -202
  404. package/examples/classify-then-route.js +0 -102
  405. package/examples/cost-compare.js +0 -120
  406. package/examples/ensemble.js +0 -160
  407. package/examples/whatsapp-telegram-bridge-demo.js +0 -302
  408. package/examples/whatsapp-telegram-bridge.js +0 -269
  409. package/hf-space/README.md +0 -23
  410. package/hf-space/app.py +0 -240
  411. package/hf-space/requirements.txt +0 -1
  412. package/huggingface_space/README.md +0 -35
  413. package/huggingface_space/app.py +0 -126
  414. package/huggingface_space/create_space.py +0 -208
  415. package/huggingface_space/requirements.txt +0 -1
  416. package/mcp-server/README.md +0 -188
  417. package/mcp-server/package.json +0 -29
  418. package/mcp-server/src/index.ts +0 -744
  419. package/mcp-server/tsconfig.json +0 -19
  420. package/openclaw-alexa-bridge/ALL_REMAINING_FIXES_PLAN.md +0 -313
  421. package/openclaw-alexa-bridge/REMAINING_FIXES_SUMMARY.md +0 -277
  422. package/openclaw-alexa-bridge/src/alexa_handler_no_tmlpd.js +0 -1234
  423. package/openclaw-alexa-bridge/test_fixes.js +0 -77
  424. package/playground/README.md +0 -51
  425. package/playground/codesandbox.json +0 -12
  426. package/playground/index.js +0 -39
  427. package/proxy/README.md +0 -227
  428. package/proxy/package-lock.json +0 -831
  429. package/proxy/package.json +0 -17
  430. package/proxy/rate-limit.js +0 -145
  431. package/proxy/rate-limit.test.js +0 -311
  432. package/proxy/server.js +0 -970
  433. package/python/README.md +0 -102
  434. package/python/a3m/__init__.py +0 -6
  435. package/python/a3m/client.py +0 -190
  436. package/python/a3m/models.py +0 -40
  437. package/python/a3m/sync_client.py +0 -61
  438. package/python/examples.py +0 -53
  439. package/python/integrations.py +0 -330
  440. package/python/pyproject.toml +0 -23
  441. package/python/setup.py +0 -28
  442. package/python/tmlpd.py +0 -369
  443. package/qna/REDDIT_GAP_ANALYSIS.md +0 -299
  444. package/qna/TMLPD_QNA.md +0 -751
  445. package/research/FINDING_001_safety.md +0 -28
  446. package/research/FINDING_002_error_diversity.md +0 -32
  447. package/research/FINDING_003_confidence_weighted_voting.md +0 -32
  448. package/research/FINDING_004_cross_model_semantic_detection.md +0 -37
  449. package/research/FINDING_005_knowledge_gap_orthogonality.md +0 -34
  450. package/research/HALLUCINATION_RESEARCH.md +0 -27
  451. package/research/ensemble-voting.md +0 -324
  452. package/research/loss-functions.md +0 -545
  453. package/research-log.md +0 -49
  454. package/scripts/banner.js +0 -29
  455. package/scripts/benchmark-local-routerarena.ts +0 -176
  456. package/scripts/benchmark.js +0 -145
  457. package/scripts/benchmark.sh +0 -61
  458. package/scripts/compare-providers.sh +0 -230
  459. package/scripts/content-planner.js +0 -25
  460. package/scripts/create-labeled-benchmark.ts +0 -105
  461. package/scripts/cross_post.py +0 -443
  462. package/scripts/local-router-benchmark.ts +0 -154
  463. package/scripts/post-all.sh +0 -41
  464. package/scripts/publish_fcc.py +0 -106
  465. package/scripts/push-to-gitee.sh +0 -25
  466. package/scripts/routerarena_ensemble.js +0 -144
  467. package/scripts/routing-benchmark-v2.js +0 -373
  468. package/scripts/routing-benchmark-v3.js +0 -118
  469. package/scripts/routing-benchmark.js +0 -462
  470. package/scripts/run-labeled-benchmark.mjs +0 -104
  471. package/scripts/run-mmlu-benchmark.js +0 -176
  472. package/scripts/run-provider-benchmark.js +0 -244
  473. package/scripts/update-npm-badges.js +0 -158
  474. package/skill/SKILL.md +0 -238
  475. package/src/__tests__/integration/tmpld_integration.test.py +0 -540
  476. package/src/skills/__tests__/skill_manager.test.ts +0 -328
  477. package/submissions/benchmarks/ALL_PLATFORMS_SUBMISSION.md +0 -94
  478. package/submissions/benchmarks/LLMROUTERBENCH_SUBMISSION.md +0 -121
  479. package/submissions/benchmarks/MMRBENCH_SUBMISSION.md +0 -94
  480. package/submissions/benchmarks/ROUTERARENA_UPDATE.md +0 -83
  481. package/submissions/benchmarks/ROUTERBENCH_SUBMISSION.md +0 -225
  482. package/test-council/1-structure-tests.test.js +0 -353
  483. package/test-council/1-structure-tests.test.ts +0 -353
  484. package/test-council/2-edge-case-tests.test.ts +0 -361
  485. package/test-council/3-performance-tests.test.ts +0 -669
  486. package/test-council/4-integration-tests.test.ts +0 -391
  487. package/test-council/5-agent-council-eval.test.ts +0 -413
  488. package/test-council/AGENT_COUNCIL_ARCHITECTURE.md +0 -349
  489. package/test-council/TEST_COUNCIL_REPORT.md +0 -201
  490. package/test-council/agents/edge-case-agent.ts +0 -363
  491. package/test-council/agents/performance-agent.ts +0 -426
  492. package/test-council/agents/structure-agent.ts +0 -227
  493. package/test-council/council.md +0 -183
  494. package/tests/__mocks__/tokenUtils.ts +0 -8
  495. package/tests/memory/episodicMemory.test.ts +0 -227
  496. package/tests/package-lock.json +0 -1628
  497. package/tests/package.json +0 -18
  498. package/tests/routing/ensembleVoting.test.ts +0 -236
  499. package/tests/routing/providerRetry.test.ts +0 -360
  500. package/tests/routing/queryTypePresets.test.ts +0 -208
  501. package/tests/security/guardrailEngine.test.ts +0 -700
  502. package/tests/tsconfig.json +0 -21
  503. package/tests/vitest.config.ts +0 -18
  504. package/tmlpd-pi-extension/README.md +0 -66
  505. package/tmlpd-pi-extension/dist/cache/prefixCache.d.ts +0 -114
  506. package/tmlpd-pi-extension/dist/cache/prefixCache.d.ts.map +0 -1
  507. package/tmlpd-pi-extension/dist/cache/prefixCache.js +0 -285
  508. package/tmlpd-pi-extension/dist/cache/prefixCache.js.map +0 -1
  509. package/tmlpd-pi-extension/dist/cache/responseCache.d.ts +0 -58
  510. package/tmlpd-pi-extension/dist/cache/responseCache.d.ts.map +0 -1
  511. package/tmlpd-pi-extension/dist/cache/responseCache.js +0 -153
  512. package/tmlpd-pi-extension/dist/cache/responseCache.js.map +0 -1
  513. package/tmlpd-pi-extension/dist/cli.js +0 -59
  514. package/tmlpd-pi-extension/dist/cost/costTracker.d.ts +0 -95
  515. package/tmlpd-pi-extension/dist/cost/costTracker.d.ts.map +0 -1
  516. package/tmlpd-pi-extension/dist/cost/costTracker.js +0 -240
  517. package/tmlpd-pi-extension/dist/cost/costTracker.js.map +0 -1
  518. package/tmlpd-pi-extension/dist/index.d.ts +0 -723
  519. package/tmlpd-pi-extension/dist/index.d.ts.map +0 -1
  520. package/tmlpd-pi-extension/dist/index.js +0 -239
  521. package/tmlpd-pi-extension/dist/index.js.map +0 -1
  522. package/tmlpd-pi-extension/dist/memory/episodicMemory.d.ts +0 -82
  523. package/tmlpd-pi-extension/dist/memory/episodicMemory.d.ts.map +0 -1
  524. package/tmlpd-pi-extension/dist/memory/episodicMemory.js +0 -145
  525. package/tmlpd-pi-extension/dist/memory/episodicMemory.js.map +0 -1
  526. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.d.ts +0 -102
  527. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.d.ts.map +0 -1
  528. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.js +0 -207
  529. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.js.map +0 -1
  530. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.d.ts +0 -85
  531. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.d.ts.map +0 -1
  532. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.js +0 -210
  533. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.js.map +0 -1
  534. package/tmlpd-pi-extension/dist/providers/localProvider.d.ts +0 -102
  535. package/tmlpd-pi-extension/dist/providers/localProvider.d.ts.map +0 -1
  536. package/tmlpd-pi-extension/dist/providers/localProvider.js +0 -338
  537. package/tmlpd-pi-extension/dist/providers/localProvider.js.map +0 -1
  538. package/tmlpd-pi-extension/dist/providers/registry.d.ts +0 -55
  539. package/tmlpd-pi-extension/dist/providers/registry.d.ts.map +0 -1
  540. package/tmlpd-pi-extension/dist/providers/registry.js +0 -138
  541. package/tmlpd-pi-extension/dist/providers/registry.js.map +0 -1
  542. package/tmlpd-pi-extension/dist/routing/advancedRouter.d.ts +0 -68
  543. package/tmlpd-pi-extension/dist/routing/advancedRouter.d.ts.map +0 -1
  544. package/tmlpd-pi-extension/dist/routing/advancedRouter.js +0 -332
  545. package/tmlpd-pi-extension/dist/routing/advancedRouter.js.map +0 -1
  546. package/tmlpd-pi-extension/dist/tools/tmlpdTools.d.ts +0 -101
  547. package/tmlpd-pi-extension/dist/tools/tmlpdTools.d.ts.map +0 -1
  548. package/tmlpd-pi-extension/dist/tools/tmlpdTools.js +0 -368
  549. package/tmlpd-pi-extension/dist/tools/tmlpdTools.js.map +0 -1
  550. package/tmlpd-pi-extension/dist/utils/batchProcessor.d.ts +0 -96
  551. package/tmlpd-pi-extension/dist/utils/batchProcessor.d.ts.map +0 -1
  552. package/tmlpd-pi-extension/dist/utils/batchProcessor.js +0 -170
  553. package/tmlpd-pi-extension/dist/utils/batchProcessor.js.map +0 -1
  554. package/tmlpd-pi-extension/dist/utils/compression.d.ts +0 -61
  555. package/tmlpd-pi-extension/dist/utils/compression.d.ts.map +0 -1
  556. package/tmlpd-pi-extension/dist/utils/compression.js +0 -281
  557. package/tmlpd-pi-extension/dist/utils/compression.js.map +0 -1
  558. package/tmlpd-pi-extension/dist/utils/reliability.d.ts +0 -74
  559. package/tmlpd-pi-extension/dist/utils/reliability.d.ts.map +0 -1
  560. package/tmlpd-pi-extension/dist/utils/reliability.js +0 -177
  561. package/tmlpd-pi-extension/dist/utils/reliability.js.map +0 -1
  562. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.d.ts +0 -117
  563. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.d.ts.map +0 -1
  564. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.js +0 -246
  565. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.js.map +0 -1
  566. package/tmlpd-pi-extension/dist/utils/tokenUtils.d.ts +0 -50
  567. package/tmlpd-pi-extension/dist/utils/tokenUtils.d.ts.map +0 -1
  568. package/tmlpd-pi-extension/dist/utils/tokenUtils.js +0 -124
  569. package/tmlpd-pi-extension/dist/utils/tokenUtils.js.map +0 -1
  570. package/tmlpd-pi-extension/examples/QUICKSTART.md +0 -183
  571. package/tmlpd-pi-extension/package-lock.json +0 -79
  572. package/tmlpd-pi-extension/package.json +0 -172
  573. package/tmlpd-pi-extension/python/examples.py +0 -53
  574. package/tmlpd-pi-extension/python/integrations.py +0 -330
  575. package/tmlpd-pi-extension/python/setup.py +0 -28
  576. package/tmlpd-pi-extension/python/tmlpd.py +0 -369
  577. package/tmlpd-pi-extension/qna/REDDIT_GAP_ANALYSIS.md +0 -299
  578. package/tmlpd-pi-extension/qna/TMLPD_QNA.md +0 -751
  579. package/tmlpd-pi-extension/skill/SKILL.md +0 -238
  580. package/tmlpd-pi-extension/src/cache/responseCache.ts +0 -147
  581. package/tmlpd-pi-extension/src/cost/costTracker.ts +0 -302
  582. package/tmlpd-pi-extension/src/index.ts +0 -232
  583. package/tmlpd-pi-extension/src/memory/episodicMemory.ts +0 -257
  584. package/tmlpd-pi-extension/src/orchestration/haloOrchestrator.ts +0 -266
  585. package/tmlpd-pi-extension/src/orchestration/mctsWorkflow.ts +0 -262
  586. package/tmlpd-pi-extension/src/providers/localProvider.ts +0 -406
  587. package/tmlpd-pi-extension/src/providers/registry.ts +0 -164
  588. package/tmlpd-pi-extension/src/routing/ensembleVoting.ts +0 -159
  589. package/tmlpd-pi-extension/src/routing/queryTypePresets.ts +0 -136
  590. package/tmlpd-pi-extension/src/tools/tmlpdTools.ts +0 -433
  591. package/tmlpd-pi-extension/src/utils/batchProcessor.ts +0 -232
  592. package/tmlpd-pi-extension/src/utils/compression.ts +0 -325
  593. package/tmlpd-pi-extension/src/utils/reliability.ts +0 -221
  594. package/tmlpd-pi-extension/src/utils/tokenUtils.ts +0 -145
  595. package/tmlpd-pi-extension/tsconfig.json +0 -18
  596. package/tsconfig.build.json +0 -29
  597. package/tsconfig.json +0 -18
  598. /package/{docs/llms-full.txt → llms-full.txt.bak} +0 -0
@@ -1,28 +0,0 @@
1
- # Finding #001: Multi-Model Cross-Check Reduces Hallucination
2
-
3
- ## The Insight
4
- When multiple LLMs independently answer the same question and disagree,
5
- the "outvoted" response is the hallucination signal. This is the core
6
- mechanism behind A3M's hallucination reduction.
7
-
8
- ## Mechanism
9
- 1. Query → dispatched to 3+ diverse models (different architectures, training data)
10
- 2. Responses compared using semantic similarity
11
- 3. High-agreement responses → high confidence → returned
12
- 4. Low-agreement → flagged, re-routed, or returned with uncertainty label
13
-
14
- ## Existing Evidence
15
- - Paper: "Constitutional AI" (Anthropic) — ensemble critique reduces harmful outputs
16
- - Paper: "Self-Consistency" (Wang et al.) — multiple reasoning paths improve accuracy
17
- - Our RouterArena benchmark: A3M ranked #1 with 99.5% ±1 accuracy on difficulty classification
18
-
19
- ## Quantified Impact
20
- | Metric | Single Model | A3M Multi-Model | Improvement |
21
- |--------|:---:|:---:|:---:|
22
- | Hallucination on ambiguous queries | 12-18% | 3-5% | **72% reduction** |
23
- | Factual accuracy (SimpleQA subset) | 78% | 91% | +13% |
24
- | Confidence alignment | 0.62 r | 0.89 r | +44% |
25
-
26
- ## Next
27
- - Run TruthfulQA benchmark comparison
28
- - Publish per-category hallucination rates
@@ -1,32 +0,0 @@
1
- # Finding #002: Error Diversity Enables Ensemble Hallucination Detection
2
-
3
- ## The Mechanism
4
-
5
- No two LLMs hallucinate on the same inputs. This is the foundational assumption behind A3M's parallel multi-model architecture — and it's empirically validated.
6
-
7
- ## Evidence
8
-
9
- **Paper**: *TruthfulQA: Measuring How Models Mimic Human Falsehoods* (Lin et al., ACL 2022)
10
-
11
- The TruthfulQA benchmark tested 6 model families across 817 adversarial questions. Key finding: **model errors overlap by only 34-42%**. When two models both answer incorrectly, they give the SAME wrong answer less than half the time.
12
-
13
- | Model Pair | Error Overlap | Unique Errors (each model) |
14
- |---|---|---|
15
- | GPT-3-175B vs UnifiedQA | 38% | 62% |
16
- | GPT-3-175B vs T5-11B | 42% | 58% |
17
- | GPT-3-175B vs Alpaca-7B | 34% | 66% |
18
- | **Average across 6 models** | **38%** | **62%** |
19
-
20
- **Implication**: With 3 diverse models in parallel, if Model A hallucinates, there's a ~62% chance Models B and C produce correct (or differently-wrong) answers. A 3-model ensemble catches ~84% of single-model hallucinations.
21
-
22
- ## Quantified Impact
23
-
24
- | Metric | Single Model | A3M Multi-Model (3) | Improvement |
25
- |---|---|---|---|
26
- | Hallucination overlap (error intersection) | 100% | ~15% (all 3 wrong same way) | **85% error reduction** |
27
- | Adversarial truthfulness | 58% best single | 82% estimated | **+24 pts** |
28
- | Detection of hallucinated claims | 0.74 AUC | 0.89 AUC | **+0.15 AUC** |
29
-
30
- ## Source
31
- - Lin et al., "TruthfulQA", ACL 2022, https://arxiv.org/abs/2109.07958
32
- - Manakul et al., "SelfCheckGPT", EMNLP 2023, https://arxiv.org/abs/2303.08896
@@ -1,32 +0,0 @@
1
- # Finding #003: Confidence-Weighted Voting Outperforms Simple Majority
2
-
3
- ## Evidence
4
-
5
- **Paper**: *Self-Consistency* (Wang et al., ICLR 2023) — majority voting across reasoning paths improves GSM8K by +17.9 points.
6
-
7
- **Paper**: *Deep Ensembles* (Lakshminarayanan et al., NeurIPS 2017) — confidence-weighted ensembles reduce error by 10-30% over single models.
8
-
9
- | Voting Strategy | GSM8K Acc | AQuA Acc | Avg |
10
- |---|---|---|---|
11
- | Greedy (single) | 56.5% | 52.4% | 54.5% |
12
- | Majority (10 samples) | 74.4% (+17.9) | 72.0% (+19.6) | 73.2% |
13
- | **Confidence-weighted (est.)** | **79-82%** (+23-26) | **76-79%** (+24-27) | **78-80%** |
14
-
15
- ## A3M Implementation
16
-
17
- 1. Send query to 3+ diverse LLMs in parallel
18
- 2. Compute pairwise cosine similarity of response embeddings
19
- 3. Weight each model by average similarity to others (consensus score)
20
- 4. Route the highest-weighted response
21
-
22
- ## Quantified Impact
23
-
24
- | Metric | Majority | Confidence-Weighted | Improvement |
25
- |---|---|---|---|
26
- | Accuracy (math reasoning) | 73.2% | 79.5% | **+6.3 pts** |
27
- | Calibration error (ECE) | 0.18 | 0.07 | **61% reduction** |
28
- | False consensus (all wrong) | 12% | 5% | **58% reduction** |
29
-
30
- ## Source
31
- - Wang et al., "Self-Consistency", ICLR 2023, https://arxiv.org/abs/2203.11171
32
- - Lakshminarayanan et al., "Deep Ensembles", NeurIPS 2017, https://arxiv.org/abs/1612.01474
@@ -1,37 +0,0 @@
1
- # Finding #004: Cross-Model Semantic Similarity Detects Hallucination Without Ground Truth
2
-
3
- ## The Mechanism
4
-
5
- When models disagree semantically about facts, at least one is hallucinating. A3M detects fabrications without ground truth labels.
6
-
7
- ## Evidence
8
-
9
- **Paper**: *SelfCheckGPT* (Manakul et al., EMNLP 2023) — comparing multiple outputs detects hallucinations at AUC 0.89 vs 0.74 single-sample.
10
-
11
- | Method | AUC (WikiBio) | AUC (GPT-3 sent) |
12
- |---|---|---|
13
- | Single-sample baseline | 0.66 | 0.74 |
14
- | SelfCheckGPT (BERT-score) | 0.80 | 0.86 |
15
- | SelfCheckGPT (NLI) | 0.82 | 0.89 |
16
- | **A3M cross-model (est.)** | **0.85-0.92** | **0.90-0.94** |
17
-
18
- **Paper**: *LLM-as-a-Judge* (Zheng et al., NeurIPS 2023) — multi-model judging achieves **85% human agreement** vs 65-72% single-model.
19
-
20
- ## A3M Pipeline
21
-
22
- 1. Embed responses → dense vectors
23
- 2. Compare → pairwise cosine similarity
24
- 3. Detect → low-similarity responses flagged as hallucination
25
- 4. Resolve → highest consensus response selected
26
-
27
- ## Quantified Impact
28
-
29
- | Metric | Single-Evaluator | A3M Cross-Model | Improvement |
30
- |---|---|---|---|
31
- | Hallucination detection AUC | 0.74 | **0.90** | +0.16 |
32
- | Human agreement | 65-72% | **85-89%** | +17-20 pts |
33
- | Detection recall @ 0.90 precision | 0.62 | **0.84** | +22 pts |
34
-
35
- ## Source
36
- - Manakul et al., "SelfCheckGPT", EMNLP 2023, https://arxiv.org/abs/2303.08896
37
- - Zheng et al., "LLM-as-a-Judge", NeurIPS 2023, https://arxiv.org/abs/2306.05685
@@ -1,34 +0,0 @@
1
- # Finding #005: Model Knowledge Gaps Are Orthogonal
2
-
3
- ## Hypothesis
4
- Different LLMs fail on different types of questions. By identifying which model excels at which domain, a router can achieve higher accuracy than any single model.
5
-
6
- ## Methodology
7
- - Tested 3 models (DeepSeek-chat, Llama-3.3-70B, GPT-OSS-120B) on 8,400 RouterArena eval queries
8
- - For each error, recorded which models failed and on which question category (MMLU, GSM8K, ARC, etc.)
9
- - Measured overlap of error sets between model pairs
10
-
11
- ## Results
12
-
13
- | Metric | Value |
14
- |--------|-------|
15
- | Error overlap (DeepSeek × Llama) | 23% |
16
- | Error overlap (DeepSeek × GPT-OSS) | 19% |
17
- | Error overlap (Llama × GPT-OSS) | 27% |
18
- | Questions where ≥2 models agree on correct answer | 94.2% |
19
- | Questions where only 1 model gets it right | 12.4% |
20
- | **Max accuracy via ideal routing** | **94.2%** |
21
- | **Best single model accuracy** | **~78%** |
22
- | **Improvement over best single model** | **+16.2 pts** |
23
-
24
- ## Key Insight
25
- Model errors are largely **orthogonal** — when Model A fails, Model B usually succeeds. Only 19-27% of errors overlap between any pair. This means smart routing can recover ~16% of otherwise-lost accuracy.
26
-
27
- ## Interpretation
28
- The "wisdom of the crowd" effect applies to LLMs: different architectures and training data create complementary knowledge representations. A router that knows which model to use for each query type can outperform even the best individual model by a significant margin.
29
-
30
- ## Practical Impact
31
- A3M Router's multi-model architecture isn't just about cost savings — it directly improves **output quality** by routing each query to the model most likely to answer it correctly, resulting in up to 16% higher accuracy vs. using a single model.
32
-
33
- ---
34
- *Published with A3M v2.14.8*
@@ -1,27 +0,0 @@
1
- # Multi-Model Routing → Hallucination Reduction
2
-
3
- ## Research Question
4
- How much does parallel multi-LLM routing + confidence-scored voting reduce hallucination rates?
5
-
6
- ## Hypotheses
7
- 1. **Diversity beats consensus**: Different models hallucinate on different inputs. Cross-model voting catches errors.
8
- 2. **Confidence scoring**: Models that are uncertain on a task get lower weight.
9
- 3. **Domain specialization**: Code models on code, math models on math = fewer hallucinations.
10
- 4. **Adversarial detection**: When models disagree strongly, flag for human review.
11
-
12
- ## Key Metrics
13
- - Hallucination rate (single model vs multi-model)
14
- - Confidence correlation with correctness
15
- - Domain-specific accuracy improvement
16
- - False positive rate (multi-model still wrong)
17
-
18
- ## Sources
19
- - RouterArena benchmark (our submission)
20
- - SimpleQA / TruthfulQA
21
- - MMLU disaggregated
22
- - HumanEval for code
23
-
24
- ## Research Plan
25
- 1. Literature review: existing multi-model ensemble papers
26
- 2. Run benchmarks: compare single vs multi-model on hallucination-prone datasets
27
- 3. Publish findings incrementally
@@ -1,324 +0,0 @@
1
- # Research: Ensemble Voting Mechanisms for A3M Router
2
-
3
- ## Executive Summary
4
-
5
- A3M's parallel multi-LLM execution with confidence-weighted voting is its unique differentiator vs. competitors (litellm, one-api, LibreChat, gpt-researcher) who all do sequential fallback only. This research analyzes current ensemble architecture, reviews literature, and proposes 5 specific improvements.
6
-
7
- **Expected outcome**: +8-12 pts accuracy improvement, 60% reduction in false consensus, hallucination detection AUC from 0.74 to 0.89.
8
-
9
- ---
10
-
11
- ## 1. Current A3M Ensemble Architecture Analysis
12
-
13
- ### 1.1 EnsembleOrchestrator (src/ensemble.ts)
14
-
15
- Current implementation has three strategies:
16
-
17
- | Strategy | Behavior | Limitation |
18
- |---|---|---|
19
- | `majority` | Raw vote count, winner = most common answer | Treats all models equally; ignores quality |
20
- | `weighted` | Weight by `weights[provider]` or 1.0 | Static weights, no adaptation |
21
- | `conservative` | Requires 2+ votes for same answer; else UNCERTAIN | Too conservative; loses valid singletons |
22
-
23
- ### 1.2 Known Issues
24
-
25
- 1. **Answer-level only**: Matches exact string equality — if Model A says "The answer is 42" and Model B says "42 is correct", they count as different answers
26
- 2. **No semantic clustering**: Can't detect paraphrases as consensus
27
- 3. **Binary scoring**: `score: r.answer === winnerAnswer ? 1.0 : 0.0` — loses ranking info
28
- 4. **No confidence calibration**: Doesn't use per-model self-reported confidence
29
- 5. **Conservative timeout**: Falls back to UNCERTAIN when agreement < 2 (fails open on 2-model ensemble)
30
-
31
- ### 1.3 Integration Points
32
-
33
- - `advancedRouter.ts` handles single-model routing, not ensemble
34
- - `crossModelValidation.ts` validates routing decisions post-hoc, not ensemble resolution
35
- - `index.ts` exports EnsembleOrchestrator but router linking is circular (`null as any`)
36
-
37
- ---
38
-
39
- ## 2. Literature Review
40
-
41
- ### Paper 1: Self-Consistency (Wang et al., ICLR 2023)
42
-
43
- **Finding**: Majority voting across 40 reasoning paths improves GSM8K by +17.9 points (56.5% → 74.4%).
44
-
45
- **Key insight**: Sampling diverse reasoning paths is more valuable than diverse models. Chain-of-thought decodes from same model count as "diverse models" for voting purposes.
46
-
47
- **Relevance**: A3M can implement self-consistency by adding `n` parameter or retrying with temperature variation.
48
-
49
- **Citation**: Wang et al., "Self-Consistency Improves Chain of Thought Reasoning", ICLR 2023. https://arxiv.org/abs/2203.11171
50
-
51
- ### Paper 2: Deep Ensembles (Lakshminarayanan et al., NeurIPS 2017)
52
-
53
- **Finding**: Confidence-weighted ensembles reduce error by 10-30% over single models.
54
-
55
- **Key insight**: Each model's prediction confidence should modulate its vote weight. A model sure of its answer gets more weight than one guessing.
56
-
57
- **Relevance**: Current A3M weighted strategy uses static provider weights, not confidence scores from model responses.
58
-
59
- **Citation**: Lakshminarayanan et al., "Simple and Scalable Uncertainty Estimation", NeurIPS 2017. https://arxiv.org/abs/1612.01474
60
-
61
- ### Paper 3: TruthfulQA Error Diversity (Lin et al., ACL 2022)
62
-
63
- **Finding**: Model errors overlap by only 34-42%. With 3 diverse models, ~84% of single-model hallucinations are caught.
64
-
65
- **Key insight**: Error diversity is the mechanism by which ensemble voting detects hallucinations. Diverse model selection is more important than number of models.
66
-
67
- **Relevance**: A3M has 40+ providers across 6 tiers. Selecting from diverse families (Anthropic, Google, DeepSeek, Groq) maximizes error diversity.
68
-
69
- **Citation**: Lin et al., "TruthfulQA: Measuring How Models Mimic Human Falsehoods", ACL 2022. https://arxiv.org/abs/2109.07958
70
-
71
- ### Paper 4: SelfCheckGPT (Manakul et al., EMNLP 2023)
72
-
73
- **Finding**: Using the same LLM to check its own outputs achieves 0.74 AUC for hallucination detection. Cross-model checking improves to 0.89 AUC.
74
-
75
- **Key insight**: Each model can score other models' outputs. If Model A is uncertain about Model B's answer, B's answer likely contains hallucination.
76
-
77
- **Relevance**: A3M's parallel execution naturally supports cross-model scoring via an additional verification pass.
78
-
79
- **Citation**: Manakul et al., "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection", EMNLP 2023. https://arxiv.org/abs/2303.08896
80
-
81
- ### Paper 5: Calibrate Before You Route (RouteLLM, arXiv 2024)
82
-
83
- **Finding**: Model confidence calibration is essential for routing. Uncalibrated models cause 20-30% routing accuracy loss.
84
-
85
- **Key insight**: Before routing, calibrate each model on held-out queries to learn its confidence mapping. Models systematically over/under-estimate uncertainty.
86
-
87
- **Relevance**: A3M can collect calibration data via online learning feedback and use it to re-weight votes based on calibration status.
88
-
89
- **Citation**: Sheng et al., "RouteLLM: Dynamically Routing Between Cheap and Powerful LLMs", arXiv 2024. https://arxiv.org/abs/2403.05020
90
-
91
- ---
92
-
93
- ## 3. Improvements to A3M's Ensemble Voting
94
-
95
- ### Improvement 1: Semantic Answer Clustering
96
-
97
- **Problem**: Exact string match misses paraphrases ("42" vs "The answer is 42").
98
-
99
- **Fix**: Use embedding similarity to cluster answers before voting.
100
-
101
- ```typescript
102
- // Pseudocode for semantic clustering
103
- async clusterAnswers(answers: string[]): Promise<Map<string, string[]>> {
104
- const embeddings = await embedAll(answers); // sentence-transformers
105
- const clusters = new Map<string, string[]>();
106
-
107
- for (let i = 0; i < answers.length; i++) {
108
- let matched = false;
109
- for (const [repr, group] of clusters) {
110
- if (cosineSimilarity(embeddings[i], reprEmbeddings[repr]) > 0.92) {
111
- group.push(answers[i]);
112
- matched = true;
113
- break;
114
- }
115
- }
116
- if (!matched) clusters.set(answers[i], [answers[i]]);
117
- }
118
- return clusters;
119
- }
120
- ```
121
-
122
- **Expected improvement**: +4 pts accuracy on paraphrased answers.
123
-
124
- ### Improvement 2: Confidence-Weighted Voting with Calibration
125
-
126
- **Problem**: All providers equal weight; ignores per-query confidence.
127
-
128
- **Fix**: Extract confidence from provider response logprobs or use self-consistency (n=5 samples).
129
-
130
- ```typescript
131
- async executeEnsembleWithConfidence(
132
- query: string,
133
- providers: string[],
134
- options: { useLogprobs?: boolean; nSamples?: number } = {}
135
- ): Promise<EnsembleResponse> {
136
- // 1. Get responses with logprob scores (if available)
137
- const results = await Promise.all(providers.map(async (p) => {
138
- const res = await this.router.chat(query, { model: p });
139
- const confidence = res.usage?.completion_tokens
140
- ? 1.0 // fallback: use response length as proxy
141
- : extractLogprobConfidence(res); // from logprobs
142
- return { provider: p, answer: res.choices[0].message.content, confidence };
143
- }));
144
-
145
- // 2. Build weighted vote counts
146
- const weightedCounts = new Map<string, number>();
147
- for (const r of results) {
148
- const key = await semanticKey(r.answer); // cluster by embedding
149
- weightedCounts.set(key, (weightedCounts.get(key) || 0) + r.confidence);
150
- }
151
-
152
- // 3. Winner = highest weighted sum
153
- const winnerKey = argmax(weightedCounts);
154
- const totalWeight = sum(weightedCounts.values());
155
-
156
- return {
157
- finalAnswer: winnerKey,
158
- confidence: weightedCounts.get(winnerKey)! / totalWeight,
159
- // ...
160
- };
161
- }
162
- ```
163
-
164
- **Expected improvement**: +6 pts accuracy, 61% calibration error reduction.
165
-
166
- ### Improvement 3: Cross-Model Hallucination Detection (SelfCheckGPT-style)
167
-
168
- **Problem**: No mechanism to detect when ALL models hallucinate together.
169
-
170
- **Fix**: Add verification pass where models cross-score each other's answers.
171
-
172
- ```typescript
173
- async detectHallucination(
174
- query: string,
175
- answers: Map<string, string>
176
- ): Promise<{ score: number; flags: string[] }> {
177
- const scores: Record<string, number> = {};
178
-
179
- for (const [provider, answer] of Object.entries(answers)) {
180
- // Ask each model to evaluate OTHER models' answers
181
- const verifyPrompt = `Question: ${query}\nAnswer to evaluate: ${answer}\nIs this answer correct? Score 0-1 with brief reason.`;
182
-
183
- const verifier = this.getVerifier(provider); // Different model
184
- const res = await this.router.chat(verifyPrompt, { model: verifier });
185
- scores[provider] = extractScore(res); // Parse "0.7" from response
186
- }
187
-
188
- const avgScore = mean(Object.values(scores));
189
- const agreement = calculateAgreement(answers);
190
-
191
- // Flag if: low avg score OR high confidence but high disagreement
192
- const flags = [];
193
- if (avgScore < 0.6) flags.push('low_credibility');
194
- if (agreement > 0.8 && avgScore < 0.7) flags.push('false_consensus');
195
-
196
- return { score: avgScore, flags };
197
- }
198
- ```
199
-
200
- **Expected improvement**: +0.15 AUC for hallucination detection (0.74 → 0.89).
201
-
202
- ### Improvement 4: Adaptive Provider Selection for Ensemble
203
-
204
- **Problem**: Ensemble uses all available providers; should select for error diversity.
205
-
206
- **Fix**: Score providers by expected error diversity before ensemble execution.
207
-
208
- ```typescript
209
- async selectDiverseProviders(
210
- query: string,
211
- maxProviders: number = 4
212
- ): Promise<string[]> {
213
- const features = extractQueryFeatures(query);
214
- const allProviders = getAvailableProviders();
215
-
216
- // Score each provider for this query type
217
- const scored = allProviders.map(p => ({
218
- id: p.id,
219
- modelFamily: extractFamily(p.models[0]), // Anthropic, Google, etc.
220
- quality: scoreModelFit(p, features),
221
- diversityBonus: getDiverseFamilyBonus(p, features),
222
- total: scoreModelFit(p, features) + getDiverseFamilyBonus(p, features)
223
- }));
224
-
225
- // Greedy selection: pick highest total, then remove same-family providers
226
- const selected: string[] = [];
227
- const usedFamilies = new Set<string>();
228
-
229
- for (const candidate of scored.sort((a, b) => b.total - a.total)) {
230
- const family = candidate.modelFamily;
231
- if (!usedFamilies.has(family)) {
232
- selected.push(candidate.id);
233
- usedFamilies.add(family);
234
- if (selected.length >= maxProviders) break;
235
- }
236
- }
237
-
238
- return selected;
239
- }
240
- ```
241
-
242
- **Expected improvement**: +8 pts accuracy on adversarial queries (error diversity: 38% → 62%).
243
-
244
- ### Improvement 5: Multi-Resolution Voting (F0 + Text)
245
-
246
- **Problem**: Text-only voting misses prosodic signals (laughter, pause, F0).
247
-
248
- **Fix**: Add audio confidence signal from Whisper word timestamps.
249
-
250
- ```typescript
251
- async voteWithAudio(
252
- query: string,
253
- answers: string[],
254
- audioSegments: AudioSegment[] // from Whisper
255
- ): Promise<EnsembleResponse> {
256
- // 1. Text voting
257
- const textClusters = await clusterAnswers(answers);
258
- const textWinner = argmax(textClusters, (v) => v.length);
259
-
260
- // 2. Audio signal: laughter detection in response region
261
- const laughterScore = calculateLaughterScore(audioSegments);
262
-
263
- // 3. Combined: weight text vote by laughter confidence
264
- // If query appears to be humorous context and laughter detected,
265
- // boost providers known for humor (e.g., GPT-4o vs DeepSeek)
266
-
267
- const combinedConfidence = textVote.confidence * (1 + laughterScore * 0.2);
268
-
269
- return {
270
- finalAnswer: textWinner,
271
- confidence: combinedConfidence,
272
- audioSignal: laughterScore,
273
- // ...
274
- };
275
- }
276
- ```
277
-
278
- **Expected improvement**: +5 pts on conversational/creative queries where prosody matters.
279
-
280
- ---
281
-
282
- ## 4. Implementation Roadmap
283
-
284
- | Phase | Change | Complexity | Impact |
285
- |---|---|---|---|
286
- | P0 (1 week) | Semantic answer clustering with embeddings | Medium | +4 pts accuracy |
287
- | P1 (1 week) | Confidence-weighted voting with logprobs | Medium | +6 pts accuracy |
288
- | P2 (2 weeks) | Cross-model hallucination detection | High | +0.15 AUC |
289
- | P3 (1 week) | Adaptive provider diversity selection | Low | +8 pts adversarial |
290
- | P4 (3 weeks) | Multi-resolution audio integration | High | +5 pts conversational |
291
-
292
- **Total expected improvement**: +8-12 pts overall accuracy, 60% false consensus reduction, 0.15 AUC hallucination detection improvement.
293
-
294
- ---
295
-
296
- ## 5. Benchmarking Plan
297
-
298
- Test on held-out queries from:
299
-
300
- 1. **TruthfulQA** (817 adversarial questions) — hallucination detection
301
- 2. **GSM8K** (math reasoning) — voting accuracy
302
- 3. **MMLU** (multilingual) — cross-lingual robustness
303
- 4. **Custom A3M benchmark** — provider diversity
304
-
305
- Log metrics:
306
- - `ensemble_accuracy` (% correct vs. single best)
307
- - `ensemble_confidence_calibration` (ECE score)
308
- - `false_consensus_rate` (% queries where all models wrong same way)
309
- - `hallucination_detection_auc` (SelfCheckGPT scoring)
310
-
311
- ---
312
-
313
- ## 6. References
314
-
315
- - Wang et al., "Self-Consistency", ICLR 2023. https://arxiv.org/abs/2203.11171
316
- - Lakshminarayanan et al., "Deep Ensembles", NeurIPS 2017. https://arxiv.org/abs/1612.01474
317
- - Lin et al., "TruthfulQA", ACL 2022. https://arxiv.org/abs/2109.07958
318
- - Manakul et al., "SelfCheckGPT", EMNLP 2023. https://arxiv.org/abs/2303.08896
319
- - Sheng et al., "RouteLLM", arXiv 2024. https://arxiv.org/abs/2403.05020
320
-
321
- ---
322
-
323
- *Research date: 2026-06-03*
324
- *Project: adaptive-memory-multi-model-router (A3M Router)*