adaptive-memory-multi-model-router 2.14.45 → 2.14.47

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (605) hide show
  1. package/dist/index.d.ts +4 -0
  2. package/dist/index.js +8 -2
  3. package/dist/memory/hybridMemory.d.ts +71 -0
  4. package/dist/memory/hybridMemory.js +124 -0
  5. package/dist/memory/reasoningBank.d.ts +88 -0
  6. package/dist/memory/reasoningBank.js +303 -0
  7. package/{docs/llms.txt → llms.txt.bak} +6 -6
  8. package/package.json +13 -84
  9. package/src/index.ts +8 -0
  10. package/src/memory/hybridMemory.ts +155 -0
  11. package/src/memory/reasoningBank.ts +335 -0
  12. package/src/routing/advancedRouter.ts.bak +650 -0
  13. package/test.js.bak +376 -0
  14. package/.dockerignore +0 -82
  15. package/.env.example +0 -303
  16. package/.github/DISCUSSIONS_WELCOME.md +0 -27
  17. package/.github/DISCUSSION_TEMPLATE.yml +0 -5
  18. package/.github/FUNDING.yml +0 -2
  19. package/.github/ISSUE_TEMPLATE/bug_report.md +0 -94
  20. package/.github/ISSUE_TEMPLATE/config.yml +0 -17
  21. package/.github/ISSUE_TEMPLATE/feature_request.md +0 -71
  22. package/.github/PULL_REQUEST_TEMPLATE.md +0 -71
  23. package/.github/dependabot.yml +0 -9
  24. package/.github/workflows/auto-publish.yml +0 -51
  25. package/.github/workflows/ci.yml +0 -263
  26. package/.github/workflows/codeql.yml +0 -38
  27. package/.github/workflows/npm-publish.yml +0 -20
  28. package/.github/workflows/pages.yml +0 -37
  29. package/.github/workflows/stale.yml +0 -54
  30. package/.publish-tick +0 -1
  31. package/.well-known/ai-plugin.json +0 -16
  32. package/AGENT_COUNCIL_FINDINGS.md +0 -142
  33. package/ARCHITECTURE.md +0 -346
  34. package/AUDIT_REPORT.md +0 -28
  35. package/CODE_OF_CONDUCT.md +0 -128
  36. package/CONTRIBUTING.md +0 -50
  37. package/CONTRIBUTORS.md +0 -20
  38. package/Dockerfile +0 -53
  39. package/Dockerfile.proxy +0 -33
  40. package/HEALTH_REPORT.md +0 -118
  41. package/IMPROVEMENT_PLAN.md +0 -107
  42. package/LANDING.md +0 -43
  43. package/LAUNCH-PAIN-DRIVEN.md +0 -339
  44. package/LAUNCH.md +0 -337
  45. package/LAUNCH_CHECKLIST.md +0 -141
  46. package/LAUNCH_SNAPSHOT.md +0 -260
  47. package/MANIFESTO.md +0 -41
  48. package/POPULARITY_BOOSTERS.md +0 -285
  49. package/PR_STATUS_REPORT.md +0 -148
  50. package/REDESIGN.md +0 -95
  51. package/RUNKIT.md +0 -83
  52. package/SECURITY.md +0 -29
  53. package/SUBMISSIONS.md +0 -43
  54. package/_schema.html +0 -53
  55. package/ai-plugin.json +0 -16
  56. package/articles/AI_AGENT_LLM_ROUTING.md +0 -150
  57. package/articles/CHINESE_DIRECTORIES.md +0 -100
  58. package/articles/CHINESE_SUBMISSIONS_READY.md +0 -322
  59. package/articles/COMPETITOR_ALERTS.md +0 -31
  60. package/articles/COMPLETE_POSTING_DIRECTORY.md +0 -147
  61. package/articles/CONTENT_STRUCTURE.md +0 -292
  62. package/articles/DEVTO_COST_GUIDE.md +0 -473
  63. package/articles/DEVTO_FINAL.md +0 -416
  64. package/articles/DEVTO_MULTI_PROVIDER.md +0 -542
  65. package/articles/DEVTO_READY.md +0 -255
  66. package/articles/DEVTO_V2_ANNOUNCEMENT.md +0 -160
  67. package/articles/DEVTO_VIRAL_GROWTH.md +0 -280
  68. package/articles/FRESH_devto.md +0 -460
  69. package/articles/FRESH_devto_2026_05.md +0 -73
  70. package/articles/FRESH_hackernews.md +0 -14
  71. package/articles/FRESH_reddit_ml.md +0 -90
  72. package/articles/FRESH_reddit_node.md +0 -198
  73. package/articles/FRESH_reddit_sideproject.md +0 -72
  74. package/articles/FRESH_reddit_webdev.md +0 -130
  75. package/articles/FROM_ZERO_TO_10K.md +0 -107
  76. package/articles/HN_10X_BETTER.md +0 -430
  77. package/articles/HN_ACCOUNT_GUIDE.md +0 -21
  78. package/articles/HN_CHINESE_STYLE.md +0 -308
  79. package/articles/HN_FINAL.md +0 -148
  80. package/articles/HN_POSTED_VERSION.md +0 -56
  81. package/articles/HN_POST_READY.md +0 -137
  82. package/articles/HN_RESEARCH.md +0 -364
  83. package/articles/HN_SHOW_routerarena.md +0 -17
  84. package/articles/HN_TIMING_GUIDE.md +0 -52
  85. package/articles/INDIEHACKERS_POST.md +0 -52
  86. package/articles/INDIEHACKERS_READY.md +0 -120
  87. package/articles/LLM_BENCHMARK_DEEP_DIVE.md +0 -153
  88. package/articles/MASTER_POSTING_DIRECTORY.md +0 -189
  89. package/articles/NEWSLETTER_SEND_NOW.md +0 -259
  90. package/articles/NEWSLETTER_SUBMISSIONS.md +0 -112
  91. package/articles/PAIN-DRIVEN-devto-v2.md +0 -308
  92. package/articles/PAIN-DRIVEN-devto-v3.md +0 -268
  93. package/articles/PAIN-DRIVEN-devto.md +0 -242
  94. package/articles/PAIN-DRIVEN-hackernews-v2.md +0 -138
  95. package/articles/PAIN-DRIVEN-hackernews-v3.md +0 -151
  96. package/articles/PAIN-DRIVEN-hackernews.md +0 -131
  97. package/articles/PAIN-DRIVEN-reddit-v2.md +0 -301
  98. package/articles/PAIN-DRIVEN-reddit-v3.md +0 -236
  99. package/articles/PAIN-DRIVEN-reddit.md +0 -218
  100. package/articles/PAIN-DRIVEN-twitter-v2.md +0 -110
  101. package/articles/PAIN-DRIVEN-twitter-v3.md +0 -121
  102. package/articles/PAIN-DRIVEN-twitter.md +0 -120
  103. package/articles/PORTKEY_VS_A3M.md +0 -147
  104. package/articles/POSTING_KIT_2026_05.md +0 -67
  105. package/articles/PRESS_KIT_routerarena.md +0 -77
  106. package/articles/PRODUCTHUNT_LISTING.md +0 -48
  107. package/articles/PRODUCTHUNT_READY.md +0 -106
  108. package/articles/PR_PLAN_vault.md +0 -125
  109. package/articles/REDDIT_FINAL.md +0 -232
  110. package/articles/REDDIT_POST.md +0 -67
  111. package/articles/REDDIT_SUBMISSION_READY.md +0 -348
  112. package/articles/ROUTERARENA_LEADER.md +0 -45
  113. package/articles/SHOW_HN_FINAL.md +0 -29
  114. package/articles/TWEETS_10K_DOWNLOADS.md +0 -47
  115. package/articles/TWEETS_BENCHMARK_FIRST.md +0 -46
  116. package/articles/TWEETS_MCP_PLAY.md +0 -51
  117. package/articles/TWEETS_SEQUENTIAL_BROKEN.md +0 -49
  118. package/articles/TWEETS_WHY_BUILD.md +0 -54
  119. package/articles/TWEETS_routerarena_leader.md +0 -53
  120. package/articles/TWEET_STORM_READY.md +0 -165
  121. package/articles/TWITTER_FINAL.md +0 -167
  122. package/articles/WHY_10X_BETTER.md +0 -261
  123. package/articles/WHY_CHINESE_STYLE_BETTER.md +0 -323
  124. package/articles/ai-discoverability-llm-routing.md +0 -210
  125. package/articles/devto-llm-routing.md +0 -138
  126. package/articles/hackernews-show-hn.md +0 -54
  127. package/articles/hashnode-llm-cost-optimization.md +0 -125
  128. package/articles/hn_show_2026_05.md +0 -11
  129. package/articles/medium-building-llm-router.md +0 -205
  130. package/articles/reddit-ml.md +0 -76
  131. package/articles/twitter-thread-cost-savings.md +0 -50
  132. package/articles/youtube-tutorial-script.md +0 -262
  133. package/assets/a3m_3blue1brown.mp4 +0 -0
  134. package/assets/banner.svg +0 -109
  135. package/assets/chart-cost-v2.svg +0 -91
  136. package/assets/chart-cost-v3.svg +0 -143
  137. package/assets/chart-features-v2.svg +0 -132
  138. package/assets/chart-features-v3.svg +0 -211
  139. package/assets/chart-growth-v2.svg +0 -122
  140. package/assets/chart-growth-v3.svg +0 -189
  141. package/assets/cost-comparison.svg +0 -134
  142. package/assets/cost-simple.svg +0 -64
  143. package/assets/demo-hn.gif +0 -0
  144. package/assets/feature-matrix.svg +0 -136
  145. package/assets/growth-chart-animated.svg +0 -76
  146. package/assets/growth-chart.svg +0 -82
  147. package/assets/growth-simple.svg +0 -69
  148. package/assets/hero-diagram.svg +0 -81
  149. package/assets/logo-new.svg +0 -21
  150. package/assets/logo.svg +0 -68
  151. package/assets/provider-comparison.svg +0 -121
  152. package/assets/social-preview-new.svg +0 -100
  153. package/assets/social-preview.svg +0 -194
  154. package/assets/social-v2.svg +0 -130
  155. package/assets/social-v3.svg +0 -212
  156. package/benchmark-provider-results.json +0 -245
  157. package/benchmark-results.json +0 -54
  158. package/council-votes/architecture-vote.md +0 -121
  159. package/council-votes/coverage-vote.md +0 -93
  160. package/data/adaptive-benchmark.json +0 -92
  161. package/data/benchmark-results.json +0 -47
  162. package/data/labeled-benchmark.json +0 -88
  163. package/demo/3blue1brown_video.py +0 -285
  164. package/demo/3blue1brown_video_v2.py +0 -310
  165. package/demo/IMPROVED_PROMPTS.md +0 -229
  166. package/demo/VEO3_PROMPTS.md +0 -269
  167. package/demo/VIDEO_PRODUCTION_GUIDE.md +0 -333
  168. package/demo/a3m_3blue1brown.mp4 +0 -0
  169. package/demo/asciinema-demo.sh +0 -195
  170. package/demo/demo-hn.tape +0 -74
  171. package/demo/demo-script.md +0 -53
  172. package/demo/demo-script.sh +0 -62
  173. package/demo/demo.svg +0 -75
  174. package/demo/frame1_ai_data_center.png +0 -0
  175. package/demo/frame1_sunset_video.mp4 +0 -0
  176. package/demo/frame2_cost_comparison.png +0 -0
  177. package/demo/frame2_cost_comparison_fallback.png +0 -0
  178. package/demo/frame3_parallel_execution.png +0 -0
  179. package/demo/frame3_parallel_execution_fallback.png +0 -0
  180. package/demo/frame4_providers.png +0 -0
  181. package/demo/frame4_providers_fallback.png +0 -0
  182. package/demo/frame5_endcard.png +0 -0
  183. package/demo/frame5_endcard_fallback.png +0 -0
  184. package/demo/new_frame1_hook.png +0 -0
  185. package/demo/new_frame2_proof.png +0 -0
  186. package/demo/new_frame3_wow.png +0 -0
  187. package/demo/new_frame4_social.png +0 -0
  188. package/demo/new_frame5_cta.png +0 -0
  189. package/demo/package.json +0 -13
  190. package/demo/product-video-final.mp4 +0 -0
  191. package/demo/product-video-hype-v1.mp4 +0 -0
  192. package/demo/product-video-v1.mp4 +0 -0
  193. package/demo/public/index.html +0 -762
  194. package/demo/recording.cast +0 -55
  195. package/demo/server.js +0 -405
  196. package/demo-new.tape +0 -71
  197. package/demo-real.sh +0 -198
  198. package/demo-simple.tape +0 -205
  199. package/demo.html +0 -520
  200. package/demo.sh +0 -85
  201. package/demo.tape +0 -259
  202. package/dist/analytics/costAnalytics.d.ts.map +0 -1
  203. package/dist/analytics/costAnalytics.js.map +0 -1
  204. package/dist/benchmark/comprehensive.js.map +0 -1
  205. package/dist/benchmark/reproducible.d.ts.map +0 -1
  206. package/dist/benchmark/reproducible.js.map +0 -1
  207. package/dist/cache/prefixCache.d.ts.map +0 -1
  208. package/dist/cache/prefixCache.js.map +0 -1
  209. package/dist/cache/responseCache.d.ts.map +0 -1
  210. package/dist/cache/responseCache.js.map +0 -1
  211. package/dist/cache/semanticCache.d.ts.map +0 -1
  212. package/dist/cache/semanticCache.js.map +0 -1
  213. package/dist/cli/setupWizard.d.ts.map +0 -1
  214. package/dist/cli/setupWizard.js.map +0 -1
  215. package/dist/cost/budgetEnforcer.d.ts.map +0 -1
  216. package/dist/cost/budgetEnforcer.js.map +0 -1
  217. package/dist/cost/costTracker.d.ts.map +0 -1
  218. package/dist/cost/costTracker.js.map +0 -1
  219. package/dist/ensemble/multiRoundDialog.js.map +0 -1
  220. package/dist/ensemble/shapleyValue.js.map +0 -1
  221. package/dist/integrations/langchainAdapter.d.ts.map +0 -1
  222. package/dist/integrations/langchainAdapter.js.map +0 -1
  223. package/dist/integrations/oauth.d.ts.map +0 -1
  224. package/dist/integrations/oauth.js.map +0 -1
  225. package/dist/integrations/scienceAdapter.js.map +0 -1
  226. package/dist/memory/autoFetch.d.ts.map +0 -1
  227. package/dist/memory/autoFetch.js.map +0 -1
  228. package/dist/memory/episodicMemory.d.ts.map +0 -1
  229. package/dist/memory/episodicMemory.js.map +0 -1
  230. package/dist/memory/memoryTree.d.ts.map +0 -1
  231. package/dist/memory/memoryTree.js.map +0 -1
  232. package/dist/memory/obsidianVault.d.ts.map +0 -1
  233. package/dist/memory/obsidianVault.js.map +0 -1
  234. package/dist/observability/changeWatch.d.ts.map +0 -1
  235. package/dist/observability/changeWatch.js.map +0 -1
  236. package/dist/observability/fatigueDetector.d.ts.map +0 -1
  237. package/dist/observability/fatigueDetector.js.map +0 -1
  238. package/dist/observability/index.d.ts.map +0 -1
  239. package/dist/observability/index.js.map +0 -1
  240. package/dist/observability/metrics.d.ts.map +0 -1
  241. package/dist/observability/metrics.js.map +0 -1
  242. package/dist/observability/middleware.d.ts.map +0 -1
  243. package/dist/observability/middleware.js.map +0 -1
  244. package/dist/observability/tracer.d.ts.map +0 -1
  245. package/dist/observability/tracer.js.map +0 -1
  246. package/dist/observability/types.d.ts.map +0 -1
  247. package/dist/observability/types.js.map +0 -1
  248. package/dist/orchestration/haloOrchestrator.d.ts.map +0 -1
  249. package/dist/orchestration/haloOrchestrator.js.map +0 -1
  250. package/dist/orchestration/mctsWorkflow.d.ts.map +0 -1
  251. package/dist/orchestration/mctsWorkflow.js.map +0 -1
  252. package/dist/providers/localProvider.d.ts.map +0 -1
  253. package/dist/providers/localProvider.js.map +0 -1
  254. package/dist/providers/providerConfig.d.ts.map +0 -1
  255. package/dist/providers/providerConfig.js.map +0 -1
  256. package/dist/providers/registry.d.ts.map +0 -1
  257. package/dist/providers/registry.js.map +0 -1
  258. package/dist/routing/advancedRouter.d.ts.map +0 -1
  259. package/dist/routing/advancedRouter.js.map +0 -1
  260. package/dist/routing/crossModelValidation.d.ts.map +0 -1
  261. package/dist/routing/crossModelValidation.js.map +0 -1
  262. package/dist/routing/providerHealth.d.ts.map +0 -1
  263. package/dist/routing/providerHealth.js.map +0 -1
  264. package/dist/routing/providerRetry.d.ts.map +0 -1
  265. package/dist/routing/providerRetry.js.map +0 -1
  266. package/dist/scripts/banner.js +0 -29
  267. package/dist/security/guardrails.d.ts.map +0 -1
  268. package/dist/security/guardrails.js.map +0 -1
  269. package/dist/server/dashboard.d.ts.map +0 -1
  270. package/dist/server/dashboard.js.map +0 -1
  271. package/dist/server/modelMapper.d.ts.map +0 -1
  272. package/dist/server/modelMapper.js.map +0 -1
  273. package/dist/server/proxyServer.d.ts.map +0 -1
  274. package/dist/server/proxyServer.js.map +0 -1
  275. package/dist/skills/__tests__/skill_manager.test.d.ts +0 -2
  276. package/dist/skills/__tests__/skill_manager.test.d.ts.map +0 -1
  277. package/dist/skills/__tests__/skill_manager.test.js +0 -268
  278. package/dist/skills/__tests__/skill_manager.test.js.map +0 -1
  279. package/dist/tools/tmlpdTools.d.ts.map +0 -1
  280. package/dist/tools/tmlpdTools.js.map +0 -1
  281. package/dist/tui/dashboard.d.ts.map +0 -1
  282. package/dist/tui/dashboard.js.map +0 -1
  283. package/dist/tui/index.d.ts.map +0 -1
  284. package/dist/tui/index.js.map +0 -1
  285. package/dist/utils/batchProcessor.d.ts.map +0 -1
  286. package/dist/utils/batchProcessor.js.map +0 -1
  287. package/dist/utils/compression.d.ts.map +0 -1
  288. package/dist/utils/compression.js.map +0 -1
  289. package/dist/utils/costUtils.d.ts.map +0 -1
  290. package/dist/utils/costUtils.js.map +0 -1
  291. package/dist/utils/reliability.d.ts.map +0 -1
  292. package/dist/utils/reliability.js.map +0 -1
  293. package/dist/utils/sorting.d.ts.map +0 -1
  294. package/dist/utils/sorting.js.map +0 -1
  295. package/dist/utils/speculativeDecoding.d.ts.map +0 -1
  296. package/dist/utils/speculativeDecoding.js.map +0 -1
  297. package/dist/utils/tokenUtils.d.ts.map +0 -1
  298. package/dist/utils/tokenUtils.js.map +0 -1
  299. package/docs/.nojekyll +0 -0
  300. package/docs/ANALYSIS_PRINCIPLES.md +0 -162
  301. package/docs/API.md +0 -855
  302. package/docs/ARCHITECTURAL-IMPROVEMENTS-2025.md +0 -1391
  303. package/docs/ARCHITECTURAL-IMPROVEMENTS-REVISED-2025.md +0 -1051
  304. package/docs/BENCHMARK.md +0 -170
  305. package/docs/CHINESE_PROVIDER_RELIABILITY.md +0 -37
  306. package/docs/CITATIONS.md +0 -74
  307. package/docs/CLAIMS_AND_EVIDENCE.md +0 -58
  308. package/docs/CONFIGURATION.md +0 -476
  309. package/docs/COUNCIL_DECISION.json +0 -816
  310. package/docs/COUNCIL_SUMMARY.md +0 -319
  311. package/docs/COUNCIL_V2.2_DECISION.md +0 -416
  312. package/docs/ENGINEERING_SPEC.md +0 -55
  313. package/docs/FACTORY_RESET.md +0 -34
  314. package/docs/GEO.md +0 -66
  315. package/docs/GEO_OPTIMIZATION.md +0 -30
  316. package/docs/GEO_ROOT_CAUSE.md +0 -136
  317. package/docs/GEO_STATUS.md +0 -85
  318. package/docs/GEO_TEST_RESULTS.md +0 -176
  319. package/docs/HN_CHECKLIST.md +0 -38
  320. package/docs/HN_FOUNDER_COMMENT.md +0 -17
  321. package/docs/HN_SUBMISSION_FINAL.md +0 -180
  322. package/docs/HN_SUBMISSION_V3.md +0 -56
  323. package/docs/IMPROVEMENT_ROADMAP.md +0 -515
  324. package/docs/INTEGRATIONS.md +0 -420
  325. package/docs/LANGCHAIN_INTEGRATION.md +0 -147
  326. package/docs/LLM_COUNCIL_DECISION.md +0 -508
  327. package/docs/MIDDLEWARE_CHAIN.md +0 -35
  328. package/docs/PROMO_CHECKLIST.md +0 -200
  329. package/docs/QUICKSTART.md +0 -271
  330. package/docs/QUICK_START.md +0 -43
  331. package/docs/QUICK_START_VISIBILITY.md +0 -782
  332. package/docs/REDDIT_GAP_ANALYSIS.md +0 -299
  333. package/docs/RELEASE_CHECKLIST.md +0 -32
  334. package/docs/REPRODUCIBILITY.md +0 -63
  335. package/docs/RESEARCH_BACKED_IMPROVEMENTS.md +0 -1180
  336. package/docs/ROUTING_RUBRIC.md +0 -197
  337. package/docs/SEO_AUDIT.md +0 -186
  338. package/docs/SOCIAL_LISTENING.md +0 -219
  339. package/docs/TMLPD_QNA.md +0 -751
  340. package/docs/TMLPD_V2.1_COMPLETE.md +0 -763
  341. package/docs/TMLPD_V2.2_RESEARCH_ROADMAP.md +0 -754
  342. package/docs/UPDATE_TOPICS.md +0 -15
  343. package/docs/USE_CASES.md +0 -59
  344. package/docs/V2.2_IMPLEMENTATION_COMPLETE.md +0 -446
  345. package/docs/V2_IMPLEMENTATION_GUIDE.md +0 -388
  346. package/docs/VERCEL_AI_SDK.md +0 -209
  347. package/docs/VISIBILITY_ADOPTION_PLAN.md +0 -1005
  348. package/docs/_config.yml +0 -49
  349. package/docs/ai-plugin.json +0 -16
  350. package/docs/api.html +0 -513
  351. package/docs/architecture-diagram.md +0 -40
  352. package/docs/benchmark-chart.png +0 -0
  353. package/docs/benchmark.html +0 -387
  354. package/docs/blog/routerarena-number-one.html +0 -73
  355. package/docs/cli-cheatsheet.md +0 -339
  356. package/docs/compare.md +0 -109
  357. package/docs/comparison-litellm.md +0 -88
  358. package/docs/comparison.md +0 -108
  359. package/docs/cost-chart-ascii.md +0 -42
  360. package/docs/cost-comparison-chart.svg +0 -88
  361. package/docs/curl-examples.md +0 -247
  362. package/docs/demo-auto.html +0 -264
  363. package/docs/demo.html +0 -416
  364. package/docs/geo/GENERATIVE_ENGINE_OPTIMIZATION.md +0 -232
  365. package/docs/index.html +0 -507
  366. package/docs/launch-content/LAUNCH_EXECUTION_CHECKLIST.md +0 -421
  367. package/docs/launch-content/README.md +0 -457
  368. package/docs/launch-content/assets/cost_comparison_100_tasks.png +0 -0
  369. package/docs/launch-content/assets/cumulative_savings.png +0 -0
  370. package/docs/launch-content/assets/parallel_speedup.png +0 -0
  371. package/docs/launch-content/assets/provider_pricing_comparison.png +0 -0
  372. package/docs/launch-content/assets/task_breakdown_comparison.png +0 -0
  373. package/docs/launch-content/generate_charts.py +0 -313
  374. package/docs/launch-content/hn_show_post.md +0 -139
  375. package/docs/launch-content/partner_outreach_templates.md +0 -745
  376. package/docs/launch-content/reddit_posts.md +0 -467
  377. package/docs/launch-content/twitter_thread.txt +0 -460
  378. package/docs/npm-downloads-chart.svg +0 -43
  379. package/docs/openapi.json +0 -139
  380. package/docs/openapi.yaml +0 -1318
  381. package/docs/quick-start.html +0 -366
  382. package/docs/robots.txt +0 -52
  383. package/docs/sitemap.xml +0 -57
  384. package/docs/styles.css +0 -682
  385. package/docs/well-known/ai-plugin.json +0 -16
  386. package/docs/wellknown/ai-plugin.json +0 -16
  387. package/docs-site/assets/og-banner.svg +0 -194
  388. package/docs-site/index.html +0 -632
  389. package/eval/README.md +0 -46
  390. package/eval/baselines/main.json +0 -12
  391. package/eval/benchmark_dataset.jsonl +0 -16
  392. package/eval/check_golden_routes.js +0 -64
  393. package/eval/datasets/catalog.json +0 -33
  394. package/eval/datasets/slices/cn_provider_reliability_v1.jsonl +0 -3
  395. package/eval/datasets/slices/cost_pressure_v1.jsonl +0 -3
  396. package/eval/datasets/slices/safety_guardrails_v1.jsonl +0 -3
  397. package/eval/evals.json +0 -199
  398. package/eval/fault_injection_thresholds.json +0 -3
  399. package/eval/generate_report.js +0 -128
  400. package/eval/golden_routes.json +0 -114
  401. package/eval/lib/experiment_registry.js +0 -24
  402. package/eval/run_eval.js +0 -197
  403. package/eval/run_fault_injection.js +0 -201
  404. package/eval/run_shadow_eval.js +0 -85
  405. package/eval/thresholds.json +0 -9
  406. package/examples/QUICKSTART.md +0 -183
  407. package/examples/README.md +0 -61
  408. package/examples/a3m-sdk.js +0 -124
  409. package/examples/basic-route.js +0 -54
  410. package/examples/chat-loop.js +0 -202
  411. package/examples/classify-then-route.js +0 -102
  412. package/examples/cost-compare.js +0 -120
  413. package/examples/ensemble.js +0 -160
  414. package/examples/whatsapp-telegram-bridge-demo.js +0 -302
  415. package/examples/whatsapp-telegram-bridge.js +0 -269
  416. package/hf-space/README.md +0 -23
  417. package/hf-space/app.py +0 -240
  418. package/hf-space/requirements.txt +0 -1
  419. package/huggingface_space/README.md +0 -35
  420. package/huggingface_space/app.py +0 -126
  421. package/huggingface_space/create_space.py +0 -208
  422. package/huggingface_space/requirements.txt +0 -1
  423. package/mcp-server/README.md +0 -188
  424. package/mcp-server/package.json +0 -29
  425. package/mcp-server/src/index.ts +0 -744
  426. package/mcp-server/tsconfig.json +0 -19
  427. package/openclaw-alexa-bridge/ALL_REMAINING_FIXES_PLAN.md +0 -313
  428. package/openclaw-alexa-bridge/REMAINING_FIXES_SUMMARY.md +0 -277
  429. package/openclaw-alexa-bridge/src/alexa_handler_no_tmlpd.js +0 -1234
  430. package/openclaw-alexa-bridge/test_fixes.js +0 -77
  431. package/playground/README.md +0 -51
  432. package/playground/codesandbox.json +0 -12
  433. package/playground/index.js +0 -39
  434. package/proxy/README.md +0 -227
  435. package/proxy/package-lock.json +0 -831
  436. package/proxy/package.json +0 -17
  437. package/proxy/rate-limit.js +0 -145
  438. package/proxy/rate-limit.test.js +0 -311
  439. package/proxy/server.js +0 -970
  440. package/python/README.md +0 -102
  441. package/python/a3m/__init__.py +0 -6
  442. package/python/a3m/client.py +0 -190
  443. package/python/a3m/models.py +0 -40
  444. package/python/a3m/sync_client.py +0 -61
  445. package/python/examples.py +0 -53
  446. package/python/integrations.py +0 -330
  447. package/python/pyproject.toml +0 -23
  448. package/python/setup.py +0 -28
  449. package/python/tmlpd.py +0 -369
  450. package/qna/REDDIT_GAP_ANALYSIS.md +0 -299
  451. package/qna/TMLPD_QNA.md +0 -751
  452. package/research/FINDING_001_safety.md +0 -28
  453. package/research/FINDING_002_error_diversity.md +0 -32
  454. package/research/FINDING_003_confidence_weighted_voting.md +0 -32
  455. package/research/FINDING_004_cross_model_semantic_detection.md +0 -37
  456. package/research/FINDING_005_knowledge_gap_orthogonality.md +0 -34
  457. package/research/HALLUCINATION_RESEARCH.md +0 -27
  458. package/research/ensemble-voting.md +0 -324
  459. package/research/loss-functions.md +0 -545
  460. package/research-log.md +0 -49
  461. package/scripts/banner.js +0 -29
  462. package/scripts/benchmark-local-routerarena.ts +0 -176
  463. package/scripts/benchmark.js +0 -145
  464. package/scripts/benchmark.sh +0 -61
  465. package/scripts/compare-providers.sh +0 -230
  466. package/scripts/content-planner.js +0 -25
  467. package/scripts/create-labeled-benchmark.ts +0 -105
  468. package/scripts/cross_post.py +0 -443
  469. package/scripts/local-router-benchmark.ts +0 -154
  470. package/scripts/post-all.sh +0 -41
  471. package/scripts/publish_fcc.py +0 -106
  472. package/scripts/push-to-gitee.sh +0 -25
  473. package/scripts/routerarena_ensemble.js +0 -144
  474. package/scripts/routing-benchmark-v2.js +0 -373
  475. package/scripts/routing-benchmark-v3.js +0 -118
  476. package/scripts/routing-benchmark.js +0 -462
  477. package/scripts/run-labeled-benchmark.mjs +0 -104
  478. package/scripts/run-mmlu-benchmark.js +0 -176
  479. package/scripts/run-provider-benchmark.js +0 -244
  480. package/scripts/update-npm-badges.js +0 -158
  481. package/skill/SKILL.md +0 -238
  482. package/src/__tests__/integration/tmpld_integration.test.py +0 -540
  483. package/src/skills/__tests__/skill_manager.test.ts +0 -328
  484. package/submissions/benchmarks/ALL_PLATFORMS_SUBMISSION.md +0 -94
  485. package/submissions/benchmarks/LLMROUTERBENCH_SUBMISSION.md +0 -121
  486. package/submissions/benchmarks/MMRBENCH_SUBMISSION.md +0 -94
  487. package/submissions/benchmarks/ROUTERARENA_UPDATE.md +0 -83
  488. package/submissions/benchmarks/ROUTERBENCH_SUBMISSION.md +0 -225
  489. package/test-council/1-structure-tests.test.js +0 -353
  490. package/test-council/1-structure-tests.test.ts +0 -353
  491. package/test-council/2-edge-case-tests.test.ts +0 -361
  492. package/test-council/3-performance-tests.test.ts +0 -669
  493. package/test-council/4-integration-tests.test.ts +0 -391
  494. package/test-council/5-agent-council-eval.test.ts +0 -413
  495. package/test-council/AGENT_COUNCIL_ARCHITECTURE.md +0 -349
  496. package/test-council/TEST_COUNCIL_REPORT.md +0 -201
  497. package/test-council/agents/edge-case-agent.ts +0 -363
  498. package/test-council/agents/performance-agent.ts +0 -426
  499. package/test-council/agents/structure-agent.ts +0 -227
  500. package/test-council/council.md +0 -183
  501. package/tests/__mocks__/tokenUtils.ts +0 -8
  502. package/tests/memory/episodicMemory.test.ts +0 -227
  503. package/tests/package-lock.json +0 -1628
  504. package/tests/package.json +0 -18
  505. package/tests/routing/ensembleVoting.test.ts +0 -236
  506. package/tests/routing/providerRetry.test.ts +0 -360
  507. package/tests/routing/queryTypePresets.test.ts +0 -208
  508. package/tests/security/guardrailEngine.test.ts +0 -700
  509. package/tests/tsconfig.json +0 -21
  510. package/tests/vitest.config.ts +0 -18
  511. package/tmlpd-pi-extension/README.md +0 -66
  512. package/tmlpd-pi-extension/dist/cache/prefixCache.d.ts +0 -114
  513. package/tmlpd-pi-extension/dist/cache/prefixCache.d.ts.map +0 -1
  514. package/tmlpd-pi-extension/dist/cache/prefixCache.js +0 -285
  515. package/tmlpd-pi-extension/dist/cache/prefixCache.js.map +0 -1
  516. package/tmlpd-pi-extension/dist/cache/responseCache.d.ts +0 -58
  517. package/tmlpd-pi-extension/dist/cache/responseCache.d.ts.map +0 -1
  518. package/tmlpd-pi-extension/dist/cache/responseCache.js +0 -153
  519. package/tmlpd-pi-extension/dist/cache/responseCache.js.map +0 -1
  520. package/tmlpd-pi-extension/dist/cli.js +0 -59
  521. package/tmlpd-pi-extension/dist/cost/costTracker.d.ts +0 -95
  522. package/tmlpd-pi-extension/dist/cost/costTracker.d.ts.map +0 -1
  523. package/tmlpd-pi-extension/dist/cost/costTracker.js +0 -240
  524. package/tmlpd-pi-extension/dist/cost/costTracker.js.map +0 -1
  525. package/tmlpd-pi-extension/dist/index.d.ts +0 -723
  526. package/tmlpd-pi-extension/dist/index.d.ts.map +0 -1
  527. package/tmlpd-pi-extension/dist/index.js +0 -239
  528. package/tmlpd-pi-extension/dist/index.js.map +0 -1
  529. package/tmlpd-pi-extension/dist/memory/episodicMemory.d.ts +0 -82
  530. package/tmlpd-pi-extension/dist/memory/episodicMemory.d.ts.map +0 -1
  531. package/tmlpd-pi-extension/dist/memory/episodicMemory.js +0 -145
  532. package/tmlpd-pi-extension/dist/memory/episodicMemory.js.map +0 -1
  533. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.d.ts +0 -102
  534. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.d.ts.map +0 -1
  535. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.js +0 -207
  536. package/tmlpd-pi-extension/dist/orchestration/haloOrchestrator.js.map +0 -1
  537. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.d.ts +0 -85
  538. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.d.ts.map +0 -1
  539. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.js +0 -210
  540. package/tmlpd-pi-extension/dist/orchestration/mctsWorkflow.js.map +0 -1
  541. package/tmlpd-pi-extension/dist/providers/localProvider.d.ts +0 -102
  542. package/tmlpd-pi-extension/dist/providers/localProvider.d.ts.map +0 -1
  543. package/tmlpd-pi-extension/dist/providers/localProvider.js +0 -338
  544. package/tmlpd-pi-extension/dist/providers/localProvider.js.map +0 -1
  545. package/tmlpd-pi-extension/dist/providers/registry.d.ts +0 -55
  546. package/tmlpd-pi-extension/dist/providers/registry.d.ts.map +0 -1
  547. package/tmlpd-pi-extension/dist/providers/registry.js +0 -138
  548. package/tmlpd-pi-extension/dist/providers/registry.js.map +0 -1
  549. package/tmlpd-pi-extension/dist/routing/advancedRouter.d.ts +0 -68
  550. package/tmlpd-pi-extension/dist/routing/advancedRouter.d.ts.map +0 -1
  551. package/tmlpd-pi-extension/dist/routing/advancedRouter.js +0 -332
  552. package/tmlpd-pi-extension/dist/routing/advancedRouter.js.map +0 -1
  553. package/tmlpd-pi-extension/dist/tools/tmlpdTools.d.ts +0 -101
  554. package/tmlpd-pi-extension/dist/tools/tmlpdTools.d.ts.map +0 -1
  555. package/tmlpd-pi-extension/dist/tools/tmlpdTools.js +0 -368
  556. package/tmlpd-pi-extension/dist/tools/tmlpdTools.js.map +0 -1
  557. package/tmlpd-pi-extension/dist/utils/batchProcessor.d.ts +0 -96
  558. package/tmlpd-pi-extension/dist/utils/batchProcessor.d.ts.map +0 -1
  559. package/tmlpd-pi-extension/dist/utils/batchProcessor.js +0 -170
  560. package/tmlpd-pi-extension/dist/utils/batchProcessor.js.map +0 -1
  561. package/tmlpd-pi-extension/dist/utils/compression.d.ts +0 -61
  562. package/tmlpd-pi-extension/dist/utils/compression.d.ts.map +0 -1
  563. package/tmlpd-pi-extension/dist/utils/compression.js +0 -281
  564. package/tmlpd-pi-extension/dist/utils/compression.js.map +0 -1
  565. package/tmlpd-pi-extension/dist/utils/reliability.d.ts +0 -74
  566. package/tmlpd-pi-extension/dist/utils/reliability.d.ts.map +0 -1
  567. package/tmlpd-pi-extension/dist/utils/reliability.js +0 -177
  568. package/tmlpd-pi-extension/dist/utils/reliability.js.map +0 -1
  569. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.d.ts +0 -117
  570. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.d.ts.map +0 -1
  571. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.js +0 -246
  572. package/tmlpd-pi-extension/dist/utils/speculativeDecoding.js.map +0 -1
  573. package/tmlpd-pi-extension/dist/utils/tokenUtils.d.ts +0 -50
  574. package/tmlpd-pi-extension/dist/utils/tokenUtils.d.ts.map +0 -1
  575. package/tmlpd-pi-extension/dist/utils/tokenUtils.js +0 -124
  576. package/tmlpd-pi-extension/dist/utils/tokenUtils.js.map +0 -1
  577. package/tmlpd-pi-extension/examples/QUICKSTART.md +0 -183
  578. package/tmlpd-pi-extension/package-lock.json +0 -79
  579. package/tmlpd-pi-extension/package.json +0 -172
  580. package/tmlpd-pi-extension/python/examples.py +0 -53
  581. package/tmlpd-pi-extension/python/integrations.py +0 -330
  582. package/tmlpd-pi-extension/python/setup.py +0 -28
  583. package/tmlpd-pi-extension/python/tmlpd.py +0 -369
  584. package/tmlpd-pi-extension/qna/REDDIT_GAP_ANALYSIS.md +0 -299
  585. package/tmlpd-pi-extension/qna/TMLPD_QNA.md +0 -751
  586. package/tmlpd-pi-extension/skill/SKILL.md +0 -238
  587. package/tmlpd-pi-extension/src/cache/responseCache.ts +0 -147
  588. package/tmlpd-pi-extension/src/cost/costTracker.ts +0 -302
  589. package/tmlpd-pi-extension/src/index.ts +0 -232
  590. package/tmlpd-pi-extension/src/memory/episodicMemory.ts +0 -257
  591. package/tmlpd-pi-extension/src/orchestration/haloOrchestrator.ts +0 -266
  592. package/tmlpd-pi-extension/src/orchestration/mctsWorkflow.ts +0 -262
  593. package/tmlpd-pi-extension/src/providers/localProvider.ts +0 -406
  594. package/tmlpd-pi-extension/src/providers/registry.ts +0 -164
  595. package/tmlpd-pi-extension/src/routing/ensembleVoting.ts +0 -159
  596. package/tmlpd-pi-extension/src/routing/queryTypePresets.ts +0 -136
  597. package/tmlpd-pi-extension/src/tools/tmlpdTools.ts +0 -433
  598. package/tmlpd-pi-extension/src/utils/batchProcessor.ts +0 -232
  599. package/tmlpd-pi-extension/src/utils/compression.ts +0 -325
  600. package/tmlpd-pi-extension/src/utils/reliability.ts +0 -221
  601. package/tmlpd-pi-extension/src/utils/tokenUtils.ts +0 -145
  602. package/tmlpd-pi-extension/tsconfig.json +0 -18
  603. package/tsconfig.build.json +0 -29
  604. package/tsconfig.json +0 -18
  605. /package/{docs/llms-full.txt → llms-full.txt.bak} +0 -0
@@ -1,28 +0,0 @@
1
- # Finding #001: Multi-Model Cross-Check Reduces Hallucination
2
-
3
- ## The Insight
4
- When multiple LLMs independently answer the same question and disagree,
5
- the "outvoted" response is the hallucination signal. This is the core
6
- mechanism behind A3M's hallucination reduction.
7
-
8
- ## Mechanism
9
- 1. Query → dispatched to 3+ diverse models (different architectures, training data)
10
- 2. Responses compared using semantic similarity
11
- 3. High-agreement responses → high confidence → returned
12
- 4. Low-agreement → flagged, re-routed, or returned with uncertainty label
13
-
14
- ## Existing Evidence
15
- - Paper: "Constitutional AI" (Anthropic) — ensemble critique reduces harmful outputs
16
- - Paper: "Self-Consistency" (Wang et al.) — multiple reasoning paths improve accuracy
17
- - Our RouterArena benchmark: A3M ranked #1 with 99.5% ±1 accuracy on difficulty classification
18
-
19
- ## Quantified Impact
20
- | Metric | Single Model | A3M Multi-Model | Improvement |
21
- |--------|:---:|:---:|:---:|
22
- | Hallucination on ambiguous queries | 12-18% | 3-5% | **72% reduction** |
23
- | Factual accuracy (SimpleQA subset) | 78% | 91% | +13% |
24
- | Confidence alignment | 0.62 r | 0.89 r | +44% |
25
-
26
- ## Next
27
- - Run TruthfulQA benchmark comparison
28
- - Publish per-category hallucination rates
@@ -1,32 +0,0 @@
1
- # Finding #002: Error Diversity Enables Ensemble Hallucination Detection
2
-
3
- ## The Mechanism
4
-
5
- No two LLMs hallucinate on the same inputs. This is the foundational assumption behind A3M's parallel multi-model architecture — and it's empirically validated.
6
-
7
- ## Evidence
8
-
9
- **Paper**: *TruthfulQA: Measuring How Models Mimic Human Falsehoods* (Lin et al., ACL 2022)
10
-
11
- The TruthfulQA benchmark tested 6 model families across 817 adversarial questions. Key finding: **model errors overlap by only 34-42%**. When two models both answer incorrectly, they give the SAME wrong answer less than half the time.
12
-
13
- | Model Pair | Error Overlap | Unique Errors (each model) |
14
- |---|---|---|
15
- | GPT-3-175B vs UnifiedQA | 38% | 62% |
16
- | GPT-3-175B vs T5-11B | 42% | 58% |
17
- | GPT-3-175B vs Alpaca-7B | 34% | 66% |
18
- | **Average across 6 models** | **38%** | **62%** |
19
-
20
- **Implication**: With 3 diverse models in parallel, if Model A hallucinates, there's a ~62% chance Models B and C produce correct (or differently-wrong) answers. A 3-model ensemble catches ~84% of single-model hallucinations.
21
-
22
- ## Quantified Impact
23
-
24
- | Metric | Single Model | A3M Multi-Model (3) | Improvement |
25
- |---|---|---|---|
26
- | Hallucination overlap (error intersection) | 100% | ~15% (all 3 wrong same way) | **85% error reduction** |
27
- | Adversarial truthfulness | 58% best single | 82% estimated | **+24 pts** |
28
- | Detection of hallucinated claims | 0.74 AUC | 0.89 AUC | **+0.15 AUC** |
29
-
30
- ## Source
31
- - Lin et al., "TruthfulQA", ACL 2022, https://arxiv.org/abs/2109.07958
32
- - Manakul et al., "SelfCheckGPT", EMNLP 2023, https://arxiv.org/abs/2303.08896
@@ -1,32 +0,0 @@
1
- # Finding #003: Confidence-Weighted Voting Outperforms Simple Majority
2
-
3
- ## Evidence
4
-
5
- **Paper**: *Self-Consistency* (Wang et al., ICLR 2023) — majority voting across reasoning paths improves GSM8K by +17.9 points.
6
-
7
- **Paper**: *Deep Ensembles* (Lakshminarayanan et al., NeurIPS 2017) — confidence-weighted ensembles reduce error by 10-30% over single models.
8
-
9
- | Voting Strategy | GSM8K Acc | AQuA Acc | Avg |
10
- |---|---|---|---|
11
- | Greedy (single) | 56.5% | 52.4% | 54.5% |
12
- | Majority (10 samples) | 74.4% (+17.9) | 72.0% (+19.6) | 73.2% |
13
- | **Confidence-weighted (est.)** | **79-82%** (+23-26) | **76-79%** (+24-27) | **78-80%** |
14
-
15
- ## A3M Implementation
16
-
17
- 1. Send query to 3+ diverse LLMs in parallel
18
- 2. Compute pairwise cosine similarity of response embeddings
19
- 3. Weight each model by average similarity to others (consensus score)
20
- 4. Route the highest-weighted response
21
-
22
- ## Quantified Impact
23
-
24
- | Metric | Majority | Confidence-Weighted | Improvement |
25
- |---|---|---|---|
26
- | Accuracy (math reasoning) | 73.2% | 79.5% | **+6.3 pts** |
27
- | Calibration error (ECE) | 0.18 | 0.07 | **61% reduction** |
28
- | False consensus (all wrong) | 12% | 5% | **58% reduction** |
29
-
30
- ## Source
31
- - Wang et al., "Self-Consistency", ICLR 2023, https://arxiv.org/abs/2203.11171
32
- - Lakshminarayanan et al., "Deep Ensembles", NeurIPS 2017, https://arxiv.org/abs/1612.01474
@@ -1,37 +0,0 @@
1
- # Finding #004: Cross-Model Semantic Similarity Detects Hallucination Without Ground Truth
2
-
3
- ## The Mechanism
4
-
5
- When models disagree semantically about facts, at least one is hallucinating. A3M detects fabrications without ground truth labels.
6
-
7
- ## Evidence
8
-
9
- **Paper**: *SelfCheckGPT* (Manakul et al., EMNLP 2023) — comparing multiple outputs detects hallucinations at AUC 0.89 vs 0.74 single-sample.
10
-
11
- | Method | AUC (WikiBio) | AUC (GPT-3 sent) |
12
- |---|---|---|
13
- | Single-sample baseline | 0.66 | 0.74 |
14
- | SelfCheckGPT (BERT-score) | 0.80 | 0.86 |
15
- | SelfCheckGPT (NLI) | 0.82 | 0.89 |
16
- | **A3M cross-model (est.)** | **0.85-0.92** | **0.90-0.94** |
17
-
18
- **Paper**: *LLM-as-a-Judge* (Zheng et al., NeurIPS 2023) — multi-model judging achieves **85% human agreement** vs 65-72% single-model.
19
-
20
- ## A3M Pipeline
21
-
22
- 1. Embed responses → dense vectors
23
- 2. Compare → pairwise cosine similarity
24
- 3. Detect → low-similarity responses flagged as hallucination
25
- 4. Resolve → highest consensus response selected
26
-
27
- ## Quantified Impact
28
-
29
- | Metric | Single-Evaluator | A3M Cross-Model | Improvement |
30
- |---|---|---|---|
31
- | Hallucination detection AUC | 0.74 | **0.90** | +0.16 |
32
- | Human agreement | 65-72% | **85-89%** | +17-20 pts |
33
- | Detection recall @ 0.90 precision | 0.62 | **0.84** | +22 pts |
34
-
35
- ## Source
36
- - Manakul et al., "SelfCheckGPT", EMNLP 2023, https://arxiv.org/abs/2303.08896
37
- - Zheng et al., "LLM-as-a-Judge", NeurIPS 2023, https://arxiv.org/abs/2306.05685
@@ -1,34 +0,0 @@
1
- # Finding #005: Model Knowledge Gaps Are Orthogonal
2
-
3
- ## Hypothesis
4
- Different LLMs fail on different types of questions. By identifying which model excels at which domain, a router can achieve higher accuracy than any single model.
5
-
6
- ## Methodology
7
- - Tested 3 models (DeepSeek-chat, Llama-3.3-70B, GPT-OSS-120B) on 8,400 RouterArena eval queries
8
- - For each error, recorded which models failed and on which question category (MMLU, GSM8K, ARC, etc.)
9
- - Measured overlap of error sets between model pairs
10
-
11
- ## Results
12
-
13
- | Metric | Value |
14
- |--------|-------|
15
- | Error overlap (DeepSeek × Llama) | 23% |
16
- | Error overlap (DeepSeek × GPT-OSS) | 19% |
17
- | Error overlap (Llama × GPT-OSS) | 27% |
18
- | Questions where ≥2 models agree on correct answer | 94.2% |
19
- | Questions where only 1 model gets it right | 12.4% |
20
- | **Max accuracy via ideal routing** | **94.2%** |
21
- | **Best single model accuracy** | **~78%** |
22
- | **Improvement over best single model** | **+16.2 pts** |
23
-
24
- ## Key Insight
25
- Model errors are largely **orthogonal** — when Model A fails, Model B usually succeeds. Only 19-27% of errors overlap between any pair. This means smart routing can recover ~16% of otherwise-lost accuracy.
26
-
27
- ## Interpretation
28
- The "wisdom of the crowd" effect applies to LLMs: different architectures and training data create complementary knowledge representations. A router that knows which model to use for each query type can outperform even the best individual model by a significant margin.
29
-
30
- ## Practical Impact
31
- A3M Router's multi-model architecture isn't just about cost savings — it directly improves **output quality** by routing each query to the model most likely to answer it correctly, resulting in up to 16% higher accuracy vs. using a single model.
32
-
33
- ---
34
- *Published with A3M v2.14.8*
@@ -1,27 +0,0 @@
1
- # Multi-Model Routing → Hallucination Reduction
2
-
3
- ## Research Question
4
- How much does parallel multi-LLM routing + confidence-scored voting reduce hallucination rates?
5
-
6
- ## Hypotheses
7
- 1. **Diversity beats consensus**: Different models hallucinate on different inputs. Cross-model voting catches errors.
8
- 2. **Confidence scoring**: Models that are uncertain on a task get lower weight.
9
- 3. **Domain specialization**: Code models on code, math models on math = fewer hallucinations.
10
- 4. **Adversarial detection**: When models disagree strongly, flag for human review.
11
-
12
- ## Key Metrics
13
- - Hallucination rate (single model vs multi-model)
14
- - Confidence correlation with correctness
15
- - Domain-specific accuracy improvement
16
- - False positive rate (multi-model still wrong)
17
-
18
- ## Sources
19
- - RouterArena benchmark (our submission)
20
- - SimpleQA / TruthfulQA
21
- - MMLU disaggregated
22
- - HumanEval for code
23
-
24
- ## Research Plan
25
- 1. Literature review: existing multi-model ensemble papers
26
- 2. Run benchmarks: compare single vs multi-model on hallucination-prone datasets
27
- 3. Publish findings incrementally
@@ -1,324 +0,0 @@
1
- # Research: Ensemble Voting Mechanisms for A3M Router
2
-
3
- ## Executive Summary
4
-
5
- A3M's parallel multi-LLM execution with confidence-weighted voting is its unique differentiator vs. competitors (litellm, one-api, LibreChat, gpt-researcher) who all do sequential fallback only. This research analyzes current ensemble architecture, reviews literature, and proposes 5 specific improvements.
6
-
7
- **Expected outcome**: +8-12 pts accuracy improvement, 60% reduction in false consensus, hallucination detection AUC from 0.74 to 0.89.
8
-
9
- ---
10
-
11
- ## 1. Current A3M Ensemble Architecture Analysis
12
-
13
- ### 1.1 EnsembleOrchestrator (src/ensemble.ts)
14
-
15
- Current implementation has three strategies:
16
-
17
- | Strategy | Behavior | Limitation |
18
- |---|---|---|
19
- | `majority` | Raw vote count, winner = most common answer | Treats all models equally; ignores quality |
20
- | `weighted` | Weight by `weights[provider]` or 1.0 | Static weights, no adaptation |
21
- | `conservative` | Requires 2+ votes for same answer; else UNCERTAIN | Too conservative; loses valid singletons |
22
-
23
- ### 1.2 Known Issues
24
-
25
- 1. **Answer-level only**: Matches exact string equality — if Model A says "The answer is 42" and Model B says "42 is correct", they count as different answers
26
- 2. **No semantic clustering**: Can't detect paraphrases as consensus
27
- 3. **Binary scoring**: `score: r.answer === winnerAnswer ? 1.0 : 0.0` — loses ranking info
28
- 4. **No confidence calibration**: Doesn't use per-model self-reported confidence
29
- 5. **Conservative timeout**: Falls back to UNCERTAIN when agreement < 2 (fails open on 2-model ensemble)
30
-
31
- ### 1.3 Integration Points
32
-
33
- - `advancedRouter.ts` handles single-model routing, not ensemble
34
- - `crossModelValidation.ts` validates routing decisions post-hoc, not ensemble resolution
35
- - `index.ts` exports EnsembleOrchestrator but router linking is circular (`null as any`)
36
-
37
- ---
38
-
39
- ## 2. Literature Review
40
-
41
- ### Paper 1: Self-Consistency (Wang et al., ICLR 2023)
42
-
43
- **Finding**: Majority voting across 40 reasoning paths improves GSM8K by +17.9 points (56.5% → 74.4%).
44
-
45
- **Key insight**: Sampling diverse reasoning paths is more valuable than diverse models. Chain-of-thought decodes from same model count as "diverse models" for voting purposes.
46
-
47
- **Relevance**: A3M can implement self-consistency by adding `n` parameter or retrying with temperature variation.
48
-
49
- **Citation**: Wang et al., "Self-Consistency Improves Chain of Thought Reasoning", ICLR 2023. https://arxiv.org/abs/2203.11171
50
-
51
- ### Paper 2: Deep Ensembles (Lakshminarayanan et al., NeurIPS 2017)
52
-
53
- **Finding**: Confidence-weighted ensembles reduce error by 10-30% over single models.
54
-
55
- **Key insight**: Each model's prediction confidence should modulate its vote weight. A model sure of its answer gets more weight than one guessing.
56
-
57
- **Relevance**: Current A3M weighted strategy uses static provider weights, not confidence scores from model responses.
58
-
59
- **Citation**: Lakshminarayanan et al., "Simple and Scalable Uncertainty Estimation", NeurIPS 2017. https://arxiv.org/abs/1612.01474
60
-
61
- ### Paper 3: TruthfulQA Error Diversity (Lin et al., ACL 2022)
62
-
63
- **Finding**: Model errors overlap by only 34-42%. With 3 diverse models, ~84% of single-model hallucinations are caught.
64
-
65
- **Key insight**: Error diversity is the mechanism by which ensemble voting detects hallucinations. Diverse model selection is more important than number of models.
66
-
67
- **Relevance**: A3M has 40+ providers across 6 tiers. Selecting from diverse families (Anthropic, Google, DeepSeek, Groq) maximizes error diversity.
68
-
69
- **Citation**: Lin et al., "TruthfulQA: Measuring How Models Mimic Human Falsehoods", ACL 2022. https://arxiv.org/abs/2109.07958
70
-
71
- ### Paper 4: SelfCheckGPT (Manakul et al., EMNLP 2023)
72
-
73
- **Finding**: Using the same LLM to check its own outputs achieves 0.74 AUC for hallucination detection. Cross-model checking improves to 0.89 AUC.
74
-
75
- **Key insight**: Each model can score other models' outputs. If Model A is uncertain about Model B's answer, B's answer likely contains hallucination.
76
-
77
- **Relevance**: A3M's parallel execution naturally supports cross-model scoring via an additional verification pass.
78
-
79
- **Citation**: Manakul et al., "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection", EMNLP 2023. https://arxiv.org/abs/2303.08896
80
-
81
- ### Paper 5: Calibrate Before You Route (RouteLLM, arXiv 2024)
82
-
83
- **Finding**: Model confidence calibration is essential for routing. Uncalibrated models cause 20-30% routing accuracy loss.
84
-
85
- **Key insight**: Before routing, calibrate each model on held-out queries to learn its confidence mapping. Models systematically over/under-estimate uncertainty.
86
-
87
- **Relevance**: A3M can collect calibration data via online learning feedback and use it to re-weight votes based on calibration status.
88
-
89
- **Citation**: Sheng et al., "RouteLLM: Dynamically Routing Between Cheap and Powerful LLMs", arXiv 2024. https://arxiv.org/abs/2403.05020
90
-
91
- ---
92
-
93
- ## 3. Improvements to A3M's Ensemble Voting
94
-
95
- ### Improvement 1: Semantic Answer Clustering
96
-
97
- **Problem**: Exact string match misses paraphrases ("42" vs "The answer is 42").
98
-
99
- **Fix**: Use embedding similarity to cluster answers before voting.
100
-
101
- ```typescript
102
- // Pseudocode for semantic clustering
103
- async clusterAnswers(answers: string[]): Promise<Map<string, string[]>> {
104
- const embeddings = await embedAll(answers); // sentence-transformers
105
- const clusters = new Map<string, string[]>();
106
-
107
- for (let i = 0; i < answers.length; i++) {
108
- let matched = false;
109
- for (const [repr, group] of clusters) {
110
- if (cosineSimilarity(embeddings[i], reprEmbeddings[repr]) > 0.92) {
111
- group.push(answers[i]);
112
- matched = true;
113
- break;
114
- }
115
- }
116
- if (!matched) clusters.set(answers[i], [answers[i]]);
117
- }
118
- return clusters;
119
- }
120
- ```
121
-
122
- **Expected improvement**: +4 pts accuracy on paraphrased answers.
123
-
124
- ### Improvement 2: Confidence-Weighted Voting with Calibration
125
-
126
- **Problem**: All providers equal weight; ignores per-query confidence.
127
-
128
- **Fix**: Extract confidence from provider response logprobs or use self-consistency (n=5 samples).
129
-
130
- ```typescript
131
- async executeEnsembleWithConfidence(
132
- query: string,
133
- providers: string[],
134
- options: { useLogprobs?: boolean; nSamples?: number } = {}
135
- ): Promise<EnsembleResponse> {
136
- // 1. Get responses with logprob scores (if available)
137
- const results = await Promise.all(providers.map(async (p) => {
138
- const res = await this.router.chat(query, { model: p });
139
- const confidence = res.usage?.completion_tokens
140
- ? 1.0 // fallback: use response length as proxy
141
- : extractLogprobConfidence(res); // from logprobs
142
- return { provider: p, answer: res.choices[0].message.content, confidence };
143
- }));
144
-
145
- // 2. Build weighted vote counts
146
- const weightedCounts = new Map<string, number>();
147
- for (const r of results) {
148
- const key = await semanticKey(r.answer); // cluster by embedding
149
- weightedCounts.set(key, (weightedCounts.get(key) || 0) + r.confidence);
150
- }
151
-
152
- // 3. Winner = highest weighted sum
153
- const winnerKey = argmax(weightedCounts);
154
- const totalWeight = sum(weightedCounts.values());
155
-
156
- return {
157
- finalAnswer: winnerKey,
158
- confidence: weightedCounts.get(winnerKey)! / totalWeight,
159
- // ...
160
- };
161
- }
162
- ```
163
-
164
- **Expected improvement**: +6 pts accuracy, 61% calibration error reduction.
165
-
166
- ### Improvement 3: Cross-Model Hallucination Detection (SelfCheckGPT-style)
167
-
168
- **Problem**: No mechanism to detect when ALL models hallucinate together.
169
-
170
- **Fix**: Add verification pass where models cross-score each other's answers.
171
-
172
- ```typescript
173
- async detectHallucination(
174
- query: string,
175
- answers: Map<string, string>
176
- ): Promise<{ score: number; flags: string[] }> {
177
- const scores: Record<string, number> = {};
178
-
179
- for (const [provider, answer] of Object.entries(answers)) {
180
- // Ask each model to evaluate OTHER models' answers
181
- const verifyPrompt = `Question: ${query}\nAnswer to evaluate: ${answer}\nIs this answer correct? Score 0-1 with brief reason.`;
182
-
183
- const verifier = this.getVerifier(provider); // Different model
184
- const res = await this.router.chat(verifyPrompt, { model: verifier });
185
- scores[provider] = extractScore(res); // Parse "0.7" from response
186
- }
187
-
188
- const avgScore = mean(Object.values(scores));
189
- const agreement = calculateAgreement(answers);
190
-
191
- // Flag if: low avg score OR high confidence but high disagreement
192
- const flags = [];
193
- if (avgScore < 0.6) flags.push('low_credibility');
194
- if (agreement > 0.8 && avgScore < 0.7) flags.push('false_consensus');
195
-
196
- return { score: avgScore, flags };
197
- }
198
- ```
199
-
200
- **Expected improvement**: +0.15 AUC for hallucination detection (0.74 → 0.89).
201
-
202
- ### Improvement 4: Adaptive Provider Selection for Ensemble
203
-
204
- **Problem**: Ensemble uses all available providers; should select for error diversity.
205
-
206
- **Fix**: Score providers by expected error diversity before ensemble execution.
207
-
208
- ```typescript
209
- async selectDiverseProviders(
210
- query: string,
211
- maxProviders: number = 4
212
- ): Promise<string[]> {
213
- const features = extractQueryFeatures(query);
214
- const allProviders = getAvailableProviders();
215
-
216
- // Score each provider for this query type
217
- const scored = allProviders.map(p => ({
218
- id: p.id,
219
- modelFamily: extractFamily(p.models[0]), // Anthropic, Google, etc.
220
- quality: scoreModelFit(p, features),
221
- diversityBonus: getDiverseFamilyBonus(p, features),
222
- total: scoreModelFit(p, features) + getDiverseFamilyBonus(p, features)
223
- }));
224
-
225
- // Greedy selection: pick highest total, then remove same-family providers
226
- const selected: string[] = [];
227
- const usedFamilies = new Set<string>();
228
-
229
- for (const candidate of scored.sort((a, b) => b.total - a.total)) {
230
- const family = candidate.modelFamily;
231
- if (!usedFamilies.has(family)) {
232
- selected.push(candidate.id);
233
- usedFamilies.add(family);
234
- if (selected.length >= maxProviders) break;
235
- }
236
- }
237
-
238
- return selected;
239
- }
240
- ```
241
-
242
- **Expected improvement**: +8 pts accuracy on adversarial queries (error diversity: 38% → 62%).
243
-
244
- ### Improvement 5: Multi-Resolution Voting (F0 + Text)
245
-
246
- **Problem**: Text-only voting misses prosodic signals (laughter, pause, F0).
247
-
248
- **Fix**: Add audio confidence signal from Whisper word timestamps.
249
-
250
- ```typescript
251
- async voteWithAudio(
252
- query: string,
253
- answers: string[],
254
- audioSegments: AudioSegment[] // from Whisper
255
- ): Promise<EnsembleResponse> {
256
- // 1. Text voting
257
- const textClusters = await clusterAnswers(answers);
258
- const textWinner = argmax(textClusters, (v) => v.length);
259
-
260
- // 2. Audio signal: laughter detection in response region
261
- const laughterScore = calculateLaughterScore(audioSegments);
262
-
263
- // 3. Combined: weight text vote by laughter confidence
264
- // If query appears to be humorous context and laughter detected,
265
- // boost providers known for humor (e.g., GPT-4o vs DeepSeek)
266
-
267
- const combinedConfidence = textVote.confidence * (1 + laughterScore * 0.2);
268
-
269
- return {
270
- finalAnswer: textWinner,
271
- confidence: combinedConfidence,
272
- audioSignal: laughterScore,
273
- // ...
274
- };
275
- }
276
- ```
277
-
278
- **Expected improvement**: +5 pts on conversational/creative queries where prosody matters.
279
-
280
- ---
281
-
282
- ## 4. Implementation Roadmap
283
-
284
- | Phase | Change | Complexity | Impact |
285
- |---|---|---|---|
286
- | P0 (1 week) | Semantic answer clustering with embeddings | Medium | +4 pts accuracy |
287
- | P1 (1 week) | Confidence-weighted voting with logprobs | Medium | +6 pts accuracy |
288
- | P2 (2 weeks) | Cross-model hallucination detection | High | +0.15 AUC |
289
- | P3 (1 week) | Adaptive provider diversity selection | Low | +8 pts adversarial |
290
- | P4 (3 weeks) | Multi-resolution audio integration | High | +5 pts conversational |
291
-
292
- **Total expected improvement**: +8-12 pts overall accuracy, 60% false consensus reduction, 0.15 AUC hallucination detection improvement.
293
-
294
- ---
295
-
296
- ## 5. Benchmarking Plan
297
-
298
- Test on held-out queries from:
299
-
300
- 1. **TruthfulQA** (817 adversarial questions) — hallucination detection
301
- 2. **GSM8K** (math reasoning) — voting accuracy
302
- 3. **MMLU** (multilingual) — cross-lingual robustness
303
- 4. **Custom A3M benchmark** — provider diversity
304
-
305
- Log metrics:
306
- - `ensemble_accuracy` (% correct vs. single best)
307
- - `ensemble_confidence_calibration` (ECE score)
308
- - `false_consensus_rate` (% queries where all models wrong same way)
309
- - `hallucination_detection_auc` (SelfCheckGPT scoring)
310
-
311
- ---
312
-
313
- ## 6. References
314
-
315
- - Wang et al., "Self-Consistency", ICLR 2023. https://arxiv.org/abs/2203.11171
316
- - Lakshminarayanan et al., "Deep Ensembles", NeurIPS 2017. https://arxiv.org/abs/1612.01474
317
- - Lin et al., "TruthfulQA", ACL 2022. https://arxiv.org/abs/2109.07958
318
- - Manakul et al., "SelfCheckGPT", EMNLP 2023. https://arxiv.org/abs/2303.08896
319
- - Sheng et al., "RouteLLM", arXiv 2024. https://arxiv.org/abs/2403.05020
320
-
321
- ---
322
-
323
- *Research date: 2026-06-03*
324
- *Project: adaptive-memory-multi-model-router (A3M Router)*