mdcontext 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (251) hide show
  1. package/.changeset/config.json +9 -9
  2. package/.claude/settings.local.json +25 -0
  3. package/.github/workflows/claude-code-review.yml +44 -0
  4. package/.github/workflows/claude.yml +85 -0
  5. package/CONTRIBUTING.md +186 -0
  6. package/NOTES/NOTES +44 -0
  7. package/README.md +206 -3
  8. package/biome.json +1 -1
  9. package/dist/chunk-23UPXDNL.js +3044 -0
  10. package/dist/chunk-2W7MO2DL.js +1366 -0
  11. package/dist/chunk-3NUAZGMA.js +1689 -0
  12. package/dist/chunk-7TOWB2XB.js +366 -0
  13. package/dist/chunk-7XOTOADQ.js +3065 -0
  14. package/dist/chunk-AH2PDM2K.js +3042 -0
  15. package/dist/chunk-BNXWSZ63.js +3742 -0
  16. package/dist/chunk-BTL5DJVU.js +3222 -0
  17. package/dist/chunk-HDHYG7E4.js +104 -0
  18. package/dist/chunk-HLR4KZBP.js +3234 -0
  19. package/dist/chunk-IP3FRFEB.js +1045 -0
  20. package/dist/chunk-KHU56VDO.js +3042 -0
  21. package/dist/chunk-KRYIFLQR.js +85 -89
  22. package/dist/chunk-LBSDNLEM.js +287 -0
  23. package/dist/chunk-MNTQ7HCP.js +2643 -0
  24. package/dist/chunk-MUJELQQ6.js +1387 -0
  25. package/dist/chunk-MXJGMSLV.js +2199 -0
  26. package/dist/chunk-N6QJGC3Z.js +2636 -0
  27. package/dist/chunk-OBELGBPM.js +1713 -0
  28. package/dist/chunk-OT7R5XTA.js +3192 -0
  29. package/dist/chunk-P7X4RA2T.js +106 -0
  30. package/dist/chunk-PIDUQNC2.js +3185 -0
  31. package/dist/chunk-POGCDIH4.js +3187 -0
  32. package/dist/chunk-PSIEOQGZ.js +3043 -0
  33. package/dist/chunk-PVRT3IHA.js +3238 -0
  34. package/dist/chunk-QNN4TT23.js +1430 -0
  35. package/dist/chunk-RE3R45RJ.js +3042 -0
  36. package/dist/chunk-S7E6TFX6.js +718 -657
  37. package/dist/chunk-SG6GLU4U.js +1378 -0
  38. package/dist/chunk-SJCDV2ST.js +274 -0
  39. package/dist/chunk-SYE5XLF3.js +104 -0
  40. package/dist/chunk-T5VLYBZD.js +103 -0
  41. package/dist/chunk-TOQB7VWU.js +3238 -0
  42. package/dist/chunk-VFNMZ4ZQ.js +3228 -0
  43. package/dist/chunk-VVTGZNBT.js +1533 -1423
  44. package/dist/chunk-W7Q4RFEV.js +104 -0
  45. package/dist/chunk-XTYYVRLO.js +3190 -0
  46. package/dist/chunk-Y6MDYVJD.js +3063 -0
  47. package/dist/cli/main.js +4072 -629
  48. package/dist/index.d.ts +420 -33
  49. package/dist/index.js +8 -15
  50. package/dist/mcp/server.js +103 -7
  51. package/dist/schema-BAWSG7KY.js +22 -0
  52. package/dist/schema-E3QUPL26.js +20 -0
  53. package/dist/schema-EHL7WUT6.js +20 -0
  54. package/docs/019-USAGE.md +44 -5
  55. package/docs/020-current-implementation.md +8 -8
  56. package/docs/021-DOGFOODING-FINDINGS.md +1 -1
  57. package/docs/CONFIG.md +1123 -0
  58. package/docs/ERRORS.md +383 -0
  59. package/docs/summarization.md +320 -0
  60. package/justfile +40 -0
  61. package/package.json +39 -33
  62. package/research/INDEX.md +315 -0
  63. package/research/code-review/README.md +90 -0
  64. package/research/code-review/cli-error-handling-review.md +979 -0
  65. package/research/code-review/code-review-validation-report.md +464 -0
  66. package/research/code-review/main-ts-review.md +1128 -0
  67. package/research/config-docs/SUMMARY.md +357 -0
  68. package/research/config-docs/TEST-RESULTS.md +776 -0
  69. package/research/config-docs/TODO.md +542 -0
  70. package/research/config-docs/analysis.md +744 -0
  71. package/research/config-docs/fix-validation.md +502 -0
  72. package/research/config-docs/help-audit.md +264 -0
  73. package/research/config-docs/help-system-analysis.md +890 -0
  74. package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
  75. package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
  76. package/research/issue-review.md +603 -0
  77. package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
  78. package/research/llm-summarization/alternative-providers-2026.md +1428 -0
  79. package/research/llm-summarization/anthropic-2026.md +367 -0
  80. package/research/llm-summarization/claude-cli-integration.md +1706 -0
  81. package/research/llm-summarization/cli-integration-patterns.md +3155 -0
  82. package/research/llm-summarization/openai-2026.md +473 -0
  83. package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
  84. package/research/llm-summarization/opencode-cli-integration.md +1552 -0
  85. package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
  86. package/research/llm-summarization/prototype-results.md +56 -0
  87. package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
  88. package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
  89. package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
  90. package/research/mdcontext-pudding/01-index-embed.md +956 -0
  91. package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
  92. package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
  93. package/research/mdcontext-pudding/02-search.md +970 -0
  94. package/research/mdcontext-pudding/03-context.md +779 -0
  95. package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
  96. package/research/mdcontext-pudding/04-tree.md +704 -0
  97. package/research/mdcontext-pudding/05-config.md +1038 -0
  98. package/research/mdcontext-pudding/06-links-summary.txt +87 -0
  99. package/research/mdcontext-pudding/06-links.md +679 -0
  100. package/research/mdcontext-pudding/07-stats.md +693 -0
  101. package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
  102. package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
  103. package/research/mdcontext-pudding/README.md +168 -0
  104. package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
  105. package/research/research-quality-review.md +834 -0
  106. package/research/semantic-search/embedding-text-analysis.md +156 -0
  107. package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
  108. package/research/semantic-search/query-processing-analysis.md +207 -0
  109. package/research/semantic-search/root-cause-and-solution.md +114 -0
  110. package/research/semantic-search/threshold-validation-report.md +69 -0
  111. package/research/semantic-search/vector-search-analysis.md +63 -0
  112. package/research/test-path-issues.md +276 -0
  113. package/review/ALP-76/1-error-type-design.md +962 -0
  114. package/review/ALP-76/2-error-handling-patterns.md +906 -0
  115. package/review/ALP-76/3-error-presentation.md +624 -0
  116. package/review/ALP-76/4-test-coverage.md +625 -0
  117. package/review/ALP-76/5-migration-completeness.md +440 -0
  118. package/review/ALP-76/6-effect-best-practices.md +755 -0
  119. package/scripts/apply-branch-protection.sh +47 -0
  120. package/scripts/branch-protection-templates.json +79 -0
  121. package/scripts/prototype-summarization.ts +346 -0
  122. package/scripts/rebuild-hnswlib.js +32 -37
  123. package/scripts/setup-branch-protection.sh +64 -0
  124. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
  125. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
  126. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
  127. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
  128. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  129. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  130. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
  131. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
  132. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
  133. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
  134. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
  135. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
  136. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
  137. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
  138. package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
  139. package/src/cli/argv-preprocessor.test.ts +2 -2
  140. package/src/cli/cli.test.ts +230 -33
  141. package/src/cli/commands/config-cmd.ts +642 -0
  142. package/src/cli/commands/context.ts +97 -9
  143. package/src/cli/commands/duplicates.ts +122 -0
  144. package/src/cli/commands/embeddings.ts +529 -0
  145. package/src/cli/commands/index-cmd.ts +210 -30
  146. package/src/cli/commands/index.ts +3 -0
  147. package/src/cli/commands/search.ts +894 -64
  148. package/src/cli/commands/stats.ts +3 -0
  149. package/src/cli/commands/tree.ts +26 -5
  150. package/src/cli/config-layer.ts +176 -0
  151. package/src/cli/error-handler.test.ts +235 -0
  152. package/src/cli/error-handler.ts +655 -0
  153. package/src/cli/flag-schemas.ts +66 -0
  154. package/src/cli/help.ts +209 -7
  155. package/src/cli/main.ts +348 -58
  156. package/src/cli/options.ts +10 -0
  157. package/src/cli/shared-error-handling.ts +199 -0
  158. package/src/cli/utils.ts +150 -17
  159. package/src/config/file-provider.test.ts +320 -0
  160. package/src/config/file-provider.ts +273 -0
  161. package/src/config/index.ts +72 -0
  162. package/src/config/integration.test.ts +667 -0
  163. package/src/config/precedence.test.ts +277 -0
  164. package/src/config/precedence.ts +451 -0
  165. package/src/config/schema.test.ts +414 -0
  166. package/src/config/schema.ts +603 -0
  167. package/src/config/service.test.ts +320 -0
  168. package/src/config/service.ts +243 -0
  169. package/src/config/testing.test.ts +264 -0
  170. package/src/config/testing.ts +110 -0
  171. package/src/core/types.ts +6 -33
  172. package/src/duplicates/detector.test.ts +183 -0
  173. package/src/duplicates/detector.ts +414 -0
  174. package/src/duplicates/index.ts +18 -0
  175. package/src/embeddings/embedding-namespace.test.ts +300 -0
  176. package/src/embeddings/embedding-namespace.ts +947 -0
  177. package/src/embeddings/heading-boost.test.ts +222 -0
  178. package/src/embeddings/hnsw-build-options.test.ts +198 -0
  179. package/src/embeddings/hyde.test.ts +272 -0
  180. package/src/embeddings/hyde.ts +264 -0
  181. package/src/embeddings/index.ts +2 -0
  182. package/src/embeddings/openai-provider.ts +332 -83
  183. package/src/embeddings/pricing.json +22 -0
  184. package/src/embeddings/provider-constants.ts +204 -0
  185. package/src/embeddings/provider-errors.test.ts +967 -0
  186. package/src/embeddings/provider-errors.ts +565 -0
  187. package/src/embeddings/provider-factory.test.ts +240 -0
  188. package/src/embeddings/provider-factory.ts +225 -0
  189. package/src/embeddings/provider-integration.test.ts +788 -0
  190. package/src/embeddings/query-preprocessing.test.ts +187 -0
  191. package/src/embeddings/semantic-search-threshold.test.ts +508 -0
  192. package/src/embeddings/semantic-search.ts +780 -93
  193. package/src/embeddings/types.ts +293 -16
  194. package/src/embeddings/vector-store.ts +486 -77
  195. package/src/embeddings/voyage-provider.ts +313 -0
  196. package/src/errors/errors.test.ts +845 -0
  197. package/src/errors/index.ts +533 -0
  198. package/src/index/ignore-patterns.test.ts +354 -0
  199. package/src/index/ignore-patterns.ts +305 -0
  200. package/src/index/indexer.ts +286 -48
  201. package/src/index/storage.ts +94 -30
  202. package/src/index/types.ts +40 -2
  203. package/src/index/watcher.ts +67 -9
  204. package/src/index.ts +22 -0
  205. package/src/integration/search-keyword.test.ts +678 -0
  206. package/src/mcp/server.ts +135 -6
  207. package/src/parser/parser.ts +18 -19
  208. package/src/parser/section-filter.test.ts +277 -0
  209. package/src/parser/section-filter.ts +125 -3
  210. package/src/search/__tests__/hybrid-search.test.ts +650 -0
  211. package/src/search/bm25-store.ts +366 -0
  212. package/src/search/cross-encoder.test.ts +253 -0
  213. package/src/search/cross-encoder.ts +406 -0
  214. package/src/search/fuzzy-search.test.ts +419 -0
  215. package/src/search/fuzzy-search.ts +273 -0
  216. package/src/search/hybrid-search.ts +448 -0
  217. package/src/search/path-matcher.test.ts +276 -0
  218. package/src/search/path-matcher.ts +33 -0
  219. package/src/search/searcher.test.ts +99 -1
  220. package/src/search/searcher.ts +189 -67
  221. package/src/search/wink-bm25.d.ts +30 -0
  222. package/src/summarization/cli-providers/claude.ts +202 -0
  223. package/src/summarization/cli-providers/detection.test.ts +273 -0
  224. package/src/summarization/cli-providers/detection.ts +118 -0
  225. package/src/summarization/cli-providers/index.ts +8 -0
  226. package/src/summarization/cost.test.ts +139 -0
  227. package/src/summarization/cost.ts +102 -0
  228. package/src/summarization/error-handler.test.ts +127 -0
  229. package/src/summarization/error-handler.ts +111 -0
  230. package/src/summarization/index.ts +102 -0
  231. package/src/summarization/pipeline.test.ts +498 -0
  232. package/src/summarization/pipeline.ts +231 -0
  233. package/src/summarization/prompts.test.ts +269 -0
  234. package/src/summarization/prompts.ts +133 -0
  235. package/src/summarization/provider-factory.test.ts +396 -0
  236. package/src/summarization/provider-factory.ts +178 -0
  237. package/src/summarization/types.ts +184 -0
  238. package/src/summarize/summarizer.ts +104 -35
  239. package/src/types/huggingface-transformers.d.ts +66 -0
  240. package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
  241. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
  242. package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
  243. package/tests/fixtures/cli/.mdcontext/indexes/documents.json +4 -4
  244. package/tests/fixtures/cli/.mdcontext/indexes/sections.json +14 -0
  245. package/tests/integration/embed-index.test.ts +712 -0
  246. package/tests/integration/search-context.test.ts +469 -0
  247. package/tests/integration/search-semantic.test.ts +522 -0
  248. package/vitest.config.ts +1 -6
  249. package/AGENTS.md +0 -46
  250. package/tests/fixtures/cli/.mdcontext/vectors.bin +0 -0
  251. package/tests/fixtures/cli/.mdcontext/vectors.meta.json +0 -1264
@@ -0,0 +1,2153 @@
1
+ # LLM Provider Switching and Fallback Patterns - 2026 Research
2
+
3
+ **Date**: January 26, 2026
4
+ **Author**: Research Team
5
+ **Purpose**: Document production-ready patterns for multi-provider LLM architectures with automatic fallback, cost optimization, and reliability
6
+
7
+ ---
8
+
9
+ ## Executive Summary
10
+
11
+ This document synthesizes current best practices for building robust multi-provider LLM systems in 2026. Based on analysis of production implementations from Vercel AI SDK, LiteLLM, LangChain, and real-world deployments, the following patterns emerge as essential:
12
+
13
+ **Key Takeaways:**
14
+ 1. **Multi-provider architecture is now standard** - Single-provider dependency is considered a production anti-pattern
15
+ 2. **Cost optimization through smart routing** - Teams see 30-50% cost reduction with intelligent provider selection
16
+ 3. **Caching delivers 80%+ savings** - Combined prompt caching and semantic caching provide the highest ROI
17
+ 4. **Circuit breakers prevent cascading failures** - Essential for production stability
18
+ 5. **Provider abstraction layers are mature** - TypeScript solutions with strong type safety available
19
+
20
+ **Cost Reduction Potential:**
21
+ - Smart routing: 30-50% reduction
22
+ - Prompt caching: 45-80% reduction
23
+ - Semantic caching: 15-30% additional reduction
24
+ - Combined strategies: 80-95% total reduction possible
25
+
26
+ ---
27
+
28
+ ## 1. Multi-Provider Architecture Patterns
29
+
30
+ ### 1.1 Primary + Fallback Strategy
31
+
32
+ The most common production pattern uses a primary provider with automatic fallback to secondary providers when errors occur.
33
+
34
+ **Architecture:**
35
+ ```typescript
36
+ interface ProviderConfig {
37
+ name: string;
38
+ priority: number;
39
+ models: {
40
+ cheap: string; // For simple tasks
41
+ standard: string; // For typical workloads
42
+ premium: string; // For complex reasoning
43
+ };
44
+ healthCheck: () => Promise<boolean>;
45
+ circuitBreaker: CircuitBreakerConfig;
46
+ }
47
+
48
+ const providerChain: ProviderConfig[] = [
49
+ {
50
+ name: 'anthropic',
51
+ priority: 1,
52
+ models: {
53
+ cheap: 'claude-3-haiku-20240307',
54
+ standard: 'claude-3-5-sonnet-20241022',
55
+ premium: 'claude-opus-4-5-20251101'
56
+ },
57
+ healthCheck: () => checkAnthropicHealth(),
58
+ circuitBreaker: {
59
+ failureThreshold: 10,
60
+ successThreshold: 3,
61
+ timeout: 60000,
62
+ }
63
+ },
64
+ {
65
+ name: 'openai',
66
+ priority: 2,
67
+ models: {
68
+ cheap: 'gpt-4o-mini',
69
+ standard: 'gpt-4o',
70
+ premium: 'o1'
71
+ },
72
+ healthCheck: () => checkOpenAIHealth(),
73
+ circuitBreaker: {
74
+ failureThreshold: 10,
75
+ successThreshold: 3,
76
+ timeout: 60000,
77
+ }
78
+ },
79
+ {
80
+ name: 'google',
81
+ priority: 3,
82
+ models: {
83
+ cheap: 'gemini-2.5-flash',
84
+ standard: 'gemini-2.5-pro',
85
+ premium: 'gemini-2.5-pro'
86
+ },
87
+ healthCheck: () => checkGoogleHealth(),
88
+ circuitBreaker: {
89
+ failureThreshold: 10,
90
+ successThreshold: 3,
91
+ timeout: 60000,
92
+ }
93
+ }
94
+ ];
95
+ ```
96
+
97
+ **Key Features:**
98
+ - Ordered fallback chain with priority levels
99
+ - Per-provider circuit breakers to prevent cascading failures
100
+ - Health checks to proactively detect provider issues
101
+ - Quality tiers (cheap/standard/premium) for cost optimization
102
+
103
+ **Sources:**
104
+ - [Vercel AI Gateway Model Fallbacks](https://vercel.com/changelog/model-fallbacks-now-available-in-vercel-ai-gateway)
105
+ - [LiteLLM Router Architecture](https://docs.litellm.ai/docs/router_architecture)
106
+
107
+ ### 1.2 Quality-Based Routing
108
+
109
+ Route requests to appropriate model tiers based on task complexity, achieving 30-50% cost reduction.
110
+
111
+ **Implementation Pattern:**
112
+ ```typescript
113
+ interface TaskClassification {
114
+ complexity: 'simple' | 'standard' | 'complex';
115
+ estimatedTokens: number;
116
+ requiresReasoning: boolean;
117
+ requiresToolUse: boolean;
118
+ }
119
+
120
+ function selectModel(task: TaskClassification): ModelConfig {
121
+ // Route 70% of queries to cheap models, escalate only when needed
122
+ if (task.complexity === 'simple' && !task.requiresReasoning) {
123
+ return {
124
+ provider: 'google',
125
+ model: 'gemini-2.5-flash',
126
+ costPer1M: { input: 0.30, output: 2.50 }
127
+ };
128
+ }
129
+
130
+ if (task.complexity === 'standard' || task.requiresToolUse) {
131
+ return {
132
+ provider: 'anthropic',
133
+ model: 'claude-3-5-sonnet-20241022',
134
+ costPer1M: { input: 3.00, output: 15.00 }
135
+ };
136
+ }
137
+
138
+ // Complex reasoning tasks
139
+ return {
140
+ provider: 'anthropic',
141
+ model: 'claude-opus-4-5-20251101',
142
+ costPer1M: { input: 15.00, output: 75.00 }
143
+ };
144
+ }
145
+ ```
146
+
147
+ **Routing Strategies:**
148
+ 1. **Simple Classification** - Use cheap models (GPT-4o-mini, Gemini Flash, Claude Haiku)
149
+ 2. **Standard Q&A** - Use mid-tier models (GPT-4o, Claude Sonnet, Gemini Pro)
150
+ 3. **Complex Reasoning** - Use premium models (Claude Opus, o1, Gemini Pro with extended context)
151
+ 4. **Internal Processing** - Use local open-source models when possible
152
+
153
+ **Production Impact:**
154
+ - 70% of queries routed to cheap models
155
+ - 25% to standard models
156
+ - 5% to premium models
157
+ - Typical cost reduction: 30-50%
158
+
159
+ **Sources:**
160
+ - [LLM Cost Optimization Strategies](https://medium.com/@ajayverma23/taming-the-beast-cost-optimization-strategies-for-llm-api-calls-in-production-11f16dbe2c39)
161
+ - [Smart Routing with Load Balancing](https://www.kosmoy.com/post/llm-cost-management-stop-burning-money-on-tokens)
162
+
163
+ ### 1.3 Cost-Optimized Provider Selection
164
+
165
+ Dynamically route to the cheapest provider for equivalent capability.
166
+
167
+ **2026 Cost Comparison (per 1M tokens):**
168
+
169
+ | Provider | Cheap Model | Input | Output | Best For |
170
+ |----------|-------------|-------|--------|----------|
171
+ | **DeepSeek** | deepseek-v3 | $0.27 | $1.10 | Massive cost savings (95% vs OpenAI) |
172
+ | **Google** | gemini-2.5-flash | $0.30 | $2.50 | Speed + cost balance |
173
+ | **OpenAI** | gpt-4o-mini | $0.15 | $0.60 | Simple tasks, fast responses |
174
+ | **Anthropic** | claude-3-haiku | $0.25 | $1.25 | Structured output, tool use |
175
+ | **Mistral** | mistral-small | $0.20 | $0.60 | EU data residency |
176
+
177
+ | Provider | Standard Model | Input | Output | Best For |
178
+ |----------|---------------|-------|--------|----------|
179
+ | **DeepSeek** | deepseek-r1 | $0.55 | $2.19 | Cost-effective reasoning |
180
+ | **Google** | gemini-2.5-pro | $1.25 | $10.00 | Large context windows |
181
+ | **OpenAI** | gpt-4o | $2.50 | $10.00 | General-purpose, reliable |
182
+ | **Anthropic** | claude-3.5-sonnet | $3.00 | $15.00 | Code generation, analysis |
183
+ | **Mistral** | devstral-2 | $0.62 | $2.46 | Code-specific tasks |
184
+
185
+ **Dynamic Routing Logic:**
186
+ ```typescript
187
+ interface ProviderCost {
188
+ provider: string;
189
+ model: string;
190
+ inputCostPer1M: number;
191
+ outputCostPer1M: number;
192
+ estimatedLatencyMs: number;
193
+ }
194
+
195
+ function selectCheapestProvider(
196
+ estimatedInputTokens: number,
197
+ estimatedOutputTokens: number,
198
+ maxLatency?: number
199
+ ): ProviderCost {
200
+ const options: ProviderCost[] = [
201
+ {
202
+ provider: 'deepseek',
203
+ model: 'deepseek-v3',
204
+ inputCostPer1M: 0.27,
205
+ outputCostPer1M: 1.10,
206
+ estimatedLatencyMs: 2000
207
+ },
208
+ {
209
+ provider: 'google',
210
+ model: 'gemini-2.5-flash',
211
+ inputCostPer1M: 0.30,
212
+ outputCostPer1M: 2.50,
213
+ estimatedLatencyMs: 800
214
+ },
215
+ {
216
+ provider: 'openai',
217
+ model: 'gpt-4o-mini',
218
+ inputCostPer1M: 0.15,
219
+ outputCostPer1M: 0.60,
220
+ estimatedLatencyMs: 600
221
+ }
222
+ ];
223
+
224
+ // Filter by latency requirement
225
+ const validOptions = maxLatency
226
+ ? options.filter(o => o.estimatedLatencyMs <= maxLatency)
227
+ : options;
228
+
229
+ // Calculate total cost and select cheapest
230
+ return validOptions.reduce((cheapest, current) => {
231
+ const currentCost =
232
+ (estimatedInputTokens / 1_000_000) * current.inputCostPer1M +
233
+ (estimatedOutputTokens / 1_000_000) * current.outputCostPer1M;
234
+
235
+ const cheapestCost =
236
+ (estimatedInputTokens / 1_000_000) * cheapest.inputCostPer1M +
237
+ (estimatedOutputTokens / 1_000_000) * cheapest.outputCostPer1M;
238
+
239
+ return currentCost < cheapestCost ? current : cheapest;
240
+ });
241
+ }
242
+ ```
243
+
244
+ **Key Insight:** Pricing varies 10x across providers for similar capability - smart routing prevents overpaying.
245
+
246
+ **Sources:**
247
+ - [LLM Cost Optimization 2026](https://byteiota.com/llm-cost-optimization-stop-overpaying-5-10x-in-2026/)
248
+ - [LLM API Pricing Comparison](https://pricepertoken.com/)
249
+
250
+ ---
251
+
252
+ ## 2. Automatic Failover and Error Handling
253
+
254
+ ### 2.1 Circuit Breaker Pattern
255
+
256
+ Circuit breakers prevent cascading failures by temporarily blocking requests to unhealthy providers.
257
+
258
+ **States:**
259
+ 1. **Closed** - Normal operation, requests pass through
260
+ 2. **Open** - Too many failures, requests blocked for timeout period
261
+ 3. **Half-Open** - Testing recovery, limited requests allowed
262
+
263
+ **Implementation:**
264
+ ```typescript
265
+ class CircuitBreaker {
266
+ private state: 'closed' | 'open' | 'half-open' = 'closed';
267
+ private failures = 0;
268
+ private successes = 0;
269
+ private lastFailureTime: number | null = null;
270
+
271
+ constructor(
272
+ private config: {
273
+ failureThreshold: number; // e.g., 10 failures
274
+ successThreshold: number; // e.g., 3 successes to close
275
+ timeout: number; // e.g., 60000ms (60s)
276
+ jitter: number; // e.g., 5000ms random variation
277
+ }
278
+ ) {}
279
+
280
+ async execute<T>(fn: () => Promise<T>): Promise<T> {
281
+ if (this.state === 'open') {
282
+ const timeoutWithJitter =
283
+ this.config.timeout + Math.random() * this.config.jitter;
284
+
285
+ if (Date.now() - this.lastFailureTime! < timeoutWithJitter) {
286
+ throw new Error('Circuit breaker is OPEN');
287
+ }
288
+
289
+ // Transition to half-open for testing
290
+ this.state = 'half-open';
291
+ this.successes = 0;
292
+ }
293
+
294
+ try {
295
+ const result = await fn();
296
+ this.onSuccess();
297
+ return result;
298
+ } catch (error) {
299
+ this.onFailure();
300
+ throw error;
301
+ }
302
+ }
303
+
304
+ private onSuccess(): void {
305
+ this.failures = 0;
306
+
307
+ if (this.state === 'half-open') {
308
+ this.successes++;
309
+
310
+ if (this.successes >= this.config.successThreshold) {
311
+ this.state = 'closed';
312
+ console.log('Circuit breaker closed - provider recovered');
313
+ }
314
+ }
315
+ }
316
+
317
+ private onFailure(): void {
318
+ this.failures++;
319
+ this.lastFailureTime = Date.now();
320
+
321
+ if (this.failures >= this.config.failureThreshold) {
322
+ this.state = 'open';
323
+ console.log('Circuit breaker opened - provider unhealthy');
324
+ }
325
+ }
326
+
327
+ getState(): { state: string; failures: number; successes: number } {
328
+ return {
329
+ state: this.state,
330
+ failures: this.failures,
331
+ successes: this.successes
332
+ };
333
+ }
334
+ }
335
+ ```
336
+
337
+ **Benefits:**
338
+ - Prevents overwhelming unhealthy providers
339
+ - Gives systems time to recover
340
+ - Provides clear health signals for monitoring
341
+ - Reduces cascading failures
342
+
343
+ **Sources:**
344
+ - [Circuit Breakers in LLM Apps](https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/)
345
+ - [Azure API Management LLM Resiliency](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/improve-llm-backend-resiliency-with-load-balancer-and-circuit-breaker-rules-in-a/4394502)
346
+
347
+ ### 2.2 Retry Logic with Exponential Backoff
348
+
349
+ All major LLM providers recommend exponential backoff with jitter for handling rate limits and transient failures.
350
+
351
+ **Implementation:**
352
+ ```typescript
353
+ interface RetryConfig {
354
+ maxRetries: number;
355
+ baseDelay: number; // e.g., 1000ms
356
+ maxDelay: number; // e.g., 60000ms
357
+ exponentialBase: number; // e.g., 2
358
+ jitter: boolean;
359
+ }
360
+
361
+ async function retryWithExponentialBackoff<T>(
362
+ fn: () => Promise<T>,
363
+ config: RetryConfig,
364
+ onRetry?: (attempt: number, delay: number, error: any) => void
365
+ ): Promise<T> {
366
+ let lastError: any;
367
+
368
+ for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
369
+ try {
370
+ return await fn();
371
+ } catch (error: any) {
372
+ lastError = error;
373
+
374
+ // Don't retry on non-retryable errors
375
+ if (!isRetryableError(error)) {
376
+ throw error;
377
+ }
378
+
379
+ // Check if we've exhausted retries
380
+ if (attempt === config.maxRetries) {
381
+ break;
382
+ }
383
+
384
+ // Calculate delay with exponential backoff
385
+ let delay = Math.min(
386
+ config.baseDelay * Math.pow(config.exponentialBase, attempt),
387
+ config.maxDelay
388
+ );
389
+
390
+ // Add jitter to prevent thundering herd
391
+ if (config.jitter) {
392
+ delay = delay * (0.5 + Math.random() * 0.5);
393
+ }
394
+
395
+ // Check for retry-after header
396
+ const retryAfter = parseRetryAfterHeader(error);
397
+ if (retryAfter) {
398
+ delay = Math.max(delay, retryAfter * 1000);
399
+ }
400
+
401
+ onRetry?.(attempt + 1, delay, error);
402
+
403
+ await sleep(delay);
404
+ }
405
+ }
406
+
407
+ throw lastError;
408
+ }
409
+
410
+ function isRetryableError(error: any): boolean {
411
+ // Rate limiting (429)
412
+ if (error.status === 429) return true;
413
+
414
+ // Server errors (500-599)
415
+ if (error.status >= 500 && error.status < 600) return true;
416
+
417
+ // Network errors
418
+ if (error.code === 'ECONNRESET' ||
419
+ error.code === 'ETIMEDOUT' ||
420
+ error.code === 'ENOTFOUND') return true;
421
+
422
+ // Provider-specific timeout errors
423
+ if (error.message?.includes('timeout')) return true;
424
+
425
+ return false;
426
+ }
427
+
428
+ function parseRetryAfterHeader(error: any): number | null {
429
+ const retryAfter = error.response?.headers?.['retry-after'];
430
+ if (!retryAfter) return null;
431
+
432
+ // Handle both seconds and date formats
433
+ const seconds = parseInt(retryAfter, 10);
434
+ return isNaN(seconds) ? null : seconds;
435
+ }
436
+
437
+ function sleep(ms: number): Promise<void> {
438
+ return new Promise(resolve => setTimeout(resolve, ms));
439
+ }
440
+ ```
441
+
442
+ **Provider-Specific Retry Guidelines (2026):**
443
+
444
+ **OpenAI:**
445
+ - Automatically retries with exponential backoff
446
+ - Respects retry-after headers
447
+ - Recommended: 3-5 retries with base delay 1s
448
+
449
+ **Anthropic (Claude):**
450
+ - Returns 429 for RPM, ITPM, OTPM limits
451
+ - Provides retry-after header for precise wait times
452
+ - Recommended: Combine retry-after with exponential backoff fallback
453
+
454
+ **Google (Gemini):**
455
+ - Uses token bucket algorithm
456
+ - 429 quota-exceeded on any dimension
457
+ - Recommended: Exponential backoff with jitter (gold standard)
458
+
459
+ **Common Pattern:**
460
+ ```typescript
461
+ const retryConfig: RetryConfig = {
462
+ maxRetries: 5,
463
+ baseDelay: 1000, // 1s
464
+ maxDelay: 60000, // 60s max
465
+ exponentialBase: 2, // Double each time: 1s, 2s, 4s, 8s, 16s
466
+ jitter: true // Prevent thundering herd
467
+ };
468
+ ```
469
+
470
+ **Sources:**
471
+ - [OpenAI Rate Limits Guide](https://platform.openai.com/docs/guides/rate-limits)
472
+ - [Claude API 429 Error Fix](https://www.aifreeapi.com/en/posts/fix-claude-api-429-rate-limit-error)
473
+ - [Gemini API Rate Limits 2026](https://www.aifreeapi.com/en/posts/gemini-api-rate-limit-explained)
474
+
475
+ ### 2.3 Fallback Chain Implementation
476
+
477
+ Combine retries with provider fallbacks for maximum reliability.
478
+
479
+ **Complete Pattern:**
480
+ ```typescript
481
+ interface FallbackConfig {
482
+ providers: ProviderConfig[];
483
+ retryConfig: RetryConfig;
484
+ circuitBreakers: Map<string, CircuitBreaker>;
485
+ }
486
+
487
+ async function executeWithFallback<T>(
488
+ task: LLMTask,
489
+ config: FallbackConfig
490
+ ): Promise<T> {
491
+ const errors: Array<{ provider: string; error: any }> = [];
492
+
493
+ for (const provider of config.providers) {
494
+ const breaker = config.circuitBreakers.get(provider.name);
495
+
496
+ // Skip if circuit breaker is open
497
+ if (breaker?.getState().state === 'open') {
498
+ console.log(`Skipping ${provider.name} - circuit breaker open`);
499
+ continue;
500
+ }
501
+
502
+ try {
503
+ // Attempt with retries
504
+ const result = await breaker!.execute(() =>
505
+ retryWithExponentialBackoff(
506
+ () => callProvider(provider, task),
507
+ config.retryConfig,
508
+ (attempt, delay, error) => {
509
+ console.log(
510
+ `Retry ${attempt} for ${provider.name} after ${delay}ms`,
511
+ error.message
512
+ );
513
+ }
514
+ )
515
+ );
516
+
517
+ console.log(`Success with ${provider.name}`);
518
+ return result;
519
+
520
+ } catch (error: any) {
521
+ errors.push({ provider: provider.name, error });
522
+ console.error(`Failed with ${provider.name}:`, error.message);
523
+
524
+ // Continue to next provider in chain
525
+ continue;
526
+ }
527
+ }
528
+
529
+ // All providers failed
530
+ throw new Error(
531
+ `All providers failed:\n${errors.map(e =>
532
+ ` ${e.provider}: ${e.error.message}`
533
+ ).join('\n')}`
534
+ );
535
+ }
536
+ ```
537
+
538
+ **LiteLLM Router Flow:**
539
+ 1. Request sent to `function_with_fallbacks`
540
+ 2. Wraps request in try-catch for fallback handling
541
+ 3. Passes to `function_with_retries` for retry logic
542
+ 4. Calls base LiteLLM unified function
543
+ 5. If model is cooling down (rate limited), skip to next
544
+ 6. After `num_retries`, fallback to next model group
545
+ 7. Fallbacks typically go from one model_name to another
546
+
547
+ **Benefits:**
548
+ - Maximum reliability across provider outages
549
+ - Automatic recovery from rate limits
550
+ - Graceful degradation under load
551
+ - Clear error trails for debugging
552
+
553
+ **Sources:**
554
+ - [LiteLLM Reliability Features](https://docs.litellm.ai/docs/completion/reliable_completions)
555
+ - [LangChain Fallbacks](https://python.langchain.com/v0.1/docs/guides/productionization/fallbacks/)
556
+
557
+ ---
558
+
559
+ ## 3. Caching Strategies
560
+
561
+ Caching provides the highest ROI for cost reduction (80-95% savings possible).
562
+
563
+ ### 3.1 Prompt Caching (Provider-Level)
564
+
565
+ Provider-native prefix caching delivers 50-90% cost reduction with minimal implementation effort.
566
+
567
+ **2026 Provider Support:**
568
+
569
+ | Provider | Feature | Cost Reduction | Latency Improvement | TTL |
570
+ |----------|---------|---------------|-------------------|-----|
571
+ | **Anthropic** | Prompt Caching | Up to 90% | Up to 85% | 5 min |
572
+ | **OpenAI** | Automatic Caching | 50% (cached tokens) | Moderate | Automatic |
573
+ | **Google** | Context Caching | 90% (cached reads) | Significant | Variable |
574
+
575
+ **Anthropic Implementation:**
576
+ ```typescript
577
+ const message = await anthropic.messages.create({
578
+ model: 'claude-3-5-sonnet-20241022',
579
+ max_tokens: 1024,
580
+ system: [
581
+ {
582
+ type: 'text',
583
+ text: 'You are an AI assistant analyzing code repositories.',
584
+ },
585
+ {
586
+ type: 'text',
587
+ text: largeCodebaseContext, // This gets cached
588
+ cache_control: { type: 'ephemeral' }
589
+ }
590
+ ],
591
+ messages: [
592
+ { role: 'user', content: 'Summarize this file: ...' }
593
+ ]
594
+ });
595
+
596
+ // Usage breakdown:
597
+ // - Input tokens: 1000
598
+ // - Cache creation: 10000 (write at 1.25x cost)
599
+ // - Cache read: 10000 (read at 0.1x cost = 90% savings)
600
+ // - Output tokens: 500
601
+ ```
602
+
603
+ **OpenAI Implementation:**
604
+ ```typescript
605
+ const completion = await openai.chat.completions.create({
606
+ model: 'gpt-4o',
607
+ messages: [
608
+ {
609
+ role: 'system',
610
+ content: largeCodebaseContext // Automatically cached if >1024 tokens
611
+ },
612
+ {
613
+ role: 'user',
614
+ content: 'Summarize this file: ...'
615
+ }
616
+ ]
617
+ });
618
+
619
+ // OpenAI automatically caches prefixes >1024 tokens
620
+ // Cached input tokens: 50% discount
621
+ // No explicit cache control needed
622
+ ```
623
+
624
+ **Best Practices:**
625
+ 1. **Place static content first** - System prompts, codebase context, documentation
626
+ 2. **Put dynamic content last** - User queries, file-specific questions
627
+ 3. **Avoid caching tool definitions** - They change frequently, reducing cache hits
628
+ 4. **Strategic block control** - Exclude dynamic sections from cache scope
629
+
630
+ **Recent Research (2026):**
631
+ - Strategic prompt cache block control provides more consistent benefits than naive full-context caching
632
+ - Reduces API costs by 45-80%
633
+ - Improves time to first token by 13-31%
634
+ - Chat apps with stable system prompts can cache 70%+ of input tokens
635
+
636
+ **Sources:**
637
+ - [Prompt Caching Infrastructure Guide](https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025)
638
+ - [Don't Break the Cache Research](https://arxiv.org/abs/2601.06007)
639
+
640
+ ### 3.2 Semantic Caching
641
+
642
+ Semantic caching eliminates API calls entirely for semantically similar queries.
643
+
644
+ **Architecture:**
645
+ ```typescript
646
+ interface SemanticCache {
647
+ // L1: Exact match cache (in-memory)
648
+ exactCache: Map<string, CachedResponse>;
649
+
650
+ // L2: Semantic match cache (vector DB)
651
+ vectorStore: VectorStore;
652
+ similarityThreshold: number; // e.g., 0.85
653
+ }
654
+
655
+ interface CachedResponse {
656
+ query: string;
657
+ embedding: number[];
658
+ response: string;
659
+ timestamp: number;
660
+ ttl: number;
661
+ metadata: {
662
+ provider: string;
663
+ model: string;
664
+ tokens: number;
665
+ cost: number;
666
+ };
667
+ }
668
+
669
+ async function getOrGenerateResponse(
670
+ query: string,
671
+ cache: SemanticCache,
672
+ generateFn: () => Promise<string>
673
+ ): Promise<{ response: string; cacheHit: boolean }> {
674
+ // L1: Check exact match
675
+ const exactMatch = cache.exactCache.get(query);
676
+ if (exactMatch && !isExpired(exactMatch)) {
677
+ return { response: exactMatch.response, cacheHit: true };
678
+ }
679
+
680
+ // L2: Check semantic similarity
681
+ const queryEmbedding = await generateEmbedding(query);
682
+ const similarQueries = await cache.vectorStore.similaritySearch(
683
+ queryEmbedding,
684
+ cache.similarityThreshold
685
+ );
686
+
687
+ if (similarQueries.length > 0) {
688
+ const bestMatch = similarQueries[0];
689
+ console.log(
690
+ `Semantic cache hit (${bestMatch.similarity.toFixed(3)}) for: "${query}"`
691
+ );
692
+ return { response: bestMatch.response, cacheHit: true };
693
+ }
694
+
695
+ // Cache miss - generate new response
696
+ const response = await generateFn();
697
+
698
+ // Store in both caches
699
+ const cached: CachedResponse = {
700
+ query,
701
+ embedding: queryEmbedding,
702
+ response,
703
+ timestamp: Date.now(),
704
+ ttl: 3600000, // 1 hour
705
+ metadata: {
706
+ provider: 'anthropic',
707
+ model: 'claude-3-5-sonnet',
708
+ tokens: response.length / 4, // Rough estimate
709
+ cost: calculateCost(response)
710
+ }
711
+ };
712
+
713
+ cache.exactCache.set(query, cached);
714
+ await cache.vectorStore.upsert(queryEmbedding, cached);
715
+
716
+ return { response, cacheHit: false };
717
+ }
718
+ ```
719
+
720
+ **Multi-Tier Performance:**
721
+ - **Microsecond-level** responses for exact matches (L1)
722
+ - **Sub-second** responses for semantic matches (L2)
723
+ - **Full API latency** only for novel queries
724
+
725
+ **Production Impact:**
726
+ - 31% of LLM queries exhibit semantic similarity to previous requests
727
+ - Chat apps with repetitive questions: 30% cache hit rate
728
+ - Combined with prompt caching: 80%+ total savings
729
+
730
+ **Implementation Options:**
731
+ - **GPTCache** - Open-source semantic cache with LangChain integration
732
+ - **Redis** - Semantic caching with vector similarity search
733
+ - **Qdrant** - Vector database optimized for semantic search
734
+ - **LiteLLM** - Built-in caching support
735
+
736
+ **Sources:**
737
+ - [Redis Semantic Caching](https://redis.io/blog/what-is-semantic-caching/)
738
+ - [GPTCache GitHub](https://github.com/zilliztech/GPTCache)
739
+ - [LLM Caching Implementation Guide](https://reintech.io/blog/how-to-implement-llm-caching-strategies-for-faster-response-times)
740
+
741
+ ### 3.3 Hybrid Caching Strategy
742
+
743
+ Combine prompt caching and semantic caching for maximum savings.
744
+
745
+ **Example Configuration:**
746
+ ```typescript
747
+ interface HybridCacheConfig {
748
+ // Semantic cache for query deduplication
749
+ semanticCache: {
750
+ enabled: true,
751
+ provider: 'redis', // or 'qdrant', 'gptcache'
752
+ similarityThreshold: 0.85,
753
+ ttl: 3600000, // 1 hour
754
+ },
755
+
756
+ // Prompt cache for large context windows
757
+ promptCache: {
758
+ enabled: true,
759
+ strategy: 'provider-native', // Use Anthropic/OpenAI/Google native caching
760
+ systemPromptCaching: true,
761
+ codebaseContextCaching: true,
762
+ toolDefinitionCaching: false, // Avoid caching dynamic tools
763
+ },
764
+
765
+ // Cache invalidation
766
+ invalidation: {
767
+ onCodebaseChange: true,
768
+ maxAge: 86400000, // 24 hours
769
+ }
770
+ }
771
+ ```
772
+
773
+ **Combined Savings Example:**
774
+ ```
775
+ Scenario: Code summarization service
776
+ - 1000 requests/day
777
+ - Average input: 15,000 tokens (10k context + 5k query)
778
+ - Average output: 500 tokens
779
+
780
+ Without caching:
781
+ - Cost: 1000 * ((15000/1M * $3) + (500/1M * $15)) = $52.50/day
782
+
783
+ With prompt caching (70% of input tokens cached):
784
+ - Cost: 1000 * ((4500/1M * $3) + (10500/1M * $0.30) + (500/1M * $15)) = $24/day
785
+ - Savings: 54%
786
+
787
+ With semantic caching (30% query deduplication):
788
+ - Actual LLM calls: 700
789
+ - Cost: 700 * calculation above = $16.80/day
790
+ - Additional savings: 30%
791
+
792
+ Total savings: 68%
793
+
794
+ With both + optimized routing:
795
+ - Use cheaper models for 50% of queries
796
+ - Final cost: ~$8/day
797
+ - Total savings: 85%
798
+ ```
799
+
800
+ **Sources:**
801
+ - [Ultimate Guide to LLM Caching](https://latitude-blog.ghost.io/blog/ultimate-guide-to-llm-caching-for-low-latency-ai/)
802
+ - [Effective LLM Caching](https://www.helicone.ai/blog/effective-llm-caching)
803
+
804
+ ---
805
+
806
+ ## 4. Provider Abstraction Layer
807
+
808
+ ### 4.1 Design Principles
809
+
810
+ Modern provider abstraction layers follow these principles:
811
+
812
+ 1. **Unified Interface** - Single API for all providers
813
+ 2. **Type Safety** - Strong TypeScript types prevent runtime errors
814
+ 3. **Provider-Specific Features** - Gracefully handle unique capabilities
815
+ 4. **Streaming Support** - First-class support for streaming responses
816
+ 5. **Tool/Function Calling** - Normalized tool use across providers
817
+ 6. **Observability** - Built-in logging, metrics, and tracing
818
+
819
+ **Core Interface:**
820
+ ```typescript
821
+ interface LLMProvider {
822
+ name: string;
823
+
824
+ // Chat completion
825
+ complete(params: CompletionParams): Promise<CompletionResponse>;
826
+
827
+ // Streaming
828
+ stream(params: CompletionParams): AsyncIterator<StreamChunk>;
829
+
830
+ // Embeddings
831
+ embed(text: string): Promise<number[]>;
832
+
833
+ // Health check
834
+ healthCheck(): Promise<HealthStatus>;
835
+
836
+ // Cost estimation
837
+ estimateCost(params: CompletionParams): CostEstimate;
838
+ }
839
+
840
+ interface CompletionParams {
841
+ model: string;
842
+ messages: Message[];
843
+ temperature?: number;
844
+ maxTokens?: number;
845
+ tools?: Tool[];
846
+
847
+ // Provider-specific options
848
+ providerOptions?: {
849
+ anthropic?: {
850
+ cacheControl?: CacheControl[];
851
+ };
852
+ openai?: {
853
+ responseFormat?: { type: 'json_object' };
854
+ };
855
+ google?: {
856
+ safetySettings?: SafetySetting[];
857
+ };
858
+ };
859
+ }
860
+
861
+ interface CompletionResponse {
862
+ content: string;
863
+ role: 'assistant';
864
+ finishReason: 'stop' | 'length' | 'tool_calls' | 'content_filter';
865
+ usage: {
866
+ inputTokens: number;
867
+ outputTokens: number;
868
+ cachedTokens?: number;
869
+ };
870
+ metadata: {
871
+ provider: string;
872
+ model: string;
873
+ latencyMs: number;
874
+ cost: number;
875
+ };
876
+ toolCalls?: ToolCall[];
877
+ }
878
+ ```
879
+
880
+ **Sources:**
881
+ - [Multi-Provider LLM Orchestration 2026](https://dev.to/ash_dubai/multi-provider-llm-orchestration-in-production-a-2026-guide-1g10)
882
+ - [TypeScript & LLMs: Lessons from Production](https://johnchildseddy.medium.com/typescript-llms-lessons-learned-from-9-months-in-production-4910485e3272)
883
+
884
+ ### 4.2 Production Implementations
885
+
886
+ **Vercel AI SDK:**
887
+ ```typescript
888
+ import { generateText } from 'ai';
889
+ import { anthropic } from '@ai-sdk/anthropic';
890
+ import { openai } from '@ai-sdk/openai';
891
+ import { google } from '@ai-sdk/google';
892
+
893
+ // Unified interface across providers
894
+ const result = await generateText({
895
+ model: anthropic('claude-3-5-sonnet-20241022'),
896
+ messages: [
897
+ { role: 'user', content: 'Summarize this code...' }
898
+ ],
899
+ // Automatic fallback via AI Gateway
900
+ providerOptions: {
901
+ fallbacks: [
902
+ { model: openai('gpt-4o') },
903
+ { model: google('gemini-2.5-flash') }
904
+ ]
905
+ }
906
+ });
907
+ ```
908
+
909
+ **LiteLLM:**
910
+ ```typescript
911
+ import litellm from 'litellm';
912
+
913
+ // Unified OpenAI-compatible interface
914
+ const response = await litellm.completion({
915
+ model: 'claude-3-5-sonnet-20241022',
916
+ messages: [{ role: 'user', content: 'Hello' }],
917
+
918
+ // Router config for fallbacks
919
+ fallbacks: ['gpt-4o', 'gemini-2.5-flash'],
920
+ num_retries: 3,
921
+ timeout: 30
922
+ });
923
+
924
+ // Automatic cost tracking
925
+ console.log(response._hidden_params.response_cost);
926
+ ```
927
+
928
+ **Custom TypeScript Abstraction:**
929
+ ```typescript
930
+ class UnifiedLLMClient {
931
+ private providers: Map<string, LLMProvider>;
932
+ private router: ProviderRouter;
933
+ private cache: SemanticCache;
934
+ private metrics: MetricsCollector;
935
+
936
+ constructor(config: UnifiedLLMConfig) {
937
+ this.providers = new Map([
938
+ ['anthropic', new AnthropicProvider(config.anthropic)],
939
+ ['openai', new OpenAIProvider(config.openai)],
940
+ ['google', new GoogleProvider(config.google)]
941
+ ]);
942
+
943
+ this.router = new ProviderRouter(config.routing);
944
+ this.cache = new SemanticCache(config.caching);
945
+ this.metrics = new MetricsCollector();
946
+ }
947
+
948
+ async complete(
949
+ params: CompletionParams,
950
+ options?: {
951
+ preferredProvider?: string;
952
+ enableFallback?: boolean;
953
+ enableCaching?: boolean;
954
+ }
955
+ ): Promise<CompletionResponse> {
956
+ const startTime = Date.now();
957
+
958
+ // Check semantic cache first
959
+ if (options?.enableCaching) {
960
+ const cached = await this.cache.get(params.messages);
961
+ if (cached) {
962
+ this.metrics.recordCacheHit();
963
+ return cached;
964
+ }
965
+ }
966
+
967
+ // Select provider via routing logic
968
+ const provider = options?.preferredProvider
969
+ ? this.providers.get(options.preferredProvider)!
970
+ : await this.router.selectProvider(params);
971
+
972
+ try {
973
+ // Execute with fallback
974
+ const response = await this.executeWithFallback(
975
+ provider,
976
+ params,
977
+ options?.enableFallback ?? true
978
+ );
979
+
980
+ // Record metrics
981
+ const latency = Date.now() - startTime;
982
+ this.metrics.record({
983
+ provider: response.metadata.provider,
984
+ latency,
985
+ tokens: response.usage.inputTokens + response.usage.outputTokens,
986
+ cost: response.metadata.cost,
987
+ cached: false
988
+ });
989
+
990
+ // Update cache
991
+ if (options?.enableCaching) {
992
+ await this.cache.set(params.messages, response);
993
+ }
994
+
995
+ return response;
996
+
997
+ } catch (error) {
998
+ this.metrics.recordError(provider.name, error);
999
+ throw error;
1000
+ }
1001
+ }
1002
+
1003
+ private async executeWithFallback(
1004
+ primary: LLMProvider,
1005
+ params: CompletionParams,
1006
+ enableFallback: boolean
1007
+ ): Promise<CompletionResponse> {
1008
+ // Implement fallback chain logic
1009
+ // (See section 2.3 for full implementation)
1010
+ }
1011
+ }
1012
+ ```
1013
+
1014
+ **Key Libraries (2026):**
1015
+ - **Vercel AI SDK** - Production-ready, TypeScript-first, excellent DX
1016
+ - **LiteLLM** - Python-focused, 100+ provider support, proxy mode
1017
+ - **AnyLLM** - TypeScript abstraction layer for seamless provider switching
1018
+ - **ModelFusion** - Vercel's abstraction for AI model integration
1019
+
1020
+ **Sources:**
1021
+ - [Vercel AI SDK Docs](https://ai-sdk.dev/)
1022
+ - [AnyLLM GitHub](https://github.com/fkesheh/any-llm)
1023
+ - [ModelFusion GitHub](https://github.com/vercel/modelfusion)
1024
+
1025
+ ### 4.3 Handling Provider-Specific Features
1026
+
1027
+ Not all providers support all features - graceful degradation is essential.
1028
+
1029
+ **Feature Matrix (2026):**
1030
+
1031
+ | Feature | Anthropic | OpenAI | Google | Handling Strategy |
1032
+ |---------|-----------|--------|--------|-------------------|
1033
+ | Tool Calling | ✅ Native | ✅ Native | ✅ Native | Universal support |
1034
+ | JSON Mode | ✅ Native | ✅ Native | ✅ Native | Universal support |
1035
+ | Vision | ✅ Claude 3+ | ✅ GPT-4o | ✅ Gemini | Universal support |
1036
+ | Prompt Caching | ✅ Explicit | ✅ Automatic | ✅ Context cache | Abstract with unified API |
1037
+ | Streaming | ✅ SSE | ✅ SSE | ✅ SSE | Universal support |
1038
+ | System Messages | ✅ Separate | ✅ In messages | ✅ Separate | Normalize in abstraction |
1039
+ | Max Output Tokens | ✅ 8192 | ✅ 16384 | ✅ 8192 | Enforce limits per provider |
1040
+
1041
+ **Graceful Degradation Pattern:**
1042
+ ```typescript
1043
+ async function executeWithFeatureDetection(
1044
+ provider: LLMProvider,
1045
+ params: CompletionParams
1046
+ ): Promise<CompletionResponse> {
1047
+ const capabilities = provider.getCapabilities();
1048
+
1049
+ // Adapt parameters to provider capabilities
1050
+ const adaptedParams = { ...params };
1051
+
1052
+ // Handle tool calling
1053
+ if (params.tools && !capabilities.toolCalling) {
1054
+ console.warn(
1055
+ `Provider ${provider.name} doesn't support tool calling, ` +
1056
+ `using function injection instead`
1057
+ );
1058
+ adaptedParams.messages = injectToolsAsContext(
1059
+ params.messages,
1060
+ params.tools
1061
+ );
1062
+ delete adaptedParams.tools;
1063
+ }
1064
+
1065
+ // Handle JSON mode
1066
+ if (params.responseFormat?.type === 'json_object') {
1067
+ if (capabilities.jsonMode) {
1068
+ // Use native JSON mode
1069
+ adaptedParams.providerOptions = {
1070
+ ...adaptedParams.providerOptions,
1071
+ responseFormat: { type: 'json_object' }
1072
+ };
1073
+ } else {
1074
+ // Fallback to prompt engineering
1075
+ console.warn(
1076
+ `Provider ${provider.name} doesn't support native JSON mode, ` +
1077
+ `using prompt-based approach`
1078
+ );
1079
+ adaptedParams.messages[0].content +=
1080
+ '\n\nIMPORTANT: Respond with valid JSON only.';
1081
+ }
1082
+ }
1083
+
1084
+ // Handle prompt caching
1085
+ if (params.enableCaching && !capabilities.promptCaching) {
1086
+ console.warn(
1087
+ `Provider ${provider.name} doesn't support prompt caching`
1088
+ );
1089
+ // Fall back to semantic caching layer
1090
+ }
1091
+
1092
+ return provider.complete(adaptedParams);
1093
+ }
1094
+ ```
1095
+
1096
+ **Sources:**
1097
+ - [Building Bridges to LLMs: Moving Beyond Over Abstraction](https://hatchworks.com/blog/gen-ai/llm-projects-production-abstraction/)
1098
+ - [AI SDK Provider Management](https://ai-sdk.dev/docs/ai-sdk-core/provider-management)
1099
+
1100
+ ---
1101
+
1102
+ ## 5. Monitoring and Observability
1103
+
1104
+ ### 5.1 Key Metrics to Track
1105
+
1106
+ **Essential Metrics:**
1107
+ ```typescript
1108
+ interface LLMMetrics {
1109
+ // Performance
1110
+ latency: {
1111
+ p50: number;
1112
+ p95: number;
1113
+ p99: number;
1114
+ };
1115
+
1116
+ // Cost
1117
+ cost: {
1118
+ total: number;
1119
+ perRequest: number;
1120
+ perProvider: Map<string, number>;
1121
+ perModel: Map<string, number>;
1122
+ perUser?: Map<string, number>;
1123
+ };
1124
+
1125
+ // Usage
1126
+ tokens: {
1127
+ input: number;
1128
+ output: number;
1129
+ cached: number;
1130
+ total: number;
1131
+ };
1132
+
1133
+ // Reliability
1134
+ reliability: {
1135
+ successRate: number;
1136
+ errorRate: number;
1137
+ retryRate: number;
1138
+ fallbackRate: number;
1139
+ };
1140
+
1141
+ // Caching
1142
+ cache: {
1143
+ hitRate: number;
1144
+ semanticHitRate: number;
1145
+ promptCacheHitRate: number;
1146
+ savings: number; // Dollar amount saved
1147
+ };
1148
+
1149
+ // Provider Health
1150
+ providerHealth: Map<string, {
1151
+ availability: number;
1152
+ avgLatency: number;
1153
+ errorRate: number;
1154
+ circuitBreakerState: string;
1155
+ }>;
1156
+ }
1157
+ ```
1158
+
1159
+ **Critical Metrics (2026):**
1160
+ 1. **Tokens per request** - Normalize usage patterns
1161
+ 2. **Cost per user/team/feature** - Attribution for chargeback
1162
+ 3. **Cache hit ratio** - Reveal spend savings potential
1163
+ 4. **Provider availability** - Track SLA compliance
1164
+ 5. **Error rate by provider** - Identify stability issues
1165
+ 6. **Latency percentiles** - User experience monitoring
1166
+
1167
+ **Sources:**
1168
+ - [Langfuse Token and Cost Tracking](https://langfuse.com/docs/observability/features/token-and-cost-tracking)
1169
+ - [Tracking LLM Token Usage](https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user)
1170
+
1171
+ ### 5.2 Observability Platforms
1172
+
1173
+ **Top Platforms (2026):**
1174
+
1175
+ **1. Langfuse (Open Source)**
1176
+ ```typescript
1177
+ import { Langfuse } from 'langfuse';
1178
+
1179
+ const langfuse = new Langfuse({
1180
+ publicKey: process.env.LANGFUSE_PUBLIC_KEY,
1181
+ secretKey: process.env.LANGFUSE_SECRET_KEY
1182
+ });
1183
+
1184
+ const trace = langfuse.trace({
1185
+ name: 'code-summarization',
1186
+ userId: 'user-123',
1187
+ metadata: { repository: 'mdcontext' }
1188
+ });
1189
+
1190
+ const generation = trace.generation({
1191
+ name: 'summarize-file',
1192
+ model: 'claude-3-5-sonnet-20241022',
1193
+ modelParameters: {
1194
+ temperature: 0.7,
1195
+ maxTokens: 1024
1196
+ },
1197
+ input: messages
1198
+ });
1199
+
1200
+ // Make LLM call
1201
+ const response = await anthropic.messages.create(...);
1202
+
1203
+ // Log result
1204
+ generation.end({
1205
+ output: response.content,
1206
+ usage: {
1207
+ promptTokens: response.usage.input_tokens,
1208
+ completionTokens: response.usage.output_tokens
1209
+ },
1210
+ metadata: {
1211
+ cost: calculateCost(response.usage),
1212
+ cached: response.usage.cache_read_input_tokens > 0
1213
+ }
1214
+ });
1215
+ ```
1216
+
1217
+ **Features:**
1218
+ - Automatic cost tracking for 100+ models
1219
+ - User/session-level attribution
1220
+ - Trace-level debugging
1221
+ - Open-source and self-hostable
1222
+
1223
+ **2. Datadog LLM Observability**
1224
+ ```typescript
1225
+ import { datadogLLM } from '@datadog/llm-observability';
1226
+
1227
+ // Automatic tracing
1228
+ const traced = datadogLLM.trace(anthropic.messages.create);
1229
+
1230
+ const response = await traced({
1231
+ model: 'claude-3-5-sonnet-20241022',
1232
+ messages: [...]
1233
+ });
1234
+
1235
+ // Automatic metrics:
1236
+ // - llm.request.duration
1237
+ // - llm.request.tokens.input
1238
+ // - llm.request.tokens.output
1239
+ // - llm.request.cost
1240
+ ```
1241
+
1242
+ **Features:**
1243
+ - End-to-end AI agent tracing
1244
+ - Cloud cost integration
1245
+ - Anomaly detection
1246
+ - Enterprise-grade dashboards
1247
+
1248
+ **3. Traceloop/OpenLLMetry**
1249
+ ```typescript
1250
+ import { Traceloop } from '@traceloop/node-server-sdk';
1251
+
1252
+ Traceloop.initialize({
1253
+ apiKey: process.env.TRACELOOP_API_KEY,
1254
+ baseUrl: 'https://api.traceloop.com'
1255
+ });
1256
+
1257
+ // Automatic instrumentation via OpenTelemetry
1258
+ const response = await anthropic.messages.create({
1259
+ model: 'claude-3-5-sonnet-20241022',
1260
+ messages: [...],
1261
+ metadata: {
1262
+ user_id: 'user-123', // Automatic attribution
1263
+ feature: 'code-summarization'
1264
+ }
1265
+ });
1266
+ ```
1267
+
1268
+ **Features:**
1269
+ - OpenTelemetry-based (industry standard)
1270
+ - Automatic user attribution
1271
+ - Rich contextual data
1272
+ - Integration with existing observability stack
1273
+
1274
+ **4. Portkey**
1275
+ ```typescript
1276
+ import Portkey from 'portkey-ai';
1277
+
1278
+ const portkey = new Portkey({
1279
+ apiKey: process.env.PORTKEY_API_KEY,
1280
+ virtualKey: process.env.ANTHROPIC_VIRTUAL_KEY
1281
+ });
1282
+
1283
+ const response = await portkey.chatCompletions.create({
1284
+ model: 'claude-3-5-sonnet-20241022',
1285
+ messages: [...]
1286
+ });
1287
+
1288
+ // Dashboard shows:
1289
+ // - Usage by model, query, token
1290
+ // - Cost tracking and forecasts
1291
+ // - Near-real-time monitoring
1292
+ // - Automatic intervention on limits
1293
+ ```
1294
+
1295
+ **Features:**
1296
+ - Finance/leadership dashboards
1297
+ - Usage breakdowns and trends
1298
+ - Spend forecasting
1299
+ - Automatic budget controls
1300
+
1301
+ **Sources:**
1302
+ - [Langfuse Documentation](https://langfuse.com/docs/observability/features/token-and-cost-tracking)
1303
+ - [Datadog LLM Observability](https://www.datadoghq.com/product/llm-observability/)
1304
+ - [Traceloop Blog](https://www.traceloop.com/blog/granular-llm-monitoring-for-tracking-token-usage-and-latency-per-user-and-feature)
1305
+ - [Portkey Token Tracking](https://portkey.ai/blog/tracking-llm-token-usage-across-providers-teams-and-workloads/)
1306
+
1307
+ ### 5.3 Monitoring Implementation
1308
+
1309
+ **Complete Monitoring Setup:**
1310
+ ```typescript
1311
+ import { Langfuse } from 'langfuse';
1312
+ import { CloudWatch } from '@aws-sdk/client-cloudwatch';
1313
+
1314
+ class LLMMonitor {
1315
+ private langfuse: Langfuse;
1316
+ private cloudwatch: CloudWatch;
1317
+ private metrics: Map<string, MetricBuffer>;
1318
+
1319
+ constructor() {
1320
+ this.langfuse = new Langfuse({
1321
+ publicKey: process.env.LANGFUSE_PUBLIC_KEY,
1322
+ secretKey: process.env.LANGFUSE_SECRET_KEY
1323
+ });
1324
+
1325
+ this.cloudwatch = new CloudWatch({ region: 'us-east-1' });
1326
+ this.metrics = new Map();
1327
+ }
1328
+
1329
+ async trackRequest(
1330
+ provider: string,
1331
+ model: string,
1332
+ params: CompletionParams,
1333
+ response: CompletionResponse,
1334
+ metadata: {
1335
+ userId?: string;
1336
+ feature?: string;
1337
+ cacheHit?: boolean;
1338
+ }
1339
+ ): Promise<void> {
1340
+ const trace = this.langfuse.trace({
1341
+ name: metadata.feature || 'llm-request',
1342
+ userId: metadata.userId,
1343
+ metadata: {
1344
+ provider,
1345
+ model,
1346
+ cacheHit: metadata.cacheHit || false
1347
+ }
1348
+ });
1349
+
1350
+ const generation = trace.generation({
1351
+ name: 'completion',
1352
+ model,
1353
+ input: params.messages,
1354
+ output: response.content,
1355
+ usage: {
1356
+ promptTokens: response.usage.inputTokens,
1357
+ completionTokens: response.usage.outputTokens,
1358
+ totalTokens:
1359
+ response.usage.inputTokens + response.usage.outputTokens
1360
+ },
1361
+ metadata: {
1362
+ latencyMs: response.metadata.latencyMs,
1363
+ cost: response.metadata.cost,
1364
+ cachedTokens: response.usage.cachedTokens || 0,
1365
+ finishReason: response.finishReason
1366
+ }
1367
+ });
1368
+
1369
+ generation.end();
1370
+
1371
+ // Send metrics to CloudWatch
1372
+ await this.cloudwatch.putMetricData({
1373
+ Namespace: 'LLM/Production',
1374
+ MetricData: [
1375
+ {
1376
+ MetricName: 'Latency',
1377
+ Value: response.metadata.latencyMs,
1378
+ Unit: 'Milliseconds',
1379
+ Dimensions: [
1380
+ { Name: 'Provider', Value: provider },
1381
+ { Name: 'Model', Value: model }
1382
+ ]
1383
+ },
1384
+ {
1385
+ MetricName: 'Cost',
1386
+ Value: response.metadata.cost,
1387
+ Unit: 'None',
1388
+ Dimensions: [
1389
+ { Name: 'Provider', Value: provider },
1390
+ { Name: 'Feature', Value: metadata.feature || 'unknown' }
1391
+ ]
1392
+ },
1393
+ {
1394
+ MetricName: 'TokensUsed',
1395
+ Value: response.usage.inputTokens + response.usage.outputTokens,
1396
+ Unit: 'Count',
1397
+ Dimensions: [
1398
+ { Name: 'Provider', Value: provider }
1399
+ ]
1400
+ },
1401
+ {
1402
+ MetricName: 'CacheHit',
1403
+ Value: metadata.cacheHit ? 1 : 0,
1404
+ Unit: 'Count'
1405
+ }
1406
+ ]
1407
+ });
1408
+ }
1409
+
1410
+ async trackError(
1411
+ provider: string,
1412
+ model: string,
1413
+ error: any,
1414
+ metadata: {
1415
+ retryAttempt?: number;
1416
+ fallbackTriggered?: boolean;
1417
+ }
1418
+ ): Promise<void> {
1419
+ // Log to Langfuse
1420
+ this.langfuse.trace({
1421
+ name: 'llm-error',
1422
+ metadata: {
1423
+ provider,
1424
+ model,
1425
+ error: error.message,
1426
+ statusCode: error.status,
1427
+ retryAttempt: metadata.retryAttempt,
1428
+ fallbackTriggered: metadata.fallbackTriggered
1429
+ }
1430
+ });
1431
+
1432
+ // Send error metrics
1433
+ await this.cloudwatch.putMetricData({
1434
+ Namespace: 'LLM/Production',
1435
+ MetricData: [
1436
+ {
1437
+ MetricName: 'Errors',
1438
+ Value: 1,
1439
+ Unit: 'Count',
1440
+ Dimensions: [
1441
+ { Name: 'Provider', Value: provider },
1442
+ { Name: 'ErrorType', Value: error.status || 'unknown' }
1443
+ ]
1444
+ }
1445
+ ]
1446
+ });
1447
+ }
1448
+
1449
+ async getProviderHealth(): Promise<Map<string, ProviderHealth>> {
1450
+ // Query CloudWatch metrics
1451
+ const endTime = new Date();
1452
+ const startTime = new Date(endTime.getTime() - 3600000); // Last hour
1453
+
1454
+ const healthMap = new Map<string, ProviderHealth>();
1455
+
1456
+ for (const provider of ['anthropic', 'openai', 'google']) {
1457
+ const errorRate = await this.getErrorRate(provider, startTime, endTime);
1458
+ const avgLatency = await this.getAvgLatency(provider, startTime, endTime);
1459
+
1460
+ healthMap.set(provider, {
1461
+ availability: 1 - errorRate,
1462
+ avgLatency,
1463
+ errorRate,
1464
+ status: errorRate > 0.05 ? 'degraded' : 'healthy'
1465
+ });
1466
+ }
1467
+
1468
+ return healthMap;
1469
+ }
1470
+ }
1471
+ ```
1472
+
1473
+ **Dashboard Recommendations:**
1474
+ 1. **Real-time costs** by provider, model, user, feature
1475
+ 2. **Latency percentiles** (p50, p95, p99) per provider
1476
+ 3. **Cache hit rates** for cost optimization tracking
1477
+ 4. **Error rates and circuit breaker states** for reliability
1478
+ 5. **Token usage trends** for capacity planning
1479
+ 6. **Provider health scores** for SLA monitoring
1480
+
1481
+ **Sources:**
1482
+ - [LLM Observability Tools 2026](https://lakefs.io/blog/llm-observability-tools/)
1483
+ - [Best LLM Observability Tools](https://www.firecrawl.dev/blog/best-llm-observability-tools)
1484
+
1485
+ ---
1486
+
1487
+ ## 6. Configuration Best Practices
1488
+
1489
+ ### 6.1 Environment-Based Configuration
1490
+
1491
+ **Production Configuration Pattern:**
1492
+ ```typescript
1493
+ interface LLMConfig {
1494
+ providers: {
1495
+ anthropic: {
1496
+ apiKey: string;
1497
+ baseURL?: string;
1498
+ timeout: number;
1499
+ maxRetries: number;
1500
+ };
1501
+ openai: {
1502
+ apiKey: string;
1503
+ organization?: string;
1504
+ timeout: number;
1505
+ maxRetries: number;
1506
+ };
1507
+ google: {
1508
+ apiKey: string;
1509
+ timeout: number;
1510
+ maxRetries: number;
1511
+ };
1512
+ };
1513
+
1514
+ routing: {
1515
+ strategy: 'cost' | 'quality' | 'latency' | 'custom';
1516
+ fallbackEnabled: boolean;
1517
+ circuitBreaker: {
1518
+ enabled: boolean;
1519
+ failureThreshold: number;
1520
+ successThreshold: number;
1521
+ timeout: number;
1522
+ };
1523
+ };
1524
+
1525
+ caching: {
1526
+ semantic: {
1527
+ enabled: boolean;
1528
+ provider: 'redis' | 'qdrant' | 'memory';
1529
+ similarityThreshold: number;
1530
+ ttl: number;
1531
+ };
1532
+ prompt: {
1533
+ enabled: boolean;
1534
+ strategy: 'provider-native' | 'custom';
1535
+ };
1536
+ };
1537
+
1538
+ observability: {
1539
+ enabled: boolean;
1540
+ provider: 'langfuse' | 'datadog' | 'custom';
1541
+ sampleRate: number; // 0.0 - 1.0
1542
+ logLevel: 'debug' | 'info' | 'warn' | 'error';
1543
+ };
1544
+
1545
+ limits: {
1546
+ maxTokensPerRequest: number;
1547
+ maxCostPerRequest: number;
1548
+ rateLimitPerUser?: number;
1549
+ };
1550
+ }
1551
+ ```
1552
+
1553
+ **Environment-Specific Configs:**
1554
+ ```typescript
1555
+ // config/production.ts
1556
+ export const productionConfig: LLMConfig = {
1557
+ providers: {
1558
+ anthropic: {
1559
+ apiKey: process.env.ANTHROPIC_API_KEY!,
1560
+ timeout: 30000,
1561
+ maxRetries: 3
1562
+ },
1563
+ openai: {
1564
+ apiKey: process.env.OPENAI_API_KEY!,
1565
+ organization: process.env.OPENAI_ORG,
1566
+ timeout: 30000,
1567
+ maxRetries: 3
1568
+ },
1569
+ google: {
1570
+ apiKey: process.env.GOOGLE_API_KEY!,
1571
+ timeout: 30000,
1572
+ maxRetries: 3
1573
+ }
1574
+ },
1575
+
1576
+ routing: {
1577
+ strategy: 'cost',
1578
+ fallbackEnabled: true,
1579
+ circuitBreaker: {
1580
+ enabled: true,
1581
+ failureThreshold: 10,
1582
+ successThreshold: 3,
1583
+ timeout: 60000
1584
+ }
1585
+ },
1586
+
1587
+ caching: {
1588
+ semantic: {
1589
+ enabled: true,
1590
+ provider: 'redis',
1591
+ similarityThreshold: 0.85,
1592
+ ttl: 3600000 // 1 hour
1593
+ },
1594
+ prompt: {
1595
+ enabled: true,
1596
+ strategy: 'provider-native'
1597
+ }
1598
+ },
1599
+
1600
+ observability: {
1601
+ enabled: true,
1602
+ provider: 'langfuse',
1603
+ sampleRate: 1.0, // Log everything in prod
1604
+ logLevel: 'info'
1605
+ },
1606
+
1607
+ limits: {
1608
+ maxTokensPerRequest: 100000,
1609
+ maxCostPerRequest: 1.00, // $1 max per request
1610
+ rateLimitPerUser: 100 // per hour
1611
+ }
1612
+ };
1613
+
1614
+ // config/development.ts
1615
+ export const developmentConfig: LLMConfig = {
1616
+ ...productionConfig,
1617
+
1618
+ routing: {
1619
+ ...productionConfig.routing,
1620
+ strategy: 'latency' // Prefer fast responses in dev
1621
+ },
1622
+
1623
+ caching: {
1624
+ semantic: {
1625
+ enabled: false, // Disable caching in dev
1626
+ provider: 'memory',
1627
+ similarityThreshold: 0.85,
1628
+ ttl: 600000
1629
+ },
1630
+ prompt: {
1631
+ enabled: false
1632
+ }
1633
+ },
1634
+
1635
+ observability: {
1636
+ enabled: true,
1637
+ provider: 'langfuse',
1638
+ sampleRate: 0.1, // Sample 10% in dev
1639
+ logLevel: 'debug'
1640
+ }
1641
+ };
1642
+ ```
1643
+
1644
+ ### 6.2 Feature Flags for Gradual Rollout
1645
+
1646
+ ```typescript
1647
+ interface FeatureFlags {
1648
+ enableSemanticCache: boolean;
1649
+ enablePromptCache: boolean;
1650
+ enableMultiProvider: boolean;
1651
+ enableCircuitBreaker: boolean;
1652
+ newProviders: string[]; // e.g., ['deepseek', 'mistral']
1653
+ costOptimizationLevel: 'conservative' | 'balanced' | 'aggressive';
1654
+ }
1655
+
1656
+ // Use LaunchDarkly, Flagsmith, or similar
1657
+ const flags = await featureFlagClient.getFlags('llm-system');
1658
+
1659
+ if (flags.enableSemanticCache) {
1660
+ // Enable semantic caching
1661
+ }
1662
+
1663
+ if (flags.newProviders.includes('deepseek')) {
1664
+ // Add DeepSeek to provider chain
1665
+ providerChain.push({
1666
+ name: 'deepseek',
1667
+ priority: 4,
1668
+ models: {
1669
+ cheap: 'deepseek-v3',
1670
+ standard: 'deepseek-r1',
1671
+ premium: 'deepseek-r1'
1672
+ }
1673
+ });
1674
+ }
1675
+ ```
1676
+
1677
+ ### 6.3 Model Selection Matrix
1678
+
1679
+ **Production Model Selection Guide:**
1680
+
1681
+ | Use Case | Recommended Model | Provider | Cost (1M tokens) | Notes |
1682
+ |----------|------------------|----------|------------------|-------|
1683
+ | **Simple Classification** | gpt-4o-mini | OpenAI | $0.15/$0.60 | Fastest, cheapest |
1684
+ | **Standard Q&A** | gemini-2.5-flash | Google | $0.30/$2.50 | Best value |
1685
+ | **Code Generation** | claude-3-5-sonnet | Anthropic | $3.00/$15.00 | Industry leading |
1686
+ | **Complex Reasoning** | claude-opus-4.5 | Anthropic | $15.00/$75.00 | Best quality |
1687
+ | **Cost-Optimized** | deepseek-v3 | DeepSeek | $0.27/$1.10 | 95% cheaper |
1688
+ | **Large Context** | gemini-2.5-pro | Google | $1.25/$10.00 | 1M tokens |
1689
+ | **Local Deployment** | qwen-2.5-coder-32b | Local | $0 | Self-hosted |
1690
+
1691
+ **Sources:**
1692
+ - [LLM Pricing 2026](https://research.aimultiple.com/llm-pricing/)
1693
+ - [Alternative Providers Research](./alternative-providers-2026.md)
1694
+
1695
+ ---
1696
+
1697
+ ## 7. Implementation Roadmap for mdcontext
1698
+
1699
+ ### 7.1 Phase 1: Basic Multi-Provider Support (Week 1)
1700
+
1701
+ **Goals:**
1702
+ - Add support for Anthropic, OpenAI, Google providers
1703
+ - Implement basic fallback chain
1704
+ - Add configuration system
1705
+
1706
+ **Tasks:**
1707
+ 1. Create provider abstraction layer
1708
+ 2. Implement unified interface
1709
+ 3. Add environment-based configuration
1710
+ 4. Basic error handling with retries
1711
+
1712
+ **Code Changes:**
1713
+ ```typescript
1714
+ // src/llm/providers/base.ts
1715
+ export interface LLMProvider {
1716
+ complete(params: CompletionParams): Promise<CompletionResponse>;
1717
+ healthCheck(): Promise<boolean>;
1718
+ }
1719
+
1720
+ // src/llm/providers/anthropic.ts
1721
+ export class AnthropicProvider implements LLMProvider {
1722
+ async complete(params: CompletionParams): Promise<CompletionResponse> {
1723
+ // Implementation
1724
+ }
1725
+ }
1726
+
1727
+ // src/llm/router.ts
1728
+ export class ProviderRouter {
1729
+ async complete(params: CompletionParams): Promise<CompletionResponse> {
1730
+ // Try primary provider with fallback
1731
+ }
1732
+ }
1733
+ ```
1734
+
1735
+ ### 7.2 Phase 2: Cost Optimization (Week 2)
1736
+
1737
+ **Goals:**
1738
+ - Implement prompt caching
1739
+ - Add cost-based routing
1740
+ - Track token usage and costs
1741
+
1742
+ **Tasks:**
1743
+ 1. Enable Anthropic prompt caching for codebase context
1744
+ 2. Implement cost estimation per provider
1745
+ 3. Add Langfuse for cost tracking
1746
+ 4. Create cost optimization routing logic
1747
+
1748
+ **Expected Savings:**
1749
+ - 50-70% cost reduction from prompt caching
1750
+ - 20-30% additional from smart routing
1751
+
1752
+ ### 7.3 Phase 3: Reliability (Week 3)
1753
+
1754
+ **Goals:**
1755
+ - Add circuit breakers
1756
+ - Implement exponential backoff
1757
+ - Health monitoring
1758
+
1759
+ **Tasks:**
1760
+ 1. Implement circuit breaker per provider
1761
+ 2. Add exponential backoff with jitter
1762
+ 3. Health check endpoints
1763
+ 4. Provider health dashboard
1764
+
1765
+ ### 7.4 Phase 4: Advanced Caching (Week 4)
1766
+
1767
+ **Goals:**
1768
+ - Semantic caching layer
1769
+ - Combined caching strategy
1770
+ - Cache invalidation logic
1771
+
1772
+ **Tasks:**
1773
+ 1. Set up Redis/Qdrant for semantic cache
1774
+ 2. Implement similarity search
1775
+ 3. Add cache hit rate tracking
1776
+ 4. Configure TTL and invalidation
1777
+
1778
+ **Expected Savings:**
1779
+ - Additional 20-30% from semantic caching
1780
+ - Total: 70-90% cost reduction
1781
+
1782
+ ### 7.5 Phase 5: Observability (Ongoing)
1783
+
1784
+ **Goals:**
1785
+ - Comprehensive monitoring
1786
+ - Cost attribution
1787
+ - Performance tracking
1788
+
1789
+ **Tasks:**
1790
+ 1. Langfuse integration complete
1791
+ 2. CloudWatch metrics
1792
+ 3. Cost per feature/user tracking
1793
+ 4. Alerting on budget thresholds
1794
+
1795
+ ---
1796
+
1797
+ ## 8. Testing Strategy
1798
+
1799
+ ### 8.1 Provider Resilience Testing
1800
+
1801
+ ```typescript
1802
+ describe('Provider Fallback', () => {
1803
+ it('should fallback to secondary provider on primary failure', async () => {
1804
+ // Mock primary provider failure
1805
+ mockAnthropicProvider.complete.mockRejectedValue(
1806
+ new Error('Rate limit exceeded')
1807
+ );
1808
+
1809
+ mockOpenAIProvider.complete.mockResolvedValue({
1810
+ content: 'Success',
1811
+ usage: { inputTokens: 100, outputTokens: 50 }
1812
+ });
1813
+
1814
+ const result = await router.complete({
1815
+ messages: [{ role: 'user', content: 'Test' }]
1816
+ });
1817
+
1818
+ expect(result.metadata.provider).toBe('openai');
1819
+ expect(mockAnthropicProvider.complete).toHaveBeenCalledTimes(1);
1820
+ expect(mockOpenAIProvider.complete).toHaveBeenCalledTimes(1);
1821
+ });
1822
+
1823
+ it('should respect circuit breaker state', async () => {
1824
+ // Trigger circuit breaker
1825
+ for (let i = 0; i < 10; i++) {
1826
+ try {
1827
+ await router.complete({ messages: [...] });
1828
+ } catch {}
1829
+ }
1830
+
1831
+ const health = circuitBreaker.getState();
1832
+ expect(health.state).toBe('open');
1833
+
1834
+ // Should skip provider with open circuit
1835
+ const result = await router.complete({ messages: [...] });
1836
+ expect(result.metadata.provider).not.toBe('anthropic');
1837
+ });
1838
+ });
1839
+ ```
1840
+
1841
+ ### 8.2 Cost Validation Testing
1842
+
1843
+ ```typescript
1844
+ describe('Cost Optimization', () => {
1845
+ it('should route simple tasks to cheap models', () => {
1846
+ const model = selectModel({
1847
+ complexity: 'simple',
1848
+ estimatedTokens: 1000,
1849
+ requiresReasoning: false,
1850
+ requiresToolUse: false
1851
+ });
1852
+
1853
+ expect(model.provider).toBe('google');
1854
+ expect(model.model).toBe('gemini-2.5-flash');
1855
+ });
1856
+
1857
+ it('should use prompt caching for large contexts', async () => {
1858
+ const params = {
1859
+ messages: [
1860
+ { role: 'system', content: largeCodebaseContext },
1861
+ { role: 'user', content: 'Summarize this file' }
1862
+ ],
1863
+ enableCaching: true
1864
+ };
1865
+
1866
+ const result = await provider.complete(params);
1867
+
1868
+ expect(result.usage.cachedTokens).toBeGreaterThan(0);
1869
+ expect(result.metadata.cost).toBeLessThan(
1870
+ calculateCostWithoutCaching(result.usage)
1871
+ );
1872
+ });
1873
+ });
1874
+ ```
1875
+
1876
+ ### 8.3 Load Testing
1877
+
1878
+ ```bash
1879
+ # Artillery config for load testing
1880
+ artillery run tests/load/provider-fallback.yml
1881
+ ```
1882
+
1883
+ ```yaml
1884
+ # tests/load/provider-fallback.yml
1885
+ config:
1886
+ target: 'http://localhost:3000'
1887
+ phases:
1888
+ - duration: 60
1889
+ arrivalRate: 10
1890
+ name: 'Warm up'
1891
+ - duration: 120
1892
+ arrivalRate: 50
1893
+ name: 'Ramp up'
1894
+ - duration: 300
1895
+ arrivalRate: 100
1896
+ name: 'Sustained load'
1897
+
1898
+ scenarios:
1899
+ - name: 'Code Summarization'
1900
+ flow:
1901
+ - post:
1902
+ url: '/api/summarize'
1903
+ json:
1904
+ code: '{{ $randomString() }}'
1905
+ model: 'claude-3-5-sonnet-20241022'
1906
+ capture:
1907
+ - json: '$.cost'
1908
+ as: 'requestCost'
1909
+ - json: '$.provider'
1910
+ as: 'usedProvider'
1911
+ ```
1912
+
1913
+ ---
1914
+
1915
+ ## 9. Key Recommendations for mdcontext
1916
+
1917
+ ### 9.1 Immediate Actions (This Week)
1918
+
1919
+ 1. **Add Multi-Provider Support**
1920
+ - Implement Anthropic, OpenAI, Google providers
1921
+ - Basic fallback chain: Claude Sonnet → GPT-4o → Gemini Pro
1922
+ - Environment-based configuration
1923
+
1924
+ 2. **Enable Prompt Caching**
1925
+ - Use Anthropic's cache_control for codebase context
1926
+ - Expected: 50-70% cost reduction immediately
1927
+
1928
+ 3. **Add Basic Monitoring**
1929
+ - Integrate Langfuse for cost tracking
1930
+ - Track provider success/failure rates
1931
+
1932
+ ### 9.2 Short-Term (Next 2 Weeks)
1933
+
1934
+ 1. **Implement Circuit Breakers**
1935
+ - Prevent cascading failures
1936
+ - Health monitoring per provider
1937
+
1938
+ 2. **Cost-Based Routing**
1939
+ - Route simple tasks to Gemini Flash
1940
+ - Complex reasoning to Claude Sonnet
1941
+ - Expected: 30-50% additional savings
1942
+
1943
+ 3. **Retry Logic**
1944
+ - Exponential backoff with jitter
1945
+ - Respect retry-after headers
1946
+
1947
+ ### 9.3 Medium-Term (Next Month)
1948
+
1949
+ 1. **Semantic Caching**
1950
+ - Set up Redis for cache storage
1951
+ - Implement similarity search
1952
+ - Expected: 20-30% additional savings
1953
+
1954
+ 2. **Advanced Routing**
1955
+ - Task complexity classification
1956
+ - Dynamic provider selection based on:
1957
+ - Current cost
1958
+ - Latency requirements
1959
+ - Provider health
1960
+
1961
+ 3. **Complete Observability**
1962
+ - Cost per feature attribution
1963
+ - User-level tracking
1964
+ - Budget alerts
1965
+
1966
+ ### 9.4 Architecture Recommendation
1967
+
1968
+ ```
1969
+ ┌─────────────────────────────────────────────────────────┐
1970
+ │ Client Request │
1971
+ └───────────────────────────┬─────────────────────────────┘
1972
+
1973
+
1974
+ ┌─────────────────────────────────────────────────────────┐
1975
+ │ Semantic Cache Layer │
1976
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
1977
+ │ │ Exact Match │ │ Similarity │ │ TTL Check │ │
1978
+ │ │ (Memory) │ │ (Redis) │ │ │ │
1979
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
1980
+ └───────────────────────────┬─────────────────────────────┘
1981
+ │ Cache Miss
1982
+
1983
+ ┌─────────────────────────────────────────────────────────┐
1984
+ │ Provider Router │
1985
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
1986
+ │ │ Classify │ │ Route │ │ Estimate │ │
1987
+ │ │ Task │ │ Provider │ │ Cost │ │
1988
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
1989
+ └───────────────────────────┬─────────────────────────────┘
1990
+
1991
+
1992
+ ┌─────────────────────────────────────────────────────────┐
1993
+ │ Provider Chain with Fallback │
1994
+ │ │
1995
+ │ ┌─────────────────────────────────────────────────┐ │
1996
+ │ │ Primary: Anthropic Claude │ │
1997
+ │ │ ┌──────────────┐ ┌──────────────┐ │ │
1998
+ │ │ │Circuit Breaker│ │Retry w/Backoff│ │ │
1999
+ │ │ └──────────────┘ └──────────────┘ │ │
2000
+ │ │ Prompt Caching: ✅ │ │
2001
+ │ └─────────────────────────────────────────────────┘ │
2002
+ │ │ Fails │
2003
+ │ ▼ │
2004
+ │ ┌─────────────────────────────────────────────────┐ │
2005
+ │ │ Secondary: OpenAI GPT-4o │ │
2006
+ │ │ ┌──────────────┐ ┌──────────────┐ │ │
2007
+ │ │ │Circuit Breaker│ │Retry w/Backoff│ │ │
2008
+ │ │ └──────────────┘ └──────────────┘ │ │
2009
+ │ │ Prompt Caching: ✅ (Automatic) │ │
2010
+ │ └─────────────────────────────────────────────────┘ │
2011
+ │ │ Fails │
2012
+ │ ▼ │
2013
+ │ ┌─────────────────────────────────────────────────┐ │
2014
+ │ │ Tertiary: Google Gemini │ │
2015
+ │ │ ┌──────────────┐ ┌──────────────┐ │ │
2016
+ │ │ │Circuit Breaker│ │Retry w/Backoff│ │ │
2017
+ │ │ └──────────────┘ └──────────────┘ │ │
2018
+ │ │ Context Caching: ✅ │ │
2019
+ │ └─────────────────────────────────────────────────┘ │
2020
+ └───────────────────────────┬─────────────────────────────┘
2021
+
2022
+
2023
+ ┌─────────────────────────────────────────────────────────┐
2024
+ │ Observability Layer │
2025
+ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
2026
+ │ │ Langfuse │ │ CloudWatch │ │ Alerts │ │
2027
+ │ │ Cost Tracking│ │ Metrics │ │ Budgets │ │
2028
+ │ └──────────────┘ └──────────────┘ └──────────────┘ │
2029
+ └─────────────────────────────────────────────────────────┘
2030
+ ```
2031
+
2032
+ ### 9.5 Expected Outcomes
2033
+
2034
+ **Cost Reduction:**
2035
+ - Phase 1 (Prompt Caching): 50-70%
2036
+ - Phase 2 (Smart Routing): 20-30%
2037
+ - Phase 3 (Semantic Caching): 10-20%
2038
+ - **Total: 70-90% reduction**
2039
+
2040
+ **Reliability Improvement:**
2041
+ - 99.9%+ uptime with multi-provider fallback
2042
+ - Circuit breakers prevent cascading failures
2043
+ - Automatic recovery from rate limits
2044
+
2045
+ **Performance:**
2046
+ - <100ms for cache hits (semantic)
2047
+ - <1s for cached prompts
2048
+ - Minimal latency overhead from routing layer
2049
+
2050
+ ---
2051
+
2052
+ ## 10. Additional Resources
2053
+
2054
+ ### 10.1 Documentation
2055
+
2056
+ - [Vercel AI SDK Documentation](https://ai-sdk.dev/)
2057
+ - [LiteLLM Documentation](https://docs.litellm.ai/)
2058
+ - [Anthropic Prompt Caching Guide](https://docs.anthropic.com/claude/docs/prompt-caching)
2059
+ - [OpenAI Rate Limits Guide](https://platform.openai.com/docs/guides/rate-limits)
2060
+ - [Google Gemini API Documentation](https://ai.google.dev/docs)
2061
+
2062
+ ### 10.2 Tools and Libraries
2063
+
2064
+ **Provider SDKs:**
2065
+ - [@anthropic-ai/sdk](https://www.npmjs.com/package/@anthropic-ai/sdk)
2066
+ - [openai](https://www.npmjs.com/package/openai)
2067
+ - [@google/generative-ai](https://www.npmjs.com/package/@google/generative-ai)
2068
+
2069
+ **Abstraction Layers:**
2070
+ - [ai (Vercel AI SDK)](https://www.npmjs.com/package/ai)
2071
+ - [litellm](https://pypi.org/project/litellm/)
2072
+ - [any-llm](https://www.npmjs.com/package/any-llm)
2073
+
2074
+ **Observability:**
2075
+ - [langfuse](https://www.npmjs.com/package/langfuse)
2076
+ - [@datadog/llm-observability](https://docs.datadoghq.com/llm_observability/)
2077
+ - [@traceloop/node-server-sdk](https://www.npmjs.com/package/@traceloop/node-server-sdk)
2078
+
2079
+ **Caching:**
2080
+ - [redis](https://www.npmjs.com/package/redis)
2081
+ - [qdrant-js](https://www.npmjs.com/package/@qdrant/js-client-rest)
2082
+ - [gptcache](https://github.com/zilliztech/GPTCache)
2083
+
2084
+ ### 10.3 Related Research Documents
2085
+
2086
+ - [Alternative LLM Providers 2026](./alternative-providers-2026.md) - Comprehensive provider comparison
2087
+ - [Anthropic 2026](./anthropic-2026.md) - Claude-specific features and pricing
2088
+ - [OpenAI 2026](./openai-2026.md) - GPT models and capabilities
2089
+ - [Prompt Engineering 2026](./prompt-engineering-2026.md) - Optimization techniques
2090
+
2091
+ ---
2092
+
2093
+ ## Sources
2094
+
2095
+ This research synthesizes information from the following sources:
2096
+
2097
+ ### Multi-Provider Architecture
2098
+ - [Vercel AI Gateway Model Fallbacks](https://vercel.com/changelog/model-fallbacks-now-available-in-vercel-ai-gateway)
2099
+ - [Vercel AI SDK Provider Management](https://ai-sdk.dev/docs/ai-sdk-core/provider-management)
2100
+ - [LiteLLM Router Architecture](https://docs.litellm.ai/docs/router_architecture)
2101
+ - [LiteLLM Fallbacks & Reliability](https://docs.litellm.ai/docs/proxy/reliability)
2102
+ - [LangChain Fallbacks](https://python.langchain.com/v0.1/docs/guides/productionization/fallbacks/)
2103
+ - [Dynamic Failover with LangChain](https://medium.com/@andrewnguonly/dynamic-failover-and-load-balancing-llms-with-langchain-e930a094be61)
2104
+
2105
+ ### Cost Optimization
2106
+ - [LLM Cost Optimization Strategies](https://medium.com/@ajayverma23/taming-the-beast-cost-optimization-strategies-for-llm-api-calls-in-production-11f16dbe2c39)
2107
+ - [Stop Overpaying 5-10x in 2026](https://byteiota.com/llm-cost-optimization-stop-overpaying-5-10x-in-2026/)
2108
+ - [Smart Routing for Cost Management](https://www.kosmoy.com/post/llm-cost-management-stop-burning-money-on-tokens)
2109
+ - [Monitor and Optimize LLM Costs](https://www.helicone.ai/blog/monitor-and-optimize-llm-costs)
2110
+ - [LLM API Pricing 2026](https://pricepertoken.com/)
2111
+
2112
+ ### Caching Strategies
2113
+ - [Redis Semantic Caching](https://redis.io/blog/what-is-semantic-caching/)
2114
+ - [GPTCache GitHub](https://github.com/zilliztech/GPTCache)
2115
+ - [Prompt Caching: 10x Cheaper Tokens](https://ngrok.com/blog/prompt-caching/)
2116
+ - [Prompt Caching Infrastructure Guide](https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025)
2117
+ - [Don't Break the Cache Research](https://arxiv.org/abs/2601.06007)
2118
+ - [LiteLLM Caching](https://docs.litellm.ai/docs/proxy/caching)
2119
+ - [Ultimate Guide to LLM Caching](https://latitude-blog.ghost.io/blog/ultimate-guide-to-llm-caching-for-low-latency-ai/)
2120
+ - [Effective LLM Caching](https://www.helicone.ai/blog/effective-llm-caching)
2121
+
2122
+ ### Circuit Breakers & Reliability
2123
+ - [Circuit Breakers in LLM Apps](https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/)
2124
+ - [Azure API Management LLM Resiliency](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/improve-llm-backend-resiliency-with-load-balancer-and-circuit-breaker-rules-in-a/4394502)
2125
+ - [Apigee Circuit Breaker Pattern](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/llm_circuit_breaking.ipynb)
2126
+
2127
+ ### Rate Limiting & Retry Logic
2128
+ - [OpenAI Rate Limits](https://platform.openai.com/docs/guides/rate-limits)
2129
+ - [OpenAI Exponential Backoff](https://platform.openai.com/docs/guides/rate-limits/retrying-with-exponential-backoff)
2130
+ - [Claude API 429 Error Fix](https://www.aifreeapi.com/en/posts/fix-claude-api-429-rate-limit-error)
2131
+ - [Gemini API Rate Limits 2026](https://www.aifreeapi.com/en/posts/gemini-api-rate-limit-explained)
2132
+
2133
+ ### Provider Abstraction
2134
+ - [Multi-Provider LLM Orchestration 2026](https://dev.to/ash_dubai/multi-provider-llm-orchestration-in-production-a-2026-guide-1g10)
2135
+ - [AnyLLM GitHub](https://github.com/fkesheh/any-llm)
2136
+ - [TypeScript & LLMs: Production Lessons](https://johnchildseddy.medium.com/typescript-llms-lessons-learned-from-9-months-in-production-4910485e3272)
2137
+ - [ModelFusion GitHub](https://github.com/vercel/modelfusion)
2138
+ - [Building Bridges to LLMs](https://hatchworks.com/blog/gen-ai/llm-projects-production-abstraction/)
2139
+
2140
+ ### Observability
2141
+ - [Langfuse Token and Cost Tracking](https://langfuse.com/docs/observability/features/token-and-cost-tracking)
2142
+ - [Datadog LLM Observability](https://www.datadoghq.com/product/llm-observability/)
2143
+ - [Traceloop: Track Token Usage Per User](https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user)
2144
+ - [Portkey Token Tracking](https://portkey.ai/blog/tracking-llm-token-usage-across-providers-teams-and-workloads/)
2145
+ - [Elastic LLM Observability](https://www.elastic.co/observability/llm-monitoring)
2146
+ - [LLM Observability Tools 2026](https://lakefs.io/blog/llm-observability-tools/)
2147
+ - [Best LLM Observability Tools](https://www.firecrawl.dev/blog/best-llm-observability-tools)
2148
+
2149
+ ---
2150
+
2151
+ **Document Version**: 1.0
2152
+ **Last Updated**: January 26, 2026
2153
+ **Next Review**: February 26, 2026