mdcontext 0.1.0 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.changeset/config.json +9 -9
- package/.claude/settings.local.json +25 -0
- package/.github/workflows/claude-code-review.yml +44 -0
- package/.github/workflows/claude.yml +85 -0
- package/CONTRIBUTING.md +186 -0
- package/NOTES/NOTES +44 -0
- package/README.md +206 -3
- package/biome.json +1 -1
- package/dist/chunk-23UPXDNL.js +3044 -0
- package/dist/chunk-2W7MO2DL.js +1366 -0
- package/dist/chunk-3NUAZGMA.js +1689 -0
- package/dist/chunk-7TOWB2XB.js +366 -0
- package/dist/chunk-7XOTOADQ.js +3065 -0
- package/dist/chunk-AH2PDM2K.js +3042 -0
- package/dist/chunk-BNXWSZ63.js +3742 -0
- package/dist/chunk-BTL5DJVU.js +3222 -0
- package/dist/chunk-HDHYG7E4.js +104 -0
- package/dist/chunk-HLR4KZBP.js +3234 -0
- package/dist/chunk-IP3FRFEB.js +1045 -0
- package/dist/chunk-KHU56VDO.js +3042 -0
- package/dist/chunk-KRYIFLQR.js +85 -89
- package/dist/chunk-LBSDNLEM.js +287 -0
- package/dist/chunk-MNTQ7HCP.js +2643 -0
- package/dist/chunk-MUJELQQ6.js +1387 -0
- package/dist/chunk-MXJGMSLV.js +2199 -0
- package/dist/chunk-N6QJGC3Z.js +2636 -0
- package/dist/chunk-OBELGBPM.js +1713 -0
- package/dist/chunk-OT7R5XTA.js +3192 -0
- package/dist/chunk-P7X4RA2T.js +106 -0
- package/dist/chunk-PIDUQNC2.js +3185 -0
- package/dist/chunk-POGCDIH4.js +3187 -0
- package/dist/chunk-PSIEOQGZ.js +3043 -0
- package/dist/chunk-PVRT3IHA.js +3238 -0
- package/dist/chunk-QNN4TT23.js +1430 -0
- package/dist/chunk-RE3R45RJ.js +3042 -0
- package/dist/chunk-S7E6TFX6.js +718 -657
- package/dist/chunk-SG6GLU4U.js +1378 -0
- package/dist/chunk-SJCDV2ST.js +274 -0
- package/dist/chunk-SYE5XLF3.js +104 -0
- package/dist/chunk-T5VLYBZD.js +103 -0
- package/dist/chunk-TOQB7VWU.js +3238 -0
- package/dist/chunk-VFNMZ4ZQ.js +3228 -0
- package/dist/chunk-VVTGZNBT.js +1533 -1423
- package/dist/chunk-W7Q4RFEV.js +104 -0
- package/dist/chunk-XTYYVRLO.js +3190 -0
- package/dist/chunk-Y6MDYVJD.js +3063 -0
- package/dist/cli/main.js +4072 -629
- package/dist/index.d.ts +420 -33
- package/dist/index.js +8 -15
- package/dist/mcp/server.js +103 -7
- package/dist/schema-BAWSG7KY.js +22 -0
- package/dist/schema-E3QUPL26.js +20 -0
- package/dist/schema-EHL7WUT6.js +20 -0
- package/docs/019-USAGE.md +44 -5
- package/docs/020-current-implementation.md +8 -8
- package/docs/021-DOGFOODING-FINDINGS.md +1 -1
- package/docs/CONFIG.md +1123 -0
- package/docs/ERRORS.md +383 -0
- package/docs/summarization.md +320 -0
- package/justfile +40 -0
- package/package.json +39 -33
- package/research/INDEX.md +315 -0
- package/research/code-review/README.md +90 -0
- package/research/code-review/cli-error-handling-review.md +979 -0
- package/research/code-review/code-review-validation-report.md +464 -0
- package/research/code-review/main-ts-review.md +1128 -0
- package/research/config-docs/SUMMARY.md +357 -0
- package/research/config-docs/TEST-RESULTS.md +776 -0
- package/research/config-docs/TODO.md +542 -0
- package/research/config-docs/analysis.md +744 -0
- package/research/config-docs/fix-validation.md +502 -0
- package/research/config-docs/help-audit.md +264 -0
- package/research/config-docs/help-system-analysis.md +890 -0
- package/research/frontmatter/COMMENTS-ARE-SKIPPED.md +149 -0
- package/research/frontmatter/LLM-CODE-NAVIGATION.md +276 -0
- package/research/issue-review.md +603 -0
- package/research/llm-summarization/agent-cli-tools-2026.md +1082 -0
- package/research/llm-summarization/alternative-providers-2026.md +1428 -0
- package/research/llm-summarization/anthropic-2026.md +367 -0
- package/research/llm-summarization/claude-cli-integration.md +1706 -0
- package/research/llm-summarization/cli-integration-patterns.md +3155 -0
- package/research/llm-summarization/openai-2026.md +473 -0
- package/research/llm-summarization/openai-compatible-providers-2026.md +1022 -0
- package/research/llm-summarization/opencode-cli-integration.md +1552 -0
- package/research/llm-summarization/prompt-engineering-2026.md +1426 -0
- package/research/llm-summarization/prototype-results.md +56 -0
- package/research/llm-summarization/provider-switching-patterns-2026.md +2153 -0
- package/research/llm-summarization/typescript-llm-libraries-2026.md +2436 -0
- package/research/mdcontext-pudding/00-EXECUTIVE-SUMMARY.md +282 -0
- package/research/mdcontext-pudding/01-index-embed.md +956 -0
- package/research/mdcontext-pudding/02-search-COMMANDS.md +142 -0
- package/research/mdcontext-pudding/02-search-SUMMARY.md +146 -0
- package/research/mdcontext-pudding/02-search.md +970 -0
- package/research/mdcontext-pudding/03-context.md +779 -0
- package/research/mdcontext-pudding/04-navigation-and-analytics.md +803 -0
- package/research/mdcontext-pudding/04-tree.md +704 -0
- package/research/mdcontext-pudding/05-config.md +1038 -0
- package/research/mdcontext-pudding/06-links-summary.txt +87 -0
- package/research/mdcontext-pudding/06-links.md +679 -0
- package/research/mdcontext-pudding/07-stats.md +693 -0
- package/research/mdcontext-pudding/BUG-FIX-PLAN.md +388 -0
- package/research/mdcontext-pudding/P0-BUG-VALIDATION.md +167 -0
- package/research/mdcontext-pudding/README.md +168 -0
- package/research/mdcontext-pudding/TESTING-SUMMARY.md +128 -0
- package/research/research-quality-review.md +834 -0
- package/research/semantic-search/embedding-text-analysis.md +156 -0
- package/research/semantic-search/multi-word-failure-reproduction.md +171 -0
- package/research/semantic-search/query-processing-analysis.md +207 -0
- package/research/semantic-search/root-cause-and-solution.md +114 -0
- package/research/semantic-search/threshold-validation-report.md +69 -0
- package/research/semantic-search/vector-search-analysis.md +63 -0
- package/research/test-path-issues.md +276 -0
- package/review/ALP-76/1-error-type-design.md +962 -0
- package/review/ALP-76/2-error-handling-patterns.md +906 -0
- package/review/ALP-76/3-error-presentation.md +624 -0
- package/review/ALP-76/4-test-coverage.md +625 -0
- package/review/ALP-76/5-migration-completeness.md +440 -0
- package/review/ALP-76/6-effect-best-practices.md +755 -0
- package/scripts/apply-branch-protection.sh +47 -0
- package/scripts/branch-protection-templates.json +79 -0
- package/scripts/prototype-summarization.ts +346 -0
- package/scripts/rebuild-hnswlib.js +32 -37
- package/scripts/setup-branch-protection.sh +64 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/active-provider.json +7 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.json +541 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/bm25.meta.json +5 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/config.json +8 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/documents.json +60 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/links.json +13 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/.mdcontext/indexes/sections.json +1197 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/configuration-management.md +99 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/distributed-systems.md +92 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/error-handling.md +78 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/failure-automation.md +55 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/job-context.md +69 -0
- package/src/__tests__/fixtures/semantic-search/multi-word-corpus/process-orchestration.md +99 -0
- package/src/cli/argv-preprocessor.test.ts +2 -2
- package/src/cli/cli.test.ts +230 -33
- package/src/cli/commands/config-cmd.ts +642 -0
- package/src/cli/commands/context.ts +97 -9
- package/src/cli/commands/duplicates.ts +122 -0
- package/src/cli/commands/embeddings.ts +529 -0
- package/src/cli/commands/index-cmd.ts +210 -30
- package/src/cli/commands/index.ts +3 -0
- package/src/cli/commands/search.ts +894 -64
- package/src/cli/commands/stats.ts +3 -0
- package/src/cli/commands/tree.ts +26 -5
- package/src/cli/config-layer.ts +176 -0
- package/src/cli/error-handler.test.ts +235 -0
- package/src/cli/error-handler.ts +655 -0
- package/src/cli/flag-schemas.ts +66 -0
- package/src/cli/help.ts +209 -7
- package/src/cli/main.ts +348 -58
- package/src/cli/options.ts +10 -0
- package/src/cli/shared-error-handling.ts +199 -0
- package/src/cli/utils.ts +150 -17
- package/src/config/file-provider.test.ts +320 -0
- package/src/config/file-provider.ts +273 -0
- package/src/config/index.ts +72 -0
- package/src/config/integration.test.ts +667 -0
- package/src/config/precedence.test.ts +277 -0
- package/src/config/precedence.ts +451 -0
- package/src/config/schema.test.ts +414 -0
- package/src/config/schema.ts +603 -0
- package/src/config/service.test.ts +320 -0
- package/src/config/service.ts +243 -0
- package/src/config/testing.test.ts +264 -0
- package/src/config/testing.ts +110 -0
- package/src/core/types.ts +6 -33
- package/src/duplicates/detector.test.ts +183 -0
- package/src/duplicates/detector.ts +414 -0
- package/src/duplicates/index.ts +18 -0
- package/src/embeddings/embedding-namespace.test.ts +300 -0
- package/src/embeddings/embedding-namespace.ts +947 -0
- package/src/embeddings/heading-boost.test.ts +222 -0
- package/src/embeddings/hnsw-build-options.test.ts +198 -0
- package/src/embeddings/hyde.test.ts +272 -0
- package/src/embeddings/hyde.ts +264 -0
- package/src/embeddings/index.ts +2 -0
- package/src/embeddings/openai-provider.ts +332 -83
- package/src/embeddings/pricing.json +22 -0
- package/src/embeddings/provider-constants.ts +204 -0
- package/src/embeddings/provider-errors.test.ts +967 -0
- package/src/embeddings/provider-errors.ts +565 -0
- package/src/embeddings/provider-factory.test.ts +240 -0
- package/src/embeddings/provider-factory.ts +225 -0
- package/src/embeddings/provider-integration.test.ts +788 -0
- package/src/embeddings/query-preprocessing.test.ts +187 -0
- package/src/embeddings/semantic-search-threshold.test.ts +508 -0
- package/src/embeddings/semantic-search.ts +780 -93
- package/src/embeddings/types.ts +293 -16
- package/src/embeddings/vector-store.ts +486 -77
- package/src/embeddings/voyage-provider.ts +313 -0
- package/src/errors/errors.test.ts +845 -0
- package/src/errors/index.ts +533 -0
- package/src/index/ignore-patterns.test.ts +354 -0
- package/src/index/ignore-patterns.ts +305 -0
- package/src/index/indexer.ts +286 -48
- package/src/index/storage.ts +94 -30
- package/src/index/types.ts +40 -2
- package/src/index/watcher.ts +67 -9
- package/src/index.ts +22 -0
- package/src/integration/search-keyword.test.ts +678 -0
- package/src/mcp/server.ts +135 -6
- package/src/parser/parser.ts +18 -19
- package/src/parser/section-filter.test.ts +277 -0
- package/src/parser/section-filter.ts +125 -3
- package/src/search/__tests__/hybrid-search.test.ts +650 -0
- package/src/search/bm25-store.ts +366 -0
- package/src/search/cross-encoder.test.ts +253 -0
- package/src/search/cross-encoder.ts +406 -0
- package/src/search/fuzzy-search.test.ts +419 -0
- package/src/search/fuzzy-search.ts +273 -0
- package/src/search/hybrid-search.ts +448 -0
- package/src/search/path-matcher.test.ts +276 -0
- package/src/search/path-matcher.ts +33 -0
- package/src/search/searcher.test.ts +99 -1
- package/src/search/searcher.ts +189 -67
- package/src/search/wink-bm25.d.ts +30 -0
- package/src/summarization/cli-providers/claude.ts +202 -0
- package/src/summarization/cli-providers/detection.test.ts +273 -0
- package/src/summarization/cli-providers/detection.ts +118 -0
- package/src/summarization/cli-providers/index.ts +8 -0
- package/src/summarization/cost.test.ts +139 -0
- package/src/summarization/cost.ts +102 -0
- package/src/summarization/error-handler.test.ts +127 -0
- package/src/summarization/error-handler.ts +111 -0
- package/src/summarization/index.ts +102 -0
- package/src/summarization/pipeline.test.ts +498 -0
- package/src/summarization/pipeline.ts +231 -0
- package/src/summarization/prompts.test.ts +269 -0
- package/src/summarization/prompts.ts +133 -0
- package/src/summarization/provider-factory.test.ts +396 -0
- package/src/summarization/provider-factory.ts +178 -0
- package/src/summarization/types.ts +184 -0
- package/src/summarize/summarizer.ts +104 -35
- package/src/types/huggingface-transformers.d.ts +66 -0
- package/tests/fixtures/cli/.mdcontext/active-provider.json +7 -0
- package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/embeddings/openai_text-embedding-3-small_512/vectors.meta.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/indexes/documents.json +4 -4
- package/tests/fixtures/cli/.mdcontext/indexes/sections.json +14 -0
- package/tests/integration/embed-index.test.ts +712 -0
- package/tests/integration/search-context.test.ts +469 -0
- package/tests/integration/search-semantic.test.ts +522 -0
- package/vitest.config.ts +1 -6
- package/AGENTS.md +0 -46
- package/tests/fixtures/cli/.mdcontext/vectors.bin +0 -0
- package/tests/fixtures/cli/.mdcontext/vectors.meta.json +0 -1264
|
@@ -0,0 +1,2153 @@
|
|
|
1
|
+
# LLM Provider Switching and Fallback Patterns - 2026 Research
|
|
2
|
+
|
|
3
|
+
**Date**: January 26, 2026
|
|
4
|
+
**Author**: Research Team
|
|
5
|
+
**Purpose**: Document production-ready patterns for multi-provider LLM architectures with automatic fallback, cost optimization, and reliability
|
|
6
|
+
|
|
7
|
+
---
|
|
8
|
+
|
|
9
|
+
## Executive Summary
|
|
10
|
+
|
|
11
|
+
This document synthesizes current best practices for building robust multi-provider LLM systems in 2026. Based on analysis of production implementations from Vercel AI SDK, LiteLLM, LangChain, and real-world deployments, the following patterns emerge as essential:
|
|
12
|
+
|
|
13
|
+
**Key Takeaways:**
|
|
14
|
+
1. **Multi-provider architecture is now standard** - Single-provider dependency is considered a production anti-pattern
|
|
15
|
+
2. **Cost optimization through smart routing** - Teams see 30-50% cost reduction with intelligent provider selection
|
|
16
|
+
3. **Caching delivers 80%+ savings** - Combined prompt caching and semantic caching provide the highest ROI
|
|
17
|
+
4. **Circuit breakers prevent cascading failures** - Essential for production stability
|
|
18
|
+
5. **Provider abstraction layers are mature** - TypeScript solutions with strong type safety available
|
|
19
|
+
|
|
20
|
+
**Cost Reduction Potential:**
|
|
21
|
+
- Smart routing: 30-50% reduction
|
|
22
|
+
- Prompt caching: 45-80% reduction
|
|
23
|
+
- Semantic caching: 15-30% additional reduction
|
|
24
|
+
- Combined strategies: 80-95% total reduction possible
|
|
25
|
+
|
|
26
|
+
---
|
|
27
|
+
|
|
28
|
+
## 1. Multi-Provider Architecture Patterns
|
|
29
|
+
|
|
30
|
+
### 1.1 Primary + Fallback Strategy
|
|
31
|
+
|
|
32
|
+
The most common production pattern uses a primary provider with automatic fallback to secondary providers when errors occur.
|
|
33
|
+
|
|
34
|
+
**Architecture:**
|
|
35
|
+
```typescript
|
|
36
|
+
interface ProviderConfig {
|
|
37
|
+
name: string;
|
|
38
|
+
priority: number;
|
|
39
|
+
models: {
|
|
40
|
+
cheap: string; // For simple tasks
|
|
41
|
+
standard: string; // For typical workloads
|
|
42
|
+
premium: string; // For complex reasoning
|
|
43
|
+
};
|
|
44
|
+
healthCheck: () => Promise<boolean>;
|
|
45
|
+
circuitBreaker: CircuitBreakerConfig;
|
|
46
|
+
}
|
|
47
|
+
|
|
48
|
+
const providerChain: ProviderConfig[] = [
|
|
49
|
+
{
|
|
50
|
+
name: 'anthropic',
|
|
51
|
+
priority: 1,
|
|
52
|
+
models: {
|
|
53
|
+
cheap: 'claude-3-haiku-20240307',
|
|
54
|
+
standard: 'claude-3-5-sonnet-20241022',
|
|
55
|
+
premium: 'claude-opus-4-5-20251101'
|
|
56
|
+
},
|
|
57
|
+
healthCheck: () => checkAnthropicHealth(),
|
|
58
|
+
circuitBreaker: {
|
|
59
|
+
failureThreshold: 10,
|
|
60
|
+
successThreshold: 3,
|
|
61
|
+
timeout: 60000,
|
|
62
|
+
}
|
|
63
|
+
},
|
|
64
|
+
{
|
|
65
|
+
name: 'openai',
|
|
66
|
+
priority: 2,
|
|
67
|
+
models: {
|
|
68
|
+
cheap: 'gpt-4o-mini',
|
|
69
|
+
standard: 'gpt-4o',
|
|
70
|
+
premium: 'o1'
|
|
71
|
+
},
|
|
72
|
+
healthCheck: () => checkOpenAIHealth(),
|
|
73
|
+
circuitBreaker: {
|
|
74
|
+
failureThreshold: 10,
|
|
75
|
+
successThreshold: 3,
|
|
76
|
+
timeout: 60000,
|
|
77
|
+
}
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
name: 'google',
|
|
81
|
+
priority: 3,
|
|
82
|
+
models: {
|
|
83
|
+
cheap: 'gemini-2.5-flash',
|
|
84
|
+
standard: 'gemini-2.5-pro',
|
|
85
|
+
premium: 'gemini-2.5-pro'
|
|
86
|
+
},
|
|
87
|
+
healthCheck: () => checkGoogleHealth(),
|
|
88
|
+
circuitBreaker: {
|
|
89
|
+
failureThreshold: 10,
|
|
90
|
+
successThreshold: 3,
|
|
91
|
+
timeout: 60000,
|
|
92
|
+
}
|
|
93
|
+
}
|
|
94
|
+
];
|
|
95
|
+
```
|
|
96
|
+
|
|
97
|
+
**Key Features:**
|
|
98
|
+
- Ordered fallback chain with priority levels
|
|
99
|
+
- Per-provider circuit breakers to prevent cascading failures
|
|
100
|
+
- Health checks to proactively detect provider issues
|
|
101
|
+
- Quality tiers (cheap/standard/premium) for cost optimization
|
|
102
|
+
|
|
103
|
+
**Sources:**
|
|
104
|
+
- [Vercel AI Gateway Model Fallbacks](https://vercel.com/changelog/model-fallbacks-now-available-in-vercel-ai-gateway)
|
|
105
|
+
- [LiteLLM Router Architecture](https://docs.litellm.ai/docs/router_architecture)
|
|
106
|
+
|
|
107
|
+
### 1.2 Quality-Based Routing
|
|
108
|
+
|
|
109
|
+
Route requests to appropriate model tiers based on task complexity, achieving 30-50% cost reduction.
|
|
110
|
+
|
|
111
|
+
**Implementation Pattern:**
|
|
112
|
+
```typescript
|
|
113
|
+
interface TaskClassification {
|
|
114
|
+
complexity: 'simple' | 'standard' | 'complex';
|
|
115
|
+
estimatedTokens: number;
|
|
116
|
+
requiresReasoning: boolean;
|
|
117
|
+
requiresToolUse: boolean;
|
|
118
|
+
}
|
|
119
|
+
|
|
120
|
+
function selectModel(task: TaskClassification): ModelConfig {
|
|
121
|
+
// Route 70% of queries to cheap models, escalate only when needed
|
|
122
|
+
if (task.complexity === 'simple' && !task.requiresReasoning) {
|
|
123
|
+
return {
|
|
124
|
+
provider: 'google',
|
|
125
|
+
model: 'gemini-2.5-flash',
|
|
126
|
+
costPer1M: { input: 0.30, output: 2.50 }
|
|
127
|
+
};
|
|
128
|
+
}
|
|
129
|
+
|
|
130
|
+
if (task.complexity === 'standard' || task.requiresToolUse) {
|
|
131
|
+
return {
|
|
132
|
+
provider: 'anthropic',
|
|
133
|
+
model: 'claude-3-5-sonnet-20241022',
|
|
134
|
+
costPer1M: { input: 3.00, output: 15.00 }
|
|
135
|
+
};
|
|
136
|
+
}
|
|
137
|
+
|
|
138
|
+
// Complex reasoning tasks
|
|
139
|
+
return {
|
|
140
|
+
provider: 'anthropic',
|
|
141
|
+
model: 'claude-opus-4-5-20251101',
|
|
142
|
+
costPer1M: { input: 15.00, output: 75.00 }
|
|
143
|
+
};
|
|
144
|
+
}
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
**Routing Strategies:**
|
|
148
|
+
1. **Simple Classification** - Use cheap models (GPT-4o-mini, Gemini Flash, Claude Haiku)
|
|
149
|
+
2. **Standard Q&A** - Use mid-tier models (GPT-4o, Claude Sonnet, Gemini Pro)
|
|
150
|
+
3. **Complex Reasoning** - Use premium models (Claude Opus, o1, Gemini Pro with extended context)
|
|
151
|
+
4. **Internal Processing** - Use local open-source models when possible
|
|
152
|
+
|
|
153
|
+
**Production Impact:**
|
|
154
|
+
- 70% of queries routed to cheap models
|
|
155
|
+
- 25% to standard models
|
|
156
|
+
- 5% to premium models
|
|
157
|
+
- Typical cost reduction: 30-50%
|
|
158
|
+
|
|
159
|
+
**Sources:**
|
|
160
|
+
- [LLM Cost Optimization Strategies](https://medium.com/@ajayverma23/taming-the-beast-cost-optimization-strategies-for-llm-api-calls-in-production-11f16dbe2c39)
|
|
161
|
+
- [Smart Routing with Load Balancing](https://www.kosmoy.com/post/llm-cost-management-stop-burning-money-on-tokens)
|
|
162
|
+
|
|
163
|
+
### 1.3 Cost-Optimized Provider Selection
|
|
164
|
+
|
|
165
|
+
Dynamically route to the cheapest provider for equivalent capability.
|
|
166
|
+
|
|
167
|
+
**2026 Cost Comparison (per 1M tokens):**
|
|
168
|
+
|
|
169
|
+
| Provider | Cheap Model | Input | Output | Best For |
|
|
170
|
+
|----------|-------------|-------|--------|----------|
|
|
171
|
+
| **DeepSeek** | deepseek-v3 | $0.27 | $1.10 | Massive cost savings (95% vs OpenAI) |
|
|
172
|
+
| **Google** | gemini-2.5-flash | $0.30 | $2.50 | Speed + cost balance |
|
|
173
|
+
| **OpenAI** | gpt-4o-mini | $0.15 | $0.60 | Simple tasks, fast responses |
|
|
174
|
+
| **Anthropic** | claude-3-haiku | $0.25 | $1.25 | Structured output, tool use |
|
|
175
|
+
| **Mistral** | mistral-small | $0.20 | $0.60 | EU data residency |
|
|
176
|
+
|
|
177
|
+
| Provider | Standard Model | Input | Output | Best For |
|
|
178
|
+
|----------|---------------|-------|--------|----------|
|
|
179
|
+
| **DeepSeek** | deepseek-r1 | $0.55 | $2.19 | Cost-effective reasoning |
|
|
180
|
+
| **Google** | gemini-2.5-pro | $1.25 | $10.00 | Large context windows |
|
|
181
|
+
| **OpenAI** | gpt-4o | $2.50 | $10.00 | General-purpose, reliable |
|
|
182
|
+
| **Anthropic** | claude-3.5-sonnet | $3.00 | $15.00 | Code generation, analysis |
|
|
183
|
+
| **Mistral** | devstral-2 | $0.62 | $2.46 | Code-specific tasks |
|
|
184
|
+
|
|
185
|
+
**Dynamic Routing Logic:**
|
|
186
|
+
```typescript
|
|
187
|
+
interface ProviderCost {
|
|
188
|
+
provider: string;
|
|
189
|
+
model: string;
|
|
190
|
+
inputCostPer1M: number;
|
|
191
|
+
outputCostPer1M: number;
|
|
192
|
+
estimatedLatencyMs: number;
|
|
193
|
+
}
|
|
194
|
+
|
|
195
|
+
function selectCheapestProvider(
|
|
196
|
+
estimatedInputTokens: number,
|
|
197
|
+
estimatedOutputTokens: number,
|
|
198
|
+
maxLatency?: number
|
|
199
|
+
): ProviderCost {
|
|
200
|
+
const options: ProviderCost[] = [
|
|
201
|
+
{
|
|
202
|
+
provider: 'deepseek',
|
|
203
|
+
model: 'deepseek-v3',
|
|
204
|
+
inputCostPer1M: 0.27,
|
|
205
|
+
outputCostPer1M: 1.10,
|
|
206
|
+
estimatedLatencyMs: 2000
|
|
207
|
+
},
|
|
208
|
+
{
|
|
209
|
+
provider: 'google',
|
|
210
|
+
model: 'gemini-2.5-flash',
|
|
211
|
+
inputCostPer1M: 0.30,
|
|
212
|
+
outputCostPer1M: 2.50,
|
|
213
|
+
estimatedLatencyMs: 800
|
|
214
|
+
},
|
|
215
|
+
{
|
|
216
|
+
provider: 'openai',
|
|
217
|
+
model: 'gpt-4o-mini',
|
|
218
|
+
inputCostPer1M: 0.15,
|
|
219
|
+
outputCostPer1M: 0.60,
|
|
220
|
+
estimatedLatencyMs: 600
|
|
221
|
+
}
|
|
222
|
+
];
|
|
223
|
+
|
|
224
|
+
// Filter by latency requirement
|
|
225
|
+
const validOptions = maxLatency
|
|
226
|
+
? options.filter(o => o.estimatedLatencyMs <= maxLatency)
|
|
227
|
+
: options;
|
|
228
|
+
|
|
229
|
+
// Calculate total cost and select cheapest
|
|
230
|
+
return validOptions.reduce((cheapest, current) => {
|
|
231
|
+
const currentCost =
|
|
232
|
+
(estimatedInputTokens / 1_000_000) * current.inputCostPer1M +
|
|
233
|
+
(estimatedOutputTokens / 1_000_000) * current.outputCostPer1M;
|
|
234
|
+
|
|
235
|
+
const cheapestCost =
|
|
236
|
+
(estimatedInputTokens / 1_000_000) * cheapest.inputCostPer1M +
|
|
237
|
+
(estimatedOutputTokens / 1_000_000) * cheapest.outputCostPer1M;
|
|
238
|
+
|
|
239
|
+
return currentCost < cheapestCost ? current : cheapest;
|
|
240
|
+
});
|
|
241
|
+
}
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
**Key Insight:** Pricing varies 10x across providers for similar capability - smart routing prevents overpaying.
|
|
245
|
+
|
|
246
|
+
**Sources:**
|
|
247
|
+
- [LLM Cost Optimization 2026](https://byteiota.com/llm-cost-optimization-stop-overpaying-5-10x-in-2026/)
|
|
248
|
+
- [LLM API Pricing Comparison](https://pricepertoken.com/)
|
|
249
|
+
|
|
250
|
+
---
|
|
251
|
+
|
|
252
|
+
## 2. Automatic Failover and Error Handling
|
|
253
|
+
|
|
254
|
+
### 2.1 Circuit Breaker Pattern
|
|
255
|
+
|
|
256
|
+
Circuit breakers prevent cascading failures by temporarily blocking requests to unhealthy providers.
|
|
257
|
+
|
|
258
|
+
**States:**
|
|
259
|
+
1. **Closed** - Normal operation, requests pass through
|
|
260
|
+
2. **Open** - Too many failures, requests blocked for timeout period
|
|
261
|
+
3. **Half-Open** - Testing recovery, limited requests allowed
|
|
262
|
+
|
|
263
|
+
**Implementation:**
|
|
264
|
+
```typescript
|
|
265
|
+
class CircuitBreaker {
|
|
266
|
+
private state: 'closed' | 'open' | 'half-open' = 'closed';
|
|
267
|
+
private failures = 0;
|
|
268
|
+
private successes = 0;
|
|
269
|
+
private lastFailureTime: number | null = null;
|
|
270
|
+
|
|
271
|
+
constructor(
|
|
272
|
+
private config: {
|
|
273
|
+
failureThreshold: number; // e.g., 10 failures
|
|
274
|
+
successThreshold: number; // e.g., 3 successes to close
|
|
275
|
+
timeout: number; // e.g., 60000ms (60s)
|
|
276
|
+
jitter: number; // e.g., 5000ms random variation
|
|
277
|
+
}
|
|
278
|
+
) {}
|
|
279
|
+
|
|
280
|
+
async execute<T>(fn: () => Promise<T>): Promise<T> {
|
|
281
|
+
if (this.state === 'open') {
|
|
282
|
+
const timeoutWithJitter =
|
|
283
|
+
this.config.timeout + Math.random() * this.config.jitter;
|
|
284
|
+
|
|
285
|
+
if (Date.now() - this.lastFailureTime! < timeoutWithJitter) {
|
|
286
|
+
throw new Error('Circuit breaker is OPEN');
|
|
287
|
+
}
|
|
288
|
+
|
|
289
|
+
// Transition to half-open for testing
|
|
290
|
+
this.state = 'half-open';
|
|
291
|
+
this.successes = 0;
|
|
292
|
+
}
|
|
293
|
+
|
|
294
|
+
try {
|
|
295
|
+
const result = await fn();
|
|
296
|
+
this.onSuccess();
|
|
297
|
+
return result;
|
|
298
|
+
} catch (error) {
|
|
299
|
+
this.onFailure();
|
|
300
|
+
throw error;
|
|
301
|
+
}
|
|
302
|
+
}
|
|
303
|
+
|
|
304
|
+
private onSuccess(): void {
|
|
305
|
+
this.failures = 0;
|
|
306
|
+
|
|
307
|
+
if (this.state === 'half-open') {
|
|
308
|
+
this.successes++;
|
|
309
|
+
|
|
310
|
+
if (this.successes >= this.config.successThreshold) {
|
|
311
|
+
this.state = 'closed';
|
|
312
|
+
console.log('Circuit breaker closed - provider recovered');
|
|
313
|
+
}
|
|
314
|
+
}
|
|
315
|
+
}
|
|
316
|
+
|
|
317
|
+
private onFailure(): void {
|
|
318
|
+
this.failures++;
|
|
319
|
+
this.lastFailureTime = Date.now();
|
|
320
|
+
|
|
321
|
+
if (this.failures >= this.config.failureThreshold) {
|
|
322
|
+
this.state = 'open';
|
|
323
|
+
console.log('Circuit breaker opened - provider unhealthy');
|
|
324
|
+
}
|
|
325
|
+
}
|
|
326
|
+
|
|
327
|
+
getState(): { state: string; failures: number; successes: number } {
|
|
328
|
+
return {
|
|
329
|
+
state: this.state,
|
|
330
|
+
failures: this.failures,
|
|
331
|
+
successes: this.successes
|
|
332
|
+
};
|
|
333
|
+
}
|
|
334
|
+
}
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
**Benefits:**
|
|
338
|
+
- Prevents overwhelming unhealthy providers
|
|
339
|
+
- Gives systems time to recover
|
|
340
|
+
- Provides clear health signals for monitoring
|
|
341
|
+
- Reduces cascading failures
|
|
342
|
+
|
|
343
|
+
**Sources:**
|
|
344
|
+
- [Circuit Breakers in LLM Apps](https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/)
|
|
345
|
+
- [Azure API Management LLM Resiliency](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/improve-llm-backend-resiliency-with-load-balancer-and-circuit-breaker-rules-in-a/4394502)
|
|
346
|
+
|
|
347
|
+
### 2.2 Retry Logic with Exponential Backoff
|
|
348
|
+
|
|
349
|
+
All major LLM providers recommend exponential backoff with jitter for handling rate limits and transient failures.
|
|
350
|
+
|
|
351
|
+
**Implementation:**
|
|
352
|
+
```typescript
|
|
353
|
+
interface RetryConfig {
|
|
354
|
+
maxRetries: number;
|
|
355
|
+
baseDelay: number; // e.g., 1000ms
|
|
356
|
+
maxDelay: number; // e.g., 60000ms
|
|
357
|
+
exponentialBase: number; // e.g., 2
|
|
358
|
+
jitter: boolean;
|
|
359
|
+
}
|
|
360
|
+
|
|
361
|
+
async function retryWithExponentialBackoff<T>(
|
|
362
|
+
fn: () => Promise<T>,
|
|
363
|
+
config: RetryConfig,
|
|
364
|
+
onRetry?: (attempt: number, delay: number, error: any) => void
|
|
365
|
+
): Promise<T> {
|
|
366
|
+
let lastError: any;
|
|
367
|
+
|
|
368
|
+
for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
|
|
369
|
+
try {
|
|
370
|
+
return await fn();
|
|
371
|
+
} catch (error: any) {
|
|
372
|
+
lastError = error;
|
|
373
|
+
|
|
374
|
+
// Don't retry on non-retryable errors
|
|
375
|
+
if (!isRetryableError(error)) {
|
|
376
|
+
throw error;
|
|
377
|
+
}
|
|
378
|
+
|
|
379
|
+
// Check if we've exhausted retries
|
|
380
|
+
if (attempt === config.maxRetries) {
|
|
381
|
+
break;
|
|
382
|
+
}
|
|
383
|
+
|
|
384
|
+
// Calculate delay with exponential backoff
|
|
385
|
+
let delay = Math.min(
|
|
386
|
+
config.baseDelay * Math.pow(config.exponentialBase, attempt),
|
|
387
|
+
config.maxDelay
|
|
388
|
+
);
|
|
389
|
+
|
|
390
|
+
// Add jitter to prevent thundering herd
|
|
391
|
+
if (config.jitter) {
|
|
392
|
+
delay = delay * (0.5 + Math.random() * 0.5);
|
|
393
|
+
}
|
|
394
|
+
|
|
395
|
+
// Check for retry-after header
|
|
396
|
+
const retryAfter = parseRetryAfterHeader(error);
|
|
397
|
+
if (retryAfter) {
|
|
398
|
+
delay = Math.max(delay, retryAfter * 1000);
|
|
399
|
+
}
|
|
400
|
+
|
|
401
|
+
onRetry?.(attempt + 1, delay, error);
|
|
402
|
+
|
|
403
|
+
await sleep(delay);
|
|
404
|
+
}
|
|
405
|
+
}
|
|
406
|
+
|
|
407
|
+
throw lastError;
|
|
408
|
+
}
|
|
409
|
+
|
|
410
|
+
function isRetryableError(error: any): boolean {
|
|
411
|
+
// Rate limiting (429)
|
|
412
|
+
if (error.status === 429) return true;
|
|
413
|
+
|
|
414
|
+
// Server errors (500-599)
|
|
415
|
+
if (error.status >= 500 && error.status < 600) return true;
|
|
416
|
+
|
|
417
|
+
// Network errors
|
|
418
|
+
if (error.code === 'ECONNRESET' ||
|
|
419
|
+
error.code === 'ETIMEDOUT' ||
|
|
420
|
+
error.code === 'ENOTFOUND') return true;
|
|
421
|
+
|
|
422
|
+
// Provider-specific timeout errors
|
|
423
|
+
if (error.message?.includes('timeout')) return true;
|
|
424
|
+
|
|
425
|
+
return false;
|
|
426
|
+
}
|
|
427
|
+
|
|
428
|
+
function parseRetryAfterHeader(error: any): number | null {
|
|
429
|
+
const retryAfter = error.response?.headers?.['retry-after'];
|
|
430
|
+
if (!retryAfter) return null;
|
|
431
|
+
|
|
432
|
+
// Handle both seconds and date formats
|
|
433
|
+
const seconds = parseInt(retryAfter, 10);
|
|
434
|
+
return isNaN(seconds) ? null : seconds;
|
|
435
|
+
}
|
|
436
|
+
|
|
437
|
+
function sleep(ms: number): Promise<void> {
|
|
438
|
+
return new Promise(resolve => setTimeout(resolve, ms));
|
|
439
|
+
}
|
|
440
|
+
```
|
|
441
|
+
|
|
442
|
+
**Provider-Specific Retry Guidelines (2026):**
|
|
443
|
+
|
|
444
|
+
**OpenAI:**
|
|
445
|
+
- Automatically retries with exponential backoff
|
|
446
|
+
- Respects retry-after headers
|
|
447
|
+
- Recommended: 3-5 retries with base delay 1s
|
|
448
|
+
|
|
449
|
+
**Anthropic (Claude):**
|
|
450
|
+
- Returns 429 for RPM, ITPM, OTPM limits
|
|
451
|
+
- Provides retry-after header for precise wait times
|
|
452
|
+
- Recommended: Combine retry-after with exponential backoff fallback
|
|
453
|
+
|
|
454
|
+
**Google (Gemini):**
|
|
455
|
+
- Uses token bucket algorithm
|
|
456
|
+
- 429 quota-exceeded on any dimension
|
|
457
|
+
- Recommended: Exponential backoff with jitter (gold standard)
|
|
458
|
+
|
|
459
|
+
**Common Pattern:**
|
|
460
|
+
```typescript
|
|
461
|
+
const retryConfig: RetryConfig = {
|
|
462
|
+
maxRetries: 5,
|
|
463
|
+
baseDelay: 1000, // 1s
|
|
464
|
+
maxDelay: 60000, // 60s max
|
|
465
|
+
exponentialBase: 2, // Double each time: 1s, 2s, 4s, 8s, 16s
|
|
466
|
+
jitter: true // Prevent thundering herd
|
|
467
|
+
};
|
|
468
|
+
```
|
|
469
|
+
|
|
470
|
+
**Sources:**
|
|
471
|
+
- [OpenAI Rate Limits Guide](https://platform.openai.com/docs/guides/rate-limits)
|
|
472
|
+
- [Claude API 429 Error Fix](https://www.aifreeapi.com/en/posts/fix-claude-api-429-rate-limit-error)
|
|
473
|
+
- [Gemini API Rate Limits 2026](https://www.aifreeapi.com/en/posts/gemini-api-rate-limit-explained)
|
|
474
|
+
|
|
475
|
+
### 2.3 Fallback Chain Implementation
|
|
476
|
+
|
|
477
|
+
Combine retries with provider fallbacks for maximum reliability.
|
|
478
|
+
|
|
479
|
+
**Complete Pattern:**
|
|
480
|
+
```typescript
|
|
481
|
+
interface FallbackConfig {
|
|
482
|
+
providers: ProviderConfig[];
|
|
483
|
+
retryConfig: RetryConfig;
|
|
484
|
+
circuitBreakers: Map<string, CircuitBreaker>;
|
|
485
|
+
}
|
|
486
|
+
|
|
487
|
+
async function executeWithFallback<T>(
|
|
488
|
+
task: LLMTask,
|
|
489
|
+
config: FallbackConfig
|
|
490
|
+
): Promise<T> {
|
|
491
|
+
const errors: Array<{ provider: string; error: any }> = [];
|
|
492
|
+
|
|
493
|
+
for (const provider of config.providers) {
|
|
494
|
+
const breaker = config.circuitBreakers.get(provider.name);
|
|
495
|
+
|
|
496
|
+
// Skip if circuit breaker is open
|
|
497
|
+
if (breaker?.getState().state === 'open') {
|
|
498
|
+
console.log(`Skipping ${provider.name} - circuit breaker open`);
|
|
499
|
+
continue;
|
|
500
|
+
}
|
|
501
|
+
|
|
502
|
+
try {
|
|
503
|
+
// Attempt with retries
|
|
504
|
+
const result = await breaker!.execute(() =>
|
|
505
|
+
retryWithExponentialBackoff(
|
|
506
|
+
() => callProvider(provider, task),
|
|
507
|
+
config.retryConfig,
|
|
508
|
+
(attempt, delay, error) => {
|
|
509
|
+
console.log(
|
|
510
|
+
`Retry ${attempt} for ${provider.name} after ${delay}ms`,
|
|
511
|
+
error.message
|
|
512
|
+
);
|
|
513
|
+
}
|
|
514
|
+
)
|
|
515
|
+
);
|
|
516
|
+
|
|
517
|
+
console.log(`Success with ${provider.name}`);
|
|
518
|
+
return result;
|
|
519
|
+
|
|
520
|
+
} catch (error: any) {
|
|
521
|
+
errors.push({ provider: provider.name, error });
|
|
522
|
+
console.error(`Failed with ${provider.name}:`, error.message);
|
|
523
|
+
|
|
524
|
+
// Continue to next provider in chain
|
|
525
|
+
continue;
|
|
526
|
+
}
|
|
527
|
+
}
|
|
528
|
+
|
|
529
|
+
// All providers failed
|
|
530
|
+
throw new Error(
|
|
531
|
+
`All providers failed:\n${errors.map(e =>
|
|
532
|
+
` ${e.provider}: ${e.error.message}`
|
|
533
|
+
).join('\n')}`
|
|
534
|
+
);
|
|
535
|
+
}
|
|
536
|
+
```
|
|
537
|
+
|
|
538
|
+
**LiteLLM Router Flow:**
|
|
539
|
+
1. Request sent to `function_with_fallbacks`
|
|
540
|
+
2. Wraps request in try-catch for fallback handling
|
|
541
|
+
3. Passes to `function_with_retries` for retry logic
|
|
542
|
+
4. Calls base LiteLLM unified function
|
|
543
|
+
5. If model is cooling down (rate limited), skip to next
|
|
544
|
+
6. After `num_retries`, fallback to next model group
|
|
545
|
+
7. Fallbacks typically go from one model_name to another
|
|
546
|
+
|
|
547
|
+
**Benefits:**
|
|
548
|
+
- Maximum reliability across provider outages
|
|
549
|
+
- Automatic recovery from rate limits
|
|
550
|
+
- Graceful degradation under load
|
|
551
|
+
- Clear error trails for debugging
|
|
552
|
+
|
|
553
|
+
**Sources:**
|
|
554
|
+
- [LiteLLM Reliability Features](https://docs.litellm.ai/docs/completion/reliable_completions)
|
|
555
|
+
- [LangChain Fallbacks](https://python.langchain.com/v0.1/docs/guides/productionization/fallbacks/)
|
|
556
|
+
|
|
557
|
+
---
|
|
558
|
+
|
|
559
|
+
## 3. Caching Strategies
|
|
560
|
+
|
|
561
|
+
Caching provides the highest ROI for cost reduction (80-95% savings possible).
|
|
562
|
+
|
|
563
|
+
### 3.1 Prompt Caching (Provider-Level)
|
|
564
|
+
|
|
565
|
+
Provider-native prefix caching delivers 50-90% cost reduction with minimal implementation effort.
|
|
566
|
+
|
|
567
|
+
**2026 Provider Support:**
|
|
568
|
+
|
|
569
|
+
| Provider | Feature | Cost Reduction | Latency Improvement | TTL |
|
|
570
|
+
|----------|---------|---------------|-------------------|-----|
|
|
571
|
+
| **Anthropic** | Prompt Caching | Up to 90% | Up to 85% | 5 min |
|
|
572
|
+
| **OpenAI** | Automatic Caching | 50% (cached tokens) | Moderate | Automatic |
|
|
573
|
+
| **Google** | Context Caching | 90% (cached reads) | Significant | Variable |
|
|
574
|
+
|
|
575
|
+
**Anthropic Implementation:**
|
|
576
|
+
```typescript
|
|
577
|
+
const message = await anthropic.messages.create({
|
|
578
|
+
model: 'claude-3-5-sonnet-20241022',
|
|
579
|
+
max_tokens: 1024,
|
|
580
|
+
system: [
|
|
581
|
+
{
|
|
582
|
+
type: 'text',
|
|
583
|
+
text: 'You are an AI assistant analyzing code repositories.',
|
|
584
|
+
},
|
|
585
|
+
{
|
|
586
|
+
type: 'text',
|
|
587
|
+
text: largeCodebaseContext, // This gets cached
|
|
588
|
+
cache_control: { type: 'ephemeral' }
|
|
589
|
+
}
|
|
590
|
+
],
|
|
591
|
+
messages: [
|
|
592
|
+
{ role: 'user', content: 'Summarize this file: ...' }
|
|
593
|
+
]
|
|
594
|
+
});
|
|
595
|
+
|
|
596
|
+
// Usage breakdown:
|
|
597
|
+
// - Input tokens: 1000
|
|
598
|
+
// - Cache creation: 10000 (write at 1.25x cost)
|
|
599
|
+
// - Cache read: 10000 (read at 0.1x cost = 90% savings)
|
|
600
|
+
// - Output tokens: 500
|
|
601
|
+
```
|
|
602
|
+
|
|
603
|
+
**OpenAI Implementation:**
|
|
604
|
+
```typescript
|
|
605
|
+
const completion = await openai.chat.completions.create({
|
|
606
|
+
model: 'gpt-4o',
|
|
607
|
+
messages: [
|
|
608
|
+
{
|
|
609
|
+
role: 'system',
|
|
610
|
+
content: largeCodebaseContext // Automatically cached if >1024 tokens
|
|
611
|
+
},
|
|
612
|
+
{
|
|
613
|
+
role: 'user',
|
|
614
|
+
content: 'Summarize this file: ...'
|
|
615
|
+
}
|
|
616
|
+
]
|
|
617
|
+
});
|
|
618
|
+
|
|
619
|
+
// OpenAI automatically caches prefixes >1024 tokens
|
|
620
|
+
// Cached input tokens: 50% discount
|
|
621
|
+
// No explicit cache control needed
|
|
622
|
+
```
|
|
623
|
+
|
|
624
|
+
**Best Practices:**
|
|
625
|
+
1. **Place static content first** - System prompts, codebase context, documentation
|
|
626
|
+
2. **Put dynamic content last** - User queries, file-specific questions
|
|
627
|
+
3. **Avoid caching tool definitions** - They change frequently, reducing cache hits
|
|
628
|
+
4. **Strategic block control** - Exclude dynamic sections from cache scope
|
|
629
|
+
|
|
630
|
+
**Recent Research (2026):**
|
|
631
|
+
- Strategic prompt cache block control provides more consistent benefits than naive full-context caching
|
|
632
|
+
- Reduces API costs by 45-80%
|
|
633
|
+
- Improves time to first token by 13-31%
|
|
634
|
+
- Chat apps with stable system prompts can cache 70%+ of input tokens
|
|
635
|
+
|
|
636
|
+
**Sources:**
|
|
637
|
+
- [Prompt Caching Infrastructure Guide](https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025)
|
|
638
|
+
- [Don't Break the Cache Research](https://arxiv.org/abs/2601.06007)
|
|
639
|
+
|
|
640
|
+
### 3.2 Semantic Caching
|
|
641
|
+
|
|
642
|
+
Semantic caching eliminates API calls entirely for semantically similar queries.
|
|
643
|
+
|
|
644
|
+
**Architecture:**
|
|
645
|
+
```typescript
|
|
646
|
+
interface SemanticCache {
|
|
647
|
+
// L1: Exact match cache (in-memory)
|
|
648
|
+
exactCache: Map<string, CachedResponse>;
|
|
649
|
+
|
|
650
|
+
// L2: Semantic match cache (vector DB)
|
|
651
|
+
vectorStore: VectorStore;
|
|
652
|
+
similarityThreshold: number; // e.g., 0.85
|
|
653
|
+
}
|
|
654
|
+
|
|
655
|
+
interface CachedResponse {
|
|
656
|
+
query: string;
|
|
657
|
+
embedding: number[];
|
|
658
|
+
response: string;
|
|
659
|
+
timestamp: number;
|
|
660
|
+
ttl: number;
|
|
661
|
+
metadata: {
|
|
662
|
+
provider: string;
|
|
663
|
+
model: string;
|
|
664
|
+
tokens: number;
|
|
665
|
+
cost: number;
|
|
666
|
+
};
|
|
667
|
+
}
|
|
668
|
+
|
|
669
|
+
async function getOrGenerateResponse(
|
|
670
|
+
query: string,
|
|
671
|
+
cache: SemanticCache,
|
|
672
|
+
generateFn: () => Promise<string>
|
|
673
|
+
): Promise<{ response: string; cacheHit: boolean }> {
|
|
674
|
+
// L1: Check exact match
|
|
675
|
+
const exactMatch = cache.exactCache.get(query);
|
|
676
|
+
if (exactMatch && !isExpired(exactMatch)) {
|
|
677
|
+
return { response: exactMatch.response, cacheHit: true };
|
|
678
|
+
}
|
|
679
|
+
|
|
680
|
+
// L2: Check semantic similarity
|
|
681
|
+
const queryEmbedding = await generateEmbedding(query);
|
|
682
|
+
const similarQueries = await cache.vectorStore.similaritySearch(
|
|
683
|
+
queryEmbedding,
|
|
684
|
+
cache.similarityThreshold
|
|
685
|
+
);
|
|
686
|
+
|
|
687
|
+
if (similarQueries.length > 0) {
|
|
688
|
+
const bestMatch = similarQueries[0];
|
|
689
|
+
console.log(
|
|
690
|
+
`Semantic cache hit (${bestMatch.similarity.toFixed(3)}) for: "${query}"`
|
|
691
|
+
);
|
|
692
|
+
return { response: bestMatch.response, cacheHit: true };
|
|
693
|
+
}
|
|
694
|
+
|
|
695
|
+
// Cache miss - generate new response
|
|
696
|
+
const response = await generateFn();
|
|
697
|
+
|
|
698
|
+
// Store in both caches
|
|
699
|
+
const cached: CachedResponse = {
|
|
700
|
+
query,
|
|
701
|
+
embedding: queryEmbedding,
|
|
702
|
+
response,
|
|
703
|
+
timestamp: Date.now(),
|
|
704
|
+
ttl: 3600000, // 1 hour
|
|
705
|
+
metadata: {
|
|
706
|
+
provider: 'anthropic',
|
|
707
|
+
model: 'claude-3-5-sonnet',
|
|
708
|
+
tokens: response.length / 4, // Rough estimate
|
|
709
|
+
cost: calculateCost(response)
|
|
710
|
+
}
|
|
711
|
+
};
|
|
712
|
+
|
|
713
|
+
cache.exactCache.set(query, cached);
|
|
714
|
+
await cache.vectorStore.upsert(queryEmbedding, cached);
|
|
715
|
+
|
|
716
|
+
return { response, cacheHit: false };
|
|
717
|
+
}
|
|
718
|
+
```
|
|
719
|
+
|
|
720
|
+
**Multi-Tier Performance:**
|
|
721
|
+
- **Microsecond-level** responses for exact matches (L1)
|
|
722
|
+
- **Sub-second** responses for semantic matches (L2)
|
|
723
|
+
- **Full API latency** only for novel queries
|
|
724
|
+
|
|
725
|
+
**Production Impact:**
|
|
726
|
+
- 31% of LLM queries exhibit semantic similarity to previous requests
|
|
727
|
+
- Chat apps with repetitive questions: 30% cache hit rate
|
|
728
|
+
- Combined with prompt caching: 80%+ total savings
|
|
729
|
+
|
|
730
|
+
**Implementation Options:**
|
|
731
|
+
- **GPTCache** - Open-source semantic cache with LangChain integration
|
|
732
|
+
- **Redis** - Semantic caching with vector similarity search
|
|
733
|
+
- **Qdrant** - Vector database optimized for semantic search
|
|
734
|
+
- **LiteLLM** - Built-in caching support
|
|
735
|
+
|
|
736
|
+
**Sources:**
|
|
737
|
+
- [Redis Semantic Caching](https://redis.io/blog/what-is-semantic-caching/)
|
|
738
|
+
- [GPTCache GitHub](https://github.com/zilliztech/GPTCache)
|
|
739
|
+
- [LLM Caching Implementation Guide](https://reintech.io/blog/how-to-implement-llm-caching-strategies-for-faster-response-times)
|
|
740
|
+
|
|
741
|
+
### 3.3 Hybrid Caching Strategy
|
|
742
|
+
|
|
743
|
+
Combine prompt caching and semantic caching for maximum savings.
|
|
744
|
+
|
|
745
|
+
**Example Configuration:**
|
|
746
|
+
```typescript
|
|
747
|
+
interface HybridCacheConfig {
|
|
748
|
+
// Semantic cache for query deduplication
|
|
749
|
+
semanticCache: {
|
|
750
|
+
enabled: true,
|
|
751
|
+
provider: 'redis', // or 'qdrant', 'gptcache'
|
|
752
|
+
similarityThreshold: 0.85,
|
|
753
|
+
ttl: 3600000, // 1 hour
|
|
754
|
+
},
|
|
755
|
+
|
|
756
|
+
// Prompt cache for large context windows
|
|
757
|
+
promptCache: {
|
|
758
|
+
enabled: true,
|
|
759
|
+
strategy: 'provider-native', // Use Anthropic/OpenAI/Google native caching
|
|
760
|
+
systemPromptCaching: true,
|
|
761
|
+
codebaseContextCaching: true,
|
|
762
|
+
toolDefinitionCaching: false, // Avoid caching dynamic tools
|
|
763
|
+
},
|
|
764
|
+
|
|
765
|
+
// Cache invalidation
|
|
766
|
+
invalidation: {
|
|
767
|
+
onCodebaseChange: true,
|
|
768
|
+
maxAge: 86400000, // 24 hours
|
|
769
|
+
}
|
|
770
|
+
}
|
|
771
|
+
```
|
|
772
|
+
|
|
773
|
+
**Combined Savings Example:**
|
|
774
|
+
```
|
|
775
|
+
Scenario: Code summarization service
|
|
776
|
+
- 1000 requests/day
|
|
777
|
+
- Average input: 15,000 tokens (10k context + 5k query)
|
|
778
|
+
- Average output: 500 tokens
|
|
779
|
+
|
|
780
|
+
Without caching:
|
|
781
|
+
- Cost: 1000 * ((15000/1M * $3) + (500/1M * $15)) = $52.50/day
|
|
782
|
+
|
|
783
|
+
With prompt caching (70% of input tokens cached):
|
|
784
|
+
- Cost: 1000 * ((4500/1M * $3) + (10500/1M * $0.30) + (500/1M * $15)) = $24/day
|
|
785
|
+
- Savings: 54%
|
|
786
|
+
|
|
787
|
+
With semantic caching (30% query deduplication):
|
|
788
|
+
- Actual LLM calls: 700
|
|
789
|
+
- Cost: 700 * calculation above = $16.80/day
|
|
790
|
+
- Additional savings: 30%
|
|
791
|
+
|
|
792
|
+
Total savings: 68%
|
|
793
|
+
|
|
794
|
+
With both + optimized routing:
|
|
795
|
+
- Use cheaper models for 50% of queries
|
|
796
|
+
- Final cost: ~$8/day
|
|
797
|
+
- Total savings: 85%
|
|
798
|
+
```
|
|
799
|
+
|
|
800
|
+
**Sources:**
|
|
801
|
+
- [Ultimate Guide to LLM Caching](https://latitude-blog.ghost.io/blog/ultimate-guide-to-llm-caching-for-low-latency-ai/)
|
|
802
|
+
- [Effective LLM Caching](https://www.helicone.ai/blog/effective-llm-caching)
|
|
803
|
+
|
|
804
|
+
---
|
|
805
|
+
|
|
806
|
+
## 4. Provider Abstraction Layer
|
|
807
|
+
|
|
808
|
+
### 4.1 Design Principles
|
|
809
|
+
|
|
810
|
+
Modern provider abstraction layers follow these principles:
|
|
811
|
+
|
|
812
|
+
1. **Unified Interface** - Single API for all providers
|
|
813
|
+
2. **Type Safety** - Strong TypeScript types prevent runtime errors
|
|
814
|
+
3. **Provider-Specific Features** - Gracefully handle unique capabilities
|
|
815
|
+
4. **Streaming Support** - First-class support for streaming responses
|
|
816
|
+
5. **Tool/Function Calling** - Normalized tool use across providers
|
|
817
|
+
6. **Observability** - Built-in logging, metrics, and tracing
|
|
818
|
+
|
|
819
|
+
**Core Interface:**
|
|
820
|
+
```typescript
|
|
821
|
+
interface LLMProvider {
|
|
822
|
+
name: string;
|
|
823
|
+
|
|
824
|
+
// Chat completion
|
|
825
|
+
complete(params: CompletionParams): Promise<CompletionResponse>;
|
|
826
|
+
|
|
827
|
+
// Streaming
|
|
828
|
+
stream(params: CompletionParams): AsyncIterator<StreamChunk>;
|
|
829
|
+
|
|
830
|
+
// Embeddings
|
|
831
|
+
embed(text: string): Promise<number[]>;
|
|
832
|
+
|
|
833
|
+
// Health check
|
|
834
|
+
healthCheck(): Promise<HealthStatus>;
|
|
835
|
+
|
|
836
|
+
// Cost estimation
|
|
837
|
+
estimateCost(params: CompletionParams): CostEstimate;
|
|
838
|
+
}
|
|
839
|
+
|
|
840
|
+
interface CompletionParams {
|
|
841
|
+
model: string;
|
|
842
|
+
messages: Message[];
|
|
843
|
+
temperature?: number;
|
|
844
|
+
maxTokens?: number;
|
|
845
|
+
tools?: Tool[];
|
|
846
|
+
|
|
847
|
+
// Provider-specific options
|
|
848
|
+
providerOptions?: {
|
|
849
|
+
anthropic?: {
|
|
850
|
+
cacheControl?: CacheControl[];
|
|
851
|
+
};
|
|
852
|
+
openai?: {
|
|
853
|
+
responseFormat?: { type: 'json_object' };
|
|
854
|
+
};
|
|
855
|
+
google?: {
|
|
856
|
+
safetySettings?: SafetySetting[];
|
|
857
|
+
};
|
|
858
|
+
};
|
|
859
|
+
}
|
|
860
|
+
|
|
861
|
+
interface CompletionResponse {
|
|
862
|
+
content: string;
|
|
863
|
+
role: 'assistant';
|
|
864
|
+
finishReason: 'stop' | 'length' | 'tool_calls' | 'content_filter';
|
|
865
|
+
usage: {
|
|
866
|
+
inputTokens: number;
|
|
867
|
+
outputTokens: number;
|
|
868
|
+
cachedTokens?: number;
|
|
869
|
+
};
|
|
870
|
+
metadata: {
|
|
871
|
+
provider: string;
|
|
872
|
+
model: string;
|
|
873
|
+
latencyMs: number;
|
|
874
|
+
cost: number;
|
|
875
|
+
};
|
|
876
|
+
toolCalls?: ToolCall[];
|
|
877
|
+
}
|
|
878
|
+
```
|
|
879
|
+
|
|
880
|
+
**Sources:**
|
|
881
|
+
- [Multi-Provider LLM Orchestration 2026](https://dev.to/ash_dubai/multi-provider-llm-orchestration-in-production-a-2026-guide-1g10)
|
|
882
|
+
- [TypeScript & LLMs: Lessons from Production](https://johnchildseddy.medium.com/typescript-llms-lessons-learned-from-9-months-in-production-4910485e3272)
|
|
883
|
+
|
|
884
|
+
### 4.2 Production Implementations
|
|
885
|
+
|
|
886
|
+
**Vercel AI SDK:**
|
|
887
|
+
```typescript
|
|
888
|
+
import { generateText } from 'ai';
|
|
889
|
+
import { anthropic } from '@ai-sdk/anthropic';
|
|
890
|
+
import { openai } from '@ai-sdk/openai';
|
|
891
|
+
import { google } from '@ai-sdk/google';
|
|
892
|
+
|
|
893
|
+
// Unified interface across providers
|
|
894
|
+
const result = await generateText({
|
|
895
|
+
model: anthropic('claude-3-5-sonnet-20241022'),
|
|
896
|
+
messages: [
|
|
897
|
+
{ role: 'user', content: 'Summarize this code...' }
|
|
898
|
+
],
|
|
899
|
+
// Automatic fallback via AI Gateway
|
|
900
|
+
providerOptions: {
|
|
901
|
+
fallbacks: [
|
|
902
|
+
{ model: openai('gpt-4o') },
|
|
903
|
+
{ model: google('gemini-2.5-flash') }
|
|
904
|
+
]
|
|
905
|
+
}
|
|
906
|
+
});
|
|
907
|
+
```
|
|
908
|
+
|
|
909
|
+
**LiteLLM:**
|
|
910
|
+
```typescript
|
|
911
|
+
import litellm from 'litellm';
|
|
912
|
+
|
|
913
|
+
// Unified OpenAI-compatible interface
|
|
914
|
+
const response = await litellm.completion({
|
|
915
|
+
model: 'claude-3-5-sonnet-20241022',
|
|
916
|
+
messages: [{ role: 'user', content: 'Hello' }],
|
|
917
|
+
|
|
918
|
+
// Router config for fallbacks
|
|
919
|
+
fallbacks: ['gpt-4o', 'gemini-2.5-flash'],
|
|
920
|
+
num_retries: 3,
|
|
921
|
+
timeout: 30
|
|
922
|
+
});
|
|
923
|
+
|
|
924
|
+
// Automatic cost tracking
|
|
925
|
+
console.log(response._hidden_params.response_cost);
|
|
926
|
+
```
|
|
927
|
+
|
|
928
|
+
**Custom TypeScript Abstraction:**
|
|
929
|
+
```typescript
|
|
930
|
+
class UnifiedLLMClient {
|
|
931
|
+
private providers: Map<string, LLMProvider>;
|
|
932
|
+
private router: ProviderRouter;
|
|
933
|
+
private cache: SemanticCache;
|
|
934
|
+
private metrics: MetricsCollector;
|
|
935
|
+
|
|
936
|
+
constructor(config: UnifiedLLMConfig) {
|
|
937
|
+
this.providers = new Map([
|
|
938
|
+
['anthropic', new AnthropicProvider(config.anthropic)],
|
|
939
|
+
['openai', new OpenAIProvider(config.openai)],
|
|
940
|
+
['google', new GoogleProvider(config.google)]
|
|
941
|
+
]);
|
|
942
|
+
|
|
943
|
+
this.router = new ProviderRouter(config.routing);
|
|
944
|
+
this.cache = new SemanticCache(config.caching);
|
|
945
|
+
this.metrics = new MetricsCollector();
|
|
946
|
+
}
|
|
947
|
+
|
|
948
|
+
async complete(
|
|
949
|
+
params: CompletionParams,
|
|
950
|
+
options?: {
|
|
951
|
+
preferredProvider?: string;
|
|
952
|
+
enableFallback?: boolean;
|
|
953
|
+
enableCaching?: boolean;
|
|
954
|
+
}
|
|
955
|
+
): Promise<CompletionResponse> {
|
|
956
|
+
const startTime = Date.now();
|
|
957
|
+
|
|
958
|
+
// Check semantic cache first
|
|
959
|
+
if (options?.enableCaching) {
|
|
960
|
+
const cached = await this.cache.get(params.messages);
|
|
961
|
+
if (cached) {
|
|
962
|
+
this.metrics.recordCacheHit();
|
|
963
|
+
return cached;
|
|
964
|
+
}
|
|
965
|
+
}
|
|
966
|
+
|
|
967
|
+
// Select provider via routing logic
|
|
968
|
+
const provider = options?.preferredProvider
|
|
969
|
+
? this.providers.get(options.preferredProvider)!
|
|
970
|
+
: await this.router.selectProvider(params);
|
|
971
|
+
|
|
972
|
+
try {
|
|
973
|
+
// Execute with fallback
|
|
974
|
+
const response = await this.executeWithFallback(
|
|
975
|
+
provider,
|
|
976
|
+
params,
|
|
977
|
+
options?.enableFallback ?? true
|
|
978
|
+
);
|
|
979
|
+
|
|
980
|
+
// Record metrics
|
|
981
|
+
const latency = Date.now() - startTime;
|
|
982
|
+
this.metrics.record({
|
|
983
|
+
provider: response.metadata.provider,
|
|
984
|
+
latency,
|
|
985
|
+
tokens: response.usage.inputTokens + response.usage.outputTokens,
|
|
986
|
+
cost: response.metadata.cost,
|
|
987
|
+
cached: false
|
|
988
|
+
});
|
|
989
|
+
|
|
990
|
+
// Update cache
|
|
991
|
+
if (options?.enableCaching) {
|
|
992
|
+
await this.cache.set(params.messages, response);
|
|
993
|
+
}
|
|
994
|
+
|
|
995
|
+
return response;
|
|
996
|
+
|
|
997
|
+
} catch (error) {
|
|
998
|
+
this.metrics.recordError(provider.name, error);
|
|
999
|
+
throw error;
|
|
1000
|
+
}
|
|
1001
|
+
}
|
|
1002
|
+
|
|
1003
|
+
private async executeWithFallback(
|
|
1004
|
+
primary: LLMProvider,
|
|
1005
|
+
params: CompletionParams,
|
|
1006
|
+
enableFallback: boolean
|
|
1007
|
+
): Promise<CompletionResponse> {
|
|
1008
|
+
// Implement fallback chain logic
|
|
1009
|
+
// (See section 2.3 for full implementation)
|
|
1010
|
+
}
|
|
1011
|
+
}
|
|
1012
|
+
```
|
|
1013
|
+
|
|
1014
|
+
**Key Libraries (2026):**
|
|
1015
|
+
- **Vercel AI SDK** - Production-ready, TypeScript-first, excellent DX
|
|
1016
|
+
- **LiteLLM** - Python-focused, 100+ provider support, proxy mode
|
|
1017
|
+
- **AnyLLM** - TypeScript abstraction layer for seamless provider switching
|
|
1018
|
+
- **ModelFusion** - Vercel's abstraction for AI model integration
|
|
1019
|
+
|
|
1020
|
+
**Sources:**
|
|
1021
|
+
- [Vercel AI SDK Docs](https://ai-sdk.dev/)
|
|
1022
|
+
- [AnyLLM GitHub](https://github.com/fkesheh/any-llm)
|
|
1023
|
+
- [ModelFusion GitHub](https://github.com/vercel/modelfusion)
|
|
1024
|
+
|
|
1025
|
+
### 4.3 Handling Provider-Specific Features
|
|
1026
|
+
|
|
1027
|
+
Not all providers support all features - graceful degradation is essential.
|
|
1028
|
+
|
|
1029
|
+
**Feature Matrix (2026):**
|
|
1030
|
+
|
|
1031
|
+
| Feature | Anthropic | OpenAI | Google | Handling Strategy |
|
|
1032
|
+
|---------|-----------|--------|--------|-------------------|
|
|
1033
|
+
| Tool Calling | ✅ Native | ✅ Native | ✅ Native | Universal support |
|
|
1034
|
+
| JSON Mode | ✅ Native | ✅ Native | ✅ Native | Universal support |
|
|
1035
|
+
| Vision | ✅ Claude 3+ | ✅ GPT-4o | ✅ Gemini | Universal support |
|
|
1036
|
+
| Prompt Caching | ✅ Explicit | ✅ Automatic | ✅ Context cache | Abstract with unified API |
|
|
1037
|
+
| Streaming | ✅ SSE | ✅ SSE | ✅ SSE | Universal support |
|
|
1038
|
+
| System Messages | ✅ Separate | ✅ In messages | ✅ Separate | Normalize in abstraction |
|
|
1039
|
+
| Max Output Tokens | ✅ 8192 | ✅ 16384 | ✅ 8192 | Enforce limits per provider |
|
|
1040
|
+
|
|
1041
|
+
**Graceful Degradation Pattern:**
|
|
1042
|
+
```typescript
|
|
1043
|
+
async function executeWithFeatureDetection(
|
|
1044
|
+
provider: LLMProvider,
|
|
1045
|
+
params: CompletionParams
|
|
1046
|
+
): Promise<CompletionResponse> {
|
|
1047
|
+
const capabilities = provider.getCapabilities();
|
|
1048
|
+
|
|
1049
|
+
// Adapt parameters to provider capabilities
|
|
1050
|
+
const adaptedParams = { ...params };
|
|
1051
|
+
|
|
1052
|
+
// Handle tool calling
|
|
1053
|
+
if (params.tools && !capabilities.toolCalling) {
|
|
1054
|
+
console.warn(
|
|
1055
|
+
`Provider ${provider.name} doesn't support tool calling, ` +
|
|
1056
|
+
`using function injection instead`
|
|
1057
|
+
);
|
|
1058
|
+
adaptedParams.messages = injectToolsAsContext(
|
|
1059
|
+
params.messages,
|
|
1060
|
+
params.tools
|
|
1061
|
+
);
|
|
1062
|
+
delete adaptedParams.tools;
|
|
1063
|
+
}
|
|
1064
|
+
|
|
1065
|
+
// Handle JSON mode
|
|
1066
|
+
if (params.responseFormat?.type === 'json_object') {
|
|
1067
|
+
if (capabilities.jsonMode) {
|
|
1068
|
+
// Use native JSON mode
|
|
1069
|
+
adaptedParams.providerOptions = {
|
|
1070
|
+
...adaptedParams.providerOptions,
|
|
1071
|
+
responseFormat: { type: 'json_object' }
|
|
1072
|
+
};
|
|
1073
|
+
} else {
|
|
1074
|
+
// Fallback to prompt engineering
|
|
1075
|
+
console.warn(
|
|
1076
|
+
`Provider ${provider.name} doesn't support native JSON mode, ` +
|
|
1077
|
+
`using prompt-based approach`
|
|
1078
|
+
);
|
|
1079
|
+
adaptedParams.messages[0].content +=
|
|
1080
|
+
'\n\nIMPORTANT: Respond with valid JSON only.';
|
|
1081
|
+
}
|
|
1082
|
+
}
|
|
1083
|
+
|
|
1084
|
+
// Handle prompt caching
|
|
1085
|
+
if (params.enableCaching && !capabilities.promptCaching) {
|
|
1086
|
+
console.warn(
|
|
1087
|
+
`Provider ${provider.name} doesn't support prompt caching`
|
|
1088
|
+
);
|
|
1089
|
+
// Fall back to semantic caching layer
|
|
1090
|
+
}
|
|
1091
|
+
|
|
1092
|
+
return provider.complete(adaptedParams);
|
|
1093
|
+
}
|
|
1094
|
+
```
|
|
1095
|
+
|
|
1096
|
+
**Sources:**
|
|
1097
|
+
- [Building Bridges to LLMs: Moving Beyond Over Abstraction](https://hatchworks.com/blog/gen-ai/llm-projects-production-abstraction/)
|
|
1098
|
+
- [AI SDK Provider Management](https://ai-sdk.dev/docs/ai-sdk-core/provider-management)
|
|
1099
|
+
|
|
1100
|
+
---
|
|
1101
|
+
|
|
1102
|
+
## 5. Monitoring and Observability
|
|
1103
|
+
|
|
1104
|
+
### 5.1 Key Metrics to Track
|
|
1105
|
+
|
|
1106
|
+
**Essential Metrics:**
|
|
1107
|
+
```typescript
|
|
1108
|
+
interface LLMMetrics {
|
|
1109
|
+
// Performance
|
|
1110
|
+
latency: {
|
|
1111
|
+
p50: number;
|
|
1112
|
+
p95: number;
|
|
1113
|
+
p99: number;
|
|
1114
|
+
};
|
|
1115
|
+
|
|
1116
|
+
// Cost
|
|
1117
|
+
cost: {
|
|
1118
|
+
total: number;
|
|
1119
|
+
perRequest: number;
|
|
1120
|
+
perProvider: Map<string, number>;
|
|
1121
|
+
perModel: Map<string, number>;
|
|
1122
|
+
perUser?: Map<string, number>;
|
|
1123
|
+
};
|
|
1124
|
+
|
|
1125
|
+
// Usage
|
|
1126
|
+
tokens: {
|
|
1127
|
+
input: number;
|
|
1128
|
+
output: number;
|
|
1129
|
+
cached: number;
|
|
1130
|
+
total: number;
|
|
1131
|
+
};
|
|
1132
|
+
|
|
1133
|
+
// Reliability
|
|
1134
|
+
reliability: {
|
|
1135
|
+
successRate: number;
|
|
1136
|
+
errorRate: number;
|
|
1137
|
+
retryRate: number;
|
|
1138
|
+
fallbackRate: number;
|
|
1139
|
+
};
|
|
1140
|
+
|
|
1141
|
+
// Caching
|
|
1142
|
+
cache: {
|
|
1143
|
+
hitRate: number;
|
|
1144
|
+
semanticHitRate: number;
|
|
1145
|
+
promptCacheHitRate: number;
|
|
1146
|
+
savings: number; // Dollar amount saved
|
|
1147
|
+
};
|
|
1148
|
+
|
|
1149
|
+
// Provider Health
|
|
1150
|
+
providerHealth: Map<string, {
|
|
1151
|
+
availability: number;
|
|
1152
|
+
avgLatency: number;
|
|
1153
|
+
errorRate: number;
|
|
1154
|
+
circuitBreakerState: string;
|
|
1155
|
+
}>;
|
|
1156
|
+
}
|
|
1157
|
+
```
|
|
1158
|
+
|
|
1159
|
+
**Critical Metrics (2026):**
|
|
1160
|
+
1. **Tokens per request** - Normalize usage patterns
|
|
1161
|
+
2. **Cost per user/team/feature** - Attribution for chargeback
|
|
1162
|
+
3. **Cache hit ratio** - Reveal spend savings potential
|
|
1163
|
+
4. **Provider availability** - Track SLA compliance
|
|
1164
|
+
5. **Error rate by provider** - Identify stability issues
|
|
1165
|
+
6. **Latency percentiles** - User experience monitoring
|
|
1166
|
+
|
|
1167
|
+
**Sources:**
|
|
1168
|
+
- [Langfuse Token and Cost Tracking](https://langfuse.com/docs/observability/features/token-and-cost-tracking)
|
|
1169
|
+
- [Tracking LLM Token Usage](https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user)
|
|
1170
|
+
|
|
1171
|
+
### 5.2 Observability Platforms
|
|
1172
|
+
|
|
1173
|
+
**Top Platforms (2026):**
|
|
1174
|
+
|
|
1175
|
+
**1. Langfuse (Open Source)**
|
|
1176
|
+
```typescript
|
|
1177
|
+
import { Langfuse } from 'langfuse';
|
|
1178
|
+
|
|
1179
|
+
const langfuse = new Langfuse({
|
|
1180
|
+
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
|
|
1181
|
+
secretKey: process.env.LANGFUSE_SECRET_KEY
|
|
1182
|
+
});
|
|
1183
|
+
|
|
1184
|
+
const trace = langfuse.trace({
|
|
1185
|
+
name: 'code-summarization',
|
|
1186
|
+
userId: 'user-123',
|
|
1187
|
+
metadata: { repository: 'mdcontext' }
|
|
1188
|
+
});
|
|
1189
|
+
|
|
1190
|
+
const generation = trace.generation({
|
|
1191
|
+
name: 'summarize-file',
|
|
1192
|
+
model: 'claude-3-5-sonnet-20241022',
|
|
1193
|
+
modelParameters: {
|
|
1194
|
+
temperature: 0.7,
|
|
1195
|
+
maxTokens: 1024
|
|
1196
|
+
},
|
|
1197
|
+
input: messages
|
|
1198
|
+
});
|
|
1199
|
+
|
|
1200
|
+
// Make LLM call
|
|
1201
|
+
const response = await anthropic.messages.create(...);
|
|
1202
|
+
|
|
1203
|
+
// Log result
|
|
1204
|
+
generation.end({
|
|
1205
|
+
output: response.content,
|
|
1206
|
+
usage: {
|
|
1207
|
+
promptTokens: response.usage.input_tokens,
|
|
1208
|
+
completionTokens: response.usage.output_tokens
|
|
1209
|
+
},
|
|
1210
|
+
metadata: {
|
|
1211
|
+
cost: calculateCost(response.usage),
|
|
1212
|
+
cached: response.usage.cache_read_input_tokens > 0
|
|
1213
|
+
}
|
|
1214
|
+
});
|
|
1215
|
+
```
|
|
1216
|
+
|
|
1217
|
+
**Features:**
|
|
1218
|
+
- Automatic cost tracking for 100+ models
|
|
1219
|
+
- User/session-level attribution
|
|
1220
|
+
- Trace-level debugging
|
|
1221
|
+
- Open-source and self-hostable
|
|
1222
|
+
|
|
1223
|
+
**2. Datadog LLM Observability**
|
|
1224
|
+
```typescript
|
|
1225
|
+
import { datadogLLM } from '@datadog/llm-observability';
|
|
1226
|
+
|
|
1227
|
+
// Automatic tracing
|
|
1228
|
+
const traced = datadogLLM.trace(anthropic.messages.create);
|
|
1229
|
+
|
|
1230
|
+
const response = await traced({
|
|
1231
|
+
model: 'claude-3-5-sonnet-20241022',
|
|
1232
|
+
messages: [...]
|
|
1233
|
+
});
|
|
1234
|
+
|
|
1235
|
+
// Automatic metrics:
|
|
1236
|
+
// - llm.request.duration
|
|
1237
|
+
// - llm.request.tokens.input
|
|
1238
|
+
// - llm.request.tokens.output
|
|
1239
|
+
// - llm.request.cost
|
|
1240
|
+
```
|
|
1241
|
+
|
|
1242
|
+
**Features:**
|
|
1243
|
+
- End-to-end AI agent tracing
|
|
1244
|
+
- Cloud cost integration
|
|
1245
|
+
- Anomaly detection
|
|
1246
|
+
- Enterprise-grade dashboards
|
|
1247
|
+
|
|
1248
|
+
**3. Traceloop/OpenLLMetry**
|
|
1249
|
+
```typescript
|
|
1250
|
+
import { Traceloop } from '@traceloop/node-server-sdk';
|
|
1251
|
+
|
|
1252
|
+
Traceloop.initialize({
|
|
1253
|
+
apiKey: process.env.TRACELOOP_API_KEY,
|
|
1254
|
+
baseUrl: 'https://api.traceloop.com'
|
|
1255
|
+
});
|
|
1256
|
+
|
|
1257
|
+
// Automatic instrumentation via OpenTelemetry
|
|
1258
|
+
const response = await anthropic.messages.create({
|
|
1259
|
+
model: 'claude-3-5-sonnet-20241022',
|
|
1260
|
+
messages: [...],
|
|
1261
|
+
metadata: {
|
|
1262
|
+
user_id: 'user-123', // Automatic attribution
|
|
1263
|
+
feature: 'code-summarization'
|
|
1264
|
+
}
|
|
1265
|
+
});
|
|
1266
|
+
```
|
|
1267
|
+
|
|
1268
|
+
**Features:**
|
|
1269
|
+
- OpenTelemetry-based (industry standard)
|
|
1270
|
+
- Automatic user attribution
|
|
1271
|
+
- Rich contextual data
|
|
1272
|
+
- Integration with existing observability stack
|
|
1273
|
+
|
|
1274
|
+
**4. Portkey**
|
|
1275
|
+
```typescript
|
|
1276
|
+
import Portkey from 'portkey-ai';
|
|
1277
|
+
|
|
1278
|
+
const portkey = new Portkey({
|
|
1279
|
+
apiKey: process.env.PORTKEY_API_KEY,
|
|
1280
|
+
virtualKey: process.env.ANTHROPIC_VIRTUAL_KEY
|
|
1281
|
+
});
|
|
1282
|
+
|
|
1283
|
+
const response = await portkey.chatCompletions.create({
|
|
1284
|
+
model: 'claude-3-5-sonnet-20241022',
|
|
1285
|
+
messages: [...]
|
|
1286
|
+
});
|
|
1287
|
+
|
|
1288
|
+
// Dashboard shows:
|
|
1289
|
+
// - Usage by model, query, token
|
|
1290
|
+
// - Cost tracking and forecasts
|
|
1291
|
+
// - Near-real-time monitoring
|
|
1292
|
+
// - Automatic intervention on limits
|
|
1293
|
+
```
|
|
1294
|
+
|
|
1295
|
+
**Features:**
|
|
1296
|
+
- Finance/leadership dashboards
|
|
1297
|
+
- Usage breakdowns and trends
|
|
1298
|
+
- Spend forecasting
|
|
1299
|
+
- Automatic budget controls
|
|
1300
|
+
|
|
1301
|
+
**Sources:**
|
|
1302
|
+
- [Langfuse Documentation](https://langfuse.com/docs/observability/features/token-and-cost-tracking)
|
|
1303
|
+
- [Datadog LLM Observability](https://www.datadoghq.com/product/llm-observability/)
|
|
1304
|
+
- [Traceloop Blog](https://www.traceloop.com/blog/granular-llm-monitoring-for-tracking-token-usage-and-latency-per-user-and-feature)
|
|
1305
|
+
- [Portkey Token Tracking](https://portkey.ai/blog/tracking-llm-token-usage-across-providers-teams-and-workloads/)
|
|
1306
|
+
|
|
1307
|
+
### 5.3 Monitoring Implementation
|
|
1308
|
+
|
|
1309
|
+
**Complete Monitoring Setup:**
|
|
1310
|
+
```typescript
|
|
1311
|
+
import { Langfuse } from 'langfuse';
|
|
1312
|
+
import { CloudWatch } from '@aws-sdk/client-cloudwatch';
|
|
1313
|
+
|
|
1314
|
+
class LLMMonitor {
|
|
1315
|
+
private langfuse: Langfuse;
|
|
1316
|
+
private cloudwatch: CloudWatch;
|
|
1317
|
+
private metrics: Map<string, MetricBuffer>;
|
|
1318
|
+
|
|
1319
|
+
constructor() {
|
|
1320
|
+
this.langfuse = new Langfuse({
|
|
1321
|
+
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
|
|
1322
|
+
secretKey: process.env.LANGFUSE_SECRET_KEY
|
|
1323
|
+
});
|
|
1324
|
+
|
|
1325
|
+
this.cloudwatch = new CloudWatch({ region: 'us-east-1' });
|
|
1326
|
+
this.metrics = new Map();
|
|
1327
|
+
}
|
|
1328
|
+
|
|
1329
|
+
async trackRequest(
|
|
1330
|
+
provider: string,
|
|
1331
|
+
model: string,
|
|
1332
|
+
params: CompletionParams,
|
|
1333
|
+
response: CompletionResponse,
|
|
1334
|
+
metadata: {
|
|
1335
|
+
userId?: string;
|
|
1336
|
+
feature?: string;
|
|
1337
|
+
cacheHit?: boolean;
|
|
1338
|
+
}
|
|
1339
|
+
): Promise<void> {
|
|
1340
|
+
const trace = this.langfuse.trace({
|
|
1341
|
+
name: metadata.feature || 'llm-request',
|
|
1342
|
+
userId: metadata.userId,
|
|
1343
|
+
metadata: {
|
|
1344
|
+
provider,
|
|
1345
|
+
model,
|
|
1346
|
+
cacheHit: metadata.cacheHit || false
|
|
1347
|
+
}
|
|
1348
|
+
});
|
|
1349
|
+
|
|
1350
|
+
const generation = trace.generation({
|
|
1351
|
+
name: 'completion',
|
|
1352
|
+
model,
|
|
1353
|
+
input: params.messages,
|
|
1354
|
+
output: response.content,
|
|
1355
|
+
usage: {
|
|
1356
|
+
promptTokens: response.usage.inputTokens,
|
|
1357
|
+
completionTokens: response.usage.outputTokens,
|
|
1358
|
+
totalTokens:
|
|
1359
|
+
response.usage.inputTokens + response.usage.outputTokens
|
|
1360
|
+
},
|
|
1361
|
+
metadata: {
|
|
1362
|
+
latencyMs: response.metadata.latencyMs,
|
|
1363
|
+
cost: response.metadata.cost,
|
|
1364
|
+
cachedTokens: response.usage.cachedTokens || 0,
|
|
1365
|
+
finishReason: response.finishReason
|
|
1366
|
+
}
|
|
1367
|
+
});
|
|
1368
|
+
|
|
1369
|
+
generation.end();
|
|
1370
|
+
|
|
1371
|
+
// Send metrics to CloudWatch
|
|
1372
|
+
await this.cloudwatch.putMetricData({
|
|
1373
|
+
Namespace: 'LLM/Production',
|
|
1374
|
+
MetricData: [
|
|
1375
|
+
{
|
|
1376
|
+
MetricName: 'Latency',
|
|
1377
|
+
Value: response.metadata.latencyMs,
|
|
1378
|
+
Unit: 'Milliseconds',
|
|
1379
|
+
Dimensions: [
|
|
1380
|
+
{ Name: 'Provider', Value: provider },
|
|
1381
|
+
{ Name: 'Model', Value: model }
|
|
1382
|
+
]
|
|
1383
|
+
},
|
|
1384
|
+
{
|
|
1385
|
+
MetricName: 'Cost',
|
|
1386
|
+
Value: response.metadata.cost,
|
|
1387
|
+
Unit: 'None',
|
|
1388
|
+
Dimensions: [
|
|
1389
|
+
{ Name: 'Provider', Value: provider },
|
|
1390
|
+
{ Name: 'Feature', Value: metadata.feature || 'unknown' }
|
|
1391
|
+
]
|
|
1392
|
+
},
|
|
1393
|
+
{
|
|
1394
|
+
MetricName: 'TokensUsed',
|
|
1395
|
+
Value: response.usage.inputTokens + response.usage.outputTokens,
|
|
1396
|
+
Unit: 'Count',
|
|
1397
|
+
Dimensions: [
|
|
1398
|
+
{ Name: 'Provider', Value: provider }
|
|
1399
|
+
]
|
|
1400
|
+
},
|
|
1401
|
+
{
|
|
1402
|
+
MetricName: 'CacheHit',
|
|
1403
|
+
Value: metadata.cacheHit ? 1 : 0,
|
|
1404
|
+
Unit: 'Count'
|
|
1405
|
+
}
|
|
1406
|
+
]
|
|
1407
|
+
});
|
|
1408
|
+
}
|
|
1409
|
+
|
|
1410
|
+
async trackError(
|
|
1411
|
+
provider: string,
|
|
1412
|
+
model: string,
|
|
1413
|
+
error: any,
|
|
1414
|
+
metadata: {
|
|
1415
|
+
retryAttempt?: number;
|
|
1416
|
+
fallbackTriggered?: boolean;
|
|
1417
|
+
}
|
|
1418
|
+
): Promise<void> {
|
|
1419
|
+
// Log to Langfuse
|
|
1420
|
+
this.langfuse.trace({
|
|
1421
|
+
name: 'llm-error',
|
|
1422
|
+
metadata: {
|
|
1423
|
+
provider,
|
|
1424
|
+
model,
|
|
1425
|
+
error: error.message,
|
|
1426
|
+
statusCode: error.status,
|
|
1427
|
+
retryAttempt: metadata.retryAttempt,
|
|
1428
|
+
fallbackTriggered: metadata.fallbackTriggered
|
|
1429
|
+
}
|
|
1430
|
+
});
|
|
1431
|
+
|
|
1432
|
+
// Send error metrics
|
|
1433
|
+
await this.cloudwatch.putMetricData({
|
|
1434
|
+
Namespace: 'LLM/Production',
|
|
1435
|
+
MetricData: [
|
|
1436
|
+
{
|
|
1437
|
+
MetricName: 'Errors',
|
|
1438
|
+
Value: 1,
|
|
1439
|
+
Unit: 'Count',
|
|
1440
|
+
Dimensions: [
|
|
1441
|
+
{ Name: 'Provider', Value: provider },
|
|
1442
|
+
{ Name: 'ErrorType', Value: error.status || 'unknown' }
|
|
1443
|
+
]
|
|
1444
|
+
}
|
|
1445
|
+
]
|
|
1446
|
+
});
|
|
1447
|
+
}
|
|
1448
|
+
|
|
1449
|
+
async getProviderHealth(): Promise<Map<string, ProviderHealth>> {
|
|
1450
|
+
// Query CloudWatch metrics
|
|
1451
|
+
const endTime = new Date();
|
|
1452
|
+
const startTime = new Date(endTime.getTime() - 3600000); // Last hour
|
|
1453
|
+
|
|
1454
|
+
const healthMap = new Map<string, ProviderHealth>();
|
|
1455
|
+
|
|
1456
|
+
for (const provider of ['anthropic', 'openai', 'google']) {
|
|
1457
|
+
const errorRate = await this.getErrorRate(provider, startTime, endTime);
|
|
1458
|
+
const avgLatency = await this.getAvgLatency(provider, startTime, endTime);
|
|
1459
|
+
|
|
1460
|
+
healthMap.set(provider, {
|
|
1461
|
+
availability: 1 - errorRate,
|
|
1462
|
+
avgLatency,
|
|
1463
|
+
errorRate,
|
|
1464
|
+
status: errorRate > 0.05 ? 'degraded' : 'healthy'
|
|
1465
|
+
});
|
|
1466
|
+
}
|
|
1467
|
+
|
|
1468
|
+
return healthMap;
|
|
1469
|
+
}
|
|
1470
|
+
}
|
|
1471
|
+
```
|
|
1472
|
+
|
|
1473
|
+
**Dashboard Recommendations:**
|
|
1474
|
+
1. **Real-time costs** by provider, model, user, feature
|
|
1475
|
+
2. **Latency percentiles** (p50, p95, p99) per provider
|
|
1476
|
+
3. **Cache hit rates** for cost optimization tracking
|
|
1477
|
+
4. **Error rates and circuit breaker states** for reliability
|
|
1478
|
+
5. **Token usage trends** for capacity planning
|
|
1479
|
+
6. **Provider health scores** for SLA monitoring
|
|
1480
|
+
|
|
1481
|
+
**Sources:**
|
|
1482
|
+
- [LLM Observability Tools 2026](https://lakefs.io/blog/llm-observability-tools/)
|
|
1483
|
+
- [Best LLM Observability Tools](https://www.firecrawl.dev/blog/best-llm-observability-tools)
|
|
1484
|
+
|
|
1485
|
+
---
|
|
1486
|
+
|
|
1487
|
+
## 6. Configuration Best Practices
|
|
1488
|
+
|
|
1489
|
+
### 6.1 Environment-Based Configuration
|
|
1490
|
+
|
|
1491
|
+
**Production Configuration Pattern:**
|
|
1492
|
+
```typescript
|
|
1493
|
+
interface LLMConfig {
|
|
1494
|
+
providers: {
|
|
1495
|
+
anthropic: {
|
|
1496
|
+
apiKey: string;
|
|
1497
|
+
baseURL?: string;
|
|
1498
|
+
timeout: number;
|
|
1499
|
+
maxRetries: number;
|
|
1500
|
+
};
|
|
1501
|
+
openai: {
|
|
1502
|
+
apiKey: string;
|
|
1503
|
+
organization?: string;
|
|
1504
|
+
timeout: number;
|
|
1505
|
+
maxRetries: number;
|
|
1506
|
+
};
|
|
1507
|
+
google: {
|
|
1508
|
+
apiKey: string;
|
|
1509
|
+
timeout: number;
|
|
1510
|
+
maxRetries: number;
|
|
1511
|
+
};
|
|
1512
|
+
};
|
|
1513
|
+
|
|
1514
|
+
routing: {
|
|
1515
|
+
strategy: 'cost' | 'quality' | 'latency' | 'custom';
|
|
1516
|
+
fallbackEnabled: boolean;
|
|
1517
|
+
circuitBreaker: {
|
|
1518
|
+
enabled: boolean;
|
|
1519
|
+
failureThreshold: number;
|
|
1520
|
+
successThreshold: number;
|
|
1521
|
+
timeout: number;
|
|
1522
|
+
};
|
|
1523
|
+
};
|
|
1524
|
+
|
|
1525
|
+
caching: {
|
|
1526
|
+
semantic: {
|
|
1527
|
+
enabled: boolean;
|
|
1528
|
+
provider: 'redis' | 'qdrant' | 'memory';
|
|
1529
|
+
similarityThreshold: number;
|
|
1530
|
+
ttl: number;
|
|
1531
|
+
};
|
|
1532
|
+
prompt: {
|
|
1533
|
+
enabled: boolean;
|
|
1534
|
+
strategy: 'provider-native' | 'custom';
|
|
1535
|
+
};
|
|
1536
|
+
};
|
|
1537
|
+
|
|
1538
|
+
observability: {
|
|
1539
|
+
enabled: boolean;
|
|
1540
|
+
provider: 'langfuse' | 'datadog' | 'custom';
|
|
1541
|
+
sampleRate: number; // 0.0 - 1.0
|
|
1542
|
+
logLevel: 'debug' | 'info' | 'warn' | 'error';
|
|
1543
|
+
};
|
|
1544
|
+
|
|
1545
|
+
limits: {
|
|
1546
|
+
maxTokensPerRequest: number;
|
|
1547
|
+
maxCostPerRequest: number;
|
|
1548
|
+
rateLimitPerUser?: number;
|
|
1549
|
+
};
|
|
1550
|
+
}
|
|
1551
|
+
```
|
|
1552
|
+
|
|
1553
|
+
**Environment-Specific Configs:**
|
|
1554
|
+
```typescript
|
|
1555
|
+
// config/production.ts
|
|
1556
|
+
export const productionConfig: LLMConfig = {
|
|
1557
|
+
providers: {
|
|
1558
|
+
anthropic: {
|
|
1559
|
+
apiKey: process.env.ANTHROPIC_API_KEY!,
|
|
1560
|
+
timeout: 30000,
|
|
1561
|
+
maxRetries: 3
|
|
1562
|
+
},
|
|
1563
|
+
openai: {
|
|
1564
|
+
apiKey: process.env.OPENAI_API_KEY!,
|
|
1565
|
+
organization: process.env.OPENAI_ORG,
|
|
1566
|
+
timeout: 30000,
|
|
1567
|
+
maxRetries: 3
|
|
1568
|
+
},
|
|
1569
|
+
google: {
|
|
1570
|
+
apiKey: process.env.GOOGLE_API_KEY!,
|
|
1571
|
+
timeout: 30000,
|
|
1572
|
+
maxRetries: 3
|
|
1573
|
+
}
|
|
1574
|
+
},
|
|
1575
|
+
|
|
1576
|
+
routing: {
|
|
1577
|
+
strategy: 'cost',
|
|
1578
|
+
fallbackEnabled: true,
|
|
1579
|
+
circuitBreaker: {
|
|
1580
|
+
enabled: true,
|
|
1581
|
+
failureThreshold: 10,
|
|
1582
|
+
successThreshold: 3,
|
|
1583
|
+
timeout: 60000
|
|
1584
|
+
}
|
|
1585
|
+
},
|
|
1586
|
+
|
|
1587
|
+
caching: {
|
|
1588
|
+
semantic: {
|
|
1589
|
+
enabled: true,
|
|
1590
|
+
provider: 'redis',
|
|
1591
|
+
similarityThreshold: 0.85,
|
|
1592
|
+
ttl: 3600000 // 1 hour
|
|
1593
|
+
},
|
|
1594
|
+
prompt: {
|
|
1595
|
+
enabled: true,
|
|
1596
|
+
strategy: 'provider-native'
|
|
1597
|
+
}
|
|
1598
|
+
},
|
|
1599
|
+
|
|
1600
|
+
observability: {
|
|
1601
|
+
enabled: true,
|
|
1602
|
+
provider: 'langfuse',
|
|
1603
|
+
sampleRate: 1.0, // Log everything in prod
|
|
1604
|
+
logLevel: 'info'
|
|
1605
|
+
},
|
|
1606
|
+
|
|
1607
|
+
limits: {
|
|
1608
|
+
maxTokensPerRequest: 100000,
|
|
1609
|
+
maxCostPerRequest: 1.00, // $1 max per request
|
|
1610
|
+
rateLimitPerUser: 100 // per hour
|
|
1611
|
+
}
|
|
1612
|
+
};
|
|
1613
|
+
|
|
1614
|
+
// config/development.ts
|
|
1615
|
+
export const developmentConfig: LLMConfig = {
|
|
1616
|
+
...productionConfig,
|
|
1617
|
+
|
|
1618
|
+
routing: {
|
|
1619
|
+
...productionConfig.routing,
|
|
1620
|
+
strategy: 'latency' // Prefer fast responses in dev
|
|
1621
|
+
},
|
|
1622
|
+
|
|
1623
|
+
caching: {
|
|
1624
|
+
semantic: {
|
|
1625
|
+
enabled: false, // Disable caching in dev
|
|
1626
|
+
provider: 'memory',
|
|
1627
|
+
similarityThreshold: 0.85,
|
|
1628
|
+
ttl: 600000
|
|
1629
|
+
},
|
|
1630
|
+
prompt: {
|
|
1631
|
+
enabled: false
|
|
1632
|
+
}
|
|
1633
|
+
},
|
|
1634
|
+
|
|
1635
|
+
observability: {
|
|
1636
|
+
enabled: true,
|
|
1637
|
+
provider: 'langfuse',
|
|
1638
|
+
sampleRate: 0.1, // Sample 10% in dev
|
|
1639
|
+
logLevel: 'debug'
|
|
1640
|
+
}
|
|
1641
|
+
};
|
|
1642
|
+
```
|
|
1643
|
+
|
|
1644
|
+
### 6.2 Feature Flags for Gradual Rollout
|
|
1645
|
+
|
|
1646
|
+
```typescript
|
|
1647
|
+
interface FeatureFlags {
|
|
1648
|
+
enableSemanticCache: boolean;
|
|
1649
|
+
enablePromptCache: boolean;
|
|
1650
|
+
enableMultiProvider: boolean;
|
|
1651
|
+
enableCircuitBreaker: boolean;
|
|
1652
|
+
newProviders: string[]; // e.g., ['deepseek', 'mistral']
|
|
1653
|
+
costOptimizationLevel: 'conservative' | 'balanced' | 'aggressive';
|
|
1654
|
+
}
|
|
1655
|
+
|
|
1656
|
+
// Use LaunchDarkly, Flagsmith, or similar
|
|
1657
|
+
const flags = await featureFlagClient.getFlags('llm-system');
|
|
1658
|
+
|
|
1659
|
+
if (flags.enableSemanticCache) {
|
|
1660
|
+
// Enable semantic caching
|
|
1661
|
+
}
|
|
1662
|
+
|
|
1663
|
+
if (flags.newProviders.includes('deepseek')) {
|
|
1664
|
+
// Add DeepSeek to provider chain
|
|
1665
|
+
providerChain.push({
|
|
1666
|
+
name: 'deepseek',
|
|
1667
|
+
priority: 4,
|
|
1668
|
+
models: {
|
|
1669
|
+
cheap: 'deepseek-v3',
|
|
1670
|
+
standard: 'deepseek-r1',
|
|
1671
|
+
premium: 'deepseek-r1'
|
|
1672
|
+
}
|
|
1673
|
+
});
|
|
1674
|
+
}
|
|
1675
|
+
```
|
|
1676
|
+
|
|
1677
|
+
### 6.3 Model Selection Matrix
|
|
1678
|
+
|
|
1679
|
+
**Production Model Selection Guide:**
|
|
1680
|
+
|
|
1681
|
+
| Use Case | Recommended Model | Provider | Cost (1M tokens) | Notes |
|
|
1682
|
+
|----------|------------------|----------|------------------|-------|
|
|
1683
|
+
| **Simple Classification** | gpt-4o-mini | OpenAI | $0.15/$0.60 | Fastest, cheapest |
|
|
1684
|
+
| **Standard Q&A** | gemini-2.5-flash | Google | $0.30/$2.50 | Best value |
|
|
1685
|
+
| **Code Generation** | claude-3-5-sonnet | Anthropic | $3.00/$15.00 | Industry leading |
|
|
1686
|
+
| **Complex Reasoning** | claude-opus-4.5 | Anthropic | $15.00/$75.00 | Best quality |
|
|
1687
|
+
| **Cost-Optimized** | deepseek-v3 | DeepSeek | $0.27/$1.10 | 95% cheaper |
|
|
1688
|
+
| **Large Context** | gemini-2.5-pro | Google | $1.25/$10.00 | 1M tokens |
|
|
1689
|
+
| **Local Deployment** | qwen-2.5-coder-32b | Local | $0 | Self-hosted |
|
|
1690
|
+
|
|
1691
|
+
**Sources:**
|
|
1692
|
+
- [LLM Pricing 2026](https://research.aimultiple.com/llm-pricing/)
|
|
1693
|
+
- [Alternative Providers Research](./alternative-providers-2026.md)
|
|
1694
|
+
|
|
1695
|
+
---
|
|
1696
|
+
|
|
1697
|
+
## 7. Implementation Roadmap for mdcontext
|
|
1698
|
+
|
|
1699
|
+
### 7.1 Phase 1: Basic Multi-Provider Support (Week 1)
|
|
1700
|
+
|
|
1701
|
+
**Goals:**
|
|
1702
|
+
- Add support for Anthropic, OpenAI, Google providers
|
|
1703
|
+
- Implement basic fallback chain
|
|
1704
|
+
- Add configuration system
|
|
1705
|
+
|
|
1706
|
+
**Tasks:**
|
|
1707
|
+
1. Create provider abstraction layer
|
|
1708
|
+
2. Implement unified interface
|
|
1709
|
+
3. Add environment-based configuration
|
|
1710
|
+
4. Basic error handling with retries
|
|
1711
|
+
|
|
1712
|
+
**Code Changes:**
|
|
1713
|
+
```typescript
|
|
1714
|
+
// src/llm/providers/base.ts
|
|
1715
|
+
export interface LLMProvider {
|
|
1716
|
+
complete(params: CompletionParams): Promise<CompletionResponse>;
|
|
1717
|
+
healthCheck(): Promise<boolean>;
|
|
1718
|
+
}
|
|
1719
|
+
|
|
1720
|
+
// src/llm/providers/anthropic.ts
|
|
1721
|
+
export class AnthropicProvider implements LLMProvider {
|
|
1722
|
+
async complete(params: CompletionParams): Promise<CompletionResponse> {
|
|
1723
|
+
// Implementation
|
|
1724
|
+
}
|
|
1725
|
+
}
|
|
1726
|
+
|
|
1727
|
+
// src/llm/router.ts
|
|
1728
|
+
export class ProviderRouter {
|
|
1729
|
+
async complete(params: CompletionParams): Promise<CompletionResponse> {
|
|
1730
|
+
// Try primary provider with fallback
|
|
1731
|
+
}
|
|
1732
|
+
}
|
|
1733
|
+
```
|
|
1734
|
+
|
|
1735
|
+
### 7.2 Phase 2: Cost Optimization (Week 2)
|
|
1736
|
+
|
|
1737
|
+
**Goals:**
|
|
1738
|
+
- Implement prompt caching
|
|
1739
|
+
- Add cost-based routing
|
|
1740
|
+
- Track token usage and costs
|
|
1741
|
+
|
|
1742
|
+
**Tasks:**
|
|
1743
|
+
1. Enable Anthropic prompt caching for codebase context
|
|
1744
|
+
2. Implement cost estimation per provider
|
|
1745
|
+
3. Add Langfuse for cost tracking
|
|
1746
|
+
4. Create cost optimization routing logic
|
|
1747
|
+
|
|
1748
|
+
**Expected Savings:**
|
|
1749
|
+
- 50-70% cost reduction from prompt caching
|
|
1750
|
+
- 20-30% additional from smart routing
|
|
1751
|
+
|
|
1752
|
+
### 7.3 Phase 3: Reliability (Week 3)
|
|
1753
|
+
|
|
1754
|
+
**Goals:**
|
|
1755
|
+
- Add circuit breakers
|
|
1756
|
+
- Implement exponential backoff
|
|
1757
|
+
- Health monitoring
|
|
1758
|
+
|
|
1759
|
+
**Tasks:**
|
|
1760
|
+
1. Implement circuit breaker per provider
|
|
1761
|
+
2. Add exponential backoff with jitter
|
|
1762
|
+
3. Health check endpoints
|
|
1763
|
+
4. Provider health dashboard
|
|
1764
|
+
|
|
1765
|
+
### 7.4 Phase 4: Advanced Caching (Week 4)
|
|
1766
|
+
|
|
1767
|
+
**Goals:**
|
|
1768
|
+
- Semantic caching layer
|
|
1769
|
+
- Combined caching strategy
|
|
1770
|
+
- Cache invalidation logic
|
|
1771
|
+
|
|
1772
|
+
**Tasks:**
|
|
1773
|
+
1. Set up Redis/Qdrant for semantic cache
|
|
1774
|
+
2. Implement similarity search
|
|
1775
|
+
3. Add cache hit rate tracking
|
|
1776
|
+
4. Configure TTL and invalidation
|
|
1777
|
+
|
|
1778
|
+
**Expected Savings:**
|
|
1779
|
+
- Additional 20-30% from semantic caching
|
|
1780
|
+
- Total: 70-90% cost reduction
|
|
1781
|
+
|
|
1782
|
+
### 7.5 Phase 5: Observability (Ongoing)
|
|
1783
|
+
|
|
1784
|
+
**Goals:**
|
|
1785
|
+
- Comprehensive monitoring
|
|
1786
|
+
- Cost attribution
|
|
1787
|
+
- Performance tracking
|
|
1788
|
+
|
|
1789
|
+
**Tasks:**
|
|
1790
|
+
1. Langfuse integration complete
|
|
1791
|
+
2. CloudWatch metrics
|
|
1792
|
+
3. Cost per feature/user tracking
|
|
1793
|
+
4. Alerting on budget thresholds
|
|
1794
|
+
|
|
1795
|
+
---
|
|
1796
|
+
|
|
1797
|
+
## 8. Testing Strategy
|
|
1798
|
+
|
|
1799
|
+
### 8.1 Provider Resilience Testing
|
|
1800
|
+
|
|
1801
|
+
```typescript
|
|
1802
|
+
describe('Provider Fallback', () => {
|
|
1803
|
+
it('should fallback to secondary provider on primary failure', async () => {
|
|
1804
|
+
// Mock primary provider failure
|
|
1805
|
+
mockAnthropicProvider.complete.mockRejectedValue(
|
|
1806
|
+
new Error('Rate limit exceeded')
|
|
1807
|
+
);
|
|
1808
|
+
|
|
1809
|
+
mockOpenAIProvider.complete.mockResolvedValue({
|
|
1810
|
+
content: 'Success',
|
|
1811
|
+
usage: { inputTokens: 100, outputTokens: 50 }
|
|
1812
|
+
});
|
|
1813
|
+
|
|
1814
|
+
const result = await router.complete({
|
|
1815
|
+
messages: [{ role: 'user', content: 'Test' }]
|
|
1816
|
+
});
|
|
1817
|
+
|
|
1818
|
+
expect(result.metadata.provider).toBe('openai');
|
|
1819
|
+
expect(mockAnthropicProvider.complete).toHaveBeenCalledTimes(1);
|
|
1820
|
+
expect(mockOpenAIProvider.complete).toHaveBeenCalledTimes(1);
|
|
1821
|
+
});
|
|
1822
|
+
|
|
1823
|
+
it('should respect circuit breaker state', async () => {
|
|
1824
|
+
// Trigger circuit breaker
|
|
1825
|
+
for (let i = 0; i < 10; i++) {
|
|
1826
|
+
try {
|
|
1827
|
+
await router.complete({ messages: [...] });
|
|
1828
|
+
} catch {}
|
|
1829
|
+
}
|
|
1830
|
+
|
|
1831
|
+
const health = circuitBreaker.getState();
|
|
1832
|
+
expect(health.state).toBe('open');
|
|
1833
|
+
|
|
1834
|
+
// Should skip provider with open circuit
|
|
1835
|
+
const result = await router.complete({ messages: [...] });
|
|
1836
|
+
expect(result.metadata.provider).not.toBe('anthropic');
|
|
1837
|
+
});
|
|
1838
|
+
});
|
|
1839
|
+
```
|
|
1840
|
+
|
|
1841
|
+
### 8.2 Cost Validation Testing
|
|
1842
|
+
|
|
1843
|
+
```typescript
|
|
1844
|
+
describe('Cost Optimization', () => {
|
|
1845
|
+
it('should route simple tasks to cheap models', () => {
|
|
1846
|
+
const model = selectModel({
|
|
1847
|
+
complexity: 'simple',
|
|
1848
|
+
estimatedTokens: 1000,
|
|
1849
|
+
requiresReasoning: false,
|
|
1850
|
+
requiresToolUse: false
|
|
1851
|
+
});
|
|
1852
|
+
|
|
1853
|
+
expect(model.provider).toBe('google');
|
|
1854
|
+
expect(model.model).toBe('gemini-2.5-flash');
|
|
1855
|
+
});
|
|
1856
|
+
|
|
1857
|
+
it('should use prompt caching for large contexts', async () => {
|
|
1858
|
+
const params = {
|
|
1859
|
+
messages: [
|
|
1860
|
+
{ role: 'system', content: largeCodebaseContext },
|
|
1861
|
+
{ role: 'user', content: 'Summarize this file' }
|
|
1862
|
+
],
|
|
1863
|
+
enableCaching: true
|
|
1864
|
+
};
|
|
1865
|
+
|
|
1866
|
+
const result = await provider.complete(params);
|
|
1867
|
+
|
|
1868
|
+
expect(result.usage.cachedTokens).toBeGreaterThan(0);
|
|
1869
|
+
expect(result.metadata.cost).toBeLessThan(
|
|
1870
|
+
calculateCostWithoutCaching(result.usage)
|
|
1871
|
+
);
|
|
1872
|
+
});
|
|
1873
|
+
});
|
|
1874
|
+
```
|
|
1875
|
+
|
|
1876
|
+
### 8.3 Load Testing
|
|
1877
|
+
|
|
1878
|
+
```bash
|
|
1879
|
+
# Artillery config for load testing
|
|
1880
|
+
artillery run tests/load/provider-fallback.yml
|
|
1881
|
+
```
|
|
1882
|
+
|
|
1883
|
+
```yaml
|
|
1884
|
+
# tests/load/provider-fallback.yml
|
|
1885
|
+
config:
|
|
1886
|
+
target: 'http://localhost:3000'
|
|
1887
|
+
phases:
|
|
1888
|
+
- duration: 60
|
|
1889
|
+
arrivalRate: 10
|
|
1890
|
+
name: 'Warm up'
|
|
1891
|
+
- duration: 120
|
|
1892
|
+
arrivalRate: 50
|
|
1893
|
+
name: 'Ramp up'
|
|
1894
|
+
- duration: 300
|
|
1895
|
+
arrivalRate: 100
|
|
1896
|
+
name: 'Sustained load'
|
|
1897
|
+
|
|
1898
|
+
scenarios:
|
|
1899
|
+
- name: 'Code Summarization'
|
|
1900
|
+
flow:
|
|
1901
|
+
- post:
|
|
1902
|
+
url: '/api/summarize'
|
|
1903
|
+
json:
|
|
1904
|
+
code: '{{ $randomString() }}'
|
|
1905
|
+
model: 'claude-3-5-sonnet-20241022'
|
|
1906
|
+
capture:
|
|
1907
|
+
- json: '$.cost'
|
|
1908
|
+
as: 'requestCost'
|
|
1909
|
+
- json: '$.provider'
|
|
1910
|
+
as: 'usedProvider'
|
|
1911
|
+
```
|
|
1912
|
+
|
|
1913
|
+
---
|
|
1914
|
+
|
|
1915
|
+
## 9. Key Recommendations for mdcontext
|
|
1916
|
+
|
|
1917
|
+
### 9.1 Immediate Actions (This Week)
|
|
1918
|
+
|
|
1919
|
+
1. **Add Multi-Provider Support**
|
|
1920
|
+
- Implement Anthropic, OpenAI, Google providers
|
|
1921
|
+
- Basic fallback chain: Claude Sonnet → GPT-4o → Gemini Pro
|
|
1922
|
+
- Environment-based configuration
|
|
1923
|
+
|
|
1924
|
+
2. **Enable Prompt Caching**
|
|
1925
|
+
- Use Anthropic's cache_control for codebase context
|
|
1926
|
+
- Expected: 50-70% cost reduction immediately
|
|
1927
|
+
|
|
1928
|
+
3. **Add Basic Monitoring**
|
|
1929
|
+
- Integrate Langfuse for cost tracking
|
|
1930
|
+
- Track provider success/failure rates
|
|
1931
|
+
|
|
1932
|
+
### 9.2 Short-Term (Next 2 Weeks)
|
|
1933
|
+
|
|
1934
|
+
1. **Implement Circuit Breakers**
|
|
1935
|
+
- Prevent cascading failures
|
|
1936
|
+
- Health monitoring per provider
|
|
1937
|
+
|
|
1938
|
+
2. **Cost-Based Routing**
|
|
1939
|
+
- Route simple tasks to Gemini Flash
|
|
1940
|
+
- Complex reasoning to Claude Sonnet
|
|
1941
|
+
- Expected: 30-50% additional savings
|
|
1942
|
+
|
|
1943
|
+
3. **Retry Logic**
|
|
1944
|
+
- Exponential backoff with jitter
|
|
1945
|
+
- Respect retry-after headers
|
|
1946
|
+
|
|
1947
|
+
### 9.3 Medium-Term (Next Month)
|
|
1948
|
+
|
|
1949
|
+
1. **Semantic Caching**
|
|
1950
|
+
- Set up Redis for cache storage
|
|
1951
|
+
- Implement similarity search
|
|
1952
|
+
- Expected: 20-30% additional savings
|
|
1953
|
+
|
|
1954
|
+
2. **Advanced Routing**
|
|
1955
|
+
- Task complexity classification
|
|
1956
|
+
- Dynamic provider selection based on:
|
|
1957
|
+
- Current cost
|
|
1958
|
+
- Latency requirements
|
|
1959
|
+
- Provider health
|
|
1960
|
+
|
|
1961
|
+
3. **Complete Observability**
|
|
1962
|
+
- Cost per feature attribution
|
|
1963
|
+
- User-level tracking
|
|
1964
|
+
- Budget alerts
|
|
1965
|
+
|
|
1966
|
+
### 9.4 Architecture Recommendation
|
|
1967
|
+
|
|
1968
|
+
```
|
|
1969
|
+
┌─────────────────────────────────────────────────────────┐
|
|
1970
|
+
│ Client Request │
|
|
1971
|
+
└───────────────────────────┬─────────────────────────────┘
|
|
1972
|
+
│
|
|
1973
|
+
▼
|
|
1974
|
+
┌─────────────────────────────────────────────────────────┐
|
|
1975
|
+
│ Semantic Cache Layer │
|
|
1976
|
+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
1977
|
+
│ │ Exact Match │ │ Similarity │ │ TTL Check │ │
|
|
1978
|
+
│ │ (Memory) │ │ (Redis) │ │ │ │
|
|
1979
|
+
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
1980
|
+
└───────────────────────────┬─────────────────────────────┘
|
|
1981
|
+
│ Cache Miss
|
|
1982
|
+
▼
|
|
1983
|
+
┌─────────────────────────────────────────────────────────┐
|
|
1984
|
+
│ Provider Router │
|
|
1985
|
+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
1986
|
+
│ │ Classify │ │ Route │ │ Estimate │ │
|
|
1987
|
+
│ │ Task │ │ Provider │ │ Cost │ │
|
|
1988
|
+
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
1989
|
+
└───────────────────────────┬─────────────────────────────┘
|
|
1990
|
+
│
|
|
1991
|
+
▼
|
|
1992
|
+
┌─────────────────────────────────────────────────────────┐
|
|
1993
|
+
│ Provider Chain with Fallback │
|
|
1994
|
+
│ │
|
|
1995
|
+
│ ┌─────────────────────────────────────────────────┐ │
|
|
1996
|
+
│ │ Primary: Anthropic Claude │ │
|
|
1997
|
+
│ │ ┌──────────────┐ ┌──────────────┐ │ │
|
|
1998
|
+
│ │ │Circuit Breaker│ │Retry w/Backoff│ │ │
|
|
1999
|
+
│ │ └──────────────┘ └──────────────┘ │ │
|
|
2000
|
+
│ │ Prompt Caching: ✅ │ │
|
|
2001
|
+
│ └─────────────────────────────────────────────────┘ │
|
|
2002
|
+
│ │ Fails │
|
|
2003
|
+
│ ▼ │
|
|
2004
|
+
│ ┌─────────────────────────────────────────────────┐ │
|
|
2005
|
+
│ │ Secondary: OpenAI GPT-4o │ │
|
|
2006
|
+
│ │ ┌──────────────┐ ┌──────────────┐ │ │
|
|
2007
|
+
│ │ │Circuit Breaker│ │Retry w/Backoff│ │ │
|
|
2008
|
+
│ │ └──────────────┘ └──────────────┘ │ │
|
|
2009
|
+
│ │ Prompt Caching: ✅ (Automatic) │ │
|
|
2010
|
+
│ └─────────────────────────────────────────────────┘ │
|
|
2011
|
+
│ │ Fails │
|
|
2012
|
+
│ ▼ │
|
|
2013
|
+
│ ┌─────────────────────────────────────────────────┐ │
|
|
2014
|
+
│ │ Tertiary: Google Gemini │ │
|
|
2015
|
+
│ │ ┌──────────────┐ ┌──────────────┐ │ │
|
|
2016
|
+
│ │ │Circuit Breaker│ │Retry w/Backoff│ │ │
|
|
2017
|
+
│ │ └──────────────┘ └──────────────┘ │ │
|
|
2018
|
+
│ │ Context Caching: ✅ │ │
|
|
2019
|
+
│ └─────────────────────────────────────────────────┘ │
|
|
2020
|
+
└───────────────────────────┬─────────────────────────────┘
|
|
2021
|
+
│
|
|
2022
|
+
▼
|
|
2023
|
+
┌─────────────────────────────────────────────────────────┐
|
|
2024
|
+
│ Observability Layer │
|
|
2025
|
+
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
2026
|
+
│ │ Langfuse │ │ CloudWatch │ │ Alerts │ │
|
|
2027
|
+
│ │ Cost Tracking│ │ Metrics │ │ Budgets │ │
|
|
2028
|
+
│ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|
2029
|
+
└─────────────────────────────────────────────────────────┘
|
|
2030
|
+
```
|
|
2031
|
+
|
|
2032
|
+
### 9.5 Expected Outcomes
|
|
2033
|
+
|
|
2034
|
+
**Cost Reduction:**
|
|
2035
|
+
- Phase 1 (Prompt Caching): 50-70%
|
|
2036
|
+
- Phase 2 (Smart Routing): 20-30%
|
|
2037
|
+
- Phase 3 (Semantic Caching): 10-20%
|
|
2038
|
+
- **Total: 70-90% reduction**
|
|
2039
|
+
|
|
2040
|
+
**Reliability Improvement:**
|
|
2041
|
+
- 99.9%+ uptime with multi-provider fallback
|
|
2042
|
+
- Circuit breakers prevent cascading failures
|
|
2043
|
+
- Automatic recovery from rate limits
|
|
2044
|
+
|
|
2045
|
+
**Performance:**
|
|
2046
|
+
- <100ms for cache hits (semantic)
|
|
2047
|
+
- <1s for cached prompts
|
|
2048
|
+
- Minimal latency overhead from routing layer
|
|
2049
|
+
|
|
2050
|
+
---
|
|
2051
|
+
|
|
2052
|
+
## 10. Additional Resources
|
|
2053
|
+
|
|
2054
|
+
### 10.1 Documentation
|
|
2055
|
+
|
|
2056
|
+
- [Vercel AI SDK Documentation](https://ai-sdk.dev/)
|
|
2057
|
+
- [LiteLLM Documentation](https://docs.litellm.ai/)
|
|
2058
|
+
- [Anthropic Prompt Caching Guide](https://docs.anthropic.com/claude/docs/prompt-caching)
|
|
2059
|
+
- [OpenAI Rate Limits Guide](https://platform.openai.com/docs/guides/rate-limits)
|
|
2060
|
+
- [Google Gemini API Documentation](https://ai.google.dev/docs)
|
|
2061
|
+
|
|
2062
|
+
### 10.2 Tools and Libraries
|
|
2063
|
+
|
|
2064
|
+
**Provider SDKs:**
|
|
2065
|
+
- [@anthropic-ai/sdk](https://www.npmjs.com/package/@anthropic-ai/sdk)
|
|
2066
|
+
- [openai](https://www.npmjs.com/package/openai)
|
|
2067
|
+
- [@google/generative-ai](https://www.npmjs.com/package/@google/generative-ai)
|
|
2068
|
+
|
|
2069
|
+
**Abstraction Layers:**
|
|
2070
|
+
- [ai (Vercel AI SDK)](https://www.npmjs.com/package/ai)
|
|
2071
|
+
- [litellm](https://pypi.org/project/litellm/)
|
|
2072
|
+
- [any-llm](https://www.npmjs.com/package/any-llm)
|
|
2073
|
+
|
|
2074
|
+
**Observability:**
|
|
2075
|
+
- [langfuse](https://www.npmjs.com/package/langfuse)
|
|
2076
|
+
- [@datadog/llm-observability](https://docs.datadoghq.com/llm_observability/)
|
|
2077
|
+
- [@traceloop/node-server-sdk](https://www.npmjs.com/package/@traceloop/node-server-sdk)
|
|
2078
|
+
|
|
2079
|
+
**Caching:**
|
|
2080
|
+
- [redis](https://www.npmjs.com/package/redis)
|
|
2081
|
+
- [qdrant-js](https://www.npmjs.com/package/@qdrant/js-client-rest)
|
|
2082
|
+
- [gptcache](https://github.com/zilliztech/GPTCache)
|
|
2083
|
+
|
|
2084
|
+
### 10.3 Related Research Documents
|
|
2085
|
+
|
|
2086
|
+
- [Alternative LLM Providers 2026](./alternative-providers-2026.md) - Comprehensive provider comparison
|
|
2087
|
+
- [Anthropic 2026](./anthropic-2026.md) - Claude-specific features and pricing
|
|
2088
|
+
- [OpenAI 2026](./openai-2026.md) - GPT models and capabilities
|
|
2089
|
+
- [Prompt Engineering 2026](./prompt-engineering-2026.md) - Optimization techniques
|
|
2090
|
+
|
|
2091
|
+
---
|
|
2092
|
+
|
|
2093
|
+
## Sources
|
|
2094
|
+
|
|
2095
|
+
This research synthesizes information from the following sources:
|
|
2096
|
+
|
|
2097
|
+
### Multi-Provider Architecture
|
|
2098
|
+
- [Vercel AI Gateway Model Fallbacks](https://vercel.com/changelog/model-fallbacks-now-available-in-vercel-ai-gateway)
|
|
2099
|
+
- [Vercel AI SDK Provider Management](https://ai-sdk.dev/docs/ai-sdk-core/provider-management)
|
|
2100
|
+
- [LiteLLM Router Architecture](https://docs.litellm.ai/docs/router_architecture)
|
|
2101
|
+
- [LiteLLM Fallbacks & Reliability](https://docs.litellm.ai/docs/proxy/reliability)
|
|
2102
|
+
- [LangChain Fallbacks](https://python.langchain.com/v0.1/docs/guides/productionization/fallbacks/)
|
|
2103
|
+
- [Dynamic Failover with LangChain](https://medium.com/@andrewnguonly/dynamic-failover-and-load-balancing-llms-with-langchain-e930a094be61)
|
|
2104
|
+
|
|
2105
|
+
### Cost Optimization
|
|
2106
|
+
- [LLM Cost Optimization Strategies](https://medium.com/@ajayverma23/taming-the-beast-cost-optimization-strategies-for-llm-api-calls-in-production-11f16dbe2c39)
|
|
2107
|
+
- [Stop Overpaying 5-10x in 2026](https://byteiota.com/llm-cost-optimization-stop-overpaying-5-10x-in-2026/)
|
|
2108
|
+
- [Smart Routing for Cost Management](https://www.kosmoy.com/post/llm-cost-management-stop-burning-money-on-tokens)
|
|
2109
|
+
- [Monitor and Optimize LLM Costs](https://www.helicone.ai/blog/monitor-and-optimize-llm-costs)
|
|
2110
|
+
- [LLM API Pricing 2026](https://pricepertoken.com/)
|
|
2111
|
+
|
|
2112
|
+
### Caching Strategies
|
|
2113
|
+
- [Redis Semantic Caching](https://redis.io/blog/what-is-semantic-caching/)
|
|
2114
|
+
- [GPTCache GitHub](https://github.com/zilliztech/GPTCache)
|
|
2115
|
+
- [Prompt Caching: 10x Cheaper Tokens](https://ngrok.com/blog/prompt-caching/)
|
|
2116
|
+
- [Prompt Caching Infrastructure Guide](https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025)
|
|
2117
|
+
- [Don't Break the Cache Research](https://arxiv.org/abs/2601.06007)
|
|
2118
|
+
- [LiteLLM Caching](https://docs.litellm.ai/docs/proxy/caching)
|
|
2119
|
+
- [Ultimate Guide to LLM Caching](https://latitude-blog.ghost.io/blog/ultimate-guide-to-llm-caching-for-low-latency-ai/)
|
|
2120
|
+
- [Effective LLM Caching](https://www.helicone.ai/blog/effective-llm-caching)
|
|
2121
|
+
|
|
2122
|
+
### Circuit Breakers & Reliability
|
|
2123
|
+
- [Circuit Breakers in LLM Apps](https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/)
|
|
2124
|
+
- [Azure API Management LLM Resiliency](https://techcommunity.microsoft.com/blog/azuredevcommunityblog/improve-llm-backend-resiliency-with-load-balancer-and-circuit-breaker-rules-in-a/4394502)
|
|
2125
|
+
- [Apigee Circuit Breaker Pattern](https://github.com/GoogleCloudPlatform/apigee-samples/blob/main/llm-circuit-breaking/llm_circuit_breaking.ipynb)
|
|
2126
|
+
|
|
2127
|
+
### Rate Limiting & Retry Logic
|
|
2128
|
+
- [OpenAI Rate Limits](https://platform.openai.com/docs/guides/rate-limits)
|
|
2129
|
+
- [OpenAI Exponential Backoff](https://platform.openai.com/docs/guides/rate-limits/retrying-with-exponential-backoff)
|
|
2130
|
+
- [Claude API 429 Error Fix](https://www.aifreeapi.com/en/posts/fix-claude-api-429-rate-limit-error)
|
|
2131
|
+
- [Gemini API Rate Limits 2026](https://www.aifreeapi.com/en/posts/gemini-api-rate-limit-explained)
|
|
2132
|
+
|
|
2133
|
+
### Provider Abstraction
|
|
2134
|
+
- [Multi-Provider LLM Orchestration 2026](https://dev.to/ash_dubai/multi-provider-llm-orchestration-in-production-a-2026-guide-1g10)
|
|
2135
|
+
- [AnyLLM GitHub](https://github.com/fkesheh/any-llm)
|
|
2136
|
+
- [TypeScript & LLMs: Production Lessons](https://johnchildseddy.medium.com/typescript-llms-lessons-learned-from-9-months-in-production-4910485e3272)
|
|
2137
|
+
- [ModelFusion GitHub](https://github.com/vercel/modelfusion)
|
|
2138
|
+
- [Building Bridges to LLMs](https://hatchworks.com/blog/gen-ai/llm-projects-production-abstraction/)
|
|
2139
|
+
|
|
2140
|
+
### Observability
|
|
2141
|
+
- [Langfuse Token and Cost Tracking](https://langfuse.com/docs/observability/features/token-and-cost-tracking)
|
|
2142
|
+
- [Datadog LLM Observability](https://www.datadoghq.com/product/llm-observability/)
|
|
2143
|
+
- [Traceloop: Track Token Usage Per User](https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user)
|
|
2144
|
+
- [Portkey Token Tracking](https://portkey.ai/blog/tracking-llm-token-usage-across-providers-teams-and-workloads/)
|
|
2145
|
+
- [Elastic LLM Observability](https://www.elastic.co/observability/llm-monitoring)
|
|
2146
|
+
- [LLM Observability Tools 2026](https://lakefs.io/blog/llm-observability-tools/)
|
|
2147
|
+
- [Best LLM Observability Tools](https://www.firecrawl.dev/blog/best-llm-observability-tools)
|
|
2148
|
+
|
|
2149
|
+
---
|
|
2150
|
+
|
|
2151
|
+
**Document Version**: 1.0
|
|
2152
|
+
**Last Updated**: January 26, 2026
|
|
2153
|
+
**Next Review**: February 26, 2026
|