llm_optimizer 0.1.2 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +25 -1
- data/README.md +39 -7
- data/lib/generators/llm_optimizer/templates/initializer.rb +12 -1
- data/lib/llm_optimizer/configuration.rb +2 -0
- data/lib/llm_optimizer/model_router.rb +36 -8
- data/lib/llm_optimizer/version.rb +1 -1
- data/lib/llm_optimizer.rb +76 -4
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c6903d2d4c2163d93ffe8d0d5ad9708d64a8472a430ed9f266c9237e468c8585
|
|
4
|
+
data.tar.gz: c7270f4717ece6778976f46f1601f9e5d45939e3e7926ea7e3ed05b3b641f413
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 858cad7443f7adcbe42b3d5ce62b4e815081d2238b7711066276ee2a7c0fb6a506d267ccb48dbe611a2ed08b2eab29139057dcddc2d033155561499a0d6f5421
|
|
7
|
+
data.tar.gz: b3afc392e8fb2ef5b7baa468f74f9def34a15db9f6df898fd738503638d32f5dda9b04a6c8f2e005cd94aa893eca864111f3be0f2e8bfa1cc0aeef6391e0ae2c
|
data/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.1.4] - 2026-04-13
|
|
11
|
+
|
|
12
|
+
### Fixed
|
|
13
|
+
- `WrapperModule#chat` (used by `wrap_client`) incorrectly called `LlmOptimizer.optimize` internally which required `llm_caller` to be configured — causing `ConfigurationError` for users who only called `wrap_client`. Refactored into `optimize_pre_call` / `optimize_post_call` so the wrapped client handles the actual LLM call via `super`. `llm_caller` is no longer needed when using `wrap_client`
|
|
14
|
+
|
|
15
|
+
### Added
|
|
16
|
+
- `LlmOptimizer.optimize_pre_call(prompt, config)` — runs compress → route → cache lookup without making an LLM call; used internally by `WrapperModule` and available for advanced integrations
|
|
17
|
+
- `LlmOptimizer.optimize_post_call(pre_call_result, response, config)` — stores a response in the semantic cache after an LLM call; used internally by `WrapperModule`
|
|
18
|
+
|
|
19
|
+
## [0.1.3] - 2026-04-10
|
|
20
|
+
|
|
21
|
+
### Added
|
|
22
|
+
- `classifier_caller` config option — injectable lambda for LLM-based prompt classification
|
|
23
|
+
- Hybrid routing in `ModelRouter`: fast-path signals (code blocks, keywords) → LLM classifier → word-count heuristic fallback
|
|
24
|
+
- Fixes misclassification of short-but-complex prompts (e.g. "Fix this bug") and long-but-simple prompts
|
|
25
|
+
- Classifier failures (network errors, missing model, unexpected response) automatically fall through to heuristic — no app impact
|
|
26
|
+
- Tests for classifier integration, failure fallback, and fast-path bypass
|
|
27
|
+
|
|
28
|
+
### Changed
|
|
29
|
+
- `ModelRouter` routing logic now uses three-layer decision chain instead of pure heuristics
|
|
30
|
+
- README updated with classifier documentation and routing decision flow
|
|
31
|
+
|
|
10
32
|
## [0.1.2] - 2026-04-10
|
|
11
33
|
|
|
12
34
|
### Fixed
|
|
@@ -57,7 +79,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
57
79
|
- `OptimizeResult` struct with `response`, `model`, `model_tier`, `cache_status`, `original_tokens`, `compressed_tokens`, `latency_ms`, `messages`
|
|
58
80
|
- Unit test suite covering all components with positive and negative scenarios using Minitest + Mocha
|
|
59
81
|
|
|
60
|
-
[Unreleased]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.
|
|
82
|
+
[Unreleased]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.4...HEAD
|
|
83
|
+
[0.1.4]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.3...v0.1.4
|
|
84
|
+
[0.1.3]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.2...v0.1.3
|
|
61
85
|
[0.1.2]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.1...v0.1.2
|
|
62
86
|
[0.1.1]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.0...v0.1.1
|
|
63
87
|
[0.1.0]: https://github.com/arunkumarry/llm_optimizer/releases/tag/v0.1.0
|
data/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
# llm_optimizer
|
|
2
2
|
|
|
3
|
-
A Smart Gateway for LLM API calls in Ruby and Rails applications. Reduces token usage and API costs through four composable optimizations
|
|
3
|
+
A Smart Gateway for LLM API calls in Ruby and Rails applications. Reduces token usage and API costs through four composable optimizations all opt-in, all independently configurable.
|
|
4
4
|
|
|
5
5
|
## How it works
|
|
6
6
|
|
|
@@ -10,17 +10,41 @@ Every call to `LlmOptimizer.optimize` passes through an ordered pipeline:
|
|
|
10
10
|
prompt → Compressor → ModelRouter → SemanticCache lookup → HistoryManager → LLM call → SemanticCache store → OptimizeResult
|
|
11
11
|
```
|
|
12
12
|
|
|
13
|
-
Each stage is independently enabled via configuration flags. If any stage fails, the gem falls through to a raw LLM call
|
|
13
|
+
Each stage is independently enabled via configuration flags. If any stage fails, the gem falls through to a raw LLM call your app never breaks because of the optimizer.
|
|
14
14
|
|
|
15
15
|
## Optimizations
|
|
16
16
|
|
|
17
17
|
### 1. Semantic Caching
|
|
18
|
-
Stores prompt embeddings in Redis. On subsequent calls, computes cosine similarity against stored embeddings. If similarity ≥ threshold, returns the cached response instantly
|
|
18
|
+
Stores prompt embeddings in Redis. On subsequent calls, computes cosine similarity against stored embeddings. If similarity ≥ threshold, returns the cached response instantly no LLM call made.
|
|
19
19
|
|
|
20
20
|
### 2. Intelligent Model Routing
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
21
|
+
|
|
22
|
+
Classifies each prompt and routes it to the appropriate model tier:
|
|
23
|
+
|
|
24
|
+
- **Simple** → cheaper/faster model (e.g. `gpt-4o-mini`, `amazon.nova-micro`)
|
|
25
|
+
- **Complex** → premium model (e.g. `claude-3-5-sonnet`, `gpt-4o`)
|
|
26
|
+
|
|
27
|
+
Routing uses a three-layer decision chain:
|
|
28
|
+
|
|
29
|
+
1. **Explicit override** — if `route_to: :simple` or `:complex` is set, always use that
|
|
30
|
+
2. **Fast-path signals** — code blocks (` ``` `, `~~~`) and keywords (`analyze`, `refactor`, `debug`, `architect`, `explain in detail`) → instantly `:complex`, no LLM call
|
|
31
|
+
3. **LLM classifier** (optional) — for ambiguous prompts, calls a cheap model with a classification prompt; falls back to word-count heuristic if not configured or if the call fails
|
|
32
|
+
|
|
33
|
+
This hybrid approach fixes the core weakness of pure heuristics:
|
|
34
|
+
- `"Fix this bug"` → 3 words but `:complex` via classifier
|
|
35
|
+
- `"Explain Ruby blocks simply"` → long but `:simple` via classifier
|
|
36
|
+
- `"analyze this code"` → keyword fast-path → `:complex` instantly (no classifier call)
|
|
37
|
+
|
|
38
|
+
Configure the classifier with any cheap model your app already uses:
|
|
39
|
+
|
|
40
|
+
```ruby
|
|
41
|
+
config.classifier_caller = ->(prompt) {
|
|
42
|
+
RubyLLM.chat(model: "amazon.nova-micro-v1:0", provider: :bedrock, assume_model_exists: true)
|
|
43
|
+
.ask(prompt).content.strip.downcase
|
|
44
|
+
}
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
If `classifier_caller` is not set, the router falls back to the word-count heuristic (< 20 words → `:simple`).
|
|
24
48
|
|
|
25
49
|
### 3. Token Pruning
|
|
26
50
|
Removes common English stop words from prompts before sending to the LLM. Preserves fenced code block content unchanged. Typically reduces token count by 10–20%.
|
|
@@ -120,6 +144,13 @@ LlmOptimizer.configure do |config|
|
|
|
120
144
|
config.embedding_caller = ->(text) {
|
|
121
145
|
MyEmbeddingService.embed(text)
|
|
122
146
|
}
|
|
147
|
+
|
|
148
|
+
# Classifier caller — optional, improves routing accuracy for ambiguous prompts
|
|
149
|
+
# Falls back to word-count heuristic if not set or if the call fails
|
|
150
|
+
config.classifier_caller = ->(prompt) {
|
|
151
|
+
RubyLLM.chat(model: "amazon.nova-micro-v1:0", provider: :bedrock, assume_model_exists: true)
|
|
152
|
+
.ask(prompt).content.strip.downcase
|
|
153
|
+
}
|
|
123
154
|
end
|
|
124
155
|
```
|
|
125
156
|
|
|
@@ -143,6 +174,7 @@ end
|
|
|
143
174
|
| `debug_logging` | Boolean | `false` | Log full prompt and response at DEBUG level |
|
|
144
175
|
| `llm_caller` | Lambda | `nil` | `(prompt, model:) -> String` |
|
|
145
176
|
| `embedding_caller` | Lambda | `nil` | `(text) -> Array<Float>` |
|
|
177
|
+
| `classifier_caller` | Lambda | `nil` | `(prompt) -> "simple" or "complex"` |
|
|
146
178
|
|
|
147
179
|
## Per-call configuration
|
|
148
180
|
|
|
@@ -179,7 +211,7 @@ Transparently wrap an existing LLM client class so all calls through it are auto
|
|
|
179
211
|
LlmOptimizer.wrap_client(OpenAI::Client)
|
|
180
212
|
```
|
|
181
213
|
|
|
182
|
-
This prepends the optimization pipeline into the client's `chat` method. Safe to call multiple times
|
|
214
|
+
This prepends the optimization pipeline into the client's `chat` method. Safe to call multiple times idempotent.
|
|
183
215
|
|
|
184
216
|
## OptimizeResult
|
|
185
217
|
|
|
@@ -64,5 +64,16 @@ LlmOptimizer.configure do |config|
|
|
|
64
64
|
# Example:
|
|
65
65
|
# config.embedding_caller = ->(text) { EmbeddingService.embed(text) }
|
|
66
66
|
#
|
|
67
|
-
#
|
|
67
|
+
# --- Routing classifier (optional) ---
|
|
68
|
+
# When set, ambiguous prompts are classified by a cheap LLM instead of
|
|
69
|
+
# falling back to the word-count heuristic. Unambiguous signals (code blocks,
|
|
70
|
+
# keywords) still bypass the classifier for speed.
|
|
71
|
+
#
|
|
72
|
+
# Example:
|
|
73
|
+
# config.classifier_caller = ->(prompt) {
|
|
74
|
+
# RubyLLM.chat(model: "amazon.nova-micro-v1:0", assume_model_exists: true)
|
|
75
|
+
# .ask(prompt).content.strip.downcase
|
|
76
|
+
# }
|
|
77
|
+
#
|
|
78
|
+
# config.classifier_caller = nil
|
|
68
79
|
end
|
|
@@ -21,6 +21,7 @@ module LlmOptimizer
|
|
|
21
21
|
cache_ttl
|
|
22
22
|
llm_caller
|
|
23
23
|
embedding_caller
|
|
24
|
+
classifier_caller
|
|
24
25
|
].freeze
|
|
25
26
|
|
|
26
27
|
# Define readers for all known keys (setters below track explicit sets)
|
|
@@ -45,6 +46,7 @@ module LlmOptimizer
|
|
|
45
46
|
@cache_ttl = 86_400
|
|
46
47
|
@llm_caller = nil
|
|
47
48
|
@embedding_caller = nil
|
|
49
|
+
@classifier_caller = nil
|
|
48
50
|
end
|
|
49
51
|
|
|
50
52
|
# Copies only explicitly set keys from other_config without resetting unmentioned keys.
|
|
@@ -6,27 +6,55 @@ module LlmOptimizer
|
|
|
6
6
|
COMPLEX_PHRASES = ["explain in detail"].freeze
|
|
7
7
|
CODE_BLOCK_RE = /```|~~~/
|
|
8
8
|
|
|
9
|
+
CLASSIFIER_PROMPT = <<~PROMPT
|
|
10
|
+
Classify the following prompt as either 'simple' or 'complex'.
|
|
11
|
+
|
|
12
|
+
Rules:
|
|
13
|
+
- simple: factual questions, basic lookups, short explanations, greetings
|
|
14
|
+
- complex: code generation, debugging, architecture, multi-step reasoning, analysis
|
|
15
|
+
|
|
16
|
+
Reply with exactly one word: simple or complex
|
|
17
|
+
|
|
18
|
+
Prompt: %<prompt>s
|
|
19
|
+
PROMPT
|
|
20
|
+
|
|
9
21
|
def initialize(config)
|
|
10
22
|
@config = config
|
|
11
23
|
end
|
|
12
24
|
|
|
13
25
|
def route(prompt)
|
|
14
|
-
#
|
|
26
|
+
# Explicit override — always
|
|
15
27
|
return @config.route_to if %i[simple complex].include?(@config.route_to)
|
|
16
28
|
|
|
17
|
-
#
|
|
29
|
+
# Unambiguous fast-path signals (no LLM call needed)
|
|
18
30
|
return :complex if CODE_BLOCK_RE.match?(prompt)
|
|
19
31
|
|
|
20
|
-
# complex keywords or phrases
|
|
21
32
|
lower = prompt.downcase
|
|
22
33
|
return :complex if COMPLEX_KEYWORDS.any? { |kw| lower.include?(kw) }
|
|
23
|
-
return :complex if COMPLEX_PHRASES.any?
|
|
34
|
+
return :complex if COMPLEX_PHRASES.any? { |ph| lower.include?(ph) }
|
|
35
|
+
|
|
36
|
+
# LLM classifier for ambiguous prompts
|
|
37
|
+
if @config.classifier_caller
|
|
38
|
+
result = classify_with_llm(prompt)
|
|
39
|
+
return result if result
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
# Fallback heuristic
|
|
43
|
+
prompt.split.length < 20 ? :simple : :complex
|
|
44
|
+
end
|
|
45
|
+
|
|
46
|
+
private
|
|
24
47
|
|
|
25
|
-
|
|
26
|
-
|
|
48
|
+
def classify_with_llm(prompt)
|
|
49
|
+
classifier_prompt = format(CLASSIFIER_PROMPT, prompt: prompt)
|
|
50
|
+
response = @config.classifier_caller.call(classifier_prompt)
|
|
51
|
+
normalized = response.to_s.strip.downcase.gsub(/[^a-z]/, "")
|
|
52
|
+
return :simple if normalized == "simple"
|
|
53
|
+
return :complex if normalized == "complex"
|
|
27
54
|
|
|
28
|
-
#
|
|
29
|
-
|
|
55
|
+
nil # unrecognized response — fall through to heuristic
|
|
56
|
+
rescue StandardError
|
|
57
|
+
nil # classifier failure — fall through to heuristic
|
|
30
58
|
end
|
|
31
59
|
end
|
|
32
60
|
end
|
data/lib/llm_optimizer.rb
CHANGED
|
@@ -58,12 +58,37 @@ module LlmOptimizer
|
|
|
58
58
|
end
|
|
59
59
|
|
|
60
60
|
# Opt-in client wrapping
|
|
61
|
+
# WrapperModule intercepts `chat` on the wrapped client, runs the pre-call
|
|
62
|
+
# optimization pipeline (compress, route, cache lookup), and delegates the
|
|
63
|
+
# actual LLM call to the original client via `super` — so llm_caller is NOT
|
|
64
|
+
# required when using wrap_client.
|
|
61
65
|
module WrapperModule
|
|
62
|
-
def chat(params, &)
|
|
66
|
+
def chat(params, &block)
|
|
67
|
+
config = LlmOptimizer.configuration
|
|
63
68
|
prompt = params[:messages] || params[:prompt]
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
69
|
+
|
|
70
|
+
# Run pre-call pipeline: compress, route, cache lookup
|
|
71
|
+
result = LlmOptimizer.optimize_pre_call(prompt, config)
|
|
72
|
+
|
|
73
|
+
# Cache hit — return immediately without calling the LLM
|
|
74
|
+
if result[:cache_status] == :hit
|
|
75
|
+
return result[:response]
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
# Apply compressed prompt and routed model, then delegate to original client
|
|
79
|
+
optimized_params = params.merge(model: result[:model])
|
|
80
|
+
if params[:messages]
|
|
81
|
+
optimized_params = optimized_params.merge(messages: result[:prompt])
|
|
82
|
+
elsif params[:prompt]
|
|
83
|
+
optimized_params = optimized_params.merge(prompt: result[:prompt])
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
response = super(optimized_params, &block)
|
|
87
|
+
|
|
88
|
+
# Store in cache after successful LLM call
|
|
89
|
+
LlmOptimizer.optimize_post_call(result, response, config)
|
|
90
|
+
|
|
91
|
+
response
|
|
67
92
|
end
|
|
68
93
|
end
|
|
69
94
|
|
|
@@ -231,6 +256,53 @@ module LlmOptimizer
|
|
|
231
256
|
)
|
|
232
257
|
end
|
|
233
258
|
|
|
259
|
+
# Pre-call pipeline for wrap_client: compress, route, cache lookup.
|
|
260
|
+
# Returns a hash with :prompt, :model, :model_tier, :embedding, :cache_status, :response.
|
|
261
|
+
# Does NOT make an LLM call — the wrapped client handles that via super.
|
|
262
|
+
def self.optimize_pre_call(prompt, config = configuration)
|
|
263
|
+
compressor = Compressor.new
|
|
264
|
+
prompt = compressor.compress(prompt) if config.compress_prompt
|
|
265
|
+
|
|
266
|
+
router = ModelRouter.new(config)
|
|
267
|
+
model_tier = router.route(prompt)
|
|
268
|
+
model = model_tier == :simple ? config.simple_model : config.complex_model
|
|
269
|
+
|
|
270
|
+
embedding = nil
|
|
271
|
+
if config.use_semantic_cache && config.redis_url
|
|
272
|
+
begin
|
|
273
|
+
emb_client = EmbeddingClient.new(
|
|
274
|
+
model: config.embedding_model,
|
|
275
|
+
timeout_seconds: config.timeout_seconds,
|
|
276
|
+
embedding_caller: config.embedding_caller
|
|
277
|
+
)
|
|
278
|
+
embedding = emb_client.embed(prompt)
|
|
279
|
+
redis = build_redis(config.redis_url)
|
|
280
|
+
cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
|
|
281
|
+
cached = cache.lookup(embedding)
|
|
282
|
+
return { prompt: prompt, model: model, model_tier: model_tier,
|
|
283
|
+
embedding: embedding, cache_status: :hit, response: cached } if cached
|
|
284
|
+
rescue EmbeddingError => e
|
|
285
|
+
config.logger.warn("[llm_optimizer] wrap_client EmbeddingError (cache miss): #{e.message}")
|
|
286
|
+
embedding = nil
|
|
287
|
+
end
|
|
288
|
+
end
|
|
289
|
+
|
|
290
|
+
{ prompt: prompt, model: model, model_tier: model_tier,
|
|
291
|
+
embedding: embedding, cache_status: :miss, response: nil }
|
|
292
|
+
end
|
|
293
|
+
|
|
294
|
+
# Post-call: store the LLM response in the semantic cache if applicable.
|
|
295
|
+
def self.optimize_post_call(pre_call_result, response, config = configuration)
|
|
296
|
+
return unless config.use_semantic_cache && config.redis_url
|
|
297
|
+
return unless pre_call_result[:embedding]
|
|
298
|
+
|
|
299
|
+
redis = build_redis(config.redis_url)
|
|
300
|
+
cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
|
|
301
|
+
cache.store(pre_call_result[:embedding], response)
|
|
302
|
+
rescue StandardError => e
|
|
303
|
+
config.logger.warn("[llm_optimizer] wrap_client cache store failed: #{e.message}")
|
|
304
|
+
end
|
|
305
|
+
|
|
234
306
|
# Private helpers
|
|
235
307
|
|
|
236
308
|
class << self
|