llm_optimizer 0.1.4 → 0.1.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +24 -1
- data/README.md +53 -50
- data/lib/generators/llm_optimizer/templates/initializer.rb +40 -2
- data/lib/llm_optimizer/configuration.rb +5 -0
- data/lib/llm_optimizer/conversation_store.rb +83 -0
- data/lib/llm_optimizer/model_router.rb +10 -5
- data/lib/llm_optimizer/pipeline.rb +173 -0
- data/lib/llm_optimizer/version.rb +1 -1
- data/lib/llm_optimizer.rb +58 -253
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: b5f6d0b99af3e0801e77df0316ac767e0e10e0d4e7bba9dc19623797681a2961
|
|
4
|
+
data.tar.gz: d8644df814cb0c7f219a51620d3cd409e1bb5822228278245b2572dbaf666fdc
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 1396d95f7e3f498e600cf6e3b99627ee2f746692a1f002be989ce1b13859f5a1af8f50656a82ff0fa853d3e42b0c219de49ac152baa2107d9c9529fc82bf63e4
|
|
7
|
+
data.tar.gz: 6b45bae664e4d43fd54fe47c2c8c9ebdeea2b4442f78bf22daee56d730095a471f58faf0e6432c7e5ef18a58cfc20a9279dc8706c9f41e4ca55db0cb441df1e8
|
data/CHANGELOG.md
CHANGED
|
@@ -7,6 +7,28 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [0.1.5] - 2026-04-22
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
- `ConversationStore` — Redis-backed conversation persistence under the `llm_optimizer:conversation:<id>` namespace; handles load, save, TTL, and debug logging
|
|
14
|
+
- `conversation_id` option on `LlmOptimizer.optimize` — pass a stable ID and the gem automatically loads history from Redis, calls the LLM with full context, and saves the updated history back; no manual message management required
|
|
15
|
+
- `messages_caller` config option — injectable lambda `(messages, model:) -> String` for LLM providers that accept a full message array (OpenAI chat, Anthropic messages, etc.); takes priority over `llm_caller` when conversation history is present
|
|
16
|
+
- `system_prompt` config option — seeded as the opening exchange when a new conversation is created via `conversation_id`
|
|
17
|
+
- `conversation_ttl` config option — TTL in seconds for Redis conversation keys (default `86400`; `0` for no expiry)
|
|
18
|
+
- `LlmOptimizer.clear_conversation(conversation_id)` — deletes a conversation key from Redis; returns `true` if deleted, `false` if not found
|
|
19
|
+
- `pipeline#load_conversation` and `pipeline#persist_conversation` — internal helpers wiring `ConversationStore` into the optimize pipeline
|
|
20
|
+
- `pipeline#apply_history_manager` — applies `HistoryManager` sliding-window summarization to loaded conversation history when `manage_history: true`
|
|
21
|
+
|
|
22
|
+
### Changed
|
|
23
|
+
- `HistoryManager` now receives an internal `llm_caller` lambda that routes through `raw_llm_call`, so it correctly uses `messages_caller` when available instead of always requiring `llm_caller`
|
|
24
|
+
- `raw_llm_call` updated to prefer `messages_caller` over `llm_caller` when a non-empty messages array is present
|
|
25
|
+
- `ModelRouter` classifier response matching now uses word-boundary regex (`/\bsimple\b/`, `/\bcomplex\b/`) to handle decorated responses like `"simple."`, `"**complex**"`, or `"the answer is simple"` — previously only exact string match was used
|
|
26
|
+
- `ModelRouter` classifier failures (any `StandardError`) and unrecognized responses both fall through silently to the word-count heuristic; no exception is raised to the caller
|
|
27
|
+
- `validate_conversation_options!` raises `ConfigurationError` if both `conversation_id` and `messages:` are supplied, or if `conversation_id` is used without `redis_url`
|
|
28
|
+
|
|
29
|
+
### Fixed
|
|
30
|
+
- `HistoryManager` summarization raised `ConfigurationError: No llm_caller configured` when called inside the pipeline without a bound config — internal lambda now correctly captures `call_config`
|
|
31
|
+
|
|
10
32
|
## [0.1.4] - 2026-04-13
|
|
11
33
|
|
|
12
34
|
### Fixed
|
|
@@ -79,7 +101,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
|
79
101
|
- `OptimizeResult` struct with `response`, `model`, `model_tier`, `cache_status`, `original_tokens`, `compressed_tokens`, `latency_ms`, `messages`
|
|
80
102
|
- Unit test suite covering all components with positive and negative scenarios using Minitest + Mocha
|
|
81
103
|
|
|
82
|
-
[Unreleased]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.
|
|
104
|
+
[Unreleased]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.5...HEAD
|
|
105
|
+
[0.1.5]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.4...v0.1.5
|
|
83
106
|
[0.1.4]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.3...v0.1.4
|
|
84
107
|
[0.1.3]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.2...v0.1.3
|
|
85
108
|
[0.1.2]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.1...v0.1.2
|
data/README.md
CHANGED
|
@@ -21,8 +21,8 @@ Stores prompt embeddings in Redis. On subsequent calls, computes cosine similari
|
|
|
21
21
|
|
|
22
22
|
Classifies each prompt and routes it to the appropriate model tier:
|
|
23
23
|
|
|
24
|
-
- **Simple** → cheaper/faster model (e.g. `
|
|
25
|
-
- **Complex** → premium model (e.g. `claude-
|
|
24
|
+
- **Simple** → cheaper/faster model (e.g. `llama3`, `gemini-2.5-flash-lite`)
|
|
25
|
+
- **Complex** → premium model (e.g. `claude-haiku-4-5-20251001`, `gemini-3.0-pro`)
|
|
26
26
|
|
|
27
27
|
Routing uses a three-layer decision chain:
|
|
28
28
|
|
|
@@ -50,7 +50,7 @@ If `classifier_caller` is not set, the router falls back to the word-count heuri
|
|
|
50
50
|
Removes common English stop words from prompts before sending to the LLM. Preserves fenced code block content unchanged. Typically reduces token count by 10–20%.
|
|
51
51
|
|
|
52
52
|
### 4. Conversation History Sliding Window
|
|
53
|
-
When a conversation history exceeds the configured token budget, summarizes the oldest messages using the simple model and replaces them with a single system summary message.
|
|
53
|
+
When a conversation history exceeds the configured token budget, summarizes the oldest messages using the simple model and replaces them with a single system summary message. Uses Redis to store for fast reetreival and summarizing.
|
|
54
54
|
|
|
55
55
|
## Installation
|
|
56
56
|
|
|
@@ -99,7 +99,7 @@ result = LlmOptimizer.optimize("What is Redis?")
|
|
|
99
99
|
puts result.response # => "Redis is an in-memory data store..."
|
|
100
100
|
puts result.cache_status # => :hit or :miss
|
|
101
101
|
puts result.model_tier # => :simple or :complex
|
|
102
|
-
puts result.model # => "
|
|
102
|
+
puts result.model # => "gemini-2.5-flash-lite"
|
|
103
103
|
puts result.original_tokens # => 5
|
|
104
104
|
puts result.compressed_tokens # => 4
|
|
105
105
|
puts result.latency_ms # => 12.4
|
|
@@ -110,39 +110,50 @@ puts result.latency_ms # => 12.4
|
|
|
110
110
|
### Rails initializer
|
|
111
111
|
|
|
112
112
|
```ruby
|
|
113
|
+
# config/initializers/llm_optimizer.rb
|
|
114
|
+
require "llm_optimizer"
|
|
115
|
+
|
|
113
116
|
LlmOptimizer.configure do |config|
|
|
114
|
-
# Feature flags
|
|
117
|
+
# --- Feature flags (all off by default) ---
|
|
115
118
|
config.compress_prompt = true # strip stop words before sending to LLM
|
|
116
119
|
config.use_semantic_cache = true # cache responses by vector similarity
|
|
117
120
|
config.manage_history = true # summarize old messages when over token budget
|
|
118
121
|
|
|
119
|
-
# Model routing
|
|
120
|
-
config.route_to = :auto
|
|
121
|
-
config.simple_model = "
|
|
122
|
-
config.complex_model = "claude-
|
|
122
|
+
# --- Model routing ---
|
|
123
|
+
config.route_to = :auto # :auto, :simple, or :complex
|
|
124
|
+
config.simple_model = "gemini-2.5-flash-lite" # used for simple prompts
|
|
125
|
+
config.complex_model = "claude-haiku-4-5-20251001" # used for complex prompts
|
|
123
126
|
|
|
124
|
-
# Redis (required if use_semantic_cache: true)
|
|
127
|
+
# --- Redis (required if use_semantic_cache: true) ---
|
|
125
128
|
config.redis_url = ENV["REDIS_URL"]
|
|
126
129
|
|
|
127
|
-
#
|
|
128
|
-
config.similarity_threshold = 0.96 # cosine similarity cutoff for cache hit
|
|
129
|
-
config.token_budget = 4000 #
|
|
130
|
-
config.cache_ttl = 86400 # cache TTL in seconds (
|
|
130
|
+
# --- Token / cache settings ---
|
|
131
|
+
config.similarity_threshold = 0.96 # cosine similarity cutoff for cache hit
|
|
132
|
+
config.token_budget = 4000 # max tokens before history summarization
|
|
133
|
+
config.cache_ttl = 86400 # cache TTL in seconds (24h)
|
|
131
134
|
config.timeout_seconds = 5 # timeout for external API calls
|
|
132
135
|
|
|
133
|
-
# Logging
|
|
136
|
+
# --- Logging ---
|
|
134
137
|
config.logger = Rails.logger
|
|
135
|
-
config.debug_logging = Rails.env.development?
|
|
138
|
+
config.debug_logging = Rails.env.development? # logs full prompt+response in dev
|
|
136
139
|
|
|
137
|
-
#
|
|
140
|
+
# --- Wire up your app's LLM client ---
|
|
141
|
+
# Replace the body with however your app calls the LLM
|
|
138
142
|
config.llm_caller = ->(prompt, model:) {
|
|
139
|
-
|
|
143
|
+
model ||= "claude-haiku-4-5-20251001"
|
|
144
|
+
provider = if model.include?("claude") then :anthropic
|
|
145
|
+
elsif model.include?("gpt") then :openai
|
|
146
|
+
elsif model.include?("gemini") then :gemini
|
|
147
|
+
else :ollama
|
|
148
|
+
end
|
|
149
|
+
chat = RubyLLM.chat(model: model, provider: provider, assume_model_exists: true)
|
|
150
|
+
chat.ask(prompt).content
|
|
140
151
|
}
|
|
141
152
|
|
|
142
153
|
# Embeddings caller — wire to your embeddings provider (required if use_semantic_cache: true)
|
|
143
|
-
# Falls back to OpenAI via ENV["OPENAI_API_KEY"] if not set
|
|
144
154
|
config.embedding_caller = ->(text) {
|
|
145
|
-
|
|
155
|
+
response = RubyLLM.embed(text, provider: :gemini, model: 'gemini-embedding-001')
|
|
156
|
+
response.vectors
|
|
146
157
|
}
|
|
147
158
|
|
|
148
159
|
# Classifier caller — optional, improves routing accuracy for ambiguous prompts
|
|
@@ -151,7 +162,18 @@ LlmOptimizer.configure do |config|
|
|
|
151
162
|
RubyLLM.chat(model: "amazon.nova-micro-v1:0", provider: :bedrock, assume_model_exists: true)
|
|
152
163
|
.ask(prompt).content.strip.downcase
|
|
153
164
|
}
|
|
165
|
+
|
|
166
|
+
# Messages caller - optional, handles converation summary and hostiry manager.
|
|
167
|
+
config.system_prompt = "You are a sarcastic comic person who gives witty responses in a non harmful way. If any serious question is asked, handle it in a calm way."
|
|
168
|
+
|
|
169
|
+
config.messages_caller = ->(messages, model:) {
|
|
170
|
+
chat = RubyLLM.chat(model: model)
|
|
171
|
+
messages[0..-2].each { |m| chat.add_message(role: m[:role], content: m[:content]) }
|
|
172
|
+
response = chat.ask(messages.last[:content])
|
|
173
|
+
response.content
|
|
174
|
+
}
|
|
154
175
|
end
|
|
176
|
+
|
|
155
177
|
```
|
|
156
178
|
|
|
157
179
|
### Configuration reference
|
|
@@ -162,19 +184,22 @@ end
|
|
|
162
184
|
| `use_semantic_cache` | Boolean | `false` | Enable Redis-backed semantic cache |
|
|
163
185
|
| `manage_history` | Boolean | `false` | Enable conversation history summarization |
|
|
164
186
|
| `route_to` | Symbol | `:auto` | `:auto`, `:simple`, or `:complex` |
|
|
165
|
-
| `simple_model` | String | `"
|
|
166
|
-
| `complex_model` | String | `"claude-
|
|
187
|
+
| `simple_model` | String | `"gemini-2.5-flash-lite"` | Model for simple prompts |
|
|
188
|
+
| `complex_model` | String | `"claude-haiku-4-5-20251001"` | Model for complex prompts |
|
|
167
189
|
| `similarity_threshold` | Float | `0.96` | Minimum cosine similarity for cache hit |
|
|
168
190
|
| `token_budget` | Integer | `4000` | Token limit before history summarization |
|
|
169
191
|
| `cache_ttl` | Integer | `86400` | Cache entry TTL in seconds |
|
|
170
192
|
| `timeout_seconds` | Integer | `5` | Timeout for external API calls |
|
|
171
193
|
| `redis_url` | String | `nil` | Redis connection URL |
|
|
172
|
-
| `embedding_model` | String | `"
|
|
194
|
+
| `embedding_model` | String | `"gemini-embedding-001"` | Embedding model name (OpenAI fallback) |
|
|
173
195
|
| `logger` | Logger | `Logger.new($stdout)` | Any Logger-compatible object |
|
|
174
196
|
| `debug_logging` | Boolean | `false` | Log full prompt and response at DEBUG level |
|
|
175
197
|
| `llm_caller` | Lambda | `nil` | `(prompt, model:) -> String` |
|
|
176
198
|
| `embedding_caller` | Lambda | `nil` | `(text) -> Array<Float>` |
|
|
177
199
|
| `classifier_caller` | Lambda | `nil` | `(prompt) -> "simple" or "complex"` |
|
|
200
|
+
| `messages_caller` | Lambda | `nil` | `(messages, model:) -> String` — used when `conversation_id` is present; receives full history including current user turn |
|
|
201
|
+
| `system_prompt` | String | `nil` | Seeded as the first system message when a new conversation is created via `conversation_id` |
|
|
202
|
+
| `conversation_ttl` | Integer | `86400` | TTL in seconds for Redis-backed conversation history (`0` for no expiry) |
|
|
178
203
|
|
|
179
204
|
## Per-call configuration
|
|
180
205
|
|
|
@@ -200,19 +225,6 @@ messages = [
|
|
|
200
225
|
|
|
201
226
|
result = LlmOptimizer.optimize("What else can it do?", messages: messages)
|
|
202
227
|
|
|
203
|
-
# result.messages contains the (possibly summarized) messages array
|
|
204
|
-
```
|
|
205
|
-
|
|
206
|
-
## Opt-in client wrapping
|
|
207
|
-
|
|
208
|
-
Transparently wrap an existing LLM client class so all calls through it are automatically optimized:
|
|
209
|
-
|
|
210
|
-
```ruby
|
|
211
|
-
LlmOptimizer.wrap_client(OpenAI::Client)
|
|
212
|
-
```
|
|
213
|
-
|
|
214
|
-
This prepends the optimization pipeline into the client's `chat` method. Safe to call multiple times idempotent.
|
|
215
|
-
|
|
216
228
|
## OptimizeResult
|
|
217
229
|
|
|
218
230
|
Every call returns an `OptimizeResult` struct:
|
|
@@ -226,20 +238,9 @@ Every call returns an `OptimizeResult` struct:
|
|
|
226
238
|
| `original_tokens` | Integer | Estimated token count before compression |
|
|
227
239
|
| `compressed_tokens` | Integer | Estimated token count after compression (`nil` if not compressed) |
|
|
228
240
|
| `latency_ms` | Float | Total wall-clock time for the optimize call |
|
|
229
|
-
| `messages` | Array | Final messages array
|
|
230
|
-
|
|
231
|
-
## Error handling
|
|
232
|
-
|
|
233
|
-
The gem defines a hierarchy of errors, all inheriting from `LlmOptimizer::Error`:
|
|
234
|
-
|
|
235
|
-
```
|
|
236
|
-
LlmOptimizer::Error
|
|
237
|
-
├── LlmOptimizer::ConfigurationError # unknown config key, missing llm_caller
|
|
238
|
-
├── LlmOptimizer::EmbeddingError # embedding API failure
|
|
239
|
-
└── LlmOptimizer::TimeoutError # network timeout exceeded
|
|
240
|
-
```
|
|
241
|
+
| `messages` | Array | Final messages array sent to the LLM, after history management and conversation hydration (`nil` on a cache hit) |
|
|
241
242
|
|
|
242
|
-
The
|
|
243
|
+
The `messages` field reflects the actual array passed to `messages_caller` (or built from `conversation_id`), including any summarization applied by the history manager. You can pass it back as `options[:messages]` on the next call to continue a stateless conversation.
|
|
243
244
|
|
|
244
245
|
## Resilience
|
|
245
246
|
|
|
@@ -249,7 +250,9 @@ The gateway catches all component failures and falls through to a raw LLM call w
|
|
|
249
250
|
| Redis unavailable (write) | Log warning, return LLM result normally |
|
|
250
251
|
| Embedding API failure | Treat as cache miss, continue |
|
|
251
252
|
| Any component exception | Log error, fall through to raw LLM call |
|
|
252
|
-
| History summarization failure | Log
|
|
253
|
+
| History summarization failure | Log warning, return original messages unchanged |
|
|
254
|
+
| Conversation load failure | Log warning, proceed without history |
|
|
255
|
+
| Conversation save failure | Log warning, return result with pre-save messages |
|
|
253
256
|
|
|
254
257
|
## Development
|
|
255
258
|
|
|
@@ -15,8 +15,8 @@ LlmOptimizer.configure do |config|
|
|
|
15
15
|
# --- Model routing ---
|
|
16
16
|
# :auto classifies each prompt; :simple or :complex forces a tier
|
|
17
17
|
config.route_to = :auto
|
|
18
|
-
config.simple_model = "
|
|
19
|
-
config.complex_model = "
|
|
18
|
+
config.simple_model = "gemini-1.5-flash"
|
|
19
|
+
config.complex_model = "claude-haiku-4-5"
|
|
20
20
|
|
|
21
21
|
# --- Redis (required only if use_semantic_cache: true) ---
|
|
22
22
|
config.redis_url = ENV.fetch("REDIS_URL", nil)
|
|
@@ -76,4 +76,42 @@ LlmOptimizer.configure do |config|
|
|
|
76
76
|
# }
|
|
77
77
|
#
|
|
78
78
|
# config.classifier_caller = nil
|
|
79
|
+
|
|
80
|
+
# --- Messages caller (optional) ---
|
|
81
|
+
# Messages caller for history manager/conversation summary - Optional
|
|
82
|
+
# config.system_prompt = "You are a helpful person who gives responses in a non harmful way. " \
|
|
83
|
+
# "If any serious question is asked, handle it in effectively."
|
|
84
|
+
# OpenAI implmeentation -
|
|
85
|
+
# config.messages_caller = ->(messages, model:) {
|
|
86
|
+
# response = $openai.chat(
|
|
87
|
+
# parameters: {
|
|
88
|
+
# model: model,
|
|
89
|
+
# messages: messages.map { |m| { role: m[:role], content: m[:content] } }
|
|
90
|
+
# }
|
|
91
|
+
# )
|
|
92
|
+
# response.dig("choices", 0, "message", "content")
|
|
93
|
+
# }
|
|
94
|
+
|
|
95
|
+
# RubyLLM implementation -
|
|
96
|
+
# config.messages_caller = ->(messages, model:) {
|
|
97
|
+
# chat = RubyLLM.chat(model: model)
|
|
98
|
+
# messages[0..-2].each { |m| chat.add_message(role: m[:role], content: m[:content]) }
|
|
99
|
+
# chat.ask(messages.last[:content]).content
|
|
100
|
+
# }
|
|
101
|
+
|
|
102
|
+
# Anthropic implementation -
|
|
103
|
+
# config.messages_caller = ->(messages, model:) {
|
|
104
|
+
# # Anthropic separates system messages from the messages array
|
|
105
|
+
# system_msg = messages.find { |m| m[:role] == "system" }&.dig(:content)
|
|
106
|
+
# chat_msgs = messages.reject { |m| m[:role] == "system" }
|
|
107
|
+
# .map { |m| { role: m[:role], content: m[:content] } }
|
|
108
|
+
|
|
109
|
+
# response = $anthropic.messages(
|
|
110
|
+
# model: model,
|
|
111
|
+
# max_tokens: 1024,
|
|
112
|
+
# system: system_msg,
|
|
113
|
+
# messages: chat_msgs
|
|
114
|
+
# )
|
|
115
|
+
# response["content"].first["text"]
|
|
116
|
+
# }
|
|
79
117
|
end
|
|
@@ -22,6 +22,9 @@ module LlmOptimizer
|
|
|
22
22
|
llm_caller
|
|
23
23
|
embedding_caller
|
|
24
24
|
classifier_caller
|
|
25
|
+
conversation_ttl
|
|
26
|
+
system_prompt
|
|
27
|
+
messages_caller
|
|
25
28
|
].freeze
|
|
26
29
|
|
|
27
30
|
# Define readers for all known keys (setters below track explicit sets)
|
|
@@ -47,6 +50,8 @@ module LlmOptimizer
|
|
|
47
50
|
@llm_caller = nil
|
|
48
51
|
@embedding_caller = nil
|
|
49
52
|
@classifier_caller = nil
|
|
53
|
+
@conversation_ttl = 86_400
|
|
54
|
+
@system_prompt = nil
|
|
50
55
|
end
|
|
51
56
|
|
|
52
57
|
# Copies only explicitly set keys from other_config without resetting unmentioned keys.
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module LlmOptimizer
|
|
4
|
+
class ConversationStore
|
|
5
|
+
KEY_NAMESPACE = "llm_optimizer:conversation:"
|
|
6
|
+
|
|
7
|
+
def initialize(redis_client, ttl:, logger:, debug_logging: false, system_prompt: nil)
|
|
8
|
+
@redis = redis_client
|
|
9
|
+
@ttl = ttl
|
|
10
|
+
@logger = logger
|
|
11
|
+
@debug_logging = debug_logging
|
|
12
|
+
@system_prompt = system_prompt
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
# Loads and returns the messages array for conversation_id.
|
|
16
|
+
# Returns [] if no key exists or on Redis error (logs warning).
|
|
17
|
+
def load(conversation_id)
|
|
18
|
+
key = redis_key(conversation_id)
|
|
19
|
+
raw = @redis.get(key)
|
|
20
|
+
|
|
21
|
+
if raw.nil?
|
|
22
|
+
messages = seed_messages
|
|
23
|
+
@logger.info("[llm_optimizer] ConversationStore load: conversation_id=#{conversation_id}, count=#{messages.size}")
|
|
24
|
+
log_debug_history(conversation_id, messages)
|
|
25
|
+
return messages
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
messages = JSON.parse(raw, symbolize_names: true)
|
|
29
|
+
@logger.info("[llm_optimizer] ConversationStore load: conversation_id=#{conversation_id}, count=#{messages.size}")
|
|
30
|
+
log_debug_history(conversation_id, messages)
|
|
31
|
+
messages
|
|
32
|
+
rescue Redis::BaseError => e
|
|
33
|
+
@logger.warn("[llm_optimizer] ConversationStore load failed: conversation_id=#{conversation_id}, error=#{e.message}")
|
|
34
|
+
[]
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
# Appends user + assistant messages to history and persists to Redis.
|
|
38
|
+
# Silently logs warning on Redis error; never raises.
|
|
39
|
+
def save(conversation_id, messages, prompt, response)
|
|
40
|
+
updated_messages = messages + [
|
|
41
|
+
{ role: "user", content: prompt },
|
|
42
|
+
{ role: "assistant", content: response }
|
|
43
|
+
]
|
|
44
|
+
|
|
45
|
+
key = redis_key(conversation_id)
|
|
46
|
+
json = JSON.generate(updated_messages)
|
|
47
|
+
|
|
48
|
+
if @ttl.zero?
|
|
49
|
+
@redis.set(key, json)
|
|
50
|
+
else
|
|
51
|
+
@redis.set(key, json, ex: @ttl)
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
@logger.info("[llm_optimizer] ConversationStore save: conversation_id=#{conversation_id}, count=#{updated_messages.size}")
|
|
55
|
+
log_debug_history(conversation_id, updated_messages)
|
|
56
|
+
updated_messages
|
|
57
|
+
rescue Redis::BaseError => e
|
|
58
|
+
@logger.warn("[llm_optimizer] ConversationStore save failed: conversation_id=#{conversation_id}, error=#{e.message}")
|
|
59
|
+
nil
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
private
|
|
63
|
+
|
|
64
|
+
def redis_key(conversation_id)
|
|
65
|
+
"#{KEY_NAMESPACE}#{conversation_id}"
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
def seed_messages
|
|
69
|
+
return [] unless @system_prompt
|
|
70
|
+
|
|
71
|
+
[
|
|
72
|
+
{ role: "user", content: @system_prompt },
|
|
73
|
+
{ role: "assistant", content: "Got it!" }
|
|
74
|
+
]
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
def log_debug_history(conversation_id, messages)
|
|
78
|
+
return unless @debug_logging
|
|
79
|
+
|
|
80
|
+
@logger.debug("[llm_optimizer] ConversationStore history: conversation_id=#{conversation_id}, messages=#{messages.inspect}")
|
|
81
|
+
end
|
|
82
|
+
end
|
|
83
|
+
end
|
|
@@ -10,10 +10,12 @@ module LlmOptimizer
|
|
|
10
10
|
Classify the following prompt as either 'simple' or 'complex'.
|
|
11
11
|
|
|
12
12
|
Rules:
|
|
13
|
-
- simple: factual questions, basic lookups, short explanations, greetings
|
|
13
|
+
- simple: factual questions, basic lookups, short explanations, greetings, chitchat, general statements, simple mathematical calculations with additions, subtractions, multiplications and divisions
|
|
14
|
+
Example - Hello, Bye, You are funny, how are you?, what is the capital of France, tell me about yourself, what is 2 + 3 - 1 * 10 / 2 etc.
|
|
14
15
|
- complex: code generation, debugging, architecture, multi-step reasoning, analysis
|
|
16
|
+
Example - how does pandas extract my information, debug this code, why is rag apps consume more tokens, give me code to print star in python etc.
|
|
15
17
|
|
|
16
|
-
Reply with exactly one word: simple or complex
|
|
18
|
+
Reply with exactly one word, no punctuation: simple or complex
|
|
17
19
|
|
|
18
20
|
Prompt: %<prompt>s
|
|
19
21
|
PROMPT
|
|
@@ -48,9 +50,12 @@ module LlmOptimizer
|
|
|
48
50
|
def classify_with_llm(prompt)
|
|
49
51
|
classifier_prompt = format(CLASSIFIER_PROMPT, prompt: prompt)
|
|
50
52
|
response = @config.classifier_caller.call(classifier_prompt)
|
|
51
|
-
normalized = response.to_s.strip.downcase
|
|
52
|
-
|
|
53
|
-
|
|
53
|
+
normalized = response.to_s.strip.downcase
|
|
54
|
+
|
|
55
|
+
# Check for word boundary match to handle responses like
|
|
56
|
+
# "simple." / "**simple**" / "the answer is simple"
|
|
57
|
+
return :simple if normalized.match?(/\bsimple\b/)
|
|
58
|
+
return :complex if normalized.match?(/\bcomplex\b/)
|
|
54
59
|
|
|
55
60
|
nil # unrecognized response — fall through to heuristic
|
|
56
61
|
rescue StandardError
|
|
@@ -0,0 +1,173 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module LlmOptimizer
|
|
4
|
+
# Internal pipeline helpers — not part of the public API.
|
|
5
|
+
# Extended into LlmOptimizer as private class methods.
|
|
6
|
+
module Pipeline
|
|
7
|
+
private
|
|
8
|
+
|
|
9
|
+
def build_call_config(options, &block)
|
|
10
|
+
cfg = Configuration.new
|
|
11
|
+
cfg.merge!(configuration)
|
|
12
|
+
options.each do |k, v|
|
|
13
|
+
next unless Configuration::KNOWN_KEYS.include?(k.to_sym)
|
|
14
|
+
|
|
15
|
+
cfg.public_send(:"#{k}=", v)
|
|
16
|
+
end
|
|
17
|
+
block&.call(cfg)
|
|
18
|
+
cfg
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
def validate_conversation_options!(conversation_id, options, call_config)
|
|
22
|
+
if conversation_id && options[:messages]
|
|
23
|
+
raise ConfigurationError,
|
|
24
|
+
"conversation_id and messages: are mutually exclusive — pass one or the other"
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
return unless conversation_id && call_config.redis_url.nil?
|
|
28
|
+
|
|
29
|
+
raise ConfigurationError,
|
|
30
|
+
"redis_url must be configured to use conversation_id"
|
|
31
|
+
end
|
|
32
|
+
|
|
33
|
+
def compress(prompt, config)
|
|
34
|
+
return [prompt, nil] unless config.compress_prompt
|
|
35
|
+
|
|
36
|
+
compressed = Compressor.new.compress(prompt)
|
|
37
|
+
[compressed, Compressor.new.estimate_tokens(compressed)]
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
def route(prompt, config)
|
|
41
|
+
router = ModelRouter.new(config)
|
|
42
|
+
model_tier = router.route(prompt)
|
|
43
|
+
model = model_tier == :simple ? config.simple_model : config.complex_model
|
|
44
|
+
[model_tier, model]
|
|
45
|
+
end
|
|
46
|
+
|
|
47
|
+
def semantic_cache_lookup(prompt, model, model_tier, original_tokens,
|
|
48
|
+
compressed_tokens, original_prompt, start, config)
|
|
49
|
+
return [nil, nil] unless config.use_semantic_cache
|
|
50
|
+
|
|
51
|
+
emb_client = EmbeddingClient.new(
|
|
52
|
+
model: config.embedding_model,
|
|
53
|
+
timeout_seconds: config.timeout_seconds,
|
|
54
|
+
embedding_caller: config.embedding_caller
|
|
55
|
+
)
|
|
56
|
+
embedding = emb_client.embed(prompt)
|
|
57
|
+
embedding, result = check_cache_hit(embedding, prompt, model, model_tier,
|
|
58
|
+
original_tokens, compressed_tokens,
|
|
59
|
+
original_prompt, start, config)
|
|
60
|
+
[embedding, result]
|
|
61
|
+
rescue EmbeddingError => e
|
|
62
|
+
config.logger.warn("[llm_optimizer] EmbeddingError (treating as cache miss): #{e.message}")
|
|
63
|
+
[nil, nil]
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
def load_conversation(conversation_id, options, config)
|
|
67
|
+
return [options[:messages], nil] unless conversation_id
|
|
68
|
+
|
|
69
|
+
redis = build_redis(config.redis_url)
|
|
70
|
+
store = ConversationStore.new(redis,
|
|
71
|
+
ttl: config.conversation_ttl,
|
|
72
|
+
logger: config.logger,
|
|
73
|
+
debug_logging: config.debug_logging,
|
|
74
|
+
system_prompt: config.system_prompt)
|
|
75
|
+
[store.load(conversation_id), store]
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
def apply_history_manager(messages, config)
|
|
79
|
+
return messages unless config.manage_history && messages
|
|
80
|
+
|
|
81
|
+
llm_caller = ->(p, model:) { raw_llm_call(p, model: model, config: config) }
|
|
82
|
+
history_mgr = HistoryManager.new(
|
|
83
|
+
llm_caller: llm_caller,
|
|
84
|
+
simple_model: config.simple_model,
|
|
85
|
+
token_budget: config.token_budget
|
|
86
|
+
)
|
|
87
|
+
history_mgr.process(messages)
|
|
88
|
+
end
|
|
89
|
+
|
|
90
|
+
def persist_conversation(store, conversation_id, messages, prompt, response)
|
|
91
|
+
return messages unless store && conversation_id
|
|
92
|
+
|
|
93
|
+
store.save(conversation_id, messages, prompt, response) || messages
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
def store_in_cache(embedding, response, config)
|
|
97
|
+
return unless config.use_semantic_cache && embedding && config.redis_url
|
|
98
|
+
|
|
99
|
+
redis = build_redis(config.redis_url)
|
|
100
|
+
cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
|
|
101
|
+
cache.store(embedding, response)
|
|
102
|
+
rescue StandardError => e
|
|
103
|
+
config.logger.warn("[llm_optimizer] SemanticCache store failed: #{e.message}")
|
|
104
|
+
end
|
|
105
|
+
|
|
106
|
+
def build_result(response, model, model_tier, cache_status,
|
|
107
|
+
original_tokens, compressed_tokens, latency_ms, messages)
|
|
108
|
+
OptimizeResult.new(
|
|
109
|
+
response: response, model: model, model_tier: model_tier,
|
|
110
|
+
cache_status: cache_status, original_tokens: original_tokens,
|
|
111
|
+
compressed_tokens: compressed_tokens, latency_ms: latency_ms,
|
|
112
|
+
messages: messages
|
|
113
|
+
)
|
|
114
|
+
end
|
|
115
|
+
|
|
116
|
+
def fallback_result(original_prompt, original_tokens, options, start)
|
|
117
|
+
latency_ms = elapsed_ms(start)
|
|
118
|
+
response = raw_llm_call(original_prompt, model: nil, config: configuration)
|
|
119
|
+
build_result(response, nil, nil, :miss, original_tokens || 0, nil,
|
|
120
|
+
latency_ms, options[:messages])
|
|
121
|
+
end
|
|
122
|
+
|
|
123
|
+
def raw_llm_call(prompt, model:, messages: nil, config: nil)
|
|
124
|
+
if messages && !messages.empty? && config&.messages_caller
|
|
125
|
+
config.messages_caller.call(messages + [{ role: "user", content: prompt }], model: model)
|
|
126
|
+
else
|
|
127
|
+
llm = config&.llm_caller || @_current_llm_caller
|
|
128
|
+
raise ConfigurationError, "No llm_caller configured." unless llm
|
|
129
|
+
|
|
130
|
+
llm.call(prompt, model: model)
|
|
131
|
+
end
|
|
132
|
+
end
|
|
133
|
+
|
|
134
|
+
def elapsed_ms(start)
|
|
135
|
+
((Process.clock_gettime(Process::CLOCK_MONOTONIC) - start) * 1000).round(2)
|
|
136
|
+
end
|
|
137
|
+
|
|
138
|
+
def emit_log(logger, config, cache_status:, model_tier:, original_tokens:,
|
|
139
|
+
compressed_tokens:, latency_ms:, prompt:, response:)
|
|
140
|
+
logger.info(
|
|
141
|
+
"[llm_optimizer] { cache_status: #{cache_status.inspect}, " \
|
|
142
|
+
"model_tier: #{model_tier.inspect}, " \
|
|
143
|
+
"original_tokens: #{original_tokens.inspect}, " \
|
|
144
|
+
"compressed_tokens: #{compressed_tokens.inspect}, " \
|
|
145
|
+
"latency_ms: #{latency_ms.inspect} }"
|
|
146
|
+
)
|
|
147
|
+
logger.debug("[llm_optimizer] prompt=#{prompt.inspect} response=#{response.inspect}") if config.debug_logging
|
|
148
|
+
end
|
|
149
|
+
|
|
150
|
+
def build_redis(redis_url)
|
|
151
|
+
require "redis"
|
|
152
|
+
Redis.new(url: redis_url)
|
|
153
|
+
end
|
|
154
|
+
|
|
155
|
+
def check_cache_hit(embedding, _prompt, model, model_tier, original_tokens,
|
|
156
|
+
compressed_tokens, original_prompt, start, config)
|
|
157
|
+
return [embedding, nil] unless config.redis_url
|
|
158
|
+
|
|
159
|
+
redis = build_redis(config.redis_url)
|
|
160
|
+
cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
|
|
161
|
+
cached = cache.lookup(embedding)
|
|
162
|
+
return [embedding, nil] unless cached
|
|
163
|
+
|
|
164
|
+
latency_ms = elapsed_ms(start)
|
|
165
|
+
emit_log(config.logger, config,
|
|
166
|
+
cache_status: :hit, model_tier: model_tier,
|
|
167
|
+
original_tokens: original_tokens, compressed_tokens: compressed_tokens,
|
|
168
|
+
latency_ms: latency_ms, prompt: original_prompt, response: cached)
|
|
169
|
+
[embedding, build_result(cached, model, model_tier, :hit,
|
|
170
|
+
original_tokens, compressed_tokens, latency_ms, nil)]
|
|
171
|
+
end
|
|
172
|
+
end
|
|
173
|
+
end
|
data/lib/llm_optimizer.rb
CHANGED
|
@@ -8,26 +8,21 @@ require_relative "llm_optimizer/model_router"
|
|
|
8
8
|
require_relative "llm_optimizer/embedding_client"
|
|
9
9
|
require_relative "llm_optimizer/semantic_cache"
|
|
10
10
|
require_relative "llm_optimizer/history_manager"
|
|
11
|
+
require_relative "llm_optimizer/conversation_store"
|
|
12
|
+
require_relative "llm_optimizer/pipeline"
|
|
11
13
|
|
|
12
14
|
require "llm_optimizer/railtie" if defined?(Rails)
|
|
13
15
|
|
|
14
16
|
module LlmOptimizer
|
|
15
|
-
# Base error class for all gem-specific exceptions
|
|
16
17
|
class Error < StandardError; end
|
|
17
|
-
|
|
18
|
-
# Raised when an unrecognized configuration key is set
|
|
19
18
|
class ConfigurationError < Error; end
|
|
20
|
-
|
|
21
|
-
# Raised when the embedding API call fails
|
|
22
19
|
class EmbeddingError < Error; end
|
|
23
|
-
|
|
24
|
-
# Raised when a network timeout is exceeded
|
|
25
20
|
class TimeoutError < Error; end
|
|
26
21
|
|
|
27
|
-
# Global configuration
|
|
28
22
|
@configuration = nil
|
|
29
23
|
|
|
30
|
-
|
|
24
|
+
extend Pipeline
|
|
25
|
+
|
|
31
26
|
def self.configure
|
|
32
27
|
temp = Configuration.new
|
|
33
28
|
yield temp
|
|
@@ -35,7 +30,6 @@ module LlmOptimizer
|
|
|
35
30
|
validate_configuration!(configuration)
|
|
36
31
|
end
|
|
37
32
|
|
|
38
|
-
# Warns about misconfigured options rather than failing silently at call time.
|
|
39
33
|
def self.validate_configuration!(config)
|
|
40
34
|
return unless config.use_semantic_cache && config.embedding_caller.nil?
|
|
41
35
|
|
|
@@ -46,36 +40,32 @@ module LlmOptimizer
|
|
|
46
40
|
config.use_semantic_cache = false
|
|
47
41
|
end
|
|
48
42
|
|
|
49
|
-
# Returns the current global Configuration, lazy-initializing if nil.
|
|
50
43
|
def self.configuration
|
|
51
44
|
@configuration ||= Configuration.new
|
|
52
45
|
end
|
|
53
46
|
|
|
54
|
-
# Replaces the global config with a fresh default Configuration.
|
|
55
|
-
# Useful in tests to avoid state leakage.
|
|
56
47
|
def self.reset_configuration!
|
|
57
48
|
@configuration = Configuration.new
|
|
58
49
|
end
|
|
59
50
|
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
51
|
+
def self.clear_conversation(conversation_id)
|
|
52
|
+
raise ConfigurationError, "redis_url must be configured to use clear_conversation" unless configuration.redis_url
|
|
53
|
+
|
|
54
|
+
redis = build_redis(configuration.redis_url)
|
|
55
|
+
key = "#{ConversationStore::KEY_NAMESPACE}#{conversation_id}"
|
|
56
|
+
deleted = redis.del(key)
|
|
57
|
+
deleted.positive?
|
|
58
|
+
rescue ::Redis::BaseError => e
|
|
59
|
+
raise LlmOptimizer::Error, "Redis error in clear_conversation: #{e.message}"
|
|
60
|
+
end
|
|
61
|
+
|
|
65
62
|
module WrapperModule
|
|
66
|
-
def chat(params, &
|
|
63
|
+
def chat(params, &)
|
|
67
64
|
config = LlmOptimizer.configuration
|
|
68
65
|
prompt = params[:messages] || params[:prompt]
|
|
69
|
-
|
|
70
|
-
# Run pre-call pipeline: compress, route, cache lookup
|
|
71
66
|
result = LlmOptimizer.optimize_pre_call(prompt, config)
|
|
67
|
+
return result[:response] if result[:cache_status] == :hit
|
|
72
68
|
|
|
73
|
-
# Cache hit — return immediately without calling the LLM
|
|
74
|
-
if result[:cache_status] == :hit
|
|
75
|
-
return result[:response]
|
|
76
|
-
end
|
|
77
|
-
|
|
78
|
-
# Apply compressed prompt and routed model, then delegate to original client
|
|
79
69
|
optimized_params = params.merge(model: result[:model])
|
|
80
70
|
if params[:messages]
|
|
81
71
|
optimized_params = optimized_params.merge(messages: result[:prompt])
|
|
@@ -83,264 +73,79 @@ module LlmOptimizer
|
|
|
83
73
|
optimized_params = optimized_params.merge(prompt: result[:prompt])
|
|
84
74
|
end
|
|
85
75
|
|
|
86
|
-
response = super(optimized_params, &
|
|
87
|
-
|
|
88
|
-
# Store in cache after successful LLM call
|
|
76
|
+
response = super(optimized_params, &)
|
|
89
77
|
LlmOptimizer.optimize_post_call(result, response, config)
|
|
90
|
-
|
|
91
78
|
response
|
|
92
79
|
end
|
|
93
80
|
end
|
|
94
81
|
|
|
95
|
-
# Prepends WrapperModule into client_class; idempotent — safe to call N times.
|
|
96
82
|
def self.wrap_client(client_class)
|
|
97
83
|
return if client_class.ancestors.include?(WrapperModule)
|
|
98
84
|
|
|
99
85
|
client_class.prepend(WrapperModule)
|
|
100
86
|
end
|
|
101
87
|
|
|
102
|
-
|
|
103
|
-
|
|
88
|
+
def self.optimize(prompt, options = {}, &)
|
|
89
|
+
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
90
|
+
call_config = build_call_config(options, &)
|
|
91
|
+
conversation_id = options[:conversation_id]
|
|
92
|
+
validate_conversation_options!(conversation_id, options, call_config)
|
|
104
93
|
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
110
|
-
|
|
111
|
-
# Resolve per-call configuration — only pass known config keys
|
|
112
|
-
call_config = Configuration.new
|
|
113
|
-
call_config.merge!(configuration)
|
|
114
|
-
options.each do |k, v|
|
|
115
|
-
next unless LlmOptimizer::Configuration::KNOWN_KEYS.include?(k.to_sym)
|
|
116
|
-
|
|
117
|
-
call_config.public_send(:"#{k}=", v)
|
|
118
|
-
end
|
|
119
|
-
yield call_config if block_given?
|
|
94
|
+
original_prompt = prompt
|
|
95
|
+
original_tokens = Compressor.new.estimate_tokens(prompt)
|
|
96
|
+
prompt, compressed_tokens = compress(prompt, call_config)
|
|
97
|
+
model_tier, model = route(prompt, call_config)
|
|
120
98
|
|
|
121
|
-
|
|
99
|
+
embedding, cached_result = semantic_cache_lookup(prompt, model, model_tier,
|
|
100
|
+
original_tokens, compressed_tokens,
|
|
101
|
+
original_prompt, start, call_config)
|
|
102
|
+
return cached_result if cached_result
|
|
122
103
|
|
|
123
|
-
|
|
124
|
-
|
|
104
|
+
messages, store = load_conversation(conversation_id, options, call_config)
|
|
105
|
+
messages = apply_history_manager(messages, call_config)
|
|
106
|
+
response = raw_llm_call(prompt, messages: messages, model: model, config: call_config)
|
|
107
|
+
messages = persist_conversation(store, conversation_id, messages, prompt, response)
|
|
108
|
+
store_in_cache(embedding, response, call_config)
|
|
125
109
|
|
|
126
|
-
# Compression
|
|
127
|
-
compressor = Compressor.new
|
|
128
|
-
original_tokens = compressor.estimate_tokens(prompt)
|
|
129
|
-
compressed_tokens = nil
|
|
130
|
-
|
|
131
|
-
if call_config.compress_prompt
|
|
132
|
-
prompt = compressor.compress(prompt)
|
|
133
|
-
compressed_tokens = compressor.estimate_tokens(prompt)
|
|
134
|
-
end
|
|
135
|
-
|
|
136
|
-
# Model routing
|
|
137
|
-
router = ModelRouter.new(call_config)
|
|
138
|
-
model_tier = router.route(prompt)
|
|
139
|
-
model = model_tier == :simple ? call_config.simple_model : call_config.complex_model
|
|
140
|
-
|
|
141
|
-
# Semantic cache lookup
|
|
142
|
-
embedding = nil
|
|
143
|
-
|
|
144
|
-
if call_config.use_semantic_cache
|
|
145
|
-
begin
|
|
146
|
-
emb_client = EmbeddingClient.new(
|
|
147
|
-
model: call_config.embedding_model,
|
|
148
|
-
timeout_seconds: call_config.timeout_seconds,
|
|
149
|
-
embedding_caller: call_config.embedding_caller
|
|
150
|
-
)
|
|
151
|
-
embedding = emb_client.embed(prompt)
|
|
152
|
-
|
|
153
|
-
if call_config.redis_url
|
|
154
|
-
redis = build_redis(call_config.redis_url)
|
|
155
|
-
cache = SemanticCache.new(redis, threshold: call_config.similarity_threshold, ttl: call_config.cache_ttl)
|
|
156
|
-
cached = cache.lookup(embedding)
|
|
157
|
-
|
|
158
|
-
if cached
|
|
159
|
-
latency_ms = elapsed_ms(start)
|
|
160
|
-
emit_log(logger, call_config,
|
|
161
|
-
cache_status: :hit, model_tier: model_tier,
|
|
162
|
-
original_tokens: original_tokens, compressed_tokens: compressed_tokens,
|
|
163
|
-
latency_ms: latency_ms, prompt: original_prompt, response: cached)
|
|
164
|
-
return OptimizeResult.new(
|
|
165
|
-
response: cached,
|
|
166
|
-
model: model,
|
|
167
|
-
model_tier: model_tier,
|
|
168
|
-
cache_status: :hit,
|
|
169
|
-
original_tokens: original_tokens,
|
|
170
|
-
compressed_tokens: compressed_tokens,
|
|
171
|
-
latency_ms: latency_ms,
|
|
172
|
-
messages: options[:messages]
|
|
173
|
-
)
|
|
174
|
-
end
|
|
175
|
-
end
|
|
176
|
-
rescue EmbeddingError => e
|
|
177
|
-
logger.warn("[llm_optimizer] EmbeddingError (treating as cache miss): #{e.message}")
|
|
178
|
-
embedding = nil
|
|
179
|
-
# continue pipeline as cache miss
|
|
180
|
-
end
|
|
181
|
-
end
|
|
182
|
-
|
|
183
|
-
# History management
|
|
184
|
-
messages = options[:messages]
|
|
185
|
-
if call_config.manage_history && messages
|
|
186
|
-
llm_caller = ->(p, model:) { raw_llm_call(p, model: model, config: call_config) }
|
|
187
|
-
history_mgr = HistoryManager.new(
|
|
188
|
-
llm_caller: llm_caller,
|
|
189
|
-
simple_model: call_config.simple_model,
|
|
190
|
-
token_budget: call_config.token_budget
|
|
191
|
-
)
|
|
192
|
-
messages = history_mgr.process(messages)
|
|
193
|
-
end
|
|
194
|
-
|
|
195
|
-
# Raw LLM call
|
|
196
|
-
response = raw_llm_call(prompt, model: model, config: call_config)
|
|
197
|
-
|
|
198
|
-
# Cache store
|
|
199
|
-
if call_config.use_semantic_cache && embedding && call_config.redis_url
|
|
200
|
-
begin
|
|
201
|
-
redis = build_redis(call_config.redis_url)
|
|
202
|
-
cache = SemanticCache.new(redis, threshold: call_config.similarity_threshold, ttl: call_config.cache_ttl)
|
|
203
|
-
cache.store(embedding, response)
|
|
204
|
-
rescue StandardError => e
|
|
205
|
-
logger.warn("[llm_optimizer] SemanticCache store failed: #{e.message}")
|
|
206
|
-
end
|
|
207
|
-
end
|
|
208
|
-
|
|
209
|
-
# Build result
|
|
210
110
|
latency_ms = elapsed_ms(start)
|
|
211
|
-
emit_log(logger, call_config,
|
|
111
|
+
emit_log(call_config.logger, call_config,
|
|
212
112
|
cache_status: :miss, model_tier: model_tier,
|
|
213
113
|
original_tokens: original_tokens, compressed_tokens: compressed_tokens,
|
|
214
114
|
latency_ms: latency_ms, prompt: original_prompt, response: response)
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
response: response,
|
|
218
|
-
model: model,
|
|
219
|
-
model_tier: model_tier,
|
|
220
|
-
cache_status: :miss,
|
|
221
|
-
original_tokens: original_tokens,
|
|
222
|
-
compressed_tokens: compressed_tokens,
|
|
223
|
-
latency_ms: latency_ms,
|
|
224
|
-
messages: messages
|
|
225
|
-
)
|
|
115
|
+
build_result(response, model, model_tier, :miss, original_tokens, compressed_tokens,
|
|
116
|
+
latency_ms, messages)
|
|
226
117
|
rescue EmbeddingError => e
|
|
227
|
-
|
|
228
|
-
|
|
229
|
-
|
|
230
|
-
|
|
231
|
-
response = raw_llm_call(original_prompt, model: nil, config: configuration)
|
|
232
|
-
OptimizeResult.new(
|
|
233
|
-
response: response,
|
|
234
|
-
model: nil,
|
|
235
|
-
model_tier: nil,
|
|
236
|
-
cache_status: :miss,
|
|
237
|
-
original_tokens: original_tokens || 0,
|
|
238
|
-
compressed_tokens: nil,
|
|
239
|
-
latency_ms: latency_ms,
|
|
240
|
-
messages: options[:messages]
|
|
241
|
-
)
|
|
118
|
+
configuration.logger.warn("[llm_optimizer] EmbeddingError (outer rescue): #{e.message}")
|
|
119
|
+
fallback_result(original_prompt, original_tokens, options, start)
|
|
120
|
+
rescue ConfigurationError
|
|
121
|
+
raise
|
|
242
122
|
rescue LlmOptimizer::Error, StandardError => e
|
|
243
|
-
logger
|
|
244
|
-
|
|
245
|
-
latency_ms = elapsed_ms(start)
|
|
246
|
-
response = raw_llm_call(original_prompt, model: nil, config: configuration)
|
|
247
|
-
OptimizeResult.new(
|
|
248
|
-
response: response,
|
|
249
|
-
model: nil,
|
|
250
|
-
model_tier: nil,
|
|
251
|
-
cache_status: :miss,
|
|
252
|
-
original_tokens: original_tokens || 0,
|
|
253
|
-
compressed_tokens: nil,
|
|
254
|
-
latency_ms: latency_ms,
|
|
255
|
-
messages: options[:messages]
|
|
256
|
-
)
|
|
123
|
+
configuration.logger.error("[llm_optimizer] #{e.class}: #{e.message}\n#{e.backtrace&.first(5)&.join("\n")}")
|
|
124
|
+
fallback_result(original_prompt, original_tokens, options, start)
|
|
257
125
|
end
|
|
258
126
|
|
|
259
|
-
# Pre-call pipeline for wrap_client: compress, route, cache lookup.
|
|
260
|
-
# Returns a hash with :prompt, :model, :model_tier, :embedding, :cache_status, :response.
|
|
261
|
-
# Does NOT make an LLM call — the wrapped client handles that via super.
|
|
262
127
|
def self.optimize_pre_call(prompt, config = configuration)
|
|
263
|
-
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
router = ModelRouter.new(config)
|
|
267
|
-
model_tier = router.route(prompt)
|
|
128
|
+
prompt = Compressor.new.compress(prompt) if config.compress_prompt
|
|
129
|
+
model_tier = ModelRouter.new(config).route(prompt)
|
|
268
130
|
model = model_tier == :simple ? config.simple_model : config.complex_model
|
|
269
131
|
|
|
270
|
-
|
|
271
|
-
|
|
272
|
-
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
|
|
280
|
-
cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
|
|
281
|
-
cached = cache.lookup(embedding)
|
|
282
|
-
return { prompt: prompt, model: model, model_tier: model_tier,
|
|
283
|
-
embedding: embedding, cache_status: :hit, response: cached } if cached
|
|
284
|
-
rescue EmbeddingError => e
|
|
285
|
-
config.logger.warn("[llm_optimizer] wrap_client EmbeddingError (cache miss): #{e.message}")
|
|
286
|
-
embedding = nil
|
|
287
|
-
end
|
|
132
|
+
unless config.use_semantic_cache && config.redis_url
|
|
133
|
+
return { prompt: prompt, model: model, model_tier: model_tier,
|
|
134
|
+
embedding: nil, cache_status: :miss, response: nil }
|
|
135
|
+
end
|
|
136
|
+
|
|
137
|
+
embedding, result = semantic_cache_lookup(prompt, model, model_tier, nil, nil,
|
|
138
|
+
prompt, Process.clock_gettime(Process::CLOCK_MONOTONIC), config)
|
|
139
|
+
if result
|
|
140
|
+
return { prompt: prompt, model: model, model_tier: model_tier,
|
|
141
|
+
embedding: embedding, cache_status: :hit, response: result.response }
|
|
288
142
|
end
|
|
289
143
|
|
|
290
144
|
{ prompt: prompt, model: model, model_tier: model_tier,
|
|
291
145
|
embedding: embedding, cache_status: :miss, response: nil }
|
|
292
146
|
end
|
|
293
147
|
|
|
294
|
-
# Post-call: store the LLM response in the semantic cache if applicable.
|
|
295
148
|
def self.optimize_post_call(pre_call_result, response, config = configuration)
|
|
296
|
-
|
|
297
|
-
return unless pre_call_result[:embedding]
|
|
298
|
-
|
|
299
|
-
redis = build_redis(config.redis_url)
|
|
300
|
-
cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
|
|
301
|
-
cache.store(pre_call_result[:embedding], response)
|
|
302
|
-
rescue StandardError => e
|
|
303
|
-
config.logger.warn("[llm_optimizer] wrap_client cache store failed: #{e.message}")
|
|
304
|
-
end
|
|
305
|
-
|
|
306
|
-
# Private helpers
|
|
307
|
-
|
|
308
|
-
class << self
|
|
309
|
-
private
|
|
310
|
-
|
|
311
|
-
def raw_llm_call(prompt, model:, config: nil)
|
|
312
|
-
caller = config&.llm_caller || @_current_llm_caller
|
|
313
|
-
unless caller
|
|
314
|
-
raise ConfigurationError,
|
|
315
|
-
"No llm_caller configured. " \
|
|
316
|
-
"Set it via LlmOptimizer.configure { |c| c.llm_caller = ->(prompt, model:) { ... } }"
|
|
317
|
-
end
|
|
318
|
-
|
|
319
|
-
caller.call(prompt, model: model)
|
|
320
|
-
end
|
|
321
|
-
|
|
322
|
-
def elapsed_ms(start)
|
|
323
|
-
((Process.clock_gettime(Process::CLOCK_MONOTONIC) - start) * 1000).round(2)
|
|
324
|
-
end
|
|
325
|
-
|
|
326
|
-
def emit_log(logger, config, cache_status:, model_tier:, original_tokens:,
|
|
327
|
-
compressed_tokens:, latency_ms:, prompt:, response:)
|
|
328
|
-
logger.info(
|
|
329
|
-
"[llm_optimizer] { cache_status: #{cache_status.inspect}, " \
|
|
330
|
-
"model_tier: #{model_tier.inspect}, " \
|
|
331
|
-
"original_tokens: #{original_tokens.inspect}, " \
|
|
332
|
-
"compressed_tokens: #{compressed_tokens.inspect}, " \
|
|
333
|
-
"latency_ms: #{latency_ms.inspect} }"
|
|
334
|
-
)
|
|
335
|
-
|
|
336
|
-
return unless config.debug_logging
|
|
337
|
-
|
|
338
|
-
logger.debug("[llm_optimizer] prompt=#{prompt.inspect} response=#{response.inspect}")
|
|
339
|
-
end
|
|
340
|
-
|
|
341
|
-
def build_redis(redis_url)
|
|
342
|
-
require "redis"
|
|
343
|
-
Redis.new(url: redis_url)
|
|
344
|
-
end
|
|
149
|
+
store_in_cache(pre_call_result[:embedding], response, config)
|
|
345
150
|
end
|
|
346
151
|
end
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: llm_optimizer
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.1.
|
|
4
|
+
version: 0.1.5
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- arun kumar
|
|
@@ -100,10 +100,12 @@ files:
|
|
|
100
100
|
- lib/llm_optimizer.rb
|
|
101
101
|
- lib/llm_optimizer/compressor.rb
|
|
102
102
|
- lib/llm_optimizer/configuration.rb
|
|
103
|
+
- lib/llm_optimizer/conversation_store.rb
|
|
103
104
|
- lib/llm_optimizer/embedding_client.rb
|
|
104
105
|
- lib/llm_optimizer/history_manager.rb
|
|
105
106
|
- lib/llm_optimizer/model_router.rb
|
|
106
107
|
- lib/llm_optimizer/optimize_result.rb
|
|
108
|
+
- lib/llm_optimizer/pipeline.rb
|
|
107
109
|
- lib/llm_optimizer/railtie.rb
|
|
108
110
|
- lib/llm_optimizer/semantic_cache.rb
|
|
109
111
|
- lib/llm_optimizer/version.rb
|