llm_optimizer 0.1.4 → 0.1.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c6903d2d4c2163d93ffe8d0d5ad9708d64a8472a430ed9f266c9237e468c8585
4
- data.tar.gz: c7270f4717ece6778976f46f1601f9e5d45939e3e7926ea7e3ed05b3b641f413
3
+ metadata.gz: 3a0ec4bdfa750f16155927a3e00c9fe2c1c39da7e85866eb6c65855ac6eebaef
4
+ data.tar.gz: 0e5820f0503fbef14dc1ad858dfaa7527e3dba278fbf7640df377d82fbc61ad7
5
5
  SHA512:
6
- metadata.gz: 858cad7443f7adcbe42b3d5ce62b4e815081d2238b7711066276ee2a7c0fb6a506d267ccb48dbe611a2ed08b2eab29139057dcddc2d033155561499a0d6f5421
7
- data.tar.gz: b3afc392e8fb2ef5b7baa468f74f9def34a15db9f6df898fd738503638d32f5dda9b04a6c8f2e005cd94aa893eca864111f3be0f2e8bfa1cc0aeef6391e0ae2c
6
+ metadata.gz: 8c2f376e324a7678063e66a89b6ad89e476bd699fd3a816c7c91a79b16ba40e09111cfdfacb1206946e2d111122e63cf70babc09a0467821723b2b286eda235a
7
+ data.tar.gz: 5bba8c343627f230c13f0671cd8b1374ab0405f6c6369457b92e9093ac1cd2f780797797a26fabdc865981ca1e131b6dc80ae4a97342f3de2f3297255d8e13c9
data/CHANGELOG.md CHANGED
@@ -7,6 +7,44 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [0.1.6] - 2026-05-04
11
+
12
+ ### Added
13
+ - `with_tools` configuration option (aliased as `tools`) — allows passing function/tool definitions to LLM calls via the `optimize` method
14
+ - Tool support for both `llm_caller` and `messages_caller` — `tools:` keyword argument is now passed to all underlying LLM callers
15
+ - `with_tools` examples in the README and Rails initializer template
16
+ - `cache_scope` configuration option — isolates semantic cache entries into separate namespaces; useful for ensuring cache hits only occur within specific contexts (e.g., user IDs, account types, or dynamic categories)
17
+
18
+ ### Changed
19
+ - `Pipeline#raw_llm_call` refactored to handle global and per-call tools consistently
20
+ - Refactored `Pipeline` to remove duplicate internal method definitions (`semantic_cache_lookup`, `store_in_cache`)
21
+ - `SemanticCache#lookup` return format updated to `[response, token_info]` to support better metadata tracking
22
+
23
+ ### Fixed
24
+ - RuboCop `Metrics/ParameterLists` offense in `OptimizeResult#initialize` by adding targeted override for the necessary result fields
25
+
26
+ ## [0.1.5] - 2026-04-22
27
+
28
+ ### Added
29
+ - `ConversationStore` — Redis-backed conversation persistence under the `llm_optimizer:conversation:<id>` namespace; handles load, save, TTL, and debug logging
30
+ - `conversation_id` option on `LlmOptimizer.optimize` — pass a stable ID and the gem automatically loads history from Redis, calls the LLM with full context, and saves the updated history back; no manual message management required
31
+ - `messages_caller` config option — injectable lambda `(messages, model:) -> String` for LLM providers that accept a full message array (OpenAI chat, Anthropic messages, etc.); takes priority over `llm_caller` when conversation history is present
32
+ - `system_prompt` config option — seeded as the opening exchange when a new conversation is created via `conversation_id`
33
+ - `conversation_ttl` config option — TTL in seconds for Redis conversation keys (default `86400`; `0` for no expiry)
34
+ - `LlmOptimizer.clear_conversation(conversation_id)` — deletes a conversation key from Redis; returns `true` if deleted, `false` if not found
35
+ - `pipeline#load_conversation` and `pipeline#persist_conversation` — internal helpers wiring `ConversationStore` into the optimize pipeline
36
+ - `pipeline#apply_history_manager` — applies `HistoryManager` sliding-window summarization to loaded conversation history when `manage_history: true`
37
+
38
+ ### Changed
39
+ - `HistoryManager` now receives an internal `llm_caller` lambda that routes through `raw_llm_call`, so it correctly uses `messages_caller` when available instead of always requiring `llm_caller`
40
+ - `raw_llm_call` updated to prefer `messages_caller` over `llm_caller` when a non-empty messages array is present
41
+ - `ModelRouter` classifier response matching now uses word-boundary regex (`/\bsimple\b/`, `/\bcomplex\b/`) to handle decorated responses like `"simple."`, `"**complex**"`, or `"the answer is simple"` — previously only exact string match was used
42
+ - `ModelRouter` classifier failures (any `StandardError`) and unrecognized responses both fall through silently to the word-count heuristic; no exception is raised to the caller
43
+ - `validate_conversation_options!` raises `ConfigurationError` if both `conversation_id` and `messages:` are supplied, or if `conversation_id` is used without `redis_url`
44
+
45
+ ### Fixed
46
+ - `HistoryManager` summarization raised `ConfigurationError: No llm_caller configured` when called inside the pipeline without a bound config — internal lambda now correctly captures `call_config`
47
+
10
48
  ## [0.1.4] - 2026-04-13
11
49
 
12
50
  ### Fixed
@@ -79,7 +117,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
79
117
  - `OptimizeResult` struct with `response`, `model`, `model_tier`, `cache_status`, `original_tokens`, `compressed_tokens`, `latency_ms`, `messages`
80
118
  - Unit test suite covering all components with positive and negative scenarios using Minitest + Mocha
81
119
 
82
- [Unreleased]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.4...HEAD
120
+ [Unreleased]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.5...HEAD
121
+ [0.1.5]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.4...v0.1.5
83
122
  [0.1.4]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.3...v0.1.4
84
123
  [0.1.3]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.2...v0.1.3
85
124
  [0.1.2]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.1...v0.1.2
data/README.md CHANGED
@@ -21,8 +21,8 @@ Stores prompt embeddings in Redis. On subsequent calls, computes cosine similari
21
21
 
22
22
  Classifies each prompt and routes it to the appropriate model tier:
23
23
 
24
- - **Simple** → cheaper/faster model (e.g. `gpt-4o-mini`, `amazon.nova-micro`)
25
- - **Complex** → premium model (e.g. `claude-3-5-sonnet`, `gpt-4o`)
24
+ - **Simple** → cheaper/faster model (e.g. `llama3`, `gemini-2.5-flash-lite`)
25
+ - **Complex** → premium model (e.g. `claude-haiku-4-5-20251001`, `gemini-3.0-pro`)
26
26
 
27
27
  Routing uses a three-layer decision chain:
28
28
 
@@ -50,7 +50,7 @@ If `classifier_caller` is not set, the router falls back to the word-count heuri
50
50
  Removes common English stop words from prompts before sending to the LLM. Preserves fenced code block content unchanged. Typically reduces token count by 10–20%.
51
51
 
52
52
  ### 4. Conversation History Sliding Window
53
- When a conversation history exceeds the configured token budget, summarizes the oldest messages using the simple model and replaces them with a single system summary message.
53
+ When a conversation history exceeds the configured token budget, summarizes the oldest messages using the simple model and replaces them with a single system summary message. Uses Redis to store for fast reetreival and summarizing.
54
54
 
55
55
  ## Installation
56
56
 
@@ -99,7 +99,7 @@ result = LlmOptimizer.optimize("What is Redis?")
99
99
  puts result.response # => "Redis is an in-memory data store..."
100
100
  puts result.cache_status # => :hit or :miss
101
101
  puts result.model_tier # => :simple or :complex
102
- puts result.model # => "gpt-4o-mini"
102
+ puts result.model # => "gemini-2.5-flash-lite"
103
103
  puts result.original_tokens # => 5
104
104
  puts result.compressed_tokens # => 4
105
105
  puts result.latency_ms # => 12.4
@@ -110,39 +110,50 @@ puts result.latency_ms # => 12.4
110
110
  ### Rails initializer
111
111
 
112
112
  ```ruby
113
+ # config/initializers/llm_optimizer.rb
114
+ require "llm_optimizer"
115
+
113
116
  LlmOptimizer.configure do |config|
114
- # Feature flags all off by default
117
+ # --- Feature flags (all off by default) ---
115
118
  config.compress_prompt = true # strip stop words before sending to LLM
116
119
  config.use_semantic_cache = true # cache responses by vector similarity
117
120
  config.manage_history = true # summarize old messages when over token budget
118
121
 
119
- # Model routing
120
- config.route_to = :auto # :auto | :simple | :complex
121
- config.simple_model = "gpt-4o-mini" # model used for simple prompts
122
- config.complex_model = "claude-3-5-sonnet-20241022" # model used for complex prompts
122
+ # --- Model routing ---
123
+ config.route_to = :auto # :auto, :simple, or :complex
124
+ config.simple_model = "gemini-2.5-flash-lite" # used for simple prompts
125
+ config.complex_model = "claude-haiku-4-5-20251001" # used for complex prompts
123
126
 
124
- # Redis (required if use_semantic_cache: true)
127
+ # --- Redis (required if use_semantic_cache: true) ---
125
128
  config.redis_url = ENV["REDIS_URL"]
126
129
 
127
- # Tuning
128
- config.similarity_threshold = 0.96 # cosine similarity cutoff for cache hit (0.0–1.0)
129
- config.token_budget = 4000 # token limit before history summarization
130
- config.cache_ttl = 86400 # cache TTL in seconds (default: 24h)
130
+ # --- Token / cache settings ---
131
+ config.similarity_threshold = 0.96 # cosine similarity cutoff for cache hit
132
+ config.token_budget = 4000 # max tokens before history summarization
133
+ config.cache_ttl = 86400 # cache TTL in seconds (24h)
131
134
  config.timeout_seconds = 5 # timeout for external API calls
132
135
 
133
- # Logging
136
+ # --- Logging ---
134
137
  config.logger = Rails.logger
135
- config.debug_logging = Rails.env.development? # logs full prompt+response at DEBUG level
138
+ config.debug_logging = Rails.env.development? # logs full prompt+response in dev
136
139
 
137
- # LLM caller wire to your existing LLM client (required)
140
+ # --- Wire up your app's LLM client ---
141
+ # Replace the body with however your app calls the LLM
138
142
  config.llm_caller = ->(prompt, model:) {
139
- RubyLLM.chat(model: model, assume_model_exists: true).ask(prompt).content
143
+ model ||= "claude-haiku-4-5-20251001"
144
+ provider = if model.include?("claude") then :anthropic
145
+ elsif model.include?("gpt") then :openai
146
+ elsif model.include?("gemini") then :gemini
147
+ else :ollama
148
+ end
149
+ chat = RubyLLM.chat(model: model, provider: provider, assume_model_exists: true)
150
+ chat.ask(prompt).content
140
151
  }
141
152
 
142
153
  # Embeddings caller — wire to your embeddings provider (required if use_semantic_cache: true)
143
- # Falls back to OpenAI via ENV["OPENAI_API_KEY"] if not set
144
154
  config.embedding_caller = ->(text) {
145
- MyEmbeddingService.embed(text)
155
+ response = RubyLLM.embed(text, provider: :gemini, model: 'gemini-embedding-001')
156
+ response.vectors
146
157
  }
147
158
 
148
159
  # Classifier caller — optional, improves routing accuracy for ambiguous prompts
@@ -151,7 +162,18 @@ LlmOptimizer.configure do |config|
151
162
  RubyLLM.chat(model: "amazon.nova-micro-v1:0", provider: :bedrock, assume_model_exists: true)
152
163
  .ask(prompt).content.strip.downcase
153
164
  }
165
+
166
+ # Messages caller - optional, handles converation summary and hostiry manager.
167
+ config.system_prompt = "You are a sarcastic comic person who gives witty responses in a non harmful way. If any serious question is asked, handle it in a calm way."
168
+
169
+ config.messages_caller = ->(messages, model:) {
170
+ chat = RubyLLM.chat(model: model)
171
+ messages[0..-2].each { |m| chat.add_message(role: m[:role], content: m[:content]) }
172
+ response = chat.ask(messages.last[:content])
173
+ response.content
174
+ }
154
175
  end
176
+
155
177
  ```
156
178
 
157
179
  ### Configuration reference
@@ -162,19 +184,23 @@ end
162
184
  | `use_semantic_cache` | Boolean | `false` | Enable Redis-backed semantic cache |
163
185
  | `manage_history` | Boolean | `false` | Enable conversation history summarization |
164
186
  | `route_to` | Symbol | `:auto` | `:auto`, `:simple`, or `:complex` |
165
- | `simple_model` | String | `"gpt-4o-mini"` | Model for simple prompts |
166
- | `complex_model` | String | `"claude-3-5-sonnet-20241022"` | Model for complex prompts |
187
+ | `simple_model` | String | `"gemini-2.5-flash-lite"` | Model for simple prompts |
188
+ | `complex_model` | String | `"claude-haiku-4-5-20251001"` | Model for complex prompts |
167
189
  | `similarity_threshold` | Float | `0.96` | Minimum cosine similarity for cache hit |
168
190
  | `token_budget` | Integer | `4000` | Token limit before history summarization |
169
191
  | `cache_ttl` | Integer | `86400` | Cache entry TTL in seconds |
170
192
  | `timeout_seconds` | Integer | `5` | Timeout for external API calls |
171
193
  | `redis_url` | String | `nil` | Redis connection URL |
172
- | `embedding_model` | String | `"text-embedding-3-small"` | Embedding model name (OpenAI fallback) |
194
+ | `embedding_model` | String | `"gemini-embedding-001"` | Embedding model name (OpenAI fallback) |
173
195
  | `logger` | Logger | `Logger.new($stdout)` | Any Logger-compatible object |
174
196
  | `debug_logging` | Boolean | `false` | Log full prompt and response at DEBUG level |
175
197
  | `llm_caller` | Lambda | `nil` | `(prompt, model:) -> String` |
176
198
  | `embedding_caller` | Lambda | `nil` | `(text) -> Array<Float>` |
177
199
  | `classifier_caller` | Lambda | `nil` | `(prompt) -> "simple" or "complex"` |
200
+ | `messages_caller` | Lambda | `nil` | `(messages, model:) -> String` — used when `conversation_id` is present; receives full history including current user turn |
201
+ | `system_prompt` | String | `nil` | Seeded as the first system message when a new conversation is created via `conversation_id` |
202
+ | `conversation_ttl` | Integer | `86400` | TTL in seconds for Redis-backed conversation history (`0` for no expiry) |
203
+ | `with_tools` | Array | `nil` | Tools (functions) available to the LLM; passed as `tools:` keyword to callers |
178
204
 
179
205
  ## Per-call configuration
180
206
 
@@ -187,32 +213,6 @@ result = LlmOptimizer.optimize(prompt) do |config|
187
213
  end
188
214
  ```
189
215
 
190
- ## Conversation history
191
-
192
- Pass a `messages` array to enable history management:
193
-
194
- ```ruby
195
- messages = [
196
- { role: "user", content: "Tell me about Redis" },
197
- { role: "assistant", content: "Redis is an in-memory data store..." },
198
- # ... more messages
199
- ]
200
-
201
- result = LlmOptimizer.optimize("What else can it do?", messages: messages)
202
-
203
- # result.messages contains the (possibly summarized) messages array
204
- ```
205
-
206
- ## Opt-in client wrapping
207
-
208
- Transparently wrap an existing LLM client class so all calls through it are automatically optimized:
209
-
210
- ```ruby
211
- LlmOptimizer.wrap_client(OpenAI::Client)
212
- ```
213
-
214
- This prepends the optimization pipeline into the client's `chat` method. Safe to call multiple times idempotent.
215
-
216
216
  ## OptimizeResult
217
217
 
218
218
  Every call returns an `OptimizeResult` struct:
@@ -226,20 +226,9 @@ Every call returns an `OptimizeResult` struct:
226
226
  | `original_tokens` | Integer | Estimated token count before compression |
227
227
  | `compressed_tokens` | Integer | Estimated token count after compression (`nil` if not compressed) |
228
228
  | `latency_ms` | Float | Total wall-clock time for the optimize call |
229
- | `messages` | Array | Final messages array (for history management) |
230
-
231
- ## Error handling
232
-
233
- The gem defines a hierarchy of errors, all inheriting from `LlmOptimizer::Error`:
234
-
235
- ```
236
- LlmOptimizer::Error
237
- ├── LlmOptimizer::ConfigurationError # unknown config key, missing llm_caller
238
- ├── LlmOptimizer::EmbeddingError # embedding API failure
239
- └── LlmOptimizer::TimeoutError # network timeout exceeded
240
- ```
229
+ | `messages` | Array | Final messages array sent to the LLM, after history management and conversation hydration (`nil` on a cache hit) |
241
230
 
242
- The gateway catches all component failures and falls through to a raw LLM call with the original prompt. Your app's core functionality is never blocked by the optimizer.
231
+ The `messages` field reflects the actual array passed to `messages_caller` (or built from `conversation_id`), including any summarization applied by the history manager. You can pass it back as `options[:messages]` on the next call to continue a stateless conversation.
243
232
 
244
233
  ## Resilience
245
234
 
@@ -249,7 +238,9 @@ The gateway catches all component failures and falls through to a raw LLM call w
249
238
  | Redis unavailable (write) | Log warning, return LLM result normally |
250
239
  | Embedding API failure | Treat as cache miss, continue |
251
240
  | Any component exception | Log error, fall through to raw LLM call |
252
- | History summarization failure | Log error, return original messages unchanged |
241
+ | History summarization failure | Log warning, return original messages unchanged |
242
+ | Conversation load failure | Log warning, proceed without history |
243
+ | Conversation save failure | Log warning, return result with pre-save messages |
253
244
 
254
245
  ## Development
255
246
 
@@ -15,8 +15,8 @@ LlmOptimizer.configure do |config|
15
15
  # --- Model routing ---
16
16
  # :auto classifies each prompt; :simple or :complex forces a tier
17
17
  config.route_to = :auto
18
- config.simple_model = "gpt-4o-mini"
19
- config.complex_model = "gpt-4o"
18
+ config.simple_model = "gemini-1.5-flash"
19
+ config.complex_model = "claude-haiku-4-5"
20
20
 
21
21
  # --- Redis (required only if use_semantic_cache: true) ---
22
22
  config.redis_url = ENV.fetch("REDIS_URL", nil)
@@ -27,6 +27,9 @@ LlmOptimizer.configure do |config|
27
27
  config.cache_ttl = 86_400 # cache entry TTL in seconds (default: 24h)
28
28
  config.timeout_seconds = 5 # timeout for embedding / external API calls
29
29
 
30
+ # --- Tools ---
31
+ # config.with_tools = [] # Array of tool definitions (OpenAI/Anthropic format)
32
+
30
33
  # --- Logging ---
31
34
  config.logger = Rails.logger
32
35
  config.debug_logging = Rails.env.development?
@@ -76,4 +79,45 @@ LlmOptimizer.configure do |config|
76
79
  # }
77
80
  #
78
81
  # config.classifier_caller = nil
82
+
83
+ # --- Messages caller (optional) ---
84
+ # Messages caller for history manager/conversation summary - Optional
85
+ # config.system_prompt = "You are a helpful person who gives responses in a non harmful way. " \
86
+ # "If any serious question is asked, handle it in effectively."
87
+ # OpenAI implementation -
88
+ # config.messages_caller = ->(messages, model:, tools: nil) {
89
+ # parameters = {
90
+ # model: model,
91
+ # messages: messages.map { |m| { role: m[:role], content: m[:content] } }
92
+ # }
93
+ # parameters[:tools] = tools if tools&.any?
94
+ #
95
+ # response = $openai.chat(parameters: parameters)
96
+ # response.dig("choices", 0, "message", "content")
97
+ # }
98
+
99
+ # RubyLLM implementation -
100
+ # config.messages_caller = ->(messages, model:, tools: nil) {
101
+ # chat = RubyLLM.chat(model: model)
102
+ # chat.with_tools(*tools) if tools&.any?
103
+ # messages[0..-2].each { |m| chat.add_message(role: m[:role], content: m[:content]) }
104
+ # chat.ask(messages.last[:content]).content
105
+ # }
106
+
107
+ # Anthropic implementation -
108
+ # config.messages_caller = ->(messages, model:, tools: nil) {
109
+ # # Anthropic separates system messages from the messages array
110
+ # system_msg = messages.find { |m| m[:role] == "system" }&.dig(:content)
111
+ # chat_msgs = messages.reject { |m| m[:role] == "system" }
112
+ # .map { |m| { role: m[:role], content: m[:content] } }
113
+ #
114
+ # response = $anthropic.messages(
115
+ # model: model,
116
+ # max_tokens: 1024,
117
+ # system: system_msg,
118
+ # messages: chat_msgs,
119
+ # tools: tools
120
+ # )
121
+ # response["content"].first["text"]
122
+ # }
79
123
  end
@@ -22,6 +22,13 @@ module LlmOptimizer
22
22
  llm_caller
23
23
  embedding_caller
24
24
  classifier_caller
25
+ conversation_ttl
26
+ system_prompt
27
+ messages_caller
28
+ cache_scope
29
+ tools
30
+ with_tools
31
+ tools_caller
25
32
  ].freeze
26
33
 
27
34
  # Define readers for all known keys (setters below track explicit sets)
@@ -47,6 +54,9 @@ module LlmOptimizer
47
54
  @llm_caller = nil
48
55
  @embedding_caller = nil
49
56
  @classifier_caller = nil
57
+ @conversation_ttl = 86_400
58
+ @system_prompt = nil
59
+ @with_tools = nil
50
60
  end
51
61
 
52
62
  # Copies only explicitly set keys from other_config without resetting unmentioned keys.
@@ -0,0 +1,83 @@
1
+ # frozen_string_literal: true
2
+
3
+ module LlmOptimizer
4
+ class ConversationStore
5
+ KEY_NAMESPACE = "llm_optimizer:conversation:"
6
+
7
+ def initialize(redis_client, ttl:, logger:, debug_logging: false, system_prompt: nil)
8
+ @redis = redis_client
9
+ @ttl = ttl
10
+ @logger = logger
11
+ @debug_logging = debug_logging
12
+ @system_prompt = system_prompt
13
+ end
14
+
15
+ # Loads and returns the messages array for conversation_id.
16
+ # Returns [] if no key exists or on Redis error (logs warning).
17
+ def load(conversation_id)
18
+ key = redis_key(conversation_id)
19
+ raw = @redis.get(key)
20
+
21
+ if raw.nil?
22
+ messages = seed_messages
23
+ @logger.info("[llm_optimizer] ConversationStore load: conversation_id=#{conversation_id}, count=#{messages.size}")
24
+ log_debug_history(conversation_id, messages)
25
+ return messages
26
+ end
27
+
28
+ messages = JSON.parse(raw, symbolize_names: true)
29
+ @logger.info("[llm_optimizer] ConversationStore load: conversation_id=#{conversation_id}, count=#{messages.size}")
30
+ log_debug_history(conversation_id, messages)
31
+ messages
32
+ rescue Redis::BaseError => e
33
+ @logger.warn("[llm_optimizer] ConversationStore load failed: conversation_id=#{conversation_id}, error=#{e.message}")
34
+ []
35
+ end
36
+
37
+ # Appends user + assistant messages to history and persists to Redis.
38
+ # Silently logs warning on Redis error; never raises.
39
+ def save(conversation_id, messages, prompt, response)
40
+ updated_messages = messages + [
41
+ { role: "user", content: prompt },
42
+ { role: "assistant", content: response }
43
+ ]
44
+
45
+ key = redis_key(conversation_id)
46
+ json = JSON.generate(updated_messages)
47
+
48
+ if @ttl.zero?
49
+ @redis.set(key, json)
50
+ else
51
+ @redis.set(key, json, ex: @ttl)
52
+ end
53
+
54
+ @logger.info("[llm_optimizer] ConversationStore save: conversation_id=#{conversation_id}, count=#{updated_messages.size}")
55
+ log_debug_history(conversation_id, updated_messages)
56
+ updated_messages
57
+ rescue Redis::BaseError => e
58
+ @logger.warn("[llm_optimizer] ConversationStore save failed: conversation_id=#{conversation_id}, error=#{e.message}")
59
+ nil
60
+ end
61
+
62
+ private
63
+
64
+ def redis_key(conversation_id)
65
+ "#{KEY_NAMESPACE}#{conversation_id}"
66
+ end
67
+
68
+ def seed_messages
69
+ return [] unless @system_prompt
70
+
71
+ [
72
+ { role: "user", content: @system_prompt },
73
+ { role: "assistant", content: "Got it!" }
74
+ ]
75
+ end
76
+
77
+ def log_debug_history(conversation_id, messages)
78
+ return unless @debug_logging
79
+
80
+ @logger.debug("[llm_optimizer] ConversationStore history: conversation_id=#{conversation_id}, messages=#{messages.inspect}")
81
+ end
82
+ end
83
+ end
@@ -10,10 +10,12 @@ module LlmOptimizer
10
10
  Classify the following prompt as either 'simple' or 'complex'.
11
11
 
12
12
  Rules:
13
- - simple: factual questions, basic lookups, short explanations, greetings
13
+ - simple: factual questions, basic lookups, short explanations, greetings, chitchat, general statements, simple mathematical calculations with additions, subtractions, multiplications and divisions
14
+ Example - Hello, Bye, You are funny, how are you?, what is the capital of France, tell me about yourself, what is 2 + 3 - 1 * 10 / 2 etc.
14
15
  - complex: code generation, debugging, architecture, multi-step reasoning, analysis
16
+ Example - how does pandas extract my information, debug this code, why is rag apps consume more tokens, give me code to print star in python etc.
15
17
 
16
- Reply with exactly one word: simple or complex
18
+ Reply with exactly one word, no punctuation: simple or complex
17
19
 
18
20
  Prompt: %<prompt>s
19
21
  PROMPT
@@ -48,9 +50,12 @@ module LlmOptimizer
48
50
  def classify_with_llm(prompt)
49
51
  classifier_prompt = format(CLASSIFIER_PROMPT, prompt: prompt)
50
52
  response = @config.classifier_caller.call(classifier_prompt)
51
- normalized = response.to_s.strip.downcase.gsub(/[^a-z]/, "")
52
- return :simple if normalized == "simple"
53
- return :complex if normalized == "complex"
53
+ normalized = response.to_s.strip.downcase
54
+
55
+ # Check for word boundary match to handle responses like
56
+ # "simple." / "**simple**" / "the answer is simple"
57
+ return :simple if normalized.match?(/\bsimple\b/)
58
+ return :complex if normalized.match?(/\bcomplex\b/)
54
59
 
55
60
  nil # unrecognized response — fall through to heuristic
56
61
  rescue StandardError
@@ -1,8 +1,43 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module LlmOptimizer
4
- OptimizeResult = Struct.new(
5
- :response, :model, :model_tier, :cache_status,
6
- :original_tokens, :compressed_tokens, :latency_ms, :messages
7
- )
4
+ class OptimizeResult
5
+ attr_accessor :response, :model, :model_tier, :cache_status,
6
+ :original_tokens, :compressed_tokens, :input_tokens,
7
+ :output_tokens, :cached_tokens, :latency_ms, :messages
8
+
9
+ # rubocop:disable Metrics/ParameterLists
10
+ def initialize(response: nil, model: nil, model_tier: nil, cache_status: nil,
11
+ original_tokens: 0, compressed_tokens: 0, input_tokens: 0,
12
+ output_tokens: 0, cached_tokens: 0, latency_ms: 0, messages: [])
13
+ @response = response
14
+ @model = model
15
+ @model_tier = model_tier
16
+ @cache_status = cache_status
17
+ @original_tokens = original_tokens
18
+ @compressed_tokens = compressed_tokens
19
+ @input_tokens = input_tokens
20
+ @output_tokens = output_tokens
21
+ @cached_tokens = cached_tokens
22
+ @latency_ms = latency_ms
23
+ @messages = messages
24
+ end
25
+ # rubocop:enable Metrics/ParameterLists
26
+
27
+ def to_h
28
+ {
29
+ response: @response,
30
+ model: @model,
31
+ model_tier: @model_tier,
32
+ cache_status: @cache_status,
33
+ original_tokens: @original_tokens,
34
+ compressed_tokens: @compressed_tokens,
35
+ input_tokens: @input_tokens,
36
+ output_tokens: @output_tokens,
37
+ cached_tokens: @cached_tokens,
38
+ latency_ms: @latency_ms,
39
+ messages: @messages
40
+ }
41
+ end
42
+ end
8
43
  end
@@ -0,0 +1,174 @@
1
+ # frozen_string_literal: true
2
+
3
+ module LlmOptimizer
4
+ # Internal pipeline helpers — not part of the public API.
5
+ # Extended into LlmOptimizer as private class methods.
6
+ module Pipeline
7
+ private
8
+
9
+ def build_call_config(options, &block)
10
+ cfg = Configuration.new
11
+ cfg.merge!(configuration)
12
+ options.each do |k, v|
13
+ next unless Configuration::KNOWN_KEYS.include?(k.to_sym)
14
+
15
+ cfg.public_send(:"#{k}=", v)
16
+ end
17
+ block&.call(cfg)
18
+ cfg
19
+ end
20
+
21
+ def validate_conversation_options!(conversation_id, options, call_config)
22
+ if conversation_id && options[:messages]
23
+ raise ConfigurationError,
24
+ "conversation_id and messages: are mutually exclusive — pass one or the other"
25
+ end
26
+
27
+ return unless conversation_id && call_config.redis_url.nil?
28
+
29
+ raise ConfigurationError,
30
+ "redis_url must be configured to use conversation_id"
31
+ end
32
+
33
+ def compress(prompt, config)
34
+ return [prompt, nil] unless config.compress_prompt
35
+
36
+ compressed = Compressor.new.compress(prompt)
37
+ [compressed, Compressor.new.estimate_tokens(compressed)]
38
+ end
39
+
40
+ def route(prompt, config)
41
+ router = ModelRouter.new(config)
42
+ model_tier = router.route(prompt)
43
+ model = model_tier == :simple ? config.simple_model : config.complex_model
44
+ [model_tier, model]
45
+ end
46
+
47
+ def load_conversation(conversation_id, options, config)
48
+ return [options[:messages], nil] unless conversation_id
49
+
50
+ redis = build_redis(config.redis_url)
51
+ store = ConversationStore.new(redis,
52
+ ttl: config.conversation_ttl,
53
+ logger: config.logger,
54
+ debug_logging: config.debug_logging,
55
+ system_prompt: config.system_prompt)
56
+ [store.load(conversation_id), store]
57
+ end
58
+
59
+ def apply_history_manager(messages, config)
60
+ return messages unless config.manage_history && messages
61
+
62
+ llm_caller = ->(p, model:) { raw_llm_call(p, model: model, config: config) }
63
+ history_mgr = HistoryManager.new(
64
+ llm_caller: llm_caller,
65
+ simple_model: config.simple_model,
66
+ token_budget: config.token_budget
67
+ )
68
+ history_mgr.process(messages)
69
+ end
70
+
71
+ def persist_conversation(store, conversation_id, messages, prompt, response)
72
+ return messages unless store && conversation_id
73
+
74
+ store.save(conversation_id, messages, prompt, response) || messages
75
+ end
76
+
77
+ def build_result(response, model, model_tier, cache_status,
78
+ original_tokens, compressed_tokens, latency_ms, messages, token_info = {})
79
+ OptimizeResult.new(
80
+ response: response, model: model, model_tier: model_tier,
81
+ cache_status: cache_status, original_tokens: original_tokens,
82
+ compressed_tokens: compressed_tokens,
83
+ input_tokens: token_info[:input_tokens] || compressed_tokens || original_tokens,
84
+ output_tokens: token_info[:output_tokens],
85
+ cached_tokens: token_info[:cached_tokens],
86
+ latency_ms: latency_ms,
87
+ messages: messages
88
+ )
89
+ end
90
+
91
+ def fallback_result(original_prompt, original_tokens, options, start)
92
+ latency_ms = elapsed_ms(start)
93
+ response, _token_info = raw_llm_call(original_prompt, model: nil, config: configuration)
94
+ build_result(response, nil, nil, :miss, original_tokens || 0, nil,
95
+ latency_ms, options[:messages])
96
+ end
97
+
98
+ def raw_llm_call(prompt, model:, messages: nil, config: nil)
99
+ tools = config&.with_tools || config&.tools
100
+ result = if messages && !messages.empty? && config&.messages_caller
101
+ config.messages_caller.call(messages + [{ role: "user", content: prompt }], model: model, tools: tools)
102
+ else
103
+ llm = config&.llm_caller || @_current_llm_caller
104
+ raise ConfigurationError, "No llm_caller configured." unless llm
105
+
106
+ llm.call(prompt, model: model, tools: tools)
107
+ end
108
+
109
+ if result.is_a?(Hash)
110
+ [result[:content], result]
111
+ else
112
+ [result, {}]
113
+ end
114
+ end
115
+
116
+ def elapsed_ms(start)
117
+ ((Process.clock_gettime(Process::CLOCK_MONOTONIC) - start) * 1000).round(2)
118
+ end
119
+
120
+ def emit_log(logger, config, cache_status:, model_tier:, original_tokens:,
121
+ compressed_tokens:, latency_ms:, prompt:, response:)
122
+ logger.info(
123
+ "[llm_optimizer] { cache_status: #{cache_status.inspect}, " \
124
+ "model_tier: #{model_tier.inspect}, " \
125
+ "original_tokens: #{original_tokens.inspect}, " \
126
+ "compressed_tokens: #{compressed_tokens.inspect}, " \
127
+ "latency_ms: #{latency_ms.inspect} }"
128
+ )
129
+ logger.debug("[llm_optimizer] prompt=#{prompt.inspect} response=#{response.inspect}") if config.debug_logging
130
+ end
131
+
132
+ def build_redis(redis_url)
133
+ require "redis"
134
+ Redis.new(url: redis_url)
135
+ end
136
+
137
+ def semantic_cache_lookup(prompt, model, model_tier, original_tokens,
138
+ compressed_tokens, original_prompt, start, config)
139
+ return [nil, nil] unless config.use_semantic_cache
140
+
141
+ embedding = config.embedding_caller.call(prompt)
142
+ cache = SemanticCache.new(build_redis(config.redis_url),
143
+ threshold: config.similarity_threshold,
144
+ ttl: config.cache_ttl,
145
+ cache_scope: config.cache_scope)
146
+ cached, token_info = cache.lookup(embedding)
147
+
148
+ if cached
149
+ latency_ms = elapsed_ms(start)
150
+ emit_log(config.logger, config,
151
+ cache_status: :hit, model_tier: model_tier,
152
+ original_tokens: original_tokens, compressed_tokens: compressed_tokens,
153
+ latency_ms: latency_ms, prompt: original_prompt, response: cached)
154
+
155
+ [embedding, build_result(cached, model, model_tier, :hit,
156
+ original_tokens, compressed_tokens, latency_ms, nil, token_info)]
157
+ else
158
+ [embedding, nil]
159
+ end
160
+ rescue StandardError => e
161
+ config.logger.warn("[llm_optimizer] semantic_cache_lookup failed: #{e.message}")
162
+ [nil, nil]
163
+ end
164
+
165
+ def store_in_cache(embedding, response, config, token_info = {})
166
+ return unless config.use_semantic_cache && embedding
167
+
168
+ SemanticCache.new(build_redis(config.redis_url),
169
+ threshold: config.similarity_threshold,
170
+ ttl: config.cache_ttl,
171
+ cache_scope: config.cache_scope).store(embedding, response, token_info)
172
+ end
173
+ end
174
+ end
@@ -7,20 +7,19 @@ module LlmOptimizer
7
7
  class SemanticCache
8
8
  KEY_NAMESPACE = "llm_optimizer:cache:"
9
9
 
10
- def initialize(redis_client, threshold:, ttl:)
11
- @redis = redis_client
12
- @threshold = threshold
13
- @ttl = ttl
10
+ def initialize(redis_client, threshold:, ttl:, cache_scope: nil)
11
+ @redis = redis_client
12
+ @threshold = threshold
13
+ @ttl = ttl
14
+ @cache_scope = cache_scope
14
15
  end
15
16
 
16
- def store(embedding, response)
17
+ def store(embedding, response, token_info = {})
17
18
  key = cache_key(embedding)
18
- # Serialize embedding as raw 64-bit big-endian doubles to preserve full
19
- # Float precision. MessagePack silently downcasts Ruby Float to 32-bit,
20
- # which corrupts cosine similarity on deserialization.
21
19
  payload = MessagePack.pack({
22
- "embedding" => embedding.pack("G*"), # binary string, lossless
23
- "response" => response
20
+ "embedding" => embedding.pack("G*"),
21
+ "response" => response,
22
+ "token_info" => token_info
24
23
  })
25
24
  @redis.set(key, payload, ex: @ttl)
26
25
  rescue ::Redis::BaseError => e
@@ -28,28 +27,32 @@ module LlmOptimizer
28
27
  end
29
28
 
30
29
  def lookup(embedding)
31
- keys = @redis.keys("#{KEY_NAMESPACE}*")
30
+ prefix = KEY_NAMESPACE
31
+ prefix += "#{@cache_scope}:" if @cache_scope
32
+ keys = @redis.keys("#{prefix}*")
33
+
34
+ keys.reject! { |k| k.count(":") > 2 } unless @cache_scope
35
+
32
36
  return nil if keys.empty?
33
37
 
34
38
  best_score = -Float::INFINITY
35
- best_response = nil
39
+ best_entry = nil
36
40
 
37
41
  keys.each do |key|
38
42
  raw = @redis.get(key)
39
43
  next unless raw
40
44
 
41
45
  entry = MessagePack.unpack(raw)
42
- # Unpack the binary string back to 64-bit doubles
43
46
  stored_embedding = entry["embedding"].unpack("G*")
44
47
  score = cosine_similarity(embedding, stored_embedding)
45
48
 
46
49
  if score > best_score
47
50
  best_score = score
48
- best_response = entry["response"]
51
+ best_entry = entry
49
52
  end
50
53
  end
51
54
 
52
- best_score >= @threshold ? best_response : nil
55
+ [best_entry["response"], best_entry["token_info"] || {}] if best_score >= @threshold
53
56
  rescue ::Redis::BaseError => e
54
57
  warn "[llm_optimizer] SemanticCache lookup failed: #{e.message}"
55
58
  nil
@@ -70,7 +73,9 @@ module LlmOptimizer
70
73
  # Use "G*" (64-bit big-endian double) to match Ruby's native Float precision.
71
74
  # "f*" (32-bit) truncates precision and produces inconsistent hashes for the
72
75
  # same embedding across serialize/deserialize round trips.
73
- KEY_NAMESPACE + Digest::SHA256.hexdigest(embedding.pack("G*"))
76
+ prefix = KEY_NAMESPACE
77
+ prefix += "#{@cache_scope}:" if @cache_scope
78
+ prefix + Digest::SHA256.hexdigest(embedding.pack("G*"))
74
79
  end
75
80
  end
76
81
  end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module LlmOptimizer
4
- VERSION = "0.1.4"
4
+ VERSION = "0.1.6"
5
5
  end
data/lib/llm_optimizer.rb CHANGED
@@ -8,26 +8,21 @@ require_relative "llm_optimizer/model_router"
8
8
  require_relative "llm_optimizer/embedding_client"
9
9
  require_relative "llm_optimizer/semantic_cache"
10
10
  require_relative "llm_optimizer/history_manager"
11
+ require_relative "llm_optimizer/conversation_store"
12
+ require_relative "llm_optimizer/pipeline"
11
13
 
12
14
  require "llm_optimizer/railtie" if defined?(Rails)
13
15
 
14
16
  module LlmOptimizer
15
- # Base error class for all gem-specific exceptions
16
17
  class Error < StandardError; end
17
-
18
- # Raised when an unrecognized configuration key is set
19
18
  class ConfigurationError < Error; end
20
-
21
- # Raised when the embedding API call fails
22
19
  class EmbeddingError < Error; end
23
-
24
- # Raised when a network timeout is exceeded
25
20
  class TimeoutError < Error; end
26
21
 
27
- # Global configuration
28
22
  @configuration = nil
29
23
 
30
- # Yields a Configuration instance; merges it into the global config.
24
+ extend Pipeline
25
+
31
26
  def self.configure
32
27
  temp = Configuration.new
33
28
  yield temp
@@ -35,7 +30,6 @@ module LlmOptimizer
35
30
  validate_configuration!(configuration)
36
31
  end
37
32
 
38
- # Warns about misconfigured options rather than failing silently at call time.
39
33
  def self.validate_configuration!(config)
40
34
  return unless config.use_semantic_cache && config.embedding_caller.nil?
41
35
 
@@ -46,36 +40,32 @@ module LlmOptimizer
46
40
  config.use_semantic_cache = false
47
41
  end
48
42
 
49
- # Returns the current global Configuration, lazy-initializing if nil.
50
43
  def self.configuration
51
44
  @configuration ||= Configuration.new
52
45
  end
53
46
 
54
- # Replaces the global config with a fresh default Configuration.
55
- # Useful in tests to avoid state leakage.
56
47
  def self.reset_configuration!
57
48
  @configuration = Configuration.new
58
49
  end
59
50
 
60
- # Opt-in client wrapping
61
- # WrapperModule intercepts `chat` on the wrapped client, runs the pre-call
62
- # optimization pipeline (compress, route, cache lookup), and delegates the
63
- # actual LLM call to the original client via `super` — so llm_caller is NOT
64
- # required when using wrap_client.
51
+ def self.clear_conversation(conversation_id)
52
+ raise ConfigurationError, "redis_url must be configured to use clear_conversation" unless configuration.redis_url
53
+
54
+ redis = build_redis(configuration.redis_url)
55
+ key = "#{ConversationStore::KEY_NAMESPACE}#{conversation_id}"
56
+ deleted = redis.del(key)
57
+ deleted.positive?
58
+ rescue ::Redis::BaseError => e
59
+ raise LlmOptimizer::Error, "Redis error in clear_conversation: #{e.message}"
60
+ end
61
+
65
62
  module WrapperModule
66
- def chat(params, &block)
63
+ def chat(params, &)
67
64
  config = LlmOptimizer.configuration
68
65
  prompt = params[:messages] || params[:prompt]
69
-
70
- # Run pre-call pipeline: compress, route, cache lookup
71
66
  result = LlmOptimizer.optimize_pre_call(prompt, config)
67
+ return result[:response] if result[:cache_status] == :hit
72
68
 
73
- # Cache hit — return immediately without calling the LLM
74
- if result[:cache_status] == :hit
75
- return result[:response]
76
- end
77
-
78
- # Apply compressed prompt and routed model, then delegate to original client
79
69
  optimized_params = params.merge(model: result[:model])
80
70
  if params[:messages]
81
71
  optimized_params = optimized_params.merge(messages: result[:prompt])
@@ -83,264 +73,80 @@ module LlmOptimizer
83
73
  optimized_params = optimized_params.merge(prompt: result[:prompt])
84
74
  end
85
75
 
86
- response = super(optimized_params, &block)
87
-
88
- # Store in cache after successful LLM call
76
+ response = super(optimized_params, &)
89
77
  LlmOptimizer.optimize_post_call(result, response, config)
90
-
91
78
  response
92
79
  end
93
80
  end
94
81
 
95
- # Prepends WrapperModule into client_class; idempotent — safe to call N times.
96
82
  def self.wrap_client(client_class)
97
83
  return if client_class.ancestors.include?(WrapperModule)
98
84
 
99
85
  client_class.prepend(WrapperModule)
100
86
  end
101
87
 
102
- # Primary entry point
103
- # Runs the optimization pipeline and returns an OptimizeResult.
104
-
105
- # options hash keys mirror Configuration attr_accessors and are merged over
106
- # the global config for this call only. An optional block is yielded a
107
- # per-call Configuration for fine-grained control.
108
- def self.optimize(prompt, options = {})
109
- start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
110
-
111
- # Resolve per-call configuration — only pass known config keys
112
- call_config = Configuration.new
113
- call_config.merge!(configuration)
114
- options.each do |k, v|
115
- next unless LlmOptimizer::Configuration::KNOWN_KEYS.include?(k.to_sym)
116
-
117
- call_config.public_send(:"#{k}=", v)
118
- end
119
- yield call_config if block_given?
120
-
121
- logger = call_config.logger
122
-
123
- # Keep a reference to the original prompt for fallback use
124
- original_prompt = prompt
125
-
126
- # Compression
127
- compressor = Compressor.new
128
- original_tokens = compressor.estimate_tokens(prompt)
129
- compressed_tokens = nil
130
-
131
- if call_config.compress_prompt
132
- prompt = compressor.compress(prompt)
133
- compressed_tokens = compressor.estimate_tokens(prompt)
134
- end
135
-
136
- # Model routing
137
- router = ModelRouter.new(call_config)
138
- model_tier = router.route(prompt)
139
- model = model_tier == :simple ? call_config.simple_model : call_config.complex_model
140
-
141
- # Semantic cache lookup
142
- embedding = nil
143
-
144
- if call_config.use_semantic_cache
145
- begin
146
- emb_client = EmbeddingClient.new(
147
- model: call_config.embedding_model,
148
- timeout_seconds: call_config.timeout_seconds,
149
- embedding_caller: call_config.embedding_caller
150
- )
151
- embedding = emb_client.embed(prompt)
152
-
153
- if call_config.redis_url
154
- redis = build_redis(call_config.redis_url)
155
- cache = SemanticCache.new(redis, threshold: call_config.similarity_threshold, ttl: call_config.cache_ttl)
156
- cached = cache.lookup(embedding)
157
-
158
- if cached
159
- latency_ms = elapsed_ms(start)
160
- emit_log(logger, call_config,
161
- cache_status: :hit, model_tier: model_tier,
162
- original_tokens: original_tokens, compressed_tokens: compressed_tokens,
163
- latency_ms: latency_ms, prompt: original_prompt, response: cached)
164
- return OptimizeResult.new(
165
- response: cached,
166
- model: model,
167
- model_tier: model_tier,
168
- cache_status: :hit,
169
- original_tokens: original_tokens,
170
- compressed_tokens: compressed_tokens,
171
- latency_ms: latency_ms,
172
- messages: options[:messages]
173
- )
174
- end
175
- end
176
- rescue EmbeddingError => e
177
- logger.warn("[llm_optimizer] EmbeddingError (treating as cache miss): #{e.message}")
178
- embedding = nil
179
- # continue pipeline as cache miss
180
- end
181
- end
88
+ def self.optimize(prompt, options = {}, &)
89
+ start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
90
+ call_config = build_call_config(options, &)
91
+ conversation_id = options[:conversation_id]
92
+ validate_conversation_options!(conversation_id, options, call_config)
182
93
 
183
- # History management
184
- messages = options[:messages]
185
- if call_config.manage_history && messages
186
- llm_caller = ->(p, model:) { raw_llm_call(p, model: model, config: call_config) }
187
- history_mgr = HistoryManager.new(
188
- llm_caller: llm_caller,
189
- simple_model: call_config.simple_model,
190
- token_budget: call_config.token_budget
191
- )
192
- messages = history_mgr.process(messages)
193
- end
94
+ original_prompt = prompt
95
+ original_tokens = Compressor.new.estimate_tokens(prompt)
96
+ prompt, compressed_tokens = compress(prompt, call_config)
97
+ model_tier, model = route(prompt, call_config)
194
98
 
195
- # Raw LLM call
196
- response = raw_llm_call(prompt, model: model, config: call_config)
99
+ embedding, cached_result = semantic_cache_lookup(prompt, model, model_tier,
100
+ original_tokens, compressed_tokens,
101
+ original_prompt, start, call_config)
102
+ return cached_result if cached_result
197
103
 
198
- # Cache store
199
- if call_config.use_semantic_cache && embedding && call_config.redis_url
200
- begin
201
- redis = build_redis(call_config.redis_url)
202
- cache = SemanticCache.new(redis, threshold: call_config.similarity_threshold, ttl: call_config.cache_ttl)
203
- cache.store(embedding, response)
204
- rescue StandardError => e
205
- logger.warn("[llm_optimizer] SemanticCache store failed: #{e.message}")
206
- end
207
- end
104
+ messages, store = load_conversation(conversation_id, options, call_config)
105
+ messages = apply_history_manager(messages, call_config)
106
+ response, token_info = raw_llm_call(prompt, messages: messages, model: model, config: call_config)
107
+ messages = persist_conversation(store, conversation_id, messages, prompt, response)
108
+ store_in_cache(embedding, response, call_config, token_info)
208
109
 
209
- # Build result
210
110
  latency_ms = elapsed_ms(start)
211
- emit_log(logger, call_config,
111
+ emit_log(call_config.logger, call_config,
212
112
  cache_status: :miss, model_tier: model_tier,
213
113
  original_tokens: original_tokens, compressed_tokens: compressed_tokens,
214
114
  latency_ms: latency_ms, prompt: original_prompt, response: response)
215
115
 
216
- OptimizeResult.new(
217
- response: response,
218
- model: model,
219
- model_tier: model_tier,
220
- cache_status: :miss,
221
- original_tokens: original_tokens,
222
- compressed_tokens: compressed_tokens,
223
- latency_ms: latency_ms,
224
- messages: messages
225
- )
116
+ build_result(response, model, model_tier, :miss, original_tokens, compressed_tokens,
117
+ latency_ms, messages, token_info)
226
118
  rescue EmbeddingError => e
227
- # Treat embedding failures as cache miss — continue to raw LLM call
228
- logger = configuration.logger
229
- logger.warn("[llm_optimizer] EmbeddingError (outer rescue, treating as cache miss): #{e.message}")
230
- latency_ms = elapsed_ms(start)
231
- response = raw_llm_call(original_prompt, model: nil, config: configuration)
232
- OptimizeResult.new(
233
- response: response,
234
- model: nil,
235
- model_tier: nil,
236
- cache_status: :miss,
237
- original_tokens: original_tokens || 0,
238
- compressed_tokens: nil,
239
- latency_ms: latency_ms,
240
- messages: options[:messages]
241
- )
119
+ configuration.logger.warn("[llm_optimizer] EmbeddingError (outer rescue): #{e.message}")
120
+ fallback_result(original_prompt, original_tokens, options, start)
121
+ rescue ConfigurationError
122
+ raise
242
123
  rescue LlmOptimizer::Error, StandardError => e
243
- logger = configuration.logger
244
- logger.error("[llm_optimizer] #{e.class}: #{e.message}\n#{e.backtrace&.first(5)&.join("\n")}")
245
- latency_ms = elapsed_ms(start)
246
- response = raw_llm_call(original_prompt, model: nil, config: configuration)
247
- OptimizeResult.new(
248
- response: response,
249
- model: nil,
250
- model_tier: nil,
251
- cache_status: :miss,
252
- original_tokens: original_tokens || 0,
253
- compressed_tokens: nil,
254
- latency_ms: latency_ms,
255
- messages: options[:messages]
256
- )
124
+ configuration.logger.error("[llm_optimizer] #{e.class}: #{e.message}\n#{e.backtrace&.first(5)&.join("\n")}")
125
+ fallback_result(original_prompt, original_tokens, options, start)
257
126
  end
258
127
 
259
- # Pre-call pipeline for wrap_client: compress, route, cache lookup.
260
- # Returns a hash with :prompt, :model, :model_tier, :embedding, :cache_status, :response.
261
- # Does NOT make an LLM call — the wrapped client handles that via super.
262
128
  def self.optimize_pre_call(prompt, config = configuration)
263
- compressor = Compressor.new
264
- prompt = compressor.compress(prompt) if config.compress_prompt
265
-
266
- router = ModelRouter.new(config)
267
- model_tier = router.route(prompt)
129
+ prompt = Compressor.new.compress(prompt) if config.compress_prompt
130
+ model_tier = ModelRouter.new(config).route(prompt)
268
131
  model = model_tier == :simple ? config.simple_model : config.complex_model
269
132
 
270
- embedding = nil
271
- if config.use_semantic_cache && config.redis_url
272
- begin
273
- emb_client = EmbeddingClient.new(
274
- model: config.embedding_model,
275
- timeout_seconds: config.timeout_seconds,
276
- embedding_caller: config.embedding_caller
277
- )
278
- embedding = emb_client.embed(prompt)
279
- redis = build_redis(config.redis_url)
280
- cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
281
- cached = cache.lookup(embedding)
282
- return { prompt: prompt, model: model, model_tier: model_tier,
283
- embedding: embedding, cache_status: :hit, response: cached } if cached
284
- rescue EmbeddingError => e
285
- config.logger.warn("[llm_optimizer] wrap_client EmbeddingError (cache miss): #{e.message}")
286
- embedding = nil
287
- end
133
+ unless config.use_semantic_cache && config.redis_url
134
+ return { prompt: prompt, model: model, model_tier: model_tier,
135
+ embedding: nil, cache_status: :miss, response: nil }
136
+ end
137
+
138
+ embedding, result = semantic_cache_lookup(prompt, model, model_tier, nil, nil,
139
+ prompt, Process.clock_gettime(Process::CLOCK_MONOTONIC), config)
140
+ if result
141
+ return { prompt: prompt, model: model, model_tier: model_tier,
142
+ embedding: embedding, cache_status: :hit, response: result.response }
288
143
  end
289
144
 
290
145
  { prompt: prompt, model: model, model_tier: model_tier,
291
146
  embedding: embedding, cache_status: :miss, response: nil }
292
147
  end
293
148
 
294
- # Post-call: store the LLM response in the semantic cache if applicable.
295
149
  def self.optimize_post_call(pre_call_result, response, config = configuration)
296
- return unless config.use_semantic_cache && config.redis_url
297
- return unless pre_call_result[:embedding]
298
-
299
- redis = build_redis(config.redis_url)
300
- cache = SemanticCache.new(redis, threshold: config.similarity_threshold, ttl: config.cache_ttl)
301
- cache.store(pre_call_result[:embedding], response)
302
- rescue StandardError => e
303
- config.logger.warn("[llm_optimizer] wrap_client cache store failed: #{e.message}")
304
- end
305
-
306
- # Private helpers
307
-
308
- class << self
309
- private
310
-
311
- def raw_llm_call(prompt, model:, config: nil)
312
- caller = config&.llm_caller || @_current_llm_caller
313
- unless caller
314
- raise ConfigurationError,
315
- "No llm_caller configured. " \
316
- "Set it via LlmOptimizer.configure { |c| c.llm_caller = ->(prompt, model:) { ... } }"
317
- end
318
-
319
- caller.call(prompt, model: model)
320
- end
321
-
322
- def elapsed_ms(start)
323
- ((Process.clock_gettime(Process::CLOCK_MONOTONIC) - start) * 1000).round(2)
324
- end
325
-
326
- def emit_log(logger, config, cache_status:, model_tier:, original_tokens:,
327
- compressed_tokens:, latency_ms:, prompt:, response:)
328
- logger.info(
329
- "[llm_optimizer] { cache_status: #{cache_status.inspect}, " \
330
- "model_tier: #{model_tier.inspect}, " \
331
- "original_tokens: #{original_tokens.inspect}, " \
332
- "compressed_tokens: #{compressed_tokens.inspect}, " \
333
- "latency_ms: #{latency_ms.inspect} }"
334
- )
335
-
336
- return unless config.debug_logging
337
-
338
- logger.debug("[llm_optimizer] prompt=#{prompt.inspect} response=#{response.inspect}")
339
- end
340
-
341
- def build_redis(redis_url)
342
- require "redis"
343
- Redis.new(url: redis_url)
344
- end
150
+ store_in_cache(pre_call_result[:embedding], response, config)
345
151
  end
346
152
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: llm_optimizer
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.4
4
+ version: 0.1.6
5
5
  platform: ruby
6
6
  authors:
7
7
  - arun kumar
@@ -100,10 +100,12 @@ files:
100
100
  - lib/llm_optimizer.rb
101
101
  - lib/llm_optimizer/compressor.rb
102
102
  - lib/llm_optimizer/configuration.rb
103
+ - lib/llm_optimizer/conversation_store.rb
103
104
  - lib/llm_optimizer/embedding_client.rb
104
105
  - lib/llm_optimizer/history_manager.rb
105
106
  - lib/llm_optimizer/model_router.rb
106
107
  - lib/llm_optimizer/optimize_result.rb
108
+ - lib/llm_optimizer/pipeline.rb
107
109
  - lib/llm_optimizer/railtie.rb
108
110
  - lib/llm_optimizer/semantic_cache.rb
109
111
  - lib/llm_optimizer/version.rb