llm_optimizer 0.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +33 -0
- data/CODE_OF_CONDUCT.md +132 -0
- data/LICENSE.txt +21 -0
- data/README.md +243 -0
- data/Rakefile +15 -0
- data/lib/generators/llm_optimizer/install_generator.rb +17 -0
- data/lib/generators/llm_optimizer/templates/initializer.rb +68 -0
- data/lib/llm_optimizer/compressor.rb +47 -0
- data/lib/llm_optimizer/configuration.rb +79 -0
- data/lib/llm_optimizer/embedding_client.rb +61 -0
- data/lib/llm_optimizer/history_manager.rb +43 -0
- data/lib/llm_optimizer/model_router.rb +32 -0
- data/lib/llm_optimizer/optimize_result.rb +9 -0
- data/lib/llm_optimizer/railtie.rb +11 -0
- data/lib/llm_optimizer/semantic_cache.rb +66 -0
- data/lib/llm_optimizer/version.rb +5 -0
- data/lib/llm_optimizer.rb +273 -0
- data/sig/llm_optimizer.rbs +4 -0
- metadata +135 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: e68575c7d9fba996b2efeb9dd559635bba72316322b75a1f43b0e6b2a5e28fce
|
|
4
|
+
data.tar.gz: b8b3f8e06da0d860af65a192b3a4fb16bfe8110fd3eca69d328fd5ef71471571
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: 6679cbc09844d71e3c42e74e313d5366bcafbfdeb7e6625a6f0ad591bb8fc98687ba3040a6c54703388bc81be614fd9095d5f71eb0b3eac80fdf0c43299445ee
|
|
7
|
+
data.tar.gz: '095e4ac8ef5f45240f9d0d5068cf462d66900258765d184f1e25b2e9233c782d5363a248dc178167ad83055fa0d6a006bb2631a83e94c482fcd03fed947c41d4'
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [Unreleased]
|
|
9
|
+
|
|
10
|
+
## [0.1.0] - 2026-04-10
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- `LlmOptimizer.optimize(prompt, options = {}, &block)` — primary entry point returning an `OptimizeResult`
|
|
15
|
+
- `LlmOptimizer.configure` — global configuration with merge semantics (multiple calls merge without resetting)
|
|
16
|
+
- `LlmOptimizer.reset_configuration!` — resets global config to defaults (useful in tests)
|
|
17
|
+
- `LlmOptimizer.wrap_client(client_class)` — opt-in idempotent client wrapping via module prepend
|
|
18
|
+
- **Semantic Caching** — Redis-backed vector similarity cache using cosine similarity; configurable threshold and TTL
|
|
19
|
+
- **Intelligent Model Routing** — heuristic classifier routing prompts to `:simple` or `:complex` model tier based on word count, code blocks, and keywords
|
|
20
|
+
- **Token Pruning / Compressor** — English stop-word removal with fenced code block preservation; `estimate_tokens` helper
|
|
21
|
+
- **Conversation History Sliding Window** — summarizes oldest messages when token budget is exceeded; falls back to original messages on LLM failure
|
|
22
|
+
- **EmbeddingClient** — injectable `embedding_caller` lambda with OpenAI fallback via `OPENAI_API_KEY`
|
|
23
|
+
- **`llm_caller`** — injectable lambda to wire any LLM provider (RubyLLM, ruby-openai, Anthropic, Bedrock, etc.)
|
|
24
|
+
- **Rails generator** — `rails generate llm_optimizer:install` creates a pre-filled initializer
|
|
25
|
+
- **Railtie** — auto-loads generator when used in a Rails app
|
|
26
|
+
- **Structured logging** — INFO log per optimize call (no prompt content); DEBUG log with full prompt/response when `debug_logging: true`
|
|
27
|
+
- **Resilience** — all component failures fall through to raw LLM call; `EmbeddingError` treated as cache miss
|
|
28
|
+
- Full exception hierarchy: `LlmOptimizer::Error`, `ConfigurationError`, `EmbeddingError`, `TimeoutError`
|
|
29
|
+
- `OptimizeResult` struct with `response`, `model`, `model_tier`, `cache_status`, `original_tokens`, `compressed_tokens`, `latency_ms`, `messages`
|
|
30
|
+
- Unit test suite covering all components with positive and negative scenarios using Minitest + Mocha
|
|
31
|
+
|
|
32
|
+
[Unreleased]: https://github.com/arunkumarry/llm_optimizer/compare/v0.1.0...HEAD
|
|
33
|
+
[0.1.0]: https://github.com/arunkumarry/llm_optimizer/releases/tag/v0.1.0
|
data/CODE_OF_CONDUCT.md
ADDED
|
@@ -0,0 +1,132 @@
|
|
|
1
|
+
# Contributor Covenant Code of Conduct
|
|
2
|
+
|
|
3
|
+
## Our Pledge
|
|
4
|
+
|
|
5
|
+
We as members, contributors, and leaders pledge to make participation in our
|
|
6
|
+
community a harassment-free experience for everyone, regardless of age, body
|
|
7
|
+
size, visible or invisible disability, ethnicity, sex characteristics, gender
|
|
8
|
+
identity and expression, level of experience, education, socio-economic status,
|
|
9
|
+
nationality, personal appearance, race, caste, color, religion, or sexual
|
|
10
|
+
identity and orientation.
|
|
11
|
+
|
|
12
|
+
We pledge to act and interact in ways that contribute to an open, welcoming,
|
|
13
|
+
diverse, inclusive, and healthy community.
|
|
14
|
+
|
|
15
|
+
## Our Standards
|
|
16
|
+
|
|
17
|
+
Examples of behavior that contributes to a positive environment for our
|
|
18
|
+
community include:
|
|
19
|
+
|
|
20
|
+
* Demonstrating empathy and kindness toward other people
|
|
21
|
+
* Being respectful of differing opinions, viewpoints, and experiences
|
|
22
|
+
* Giving and gracefully accepting constructive feedback
|
|
23
|
+
* Accepting responsibility and apologizing to those affected by our mistakes,
|
|
24
|
+
and learning from the experience
|
|
25
|
+
* Focusing on what is best not just for us as individuals, but for the overall
|
|
26
|
+
community
|
|
27
|
+
|
|
28
|
+
Examples of unacceptable behavior include:
|
|
29
|
+
|
|
30
|
+
* The use of sexualized language or imagery, and sexual attention or advances of
|
|
31
|
+
any kind
|
|
32
|
+
* Trolling, insulting or derogatory comments, and personal or political attacks
|
|
33
|
+
* Public or private harassment
|
|
34
|
+
* Publishing others' private information, such as a physical or email address,
|
|
35
|
+
without their explicit permission
|
|
36
|
+
* Other conduct which could reasonably be considered inappropriate in a
|
|
37
|
+
professional setting
|
|
38
|
+
|
|
39
|
+
## Enforcement Responsibilities
|
|
40
|
+
|
|
41
|
+
Community leaders are responsible for clarifying and enforcing our standards of
|
|
42
|
+
acceptable behavior and will take appropriate and fair corrective action in
|
|
43
|
+
response to any behavior that they deem inappropriate, threatening, offensive,
|
|
44
|
+
or harmful.
|
|
45
|
+
|
|
46
|
+
Community leaders have the right and responsibility to remove, edit, or reject
|
|
47
|
+
comments, commits, code, wiki edits, issues, and other contributions that are
|
|
48
|
+
not aligned to this Code of Conduct, and will communicate reasons for moderation
|
|
49
|
+
decisions when appropriate.
|
|
50
|
+
|
|
51
|
+
## Scope
|
|
52
|
+
|
|
53
|
+
This Code of Conduct applies within all community spaces, and also applies when
|
|
54
|
+
an individual is officially representing the community in public spaces.
|
|
55
|
+
Examples of representing our community include using an official email address,
|
|
56
|
+
posting via an official social media account, or acting as an appointed
|
|
57
|
+
representative at an online or offline event.
|
|
58
|
+
|
|
59
|
+
## Enforcement
|
|
60
|
+
|
|
61
|
+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
|
|
62
|
+
reported to the community leaders responsible for enforcement at
|
|
63
|
+
[INSERT CONTACT METHOD].
|
|
64
|
+
All complaints will be reviewed and investigated promptly and fairly.
|
|
65
|
+
|
|
66
|
+
All community leaders are obligated to respect the privacy and security of the
|
|
67
|
+
reporter of any incident.
|
|
68
|
+
|
|
69
|
+
## Enforcement Guidelines
|
|
70
|
+
|
|
71
|
+
Community leaders will follow these Community Impact Guidelines in determining
|
|
72
|
+
the consequences for any action they deem in violation of this Code of Conduct:
|
|
73
|
+
|
|
74
|
+
### 1. Correction
|
|
75
|
+
|
|
76
|
+
**Community Impact**: Use of inappropriate language or other behavior deemed
|
|
77
|
+
unprofessional or unwelcome in the community.
|
|
78
|
+
|
|
79
|
+
**Consequence**: A private, written warning from community leaders, providing
|
|
80
|
+
clarity around the nature of the violation and an explanation of why the
|
|
81
|
+
behavior was inappropriate. A public apology may be requested.
|
|
82
|
+
|
|
83
|
+
### 2. Warning
|
|
84
|
+
|
|
85
|
+
**Community Impact**: A violation through a single incident or series of
|
|
86
|
+
actions.
|
|
87
|
+
|
|
88
|
+
**Consequence**: A warning with consequences for continued behavior. No
|
|
89
|
+
interaction with the people involved, including unsolicited interaction with
|
|
90
|
+
those enforcing the Code of Conduct, for a specified period of time. This
|
|
91
|
+
includes avoiding interactions in community spaces as well as external channels
|
|
92
|
+
like social media. Violating these terms may lead to a temporary or permanent
|
|
93
|
+
ban.
|
|
94
|
+
|
|
95
|
+
### 3. Temporary Ban
|
|
96
|
+
|
|
97
|
+
**Community Impact**: A serious violation of community standards, including
|
|
98
|
+
sustained inappropriate behavior.
|
|
99
|
+
|
|
100
|
+
**Consequence**: A temporary ban from any sort of interaction or public
|
|
101
|
+
communication with the community for a specified period of time. No public or
|
|
102
|
+
private interaction with the people involved, including unsolicited interaction
|
|
103
|
+
with those enforcing the Code of Conduct, is allowed during this period.
|
|
104
|
+
Violating these terms may lead to a permanent ban.
|
|
105
|
+
|
|
106
|
+
### 4. Permanent Ban
|
|
107
|
+
|
|
108
|
+
**Community Impact**: Demonstrating a pattern of violation of community
|
|
109
|
+
standards, including sustained inappropriate behavior, harassment of an
|
|
110
|
+
individual, or aggression toward or disparagement of classes of individuals.
|
|
111
|
+
|
|
112
|
+
**Consequence**: A permanent ban from any sort of public interaction within the
|
|
113
|
+
community.
|
|
114
|
+
|
|
115
|
+
## Attribution
|
|
116
|
+
|
|
117
|
+
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
|
|
118
|
+
version 2.1, available at
|
|
119
|
+
[https://www.contributor-covenant.org/version/2/1/code_of_conduct.html][v2.1].
|
|
120
|
+
|
|
121
|
+
Community Impact Guidelines were inspired by
|
|
122
|
+
[Mozilla's code of conduct enforcement ladder][Mozilla CoC].
|
|
123
|
+
|
|
124
|
+
For answers to common questions about this code of conduct, see the FAQ at
|
|
125
|
+
[https://www.contributor-covenant.org/faq][FAQ]. Translations are available at
|
|
126
|
+
[https://www.contributor-covenant.org/translations][translations].
|
|
127
|
+
|
|
128
|
+
[homepage]: https://www.contributor-covenant.org
|
|
129
|
+
[v2.1]: https://www.contributor-covenant.org/version/2/1/code_of_conduct.html
|
|
130
|
+
[Mozilla CoC]: https://github.com/mozilla/diversity
|
|
131
|
+
[FAQ]: https://www.contributor-covenant.org/faq
|
|
132
|
+
[translations]: https://www.contributor-covenant.org/translations
|
data/LICENSE.txt
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
The MIT License (MIT)
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2026 arun kumar
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in
|
|
13
|
+
all copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
|
21
|
+
THE SOFTWARE.
|
data/README.md
ADDED
|
@@ -0,0 +1,243 @@
|
|
|
1
|
+
# llm_optimizer
|
|
2
|
+
|
|
3
|
+
A Smart Gateway for LLM API calls in Ruby and Rails applications. Reduces token usage and API costs through four composable optimizations — all opt-in, all independently configurable.
|
|
4
|
+
|
|
5
|
+
## How it works
|
|
6
|
+
|
|
7
|
+
Every call to `LlmOptimizer.optimize` passes through an ordered pipeline:
|
|
8
|
+
|
|
9
|
+
```
|
|
10
|
+
prompt → Compressor → ModelRouter → SemanticCache lookup → HistoryManager → LLM call → SemanticCache store → OptimizeResult
|
|
11
|
+
```
|
|
12
|
+
|
|
13
|
+
Each stage is independently enabled via configuration flags. If any stage fails, the gem falls through to a raw LLM call — your app never breaks because of the optimizer.
|
|
14
|
+
|
|
15
|
+
## Optimizations
|
|
16
|
+
|
|
17
|
+
### 1. Semantic Caching
|
|
18
|
+
Stores prompt embeddings in Redis. On subsequent calls, computes cosine similarity against stored embeddings. If similarity ≥ threshold, returns the cached response instantly — no LLM call made.
|
|
19
|
+
|
|
20
|
+
### 2. Intelligent Model Routing
|
|
21
|
+
Classifies each prompt using a heuristic and routes it to the appropriate model tier:
|
|
22
|
+
- **Simple** — short prompts (< 20 words), no code blocks, no complex keywords → cheaper/faster model
|
|
23
|
+
- **Complex** — code blocks, keywords like `analyze`, `refactor`, `debug`, `architect`, `explain in detail` → premium model
|
|
24
|
+
|
|
25
|
+
### 3. Token Pruning
|
|
26
|
+
Removes common English stop words from prompts before sending to the LLM. Preserves fenced code block content unchanged. Typically reduces token count by 10–20%.
|
|
27
|
+
|
|
28
|
+
### 4. Conversation History Sliding Window
|
|
29
|
+
When a conversation history exceeds the configured token budget, summarizes the oldest messages using the simple model and replaces them with a single system summary message.
|
|
30
|
+
|
|
31
|
+
## Installation
|
|
32
|
+
|
|
33
|
+
Add to your Gemfile:
|
|
34
|
+
|
|
35
|
+
```ruby
|
|
36
|
+
gem "llm_optimizer"
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
Then run:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
bundle install
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
For Rails apps, generate the initializer:
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
rails generate llm_optimizer:install
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
This creates `config/initializers/llm_optimizer.rb` with all options pre-filled and commented.
|
|
52
|
+
|
|
53
|
+
## Quick Start
|
|
54
|
+
|
|
55
|
+
```ruby
|
|
56
|
+
LlmOptimizer.configure do |config|
|
|
57
|
+
config.compress_prompt = true
|
|
58
|
+
config.use_semantic_cache = true
|
|
59
|
+
config.redis_url = ENV["REDIS_URL"]
|
|
60
|
+
|
|
61
|
+
# Wire up your app's LLM client
|
|
62
|
+
config.llm_caller = ->(prompt, model:) {
|
|
63
|
+
# Use whatever LLM client your app already has
|
|
64
|
+
MyLlmService.chat(prompt, model: model)
|
|
65
|
+
}
|
|
66
|
+
|
|
67
|
+
# Wire up your embeddings provider (required if use_semantic_cache: true)
|
|
68
|
+
config.embedding_caller = ->(text) {
|
|
69
|
+
MyEmbeddingService.embed(text)
|
|
70
|
+
}
|
|
71
|
+
end
|
|
72
|
+
|
|
73
|
+
result = LlmOptimizer.optimize("What is Redis?")
|
|
74
|
+
|
|
75
|
+
puts result.response # => "Redis is an in-memory data store..."
|
|
76
|
+
puts result.cache_status # => :hit or :miss
|
|
77
|
+
puts result.model_tier # => :simple or :complex
|
|
78
|
+
puts result.model # => "gpt-4o-mini"
|
|
79
|
+
puts result.original_tokens # => 5
|
|
80
|
+
puts result.compressed_tokens # => 4
|
|
81
|
+
puts result.latency_ms # => 12.4
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
## Configuration
|
|
85
|
+
|
|
86
|
+
### Rails initializer
|
|
87
|
+
|
|
88
|
+
```ruby
|
|
89
|
+
LlmOptimizer.configure do |config|
|
|
90
|
+
# Feature flags — all off by default
|
|
91
|
+
config.compress_prompt = true # strip stop words before sending to LLM
|
|
92
|
+
config.use_semantic_cache = true # cache responses by vector similarity
|
|
93
|
+
config.manage_history = true # summarize old messages when over token budget
|
|
94
|
+
|
|
95
|
+
# Model routing
|
|
96
|
+
config.route_to = :auto # :auto | :simple | :complex
|
|
97
|
+
config.simple_model = "gpt-4o-mini" # model used for simple prompts
|
|
98
|
+
config.complex_model = "claude-3-5-sonnet-20241022" # model used for complex prompts
|
|
99
|
+
|
|
100
|
+
# Redis (required if use_semantic_cache: true)
|
|
101
|
+
config.redis_url = ENV["REDIS_URL"]
|
|
102
|
+
|
|
103
|
+
# Tuning
|
|
104
|
+
config.similarity_threshold = 0.96 # cosine similarity cutoff for cache hit (0.0–1.0)
|
|
105
|
+
config.token_budget = 4000 # token limit before history summarization
|
|
106
|
+
config.cache_ttl = 86400 # cache TTL in seconds (default: 24h)
|
|
107
|
+
config.timeout_seconds = 5 # timeout for external API calls
|
|
108
|
+
|
|
109
|
+
# Logging
|
|
110
|
+
config.logger = Rails.logger
|
|
111
|
+
config.debug_logging = Rails.env.development? # logs full prompt+response at DEBUG level
|
|
112
|
+
|
|
113
|
+
# LLM caller — wire to your existing LLM client (required)
|
|
114
|
+
config.llm_caller = ->(prompt, model:) {
|
|
115
|
+
RubyLLM.chat(model: model, assume_model_exists: true).ask(prompt).content
|
|
116
|
+
}
|
|
117
|
+
|
|
118
|
+
# Embeddings caller — wire to your embeddings provider (required if use_semantic_cache: true)
|
|
119
|
+
# Falls back to OpenAI via ENV["OPENAI_API_KEY"] if not set
|
|
120
|
+
config.embedding_caller = ->(text) {
|
|
121
|
+
MyEmbeddingService.embed(text)
|
|
122
|
+
}
|
|
123
|
+
end
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
### Configuration reference
|
|
127
|
+
|
|
128
|
+
| Key | Type | Default | Description |
|
|
129
|
+
|---|---|---|---|
|
|
130
|
+
| `compress_prompt` | Boolean | `false` | Strip stop words before sending to LLM |
|
|
131
|
+
| `use_semantic_cache` | Boolean | `false` | Enable Redis-backed semantic cache |
|
|
132
|
+
| `manage_history` | Boolean | `false` | Enable conversation history summarization |
|
|
133
|
+
| `route_to` | Symbol | `:auto` | `:auto`, `:simple`, or `:complex` |
|
|
134
|
+
| `simple_model` | String | `"gpt-4o-mini"` | Model for simple prompts |
|
|
135
|
+
| `complex_model` | String | `"claude-3-5-sonnet-20241022"` | Model for complex prompts |
|
|
136
|
+
| `similarity_threshold` | Float | `0.96` | Minimum cosine similarity for cache hit |
|
|
137
|
+
| `token_budget` | Integer | `4000` | Token limit before history summarization |
|
|
138
|
+
| `cache_ttl` | Integer | `86400` | Cache entry TTL in seconds |
|
|
139
|
+
| `timeout_seconds` | Integer | `5` | Timeout for external API calls |
|
|
140
|
+
| `redis_url` | String | `nil` | Redis connection URL |
|
|
141
|
+
| `embedding_model` | String | `"text-embedding-3-small"` | Embedding model name (OpenAI fallback) |
|
|
142
|
+
| `logger` | Logger | `Logger.new($stdout)` | Any Logger-compatible object |
|
|
143
|
+
| `debug_logging` | Boolean | `false` | Log full prompt and response at DEBUG level |
|
|
144
|
+
| `llm_caller` | Lambda | `nil` | `(prompt, model:) -> String` |
|
|
145
|
+
| `embedding_caller` | Lambda | `nil` | `(text) -> Array<Float>` |
|
|
146
|
+
|
|
147
|
+
## Per-call configuration
|
|
148
|
+
|
|
149
|
+
Override global config for a single call using a block:
|
|
150
|
+
|
|
151
|
+
```ruby
|
|
152
|
+
result = LlmOptimizer.optimize(prompt) do |config|
|
|
153
|
+
config.route_to = :simple
|
|
154
|
+
config.compress_prompt = false
|
|
155
|
+
end
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
## Conversation history
|
|
159
|
+
|
|
160
|
+
Pass a `messages` array to enable history management:
|
|
161
|
+
|
|
162
|
+
```ruby
|
|
163
|
+
messages = [
|
|
164
|
+
{ role: "user", content: "Tell me about Redis" },
|
|
165
|
+
{ role: "assistant", content: "Redis is an in-memory data store..." },
|
|
166
|
+
# ... more messages
|
|
167
|
+
]
|
|
168
|
+
|
|
169
|
+
result = LlmOptimizer.optimize("What else can it do?", messages: messages)
|
|
170
|
+
|
|
171
|
+
# result.messages contains the (possibly summarized) messages array
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
## Opt-in client wrapping
|
|
175
|
+
|
|
176
|
+
Transparently wrap an existing LLM client class so all calls through it are automatically optimized:
|
|
177
|
+
|
|
178
|
+
```ruby
|
|
179
|
+
LlmOptimizer.wrap_client(OpenAI::Client)
|
|
180
|
+
```
|
|
181
|
+
|
|
182
|
+
This prepends the optimization pipeline into the client's `chat` method. Safe to call multiple times — idempotent.
|
|
183
|
+
|
|
184
|
+
## OptimizeResult
|
|
185
|
+
|
|
186
|
+
Every call returns an `OptimizeResult` struct:
|
|
187
|
+
|
|
188
|
+
| Field | Type | Description |
|
|
189
|
+
|---|---|---|
|
|
190
|
+
| `response` | String | The LLM response text |
|
|
191
|
+
| `model` | String | Model name actually used |
|
|
192
|
+
| `model_tier` | Symbol | `:simple` or `:complex` |
|
|
193
|
+
| `cache_status` | Symbol | `:hit` or `:miss` |
|
|
194
|
+
| `original_tokens` | Integer | Estimated token count before compression |
|
|
195
|
+
| `compressed_tokens` | Integer | Estimated token count after compression (`nil` if not compressed) |
|
|
196
|
+
| `latency_ms` | Float | Total wall-clock time for the optimize call |
|
|
197
|
+
| `messages` | Array | Final messages array (for history management) |
|
|
198
|
+
|
|
199
|
+
## Error handling
|
|
200
|
+
|
|
201
|
+
The gem defines a hierarchy of errors, all inheriting from `LlmOptimizer::Error`:
|
|
202
|
+
|
|
203
|
+
```
|
|
204
|
+
LlmOptimizer::Error
|
|
205
|
+
├── LlmOptimizer::ConfigurationError # unknown config key, missing llm_caller
|
|
206
|
+
├── LlmOptimizer::EmbeddingError # embedding API failure
|
|
207
|
+
└── LlmOptimizer::TimeoutError # network timeout exceeded
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
The gateway catches all component failures and falls through to a raw LLM call with the original prompt. Your app's core functionality is never blocked by the optimizer.
|
|
211
|
+
|
|
212
|
+
## Resilience
|
|
213
|
+
|
|
214
|
+
| Failure | Behavior |
|
|
215
|
+
|---|---|
|
|
216
|
+
| Redis unavailable (read) | Treat as cache miss, continue |
|
|
217
|
+
| Redis unavailable (write) | Log warning, return LLM result normally |
|
|
218
|
+
| Embedding API failure | Treat as cache miss, continue |
|
|
219
|
+
| Any component exception | Log error, fall through to raw LLM call |
|
|
220
|
+
| History summarization failure | Log error, return original messages unchanged |
|
|
221
|
+
|
|
222
|
+
## Development
|
|
223
|
+
|
|
224
|
+
```bash
|
|
225
|
+
bundle install
|
|
226
|
+
bundle exec rake test # run tests
|
|
227
|
+
bundle exec rake rubocop # lint
|
|
228
|
+
bundle exec rake # test + lint
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
Generate the Rails initializer in a target app:
|
|
232
|
+
|
|
233
|
+
```bash
|
|
234
|
+
rails generate llm_optimizer:install
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
## License
|
|
238
|
+
|
|
239
|
+
MIT
|
|
240
|
+
|
|
241
|
+
---
|
|
242
|
+
|
|
243
|
+
[GitHub](https://github.com/arunkumarry/llm_optimizer) · [RubyGems](https://rubygems.org/gems/llm_optimizer) · [Changelog](https://github.com/arunkumarry/llm_optimizer/blob/main/CHANGELOG.md)
|
data/Rakefile
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "bundler/gem_tasks"
|
|
4
|
+
require "minitest/test_task"
|
|
5
|
+
|
|
6
|
+
Minitest::TestTask.create(:test) do |t|
|
|
7
|
+
t.libs << "test"
|
|
8
|
+
t.test_globs = ["test/test_*.rb", "test/unit/test_*.rb"]
|
|
9
|
+
end
|
|
10
|
+
|
|
11
|
+
require "rubocop/rake_task"
|
|
12
|
+
|
|
13
|
+
RuboCop::RakeTask.new
|
|
14
|
+
|
|
15
|
+
task default: %i[test rubocop]
|
|
@@ -0,0 +1,17 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "rails/generators"
|
|
4
|
+
|
|
5
|
+
module LlmOptimizer
|
|
6
|
+
module Generators
|
|
7
|
+
class InstallGenerator < Rails::Generators::Base
|
|
8
|
+
source_root File.expand_path("templates", __dir__)
|
|
9
|
+
|
|
10
|
+
desc "Creates a LlmOptimizer initializer in your Rails app"
|
|
11
|
+
|
|
12
|
+
def copy_initializer
|
|
13
|
+
template "initializer.rb", "config/initializers/llm_optimizer.rb"
|
|
14
|
+
end
|
|
15
|
+
end
|
|
16
|
+
end
|
|
17
|
+
end
|
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
# LlmOptimizer initializer
|
|
4
|
+
# Run `rails generate llm_optimizer:install` to regenerate this file.
|
|
5
|
+
#
|
|
6
|
+
# Docs: https://github.com/arunkumar/llm_optimizer
|
|
7
|
+
|
|
8
|
+
LlmOptimizer.configure do |config|
|
|
9
|
+
# --- Feature flags ---
|
|
10
|
+
# All optimizations are off by default. Enable what you need.
|
|
11
|
+
config.compress_prompt = false # strip stop words before sending to LLM
|
|
12
|
+
config.use_semantic_cache = false # cache responses by vector similarity in Redis
|
|
13
|
+
config.manage_history = false # summarize old messages when over token budget
|
|
14
|
+
|
|
15
|
+
# --- Model routing ---
|
|
16
|
+
# :auto classifies each prompt; :simple or :complex forces a tier
|
|
17
|
+
config.route_to = :auto
|
|
18
|
+
config.simple_model = "gpt-4o-mini"
|
|
19
|
+
config.complex_model = "gpt-4o"
|
|
20
|
+
|
|
21
|
+
# --- Redis (required only if use_semantic_cache: true) ---
|
|
22
|
+
config.redis_url = ENV.fetch("REDIS_URL", nil)
|
|
23
|
+
|
|
24
|
+
# --- Tuning ---
|
|
25
|
+
config.similarity_threshold = 0.96 # cosine similarity cutoff for a cache hit
|
|
26
|
+
config.token_budget = 4000 # token limit before history summarization kicks in
|
|
27
|
+
config.cache_ttl = 86400 # cache entry TTL in seconds (default: 24h)
|
|
28
|
+
config.timeout_seconds = 5 # timeout for embedding / external API calls
|
|
29
|
+
|
|
30
|
+
# --- Logging ---
|
|
31
|
+
config.logger = Rails.logger
|
|
32
|
+
config.debug_logging = Rails.env.development?
|
|
33
|
+
|
|
34
|
+
# --- LLM caller (required) ---
|
|
35
|
+
# Wire this up to however your app already calls the LLM.
|
|
36
|
+
#
|
|
37
|
+
# Example with ruby-openai:
|
|
38
|
+
# config.llm_caller = ->(prompt, model:) {
|
|
39
|
+
# OpenAI::Client.new(access_token: ENV["OPENAI_API_KEY"])
|
|
40
|
+
# .chat(parameters: { model: model, messages: [{ role: "user", content: prompt }] })
|
|
41
|
+
# .dig("choices", 0, "message", "content")
|
|
42
|
+
# }
|
|
43
|
+
#
|
|
44
|
+
# Example with a shared service object:
|
|
45
|
+
# config.llm_caller = ->(prompt, model:) {
|
|
46
|
+
# provider = if model.include?("claude") then :anthropic
|
|
47
|
+
# elsif model.include?("gpt") then :openai
|
|
48
|
+
# elsif model.include?("gemini") then :gemini
|
|
49
|
+
# elsif model.include?("nova") || model.include?("amazon") then :bedrock
|
|
50
|
+
# else :ollama
|
|
51
|
+
# end
|
|
52
|
+
# RubyLLM.chat(model: model, provider: provider, assume_model_exists: true) }
|
|
53
|
+
# end
|
|
54
|
+
#
|
|
55
|
+
config.llm_caller = ->(prompt, model:) {
|
|
56
|
+
raise NotImplementedError, "[llm_optimizer] llm_caller is not configured. " \
|
|
57
|
+
"Edit config/initializers/llm_optimizer.rb and wire it to your LLM client."
|
|
58
|
+
}
|
|
59
|
+
|
|
60
|
+
# --- Embeddings caller (optional) ---
|
|
61
|
+
# Only needed if use_semantic_cache: true.
|
|
62
|
+
# If omitted, falls back to OpenAI via ENV["OPENAI_API_KEY"].
|
|
63
|
+
#
|
|
64
|
+
# Example:
|
|
65
|
+
# config.embedding_caller = ->(text) { EmbeddingService.embed(text) }
|
|
66
|
+
#
|
|
67
|
+
# config.embedding_caller = nil
|
|
68
|
+
end
|
|
@@ -0,0 +1,47 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module LlmOptimizer
|
|
4
|
+
class Compressor
|
|
5
|
+
STOP_WORDS = %w[
|
|
6
|
+
the a an is are was were be been being
|
|
7
|
+
of in to for on at by with from as into
|
|
8
|
+
through during before after above below
|
|
9
|
+
between out off over under again further
|
|
10
|
+
then once
|
|
11
|
+
].freeze
|
|
12
|
+
|
|
13
|
+
FENCE_RE = /(```[\s\S]*?```|~~~[\s\S]*?~~~)/
|
|
14
|
+
|
|
15
|
+
def initialize(slm_client: nil)
|
|
16
|
+
@slm_client = slm_client
|
|
17
|
+
end
|
|
18
|
+
|
|
19
|
+
def compress(prompt)
|
|
20
|
+
segments = prompt.split(FENCE_RE)
|
|
21
|
+
|
|
22
|
+
processed = segments.map.with_index do |segment, i|
|
|
23
|
+
# Odd-indexed segments are fenced code blocks (captured group)
|
|
24
|
+
if i.odd?
|
|
25
|
+
segment
|
|
26
|
+
else
|
|
27
|
+
remove_stop_words(segment)
|
|
28
|
+
end
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
result = processed.join
|
|
32
|
+
result.gsub(/\s{2,}/, " ").strip
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
def estimate_tokens(text)
|
|
36
|
+
(text.length / 4.0).ceil
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
private
|
|
40
|
+
|
|
41
|
+
def remove_stop_words(text)
|
|
42
|
+
stop_set = STOP_WORDS.to_set
|
|
43
|
+
words = text.split(" ")
|
|
44
|
+
words.reject { |w| stop_set.include?(w.downcase) }.join(" ")
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
end
|
|
@@ -0,0 +1,79 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "logger"
|
|
4
|
+
require "set"
|
|
5
|
+
|
|
6
|
+
module LlmOptimizer
|
|
7
|
+
class Configuration
|
|
8
|
+
KNOWN_KEYS = %i[
|
|
9
|
+
use_semantic_cache
|
|
10
|
+
compress_prompt
|
|
11
|
+
manage_history
|
|
12
|
+
route_to
|
|
13
|
+
similarity_threshold
|
|
14
|
+
token_budget
|
|
15
|
+
redis_url
|
|
16
|
+
embedding_model
|
|
17
|
+
simple_model
|
|
18
|
+
complex_model
|
|
19
|
+
logger
|
|
20
|
+
debug_logging
|
|
21
|
+
timeout_seconds
|
|
22
|
+
cache_ttl
|
|
23
|
+
llm_caller
|
|
24
|
+
embedding_caller
|
|
25
|
+
].freeze
|
|
26
|
+
|
|
27
|
+
# Define readers for all known keys (setters below track explicit sets)
|
|
28
|
+
KNOWN_KEYS.each { |key| define_method(key) { instance_variable_get(:"@#{key}") } }
|
|
29
|
+
|
|
30
|
+
def initialize
|
|
31
|
+
@explicitly_set = Set.new
|
|
32
|
+
|
|
33
|
+
@use_semantic_cache = false
|
|
34
|
+
@compress_prompt = false
|
|
35
|
+
@manage_history = false
|
|
36
|
+
@route_to = :auto
|
|
37
|
+
@similarity_threshold = 0.96
|
|
38
|
+
@token_budget = 4000
|
|
39
|
+
@redis_url = nil
|
|
40
|
+
@embedding_model = "text-embedding-3-small"
|
|
41
|
+
@simple_model = "gpt-4o-mini"
|
|
42
|
+
@complex_model = "claude-3-5-sonnet-20241022"
|
|
43
|
+
@logger = Logger.new($stdout)
|
|
44
|
+
@debug_logging = false
|
|
45
|
+
@timeout_seconds = 5
|
|
46
|
+
@cache_ttl = 86400
|
|
47
|
+
@llm_caller = nil
|
|
48
|
+
@embedding_caller = nil
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
# Copies only explicitly set keys from other_config without resetting unmentioned keys.
|
|
52
|
+
def merge!(other_config)
|
|
53
|
+
other_config.instance_variable_get(:@explicitly_set).each do |key|
|
|
54
|
+
public_send(:"#{key}=", other_config.public_send(key))
|
|
55
|
+
end
|
|
56
|
+
self
|
|
57
|
+
end
|
|
58
|
+
|
|
59
|
+
def method_missing(name, *args, &block)
|
|
60
|
+
key = name.to_s.chomp("=").to_sym
|
|
61
|
+
raise ConfigurationError, "Unknown configuration key: #{key}" unless KNOWN_KEYS.include?(key)
|
|
62
|
+
|
|
63
|
+
super
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
def respond_to_missing?(name, include_private = false)
|
|
67
|
+
key = name.to_s.chomp("=").to_sym
|
|
68
|
+
KNOWN_KEYS.include?(key) || super
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
# Override generated attr_accessor setters to track explicitly set keys.
|
|
72
|
+
KNOWN_KEYS.each do |key|
|
|
73
|
+
define_method(:"#{key}=") do |value|
|
|
74
|
+
@explicitly_set << key
|
|
75
|
+
instance_variable_set(:"@#{key}", value)
|
|
76
|
+
end
|
|
77
|
+
end
|
|
78
|
+
end
|
|
79
|
+
end
|
|
@@ -0,0 +1,61 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "net/http"
|
|
4
|
+
require "uri"
|
|
5
|
+
require "json"
|
|
6
|
+
|
|
7
|
+
module LlmOptimizer
|
|
8
|
+
class EmbeddingClient
|
|
9
|
+
OPENAI_ENDPOINT = "https://api.openai.com/v1/embeddings"
|
|
10
|
+
|
|
11
|
+
def initialize(model:, timeout_seconds:, embedding_caller: nil)
|
|
12
|
+
@model = model
|
|
13
|
+
@timeout_seconds = timeout_seconds
|
|
14
|
+
@embedding_caller = embedding_caller
|
|
15
|
+
end
|
|
16
|
+
|
|
17
|
+
def embed(text)
|
|
18
|
+
if @embedding_caller
|
|
19
|
+
@embedding_caller.call(text)
|
|
20
|
+
else
|
|
21
|
+
embed_via_openai(text)
|
|
22
|
+
end
|
|
23
|
+
rescue EmbeddingError
|
|
24
|
+
raise
|
|
25
|
+
rescue StandardError => e
|
|
26
|
+
raise EmbeddingError, "Embedding request failed: #{e.message}"
|
|
27
|
+
end
|
|
28
|
+
|
|
29
|
+
private
|
|
30
|
+
|
|
31
|
+
def embed_via_openai(text)
|
|
32
|
+
api_key = ENV["OPENAI_API_KEY"]
|
|
33
|
+
raise EmbeddingError, "OPENAI_API_KEY is not set and no embedding_caller configured" if api_key.nil? || api_key.empty?
|
|
34
|
+
|
|
35
|
+
uri = URI(OPENAI_ENDPOINT)
|
|
36
|
+
body = JSON.generate({ model: @model, input: text })
|
|
37
|
+
|
|
38
|
+
http = Net::HTTP.new(uri.host, uri.port)
|
|
39
|
+
http.use_ssl = true
|
|
40
|
+
http.open_timeout = @timeout_seconds
|
|
41
|
+
http.read_timeout = @timeout_seconds
|
|
42
|
+
|
|
43
|
+
request = Net::HTTP::Post.new(uri.path)
|
|
44
|
+
request["Content-Type"] = "application/json"
|
|
45
|
+
request["Authorization"] = "Bearer #{api_key}"
|
|
46
|
+
request.body = body
|
|
47
|
+
|
|
48
|
+
response = http.request(request)
|
|
49
|
+
|
|
50
|
+
unless response.is_a?(Net::HTTPSuccess)
|
|
51
|
+
raise EmbeddingError, "OpenAI embeddings API returned #{response.code}: #{response.body}"
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
parsed = JSON.parse(response.body)
|
|
55
|
+
parsed.dig("data", 0, "embedding") or
|
|
56
|
+
raise EmbeddingError, "Unexpected response shape: #{response.body}"
|
|
57
|
+
rescue Net::OpenTimeout, Net::ReadTimeout => e
|
|
58
|
+
raise EmbeddingError, "Embedding request timed out: #{e.message}"
|
|
59
|
+
end
|
|
60
|
+
end
|
|
61
|
+
end
|
|
@@ -0,0 +1,43 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module LlmOptimizer
|
|
4
|
+
class HistoryManager
|
|
5
|
+
SUMMARIZE_COUNT = 10
|
|
6
|
+
|
|
7
|
+
def initialize(llm_caller:, simple_model:, token_budget:)
|
|
8
|
+
@llm_caller = llm_caller
|
|
9
|
+
@simple_model = simple_model
|
|
10
|
+
@token_budget = token_budget
|
|
11
|
+
end
|
|
12
|
+
|
|
13
|
+
def estimate_tokens(messages)
|
|
14
|
+
total_chars = messages.sum { |m| (m[:content] || m["content"] || "").length }
|
|
15
|
+
total_chars / 4
|
|
16
|
+
end
|
|
17
|
+
|
|
18
|
+
def process(messages)
|
|
19
|
+
return messages if estimate_tokens(messages) <= @token_budget
|
|
20
|
+
|
|
21
|
+
count = [SUMMARIZE_COUNT, messages.length].min
|
|
22
|
+
to_summarize = messages.first(count)
|
|
23
|
+
remainder = messages.drop(count)
|
|
24
|
+
|
|
25
|
+
summary = summarize(to_summarize)
|
|
26
|
+
return messages if summary.nil?
|
|
27
|
+
|
|
28
|
+
[{ role: "system", content: summary }] + remainder
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
private
|
|
32
|
+
|
|
33
|
+
def summarize(messages)
|
|
34
|
+
conversation = messages.map { |m| "#{m[:role] || m["role"]}: #{m[:content] || m["content"]}" }.join("\n")
|
|
35
|
+
prompt = "Summarize the following conversation history concisely, preserving key facts and decisions:\n\n#{conversation}"
|
|
36
|
+
|
|
37
|
+
@llm_caller.call(prompt, model: @simple_model)
|
|
38
|
+
rescue StandardError => e
|
|
39
|
+
warn "[llm_optimizer] HistoryManager summarization failed: #{e.message}"
|
|
40
|
+
nil
|
|
41
|
+
end
|
|
42
|
+
end
|
|
43
|
+
end
|
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module LlmOptimizer
|
|
4
|
+
class ModelRouter
|
|
5
|
+
COMPLEX_KEYWORDS = %w[analyze refactor debug architect].freeze
|
|
6
|
+
COMPLEX_PHRASES = ["explain in detail"].freeze
|
|
7
|
+
CODE_BLOCK_RE = /```|~~~/
|
|
8
|
+
|
|
9
|
+
def initialize(config)
|
|
10
|
+
@config = config
|
|
11
|
+
end
|
|
12
|
+
|
|
13
|
+
def route(prompt)
|
|
14
|
+
# explicit override
|
|
15
|
+
return @config.route_to if @config.route_to == :simple || @config.route_to == :complex
|
|
16
|
+
|
|
17
|
+
# fenced code block
|
|
18
|
+
return :complex if CODE_BLOCK_RE.match?(prompt)
|
|
19
|
+
|
|
20
|
+
# complex keywords or phrases
|
|
21
|
+
lower = prompt.downcase
|
|
22
|
+
return :complex if COMPLEX_KEYWORDS.any? { |kw| lower.include?(kw) }
|
|
23
|
+
return :complex if COMPLEX_PHRASES.any? { |ph| lower.include?(ph) }
|
|
24
|
+
|
|
25
|
+
# short prompt
|
|
26
|
+
return :simple if prompt.split.length < 20
|
|
27
|
+
|
|
28
|
+
# default
|
|
29
|
+
:complex
|
|
30
|
+
end
|
|
31
|
+
end
|
|
32
|
+
end
|
|
@@ -0,0 +1,66 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "digest"
|
|
4
|
+
require "msgpack"
|
|
5
|
+
|
|
6
|
+
module LlmOptimizer
|
|
7
|
+
class SemanticCache
|
|
8
|
+
KEY_NAMESPACE = "llm_optimizer:cache:"
|
|
9
|
+
|
|
10
|
+
def initialize(redis_client, threshold:, ttl:)
|
|
11
|
+
@redis = redis_client
|
|
12
|
+
@threshold = threshold
|
|
13
|
+
@ttl = ttl
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
def store(embedding, response)
|
|
17
|
+
key = cache_key(embedding)
|
|
18
|
+
payload = MessagePack.pack({ "embedding" => embedding, "response" => response })
|
|
19
|
+
@redis.set(key, payload, ex: @ttl)
|
|
20
|
+
rescue ::Redis::BaseError => e
|
|
21
|
+
warn "[llm_optimizer] SemanticCache store failed: #{e.message}"
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
def lookup(embedding)
|
|
25
|
+
keys = @redis.keys("#{KEY_NAMESPACE}*")
|
|
26
|
+
return nil if keys.empty?
|
|
27
|
+
|
|
28
|
+
best_score = -Float::INFINITY
|
|
29
|
+
best_response = nil
|
|
30
|
+
|
|
31
|
+
keys.each do |key|
|
|
32
|
+
raw = @redis.get(key)
|
|
33
|
+
next unless raw
|
|
34
|
+
|
|
35
|
+
entry = MessagePack.unpack(raw)
|
|
36
|
+
stored_embedding = entry["embedding"]
|
|
37
|
+
score = cosine_similarity(embedding, stored_embedding)
|
|
38
|
+
|
|
39
|
+
if score > best_score
|
|
40
|
+
best_score = score
|
|
41
|
+
best_response = entry["response"]
|
|
42
|
+
end
|
|
43
|
+
end
|
|
44
|
+
|
|
45
|
+
best_score >= @threshold ? best_response : nil
|
|
46
|
+
rescue ::Redis::BaseError => e
|
|
47
|
+
warn "[llm_optimizer] SemanticCache lookup failed: #{e.message}"
|
|
48
|
+
nil
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
def cosine_similarity(vec_a, vec_b)
|
|
52
|
+
dot = vec_a.zip(vec_b).sum { |a, b| a * b }
|
|
53
|
+
mag_a = Math.sqrt(vec_a.sum { |x| x * x })
|
|
54
|
+
mag_b = Math.sqrt(vec_b.sum { |x| x * x })
|
|
55
|
+
return 0.0 if mag_a.zero? || mag_b.zero?
|
|
56
|
+
|
|
57
|
+
dot / (mag_a * mag_b)
|
|
58
|
+
end
|
|
59
|
+
|
|
60
|
+
private
|
|
61
|
+
|
|
62
|
+
def cache_key(embedding)
|
|
63
|
+
KEY_NAMESPACE + Digest::SHA256.hexdigest(embedding.pack("f*"))
|
|
64
|
+
end
|
|
65
|
+
end
|
|
66
|
+
end
|
|
@@ -0,0 +1,273 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require_relative "llm_optimizer/version"
|
|
4
|
+
require_relative "llm_optimizer/configuration"
|
|
5
|
+
require_relative "llm_optimizer/optimize_result"
|
|
6
|
+
require_relative "llm_optimizer/compressor"
|
|
7
|
+
require_relative "llm_optimizer/model_router"
|
|
8
|
+
require_relative "llm_optimizer/embedding_client"
|
|
9
|
+
require_relative "llm_optimizer/semantic_cache"
|
|
10
|
+
require_relative "llm_optimizer/history_manager"
|
|
11
|
+
|
|
12
|
+
require "llm_optimizer/railtie" if defined?(Rails)
|
|
13
|
+
|
|
14
|
+
module LlmOptimizer
|
|
15
|
+
# Base error class for all gem-specific exceptions
|
|
16
|
+
class Error < StandardError; end
|
|
17
|
+
|
|
18
|
+
# Raised when an unrecognized configuration key is set
|
|
19
|
+
class ConfigurationError < Error; end
|
|
20
|
+
|
|
21
|
+
# Raised when the embedding API call fails
|
|
22
|
+
class EmbeddingError < Error; end
|
|
23
|
+
|
|
24
|
+
# Raised when a network timeout is exceeded
|
|
25
|
+
class TimeoutError < Error; end
|
|
26
|
+
|
|
27
|
+
# Global configuration
|
|
28
|
+
@configuration = nil
|
|
29
|
+
|
|
30
|
+
# Yields a Configuration instance; merges it into the global config.
|
|
31
|
+
def self.configure
|
|
32
|
+
temp = Configuration.new
|
|
33
|
+
yield temp
|
|
34
|
+
configuration.merge!(temp)
|
|
35
|
+
validate_configuration!(configuration)
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
# Warns about misconfigured options rather than failing silently at call time.
|
|
39
|
+
def self.validate_configuration!(config)
|
|
40
|
+
if config.use_semantic_cache && config.embedding_caller.nil?
|
|
41
|
+
config.logger.warn(
|
|
42
|
+
"[llm_optimizer] use_semantic_cache is true but no embedding_caller is configured. " \
|
|
43
|
+
"Semantic caching will be skipped. Set config.embedding_caller to enable it."
|
|
44
|
+
)
|
|
45
|
+
config.use_semantic_cache = false
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
# Returns the current global Configuration, lazy-initializing if nil.
|
|
50
|
+
def self.configuration
|
|
51
|
+
@configuration ||= Configuration.new
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
# Replaces the global config with a fresh default Configuration.
|
|
55
|
+
# Useful in tests to avoid state leakage.
|
|
56
|
+
def self.reset_configuration!
|
|
57
|
+
@configuration = Configuration.new
|
|
58
|
+
end
|
|
59
|
+
|
|
60
|
+
# Opt-in client wrapping
|
|
61
|
+
module WrapperModule
|
|
62
|
+
def chat(params, &block)
|
|
63
|
+
prompt = params[:messages] || params[:prompt]
|
|
64
|
+
optimized = LlmOptimizer.optimize(prompt)
|
|
65
|
+
params = params.merge(messages: optimized.messages, model: optimized.model)
|
|
66
|
+
super(params, &block)
|
|
67
|
+
end
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
# Prepends WrapperModule into client_class; idempotent — safe to call N times.
|
|
71
|
+
def self.wrap_client(client_class)
|
|
72
|
+
return if client_class.ancestors.include?(WrapperModule)
|
|
73
|
+
|
|
74
|
+
client_class.prepend(WrapperModule)
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
# Primary entry point
|
|
78
|
+
# Runs the optimization pipeline and returns an OptimizeResult.
|
|
79
|
+
|
|
80
|
+
# options hash keys mirror Configuration attr_accessors and are merged over
|
|
81
|
+
# the global config for this call only. An optional block is yielded a
|
|
82
|
+
# per-call Configuration for fine-grained control.
|
|
83
|
+
def self.optimize(prompt, options = {}, &block)
|
|
84
|
+
start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
|
|
85
|
+
|
|
86
|
+
# Resolve per-call configuration — only pass known config keys
|
|
87
|
+
call_config = Configuration.new
|
|
88
|
+
call_config.merge!(configuration)
|
|
89
|
+
options.each do |k, v|
|
|
90
|
+
next unless LlmOptimizer::Configuration::KNOWN_KEYS.include?(k.to_sym)
|
|
91
|
+
call_config.public_send(:"#{k}=", v)
|
|
92
|
+
end
|
|
93
|
+
yield call_config if block_given?
|
|
94
|
+
|
|
95
|
+
logger = call_config.logger
|
|
96
|
+
|
|
97
|
+
# Keep a reference to the original prompt for fallback use
|
|
98
|
+
original_prompt = prompt
|
|
99
|
+
|
|
100
|
+
# Compression
|
|
101
|
+
compressor = Compressor.new
|
|
102
|
+
original_tokens = compressor.estimate_tokens(prompt)
|
|
103
|
+
compressed_tokens = nil
|
|
104
|
+
|
|
105
|
+
if call_config.compress_prompt
|
|
106
|
+
prompt = compressor.compress(prompt)
|
|
107
|
+
compressed_tokens = compressor.estimate_tokens(prompt)
|
|
108
|
+
end
|
|
109
|
+
|
|
110
|
+
# Model routing
|
|
111
|
+
router = ModelRouter.new(call_config)
|
|
112
|
+
model_tier = router.route(prompt)
|
|
113
|
+
model = model_tier == :simple ? call_config.simple_model : call_config.complex_model
|
|
114
|
+
|
|
115
|
+
# Semantic cache lookup
|
|
116
|
+
embedding = nil
|
|
117
|
+
|
|
118
|
+
if call_config.use_semantic_cache
|
|
119
|
+
begin
|
|
120
|
+
emb_client = EmbeddingClient.new(
|
|
121
|
+
model: call_config.embedding_model,
|
|
122
|
+
timeout_seconds: call_config.timeout_seconds,
|
|
123
|
+
embedding_caller: call_config.embedding_caller
|
|
124
|
+
)
|
|
125
|
+
embedding = emb_client.embed(prompt)
|
|
126
|
+
|
|
127
|
+
if call_config.redis_url
|
|
128
|
+
redis = build_redis(call_config.redis_url)
|
|
129
|
+
cache = SemanticCache.new(redis, threshold: call_config.similarity_threshold, ttl: call_config.cache_ttl)
|
|
130
|
+
cached = cache.lookup(embedding)
|
|
131
|
+
|
|
132
|
+
if cached
|
|
133
|
+
latency_ms = elapsed_ms(start)
|
|
134
|
+
emit_log(logger, call_config,
|
|
135
|
+
cache_status: :hit, model_tier: model_tier,
|
|
136
|
+
original_tokens: original_tokens, compressed_tokens: compressed_tokens,
|
|
137
|
+
latency_ms: latency_ms, prompt: original_prompt, response: cached)
|
|
138
|
+
return OptimizeResult.new(
|
|
139
|
+
response: cached,
|
|
140
|
+
model: model,
|
|
141
|
+
model_tier: model_tier,
|
|
142
|
+
cache_status: :hit,
|
|
143
|
+
original_tokens: original_tokens,
|
|
144
|
+
compressed_tokens: compressed_tokens,
|
|
145
|
+
latency_ms: latency_ms,
|
|
146
|
+
messages: options[:messages]
|
|
147
|
+
)
|
|
148
|
+
end
|
|
149
|
+
end
|
|
150
|
+
rescue EmbeddingError => e
|
|
151
|
+
logger.warn("[llm_optimizer] EmbeddingError (treating as cache miss): #{e.message}")
|
|
152
|
+
embedding = nil
|
|
153
|
+
# continue pipeline as cache miss
|
|
154
|
+
end
|
|
155
|
+
end
|
|
156
|
+
|
|
157
|
+
# History management
|
|
158
|
+
messages = options[:messages]
|
|
159
|
+
if call_config.manage_history && messages
|
|
160
|
+
llm_caller = ->(p, model:) { raw_llm_call(p, model: model) }
|
|
161
|
+
history_mgr = HistoryManager.new(
|
|
162
|
+
llm_caller: llm_caller,
|
|
163
|
+
simple_model: call_config.simple_model,
|
|
164
|
+
token_budget: call_config.token_budget
|
|
165
|
+
)
|
|
166
|
+
messages = history_mgr.process(messages)
|
|
167
|
+
end
|
|
168
|
+
|
|
169
|
+
# Raw LLM call
|
|
170
|
+
response = raw_llm_call(prompt, model: model, config: call_config)
|
|
171
|
+
|
|
172
|
+
# Cache store
|
|
173
|
+
if call_config.use_semantic_cache && embedding && call_config.redis_url
|
|
174
|
+
begin
|
|
175
|
+
redis = build_redis(call_config.redis_url)
|
|
176
|
+
cache = SemanticCache.new(redis, threshold: call_config.similarity_threshold, ttl: call_config.cache_ttl)
|
|
177
|
+
cache.store(embedding, response)
|
|
178
|
+
rescue StandardError => e
|
|
179
|
+
logger.warn("[llm_optimizer] SemanticCache store failed: #{e.message}")
|
|
180
|
+
end
|
|
181
|
+
end
|
|
182
|
+
|
|
183
|
+
# Build result
|
|
184
|
+
latency_ms = elapsed_ms(start)
|
|
185
|
+
emit_log(logger, call_config,
|
|
186
|
+
cache_status: :miss, model_tier: model_tier,
|
|
187
|
+
original_tokens: original_tokens, compressed_tokens: compressed_tokens,
|
|
188
|
+
latency_ms: latency_ms, prompt: original_prompt, response: response)
|
|
189
|
+
|
|
190
|
+
OptimizeResult.new(
|
|
191
|
+
response: response,
|
|
192
|
+
model: model,
|
|
193
|
+
model_tier: model_tier,
|
|
194
|
+
cache_status: :miss,
|
|
195
|
+
original_tokens: original_tokens,
|
|
196
|
+
compressed_tokens: compressed_tokens,
|
|
197
|
+
latency_ms: latency_ms,
|
|
198
|
+
messages: messages
|
|
199
|
+
)
|
|
200
|
+
|
|
201
|
+
rescue EmbeddingError => e
|
|
202
|
+
# Treat embedding failures as cache miss — continue to raw LLM call
|
|
203
|
+
logger = configuration.logger
|
|
204
|
+
logger.warn("[llm_optimizer] EmbeddingError (outer rescue, treating as cache miss): #{e.message}")
|
|
205
|
+
latency_ms = elapsed_ms(start)
|
|
206
|
+
response = raw_llm_call(original_prompt, model: nil, config: configuration)
|
|
207
|
+
OptimizeResult.new(
|
|
208
|
+
response: response,
|
|
209
|
+
model: nil,
|
|
210
|
+
model_tier: nil,
|
|
211
|
+
cache_status: :miss,
|
|
212
|
+
original_tokens: original_tokens || 0,
|
|
213
|
+
compressed_tokens: nil,
|
|
214
|
+
latency_ms: latency_ms,
|
|
215
|
+
messages: options[:messages]
|
|
216
|
+
)
|
|
217
|
+
|
|
218
|
+
rescue LlmOptimizer::Error, StandardError => e
|
|
219
|
+
logger = configuration.logger
|
|
220
|
+
logger.error("[llm_optimizer] #{e.class}: #{e.message}\n#{e.backtrace&.first(5)&.join("\n")}")
|
|
221
|
+
latency_ms = elapsed_ms(start)
|
|
222
|
+
response = raw_llm_call(original_prompt, model: nil, config: configuration)
|
|
223
|
+
OptimizeResult.new(
|
|
224
|
+
response: response,
|
|
225
|
+
model: nil,
|
|
226
|
+
model_tier: nil,
|
|
227
|
+
cache_status: :miss,
|
|
228
|
+
original_tokens: original_tokens || 0,
|
|
229
|
+
compressed_tokens: nil,
|
|
230
|
+
latency_ms: latency_ms,
|
|
231
|
+
messages: options[:messages]
|
|
232
|
+
)
|
|
233
|
+
end
|
|
234
|
+
|
|
235
|
+
# Private helpers
|
|
236
|
+
|
|
237
|
+
class << self
|
|
238
|
+
private
|
|
239
|
+
|
|
240
|
+
def raw_llm_call(prompt, model:, config: nil)
|
|
241
|
+
caller = config&.llm_caller || @_current_llm_caller
|
|
242
|
+
raise ConfigurationError,
|
|
243
|
+
"No llm_caller configured. Set it via LlmOptimizer.configure { |c| c.llm_caller = ->(prompt, model:) { ... } }" unless caller
|
|
244
|
+
|
|
245
|
+
caller.call(prompt, model: model)
|
|
246
|
+
end
|
|
247
|
+
|
|
248
|
+
def elapsed_ms(start)
|
|
249
|
+
((Process.clock_gettime(Process::CLOCK_MONOTONIC) - start) * 1000).round(2)
|
|
250
|
+
end
|
|
251
|
+
|
|
252
|
+
def emit_log(logger, config, cache_status:, model_tier:, original_tokens:,
|
|
253
|
+
compressed_tokens:, latency_ms:, prompt:, response:)
|
|
254
|
+
|
|
255
|
+
logger.info(
|
|
256
|
+
"[llm_optimizer] { cache_status: #{cache_status.inspect}, " \
|
|
257
|
+
"model_tier: #{model_tier.inspect}, " \
|
|
258
|
+
"original_tokens: #{original_tokens.inspect}, " \
|
|
259
|
+
"compressed_tokens: #{compressed_tokens.inspect}, " \
|
|
260
|
+
"latency_ms: #{latency_ms.inspect} }"
|
|
261
|
+
)
|
|
262
|
+
|
|
263
|
+
if config.debug_logging
|
|
264
|
+
logger.debug("[llm_optimizer] prompt=#{prompt.inspect} response=#{response.inspect}")
|
|
265
|
+
end
|
|
266
|
+
end
|
|
267
|
+
|
|
268
|
+
def build_redis(redis_url)
|
|
269
|
+
require "redis"
|
|
270
|
+
Redis.new(url: redis_url)
|
|
271
|
+
end
|
|
272
|
+
end
|
|
273
|
+
end
|
metadata
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
|
2
|
+
name: llm_optimizer
|
|
3
|
+
version: !ruby/object:Gem::Version
|
|
4
|
+
version: 0.1.0
|
|
5
|
+
platform: ruby
|
|
6
|
+
authors:
|
|
7
|
+
- arun kumar
|
|
8
|
+
bindir: exe
|
|
9
|
+
cert_chain: []
|
|
10
|
+
date: 1980-01-02 00:00:00.000000000 Z
|
|
11
|
+
dependencies:
|
|
12
|
+
- !ruby/object:Gem::Dependency
|
|
13
|
+
name: redis
|
|
14
|
+
requirement: !ruby/object:Gem::Requirement
|
|
15
|
+
requirements:
|
|
16
|
+
- - "~>"
|
|
17
|
+
- !ruby/object:Gem::Version
|
|
18
|
+
version: '5.0'
|
|
19
|
+
type: :runtime
|
|
20
|
+
prerelease: false
|
|
21
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
22
|
+
requirements:
|
|
23
|
+
- - "~>"
|
|
24
|
+
- !ruby/object:Gem::Version
|
|
25
|
+
version: '5.0'
|
|
26
|
+
- !ruby/object:Gem::Dependency
|
|
27
|
+
name: msgpack
|
|
28
|
+
requirement: !ruby/object:Gem::Requirement
|
|
29
|
+
requirements:
|
|
30
|
+
- - "~>"
|
|
31
|
+
- !ruby/object:Gem::Version
|
|
32
|
+
version: '1.7'
|
|
33
|
+
type: :runtime
|
|
34
|
+
prerelease: false
|
|
35
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
36
|
+
requirements:
|
|
37
|
+
- - "~>"
|
|
38
|
+
- !ruby/object:Gem::Version
|
|
39
|
+
version: '1.7'
|
|
40
|
+
- !ruby/object:Gem::Dependency
|
|
41
|
+
name: logger
|
|
42
|
+
requirement: !ruby/object:Gem::Requirement
|
|
43
|
+
requirements:
|
|
44
|
+
- - "~>"
|
|
45
|
+
- !ruby/object:Gem::Version
|
|
46
|
+
version: '1.6'
|
|
47
|
+
type: :runtime
|
|
48
|
+
prerelease: false
|
|
49
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
50
|
+
requirements:
|
|
51
|
+
- - "~>"
|
|
52
|
+
- !ruby/object:Gem::Version
|
|
53
|
+
version: '1.6'
|
|
54
|
+
- !ruby/object:Gem::Dependency
|
|
55
|
+
name: prop_check
|
|
56
|
+
requirement: !ruby/object:Gem::Requirement
|
|
57
|
+
requirements:
|
|
58
|
+
- - "~>"
|
|
59
|
+
- !ruby/object:Gem::Version
|
|
60
|
+
version: '1.0'
|
|
61
|
+
type: :development
|
|
62
|
+
prerelease: false
|
|
63
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
64
|
+
requirements:
|
|
65
|
+
- - "~>"
|
|
66
|
+
- !ruby/object:Gem::Version
|
|
67
|
+
version: '1.0'
|
|
68
|
+
- !ruby/object:Gem::Dependency
|
|
69
|
+
name: mocha
|
|
70
|
+
requirement: !ruby/object:Gem::Requirement
|
|
71
|
+
requirements:
|
|
72
|
+
- - "~>"
|
|
73
|
+
- !ruby/object:Gem::Version
|
|
74
|
+
version: '2.0'
|
|
75
|
+
type: :development
|
|
76
|
+
prerelease: false
|
|
77
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
78
|
+
requirements:
|
|
79
|
+
- - "~>"
|
|
80
|
+
- !ruby/object:Gem::Version
|
|
81
|
+
version: '2.0'
|
|
82
|
+
description: llm_optimizer reduces LLM API costs by up to 80% through semantic caching
|
|
83
|
+
(Redis + vector similarity), intelligent model routing, token pruning, and conversation
|
|
84
|
+
history summarization. Strictly opt-in and non-invasive.
|
|
85
|
+
email:
|
|
86
|
+
- arunr.rubydev@gmail.com
|
|
87
|
+
executables: []
|
|
88
|
+
extensions: []
|
|
89
|
+
extra_rdoc_files: []
|
|
90
|
+
files:
|
|
91
|
+
- CHANGELOG.md
|
|
92
|
+
- CODE_OF_CONDUCT.md
|
|
93
|
+
- LICENSE.txt
|
|
94
|
+
- README.md
|
|
95
|
+
- Rakefile
|
|
96
|
+
- lib/generators/llm_optimizer/install_generator.rb
|
|
97
|
+
- lib/generators/llm_optimizer/templates/initializer.rb
|
|
98
|
+
- lib/llm_optimizer.rb
|
|
99
|
+
- lib/llm_optimizer/compressor.rb
|
|
100
|
+
- lib/llm_optimizer/configuration.rb
|
|
101
|
+
- lib/llm_optimizer/embedding_client.rb
|
|
102
|
+
- lib/llm_optimizer/history_manager.rb
|
|
103
|
+
- lib/llm_optimizer/model_router.rb
|
|
104
|
+
- lib/llm_optimizer/optimize_result.rb
|
|
105
|
+
- lib/llm_optimizer/railtie.rb
|
|
106
|
+
- lib/llm_optimizer/semantic_cache.rb
|
|
107
|
+
- lib/llm_optimizer/version.rb
|
|
108
|
+
- sig/llm_optimizer.rbs
|
|
109
|
+
homepage: https://github.com/arunkumarry/llm_optimizer
|
|
110
|
+
licenses:
|
|
111
|
+
- MIT
|
|
112
|
+
metadata:
|
|
113
|
+
allowed_push_host: https://rubygems.org
|
|
114
|
+
homepage_uri: https://github.com/arunkumarry/llm_optimizer
|
|
115
|
+
source_code_uri: https://github.com/arunkumarry/llm_optimizer/tree/main
|
|
116
|
+
changelog_uri: https://github.com/arunkumarry/llm_optimizer/blob/main/CHANGELOG.md
|
|
117
|
+
rdoc_options: []
|
|
118
|
+
require_paths:
|
|
119
|
+
- lib
|
|
120
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
|
121
|
+
requirements:
|
|
122
|
+
- - ">="
|
|
123
|
+
- !ruby/object:Gem::Version
|
|
124
|
+
version: 3.2.0
|
|
125
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
126
|
+
requirements:
|
|
127
|
+
- - ">="
|
|
128
|
+
- !ruby/object:Gem::Version
|
|
129
|
+
version: '0'
|
|
130
|
+
requirements: []
|
|
131
|
+
rubygems_version: 3.6.9
|
|
132
|
+
specification_version: 4
|
|
133
|
+
summary: Smart Gateway for LLM calls — semantic caching, model routing, token pruning,
|
|
134
|
+
and history management.
|
|
135
|
+
test_files: []
|