legion-llm 0.3.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.github/workflows/ci.yml +16 -0
- data/.gitignore +18 -0
- data/.rubocop.yml +56 -0
- data/CHANGELOG.md +71 -0
- data/CLAUDE.md +388 -0
- data/Gemfile +14 -0
- data/LICENSE +20 -0
- data/README.md +615 -0
- data/docs/plans/2026-03-15-ollama-discovery-design.md +164 -0
- data/docs/plans/2026-03-15-ollama-discovery-implementation.md +1147 -0
- data/legion-llm.gemspec +32 -0
- data/lib/legion/llm/bedrock_bearer_auth.rb +53 -0
- data/lib/legion/llm/compressor.rb +75 -0
- data/lib/legion/llm/discovery/ollama.rb +88 -0
- data/lib/legion/llm/discovery/system.rb +139 -0
- data/lib/legion/llm/escalation_history.rb +28 -0
- data/lib/legion/llm/helpers/llm.rb +59 -0
- data/lib/legion/llm/providers.rb +88 -0
- data/lib/legion/llm/quality_checker.rb +56 -0
- data/lib/legion/llm/router/escalation_chain.rb +49 -0
- data/lib/legion/llm/router/health_tracker.rb +160 -0
- data/lib/legion/llm/router/resolution.rb +43 -0
- data/lib/legion/llm/router/rule.rb +103 -0
- data/lib/legion/llm/router.rb +279 -0
- data/lib/legion/llm/settings.rb +97 -0
- data/lib/legion/llm/transport/exchanges/escalation.rb +14 -0
- data/lib/legion/llm/transport/messages/escalation_event.rb +13 -0
- data/lib/legion/llm/version.rb +7 -0
- data/lib/legion/llm.rb +264 -0
- metadata +136 -0
data/README.md
ADDED
|
@@ -0,0 +1,615 @@
|
|
|
1
|
+
# Legion LLM
|
|
2
|
+
|
|
3
|
+
LLM integration for the [LegionIO](https://github.com/LegionIO/LegionIO) framework. Wraps [ruby_llm](https://github.com/crmne/ruby_llm) to provide chat, embeddings, tool use, and agent capabilities to any Legion extension.
|
|
4
|
+
|
|
5
|
+
## Installation
|
|
6
|
+
|
|
7
|
+
```ruby
|
|
8
|
+
gem 'legion-llm'
|
|
9
|
+
```
|
|
10
|
+
|
|
11
|
+
Or add to your Gemfile and `bundle install`.
|
|
12
|
+
|
|
13
|
+
## Configuration
|
|
14
|
+
|
|
15
|
+
Add to your LegionIO settings directory:
|
|
16
|
+
|
|
17
|
+
```json
|
|
18
|
+
{
|
|
19
|
+
"llm": {
|
|
20
|
+
"default_model": "us.anthropic.claude-sonnet-4-6-v1",
|
|
21
|
+
"default_provider": "bedrock",
|
|
22
|
+
"providers": {
|
|
23
|
+
"bedrock": {
|
|
24
|
+
"enabled": true,
|
|
25
|
+
"region": "us-east-2",
|
|
26
|
+
"vault_path": "legion/bedrock"
|
|
27
|
+
},
|
|
28
|
+
"anthropic": {
|
|
29
|
+
"enabled": false,
|
|
30
|
+
"vault_path": "legion/anthropic"
|
|
31
|
+
},
|
|
32
|
+
"openai": {
|
|
33
|
+
"enabled": false
|
|
34
|
+
},
|
|
35
|
+
"ollama": {
|
|
36
|
+
"enabled": false,
|
|
37
|
+
"base_url": "http://localhost:11434"
|
|
38
|
+
}
|
|
39
|
+
}
|
|
40
|
+
}
|
|
41
|
+
}
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
Credentials are resolved from Vault automatically when `vault_path` is set and Legion::Crypt is connected.
|
|
45
|
+
|
|
46
|
+
### Provider Configuration
|
|
47
|
+
|
|
48
|
+
Each provider supports these common fields:
|
|
49
|
+
|
|
50
|
+
| Field | Type | Description |
|
|
51
|
+
|-------|------|-------------|
|
|
52
|
+
| `enabled` | Boolean | Enable this provider (default: `false`) |
|
|
53
|
+
| `api_key` | String | API key (resolved from Vault if `vault_path` set) |
|
|
54
|
+
| `vault_path` | String | Vault secret path for credential resolution |
|
|
55
|
+
|
|
56
|
+
Provider-specific fields:
|
|
57
|
+
|
|
58
|
+
| Provider | Additional Fields |
|
|
59
|
+
|----------|------------------|
|
|
60
|
+
| **Bedrock** | `secret_key`, `session_token`, `region` (default: `us-east-2`), `bearer_token` (alternative to SigV4 — for AWS Identity Center/SSO) |
|
|
61
|
+
| **Ollama** | `base_url` (default: `http://localhost:11434`) |
|
|
62
|
+
|
|
63
|
+
### Vault Credential Resolution
|
|
64
|
+
|
|
65
|
+
When `vault_path` is set and `Legion::Crypt::Vault` is connected, credentials are fetched from Vault at startup. The secret keys map to provider fields automatically:
|
|
66
|
+
|
|
67
|
+
| Provider | Vault Key | Maps To |
|
|
68
|
+
|----------|-----------|---------|
|
|
69
|
+
| Bedrock | `access_key` / `aws_access_key_id` | `api_key` |
|
|
70
|
+
| Bedrock | `secret_key` / `aws_secret_access_key` | `secret_key` |
|
|
71
|
+
| Bedrock | `session_token` / `aws_session_token` | `session_token` |
|
|
72
|
+
| Bedrock | `bearer_token` / `aws_bearer_token` | `bearer_token` (Identity Center/SSO) |
|
|
73
|
+
| Anthropic / OpenAI / Gemini | `api_key` / `token` | `api_key` |
|
|
74
|
+
|
|
75
|
+
Direct configuration (setting `api_key` in settings) takes precedence over Vault-resolved values.
|
|
76
|
+
|
|
77
|
+
### Auto-Detection
|
|
78
|
+
|
|
79
|
+
If no `default_model` or `default_provider` is set, legion-llm auto-detects from the first enabled provider in priority order:
|
|
80
|
+
|
|
81
|
+
| Priority | Provider | Default Model |
|
|
82
|
+
|----------|----------|---------------|
|
|
83
|
+
| 1 | Bedrock | `us.anthropic.claude-sonnet-4-6-v1` |
|
|
84
|
+
| 2 | Anthropic | `claude-sonnet-4-6` |
|
|
85
|
+
| 3 | OpenAI | `gpt-4o` |
|
|
86
|
+
| 4 | Gemini | `gemini-2.0-flash` |
|
|
87
|
+
| 5 | Ollama | `llama3` |
|
|
88
|
+
|
|
89
|
+
## Core API
|
|
90
|
+
|
|
91
|
+
### Lifecycle
|
|
92
|
+
|
|
93
|
+
```ruby
|
|
94
|
+
Legion::LLM.start # Configure providers, resolve Vault credentials, warm discovery caches, set defaults, ping provider
|
|
95
|
+
Legion::LLM.shutdown # Mark disconnected, clean up
|
|
96
|
+
Legion::LLM.started? # -> Boolean
|
|
97
|
+
Legion::LLM.settings # -> Hash (current LLM settings)
|
|
98
|
+
```
|
|
99
|
+
|
|
100
|
+
### Chat
|
|
101
|
+
|
|
102
|
+
Returns a `RubyLLM::Chat` instance for multi-turn conversation:
|
|
103
|
+
|
|
104
|
+
```ruby
|
|
105
|
+
# Use configured defaults
|
|
106
|
+
chat = Legion::LLM.chat
|
|
107
|
+
response = chat.ask("What is the capital of France?")
|
|
108
|
+
puts response.content
|
|
109
|
+
|
|
110
|
+
# Override model/provider per call
|
|
111
|
+
chat = Legion::LLM.chat(model: 'gpt-4o', provider: :openai)
|
|
112
|
+
|
|
113
|
+
# Multi-turn conversation
|
|
114
|
+
chat = Legion::LLM.chat
|
|
115
|
+
chat.ask("Remember: my name is Matt")
|
|
116
|
+
chat.ask("What's my name?") # -> "Matt"
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
### Embeddings
|
|
120
|
+
|
|
121
|
+
```ruby
|
|
122
|
+
embedding = Legion::LLM.embed("some text to embed")
|
|
123
|
+
embedding.vectors # -> Array of floats
|
|
124
|
+
|
|
125
|
+
# Specific model
|
|
126
|
+
embedding = Legion::LLM.embed("text", model: "text-embedding-3-small")
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
### Tool Use
|
|
130
|
+
|
|
131
|
+
Define tools as Ruby classes and attach them to a chat session. RubyLLM handles the tool-use loop automatically — when the model calls a tool, ruby_llm executes it and feeds the result back:
|
|
132
|
+
|
|
133
|
+
```ruby
|
|
134
|
+
class WeatherLookup < RubyLLM::Tool
|
|
135
|
+
description "Look up current weather for a location"
|
|
136
|
+
|
|
137
|
+
param :location, desc: "City name or zip code"
|
|
138
|
+
param :units, desc: "celsius or fahrenheit", required: false
|
|
139
|
+
|
|
140
|
+
def execute(location:, units: "fahrenheit")
|
|
141
|
+
# Your weather API call here
|
|
142
|
+
{ temperature: 72, conditions: "sunny", location: location }
|
|
143
|
+
end
|
|
144
|
+
end
|
|
145
|
+
|
|
146
|
+
chat = Legion::LLM.chat
|
|
147
|
+
chat.with_tools(WeatherLookup)
|
|
148
|
+
response = chat.ask("What's the weather in Minneapolis?")
|
|
149
|
+
# Model calls WeatherLookup, gets result, responds with natural language
|
|
150
|
+
```
|
|
151
|
+
|
|
152
|
+
### Structured Output
|
|
153
|
+
|
|
154
|
+
Use `RubyLLM::Schema` to get typed, validated responses:
|
|
155
|
+
|
|
156
|
+
```ruby
|
|
157
|
+
class SentimentResult < RubyLLM::Schema
|
|
158
|
+
string :sentiment, enum: %w[positive negative neutral]
|
|
159
|
+
number :confidence
|
|
160
|
+
string :reasoning
|
|
161
|
+
end
|
|
162
|
+
|
|
163
|
+
chat = Legion::LLM.chat
|
|
164
|
+
result = chat.with_output_schema(SentimentResult).ask("Analyze: 'I love this product!'")
|
|
165
|
+
result.sentiment # -> "positive"
|
|
166
|
+
result.confidence # -> 0.95
|
|
167
|
+
result.reasoning # -> "Strong positive language..."
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
### Agents
|
|
171
|
+
|
|
172
|
+
Define reusable agents as `RubyLLM::Agent` subclasses with declarative configuration:
|
|
173
|
+
|
|
174
|
+
```ruby
|
|
175
|
+
class CodeReviewer < RubyLLM::Agent
|
|
176
|
+
model "us.anthropic.claude-sonnet-4-6-v1", provider: :bedrock
|
|
177
|
+
instructions "You review code for bugs, security issues, and style"
|
|
178
|
+
tools CodeAnalyzer, SecurityScanner
|
|
179
|
+
temperature 0.1
|
|
180
|
+
|
|
181
|
+
schema do
|
|
182
|
+
string :verdict, enum: %w[approve request_changes]
|
|
183
|
+
array :issues do
|
|
184
|
+
string
|
|
185
|
+
end
|
|
186
|
+
end
|
|
187
|
+
end
|
|
188
|
+
|
|
189
|
+
reviewer = Legion::LLM.agent(CodeReviewer)
|
|
190
|
+
result = reviewer.ask(diff_content)
|
|
191
|
+
result.verdict # -> "approve" or "request_changes"
|
|
192
|
+
result.issues # -> ["Line 42: potential SQL injection", ...]
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
## Usage in Extensions
|
|
196
|
+
|
|
197
|
+
Any LEX extension can use LLM capabilities. The gem provides helper methods that are auto-loaded when legion-llm is present.
|
|
198
|
+
|
|
199
|
+
### Basic Extension Usage
|
|
200
|
+
|
|
201
|
+
```ruby
|
|
202
|
+
module Legion::Extensions::MyLex::Runners
|
|
203
|
+
module Analyzer
|
|
204
|
+
def analyze(text:, **_opts)
|
|
205
|
+
chat = Legion::LLM.chat
|
|
206
|
+
response = chat.ask("Analyze this: #{text}")
|
|
207
|
+
{ analysis: response.content }
|
|
208
|
+
end
|
|
209
|
+
end
|
|
210
|
+
end
|
|
211
|
+
```
|
|
212
|
+
|
|
213
|
+
### Declaring LLM as Required
|
|
214
|
+
|
|
215
|
+
Extensions that cannot function without LLM should declare the dependency. Legion will skip loading the extension if LLM is not available:
|
|
216
|
+
|
|
217
|
+
```ruby
|
|
218
|
+
module Legion::Extensions::MyLex
|
|
219
|
+
def self.llm_required?
|
|
220
|
+
true
|
|
221
|
+
end
|
|
222
|
+
end
|
|
223
|
+
```
|
|
224
|
+
|
|
225
|
+
### Helper Methods
|
|
226
|
+
|
|
227
|
+
Include the LLM helper for convenience methods in any runner:
|
|
228
|
+
|
|
229
|
+
```ruby
|
|
230
|
+
# One-shot chat (returns RubyLLM::Response)
|
|
231
|
+
result = llm_chat("Summarize this text", instructions: "Be concise")
|
|
232
|
+
|
|
233
|
+
# Chat with tools
|
|
234
|
+
result = llm_chat("Check the weather", tools: [WeatherLookup])
|
|
235
|
+
|
|
236
|
+
# With prompt compression (reduces input tokens for cost/speed)
|
|
237
|
+
result = llm_chat("Summarize the data", instructions: "Be concise", compress: 2)
|
|
238
|
+
|
|
239
|
+
# Embeddings
|
|
240
|
+
embedding = llm_embed("some text to embed")
|
|
241
|
+
|
|
242
|
+
# Multi-turn session (returns RubyLLM::Chat for continued conversation)
|
|
243
|
+
session = llm_session
|
|
244
|
+
session.with_instructions("You are a code reviewer")
|
|
245
|
+
session.with_tools(CodeAnalyzer, SecurityScanner)
|
|
246
|
+
response = session.ask("Review this PR: #{diff}")
|
|
247
|
+
```
|
|
248
|
+
|
|
249
|
+
### Routing
|
|
250
|
+
|
|
251
|
+
legion-llm includes a dynamic weighted routing engine that dispatches requests across local, fleet, and cloud tiers based on caller intent, priority rules, time schedules, cost multipliers, and real-time provider health. Routing is **disabled by default** — opt in via settings.
|
|
252
|
+
|
|
253
|
+
#### Three Tiers
|
|
254
|
+
|
|
255
|
+
```
|
|
256
|
+
┌─────────────────────────────────────────────────────────┐
|
|
257
|
+
│ Legion::LLM Router (per-node) │
|
|
258
|
+
│ │
|
|
259
|
+
│ Tier 1: LOCAL → Ollama on this machine (direct HTTP) │
|
|
260
|
+
│ Zero network overhead, no Transport │
|
|
261
|
+
│ │
|
|
262
|
+
│ Tier 2: FLEET → Ollama on Mac Studios / GPU servers │
|
|
263
|
+
│ Via Legion::Transport (AMQP) when local can't │
|
|
264
|
+
│ serve the model (Phase 2, not yet built) │
|
|
265
|
+
│ │
|
|
266
|
+
│ Tier 3: CLOUD → Bedrock / Anthropic / OpenAI / Gemini │
|
|
267
|
+
│ Existing provider API calls │
|
|
268
|
+
└─────────────────────────────────────────────────────────┘
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
| Tier | Target | Use Case |
|
|
272
|
+
|------|--------|----------|
|
|
273
|
+
| `local` | Ollama on localhost | Privacy-sensitive, offline, or low-latency workloads |
|
|
274
|
+
| `fleet` | Shared hardware via Legion::Transport | Larger models on dedicated GPU servers (Phase 2) |
|
|
275
|
+
| `cloud` | API providers (Bedrock, Anthropic, OpenAI, Gemini) | Frontier models, full-capability inference |
|
|
276
|
+
|
|
277
|
+
#### Intent-Based Dispatch
|
|
278
|
+
|
|
279
|
+
Pass an `intent:` hash to route based on privacy, capability, or cost requirements:
|
|
280
|
+
|
|
281
|
+
```ruby
|
|
282
|
+
# Route to local tier for strict privacy
|
|
283
|
+
result = llm_chat("Summarize this PII data", intent: { privacy: :strict })
|
|
284
|
+
|
|
285
|
+
# Route to cloud for reasoning tasks
|
|
286
|
+
result = llm_chat("Solve this proof", intent: { capability: :reasoning })
|
|
287
|
+
|
|
288
|
+
# Minimize cost — prefers local/fleet over cloud
|
|
289
|
+
result = llm_chat("Translate this", intent: { cost: :minimize })
|
|
290
|
+
|
|
291
|
+
# Explicit tier override (bypasses rules)
|
|
292
|
+
result = llm_chat("Translate this", tier: :cloud, model: "claude-sonnet-4-6")
|
|
293
|
+
```
|
|
294
|
+
|
|
295
|
+
Same parameters work on `Legion::LLM.chat` and `llm_session`:
|
|
296
|
+
|
|
297
|
+
```ruby
|
|
298
|
+
chat = Legion::LLM.chat(intent: { privacy: :strict, capability: :basic })
|
|
299
|
+
session = llm_session(tier: :local)
|
|
300
|
+
```
|
|
301
|
+
|
|
302
|
+
#### Intent Dimensions
|
|
303
|
+
|
|
304
|
+
| Dimension | Values | Default | Effect |
|
|
305
|
+
|-----------|--------|---------|--------|
|
|
306
|
+
| `privacy` | `:strict`, `:normal` | `:normal` | `:strict` -> never cloud (via constraint rules) |
|
|
307
|
+
| `capability` | `:basic`, `:moderate`, `:reasoning` | `:moderate` | Higher prefers larger/cloud models |
|
|
308
|
+
| `cost` | `:minimize`, `:normal` | `:normal` | `:minimize` prefers local/fleet |
|
|
309
|
+
|
|
310
|
+
#### Routing Resolution
|
|
311
|
+
|
|
312
|
+
```
|
|
313
|
+
1. Caller passes intent: { privacy: :strict, capability: :basic }
|
|
314
|
+
2. Router merges with default_intent (fills missing dimensions)
|
|
315
|
+
3. Load rules from settings, filter by:
|
|
316
|
+
a. Intent match (all `when` conditions must match)
|
|
317
|
+
b. Schedule window (valid_from/valid_until, hours, days)
|
|
318
|
+
c. Constraints (e.g., never_cloud strips cloud-tier rules)
|
|
319
|
+
d. Discovery (Ollama model pulled? Model fits in available RAM?)
|
|
320
|
+
e. Tier availability (is Ollama running? is Transport loaded?)
|
|
321
|
+
4. Score remaining candidates:
|
|
322
|
+
effective_priority = rule.priority
|
|
323
|
+
+ health_tracker.adjustment(provider)
|
|
324
|
+
+ (1.0 - cost_multiplier) * 10
|
|
325
|
+
5. Return Resolution for highest-scoring candidate
|
|
326
|
+
```
|
|
327
|
+
|
|
328
|
+
#### Settings
|
|
329
|
+
|
|
330
|
+
Add routing configuration under the `llm` key:
|
|
331
|
+
|
|
332
|
+
```json
|
|
333
|
+
{
|
|
334
|
+
"llm": {
|
|
335
|
+
"routing": {
|
|
336
|
+
"enabled": true,
|
|
337
|
+
"default_intent": { "privacy": "normal", "capability": "moderate", "cost": "normal" },
|
|
338
|
+
"tiers": {
|
|
339
|
+
"local": { "provider": "ollama" },
|
|
340
|
+
"fleet": { "queue": "llm.inference", "timeout_seconds": 30 },
|
|
341
|
+
"cloud": { "providers": ["bedrock", "anthropic"] }
|
|
342
|
+
},
|
|
343
|
+
"health": {
|
|
344
|
+
"window_seconds": 300,
|
|
345
|
+
"circuit_breaker": { "failure_threshold": 3, "cooldown_seconds": 60 },
|
|
346
|
+
"latency_penalty_threshold_ms": 5000
|
|
347
|
+
},
|
|
348
|
+
"rules": [
|
|
349
|
+
{
|
|
350
|
+
"name": "privacy_local",
|
|
351
|
+
"when": { "privacy": "strict" },
|
|
352
|
+
"then": { "tier": "local", "provider": "ollama", "model": "llama3" },
|
|
353
|
+
"priority": 100,
|
|
354
|
+
"constraint": "never_cloud"
|
|
355
|
+
},
|
|
356
|
+
{
|
|
357
|
+
"name": "reasoning_cloud",
|
|
358
|
+
"when": { "capability": "reasoning" },
|
|
359
|
+
"then": { "tier": "cloud", "provider": "bedrock", "model": "us.anthropic.claude-sonnet-4-6-v1" },
|
|
360
|
+
"priority": 50,
|
|
361
|
+
"cost_multiplier": 1.0
|
|
362
|
+
},
|
|
363
|
+
{
|
|
364
|
+
"name": "anthropic_promo",
|
|
365
|
+
"when": { "cost": "normal" },
|
|
366
|
+
"then": { "tier": "cloud", "provider": "anthropic", "model": "claude-sonnet-4-6" },
|
|
367
|
+
"priority": 60,
|
|
368
|
+
"cost_multiplier": 0.5,
|
|
369
|
+
"schedule": {
|
|
370
|
+
"valid_from": "2026-03-15T00:00:00",
|
|
371
|
+
"valid_until": "2026-03-29T23:59:59",
|
|
372
|
+
"hours": ["00:00-06:00", "18:00-23:59"]
|
|
373
|
+
},
|
|
374
|
+
"note": "Double token promotion — off-peak hours only"
|
|
375
|
+
}
|
|
376
|
+
]
|
|
377
|
+
}
|
|
378
|
+
}
|
|
379
|
+
}
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
#### Routing Rules
|
|
383
|
+
|
|
384
|
+
Each rule is a hash with:
|
|
385
|
+
|
|
386
|
+
| Field | Type | Required | Description |
|
|
387
|
+
|-------|------|----------|-------------|
|
|
388
|
+
| `name` | String | Yes | Unique rule identifier |
|
|
389
|
+
| `when` | Hash | Yes | Intent conditions to match (`privacy`, `capability`, `cost`) |
|
|
390
|
+
| `then` | Hash | No | Target: `{ tier:, provider:, model: }` |
|
|
391
|
+
| `priority` | Integer | No (default 0) | Higher wins when multiple rules match |
|
|
392
|
+
| `constraint` | String | No | Hard constraint (e.g., `never_cloud`) |
|
|
393
|
+
| `fallback` | String | No | Fallback tier if primary is unavailable |
|
|
394
|
+
| `cost_multiplier` | Float | No (default 1.0) | Lower = cheaper = routing bonus |
|
|
395
|
+
| `schedule` | Hash | No | Time-based activation window |
|
|
396
|
+
| `note` | String | No | Human-readable note |
|
|
397
|
+
|
|
398
|
+
#### Health Tracking
|
|
399
|
+
|
|
400
|
+
The `HealthTracker` adjusts effective priorities at runtime based on provider health signals:
|
|
401
|
+
|
|
402
|
+
- **Circuit breaker**: After consecutive failures, a provider's circuit opens (penalty: -50) then transitions to half_open (penalty: -25) after a cooldown period
|
|
403
|
+
- **Latency penalty**: Rolling window tracks average latency; providers above threshold receive priority penalties
|
|
404
|
+
- **Pluggable signals**: Any LEX can feed custom signals (e.g., GPU utilization, budget tracking) via `register_handler`
|
|
405
|
+
|
|
406
|
+
```ruby
|
|
407
|
+
# Report signals (typically called by LEX extensions)
|
|
408
|
+
tracker = Legion::LLM::Router.health_tracker
|
|
409
|
+
tracker.report(provider: :anthropic, signal: :error, value: 1)
|
|
410
|
+
tracker.report(provider: :ollama, signal: :latency, value: 1200)
|
|
411
|
+
|
|
412
|
+
# Check state
|
|
413
|
+
tracker.circuit_state(:anthropic) # -> :closed, :open, or :half_open
|
|
414
|
+
tracker.adjustment(:anthropic) # -> Integer (priority offset)
|
|
415
|
+
|
|
416
|
+
# Add custom signal handler
|
|
417
|
+
tracker.register_handler(:gpu_utilization) { |data| ... }
|
|
418
|
+
```
|
|
419
|
+
|
|
420
|
+
When routing is disabled (the default), `chat`, `llm_chat`, and `llm_session` behave exactly as before — no behavior change until you opt in.
|
|
421
|
+
|
|
422
|
+
#### Local Model Discovery
|
|
423
|
+
|
|
424
|
+
When the Ollama provider is enabled, legion-llm discovers which models are actually pulled and checks available system memory before routing to local models. This prevents the router from selecting models that aren't installed or that won't fit in RAM.
|
|
425
|
+
|
|
426
|
+
Discovery uses lazy TTL-based caching (default: 60 seconds). At startup, caches are warmed and logged:
|
|
427
|
+
|
|
428
|
+
```
|
|
429
|
+
Ollama: 3 models available (llama3.1:8b, qwen2.5:32b, nomic-embed-text)
|
|
430
|
+
System: 65536 MB total, 42000 MB available
|
|
431
|
+
```
|
|
432
|
+
|
|
433
|
+
Configure under `discovery`:
|
|
434
|
+
|
|
435
|
+
```json
|
|
436
|
+
{
|
|
437
|
+
"llm": {
|
|
438
|
+
"discovery": {
|
|
439
|
+
"enabled": true,
|
|
440
|
+
"refresh_seconds": 60,
|
|
441
|
+
"memory_floor_mb": 2048
|
|
442
|
+
}
|
|
443
|
+
}
|
|
444
|
+
}
|
|
445
|
+
```
|
|
446
|
+
|
|
447
|
+
| Key | Type | Default | Description |
|
|
448
|
+
|-----|------|---------|-------------|
|
|
449
|
+
| `enabled` | Boolean | `true` | Master switch for discovery checks |
|
|
450
|
+
| `refresh_seconds` | Integer | `60` | TTL for discovery caches |
|
|
451
|
+
| `memory_floor_mb` | Integer | `2048` | Minimum free MB to reserve for OS |
|
|
452
|
+
|
|
453
|
+
When a routing rule targets a local Ollama model that isn't pulled or won't fit in available memory (minus `memory_floor_mb`), the rule is silently skipped and the next best candidate is used. If discovery fails (Ollama not running, unknown OS), checks are bypassed permissively.
|
|
454
|
+
|
|
455
|
+
### Model Escalation
|
|
456
|
+
|
|
457
|
+
When an LLM call fails (API error, timeout, or quality issue), the escalation system automatically retries with more capable models. If all attempts fail, `Legion::LLM::EscalationExhausted` is raised.
|
|
458
|
+
|
|
459
|
+
```ruby
|
|
460
|
+
# Enable escalation and ask in one call
|
|
461
|
+
response = Legion::LLM.chat(
|
|
462
|
+
message: "Generate a SQL query for user analytics",
|
|
463
|
+
escalate: true,
|
|
464
|
+
max_escalations: 3,
|
|
465
|
+
quality_check: ->(r) { r.content.include?('SELECT') }
|
|
466
|
+
)
|
|
467
|
+
|
|
468
|
+
# Check if escalation occurred (true only when more than one attempt was made)
|
|
469
|
+
response.escalated? # => true if >1 attempt was made
|
|
470
|
+
response.escalation_history # => [{model:, provider:, tier:, outcome:, failures:, duration_ms:}, ...]
|
|
471
|
+
response.final_resolution # => Resolution that succeeded
|
|
472
|
+
response.escalation_chain # => EscalationChain used for this call
|
|
473
|
+
```
|
|
474
|
+
|
|
475
|
+
Raises `Legion::LLM::EscalationExhausted` if all attempts are exhausted.
|
|
476
|
+
|
|
477
|
+
Configure globally in settings:
|
|
478
|
+
|
|
479
|
+
```yaml
|
|
480
|
+
llm:
|
|
481
|
+
routing:
|
|
482
|
+
escalation:
|
|
483
|
+
enabled: true
|
|
484
|
+
max_attempts: 3
|
|
485
|
+
quality_threshold: 50
|
|
486
|
+
```
|
|
487
|
+
|
|
488
|
+
### Prompt Compression
|
|
489
|
+
|
|
490
|
+
`Legion::LLM::Compressor` strips low-signal words from prompts before sending to the API, reducing input token count and cost. Compression is deterministic (same input always produces the same output), preserving prompt caching compatibility.
|
|
491
|
+
|
|
492
|
+
#### Levels
|
|
493
|
+
|
|
494
|
+
| Level | Name | What It Removes |
|
|
495
|
+
|-------|------|-----------------|
|
|
496
|
+
| 0 | None | Nothing |
|
|
497
|
+
| 1 | Light | Articles (a, an, the), filler adverbs (just, very, really, basically, ...) |
|
|
498
|
+
| 2 | Moderate | + sentence connectives (however, moreover, furthermore, ...) |
|
|
499
|
+
| 3 | Aggressive | + low-signal words (also, then, please, note, that, ...) + whitespace normalization |
|
|
500
|
+
|
|
501
|
+
Code blocks (fenced and inline) are never modified. Negation words are never removed.
|
|
502
|
+
|
|
503
|
+
#### Usage
|
|
504
|
+
|
|
505
|
+
```ruby
|
|
506
|
+
# Direct API
|
|
507
|
+
text = Legion::LLM::Compressor.compress("The very important system prompt", level: 2)
|
|
508
|
+
|
|
509
|
+
# Via llm_chat helper (compresses both message and instructions)
|
|
510
|
+
result = llm_chat("Analyze the data", instructions: "Be very concise", compress: 2)
|
|
511
|
+
```
|
|
512
|
+
|
|
513
|
+
#### Router Integration
|
|
514
|
+
|
|
515
|
+
Routing rules can specify `compress_level` in their target to auto-compress for cost-sensitive tiers:
|
|
516
|
+
|
|
517
|
+
```json
|
|
518
|
+
{
|
|
519
|
+
"name": "cloud_compressed",
|
|
520
|
+
"priority": 50,
|
|
521
|
+
"when": { "capability": "chat" },
|
|
522
|
+
"then": { "tier": "cloud", "provider": "bedrock", "model": "claude-sonnet-4-6", "compress_level": 2 }
|
|
523
|
+
}
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
### Building an LLM-Powered LEX
|
|
527
|
+
|
|
528
|
+
A complete example of a LEX extension that uses LLM for intelligent processing:
|
|
529
|
+
|
|
530
|
+
```ruby
|
|
531
|
+
# lib/legion/extensions/smart_alerts/runners/evaluate.rb
|
|
532
|
+
module Legion::Extensions::SmartAlerts::Runners
|
|
533
|
+
module Evaluate
|
|
534
|
+
def evaluate(alert_data:, **_opts)
|
|
535
|
+
session = llm_session(model: 'us.anthropic.claude-sonnet-4-6-v1')
|
|
536
|
+
session.with_instructions(<<~PROMPT)
|
|
537
|
+
You are an alert triage system. Given alert data, determine:
|
|
538
|
+
1. Severity (critical, warning, info)
|
|
539
|
+
2. Whether it requires immediate human attention
|
|
540
|
+
3. Suggested remediation steps
|
|
541
|
+
PROMPT
|
|
542
|
+
|
|
543
|
+
result = session.ask("Evaluate this alert: #{alert_data.to_json}")
|
|
544
|
+
|
|
545
|
+
{
|
|
546
|
+
evaluation: result.content,
|
|
547
|
+
timestamp: Time.now.utc,
|
|
548
|
+
model: 'us.anthropic.claude-sonnet-4-6-v1'
|
|
549
|
+
}
|
|
550
|
+
end
|
|
551
|
+
end
|
|
552
|
+
end
|
|
553
|
+
```
|
|
554
|
+
|
|
555
|
+
## Providers
|
|
556
|
+
|
|
557
|
+
| Provider | Config Key | Credential Source | Notes |
|
|
558
|
+
|----------|-----------|-------------------|-------|
|
|
559
|
+
| AWS Bedrock | `bedrock` | Vault (`access_key`, `secret_key`) or direct | Default region: us-east-2 |
|
|
560
|
+
| Anthropic | `anthropic` | Vault (`api_key`) or direct | Direct API access |
|
|
561
|
+
| OpenAI | `openai` | Vault (`api_key`) or direct | GPT models |
|
|
562
|
+
| Google Gemini | `gemini` | Vault (`api_key`) or direct | Gemini models |
|
|
563
|
+
| Ollama | `ollama` | Local, no credentials needed | Local inference |
|
|
564
|
+
|
|
565
|
+
## Integration with LegionIO
|
|
566
|
+
|
|
567
|
+
legion-llm follows the standard core gem lifecycle:
|
|
568
|
+
|
|
569
|
+
```
|
|
570
|
+
Legion::Service#initialize
|
|
571
|
+
...
|
|
572
|
+
setup_data # Legion::Data
|
|
573
|
+
setup_llm # Legion::LLM <-- here
|
|
574
|
+
setup_supervision # Legion::Supervision
|
|
575
|
+
load_extensions # LEX extensions (can use LLM if available)
|
|
576
|
+
```
|
|
577
|
+
|
|
578
|
+
- **Service**: `setup_llm` called between data and supervision in startup sequence
|
|
579
|
+
- **Extensions**: `llm_required?` method on extension module, checked at load time
|
|
580
|
+
- **Helpers**: `Legion::Extensions::Helpers::LLM` auto-loaded when gem is present
|
|
581
|
+
- **Readiness**: Registers as `:llm` in `Legion::Readiness`
|
|
582
|
+
- **Shutdown**: `Legion::LLM.shutdown` called during service shutdown (reverse order)
|
|
583
|
+
|
|
584
|
+
## Development
|
|
585
|
+
|
|
586
|
+
```bash
|
|
587
|
+
git clone https://github.com/LegionIO/legion-llm.git
|
|
588
|
+
cd legion-llm
|
|
589
|
+
bundle install
|
|
590
|
+
bundle exec rspec
|
|
591
|
+
```
|
|
592
|
+
|
|
593
|
+
### Running Tests
|
|
594
|
+
|
|
595
|
+
Tests use stubbed `Legion::Logging` and `Legion::Settings` modules (no need for the full LegionIO stack):
|
|
596
|
+
|
|
597
|
+
```bash
|
|
598
|
+
bundle exec rspec # Run all 269 tests
|
|
599
|
+
bundle exec rubocop # Lint (0 offenses)
|
|
600
|
+
bundle exec rspec spec/legion/llm_spec.rb # Run specific test file
|
|
601
|
+
bundle exec rspec spec/legion/llm/router_spec.rb # Router tests only
|
|
602
|
+
```
|
|
603
|
+
|
|
604
|
+
## Dependencies
|
|
605
|
+
|
|
606
|
+
| Gem | Purpose |
|
|
607
|
+
|-----|---------|
|
|
608
|
+
| `ruby_llm` (>= 1.0) | Multi-provider LLM client |
|
|
609
|
+
| `tzinfo` (>= 2.0) | IANA timezone conversion for schedule windows |
|
|
610
|
+
| `legion-logging` | Logging |
|
|
611
|
+
| `legion-settings` | Configuration |
|
|
612
|
+
|
|
613
|
+
## License
|
|
614
|
+
|
|
615
|
+
Apache-2.0
|