ruby_llm-contract 0.5.2 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +25 -2
- data/Gemfile.lock +2 -2
- data/README.md +167 -137
- data/lib/ruby_llm/contract/concerns/eval_host.rb +25 -6
- data/lib/ruby_llm/contract/eval/model_comparison.rb +33 -11
- data/lib/ruby_llm/contract/eval/recommendation.rb +48 -0
- data/lib/ruby_llm/contract/eval/recommender.rb +132 -0
- data/lib/ruby_llm/contract/eval/report.rb +2 -2
- data/lib/ruby_llm/contract/eval/report_stats.rb +6 -0
- data/lib/ruby_llm/contract/eval/report_storage.rb +18 -12
- data/lib/ruby_llm/contract/eval.rb +2 -0
- data/lib/ruby_llm/contract/step/base.rb +21 -0
- data/lib/ruby_llm/contract/step/retry_executor.rb +9 -3
- data/lib/ruby_llm/contract/step/retry_policy.rb +27 -14
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/lib/ruby_llm/contract.rb +21 -0
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c86dcf06a34e5ff934367708213a0852191b0be7bd61e61e8577161e32ccf807
|
|
4
|
+
data.tar.gz: 28a44dd4f4d4c0f74c5da7800d9fa1fd2b8e51242a056f5aaae5f6f9fcf1be69
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c2b6dc71a02519288ae4ee4f74f17d74e6c45d01e888036d18e4c821b8fc63ba43ac0ccee6b2948163597b19c93008a6e5e0375fd65c8f3ad8fcf1285f356e91
|
|
7
|
+
data.tar.gz: f08576e520ec7397c4b233c5907ce703890cf4616e969e25f4c980ac1b06013a62ff680b3c90878d76fd1da0980f04d4946a2fd11a8e304e43f5155e5c855e8e
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,28 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.6.0 (2026-04-12)
|
|
4
|
+
|
|
5
|
+
"What should I do?" — model + configuration recommendation.
|
|
6
|
+
|
|
7
|
+
### Features
|
|
8
|
+
|
|
9
|
+
- **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
|
|
10
|
+
- **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
|
|
11
|
+
- **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
|
|
12
|
+
- **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
|
|
13
|
+
- **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
|
|
14
|
+
- **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
|
|
15
|
+
|
|
16
|
+
### Game changer continuity
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
v0.2: "Which model?" → compare_models (snapshot)
|
|
20
|
+
v0.3: "Did it change?" → baseline regression (binary)
|
|
21
|
+
v0.4: "Show me the trend" → eval history (time series)
|
|
22
|
+
v0.5: "Which prompt is better?" → compare_with (A/B testing)
|
|
23
|
+
v0.6: "What should I do?" → recommend (actionable advice)
|
|
24
|
+
```
|
|
25
|
+
|
|
3
26
|
## 0.5.2 (2026-04-06)
|
|
4
27
|
|
|
5
28
|
### Features
|
|
@@ -8,7 +31,7 @@
|
|
|
8
31
|
|
|
9
32
|
## 0.5.0 (2026-03-25)
|
|
10
33
|
|
|
11
|
-
Data-Driven Prompt Engineering
|
|
34
|
+
Data-Driven Prompt Engineering.
|
|
12
35
|
|
|
13
36
|
### Features
|
|
14
37
|
|
|
@@ -55,7 +78,7 @@ Audit hardening — 18 bugs fixed across 4 audit rounds.
|
|
|
55
78
|
|
|
56
79
|
## 0.4.3 (2026-03-24)
|
|
57
80
|
|
|
58
|
-
Production feedback release
|
|
81
|
+
Production feedback release.
|
|
59
82
|
|
|
60
83
|
### Features
|
|
61
84
|
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
ruby_llm-contract (0.
|
|
4
|
+
ruby_llm-contract (0.6.0)
|
|
5
5
|
dry-types (~> 1.7)
|
|
6
6
|
ruby_llm (~> 1.0)
|
|
7
7
|
ruby_llm-schema (~> 0.3)
|
|
@@ -258,7 +258,7 @@ CHECKSUMS
|
|
|
258
258
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
259
259
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
|
260
260
|
ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
|
|
261
|
-
ruby_llm-contract (0.
|
|
261
|
+
ruby_llm-contract (0.6.0)
|
|
262
262
|
ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
|
|
263
263
|
ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
|
|
264
264
|
rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
|
data/README.md
CHANGED
|
@@ -1,24 +1,89 @@
|
|
|
1
1
|
# ruby_llm-contract
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
```
|
|
6
|
+
YOU WRITE THE GEM HANDLES YOU GET
|
|
7
|
+
───────── ─────────────── ───────
|
|
8
|
+
|
|
9
|
+
validate { |o| ... } catch bad answers — combined Zero garbage
|
|
10
|
+
with retry_policy, auto-retry in production
|
|
6
11
|
|
|
7
|
-
|
|
12
|
+
retry_policy start cheap, escalate only Pay for the cheapest
|
|
13
|
+
models: %w[nano mini full] when validation fails model that works
|
|
8
14
|
|
|
9
|
-
|
|
15
|
+
max_cost 0.01 estimate tokens, check price, No surprise bills
|
|
16
|
+
refuse before calling LLM
|
|
10
17
|
|
|
11
|
-
|
|
18
|
+
output_schema { ... } send JSON schema to provider, Zero parsing code
|
|
19
|
+
validate response client-side
|
|
20
|
+
|
|
21
|
+
define_eval { ... } test cases + baselines, Regressions caught
|
|
22
|
+
run in CI with real LLM before deploy
|
|
23
|
+
|
|
24
|
+
recommend(candidates: [...]) evaluate all configs, pick Optimal model +
|
|
25
|
+
cheapest that passes retry chain
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Before and after
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
32
|
+
│ BEFORE: pick one model, hope for the best │
|
|
33
|
+
│ │
|
|
34
|
+
│ expensive model → accurate, but you overpay on every call │
|
|
35
|
+
│ cheap model → fast, but wrong answers slip to production │
|
|
36
|
+
│ prompt change → "looks good to me" → deploy → users suffer │
|
|
37
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
38
|
+
|
|
39
|
+
⬇ add ruby_llm-contract
|
|
40
|
+
|
|
41
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
42
|
+
│ YOU DEFINE A CONTRACT │
|
|
43
|
+
│ │
|
|
44
|
+
│ output_schema { string :priority } ← valid structure │
|
|
45
|
+
│ validate("valid priority") { |o| ... } ← business rules │
|
|
46
|
+
│ retry_policy models: %w[nano mini full] ← escalation chain │
|
|
47
|
+
│ max_cost 0.01 ← budget cap │
|
|
48
|
+
└───────────────────────────┬─────────────────────────────────────┘
|
|
49
|
+
│
|
|
50
|
+
▼
|
|
51
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
52
|
+
│ THE GEM HANDLES THE REST │
|
|
53
|
+
│ │
|
|
54
|
+
│ request ──→ ┌──────┐ ┌──────────┐ │
|
|
55
|
+
│ │ nano │─→ │ contract │──→ ✓ pass → done │
|
|
56
|
+
│ └──────┘ └────┬─────┘ │
|
|
57
|
+
│ │ ✗ fail │
|
|
58
|
+
│ ▼ │
|
|
59
|
+
│ ┌──────┐ ┌──────────┐ │
|
|
60
|
+
│ │ mini │─→ │ contract │──→ ✓ pass → done │
|
|
61
|
+
│ └──────┘ └────┬─────┘ │
|
|
62
|
+
│ │ ✗ fail │
|
|
63
|
+
│ ▼ │
|
|
64
|
+
│ ┌──────┐ ┌──────────┐ │
|
|
65
|
+
│ │ full │─→ │ contract │──→ ✓ pass → done │
|
|
66
|
+
│ └──────┘ └──────────┘ │
|
|
67
|
+
└───────────────────────────┬─────────────────────────────────────┘
|
|
68
|
+
│
|
|
69
|
+
▼
|
|
70
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
71
|
+
│ YOU GET │
|
|
72
|
+
│ │
|
|
73
|
+
│ ✓ Valid output guaranteed — schema + business rules enforced │
|
|
74
|
+
│ ✓ Cheapest model that works — most requests stay on nano │
|
|
75
|
+
│ ✓ Cost, latency, tokens — tracked on every call │
|
|
76
|
+
│ ✓ Eval scores per model — data instead of gut feeling │
|
|
77
|
+
│ ✓ Regressions caught — before deploy, not after │
|
|
78
|
+
│ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
|
|
79
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## 30-second version
|
|
12
83
|
|
|
13
84
|
```ruby
|
|
14
85
|
class ClassifyTicket < RubyLLM::Contract::Step::Base
|
|
15
|
-
prompt
|
|
16
|
-
system "You are a support ticket classifier."
|
|
17
|
-
rule "Return valid JSON only, no markdown."
|
|
18
|
-
rule "Use exactly one priority: low, medium, high, urgent."
|
|
19
|
-
example input: "My invoice is wrong", output: '{"priority": "high"}'
|
|
20
|
-
user "{input}"
|
|
21
|
-
end
|
|
86
|
+
prompt "Classify this support ticket by priority and category.\n\n{input}"
|
|
22
87
|
|
|
23
88
|
output_schema do
|
|
24
89
|
string :priority, enum: %w[low medium high urgent]
|
|
@@ -30,29 +95,45 @@ class ClassifyTicket < RubyLLM::Contract::Step::Base
|
|
|
30
95
|
end
|
|
31
96
|
|
|
32
97
|
result = ClassifyTicket.run("I was charged twice")
|
|
33
|
-
result.
|
|
34
|
-
result.
|
|
35
|
-
result.trace[:cost]
|
|
36
|
-
result.trace[:model] # => "gpt-4.1-nano"
|
|
98
|
+
result.parsed_output # => {priority: "high", category: "billing"}
|
|
99
|
+
result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
|
|
100
|
+
result.trace[:cost] # => 0.000032
|
|
37
101
|
```
|
|
38
102
|
|
|
39
|
-
Bad JSON?
|
|
103
|
+
Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
|
|
104
|
+
|
|
105
|
+
## Install
|
|
106
|
+
|
|
107
|
+
```ruby
|
|
108
|
+
gem "ruby_llm-contract"
|
|
109
|
+
```
|
|
40
110
|
|
|
41
|
-
|
|
111
|
+
```ruby
|
|
112
|
+
RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
|
|
113
|
+
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
|
|
114
|
+
```
|
|
42
115
|
|
|
43
|
-
|
|
116
|
+
Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
44
117
|
|
|
45
|
-
|
|
46
|
-
- compare prompt versions on the same dataset
|
|
47
|
-
- merge only when the eval stays green
|
|
118
|
+
## Save money with model escalation
|
|
48
119
|
|
|
49
|
-
|
|
120
|
+
Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
|
|
50
121
|
|
|
51
|
-
|
|
122
|
+
```ruby
|
|
123
|
+
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
124
|
+
```
|
|
52
125
|
|
|
53
|
-
|
|
126
|
+
```
|
|
127
|
+
Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
|
|
128
|
+
Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
|
|
129
|
+
gpt-4.1 → never called ($0.00)
|
|
130
|
+
```
|
|
54
131
|
|
|
55
|
-
|
|
132
|
+
Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
|
|
133
|
+
|
|
134
|
+
## Know which model to use — with data
|
|
135
|
+
|
|
136
|
+
Don't guess. Define test cases, compare models, get numbers:
|
|
56
137
|
|
|
57
138
|
```ruby
|
|
58
139
|
ClassifyTicket.define_eval("regression") do
|
|
@@ -62,171 +143,120 @@ ClassifyTicket.define_eval("regression") do
|
|
|
62
143
|
end
|
|
63
144
|
|
|
64
145
|
comparison = ClassifyTicket.compare_models("regression",
|
|
65
|
-
models: %w[gpt-4.1-nano gpt-4.1-mini])
|
|
146
|
+
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
|
|
66
147
|
```
|
|
67
148
|
|
|
68
|
-
Real output from real API calls:
|
|
69
|
-
|
|
70
149
|
```
|
|
71
|
-
|
|
150
|
+
Candidate Score Cost Avg Latency
|
|
72
151
|
---------------------------------------------------------
|
|
73
|
-
gpt-4.1-nano 0.67 $0.
|
|
74
|
-
gpt-4.1-mini 1.00 $0.
|
|
152
|
+
gpt-4.1-nano 0.67 $0.0001 48ms
|
|
153
|
+
gpt-4.1-mini 1.00 $0.0004 92ms
|
|
154
|
+
gpt-4.1 1.00 $0.0021 210ms
|
|
75
155
|
|
|
76
156
|
Cheapest at 100%: gpt-4.1-mini
|
|
77
157
|
```
|
|
78
158
|
|
|
79
|
-
|
|
80
|
-
comparison.best_for(min_score: 0.95) # => "gpt-4.1-mini"
|
|
81
|
-
|
|
82
|
-
# Inspect failures
|
|
83
|
-
comparison.reports["gpt-4.1-nano"].failures.each do |f|
|
|
84
|
-
puts "#{f.name}: expected #{f.expected}, got #{f.output}"
|
|
85
|
-
puts " mismatches: #{f.mismatches}"
|
|
86
|
-
# => outage: expected {priority: "urgent"}, got {priority: "high"}
|
|
87
|
-
# mismatches: {priority: {expected: "urgent", got: "high"}}
|
|
88
|
-
end
|
|
89
|
-
```
|
|
159
|
+
Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
|
|
90
160
|
|
|
91
|
-
##
|
|
161
|
+
## Let the gem tell you what to do
|
|
92
162
|
|
|
93
|
-
|
|
163
|
+
Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
|
|
94
164
|
|
|
95
165
|
```ruby
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
166
|
+
rec = ClassifyTicket.recommend("regression",
|
|
167
|
+
candidates: [
|
|
168
|
+
{ model: "gpt-4.1-nano" },
|
|
169
|
+
{ model: "gpt-4.1-mini" },
|
|
170
|
+
{ model: "gpt-5-mini", reasoning_effort: "low" },
|
|
171
|
+
{ model: "gpt-5-mini", reasoning_effort: "high" },
|
|
172
|
+
],
|
|
173
|
+
min_score: 0.95
|
|
174
|
+
)
|
|
175
|
+
|
|
176
|
+
rec.best # => { model: "gpt-4.1-mini" }
|
|
177
|
+
rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
|
|
178
|
+
rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
|
|
179
|
+
rec.savings # => savings vs your current model (if configured)
|
|
106
180
|
```
|
|
107
181
|
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
```ruby
|
|
111
|
-
# RSpec — block merge if accuracy drops or cost spikes
|
|
112
|
-
expect(ClassifyTicket).to pass_eval("regression")
|
|
113
|
-
.with_context(model: "gpt-4.1-mini")
|
|
114
|
-
.with_minimum_score(0.8)
|
|
115
|
-
.with_maximum_cost(0.01)
|
|
116
|
-
|
|
117
|
-
# Rake — run all evals across all steps
|
|
118
|
-
require "ruby_llm/contract/rake_task"
|
|
119
|
-
RubyLLM::Contract::RakeTask.new do |t|
|
|
120
|
-
t.minimum_score = 0.8
|
|
121
|
-
t.maximum_cost = 0.05
|
|
122
|
-
end
|
|
123
|
-
# bundle exec rake ruby_llm_contract:eval
|
|
124
|
-
```
|
|
182
|
+
Copy `rec.to_dsl` into your step. Done.
|
|
125
183
|
|
|
126
|
-
##
|
|
184
|
+
## Catch regressions before users do
|
|
127
185
|
|
|
128
|
-
|
|
186
|
+
A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
|
|
129
187
|
|
|
130
188
|
```ruby
|
|
189
|
+
# Save a baseline once:
|
|
131
190
|
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
|
|
132
191
|
report.save_baseline!(model: "gpt-4.1-nano")
|
|
133
192
|
|
|
134
|
-
#
|
|
135
|
-
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
|
|
136
|
-
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
|
|
137
|
-
|
|
138
|
-
diff.regressed? # => true
|
|
139
|
-
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
|
|
140
|
-
diff.score_delta # => -0.33
|
|
141
|
-
```
|
|
142
|
-
|
|
143
|
-
```ruby
|
|
144
|
-
# CI: block merge if any previously-passing case now fails
|
|
193
|
+
# In CI — block merge if anything regressed:
|
|
145
194
|
expect(ClassifyTicket).to pass_eval("regression")
|
|
146
195
|
.with_context(model: "gpt-4.1-nano")
|
|
147
196
|
.without_regressions
|
|
148
197
|
```
|
|
149
198
|
|
|
150
|
-
## Track quality over time
|
|
151
|
-
|
|
152
199
|
```ruby
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
# View trend
|
|
158
|
-
history = report.eval_history(model: "gpt-4.1-nano")
|
|
159
|
-
history.score_trend # => :stable_or_improving | :declining
|
|
160
|
-
history.drift? # => true (score dropped > 10%)
|
|
200
|
+
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
|
|
201
|
+
diff.regressed? # => true
|
|
202
|
+
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
|
|
203
|
+
diff.score_delta # => -0.33
|
|
161
204
|
```
|
|
162
205
|
|
|
163
|
-
|
|
206
|
+
No more "it worked in the playground". Regressions are caught in CI, not production.
|
|
164
207
|
|
|
165
|
-
|
|
166
|
-
# 4x faster with parallel execution
|
|
167
|
-
report = ClassifyTicket.run_eval("regression",
|
|
168
|
-
context: { model: "gpt-4.1-nano" },
|
|
169
|
-
concurrency: 4)
|
|
170
|
-
```
|
|
171
|
-
|
|
172
|
-
## Prompt A/B testing
|
|
208
|
+
## A/B test your prompts
|
|
173
209
|
|
|
174
|
-
Changed a prompt? Compare old vs new with regression safety:
|
|
210
|
+
Changed a prompt? Compare old vs new on the same dataset with regression safety:
|
|
175
211
|
|
|
176
212
|
```ruby
|
|
177
213
|
diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
|
|
178
214
|
eval: "regression", model: "gpt-4.1-mini")
|
|
179
215
|
|
|
180
|
-
diff.safe_to_switch? # => true (no regressions
|
|
216
|
+
diff.safe_to_switch? # => true (no regressions)
|
|
181
217
|
diff.improvements # => [{case: "outage", ...}]
|
|
182
218
|
diff.score_delta # => +0.33
|
|
183
219
|
```
|
|
184
220
|
|
|
185
|
-
Requires `model:` or `context: { adapter: ... }`.
|
|
186
|
-
`compare_with` ignores `sample_response`; without a real model/adapter both sides are skipped and the A/B result is not meaningful.
|
|
187
|
-
|
|
188
|
-
CI gate:
|
|
189
221
|
```ruby
|
|
222
|
+
# CI gate:
|
|
190
223
|
expect(ClassifyTicketV2).to pass_eval("regression")
|
|
191
224
|
.compared_with(ClassifyTicketV1)
|
|
192
225
|
.with_minimum_score(0.8)
|
|
193
226
|
```
|
|
194
227
|
|
|
195
|
-
##
|
|
228
|
+
## Chain steps with fail-fast
|
|
196
229
|
|
|
197
|
-
|
|
230
|
+
Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
|
|
198
231
|
|
|
199
232
|
```ruby
|
|
200
|
-
class
|
|
201
|
-
|
|
202
|
-
|
|
233
|
+
class TicketPipeline < RubyLLM::Contract::Pipeline::Base
|
|
234
|
+
step ClassifyTicket, as: :classify
|
|
235
|
+
step RouteToTeam, as: :route
|
|
236
|
+
step DraftResponse, as: :draft
|
|
203
237
|
end
|
|
204
238
|
|
|
205
|
-
result =
|
|
206
|
-
result.
|
|
207
|
-
result.
|
|
239
|
+
result = TicketPipeline.run("I was charged twice")
|
|
240
|
+
result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
|
|
241
|
+
result.trace.total_cost # => $0.000128
|
|
208
242
|
```
|
|
209
243
|
|
|
210
|
-
##
|
|
244
|
+
## Gate merges on quality and cost
|
|
211
245
|
|
|
212
246
|
```ruby
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
|
|
216
|
-
|
|
217
|
-
## Install
|
|
218
|
-
|
|
219
|
-
```ruby
|
|
220
|
-
gem "ruby_llm-contract"
|
|
221
|
-
```
|
|
247
|
+
# RSpec — block merge if accuracy drops or cost spikes
|
|
248
|
+
expect(ClassifyTicket).to pass_eval("regression")
|
|
249
|
+
.with_minimum_score(0.8)
|
|
250
|
+
.with_maximum_cost(0.01)
|
|
222
251
|
|
|
223
|
-
|
|
224
|
-
RubyLLM.
|
|
225
|
-
|
|
252
|
+
# Rake — run all evals across all steps
|
|
253
|
+
RubyLLM::Contract::RakeTask.new do |t|
|
|
254
|
+
t.minimum_score = 0.8
|
|
255
|
+
t.maximum_cost = 0.05
|
|
256
|
+
end
|
|
257
|
+
# bundle exec rake ruby_llm_contract:eval
|
|
226
258
|
```
|
|
227
259
|
|
|
228
|
-
Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
229
|
-
|
|
230
260
|
## Docs
|
|
231
261
|
|
|
232
262
|
| Guide | |
|
|
@@ -241,13 +271,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
|
241
271
|
|
|
242
272
|
## Roadmap
|
|
243
273
|
|
|
244
|
-
**v0.
|
|
274
|
+
**v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
|
|
245
275
|
|
|
246
|
-
**v0.
|
|
276
|
+
**v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
|
|
247
277
|
|
|
248
|
-
**v0.
|
|
278
|
+
**v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
|
|
249
279
|
|
|
250
|
-
**v0.
|
|
280
|
+
**v0.3:** Baseline regression detection, migration guide.
|
|
251
281
|
|
|
252
282
|
## License
|
|
253
283
|
|
|
@@ -70,18 +70,37 @@ module RubyLLM
|
|
|
70
70
|
Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
|
|
71
71
|
end
|
|
72
72
|
|
|
73
|
-
def compare_models(eval_name, models
|
|
73
|
+
def compare_models(eval_name, models: [], candidates: [], context: {})
|
|
74
|
+
raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
|
|
75
|
+
|
|
74
76
|
context = safe_context(context)
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
77
|
+
candidate_configs = normalize_candidates(models, candidates)
|
|
78
|
+
|
|
79
|
+
reports = {}
|
|
80
|
+
configs = {}
|
|
81
|
+
candidate_configs.each do |config|
|
|
82
|
+
label = Eval::ModelComparison.candidate_label(config)
|
|
83
|
+
model_context = isolate_context(context).merge(model: config[:model])
|
|
84
|
+
model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
|
|
85
|
+
reports[label] = run_single_eval(eval_name, model_context)
|
|
86
|
+
configs[label] = config
|
|
79
87
|
end
|
|
80
|
-
|
|
88
|
+
|
|
89
|
+
Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
|
|
81
90
|
end
|
|
82
91
|
|
|
83
92
|
private
|
|
84
93
|
|
|
94
|
+
def normalize_candidates(models, candidates)
|
|
95
|
+
if candidates.any?
|
|
96
|
+
candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
|
|
97
|
+
elsif models.any?
|
|
98
|
+
models.uniq.map { |m| { model: m } }
|
|
99
|
+
else
|
|
100
|
+
raise ArgumentError, "Pass models: or candidates: with at least one entry"
|
|
101
|
+
end
|
|
102
|
+
end
|
|
103
|
+
|
|
85
104
|
def comparison_context(context, model)
|
|
86
105
|
base_context = safe_context(context)
|
|
87
106
|
model ? base_context.merge(model: model) : base_context
|
|
@@ -4,11 +4,17 @@ module RubyLLM
|
|
|
4
4
|
module Contract
|
|
5
5
|
module Eval
|
|
6
6
|
class ModelComparison
|
|
7
|
-
attr_reader :eval_name, :reports
|
|
7
|
+
attr_reader :eval_name, :reports, :configs
|
|
8
8
|
|
|
9
|
-
def
|
|
9
|
+
def self.candidate_label(config)
|
|
10
|
+
effort = config[:reasoning_effort]
|
|
11
|
+
effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def initialize(eval_name:, reports:, configs: nil)
|
|
10
15
|
@eval_name = eval_name
|
|
11
|
-
@reports = reports.dup.freeze
|
|
16
|
+
@reports = reports.dup.freeze
|
|
17
|
+
@configs = (configs || default_configs_from_reports).freeze
|
|
12
18
|
freeze
|
|
13
19
|
end
|
|
14
20
|
|
|
@@ -16,12 +22,12 @@ module RubyLLM
|
|
|
16
22
|
@reports.keys
|
|
17
23
|
end
|
|
18
24
|
|
|
19
|
-
def score_for(
|
|
20
|
-
@reports[
|
|
25
|
+
def score_for(candidate)
|
|
26
|
+
@reports[resolve_key(candidate)]&.score
|
|
21
27
|
end
|
|
22
28
|
|
|
23
|
-
def cost_for(
|
|
24
|
-
@reports[
|
|
29
|
+
def cost_for(candidate)
|
|
30
|
+
@reports[resolve_key(candidate)]&.total_cost
|
|
25
31
|
end
|
|
26
32
|
|
|
27
33
|
def best_for(min_score: 0.0)
|
|
@@ -38,13 +44,14 @@ module RubyLLM
|
|
|
38
44
|
end
|
|
39
45
|
|
|
40
46
|
def table
|
|
41
|
-
|
|
42
|
-
lines
|
|
47
|
+
max_label = [@reports.keys.map(&:length).max || 0, 25].max
|
|
48
|
+
lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
|
|
49
|
+
lines << " #{"-" * (max_label + 36)}"
|
|
43
50
|
|
|
44
|
-
@reports.each do |
|
|
51
|
+
@reports.each do |label, report|
|
|
45
52
|
latency = report.avg_latency_ms ? "#{report.avg_latency_ms.round}ms" : "n/a"
|
|
46
53
|
cost = report.total_cost.positive? ? "$#{format("%.4f", report.total_cost)}" : "n/a"
|
|
47
|
-
lines << format("
|
|
54
|
+
lines << format(" %-#{max_label}s %6.2f %10s %12s", label, report.score, cost, latency)
|
|
48
55
|
end
|
|
49
56
|
|
|
50
57
|
lines.join("\n")
|
|
@@ -70,10 +77,25 @@ module RubyLLM
|
|
|
70
77
|
total_cost: report.total_cost,
|
|
71
78
|
avg_latency_ms: report.avg_latency_ms,
|
|
72
79
|
pass_rate: report.pass_rate,
|
|
80
|
+
pass_rate_ratio: report.pass_rate_ratio,
|
|
73
81
|
passed: report.passed?
|
|
74
82
|
}
|
|
75
83
|
end
|
|
76
84
|
end
|
|
85
|
+
|
|
86
|
+
private
|
|
87
|
+
|
|
88
|
+
def resolve_key(candidate)
|
|
89
|
+
case candidate
|
|
90
|
+
when String then candidate
|
|
91
|
+
when Hash then self.class.candidate_label(candidate)
|
|
92
|
+
else candidate.to_s
|
|
93
|
+
end
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
def default_configs_from_reports
|
|
97
|
+
@reports.each_with_object({}) { |(key, _), h| h[key] = { model: key } }
|
|
98
|
+
end
|
|
77
99
|
end
|
|
78
100
|
end
|
|
79
101
|
end
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RubyLLM
|
|
4
|
+
module Contract
|
|
5
|
+
module Eval
|
|
6
|
+
class Recommendation
|
|
7
|
+
include Concerns::DeepFreeze
|
|
8
|
+
|
|
9
|
+
attr_reader :best, :retry_chain, :score, :cost_per_call,
|
|
10
|
+
:rationale, :current_config, :savings, :warnings
|
|
11
|
+
|
|
12
|
+
def initialize(best:, retry_chain:, score:, cost_per_call:,
|
|
13
|
+
rationale:, current_config:, savings:, warnings:)
|
|
14
|
+
@best = deep_dup_freeze(best)
|
|
15
|
+
@retry_chain = deep_dup_freeze(retry_chain)
|
|
16
|
+
@score = score
|
|
17
|
+
@cost_per_call = cost_per_call
|
|
18
|
+
@rationale = deep_dup_freeze(rationale)
|
|
19
|
+
@current_config = deep_dup_freeze(current_config)
|
|
20
|
+
@savings = deep_dup_freeze(savings)
|
|
21
|
+
@warnings = deep_dup_freeze(warnings)
|
|
22
|
+
freeze
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def to_dsl
|
|
26
|
+
return "# No recommendation — no candidate met the minimum score" if retry_chain.empty?
|
|
27
|
+
|
|
28
|
+
if retry_chain.length == 1 && retry_chain.first.keys == [:model]
|
|
29
|
+
"model \"#{retry_chain.first[:model]}\""
|
|
30
|
+
elsif retry_chain.all? { |c| c.keys == [:model] }
|
|
31
|
+
models_str = retry_chain.map { |c| c[:model] }.join(" ")
|
|
32
|
+
"retry_policy models: %w[#{models_str}]"
|
|
33
|
+
else
|
|
34
|
+
args = retry_chain.map { |c| config_to_ruby(c) }.join(",\n ")
|
|
35
|
+
"retry_policy do\n escalate(#{args})\nend"
|
|
36
|
+
end
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
private
|
|
40
|
+
|
|
41
|
+
def config_to_ruby(config)
|
|
42
|
+
pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
|
|
43
|
+
"{ #{pairs} }"
|
|
44
|
+
end
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
end
|
|
@@ -0,0 +1,132 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RubyLLM
|
|
4
|
+
module Contract
|
|
5
|
+
module Eval
|
|
6
|
+
class Recommender
|
|
7
|
+
def initialize(comparison:, min_score:, min_first_try_pass_rate: 0.8, current_config: nil)
|
|
8
|
+
@comparison = comparison
|
|
9
|
+
@min_score = min_score
|
|
10
|
+
@min_first_try_pass_rate = min_first_try_pass_rate
|
|
11
|
+
@current_config = current_config
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def recommend
|
|
15
|
+
scored = build_scored_candidates
|
|
16
|
+
best = select_best(scored)
|
|
17
|
+
chain = build_retry_chain(scored, best)
|
|
18
|
+
rationale = build_rationale(scored, best)
|
|
19
|
+
warnings = build_warnings(scored)
|
|
20
|
+
savings = best ? calculate_savings(best) : {}
|
|
21
|
+
|
|
22
|
+
Recommendation.new(
|
|
23
|
+
best: best&.dig(:config),
|
|
24
|
+
retry_chain: chain,
|
|
25
|
+
score: best&.dig(:score) || 0.0,
|
|
26
|
+
cost_per_call: best&.dig(:cost_per_call) || 0.0,
|
|
27
|
+
rationale: rationale,
|
|
28
|
+
current_config: @current_config,
|
|
29
|
+
savings: savings,
|
|
30
|
+
warnings: warnings
|
|
31
|
+
)
|
|
32
|
+
end
|
|
33
|
+
|
|
34
|
+
private
|
|
35
|
+
|
|
36
|
+
def build_scored_candidates
|
|
37
|
+
@comparison.configs.filter_map do |label, config|
|
|
38
|
+
report = @comparison.reports[label]
|
|
39
|
+
next nil unless report
|
|
40
|
+
|
|
41
|
+
evaluated_count = report.results.count { |r| r.step_status != :skipped }
|
|
42
|
+
cases_count = [evaluated_count, 1].max
|
|
43
|
+
cost_per_call = report.total_cost.to_f / cases_count
|
|
44
|
+
|
|
45
|
+
{
|
|
46
|
+
label: label,
|
|
47
|
+
config: config,
|
|
48
|
+
score: report.score,
|
|
49
|
+
cost_per_call: cost_per_call,
|
|
50
|
+
latency: report.avg_latency_ms || Float::INFINITY,
|
|
51
|
+
pass_rate_ratio: report.pass_rate_ratio,
|
|
52
|
+
total_cost: report.total_cost
|
|
53
|
+
}
|
|
54
|
+
end
|
|
55
|
+
end
|
|
56
|
+
|
|
57
|
+
def select_best(scored)
|
|
58
|
+
eligible = scored.select { |s| s[:score] >= @min_score && cost_known?(s) }
|
|
59
|
+
eligible.min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
def build_retry_chain(scored, best)
|
|
63
|
+
return [] unless best
|
|
64
|
+
|
|
65
|
+
first_try = scored
|
|
66
|
+
.select { |s| s[:pass_rate_ratio] >= @min_first_try_pass_rate && cost_known?(s) }
|
|
67
|
+
.min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
|
|
68
|
+
|
|
69
|
+
if first_try.nil? || first_try[:label] == best[:label]
|
|
70
|
+
[best[:config]]
|
|
71
|
+
else
|
|
72
|
+
[first_try[:config], best[:config]]
|
|
73
|
+
end
|
|
74
|
+
end
|
|
75
|
+
|
|
76
|
+
def build_rationale(scored, best)
|
|
77
|
+
sorted = scored.sort_by { |s| [cost_known?(s) ? 0 : 1, s[:cost_per_call], s[:latency], s[:label]] }
|
|
78
|
+
sorted.map { |s| rationale_line(s, best) }
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
def rationale_line(candidate, best)
|
|
82
|
+
cost_str = cost_known?(candidate) ? "$#{format("%.4f", candidate[:cost_per_call])}/call" : "unknown pricing"
|
|
83
|
+
header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{cost_str}"
|
|
84
|
+
notes = rationale_notes(candidate, best)
|
|
85
|
+
notes.any? ? "#{header} — #{notes.join(", ")}" : header
|
|
86
|
+
end
|
|
87
|
+
|
|
88
|
+
def rationale_notes(candidate, best)
|
|
89
|
+
notes = []
|
|
90
|
+
pass_pct = (candidate[:pass_rate_ratio] * 100).round
|
|
91
|
+
below_threshold = candidate[:score] < @min_score
|
|
92
|
+
|
|
93
|
+
if below_threshold && candidate[:pass_rate_ratio] >= @min_first_try_pass_rate
|
|
94
|
+
notes << "below #{@min_score} threshold, but good first-try (#{pass_pct}% pass rate)"
|
|
95
|
+
elsif below_threshold
|
|
96
|
+
notes << "below #{@min_score} threshold"
|
|
97
|
+
elsif candidate[:pass_rate_ratio] < 1.0
|
|
98
|
+
notes << "#{pass_pct}% pass rate"
|
|
99
|
+
end
|
|
100
|
+
notes << "recommended" if best && candidate[:label] == best[:label]
|
|
101
|
+
notes << "unknown pricing" unless cost_known?(candidate)
|
|
102
|
+
notes
|
|
103
|
+
end
|
|
104
|
+
|
|
105
|
+
def build_warnings(scored)
|
|
106
|
+
scored.reject { |s| cost_known?(s) }
|
|
107
|
+
.map { |s| "#{s[:label]}: unknown pricing — cost ranking may be inaccurate" }
|
|
108
|
+
end
|
|
109
|
+
|
|
110
|
+
def calculate_savings(best)
|
|
111
|
+
return {} unless @current_config
|
|
112
|
+
|
|
113
|
+
current_label = ModelComparison.candidate_label(@current_config)
|
|
114
|
+
current_report = @comparison.reports[current_label]
|
|
115
|
+
return {} unless current_report
|
|
116
|
+
|
|
117
|
+
current_evaluated = current_report.results.count { |r| r.step_status != :skipped }
|
|
118
|
+
current_cases = [current_evaluated, 1].max
|
|
119
|
+
current_cost = current_report.total_cost.to_f / current_cases
|
|
120
|
+
diff = current_cost - best[:cost_per_call]
|
|
121
|
+
return {} unless diff.positive?
|
|
122
|
+
|
|
123
|
+
{ per_call: diff.round(6), monthly_at: { 10_000 => (diff * 10_000).round(2) } }
|
|
124
|
+
end
|
|
125
|
+
|
|
126
|
+
def cost_known?(scored_candidate)
|
|
127
|
+
scored_candidate[:cost_per_call]&.positive?
|
|
128
|
+
end
|
|
129
|
+
end
|
|
130
|
+
end
|
|
131
|
+
end
|
|
132
|
+
end
|
|
@@ -14,8 +14,8 @@ module RubyLLM
|
|
|
14
14
|
HISTORY_DIR = ".eval_history"
|
|
15
15
|
BASELINE_DIR = ".eval_baselines"
|
|
16
16
|
|
|
17
|
-
def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :
|
|
18
|
-
:passed?
|
|
17
|
+
def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
|
|
18
|
+
:total_cost, :avg_latency_ms, :passed?
|
|
19
19
|
def_delegators :@presenter, :summary, :to_s, :print_summary
|
|
20
20
|
def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
|
|
21
21
|
:baseline_exists?
|
|
@@ -35,6 +35,12 @@ module RubyLLM
|
|
|
35
35
|
"#{passed}/#{evaluated_results.length}"
|
|
36
36
|
end
|
|
37
37
|
|
|
38
|
+
def pass_rate_ratio
|
|
39
|
+
return 0.0 if evaluated_results.empty?
|
|
40
|
+
|
|
41
|
+
passed.to_f / evaluated_results.length
|
|
42
|
+
end
|
|
43
|
+
|
|
38
44
|
def total_cost
|
|
39
45
|
@results.sum { |result| result.cost || 0.0 }
|
|
40
46
|
end
|
|
@@ -13,25 +13,29 @@ module RubyLLM
|
|
|
13
13
|
@stats = stats
|
|
14
14
|
end
|
|
15
15
|
|
|
16
|
-
def save_history!(path: nil, model: nil)
|
|
17
|
-
file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model)
|
|
18
|
-
|
|
16
|
+
def save_history!(path: nil, model: nil, reasoning_effort: nil)
|
|
17
|
+
file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model, reasoning_effort: reasoning_effort)
|
|
18
|
+
entry = history_entry
|
|
19
|
+
entry[:model] = model if model
|
|
20
|
+
entry[:reasoning_effort] = reasoning_effort if reasoning_effort
|
|
21
|
+
EvalHistory.append(file, entry)
|
|
19
22
|
file
|
|
20
23
|
end
|
|
21
24
|
|
|
22
|
-
def eval_history(path: nil, model: nil)
|
|
23
|
-
EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model
|
|
25
|
+
def eval_history(path: nil, model: nil, reasoning_effort: nil)
|
|
26
|
+
EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model,
|
|
27
|
+
reasoning_effort: reasoning_effort))
|
|
24
28
|
end
|
|
25
29
|
|
|
26
|
-
def save_baseline!(path: nil, model: nil)
|
|
27
|
-
file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
|
|
30
|
+
def save_baseline!(path: nil, model: nil, reasoning_effort: nil)
|
|
31
|
+
file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
|
|
28
32
|
FileUtils.mkdir_p(File.dirname(file))
|
|
29
33
|
File.write(file, JSON.pretty_generate(serialize_for_baseline))
|
|
30
34
|
file
|
|
31
35
|
end
|
|
32
36
|
|
|
33
|
-
def compare_with_baseline(path: nil, model: nil)
|
|
34
|
-
file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
|
|
37
|
+
def compare_with_baseline(path: nil, model: nil, reasoning_effort: nil)
|
|
38
|
+
file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
|
|
35
39
|
raise ArgumentError, "No baseline found at #{file}" unless File.exist?(file)
|
|
36
40
|
|
|
37
41
|
baseline_data = JSON.parse(File.read(file), symbolize_names: true)
|
|
@@ -43,8 +47,8 @@ module RubyLLM
|
|
|
43
47
|
)
|
|
44
48
|
end
|
|
45
49
|
|
|
46
|
-
def baseline_exists?(path: nil, model: nil)
|
|
47
|
-
File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model))
|
|
50
|
+
def baseline_exists?(path: nil, model: nil, reasoning_effort: nil)
|
|
51
|
+
File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort))
|
|
48
52
|
end
|
|
49
53
|
|
|
50
54
|
private
|
|
@@ -55,6 +59,7 @@ module RubyLLM
|
|
|
55
59
|
score: @stats.score,
|
|
56
60
|
total_cost: @stats.total_cost,
|
|
57
61
|
pass_rate: @stats.pass_rate,
|
|
62
|
+
pass_rate_ratio: @stats.pass_rate_ratio,
|
|
58
63
|
cases_count: @stats.evaluated_results_count
|
|
59
64
|
}
|
|
60
65
|
end
|
|
@@ -79,12 +84,13 @@ module RubyLLM
|
|
|
79
84
|
}
|
|
80
85
|
end
|
|
81
86
|
|
|
82
|
-
def storage_path(root_dir, extension, model:)
|
|
87
|
+
def storage_path(root_dir, extension, model:, reasoning_effort: nil)
|
|
83
88
|
parts = [root_dir]
|
|
84
89
|
parts << sanitize_name(@report.step_name) if @report.step_name
|
|
85
90
|
|
|
86
91
|
dataset_name = sanitize_name(@report.dataset_name)
|
|
87
92
|
dataset_name = "#{dataset_name}_#{sanitize_name(model)}" if model
|
|
93
|
+
dataset_name = "#{dataset_name}_effort_#{sanitize_name(reasoning_effort)}" if reasoning_effort
|
|
88
94
|
|
|
89
95
|
File.join(*parts, "#{dataset_name}.#{extension}")
|
|
90
96
|
end
|
|
@@ -49,6 +49,16 @@ module RubyLLM
|
|
|
49
49
|
end
|
|
50
50
|
end
|
|
51
51
|
|
|
52
|
+
def recommend(eval_name, candidates:, min_score: 0.95, min_first_try_pass_rate: 0.8, context: {})
|
|
53
|
+
comparison = compare_models(eval_name, candidates: candidates, context: context)
|
|
54
|
+
Eval::Recommender.new(
|
|
55
|
+
comparison: comparison,
|
|
56
|
+
min_score: min_score,
|
|
57
|
+
min_first_try_pass_rate: min_first_try_pass_rate,
|
|
58
|
+
current_config: current_model_config
|
|
59
|
+
).recommend
|
|
60
|
+
end
|
|
61
|
+
|
|
52
62
|
KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
|
|
53
63
|
|
|
54
64
|
include Concerns::ContextHelpers
|
|
@@ -144,6 +154,17 @@ module RubyLLM
|
|
|
144
154
|
}
|
|
145
155
|
end
|
|
146
156
|
|
|
157
|
+
def current_model_config
|
|
158
|
+
policy = retry_policy
|
|
159
|
+
if policy && policy.config_list.any?
|
|
160
|
+
policy.config_list.first
|
|
161
|
+
elsif respond_to?(:model) && model
|
|
162
|
+
{ model: model }
|
|
163
|
+
elsif RubyLLM::Contract.configuration.default_model
|
|
164
|
+
{ model: RubyLLM::Contract.configuration.default_model }
|
|
165
|
+
end
|
|
166
|
+
end
|
|
167
|
+
|
|
147
168
|
def resolve_adapter(context)
|
|
148
169
|
adapter = context[:adapter] || RubyLLM::Contract.configuration.default_adapter
|
|
149
170
|
return adapter if adapter
|
|
@@ -10,12 +10,16 @@ module RubyLLM
|
|
|
10
10
|
|
|
11
11
|
def run_with_retry(input, adapter:, default_model:, policy:, context_temperature: nil, extra_options: {})
|
|
12
12
|
all_attempts = []
|
|
13
|
+
default_config = { model: default_model }.merge(extra_options.slice(:reasoning_effort).compact)
|
|
13
14
|
|
|
14
15
|
policy.max_attempts.times do |attempt_index|
|
|
15
|
-
|
|
16
|
+
config = policy.config_for_attempt(attempt_index, default_config)
|
|
17
|
+
model = config[:model]
|
|
18
|
+
attempt_extra = extra_options.merge(config.except(:model))
|
|
19
|
+
|
|
16
20
|
result = run_once(input, adapter: adapter, model: model,
|
|
17
|
-
context_temperature: context_temperature, extra_options:
|
|
18
|
-
all_attempts << { attempt: attempt_index + 1, model: model, result: result }
|
|
21
|
+
context_temperature: context_temperature, extra_options: attempt_extra)
|
|
22
|
+
all_attempts << { attempt: attempt_index + 1, model: model, config: config, result: result }
|
|
19
23
|
break unless policy.retryable?(result)
|
|
20
24
|
end
|
|
21
25
|
|
|
@@ -43,6 +47,8 @@ module RubyLLM
|
|
|
43
47
|
def build_attempt_entry(attempt)
|
|
44
48
|
trace = attempt[:result].trace
|
|
45
49
|
entry = { attempt: attempt[:attempt], model: attempt[:model], status: attempt[:result].status }
|
|
50
|
+
config = attempt[:config]
|
|
51
|
+
entry[:config] = config if config && config.keys != [:model]
|
|
46
52
|
append_trace_fields(entry, trace)
|
|
47
53
|
end
|
|
48
54
|
|
|
@@ -9,13 +9,13 @@ module RubyLLM
|
|
|
9
9
|
DEFAULT_RETRY_ON = %i[validation_failed parse_error adapter_error].freeze
|
|
10
10
|
|
|
11
11
|
def initialize(models: nil, attempts: nil, retry_on: nil, &block)
|
|
12
|
-
@
|
|
12
|
+
@configs = []
|
|
13
13
|
@retryable_statuses = DEFAULT_RETRY_ON.dup
|
|
14
14
|
|
|
15
15
|
if block
|
|
16
16
|
@max_attempts = 1
|
|
17
17
|
instance_eval(&block)
|
|
18
|
-
warn_no_retry! if @max_attempts == 1 && @
|
|
18
|
+
warn_no_retry! if @max_attempts == 1 && @configs.empty?
|
|
19
19
|
else
|
|
20
20
|
apply_keywords(models: models, attempts: attempts, retry_on: retry_on)
|
|
21
21
|
end
|
|
@@ -28,14 +28,18 @@ module RubyLLM
|
|
|
28
28
|
validate_max_attempts!
|
|
29
29
|
end
|
|
30
30
|
|
|
31
|
-
def escalate(*
|
|
32
|
-
@
|
|
33
|
-
@max_attempts = @
|
|
31
|
+
def escalate(*config_list)
|
|
32
|
+
@configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze
|
|
33
|
+
@max_attempts = @configs.length if @max_attempts < @configs.length
|
|
34
34
|
end
|
|
35
35
|
alias models escalate
|
|
36
36
|
|
|
37
37
|
def model_list
|
|
38
|
-
@
|
|
38
|
+
@configs.map { |c| c[:model] }.freeze
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
def config_list
|
|
42
|
+
@configs
|
|
39
43
|
end
|
|
40
44
|
|
|
41
45
|
def retry_on(*statuses)
|
|
@@ -46,29 +50,38 @@ module RubyLLM
|
|
|
46
50
|
retryable_statuses.include?(result.status)
|
|
47
51
|
end
|
|
48
52
|
|
|
49
|
-
def
|
|
50
|
-
if @
|
|
51
|
-
@
|
|
53
|
+
def config_for_attempt(attempt, default_config)
|
|
54
|
+
if @configs.any?
|
|
55
|
+
@configs[attempt] || @configs.last
|
|
52
56
|
else
|
|
53
|
-
|
|
57
|
+
default_config
|
|
54
58
|
end
|
|
55
59
|
end
|
|
56
60
|
|
|
61
|
+
def model_for_attempt(attempt, default_model)
|
|
62
|
+
config_for_attempt(attempt, { model: default_model })[:model]
|
|
63
|
+
end
|
|
64
|
+
|
|
57
65
|
private
|
|
58
66
|
|
|
59
67
|
def apply_keywords(models:, attempts:, retry_on:)
|
|
60
68
|
if models
|
|
61
|
-
@
|
|
62
|
-
@max_attempts = @
|
|
69
|
+
@configs = Array(models).map { |m| normalize_config(m).freeze }.freeze
|
|
70
|
+
@max_attempts = @configs.length
|
|
63
71
|
else
|
|
64
72
|
@max_attempts = attempts || 1
|
|
65
73
|
end
|
|
66
74
|
@retryable_statuses = Array(retry_on).dup if retry_on
|
|
67
75
|
end
|
|
68
76
|
|
|
77
|
+
def normalize_config(entry)
|
|
78
|
+
RubyLLM::Contract.normalize_candidate_config(entry)
|
|
79
|
+
end
|
|
80
|
+
|
|
69
81
|
def warn_no_retry!
|
|
70
|
-
warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no
|
|
71
|
-
"This means no actual retry will happen. Add `attempts 2` or
|
|
82
|
+
warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \
|
|
83
|
+
"This means no actual retry will happen. Add `attempts 2` or " \
|
|
84
|
+
'`escalate "model1", "model2"`.'
|
|
72
85
|
end
|
|
73
86
|
|
|
74
87
|
def validate_max_attempts!
|
data/lib/ruby_llm/contract.rb
CHANGED
|
@@ -78,6 +78,27 @@ module RubyLLM
|
|
|
78
78
|
Thread.current[:ruby_llm_contract_reloading] = false
|
|
79
79
|
end
|
|
80
80
|
|
|
81
|
+
def normalize_candidate_config(entry)
|
|
82
|
+
case entry
|
|
83
|
+
when String
|
|
84
|
+
raise ArgumentError, "Candidate model must be a non-empty String" if entry.strip.empty?
|
|
85
|
+
|
|
86
|
+
{ model: entry.strip }
|
|
87
|
+
when Hash
|
|
88
|
+
model = entry[:model] || entry["model"]
|
|
89
|
+
unless model.is_a?(String) && !model.strip.empty?
|
|
90
|
+
raise ArgumentError, "Candidate config must include a non-empty String :model"
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
normalized = { model: model.strip }
|
|
94
|
+
effort = entry[:reasoning_effort] || entry["reasoning_effort"]
|
|
95
|
+
normalized[:reasoning_effort] = effort if effort
|
|
96
|
+
normalized
|
|
97
|
+
else
|
|
98
|
+
raise ArgumentError, "Expected String or Hash, got #{entry.class}"
|
|
99
|
+
end
|
|
100
|
+
end
|
|
101
|
+
|
|
81
102
|
private
|
|
82
103
|
|
|
83
104
|
# Filter stale hosts, deduplicate by name (last wins), prune registry in-place
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ruby_llm-contract
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.6.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Justyna
|
|
@@ -130,6 +130,8 @@ files:
|
|
|
130
130
|
- lib/ruby_llm/contract/eval/prompt_diff_comparator.rb
|
|
131
131
|
- lib/ruby_llm/contract/eval/prompt_diff_presenter.rb
|
|
132
132
|
- lib/ruby_llm/contract/eval/prompt_diff_serializer.rb
|
|
133
|
+
- lib/ruby_llm/contract/eval/recommendation.rb
|
|
134
|
+
- lib/ruby_llm/contract/eval/recommender.rb
|
|
133
135
|
- lib/ruby_llm/contract/eval/report.rb
|
|
134
136
|
- lib/ruby_llm/contract/eval/report_presenter.rb
|
|
135
137
|
- lib/ruby_llm/contract/eval/report_stats.rb
|