ruby_llm-contract 0.5.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +31 -2
- data/Gemfile.lock +2 -2
- data/README.md +173 -130
- data/lib/ruby_llm/contract/adapters/ruby_llm.rb +4 -1
- data/lib/ruby_llm/contract/concerns/eval_host.rb +25 -6
- data/lib/ruby_llm/contract/eval/model_comparison.rb +33 -11
- data/lib/ruby_llm/contract/eval/recommendation.rb +48 -0
- data/lib/ruby_llm/contract/eval/recommender.rb +132 -0
- data/lib/ruby_llm/contract/eval/report.rb +2 -2
- data/lib/ruby_llm/contract/eval/report_stats.rb +6 -0
- data/lib/ruby_llm/contract/eval/report_storage.rb +18 -12
- data/lib/ruby_llm/contract/eval.rb +2 -0
- data/lib/ruby_llm/contract/step/base.rb +23 -2
- data/lib/ruby_llm/contract/step/retry_executor.rb +9 -3
- data/lib/ruby_llm/contract/step/retry_policy.rb +27 -14
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/lib/ruby_llm/contract.rb +21 -0
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c86dcf06a34e5ff934367708213a0852191b0be7bd61e61e8577161e32ccf807
|
|
4
|
+
data.tar.gz: 28a44dd4f4d4c0f74c5da7800d9fa1fd2b8e51242a056f5aaae5f6f9fcf1be69
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c2b6dc71a02519288ae4ee4f74f17d74e6c45d01e888036d18e4c821b8fc63ba43ac0ccee6b2948163597b19c93008a6e5e0375fd65c8f3ad8fcf1285f356e91
|
|
7
|
+
data.tar.gz: f08576e520ec7397c4b233c5907ce703890cf4616e969e25f4c980ac1b06013a62ff680b3c90878d76fd1da0980f04d4946a2fd11a8e304e43f5155e5c855e8e
|
data/CHANGELOG.md
CHANGED
|
@@ -1,8 +1,37 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.6.0 (2026-04-12)
|
|
4
|
+
|
|
5
|
+
"What should I do?" — model + configuration recommendation.
|
|
6
|
+
|
|
7
|
+
### Features
|
|
8
|
+
|
|
9
|
+
- **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
|
|
10
|
+
- **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
|
|
11
|
+
- **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
|
|
12
|
+
- **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
|
|
13
|
+
- **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
|
|
14
|
+
- **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
|
|
15
|
+
|
|
16
|
+
### Game changer continuity
|
|
17
|
+
|
|
18
|
+
```
|
|
19
|
+
v0.2: "Which model?" → compare_models (snapshot)
|
|
20
|
+
v0.3: "Did it change?" → baseline regression (binary)
|
|
21
|
+
v0.4: "Show me the trend" → eval history (time series)
|
|
22
|
+
v0.5: "Which prompt is better?" → compare_with (A/B testing)
|
|
23
|
+
v0.6: "What should I do?" → recommend (actionable advice)
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
## 0.5.2 (2026-04-06)
|
|
27
|
+
|
|
28
|
+
### Features
|
|
29
|
+
|
|
30
|
+
- **`reasoning_effort` forwarded to provider** — `context: { reasoning_effort: "low" }` now passed through `with_params` to the LLM. Previously accepted as a known context key but silently ignored by the RubyLLM adapter.
|
|
31
|
+
|
|
3
32
|
## 0.5.0 (2026-03-25)
|
|
4
33
|
|
|
5
|
-
Data-Driven Prompt Engineering
|
|
34
|
+
Data-Driven Prompt Engineering.
|
|
6
35
|
|
|
7
36
|
### Features
|
|
8
37
|
|
|
@@ -49,7 +78,7 @@ Audit hardening — 18 bugs fixed across 4 audit rounds.
|
|
|
49
78
|
|
|
50
79
|
## 0.4.3 (2026-03-24)
|
|
51
80
|
|
|
52
|
-
Production feedback release
|
|
81
|
+
Production feedback release.
|
|
53
82
|
|
|
54
83
|
### Features
|
|
55
84
|
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
ruby_llm-contract (0.
|
|
4
|
+
ruby_llm-contract (0.6.0)
|
|
5
5
|
dry-types (~> 1.7)
|
|
6
6
|
ruby_llm (~> 1.0)
|
|
7
7
|
ruby_llm-schema (~> 0.3)
|
|
@@ -258,7 +258,7 @@ CHECKSUMS
|
|
|
258
258
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
259
259
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
|
260
260
|
ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
|
|
261
|
-
ruby_llm-contract (0.
|
|
261
|
+
ruby_llm-contract (0.6.0)
|
|
262
262
|
ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
|
|
263
263
|
ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
|
|
264
264
|
rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
|
data/README.md
CHANGED
|
@@ -1,24 +1,89 @@
|
|
|
1
1
|
# ruby_llm-contract
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
```
|
|
6
|
+
YOU WRITE THE GEM HANDLES YOU GET
|
|
7
|
+
───────── ─────────────── ───────
|
|
8
|
+
|
|
9
|
+
validate { |o| ... } catch bad answers — combined Zero garbage
|
|
10
|
+
with retry_policy, auto-retry in production
|
|
11
|
+
|
|
12
|
+
retry_policy start cheap, escalate only Pay for the cheapest
|
|
13
|
+
models: %w[nano mini full] when validation fails model that works
|
|
14
|
+
|
|
15
|
+
max_cost 0.01 estimate tokens, check price, No surprise bills
|
|
16
|
+
refuse before calling LLM
|
|
17
|
+
|
|
18
|
+
output_schema { ... } send JSON schema to provider, Zero parsing code
|
|
19
|
+
validate response client-side
|
|
20
|
+
|
|
21
|
+
define_eval { ... } test cases + baselines, Regressions caught
|
|
22
|
+
run in CI with real LLM before deploy
|
|
23
|
+
|
|
24
|
+
recommend(candidates: [...]) evaluate all configs, pick Optimal model +
|
|
25
|
+
cheapest that passes retry chain
|
|
26
|
+
```
|
|
6
27
|
|
|
7
|
-
##
|
|
28
|
+
## Before and after
|
|
8
29
|
|
|
9
|
-
|
|
30
|
+
```
|
|
31
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
32
|
+
│ BEFORE: pick one model, hope for the best │
|
|
33
|
+
│ │
|
|
34
|
+
│ expensive model → accurate, but you overpay on every call │
|
|
35
|
+
│ cheap model → fast, but wrong answers slip to production │
|
|
36
|
+
│ prompt change → "looks good to me" → deploy → users suffer │
|
|
37
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
38
|
+
|
|
39
|
+
⬇ add ruby_llm-contract
|
|
40
|
+
|
|
41
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
42
|
+
│ YOU DEFINE A CONTRACT │
|
|
43
|
+
│ │
|
|
44
|
+
│ output_schema { string :priority } ← valid structure │
|
|
45
|
+
│ validate("valid priority") { |o| ... } ← business rules │
|
|
46
|
+
│ retry_policy models: %w[nano mini full] ← escalation chain │
|
|
47
|
+
│ max_cost 0.01 ← budget cap │
|
|
48
|
+
└───────────────────────────┬─────────────────────────────────────┘
|
|
49
|
+
│
|
|
50
|
+
▼
|
|
51
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
52
|
+
│ THE GEM HANDLES THE REST │
|
|
53
|
+
│ │
|
|
54
|
+
│ request ──→ ┌──────┐ ┌──────────┐ │
|
|
55
|
+
│ │ nano │─→ │ contract │──→ ✓ pass → done │
|
|
56
|
+
│ └──────┘ └────┬─────┘ │
|
|
57
|
+
│ │ ✗ fail │
|
|
58
|
+
│ ▼ │
|
|
59
|
+
│ ┌──────┐ ┌──────────┐ │
|
|
60
|
+
│ │ mini │─→ │ contract │──→ ✓ pass → done │
|
|
61
|
+
│ └──────┘ └────┬─────┘ │
|
|
62
|
+
│ │ ✗ fail │
|
|
63
|
+
│ ▼ │
|
|
64
|
+
│ ┌──────┐ ┌──────────┐ │
|
|
65
|
+
│ │ full │─→ │ contract │──→ ✓ pass → done │
|
|
66
|
+
│ └──────┘ └──────────┘ │
|
|
67
|
+
└───────────────────────────┬─────────────────────────────────────┘
|
|
68
|
+
│
|
|
69
|
+
▼
|
|
70
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
71
|
+
│ YOU GET │
|
|
72
|
+
│ │
|
|
73
|
+
│ ✓ Valid output guaranteed — schema + business rules enforced │
|
|
74
|
+
│ ✓ Cheapest model that works — most requests stay on nano │
|
|
75
|
+
│ ✓ Cost, latency, tokens — tracked on every call │
|
|
76
|
+
│ ✓ Eval scores per model — data instead of gut feeling │
|
|
77
|
+
│ ✓ Regressions caught — before deploy, not after │
|
|
78
|
+
│ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
|
|
79
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
80
|
+
```
|
|
10
81
|
|
|
11
|
-
##
|
|
82
|
+
## 30-second version
|
|
12
83
|
|
|
13
84
|
```ruby
|
|
14
85
|
class ClassifyTicket < RubyLLM::Contract::Step::Base
|
|
15
|
-
prompt
|
|
16
|
-
system "You are a support ticket classifier."
|
|
17
|
-
rule "Return valid JSON only, no markdown."
|
|
18
|
-
rule "Use exactly one priority: low, medium, high, urgent."
|
|
19
|
-
example input: "My invoice is wrong", output: '{"priority": "high"}'
|
|
20
|
-
user "{input}"
|
|
21
|
-
end
|
|
86
|
+
prompt "Classify this support ticket by priority and category.\n\n{input}"
|
|
22
87
|
|
|
23
88
|
output_schema do
|
|
24
89
|
string :priority, enum: %w[low medium high urgent]
|
|
@@ -30,17 +95,45 @@ class ClassifyTicket < RubyLLM::Contract::Step::Base
|
|
|
30
95
|
end
|
|
31
96
|
|
|
32
97
|
result = ClassifyTicket.run("I was charged twice")
|
|
33
|
-
result.
|
|
34
|
-
result.
|
|
35
|
-
result.trace[:cost]
|
|
36
|
-
|
|
98
|
+
result.parsed_output # => {priority: "high", category: "billing"}
|
|
99
|
+
result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
|
|
100
|
+
result.trace[:cost] # => 0.000032
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
|
|
104
|
+
|
|
105
|
+
## Install
|
|
106
|
+
|
|
107
|
+
```ruby
|
|
108
|
+
gem "ruby_llm-contract"
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
```ruby
|
|
112
|
+
RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
|
|
113
|
+
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
117
|
+
|
|
118
|
+
## Save money with model escalation
|
|
119
|
+
|
|
120
|
+
Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
|
|
121
|
+
|
|
122
|
+
```ruby
|
|
123
|
+
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
37
124
|
```
|
|
38
125
|
|
|
39
|
-
|
|
126
|
+
```
|
|
127
|
+
Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
|
|
128
|
+
Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
|
|
129
|
+
gpt-4.1 → never called ($0.00)
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
|
|
40
133
|
|
|
41
|
-
##
|
|
134
|
+
## Know which model to use — with data
|
|
42
135
|
|
|
43
|
-
Define test cases
|
|
136
|
+
Don't guess. Define test cases, compare models, get numbers:
|
|
44
137
|
|
|
45
138
|
```ruby
|
|
46
139
|
ClassifyTicket.define_eval("regression") do
|
|
@@ -50,176 +143,126 @@ ClassifyTicket.define_eval("regression") do
|
|
|
50
143
|
end
|
|
51
144
|
|
|
52
145
|
comparison = ClassifyTicket.compare_models("regression",
|
|
53
|
-
models: %w[gpt-4.1-nano gpt-4.1-mini])
|
|
146
|
+
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
|
|
54
147
|
```
|
|
55
148
|
|
|
56
|
-
Real output from real API calls:
|
|
57
|
-
|
|
58
149
|
```
|
|
59
|
-
|
|
150
|
+
Candidate Score Cost Avg Latency
|
|
60
151
|
---------------------------------------------------------
|
|
61
|
-
gpt-4.1-nano 0.67 $0.
|
|
62
|
-
gpt-4.1-mini 1.00 $0.
|
|
152
|
+
gpt-4.1-nano 0.67 $0.0001 48ms
|
|
153
|
+
gpt-4.1-mini 1.00 $0.0004 92ms
|
|
154
|
+
gpt-4.1 1.00 $0.0021 210ms
|
|
63
155
|
|
|
64
156
|
Cheapest at 100%: gpt-4.1-mini
|
|
65
157
|
```
|
|
66
158
|
|
|
67
|
-
|
|
68
|
-
comparison.best_for(min_score: 0.95) # => "gpt-4.1-mini"
|
|
69
|
-
|
|
70
|
-
# Inspect failures
|
|
71
|
-
comparison.reports["gpt-4.1-nano"].failures.each do |f|
|
|
72
|
-
puts "#{f.name}: expected #{f.expected}, got #{f.output}"
|
|
73
|
-
puts " mismatches: #{f.mismatches}"
|
|
74
|
-
# => outage: expected {priority: "urgent"}, got {priority: "high"}
|
|
75
|
-
# mismatches: {priority: {expected: "urgent", got: "high"}}
|
|
76
|
-
end
|
|
77
|
-
```
|
|
159
|
+
Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
|
|
78
160
|
|
|
79
|
-
##
|
|
161
|
+
## Let the gem tell you what to do
|
|
80
162
|
|
|
81
|
-
|
|
163
|
+
Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
|
|
82
164
|
|
|
83
165
|
```ruby
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
93
|
-
|
|
166
|
+
rec = ClassifyTicket.recommend("regression",
|
|
167
|
+
candidates: [
|
|
168
|
+
{ model: "gpt-4.1-nano" },
|
|
169
|
+
{ model: "gpt-4.1-mini" },
|
|
170
|
+
{ model: "gpt-5-mini", reasoning_effort: "low" },
|
|
171
|
+
{ model: "gpt-5-mini", reasoning_effort: "high" },
|
|
172
|
+
],
|
|
173
|
+
min_score: 0.95
|
|
174
|
+
)
|
|
175
|
+
|
|
176
|
+
rec.best # => { model: "gpt-4.1-mini" }
|
|
177
|
+
rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
|
|
178
|
+
rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
|
|
179
|
+
rec.savings # => savings vs your current model (if configured)
|
|
94
180
|
```
|
|
95
181
|
|
|
96
|
-
|
|
182
|
+
Copy `rec.to_dsl` into your step. Done.
|
|
97
183
|
|
|
98
|
-
|
|
99
|
-
# RSpec — block merge if accuracy drops or cost spikes
|
|
100
|
-
expect(ClassifyTicket).to pass_eval("regression")
|
|
101
|
-
.with_context(model: "gpt-4.1-mini")
|
|
102
|
-
.with_minimum_score(0.8)
|
|
103
|
-
.with_maximum_cost(0.01)
|
|
184
|
+
## Catch regressions before users do
|
|
104
185
|
|
|
105
|
-
|
|
106
|
-
require "ruby_llm/contract/rake_task"
|
|
107
|
-
RubyLLM::Contract::RakeTask.new do |t|
|
|
108
|
-
t.minimum_score = 0.8
|
|
109
|
-
t.maximum_cost = 0.05
|
|
110
|
-
end
|
|
111
|
-
# bundle exec rake ruby_llm_contract:eval
|
|
112
|
-
```
|
|
113
|
-
|
|
114
|
-
## Detect quality drops
|
|
115
|
-
|
|
116
|
-
Save a baseline. Next run, see what regressed.
|
|
186
|
+
A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
|
|
117
187
|
|
|
118
188
|
```ruby
|
|
189
|
+
# Save a baseline once:
|
|
119
190
|
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
|
|
120
191
|
report.save_baseline!(model: "gpt-4.1-nano")
|
|
121
192
|
|
|
122
|
-
#
|
|
123
|
-
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
|
|
124
|
-
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
|
|
125
|
-
|
|
126
|
-
diff.regressed? # => true
|
|
127
|
-
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
|
|
128
|
-
diff.score_delta # => -0.33
|
|
129
|
-
```
|
|
130
|
-
|
|
131
|
-
```ruby
|
|
132
|
-
# CI: block merge if any previously-passing case now fails
|
|
193
|
+
# In CI — block merge if anything regressed:
|
|
133
194
|
expect(ClassifyTicket).to pass_eval("regression")
|
|
134
195
|
.with_context(model: "gpt-4.1-nano")
|
|
135
196
|
.without_regressions
|
|
136
197
|
```
|
|
137
198
|
|
|
138
|
-
## Track quality over time
|
|
139
|
-
|
|
140
199
|
```ruby
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
|
|
145
|
-
# View trend
|
|
146
|
-
history = report.eval_history(model: "gpt-4.1-nano")
|
|
147
|
-
history.score_trend # => :stable_or_improving | :declining
|
|
148
|
-
history.drift? # => true (score dropped > 10%)
|
|
200
|
+
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
|
|
201
|
+
diff.regressed? # => true
|
|
202
|
+
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
|
|
203
|
+
diff.score_delta # => -0.33
|
|
149
204
|
```
|
|
150
205
|
|
|
151
|
-
|
|
206
|
+
No more "it worked in the playground". Regressions are caught in CI, not production.
|
|
152
207
|
|
|
153
|
-
|
|
154
|
-
# 4x faster with parallel execution
|
|
155
|
-
report = ClassifyTicket.run_eval("regression",
|
|
156
|
-
context: { model: "gpt-4.1-nano" },
|
|
157
|
-
concurrency: 4)
|
|
158
|
-
```
|
|
159
|
-
|
|
160
|
-
## Prompt A/B testing
|
|
208
|
+
## A/B test your prompts
|
|
161
209
|
|
|
162
|
-
Changed a prompt? Compare old vs new with regression safety:
|
|
210
|
+
Changed a prompt? Compare old vs new on the same dataset with regression safety:
|
|
163
211
|
|
|
164
212
|
```ruby
|
|
165
213
|
diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
|
|
166
214
|
eval: "regression", model: "gpt-4.1-mini")
|
|
167
215
|
|
|
168
|
-
diff.safe_to_switch? # => true (no regressions
|
|
216
|
+
diff.safe_to_switch? # => true (no regressions)
|
|
169
217
|
diff.improvements # => [{case: "outage", ...}]
|
|
170
218
|
diff.score_delta # => +0.33
|
|
171
219
|
```
|
|
172
220
|
|
|
173
|
-
Requires `model:` or `context: { adapter: ... }`.
|
|
174
|
-
`compare_with` ignores `sample_response`; without a real model/adapter both sides are skipped and the A/B result is not meaningful.
|
|
175
|
-
|
|
176
|
-
CI gate:
|
|
177
221
|
```ruby
|
|
222
|
+
# CI gate:
|
|
178
223
|
expect(ClassifyTicketV2).to pass_eval("regression")
|
|
179
224
|
.compared_with(ClassifyTicketV1)
|
|
180
225
|
.with_minimum_score(0.8)
|
|
181
226
|
```
|
|
182
227
|
|
|
183
|
-
##
|
|
228
|
+
## Chain steps with fail-fast
|
|
184
229
|
|
|
185
|
-
|
|
230
|
+
Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
|
|
186
231
|
|
|
187
232
|
```ruby
|
|
188
|
-
class
|
|
189
|
-
|
|
190
|
-
|
|
233
|
+
class TicketPipeline < RubyLLM::Contract::Pipeline::Base
|
|
234
|
+
step ClassifyTicket, as: :classify
|
|
235
|
+
step RouteToTeam, as: :route
|
|
236
|
+
step DraftResponse, as: :draft
|
|
191
237
|
end
|
|
192
238
|
|
|
193
|
-
result =
|
|
194
|
-
result.
|
|
195
|
-
result.
|
|
196
|
-
```
|
|
197
|
-
|
|
198
|
-
## Predict cost before running
|
|
199
|
-
|
|
200
|
-
```ruby
|
|
201
|
-
ClassifyTicket.estimate_eval_cost("regression", models: %w[gpt-4.1-nano gpt-4.1-mini])
|
|
202
|
-
# => { "gpt-4.1-nano" => 0.000024, "gpt-4.1-mini" => 0.000096 }
|
|
239
|
+
result = TicketPipeline.run("I was charged twice")
|
|
240
|
+
result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
|
|
241
|
+
result.trace.total_cost # => $0.000128
|
|
203
242
|
```
|
|
204
243
|
|
|
205
|
-
##
|
|
244
|
+
## Gate merges on quality and cost
|
|
206
245
|
|
|
207
246
|
```ruby
|
|
208
|
-
|
|
209
|
-
|
|
247
|
+
# RSpec — block merge if accuracy drops or cost spikes
|
|
248
|
+
expect(ClassifyTicket).to pass_eval("regression")
|
|
249
|
+
.with_minimum_score(0.8)
|
|
250
|
+
.with_maximum_cost(0.01)
|
|
210
251
|
|
|
211
|
-
|
|
212
|
-
RubyLLM.
|
|
213
|
-
|
|
252
|
+
# Rake — run all evals across all steps
|
|
253
|
+
RubyLLM::Contract::RakeTask.new do |t|
|
|
254
|
+
t.minimum_score = 0.8
|
|
255
|
+
t.maximum_cost = 0.05
|
|
256
|
+
end
|
|
257
|
+
# bundle exec rake ruby_llm_contract:eval
|
|
214
258
|
```
|
|
215
259
|
|
|
216
|
-
Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
217
|
-
|
|
218
260
|
## Docs
|
|
219
261
|
|
|
220
262
|
| Guide | |
|
|
221
263
|
|-------|-|
|
|
222
264
|
| [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
|
|
265
|
+
| [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
|
|
223
266
|
| [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
|
|
224
267
|
| [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
|
|
225
268
|
| [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
|
|
@@ -228,13 +271,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
|
228
271
|
|
|
229
272
|
## Roadmap
|
|
230
273
|
|
|
231
|
-
**v0.
|
|
274
|
+
**v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
|
|
232
275
|
|
|
233
|
-
**v0.
|
|
276
|
+
**v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
|
|
234
277
|
|
|
235
|
-
**v0.
|
|
278
|
+
**v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
|
|
236
279
|
|
|
237
|
-
**v0.
|
|
280
|
+
**v0.3:** Baseline regression detection, migration guide.
|
|
238
281
|
|
|
239
282
|
## License
|
|
240
283
|
|
|
@@ -52,7 +52,10 @@ module RubyLLM
|
|
|
52
52
|
CHAT_OPTION_METHODS.each do |key, method_name|
|
|
53
53
|
chat.public_send(method_name, options[key]) if options[key]
|
|
54
54
|
end
|
|
55
|
-
|
|
55
|
+
params = {}
|
|
56
|
+
params[:max_tokens] = options[:max_tokens] if options[:max_tokens]
|
|
57
|
+
params[:reasoning_effort] = options[:reasoning_effort] if options[:reasoning_effort]
|
|
58
|
+
chat.with_params(**params) if params.any?
|
|
56
59
|
end
|
|
57
60
|
|
|
58
61
|
def build_response(response)
|
|
@@ -70,18 +70,37 @@ module RubyLLM
|
|
|
70
70
|
Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
|
|
71
71
|
end
|
|
72
72
|
|
|
73
|
-
def compare_models(eval_name, models
|
|
73
|
+
def compare_models(eval_name, models: [], candidates: [], context: {})
|
|
74
|
+
raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
|
|
75
|
+
|
|
74
76
|
context = safe_context(context)
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
77
|
+
candidate_configs = normalize_candidates(models, candidates)
|
|
78
|
+
|
|
79
|
+
reports = {}
|
|
80
|
+
configs = {}
|
|
81
|
+
candidate_configs.each do |config|
|
|
82
|
+
label = Eval::ModelComparison.candidate_label(config)
|
|
83
|
+
model_context = isolate_context(context).merge(model: config[:model])
|
|
84
|
+
model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
|
|
85
|
+
reports[label] = run_single_eval(eval_name, model_context)
|
|
86
|
+
configs[label] = config
|
|
79
87
|
end
|
|
80
|
-
|
|
88
|
+
|
|
89
|
+
Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
|
|
81
90
|
end
|
|
82
91
|
|
|
83
92
|
private
|
|
84
93
|
|
|
94
|
+
def normalize_candidates(models, candidates)
|
|
95
|
+
if candidates.any?
|
|
96
|
+
candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
|
|
97
|
+
elsif models.any?
|
|
98
|
+
models.uniq.map { |m| { model: m } }
|
|
99
|
+
else
|
|
100
|
+
raise ArgumentError, "Pass models: or candidates: with at least one entry"
|
|
101
|
+
end
|
|
102
|
+
end
|
|
103
|
+
|
|
85
104
|
def comparison_context(context, model)
|
|
86
105
|
base_context = safe_context(context)
|
|
87
106
|
model ? base_context.merge(model: model) : base_context
|
|
@@ -4,11 +4,17 @@ module RubyLLM
|
|
|
4
4
|
module Contract
|
|
5
5
|
module Eval
|
|
6
6
|
class ModelComparison
|
|
7
|
-
attr_reader :eval_name, :reports
|
|
7
|
+
attr_reader :eval_name, :reports, :configs
|
|
8
8
|
|
|
9
|
-
def
|
|
9
|
+
def self.candidate_label(config)
|
|
10
|
+
effort = config[:reasoning_effort]
|
|
11
|
+
effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def initialize(eval_name:, reports:, configs: nil)
|
|
10
15
|
@eval_name = eval_name
|
|
11
|
-
@reports = reports.dup.freeze
|
|
16
|
+
@reports = reports.dup.freeze
|
|
17
|
+
@configs = (configs || default_configs_from_reports).freeze
|
|
12
18
|
freeze
|
|
13
19
|
end
|
|
14
20
|
|
|
@@ -16,12 +22,12 @@ module RubyLLM
|
|
|
16
22
|
@reports.keys
|
|
17
23
|
end
|
|
18
24
|
|
|
19
|
-
def score_for(
|
|
20
|
-
@reports[
|
|
25
|
+
def score_for(candidate)
|
|
26
|
+
@reports[resolve_key(candidate)]&.score
|
|
21
27
|
end
|
|
22
28
|
|
|
23
|
-
def cost_for(
|
|
24
|
-
@reports[
|
|
29
|
+
def cost_for(candidate)
|
|
30
|
+
@reports[resolve_key(candidate)]&.total_cost
|
|
25
31
|
end
|
|
26
32
|
|
|
27
33
|
def best_for(min_score: 0.0)
|
|
@@ -38,13 +44,14 @@ module RubyLLM
|
|
|
38
44
|
end
|
|
39
45
|
|
|
40
46
|
def table
|
|
41
|
-
|
|
42
|
-
lines
|
|
47
|
+
max_label = [@reports.keys.map(&:length).max || 0, 25].max
|
|
48
|
+
lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
|
|
49
|
+
lines << " #{"-" * (max_label + 36)}"
|
|
43
50
|
|
|
44
|
-
@reports.each do |
|
|
51
|
+
@reports.each do |label, report|
|
|
45
52
|
latency = report.avg_latency_ms ? "#{report.avg_latency_ms.round}ms" : "n/a"
|
|
46
53
|
cost = report.total_cost.positive? ? "$#{format("%.4f", report.total_cost)}" : "n/a"
|
|
47
|
-
lines << format("
|
|
54
|
+
lines << format(" %-#{max_label}s %6.2f %10s %12s", label, report.score, cost, latency)
|
|
48
55
|
end
|
|
49
56
|
|
|
50
57
|
lines.join("\n")
|
|
@@ -70,10 +77,25 @@ module RubyLLM
|
|
|
70
77
|
total_cost: report.total_cost,
|
|
71
78
|
avg_latency_ms: report.avg_latency_ms,
|
|
72
79
|
pass_rate: report.pass_rate,
|
|
80
|
+
pass_rate_ratio: report.pass_rate_ratio,
|
|
73
81
|
passed: report.passed?
|
|
74
82
|
}
|
|
75
83
|
end
|
|
76
84
|
end
|
|
85
|
+
|
|
86
|
+
private
|
|
87
|
+
|
|
88
|
+
def resolve_key(candidate)
|
|
89
|
+
case candidate
|
|
90
|
+
when String then candidate
|
|
91
|
+
when Hash then self.class.candidate_label(candidate)
|
|
92
|
+
else candidate.to_s
|
|
93
|
+
end
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
def default_configs_from_reports
|
|
97
|
+
@reports.each_with_object({}) { |(key, _), h| h[key] = { model: key } }
|
|
98
|
+
end
|
|
77
99
|
end
|
|
78
100
|
end
|
|
79
101
|
end
|
|
@@ -0,0 +1,48 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RubyLLM
|
|
4
|
+
module Contract
|
|
5
|
+
module Eval
|
|
6
|
+
class Recommendation
|
|
7
|
+
include Concerns::DeepFreeze
|
|
8
|
+
|
|
9
|
+
attr_reader :best, :retry_chain, :score, :cost_per_call,
|
|
10
|
+
:rationale, :current_config, :savings, :warnings
|
|
11
|
+
|
|
12
|
+
def initialize(best:, retry_chain:, score:, cost_per_call:,
|
|
13
|
+
rationale:, current_config:, savings:, warnings:)
|
|
14
|
+
@best = deep_dup_freeze(best)
|
|
15
|
+
@retry_chain = deep_dup_freeze(retry_chain)
|
|
16
|
+
@score = score
|
|
17
|
+
@cost_per_call = cost_per_call
|
|
18
|
+
@rationale = deep_dup_freeze(rationale)
|
|
19
|
+
@current_config = deep_dup_freeze(current_config)
|
|
20
|
+
@savings = deep_dup_freeze(savings)
|
|
21
|
+
@warnings = deep_dup_freeze(warnings)
|
|
22
|
+
freeze
|
|
23
|
+
end
|
|
24
|
+
|
|
25
|
+
def to_dsl
|
|
26
|
+
return "# No recommendation — no candidate met the minimum score" if retry_chain.empty?
|
|
27
|
+
|
|
28
|
+
if retry_chain.length == 1 && retry_chain.first.keys == [:model]
|
|
29
|
+
"model \"#{retry_chain.first[:model]}\""
|
|
30
|
+
elsif retry_chain.all? { |c| c.keys == [:model] }
|
|
31
|
+
models_str = retry_chain.map { |c| c[:model] }.join(" ")
|
|
32
|
+
"retry_policy models: %w[#{models_str}]"
|
|
33
|
+
else
|
|
34
|
+
args = retry_chain.map { |c| config_to_ruby(c) }.join(",\n ")
|
|
35
|
+
"retry_policy do\n escalate(#{args})\nend"
|
|
36
|
+
end
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
private
|
|
40
|
+
|
|
41
|
+
def config_to_ruby(config)
|
|
42
|
+
pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
|
|
43
|
+
"{ #{pairs} }"
|
|
44
|
+
end
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
end
|
|
48
|
+
end
|
|
@@ -0,0 +1,132 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RubyLLM
|
|
4
|
+
module Contract
|
|
5
|
+
module Eval
|
|
6
|
+
class Recommender
|
|
7
|
+
def initialize(comparison:, min_score:, min_first_try_pass_rate: 0.8, current_config: nil)
|
|
8
|
+
@comparison = comparison
|
|
9
|
+
@min_score = min_score
|
|
10
|
+
@min_first_try_pass_rate = min_first_try_pass_rate
|
|
11
|
+
@current_config = current_config
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def recommend
|
|
15
|
+
scored = build_scored_candidates
|
|
16
|
+
best = select_best(scored)
|
|
17
|
+
chain = build_retry_chain(scored, best)
|
|
18
|
+
rationale = build_rationale(scored, best)
|
|
19
|
+
warnings = build_warnings(scored)
|
|
20
|
+
savings = best ? calculate_savings(best) : {}
|
|
21
|
+
|
|
22
|
+
Recommendation.new(
|
|
23
|
+
best: best&.dig(:config),
|
|
24
|
+
retry_chain: chain,
|
|
25
|
+
score: best&.dig(:score) || 0.0,
|
|
26
|
+
cost_per_call: best&.dig(:cost_per_call) || 0.0,
|
|
27
|
+
rationale: rationale,
|
|
28
|
+
current_config: @current_config,
|
|
29
|
+
savings: savings,
|
|
30
|
+
warnings: warnings
|
|
31
|
+
)
|
|
32
|
+
end
|
|
33
|
+
|
|
34
|
+
private
|
|
35
|
+
|
|
36
|
+
def build_scored_candidates
|
|
37
|
+
@comparison.configs.filter_map do |label, config|
|
|
38
|
+
report = @comparison.reports[label]
|
|
39
|
+
next nil unless report
|
|
40
|
+
|
|
41
|
+
evaluated_count = report.results.count { |r| r.step_status != :skipped }
|
|
42
|
+
cases_count = [evaluated_count, 1].max
|
|
43
|
+
cost_per_call = report.total_cost.to_f / cases_count
|
|
44
|
+
|
|
45
|
+
{
|
|
46
|
+
label: label,
|
|
47
|
+
config: config,
|
|
48
|
+
score: report.score,
|
|
49
|
+
cost_per_call: cost_per_call,
|
|
50
|
+
latency: report.avg_latency_ms || Float::INFINITY,
|
|
51
|
+
pass_rate_ratio: report.pass_rate_ratio,
|
|
52
|
+
total_cost: report.total_cost
|
|
53
|
+
}
|
|
54
|
+
end
|
|
55
|
+
end
|
|
56
|
+
|
|
57
|
+
def select_best(scored)
|
|
58
|
+
eligible = scored.select { |s| s[:score] >= @min_score && cost_known?(s) }
|
|
59
|
+
eligible.min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
def build_retry_chain(scored, best)
|
|
63
|
+
return [] unless best
|
|
64
|
+
|
|
65
|
+
first_try = scored
|
|
66
|
+
.select { |s| s[:pass_rate_ratio] >= @min_first_try_pass_rate && cost_known?(s) }
|
|
67
|
+
.min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
|
|
68
|
+
|
|
69
|
+
if first_try.nil? || first_try[:label] == best[:label]
|
|
70
|
+
[best[:config]]
|
|
71
|
+
else
|
|
72
|
+
[first_try[:config], best[:config]]
|
|
73
|
+
end
|
|
74
|
+
end
|
|
75
|
+
|
|
76
|
+
def build_rationale(scored, best)
|
|
77
|
+
sorted = scored.sort_by { |s| [cost_known?(s) ? 0 : 1, s[:cost_per_call], s[:latency], s[:label]] }
|
|
78
|
+
sorted.map { |s| rationale_line(s, best) }
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
def rationale_line(candidate, best)
|
|
82
|
+
cost_str = cost_known?(candidate) ? "$#{format("%.4f", candidate[:cost_per_call])}/call" : "unknown pricing"
|
|
83
|
+
header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{cost_str}"
|
|
84
|
+
notes = rationale_notes(candidate, best)
|
|
85
|
+
notes.any? ? "#{header} — #{notes.join(", ")}" : header
|
|
86
|
+
end
|
|
87
|
+
|
|
88
|
+
def rationale_notes(candidate, best)
|
|
89
|
+
notes = []
|
|
90
|
+
pass_pct = (candidate[:pass_rate_ratio] * 100).round
|
|
91
|
+
below_threshold = candidate[:score] < @min_score
|
|
92
|
+
|
|
93
|
+
if below_threshold && candidate[:pass_rate_ratio] >= @min_first_try_pass_rate
|
|
94
|
+
notes << "below #{@min_score} threshold, but good first-try (#{pass_pct}% pass rate)"
|
|
95
|
+
elsif below_threshold
|
|
96
|
+
notes << "below #{@min_score} threshold"
|
|
97
|
+
elsif candidate[:pass_rate_ratio] < 1.0
|
|
98
|
+
notes << "#{pass_pct}% pass rate"
|
|
99
|
+
end
|
|
100
|
+
notes << "recommended" if best && candidate[:label] == best[:label]
|
|
101
|
+
notes << "unknown pricing" unless cost_known?(candidate)
|
|
102
|
+
notes
|
|
103
|
+
end
|
|
104
|
+
|
|
105
|
+
def build_warnings(scored)
|
|
106
|
+
scored.reject { |s| cost_known?(s) }
|
|
107
|
+
.map { |s| "#{s[:label]}: unknown pricing — cost ranking may be inaccurate" }
|
|
108
|
+
end
|
|
109
|
+
|
|
110
|
+
def calculate_savings(best)
|
|
111
|
+
return {} unless @current_config
|
|
112
|
+
|
|
113
|
+
current_label = ModelComparison.candidate_label(@current_config)
|
|
114
|
+
current_report = @comparison.reports[current_label]
|
|
115
|
+
return {} unless current_report
|
|
116
|
+
|
|
117
|
+
current_evaluated = current_report.results.count { |r| r.step_status != :skipped }
|
|
118
|
+
current_cases = [current_evaluated, 1].max
|
|
119
|
+
current_cost = current_report.total_cost.to_f / current_cases
|
|
120
|
+
diff = current_cost - best[:cost_per_call]
|
|
121
|
+
return {} unless diff.positive?
|
|
122
|
+
|
|
123
|
+
{ per_call: diff.round(6), monthly_at: { 10_000 => (diff * 10_000).round(2) } }
|
|
124
|
+
end
|
|
125
|
+
|
|
126
|
+
def cost_known?(scored_candidate)
|
|
127
|
+
scored_candidate[:cost_per_call]&.positive?
|
|
128
|
+
end
|
|
129
|
+
end
|
|
130
|
+
end
|
|
131
|
+
end
|
|
132
|
+
end
|
|
@@ -14,8 +14,8 @@ module RubyLLM
|
|
|
14
14
|
HISTORY_DIR = ".eval_history"
|
|
15
15
|
BASELINE_DIR = ".eval_baselines"
|
|
16
16
|
|
|
17
|
-
def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :
|
|
18
|
-
:passed?
|
|
17
|
+
def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
|
|
18
|
+
:total_cost, :avg_latency_ms, :passed?
|
|
19
19
|
def_delegators :@presenter, :summary, :to_s, :print_summary
|
|
20
20
|
def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
|
|
21
21
|
:baseline_exists?
|
|
@@ -35,6 +35,12 @@ module RubyLLM
|
|
|
35
35
|
"#{passed}/#{evaluated_results.length}"
|
|
36
36
|
end
|
|
37
37
|
|
|
38
|
+
def pass_rate_ratio
|
|
39
|
+
return 0.0 if evaluated_results.empty?
|
|
40
|
+
|
|
41
|
+
passed.to_f / evaluated_results.length
|
|
42
|
+
end
|
|
43
|
+
|
|
38
44
|
def total_cost
|
|
39
45
|
@results.sum { |result| result.cost || 0.0 }
|
|
40
46
|
end
|
|
@@ -13,25 +13,29 @@ module RubyLLM
|
|
|
13
13
|
@stats = stats
|
|
14
14
|
end
|
|
15
15
|
|
|
16
|
-
def save_history!(path: nil, model: nil)
|
|
17
|
-
file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model)
|
|
18
|
-
|
|
16
|
+
def save_history!(path: nil, model: nil, reasoning_effort: nil)
|
|
17
|
+
file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model, reasoning_effort: reasoning_effort)
|
|
18
|
+
entry = history_entry
|
|
19
|
+
entry[:model] = model if model
|
|
20
|
+
entry[:reasoning_effort] = reasoning_effort if reasoning_effort
|
|
21
|
+
EvalHistory.append(file, entry)
|
|
19
22
|
file
|
|
20
23
|
end
|
|
21
24
|
|
|
22
|
-
def eval_history(path: nil, model: nil)
|
|
23
|
-
EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model
|
|
25
|
+
def eval_history(path: nil, model: nil, reasoning_effort: nil)
|
|
26
|
+
EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model,
|
|
27
|
+
reasoning_effort: reasoning_effort))
|
|
24
28
|
end
|
|
25
29
|
|
|
26
|
-
def save_baseline!(path: nil, model: nil)
|
|
27
|
-
file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
|
|
30
|
+
def save_baseline!(path: nil, model: nil, reasoning_effort: nil)
|
|
31
|
+
file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
|
|
28
32
|
FileUtils.mkdir_p(File.dirname(file))
|
|
29
33
|
File.write(file, JSON.pretty_generate(serialize_for_baseline))
|
|
30
34
|
file
|
|
31
35
|
end
|
|
32
36
|
|
|
33
|
-
def compare_with_baseline(path: nil, model: nil)
|
|
34
|
-
file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
|
|
37
|
+
def compare_with_baseline(path: nil, model: nil, reasoning_effort: nil)
|
|
38
|
+
file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
|
|
35
39
|
raise ArgumentError, "No baseline found at #{file}" unless File.exist?(file)
|
|
36
40
|
|
|
37
41
|
baseline_data = JSON.parse(File.read(file), symbolize_names: true)
|
|
@@ -43,8 +47,8 @@ module RubyLLM
|
|
|
43
47
|
)
|
|
44
48
|
end
|
|
45
49
|
|
|
46
|
-
def baseline_exists?(path: nil, model: nil)
|
|
47
|
-
File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model))
|
|
50
|
+
def baseline_exists?(path: nil, model: nil, reasoning_effort: nil)
|
|
51
|
+
File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort))
|
|
48
52
|
end
|
|
49
53
|
|
|
50
54
|
private
|
|
@@ -55,6 +59,7 @@ module RubyLLM
|
|
|
55
59
|
score: @stats.score,
|
|
56
60
|
total_cost: @stats.total_cost,
|
|
57
61
|
pass_rate: @stats.pass_rate,
|
|
62
|
+
pass_rate_ratio: @stats.pass_rate_ratio,
|
|
58
63
|
cases_count: @stats.evaluated_results_count
|
|
59
64
|
}
|
|
60
65
|
end
|
|
@@ -79,12 +84,13 @@ module RubyLLM
|
|
|
79
84
|
}
|
|
80
85
|
end
|
|
81
86
|
|
|
82
|
-
def storage_path(root_dir, extension, model:)
|
|
87
|
+
def storage_path(root_dir, extension, model:, reasoning_effort: nil)
|
|
83
88
|
parts = [root_dir]
|
|
84
89
|
parts << sanitize_name(@report.step_name) if @report.step_name
|
|
85
90
|
|
|
86
91
|
dataset_name = sanitize_name(@report.dataset_name)
|
|
87
92
|
dataset_name = "#{dataset_name}_#{sanitize_name(model)}" if model
|
|
93
|
+
dataset_name = "#{dataset_name}_effort_#{sanitize_name(reasoning_effort)}" if reasoning_effort
|
|
88
94
|
|
|
89
95
|
File.join(*parts, "#{dataset_name}.#{extension}")
|
|
90
96
|
end
|
|
@@ -49,7 +49,17 @@ module RubyLLM
|
|
|
49
49
|
end
|
|
50
50
|
end
|
|
51
51
|
|
|
52
|
-
|
|
52
|
+
def recommend(eval_name, candidates:, min_score: 0.95, min_first_try_pass_rate: 0.8, context: {})
|
|
53
|
+
comparison = compare_models(eval_name, candidates: candidates, context: context)
|
|
54
|
+
Eval::Recommender.new(
|
|
55
|
+
comparison: comparison,
|
|
56
|
+
min_score: min_score,
|
|
57
|
+
min_first_try_pass_rate: min_first_try_pass_rate,
|
|
58
|
+
current_config: current_model_config
|
|
59
|
+
).recommend
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
|
|
53
63
|
|
|
54
64
|
include Concerns::ContextHelpers
|
|
55
65
|
|
|
@@ -139,11 +149,22 @@ module RubyLLM
|
|
|
139
149
|
{
|
|
140
150
|
model: context[:model] || model || RubyLLM::Contract.configuration.default_model,
|
|
141
151
|
temperature: context[:temperature],
|
|
142
|
-
extra_options: context.slice(:provider, :assume_model_exists, :max_tokens),
|
|
152
|
+
extra_options: context.slice(:provider, :assume_model_exists, :max_tokens, :reasoning_effort),
|
|
143
153
|
policy: retry_policy
|
|
144
154
|
}
|
|
145
155
|
end
|
|
146
156
|
|
|
157
|
+
def current_model_config
|
|
158
|
+
policy = retry_policy
|
|
159
|
+
if policy && policy.config_list.any?
|
|
160
|
+
policy.config_list.first
|
|
161
|
+
elsif respond_to?(:model) && model
|
|
162
|
+
{ model: model }
|
|
163
|
+
elsif RubyLLM::Contract.configuration.default_model
|
|
164
|
+
{ model: RubyLLM::Contract.configuration.default_model }
|
|
165
|
+
end
|
|
166
|
+
end
|
|
167
|
+
|
|
147
168
|
def resolve_adapter(context)
|
|
148
169
|
adapter = context[:adapter] || RubyLLM::Contract.configuration.default_adapter
|
|
149
170
|
return adapter if adapter
|
|
@@ -10,12 +10,16 @@ module RubyLLM
|
|
|
10
10
|
|
|
11
11
|
def run_with_retry(input, adapter:, default_model:, policy:, context_temperature: nil, extra_options: {})
|
|
12
12
|
all_attempts = []
|
|
13
|
+
default_config = { model: default_model }.merge(extra_options.slice(:reasoning_effort).compact)
|
|
13
14
|
|
|
14
15
|
policy.max_attempts.times do |attempt_index|
|
|
15
|
-
|
|
16
|
+
config = policy.config_for_attempt(attempt_index, default_config)
|
|
17
|
+
model = config[:model]
|
|
18
|
+
attempt_extra = extra_options.merge(config.except(:model))
|
|
19
|
+
|
|
16
20
|
result = run_once(input, adapter: adapter, model: model,
|
|
17
|
-
context_temperature: context_temperature, extra_options:
|
|
18
|
-
all_attempts << { attempt: attempt_index + 1, model: model, result: result }
|
|
21
|
+
context_temperature: context_temperature, extra_options: attempt_extra)
|
|
22
|
+
all_attempts << { attempt: attempt_index + 1, model: model, config: config, result: result }
|
|
19
23
|
break unless policy.retryable?(result)
|
|
20
24
|
end
|
|
21
25
|
|
|
@@ -43,6 +47,8 @@ module RubyLLM
|
|
|
43
47
|
def build_attempt_entry(attempt)
|
|
44
48
|
trace = attempt[:result].trace
|
|
45
49
|
entry = { attempt: attempt[:attempt], model: attempt[:model], status: attempt[:result].status }
|
|
50
|
+
config = attempt[:config]
|
|
51
|
+
entry[:config] = config if config && config.keys != [:model]
|
|
46
52
|
append_trace_fields(entry, trace)
|
|
47
53
|
end
|
|
48
54
|
|
|
@@ -9,13 +9,13 @@ module RubyLLM
|
|
|
9
9
|
DEFAULT_RETRY_ON = %i[validation_failed parse_error adapter_error].freeze
|
|
10
10
|
|
|
11
11
|
def initialize(models: nil, attempts: nil, retry_on: nil, &block)
|
|
12
|
-
@
|
|
12
|
+
@configs = []
|
|
13
13
|
@retryable_statuses = DEFAULT_RETRY_ON.dup
|
|
14
14
|
|
|
15
15
|
if block
|
|
16
16
|
@max_attempts = 1
|
|
17
17
|
instance_eval(&block)
|
|
18
|
-
warn_no_retry! if @max_attempts == 1 && @
|
|
18
|
+
warn_no_retry! if @max_attempts == 1 && @configs.empty?
|
|
19
19
|
else
|
|
20
20
|
apply_keywords(models: models, attempts: attempts, retry_on: retry_on)
|
|
21
21
|
end
|
|
@@ -28,14 +28,18 @@ module RubyLLM
|
|
|
28
28
|
validate_max_attempts!
|
|
29
29
|
end
|
|
30
30
|
|
|
31
|
-
def escalate(*
|
|
32
|
-
@
|
|
33
|
-
@max_attempts = @
|
|
31
|
+
def escalate(*config_list)
|
|
32
|
+
@configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze
|
|
33
|
+
@max_attempts = @configs.length if @max_attempts < @configs.length
|
|
34
34
|
end
|
|
35
35
|
alias models escalate
|
|
36
36
|
|
|
37
37
|
def model_list
|
|
38
|
-
@
|
|
38
|
+
@configs.map { |c| c[:model] }.freeze
|
|
39
|
+
end
|
|
40
|
+
|
|
41
|
+
def config_list
|
|
42
|
+
@configs
|
|
39
43
|
end
|
|
40
44
|
|
|
41
45
|
def retry_on(*statuses)
|
|
@@ -46,29 +50,38 @@ module RubyLLM
|
|
|
46
50
|
retryable_statuses.include?(result.status)
|
|
47
51
|
end
|
|
48
52
|
|
|
49
|
-
def
|
|
50
|
-
if @
|
|
51
|
-
@
|
|
53
|
+
def config_for_attempt(attempt, default_config)
|
|
54
|
+
if @configs.any?
|
|
55
|
+
@configs[attempt] || @configs.last
|
|
52
56
|
else
|
|
53
|
-
|
|
57
|
+
default_config
|
|
54
58
|
end
|
|
55
59
|
end
|
|
56
60
|
|
|
61
|
+
def model_for_attempt(attempt, default_model)
|
|
62
|
+
config_for_attempt(attempt, { model: default_model })[:model]
|
|
63
|
+
end
|
|
64
|
+
|
|
57
65
|
private
|
|
58
66
|
|
|
59
67
|
def apply_keywords(models:, attempts:, retry_on:)
|
|
60
68
|
if models
|
|
61
|
-
@
|
|
62
|
-
@max_attempts = @
|
|
69
|
+
@configs = Array(models).map { |m| normalize_config(m).freeze }.freeze
|
|
70
|
+
@max_attempts = @configs.length
|
|
63
71
|
else
|
|
64
72
|
@max_attempts = attempts || 1
|
|
65
73
|
end
|
|
66
74
|
@retryable_statuses = Array(retry_on).dup if retry_on
|
|
67
75
|
end
|
|
68
76
|
|
|
77
|
+
def normalize_config(entry)
|
|
78
|
+
RubyLLM::Contract.normalize_candidate_config(entry)
|
|
79
|
+
end
|
|
80
|
+
|
|
69
81
|
def warn_no_retry!
|
|
70
|
-
warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no
|
|
71
|
-
"This means no actual retry will happen. Add `attempts 2` or
|
|
82
|
+
warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \
|
|
83
|
+
"This means no actual retry will happen. Add `attempts 2` or " \
|
|
84
|
+
'`escalate "model1", "model2"`.'
|
|
72
85
|
end
|
|
73
86
|
|
|
74
87
|
def validate_max_attempts!
|
data/lib/ruby_llm/contract.rb
CHANGED
|
@@ -78,6 +78,27 @@ module RubyLLM
|
|
|
78
78
|
Thread.current[:ruby_llm_contract_reloading] = false
|
|
79
79
|
end
|
|
80
80
|
|
|
81
|
+
def normalize_candidate_config(entry)
|
|
82
|
+
case entry
|
|
83
|
+
when String
|
|
84
|
+
raise ArgumentError, "Candidate model must be a non-empty String" if entry.strip.empty?
|
|
85
|
+
|
|
86
|
+
{ model: entry.strip }
|
|
87
|
+
when Hash
|
|
88
|
+
model = entry[:model] || entry["model"]
|
|
89
|
+
unless model.is_a?(String) && !model.strip.empty?
|
|
90
|
+
raise ArgumentError, "Candidate config must include a non-empty String :model"
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
normalized = { model: model.strip }
|
|
94
|
+
effort = entry[:reasoning_effort] || entry["reasoning_effort"]
|
|
95
|
+
normalized[:reasoning_effort] = effort if effort
|
|
96
|
+
normalized
|
|
97
|
+
else
|
|
98
|
+
raise ArgumentError, "Expected String or Hash, got #{entry.class}"
|
|
99
|
+
end
|
|
100
|
+
end
|
|
101
|
+
|
|
81
102
|
private
|
|
82
103
|
|
|
83
104
|
# Filter stale hosts, deduplicate by name (last wins), prune registry in-place
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ruby_llm-contract
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.6.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Justyna
|
|
@@ -130,6 +130,8 @@ files:
|
|
|
130
130
|
- lib/ruby_llm/contract/eval/prompt_diff_comparator.rb
|
|
131
131
|
- lib/ruby_llm/contract/eval/prompt_diff_presenter.rb
|
|
132
132
|
- lib/ruby_llm/contract/eval/prompt_diff_serializer.rb
|
|
133
|
+
- lib/ruby_llm/contract/eval/recommendation.rb
|
|
134
|
+
- lib/ruby_llm/contract/eval/recommender.rb
|
|
133
135
|
- lib/ruby_llm/contract/eval/report.rb
|
|
134
136
|
- lib/ruby_llm/contract/eval/report_presenter.rb
|
|
135
137
|
- lib/ruby_llm/contract/eval/report_stats.rb
|