ruby_llm-contract 0.5.2 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 359d08f8cf1e31b84f308c47c7f93c7cee7663054de3ab538a34c1f67873554f
4
- data.tar.gz: 60d8728bed042277d40ec1d231b6712e258b658fd893a73afc6ed1f8e9cff8c8
3
+ metadata.gz: c86dcf06a34e5ff934367708213a0852191b0be7bd61e61e8577161e32ccf807
4
+ data.tar.gz: 28a44dd4f4d4c0f74c5da7800d9fa1fd2b8e51242a056f5aaae5f6f9fcf1be69
5
5
  SHA512:
6
- metadata.gz: 4bd4d7cea9fde7281bf84e1283c4201f8c5e9425cb8357e40b85e5184f19f51eb57a88a35901eddf571defd93ff33ef790e24b5e2eb90add8ef6371e791d37e5
7
- data.tar.gz: e68ca27fc2225224cd900b1afb2180cfd43929e0461420c7fd2987706a2ebaa282b1e659c8b5c14e69e30d1250ede547061e2d2ab74b5c9cc0bb7fdb77109f0a
6
+ metadata.gz: c2b6dc71a02519288ae4ee4f74f17d74e6c45d01e888036d18e4c821b8fc63ba43ac0ccee6b2948163597b19c93008a6e5e0375fd65c8f3ad8fcf1285f356e91
7
+ data.tar.gz: f08576e520ec7397c4b233c5907ce703890cf4616e969e25f4c980ac1b06013a62ff680b3c90878d76fd1da0980f04d4946a2fd11a8e304e43f5155e5c855e8e
data/CHANGELOG.md CHANGED
@@ -1,5 +1,28 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.0 (2026-04-12)
4
+
5
+ "What should I do?" — model + configuration recommendation.
6
+
7
+ ### Features
8
+
9
+ - **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
10
+ - **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
11
+ - **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
12
+ - **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
13
+ - **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
14
+ - **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
15
+
16
+ ### Game changer continuity
17
+
18
+ ```
19
+ v0.2: "Which model?" → compare_models (snapshot)
20
+ v0.3: "Did it change?" → baseline regression (binary)
21
+ v0.4: "Show me the trend" → eval history (time series)
22
+ v0.5: "Which prompt is better?" → compare_with (A/B testing)
23
+ v0.6: "What should I do?" → recommend (actionable advice)
24
+ ```
25
+
3
26
  ## 0.5.2 (2026-04-06)
4
27
 
5
28
  ### Features
@@ -8,7 +31,7 @@
8
31
 
9
32
  ## 0.5.0 (2026-03-25)
10
33
 
11
- Data-Driven Prompt Engineering — see ADR-0015.
34
+ Data-Driven Prompt Engineering.
12
35
 
13
36
  ### Features
14
37
 
@@ -55,7 +78,7 @@ Audit hardening — 18 bugs fixed across 4 audit rounds.
55
78
 
56
79
  ## 0.4.3 (2026-03-24)
57
80
 
58
- Production feedback release — driven by ADR-0016 (real Rails 8.1 deployment).
81
+ Production feedback release.
59
82
 
60
83
  ### Features
61
84
 
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.5.2)
4
+ ruby_llm-contract (0.6.0)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.5.2)
261
+ ruby_llm-contract (0.6.0)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -1,24 +1,89 @@
1
1
  # ruby_llm-contract
2
2
 
3
- Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.
3
+ The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
4
4
 
5
- Companion gem for [ruby_llm](https://github.com/crmne/ruby_llm).
5
+ ```
6
+ YOU WRITE THE GEM HANDLES YOU GET
7
+ ───────── ─────────────── ───────
8
+
9
+ validate { |o| ... } catch bad answers — combined Zero garbage
10
+ with retry_policy, auto-retry in production
6
11
 
7
- ## The problem
12
+ retry_policy start cheap, escalate only Pay for the cheapest
13
+ models: %w[nano mini full] when validation fails model that works
8
14
 
9
- Which model should you use? The expensive one is accurate but costs 4x more. The cheap one is fast but hallucinates on edge cases. You tweak a prompt — did accuracy improve or drop? You have no data. Just gut feeling.
15
+ max_cost 0.01 estimate tokens, check price, No surprise bills
16
+ refuse before calling LLM
10
17
 
11
- ## The fix
18
+ output_schema { ... } send JSON schema to provider, Zero parsing code
19
+ validate response client-side
20
+
21
+ define_eval { ... } test cases + baselines, Regressions caught
22
+ run in CI with real LLM before deploy
23
+
24
+ recommend(candidates: [...]) evaluate all configs, pick Optimal model +
25
+ cheapest that passes retry chain
26
+ ```
27
+
28
+ ## Before and after
29
+
30
+ ```
31
+ ┌─────────────────────────────────────────────────────────────────┐
32
+ │ BEFORE: pick one model, hope for the best │
33
+ │ │
34
+ │ expensive model → accurate, but you overpay on every call │
35
+ │ cheap model → fast, but wrong answers slip to production │
36
+ │ prompt change → "looks good to me" → deploy → users suffer │
37
+ └─────────────────────────────────────────────────────────────────┘
38
+
39
+ ⬇ add ruby_llm-contract
40
+
41
+ ┌─────────────────────────────────────────────────────────────────┐
42
+ │ YOU DEFINE A CONTRACT │
43
+ │ │
44
+ │ output_schema { string :priority } ← valid structure │
45
+ │ validate("valid priority") { |o| ... } ← business rules │
46
+ │ retry_policy models: %w[nano mini full] ← escalation chain │
47
+ │ max_cost 0.01 ← budget cap │
48
+ └───────────────────────────┬─────────────────────────────────────┘
49
+
50
+
51
+ ┌─────────────────────────────────────────────────────────────────┐
52
+ │ THE GEM HANDLES THE REST │
53
+ │ │
54
+ │ request ──→ ┌──────┐ ┌──────────┐ │
55
+ │ │ nano │─→ │ contract │──→ ✓ pass → done │
56
+ │ └──────┘ └────┬─────┘ │
57
+ │ │ ✗ fail │
58
+ │ ▼ │
59
+ │ ┌──────┐ ┌──────────┐ │
60
+ │ │ mini │─→ │ contract │──→ ✓ pass → done │
61
+ │ └──────┘ └────┬─────┘ │
62
+ │ │ ✗ fail │
63
+ │ ▼ │
64
+ │ ┌──────┐ ┌──────────┐ │
65
+ │ │ full │─→ │ contract │──→ ✓ pass → done │
66
+ │ └──────┘ └──────────┘ │
67
+ └───────────────────────────┬─────────────────────────────────────┘
68
+
69
+
70
+ ┌─────────────────────────────────────────────────────────────────┐
71
+ │ YOU GET │
72
+ │ │
73
+ │ ✓ Valid output guaranteed — schema + business rules enforced │
74
+ │ ✓ Cheapest model that works — most requests stay on nano │
75
+ │ ✓ Cost, latency, tokens — tracked on every call │
76
+ │ ✓ Eval scores per model — data instead of gut feeling │
77
+ │ ✓ Regressions caught — before deploy, not after │
78
+ │ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
79
+ └─────────────────────────────────────────────────────────────────┘
80
+ ```
81
+
82
+ ## 30-second version
12
83
 
13
84
  ```ruby
14
85
  class ClassifyTicket < RubyLLM::Contract::Step::Base
15
- prompt do
16
- system "You are a support ticket classifier."
17
- rule "Return valid JSON only, no markdown."
18
- rule "Use exactly one priority: low, medium, high, urgent."
19
- example input: "My invoice is wrong", output: '{"priority": "high"}'
20
- user "{input}"
21
- end
86
+ prompt "Classify this support ticket by priority and category.\n\n{input}"
22
87
 
23
88
  output_schema do
24
89
  string :priority, enum: %w[low medium high urgent]
@@ -30,29 +95,45 @@ class ClassifyTicket < RubyLLM::Contract::Step::Base
30
95
  end
31
96
 
32
97
  result = ClassifyTicket.run("I was charged twice")
33
- result.ok? # => true
34
- result.parsed_output # => {priority: "high", category: "billing"}
35
- result.trace[:cost] # => 0.000032
36
- result.trace[:model] # => "gpt-4.1-nano"
98
+ result.parsed_output # => {priority: "high", category: "billing"}
99
+ result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
100
+ result.trace[:cost] # => 0.000032
37
101
  ```
38
102
 
39
- Bad JSON? Auto-retry. Wrong value? Escalate to a smarter model. Schema violated? Caught client-side even if the provider ignores it. All with cost tracking.
103
+ Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules you pay for the cheapest model that passes.
104
+
105
+ ## Install
106
+
107
+ ```ruby
108
+ gem "ruby_llm-contract"
109
+ ```
40
110
 
41
- ## Start Here: Eval-First
111
+ ```ruby
112
+ RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
113
+ RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
114
+ ```
42
115
 
43
- The most powerful way to use this gem is simple:
116
+ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
44
117
 
45
- - define evals before changing prompts
46
- - compare prompt versions on the same dataset
47
- - merge only when the eval stays green
118
+ ## Save money with model escalation
48
119
 
49
- Read: [Eval-First](docs/guide/eval_first.md)
120
+ Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
50
121
 
51
- This is the workflow that gives prompt engineering teeth. No vibes, no cherry-picked examples, no "it felt better in the playground". Just cases, regressions, baselines, and measured wins.
122
+ ```ruby
123
+ retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
124
+ ```
52
125
 
53
- ## Which model should I use?
126
+ ```
127
+ Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
128
+ Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
129
+ gpt-4.1 → never called ($0.00)
130
+ ```
54
131
 
55
- Define test cases. Compare models. Get data.
132
+ Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
133
+
134
+ ## Know which model to use — with data
135
+
136
+ Don't guess. Define test cases, compare models, get numbers:
56
137
 
57
138
  ```ruby
58
139
  ClassifyTicket.define_eval("regression") do
@@ -62,171 +143,120 @@ ClassifyTicket.define_eval("regression") do
62
143
  end
63
144
 
64
145
  comparison = ClassifyTicket.compare_models("regression",
65
- models: %w[gpt-4.1-nano gpt-4.1-mini])
146
+ models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
66
147
  ```
67
148
 
68
- Real output from real API calls:
69
-
70
149
  ```
71
- Model Score Cost Avg Latency
150
+ Candidate Score Cost Avg Latency
72
151
  ---------------------------------------------------------
73
- gpt-4.1-nano 0.67 $0.000032 687ms
74
- gpt-4.1-mini 1.00 $0.000102 1070ms
152
+ gpt-4.1-nano 0.67 $0.0001 48ms
153
+ gpt-4.1-mini 1.00 $0.0004 92ms
154
+ gpt-4.1 1.00 $0.0021 210ms
75
155
 
76
156
  Cheapest at 100%: gpt-4.1-mini
77
157
  ```
78
158
 
79
- ```ruby
80
- comparison.best_for(min_score: 0.95) # => "gpt-4.1-mini"
81
-
82
- # Inspect failures
83
- comparison.reports["gpt-4.1-nano"].failures.each do |f|
84
- puts "#{f.name}: expected #{f.expected}, got #{f.output}"
85
- puts " mismatches: #{f.mismatches}"
86
- # => outage: expected {priority: "urgent"}, got {priority: "high"}
87
- # mismatches: {priority: {expected: "urgent", got: "high"}}
88
- end
89
- ```
159
+ Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
90
160
 
91
- ## Pipeline
161
+ ## Let the gem tell you what to do
92
162
 
93
- Chain steps with fail-fast. Hallucination in step 1 stops before step 2 spends tokens.
163
+ Don't read tables get a recommendation. Supports `model + reasoning_effort` combinations:
94
164
 
95
165
  ```ruby
96
- class TicketPipeline < RubyLLM::Contract::Pipeline::Base
97
- step ClassifyTicket, as: :classify
98
- step RouteToTeam, as: :route
99
- step DraftResponse, as: :draft
100
- end
101
-
102
- result = TicketPipeline.run("I was charged twice")
103
- result.ok? # => true
104
- result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
105
- result.trace.total_cost # => 0.000128
166
+ rec = ClassifyTicket.recommend("regression",
167
+ candidates: [
168
+ { model: "gpt-4.1-nano" },
169
+ { model: "gpt-4.1-mini" },
170
+ { model: "gpt-5-mini", reasoning_effort: "low" },
171
+ { model: "gpt-5-mini", reasoning_effort: "high" },
172
+ ],
173
+ min_score: 0.95
174
+ )
175
+
176
+ rec.best # => { model: "gpt-4.1-mini" }
177
+ rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
178
+ rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
179
+ rec.savings # => savings vs your current model (if configured)
106
180
  ```
107
181
 
108
- ## CI gate
109
-
110
- ```ruby
111
- # RSpec — block merge if accuracy drops or cost spikes
112
- expect(ClassifyTicket).to pass_eval("regression")
113
- .with_context(model: "gpt-4.1-mini")
114
- .with_minimum_score(0.8)
115
- .with_maximum_cost(0.01)
116
-
117
- # Rake — run all evals across all steps
118
- require "ruby_llm/contract/rake_task"
119
- RubyLLM::Contract::RakeTask.new do |t|
120
- t.minimum_score = 0.8
121
- t.maximum_cost = 0.05
122
- end
123
- # bundle exec rake ruby_llm_contract:eval
124
- ```
182
+ Copy `rec.to_dsl` into your step. Done.
125
183
 
126
- ## Detect quality drops
184
+ ## Catch regressions before users do
127
185
 
128
- Save a baseline. Next run, see what regressed.
186
+ A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
129
187
 
130
188
  ```ruby
189
+ # Save a baseline once:
131
190
  report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
132
191
  report.save_baseline!(model: "gpt-4.1-nano")
133
192
 
134
- # Laterafter prompt change, model update, or provider weight shift:
135
- report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
136
- diff = report.compare_with_baseline(model: "gpt-4.1-nano")
137
-
138
- diff.regressed? # => true
139
- diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
140
- diff.score_delta # => -0.33
141
- ```
142
-
143
- ```ruby
144
- # CI: block merge if any previously-passing case now fails
193
+ # In CI block merge if anything regressed:
145
194
  expect(ClassifyTicket).to pass_eval("regression")
146
195
  .with_context(model: "gpt-4.1-nano")
147
196
  .without_regressions
148
197
  ```
149
198
 
150
- ## Track quality over time
151
-
152
199
  ```ruby
153
- # Save every eval run
154
- report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
155
- report.save_history!(model: "gpt-4.1-nano")
156
-
157
- # View trend
158
- history = report.eval_history(model: "gpt-4.1-nano")
159
- history.score_trend # => :stable_or_improving | :declining
160
- history.drift? # => true (score dropped > 10%)
200
+ diff = report.compare_with_baseline(model: "gpt-4.1-nano")
201
+ diff.regressed? # => true
202
+ diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
203
+ diff.score_delta # => -0.33
161
204
  ```
162
205
 
163
- ## Run evals fast
206
+ No more "it worked in the playground". Regressions are caught in CI, not production.
164
207
 
165
- ```ruby
166
- # 4x faster with parallel execution
167
- report = ClassifyTicket.run_eval("regression",
168
- context: { model: "gpt-4.1-nano" },
169
- concurrency: 4)
170
- ```
171
-
172
- ## Prompt A/B testing
208
+ ## A/B test your prompts
173
209
 
174
- Changed a prompt? Compare old vs new with regression safety:
210
+ Changed a prompt? Compare old vs new on the same dataset with regression safety:
175
211
 
176
212
  ```ruby
177
213
  diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
178
214
  eval: "regression", model: "gpt-4.1-mini")
179
215
 
180
- diff.safe_to_switch? # => true (no regressions, no per-case score drops)
216
+ diff.safe_to_switch? # => true (no regressions)
181
217
  diff.improvements # => [{case: "outage", ...}]
182
218
  diff.score_delta # => +0.33
183
219
  ```
184
220
 
185
- Requires `model:` or `context: { adapter: ... }`.
186
- `compare_with` ignores `sample_response`; without a real model/adapter both sides are skipped and the A/B result is not meaningful.
187
-
188
- CI gate:
189
221
  ```ruby
222
+ # CI gate:
190
223
  expect(ClassifyTicketV2).to pass_eval("regression")
191
224
  .compared_with(ClassifyTicketV1)
192
225
  .with_minimum_score(0.8)
193
226
  ```
194
227
 
195
- ## Soft observations
228
+ ## Chain steps with fail-fast
196
229
 
197
- Log suspicious-but-not-invalid output without failing the contract:
230
+ Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
198
231
 
199
232
  ```ruby
200
- class EvaluateComparative < RubyLLM::Contract::Step::Base
201
- validate("scores in range") { |o| (1..10).include?(o[:score_a]) }
202
- observe("scores should differ") { |o| o[:score_a] != o[:score_b] }
233
+ class TicketPipeline < RubyLLM::Contract::Pipeline::Base
234
+ step ClassifyTicket, as: :classify
235
+ step RouteToTeam, as: :route
236
+ step DraftResponse, as: :draft
203
237
  end
204
238
 
205
- result = EvaluateComparative.run(input)
206
- result.ok? # => true (observe never fails)
207
- result.observations # => [{description: "scores should differ", passed: false}]
239
+ result = TicketPipeline.run("I was charged twice")
240
+ result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
241
+ result.trace.total_cost # => $0.000128
208
242
  ```
209
243
 
210
- ## Predict cost before running
244
+ ## Gate merges on quality and cost
211
245
 
212
246
  ```ruby
213
- ClassifyTicket.estimate_eval_cost("regression", models: %w[gpt-4.1-nano gpt-4.1-mini])
214
- # => { "gpt-4.1-nano" => 0.000024, "gpt-4.1-mini" => 0.000096 }
215
- ```
216
-
217
- ## Install
218
-
219
- ```ruby
220
- gem "ruby_llm-contract"
221
- ```
247
+ # RSpec block merge if accuracy drops or cost spikes
248
+ expect(ClassifyTicket).to pass_eval("regression")
249
+ .with_minimum_score(0.8)
250
+ .with_maximum_cost(0.01)
222
251
 
223
- ```ruby
224
- RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
225
- RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
252
+ # Rake — run all evals across all steps
253
+ RubyLLM::Contract::RakeTask.new do |t|
254
+ t.minimum_score = 0.8
255
+ t.maximum_cost = 0.05
256
+ end
257
+ # bundle exec rake ruby_llm_contract:eval
226
258
  ```
227
259
 
228
- Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
229
-
230
260
  ## Docs
231
261
 
232
262
  | Guide | |
@@ -241,13 +271,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
241
271
 
242
272
  ## Roadmap
243
273
 
244
- **v0.5 (current):** Data-driven prompt engineering — `compare_with(OtherStep)` for prompt A/B testing with regression safety. `observe` DSL for soft observations that log but never fail.
274
+ **v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
245
275
 
246
- **v0.4:** Observability & scale — eval history with trending, batch eval with concurrency, pipeline per-step eval, Minitest support, structured logging. Audit hardening (18 bugfixes).
276
+ **v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
247
277
 
248
- **v0.3:** Baseline regression detection, migration guide, production hardening.
278
+ **v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
249
279
 
250
- **v0.6:** Model recommendation based on eval history data. Cross-provider comparison docs.
280
+ **v0.3:** Baseline regression detection, migration guide.
251
281
 
252
282
  ## License
253
283
 
@@ -70,18 +70,37 @@ module RubyLLM
70
70
  Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
71
71
  end
72
72
 
73
- def compare_models(eval_name, models:, context: {})
73
+ def compare_models(eval_name, models: [], candidates: [], context: {})
74
+ raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
75
+
74
76
  context = safe_context(context)
75
- models = models.uniq
76
- reports = models.each_with_object({}) do |model, hash|
77
- model_context = isolate_context(context).merge(model: model)
78
- hash[model] = run_single_eval(eval_name, model_context)
77
+ candidate_configs = normalize_candidates(models, candidates)
78
+
79
+ reports = {}
80
+ configs = {}
81
+ candidate_configs.each do |config|
82
+ label = Eval::ModelComparison.candidate_label(config)
83
+ model_context = isolate_context(context).merge(model: config[:model])
84
+ model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
85
+ reports[label] = run_single_eval(eval_name, model_context)
86
+ configs[label] = config
79
87
  end
80
- Eval::ModelComparison.new(eval_name: eval_name, reports: reports)
88
+
89
+ Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
81
90
  end
82
91
 
83
92
  private
84
93
 
94
+ def normalize_candidates(models, candidates)
95
+ if candidates.any?
96
+ candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
97
+ elsif models.any?
98
+ models.uniq.map { |m| { model: m } }
99
+ else
100
+ raise ArgumentError, "Pass models: or candidates: with at least one entry"
101
+ end
102
+ end
103
+
85
104
  def comparison_context(context, model)
86
105
  base_context = safe_context(context)
87
106
  model ? base_context.merge(model: model) : base_context
@@ -4,11 +4,17 @@ module RubyLLM
4
4
  module Contract
5
5
  module Eval
6
6
  class ModelComparison
7
- attr_reader :eval_name, :reports
7
+ attr_reader :eval_name, :reports, :configs
8
8
 
9
- def initialize(eval_name:, reports:)
9
+ def self.candidate_label(config)
10
+ effort = config[:reasoning_effort]
11
+ effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
12
+ end
13
+
14
+ def initialize(eval_name:, reports:, configs: nil)
10
15
  @eval_name = eval_name
11
- @reports = reports.dup.freeze # { "model_name" => Report }
16
+ @reports = reports.dup.freeze
17
+ @configs = (configs || default_configs_from_reports).freeze
12
18
  freeze
13
19
  end
14
20
 
@@ -16,12 +22,12 @@ module RubyLLM
16
22
  @reports.keys
17
23
  end
18
24
 
19
- def score_for(model)
20
- @reports[model]&.score
25
+ def score_for(candidate)
26
+ @reports[resolve_key(candidate)]&.score
21
27
  end
22
28
 
23
- def cost_for(model)
24
- @reports[model]&.total_cost
29
+ def cost_for(candidate)
30
+ @reports[resolve_key(candidate)]&.total_cost
25
31
  end
26
32
 
27
33
  def best_for(min_score: 0.0)
@@ -38,13 +44,14 @@ module RubyLLM
38
44
  end
39
45
 
40
46
  def table
41
- lines = [" Model Score Cost Avg Latency"]
42
- lines << " #{"-" * 57}"
47
+ max_label = [@reports.keys.map(&:length).max || 0, 25].max
48
+ lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
49
+ lines << " #{"-" * (max_label + 36)}"
43
50
 
44
- @reports.each do |model, report|
51
+ @reports.each do |label, report|
45
52
  latency = report.avg_latency_ms ? "#{report.avg_latency_ms.round}ms" : "n/a"
46
53
  cost = report.total_cost.positive? ? "$#{format("%.4f", report.total_cost)}" : "n/a"
47
- lines << format(" %-25s %6.2f %10s %12s", model, report.score, cost, latency)
54
+ lines << format(" %-#{max_label}s %6.2f %10s %12s", label, report.score, cost, latency)
48
55
  end
49
56
 
50
57
  lines.join("\n")
@@ -70,10 +77,25 @@ module RubyLLM
70
77
  total_cost: report.total_cost,
71
78
  avg_latency_ms: report.avg_latency_ms,
72
79
  pass_rate: report.pass_rate,
80
+ pass_rate_ratio: report.pass_rate_ratio,
73
81
  passed: report.passed?
74
82
  }
75
83
  end
76
84
  end
85
+
86
+ private
87
+
88
+ def resolve_key(candidate)
89
+ case candidate
90
+ when String then candidate
91
+ when Hash then self.class.candidate_label(candidate)
92
+ else candidate.to_s
93
+ end
94
+ end
95
+
96
+ def default_configs_from_reports
97
+ @reports.each_with_object({}) { |(key, _), h| h[key] = { model: key } }
98
+ end
77
99
  end
78
100
  end
79
101
  end
@@ -0,0 +1,48 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Eval
6
+ class Recommendation
7
+ include Concerns::DeepFreeze
8
+
9
+ attr_reader :best, :retry_chain, :score, :cost_per_call,
10
+ :rationale, :current_config, :savings, :warnings
11
+
12
+ def initialize(best:, retry_chain:, score:, cost_per_call:,
13
+ rationale:, current_config:, savings:, warnings:)
14
+ @best = deep_dup_freeze(best)
15
+ @retry_chain = deep_dup_freeze(retry_chain)
16
+ @score = score
17
+ @cost_per_call = cost_per_call
18
+ @rationale = deep_dup_freeze(rationale)
19
+ @current_config = deep_dup_freeze(current_config)
20
+ @savings = deep_dup_freeze(savings)
21
+ @warnings = deep_dup_freeze(warnings)
22
+ freeze
23
+ end
24
+
25
+ def to_dsl
26
+ return "# No recommendation — no candidate met the minimum score" if retry_chain.empty?
27
+
28
+ if retry_chain.length == 1 && retry_chain.first.keys == [:model]
29
+ "model \"#{retry_chain.first[:model]}\""
30
+ elsif retry_chain.all? { |c| c.keys == [:model] }
31
+ models_str = retry_chain.map { |c| c[:model] }.join(" ")
32
+ "retry_policy models: %w[#{models_str}]"
33
+ else
34
+ args = retry_chain.map { |c| config_to_ruby(c) }.join(",\n ")
35
+ "retry_policy do\n escalate(#{args})\nend"
36
+ end
37
+ end
38
+
39
+ private
40
+
41
+ def config_to_ruby(config)
42
+ pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
43
+ "{ #{pairs} }"
44
+ end
45
+ end
46
+ end
47
+ end
48
+ end
@@ -0,0 +1,132 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Eval
6
+ class Recommender
7
+ def initialize(comparison:, min_score:, min_first_try_pass_rate: 0.8, current_config: nil)
8
+ @comparison = comparison
9
+ @min_score = min_score
10
+ @min_first_try_pass_rate = min_first_try_pass_rate
11
+ @current_config = current_config
12
+ end
13
+
14
+ def recommend
15
+ scored = build_scored_candidates
16
+ best = select_best(scored)
17
+ chain = build_retry_chain(scored, best)
18
+ rationale = build_rationale(scored, best)
19
+ warnings = build_warnings(scored)
20
+ savings = best ? calculate_savings(best) : {}
21
+
22
+ Recommendation.new(
23
+ best: best&.dig(:config),
24
+ retry_chain: chain,
25
+ score: best&.dig(:score) || 0.0,
26
+ cost_per_call: best&.dig(:cost_per_call) || 0.0,
27
+ rationale: rationale,
28
+ current_config: @current_config,
29
+ savings: savings,
30
+ warnings: warnings
31
+ )
32
+ end
33
+
34
+ private
35
+
36
+ def build_scored_candidates
37
+ @comparison.configs.filter_map do |label, config|
38
+ report = @comparison.reports[label]
39
+ next nil unless report
40
+
41
+ evaluated_count = report.results.count { |r| r.step_status != :skipped }
42
+ cases_count = [evaluated_count, 1].max
43
+ cost_per_call = report.total_cost.to_f / cases_count
44
+
45
+ {
46
+ label: label,
47
+ config: config,
48
+ score: report.score,
49
+ cost_per_call: cost_per_call,
50
+ latency: report.avg_latency_ms || Float::INFINITY,
51
+ pass_rate_ratio: report.pass_rate_ratio,
52
+ total_cost: report.total_cost
53
+ }
54
+ end
55
+ end
56
+
57
+ def select_best(scored)
58
+ eligible = scored.select { |s| s[:score] >= @min_score && cost_known?(s) }
59
+ eligible.min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
60
+ end
61
+
62
+ def build_retry_chain(scored, best)
63
+ return [] unless best
64
+
65
+ first_try = scored
66
+ .select { |s| s[:pass_rate_ratio] >= @min_first_try_pass_rate && cost_known?(s) }
67
+ .min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
68
+
69
+ if first_try.nil? || first_try[:label] == best[:label]
70
+ [best[:config]]
71
+ else
72
+ [first_try[:config], best[:config]]
73
+ end
74
+ end
75
+
76
+ def build_rationale(scored, best)
77
+ sorted = scored.sort_by { |s| [cost_known?(s) ? 0 : 1, s[:cost_per_call], s[:latency], s[:label]] }
78
+ sorted.map { |s| rationale_line(s, best) }
79
+ end
80
+
81
+ def rationale_line(candidate, best)
82
+ cost_str = cost_known?(candidate) ? "$#{format("%.4f", candidate[:cost_per_call])}/call" : "unknown pricing"
83
+ header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{cost_str}"
84
+ notes = rationale_notes(candidate, best)
85
+ notes.any? ? "#{header} — #{notes.join(", ")}" : header
86
+ end
87
+
88
+ def rationale_notes(candidate, best)
89
+ notes = []
90
+ pass_pct = (candidate[:pass_rate_ratio] * 100).round
91
+ below_threshold = candidate[:score] < @min_score
92
+
93
+ if below_threshold && candidate[:pass_rate_ratio] >= @min_first_try_pass_rate
94
+ notes << "below #{@min_score} threshold, but good first-try (#{pass_pct}% pass rate)"
95
+ elsif below_threshold
96
+ notes << "below #{@min_score} threshold"
97
+ elsif candidate[:pass_rate_ratio] < 1.0
98
+ notes << "#{pass_pct}% pass rate"
99
+ end
100
+ notes << "recommended" if best && candidate[:label] == best[:label]
101
+ notes << "unknown pricing" unless cost_known?(candidate)
102
+ notes
103
+ end
104
+
105
+ def build_warnings(scored)
106
+ scored.reject { |s| cost_known?(s) }
107
+ .map { |s| "#{s[:label]}: unknown pricing — cost ranking may be inaccurate" }
108
+ end
109
+
110
+ def calculate_savings(best)
111
+ return {} unless @current_config
112
+
113
+ current_label = ModelComparison.candidate_label(@current_config)
114
+ current_report = @comparison.reports[current_label]
115
+ return {} unless current_report
116
+
117
+ current_evaluated = current_report.results.count { |r| r.step_status != :skipped }
118
+ current_cases = [current_evaluated, 1].max
119
+ current_cost = current_report.total_cost.to_f / current_cases
120
+ diff = current_cost - best[:cost_per_call]
121
+ return {} unless diff.positive?
122
+
123
+ { per_call: diff.round(6), monthly_at: { 10_000 => (diff * 10_000).round(2) } }
124
+ end
125
+
126
+ def cost_known?(scored_candidate)
127
+ scored_candidate[:cost_per_call]&.positive?
128
+ end
129
+ end
130
+ end
131
+ end
132
+ end
@@ -14,8 +14,8 @@ module RubyLLM
14
14
  HISTORY_DIR = ".eval_history"
15
15
  BASELINE_DIR = ".eval_baselines"
16
16
 
17
- def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :total_cost, :avg_latency_ms,
18
- :passed?
17
+ def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
18
+ :total_cost, :avg_latency_ms, :passed?
19
19
  def_delegators :@presenter, :summary, :to_s, :print_summary
20
20
  def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
21
21
  :baseline_exists?
@@ -35,6 +35,12 @@ module RubyLLM
35
35
  "#{passed}/#{evaluated_results.length}"
36
36
  end
37
37
 
38
+ def pass_rate_ratio
39
+ return 0.0 if evaluated_results.empty?
40
+
41
+ passed.to_f / evaluated_results.length
42
+ end
43
+
38
44
  def total_cost
39
45
  @results.sum { |result| result.cost || 0.0 }
40
46
  end
@@ -13,25 +13,29 @@ module RubyLLM
13
13
  @stats = stats
14
14
  end
15
15
 
16
- def save_history!(path: nil, model: nil)
17
- file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model)
18
- EvalHistory.append(file, history_entry)
16
+ def save_history!(path: nil, model: nil, reasoning_effort: nil)
17
+ file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model, reasoning_effort: reasoning_effort)
18
+ entry = history_entry
19
+ entry[:model] = model if model
20
+ entry[:reasoning_effort] = reasoning_effort if reasoning_effort
21
+ EvalHistory.append(file, entry)
19
22
  file
20
23
  end
21
24
 
22
- def eval_history(path: nil, model: nil)
23
- EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model))
25
+ def eval_history(path: nil, model: nil, reasoning_effort: nil)
26
+ EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model,
27
+ reasoning_effort: reasoning_effort))
24
28
  end
25
29
 
26
- def save_baseline!(path: nil, model: nil)
27
- file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
30
+ def save_baseline!(path: nil, model: nil, reasoning_effort: nil)
31
+ file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
28
32
  FileUtils.mkdir_p(File.dirname(file))
29
33
  File.write(file, JSON.pretty_generate(serialize_for_baseline))
30
34
  file
31
35
  end
32
36
 
33
- def compare_with_baseline(path: nil, model: nil)
34
- file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
37
+ def compare_with_baseline(path: nil, model: nil, reasoning_effort: nil)
38
+ file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
35
39
  raise ArgumentError, "No baseline found at #{file}" unless File.exist?(file)
36
40
 
37
41
  baseline_data = JSON.parse(File.read(file), symbolize_names: true)
@@ -43,8 +47,8 @@ module RubyLLM
43
47
  )
44
48
  end
45
49
 
46
- def baseline_exists?(path: nil, model: nil)
47
- File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model))
50
+ def baseline_exists?(path: nil, model: nil, reasoning_effort: nil)
51
+ File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort))
48
52
  end
49
53
 
50
54
  private
@@ -55,6 +59,7 @@ module RubyLLM
55
59
  score: @stats.score,
56
60
  total_cost: @stats.total_cost,
57
61
  pass_rate: @stats.pass_rate,
62
+ pass_rate_ratio: @stats.pass_rate_ratio,
58
63
  cases_count: @stats.evaluated_results_count
59
64
  }
60
65
  end
@@ -79,12 +84,13 @@ module RubyLLM
79
84
  }
80
85
  end
81
86
 
82
- def storage_path(root_dir, extension, model:)
87
+ def storage_path(root_dir, extension, model:, reasoning_effort: nil)
83
88
  parts = [root_dir]
84
89
  parts << sanitize_name(@report.step_name) if @report.step_name
85
90
 
86
91
  dataset_name = sanitize_name(@report.dataset_name)
87
92
  dataset_name = "#{dataset_name}_#{sanitize_name(model)}" if model
93
+ dataset_name = "#{dataset_name}_effort_#{sanitize_name(reasoning_effort)}" if reasoning_effort
88
94
 
89
95
  File.join(*parts, "#{dataset_name}.#{extension}")
90
96
  end
@@ -29,3 +29,5 @@ require_relative "eval/prompt_diff_comparator"
29
29
  require_relative "eval/prompt_diff_presenter"
30
30
  require_relative "eval/prompt_diff"
31
31
  require_relative "eval/eval_history"
32
+ require_relative "eval/recommendation"
33
+ require_relative "eval/recommender"
@@ -49,6 +49,16 @@ module RubyLLM
49
49
  end
50
50
  end
51
51
 
52
+ def recommend(eval_name, candidates:, min_score: 0.95, min_first_try_pass_rate: 0.8, context: {})
53
+ comparison = compare_models(eval_name, candidates: candidates, context: context)
54
+ Eval::Recommender.new(
55
+ comparison: comparison,
56
+ min_score: min_score,
57
+ min_first_try_pass_rate: min_first_try_pass_rate,
58
+ current_config: current_model_config
59
+ ).recommend
60
+ end
61
+
52
62
  KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
53
63
 
54
64
  include Concerns::ContextHelpers
@@ -144,6 +154,17 @@ module RubyLLM
144
154
  }
145
155
  end
146
156
 
157
+ def current_model_config
158
+ policy = retry_policy
159
+ if policy && policy.config_list.any?
160
+ policy.config_list.first
161
+ elsif respond_to?(:model) && model
162
+ { model: model }
163
+ elsif RubyLLM::Contract.configuration.default_model
164
+ { model: RubyLLM::Contract.configuration.default_model }
165
+ end
166
+ end
167
+
147
168
  def resolve_adapter(context)
148
169
  adapter = context[:adapter] || RubyLLM::Contract.configuration.default_adapter
149
170
  return adapter if adapter
@@ -10,12 +10,16 @@ module RubyLLM
10
10
 
11
11
  def run_with_retry(input, adapter:, default_model:, policy:, context_temperature: nil, extra_options: {})
12
12
  all_attempts = []
13
+ default_config = { model: default_model }.merge(extra_options.slice(:reasoning_effort).compact)
13
14
 
14
15
  policy.max_attempts.times do |attempt_index|
15
- model = policy.model_for_attempt(attempt_index, default_model)
16
+ config = policy.config_for_attempt(attempt_index, default_config)
17
+ model = config[:model]
18
+ attempt_extra = extra_options.merge(config.except(:model))
19
+
16
20
  result = run_once(input, adapter: adapter, model: model,
17
- context_temperature: context_temperature, extra_options: extra_options)
18
- all_attempts << { attempt: attempt_index + 1, model: model, result: result }
21
+ context_temperature: context_temperature, extra_options: attempt_extra)
22
+ all_attempts << { attempt: attempt_index + 1, model: model, config: config, result: result }
19
23
  break unless policy.retryable?(result)
20
24
  end
21
25
 
@@ -43,6 +47,8 @@ module RubyLLM
43
47
  def build_attempt_entry(attempt)
44
48
  trace = attempt[:result].trace
45
49
  entry = { attempt: attempt[:attempt], model: attempt[:model], status: attempt[:result].status }
50
+ config = attempt[:config]
51
+ entry[:config] = config if config && config.keys != [:model]
46
52
  append_trace_fields(entry, trace)
47
53
  end
48
54
 
@@ -9,13 +9,13 @@ module RubyLLM
9
9
  DEFAULT_RETRY_ON = %i[validation_failed parse_error adapter_error].freeze
10
10
 
11
11
  def initialize(models: nil, attempts: nil, retry_on: nil, &block)
12
- @models = []
12
+ @configs = []
13
13
  @retryable_statuses = DEFAULT_RETRY_ON.dup
14
14
 
15
15
  if block
16
16
  @max_attempts = 1
17
17
  instance_eval(&block)
18
- warn_no_retry! if @max_attempts == 1 && @models.empty?
18
+ warn_no_retry! if @max_attempts == 1 && @configs.empty?
19
19
  else
20
20
  apply_keywords(models: models, attempts: attempts, retry_on: retry_on)
21
21
  end
@@ -28,14 +28,18 @@ module RubyLLM
28
28
  validate_max_attempts!
29
29
  end
30
30
 
31
- def escalate(*model_list)
32
- @models = model_list.flatten
33
- @max_attempts = @models.length if @max_attempts < @models.length
31
+ def escalate(*config_list)
32
+ @configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze
33
+ @max_attempts = @configs.length if @max_attempts < @configs.length
34
34
  end
35
35
  alias models escalate
36
36
 
37
37
  def model_list
38
- @models
38
+ @configs.map { |c| c[:model] }.freeze
39
+ end
40
+
41
+ def config_list
42
+ @configs
39
43
  end
40
44
 
41
45
  def retry_on(*statuses)
@@ -46,29 +50,38 @@ module RubyLLM
46
50
  retryable_statuses.include?(result.status)
47
51
  end
48
52
 
49
- def model_for_attempt(attempt, default_model)
50
- if @models.any?
51
- @models[attempt] || @models.last
53
+ def config_for_attempt(attempt, default_config)
54
+ if @configs.any?
55
+ @configs[attempt] || @configs.last
52
56
  else
53
- default_model
57
+ default_config
54
58
  end
55
59
  end
56
60
 
61
+ def model_for_attempt(attempt, default_model)
62
+ config_for_attempt(attempt, { model: default_model })[:model]
63
+ end
64
+
57
65
  private
58
66
 
59
67
  def apply_keywords(models:, attempts:, retry_on:)
60
68
  if models
61
- @models = Array(models).dup.freeze
62
- @max_attempts = @models.length
69
+ @configs = Array(models).map { |m| normalize_config(m).freeze }.freeze
70
+ @max_attempts = @configs.length
63
71
  else
64
72
  @max_attempts = attempts || 1
65
73
  end
66
74
  @retryable_statuses = Array(retry_on).dup if retry_on
67
75
  end
68
76
 
77
+ def normalize_config(entry)
78
+ RubyLLM::Contract.normalize_candidate_config(entry)
79
+ end
80
+
69
81
  def warn_no_retry!
70
- warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
71
- "This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."
82
+ warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \
83
+ "This means no actual retry will happen. Add `attempts 2` or " \
84
+ '`escalate "model1", "model2"`.'
72
85
  end
73
86
 
74
87
  def validate_max_attempts!
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Contract
5
- VERSION = "0.5.2"
5
+ VERSION = "0.6.0"
6
6
  end
7
7
  end
@@ -78,6 +78,27 @@ module RubyLLM
78
78
  Thread.current[:ruby_llm_contract_reloading] = false
79
79
  end
80
80
 
81
+ def normalize_candidate_config(entry)
82
+ case entry
83
+ when String
84
+ raise ArgumentError, "Candidate model must be a non-empty String" if entry.strip.empty?
85
+
86
+ { model: entry.strip }
87
+ when Hash
88
+ model = entry[:model] || entry["model"]
89
+ unless model.is_a?(String) && !model.strip.empty?
90
+ raise ArgumentError, "Candidate config must include a non-empty String :model"
91
+ end
92
+
93
+ normalized = { model: model.strip }
94
+ effort = entry[:reasoning_effort] || entry["reasoning_effort"]
95
+ normalized[:reasoning_effort] = effort if effort
96
+ normalized
97
+ else
98
+ raise ArgumentError, "Expected String or Hash, got #{entry.class}"
99
+ end
100
+ end
101
+
81
102
  private
82
103
 
83
104
  # Filter stale hosts, deduplicate by name (last wins), prune registry in-place
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-contract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.2
4
+ version: 0.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justyna
@@ -130,6 +130,8 @@ files:
130
130
  - lib/ruby_llm/contract/eval/prompt_diff_comparator.rb
131
131
  - lib/ruby_llm/contract/eval/prompt_diff_presenter.rb
132
132
  - lib/ruby_llm/contract/eval/prompt_diff_serializer.rb
133
+ - lib/ruby_llm/contract/eval/recommendation.rb
134
+ - lib/ruby_llm/contract/eval/recommender.rb
133
135
  - lib/ruby_llm/contract/eval/report.rb
134
136
  - lib/ruby_llm/contract/eval/report_presenter.rb
135
137
  - lib/ruby_llm/contract/eval/report_stats.rb