ruby_llm-contract 0.5.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2e6861a314beadad8064d5fe08b85dc0f94032987ba41c27fc9a640788f10c28
4
- data.tar.gz: 7efb88142ef8ac8287ed58f6cc9bc93ca78130084e14f8310981e7f90faf9943
3
+ metadata.gz: c86dcf06a34e5ff934367708213a0852191b0be7bd61e61e8577161e32ccf807
4
+ data.tar.gz: 28a44dd4f4d4c0f74c5da7800d9fa1fd2b8e51242a056f5aaae5f6f9fcf1be69
5
5
  SHA512:
6
- metadata.gz: 1bd259ab22d13b2e7cc9848c401b91ebfaa135dc7502b72074fd28e71e6120f7ea183c09361948bf8d68df9cd5190e9450a3f03954bbd2927f8091c987368bb7
7
- data.tar.gz: 904185a06d1def6513268033cbe87b30e3af9b92e0a26ec3cf89c93ad88450c4d49aa13d79c2428b1f7f6cc1cbffefeccf146b5a5c7ddf0057ab3db2f1b7dc8d
6
+ metadata.gz: c2b6dc71a02519288ae4ee4f74f17d74e6c45d01e888036d18e4c821b8fc63ba43ac0ccee6b2948163597b19c93008a6e5e0375fd65c8f3ad8fcf1285f356e91
7
+ data.tar.gz: f08576e520ec7397c4b233c5907ce703890cf4616e969e25f4c980ac1b06013a62ff680b3c90878d76fd1da0980f04d4946a2fd11a8e304e43f5155e5c855e8e
data/CHANGELOG.md CHANGED
@@ -1,8 +1,37 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.0 (2026-04-12)
4
+
5
+ "What should I do?" — model + configuration recommendation.
6
+
7
+ ### Features
8
+
9
+ - **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
10
+ - **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
11
+ - **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
12
+ - **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
13
+ - **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
14
+ - **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
15
+
16
+ ### Game changer continuity
17
+
18
+ ```
19
+ v0.2: "Which model?" → compare_models (snapshot)
20
+ v0.3: "Did it change?" → baseline regression (binary)
21
+ v0.4: "Show me the trend" → eval history (time series)
22
+ v0.5: "Which prompt is better?" → compare_with (A/B testing)
23
+ v0.6: "What should I do?" → recommend (actionable advice)
24
+ ```
25
+
26
+ ## 0.5.2 (2026-04-06)
27
+
28
+ ### Features
29
+
30
+ - **`reasoning_effort` forwarded to provider** — `context: { reasoning_effort: "low" }` now passed through `with_params` to the LLM. Previously accepted as a known context key but silently ignored by the RubyLLM adapter.
31
+
3
32
  ## 0.5.0 (2026-03-25)
4
33
 
5
- Data-Driven Prompt Engineering — see ADR-0015.
34
+ Data-Driven Prompt Engineering.
6
35
 
7
36
  ### Features
8
37
 
@@ -49,7 +78,7 @@ Audit hardening — 18 bugs fixed across 4 audit rounds.
49
78
 
50
79
  ## 0.4.3 (2026-03-24)
51
80
 
52
- Production feedback release — driven by ADR-0016 (real Rails 8.1 deployment).
81
+ Production feedback release.
53
82
 
54
83
  ### Features
55
84
 
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.5.0)
4
+ ruby_llm-contract (0.6.0)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.5.0)
261
+ ruby_llm-contract (0.6.0)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -1,24 +1,89 @@
1
1
  # ruby_llm-contract
2
2
 
3
- Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.
3
+ The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
4
4
 
5
- Companion gem for [ruby_llm](https://github.com/crmne/ruby_llm).
5
+ ```
6
+ YOU WRITE THE GEM HANDLES YOU GET
7
+ ───────── ─────────────── ───────
8
+
9
+ validate { |o| ... } catch bad answers — combined Zero garbage
10
+ with retry_policy, auto-retry in production
11
+
12
+ retry_policy start cheap, escalate only Pay for the cheapest
13
+ models: %w[nano mini full] when validation fails model that works
14
+
15
+ max_cost 0.01 estimate tokens, check price, No surprise bills
16
+ refuse before calling LLM
17
+
18
+ output_schema { ... } send JSON schema to provider, Zero parsing code
19
+ validate response client-side
20
+
21
+ define_eval { ... } test cases + baselines, Regressions caught
22
+ run in CI with real LLM before deploy
23
+
24
+ recommend(candidates: [...]) evaluate all configs, pick Optimal model +
25
+ cheapest that passes retry chain
26
+ ```
6
27
 
7
- ## The problem
28
+ ## Before and after
8
29
 
9
- Which model should you use? The expensive one is accurate but costs 4x more. The cheap one is fast but hallucinates on edge cases. You tweak a prompt — did accuracy improve or drop? You have no data. Just gut feeling.
30
+ ```
31
+ ┌─────────────────────────────────────────────────────────────────┐
32
+ │ BEFORE: pick one model, hope for the best │
33
+ │ │
34
+ │ expensive model → accurate, but you overpay on every call │
35
+ │ cheap model → fast, but wrong answers slip to production │
36
+ │ prompt change → "looks good to me" → deploy → users suffer │
37
+ └─────────────────────────────────────────────────────────────────┘
38
+
39
+ ⬇ add ruby_llm-contract
40
+
41
+ ┌─────────────────────────────────────────────────────────────────┐
42
+ │ YOU DEFINE A CONTRACT │
43
+ │ │
44
+ │ output_schema { string :priority } ← valid structure │
45
+ │ validate("valid priority") { |o| ... } ← business rules │
46
+ │ retry_policy models: %w[nano mini full] ← escalation chain │
47
+ │ max_cost 0.01 ← budget cap │
48
+ └───────────────────────────┬─────────────────────────────────────┘
49
+
50
+
51
+ ┌─────────────────────────────────────────────────────────────────┐
52
+ │ THE GEM HANDLES THE REST │
53
+ │ │
54
+ │ request ──→ ┌──────┐ ┌──────────┐ │
55
+ │ │ nano │─→ │ contract │──→ ✓ pass → done │
56
+ │ └──────┘ └────┬─────┘ │
57
+ │ │ ✗ fail │
58
+ │ ▼ │
59
+ │ ┌──────┐ ┌──────────┐ │
60
+ │ │ mini │─→ │ contract │──→ ✓ pass → done │
61
+ │ └──────┘ └────┬─────┘ │
62
+ │ │ ✗ fail │
63
+ │ ▼ │
64
+ │ ┌──────┐ ┌──────────┐ │
65
+ │ │ full │─→ │ contract │──→ ✓ pass → done │
66
+ │ └──────┘ └──────────┘ │
67
+ └───────────────────────────┬─────────────────────────────────────┘
68
+
69
+
70
+ ┌─────────────────────────────────────────────────────────────────┐
71
+ │ YOU GET │
72
+ │ │
73
+ │ ✓ Valid output guaranteed — schema + business rules enforced │
74
+ │ ✓ Cheapest model that works — most requests stay on nano │
75
+ │ ✓ Cost, latency, tokens — tracked on every call │
76
+ │ ✓ Eval scores per model — data instead of gut feeling │
77
+ │ ✓ Regressions caught — before deploy, not after │
78
+ │ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
79
+ └─────────────────────────────────────────────────────────────────┘
80
+ ```
10
81
 
11
- ## The fix
82
+ ## 30-second version
12
83
 
13
84
  ```ruby
14
85
  class ClassifyTicket < RubyLLM::Contract::Step::Base
15
- prompt do
16
- system "You are a support ticket classifier."
17
- rule "Return valid JSON only, no markdown."
18
- rule "Use exactly one priority: low, medium, high, urgent."
19
- example input: "My invoice is wrong", output: '{"priority": "high"}'
20
- user "{input}"
21
- end
86
+ prompt "Classify this support ticket by priority and category.\n\n{input}"
22
87
 
23
88
  output_schema do
24
89
  string :priority, enum: %w[low medium high urgent]
@@ -30,17 +95,45 @@ class ClassifyTicket < RubyLLM::Contract::Step::Base
30
95
  end
31
96
 
32
97
  result = ClassifyTicket.run("I was charged twice")
33
- result.ok? # => true
34
- result.parsed_output # => {priority: "high", category: "billing"}
35
- result.trace[:cost] # => 0.000032
36
- result.trace[:model] # => "gpt-4.1-nano"
98
+ result.parsed_output # => {priority: "high", category: "billing"}
99
+ result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
100
+ result.trace[:cost] # => 0.000032
101
+ ```
102
+
103
+ Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
104
+
105
+ ## Install
106
+
107
+ ```ruby
108
+ gem "ruby_llm-contract"
109
+ ```
110
+
111
+ ```ruby
112
+ RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
113
+ RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
114
+ ```
115
+
116
+ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
117
+
118
+ ## Save money with model escalation
119
+
120
+ Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
121
+
122
+ ```ruby
123
+ retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
37
124
  ```
38
125
 
39
- Bad JSON? Auto-retry. Wrong value? Escalate to a smarter model. Schema violated? Caught client-side even if the provider ignores it. All with cost tracking.
126
+ ```
127
+ Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
128
+ Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
129
+ gpt-4.1 → never called ($0.00)
130
+ ```
131
+
132
+ Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
40
133
 
41
- ## Which model should I use?
134
+ ## Know which model to use — with data
42
135
 
43
- Define test cases. Compare models. Get data.
136
+ Don't guess. Define test cases, compare models, get numbers:
44
137
 
45
138
  ```ruby
46
139
  ClassifyTicket.define_eval("regression") do
@@ -50,176 +143,126 @@ ClassifyTicket.define_eval("regression") do
50
143
  end
51
144
 
52
145
  comparison = ClassifyTicket.compare_models("regression",
53
- models: %w[gpt-4.1-nano gpt-4.1-mini])
146
+ models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
54
147
  ```
55
148
 
56
- Real output from real API calls:
57
-
58
149
  ```
59
- Model Score Cost Avg Latency
150
+ Candidate Score Cost Avg Latency
60
151
  ---------------------------------------------------------
61
- gpt-4.1-nano 0.67 $0.000032 687ms
62
- gpt-4.1-mini 1.00 $0.000102 1070ms
152
+ gpt-4.1-nano 0.67 $0.0001 48ms
153
+ gpt-4.1-mini 1.00 $0.0004 92ms
154
+ gpt-4.1 1.00 $0.0021 210ms
63
155
 
64
156
  Cheapest at 100%: gpt-4.1-mini
65
157
  ```
66
158
 
67
- ```ruby
68
- comparison.best_for(min_score: 0.95) # => "gpt-4.1-mini"
69
-
70
- # Inspect failures
71
- comparison.reports["gpt-4.1-nano"].failures.each do |f|
72
- puts "#{f.name}: expected #{f.expected}, got #{f.output}"
73
- puts " mismatches: #{f.mismatches}"
74
- # => outage: expected {priority: "urgent"}, got {priority: "high"}
75
- # mismatches: {priority: {expected: "urgent", got: "high"}}
76
- end
77
- ```
159
+ Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
78
160
 
79
- ## Pipeline
161
+ ## Let the gem tell you what to do
80
162
 
81
- Chain steps with fail-fast. Hallucination in step 1 stops before step 2 spends tokens.
163
+ Don't read tables get a recommendation. Supports `model + reasoning_effort` combinations:
82
164
 
83
165
  ```ruby
84
- class TicketPipeline < RubyLLM::Contract::Pipeline::Base
85
- step ClassifyTicket, as: :classify
86
- step RouteToTeam, as: :route
87
- step DraftResponse, as: :draft
88
- end
89
-
90
- result = TicketPipeline.run("I was charged twice")
91
- result.ok? # => true
92
- result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
93
- result.trace.total_cost # => 0.000128
166
+ rec = ClassifyTicket.recommend("regression",
167
+ candidates: [
168
+ { model: "gpt-4.1-nano" },
169
+ { model: "gpt-4.1-mini" },
170
+ { model: "gpt-5-mini", reasoning_effort: "low" },
171
+ { model: "gpt-5-mini", reasoning_effort: "high" },
172
+ ],
173
+ min_score: 0.95
174
+ )
175
+
176
+ rec.best # => { model: "gpt-4.1-mini" }
177
+ rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
178
+ rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
179
+ rec.savings # => savings vs your current model (if configured)
94
180
  ```
95
181
 
96
- ## CI gate
182
+ Copy `rec.to_dsl` into your step. Done.
97
183
 
98
- ```ruby
99
- # RSpec — block merge if accuracy drops or cost spikes
100
- expect(ClassifyTicket).to pass_eval("regression")
101
- .with_context(model: "gpt-4.1-mini")
102
- .with_minimum_score(0.8)
103
- .with_maximum_cost(0.01)
184
+ ## Catch regressions before users do
104
185
 
105
- # Rake run all evals across all steps
106
- require "ruby_llm/contract/rake_task"
107
- RubyLLM::Contract::RakeTask.new do |t|
108
- t.minimum_score = 0.8
109
- t.maximum_cost = 0.05
110
- end
111
- # bundle exec rake ruby_llm_contract:eval
112
- ```
113
-
114
- ## Detect quality drops
115
-
116
- Save a baseline. Next run, see what regressed.
186
+ A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
117
187
 
118
188
  ```ruby
189
+ # Save a baseline once:
119
190
  report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
120
191
  report.save_baseline!(model: "gpt-4.1-nano")
121
192
 
122
- # Laterafter prompt change, model update, or provider weight shift:
123
- report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
124
- diff = report.compare_with_baseline(model: "gpt-4.1-nano")
125
-
126
- diff.regressed? # => true
127
- diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
128
- diff.score_delta # => -0.33
129
- ```
130
-
131
- ```ruby
132
- # CI: block merge if any previously-passing case now fails
193
+ # In CI block merge if anything regressed:
133
194
  expect(ClassifyTicket).to pass_eval("regression")
134
195
  .with_context(model: "gpt-4.1-nano")
135
196
  .without_regressions
136
197
  ```
137
198
 
138
- ## Track quality over time
139
-
140
199
  ```ruby
141
- # Save every eval run
142
- report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
143
- report.save_history!(model: "gpt-4.1-nano")
144
-
145
- # View trend
146
- history = report.eval_history(model: "gpt-4.1-nano")
147
- history.score_trend # => :stable_or_improving | :declining
148
- history.drift? # => true (score dropped > 10%)
200
+ diff = report.compare_with_baseline(model: "gpt-4.1-nano")
201
+ diff.regressed? # => true
202
+ diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
203
+ diff.score_delta # => -0.33
149
204
  ```
150
205
 
151
- ## Run evals fast
206
+ No more "it worked in the playground". Regressions are caught in CI, not production.
152
207
 
153
- ```ruby
154
- # 4x faster with parallel execution
155
- report = ClassifyTicket.run_eval("regression",
156
- context: { model: "gpt-4.1-nano" },
157
- concurrency: 4)
158
- ```
159
-
160
- ## Prompt A/B testing
208
+ ## A/B test your prompts
161
209
 
162
- Changed a prompt? Compare old vs new with regression safety:
210
+ Changed a prompt? Compare old vs new on the same dataset with regression safety:
163
211
 
164
212
  ```ruby
165
213
  diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
166
214
  eval: "regression", model: "gpt-4.1-mini")
167
215
 
168
- diff.safe_to_switch? # => true (no regressions, no per-case score drops)
216
+ diff.safe_to_switch? # => true (no regressions)
169
217
  diff.improvements # => [{case: "outage", ...}]
170
218
  diff.score_delta # => +0.33
171
219
  ```
172
220
 
173
- Requires `model:` or `context: { adapter: ... }`.
174
- `compare_with` ignores `sample_response`; without a real model/adapter both sides are skipped and the A/B result is not meaningful.
175
-
176
- CI gate:
177
221
  ```ruby
222
+ # CI gate:
178
223
  expect(ClassifyTicketV2).to pass_eval("regression")
179
224
  .compared_with(ClassifyTicketV1)
180
225
  .with_minimum_score(0.8)
181
226
  ```
182
227
 
183
- ## Soft observations
228
+ ## Chain steps with fail-fast
184
229
 
185
- Log suspicious-but-not-invalid output without failing the contract:
230
+ Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
186
231
 
187
232
  ```ruby
188
- class EvaluateComparative < RubyLLM::Contract::Step::Base
189
- validate("scores in range") { |o| (1..10).include?(o[:score_a]) }
190
- observe("scores should differ") { |o| o[:score_a] != o[:score_b] }
233
+ class TicketPipeline < RubyLLM::Contract::Pipeline::Base
234
+ step ClassifyTicket, as: :classify
235
+ step RouteToTeam, as: :route
236
+ step DraftResponse, as: :draft
191
237
  end
192
238
 
193
- result = EvaluateComparative.run(input)
194
- result.ok? # => true (observe never fails)
195
- result.observations # => [{description: "scores should differ", passed: false}]
196
- ```
197
-
198
- ## Predict cost before running
199
-
200
- ```ruby
201
- ClassifyTicket.estimate_eval_cost("regression", models: %w[gpt-4.1-nano gpt-4.1-mini])
202
- # => { "gpt-4.1-nano" => 0.000024, "gpt-4.1-mini" => 0.000096 }
239
+ result = TicketPipeline.run("I was charged twice")
240
+ result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
241
+ result.trace.total_cost # => $0.000128
203
242
  ```
204
243
 
205
- ## Install
244
+ ## Gate merges on quality and cost
206
245
 
207
246
  ```ruby
208
- gem "ruby_llm-contract"
209
- ```
247
+ # RSpec — block merge if accuracy drops or cost spikes
248
+ expect(ClassifyTicket).to pass_eval("regression")
249
+ .with_minimum_score(0.8)
250
+ .with_maximum_cost(0.01)
210
251
 
211
- ```ruby
212
- RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
213
- RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
252
+ # Rake — run all evals across all steps
253
+ RubyLLM::Contract::RakeTask.new do |t|
254
+ t.minimum_score = 0.8
255
+ t.maximum_cost = 0.05
256
+ end
257
+ # bundle exec rake ruby_llm_contract:eval
214
258
  ```
215
259
 
216
- Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
217
-
218
260
  ## Docs
219
261
 
220
262
  | Guide | |
221
263
  |-------|-|
222
264
  | [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
265
+ | [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
223
266
  | [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
224
267
  | [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
225
268
  | [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
@@ -228,13 +271,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
228
271
 
229
272
  ## Roadmap
230
273
 
231
- **v0.5 (current):** Data-driven prompt engineering — `compare_with(OtherStep)` for prompt A/B testing with regression safety. `observe` DSL for soft observations that log but never fail.
274
+ **v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
232
275
 
233
- **v0.4:** Observability & scale — eval history with trending, batch eval with concurrency, pipeline per-step eval, Minitest support, structured logging. Audit hardening (18 bugfixes).
276
+ **v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
234
277
 
235
- **v0.3:** Baseline regression detection, migration guide, production hardening.
278
+ **v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
236
279
 
237
- **v0.6:** Model recommendation based on eval history data. Cross-provider comparison docs.
280
+ **v0.3:** Baseline regression detection, migration guide.
238
281
 
239
282
  ## License
240
283
 
@@ -52,7 +52,10 @@ module RubyLLM
52
52
  CHAT_OPTION_METHODS.each do |key, method_name|
53
53
  chat.public_send(method_name, options[key]) if options[key]
54
54
  end
55
- chat.with_params(max_tokens: options[:max_tokens]) if options[:max_tokens]
55
+ params = {}
56
+ params[:max_tokens] = options[:max_tokens] if options[:max_tokens]
57
+ params[:reasoning_effort] = options[:reasoning_effort] if options[:reasoning_effort]
58
+ chat.with_params(**params) if params.any?
56
59
  end
57
60
 
58
61
  def build_response(response)
@@ -70,18 +70,37 @@ module RubyLLM
70
70
  Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
71
71
  end
72
72
 
73
- def compare_models(eval_name, models:, context: {})
73
+ def compare_models(eval_name, models: [], candidates: [], context: {})
74
+ raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
75
+
74
76
  context = safe_context(context)
75
- models = models.uniq
76
- reports = models.each_with_object({}) do |model, hash|
77
- model_context = isolate_context(context).merge(model: model)
78
- hash[model] = run_single_eval(eval_name, model_context)
77
+ candidate_configs = normalize_candidates(models, candidates)
78
+
79
+ reports = {}
80
+ configs = {}
81
+ candidate_configs.each do |config|
82
+ label = Eval::ModelComparison.candidate_label(config)
83
+ model_context = isolate_context(context).merge(model: config[:model])
84
+ model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
85
+ reports[label] = run_single_eval(eval_name, model_context)
86
+ configs[label] = config
79
87
  end
80
- Eval::ModelComparison.new(eval_name: eval_name, reports: reports)
88
+
89
+ Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
81
90
  end
82
91
 
83
92
  private
84
93
 
94
+ def normalize_candidates(models, candidates)
95
+ if candidates.any?
96
+ candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
97
+ elsif models.any?
98
+ models.uniq.map { |m| { model: m } }
99
+ else
100
+ raise ArgumentError, "Pass models: or candidates: with at least one entry"
101
+ end
102
+ end
103
+
85
104
  def comparison_context(context, model)
86
105
  base_context = safe_context(context)
87
106
  model ? base_context.merge(model: model) : base_context
@@ -4,11 +4,17 @@ module RubyLLM
4
4
  module Contract
5
5
  module Eval
6
6
  class ModelComparison
7
- attr_reader :eval_name, :reports
7
+ attr_reader :eval_name, :reports, :configs
8
8
 
9
- def initialize(eval_name:, reports:)
9
+ def self.candidate_label(config)
10
+ effort = config[:reasoning_effort]
11
+ effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
12
+ end
13
+
14
+ def initialize(eval_name:, reports:, configs: nil)
10
15
  @eval_name = eval_name
11
- @reports = reports.dup.freeze # { "model_name" => Report }
16
+ @reports = reports.dup.freeze
17
+ @configs = (configs || default_configs_from_reports).freeze
12
18
  freeze
13
19
  end
14
20
 
@@ -16,12 +22,12 @@ module RubyLLM
16
22
  @reports.keys
17
23
  end
18
24
 
19
- def score_for(model)
20
- @reports[model]&.score
25
+ def score_for(candidate)
26
+ @reports[resolve_key(candidate)]&.score
21
27
  end
22
28
 
23
- def cost_for(model)
24
- @reports[model]&.total_cost
29
+ def cost_for(candidate)
30
+ @reports[resolve_key(candidate)]&.total_cost
25
31
  end
26
32
 
27
33
  def best_for(min_score: 0.0)
@@ -38,13 +44,14 @@ module RubyLLM
38
44
  end
39
45
 
40
46
  def table
41
- lines = [" Model Score Cost Avg Latency"]
42
- lines << " #{"-" * 57}"
47
+ max_label = [@reports.keys.map(&:length).max || 0, 25].max
48
+ lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
49
+ lines << " #{"-" * (max_label + 36)}"
43
50
 
44
- @reports.each do |model, report|
51
+ @reports.each do |label, report|
45
52
  latency = report.avg_latency_ms ? "#{report.avg_latency_ms.round}ms" : "n/a"
46
53
  cost = report.total_cost.positive? ? "$#{format("%.4f", report.total_cost)}" : "n/a"
47
- lines << format(" %-25s %6.2f %10s %12s", model, report.score, cost, latency)
54
+ lines << format(" %-#{max_label}s %6.2f %10s %12s", label, report.score, cost, latency)
48
55
  end
49
56
 
50
57
  lines.join("\n")
@@ -70,10 +77,25 @@ module RubyLLM
70
77
  total_cost: report.total_cost,
71
78
  avg_latency_ms: report.avg_latency_ms,
72
79
  pass_rate: report.pass_rate,
80
+ pass_rate_ratio: report.pass_rate_ratio,
73
81
  passed: report.passed?
74
82
  }
75
83
  end
76
84
  end
85
+
86
+ private
87
+
88
+ def resolve_key(candidate)
89
+ case candidate
90
+ when String then candidate
91
+ when Hash then self.class.candidate_label(candidate)
92
+ else candidate.to_s
93
+ end
94
+ end
95
+
96
+ def default_configs_from_reports
97
+ @reports.each_with_object({}) { |(key, _), h| h[key] = { model: key } }
98
+ end
77
99
  end
78
100
  end
79
101
  end
@@ -0,0 +1,48 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Eval
6
+ class Recommendation
7
+ include Concerns::DeepFreeze
8
+
9
+ attr_reader :best, :retry_chain, :score, :cost_per_call,
10
+ :rationale, :current_config, :savings, :warnings
11
+
12
+ def initialize(best:, retry_chain:, score:, cost_per_call:,
13
+ rationale:, current_config:, savings:, warnings:)
14
+ @best = deep_dup_freeze(best)
15
+ @retry_chain = deep_dup_freeze(retry_chain)
16
+ @score = score
17
+ @cost_per_call = cost_per_call
18
+ @rationale = deep_dup_freeze(rationale)
19
+ @current_config = deep_dup_freeze(current_config)
20
+ @savings = deep_dup_freeze(savings)
21
+ @warnings = deep_dup_freeze(warnings)
22
+ freeze
23
+ end
24
+
25
+ def to_dsl
26
+ return "# No recommendation — no candidate met the minimum score" if retry_chain.empty?
27
+
28
+ if retry_chain.length == 1 && retry_chain.first.keys == [:model]
29
+ "model \"#{retry_chain.first[:model]}\""
30
+ elsif retry_chain.all? { |c| c.keys == [:model] }
31
+ models_str = retry_chain.map { |c| c[:model] }.join(" ")
32
+ "retry_policy models: %w[#{models_str}]"
33
+ else
34
+ args = retry_chain.map { |c| config_to_ruby(c) }.join(",\n ")
35
+ "retry_policy do\n escalate(#{args})\nend"
36
+ end
37
+ end
38
+
39
+ private
40
+
41
+ def config_to_ruby(config)
42
+ pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
43
+ "{ #{pairs} }"
44
+ end
45
+ end
46
+ end
47
+ end
48
+ end
@@ -0,0 +1,132 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Eval
6
+ class Recommender
7
+ def initialize(comparison:, min_score:, min_first_try_pass_rate: 0.8, current_config: nil)
8
+ @comparison = comparison
9
+ @min_score = min_score
10
+ @min_first_try_pass_rate = min_first_try_pass_rate
11
+ @current_config = current_config
12
+ end
13
+
14
+ def recommend
15
+ scored = build_scored_candidates
16
+ best = select_best(scored)
17
+ chain = build_retry_chain(scored, best)
18
+ rationale = build_rationale(scored, best)
19
+ warnings = build_warnings(scored)
20
+ savings = best ? calculate_savings(best) : {}
21
+
22
+ Recommendation.new(
23
+ best: best&.dig(:config),
24
+ retry_chain: chain,
25
+ score: best&.dig(:score) || 0.0,
26
+ cost_per_call: best&.dig(:cost_per_call) || 0.0,
27
+ rationale: rationale,
28
+ current_config: @current_config,
29
+ savings: savings,
30
+ warnings: warnings
31
+ )
32
+ end
33
+
34
+ private
35
+
36
+ def build_scored_candidates
37
+ @comparison.configs.filter_map do |label, config|
38
+ report = @comparison.reports[label]
39
+ next nil unless report
40
+
41
+ evaluated_count = report.results.count { |r| r.step_status != :skipped }
42
+ cases_count = [evaluated_count, 1].max
43
+ cost_per_call = report.total_cost.to_f / cases_count
44
+
45
+ {
46
+ label: label,
47
+ config: config,
48
+ score: report.score,
49
+ cost_per_call: cost_per_call,
50
+ latency: report.avg_latency_ms || Float::INFINITY,
51
+ pass_rate_ratio: report.pass_rate_ratio,
52
+ total_cost: report.total_cost
53
+ }
54
+ end
55
+ end
56
+
57
+ def select_best(scored)
58
+ eligible = scored.select { |s| s[:score] >= @min_score && cost_known?(s) }
59
+ eligible.min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
60
+ end
61
+
62
+ def build_retry_chain(scored, best)
63
+ return [] unless best
64
+
65
+ first_try = scored
66
+ .select { |s| s[:pass_rate_ratio] >= @min_first_try_pass_rate && cost_known?(s) }
67
+ .min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
68
+
69
+ if first_try.nil? || first_try[:label] == best[:label]
70
+ [best[:config]]
71
+ else
72
+ [first_try[:config], best[:config]]
73
+ end
74
+ end
75
+
76
+ def build_rationale(scored, best)
77
+ sorted = scored.sort_by { |s| [cost_known?(s) ? 0 : 1, s[:cost_per_call], s[:latency], s[:label]] }
78
+ sorted.map { |s| rationale_line(s, best) }
79
+ end
80
+
81
+ def rationale_line(candidate, best)
82
+ cost_str = cost_known?(candidate) ? "$#{format("%.4f", candidate[:cost_per_call])}/call" : "unknown pricing"
83
+ header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{cost_str}"
84
+ notes = rationale_notes(candidate, best)
85
+ notes.any? ? "#{header} — #{notes.join(", ")}" : header
86
+ end
87
+
88
+ def rationale_notes(candidate, best)
89
+ notes = []
90
+ pass_pct = (candidate[:pass_rate_ratio] * 100).round
91
+ below_threshold = candidate[:score] < @min_score
92
+
93
+ if below_threshold && candidate[:pass_rate_ratio] >= @min_first_try_pass_rate
94
+ notes << "below #{@min_score} threshold, but good first-try (#{pass_pct}% pass rate)"
95
+ elsif below_threshold
96
+ notes << "below #{@min_score} threshold"
97
+ elsif candidate[:pass_rate_ratio] < 1.0
98
+ notes << "#{pass_pct}% pass rate"
99
+ end
100
+ notes << "recommended" if best && candidate[:label] == best[:label]
101
+ notes << "unknown pricing" unless cost_known?(candidate)
102
+ notes
103
+ end
104
+
105
+ def build_warnings(scored)
106
+ scored.reject { |s| cost_known?(s) }
107
+ .map { |s| "#{s[:label]}: unknown pricing — cost ranking may be inaccurate" }
108
+ end
109
+
110
+ def calculate_savings(best)
111
+ return {} unless @current_config
112
+
113
+ current_label = ModelComparison.candidate_label(@current_config)
114
+ current_report = @comparison.reports[current_label]
115
+ return {} unless current_report
116
+
117
+ current_evaluated = current_report.results.count { |r| r.step_status != :skipped }
118
+ current_cases = [current_evaluated, 1].max
119
+ current_cost = current_report.total_cost.to_f / current_cases
120
+ diff = current_cost - best[:cost_per_call]
121
+ return {} unless diff.positive?
122
+
123
+ { per_call: diff.round(6), monthly_at: { 10_000 => (diff * 10_000).round(2) } }
124
+ end
125
+
126
+ def cost_known?(scored_candidate)
127
+ scored_candidate[:cost_per_call]&.positive?
128
+ end
129
+ end
130
+ end
131
+ end
132
+ end
@@ -14,8 +14,8 @@ module RubyLLM
14
14
  HISTORY_DIR = ".eval_history"
15
15
  BASELINE_DIR = ".eval_baselines"
16
16
 
17
- def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :total_cost, :avg_latency_ms,
18
- :passed?
17
+ def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
18
+ :total_cost, :avg_latency_ms, :passed?
19
19
  def_delegators :@presenter, :summary, :to_s, :print_summary
20
20
  def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
21
21
  :baseline_exists?
@@ -35,6 +35,12 @@ module RubyLLM
35
35
  "#{passed}/#{evaluated_results.length}"
36
36
  end
37
37
 
38
+ def pass_rate_ratio
39
+ return 0.0 if evaluated_results.empty?
40
+
41
+ passed.to_f / evaluated_results.length
42
+ end
43
+
38
44
  def total_cost
39
45
  @results.sum { |result| result.cost || 0.0 }
40
46
  end
@@ -13,25 +13,29 @@ module RubyLLM
13
13
  @stats = stats
14
14
  end
15
15
 
16
- def save_history!(path: nil, model: nil)
17
- file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model)
18
- EvalHistory.append(file, history_entry)
16
+ def save_history!(path: nil, model: nil, reasoning_effort: nil)
17
+ file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model, reasoning_effort: reasoning_effort)
18
+ entry = history_entry
19
+ entry[:model] = model if model
20
+ entry[:reasoning_effort] = reasoning_effort if reasoning_effort
21
+ EvalHistory.append(file, entry)
19
22
  file
20
23
  end
21
24
 
22
- def eval_history(path: nil, model: nil)
23
- EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model))
25
+ def eval_history(path: nil, model: nil, reasoning_effort: nil)
26
+ EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model,
27
+ reasoning_effort: reasoning_effort))
24
28
  end
25
29
 
26
- def save_baseline!(path: nil, model: nil)
27
- file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
30
+ def save_baseline!(path: nil, model: nil, reasoning_effort: nil)
31
+ file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
28
32
  FileUtils.mkdir_p(File.dirname(file))
29
33
  File.write(file, JSON.pretty_generate(serialize_for_baseline))
30
34
  file
31
35
  end
32
36
 
33
- def compare_with_baseline(path: nil, model: nil)
34
- file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
37
+ def compare_with_baseline(path: nil, model: nil, reasoning_effort: nil)
38
+ file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
35
39
  raise ArgumentError, "No baseline found at #{file}" unless File.exist?(file)
36
40
 
37
41
  baseline_data = JSON.parse(File.read(file), symbolize_names: true)
@@ -43,8 +47,8 @@ module RubyLLM
43
47
  )
44
48
  end
45
49
 
46
- def baseline_exists?(path: nil, model: nil)
47
- File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model))
50
+ def baseline_exists?(path: nil, model: nil, reasoning_effort: nil)
51
+ File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort))
48
52
  end
49
53
 
50
54
  private
@@ -55,6 +59,7 @@ module RubyLLM
55
59
  score: @stats.score,
56
60
  total_cost: @stats.total_cost,
57
61
  pass_rate: @stats.pass_rate,
62
+ pass_rate_ratio: @stats.pass_rate_ratio,
58
63
  cases_count: @stats.evaluated_results_count
59
64
  }
60
65
  end
@@ -79,12 +84,13 @@ module RubyLLM
79
84
  }
80
85
  end
81
86
 
82
- def storage_path(root_dir, extension, model:)
87
+ def storage_path(root_dir, extension, model:, reasoning_effort: nil)
83
88
  parts = [root_dir]
84
89
  parts << sanitize_name(@report.step_name) if @report.step_name
85
90
 
86
91
  dataset_name = sanitize_name(@report.dataset_name)
87
92
  dataset_name = "#{dataset_name}_#{sanitize_name(model)}" if model
93
+ dataset_name = "#{dataset_name}_effort_#{sanitize_name(reasoning_effort)}" if reasoning_effort
88
94
 
89
95
  File.join(*parts, "#{dataset_name}.#{extension}")
90
96
  end
@@ -29,3 +29,5 @@ require_relative "eval/prompt_diff_comparator"
29
29
  require_relative "eval/prompt_diff_presenter"
30
30
  require_relative "eval/prompt_diff"
31
31
  require_relative "eval/eval_history"
32
+ require_relative "eval/recommendation"
33
+ require_relative "eval/recommender"
@@ -49,7 +49,17 @@ module RubyLLM
49
49
  end
50
50
  end
51
51
 
52
- KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists].freeze
52
+ def recommend(eval_name, candidates:, min_score: 0.95, min_first_try_pass_rate: 0.8, context: {})
53
+ comparison = compare_models(eval_name, candidates: candidates, context: context)
54
+ Eval::Recommender.new(
55
+ comparison: comparison,
56
+ min_score: min_score,
57
+ min_first_try_pass_rate: min_first_try_pass_rate,
58
+ current_config: current_model_config
59
+ ).recommend
60
+ end
61
+
62
+ KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
53
63
 
54
64
  include Concerns::ContextHelpers
55
65
 
@@ -139,11 +149,22 @@ module RubyLLM
139
149
  {
140
150
  model: context[:model] || model || RubyLLM::Contract.configuration.default_model,
141
151
  temperature: context[:temperature],
142
- extra_options: context.slice(:provider, :assume_model_exists, :max_tokens),
152
+ extra_options: context.slice(:provider, :assume_model_exists, :max_tokens, :reasoning_effort),
143
153
  policy: retry_policy
144
154
  }
145
155
  end
146
156
 
157
+ def current_model_config
158
+ policy = retry_policy
159
+ if policy && policy.config_list.any?
160
+ policy.config_list.first
161
+ elsif respond_to?(:model) && model
162
+ { model: model }
163
+ elsif RubyLLM::Contract.configuration.default_model
164
+ { model: RubyLLM::Contract.configuration.default_model }
165
+ end
166
+ end
167
+
147
168
  def resolve_adapter(context)
148
169
  adapter = context[:adapter] || RubyLLM::Contract.configuration.default_adapter
149
170
  return adapter if adapter
@@ -10,12 +10,16 @@ module RubyLLM
10
10
 
11
11
  def run_with_retry(input, adapter:, default_model:, policy:, context_temperature: nil, extra_options: {})
12
12
  all_attempts = []
13
+ default_config = { model: default_model }.merge(extra_options.slice(:reasoning_effort).compact)
13
14
 
14
15
  policy.max_attempts.times do |attempt_index|
15
- model = policy.model_for_attempt(attempt_index, default_model)
16
+ config = policy.config_for_attempt(attempt_index, default_config)
17
+ model = config[:model]
18
+ attempt_extra = extra_options.merge(config.except(:model))
19
+
16
20
  result = run_once(input, adapter: adapter, model: model,
17
- context_temperature: context_temperature, extra_options: extra_options)
18
- all_attempts << { attempt: attempt_index + 1, model: model, result: result }
21
+ context_temperature: context_temperature, extra_options: attempt_extra)
22
+ all_attempts << { attempt: attempt_index + 1, model: model, config: config, result: result }
19
23
  break unless policy.retryable?(result)
20
24
  end
21
25
 
@@ -43,6 +47,8 @@ module RubyLLM
43
47
  def build_attempt_entry(attempt)
44
48
  trace = attempt[:result].trace
45
49
  entry = { attempt: attempt[:attempt], model: attempt[:model], status: attempt[:result].status }
50
+ config = attempt[:config]
51
+ entry[:config] = config if config && config.keys != [:model]
46
52
  append_trace_fields(entry, trace)
47
53
  end
48
54
 
@@ -9,13 +9,13 @@ module RubyLLM
9
9
  DEFAULT_RETRY_ON = %i[validation_failed parse_error adapter_error].freeze
10
10
 
11
11
  def initialize(models: nil, attempts: nil, retry_on: nil, &block)
12
- @models = []
12
+ @configs = []
13
13
  @retryable_statuses = DEFAULT_RETRY_ON.dup
14
14
 
15
15
  if block
16
16
  @max_attempts = 1
17
17
  instance_eval(&block)
18
- warn_no_retry! if @max_attempts == 1 && @models.empty?
18
+ warn_no_retry! if @max_attempts == 1 && @configs.empty?
19
19
  else
20
20
  apply_keywords(models: models, attempts: attempts, retry_on: retry_on)
21
21
  end
@@ -28,14 +28,18 @@ module RubyLLM
28
28
  validate_max_attempts!
29
29
  end
30
30
 
31
- def escalate(*model_list)
32
- @models = model_list.flatten
33
- @max_attempts = @models.length if @max_attempts < @models.length
31
+ def escalate(*config_list)
32
+ @configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze
33
+ @max_attempts = @configs.length if @max_attempts < @configs.length
34
34
  end
35
35
  alias models escalate
36
36
 
37
37
  def model_list
38
- @models
38
+ @configs.map { |c| c[:model] }.freeze
39
+ end
40
+
41
+ def config_list
42
+ @configs
39
43
  end
40
44
 
41
45
  def retry_on(*statuses)
@@ -46,29 +50,38 @@ module RubyLLM
46
50
  retryable_statuses.include?(result.status)
47
51
  end
48
52
 
49
- def model_for_attempt(attempt, default_model)
50
- if @models.any?
51
- @models[attempt] || @models.last
53
+ def config_for_attempt(attempt, default_config)
54
+ if @configs.any?
55
+ @configs[attempt] || @configs.last
52
56
  else
53
- default_model
57
+ default_config
54
58
  end
55
59
  end
56
60
 
61
+ def model_for_attempt(attempt, default_model)
62
+ config_for_attempt(attempt, { model: default_model })[:model]
63
+ end
64
+
57
65
  private
58
66
 
59
67
  def apply_keywords(models:, attempts:, retry_on:)
60
68
  if models
61
- @models = Array(models).dup.freeze
62
- @max_attempts = @models.length
69
+ @configs = Array(models).map { |m| normalize_config(m).freeze }.freeze
70
+ @max_attempts = @configs.length
63
71
  else
64
72
  @max_attempts = attempts || 1
65
73
  end
66
74
  @retryable_statuses = Array(retry_on).dup if retry_on
67
75
  end
68
76
 
77
+ def normalize_config(entry)
78
+ RubyLLM::Contract.normalize_candidate_config(entry)
79
+ end
80
+
69
81
  def warn_no_retry!
70
- warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
71
- "This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."
82
+ warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \
83
+ "This means no actual retry will happen. Add `attempts 2` or " \
84
+ '`escalate "model1", "model2"`.'
72
85
  end
73
86
 
74
87
  def validate_max_attempts!
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Contract
5
- VERSION = "0.5.0"
5
+ VERSION = "0.6.0"
6
6
  end
7
7
  end
@@ -78,6 +78,27 @@ module RubyLLM
78
78
  Thread.current[:ruby_llm_contract_reloading] = false
79
79
  end
80
80
 
81
+ def normalize_candidate_config(entry)
82
+ case entry
83
+ when String
84
+ raise ArgumentError, "Candidate model must be a non-empty String" if entry.strip.empty?
85
+
86
+ { model: entry.strip }
87
+ when Hash
88
+ model = entry[:model] || entry["model"]
89
+ unless model.is_a?(String) && !model.strip.empty?
90
+ raise ArgumentError, "Candidate config must include a non-empty String :model"
91
+ end
92
+
93
+ normalized = { model: model.strip }
94
+ effort = entry[:reasoning_effort] || entry["reasoning_effort"]
95
+ normalized[:reasoning_effort] = effort if effort
96
+ normalized
97
+ else
98
+ raise ArgumentError, "Expected String or Hash, got #{entry.class}"
99
+ end
100
+ end
101
+
81
102
  private
82
103
 
83
104
  # Filter stale hosts, deduplicate by name (last wins), prune registry in-place
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-contract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.0
4
+ version: 0.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justyna
@@ -130,6 +130,8 @@ files:
130
130
  - lib/ruby_llm/contract/eval/prompt_diff_comparator.rb
131
131
  - lib/ruby_llm/contract/eval/prompt_diff_presenter.rb
132
132
  - lib/ruby_llm/contract/eval/prompt_diff_serializer.rb
133
+ - lib/ruby_llm/contract/eval/recommendation.rb
134
+ - lib/ruby_llm/contract/eval/recommender.rb
133
135
  - lib/ruby_llm/contract/eval/report.rb
134
136
  - lib/ruby_llm/contract/eval/report_presenter.rb
135
137
  - lib/ruby_llm/contract/eval/report_stats.rb