ruby_llm-contract 0.5.2 → 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 359d08f8cf1e31b84f308c47c7f93c7cee7663054de3ab538a34c1f67873554f
4
- data.tar.gz: 60d8728bed042277d40ec1d231b6712e258b658fd893a73afc6ed1f8e9cff8c8
3
+ metadata.gz: 2636c2e59f5fef27f929a94ac9e3194793ce6b51f86cce0c18ca6a5b0caa61ab
4
+ data.tar.gz: c2c81c7cc8fd281bf6c88738f0fb5bc4bdbb86b71d2fb59248e2b0ebb8d648fe
5
5
  SHA512:
6
- metadata.gz: 4bd4d7cea9fde7281bf84e1283c4201f8c5e9425cb8357e40b85e5184f19f51eb57a88a35901eddf571defd93ff33ef790e24b5e2eb90add8ef6371e791d37e5
7
- data.tar.gz: e68ca27fc2225224cd900b1afb2180cfd43929e0461420c7fd2987706a2ebaa282b1e659c8b5c14e69e30d1250ede547061e2d2ab74b5c9cc0bb7fdb77109f0a
6
+ metadata.gz: d9f2fca592fd3a183d987239dea0cdc2456eed639d6cccaea71e9b1ef3a3ff6e32f3e346f640c905168674e7401ccb85e5f342815a1b290cdebe05a9f7b5374f
7
+ data.tar.gz: 00b8d113564871db19f88d9061276a0aee0295da7638974804782fda7929cf893a0004aadfeffc2f25f13d421935e0e9d710e553d8e56b7b5d0229f62edd129e
data/CHANGELOG.md CHANGED
@@ -1,5 +1,52 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.2 (2026-04-18)
4
+
5
+ ### Features
6
+
7
+ - **`Step.optimize_retry_policy`** — runs `compare_models` on ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).
8
+ - **`rake ruby_llm_contract:optimize`** — one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.
9
+ - **Offline by default** — `optimize` uses `sample_response` (zero API calls) unless `LIVE=1` or `PROVIDER=` is set.
10
+ - **`EVAL_DIRS=` support** — non-Rails setups can specify eval file directories.
11
+ - **Guide: [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)** — full procedure with prerequisites, troubleshooting, and real-world example.
12
+
13
+ ### Fixes
14
+
15
+ - Chain semantics aligned with `retry_executor` — retry fires on `validation_failed`/`parse_error`, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain.
16
+ - Removed ActiveSupport dependency from rake task (`.presence` → `.empty?`).
17
+ - Added `require "set"` for non-Rails environments.
18
+
19
+ ## 0.6.1 (2026-04-17)
20
+
21
+ ### Features
22
+
23
+ - **Multi-provider operator tooling** — rake tasks support `PROVIDER=openai|anthropic|ollama`, `CANDIDATES=model@effort,...`, and `REASONING_EFFORT=low|medium|high`.
24
+ - **`rake ruby_llm_contract:recommend`** — wraps `Step.recommend` with CLI interface, prints best config, retry chain, DSL, rationale, and savings.
25
+ - **Ollama support** — `PROVIDER=ollama` with configurable `OLLAMA_API_BASE`.
26
+
27
+ ## 0.6.0 (2026-04-12)
28
+
29
+ "What should I do?" — model + configuration recommendation.
30
+
31
+ ### Features
32
+
33
+ - **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
34
+ - **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
35
+ - **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
36
+ - **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
37
+ - **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
38
+ - **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
39
+
40
+ ### Game changer continuity
41
+
42
+ ```
43
+ v0.2: "Which model?" → compare_models (snapshot)
44
+ v0.3: "Did it change?" → baseline regression (binary)
45
+ v0.4: "Show me the trend" → eval history (time series)
46
+ v0.5: "Which prompt is better?" → compare_with (A/B testing)
47
+ v0.6: "What should I do?" → recommend (actionable advice)
48
+ ```
49
+
3
50
  ## 0.5.2 (2026-04-06)
4
51
 
5
52
  ### Features
@@ -8,7 +55,7 @@
8
55
 
9
56
  ## 0.5.0 (2026-03-25)
10
57
 
11
- Data-Driven Prompt Engineering — see ADR-0015.
58
+ Data-Driven Prompt Engineering.
12
59
 
13
60
  ### Features
14
61
 
@@ -55,7 +102,7 @@ Audit hardening — 18 bugs fixed across 4 audit rounds.
55
102
 
56
103
  ## 0.4.3 (2026-03-24)
57
104
 
58
- Production feedback release — driven by ADR-0016 (real Rails 8.1 deployment).
105
+ Production feedback release.
59
106
 
60
107
  ### Features
61
108
 
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.5.2)
4
+ ruby_llm-contract (0.6.1)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.5.2)
261
+ ruby_llm-contract (0.6.1)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -1,24 +1,89 @@
1
1
  # ruby_llm-contract
2
2
 
3
- Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.
3
+ The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
4
4
 
5
- Companion gem for [ruby_llm](https://github.com/crmne/ruby_llm).
5
+ ```
6
+ YOU WRITE THE GEM HANDLES YOU GET
7
+ ───────── ─────────────── ───────
8
+
9
+ validate { |o| ... } catch bad answers — combined Zero garbage
10
+ with retry_policy, auto-retry in production
6
11
 
7
- ## The problem
12
+ retry_policy start cheap, escalate only Pay for the cheapest
13
+ models: %w[nano mini full] when validation fails model that works
8
14
 
9
- Which model should you use? The expensive one is accurate but costs 4x more. The cheap one is fast but hallucinates on edge cases. You tweak a prompt — did accuracy improve or drop? You have no data. Just gut feeling.
15
+ max_cost 0.01 estimate tokens, check price, No surprise bills
16
+ refuse before calling LLM
10
17
 
11
- ## The fix
18
+ output_schema { ... } send JSON schema to provider, Zero parsing code
19
+ validate response client-side
20
+
21
+ define_eval { ... } test cases + baselines, Regressions caught
22
+ run in CI with real LLM before deploy
23
+
24
+ recommend(candidates: [...]) evaluate all configs, pick Optimal model +
25
+ cheapest that passes retry chain
26
+ ```
27
+
28
+ ## Before and after
29
+
30
+ ```
31
+ ┌─────────────────────────────────────────────────────────────────┐
32
+ │ BEFORE: pick one model, hope for the best │
33
+ │ │
34
+ │ expensive model → accurate, but you overpay on every call │
35
+ │ cheap model → fast, but wrong answers slip to production │
36
+ │ prompt change → "looks good to me" → deploy → users suffer │
37
+ └─────────────────────────────────────────────────────────────────┘
38
+
39
+ ⬇ add ruby_llm-contract
40
+
41
+ ┌─────────────────────────────────────────────────────────────────┐
42
+ │ YOU DEFINE A CONTRACT │
43
+ │ │
44
+ │ output_schema { string :priority } ← valid structure │
45
+ │ validate("valid priority") { |o| ... } ← business rules │
46
+ │ retry_policy models: %w[nano mini full] ← escalation chain │
47
+ │ max_cost 0.01 ← budget cap │
48
+ └───────────────────────────┬─────────────────────────────────────┘
49
+
50
+
51
+ ┌─────────────────────────────────────────────────────────────────┐
52
+ │ THE GEM HANDLES THE REST │
53
+ │ │
54
+ │ request ──→ ┌──────┐ ┌──────────┐ │
55
+ │ │ nano │─→ │ contract │──→ ✓ pass → done │
56
+ │ └──────┘ └────┬─────┘ │
57
+ │ │ ✗ fail │
58
+ │ ▼ │
59
+ │ ┌──────┐ ┌──────────┐ │
60
+ │ │ mini │─→ │ contract │──→ ✓ pass → done │
61
+ │ └──────┘ └────┬─────┘ │
62
+ │ │ ✗ fail │
63
+ │ ▼ │
64
+ │ ┌──────┐ ┌──────────┐ │
65
+ │ │ full │─→ │ contract │──→ ✓ pass → done │
66
+ │ └──────┘ └──────────┘ │
67
+ └───────────────────────────┬─────────────────────────────────────┘
68
+
69
+
70
+ ┌─────────────────────────────────────────────────────────────────┐
71
+ │ YOU GET │
72
+ │ │
73
+ │ ✓ Valid output guaranteed — schema + business rules enforced │
74
+ │ ✓ Cheapest model that works — most requests stay on nano │
75
+ │ ✓ Cost, latency, tokens — tracked on every call │
76
+ │ ✓ Eval scores per model — data instead of gut feeling │
77
+ │ ✓ Regressions caught — before deploy, not after │
78
+ │ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
79
+ └─────────────────────────────────────────────────────────────────┘
80
+ ```
81
+
82
+ ## 30-second version
12
83
 
13
84
  ```ruby
14
85
  class ClassifyTicket < RubyLLM::Contract::Step::Base
15
- prompt do
16
- system "You are a support ticket classifier."
17
- rule "Return valid JSON only, no markdown."
18
- rule "Use exactly one priority: low, medium, high, urgent."
19
- example input: "My invoice is wrong", output: '{"priority": "high"}'
20
- user "{input}"
21
- end
86
+ prompt "Classify this support ticket by priority and category.\n\n{input}"
22
87
 
23
88
  output_schema do
24
89
  string :priority, enum: %w[low medium high urgent]
@@ -30,29 +95,45 @@ class ClassifyTicket < RubyLLM::Contract::Step::Base
30
95
  end
31
96
 
32
97
  result = ClassifyTicket.run("I was charged twice")
33
- result.ok? # => true
34
- result.parsed_output # => {priority: "high", category: "billing"}
35
- result.trace[:cost] # => 0.000032
36
- result.trace[:model] # => "gpt-4.1-nano"
98
+ result.parsed_output # => {priority: "high", category: "billing"}
99
+ result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
100
+ result.trace[:cost] # => 0.000032
37
101
  ```
38
102
 
39
- Bad JSON? Auto-retry. Wrong value? Escalate to a smarter model. Schema violated? Caught client-side even if the provider ignores it. All with cost tracking.
103
+ Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules you pay for the cheapest model that passes.
40
104
 
41
- ## Start Here: Eval-First
105
+ ## Install
42
106
 
43
- The most powerful way to use this gem is simple:
107
+ ```ruby
108
+ gem "ruby_llm-contract"
109
+ ```
110
+
111
+ ```ruby
112
+ RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
113
+ RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
114
+ ```
115
+
116
+ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
44
117
 
45
- - define evals before changing prompts
46
- - compare prompt versions on the same dataset
47
- - merge only when the eval stays green
118
+ ## Save money with model escalation
119
+
120
+ Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
121
+
122
+ ```ruby
123
+ retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
124
+ ```
48
125
 
49
- Read: [Eval-First](docs/guide/eval_first.md)
126
+ ```
127
+ Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
128
+ Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
129
+ gpt-4.1 → never called ($0.00)
130
+ ```
50
131
 
51
- This is the workflow that gives prompt engineering teeth. No vibes, no cherry-picked examples, no "it felt better in the playground". Just cases, regressions, baselines, and measured wins.
132
+ Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
52
133
 
53
- ## Which model should I use?
134
+ ## Know which model to use — with data
54
135
 
55
- Define test cases. Compare models. Get data.
136
+ Don't guess. Define test cases, compare models, get numbers:
56
137
 
57
138
  ```ruby
58
139
  ClassifyTicket.define_eval("regression") do
@@ -62,170 +143,127 @@ ClassifyTicket.define_eval("regression") do
62
143
  end
63
144
 
64
145
  comparison = ClassifyTicket.compare_models("regression",
65
- models: %w[gpt-4.1-nano gpt-4.1-mini])
146
+ models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
66
147
  ```
67
148
 
68
- Real output from real API calls:
69
-
70
149
  ```
71
- Model Score Cost Avg Latency
150
+ Candidate Score Cost Avg Latency
72
151
  ---------------------------------------------------------
73
- gpt-4.1-nano 0.67 $0.000032 687ms
74
- gpt-4.1-mini 1.00 $0.000102 1070ms
152
+ gpt-4.1-nano 0.67 $0.0001 48ms
153
+ gpt-4.1-mini 1.00 $0.0004 92ms
154
+ gpt-4.1 1.00 $0.0021 210ms
75
155
 
76
156
  Cheapest at 100%: gpt-4.1-mini
77
157
  ```
78
158
 
79
- ```ruby
80
- comparison.best_for(min_score: 0.95) # => "gpt-4.1-mini"
81
-
82
- # Inspect failures
83
- comparison.reports["gpt-4.1-nano"].failures.each do |f|
84
- puts "#{f.name}: expected #{f.expected}, got #{f.output}"
85
- puts " mismatches: #{f.mismatches}"
86
- # => outage: expected {priority: "urgent"}, got {priority: "high"}
87
- # mismatches: {priority: {expected: "urgent", got: "high"}}
88
- end
89
- ```
159
+ Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
90
160
 
91
- ## Pipeline
161
+ ## Let the gem tell you what to do
92
162
 
93
- Chain steps with fail-fast. Hallucination in step 1 stops before step 2 spends tokens.
163
+ Don't read tables get a recommendation. Supports `model + reasoning_effort` combinations:
94
164
 
95
165
  ```ruby
96
- class TicketPipeline < RubyLLM::Contract::Pipeline::Base
97
- step ClassifyTicket, as: :classify
98
- step RouteToTeam, as: :route
99
- step DraftResponse, as: :draft
100
- end
101
-
102
- result = TicketPipeline.run("I was charged twice")
103
- result.ok? # => true
104
- result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
105
- result.trace.total_cost # => 0.000128
166
+ rec = ClassifyTicket.recommend("regression",
167
+ candidates: [
168
+ { model: "gpt-4.1-nano" },
169
+ { model: "gpt-4.1-mini" },
170
+ { model: "gpt-5-mini", reasoning_effort: "low" },
171
+ { model: "gpt-5-mini", reasoning_effort: "high" },
172
+ ],
173
+ min_score: 0.95
174
+ )
175
+
176
+ rec.best # => { model: "gpt-4.1-mini" }
177
+ rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
178
+ rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
179
+ rec.savings # => savings vs your current model (if configured)
106
180
  ```
107
181
 
108
- ## CI gate
109
-
110
- ```ruby
111
- # RSpec — block merge if accuracy drops or cost spikes
112
- expect(ClassifyTicket).to pass_eval("regression")
113
- .with_context(model: "gpt-4.1-mini")
114
- .with_minimum_score(0.8)
115
- .with_maximum_cost(0.01)
116
-
117
- # Rake — run all evals across all steps
118
- require "ruby_llm/contract/rake_task"
119
- RubyLLM::Contract::RakeTask.new do |t|
120
- t.minimum_score = 0.8
121
- t.maximum_cost = 0.05
122
- end
123
- # bundle exec rake ruby_llm_contract:eval
124
- ```
182
+ Copy `rec.to_dsl` into your step. Done.
125
183
 
126
- ## Detect quality drops
184
+ ## Catch regressions before users do
127
185
 
128
- Save a baseline. Next run, see what regressed.
186
+ A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
129
187
 
130
188
  ```ruby
189
+ # Save a baseline once:
131
190
  report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
132
191
  report.save_baseline!(model: "gpt-4.1-nano")
133
192
 
134
- # Laterafter prompt change, model update, or provider weight shift:
135
- report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
136
- diff = report.compare_with_baseline(model: "gpt-4.1-nano")
137
-
138
- diff.regressed? # => true
139
- diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
140
- diff.score_delta # => -0.33
141
- ```
142
-
143
- ```ruby
144
- # CI: block merge if any previously-passing case now fails
193
+ # In CI block merge if anything regressed:
145
194
  expect(ClassifyTicket).to pass_eval("regression")
146
195
  .with_context(model: "gpt-4.1-nano")
147
196
  .without_regressions
148
197
  ```
149
198
 
150
- ## Track quality over time
151
-
152
199
  ```ruby
153
- # Save every eval run
154
- report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
155
- report.save_history!(model: "gpt-4.1-nano")
156
-
157
- # View trend
158
- history = report.eval_history(model: "gpt-4.1-nano")
159
- history.score_trend # => :stable_or_improving | :declining
160
- history.drift? # => true (score dropped > 10%)
200
+ diff = report.compare_with_baseline(model: "gpt-4.1-nano")
201
+ diff.regressed? # => true
202
+ diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
203
+ diff.score_delta # => -0.33
161
204
  ```
162
205
 
163
- ## Run evals fast
164
-
165
- ```ruby
166
- # 4x faster with parallel execution
167
- report = ClassifyTicket.run_eval("regression",
168
- context: { model: "gpt-4.1-nano" },
169
- concurrency: 4)
170
- ```
206
+ No more "it worked in the playground". Regressions are caught in CI, not production.
171
207
 
172
- ## Prompt A/B testing
208
+ ## A/B test your prompts
173
209
 
174
- Changed a prompt? Compare old vs new with regression safety:
210
+ Changed a prompt? Compare old vs new on the same dataset with regression safety:
175
211
 
176
212
  ```ruby
177
213
  diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
178
214
  eval: "regression", model: "gpt-4.1-mini")
179
215
 
180
- diff.safe_to_switch? # => true (no regressions, no per-case score drops)
216
+ diff.safe_to_switch? # => true (no regressions)
181
217
  diff.improvements # => [{case: "outage", ...}]
182
218
  diff.score_delta # => +0.33
183
219
  ```
184
220
 
185
- Requires `model:` or `context: { adapter: ... }`.
186
- `compare_with` ignores `sample_response`; without a real model/adapter both sides are skipped and the A/B result is not meaningful.
187
-
188
- CI gate:
189
221
  ```ruby
222
+ # CI gate:
190
223
  expect(ClassifyTicketV2).to pass_eval("regression")
191
224
  .compared_with(ClassifyTicketV1)
192
225
  .with_minimum_score(0.8)
193
226
  ```
194
227
 
195
- ## Soft observations
228
+ ## Chain steps with fail-fast
196
229
 
197
- Log suspicious-but-not-invalid output without failing the contract:
230
+ Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
198
231
 
199
232
  ```ruby
200
- class EvaluateComparative < RubyLLM::Contract::Step::Base
201
- validate("scores in range") { |o| (1..10).include?(o[:score_a]) }
202
- observe("scores should differ") { |o| o[:score_a] != o[:score_b] }
233
+ class TicketPipeline < RubyLLM::Contract::Pipeline::Base
234
+ step ClassifyTicket, as: :classify
235
+ step RouteToTeam, as: :route
236
+ step DraftResponse, as: :draft
203
237
  end
204
238
 
205
- result = EvaluateComparative.run(input)
206
- result.ok? # => true (observe never fails)
207
- result.observations # => [{description: "scores should differ", passed: false}]
239
+ result = TicketPipeline.run("I was charged twice")
240
+ result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
241
+ result.trace.total_cost # => $0.000128
208
242
  ```
209
243
 
210
- ## Predict cost before running
244
+ ## Gate merges on quality and cost
211
245
 
212
246
  ```ruby
213
- ClassifyTicket.estimate_eval_cost("regression", models: %w[gpt-4.1-nano gpt-4.1-mini])
214
- # => { "gpt-4.1-nano" => 0.000024, "gpt-4.1-mini" => 0.000096 }
247
+ # RSpec block merge if accuracy drops or cost spikes
248
+ expect(ClassifyTicket).to pass_eval("regression")
249
+ .with_minimum_score(0.8)
250
+ .with_maximum_cost(0.01)
251
+
252
+ # Rake — run all evals across all steps
253
+ RubyLLM::Contract::RakeTask.new do |t|
254
+ t.minimum_score = 0.8
255
+ t.maximum_cost = 0.05
256
+ end
257
+ # bundle exec rake ruby_llm_contract:eval
215
258
  ```
216
259
 
217
- ## Install
260
+ ## Full power: data-driven retry chains
218
261
 
219
- ```ruby
220
- gem "ruby_llm-contract"
221
- ```
262
+ The pieces above — evals, compare_models, recommend — combine into a workflow that replaces guesswork with measured optimization. You define evals for your step, run `recommend` against all of them, find the eval that actually needs the strongest model, and build a retry chain where each attempt is as cheap as the data allows.
222
263
 
223
- ```ruby
224
- RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
225
- RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
226
- ```
264
+ The difference: instead of "gpt-5-mini seems to work, let's use it everywhere", you get "nano handles 4/6 scenarios, mini@low catches the 5th, full mini only fires on the hardest edge case — first attempt is 4× cheaper."
227
265
 
228
- Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
266
+ Full procedure with examples: **[Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)**
229
267
 
230
268
  ## Docs
231
269
 
@@ -233,6 +271,7 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
233
271
  |-------|-|
234
272
  | [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
235
273
  | [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
274
+ | [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md) | Find the cheapest retry chain that passes all your evals |
236
275
  | [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
237
276
  | [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
238
277
  | [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
@@ -241,13 +280,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
241
280
 
242
281
  ## Roadmap
243
282
 
244
- **v0.5 (current):** Data-driven prompt engineering — `compare_with(OtherStep)` for prompt A/B testing with regression safety. `observe` DSL for soft observations that log but never fail.
283
+ **v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
245
284
 
246
- **v0.4:** Observability & scale — eval history with trending, batch eval with concurrency, pipeline per-step eval, Minitest support, structured logging. Audit hardening (18 bugfixes).
285
+ **v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
247
286
 
248
- **v0.3:** Baseline regression detection, migration guide, production hardening.
287
+ **v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
249
288
 
250
- **v0.6:** Model recommendation based on eval history data. Cross-provider comparison docs.
289
+ **v0.3:** Baseline regression detection, migration guide.
251
290
 
252
291
  ## License
253
292
 
@@ -70,18 +70,37 @@ module RubyLLM
70
70
  Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
71
71
  end
72
72
 
73
- def compare_models(eval_name, models:, context: {})
73
+ def compare_models(eval_name, models: [], candidates: [], context: {})
74
+ raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
75
+
74
76
  context = safe_context(context)
75
- models = models.uniq
76
- reports = models.each_with_object({}) do |model, hash|
77
- model_context = isolate_context(context).merge(model: model)
78
- hash[model] = run_single_eval(eval_name, model_context)
77
+ candidate_configs = normalize_candidates(models, candidates)
78
+
79
+ reports = {}
80
+ configs = {}
81
+ candidate_configs.each do |config|
82
+ label = Eval::ModelComparison.candidate_label(config)
83
+ model_context = isolate_context(context).merge(model: config[:model])
84
+ model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
85
+ reports[label] = run_single_eval(eval_name, model_context)
86
+ configs[label] = config
79
87
  end
80
- Eval::ModelComparison.new(eval_name: eval_name, reports: reports)
88
+
89
+ Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
81
90
  end
82
91
 
83
92
  private
84
93
 
94
+ def normalize_candidates(models, candidates)
95
+ if candidates.any?
96
+ candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
97
+ elsif models.any?
98
+ models.uniq.map { |m| { model: m } }
99
+ else
100
+ raise ArgumentError, "Pass models: or candidates: with at least one entry"
101
+ end
102
+ end
103
+
85
104
  def comparison_context(context, model)
86
105
  base_context = safe_context(context)
87
106
  model ? base_context.merge(model: model) : base_context
@@ -4,11 +4,17 @@ module RubyLLM
4
4
  module Contract
5
5
  module Eval
6
6
  class ModelComparison
7
- attr_reader :eval_name, :reports
7
+ attr_reader :eval_name, :reports, :configs
8
8
 
9
- def initialize(eval_name:, reports:)
9
+ def self.candidate_label(config)
10
+ effort = config[:reasoning_effort]
11
+ effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
12
+ end
13
+
14
+ def initialize(eval_name:, reports:, configs: nil)
10
15
  @eval_name = eval_name
11
- @reports = reports.dup.freeze # { "model_name" => Report }
16
+ @reports = reports.dup.freeze
17
+ @configs = (configs || default_configs_from_reports).freeze
12
18
  freeze
13
19
  end
14
20
 
@@ -16,12 +22,12 @@ module RubyLLM
16
22
  @reports.keys
17
23
  end
18
24
 
19
- def score_for(model)
20
- @reports[model]&.score
25
+ def score_for(candidate)
26
+ @reports[resolve_key(candidate)]&.score
21
27
  end
22
28
 
23
- def cost_for(model)
24
- @reports[model]&.total_cost
29
+ def cost_for(candidate)
30
+ @reports[resolve_key(candidate)]&.total_cost
25
31
  end
26
32
 
27
33
  def best_for(min_score: 0.0)
@@ -38,13 +44,14 @@ module RubyLLM
38
44
  end
39
45
 
40
46
  def table
41
- lines = [" Model Score Cost Avg Latency"]
42
- lines << " #{"-" * 57}"
47
+ max_label = [@reports.keys.map(&:length).max || 0, 25].max
48
+ lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
49
+ lines << " #{"-" * (max_label + 36)}"
43
50
 
44
- @reports.each do |model, report|
51
+ @reports.each do |label, report|
45
52
  latency = report.avg_latency_ms ? "#{report.avg_latency_ms.round}ms" : "n/a"
46
53
  cost = report.total_cost.positive? ? "$#{format("%.4f", report.total_cost)}" : "n/a"
47
- lines << format(" %-25s %6.2f %10s %12s", model, report.score, cost, latency)
54
+ lines << format(" %-#{max_label}s %6.2f %10s %12s", label, report.score, cost, latency)
48
55
  end
49
56
 
50
57
  lines.join("\n")
@@ -70,10 +77,25 @@ module RubyLLM
70
77
  total_cost: report.total_cost,
71
78
  avg_latency_ms: report.avg_latency_ms,
72
79
  pass_rate: report.pass_rate,
80
+ pass_rate_ratio: report.pass_rate_ratio,
73
81
  passed: report.passed?
74
82
  }
75
83
  end
76
84
  end
85
+
86
+ private
87
+
88
+ def resolve_key(candidate)
89
+ case candidate
90
+ when String then candidate
91
+ when Hash then self.class.candidate_label(candidate)
92
+ else candidate.to_s
93
+ end
94
+ end
95
+
96
+ def default_configs_from_reports
97
+ @reports.each_with_object({}) { |(key, _), h| h[key] = { model: key } }
98
+ end
77
99
  end
78
100
  end
79
101
  end