RubyGems - ruby_llm-contract - Versions diffs - 0.5.2 → 0.6.0 - Mend

ruby_llm-contract 0.5.2 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +25 -2
data/Gemfile.lock +2 -2
data/README.md +167 -137
data/lib/ruby_llm/contract/concerns/eval_host.rb +25 -6
data/lib/ruby_llm/contract/eval/model_comparison.rb +33 -11
data/lib/ruby_llm/contract/eval/recommendation.rb +48 -0
data/lib/ruby_llm/contract/eval/recommender.rb +132 -0
data/lib/ruby_llm/contract/eval/report.rb +2 -2
data/lib/ruby_llm/contract/eval/report_stats.rb +6 -0
data/lib/ruby_llm/contract/eval/report_storage.rb +18 -12
data/lib/ruby_llm/contract/eval.rb +2 -0
data/lib/ruby_llm/contract/step/base.rb +21 -0
data/lib/ruby_llm/contract/step/retry_executor.rb +9 -3
data/lib/ruby_llm/contract/step/retry_policy.rb +27 -14
data/lib/ruby_llm/contract/version.rb +1 -1
data/lib/ruby_llm/contract.rb +21 -0
metadata +3 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 359d08f8cf1e31b84f308c47c7f93c7cee7663054de3ab538a34c1f67873554f
-  data.tar.gz: 60d8728bed042277d40ec1d231b6712e258b658fd893a73afc6ed1f8e9cff8c8
+  metadata.gz: c86dcf06a34e5ff934367708213a0852191b0be7bd61e61e8577161e32ccf807
+  data.tar.gz: 28a44dd4f4d4c0f74c5da7800d9fa1fd2b8e51242a056f5aaae5f6f9fcf1be69
 SHA512:
-  metadata.gz: 4bd4d7cea9fde7281bf84e1283c4201f8c5e9425cb8357e40b85e5184f19f51eb57a88a35901eddf571defd93ff33ef790e24b5e2eb90add8ef6371e791d37e5
-  data.tar.gz: e68ca27fc2225224cd900b1afb2180cfd43929e0461420c7fd2987706a2ebaa282b1e659c8b5c14e69e30d1250ede547061e2d2ab74b5c9cc0bb7fdb77109f0a
+  metadata.gz: c2b6dc71a02519288ae4ee4f74f17d74e6c45d01e888036d18e4c821b8fc63ba43ac0ccee6b2948163597b19c93008a6e5e0375fd65c8f3ad8fcf1285f356e91
+  data.tar.gz: f08576e520ec7397c4b233c5907ce703890cf4616e969e25f4c980ac1b06013a62ff680b3c90878d76fd1da0980f04d4946a2fd11a8e304e43f5155e5c855e8e

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,28 @@
 # Changelog
+## 0.6.0 (2026-04-12)
+"What should I do?" — model + configuration recommendation.
+### Features
+- **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
+- **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
+- **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
+- **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
+- **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
+- **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
+### Game changer continuity
+```
+v0.2: "Which model?"          → compare_models (snapshot)
+v0.3: "Did it change?"        → baseline regression (binary)
+v0.4: "Show me the trend"     → eval history (time series)
+v0.5: "Which prompt is better?" → compare_with (A/B testing)
+v0.6: "What should I do?"     → recommend (actionable advice)
+```
 ## 0.5.2 (2026-04-06)
 ### Features
@@ -8,7 +31,7 @@
 ## 0.5.0 (2026-03-25)
-Data-Driven Prompt Engineering — see ADR-0015.
+Data-Driven Prompt Engineering.
 ### Features
@@ -55,7 +78,7 @@ Audit hardening — 18 bugs fixed across 4 audit rounds.
 ## 0.4.3 (2026-03-24)
-Production feedback release — driven by ADR-0016 (real Rails 8.1 deployment).
+Production feedback release.
 ### Features

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    ruby_llm-contract (0.5.2)
+    ruby_llm-contract (0.6.0)
       dry-types (~> 1.7)
       ruby_llm (~> 1.0)
       ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
   rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
   ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
   ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
-  ruby_llm-contract (0.5.2)
+  ruby_llm-contract (0.6.0)
   ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
   ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
   rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8

data/README.md CHANGED Viewed

@@ -1,24 +1,89 @@
 # ruby_llm-contract
-Contracts for LLM quality. Know which model to use, what it costs, and when accuracy drops.
+The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
-Companion gem for [ruby_llm](https://github.com/crmne/ruby_llm).
+```
+  YOU WRITE                       THE GEM HANDLES                 YOU GET
+  ─────────                       ───────────────                 ───────
+  validate { |o| ... }            catch bad answers — combined     Zero garbage
+                                  with retry_policy, auto-retry   in production
-## The problem
+  retry_policy                    start cheap, escalate only      Pay for the cheapest
+  models: %w[nano mini full]      when validation fails           model that works
-Which model should you use? The expensive one is accurate but costs 4x more. The cheap one is fast but hallucinates on edge cases. You tweak a prompt — did accuracy improve or drop? You have no data. Just gut feeling.
+  max_cost 0.01                   estimate tokens, check price,   No surprise bills
+                                  refuse before calling LLM
-## The fix
+  output_schema { ... }           send JSON schema to provider,   Zero parsing code
+                                  validate response client-side
+  define_eval { ... }             test cases + baselines,          Regressions caught
+                                  run in CI with real LLM          before deploy
+  recommend(candidates: [...])    evaluate all configs, pick      Optimal model +
+                                  cheapest that passes            retry chain
+```
+## Before and after
+```
+  ┌─────────────────────────────────────────────────────────────────┐
+  │ BEFORE: pick one model, hope for the best                      │
+  │                                                                 │
+  │   expensive model → accurate, but you overpay on every call     │
+  │   cheap model     → fast, but wrong answers slip to production  │
+  │   prompt change   → "looks good to me" → deploy → users suffer │
+  └─────────────────────────────────────────────────────────────────┘
+                         ⬇  add ruby_llm-contract
+  ┌─────────────────────────────────────────────────────────────────┐
+  │ YOU DEFINE A CONTRACT                                            │
+  │                                                                 │
+  │   output_schema { string :priority }       ← valid structure   │
+  │   validate("valid priority") { |o| ... }   ← business rules    │
+  │   retry_policy models: %w[nano mini full]  ← escalation chain  │
+  │   max_cost 0.01                            ← budget cap         │
+  └───────────────────────────┬─────────────────────────────────────┘
+                              │
+                              ▼
+  ┌─────────────────────────────────────────────────────────────────┐
+  │ THE GEM HANDLES THE REST                                        │
+  │                                                                 │
+  │   request ──→ ┌──────┐   ┌──────────┐                           │
+  │               │ nano │─→ │ contract │──→ ✓ pass → done         │
+  │               └──────┘   └────┬─────┘                           │
+  │                               │ ✗ fail                          │
+  │                               ▼                                 │
+  │               ┌──────┐   ┌──────────┐                           │
+  │               │ mini │─→ │ contract │──→ ✓ pass → done         │
+  │               └──────┘   └────┬─────┘                           │
+  │                               │ ✗ fail                          │
+  │                               ▼                                 │
+  │               ┌──────┐   ┌──────────┐                           │
+  │               │ full │─→ │ contract │──→ ✓ pass → done         │
+  │               └──────┘   └──────────┘                           │
+  └───────────────────────────┬─────────────────────────────────────┘
+                              │
+                              ▼
+  ┌─────────────────────────────────────────────────────────────────┐
+  │ YOU GET                                                         │
+  │                                                                 │
+  │   ✓ Valid output guaranteed — schema + business rules enforced  │
+  │   ✓ Cheapest model that works — most requests stay on nano     │
+  │   ✓ Cost, latency, tokens — tracked on every call              │
+  │   ✓ Eval scores per model — data instead of gut feeling        │
+  │   ✓ Regressions caught — before deploy, not after              │
+  │   ✓ Recommendation — "use nano+mini, drop full, save $X/mo"   │
+  └─────────────────────────────────────────────────────────────────┘
+```
+## 30-second version
 ```ruby
 class ClassifyTicket < RubyLLM::Contract::Step::Base
-  prompt do
-    system "You are a support ticket classifier."
-    rule "Return valid JSON only, no markdown."
-    rule "Use exactly one priority: low, medium, high, urgent."
-    example input: "My invoice is wrong", output: '{"priority": "high"}'
-    user "{input}"
-  end
+  prompt "Classify this support ticket by priority and category.\n\n{input}"
   output_schema do
     string :priority, enum: %w[low medium high urgent]
@@ -30,29 +95,45 @@ class ClassifyTicket < RubyLLM::Contract::Step::Base
 end
 result = ClassifyTicket.run("I was charged twice")
-result.ok?               # => true
-result.parsed_output     # => {priority: "high", category: "billing"}
-result.trace[:cost]      # => 0.000032
-result.trace[:model]     # => "gpt-4.1-nano"
+result.parsed_output  # => {priority: "high", category: "billing"}
+result.trace[:model]  # => "gpt-4.1-nano" (first model that passed)
+result.trace[:cost]   # => 0.000032
 ```
-Bad JSON? Auto-retry. Wrong value? Escalate to a smarter model. Schema violated? Caught client-side even if the provider ignores it. All with cost tracking.
+Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
+## Install
+```ruby
+gem "ruby_llm-contract"
+```
-## Start Here: Eval-First
+```ruby
+RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
+RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
+```
-The most powerful way to use this gem is simple:
+Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
-- define evals before changing prompts
-- compare prompt versions on the same dataset
-- merge only when the eval stays green
+## Save money with model escalation
-Read: [Eval-First](docs/guide/eval_first.md)
+Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
-This is the workflow that gives prompt engineering teeth. No vibes, no cherry-picked examples, no "it felt better in the playground". Just cases, regressions, baselines, and measured wins.
+```ruby
+retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
+```
-## Which model should I use?
+```
+Attempt 1: gpt-4.1-nano  → contract failed  ($0.0001)
+Attempt 2: gpt-4.1-mini  → contract passed  ($0.0004)
+           gpt-4.1       → never called      ($0.00)
+```
-Define test cases. Compare models. Get data.
+Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
+## Know which model to use — with data
+Don't guess. Define test cases, compare models, get numbers:
 ```ruby
 ClassifyTicket.define_eval("regression") do
@@ -62,171 +143,120 @@ ClassifyTicket.define_eval("regression") do
 end
 comparison = ClassifyTicket.compare_models("regression",
-  models: %w[gpt-4.1-nano gpt-4.1-mini])
+  models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
 ```
-Real output from real API calls:
 ```
-Model                      Score       Cost  Avg Latency
+Candidate                  Score       Cost  Avg Latency
 ---------------------------------------------------------
-gpt-4.1-nano                0.67    $0.000032      687ms
-gpt-4.1-mini                1.00    $0.000102     1070ms
+gpt-4.1-nano                0.67    $0.0001         48ms
+gpt-4.1-mini                1.00    $0.0004         92ms
+gpt-4.1                     1.00    $0.0021        210ms
 Cheapest at 100%: gpt-4.1-mini
 ```
-```ruby
-comparison.best_for(min_score: 0.95)  # => "gpt-4.1-mini"
-# Inspect failures
-comparison.reports["gpt-4.1-nano"].failures.each do |f|
-  puts "#{f.name}: expected #{f.expected}, got #{f.output}"
-  puts "  mismatches: #{f.mismatches}"
-  # => outage: expected {priority: "urgent"}, got {priority: "high"}
-  #      mismatches: {priority: {expected: "urgent", got: "high"}}
-end
-```
+Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
-## Pipeline
+## Let the gem tell you what to do
-Chain steps with fail-fast. Hallucination in step 1 stops before step 2 spends tokens.
+Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
 ```ruby
-class TicketPipeline < RubyLLM::Contract::Pipeline::Base
-  step ClassifyTicket,  as: :classify
-  step RouteToTeam,     as: :route
-  step DraftResponse,   as: :draft
-end
-result = TicketPipeline.run("I was charged twice")
-result.ok?                          # => true
-result.outputs_by_step[:classify]   # => {priority: "high", category: "billing"}
-result.trace.total_cost             # => 0.000128
+rec = ClassifyTicket.recommend("regression",
+  candidates: [
+    { model: "gpt-4.1-nano" },
+    { model: "gpt-4.1-mini" },
+    { model: "gpt-5-mini", reasoning_effort: "low" },
+    { model: "gpt-5-mini", reasoning_effort: "high" },
+  ],
+  min_score: 0.95
+)
+rec.best           # => { model: "gpt-4.1-mini" }
+rec.retry_chain    # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
+rec.to_dsl         # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
+rec.savings        # => savings vs your current model (if configured)
 ```
-## CI gate
-```ruby
-# RSpec — block merge if accuracy drops or cost spikes
-expect(ClassifyTicket).to pass_eval("regression")
-  .with_context(model: "gpt-4.1-mini")
-  .with_minimum_score(0.8)
-  .with_maximum_cost(0.01)
-# Rake — run all evals across all steps
-require "ruby_llm/contract/rake_task"
-RubyLLM::Contract::RakeTask.new do |t|
-  t.minimum_score = 0.8
-  t.maximum_cost = 0.05
-end
-# bundle exec rake ruby_llm_contract:eval
-```
+Copy `rec.to_dsl` into your step. Done.
-## Detect quality drops
+## Catch regressions before users do
-Save a baseline. Next run, see what regressed.
+A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
 ```ruby
+# Save a baseline once:
 report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
 report.save_baseline!(model: "gpt-4.1-nano")
-# Later — after prompt change, model update, or provider weight shift:
-report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
-diff = report.compare_with_baseline(model: "gpt-4.1-nano")
-diff.regressed?    # => true
-diff.regressions   # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
-diff.score_delta   # => -0.33
-```
-```ruby
-# CI: block merge if any previously-passing case now fails
+# In CI — block merge if anything regressed:
 expect(ClassifyTicket).to pass_eval("regression")
   .with_context(model: "gpt-4.1-nano")
   .without_regressions
 ```
-## Track quality over time
 ```ruby
-# Save every eval run
-report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
-report.save_history!(model: "gpt-4.1-nano")
-# View trend
-history = report.eval_history(model: "gpt-4.1-nano")
-history.score_trend   # => :stable_or_improving | :declining
-history.drift?        # => true (score dropped > 10%)
+diff = report.compare_with_baseline(model: "gpt-4.1-nano")
+diff.regressed?    # => true
+diff.regressions   # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
+diff.score_delta   # => -0.33
 ```
-## Run evals fast
+No more "it worked in the playground". Regressions are caught in CI, not production.
-```ruby
-# 4x faster with parallel execution
-report = ClassifyTicket.run_eval("regression",
-  context: { model: "gpt-4.1-nano" },
-  concurrency: 4)
-```
-## Prompt A/B testing
+## A/B test your prompts
-Changed a prompt? Compare old vs new with regression safety:
+Changed a prompt? Compare old vs new on the same dataset with regression safety:
 ```ruby
 diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
   eval: "regression", model: "gpt-4.1-mini")
-diff.safe_to_switch?  # => true (no regressions, no per-case score drops)
+diff.safe_to_switch?  # => true (no regressions)
 diff.improvements     # => [{case: "outage", ...}]
 diff.score_delta      # => +0.33
 ```
-Requires `model:` or `context: { adapter: ... }`.
-`compare_with` ignores `sample_response`; without a real model/adapter both sides are skipped and the A/B result is not meaningful.
-CI gate:
 ```ruby
+# CI gate:
 expect(ClassifyTicketV2).to pass_eval("regression")
   .compared_with(ClassifyTicketV1)
   .with_minimum_score(0.8)
 ```
-## Soft observations
+## Chain steps with fail-fast
-Log suspicious-but-not-invalid output without failing the contract:
+Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
 ```ruby
-class EvaluateComparative < RubyLLM::Contract::Step::Base
-  validate("scores in range") { |o| (1..10).include?(o[:score_a]) }
-  observe("scores should differ") { |o| o[:score_a] != o[:score_b] }
+class TicketPipeline < RubyLLM::Contract::Pipeline::Base
+  step ClassifyTicket,  as: :classify
+  step RouteToTeam,     as: :route
+  step DraftResponse,   as: :draft
 end
-result = EvaluateComparative.run(input)
-result.ok?            # => true (observe never fails)
-result.observations   # => [{description: "scores should differ", passed: false}]
+result = TicketPipeline.run("I was charged twice")
+result.outputs_by_step[:classify]   # => {priority: "high", category: "billing"}
+result.trace.total_cost             # => $0.000128
 ```
-## Predict cost before running
+## Gate merges on quality and cost
 ```ruby
-ClassifyTicket.estimate_eval_cost("regression", models: %w[gpt-4.1-nano gpt-4.1-mini])
-# => { "gpt-4.1-nano" => 0.000024, "gpt-4.1-mini" => 0.000096 }
-```
-## Install
-```ruby
-gem "ruby_llm-contract"
-```
+# RSpec — block merge if accuracy drops or cost spikes
+expect(ClassifyTicket).to pass_eval("regression")
+  .with_minimum_score(0.8)
+  .with_maximum_cost(0.01)
-```ruby
-RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
-RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
+# Rake — run all evals across all steps
+RubyLLM::Contract::RakeTask.new do |t|
+  t.minimum_score = 0.8
+  t.maximum_cost = 0.05
+end
+# bundle exec rake ruby_llm_contract:eval
 ```
-Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
 ## Docs
 | Guide | |
@@ -241,13 +271,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
 ## Roadmap
-**v0.5 (current):** Data-driven prompt engineering — `compare_with(OtherStep)` for prompt A/B testing with regression safety. `observe` DSL for soft observations that log but never fail.
+**v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
-**v0.4:** Observability & scale — eval history with trending, batch eval with concurrency, pipeline per-step eval, Minitest support, structured logging. Audit hardening (18 bugfixes).
+**v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
-**v0.3:** Baseline regression detection, migration guide, production hardening.
+**v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
-**v0.6:** Model recommendation based on eval history data. Cross-provider comparison docs.
+**v0.3:** Baseline regression detection, migration guide.
 ## License

data/lib/ruby_llm/contract/concerns/eval_host.rb CHANGED Viewed

@@ -70,18 +70,37 @@ module RubyLLM
           Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
         end
-        def compare_models(eval_name, models:, context: {})
+        def compare_models(eval_name, models: [], candidates: [], context: {})
+          raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
           context = safe_context(context)
-          models = models.uniq
-          reports = models.each_with_object({}) do |model, hash|
-            model_context = isolate_context(context).merge(model: model)
-            hash[model] = run_single_eval(eval_name, model_context)
+          candidate_configs = normalize_candidates(models, candidates)
+          reports = {}
+          configs = {}
+          candidate_configs.each do |config|
+            label = Eval::ModelComparison.candidate_label(config)
+            model_context = isolate_context(context).merge(model: config[:model])
+            model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
+            reports[label] = run_single_eval(eval_name, model_context)
+            configs[label] = config
           end
-          Eval::ModelComparison.new(eval_name: eval_name, reports: reports)
+          Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
         end
         private
+        def normalize_candidates(models, candidates)
+          if candidates.any?
+            candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
+          elsif models.any?
+            models.uniq.map { |m| { model: m } }
+          else
+            raise ArgumentError, "Pass models: or candidates: with at least one entry"
+          end
+        end
         def comparison_context(context, model)
           base_context = safe_context(context)
           model ? base_context.merge(model: model) : base_context

data/lib/ruby_llm/contract/eval/model_comparison.rb CHANGED Viewed

@@ -4,11 +4,17 @@ module RubyLLM
   module Contract
     module Eval
       class ModelComparison
-        attr_reader :eval_name, :reports
+        attr_reader :eval_name, :reports, :configs
-        def initialize(eval_name:, reports:)
+        def self.candidate_label(config)
+          effort = config[:reasoning_effort]
+          effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
+        end
+        def initialize(eval_name:, reports:, configs: nil)
           @eval_name = eval_name
-          @reports = reports.dup.freeze # { "model_name" => Report }
+          @reports = reports.dup.freeze
+          @configs = (configs || default_configs_from_reports).freeze
           freeze
         end
@@ -16,12 +22,12 @@ module RubyLLM
           @reports.keys
         end
-        def score_for(model)
-          @reports[model]&.score
+        def score_for(candidate)
+          @reports[resolve_key(candidate)]&.score
         end
-        def cost_for(model)
-          @reports[model]&.total_cost
+        def cost_for(candidate)
+          @reports[resolve_key(candidate)]&.total_cost
         end
         def best_for(min_score: 0.0)
@@ -38,13 +44,14 @@ module RubyLLM
         end
         def table
-          lines = ["  Model                      Score       Cost  Avg Latency"]
-          lines << "  #{"-" * 57}"
+          max_label = [@reports.keys.map(&:length).max || 0, 25].max
+          lines = [format("  %-#{max_label}s  Score       Cost  Avg Latency", "Candidate")]
+          lines << "  #{"-" * (max_label + 36)}"
-          @reports.each do |model, report|
+          @reports.each do |label, report|
             latency = report.avg_latency_ms ? "#{report.avg_latency_ms.round}ms" : "n/a"
             cost = report.total_cost.positive? ? "$#{format("%.4f", report.total_cost)}" : "n/a"
-            lines << format("  %-25s %6.2f %10s %12s", model, report.score, cost, latency)
+            lines << format("  %-#{max_label}s %6.2f %10s %12s", label, report.score, cost, latency)
           end
           lines.join("\n")
@@ -70,10 +77,25 @@ module RubyLLM
               total_cost: report.total_cost,
               avg_latency_ms: report.avg_latency_ms,
               pass_rate: report.pass_rate,
+              pass_rate_ratio: report.pass_rate_ratio,
               passed: report.passed?
             }
           end
         end
+        private
+        def resolve_key(candidate)
+          case candidate
+          when String then candidate
+          when Hash then self.class.candidate_label(candidate)
+          else candidate.to_s
+          end
+        end
+        def default_configs_from_reports
+          @reports.each_with_object({}) { |(key, _), h| h[key] = { model: key } }
+        end
       end
     end
   end

data/lib/ruby_llm/contract/eval/recommendation.rb ADDED Viewed

@@ -0,0 +1,48 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Contract
+    module Eval
+      class Recommendation
+        include Concerns::DeepFreeze
+        attr_reader :best, :retry_chain, :score, :cost_per_call,
+                    :rationale, :current_config, :savings, :warnings
+        def initialize(best:, retry_chain:, score:, cost_per_call:,
+                       rationale:, current_config:, savings:, warnings:)
+          @best = deep_dup_freeze(best)
+          @retry_chain = deep_dup_freeze(retry_chain)
+          @score = score
+          @cost_per_call = cost_per_call
+          @rationale = deep_dup_freeze(rationale)
+          @current_config = deep_dup_freeze(current_config)
+          @savings = deep_dup_freeze(savings)
+          @warnings = deep_dup_freeze(warnings)
+          freeze
+        end
+        def to_dsl
+          return "# No recommendation — no candidate met the minimum score" if retry_chain.empty?
+          if retry_chain.length == 1 && retry_chain.first.keys == [:model]
+            "model \"#{retry_chain.first[:model]}\""
+          elsif retry_chain.all? { |c| c.keys == [:model] }
+            models_str = retry_chain.map { |c| c[:model] }.join(" ")
+            "retry_policy models: %w[#{models_str}]"
+          else
+            args = retry_chain.map { |c| config_to_ruby(c) }.join(",\n             ")
+            "retry_policy do\n  escalate(#{args})\nend"
+          end
+        end
+        private
+        def config_to_ruby(config)
+          pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
+          "{ #{pairs} }"
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/contract/eval/recommender.rb ADDED Viewed

@@ -0,0 +1,132 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Contract
+    module Eval
+      class Recommender
+        def initialize(comparison:, min_score:, min_first_try_pass_rate: 0.8, current_config: nil)
+          @comparison = comparison
+          @min_score = min_score
+          @min_first_try_pass_rate = min_first_try_pass_rate
+          @current_config = current_config
+        end
+        def recommend
+          scored = build_scored_candidates
+          best = select_best(scored)
+          chain = build_retry_chain(scored, best)
+          rationale = build_rationale(scored, best)
+          warnings = build_warnings(scored)
+          savings = best ? calculate_savings(best) : {}
+          Recommendation.new(
+            best: best&.dig(:config),
+            retry_chain: chain,
+            score: best&.dig(:score) || 0.0,
+            cost_per_call: best&.dig(:cost_per_call) || 0.0,
+            rationale: rationale,
+            current_config: @current_config,
+            savings: savings,
+            warnings: warnings
+          )
+        end
+        private
+        def build_scored_candidates
+          @comparison.configs.filter_map do |label, config|
+            report = @comparison.reports[label]
+            next nil unless report
+            evaluated_count = report.results.count { |r| r.step_status != :skipped }
+            cases_count = [evaluated_count, 1].max
+            cost_per_call = report.total_cost.to_f / cases_count
+            {
+              label: label,
+              config: config,
+              score: report.score,
+              cost_per_call: cost_per_call,
+              latency: report.avg_latency_ms || Float::INFINITY,
+              pass_rate_ratio: report.pass_rate_ratio,
+              total_cost: report.total_cost
+            }
+          end
+        end
+        def select_best(scored)
+          eligible = scored.select { |s| s[:score] >= @min_score && cost_known?(s) }
+          eligible.min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
+        end
+        def build_retry_chain(scored, best)
+          return [] unless best
+          first_try = scored
+            .select { |s| s[:pass_rate_ratio] >= @min_first_try_pass_rate && cost_known?(s) }
+            .min_by { |s| [s[:cost_per_call], s[:latency], s[:label]] }
+          if first_try.nil? || first_try[:label] == best[:label]
+            [best[:config]]
+          else
+            [first_try[:config], best[:config]]
+          end
+        end
+        def build_rationale(scored, best)
+          sorted = scored.sort_by { |s| [cost_known?(s) ? 0 : 1, s[:cost_per_call], s[:latency], s[:label]] }
+          sorted.map { |s| rationale_line(s, best) }
+        end
+        def rationale_line(candidate, best)
+          cost_str = cost_known?(candidate) ? "$#{format("%.4f", candidate[:cost_per_call])}/call" : "unknown pricing"
+          header = "#{candidate[:label]}, score #{format("%.2f", candidate[:score])}, at #{cost_str}"
+          notes = rationale_notes(candidate, best)
+          notes.any? ? "#{header} — #{notes.join(", ")}" : header
+        end
+        def rationale_notes(candidate, best)
+          notes = []
+          pass_pct = (candidate[:pass_rate_ratio] * 100).round
+          below_threshold = candidate[:score] < @min_score
+          if below_threshold && candidate[:pass_rate_ratio] >= @min_first_try_pass_rate
+            notes << "below #{@min_score} threshold, but good first-try (#{pass_pct}% pass rate)"
+          elsif below_threshold
+            notes << "below #{@min_score} threshold"
+          elsif candidate[:pass_rate_ratio] < 1.0
+            notes << "#{pass_pct}% pass rate"
+          end
+          notes << "recommended" if best && candidate[:label] == best[:label]
+          notes << "unknown pricing" unless cost_known?(candidate)
+          notes
+        end
+        def build_warnings(scored)
+          scored.reject { |s| cost_known?(s) }
+                .map { |s| "#{s[:label]}: unknown pricing — cost ranking may be inaccurate" }
+        end
+        def calculate_savings(best)
+          return {} unless @current_config
+          current_label = ModelComparison.candidate_label(@current_config)
+          current_report = @comparison.reports[current_label]
+          return {} unless current_report
+          current_evaluated = current_report.results.count { |r| r.step_status != :skipped }
+          current_cases = [current_evaluated, 1].max
+          current_cost = current_report.total_cost.to_f / current_cases
+          diff = current_cost - best[:cost_per_call]
+          return {} unless diff.positive?
+          { per_call: diff.round(6), monthly_at: { 10_000 => (diff * 10_000).round(2) } }
+        end
+        def cost_known?(scored_candidate)
+          scored_candidate[:cost_per_call]&.positive?
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/contract/eval/report.rb CHANGED Viewed

@@ -14,8 +14,8 @@ module RubyLLM
         HISTORY_DIR = ".eval_history"
         BASELINE_DIR = ".eval_baselines"
-        def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :total_cost, :avg_latency_ms,
-                       :passed?
+        def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
+                       :total_cost, :avg_latency_ms, :passed?
         def_delegators :@presenter, :summary, :to_s, :print_summary
         def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
                        :baseline_exists?

data/lib/ruby_llm/contract/eval/report_stats.rb CHANGED Viewed

@@ -35,6 +35,12 @@ module RubyLLM
           "#{passed}/#{evaluated_results.length}"
         end
+        def pass_rate_ratio
+          return 0.0 if evaluated_results.empty?
+          passed.to_f / evaluated_results.length
+        end
         def total_cost
           @results.sum { |result| result.cost || 0.0 }
         end

data/lib/ruby_llm/contract/eval/report_storage.rb CHANGED Viewed

@@ -13,25 +13,29 @@ module RubyLLM
           @stats = stats
         end
-        def save_history!(path: nil, model: nil)
-          file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model)
-          EvalHistory.append(file, history_entry)
+        def save_history!(path: nil, model: nil, reasoning_effort: nil)
+          file = path || storage_path(Report::HISTORY_DIR, "jsonl", model: model, reasoning_effort: reasoning_effort)
+          entry = history_entry
+          entry[:model] = model if model
+          entry[:reasoning_effort] = reasoning_effort if reasoning_effort
+          EvalHistory.append(file, entry)
           file
         end
-        def eval_history(path: nil, model: nil)
-          EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model))
+        def eval_history(path: nil, model: nil, reasoning_effort: nil)
+          EvalHistory.load(path || storage_path(Report::HISTORY_DIR, "jsonl", model: model,
+                                                                              reasoning_effort: reasoning_effort))
         end
-        def save_baseline!(path: nil, model: nil)
-          file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
+        def save_baseline!(path: nil, model: nil, reasoning_effort: nil)
+          file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
           FileUtils.mkdir_p(File.dirname(file))
           File.write(file, JSON.pretty_generate(serialize_for_baseline))
           file
         end
-        def compare_with_baseline(path: nil, model: nil)
-          file = path || storage_path(Report::BASELINE_DIR, "json", model: model)
+        def compare_with_baseline(path: nil, model: nil, reasoning_effort: nil)
+          file = path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort)
           raise ArgumentError, "No baseline found at #{file}" unless File.exist?(file)
           baseline_data = JSON.parse(File.read(file), symbolize_names: true)
@@ -43,8 +47,8 @@ module RubyLLM
           )
         end
-        def baseline_exists?(path: nil, model: nil)
-          File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model))
+        def baseline_exists?(path: nil, model: nil, reasoning_effort: nil)
+          File.exist?(path || storage_path(Report::BASELINE_DIR, "json", model: model, reasoning_effort: reasoning_effort))
         end
         private
@@ -55,6 +59,7 @@ module RubyLLM
             score: @stats.score,
             total_cost: @stats.total_cost,
             pass_rate: @stats.pass_rate,
+            pass_rate_ratio: @stats.pass_rate_ratio,
             cases_count: @stats.evaluated_results_count
           }
         end
@@ -79,12 +84,13 @@ module RubyLLM
           }
         end
-        def storage_path(root_dir, extension, model:)
+        def storage_path(root_dir, extension, model:, reasoning_effort: nil)
           parts = [root_dir]
           parts << sanitize_name(@report.step_name) if @report.step_name
           dataset_name = sanitize_name(@report.dataset_name)
           dataset_name = "#{dataset_name}_#{sanitize_name(model)}" if model
+          dataset_name = "#{dataset_name}_effort_#{sanitize_name(reasoning_effort)}" if reasoning_effort
           File.join(*parts, "#{dataset_name}.#{extension}")
         end

data/lib/ruby_llm/contract/eval.rb CHANGED Viewed

@@ -29,3 +29,5 @@ require_relative "eval/prompt_diff_comparator"
 require_relative "eval/prompt_diff_presenter"
 require_relative "eval/prompt_diff"
 require_relative "eval/eval_history"
+require_relative "eval/recommendation"
+require_relative "eval/recommender"

data/lib/ruby_llm/contract/step/base.rb CHANGED Viewed

@@ -49,6 +49,16 @@ module RubyLLM
             end
           end
+          def recommend(eval_name, candidates:, min_score: 0.95, min_first_try_pass_rate: 0.8, context: {})
+            comparison = compare_models(eval_name, candidates: candidates, context: context)
+            Eval::Recommender.new(
+              comparison: comparison,
+              min_score: min_score,
+              min_first_try_pass_rate: min_first_try_pass_rate,
+              current_config: current_model_config
+            ).recommend
+          end
           KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
           include Concerns::ContextHelpers
@@ -144,6 +154,17 @@ module RubyLLM
             }
           end
+          def current_model_config
+            policy = retry_policy
+            if policy && policy.config_list.any?
+              policy.config_list.first
+            elsif respond_to?(:model) && model
+              { model: model }
+            elsif RubyLLM::Contract.configuration.default_model
+              { model: RubyLLM::Contract.configuration.default_model }
+            end
+          end
           def resolve_adapter(context)
             adapter = context[:adapter] || RubyLLM::Contract.configuration.default_adapter
             return adapter if adapter

data/lib/ruby_llm/contract/step/retry_executor.rb CHANGED Viewed

@@ -10,12 +10,16 @@ module RubyLLM
         def run_with_retry(input, adapter:, default_model:, policy:, context_temperature: nil, extra_options: {})
           all_attempts = []
+          default_config = { model: default_model }.merge(extra_options.slice(:reasoning_effort).compact)
           policy.max_attempts.times do |attempt_index|
-            model = policy.model_for_attempt(attempt_index, default_model)
+            config = policy.config_for_attempt(attempt_index, default_config)
+            model = config[:model]
+            attempt_extra = extra_options.merge(config.except(:model))
             result = run_once(input, adapter: adapter, model: model,
-                              context_temperature: context_temperature, extra_options: extra_options)
-            all_attempts << { attempt: attempt_index + 1, model: model, result: result }
+                              context_temperature: context_temperature, extra_options: attempt_extra)
+            all_attempts << { attempt: attempt_index + 1, model: model, config: config, result: result }
             break unless policy.retryable?(result)
           end
@@ -43,6 +47,8 @@ module RubyLLM
         def build_attempt_entry(attempt)
           trace = attempt[:result].trace
           entry = { attempt: attempt[:attempt], model: attempt[:model], status: attempt[:result].status }
+          config = attempt[:config]
+          entry[:config] = config if config && config.keys != [:model]
           append_trace_fields(entry, trace)
         end

data/lib/ruby_llm/contract/step/retry_policy.rb CHANGED Viewed

@@ -9,13 +9,13 @@ module RubyLLM
         DEFAULT_RETRY_ON = %i[validation_failed parse_error adapter_error].freeze
         def initialize(models: nil, attempts: nil, retry_on: nil, &block)
-          @models = []
+          @configs = []
           @retryable_statuses = DEFAULT_RETRY_ON.dup
           if block
             @max_attempts = 1
             instance_eval(&block)
-            warn_no_retry! if @max_attempts == 1 && @models.empty?
+            warn_no_retry! if @max_attempts == 1 && @configs.empty?
           else
             apply_keywords(models: models, attempts: attempts, retry_on: retry_on)
           end
@@ -28,14 +28,18 @@ module RubyLLM
           validate_max_attempts!
         end
-        def escalate(*model_list)
-          @models = model_list.flatten
-          @max_attempts = @models.length if @max_attempts < @models.length
+        def escalate(*config_list)
+          @configs = config_list.flatten.map { |c| normalize_config(c).freeze }.freeze
+          @max_attempts = @configs.length if @max_attempts < @configs.length
         end
         alias models escalate
         def model_list
-          @models
+          @configs.map { |c| c[:model] }.freeze
+        end
+        def config_list
+          @configs
         end
         def retry_on(*statuses)
@@ -46,29 +50,38 @@ module RubyLLM
           retryable_statuses.include?(result.status)
         end
-        def model_for_attempt(attempt, default_model)
-          if @models.any?
-            @models[attempt] || @models.last
+        def config_for_attempt(attempt, default_config)
+          if @configs.any?
+            @configs[attempt] || @configs.last
           else
-            default_model
+            default_config
           end
         end
+        def model_for_attempt(attempt, default_model)
+          config_for_attempt(attempt, { model: default_model })[:model]
+        end
         private
         def apply_keywords(models:, attempts:, retry_on:)
           if models
-            @models = Array(models).dup.freeze
-            @max_attempts = @models.length
+            @configs = Array(models).map { |m| normalize_config(m).freeze }.freeze
+            @max_attempts = @configs.length
           else
             @max_attempts = attempts || 1
           end
           @retryable_statuses = Array(retry_on).dup if retry_on
         end
+        def normalize_config(entry)
+          RubyLLM::Contract.normalize_candidate_config(entry)
+        end
         def warn_no_retry!
-          warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no models. " \
-               "This means no actual retry will happen. Add `attempts 2` or `escalate %w[model1 model2]`."
+          warn "[ruby_llm-contract] retry_policy has max_attempts=1 with no configs. " \
+               "This means no actual retry will happen. Add `attempts 2` or " \
+               '`escalate "model1", "model2"`.'
         end
         def validate_max_attempts!

data/lib/ruby_llm/contract/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module RubyLLM
   module Contract
-    VERSION = "0.5.2"
+    VERSION = "0.6.0"
   end
 end

data/lib/ruby_llm/contract.rb CHANGED Viewed

@@ -78,6 +78,27 @@ module RubyLLM
         Thread.current[:ruby_llm_contract_reloading] = false
       end
+      def normalize_candidate_config(entry)
+        case entry
+        when String
+          raise ArgumentError, "Candidate model must be a non-empty String" if entry.strip.empty?
+          { model: entry.strip }
+        when Hash
+          model = entry[:model] || entry["model"]
+          unless model.is_a?(String) && !model.strip.empty?
+            raise ArgumentError, "Candidate config must include a non-empty String :model"
+          end
+          normalized = { model: model.strip }
+          effort = entry[:reasoning_effort] || entry["reasoning_effort"]
+          normalized[:reasoning_effort] = effort if effort
+          normalized
+        else
+          raise ArgumentError, "Expected String or Hash, got #{entry.class}"
+        end
+      end
       private
       # Filter stale hosts, deduplicate by name (last wins), prune registry in-place

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ruby_llm-contract
 version: !ruby/object:Gem::Version
-  version: 0.5.2
+  version: 0.6.0
 platform: ruby
 authors:
 - Justyna
@@ -130,6 +130,8 @@ files:
 - lib/ruby_llm/contract/eval/prompt_diff_comparator.rb
 - lib/ruby_llm/contract/eval/prompt_diff_presenter.rb
 - lib/ruby_llm/contract/eval/prompt_diff_serializer.rb
+- lib/ruby_llm/contract/eval/recommendation.rb
+- lib/ruby_llm/contract/eval/recommender.rb
 - lib/ruby_llm/contract/eval/report.rb
 - lib/ruby_llm/contract/eval/report_presenter.rb
 - lib/ruby_llm/contract/eval/report_stats.rb