RubyGems - ruby_llm-contract - Versions diffs - 0.6.2 → 0.6.4 - Mend

ruby_llm-contract 0.6.2 → 0.6.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (21) hide show

checksums.yaml +4 -4
data/.rubocop.yml +6 -3
data/CHANGELOG.md +28 -0
data/Gemfile.lock +2 -2
data/README.md +4 -0
data/lib/ruby_llm/contract/concerns/eval_host.rb +37 -5
data/lib/ruby_llm/contract/concerns/production_mode_context.rb +35 -0
data/lib/ruby_llm/contract/eval/aggregated_report.rb +135 -0
data/lib/ruby_llm/contract/eval/case_result.rb +5 -3
data/lib/ruby_llm/contract/eval/case_result_builder.rb +8 -1
data/lib/ruby_llm/contract/eval/model_comparison.rb +80 -3
data/lib/ruby_llm/contract/eval/report.rb +3 -1
data/lib/ruby_llm/contract/eval/report_stats.rb +63 -0
data/lib/ruby_llm/contract/eval/retry_optimizer.rb +5 -2
data/lib/ruby_llm/contract/eval.rb +1 -0
data/lib/ruby_llm/contract/rake_task.rb +11 -1
data/lib/ruby_llm/contract/step/base.rb +8 -4
data/lib/ruby_llm/contract/version.rb +1 -1
data/lib/ruby_llm/contract.rb +1 -0
metadata +3 -2
data/ruby_llm-contract.gemspec +0 -35

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 2636c2e59f5fef27f929a94ac9e3194793ce6b51f86cce0c18ca6a5b0caa61ab
-  data.tar.gz: c2c81c7cc8fd281bf6c88738f0fb5bc4bdbb86b71d2fb59248e2b0ebb8d648fe
+  metadata.gz: 655ecfa5adebe03245cb4ee910752ec866b83844b665018277dc653a3e246a63
+  data.tar.gz: fafcdacea3943703fad581f9f1806fe25678fd5da7b01867ecb5aac353da139f
 SHA512:
-  metadata.gz: d9f2fca592fd3a183d987239dea0cdc2456eed639d6cccaea71e9b1ef3a3ff6e32f3e346f640c905168674e7401ccb85e5f342815a1b290cdebe05a9f7b5374f
-  data.tar.gz: 00b8d113564871db19f88d9061276a0aee0295da7638974804782fda7929cf893a0004aadfeffc2f25f13d421935e0e9d710e553d8e56b7b5d0229f62edd129e
+  metadata.gz: c4821220d898468f55ab7085d25f6370819501cdfcb5ffea21a9dbdaaa9348997c1b9e6acbad8eb9ff6328eb531eda17e05bd09b1fcd047429fcd69b820a1819
+  data.tar.gz: 91fb14f2aa6e13e70d29f0222dc160548c9c50e9bf4bd9f27e9042bd0f6a4fae1a557b762747ee066c3444ee90942ea008c048b4bb0efc364d212329726d3888

data/.rubocop.yml CHANGED Viewed

@@ -42,14 +42,17 @@ AllCops:
     - 'internal/**/*'
 Metrics/ClassLength:
-  Max: 130
+  Max: 140
+Metrics/ModuleLength:
+  Max: 150
 Metrics/AbcSize:
   Max: 30
 Metrics/ParameterLists:
-  Max: 11
-  MaxOptionalParameters: 9
+  Max: 12
+  MaxOptionalParameters: 10
 Style/FormatStringToken:
   Enabled: false

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,33 @@
 # Changelog
+## 0.6.4 (2026-04-20)
+### Features
+- **`production_mode:` on `compare_models` and `optimize_retry_policy`** — measures retry-aware, end-to-end cost per successful output. Pass `production_mode: { fallback: "gpt-5-mini" }` and each candidate runs with a runtime-injected `[candidate, fallback]` retry chain. The report exposes `escalation_rate`, `single_shot_cost`, and `effective_cost` so "the cheaper candidate" decision matches production cost rather than first-attempt cost.
+- **New Report metrics** — `escalation_rate`, `single_shot_cost`, `effective_cost`, `single_shot_latency_ms`, `effective_latency_ms`, `latency_percentiles` (p50/p95/max). `AggregatedReport` averages all of them across `runs:`.
+- **Extended `ModelComparison#table`** — when `production_mode:` is set, renders a `Chain` column (`candidate → fallback`) with `single-shot`, `escalation`, `effective cost`, `latency`, `score`. Edge case `candidate == fallback` renders as a single model and `—` in the escalation column, with retry injection skipped entirely so `effective == single-shot` by construction, not by coincidence.
+- **`context[:retry_policy_override]`** — new context key that nullifies or replaces class-level `retry_policy` for a single call. Used internally by production-mode injection; safe to use directly when you need a transient override that doesn't mutate the step class.
+### Scope
+- Single-fallback (2-tier) chains only. Multi-tier chains can be inspected post-hoc via `trace.attempts` but aren't summarized in the optimize table.
+- Costs with `runs: 3 + production_mode: { fallback: "gpt-5-mini" }` are ≈3× a single-shot eval plus the actual retry attempts — not 6×. Production-mode metrics come from a single pass.
+- **Step-only.** Calling `compare_models` with `production_mode:` on a `Pipeline::Base` subclass raises `ArgumentError` — retry injection is Step-level and pipeline-wide fallback semantics aren't defined yet. Benchmark individual steps.
+### Documentation
+- **Guide: [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement)** — API, metric interpretation, 2-tier scope note.
+## 0.6.3 (2026-04-20)
+### Features
+- **`runs:` parameter on `compare_models` and `optimize_retry_policy`** — runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforces `temperature=1.0` server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Default `runs: 1` — backward compatible.
+- **`RUNS=N` on `rake ruby_llm_contract:optimize`** — CLI flag for variance-aware optimization.
+- **`Eval::AggregatedReport`** — duck-type `Report` exposing `score` (mean), `score_min`/`score_max` (spread), `total_cost` (mean per run), `pass_rate` (clean-pass count x/N), and `clean_passes`.
+- **Guide: [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs)** — when to use it and why.
 ## 0.6.2 (2026-04-18)
 ### Features

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    ruby_llm-contract (0.6.1)
+    ruby_llm-contract (0.6.4)
       dry-types (~> 1.7)
       ruby_llm (~> 1.0)
       ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
   rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
   ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
   ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
-  ruby_llm-contract (0.6.1)
+  ruby_llm-contract (0.6.4)
   ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
   ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
   rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8

data/README.md CHANGED Viewed

@@ -158,6 +158,10 @@ Cheapest at 100%: gpt-4.1-mini
 Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
+Running live against gpt-5 / o-series? Pass `runs: 3` to average out sampling variance (OpenAI forces `temperature=1.0` server-side, so one unlucky run can misclassify a viable candidate). See [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs).
+Want the *effective* cost — first-attempt plus retries — rather than the single-shot headline number? Pass `production_mode: { fallback: "gpt-5-mini" }` and the table gains `escalation`, `effective cost`, and a `Chain` column. See [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement).
 ## Let the gem tell you what to do
 Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:

data/lib/ruby_llm/contract/concerns/eval_host.rb CHANGED Viewed

@@ -5,6 +5,7 @@ module RubyLLM
     module Concerns
       module EvalHost
         include ContextHelpers
+        include ProductionModeContext
         SAMPLE_RESPONSE_COMPARE_WARNING = "[ruby_llm-contract] compare_with ignores sample_response. " \
                                           "Without model: or context: { adapter: ... }, both sides will be skipped " \
@@ -70,27 +71,58 @@ module RubyLLM
           Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
         end
-        def compare_models(eval_name, models: [], candidates: [], context: {})
+        def compare_models(eval_name, models: [], candidates: [], context: {}, runs: 1, production_mode: nil)
           raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
+          runs = coerce_runs(runs)
           context = safe_context(context)
           candidate_configs = normalize_candidates(models, candidates)
+          reject_production_mode_on_pipeline!(production_mode)
+          fallback_config = normalize_production_mode(production_mode)
           reports = {}
           configs = {}
           candidate_configs.each do |config|
             label = Eval::ModelComparison.candidate_label(config)
-            model_context = isolate_context(context).merge(model: config[:model])
-            model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
-            reports[label] = run_single_eval(eval_name, model_context)
+            model_context = build_candidate_context(context, config, fallback_config)
+            per_run = Array.new(runs) { run_single_eval(eval_name, model_context) }
+            reports[label] = runs == 1 ? per_run.first : Eval::AggregatedReport.new(per_run)
             configs[label] = config
           end
-          Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
+          Eval::ModelComparison.new(
+            eval_name: eval_name, reports: reports, configs: configs, fallback: fallback_config
+          )
         end
         private
+        def coerce_runs(runs)
+          raise ArgumentError, "runs must be an Integer >= 1, got #{runs.inspect}" unless runs.is_a?(Integer)
+          raise ArgumentError, "runs must be >= 1, got #{runs.inspect}" if runs < 1
+          runs
+        end
+        def reject_production_mode_on_pipeline!(production_mode)
+          return if production_mode.nil? || production_mode == false
+          return unless defined?(Pipeline::Base) && self < Pipeline::Base
+          raise ArgumentError,
+                "production_mode: is not supported on Pipeline (#{self}). Retry injection happens at Step level; " \
+                "call compare_models with production_mode: on individual Step classes instead."
+        end
+        def build_candidate_context(context, config, fallback_config)
+          model_context = isolate_context(context).merge(model: config[:model])
+          model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
+          return model_context unless fallback_config
+          model_context[:retry_policy_override] = production_mode_override(config, fallback_config)
+          model_context
+        end
         def normalize_candidates(models, candidates)
           if candidates.any?
             candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq

data/lib/ruby_llm/contract/concerns/production_mode_context.rb ADDED Viewed

@@ -0,0 +1,35 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Contract
+    module Concerns
+      # Helpers for injecting a retry_policy_override into per-candidate eval
+      # context when compare_models runs in production-mode. When candidate
+      # == fallback, retry injection is skipped so the row degenerates into
+      # a single-shot eval by construction.
+      module ProductionModeContext
+        private
+        def normalize_production_mode(production_mode)
+          return nil if production_mode.nil? || production_mode == false
+          unless production_mode.is_a?(Hash) && production_mode[:fallback]
+            raise ArgumentError, "production_mode: must be a Hash with :fallback, got #{production_mode.inspect}"
+          end
+          RubyLLM::Contract.normalize_candidate_config(production_mode[:fallback])
+        end
+        def production_mode_override(config, fallback_config)
+          return nil if same_candidate?(config, fallback_config)
+          Step::RetryPolicy.new(models: [config, fallback_config])
+        end
+        def same_candidate?(first, second)
+          first[:model] == second[:model] && first[:reasoning_effort] == second[:reasoning_effort]
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/contract/eval/aggregated_report.rb ADDED Viewed

@@ -0,0 +1,135 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Contract
+    module Eval
+      # Wraps N Reports from repeated runs of the same eval to reduce sampling
+      # variance in live mode (temperature=1 on gpt-5 family). Exposes the same
+      # duck-type as Report — mean score, mean cost per run, mean latency.
+      #
+      # pass_rate reports how many runs passed cleanly (x/N), not case-level
+      # pass rate, since the question is "does this candidate reliably pass?".
+      class AggregatedReport
+        attr_reader :runs, :results
+        def initialize(runs)
+          raise ArgumentError, "runs must not be empty" if runs.empty?
+          @runs = runs.freeze
+          @results = runs.flat_map(&:results).freeze
+          freeze
+        end
+        def dataset_name
+          @runs.first.dataset_name
+        end
+        def step_name
+          @runs.first.step_name
+        end
+        def score
+          @runs.sum(&:score) / @runs.length.to_f
+        end
+        def score_min
+          @runs.map(&:score).min
+        end
+        def score_max
+          @runs.map(&:score).max
+        end
+        def total_cost
+          @runs.sum(&:total_cost) / @runs.length.to_f
+        end
+        def avg_latency_ms
+          latencies = @runs.filter_map(&:avg_latency_ms)
+          return nil if latencies.empty?
+          latencies.sum / latencies.length.to_f
+        end
+        def pass_rate
+          "#{clean_passes}/#{@runs.length}"
+        end
+        def pass_rate_ratio
+          clean_passes.to_f / @runs.length
+        end
+        def each(&block)
+          @results.each(&block)
+        end
+        def summary
+          @runs.first.summary
+        end
+        def to_s
+          @runs.first.to_s
+        end
+        def print_summary(io = $stdout)
+          @runs.first.print_summary(io)
+        end
+        def passed?
+          @runs.all?(&:passed?)
+        end
+        def clean_passes
+          @runs.count(&:passed?)
+        end
+        def failures
+          @runs.flat_map(&:failures)
+        end
+        def production_mode?
+          @runs.any?(&:production_mode?)
+        end
+        def escalation_rate
+          values = @runs.filter_map(&:escalation_rate)
+          return nil if values.empty?
+          values.sum / values.length.to_f
+        end
+        def single_shot_cost
+          values = @runs.filter_map(&:single_shot_cost)
+          return nil if values.empty?
+          values.sum / values.length.to_f
+        end
+        def effective_cost
+          total_cost
+        end
+        def single_shot_latency_ms
+          values = @runs.filter_map(&:single_shot_latency_ms)
+          return nil if values.empty?
+          values.sum / values.length.to_f
+        end
+        def effective_latency_ms
+          avg_latency_ms
+        end
+        def latency_percentiles
+          per_run = @runs.filter_map(&:latency_percentiles)
+          return nil if per_run.empty?
+          %i[p50 p95 max].each_with_object({}) do |key, acc|
+            values = per_run.filter_map { |h| h[key] }
+            acc[key] = values.empty? ? nil : values.sum / values.length.to_f
+          end
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/contract/eval/case_result.rb CHANGED Viewed

@@ -5,10 +5,10 @@ module RubyLLM
     module Eval
       class CaseResult
         attr_reader :name, :input, :output, :expected, :step_status,
-                    :score, :details, :duration_ms, :cost
+                    :score, :details, :duration_ms, :cost, :attempts
         def initialize(name:, input:, output:, expected:, step_status:,
-                       score:, passed:, label: nil, details: nil, duration_ms: nil, cost: nil)
+                       score:, passed:, label: nil, details: nil, duration_ms: nil, cost: nil, attempts: nil)
           @name = name
           @input = input
           @output = output
@@ -20,6 +20,7 @@ module RubyLLM
           @details = details
           @duration_ms = duration_ms
           @cost = cost
+          @attempts = attempts
           freeze
         end
@@ -58,7 +59,8 @@ module RubyLLM
             label: label,
             details: @details,
             duration_ms: @duration_ms,
-            cost: @cost
+            cost: @cost,
+            attempts: @attempts
           }
         end

data/lib/ruby_llm/contract/eval/case_result_builder.rb CHANGED Viewed

@@ -18,7 +18,8 @@ module RubyLLM
             label: evaluation.label,
             details: evaluation.details,
             duration_ms: trace_metric(trace, :total_latency_ms, :latency_ms),
-            cost: trace_metric(trace, :total_cost, :cost)
+            cost: trace_metric(trace, :total_cost, :cost),
+            attempts: trace_attempts(trace)
           )
         end
@@ -29,6 +30,12 @@ module RubyLLM
           trace.respond_to?(pipeline_key) ? trace.public_send(pipeline_key) : trace[step_key]
         end
+        def trace_attempts(trace)
+          return nil unless trace
+          trace.respond_to?(:attempts) ? trace.attempts : nil
+        end
       end
     end
   end

data/lib/ruby_llm/contract/eval/model_comparison.rb CHANGED Viewed

@@ -4,20 +4,25 @@ module RubyLLM
   module Contract
     module Eval
       class ModelComparison
-        attr_reader :eval_name, :reports, :configs
+        attr_reader :eval_name, :reports, :configs, :fallback
         def self.candidate_label(config)
           effort = config[:reasoning_effort]
           effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
         end
-        def initialize(eval_name:, reports:, configs: nil)
+        def initialize(eval_name:, reports:, configs: nil, fallback: nil)
           @eval_name = eval_name
           @reports = reports.dup.freeze
           @configs = (configs || default_configs_from_reports).freeze
+          @fallback = fallback
           freeze
         end
+        def production_mode?
+          !@fallback.nil?
+        end
         def models
           @reports.keys
         end
@@ -44,6 +49,8 @@ module RubyLLM
         end
         def table
+          return production_mode_table if production_mode?
           max_label = [@reports.keys.map(&:length).max || 0, 25].max
           lines = [format("  %-#{max_label}s  Score       Cost  Avg Latency", "Candidate")]
           lines << "  #{"-" * (max_label + 36)}"
@@ -57,6 +64,62 @@ module RubyLLM
           lines.join("\n")
         end
+        def production_mode_table
+          fallback_label = self.class.candidate_label(@fallback)
+          rows = @reports.map do |label, report|
+            chain = chain_label(label, fallback_label)
+            { chain: chain, report: report, same: chain_same_as_fallback?(label, fallback_label) }
+          end
+          chain_width = [rows.map { |r| r[:chain].length }.max || 0, 20].max
+          lines = [format("  %-#{chain_width}s  %-11s  %-10s  %-14s  %-9s  %s",
+                          "Chain", "single-shot", "escalation", "effective cost", "latency", "score")]
+          lines << "  #{"-" * (chain_width + 60)}"
+          rows.each do |row|
+            lines << format_production_row(row, chain_width)
+          end
+          lines.join("\n")
+        end
+        private
+        def chain_label(label, fallback_label)
+          label == fallback_label ? label : "#{label} → #{fallback_label}"
+        end
+        def chain_same_as_fallback?(label, fallback_label)
+          label == fallback_label
+        end
+        def format_production_row(row, chain_width)
+          report = row[:report]
+          format("  %-#{chain_width}s  %-11s  %-10s  %-14s  %-9s  %6.2f",
+                 row[:chain],
+                 format_money(report.single_shot_cost || report.total_cost),
+                 format_escalation(row, report),
+                 format_money(report.effective_cost),
+                 format_latency(report.effective_latency_ms),
+                 report.score)
+        end
+        def format_money(value)
+          value&.positive? ? "$#{format("%.4f", value)}" : "n/a"
+        end
+        def format_latency(value)
+          value ? "#{value.round}ms" : "n/a"
+        end
+        def format_escalation(row, report)
+          return "—" if row[:same]
+          format("%d%%", ((report.escalation_rate || 0) * 100).round)
+        end
+        public
         def print_summary(io = $stdout)
           io.puts "#{@eval_name} — model comparison"
           io.puts
@@ -72,7 +135,7 @@ module RubyLLM
         def to_h
           @reports.transform_values do |report|
-            {
+            base = {
               score: report.score,
               total_cost: report.total_cost,
               avg_latency_ms: report.avg_latency_ms,
@@ -80,11 +143,25 @@ module RubyLLM
               pass_rate_ratio: report.pass_rate_ratio,
               passed: report.passed?
             }
+            production_mode_metrics(report, base)
           end
         end
         private
+        def production_mode_metrics(report, base)
+          return base unless report.respond_to?(:production_mode?) && report.production_mode?
+          base.merge(
+            escalation_rate: report.escalation_rate,
+            single_shot_cost: report.single_shot_cost,
+            effective_cost: report.effective_cost,
+            single_shot_latency_ms: report.single_shot_latency_ms,
+            effective_latency_ms: report.effective_latency_ms,
+            latency_percentiles: report.latency_percentiles
+          )
+        end
         def resolve_key(candidate)
           case candidate
           when String then candidate

data/lib/ruby_llm/contract/eval/report.rb CHANGED Viewed

@@ -15,7 +15,9 @@ module RubyLLM
         BASELINE_DIR = ".eval_baselines"
         def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
-                       :total_cost, :avg_latency_ms, :passed?
+                       :total_cost, :avg_latency_ms, :passed?,
+                       :production_mode?, :escalation_rate, :single_shot_cost, :single_shot_latency_ms,
+                       :effective_cost, :effective_latency_ms, :latency_percentiles
         def_delegators :@presenter, :summary, :to_s, :print_summary
         def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
                        :baseline_exists?

data/lib/ruby_llm/contract/eval/report_stats.rb CHANGED Viewed

@@ -65,6 +65,69 @@ module RubyLLM
         def evaluated_results_count
           evaluated_results.length
         end
+        def production_mode?
+          evaluated_results.any? { |r| r.respond_to?(:attempts) && r.attempts }
+        end
+        def escalation_rate
+          return nil unless production_mode?
+          return 0.0 if evaluated_results.empty?
+          escalated = evaluated_results.count { |r| (r.attempts || []).length > 1 }
+          escalated.to_f / evaluated_results.length
+        end
+        def single_shot_cost
+          return nil unless production_mode?
+          evaluated_results.sum { |r| first_attempt_cost(r) || r.cost || 0.0 }
+        end
+        def effective_cost
+          total_cost
+        end
+        def single_shot_latency_ms
+          return nil unless production_mode?
+          latencies = evaluated_results.filter_map { |r| first_attempt_latency(r) || r.duration_ms }
+          return nil if latencies.empty?
+          latencies.sum.to_f / latencies.length
+        end
+        def effective_latency_ms
+          avg_latency_ms
+        end
+        def latency_percentiles
+          return nil unless production_mode?
+          latencies = evaluated_results.filter_map(&:duration_ms).sort
+          return nil if latencies.empty?
+          { p50: percentile(latencies, 0.50), p95: percentile(latencies, 0.95), max: latencies.last.to_f }
+        end
+        private
+        def first_attempt_cost(result)
+          first = (result.attempts || []).first
+          first && first[:cost]
+        end
+        def first_attempt_latency(result)
+          first = (result.attempts || []).first
+          first && first[:latency_ms]
+        end
+        def percentile(sorted, fraction)
+          return nil if sorted.empty?
+          idx = (fraction * (sorted.length - 1)).round
+          sorted[idx].to_f
+        end
       end
     end
   end

data/lib/ruby_llm/contract/eval/retry_optimizer.rb CHANGED Viewed

@@ -94,11 +94,13 @@ module RubyLLM
           end
         end
-        def initialize(step:, candidates:, context: {}, min_score: 0.95)
+        def initialize(step:, candidates:, context: {}, min_score: 0.95, runs: 1, production_mode: nil)
           @step = step
           @candidates = candidates
           @context = context
           @min_score = min_score
+          @runs = runs
+          @production_mode = production_mode
         end
         def call
@@ -108,7 +110,8 @@ module RubyLLM
           score_matrix = {}
           evals.each do |eval_name|
             comparison = with_retry_disabled do
-              @step.compare_models(eval_name, candidates: @candidates, context: @context)
+              @step.compare_models(eval_name, candidates: @candidates, context: @context,
+                                              runs: @runs, production_mode: @production_mode)
             end
             score_matrix[eval_name] = extract_scores(comparison)
           end

data/lib/ruby_llm/contract/eval.rb CHANGED Viewed

@@ -21,6 +21,7 @@ require_relative "eval/report_stats"
 require_relative "eval/report_presenter"
 require_relative "eval/report_storage"
 require_relative "eval/report"
+require_relative "eval/aggregated_report"
 require_relative "eval/eval_definition"
 require_relative "eval/model_comparison"
 require_relative "eval/baseline_diff"

data/lib/ruby_llm/contract/rake_task.rb CHANGED Viewed

@@ -150,6 +150,7 @@ module RubyLLM
           raw_candidates = ENV["CANDIDATES"].to_s.strip
           abort("CANDIDATES is required, e.g. CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini") if raw_candidates.empty?
           min_score = ENV.fetch("MIN_SCORE", "0.95").to_f
+          runs = parse_runs(ENV.fetch("RUNS", "1"))
           host = RubyLLM::Contract.eval_hosts.find { |h| h.name == step_name }
           unless host
@@ -163,13 +164,22 @@ module RubyLLM
           result = host.optimize_retry_policy(
             candidates: candidates,
             context: context,
-            min_score: min_score
+            min_score: min_score,
+            runs: runs
           )
           result.print_summary
         end
       end
+      def parse_runs(raw)
+        runs = Integer(raw.to_s.strip, 10)
+        abort("RUNS must be an integer >= 1, e.g. RUNS=1") if runs < 1
+        runs
+      rescue ArgumentError
+        abort("RUNS must be an integer >= 1, e.g. RUNS=1")
+      end
       def parse_candidates(raw)
         entries = if raw.start_with?("[")
                     Array(JSON.parse(raw))

data/lib/ruby_llm/contract/step/base.rb CHANGED Viewed

@@ -59,16 +59,19 @@ module RubyLLM
             ).recommend
           end
-          def optimize_retry_policy(candidates:, context: {}, min_score: 0.95)
+          def optimize_retry_policy(candidates:, context: {}, min_score: 0.95, runs: 1, production_mode: nil)
             Eval::RetryOptimizer.new(
               step: self,
               candidates: candidates,
               context: context,
-              min_score: min_score
+              min_score: min_score,
+              runs: runs,
+              production_mode: production_mode
             ).call
           end
-          KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
+          KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists
+                                  reasoning_effort retry_policy_override].freeze
           include Concerns::ContextHelpers
@@ -155,11 +158,12 @@ module RubyLLM
           end
           def runtime_settings(context)
+            policy = context.key?(:retry_policy_override) ? context[:retry_policy_override] : retry_policy
             {
               model: context[:model] || model || RubyLLM::Contract.configuration.default_model,
               temperature: context[:temperature],
               extra_options: context.slice(:provider, :assume_model_exists, :max_tokens, :reasoning_effort),
-              policy: retry_policy
+              policy: policy
             }
           end

data/lib/ruby_llm/contract/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module RubyLLM
   module Contract
-    VERSION = "0.6.2"
+    VERSION = "0.6.4"
   end
 end

data/lib/ruby_llm/contract.rb CHANGED Viewed

@@ -154,6 +154,7 @@ end
 require_relative "contract/concerns/context_helpers"
 require_relative "contract/concerns/deep_freeze"
 require_relative "contract/concerns/deep_symbolize"
+require_relative "contract/concerns/production_mode_context"
 require_relative "contract/concerns/eval_host"
 require_relative "contract/concerns/trace_equality"
 require_relative "contract/concerns/usage_aggregator"

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ruby_llm-contract
 version: !ruby/object:Gem::Version
-  version: 0.6.2
+  version: 0.6.4
 platform: ruby
 authors:
 - Justyna
@@ -89,6 +89,7 @@ files:
 - lib/ruby_llm/contract/concerns/deep_freeze.rb
 - lib/ruby_llm/contract/concerns/deep_symbolize.rb
 - lib/ruby_llm/contract/concerns/eval_host.rb
+- lib/ruby_llm/contract/concerns/production_mode_context.rb
 - lib/ruby_llm/contract/concerns/trace_equality.rb
 - lib/ruby_llm/contract/concerns/usage_aggregator.rb
 - lib/ruby_llm/contract/configuration.rb
@@ -109,6 +110,7 @@ files:
 - lib/ruby_llm/contract/dsl.rb
 - lib/ruby_llm/contract/errors.rb
 - lib/ruby_llm/contract/eval.rb
+- lib/ruby_llm/contract/eval/aggregated_report.rb
 - lib/ruby_llm/contract/eval/baseline_diff.rb
 - lib/ruby_llm/contract/eval/case_executor.rb
 - lib/ruby_llm/contract/eval/case_result.rb
@@ -180,7 +182,6 @@ files:
 - lib/ruby_llm/contract/token_estimator.rb
 - lib/ruby_llm/contract/types.rb
 - lib/ruby_llm/contract/version.rb
-- ruby_llm-contract.gemspec
 homepage: https://github.com/justi/ruby_llm-contract
 licenses:
 - MIT

data/ruby_llm-contract.gemspec DELETED Viewed

@@ -1,35 +0,0 @@
-# frozen_string_literal: true
-require_relative "lib/ruby_llm/contract/version"
-Gem::Specification.new do |spec|
-  spec.name = "ruby_llm-contract"
-  spec.version = RubyLLM::Contract::VERSION
-  spec.authors = ["Justyna"]
-  spec.summary = "Know which LLM model to use, what it costs, and when accuracy drops"
-  spec.description = "Compare LLM models by accuracy and cost. Regression-test prompts in CI. " \
-                     "Start on nano, auto-escalate to bigger models when quality drops. " \
-                     "Companion gem for ruby_llm."
-  spec.homepage = "https://github.com/justi/ruby_llm-contract"
-  spec.license = "MIT"
-  spec.required_ruby_version = ">= 3.2.0"
-  spec.metadata["homepage_uri"] = spec.homepage
-  spec.metadata["source_code_uri"] = spec.homepage
-  spec.metadata["changelog_uri"] = "#{spec.homepage}/blob/main/CHANGELOG.md"
-  spec.metadata["documentation_uri"] = "#{spec.homepage}#readme"
-  spec.metadata["rubygems_mfa_required"] = "true"
-  spec.files = Dir.chdir(__dir__) do
-    `git ls-files -z`.split("\x0").reject do |f|
-      (File.expand_path(f) == __FILE__) ||
-        f.start_with?("spec/", "docs/", "doc/", ".ai/", ".claude/", ".git")
-    end
-  end
-  spec.require_paths = ["lib"]
-  spec.add_dependency "dry-types", "~> 1.7"
-  spec.add_dependency "ruby_llm", "~> 1.0"
-  spec.add_dependency "ruby_llm-schema", "~> 0.3"
-end