ruby_llm-contract 0.6.2 → 0.6.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 2636c2e59f5fef27f929a94ac9e3194793ce6b51f86cce0c18ca6a5b0caa61ab
4
- data.tar.gz: c2c81c7cc8fd281bf6c88738f0fb5bc4bdbb86b71d2fb59248e2b0ebb8d648fe
3
+ metadata.gz: 655ecfa5adebe03245cb4ee910752ec866b83844b665018277dc653a3e246a63
4
+ data.tar.gz: fafcdacea3943703fad581f9f1806fe25678fd5da7b01867ecb5aac353da139f
5
5
  SHA512:
6
- metadata.gz: d9f2fca592fd3a183d987239dea0cdc2456eed639d6cccaea71e9b1ef3a3ff6e32f3e346f640c905168674e7401ccb85e5f342815a1b290cdebe05a9f7b5374f
7
- data.tar.gz: 00b8d113564871db19f88d9061276a0aee0295da7638974804782fda7929cf893a0004aadfeffc2f25f13d421935e0e9d710e553d8e56b7b5d0229f62edd129e
6
+ metadata.gz: c4821220d898468f55ab7085d25f6370819501cdfcb5ffea21a9dbdaaa9348997c1b9e6acbad8eb9ff6328eb531eda17e05bd09b1fcd047429fcd69b820a1819
7
+ data.tar.gz: 91fb14f2aa6e13e70d29f0222dc160548c9c50e9bf4bd9f27e9042bd0f6a4fae1a557b762747ee066c3444ee90942ea008c048b4bb0efc364d212329726d3888
data/.rubocop.yml CHANGED
@@ -42,14 +42,17 @@ AllCops:
42
42
  - 'internal/**/*'
43
43
 
44
44
  Metrics/ClassLength:
45
- Max: 130
45
+ Max: 140
46
+
47
+ Metrics/ModuleLength:
48
+ Max: 150
46
49
 
47
50
  Metrics/AbcSize:
48
51
  Max: 30
49
52
 
50
53
  Metrics/ParameterLists:
51
- Max: 11
52
- MaxOptionalParameters: 9
54
+ Max: 12
55
+ MaxOptionalParameters: 10
53
56
 
54
57
  Style/FormatStringToken:
55
58
  Enabled: false
data/CHANGELOG.md CHANGED
@@ -1,5 +1,33 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.4 (2026-04-20)
4
+
5
+ ### Features
6
+
7
+ - **`production_mode:` on `compare_models` and `optimize_retry_policy`** — measures retry-aware, end-to-end cost per successful output. Pass `production_mode: { fallback: "gpt-5-mini" }` and each candidate runs with a runtime-injected `[candidate, fallback]` retry chain. The report exposes `escalation_rate`, `single_shot_cost`, and `effective_cost` so "the cheaper candidate" decision matches production cost rather than first-attempt cost.
8
+ - **New Report metrics** — `escalation_rate`, `single_shot_cost`, `effective_cost`, `single_shot_latency_ms`, `effective_latency_ms`, `latency_percentiles` (p50/p95/max). `AggregatedReport` averages all of them across `runs:`.
9
+ - **Extended `ModelComparison#table`** — when `production_mode:` is set, renders a `Chain` column (`candidate → fallback`) with `single-shot`, `escalation`, `effective cost`, `latency`, `score`. Edge case `candidate == fallback` renders as a single model and `—` in the escalation column, with retry injection skipped entirely so `effective == single-shot` by construction, not by coincidence.
10
+ - **`context[:retry_policy_override]`** — new context key that nullifies or replaces class-level `retry_policy` for a single call. Used internally by production-mode injection; safe to use directly when you need a transient override that doesn't mutate the step class.
11
+
12
+ ### Scope
13
+
14
+ - Single-fallback (2-tier) chains only. Multi-tier chains can be inspected post-hoc via `trace.attempts` but aren't summarized in the optimize table.
15
+ - Costs with `runs: 3 + production_mode: { fallback: "gpt-5-mini" }` are ≈3× a single-shot eval plus the actual retry attempts — not 6×. Production-mode metrics come from a single pass.
16
+ - **Step-only.** Calling `compare_models` with `production_mode:` on a `Pipeline::Base` subclass raises `ArgumentError` — retry injection is Step-level and pipeline-wide fallback semantics aren't defined yet. Benchmark individual steps.
17
+
18
+ ### Documentation
19
+
20
+ - **Guide: [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement)** — API, metric interpretation, 2-tier scope note.
21
+
22
+ ## 0.6.3 (2026-04-20)
23
+
24
+ ### Features
25
+
26
+ - **`runs:` parameter on `compare_models` and `optimize_retry_policy`** — runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforces `temperature=1.0` server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Default `runs: 1` — backward compatible.
27
+ - **`RUNS=N` on `rake ruby_llm_contract:optimize`** — CLI flag for variance-aware optimization.
28
+ - **`Eval::AggregatedReport`** — duck-type `Report` exposing `score` (mean), `score_min`/`score_max` (spread), `total_cost` (mean per run), `pass_rate` (clean-pass count x/N), and `clean_passes`.
29
+ - **Guide: [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs)** — when to use it and why.
30
+
3
31
  ## 0.6.2 (2026-04-18)
4
32
 
5
33
  ### Features
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.6.1)
4
+ ruby_llm-contract (0.6.4)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.6.1)
261
+ ruby_llm-contract (0.6.4)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -158,6 +158,10 @@ Cheapest at 100%: gpt-4.1-mini
158
158
 
159
159
  Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
160
160
 
161
+ Running live against gpt-5 / o-series? Pass `runs: 3` to average out sampling variance (OpenAI forces `temperature=1.0` server-side, so one unlucky run can misclassify a viable candidate). See [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs).
162
+
163
+ Want the *effective* cost — first-attempt plus retries — rather than the single-shot headline number? Pass `production_mode: { fallback: "gpt-5-mini" }` and the table gains `escalation`, `effective cost`, and a `Chain` column. See [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement).
164
+
161
165
  ## Let the gem tell you what to do
162
166
 
163
167
  Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
@@ -5,6 +5,7 @@ module RubyLLM
5
5
  module Concerns
6
6
  module EvalHost
7
7
  include ContextHelpers
8
+ include ProductionModeContext
8
9
 
9
10
  SAMPLE_RESPONSE_COMPARE_WARNING = "[ruby_llm-contract] compare_with ignores sample_response. " \
10
11
  "Without model: or context: { adapter: ... }, both sides will be skipped " \
@@ -70,27 +71,58 @@ module RubyLLM
70
71
  Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
71
72
  end
72
73
 
73
- def compare_models(eval_name, models: [], candidates: [], context: {})
74
+ def compare_models(eval_name, models: [], candidates: [], context: {}, runs: 1, production_mode: nil)
74
75
  raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
75
76
 
77
+ runs = coerce_runs(runs)
78
+
76
79
  context = safe_context(context)
77
80
  candidate_configs = normalize_candidates(models, candidates)
81
+ reject_production_mode_on_pipeline!(production_mode)
82
+ fallback_config = normalize_production_mode(production_mode)
78
83
 
79
84
  reports = {}
80
85
  configs = {}
81
86
  candidate_configs.each do |config|
82
87
  label = Eval::ModelComparison.candidate_label(config)
83
- model_context = isolate_context(context).merge(model: config[:model])
84
- model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
85
- reports[label] = run_single_eval(eval_name, model_context)
88
+ model_context = build_candidate_context(context, config, fallback_config)
89
+ per_run = Array.new(runs) { run_single_eval(eval_name, model_context) }
90
+ reports[label] = runs == 1 ? per_run.first : Eval::AggregatedReport.new(per_run)
86
91
  configs[label] = config
87
92
  end
88
93
 
89
- Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
94
+ Eval::ModelComparison.new(
95
+ eval_name: eval_name, reports: reports, configs: configs, fallback: fallback_config
96
+ )
90
97
  end
91
98
 
92
99
  private
93
100
 
101
+ def coerce_runs(runs)
102
+ raise ArgumentError, "runs must be an Integer >= 1, got #{runs.inspect}" unless runs.is_a?(Integer)
103
+ raise ArgumentError, "runs must be >= 1, got #{runs.inspect}" if runs < 1
104
+
105
+ runs
106
+ end
107
+
108
+ def reject_production_mode_on_pipeline!(production_mode)
109
+ return if production_mode.nil? || production_mode == false
110
+ return unless defined?(Pipeline::Base) && self < Pipeline::Base
111
+
112
+ raise ArgumentError,
113
+ "production_mode: is not supported on Pipeline (#{self}). Retry injection happens at Step level; " \
114
+ "call compare_models with production_mode: on individual Step classes instead."
115
+ end
116
+
117
+ def build_candidate_context(context, config, fallback_config)
118
+ model_context = isolate_context(context).merge(model: config[:model])
119
+ model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
120
+ return model_context unless fallback_config
121
+
122
+ model_context[:retry_policy_override] = production_mode_override(config, fallback_config)
123
+ model_context
124
+ end
125
+
94
126
  def normalize_candidates(models, candidates)
95
127
  if candidates.any?
96
128
  candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
@@ -0,0 +1,35 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Concerns
6
+ # Helpers for injecting a retry_policy_override into per-candidate eval
7
+ # context when compare_models runs in production-mode. When candidate
8
+ # == fallback, retry injection is skipped so the row degenerates into
9
+ # a single-shot eval by construction.
10
+ module ProductionModeContext
11
+ private
12
+
13
+ def normalize_production_mode(production_mode)
14
+ return nil if production_mode.nil? || production_mode == false
15
+
16
+ unless production_mode.is_a?(Hash) && production_mode[:fallback]
17
+ raise ArgumentError, "production_mode: must be a Hash with :fallback, got #{production_mode.inspect}"
18
+ end
19
+
20
+ RubyLLM::Contract.normalize_candidate_config(production_mode[:fallback])
21
+ end
22
+
23
+ def production_mode_override(config, fallback_config)
24
+ return nil if same_candidate?(config, fallback_config)
25
+
26
+ Step::RetryPolicy.new(models: [config, fallback_config])
27
+ end
28
+
29
+ def same_candidate?(first, second)
30
+ first[:model] == second[:model] && first[:reasoning_effort] == second[:reasoning_effort]
31
+ end
32
+ end
33
+ end
34
+ end
35
+ end
@@ -0,0 +1,135 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Eval
6
+ # Wraps N Reports from repeated runs of the same eval to reduce sampling
7
+ # variance in live mode (temperature=1 on gpt-5 family). Exposes the same
8
+ # duck-type as Report — mean score, mean cost per run, mean latency.
9
+ #
10
+ # pass_rate reports how many runs passed cleanly (x/N), not case-level
11
+ # pass rate, since the question is "does this candidate reliably pass?".
12
+ class AggregatedReport
13
+ attr_reader :runs, :results
14
+
15
+ def initialize(runs)
16
+ raise ArgumentError, "runs must not be empty" if runs.empty?
17
+
18
+ @runs = runs.freeze
19
+ @results = runs.flat_map(&:results).freeze
20
+ freeze
21
+ end
22
+
23
+ def dataset_name
24
+ @runs.first.dataset_name
25
+ end
26
+
27
+ def step_name
28
+ @runs.first.step_name
29
+ end
30
+
31
+ def score
32
+ @runs.sum(&:score) / @runs.length.to_f
33
+ end
34
+
35
+ def score_min
36
+ @runs.map(&:score).min
37
+ end
38
+
39
+ def score_max
40
+ @runs.map(&:score).max
41
+ end
42
+
43
+ def total_cost
44
+ @runs.sum(&:total_cost) / @runs.length.to_f
45
+ end
46
+
47
+ def avg_latency_ms
48
+ latencies = @runs.filter_map(&:avg_latency_ms)
49
+ return nil if latencies.empty?
50
+
51
+ latencies.sum / latencies.length.to_f
52
+ end
53
+
54
+ def pass_rate
55
+ "#{clean_passes}/#{@runs.length}"
56
+ end
57
+
58
+ def pass_rate_ratio
59
+ clean_passes.to_f / @runs.length
60
+ end
61
+
62
+ def each(&block)
63
+ @results.each(&block)
64
+ end
65
+
66
+ def summary
67
+ @runs.first.summary
68
+ end
69
+
70
+ def to_s
71
+ @runs.first.to_s
72
+ end
73
+
74
+ def print_summary(io = $stdout)
75
+ @runs.first.print_summary(io)
76
+ end
77
+
78
+ def passed?
79
+ @runs.all?(&:passed?)
80
+ end
81
+
82
+ def clean_passes
83
+ @runs.count(&:passed?)
84
+ end
85
+
86
+ def failures
87
+ @runs.flat_map(&:failures)
88
+ end
89
+
90
+ def production_mode?
91
+ @runs.any?(&:production_mode?)
92
+ end
93
+
94
+ def escalation_rate
95
+ values = @runs.filter_map(&:escalation_rate)
96
+ return nil if values.empty?
97
+
98
+ values.sum / values.length.to_f
99
+ end
100
+
101
+ def single_shot_cost
102
+ values = @runs.filter_map(&:single_shot_cost)
103
+ return nil if values.empty?
104
+
105
+ values.sum / values.length.to_f
106
+ end
107
+
108
+ def effective_cost
109
+ total_cost
110
+ end
111
+
112
+ def single_shot_latency_ms
113
+ values = @runs.filter_map(&:single_shot_latency_ms)
114
+ return nil if values.empty?
115
+
116
+ values.sum / values.length.to_f
117
+ end
118
+
119
+ def effective_latency_ms
120
+ avg_latency_ms
121
+ end
122
+
123
+ def latency_percentiles
124
+ per_run = @runs.filter_map(&:latency_percentiles)
125
+ return nil if per_run.empty?
126
+
127
+ %i[p50 p95 max].each_with_object({}) do |key, acc|
128
+ values = per_run.filter_map { |h| h[key] }
129
+ acc[key] = values.empty? ? nil : values.sum / values.length.to_f
130
+ end
131
+ end
132
+ end
133
+ end
134
+ end
135
+ end
@@ -5,10 +5,10 @@ module RubyLLM
5
5
  module Eval
6
6
  class CaseResult
7
7
  attr_reader :name, :input, :output, :expected, :step_status,
8
- :score, :details, :duration_ms, :cost
8
+ :score, :details, :duration_ms, :cost, :attempts
9
9
 
10
10
  def initialize(name:, input:, output:, expected:, step_status:,
11
- score:, passed:, label: nil, details: nil, duration_ms: nil, cost: nil)
11
+ score:, passed:, label: nil, details: nil, duration_ms: nil, cost: nil, attempts: nil)
12
12
  @name = name
13
13
  @input = input
14
14
  @output = output
@@ -20,6 +20,7 @@ module RubyLLM
20
20
  @details = details
21
21
  @duration_ms = duration_ms
22
22
  @cost = cost
23
+ @attempts = attempts
23
24
  freeze
24
25
  end
25
26
 
@@ -58,7 +59,8 @@ module RubyLLM
58
59
  label: label,
59
60
  details: @details,
60
61
  duration_ms: @duration_ms,
61
- cost: @cost
62
+ cost: @cost,
63
+ attempts: @attempts
62
64
  }
63
65
  end
64
66
 
@@ -18,7 +18,8 @@ module RubyLLM
18
18
  label: evaluation.label,
19
19
  details: evaluation.details,
20
20
  duration_ms: trace_metric(trace, :total_latency_ms, :latency_ms),
21
- cost: trace_metric(trace, :total_cost, :cost)
21
+ cost: trace_metric(trace, :total_cost, :cost),
22
+ attempts: trace_attempts(trace)
22
23
  )
23
24
  end
24
25
 
@@ -29,6 +30,12 @@ module RubyLLM
29
30
 
30
31
  trace.respond_to?(pipeline_key) ? trace.public_send(pipeline_key) : trace[step_key]
31
32
  end
33
+
34
+ def trace_attempts(trace)
35
+ return nil unless trace
36
+
37
+ trace.respond_to?(:attempts) ? trace.attempts : nil
38
+ end
32
39
  end
33
40
  end
34
41
  end
@@ -4,20 +4,25 @@ module RubyLLM
4
4
  module Contract
5
5
  module Eval
6
6
  class ModelComparison
7
- attr_reader :eval_name, :reports, :configs
7
+ attr_reader :eval_name, :reports, :configs, :fallback
8
8
 
9
9
  def self.candidate_label(config)
10
10
  effort = config[:reasoning_effort]
11
11
  effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
12
12
  end
13
13
 
14
- def initialize(eval_name:, reports:, configs: nil)
14
+ def initialize(eval_name:, reports:, configs: nil, fallback: nil)
15
15
  @eval_name = eval_name
16
16
  @reports = reports.dup.freeze
17
17
  @configs = (configs || default_configs_from_reports).freeze
18
+ @fallback = fallback
18
19
  freeze
19
20
  end
20
21
 
22
+ def production_mode?
23
+ !@fallback.nil?
24
+ end
25
+
21
26
  def models
22
27
  @reports.keys
23
28
  end
@@ -44,6 +49,8 @@ module RubyLLM
44
49
  end
45
50
 
46
51
  def table
52
+ return production_mode_table if production_mode?
53
+
47
54
  max_label = [@reports.keys.map(&:length).max || 0, 25].max
48
55
  lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
49
56
  lines << " #{"-" * (max_label + 36)}"
@@ -57,6 +64,62 @@ module RubyLLM
57
64
  lines.join("\n")
58
65
  end
59
66
 
67
+ def production_mode_table
68
+ fallback_label = self.class.candidate_label(@fallback)
69
+ rows = @reports.map do |label, report|
70
+ chain = chain_label(label, fallback_label)
71
+ { chain: chain, report: report, same: chain_same_as_fallback?(label, fallback_label) }
72
+ end
73
+
74
+ chain_width = [rows.map { |r| r[:chain].length }.max || 0, 20].max
75
+ lines = [format(" %-#{chain_width}s %-11s %-10s %-14s %-9s %s",
76
+ "Chain", "single-shot", "escalation", "effective cost", "latency", "score")]
77
+ lines << " #{"-" * (chain_width + 60)}"
78
+
79
+ rows.each do |row|
80
+ lines << format_production_row(row, chain_width)
81
+ end
82
+
83
+ lines.join("\n")
84
+ end
85
+
86
+ private
87
+
88
+ def chain_label(label, fallback_label)
89
+ label == fallback_label ? label : "#{label} → #{fallback_label}"
90
+ end
91
+
92
+ def chain_same_as_fallback?(label, fallback_label)
93
+ label == fallback_label
94
+ end
95
+
96
+ def format_production_row(row, chain_width)
97
+ report = row[:report]
98
+ format(" %-#{chain_width}s %-11s %-10s %-14s %-9s %6.2f",
99
+ row[:chain],
100
+ format_money(report.single_shot_cost || report.total_cost),
101
+ format_escalation(row, report),
102
+ format_money(report.effective_cost),
103
+ format_latency(report.effective_latency_ms),
104
+ report.score)
105
+ end
106
+
107
+ def format_money(value)
108
+ value&.positive? ? "$#{format("%.4f", value)}" : "n/a"
109
+ end
110
+
111
+ def format_latency(value)
112
+ value ? "#{value.round}ms" : "n/a"
113
+ end
114
+
115
+ def format_escalation(row, report)
116
+ return "—" if row[:same]
117
+
118
+ format("%d%%", ((report.escalation_rate || 0) * 100).round)
119
+ end
120
+
121
+ public
122
+
60
123
  def print_summary(io = $stdout)
61
124
  io.puts "#{@eval_name} — model comparison"
62
125
  io.puts
@@ -72,7 +135,7 @@ module RubyLLM
72
135
 
73
136
  def to_h
74
137
  @reports.transform_values do |report|
75
- {
138
+ base = {
76
139
  score: report.score,
77
140
  total_cost: report.total_cost,
78
141
  avg_latency_ms: report.avg_latency_ms,
@@ -80,11 +143,25 @@ module RubyLLM
80
143
  pass_rate_ratio: report.pass_rate_ratio,
81
144
  passed: report.passed?
82
145
  }
146
+ production_mode_metrics(report, base)
83
147
  end
84
148
  end
85
149
 
86
150
  private
87
151
 
152
+ def production_mode_metrics(report, base)
153
+ return base unless report.respond_to?(:production_mode?) && report.production_mode?
154
+
155
+ base.merge(
156
+ escalation_rate: report.escalation_rate,
157
+ single_shot_cost: report.single_shot_cost,
158
+ effective_cost: report.effective_cost,
159
+ single_shot_latency_ms: report.single_shot_latency_ms,
160
+ effective_latency_ms: report.effective_latency_ms,
161
+ latency_percentiles: report.latency_percentiles
162
+ )
163
+ end
164
+
88
165
  def resolve_key(candidate)
89
166
  case candidate
90
167
  when String then candidate
@@ -15,7 +15,9 @@ module RubyLLM
15
15
  BASELINE_DIR = ".eval_baselines"
16
16
 
17
17
  def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
18
- :total_cost, :avg_latency_ms, :passed?
18
+ :total_cost, :avg_latency_ms, :passed?,
19
+ :production_mode?, :escalation_rate, :single_shot_cost, :single_shot_latency_ms,
20
+ :effective_cost, :effective_latency_ms, :latency_percentiles
19
21
  def_delegators :@presenter, :summary, :to_s, :print_summary
20
22
  def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
21
23
  :baseline_exists?
@@ -65,6 +65,69 @@ module RubyLLM
65
65
  def evaluated_results_count
66
66
  evaluated_results.length
67
67
  end
68
+
69
+ def production_mode?
70
+ evaluated_results.any? { |r| r.respond_to?(:attempts) && r.attempts }
71
+ end
72
+
73
+ def escalation_rate
74
+ return nil unless production_mode?
75
+ return 0.0 if evaluated_results.empty?
76
+
77
+ escalated = evaluated_results.count { |r| (r.attempts || []).length > 1 }
78
+ escalated.to_f / evaluated_results.length
79
+ end
80
+
81
+ def single_shot_cost
82
+ return nil unless production_mode?
83
+
84
+ evaluated_results.sum { |r| first_attempt_cost(r) || r.cost || 0.0 }
85
+ end
86
+
87
+ def effective_cost
88
+ total_cost
89
+ end
90
+
91
+ def single_shot_latency_ms
92
+ return nil unless production_mode?
93
+
94
+ latencies = evaluated_results.filter_map { |r| first_attempt_latency(r) || r.duration_ms }
95
+ return nil if latencies.empty?
96
+
97
+ latencies.sum.to_f / latencies.length
98
+ end
99
+
100
+ def effective_latency_ms
101
+ avg_latency_ms
102
+ end
103
+
104
+ def latency_percentiles
105
+ return nil unless production_mode?
106
+
107
+ latencies = evaluated_results.filter_map(&:duration_ms).sort
108
+ return nil if latencies.empty?
109
+
110
+ { p50: percentile(latencies, 0.50), p95: percentile(latencies, 0.95), max: latencies.last.to_f }
111
+ end
112
+
113
+ private
114
+
115
+ def first_attempt_cost(result)
116
+ first = (result.attempts || []).first
117
+ first && first[:cost]
118
+ end
119
+
120
+ def first_attempt_latency(result)
121
+ first = (result.attempts || []).first
122
+ first && first[:latency_ms]
123
+ end
124
+
125
+ def percentile(sorted, fraction)
126
+ return nil if sorted.empty?
127
+
128
+ idx = (fraction * (sorted.length - 1)).round
129
+ sorted[idx].to_f
130
+ end
68
131
  end
69
132
  end
70
133
  end
@@ -94,11 +94,13 @@ module RubyLLM
94
94
  end
95
95
  end
96
96
 
97
- def initialize(step:, candidates:, context: {}, min_score: 0.95)
97
+ def initialize(step:, candidates:, context: {}, min_score: 0.95, runs: 1, production_mode: nil)
98
98
  @step = step
99
99
  @candidates = candidates
100
100
  @context = context
101
101
  @min_score = min_score
102
+ @runs = runs
103
+ @production_mode = production_mode
102
104
  end
103
105
 
104
106
  def call
@@ -108,7 +110,8 @@ module RubyLLM
108
110
  score_matrix = {}
109
111
  evals.each do |eval_name|
110
112
  comparison = with_retry_disabled do
111
- @step.compare_models(eval_name, candidates: @candidates, context: @context)
113
+ @step.compare_models(eval_name, candidates: @candidates, context: @context,
114
+ runs: @runs, production_mode: @production_mode)
112
115
  end
113
116
  score_matrix[eval_name] = extract_scores(comparison)
114
117
  end
@@ -21,6 +21,7 @@ require_relative "eval/report_stats"
21
21
  require_relative "eval/report_presenter"
22
22
  require_relative "eval/report_storage"
23
23
  require_relative "eval/report"
24
+ require_relative "eval/aggregated_report"
24
25
  require_relative "eval/eval_definition"
25
26
  require_relative "eval/model_comparison"
26
27
  require_relative "eval/baseline_diff"
@@ -150,6 +150,7 @@ module RubyLLM
150
150
  raw_candidates = ENV["CANDIDATES"].to_s.strip
151
151
  abort("CANDIDATES is required, e.g. CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini") if raw_candidates.empty?
152
152
  min_score = ENV.fetch("MIN_SCORE", "0.95").to_f
153
+ runs = parse_runs(ENV.fetch("RUNS", "1"))
153
154
 
154
155
  host = RubyLLM::Contract.eval_hosts.find { |h| h.name == step_name }
155
156
  unless host
@@ -163,13 +164,22 @@ module RubyLLM
163
164
  result = host.optimize_retry_policy(
164
165
  candidates: candidates,
165
166
  context: context,
166
- min_score: min_score
167
+ min_score: min_score,
168
+ runs: runs
167
169
  )
168
170
 
169
171
  result.print_summary
170
172
  end
171
173
  end
172
174
 
175
+ def parse_runs(raw)
176
+ runs = Integer(raw.to_s.strip, 10)
177
+ abort("RUNS must be an integer >= 1, e.g. RUNS=1") if runs < 1
178
+ runs
179
+ rescue ArgumentError
180
+ abort("RUNS must be an integer >= 1, e.g. RUNS=1")
181
+ end
182
+
173
183
  def parse_candidates(raw)
174
184
  entries = if raw.start_with?("[")
175
185
  Array(JSON.parse(raw))
@@ -59,16 +59,19 @@ module RubyLLM
59
59
  ).recommend
60
60
  end
61
61
 
62
- def optimize_retry_policy(candidates:, context: {}, min_score: 0.95)
62
+ def optimize_retry_policy(candidates:, context: {}, min_score: 0.95, runs: 1, production_mode: nil)
63
63
  Eval::RetryOptimizer.new(
64
64
  step: self,
65
65
  candidates: candidates,
66
66
  context: context,
67
- min_score: min_score
67
+ min_score: min_score,
68
+ runs: runs,
69
+ production_mode: production_mode
68
70
  ).call
69
71
  end
70
72
 
71
- KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
73
+ KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists
74
+ reasoning_effort retry_policy_override].freeze
72
75
 
73
76
  include Concerns::ContextHelpers
74
77
 
@@ -155,11 +158,12 @@ module RubyLLM
155
158
  end
156
159
 
157
160
  def runtime_settings(context)
161
+ policy = context.key?(:retry_policy_override) ? context[:retry_policy_override] : retry_policy
158
162
  {
159
163
  model: context[:model] || model || RubyLLM::Contract.configuration.default_model,
160
164
  temperature: context[:temperature],
161
165
  extra_options: context.slice(:provider, :assume_model_exists, :max_tokens, :reasoning_effort),
162
- policy: retry_policy
166
+ policy: policy
163
167
  }
164
168
  end
165
169
 
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Contract
5
- VERSION = "0.6.2"
5
+ VERSION = "0.6.4"
6
6
  end
7
7
  end
@@ -154,6 +154,7 @@ end
154
154
  require_relative "contract/concerns/context_helpers"
155
155
  require_relative "contract/concerns/deep_freeze"
156
156
  require_relative "contract/concerns/deep_symbolize"
157
+ require_relative "contract/concerns/production_mode_context"
157
158
  require_relative "contract/concerns/eval_host"
158
159
  require_relative "contract/concerns/trace_equality"
159
160
  require_relative "contract/concerns/usage_aggregator"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-contract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.2
4
+ version: 0.6.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justyna
@@ -89,6 +89,7 @@ files:
89
89
  - lib/ruby_llm/contract/concerns/deep_freeze.rb
90
90
  - lib/ruby_llm/contract/concerns/deep_symbolize.rb
91
91
  - lib/ruby_llm/contract/concerns/eval_host.rb
92
+ - lib/ruby_llm/contract/concerns/production_mode_context.rb
92
93
  - lib/ruby_llm/contract/concerns/trace_equality.rb
93
94
  - lib/ruby_llm/contract/concerns/usage_aggregator.rb
94
95
  - lib/ruby_llm/contract/configuration.rb
@@ -109,6 +110,7 @@ files:
109
110
  - lib/ruby_llm/contract/dsl.rb
110
111
  - lib/ruby_llm/contract/errors.rb
111
112
  - lib/ruby_llm/contract/eval.rb
113
+ - lib/ruby_llm/contract/eval/aggregated_report.rb
112
114
  - lib/ruby_llm/contract/eval/baseline_diff.rb
113
115
  - lib/ruby_llm/contract/eval/case_executor.rb
114
116
  - lib/ruby_llm/contract/eval/case_result.rb
@@ -180,7 +182,6 @@ files:
180
182
  - lib/ruby_llm/contract/token_estimator.rb
181
183
  - lib/ruby_llm/contract/types.rb
182
184
  - lib/ruby_llm/contract/version.rb
183
- - ruby_llm-contract.gemspec
184
185
  homepage: https://github.com/justi/ruby_llm-contract
185
186
  licenses:
186
187
  - MIT
@@ -1,35 +0,0 @@
1
- # frozen_string_literal: true
2
-
3
- require_relative "lib/ruby_llm/contract/version"
4
-
5
- Gem::Specification.new do |spec|
6
- spec.name = "ruby_llm-contract"
7
- spec.version = RubyLLM::Contract::VERSION
8
- spec.authors = ["Justyna"]
9
-
10
- spec.summary = "Know which LLM model to use, what it costs, and when accuracy drops"
11
- spec.description = "Compare LLM models by accuracy and cost. Regression-test prompts in CI. " \
12
- "Start on nano, auto-escalate to bigger models when quality drops. " \
13
- "Companion gem for ruby_llm."
14
- spec.homepage = "https://github.com/justi/ruby_llm-contract"
15
- spec.license = "MIT"
16
- spec.required_ruby_version = ">= 3.2.0"
17
-
18
- spec.metadata["homepage_uri"] = spec.homepage
19
- spec.metadata["source_code_uri"] = spec.homepage
20
- spec.metadata["changelog_uri"] = "#{spec.homepage}/blob/main/CHANGELOG.md"
21
- spec.metadata["documentation_uri"] = "#{spec.homepage}#readme"
22
- spec.metadata["rubygems_mfa_required"] = "true"
23
-
24
- spec.files = Dir.chdir(__dir__) do
25
- `git ls-files -z`.split("\x0").reject do |f|
26
- (File.expand_path(f) == __FILE__) ||
27
- f.start_with?("spec/", "docs/", "doc/", ".ai/", ".claude/", ".git")
28
- end
29
- end
30
- spec.require_paths = ["lib"]
31
-
32
- spec.add_dependency "dry-types", "~> 1.7"
33
- spec.add_dependency "ruby_llm", "~> 1.0"
34
- spec.add_dependency "ruby_llm-schema", "~> 0.3"
35
- end