ruby_llm-contract 0.6.0 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c86dcf06a34e5ff934367708213a0852191b0be7bd61e61e8577161e32ccf807
4
- data.tar.gz: 28a44dd4f4d4c0f74c5da7800d9fa1fd2b8e51242a056f5aaae5f6f9fcf1be69
3
+ metadata.gz: 75884386ae53ddf1985760afa2c94f64ce15274b0cb0829ea40e6711506e60cf
4
+ data.tar.gz: 203229c4ce8ab0b1ab9e6209871668b0713b114e412eba03e79e9964dc9a43bb
5
5
  SHA512:
6
- metadata.gz: c2b6dc71a02519288ae4ee4f74f17d74e6c45d01e888036d18e4c821b8fc63ba43ac0ccee6b2948163597b19c93008a6e5e0375fd65c8f3ad8fcf1285f356e91
7
- data.tar.gz: f08576e520ec7397c4b233c5907ce703890cf4616e969e25f4c980ac1b06013a62ff680b3c90878d76fd1da0980f04d4946a2fd11a8e304e43f5155e5c855e8e
6
+ metadata.gz: 0c586ce70e71d8e77c262e0ae21bab2e29ef6ccdb16df4a0e56271aeb98efba59027618db24d4b59470fd3a9d340051201584038f8c67143f7ca1bec0e5e365b
7
+ data.tar.gz: 6a988c62f6b36a4da860b736c5523abcc568ed1384c1f20cd8ff9659124c12eeeddb36100d5b8542af07b363cef7ade8a1aca01250929ec2c23172d4f3faec5f
data/CHANGELOG.md CHANGED
@@ -1,5 +1,38 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.3 (2026-04-20)
4
+
5
+ ### Features
6
+
7
+ - **`runs:` parameter on `compare_models` and `optimize_retry_policy`** — runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforces `temperature=1.0` server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Default `runs: 1` — backward compatible.
8
+ - **`RUNS=N` on `rake ruby_llm_contract:optimize`** — CLI flag for variance-aware optimization.
9
+ - **`Eval::AggregatedReport`** — duck-type `Report` exposing `score` (mean), `score_min`/`score_max` (spread), `total_cost` (mean per run), `pass_rate` (clean-pass count x/N), and `clean_passes`.
10
+ - **Guide: [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs)** — when to use it and why.
11
+
12
+ ## 0.6.2 (2026-04-18)
13
+
14
+ ### Features
15
+
16
+ - **`Step.optimize_retry_policy`** — runs `compare_models` on ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).
17
+ - **`rake ruby_llm_contract:optimize`** — one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.
18
+ - **Offline by default** — `optimize` uses `sample_response` (zero API calls) unless `LIVE=1` or `PROVIDER=` is set.
19
+ - **`EVAL_DIRS=` support** — non-Rails setups can specify eval file directories.
20
+ - **Guide: [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)** — full procedure with prerequisites, troubleshooting, and real-world example.
21
+
22
+ ### Fixes
23
+
24
+ - Chain semantics aligned with `retry_executor` — retry fires on `validation_failed`/`parse_error`, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain.
25
+ - Removed ActiveSupport dependency from rake task (`.presence` → `.empty?`).
26
+ - Added `require "set"` for non-Rails environments.
27
+
28
+ ## 0.6.1 (2026-04-17)
29
+
30
+ ### Features
31
+
32
+ - **Multi-provider operator tooling** — rake tasks support `PROVIDER=openai|anthropic|ollama`, `CANDIDATES=model@effort,...`, and `REASONING_EFFORT=low|medium|high`.
33
+ - **`rake ruby_llm_contract:recommend`** — wraps `Step.recommend` with CLI interface, prints best config, retry chain, DSL, rationale, and savings.
34
+ - **Ollama support** — `PROVIDER=ollama` with configurable `OLLAMA_API_BASE`.
35
+
3
36
  ## 0.6.0 (2026-04-12)
4
37
 
5
38
  "What should I do?" — model + configuration recommendation.
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.6.0)
4
+ ruby_llm-contract (0.6.3)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.6.0)
261
+ ruby_llm-contract (0.6.3)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -158,6 +158,8 @@ Cheapest at 100%: gpt-4.1-mini
158
158
 
159
159
  Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
160
160
 
161
+ Running live against gpt-5 / o-series? Pass `runs: 3` to average out sampling variance (OpenAI forces `temperature=1.0` server-side, so one unlucky run can misclassify a viable candidate). See [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs).
162
+
161
163
  ## Let the gem tell you what to do
162
164
 
163
165
  Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
@@ -257,12 +259,21 @@ end
257
259
  # bundle exec rake ruby_llm_contract:eval
258
260
  ```
259
261
 
262
+ ## Full power: data-driven retry chains
263
+
264
+ The pieces above — evals, compare_models, recommend — combine into a workflow that replaces guesswork with measured optimization. You define evals for your step, run `recommend` against all of them, find the eval that actually needs the strongest model, and build a retry chain where each attempt is as cheap as the data allows.
265
+
266
+ The difference: instead of "gpt-5-mini seems to work, let's use it everywhere", you get "nano handles 4/6 scenarios, mini@low catches the 5th, full mini only fires on the hardest edge case — first attempt is 4× cheaper."
267
+
268
+ Full procedure with examples: **[Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)**
269
+
260
270
  ## Docs
261
271
 
262
272
  | Guide | |
263
273
  |-------|-|
264
274
  | [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
265
275
  | [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
276
+ | [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md) | Find the cheapest retry chain that passes all your evals |
266
277
  | [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
267
278
  | [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
268
279
  | [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
@@ -70,9 +70,11 @@ module RubyLLM
70
70
  Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
71
71
  end
72
72
 
73
- def compare_models(eval_name, models: [], candidates: [], context: {})
73
+ def compare_models(eval_name, models: [], candidates: [], context: {}, runs: 1)
74
74
  raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
75
75
 
76
+ runs = coerce_runs(runs)
77
+
76
78
  context = safe_context(context)
77
79
  candidate_configs = normalize_candidates(models, candidates)
78
80
 
@@ -82,7 +84,8 @@ module RubyLLM
82
84
  label = Eval::ModelComparison.candidate_label(config)
83
85
  model_context = isolate_context(context).merge(model: config[:model])
84
86
  model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
85
- reports[label] = run_single_eval(eval_name, model_context)
87
+ per_run = Array.new(runs) { run_single_eval(eval_name, model_context) }
88
+ reports[label] = runs == 1 ? per_run.first : Eval::AggregatedReport.new(per_run)
86
89
  configs[label] = config
87
90
  end
88
91
 
@@ -91,6 +94,13 @@ module RubyLLM
91
94
 
92
95
  private
93
96
 
97
+ def coerce_runs(runs)
98
+ raise ArgumentError, "runs must be an Integer >= 1, got #{runs.inspect}" unless runs.is_a?(Integer)
99
+ raise ArgumentError, "runs must be >= 1, got #{runs.inspect}" if runs < 1
100
+
101
+ runs
102
+ end
103
+
94
104
  def normalize_candidates(models, candidates)
95
105
  if candidates.any?
96
106
  candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
@@ -0,0 +1,92 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Eval
6
+ # Wraps N Reports from repeated runs of the same eval to reduce sampling
7
+ # variance in live mode (temperature=1 on gpt-5 family). Exposes the same
8
+ # duck-type as Report — mean score, mean cost per run, mean latency.
9
+ #
10
+ # pass_rate reports how many runs passed cleanly (x/N), not case-level
11
+ # pass rate, since the question is "does this candidate reliably pass?".
12
+ class AggregatedReport
13
+ attr_reader :runs, :results
14
+
15
+ def initialize(runs)
16
+ raise ArgumentError, "runs must not be empty" if runs.empty?
17
+
18
+ @runs = runs.freeze
19
+ @results = runs.flat_map(&:results).freeze
20
+ freeze
21
+ end
22
+
23
+ def dataset_name
24
+ @runs.first.dataset_name
25
+ end
26
+
27
+ def step_name
28
+ @runs.first.step_name
29
+ end
30
+
31
+ def score
32
+ @runs.sum(&:score) / @runs.length.to_f
33
+ end
34
+
35
+ def score_min
36
+ @runs.map(&:score).min
37
+ end
38
+
39
+ def score_max
40
+ @runs.map(&:score).max
41
+ end
42
+
43
+ def total_cost
44
+ @runs.sum(&:total_cost) / @runs.length.to_f
45
+ end
46
+
47
+ def avg_latency_ms
48
+ latencies = @runs.filter_map(&:avg_latency_ms)
49
+ return nil if latencies.empty?
50
+
51
+ latencies.sum / latencies.length.to_f
52
+ end
53
+
54
+ def pass_rate
55
+ "#{clean_passes}/#{@runs.length}"
56
+ end
57
+
58
+ def pass_rate_ratio
59
+ clean_passes.to_f / @runs.length
60
+ end
61
+
62
+ def each(&block)
63
+ @results.each(&block)
64
+ end
65
+
66
+ def summary
67
+ @runs.first.summary
68
+ end
69
+
70
+ def to_s
71
+ @runs.first.to_s
72
+ end
73
+
74
+ def print_summary(io = $stdout)
75
+ @runs.first.print_summary(io)
76
+ end
77
+
78
+ def passed?
79
+ @runs.all?(&:passed?)
80
+ end
81
+
82
+ def clean_passes
83
+ @runs.count(&:passed?)
84
+ end
85
+
86
+ def failures
87
+ @runs.flat_map(&:failures)
88
+ end
89
+ end
90
+ end
91
+ end
92
+ end
@@ -0,0 +1,222 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+
5
+ module RubyLLM
6
+ module Contract
7
+ module Eval
8
+ # Runs compare_models on ALL evals for a step, builds a score matrix,
9
+ # identifies the constraining eval, and suggests an escalation chain.
10
+ #
11
+ # optimizer = RetryOptimizer.new(step: MyStep, candidates: [...], context: {})
12
+ # result = optimizer.call
13
+ # result.print_summary
14
+ # result.to_dsl # => copy-paste retry_policy
15
+ class RetryOptimizer
16
+ Result = Struct.new(:step_name, :eval_names, :candidate_labels, :score_matrix,
17
+ :constraining_eval, :chain, :chain_details, keyword_init: true) do
18
+ def print_summary(io = $stdout)
19
+ io.puts "#{step_name} — retry chain optimization"
20
+ io.puts
21
+ print_table(io)
22
+ io.puts
23
+ print_chain(io)
24
+ io.puts
25
+ print_dsl(io)
26
+ end
27
+
28
+ def to_dsl
29
+ return "# No viable chain — no candidate passes all evals" if chain.empty?
30
+
31
+ if chain.all? { |c| c.keys == [:model] }
32
+ models_str = chain.map { |c| c[:model] }.join(" ")
33
+ "retry_policy models: %w[#{models_str}]"
34
+ else
35
+ args = chain.map { |c| config_to_ruby(c) }.join(",\n ")
36
+ "retry_policy do\n escalate(\n #{args}\n )\nend"
37
+ end
38
+ end
39
+
40
+ private
41
+
42
+ def print_table(io)
43
+ short_labels = candidate_labels.map { |l| short_candidate_label(l) }
44
+ col_width = [short_labels.map(&:length).max || 0, 8].max
45
+ eval_width = [eval_names.map { |e| e.to_s.length }.max || 0, 12].max
46
+
47
+ header = format(" %-#{eval_width}s", "eval") + short_labels.map { |l| format(" %#{col_width}s", l) }.join
48
+ io.puts header
49
+ io.puts " #{"-" * (eval_width + (col_width + 2) * short_labels.size)}"
50
+
51
+ eval_names.each do |eval_name|
52
+ row = format(" %-#{eval_width}s", eval_name.to_s)
53
+ candidate_labels.each do |label|
54
+ score = score_matrix.dig(eval_name, label) || 0.0
55
+ marker = eval_name == constraining_eval && score < 1.0 ? " ←" : " "
56
+ row += format(" %#{col_width - 2}.2f%s", score, marker)
57
+ end
58
+ io.puts row
59
+ end
60
+
61
+ io.puts
62
+ io.puts " Constraining eval: #{constraining_eval}" if constraining_eval
63
+ end
64
+
65
+ def print_chain(io)
66
+ if chain.empty?
67
+ io.puts " No viable chain."
68
+ return
69
+ end
70
+
71
+ io.puts " Suggested chain:"
72
+ chain_details.each_with_index do |detail, i|
73
+ suffix = i == chain_details.size - 1 ? "passes all #{eval_names.size} evals" : "covers #{detail[:passes]} eval(s)"
74
+ io.puts " #{detail[:label]} — #{suffix}"
75
+ end
76
+ end
77
+
78
+ def short_candidate_label(label)
79
+ label
80
+ .sub("gpt-5-", "")
81
+ .sub("gpt-4.1", "4.1")
82
+ .sub(" (effort: ", "@")
83
+ .sub(")", "")
84
+ end
85
+
86
+ def print_dsl(io)
87
+ io.puts " DSL:"
88
+ to_dsl.each_line { |line| io.puts " #{line}" }
89
+ end
90
+
91
+ def config_to_ruby(config)
92
+ pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
93
+ "{ #{pairs} }"
94
+ end
95
+ end
96
+
97
+ def initialize(step:, candidates:, context: {}, min_score: 0.95, runs: 1)
98
+ @step = step
99
+ @candidates = candidates
100
+ @context = context
101
+ @min_score = min_score
102
+ @runs = runs
103
+ end
104
+
105
+ def call
106
+ evals = @step.eval_names
107
+ return empty_result(evals) if evals.empty?
108
+
109
+ score_matrix = {}
110
+ evals.each do |eval_name|
111
+ comparison = with_retry_disabled do
112
+ @step.compare_models(eval_name, candidates: @candidates, context: @context, runs: @runs)
113
+ end
114
+ score_matrix[eval_name] = extract_scores(comparison)
115
+ end
116
+
117
+ labels = score_matrix.values.flat_map(&:keys).uniq
118
+ constraining = find_constraining_eval(score_matrix, labels)
119
+ chain, details = build_chain(score_matrix, labels, evals)
120
+
121
+ Result.new(
122
+ step_name: @step.name || @step.to_s,
123
+ eval_names: evals,
124
+ candidate_labels: labels,
125
+ score_matrix: score_matrix,
126
+ constraining_eval: constraining,
127
+ chain: chain,
128
+ chain_details: details
129
+ )
130
+ end
131
+
132
+ private
133
+
134
+ def extract_scores(comparison)
135
+ comparison.reports.transform_values(&:score)
136
+ end
137
+
138
+ def find_constraining_eval(matrix, labels)
139
+ matrix.max_by do |_eval_name, scores|
140
+ cheapest_passing = labels.find { |l| (scores[l] || 0) >= @min_score }
141
+ cheapest_passing ? labels.index(cheapest_passing) : labels.size
142
+ end&.first
143
+ end
144
+
145
+ # Retry escalates on validation_failed/parse_error, NOT on low eval
146
+ # score. A model that returns :ok with semantically wrong output won't
147
+ # trigger retry. Therefore the LAST model in the chain must pass ALL
148
+ # evals — it's the safety net. Cheaper models are prepended as
149
+ # first-try optimization (they handle easy inputs cheaply; when they
150
+ # fail validation, retry escalates to the safe fallback).
151
+ #
152
+ # Known limitation: intermediate models are assumed safe if their eval
153
+ # failures correspond to validation failures (retryable). If an
154
+ # intermediate model returns :ok with semantically wrong output on
155
+ # some eval, retry won't fire and the safe fallback won't run. This
156
+ # requires step validates to cover the same semantics as eval verify
157
+ # checks. A future version could inspect per-case step_status from
158
+ # compare_models to verify failures are actually retryable.
159
+ def build_chain(matrix, labels, evals)
160
+ total = evals.size
161
+
162
+ # Find cheapest model that passes every eval — the safe fallback.
163
+ safe_fallback = labels.find { |l| evals.all? { |e| (matrix.dig(e, l) || 0) >= @min_score } }
164
+ return [[], []] unless safe_fallback
165
+
166
+ # Prepend cheaper models that pass a strict subset.
167
+ chain = []
168
+ details = []
169
+ covered_evals = Set.new
170
+
171
+ labels.each do |label|
172
+ break if label == safe_fallback
173
+
174
+ newly_covered = evals.select { |e| (matrix.dig(e, label) || 0) >= @min_score }
175
+ new_additions = newly_covered.to_set - covered_evals
176
+ next if new_additions.empty?
177
+
178
+ covered_evals.merge(new_additions)
179
+ chain << parse_label_to_config(label)
180
+ details << { label: label, passes: new_additions.size, cost: label }
181
+ end
182
+
183
+ # Always end with the safe fallback.
184
+ chain << parse_label_to_config(safe_fallback)
185
+ details << { label: safe_fallback, passes: total, cost: safe_fallback }
186
+
187
+ [chain, details]
188
+ end
189
+
190
+ def parse_label_to_config(label)
191
+ if label.match?(/\(effort: (\w+)\)/)
192
+ model = label.sub(/\s*\(effort:.*/, "").strip
193
+ effort = label.match(/\(effort: (\w+)\)/)[1]
194
+ { model: model, reasoning_effort: effort }
195
+ else
196
+ { model: label }
197
+ end
198
+ end
199
+
200
+ def with_retry_disabled(&block)
201
+ original = @step.retry_policy if @step.respond_to?(:retry_policy)
202
+ @step.define_singleton_method(:retry_policy) { nil }
203
+ block.call
204
+ ensure
205
+ @step.define_singleton_method(:retry_policy) { original }
206
+ end
207
+
208
+ def empty_result(evals)
209
+ Result.new(
210
+ step_name: @step.name || @step.to_s,
211
+ eval_names: evals,
212
+ candidate_labels: [],
213
+ score_matrix: {},
214
+ constraining_eval: nil,
215
+ chain: [],
216
+ chain_details: []
217
+ )
218
+ end
219
+ end
220
+ end
221
+ end
222
+ end
@@ -21,6 +21,7 @@ require_relative "eval/report_stats"
21
21
  require_relative "eval/report_presenter"
22
22
  require_relative "eval/report_storage"
23
23
  require_relative "eval/report"
24
+ require_relative "eval/aggregated_report"
24
25
  require_relative "eval/eval_definition"
25
26
  require_relative "eval/model_comparison"
26
27
  require_relative "eval/baseline_diff"
@@ -31,3 +32,4 @@ require_relative "eval/prompt_diff"
31
32
  require_relative "eval/eval_history"
32
33
  require_relative "eval/recommendation"
33
34
  require_relative "eval/recommender"
35
+ require_relative "eval/retry_optimizer"
@@ -121,5 +121,98 @@ module RubyLLM
121
121
  defined?(::Rails) ? [:environment] : []
122
122
  end
123
123
  end
124
+
125
+ # Standalone task: runs all evals for one step across candidates,
126
+ # builds a score matrix, and suggests an optimal retry chain.
127
+ #
128
+ # Loaded automatically when `require "ruby_llm/contract/rake_task"`.
129
+ # Usage:
130
+ # rake ruby_llm_contract:optimize \
131
+ # STEP=MatchProblemsToPages \
132
+ # CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini
133
+ class OptimizeRakeTask < ::Rake::TaskLib
134
+ def initialize
135
+ super()
136
+ define_task
137
+ end
138
+
139
+ private
140
+
141
+ def define_task
142
+ desc "Run all evals for STEP with CANDIDATES and suggest an optimal retry chain"
143
+ task(:"ruby_llm_contract:optimize" => task_prerequisites) do
144
+ require "ruby_llm/contract"
145
+ eval_dirs = ENV["EVAL_DIRS"].to_s.split(",").map(&:strip).reject(&:empty?)
146
+ RubyLLM::Contract.load_evals!(*eval_dirs)
147
+
148
+ step_name = ENV["STEP"].to_s.strip
149
+ abort("STEP is required, e.g. STEP=MatchProblemsToPages") if step_name.empty?
150
+ raw_candidates = ENV["CANDIDATES"].to_s.strip
151
+ abort("CANDIDATES is required, e.g. CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini") if raw_candidates.empty?
152
+ min_score = ENV.fetch("MIN_SCORE", "0.95").to_f
153
+ runs = parse_runs(ENV.fetch("RUNS", "1"))
154
+
155
+ host = RubyLLM::Contract.eval_hosts.find { |h| h.name == step_name }
156
+ unless host
157
+ available = RubyLLM::Contract.eval_hosts.filter_map(&:name).sort
158
+ abort "Unknown STEP=#{step_name}. Available: #{available.join(", ")}"
159
+ end
160
+
161
+ candidates = parse_candidates(raw_candidates)
162
+ context = build_context
163
+
164
+ result = host.optimize_retry_policy(
165
+ candidates: candidates,
166
+ context: context,
167
+ min_score: min_score,
168
+ runs: runs
169
+ )
170
+
171
+ result.print_summary
172
+ end
173
+ end
174
+
175
+ def parse_runs(raw)
176
+ runs = Integer(raw.to_s.strip, 10)
177
+ abort("RUNS must be an integer >= 1, e.g. RUNS=1") if runs < 1
178
+ runs
179
+ rescue ArgumentError
180
+ abort("RUNS must be an integer >= 1, e.g. RUNS=1")
181
+ end
182
+
183
+ def parse_candidates(raw)
184
+ entries = if raw.start_with?("[")
185
+ Array(JSON.parse(raw))
186
+ else
187
+ raw.split(",").map(&:strip).reject(&:empty?).map do |entry|
188
+ model, effort = entry.split("@", 2)
189
+ config = { model: model.strip }
190
+ config[:reasoning_effort] = effort.strip if effort && !effort.empty?
191
+ config
192
+ end
193
+ end
194
+
195
+ entries.map { |e| RubyLLM::Contract.normalize_candidate_config(e) }.uniq
196
+ end
197
+
198
+ def build_context
199
+ ctx = {}
200
+ provider = ENV["PROVIDER"].to_s.strip
201
+ # Only inject real adapter when LIVE=1 or PROVIDER is set — otherwise
202
+ # evals use sample_response (offline mode, zero API calls).
203
+ if ENV["LIVE"] == "1" || !provider.empty?
204
+ ctx[:adapter] = RubyLLM::Contract::Adapters::RubyLLM.new
205
+ ctx[:provider] = provider.downcase.to_sym unless provider.empty?
206
+ end
207
+ ctx
208
+ end
209
+
210
+ def task_prerequisites
211
+ defined?(::Rails) ? [:environment] : []
212
+ end
213
+ end
214
+
215
+ # Auto-register the optimize task when this file is loaded
216
+ OptimizeRakeTask.new
124
217
  end
125
218
  end
@@ -59,6 +59,16 @@ module RubyLLM
59
59
  ).recommend
60
60
  end
61
61
 
62
+ def optimize_retry_policy(candidates:, context: {}, min_score: 0.95, runs: 1)
63
+ Eval::RetryOptimizer.new(
64
+ step: self,
65
+ candidates: candidates,
66
+ context: context,
67
+ min_score: min_score,
68
+ runs: runs
69
+ ).call
70
+ end
71
+
62
72
  KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
63
73
 
64
74
  include Concerns::ContextHelpers
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Contract
5
- VERSION = "0.6.0"
5
+ VERSION = "0.6.3"
6
6
  end
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-contract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.6.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justyna
@@ -109,6 +109,7 @@ files:
109
109
  - lib/ruby_llm/contract/dsl.rb
110
110
  - lib/ruby_llm/contract/errors.rb
111
111
  - lib/ruby_llm/contract/eval.rb
112
+ - lib/ruby_llm/contract/eval/aggregated_report.rb
112
113
  - lib/ruby_llm/contract/eval/baseline_diff.rb
113
114
  - lib/ruby_llm/contract/eval/case_executor.rb
114
115
  - lib/ruby_llm/contract/eval/case_result.rb
@@ -136,6 +137,7 @@ files:
136
137
  - lib/ruby_llm/contract/eval/report_presenter.rb
137
138
  - lib/ruby_llm/contract/eval/report_stats.rb
138
139
  - lib/ruby_llm/contract/eval/report_storage.rb
140
+ - lib/ruby_llm/contract/eval/retry_optimizer.rb
139
141
  - lib/ruby_llm/contract/eval/runner.rb
140
142
  - lib/ruby_llm/contract/eval/step_expectation_applier.rb
141
143
  - lib/ruby_llm/contract/eval/step_result_normalizer.rb