ruby_llm-contract 0.6.0 → 0.6.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c86dcf06a34e5ff934367708213a0852191b0be7bd61e61e8577161e32ccf807
4
- data.tar.gz: 28a44dd4f4d4c0f74c5da7800d9fa1fd2b8e51242a056f5aaae5f6f9fcf1be69
3
+ metadata.gz: 2636c2e59f5fef27f929a94ac9e3194793ce6b51f86cce0c18ca6a5b0caa61ab
4
+ data.tar.gz: c2c81c7cc8fd281bf6c88738f0fb5bc4bdbb86b71d2fb59248e2b0ebb8d648fe
5
5
  SHA512:
6
- metadata.gz: c2b6dc71a02519288ae4ee4f74f17d74e6c45d01e888036d18e4c821b8fc63ba43ac0ccee6b2948163597b19c93008a6e5e0375fd65c8f3ad8fcf1285f356e91
7
- data.tar.gz: f08576e520ec7397c4b233c5907ce703890cf4616e969e25f4c980ac1b06013a62ff680b3c90878d76fd1da0980f04d4946a2fd11a8e304e43f5155e5c855e8e
6
+ metadata.gz: d9f2fca592fd3a183d987239dea0cdc2456eed639d6cccaea71e9b1ef3a3ff6e32f3e346f640c905168674e7401ccb85e5f342815a1b290cdebe05a9f7b5374f
7
+ data.tar.gz: 00b8d113564871db19f88d9061276a0aee0295da7638974804782fda7929cf893a0004aadfeffc2f25f13d421935e0e9d710e553d8e56b7b5d0229f62edd129e
data/CHANGELOG.md CHANGED
@@ -1,5 +1,29 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.6.2 (2026-04-18)
4
+
5
+ ### Features
6
+
7
+ - **`Step.optimize_retry_policy`** — runs `compare_models` on ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).
8
+ - **`rake ruby_llm_contract:optimize`** — one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.
9
+ - **Offline by default** — `optimize` uses `sample_response` (zero API calls) unless `LIVE=1` or `PROVIDER=` is set.
10
+ - **`EVAL_DIRS=` support** — non-Rails setups can specify eval file directories.
11
+ - **Guide: [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)** — full procedure with prerequisites, troubleshooting, and real-world example.
12
+
13
+ ### Fixes
14
+
15
+ - Chain semantics aligned with `retry_executor` — retry fires on `validation_failed`/`parse_error`, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain.
16
+ - Removed ActiveSupport dependency from rake task (`.presence` → `.empty?`).
17
+ - Added `require "set"` for non-Rails environments.
18
+
19
+ ## 0.6.1 (2026-04-17)
20
+
21
+ ### Features
22
+
23
+ - **Multi-provider operator tooling** — rake tasks support `PROVIDER=openai|anthropic|ollama`, `CANDIDATES=model@effort,...`, and `REASONING_EFFORT=low|medium|high`.
24
+ - **`rake ruby_llm_contract:recommend`** — wraps `Step.recommend` with CLI interface, prints best config, retry chain, DSL, rationale, and savings.
25
+ - **Ollama support** — `PROVIDER=ollama` with configurable `OLLAMA_API_BASE`.
26
+
3
27
  ## 0.6.0 (2026-04-12)
4
28
 
5
29
  "What should I do?" — model + configuration recommendation.
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.6.0)
4
+ ruby_llm-contract (0.6.1)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.6.0)
261
+ ruby_llm-contract (0.6.1)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -257,12 +257,21 @@ end
257
257
  # bundle exec rake ruby_llm_contract:eval
258
258
  ```
259
259
 
260
+ ## Full power: data-driven retry chains
261
+
262
+ The pieces above — evals, compare_models, recommend — combine into a workflow that replaces guesswork with measured optimization. You define evals for your step, run `recommend` against all of them, find the eval that actually needs the strongest model, and build a retry chain where each attempt is as cheap as the data allows.
263
+
264
+ The difference: instead of "gpt-5-mini seems to work, let's use it everywhere", you get "nano handles 4/6 scenarios, mini@low catches the 5th, full mini only fires on the hardest edge case — first attempt is 4× cheaper."
265
+
266
+ Full procedure with examples: **[Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)**
267
+
260
268
  ## Docs
261
269
 
262
270
  | Guide | |
263
271
  |-------|-|
264
272
  | [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
265
273
  | [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
274
+ | [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md) | Find the cheapest retry chain that passes all your evals |
266
275
  | [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
267
276
  | [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
268
277
  | [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
@@ -0,0 +1,221 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+
5
+ module RubyLLM
6
+ module Contract
7
+ module Eval
8
+ # Runs compare_models on ALL evals for a step, builds a score matrix,
9
+ # identifies the constraining eval, and suggests an escalation chain.
10
+ #
11
+ # optimizer = RetryOptimizer.new(step: MyStep, candidates: [...], context: {})
12
+ # result = optimizer.call
13
+ # result.print_summary
14
+ # result.to_dsl # => copy-paste retry_policy
15
+ class RetryOptimizer
16
+ Result = Struct.new(:step_name, :eval_names, :candidate_labels, :score_matrix,
17
+ :constraining_eval, :chain, :chain_details, keyword_init: true) do
18
+ def print_summary(io = $stdout)
19
+ io.puts "#{step_name} — retry chain optimization"
20
+ io.puts
21
+ print_table(io)
22
+ io.puts
23
+ print_chain(io)
24
+ io.puts
25
+ print_dsl(io)
26
+ end
27
+
28
+ def to_dsl
29
+ return "# No viable chain — no candidate passes all evals" if chain.empty?
30
+
31
+ if chain.all? { |c| c.keys == [:model] }
32
+ models_str = chain.map { |c| c[:model] }.join(" ")
33
+ "retry_policy models: %w[#{models_str}]"
34
+ else
35
+ args = chain.map { |c| config_to_ruby(c) }.join(",\n ")
36
+ "retry_policy do\n escalate(\n #{args}\n )\nend"
37
+ end
38
+ end
39
+
40
+ private
41
+
42
+ def print_table(io)
43
+ short_labels = candidate_labels.map { |l| short_candidate_label(l) }
44
+ col_width = [short_labels.map(&:length).max || 0, 8].max
45
+ eval_width = [eval_names.map { |e| e.to_s.length }.max || 0, 12].max
46
+
47
+ header = format(" %-#{eval_width}s", "eval") + short_labels.map { |l| format(" %#{col_width}s", l) }.join
48
+ io.puts header
49
+ io.puts " #{"-" * (eval_width + (col_width + 2) * short_labels.size)}"
50
+
51
+ eval_names.each do |eval_name|
52
+ row = format(" %-#{eval_width}s", eval_name.to_s)
53
+ candidate_labels.each do |label|
54
+ score = score_matrix.dig(eval_name, label) || 0.0
55
+ marker = eval_name == constraining_eval && score < 1.0 ? " ←" : " "
56
+ row += format(" %#{col_width - 2}.2f%s", score, marker)
57
+ end
58
+ io.puts row
59
+ end
60
+
61
+ io.puts
62
+ io.puts " Constraining eval: #{constraining_eval}" if constraining_eval
63
+ end
64
+
65
+ def print_chain(io)
66
+ if chain.empty?
67
+ io.puts " No viable chain."
68
+ return
69
+ end
70
+
71
+ io.puts " Suggested chain:"
72
+ chain_details.each_with_index do |detail, i|
73
+ suffix = i == chain_details.size - 1 ? "passes all #{eval_names.size} evals" : "covers #{detail[:passes]} eval(s)"
74
+ io.puts " #{detail[:label]} — #{suffix}"
75
+ end
76
+ end
77
+
78
+ def short_candidate_label(label)
79
+ label
80
+ .sub("gpt-5-", "")
81
+ .sub("gpt-4.1", "4.1")
82
+ .sub(" (effort: ", "@")
83
+ .sub(")", "")
84
+ end
85
+
86
+ def print_dsl(io)
87
+ io.puts " DSL:"
88
+ to_dsl.each_line { |line| io.puts " #{line}" }
89
+ end
90
+
91
+ def config_to_ruby(config)
92
+ pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
93
+ "{ #{pairs} }"
94
+ end
95
+ end
96
+
97
+ def initialize(step:, candidates:, context: {}, min_score: 0.95)
98
+ @step = step
99
+ @candidates = candidates
100
+ @context = context
101
+ @min_score = min_score
102
+ end
103
+
104
+ def call
105
+ evals = @step.eval_names
106
+ return empty_result(evals) if evals.empty?
107
+
108
+ score_matrix = {}
109
+ evals.each do |eval_name|
110
+ comparison = with_retry_disabled do
111
+ @step.compare_models(eval_name, candidates: @candidates, context: @context)
112
+ end
113
+ score_matrix[eval_name] = extract_scores(comparison)
114
+ end
115
+
116
+ labels = score_matrix.values.flat_map(&:keys).uniq
117
+ constraining = find_constraining_eval(score_matrix, labels)
118
+ chain, details = build_chain(score_matrix, labels, evals)
119
+
120
+ Result.new(
121
+ step_name: @step.name || @step.to_s,
122
+ eval_names: evals,
123
+ candidate_labels: labels,
124
+ score_matrix: score_matrix,
125
+ constraining_eval: constraining,
126
+ chain: chain,
127
+ chain_details: details
128
+ )
129
+ end
130
+
131
+ private
132
+
133
+ def extract_scores(comparison)
134
+ comparison.reports.transform_values(&:score)
135
+ end
136
+
137
+ def find_constraining_eval(matrix, labels)
138
+ matrix.max_by do |_eval_name, scores|
139
+ cheapest_passing = labels.find { |l| (scores[l] || 0) >= @min_score }
140
+ cheapest_passing ? labels.index(cheapest_passing) : labels.size
141
+ end&.first
142
+ end
143
+
144
+ # Retry escalates on validation_failed/parse_error, NOT on low eval
145
+ # score. A model that returns :ok with semantically wrong output won't
146
+ # trigger retry. Therefore the LAST model in the chain must pass ALL
147
+ # evals — it's the safety net. Cheaper models are prepended as
148
+ # first-try optimization (they handle easy inputs cheaply; when they
149
+ # fail validation, retry escalates to the safe fallback).
150
+ #
151
+ # Known limitation: intermediate models are assumed safe if their eval
152
+ # failures correspond to validation failures (retryable). If an
153
+ # intermediate model returns :ok with semantically wrong output on
154
+ # some eval, retry won't fire and the safe fallback won't run. This
155
+ # requires step validates to cover the same semantics as eval verify
156
+ # checks. A future version could inspect per-case step_status from
157
+ # compare_models to verify failures are actually retryable.
158
+ def build_chain(matrix, labels, evals)
159
+ total = evals.size
160
+
161
+ # Find cheapest model that passes every eval — the safe fallback.
162
+ safe_fallback = labels.find { |l| evals.all? { |e| (matrix.dig(e, l) || 0) >= @min_score } }
163
+ return [[], []] unless safe_fallback
164
+
165
+ # Prepend cheaper models that pass a strict subset.
166
+ chain = []
167
+ details = []
168
+ covered_evals = Set.new
169
+
170
+ labels.each do |label|
171
+ break if label == safe_fallback
172
+
173
+ newly_covered = evals.select { |e| (matrix.dig(e, label) || 0) >= @min_score }
174
+ new_additions = newly_covered.to_set - covered_evals
175
+ next if new_additions.empty?
176
+
177
+ covered_evals.merge(new_additions)
178
+ chain << parse_label_to_config(label)
179
+ details << { label: label, passes: new_additions.size, cost: label }
180
+ end
181
+
182
+ # Always end with the safe fallback.
183
+ chain << parse_label_to_config(safe_fallback)
184
+ details << { label: safe_fallback, passes: total, cost: safe_fallback }
185
+
186
+ [chain, details]
187
+ end
188
+
189
+ def parse_label_to_config(label)
190
+ if label.match?(/\(effort: (\w+)\)/)
191
+ model = label.sub(/\s*\(effort:.*/, "").strip
192
+ effort = label.match(/\(effort: (\w+)\)/)[1]
193
+ { model: model, reasoning_effort: effort }
194
+ else
195
+ { model: label }
196
+ end
197
+ end
198
+
199
+ def with_retry_disabled(&block)
200
+ original = @step.retry_policy if @step.respond_to?(:retry_policy)
201
+ @step.define_singleton_method(:retry_policy) { nil }
202
+ block.call
203
+ ensure
204
+ @step.define_singleton_method(:retry_policy) { original }
205
+ end
206
+
207
+ def empty_result(evals)
208
+ Result.new(
209
+ step_name: @step.name || @step.to_s,
210
+ eval_names: evals,
211
+ candidate_labels: [],
212
+ score_matrix: {},
213
+ constraining_eval: nil,
214
+ chain: [],
215
+ chain_details: []
216
+ )
217
+ end
218
+ end
219
+ end
220
+ end
221
+ end
@@ -31,3 +31,4 @@ require_relative "eval/prompt_diff"
31
31
  require_relative "eval/eval_history"
32
32
  require_relative "eval/recommendation"
33
33
  require_relative "eval/recommender"
34
+ require_relative "eval/retry_optimizer"
@@ -121,5 +121,88 @@ module RubyLLM
121
121
  defined?(::Rails) ? [:environment] : []
122
122
  end
123
123
  end
124
+
125
+ # Standalone task: runs all evals for one step across candidates,
126
+ # builds a score matrix, and suggests an optimal retry chain.
127
+ #
128
+ # Loaded automatically when `require "ruby_llm/contract/rake_task"`.
129
+ # Usage:
130
+ # rake ruby_llm_contract:optimize \
131
+ # STEP=MatchProblemsToPages \
132
+ # CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini
133
+ class OptimizeRakeTask < ::Rake::TaskLib
134
+ def initialize
135
+ super()
136
+ define_task
137
+ end
138
+
139
+ private
140
+
141
+ def define_task
142
+ desc "Run all evals for STEP with CANDIDATES and suggest an optimal retry chain"
143
+ task(:"ruby_llm_contract:optimize" => task_prerequisites) do
144
+ require "ruby_llm/contract"
145
+ eval_dirs = ENV["EVAL_DIRS"].to_s.split(",").map(&:strip).reject(&:empty?)
146
+ RubyLLM::Contract.load_evals!(*eval_dirs)
147
+
148
+ step_name = ENV["STEP"].to_s.strip
149
+ abort("STEP is required, e.g. STEP=MatchProblemsToPages") if step_name.empty?
150
+ raw_candidates = ENV["CANDIDATES"].to_s.strip
151
+ abort("CANDIDATES is required, e.g. CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini") if raw_candidates.empty?
152
+ min_score = ENV.fetch("MIN_SCORE", "0.95").to_f
153
+
154
+ host = RubyLLM::Contract.eval_hosts.find { |h| h.name == step_name }
155
+ unless host
156
+ available = RubyLLM::Contract.eval_hosts.filter_map(&:name).sort
157
+ abort "Unknown STEP=#{step_name}. Available: #{available.join(", ")}"
158
+ end
159
+
160
+ candidates = parse_candidates(raw_candidates)
161
+ context = build_context
162
+
163
+ result = host.optimize_retry_policy(
164
+ candidates: candidates,
165
+ context: context,
166
+ min_score: min_score
167
+ )
168
+
169
+ result.print_summary
170
+ end
171
+ end
172
+
173
+ def parse_candidates(raw)
174
+ entries = if raw.start_with?("[")
175
+ Array(JSON.parse(raw))
176
+ else
177
+ raw.split(",").map(&:strip).reject(&:empty?).map do |entry|
178
+ model, effort = entry.split("@", 2)
179
+ config = { model: model.strip }
180
+ config[:reasoning_effort] = effort.strip if effort && !effort.empty?
181
+ config
182
+ end
183
+ end
184
+
185
+ entries.map { |e| RubyLLM::Contract.normalize_candidate_config(e) }.uniq
186
+ end
187
+
188
+ def build_context
189
+ ctx = {}
190
+ provider = ENV["PROVIDER"].to_s.strip
191
+ # Only inject real adapter when LIVE=1 or PROVIDER is set — otherwise
192
+ # evals use sample_response (offline mode, zero API calls).
193
+ if ENV["LIVE"] == "1" || !provider.empty?
194
+ ctx[:adapter] = RubyLLM::Contract::Adapters::RubyLLM.new
195
+ ctx[:provider] = provider.downcase.to_sym unless provider.empty?
196
+ end
197
+ ctx
198
+ end
199
+
200
+ def task_prerequisites
201
+ defined?(::Rails) ? [:environment] : []
202
+ end
203
+ end
204
+
205
+ # Auto-register the optimize task when this file is loaded
206
+ OptimizeRakeTask.new
124
207
  end
125
208
  end
@@ -59,6 +59,15 @@ module RubyLLM
59
59
  ).recommend
60
60
  end
61
61
 
62
+ def optimize_retry_policy(candidates:, context: {}, min_score: 0.95)
63
+ Eval::RetryOptimizer.new(
64
+ step: self,
65
+ candidates: candidates,
66
+ context: context,
67
+ min_score: min_score
68
+ ).call
69
+ end
70
+
62
71
  KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
63
72
 
64
73
  include Concerns::ContextHelpers
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Contract
5
- VERSION = "0.6.0"
5
+ VERSION = "0.6.2"
6
6
  end
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-contract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.0
4
+ version: 0.6.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justyna
@@ -136,6 +136,7 @@ files:
136
136
  - lib/ruby_llm/contract/eval/report_presenter.rb
137
137
  - lib/ruby_llm/contract/eval/report_stats.rb
138
138
  - lib/ruby_llm/contract/eval/report_storage.rb
139
+ - lib/ruby_llm/contract/eval/retry_optimizer.rb
139
140
  - lib/ruby_llm/contract/eval/runner.rb
140
141
  - lib/ruby_llm/contract/eval/step_expectation_applier.rb
141
142
  - lib/ruby_llm/contract/eval/step_result_normalizer.rb