ruby_llm-contract 0.6.0 → 0.6.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +33 -0
- data/Gemfile.lock +2 -2
- data/README.md +11 -0
- data/lib/ruby_llm/contract/concerns/eval_host.rb +12 -2
- data/lib/ruby_llm/contract/eval/aggregated_report.rb +92 -0
- data/lib/ruby_llm/contract/eval/retry_optimizer.rb +222 -0
- data/lib/ruby_llm/contract/eval.rb +2 -0
- data/lib/ruby_llm/contract/rake_task.rb +93 -0
- data/lib/ruby_llm/contract/step/base.rb +10 -0
- data/lib/ruby_llm/contract/version.rb +1 -1
- metadata +3 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 75884386ae53ddf1985760afa2c94f64ce15274b0cb0829ea40e6711506e60cf
|
|
4
|
+
data.tar.gz: 203229c4ce8ab0b1ab9e6209871668b0713b114e412eba03e79e9964dc9a43bb
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 0c586ce70e71d8e77c262e0ae21bab2e29ef6ccdb16df4a0e56271aeb98efba59027618db24d4b59470fd3a9d340051201584038f8c67143f7ca1bec0e5e365b
|
|
7
|
+
data.tar.gz: 6a988c62f6b36a4da860b736c5523abcc568ed1384c1f20cd8ff9659124c12eeeddb36100d5b8542af07b363cef7ade8a1aca01250929ec2c23172d4f3faec5f
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,38 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.6.3 (2026-04-20)
|
|
4
|
+
|
|
5
|
+
### Features
|
|
6
|
+
|
|
7
|
+
- **`runs:` parameter on `compare_models` and `optimize_retry_policy`** — runs each candidate N times per eval and aggregates the mean score, mean cost per run, and mean latency. Reduces sampling variance in live mode where LLM outputs are non-deterministic (gpt-5 family enforces `temperature=1.0` server-side, so a single unlucky sample can misclassify a viable candidate as "failing"). Default `runs: 1` — backward compatible.
|
|
8
|
+
- **`RUNS=N` on `rake ruby_llm_contract:optimize`** — CLI flag for variance-aware optimization.
|
|
9
|
+
- **`Eval::AggregatedReport`** — duck-type `Report` exposing `score` (mean), `score_min`/`score_max` (spread), `total_cost` (mean per run), `pass_rate` (clean-pass count x/N), and `clean_passes`.
|
|
10
|
+
- **Guide: [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs)** — when to use it and why.
|
|
11
|
+
|
|
12
|
+
## 0.6.2 (2026-04-18)
|
|
13
|
+
|
|
14
|
+
### Features
|
|
15
|
+
|
|
16
|
+
- **`Step.optimize_retry_policy`** — runs `compare_models` on ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).
|
|
17
|
+
- **`rake ruby_llm_contract:optimize`** — one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.
|
|
18
|
+
- **Offline by default** — `optimize` uses `sample_response` (zero API calls) unless `LIVE=1` or `PROVIDER=` is set.
|
|
19
|
+
- **`EVAL_DIRS=` support** — non-Rails setups can specify eval file directories.
|
|
20
|
+
- **Guide: [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)** — full procedure with prerequisites, troubleshooting, and real-world example.
|
|
21
|
+
|
|
22
|
+
### Fixes
|
|
23
|
+
|
|
24
|
+
- Chain semantics aligned with `retry_executor` — retry fires on `validation_failed`/`parse_error`, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain.
|
|
25
|
+
- Removed ActiveSupport dependency from rake task (`.presence` → `.empty?`).
|
|
26
|
+
- Added `require "set"` for non-Rails environments.
|
|
27
|
+
|
|
28
|
+
## 0.6.1 (2026-04-17)
|
|
29
|
+
|
|
30
|
+
### Features
|
|
31
|
+
|
|
32
|
+
- **Multi-provider operator tooling** — rake tasks support `PROVIDER=openai|anthropic|ollama`, `CANDIDATES=model@effort,...`, and `REASONING_EFFORT=low|medium|high`.
|
|
33
|
+
- **`rake ruby_llm_contract:recommend`** — wraps `Step.recommend` with CLI interface, prints best config, retry chain, DSL, rationale, and savings.
|
|
34
|
+
- **Ollama support** — `PROVIDER=ollama` with configurable `OLLAMA_API_BASE`.
|
|
35
|
+
|
|
3
36
|
## 0.6.0 (2026-04-12)
|
|
4
37
|
|
|
5
38
|
"What should I do?" — model + configuration recommendation.
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
ruby_llm-contract (0.6.
|
|
4
|
+
ruby_llm-contract (0.6.3)
|
|
5
5
|
dry-types (~> 1.7)
|
|
6
6
|
ruby_llm (~> 1.0)
|
|
7
7
|
ruby_llm-schema (~> 0.3)
|
|
@@ -258,7 +258,7 @@ CHECKSUMS
|
|
|
258
258
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
259
259
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
|
260
260
|
ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
|
|
261
|
-
ruby_llm-contract (0.6.
|
|
261
|
+
ruby_llm-contract (0.6.3)
|
|
262
262
|
ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
|
|
263
263
|
ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
|
|
264
264
|
rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
|
data/README.md
CHANGED
|
@@ -158,6 +158,8 @@ Cheapest at 100%: gpt-4.1-mini
|
|
|
158
158
|
|
|
159
159
|
Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
|
|
160
160
|
|
|
161
|
+
Running live against gpt-5 / o-series? Pass `runs: 3` to average out sampling variance (OpenAI forces `temperature=1.0` server-side, so one unlucky run can misclassify a viable candidate). See [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs).
|
|
162
|
+
|
|
161
163
|
## Let the gem tell you what to do
|
|
162
164
|
|
|
163
165
|
Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
|
|
@@ -257,12 +259,21 @@ end
|
|
|
257
259
|
# bundle exec rake ruby_llm_contract:eval
|
|
258
260
|
```
|
|
259
261
|
|
|
262
|
+
## Full power: data-driven retry chains
|
|
263
|
+
|
|
264
|
+
The pieces above — evals, compare_models, recommend — combine into a workflow that replaces guesswork with measured optimization. You define evals for your step, run `recommend` against all of them, find the eval that actually needs the strongest model, and build a retry chain where each attempt is as cheap as the data allows.
|
|
265
|
+
|
|
266
|
+
The difference: instead of "gpt-5-mini seems to work, let's use it everywhere", you get "nano handles 4/6 scenarios, mini@low catches the 5th, full mini only fires on the hardest edge case — first attempt is 4× cheaper."
|
|
267
|
+
|
|
268
|
+
Full procedure with examples: **[Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)**
|
|
269
|
+
|
|
260
270
|
## Docs
|
|
261
271
|
|
|
262
272
|
| Guide | |
|
|
263
273
|
|-------|-|
|
|
264
274
|
| [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
|
|
265
275
|
| [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
|
|
276
|
+
| [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md) | Find the cheapest retry chain that passes all your evals |
|
|
266
277
|
| [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
|
|
267
278
|
| [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
|
|
268
279
|
| [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
|
|
@@ -70,9 +70,11 @@ module RubyLLM
|
|
|
70
70
|
Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
|
|
71
71
|
end
|
|
72
72
|
|
|
73
|
-
def compare_models(eval_name, models: [], candidates: [], context: {})
|
|
73
|
+
def compare_models(eval_name, models: [], candidates: [], context: {}, runs: 1)
|
|
74
74
|
raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
|
|
75
75
|
|
|
76
|
+
runs = coerce_runs(runs)
|
|
77
|
+
|
|
76
78
|
context = safe_context(context)
|
|
77
79
|
candidate_configs = normalize_candidates(models, candidates)
|
|
78
80
|
|
|
@@ -82,7 +84,8 @@ module RubyLLM
|
|
|
82
84
|
label = Eval::ModelComparison.candidate_label(config)
|
|
83
85
|
model_context = isolate_context(context).merge(model: config[:model])
|
|
84
86
|
model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
|
|
85
|
-
|
|
87
|
+
per_run = Array.new(runs) { run_single_eval(eval_name, model_context) }
|
|
88
|
+
reports[label] = runs == 1 ? per_run.first : Eval::AggregatedReport.new(per_run)
|
|
86
89
|
configs[label] = config
|
|
87
90
|
end
|
|
88
91
|
|
|
@@ -91,6 +94,13 @@ module RubyLLM
|
|
|
91
94
|
|
|
92
95
|
private
|
|
93
96
|
|
|
97
|
+
def coerce_runs(runs)
|
|
98
|
+
raise ArgumentError, "runs must be an Integer >= 1, got #{runs.inspect}" unless runs.is_a?(Integer)
|
|
99
|
+
raise ArgumentError, "runs must be >= 1, got #{runs.inspect}" if runs < 1
|
|
100
|
+
|
|
101
|
+
runs
|
|
102
|
+
end
|
|
103
|
+
|
|
94
104
|
def normalize_candidates(models, candidates)
|
|
95
105
|
if candidates.any?
|
|
96
106
|
candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
|
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RubyLLM
|
|
4
|
+
module Contract
|
|
5
|
+
module Eval
|
|
6
|
+
# Wraps N Reports from repeated runs of the same eval to reduce sampling
|
|
7
|
+
# variance in live mode (temperature=1 on gpt-5 family). Exposes the same
|
|
8
|
+
# duck-type as Report — mean score, mean cost per run, mean latency.
|
|
9
|
+
#
|
|
10
|
+
# pass_rate reports how many runs passed cleanly (x/N), not case-level
|
|
11
|
+
# pass rate, since the question is "does this candidate reliably pass?".
|
|
12
|
+
class AggregatedReport
|
|
13
|
+
attr_reader :runs, :results
|
|
14
|
+
|
|
15
|
+
def initialize(runs)
|
|
16
|
+
raise ArgumentError, "runs must not be empty" if runs.empty?
|
|
17
|
+
|
|
18
|
+
@runs = runs.freeze
|
|
19
|
+
@results = runs.flat_map(&:results).freeze
|
|
20
|
+
freeze
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
def dataset_name
|
|
24
|
+
@runs.first.dataset_name
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
def step_name
|
|
28
|
+
@runs.first.step_name
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
def score
|
|
32
|
+
@runs.sum(&:score) / @runs.length.to_f
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
def score_min
|
|
36
|
+
@runs.map(&:score).min
|
|
37
|
+
end
|
|
38
|
+
|
|
39
|
+
def score_max
|
|
40
|
+
@runs.map(&:score).max
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
def total_cost
|
|
44
|
+
@runs.sum(&:total_cost) / @runs.length.to_f
|
|
45
|
+
end
|
|
46
|
+
|
|
47
|
+
def avg_latency_ms
|
|
48
|
+
latencies = @runs.filter_map(&:avg_latency_ms)
|
|
49
|
+
return nil if latencies.empty?
|
|
50
|
+
|
|
51
|
+
latencies.sum / latencies.length.to_f
|
|
52
|
+
end
|
|
53
|
+
|
|
54
|
+
def pass_rate
|
|
55
|
+
"#{clean_passes}/#{@runs.length}"
|
|
56
|
+
end
|
|
57
|
+
|
|
58
|
+
def pass_rate_ratio
|
|
59
|
+
clean_passes.to_f / @runs.length
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
def each(&block)
|
|
63
|
+
@results.each(&block)
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
def summary
|
|
67
|
+
@runs.first.summary
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
def to_s
|
|
71
|
+
@runs.first.to_s
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
def print_summary(io = $stdout)
|
|
75
|
+
@runs.first.print_summary(io)
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
def passed?
|
|
79
|
+
@runs.all?(&:passed?)
|
|
80
|
+
end
|
|
81
|
+
|
|
82
|
+
def clean_passes
|
|
83
|
+
@runs.count(&:passed?)
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
def failures
|
|
87
|
+
@runs.flat_map(&:failures)
|
|
88
|
+
end
|
|
89
|
+
end
|
|
90
|
+
end
|
|
91
|
+
end
|
|
92
|
+
end
|
|
@@ -0,0 +1,222 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "set"
|
|
4
|
+
|
|
5
|
+
module RubyLLM
|
|
6
|
+
module Contract
|
|
7
|
+
module Eval
|
|
8
|
+
# Runs compare_models on ALL evals for a step, builds a score matrix,
|
|
9
|
+
# identifies the constraining eval, and suggests an escalation chain.
|
|
10
|
+
#
|
|
11
|
+
# optimizer = RetryOptimizer.new(step: MyStep, candidates: [...], context: {})
|
|
12
|
+
# result = optimizer.call
|
|
13
|
+
# result.print_summary
|
|
14
|
+
# result.to_dsl # => copy-paste retry_policy
|
|
15
|
+
class RetryOptimizer
|
|
16
|
+
Result = Struct.new(:step_name, :eval_names, :candidate_labels, :score_matrix,
|
|
17
|
+
:constraining_eval, :chain, :chain_details, keyword_init: true) do
|
|
18
|
+
def print_summary(io = $stdout)
|
|
19
|
+
io.puts "#{step_name} — retry chain optimization"
|
|
20
|
+
io.puts
|
|
21
|
+
print_table(io)
|
|
22
|
+
io.puts
|
|
23
|
+
print_chain(io)
|
|
24
|
+
io.puts
|
|
25
|
+
print_dsl(io)
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
def to_dsl
|
|
29
|
+
return "# No viable chain — no candidate passes all evals" if chain.empty?
|
|
30
|
+
|
|
31
|
+
if chain.all? { |c| c.keys == [:model] }
|
|
32
|
+
models_str = chain.map { |c| c[:model] }.join(" ")
|
|
33
|
+
"retry_policy models: %w[#{models_str}]"
|
|
34
|
+
else
|
|
35
|
+
args = chain.map { |c| config_to_ruby(c) }.join(",\n ")
|
|
36
|
+
"retry_policy do\n escalate(\n #{args}\n )\nend"
|
|
37
|
+
end
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
private
|
|
41
|
+
|
|
42
|
+
def print_table(io)
|
|
43
|
+
short_labels = candidate_labels.map { |l| short_candidate_label(l) }
|
|
44
|
+
col_width = [short_labels.map(&:length).max || 0, 8].max
|
|
45
|
+
eval_width = [eval_names.map { |e| e.to_s.length }.max || 0, 12].max
|
|
46
|
+
|
|
47
|
+
header = format(" %-#{eval_width}s", "eval") + short_labels.map { |l| format(" %#{col_width}s", l) }.join
|
|
48
|
+
io.puts header
|
|
49
|
+
io.puts " #{"-" * (eval_width + (col_width + 2) * short_labels.size)}"
|
|
50
|
+
|
|
51
|
+
eval_names.each do |eval_name|
|
|
52
|
+
row = format(" %-#{eval_width}s", eval_name.to_s)
|
|
53
|
+
candidate_labels.each do |label|
|
|
54
|
+
score = score_matrix.dig(eval_name, label) || 0.0
|
|
55
|
+
marker = eval_name == constraining_eval && score < 1.0 ? " ←" : " "
|
|
56
|
+
row += format(" %#{col_width - 2}.2f%s", score, marker)
|
|
57
|
+
end
|
|
58
|
+
io.puts row
|
|
59
|
+
end
|
|
60
|
+
|
|
61
|
+
io.puts
|
|
62
|
+
io.puts " Constraining eval: #{constraining_eval}" if constraining_eval
|
|
63
|
+
end
|
|
64
|
+
|
|
65
|
+
def print_chain(io)
|
|
66
|
+
if chain.empty?
|
|
67
|
+
io.puts " No viable chain."
|
|
68
|
+
return
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
io.puts " Suggested chain:"
|
|
72
|
+
chain_details.each_with_index do |detail, i|
|
|
73
|
+
suffix = i == chain_details.size - 1 ? "passes all #{eval_names.size} evals" : "covers #{detail[:passes]} eval(s)"
|
|
74
|
+
io.puts " #{detail[:label]} — #{suffix}"
|
|
75
|
+
end
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
def short_candidate_label(label)
|
|
79
|
+
label
|
|
80
|
+
.sub("gpt-5-", "")
|
|
81
|
+
.sub("gpt-4.1", "4.1")
|
|
82
|
+
.sub(" (effort: ", "@")
|
|
83
|
+
.sub(")", "")
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
def print_dsl(io)
|
|
87
|
+
io.puts " DSL:"
|
|
88
|
+
to_dsl.each_line { |line| io.puts " #{line}" }
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
def config_to_ruby(config)
|
|
92
|
+
pairs = config.map { |k, v| "#{k}: #{v.inspect}" }.join(", ")
|
|
93
|
+
"{ #{pairs} }"
|
|
94
|
+
end
|
|
95
|
+
end
|
|
96
|
+
|
|
97
|
+
def initialize(step:, candidates:, context: {}, min_score: 0.95, runs: 1)
|
|
98
|
+
@step = step
|
|
99
|
+
@candidates = candidates
|
|
100
|
+
@context = context
|
|
101
|
+
@min_score = min_score
|
|
102
|
+
@runs = runs
|
|
103
|
+
end
|
|
104
|
+
|
|
105
|
+
def call
|
|
106
|
+
evals = @step.eval_names
|
|
107
|
+
return empty_result(evals) if evals.empty?
|
|
108
|
+
|
|
109
|
+
score_matrix = {}
|
|
110
|
+
evals.each do |eval_name|
|
|
111
|
+
comparison = with_retry_disabled do
|
|
112
|
+
@step.compare_models(eval_name, candidates: @candidates, context: @context, runs: @runs)
|
|
113
|
+
end
|
|
114
|
+
score_matrix[eval_name] = extract_scores(comparison)
|
|
115
|
+
end
|
|
116
|
+
|
|
117
|
+
labels = score_matrix.values.flat_map(&:keys).uniq
|
|
118
|
+
constraining = find_constraining_eval(score_matrix, labels)
|
|
119
|
+
chain, details = build_chain(score_matrix, labels, evals)
|
|
120
|
+
|
|
121
|
+
Result.new(
|
|
122
|
+
step_name: @step.name || @step.to_s,
|
|
123
|
+
eval_names: evals,
|
|
124
|
+
candidate_labels: labels,
|
|
125
|
+
score_matrix: score_matrix,
|
|
126
|
+
constraining_eval: constraining,
|
|
127
|
+
chain: chain,
|
|
128
|
+
chain_details: details
|
|
129
|
+
)
|
|
130
|
+
end
|
|
131
|
+
|
|
132
|
+
private
|
|
133
|
+
|
|
134
|
+
def extract_scores(comparison)
|
|
135
|
+
comparison.reports.transform_values(&:score)
|
|
136
|
+
end
|
|
137
|
+
|
|
138
|
+
def find_constraining_eval(matrix, labels)
|
|
139
|
+
matrix.max_by do |_eval_name, scores|
|
|
140
|
+
cheapest_passing = labels.find { |l| (scores[l] || 0) >= @min_score }
|
|
141
|
+
cheapest_passing ? labels.index(cheapest_passing) : labels.size
|
|
142
|
+
end&.first
|
|
143
|
+
end
|
|
144
|
+
|
|
145
|
+
# Retry escalates on validation_failed/parse_error, NOT on low eval
|
|
146
|
+
# score. A model that returns :ok with semantically wrong output won't
|
|
147
|
+
# trigger retry. Therefore the LAST model in the chain must pass ALL
|
|
148
|
+
# evals — it's the safety net. Cheaper models are prepended as
|
|
149
|
+
# first-try optimization (they handle easy inputs cheaply; when they
|
|
150
|
+
# fail validation, retry escalates to the safe fallback).
|
|
151
|
+
#
|
|
152
|
+
# Known limitation: intermediate models are assumed safe if their eval
|
|
153
|
+
# failures correspond to validation failures (retryable). If an
|
|
154
|
+
# intermediate model returns :ok with semantically wrong output on
|
|
155
|
+
# some eval, retry won't fire and the safe fallback won't run. This
|
|
156
|
+
# requires step validates to cover the same semantics as eval verify
|
|
157
|
+
# checks. A future version could inspect per-case step_status from
|
|
158
|
+
# compare_models to verify failures are actually retryable.
|
|
159
|
+
def build_chain(matrix, labels, evals)
|
|
160
|
+
total = evals.size
|
|
161
|
+
|
|
162
|
+
# Find cheapest model that passes every eval — the safe fallback.
|
|
163
|
+
safe_fallback = labels.find { |l| evals.all? { |e| (matrix.dig(e, l) || 0) >= @min_score } }
|
|
164
|
+
return [[], []] unless safe_fallback
|
|
165
|
+
|
|
166
|
+
# Prepend cheaper models that pass a strict subset.
|
|
167
|
+
chain = []
|
|
168
|
+
details = []
|
|
169
|
+
covered_evals = Set.new
|
|
170
|
+
|
|
171
|
+
labels.each do |label|
|
|
172
|
+
break if label == safe_fallback
|
|
173
|
+
|
|
174
|
+
newly_covered = evals.select { |e| (matrix.dig(e, label) || 0) >= @min_score }
|
|
175
|
+
new_additions = newly_covered.to_set - covered_evals
|
|
176
|
+
next if new_additions.empty?
|
|
177
|
+
|
|
178
|
+
covered_evals.merge(new_additions)
|
|
179
|
+
chain << parse_label_to_config(label)
|
|
180
|
+
details << { label: label, passes: new_additions.size, cost: label }
|
|
181
|
+
end
|
|
182
|
+
|
|
183
|
+
# Always end with the safe fallback.
|
|
184
|
+
chain << parse_label_to_config(safe_fallback)
|
|
185
|
+
details << { label: safe_fallback, passes: total, cost: safe_fallback }
|
|
186
|
+
|
|
187
|
+
[chain, details]
|
|
188
|
+
end
|
|
189
|
+
|
|
190
|
+
def parse_label_to_config(label)
|
|
191
|
+
if label.match?(/\(effort: (\w+)\)/)
|
|
192
|
+
model = label.sub(/\s*\(effort:.*/, "").strip
|
|
193
|
+
effort = label.match(/\(effort: (\w+)\)/)[1]
|
|
194
|
+
{ model: model, reasoning_effort: effort }
|
|
195
|
+
else
|
|
196
|
+
{ model: label }
|
|
197
|
+
end
|
|
198
|
+
end
|
|
199
|
+
|
|
200
|
+
def with_retry_disabled(&block)
|
|
201
|
+
original = @step.retry_policy if @step.respond_to?(:retry_policy)
|
|
202
|
+
@step.define_singleton_method(:retry_policy) { nil }
|
|
203
|
+
block.call
|
|
204
|
+
ensure
|
|
205
|
+
@step.define_singleton_method(:retry_policy) { original }
|
|
206
|
+
end
|
|
207
|
+
|
|
208
|
+
def empty_result(evals)
|
|
209
|
+
Result.new(
|
|
210
|
+
step_name: @step.name || @step.to_s,
|
|
211
|
+
eval_names: evals,
|
|
212
|
+
candidate_labels: [],
|
|
213
|
+
score_matrix: {},
|
|
214
|
+
constraining_eval: nil,
|
|
215
|
+
chain: [],
|
|
216
|
+
chain_details: []
|
|
217
|
+
)
|
|
218
|
+
end
|
|
219
|
+
end
|
|
220
|
+
end
|
|
221
|
+
end
|
|
222
|
+
end
|
|
@@ -21,6 +21,7 @@ require_relative "eval/report_stats"
|
|
|
21
21
|
require_relative "eval/report_presenter"
|
|
22
22
|
require_relative "eval/report_storage"
|
|
23
23
|
require_relative "eval/report"
|
|
24
|
+
require_relative "eval/aggregated_report"
|
|
24
25
|
require_relative "eval/eval_definition"
|
|
25
26
|
require_relative "eval/model_comparison"
|
|
26
27
|
require_relative "eval/baseline_diff"
|
|
@@ -31,3 +32,4 @@ require_relative "eval/prompt_diff"
|
|
|
31
32
|
require_relative "eval/eval_history"
|
|
32
33
|
require_relative "eval/recommendation"
|
|
33
34
|
require_relative "eval/recommender"
|
|
35
|
+
require_relative "eval/retry_optimizer"
|
|
@@ -121,5 +121,98 @@ module RubyLLM
|
|
|
121
121
|
defined?(::Rails) ? [:environment] : []
|
|
122
122
|
end
|
|
123
123
|
end
|
|
124
|
+
|
|
125
|
+
# Standalone task: runs all evals for one step across candidates,
|
|
126
|
+
# builds a score matrix, and suggests an optimal retry chain.
|
|
127
|
+
#
|
|
128
|
+
# Loaded automatically when `require "ruby_llm/contract/rake_task"`.
|
|
129
|
+
# Usage:
|
|
130
|
+
# rake ruby_llm_contract:optimize \
|
|
131
|
+
# STEP=MatchProblemsToPages \
|
|
132
|
+
# CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini
|
|
133
|
+
class OptimizeRakeTask < ::Rake::TaskLib
|
|
134
|
+
def initialize
|
|
135
|
+
super()
|
|
136
|
+
define_task
|
|
137
|
+
end
|
|
138
|
+
|
|
139
|
+
private
|
|
140
|
+
|
|
141
|
+
def define_task
|
|
142
|
+
desc "Run all evals for STEP with CANDIDATES and suggest an optimal retry chain"
|
|
143
|
+
task(:"ruby_llm_contract:optimize" => task_prerequisites) do
|
|
144
|
+
require "ruby_llm/contract"
|
|
145
|
+
eval_dirs = ENV["EVAL_DIRS"].to_s.split(",").map(&:strip).reject(&:empty?)
|
|
146
|
+
RubyLLM::Contract.load_evals!(*eval_dirs)
|
|
147
|
+
|
|
148
|
+
step_name = ENV["STEP"].to_s.strip
|
|
149
|
+
abort("STEP is required, e.g. STEP=MatchProblemsToPages") if step_name.empty?
|
|
150
|
+
raw_candidates = ENV["CANDIDATES"].to_s.strip
|
|
151
|
+
abort("CANDIDATES is required, e.g. CANDIDATES=gpt-5-nano,gpt-5-mini@low,gpt-5-mini") if raw_candidates.empty?
|
|
152
|
+
min_score = ENV.fetch("MIN_SCORE", "0.95").to_f
|
|
153
|
+
runs = parse_runs(ENV.fetch("RUNS", "1"))
|
|
154
|
+
|
|
155
|
+
host = RubyLLM::Contract.eval_hosts.find { |h| h.name == step_name }
|
|
156
|
+
unless host
|
|
157
|
+
available = RubyLLM::Contract.eval_hosts.filter_map(&:name).sort
|
|
158
|
+
abort "Unknown STEP=#{step_name}. Available: #{available.join(", ")}"
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
candidates = parse_candidates(raw_candidates)
|
|
162
|
+
context = build_context
|
|
163
|
+
|
|
164
|
+
result = host.optimize_retry_policy(
|
|
165
|
+
candidates: candidates,
|
|
166
|
+
context: context,
|
|
167
|
+
min_score: min_score,
|
|
168
|
+
runs: runs
|
|
169
|
+
)
|
|
170
|
+
|
|
171
|
+
result.print_summary
|
|
172
|
+
end
|
|
173
|
+
end
|
|
174
|
+
|
|
175
|
+
def parse_runs(raw)
|
|
176
|
+
runs = Integer(raw.to_s.strip, 10)
|
|
177
|
+
abort("RUNS must be an integer >= 1, e.g. RUNS=1") if runs < 1
|
|
178
|
+
runs
|
|
179
|
+
rescue ArgumentError
|
|
180
|
+
abort("RUNS must be an integer >= 1, e.g. RUNS=1")
|
|
181
|
+
end
|
|
182
|
+
|
|
183
|
+
def parse_candidates(raw)
|
|
184
|
+
entries = if raw.start_with?("[")
|
|
185
|
+
Array(JSON.parse(raw))
|
|
186
|
+
else
|
|
187
|
+
raw.split(",").map(&:strip).reject(&:empty?).map do |entry|
|
|
188
|
+
model, effort = entry.split("@", 2)
|
|
189
|
+
config = { model: model.strip }
|
|
190
|
+
config[:reasoning_effort] = effort.strip if effort && !effort.empty?
|
|
191
|
+
config
|
|
192
|
+
end
|
|
193
|
+
end
|
|
194
|
+
|
|
195
|
+
entries.map { |e| RubyLLM::Contract.normalize_candidate_config(e) }.uniq
|
|
196
|
+
end
|
|
197
|
+
|
|
198
|
+
def build_context
|
|
199
|
+
ctx = {}
|
|
200
|
+
provider = ENV["PROVIDER"].to_s.strip
|
|
201
|
+
# Only inject real adapter when LIVE=1 or PROVIDER is set — otherwise
|
|
202
|
+
# evals use sample_response (offline mode, zero API calls).
|
|
203
|
+
if ENV["LIVE"] == "1" || !provider.empty?
|
|
204
|
+
ctx[:adapter] = RubyLLM::Contract::Adapters::RubyLLM.new
|
|
205
|
+
ctx[:provider] = provider.downcase.to_sym unless provider.empty?
|
|
206
|
+
end
|
|
207
|
+
ctx
|
|
208
|
+
end
|
|
209
|
+
|
|
210
|
+
def task_prerequisites
|
|
211
|
+
defined?(::Rails) ? [:environment] : []
|
|
212
|
+
end
|
|
213
|
+
end
|
|
214
|
+
|
|
215
|
+
# Auto-register the optimize task when this file is loaded
|
|
216
|
+
OptimizeRakeTask.new
|
|
124
217
|
end
|
|
125
218
|
end
|
|
@@ -59,6 +59,16 @@ module RubyLLM
|
|
|
59
59
|
).recommend
|
|
60
60
|
end
|
|
61
61
|
|
|
62
|
+
def optimize_retry_policy(candidates:, context: {}, min_score: 0.95, runs: 1)
|
|
63
|
+
Eval::RetryOptimizer.new(
|
|
64
|
+
step: self,
|
|
65
|
+
candidates: candidates,
|
|
66
|
+
context: context,
|
|
67
|
+
min_score: min_score,
|
|
68
|
+
runs: runs
|
|
69
|
+
).call
|
|
70
|
+
end
|
|
71
|
+
|
|
62
72
|
KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists reasoning_effort].freeze
|
|
63
73
|
|
|
64
74
|
include Concerns::ContextHelpers
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ruby_llm-contract
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.6.
|
|
4
|
+
version: 0.6.3
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Justyna
|
|
@@ -109,6 +109,7 @@ files:
|
|
|
109
109
|
- lib/ruby_llm/contract/dsl.rb
|
|
110
110
|
- lib/ruby_llm/contract/errors.rb
|
|
111
111
|
- lib/ruby_llm/contract/eval.rb
|
|
112
|
+
- lib/ruby_llm/contract/eval/aggregated_report.rb
|
|
112
113
|
- lib/ruby_llm/contract/eval/baseline_diff.rb
|
|
113
114
|
- lib/ruby_llm/contract/eval/case_executor.rb
|
|
114
115
|
- lib/ruby_llm/contract/eval/case_result.rb
|
|
@@ -136,6 +137,7 @@ files:
|
|
|
136
137
|
- lib/ruby_llm/contract/eval/report_presenter.rb
|
|
137
138
|
- lib/ruby_llm/contract/eval/report_stats.rb
|
|
138
139
|
- lib/ruby_llm/contract/eval/report_storage.rb
|
|
140
|
+
- lib/ruby_llm/contract/eval/retry_optimizer.rb
|
|
139
141
|
- lib/ruby_llm/contract/eval/runner.rb
|
|
140
142
|
- lib/ruby_llm/contract/eval/step_expectation_applier.rb
|
|
141
143
|
- lib/ruby_llm/contract/eval/step_result_normalizer.rb
|