ruby_llm-contract 0.6.3 → 0.6.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop.yml +6 -3
- data/CHANGELOG.md +19 -0
- data/Gemfile.lock +2 -2
- data/README.md +2 -0
- data/lib/ruby_llm/contract/concerns/eval_host.rb +26 -4
- data/lib/ruby_llm/contract/concerns/production_mode_context.rb +35 -0
- data/lib/ruby_llm/contract/eval/aggregated_report.rb +43 -0
- data/lib/ruby_llm/contract/eval/case_result.rb +5 -3
- data/lib/ruby_llm/contract/eval/case_result_builder.rb +8 -1
- data/lib/ruby_llm/contract/eval/model_comparison.rb +80 -3
- data/lib/ruby_llm/contract/eval/report.rb +3 -1
- data/lib/ruby_llm/contract/eval/report_stats.rb +63 -0
- data/lib/ruby_llm/contract/eval/retry_optimizer.rb +4 -2
- data/lib/ruby_llm/contract/step/base.rb +7 -4
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/lib/ruby_llm/contract.rb +1 -0
- metadata +2 -2
- data/ruby_llm-contract.gemspec +0 -35
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 655ecfa5adebe03245cb4ee910752ec866b83844b665018277dc653a3e246a63
|
|
4
|
+
data.tar.gz: fafcdacea3943703fad581f9f1806fe25678fd5da7b01867ecb5aac353da139f
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c4821220d898468f55ab7085d25f6370819501cdfcb5ffea21a9dbdaaa9348997c1b9e6acbad8eb9ff6328eb531eda17e05bd09b1fcd047429fcd69b820a1819
|
|
7
|
+
data.tar.gz: 91fb14f2aa6e13e70d29f0222dc160548c9c50e9bf4bd9f27e9042bd0f6a4fae1a557b762747ee066c3444ee90942ea008c048b4bb0efc364d212329726d3888
|
data/.rubocop.yml
CHANGED
|
@@ -42,14 +42,17 @@ AllCops:
|
|
|
42
42
|
- 'internal/**/*'
|
|
43
43
|
|
|
44
44
|
Metrics/ClassLength:
|
|
45
|
-
Max:
|
|
45
|
+
Max: 140
|
|
46
|
+
|
|
47
|
+
Metrics/ModuleLength:
|
|
48
|
+
Max: 150
|
|
46
49
|
|
|
47
50
|
Metrics/AbcSize:
|
|
48
51
|
Max: 30
|
|
49
52
|
|
|
50
53
|
Metrics/ParameterLists:
|
|
51
|
-
Max:
|
|
52
|
-
MaxOptionalParameters:
|
|
54
|
+
Max: 12
|
|
55
|
+
MaxOptionalParameters: 10
|
|
53
56
|
|
|
54
57
|
Style/FormatStringToken:
|
|
55
58
|
Enabled: false
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,24 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.6.4 (2026-04-20)
|
|
4
|
+
|
|
5
|
+
### Features
|
|
6
|
+
|
|
7
|
+
- **`production_mode:` on `compare_models` and `optimize_retry_policy`** — measures retry-aware, end-to-end cost per successful output. Pass `production_mode: { fallback: "gpt-5-mini" }` and each candidate runs with a runtime-injected `[candidate, fallback]` retry chain. The report exposes `escalation_rate`, `single_shot_cost`, and `effective_cost` so "the cheaper candidate" decision matches production cost rather than first-attempt cost.
|
|
8
|
+
- **New Report metrics** — `escalation_rate`, `single_shot_cost`, `effective_cost`, `single_shot_latency_ms`, `effective_latency_ms`, `latency_percentiles` (p50/p95/max). `AggregatedReport` averages all of them across `runs:`.
|
|
9
|
+
- **Extended `ModelComparison#table`** — when `production_mode:` is set, renders a `Chain` column (`candidate → fallback`) with `single-shot`, `escalation`, `effective cost`, `latency`, `score`. Edge case `candidate == fallback` renders as a single model and `—` in the escalation column, with retry injection skipped entirely so `effective == single-shot` by construction, not by coincidence.
|
|
10
|
+
- **`context[:retry_policy_override]`** — new context key that nullifies or replaces class-level `retry_policy` for a single call. Used internally by production-mode injection; safe to use directly when you need a transient override that doesn't mutate the step class.
|
|
11
|
+
|
|
12
|
+
### Scope
|
|
13
|
+
|
|
14
|
+
- Single-fallback (2-tier) chains only. Multi-tier chains can be inspected post-hoc via `trace.attempts` but aren't summarized in the optimize table.
|
|
15
|
+
- Costs with `runs: 3 + production_mode: { fallback: "gpt-5-mini" }` are ≈3× a single-shot eval plus the actual retry attempts — not 6×. Production-mode metrics come from a single pass.
|
|
16
|
+
- **Step-only.** Calling `compare_models` with `production_mode:` on a `Pipeline::Base` subclass raises `ArgumentError` — retry injection is Step-level and pipeline-wide fallback semantics aren't defined yet. Benchmark individual steps.
|
|
17
|
+
|
|
18
|
+
### Documentation
|
|
19
|
+
|
|
20
|
+
- **Guide: [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement)** — API, metric interpretation, 2-tier scope note.
|
|
21
|
+
|
|
3
22
|
## 0.6.3 (2026-04-20)
|
|
4
23
|
|
|
5
24
|
### Features
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
ruby_llm-contract (0.6.
|
|
4
|
+
ruby_llm-contract (0.6.4)
|
|
5
5
|
dry-types (~> 1.7)
|
|
6
6
|
ruby_llm (~> 1.0)
|
|
7
7
|
ruby_llm-schema (~> 0.3)
|
|
@@ -258,7 +258,7 @@ CHECKSUMS
|
|
|
258
258
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
259
259
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
|
260
260
|
ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
|
|
261
|
-
ruby_llm-contract (0.6.
|
|
261
|
+
ruby_llm-contract (0.6.4)
|
|
262
262
|
ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
|
|
263
263
|
ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
|
|
264
264
|
rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
|
data/README.md
CHANGED
|
@@ -160,6 +160,8 @@ Nano fails on edge cases. Mini and full both score 100% — but mini is **5x che
|
|
|
160
160
|
|
|
161
161
|
Running live against gpt-5 / o-series? Pass `runs: 3` to average out sampling variance (OpenAI forces `temperature=1.0` server-side, so one unlucky run can misclassify a viable candidate). See [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs).
|
|
162
162
|
|
|
163
|
+
Want the *effective* cost — first-attempt plus retries — rather than the single-shot headline number? Pass `production_mode: { fallback: "gpt-5-mini" }` and the table gains `escalation`, `effective cost`, and a `Chain` column. See [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement).
|
|
164
|
+
|
|
163
165
|
## Let the gem tell you what to do
|
|
164
166
|
|
|
165
167
|
Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
|
|
@@ -5,6 +5,7 @@ module RubyLLM
|
|
|
5
5
|
module Concerns
|
|
6
6
|
module EvalHost
|
|
7
7
|
include ContextHelpers
|
|
8
|
+
include ProductionModeContext
|
|
8
9
|
|
|
9
10
|
SAMPLE_RESPONSE_COMPARE_WARNING = "[ruby_llm-contract] compare_with ignores sample_response. " \
|
|
10
11
|
"Without model: or context: { adapter: ... }, both sides will be skipped " \
|
|
@@ -70,26 +71,29 @@ module RubyLLM
|
|
|
70
71
|
Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
|
|
71
72
|
end
|
|
72
73
|
|
|
73
|
-
def compare_models(eval_name, models: [], candidates: [], context: {}, runs: 1)
|
|
74
|
+
def compare_models(eval_name, models: [], candidates: [], context: {}, runs: 1, production_mode: nil)
|
|
74
75
|
raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
|
|
75
76
|
|
|
76
77
|
runs = coerce_runs(runs)
|
|
77
78
|
|
|
78
79
|
context = safe_context(context)
|
|
79
80
|
candidate_configs = normalize_candidates(models, candidates)
|
|
81
|
+
reject_production_mode_on_pipeline!(production_mode)
|
|
82
|
+
fallback_config = normalize_production_mode(production_mode)
|
|
80
83
|
|
|
81
84
|
reports = {}
|
|
82
85
|
configs = {}
|
|
83
86
|
candidate_configs.each do |config|
|
|
84
87
|
label = Eval::ModelComparison.candidate_label(config)
|
|
85
|
-
model_context =
|
|
86
|
-
model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
|
|
88
|
+
model_context = build_candidate_context(context, config, fallback_config)
|
|
87
89
|
per_run = Array.new(runs) { run_single_eval(eval_name, model_context) }
|
|
88
90
|
reports[label] = runs == 1 ? per_run.first : Eval::AggregatedReport.new(per_run)
|
|
89
91
|
configs[label] = config
|
|
90
92
|
end
|
|
91
93
|
|
|
92
|
-
Eval::ModelComparison.new(
|
|
94
|
+
Eval::ModelComparison.new(
|
|
95
|
+
eval_name: eval_name, reports: reports, configs: configs, fallback: fallback_config
|
|
96
|
+
)
|
|
93
97
|
end
|
|
94
98
|
|
|
95
99
|
private
|
|
@@ -101,6 +105,24 @@ module RubyLLM
|
|
|
101
105
|
runs
|
|
102
106
|
end
|
|
103
107
|
|
|
108
|
+
def reject_production_mode_on_pipeline!(production_mode)
|
|
109
|
+
return if production_mode.nil? || production_mode == false
|
|
110
|
+
return unless defined?(Pipeline::Base) && self < Pipeline::Base
|
|
111
|
+
|
|
112
|
+
raise ArgumentError,
|
|
113
|
+
"production_mode: is not supported on Pipeline (#{self}). Retry injection happens at Step level; " \
|
|
114
|
+
"call compare_models with production_mode: on individual Step classes instead."
|
|
115
|
+
end
|
|
116
|
+
|
|
117
|
+
def build_candidate_context(context, config, fallback_config)
|
|
118
|
+
model_context = isolate_context(context).merge(model: config[:model])
|
|
119
|
+
model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
|
|
120
|
+
return model_context unless fallback_config
|
|
121
|
+
|
|
122
|
+
model_context[:retry_policy_override] = production_mode_override(config, fallback_config)
|
|
123
|
+
model_context
|
|
124
|
+
end
|
|
125
|
+
|
|
104
126
|
def normalize_candidates(models, candidates)
|
|
105
127
|
if candidates.any?
|
|
106
128
|
candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
|
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module RubyLLM
|
|
4
|
+
module Contract
|
|
5
|
+
module Concerns
|
|
6
|
+
# Helpers for injecting a retry_policy_override into per-candidate eval
|
|
7
|
+
# context when compare_models runs in production-mode. When candidate
|
|
8
|
+
# == fallback, retry injection is skipped so the row degenerates into
|
|
9
|
+
# a single-shot eval by construction.
|
|
10
|
+
module ProductionModeContext
|
|
11
|
+
private
|
|
12
|
+
|
|
13
|
+
def normalize_production_mode(production_mode)
|
|
14
|
+
return nil if production_mode.nil? || production_mode == false
|
|
15
|
+
|
|
16
|
+
unless production_mode.is_a?(Hash) && production_mode[:fallback]
|
|
17
|
+
raise ArgumentError, "production_mode: must be a Hash with :fallback, got #{production_mode.inspect}"
|
|
18
|
+
end
|
|
19
|
+
|
|
20
|
+
RubyLLM::Contract.normalize_candidate_config(production_mode[:fallback])
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
def production_mode_override(config, fallback_config)
|
|
24
|
+
return nil if same_candidate?(config, fallback_config)
|
|
25
|
+
|
|
26
|
+
Step::RetryPolicy.new(models: [config, fallback_config])
|
|
27
|
+
end
|
|
28
|
+
|
|
29
|
+
def same_candidate?(first, second)
|
|
30
|
+
first[:model] == second[:model] && first[:reasoning_effort] == second[:reasoning_effort]
|
|
31
|
+
end
|
|
32
|
+
end
|
|
33
|
+
end
|
|
34
|
+
end
|
|
35
|
+
end
|
|
@@ -86,6 +86,49 @@ module RubyLLM
|
|
|
86
86
|
def failures
|
|
87
87
|
@runs.flat_map(&:failures)
|
|
88
88
|
end
|
|
89
|
+
|
|
90
|
+
def production_mode?
|
|
91
|
+
@runs.any?(&:production_mode?)
|
|
92
|
+
end
|
|
93
|
+
|
|
94
|
+
def escalation_rate
|
|
95
|
+
values = @runs.filter_map(&:escalation_rate)
|
|
96
|
+
return nil if values.empty?
|
|
97
|
+
|
|
98
|
+
values.sum / values.length.to_f
|
|
99
|
+
end
|
|
100
|
+
|
|
101
|
+
def single_shot_cost
|
|
102
|
+
values = @runs.filter_map(&:single_shot_cost)
|
|
103
|
+
return nil if values.empty?
|
|
104
|
+
|
|
105
|
+
values.sum / values.length.to_f
|
|
106
|
+
end
|
|
107
|
+
|
|
108
|
+
def effective_cost
|
|
109
|
+
total_cost
|
|
110
|
+
end
|
|
111
|
+
|
|
112
|
+
def single_shot_latency_ms
|
|
113
|
+
values = @runs.filter_map(&:single_shot_latency_ms)
|
|
114
|
+
return nil if values.empty?
|
|
115
|
+
|
|
116
|
+
values.sum / values.length.to_f
|
|
117
|
+
end
|
|
118
|
+
|
|
119
|
+
def effective_latency_ms
|
|
120
|
+
avg_latency_ms
|
|
121
|
+
end
|
|
122
|
+
|
|
123
|
+
def latency_percentiles
|
|
124
|
+
per_run = @runs.filter_map(&:latency_percentiles)
|
|
125
|
+
return nil if per_run.empty?
|
|
126
|
+
|
|
127
|
+
%i[p50 p95 max].each_with_object({}) do |key, acc|
|
|
128
|
+
values = per_run.filter_map { |h| h[key] }
|
|
129
|
+
acc[key] = values.empty? ? nil : values.sum / values.length.to_f
|
|
130
|
+
end
|
|
131
|
+
end
|
|
89
132
|
end
|
|
90
133
|
end
|
|
91
134
|
end
|
|
@@ -5,10 +5,10 @@ module RubyLLM
|
|
|
5
5
|
module Eval
|
|
6
6
|
class CaseResult
|
|
7
7
|
attr_reader :name, :input, :output, :expected, :step_status,
|
|
8
|
-
:score, :details, :duration_ms, :cost
|
|
8
|
+
:score, :details, :duration_ms, :cost, :attempts
|
|
9
9
|
|
|
10
10
|
def initialize(name:, input:, output:, expected:, step_status:,
|
|
11
|
-
score:, passed:, label: nil, details: nil, duration_ms: nil, cost: nil)
|
|
11
|
+
score:, passed:, label: nil, details: nil, duration_ms: nil, cost: nil, attempts: nil)
|
|
12
12
|
@name = name
|
|
13
13
|
@input = input
|
|
14
14
|
@output = output
|
|
@@ -20,6 +20,7 @@ module RubyLLM
|
|
|
20
20
|
@details = details
|
|
21
21
|
@duration_ms = duration_ms
|
|
22
22
|
@cost = cost
|
|
23
|
+
@attempts = attempts
|
|
23
24
|
freeze
|
|
24
25
|
end
|
|
25
26
|
|
|
@@ -58,7 +59,8 @@ module RubyLLM
|
|
|
58
59
|
label: label,
|
|
59
60
|
details: @details,
|
|
60
61
|
duration_ms: @duration_ms,
|
|
61
|
-
cost: @cost
|
|
62
|
+
cost: @cost,
|
|
63
|
+
attempts: @attempts
|
|
62
64
|
}
|
|
63
65
|
end
|
|
64
66
|
|
|
@@ -18,7 +18,8 @@ module RubyLLM
|
|
|
18
18
|
label: evaluation.label,
|
|
19
19
|
details: evaluation.details,
|
|
20
20
|
duration_ms: trace_metric(trace, :total_latency_ms, :latency_ms),
|
|
21
|
-
cost: trace_metric(trace, :total_cost, :cost)
|
|
21
|
+
cost: trace_metric(trace, :total_cost, :cost),
|
|
22
|
+
attempts: trace_attempts(trace)
|
|
22
23
|
)
|
|
23
24
|
end
|
|
24
25
|
|
|
@@ -29,6 +30,12 @@ module RubyLLM
|
|
|
29
30
|
|
|
30
31
|
trace.respond_to?(pipeline_key) ? trace.public_send(pipeline_key) : trace[step_key]
|
|
31
32
|
end
|
|
33
|
+
|
|
34
|
+
def trace_attempts(trace)
|
|
35
|
+
return nil unless trace
|
|
36
|
+
|
|
37
|
+
trace.respond_to?(:attempts) ? trace.attempts : nil
|
|
38
|
+
end
|
|
32
39
|
end
|
|
33
40
|
end
|
|
34
41
|
end
|
|
@@ -4,20 +4,25 @@ module RubyLLM
|
|
|
4
4
|
module Contract
|
|
5
5
|
module Eval
|
|
6
6
|
class ModelComparison
|
|
7
|
-
attr_reader :eval_name, :reports, :configs
|
|
7
|
+
attr_reader :eval_name, :reports, :configs, :fallback
|
|
8
8
|
|
|
9
9
|
def self.candidate_label(config)
|
|
10
10
|
effort = config[:reasoning_effort]
|
|
11
11
|
effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
|
|
12
12
|
end
|
|
13
13
|
|
|
14
|
-
def initialize(eval_name:, reports:, configs: nil)
|
|
14
|
+
def initialize(eval_name:, reports:, configs: nil, fallback: nil)
|
|
15
15
|
@eval_name = eval_name
|
|
16
16
|
@reports = reports.dup.freeze
|
|
17
17
|
@configs = (configs || default_configs_from_reports).freeze
|
|
18
|
+
@fallback = fallback
|
|
18
19
|
freeze
|
|
19
20
|
end
|
|
20
21
|
|
|
22
|
+
def production_mode?
|
|
23
|
+
!@fallback.nil?
|
|
24
|
+
end
|
|
25
|
+
|
|
21
26
|
def models
|
|
22
27
|
@reports.keys
|
|
23
28
|
end
|
|
@@ -44,6 +49,8 @@ module RubyLLM
|
|
|
44
49
|
end
|
|
45
50
|
|
|
46
51
|
def table
|
|
52
|
+
return production_mode_table if production_mode?
|
|
53
|
+
|
|
47
54
|
max_label = [@reports.keys.map(&:length).max || 0, 25].max
|
|
48
55
|
lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
|
|
49
56
|
lines << " #{"-" * (max_label + 36)}"
|
|
@@ -57,6 +64,62 @@ module RubyLLM
|
|
|
57
64
|
lines.join("\n")
|
|
58
65
|
end
|
|
59
66
|
|
|
67
|
+
def production_mode_table
|
|
68
|
+
fallback_label = self.class.candidate_label(@fallback)
|
|
69
|
+
rows = @reports.map do |label, report|
|
|
70
|
+
chain = chain_label(label, fallback_label)
|
|
71
|
+
{ chain: chain, report: report, same: chain_same_as_fallback?(label, fallback_label) }
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
chain_width = [rows.map { |r| r[:chain].length }.max || 0, 20].max
|
|
75
|
+
lines = [format(" %-#{chain_width}s %-11s %-10s %-14s %-9s %s",
|
|
76
|
+
"Chain", "single-shot", "escalation", "effective cost", "latency", "score")]
|
|
77
|
+
lines << " #{"-" * (chain_width + 60)}"
|
|
78
|
+
|
|
79
|
+
rows.each do |row|
|
|
80
|
+
lines << format_production_row(row, chain_width)
|
|
81
|
+
end
|
|
82
|
+
|
|
83
|
+
lines.join("\n")
|
|
84
|
+
end
|
|
85
|
+
|
|
86
|
+
private
|
|
87
|
+
|
|
88
|
+
def chain_label(label, fallback_label)
|
|
89
|
+
label == fallback_label ? label : "#{label} → #{fallback_label}"
|
|
90
|
+
end
|
|
91
|
+
|
|
92
|
+
def chain_same_as_fallback?(label, fallback_label)
|
|
93
|
+
label == fallback_label
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
def format_production_row(row, chain_width)
|
|
97
|
+
report = row[:report]
|
|
98
|
+
format(" %-#{chain_width}s %-11s %-10s %-14s %-9s %6.2f",
|
|
99
|
+
row[:chain],
|
|
100
|
+
format_money(report.single_shot_cost || report.total_cost),
|
|
101
|
+
format_escalation(row, report),
|
|
102
|
+
format_money(report.effective_cost),
|
|
103
|
+
format_latency(report.effective_latency_ms),
|
|
104
|
+
report.score)
|
|
105
|
+
end
|
|
106
|
+
|
|
107
|
+
def format_money(value)
|
|
108
|
+
value&.positive? ? "$#{format("%.4f", value)}" : "n/a"
|
|
109
|
+
end
|
|
110
|
+
|
|
111
|
+
def format_latency(value)
|
|
112
|
+
value ? "#{value.round}ms" : "n/a"
|
|
113
|
+
end
|
|
114
|
+
|
|
115
|
+
def format_escalation(row, report)
|
|
116
|
+
return "—" if row[:same]
|
|
117
|
+
|
|
118
|
+
format("%d%%", ((report.escalation_rate || 0) * 100).round)
|
|
119
|
+
end
|
|
120
|
+
|
|
121
|
+
public
|
|
122
|
+
|
|
60
123
|
def print_summary(io = $stdout)
|
|
61
124
|
io.puts "#{@eval_name} — model comparison"
|
|
62
125
|
io.puts
|
|
@@ -72,7 +135,7 @@ module RubyLLM
|
|
|
72
135
|
|
|
73
136
|
def to_h
|
|
74
137
|
@reports.transform_values do |report|
|
|
75
|
-
{
|
|
138
|
+
base = {
|
|
76
139
|
score: report.score,
|
|
77
140
|
total_cost: report.total_cost,
|
|
78
141
|
avg_latency_ms: report.avg_latency_ms,
|
|
@@ -80,11 +143,25 @@ module RubyLLM
|
|
|
80
143
|
pass_rate_ratio: report.pass_rate_ratio,
|
|
81
144
|
passed: report.passed?
|
|
82
145
|
}
|
|
146
|
+
production_mode_metrics(report, base)
|
|
83
147
|
end
|
|
84
148
|
end
|
|
85
149
|
|
|
86
150
|
private
|
|
87
151
|
|
|
152
|
+
def production_mode_metrics(report, base)
|
|
153
|
+
return base unless report.respond_to?(:production_mode?) && report.production_mode?
|
|
154
|
+
|
|
155
|
+
base.merge(
|
|
156
|
+
escalation_rate: report.escalation_rate,
|
|
157
|
+
single_shot_cost: report.single_shot_cost,
|
|
158
|
+
effective_cost: report.effective_cost,
|
|
159
|
+
single_shot_latency_ms: report.single_shot_latency_ms,
|
|
160
|
+
effective_latency_ms: report.effective_latency_ms,
|
|
161
|
+
latency_percentiles: report.latency_percentiles
|
|
162
|
+
)
|
|
163
|
+
end
|
|
164
|
+
|
|
88
165
|
def resolve_key(candidate)
|
|
89
166
|
case candidate
|
|
90
167
|
when String then candidate
|
|
@@ -15,7 +15,9 @@ module RubyLLM
|
|
|
15
15
|
BASELINE_DIR = ".eval_baselines"
|
|
16
16
|
|
|
17
17
|
def_delegators :@stats, :score, :passed, :failed, :skipped, :failures, :pass_rate, :pass_rate_ratio,
|
|
18
|
-
:total_cost, :avg_latency_ms, :passed
|
|
18
|
+
:total_cost, :avg_latency_ms, :passed?,
|
|
19
|
+
:production_mode?, :escalation_rate, :single_shot_cost, :single_shot_latency_ms,
|
|
20
|
+
:effective_cost, :effective_latency_ms, :latency_percentiles
|
|
19
21
|
def_delegators :@presenter, :summary, :to_s, :print_summary
|
|
20
22
|
def_delegators :@storage, :save_history!, :eval_history, :save_baseline!, :compare_with_baseline,
|
|
21
23
|
:baseline_exists?
|
|
@@ -65,6 +65,69 @@ module RubyLLM
|
|
|
65
65
|
def evaluated_results_count
|
|
66
66
|
evaluated_results.length
|
|
67
67
|
end
|
|
68
|
+
|
|
69
|
+
def production_mode?
|
|
70
|
+
evaluated_results.any? { |r| r.respond_to?(:attempts) && r.attempts }
|
|
71
|
+
end
|
|
72
|
+
|
|
73
|
+
def escalation_rate
|
|
74
|
+
return nil unless production_mode?
|
|
75
|
+
return 0.0 if evaluated_results.empty?
|
|
76
|
+
|
|
77
|
+
escalated = evaluated_results.count { |r| (r.attempts || []).length > 1 }
|
|
78
|
+
escalated.to_f / evaluated_results.length
|
|
79
|
+
end
|
|
80
|
+
|
|
81
|
+
def single_shot_cost
|
|
82
|
+
return nil unless production_mode?
|
|
83
|
+
|
|
84
|
+
evaluated_results.sum { |r| first_attempt_cost(r) || r.cost || 0.0 }
|
|
85
|
+
end
|
|
86
|
+
|
|
87
|
+
def effective_cost
|
|
88
|
+
total_cost
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
def single_shot_latency_ms
|
|
92
|
+
return nil unless production_mode?
|
|
93
|
+
|
|
94
|
+
latencies = evaluated_results.filter_map { |r| first_attempt_latency(r) || r.duration_ms }
|
|
95
|
+
return nil if latencies.empty?
|
|
96
|
+
|
|
97
|
+
latencies.sum.to_f / latencies.length
|
|
98
|
+
end
|
|
99
|
+
|
|
100
|
+
def effective_latency_ms
|
|
101
|
+
avg_latency_ms
|
|
102
|
+
end
|
|
103
|
+
|
|
104
|
+
def latency_percentiles
|
|
105
|
+
return nil unless production_mode?
|
|
106
|
+
|
|
107
|
+
latencies = evaluated_results.filter_map(&:duration_ms).sort
|
|
108
|
+
return nil if latencies.empty?
|
|
109
|
+
|
|
110
|
+
{ p50: percentile(latencies, 0.50), p95: percentile(latencies, 0.95), max: latencies.last.to_f }
|
|
111
|
+
end
|
|
112
|
+
|
|
113
|
+
private
|
|
114
|
+
|
|
115
|
+
def first_attempt_cost(result)
|
|
116
|
+
first = (result.attempts || []).first
|
|
117
|
+
first && first[:cost]
|
|
118
|
+
end
|
|
119
|
+
|
|
120
|
+
def first_attempt_latency(result)
|
|
121
|
+
first = (result.attempts || []).first
|
|
122
|
+
first && first[:latency_ms]
|
|
123
|
+
end
|
|
124
|
+
|
|
125
|
+
def percentile(sorted, fraction)
|
|
126
|
+
return nil if sorted.empty?
|
|
127
|
+
|
|
128
|
+
idx = (fraction * (sorted.length - 1)).round
|
|
129
|
+
sorted[idx].to_f
|
|
130
|
+
end
|
|
68
131
|
end
|
|
69
132
|
end
|
|
70
133
|
end
|
|
@@ -94,12 +94,13 @@ module RubyLLM
|
|
|
94
94
|
end
|
|
95
95
|
end
|
|
96
96
|
|
|
97
|
-
def initialize(step:, candidates:, context: {}, min_score: 0.95, runs: 1)
|
|
97
|
+
def initialize(step:, candidates:, context: {}, min_score: 0.95, runs: 1, production_mode: nil)
|
|
98
98
|
@step = step
|
|
99
99
|
@candidates = candidates
|
|
100
100
|
@context = context
|
|
101
101
|
@min_score = min_score
|
|
102
102
|
@runs = runs
|
|
103
|
+
@production_mode = production_mode
|
|
103
104
|
end
|
|
104
105
|
|
|
105
106
|
def call
|
|
@@ -109,7 +110,8 @@ module RubyLLM
|
|
|
109
110
|
score_matrix = {}
|
|
110
111
|
evals.each do |eval_name|
|
|
111
112
|
comparison = with_retry_disabled do
|
|
112
|
-
@step.compare_models(eval_name, candidates: @candidates, context: @context,
|
|
113
|
+
@step.compare_models(eval_name, candidates: @candidates, context: @context,
|
|
114
|
+
runs: @runs, production_mode: @production_mode)
|
|
113
115
|
end
|
|
114
116
|
score_matrix[eval_name] = extract_scores(comparison)
|
|
115
117
|
end
|
|
@@ -59,17 +59,19 @@ module RubyLLM
|
|
|
59
59
|
).recommend
|
|
60
60
|
end
|
|
61
61
|
|
|
62
|
-
def optimize_retry_policy(candidates:, context: {}, min_score: 0.95, runs: 1)
|
|
62
|
+
def optimize_retry_policy(candidates:, context: {}, min_score: 0.95, runs: 1, production_mode: nil)
|
|
63
63
|
Eval::RetryOptimizer.new(
|
|
64
64
|
step: self,
|
|
65
65
|
candidates: candidates,
|
|
66
66
|
context: context,
|
|
67
67
|
min_score: min_score,
|
|
68
|
-
runs: runs
|
|
68
|
+
runs: runs,
|
|
69
|
+
production_mode: production_mode
|
|
69
70
|
).call
|
|
70
71
|
end
|
|
71
72
|
|
|
72
|
-
KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists
|
|
73
|
+
KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists
|
|
74
|
+
reasoning_effort retry_policy_override].freeze
|
|
73
75
|
|
|
74
76
|
include Concerns::ContextHelpers
|
|
75
77
|
|
|
@@ -156,11 +158,12 @@ module RubyLLM
|
|
|
156
158
|
end
|
|
157
159
|
|
|
158
160
|
def runtime_settings(context)
|
|
161
|
+
policy = context.key?(:retry_policy_override) ? context[:retry_policy_override] : retry_policy
|
|
159
162
|
{
|
|
160
163
|
model: context[:model] || model || RubyLLM::Contract.configuration.default_model,
|
|
161
164
|
temperature: context[:temperature],
|
|
162
165
|
extra_options: context.slice(:provider, :assume_model_exists, :max_tokens, :reasoning_effort),
|
|
163
|
-
policy:
|
|
166
|
+
policy: policy
|
|
164
167
|
}
|
|
165
168
|
end
|
|
166
169
|
|
data/lib/ruby_llm/contract.rb
CHANGED
|
@@ -154,6 +154,7 @@ end
|
|
|
154
154
|
require_relative "contract/concerns/context_helpers"
|
|
155
155
|
require_relative "contract/concerns/deep_freeze"
|
|
156
156
|
require_relative "contract/concerns/deep_symbolize"
|
|
157
|
+
require_relative "contract/concerns/production_mode_context"
|
|
157
158
|
require_relative "contract/concerns/eval_host"
|
|
158
159
|
require_relative "contract/concerns/trace_equality"
|
|
159
160
|
require_relative "contract/concerns/usage_aggregator"
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ruby_llm-contract
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.6.
|
|
4
|
+
version: 0.6.4
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Justyna
|
|
@@ -89,6 +89,7 @@ files:
|
|
|
89
89
|
- lib/ruby_llm/contract/concerns/deep_freeze.rb
|
|
90
90
|
- lib/ruby_llm/contract/concerns/deep_symbolize.rb
|
|
91
91
|
- lib/ruby_llm/contract/concerns/eval_host.rb
|
|
92
|
+
- lib/ruby_llm/contract/concerns/production_mode_context.rb
|
|
92
93
|
- lib/ruby_llm/contract/concerns/trace_equality.rb
|
|
93
94
|
- lib/ruby_llm/contract/concerns/usage_aggregator.rb
|
|
94
95
|
- lib/ruby_llm/contract/configuration.rb
|
|
@@ -181,7 +182,6 @@ files:
|
|
|
181
182
|
- lib/ruby_llm/contract/token_estimator.rb
|
|
182
183
|
- lib/ruby_llm/contract/types.rb
|
|
183
184
|
- lib/ruby_llm/contract/version.rb
|
|
184
|
-
- ruby_llm-contract.gemspec
|
|
185
185
|
homepage: https://github.com/justi/ruby_llm-contract
|
|
186
186
|
licenses:
|
|
187
187
|
- MIT
|
data/ruby_llm-contract.gemspec
DELETED
|
@@ -1,35 +0,0 @@
|
|
|
1
|
-
# frozen_string_literal: true
|
|
2
|
-
|
|
3
|
-
require_relative "lib/ruby_llm/contract/version"
|
|
4
|
-
|
|
5
|
-
Gem::Specification.new do |spec|
|
|
6
|
-
spec.name = "ruby_llm-contract"
|
|
7
|
-
spec.version = RubyLLM::Contract::VERSION
|
|
8
|
-
spec.authors = ["Justyna"]
|
|
9
|
-
|
|
10
|
-
spec.summary = "Know which LLM model to use, what it costs, and when accuracy drops"
|
|
11
|
-
spec.description = "Compare LLM models by accuracy and cost. Regression-test prompts in CI. " \
|
|
12
|
-
"Start on nano, auto-escalate to bigger models when quality drops. " \
|
|
13
|
-
"Companion gem for ruby_llm."
|
|
14
|
-
spec.homepage = "https://github.com/justi/ruby_llm-contract"
|
|
15
|
-
spec.license = "MIT"
|
|
16
|
-
spec.required_ruby_version = ">= 3.2.0"
|
|
17
|
-
|
|
18
|
-
spec.metadata["homepage_uri"] = spec.homepage
|
|
19
|
-
spec.metadata["source_code_uri"] = spec.homepage
|
|
20
|
-
spec.metadata["changelog_uri"] = "#{spec.homepage}/blob/main/CHANGELOG.md"
|
|
21
|
-
spec.metadata["documentation_uri"] = "#{spec.homepage}#readme"
|
|
22
|
-
spec.metadata["rubygems_mfa_required"] = "true"
|
|
23
|
-
|
|
24
|
-
spec.files = Dir.chdir(__dir__) do
|
|
25
|
-
`git ls-files -z`.split("\x0").reject do |f|
|
|
26
|
-
(File.expand_path(f) == __FILE__) ||
|
|
27
|
-
f.start_with?("spec/", "docs/", "doc/", ".ai/", ".claude/", ".git")
|
|
28
|
-
end
|
|
29
|
-
end
|
|
30
|
-
spec.require_paths = ["lib"]
|
|
31
|
-
|
|
32
|
-
spec.add_dependency "dry-types", "~> 1.7"
|
|
33
|
-
spec.add_dependency "ruby_llm", "~> 1.0"
|
|
34
|
-
spec.add_dependency "ruby_llm-schema", "~> 0.3"
|
|
35
|
-
end
|