ruby_llm-contract 0.5.2 → 0.6.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +49 -2
- data/Gemfile.lock +2 -2
- data/README.md +173 -134
- data/lib/ruby_llm/contract/concerns/eval_host.rb +25 -6
- data/lib/ruby_llm/contract/eval/model_comparison.rb +33 -11
- data/lib/ruby_llm/contract/eval/recommendation.rb +48 -0
- data/lib/ruby_llm/contract/eval/recommender.rb +132 -0
- data/lib/ruby_llm/contract/eval/report.rb +2 -2
- data/lib/ruby_llm/contract/eval/report_stats.rb +6 -0
- data/lib/ruby_llm/contract/eval/report_storage.rb +18 -12
- data/lib/ruby_llm/contract/eval/retry_optimizer.rb +221 -0
- data/lib/ruby_llm/contract/eval.rb +3 -0
- data/lib/ruby_llm/contract/rake_task.rb +83 -0
- data/lib/ruby_llm/contract/step/base.rb +30 -0
- data/lib/ruby_llm/contract/step/retry_executor.rb +9 -3
- data/lib/ruby_llm/contract/step/retry_policy.rb +27 -14
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/lib/ruby_llm/contract.rb +21 -0
- metadata +4 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 2636c2e59f5fef27f929a94ac9e3194793ce6b51f86cce0c18ca6a5b0caa61ab
|
|
4
|
+
data.tar.gz: c2c81c7cc8fd281bf6c88738f0fb5bc4bdbb86b71d2fb59248e2b0ebb8d648fe
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: d9f2fca592fd3a183d987239dea0cdc2456eed639d6cccaea71e9b1ef3a3ff6e32f3e346f640c905168674e7401ccb85e5f342815a1b290cdebe05a9f7b5374f
|
|
7
|
+
data.tar.gz: 00b8d113564871db19f88d9061276a0aee0295da7638974804782fda7929cf893a0004aadfeffc2f25f13d421935e0e9d710e553d8e56b7b5d0229f62edd129e
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,52 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.6.2 (2026-04-18)
|
|
4
|
+
|
|
5
|
+
### Features
|
|
6
|
+
|
|
7
|
+
- **`Step.optimize_retry_policy`** — runs `compare_models` on ALL evals for the step, builds a score matrix, identifies the constraining eval, and suggests a retry chain. Chain's last model always passes all evals (safe fallback).
|
|
8
|
+
- **`rake ruby_llm_contract:optimize`** — one-command retry chain optimization. Prints score table, constraining eval, suggested chain, and copy-paste DSL.
|
|
9
|
+
- **Offline by default** — `optimize` uses `sample_response` (zero API calls) unless `LIVE=1` or `PROVIDER=` is set.
|
|
10
|
+
- **`EVAL_DIRS=` support** — non-Rails setups can specify eval file directories.
|
|
11
|
+
- **Guide: [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)** — full procedure with prerequisites, troubleshooting, and real-world example.
|
|
12
|
+
|
|
13
|
+
### Fixes
|
|
14
|
+
|
|
15
|
+
- Chain semantics aligned with `retry_executor` — retry fires on `validation_failed`/`parse_error`, not on low eval score. Disjoint eval coverage (A passes e1, B passes e2, neither passes both) correctly returns empty chain.
|
|
16
|
+
- Removed ActiveSupport dependency from rake task (`.presence` → `.empty?`).
|
|
17
|
+
- Added `require "set"` for non-Rails environments.
|
|
18
|
+
|
|
19
|
+
## 0.6.1 (2026-04-17)
|
|
20
|
+
|
|
21
|
+
### Features
|
|
22
|
+
|
|
23
|
+
- **Multi-provider operator tooling** — rake tasks support `PROVIDER=openai|anthropic|ollama`, `CANDIDATES=model@effort,...`, and `REASONING_EFFORT=low|medium|high`.
|
|
24
|
+
- **`rake ruby_llm_contract:recommend`** — wraps `Step.recommend` with CLI interface, prints best config, retry chain, DSL, rationale, and savings.
|
|
25
|
+
- **Ollama support** — `PROVIDER=ollama` with configurable `OLLAMA_API_BASE`.
|
|
26
|
+
|
|
27
|
+
## 0.6.0 (2026-04-12)
|
|
28
|
+
|
|
29
|
+
"What should I do?" — model + configuration recommendation.
|
|
30
|
+
|
|
31
|
+
### Features
|
|
32
|
+
|
|
33
|
+
- **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
|
|
34
|
+
- **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
|
|
35
|
+
- **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
|
|
36
|
+
- **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
|
|
37
|
+
- **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
|
|
38
|
+
- **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
|
|
39
|
+
|
|
40
|
+
### Game changer continuity
|
|
41
|
+
|
|
42
|
+
```
|
|
43
|
+
v0.2: "Which model?" → compare_models (snapshot)
|
|
44
|
+
v0.3: "Did it change?" → baseline regression (binary)
|
|
45
|
+
v0.4: "Show me the trend" → eval history (time series)
|
|
46
|
+
v0.5: "Which prompt is better?" → compare_with (A/B testing)
|
|
47
|
+
v0.6: "What should I do?" → recommend (actionable advice)
|
|
48
|
+
```
|
|
49
|
+
|
|
3
50
|
## 0.5.2 (2026-04-06)
|
|
4
51
|
|
|
5
52
|
### Features
|
|
@@ -8,7 +55,7 @@
|
|
|
8
55
|
|
|
9
56
|
## 0.5.0 (2026-03-25)
|
|
10
57
|
|
|
11
|
-
Data-Driven Prompt Engineering
|
|
58
|
+
Data-Driven Prompt Engineering.
|
|
12
59
|
|
|
13
60
|
### Features
|
|
14
61
|
|
|
@@ -55,7 +102,7 @@ Audit hardening — 18 bugs fixed across 4 audit rounds.
|
|
|
55
102
|
|
|
56
103
|
## 0.4.3 (2026-03-24)
|
|
57
104
|
|
|
58
|
-
Production feedback release
|
|
105
|
+
Production feedback release.
|
|
59
106
|
|
|
60
107
|
### Features
|
|
61
108
|
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
ruby_llm-contract (0.
|
|
4
|
+
ruby_llm-contract (0.6.1)
|
|
5
5
|
dry-types (~> 1.7)
|
|
6
6
|
ruby_llm (~> 1.0)
|
|
7
7
|
ruby_llm-schema (~> 0.3)
|
|
@@ -258,7 +258,7 @@ CHECKSUMS
|
|
|
258
258
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
259
259
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
|
260
260
|
ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
|
|
261
|
-
ruby_llm-contract (0.
|
|
261
|
+
ruby_llm-contract (0.6.1)
|
|
262
262
|
ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
|
|
263
263
|
ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
|
|
264
264
|
rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
|
data/README.md
CHANGED
|
@@ -1,24 +1,89 @@
|
|
|
1
1
|
# ruby_llm-contract
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" — get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
```
|
|
6
|
+
YOU WRITE THE GEM HANDLES YOU GET
|
|
7
|
+
───────── ─────────────── ───────
|
|
8
|
+
|
|
9
|
+
validate { |o| ... } catch bad answers — combined Zero garbage
|
|
10
|
+
with retry_policy, auto-retry in production
|
|
6
11
|
|
|
7
|
-
|
|
12
|
+
retry_policy start cheap, escalate only Pay for the cheapest
|
|
13
|
+
models: %w[nano mini full] when validation fails model that works
|
|
8
14
|
|
|
9
|
-
|
|
15
|
+
max_cost 0.01 estimate tokens, check price, No surprise bills
|
|
16
|
+
refuse before calling LLM
|
|
10
17
|
|
|
11
|
-
|
|
18
|
+
output_schema { ... } send JSON schema to provider, Zero parsing code
|
|
19
|
+
validate response client-side
|
|
20
|
+
|
|
21
|
+
define_eval { ... } test cases + baselines, Regressions caught
|
|
22
|
+
run in CI with real LLM before deploy
|
|
23
|
+
|
|
24
|
+
recommend(candidates: [...]) evaluate all configs, pick Optimal model +
|
|
25
|
+
cheapest that passes retry chain
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Before and after
|
|
29
|
+
|
|
30
|
+
```
|
|
31
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
32
|
+
│ BEFORE: pick one model, hope for the best │
|
|
33
|
+
│ │
|
|
34
|
+
│ expensive model → accurate, but you overpay on every call │
|
|
35
|
+
│ cheap model → fast, but wrong answers slip to production │
|
|
36
|
+
│ prompt change → "looks good to me" → deploy → users suffer │
|
|
37
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
38
|
+
|
|
39
|
+
⬇ add ruby_llm-contract
|
|
40
|
+
|
|
41
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
42
|
+
│ YOU DEFINE A CONTRACT │
|
|
43
|
+
│ │
|
|
44
|
+
│ output_schema { string :priority } ← valid structure │
|
|
45
|
+
│ validate("valid priority") { |o| ... } ← business rules │
|
|
46
|
+
│ retry_policy models: %w[nano mini full] ← escalation chain │
|
|
47
|
+
│ max_cost 0.01 ← budget cap │
|
|
48
|
+
└───────────────────────────┬─────────────────────────────────────┘
|
|
49
|
+
│
|
|
50
|
+
▼
|
|
51
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
52
|
+
│ THE GEM HANDLES THE REST │
|
|
53
|
+
│ │
|
|
54
|
+
│ request ──→ ┌──────┐ ┌──────────┐ │
|
|
55
|
+
│ │ nano │─→ │ contract │──→ ✓ pass → done │
|
|
56
|
+
│ └──────┘ └────┬─────┘ │
|
|
57
|
+
│ │ ✗ fail │
|
|
58
|
+
│ ▼ │
|
|
59
|
+
│ ┌──────┐ ┌──────────┐ │
|
|
60
|
+
│ │ mini │─→ │ contract │──→ ✓ pass → done │
|
|
61
|
+
│ └──────┘ └────┬─────┘ │
|
|
62
|
+
│ │ ✗ fail │
|
|
63
|
+
│ ▼ │
|
|
64
|
+
│ ┌──────┐ ┌──────────┐ │
|
|
65
|
+
│ │ full │─→ │ contract │──→ ✓ pass → done │
|
|
66
|
+
│ └──────┘ └──────────┘ │
|
|
67
|
+
└───────────────────────────┬─────────────────────────────────────┘
|
|
68
|
+
│
|
|
69
|
+
▼
|
|
70
|
+
┌─────────────────────────────────────────────────────────────────┐
|
|
71
|
+
│ YOU GET │
|
|
72
|
+
│ │
|
|
73
|
+
│ ✓ Valid output guaranteed — schema + business rules enforced │
|
|
74
|
+
│ ✓ Cheapest model that works — most requests stay on nano │
|
|
75
|
+
│ ✓ Cost, latency, tokens — tracked on every call │
|
|
76
|
+
│ ✓ Eval scores per model — data instead of gut feeling │
|
|
77
|
+
│ ✓ Regressions caught — before deploy, not after │
|
|
78
|
+
│ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
|
|
79
|
+
└─────────────────────────────────────────────────────────────────┘
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## 30-second version
|
|
12
83
|
|
|
13
84
|
```ruby
|
|
14
85
|
class ClassifyTicket < RubyLLM::Contract::Step::Base
|
|
15
|
-
prompt
|
|
16
|
-
system "You are a support ticket classifier."
|
|
17
|
-
rule "Return valid JSON only, no markdown."
|
|
18
|
-
rule "Use exactly one priority: low, medium, high, urgent."
|
|
19
|
-
example input: "My invoice is wrong", output: '{"priority": "high"}'
|
|
20
|
-
user "{input}"
|
|
21
|
-
end
|
|
86
|
+
prompt "Classify this support ticket by priority and category.\n\n{input}"
|
|
22
87
|
|
|
23
88
|
output_schema do
|
|
24
89
|
string :priority, enum: %w[low medium high urgent]
|
|
@@ -30,29 +95,45 @@ class ClassifyTicket < RubyLLM::Contract::Step::Base
|
|
|
30
95
|
end
|
|
31
96
|
|
|
32
97
|
result = ClassifyTicket.run("I was charged twice")
|
|
33
|
-
result.
|
|
34
|
-
result.
|
|
35
|
-
result.trace[:cost]
|
|
36
|
-
result.trace[:model] # => "gpt-4.1-nano"
|
|
98
|
+
result.parsed_output # => {priority: "high", category: "billing"}
|
|
99
|
+
result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
|
|
100
|
+
result.trace[:cost] # => 0.000032
|
|
37
101
|
```
|
|
38
102
|
|
|
39
|
-
Bad JSON?
|
|
103
|
+
Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
|
|
40
104
|
|
|
41
|
-
##
|
|
105
|
+
## Install
|
|
42
106
|
|
|
43
|
-
|
|
107
|
+
```ruby
|
|
108
|
+
gem "ruby_llm-contract"
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
```ruby
|
|
112
|
+
RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
|
|
113
|
+
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
44
117
|
|
|
45
|
-
|
|
46
|
-
|
|
47
|
-
-
|
|
118
|
+
## Save money with model escalation
|
|
119
|
+
|
|
120
|
+
Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
|
|
121
|
+
|
|
122
|
+
```ruby
|
|
123
|
+
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
124
|
+
```
|
|
48
125
|
|
|
49
|
-
|
|
126
|
+
```
|
|
127
|
+
Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
|
|
128
|
+
Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
|
|
129
|
+
gpt-4.1 → never called ($0.00)
|
|
130
|
+
```
|
|
50
131
|
|
|
51
|
-
|
|
132
|
+
Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
|
|
52
133
|
|
|
53
|
-
##
|
|
134
|
+
## Know which model to use — with data
|
|
54
135
|
|
|
55
|
-
Define test cases
|
|
136
|
+
Don't guess. Define test cases, compare models, get numbers:
|
|
56
137
|
|
|
57
138
|
```ruby
|
|
58
139
|
ClassifyTicket.define_eval("regression") do
|
|
@@ -62,170 +143,127 @@ ClassifyTicket.define_eval("regression") do
|
|
|
62
143
|
end
|
|
63
144
|
|
|
64
145
|
comparison = ClassifyTicket.compare_models("regression",
|
|
65
|
-
models: %w[gpt-4.1-nano gpt-4.1-mini])
|
|
146
|
+
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
|
|
66
147
|
```
|
|
67
148
|
|
|
68
|
-
Real output from real API calls:
|
|
69
|
-
|
|
70
149
|
```
|
|
71
|
-
|
|
150
|
+
Candidate Score Cost Avg Latency
|
|
72
151
|
---------------------------------------------------------
|
|
73
|
-
gpt-4.1-nano 0.67 $0.
|
|
74
|
-
gpt-4.1-mini 1.00 $0.
|
|
152
|
+
gpt-4.1-nano 0.67 $0.0001 48ms
|
|
153
|
+
gpt-4.1-mini 1.00 $0.0004 92ms
|
|
154
|
+
gpt-4.1 1.00 $0.0021 210ms
|
|
75
155
|
|
|
76
156
|
Cheapest at 100%: gpt-4.1-mini
|
|
77
157
|
```
|
|
78
158
|
|
|
79
|
-
|
|
80
|
-
comparison.best_for(min_score: 0.95) # => "gpt-4.1-mini"
|
|
81
|
-
|
|
82
|
-
# Inspect failures
|
|
83
|
-
comparison.reports["gpt-4.1-nano"].failures.each do |f|
|
|
84
|
-
puts "#{f.name}: expected #{f.expected}, got #{f.output}"
|
|
85
|
-
puts " mismatches: #{f.mismatches}"
|
|
86
|
-
# => outage: expected {priority: "urgent"}, got {priority: "high"}
|
|
87
|
-
# mismatches: {priority: {expected: "urgent", got: "high"}}
|
|
88
|
-
end
|
|
89
|
-
```
|
|
159
|
+
Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
|
|
90
160
|
|
|
91
|
-
##
|
|
161
|
+
## Let the gem tell you what to do
|
|
92
162
|
|
|
93
|
-
|
|
163
|
+
Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
|
|
94
164
|
|
|
95
165
|
```ruby
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
|
|
105
|
-
|
|
166
|
+
rec = ClassifyTicket.recommend("regression",
|
|
167
|
+
candidates: [
|
|
168
|
+
{ model: "gpt-4.1-nano" },
|
|
169
|
+
{ model: "gpt-4.1-mini" },
|
|
170
|
+
{ model: "gpt-5-mini", reasoning_effort: "low" },
|
|
171
|
+
{ model: "gpt-5-mini", reasoning_effort: "high" },
|
|
172
|
+
],
|
|
173
|
+
min_score: 0.95
|
|
174
|
+
)
|
|
175
|
+
|
|
176
|
+
rec.best # => { model: "gpt-4.1-mini" }
|
|
177
|
+
rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
|
|
178
|
+
rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
|
|
179
|
+
rec.savings # => savings vs your current model (if configured)
|
|
106
180
|
```
|
|
107
181
|
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
```ruby
|
|
111
|
-
# RSpec — block merge if accuracy drops or cost spikes
|
|
112
|
-
expect(ClassifyTicket).to pass_eval("regression")
|
|
113
|
-
.with_context(model: "gpt-4.1-mini")
|
|
114
|
-
.with_minimum_score(0.8)
|
|
115
|
-
.with_maximum_cost(0.01)
|
|
116
|
-
|
|
117
|
-
# Rake — run all evals across all steps
|
|
118
|
-
require "ruby_llm/contract/rake_task"
|
|
119
|
-
RubyLLM::Contract::RakeTask.new do |t|
|
|
120
|
-
t.minimum_score = 0.8
|
|
121
|
-
t.maximum_cost = 0.05
|
|
122
|
-
end
|
|
123
|
-
# bundle exec rake ruby_llm_contract:eval
|
|
124
|
-
```
|
|
182
|
+
Copy `rec.to_dsl` into your step. Done.
|
|
125
183
|
|
|
126
|
-
##
|
|
184
|
+
## Catch regressions before users do
|
|
127
185
|
|
|
128
|
-
|
|
186
|
+
A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
|
|
129
187
|
|
|
130
188
|
```ruby
|
|
189
|
+
# Save a baseline once:
|
|
131
190
|
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
|
|
132
191
|
report.save_baseline!(model: "gpt-4.1-nano")
|
|
133
192
|
|
|
134
|
-
#
|
|
135
|
-
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
|
|
136
|
-
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
|
|
137
|
-
|
|
138
|
-
diff.regressed? # => true
|
|
139
|
-
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
|
|
140
|
-
diff.score_delta # => -0.33
|
|
141
|
-
```
|
|
142
|
-
|
|
143
|
-
```ruby
|
|
144
|
-
# CI: block merge if any previously-passing case now fails
|
|
193
|
+
# In CI — block merge if anything regressed:
|
|
145
194
|
expect(ClassifyTicket).to pass_eval("regression")
|
|
146
195
|
.with_context(model: "gpt-4.1-nano")
|
|
147
196
|
.without_regressions
|
|
148
197
|
```
|
|
149
198
|
|
|
150
|
-
## Track quality over time
|
|
151
|
-
|
|
152
199
|
```ruby
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
# View trend
|
|
158
|
-
history = report.eval_history(model: "gpt-4.1-nano")
|
|
159
|
-
history.score_trend # => :stable_or_improving | :declining
|
|
160
|
-
history.drift? # => true (score dropped > 10%)
|
|
200
|
+
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
|
|
201
|
+
diff.regressed? # => true
|
|
202
|
+
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
|
|
203
|
+
diff.score_delta # => -0.33
|
|
161
204
|
```
|
|
162
205
|
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
```ruby
|
|
166
|
-
# 4x faster with parallel execution
|
|
167
|
-
report = ClassifyTicket.run_eval("regression",
|
|
168
|
-
context: { model: "gpt-4.1-nano" },
|
|
169
|
-
concurrency: 4)
|
|
170
|
-
```
|
|
206
|
+
No more "it worked in the playground". Regressions are caught in CI, not production.
|
|
171
207
|
|
|
172
|
-
##
|
|
208
|
+
## A/B test your prompts
|
|
173
209
|
|
|
174
|
-
Changed a prompt? Compare old vs new with regression safety:
|
|
210
|
+
Changed a prompt? Compare old vs new on the same dataset with regression safety:
|
|
175
211
|
|
|
176
212
|
```ruby
|
|
177
213
|
diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
|
|
178
214
|
eval: "regression", model: "gpt-4.1-mini")
|
|
179
215
|
|
|
180
|
-
diff.safe_to_switch? # => true (no regressions
|
|
216
|
+
diff.safe_to_switch? # => true (no regressions)
|
|
181
217
|
diff.improvements # => [{case: "outage", ...}]
|
|
182
218
|
diff.score_delta # => +0.33
|
|
183
219
|
```
|
|
184
220
|
|
|
185
|
-
Requires `model:` or `context: { adapter: ... }`.
|
|
186
|
-
`compare_with` ignores `sample_response`; without a real model/adapter both sides are skipped and the A/B result is not meaningful.
|
|
187
|
-
|
|
188
|
-
CI gate:
|
|
189
221
|
```ruby
|
|
222
|
+
# CI gate:
|
|
190
223
|
expect(ClassifyTicketV2).to pass_eval("regression")
|
|
191
224
|
.compared_with(ClassifyTicketV1)
|
|
192
225
|
.with_minimum_score(0.8)
|
|
193
226
|
```
|
|
194
227
|
|
|
195
|
-
##
|
|
228
|
+
## Chain steps with fail-fast
|
|
196
229
|
|
|
197
|
-
|
|
230
|
+
Pipeline stops at the first contract failure — no wasted tokens on downstream steps:
|
|
198
231
|
|
|
199
232
|
```ruby
|
|
200
|
-
class
|
|
201
|
-
|
|
202
|
-
|
|
233
|
+
class TicketPipeline < RubyLLM::Contract::Pipeline::Base
|
|
234
|
+
step ClassifyTicket, as: :classify
|
|
235
|
+
step RouteToTeam, as: :route
|
|
236
|
+
step DraftResponse, as: :draft
|
|
203
237
|
end
|
|
204
238
|
|
|
205
|
-
result =
|
|
206
|
-
result.
|
|
207
|
-
result.
|
|
239
|
+
result = TicketPipeline.run("I was charged twice")
|
|
240
|
+
result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
|
|
241
|
+
result.trace.total_cost # => $0.000128
|
|
208
242
|
```
|
|
209
243
|
|
|
210
|
-
##
|
|
244
|
+
## Gate merges on quality and cost
|
|
211
245
|
|
|
212
246
|
```ruby
|
|
213
|
-
|
|
214
|
-
|
|
247
|
+
# RSpec — block merge if accuracy drops or cost spikes
|
|
248
|
+
expect(ClassifyTicket).to pass_eval("regression")
|
|
249
|
+
.with_minimum_score(0.8)
|
|
250
|
+
.with_maximum_cost(0.01)
|
|
251
|
+
|
|
252
|
+
# Rake — run all evals across all steps
|
|
253
|
+
RubyLLM::Contract::RakeTask.new do |t|
|
|
254
|
+
t.minimum_score = 0.8
|
|
255
|
+
t.maximum_cost = 0.05
|
|
256
|
+
end
|
|
257
|
+
# bundle exec rake ruby_llm_contract:eval
|
|
215
258
|
```
|
|
216
259
|
|
|
217
|
-
##
|
|
260
|
+
## Full power: data-driven retry chains
|
|
218
261
|
|
|
219
|
-
|
|
220
|
-
gem "ruby_llm-contract"
|
|
221
|
-
```
|
|
262
|
+
The pieces above — evals, compare_models, recommend — combine into a workflow that replaces guesswork with measured optimization. You define evals for your step, run `recommend` against all of them, find the eval that actually needs the strongest model, and build a retry chain where each attempt is as cheap as the data allows.
|
|
222
263
|
|
|
223
|
-
|
|
224
|
-
RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
|
|
225
|
-
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
|
|
226
|
-
```
|
|
264
|
+
The difference: instead of "gpt-5-mini seems to work, let's use it everywhere", you get "nano handles 4/6 scenarios, mini@low catches the 5th, full mini only fires on the hardest edge case — first attempt is 4× cheaper."
|
|
227
265
|
|
|
228
|
-
|
|
266
|
+
Full procedure with examples: **[Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)**
|
|
229
267
|
|
|
230
268
|
## Docs
|
|
231
269
|
|
|
@@ -233,6 +271,7 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
|
233
271
|
|-------|-|
|
|
234
272
|
| [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
|
|
235
273
|
| [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
|
|
274
|
+
| [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md) | Find the cheapest retry chain that passes all your evals |
|
|
236
275
|
| [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
|
|
237
276
|
| [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
|
|
238
277
|
| [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
|
|
@@ -241,13 +280,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
|
241
280
|
|
|
242
281
|
## Roadmap
|
|
243
282
|
|
|
244
|
-
**v0.
|
|
283
|
+
**v0.6 (current):** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
|
|
245
284
|
|
|
246
|
-
**v0.
|
|
285
|
+
**v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
|
|
247
286
|
|
|
248
|
-
**v0.
|
|
287
|
+
**v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
|
|
249
288
|
|
|
250
|
-
**v0.
|
|
289
|
+
**v0.3:** Baseline regression detection, migration guide.
|
|
251
290
|
|
|
252
291
|
## License
|
|
253
292
|
|
|
@@ -70,18 +70,37 @@ module RubyLLM
|
|
|
70
70
|
Eval::PromptDiff.new(candidate: my_report, baseline: other_report)
|
|
71
71
|
end
|
|
72
72
|
|
|
73
|
-
def compare_models(eval_name, models
|
|
73
|
+
def compare_models(eval_name, models: [], candidates: [], context: {})
|
|
74
|
+
raise ArgumentError, "Pass either models: or candidates:, not both" if models.any? && candidates.any?
|
|
75
|
+
|
|
74
76
|
context = safe_context(context)
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
77
|
+
candidate_configs = normalize_candidates(models, candidates)
|
|
78
|
+
|
|
79
|
+
reports = {}
|
|
80
|
+
configs = {}
|
|
81
|
+
candidate_configs.each do |config|
|
|
82
|
+
label = Eval::ModelComparison.candidate_label(config)
|
|
83
|
+
model_context = isolate_context(context).merge(model: config[:model])
|
|
84
|
+
model_context[:reasoning_effort] = config[:reasoning_effort] if config[:reasoning_effort]
|
|
85
|
+
reports[label] = run_single_eval(eval_name, model_context)
|
|
86
|
+
configs[label] = config
|
|
79
87
|
end
|
|
80
|
-
|
|
88
|
+
|
|
89
|
+
Eval::ModelComparison.new(eval_name: eval_name, reports: reports, configs: configs)
|
|
81
90
|
end
|
|
82
91
|
|
|
83
92
|
private
|
|
84
93
|
|
|
94
|
+
def normalize_candidates(models, candidates)
|
|
95
|
+
if candidates.any?
|
|
96
|
+
candidates.map { |c| RubyLLM::Contract.normalize_candidate_config(c) }.uniq
|
|
97
|
+
elsif models.any?
|
|
98
|
+
models.uniq.map { |m| { model: m } }
|
|
99
|
+
else
|
|
100
|
+
raise ArgumentError, "Pass models: or candidates: with at least one entry"
|
|
101
|
+
end
|
|
102
|
+
end
|
|
103
|
+
|
|
85
104
|
def comparison_context(context, model)
|
|
86
105
|
base_context = safe_context(context)
|
|
87
106
|
model ? base_context.merge(model: model) : base_context
|
|
@@ -4,11 +4,17 @@ module RubyLLM
|
|
|
4
4
|
module Contract
|
|
5
5
|
module Eval
|
|
6
6
|
class ModelComparison
|
|
7
|
-
attr_reader :eval_name, :reports
|
|
7
|
+
attr_reader :eval_name, :reports, :configs
|
|
8
8
|
|
|
9
|
-
def
|
|
9
|
+
def self.candidate_label(config)
|
|
10
|
+
effort = config[:reasoning_effort]
|
|
11
|
+
effort ? "#{config[:model]} (effort: #{effort})" : config[:model]
|
|
12
|
+
end
|
|
13
|
+
|
|
14
|
+
def initialize(eval_name:, reports:, configs: nil)
|
|
10
15
|
@eval_name = eval_name
|
|
11
|
-
@reports = reports.dup.freeze
|
|
16
|
+
@reports = reports.dup.freeze
|
|
17
|
+
@configs = (configs || default_configs_from_reports).freeze
|
|
12
18
|
freeze
|
|
13
19
|
end
|
|
14
20
|
|
|
@@ -16,12 +22,12 @@ module RubyLLM
|
|
|
16
22
|
@reports.keys
|
|
17
23
|
end
|
|
18
24
|
|
|
19
|
-
def score_for(
|
|
20
|
-
@reports[
|
|
25
|
+
def score_for(candidate)
|
|
26
|
+
@reports[resolve_key(candidate)]&.score
|
|
21
27
|
end
|
|
22
28
|
|
|
23
|
-
def cost_for(
|
|
24
|
-
@reports[
|
|
29
|
+
def cost_for(candidate)
|
|
30
|
+
@reports[resolve_key(candidate)]&.total_cost
|
|
25
31
|
end
|
|
26
32
|
|
|
27
33
|
def best_for(min_score: 0.0)
|
|
@@ -38,13 +44,14 @@ module RubyLLM
|
|
|
38
44
|
end
|
|
39
45
|
|
|
40
46
|
def table
|
|
41
|
-
|
|
42
|
-
lines
|
|
47
|
+
max_label = [@reports.keys.map(&:length).max || 0, 25].max
|
|
48
|
+
lines = [format(" %-#{max_label}s Score Cost Avg Latency", "Candidate")]
|
|
49
|
+
lines << " #{"-" * (max_label + 36)}"
|
|
43
50
|
|
|
44
|
-
@reports.each do |
|
|
51
|
+
@reports.each do |label, report|
|
|
45
52
|
latency = report.avg_latency_ms ? "#{report.avg_latency_ms.round}ms" : "n/a"
|
|
46
53
|
cost = report.total_cost.positive? ? "$#{format("%.4f", report.total_cost)}" : "n/a"
|
|
47
|
-
lines << format("
|
|
54
|
+
lines << format(" %-#{max_label}s %6.2f %10s %12s", label, report.score, cost, latency)
|
|
48
55
|
end
|
|
49
56
|
|
|
50
57
|
lines.join("\n")
|
|
@@ -70,10 +77,25 @@ module RubyLLM
|
|
|
70
77
|
total_cost: report.total_cost,
|
|
71
78
|
avg_latency_ms: report.avg_latency_ms,
|
|
72
79
|
pass_rate: report.pass_rate,
|
|
80
|
+
pass_rate_ratio: report.pass_rate_ratio,
|
|
73
81
|
passed: report.passed?
|
|
74
82
|
}
|
|
75
83
|
end
|
|
76
84
|
end
|
|
85
|
+
|
|
86
|
+
private
|
|
87
|
+
|
|
88
|
+
def resolve_key(candidate)
|
|
89
|
+
case candidate
|
|
90
|
+
when String then candidate
|
|
91
|
+
when Hash then self.class.candidate_label(candidate)
|
|
92
|
+
else candidate.to_s
|
|
93
|
+
end
|
|
94
|
+
end
|
|
95
|
+
|
|
96
|
+
def default_configs_from_reports
|
|
97
|
+
@reports.each_with_object({}) { |(key, _), h| h[key] = { model: key } }
|
|
98
|
+
end
|
|
77
99
|
end
|
|
78
100
|
end
|
|
79
101
|
end
|