ruby_llm-contract 0.10.1 → 0.10.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -1
- data/Gemfile.lock +2 -2
- data/README.md +2 -2
- data/docs/architecture.md +50 -0
- data/docs/guide/best_practices.md +136 -0
- data/docs/guide/eval_first.md +192 -0
- data/docs/guide/getting_started.md +199 -0
- data/docs/guide/migration.md +185 -0
- data/docs/guide/multimodal_input.md +160 -0
- data/docs/guide/optimizing_retry_policy.md +131 -0
- data/docs/guide/output_schema.md +93 -0
- data/docs/guide/pipeline.md +154 -0
- data/docs/guide/prompt_ast.md +76 -0
- data/docs/guide/rails_integration.md +218 -0
- data/docs/guide/relation_to_agent.md +52 -0
- data/docs/guide/relation_to_tribunal.md +135 -0
- data/docs/guide/testing.md +282 -0
- data/docs/guide/why.md +103 -0
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/ruby_llm-contract.gemspec +1 -1
- metadata +16 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: ac8c19463285a3e2c0f9050e835125454b21d2b1500dcb5bad2dd4a114666bd0
|
|
4
|
+
data.tar.gz: 339c0609e3dccbf55da649c3470ed234c5abafa55c8649aee5d8dd41f9bd82ec
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: c04fea66393868fbdb369f5a75eaeb66a28172c8f9ddabd0a8c053fd55e5daa06174aeeb00c4bb7e9aabaf613bd1ef4ea354ff7d0f4b817f2ec41af1a4c61667
|
|
7
|
+
data.tar.gz: 71f502ebe497ede22df508d3b56396660fbd1eba109066885d5e905724bd82d30a0d46fddfe4f1c1e7efe47f95ea41b4f61341a4698c12d3f3a397cab2510068
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,19 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.10.2 (2026-06-10)
|
|
4
|
+
|
|
5
|
+
Patch release: ship the `docs/guide/` directory inside the gem so adopters and LLM integration agents can read the manuals locally (via `bundle show ruby_llm-contract` or `gem unpack`) without an internet round-trip to GitHub. No code behavior change.
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
|
|
9
|
+
- **`docs/guide/*` is now packaged with the gem** (14 files, ~120 KB). Previously the README's "See also" links pointed at `docs/guide/getting_started.md`, `docs/guide/optimizing_retry_policy.md`, etc., but those files were stripped from the published gem - LLM integration agents (Cursor, Claude Code, Copilot) reported "no documentation in the gem" because the links 404'd locally. `docs/ideas/` and `doc/decisions/` remain excluded.
|
|
10
|
+
|
|
11
|
+
### Fixed
|
|
12
|
+
|
|
13
|
+
- **`models:` keyword form documented + pinned for hash configs.** `retry_policy models: ["gpt-5-nano", { model: "gpt-5-mini", reasoning_effort: "high" }]` is now covered by a spec and shown in [getting_started.md](docs/guide/getting_started.md). The block form (`escalate(...)`) and the keyword form share the same `@configs` storage; both forms accept config hashes.
|
|
14
|
+
- **`reasoning_effort` examples corrected to gpt-5 family.** Pre-0.10.2 docs and specs paired `reasoning_effort` with `gpt-4.1-*` model names, which is incorrect: `gpt-4.1` is not a reasoning model. Updated across `docs/guide/getting_started.md`, `docs/guide/optimizing_retry_policy.md`, `CHANGELOG.md`, and 7 spec files. Non-reasoning `gpt-4.1` examples (model fallback chains without `reasoning_effort`) are unchanged.
|
|
15
|
+
- **Version mentions corrected in README and multimodal guide.** README FAQ ("Upgraded to 0.9.0 - why?") now reads "Upgraded to 0.10.0 from 0.8.x"; `docs/guide/multimodal_input.md` references `0.10.0+` and `0.10.x` instead of `0.9.0`. 0.9.0 and 0.9.1 were tagged but never published to rubygems; adopters jump from 0.8.0 directly to 0.10.x.
|
|
16
|
+
|
|
3
17
|
## 0.10.1 (2026-06-01)
|
|
4
18
|
|
|
5
19
|
Patch release fixing gem packaging. 0.10.0 was yanked from rubygems.org due to the issue documented below; 0.10.1 is the recommended upgrade target. No code behavior change vs 0.10.0.
|
|
@@ -254,7 +268,7 @@ end
|
|
|
254
268
|
- **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
|
|
255
269
|
- **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
|
|
256
270
|
- **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
|
|
257
|
-
- **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-
|
|
271
|
+
- **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-5-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
|
|
258
272
|
- **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
|
|
259
273
|
- **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
|
|
260
274
|
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
ruby_llm-contract (0.10.
|
|
4
|
+
ruby_llm-contract (0.10.2)
|
|
5
5
|
dry-types (~> 1.7)
|
|
6
6
|
ruby_llm (~> 1.12)
|
|
7
7
|
ruby_llm-schema (~> 0.3)
|
|
@@ -258,7 +258,7 @@ CHECKSUMS
|
|
|
258
258
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
259
259
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
|
260
260
|
ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
|
|
261
|
-
ruby_llm-contract (0.10.
|
|
261
|
+
ruby_llm-contract (0.10.2)
|
|
262
262
|
ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
|
|
263
263
|
ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
|
|
264
264
|
rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
|
data/README.md
CHANGED
|
@@ -157,7 +157,7 @@ Different layers, complementary. [`ruby_llm-tribunal`](https://github.com/Alqemi
|
|
|
157
157
|
|
|
158
158
|
## Status & versioning
|
|
159
159
|
|
|
160
|
-
Pre-1.0 (currently **0.10.
|
|
160
|
+
Pre-1.0 (currently **0.10.2**). Semver tracked; breaking changes flagged in [CHANGELOG](CHANGELOG.md). Pin `~> 0.10.2` until 1.0 ships.
|
|
161
161
|
|
|
162
162
|
## FAQ
|
|
163
163
|
|
|
@@ -167,7 +167,7 @@ Pre-1.0 (currently **0.10.1**). Semver tracked; breaking changes flagged in [CHA
|
|
|
167
167
|
|
|
168
168
|
**Where in a Rails app?** Default `app/contracts/`. The Railtie reloads `app/contracts/eval/` and `app/steps/eval/` in development; any autoloaded directory also works. See [Rails integration](docs/guide/rails_integration.md).
|
|
169
169
|
|
|
170
|
-
**Upgraded to 0.
|
|
170
|
+
**Upgraded to 0.10.0 from 0.8.x and my contract started refusing — why?** 0.10.0 added multimodal input. If your contract has `max_cost` or `max_input` set AND now receives `context: { attachment: ... }`, you must declare `attachment_token_estimate(n)` (conservative input-token budget for the attachment) — otherwise the call fails closed with `:limit_exceeded`. The gem cannot bound vision/PDF cost without your estimate. Opt out per-step with `on_unknown_attachment_size :warn`. Text-only contracts are unaffected. See [multimodal input guide](docs/guide/multimodal_input.md).
|
|
171
171
|
|
|
172
172
|
## License
|
|
173
173
|
|
|
@@ -0,0 +1,50 @@
|
|
|
1
|
+
# Architecture
|
|
2
|
+
|
|
3
|
+
```
|
|
4
|
+
RubyLLM::Contract::Pipeline::Base # optional: compose steps
|
|
5
|
+
├── Pipeline::Runner # sequential execution, fail-fast, trace, timeout
|
|
6
|
+
└── Pipeline::Result # per-step outputs + aggregated trace
|
|
7
|
+
|
|
8
|
+
RubyLLM::Contract::Step::Base # single contracted step
|
|
9
|
+
├── Step::Dsl # DSL macros (prompt, validate, output_schema, etc.)
|
|
10
|
+
├── Step::RetryPolicy # attempts, models, reasoning_effort, retry_on
|
|
11
|
+
├── Step::RetryExecutor # retry loop driven by RetryPolicy
|
|
12
|
+
├── Step::LimitChecker # preflight cost / input / output checks
|
|
13
|
+
├── Step::Runner # runtime flow (single attempt)
|
|
14
|
+
├── Step::Result # status + outputs + errors + trace
|
|
15
|
+
├── Step::Trace # model, latency, tokens, cost, attempts
|
|
16
|
+
├── Prompt::AST # structured prompt (immutable)
|
|
17
|
+
│ ├── Prompt::Builder # DSL: system, rule, example, user, section
|
|
18
|
+
│ └── Prompt::Renderer # AST → messages array
|
|
19
|
+
├── Contract::Definition # parse strategy + validates
|
|
20
|
+
│ ├── Contract::Parser # :json / :text (auto-inferred from output type)
|
|
21
|
+
│ ├── Contract::Validator # runs parse + schema + validates + observations
|
|
22
|
+
│ └── Contract::SchemaValidator # JSON Schema validation (nested)
|
|
23
|
+
├── CostCalculator # per-step cost estimation from model pricing
|
|
24
|
+
├── TokenEstimator # input-token count estimation for limit checks
|
|
25
|
+
├── estimate_cost # single-call cost estimate (class method)
|
|
26
|
+
├── estimate_eval_cost # cost estimate for a full eval across models
|
|
27
|
+
└── Adapters::Base # provider interface
|
|
28
|
+
├── Adapters::RubyLLM # real LLM calls via ruby_llm (any provider)
|
|
29
|
+
└── Adapters::Test # canned responses for specs and examples
|
|
30
|
+
|
|
31
|
+
RubyLLM::Contract::Eval # quality measurement
|
|
32
|
+
├── Eval::EvalDefinition # define_eval DSL (verify, add_case, default_input, sample_response)
|
|
33
|
+
├── Eval::Dataset # test cases
|
|
34
|
+
├── Eval::Runner # sequential or concurrent execution
|
|
35
|
+
├── Eval::Report # score, pass_rate, per-case results
|
|
36
|
+
├── Eval::AggregatedReport # merged reports across models or runs
|
|
37
|
+
├── Eval::CaseResult # value object (name, passed?, output, expected, mismatches, cost)
|
|
38
|
+
├── Eval::ExpectationEvaluator # expected / expected_traits / evaluator proc
|
|
39
|
+
├── Eval::ModelComparison # compare_models result (table, best_for, candidate configs)
|
|
40
|
+
├── Eval::Recommender # model recommendation algorithm (candidates → optimal config)
|
|
41
|
+
├── Eval::Recommendation # recommendation result (best, retry_chain, savings, to_dsl)
|
|
42
|
+
├── Eval::RetryOptimizer # optimize_retry_policy result (per-eval breakdown, fallback list)
|
|
43
|
+
├── Eval::BaselineDiff # save_baseline! + without_regressions comparison
|
|
44
|
+
├── Eval::PromptDiffComparator # compare_with prompt A/B diff
|
|
45
|
+
└── Eval::EvalHistory # time-series view across saved reports
|
|
46
|
+
|
|
47
|
+
RubyLLM::Contract::RakeTask # rake ruby_llm_contract:eval
|
|
48
|
+
RubyLLM::Contract::OptimizeRakeTask # rake ruby_llm_contract:optimize
|
|
49
|
+
RubyLLM::Contract::Railtie # auto-loads eval files in Rails
|
|
50
|
+
```
|
|
@@ -0,0 +1,136 @@
|
|
|
1
|
+
# Best Practices
|
|
2
|
+
|
|
3
|
+
> Read this when writing your first `validate` rules and you want patterns that catch real bugs instead of restating schema.
|
|
4
|
+
|
|
5
|
+
Schema guarantees valid JSON structure. An LLM can still return structurally perfect JSON that is **semantically wrong**. Schema handles _shape_, validates handle _meaning_.
|
|
6
|
+
|
|
7
|
+
All examples extend the `SummarizeArticle` step from the [README](../../README.md).
|
|
8
|
+
|
|
9
|
+
## 1. Guard against empty / placeholder output
|
|
10
|
+
|
|
11
|
+
**Why it matters:** a cheap model that answers `{"tldr": "This article discusses...", "takeaways": ["This article discusses X", ...]}` passes the schema but renders a broken UI card that tells the user nothing. The validate catches it before `Article.update!` persists it.
|
|
12
|
+
|
|
13
|
+
```ruby
|
|
14
|
+
output_schema do
|
|
15
|
+
string :tldr
|
|
16
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
17
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
18
|
+
end
|
|
19
|
+
|
|
20
|
+
validate("tldr is substantive") do |o, _|
|
|
21
|
+
o[:tldr].to_s.split.length >= 5 # at least five words
|
|
22
|
+
end
|
|
23
|
+
|
|
24
|
+
validate("no boilerplate takeaways") do |o, _|
|
|
25
|
+
o[:takeaways].none? { |t| t.downcase.start_with?("this article") }
|
|
26
|
+
end
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## 2. Cross-validate output against input
|
|
30
|
+
|
|
31
|
+
**Why it matters:** a lazy model will return the article text verbatim as the "summary", or invent takeaways about topics the article never mentions. The 2-arity form is how you catch answers that are internally consistent but unfaithful to the actual input.
|
|
32
|
+
|
|
33
|
+
`validate` blocks with 2-arity `|output, input|` compare the model's answer against what was asked:
|
|
34
|
+
|
|
35
|
+
```ruby
|
|
36
|
+
validate("tldr is shorter than the article") do |output, input|
|
|
37
|
+
output[:tldr].length < input.length / 2
|
|
38
|
+
end
|
|
39
|
+
|
|
40
|
+
validate("every takeaway appears, in spirit, in the article") do |output, input|
|
|
41
|
+
output[:takeaways].all? { |t|
|
|
42
|
+
# cheap keyword overlap heuristic
|
|
43
|
+
t.downcase.split.any? { |w| input.downcase.include?(w) && w.length > 4 }
|
|
44
|
+
}
|
|
45
|
+
end
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
## 3. Conditional logic schema can't express
|
|
49
|
+
|
|
50
|
+
**Why it matters:** customer success filters on `tone == "negative"` to route angry users to a human. If the model labels an outage complaint "negative" but the takeaways are all positive-sounding, the filter runs on a label that doesn't match the content — the routing breaks silently.
|
|
51
|
+
|
|
52
|
+
```ruby
|
|
53
|
+
validate("negative tone requires at least one concrete concern") do |output, _input|
|
|
54
|
+
next true unless output[:tone] == "negative"
|
|
55
|
+
output[:takeaways].any? { |t| t.match?(/fail|break|crash|outage|vulnerab|risk/i) }
|
|
56
|
+
end
|
|
57
|
+
```
|
|
58
|
+
|
|
59
|
+
A model that picks `tone: "negative"` but gives three upbeat takeaways fails this check. Schema can't catch it because each takeaway is, individually, a valid string.
|
|
60
|
+
|
|
61
|
+
## 4. Content quality
|
|
62
|
+
|
|
63
|
+
**Why it matters:** a TL;DR with `## Summary` leaks markdown into a plain-text card. A one-word takeaway ("Fast.") wastes a UI slot. A leaked `{article}` placeholder reveals the prompt template to end users. All pass schema; all embarrass you in front of customers.
|
|
64
|
+
|
|
65
|
+
```ruby
|
|
66
|
+
validate("no markdown headings in the TL;DR") do |o, _|
|
|
67
|
+
!o[:tldr].match?(/^\#{1,6}\s/)
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
validate("takeaways aren't single words") do |o, _|
|
|
71
|
+
o[:takeaways].all? { |t| t.split.length >= 3 }
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
validate("no template placeholders leaked") do |o, _|
|
|
75
|
+
!(o[:tldr] + o[:takeaways].join(" ")).include?("{")
|
|
76
|
+
end
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
## 5. Pipeline: preserve data between steps
|
|
80
|
+
|
|
81
|
+
In a pipeline, each step only sees the previous step's output. If a later step needs original article metadata, an intermediate step must carry it through. Suppose a pipeline `SummarizeArticle → GenerateHashtags`, where `GenerateHashtags` needs the `tone` from the summary:
|
|
82
|
+
|
|
83
|
+
```ruby
|
|
84
|
+
class GenerateHashtags < RubyLLM::Contract::Step::Base
|
|
85
|
+
input_type Hash
|
|
86
|
+
|
|
87
|
+
output_schema do
|
|
88
|
+
# Carry through the fields a downstream step (or the caller) might need
|
|
89
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
90
|
+
array :hashtags, of: :string, min_items: 2, max_items: 5
|
|
91
|
+
end
|
|
92
|
+
|
|
93
|
+
prompt do
|
|
94
|
+
rule "Preserve the tone label from the input unchanged."
|
|
95
|
+
user "Summary: {tldr}\nTone: {tone}\nTakeaways: {takeaways}"
|
|
96
|
+
end
|
|
97
|
+
|
|
98
|
+
validate("tone preserved") { |o, input| o[:tone] == input[:tone] }
|
|
99
|
+
end
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
The explicit `validate("tone preserved")` catches the case where the model silently rewrites the tone during a downstream transform.
|
|
103
|
+
|
|
104
|
+
## 6. Model fallback
|
|
105
|
+
|
|
106
|
+
**Why it matters:** 80% of production articles are short and simple — `gpt-4.1-nano` handles them for ~$0.0001. The remaining 20% are dense, critical, or multi-topic — those need `gpt-4.1-mini` or `gpt-4.1`. Paying `gpt-4.1` rates for every call when nano is enough for most is throwing money away. Contracts tell you when nano wasn't enough, so fallback is cost-aware, not hope-based.
|
|
107
|
+
|
|
108
|
+
Small models are cheap but hallucinate. Big models are accurate but expensive. Start cheap, fall back only when validates catch a failure:
|
|
109
|
+
|
|
110
|
+
```ruby
|
|
111
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
112
|
+
output_schema do
|
|
113
|
+
string :tldr
|
|
114
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
115
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
116
|
+
end
|
|
117
|
+
|
|
118
|
+
validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
|
|
119
|
+
validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
|
|
120
|
+
|
|
121
|
+
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
122
|
+
end
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
**Key insight:** without contracts, you can't do model fallback — you'd have no way to know if the cheap model's output is good enough. Validates are the quality gate that makes cost optimization possible. See [Optimizing retry_policy](optimizing_retry_policy.md) for how to find the cheapest viable fallback list for your step.
|
|
126
|
+
|
|
127
|
+
## Summary
|
|
128
|
+
|
|
129
|
+
| What to validate | Use |
|
|
130
|
+
|---|---|
|
|
131
|
+
| Field types, enums, ranges, required fields | `output_schema` |
|
|
132
|
+
| Output makes sense given the input | `validate` (2-arity `\|output, input\|`) |
|
|
133
|
+
| Conditional business rules | `validate` |
|
|
134
|
+
| Content quality (not empty, not template) | `validate` |
|
|
135
|
+
| Data preserved across pipeline steps | `validate` + schema carry-through |
|
|
136
|
+
| Cost optimization via model fallback | `retry_policy` + `validate` as quality gate |
|
|
@@ -0,0 +1,192 @@
|
|
|
1
|
+
# Eval-First
|
|
2
|
+
|
|
3
|
+
> Read this when you need to prevent silent prompt regressions in CI. Skip if your LLM output is evaluated only by humans and never gated in an automated pipeline.
|
|
4
|
+
|
|
5
|
+
If you change prompts by feel, you ship regressions by feel.
|
|
6
|
+
|
|
7
|
+
Concrete scenario: `SummarizeArticle` has been running in production for two weeks. Customer success notices that complaints about service outages keep getting `tone: "analytical"` instead of `"negative"` — so their "critical feedback" filter silently misses angry users. Someone tweaks the system prompt to emphasise negative sentiment. It fixes the outage article but now three neutral product-update articles get misclassified as `"negative"`. You find out from a Slack thread.
|
|
8
|
+
|
|
9
|
+
That is the cost of prompt-by-feel. Evals are how you stop it.
|
|
10
|
+
|
|
11
|
+
`ruby_llm-contract` works best when you treat evals as the source of truth:
|
|
12
|
+
|
|
13
|
+
1. Capture real failures from production (the outage article, verbatim).
|
|
14
|
+
2. Turn them into eval cases (`add_case "service outage complaint"`).
|
|
15
|
+
3. Change the prompt.
|
|
16
|
+
4. Re-run the same eval — plus all previously-passing cases.
|
|
17
|
+
5. Merge only if the eval says quality improved or stayed safe on every case.
|
|
18
|
+
|
|
19
|
+
## Core rule
|
|
20
|
+
|
|
21
|
+
**Do not start with the prompt. Start with the eval.**
|
|
22
|
+
|
|
23
|
+
Using the `SummarizeArticle` step from the [README](../../README.md):
|
|
24
|
+
|
|
25
|
+
```ruby
|
|
26
|
+
SummarizeArticle.define_eval("regression") do
|
|
27
|
+
add_case "ruby release",
|
|
28
|
+
input: "Ruby 3.4 shipped with frozen string literals...",
|
|
29
|
+
expected: { tone: "analytical" } # partial match
|
|
30
|
+
|
|
31
|
+
add_case "critical review",
|
|
32
|
+
input: "Mesh networking hardware failed under load...",
|
|
33
|
+
expected: { tone: "negative" }
|
|
34
|
+
end
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Only after the eval exists, touch: `system`, `rule`, `example`, `validate`, prompt versions.
|
|
38
|
+
|
|
39
|
+
## Three eval kinds
|
|
40
|
+
|
|
41
|
+
### 1. `smoke` — wiring check, offline
|
|
42
|
+
|
|
43
|
+
```ruby
|
|
44
|
+
SummarizeArticle.define_eval("smoke") do
|
|
45
|
+
default_input "Ruby 3.4 shipped with frozen string literals..."
|
|
46
|
+
sample_response({
|
|
47
|
+
tldr: "...",
|
|
48
|
+
takeaways: ["point one", "point two", "point three"],
|
|
49
|
+
tone: "analytical"
|
|
50
|
+
})
|
|
51
|
+
end
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
`sample_response` returns canned data. Zero API calls. Verifies schema + validates parse and the step wiring is intact. **Not a quality signal.**
|
|
55
|
+
|
|
56
|
+
### 2. `regression` — real quality measurement
|
|
57
|
+
|
|
58
|
+
Represent real traffic and known failures. Good sources: production logs, bad completions, incidents, QA edge cases, cases a human had to correct.
|
|
59
|
+
|
|
60
|
+
Every production failure becomes `add_case`. That's the flywheel.
|
|
61
|
+
|
|
62
|
+
### 3. `ab` — prompt iteration
|
|
63
|
+
|
|
64
|
+
Compare two prompt versions on the same eval:
|
|
65
|
+
|
|
66
|
+
```ruby
|
|
67
|
+
diff = SummarizeArticleV2.compare_with(
|
|
68
|
+
SummarizeArticleV1,
|
|
69
|
+
eval: "regression",
|
|
70
|
+
model: "gpt-4.1-mini"
|
|
71
|
+
)
|
|
72
|
+
|
|
73
|
+
diff.safe_to_switch? # => true if no cases regressed
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
This is the cleanest eval-first move: same eval, same cases, two prompt versions, one answer.
|
|
77
|
+
|
|
78
|
+
## What counts as eval-first
|
|
79
|
+
|
|
80
|
+
**Good** — eval exists before the prompt change:
|
|
81
|
+
|
|
82
|
+
```ruby
|
|
83
|
+
SummarizeArticle.define_eval("regression") do
|
|
84
|
+
add_case "short article", input: "...", expected: { tone: "neutral" }
|
|
85
|
+
end
|
|
86
|
+
|
|
87
|
+
# Prompt iteration happens afterward
|
|
88
|
+
diff = SummarizeArticleV2.compare_with(
|
|
89
|
+
SummarizeArticleV1, eval: "regression", model: "gpt-4.1-mini"
|
|
90
|
+
)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
**Bad**:
|
|
94
|
+
|
|
95
|
+
```ruby
|
|
96
|
+
# Tweak prompt for an hour
|
|
97
|
+
# Maybe add an example
|
|
98
|
+
# Maybe tighten a rule
|
|
99
|
+
# Then eyeball one or two responses
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
That's prompt guessing, not eval-first.
|
|
103
|
+
|
|
104
|
+
## `sample_response`: useful, but not the main thing
|
|
105
|
+
|
|
106
|
+
Good for: offline smoke tests, local development, testing evaluator wiring, checking schema + validate behavior with zero API calls.
|
|
107
|
+
|
|
108
|
+
Not enough for real prompt decisions. For those:
|
|
109
|
+
|
|
110
|
+
- `run_eval(..., context: { model: "..." })` with a real model, or pass an explicit adapter.
|
|
111
|
+
- `compare_with(...)` for prompt A/B.
|
|
112
|
+
|
|
113
|
+
`compare_with` intentionally ignores `sample_response` — canned data would make both sides look the same.
|
|
114
|
+
|
|
115
|
+
## Parallel eval runs
|
|
116
|
+
|
|
117
|
+
For larger datasets, `run_eval` accepts a `concurrency:` argument — cases run in parallel using a thread pool:
|
|
118
|
+
|
|
119
|
+
```ruby
|
|
120
|
+
report = SummarizeArticle.run_eval("regression",
|
|
121
|
+
context: { model: "gpt-4.1-mini" },
|
|
122
|
+
concurrency: 8)
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Same accepted by `compare_models` and `optimize_retry_policy`. Thread count is a ceiling — dataset order of results is preserved. Keep it low enough to respect the provider's rate limits.
|
|
126
|
+
|
|
127
|
+
## Budgeting an eval before you run it
|
|
128
|
+
|
|
129
|
+
`estimate_eval_cost` gives you a cost projection without calling the LLM:
|
|
130
|
+
|
|
131
|
+
```ruby
|
|
132
|
+
SummarizeArticle.estimate_eval_cost("regression",
|
|
133
|
+
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
|
|
134
|
+
# => { "gpt-4.1-nano" => 0.00041, "gpt-4.1-mini" => 0.0018, "gpt-4.1" => 0.0092 }
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
Use it in CI to decide which models are worth running regression on, or to cap worst-case spend per build.
|
|
138
|
+
|
|
139
|
+
## Team workflow
|
|
140
|
+
|
|
141
|
+
1. **Build one eval that matters** — 10–30 cases representing real mistakes and important business paths.
|
|
142
|
+
2. **Gate CI** — `pass_eval("regression").with_context(model: "...").with_minimum_score(0.8)`. See [Getting Started](getting_started.md) for the full matcher chain.
|
|
143
|
+
3. **Save a baseline** — `report.save_baseline!` makes quality drift visible.
|
|
144
|
+
4. **Change prompts only through comparison** — `pass_eval(...).compared_with(SummarizeArticleV1)` in CI so any regression blocks the merge.
|
|
145
|
+
5. **Feed production failures back** — every miss in prod → new `add_case`, then fix. The eval gets stronger over time.
|
|
146
|
+
|
|
147
|
+
## Few-shot examples fit naturally
|
|
148
|
+
|
|
149
|
+
Adding `example input: ..., output: ...` inside the prompt is still a prompt change. The eval-first way:
|
|
150
|
+
|
|
151
|
+
1. Add examples to the prompt.
|
|
152
|
+
2. Rerun the existing regression eval.
|
|
153
|
+
3. `compare_with` against the old prompt.
|
|
154
|
+
|
|
155
|
+
Few-shot isn't the proof. The eval is.
|
|
156
|
+
|
|
157
|
+
## Model selection comes after prompt stability
|
|
158
|
+
|
|
159
|
+
Don't optimize cost before you stabilize quality:
|
|
160
|
+
|
|
161
|
+
1. Build `regression`.
|
|
162
|
+
2. Improve the prompt with `compare_with`.
|
|
163
|
+
3. Lock quality in CI.
|
|
164
|
+
4. Then run `compare_models` (see [Optimizing retry_policy](optimizing_retry_policy.md)).
|
|
165
|
+
|
|
166
|
+
```ruby
|
|
167
|
+
comparison = SummarizeArticle.compare_models(
|
|
168
|
+
"regression",
|
|
169
|
+
candidates: [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }, { model: "gpt-4.1" }]
|
|
170
|
+
)
|
|
171
|
+
|
|
172
|
+
comparison.best_for(min_score: 0.95)
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
## Strong defaults for teams
|
|
176
|
+
|
|
177
|
+
- `smoke` uses `sample_response`.
|
|
178
|
+
- `regression` uses real model calls.
|
|
179
|
+
- Every prompt change uses `compare_with`.
|
|
180
|
+
- Every merge runs `pass_eval`.
|
|
181
|
+
- Every production failure becomes a new `add_case`.
|
|
182
|
+
|
|
183
|
+
## Short version
|
|
184
|
+
|
|
185
|
+
1. Write `define_eval` before touching the prompt.
|
|
186
|
+
2. Treat `sample_response` as smoke only.
|
|
187
|
+
3. Use `run_eval("name", context: { model: "..." })` for real quality measurement.
|
|
188
|
+
4. Use `compare_with` for every serious prompt change.
|
|
189
|
+
5. Gate merges with `pass_eval`.
|
|
190
|
+
6. Feed every production miss back into the dataset.
|
|
191
|
+
|
|
192
|
+
Prompts stop being vibes and start being engineering.
|
|
@@ -0,0 +1,199 @@
|
|
|
1
|
+
# Getting Started
|
|
2
|
+
|
|
3
|
+
> Read this to walk through every feature on one concrete step for the first time.
|
|
4
|
+
|
|
5
|
+
The README shows a minimal `SummarizeArticle` step. This guide walks through the features you reach for as production requirements grow: budget caps so runaway inputs don't drain your LLM provider budget, evals so you catch regressions in CI, and CI gating so a merge that lowers accuracy gets blocked.
|
|
6
|
+
|
|
7
|
+
## The walkthrough
|
|
8
|
+
|
|
9
|
+
Start with the README example, then add features one layer at a time. Each is optional — use what you need.
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
13
|
+
# 1. Prompt (required)
|
|
14
|
+
prompt <<~PROMPT
|
|
15
|
+
Summarize this article for a UI card. Return a short TL;DR,
|
|
16
|
+
3 to 5 key takeaways, and a tone label.
|
|
17
|
+
|
|
18
|
+
{input}
|
|
19
|
+
PROMPT
|
|
20
|
+
|
|
21
|
+
# 2. Schema — sent to the provider via with_schema, validated client-side
|
|
22
|
+
output_schema do
|
|
23
|
+
string :tldr
|
|
24
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
25
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
# 3. Business rules — things JSON Schema cannot express
|
|
29
|
+
validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
|
|
30
|
+
validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
|
|
31
|
+
|
|
32
|
+
# 4. Retry with model fallback on validation_failed / parse_error
|
|
33
|
+
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
34
|
+
|
|
35
|
+
# 5. Refuse before calling the LLM if input is too large or estimated cost exceeds the cap
|
|
36
|
+
max_input 2_000
|
|
37
|
+
max_output 4_000
|
|
38
|
+
max_cost 0.01
|
|
39
|
+
end
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
## Validation and retry behavior
|
|
43
|
+
|
|
44
|
+
When the cheap model returns output that fails a `validate` block or can't be parsed, retry falls back to the next model in `models:` and tries again.
|
|
45
|
+
|
|
46
|
+
```ruby
|
|
47
|
+
result = SummarizeArticle.run(article_text)
|
|
48
|
+
|
|
49
|
+
result.status # => :ok
|
|
50
|
+
result.parsed_output # => { tldr: "...", takeaways: [...], tone: "analytical" }
|
|
51
|
+
result.trace[:model] # => "gpt-4.1-mini" (first model that passed)
|
|
52
|
+
result.trace[:cost] # => 0.00052 (sum of all attempts)
|
|
53
|
+
result.trace[:attempts]
|
|
54
|
+
# => [
|
|
55
|
+
# { attempt: 1, model: "gpt-4.1-nano", status: :validation_failed,
|
|
56
|
+
# cost: 0.00010, latency_ms: 45, ... },
|
|
57
|
+
# { attempt: 2, model: "gpt-4.1-mini", status: :ok,
|
|
58
|
+
# cost: 0.00042, latency_ms: 92, ... }
|
|
59
|
+
# ]
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
If the whole chain exhausts, `result.status` is the status of the last attempt (`:validation_failed` or `:parse_error`) and `result.parsed_output` is the last attempt's output. The caller decides what to do — ship it anyway, fall back to a template, or raise.
|
|
63
|
+
|
|
64
|
+
### Per-attempt reasoning effort
|
|
65
|
+
|
|
66
|
+
`models:` accepts config hashes as well as model-name strings, so a fallback can "try harder" (more reasoning) on retry, not just switch model:
|
|
67
|
+
|
|
68
|
+
```ruby
|
|
69
|
+
retry_policy models: [
|
|
70
|
+
"gpt-5-nano", # attempt 1: cheap + fast
|
|
71
|
+
{ model: "gpt-5-mini", reasoning_effort: "high" } # attempt 2: stronger + more reasoning
|
|
72
|
+
]
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
`reasoning_effort` is a gpt-5 family feature (gpt-4.1 is not a reasoning model). The per-attempt value is forwarded via `with_thinking` (provider-agnostic — OpenAI `reasoning_effort` and Anthropic extended-thinking budget both supported). Passing it alongside a non-reasoning model is forwarded unchanged to the provider, which will either ignore it or reject the request — the gem does not guard against this.
|
|
76
|
+
|
|
77
|
+
## Evals and CI gates
|
|
78
|
+
|
|
79
|
+
An eval is a named scenario you can run to verify the step still works. `sample_response` makes it offline — zero API calls — so CI can run it on every merge without burning budget.
|
|
80
|
+
|
|
81
|
+
```ruby
|
|
82
|
+
SummarizeArticle.define_eval("smoke") do
|
|
83
|
+
default_input <<~ARTICLE
|
|
84
|
+
Ruby 3.4 ships with frozen string literals on by default, measurable YJIT
|
|
85
|
+
speedups on Rails workloads, and tightened Warning.warn category filtering.
|
|
86
|
+
The release notes also mention several parser fixes and faster keyword
|
|
87
|
+
argument handling.
|
|
88
|
+
ARTICLE
|
|
89
|
+
|
|
90
|
+
sample_response({
|
|
91
|
+
tldr: "Ruby 3.4 brings frozen string literals by default, YJIT speedups, and parser fixes.",
|
|
92
|
+
takeaways: [
|
|
93
|
+
"Frozen string literals are the default",
|
|
94
|
+
"YJIT adds measurable speedups on Rails workloads",
|
|
95
|
+
"Warning.warn category filtering is tighter"
|
|
96
|
+
],
|
|
97
|
+
tone: "analytical"
|
|
98
|
+
})
|
|
99
|
+
end
|
|
100
|
+
|
|
101
|
+
report = SummarizeArticle.run_eval("smoke")
|
|
102
|
+
report.passed? # => true — schema + validates pass on the canned response
|
|
103
|
+
report.score # => 1.0
|
|
104
|
+
report.print_summary
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
For real regression testing, define cases with expected output (online — calls the LLM):
|
|
108
|
+
|
|
109
|
+
```ruby
|
|
110
|
+
SummarizeArticle.define_eval("regression") do
|
|
111
|
+
add_case "ruby release",
|
|
112
|
+
input: "Ruby 3.4 was released...",
|
|
113
|
+
expected: { tone: "analytical" } # partial match
|
|
114
|
+
|
|
115
|
+
add_case "critical review",
|
|
116
|
+
input: "The new mesh networking hardware failed under load...",
|
|
117
|
+
expected: { tone: "negative" }
|
|
118
|
+
end
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
Gate CI on score and cost thresholds:
|
|
122
|
+
|
|
123
|
+
```ruby
|
|
124
|
+
# RSpec — blocks merge if accuracy drops or cost spikes
|
|
125
|
+
expect(SummarizeArticle).to pass_eval("regression")
|
|
126
|
+
.with_minimum_score(0.8)
|
|
127
|
+
.with_maximum_cost(0.01)
|
|
128
|
+
```
|
|
129
|
+
|
|
130
|
+
Save a baseline once, then block regressions automatically:
|
|
131
|
+
|
|
132
|
+
```ruby
|
|
133
|
+
report = SummarizeArticle.run_eval("regression")
|
|
134
|
+
report.save_baseline!
|
|
135
|
+
|
|
136
|
+
# In CI:
|
|
137
|
+
expect(SummarizeArticle).to pass_eval("regression").without_regressions
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
`without_regressions` fails the build only if a previously-passing case now fails — a new model version, a prompt tweak, or an upstream change that silently lowered quality.
|
|
141
|
+
|
|
142
|
+
## Budget caps
|
|
143
|
+
|
|
144
|
+
`max_input`, `max_output`, and `max_cost` are preflight checks — the LLM is never called if an estimate exceeds the limit. Zero tokens spent, zero cost.
|
|
145
|
+
|
|
146
|
+
```ruby
|
|
147
|
+
result = SummarizeArticle.run(giant_10mb_document)
|
|
148
|
+
result.status # => :limit_exceeded
|
|
149
|
+
result.validation_errors
|
|
150
|
+
# => ["Input token limit exceeded: estimated 32000 tokens (heuristic ±30%), max 2000"]
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
`max_cost` fails closed when the model's pricing isn't known — register custom or fine-tuned models explicitly:
|
|
154
|
+
|
|
155
|
+
```ruby
|
|
156
|
+
RubyLLM::Contract::CostCalculator.register_model("ft:gpt-4o-custom",
|
|
157
|
+
input_per_1m: 3.0, output_per_1m: 6.0)
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
Or opt into a soft warning instead of a refusal when pricing is missing:
|
|
161
|
+
|
|
162
|
+
```ruby
|
|
163
|
+
max_cost 0.01, on_unknown_pricing: :warn
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
Default is `:refuse`. Use `:warn` only when you accept running without a cost ceiling (fine-tuned models you trust, private endpoints).
|
|
167
|
+
|
|
168
|
+
### Preflight cost estimates
|
|
169
|
+
|
|
170
|
+
Check what a call is likely to cost before invoking it:
|
|
171
|
+
|
|
172
|
+
```ruby
|
|
173
|
+
SummarizeArticle.estimate_cost(input: article_text)
|
|
174
|
+
# => {
|
|
175
|
+
# model: "gpt-4.1-mini",
|
|
176
|
+
# input_tokens: 812, output_tokens_estimate: 4000,
|
|
177
|
+
# estimated_cost: 0.00243
|
|
178
|
+
# }
|
|
179
|
+
|
|
180
|
+
# Estimate what a full eval would cost across candidate models
|
|
181
|
+
SummarizeArticle.estimate_eval_cost("regression",
|
|
182
|
+
models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
|
|
183
|
+
# => { "gpt-4.1-nano" => 0.00041, "gpt-4.1-mini" => 0.0018, "gpt-4.1" => 0.0092 }
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
`estimate_cost` returns `nil` when pricing isn't registered. `estimate_eval_cost` silently treats unknown-pricing cases as `$0.00` and sums the rest — it does **not** fail closed the way `max_cost` does. Treat its output as a floor, not a guarantee; register pricing via `CostCalculator.register_model` before relying on it for budget decisions.
|
|
187
|
+
|
|
188
|
+
## `output_schema` vs `with_schema`
|
|
189
|
+
|
|
190
|
+
`with_schema` in `ruby_llm` tells the provider to force a specific JSON structure. `output_schema` in this gem does the same thing (calls `with_schema` under the hood) **plus** validates the response client-side. Cheaper models sometimes ignore schema constraints — `with_schema` is a request; `output_schema` is a request plus verification.
|
|
191
|
+
|
|
192
|
+
## See also
|
|
193
|
+
|
|
194
|
+
- [Prompt AST](prompt_ast.md) — prompt DSL variants: `system`, `rule`, `section`, `example`, `user`, and dynamic prompts with `|input|`.
|
|
195
|
+
- [Eval-First](eval_first.md) — datasets, baselines, A/B gates, the workflow that makes the above evals useful.
|
|
196
|
+
- [Optimizing retry_policy](optimizing_retry_policy.md) — find the cheapest viable fallback list with `compare_models` and `optimize_retry_policy`.
|
|
197
|
+
- [Testing](testing.md) — test adapter, `stub_step`, full RSpec + Minitest matcher reference.
|
|
198
|
+
- [Output Schema](output_schema.md) — nested objects in arrays, constraints, pattern reference.
|
|
199
|
+
- [Rails integration](rails_integration.md) — where step classes live, initializer, jobs, logging, specs, CI gate.
|