ruby_llm-contract 0.10.1 → 0.10.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 66d3692be182ac8958132d06b53c1d47e01d07769d0e7a1942f81b71eb23a759
4
- data.tar.gz: f254fcbc6bc8c2a42a78cff9ea3b24b23f04d860c60d79084bbd8a5d6ea6ec0b
3
+ metadata.gz: ac8c19463285a3e2c0f9050e835125454b21d2b1500dcb5bad2dd4a114666bd0
4
+ data.tar.gz: 339c0609e3dccbf55da649c3470ed234c5abafa55c8649aee5d8dd41f9bd82ec
5
5
  SHA512:
6
- metadata.gz: f035aca56a5c0e2ca0607a06431cbb76a6ff4cc45f2f35fe9309d534b881910fa05d63c5c581a6e9cbac5374377f6fc187e521a844a653da5b53d612e9a23a89
7
- data.tar.gz: 31bb34333a28224e7c00ce2dea7b3905dce878c320fc0fb8a0e5db1313259a0e12e136ab6cc74126b25f0a47984830ae2172e3f824ccfdcb28f847929f40080a
6
+ metadata.gz: c04fea66393868fbdb369f5a75eaeb66a28172c8f9ddabd0a8c053fd55e5daa06174aeeb00c4bb7e9aabaf613bd1ef4ea354ff7d0f4b817f2ec41af1a4c61667
7
+ data.tar.gz: 71f502ebe497ede22df508d3b56396660fbd1eba109066885d5e905724bd82d30a0d46fddfe4f1c1e7efe47f95ea41b4f61341a4698c12d3f3a397cab2510068
data/CHANGELOG.md CHANGED
@@ -1,5 +1,19 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.10.2 (2026-06-10)
4
+
5
+ Patch release: ship the `docs/guide/` directory inside the gem so adopters and LLM integration agents can read the manuals locally (via `bundle show ruby_llm-contract` or `gem unpack`) without an internet round-trip to GitHub. No code behavior change.
6
+
7
+ ### Added
8
+
9
+ - **`docs/guide/*` is now packaged with the gem** (14 files, ~120 KB). Previously the README's "See also" links pointed at `docs/guide/getting_started.md`, `docs/guide/optimizing_retry_policy.md`, etc., but those files were stripped from the published gem - LLM integration agents (Cursor, Claude Code, Copilot) reported "no documentation in the gem" because the links 404'd locally. `docs/ideas/` and `doc/decisions/` remain excluded.
10
+
11
+ ### Fixed
12
+
13
+ - **`models:` keyword form documented + pinned for hash configs.** `retry_policy models: ["gpt-5-nano", { model: "gpt-5-mini", reasoning_effort: "high" }]` is now covered by a spec and shown in [getting_started.md](docs/guide/getting_started.md). The block form (`escalate(...)`) and the keyword form share the same `@configs` storage; both forms accept config hashes.
14
+ - **`reasoning_effort` examples corrected to gpt-5 family.** Pre-0.10.2 docs and specs paired `reasoning_effort` with `gpt-4.1-*` model names, which is incorrect: `gpt-4.1` is not a reasoning model. Updated across `docs/guide/getting_started.md`, `docs/guide/optimizing_retry_policy.md`, `CHANGELOG.md`, and 7 spec files. Non-reasoning `gpt-4.1` examples (model fallback chains without `reasoning_effort`) are unchanged.
15
+ - **Version mentions corrected in README and multimodal guide.** README FAQ ("Upgraded to 0.9.0 - why?") now reads "Upgraded to 0.10.0 from 0.8.x"; `docs/guide/multimodal_input.md` references `0.10.0+` and `0.10.x` instead of `0.9.0`. 0.9.0 and 0.9.1 were tagged but never published to rubygems; adopters jump from 0.8.0 directly to 0.10.x.
16
+
3
17
  ## 0.10.1 (2026-06-01)
4
18
 
5
19
  Patch release fixing gem packaging. 0.10.0 was yanked from rubygems.org due to the issue documented below; 0.10.1 is the recommended upgrade target. No code behavior change vs 0.10.0.
@@ -254,7 +268,7 @@ end
254
268
  - **`Step.recommend`** — `ClassifyTicket.recommend("eval", candidates: [...], min_score: 0.95)` runs eval on all candidates and returns a `Recommendation` with optimal model, retry chain, rationale, savings vs current config, and `to_dsl` code output.
255
269
  - **Candidates as configurations** — `candidates:` accepts `{ model:, reasoning_effort: }` hashes, not just model name strings. `gpt-5-mini` with `reasoning_effort: "low"` is a different candidate than with `"high"`.
256
270
  - **`compare_models` extended** — new `candidates:` parameter alongside existing `models:` (backward compatible). Candidate labels include reasoning effort in output table.
257
- - **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-4.1-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
271
+ - **Per-attempt `reasoning_effort` in retry policies** — `escalate` accepts config hashes: `escalate({ model: "gpt-5-nano" }, { model: "gpt-5-mini", reasoning_effort: "high" })`. Each attempt gets its own reasoning_effort forwarded to the provider.
258
272
  - **`pass_rate_ratio`** — numeric float (0.0–1.0) on `Report` and `ReportStats`, complementing the string `pass_rate` (`"3/5"`).
259
273
  - **History entries enriched** — `save_history!` accepts `reasoning_effort:` and stores `model`, `reasoning_effort`, `pass_rate_ratio` in JSONL entries.
260
274
 
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.10.1)
4
+ ruby_llm-contract (0.10.2)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.12)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.10.1)
261
+ ruby_llm-contract (0.10.2)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -157,7 +157,7 @@ Different layers, complementary. [`ruby_llm-tribunal`](https://github.com/Alqemi
157
157
 
158
158
  ## Status & versioning
159
159
 
160
- Pre-1.0 (currently **0.10.1**). Semver tracked; breaking changes flagged in [CHANGELOG](CHANGELOG.md). Pin `~> 0.10.1` until 1.0 ships.
160
+ Pre-1.0 (currently **0.10.2**). Semver tracked; breaking changes flagged in [CHANGELOG](CHANGELOG.md). Pin `~> 0.10.2` until 1.0 ships.
161
161
 
162
162
  ## FAQ
163
163
 
@@ -167,7 +167,7 @@ Pre-1.0 (currently **0.10.1**). Semver tracked; breaking changes flagged in [CHA
167
167
 
168
168
  **Where in a Rails app?** Default `app/contracts/`. The Railtie reloads `app/contracts/eval/` and `app/steps/eval/` in development; any autoloaded directory also works. See [Rails integration](docs/guide/rails_integration.md).
169
169
 
170
- **Upgraded to 0.9.0 and my contract started refusing — why?** 0.9.0 added multimodal input. If your contract has `max_cost` or `max_input` set AND now receives `context: { attachment: ... }`, you must declare `attachment_token_estimate(n)` (conservative input-token budget for the attachment) — otherwise the call fails closed with `:limit_exceeded`. The gem cannot bound vision/PDF cost without your estimate. Opt out per-step with `on_unknown_attachment_size :warn`. Text-only contracts are unaffected. See [multimodal input guide](docs/guide/multimodal_input.md).
170
+ **Upgraded to 0.10.0 from 0.8.x and my contract started refusing — why?** 0.10.0 added multimodal input. If your contract has `max_cost` or `max_input` set AND now receives `context: { attachment: ... }`, you must declare `attachment_token_estimate(n)` (conservative input-token budget for the attachment) — otherwise the call fails closed with `:limit_exceeded`. The gem cannot bound vision/PDF cost without your estimate. Opt out per-step with `on_unknown_attachment_size :warn`. Text-only contracts are unaffected. See [multimodal input guide](docs/guide/multimodal_input.md).
171
171
 
172
172
  ## License
173
173
 
@@ -0,0 +1,50 @@
1
+ # Architecture
2
+
3
+ ```
4
+ RubyLLM::Contract::Pipeline::Base # optional: compose steps
5
+ ├── Pipeline::Runner # sequential execution, fail-fast, trace, timeout
6
+ └── Pipeline::Result # per-step outputs + aggregated trace
7
+
8
+ RubyLLM::Contract::Step::Base # single contracted step
9
+ ├── Step::Dsl # DSL macros (prompt, validate, output_schema, etc.)
10
+ ├── Step::RetryPolicy # attempts, models, reasoning_effort, retry_on
11
+ ├── Step::RetryExecutor # retry loop driven by RetryPolicy
12
+ ├── Step::LimitChecker # preflight cost / input / output checks
13
+ ├── Step::Runner # runtime flow (single attempt)
14
+ ├── Step::Result # status + outputs + errors + trace
15
+ ├── Step::Trace # model, latency, tokens, cost, attempts
16
+ ├── Prompt::AST # structured prompt (immutable)
17
+ │ ├── Prompt::Builder # DSL: system, rule, example, user, section
18
+ │ └── Prompt::Renderer # AST → messages array
19
+ ├── Contract::Definition # parse strategy + validates
20
+ │ ├── Contract::Parser # :json / :text (auto-inferred from output type)
21
+ │ ├── Contract::Validator # runs parse + schema + validates + observations
22
+ │ └── Contract::SchemaValidator # JSON Schema validation (nested)
23
+ ├── CostCalculator # per-step cost estimation from model pricing
24
+ ├── TokenEstimator # input-token count estimation for limit checks
25
+ ├── estimate_cost # single-call cost estimate (class method)
26
+ ├── estimate_eval_cost # cost estimate for a full eval across models
27
+ └── Adapters::Base # provider interface
28
+ ├── Adapters::RubyLLM # real LLM calls via ruby_llm (any provider)
29
+ └── Adapters::Test # canned responses for specs and examples
30
+
31
+ RubyLLM::Contract::Eval # quality measurement
32
+ ├── Eval::EvalDefinition # define_eval DSL (verify, add_case, default_input, sample_response)
33
+ ├── Eval::Dataset # test cases
34
+ ├── Eval::Runner # sequential or concurrent execution
35
+ ├── Eval::Report # score, pass_rate, per-case results
36
+ ├── Eval::AggregatedReport # merged reports across models or runs
37
+ ├── Eval::CaseResult # value object (name, passed?, output, expected, mismatches, cost)
38
+ ├── Eval::ExpectationEvaluator # expected / expected_traits / evaluator proc
39
+ ├── Eval::ModelComparison # compare_models result (table, best_for, candidate configs)
40
+ ├── Eval::Recommender # model recommendation algorithm (candidates → optimal config)
41
+ ├── Eval::Recommendation # recommendation result (best, retry_chain, savings, to_dsl)
42
+ ├── Eval::RetryOptimizer # optimize_retry_policy result (per-eval breakdown, fallback list)
43
+ ├── Eval::BaselineDiff # save_baseline! + without_regressions comparison
44
+ ├── Eval::PromptDiffComparator # compare_with prompt A/B diff
45
+ └── Eval::EvalHistory # time-series view across saved reports
46
+
47
+ RubyLLM::Contract::RakeTask # rake ruby_llm_contract:eval
48
+ RubyLLM::Contract::OptimizeRakeTask # rake ruby_llm_contract:optimize
49
+ RubyLLM::Contract::Railtie # auto-loads eval files in Rails
50
+ ```
@@ -0,0 +1,136 @@
1
+ # Best Practices
2
+
3
+ > Read this when writing your first `validate` rules and you want patterns that catch real bugs instead of restating schema.
4
+
5
+ Schema guarantees valid JSON structure. An LLM can still return structurally perfect JSON that is **semantically wrong**. Schema handles _shape_, validates handle _meaning_.
6
+
7
+ All examples extend the `SummarizeArticle` step from the [README](../../README.md).
8
+
9
+ ## 1. Guard against empty / placeholder output
10
+
11
+ **Why it matters:** a cheap model that answers `{"tldr": "This article discusses...", "takeaways": ["This article discusses X", ...]}` passes the schema but renders a broken UI card that tells the user nothing. The validate catches it before `Article.update!` persists it.
12
+
13
+ ```ruby
14
+ output_schema do
15
+ string :tldr
16
+ array :takeaways, of: :string, min_items: 3, max_items: 5
17
+ string :tone, enum: %w[neutral positive negative analytical]
18
+ end
19
+
20
+ validate("tldr is substantive") do |o, _|
21
+ o[:tldr].to_s.split.length >= 5 # at least five words
22
+ end
23
+
24
+ validate("no boilerplate takeaways") do |o, _|
25
+ o[:takeaways].none? { |t| t.downcase.start_with?("this article") }
26
+ end
27
+ ```
28
+
29
+ ## 2. Cross-validate output against input
30
+
31
+ **Why it matters:** a lazy model will return the article text verbatim as the "summary", or invent takeaways about topics the article never mentions. The 2-arity form is how you catch answers that are internally consistent but unfaithful to the actual input.
32
+
33
+ `validate` blocks with 2-arity `|output, input|` compare the model's answer against what was asked:
34
+
35
+ ```ruby
36
+ validate("tldr is shorter than the article") do |output, input|
37
+ output[:tldr].length < input.length / 2
38
+ end
39
+
40
+ validate("every takeaway appears, in spirit, in the article") do |output, input|
41
+ output[:takeaways].all? { |t|
42
+ # cheap keyword overlap heuristic
43
+ t.downcase.split.any? { |w| input.downcase.include?(w) && w.length > 4 }
44
+ }
45
+ end
46
+ ```
47
+
48
+ ## 3. Conditional logic schema can't express
49
+
50
+ **Why it matters:** customer success filters on `tone == "negative"` to route angry users to a human. If the model labels an outage complaint "negative" but the takeaways are all positive-sounding, the filter runs on a label that doesn't match the content — the routing breaks silently.
51
+
52
+ ```ruby
53
+ validate("negative tone requires at least one concrete concern") do |output, _input|
54
+ next true unless output[:tone] == "negative"
55
+ output[:takeaways].any? { |t| t.match?(/fail|break|crash|outage|vulnerab|risk/i) }
56
+ end
57
+ ```
58
+
59
+ A model that picks `tone: "negative"` but gives three upbeat takeaways fails this check. Schema can't catch it because each takeaway is, individually, a valid string.
60
+
61
+ ## 4. Content quality
62
+
63
+ **Why it matters:** a TL;DR with `## Summary` leaks markdown into a plain-text card. A one-word takeaway ("Fast.") wastes a UI slot. A leaked `{article}` placeholder reveals the prompt template to end users. All pass schema; all embarrass you in front of customers.
64
+
65
+ ```ruby
66
+ validate("no markdown headings in the TL;DR") do |o, _|
67
+ !o[:tldr].match?(/^\#{1,6}\s/)
68
+ end
69
+
70
+ validate("takeaways aren't single words") do |o, _|
71
+ o[:takeaways].all? { |t| t.split.length >= 3 }
72
+ end
73
+
74
+ validate("no template placeholders leaked") do |o, _|
75
+ !(o[:tldr] + o[:takeaways].join(" ")).include?("{")
76
+ end
77
+ ```
78
+
79
+ ## 5. Pipeline: preserve data between steps
80
+
81
+ In a pipeline, each step only sees the previous step's output. If a later step needs original article metadata, an intermediate step must carry it through. Suppose a pipeline `SummarizeArticle → GenerateHashtags`, where `GenerateHashtags` needs the `tone` from the summary:
82
+
83
+ ```ruby
84
+ class GenerateHashtags < RubyLLM::Contract::Step::Base
85
+ input_type Hash
86
+
87
+ output_schema do
88
+ # Carry through the fields a downstream step (or the caller) might need
89
+ string :tone, enum: %w[neutral positive negative analytical]
90
+ array :hashtags, of: :string, min_items: 2, max_items: 5
91
+ end
92
+
93
+ prompt do
94
+ rule "Preserve the tone label from the input unchanged."
95
+ user "Summary: {tldr}\nTone: {tone}\nTakeaways: {takeaways}"
96
+ end
97
+
98
+ validate("tone preserved") { |o, input| o[:tone] == input[:tone] }
99
+ end
100
+ ```
101
+
102
+ The explicit `validate("tone preserved")` catches the case where the model silently rewrites the tone during a downstream transform.
103
+
104
+ ## 6. Model fallback
105
+
106
+ **Why it matters:** 80% of production articles are short and simple — `gpt-4.1-nano` handles them for ~$0.0001. The remaining 20% are dense, critical, or multi-topic — those need `gpt-4.1-mini` or `gpt-4.1`. Paying `gpt-4.1` rates for every call when nano is enough for most is throwing money away. Contracts tell you when nano wasn't enough, so fallback is cost-aware, not hope-based.
107
+
108
+ Small models are cheap but hallucinate. Big models are accurate but expensive. Start cheap, fall back only when validates catch a failure:
109
+
110
+ ```ruby
111
+ class SummarizeArticle < RubyLLM::Contract::Step::Base
112
+ output_schema do
113
+ string :tldr
114
+ array :takeaways, of: :string, min_items: 3, max_items: 5
115
+ string :tone, enum: %w[neutral positive negative analytical]
116
+ end
117
+
118
+ validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
119
+ validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
120
+
121
+ retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
122
+ end
123
+ ```
124
+
125
+ **Key insight:** without contracts, you can't do model fallback — you'd have no way to know if the cheap model's output is good enough. Validates are the quality gate that makes cost optimization possible. See [Optimizing retry_policy](optimizing_retry_policy.md) for how to find the cheapest viable fallback list for your step.
126
+
127
+ ## Summary
128
+
129
+ | What to validate | Use |
130
+ |---|---|
131
+ | Field types, enums, ranges, required fields | `output_schema` |
132
+ | Output makes sense given the input | `validate` (2-arity `\|output, input\|`) |
133
+ | Conditional business rules | `validate` |
134
+ | Content quality (not empty, not template) | `validate` |
135
+ | Data preserved across pipeline steps | `validate` + schema carry-through |
136
+ | Cost optimization via model fallback | `retry_policy` + `validate` as quality gate |
@@ -0,0 +1,192 @@
1
+ # Eval-First
2
+
3
+ > Read this when you need to prevent silent prompt regressions in CI. Skip if your LLM output is evaluated only by humans and never gated in an automated pipeline.
4
+
5
+ If you change prompts by feel, you ship regressions by feel.
6
+
7
+ Concrete scenario: `SummarizeArticle` has been running in production for two weeks. Customer success notices that complaints about service outages keep getting `tone: "analytical"` instead of `"negative"` — so their "critical feedback" filter silently misses angry users. Someone tweaks the system prompt to emphasise negative sentiment. It fixes the outage article but now three neutral product-update articles get misclassified as `"negative"`. You find out from a Slack thread.
8
+
9
+ That is the cost of prompt-by-feel. Evals are how you stop it.
10
+
11
+ `ruby_llm-contract` works best when you treat evals as the source of truth:
12
+
13
+ 1. Capture real failures from production (the outage article, verbatim).
14
+ 2. Turn them into eval cases (`add_case "service outage complaint"`).
15
+ 3. Change the prompt.
16
+ 4. Re-run the same eval — plus all previously-passing cases.
17
+ 5. Merge only if the eval says quality improved or stayed safe on every case.
18
+
19
+ ## Core rule
20
+
21
+ **Do not start with the prompt. Start with the eval.**
22
+
23
+ Using the `SummarizeArticle` step from the [README](../../README.md):
24
+
25
+ ```ruby
26
+ SummarizeArticle.define_eval("regression") do
27
+ add_case "ruby release",
28
+ input: "Ruby 3.4 shipped with frozen string literals...",
29
+ expected: { tone: "analytical" } # partial match
30
+
31
+ add_case "critical review",
32
+ input: "Mesh networking hardware failed under load...",
33
+ expected: { tone: "negative" }
34
+ end
35
+ ```
36
+
37
+ Only after the eval exists, touch: `system`, `rule`, `example`, `validate`, prompt versions.
38
+
39
+ ## Three eval kinds
40
+
41
+ ### 1. `smoke` — wiring check, offline
42
+
43
+ ```ruby
44
+ SummarizeArticle.define_eval("smoke") do
45
+ default_input "Ruby 3.4 shipped with frozen string literals..."
46
+ sample_response({
47
+ tldr: "...",
48
+ takeaways: ["point one", "point two", "point three"],
49
+ tone: "analytical"
50
+ })
51
+ end
52
+ ```
53
+
54
+ `sample_response` returns canned data. Zero API calls. Verifies schema + validates parse and the step wiring is intact. **Not a quality signal.**
55
+
56
+ ### 2. `regression` — real quality measurement
57
+
58
+ Represent real traffic and known failures. Good sources: production logs, bad completions, incidents, QA edge cases, cases a human had to correct.
59
+
60
+ Every production failure becomes `add_case`. That's the flywheel.
61
+
62
+ ### 3. `ab` — prompt iteration
63
+
64
+ Compare two prompt versions on the same eval:
65
+
66
+ ```ruby
67
+ diff = SummarizeArticleV2.compare_with(
68
+ SummarizeArticleV1,
69
+ eval: "regression",
70
+ model: "gpt-4.1-mini"
71
+ )
72
+
73
+ diff.safe_to_switch? # => true if no cases regressed
74
+ ```
75
+
76
+ This is the cleanest eval-first move: same eval, same cases, two prompt versions, one answer.
77
+
78
+ ## What counts as eval-first
79
+
80
+ **Good** — eval exists before the prompt change:
81
+
82
+ ```ruby
83
+ SummarizeArticle.define_eval("regression") do
84
+ add_case "short article", input: "...", expected: { tone: "neutral" }
85
+ end
86
+
87
+ # Prompt iteration happens afterward
88
+ diff = SummarizeArticleV2.compare_with(
89
+ SummarizeArticleV1, eval: "regression", model: "gpt-4.1-mini"
90
+ )
91
+ ```
92
+
93
+ **Bad**:
94
+
95
+ ```ruby
96
+ # Tweak prompt for an hour
97
+ # Maybe add an example
98
+ # Maybe tighten a rule
99
+ # Then eyeball one or two responses
100
+ ```
101
+
102
+ That's prompt guessing, not eval-first.
103
+
104
+ ## `sample_response`: useful, but not the main thing
105
+
106
+ Good for: offline smoke tests, local development, testing evaluator wiring, checking schema + validate behavior with zero API calls.
107
+
108
+ Not enough for real prompt decisions. For those:
109
+
110
+ - `run_eval(..., context: { model: "..." })` with a real model, or pass an explicit adapter.
111
+ - `compare_with(...)` for prompt A/B.
112
+
113
+ `compare_with` intentionally ignores `sample_response` — canned data would make both sides look the same.
114
+
115
+ ## Parallel eval runs
116
+
117
+ For larger datasets, `run_eval` accepts a `concurrency:` argument — cases run in parallel using a thread pool:
118
+
119
+ ```ruby
120
+ report = SummarizeArticle.run_eval("regression",
121
+ context: { model: "gpt-4.1-mini" },
122
+ concurrency: 8)
123
+ ```
124
+
125
+ Same accepted by `compare_models` and `optimize_retry_policy`. Thread count is a ceiling — dataset order of results is preserved. Keep it low enough to respect the provider's rate limits.
126
+
127
+ ## Budgeting an eval before you run it
128
+
129
+ `estimate_eval_cost` gives you a cost projection without calling the LLM:
130
+
131
+ ```ruby
132
+ SummarizeArticle.estimate_eval_cost("regression",
133
+ models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
134
+ # => { "gpt-4.1-nano" => 0.00041, "gpt-4.1-mini" => 0.0018, "gpt-4.1" => 0.0092 }
135
+ ```
136
+
137
+ Use it in CI to decide which models are worth running regression on, or to cap worst-case spend per build.
138
+
139
+ ## Team workflow
140
+
141
+ 1. **Build one eval that matters** — 10–30 cases representing real mistakes and important business paths.
142
+ 2. **Gate CI** — `pass_eval("regression").with_context(model: "...").with_minimum_score(0.8)`. See [Getting Started](getting_started.md) for the full matcher chain.
143
+ 3. **Save a baseline** — `report.save_baseline!` makes quality drift visible.
144
+ 4. **Change prompts only through comparison** — `pass_eval(...).compared_with(SummarizeArticleV1)` in CI so any regression blocks the merge.
145
+ 5. **Feed production failures back** — every miss in prod → new `add_case`, then fix. The eval gets stronger over time.
146
+
147
+ ## Few-shot examples fit naturally
148
+
149
+ Adding `example input: ..., output: ...` inside the prompt is still a prompt change. The eval-first way:
150
+
151
+ 1. Add examples to the prompt.
152
+ 2. Rerun the existing regression eval.
153
+ 3. `compare_with` against the old prompt.
154
+
155
+ Few-shot isn't the proof. The eval is.
156
+
157
+ ## Model selection comes after prompt stability
158
+
159
+ Don't optimize cost before you stabilize quality:
160
+
161
+ 1. Build `regression`.
162
+ 2. Improve the prompt with `compare_with`.
163
+ 3. Lock quality in CI.
164
+ 4. Then run `compare_models` (see [Optimizing retry_policy](optimizing_retry_policy.md)).
165
+
166
+ ```ruby
167
+ comparison = SummarizeArticle.compare_models(
168
+ "regression",
169
+ candidates: [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }, { model: "gpt-4.1" }]
170
+ )
171
+
172
+ comparison.best_for(min_score: 0.95)
173
+ ```
174
+
175
+ ## Strong defaults for teams
176
+
177
+ - `smoke` uses `sample_response`.
178
+ - `regression` uses real model calls.
179
+ - Every prompt change uses `compare_with`.
180
+ - Every merge runs `pass_eval`.
181
+ - Every production failure becomes a new `add_case`.
182
+
183
+ ## Short version
184
+
185
+ 1. Write `define_eval` before touching the prompt.
186
+ 2. Treat `sample_response` as smoke only.
187
+ 3. Use `run_eval("name", context: { model: "..." })` for real quality measurement.
188
+ 4. Use `compare_with` for every serious prompt change.
189
+ 5. Gate merges with `pass_eval`.
190
+ 6. Feed every production miss back into the dataset.
191
+
192
+ Prompts stop being vibes and start being engineering.
@@ -0,0 +1,199 @@
1
+ # Getting Started
2
+
3
+ > Read this to walk through every feature on one concrete step for the first time.
4
+
5
+ The README shows a minimal `SummarizeArticle` step. This guide walks through the features you reach for as production requirements grow: budget caps so runaway inputs don't drain your LLM provider budget, evals so you catch regressions in CI, and CI gating so a merge that lowers accuracy gets blocked.
6
+
7
+ ## The walkthrough
8
+
9
+ Start with the README example, then add features one layer at a time. Each is optional — use what you need.
10
+
11
+ ```ruby
12
+ class SummarizeArticle < RubyLLM::Contract::Step::Base
13
+ # 1. Prompt (required)
14
+ prompt <<~PROMPT
15
+ Summarize this article for a UI card. Return a short TL;DR,
16
+ 3 to 5 key takeaways, and a tone label.
17
+
18
+ {input}
19
+ PROMPT
20
+
21
+ # 2. Schema — sent to the provider via with_schema, validated client-side
22
+ output_schema do
23
+ string :tldr
24
+ array :takeaways, of: :string, min_items: 3, max_items: 5
25
+ string :tone, enum: %w[neutral positive negative analytical]
26
+ end
27
+
28
+ # 3. Business rules — things JSON Schema cannot express
29
+ validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
30
+ validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
31
+
32
+ # 4. Retry with model fallback on validation_failed / parse_error
33
+ retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
34
+
35
+ # 5. Refuse before calling the LLM if input is too large or estimated cost exceeds the cap
36
+ max_input 2_000
37
+ max_output 4_000
38
+ max_cost 0.01
39
+ end
40
+ ```
41
+
42
+ ## Validation and retry behavior
43
+
44
+ When the cheap model returns output that fails a `validate` block or can't be parsed, retry falls back to the next model in `models:` and tries again.
45
+
46
+ ```ruby
47
+ result = SummarizeArticle.run(article_text)
48
+
49
+ result.status # => :ok
50
+ result.parsed_output # => { tldr: "...", takeaways: [...], tone: "analytical" }
51
+ result.trace[:model] # => "gpt-4.1-mini" (first model that passed)
52
+ result.trace[:cost] # => 0.00052 (sum of all attempts)
53
+ result.trace[:attempts]
54
+ # => [
55
+ # { attempt: 1, model: "gpt-4.1-nano", status: :validation_failed,
56
+ # cost: 0.00010, latency_ms: 45, ... },
57
+ # { attempt: 2, model: "gpt-4.1-mini", status: :ok,
58
+ # cost: 0.00042, latency_ms: 92, ... }
59
+ # ]
60
+ ```
61
+
62
+ If the whole chain exhausts, `result.status` is the status of the last attempt (`:validation_failed` or `:parse_error`) and `result.parsed_output` is the last attempt's output. The caller decides what to do — ship it anyway, fall back to a template, or raise.
63
+
64
+ ### Per-attempt reasoning effort
65
+
66
+ `models:` accepts config hashes as well as model-name strings, so a fallback can "try harder" (more reasoning) on retry, not just switch model:
67
+
68
+ ```ruby
69
+ retry_policy models: [
70
+ "gpt-5-nano", # attempt 1: cheap + fast
71
+ { model: "gpt-5-mini", reasoning_effort: "high" } # attempt 2: stronger + more reasoning
72
+ ]
73
+ ```
74
+
75
+ `reasoning_effort` is a gpt-5 family feature (gpt-4.1 is not a reasoning model). The per-attempt value is forwarded via `with_thinking` (provider-agnostic — OpenAI `reasoning_effort` and Anthropic extended-thinking budget both supported). Passing it alongside a non-reasoning model is forwarded unchanged to the provider, which will either ignore it or reject the request — the gem does not guard against this.
76
+
77
+ ## Evals and CI gates
78
+
79
+ An eval is a named scenario you can run to verify the step still works. `sample_response` makes it offline — zero API calls — so CI can run it on every merge without burning budget.
80
+
81
+ ```ruby
82
+ SummarizeArticle.define_eval("smoke") do
83
+ default_input <<~ARTICLE
84
+ Ruby 3.4 ships with frozen string literals on by default, measurable YJIT
85
+ speedups on Rails workloads, and tightened Warning.warn category filtering.
86
+ The release notes also mention several parser fixes and faster keyword
87
+ argument handling.
88
+ ARTICLE
89
+
90
+ sample_response({
91
+ tldr: "Ruby 3.4 brings frozen string literals by default, YJIT speedups, and parser fixes.",
92
+ takeaways: [
93
+ "Frozen string literals are the default",
94
+ "YJIT adds measurable speedups on Rails workloads",
95
+ "Warning.warn category filtering is tighter"
96
+ ],
97
+ tone: "analytical"
98
+ })
99
+ end
100
+
101
+ report = SummarizeArticle.run_eval("smoke")
102
+ report.passed? # => true — schema + validates pass on the canned response
103
+ report.score # => 1.0
104
+ report.print_summary
105
+ ```
106
+
107
+ For real regression testing, define cases with expected output (online — calls the LLM):
108
+
109
+ ```ruby
110
+ SummarizeArticle.define_eval("regression") do
111
+ add_case "ruby release",
112
+ input: "Ruby 3.4 was released...",
113
+ expected: { tone: "analytical" } # partial match
114
+
115
+ add_case "critical review",
116
+ input: "The new mesh networking hardware failed under load...",
117
+ expected: { tone: "negative" }
118
+ end
119
+ ```
120
+
121
+ Gate CI on score and cost thresholds:
122
+
123
+ ```ruby
124
+ # RSpec — blocks merge if accuracy drops or cost spikes
125
+ expect(SummarizeArticle).to pass_eval("regression")
126
+ .with_minimum_score(0.8)
127
+ .with_maximum_cost(0.01)
128
+ ```
129
+
130
+ Save a baseline once, then block regressions automatically:
131
+
132
+ ```ruby
133
+ report = SummarizeArticle.run_eval("regression")
134
+ report.save_baseline!
135
+
136
+ # In CI:
137
+ expect(SummarizeArticle).to pass_eval("regression").without_regressions
138
+ ```
139
+
140
+ `without_regressions` fails the build only if a previously-passing case now fails — a new model version, a prompt tweak, or an upstream change that silently lowered quality.
141
+
142
+ ## Budget caps
143
+
144
+ `max_input`, `max_output`, and `max_cost` are preflight checks — the LLM is never called if an estimate exceeds the limit. Zero tokens spent, zero cost.
145
+
146
+ ```ruby
147
+ result = SummarizeArticle.run(giant_10mb_document)
148
+ result.status # => :limit_exceeded
149
+ result.validation_errors
150
+ # => ["Input token limit exceeded: estimated 32000 tokens (heuristic ±30%), max 2000"]
151
+ ```
152
+
153
+ `max_cost` fails closed when the model's pricing isn't known — register custom or fine-tuned models explicitly:
154
+
155
+ ```ruby
156
+ RubyLLM::Contract::CostCalculator.register_model("ft:gpt-4o-custom",
157
+ input_per_1m: 3.0, output_per_1m: 6.0)
158
+ ```
159
+
160
+ Or opt into a soft warning instead of a refusal when pricing is missing:
161
+
162
+ ```ruby
163
+ max_cost 0.01, on_unknown_pricing: :warn
164
+ ```
165
+
166
+ Default is `:refuse`. Use `:warn` only when you accept running without a cost ceiling (fine-tuned models you trust, private endpoints).
167
+
168
+ ### Preflight cost estimates
169
+
170
+ Check what a call is likely to cost before invoking it:
171
+
172
+ ```ruby
173
+ SummarizeArticle.estimate_cost(input: article_text)
174
+ # => {
175
+ # model: "gpt-4.1-mini",
176
+ # input_tokens: 812, output_tokens_estimate: 4000,
177
+ # estimated_cost: 0.00243
178
+ # }
179
+
180
+ # Estimate what a full eval would cost across candidate models
181
+ SummarizeArticle.estimate_eval_cost("regression",
182
+ models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
183
+ # => { "gpt-4.1-nano" => 0.00041, "gpt-4.1-mini" => 0.0018, "gpt-4.1" => 0.0092 }
184
+ ```
185
+
186
+ `estimate_cost` returns `nil` when pricing isn't registered. `estimate_eval_cost` silently treats unknown-pricing cases as `$0.00` and sums the rest — it does **not** fail closed the way `max_cost` does. Treat its output as a floor, not a guarantee; register pricing via `CostCalculator.register_model` before relying on it for budget decisions.
187
+
188
+ ## `output_schema` vs `with_schema`
189
+
190
+ `with_schema` in `ruby_llm` tells the provider to force a specific JSON structure. `output_schema` in this gem does the same thing (calls `with_schema` under the hood) **plus** validates the response client-side. Cheaper models sometimes ignore schema constraints — `with_schema` is a request; `output_schema` is a request plus verification.
191
+
192
+ ## See also
193
+
194
+ - [Prompt AST](prompt_ast.md) — prompt DSL variants: `system`, `rule`, `section`, `example`, `user`, and dynamic prompts with `|input|`.
195
+ - [Eval-First](eval_first.md) — datasets, baselines, A/B gates, the workflow that makes the above evals useful.
196
+ - [Optimizing retry_policy](optimizing_retry_policy.md) — find the cheapest viable fallback list with `compare_models` and `optimize_retry_policy`.
197
+ - [Testing](testing.md) — test adapter, `stub_step`, full RSpec + Minitest matcher reference.
198
+ - [Output Schema](output_schema.md) — nested objects in arrays, constraints, pattern reference.
199
+ - [Rails integration](rails_integration.md) — where step classes live, initializer, jobs, logging, specs, CI gate.