ruby_llm-contract 0.7.0 → 0.7.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 5decfb7456338baa05d0e6bb79287bb3fb3af0e0cc2c3f001e09122fa76ac298
4
- data.tar.gz: be3b46ff015fd651e885c4c078dbdaf06623865f4e4b80b864b66e8dd9e16c34
3
+ metadata.gz: c9b5adda942a6e1f3393d2d05e8226d210b25088c65c2c20160b2ffc6a493533
4
+ data.tar.gz: 91110ddd5d2fc39a8e3f6cd9ec912c707ceef7da2ee40e8805598a787ae1c067
5
5
  SHA512:
6
- metadata.gz: fd3882613ac2b500c46dc6b1084d8f298db96800bf01932e3bc2638a7a3d2a8588610c1de8a1f3754a8e15fb7b1ef29c0c0a7bddd11709cd95bbf12fcd48e83e
7
- data.tar.gz: 4cb584323a5575de4131b0eb82cdb1426743ca6651030708673d17264c9fec60619bb7d5d54f62eee4c1fa357b4022f57bf08b95d81f153eccf8e625ce3ef5e7
6
+ metadata.gz: 543779ceacb617909e7e45355473fb300d9a5f81206ee51019d4f868b3ea66d43272b04be78aa95d0a014ea16619ab57c933eb58f1d17d8fa3845a957d6c750a
7
+ data.tar.gz: eea6b49a1cb01de6b501bddec6fcfcccbc221a5448a72a435f4c04e80131a7994757b3c337390e1b6212a2709ae082a5d55c82158535a5789e606b7ef13d27f9
data/CHANGELOG.md CHANGED
@@ -1,5 +1,71 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.7.3 (2026-04-24)
4
+
5
+ Adoption-friction release. No runtime behavior changes — every delta is in `docs/`, `examples/`, or `spec/integration/` (plus the `version.rb` / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.
6
+
7
+ ### Documentation
8
+
9
+ - **New guide: `docs/guide/why.md`** — four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves.
10
+ - **New guide: `docs/guide/rails_integration.md`** — seven Rails-specific FAQs with runnable snippets: where step classes live (`app/contracts/`), initializer setup, background jobs, `around_call` observability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring.
11
+ - **README adoption-friction pass** — added a short "Do I need this?" block after Install, a reading-order hint (`README → why.md → getting_started.md`), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.).
12
+ - **TL;DR box at the top of every guide** — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (`eval_first.md`, `testing.md`, `migration.md`).
13
+ - **API coverage gaps closed** — `estimate_cost` / `estimate_eval_cost`, `max_cost on_unknown_pricing: :warn`, `run_eval(..., concurrency:)`, `around_call` testing patterns now documented in `getting_started.md`, `eval_first.md`, `testing.md`.
14
+ - **Industry-standard terminology** — `temperature-locked` → `fixed-temperature`, `variance-induced` → `sampling variance`, `severity signals` → `severity keywords`, `takeaway drift` → `tone/takeaways mismatch`.
15
+ - **`docs/architecture.md` refresh** — diagram now reflects the current class layout: added `Step::RetryPolicy`, `Pipeline::Result`, `Eval::AggregatedReport`, `Eval::BaselineDiff`, `Eval::PromptDiffComparator`, `Eval::EvalHistory`, `Eval::RetryOptimizer`, `OptimizeRakeTask`. Replaced the outdated `Eval::TraitEvaluator` entry with `Eval::ExpectationEvaluator`.
16
+ - **Business framing added to guides** — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.
17
+
18
+ ### Examples — consolidated on `SummarizeArticle`, renumbered 00-06
19
+
20
+ The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's `SummarizeArticle` case.
21
+
22
+ | # | File | Answers |
23
+ |---|------|---------|
24
+ | 00 | `00_basics.rb` | How do I start? (seven incremental layers + real-LLM pointer) |
25
+ | 01 | `01_fallback_showcase.rb` | Show me the gem in 30 seconds (zero API keys) |
26
+ | 02 | `02_real_llm_minimal.rb` | How do I plug in a real LLM? (~30 lines) |
27
+ | 03 | `03_summarize_with_keywords.rb` | How does the contract evolve? (growing prompt) |
28
+ | 04 | `04_summarize_and_translate.rb` | Pipeline composition + pipeline-level `run_eval` |
29
+ | 05 | `05_eval_dataset.rb` | How do I stop silent prompt regressions? |
30
+ | 06 | `06_retry_variants.rb` | `attempts: 3`, `reasoning_effort` escalation, cross-provider (Ollama → Anthropic → OpenAI) |
31
+
32
+ Every file carries an "Expected output" block in its header so readers see the result without running the script. The `docs/ideas/` directory is now fully untracked (already in `.gitignore`; one stray file removed from version control).
33
+
34
+ ### Examples — bug fixes carried along
35
+
36
+ - **Schema pitfall fixed in 5 files** — `array :x do; string :y; ...; end` silently produces `items: string` and drops every declaration after the first, matching the documented pitfall in `spec/ruby_llm/contract/nested_schema_spec.rb:71`. Every affected array block is now wrapped in `object do...end`.
37
+ - **`examples/05_eval_dataset.rb` (pre-renumber: `09_eval_dataset.rb`) `result[:passed]` → `result.passed?`** — the previous code called `[]` on an `Eval::CaseResult` and raised `NoMethodError` at runtime.
38
+
39
+ ### Testing
40
+
41
+ - **New `spec/integration/pipeline_eval_spec.rb`** — three cases guaranteeing pipeline-level `run_eval` stays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediate `validate` rejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case asserts `step_status == :validation_failed` and the validate's label in `details`, so a regression that short-circuits on schema instead of validate would fail loudly.
42
+
43
+ ### Deleted (private-project cleanup)
44
+
45
+ - `examples/01_classify_threads.rb`, `02_generate_comment.rb`, `03_target_audience.rb`, `10_reddit_full_showcase.rb`, `spec/integration/reddit_pipeline_spec.rb` — Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.
46
+ - `examples/02_output_schema.rb` — fully covered by `docs/guide/output_schema.md`; deleting avoids duplication.
47
+
48
+ ## 0.7.2 (2026-04-22)
49
+
50
+ ### Changed
51
+
52
+ - **Terminal output labels renamed for consistency with README narrative.** `print_summary` now prints `Hardest eval` (was `Constraining eval`), `Suggested fallback list` (was `Suggested chain`), and the production-mode table uses `first-attempt` / `fallback %` as column headers (was `single-shot` / `escalation`). Programmatic metric names unchanged: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. `RetryOptimizer::Result` exposes `hardest_eval` as an alias for `constraining_eval`.
53
+
54
+ ### Documentation
55
+
56
+ - **`docs/guide/optimizing_retry_policy.md` rewritten** — 17.7k → 6.4k characters. Continues the `SummarizeArticle` narrative from README. Offline mode clearly positioned as wiring-check; real optimization runs via `LIVE=1 RUNS=3`. Output samples match actual `print_summary` format.
57
+ - **`docs/guide/getting_started.md` rewritten** — 8.7k → 6.1k. Every example uses `SummarizeArticle`. Evals + CI gates section moved before Budget caps. Structured Prompts / Dynamic Prompts / "Already using ruby_llm?" / Reasoning effort sections removed; content delegated to `prompt_ast.md` and README.
58
+ - **`docs/guide/eval_first.md` refined** — 6.3k → 5.0k. Switched to `SummarizeArticle` case. Team workflow section compressed with links back to `getting_started.md` for the matcher chain.
59
+ - **`docs/guide/testing.md` refined** — 10.7k → 7.4k. Switched to `SummarizeArticle` case. Threshold gating / Rake task / baseline walkthrough / prompt A/B sections delegated back to `getting_started.md` and `eval_first.md`.
60
+ - **`docs/guide/output_schema.md` DSL bug fix** — the Supported constraints table documented JSON Schema camelCase keys (`minLength`, `minItems`, `additionalProperties`) that are not valid DSL arguments. Every copy-paste from the previous table would have raised `ArgumentError`. Switched to snake_case (`min_length`, `min_items`, `additional_properties`) as the DSL actually expects; added a short note on the internal camelCase conversion.
61
+ - **`docs/guide/best_practices.md`, `pipeline.md`, `migration.md` sanity pass** — terminology alignment (model escalation → model fallback where narrative; `escalate` DSL method unchanged) and `SummarizeArticle` case where the guide is not inherently multi-step.
62
+
63
+ ## 0.7.1 (2026-04-22)
64
+
65
+ ### Changed (behavioral, follow-up to v0.7.0)
66
+
67
+ - **`Step::Base#run_once` no longer swallows adapter-phase `ArgumentError` as `:input_error`.** The previous blanket `rescue ArgumentError` was there to convert DSL misconfiguration (e.g. missing `prompt`) into an `:input_error` Result. Side effect: programmer bugs in adapter code that raised `ArgumentError` (wrong arity, bad config argument) were silently coerced into `:input_error` and retried as if the user had given bad input. Now the rescue is narrowed to the Runner-construction phase only — DSL configuration errors still produce `:input_error` (the `prompt has not been set` case is regression-tested), but `ArgumentError` raised from adapter code during `Runner#call` propagates to the caller. Input-type validation failures continue to produce `:input_error` through `InputValidator`'s own scoped rescue, unchanged.
68
+
3
69
  ## 0.7.0 (2026-04-21)
4
70
 
5
71
  ### Breaking changes
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.7.0)
4
+ ruby_llm-contract (0.7.3)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
258
258
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
259
259
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
260
260
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
261
- ruby_llm-contract (0.7.0)
261
+ ruby_llm-contract (0.7.3)
262
262
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
263
263
  ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
264
264
  rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
data/README.md CHANGED
@@ -1,106 +1,8 @@
1
1
  # ruby_llm-contract
2
2
 
3
- The missing link between LLM cost and quality. Stop choosing between "cheap but wrong" and "accurate but expensive" get both. Contracts, model escalation, and data-driven recommendations for [ruby_llm](https://github.com/crmne/ruby_llm).
3
+ **Validate and retry LLM outputs for [ruby_llm](https://github.com/crmne/ruby_llm).** Describe the answer you expect (JSON schema + business rules). If the model returns something that doesn't match, retry — optionally falling back to a stronger model until it passes or you hit the budget.
4
4
 
5
- ```
6
- YOU WRITE THE GEM HANDLES YOU GET
7
- ───────── ─────────────── ───────
8
-
9
- validate { |o| ... } catch bad answers — combined Zero garbage
10
- with retry_policy, auto-retry in production
11
-
12
- retry_policy start cheap, escalate only Pay for the cheapest
13
- models: %w[nano mini full] when validation fails model that works
14
-
15
- max_cost 0.01 estimate tokens, check price, No surprise bills
16
- refuse before calling LLM
17
-
18
- output_schema { ... } send JSON schema to provider, Zero parsing code
19
- validate response client-side
20
-
21
- define_eval { ... } test cases + baselines, Regressions caught
22
- run in CI with real LLM before deploy
23
-
24
- recommend(candidates: [...]) evaluate all configs, pick Optimal model +
25
- cheapest that passes retry chain
26
- ```
27
-
28
- ## Before and after
29
-
30
- ```
31
- ┌─────────────────────────────────────────────────────────────────┐
32
- │ BEFORE: pick one model, hope for the best │
33
- │ │
34
- │ expensive model → accurate, but you overpay on every call │
35
- │ cheap model → fast, but wrong answers slip to production │
36
- │ prompt change → "looks good to me" → deploy → users suffer │
37
- └─────────────────────────────────────────────────────────────────┘
38
-
39
- ⬇ add ruby_llm-contract
40
-
41
- ┌─────────────────────────────────────────────────────────────────┐
42
- │ YOU DEFINE A CONTRACT │
43
- │ │
44
- │ output_schema { string :priority } ← valid structure │
45
- │ validate("valid priority") { |o| ... } ← business rules │
46
- │ retry_policy models: %w[nano mini full] ← escalation chain │
47
- │ max_cost 0.01 ← budget cap │
48
- └───────────────────────────┬─────────────────────────────────────┘
49
-
50
-
51
- ┌─────────────────────────────────────────────────────────────────┐
52
- │ THE GEM HANDLES THE REST │
53
- │ │
54
- │ request ──→ ┌──────┐ ┌──────────┐ │
55
- │ │ nano │─→ │ contract │──→ ✓ pass → done │
56
- │ └──────┘ └────┬─────┘ │
57
- │ │ ✗ fail │
58
- │ ▼ │
59
- │ ┌──────┐ ┌──────────┐ │
60
- │ │ mini │─→ │ contract │──→ ✓ pass → done │
61
- │ └──────┘ └────┬─────┘ │
62
- │ │ ✗ fail │
63
- │ ▼ │
64
- │ ┌──────┐ ┌──────────┐ │
65
- │ │ full │─→ │ contract │──→ ✓ pass → done │
66
- │ └──────┘ └──────────┘ │
67
- └───────────────────────────┬─────────────────────────────────────┘
68
-
69
-
70
- ┌─────────────────────────────────────────────────────────────────┐
71
- │ YOU GET │
72
- │ │
73
- │ ✓ Valid output guaranteed — schema + business rules enforced │
74
- │ ✓ Cheapest model that works — most requests stay on nano │
75
- │ ✓ Cost, latency, tokens — tracked on every call │
76
- │ ✓ Eval scores per model — data instead of gut feeling │
77
- │ ✓ Regressions caught — before deploy, not after │
78
- │ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
79
- └─────────────────────────────────────────────────────────────────┘
80
- ```
81
-
82
- ## 30-second version
83
-
84
- ```ruby
85
- class ClassifyTicket < RubyLLM::Contract::Step::Base
86
- prompt "Classify this support ticket by priority and category.\n\n{input}"
87
-
88
- output_schema do
89
- string :priority, enum: %w[low medium high urgent]
90
- string :category
91
- end
92
-
93
- validate("urgent needs justification") { |o, input| o[:priority] != "urgent" || input.length > 20 }
94
- retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
95
- end
96
-
97
- result = ClassifyTicket.run("I was charged twice")
98
- result.parsed_output # => {priority: "high", category: "billing"}
99
- result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
100
- result.trace[:cost] # => 0.000032
101
- ```
102
-
103
- Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
5
+ `ruby_llm` handles the HTTP side (rate limits, timeouts, streaming, tool calls, embeddings). This gem handles what the model *returned*: schema validation, business rules, retry with model fallback, datasets, regression tests.
104
6
 
105
7
  ## Install
106
8
 
@@ -113,184 +15,81 @@ RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
113
15
  RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
114
16
  ```
115
17
 
116
- Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
117
-
118
- ## Save money with model escalation
119
-
120
- Without a contract, you use gpt-4.1 for everything because you can't tell when a cheaper model gets it wrong. With a contract, you start on nano and only escalate when the answer fails the contract:
121
-
122
- ```ruby
123
- retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
124
- ```
125
-
126
- ```
127
- Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
128
- Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
129
- gpt-4.1 → never called ($0.00)
130
- ```
131
-
132
- Most requests succeed on the cheapest model. You pay full price only for the ones that need it. How many? Run `compare_models` and find out.
133
-
134
- ## Know which model to use — with data
135
-
136
- Don't guess. Define test cases, compare models, get numbers:
137
-
138
- ```ruby
139
- ClassifyTicket.define_eval("regression") do
140
- add_case "billing", input: "I was charged twice", expected: { priority: "high" }
141
- add_case "feature", input: "Add dark mode please", expected: { priority: "low" }
142
- add_case "outage", input: "Database is down", expected: { priority: "urgent" }
143
- end
144
-
145
- comparison = ClassifyTicket.compare_models("regression",
146
- models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1])
147
- ```
148
-
149
- ```
150
- Candidate Score Cost Avg Latency
151
- ---------------------------------------------------------
152
- gpt-4.1-nano 0.67 $0.0001 48ms
153
- gpt-4.1-mini 1.00 $0.0004 92ms
154
- gpt-4.1 1.00 $0.0021 210ms
155
-
156
- Cheapest at 100%: gpt-4.1-mini
157
- ```
158
-
159
- Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
160
-
161
- Running live against gpt-5 / o-series? Pass `runs: 3` to average out sampling variance (OpenAI forces `temperature=1.0` server-side, so one unlucky run can misclassify a viable candidate). See [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs).
162
-
163
- Want the *effective* cost — first-attempt plus retries — rather than the single-shot headline number? Pass `production_mode: { fallback: "gpt-5-mini" }` and the table gains `escalation`, `effective cost`, and a `Chain` column. See [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement).
18
+ Works with any `ruby_llm` provider (OpenAI, Anthropic, Gemini, etc).
164
19
 
165
- ## Let the gem tell you what to do
20
+ ## Do I need this?
166
21
 
167
- Don't read tables get a recommendation. Supports `model + reasoning_effort` combinations:
22
+ Use this if LLM output affects production behaviour, money, user trust, or downstream code. You probably don't need it if you have one low-risk prompt, manually inspect every result, or only generate best-effort prose.
168
23
 
169
- ```ruby
170
- rec = ClassifyTicket.recommend("regression",
171
- candidates: [
172
- { model: "gpt-4.1-nano" },
173
- { model: "gpt-4.1-mini" },
174
- { model: "gpt-5-mini", reasoning_effort: "low" },
175
- { model: "gpt-5-mini", reasoning_effort: "high" },
176
- ],
177
- min_score: 0.95
178
- )
179
-
180
- rec.best # => { model: "gpt-4.1-mini" }
181
- rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
182
- rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
183
- rec.savings # => savings vs your current model (if configured)
184
- ```
185
-
186
- Copy `rec.to_dsl` into your step. Done.
187
-
188
- ## Catch regressions before users do
24
+ Already using structured outputs from your provider? This gem adds business-rule validation, retry with model fallback, evals, regression gating, and test stubs on top of them — the layer that stops schema-valid-but-wrong output from reaching users. See [Why contracts?](docs/guide/why.md) for the four production failure modes the gem exists for, or run `ruby examples/01_fallback_showcase.rb` to see the fallback loop in 30 seconds (no API key needed).
189
25
 
190
- A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
26
+ ## Example
191
27
 
192
- ```ruby
193
- # Save a baseline once:
194
- report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
195
- report.save_baseline!(model: "gpt-4.1-nano")
196
-
197
- # In CI — block merge if anything regressed:
198
- expect(ClassifyTicket).to pass_eval("regression")
199
- .with_context(model: "gpt-4.1-nano")
200
- .without_regressions
201
- ```
28
+ A Rails app takes article text extracted from a user-submitted URL and wants to show a summary card: a short TL;DR, 3–5 key takeaways, and a tone label. The output has to fit the UI (TL;DR under 200 chars) and the schema has to be strict enough to render without conditionals.
202
29
 
203
30
  ```ruby
204
- diff = report.compare_with_baseline(model: "gpt-4.1-nano")
205
- diff.regressed? # => true
206
- diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
207
- diff.score_delta # => -0.33
208
- ```
209
-
210
- No more "it worked in the playground". Regressions are caught in CI, not production.
31
+ class SummarizeArticle < RubyLLM::Contract::Step::Base
32
+ prompt <<~PROMPT
33
+ Summarize this article for a UI card. Return a short TL;DR,
34
+ 3 to 5 key takeaways, and a tone label.
211
35
 
212
- ## A/B test your prompts
36
+ {input}
37
+ PROMPT
213
38
 
214
- Changed a prompt? Compare old vs new on the same dataset with regression safety:
215
-
216
- ```ruby
217
- diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
218
- eval: "regression", model: "gpt-4.1-mini")
219
-
220
- diff.safe_to_switch? # => true (no regressions)
221
- diff.improvements # => [{case: "outage", ...}]
222
- diff.score_delta # => +0.33
223
- ```
224
-
225
- ```ruby
226
- # CI gate:
227
- expect(ClassifyTicketV2).to pass_eval("regression")
228
- .compared_with(ClassifyTicketV1)
229
- .with_minimum_score(0.8)
230
- ```
231
-
232
- ## Chain steps with fail-fast
39
+ output_schema do
40
+ string :tldr
41
+ array :takeaways, of: :string, min_items: 3, max_items: 5
42
+ string :tone, enum: %w[neutral positive negative analytical]
43
+ end
233
44
 
234
- Pipeline stops at the first contract failure no wasted tokens on downstream steps:
45
+ validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
46
+ validate("takeaways are unique") { |o, _| o[:takeaways].uniq.size == o[:takeaways].size }
235
47
 
236
- ```ruby
237
- class TicketPipeline < RubyLLM::Contract::Pipeline::Base
238
- step ClassifyTicket, as: :classify
239
- step RouteToTeam, as: :route
240
- step DraftResponse, as: :draft
48
+ retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
241
49
  end
242
50
 
243
- result = TicketPipeline.run("I was charged twice")
244
- result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
245
- result.trace.total_cost # => $0.000128
51
+ result = SummarizeArticle.run(article_text)
52
+ result.parsed_output # => { tldr: "...", takeaways: [...], tone: "analytical" }
53
+ result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
54
+ result.trace[:cost] # => 0.000032
246
55
  ```
247
56
 
248
- ## Gate merges on quality and cost
57
+ The model returns JSON matching the schema. If the response is malformed, the TL;DR overflows the card, or the takeaway count is off, the gem retries — moving to the next model in `models:` only when the cheaper one can't satisfy the rules. In this setup cheaper models are tried first and the expensive ones are used only when cheaper models fail.
249
58
 
250
- ```ruby
251
- # RSpec — block merge if accuracy drops or cost spikes
252
- expect(ClassifyTicket).to pass_eval("regression")
253
- .with_minimum_score(0.8)
254
- .with_maximum_cost(0.01)
59
+ You could write this loop yourself once. The gem gives you the loop, a trace of every attempt (model, status, cost, latency), fallback policy, evals, baselines, and CI checks as one contract object — tracked per-step so adding a new LLM feature to your app is one class, not one-off scaffolding.
255
60
 
256
- # Rake run all evals across all steps
257
- RubyLLM::Contract::RakeTask.new do |t|
258
- t.minimum_score = 0.8
259
- t.maximum_cost = 0.05
260
- end
261
- # bundle exec rake ruby_llm_contract:eval
262
- ```
61
+ ## Most useful next
263
62
 
264
- ## Full power: data-driven retry chains
63
+ Everything below is optional the example above is a complete step. Reach for these when one step isn't enough.
265
64
 
266
- The pieces above evals, compare_models, recommend combine into a workflow that replaces guesswork with measured optimization. You define evals for your step, run `recommend` against all of them, find the eval that actually needs the strongest model, and build a retry chain where each attempt is as cheap as the data allows.
65
+ - **[CI regression gates](docs/guide/getting_started.md)**`define_eval` + `save_baseline!` + `pass_eval(...).without_regressions` blocks CI when accuracy drops on a model update or prompt tweak.
66
+ - **[Find the cheapest viable fallback list](docs/guide/optimizing_retry_policy.md)** — `Step.recommend("regression", candidates: [...], min_score: 0.95)` returns the cheapest list of models that still passes your evals. `production_mode:` measures retry-aware cost.
67
+ - **[A/B test prompts](docs/guide/eval_first.md)** — `SummarizeArticleV2.compare_with(SummarizeArticleV1, eval: "regression")` reports whether the new prompt is safe to ship.
68
+ - **[Budget caps](docs/guide/getting_started.md)** — `max_cost`, `max_input`, `max_output` refuse the request before calling the API when an estimate exceeds the limit.
267
69
 
268
- The difference: instead of "gpt-5-mini seems to work, let's use it everywhere", you get "nano handles 4/6 scenarios, mini@low catches the 5th, full mini only fires on the hardest edge case — first attempt is 4× cheaper."
269
-
270
- Full procedure with examples: **[Optimizing retry_policy](docs/guide/optimizing_retry_policy.md)**
70
+ Also supports [multi-step pipelines](docs/guide/pipeline.md) with fail-fast and [best-effort retries without fallback](docs/guide/best_practices.md) (`retry_policy attempts: 3` for sampling variance).
271
71
 
272
72
  ## Docs
273
73
 
274
- | Guide | |
275
- |-------|-|
276
- | [Getting Started](docs/guide/getting_started.md) | Features walkthrough, model escalation, eval |
277
- | [Eval-First](docs/guide/eval_first.md) | Practical workflow for prompt engineering with datasets, baselines, and A/B gates |
278
- | [Optimizing retry_policy](docs/guide/optimizing_retry_policy.md) | Find the cheapest retry chain that passes all your evals |
279
- | [Best Practices](docs/guide/best_practices.md) | 6 patterns for bulletproof validates |
280
- | [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
281
- | [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
282
- | [Testing](docs/guide/testing.md) | Test adapter, RSpec matchers |
283
- | [Migration](docs/guide/migration.md) | Adopting the gem in existing Rails apps |
74
+ **New here?** Read in order: this README → [Why contracts?](docs/guide/why.md) → [Getting Started](docs/guide/getting_started.md).
75
+
76
+ | Guide | What it does for your app |
77
+ |-------|---------------------------|
78
+ | [Why contracts?](docs/guide/why.md) | Recognise the four production failures the gem exists for |
79
+ | [Getting Started](docs/guide/getting_started.md) | Walk the full feature set on one concrete step |
80
+ | [Rails integration](docs/guide/rails_integration.md) | Directory, initializer, jobs, logging, specs, CI gate — 7 FAQs for Rails devs |
81
+ | [Adopt in an existing Rails app](docs/guide/migration.md) | Replace raw `LlmClient.call` with a contract, Before/After |
82
+ | [Prevent silent prompt regressions](docs/guide/eval_first.md) | Evals, baselines, CI gates that block quality drift |
83
+ | [Control retry cost and fallback behaviour](docs/guide/optimizing_retry_policy.md) | Find the cheapest viable fallback list empirically |
84
+ | [Write validate rules that catch real bugs](docs/guide/best_practices.md) | Patterns for cross-input checks and content-quality rules |
85
+ | [Stub LLM calls in tests](docs/guide/testing.md) | Deterministic specs, RSpec + Minitest matchers |
86
+ | [Chain LLM calls into a pipeline](docs/guide/pipeline.md) | Multi-step with fail-fast and per-step models |
87
+ | [Schema DSL reference](docs/guide/output_schema.md) | Every constraint, nested objects, pattern table |
88
+ | [Prompt DSL reference](docs/guide/prompt_ast.md) | `system` / `rule` / `section` / `example` / `user` nodes |
284
89
 
285
90
  ## Roadmap
286
91
 
287
- **v0.6 (current):** "What should I do?" `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
288
-
289
- **v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
290
-
291
- **v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
292
-
293
- **v0.3:** Baseline regression detection, migration guide.
92
+ Latest: **v0.7.2** terminal output labels and guides aligned with the fallback narrative; `output_schema.md` DSL bug fix. See [CHANGELOG](CHANGELOG.md) for history.
294
93
 
295
94
  ## License
296
95