ruby_llm-contract 0.10.1 → 0.10.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,282 @@
1
+ # Testing
2
+
3
+ > Read this when writing deterministic unit specs for contract steps. Skip if you only ever run live evals in CI and never stub LLM calls.
4
+
5
+ CI that hits a real LLM on every commit is slow and costly: tests take minutes instead of milliseconds, every rerun spends money, and flakes on provider hiccups block merges. The Test adapter + `stub_step` make LLM-backed specs run like ordinary unit tests — deterministic, offline, free. Live evals stay as an opt-in CI stage for quality gating (see [Eval-First](eval_first.md)).
6
+
7
+ How to write deterministic specs and matchers for steps built on `ruby_llm-contract`. Examples use `SummarizeArticle` (the flagship step from the [README](../../README.md)).
8
+
9
+ ## Test adapter
10
+
11
+ Ships deterministic specs with zero API calls. Accepts a String, Hash, or Array:
12
+
13
+ ```ruby
14
+ # String JSON
15
+ adapter = RubyLLM::Contract::Adapters::Test.new(
16
+ response: '{"tldr":"...","takeaways":["a","b","c"],"tone":"neutral"}'
17
+ )
18
+
19
+ # Hash — auto-converted to JSON
20
+ adapter = RubyLLM::Contract::Adapters::Test.new(
21
+ response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" }
22
+ )
23
+
24
+ # Multiple sequential responses (one per call)
25
+ adapter = RubyLLM::Contract::Adapters::Test.new(
26
+ responses: [
27
+ { tldr: "...", takeaways: %w[a b c], tone: "neutral" },
28
+ { tldr: "...", takeaways: %w[x y z], tone: "analytical" }
29
+ ]
30
+ )
31
+
32
+ result = SummarizeArticle.run("article text", context: { adapter: adapter })
33
+ result.ok? # => true
34
+ ```
35
+
36
+ Multi-step pipeline testing with per-step named responses (using `ArticleCardPipeline` from [Pipeline](pipeline.md)):
37
+
38
+ ```ruby
39
+ result = ArticleCardPipeline.test("article text",
40
+ responses: {
41
+ summarize: { tldr: "...", takeaways: %w[a b c], tone: "analytical" },
42
+ tag: { tldr: "...", takeaways: %w[a b c], tone: "analytical",
43
+ hashtags: %w[#ruby #release] },
44
+ card: { headline: "Ruby 3.4 ships", summary: "...",
45
+ hashtags: %w[#ruby #release], sentiment_icon: "🧠" }
46
+ }
47
+ )
48
+ ```
49
+
50
+ ## Output keys are always symbols
51
+
52
+ Parsed output uses **symbol keys**, never strings:
53
+
54
+ ```ruby
55
+ result.parsed_output[:tldr] # => "..." ✓
56
+ result.parsed_output["tldr"] # => nil ✗
57
+ ```
58
+
59
+ The gem warns if a `validate` or `verify` block returns `nil` — usually a sign of string-key access on symbol-keyed data.
60
+
61
+ ## RSpec setup
62
+
63
+ In `spec_helper.rb`:
64
+
65
+ ```ruby
66
+ require "ruby_llm/contract/rspec"
67
+ ```
68
+
69
+ You get the `satisfy_contract` matcher, `pass_eval` matcher, and the `stub_step` helpers.
70
+
71
+ ## stub_step helpers
72
+
73
+ `stub_step` canned-responses a single step; other steps run normally.
74
+
75
+ ```ruby
76
+ RSpec.describe SummarizeArticle do
77
+ before do
78
+ stub_step(described_class,
79
+ response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" })
80
+ end
81
+
82
+ it "satisfies its contract" do
83
+ result = described_class.run("article text")
84
+ expect(result).to satisfy_contract
85
+ end
86
+ end
87
+ ```
88
+
89
+ **Sequential responses:**
90
+
91
+ ```ruby
92
+ stub_step(described_class, responses: [
93
+ { tldr: "...", takeaways: %w[a b c], tone: "neutral" },
94
+ { tldr: "...", takeaways: %w[x y z], tone: "analytical" }
95
+ ])
96
+ ```
97
+
98
+ **Block form (auto-cleanup):**
99
+
100
+ ```ruby
101
+ stub_step(SummarizeArticle, response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" }) do
102
+ result = SummarizeArticle.run("article text")
103
+ # stub is active here
104
+ end
105
+ # stub gone — original adapter restored
106
+ ```
107
+
108
+ **Multiple steps at once:**
109
+
110
+ ```ruby
111
+ stub_steps(
112
+ SummarizeArticle => { response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" } },
113
+ GenerateHashtags => { response: { tldr: "...", takeaways: %w[a b c],
114
+ tone: "neutral",
115
+ hashtags: %w[#ruby #release] } }
116
+ ) do
117
+ result = ArticleCardPipeline.run("article text")
118
+ end
119
+ ```
120
+
121
+ **Global stub for all steps:**
122
+
123
+ ```ruby
124
+ stub_all_steps(response: { default: true })
125
+ ```
126
+
127
+ In RSpec, non-block stubs auto-clean after each example. In Minitest, `teardown` restores the original adapter (via `MinitestHelpers`).
128
+
129
+ ## Minitest
130
+
131
+ Require `ruby_llm/contract/minitest` in your `test_helper.rb`. You get the same `satisfy_contract` / `pass_eval` assertions and `stub_step` helper, adapted for Minitest syntax.
132
+
133
+ ## RSpec matchers
134
+
135
+ ```ruby
136
+ RSpec.describe SummarizeArticle do
137
+ before do
138
+ stub_step(described_class,
139
+ response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" })
140
+ end
141
+
142
+ it "satisfies its contract" do
143
+ result = described_class.run("article text")
144
+ expect(result).to satisfy_contract
145
+ end
146
+
147
+ it "rejects invalid output" do
148
+ stub_step(described_class, response: { tldr: "x" * 300, takeaways: %w[a b c], tone: "neutral" })
149
+ result = described_class.run("article text")
150
+ expect(result).not_to satisfy_contract # TL;DR > 200 chars fails validate
151
+ end
152
+
153
+ it "passes its eval" do
154
+ expect(described_class).to pass_eval("smoke")
155
+ end
156
+ end
157
+ ```
158
+
159
+ `pass_eval` supports a matcher chain — full reference lives in [Getting Started](getting_started.md) under Evals and CI gates. Quick summary:
160
+
161
+ - `.with_context(model: "gpt-4.1-mini")` — pick model / pass adapter
162
+ - `.with_minimum_score(0.8)` — gate on average score
163
+ - `.with_maximum_cost(0.01)` — gate on total cost
164
+ - `.without_regressions` — block any previously-passing case that now fails (reads the baseline)
165
+ - `.compared_with(SummarizeArticleV1)` — A/B against another step; implies regression check
166
+
167
+ ## Offline vs online eval
168
+
169
+ Evals run in one of two modes depending on how they're defined and what context is passed:
170
+
171
+ | Has `sample_response`? | Context has adapter/model? | Mode | API calls |
172
+ |---|---|---|---|
173
+ | Yes | No | **Offline** — uses `sample_response` as canned answer | Zero |
174
+ | Yes | Yes | **Online** — ignores `sample_response`, calls real LLM | Real |
175
+ | No | Yes | **Online** — calls real LLM | Real |
176
+ | No | No | **Skipped** — returns `:skipped`, excluded from score | Zero |
177
+
178
+ Default is offline. To force online, pass adapter or model in context:
179
+
180
+ ```ruby
181
+ # Online — real LLM call
182
+ report = SummarizeArticle.run_eval("regression", context: { model: "gpt-4.1-nano" })
183
+
184
+ # Offline — uses sample_response
185
+ report = SummarizeArticle.run_eval("smoke")
186
+ ```
187
+
188
+ `compare_with` intentionally ignores `sample_response` because canned data would make both sides look identical. Always pass `model:` or an adapter to A/B.
189
+
190
+ ## Inspecting failures
191
+
192
+ `run_eval` returns a `Report`. Drill into per-case failures:
193
+
194
+ ```ruby
195
+ report = SummarizeArticle.run_eval("regression")
196
+
197
+ report.score # => 0.5
198
+ report.pass_rate # => "1/2"
199
+ report.total_cost # => 0.003
200
+
201
+ report.failures.each do |result|
202
+ puts result.name # => "critical review"
203
+ puts result.mismatches # => { tone: { expected: "negative", got: "analytical" } }
204
+ puts result.output # full parsed output hash
205
+ puts result.details # human-readable explanation
206
+ end
207
+ ```
208
+
209
+ `mismatches` is a hash of keys where expected and actual output diverge — pinpoints which field the model got wrong.
210
+
211
+ ## Soft observations
212
+
213
+ Log suspicious-but-not-invalid output without failing the contract:
214
+
215
+ ```ruby
216
+ class CompareArticles < RubyLLM::Contract::Step::Base
217
+ prompt "Score the article pair for relevance. Return JSON: {score_a: 1-10, score_b: 1-10}.\n\n{input}"
218
+
219
+ output_schema do
220
+ integer :score_a, minimum: 1, maximum: 10
221
+ integer :score_b, minimum: 1, maximum: 10
222
+ end
223
+
224
+ validate("scores in range") { |o, _| (1..10).cover?(o[:score_a]) && (1..10).cover?(o[:score_b]) }
225
+ observe("scores should differ") { |o, _| o[:score_a] != o[:score_b] }
226
+ end
227
+
228
+ adapter = RubyLLM::Contract::Adapters::Test.new(response: { score_a: 5, score_b: 5 })
229
+ result = CompareArticles.run("two identical-looking articles", context: { adapter: adapter })
230
+
231
+ result.ok? # => true (observe never fails the contract)
232
+ result.observations # => [{ description: "scores should differ", passed: false }]
233
+ ```
234
+
235
+ `observe` runs only after validation passes. Failed observations are logged via `RubyLLM::Contract.logger` — useful for "I want to know this happened without blocking the response".
236
+
237
+ > **Not the same as `Chat#on_end_message` / `on_tool_call`.** RubyLLM exposes anonymous, global callbacks attached to a chat instance — they receive the raw `Message` and have no inherent pass/fail concept. `Step.observe` is per-step, named with a description, runs against `parsed_output` (already JSON-parsed and schema-validated), and the pass/fail outcome is recorded in `result.observations`. The two can coexist: use Chat callbacks for transport/tool tracing, use `observe` for domain assertions you want captured in the step trace.
238
+
239
+ ## Asserting on `around_call`
240
+
241
+ `around_call` fires **once per run** with the final result (after retry fallback) and exceptions propagate. That makes it straightforward to test:
242
+
243
+ ```ruby
244
+ class LoggedSummarize < RubyLLM::Contract::Step::Base
245
+ prompt "Summarize: {input}"
246
+ output_schema { string :tldr }
247
+
248
+ around_call do |_step, input, result|
249
+ CallLog.record(model: result.trace.model, cost: result.trace.cost, input_size: input.length)
250
+ end
251
+ end
252
+
253
+ RSpec.describe LoggedSummarize do
254
+ it "logs once per run, with final model + total cost" do
255
+ adapter = RubyLLM::Contract::Adapters::Test.new(response: { tldr: "ok" })
256
+ expect(CallLog).to receive(:record).once.with(hash_including(:model, :cost, :input_size))
257
+
258
+ LoggedSummarize.run("article text", context: { adapter: adapter })
259
+ end
260
+ end
261
+ ```
262
+
263
+ The callback receives `(step, input, result)` — the same `Result` the caller sees. Not invoked per-attempt inside a `retry_policy` chain; if you need per-attempt visibility, read `result.trace[:attempts]` inside the block.
264
+
265
+ ## Baseline file format
266
+
267
+ Baselines are JSON files in `.eval_baselines/` — commit them to git:
268
+
269
+ ```
270
+ .eval_baselines/
271
+ SummarizeArticle/
272
+ regression_gpt-4_1-nano.json
273
+ regression_gpt-4_1-mini.json
274
+ ```
275
+
276
+ Each file contains dataset name, step name, score, and per-case results. No timestamps — re-saving an identical baseline produces no git diff. Baseline semantics (what counts as a regression, how `compare_with_baseline` works) are covered in [Getting Started](getting_started.md#evals-and-ci-gates) and [Eval-First](eval_first.md).
277
+
278
+ ## See also
279
+
280
+ - [Getting Started](getting_started.md) — `pass_eval` matcher chain, threshold gating, Rake task, baseline regressions.
281
+ - [Eval-First](eval_first.md) — `compare_with` prompt A/B workflow.
282
+ - [Pipeline](pipeline.md) — pipeline-level testing with named step responses.
data/docs/guide/why.md ADDED
@@ -0,0 +1,103 @@
1
+ # Why contracts?
2
+
3
+ > Read this if you're not sure whether `ruby_llm-contract` solves a problem you actually have. It's the fastest way to recognise the production failure modes the gem exists for.
4
+
5
+ LLMs return JSON that *looks* correct — valid shape, right types, right fields — while being silently wrong in ways that hurt users, burn budget, or break downstream code. Schema validation alone does not catch these. Contracts layer on business rules, retries, evals, and cost caps so the wrong output is caught at the boundary of your system instead of shipping to production.
6
+
7
+ Below are four failure modes teams actually hit. If one looks familiar, the gem is probably worth 30 minutes of your time.
8
+
9
+ ## Failure 1 — Schema-valid, logically wrong
10
+
11
+ A `SummarizeArticle` step produces `{ tldr: "...", takeaways: [...], tone: "analytical" }`. Schema passes. The TL;DR is 520 characters long and overflows the UI card. Or the takeaways are all variations of the same sentence. Or the article was a service-outage complaint and `tone` came back `"analytical"` instead of `"negative"` — so customer success's "critical feedback" filter never sees it.
12
+
13
+ JSON schema enforces **shape**. It cannot enforce *length fits the card*, *takeaways are distinct*, or *tone matches content*. Those are business rules, and without them you find out from a Slack thread or a support ticket.
14
+
15
+ ```ruby
16
+ validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
17
+ validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
18
+ validate("negative tone requires concrete risk") do |o, _|
19
+ next true unless o[:tone] == "negative"
20
+ o[:takeaways].any? { |t| t.match?(/fail|break|crash|outage|risk/i) }
21
+ end
22
+ ```
23
+
24
+ Wrong output never reaches `Article.update!` — the contract refuses before it persists.
25
+
26
+ ## Failure 2 — Silent prompt regression
27
+
28
+ `SummarizeArticle` ships and works. Two weeks later, someone tweaks the system prompt to emphasise negative sentiment because customer success complained about missed complaints. The tweak fixes that case and silently breaks three neutral product-update articles that now get labelled `"negative"`. Nobody knows for a week.
29
+
30
+ Without evals, *every prompt change is a blind deploy.* Contracts invert this:
31
+
32
+ ```ruby
33
+ SummarizeArticle.define_eval("regression") do
34
+ add_case "outage complaint", input: "...", expected: { tone: "negative" }
35
+ add_case "neutral product update", input: "...", expected: { tone: "neutral" }
36
+ end
37
+
38
+ # In CI — blocks merge when a prompt tweak regresses any previously-passing case
39
+ expect(SummarizeArticle).to pass_eval("regression").without_regressions
40
+ ```
41
+
42
+ The "tweak helped one case, broke three" scenario is caught at PR review. No Slack-thread surprises.
43
+
44
+ ## Failure 3 — Sampling variance on fixed-temperature models
45
+
46
+ OpenAI's gpt-5 and o-series run with `temperature=1.0` server-side — you cannot lower it. That means the same prompt on the same model can produce different answers between calls. An outage complaint classified `tone: "negative"` on Monday may come back `tone: "positive"` on Tuesday, with no code change in between. Schema passes both. Your customer-success filter silently misroutes the Tuesday case.
47
+
48
+ A `validate` block that cross-checks fields against each other turns a one-in-N flaky output into a deterministic retry:
49
+
50
+ ```ruby
51
+ validate("tone matches severity keywords") do |o, _|
52
+ severity = /fail|crash|outage|broken|bug|error/i
53
+ flagged = o[:takeaways].any? { |t| t.match?(severity) }
54
+ next true unless flagged
55
+ %w[negative analytical].include?(o[:tone])
56
+ end
57
+
58
+ retry_policy models: %w[gpt-5-nano gpt-5-mini gpt-5]
59
+ ```
60
+
61
+ Nano misclassifies the tone on the first attempt → contract rejects → mini gets the call and returns a different sample. Variance absorbed; the user never sees the flaky run. Your logs show the retry rate and the cost delta.
62
+
63
+ **See it in 30 seconds:** `ruby examples/01_fallback_showcase.rb` — zero API keys required. The Test adapter simulates a tone/takeaways mismatch on the first attempt and a consistent sample on the retry, then prints the per-attempt trace.
64
+
65
+ `retry_policy` has three other shapes beyond cross-model escalation — same-model `attempts: 3` (absorbs sampling variance without paying for a stronger tier), `reasoning_effort` escalation (low → medium → high on one model), and cross-provider fallback (Ollama → Anthropic → OpenAI — local first because it costs nothing, hosted last because it is the most accurate). `examples/06_retry_variants.rb` runs all three through the Test adapter with the trace printed.
66
+
67
+ ## Failure 4 — Runaway cost and no fallback policy
68
+
69
+ Someone pastes a 40-page PDF into the endpoint that calls `SummarizeArticle`. The prompt expands to 80k tokens. Your provider bill jumps. Meanwhile, a separate team uses `gpt-4.1` for every single call because "quality matters" — even though 80% of their traffic is trivially handled by `gpt-4.1-nano` at 1/30th the cost. Neither situation is visible until the invoice arrives.
70
+
71
+ Contracts make cost a first-class concern:
72
+
73
+ ```ruby
74
+ max_input 2_000 # refuses before calling the API if tokens exceed budget
75
+ max_output 4_000
76
+ max_cost 0.01 # refuses if estimated cost exceeds cap
77
+ retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1] # cheap first, escalate only on failure
78
+ ```
79
+
80
+ The 40-page PDF returns `status: :limit_exceeded` — zero tokens spent. The 80/20 traffic pattern resolves on nano; only the hard 20% escalate. [`optimize_retry_policy`](optimizing_retry_policy.md) tells you empirically which fallback list is cheapest for *your* evals.
81
+
82
+ ## Also catches
83
+
84
+ - **Leaked prompt placeholders** — model echoes `{article}` or `{audience}` into the output because the template string wasn't interpolated. Validate string-equality check stops it before a user sees it.
85
+ - **Lazy models echoing input verbatim** — cheap model returns the article text as the "summary". 2-arity `validate("tldr shorter than input")` catches it.
86
+ - **Tone mislabel breaking downstream routing** — content says "negative", model labels it "analytical", a customer success filter misses it. Cross-validate catches the label/content drift.
87
+
88
+ ## Failure → contract mechanism
89
+
90
+ | Failure in production | Contract mechanism |
91
+ |---|---|
92
+ | Schema-valid but logically wrong output | `validate(...) { |o, i| ... }` with 2-arity for cross-checks |
93
+ | Silent prompt regression after a tweak | `define_eval` + `pass_eval(...).without_regressions` in CI |
94
+ | Sampling variance on fixed-temperature models (gpt-5 / o-series) | Cross-field `validate(...)` + `retry_policy models: [...]` |
95
+ | Runaway cost on pathological inputs | `max_input`, `max_output`, `max_cost` preflight |
96
+ | 80/20 traffic paying the premium model rate | `retry_policy` + `optimize_retry_policy` |
97
+ | Leaked placeholder / input echo / tone drift | `validate` with content and cross-input checks |
98
+
99
+ ## What next
100
+
101
+ - **If one of the failures above looks familiar** → [Getting Started](getting_started.md) walks through every feature in order on the same `SummarizeArticle` step.
102
+ - **If you're adopting in an existing Rails app** → [Migration](migration.md) shows Before/After for replacing a raw `LlmClient.new.call` service.
103
+ - **If you already ship LLM features and want to make them regression-safe** → [Eval-First](eval_first.md) is the workflow that prevents Failure 2 from ever happening again.
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Contract
5
- VERSION = "0.10.1"
5
+ VERSION = "0.10.2"
6
6
  end
7
7
  end
@@ -28,7 +28,7 @@ Gem::Specification.new do |spec|
28
28
  excluded_files = %w[TODO.md .rspec .rubycritic.yml .simplecov]
29
29
  `git ls-files -z`.split("\x0").reject do |f|
30
30
  (File.expand_path(f) == __FILE__) ||
31
- f.start_with?("spec/", "docs/", "doc/", ".ai/", ".claude/", ".git", ".revive/") ||
31
+ f.start_with?("spec/", "docs/ideas/", "doc/", ".ai/", ".claude/", ".git", ".revive/") ||
32
32
  excluded_files.include?(f)
33
33
  end
34
34
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-contract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.10.1
4
+ version: 0.10.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justyna
@@ -66,6 +66,21 @@ files:
66
66
  - LICENSE
67
67
  - README.md
68
68
  - Rakefile
69
+ - docs/architecture.md
70
+ - docs/guide/best_practices.md
71
+ - docs/guide/eval_first.md
72
+ - docs/guide/getting_started.md
73
+ - docs/guide/migration.md
74
+ - docs/guide/multimodal_input.md
75
+ - docs/guide/optimizing_retry_policy.md
76
+ - docs/guide/output_schema.md
77
+ - docs/guide/pipeline.md
78
+ - docs/guide/prompt_ast.md
79
+ - docs/guide/rails_integration.md
80
+ - docs/guide/relation_to_agent.md
81
+ - docs/guide/relation_to_tribunal.md
82
+ - docs/guide/testing.md
83
+ - docs/guide/why.md
69
84
  - examples/00_basics.rb
70
85
  - examples/01_fallback_showcase.rb
71
86
  - examples/02_real_llm_minimal.rb