RubyGems - ruby_llm-contract - Versions diffs - 0.8.0 → 0.10.1 - Mend

ruby_llm-contract 0.8.0 → 0.10.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (30) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +64 -0
data/Gemfile.lock +2 -2
data/README.md +96 -37
data/lib/ruby_llm/contract/adapters/ruby_llm.rb +9 -1
data/lib/ruby_llm/contract/concerns/eval_host.rb +6 -9
data/lib/ruby_llm/contract/concerns/stub_helpers.rb +97 -0
data/lib/ruby_llm/contract/contract/definition.rb +2 -0
data/lib/ruby_llm/contract/cost_calculator.rb +11 -2
data/lib/ruby_llm/contract/eval/recommender.rb +3 -1
data/lib/ruby_llm/contract/eval/retry_optimizer.rb +16 -13
data/lib/ruby_llm/contract/eval.rb +13 -0
data/lib/ruby_llm/contract/minitest.rb +6 -108
data/lib/ruby_llm/contract/pipeline/result.rb +1 -1
data/lib/ruby_llm/contract/rake_task/suite_gate.rb +117 -0
data/lib/ruby_llm/contract/rake_task.rb +30 -51
data/lib/ruby_llm/contract/rspec/helpers.rb +9 -123
data/lib/ruby_llm/contract/step/base.rb +56 -24
data/lib/ruby_llm/contract/step/dsl.rb +91 -63
data/lib/ruby_llm/contract/step/limit_checker.rb +34 -1
data/lib/ruby_llm/contract/step/retry_executor.rb +6 -13
data/lib/ruby_llm/contract/step/runner.rb +22 -20
data/lib/ruby_llm/contract/step/runner_config.rb +26 -0
data/lib/ruby_llm/contract/version.rb +1 -1
data/lib/ruby_llm/contract.rb +1 -0
data/ruby_llm-contract.gemspec +5 -1
metadata +3 -4
data/.rspec +0 -3
data/.rubycritic.yml +0 -8
data/.simplecov +0 -22

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 33f6a8a686f7f20791904c4fbfacd19f6ea5b8bad428c374ec14b7e33521354d
-  data.tar.gz: a2b2f7d9ff1e6cd69b39d55a3809b1babbd82a5d42afe3c567733076f03fa317
+  metadata.gz: 66d3692be182ac8958132d06b53c1d47e01d07769d0e7a1942f81b71eb23a759
+  data.tar.gz: f254fcbc6bc8c2a42a78cff9ea3b24b23f04d860c60d79084bbd8a5d6ea6ec0b
 SHA512:
-  metadata.gz: '0895986db9cde7d26d2e91ffc7c6469b7d34df299a361148c9ee339dbf1dc61539e44adf250c00c06383717ed2e47ff250d4490d1cce43c3cdf8c3169529fba5'
-  data.tar.gz: ed35a4b4cc9ab1afd46c427468dcb33844d8d54207531f98a9f1d775004efc5a1e19d64fb22c64b20da81750165a3a979f4de4ecaa46aff4121eb7ba80a27ed2
+  metadata.gz: f035aca56a5c0e2ca0607a06431cbb76a6ff4cc45f2f35fe9309d534b881910fa05d63c5c581a6e9cbac5374377f6fc187e521a844a653da5b53d612e9a23a89
+  data.tar.gz: 31bb34333a28224e7c00ce2dea7b3905dce878c320fc0fb8a0e5db1313259a0e12e136ab6cc74126b25f0a47984830ae2172e3f824ccfdcb28f847929f40080a

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,69 @@
 # Changelog
+## 0.10.1 (2026-06-01)
+Patch release fixing gem packaging. 0.10.0 was yanked from rubygems.org due to the issue documented below; 0.10.1 is the recommended upgrade target. No code behavior change vs 0.10.0.
+### Fixed
+- **Gem no longer ships internal tracker / dev configs.** Excluded from `spec.files`: `TODO.md`, `.rspec`, `.rubycritic.yml`, `.simplecov`, and the `.revive/` directory. Pre-0.10.1 the published gem contained these files; adopters who already extracted 0.10.0 can safely delete them.
+## 0.10.0 (2026-06-01)
+First published release since 0.8.0. Consolidates work originally tagged as 0.9.0 (multimodal input) and 0.9.1 (internal quality refactor), neither of which was pushed to rubygems. Adopters upgrading from 0.8.0 should read the **Behavioural change** and **Breaking changes** sections below before installing.
+### Breaking changes
+- **`validate(description, &block)` and `Definition#invariant(description, &block)` now raise `ArgumentError` when `description` is `nil` or empty.** Pre-0.10.0 the empty descriptor was silently accepted and produced `""` entries in `result.validation_errors`, making debugging impossible. Codex audit found zero production use sites across `lib/`, `examples/`, `README` — only the regression-marker test certifying the bug.
+### Migration
+Ensure every `validate` / `invariant` call has a non-empty descriptor (this is already how every README example writes them):
+```ruby
+# Before (silently accepted, produced "" in validation_errors):
+validate("") { |o| o[:score].between?(0, 100) }
+# After (required):
+validate("score in range 0-100") { |o| o[:score].between?(0, 100) }
+```
+### Added
+- **Multimodal input via `context: { attachment: ... }`** — pass a file/IO/URL through `Step.run(input, context: { attachment: path })`; the adapter forwards it to `RubyLLM::Chat#ask(content, with: attachment)`. RubyLLM normalises wire format per provider (Anthropic url/base64, OpenAI `image_url`/`file`, Gemini `inline_data`). Multi-attachment supported natively (`with: [pdf1, pdf2]` or `with: { images: [...], pdfs: [...] }`). See [multimodal input guide](docs/guide/multimodal_input.md) and [ADR-0022](doc/decisions/ADR-0022-v09-multimodal-input.md).
+- **`attachment_token_estimate(n)` class macro** — adopter-declared conservative estimate of attachment input tokens. Applied to BOTH runtime (`limit_checker`) and pre-flight (`estimate_cost`) — same source of truth, no estimate/runtime drift.
+- **`on_unknown_attachment_size(:refuse | :warn)` class macro** — mirrors `on_unknown_pricing` opt-out semantics. Defaults to `:refuse`. Never settable as global default — same invariant as `max_cost` fail-closed.
+### Behavioural change (READ BEFORE UPGRADING)
+- **Contracts with `max_cost` or `max_input` AND `context[:attachment]` set AND no `attachment_token_estimate` declared → REFUSE with `:limit_exceeded`.** This is fail-closed semantics: the gem cannot bound vision/PDF token cost without an adopter-declared estimate. Opt out per-step with `on_unknown_attachment_size :warn`. Text-only contracts and contracts without `max_cost`/`max_input` are unaffected.
+### Changed
+- **`run_eval` (no args) return shape pinned to `Hash<String, Report>` keyed by eval name.** Documents the existing contract used by `RubyLLM::Contract::RakeTask#collect_host_reports` and adopters. No runtime change vs 0.8.0 — only the spec assertion now locks the shape.
+- **`Parser.parse(text, strategy: :json)` first-bracket-wins boundary documented.** Extraction commits to the first balanced `{` or `[` structure and does NOT retry on later candidates. Empty `{}` followed by real JSON parses as the empty Hash; non-JSON `{braces}` before real JSON raises `ParseError`. No runtime change — this codifies long-standing behavior with explicit boundary tests.
+### Fixed
+- **`with_retry_disabled` no longer mutates the step class's singleton method.** The optimizer now passes `retry_policy_override: nil` through `context:` to `compare_models`, which `Step::Base#runtime_settings` already honours. Removes a concurrency hazard where two parallel `optimize_retry_policy` calls on the same step would race on the singleton restore in `ensure`.
+- **`CostCalculator.find_model` exposed as a public class method.** Removes two `CostCalculator.send(:find_model, ...)` workarounds in `Step::Base#estimate_cost`. The `estimated_cost_for` helper is gone — `estimate_cost` now routes through the existing public `CostCalculator.calculate(model_name:, usage:)`.
+- **`stub_step` unified on a single storage path.** Both block and non-block forms now write to `RubyLLM::Contract.step_adapter_overrides` (thread-local). The `around(:each)` hook in `rspec.rb` handles cleanup between examples. Removes the prior `allow(step).to receive(:run)` branch.
+### Internal
+- **Anti-facade audit complete: 89/89 spec files under per-test 17-mode walk** (Phase A: 26 specs, Phase C: 63 specs via parallel Codex fan-out). Net +30 strengthened tests against mutation-blind assertions, zero public API change beyond the breaking entry above.
+- **Dead `ObjectSpace.each_object(Class)` fallback removed** in `concerns/eval_host.rb#register_subclasses`. The gemspec requires Ruby `>= 3.2.0`, so `Class#subclasses` (Ruby ≥ 3.1) is always available; the legacy fallback was unreachable code that would have iterated all loaded classes O(n) and was not thread-safe.
+### Deferred (not in 0.10.x)
+- `add_history` multi-turn replay of prior attachments — single-turn multimodal supported; follow-up questions on the same document deferred to a later release.
+- Streaming + attachment — contract steps remain synchronous.
+- Provider-specific attachment size caps — surface only via `attachment_token_estimate` calibration; consult provider docs.
+### Tests
+- Suite: 1401 examples / 0 failures / 7 pending (was 1346/0/8 at 0.8.0).
 ## 0.8.0 (2026-04-26)
 Narrative repositioning + small API additions. Internal architecture unchanged: no `Step::Base` refactor, no breaking changes to existing DSL.

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    ruby_llm-contract (0.8.0)
+    ruby_llm-contract (0.10.1)
       dry-types (~> 1.7)
       ruby_llm (~> 1.12)
       ruby_llm-schema (~> 0.3)
@@ -258,7 +258,7 @@ CHECKSUMS
   rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
   ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
   ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
-  ruby_llm-contract (0.8.0)
+  ruby_llm-contract (0.10.1)
   ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
   ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
   rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8

data/README.md CHANGED Viewed

@@ -2,9 +2,9 @@
 **Contracts + Evals for [ruby_llm](https://github.com/crmne/ruby_llm).**
-Your eval passed. Prod broke anyway? This gem wraps `RubyLLM::Chat` with input/output contracts, business-rule validation, retry with model escalation on validation failure, pre-flight cost ceilings, and an evaluation framework — so a flaky cheap-model call escalates to a stronger model instead of shipping garbage to your user.
+Your eval passed. Prod broke anyway? This gem wraps `RubyLLM::Chat` with input/output contracts, business-rule validation, retry with model escalation on validation failure, pre-flight cost ceilings, and a regression-eval framework — so a flaky cheap-model call escalates to a stronger model instead of shipping garbage to your user.
-`ruby_llm` handles the HTTP side (rate limits, timeouts, streaming, tool calls, embeddings). This gem handles what the model *returned*: schema validation, business rules, model escalation on failed validation, datasets, regression tests.
+`ruby_llm` handles the HTTP side (rate limits, timeouts, streaming, tool calls, embeddings). This gem handles what the model *returned* at **runtime**: schema validation, business rules, model escalation on failed validation, regression datasets that gate prompt/model changes in CI.
 ## Install
@@ -13,23 +13,24 @@ gem "ruby_llm-contract"
 ```
 ```ruby
-RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
-RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
-```
-Works with any `ruby_llm` provider (OpenAI, Anthropic, Gemini, etc).
-## Do I need this?
+RubyLLM.configure do |c|
+  c.openai_api_key = ENV["OPENAI_API_KEY"]
+  c.default_model  = "gpt-4.1-mini"   # used when a Step has no explicit model
+end
-Use this if LLM output affects production behaviour, money, user trust, or downstream code. You probably don't need it if you have one low-risk prompt, manually inspect every result, or only generate best-effort prose.
+# Required: boots the gem so `Step.run` knows how to talk to your LLM.
+# Empty block is fine. Pass options here if you need them (e.g. `c.logger`).
+RubyLLM::Contract.configure { }
+```
-Already using structured outputs from your provider? This gem adds business-rule validation, retry with model fallback, evals, regression gating, and test stubs on top of them — the layer that stops schema-valid-but-wrong output from reaching users. See [Why contracts?](docs/guide/why.md) for the four production failure modes the gem exists for, or run `ruby examples/01_fallback_showcase.rb` to see the fallback loop in 30 seconds (no API key needed).
+Works with any `ruby_llm` provider (OpenAI, Anthropic, Gemini, etc). Requires `ruby_llm ~> 1.12` and Ruby ≥ 3.2.
 ## Example
 A Rails app takes article text extracted from a user-submitted URL and wants to show a summary card: a short TL;DR, 3–5 key takeaways, and a tone label. The output has to fit the UI (TL;DR under 200 chars) and the schema has to be strict enough to render without conditionals.
 ```ruby
+# app/contracts/summarize_article.rb
 class SummarizeArticle < RubyLLM::Contract::Step::Base
   prompt <<~PROMPT
     Summarize this article for a UI card. Return a short TL;DR,
@@ -45,48 +46,93 @@ class SummarizeArticle < RubyLLM::Contract::Step::Base
   end
   validate("TL;DR fits the card")  { |o, _| o[:tldr].length <= 200 }
-  validate("takeaways are unique") { |o, _| o[:takeaways].uniq.size == o[:takeaways].size }
+  validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
-  retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
+  # Cheapest first; last step adds a reasoning model with more thinking.
+  retry_policy do
+    escalate "gpt-4.1-nano",
+             "gpt-4.1-mini",
+             { model: "gpt-5", reasoning_effort: "high" }
+  end
 end
 result = SummarizeArticle.run(article_text)
-result.parsed_output    # => { tldr: "...", takeaways: [...], tone: "analytical" }
-result.trace[:model]    # => "gpt-4.1-nano"  (first model that passed)
-result.trace[:cost]     # => 0.000032
+result.status           # => :ok  (or :validation_failed if all steps fail)
+result.parsed_output    # => { tldr: "...", takeaways: [...], tone: "..." }
+result.trace[:model]    # => "gpt-4.1-mini"  (winning step)
+result.trace[:cost]     # => 0.000520        (total across all attempts)
+result.trace[:attempts]
+# => [
+#      {
+#        attempt: 1,
+#        model: "gpt-4.1-nano",
+#        status: :validation_failed,
+#        usage: { input_tokens: 256, output_tokens: 84 },
+#        latency_ms: 45,
+#        cost: 0.000100
+#      },
+#      {
+#        attempt: 2,
+#        model: "gpt-4.1-mini",
+#        status: :ok,
+#        usage: { input_tokens: 256, output_tokens: 92 },
+#        latency_ms: 92,
+#        cost: 0.000420
+#      }
+#    ]
+```
+If the response is malformed, the TL;DR overflows the card, or the takeaway count is off, the gem moves to the next step. This is model **escalation**, not a fallback list — each step is an independent config (`model`, `reasoning_effort`), so the retry policy spends more compute only when the cheaper one couldn't satisfy the contract.
+### Add a CI gate in 6 lines
+The contract above already runs in production. The same `Step` doubles as the unit your regression eval runs against:
+```ruby
+SummarizeArticle.define_eval("regression") do
+  # `expected:` is a partial hash match — only listed keys check parsed_output.
+  add_case "neutral release",
+           input: "Ruby 3.4 shipped frozen string literals...",
+           expected: { tone: "analytical" }
+  add_case "outage post",
+           input: "Service was down for 4 hours...",
+           expected: { tone: "negative" }
+end
+# in CI (RSpec):
+expect(SummarizeArticle).to pass_eval("regression").without_regressions
 ```
-The model returns JSON matching the schema. If the response is malformed, the TL;DR overflows the card, or the takeaway count is off, the gem retries — moving to the next model in `models:` only when the cheaper one can't satisfy the rules. In this setup cheaper models are tried first and the expensive ones are used only when cheaper models fail.
+A bad prompt edit or model swap that drops accuracy on the frozen dataset → red CI, blocked merge. The first CI run records a baseline; subsequent runs compare against it. Every production miss should become the next `add_case`. See [Prevent silent prompt regressions](docs/guide/eval_first.md) for the full flywheel.
+## Do I need this?
+Use this if LLM output affects production behaviour, money, user trust, or downstream code. You probably don't need it if you have one low-risk prompt, manually inspect every result, or only generate best-effort prose.
-You could write this loop yourself once. The gem gives you the loop, a trace of every attempt (model, status, cost, latency), fallback policy, evals, baselines, and CI checks as one contract object — tracked per-step so adding a new LLM feature to your app is one class, not one-off scaffolding.
+Already using structured outputs from your provider? This gem adds business-rule validation, retry with model escalation, evals, regression gating, and test stubs on top of them — the layer that stops schema-valid-but-wrong output from reaching users. See [Why contracts?](docs/guide/why.md) for the four production failure modes the gem exists for.
 ## Most useful next
 Everything below is optional — the example above is a complete step. Reach for these when one step isn't enough.
-- **[CI regression gates](docs/guide/getting_started.md)** — `define_eval` + `save_baseline!` + `pass_eval(...).without_regressions` blocks CI when accuracy drops on a model update or prompt tweak.
-- **[Find the cheapest viable fallback list](docs/guide/optimizing_retry_policy.md)** — `Step.recommend("regression", candidates: [...], min_score: 0.95)` returns the cheapest list of models that still passes your evals. `production_mode:` measures retry-aware cost.
-- **[A/B test prompts](docs/guide/eval_first.md)** — `SummarizeArticleV2.compare_with(SummarizeArticleV1, eval: "regression")` reports whether the new prompt is safe to ship.
-- **[Budget caps](docs/guide/getting_started.md)** — `max_cost`, `max_input`, `max_output` refuse the request before calling the API when a heuristic estimate (~±30% accuracy) exceeds the limit.
-- **[Reasoning effort / thinking config](docs/guide/optimizing_retry_policy.md)** — `thinking effort: :low` (or alias `reasoning_effort :low`) on the Step class; mirrors `RubyLLM::Agent.thinking` and forwards through `Chat#with_thinking`.
+- **[CI regression gates](docs/guide/getting_started.md)** — block CI when accuracy drops on a model update or prompt tweak.
+- **[Find the cheapest viable fallback list](docs/guide/optimizing_retry_policy.md)** — empirically pick the cheapest model chain that still passes your evals.
+- **[A/B test prompts](docs/guide/eval_first.md)** — measure whether a new prompt is safe to ship before merging.
+- **[Budget caps](docs/guide/getting_started.md)** — refuse the request pre-flight when an estimate exceeds the limit.
+- **[Reasoning effort / thinking config](docs/guide/optimizing_retry_policy.md)** — Anthropic / OpenAI thinking configuration on the Step class.
-Also supports [multi-step pipelines](docs/guide/pipeline.md) with fail-fast and `retry_policy attempts: N` for niche cases (we measured this empirically — for `gpt-4.1-nano` / `gpt-5-nano` on tasks with clear correctness criteria, same-model retry rarely helps; `escalate(model_2)` is the strategy that moves the needle, see [optimizing_retry_policy.md](docs/guide/optimizing_retry_policy.md)).
+Also supports [multi-step pipelines](docs/guide/pipeline.md) with fail-fast and per-step models.
 ## Relation to `RubyLLM::Agent`
-`RubyLLM::Agent` (since RubyLLM 1.12) and `Step::Base` here target the **same niche**: reusable, class-based prompts. They are siblings, not foundation-and-roof.
+`Step::Base` and `RubyLLM::Agent` (since RubyLLM 1.12) are **siblings** targeting the same niche: reusable, class-based prompts. Both call into `RubyLLM::Chat` directly — Step does not wrap Agent. Step adds the contract layer: `validate` (business invariants), `retry_policy escalate(...)` (model escalation on validation failure), `max_cost` pre-flight refusal, regression-eval framework, pipeline composition. **[Full feature mapping →](docs/guide/relation_to_agent.md)**
-| What you write | Where it lives |
-|---|---|
-| `model`, `temperature`, `schema`, `instructions`, `tools`, `thinking` | covered by both — same idea, different DSL surface |
-| `validate :rule do |out| ... end` business invariants | only here |
-| `retry_policy escalate(...)` model escalation on validation failure | only here (different from RubyLLM's network-level retry) |
-| `max_cost` / `max_input` / `max_output` pre-flight refusal | only here |
-| `define_eval` + baseline regression + `compare_models` + `optimize_retry_policy` | only here (RubyLLM does not ship an eval framework) |
-| Pipeline composition with `step SomeStep, as: :alias` | only here (RubyLLM intentionally leaves workflows as plain Ruby) |
-| `around_call`, named `observe` hooks with pass/fail in trace | only here |
+## Relation to `ruby_llm-tribunal`
-`Step::Base` does NOT use `Agent` internally today — both call into `RubyLLM::Chat` directly. The two abstractions can coexist on the same project: use `Agent` for prompt-only reuse, use `Step` when you need any of the contract-layer features above. The retry-strategy framing here (favouring `escalate(...)` over same-model `attempts: N`) is grounded in an empirical comparison; `attempts: N` stays in the API for niche cases.
+Different layers, complementary. [`ruby_llm-tribunal`](https://github.com/Alqemist-labs/ruby_llm-tribunal) is a **test framework** that grades outputs **after they've reached your code**, typically in a spec. `ruby_llm-contract` is **runtime** — schema + `validate` rules gate the call **before the output reaches your code**, retry/escalate attempts to recover from failed outputs, `max_cost` refuses pre-flight. Our `define_eval` is *regression* (does this prompt/model still pass on a frozen dataset?), not *grading*.
+**One-liner:** Tribunal answers *"is this output good?"* (fail → red test in CI). Contract answers *"what do we do when it isn't?"* (fail → retry/escalate, or fail closed). **[Visual flows + coexistence patterns →](docs/guide/relation_to_tribunal.md)**
 ## Docs
@@ -95,6 +141,8 @@ Also supports [multi-step pipelines](docs/guide/pipeline.md) with fail-fast and
 | Guide | What it does for your app |
 |-------|---------------------------|
 | [Why contracts?](docs/guide/why.md) | Recognise the four production failures the gem exists for |
+| [Relation to RubyLLM::Agent](docs/guide/relation_to_agent.md) | Sibling abstractions; what each adds; runtime call path; coexistence patterns |
+| [Relation to ruby_llm-tribunal](docs/guide/relation_to_tribunal.md) | Different layers (test framework vs runtime contract); visual flows; integration recipes |
 | [Getting Started](docs/guide/getting_started.md) | Walk the full feature set on one concrete step |
 | [Rails integration](docs/guide/rails_integration.md) | Directory, initializer, jobs, logging, specs, CI gate — 7 FAQs for Rails devs |
 | [Adopt in an existing Rails app](docs/guide/migration.md) | Replace raw `LlmClient.call` with a contract, Before/After |
@@ -103,12 +151,23 @@ Also supports [multi-step pipelines](docs/guide/pipeline.md) with fail-fast and
 | [Write validate rules that catch real bugs](docs/guide/best_practices.md) | Patterns for cross-input checks and content-quality rules |
 | [Stub LLM calls in tests](docs/guide/testing.md) | Deterministic specs, RSpec + Minitest matchers |
 | [Chain LLM calls into a pipeline](docs/guide/pipeline.md) | Multi-step with fail-fast and per-step models |
+| [Multimodal input (PDF / image / audio)](docs/guide/multimodal_input.md) | Route attachments through the contract; `attachment_token_estimate`, fail-closed cost, calibration table |
 | [Schema DSL reference](docs/guide/output_schema.md) | Every constraint, nested objects, pattern table |
 | [Prompt DSL reference](docs/guide/prompt_ast.md) | `system` / `rule` / `section` / `example` / `user` nodes |
-## Roadmap
+## Status & versioning
+Pre-1.0 (currently **0.10.1**). Semver tracked; breaking changes flagged in [CHANGELOG](CHANGELOG.md). Pin `~> 0.10.1` until 1.0 ships.
+## FAQ
+**Thread-safe / Sidekiq?** Yes. Each `Step.run` builds an isolated `RubyLLM::Chat`; class-level state (`output_schema`, `validate`, `retry_policy`) is set up once at class load and read-only afterwards. Safe to run from concurrent jobs/threads.
+**How do I stub `Step.run` in specs?** Include `RubyLLM::Contract::RSpec::Helpers` and use `stub_step(MyStep, response: { ... })`. The block form scopes the stub to one `it`. See [testing guide](docs/guide/testing.md).
+**Where in a Rails app?** Default `app/contracts/`. The Railtie reloads `app/contracts/eval/` and `app/steps/eval/` in development; any autoloaded directory also works. See [Rails integration](docs/guide/rails_integration.md).
-Latest: **v0.8.0** — tagline + narrative repositioning around "Contracts + Evals for RubyLLM", `thinking` / `reasoning_effort` class macro, TokenEstimator labelled as heuristic, CostCalculator repositioned. See [CHANGELOG](CHANGELOG.md) for history.
+**Upgraded to 0.9.0 and my contract started refusing — why?** 0.9.0 added multimodal input. If your contract has `max_cost` or `max_input` set AND now receives `context: { attachment: ... }`, you must declare `attachment_token_estimate(n)` (conservative input-token budget for the attachment) — otherwise the call fails closed with `:limit_exceeded`. The gem cannot bound vision/PDF cost without your estimate. Opt out per-step with `on_unknown_attachment_size :warn`. Text-only contracts are unaffected. See [multimodal input guide](docs/guide/multimodal_input.md).
 ## License

data/lib/ruby_llm/contract/adapters/ruby_llm.rb CHANGED Viewed

@@ -13,7 +13,15 @@ module RubyLLM
           chat = build_chat(options, system_contents)
           add_history(chat, conversation[0..-2])
-          response = chat.ask(conversation.last&.fetch(:content, ""))
+          # `with: nil` is a documented no-op in RubyLLM (verified against
+          # 1.15.0: chat.rb:36-37 `build_content(message, nil)` -> content.rb:8-14
+          # `Content.new(text, nil)` keeps text-only path when attachments
+          # are empty; raise only fires when BOTH text and attachments are nil,
+          # and we always pass a non-nil string thanks to `&.fetch(:content, "")`).
+          response = chat.ask(
+            conversation.last&.fetch(:content, ""),
+            with: options[:attachment]
+          )
           build_response(response)
         end

data/lib/ruby_llm/contract/concerns/eval_host.rb CHANGED Viewed

@@ -190,16 +190,13 @@ module RubyLLM
           context.merge(adapter: sample_adapter)
         end
+        # `Class#subclasses` is available from Ruby 3.1; the gemspec requires
+        # `>= 3.2.0` so the legacy `ObjectSpace.each_object` fallback would be
+        # dead code on every supported runtime.
         def register_subclasses(klass)
-          if klass.respond_to?(:subclasses)
-            klass.subclasses.each do |sub|
-              Contract.register_eval_host(sub)
-              register_subclasses(sub)
-            end
-          else
-            ObjectSpace.each_object(Class) do |sub|
-              Contract.register_eval_host(sub) if sub < klass
-            end
+          klass.subclasses.each do |sub|
+            Contract.register_eval_host(sub)
+            register_subclasses(sub)
           end
         end

data/lib/ruby_llm/contract/concerns/stub_helpers.rb ADDED Viewed

@@ -0,0 +1,97 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Contract
+    module Concerns
+      # Shared implementation of `stub_step`, `stub_steps`, and `stub_all_steps`.
+      # Included by both `RubyLLM::Contract::RSpec::Helpers` and
+      # `RubyLLM::Contract::MinitestHelpers` so the two test-framework
+      # adapters cannot drift on stub semantics (Codex DRY finding #1: the
+      # prior parallel implementations had already diverged on
+      # `normalize_test_response` — RSpec had it, Minitest didn't).
+      #
+      # Cleanup between examples is the responsibility of the host helper:
+      # - RSpec: `around(:each)` hook in `lib/ruby_llm/contract/rspec.rb`
+      #   restores `step_adapter_overrides`.
+      # - Minitest: `teardown` in `MinitestHelpers` clears overrides and
+      #   restores `default_adapter`.
+      module StubHelpers
+        # Stub a single step to return a canned response without API calls.
+        # Block form scopes the stub to the block; non-block form lives
+        # until the host's teardown/around hook fires.
+        def stub_step(step_class, response: nil, responses: nil, &block)
+          adapter = build_test_adapter(response: response, responses: responses)
+          overrides = RubyLLM::Contract.step_adapter_overrides
+          if block
+            previous = overrides[step_class]
+            overrides[step_class] = adapter
+            begin
+              yield
+            ensure
+              previous ? (overrides[step_class] = previous) : overrides.delete(step_class)
+            end
+          else
+            overrides[step_class] = adapter
+          end
+        end
+        # Stub multiple steps with different responses. Requires a block.
+        def stub_steps(stubs, &block)
+          raise ArgumentError, "stub_steps requires a block" unless block
+          overrides = RubyLLM::Contract.step_adapter_overrides
+          previous = {}
+          stubs.each do |step_class, opts|
+            opts = opts.transform_keys(&:to_sym)
+            previous[step_class] = overrides[step_class]
+            overrides[step_class] = build_test_adapter(**opts.slice(:response, :responses))
+          end
+          begin
+            yield
+          ensure
+            stubs.each_key do |step_class|
+              previous[step_class] ? (overrides[step_class] = previous[step_class]) : overrides.delete(step_class)
+            end
+          end
+        end
+        # Set a global test adapter for ALL steps. Block form restores the
+        # previous adapter on exit; non-block form persists until host cleanup.
+        def stub_all_steps(response: nil, responses: nil, &block)
+          adapter = build_test_adapter(response: response, responses: responses)
+          if block
+            previous = RubyLLM::Contract.configuration.default_adapter
+            begin
+              RubyLLM::Contract.configuration.default_adapter = adapter
+              yield
+            ensure
+              RubyLLM::Contract.configuration.default_adapter = previous
+            end
+          else
+            RubyLLM::Contract.configure { |c| c.default_adapter = adapter }
+          end
+        end
+        private
+        def build_test_adapter(response: nil, responses: nil)
+          if responses
+            Adapters::Test.new(responses: responses.map { |r| normalize_test_response(r) })
+          else
+            Adapters::Test.new(response: normalize_test_response(response))
+          end
+        end
+        # Hook for host frameworks to inject custom serialization (e.g.
+        # turning hashes into JSON strings). Default: identity.
+        def normalize_test_response(value)
+          value
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/contract/contract/definition.rb CHANGED Viewed

@@ -18,6 +18,8 @@ module RubyLLM
       end
       def invariant(description, &block)
+        raise ArgumentError, "invariant description must be a non-empty string", caller if description.to_s.empty?
         @invariants << Invariant.new(description, block)
       end
       alias validate invariant

data/lib/ruby_llm/contract/cost_calculator.rb CHANGED Viewed

@@ -89,8 +89,13 @@ module RubyLLM
         (input_cost + output_cost).round(6)
       end
+      # Provider pricing is denominated per 1M tokens; divide here to get
+      # the dollar cost for the actual usage count. Named constant for
+      # consistency with how RubyLLM and provider docs express prices.
+      TOKENS_PER_MILLION = 1_000_000.0
       def self.token_cost(tokens, price_per_million)
-        (tokens || 0) * (price_per_million || 0) / 1_000_000.0
+        (tokens || 0) * (price_per_million || 0) / TOKENS_PER_MILLION
       end
       def self.find_model(model_name)
@@ -111,7 +116,11 @@ module RubyLLM
         end
       end
-      private_class_method :compute_cost, :token_cost, :find_model, :validate_price!
+      # `find_model` is intentionally public: `Step::Base#estimate_cost` needs
+      # to inspect model pricing before invoking `calculate` (e.g., to short-
+      # circuit estimate when the model is unknown). Exposing it removes
+      # `CostCalculator.send(:find_model)` workarounds at call sites.
+      private_class_method :compute_cost, :token_cost, :validate_price!
     end
   end
 end

data/lib/ruby_llm/contract/eval/recommender.rb CHANGED Viewed

@@ -4,7 +4,9 @@ module RubyLLM
   module Contract
     module Eval
       class Recommender
-        def initialize(comparison:, min_score:, min_first_try_pass_rate: 0.8, current_config: nil)
+        def initialize(comparison:, min_score:,
+                       min_first_try_pass_rate: DEFAULT_MIN_FIRST_TRY_PASS_RATE,
+                       current_config: nil)
           @comparison = comparison
           @min_score = min_score
           @min_first_try_pass_rate = min_first_try_pass_rate

data/lib/ruby_llm/contract/eval/retry_optimizer.rb CHANGED Viewed

@@ -98,7 +98,9 @@ module RubyLLM
           end
         end
-        def initialize(step:, candidates:, context: {}, min_score: 0.95, runs: 1, production_mode: nil)
+        def initialize(step:, candidates:, context: {},
+                       min_score: DEFAULT_MIN_SCORE,
+                       runs: 1, production_mode: nil)
           @step = step
           @candidates = candidates
           @context = context
@@ -113,10 +115,19 @@ module RubyLLM
           score_matrix = {}
           evals.each do |eval_name|
-            comparison = with_retry_disabled do
-              @step.compare_models(eval_name, candidates: @candidates, context: @context,
-                                              runs: @runs, production_mode: @production_mode)
-            end
+            # `retry_policy_override: nil` in context disables the step's
+            # class-level retry policy for this comparison run — see
+            # step/base.rb#runtime_settings, which honours the key when
+            # present (even when value is nil). Replaces the prior
+            # `define_singleton_method(:retry_policy)` mutation, which was
+            # not thread-safe across concurrent optimizer calls.
+            comparison = @step.compare_models(
+              eval_name,
+              candidates: @candidates,
+              context: @context.merge(retry_policy_override: nil),
+              runs: @runs,
+              production_mode: @production_mode
+            )
             score_matrix[eval_name] = extract_scores(comparison)
           end
@@ -203,14 +214,6 @@ module RubyLLM
           end
         end
-        def with_retry_disabled(&block)
-          original = @step.retry_policy if @step.respond_to?(:retry_policy)
-          @step.define_singleton_method(:retry_policy) { nil }
-          block.call
-        ensure
-          @step.define_singleton_method(:retry_policy) { original }
-        end
         def empty_result(evals)
           Result.new(
             step_name: @step.name || @step.to_s,

data/lib/ruby_llm/contract/eval.rb CHANGED Viewed

@@ -33,3 +33,16 @@ require_relative "eval/eval_history"
 require_relative "eval/recommendation"
 require_relative "eval/recommender"
 require_relative "eval/retry_optimizer"
+module RubyLLM
+  module Contract
+    module Eval
+      # Default thresholds shared across `recommend`, `optimize_retry_policy`,
+      # `Recommender`, and `RetryOptimizer`. Centralised so a single change in
+      # what "viable" means (e.g. tightening from 0.95 to 0.97) propagates
+      # everywhere instead of needing the same edit in 4 places.
+      DEFAULT_MIN_SCORE = 0.95
+      DEFAULT_MIN_FIRST_TRY_PASS_RATE = 0.8
+    end
+  end
+end