RubyGems - ruby_llm-contract - Versions diffs - 0.10.1 → 0.10.2 - Mend

ruby_llm-contract 0.10.1 → 0.10.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +15 -1
data/Gemfile.lock +2 -2
data/README.md +2 -2
data/docs/architecture.md +50 -0
data/docs/guide/best_practices.md +136 -0
data/docs/guide/eval_first.md +192 -0
data/docs/guide/getting_started.md +199 -0
data/docs/guide/migration.md +185 -0
data/docs/guide/multimodal_input.md +160 -0
data/docs/guide/optimizing_retry_policy.md +131 -0
data/docs/guide/output_schema.md +93 -0
data/docs/guide/pipeline.md +154 -0
data/docs/guide/prompt_ast.md +76 -0
data/docs/guide/rails_integration.md +218 -0
data/docs/guide/relation_to_agent.md +52 -0
data/docs/guide/relation_to_tribunal.md +135 -0
data/docs/guide/testing.md +282 -0
data/docs/guide/why.md +103 -0
data/lib/ruby_llm/contract/version.rb +1 -1
data/ruby_llm-contract.gemspec +1 -1
metadata +16 -1

data/docs/guide/testing.md ADDED Viewed

@@ -0,0 +1,282 @@
+# Testing
+> Read this when writing deterministic unit specs for contract steps. Skip if you only ever run live evals in CI and never stub LLM calls.
+CI that hits a real LLM on every commit is slow and costly: tests take minutes instead of milliseconds, every rerun spends money, and flakes on provider hiccups block merges. The Test adapter + `stub_step` make LLM-backed specs run like ordinary unit tests — deterministic, offline, free. Live evals stay as an opt-in CI stage for quality gating (see [Eval-First](eval_first.md)).
+How to write deterministic specs and matchers for steps built on `ruby_llm-contract`. Examples use `SummarizeArticle` (the flagship step from the [README](../../README.md)).
+## Test adapter
+Ships deterministic specs with zero API calls. Accepts a String, Hash, or Array:
+```ruby
+# String JSON
+adapter = RubyLLM::Contract::Adapters::Test.new(
+  response: '{"tldr":"...","takeaways":["a","b","c"],"tone":"neutral"}'
+)
+# Hash — auto-converted to JSON
+adapter = RubyLLM::Contract::Adapters::Test.new(
+  response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" }
+)
+# Multiple sequential responses (one per call)
+adapter = RubyLLM::Contract::Adapters::Test.new(
+  responses: [
+    { tldr: "...", takeaways: %w[a b c], tone: "neutral" },
+    { tldr: "...", takeaways: %w[x y z], tone: "analytical" }
+  ]
+)
+result = SummarizeArticle.run("article text", context: { adapter: adapter })
+result.ok?  # => true
+```
+Multi-step pipeline testing with per-step named responses (using `ArticleCardPipeline` from [Pipeline](pipeline.md)):
+```ruby
+result = ArticleCardPipeline.test("article text",
+  responses: {
+    summarize: { tldr: "...", takeaways: %w[a b c], tone: "analytical" },
+    tag:       { tldr: "...", takeaways: %w[a b c], tone: "analytical",
+                 hashtags: %w[#ruby #release] },
+    card:      { headline: "Ruby 3.4 ships", summary: "...",
+                 hashtags: %w[#ruby #release], sentiment_icon: "🧠" }
+  }
+)
+```
+## Output keys are always symbols
+Parsed output uses **symbol keys**, never strings:
+```ruby
+result.parsed_output[:tldr]     # => "..." ✓
+result.parsed_output["tldr"]    # => nil ✗
+```
+The gem warns if a `validate` or `verify` block returns `nil` — usually a sign of string-key access on symbol-keyed data.
+## RSpec setup
+In `spec_helper.rb`:
+```ruby
+require "ruby_llm/contract/rspec"
+```
+You get the `satisfy_contract` matcher, `pass_eval` matcher, and the `stub_step` helpers.
+## stub_step helpers
+`stub_step` canned-responses a single step; other steps run normally.
+```ruby
+RSpec.describe SummarizeArticle do
+  before do
+    stub_step(described_class,
+              response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" })
+  end
+  it "satisfies its contract" do
+    result = described_class.run("article text")
+    expect(result).to satisfy_contract
+  end
+end
+```
+**Sequential responses:**
+```ruby
+stub_step(described_class, responses: [
+  { tldr: "...", takeaways: %w[a b c], tone: "neutral" },
+  { tldr: "...", takeaways: %w[x y z], tone: "analytical" }
+])
+```
+**Block form (auto-cleanup):**
+```ruby
+stub_step(SummarizeArticle, response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" }) do
+  result = SummarizeArticle.run("article text")
+  # stub is active here
+end
+# stub gone — original adapter restored
+```
+**Multiple steps at once:**
+```ruby
+stub_steps(
+  SummarizeArticle => { response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" } },
+  GenerateHashtags => { response: { tldr: "...", takeaways: %w[a b c],
+                                    tone: "neutral",
+                                    hashtags: %w[#ruby #release] } }
+) do
+  result = ArticleCardPipeline.run("article text")
+end
+```
+**Global stub for all steps:**
+```ruby
+stub_all_steps(response: { default: true })
+```
+In RSpec, non-block stubs auto-clean after each example. In Minitest, `teardown` restores the original adapter (via `MinitestHelpers`).
+## Minitest
+Require `ruby_llm/contract/minitest` in your `test_helper.rb`. You get the same `satisfy_contract` / `pass_eval` assertions and `stub_step` helper, adapted for Minitest syntax.
+## RSpec matchers
+```ruby
+RSpec.describe SummarizeArticle do
+  before do
+    stub_step(described_class,
+              response: { tldr: "...", takeaways: %w[a b c], tone: "neutral" })
+  end
+  it "satisfies its contract" do
+    result = described_class.run("article text")
+    expect(result).to satisfy_contract
+  end
+  it "rejects invalid output" do
+    stub_step(described_class, response: { tldr: "x" * 300, takeaways: %w[a b c], tone: "neutral" })
+    result = described_class.run("article text")
+    expect(result).not_to satisfy_contract  # TL;DR > 200 chars fails validate
+  end
+  it "passes its eval" do
+    expect(described_class).to pass_eval("smoke")
+  end
+end
+```
+`pass_eval` supports a matcher chain — full reference lives in [Getting Started](getting_started.md) under Evals and CI gates. Quick summary:
+- `.with_context(model: "gpt-4.1-mini")` — pick model / pass adapter
+- `.with_minimum_score(0.8)` — gate on average score
+- `.with_maximum_cost(0.01)` — gate on total cost
+- `.without_regressions` — block any previously-passing case that now fails (reads the baseline)
+- `.compared_with(SummarizeArticleV1)` — A/B against another step; implies regression check
+## Offline vs online eval
+Evals run in one of two modes depending on how they're defined and what context is passed:
+| Has `sample_response`? | Context has adapter/model? | Mode | API calls |
+|---|---|---|---|
+| Yes | No | **Offline** — uses `sample_response` as canned answer | Zero |
+| Yes | Yes | **Online** — ignores `sample_response`, calls real LLM | Real |
+| No | Yes | **Online** — calls real LLM | Real |
+| No | No | **Skipped** — returns `:skipped`, excluded from score | Zero |
+Default is offline. To force online, pass adapter or model in context:
+```ruby
+# Online — real LLM call
+report = SummarizeArticle.run_eval("regression", context: { model: "gpt-4.1-nano" })
+# Offline — uses sample_response
+report = SummarizeArticle.run_eval("smoke")
+```
+`compare_with` intentionally ignores `sample_response` because canned data would make both sides look identical. Always pass `model:` or an adapter to A/B.
+## Inspecting failures
+`run_eval` returns a `Report`. Drill into per-case failures:
+```ruby
+report = SummarizeArticle.run_eval("regression")
+report.score       # => 0.5
+report.pass_rate   # => "1/2"
+report.total_cost  # => 0.003
+report.failures.each do |result|
+  puts result.name        # => "critical review"
+  puts result.mismatches  # => { tone: { expected: "negative", got: "analytical" } }
+  puts result.output      # full parsed output hash
+  puts result.details     # human-readable explanation
+end
+```
+`mismatches` is a hash of keys where expected and actual output diverge — pinpoints which field the model got wrong.
+## Soft observations
+Log suspicious-but-not-invalid output without failing the contract:
+```ruby
+class CompareArticles < RubyLLM::Contract::Step::Base
+  prompt "Score the article pair for relevance. Return JSON: {score_a: 1-10, score_b: 1-10}.\n\n{input}"
+  output_schema do
+    integer :score_a, minimum: 1, maximum: 10
+    integer :score_b, minimum: 1, maximum: 10
+  end
+  validate("scores in range") { |o, _| (1..10).cover?(o[:score_a]) && (1..10).cover?(o[:score_b]) }
+  observe("scores should differ") { |o, _| o[:score_a] != o[:score_b] }
+end
+adapter = RubyLLM::Contract::Adapters::Test.new(response: { score_a: 5, score_b: 5 })
+result  = CompareArticles.run("two identical-looking articles", context: { adapter: adapter })
+result.ok?           # => true (observe never fails the contract)
+result.observations  # => [{ description: "scores should differ", passed: false }]
+```
+`observe` runs only after validation passes. Failed observations are logged via `RubyLLM::Contract.logger` — useful for "I want to know this happened without blocking the response".
+> **Not the same as `Chat#on_end_message` / `on_tool_call`.** RubyLLM exposes anonymous, global callbacks attached to a chat instance — they receive the raw `Message` and have no inherent pass/fail concept. `Step.observe` is per-step, named with a description, runs against `parsed_output` (already JSON-parsed and schema-validated), and the pass/fail outcome is recorded in `result.observations`. The two can coexist: use Chat callbacks for transport/tool tracing, use `observe` for domain assertions you want captured in the step trace.
+## Asserting on `around_call`
+`around_call` fires **once per run** with the final result (after retry fallback) and exceptions propagate. That makes it straightforward to test:
+```ruby
+class LoggedSummarize < RubyLLM::Contract::Step::Base
+  prompt "Summarize: {input}"
+  output_schema { string :tldr }
+  around_call do |_step, input, result|
+    CallLog.record(model: result.trace.model, cost: result.trace.cost, input_size: input.length)
+  end
+end
+RSpec.describe LoggedSummarize do
+  it "logs once per run, with final model + total cost" do
+    adapter = RubyLLM::Contract::Adapters::Test.new(response: { tldr: "ok" })
+    expect(CallLog).to receive(:record).once.with(hash_including(:model, :cost, :input_size))
+    LoggedSummarize.run("article text", context: { adapter: adapter })
+  end
+end
+```
+The callback receives `(step, input, result)` — the same `Result` the caller sees. Not invoked per-attempt inside a `retry_policy` chain; if you need per-attempt visibility, read `result.trace[:attempts]` inside the block.
+## Baseline file format
+Baselines are JSON files in `.eval_baselines/` — commit them to git:
+```
+.eval_baselines/
+  SummarizeArticle/
+    regression_gpt-4_1-nano.json
+    regression_gpt-4_1-mini.json
+```
+Each file contains dataset name, step name, score, and per-case results. No timestamps — re-saving an identical baseline produces no git diff. Baseline semantics (what counts as a regression, how `compare_with_baseline` works) are covered in [Getting Started](getting_started.md#evals-and-ci-gates) and [Eval-First](eval_first.md).
+## See also
+- [Getting Started](getting_started.md) — `pass_eval` matcher chain, threshold gating, Rake task, baseline regressions.
+- [Eval-First](eval_first.md) — `compare_with` prompt A/B workflow.
+- [Pipeline](pipeline.md) — pipeline-level testing with named step responses.

data/docs/guide/why.md ADDED Viewed

@@ -0,0 +1,103 @@
+# Why contracts?
+> Read this if you're not sure whether `ruby_llm-contract` solves a problem you actually have. It's the fastest way to recognise the production failure modes the gem exists for.
+LLMs return JSON that *looks* correct — valid shape, right types, right fields — while being silently wrong in ways that hurt users, burn budget, or break downstream code. Schema validation alone does not catch these. Contracts layer on business rules, retries, evals, and cost caps so the wrong output is caught at the boundary of your system instead of shipping to production.
+Below are four failure modes teams actually hit. If one looks familiar, the gem is probably worth 30 minutes of your time.
+## Failure 1 — Schema-valid, logically wrong
+A `SummarizeArticle` step produces `{ tldr: "...", takeaways: [...], tone: "analytical" }`. Schema passes. The TL;DR is 520 characters long and overflows the UI card. Or the takeaways are all variations of the same sentence. Or the article was a service-outage complaint and `tone` came back `"analytical"` instead of `"negative"` — so customer success's "critical feedback" filter never sees it.
+JSON schema enforces **shape**. It cannot enforce *length fits the card*, *takeaways are distinct*, or *tone matches content*. Those are business rules, and without them you find out from a Slack thread or a support ticket.
+```ruby
+validate("TL;DR fits the card")  { |o, _| o[:tldr].length <= 200 }
+validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
+validate("negative tone requires concrete risk") do |o, _|
+  next true unless o[:tone] == "negative"
+  o[:takeaways].any? { |t| t.match?(/fail|break|crash|outage|risk/i) }
+end
+```
+Wrong output never reaches `Article.update!` — the contract refuses before it persists.
+## Failure 2 — Silent prompt regression
+`SummarizeArticle` ships and works. Two weeks later, someone tweaks the system prompt to emphasise negative sentiment because customer success complained about missed complaints. The tweak fixes that case and silently breaks three neutral product-update articles that now get labelled `"negative"`. Nobody knows for a week.
+Without evals, *every prompt change is a blind deploy.* Contracts invert this:
+```ruby
+SummarizeArticle.define_eval("regression") do
+  add_case "outage complaint", input: "...", expected: { tone: "negative" }
+  add_case "neutral product update", input: "...", expected: { tone: "neutral" }
+end
+# In CI — blocks merge when a prompt tweak regresses any previously-passing case
+expect(SummarizeArticle).to pass_eval("regression").without_regressions
+```
+The "tweak helped one case, broke three" scenario is caught at PR review. No Slack-thread surprises.
+## Failure 3 — Sampling variance on fixed-temperature models
+OpenAI's gpt-5 and o-series run with `temperature=1.0` server-side — you cannot lower it. That means the same prompt on the same model can produce different answers between calls. An outage complaint classified `tone: "negative"` on Monday may come back `tone: "positive"` on Tuesday, with no code change in between. Schema passes both. Your customer-success filter silently misroutes the Tuesday case.
+A `validate` block that cross-checks fields against each other turns a one-in-N flaky output into a deterministic retry:
+```ruby
+validate("tone matches severity keywords") do |o, _|
+  severity = /fail|crash|outage|broken|bug|error/i
+  flagged = o[:takeaways].any? { |t| t.match?(severity) }
+  next true unless flagged
+  %w[negative analytical].include?(o[:tone])
+end
+retry_policy models: %w[gpt-5-nano gpt-5-mini gpt-5]
+```
+Nano misclassifies the tone on the first attempt → contract rejects → mini gets the call and returns a different sample. Variance absorbed; the user never sees the flaky run. Your logs show the retry rate and the cost delta.
+**See it in 30 seconds:** `ruby examples/01_fallback_showcase.rb` — zero API keys required. The Test adapter simulates a tone/takeaways mismatch on the first attempt and a consistent sample on the retry, then prints the per-attempt trace.
+`retry_policy` has three other shapes beyond cross-model escalation — same-model `attempts: 3` (absorbs sampling variance without paying for a stronger tier), `reasoning_effort` escalation (low → medium → high on one model), and cross-provider fallback (Ollama → Anthropic → OpenAI — local first because it costs nothing, hosted last because it is the most accurate). `examples/06_retry_variants.rb` runs all three through the Test adapter with the trace printed.
+## Failure 4 — Runaway cost and no fallback policy
+Someone pastes a 40-page PDF into the endpoint that calls `SummarizeArticle`. The prompt expands to 80k tokens. Your provider bill jumps. Meanwhile, a separate team uses `gpt-4.1` for every single call because "quality matters" — even though 80% of their traffic is trivially handled by `gpt-4.1-nano` at 1/30th the cost. Neither situation is visible until the invoice arrives.
+Contracts make cost a first-class concern:
+```ruby
+max_input  2_000   # refuses before calling the API if tokens exceed budget
+max_output 4_000
+max_cost   0.01    # refuses if estimated cost exceeds cap
+retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]  # cheap first, escalate only on failure
+```
+The 40-page PDF returns `status: :limit_exceeded` — zero tokens spent. The 80/20 traffic pattern resolves on nano; only the hard 20% escalate. [`optimize_retry_policy`](optimizing_retry_policy.md) tells you empirically which fallback list is cheapest for *your* evals.
+## Also catches
+- **Leaked prompt placeholders** — model echoes `{article}` or `{audience}` into the output because the template string wasn't interpolated. Validate string-equality check stops it before a user sees it.
+- **Lazy models echoing input verbatim** — cheap model returns the article text as the "summary". 2-arity `validate("tldr shorter than input")` catches it.
+- **Tone mislabel breaking downstream routing** — content says "negative", model labels it "analytical", a customer success filter misses it. Cross-validate catches the label/content drift.
+## Failure → contract mechanism
+| Failure in production | Contract mechanism |
+|---|---|
+| Schema-valid but logically wrong output | `validate(...) { |o, i| ... }` with 2-arity for cross-checks |
+| Silent prompt regression after a tweak | `define_eval` + `pass_eval(...).without_regressions` in CI |
+| Sampling variance on fixed-temperature models (gpt-5 / o-series) | Cross-field `validate(...)` + `retry_policy models: [...]` |
+| Runaway cost on pathological inputs | `max_input`, `max_output`, `max_cost` preflight |
+| 80/20 traffic paying the premium model rate | `retry_policy` + `optimize_retry_policy` |
+| Leaked placeholder / input echo / tone drift | `validate` with content and cross-input checks |
+## What next
+- **If one of the failures above looks familiar** → [Getting Started](getting_started.md) walks through every feature in order on the same `SummarizeArticle` step.
+- **If you're adopting in an existing Rails app** → [Migration](migration.md) shows Before/After for replacing a raw `LlmClient.new.call` service.
+- **If you already ship LLM features and want to make them regression-safe** → [Eval-First](eval_first.md) is the workflow that prevents Failure 2 from ever happening again.

data/lib/ruby_llm/contract/version.rb CHANGED Viewed

@@ -2,6 +2,6 @@
 module RubyLLM
   module Contract
-    VERSION = "0.10.1"
+    VERSION = "0.10.2"
   end
 end

data/ruby_llm-contract.gemspec CHANGED Viewed

@@ -28,7 +28,7 @@ Gem::Specification.new do |spec|
     excluded_files = %w[TODO.md .rspec .rubycritic.yml .simplecov]
     `git ls-files -z`.split("\x0").reject do |f|
       (File.expand_path(f) == __FILE__) ||
-        f.start_with?("spec/", "docs/", "doc/", ".ai/", ".claude/", ".git", ".revive/") ||
+        f.start_with?("spec/", "docs/ideas/", "doc/", ".ai/", ".claude/", ".git", ".revive/") ||
         excluded_files.include?(f)
     end
   end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ruby_llm-contract
 version: !ruby/object:Gem::Version
-  version: 0.10.1
+  version: 0.10.2
 platform: ruby
 authors:
 - Justyna
@@ -66,6 +66,21 @@ files:
 - LICENSE
 - README.md
 - Rakefile
+- docs/architecture.md
+- docs/guide/best_practices.md
+- docs/guide/eval_first.md
+- docs/guide/getting_started.md
+- docs/guide/migration.md
+- docs/guide/multimodal_input.md
+- docs/guide/optimizing_retry_policy.md
+- docs/guide/output_schema.md
+- docs/guide/pipeline.md
+- docs/guide/prompt_ast.md
+- docs/guide/rails_integration.md
+- docs/guide/relation_to_agent.md
+- docs/guide/relation_to_tribunal.md
+- docs/guide/testing.md
+- docs/guide/why.md
 - examples/00_basics.rb
 - examples/01_fallback_showcase.rb
 - examples/02_real_llm_minimal.rb