ruby_llm-contract 0.10.1 → 0.10.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,154 @@
1
+ # Pipeline
2
+
3
+ > Read this when one step isn't enough — you need multi-step with fail-fast, automatic data threading, and per-step models.
4
+
5
+ Chain multiple steps with automatic data threading, fail-fast, per-step models, trace, and timeout.
6
+
7
+ A pipeline needs more than one step to be interesting. This guide grows the `SummarizeArticle` step from the [README](../../README.md) into a three-step content pipeline that tags and routes the summary to a UI card.
8
+
9
+ ## Full example: article → summary → hashtags → card
10
+
11
+ ```ruby
12
+ # Step 1 — the flagship step from README, unchanged.
13
+ class SummarizeArticle < RubyLLM::Contract::Step::Base
14
+ prompt <<~PROMPT
15
+ Summarize this article for a UI card. Return a short TL;DR,
16
+ 3 to 5 key takeaways, and a tone label.
17
+
18
+ {input}
19
+ PROMPT
20
+
21
+ output_schema do
22
+ string :tldr
23
+ array :takeaways, of: :string, min_items: 3, max_items: 5
24
+ string :tone, enum: %w[neutral positive negative analytical]
25
+ end
26
+
27
+ validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
28
+ end
29
+
30
+ # Step 2 — reads SummarizeArticle's output, produces hashtags suitable for social posts.
31
+ class GenerateHashtags < RubyLLM::Contract::Step::Base
32
+ input_type Hash
33
+
34
+ output_schema do
35
+ # Carry through the summary fields downstream consumers (and the next step) need.
36
+ string :tldr
37
+ array :takeaways, of: :string
38
+ string :tone, enum: %w[neutral positive negative analytical]
39
+ # Add new field.
40
+ array :hashtags, of: :string, min_items: 2, max_items: 5
41
+ end
42
+
43
+ prompt do
44
+ rule "Preserve tldr / takeaways / tone exactly as given."
45
+ user "Article summary: {tldr}\nTone: {tone}\nGenerate 2 to 5 concise hashtags."
46
+ end
47
+
48
+ validate("tone preserved") { |o, input| o[:tone] == input[:tone] }
49
+ end
50
+
51
+ # Step 3 — final shape the UI card consumes.
52
+ class BuildArticleCard < RubyLLM::Contract::Step::Base
53
+ input_type Hash
54
+
55
+ output_schema do
56
+ string :headline
57
+ string :summary
58
+ array :hashtags, of: :string
59
+ string :sentiment_icon, enum: %w[😐 🙂 ⚠️ 🧠]
60
+ end
61
+
62
+ prompt do
63
+ rule "Headline <= 70 chars. Summary is the incoming tldr reprinted verbatim."
64
+ rule "Pick sentiment_icon from: 😐 neutral, 🙂 positive, ⚠️ negative, 🧠 analytical."
65
+ user "TL;DR: {tldr}\nTone: {tone}\nHashtags: {hashtags}"
66
+ end
67
+
68
+ validate("summary is the tldr verbatim") { |o, input| o[:summary] == input[:tldr] }
69
+ end
70
+
71
+ # Pipeline: summarize → hashtags → card
72
+ class ArticleCardPipeline < RubyLLM::Contract::Pipeline::Base
73
+ step SummarizeArticle, as: :summarize
74
+ step GenerateHashtags, as: :tag
75
+ step BuildArticleCard, as: :card
76
+ end
77
+ ```
78
+
79
+ ## Running and inspecting
80
+
81
+ ```ruby
82
+ result = ArticleCardPipeline.run(article_text, context: { adapter: adapter })
83
+ result.ok? # => true
84
+ result.outputs_by_step[:summarize] # => { tldr: "...", takeaways: [...], tone: "analytical" }
85
+ result.outputs_by_step[:card] # => { headline: "...", summary: "...", ... }
86
+ result.trace.total_cost # => 0.000128 (all steps combined)
87
+ result.trace.total_latency_ms # => 2340
88
+ ```
89
+
90
+ ## Fail-fast behavior
91
+
92
+ When a step's schema, validate, or preflight check rejects the output, the pipeline stops there — downstream steps never run:
93
+
94
+ ```ruby
95
+ # Summarize returns a TL;DR over 200 chars → the "TL;DR fits the card" validate fails
96
+ adapter = RubyLLM::Contract::Adapters::Test.new(response: {
97
+ tldr: "x" * 500,
98
+ takeaways: %w[one two three],
99
+ tone: "neutral"
100
+ })
101
+
102
+ result = ArticleCardPipeline.run("article text", context: { adapter: adapter })
103
+ result.failed? # => true
104
+ result.failed_step # => :summarize (validate rejected; retries exhausted)
105
+ # tag and card never run — no downstream tokens spent on garbage
106
+ ```
107
+
108
+ ## Per-step model override
109
+
110
+ ```ruby
111
+ class ArticleCardPipeline < RubyLLM::Contract::Pipeline::Base
112
+ step SummarizeArticle, as: :summarize, model: "gpt-4.1-mini"
113
+ step GenerateHashtags, as: :tag, model: "gpt-4.1-nano"
114
+ step BuildArticleCard, as: :card, model: "gpt-4.1-nano"
115
+ end
116
+ ```
117
+
118
+ ## Timeout
119
+
120
+ ```ruby
121
+ result = ArticleCardPipeline.run(article_text, timeout_ms: 30_000)
122
+ ```
123
+
124
+ ## Pipeline eval
125
+
126
+ ```ruby
127
+ ArticleCardPipeline.define_eval("e2e") do
128
+ add_case "ruby 3.4 release",
129
+ input: "Ruby 3.4 ships with frozen string literals by default and better YJIT...",
130
+ expected: { sentiment_icon: "🧠" }
131
+ end
132
+
133
+ report = ArticleCardPipeline.run_eval("e2e", context: { model: "gpt-4.1-mini" })
134
+ report.print_summary
135
+ ```
136
+
137
+ ## Pretty print
138
+
139
+ ```ruby
140
+ puts result
141
+ # Pipeline: ok 3 steps 1234ms 450+120 tokens trace=abc12345
142
+
143
+ result.pretty_print
144
+ # Full ASCII table with per-step outputs (Pipeline::Result)
145
+
146
+ # For eval reports, use print_summary instead:
147
+ report.print_summary
148
+ # Tabular pass/fail breakdown (Eval::Report)
149
+ ```
150
+
151
+ ## See also
152
+
153
+ - [Testing](testing.md) — `ArticleCardPipeline.test(..., responses: { summarize: ..., tag: ..., card: ... })` for pipeline-level spec adapters.
154
+ - [Optimizing retry_policy](optimizing_retry_policy.md) — `optimize_retry_policy` runs per-step; pipelines benchmark one step at a time.
@@ -0,0 +1,76 @@
1
+ # Prompt AST
2
+
3
+ > Read this when your prompt has more than one shape (per-tenant, per-language, per-audience) and string concatenation is starting to drift.
4
+
5
+ Prompts are structured data, not strings. That matters the moment `SummarizeArticle` has to ship in more than one shape — a different audience per tenant, a different language per region, a different tone template for a B2B vs consumer card. Building those variants by string-concatenating a monolithic prompt leads to silent drift across environments. The AST gives you typed nodes (`system`, `rule`, `section`, `example`, `user`) that compose, diff, and snapshot-test cleanly.
6
+
7
+ Available node types:
8
+
9
+ ```ruby
10
+ prompt do
11
+ system "You summarize articles for a UI card." # system message
12
+ rule "Return valid JSON only." # appended as separate system message
13
+ section "AUDIENCE", "Rails developers" # labeled system message: [AUDIENCE]\n...
14
+ example input: "Ruby 3.4 ships frozen strings...", # user/assistant few-shot pair
15
+ output: '{"tldr":"...","takeaways":[...],"tone":"analytical"}'
16
+ user "{input}" # user message with interpolation
17
+ end
18
+ ```
19
+
20
+ Or just a plain string (wraps as a single user message):
21
+
22
+ ```ruby
23
+ prompt "Summarize this article for a UI card. {input}"
24
+ ```
25
+
26
+ The AST is immutable, diffable, and hashable. Useful for snapshot testing and auditing prompt changes.
27
+
28
+ ## Hash inputs with variable interpolation
29
+
30
+ When input is a Hash, each key becomes a template variable. Concrete scenario: a multi-language newsletter product where the same article has to be summarised in Polish for EU subscribers, English for US, with different audiences per tier (Rails developers vs engineering managers). Hash inputs let one step cover all of these without forking the class:
31
+
32
+ ```ruby
33
+ class SummarizeArticle < RubyLLM::Contract::Step::Base
34
+ input_type RubyLLM::Contract::Types::Hash.schema(
35
+ article: RubyLLM::Contract::Types::String,
36
+ audience: RubyLLM::Contract::Types::String,
37
+ language: RubyLLM::Contract::Types::String
38
+ )
39
+
40
+ prompt do
41
+ system "You summarize articles for a UI card."
42
+ rule "Write the TL;DR and takeaways in {language}."
43
+ section "AUDIENCE", "{audience}"
44
+ user "{article}"
45
+ end
46
+
47
+ output_schema do
48
+ string :tldr
49
+ array :takeaways, of: :string, min_items: 3, max_items: 5
50
+ string :tone, enum: %w[neutral positive negative analytical]
51
+ end
52
+ end
53
+ ```
54
+
55
+ Every `{key}` in a prompt node is pulled from the input hash at run time. Missing keys raise — making wire-up bugs loud, not silent.
56
+
57
+ > **Not the same as `RubyLLM::Agent.inputs`.** `Step.input_type` is a *runtime type check* on the positional argument passed to `run(input)` — it raises `TypeError` if the input violates the declared shape. `RubyLLM::Agent.inputs` is a list of *named template locals* injected into ERB instructions. They solve different problems (validation vs templating) and can coexist on the same project.
58
+
59
+ > **Not the same as `RubyLLM::Agent` ERB templates either.** The `prompt do ... end` DSL above builds a multi-role message list (`system` / `user` / `assistant` / `example` nodes) into a node-AST. `Agent.instructions :name` loads a single-string ERB file from `app/prompts/<agent_path>/<name>.txt.erb` for the system prompt only. Different output shape, different scope. Variable interpolation here uses `{key}` substitution; ERB uses full `<%= ruby %>`.
60
+
61
+ ## Cross-validating output against input
62
+
63
+ Validate blocks support 2-arity `|output, input|` so you can check that the model's answer stays faithful to the request:
64
+
65
+ ```ruby
66
+ validate("tldr is not just the article reprinted") do |output, input|
67
+ # Guard against lazy models that return the input verbatim.
68
+ output[:tldr].length < input[:article].length / 2
69
+ end
70
+
71
+ validate("no takeaway repeats the TL;DR") do |output, _input|
72
+ output[:takeaways].none? { |t| t == output[:tldr] }
73
+ end
74
+ ```
75
+
76
+ The first example uses `input`; the second ignores it. Both are legal 2-arity signatures — Ruby accepts the unused `_input` parameter naming convention.
@@ -0,0 +1,218 @@
1
+ # Rails integration
2
+
3
+ > Read this when you've seen the `SummarizeArticle` example and want to know where contract steps fit in an actual Rails app — directory, initializer, jobs, logging, tests, CI. Skip if you're writing a non-Rails script.
4
+
5
+ Seven pre-emptive answers to the questions that come up first.
6
+
7
+ ## 1. Where do step classes live?
8
+
9
+ **Recommended: `app/contracts/`.** The gem's Railtie auto-reloads eval files under `app/contracts/eval/` and `app/steps/eval/` in development, so picking `app/contracts/` aligns with the default reload paths.
10
+
11
+ ```
12
+ app/contracts/summarize_article.rb # class SummarizeArticle
13
+ ```
14
+
15
+ Any autoloaded directory works (`app/llm_steps/`, `app/services/llm/`, etc.) — Rails 7/8 autoloading resolves them all, and the step class itself does not depend on the path. Pick the default if you have no stronger convention.
16
+
17
+ Keep evals in the same file as the step (`define_eval` block at the bottom of the class) — one source of truth per contract. If your evals grow too large for the class file, move them to `app/contracts/eval/summarize_article_eval.rb` — the Railtie reloads that directory explicitly in development.
18
+
19
+ ## 2. Initializer configuration
20
+
21
+ ```ruby
22
+ # config/initializers/ruby_llm_contract.rb
23
+ RubyLLM.configure do |c|
24
+ c.openai_api_key = ENV.fetch("OPENAI_API_KEY", nil)
25
+ c.anthropic_api_key = ENV.fetch("ANTHROPIC_API_KEY", nil)
26
+ end
27
+
28
+ RubyLLM::Contract.configure do |c|
29
+ c.default_model = Rails.env.production? ? "gpt-5-mini" : "gpt-5-nano"
30
+ c.default_adapter = RubyLLM::Contract::Adapters::RubyLLM.new
31
+ end
32
+ ```
33
+
34
+ In specs, override the default adapter to `Adapters::Test` in `spec_helper.rb` (or use `stub_step` per-example — see §5).
35
+
36
+ Evals defined inside a step class (the recommended pattern) are picked up as soon as Rails autoloads the class — you do not need the eval file in any special directory. If you move evals into separate files under `app/contracts/eval/` or `app/steps/eval/`, the gem's Railtie reloads those two directories explicitly on each request in development; other directories follow standard Rails autoloading rules.
37
+
38
+ ## 3. Background jobs — never call LLMs inline in a controller
39
+
40
+ LLM calls take 0.8–5 seconds and can fail. Wrap every step invocation in an ActiveJob:
41
+
42
+ ```ruby
43
+ class SummarizeArticleJob < ApplicationJob
44
+ queue_as :llm
45
+
46
+ def perform(article_id)
47
+ article = Article.find(article_id)
48
+ result = SummarizeArticle.run(article.body)
49
+
50
+ if result.ok?
51
+ # parsed_output uses symbol keys in memory. jsonb/json columns round-trip
52
+ # as strings on reload, so either use deep_stringify_keys before write or
53
+ # access downstream with string keys — pick one convention and stick to it.
54
+ article.update!(summary: result.parsed_output.deep_stringify_keys)
55
+ else
56
+ article.update!(summary_error: result.validation_errors.join("; "))
57
+ end
58
+ end
59
+ end
60
+ ```
61
+
62
+ `SummarizeArticleJob.perform_later(article.id)` returns in milliseconds; the controller stays responsive. If you use Sidekiq, pair `queue_as :llm` with a dedicated concurrency cap in `sidekiq.yml` so long-running LLM calls do not starve other job queues (mailers, webhooks, cleanups).
63
+
64
+ ## 4. Logging and observability
65
+
66
+ `around_call` runs once per `run()` with the final `Result` (after all retries). Use it to write one row per LLM call:
67
+
68
+ ```ruby
69
+ class SummarizeArticle < RubyLLM::Contract::Step::Base
70
+ # ... prompt, schema, validates ...
71
+
72
+ around_call do |step, input, result|
73
+ AiCallLog.create!(
74
+ step: step.name,
75
+ model: result.trace[:model],
76
+ status: result.status.to_s,
77
+ latency_ms: result.trace[:latency_ms],
78
+ input_tokens: result.trace[:usage]&.dig(:input_tokens),
79
+ output_tokens: result.trace[:usage]&.dig(:output_tokens),
80
+ cost: result.trace[:cost],
81
+ validation_errors: result.validation_errors
82
+ )
83
+ end
84
+ end
85
+ ```
86
+
87
+ The `AiCallLog` model assumed above is a thin audit record. One possible migration:
88
+
89
+ ```ruby
90
+ # rails g model AiCallLog step:string model:string status:string ...
91
+ create_table :ai_call_logs do |t|
92
+ t.string :step, null: false
93
+ t.string :model
94
+ t.string :status, null: false
95
+ t.integer :latency_ms
96
+ t.integer :input_tokens
97
+ t.integer :output_tokens
98
+ t.decimal :cost, precision: 10, scale: 6
99
+ t.jsonb :validation_errors, default: []
100
+ t.timestamps
101
+ end
102
+ add_index :ai_call_logs, :step
103
+ add_index :ai_call_logs, :status
104
+ ```
105
+
106
+ For Appsignal / Honeybadger / Datadog, emit an `ActiveSupport::Notifications` event from inside the same `around_call` and subscribe in an initializer:
107
+
108
+ ```ruby
109
+ class SummarizeArticle < RubyLLM::Contract::Step::Base
110
+ # ... prompt, schema, validates ...
111
+
112
+ around_call do |step, _input, result|
113
+ ActiveSupport::Notifications.instrument(
114
+ "ruby_llm_contract.run",
115
+ step: step.name, model: result.trace[:model], status: result.status
116
+ )
117
+ end
118
+ end
119
+
120
+ # config/initializers/observability.rb
121
+ ActiveSupport::Notifications.subscribe("ruby_llm_contract.run") do |*, payload|
122
+ Appsignal.increment_counter("llm.run.#{payload[:status]}", 1, step: payload[:step])
123
+ end
124
+ ```
125
+
126
+ Trace inspection in an admin UI: `result.trace[:attempts]` gives you per-attempt model, status, cost, latency — render it in a partial to debug production failures without re-running.
127
+
128
+ ## 5. Testing — RSpec and Minitest
129
+
130
+ Add to `spec/spec_helper.rb` (or `test_helper.rb`):
131
+
132
+ ```ruby
133
+ require "ruby_llm/contract/rspec" # or ruby_llm/contract/minitest
134
+ ```
135
+
136
+ Then in specs:
137
+
138
+ ```ruby
139
+ RSpec.describe ArticlesController do
140
+ it "saves the summary when the step passes" do
141
+ stub_step(SummarizeArticle, response: {
142
+ tldr: "...", takeaways: %w[a b c], tone: "analytical"
143
+ })
144
+
145
+ post :summarize, params: { id: article.id }
146
+
147
+ # NOTE: jsonb/json column round-trips as string keys on reload.
148
+ expect(article.reload.summary["tldr"]).to eq("...")
149
+ end
150
+ end
151
+ ```
152
+
153
+ For the step itself, use the `satisfy_contract` and `pass_eval` matchers — details in the [Testing guide](testing.md).
154
+
155
+ ## 6. Error handling in controllers
156
+
157
+ Never raise on `result.failed?` — that crashes the request. Branch instead:
158
+
159
+ ```ruby
160
+ class ArticlesController < ApplicationController
161
+ def summarize
162
+ SummarizeArticleJob.perform_later(params[:id])
163
+ head :accepted
164
+ end
165
+
166
+ # For synchronous cases (admin tools, small content):
167
+ def preview
168
+ result = SummarizeArticle.run(@article.body)
169
+
170
+ if result.ok?
171
+ render json: result.parsed_output
172
+ else
173
+ Rails.logger.warn "[llm] #{SummarizeArticle.name} failed: #{result.status}"
174
+ render json: { error: "Could not summarize; try again shortly." }, status: :service_unavailable
175
+ end
176
+ end
177
+ end
178
+ ```
179
+
180
+ When `retry_policy` exhausts and all models fail, `result.failed?` is true but `result.parsed_output` still contains the last attempt's output — useful for logging what the model *did* return before the validate rejected it.
181
+
182
+ ## 7. CI gate — block regressions before merge
183
+
184
+ Add to your `Rakefile`:
185
+
186
+ ```ruby
187
+ require "ruby_llm/contract/rake_task"
188
+
189
+ RubyLLM::Contract::RakeTask.new do |t|
190
+ t.minimum_score = 0.8
191
+ t.maximum_cost = 0.05
192
+ t.fail_on_regression = true
193
+ t.save_baseline = false # read-only in CI; refresh baselines manually
194
+ end
195
+ ```
196
+
197
+ Then wire it in GitHub Actions:
198
+
199
+ ```yaml
200
+ - name: LLM contract evals
201
+ env:
202
+ OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
203
+ run: bundle exec rake ruby_llm_contract:eval
204
+ ```
205
+
206
+ The job fails when a previously-passing eval case now fails, when the average score drops below the threshold, or when total cost exceeds the cap. That is the signal that blocks a prompt regression or an accidental model upgrade from shipping.
207
+
208
+ **Two practical notes:**
209
+
210
+ - **Live evals spend real money on every run** — provider tokens per case × number of cases × every merge. Keep the dataset small and targeted (5–15 high-value cases), use cheap models where quality allows, and rely on offline `sample_response` smoke tests in the bulk of CI runs. Live evals belong on merge-candidate branches and scheduled nightly runs, not on every commit.
211
+ - **Baselines are checkout-managed** — commit them to git under `.eval_baselines/`. Refresh them in a separate manual workflow (or locally + a dedicated PR) rather than from the merge gate, which would dirty the checkout and race with the regression check it is supposed to run.
212
+
213
+ ## See also
214
+
215
+ - [Getting Started](getting_started.md) — the feature walkthrough the step above is built on
216
+ - [Migration](migration.md) — before/after for replacing a raw `LlmClient.new.call` service with a contract
217
+ - [Eval-First](eval_first.md) — the workflow behind the CI gate above
218
+ - [Testing](testing.md) — `satisfy_contract` and `pass_eval` matcher chains
@@ -0,0 +1,52 @@
1
+ # Relation to `RubyLLM::Agent`
2
+
3
+ > Read this when you already use `RubyLLM::Agent` (or are about to) and want to understand where `ruby_llm-contract` sits in the same project.
4
+
5
+ `RubyLLM::Agent` shipped in RubyLLM 1.12. `Step::Base` from this gem and `Agent` target the **same niche**: reusable, class-based prompts. They are **siblings**, not foundation-and-roof.
6
+
7
+ ## Feature mapping
8
+
9
+ | What you write | Where it lives |
10
+ |---|---|
11
+ | `model`, `temperature`, `schema`, `instructions`, `tools`, `thinking` | covered by both — same idea, different DSL surface |
12
+ | `validate :rule do ... end` business invariants on output | only in `ruby_llm-contract` |
13
+ | `retry_policy escalate(...)` model escalation on validation failure | only here (different from RubyLLM's network-level retry) |
14
+ | `max_cost` / `max_input` / `max_output` pre-flight refusal | only here |
15
+ | `define_eval` + baseline regression + `compare_models` + `optimize_retry_policy` | only here (RubyLLM does not ship an evaluation framework) |
16
+ | Pipeline composition with `step SomeStep, as: :alias` | only here (RubyLLM intentionally leaves workflows as plain Ruby) |
17
+ | `around_call`, named `observe` hooks with pass/fail recorded in trace | only here |
18
+
19
+ ## Runtime relationship
20
+
21
+ `Step::Base` does **not** use `Agent` internally today. The actual call path is:
22
+
23
+ ```
24
+ Step.run(input)
25
+ → Runner
26
+ → Adapters::RubyLLM
27
+ → RubyLLM.chat(model:, ...)
28
+ → ... .ask(prompt)
29
+ ```
30
+
31
+ `Agent` is a sibling abstraction calling into `RubyLLM::Chat` through its own `apply_configuration` path. Both end up at `Chat`. They do not share the macro-storage layer.
32
+
33
+ This may change in a future release if upstream APIs make a layered design natural. The decision is not committed; it depends on adopter signal.
34
+
35
+ ## Coexistence on the same project
36
+
37
+ The two abstractions can live in the same Rails (or non-Rails) project. Pick one per use case:
38
+
39
+ - **`RubyLLM::Agent`** when you want a reusable prompt with `model` + `instructions` + `schema` + `tools` and that is enough — no retry-on-validation-failure, no business invariants, no eval framework, no budget gating.
40
+ - **`ruby_llm-contract`'s `Step::Base`** when you need any of: invariants (`validate`), retry with model escalation on validation failure, pre-flight cost ceilings, an evaluation framework with baseline regression, or pipeline composition.
41
+
42
+ A common pattern: simple ad-hoc prompts as `Agent`, contracts on the LLM features that touch production behaviour or money as `Step`.
43
+
44
+ ## On retry strategies
45
+
46
+ The retry-strategy framing in this gem favours `retry_policy escalate(model_2, ...)` (model escalation, addresses model bias) over same-model `retry_policy attempts: N` (variance retry).
47
+
48
+ This is grounded in empirical comparison across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation: same-model retry produced no useful lift for nano-class models on tasks with clear correctness criteria. Model escalation did move the needle when same-model retry could not.
49
+
50
+ `attempts: N` stays in the gem API (backward compat + niche cases like subjective-criteria tasks, multi-step pipelines, weaker open-source models) but is not marketed as a default retry strategy.
51
+
52
+ See [Optimize retry policy](optimizing_retry_policy.md) for the empirical tooling.
@@ -0,0 +1,135 @@
1
+ # Relation to `ruby_llm-tribunal`
2
+
3
+ > Read this when you've seen [`ruby_llm-tribunal`](https://github.com/Alqemist-labs/ruby_llm-tribunal) and want to know how it relates to `ruby_llm-contract` — and which one (or both) you need.
4
+
5
+ Both gems sit on top of `ruby_llm`. The space they cover overlaps in vocabulary (both talk about "evals") but they live in different layers and answer different questions. They are not alternatives — they compose.
6
+
7
+ ## The core distinction
8
+
9
+ | | `ruby_llm-tribunal` | `ruby_llm-contract` |
10
+ |---|---|---|
11
+ | Layer | **Test framework** | **Runtime contract** |
12
+ | When it runs | After the LLM call returns, typically in a spec | Before the LLM result reaches your code |
13
+ | Where the output lives at evaluation time | Already in your variable, returned to caller | Still inside the gem's runner, not yet released |
14
+ | What "fail" means | Red test in CI | Trigger retry/escalate on a stronger model, or fail closed |
15
+ | Strongest features | Rich LLM-as-judge (faithful, relevant, hallucination, refusal, bias, toxicity, jailbreak, PII), red-team adversarial prompts, deterministic helpers (`assert_contains`, `assert_levenshtein`, …), RSpec/Minitest matchers, HTML/JUnit/GH reporters | Schema DSL with constraints, `validate` business rules, `retry_policy escalate(...)` model escalation, `max_cost` pre-flight refusal, regression-eval framework (frozen dataset + baseline + min_score gate), pipeline composition |
16
+ | What it does NOT cover | No retry, no model escalation, no pre-flight cost cap, no contract layer between LLM and your code | No 10-judge LLM-as-judge catalog, no red-team generation, no rich deterministic assertion library, no test-framework matchers |
17
+
18
+ ## Visual: where each gem sits in your call
19
+
20
+ ### Tribunal alone (test-time, in CI)
21
+
22
+ ```
23
+ your code ──► LLM ──► output ──► variable ──► [Tribunal assert_*] ──► ✅ / ❌ red test
24
+
25
+ runs in your spec, not in prod
26
+ ```
27
+
28
+ The output **already exists in your code** by the time Tribunal sees it. Tribunal grades it after the fact. A failed grade is a red test — production is unaffected, you fix the prompt or model and re-run.
29
+
30
+ ### Contract alone (runtime, in prod)
31
+
32
+ ```
33
+ your code ──► Step.run ──► LLM ──► [schema + validate]
34
+
35
+ ├── valid ────────────► output ──► your code
36
+
37
+ └── invalid ──► retry/escalate ──► next model
38
+
39
+ all attempts fail ─┘
40
+
41
+ Result(:validation_error)
42
+ ```
43
+
44
+ The output **never reaches your code** until the contract passes. A failed validation either fixes itself (the retry policy escalates to a stronger model) or fails closed with `Result(:validation_error)` — your call site sees a deterministic failure status, never schema-invalid data.
45
+
46
+ ### Both together (Tribunal in CI → then Contract in prod)
47
+
48
+ ```
49
+ CI (before merge):
50
+ define_eval(frozen dataset) ──► run Step ──► [Tribunal grades each case]
51
+
52
+
53
+ regression gate
54
+ (prevents quality drift over time)
55
+
56
+
57
+ ✅ merge allowed
58
+
59
+
60
+ PROD (every request):
61
+ your code ──► Step.run ──► LLM ──► [contract] ──► output ──► your code
62
+
63
+ │ keeps bad outputs out of prod
64
+ ```
65
+
66
+ Tribunal grades **a fixed set of cases on every PR** to catch quality regressions before merge. Once merged, Contract gates **every production call** to keep bad outputs from reaching users. Each gem owns the layer it is best at — and they run in the order developers experience them.
67
+
68
+ ## When to use which
69
+
70
+ **Just Contract.** You ship LLM features whose output drives downstream code, money, or user trust. You need the bad-output-doesn't-reach-prod guarantee, retry escalation, and budget refusal. You are happy to write your own `validate` blocks for content checks; you don't need a 10-judge catalog yet.
71
+
72
+ **Just Tribunal.** You have a stable production path you don't want to wrap, but you want a CI safety net that grades LLM output for faithfulness, hallucination, PII, jailbreak resistance, etc. You're testing a RAG pipeline or chatbot whose runtime is owned by other code.
73
+
74
+ **Both.** You ship contracts in prod (Contract) AND want stronger CI signal beyond schema regression — judge-quality grading on a frozen dataset, plus adversarial red-team probes. Use Contract's `Step` to make the call, run it in `define_eval` over your dataset, and grade each case with Tribunal helpers in your spec or via the dataset's `evaluator:` proc.
75
+
76
+ ## Integration patterns
77
+
78
+ These work today without any code changes in either gem — both use plain Ruby blocks/procs as extension points.
79
+
80
+ ### Tribunal helpers inside Contract `validate`
81
+
82
+ ```ruby
83
+ class ChatReply < RubyLLM::Contract::Step::Base
84
+ prompt "Answer this question grounded in the docs:\n{input}"
85
+
86
+ validate("no PII in answer") do |output, _ctx|
87
+ test_case = RubyLLM::Tribunal::TestCase.new(actual_output: output[:answer])
88
+ RubyLLM::Tribunal::Assertions.evaluate(:pii, test_case, {}).first == :pass
89
+ end
90
+ end
91
+ ```
92
+
93
+ A failed Tribunal grade triggers Contract's retry/escalate just like any other validation failure. You get LLM-as-judge **runtime gating**, not just CI testing.
94
+
95
+ ### Tribunal as `evaluator:` in a Contract dataset
96
+
97
+ ```ruby
98
+ ChatReply.define_eval "rag_regression" do
99
+ add_case "policy",
100
+ input: "What is the return policy?",
101
+ evaluator: ->(output, _expected, _input) {
102
+ tc = RubyLLM::Tribunal::TestCase.new(
103
+ actual_output: output[:answer],
104
+ context: ["Returns accepted within 30 days with receipt."]
105
+ )
106
+ result = RubyLLM::Tribunal::Assertions.evaluate(:faithful, tc, {})
107
+ score = result.last[:score] || 0.0
108
+ RubyLLM::Contract::Eval::EvaluationResult.new(score: score, passed: result.first == :pass)
109
+ }
110
+ end
111
+ ```
112
+
113
+ Each case is graded by a Tribunal judge; baseline + min_score gate then fails the build on regression. You write the judge once, get the regression gate for free.
114
+
115
+ ### Contract `Step` as Tribunal's `opts[:llm]` injection
116
+
117
+ Tribunal's built-in judges call `RubyLLM.chat(...).ask(...)` and naively `JSON.parse` the result. If you want **schema-validated, retried, cost-capped judge calls**, inject a Contract `Step` as the LLM caller via `opts[:llm]`. This is an advanced pattern; sketch it from `Tribunal::Assertions::Judge#run_judge`'s injection point and your own judge wrapping a `Step.run`.
118
+
119
+ This is a recipe, not a shipped adapter. Tribunal's `opts[:llm]` API is at v0.x — recipes survive minor changes; a shipped adapter would not.
120
+
121
+ ## What we are NOT doing
122
+
123
+ - **No `Contract::ContainsAssertion` or similar 16-helper deterministic library.** Tribunal owns that layer well. Contract's evaluator surface is intentionally minimal (`Exact`, `Regex`, `JsonIncludes`, `ProcEvaluator`, `TraitEvaluator`); for richer deterministic checks, drop a Tribunal helper into your `evaluator:` proc.
124
+ - **No built-in LLM-as-judge catalog.** `Faithful`, `Hallucination`, `Refusal`, etc. are Tribunal's domain. We provide the runtime contract; they provide the grading vocabulary.
125
+ - **No Tribunal as a hard or soft dependency.** Both gems work standalone. Recipes above are documentation, not code in this gem.
126
+
127
+ ## Summary
128
+
129
+ Three questions, three owners:
130
+
131
+ - **"Is this output good?"** — Tribunal, in CI, on outputs you already hold.
132
+ - **"What do we do when it isn't?"** — Contract, at runtime, before outputs reach your code (retry/escalate, or fail-closed with `Result(:validation_failed)`).
133
+ - **"What do we do when it _is_ good?"** — your application code. Once Contract returns `:ok`, you persist, render, hand off downstream. The gem deliberately doesn't touch the happy path; it owns failure semantics, not domain logic.
134
+
135
+ Use Tribunal, Contract, or both — whichever questions your application needs to answer.