ruby_llm-contract 0.10.4 → 0.10.5
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +26 -0
- data/README.md +9 -1
- data/docs/guide/best_practices.md +2 -0
- data/docs/guide/eval_first.md +19 -2
- data/docs/guide/getting_started.md +9 -1
- data/docs/guide/llm_judge.md +206 -0
- data/docs/guide/migration.md +1 -1
- data/docs/guide/optimizing_retry_policy.md +4 -2
- data/docs/guide/output_schema.md +1 -1
- data/docs/guide/pipeline.md +2 -0
- data/docs/guide/prompt_ast.md +3 -1
- data/docs/guide/relation_to_tribunal.md +1 -1
- data/docs/guide/testing.md +3 -3
- data/lib/ruby_llm/contract/version.rb +1 -1
- metadata +2 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 233b114efc88c168573b434e9adc17163401e747920a4fc6b587e7a834721723
|
|
4
|
+
data.tar.gz: fd2d57bd72cd72a7b4a405fda8bfcb0865d737a16a78db11e69d25026c89d1a2
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: ff1564d786bc8ab8b391d3dfe0e7e9be406dd4bcaca56fc891fa3ead2a3b8c76cfb6f26c3ad81bc2ffd798e3b9a63a90ab1c4a847e1ad2be1efc1170c6abee83
|
|
7
|
+
data.tar.gz: 33a4a6aa6798f55d6cfe9c4a1684ee3080e1ea42043d7f53ab54b40c5cc927a9dd607a266a3130206b43822ae7c06739856d29517237432d6cdcfda2f1b08ff3
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,31 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.10.5 (2026-06-11)
|
|
4
|
+
|
|
5
|
+
Docs release: new `llm_judge.md` guide + comprehensive clarity audit across all 16 shipping documentation files. No code behaviour change.
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
|
|
9
|
+
- **`docs/guide/llm_judge.md`** — new guide for the LLM-as-judge pattern. Three-step workflow (build judge as a Contract Step → calibrate against humans on real production data → use as eval gate). Hook is the Cursor "Sam" April 2025 incident (support chatbot hallucinated company policy, schema valid, brand damage). Covers claims-breakdown alternative output schema, the "calibration is itself iterative" failure mode (over-flagging stylistic courtesy), six anti-patterns (stubbing the judge, calibrating on synthetic data, per-language regex, judge on request path, proxy label assertion, calibrating once and shipping), and an escalation pointer to `ruby_llm-tribunal`'s catalog. 206 lines. Linked from `eval_first.md` (oracle-validation rule) and `relation_to_tribunal.md` (composition recipe). Cited research: Eugene Yan on evals, Hamel Husain field guide, May 2025 LLM-hallucination court-case database, Klarna AI customer-service reversal.
|
|
10
|
+
|
|
11
|
+
### Fixed
|
|
12
|
+
|
|
13
|
+
- **`docs/guide/migration.md` — `save_baseline = false` in CI Rakefile.** Pre-0.10.5 the migration template set `save_baseline = true`, which dirties the working checkout on every CI run and races the regression check — directly violating the project invariant ("`save_baseline: true` in CI dirties checkout and races the regression check — keep it false; refresh in a separate job"). Adopters copy-pasting the template hit non-deterministic CI failures + git status noise. The corrected template now sets `false` with an inline comment explaining the baseline-refresh-in-separate-workflow rationale.
|
|
14
|
+
- **`docs/guide/relation_to_tribunal.md` — evaluator lambda arity corrected from 3 to 1.** The example previously used `->(output, _expected, _input)`, but `ProcEvaluator` only accepts arity 1 or 2 — adopters copy-pasting hit `ArgumentError: wrong number of arguments`. Now reads `->(output)`.
|
|
15
|
+
- **`docs/guide/eval_first.md`** — added explicit definition of "partial match" (`expected:` is treated as a subset of `parsed_output`; extra output keys ignored, listed keys must equal), added definition of `adapter` (the layer that actually executes the LLM call), clarified `system`/`rule`/`example`/`validate` as prompt-shaping building blocks with a `prompt_ast.md` link.
|
|
16
|
+
- **`docs/guide/getting_started.md`** — explained the `{X}` prompt template syntax (gem-specific, not ERB, not `String#%`), described `validate(...) { |o, _| ... }` arity convention, named the retry triggers `:validation_failed` / `:parse_error`, differentiated `default_input` from `add_case input:`, explained partial-match semantics.
|
|
17
|
+
- **`README.md`** — three clarity edits to the SummarizeArticle example: `{input}` placeholder syntax explained, `validate` lambda args (`|o, _|`) explained, `retry_policy do escalate(...) end` block form aliased to `retry_policy models: %w[...]` shorthand (both forms share the same DSL — `models` is an alias for `escalate`). Plus the multimodal upgrade FAQ entry was compressed from 7 sentences to a 2-sentence SEO pointer to `multimodal_input.md` (where the full `attachment_token_estimate` setup, fail-closed behaviour, and `on_unknown_attachment_size :warn` opt-out already lived canonically) — reduces README cognitive load for the 90% of adopters not upgrading from pre-0.10.0.
|
|
18
|
+
- **`docs/guide/testing.md`** — clarified pipeline test responses (`responses: { :summarize, :tag, :card }` keys must match `add_step :name, ...` identifiers), distinguished `validate` block (Step-level) from `verify` block (eval-case evaluator declared via `verify(name) { |output| ... }` inside `define_eval`), explained `stub_all_steps(response: { ... })` shape semantics.
|
|
19
|
+
- **`docs/guide/best_practices.md`** — explained `rule` as a prompt DSL element distinct from `system` / `user` with a link to `prompt_ast.md`.
|
|
20
|
+
- **`docs/guide/optimizing_retry_policy.md`** — explained `gpt-4.1-mini@low` CLI shorthand for `{model:, reasoning_effort:}`, defined `production_mode: { fallback: ... }` as an optional `compare_models` kwarg that reports effective cost (first-try + weighted fallback) instead of first-attempt only, compressed the "two orthogonal dimensions" callout from 7 concepts to 3 (DSL alias details moved to the dedicated `thinking` DSL note at the end).
|
|
21
|
+
- **`docs/guide/output_schema.md`** — compressed the `RubyLLM::Agent.schema` boundary callout from 6 concepts to 3 + pointer (full Agent-vs-Step comparison lives canonically in `relation_to_agent.md`).
|
|
22
|
+
- **`docs/guide/pipeline.md`** — documented that `Pipeline.run_eval` matches **only** the final step's output against `expected:` (assertions on intermediate-step outputs silently never match — gem-level invariant).
|
|
23
|
+
- **`docs/guide/prompt_ast.md`** — explained `Types::Hash.schema(...)` typed-input declaration as distinct from plain `input_type Hash` (typed form raises `TypeError` on missing/wrong-type keys; plain form accepts any hash).
|
|
24
|
+
|
|
25
|
+
### Audit method
|
|
26
|
+
|
|
27
|
+
All 16 documentation files (README + 15 guides) were audited with a new internal skill, `reader-knowledge-audit` (three-axis: assumption-gap A-modes + reading-overhead R-modes + structural-misplacement M-modes). 33 runnable code examples across the guides were verified end-to-end against a live OpenAI API key (separate test scripts in `.revive/` cover `SummarizeArticle`, smoke / regression evals, multi-key `{X}` template interpolation, 3-step pipeline with fail-fast, full LLM-judge lifecycle, and multimodal PNG + PDF attachment ingestion with `attachment_token_estimate` enforcement).
|
|
28
|
+
|
|
3
29
|
## 0.10.4 (2026-06-10)
|
|
4
30
|
|
|
5
31
|
Patch release: the `ruby_llm_contract:optimize` rake task now auto-loads in Rails apps. No behaviour change to the task itself.
|
data/README.md
CHANGED
|
@@ -55,7 +55,15 @@ class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
|
55
55
|
{ model: "gpt-5", reasoning_effort: "high" }
|
|
56
56
|
end
|
|
57
57
|
end
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Three things to notice in the snippet above:
|
|
58
61
|
|
|
62
|
+
- **`{input}` is a gem-specific prompt placeholder** — at call time `SummarizeArticle.run(article_text)` interpolates `article_text` into that slot. For `input_type Hash`, each top-level key becomes its own placeholder (e.g. `{title}`, `{body}`). Plain text in, plain text out — no ERB, no `String#%`.
|
|
63
|
+
- **`validate("...") { |o, _| ... }`** — each block receives `(parsed_output, context)`; the convention `|o, _|` reads "output, ignore context". Use `o[:key]` to assert on any field your schema declared. Return false → status `:validation_failed`, retry fires.
|
|
64
|
+
- **`retry_policy do escalate(...) end`** — block form for mixed configs (strings + hashes with per-attempt options). For uniform model lists you can use the shorthand `retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]` (it's the same DSL — `models` is an alias for `escalate`).
|
|
65
|
+
|
|
66
|
+
```ruby
|
|
59
67
|
result = SummarizeArticle.run(article_text)
|
|
60
68
|
result.status # => :ok (or :validation_failed if all steps fail)
|
|
61
69
|
result.parsed_output # => { tldr: "...", takeaways: [...], tone: "..." }
|
|
@@ -167,7 +175,7 @@ Pre-1.0 (currently **0.10.4**). Semver tracked; breaking changes flagged in [CHA
|
|
|
167
175
|
|
|
168
176
|
**Where in a Rails app?** Default `app/contracts/`. The Railtie reloads `app/contracts/eval/` and `app/steps/eval/` in development; any autoloaded directory also works. See [Rails integration](docs/guide/rails_integration.md).
|
|
169
177
|
|
|
170
|
-
**Upgraded
|
|
178
|
+
**Upgraded from pre-0.10.0 and getting `:limit_exceeded` with attachments?** Multimodal contracts with `max_cost`/`max_input` need `attachment_token_estimate`. See [multimodal input guide](docs/guide/multimodal_input.md#cost-attachment_token_estimate-is-required) for setup, fail-closed behaviour, and `on_unknown_attachment_size :warn` opt-out.
|
|
171
179
|
|
|
172
180
|
## License
|
|
173
181
|
|
|
@@ -80,6 +80,8 @@ end
|
|
|
80
80
|
|
|
81
81
|
In a pipeline, each step only sees the previous step's output. If a later step needs original article metadata, an intermediate step must carry it through. Suppose a pipeline `SummarizeArticle → GenerateHashtags`, where `GenerateHashtags` needs the `tone` from the summary:
|
|
82
82
|
|
|
83
|
+
`rule` is a prompt DSL element distinct from `system` / `user` — use it for declarative invariants the model must keep across the transform (full DSL surface in [Prompt AST](prompt_ast.md)).
|
|
84
|
+
|
|
83
85
|
```ruby
|
|
84
86
|
class GenerateHashtags < RubyLLM::Contract::Step::Base
|
|
85
87
|
input_type Hash
|
data/docs/guide/eval_first.md
CHANGED
|
@@ -34,7 +34,24 @@ SummarizeArticle.define_eval("regression") do
|
|
|
34
34
|
end
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
-
|
|
37
|
+
"Partial match" means `expected` is treated as a subset of `parsed_output` — each key in `expected` must equal the corresponding key in the output, but extra keys in the output are ignored. So the `ruby release` case passes whenever the model produces `tone: "analytical"`, regardless of what shows up in `tldr` or `takeaways`. To assert on multiple keys, list them all in `expected`.
|
|
38
|
+
|
|
39
|
+
Only after the eval exists, touch the prompt-shaping building blocks of the Step: `system`, `rule`, and `example` (blocks inside `prompt do`), `validate` (class-level invariant checks that run after the model returns), or the prompt version itself. See [Getting Started](getting_started.md) for the DSL surface.
|
|
40
|
+
|
|
41
|
+
## Validate your oracle before you trust the score
|
|
42
|
+
|
|
43
|
+
An eval is exactly as good as its ground-truth. Green tells you the system does what you **specified**, never what you **wanted**. Two failure modes show up the moment a regression eval is taken seriously:
|
|
44
|
+
|
|
45
|
+
**1. Wrong oracle (green-but-wrong).** You hand-label the expected output — `expected: { page_type: "Article" }` — the classifier returns `Article`, eval passes 8/8 stable. Then a real audit catches that `Article` was the wrong target all along: it triggers a cascade of downstream `Article`-scoped checks that fail on the actual page. The eval gated the **label**, not the **downstream effect**. Fix: assert the target effect, not a convenient proxy. If `page_type` matters because it changes which audit rules run, the case should assert which rules fire (or don't), not the label string.
|
|
46
|
+
|
|
47
|
+
**2. Unvalidated LLM-judge oracle.** You add a cheap `gpt-5-nano` judge that scores "is this output self-contained?" and gate prompt changes on it. The unit test for the judge **stubs the judge's verdict** and checks the scoring math — it proves the aggregation works, never that the judge's real verdicts agree with reality. A miscalibrated judge gives you a confident percentage that doesn't match what a human reviewer would say. Fix: before the judge gates anything, validate it against a human-labeled golden set of **real production samples** (not synthetic, not edge-cases-you-imagined). Only then do its before/after numbers mean what you think. See [LLM-judge pattern](llm_judge.md) for the full build → calibrate → gate workflow in code.
|
|
48
|
+
|
|
49
|
+
**Rules:**
|
|
50
|
+
|
|
51
|
+
- Assert the target **effect**, not a proxy label — pick the assertion by reading what downstream actually does with the output, not what's convenient to label.
|
|
52
|
+
- Validate the oracle (human labels on real prod samples) before trusting any eval's score.
|
|
53
|
+
- For semantic checks that span multiple inputs/locales, use an LLM-judge (one prompt, all cases) — not per-case regex that drifts.
|
|
54
|
+
- Keep LLM-judges in the eval suite (on-demand or scheduled), **never** on the production request path.
|
|
38
55
|
|
|
39
56
|
## Three eval kinds
|
|
40
57
|
|
|
@@ -107,7 +124,7 @@ Good for: offline smoke tests, local development, testing evaluator wiring, chec
|
|
|
107
124
|
|
|
108
125
|
Not enough for real prompt decisions. For those:
|
|
109
126
|
|
|
110
|
-
- `run_eval(..., context: { model: "..." })` with a real model, or pass an explicit adapter.
|
|
127
|
+
- `run_eval(..., context: { model: "..." })` with a real model, or pass an explicit adapter via `context: { adapter: ... }`. An *adapter* here is the layer that actually executes the LLM call — `RubyLLM::Contract::Adapters::Test` returns canned data, the default RubyLlm adapter hits the real provider. `model:` chooses the model name; `adapter:` chooses whether the call goes live or stays canned.
|
|
111
128
|
- `compare_with(...)` for prompt A/B.
|
|
112
129
|
|
|
113
130
|
`compare_with` intentionally ignores `sample_response` — canned data would make both sides look the same.
|
|
@@ -8,6 +8,8 @@ The README shows a minimal `SummarizeArticle` step. This guide walks through the
|
|
|
8
8
|
|
|
9
9
|
Start with the README example, then add features one layer at a time. Each is optional — use what you need.
|
|
10
10
|
|
|
11
|
+
The `prompt` body uses a small template syntax — `{X}` placeholders are replaced with values from the input at call time. With the default `input_type String`, the entire `article_text` you pass to `SummarizeArticle.run(article_text)` lands in `{input}`. For `input_type Hash`, each top-level key becomes its own placeholder (e.g. `{summary}`, `{article}`). See [Prompt AST](prompt_ast.md) for the full DSL surface.
|
|
12
|
+
|
|
11
13
|
```ruby
|
|
12
14
|
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
13
15
|
# 1. Prompt (required)
|
|
@@ -39,9 +41,11 @@ class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
|
39
41
|
end
|
|
40
42
|
```
|
|
41
43
|
|
|
44
|
+
Each `validate` block receives `(parsed_output, context)` — the convention `|o, _|` reads "output, ignore context". Use `o[:key]` to assert on any field your `output_schema` declared. The block returns truthy/falsy: `false` → status `:validation_failed`, retry fires if `retry_policy` was set.
|
|
45
|
+
|
|
42
46
|
## Validation and retry behavior
|
|
43
47
|
|
|
44
|
-
When the cheap model returns output that fails a `validate` block or can't be parsed, retry falls back to the next model in `models:` and tries again.
|
|
48
|
+
When the cheap model returns output that fails a `validate` block (status `:validation_failed`) or can't be parsed as JSON (status `:parse_error`), retry falls back to the next model in `models:` and tries again. Transport errors (timeouts, 5xx) retry independently via Faraday, not via `retry_policy`. See [Optimizing retry_policy](optimizing_retry_policy.md) for tuning the chain.
|
|
45
49
|
|
|
46
50
|
```ruby
|
|
47
51
|
result = SummarizeArticle.run(article_text)
|
|
@@ -104,6 +108,8 @@ report.score # => 1.0
|
|
|
104
108
|
report.print_summary
|
|
105
109
|
```
|
|
106
110
|
|
|
111
|
+
`default_input` sets the input once for every case in this eval — useful for smoke evals where you reuse one synthetic article. For regression evals where each case needs its own input, use `add_case input: "..."` (shown next).
|
|
112
|
+
|
|
107
113
|
For real regression testing, define cases with expected output (online — calls the LLM):
|
|
108
114
|
|
|
109
115
|
```ruby
|
|
@@ -118,6 +124,8 @@ SummarizeArticle.define_eval("regression") do
|
|
|
118
124
|
end
|
|
119
125
|
```
|
|
120
126
|
|
|
127
|
+
`expected:` is treated as a partial match against `parsed_output` — extra output keys are ignored, listed keys must equal. So this case passes whenever the model produces `tone: "analytical"`, regardless of `tldr` content. To assert on multiple keys, list them all in `expected`.
|
|
128
|
+
|
|
121
129
|
Gate CI on score and cost thresholds:
|
|
122
130
|
|
|
123
131
|
```ruby
|
|
@@ -0,0 +1,206 @@
|
|
|
1
|
+
# LLM-judge pattern
|
|
2
|
+
|
|
3
|
+
> **Cursor, April 2025.** A support chatbot named "Sam" confidently told customers their subscription was limited to one device — a security policy the company never had. Reddit and Hacker News picked it up within hours. Some customers cancelled. Anysphere (Cursor's creator — an AI-tooling company) had to issue a public apology and refund the user who blew the whistle.
|
|
4
|
+
>
|
|
5
|
+
> Every Sam response was schema valid. JSON parsed, structured output, validate blocks green.
|
|
6
|
+
>
|
|
7
|
+
> Your schemas check shape, not meaning. This guide adds the missing layer: a second LLM that reads the answer and decides whether it's faithful to your source of truth.
|
|
8
|
+
|
|
9
|
+
> ⚠️ **Limit of this layer.** The judge keeps the model **consistent with the source you give it** — nothing more. It does NOT verify the source itself. If your policy document is incomplete or out of date, a consistent answer can still be wrong: the gate passes, the customer wins in court. Treat the source with the same versioning discipline as your code.
|
|
10
|
+
|
|
11
|
+
> 📊 **The pattern repeats.** By May 2025, **129 court cases** involved LLM-generated fake citations submitted as evidence — averaging $4,713 in sanctions per case. **Klarna reversed** its 2024 decision to replace customer service with AI after a year of hallucination complaints. This is not a long-tail risk — it's a systemic, recurring failure mode of LLM applications.
|
|
12
|
+
|
|
13
|
+
## When you need a judge
|
|
14
|
+
|
|
15
|
+
Most outputs can be checked deterministically: the schema matches, an enum is in range, a string fits a length limit. Those go in `validate` blocks or `output_schema`.
|
|
16
|
+
|
|
17
|
+
Some questions cannot be checked that way:
|
|
18
|
+
|
|
19
|
+
- *"Is this summary accurate to the source article?"*
|
|
20
|
+
- *"Does this support reply actually answer the customer's question?"*
|
|
21
|
+
- *"Is this translation faithful to the original tone?"*
|
|
22
|
+
|
|
23
|
+
Reading the LLM's output character-by-character will not tell you. You need a second model to read it and decide. That second model is the **judge**.
|
|
24
|
+
|
|
25
|
+
The catch: a judge is just another LLM. If you trust its scores before you verify it agrees with humans, you have replaced one black box with two.
|
|
26
|
+
|
|
27
|
+
## The pattern in three steps
|
|
28
|
+
|
|
29
|
+
The running example is `SummarizeArticle` from [Getting Started](getting_started.md) — a Step that produces a TL;DR plus a few takeaways. We want to detect when a prompt change makes the summary less accurate to the source article.
|
|
30
|
+
|
|
31
|
+
### Step 1 — Build the judge as a Contract Step
|
|
32
|
+
|
|
33
|
+
The judge is just another Step. The benefit of using `ruby_llm-contract`: you get the same retry, cost cap, and structured-output guarantees that any other Step in this gem provides.
|
|
34
|
+
|
|
35
|
+
The judge takes **two** inputs in a single Hash: the SOURCE you want the answer to be faithful to (here: the article), and the ANSWER under review (here: the summary). Both are required — without the source, the judge has nothing to verify against.
|
|
36
|
+
|
|
37
|
+
The `prompt do` DSL uses a small template syntax: `{key}` placeholders are interpolated from the input. With `input_type Hash`, every top-level key of the hash becomes available as `{key}` — here `{article}` and `{summary}` map to `input[:article]` and `input[:summary]` when you later call `AccuracyJudge.run(article: ..., summary: ...)`. With `input_type String`, you use the special `{input}` placeholder (one input, one slot). It is the gem's own syntax — not ERB, not `String#%`, not string interpolation — and it is what makes the judge inputs visible inside the prompt.
|
|
38
|
+
|
|
39
|
+
```ruby
|
|
40
|
+
# A cheap model reads the summary against its source and decides if the
|
|
41
|
+
# summary's claims all come from the article. Structured boolean +
|
|
42
|
+
# reason so we can debug failures later.
|
|
43
|
+
class AccuracyJudge < RubyLLM::Contract::Step::Base
|
|
44
|
+
input_type Hash # { summary:, article: }
|
|
45
|
+
model "gpt-5-nano"
|
|
46
|
+
|
|
47
|
+
prompt do
|
|
48
|
+
system "You decide whether a summary is accurate to the article."
|
|
49
|
+
user <<~MSG
|
|
50
|
+
Article:
|
|
51
|
+
{article}
|
|
52
|
+
|
|
53
|
+
Summary:
|
|
54
|
+
{summary}
|
|
55
|
+
|
|
56
|
+
Answer two things:
|
|
57
|
+
- accurate: true if every claim in the summary comes from the article
|
|
58
|
+
(no added facts, no contradictions). false otherwise.
|
|
59
|
+
- reason: one short sentence explaining your decision.
|
|
60
|
+
MSG
|
|
61
|
+
end
|
|
62
|
+
|
|
63
|
+
output_schema do
|
|
64
|
+
boolean :accurate
|
|
65
|
+
string :reason
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
retry_policy models: %w[gpt-5-nano gpt-5-mini]
|
|
69
|
+
max_cost 0.001
|
|
70
|
+
end
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
The `retry_policy` escalates to `gpt-5-mini` if the cheaper `gpt-5-nano` fails schema/validate (e.g., returns malformed JSON) — the judge stays reliable without paying for the bigger model on every call. `max_cost 0.001` bounds each call: the judge runs N× per eval suite (one call per case), so capping per-call cost prevents calibration runs from blowing the budget.
|
|
74
|
+
|
|
75
|
+
This is small on purpose: one boolean, one reason. Easy to read, easy to disagree with.
|
|
76
|
+
|
|
77
|
+
> 💡 **When the judge needs to point at a specific sentence, not just say "drift".**
|
|
78
|
+
>
|
|
79
|
+
> *Drift* here means the answer adds facts not in the source, contradicts the source, or extends scope beyond what the source covers — even though schema and validate blocks pass.
|
|
80
|
+
>
|
|
81
|
+
> Boolean + reason is fine for binary regression checks. When you want the judge to be a debugging tool — telling you which exact claim in the answer drifted — swap the output schema for a per-claim breakdown:
|
|
82
|
+
>
|
|
83
|
+
> ```ruby
|
|
84
|
+
> output_schema do
|
|
85
|
+
> array :claims do
|
|
86
|
+
> object do
|
|
87
|
+
> string :claim
|
|
88
|
+
> string :status, enum: %w[supported contradicted unsupported]
|
|
89
|
+
> end
|
|
90
|
+
> end
|
|
91
|
+
> string :verdict, enum: %w[pass fail]
|
|
92
|
+
> string :reason
|
|
93
|
+
> end
|
|
94
|
+
> ```
|
|
95
|
+
>
|
|
96
|
+
> Now a failed eval gives you a per-claim breakdown like `"Ruby 3.4 has a new pattern matching syntax" → unsupported` (the source article doesn't mention pattern matching) or `"Released in January 2025" → contradicted` (the article says December 2024). Three statuses, each grounded in the SOURCE you passed: **supported** (the article says so), **contradicted** (the article says the opposite), **unsupported** (the article doesn't mention it either way). That's a sentence-level pointer your PR review can act on, not just "score dropped". Costs slightly more tokens per call; pays off the first time you debug a regression. (Reminder: `array :x do object do ... end end` — without the inner `object do`, the schema silently degrades to `items: string`.)
|
|
97
|
+
|
|
98
|
+
### Step 2 — Calibrate the judge against humans on real production data
|
|
99
|
+
|
|
100
|
+
Before the judge can gate anything, you need to know it agrees with a human reviewer. Pull a handful of real production summaries (not synthetic, not edge cases you imagined) and have a human label each one as accurate or not.
|
|
101
|
+
|
|
102
|
+
```ruby
|
|
103
|
+
AccuracyJudge.define_eval("calibration") do
|
|
104
|
+
add_case "prod_a",
|
|
105
|
+
input: { article: PROD_A_ARTICLE, summary: PROD_A_SUMMARY },
|
|
106
|
+
expected: { accurate: true } # the human said: looks accurate
|
|
107
|
+
|
|
108
|
+
add_case "prod_b",
|
|
109
|
+
input: { article: PROD_B_ARTICLE, summary: PROD_B_SUMMARY },
|
|
110
|
+
expected: { accurate: false } # the human said: summary added a date not in the source
|
|
111
|
+
|
|
112
|
+
# ... 30 to 50 cases, sampled from real production traffic
|
|
113
|
+
end
|
|
114
|
+
```
|
|
115
|
+
|
|
116
|
+
Run it. The score tells you how often the judge agrees with the human:
|
|
117
|
+
|
|
118
|
+
```ruby
|
|
119
|
+
report = AccuracyJudge.run_eval("calibration")
|
|
120
|
+
report.score # => 0.92 → judge agrees with humans 92% of the time
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
A reasonable bar is `score >= 0.85` — an empirical baseline from LLM-eval field practice ([Eugene Yan on evals](https://eugeneyan.com/writing/evals/), [Hamel Husain field guide](https://hamel.dev/blog/posts/field-guide/)). Below that, judge-human agreement drops into a band where the judge's confident percentages stop matching what a human reviewer would say, and gate decisions become unreliable. If your domain is high-stakes (medical, legal, financial), raise the bar to 0.90+. Iterate the judge's prompt, or escalate to a stronger model, until the score on real production data crosses your bar.
|
|
124
|
+
|
|
125
|
+
**What "iterate the judge's prompt" actually looks like.** The first judge prompt almost always over-flags. Two or three iterations are normal before the judge is ready to gate anything. The common pattern:
|
|
126
|
+
|
|
127
|
+
| Addition in the answer | Naive judge marks | Should be |
|
|
128
|
+
|---|---|---|
|
|
129
|
+
| "Happy to help with anything else" (small talk) | unsupported | supported (stylistic) |
|
|
130
|
+
| "We'll find a flexible solution" (commitment) | unsupported | unsupported (out of policy) |
|
|
131
|
+
|
|
132
|
+
When the per-claim breakdown shows the judge can't distinguish those two categories, refine its prompt to make the distinction explicit:
|
|
133
|
+
|
|
134
|
+
```
|
|
135
|
+
Distinguish two kinds of additions outside the source:
|
|
136
|
+
a) Stylistic courtesy (greetings, empathy, small talk) — treat as supported.
|
|
137
|
+
b) Out-of-policy commitments (offers, promises, gestures) — unsupported.
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
Re-run calibration. Score recovers from 0.6 to 0.92 not because the judge got smarter, but because you taught it your distinction between "polite filler" and "business commitment". Calibration is itself an eval-first loop — not a one-time qualification gate.
|
|
141
|
+
|
|
142
|
+
### Step 3 — Use the calibrated judge as the eval gate
|
|
143
|
+
|
|
144
|
+
Only now, after the judge has earned trust on real data, do you use it as the oracle in a regression eval on the Step you actually care about. The evaluator lambda receives two arguments:
|
|
145
|
+
|
|
146
|
+
- `output` — `parsed_output` of the Step under test. Here `output[:tldr]` is the summary `SummarizeArticle` produced (the same field declared in its `output_schema`).
|
|
147
|
+
- `input` — whatever you passed to `add_case input:`. Here it's the original article string.
|
|
148
|
+
|
|
149
|
+
The lambda must return a `Float` in `[0.0, 1.0]` — the eval framework averages per-case scores into the suite score. Map `true → 1.0`, `false → 0.0`, or use weighted partial scores if your judge returns gradations.
|
|
150
|
+
|
|
151
|
+
> ⚠️ **Heads-up before you copy the snippet below.** Smoke evals on `SummarizeArticle` (per [Getting Started](getting_started.md)) typically run with `RubyLLM::Contract::Adapters::Test` for determinism — canned summaries, no API calls. **The `AccuracyJudge.run(...)` inside the evaluator does NOT inherit that Test adapter.** Each Step picks its adapter independently (from `RubyLLM::Contract.configure`, `context: { adapter: }`, or `LIVE=1` env). Run accuracy-regression evals as live (`LIVE=1` or explicit `context:`), and rely on `max_cost` on the judge to bound the bill.
|
|
152
|
+
>
|
|
153
|
+
> This is intentional: the judge IS the oracle, so faking its verdicts would defeat the purpose. Keep the Test-adapter smoke evals for schema/wiring checks only.
|
|
154
|
+
|
|
155
|
+
```ruby
|
|
156
|
+
SummarizeArticle.define_eval("accuracy_regression") do
|
|
157
|
+
add_case "prod_article_42",
|
|
158
|
+
input: PROD_ARTICLE_42,
|
|
159
|
+
evaluator: ->(output, input) {
|
|
160
|
+
verdict = AccuracyJudge.run({
|
|
161
|
+
summary: output[:tldr],
|
|
162
|
+
article: input
|
|
163
|
+
})
|
|
164
|
+
verdict.parsed_output[:accurate] ? 1.0 : 0.0
|
|
165
|
+
}
|
|
166
|
+
|
|
167
|
+
# ... one case per real production article you want to defend
|
|
168
|
+
end
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
Now `SummarizeArticle.run_eval("accuracy_regression")` is your real before/after signal:
|
|
172
|
+
|
|
173
|
+
- Run it before a prompt change → baseline score.
|
|
174
|
+
- Run it after the prompt change → if the score drops, the new prompt made the summary less accurate (and a calibrated judge says so, which means a human would likely say so too).
|
|
175
|
+
|
|
176
|
+
## Anti-patterns
|
|
177
|
+
|
|
178
|
+
Each of these has been observed in real production projects.
|
|
179
|
+
|
|
180
|
+
**Stubbing the judge's verdict in unit tests.** Your spec replaces `AccuracyJudge.run` with a canned `{ accurate: true }` and then checks that the eval scoring math adds up. The math is correct; you have learned nothing about whether the judge's real verdicts agree with reality. Calibrate the judge against humans (Step 2). Do not stub it.
|
|
181
|
+
|
|
182
|
+
**Calibrating on synthetic data.** You ask another LLM to generate 50 summary+article pairs labeled accurate or not, then check the judge against those labels. The calibration only proves your judge agrees with the labeling LLM. Use real production samples reviewed by a human.
|
|
183
|
+
|
|
184
|
+
**Per-language regex instead of a judge.** "If the summary contains the word 'unfortunately' AND the article does not, flag it as a hallucination." This works once, drifts the moment the prompt changes the tone, and silently breaks on a new locale. A single LLM-judge with one prompt covers all locales; a per-language regex set does not.
|
|
185
|
+
|
|
186
|
+
**Running the judge on the production request path.** The judge adds an extra LLM call per output. On the eval suite (on-demand or CI), that is fine. On every production request, that is double the latency and double the cost. Keep judges in the eval suite.
|
|
187
|
+
|
|
188
|
+
**Calibrating once and shipping.** First calibration gives you a baseline. Production drifts — new product lines, new customer behaviors, seasonal patterns, provider model updates — and the calibration drifts with it. A judge calibrated against Q1 customer-support logs may silently misjudge Q3 traffic after a new return policy launched.
|
|
189
|
+
|
|
190
|
+
What to do operationally:
|
|
191
|
+
|
|
192
|
+
- **Re-calibrate** against a fresh human-labeled sample every quarter, or after any material business change (new SKU category, prompt rewrite on the gated Step, model version bump).
|
|
193
|
+
- **Store** calibration scores per run — a DB column, a log line, or a CI artifact.
|
|
194
|
+
- **Alert** when the 30-day rolling average drops more than 10 points. 30 days smooths daily judge-call noise; 10 points crosses noise floor while staying sensitive to real drift. Adjust the window down for high-volume Steps (7-day) or up for slow-drift business contexts (60-day).
|
|
195
|
+
|
|
196
|
+
The drop is your signal to refine the judge's prompt, not to lower the gate.
|
|
197
|
+
|
|
198
|
+
## When to escalate to Tribunal's catalog
|
|
199
|
+
|
|
200
|
+
If you find yourself building three or four judges that all rhyme — "faithful?", "hallucination?", "refusal?", "PII leakage?" — you are reinventing what [`ruby_llm-tribunal`](https://github.com/Alqemist-labs/ruby_llm-tribunal) ships as a built-in catalog. See [Relation to Tribunal](relation_to_tribunal.md) for the integration recipe: Contract Steps make the LLM calls, Tribunal supplies the grading vocabulary, and the two compose in a single `define_eval`.
|
|
201
|
+
|
|
202
|
+
## See also
|
|
203
|
+
|
|
204
|
+
- [Eval-First](eval_first.md) — when to write evals, the three kinds, the oracle-validation rule that this guide implements in code.
|
|
205
|
+
- [Best Practices](best_practices.md) — keeping evals cheap, real, and trustworthy over time.
|
|
206
|
+
- [Optimizing retry_policy](optimizing_retry_policy.md) — once your eval is honest, find the cheapest model chain that still passes it.
|
data/docs/guide/migration.md
CHANGED
|
@@ -116,7 +116,7 @@ RubyLLM::Contract::RakeTask.new do |t|
|
|
|
116
116
|
t.minimum_score = 0.8
|
|
117
117
|
t.maximum_cost = 0.05
|
|
118
118
|
t.fail_on_regression = true
|
|
119
|
-
t.save_baseline =
|
|
119
|
+
t.save_baseline = false # read-only in CI; refresh baselines in a separate workflow
|
|
120
120
|
end
|
|
121
121
|
```
|
|
122
122
|
|
|
@@ -10,7 +10,7 @@ You defined `SummarizeArticle` in the [README](../../README.md) with `retry_poli
|
|
|
10
10
|
- **2–3 evals per step.** One eval optimizes for one scenario; with only `smoke`, you get a recommendation that passes smoke but may miss production edge cases. See [eval-first](eval_first.md).
|
|
11
11
|
- **Rake task `ruby_llm_contract:optimize` is registered when `lib/ruby_llm/contract/rake_task.rb` is loaded.** In Rails apps on **0.10.4+** the railtie auto-loads it lazily during `rake` invocation — no manual setup needed. On older Rails versions or in non-Rails projects, add `require "ruby_llm/contract/rake_task"` to your `Rakefile` (or any file under `lib/tasks/`) and set `EVAL_DIRS=...` on the command line.
|
|
12
12
|
|
|
13
|
-
> **Two orthogonal dimensions to a retry chain.** A chain element is `{ model:, reasoning_effort: }` — model identity AND thinking budget. `optimize_retry_policy` explores both. You can
|
|
13
|
+
> **Two orthogonal dimensions to a retry chain.** A chain element is `{ model:, reasoning_effort: }` — model identity AND thinking budget. `optimize_retry_policy` explores both. You can pin the default thinking config at class level too — see the `thinking` DSL note at the end.
|
|
14
14
|
|
|
15
15
|
For this guide, assume `SummarizeArticle` has three evals:
|
|
16
16
|
|
|
@@ -30,6 +30,8 @@ rake ruby_llm_contract:optimize \
|
|
|
30
30
|
CANDIDATES=gpt-4.1-nano,gpt-4.1-mini@low,gpt-4.1-mini,gpt-4.1
|
|
31
31
|
```
|
|
32
32
|
|
|
33
|
+
The `@low` (or `@medium`, `@high`) suffix on a candidate is CLI shorthand for `{ model: "gpt-4.1-mini", reasoning_effort: :low }` — saves you constructing hash strings on the command line. Plain model names without `@` use the Step's default thinking config (or none).
|
|
34
|
+
|
|
33
35
|
Offline uses each eval's `sample_response` — zero API calls. **Every candidate gets the same score** because they all receive the canned response. That's fine for a smoke test (verifying evals load, candidates parse, output renders) but it doesn't compare model quality. For real optimization, go live.
|
|
34
36
|
|
|
35
37
|
## Optimize against real models
|
|
@@ -74,7 +76,7 @@ Copy the DSL, paste into your step, verify with `rake ruby_llm_contract:eval`. Y
|
|
|
74
76
|
|
|
75
77
|
`optimize` shows **first-attempt** cost. In production, a candidate whose validator rejects 20% of outputs actually costs `first_try_cost + fallback_cost × 0.20` per successful output. The first-attempt number hides this.
|
|
76
78
|
|
|
77
|
-
`production_mode
|
|
79
|
+
`production_mode:` is an optional `compare_models` kwarg. With `{ fallback: "model" }` it runs each candidate with a runtime `[candidate, fallback]` chain and reports **effective cost** (first-try + fallback cost weighted by rejection rate). Without it, you get only first-attempt metrics (the default).
|
|
78
80
|
|
|
79
81
|
```ruby
|
|
80
82
|
SummarizeArticle.compare_models(
|
data/docs/guide/output_schema.md
CHANGED
|
@@ -7,7 +7,7 @@ Declare the expected output structure using [ruby_llm-schema](https://github.com
|
|
|
7
7
|
1. **Output validation** — replaces type and shape checks (enums, ranges, required fields). One declaration instead of many.
|
|
8
8
|
2. **Provider-side request** — with the RubyLLM adapter, the schema is sent to the LLM provider via `chat.with_schema(...)`, asking the model to return JSON matching the shape. Cheaper models sometimes ignore the request, which is why client-side validation (point 1) still matters.
|
|
9
9
|
|
|
10
|
-
> **Same DSL `RubyLLM::Agent.schema` accepts.**
|
|
10
|
+
> **Same DSL `RubyLLM::Agent.schema` accepts.** Both use `RubyLLM::Schema.create(&block)` under the hood. `Step.output_schema` is eager-compiled at class load and drives client-side validation; `Agent.schema` accepts a `Proc` for dynamic per-call schemas. See [Relation to Agent](relation_to_agent.md) for the full comparison.
|
|
11
11
|
|
|
12
12
|
All examples below extend the `SummarizeArticle` step from the [README](../../README.md).
|
|
13
13
|
|
data/docs/guide/pipeline.md
CHANGED
|
@@ -123,6 +123,8 @@ result = ArticleCardPipeline.run(article_text, timeout_ms: 30_000)
|
|
|
123
123
|
|
|
124
124
|
## Pipeline eval
|
|
125
125
|
|
|
126
|
+
Pipeline `run_eval` matches only the **final step's output** against `expected:` (`BuildArticleCard` here). Intermediate step outputs (`SummarizeArticle`, `GenerateHashtags`) are not routed through `expected:` — assertions on `:tone` or `:hashtags` here would silently never match. For per-step assertions, run evals on each step individually.
|
|
127
|
+
|
|
126
128
|
```ruby
|
|
127
129
|
ArticleCardPipeline.define_eval("e2e") do
|
|
128
130
|
add_case "ruby 3.4 release",
|
data/docs/guide/prompt_ast.md
CHANGED
|
@@ -27,7 +27,9 @@ The AST is immutable, diffable, and hashable. Useful for snapshot testing and au
|
|
|
27
27
|
|
|
28
28
|
## Hash inputs with variable interpolation
|
|
29
29
|
|
|
30
|
-
When input is a Hash, each key becomes a template variable. Concrete scenario: a multi-language newsletter product where the same article has to be summarised in Polish for EU subscribers, English for US, with different audiences per tier (Rails developers vs engineering managers). Hash inputs let one step cover all of these without forking the class
|
|
30
|
+
When input is a Hash, each key becomes a template variable. Concrete scenario: a multi-language newsletter product where the same article has to be summarised in Polish for EU subscribers, English for US, with different audiences per tier (Rails developers vs engineering managers). Hash inputs let one step cover all of these without forking the class.
|
|
31
|
+
|
|
32
|
+
The `Types::Hash.schema(...)` form below declares a **typed contract on the input** — `run()` raises `TypeError` if a required key is missing or its value violates the declared type. Use it when the input shape is part of the Step's interface (multi-key inputs, multi-tenant variants); plain `input_type Hash` accepts any hash and is fine for prototypes or when the schema is enforced upstream.
|
|
31
33
|
|
|
32
34
|
```ruby
|
|
33
35
|
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
@@ -98,7 +98,7 @@ A failed Tribunal grade triggers Contract's retry/escalate just like any other v
|
|
|
98
98
|
ChatReply.define_eval "rag_regression" do
|
|
99
99
|
add_case "policy",
|
|
100
100
|
input: "What is the return policy?",
|
|
101
|
-
evaluator: ->(output
|
|
101
|
+
evaluator: ->(output) {
|
|
102
102
|
tc = RubyLLM::Tribunal::TestCase.new(
|
|
103
103
|
actual_output: output[:answer],
|
|
104
104
|
context: ["Returns accepted within 30 days with receipt."]
|
data/docs/guide/testing.md
CHANGED
|
@@ -33,7 +33,7 @@ result = SummarizeArticle.run("article text", context: { adapter: adapter })
|
|
|
33
33
|
result.ok? # => true
|
|
34
34
|
```
|
|
35
35
|
|
|
36
|
-
Multi-step pipeline testing with per-step named responses (using `ArticleCardPipeline` from [Pipeline](pipeline.md)):
|
|
36
|
+
Multi-step pipeline testing with per-step named responses (using `ArticleCardPipeline` from [Pipeline](pipeline.md)). The keys in `responses:` must match the step identifiers declared with `add_step :name, ...` in your pipeline definition — `:summarize`, `:tag`, `:card` here map to the three steps in that order.
|
|
37
37
|
|
|
38
38
|
```ruby
|
|
39
39
|
result = ArticleCardPipeline.test("article text",
|
|
@@ -56,7 +56,7 @@ result.parsed_output[:tldr] # => "..." ✓
|
|
|
56
56
|
result.parsed_output["tldr"] # => nil ✗
|
|
57
57
|
```
|
|
58
58
|
|
|
59
|
-
The gem warns if a `validate` or `verify` block returns `nil` — usually a sign of string-key access on symbol-keyed data.
|
|
59
|
+
The gem warns if a `validate` block (Step-level invariant check) or a `verify` block (eval-case evaluator declared via `verify(name) { |output| ... }` inside `define_eval`) returns `nil` — usually a sign of string-key access on symbol-keyed data.
|
|
60
60
|
|
|
61
61
|
## RSpec setup
|
|
62
62
|
|
|
@@ -118,7 +118,7 @@ stub_steps(
|
|
|
118
118
|
end
|
|
119
119
|
```
|
|
120
120
|
|
|
121
|
-
**Global stub for all steps
|
|
121
|
+
**Global stub for all steps** — every Step in the example process returns the same canned hash regardless of its `output_schema`. Useful for high-level integration specs where step output content doesn't matter; use `stub_steps` instead when you need per-step shapes.
|
|
122
122
|
|
|
123
123
|
```ruby
|
|
124
124
|
stub_all_steps(response: { default: true })
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: ruby_llm-contract
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.10.
|
|
4
|
+
version: 0.10.5
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Justyna
|
|
@@ -70,6 +70,7 @@ files:
|
|
|
70
70
|
- docs/guide/best_practices.md
|
|
71
71
|
- docs/guide/eval_first.md
|
|
72
72
|
- docs/guide/getting_started.md
|
|
73
|
+
- docs/guide/llm_judge.md
|
|
73
74
|
- docs/guide/migration.md
|
|
74
75
|
- docs/guide/multimodal_input.md
|
|
75
76
|
- docs/guide/optimizing_retry_policy.md
|