ruby_llm-contract 0.10.1 → 0.10.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -1
- data/Gemfile.lock +2 -2
- data/README.md +2 -2
- data/docs/architecture.md +50 -0
- data/docs/guide/best_practices.md +136 -0
- data/docs/guide/eval_first.md +192 -0
- data/docs/guide/getting_started.md +199 -0
- data/docs/guide/migration.md +185 -0
- data/docs/guide/multimodal_input.md +160 -0
- data/docs/guide/optimizing_retry_policy.md +131 -0
- data/docs/guide/output_schema.md +93 -0
- data/docs/guide/pipeline.md +154 -0
- data/docs/guide/prompt_ast.md +76 -0
- data/docs/guide/rails_integration.md +218 -0
- data/docs/guide/relation_to_agent.md +52 -0
- data/docs/guide/relation_to_tribunal.md +135 -0
- data/docs/guide/testing.md +282 -0
- data/docs/guide/why.md +103 -0
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/ruby_llm-contract.gemspec +1 -1
- metadata +16 -1
|
@@ -0,0 +1,185 @@
|
|
|
1
|
+
# Migration Guide
|
|
2
|
+
|
|
3
|
+
> Read this when adopting the gem in an existing Rails app with a raw `LlmClient.call` service you want to replace. Skip if starting fresh — go to [Getting Started](getting_started.md) instead.
|
|
4
|
+
|
|
5
|
+
How to adopt `ruby_llm-contract` in an existing Rails app. Examples use `SummarizeArticle` — the flagship step from the [README](../../README.md) — but the pattern applies to any single-input / structured-output service.
|
|
6
|
+
|
|
7
|
+
## Step 1: Start with the simplest service
|
|
8
|
+
|
|
9
|
+
Pick the LLM service with: single input → JSON output → DB save. Don't start with parallel batches or complex pipelines.
|
|
10
|
+
|
|
11
|
+
## Step 2: Define the contract
|
|
12
|
+
|
|
13
|
+
**Before — raw HTTP:**
|
|
14
|
+
|
|
15
|
+
```ruby
|
|
16
|
+
class ArticleSummaryService
|
|
17
|
+
def call(article_text)
|
|
18
|
+
response = LlmClient.new(model: "gpt-4o-mini").call(prompt(article_text))
|
|
19
|
+
JSON.parse(response[:content], symbolize_names: true)
|
|
20
|
+
end
|
|
21
|
+
end
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
**After — contract:**
|
|
25
|
+
|
|
26
|
+
```ruby
|
|
27
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
28
|
+
model "gpt-4.1-mini"
|
|
29
|
+
|
|
30
|
+
prompt do
|
|
31
|
+
system "You summarize articles for a UI card."
|
|
32
|
+
rule "Return valid JSON only."
|
|
33
|
+
user "{input}"
|
|
34
|
+
end
|
|
35
|
+
|
|
36
|
+
output_schema do
|
|
37
|
+
string :tldr
|
|
38
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
39
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
40
|
+
end
|
|
41
|
+
|
|
42
|
+
validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
|
|
43
|
+
retry_policy models: %w[gpt-4.1-mini gpt-4.1]
|
|
44
|
+
end
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
## Step 3: Replace the caller
|
|
48
|
+
|
|
49
|
+
```ruby
|
|
50
|
+
# Before
|
|
51
|
+
parsed = ArticleSummaryService.new.call(article_text)
|
|
52
|
+
Article.update!(summary: parsed["tldr"])
|
|
53
|
+
|
|
54
|
+
# After
|
|
55
|
+
result = SummarizeArticle.run(article_text)
|
|
56
|
+
if result.ok?
|
|
57
|
+
Article.update!(summary: result.parsed_output[:tldr])
|
|
58
|
+
else
|
|
59
|
+
Rails.logger.warn "Summary failed: #{result.status}"
|
|
60
|
+
end
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
## Step 4: Add logging via around_call
|
|
64
|
+
|
|
65
|
+
```ruby
|
|
66
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
67
|
+
# ... prompt, schema, validates ...
|
|
68
|
+
|
|
69
|
+
around_call do |step, input, result|
|
|
70
|
+
AiCallLog.create!(
|
|
71
|
+
ai_model: result.trace.model,
|
|
72
|
+
duration_ms: result.trace.latency_ms,
|
|
73
|
+
input_tokens: result.trace.usage&.dig(:input_tokens),
|
|
74
|
+
output_tokens: result.trace.usage&.dig(:output_tokens),
|
|
75
|
+
cost: result.trace.cost,
|
|
76
|
+
status: result.status.to_s
|
|
77
|
+
)
|
|
78
|
+
end
|
|
79
|
+
end
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
## Step 5: Add eval cases
|
|
83
|
+
|
|
84
|
+
Use real inputs from production logs:
|
|
85
|
+
|
|
86
|
+
```ruby
|
|
87
|
+
SummarizeArticle.define_eval("regression") do
|
|
88
|
+
add_case "short news",
|
|
89
|
+
input: "Ruby 3.4 ships with frozen string literals by default...",
|
|
90
|
+
expected: { tone: "analytical" }
|
|
91
|
+
|
|
92
|
+
add_case "critical review",
|
|
93
|
+
input: "The new mesh networking hardware failed under load...",
|
|
94
|
+
expected: { tone: "negative" }
|
|
95
|
+
end
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
## Step 6: Find the cheapest model
|
|
99
|
+
|
|
100
|
+
```ruby
|
|
101
|
+
comparison = SummarizeArticle.compare_models("regression",
|
|
102
|
+
candidates: [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }])
|
|
103
|
+
|
|
104
|
+
comparison.print_summary
|
|
105
|
+
comparison.best_for(min_score: 0.95) # => cheapest model at >= 95%
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
Full optimization workflow — multi-eval, fallback list, production-mode cost — in [Optimizing retry_policy](optimizing_retry_policy.md).
|
|
109
|
+
|
|
110
|
+
## Step 7: Add CI gate
|
|
111
|
+
|
|
112
|
+
```ruby
|
|
113
|
+
# Rakefile
|
|
114
|
+
require "ruby_llm/contract/rake_task"
|
|
115
|
+
RubyLLM::Contract::RakeTask.new do |t|
|
|
116
|
+
t.minimum_score = 0.8
|
|
117
|
+
t.maximum_cost = 0.05
|
|
118
|
+
t.fail_on_regression = true
|
|
119
|
+
t.save_baseline = true
|
|
120
|
+
end
|
|
121
|
+
```
|
|
122
|
+
|
|
123
|
+
**Rails apps:** if your adapter is configured in an initializer, use a Proc so context is resolved after Rails boots:
|
|
124
|
+
|
|
125
|
+
```ruby
|
|
126
|
+
RubyLLM::Contract::RakeTask.new do |t|
|
|
127
|
+
t.context = -> { { adapter: RubyLLM::Contract.configuration.default_adapter } }
|
|
128
|
+
t.minimum_score = 0.8
|
|
129
|
+
end
|
|
130
|
+
```
|
|
131
|
+
|
|
132
|
+
## Common patterns
|
|
133
|
+
|
|
134
|
+
| Old pattern | New pattern |
|
|
135
|
+
|---|---|
|
|
136
|
+
| `LlmClient.new(model:).call(prompt)` | `MyStep.run(input)` |
|
|
137
|
+
| `JSON.parse(response[:content])` | `result.parsed_output` |
|
|
138
|
+
| `begin; rescue; retry; end` | `retry_policy models: [...]` |
|
|
139
|
+
| `body[:temperature] = 0.7` | `temperature 0.7` |
|
|
140
|
+
| `AiCallLog.create(...)` | `around_call { \|s, i, r\| ... }` |
|
|
141
|
+
| `response_format: JsonSchema.build(...)` | `output_schema do...end` |
|
|
142
|
+
| `stub_request(:post, ...)` | `stub_step(MyStep, response: {...})` |
|
|
143
|
+
|
|
144
|
+
## Anti-patterns
|
|
145
|
+
|
|
146
|
+
- **Don't migrate markdown/text output services.** The gem is for structured JSON. Prose output gets no benefit from schema validation.
|
|
147
|
+
- **Don't put parallelism in the gem.** Thread management is your app's concern. The gem provides the contract; you call it however you want.
|
|
148
|
+
- **Don't migrate all services at once.** Start with one. Validate the pattern. Then migrate the next.
|
|
149
|
+
|
|
150
|
+
## Parallel batch generation
|
|
151
|
+
|
|
152
|
+
The gem handles single calls. You handle parallelism:
|
|
153
|
+
|
|
154
|
+
```ruby
|
|
155
|
+
class SummarizeBatch < RubyLLM::Contract::Step::Base
|
|
156
|
+
output_schema do
|
|
157
|
+
array :summaries do
|
|
158
|
+
object do
|
|
159
|
+
string :article_id
|
|
160
|
+
string :tldr
|
|
161
|
+
end
|
|
162
|
+
end
|
|
163
|
+
end
|
|
164
|
+
retry_policy models: %w[gpt-4.1-mini gpt-4.1]
|
|
165
|
+
end
|
|
166
|
+
|
|
167
|
+
# Your orchestrator
|
|
168
|
+
threads = 10.times.map do |i|
|
|
169
|
+
Thread.new { Rails.application.executor.wrap { SummarizeBatch.run(input(i)) } }
|
|
170
|
+
end
|
|
171
|
+
results = threads.map(&:value)
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
**Note:** in tests, `stub_step` overrides are thread-local. If your orchestrator spawns threads, propagate overrides manually:
|
|
175
|
+
|
|
176
|
+
```ruby
|
|
177
|
+
overrides = RubyLLM::Contract.step_adapter_overrides.dup
|
|
178
|
+
Thread.new { RubyLLM::Contract.step_adapter_overrides = overrides; SummarizeBatch.run(input) }
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
## See also
|
|
182
|
+
|
|
183
|
+
- [Getting Started](getting_started.md) — the full walkthrough of every feature `SummarizeArticle` uses.
|
|
184
|
+
- [Testing](testing.md) — `stub_step` reference for migrating your test adapter mocks.
|
|
185
|
+
- [Eval-First](eval_first.md) — how to build the "regression" eval from production logs.
|
|
@@ -0,0 +1,160 @@
|
|
|
1
|
+
# Multimodal input
|
|
2
|
+
|
|
3
|
+
> Read this when your contract needs to send a PDF, image, or audio file to the LLM — not just text.
|
|
4
|
+
|
|
5
|
+
`ruby_llm-contract` 0.10.0+ routes attachments through the contract layer, so `max_cost`, `validate`, `retry_policy escalate(...)`, and trace observability still apply. The gem does **not** ship its own multimodal API — it forwards `with:` to `RubyLLM::Chat#ask`, which RubyLLM 1.15+ normalises per provider (Anthropic, OpenAI, Gemini).
|
|
6
|
+
|
|
7
|
+
## Minimal example
|
|
8
|
+
|
|
9
|
+
```ruby
|
|
10
|
+
# app/contracts/extract_invoice_data.rb
|
|
11
|
+
class ExtractInvoiceData < RubyLLM::Contract::Step::Base
|
|
12
|
+
prompt "Extract invoice fields from the attached PDF. Return JSON."
|
|
13
|
+
|
|
14
|
+
output_schema do
|
|
15
|
+
string :vendor
|
|
16
|
+
string :invoice_number
|
|
17
|
+
number :total_amount
|
|
18
|
+
string :currency, enum: %w[USD EUR PLN GBP]
|
|
19
|
+
end
|
|
20
|
+
|
|
21
|
+
validate("currency present") { |o, _| !o[:currency].nil? }
|
|
22
|
+
|
|
23
|
+
# REQUIRED when max_cost is set and the contract receives an attachment.
|
|
24
|
+
# Conservative estimate of attachment input tokens (provider/model-specific).
|
|
25
|
+
attachment_token_estimate 15_000 # ~12 PDF pages at ~1250 tokens/page
|
|
26
|
+
|
|
27
|
+
max_cost 0.10
|
|
28
|
+
|
|
29
|
+
retry_policy do
|
|
30
|
+
escalate "gpt-4.1-mini",
|
|
31
|
+
{ model: "gpt-5", reasoning_effort: "high" }
|
|
32
|
+
end
|
|
33
|
+
end
|
|
34
|
+
|
|
35
|
+
result = ExtractInvoiceData.run(
|
|
36
|
+
"Look for vendor, amount, currency.",
|
|
37
|
+
context: { attachment: "tmp/invoice.pdf" }
|
|
38
|
+
)
|
|
39
|
+
|
|
40
|
+
result.status # => :ok
|
|
41
|
+
result.parsed_output # => { vendor: "...", invoice_number: "...", ... }
|
|
42
|
+
result.trace[:cost] # => 0.0042 (total across attempts)
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
## How it works
|
|
46
|
+
|
|
47
|
+
1. **Input vs attachment.** The `input` argument to `Step.run` is the text prompt. The attachment travels via `context: { attachment: ... }` — opaque to the contract layer, forwarded to the adapter.
|
|
48
|
+
2. **Adapter forwards `with:`.** `RubyLLM::Contract::Adapters::RubyLLM` reads `options[:attachment]` and passes it to `chat.ask(content, with: attachment)`. RubyLLM picks the right wire format per provider.
|
|
49
|
+
3. **Multi-attachment supported.** `with: [pdf1, pdf2]` or `with: { images: [...], pdfs: [...] }` works natively (`RubyLLM::Content#process_attachments`).
|
|
50
|
+
4. **`with: nil` is a no-op.** Text-only contracts unaffected — the kwarg defaults to nil.
|
|
51
|
+
|
|
52
|
+
## Cost: `attachment_token_estimate` is required
|
|
53
|
+
|
|
54
|
+
The gem cannot count attachment input tokens precisely — vision/PDF token cost depends on model, image resolution, page count, and provider. To keep `max_cost` and `max_input` fail-closed, you declare a **conservative estimate** of attachment input tokens at the class level:
|
|
55
|
+
|
|
56
|
+
```ruby
|
|
57
|
+
class TranscribePDF < RubyLLM::Contract::Step::Base
|
|
58
|
+
# ...
|
|
59
|
+
attachment_token_estimate 15_000 # safe upper bound for ~12-page docs
|
|
60
|
+
max_cost 0.05
|
|
61
|
+
end
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
The same estimate applies to:
|
|
65
|
+
|
|
66
|
+
- **Runtime** (`limit_checker`) — adds the estimate to `input_tokens` before checking `max_cost`/`max_input`. Refuses pre-flight if budget exceeded.
|
|
67
|
+
- **Pre-flight** (`estimate_cost`) — accepts `attachment:` kwarg; same accounting. No drift between estimate and runtime decisions.
|
|
68
|
+
|
|
69
|
+
### Fail-closed without estimate
|
|
70
|
+
|
|
71
|
+
If your contract has `max_cost` or `max_input` set, receives an attachment, and `attachment_token_estimate` is **not declared**, the call fails with `:limit_exceeded` — the gem refuses to spend money on cost it cannot bound.
|
|
72
|
+
|
|
73
|
+
```ruby
|
|
74
|
+
class MyContract < RubyLLM::Contract::Step::Base
|
|
75
|
+
max_cost 0.05
|
|
76
|
+
# no attachment_token_estimate declared
|
|
77
|
+
end
|
|
78
|
+
|
|
79
|
+
result = MyContract.run("text", context: { attachment: "doc.pdf" })
|
|
80
|
+
result.status # => :limit_exceeded
|
|
81
|
+
result.validation_errors # => ["attachment present but attachment_token_estimate not set; ..."]
|
|
82
|
+
```
|
|
83
|
+
|
|
84
|
+
### Opting out per-step
|
|
85
|
+
|
|
86
|
+
If you do not want fail-closed (e.g., experimental or development contracts), set:
|
|
87
|
+
|
|
88
|
+
```ruby
|
|
89
|
+
class FlexibleContract < RubyLLM::Contract::Step::Base
|
|
90
|
+
on_unknown_attachment_size :warn # log a warning instead of refusing
|
|
91
|
+
max_cost 0.05
|
|
92
|
+
end
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
`:warn` is per-step. There is no global opt-out — the same invariant as `on_unknown_pricing`.
|
|
96
|
+
|
|
97
|
+
## Pre-flight cost estimation
|
|
98
|
+
|
|
99
|
+
`estimate_cost` accepts an optional `attachment:` kwarg for budget planning:
|
|
100
|
+
|
|
101
|
+
```ruby
|
|
102
|
+
ExtractInvoiceData.estimate_cost(
|
|
103
|
+
input: "Look for vendor, amount...",
|
|
104
|
+
attachment: "tmp/invoice.pdf"
|
|
105
|
+
)
|
|
106
|
+
# => { model: "gpt-4.1-mini",
|
|
107
|
+
# input_tokens: 15_320,
|
|
108
|
+
# output_tokens_estimate: 256,
|
|
109
|
+
# estimated_cost: 0.0123 }
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
The `input_tokens` figure includes both the text estimate (chars/4 heuristic) AND the declared `attachment_token_estimate`. Pre-flight refusal mirrors runtime: if `attachment` is passed and `attachment_token_estimate` is not declared, `estimate_cost` returns nil and emits the same fail-closed reason.
|
|
113
|
+
|
|
114
|
+
**Note on output tokens.** `attachment_token_estimate` adds to `input_tokens` only — not to `output_tokens_estimate`. Vision-heavy responses (long image descriptions, transcribed paragraphs) may exceed the conservative `output_tokens_estimate` default. Treat `estimated_cost` as a floor for budget planning, not a precise predictor; inflate `max_output` or `max_cost` accordingly if your prompt routinely produces verbose descriptions.
|
|
115
|
+
|
|
116
|
+
## Calibrating `attachment_token_estimate`
|
|
117
|
+
|
|
118
|
+
The number depends on provider, model, and content shape. Some baselines:
|
|
119
|
+
|
|
120
|
+
| Content | Provider | Approx tokens |
|
|
121
|
+
|-------------------------------|----------|---------------|
|
|
122
|
+
| 1 PDF page (text-heavy) | OpenAI | ~1000-1500 |
|
|
123
|
+
| 1 PDF page (text-heavy) | Anthropic | ~1000-1500 |
|
|
124
|
+
| 1 image (1024x1024, low res) | OpenAI | ~85 |
|
|
125
|
+
| 1 image (1024x1024, high res) | OpenAI | ~765 |
|
|
126
|
+
| 1 image | Anthropic | ~1500 max |
|
|
127
|
+
| 1 image | Gemini | ~258 (fixed) |
|
|
128
|
+
|
|
129
|
+
Pick a value at or above the provider's worst-case. The estimate is a **floor for safety**, not a precise count — use it to gate budget refusal, not to predict exact cost.
|
|
130
|
+
|
|
131
|
+
## Multi-turn caveat
|
|
132
|
+
|
|
133
|
+
If your contract uses history (`add_history`), attachments from prior turns are **not** replayed in 0.10.x. Single-turn multimodal works; follow-up questions on the same document require additional work that is deferred to a later release. See [ADR-0022](../decisions/ADR-0022-v09-multimodal-input.md) (internal) for the rationale.
|
|
134
|
+
|
|
135
|
+
## Provider notes
|
|
136
|
+
|
|
137
|
+
- **OpenAI** — PDFs sent as `type: 'file'` with `file_data` (base64). Images as `image_url`. Audio as `input_audio`. Vision pricing varies by image detail; check the model card.
|
|
138
|
+
- **Anthropic** — PDFs sent as `type: 'document'` with `source.type: 'base64'` or `'url'` (auto-selected). Images same. Page limit ~100 per call.
|
|
139
|
+
- **Gemini** — Everything via `inline_data` with `mime_type`. Multimodal token counting is unified.
|
|
140
|
+
|
|
141
|
+
RubyLLM dispatches on `attachment.type` (`:image`, `:pdf`, `:audio`, `:text`, `:unknown`). Tempfiles must have the right extension (`.pdf`, `.png`, etc.) — RubyLLM detects MIME from the filename; an unsuffixed tempfile becomes `:unknown` and is rejected by the provider.
|
|
142
|
+
|
|
143
|
+
## Testing contracts with attachments
|
|
144
|
+
|
|
145
|
+
The Test adapter ignores the `attachment` context key (deterministic responses by step). To verify your adapter call shape, stub `RubyLLM::Chat` directly:
|
|
146
|
+
|
|
147
|
+
```ruby
|
|
148
|
+
RSpec.describe ExtractInvoiceData do
|
|
149
|
+
it "forwards attachment to chat.ask" do
|
|
150
|
+
expect_any_instance_of(RubyLLM::Chat).to receive(:ask)
|
|
151
|
+
.with(anything, with: "fixtures/invoice.pdf")
|
|
152
|
+
.and_return(double(content: '{"vendor":"X",...}', input_tokens: 200, output_tokens: 50))
|
|
153
|
+
|
|
154
|
+
result = described_class.run("extract", context: { attachment: "fixtures/invoice.pdf" })
|
|
155
|
+
expect(result.status).to eq(:ok)
|
|
156
|
+
end
|
|
157
|
+
end
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
For pre-flight estimate tests, just call `.estimate_cost(input: ..., attachment: ...)` — no adapter stub needed.
|
|
@@ -0,0 +1,131 @@
|
|
|
1
|
+
# Find the cheapest viable fallback list
|
|
2
|
+
|
|
3
|
+
> Read this when you have 2+ evals and want to know empirically — not by guessing — which models belong in `retry_policy` and in what order.
|
|
4
|
+
|
|
5
|
+
You defined `SummarizeArticle` in the [README](../../README.md) with `retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]`. That list was a guess. `optimize_retry_policy` tells you which models your evals actually need, so you stop paying for the strong model when `nano` was enough — or stop shipping `nano` when the hardest eval proves it isn't.
|
|
6
|
+
|
|
7
|
+
## Requirements
|
|
8
|
+
|
|
9
|
+
- **`SummarizeArticle` already has `retry_policy`.** If your step has none, add one first ([getting started](getting_started.md)).
|
|
10
|
+
- **2–3 evals per step.** One eval optimizes for one scenario; with only `smoke`, you get a recommendation that passes smoke but may miss production edge cases. See [eval-first](eval_first.md).
|
|
11
|
+
- **Rake tasks.** The standard `RubyLLM::Contract::RakeTask` includes `ruby_llm_contract:optimize`. Non-Rails projects: set `EVAL_DIRS=...`.
|
|
12
|
+
|
|
13
|
+
> **Two orthogonal dimensions to a retry chain.** A chain element is `{ model:, reasoning_effort: }` — model identity AND thinking budget. `optimize_retry_policy` explores both. You can also fix the thinking config at class level via `thinking effort: :low` (or alias `reasoning_effort :low`) on the Step — it becomes the default for every chain element unless an override is passed. See the `thinking` DSL note at the bottom of this guide.
|
|
14
|
+
|
|
15
|
+
For this guide, assume `SummarizeArticle` has three evals:
|
|
16
|
+
|
|
17
|
+
```ruby
|
|
18
|
+
SummarizeArticle.define_eval("smoke") { ... } # short news article
|
|
19
|
+
SummarizeArticle.define_eval("dense_article") { ... } # long form, 5 takeaways required
|
|
20
|
+
SummarizeArticle.define_eval("critical_tone") { ... } # negative review, tone must match
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Offline check first
|
|
24
|
+
|
|
25
|
+
Run once offline to verify the wiring:
|
|
26
|
+
|
|
27
|
+
```bash
|
|
28
|
+
rake ruby_llm_contract:optimize \
|
|
29
|
+
STEP=SummarizeArticle \
|
|
30
|
+
CANDIDATES=gpt-4.1-nano,gpt-4.1-mini@low,gpt-4.1-mini,gpt-4.1
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
Offline uses each eval's `sample_response` — zero API calls. **Every candidate gets the same score** because they all receive the canned response. That's fine for a smoke test (verifying evals load, candidates parse, output renders) but it doesn't compare model quality. For real optimization, go live.
|
|
34
|
+
|
|
35
|
+
## Optimize against real models
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
LIVE=1 RUNS=3 rake ruby_llm_contract:optimize \
|
|
39
|
+
STEP=SummarizeArticle \
|
|
40
|
+
CANDIDATES=gpt-4.1-nano,gpt-4.1-mini@low,gpt-4.1-mini,gpt-4.1
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
`LIVE=1` makes real API calls. `RUNS=3` averages each `(candidate, eval)` pair over three runs — necessary because OpenAI forces `temperature=1.0` on gpt-5 / o-series and the same pair can score `0.00` on one run and `1.00` on the next.
|
|
44
|
+
|
|
45
|
+
Output (illustrative):
|
|
46
|
+
|
|
47
|
+
```
|
|
48
|
+
SummarizeArticle — fallback list optimization
|
|
49
|
+
|
|
50
|
+
eval 4.1-nano 4.1-mini@low 4.1-mini 4.1
|
|
51
|
+
---------------------------------------------------------
|
|
52
|
+
smoke 1.00 1.00 1.00 1.00
|
|
53
|
+
dense_article 0.67 ← 1.00 1.00 1.00
|
|
54
|
+
critical_tone 0.50 ← 0.67 ← 1.00 1.00
|
|
55
|
+
|
|
56
|
+
Hardest eval: critical_tone
|
|
57
|
+
|
|
58
|
+
Suggested fallback list:
|
|
59
|
+
gpt-4.1-nano — covers 1 eval(s)
|
|
60
|
+
gpt-4.1-mini — passes all 3 evals
|
|
61
|
+
|
|
62
|
+
DSL:
|
|
63
|
+
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
Reading the table:
|
|
67
|
+
- **`←` marks scores below threshold in the hardest eval.** Not a selection hint — just "this candidate fails the row that matters most".
|
|
68
|
+
- **Hardest eval** = the one that forces the strong fallback. Here, `critical_tone` demands `gpt-4.1-mini`.
|
|
69
|
+
- **Suggested fallback list** = the shortest chain where each step covers more evals, built greedy-cheapest-first. Stops when all evals pass. Order matters: `gpt-4.1-nano` is tried first; on validation failure, the gem falls back to `gpt-4.1-mini`.
|
|
70
|
+
|
|
71
|
+
Copy the DSL, paste into your step, verify with `rake ruby_llm_contract:eval`. You just dropped `gpt-4.1` from the chain — most requests finish on nano, mini handles what nano misses, and the strong model was never needed.
|
|
72
|
+
|
|
73
|
+
## Measure effective cost before shipping
|
|
74
|
+
|
|
75
|
+
`optimize` shows **first-attempt** cost. In production, a candidate whose validator rejects 20% of outputs actually costs `first_try_cost + fallback_cost × 0.20` per successful output. The first-attempt number hides this.
|
|
76
|
+
|
|
77
|
+
`production_mode: { fallback: "..." }` runs each candidate with a runtime `[candidate, fallback]` chain and reports effective cost:
|
|
78
|
+
|
|
79
|
+
```ruby
|
|
80
|
+
SummarizeArticle.compare_models(
|
|
81
|
+
"dense_article",
|
|
82
|
+
candidates: [{ model: "gpt-5-nano" }, { model: "gpt-5-mini", reasoning_effort: "low" }],
|
|
83
|
+
production_mode: { fallback: "gpt-5-mini" }
|
|
84
|
+
).print_summary
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
Output (live mode, illustrative):
|
|
88
|
+
|
|
89
|
+
```
|
|
90
|
+
dense_article — model comparison
|
|
91
|
+
|
|
92
|
+
Chain first-attempt fallback % effective cost latency score
|
|
93
|
+
---------------------------------------------------------------------------------------------------
|
|
94
|
+
gpt-5-nano → gpt-5-mini $0.0010 33% $0.0018 164ms 1.00
|
|
95
|
+
gpt-5-mini (effort: low) → gpt-5-mini $0.0015 5% $0.0016 210ms 1.00
|
|
96
|
+
gpt-5-mini $0.0030 — $0.0030 220ms 1.00
|
|
97
|
+
```
|
|
98
|
+
|
|
99
|
+
- **first-attempt** — cost of the first run alone.
|
|
100
|
+
- **fallback %** — fraction of cases where the validator rejected and the fallback ran.
|
|
101
|
+
- **effective cost** — total per successful output including retries.
|
|
102
|
+
- **`—`** — candidate equals fallback, no chain to observe.
|
|
103
|
+
|
|
104
|
+
Run this before finalizing: a candidate saving 3× on first-attempt but falling back 60% of the time may save only 1.2× in production.
|
|
105
|
+
|
|
106
|
+
**Scope.** Single-fallback (2-tier) chains only. Multi-tier inspect via `trace.attempts`. Step-level — calling on `Pipeline::Base` raises `ArgumentError`.
|
|
107
|
+
|
|
108
|
+
## When results look wrong
|
|
109
|
+
|
|
110
|
+
- **"No viable chain" from a single live run.** Re-run with `RUNS=3`. If scores jump, the first run was noise. Never trust single-run results with gpt-5 / o-series in the pool — `temperature=1.0` is server-enforced.
|
|
111
|
+
- **Every candidate fails the same eval**, including the strongest. The eval is rejecting correct answers. Run the step directly (`context: { retry_policy_override: nil, model: "gpt-4.1" }`), inspect the output, compare with the `verify` block. Loosen the eval if the output is correct but not one of the accepted values.
|
|
112
|
+
- **Testing one specific hypothesis.** (e.g. "does `mini@medium` help on `critical_tone`?") Use `SummarizeArticle.compare_models("critical_tone", candidates: [{ model: "gpt-5-mini", reasoning_effort: "medium" }], runs: 3)` directly — three calls instead of rerunning the whole optimize pass.
|
|
113
|
+
|
|
114
|
+
## Programmatic API names
|
|
115
|
+
|
|
116
|
+
Metrics exposed on `Report` / `AggregatedReport` keep their original names: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. The optimize Result struct also exposes `hardest_eval` as an alias for `constraining_eval`.
|
|
117
|
+
|
|
118
|
+
## thinking DSL note
|
|
119
|
+
|
|
120
|
+
Set the default reasoning effort once on the Step class — mirrors `RubyLLM::Agent.thinking` exactly:
|
|
121
|
+
|
|
122
|
+
```ruby
|
|
123
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
124
|
+
model "gpt-5-nano"
|
|
125
|
+
thinking effort: :low # canonical
|
|
126
|
+
# or
|
|
127
|
+
reasoning_effort :low # alias for thinking(effort: :low)
|
|
128
|
+
end
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
Forwarded to `Chat#with_thinking(**)` through the adapter — works provider-agnostically (OpenAI `reasoning_effort`, Anthropic extended-thinking budget). A per-call override via `context: { reasoning_effort: :high }` still wins over the class default.
|
|
@@ -0,0 +1,93 @@
|
|
|
1
|
+
# Output Schema
|
|
2
|
+
|
|
3
|
+
> Read this as a reference for the schema DSL — every constraint, nested objects, arrays of objects, the full pattern table.
|
|
4
|
+
|
|
5
|
+
Declare the expected output structure using [ruby_llm-schema](https://github.com/danielfriis/ruby_llm-schema) DSL. The schema serves **two purposes**:
|
|
6
|
+
|
|
7
|
+
1. **Output validation** — replaces type and shape checks (enums, ranges, required fields). One declaration instead of many.
|
|
8
|
+
2. **Provider-side request** — with the RubyLLM adapter, the schema is sent to the LLM provider via `chat.with_schema(...)`, asking the model to return JSON matching the shape. Cheaper models sometimes ignore the request, which is why client-side validation (point 1) still matters.
|
|
9
|
+
|
|
10
|
+
> **Same DSL `RubyLLM::Agent.schema` accepts.** `output_schema do ... end` here is a wrapper around `RubyLLM::Schema.create(&block)` plus a client-side validation step. `RubyLLM::Agent.schema` accepts the same block — choosing one over the other does not change the schema language. The difference: `Agent.schema` lets you pass a `Proc` evaluated in runtime context (dynamic per-call schema); `Step.output_schema` is eager-compiled at class load and additionally drives `output_type` inference and `Validator.validate`. Both can coexist.
|
|
11
|
+
|
|
12
|
+
All examples below extend the `SummarizeArticle` step from the [README](../../README.md).
|
|
13
|
+
|
|
14
|
+
## Schema replaces type and shape checks
|
|
15
|
+
|
|
16
|
+
```ruby
|
|
17
|
+
# WITHOUT schema — many validates:
|
|
18
|
+
validate("tldr must be a string") { |o| o[:tldr].is_a?(String) }
|
|
19
|
+
validate("takeaways must be an array") { |o| o[:takeaways].is_a?(Array) }
|
|
20
|
+
validate("takeaways 3 to 5") { |o| (3..5).cover?(o[:takeaways].size) }
|
|
21
|
+
ALLOWED_TONES = %w[neutral positive negative analytical].freeze
|
|
22
|
+
validate("tone must be an allowed label") { |o| ALLOWED_TONES.include?(o[:tone]) }
|
|
23
|
+
|
|
24
|
+
# WITH schema — one declaration:
|
|
25
|
+
output_schema do
|
|
26
|
+
string :tldr
|
|
27
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
28
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
29
|
+
end
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Nested objects in arrays
|
|
33
|
+
|
|
34
|
+
Use `object do...end` inside `array` when you need more than a primitive per element. Concrete scenario: the UI card grows a "confidence bar" next to each takeaway so editors can see which points the model was sure about vs guessing. That requires `confidence` paired with `text`, not two parallel arrays that could desync. Nested objects make the pairing a schema invariant:
|
|
35
|
+
|
|
36
|
+
```ruby
|
|
37
|
+
output_schema do
|
|
38
|
+
string :tldr
|
|
39
|
+
array :takeaways, min_items: 3, max_items: 5 do
|
|
40
|
+
object do
|
|
41
|
+
string :text
|
|
42
|
+
number :confidence, minimum: 0.0, maximum: 1.0
|
|
43
|
+
end
|
|
44
|
+
end
|
|
45
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
46
|
+
end
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
## Schema pattern reference
|
|
50
|
+
|
|
51
|
+
| Your output looks like | Schema pattern | Example |
|
|
52
|
+
|---|---|---|
|
|
53
|
+
| `{"tldr": "...", "tone": "positive"}` | Flat fields | `string :tldr; string :tone, enum: [...]` |
|
|
54
|
+
| `{"takeaways": ["...", "..."]}` | Array of primitives | `array :takeaways, of: :string, min_items: 3, max_items: 5` |
|
|
55
|
+
| `{"takeaways": [{"text": "...", "confidence": 0.9}]}` | Array of objects | `array :takeaways do; object do; string :text; number :confidence; end; end` |
|
|
56
|
+
|
|
57
|
+
Without `object do...end`, `array :takeaways do; string :text; end` tells the provider "takeaways is an array of strings" — not objects. That's what you get back.
|
|
58
|
+
|
|
59
|
+
## Why schema alone is not enough
|
|
60
|
+
|
|
61
|
+
Schema validates **shape** — correct types, allowed values, field presence. But LLMs can return structurally valid JSON that is **logically wrong**. Validates catch what schema can't:
|
|
62
|
+
|
|
63
|
+
```ruby
|
|
64
|
+
output_schema do
|
|
65
|
+
string :tldr
|
|
66
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
67
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
68
|
+
end
|
|
69
|
+
|
|
70
|
+
# Schema allows any string for :tldr — but a 500-char "summary" breaks the UI card.
|
|
71
|
+
validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
|
|
72
|
+
|
|
73
|
+
# Schema enforces 3–5 takeaways — but says nothing about them being distinct.
|
|
74
|
+
validate("takeaways are unique") { |o, _| o[:takeaways] == o[:takeaways].uniq }
|
|
75
|
+
|
|
76
|
+
# Schema can't express cross-field rules.
|
|
77
|
+
validate("critical tone requires at least one concrete risk") do |o, _|
|
|
78
|
+
next true unless o[:tone] == "negative"
|
|
79
|
+
o[:takeaways].any? { |t| t.match?(/fail|break|crash|outage|vulnerab/i) }
|
|
80
|
+
end
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
## Supported constraints
|
|
84
|
+
|
|
85
|
+
| Constraint | Types | Example |
|
|
86
|
+
|---|---|---|
|
|
87
|
+
| `enum` | string, integer | `string :tone, enum: %w[neutral positive negative analytical]` |
|
|
88
|
+
| `minimum` / `maximum` | number, integer | `number :confidence, minimum: 0.0, maximum: 1.0` |
|
|
89
|
+
| `min_length` / `max_length` | string | `string :tldr, min_length: 1, max_length: 200` |
|
|
90
|
+
| `min_items` / `max_items` | array | `array :takeaways, of: :string, min_items: 3, max_items: 5` |
|
|
91
|
+
| `additional_properties` | object | Set to `false` in the schema to reject extra keys |
|
|
92
|
+
|
|
93
|
+
Keyword args use Ruby snake_case (`min_length`, `min_items`). The DSL converts them internally to JSON Schema's camelCase (`minLength`, `minItems`) before sending the schema to the provider — you don't need to write camelCase in Ruby.
|