ruby_llm-contract 0.10.1 → 0.10.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +15 -1
- data/Gemfile.lock +2 -2
- data/README.md +2 -2
- data/docs/architecture.md +50 -0
- data/docs/guide/best_practices.md +136 -0
- data/docs/guide/eval_first.md +192 -0
- data/docs/guide/getting_started.md +199 -0
- data/docs/guide/migration.md +185 -0
- data/docs/guide/multimodal_input.md +160 -0
- data/docs/guide/optimizing_retry_policy.md +131 -0
- data/docs/guide/output_schema.md +93 -0
- data/docs/guide/pipeline.md +154 -0
- data/docs/guide/prompt_ast.md +76 -0
- data/docs/guide/rails_integration.md +218 -0
- data/docs/guide/relation_to_agent.md +52 -0
- data/docs/guide/relation_to_tribunal.md +135 -0
- data/docs/guide/testing.md +282 -0
- data/docs/guide/why.md +103 -0
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/ruby_llm-contract.gemspec +1 -1
- metadata +16 -1
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
# Pipeline
|
|
2
|
+
|
|
3
|
+
> Read this when one step isn't enough — you need multi-step with fail-fast, automatic data threading, and per-step models.
|
|
4
|
+
|
|
5
|
+
Chain multiple steps with automatic data threading, fail-fast, per-step models, trace, and timeout.
|
|
6
|
+
|
|
7
|
+
A pipeline needs more than one step to be interesting. This guide grows the `SummarizeArticle` step from the [README](../../README.md) into a three-step content pipeline that tags and routes the summary to a UI card.
|
|
8
|
+
|
|
9
|
+
## Full example: article → summary → hashtags → card
|
|
10
|
+
|
|
11
|
+
```ruby
|
|
12
|
+
# Step 1 — the flagship step from README, unchanged.
|
|
13
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
14
|
+
prompt <<~PROMPT
|
|
15
|
+
Summarize this article for a UI card. Return a short TL;DR,
|
|
16
|
+
3 to 5 key takeaways, and a tone label.
|
|
17
|
+
|
|
18
|
+
{input}
|
|
19
|
+
PROMPT
|
|
20
|
+
|
|
21
|
+
output_schema do
|
|
22
|
+
string :tldr
|
|
23
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
24
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
# Step 2 — reads SummarizeArticle's output, produces hashtags suitable for social posts.
|
|
31
|
+
class GenerateHashtags < RubyLLM::Contract::Step::Base
|
|
32
|
+
input_type Hash
|
|
33
|
+
|
|
34
|
+
output_schema do
|
|
35
|
+
# Carry through the summary fields downstream consumers (and the next step) need.
|
|
36
|
+
string :tldr
|
|
37
|
+
array :takeaways, of: :string
|
|
38
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
39
|
+
# Add new field.
|
|
40
|
+
array :hashtags, of: :string, min_items: 2, max_items: 5
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
prompt do
|
|
44
|
+
rule "Preserve tldr / takeaways / tone exactly as given."
|
|
45
|
+
user "Article summary: {tldr}\nTone: {tone}\nGenerate 2 to 5 concise hashtags."
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
validate("tone preserved") { |o, input| o[:tone] == input[:tone] }
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
# Step 3 — final shape the UI card consumes.
|
|
52
|
+
class BuildArticleCard < RubyLLM::Contract::Step::Base
|
|
53
|
+
input_type Hash
|
|
54
|
+
|
|
55
|
+
output_schema do
|
|
56
|
+
string :headline
|
|
57
|
+
string :summary
|
|
58
|
+
array :hashtags, of: :string
|
|
59
|
+
string :sentiment_icon, enum: %w[😐 🙂 ⚠️ 🧠]
|
|
60
|
+
end
|
|
61
|
+
|
|
62
|
+
prompt do
|
|
63
|
+
rule "Headline <= 70 chars. Summary is the incoming tldr reprinted verbatim."
|
|
64
|
+
rule "Pick sentiment_icon from: 😐 neutral, 🙂 positive, ⚠️ negative, 🧠 analytical."
|
|
65
|
+
user "TL;DR: {tldr}\nTone: {tone}\nHashtags: {hashtags}"
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
validate("summary is the tldr verbatim") { |o, input| o[:summary] == input[:tldr] }
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
# Pipeline: summarize → hashtags → card
|
|
72
|
+
class ArticleCardPipeline < RubyLLM::Contract::Pipeline::Base
|
|
73
|
+
step SummarizeArticle, as: :summarize
|
|
74
|
+
step GenerateHashtags, as: :tag
|
|
75
|
+
step BuildArticleCard, as: :card
|
|
76
|
+
end
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
## Running and inspecting
|
|
80
|
+
|
|
81
|
+
```ruby
|
|
82
|
+
result = ArticleCardPipeline.run(article_text, context: { adapter: adapter })
|
|
83
|
+
result.ok? # => true
|
|
84
|
+
result.outputs_by_step[:summarize] # => { tldr: "...", takeaways: [...], tone: "analytical" }
|
|
85
|
+
result.outputs_by_step[:card] # => { headline: "...", summary: "...", ... }
|
|
86
|
+
result.trace.total_cost # => 0.000128 (all steps combined)
|
|
87
|
+
result.trace.total_latency_ms # => 2340
|
|
88
|
+
```
|
|
89
|
+
|
|
90
|
+
## Fail-fast behavior
|
|
91
|
+
|
|
92
|
+
When a step's schema, validate, or preflight check rejects the output, the pipeline stops there — downstream steps never run:
|
|
93
|
+
|
|
94
|
+
```ruby
|
|
95
|
+
# Summarize returns a TL;DR over 200 chars → the "TL;DR fits the card" validate fails
|
|
96
|
+
adapter = RubyLLM::Contract::Adapters::Test.new(response: {
|
|
97
|
+
tldr: "x" * 500,
|
|
98
|
+
takeaways: %w[one two three],
|
|
99
|
+
tone: "neutral"
|
|
100
|
+
})
|
|
101
|
+
|
|
102
|
+
result = ArticleCardPipeline.run("article text", context: { adapter: adapter })
|
|
103
|
+
result.failed? # => true
|
|
104
|
+
result.failed_step # => :summarize (validate rejected; retries exhausted)
|
|
105
|
+
# tag and card never run — no downstream tokens spent on garbage
|
|
106
|
+
```
|
|
107
|
+
|
|
108
|
+
## Per-step model override
|
|
109
|
+
|
|
110
|
+
```ruby
|
|
111
|
+
class ArticleCardPipeline < RubyLLM::Contract::Pipeline::Base
|
|
112
|
+
step SummarizeArticle, as: :summarize, model: "gpt-4.1-mini"
|
|
113
|
+
step GenerateHashtags, as: :tag, model: "gpt-4.1-nano"
|
|
114
|
+
step BuildArticleCard, as: :card, model: "gpt-4.1-nano"
|
|
115
|
+
end
|
|
116
|
+
```
|
|
117
|
+
|
|
118
|
+
## Timeout
|
|
119
|
+
|
|
120
|
+
```ruby
|
|
121
|
+
result = ArticleCardPipeline.run(article_text, timeout_ms: 30_000)
|
|
122
|
+
```
|
|
123
|
+
|
|
124
|
+
## Pipeline eval
|
|
125
|
+
|
|
126
|
+
```ruby
|
|
127
|
+
ArticleCardPipeline.define_eval("e2e") do
|
|
128
|
+
add_case "ruby 3.4 release",
|
|
129
|
+
input: "Ruby 3.4 ships with frozen string literals by default and better YJIT...",
|
|
130
|
+
expected: { sentiment_icon: "🧠" }
|
|
131
|
+
end
|
|
132
|
+
|
|
133
|
+
report = ArticleCardPipeline.run_eval("e2e", context: { model: "gpt-4.1-mini" })
|
|
134
|
+
report.print_summary
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Pretty print
|
|
138
|
+
|
|
139
|
+
```ruby
|
|
140
|
+
puts result
|
|
141
|
+
# Pipeline: ok 3 steps 1234ms 450+120 tokens trace=abc12345
|
|
142
|
+
|
|
143
|
+
result.pretty_print
|
|
144
|
+
# Full ASCII table with per-step outputs (Pipeline::Result)
|
|
145
|
+
|
|
146
|
+
# For eval reports, use print_summary instead:
|
|
147
|
+
report.print_summary
|
|
148
|
+
# Tabular pass/fail breakdown (Eval::Report)
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
## See also
|
|
152
|
+
|
|
153
|
+
- [Testing](testing.md) — `ArticleCardPipeline.test(..., responses: { summarize: ..., tag: ..., card: ... })` for pipeline-level spec adapters.
|
|
154
|
+
- [Optimizing retry_policy](optimizing_retry_policy.md) — `optimize_retry_policy` runs per-step; pipelines benchmark one step at a time.
|
|
@@ -0,0 +1,76 @@
|
|
|
1
|
+
# Prompt AST
|
|
2
|
+
|
|
3
|
+
> Read this when your prompt has more than one shape (per-tenant, per-language, per-audience) and string concatenation is starting to drift.
|
|
4
|
+
|
|
5
|
+
Prompts are structured data, not strings. That matters the moment `SummarizeArticle` has to ship in more than one shape — a different audience per tenant, a different language per region, a different tone template for a B2B vs consumer card. Building those variants by string-concatenating a monolithic prompt leads to silent drift across environments. The AST gives you typed nodes (`system`, `rule`, `section`, `example`, `user`) that compose, diff, and snapshot-test cleanly.
|
|
6
|
+
|
|
7
|
+
Available node types:
|
|
8
|
+
|
|
9
|
+
```ruby
|
|
10
|
+
prompt do
|
|
11
|
+
system "You summarize articles for a UI card." # system message
|
|
12
|
+
rule "Return valid JSON only." # appended as separate system message
|
|
13
|
+
section "AUDIENCE", "Rails developers" # labeled system message: [AUDIENCE]\n...
|
|
14
|
+
example input: "Ruby 3.4 ships frozen strings...", # user/assistant few-shot pair
|
|
15
|
+
output: '{"tldr":"...","takeaways":[...],"tone":"analytical"}'
|
|
16
|
+
user "{input}" # user message with interpolation
|
|
17
|
+
end
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
Or just a plain string (wraps as a single user message):
|
|
21
|
+
|
|
22
|
+
```ruby
|
|
23
|
+
prompt "Summarize this article for a UI card. {input}"
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
The AST is immutable, diffable, and hashable. Useful for snapshot testing and auditing prompt changes.
|
|
27
|
+
|
|
28
|
+
## Hash inputs with variable interpolation
|
|
29
|
+
|
|
30
|
+
When input is a Hash, each key becomes a template variable. Concrete scenario: a multi-language newsletter product where the same article has to be summarised in Polish for EU subscribers, English for US, with different audiences per tier (Rails developers vs engineering managers). Hash inputs let one step cover all of these without forking the class:
|
|
31
|
+
|
|
32
|
+
```ruby
|
|
33
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
34
|
+
input_type RubyLLM::Contract::Types::Hash.schema(
|
|
35
|
+
article: RubyLLM::Contract::Types::String,
|
|
36
|
+
audience: RubyLLM::Contract::Types::String,
|
|
37
|
+
language: RubyLLM::Contract::Types::String
|
|
38
|
+
)
|
|
39
|
+
|
|
40
|
+
prompt do
|
|
41
|
+
system "You summarize articles for a UI card."
|
|
42
|
+
rule "Write the TL;DR and takeaways in {language}."
|
|
43
|
+
section "AUDIENCE", "{audience}"
|
|
44
|
+
user "{article}"
|
|
45
|
+
end
|
|
46
|
+
|
|
47
|
+
output_schema do
|
|
48
|
+
string :tldr
|
|
49
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
50
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
51
|
+
end
|
|
52
|
+
end
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
Every `{key}` in a prompt node is pulled from the input hash at run time. Missing keys raise — making wire-up bugs loud, not silent.
|
|
56
|
+
|
|
57
|
+
> **Not the same as `RubyLLM::Agent.inputs`.** `Step.input_type` is a *runtime type check* on the positional argument passed to `run(input)` — it raises `TypeError` if the input violates the declared shape. `RubyLLM::Agent.inputs` is a list of *named template locals* injected into ERB instructions. They solve different problems (validation vs templating) and can coexist on the same project.
|
|
58
|
+
|
|
59
|
+
> **Not the same as `RubyLLM::Agent` ERB templates either.** The `prompt do ... end` DSL above builds a multi-role message list (`system` / `user` / `assistant` / `example` nodes) into a node-AST. `Agent.instructions :name` loads a single-string ERB file from `app/prompts/<agent_path>/<name>.txt.erb` for the system prompt only. Different output shape, different scope. Variable interpolation here uses `{key}` substitution; ERB uses full `<%= ruby %>`.
|
|
60
|
+
|
|
61
|
+
## Cross-validating output against input
|
|
62
|
+
|
|
63
|
+
Validate blocks support 2-arity `|output, input|` so you can check that the model's answer stays faithful to the request:
|
|
64
|
+
|
|
65
|
+
```ruby
|
|
66
|
+
validate("tldr is not just the article reprinted") do |output, input|
|
|
67
|
+
# Guard against lazy models that return the input verbatim.
|
|
68
|
+
output[:tldr].length < input[:article].length / 2
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
validate("no takeaway repeats the TL;DR") do |output, _input|
|
|
72
|
+
output[:takeaways].none? { |t| t == output[:tldr] }
|
|
73
|
+
end
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
The first example uses `input`; the second ignores it. Both are legal 2-arity signatures — Ruby accepts the unused `_input` parameter naming convention.
|
|
@@ -0,0 +1,218 @@
|
|
|
1
|
+
# Rails integration
|
|
2
|
+
|
|
3
|
+
> Read this when you've seen the `SummarizeArticle` example and want to know where contract steps fit in an actual Rails app — directory, initializer, jobs, logging, tests, CI. Skip if you're writing a non-Rails script.
|
|
4
|
+
|
|
5
|
+
Seven pre-emptive answers to the questions that come up first.
|
|
6
|
+
|
|
7
|
+
## 1. Where do step classes live?
|
|
8
|
+
|
|
9
|
+
**Recommended: `app/contracts/`.** The gem's Railtie auto-reloads eval files under `app/contracts/eval/` and `app/steps/eval/` in development, so picking `app/contracts/` aligns with the default reload paths.
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
app/contracts/summarize_article.rb # class SummarizeArticle
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
Any autoloaded directory works (`app/llm_steps/`, `app/services/llm/`, etc.) — Rails 7/8 autoloading resolves them all, and the step class itself does not depend on the path. Pick the default if you have no stronger convention.
|
|
16
|
+
|
|
17
|
+
Keep evals in the same file as the step (`define_eval` block at the bottom of the class) — one source of truth per contract. If your evals grow too large for the class file, move them to `app/contracts/eval/summarize_article_eval.rb` — the Railtie reloads that directory explicitly in development.
|
|
18
|
+
|
|
19
|
+
## 2. Initializer configuration
|
|
20
|
+
|
|
21
|
+
```ruby
|
|
22
|
+
# config/initializers/ruby_llm_contract.rb
|
|
23
|
+
RubyLLM.configure do |c|
|
|
24
|
+
c.openai_api_key = ENV.fetch("OPENAI_API_KEY", nil)
|
|
25
|
+
c.anthropic_api_key = ENV.fetch("ANTHROPIC_API_KEY", nil)
|
|
26
|
+
end
|
|
27
|
+
|
|
28
|
+
RubyLLM::Contract.configure do |c|
|
|
29
|
+
c.default_model = Rails.env.production? ? "gpt-5-mini" : "gpt-5-nano"
|
|
30
|
+
c.default_adapter = RubyLLM::Contract::Adapters::RubyLLM.new
|
|
31
|
+
end
|
|
32
|
+
```
|
|
33
|
+
|
|
34
|
+
In specs, override the default adapter to `Adapters::Test` in `spec_helper.rb` (or use `stub_step` per-example — see §5).
|
|
35
|
+
|
|
36
|
+
Evals defined inside a step class (the recommended pattern) are picked up as soon as Rails autoloads the class — you do not need the eval file in any special directory. If you move evals into separate files under `app/contracts/eval/` or `app/steps/eval/`, the gem's Railtie reloads those two directories explicitly on each request in development; other directories follow standard Rails autoloading rules.
|
|
37
|
+
|
|
38
|
+
## 3. Background jobs — never call LLMs inline in a controller
|
|
39
|
+
|
|
40
|
+
LLM calls take 0.8–5 seconds and can fail. Wrap every step invocation in an ActiveJob:
|
|
41
|
+
|
|
42
|
+
```ruby
|
|
43
|
+
class SummarizeArticleJob < ApplicationJob
|
|
44
|
+
queue_as :llm
|
|
45
|
+
|
|
46
|
+
def perform(article_id)
|
|
47
|
+
article = Article.find(article_id)
|
|
48
|
+
result = SummarizeArticle.run(article.body)
|
|
49
|
+
|
|
50
|
+
if result.ok?
|
|
51
|
+
# parsed_output uses symbol keys in memory. jsonb/json columns round-trip
|
|
52
|
+
# as strings on reload, so either use deep_stringify_keys before write or
|
|
53
|
+
# access downstream with string keys — pick one convention and stick to it.
|
|
54
|
+
article.update!(summary: result.parsed_output.deep_stringify_keys)
|
|
55
|
+
else
|
|
56
|
+
article.update!(summary_error: result.validation_errors.join("; "))
|
|
57
|
+
end
|
|
58
|
+
end
|
|
59
|
+
end
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
`SummarizeArticleJob.perform_later(article.id)` returns in milliseconds; the controller stays responsive. If you use Sidekiq, pair `queue_as :llm` with a dedicated concurrency cap in `sidekiq.yml` so long-running LLM calls do not starve other job queues (mailers, webhooks, cleanups).
|
|
63
|
+
|
|
64
|
+
## 4. Logging and observability
|
|
65
|
+
|
|
66
|
+
`around_call` runs once per `run()` with the final `Result` (after all retries). Use it to write one row per LLM call:
|
|
67
|
+
|
|
68
|
+
```ruby
|
|
69
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
70
|
+
# ... prompt, schema, validates ...
|
|
71
|
+
|
|
72
|
+
around_call do |step, input, result|
|
|
73
|
+
AiCallLog.create!(
|
|
74
|
+
step: step.name,
|
|
75
|
+
model: result.trace[:model],
|
|
76
|
+
status: result.status.to_s,
|
|
77
|
+
latency_ms: result.trace[:latency_ms],
|
|
78
|
+
input_tokens: result.trace[:usage]&.dig(:input_tokens),
|
|
79
|
+
output_tokens: result.trace[:usage]&.dig(:output_tokens),
|
|
80
|
+
cost: result.trace[:cost],
|
|
81
|
+
validation_errors: result.validation_errors
|
|
82
|
+
)
|
|
83
|
+
end
|
|
84
|
+
end
|
|
85
|
+
```
|
|
86
|
+
|
|
87
|
+
The `AiCallLog` model assumed above is a thin audit record. One possible migration:
|
|
88
|
+
|
|
89
|
+
```ruby
|
|
90
|
+
# rails g model AiCallLog step:string model:string status:string ...
|
|
91
|
+
create_table :ai_call_logs do |t|
|
|
92
|
+
t.string :step, null: false
|
|
93
|
+
t.string :model
|
|
94
|
+
t.string :status, null: false
|
|
95
|
+
t.integer :latency_ms
|
|
96
|
+
t.integer :input_tokens
|
|
97
|
+
t.integer :output_tokens
|
|
98
|
+
t.decimal :cost, precision: 10, scale: 6
|
|
99
|
+
t.jsonb :validation_errors, default: []
|
|
100
|
+
t.timestamps
|
|
101
|
+
end
|
|
102
|
+
add_index :ai_call_logs, :step
|
|
103
|
+
add_index :ai_call_logs, :status
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
For Appsignal / Honeybadger / Datadog, emit an `ActiveSupport::Notifications` event from inside the same `around_call` and subscribe in an initializer:
|
|
107
|
+
|
|
108
|
+
```ruby
|
|
109
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
110
|
+
# ... prompt, schema, validates ...
|
|
111
|
+
|
|
112
|
+
around_call do |step, _input, result|
|
|
113
|
+
ActiveSupport::Notifications.instrument(
|
|
114
|
+
"ruby_llm_contract.run",
|
|
115
|
+
step: step.name, model: result.trace[:model], status: result.status
|
|
116
|
+
)
|
|
117
|
+
end
|
|
118
|
+
end
|
|
119
|
+
|
|
120
|
+
# config/initializers/observability.rb
|
|
121
|
+
ActiveSupport::Notifications.subscribe("ruby_llm_contract.run") do |*, payload|
|
|
122
|
+
Appsignal.increment_counter("llm.run.#{payload[:status]}", 1, step: payload[:step])
|
|
123
|
+
end
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
Trace inspection in an admin UI: `result.trace[:attempts]` gives you per-attempt model, status, cost, latency — render it in a partial to debug production failures without re-running.
|
|
127
|
+
|
|
128
|
+
## 5. Testing — RSpec and Minitest
|
|
129
|
+
|
|
130
|
+
Add to `spec/spec_helper.rb` (or `test_helper.rb`):
|
|
131
|
+
|
|
132
|
+
```ruby
|
|
133
|
+
require "ruby_llm/contract/rspec" # or ruby_llm/contract/minitest
|
|
134
|
+
```
|
|
135
|
+
|
|
136
|
+
Then in specs:
|
|
137
|
+
|
|
138
|
+
```ruby
|
|
139
|
+
RSpec.describe ArticlesController do
|
|
140
|
+
it "saves the summary when the step passes" do
|
|
141
|
+
stub_step(SummarizeArticle, response: {
|
|
142
|
+
tldr: "...", takeaways: %w[a b c], tone: "analytical"
|
|
143
|
+
})
|
|
144
|
+
|
|
145
|
+
post :summarize, params: { id: article.id }
|
|
146
|
+
|
|
147
|
+
# NOTE: jsonb/json column round-trips as string keys on reload.
|
|
148
|
+
expect(article.reload.summary["tldr"]).to eq("...")
|
|
149
|
+
end
|
|
150
|
+
end
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
For the step itself, use the `satisfy_contract` and `pass_eval` matchers — details in the [Testing guide](testing.md).
|
|
154
|
+
|
|
155
|
+
## 6. Error handling in controllers
|
|
156
|
+
|
|
157
|
+
Never raise on `result.failed?` — that crashes the request. Branch instead:
|
|
158
|
+
|
|
159
|
+
```ruby
|
|
160
|
+
class ArticlesController < ApplicationController
|
|
161
|
+
def summarize
|
|
162
|
+
SummarizeArticleJob.perform_later(params[:id])
|
|
163
|
+
head :accepted
|
|
164
|
+
end
|
|
165
|
+
|
|
166
|
+
# For synchronous cases (admin tools, small content):
|
|
167
|
+
def preview
|
|
168
|
+
result = SummarizeArticle.run(@article.body)
|
|
169
|
+
|
|
170
|
+
if result.ok?
|
|
171
|
+
render json: result.parsed_output
|
|
172
|
+
else
|
|
173
|
+
Rails.logger.warn "[llm] #{SummarizeArticle.name} failed: #{result.status}"
|
|
174
|
+
render json: { error: "Could not summarize; try again shortly." }, status: :service_unavailable
|
|
175
|
+
end
|
|
176
|
+
end
|
|
177
|
+
end
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
When `retry_policy` exhausts and all models fail, `result.failed?` is true but `result.parsed_output` still contains the last attempt's output — useful for logging what the model *did* return before the validate rejected it.
|
|
181
|
+
|
|
182
|
+
## 7. CI gate — block regressions before merge
|
|
183
|
+
|
|
184
|
+
Add to your `Rakefile`:
|
|
185
|
+
|
|
186
|
+
```ruby
|
|
187
|
+
require "ruby_llm/contract/rake_task"
|
|
188
|
+
|
|
189
|
+
RubyLLM::Contract::RakeTask.new do |t|
|
|
190
|
+
t.minimum_score = 0.8
|
|
191
|
+
t.maximum_cost = 0.05
|
|
192
|
+
t.fail_on_regression = true
|
|
193
|
+
t.save_baseline = false # read-only in CI; refresh baselines manually
|
|
194
|
+
end
|
|
195
|
+
```
|
|
196
|
+
|
|
197
|
+
Then wire it in GitHub Actions:
|
|
198
|
+
|
|
199
|
+
```yaml
|
|
200
|
+
- name: LLM contract evals
|
|
201
|
+
env:
|
|
202
|
+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
203
|
+
run: bundle exec rake ruby_llm_contract:eval
|
|
204
|
+
```
|
|
205
|
+
|
|
206
|
+
The job fails when a previously-passing eval case now fails, when the average score drops below the threshold, or when total cost exceeds the cap. That is the signal that blocks a prompt regression or an accidental model upgrade from shipping.
|
|
207
|
+
|
|
208
|
+
**Two practical notes:**
|
|
209
|
+
|
|
210
|
+
- **Live evals spend real money on every run** — provider tokens per case × number of cases × every merge. Keep the dataset small and targeted (5–15 high-value cases), use cheap models where quality allows, and rely on offline `sample_response` smoke tests in the bulk of CI runs. Live evals belong on merge-candidate branches and scheduled nightly runs, not on every commit.
|
|
211
|
+
- **Baselines are checkout-managed** — commit them to git under `.eval_baselines/`. Refresh them in a separate manual workflow (or locally + a dedicated PR) rather than from the merge gate, which would dirty the checkout and race with the regression check it is supposed to run.
|
|
212
|
+
|
|
213
|
+
## See also
|
|
214
|
+
|
|
215
|
+
- [Getting Started](getting_started.md) — the feature walkthrough the step above is built on
|
|
216
|
+
- [Migration](migration.md) — before/after for replacing a raw `LlmClient.new.call` service with a contract
|
|
217
|
+
- [Eval-First](eval_first.md) — the workflow behind the CI gate above
|
|
218
|
+
- [Testing](testing.md) — `satisfy_contract` and `pass_eval` matcher chains
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Relation to `RubyLLM::Agent`
|
|
2
|
+
|
|
3
|
+
> Read this when you already use `RubyLLM::Agent` (or are about to) and want to understand where `ruby_llm-contract` sits in the same project.
|
|
4
|
+
|
|
5
|
+
`RubyLLM::Agent` shipped in RubyLLM 1.12. `Step::Base` from this gem and `Agent` target the **same niche**: reusable, class-based prompts. They are **siblings**, not foundation-and-roof.
|
|
6
|
+
|
|
7
|
+
## Feature mapping
|
|
8
|
+
|
|
9
|
+
| What you write | Where it lives |
|
|
10
|
+
|---|---|
|
|
11
|
+
| `model`, `temperature`, `schema`, `instructions`, `tools`, `thinking` | covered by both — same idea, different DSL surface |
|
|
12
|
+
| `validate :rule do ... end` business invariants on output | only in `ruby_llm-contract` |
|
|
13
|
+
| `retry_policy escalate(...)` model escalation on validation failure | only here (different from RubyLLM's network-level retry) |
|
|
14
|
+
| `max_cost` / `max_input` / `max_output` pre-flight refusal | only here |
|
|
15
|
+
| `define_eval` + baseline regression + `compare_models` + `optimize_retry_policy` | only here (RubyLLM does not ship an evaluation framework) |
|
|
16
|
+
| Pipeline composition with `step SomeStep, as: :alias` | only here (RubyLLM intentionally leaves workflows as plain Ruby) |
|
|
17
|
+
| `around_call`, named `observe` hooks with pass/fail recorded in trace | only here |
|
|
18
|
+
|
|
19
|
+
## Runtime relationship
|
|
20
|
+
|
|
21
|
+
`Step::Base` does **not** use `Agent` internally today. The actual call path is:
|
|
22
|
+
|
|
23
|
+
```
|
|
24
|
+
Step.run(input)
|
|
25
|
+
→ Runner
|
|
26
|
+
→ Adapters::RubyLLM
|
|
27
|
+
→ RubyLLM.chat(model:, ...)
|
|
28
|
+
→ ... .ask(prompt)
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
`Agent` is a sibling abstraction calling into `RubyLLM::Chat` through its own `apply_configuration` path. Both end up at `Chat`. They do not share the macro-storage layer.
|
|
32
|
+
|
|
33
|
+
This may change in a future release if upstream APIs make a layered design natural. The decision is not committed; it depends on adopter signal.
|
|
34
|
+
|
|
35
|
+
## Coexistence on the same project
|
|
36
|
+
|
|
37
|
+
The two abstractions can live in the same Rails (or non-Rails) project. Pick one per use case:
|
|
38
|
+
|
|
39
|
+
- **`RubyLLM::Agent`** when you want a reusable prompt with `model` + `instructions` + `schema` + `tools` and that is enough — no retry-on-validation-failure, no business invariants, no eval framework, no budget gating.
|
|
40
|
+
- **`ruby_llm-contract`'s `Step::Base`** when you need any of: invariants (`validate`), retry with model escalation on validation failure, pre-flight cost ceilings, an evaluation framework with baseline regression, or pipeline composition.
|
|
41
|
+
|
|
42
|
+
A common pattern: simple ad-hoc prompts as `Agent`, contracts on the LLM features that touch production behaviour or money as `Step`.
|
|
43
|
+
|
|
44
|
+
## On retry strategies
|
|
45
|
+
|
|
46
|
+
The retry-strategy framing in this gem favours `retry_policy escalate(model_2, ...)` (model escalation, addresses model bias) over same-model `retry_policy attempts: N` (variance retry).
|
|
47
|
+
|
|
48
|
+
This is grounded in empirical comparison across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation: same-model retry produced no useful lift for nano-class models on tasks with clear correctness criteria. Model escalation did move the needle when same-model retry could not.
|
|
49
|
+
|
|
50
|
+
`attempts: N` stays in the gem API (backward compat + niche cases like subjective-criteria tasks, multi-step pipelines, weaker open-source models) but is not marketed as a default retry strategy.
|
|
51
|
+
|
|
52
|
+
See [Optimize retry policy](optimizing_retry_policy.md) for the empirical tooling.
|
|
@@ -0,0 +1,135 @@
|
|
|
1
|
+
# Relation to `ruby_llm-tribunal`
|
|
2
|
+
|
|
3
|
+
> Read this when you've seen [`ruby_llm-tribunal`](https://github.com/Alqemist-labs/ruby_llm-tribunal) and want to know how it relates to `ruby_llm-contract` — and which one (or both) you need.
|
|
4
|
+
|
|
5
|
+
Both gems sit on top of `ruby_llm`. The space they cover overlaps in vocabulary (both talk about "evals") but they live in different layers and answer different questions. They are not alternatives — they compose.
|
|
6
|
+
|
|
7
|
+
## The core distinction
|
|
8
|
+
|
|
9
|
+
| | `ruby_llm-tribunal` | `ruby_llm-contract` |
|
|
10
|
+
|---|---|---|
|
|
11
|
+
| Layer | **Test framework** | **Runtime contract** |
|
|
12
|
+
| When it runs | After the LLM call returns, typically in a spec | Before the LLM result reaches your code |
|
|
13
|
+
| Where the output lives at evaluation time | Already in your variable, returned to caller | Still inside the gem's runner, not yet released |
|
|
14
|
+
| What "fail" means | Red test in CI | Trigger retry/escalate on a stronger model, or fail closed |
|
|
15
|
+
| Strongest features | Rich LLM-as-judge (faithful, relevant, hallucination, refusal, bias, toxicity, jailbreak, PII), red-team adversarial prompts, deterministic helpers (`assert_contains`, `assert_levenshtein`, …), RSpec/Minitest matchers, HTML/JUnit/GH reporters | Schema DSL with constraints, `validate` business rules, `retry_policy escalate(...)` model escalation, `max_cost` pre-flight refusal, regression-eval framework (frozen dataset + baseline + min_score gate), pipeline composition |
|
|
16
|
+
| What it does NOT cover | No retry, no model escalation, no pre-flight cost cap, no contract layer between LLM and your code | No 10-judge LLM-as-judge catalog, no red-team generation, no rich deterministic assertion library, no test-framework matchers |
|
|
17
|
+
|
|
18
|
+
## Visual: where each gem sits in your call
|
|
19
|
+
|
|
20
|
+
### Tribunal alone (test-time, in CI)
|
|
21
|
+
|
|
22
|
+
```
|
|
23
|
+
your code ──► LLM ──► output ──► variable ──► [Tribunal assert_*] ──► ✅ / ❌ red test
|
|
24
|
+
▲
|
|
25
|
+
runs in your spec, not in prod
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
The output **already exists in your code** by the time Tribunal sees it. Tribunal grades it after the fact. A failed grade is a red test — production is unaffected, you fix the prompt or model and re-run.
|
|
29
|
+
|
|
30
|
+
### Contract alone (runtime, in prod)
|
|
31
|
+
|
|
32
|
+
```
|
|
33
|
+
your code ──► Step.run ──► LLM ──► [schema + validate]
|
|
34
|
+
│
|
|
35
|
+
├── valid ────────────► output ──► your code
|
|
36
|
+
│
|
|
37
|
+
└── invalid ──► retry/escalate ──► next model
|
|
38
|
+
│
|
|
39
|
+
all attempts fail ─┘
|
|
40
|
+
▼
|
|
41
|
+
Result(:validation_error)
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
The output **never reaches your code** until the contract passes. A failed validation either fixes itself (the retry policy escalates to a stronger model) or fails closed with `Result(:validation_error)` — your call site sees a deterministic failure status, never schema-invalid data.
|
|
45
|
+
|
|
46
|
+
### Both together (Tribunal in CI → then Contract in prod)
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
CI (before merge):
|
|
50
|
+
define_eval(frozen dataset) ──► run Step ──► [Tribunal grades each case]
|
|
51
|
+
│
|
|
52
|
+
▼
|
|
53
|
+
regression gate
|
|
54
|
+
(prevents quality drift over time)
|
|
55
|
+
│
|
|
56
|
+
▼
|
|
57
|
+
✅ merge allowed
|
|
58
|
+
│
|
|
59
|
+
▼
|
|
60
|
+
PROD (every request):
|
|
61
|
+
your code ──► Step.run ──► LLM ──► [contract] ──► output ──► your code
|
|
62
|
+
▲
|
|
63
|
+
│ keeps bad outputs out of prod
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
Tribunal grades **a fixed set of cases on every PR** to catch quality regressions before merge. Once merged, Contract gates **every production call** to keep bad outputs from reaching users. Each gem owns the layer it is best at — and they run in the order developers experience them.
|
|
67
|
+
|
|
68
|
+
## When to use which
|
|
69
|
+
|
|
70
|
+
**Just Contract.** You ship LLM features whose output drives downstream code, money, or user trust. You need the bad-output-doesn't-reach-prod guarantee, retry escalation, and budget refusal. You are happy to write your own `validate` blocks for content checks; you don't need a 10-judge catalog yet.
|
|
71
|
+
|
|
72
|
+
**Just Tribunal.** You have a stable production path you don't want to wrap, but you want a CI safety net that grades LLM output for faithfulness, hallucination, PII, jailbreak resistance, etc. You're testing a RAG pipeline or chatbot whose runtime is owned by other code.
|
|
73
|
+
|
|
74
|
+
**Both.** You ship contracts in prod (Contract) AND want stronger CI signal beyond schema regression — judge-quality grading on a frozen dataset, plus adversarial red-team probes. Use Contract's `Step` to make the call, run it in `define_eval` over your dataset, and grade each case with Tribunal helpers in your spec or via the dataset's `evaluator:` proc.
|
|
75
|
+
|
|
76
|
+
## Integration patterns
|
|
77
|
+
|
|
78
|
+
These work today without any code changes in either gem — both use plain Ruby blocks/procs as extension points.
|
|
79
|
+
|
|
80
|
+
### Tribunal helpers inside Contract `validate`
|
|
81
|
+
|
|
82
|
+
```ruby
|
|
83
|
+
class ChatReply < RubyLLM::Contract::Step::Base
|
|
84
|
+
prompt "Answer this question grounded in the docs:\n{input}"
|
|
85
|
+
|
|
86
|
+
validate("no PII in answer") do |output, _ctx|
|
|
87
|
+
test_case = RubyLLM::Tribunal::TestCase.new(actual_output: output[:answer])
|
|
88
|
+
RubyLLM::Tribunal::Assertions.evaluate(:pii, test_case, {}).first == :pass
|
|
89
|
+
end
|
|
90
|
+
end
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
A failed Tribunal grade triggers Contract's retry/escalate just like any other validation failure. You get LLM-as-judge **runtime gating**, not just CI testing.
|
|
94
|
+
|
|
95
|
+
### Tribunal as `evaluator:` in a Contract dataset
|
|
96
|
+
|
|
97
|
+
```ruby
|
|
98
|
+
ChatReply.define_eval "rag_regression" do
|
|
99
|
+
add_case "policy",
|
|
100
|
+
input: "What is the return policy?",
|
|
101
|
+
evaluator: ->(output, _expected, _input) {
|
|
102
|
+
tc = RubyLLM::Tribunal::TestCase.new(
|
|
103
|
+
actual_output: output[:answer],
|
|
104
|
+
context: ["Returns accepted within 30 days with receipt."]
|
|
105
|
+
)
|
|
106
|
+
result = RubyLLM::Tribunal::Assertions.evaluate(:faithful, tc, {})
|
|
107
|
+
score = result.last[:score] || 0.0
|
|
108
|
+
RubyLLM::Contract::Eval::EvaluationResult.new(score: score, passed: result.first == :pass)
|
|
109
|
+
}
|
|
110
|
+
end
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
Each case is graded by a Tribunal judge; baseline + min_score gate then fails the build on regression. You write the judge once, get the regression gate for free.
|
|
114
|
+
|
|
115
|
+
### Contract `Step` as Tribunal's `opts[:llm]` injection
|
|
116
|
+
|
|
117
|
+
Tribunal's built-in judges call `RubyLLM.chat(...).ask(...)` and naively `JSON.parse` the result. If you want **schema-validated, retried, cost-capped judge calls**, inject a Contract `Step` as the LLM caller via `opts[:llm]`. This is an advanced pattern; sketch it from `Tribunal::Assertions::Judge#run_judge`'s injection point and your own judge wrapping a `Step.run`.
|
|
118
|
+
|
|
119
|
+
This is a recipe, not a shipped adapter. Tribunal's `opts[:llm]` API is at v0.x — recipes survive minor changes; a shipped adapter would not.
|
|
120
|
+
|
|
121
|
+
## What we are NOT doing
|
|
122
|
+
|
|
123
|
+
- **No `Contract::ContainsAssertion` or similar 16-helper deterministic library.** Tribunal owns that layer well. Contract's evaluator surface is intentionally minimal (`Exact`, `Regex`, `JsonIncludes`, `ProcEvaluator`, `TraitEvaluator`); for richer deterministic checks, drop a Tribunal helper into your `evaluator:` proc.
|
|
124
|
+
- **No built-in LLM-as-judge catalog.** `Faithful`, `Hallucination`, `Refusal`, etc. are Tribunal's domain. We provide the runtime contract; they provide the grading vocabulary.
|
|
125
|
+
- **No Tribunal as a hard or soft dependency.** Both gems work standalone. Recipes above are documentation, not code in this gem.
|
|
126
|
+
|
|
127
|
+
## Summary
|
|
128
|
+
|
|
129
|
+
Three questions, three owners:
|
|
130
|
+
|
|
131
|
+
- **"Is this output good?"** — Tribunal, in CI, on outputs you already hold.
|
|
132
|
+
- **"What do we do when it isn't?"** — Contract, at runtime, before outputs reach your code (retry/escalate, or fail-closed with `Result(:validation_failed)`).
|
|
133
|
+
- **"What do we do when it _is_ good?"** — your application code. Once Contract returns `:ok`, you persist, render, hand off downstream. The gem deliberately doesn't touch the happy path; it owns failure semantics, not domain logic.
|
|
134
|
+
|
|
135
|
+
Use Tribunal, Contract, or both — whichever questions your application needs to answer.
|