ruby_llm-contract 0.7.1 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +96 -0
- data/Gemfile.lock +3 -3
- data/README.md +64 -316
- data/examples/00_basics.rb +110 -428
- data/examples/01_fallback_showcase.rb +208 -0
- data/examples/02_real_llm_minimal.rb +45 -0
- data/examples/03_summarize_with_keywords.rb +128 -0
- data/examples/04_summarize_and_translate.rb +196 -0
- data/examples/05_eval_dataset.rb +144 -0
- data/examples/06_retry_variants.rb +147 -0
- data/examples/README.md +20 -128
- data/lib/ruby_llm/contract/adapters/ruby_llm.rb +22 -1
- data/lib/ruby_llm/contract/cost_calculator.rb +39 -0
- data/lib/ruby_llm/contract/eval/model_comparison.rb +4 -4
- data/lib/ruby_llm/contract/eval/retry_optimizer.rb +7 -3
- data/lib/ruby_llm/contract/step/base.rb +18 -1
- data/lib/ruby_llm/contract/step/dsl.rb +38 -0
- data/lib/ruby_llm/contract/step/limit_checker.rb +2 -2
- data/lib/ruby_llm/contract/token_estimator.rb +20 -3
- data/lib/ruby_llm/contract/version.rb +1 -1
- data/ruby_llm-contract.gemspec +6 -5
- metadata +14 -16
- data/examples/01_classify_threads.rb +0 -220
- data/examples/02_generate_comment.rb +0 -203
- data/examples/03_target_audience.rb +0 -201
- data/examples/04_real_llm.rb +0 -410
- data/examples/05_output_schema.rb +0 -258
- data/examples/07_keyword_extraction.rb +0 -239
- data/examples/08_translation.rb +0 -353
- data/examples/09_eval_dataset.rb +0 -287
- data/examples/10_reddit_full_showcase.rb +0 -363
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 33f6a8a686f7f20791904c4fbfacd19f6ea5b8bad428c374ec14b7e33521354d
|
|
4
|
+
data.tar.gz: a2b2f7d9ff1e6cd69b39d55a3809b1babbd82a5d42afe3c567733076f03fa317
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: '0895986db9cde7d26d2e91ffc7c6469b7d34df299a361148c9ee339dbf1dc61539e44adf250c00c06383717ed2e47ff250d4490d1cce43c3cdf8c3169529fba5'
|
|
7
|
+
data.tar.gz: ed35a4b4cc9ab1afd46c427468dcb33844d8d54207531f98a9f1d775004efc5a1e19d64fb22c64b20da81750165a3a979f4de4ecaa46aff4121eb7ba80a27ed2
|
data/CHANGELOG.md
CHANGED
|
@@ -1,5 +1,101 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
|
+
## 0.8.0 (2026-04-26)
|
|
4
|
+
|
|
5
|
+
Narrative repositioning + small API additions. Internal architecture unchanged: no `Step::Base` refactor, no breaking changes to existing DSL.
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
|
|
9
|
+
- **`thinking(effort:, budget:)` class macro on `Step::Base`** — mirrors `RubyLLM::Agent.thinking` signature exactly. Stored as `{ effort:, budget: }` hash; reader returns the hash; supports `:default` reset semantics; superclass inheritance like `model`/`temperature`. The convenience alias `reasoning_effort(:low)` is implemented as `thinking(effort: :low)` — single normalized state, not separate ivar.
|
|
10
|
+
- **Adapter wiring for `with_thinking`** — when `thinking` is set on the Step class, OR when `reasoning_effort:` is passed through context, OR when an attempt config in `retry_policy escalate(...)` carries `reasoning_effort:`, the RubyLLM adapter resolves the effective `{ effort:, budget: }` hash and forwards it via `chat.with_thinking(**)` — provider-agnostic (supports OpenAI `reasoning_effort` AND Anthropic extended-thinking budget). Precedence: per-attempt / context `reasoning_effort` overrides class-level `thinking[:effort]`; budget is taken from class-level `thinking[:budget]`. **Behavioural change vs 0.7.x**: `reasoning_effort` is now forwarded via `with_thinking` instead of `with_params`. Same wire-level OpenAI parameter; provider-agnostic Anthropic support is now automatic.
|
|
11
|
+
|
|
12
|
+
### Dependencies
|
|
13
|
+
|
|
14
|
+
- **`ruby_llm` constraint bumped from `~> 1.0` to `~> 1.12`** — `Chat#with_thinking` is the canonical path for reasoning effort + extended thinking; it shipped in RubyLLM 1.12. Adopters on `ruby_llm < 1.12` need to bump RubyLLM before upgrading this gem to 0.8.0.
|
|
15
|
+
|
|
16
|
+
### Changed
|
|
17
|
+
|
|
18
|
+
- **Tagline + README opening** — repositioned around "Contracts + Evals for RubyLLM". New "Relation to RubyLLM::Agent" section explicitly frames Step as a sibling abstraction (same niche as Agent, wider contract), not an alternative or foundation. README does not claim "Step uses Agent under the hood" — current call path is `Step → Runner → Adapters::RubyLLM → RubyLLM.chat` directly.
|
|
19
|
+
- **`TokenEstimator` documented as heuristic** — module docstring expanded with explicit "±30% accuracy" framing. Refusal messages from `LimitChecker` now include `(heuristic ±30%)` suffix so adopters know the pre-flight number is estimated, not measured. RubyLLM 1.14 also has no pre-flight tokenizer; `RubyLLM::Tokens` is post-hoc only.
|
|
20
|
+
- **`CostCalculator` repositioned in docs** — module narrative reframed from "cost calculator" to "fine-tune pricing registry + lookup with fallback chain". Math methods (`compute_cost`, `token_cost`, etc.) were already private; this release makes the docs match. Public API surface unchanged: `register_model`, `unregister_model`, `reset_custom_models!`, `calculate`.
|
|
21
|
+
- **`output_schema` reframed in docs** — described as "wrapper around `RubyLLM::Schema` + client-side validation step", not a standalone feature. The schema language is identical to what `RubyLLM::Agent.schema` accepts; the difference is what wraps it.
|
|
22
|
+
- **README retry framing** — `retry_policy escalate(...)` (model escalation on validation failure) is the marketed default. `retry_policy attempts: N` (same-model retry) stays in the API for backward compat and niche cases (subjective criteria, multi-step pipelines, weaker models) but is no longer marketed as a recommended default. Empirical basis: four small experiments across PDF quiz generation, GSM8K math (n=30 + n=120), and multi-constraint schedule generation found no useful lift for nano-class models on tasks with clear correctness criteria.
|
|
23
|
+
|
|
24
|
+
### Documentation
|
|
25
|
+
|
|
26
|
+
- **New disambiguation paragraphs** in `prompt_ast.md` (`Step.input_type` vs `RubyLLM::Agent.inputs`; `Prompt::Builder` multi-role DSL vs Agent ERB single-string template loader), `testing.md` (`Step.observe` vs `Chat#on_end_message` / `on_tool_call`), `output_schema.md` (relation to `Agent.schema`), and `optimizing_retry_policy.md` (orthogonal model + thinking dimensions).
|
|
27
|
+
- **`getting_started.md` refusal message example** updated to include the new `(heuristic ±30%)` suffix.
|
|
28
|
+
|
|
29
|
+
### Issues closed
|
|
30
|
+
|
|
31
|
+
- **#11** (Optimizer is blind to same-model attempts) — closed after empirical experiments. `attempts: N` retry stays in API; not marketed as a default.
|
|
32
|
+
- **#6** (Production cost reporting) — already implemented in 0.7.x; close confirmed.
|
|
33
|
+
|
|
34
|
+
### Not in this release (deferred)
|
|
35
|
+
|
|
36
|
+
- `output_schema` Proc form for runtime-input-aware schemas (parity with `Agent.schema` Proc form). Additive, low-risk; deferred to 0.9 to keep 0.8 scope tight.
|
|
37
|
+
- H4 (Step composing `RubyLLM::Agent` internally as config holder) — verified feasible but ROI insufficient for current adopter base; trigger-based revisit, no calendar commitment.
|
|
38
|
+
|
|
39
|
+
## 0.7.3 (2026-04-24)
|
|
40
|
+
|
|
41
|
+
Adoption-friction release. No runtime behavior changes — every delta is in `docs/`, `examples/`, or `spec/integration/` (plus the `version.rb` / Gemfile.lock bumps). Upgrading from 0.7.2 picks up the expanded guide set, the new runnable showcases, and one extra integration spec.
|
|
42
|
+
|
|
43
|
+
### Documentation
|
|
44
|
+
|
|
45
|
+
- **New guide: `docs/guide/why.md`** — four production failure modes the gem exists for (schema-valid logically wrong, silent prompt regression, sampling variance on fixed-temperature models, runaway cost). Opens from a concrete incident each time; designed for readers who have not yet felt the pain the gem relieves.
|
|
46
|
+
- **New guide: `docs/guide/rails_integration.md`** — seven Rails-specific FAQs with runnable snippets: where step classes live (`app/contracts/`), initializer setup, background jobs, `around_call` observability, RSpec/Minitest stubs, error handling in controllers, CI gate wiring.
|
|
47
|
+
- **README adoption-friction pass** — added a short "Do I need this?" block after Install, a reading-order hint (`README → why.md → getting_started.md`), and outcome-based labels in the docs index ("Prevent silent prompt regressions" instead of "Eval-First", etc.).
|
|
48
|
+
- **TL;DR box at the top of every guide** — single-sentence orientation for readers who land via search; "Skip if" clause added where real confusion exists (`eval_first.md`, `testing.md`, `migration.md`).
|
|
49
|
+
- **API coverage gaps closed** — `estimate_cost` / `estimate_eval_cost`, `max_cost on_unknown_pricing: :warn`, `run_eval(..., concurrency:)`, `around_call` testing patterns now documented in `getting_started.md`, `eval_first.md`, `testing.md`.
|
|
50
|
+
- **Industry-standard terminology** — `temperature-locked` → `fixed-temperature`, `variance-induced` → `sampling variance`, `severity signals` → `severity keywords`, `takeaway drift` → `tone/takeaways mismatch`.
|
|
51
|
+
- **`docs/architecture.md` refresh** — diagram now reflects the current class layout: added `Step::RetryPolicy`, `Pipeline::Result`, `Eval::AggregatedReport`, `Eval::BaselineDiff`, `Eval::PromptDiffComparator`, `Eval::EvalHistory`, `Eval::RetryOptimizer`, `OptimizeRakeTask`. Replaced the outdated `Eval::TraitEvaluator` entry with `Eval::ExpectationEvaluator`.
|
|
52
|
+
- **Business framing added to guides** — every guide opens with a concrete production scenario or "why it matters" hook before the API reference.
|
|
53
|
+
|
|
54
|
+
### Examples — consolidated on `SummarizeArticle`, renumbered 00-06
|
|
55
|
+
|
|
56
|
+
The previous 12-file set mixed a private Reddit promo planner, customer support, meetings, keyword extraction, and translation. The new set is seven runnable files, each answering one adopter question on the README's `SummarizeArticle` case.
|
|
57
|
+
|
|
58
|
+
| # | File | Answers |
|
|
59
|
+
|---|------|---------|
|
|
60
|
+
| 00 | `00_basics.rb` | How do I start? (seven incremental layers + real-LLM pointer) |
|
|
61
|
+
| 01 | `01_fallback_showcase.rb` | Show me the gem in 30 seconds (zero API keys) |
|
|
62
|
+
| 02 | `02_real_llm_minimal.rb` | How do I plug in a real LLM? (~30 lines) |
|
|
63
|
+
| 03 | `03_summarize_with_keywords.rb` | How does the contract evolve? (growing prompt) |
|
|
64
|
+
| 04 | `04_summarize_and_translate.rb` | Pipeline composition + pipeline-level `run_eval` |
|
|
65
|
+
| 05 | `05_eval_dataset.rb` | How do I stop silent prompt regressions? |
|
|
66
|
+
| 06 | `06_retry_variants.rb` | `attempts: 3`, `reasoning_effort` escalation, cross-provider (Ollama → Anthropic → OpenAI) |
|
|
67
|
+
|
|
68
|
+
Every file carries an "Expected output" block in its header so readers see the result without running the script. The `docs/ideas/` directory is now fully untracked (already in `.gitignore`; one stray file removed from version control).
|
|
69
|
+
|
|
70
|
+
### Examples — bug fixes carried along
|
|
71
|
+
|
|
72
|
+
- **Schema pitfall fixed in 5 files** — `array :x do; string :y; ...; end` silently produces `items: string` and drops every declaration after the first, matching the documented pitfall in `spec/ruby_llm/contract/nested_schema_spec.rb:71`. Every affected array block is now wrapped in `object do...end`.
|
|
73
|
+
- **`examples/05_eval_dataset.rb` (pre-renumber: `09_eval_dataset.rb`) `result[:passed]` → `result.passed?`** — the previous code called `[]` on an `Eval::CaseResult` and raised `NoMethodError` at runtime.
|
|
74
|
+
|
|
75
|
+
### Testing
|
|
76
|
+
|
|
77
|
+
- **New `spec/integration/pipeline_eval_spec.rb`** — three cases guaranteeing pipeline-level `run_eval` stays functional: happy path, final-step mismatch, and fail-fast propagation when an intermediate `validate` rejects. Closes the "09 STEP 5 pipeline evaluation" known issue flagged in the 0.7.2 release. The fail-fast case asserts `step_status == :validation_failed` and the validate's label in `details`, so a regression that short-circuits on schema instead of validate would fail loudly.
|
|
78
|
+
|
|
79
|
+
### Deleted (private-project cleanup)
|
|
80
|
+
|
|
81
|
+
- `examples/01_classify_threads.rb`, `02_generate_comment.rb`, `03_target_audience.rb`, `10_reddit_full_showcase.rb`, `spec/integration/reddit_pipeline_spec.rb` — Reddit Promo Planner was a separate private project; its examples do not belong in the gem's public repo.
|
|
82
|
+
- `examples/02_output_schema.rb` — fully covered by `docs/guide/output_schema.md`; deleting avoids duplication.
|
|
83
|
+
|
|
84
|
+
## 0.7.2 (2026-04-22)
|
|
85
|
+
|
|
86
|
+
### Changed
|
|
87
|
+
|
|
88
|
+
- **Terminal output labels renamed for consistency with README narrative.** `print_summary` now prints `Hardest eval` (was `Constraining eval`), `Suggested fallback list` (was `Suggested chain`), and the production-mode table uses `first-attempt` / `fallback %` as column headers (was `single-shot` / `escalation`). Programmatic metric names unchanged: `single_shot_cost`, `single_shot_latency_ms`, `escalation_rate`. `RetryOptimizer::Result` exposes `hardest_eval` as an alias for `constraining_eval`.
|
|
89
|
+
|
|
90
|
+
### Documentation
|
|
91
|
+
|
|
92
|
+
- **`docs/guide/optimizing_retry_policy.md` rewritten** — 17.7k → 6.4k characters. Continues the `SummarizeArticle` narrative from README. Offline mode clearly positioned as wiring-check; real optimization runs via `LIVE=1 RUNS=3`. Output samples match actual `print_summary` format.
|
|
93
|
+
- **`docs/guide/getting_started.md` rewritten** — 8.7k → 6.1k. Every example uses `SummarizeArticle`. Evals + CI gates section moved before Budget caps. Structured Prompts / Dynamic Prompts / "Already using ruby_llm?" / Reasoning effort sections removed; content delegated to `prompt_ast.md` and README.
|
|
94
|
+
- **`docs/guide/eval_first.md` refined** — 6.3k → 5.0k. Switched to `SummarizeArticle` case. Team workflow section compressed with links back to `getting_started.md` for the matcher chain.
|
|
95
|
+
- **`docs/guide/testing.md` refined** — 10.7k → 7.4k. Switched to `SummarizeArticle` case. Threshold gating / Rake task / baseline walkthrough / prompt A/B sections delegated back to `getting_started.md` and `eval_first.md`.
|
|
96
|
+
- **`docs/guide/output_schema.md` DSL bug fix** — the Supported constraints table documented JSON Schema camelCase keys (`minLength`, `minItems`, `additionalProperties`) that are not valid DSL arguments. Every copy-paste from the previous table would have raised `ArgumentError`. Switched to snake_case (`min_length`, `min_items`, `additional_properties`) as the DSL actually expects; added a short note on the internal camelCase conversion.
|
|
97
|
+
- **`docs/guide/best_practices.md`, `pipeline.md`, `migration.md` sanity pass** — terminology alignment (model escalation → model fallback where narrative; `escalate` DSL method unchanged) and `SummarizeArticle` case where the guide is not inherently multi-step.
|
|
98
|
+
|
|
3
99
|
## 0.7.1 (2026-04-22)
|
|
4
100
|
|
|
5
101
|
### Changed (behavioral, follow-up to v0.7.0)
|
data/Gemfile.lock
CHANGED
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
ruby_llm-contract (0.
|
|
4
|
+
ruby_llm-contract (0.8.0)
|
|
5
5
|
dry-types (~> 1.7)
|
|
6
|
-
ruby_llm (~> 1.
|
|
6
|
+
ruby_llm (~> 1.12)
|
|
7
7
|
ruby_llm-schema (~> 0.3)
|
|
8
8
|
|
|
9
9
|
GEM
|
|
@@ -258,7 +258,7 @@ CHECKSUMS
|
|
|
258
258
|
rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
|
|
259
259
|
ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
|
|
260
260
|
ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
|
|
261
|
-
ruby_llm-contract (0.
|
|
261
|
+
ruby_llm-contract (0.8.0)
|
|
262
262
|
ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
|
|
263
263
|
ruby_parser (3.22.0) sha256=1eb4937cd9eb220aa2d194e352a24dba90aef00751e24c8dfffdb14000f15d23
|
|
264
264
|
rubycritic (4.12.0) sha256=024fed90fe656fa939f6ea80aab17569699ac3863d0b52fd72cb99892247abc8
|
data/README.md
CHANGED
|
@@ -1,121 +1,10 @@
|
|
|
1
1
|
# ruby_llm-contract
|
|
2
2
|
|
|
3
|
-
**
|
|
3
|
+
**Contracts + Evals for [ruby_llm](https://github.com/crmne/ruby_llm).**
|
|
4
4
|
|
|
5
|
-
|
|
5
|
+
Your eval passed. Prod broke anyway? This gem wraps `RubyLLM::Chat` with input/output contracts, business-rule validation, retry with model escalation on validation failure, pre-flight cost ceilings, and an evaluation framework — so a flaky cheap-model call escalates to a stronger model instead of shipping garbage to your user.
|
|
6
6
|
|
|
7
|
-
|
|
8
|
-
|---|---|
|
|
9
|
-
| Rate limits, timeouts, 5xx, connection errors | `ruby_llm` (Faraday retry middleware) |
|
|
10
|
-
| Streaming, tool calls, embeddings, images, transcription | `ruby_llm` |
|
|
11
|
-
| Chat history persistence (`acts_as_chat`) | `ruby_llm` |
|
|
12
|
-
| **Malformed JSON / parse errors** | **`ruby_llm-contract`** |
|
|
13
|
-
| **Business rule violations (invariants, schema)** | **`ruby_llm-contract`** |
|
|
14
|
-
| **Retry with model escalation on bad output** | **`ruby_llm-contract`** |
|
|
15
|
-
| **Measuring output variance on datasets** | **`ruby_llm-contract`** |
|
|
16
|
-
| **Regression detection + CI gates** | **`ruby_llm-contract`** |
|
|
17
|
-
|
|
18
|
-
Put together: `ruby_llm` owns the wire, this gem owns what the model *said*.
|
|
19
|
-
|
|
20
|
-
```
|
|
21
|
-
YOU WRITE THE GEM HANDLES YOU GET
|
|
22
|
-
───────── ─────────────── ───────
|
|
23
|
-
|
|
24
|
-
validate { |o| ... } catch bad answers — combined Zero garbage
|
|
25
|
-
with retry_policy, auto-retry in production
|
|
26
|
-
|
|
27
|
-
retry_policy start cheap, escalate only Pay for the cheapest
|
|
28
|
-
models: %w[nano mini full] when validation fails model that works
|
|
29
|
-
|
|
30
|
-
max_cost 0.01 estimate tokens, check price, No surprise bills
|
|
31
|
-
refuse before calling LLM
|
|
32
|
-
|
|
33
|
-
output_schema { ... } send JSON schema to provider, Zero parsing code
|
|
34
|
-
validate response client-side
|
|
35
|
-
|
|
36
|
-
define_eval { ... } test cases + baselines, Regressions caught
|
|
37
|
-
run in CI with real LLM before deploy
|
|
38
|
-
|
|
39
|
-
recommend(candidates: [...]) evaluate all configs, pick Optimal model +
|
|
40
|
-
cheapest that passes retry chain
|
|
41
|
-
```
|
|
42
|
-
|
|
43
|
-
## Before and after
|
|
44
|
-
|
|
45
|
-
```
|
|
46
|
-
┌─────────────────────────────────────────────────────────────────┐
|
|
47
|
-
│ BEFORE: pick one model, hope for the best │
|
|
48
|
-
│ │
|
|
49
|
-
│ expensive model → accurate, but you overpay on every call │
|
|
50
|
-
│ cheap model → fast, but wrong answers slip to production │
|
|
51
|
-
│ prompt change → "looks good to me" → deploy → users suffer │
|
|
52
|
-
└─────────────────────────────────────────────────────────────────┘
|
|
53
|
-
|
|
54
|
-
⬇ add ruby_llm-contract
|
|
55
|
-
|
|
56
|
-
┌─────────────────────────────────────────────────────────────────┐
|
|
57
|
-
│ YOU DEFINE A CONTRACT │
|
|
58
|
-
│ │
|
|
59
|
-
│ output_schema { string :priority } ← valid structure │
|
|
60
|
-
│ validate("valid priority") { |o| ... } ← business rules │
|
|
61
|
-
│ retry_policy models: %w[nano mini full] ← escalation chain │
|
|
62
|
-
│ max_cost 0.01 ← budget cap │
|
|
63
|
-
└───────────────────────────┬─────────────────────────────────────┘
|
|
64
|
-
│
|
|
65
|
-
▼
|
|
66
|
-
┌─────────────────────────────────────────────────────────────────┐
|
|
67
|
-
│ THE GEM HANDLES THE REST │
|
|
68
|
-
│ │
|
|
69
|
-
│ request ──→ ┌──────┐ ┌──────────┐ │
|
|
70
|
-
│ │ nano │─→ │ contract │──→ ✓ pass → done │
|
|
71
|
-
│ └──────┘ └────┬─────┘ │
|
|
72
|
-
│ │ ✗ fail │
|
|
73
|
-
│ ▼ │
|
|
74
|
-
│ ┌──────┐ ┌──────────┐ │
|
|
75
|
-
│ │ mini │─→ │ contract │──→ ✓ pass → done │
|
|
76
|
-
│ └──────┘ └────┬─────┘ │
|
|
77
|
-
│ │ ✗ fail │
|
|
78
|
-
│ ▼ │
|
|
79
|
-
│ ┌──────┐ ┌──────────┐ │
|
|
80
|
-
│ │ full │─→ │ contract │──→ ✓ pass → done │
|
|
81
|
-
│ └──────┘ └──────────┘ │
|
|
82
|
-
└───────────────────────────┬─────────────────────────────────────┘
|
|
83
|
-
│
|
|
84
|
-
▼
|
|
85
|
-
┌─────────────────────────────────────────────────────────────────┐
|
|
86
|
-
│ YOU GET │
|
|
87
|
-
│ │
|
|
88
|
-
│ ✓ Valid output guaranteed — schema + business rules enforced │
|
|
89
|
-
│ ✓ Cheapest model that works — most requests stay on nano │
|
|
90
|
-
│ ✓ Cost, latency, tokens — tracked on every call │
|
|
91
|
-
│ ✓ Eval scores per model — data instead of gut feeling │
|
|
92
|
-
│ ✓ Regressions caught — before deploy, not after │
|
|
93
|
-
│ ✓ Recommendation — "use nano+mini, drop full, save $X/mo" │
|
|
94
|
-
└─────────────────────────────────────────────────────────────────┘
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
## 30-second version
|
|
98
|
-
|
|
99
|
-
```ruby
|
|
100
|
-
class ClassifyTicket < RubyLLM::Contract::Step::Base
|
|
101
|
-
prompt "Classify this support ticket by priority and category.\n\n{input}"
|
|
102
|
-
|
|
103
|
-
output_schema do
|
|
104
|
-
string :priority, enum: %w[low medium high urgent]
|
|
105
|
-
string :category
|
|
106
|
-
end
|
|
107
|
-
|
|
108
|
-
validate("urgent needs justification") { |o, input| o[:priority] != "urgent" || input.length > 20 }
|
|
109
|
-
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
110
|
-
end
|
|
111
|
-
|
|
112
|
-
result = ClassifyTicket.run("I was charged twice")
|
|
113
|
-
result.parsed_output # => {priority: "high", category: "billing"}
|
|
114
|
-
result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
|
|
115
|
-
result.trace[:cost] # => 0.000032
|
|
116
|
-
```
|
|
117
|
-
|
|
118
|
-
Bad JSON? Retried automatically. Wrong answer? Escalated to a smarter model. Schema violated? Caught client-side. The contract guarantees every response meets your rules — you pay for the cheapest model that passes.
|
|
7
|
+
`ruby_llm` handles the HTTP side (rate limits, timeouts, streaming, tool calls, embeddings). This gem handles what the model *returned*: schema validation, business rules, model escalation on failed validation, datasets, regression tests.
|
|
119
8
|
|
|
120
9
|
## Install
|
|
121
10
|
|
|
@@ -128,239 +17,98 @@ RubyLLM.configure { |c| c.openai_api_key = ENV["OPENAI_API_KEY"] }
|
|
|
128
17
|
RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
|
|
129
18
|
```
|
|
130
19
|
|
|
131
|
-
Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
|
|
132
|
-
|
|
133
|
-
## Handle output variance with model escalation
|
|
134
|
-
|
|
135
|
-
Models are non-deterministic. A prompt that works on 95% of inputs can break on the edge case sitting in your production traffic right now. The defensive response is to pick the strongest model and pay for it on every call. The measured response is to define a contract and let the gem escalate only when the cheaper model's output actually fails the rules:
|
|
136
|
-
|
|
137
|
-
```ruby
|
|
138
|
-
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
139
|
-
```
|
|
20
|
+
Works with any `ruby_llm` provider (OpenAI, Anthropic, Gemini, etc).
|
|
140
21
|
|
|
141
|
-
|
|
142
|
-
Attempt 1: gpt-4.1-nano → contract failed ($0.0001)
|
|
143
|
-
Attempt 2: gpt-4.1-mini → contract passed ($0.0004)
|
|
144
|
-
gpt-4.1 → never called ($0.00)
|
|
145
|
-
```
|
|
22
|
+
## Do I need this?
|
|
146
23
|
|
|
147
|
-
|
|
24
|
+
Use this if LLM output affects production behaviour, money, user trust, or downstream code. You probably don't need it if you have one low-risk prompt, manually inspect every result, or only generate best-effort prose.
|
|
148
25
|
|
|
149
|
-
|
|
26
|
+
Already using structured outputs from your provider? This gem adds business-rule validation, retry with model fallback, evals, regression gating, and test stubs on top of them — the layer that stops schema-valid-but-wrong output from reaching users. See [Why contracts?](docs/guide/why.md) for the four production failure modes the gem exists for, or run `ruby examples/01_fallback_showcase.rb` to see the fallback loop in 30 seconds (no API key needed).
|
|
150
27
|
|
|
151
|
-
##
|
|
28
|
+
## Example
|
|
152
29
|
|
|
153
|
-
|
|
30
|
+
A Rails app takes article text extracted from a user-submitted URL and wants to show a summary card: a short TL;DR, 3–5 key takeaways, and a tone label. The output has to fit the UI (TL;DR under 200 chars) and the schema has to be strict enough to render without conditionals.
|
|
154
31
|
|
|
155
32
|
```ruby
|
|
156
|
-
class
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
33
|
+
class SummarizeArticle < RubyLLM::Contract::Step::Base
|
|
34
|
+
prompt <<~PROMPT
|
|
35
|
+
Summarize this article for a UI card. Return a short TL;DR,
|
|
36
|
+
3 to 5 key takeaways, and a tone label.
|
|
160
37
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
lens = q[:answer_options].map(&:length)
|
|
164
|
-
next false if lens.empty? || lens.min.zero?
|
|
38
|
+
{input}
|
|
39
|
+
PROMPT
|
|
165
40
|
|
|
166
|
-
|
|
167
|
-
|
|
41
|
+
output_schema do
|
|
42
|
+
string :tldr
|
|
43
|
+
array :takeaways, of: :string, min_items: 3, max_items: 5
|
|
44
|
+
string :tone, enum: %w[neutral positive negative analytical]
|
|
168
45
|
end
|
|
169
46
|
|
|
170
|
-
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
result = GenerateQuiz.run(document)
|
|
174
|
-
if result.ok?
|
|
175
|
-
publish(result.parsed_output)
|
|
176
|
-
else
|
|
177
|
-
# Three unlucky draws in a row — ship the last one anyway, log the miss.
|
|
178
|
-
Rails.logger.warn "Quiz delivered with soft-validation miss: #{result.validation_errors.join('; ')}"
|
|
179
|
-
publish(result.parsed_output)
|
|
180
|
-
end
|
|
181
|
-
```
|
|
182
|
-
|
|
183
|
-
How this differs from the escalation chain:
|
|
184
|
-
|
|
185
|
-
- `retry_policy models: %w[nano mini full]` — **document hardness.** Retry means "the cheap model isn't enough, use a smarter one."
|
|
186
|
-
- `retry_policy attempts: 3` — **sampling variance.** Retry means "same model, different random seed — the model can do better on a second try."
|
|
187
|
-
|
|
188
|
-
After all `attempts` are exhausted (`attempts: 3` means 3 total tries, not 3 retries on top of the first), the Result carries `status: :validation_failed` plus the last attempt's `parsed_output`. The caller decides: ship anyway, fall back to a template, or surface an error. The gem does not raise on exhaustion — your application policy, your choice.
|
|
189
|
-
|
|
190
|
-
Combine both when helpful:
|
|
191
|
-
|
|
192
|
-
```ruby
|
|
193
|
-
retry_policy do
|
|
194
|
-
escalate({ model: "gpt-4.1-mini" }, { model: "gpt-4.1-mini" }, { model: "gpt-4.1" })
|
|
195
|
-
end
|
|
196
|
-
```
|
|
47
|
+
validate("TL;DR fits the card") { |o, _| o[:tldr].length <= 200 }
|
|
48
|
+
validate("takeaways are unique") { |o, _| o[:takeaways].uniq.size == o[:takeaways].size }
|
|
197
49
|
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
## Know which model to use — with data
|
|
201
|
-
|
|
202
|
-
Don't guess. Define test cases, compare models, get numbers:
|
|
203
|
-
|
|
204
|
-
```ruby
|
|
205
|
-
ClassifyTicket.define_eval("regression") do
|
|
206
|
-
add_case "billing", input: "I was charged twice", expected: { priority: "high" }
|
|
207
|
-
add_case "feature", input: "Add dark mode please", expected: { priority: "low" }
|
|
208
|
-
add_case "outage", input: "Database is down", expected: { priority: "urgent" }
|
|
50
|
+
retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini gpt-4.1]
|
|
209
51
|
end
|
|
210
52
|
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
```
|
|
216
|
-
Candidate Score Cost Avg Latency
|
|
217
|
-
---------------------------------------------------------
|
|
218
|
-
gpt-4.1-nano 0.67 $0.0001 48ms
|
|
219
|
-
gpt-4.1-mini 1.00 $0.0004 92ms
|
|
220
|
-
gpt-4.1 1.00 $0.0021 210ms
|
|
221
|
-
|
|
222
|
-
Cheapest at 100%: gpt-4.1-mini
|
|
223
|
-
```
|
|
224
|
-
|
|
225
|
-
Nano fails on edge cases. Mini and full both score 100% — but mini is **5x cheaper**. Now you know.
|
|
226
|
-
|
|
227
|
-
Running live against gpt-5 / o-series? Pass `runs: 3` to average out sampling variance (OpenAI forces `temperature=1.0` server-side, so one unlucky run can misclassify a viable candidate). See [Reducing variance with `runs:`](docs/guide/optimizing_retry_policy.md#reducing-variance-with-runs).
|
|
228
|
-
|
|
229
|
-
Want the *effective* cost — first-attempt plus retries — rather than the single-shot headline number? Pass `production_mode: { fallback: "gpt-5-mini" }` and the table gains `escalation`, `effective cost`, and a `Chain` column. See [Production-mode cost measurement](docs/guide/optimizing_retry_policy.md#production-mode-cost-measurement).
|
|
230
|
-
|
|
231
|
-
## Let the gem tell you what to do
|
|
232
|
-
|
|
233
|
-
Don't read tables — get a recommendation. Supports `model + reasoning_effort` combinations:
|
|
234
|
-
|
|
235
|
-
```ruby
|
|
236
|
-
rec = ClassifyTicket.recommend("regression",
|
|
237
|
-
candidates: [
|
|
238
|
-
{ model: "gpt-4.1-nano" },
|
|
239
|
-
{ model: "gpt-4.1-mini" },
|
|
240
|
-
{ model: "gpt-5-mini", reasoning_effort: "low" },
|
|
241
|
-
{ model: "gpt-5-mini", reasoning_effort: "high" },
|
|
242
|
-
],
|
|
243
|
-
min_score: 0.95
|
|
244
|
-
)
|
|
245
|
-
|
|
246
|
-
rec.best # => { model: "gpt-4.1-mini" }
|
|
247
|
-
rec.retry_chain # => [{ model: "gpt-4.1-nano" }, { model: "gpt-4.1-mini" }]
|
|
248
|
-
rec.to_dsl # => "retry_policy models: %w[gpt-4.1-nano gpt-4.1-mini]"
|
|
249
|
-
rec.savings # => savings vs your current model (if configured)
|
|
250
|
-
```
|
|
251
|
-
|
|
252
|
-
Copy `rec.to_dsl` into your step. Done.
|
|
253
|
-
|
|
254
|
-
## Catch regressions before users do
|
|
255
|
-
|
|
256
|
-
A model update silently dropped your accuracy? A prompt tweak broke an edge case? You'll know before deploying:
|
|
257
|
-
|
|
258
|
-
```ruby
|
|
259
|
-
# Save a baseline once:
|
|
260
|
-
report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
|
|
261
|
-
report.save_baseline!(model: "gpt-4.1-nano")
|
|
262
|
-
|
|
263
|
-
# In CI — block merge if anything regressed:
|
|
264
|
-
expect(ClassifyTicket).to pass_eval("regression")
|
|
265
|
-
.with_context(model: "gpt-4.1-nano")
|
|
266
|
-
.without_regressions
|
|
267
|
-
```
|
|
268
|
-
|
|
269
|
-
```ruby
|
|
270
|
-
diff = report.compare_with_baseline(model: "gpt-4.1-nano")
|
|
271
|
-
diff.regressed? # => true
|
|
272
|
-
diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
|
|
273
|
-
diff.score_delta # => -0.33
|
|
274
|
-
```
|
|
275
|
-
|
|
276
|
-
No more "it worked in the playground". Regressions are caught in CI, not production.
|
|
277
|
-
|
|
278
|
-
## A/B test your prompts
|
|
279
|
-
|
|
280
|
-
Changed a prompt? Compare old vs new on the same dataset with regression safety:
|
|
281
|
-
|
|
282
|
-
```ruby
|
|
283
|
-
diff = ClassifyTicketV2.compare_with(ClassifyTicketV1,
|
|
284
|
-
eval: "regression", model: "gpt-4.1-mini")
|
|
285
|
-
|
|
286
|
-
diff.safe_to_switch? # => true (no regressions)
|
|
287
|
-
diff.improvements # => [{case: "outage", ...}]
|
|
288
|
-
diff.score_delta # => +0.33
|
|
289
|
-
```
|
|
290
|
-
|
|
291
|
-
```ruby
|
|
292
|
-
# CI gate:
|
|
293
|
-
expect(ClassifyTicketV2).to pass_eval("regression")
|
|
294
|
-
.compared_with(ClassifyTicketV1)
|
|
295
|
-
.with_minimum_score(0.8)
|
|
53
|
+
result = SummarizeArticle.run(article_text)
|
|
54
|
+
result.parsed_output # => { tldr: "...", takeaways: [...], tone: "analytical" }
|
|
55
|
+
result.trace[:model] # => "gpt-4.1-nano" (first model that passed)
|
|
56
|
+
result.trace[:cost] # => 0.000032
|
|
296
57
|
```
|
|
297
58
|
|
|
298
|
-
|
|
59
|
+
The model returns JSON matching the schema. If the response is malformed, the TL;DR overflows the card, or the takeaway count is off, the gem retries — moving to the next model in `models:` only when the cheaper one can't satisfy the rules. In this setup cheaper models are tried first and the expensive ones are used only when cheaper models fail.
|
|
299
60
|
|
|
300
|
-
|
|
61
|
+
You could write this loop yourself once. The gem gives you the loop, a trace of every attempt (model, status, cost, latency), fallback policy, evals, baselines, and CI checks as one contract object — tracked per-step so adding a new LLM feature to your app is one class, not one-off scaffolding.
|
|
301
62
|
|
|
302
|
-
|
|
303
|
-
class TicketPipeline < RubyLLM::Contract::Pipeline::Base
|
|
304
|
-
step ClassifyTicket, as: :classify
|
|
305
|
-
step RouteToTeam, as: :route
|
|
306
|
-
step DraftResponse, as: :draft
|
|
307
|
-
end
|
|
63
|
+
## Most useful next
|
|
308
64
|
|
|
309
|
-
|
|
310
|
-
result.outputs_by_step[:classify] # => {priority: "high", category: "billing"}
|
|
311
|
-
result.trace.total_cost # => $0.000128
|
|
312
|
-
```
|
|
65
|
+
Everything below is optional — the example above is a complete step. Reach for these when one step isn't enough.
|
|
313
66
|
|
|
314
|
-
|
|
67
|
+
- **[CI regression gates](docs/guide/getting_started.md)** — `define_eval` + `save_baseline!` + `pass_eval(...).without_regressions` blocks CI when accuracy drops on a model update or prompt tweak.
|
|
68
|
+
- **[Find the cheapest viable fallback list](docs/guide/optimizing_retry_policy.md)** — `Step.recommend("regression", candidates: [...], min_score: 0.95)` returns the cheapest list of models that still passes your evals. `production_mode:` measures retry-aware cost.
|
|
69
|
+
- **[A/B test prompts](docs/guide/eval_first.md)** — `SummarizeArticleV2.compare_with(SummarizeArticleV1, eval: "regression")` reports whether the new prompt is safe to ship.
|
|
70
|
+
- **[Budget caps](docs/guide/getting_started.md)** — `max_cost`, `max_input`, `max_output` refuse the request before calling the API when a heuristic estimate (~±30% accuracy) exceeds the limit.
|
|
71
|
+
- **[Reasoning effort / thinking config](docs/guide/optimizing_retry_policy.md)** — `thinking effort: :low` (or alias `reasoning_effort :low`) on the Step class; mirrors `RubyLLM::Agent.thinking` and forwards through `Chat#with_thinking`.
|
|
315
72
|
|
|
316
|
-
|
|
317
|
-
# RSpec — block merge if accuracy drops or cost spikes
|
|
318
|
-
expect(ClassifyTicket).to pass_eval("regression")
|
|
319
|
-
.with_minimum_score(0.8)
|
|
320
|
-
.with_maximum_cost(0.01)
|
|
321
|
-
|
|
322
|
-
# Rake — run all evals across all steps
|
|
323
|
-
RubyLLM::Contract::RakeTask.new do |t|
|
|
324
|
-
t.minimum_score = 0.8
|
|
325
|
-
t.maximum_cost = 0.05
|
|
326
|
-
end
|
|
327
|
-
# bundle exec rake ruby_llm_contract:eval
|
|
328
|
-
```
|
|
73
|
+
Also supports [multi-step pipelines](docs/guide/pipeline.md) with fail-fast and `retry_policy attempts: N` for niche cases (we measured this empirically — for `gpt-4.1-nano` / `gpt-5-nano` on tasks with clear correctness criteria, same-model retry rarely helps; `escalate(model_2)` is the strategy that moves the needle, see [optimizing_retry_policy.md](docs/guide/optimizing_retry_policy.md)).
|
|
329
74
|
|
|
330
|
-
##
|
|
75
|
+
## Relation to `RubyLLM::Agent`
|
|
331
76
|
|
|
332
|
-
|
|
77
|
+
`RubyLLM::Agent` (since RubyLLM 1.12) and `Step::Base` here target the **same niche**: reusable, class-based prompts. They are siblings, not foundation-and-roof.
|
|
333
78
|
|
|
334
|
-
|
|
79
|
+
| What you write | Where it lives |
|
|
80
|
+
|---|---|
|
|
81
|
+
| `model`, `temperature`, `schema`, `instructions`, `tools`, `thinking` | covered by both — same idea, different DSL surface |
|
|
82
|
+
| `validate :rule do |out| ... end` business invariants | only here |
|
|
83
|
+
| `retry_policy escalate(...)` model escalation on validation failure | only here (different from RubyLLM's network-level retry) |
|
|
84
|
+
| `max_cost` / `max_input` / `max_output` pre-flight refusal | only here |
|
|
85
|
+
| `define_eval` + baseline regression + `compare_models` + `optimize_retry_policy` | only here (RubyLLM does not ship an eval framework) |
|
|
86
|
+
| Pipeline composition with `step SomeStep, as: :alias` | only here (RubyLLM intentionally leaves workflows as plain Ruby) |
|
|
87
|
+
| `around_call`, named `observe` hooks with pass/fail in trace | only here |
|
|
335
88
|
|
|
336
|
-
|
|
89
|
+
`Step::Base` does NOT use `Agent` internally today — both call into `RubyLLM::Chat` directly. The two abstractions can coexist on the same project: use `Agent` for prompt-only reuse, use `Step` when you need any of the contract-layer features above. The retry-strategy framing here (favouring `escalate(...)` over same-model `attempts: N`) is grounded in an empirical comparison; `attempts: N` stays in the API for niche cases.
|
|
337
90
|
|
|
338
91
|
## Docs
|
|
339
92
|
|
|
340
|
-
|
|
341
|
-
|
|
342
|
-
|
|
|
343
|
-
|
|
344
|
-
| [
|
|
345
|
-
| [
|
|
346
|
-
| [
|
|
347
|
-
| [
|
|
348
|
-
| [
|
|
349
|
-
| [
|
|
93
|
+
**New here?** Read in order: this README → [Why contracts?](docs/guide/why.md) → [Getting Started](docs/guide/getting_started.md).
|
|
94
|
+
|
|
95
|
+
| Guide | What it does for your app |
|
|
96
|
+
|-------|---------------------------|
|
|
97
|
+
| [Why contracts?](docs/guide/why.md) | Recognise the four production failures the gem exists for |
|
|
98
|
+
| [Getting Started](docs/guide/getting_started.md) | Walk the full feature set on one concrete step |
|
|
99
|
+
| [Rails integration](docs/guide/rails_integration.md) | Directory, initializer, jobs, logging, specs, CI gate — 7 FAQs for Rails devs |
|
|
100
|
+
| [Adopt in an existing Rails app](docs/guide/migration.md) | Replace raw `LlmClient.call` with a contract, Before/After |
|
|
101
|
+
| [Prevent silent prompt regressions](docs/guide/eval_first.md) | Evals, baselines, CI gates that block quality drift |
|
|
102
|
+
| [Control retry cost and fallback behaviour](docs/guide/optimizing_retry_policy.md) | Find the cheapest viable fallback list empirically |
|
|
103
|
+
| [Write validate rules that catch real bugs](docs/guide/best_practices.md) | Patterns for cross-input checks and content-quality rules |
|
|
104
|
+
| [Stub LLM calls in tests](docs/guide/testing.md) | Deterministic specs, RSpec + Minitest matchers |
|
|
105
|
+
| [Chain LLM calls into a pipeline](docs/guide/pipeline.md) | Multi-step with fail-fast and per-step models |
|
|
106
|
+
| [Schema DSL reference](docs/guide/output_schema.md) | Every constraint, nested objects, pattern table |
|
|
107
|
+
| [Prompt DSL reference](docs/guide/prompt_ast.md) | `system` / `rule` / `section` / `example` / `user` nodes |
|
|
350
108
|
|
|
351
109
|
## Roadmap
|
|
352
110
|
|
|
353
|
-
**v0.
|
|
354
|
-
|
|
355
|
-
**v0.7.0:** Sharpened retry semantics. `DEFAULT_RETRY_ON` now targets LLM output variance only (`:validation_failed`, `:parse_error`); transport errors are delegated to ruby_llm's Faraday retry. `AdapterCaller` narrowed to let programmer errors propagate instead of masking them as retries. Breaking change — see [CHANGELOG](CHANGELOG.md) for migration.
|
|
356
|
-
|
|
357
|
-
**v0.6:** "What should I do?" — `Step.recommend` returns optimal model, reasoning effort, and retry chain. Per-attempt `reasoning_effort` in retry policies.
|
|
358
|
-
|
|
359
|
-
**v0.5:** Prompt A/B testing with `compare_with`. Soft observations with `observe`.
|
|
360
|
-
|
|
361
|
-
**v0.4:** Eval history, batch concurrency, pipeline per-step eval, Minitest, structured logging.
|
|
362
|
-
|
|
363
|
-
**v0.3:** Baseline regression detection, migration guide.
|
|
111
|
+
Latest: **v0.8.0** — tagline + narrative repositioning around "Contracts + Evals for RubyLLM", `thinking` / `reasoning_effort` class macro, TokenEstimator labelled as heuristic, CostCalculator repositioned. See [CHANGELOG](CHANGELOG.md) for history.
|
|
364
112
|
|
|
365
113
|
## License
|
|
366
114
|
|