ruby_llm-contract 0.3.6 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 35a61fe65d6a7939e3ef22bdd37732d2ae6cd5643f51d595a3f26b4281eea396
4
- data.tar.gz: 9b1b95b29c31e433af60c25e85dfdebf3e8e71cb85c0e568835309a7cd855926
3
+ metadata.gz: a423ef1b370ae97651d256fdc3776bd895d1eebc81a2b1c4adac305292e2a7a0
4
+ data.tar.gz: 685ec9b00a369748ca897e38ae498e26d9fc31644aac8c41f096a704bceadd7d
5
5
  SHA512:
6
- metadata.gz: 0bb0333b6c362b1687b51f6bf360fd6d659c066a2a5b4b539bab4795150e5c1c8dbebe8dac6d05791b62958058d60418e5ff1f2b5db1f050f29412ed136494a5
7
- data.tar.gz: ff5a8e7c30344993617bdd5f85d857e91d0cb633e2b7fe35a08aadf0790a4c7c0389cb017f92a192d199fe1eaba9526c509d5731321b36bd2c6e5fdedb5ca6d0
6
+ metadata.gz: 34ab0e678a2de57812a7b8391d406cabe3bb13cf0399669a9bcba18609fc69488d0ef4d2e4a6675436da71fc9b67d3f4c9e264e8fdbef475b07d020d9d8b9d34
7
+ data.tar.gz: 3aca7548473e4f6e32df442296344013d6081564b56fb5d0081aefc1ea0ab6129896ed852e723c2678548d265797e3568fc4bcf0ca219ca6e220c9ecba35bc9c
data/CHANGELOG.md CHANGED
@@ -1,5 +1,35 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.4.0 (2026-03-24)
4
+
5
+ Observability & Scale — see what changed, run it fast, debug it easily.
6
+
7
+ ### Features
8
+
9
+ - **Structured logging** — `Contract.configure { |c| c.logger = Rails.logger }`. Auto-logs model, status, latency, tokens, cost on every `step.run`.
10
+ - **Batch eval concurrency** — `run_eval("regression", concurrency: 4)`. Parallel case execution via Concurrent::Future. 4x faster CI for large eval suites.
11
+ - **Eval history & trending** — `report.save_history!` appends to JSONL. `report.eval_history` returns `EvalHistory` with `score_trend`, `drift?`, run-by-run scores.
12
+ - **Pipeline per-step eval** — `add_case(..., step_expectations: { classify: { priority: "high" } })`. See which step in a pipeline regressed.
13
+ - **Minitest support** — `assert_satisfies_contract`, `assert_eval_passes`, `stub_step` for Minitest users. `require "ruby_llm/contract/minitest"`.
14
+
15
+ ### Game changer continuity
16
+
17
+ ```
18
+ v0.2: "Which model?" → compare_models (snapshot)
19
+ v0.3: "Did it change?" → baseline regression (binary)
20
+ v0.4: "Show me the trend" → eval history (time series)
21
+ "Which step changed?" → pipeline per-step eval
22
+ "Run it fast" → batch concurrency
23
+ ```
24
+
25
+ ## 0.3.7 (2026-03-24)
26
+
27
+ - **Trait missing key = error** — `expected_traits: { title: 0..5 }` on output `{}` now fails instead of silently passing.
28
+ - **nil input in dynamic prompts** — `run(nil)` with `prompt { |input| ... }` correctly passes nil to block.
29
+ - **Defensive sample pre-validation** — `sample_response` uses the same parser as runtime (handles code fences, BOM, prose around JSON).
30
+ - **Baseline diff excludes skipped** — self-compare with skipped cases no longer shows artificial score delta.
31
+ - **Zeitwerk eval/ ignore** — `eager_load_contract_dirs!` ignores `eval/` subdirs before eager load.
32
+
3
33
  ## 0.3.6 (2026-03-24)
4
34
 
5
35
  - **Recursive array/object validation** — nested arrays (`array of array of string`) validated recursively. Object items validated even without `:properties` (e.g. `additionalProperties: false`).
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.3.6)
4
+ ruby_llm-contract (0.4.0)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -165,7 +165,7 @@ CHECKSUMS
165
165
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
166
166
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
167
167
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
168
- ruby_llm-contract (0.3.6)
168
+ ruby_llm-contract (0.4.0)
169
169
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
170
170
  unicode-display_width (3.2.0) sha256=0cdd96b5681a5949cdbc2c55e7b420facae74c4aaf9a9815eee1087cb1853c42
171
171
  unicode-emoji (4.2.0) sha256=519e69150f75652e40bf736106cfbc8f0f73aa3fb6a65afe62fefa7f80b0f80f
data/README.md CHANGED
@@ -6,7 +6,7 @@ Companion gem for [ruby_llm](https://github.com/crmne/ruby_llm).
6
6
 
7
7
  ## The problem
8
8
 
9
- You call an LLM. It returns bad JSON, wrong values, or costs 4x more than it should. You switch models and quality drops silently. You have no data to decide which model to use.
9
+ Which model should you use? The expensive one is accurate but costs 4x more. The cheap one is fast but hallucinates on edge cases. You tweak a prompt did accuracy improve or drop? You have no data. Just gut feeling.
10
10
 
11
11
  ## The fix
12
12
 
@@ -168,11 +168,11 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
168
168
 
169
169
  ## Roadmap
170
170
 
171
- **v0.3 (current):** Baseline regression detection`save_baseline!`, `compare_with_baseline`, `without_regressions`. Migration guide.
171
+ **v0.4 (current):** Observability & scaleeval history with trending, batch eval with concurrency, pipeline per-step eval, Minitest support, structured logging.
172
172
 
173
- **v0.2:** Model comparison, cost tracking, eval with `add_case`, CI gating, Rails Railtie.
173
+ **v0.3:** Baseline regression detection, migration guide, production hardening.
174
174
 
175
- **v0.4:** Auto-routinglearn which model works for which input pattern.
175
+ **v0.5:** Prompt A/B testing `compare_with(OtherStep)` for data-driven prompt engineering with regression safety. Cross-provider comparison docs.
176
176
 
177
177
  ## License
178
178
 
@@ -7,6 +7,12 @@ module RubyLLM
7
7
  def call(messages:, **_options)
8
8
  raise NotImplementedError, "Subclasses must implement #call"
9
9
  end
10
+
11
+ # Override in stateful adapters to provide a fully independent copy
12
+ # for concurrent eval execution. Default: self (stateless adapters).
13
+ def clone_for_concurrency
14
+ self
15
+ end
10
16
  end
11
17
  end
12
18
  end
@@ -29,6 +29,20 @@ module RubyLLM
29
29
 
30
30
  public
31
31
 
32
+ # Exposes raw responses array for concurrent eval to split per-case
33
+ def responses_array
34
+ @responses
35
+ end
36
+
37
+ # Returns a fresh adapter with reset index for concurrent execution
38
+ def clone_for_concurrency
39
+ if @responses
40
+ self.class.new(responses: @responses.dup, usage: @usage.dup)
41
+ else
42
+ self.class.new(response: @response, usage: @usage.dup)
43
+ end
44
+ end
45
+
32
46
  def call(messages:, **_options) # rubocop:disable Lint/UnusedMethodArgument
33
47
  content = if @responses
34
48
  c = @responses[@index] || @responses.last
@@ -0,0 +1,31 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Concerns
6
+ # Shared helpers for context hash manipulation.
7
+ # Used by EvalHost, Runner, Step::Base.
8
+ module ContextHelpers
9
+ private
10
+
11
+ def safe_context(context)
12
+ (context || {}).transform_keys { |k| k.respond_to?(:to_sym) ? k.to_sym : k }
13
+ end
14
+
15
+ def isolate_context(context)
16
+ context.transform_values do |v|
17
+ if v.respond_to?(:clone_for_concurrency)
18
+ v.clone_for_concurrency
19
+ elsif v.respond_to?(:dup)
20
+ v.dup
21
+ else
22
+ v
23
+ end
24
+ rescue TypeError
25
+ v
26
+ end
27
+ end
28
+ end
29
+ end
30
+ end
31
+ end
@@ -4,6 +4,7 @@ module RubyLLM
4
4
  module Contract
5
5
  module Concerns
6
6
  module EvalHost
7
+ include ContextHelpers
7
8
  def define_eval(name, &)
8
9
  @eval_definitions ||= {}
9
10
  @file_sourced_evals ||= Set.new
@@ -35,20 +36,20 @@ module RubyLLM
35
36
  !all_eval_definitions.empty?
36
37
  end
37
38
 
38
- def run_eval(name = nil, context: {})
39
- context ||= {}
39
+ def run_eval(name = nil, context: {}, concurrency: nil)
40
+ context = safe_context(context)
40
41
  if name
41
- run_single_eval(name, context)
42
+ run_single_eval(name, context, concurrency: concurrency)
42
43
  else
43
- run_all_own_evals(context)
44
+ run_all_own_evals(context, concurrency: concurrency)
44
45
  end
45
46
  end
46
47
 
47
48
  def compare_models(eval_name, models:, context: {})
48
- context ||= {}
49
+ context = safe_context(context)
49
50
  models = models.uniq
50
51
  reports = models.each_with_object({}) do |model, hash|
51
- model_context = deep_dup_context(context).merge(model: model)
52
+ model_context = isolate_context(context).merge(model: model)
52
53
  hash[model] = run_single_eval(eval_name, model_context)
53
54
  end
54
55
  Eval::ModelComparison.new(eval_name: eval_name, reports: reports)
@@ -66,24 +67,26 @@ module RubyLLM
66
67
  inherited.merge(own)
67
68
  end
68
69
 
69
- def run_single_eval(name, context)
70
+ def run_single_eval(name, context, concurrency: nil)
70
71
  defn = all_eval_definitions[name.to_s]
71
72
  raise ArgumentError, "No eval '#{name}' defined. Available: #{all_eval_definitions.keys}" unless defn
72
73
 
73
74
  effective_context = eval_context(defn, context)
74
- Eval::Runner.run(step: self, dataset: defn.build_dataset, context: effective_context)
75
+ Eval::Runner.run(step: self, dataset: defn.build_dataset, context: effective_context,
76
+ concurrency: concurrency)
75
77
  end
76
78
 
77
- def run_all_own_evals(context)
79
+ def run_all_own_evals(context, concurrency: nil)
78
80
  all_eval_definitions.transform_values do |defn|
79
- isolated_context = deep_dup_context(context)
81
+ isolated_context = isolate_context(context)
80
82
  effective_context = eval_context(defn, isolated_context)
81
- Eval::Runner.run(step: self, dataset: defn.build_dataset, context: effective_context)
83
+ Eval::Runner.run(step: self, dataset: defn.build_dataset, context: effective_context,
84
+ concurrency: concurrency)
82
85
  end
83
86
  end
84
87
 
85
88
  def eval_context(defn, context)
86
- context = (context || {}).transform_keys { |k| k.respond_to?(:to_sym) ? k.to_sym : k }
89
+ context = safe_context(context)
87
90
  return context if context[:adapter]
88
91
 
89
92
  sample_adapter = defn.build_adapter
@@ -105,13 +108,6 @@ module RubyLLM
105
108
  end
106
109
  end
107
110
 
108
- def deep_dup_context(context)
109
- context.transform_values do |v|
110
- v.respond_to?(:dup) ? v.dup : v
111
- rescue TypeError
112
- v
113
- end
114
- end
115
111
  end
116
112
  end
117
113
  end
@@ -10,11 +10,12 @@ module RubyLLM
10
10
  # Then configure contract-specific options:
11
11
  # RubyLLM::Contract.configure { |c| c.default_model = "gpt-4.1-mini" }
12
12
  class Configuration
13
- attr_accessor :default_adapter, :default_model
13
+ attr_accessor :default_adapter, :default_model, :logger
14
14
 
15
15
  def initialize
16
16
  @default_adapter = nil
17
17
  @default_model = nil
18
+ @logger = nil
18
19
  end
19
20
  end
20
21
  end
@@ -9,8 +9,8 @@ module RubyLLM
9
9
  def initialize(baseline_cases:, current_cases:)
10
10
  @baseline = index_by_name(baseline_cases)
11
11
  @current = index_by_name(current_cases)
12
- @baseline_score = baseline_cases.empty? ? 0.0 : baseline_cases.sum { |c| c[:score] } / baseline_cases.length
13
- @current_score = current_cases.empty? ? 0.0 : current_cases.sum { |c| c[:score] } / current_cases.length
12
+ @baseline_score = compute_score(baseline_cases)
13
+ @current_score = compute_score(current_cases)
14
14
  freeze
15
15
  end
16
16
 
@@ -78,6 +78,14 @@ module RubyLLM
78
78
 
79
79
  private
80
80
 
81
+ def compute_score(cases)
82
+ # Exclude skipped cases from score (consistent with Report#score)
83
+ evaluated = cases.reject { |c| c[:details]&.start_with?("skipped:") }
84
+ return 0.0 if evaluated.empty?
85
+
86
+ evaluated.sum { |c| c[:score] } / evaluated.length
87
+ end
88
+
81
89
  def index_by_name(cases)
82
90
  cases.each_with_object({}) { |c, h| h[c[:name]] = c }
83
91
  end
@@ -22,7 +22,7 @@ module RubyLLM
22
22
  # dataset.case "name", input: {...}, expected: {...}
23
23
  # dataset.case "name", input: {...}, expected_traits: {...}
24
24
  # dataset.case "name", input: {...}, evaluator: proc
25
- def add_case(name = nil, input:, expected: nil, expected_traits: nil, evaluator: nil)
25
+ def add_case(name = nil, input:, expected: nil, expected_traits: nil, evaluator: nil, step_expectations: nil)
26
26
  case_name = name || "case_#{@cases.length + 1}"
27
27
  if @cases.any? { |c| c.name == case_name }
28
28
  raise ArgumentError, "Duplicate case name '#{case_name}'. Case names must be unique within a dataset."
@@ -33,7 +33,8 @@ module RubyLLM
33
33
  input: input,
34
34
  expected: expected,
35
35
  expected_traits: expected_traits,
36
- evaluator: evaluator
36
+ evaluator: evaluator,
37
+ step_expectations: step_expectations
37
38
  )
38
39
  end
39
40
 
@@ -44,14 +45,15 @@ module RubyLLM
44
45
  class Case
45
46
  include Concerns::DeepFreeze
46
47
 
47
- attr_reader :name, :input, :expected, :expected_traits, :evaluator
48
+ attr_reader :name, :input, :expected, :expected_traits, :evaluator, :step_expectations
48
49
 
49
- def initialize(name:, input:, expected: nil, expected_traits: nil, evaluator: nil)
50
+ def initialize(name:, input:, expected: nil, expected_traits: nil, evaluator: nil, step_expectations: nil)
50
51
  @name = name
51
52
  @input = deep_dup_freeze(input)
52
53
  @expected = deep_dup_freeze(expected)
53
54
  @expected_traits = deep_dup_freeze(expected_traits)
54
55
  @evaluator = evaluator
56
+ @step_expectations = deep_dup_freeze(step_expectations)
55
57
  freeze
56
58
  end
57
59
  end
@@ -31,7 +31,7 @@ module RubyLLM
31
31
  Adapters::Test.new(response: @sample_response)
32
32
  end
33
33
 
34
- def add_case(description, input: nil, expected: nil, expected_traits: nil, evaluator: nil)
34
+ def add_case(description, input: nil, expected: nil, expected_traits: nil, evaluator: nil, step_expectations: nil)
35
35
  case_input = input.nil? ? @default_input : input
36
36
  raise ArgumentError, "add_case requires input (set default_input or pass input:)" if case_input.nil?
37
37
  validate_unique_case_name!(description)
@@ -41,7 +41,8 @@ module RubyLLM
41
41
  input: case_input,
42
42
  expected: expected,
43
43
  expected_traits: expected_traits,
44
- evaluator: evaluator
44
+ evaluator: evaluator,
45
+ step_expectations: step_expectations
45
46
  }
46
47
  end
47
48
 
@@ -72,7 +73,8 @@ module RubyLLM
72
73
  eval_cases.each do |eval_case|
73
74
  add_case(eval_case[:name], input: eval_case[:input], expected: eval_case[:expected],
74
75
  expected_traits: eval_case[:expected_traits],
75
- evaluator: eval_case[:evaluator])
76
+ evaluator: eval_case[:evaluator],
77
+ step_expectations: eval_case[:step_expectations])
76
78
  end
77
79
  end
78
80
  end
@@ -106,15 +108,14 @@ module RubyLLM
106
108
  return if errors.empty?
107
109
 
108
110
  raise ArgumentError, "sample_response does not satisfy step schema: #{errors.join(", ")}"
109
- rescue JSON::ParserError => e
110
- # Non-JSON string with a structured schema = clear error
111
+ rescue JSON::ParserError, RubyLLM::Contract::ParseError => e
111
112
  raise ArgumentError, "sample_response is not valid JSON: #{e.message}"
112
113
  end
113
114
 
114
115
  def validate_sample_against_schema(schema)
115
116
  parsed = case @sample_response
116
117
  when Hash, Array then @sample_response
117
- when String then JSON.parse(@sample_response)
118
+ when String then Parser.parse(@sample_response, strategy: :json)
118
119
  else @sample_response
119
120
  end
120
121
  symbolized = deep_symbolize(parsed)
@@ -0,0 +1,79 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "json"
4
+ require "fileutils"
5
+
6
+ module RubyLLM
7
+ module Contract
8
+ module Eval
9
+ class EvalHistory
10
+ attr_reader :runs
11
+
12
+ def initialize(runs:)
13
+ @runs = runs.freeze
14
+ freeze
15
+ end
16
+
17
+ def self.load(path)
18
+ return new(runs: []) unless File.exist?(path)
19
+
20
+ runs = File.readlines(path).filter_map do |line|
21
+ JSON.parse(line.strip, symbolize_names: true)
22
+ rescue JSON::ParserError
23
+ nil
24
+ end
25
+ new(runs: runs)
26
+ end
27
+
28
+ def self.append(path, run_data)
29
+ FileUtils.mkdir_p(File.dirname(path))
30
+ File.open(path, "a") { |f| f.puts(run_data.to_json) }
31
+ end
32
+
33
+ def score_trend
34
+ return :unknown if runs.length < 2
35
+
36
+ scores = runs.map { |r| r[:score] }
37
+ recent = scores.last(3)
38
+ if recent.all? { |s| s >= scores.first }
39
+ :stable_or_improving
40
+ elsif recent.last < scores.max * 0.9
41
+ :declining
42
+ else
43
+ :stable_or_improving
44
+ end
45
+ end
46
+
47
+ def drift?(threshold: 0.1)
48
+ return false if runs.length < 2
49
+
50
+ baseline_score = runs.first[:score]
51
+ current_score = runs.last[:score]
52
+ (baseline_score - current_score) > threshold
53
+ end
54
+
55
+ def scores
56
+ runs.map { |r| r[:score] }
57
+ end
58
+
59
+ def dates
60
+ runs.map { |r| r[:date] }
61
+ end
62
+
63
+ def latest
64
+ runs.last
65
+ end
66
+
67
+ def to_s
68
+ return "No history" if runs.empty?
69
+
70
+ lines = ["#{runs.length} runs"]
71
+ runs.last(5).each do |r|
72
+ lines << " #{r[:date]} score=#{r[:score].round(2)} cost=$#{format("%.6f", r[:total_cost] || r[:cost] || 0)}"
73
+ end
74
+ lines.join("\n")
75
+ end
76
+ end
77
+ end
78
+ end
79
+ end
@@ -82,6 +82,24 @@ module RubyLLM
82
82
  lines.join("\n")
83
83
  end
84
84
 
85
+ def save_history!(path: nil, model: nil)
86
+ file = path || default_history_path(model: model)
87
+ run_data = {
88
+ date: Time.now.strftime("%Y-%m-%d"),
89
+ score: score,
90
+ total_cost: total_cost,
91
+ pass_rate: pass_rate,
92
+ cases_count: evaluated_results.length
93
+ }
94
+ EvalHistory.append(file, run_data)
95
+ file
96
+ end
97
+
98
+ def eval_history(path: nil, model: nil)
99
+ file = path || default_history_path(model: model)
100
+ EvalHistory.load(file)
101
+ end
102
+
85
103
  def save_baseline!(path: nil, model: nil)
86
104
  file = path || default_baseline_path(model: model)
87
105
  FileUtils.mkdir_p(File.dirname(file))
@@ -133,6 +151,15 @@ module RubyLLM
133
151
  results.reject { |r| r.step_status == :skipped }
134
152
  end
135
153
 
154
+ def default_history_path(model: nil)
155
+ parts = [".eval_history"]
156
+ parts << sanitize_name(@step_name) if @step_name
157
+ name = sanitize_name(dataset_name)
158
+ name = "#{name}_#{sanitize_name(model)}" if model
159
+ parts << "#{name}.jsonl"
160
+ File.join(*parts)
161
+ end
162
+
136
163
  def default_baseline_path(model: nil)
137
164
  parts = [".eval_baselines"]
138
165
  parts << sanitize_name(@step_name) if @step_name
@@ -6,31 +6,100 @@ module RubyLLM
6
6
  class Runner
7
7
  include TraitEvaluator
8
8
  include ContractDetailBuilder
9
+ include Concerns::ContextHelpers
9
10
 
10
- def self.run(step:, dataset:, context: {})
11
- new(step: step, dataset: dataset, context: context).run
11
+ def self.run(step:, dataset:, context: {}, concurrency: nil)
12
+ new(step: step, dataset: dataset, context: context, concurrency: concurrency).run
12
13
  end
13
14
 
14
- def initialize(step:, dataset:, context: {})
15
+ def initialize(step:, dataset:, context: {}, concurrency: nil)
15
16
  @step = step
16
17
  @dataset = dataset
17
18
  @context = context
19
+ @concurrency = concurrency
18
20
  end
19
21
 
20
22
  def run
21
- results = @dataset.cases.map { |test_case| evaluate_case(test_case) }
23
+ results = if @concurrency && @concurrency > 1
24
+ run_concurrent
25
+ else
26
+ @dataset.cases.map { |test_case| evaluate_case(test_case) }
27
+ end
22
28
  step_name = @step.respond_to?(:name) ? @step.name : @step.to_s
23
29
  Report.new(dataset_name: @dataset.name, results: results, step_name: step_name)
24
30
  end
25
31
 
26
32
  private
27
33
 
34
+ def run_concurrent
35
+ require "concurrent"
36
+ pool = Concurrent::FixedThreadPool.new(@concurrency)
37
+
38
+ # Pre-build per-case contexts: if adapter has responses:, each case
39
+ # gets a single-response adapter with its own response (by index).
40
+ per_case_contexts = build_per_case_contexts
41
+
42
+ futures = @dataset.cases.each_with_index.map do |test_case, i|
43
+ ctx = per_case_contexts[i]
44
+ Concurrent::Future.execute(executor: pool) do
45
+ evaluate_case_with_context(test_case, ctx)
46
+ end
47
+ end
48
+ futures.map(&:value!)
49
+ ensure
50
+ pool&.shutdown
51
+ pool&.wait_for_termination(5)
52
+ end
53
+
54
+ def build_per_case_contexts
55
+ adapter = @context[:adapter]
56
+ responses = adapter.respond_to?(:responses_array) ? adapter.responses_array : nil
57
+
58
+ @dataset.cases.each_with_index.map do |_, i|
59
+ if responses
60
+ # Give each case its own single-response adapter
61
+ response = responses[i] || responses.last
62
+ per_case_adapter = Adapters::Test.new(response: response)
63
+ @context.merge(adapter: per_case_adapter)
64
+ else
65
+ isolate_context(@context)
66
+ end
67
+ end
68
+ end
69
+
70
+ def evaluate_case_with_context(test_case, context)
71
+ run_result = @step.run(test_case.input, context: context)
72
+ step_result = normalize_result(run_result)
73
+ eval_result = dispatch_evaluation(step_result, test_case)
74
+
75
+ result = build_case_result(test_case, step_result, eval_result)
76
+
77
+ if test_case.respond_to?(:step_expectations) && test_case.step_expectations &&
78
+ run_result.respond_to?(:outputs_by_step)
79
+ evaluate_step_expectations(result, run_result.outputs_by_step, test_case.step_expectations)
80
+ else
81
+ result
82
+ end
83
+ rescue RubyLLM::Contract::Error => e
84
+ raise unless e.message.include?("No adapter configured")
85
+
86
+ skipped_result(test_case, e.message)
87
+ end
88
+
28
89
  def evaluate_case(test_case)
29
90
  run_result = @step.run(test_case.input, context: @context)
30
91
  step_result = normalize_result(run_result)
31
92
  eval_result = dispatch_evaluation(step_result, test_case)
32
93
 
33
- build_case_result(test_case, step_result, eval_result)
94
+ result = build_case_result(test_case, step_result, eval_result)
95
+
96
+ # Pipeline per-step evaluation
97
+ if test_case.respond_to?(:step_expectations) && test_case.step_expectations &&
98
+ run_result.respond_to?(:outputs_by_step)
99
+ evaluate_step_expectations(result, run_result.outputs_by_step, test_case.step_expectations)
100
+ else
101
+ result
102
+ end
34
103
  rescue RubyLLM::Contract::Error => e
35
104
  raise unless e.message.include?("No adapter configured")
36
105
 
@@ -145,6 +214,38 @@ module RubyLLM
145
214
  )
146
215
  end
147
216
 
217
+ def evaluate_step_expectations(result, outputs_by_step, expectations)
218
+ step_results = {}
219
+ all_passed = true
220
+
221
+ expectations.each do |step_alias, expected|
222
+ output = outputs_by_step[step_alias]
223
+ if output.nil?
224
+ step_results[step_alias] = { passed: false, details: "step not executed" }
225
+ all_passed = false
226
+ else
227
+ eval_res = dispatch_expected_evaluator(output: output, expected: expected, input: nil)
228
+ step_results[step_alias] = { passed: eval_res.passed, score: eval_res.score, details: eval_res.details }
229
+ all_passed = false unless eval_res.passed
230
+ end
231
+ end
232
+
233
+ # Rebuild CaseResult with step_results metadata
234
+ failed_steps = step_results.select { |_, v| !v[:passed] }
235
+ failure_details = failed_steps.map { |k, v| "#{k}: #{v[:details]}" }.join("; ")
236
+
237
+ CaseResult.new(
238
+ name: result.name, input: result.input, output: result.output,
239
+ expected: result.expected,
240
+ step_status: all_passed ? result.step_status : :step_expectation_failed,
241
+ score: all_passed ? result.score : 0.0,
242
+ passed: result.passed? && all_passed,
243
+ label: all_passed ? result.label : "FAIL",
244
+ details: all_passed ? result.details : "step expectations failed: #{failure_details}",
245
+ duration_ms: result.duration_ms, cost: result.cost
246
+ )
247
+ end
248
+
148
249
  def skipped_result(test_case, reason)
149
250
  CaseResult.new(
150
251
  name: test_case.name,
@@ -19,8 +19,11 @@ module RubyLLM
19
19
  end
20
20
 
21
21
  def check_trait(output, key, expectation, errors)
22
- value = output.is_a?(Hash) ? output[key] : nil
23
- error_msg = trait_error(key, value, expectation)
22
+ unless output.is_a?(Hash) && output.key?(key)
23
+ errors << "#{key}: missing key"
24
+ return
25
+ end
26
+ error_msg = trait_error(key, output[key], expectation)
24
27
  errors << error_msg if error_msg
25
28
  end
26
29
 
@@ -15,3 +15,4 @@ require_relative "eval/report"
15
15
  require_relative "eval/eval_definition"
16
16
  require_relative "eval/model_comparison"
17
17
  require_relative "eval/baseline_diff"
18
+ require_relative "eval/eval_history"
@@ -0,0 +1,46 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "ruby_llm/contract"
4
+
5
+ module RubyLLM
6
+ module Contract
7
+ module MinitestHelpers
8
+ def assert_satisfies_contract(result, msg = nil)
9
+ assert result.ok?, msg || "Expected step result to satisfy contract, " \
10
+ "but got status: #{result.status}. Errors: #{result.validation_errors.join(", ")}"
11
+ end
12
+
13
+ def refute_satisfies_contract(result, msg = nil)
14
+ refute result.ok?, msg || "Expected step result NOT to satisfy contract, but it passed"
15
+ end
16
+
17
+ def assert_eval_passes(step, eval_name, minimum_score: nil, maximum_cost: nil, context: {}, msg: nil)
18
+ report = step.run_eval(eval_name, context: context)
19
+
20
+ if minimum_score
21
+ assert report.score >= minimum_score,
22
+ msg || "Expected #{eval_name} eval score >= #{minimum_score}, got #{report.score.round(2)} (#{report.pass_rate})"
23
+ else
24
+ assert report.passed?,
25
+ msg || "Expected #{eval_name} eval to pass, got #{report.score.round(2)} (#{report.pass_rate})"
26
+ end
27
+
28
+ if maximum_cost
29
+ assert report.total_cost <= maximum_cost,
30
+ msg || "Expected #{eval_name} eval cost <= $#{format("%.4f", maximum_cost)}, got $#{format("%.4f", report.total_cost)}"
31
+ end
32
+
33
+ report
34
+ end
35
+
36
+ def stub_step(step_class, response: nil, responses: nil)
37
+ adapter = if responses
38
+ Adapters::Test.new(responses: responses)
39
+ else
40
+ Adapters::Test.new(response: response)
41
+ end
42
+ RubyLLM::Contract.configure { |c| c.default_adapter = adapter }
43
+ end
44
+ end
45
+ end
46
+ end
@@ -4,14 +4,16 @@ module RubyLLM
4
4
  module Contract
5
5
  module Prompt
6
6
  class Builder
7
+ NOT_PROVIDED = Object.new.freeze
8
+
7
9
  def initialize(block)
8
10
  @block = block
9
11
  @nodes = []
10
12
  end
11
13
 
12
- def build(input = nil)
14
+ def build(input = NOT_PROVIDED)
13
15
  @nodes = []
14
- if !input.nil? && @block.arity >= 1
16
+ if input != NOT_PROVIDED && @block.arity >= 1
15
17
  instance_exec(input, &@block)
16
18
  else
17
19
  instance_eval(&@block)
@@ -39,7 +41,7 @@ module RubyLLM
39
41
  @nodes << Nodes::SectionNode.new(name, text)
40
42
  end
41
43
 
42
- def self.build(input: nil, &block)
44
+ def self.build(input: NOT_PROVIDED, &block)
43
45
  new(block).build(input)
44
46
  end
45
47
  end
@@ -3,15 +3,21 @@
3
3
  module RubyLLM
4
4
  module Contract
5
5
  class Railtie < ::Rails::Railtie
6
- # Eval files (e.g. classify_threads_eval.rb) don't define Zeitwerk-compatible
7
- # constants they call define_eval on an existing Step class. We use `load`
8
- # after initialization, and hook into the reloader for development.
6
+ # Ignore eval/ subdirs BEFORE Zeitwerk setup — eval files don't define
7
+ # constants, they call define_eval on existing Step classes.
8
+ initializer "ruby_llm_contract.ignore_eval_dirs", before: :set_autoload_paths do |app|
9
+ %w[app/contracts/eval app/steps/eval].each do |path|
10
+ full = app.root.join(path)
11
+ next unless full.exist?
12
+
13
+ Rails.autoloaders.each { |loader| loader.ignore(full.to_s) }
14
+ end
15
+ end
9
16
 
10
17
  config.after_initialize do
11
18
  RubyLLM::Contract.load_evals!
12
19
  end
13
20
 
14
- # Re-load eval files on code reload in development (Spring, zeitwerk:check, etc.)
15
21
  config.to_prepare do
16
22
  RubyLLM::Contract.load_evals!
17
23
  end
@@ -60,8 +60,10 @@ module RubyLLM
60
60
 
61
61
  KNOWN_CONTEXT_KEYS = %i[adapter model temperature max_tokens provider assume_model_exists].freeze
62
62
 
63
+ include Concerns::ContextHelpers
64
+
63
65
  def run(input, context: {})
64
- context = (context || {}).transform_keys { |k| k.respond_to?(:to_sym) ? k.to_sym : k }
66
+ context = safe_context(context)
65
67
  warn_unknown_context_keys(context)
66
68
  adapter = resolve_adapter(context)
67
69
  default_model = context[:model] || model || RubyLLM::Contract.configuration.default_model
@@ -77,12 +79,14 @@ module RubyLLM
77
79
  context_temperature: ctx_temp, extra_options: extra)
78
80
  end
79
81
 
82
+ log_result(result)
80
83
  invoke_around_call(input, result)
81
84
  end
82
85
 
83
86
  def build_messages(input)
84
87
  dynamic = prompt.arity >= 1
85
- ast = Prompt::Builder.build(input: dynamic ? input : nil, &prompt)
88
+ builder_input = dynamic ? input : Prompt::Builder::NOT_PROVIDED
89
+ ast = Prompt::Builder.build(input: builder_input, &prompt)
86
90
  variables = dynamic ? {} : { input: input }
87
91
  variables.merge!(input.transform_keys(&:to_sym)) if !dynamic && input.is_a?(Hash)
88
92
  Prompt::Renderer.render(ast, variables: variables)
@@ -120,6 +124,19 @@ module RubyLLM
120
124
  validation_errors: [e.message])
121
125
  end
122
126
 
127
+ def log_result(result)
128
+ logger = RubyLLM::Contract.configuration.logger
129
+ return unless logger
130
+
131
+ trace = result.trace
132
+ msg = "[ruby_llm-contract] #{name || self} " \
133
+ "model=#{trace.model} status=#{result.status} " \
134
+ "latency=#{trace.latency_ms}ms " \
135
+ "tokens=#{trace.usage&.dig(:input_tokens) || 0}+#{trace.usage&.dig(:output_tokens) || 0} " \
136
+ "cost=$#{format("%.6f", trace.cost || 0)}"
137
+ logger.info(msg)
138
+ end
139
+
123
140
  def invoke_around_call(input, result)
124
141
  return result unless around_call
125
142
 
@@ -2,6 +2,6 @@
2
2
 
3
3
  module RubyLLM
4
4
  module Contract
5
- VERSION = "0.3.6"
5
+ VERSION = "0.4.0"
6
6
  end
7
7
  end
@@ -88,9 +88,9 @@ module RubyLLM
88
88
  full = ::Rails.root.join(path)
89
89
  next unless full.exist?
90
90
 
91
+ # eval/ subdirs already ignored by Railtie initializer (before Zeitwerk setup)
91
92
  ::Rails.autoloaders.main.eager_load_dir(full.to_s)
92
93
  rescue StandardError
93
- # Zeitwerk not available or dir not managed — skip
94
94
  nil
95
95
  end
96
96
  end
@@ -105,6 +105,7 @@ module RubyLLM
105
105
  end
106
106
  end
107
107
 
108
+ require_relative "contract/concerns/context_helpers"
108
109
  require_relative "contract/concerns/deep_freeze"
109
110
  require_relative "contract/concerns/deep_symbolize"
110
111
  require_relative "contract/concerns/eval_host"
@@ -7,9 +7,10 @@ Gem::Specification.new do |spec|
7
7
  spec.version = RubyLLM::Contract::VERSION
8
8
  spec.authors = ["Justyna"]
9
9
 
10
- spec.summary = "Contract-first LLM step execution for RubyLLM"
11
- spec.description = "Turn RubyLLM calls into contracted, validated, testable steps with schema enforcement, " \
12
- "retry with model escalation, and eval."
10
+ spec.summary = "Know which LLM model to use, what it costs, and when accuracy drops"
11
+ spec.description = "Compare LLM models by accuracy and cost. Regression-test prompts in CI. " \
12
+ "Start on nano, auto-escalate to bigger models when quality drops. " \
13
+ "Companion gem for ruby_llm."
13
14
  spec.homepage = "https://github.com/justi/ruby_llm-contract"
14
15
  spec.license = "MIT"
15
16
  spec.required_ruby_version = ">= 3.2.0"
@@ -17,6 +18,7 @@ Gem::Specification.new do |spec|
17
18
  spec.metadata["homepage_uri"] = spec.homepage
18
19
  spec.metadata["source_code_uri"] = spec.homepage
19
20
  spec.metadata["changelog_uri"] = "#{spec.homepage}/blob/main/CHANGELOG.md"
21
+ spec.metadata["documentation_uri"] = "#{spec.homepage}#readme"
20
22
  spec.metadata["rubygems_mfa_required"] = "true"
21
23
 
22
24
  spec.files = Dir.chdir(__dir__) do
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ruby_llm-contract
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.6
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Justyna
@@ -51,8 +51,9 @@ dependencies:
51
51
  - - "~>"
52
52
  - !ruby/object:Gem::Version
53
53
  version: '0.3'
54
- description: Turn RubyLLM calls into contracted, validated, testable steps with schema
55
- enforcement, retry with model escalation, and eval.
54
+ description: Compare LLM models by accuracy and cost. Regression-test prompts in CI.
55
+ Start on nano, auto-escalate to bigger models when quality drops. Companion gem
56
+ for ruby_llm.
56
57
  executables: []
57
58
  extensions: []
58
59
  extra_rdoc_files: []
@@ -82,6 +83,7 @@ files:
82
83
  - lib/ruby_llm/contract/adapters/response.rb
83
84
  - lib/ruby_llm/contract/adapters/ruby_llm.rb
84
85
  - lib/ruby_llm/contract/adapters/test.rb
86
+ - lib/ruby_llm/contract/concerns/context_helpers.rb
85
87
  - lib/ruby_llm/contract/concerns/deep_freeze.rb
86
88
  - lib/ruby_llm/contract/concerns/deep_symbolize.rb
87
89
  - lib/ruby_llm/contract/concerns/eval_host.rb
@@ -103,6 +105,7 @@ files:
103
105
  - lib/ruby_llm/contract/eval/contract_detail_builder.rb
104
106
  - lib/ruby_llm/contract/eval/dataset.rb
105
107
  - lib/ruby_llm/contract/eval/eval_definition.rb
108
+ - lib/ruby_llm/contract/eval/eval_history.rb
106
109
  - lib/ruby_llm/contract/eval/evaluation_result.rb
107
110
  - lib/ruby_llm/contract/eval/evaluator/exact.rb
108
111
  - lib/ruby_llm/contract/eval/evaluator/json_includes.rb
@@ -113,6 +116,7 @@ files:
113
116
  - lib/ruby_llm/contract/eval/report.rb
114
117
  - lib/ruby_llm/contract/eval/runner.rb
115
118
  - lib/ruby_llm/contract/eval/trait_evaluator.rb
119
+ - lib/ruby_llm/contract/minitest.rb
116
120
  - lib/ruby_llm/contract/pipeline.rb
117
121
  - lib/ruby_llm/contract/pipeline/base.rb
118
122
  - lib/ruby_llm/contract/pipeline/result.rb
@@ -154,6 +158,7 @@ metadata:
154
158
  homepage_uri: https://github.com/justi/ruby_llm-contract
155
159
  source_code_uri: https://github.com/justi/ruby_llm-contract
156
160
  changelog_uri: https://github.com/justi/ruby_llm-contract/blob/main/CHANGELOG.md
161
+ documentation_uri: https://github.com/justi/ruby_llm-contract#readme
157
162
  rubygems_mfa_required: 'true'
158
163
  rdoc_options: []
159
164
  require_paths:
@@ -171,5 +176,5 @@ required_rubygems_version: !ruby/object:Gem::Requirement
171
176
  requirements: []
172
177
  rubygems_version: 3.6.7
173
178
  specification_version: 4
174
- summary: Contract-first LLM step execution for RubyLLM
179
+ summary: Know which LLM model to use, what it costs, and when accuracy drops
175
180
  test_files: []