ruby_llm-contract 0.2.3 → 0.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (38) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +64 -0
  3. data/Gemfile.lock +2 -2
  4. data/README.md +27 -2
  5. data/lib/ruby_llm/contract/adapters/response.rb +4 -2
  6. data/lib/ruby_llm/contract/adapters/ruby_llm.rb +3 -3
  7. data/lib/ruby_llm/contract/adapters/test.rb +3 -2
  8. data/lib/ruby_llm/contract/concerns/deep_freeze.rb +23 -0
  9. data/lib/ruby_llm/contract/concerns/eval_host.rb +11 -2
  10. data/lib/ruby_llm/contract/contract/schema_validator.rb +70 -3
  11. data/lib/ruby_llm/contract/eval/baseline_diff.rb +92 -0
  12. data/lib/ruby_llm/contract/eval/dataset.rb +11 -4
  13. data/lib/ruby_llm/contract/eval/eval_definition.rb +36 -14
  14. data/lib/ruby_llm/contract/eval/model_comparison.rb +1 -1
  15. data/lib/ruby_llm/contract/eval/report.rb +71 -2
  16. data/lib/ruby_llm/contract/eval/runner.rb +5 -3
  17. data/lib/ruby_llm/contract/eval/trait_evaluator.rb +6 -0
  18. data/lib/ruby_llm/contract/eval.rb +1 -0
  19. data/lib/ruby_llm/contract/pipeline/base.rb +1 -1
  20. data/lib/ruby_llm/contract/pipeline/result.rb +1 -1
  21. data/lib/ruby_llm/contract/pipeline/runner.rb +1 -1
  22. data/lib/ruby_llm/contract/pipeline/trace.rb +3 -2
  23. data/lib/ruby_llm/contract/prompt/builder.rb +2 -1
  24. data/lib/ruby_llm/contract/prompt/node.rb +2 -2
  25. data/lib/ruby_llm/contract/prompt/nodes/example_node.rb +2 -2
  26. data/lib/ruby_llm/contract/rake_task.rb +31 -4
  27. data/lib/ruby_llm/contract/rspec/helpers.rb +28 -8
  28. data/lib/ruby_llm/contract/rspec/pass_eval.rb +23 -2
  29. data/lib/ruby_llm/contract/step/base.rb +10 -5
  30. data/lib/ruby_llm/contract/step/dsl.rb +1 -1
  31. data/lib/ruby_llm/contract/step/limit_checker.rb +1 -1
  32. data/lib/ruby_llm/contract/step/retry_executor.rb +3 -2
  33. data/lib/ruby_llm/contract/step/retry_policy.rb +7 -1
  34. data/lib/ruby_llm/contract/step/runner.rb +10 -2
  35. data/lib/ruby_llm/contract/step/trace.rb +5 -4
  36. data/lib/ruby_llm/contract/version.rb +1 -1
  37. data/lib/ruby_llm/contract.rb +36 -17
  38. metadata +3 -1
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '080fd81afd87ad234cf66f7577080a4ac55a59f890e0c8c479479fccec57ad32'
4
- data.tar.gz: cdabbac3ea1d81e1abd3cb850e927f410d98282bd23111be79463804ea4d84b9
3
+ metadata.gz: 35a61fe65d6a7939e3ef22bdd37732d2ae6cd5643f51d595a3f26b4281eea396
4
+ data.tar.gz: 9b1b95b29c31e433af60c25e85dfdebf3e8e71cb85c0e568835309a7cd855926
5
5
  SHA512:
6
- metadata.gz: 294b36f7264a2ba8b04334f3fd1c6b4433466a04c6be4aaccf23a92df3c7e92d04061ace018aa5243e28a9ef4fe64abc7f6de5ec11143c32bf5466bf591b9130
7
- data.tar.gz: d7447319e3389264571209bc84d7dc84a441ffb76d1f64506d9cac2dc1953d26ba8cf3e1eb4169adf64576f1be7ae182bdf0c6e8e6b876220b102cea1e653fa6
6
+ metadata.gz: 0bb0333b6c362b1687b51f6bf360fd6d659c066a2a5b4b539bab4795150e5c1c8dbebe8dac6d05791b62958058d60418e5ff1f2b5db1f050f29412ed136494a5
7
+ data.tar.gz: ff5a8e7c30344993617bdd5f85d857e91d0cb633e2b7fe35a08aadf0790a4c7c0389cb017f92a192d199fe1eaba9526c509d5731321b36bd2c6e5fdedb5ca6d0
data/CHANGELOG.md CHANGED
@@ -1,5 +1,69 @@
1
1
  # Changelog
2
2
 
3
+ ## 0.3.6 (2026-03-24)
4
+
5
+ - **Recursive array/object validation** — nested arrays (`array of array of string`) validated recursively. Object items validated even without `:properties` (e.g. `additionalProperties: false`).
6
+ - **Deep symbolize in sample pre-validation** — array samples with string keys (`[{"name" => "Alice"}]`) correctly symbolized before schema validation.
7
+
8
+ ## 0.3.5 (2026-03-24)
9
+
10
+ - **String constraints in SchemaValidator** — `minLength`/`maxLength` enforced for root and nested strings.
11
+ - **Array item validation** — scalar items (string, integer) validated against items schema type and constraints.
12
+ - **Non-JSON sample_response fails fast** — `sample_response("hello")` with object schema raises ArgumentError at definition time instead of silently passing.
13
+ - **`max_tokens` in KNOWN_CONTEXT_KEYS** — no more spurious "Unknown context keys" warning.
14
+ - **Duplicate models deduplicated** — `compare_models(models: ["m", "m"])` runs model once.
15
+
16
+ ## 0.3.4 (2026-03-24)
17
+
18
+ - **SchemaValidator validates non-object roots** — boolean, integer, number, array root schemas now enforce type, min/max, enum, minItems/maxItems. Previously only object schemas were validated.
19
+ - **Removed passing cases = regression** — `regressed?` returns true when baseline had passing cases that are now missing. Prevents gate bypass by deleting eval cases.
20
+ - **JSON string sample_response fixed** — `sample_response('{"name":"Alice"}')` correctly parsed for pre-validation instead of double-encoding.
21
+ - **`context[:max_tokens]` forwarded** — overrides step's `max_output` for adapter call AND budget precheck.
22
+
23
+ ## 0.3.3 (2026-03-23)
24
+
25
+ - **Skipped cases visible in regression diff** — baseline PASS → current SKIP now detected as regression by `without_regressions` and `fail_on_regression`.
26
+ - **Skip only on missing adapter** — eval runner no longer masks evaluator errors as SKIP. Only "No adapter configured" triggers skip.
27
+ - **Array/Hash sample pre-validation** — `sample_response([{...}])` correctly validated against schema instead of silently skipping.
28
+ - **`assume_model_exists: false` forwarded** — boolean `false` no longer dropped by truthiness check in adapter options.
29
+ - **Duplicate case names caught at definition** — `add_case`/`verify` with same name raises immediately, not at run time.
30
+
31
+ ## 0.3.2 (2026-03-23)
32
+
33
+ - **Array response preserved** — `Adapters::RubyLLM` no longer stringifies Array content. Steps with `output_type Array` work correctly.
34
+ - **Falsy prompt input** — `run(false)` and `build_messages(false)` pass `false` to dynamic prompt blocks instead of falling back to `instance_eval`.
35
+ - **`retry_on` flatten** — `retry_on([:a, :b])` no longer wraps in nested array.
36
+ - **Builder reset** — `Prompt::Builder` resets nodes on each build (no accumulation on reuse).
37
+ - **Pipeline false output** — `output: false` no longer shows "(no output)" in pretty_print.
38
+
39
+ ## 0.3.1 (2026-03-23)
40
+
41
+ Fixes from persona_tool production deployment (4 services migrated).
42
+
43
+ - **Proc/Lambda in `expected_traits`** — `expected_traits: { score: ->(v) { v > 3 } }` now works.
44
+ - **Zeitwerk eager-load** — `load_evals!` eager-loads `app/contracts/` and `app/steps/` before loading eval files. Fixes uninitialized constant errors in Rake tasks.
45
+ - **Falsy values** — `expected: false`, `input: false`, `sample_response(nil)` all handled correctly.
46
+ - **Context key forwarding** — `provider:` and `assume_model_exists:` forwarded to adapter. `schema:` and `max_tokens:` are step-level only (no split-brain).
47
+ - **Deep-freeze immutability** — constructors never mutate caller's data.
48
+
49
+ ## 0.3.0 (2026-03-23)
50
+
51
+ Baseline regression detection — know when quality drops before users do.
52
+
53
+ ### Features
54
+
55
+ - **`report.save_baseline!`** — serialize eval results to `.eval_baselines/` (JSON, git-tracked)
56
+ - **`report.compare_with_baseline`** — returns `BaselineDiff` with regressions, improvements, score_delta, new/removed cases
57
+ - **`diff.regressed?`** — true when any previously-passing case now fails
58
+ - **`without_regressions` RSpec chain** — `expect(Step).to pass_eval("x").without_regressions`
59
+ - **RakeTask `fail_on_regression`** — blocks CI when regressions detected
60
+ - **RakeTask `save_baseline`** — auto-save after successful run
61
+ - **Migration guide** — `docs/guide/migration.md` with 7 patterns for adopting the gem in existing Rails apps
62
+
63
+ ### Stats
64
+
65
+ - 1086 tests, 0 failures
66
+
3
67
  ## 0.2.3 (2026-03-23)
4
68
 
5
69
  Production hardening from senior Rails review panel.
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- ruby_llm-contract (0.2.3)
4
+ ruby_llm-contract (0.3.6)
5
5
  dry-types (~> 1.7)
6
6
  ruby_llm (~> 1.0)
7
7
  ruby_llm-schema (~> 0.3)
@@ -165,7 +165,7 @@ CHECKSUMS
165
165
  rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
166
166
  ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
167
167
  ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
168
- ruby_llm-contract (0.2.3)
168
+ ruby_llm-contract (0.3.6)
169
169
  ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
170
170
  unicode-display_width (3.2.0) sha256=0cdd96b5681a5949cdbc2c55e7b420facae74c4aaf9a9815eee1087cb1853c42
171
171
  unicode-emoji (4.2.0) sha256=519e69150f75652e40bf736106cfbc8f0f73aa3fb6a65afe62fefa7f80b0f80f
data/README.md CHANGED
@@ -111,6 +111,30 @@ end
111
111
  # bundle exec rake ruby_llm_contract:eval
112
112
  ```
113
113
 
114
+ ## Detect quality drops
115
+
116
+ Save a baseline. Next run, see what regressed.
117
+
118
+ ```ruby
119
+ report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
120
+ report.save_baseline!(model: "gpt-4.1-nano")
121
+
122
+ # Later — after prompt change, model update, or provider weight shift:
123
+ report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
124
+ diff = report.compare_with_baseline(model: "gpt-4.1-nano")
125
+
126
+ diff.regressed? # => true
127
+ diff.regressions # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
128
+ diff.score_delta # => -0.33
129
+ ```
130
+
131
+ ```ruby
132
+ # CI: block merge if any previously-passing case now fails
133
+ expect(ClassifyTicket).to pass_eval("regression")
134
+ .with_context(model: "gpt-4.1-nano")
135
+ .without_regressions
136
+ ```
137
+
114
138
  ## Predict cost before running
115
139
 
116
140
  ```ruby
@@ -140,12 +164,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
140
164
  | [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
141
165
  | [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
142
166
  | [Testing](docs/guide/testing.md) | Test adapter, RSpec matchers |
167
+ | [Migration](docs/guide/migration.md) | Adopting the gem in existing Rails apps |
143
168
 
144
169
  ## Roadmap
145
170
 
146
- **v0.2 (current):** Model comparison, cost tracking, eval with `add_case`, CI gating, Rails Railtie.
171
+ **v0.3 (current):** Baseline regression detection `save_baseline!`, `compare_with_baseline`, `without_regressions`. Migration guide.
147
172
 
148
- **v0.3:** Regression baselines compare eval results with previous run, detect quality drift.
173
+ **v0.2:** Model comparison, cost tracking, eval with `add_case`, CI gating, Rails Railtie.
149
174
 
150
175
  **v0.4:** Auto-routing — learn which model works for which input pattern.
151
176
 
@@ -4,11 +4,13 @@ module RubyLLM
4
4
  module Contract
5
5
  module Adapters
6
6
  class Response
7
+ include Concerns::DeepFreeze
8
+
7
9
  attr_reader :content, :usage
8
10
 
9
11
  def initialize(content:, usage: {})
10
- @content = content
11
- @usage = usage
12
+ @content = deep_dup_freeze(content)
13
+ @usage = deep_dup_freeze(usage)
12
14
  freeze
13
15
  end
14
16
  end
@@ -43,8 +43,8 @@ module RubyLLM
43
43
 
44
44
  def chat_constructor_options(options)
45
45
  opts = { model: options[:model] }
46
- opts[:provider] = options[:provider] if options[:provider]
47
- opts[:assume_model_exists] = options[:assume_model_exists] if options[:assume_model_exists]
46
+ opts[:provider] = options[:provider] if options.key?(:provider)
47
+ opts[:assume_model_exists] = options[:assume_model_exists] if options.key?(:assume_model_exists)
48
48
  opts
49
49
  end
50
50
 
@@ -57,7 +57,7 @@ module RubyLLM
57
57
 
58
58
  def build_response(response)
59
59
  content = response.content
60
- content = content.to_s unless content.is_a?(Hash)
60
+ content = content.to_s unless content.is_a?(Hash) || content.is_a?(Array)
61
61
 
62
62
  Response.new(
63
63
  content: content,
@@ -4,8 +4,9 @@ module RubyLLM
4
4
  module Contract
5
5
  module Adapters
6
6
  class Test < Base
7
- def initialize(response: nil, responses: nil)
7
+ def initialize(response: nil, responses: nil, usage: nil)
8
8
  super()
9
+ @usage = (usage || { input_tokens: 0, output_tokens: 0 }).dup.freeze
9
10
  if responses
10
11
  raise ArgumentError, "responses: must not be empty (use response: nil for nil content)" if responses.empty?
11
12
 
@@ -36,7 +37,7 @@ module RubyLLM
36
37
  else
37
38
  @response
38
39
  end
39
- Response.new(content: content, usage: { input_tokens: 0, output_tokens: 0 })
40
+ Response.new(content: content, usage: @usage)
40
41
  end
41
42
  end
42
43
  end
@@ -0,0 +1,23 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Concerns
6
+ # Deep-duplicate and freeze a value. Creates an independent frozen copy
7
+ # without mutating the original. Handles Hash, Array, String recursively.
8
+ module DeepFreeze
9
+ private
10
+
11
+ def deep_dup_freeze(obj)
12
+ case obj
13
+ when NilClass, Integer, Float, Symbol, TrueClass, FalseClass then obj
14
+ when Hash then obj.transform_values { |v| deep_dup_freeze(v) }.freeze
15
+ when Array then obj.map { |v| deep_dup_freeze(v) }.freeze
16
+ when String then obj.frozen? ? obj : obj.dup.freeze
17
+ else obj.frozen? ? obj : obj.dup.freeze
18
+ end
19
+ end
20
+ end
21
+ end
22
+ end
23
+ end
@@ -6,6 +6,7 @@ module RubyLLM
6
6
  module EvalHost
7
7
  def define_eval(name, &)
8
8
  @eval_definitions ||= {}
9
+ @file_sourced_evals ||= Set.new
9
10
  key = name.to_s
10
11
 
11
12
  if @eval_definitions.key?(key) && !Thread.current[:ruby_llm_contract_reloading]
@@ -14,12 +15,16 @@ module RubyLLM
14
15
  end
15
16
 
16
17
  @eval_definitions[key] = Eval::EvalDefinition.new(key, step_class: self, &)
18
+ @file_sourced_evals.add(key) if Thread.current[:ruby_llm_contract_reloading]
17
19
  Contract.register_eval_host(self)
18
20
  register_subclasses(self)
19
21
  end
20
22
 
21
- def clear_eval_definitions!
22
- @eval_definitions = {}
23
+ def clear_file_sourced_evals!
24
+ return unless defined?(@file_sourced_evals) && defined?(@eval_definitions)
25
+
26
+ @file_sourced_evals.each { |key| @eval_definitions.delete(key) }
27
+ @file_sourced_evals.clear
23
28
  end
24
29
 
25
30
  def eval_names
@@ -31,6 +36,7 @@ module RubyLLM
31
36
  end
32
37
 
33
38
  def run_eval(name = nil, context: {})
39
+ context ||= {}
34
40
  if name
35
41
  run_single_eval(name, context)
36
42
  else
@@ -39,6 +45,8 @@ module RubyLLM
39
45
  end
40
46
 
41
47
  def compare_models(eval_name, models:, context: {})
48
+ context ||= {}
49
+ models = models.uniq
42
50
  reports = models.each_with_object({}) do |model, hash|
43
51
  model_context = deep_dup_context(context).merge(model: model)
44
52
  hash[model] = run_single_eval(eval_name, model_context)
@@ -75,6 +83,7 @@ module RubyLLM
75
83
  end
76
84
 
77
85
  def eval_context(defn, context)
86
+ context = (context || {}).transform_keys { |k| k.respond_to?(:to_sym) ? k.to_sym : k }
78
87
  return context if context[:adapter]
79
88
 
80
89
  sample_adapter = defn.build_adapter
@@ -40,10 +40,77 @@ module RubyLLM
40
40
 
41
41
  def validate_non_hash_output
42
42
  expected_type = @json_schema[:type]&.to_s
43
+
43
44
  if expected_type == "object" || @json_schema.key?(:properties)
44
- ["expected object, got #{@output.class}"]
45
- else
46
- []
45
+ return ["expected object, got #{@output.class}"]
46
+ end
47
+
48
+ errors = []
49
+ validate_type_match(errors, @output, expected_type, "root") if expected_type
50
+ validate_constraints(errors, @output, @json_schema, "root")
51
+
52
+ if expected_type == "array" && @output.is_a?(Array) && @json_schema[:items]
53
+ validate_array_items(errors, @output, @json_schema[:items], "")
54
+ end
55
+
56
+ errors
57
+ end
58
+
59
+ def validate_array_items(errors, array, items_schema, prefix)
60
+ array.each_with_index do |item, i|
61
+ item_prefix = "#{prefix}[#{i}]"
62
+ validate_value(errors, item, items_schema, item_prefix)
63
+ end
64
+ end
65
+
66
+ def validate_value(errors, value, schema, prefix)
67
+ value_type = schema[:type]&.to_s
68
+
69
+ validate_type_match(errors, value, value_type, prefix) if value_type
70
+ validate_constraints(errors, value, schema, prefix)
71
+
72
+ if value.is_a?(Hash) && (schema.key?(:properties) || value_type == "object")
73
+ validate_object(value, schema, prefix: prefix)
74
+ errors.concat(@errors)
75
+ @errors = []
76
+ elsif value.is_a?(Array) && schema[:items]
77
+ validate_array_items(errors, value, schema[:items], prefix)
78
+ end
79
+ end
80
+
81
+ def validate_type_match(errors, value, expected_type, prefix)
82
+ valid = case expected_type
83
+ when "string" then value.is_a?(String)
84
+ when "integer" then value.is_a?(Integer)
85
+ when "number" then value.is_a?(Numeric)
86
+ when "boolean" then value.is_a?(TrueClass) || value.is_a?(FalseClass)
87
+ when "array" then value.is_a?(Array)
88
+ else true
89
+ end
90
+ errors << "#{prefix}: expected #{expected_type}, got #{value.class}" unless valid
91
+ end
92
+
93
+ def validate_constraints(errors, value, schema, prefix)
94
+ if schema[:minimum] && value.is_a?(Numeric) && value < schema[:minimum]
95
+ errors << "#{prefix}: #{value} is less than minimum #{schema[:minimum]}"
96
+ end
97
+ if schema[:maximum] && value.is_a?(Numeric) && value > schema[:maximum]
98
+ errors << "#{prefix}: #{value} is greater than maximum #{schema[:maximum]}"
99
+ end
100
+ if schema[:enum] && !schema[:enum].include?(value)
101
+ errors << "#{prefix}: #{value.inspect} is not in enum #{schema[:enum].inspect}"
102
+ end
103
+ if schema[:minItems] && value.is_a?(Array) && value.length < schema[:minItems]
104
+ errors << "#{prefix}: array has #{value.length} items, minimum #{schema[:minItems]}"
105
+ end
106
+ if schema[:maxItems] && value.is_a?(Array) && value.length > schema[:maxItems]
107
+ errors << "#{prefix}: array has #{value.length} items, maximum #{schema[:maxItems]}"
108
+ end
109
+ if schema[:minLength] && value.is_a?(String) && value.length < schema[:minLength]
110
+ errors << "#{prefix}: string length #{value.length} is less than minLength #{schema[:minLength]}"
111
+ end
112
+ if schema[:maxLength] && value.is_a?(String) && value.length > schema[:maxLength]
113
+ errors << "#{prefix}: string length #{value.length} is greater than maxLength #{schema[:maxLength]}"
47
114
  end
48
115
  end
49
116
 
@@ -0,0 +1,92 @@
1
+ # frozen_string_literal: true
2
+
3
+ module RubyLLM
4
+ module Contract
5
+ module Eval
6
+ class BaselineDiff
7
+ attr_reader :baseline_score, :current_score
8
+
9
+ def initialize(baseline_cases:, current_cases:)
10
+ @baseline = index_by_name(baseline_cases)
11
+ @current = index_by_name(current_cases)
12
+ @baseline_score = baseline_cases.empty? ? 0.0 : baseline_cases.sum { |c| c[:score] } / baseline_cases.length
13
+ @current_score = current_cases.empty? ? 0.0 : current_cases.sum { |c| c[:score] } / current_cases.length
14
+ freeze
15
+ end
16
+
17
+ def regressions
18
+ @baseline.filter_map do |name, baseline|
19
+ current = @current[name]
20
+ next unless current
21
+ next unless baseline[:passed] && !current[:passed]
22
+
23
+ {
24
+ case: name,
25
+ baseline: { passed: baseline[:passed], score: baseline[:score] },
26
+ current: { passed: current[:passed], score: current[:score] },
27
+ detail: current[:details]
28
+ }
29
+ end
30
+ end
31
+
32
+ def improvements
33
+ @baseline.filter_map do |name, baseline|
34
+ current = @current[name]
35
+ next unless current
36
+ next unless !baseline[:passed] && current[:passed]
37
+
38
+ {
39
+ case: name,
40
+ baseline: { passed: baseline[:passed], score: baseline[:score] },
41
+ current: { passed: current[:passed], score: current[:score] }
42
+ }
43
+ end
44
+ end
45
+
46
+ def score_delta
47
+ (current_score - baseline_score).round(4)
48
+ end
49
+
50
+ def regressed?
51
+ regressions.any? || removed_passing_cases.any?
52
+ end
53
+
54
+ def removed_passing_cases
55
+ removed_cases.select { |name| @baseline[name]&.dig(:passed) }
56
+ end
57
+
58
+ def improved?
59
+ improvements.any?
60
+ end
61
+
62
+ def new_cases
63
+ (@current.keys - @baseline.keys)
64
+ end
65
+
66
+ def removed_cases
67
+ (@baseline.keys - @current.keys)
68
+ end
69
+
70
+ def to_s
71
+ lines = ["Score: #{baseline_score.round(2)} → #{current_score.round(2)} (#{format_delta})"]
72
+ regressions.each { |r| lines << " REGRESSED #{r[:case]}: #{r[:detail]}" }
73
+ improvements.each { |r| lines << " IMPROVED #{r[:case]}" }
74
+ new_cases.each { |c| lines << " NEW #{c}" }
75
+ removed_cases.each { |c| lines << " REMOVED #{c}" }
76
+ lines.join("\n")
77
+ end
78
+
79
+ private
80
+
81
+ def index_by_name(cases)
82
+ cases.each_with_object({}) { |c, h| h[c[:name]] = c }
83
+ end
84
+
85
+ def format_delta
86
+ d = score_delta
87
+ d >= 0 ? "+#{d}" : d.to_s
88
+ end
89
+ end
90
+ end
91
+ end
92
+ end
@@ -23,8 +23,13 @@ module RubyLLM
23
23
  # dataset.case "name", input: {...}, expected_traits: {...}
24
24
  # dataset.case "name", input: {...}, evaluator: proc
25
25
  def add_case(name = nil, input:, expected: nil, expected_traits: nil, evaluator: nil)
26
+ case_name = name || "case_#{@cases.length + 1}"
27
+ if @cases.any? { |c| c.name == case_name }
28
+ raise ArgumentError, "Duplicate case name '#{case_name}'. Case names must be unique within a dataset."
29
+ end
30
+
26
31
  @cases << Case.new(
27
- name: name || "case_#{@cases.length + 1}",
32
+ name: case_name,
28
33
  input: input,
29
34
  expected: expected,
30
35
  expected_traits: expected_traits,
@@ -37,13 +42,15 @@ module RubyLLM
37
42
  end
38
43
 
39
44
  class Case
45
+ include Concerns::DeepFreeze
46
+
40
47
  attr_reader :name, :input, :expected, :expected_traits, :evaluator
41
48
 
42
49
  def initialize(name:, input:, expected: nil, expected_traits: nil, evaluator: nil)
43
50
  @name = name
44
- @input = input
45
- @expected = expected
46
- @expected_traits = expected_traits
51
+ @input = deep_dup_freeze(input)
52
+ @expected = deep_dup_freeze(expected)
53
+ @expected_traits = deep_dup_freeze(expected_traits)
47
54
  @evaluator = evaluator
48
55
  freeze
49
56
  end
@@ -21,18 +21,20 @@ module RubyLLM
21
21
 
22
22
  def sample_response(response)
23
23
  @sample_response = response
24
+ @has_sample_response = true
24
25
  pre_validate_sample! if @step_class
25
26
  end
26
27
 
27
28
  def build_adapter
28
- return nil unless @sample_response
29
+ return nil unless defined?(@has_sample_response) && @has_sample_response
29
30
 
30
- Adapters::Test.new(response: @sample_response.is_a?(String) ? @sample_response : @sample_response.to_json)
31
+ Adapters::Test.new(response: @sample_response)
31
32
  end
32
33
 
33
34
  def add_case(description, input: nil, expected: nil, expected_traits: nil, evaluator: nil)
34
- case_input = input || @default_input
35
- raise ArgumentError, "add_case requires input (set default_input or pass input:)" unless case_input
35
+ case_input = input.nil? ? @default_input : input
36
+ raise ArgumentError, "add_case requires input (set default_input or pass input:)" if case_input.nil?
37
+ validate_unique_case_name!(description)
36
38
 
37
39
  @cases << {
38
40
  name: description,
@@ -44,13 +46,14 @@ module RubyLLM
44
46
  end
45
47
 
46
48
  def verify(description, expected_or_proc = nil, input: nil, expect: nil)
47
- if expected_or_proc && expect
49
+ if !expected_or_proc.nil? && !expect.nil?
48
50
  raise ArgumentError, "verify accepts either a positional argument or expect: keyword, not both"
49
51
  end
50
52
 
51
- expected_or_proc = expect if expect
52
- case_input = input || @default_input
53
+ expected_or_proc = expect unless expect.nil?
54
+ case_input = input.nil? ? @default_input : input
53
55
  validate_verify_args!(expected_or_proc, case_input)
56
+ validate_unique_case_name!(description)
54
57
 
55
58
  evaluator = expected_or_proc.is_a?(::Proc) ? expected_or_proc : nil
56
59
 
@@ -78,15 +81,21 @@ module RubyLLM
78
81
 
79
82
  def effective_cases
80
83
  return @cases if @cases.any?
81
- return [] unless @default_input
84
+ return [] if @default_input.nil?
82
85
 
83
86
  # Zero-verify: auto-add a contract check case
84
87
  [{ name: "contract check", input: @default_input, expected: nil, evaluator: nil }]
85
88
  end
86
89
 
90
+ def validate_unique_case_name!(name)
91
+ return unless @cases.any? { |c| c[:name] == name }
92
+
93
+ raise ArgumentError, "Duplicate case name '#{name}' in eval '#{@name}'. Case names must be unique."
94
+ end
95
+
87
96
  def validate_verify_args!(expected_or_proc, case_input)
88
- raise ArgumentError, "verify requires either a positional argument or expect: keyword" unless expected_or_proc
89
- raise ArgumentError, "verify requires input (set default_input or pass input:)" unless case_input
97
+ raise ArgumentError, "verify requires either a positional argument or expect: keyword" if expected_or_proc.nil?
98
+ raise ArgumentError, "verify requires input (set default_input or pass input:)" if case_input.nil?
90
99
  end
91
100
 
92
101
  def pre_validate_sample!
@@ -97,15 +106,28 @@ module RubyLLM
97
106
  return if errors.empty?
98
107
 
99
108
  raise ArgumentError, "sample_response does not satisfy step schema: #{errors.join(", ")}"
100
- rescue JSON::ParserError
101
- # Not JSON -- skip pre-validation
109
+ rescue JSON::ParserError => e
110
+ # Non-JSON string with a structured schema = clear error
111
+ raise ArgumentError, "sample_response is not valid JSON: #{e.message}"
102
112
  end
103
113
 
104
114
  def validate_sample_against_schema(schema)
105
- response_hash = @sample_response.is_a?(Hash) ? @sample_response : JSON.parse(@sample_response.to_s)
106
- symbolized = Parser.symbolize_keys(response_hash)
115
+ parsed = case @sample_response
116
+ when Hash, Array then @sample_response
117
+ when String then JSON.parse(@sample_response)
118
+ else @sample_response
119
+ end
120
+ symbolized = deep_symbolize(parsed)
107
121
  SchemaValidator.validate(symbolized, schema)
108
122
  end
123
+
124
+ def deep_symbolize(obj)
125
+ case obj
126
+ when Hash then Parser.symbolize_keys(obj)
127
+ when Array then obj.map { |item| deep_symbolize(item) }
128
+ else obj
129
+ end
130
+ end
109
131
  end
110
132
  end
111
133
  end
@@ -8,7 +8,7 @@ module RubyLLM
8
8
 
9
9
  def initialize(eval_name:, reports:)
10
10
  @eval_name = eval_name
11
- @reports = reports.freeze # { "model_name" => Report }
11
+ @reports = reports.dup.freeze # { "model_name" => Report }
12
12
  freeze
13
13
  end
14
14