RubyGems - ruby_llm-contract - Versions diffs - 0.2.3 → 0.3.6 - Mend

ruby_llm-contract 0.2.3 → 0.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (38) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +64 -0
data/Gemfile.lock +2 -2
data/README.md +27 -2
data/lib/ruby_llm/contract/adapters/response.rb +4 -2
data/lib/ruby_llm/contract/adapters/ruby_llm.rb +3 -3
data/lib/ruby_llm/contract/adapters/test.rb +3 -2
data/lib/ruby_llm/contract/concerns/deep_freeze.rb +23 -0
data/lib/ruby_llm/contract/concerns/eval_host.rb +11 -2
data/lib/ruby_llm/contract/contract/schema_validator.rb +70 -3
data/lib/ruby_llm/contract/eval/baseline_diff.rb +92 -0
data/lib/ruby_llm/contract/eval/dataset.rb +11 -4
data/lib/ruby_llm/contract/eval/eval_definition.rb +36 -14
data/lib/ruby_llm/contract/eval/model_comparison.rb +1 -1
data/lib/ruby_llm/contract/eval/report.rb +71 -2
data/lib/ruby_llm/contract/eval/runner.rb +5 -3
data/lib/ruby_llm/contract/eval/trait_evaluator.rb +6 -0
data/lib/ruby_llm/contract/eval.rb +1 -0
data/lib/ruby_llm/contract/pipeline/base.rb +1 -1
data/lib/ruby_llm/contract/pipeline/result.rb +1 -1
data/lib/ruby_llm/contract/pipeline/runner.rb +1 -1
data/lib/ruby_llm/contract/pipeline/trace.rb +3 -2
data/lib/ruby_llm/contract/prompt/builder.rb +2 -1
data/lib/ruby_llm/contract/prompt/node.rb +2 -2
data/lib/ruby_llm/contract/prompt/nodes/example_node.rb +2 -2
data/lib/ruby_llm/contract/rake_task.rb +31 -4
data/lib/ruby_llm/contract/rspec/helpers.rb +28 -8
data/lib/ruby_llm/contract/rspec/pass_eval.rb +23 -2
data/lib/ruby_llm/contract/step/base.rb +10 -5
data/lib/ruby_llm/contract/step/dsl.rb +1 -1
data/lib/ruby_llm/contract/step/limit_checker.rb +1 -1
data/lib/ruby_llm/contract/step/retry_executor.rb +3 -2
data/lib/ruby_llm/contract/step/retry_policy.rb +7 -1
data/lib/ruby_llm/contract/step/runner.rb +10 -2
data/lib/ruby_llm/contract/step/trace.rb +5 -4
data/lib/ruby_llm/contract/version.rb +1 -1
data/lib/ruby_llm/contract.rb +36 -17
metadata +3 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: '080fd81afd87ad234cf66f7577080a4ac55a59f890e0c8c479479fccec57ad32'
-  data.tar.gz: cdabbac3ea1d81e1abd3cb850e927f410d98282bd23111be79463804ea4d84b9
+  metadata.gz: 35a61fe65d6a7939e3ef22bdd37732d2ae6cd5643f51d595a3f26b4281eea396
+  data.tar.gz: 9b1b95b29c31e433af60c25e85dfdebf3e8e71cb85c0e568835309a7cd855926
 SHA512:
-  metadata.gz: 294b36f7264a2ba8b04334f3fd1c6b4433466a04c6be4aaccf23a92df3c7e92d04061ace018aa5243e28a9ef4fe64abc7f6de5ec11143c32bf5466bf591b9130
-  data.tar.gz: d7447319e3389264571209bc84d7dc84a441ffb76d1f64506d9cac2dc1953d26ba8cf3e1eb4169adf64576f1be7ae182bdf0c6e8e6b876220b102cea1e653fa6
+  metadata.gz: 0bb0333b6c362b1687b51f6bf360fd6d659c066a2a5b4b539bab4795150e5c1c8dbebe8dac6d05791b62958058d60418e5ff1f2b5db1f050f29412ed136494a5
+  data.tar.gz: ff5a8e7c30344993617bdd5f85d857e91d0cb633e2b7fe35a08aadf0790a4c7c0389cb017f92a192d199fe1eaba9526c509d5731321b36bd2c6e5fdedb5ca6d0

data/CHANGELOG.md CHANGED Viewed

@@ -1,5 +1,69 @@
 # Changelog
+## 0.3.6 (2026-03-24)
+- **Recursive array/object validation** — nested arrays (`array of array of string`) validated recursively. Object items validated even without `:properties` (e.g. `additionalProperties: false`).
+- **Deep symbolize in sample pre-validation** — array samples with string keys (`[{"name" => "Alice"}]`) correctly symbolized before schema validation.
+## 0.3.5 (2026-03-24)
+- **String constraints in SchemaValidator** — `minLength`/`maxLength` enforced for root and nested strings.
+- **Array item validation** — scalar items (string, integer) validated against items schema type and constraints.
+- **Non-JSON sample_response fails fast** — `sample_response("hello")` with object schema raises ArgumentError at definition time instead of silently passing.
+- **`max_tokens` in KNOWN_CONTEXT_KEYS** — no more spurious "Unknown context keys" warning.
+- **Duplicate models deduplicated** — `compare_models(models: ["m", "m"])` runs model once.
+## 0.3.4 (2026-03-24)
+- **SchemaValidator validates non-object roots** — boolean, integer, number, array root schemas now enforce type, min/max, enum, minItems/maxItems. Previously only object schemas were validated.
+- **Removed passing cases = regression** — `regressed?` returns true when baseline had passing cases that are now missing. Prevents gate bypass by deleting eval cases.
+- **JSON string sample_response fixed** — `sample_response('{"name":"Alice"}')` correctly parsed for pre-validation instead of double-encoding.
+- **`context[:max_tokens]` forwarded** — overrides step's `max_output` for adapter call AND budget precheck.
+## 0.3.3 (2026-03-23)
+- **Skipped cases visible in regression diff** — baseline PASS → current SKIP now detected as regression by `without_regressions` and `fail_on_regression`.
+- **Skip only on missing adapter** — eval runner no longer masks evaluator errors as SKIP. Only "No adapter configured" triggers skip.
+- **Array/Hash sample pre-validation** — `sample_response([{...}])` correctly validated against schema instead of silently skipping.
+- **`assume_model_exists: false` forwarded** — boolean `false` no longer dropped by truthiness check in adapter options.
+- **Duplicate case names caught at definition** — `add_case`/`verify` with same name raises immediately, not at run time.
+## 0.3.2 (2026-03-23)
+- **Array response preserved** — `Adapters::RubyLLM` no longer stringifies Array content. Steps with `output_type Array` work correctly.
+- **Falsy prompt input** — `run(false)` and `build_messages(false)` pass `false` to dynamic prompt blocks instead of falling back to `instance_eval`.
+- **`retry_on` flatten** — `retry_on([:a, :b])` no longer wraps in nested array.
+- **Builder reset** — `Prompt::Builder` resets nodes on each build (no accumulation on reuse).
+- **Pipeline false output** — `output: false` no longer shows "(no output)" in pretty_print.
+## 0.3.1 (2026-03-23)
+Fixes from persona_tool production deployment (4 services migrated).
+- **Proc/Lambda in `expected_traits`** — `expected_traits: { score: ->(v) { v > 3 } }` now works.
+- **Zeitwerk eager-load** — `load_evals!` eager-loads `app/contracts/` and `app/steps/` before loading eval files. Fixes uninitialized constant errors in Rake tasks.
+- **Falsy values** — `expected: false`, `input: false`, `sample_response(nil)` all handled correctly.
+- **Context key forwarding** — `provider:` and `assume_model_exists:` forwarded to adapter. `schema:` and `max_tokens:` are step-level only (no split-brain).
+- **Deep-freeze immutability** — constructors never mutate caller's data.
+## 0.3.0 (2026-03-23)
+Baseline regression detection — know when quality drops before users do.
+### Features
+- **`report.save_baseline!`** — serialize eval results to `.eval_baselines/` (JSON, git-tracked)
+- **`report.compare_with_baseline`** — returns `BaselineDiff` with regressions, improvements, score_delta, new/removed cases
+- **`diff.regressed?`** — true when any previously-passing case now fails
+- **`without_regressions` RSpec chain** — `expect(Step).to pass_eval("x").without_regressions`
+- **RakeTask `fail_on_regression`** — blocks CI when regressions detected
+- **RakeTask `save_baseline`** — auto-save after successful run
+- **Migration guide** — `docs/guide/migration.md` with 7 patterns for adopting the gem in existing Rails apps
+### Stats
+- 1086 tests, 0 failures
 ## 0.2.3 (2026-03-23)
 Production hardening from senior Rails review panel.

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    ruby_llm-contract (0.2.3)
+    ruby_llm-contract (0.3.6)
       dry-types (~> 1.7)
       ruby_llm (~> 1.0)
       ruby_llm-schema (~> 0.3)
@@ -165,7 +165,7 @@ CHECKSUMS
   rubocop-ast (1.49.1) sha256=4412f3ee70f6fe4546cc489548e0f6fcf76cafcfa80fa03af67098ffed755035
   ruby-progressbar (1.13.0) sha256=80fc9c47a9b640d6834e0dc7b3c94c9df37f08cb072b7761e4a71e22cff29b33
   ruby_llm (1.14.0) sha256=57c6f7034fc4a44504ea137d70f853b07824f1c1cdbe774ab3ab3522e7098deb
-  ruby_llm-contract (0.2.3)
+  ruby_llm-contract (0.3.6)
   ruby_llm-schema (0.3.0) sha256=a591edc5ca1b7f0304f0e2261de61ba4b3bea17be09f5cf7558153adfda3dec6
   unicode-display_width (3.2.0) sha256=0cdd96b5681a5949cdbc2c55e7b420facae74c4aaf9a9815eee1087cb1853c42
   unicode-emoji (4.2.0) sha256=519e69150f75652e40bf736106cfbc8f0f73aa3fb6a65afe62fefa7f80b0f80f

data/README.md CHANGED Viewed

@@ -111,6 +111,30 @@ end
 # bundle exec rake ruby_llm_contract:eval
 ```
+## Detect quality drops
+Save a baseline. Next run, see what regressed.
+```ruby
+report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
+report.save_baseline!(model: "gpt-4.1-nano")
+# Later — after prompt change, model update, or provider weight shift:
+report = ClassifyTicket.run_eval("regression", context: { model: "gpt-4.1-nano" })
+diff = report.compare_with_baseline(model: "gpt-4.1-nano")
+diff.regressed?    # => true
+diff.regressions   # => [{case: "outage", baseline: {passed: true}, current: {passed: false}}]
+diff.score_delta   # => -0.33
+```
+```ruby
+# CI: block merge if any previously-passing case now fails
+expect(ClassifyTicket).to pass_eval("regression")
+  .with_context(model: "gpt-4.1-nano")
+  .without_regressions
+```
 ## Predict cost before running
 ```ruby
@@ -140,12 +164,13 @@ Works with any ruby_llm provider (OpenAI, Anthropic, Gemini, etc).
 | [Output Schema](docs/guide/output_schema.md) | Full schema reference + constraints |
 | [Pipeline](docs/guide/pipeline.md) | Multi-step composition, timeout, fail-fast |
 | [Testing](docs/guide/testing.md) | Test adapter, RSpec matchers |
+| [Migration](docs/guide/migration.md) | Adopting the gem in existing Rails apps |
 ## Roadmap
-**v0.2 (current):** Model comparison, cost tracking, eval with `add_case`, CI gating, Rails Railtie.
+**v0.3 (current):** Baseline regression detection — `save_baseline!`, `compare_with_baseline`, `without_regressions`. Migration guide.
-**v0.3:** Regression baselines — compare eval results with previous run, detect quality drift.
+**v0.2:** Model comparison, cost tracking, eval with `add_case`, CI gating, Rails Railtie.
 **v0.4:** Auto-routing — learn which model works for which input pattern.

data/lib/ruby_llm/contract/adapters/response.rb CHANGED Viewed

@@ -4,11 +4,13 @@ module RubyLLM
   module Contract
     module Adapters
       class Response
+        include Concerns::DeepFreeze
         attr_reader :content, :usage
         def initialize(content:, usage: {})
-          @content = content
-          @usage = usage
+          @content = deep_dup_freeze(content)
+          @usage = deep_dup_freeze(usage)
           freeze
         end
       end

data/lib/ruby_llm/contract/adapters/ruby_llm.rb CHANGED Viewed

@@ -43,8 +43,8 @@ module RubyLLM
         def chat_constructor_options(options)
           opts = { model: options[:model] }
-          opts[:provider] = options[:provider] if options[:provider]
-          opts[:assume_model_exists] = options[:assume_model_exists] if options[:assume_model_exists]
+          opts[:provider] = options[:provider] if options.key?(:provider)
+          opts[:assume_model_exists] = options[:assume_model_exists] if options.key?(:assume_model_exists)
           opts
         end
@@ -57,7 +57,7 @@ module RubyLLM
         def build_response(response)
           content = response.content
-          content = content.to_s unless content.is_a?(Hash)
+          content = content.to_s unless content.is_a?(Hash) || content.is_a?(Array)
           Response.new(
             content: content,

data/lib/ruby_llm/contract/adapters/test.rb CHANGED Viewed

@@ -4,8 +4,9 @@ module RubyLLM
   module Contract
     module Adapters
       class Test < Base
-        def initialize(response: nil, responses: nil)
+        def initialize(response: nil, responses: nil, usage: nil)
           super()
+          @usage = (usage || { input_tokens: 0, output_tokens: 0 }).dup.freeze
           if responses
             raise ArgumentError, "responses: must not be empty (use response: nil for nil content)" if responses.empty?
@@ -36,7 +37,7 @@ module RubyLLM
                     else
                       @response
                     end
-          Response.new(content: content, usage: { input_tokens: 0, output_tokens: 0 })
+          Response.new(content: content, usage: @usage)
         end
       end
     end

data/lib/ruby_llm/contract/concerns/deep_freeze.rb ADDED Viewed

@@ -0,0 +1,23 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Contract
+    module Concerns
+      # Deep-duplicate and freeze a value. Creates an independent frozen copy
+      # without mutating the original. Handles Hash, Array, String recursively.
+      module DeepFreeze
+        private
+        def deep_dup_freeze(obj)
+          case obj
+          when NilClass, Integer, Float, Symbol, TrueClass, FalseClass then obj
+          when Hash then obj.transform_values { |v| deep_dup_freeze(v) }.freeze
+          when Array then obj.map { |v| deep_dup_freeze(v) }.freeze
+          when String then obj.frozen? ? obj : obj.dup.freeze
+          else obj.frozen? ? obj : obj.dup.freeze
+          end
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/contract/concerns/eval_host.rb CHANGED Viewed

@@ -6,6 +6,7 @@ module RubyLLM
       module EvalHost
         def define_eval(name, &)
           @eval_definitions ||= {}
+          @file_sourced_evals ||= Set.new
           key = name.to_s
           if @eval_definitions.key?(key) && !Thread.current[:ruby_llm_contract_reloading]
@@ -14,12 +15,16 @@ module RubyLLM
           end
           @eval_definitions[key] = Eval::EvalDefinition.new(key, step_class: self, &)
+          @file_sourced_evals.add(key) if Thread.current[:ruby_llm_contract_reloading]
           Contract.register_eval_host(self)
           register_subclasses(self)
         end
-        def clear_eval_definitions!
-          @eval_definitions = {}
+        def clear_file_sourced_evals!
+          return unless defined?(@file_sourced_evals) && defined?(@eval_definitions)
+          @file_sourced_evals.each { |key| @eval_definitions.delete(key) }
+          @file_sourced_evals.clear
         end
         def eval_names
@@ -31,6 +36,7 @@ module RubyLLM
         end
         def run_eval(name = nil, context: {})
+          context ||= {}
           if name
             run_single_eval(name, context)
           else
@@ -39,6 +45,8 @@ module RubyLLM
         end
         def compare_models(eval_name, models:, context: {})
+          context ||= {}
+          models = models.uniq
           reports = models.each_with_object({}) do |model, hash|
             model_context = deep_dup_context(context).merge(model: model)
             hash[model] = run_single_eval(eval_name, model_context)
@@ -75,6 +83,7 @@ module RubyLLM
         end
         def eval_context(defn, context)
+          context = (context || {}).transform_keys { |k| k.respond_to?(:to_sym) ? k.to_sym : k }
           return context if context[:adapter]
           sample_adapter = defn.build_adapter

data/lib/ruby_llm/contract/contract/schema_validator.rb CHANGED Viewed

@@ -40,10 +40,77 @@ module RubyLLM
       def validate_non_hash_output
         expected_type = @json_schema[:type]&.to_s
         if expected_type == "object" || @json_schema.key?(:properties)
-          ["expected object, got #{@output.class}"]
-        else
-          []
+          return ["expected object, got #{@output.class}"]
+        end
+        errors = []
+        validate_type_match(errors, @output, expected_type, "root") if expected_type
+        validate_constraints(errors, @output, @json_schema, "root")
+        if expected_type == "array" && @output.is_a?(Array) && @json_schema[:items]
+          validate_array_items(errors, @output, @json_schema[:items], "")
+        end
+        errors
+      end
+      def validate_array_items(errors, array, items_schema, prefix)
+        array.each_with_index do |item, i|
+          item_prefix = "#{prefix}[#{i}]"
+          validate_value(errors, item, items_schema, item_prefix)
+        end
+      end
+      def validate_value(errors, value, schema, prefix)
+        value_type = schema[:type]&.to_s
+        validate_type_match(errors, value, value_type, prefix) if value_type
+        validate_constraints(errors, value, schema, prefix)
+        if value.is_a?(Hash) && (schema.key?(:properties) || value_type == "object")
+          validate_object(value, schema, prefix: prefix)
+          errors.concat(@errors)
+          @errors = []
+        elsif value.is_a?(Array) && schema[:items]
+          validate_array_items(errors, value, schema[:items], prefix)
+        end
+      end
+      def validate_type_match(errors, value, expected_type, prefix)
+        valid = case expected_type
+                when "string" then value.is_a?(String)
+                when "integer" then value.is_a?(Integer)
+                when "number" then value.is_a?(Numeric)
+                when "boolean" then value.is_a?(TrueClass) || value.is_a?(FalseClass)
+                when "array" then value.is_a?(Array)
+                else true
+                end
+        errors << "#{prefix}: expected #{expected_type}, got #{value.class}" unless valid
+      end
+      def validate_constraints(errors, value, schema, prefix)
+        if schema[:minimum] && value.is_a?(Numeric) && value < schema[:minimum]
+          errors << "#{prefix}: #{value} is less than minimum #{schema[:minimum]}"
+        end
+        if schema[:maximum] && value.is_a?(Numeric) && value > schema[:maximum]
+          errors << "#{prefix}: #{value} is greater than maximum #{schema[:maximum]}"
+        end
+        if schema[:enum] && !schema[:enum].include?(value)
+          errors << "#{prefix}: #{value.inspect} is not in enum #{schema[:enum].inspect}"
+        end
+        if schema[:minItems] && value.is_a?(Array) && value.length < schema[:minItems]
+          errors << "#{prefix}: array has #{value.length} items, minimum #{schema[:minItems]}"
+        end
+        if schema[:maxItems] && value.is_a?(Array) && value.length > schema[:maxItems]
+          errors << "#{prefix}: array has #{value.length} items, maximum #{schema[:maxItems]}"
+        end
+        if schema[:minLength] && value.is_a?(String) && value.length < schema[:minLength]
+          errors << "#{prefix}: string length #{value.length} is less than minLength #{schema[:minLength]}"
+        end
+        if schema[:maxLength] && value.is_a?(String) && value.length > schema[:maxLength]
+          errors << "#{prefix}: string length #{value.length} is greater than maxLength #{schema[:maxLength]}"
         end
       end

data/lib/ruby_llm/contract/eval/baseline_diff.rb ADDED Viewed

@@ -0,0 +1,92 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Contract
+    module Eval
+      class BaselineDiff
+        attr_reader :baseline_score, :current_score
+        def initialize(baseline_cases:, current_cases:)
+          @baseline = index_by_name(baseline_cases)
+          @current = index_by_name(current_cases)
+          @baseline_score = baseline_cases.empty? ? 0.0 : baseline_cases.sum { |c| c[:score] } / baseline_cases.length
+          @current_score = current_cases.empty? ? 0.0 : current_cases.sum { |c| c[:score] } / current_cases.length
+          freeze
+        end
+        def regressions
+          @baseline.filter_map do |name, baseline|
+            current = @current[name]
+            next unless current
+            next unless baseline[:passed] && !current[:passed]
+            {
+              case: name,
+              baseline: { passed: baseline[:passed], score: baseline[:score] },
+              current: { passed: current[:passed], score: current[:score] },
+              detail: current[:details]
+            }
+          end
+        end
+        def improvements
+          @baseline.filter_map do |name, baseline|
+            current = @current[name]
+            next unless current
+            next unless !baseline[:passed] && current[:passed]
+            {
+              case: name,
+              baseline: { passed: baseline[:passed], score: baseline[:score] },
+              current: { passed: current[:passed], score: current[:score] }
+            }
+          end
+        end
+        def score_delta
+          (current_score - baseline_score).round(4)
+        end
+        def regressed?
+          regressions.any? || removed_passing_cases.any?
+        end
+        def removed_passing_cases
+          removed_cases.select { |name| @baseline[name]&.dig(:passed) }
+        end
+        def improved?
+          improvements.any?
+        end
+        def new_cases
+          (@current.keys - @baseline.keys)
+        end
+        def removed_cases
+          (@baseline.keys - @current.keys)
+        end
+        def to_s
+          lines = ["Score: #{baseline_score.round(2)} → #{current_score.round(2)} (#{format_delta})"]
+          regressions.each { |r| lines << "  REGRESSED  #{r[:case]}: #{r[:detail]}" }
+          improvements.each { |r| lines << "  IMPROVED   #{r[:case]}" }
+          new_cases.each { |c| lines << "  NEW        #{c}" }
+          removed_cases.each { |c| lines << "  REMOVED    #{c}" }
+          lines.join("\n")
+        end
+        private
+        def index_by_name(cases)
+          cases.each_with_object({}) { |c, h| h[c[:name]] = c }
+        end
+        def format_delta
+          d = score_delta
+          d >= 0 ? "+#{d}" : d.to_s
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/contract/eval/dataset.rb CHANGED Viewed

@@ -23,8 +23,13 @@ module RubyLLM
         # dataset.case "name", input: {...}, expected_traits: {...}
         # dataset.case "name", input: {...}, evaluator: proc
         def add_case(name = nil, input:, expected: nil, expected_traits: nil, evaluator: nil)
+          case_name = name || "case_#{@cases.length + 1}"
+          if @cases.any? { |c| c.name == case_name }
+            raise ArgumentError, "Duplicate case name '#{case_name}'. Case names must be unique within a dataset."
+          end
           @cases << Case.new(
-            name: name || "case_#{@cases.length + 1}",
+            name: case_name,
             input: input,
             expected: expected,
             expected_traits: expected_traits,
@@ -37,13 +42,15 @@ module RubyLLM
       end
       class Case
+        include Concerns::DeepFreeze
         attr_reader :name, :input, :expected, :expected_traits, :evaluator
         def initialize(name:, input:, expected: nil, expected_traits: nil, evaluator: nil)
           @name = name
-          @input = input
-          @expected = expected
-          @expected_traits = expected_traits
+          @input = deep_dup_freeze(input)
+          @expected = deep_dup_freeze(expected)
+          @expected_traits = deep_dup_freeze(expected_traits)
           @evaluator = evaluator
           freeze
         end

data/lib/ruby_llm/contract/eval/eval_definition.rb CHANGED Viewed

@@ -21,18 +21,20 @@ module RubyLLM
         def sample_response(response)
           @sample_response = response
+          @has_sample_response = true
           pre_validate_sample! if @step_class
         end
         def build_adapter
-          return nil unless @sample_response
+          return nil unless defined?(@has_sample_response) && @has_sample_response
-          Adapters::Test.new(response: @sample_response.is_a?(String) ? @sample_response : @sample_response.to_json)
+          Adapters::Test.new(response: @sample_response)
         end
         def add_case(description, input: nil, expected: nil, expected_traits: nil, evaluator: nil)
-          case_input = input || @default_input
-          raise ArgumentError, "add_case requires input (set default_input or pass input:)" unless case_input
+          case_input = input.nil? ? @default_input : input
+          raise ArgumentError, "add_case requires input (set default_input or pass input:)" if case_input.nil?
+          validate_unique_case_name!(description)
           @cases << {
             name: description,
@@ -44,13 +46,14 @@ module RubyLLM
         end
         def verify(description, expected_or_proc = nil, input: nil, expect: nil)
-          if expected_or_proc && expect
+          if !expected_or_proc.nil? && !expect.nil?
             raise ArgumentError, "verify accepts either a positional argument or expect: keyword, not both"
           end
-          expected_or_proc = expect if expect
-          case_input = input || @default_input
+          expected_or_proc = expect unless expect.nil?
+          case_input = input.nil? ? @default_input : input
           validate_verify_args!(expected_or_proc, case_input)
+          validate_unique_case_name!(description)
           evaluator = expected_or_proc.is_a?(::Proc) ? expected_or_proc : nil
@@ -78,15 +81,21 @@ module RubyLLM
         def effective_cases
           return @cases if @cases.any?
-          return [] unless @default_input
+          return [] if @default_input.nil?
           # Zero-verify: auto-add a contract check case
           [{ name: "contract check", input: @default_input, expected: nil, evaluator: nil }]
         end
+        def validate_unique_case_name!(name)
+          return unless @cases.any? { |c| c[:name] == name }
+          raise ArgumentError, "Duplicate case name '#{name}' in eval '#{@name}'. Case names must be unique."
+        end
         def validate_verify_args!(expected_or_proc, case_input)
-          raise ArgumentError, "verify requires either a positional argument or expect: keyword" unless expected_or_proc
-          raise ArgumentError, "verify requires input (set default_input or pass input:)" unless case_input
+          raise ArgumentError, "verify requires either a positional argument or expect: keyword" if expected_or_proc.nil?
+          raise ArgumentError, "verify requires input (set default_input or pass input:)" if case_input.nil?
         end
         def pre_validate_sample!
@@ -97,15 +106,28 @@ module RubyLLM
           return if errors.empty?
           raise ArgumentError, "sample_response does not satisfy step schema: #{errors.join(", ")}"
-        rescue JSON::ParserError
-          # Not JSON -- skip pre-validation
+        rescue JSON::ParserError => e
+          # Non-JSON string with a structured schema = clear error
+          raise ArgumentError, "sample_response is not valid JSON: #{e.message}"
         end
         def validate_sample_against_schema(schema)
-          response_hash = @sample_response.is_a?(Hash) ? @sample_response : JSON.parse(@sample_response.to_s)
-          symbolized = Parser.symbolize_keys(response_hash)
+          parsed = case @sample_response
+                   when Hash, Array then @sample_response
+                   when String then JSON.parse(@sample_response)
+                   else @sample_response
+                   end
+          symbolized = deep_symbolize(parsed)
           SchemaValidator.validate(symbolized, schema)
         end
+        def deep_symbolize(obj)
+          case obj
+          when Hash then Parser.symbolize_keys(obj)
+          when Array then obj.map { |item| deep_symbolize(item) }
+          else obj
+          end
+        end
       end
     end
   end

data/lib/ruby_llm/contract/eval/model_comparison.rb CHANGED Viewed

@@ -8,7 +8,7 @@ module RubyLLM
         def initialize(eval_name:, reports:)
           @eval_name = eval_name
-          @reports = reports.freeze # { "model_name" => Report }
+          @reports = reports.dup.freeze # { "model_name" => Report }
           freeze
         end