RubyGems - ruby_llm-agents - Versions diffs - 3.6.0 → 3.7.0 - Mend

ruby_llm-agents 3.6.0 → 3.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml +4 -4
data/README.md +17 -0
data/app/controllers/ruby_llm/agents/executions_controller.rb +1 -3
data/app/helpers/ruby_llm/agents/application_helper.rb +0 -27
data/app/views/ruby_llm/agents/dashboard/index.html.erb +11 -11
data/app/views/ruby_llm/agents/system_config/show.html.erb +0 -13
data/lib/generators/ruby_llm_agents/templates/initializer.rb.tt +0 -15
data/lib/ruby_llm/agents/core/version.rb +1 -1
data/lib/ruby_llm/agents/eval/eval_result.rb +73 -0
data/lib/ruby_llm/agents/eval/eval_run.rb +124 -0
data/lib/ruby_llm/agents/eval/eval_suite.rb +264 -0
data/lib/ruby_llm/agents/eval.rb +5 -0
data/lib/ruby_llm/agents.rb +3 -0
metadata +5 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: fa3658e8503a35a228e1b1d370c7e9308efb9ec5d0ce7e7a287afa4d69c891a5
-  data.tar.gz: 8bd69d29c6dbdc13f1ba2df4a11bb3032effadf198afff3dc7f2982abf2f62d2
+  metadata.gz: ce728e318b0681df1f65dc93e4c264f35573e863b86841394ab218994dd3dd29
+  data.tar.gz: f036ab7df822277740a0f840afd52d3d68c36bbd37843ae16032da7f9406864e
 SHA512:
-  metadata.gz: f5c1c97afd3b6448c4b9467b969aa8e521fb578e3c84b2985798d4a02fb78cccf624e771459adb9b06ff42018c705deb322b510288db12e7faaf771e19126206
-  data.tar.gz: b4ad0b9e03fef2df24bd70d0ae72af5d866d87daa6352e2948900c9996c135915fa104e0849c00c88fd13ea126fb6643eada544c87960ea1103716c7286ee4b9
+  metadata.gz: 6e63acc86ac7413983957abc46bbf5ba997230ae06480d6aa2eb77d6f25524e02cc58131112cda0b5eb413452078766d8afec001d98e56e6fb3ae2e8a0602dff
+  data.tar.gz: d01dba7531c3e503f7b00156a254c8a71ab582d20b194ad22d1f6bc1734cdd38b814e6581855274e08b49bb2c8d33e7655a02c2ed834771d9a990085ce743c84

data/README.md CHANGED Viewed

@@ -162,6 +162,21 @@ result.url          # => "https://..."
 result.save("logo.png")
 ```
+```ruby
+# Evaluate agent quality with built-in scoring
+class SupportRouter::Eval < RubyLLM::Agents::Eval::EvalSuite
+  agent SupportRouter
+  test_case "billing",   input: { message: "charged twice" }, expected: "billing"
+  test_case "technical",  input: { message: "500 error" },     expected: "technical"
+  test_case "greeting",   input: { message: "hello" },         expected: "general"
+end
+run = SupportRouter::Eval.run!
+puts run.summary
+# SupportRouter eval: 3/3 passed (score: 1.0)
+```
 ## Features
 | Feature | Description | Docs |
@@ -184,6 +199,7 @@ result.save("logo.png")
 | **Audio** | Text-to-speech (OpenAI, ElevenLabs), speech-to-text, dynamic pricing, 28+ output formats, dashboard audio playback | [Audio](https://github.com/adham90/ruby_llm-agents/wiki/Audio) |
 | **Agent Composition** | Use agents as tools in other agents with automatic hierarchy tracking | [Tools](https://github.com/adham90/ruby_llm-agents/wiki/Tools) |
 | **Queryable Agents** | Query execution history from agent classes with stats, replay, and cost breakdown | [Querying](https://github.com/adham90/ruby_llm-agents/wiki/Querying-Executions) |
+| **Evaluation** | Test agent quality with exact match, contains, LLM judge, and custom scorers | [Evaluation](https://github.com/adham90/ruby_llm-agents/wiki/Evaluation) |
 | **Alerts** | Slack, webhook, and custom notifications | [Alerts](https://github.com/adham90/ruby_llm-agents/wiki/Alerts) |
 | **AS::Notifications** | 11 instrumentation events across execution, cache, budget, and reliability | [Events](https://github.com/adham90/ruby_llm-agents/wiki/ActiveSupport-Notifications) |
 | **Custom Middleware** | Inject custom middleware globally or per-agent with positioning control | [Middleware](https://github.com/adham90/ruby_llm-agents/wiki/Custom-Middleware) |
@@ -267,6 +283,7 @@ mount RubyLLM::Agents::Engine => "/agents"
 | [Multi-Tenancy](https://github.com/adham90/ruby_llm-agents/wiki/Multi-Tenancy) | Per-tenant budgets, isolation, configuration |
 | [Async/Fiber](https://github.com/adham90/ruby_llm-agents/wiki/Async-Fiber) | Concurrent execution with Ruby fibers |
 | [Testing Agents](https://github.com/adham90/ruby_llm-agents/wiki/Testing-Agents) | RSpec patterns, mocking, dry_run mode |
+| [Evaluation](https://github.com/adham90/ruby_llm-agents/wiki/Evaluation) | Score agent quality with built-in and custom scorers |
 | [Error Handling](https://github.com/adham90/ruby_llm-agents/wiki/Error-Handling) | Error types, recovery patterns |
 | [Routing](https://github.com/adham90/ruby_llm-agents/wiki/Routing) | Message classification, routing DSL, inline classify |
 | [Embeddings](https://github.com/adham90/ruby_llm-agents/wiki/Embeddings) | Vector embeddings, batching, caching, preprocessing |

data/app/controllers/ruby_llm/agents/executions_controller.rb CHANGED Viewed

@@ -96,9 +96,7 @@ module RubyLLM
       # @param execution [Execution] The execution record
       # @return [String] CSV row string
       def generate_csv_row(execution)
-        redacted_error_message = if execution.error_message.present?
-          Redactor.redact_string(execution.error_message)
-        end
+        redacted_error_message = execution.error_message
         CSV.generate_line([
           execution.id,

data/app/helpers/ruby_llm/agents/application_helper.rb CHANGED Viewed

@@ -120,33 +120,6 @@ module RubyLLM
         end
       end
-      # Redacts sensitive data from an object for display
-      #
-      # Uses the configured redaction rules to mask sensitive fields
-      # and patterns in the data.
-      #
-      # @param obj [Object] The object to redact (Hash, Array, or primitive)
-      # @return [Object] The redacted object
-      # @example
-      #   redact_for_display({ password: "secret", name: "John" })
-      #   #=> { password: "[REDACTED]", name: "John" }
-      def redact_for_display(obj)
-        Redactor.redact(obj)
-      end
-      # Syntax-highlights a redacted Ruby object as pretty-printed JSON
-      #
-      # Combines redaction and highlighting in one call.
-      #
-      # @param obj [Object] Any JSON-serializable Ruby object
-      # @return [ActiveSupport::SafeBuffer] HTML-safe highlighted redacted JSON
-      def highlight_json_redacted(obj)
-        return "" if obj.nil?
-        redacted = redact_for_display(obj)
-        highlight_json(redacted)
-      end
       # Syntax-highlights a Ruby object as pretty-printed JSON
       #
       # Converts the object to JSON and applies color highlighting

data/app/views/ruby_llm/agents/dashboard/index.html.erb CHANGED Viewed

@@ -2,27 +2,27 @@
 <%= render partial: "ruby_llm/agents/dashboard/action_center", locals: { critical_alerts: @critical_alerts } %>
 <!-- Stats Strip + Range Selector -->
-<div class="flex items-center justify-between mb-3">
-  <div class="flex items-center gap-4">
-    <h1 class="text-[10px] font-medium text-gray-400 dark:text-gray-500 uppercase tracking-widest font-mono">overview</h1>
-    <div class="flex items-center gap-1.5 font-mono text-xs text-gray-400 dark:text-gray-500">
+<div class="flex flex-wrap items-center justify-between gap-2 mb-3">
+  <div class="flex items-center gap-4 min-w-0">
+    <h1 class="text-[10px] font-medium text-gray-400 dark:text-gray-500 uppercase tracking-widest font-mono shrink-0">overview</h1>
+    <div class="flex flex-wrap items-center gap-x-1.5 gap-y-0.5 font-mono text-xs text-gray-400 dark:text-gray-500">
       <% total = @now_strip[:success_today] + @now_strip[:errors_today] %>
-      <span class="text-gray-800 dark:text-gray-200"><%= number_with_delimiter(total) %></span> runs
+      <span class="whitespace-nowrap"><span class="text-gray-800 dark:text-gray-200"><%= number_with_delimiter(total) %></span> runs</span>
       <span class="text-gray-300 dark:text-gray-700">&middot;</span>
-      <span class="<%= @now_strip[:errors_today] > 0 ? 'text-red-500' : 'text-gray-800 dark:text-gray-200' %>"><%= @now_strip[:errors_today] %></span> errors<% if total > 0 && @now_strip[:errors_today] > 0 %> <span class="text-gray-300 dark:text-gray-600">(<%= (@now_strip[:errors_today].to_f / total * 100).round(1) %>%)</span><% end %>
+      <span class="whitespace-nowrap"><span class="<%= @now_strip[:errors_today] > 0 ? 'text-red-500' : 'text-gray-800 dark:text-gray-200' %>"><%= @now_strip[:errors_today] %></span> errors<% if total > 0 && @now_strip[:errors_today] > 0 %> <span class="text-gray-300 dark:text-gray-600">(<%= (@now_strip[:errors_today].to_f / total * 100).round(1) %>%)</span><% end %></span>
       <span class="text-gray-300 dark:text-gray-700">&middot;</span>
-      <span class="text-gray-800 dark:text-gray-200">$<%= number_with_precision(@now_strip[:cost_today], precision: 2) %></span>
+      <span class="text-gray-800 dark:text-gray-200 whitespace-nowrap">$<%= number_with_precision(@now_strip[:cost_today], precision: 2) %></span>
       <% if @cache_savings[:count] > 0 %>
         <span class="text-gray-300 dark:text-gray-700">&middot;</span>
-        <span class="text-green-500"><%= number_with_delimiter(@cache_savings[:count]) %></span> cache hits
+        <span class="whitespace-nowrap"><span class="text-green-500"><%= number_with_delimiter(@cache_savings[:count]) %></span> cache hits</span>
         <span class="text-gray-300 dark:text-gray-700">&middot;</span>
-        <span class="text-green-500">$<%= number_with_precision(@cache_savings[:estimated_savings], precision: 2) %></span> saved
+        <span class="whitespace-nowrap"><span class="text-green-500">$<%= number_with_precision(@cache_savings[:estimated_savings], precision: 2) %></span> saved</span>
         <span class="text-gray-300 dark:text-gray-700">&middot;</span>
-        <span class="text-gray-800 dark:text-gray-200"><%= @cache_savings[:hit_rate] %>%</span> hit rate
+        <span class="whitespace-nowrap"><span class="text-gray-800 dark:text-gray-200"><%= @cache_savings[:hit_rate] %>%</span> hit rate</span>
       <% end %>
     </div>
   </div>
-  <div class="relative font-mono text-xs" x-data="{ open: false, showCustom: false }" @click.outside="open = false; showCustom = false">
+  <div class="relative font-mono text-xs shrink-0" x-data="{ open: false, showCustom: false }" @click.outside="open = false; showCustom = false">
     <button @click="open = !open" class="flex items-center gap-1 px-2 py-0.5 text-gray-900 dark:text-gray-100 hover:text-gray-600 dark:hover:text-gray-300">
       <% if @selected_range == "custom" && @custom_from && @custom_to %>
         <%= @custom_from.strftime("%b %-d") %> – <%= @custom_to.strftime("%b %-d") %>

data/app/views/ruby_llm/agents/system_config/show.html.erb CHANGED Viewed

@@ -6,8 +6,6 @@
   budgets = @config.budgets || {}
   budgets_enabled = @config.budgets_enabled?
   alerts_enabled = @config.on_alert.respond_to?(:call)
-  redaction = @config.redaction || {}
-  redaction_enabled = redaction.present?
 %>
 <!-- ── system config ──────────────── -->
@@ -242,17 +240,6 @@
     <span class="w-36 flex-shrink-0 text-gray-500 dark:text-gray-400">persist responses</span>
     <span class="badge badge-sm <%= @config.persist_responses ? 'badge-success' : 'badge-timeout' %>"><%= @config.persist_responses ? 'on' : 'off' %></span>
   </div>
-  <div class="flex items-center gap-3 py-0.5">
-    <span class="w-36 flex-shrink-0 text-gray-500 dark:text-gray-400">pii redaction</span>
-    <% if redaction_enabled %>
-      <span class="badge badge-sm badge-success">on</span>
-      <span class="text-gray-400 dark:text-gray-600"><%= @config.redaction_fields.count %> fields</span>
-      <span class="text-gray-300 dark:text-gray-700">&middot;</span>
-      <span class="text-gray-400 dark:text-gray-600"><%= @config.redaction_patterns.count %> patterns</span>
-    <% else %>
-      <span class="text-gray-300 dark:text-gray-700">—</span>
-    <% end %>
-  </div>
 </div>
 <!-- Footer note -->

data/lib/generators/ruby_llm_agents/templates/initializer.rb.tt CHANGED Viewed

@@ -179,19 +179,4 @@ RubyLLM::Agents.configure do |config|
   # Whether to persist LLM responses in execution records
   # config.persist_responses = true
-  # Redaction configuration for PII and sensitive data
-  # - fields: Parameter names to redact (extends defaults: password, token, api_key, secret, etc.)
-  # - patterns: Regex patterns to match and redact in string values
-  # - placeholder: String to replace redacted values with
-  # - max_value_length: Truncate values longer than this (nil = no limit)
-  # config.redaction = {
-  #   fields: %w[ssn credit_card phone_number email],
-  #   patterns: [
-  #     /\b\d{3}-\d{2}-\d{4}\b/,           # SSN
-  #     /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/  # Credit card
-  #   ],
-  #   placeholder: "[REDACTED]",
-  #   max_value_length: 5000
-  # }
 end

data/lib/ruby_llm/agents/core/version.rb CHANGED Viewed

@@ -4,6 +4,6 @@ module RubyLLM
   module Agents
     # Current version of the RubyLLM::Agents gem
     # @return [String] Semantic version string
-    VERSION = "3.6.0"
+    VERSION = "3.7.0"
   end
 end

data/lib/ruby_llm/agents/eval/eval_result.rb ADDED Viewed

@@ -0,0 +1,73 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Agents
+    module Eval
+      # Holds the result of evaluating a single test case.
+      #
+      # Contains the test case definition, the agent's result, the score,
+      # and any error that occurred during execution.
+      class EvalResult
+        attr_reader :test_case, :agent_result, :score, :execution_id, :error
+        def initialize(test_case:, agent_result:, score:, execution_id: nil, error: nil)
+          @test_case = test_case
+          @agent_result = agent_result
+          @score = score
+          @execution_id = execution_id
+          @error = error
+        end
+        def test_case_name
+          test_case.name
+        end
+        def input
+          test_case.input
+        end
+        def expected
+          test_case.expected
+        end
+        def passed?(threshold = 0.5)
+          score.passed?(threshold)
+        end
+        def failed?(threshold = 0.5)
+          score.failed?(threshold)
+        end
+        def errored?
+          !error.nil?
+        end
+        def actual
+          return nil unless agent_result
+          if agent_result.respond_to?(:route)
+            {route: agent_result.route}
+          elsif agent_result.respond_to?(:content)
+            agent_result.content
+          else
+            agent_result
+          end
+        end
+        def to_h
+          {
+            name: test_case_name,
+            score: score.value,
+            reason: score.reason,
+            passed: passed?,
+            input: input,
+            expected: expected,
+            actual: actual,
+            execution_id: execution_id,
+            error: error&.message
+          }
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/agents/eval/eval_run.rb ADDED Viewed

@@ -0,0 +1,124 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Agents
+    module Eval
+      # Aggregate results from running an eval suite.
+      #
+      # Provides score calculation, pass/fail counts, failure details,
+      # and a formatted summary string.
+      class EvalRun
+        attr_reader :suite, :results, :model, :pass_threshold,
+          :started_at, :completed_at
+        def initialize(suite:, results:, model:, pass_threshold:, started_at:, completed_at:)
+          @suite = suite
+          @results = results
+          @model = model
+          @pass_threshold = pass_threshold
+          @started_at = started_at
+          @completed_at = completed_at
+        end
+        def agent_class
+          suite.respond_to?(:agent_class) ? suite.agent_class : suite
+        end
+        # Average score across all test cases (0.0 to 1.0)
+        def score
+          return 0.0 if results.empty?
+          results.sum { |r| r.score.value } / results.size.to_f
+        end
+        def score_pct
+          (score * 100).round(1)
+        end
+        def total_cases
+          results.size
+        end
+        def passed
+          results.count { |r| r.passed?(pass_threshold) }
+        end
+        def failed
+          results.count { |r| r.failed?(pass_threshold) }
+        end
+        def failures
+          results.select { |r| r.failed?(pass_threshold) }
+        end
+        def errors
+          results.select(&:errored?)
+        end
+        def total_cost
+          results.sum do |r|
+            next 0 unless r.execution_id
+            if defined?(Execution)
+              Execution.find_by(id: r.execution_id)&.total_cost || 0
+            else
+              0
+            end
+          end
+        rescue
+          0
+        end
+        def duration_ms
+          return 0 unless started_at && completed_at
+          ((completed_at - started_at) * 1000).to_i
+        end
+        def summary
+          agent_name = agent_class.respond_to?(:name) ? agent_class.name : agent_class.to_s
+          lines = ["#{agent_name} Eval — #{started_at.strftime("%Y-%m-%d %H:%M")}"]
+          lines << "Model: #{model} | Score: #{score_pct}% | #{passed}/#{total_cases} passed"
+          lines << "Cost: $#{"%.4f" % total_cost} | Duration: #{(duration_ms / 1000.0).round(1)}s"
+          if failures.any?
+            lines << ""
+            lines << "Failures:"
+            failures.each do |r|
+              lines << "  - #{r.test_case_name}: expected #{r.expected.inspect}, got #{r.actual.inspect} (#{r.score.reason})"
+            end
+          end
+          if errors.any?
+            lines << ""
+            lines << "Errors:"
+            errors.each do |r|
+              lines << "  - #{r.test_case_name}: #{r.error.message}"
+            end
+          end
+          lines.join("\n")
+        end
+        def to_h
+          {
+            agent: agent_class.respond_to?(:name) ? agent_class.name : agent_class.to_s,
+            model: model,
+            score: score,
+            score_pct: score_pct,
+            total_cases: total_cases,
+            passed: passed,
+            failed: failed,
+            total_cost: total_cost,
+            duration_ms: duration_ms,
+            results: results.map(&:to_h)
+          }
+        end
+        def to_json(*args)
+          to_h.to_json(*args)
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/agents/eval/eval_suite.rb ADDED Viewed

@@ -0,0 +1,264 @@
+# frozen_string_literal: true
+module RubyLLM
+  module Agents
+    module Eval
+      # Score value object — returned by every scorer
+      Score = Struct.new(:value, :reason, keyword_init: true) do
+        def initialize(value:, reason: nil)
+          super(value: value.to_f.clamp(0.0, 1.0), reason: reason)
+        end
+        def passed?(threshold = 0.5)
+          value >= threshold
+        end
+        def failed?(threshold = 0.5)
+          !passed?(threshold)
+        end
+      end
+      # A single test case definition
+      TestCase = Struct.new(:name, :input, :expected, :scorer, :options, keyword_init: true) do
+        def resolve_input
+          input.is_a?(Proc) ? input.call : input
+        end
+      end
+      # Defines test cases for an agent, runs them, scores results.
+      #
+      # @example
+      #   class SupportRouter::Eval < RubyLLM::Agents::EvalSuite
+      #     agent SupportRouter
+      #     test_case "billing", input: { message: "charged twice" }, expected: { route: :billing }
+      #   end
+      #
+      #   run = SupportRouter::Eval.run!
+      #   puts run.summary
+      class EvalSuite
+        class << self
+          attr_reader :agent_class, :test_cases, :eval_options
+          def inherited(subclass)
+            super
+            subclass.instance_variable_set(:@test_cases, [])
+            subclass.instance_variable_set(:@eval_options, {})
+          end
+          # --- DSL ---
+          def agent(klass)
+            @agent_class = klass
+          end
+          def test_case(name, input:, expected: nil, score: nil, **options)
+            @test_cases << TestCase.new(
+              name: name,
+              input: input,
+              expected: expected,
+              scorer: score,
+              options: options
+            )
+          end
+          def dataset(path)
+            full_path = path.start_with?("/") ? path : Rails.root.join(path).to_s
+            cases = YAML.safe_load_file(full_path, permitted_classes: [Symbol], symbolize_names: true)
+            cases.each do |tc|
+              test_case(
+                tc[:name],
+                input: tc[:input],
+                expected: tc[:expected],
+                score: tc[:score]&.to_sym,
+                **tc.except(:name, :input, :expected, :score)
+              )
+            end
+          end
+          def eval_model(value)
+            @eval_options[:model] = value
+          end
+          def eval_temperature(value)
+            @eval_options[:temperature] = value
+          end
+          # --- Running ---
+          def run!(model: nil, only: nil, pass_threshold: 0.5, overrides: {}, **options)
+            validate!
+            cases = only ? @test_cases.select { |tc| Array(only).include?(tc.name) } : @test_cases
+            resolved_model = model || @eval_options[:model]
+            temperature = @eval_options[:temperature]
+            started_at = Time.current
+            results = cases.map do |tc|
+              run_single(tc, model: resolved_model, temperature: temperature, overrides: overrides)
+            end
+            EvalRun.new(
+              suite: self,
+              results: results,
+              model: resolved_model || (agent_class.respond_to?(:model) ? agent_class.model : nil),
+              pass_threshold: pass_threshold,
+              started_at: started_at,
+              completed_at: Time.current
+            )
+          end
+          def validate!
+            raise ConfigurationError, "No agent class set" unless @agent_class
+            raise ConfigurationError, "No test cases defined" if @test_cases.empty?
+            @test_cases.each do |tc|
+              next if tc.input.is_a?(Proc)
+              next unless @agent_class.respond_to?(:params)
+              agent_params = @agent_class.params
+              required = agent_params.select { |_, v| v[:required] }.keys
+              missing = required - tc.input.keys
+              if missing.any?
+                raise ConfigurationError,
+                  "Test case '#{tc.name}' missing required params: #{missing.join(", ")}"
+              end
+            end
+            true
+          end
+          def for(agent_klass, &block)
+            suite = Class.new(self)
+            suite.agent(agent_klass)
+            suite.instance_eval(&block) if block
+            suite
+          end
+          private
+          def run_single(tc, model:, temperature:, overrides:)
+            input = tc.resolve_input
+            call_options = input.dup
+            call_options.merge!(overrides) if overrides.any?
+            call_options[:model] = model if model
+            call_options[:temperature] = temperature if temperature
+            agent_result = agent_class.call(**call_options)
+            score = evaluate(tc, agent_result)
+            EvalResult.new(
+              test_case: tc,
+              agent_result: agent_result,
+              score: score,
+              execution_id: agent_result.respond_to?(:execution_id) ? agent_result.execution_id : nil
+            )
+          rescue ArgumentError
+            raise
+          rescue => e
+            EvalResult.new(
+              test_case: tc,
+              agent_result: nil,
+              score: Score.new(value: 0.0, reason: "Error: #{e.class}: #{e.message}"),
+              error: e
+            )
+          end
+          def evaluate(tc, agent_result)
+            case tc.scorer
+            when Proc
+              coerce_score(tc.scorer.call(agent_result, tc.expected))
+            when :contains
+              score_contains(agent_result, tc.expected)
+            when :llm_judge
+              score_llm_judge(agent_result, tc)
+            when :exact_match, nil
+              score_exact_match(agent_result, tc.expected)
+            else
+              raise ArgumentError, "Unknown scorer: #{tc.scorer}"
+            end
+          end
+          def coerce_score(value)
+            case value
+            when Score then value
+            when Numeric then Score.new(value: value)
+            when true then Score.new(value: 1.0)
+            when false then Score.new(value: 0.0)
+            else Score.new(value: 0.0, reason: "Scorer returned #{value.class}")
+            end
+          end
+          # --- Built-in scorers ---
+          def score_exact_match(result, expected)
+            actual = extract_comparable(result)
+            expected_val = normalize_expected(expected)
+            if actual == expected_val
+              Score.new(value: 1.0)
+            else
+              Score.new(value: 0.0, reason: "Expected #{expected_val.inspect}, got #{actual.inspect}")
+            end
+          end
+          def score_contains(result, expected)
+            content = result.respond_to?(:content) ? result.content.to_s : result.to_s
+            targets = Array(expected)
+            missing = targets.reject { |e| content.downcase.include?(e.to_s.downcase) }
+            if missing.empty?
+              Score.new(value: 1.0)
+            else
+              Score.new(value: 0.0, reason: "Missing: #{missing.join(", ")}")
+            end
+          end
+          def score_llm_judge(result, tc)
+            content = result.respond_to?(:content) ? result.content.to_s : result.to_s
+            criteria = tc.options[:criteria]
+            judge_model = tc.options[:judge_model] || "gpt-4o-mini"
+            prompt = <<~PROMPT
+              You are evaluating an AI agent's response. Score it from 0 to 10.
+              ## Input
+              #{tc.input.inspect}
+              ## Agent Response
+              #{content}
+              ## Criteria
+              #{criteria}
+              Respond with ONLY a JSON object:
+              {"score": <0-10>, "reason": "<brief explanation>"}
+            PROMPT
+            chat = RubyLLM.chat(model: judge_model)
+            parsed = JSON.parse(chat.ask(prompt).content)
+            Score.new(value: parsed["score"].to_f / 10.0, reason: parsed["reason"])
+          rescue => e
+            Score.new(value: 0.0, reason: "Judge error: #{e.message}")
+          end
+          def extract_comparable(result)
+            if result.respond_to?(:route)
+              {route: result.route}
+            elsif result.respond_to?(:content)
+              content = result.content
+              content.is_a?(Hash) ? content.transform_keys(&:to_sym) : content.to_s.strip
+            else
+              result
+            end
+          end
+          def normalize_expected(expected)
+            case expected
+            when Hash then expected.transform_keys(&:to_sym)
+            when String then expected.strip
+            else expected
+            end
+          end
+        end
+      end
+    end
+  end
+end

data/lib/ruby_llm/agents/eval.rb ADDED Viewed

@@ -0,0 +1,5 @@
+# frozen_string_literal: true
+require_relative "eval/eval_suite"
+require_relative "eval/eval_result"
+require_relative "eval/eval_run"

data/lib/ruby_llm/agents.rb CHANGED Viewed

@@ -75,6 +75,9 @@ require_relative "agents/image/analyzer"
 require_relative "agents/image/background_remover"
 require_relative "agents/image/pipeline"
+# Evaluation framework
+require_relative "agents/eval"
 # Rails integration
 if defined?(Rails)
   require_relative "agents/core/inflections"

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ruby_llm-agents
 version: !ruby/object:Gem::Version
-  version: 3.6.0
+  version: 3.7.0
 platform: ruby
 authors:
 - adham90
@@ -237,6 +237,10 @@ files:
 - lib/ruby_llm/agents/dsl/caching.rb
 - lib/ruby_llm/agents/dsl/queryable.rb
 - lib/ruby_llm/agents/dsl/reliability.rb
+- lib/ruby_llm/agents/eval.rb
+- lib/ruby_llm/agents/eval/eval_result.rb
+- lib/ruby_llm/agents/eval/eval_run.rb
+- lib/ruby_llm/agents/eval/eval_suite.rb
 - lib/ruby_llm/agents/image/analyzer.rb
 - lib/ruby_llm/agents/image/analyzer/dsl.rb
 - lib/ruby_llm/agents/image/analyzer/execution.rb