RubyGems - eval-ruby - Versions diffs - 0.1.1 → 0.3.0 - Mend

eval-ruby 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +60 -0
data/Gemfile.lock +2 -2
data/MILESTONES.md +13 -0
data/README.md +39 -0
data/lib/eval_ruby/comparison.rb +18 -1
data/lib/eval_ruby/configuration.rb +37 -2
data/lib/eval_ruby/dataset.rb +118 -13
data/lib/eval_ruby/embedders/base.rb +29 -0
data/lib/eval_ruby/embedders/openai.rb +83 -0
data/lib/eval_ruby/evaluator.rb +36 -0
data/lib/eval_ruby/judges/anthropic.rb +8 -0
data/lib/eval_ruby/judges/base.rb +11 -0
data/lib/eval_ruby/judges/openai.rb +8 -0
data/lib/eval_ruby/metrics/base.rb +8 -0
data/lib/eval_ruby/metrics/context_precision.rb +10 -0
data/lib/eval_ruby/metrics/context_recall.rb +10 -0
data/lib/eval_ruby/metrics/correctness.rb +13 -0
data/lib/eval_ruby/metrics/faithfulness.rb +10 -0
data/lib/eval_ruby/metrics/mrr.rb +8 -0
data/lib/eval_ruby/metrics/ndcg.rb +10 -0
data/lib/eval_ruby/metrics/precision_at_k.rb +9 -0
data/lib/eval_ruby/metrics/recall_at_k.rb +9 -0
data/lib/eval_ruby/metrics/relevance.rb +10 -0
data/lib/eval_ruby/metrics/semantic_similarity.rb +72 -0
data/lib/eval_ruby/report.rb +38 -1
data/lib/eval_ruby/result.rb +29 -1
data/lib/eval_ruby/rspec.rb +48 -6
data/lib/eval_ruby/version.rb +1 -1
data/lib/eval_ruby.rb +87 -3
metadata +6 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 59b7bd64cf696d82a27cb6330ab948306904f444a8a64476f702ee1937bbccab
-  data.tar.gz: f9c0c234f0712d37d309d460d1c3204d3e6f70bfbd8fe4dcff95ac508bcb7f34
+  metadata.gz: 6ba2db8c803feba5276939550bf2bd3c9dca548f2eca3223ce1f2df39e83bfa8
+  data.tar.gz: 9a3a0a5dbb7eda4d64669e11a5b76c20f43fd0cdb3946934f190475cc460ccd3
 SHA512:
-  metadata.gz: 6a9f0c12a790b0098ba639bc236c6080c64c7fbb4ad9892c36810e93b88486bebaa42dca9245beaf828beca29e31793030d62ab232cad50107bc99905635a069
-  data.tar.gz: 7c922b6fd8743d5241a254baf301aae2af6aba936fdeedbcb467f9549a8f16493c7571ec9724a5460a75228e415cbcaa852b915ef78f23aa8f1fcbddb85d9450
+  metadata.gz: a6b52e6caf04da090f8595808f76e8edaa901b7eec90178054de0f2c4d8d27c1e2b46f8c154961b91c5ef3a849e93cbbc411df77e4ec2919ea48dd7ead869a1a
+  data.tar.gz: 06eb01c2e177dc263dcdf69ea38debca5989fc6d55e3be31e50c6b7e5bc8539d06df90b815b35680f8bc292d8bc44cebf38496dde5a65cea9cd8f4b3cedee18f

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,60 @@
+# Changelog
+All notable changes to this project are documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.3.0] - 2026-04-13
+Reshape release — the original v0.3.0 scope (cost tracking, async/parallel batch, HTML reports) was redistributed to a new **v0.4.0 "Batch, Reporting & Cost"** milestone, and the previous v0.4.0 "Testing Framework Integration" work slid to **v0.5.0**. This release instead prioritizes the API surface needed by the omnibot Wicara eval integration, plus the two quickly actionable items from the original v0.3.0 plan.
+### Added
+- `EvalRuby::Embedders::Base` — abstract base class for pluggable embedding backends, mirroring the existing `Judges::Base` pattern.
+- `EvalRuby::Embedders::OpenAI` — OpenAI embeddings backend (`/v1/embeddings`) with retry/timeout handling and out-of-order response reassembly.
+- `EvalRuby::Metrics::SemanticSimilarity` — Ragas-style answer-similarity metric. Computes cosine similarity between answer and ground-truth embeddings via an injected embedder. Judge-free, fast, deterministic; ideal for chatbot regression testing.
+- `Configuration#embedder_llm`, `#embedder_model`, `#embedder_api_key` — new keys controlling the embedder. `embedder_api_key` falls back to `api_key` when unset, so most users only configure one OpenAI key.
+- `EvalRuby.evaluate_batch(dataset) { |progress| ... }` — block form that yields an `EvalRuby::Progress` struct (`current`, `total`, `elapsed`, `percent`) after each sample. Backwards compatible — batch calls without a block behave exactly as before.
+### Changed
+- `Dataset.generate` hardened:
+  - validates `questions_per_doc` is a positive integer
+  - validates document paths exist (raises a clear error instead of a `File.read` crash)
+  - expands directory paths via `Dir.glob(**/*)` to support "scan this folder" workflows
+  - accepts a single path string, not just an array
+  - tolerates malformed LLM responses (missing `pairs`, non-array `pairs`, non-hash entries, missing `question`/`answer`) — skips the bad pair rather than crashing the whole generation
+  - tolerates a judge raising mid-batch — logs the failure as a skip and continues with the remaining documents
+  - accepts an injected `judge:` parameter for testing (and for custom judge plumbing)
+### Notes
+- `SemanticSimilarity` is **opt-in** — not part of the default `Evaluator` roster. Instantiate it directly when you want reference-based scoring without an LLM judge.
+- Deferred to v0.4.0: cost tracking per evaluation (#10), async/parallel batch evaluation (#11), HTML report generation (#12).
+- Deferred to v0.5.0: CI/test-framework integration (JUnit XML, regression detection, GitHub Actions workflow, expanded matchers/assertions — formerly v0.4.0).
+## [0.2.0] - 2026-03-17
+### Added
+- Comprehensive test suite covering all metrics, judges, datasets, reports, and error paths.
+- YARD documentation across all public APIs.
+- RSpec matchers and Minitest assertions for integration in user test suites.
+- A/B comparison reports with statistical significance testing.
+## [0.1.1] - 2026-03-10
+### Fixed
+- Transient API failures now retry with exponential backoff (previously a single timeout raised immediately).
+- `Judges::OpenAI#initialize` rejects missing/empty API keys up front with a clear error message.
+- String-context metrics now handle strings passed in place of arrays without crashing.
+- Standard-deviation computation in `Report#summary` no longer divides by zero for single-sample reports.
+### Added
+- Error subclasses (`APIError`, `TimeoutError`, `InvalidResponseError`) so callers can rescue at the right granularity.
+## [0.1.0] - 2026-03-09
+- Initial release.
+- LLM-as-judge metrics: faithfulness, relevance, correctness, context precision, context recall.
+- Retrieval metrics: precision@k, recall@k, MRR, NDCG, hit rate.
+- OpenAI and Anthropic judge backends.
+- `Dataset` with CSV/JSON import and export.
+- `Report` with per-metric summary, worst-cases, failure filtering, and CSV export.
+- `Configuration` DSL for judge model, API key, threshold, timeout, retries.

data/Gemfile.lock CHANGED Viewed

@@ -1,7 +1,7 @@
 PATH
   remote: .
   specs:
-    eval-ruby (0.1.1)
+    eval-ruby (0.2.0)
       csv
 GEM
@@ -39,7 +39,7 @@ CHECKSUMS
   bigdecimal (4.0.1) sha256=8b07d3d065a9f921c80ceaea7c9d4ae596697295b584c296fe599dd0ad01c4a7
   crack (1.0.1) sha256=ff4a10390cd31d66440b7524eb1841874db86201d5b70032028553130b6d4c7e
   csv (3.3.5) sha256=6e5134ac3383ef728b7f02725d9872934f523cb40b961479f69cf3afa6c8e73f
-  eval-ruby (0.1.1)
+  eval-ruby (0.2.0)
   hashdiff (1.2.1) sha256=9c079dbc513dfc8833ab59c0c2d8f230fa28499cc5efb4b8dd276cf931457cd1
   minitest (5.27.0) sha256=2d3b17f8a36fe7801c1adcffdbc38233b938eb0b4966e97a6739055a45fa77d5
   public_suffix (7.0.5) sha256=1a8bb08f1bbea19228d3bed6e5ed908d1cb4f7c2726d18bd9cadf60bc676f623

data/MILESTONES.md ADDED Viewed

@@ -0,0 +1,13 @@
+# Milestones
+## v0.1.1 (2026-03-10)
+### Changes
+- Retry logic
+- API key validation
+- string context fix
+- std dev fix
+- error subclasses
+## v0.1.0 (Initial release)
+- Initial release

data/README.md CHANGED Viewed

@@ -50,6 +50,45 @@ result.overall           # => 0.94
 - **NDCG** (Normalized Discounted Cumulative Gain)
 - **Hit Rate**
+### Embedding-Based
+- **Semantic Similarity** — cosine similarity between answer and ground truth via a pluggable embedder. Judge-free, fast, deterministic; useful for chatbot regression testing where you want a reference-based score without an LLM call.
+## Semantic Similarity
+`SemanticSimilarity` is opt-in (not part of the default `Evaluator` roster in v0.3.0). Instantiate it directly when you need reference-based scoring without an LLM judge — for example, scoring a chatbot's actual response against a fixed expected response.
+```ruby
+EvalRuby.configure do |config|
+  config.api_key        = ENV["OPENAI_API_KEY"]        # shared with judge by default
+  config.embedder_model = "text-embedding-3-small"      # default; also supports text-embedding-3-large
+  # config.embedder_api_key = ENV["OTHER_KEY"]          # optional; falls back to api_key
+end
+embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
+metric   = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
+result = metric.call(
+  answer:       "Paris is the capital of France",
+  ground_truth: "The capital of France is Paris"
+)
+result[:score]              # => 0.92
+result[:details][:cosine]   # => 0.92 (raw, pre-clamp)
+result[:details][:model]    # => "text-embedding-3-small"
+```
+**When to use `SemanticSimilarity` vs `Correctness`:**
+| | `Correctness` | `SemanticSimilarity` |
+|---|---|---|
+| Backend | LLM judge (GPT-4, Claude, …) | Embeddings + cosine |
+| Cost per call | $$ (judge LLM tokens) | $ (embedding tokens) |
+| Latency | High (LLM generation) | Low (embedding lookup) |
+| Determinism | Low (model-dependent) | High |
+| Reasoning | Natural-language rationale in details | Raw cosine value |
+| Best for | Nuanced/subjective answers | Regression tests, bulk scoring |
 ## Retrieval Evaluation
 ```ruby

data/lib/eval_ruby/comparison.rb CHANGED Viewed

@@ -1,14 +1,27 @@
 # frozen_string_literal: true
 module EvalRuby
+  # Statistical comparison of two evaluation reports using paired t-tests.
+  #
+  # @example
+  #   comparison = EvalRuby.compare(report_a, report_b)
+  #   puts comparison.summary
+  #   comparison.significant_improvements # => [:faithfulness]
   class Comparison
-    attr_reader :report_a, :report_b
+    # @return [Report] baseline report
+    attr_reader :report_a
+    # @return [Report] comparison report
+    attr_reader :report_b
+    # @param report_a [Report] baseline
+    # @param report_b [Report] comparison
     def initialize(report_a, report_b)
       @report_a = report_a
       @report_b = report_b
     end
+    # @return [String] formatted comparison table with deltas and p-values
     def summary
       lines = [
         format("%-20s | %-10s | %-10s | %-8s | %s", "Metric", "A", "B", "Delta", "p-value"),
@@ -35,6 +48,10 @@ module EvalRuby
       lines.join("\n")
     end
+    # Returns metrics where report_b is significantly better than report_a.
+    #
+    # @param alpha [Float] significance level (default 0.05)
+    # @return [Array<Symbol>] metric names with significant improvements
     def significant_improvements(alpha: 0.05)
       all_metrics.select do |metric|
         scores_a = @report_a.results.filter_map { |r| r.scores[metric] }

data/lib/eval_ruby/configuration.rb CHANGED Viewed

@@ -1,14 +1,49 @@
 # frozen_string_literal: true
 module EvalRuby
+  # Global configuration for EvalRuby.
+  #
+  # @example
+  #   EvalRuby.configure do |config|
+  #     config.judge_llm = :openai
+  #     config.api_key = ENV["OPENAI_API_KEY"]
+  #     config.judge_model = "gpt-4o"
+  #   end
   class Configuration
-    attr_accessor :judge_llm, :judge_model, :api_key, :default_threshold,
-                  :timeout, :max_retries
+    # @return [Symbol] LLM provider for judge (:openai or :anthropic)
+    attr_accessor :judge_llm
+    # @return [String] model name for the judge LLM
+    attr_accessor :judge_model
+    # @return [String, nil] API key for the judge LLM provider
+    attr_accessor :api_key
+    # @return [Symbol] embedding provider (:openai in v0.3.0)
+    attr_accessor :embedder_llm
+    # @return [String] model name for the embedder
+    attr_accessor :embedder_model
+    # @return [String, nil] API key for the embedder; falls back to {#api_key} when nil
+    attr_accessor :embedder_api_key
+    # @return [Float] default threshold for pass/fail decisions
+    attr_accessor :default_threshold
+    # @return [Integer] HTTP request timeout in seconds
+    attr_accessor :timeout
+    # @return [Integer] maximum number of retries on transient failures
+    attr_accessor :max_retries
     def initialize
       @judge_llm = :openai
       @judge_model = "gpt-4o"
       @api_key = nil
+      @embedder_llm = :openai
+      @embedder_model = "text-embedding-3-small"
+      @embedder_api_key = nil
       @default_threshold = 0.7
       @timeout = 30
       @max_retries = 3

data/lib/eval_ruby/dataset.rb CHANGED Viewed

@@ -4,16 +4,36 @@ require "csv"
 require "json"
 module EvalRuby
+  # Collection of evaluation samples with import/export support.
+  # Supports CSV, JSON, and programmatic construction.
+  #
+  # @example
+  #   dataset = EvalRuby::Dataset.new("my_test_set")
+  #   dataset.add(question: "What is Ruby?", answer: "A language", ground_truth: "A language")
+  #   report = EvalRuby.evaluate_batch(dataset)
   class Dataset
     include Enumerable
-    attr_reader :name, :samples
+    # @return [String] dataset name
+    attr_reader :name
+    # @return [Array<Hash>] sample entries
+    attr_reader :samples
+    # @param name [String] dataset name
     def initialize(name = "default")
       @name = name
       @samples = []
     end
+    # Adds a sample to the dataset.
+    #
+    # @param question [String]
+    # @param ground_truth [String, nil]
+    # @param relevant_contexts [Array<String>] alias for context
+    # @param answer [String, nil]
+    # @param context [Array<String>]
+    # @return [self]
     def add(question:, ground_truth: nil, relevant_contexts: [], answer: nil, context: [])
       @samples << {
         question: question,
@@ -24,18 +44,26 @@ module EvalRuby
       self
     end
+    # @yield [Hash] each sample
     def each(&block)
       @samples.each(&block)
     end
+    # @return [Integer] number of samples
     def size
       @samples.size
     end
+    # @param index [Integer]
+    # @return [Hash] sample at index
     def [](index)
       @samples[index]
     end
+    # Loads a dataset from a CSV file.
+    #
+    # @param path [String] path to CSV file
+    # @return [Dataset]
     def self.from_csv(path)
       dataset = new(File.basename(path, ".*"))
       CSV.foreach(path, headers: true) do |row|
@@ -49,6 +77,10 @@ module EvalRuby
       dataset
     end
+    # Loads a dataset from a JSON file.
+    #
+    # @param path [String] path to JSON file
+    # @return [Dataset]
     def self.from_json(path)
       dataset = new(File.basename(path, ".*"))
       data = JSON.parse(File.read(path))
@@ -64,6 +96,10 @@ module EvalRuby
       dataset
     end
+    # Exports dataset to CSV.
+    #
+    # @param path [String] output file path
+    # @return [void]
     def to_csv(path)
       CSV.open(path, "w") do |csv|
         csv << %w[question answer context ground_truth]
@@ -78,21 +114,40 @@ module EvalRuby
       end
     end
+    # Exports dataset to JSON.
+    #
+    # @param path [String] output file path
+    # @return [void]
     def to_json(path)
       File.write(path, JSON.pretty_generate({name: @name, samples: @samples}))
     end
-    def self.generate(documents:, questions_per_doc: 5, llm: :openai)
-      config = EvalRuby.configuration.dup
-      config.judge_llm = llm
-      judge = case llm
-      when :openai then Judges::OpenAI.new(config)
-      when :anthropic then Judges::Anthropic.new(config)
-      else raise Error, "Unknown LLM: #{llm}"
+    # Generates a dataset from documents using an LLM.
+    #
+    # Each document is read, passed to the LLM with a prompt asking for
+    # +questions_per_doc+ QA pairs, and the resulting pairs are appended to
+    # the dataset. Directory paths are expanded via +Dir.glob+. Missing paths
+    # and malformed LLM responses raise or are skipped gracefully rather than
+    # crashing the whole generation.
+    #
+    # @param documents [String, Array<String>] file paths or directory paths
+    # @param questions_per_doc [Integer] number of QA pairs per document (must be > 0)
+    # @param llm [Symbol] LLM provider (:openai or :anthropic)
+    # @param judge [Judges::Base, nil] inject a pre-built judge (mainly for testing)
+    # @return [Dataset]
+    # @raise [EvalRuby::Error] on bad arguments or no documents found
+    def self.generate(documents:, questions_per_doc: 5, llm: :openai, judge: nil)
+      unless questions_per_doc.is_a?(Integer) && questions_per_doc.positive?
+        raise Error, "questions_per_doc must be a positive integer, got #{questions_per_doc.inspect}"
       end
+      document_paths = expand_document_paths(documents)
+      raise Error, "No documents found in the provided paths" if document_paths.empty?
+      judge ||= build_judge_for(llm)
       dataset = new("generated")
-      documents.each do |doc_path|
+      document_paths.each do |doc_path|
         content = File.read(doc_path)
         prompt = <<~PROMPT
           Given the following document, generate #{questions_per_doc} question-answer pairs
@@ -104,14 +159,19 @@ module EvalRuby
           Respond in JSON: {"pairs": [{"question": "...", "answer": "...", "context": "relevant excerpt"}]}
         PROMPT
-        result = judge.call(prompt)
-        next unless result.is_a?(Hash) && result.key?("pairs")
+        begin
+          result = judge.call(prompt)
+        rescue StandardError
+          next # keep generating from remaining docs; individual failure should not abort the batch
+        end
+        extract_pairs(result).each do |pair|
+          next unless valid_pair?(pair)
-        result["pairs"].each do |pair|
           dataset.add(
             question: pair["question"],
             answer: pair["answer"],
-            context: [pair["context"] || content],
+            context: [pair["context"].is_a?(String) && !pair["context"].empty? ? pair["context"] : content],
             ground_truth: pair["answer"]
           )
         end
@@ -119,6 +179,51 @@ module EvalRuby
       dataset
     end
+    # Expands a list of file/directory paths into a flat list of file paths.
+    # Validates existence — missing paths raise an Error.
+    #
+    # @param paths [String, Array<String>]
+    # @return [Array<String>] absolute-or-relative file paths, each verified to exist
+    # @raise [EvalRuby::Error] if any path does not exist
+    def self.expand_document_paths(paths)
+      result = []
+      Array(paths).each do |path|
+        if File.directory?(path)
+          result.concat(Dir.glob(File.join(path, "**/*")).select { |p| File.file?(p) }.sort)
+        elsif File.file?(path)
+          result << path
+        else
+          raise Error, "Document path does not exist: #{path}"
+        end
+      end
+      result
+    end
+    def self.build_judge_for(llm)
+      config = EvalRuby.configuration.dup
+      config.judge_llm = llm
+      case llm
+      when :openai then Judges::OpenAI.new(config)
+      when :anthropic then Judges::Anthropic.new(config)
+      else raise Error, "Unknown LLM: #{llm}"
+      end
+    end
+    private_class_method :build_judge_for
+    def self.extract_pairs(result)
+      return [] unless result.is_a?(Hash)
+      pairs = result["pairs"]
+      pairs.is_a?(Array) ? pairs : []
+    end
+    private_class_method :extract_pairs
+    def self.valid_pair?(pair)
+      pair.is_a?(Hash) &&
+        pair["question"].is_a?(String) && !pair["question"].strip.empty? &&
+        pair["answer"].is_a?(String) && !pair["answer"].strip.empty?
+    end
+    private_class_method :valid_pair?
     private_class_method def self.parse_array_field(value)
       return [] if value.nil? || value.empty?

data/lib/eval_ruby/embedders/base.rb ADDED Viewed

@@ -0,0 +1,29 @@
+# frozen_string_literal: true
+module EvalRuby
+  module Embedders
+    # Abstract base class for text embedders.
+    # Subclasses must implement {#call} to convert a batch of strings
+    # into a batch of float vectors, and {#model} to surface the model
+    # identifier used (shown in metric details).
+    class Base
+      # @param config [Configuration]
+      def initialize(config)
+        @config = config
+      end
+      # Embeds a batch of texts.
+      #
+      # @param texts [Array<String>] inputs to embed
+      # @return [Array<Array<Float>>] one vector per input, in the same order
+      def call(texts)
+        raise NotImplementedError, "#{self.class}#call must be implemented"
+      end
+      # @return [String] model identifier (e.g. "text-embedding-3-small")
+      def model
+        raise NotImplementedError, "#{self.class}#model must be implemented"
+      end
+    end
+  end
+end

data/lib/eval_ruby/embedders/openai.rb ADDED Viewed

@@ -0,0 +1,83 @@
+# frozen_string_literal: true
+require "net/http"
+require "json"
+require "uri"
+module EvalRuby
+  module Embedders
+    # OpenAI embeddings backend.
+    # Requires an API key via {Configuration#embedder_api_key} (or falls
+    # back to {Configuration#api_key}). The default model is
+    # +text-embedding-3-small+ (1536 dimensions).
+    class OpenAI < Base
+      API_URL = "https://api.openai.com/v1/embeddings"
+      # @param config [Configuration]
+      # @raise [EvalRuby::Error] if no API key is available
+      def initialize(config)
+        super
+        @api_key = @config.embedder_api_key || @config.api_key
+        if @api_key.nil? || @api_key.empty?
+          raise EvalRuby::Error, "API key is required for embedder. Set via EvalRuby.configure { |c| c.embedder_api_key = '...' } or c.api_key = '...'"
+        end
+      end
+      # @return [String] configured embedder model
+      def model
+        @config.embedder_model
+      end
+      # @param texts [Array<String>] inputs to embed
+      # @return [Array<Array<Float>>] vectors in input order
+      # @raise [EvalRuby::APIError] on non-success HTTP responses
+      # @raise [EvalRuby::TimeoutError] after max retries
+      def call(texts)
+        retries = 0
+        begin
+          uri = URI(API_URL)
+          request = Net::HTTP::Post.new(uri)
+          request["Authorization"] = "Bearer #{@api_key}"
+          request["Content-Type"] = "application/json"
+          request.body = JSON.generate({input: texts, model: model})
+          response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true,
+                                     read_timeout: @config.timeout) do |http|
+            http.request(request)
+          end
+          unless response.is_a?(Net::HTTPSuccess)
+            raise APIError, "OpenAI embeddings API error: #{response.code} - #{response.body}"
+          end
+          parse_vectors(response.body)
+        rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET => e
+          retries += 1
+          if retries <= @config.max_retries
+            sleep(2**(retries - 1))
+            retry
+          end
+          raise EvalRuby::TimeoutError, "Embedder API failed after #{@config.max_retries} retries: #{e.message}"
+        end
+      end
+      private
+      def parse_vectors(body)
+        parsed = JSON.parse(body)
+        data = parsed["data"]
+        raise InvalidResponseError, "Unexpected embeddings response shape: missing 'data'" unless data.is_a?(Array)
+        # OpenAI returns data entries tagged with 'index' to preserve input order;
+        # sort defensively in case the API ever reorders.
+        data.sort_by { |entry| entry["index"].to_i }.map do |entry|
+          vector = entry["embedding"]
+          raise InvalidResponseError, "Embedding entry missing 'embedding' array" unless vector.is_a?(Array)
+          vector
+        end
+      rescue JSON::ParserError => e
+        raise InvalidResponseError, "Failed to parse embeddings response: #{e.message}"
+      end
+    end
+  end
+end

data/lib/eval_ruby/evaluator.rb CHANGED Viewed

@@ -1,12 +1,25 @@
 # frozen_string_literal: true
 module EvalRuby
+  # Runs all configured metrics on a given question/answer/context tuple.
+  #
+  # @example
+  #   evaluator = EvalRuby::Evaluator.new
+  #   result = evaluator.evaluate(question: "...", answer: "...", context: [...])
   class Evaluator
+    # @param config [Configuration] configuration to use
     def initialize(config = EvalRuby.configuration)
       @config = config
       @judge = build_judge(config)
     end
+    # Evaluates an LLM response across quality metrics.
+    #
+    # @param question [String] the input question
+    # @param answer [String] the LLM-generated answer
+    # @param context [Array<String>] retrieved context chunks
+    # @param ground_truth [String, nil] expected correct answer
+    # @return [Result]
     def evaluate(question:, answer:, context: [], ground_truth: nil)
       scores = {}
       details = {}
@@ -37,6 +50,12 @@ module EvalRuby
       Result.new(scores: scores, details: details)
     end
+    # Evaluates retrieval quality using IR metrics.
+    #
+    # @param question [String] the input question
+    # @param retrieved [Array<String>] retrieved document IDs
+    # @param relevant [Array<String>] ground-truth relevant document IDs
+    # @return [RetrievalResult]
     def evaluate_retrieval(question:, retrieved:, relevant:)
       RetrievalResult.new(retrieved: retrieved, relevant: relevant)
     end
@@ -55,32 +74,49 @@ module EvalRuby
     end
   end
+  # Holds retrieval evaluation results with IR metric accessors.
+  #
+  # @example
+  #   result = EvalRuby.evaluate_retrieval(question: "...", retrieved: [...], relevant: [...])
+  #   result.precision_at_k(5) # => 0.6
+  #   result.mrr               # => 1.0
   class RetrievalResult
+    # @param retrieved [Array<String>] retrieved document IDs in ranked order
+    # @param relevant [Array<String>] ground-truth relevant document IDs
     def initialize(retrieved:, relevant:)
       @retrieved = retrieved
       @relevant = relevant
     end
+    # @param k [Integer] number of top results to consider
+    # @return [Float] precision at k
     def precision_at_k(k)
       Metrics::PrecisionAtK.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
     end
+    # @param k [Integer] number of top results to consider
+    # @return [Float] recall at k
     def recall_at_k(k)
       Metrics::RecallAtK.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
     end
+    # @return [Float] mean reciprocal rank
     def mrr
       Metrics::MRR.new.call(retrieved: @retrieved, relevant: @relevant)
     end
+    # @param k [Integer, nil] number of top results (nil for all)
+    # @return [Float] normalized discounted cumulative gain
     def ndcg(k: nil)
       Metrics::NDCG.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
     end
+    # @return [Float] 1.0 if any relevant doc is retrieved, 0.0 otherwise
     def hit_rate
       @retrieved.any? { |doc| @relevant.include?(doc) } ? 1.0 : 0.0
     end
+    # @return [Hash{Symbol => Float}] all retrieval metrics
     def to_h
       {
         precision_at_k: precision_at_k(@retrieved.length),

data/lib/eval_ruby/judges/anthropic.rb CHANGED Viewed

@@ -6,14 +6,22 @@ require "uri"
 module EvalRuby
   module Judges
+    # Anthropic-based LLM judge using the Messages API.
+    # Requires an API key set via {Configuration#api_key}.
     class Anthropic < Base
       API_URL = "https://api.anthropic.com/v1/messages"
+      # @param config [Configuration]
+      # @raise [EvalRuby::Error] if API key is missing
       def initialize(config)
         super
         raise EvalRuby::Error, "API key is required. Set via EvalRuby.configure { |c| c.api_key = '...' }" if @config.api_key.nil? || @config.api_key.empty?
       end
+      # @param prompt [String] the evaluation prompt
+      # @return [Hash, nil] parsed JSON response
+      # @raise [EvalRuby::Error] on API errors
+      # @raise [EvalRuby::TimeoutError] after max retries
       def call(prompt)
         retries = 0
         begin