RubyGems - eval-ruby - Versions diffs - 0.2.0 → 0.3.0 - Mend

eval-ruby 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +60 -0
data/README.md +39 -0
data/lib/eval_ruby/configuration.rb +12 -0
data/lib/eval_ruby/dataset.rb +73 -14
data/lib/eval_ruby/embedders/base.rb +29 -0
data/lib/eval_ruby/embedders/openai.rb +83 -0
data/lib/eval_ruby/metrics/semantic_similarity.rb +72 -0
data/lib/eval_ruby/version.rb +1 -1
data/lib/eval_ruby.rb +35 -3
metadata +5 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 69f4642cd2505ab6b0b54f36cd76fea2043f76306c425fa69e29027e4d0dc901
-  data.tar.gz: 70125b286a374c01af966a0a044edb9b0cbca02fa9acd6b944f2b0825ad60981
+  metadata.gz: 6ba2db8c803feba5276939550bf2bd3c9dca548f2eca3223ce1f2df39e83bfa8
+  data.tar.gz: 9a3a0a5dbb7eda4d64669e11a5b76c20f43fd0cdb3946934f190475cc460ccd3
 SHA512:
-  metadata.gz: 7ac13b5d60996d964a948b92bec466d3594c614e5a0bc3e138bfba31dda0dcca19876d01e2826ed1522962e7117b2eb69c7e3ea46567e4a09920f77f2031d09c
-  data.tar.gz: ca13c6f768516a2f21188c59511f045bd12c9fd4fe51819af705016c1282256b6765805a007265dbb3e299430c19b8fe80e1a29c22038b930b5799ab7d3a8355
+  metadata.gz: a6b52e6caf04da090f8595808f76e8edaa901b7eec90178054de0f2c4d8d27c1e2b46f8c154961b91c5ef3a849e93cbbc411df77e4ec2919ea48dd7ead869a1a
+  data.tar.gz: 06eb01c2e177dc263dcdf69ea38debca5989fc6d55e3be31e50c6b7e5bc8539d06df90b815b35680f8bc292d8bc44cebf38496dde5a65cea9cd8f4b3cedee18f

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,60 @@
+# Changelog
+All notable changes to this project are documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [0.3.0] - 2026-04-13
+Reshape release — the original v0.3.0 scope (cost tracking, async/parallel batch, HTML reports) was redistributed to a new **v0.4.0 "Batch, Reporting & Cost"** milestone, and the previous v0.4.0 "Testing Framework Integration" work slid to **v0.5.0**. This release instead prioritizes the API surface needed by the omnibot Wicara eval integration, plus the two quickly actionable items from the original v0.3.0 plan.
+### Added
+- `EvalRuby::Embedders::Base` — abstract base class for pluggable embedding backends, mirroring the existing `Judges::Base` pattern.
+- `EvalRuby::Embedders::OpenAI` — OpenAI embeddings backend (`/v1/embeddings`) with retry/timeout handling and out-of-order response reassembly.
+- `EvalRuby::Metrics::SemanticSimilarity` — Ragas-style answer-similarity metric. Computes cosine similarity between answer and ground-truth embeddings via an injected embedder. Judge-free, fast, deterministic; ideal for chatbot regression testing.
+- `Configuration#embedder_llm`, `#embedder_model`, `#embedder_api_key` — new keys controlling the embedder. `embedder_api_key` falls back to `api_key` when unset, so most users only configure one OpenAI key.
+- `EvalRuby.evaluate_batch(dataset) { |progress| ... }` — block form that yields an `EvalRuby::Progress` struct (`current`, `total`, `elapsed`, `percent`) after each sample. Backwards compatible — batch calls without a block behave exactly as before.
+### Changed
+- `Dataset.generate` hardened:
+  - validates `questions_per_doc` is a positive integer
+  - validates document paths exist (raises a clear error instead of a `File.read` crash)
+  - expands directory paths via `Dir.glob(**/*)` to support "scan this folder" workflows
+  - accepts a single path string, not just an array
+  - tolerates malformed LLM responses (missing `pairs`, non-array `pairs`, non-hash entries, missing `question`/`answer`) — skips the bad pair rather than crashing the whole generation
+  - tolerates a judge raising mid-batch — logs the failure as a skip and continues with the remaining documents
+  - accepts an injected `judge:` parameter for testing (and for custom judge plumbing)
+### Notes
+- `SemanticSimilarity` is **opt-in** — not part of the default `Evaluator` roster. Instantiate it directly when you want reference-based scoring without an LLM judge.
+- Deferred to v0.4.0: cost tracking per evaluation (#10), async/parallel batch evaluation (#11), HTML report generation (#12).
+- Deferred to v0.5.0: CI/test-framework integration (JUnit XML, regression detection, GitHub Actions workflow, expanded matchers/assertions — formerly v0.4.0).
+## [0.2.0] - 2026-03-17
+### Added
+- Comprehensive test suite covering all metrics, judges, datasets, reports, and error paths.
+- YARD documentation across all public APIs.
+- RSpec matchers and Minitest assertions for integration in user test suites.
+- A/B comparison reports with statistical significance testing.
+## [0.1.1] - 2026-03-10
+### Fixed
+- Transient API failures now retry with exponential backoff (previously a single timeout raised immediately).
+- `Judges::OpenAI#initialize` rejects missing/empty API keys up front with a clear error message.
+- String-context metrics now handle strings passed in place of arrays without crashing.
+- Standard-deviation computation in `Report#summary` no longer divides by zero for single-sample reports.
+### Added
+- Error subclasses (`APIError`, `TimeoutError`, `InvalidResponseError`) so callers can rescue at the right granularity.
+## [0.1.0] - 2026-03-09
+- Initial release.
+- LLM-as-judge metrics: faithfulness, relevance, correctness, context precision, context recall.
+- Retrieval metrics: precision@k, recall@k, MRR, NDCG, hit rate.
+- OpenAI and Anthropic judge backends.
+- `Dataset` with CSV/JSON import and export.
+- `Report` with per-metric summary, worst-cases, failure filtering, and CSV export.
+- `Configuration` DSL for judge model, API key, threshold, timeout, retries.

data/README.md CHANGED Viewed

@@ -50,6 +50,45 @@ result.overall           # => 0.94
 - **NDCG** (Normalized Discounted Cumulative Gain)
 - **Hit Rate**
+### Embedding-Based
+- **Semantic Similarity** — cosine similarity between answer and ground truth via a pluggable embedder. Judge-free, fast, deterministic; useful for chatbot regression testing where you want a reference-based score without an LLM call.
+## Semantic Similarity
+`SemanticSimilarity` is opt-in (not part of the default `Evaluator` roster in v0.3.0). Instantiate it directly when you need reference-based scoring without an LLM judge — for example, scoring a chatbot's actual response against a fixed expected response.
+```ruby
+EvalRuby.configure do |config|
+  config.api_key        = ENV["OPENAI_API_KEY"]        # shared with judge by default
+  config.embedder_model = "text-embedding-3-small"      # default; also supports text-embedding-3-large
+  # config.embedder_api_key = ENV["OTHER_KEY"]          # optional; falls back to api_key
+end
+embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
+metric   = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
+result = metric.call(
+  answer:       "Paris is the capital of France",
+  ground_truth: "The capital of France is Paris"
+)
+result[:score]              # => 0.92
+result[:details][:cosine]   # => 0.92 (raw, pre-clamp)
+result[:details][:model]    # => "text-embedding-3-small"
+```
+**When to use `SemanticSimilarity` vs `Correctness`:**
+| | `Correctness` | `SemanticSimilarity` |
+|---|---|---|
+| Backend | LLM judge (GPT-4, Claude, …) | Embeddings + cosine |
+| Cost per call | $$ (judge LLM tokens) | $ (embedding tokens) |
+| Latency | High (LLM generation) | Low (embedding lookup) |
+| Determinism | Low (model-dependent) | High |
+| Reasoning | Natural-language rationale in details | Raw cosine value |
+| Best for | Nuanced/subjective answers | Regression tests, bulk scoring |
 ## Retrieval Evaluation
 ```ruby

data/lib/eval_ruby/configuration.rb CHANGED Viewed

@@ -19,6 +19,15 @@ module EvalRuby
     # @return [String, nil] API key for the judge LLM provider
     attr_accessor :api_key
+    # @return [Symbol] embedding provider (:openai in v0.3.0)
+    attr_accessor :embedder_llm
+    # @return [String] model name for the embedder
+    attr_accessor :embedder_model
+    # @return [String, nil] API key for the embedder; falls back to {#api_key} when nil
+    attr_accessor :embedder_api_key
     # @return [Float] default threshold for pass/fail decisions
     attr_accessor :default_threshold
@@ -32,6 +41,9 @@ module EvalRuby
       @judge_llm = :openai
       @judge_model = "gpt-4o"
       @api_key = nil
+      @embedder_llm = :openai
+      @embedder_model = "text-embedding-3-small"
+      @embedder_api_key = nil
       @default_threshold = 0.7
       @timeout = 30
       @max_retries = 3

data/lib/eval_ruby/dataset.rb CHANGED Viewed

@@ -124,21 +124,30 @@ module EvalRuby
     # Generates a dataset from documents using an LLM.
     #
-    # @param documents [Array<String>] file paths to source documents
-    # @param questions_per_doc [Integer] number of QA pairs per document
+    # Each document is read, passed to the LLM with a prompt asking for
+    # +questions_per_doc+ QA pairs, and the resulting pairs are appended to
+    # the dataset. Directory paths are expanded via +Dir.glob+. Missing paths
+    # and malformed LLM responses raise or are skipped gracefully rather than
+    # crashing the whole generation.
+    #
+    # @param documents [String, Array<String>] file paths or directory paths
+    # @param questions_per_doc [Integer] number of QA pairs per document (must be > 0)
     # @param llm [Symbol] LLM provider (:openai or :anthropic)
+    # @param judge [Judges::Base, nil] inject a pre-built judge (mainly for testing)
     # @return [Dataset]
-    def self.generate(documents:, questions_per_doc: 5, llm: :openai)
-      config = EvalRuby.configuration.dup
-      config.judge_llm = llm
-      judge = case llm
-      when :openai then Judges::OpenAI.new(config)
-      when :anthropic then Judges::Anthropic.new(config)
-      else raise Error, "Unknown LLM: #{llm}"
+    # @raise [EvalRuby::Error] on bad arguments or no documents found
+    def self.generate(documents:, questions_per_doc: 5, llm: :openai, judge: nil)
+      unless questions_per_doc.is_a?(Integer) && questions_per_doc.positive?
+        raise Error, "questions_per_doc must be a positive integer, got #{questions_per_doc.inspect}"
       end
+      document_paths = expand_document_paths(documents)
+      raise Error, "No documents found in the provided paths" if document_paths.empty?
+      judge ||= build_judge_for(llm)
       dataset = new("generated")
-      documents.each do |doc_path|
+      document_paths.each do |doc_path|
         content = File.read(doc_path)
         prompt = <<~PROMPT
           Given the following document, generate #{questions_per_doc} question-answer pairs
@@ -150,14 +159,19 @@ module EvalRuby
           Respond in JSON: {"pairs": [{"question": "...", "answer": "...", "context": "relevant excerpt"}]}
         PROMPT
-        result = judge.call(prompt)
-        next unless result.is_a?(Hash) && result.key?("pairs")
+        begin
+          result = judge.call(prompt)
+        rescue StandardError
+          next # keep generating from remaining docs; individual failure should not abort the batch
+        end
+        extract_pairs(result).each do |pair|
+          next unless valid_pair?(pair)
-        result["pairs"].each do |pair|
           dataset.add(
             question: pair["question"],
             answer: pair["answer"],
-            context: [pair["context"] || content],
+            context: [pair["context"].is_a?(String) && !pair["context"].empty? ? pair["context"] : content],
             ground_truth: pair["answer"]
           )
         end
@@ -165,6 +179,51 @@ module EvalRuby
       dataset
     end
+    # Expands a list of file/directory paths into a flat list of file paths.
+    # Validates existence — missing paths raise an Error.
+    #
+    # @param paths [String, Array<String>]
+    # @return [Array<String>] absolute-or-relative file paths, each verified to exist
+    # @raise [EvalRuby::Error] if any path does not exist
+    def self.expand_document_paths(paths)
+      result = []
+      Array(paths).each do |path|
+        if File.directory?(path)
+          result.concat(Dir.glob(File.join(path, "**/*")).select { |p| File.file?(p) }.sort)
+        elsif File.file?(path)
+          result << path
+        else
+          raise Error, "Document path does not exist: #{path}"
+        end
+      end
+      result
+    end
+    def self.build_judge_for(llm)
+      config = EvalRuby.configuration.dup
+      config.judge_llm = llm
+      case llm
+      when :openai then Judges::OpenAI.new(config)
+      when :anthropic then Judges::Anthropic.new(config)
+      else raise Error, "Unknown LLM: #{llm}"
+      end
+    end
+    private_class_method :build_judge_for
+    def self.extract_pairs(result)
+      return [] unless result.is_a?(Hash)
+      pairs = result["pairs"]
+      pairs.is_a?(Array) ? pairs : []
+    end
+    private_class_method :extract_pairs
+    def self.valid_pair?(pair)
+      pair.is_a?(Hash) &&
+        pair["question"].is_a?(String) && !pair["question"].strip.empty? &&
+        pair["answer"].is_a?(String) && !pair["answer"].strip.empty?
+    end
+    private_class_method :valid_pair?
     private_class_method def self.parse_array_field(value)
       return [] if value.nil? || value.empty?

data/lib/eval_ruby/embedders/base.rb ADDED Viewed

@@ -0,0 +1,29 @@
+# frozen_string_literal: true
+module EvalRuby
+  module Embedders
+    # Abstract base class for text embedders.
+    # Subclasses must implement {#call} to convert a batch of strings
+    # into a batch of float vectors, and {#model} to surface the model
+    # identifier used (shown in metric details).
+    class Base
+      # @param config [Configuration]
+      def initialize(config)
+        @config = config
+      end
+      # Embeds a batch of texts.
+      #
+      # @param texts [Array<String>] inputs to embed
+      # @return [Array<Array<Float>>] one vector per input, in the same order
+      def call(texts)
+        raise NotImplementedError, "#{self.class}#call must be implemented"
+      end
+      # @return [String] model identifier (e.g. "text-embedding-3-small")
+      def model
+        raise NotImplementedError, "#{self.class}#model must be implemented"
+      end
+    end
+  end
+end

data/lib/eval_ruby/embedders/openai.rb ADDED Viewed

@@ -0,0 +1,83 @@
+# frozen_string_literal: true
+require "net/http"
+require "json"
+require "uri"
+module EvalRuby
+  module Embedders
+    # OpenAI embeddings backend.
+    # Requires an API key via {Configuration#embedder_api_key} (or falls
+    # back to {Configuration#api_key}). The default model is
+    # +text-embedding-3-small+ (1536 dimensions).
+    class OpenAI < Base
+      API_URL = "https://api.openai.com/v1/embeddings"
+      # @param config [Configuration]
+      # @raise [EvalRuby::Error] if no API key is available
+      def initialize(config)
+        super
+        @api_key = @config.embedder_api_key || @config.api_key
+        if @api_key.nil? || @api_key.empty?
+          raise EvalRuby::Error, "API key is required for embedder. Set via EvalRuby.configure { |c| c.embedder_api_key = '...' } or c.api_key = '...'"
+        end
+      end
+      # @return [String] configured embedder model
+      def model
+        @config.embedder_model
+      end
+      # @param texts [Array<String>] inputs to embed
+      # @return [Array<Array<Float>>] vectors in input order
+      # @raise [EvalRuby::APIError] on non-success HTTP responses
+      # @raise [EvalRuby::TimeoutError] after max retries
+      def call(texts)
+        retries = 0
+        begin
+          uri = URI(API_URL)
+          request = Net::HTTP::Post.new(uri)
+          request["Authorization"] = "Bearer #{@api_key}"
+          request["Content-Type"] = "application/json"
+          request.body = JSON.generate({input: texts, model: model})
+          response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true,
+                                     read_timeout: @config.timeout) do |http|
+            http.request(request)
+          end
+          unless response.is_a?(Net::HTTPSuccess)
+            raise APIError, "OpenAI embeddings API error: #{response.code} - #{response.body}"
+          end
+          parse_vectors(response.body)
+        rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET => e
+          retries += 1
+          if retries <= @config.max_retries
+            sleep(2**(retries - 1))
+            retry
+          end
+          raise EvalRuby::TimeoutError, "Embedder API failed after #{@config.max_retries} retries: #{e.message}"
+        end
+      end
+      private
+      def parse_vectors(body)
+        parsed = JSON.parse(body)
+        data = parsed["data"]
+        raise InvalidResponseError, "Unexpected embeddings response shape: missing 'data'" unless data.is_a?(Array)
+        # OpenAI returns data entries tagged with 'index' to preserve input order;
+        # sort defensively in case the API ever reorders.
+        data.sort_by { |entry| entry["index"].to_i }.map do |entry|
+          vector = entry["embedding"]
+          raise InvalidResponseError, "Embedding entry missing 'embedding' array" unless vector.is_a?(Array)
+          vector
+        end
+      rescue JSON::ParserError => e
+        raise InvalidResponseError, "Failed to parse embeddings response: #{e.message}"
+      end
+    end
+  end
+end

data/lib/eval_ruby/metrics/semantic_similarity.rb ADDED Viewed

@@ -0,0 +1,72 @@
+# frozen_string_literal: true
+module EvalRuby
+  module Metrics
+    # Cosine similarity between an answer and its ground truth via an
+    # injected embedder. A judge-free alternative to {Correctness} when
+    # you want fast, deterministic, reference-based scoring — ideal for
+    # chatbot regression testing.
+    #
+    # @example
+    #   embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
+    #   metric = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
+    #   metric.call(answer: "Paris is in France", ground_truth: "Paris, France")
+    #   # => { score: 0.91, details: { cosine: 0.91, model: "text-embedding-3-small" } }
+    class SemanticSimilarity < Base
+      # @return [EvalRuby::Embedders::Base, nil] the embedder instance
+      attr_reader :embedder
+      # @param embedder [EvalRuby::Embedders::Base] required for this metric
+      # @param judge [EvalRuby::Judges::Base, nil] unused by this metric; accepted
+      #   only for interface compatibility with {Metrics::Base}
+      def initialize(embedder: nil, judge: nil)
+        super(judge: judge)
+        @embedder = embedder
+      end
+      # @param answer [String] candidate text (typically the model's answer)
+      # @param ground_truth [String] reference text
+      # @return [Hash] +:score+ (Float 0.0–1.0) and +:details+ (Hash)
+      # @raise [EvalRuby::Error] if no embedder is configured
+      def call(answer:, ground_truth:, **_kwargs)
+        raise EvalRuby::Error, "SemanticSimilarity requires an embedder. Pass `embedder:` in the constructor." unless @embedder
+        if answer.to_s.strip.empty? || ground_truth.to_s.strip.empty?
+          return {score: 0.0, details: {reason: :empty_input}}
+        end
+        vectors = @embedder.call([answer.to_s, ground_truth.to_s])
+        unless vectors.is_a?(Array) && vectors.length == 2
+          raise EvalRuby::Error, "Embedder returned #{vectors.is_a?(Array) ? vectors.length : vectors.class} vectors; expected 2"
+        end
+        cosine = cosine_similarity(vectors[0], vectors[1])
+        {
+          score: cosine.clamp(0.0, 1.0),
+          details: {cosine: cosine, model: @embedder.model}
+        }
+      end
+      private
+      def cosine_similarity(a, b)
+        raise EvalRuby::Error, "Embedding vector dimension mismatch: #{a.length} vs #{b.length}" unless a.length == b.length
+        dot = 0.0
+        norm_a = 0.0
+        norm_b = 0.0
+        a.each_with_index do |x, i|
+          y = b[i]
+          dot += x * y
+          norm_a += x * x
+          norm_b += y * y
+        end
+        return 0.0 if norm_a.zero? || norm_b.zero?
+        dot / (Math.sqrt(norm_a) * Math.sqrt(norm_b))
+      end
+    end
+  end
+end

data/lib/eval_ruby/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module EvalRuby
-  VERSION = "0.2.0"
+  VERSION = "0.3.0"
 end

data/lib/eval_ruby.rb CHANGED Viewed

@@ -5,12 +5,15 @@ require_relative "eval_ruby/configuration"
 require_relative "eval_ruby/judges/base"
 require_relative "eval_ruby/judges/openai"
 require_relative "eval_ruby/judges/anthropic"
+require_relative "eval_ruby/embedders/base"
+require_relative "eval_ruby/embedders/openai"
 require_relative "eval_ruby/metrics/base"
 require_relative "eval_ruby/metrics/faithfulness"
 require_relative "eval_ruby/metrics/relevance"
 require_relative "eval_ruby/metrics/correctness"
 require_relative "eval_ruby/metrics/context_precision"
 require_relative "eval_ruby/metrics/context_recall"
+require_relative "eval_ruby/metrics/semantic_similarity"
 require_relative "eval_ruby/metrics/precision_at_k"
 require_relative "eval_ruby/metrics/recall_at_k"
 require_relative "eval_ruby/metrics/mrr"
@@ -48,6 +51,17 @@ module EvalRuby
   class TimeoutError < Error; end
   class InvalidResponseError < Error; end
+  # Progress snapshot yielded to the block passed to {.evaluate_batch}.
+  # @!attribute current [Integer] number of samples completed (1-indexed)
+  # @!attribute total [Integer] total samples in the batch
+  # @!attribute elapsed [Float] seconds since batch started
+  Progress = Struct.new(:current, :total, :elapsed, keyword_init: true) do
+    # @return [Float] completion percentage, 0.0–100.0
+    def percent
+      total.zero? ? 0.0 : (current.to_f / total * 100).round(2)
+    end
+  end
   class << self
     # @return [Configuration] the current configuration
     def configuration
@@ -101,16 +115,26 @@ module EvalRuby
     # Evaluates a batch of samples, optionally running them through a pipeline.
     #
+    # If a block is given, it is called after each sample with a {Progress}
+    # snapshot, useful for rendering progress bars or writing incremental logs.
+    #
     # @param dataset [Dataset, Array<Hash>] samples to evaluate
     # @param pipeline [#query, nil] optional RAG pipeline to run queries through
+    # @yieldparam progress [Progress] progress snapshot after each sample
     # @return [Report]
-    def evaluate_batch(dataset, pipeline: nil)
+    #
+    # @example With progress callback
+    #   EvalRuby.evaluate_batch(dataset) do |progress|
+    #     puts "#{progress.current}/#{progress.total} (#{progress.percent}%)"
+    #   end
+    def evaluate_batch(dataset, pipeline: nil, &progress_block)
       samples = dataset.is_a?(Dataset) ? dataset.samples : dataset
       evaluator = Evaluator.new
       start_time = Time.now
+      total = samples.size
-      results = samples.map do |sample|
-        if pipeline
+      results = samples.each_with_index.map do |sample, i|
+        result = if pipeline
           response = pipeline.query(sample[:question])
           evaluator.evaluate(
             question: sample[:question],
@@ -121,6 +145,14 @@ module EvalRuby
         else
           evaluator.evaluate(**sample.slice(:question, :answer, :context, :ground_truth))
         end
+        progress_block&.call(Progress.new(
+          current: i + 1,
+          total: total,
+          elapsed: Time.now - start_time
+        ))
+        result
       end
       Report.new(results: results, samples: samples, duration: Time.now - start_time)

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: eval-ruby
 version: !ruby/object:Gem::Version
-  version: 0.2.0
+  version: 0.3.0
 platform: ruby
 authors:
 - Johannes Dwi Cahyo
@@ -72,6 +72,7 @@ executables: []
 extensions: []
 extra_rdoc_files: []
 files:
+- CHANGELOG.md
 - Gemfile
 - Gemfile.lock
 - LICENSE
@@ -83,6 +84,8 @@ files:
 - lib/eval_ruby/comparison.rb
 - lib/eval_ruby/configuration.rb
 - lib/eval_ruby/dataset.rb
+- lib/eval_ruby/embedders/base.rb
+- lib/eval_ruby/embedders/openai.rb
 - lib/eval_ruby/evaluator.rb
 - lib/eval_ruby/judges/anthropic.rb
 - lib/eval_ruby/judges/base.rb
@@ -97,6 +100,7 @@ files:
 - lib/eval_ruby/metrics/precision_at_k.rb
 - lib/eval_ruby/metrics/recall_at_k.rb
 - lib/eval_ruby/metrics/relevance.rb
+- lib/eval_ruby/metrics/semantic_similarity.rb
 - lib/eval_ruby/minitest.rb
 - lib/eval_ruby/report.rb
 - lib/eval_ruby/result.rb