eval-ruby 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 69f4642cd2505ab6b0b54f36cd76fea2043f76306c425fa69e29027e4d0dc901
4
- data.tar.gz: 70125b286a374c01af966a0a044edb9b0cbca02fa9acd6b944f2b0825ad60981
3
+ metadata.gz: 6ba2db8c803feba5276939550bf2bd3c9dca548f2eca3223ce1f2df39e83bfa8
4
+ data.tar.gz: 9a3a0a5dbb7eda4d64669e11a5b76c20f43fd0cdb3946934f190475cc460ccd3
5
5
  SHA512:
6
- metadata.gz: 7ac13b5d60996d964a948b92bec466d3594c614e5a0bc3e138bfba31dda0dcca19876d01e2826ed1522962e7117b2eb69c7e3ea46567e4a09920f77f2031d09c
7
- data.tar.gz: ca13c6f768516a2f21188c59511f045bd12c9fd4fe51819af705016c1282256b6765805a007265dbb3e299430c19b8fe80e1a29c22038b930b5799ab7d3a8355
6
+ metadata.gz: a6b52e6caf04da090f8595808f76e8edaa901b7eec90178054de0f2c4d8d27c1e2b46f8c154961b91c5ef3a849e93cbbc411df77e4ec2919ea48dd7ead869a1a
7
+ data.tar.gz: 06eb01c2e177dc263dcdf69ea38debca5989fc6d55e3be31e50c6b7e5bc8539d06df90b815b35680f8bc292d8bc44cebf38496dde5a65cea9cd8f4b3cedee18f
data/CHANGELOG.md ADDED
@@ -0,0 +1,60 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented in this file.
4
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
5
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [0.3.0] - 2026-04-13
8
+
9
+ Reshape release — the original v0.3.0 scope (cost tracking, async/parallel batch, HTML reports) was redistributed to a new **v0.4.0 "Batch, Reporting & Cost"** milestone, and the previous v0.4.0 "Testing Framework Integration" work slid to **v0.5.0**. This release instead prioritizes the API surface needed by the omnibot Wicara eval integration, plus the two quickly actionable items from the original v0.3.0 plan.
10
+
11
+ ### Added
12
+ - `EvalRuby::Embedders::Base` — abstract base class for pluggable embedding backends, mirroring the existing `Judges::Base` pattern.
13
+ - `EvalRuby::Embedders::OpenAI` — OpenAI embeddings backend (`/v1/embeddings`) with retry/timeout handling and out-of-order response reassembly.
14
+ - `EvalRuby::Metrics::SemanticSimilarity` — Ragas-style answer-similarity metric. Computes cosine similarity between answer and ground-truth embeddings via an injected embedder. Judge-free, fast, deterministic; ideal for chatbot regression testing.
15
+ - `Configuration#embedder_llm`, `#embedder_model`, `#embedder_api_key` — new keys controlling the embedder. `embedder_api_key` falls back to `api_key` when unset, so most users only configure one OpenAI key.
16
+ - `EvalRuby.evaluate_batch(dataset) { |progress| ... }` — block form that yields an `EvalRuby::Progress` struct (`current`, `total`, `elapsed`, `percent`) after each sample. Backwards compatible — batch calls without a block behave exactly as before.
17
+
18
+ ### Changed
19
+ - `Dataset.generate` hardened:
20
+ - validates `questions_per_doc` is a positive integer
21
+ - validates document paths exist (raises a clear error instead of a `File.read` crash)
22
+ - expands directory paths via `Dir.glob(**/*)` to support "scan this folder" workflows
23
+ - accepts a single path string, not just an array
24
+ - tolerates malformed LLM responses (missing `pairs`, non-array `pairs`, non-hash entries, missing `question`/`answer`) — skips the bad pair rather than crashing the whole generation
25
+ - tolerates a judge raising mid-batch — logs the failure as a skip and continues with the remaining documents
26
+ - accepts an injected `judge:` parameter for testing (and for custom judge plumbing)
27
+
28
+ ### Notes
29
+ - `SemanticSimilarity` is **opt-in** — not part of the default `Evaluator` roster. Instantiate it directly when you want reference-based scoring without an LLM judge.
30
+ - Deferred to v0.4.0: cost tracking per evaluation (#10), async/parallel batch evaluation (#11), HTML report generation (#12).
31
+ - Deferred to v0.5.0: CI/test-framework integration (JUnit XML, regression detection, GitHub Actions workflow, expanded matchers/assertions — formerly v0.4.0).
32
+
33
+ ## [0.2.0] - 2026-03-17
34
+
35
+ ### Added
36
+ - Comprehensive test suite covering all metrics, judges, datasets, reports, and error paths.
37
+ - YARD documentation across all public APIs.
38
+ - RSpec matchers and Minitest assertions for integration in user test suites.
39
+ - A/B comparison reports with statistical significance testing.
40
+
41
+ ## [0.1.1] - 2026-03-10
42
+
43
+ ### Fixed
44
+ - Transient API failures now retry with exponential backoff (previously a single timeout raised immediately).
45
+ - `Judges::OpenAI#initialize` rejects missing/empty API keys up front with a clear error message.
46
+ - String-context metrics now handle strings passed in place of arrays without crashing.
47
+ - Standard-deviation computation in `Report#summary` no longer divides by zero for single-sample reports.
48
+
49
+ ### Added
50
+ - Error subclasses (`APIError`, `TimeoutError`, `InvalidResponseError`) so callers can rescue at the right granularity.
51
+
52
+ ## [0.1.0] - 2026-03-09
53
+
54
+ - Initial release.
55
+ - LLM-as-judge metrics: faithfulness, relevance, correctness, context precision, context recall.
56
+ - Retrieval metrics: precision@k, recall@k, MRR, NDCG, hit rate.
57
+ - OpenAI and Anthropic judge backends.
58
+ - `Dataset` with CSV/JSON import and export.
59
+ - `Report` with per-metric summary, worst-cases, failure filtering, and CSV export.
60
+ - `Configuration` DSL for judge model, API key, threshold, timeout, retries.
data/README.md CHANGED
@@ -50,6 +50,45 @@ result.overall # => 0.94
50
50
  - **NDCG** (Normalized Discounted Cumulative Gain)
51
51
  - **Hit Rate**
52
52
 
53
+ ### Embedding-Based
54
+
55
+ - **Semantic Similarity** — cosine similarity between answer and ground truth via a pluggable embedder. Judge-free, fast, deterministic; useful for chatbot regression testing where you want a reference-based score without an LLM call.
56
+
57
+ ## Semantic Similarity
58
+
59
+ `SemanticSimilarity` is opt-in (not part of the default `Evaluator` roster in v0.3.0). Instantiate it directly when you need reference-based scoring without an LLM judge — for example, scoring a chatbot's actual response against a fixed expected response.
60
+
61
+ ```ruby
62
+ EvalRuby.configure do |config|
63
+ config.api_key = ENV["OPENAI_API_KEY"] # shared with judge by default
64
+ config.embedder_model = "text-embedding-3-small" # default; also supports text-embedding-3-large
65
+ # config.embedder_api_key = ENV["OTHER_KEY"] # optional; falls back to api_key
66
+ end
67
+
68
+ embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
69
+ metric = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
70
+
71
+ result = metric.call(
72
+ answer: "Paris is the capital of France",
73
+ ground_truth: "The capital of France is Paris"
74
+ )
75
+
76
+ result[:score] # => 0.92
77
+ result[:details][:cosine] # => 0.92 (raw, pre-clamp)
78
+ result[:details][:model] # => "text-embedding-3-small"
79
+ ```
80
+
81
+ **When to use `SemanticSimilarity` vs `Correctness`:**
82
+
83
+ | | `Correctness` | `SemanticSimilarity` |
84
+ |---|---|---|
85
+ | Backend | LLM judge (GPT-4, Claude, …) | Embeddings + cosine |
86
+ | Cost per call | $$ (judge LLM tokens) | $ (embedding tokens) |
87
+ | Latency | High (LLM generation) | Low (embedding lookup) |
88
+ | Determinism | Low (model-dependent) | High |
89
+ | Reasoning | Natural-language rationale in details | Raw cosine value |
90
+ | Best for | Nuanced/subjective answers | Regression tests, bulk scoring |
91
+
53
92
  ## Retrieval Evaluation
54
93
 
55
94
  ```ruby
@@ -19,6 +19,15 @@ module EvalRuby
19
19
  # @return [String, nil] API key for the judge LLM provider
20
20
  attr_accessor :api_key
21
21
 
22
+ # @return [Symbol] embedding provider (:openai in v0.3.0)
23
+ attr_accessor :embedder_llm
24
+
25
+ # @return [String] model name for the embedder
26
+ attr_accessor :embedder_model
27
+
28
+ # @return [String, nil] API key for the embedder; falls back to {#api_key} when nil
29
+ attr_accessor :embedder_api_key
30
+
22
31
  # @return [Float] default threshold for pass/fail decisions
23
32
  attr_accessor :default_threshold
24
33
 
@@ -32,6 +41,9 @@ module EvalRuby
32
41
  @judge_llm = :openai
33
42
  @judge_model = "gpt-4o"
34
43
  @api_key = nil
44
+ @embedder_llm = :openai
45
+ @embedder_model = "text-embedding-3-small"
46
+ @embedder_api_key = nil
35
47
  @default_threshold = 0.7
36
48
  @timeout = 30
37
49
  @max_retries = 3
@@ -124,21 +124,30 @@ module EvalRuby
124
124
 
125
125
  # Generates a dataset from documents using an LLM.
126
126
  #
127
- # @param documents [Array<String>] file paths to source documents
128
- # @param questions_per_doc [Integer] number of QA pairs per document
127
+ # Each document is read, passed to the LLM with a prompt asking for
128
+ # +questions_per_doc+ QA pairs, and the resulting pairs are appended to
129
+ # the dataset. Directory paths are expanded via +Dir.glob+. Missing paths
130
+ # and malformed LLM responses raise or are skipped gracefully rather than
131
+ # crashing the whole generation.
132
+ #
133
+ # @param documents [String, Array<String>] file paths or directory paths
134
+ # @param questions_per_doc [Integer] number of QA pairs per document (must be > 0)
129
135
  # @param llm [Symbol] LLM provider (:openai or :anthropic)
136
+ # @param judge [Judges::Base, nil] inject a pre-built judge (mainly for testing)
130
137
  # @return [Dataset]
131
- def self.generate(documents:, questions_per_doc: 5, llm: :openai)
132
- config = EvalRuby.configuration.dup
133
- config.judge_llm = llm
134
- judge = case llm
135
- when :openai then Judges::OpenAI.new(config)
136
- when :anthropic then Judges::Anthropic.new(config)
137
- else raise Error, "Unknown LLM: #{llm}"
138
+ # @raise [EvalRuby::Error] on bad arguments or no documents found
139
+ def self.generate(documents:, questions_per_doc: 5, llm: :openai, judge: nil)
140
+ unless questions_per_doc.is_a?(Integer) && questions_per_doc.positive?
141
+ raise Error, "questions_per_doc must be a positive integer, got #{questions_per_doc.inspect}"
138
142
  end
139
143
 
144
+ document_paths = expand_document_paths(documents)
145
+ raise Error, "No documents found in the provided paths" if document_paths.empty?
146
+
147
+ judge ||= build_judge_for(llm)
148
+
140
149
  dataset = new("generated")
141
- documents.each do |doc_path|
150
+ document_paths.each do |doc_path|
142
151
  content = File.read(doc_path)
143
152
  prompt = <<~PROMPT
144
153
  Given the following document, generate #{questions_per_doc} question-answer pairs
@@ -150,14 +159,19 @@ module EvalRuby
150
159
  Respond in JSON: {"pairs": [{"question": "...", "answer": "...", "context": "relevant excerpt"}]}
151
160
  PROMPT
152
161
 
153
- result = judge.call(prompt)
154
- next unless result.is_a?(Hash) && result.key?("pairs")
162
+ begin
163
+ result = judge.call(prompt)
164
+ rescue StandardError
165
+ next # keep generating from remaining docs; individual failure should not abort the batch
166
+ end
167
+
168
+ extract_pairs(result).each do |pair|
169
+ next unless valid_pair?(pair)
155
170
 
156
- result["pairs"].each do |pair|
157
171
  dataset.add(
158
172
  question: pair["question"],
159
173
  answer: pair["answer"],
160
- context: [pair["context"] || content],
174
+ context: [pair["context"].is_a?(String) && !pair["context"].empty? ? pair["context"] : content],
161
175
  ground_truth: pair["answer"]
162
176
  )
163
177
  end
@@ -165,6 +179,51 @@ module EvalRuby
165
179
  dataset
166
180
  end
167
181
 
182
+ # Expands a list of file/directory paths into a flat list of file paths.
183
+ # Validates existence — missing paths raise an Error.
184
+ #
185
+ # @param paths [String, Array<String>]
186
+ # @return [Array<String>] absolute-or-relative file paths, each verified to exist
187
+ # @raise [EvalRuby::Error] if any path does not exist
188
+ def self.expand_document_paths(paths)
189
+ result = []
190
+ Array(paths).each do |path|
191
+ if File.directory?(path)
192
+ result.concat(Dir.glob(File.join(path, "**/*")).select { |p| File.file?(p) }.sort)
193
+ elsif File.file?(path)
194
+ result << path
195
+ else
196
+ raise Error, "Document path does not exist: #{path}"
197
+ end
198
+ end
199
+ result
200
+ end
201
+
202
+ def self.build_judge_for(llm)
203
+ config = EvalRuby.configuration.dup
204
+ config.judge_llm = llm
205
+ case llm
206
+ when :openai then Judges::OpenAI.new(config)
207
+ when :anthropic then Judges::Anthropic.new(config)
208
+ else raise Error, "Unknown LLM: #{llm}"
209
+ end
210
+ end
211
+ private_class_method :build_judge_for
212
+
213
+ def self.extract_pairs(result)
214
+ return [] unless result.is_a?(Hash)
215
+ pairs = result["pairs"]
216
+ pairs.is_a?(Array) ? pairs : []
217
+ end
218
+ private_class_method :extract_pairs
219
+
220
+ def self.valid_pair?(pair)
221
+ pair.is_a?(Hash) &&
222
+ pair["question"].is_a?(String) && !pair["question"].strip.empty? &&
223
+ pair["answer"].is_a?(String) && !pair["answer"].strip.empty?
224
+ end
225
+ private_class_method :valid_pair?
226
+
168
227
  private_class_method def self.parse_array_field(value)
169
228
  return [] if value.nil? || value.empty?
170
229
 
@@ -0,0 +1,29 @@
1
+ # frozen_string_literal: true
2
+
3
+ module EvalRuby
4
+ module Embedders
5
+ # Abstract base class for text embedders.
6
+ # Subclasses must implement {#call} to convert a batch of strings
7
+ # into a batch of float vectors, and {#model} to surface the model
8
+ # identifier used (shown in metric details).
9
+ class Base
10
+ # @param config [Configuration]
11
+ def initialize(config)
12
+ @config = config
13
+ end
14
+
15
+ # Embeds a batch of texts.
16
+ #
17
+ # @param texts [Array<String>] inputs to embed
18
+ # @return [Array<Array<Float>>] one vector per input, in the same order
19
+ def call(texts)
20
+ raise NotImplementedError, "#{self.class}#call must be implemented"
21
+ end
22
+
23
+ # @return [String] model identifier (e.g. "text-embedding-3-small")
24
+ def model
25
+ raise NotImplementedError, "#{self.class}#model must be implemented"
26
+ end
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,83 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "net/http"
4
+ require "json"
5
+ require "uri"
6
+
7
+ module EvalRuby
8
+ module Embedders
9
+ # OpenAI embeddings backend.
10
+ # Requires an API key via {Configuration#embedder_api_key} (or falls
11
+ # back to {Configuration#api_key}). The default model is
12
+ # +text-embedding-3-small+ (1536 dimensions).
13
+ class OpenAI < Base
14
+ API_URL = "https://api.openai.com/v1/embeddings"
15
+
16
+ # @param config [Configuration]
17
+ # @raise [EvalRuby::Error] if no API key is available
18
+ def initialize(config)
19
+ super
20
+ @api_key = @config.embedder_api_key || @config.api_key
21
+ if @api_key.nil? || @api_key.empty?
22
+ raise EvalRuby::Error, "API key is required for embedder. Set via EvalRuby.configure { |c| c.embedder_api_key = '...' } or c.api_key = '...'"
23
+ end
24
+ end
25
+
26
+ # @return [String] configured embedder model
27
+ def model
28
+ @config.embedder_model
29
+ end
30
+
31
+ # @param texts [Array<String>] inputs to embed
32
+ # @return [Array<Array<Float>>] vectors in input order
33
+ # @raise [EvalRuby::APIError] on non-success HTTP responses
34
+ # @raise [EvalRuby::TimeoutError] after max retries
35
+ def call(texts)
36
+ retries = 0
37
+ begin
38
+ uri = URI(API_URL)
39
+ request = Net::HTTP::Post.new(uri)
40
+ request["Authorization"] = "Bearer #{@api_key}"
41
+ request["Content-Type"] = "application/json"
42
+ request.body = JSON.generate({input: texts, model: model})
43
+
44
+ response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true,
45
+ read_timeout: @config.timeout) do |http|
46
+ http.request(request)
47
+ end
48
+
49
+ unless response.is_a?(Net::HTTPSuccess)
50
+ raise APIError, "OpenAI embeddings API error: #{response.code} - #{response.body}"
51
+ end
52
+
53
+ parse_vectors(response.body)
54
+ rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET => e
55
+ retries += 1
56
+ if retries <= @config.max_retries
57
+ sleep(2**(retries - 1))
58
+ retry
59
+ end
60
+ raise EvalRuby::TimeoutError, "Embedder API failed after #{@config.max_retries} retries: #{e.message}"
61
+ end
62
+ end
63
+
64
+ private
65
+
66
+ def parse_vectors(body)
67
+ parsed = JSON.parse(body)
68
+ data = parsed["data"]
69
+ raise InvalidResponseError, "Unexpected embeddings response shape: missing 'data'" unless data.is_a?(Array)
70
+
71
+ # OpenAI returns data entries tagged with 'index' to preserve input order;
72
+ # sort defensively in case the API ever reorders.
73
+ data.sort_by { |entry| entry["index"].to_i }.map do |entry|
74
+ vector = entry["embedding"]
75
+ raise InvalidResponseError, "Embedding entry missing 'embedding' array" unless vector.is_a?(Array)
76
+ vector
77
+ end
78
+ rescue JSON::ParserError => e
79
+ raise InvalidResponseError, "Failed to parse embeddings response: #{e.message}"
80
+ end
81
+ end
82
+ end
83
+ end
@@ -0,0 +1,72 @@
1
+ # frozen_string_literal: true
2
+
3
+ module EvalRuby
4
+ module Metrics
5
+ # Cosine similarity between an answer and its ground truth via an
6
+ # injected embedder. A judge-free alternative to {Correctness} when
7
+ # you want fast, deterministic, reference-based scoring — ideal for
8
+ # chatbot regression testing.
9
+ #
10
+ # @example
11
+ # embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
12
+ # metric = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
13
+ # metric.call(answer: "Paris is in France", ground_truth: "Paris, France")
14
+ # # => { score: 0.91, details: { cosine: 0.91, model: "text-embedding-3-small" } }
15
+ class SemanticSimilarity < Base
16
+ # @return [EvalRuby::Embedders::Base, nil] the embedder instance
17
+ attr_reader :embedder
18
+
19
+ # @param embedder [EvalRuby::Embedders::Base] required for this metric
20
+ # @param judge [EvalRuby::Judges::Base, nil] unused by this metric; accepted
21
+ # only for interface compatibility with {Metrics::Base}
22
+ def initialize(embedder: nil, judge: nil)
23
+ super(judge: judge)
24
+ @embedder = embedder
25
+ end
26
+
27
+ # @param answer [String] candidate text (typically the model's answer)
28
+ # @param ground_truth [String] reference text
29
+ # @return [Hash] +:score+ (Float 0.0–1.0) and +:details+ (Hash)
30
+ # @raise [EvalRuby::Error] if no embedder is configured
31
+ def call(answer:, ground_truth:, **_kwargs)
32
+ raise EvalRuby::Error, "SemanticSimilarity requires an embedder. Pass `embedder:` in the constructor." unless @embedder
33
+
34
+ if answer.to_s.strip.empty? || ground_truth.to_s.strip.empty?
35
+ return {score: 0.0, details: {reason: :empty_input}}
36
+ end
37
+
38
+ vectors = @embedder.call([answer.to_s, ground_truth.to_s])
39
+ unless vectors.is_a?(Array) && vectors.length == 2
40
+ raise EvalRuby::Error, "Embedder returned #{vectors.is_a?(Array) ? vectors.length : vectors.class} vectors; expected 2"
41
+ end
42
+
43
+ cosine = cosine_similarity(vectors[0], vectors[1])
44
+
45
+ {
46
+ score: cosine.clamp(0.0, 1.0),
47
+ details: {cosine: cosine, model: @embedder.model}
48
+ }
49
+ end
50
+
51
+ private
52
+
53
+ def cosine_similarity(a, b)
54
+ raise EvalRuby::Error, "Embedding vector dimension mismatch: #{a.length} vs #{b.length}" unless a.length == b.length
55
+
56
+ dot = 0.0
57
+ norm_a = 0.0
58
+ norm_b = 0.0
59
+ a.each_with_index do |x, i|
60
+ y = b[i]
61
+ dot += x * y
62
+ norm_a += x * x
63
+ norm_b += y * y
64
+ end
65
+
66
+ return 0.0 if norm_a.zero? || norm_b.zero?
67
+
68
+ dot / (Math.sqrt(norm_a) * Math.sqrt(norm_b))
69
+ end
70
+ end
71
+ end
72
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module EvalRuby
4
- VERSION = "0.2.0"
4
+ VERSION = "0.3.0"
5
5
  end
data/lib/eval_ruby.rb CHANGED
@@ -5,12 +5,15 @@ require_relative "eval_ruby/configuration"
5
5
  require_relative "eval_ruby/judges/base"
6
6
  require_relative "eval_ruby/judges/openai"
7
7
  require_relative "eval_ruby/judges/anthropic"
8
+ require_relative "eval_ruby/embedders/base"
9
+ require_relative "eval_ruby/embedders/openai"
8
10
  require_relative "eval_ruby/metrics/base"
9
11
  require_relative "eval_ruby/metrics/faithfulness"
10
12
  require_relative "eval_ruby/metrics/relevance"
11
13
  require_relative "eval_ruby/metrics/correctness"
12
14
  require_relative "eval_ruby/metrics/context_precision"
13
15
  require_relative "eval_ruby/metrics/context_recall"
16
+ require_relative "eval_ruby/metrics/semantic_similarity"
14
17
  require_relative "eval_ruby/metrics/precision_at_k"
15
18
  require_relative "eval_ruby/metrics/recall_at_k"
16
19
  require_relative "eval_ruby/metrics/mrr"
@@ -48,6 +51,17 @@ module EvalRuby
48
51
  class TimeoutError < Error; end
49
52
  class InvalidResponseError < Error; end
50
53
 
54
+ # Progress snapshot yielded to the block passed to {.evaluate_batch}.
55
+ # @!attribute current [Integer] number of samples completed (1-indexed)
56
+ # @!attribute total [Integer] total samples in the batch
57
+ # @!attribute elapsed [Float] seconds since batch started
58
+ Progress = Struct.new(:current, :total, :elapsed, keyword_init: true) do
59
+ # @return [Float] completion percentage, 0.0–100.0
60
+ def percent
61
+ total.zero? ? 0.0 : (current.to_f / total * 100).round(2)
62
+ end
63
+ end
64
+
51
65
  class << self
52
66
  # @return [Configuration] the current configuration
53
67
  def configuration
@@ -101,16 +115,26 @@ module EvalRuby
101
115
 
102
116
  # Evaluates a batch of samples, optionally running them through a pipeline.
103
117
  #
118
+ # If a block is given, it is called after each sample with a {Progress}
119
+ # snapshot, useful for rendering progress bars or writing incremental logs.
120
+ #
104
121
  # @param dataset [Dataset, Array<Hash>] samples to evaluate
105
122
  # @param pipeline [#query, nil] optional RAG pipeline to run queries through
123
+ # @yieldparam progress [Progress] progress snapshot after each sample
106
124
  # @return [Report]
107
- def evaluate_batch(dataset, pipeline: nil)
125
+ #
126
+ # @example With progress callback
127
+ # EvalRuby.evaluate_batch(dataset) do |progress|
128
+ # puts "#{progress.current}/#{progress.total} (#{progress.percent}%)"
129
+ # end
130
+ def evaluate_batch(dataset, pipeline: nil, &progress_block)
108
131
  samples = dataset.is_a?(Dataset) ? dataset.samples : dataset
109
132
  evaluator = Evaluator.new
110
133
  start_time = Time.now
134
+ total = samples.size
111
135
 
112
- results = samples.map do |sample|
113
- if pipeline
136
+ results = samples.each_with_index.map do |sample, i|
137
+ result = if pipeline
114
138
  response = pipeline.query(sample[:question])
115
139
  evaluator.evaluate(
116
140
  question: sample[:question],
@@ -121,6 +145,14 @@ module EvalRuby
121
145
  else
122
146
  evaluator.evaluate(**sample.slice(:question, :answer, :context, :ground_truth))
123
147
  end
148
+
149
+ progress_block&.call(Progress.new(
150
+ current: i + 1,
151
+ total: total,
152
+ elapsed: Time.now - start_time
153
+ ))
154
+
155
+ result
124
156
  end
125
157
 
126
158
  Report.new(results: results, samples: samples, duration: Time.now - start_time)
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: eval-ruby
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Johannes Dwi Cahyo
@@ -72,6 +72,7 @@ executables: []
72
72
  extensions: []
73
73
  extra_rdoc_files: []
74
74
  files:
75
+ - CHANGELOG.md
75
76
  - Gemfile
76
77
  - Gemfile.lock
77
78
  - LICENSE
@@ -83,6 +84,8 @@ files:
83
84
  - lib/eval_ruby/comparison.rb
84
85
  - lib/eval_ruby/configuration.rb
85
86
  - lib/eval_ruby/dataset.rb
87
+ - lib/eval_ruby/embedders/base.rb
88
+ - lib/eval_ruby/embedders/openai.rb
86
89
  - lib/eval_ruby/evaluator.rb
87
90
  - lib/eval_ruby/judges/anthropic.rb
88
91
  - lib/eval_ruby/judges/base.rb
@@ -97,6 +100,7 @@ files:
97
100
  - lib/eval_ruby/metrics/precision_at_k.rb
98
101
  - lib/eval_ruby/metrics/recall_at_k.rb
99
102
  - lib/eval_ruby/metrics/relevance.rb
103
+ - lib/eval_ruby/metrics/semantic_similarity.rb
100
104
  - lib/eval_ruby/minitest.rb
101
105
  - lib/eval_ruby/report.rb
102
106
  - lib/eval_ruby/result.rb