eval-ruby 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 59b7bd64cf696d82a27cb6330ab948306904f444a8a64476f702ee1937bbccab
4
- data.tar.gz: f9c0c234f0712d37d309d460d1c3204d3e6f70bfbd8fe4dcff95ac508bcb7f34
3
+ metadata.gz: 6ba2db8c803feba5276939550bf2bd3c9dca548f2eca3223ce1f2df39e83bfa8
4
+ data.tar.gz: 9a3a0a5dbb7eda4d64669e11a5b76c20f43fd0cdb3946934f190475cc460ccd3
5
5
  SHA512:
6
- metadata.gz: 6a9f0c12a790b0098ba639bc236c6080c64c7fbb4ad9892c36810e93b88486bebaa42dca9245beaf828beca29e31793030d62ab232cad50107bc99905635a069
7
- data.tar.gz: 7c922b6fd8743d5241a254baf301aae2af6aba936fdeedbcb467f9549a8f16493c7571ec9724a5460a75228e415cbcaa852b915ef78f23aa8f1fcbddb85d9450
6
+ metadata.gz: a6b52e6caf04da090f8595808f76e8edaa901b7eec90178054de0f2c4d8d27c1e2b46f8c154961b91c5ef3a849e93cbbc411df77e4ec2919ea48dd7ead869a1a
7
+ data.tar.gz: 06eb01c2e177dc263dcdf69ea38debca5989fc6d55e3be31e50c6b7e5bc8539d06df90b815b35680f8bc292d8bc44cebf38496dde5a65cea9cd8f4b3cedee18f
data/CHANGELOG.md ADDED
@@ -0,0 +1,60 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project are documented in this file.
4
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
5
+ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [0.3.0] - 2026-04-13
8
+
9
+ Reshape release — the original v0.3.0 scope (cost tracking, async/parallel batch, HTML reports) was redistributed to a new **v0.4.0 "Batch, Reporting & Cost"** milestone, and the previous v0.4.0 "Testing Framework Integration" work slid to **v0.5.0**. This release instead prioritizes the API surface needed by the omnibot Wicara eval integration, plus the two quickly actionable items from the original v0.3.0 plan.
10
+
11
+ ### Added
12
+ - `EvalRuby::Embedders::Base` — abstract base class for pluggable embedding backends, mirroring the existing `Judges::Base` pattern.
13
+ - `EvalRuby::Embedders::OpenAI` — OpenAI embeddings backend (`/v1/embeddings`) with retry/timeout handling and out-of-order response reassembly.
14
+ - `EvalRuby::Metrics::SemanticSimilarity` — Ragas-style answer-similarity metric. Computes cosine similarity between answer and ground-truth embeddings via an injected embedder. Judge-free, fast, deterministic; ideal for chatbot regression testing.
15
+ - `Configuration#embedder_llm`, `#embedder_model`, `#embedder_api_key` — new keys controlling the embedder. `embedder_api_key` falls back to `api_key` when unset, so most users only configure one OpenAI key.
16
+ - `EvalRuby.evaluate_batch(dataset) { |progress| ... }` — block form that yields an `EvalRuby::Progress` struct (`current`, `total`, `elapsed`, `percent`) after each sample. Backwards compatible — batch calls without a block behave exactly as before.
17
+
18
+ ### Changed
19
+ - `Dataset.generate` hardened:
20
+ - validates `questions_per_doc` is a positive integer
21
+ - validates document paths exist (raises a clear error instead of a `File.read` crash)
22
+ - expands directory paths via `Dir.glob(**/*)` to support "scan this folder" workflows
23
+ - accepts a single path string, not just an array
24
+ - tolerates malformed LLM responses (missing `pairs`, non-array `pairs`, non-hash entries, missing `question`/`answer`) — skips the bad pair rather than crashing the whole generation
25
+ - tolerates a judge raising mid-batch — logs the failure as a skip and continues with the remaining documents
26
+ - accepts an injected `judge:` parameter for testing (and for custom judge plumbing)
27
+
28
+ ### Notes
29
+ - `SemanticSimilarity` is **opt-in** — not part of the default `Evaluator` roster. Instantiate it directly when you want reference-based scoring without an LLM judge.
30
+ - Deferred to v0.4.0: cost tracking per evaluation (#10), async/parallel batch evaluation (#11), HTML report generation (#12).
31
+ - Deferred to v0.5.0: CI/test-framework integration (JUnit XML, regression detection, GitHub Actions workflow, expanded matchers/assertions — formerly v0.4.0).
32
+
33
+ ## [0.2.0] - 2026-03-17
34
+
35
+ ### Added
36
+ - Comprehensive test suite covering all metrics, judges, datasets, reports, and error paths.
37
+ - YARD documentation across all public APIs.
38
+ - RSpec matchers and Minitest assertions for integration in user test suites.
39
+ - A/B comparison reports with statistical significance testing.
40
+
41
+ ## [0.1.1] - 2026-03-10
42
+
43
+ ### Fixed
44
+ - Transient API failures now retry with exponential backoff (previously a single timeout raised immediately).
45
+ - `Judges::OpenAI#initialize` rejects missing/empty API keys up front with a clear error message.
46
+ - String-context metrics now handle strings passed in place of arrays without crashing.
47
+ - Standard-deviation computation in `Report#summary` no longer divides by zero for single-sample reports.
48
+
49
+ ### Added
50
+ - Error subclasses (`APIError`, `TimeoutError`, `InvalidResponseError`) so callers can rescue at the right granularity.
51
+
52
+ ## [0.1.0] - 2026-03-09
53
+
54
+ - Initial release.
55
+ - LLM-as-judge metrics: faithfulness, relevance, correctness, context precision, context recall.
56
+ - Retrieval metrics: precision@k, recall@k, MRR, NDCG, hit rate.
57
+ - OpenAI and Anthropic judge backends.
58
+ - `Dataset` with CSV/JSON import and export.
59
+ - `Report` with per-metric summary, worst-cases, failure filtering, and CSV export.
60
+ - `Configuration` DSL for judge model, API key, threshold, timeout, retries.
data/Gemfile.lock CHANGED
@@ -1,7 +1,7 @@
1
1
  PATH
2
2
  remote: .
3
3
  specs:
4
- eval-ruby (0.1.1)
4
+ eval-ruby (0.2.0)
5
5
  csv
6
6
 
7
7
  GEM
@@ -39,7 +39,7 @@ CHECKSUMS
39
39
  bigdecimal (4.0.1) sha256=8b07d3d065a9f921c80ceaea7c9d4ae596697295b584c296fe599dd0ad01c4a7
40
40
  crack (1.0.1) sha256=ff4a10390cd31d66440b7524eb1841874db86201d5b70032028553130b6d4c7e
41
41
  csv (3.3.5) sha256=6e5134ac3383ef728b7f02725d9872934f523cb40b961479f69cf3afa6c8e73f
42
- eval-ruby (0.1.1)
42
+ eval-ruby (0.2.0)
43
43
  hashdiff (1.2.1) sha256=9c079dbc513dfc8833ab59c0c2d8f230fa28499cc5efb4b8dd276cf931457cd1
44
44
  minitest (5.27.0) sha256=2d3b17f8a36fe7801c1adcffdbc38233b938eb0b4966e97a6739055a45fa77d5
45
45
  public_suffix (7.0.5) sha256=1a8bb08f1bbea19228d3bed6e5ed908d1cb4f7c2726d18bd9cadf60bc676f623
data/MILESTONES.md ADDED
@@ -0,0 +1,13 @@
1
+ # Milestones
2
+
3
+ ## v0.1.1 (2026-03-10)
4
+
5
+ ### Changes
6
+ - Retry logic
7
+ - API key validation
8
+ - string context fix
9
+ - std dev fix
10
+ - error subclasses
11
+
12
+ ## v0.1.0 (Initial release)
13
+ - Initial release
data/README.md CHANGED
@@ -50,6 +50,45 @@ result.overall # => 0.94
50
50
  - **NDCG** (Normalized Discounted Cumulative Gain)
51
51
  - **Hit Rate**
52
52
 
53
+ ### Embedding-Based
54
+
55
+ - **Semantic Similarity** — cosine similarity between answer and ground truth via a pluggable embedder. Judge-free, fast, deterministic; useful for chatbot regression testing where you want a reference-based score without an LLM call.
56
+
57
+ ## Semantic Similarity
58
+
59
+ `SemanticSimilarity` is opt-in (not part of the default `Evaluator` roster in v0.3.0). Instantiate it directly when you need reference-based scoring without an LLM judge — for example, scoring a chatbot's actual response against a fixed expected response.
60
+
61
+ ```ruby
62
+ EvalRuby.configure do |config|
63
+ config.api_key = ENV["OPENAI_API_KEY"] # shared with judge by default
64
+ config.embedder_model = "text-embedding-3-small" # default; also supports text-embedding-3-large
65
+ # config.embedder_api_key = ENV["OTHER_KEY"] # optional; falls back to api_key
66
+ end
67
+
68
+ embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
69
+ metric = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
70
+
71
+ result = metric.call(
72
+ answer: "Paris is the capital of France",
73
+ ground_truth: "The capital of France is Paris"
74
+ )
75
+
76
+ result[:score] # => 0.92
77
+ result[:details][:cosine] # => 0.92 (raw, pre-clamp)
78
+ result[:details][:model] # => "text-embedding-3-small"
79
+ ```
80
+
81
+ **When to use `SemanticSimilarity` vs `Correctness`:**
82
+
83
+ | | `Correctness` | `SemanticSimilarity` |
84
+ |---|---|---|
85
+ | Backend | LLM judge (GPT-4, Claude, …) | Embeddings + cosine |
86
+ | Cost per call | $$ (judge LLM tokens) | $ (embedding tokens) |
87
+ | Latency | High (LLM generation) | Low (embedding lookup) |
88
+ | Determinism | Low (model-dependent) | High |
89
+ | Reasoning | Natural-language rationale in details | Raw cosine value |
90
+ | Best for | Nuanced/subjective answers | Regression tests, bulk scoring |
91
+
53
92
  ## Retrieval Evaluation
54
93
 
55
94
  ```ruby
@@ -1,14 +1,27 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module EvalRuby
4
+ # Statistical comparison of two evaluation reports using paired t-tests.
5
+ #
6
+ # @example
7
+ # comparison = EvalRuby.compare(report_a, report_b)
8
+ # puts comparison.summary
9
+ # comparison.significant_improvements # => [:faithfulness]
4
10
  class Comparison
5
- attr_reader :report_a, :report_b
11
+ # @return [Report] baseline report
12
+ attr_reader :report_a
6
13
 
14
+ # @return [Report] comparison report
15
+ attr_reader :report_b
16
+
17
+ # @param report_a [Report] baseline
18
+ # @param report_b [Report] comparison
7
19
  def initialize(report_a, report_b)
8
20
  @report_a = report_a
9
21
  @report_b = report_b
10
22
  end
11
23
 
24
+ # @return [String] formatted comparison table with deltas and p-values
12
25
  def summary
13
26
  lines = [
14
27
  format("%-20s | %-10s | %-10s | %-8s | %s", "Metric", "A", "B", "Delta", "p-value"),
@@ -35,6 +48,10 @@ module EvalRuby
35
48
  lines.join("\n")
36
49
  end
37
50
 
51
+ # Returns metrics where report_b is significantly better than report_a.
52
+ #
53
+ # @param alpha [Float] significance level (default 0.05)
54
+ # @return [Array<Symbol>] metric names with significant improvements
38
55
  def significant_improvements(alpha: 0.05)
39
56
  all_metrics.select do |metric|
40
57
  scores_a = @report_a.results.filter_map { |r| r.scores[metric] }
@@ -1,14 +1,49 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module EvalRuby
4
+ # Global configuration for EvalRuby.
5
+ #
6
+ # @example
7
+ # EvalRuby.configure do |config|
8
+ # config.judge_llm = :openai
9
+ # config.api_key = ENV["OPENAI_API_KEY"]
10
+ # config.judge_model = "gpt-4o"
11
+ # end
4
12
  class Configuration
5
- attr_accessor :judge_llm, :judge_model, :api_key, :default_threshold,
6
- :timeout, :max_retries
13
+ # @return [Symbol] LLM provider for judge (:openai or :anthropic)
14
+ attr_accessor :judge_llm
15
+
16
+ # @return [String] model name for the judge LLM
17
+ attr_accessor :judge_model
18
+
19
+ # @return [String, nil] API key for the judge LLM provider
20
+ attr_accessor :api_key
21
+
22
+ # @return [Symbol] embedding provider (:openai in v0.3.0)
23
+ attr_accessor :embedder_llm
24
+
25
+ # @return [String] model name for the embedder
26
+ attr_accessor :embedder_model
27
+
28
+ # @return [String, nil] API key for the embedder; falls back to {#api_key} when nil
29
+ attr_accessor :embedder_api_key
30
+
31
+ # @return [Float] default threshold for pass/fail decisions
32
+ attr_accessor :default_threshold
33
+
34
+ # @return [Integer] HTTP request timeout in seconds
35
+ attr_accessor :timeout
36
+
37
+ # @return [Integer] maximum number of retries on transient failures
38
+ attr_accessor :max_retries
7
39
 
8
40
  def initialize
9
41
  @judge_llm = :openai
10
42
  @judge_model = "gpt-4o"
11
43
  @api_key = nil
44
+ @embedder_llm = :openai
45
+ @embedder_model = "text-embedding-3-small"
46
+ @embedder_api_key = nil
12
47
  @default_threshold = 0.7
13
48
  @timeout = 30
14
49
  @max_retries = 3
@@ -4,16 +4,36 @@ require "csv"
4
4
  require "json"
5
5
 
6
6
  module EvalRuby
7
+ # Collection of evaluation samples with import/export support.
8
+ # Supports CSV, JSON, and programmatic construction.
9
+ #
10
+ # @example
11
+ # dataset = EvalRuby::Dataset.new("my_test_set")
12
+ # dataset.add(question: "What is Ruby?", answer: "A language", ground_truth: "A language")
13
+ # report = EvalRuby.evaluate_batch(dataset)
7
14
  class Dataset
8
15
  include Enumerable
9
16
 
10
- attr_reader :name, :samples
17
+ # @return [String] dataset name
18
+ attr_reader :name
11
19
 
20
+ # @return [Array<Hash>] sample entries
21
+ attr_reader :samples
22
+
23
+ # @param name [String] dataset name
12
24
  def initialize(name = "default")
13
25
  @name = name
14
26
  @samples = []
15
27
  end
16
28
 
29
+ # Adds a sample to the dataset.
30
+ #
31
+ # @param question [String]
32
+ # @param ground_truth [String, nil]
33
+ # @param relevant_contexts [Array<String>] alias for context
34
+ # @param answer [String, nil]
35
+ # @param context [Array<String>]
36
+ # @return [self]
17
37
  def add(question:, ground_truth: nil, relevant_contexts: [], answer: nil, context: [])
18
38
  @samples << {
19
39
  question: question,
@@ -24,18 +44,26 @@ module EvalRuby
24
44
  self
25
45
  end
26
46
 
47
+ # @yield [Hash] each sample
27
48
  def each(&block)
28
49
  @samples.each(&block)
29
50
  end
30
51
 
52
+ # @return [Integer] number of samples
31
53
  def size
32
54
  @samples.size
33
55
  end
34
56
 
57
+ # @param index [Integer]
58
+ # @return [Hash] sample at index
35
59
  def [](index)
36
60
  @samples[index]
37
61
  end
38
62
 
63
+ # Loads a dataset from a CSV file.
64
+ #
65
+ # @param path [String] path to CSV file
66
+ # @return [Dataset]
39
67
  def self.from_csv(path)
40
68
  dataset = new(File.basename(path, ".*"))
41
69
  CSV.foreach(path, headers: true) do |row|
@@ -49,6 +77,10 @@ module EvalRuby
49
77
  dataset
50
78
  end
51
79
 
80
+ # Loads a dataset from a JSON file.
81
+ #
82
+ # @param path [String] path to JSON file
83
+ # @return [Dataset]
52
84
  def self.from_json(path)
53
85
  dataset = new(File.basename(path, ".*"))
54
86
  data = JSON.parse(File.read(path))
@@ -64,6 +96,10 @@ module EvalRuby
64
96
  dataset
65
97
  end
66
98
 
99
+ # Exports dataset to CSV.
100
+ #
101
+ # @param path [String] output file path
102
+ # @return [void]
67
103
  def to_csv(path)
68
104
  CSV.open(path, "w") do |csv|
69
105
  csv << %w[question answer context ground_truth]
@@ -78,21 +114,40 @@ module EvalRuby
78
114
  end
79
115
  end
80
116
 
117
+ # Exports dataset to JSON.
118
+ #
119
+ # @param path [String] output file path
120
+ # @return [void]
81
121
  def to_json(path)
82
122
  File.write(path, JSON.pretty_generate({name: @name, samples: @samples}))
83
123
  end
84
124
 
85
- def self.generate(documents:, questions_per_doc: 5, llm: :openai)
86
- config = EvalRuby.configuration.dup
87
- config.judge_llm = llm
88
- judge = case llm
89
- when :openai then Judges::OpenAI.new(config)
90
- when :anthropic then Judges::Anthropic.new(config)
91
- else raise Error, "Unknown LLM: #{llm}"
125
+ # Generates a dataset from documents using an LLM.
126
+ #
127
+ # Each document is read, passed to the LLM with a prompt asking for
128
+ # +questions_per_doc+ QA pairs, and the resulting pairs are appended to
129
+ # the dataset. Directory paths are expanded via +Dir.glob+. Missing paths
130
+ # and malformed LLM responses raise or are skipped gracefully rather than
131
+ # crashing the whole generation.
132
+ #
133
+ # @param documents [String, Array<String>] file paths or directory paths
134
+ # @param questions_per_doc [Integer] number of QA pairs per document (must be > 0)
135
+ # @param llm [Symbol] LLM provider (:openai or :anthropic)
136
+ # @param judge [Judges::Base, nil] inject a pre-built judge (mainly for testing)
137
+ # @return [Dataset]
138
+ # @raise [EvalRuby::Error] on bad arguments or no documents found
139
+ def self.generate(documents:, questions_per_doc: 5, llm: :openai, judge: nil)
140
+ unless questions_per_doc.is_a?(Integer) && questions_per_doc.positive?
141
+ raise Error, "questions_per_doc must be a positive integer, got #{questions_per_doc.inspect}"
92
142
  end
93
143
 
144
+ document_paths = expand_document_paths(documents)
145
+ raise Error, "No documents found in the provided paths" if document_paths.empty?
146
+
147
+ judge ||= build_judge_for(llm)
148
+
94
149
  dataset = new("generated")
95
- documents.each do |doc_path|
150
+ document_paths.each do |doc_path|
96
151
  content = File.read(doc_path)
97
152
  prompt = <<~PROMPT
98
153
  Given the following document, generate #{questions_per_doc} question-answer pairs
@@ -104,14 +159,19 @@ module EvalRuby
104
159
  Respond in JSON: {"pairs": [{"question": "...", "answer": "...", "context": "relevant excerpt"}]}
105
160
  PROMPT
106
161
 
107
- result = judge.call(prompt)
108
- next unless result.is_a?(Hash) && result.key?("pairs")
162
+ begin
163
+ result = judge.call(prompt)
164
+ rescue StandardError
165
+ next # keep generating from remaining docs; individual failure should not abort the batch
166
+ end
167
+
168
+ extract_pairs(result).each do |pair|
169
+ next unless valid_pair?(pair)
109
170
 
110
- result["pairs"].each do |pair|
111
171
  dataset.add(
112
172
  question: pair["question"],
113
173
  answer: pair["answer"],
114
- context: [pair["context"] || content],
174
+ context: [pair["context"].is_a?(String) && !pair["context"].empty? ? pair["context"] : content],
115
175
  ground_truth: pair["answer"]
116
176
  )
117
177
  end
@@ -119,6 +179,51 @@ module EvalRuby
119
179
  dataset
120
180
  end
121
181
 
182
+ # Expands a list of file/directory paths into a flat list of file paths.
183
+ # Validates existence — missing paths raise an Error.
184
+ #
185
+ # @param paths [String, Array<String>]
186
+ # @return [Array<String>] absolute-or-relative file paths, each verified to exist
187
+ # @raise [EvalRuby::Error] if any path does not exist
188
+ def self.expand_document_paths(paths)
189
+ result = []
190
+ Array(paths).each do |path|
191
+ if File.directory?(path)
192
+ result.concat(Dir.glob(File.join(path, "**/*")).select { |p| File.file?(p) }.sort)
193
+ elsif File.file?(path)
194
+ result << path
195
+ else
196
+ raise Error, "Document path does not exist: #{path}"
197
+ end
198
+ end
199
+ result
200
+ end
201
+
202
+ def self.build_judge_for(llm)
203
+ config = EvalRuby.configuration.dup
204
+ config.judge_llm = llm
205
+ case llm
206
+ when :openai then Judges::OpenAI.new(config)
207
+ when :anthropic then Judges::Anthropic.new(config)
208
+ else raise Error, "Unknown LLM: #{llm}"
209
+ end
210
+ end
211
+ private_class_method :build_judge_for
212
+
213
+ def self.extract_pairs(result)
214
+ return [] unless result.is_a?(Hash)
215
+ pairs = result["pairs"]
216
+ pairs.is_a?(Array) ? pairs : []
217
+ end
218
+ private_class_method :extract_pairs
219
+
220
+ def self.valid_pair?(pair)
221
+ pair.is_a?(Hash) &&
222
+ pair["question"].is_a?(String) && !pair["question"].strip.empty? &&
223
+ pair["answer"].is_a?(String) && !pair["answer"].strip.empty?
224
+ end
225
+ private_class_method :valid_pair?
226
+
122
227
  private_class_method def self.parse_array_field(value)
123
228
  return [] if value.nil? || value.empty?
124
229
 
@@ -0,0 +1,29 @@
1
+ # frozen_string_literal: true
2
+
3
+ module EvalRuby
4
+ module Embedders
5
+ # Abstract base class for text embedders.
6
+ # Subclasses must implement {#call} to convert a batch of strings
7
+ # into a batch of float vectors, and {#model} to surface the model
8
+ # identifier used (shown in metric details).
9
+ class Base
10
+ # @param config [Configuration]
11
+ def initialize(config)
12
+ @config = config
13
+ end
14
+
15
+ # Embeds a batch of texts.
16
+ #
17
+ # @param texts [Array<String>] inputs to embed
18
+ # @return [Array<Array<Float>>] one vector per input, in the same order
19
+ def call(texts)
20
+ raise NotImplementedError, "#{self.class}#call must be implemented"
21
+ end
22
+
23
+ # @return [String] model identifier (e.g. "text-embedding-3-small")
24
+ def model
25
+ raise NotImplementedError, "#{self.class}#model must be implemented"
26
+ end
27
+ end
28
+ end
29
+ end
@@ -0,0 +1,83 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "net/http"
4
+ require "json"
5
+ require "uri"
6
+
7
+ module EvalRuby
8
+ module Embedders
9
+ # OpenAI embeddings backend.
10
+ # Requires an API key via {Configuration#embedder_api_key} (or falls
11
+ # back to {Configuration#api_key}). The default model is
12
+ # +text-embedding-3-small+ (1536 dimensions).
13
+ class OpenAI < Base
14
+ API_URL = "https://api.openai.com/v1/embeddings"
15
+
16
+ # @param config [Configuration]
17
+ # @raise [EvalRuby::Error] if no API key is available
18
+ def initialize(config)
19
+ super
20
+ @api_key = @config.embedder_api_key || @config.api_key
21
+ if @api_key.nil? || @api_key.empty?
22
+ raise EvalRuby::Error, "API key is required for embedder. Set via EvalRuby.configure { |c| c.embedder_api_key = '...' } or c.api_key = '...'"
23
+ end
24
+ end
25
+
26
+ # @return [String] configured embedder model
27
+ def model
28
+ @config.embedder_model
29
+ end
30
+
31
+ # @param texts [Array<String>] inputs to embed
32
+ # @return [Array<Array<Float>>] vectors in input order
33
+ # @raise [EvalRuby::APIError] on non-success HTTP responses
34
+ # @raise [EvalRuby::TimeoutError] after max retries
35
+ def call(texts)
36
+ retries = 0
37
+ begin
38
+ uri = URI(API_URL)
39
+ request = Net::HTTP::Post.new(uri)
40
+ request["Authorization"] = "Bearer #{@api_key}"
41
+ request["Content-Type"] = "application/json"
42
+ request.body = JSON.generate({input: texts, model: model})
43
+
44
+ response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true,
45
+ read_timeout: @config.timeout) do |http|
46
+ http.request(request)
47
+ end
48
+
49
+ unless response.is_a?(Net::HTTPSuccess)
50
+ raise APIError, "OpenAI embeddings API error: #{response.code} - #{response.body}"
51
+ end
52
+
53
+ parse_vectors(response.body)
54
+ rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET => e
55
+ retries += 1
56
+ if retries <= @config.max_retries
57
+ sleep(2**(retries - 1))
58
+ retry
59
+ end
60
+ raise EvalRuby::TimeoutError, "Embedder API failed after #{@config.max_retries} retries: #{e.message}"
61
+ end
62
+ end
63
+
64
+ private
65
+
66
+ def parse_vectors(body)
67
+ parsed = JSON.parse(body)
68
+ data = parsed["data"]
69
+ raise InvalidResponseError, "Unexpected embeddings response shape: missing 'data'" unless data.is_a?(Array)
70
+
71
+ # OpenAI returns data entries tagged with 'index' to preserve input order;
72
+ # sort defensively in case the API ever reorders.
73
+ data.sort_by { |entry| entry["index"].to_i }.map do |entry|
74
+ vector = entry["embedding"]
75
+ raise InvalidResponseError, "Embedding entry missing 'embedding' array" unless vector.is_a?(Array)
76
+ vector
77
+ end
78
+ rescue JSON::ParserError => e
79
+ raise InvalidResponseError, "Failed to parse embeddings response: #{e.message}"
80
+ end
81
+ end
82
+ end
83
+ end
@@ -1,12 +1,25 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module EvalRuby
4
+ # Runs all configured metrics on a given question/answer/context tuple.
5
+ #
6
+ # @example
7
+ # evaluator = EvalRuby::Evaluator.new
8
+ # result = evaluator.evaluate(question: "...", answer: "...", context: [...])
4
9
  class Evaluator
10
+ # @param config [Configuration] configuration to use
5
11
  def initialize(config = EvalRuby.configuration)
6
12
  @config = config
7
13
  @judge = build_judge(config)
8
14
  end
9
15
 
16
+ # Evaluates an LLM response across quality metrics.
17
+ #
18
+ # @param question [String] the input question
19
+ # @param answer [String] the LLM-generated answer
20
+ # @param context [Array<String>] retrieved context chunks
21
+ # @param ground_truth [String, nil] expected correct answer
22
+ # @return [Result]
10
23
  def evaluate(question:, answer:, context: [], ground_truth: nil)
11
24
  scores = {}
12
25
  details = {}
@@ -37,6 +50,12 @@ module EvalRuby
37
50
  Result.new(scores: scores, details: details)
38
51
  end
39
52
 
53
+ # Evaluates retrieval quality using IR metrics.
54
+ #
55
+ # @param question [String] the input question
56
+ # @param retrieved [Array<String>] retrieved document IDs
57
+ # @param relevant [Array<String>] ground-truth relevant document IDs
58
+ # @return [RetrievalResult]
40
59
  def evaluate_retrieval(question:, retrieved:, relevant:)
41
60
  RetrievalResult.new(retrieved: retrieved, relevant: relevant)
42
61
  end
@@ -55,32 +74,49 @@ module EvalRuby
55
74
  end
56
75
  end
57
76
 
77
+ # Holds retrieval evaluation results with IR metric accessors.
78
+ #
79
+ # @example
80
+ # result = EvalRuby.evaluate_retrieval(question: "...", retrieved: [...], relevant: [...])
81
+ # result.precision_at_k(5) # => 0.6
82
+ # result.mrr # => 1.0
58
83
  class RetrievalResult
84
+ # @param retrieved [Array<String>] retrieved document IDs in ranked order
85
+ # @param relevant [Array<String>] ground-truth relevant document IDs
59
86
  def initialize(retrieved:, relevant:)
60
87
  @retrieved = retrieved
61
88
  @relevant = relevant
62
89
  end
63
90
 
91
+ # @param k [Integer] number of top results to consider
92
+ # @return [Float] precision at k
64
93
  def precision_at_k(k)
65
94
  Metrics::PrecisionAtK.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
66
95
  end
67
96
 
97
+ # @param k [Integer] number of top results to consider
98
+ # @return [Float] recall at k
68
99
  def recall_at_k(k)
69
100
  Metrics::RecallAtK.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
70
101
  end
71
102
 
103
+ # @return [Float] mean reciprocal rank
72
104
  def mrr
73
105
  Metrics::MRR.new.call(retrieved: @retrieved, relevant: @relevant)
74
106
  end
75
107
 
108
+ # @param k [Integer, nil] number of top results (nil for all)
109
+ # @return [Float] normalized discounted cumulative gain
76
110
  def ndcg(k: nil)
77
111
  Metrics::NDCG.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
78
112
  end
79
113
 
114
+ # @return [Float] 1.0 if any relevant doc is retrieved, 0.0 otherwise
80
115
  def hit_rate
81
116
  @retrieved.any? { |doc| @relevant.include?(doc) } ? 1.0 : 0.0
82
117
  end
83
118
 
119
+ # @return [Hash{Symbol => Float}] all retrieval metrics
84
120
  def to_h
85
121
  {
86
122
  precision_at_k: precision_at_k(@retrieved.length),
@@ -6,14 +6,22 @@ require "uri"
6
6
 
7
7
  module EvalRuby
8
8
  module Judges
9
+ # Anthropic-based LLM judge using the Messages API.
10
+ # Requires an API key set via {Configuration#api_key}.
9
11
  class Anthropic < Base
10
12
  API_URL = "https://api.anthropic.com/v1/messages"
11
13
 
14
+ # @param config [Configuration]
15
+ # @raise [EvalRuby::Error] if API key is missing
12
16
  def initialize(config)
13
17
  super
14
18
  raise EvalRuby::Error, "API key is required. Set via EvalRuby.configure { |c| c.api_key = '...' }" if @config.api_key.nil? || @config.api_key.empty?
15
19
  end
16
20
 
21
+ # @param prompt [String] the evaluation prompt
22
+ # @return [Hash, nil] parsed JSON response
23
+ # @raise [EvalRuby::Error] on API errors
24
+ # @raise [EvalRuby::TimeoutError] after max retries
17
25
  def call(prompt)
18
26
  retries = 0
19
27
  begin