eval-ruby 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +60 -0
- data/README.md +39 -0
- data/lib/eval_ruby/configuration.rb +12 -0
- data/lib/eval_ruby/dataset.rb +73 -14
- data/lib/eval_ruby/embedders/base.rb +29 -0
- data/lib/eval_ruby/embedders/openai.rb +83 -0
- data/lib/eval_ruby/metrics/semantic_similarity.rb +72 -0
- data/lib/eval_ruby/version.rb +1 -1
- data/lib/eval_ruby.rb +35 -3
- metadata +5 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6ba2db8c803feba5276939550bf2bd3c9dca548f2eca3223ce1f2df39e83bfa8
|
|
4
|
+
data.tar.gz: 9a3a0a5dbb7eda4d64669e11a5b76c20f43fd0cdb3946934f190475cc460ccd3
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: a6b52e6caf04da090f8595808f76e8edaa901b7eec90178054de0f2c4d8d27c1e2b46f8c154961b91c5ef3a849e93cbbc411df77e4ec2919ea48dd7ead869a1a
|
|
7
|
+
data.tar.gz: 06eb01c2e177dc263dcdf69ea38debca5989fc6d55e3be31e50c6b7e5bc8539d06df90b815b35680f8bc292d8bc44cebf38496dde5a65cea9cd8f4b3cedee18f
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented in this file.
|
|
4
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
5
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
## [0.3.0] - 2026-04-13
|
|
8
|
+
|
|
9
|
+
Reshape release — the original v0.3.0 scope (cost tracking, async/parallel batch, HTML reports) was redistributed to a new **v0.4.0 "Batch, Reporting & Cost"** milestone, and the previous v0.4.0 "Testing Framework Integration" work slid to **v0.5.0**. This release instead prioritizes the API surface needed by the omnibot Wicara eval integration, plus the two quickly actionable items from the original v0.3.0 plan.
|
|
10
|
+
|
|
11
|
+
### Added
|
|
12
|
+
- `EvalRuby::Embedders::Base` — abstract base class for pluggable embedding backends, mirroring the existing `Judges::Base` pattern.
|
|
13
|
+
- `EvalRuby::Embedders::OpenAI` — OpenAI embeddings backend (`/v1/embeddings`) with retry/timeout handling and out-of-order response reassembly.
|
|
14
|
+
- `EvalRuby::Metrics::SemanticSimilarity` — Ragas-style answer-similarity metric. Computes cosine similarity between answer and ground-truth embeddings via an injected embedder. Judge-free, fast, deterministic; ideal for chatbot regression testing.
|
|
15
|
+
- `Configuration#embedder_llm`, `#embedder_model`, `#embedder_api_key` — new keys controlling the embedder. `embedder_api_key` falls back to `api_key` when unset, so most users only configure one OpenAI key.
|
|
16
|
+
- `EvalRuby.evaluate_batch(dataset) { |progress| ... }` — block form that yields an `EvalRuby::Progress` struct (`current`, `total`, `elapsed`, `percent`) after each sample. Backwards compatible — batch calls without a block behave exactly as before.
|
|
17
|
+
|
|
18
|
+
### Changed
|
|
19
|
+
- `Dataset.generate` hardened:
|
|
20
|
+
- validates `questions_per_doc` is a positive integer
|
|
21
|
+
- validates document paths exist (raises a clear error instead of a `File.read` crash)
|
|
22
|
+
- expands directory paths via `Dir.glob(**/*)` to support "scan this folder" workflows
|
|
23
|
+
- accepts a single path string, not just an array
|
|
24
|
+
- tolerates malformed LLM responses (missing `pairs`, non-array `pairs`, non-hash entries, missing `question`/`answer`) — skips the bad pair rather than crashing the whole generation
|
|
25
|
+
- tolerates a judge raising mid-batch — logs the failure as a skip and continues with the remaining documents
|
|
26
|
+
- accepts an injected `judge:` parameter for testing (and for custom judge plumbing)
|
|
27
|
+
|
|
28
|
+
### Notes
|
|
29
|
+
- `SemanticSimilarity` is **opt-in** — not part of the default `Evaluator` roster. Instantiate it directly when you want reference-based scoring without an LLM judge.
|
|
30
|
+
- Deferred to v0.4.0: cost tracking per evaluation (#10), async/parallel batch evaluation (#11), HTML report generation (#12).
|
|
31
|
+
- Deferred to v0.5.0: CI/test-framework integration (JUnit XML, regression detection, GitHub Actions workflow, expanded matchers/assertions — formerly v0.4.0).
|
|
32
|
+
|
|
33
|
+
## [0.2.0] - 2026-03-17
|
|
34
|
+
|
|
35
|
+
### Added
|
|
36
|
+
- Comprehensive test suite covering all metrics, judges, datasets, reports, and error paths.
|
|
37
|
+
- YARD documentation across all public APIs.
|
|
38
|
+
- RSpec matchers and Minitest assertions for integration in user test suites.
|
|
39
|
+
- A/B comparison reports with statistical significance testing.
|
|
40
|
+
|
|
41
|
+
## [0.1.1] - 2026-03-10
|
|
42
|
+
|
|
43
|
+
### Fixed
|
|
44
|
+
- Transient API failures now retry with exponential backoff (previously a single timeout raised immediately).
|
|
45
|
+
- `Judges::OpenAI#initialize` rejects missing/empty API keys up front with a clear error message.
|
|
46
|
+
- String-context metrics now handle strings passed in place of arrays without crashing.
|
|
47
|
+
- Standard-deviation computation in `Report#summary` no longer divides by zero for single-sample reports.
|
|
48
|
+
|
|
49
|
+
### Added
|
|
50
|
+
- Error subclasses (`APIError`, `TimeoutError`, `InvalidResponseError`) so callers can rescue at the right granularity.
|
|
51
|
+
|
|
52
|
+
## [0.1.0] - 2026-03-09
|
|
53
|
+
|
|
54
|
+
- Initial release.
|
|
55
|
+
- LLM-as-judge metrics: faithfulness, relevance, correctness, context precision, context recall.
|
|
56
|
+
- Retrieval metrics: precision@k, recall@k, MRR, NDCG, hit rate.
|
|
57
|
+
- OpenAI and Anthropic judge backends.
|
|
58
|
+
- `Dataset` with CSV/JSON import and export.
|
|
59
|
+
- `Report` with per-metric summary, worst-cases, failure filtering, and CSV export.
|
|
60
|
+
- `Configuration` DSL for judge model, API key, threshold, timeout, retries.
|
data/README.md
CHANGED
|
@@ -50,6 +50,45 @@ result.overall # => 0.94
|
|
|
50
50
|
- **NDCG** (Normalized Discounted Cumulative Gain)
|
|
51
51
|
- **Hit Rate**
|
|
52
52
|
|
|
53
|
+
### Embedding-Based
|
|
54
|
+
|
|
55
|
+
- **Semantic Similarity** — cosine similarity between answer and ground truth via a pluggable embedder. Judge-free, fast, deterministic; useful for chatbot regression testing where you want a reference-based score without an LLM call.
|
|
56
|
+
|
|
57
|
+
## Semantic Similarity
|
|
58
|
+
|
|
59
|
+
`SemanticSimilarity` is opt-in (not part of the default `Evaluator` roster in v0.3.0). Instantiate it directly when you need reference-based scoring without an LLM judge — for example, scoring a chatbot's actual response against a fixed expected response.
|
|
60
|
+
|
|
61
|
+
```ruby
|
|
62
|
+
EvalRuby.configure do |config|
|
|
63
|
+
config.api_key = ENV["OPENAI_API_KEY"] # shared with judge by default
|
|
64
|
+
config.embedder_model = "text-embedding-3-small" # default; also supports text-embedding-3-large
|
|
65
|
+
# config.embedder_api_key = ENV["OTHER_KEY"] # optional; falls back to api_key
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
|
|
69
|
+
metric = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
|
|
70
|
+
|
|
71
|
+
result = metric.call(
|
|
72
|
+
answer: "Paris is the capital of France",
|
|
73
|
+
ground_truth: "The capital of France is Paris"
|
|
74
|
+
)
|
|
75
|
+
|
|
76
|
+
result[:score] # => 0.92
|
|
77
|
+
result[:details][:cosine] # => 0.92 (raw, pre-clamp)
|
|
78
|
+
result[:details][:model] # => "text-embedding-3-small"
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**When to use `SemanticSimilarity` vs `Correctness`:**
|
|
82
|
+
|
|
83
|
+
| | `Correctness` | `SemanticSimilarity` |
|
|
84
|
+
|---|---|---|
|
|
85
|
+
| Backend | LLM judge (GPT-4, Claude, …) | Embeddings + cosine |
|
|
86
|
+
| Cost per call | $$ (judge LLM tokens) | $ (embedding tokens) |
|
|
87
|
+
| Latency | High (LLM generation) | Low (embedding lookup) |
|
|
88
|
+
| Determinism | Low (model-dependent) | High |
|
|
89
|
+
| Reasoning | Natural-language rationale in details | Raw cosine value |
|
|
90
|
+
| Best for | Nuanced/subjective answers | Regression tests, bulk scoring |
|
|
91
|
+
|
|
53
92
|
## Retrieval Evaluation
|
|
54
93
|
|
|
55
94
|
```ruby
|
|
@@ -19,6 +19,15 @@ module EvalRuby
|
|
|
19
19
|
# @return [String, nil] API key for the judge LLM provider
|
|
20
20
|
attr_accessor :api_key
|
|
21
21
|
|
|
22
|
+
# @return [Symbol] embedding provider (:openai in v0.3.0)
|
|
23
|
+
attr_accessor :embedder_llm
|
|
24
|
+
|
|
25
|
+
# @return [String] model name for the embedder
|
|
26
|
+
attr_accessor :embedder_model
|
|
27
|
+
|
|
28
|
+
# @return [String, nil] API key for the embedder; falls back to {#api_key} when nil
|
|
29
|
+
attr_accessor :embedder_api_key
|
|
30
|
+
|
|
22
31
|
# @return [Float] default threshold for pass/fail decisions
|
|
23
32
|
attr_accessor :default_threshold
|
|
24
33
|
|
|
@@ -32,6 +41,9 @@ module EvalRuby
|
|
|
32
41
|
@judge_llm = :openai
|
|
33
42
|
@judge_model = "gpt-4o"
|
|
34
43
|
@api_key = nil
|
|
44
|
+
@embedder_llm = :openai
|
|
45
|
+
@embedder_model = "text-embedding-3-small"
|
|
46
|
+
@embedder_api_key = nil
|
|
35
47
|
@default_threshold = 0.7
|
|
36
48
|
@timeout = 30
|
|
37
49
|
@max_retries = 3
|
data/lib/eval_ruby/dataset.rb
CHANGED
|
@@ -124,21 +124,30 @@ module EvalRuby
|
|
|
124
124
|
|
|
125
125
|
# Generates a dataset from documents using an LLM.
|
|
126
126
|
#
|
|
127
|
-
#
|
|
128
|
-
#
|
|
127
|
+
# Each document is read, passed to the LLM with a prompt asking for
|
|
128
|
+
# +questions_per_doc+ QA pairs, and the resulting pairs are appended to
|
|
129
|
+
# the dataset. Directory paths are expanded via +Dir.glob+. Missing paths
|
|
130
|
+
# and malformed LLM responses raise or are skipped gracefully rather than
|
|
131
|
+
# crashing the whole generation.
|
|
132
|
+
#
|
|
133
|
+
# @param documents [String, Array<String>] file paths or directory paths
|
|
134
|
+
# @param questions_per_doc [Integer] number of QA pairs per document (must be > 0)
|
|
129
135
|
# @param llm [Symbol] LLM provider (:openai or :anthropic)
|
|
136
|
+
# @param judge [Judges::Base, nil] inject a pre-built judge (mainly for testing)
|
|
130
137
|
# @return [Dataset]
|
|
131
|
-
|
|
132
|
-
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
when :openai then Judges::OpenAI.new(config)
|
|
136
|
-
when :anthropic then Judges::Anthropic.new(config)
|
|
137
|
-
else raise Error, "Unknown LLM: #{llm}"
|
|
138
|
+
# @raise [EvalRuby::Error] on bad arguments or no documents found
|
|
139
|
+
def self.generate(documents:, questions_per_doc: 5, llm: :openai, judge: nil)
|
|
140
|
+
unless questions_per_doc.is_a?(Integer) && questions_per_doc.positive?
|
|
141
|
+
raise Error, "questions_per_doc must be a positive integer, got #{questions_per_doc.inspect}"
|
|
138
142
|
end
|
|
139
143
|
|
|
144
|
+
document_paths = expand_document_paths(documents)
|
|
145
|
+
raise Error, "No documents found in the provided paths" if document_paths.empty?
|
|
146
|
+
|
|
147
|
+
judge ||= build_judge_for(llm)
|
|
148
|
+
|
|
140
149
|
dataset = new("generated")
|
|
141
|
-
|
|
150
|
+
document_paths.each do |doc_path|
|
|
142
151
|
content = File.read(doc_path)
|
|
143
152
|
prompt = <<~PROMPT
|
|
144
153
|
Given the following document, generate #{questions_per_doc} question-answer pairs
|
|
@@ -150,14 +159,19 @@ module EvalRuby
|
|
|
150
159
|
Respond in JSON: {"pairs": [{"question": "...", "answer": "...", "context": "relevant excerpt"}]}
|
|
151
160
|
PROMPT
|
|
152
161
|
|
|
153
|
-
|
|
154
|
-
|
|
162
|
+
begin
|
|
163
|
+
result = judge.call(prompt)
|
|
164
|
+
rescue StandardError
|
|
165
|
+
next # keep generating from remaining docs; individual failure should not abort the batch
|
|
166
|
+
end
|
|
167
|
+
|
|
168
|
+
extract_pairs(result).each do |pair|
|
|
169
|
+
next unless valid_pair?(pair)
|
|
155
170
|
|
|
156
|
-
result["pairs"].each do |pair|
|
|
157
171
|
dataset.add(
|
|
158
172
|
question: pair["question"],
|
|
159
173
|
answer: pair["answer"],
|
|
160
|
-
context: [pair["context"]
|
|
174
|
+
context: [pair["context"].is_a?(String) && !pair["context"].empty? ? pair["context"] : content],
|
|
161
175
|
ground_truth: pair["answer"]
|
|
162
176
|
)
|
|
163
177
|
end
|
|
@@ -165,6 +179,51 @@ module EvalRuby
|
|
|
165
179
|
dataset
|
|
166
180
|
end
|
|
167
181
|
|
|
182
|
+
# Expands a list of file/directory paths into a flat list of file paths.
|
|
183
|
+
# Validates existence — missing paths raise an Error.
|
|
184
|
+
#
|
|
185
|
+
# @param paths [String, Array<String>]
|
|
186
|
+
# @return [Array<String>] absolute-or-relative file paths, each verified to exist
|
|
187
|
+
# @raise [EvalRuby::Error] if any path does not exist
|
|
188
|
+
def self.expand_document_paths(paths)
|
|
189
|
+
result = []
|
|
190
|
+
Array(paths).each do |path|
|
|
191
|
+
if File.directory?(path)
|
|
192
|
+
result.concat(Dir.glob(File.join(path, "**/*")).select { |p| File.file?(p) }.sort)
|
|
193
|
+
elsif File.file?(path)
|
|
194
|
+
result << path
|
|
195
|
+
else
|
|
196
|
+
raise Error, "Document path does not exist: #{path}"
|
|
197
|
+
end
|
|
198
|
+
end
|
|
199
|
+
result
|
|
200
|
+
end
|
|
201
|
+
|
|
202
|
+
def self.build_judge_for(llm)
|
|
203
|
+
config = EvalRuby.configuration.dup
|
|
204
|
+
config.judge_llm = llm
|
|
205
|
+
case llm
|
|
206
|
+
when :openai then Judges::OpenAI.new(config)
|
|
207
|
+
when :anthropic then Judges::Anthropic.new(config)
|
|
208
|
+
else raise Error, "Unknown LLM: #{llm}"
|
|
209
|
+
end
|
|
210
|
+
end
|
|
211
|
+
private_class_method :build_judge_for
|
|
212
|
+
|
|
213
|
+
def self.extract_pairs(result)
|
|
214
|
+
return [] unless result.is_a?(Hash)
|
|
215
|
+
pairs = result["pairs"]
|
|
216
|
+
pairs.is_a?(Array) ? pairs : []
|
|
217
|
+
end
|
|
218
|
+
private_class_method :extract_pairs
|
|
219
|
+
|
|
220
|
+
def self.valid_pair?(pair)
|
|
221
|
+
pair.is_a?(Hash) &&
|
|
222
|
+
pair["question"].is_a?(String) && !pair["question"].strip.empty? &&
|
|
223
|
+
pair["answer"].is_a?(String) && !pair["answer"].strip.empty?
|
|
224
|
+
end
|
|
225
|
+
private_class_method :valid_pair?
|
|
226
|
+
|
|
168
227
|
private_class_method def self.parse_array_field(value)
|
|
169
228
|
return [] if value.nil? || value.empty?
|
|
170
229
|
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module EvalRuby
|
|
4
|
+
module Embedders
|
|
5
|
+
# Abstract base class for text embedders.
|
|
6
|
+
# Subclasses must implement {#call} to convert a batch of strings
|
|
7
|
+
# into a batch of float vectors, and {#model} to surface the model
|
|
8
|
+
# identifier used (shown in metric details).
|
|
9
|
+
class Base
|
|
10
|
+
# @param config [Configuration]
|
|
11
|
+
def initialize(config)
|
|
12
|
+
@config = config
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
# Embeds a batch of texts.
|
|
16
|
+
#
|
|
17
|
+
# @param texts [Array<String>] inputs to embed
|
|
18
|
+
# @return [Array<Array<Float>>] one vector per input, in the same order
|
|
19
|
+
def call(texts)
|
|
20
|
+
raise NotImplementedError, "#{self.class}#call must be implemented"
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
# @return [String] model identifier (e.g. "text-embedding-3-small")
|
|
24
|
+
def model
|
|
25
|
+
raise NotImplementedError, "#{self.class}#model must be implemented"
|
|
26
|
+
end
|
|
27
|
+
end
|
|
28
|
+
end
|
|
29
|
+
end
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "net/http"
|
|
4
|
+
require "json"
|
|
5
|
+
require "uri"
|
|
6
|
+
|
|
7
|
+
module EvalRuby
|
|
8
|
+
module Embedders
|
|
9
|
+
# OpenAI embeddings backend.
|
|
10
|
+
# Requires an API key via {Configuration#embedder_api_key} (or falls
|
|
11
|
+
# back to {Configuration#api_key}). The default model is
|
|
12
|
+
# +text-embedding-3-small+ (1536 dimensions).
|
|
13
|
+
class OpenAI < Base
|
|
14
|
+
API_URL = "https://api.openai.com/v1/embeddings"
|
|
15
|
+
|
|
16
|
+
# @param config [Configuration]
|
|
17
|
+
# @raise [EvalRuby::Error] if no API key is available
|
|
18
|
+
def initialize(config)
|
|
19
|
+
super
|
|
20
|
+
@api_key = @config.embedder_api_key || @config.api_key
|
|
21
|
+
if @api_key.nil? || @api_key.empty?
|
|
22
|
+
raise EvalRuby::Error, "API key is required for embedder. Set via EvalRuby.configure { |c| c.embedder_api_key = '...' } or c.api_key = '...'"
|
|
23
|
+
end
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
# @return [String] configured embedder model
|
|
27
|
+
def model
|
|
28
|
+
@config.embedder_model
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
# @param texts [Array<String>] inputs to embed
|
|
32
|
+
# @return [Array<Array<Float>>] vectors in input order
|
|
33
|
+
# @raise [EvalRuby::APIError] on non-success HTTP responses
|
|
34
|
+
# @raise [EvalRuby::TimeoutError] after max retries
|
|
35
|
+
def call(texts)
|
|
36
|
+
retries = 0
|
|
37
|
+
begin
|
|
38
|
+
uri = URI(API_URL)
|
|
39
|
+
request = Net::HTTP::Post.new(uri)
|
|
40
|
+
request["Authorization"] = "Bearer #{@api_key}"
|
|
41
|
+
request["Content-Type"] = "application/json"
|
|
42
|
+
request.body = JSON.generate({input: texts, model: model})
|
|
43
|
+
|
|
44
|
+
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true,
|
|
45
|
+
read_timeout: @config.timeout) do |http|
|
|
46
|
+
http.request(request)
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
unless response.is_a?(Net::HTTPSuccess)
|
|
50
|
+
raise APIError, "OpenAI embeddings API error: #{response.code} - #{response.body}"
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
parse_vectors(response.body)
|
|
54
|
+
rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET => e
|
|
55
|
+
retries += 1
|
|
56
|
+
if retries <= @config.max_retries
|
|
57
|
+
sleep(2**(retries - 1))
|
|
58
|
+
retry
|
|
59
|
+
end
|
|
60
|
+
raise EvalRuby::TimeoutError, "Embedder API failed after #{@config.max_retries} retries: #{e.message}"
|
|
61
|
+
end
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
private
|
|
65
|
+
|
|
66
|
+
def parse_vectors(body)
|
|
67
|
+
parsed = JSON.parse(body)
|
|
68
|
+
data = parsed["data"]
|
|
69
|
+
raise InvalidResponseError, "Unexpected embeddings response shape: missing 'data'" unless data.is_a?(Array)
|
|
70
|
+
|
|
71
|
+
# OpenAI returns data entries tagged with 'index' to preserve input order;
|
|
72
|
+
# sort defensively in case the API ever reorders.
|
|
73
|
+
data.sort_by { |entry| entry["index"].to_i }.map do |entry|
|
|
74
|
+
vector = entry["embedding"]
|
|
75
|
+
raise InvalidResponseError, "Embedding entry missing 'embedding' array" unless vector.is_a?(Array)
|
|
76
|
+
vector
|
|
77
|
+
end
|
|
78
|
+
rescue JSON::ParserError => e
|
|
79
|
+
raise InvalidResponseError, "Failed to parse embeddings response: #{e.message}"
|
|
80
|
+
end
|
|
81
|
+
end
|
|
82
|
+
end
|
|
83
|
+
end
|
|
@@ -0,0 +1,72 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module EvalRuby
|
|
4
|
+
module Metrics
|
|
5
|
+
# Cosine similarity between an answer and its ground truth via an
|
|
6
|
+
# injected embedder. A judge-free alternative to {Correctness} when
|
|
7
|
+
# you want fast, deterministic, reference-based scoring — ideal for
|
|
8
|
+
# chatbot regression testing.
|
|
9
|
+
#
|
|
10
|
+
# @example
|
|
11
|
+
# embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
|
|
12
|
+
# metric = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
|
|
13
|
+
# metric.call(answer: "Paris is in France", ground_truth: "Paris, France")
|
|
14
|
+
# # => { score: 0.91, details: { cosine: 0.91, model: "text-embedding-3-small" } }
|
|
15
|
+
class SemanticSimilarity < Base
|
|
16
|
+
# @return [EvalRuby::Embedders::Base, nil] the embedder instance
|
|
17
|
+
attr_reader :embedder
|
|
18
|
+
|
|
19
|
+
# @param embedder [EvalRuby::Embedders::Base] required for this metric
|
|
20
|
+
# @param judge [EvalRuby::Judges::Base, nil] unused by this metric; accepted
|
|
21
|
+
# only for interface compatibility with {Metrics::Base}
|
|
22
|
+
def initialize(embedder: nil, judge: nil)
|
|
23
|
+
super(judge: judge)
|
|
24
|
+
@embedder = embedder
|
|
25
|
+
end
|
|
26
|
+
|
|
27
|
+
# @param answer [String] candidate text (typically the model's answer)
|
|
28
|
+
# @param ground_truth [String] reference text
|
|
29
|
+
# @return [Hash] +:score+ (Float 0.0–1.0) and +:details+ (Hash)
|
|
30
|
+
# @raise [EvalRuby::Error] if no embedder is configured
|
|
31
|
+
def call(answer:, ground_truth:, **_kwargs)
|
|
32
|
+
raise EvalRuby::Error, "SemanticSimilarity requires an embedder. Pass `embedder:` in the constructor." unless @embedder
|
|
33
|
+
|
|
34
|
+
if answer.to_s.strip.empty? || ground_truth.to_s.strip.empty?
|
|
35
|
+
return {score: 0.0, details: {reason: :empty_input}}
|
|
36
|
+
end
|
|
37
|
+
|
|
38
|
+
vectors = @embedder.call([answer.to_s, ground_truth.to_s])
|
|
39
|
+
unless vectors.is_a?(Array) && vectors.length == 2
|
|
40
|
+
raise EvalRuby::Error, "Embedder returned #{vectors.is_a?(Array) ? vectors.length : vectors.class} vectors; expected 2"
|
|
41
|
+
end
|
|
42
|
+
|
|
43
|
+
cosine = cosine_similarity(vectors[0], vectors[1])
|
|
44
|
+
|
|
45
|
+
{
|
|
46
|
+
score: cosine.clamp(0.0, 1.0),
|
|
47
|
+
details: {cosine: cosine, model: @embedder.model}
|
|
48
|
+
}
|
|
49
|
+
end
|
|
50
|
+
|
|
51
|
+
private
|
|
52
|
+
|
|
53
|
+
def cosine_similarity(a, b)
|
|
54
|
+
raise EvalRuby::Error, "Embedding vector dimension mismatch: #{a.length} vs #{b.length}" unless a.length == b.length
|
|
55
|
+
|
|
56
|
+
dot = 0.0
|
|
57
|
+
norm_a = 0.0
|
|
58
|
+
norm_b = 0.0
|
|
59
|
+
a.each_with_index do |x, i|
|
|
60
|
+
y = b[i]
|
|
61
|
+
dot += x * y
|
|
62
|
+
norm_a += x * x
|
|
63
|
+
norm_b += y * y
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
return 0.0 if norm_a.zero? || norm_b.zero?
|
|
67
|
+
|
|
68
|
+
dot / (Math.sqrt(norm_a) * Math.sqrt(norm_b))
|
|
69
|
+
end
|
|
70
|
+
end
|
|
71
|
+
end
|
|
72
|
+
end
|
data/lib/eval_ruby/version.rb
CHANGED
data/lib/eval_ruby.rb
CHANGED
|
@@ -5,12 +5,15 @@ require_relative "eval_ruby/configuration"
|
|
|
5
5
|
require_relative "eval_ruby/judges/base"
|
|
6
6
|
require_relative "eval_ruby/judges/openai"
|
|
7
7
|
require_relative "eval_ruby/judges/anthropic"
|
|
8
|
+
require_relative "eval_ruby/embedders/base"
|
|
9
|
+
require_relative "eval_ruby/embedders/openai"
|
|
8
10
|
require_relative "eval_ruby/metrics/base"
|
|
9
11
|
require_relative "eval_ruby/metrics/faithfulness"
|
|
10
12
|
require_relative "eval_ruby/metrics/relevance"
|
|
11
13
|
require_relative "eval_ruby/metrics/correctness"
|
|
12
14
|
require_relative "eval_ruby/metrics/context_precision"
|
|
13
15
|
require_relative "eval_ruby/metrics/context_recall"
|
|
16
|
+
require_relative "eval_ruby/metrics/semantic_similarity"
|
|
14
17
|
require_relative "eval_ruby/metrics/precision_at_k"
|
|
15
18
|
require_relative "eval_ruby/metrics/recall_at_k"
|
|
16
19
|
require_relative "eval_ruby/metrics/mrr"
|
|
@@ -48,6 +51,17 @@ module EvalRuby
|
|
|
48
51
|
class TimeoutError < Error; end
|
|
49
52
|
class InvalidResponseError < Error; end
|
|
50
53
|
|
|
54
|
+
# Progress snapshot yielded to the block passed to {.evaluate_batch}.
|
|
55
|
+
# @!attribute current [Integer] number of samples completed (1-indexed)
|
|
56
|
+
# @!attribute total [Integer] total samples in the batch
|
|
57
|
+
# @!attribute elapsed [Float] seconds since batch started
|
|
58
|
+
Progress = Struct.new(:current, :total, :elapsed, keyword_init: true) do
|
|
59
|
+
# @return [Float] completion percentage, 0.0–100.0
|
|
60
|
+
def percent
|
|
61
|
+
total.zero? ? 0.0 : (current.to_f / total * 100).round(2)
|
|
62
|
+
end
|
|
63
|
+
end
|
|
64
|
+
|
|
51
65
|
class << self
|
|
52
66
|
# @return [Configuration] the current configuration
|
|
53
67
|
def configuration
|
|
@@ -101,16 +115,26 @@ module EvalRuby
|
|
|
101
115
|
|
|
102
116
|
# Evaluates a batch of samples, optionally running them through a pipeline.
|
|
103
117
|
#
|
|
118
|
+
# If a block is given, it is called after each sample with a {Progress}
|
|
119
|
+
# snapshot, useful for rendering progress bars or writing incremental logs.
|
|
120
|
+
#
|
|
104
121
|
# @param dataset [Dataset, Array<Hash>] samples to evaluate
|
|
105
122
|
# @param pipeline [#query, nil] optional RAG pipeline to run queries through
|
|
123
|
+
# @yieldparam progress [Progress] progress snapshot after each sample
|
|
106
124
|
# @return [Report]
|
|
107
|
-
|
|
125
|
+
#
|
|
126
|
+
# @example With progress callback
|
|
127
|
+
# EvalRuby.evaluate_batch(dataset) do |progress|
|
|
128
|
+
# puts "#{progress.current}/#{progress.total} (#{progress.percent}%)"
|
|
129
|
+
# end
|
|
130
|
+
def evaluate_batch(dataset, pipeline: nil, &progress_block)
|
|
108
131
|
samples = dataset.is_a?(Dataset) ? dataset.samples : dataset
|
|
109
132
|
evaluator = Evaluator.new
|
|
110
133
|
start_time = Time.now
|
|
134
|
+
total = samples.size
|
|
111
135
|
|
|
112
|
-
results = samples.map do |sample|
|
|
113
|
-
if pipeline
|
|
136
|
+
results = samples.each_with_index.map do |sample, i|
|
|
137
|
+
result = if pipeline
|
|
114
138
|
response = pipeline.query(sample[:question])
|
|
115
139
|
evaluator.evaluate(
|
|
116
140
|
question: sample[:question],
|
|
@@ -121,6 +145,14 @@ module EvalRuby
|
|
|
121
145
|
else
|
|
122
146
|
evaluator.evaluate(**sample.slice(:question, :answer, :context, :ground_truth))
|
|
123
147
|
end
|
|
148
|
+
|
|
149
|
+
progress_block&.call(Progress.new(
|
|
150
|
+
current: i + 1,
|
|
151
|
+
total: total,
|
|
152
|
+
elapsed: Time.now - start_time
|
|
153
|
+
))
|
|
154
|
+
|
|
155
|
+
result
|
|
124
156
|
end
|
|
125
157
|
|
|
126
158
|
Report.new(results: results, samples: samples, duration: Time.now - start_time)
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: eval-ruby
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.3.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Johannes Dwi Cahyo
|
|
@@ -72,6 +72,7 @@ executables: []
|
|
|
72
72
|
extensions: []
|
|
73
73
|
extra_rdoc_files: []
|
|
74
74
|
files:
|
|
75
|
+
- CHANGELOG.md
|
|
75
76
|
- Gemfile
|
|
76
77
|
- Gemfile.lock
|
|
77
78
|
- LICENSE
|
|
@@ -83,6 +84,8 @@ files:
|
|
|
83
84
|
- lib/eval_ruby/comparison.rb
|
|
84
85
|
- lib/eval_ruby/configuration.rb
|
|
85
86
|
- lib/eval_ruby/dataset.rb
|
|
87
|
+
- lib/eval_ruby/embedders/base.rb
|
|
88
|
+
- lib/eval_ruby/embedders/openai.rb
|
|
86
89
|
- lib/eval_ruby/evaluator.rb
|
|
87
90
|
- lib/eval_ruby/judges/anthropic.rb
|
|
88
91
|
- lib/eval_ruby/judges/base.rb
|
|
@@ -97,6 +100,7 @@ files:
|
|
|
97
100
|
- lib/eval_ruby/metrics/precision_at_k.rb
|
|
98
101
|
- lib/eval_ruby/metrics/recall_at_k.rb
|
|
99
102
|
- lib/eval_ruby/metrics/relevance.rb
|
|
103
|
+
- lib/eval_ruby/metrics/semantic_similarity.rb
|
|
100
104
|
- lib/eval_ruby/minitest.rb
|
|
101
105
|
- lib/eval_ruby/report.rb
|
|
102
106
|
- lib/eval_ruby/result.rb
|