eval-ruby 0.1.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +60 -0
- data/Gemfile.lock +2 -2
- data/MILESTONES.md +13 -0
- data/README.md +39 -0
- data/lib/eval_ruby/comparison.rb +18 -1
- data/lib/eval_ruby/configuration.rb +37 -2
- data/lib/eval_ruby/dataset.rb +118 -13
- data/lib/eval_ruby/embedders/base.rb +29 -0
- data/lib/eval_ruby/embedders/openai.rb +83 -0
- data/lib/eval_ruby/evaluator.rb +36 -0
- data/lib/eval_ruby/judges/anthropic.rb +8 -0
- data/lib/eval_ruby/judges/base.rb +11 -0
- data/lib/eval_ruby/judges/openai.rb +8 -0
- data/lib/eval_ruby/metrics/base.rb +8 -0
- data/lib/eval_ruby/metrics/context_precision.rb +10 -0
- data/lib/eval_ruby/metrics/context_recall.rb +10 -0
- data/lib/eval_ruby/metrics/correctness.rb +13 -0
- data/lib/eval_ruby/metrics/faithfulness.rb +10 -0
- data/lib/eval_ruby/metrics/mrr.rb +8 -0
- data/lib/eval_ruby/metrics/ndcg.rb +10 -0
- data/lib/eval_ruby/metrics/precision_at_k.rb +9 -0
- data/lib/eval_ruby/metrics/recall_at_k.rb +9 -0
- data/lib/eval_ruby/metrics/relevance.rb +10 -0
- data/lib/eval_ruby/metrics/semantic_similarity.rb +72 -0
- data/lib/eval_ruby/report.rb +38 -1
- data/lib/eval_ruby/result.rb +29 -1
- data/lib/eval_ruby/rspec.rb +48 -6
- data/lib/eval_ruby/version.rb +1 -1
- data/lib/eval_ruby.rb +87 -3
- metadata +6 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 6ba2db8c803feba5276939550bf2bd3c9dca548f2eca3223ce1f2df39e83bfa8
|
|
4
|
+
data.tar.gz: 9a3a0a5dbb7eda4d64669e11a5b76c20f43fd0cdb3946934f190475cc460ccd3
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: a6b52e6caf04da090f8595808f76e8edaa901b7eec90178054de0f2c4d8d27c1e2b46f8c154961b91c5ef3a849e93cbbc411df77e4ec2919ea48dd7ead869a1a
|
|
7
|
+
data.tar.gz: 06eb01c2e177dc263dcdf69ea38debca5989fc6d55e3be31e50c6b7e5bc8539d06df90b815b35680f8bc292d8bc44cebf38496dde5a65cea9cd8f4b3cedee18f
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project are documented in this file.
|
|
4
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
|
|
5
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
6
|
+
|
|
7
|
+
## [0.3.0] - 2026-04-13
|
|
8
|
+
|
|
9
|
+
Reshape release — the original v0.3.0 scope (cost tracking, async/parallel batch, HTML reports) was redistributed to a new **v0.4.0 "Batch, Reporting & Cost"** milestone, and the previous v0.4.0 "Testing Framework Integration" work slid to **v0.5.0**. This release instead prioritizes the API surface needed by the omnibot Wicara eval integration, plus the two quickly actionable items from the original v0.3.0 plan.
|
|
10
|
+
|
|
11
|
+
### Added
|
|
12
|
+
- `EvalRuby::Embedders::Base` — abstract base class for pluggable embedding backends, mirroring the existing `Judges::Base` pattern.
|
|
13
|
+
- `EvalRuby::Embedders::OpenAI` — OpenAI embeddings backend (`/v1/embeddings`) with retry/timeout handling and out-of-order response reassembly.
|
|
14
|
+
- `EvalRuby::Metrics::SemanticSimilarity` — Ragas-style answer-similarity metric. Computes cosine similarity between answer and ground-truth embeddings via an injected embedder. Judge-free, fast, deterministic; ideal for chatbot regression testing.
|
|
15
|
+
- `Configuration#embedder_llm`, `#embedder_model`, `#embedder_api_key` — new keys controlling the embedder. `embedder_api_key` falls back to `api_key` when unset, so most users only configure one OpenAI key.
|
|
16
|
+
- `EvalRuby.evaluate_batch(dataset) { |progress| ... }` — block form that yields an `EvalRuby::Progress` struct (`current`, `total`, `elapsed`, `percent`) after each sample. Backwards compatible — batch calls without a block behave exactly as before.
|
|
17
|
+
|
|
18
|
+
### Changed
|
|
19
|
+
- `Dataset.generate` hardened:
|
|
20
|
+
- validates `questions_per_doc` is a positive integer
|
|
21
|
+
- validates document paths exist (raises a clear error instead of a `File.read` crash)
|
|
22
|
+
- expands directory paths via `Dir.glob(**/*)` to support "scan this folder" workflows
|
|
23
|
+
- accepts a single path string, not just an array
|
|
24
|
+
- tolerates malformed LLM responses (missing `pairs`, non-array `pairs`, non-hash entries, missing `question`/`answer`) — skips the bad pair rather than crashing the whole generation
|
|
25
|
+
- tolerates a judge raising mid-batch — logs the failure as a skip and continues with the remaining documents
|
|
26
|
+
- accepts an injected `judge:` parameter for testing (and for custom judge plumbing)
|
|
27
|
+
|
|
28
|
+
### Notes
|
|
29
|
+
- `SemanticSimilarity` is **opt-in** — not part of the default `Evaluator` roster. Instantiate it directly when you want reference-based scoring without an LLM judge.
|
|
30
|
+
- Deferred to v0.4.0: cost tracking per evaluation (#10), async/parallel batch evaluation (#11), HTML report generation (#12).
|
|
31
|
+
- Deferred to v0.5.0: CI/test-framework integration (JUnit XML, regression detection, GitHub Actions workflow, expanded matchers/assertions — formerly v0.4.0).
|
|
32
|
+
|
|
33
|
+
## [0.2.0] - 2026-03-17
|
|
34
|
+
|
|
35
|
+
### Added
|
|
36
|
+
- Comprehensive test suite covering all metrics, judges, datasets, reports, and error paths.
|
|
37
|
+
- YARD documentation across all public APIs.
|
|
38
|
+
- RSpec matchers and Minitest assertions for integration in user test suites.
|
|
39
|
+
- A/B comparison reports with statistical significance testing.
|
|
40
|
+
|
|
41
|
+
## [0.1.1] - 2026-03-10
|
|
42
|
+
|
|
43
|
+
### Fixed
|
|
44
|
+
- Transient API failures now retry with exponential backoff (previously a single timeout raised immediately).
|
|
45
|
+
- `Judges::OpenAI#initialize` rejects missing/empty API keys up front with a clear error message.
|
|
46
|
+
- String-context metrics now handle strings passed in place of arrays without crashing.
|
|
47
|
+
- Standard-deviation computation in `Report#summary` no longer divides by zero for single-sample reports.
|
|
48
|
+
|
|
49
|
+
### Added
|
|
50
|
+
- Error subclasses (`APIError`, `TimeoutError`, `InvalidResponseError`) so callers can rescue at the right granularity.
|
|
51
|
+
|
|
52
|
+
## [0.1.0] - 2026-03-09
|
|
53
|
+
|
|
54
|
+
- Initial release.
|
|
55
|
+
- LLM-as-judge metrics: faithfulness, relevance, correctness, context precision, context recall.
|
|
56
|
+
- Retrieval metrics: precision@k, recall@k, MRR, NDCG, hit rate.
|
|
57
|
+
- OpenAI and Anthropic judge backends.
|
|
58
|
+
- `Dataset` with CSV/JSON import and export.
|
|
59
|
+
- `Report` with per-metric summary, worst-cases, failure filtering, and CSV export.
|
|
60
|
+
- `Configuration` DSL for judge model, API key, threshold, timeout, retries.
|
data/Gemfile.lock
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
PATH
|
|
2
2
|
remote: .
|
|
3
3
|
specs:
|
|
4
|
-
eval-ruby (0.
|
|
4
|
+
eval-ruby (0.2.0)
|
|
5
5
|
csv
|
|
6
6
|
|
|
7
7
|
GEM
|
|
@@ -39,7 +39,7 @@ CHECKSUMS
|
|
|
39
39
|
bigdecimal (4.0.1) sha256=8b07d3d065a9f921c80ceaea7c9d4ae596697295b584c296fe599dd0ad01c4a7
|
|
40
40
|
crack (1.0.1) sha256=ff4a10390cd31d66440b7524eb1841874db86201d5b70032028553130b6d4c7e
|
|
41
41
|
csv (3.3.5) sha256=6e5134ac3383ef728b7f02725d9872934f523cb40b961479f69cf3afa6c8e73f
|
|
42
|
-
eval-ruby (0.
|
|
42
|
+
eval-ruby (0.2.0)
|
|
43
43
|
hashdiff (1.2.1) sha256=9c079dbc513dfc8833ab59c0c2d8f230fa28499cc5efb4b8dd276cf931457cd1
|
|
44
44
|
minitest (5.27.0) sha256=2d3b17f8a36fe7801c1adcffdbc38233b938eb0b4966e97a6739055a45fa77d5
|
|
45
45
|
public_suffix (7.0.5) sha256=1a8bb08f1bbea19228d3bed6e5ed908d1cb4f7c2726d18bd9cadf60bc676f623
|
data/MILESTONES.md
ADDED
data/README.md
CHANGED
|
@@ -50,6 +50,45 @@ result.overall # => 0.94
|
|
|
50
50
|
- **NDCG** (Normalized Discounted Cumulative Gain)
|
|
51
51
|
- **Hit Rate**
|
|
52
52
|
|
|
53
|
+
### Embedding-Based
|
|
54
|
+
|
|
55
|
+
- **Semantic Similarity** — cosine similarity between answer and ground truth via a pluggable embedder. Judge-free, fast, deterministic; useful for chatbot regression testing where you want a reference-based score without an LLM call.
|
|
56
|
+
|
|
57
|
+
## Semantic Similarity
|
|
58
|
+
|
|
59
|
+
`SemanticSimilarity` is opt-in (not part of the default `Evaluator` roster in v0.3.0). Instantiate it directly when you need reference-based scoring without an LLM judge — for example, scoring a chatbot's actual response against a fixed expected response.
|
|
60
|
+
|
|
61
|
+
```ruby
|
|
62
|
+
EvalRuby.configure do |config|
|
|
63
|
+
config.api_key = ENV["OPENAI_API_KEY"] # shared with judge by default
|
|
64
|
+
config.embedder_model = "text-embedding-3-small" # default; also supports text-embedding-3-large
|
|
65
|
+
# config.embedder_api_key = ENV["OTHER_KEY"] # optional; falls back to api_key
|
|
66
|
+
end
|
|
67
|
+
|
|
68
|
+
embedder = EvalRuby::Embedders::OpenAI.new(EvalRuby.configuration)
|
|
69
|
+
metric = EvalRuby::Metrics::SemanticSimilarity.new(embedder: embedder)
|
|
70
|
+
|
|
71
|
+
result = metric.call(
|
|
72
|
+
answer: "Paris is the capital of France",
|
|
73
|
+
ground_truth: "The capital of France is Paris"
|
|
74
|
+
)
|
|
75
|
+
|
|
76
|
+
result[:score] # => 0.92
|
|
77
|
+
result[:details][:cosine] # => 0.92 (raw, pre-clamp)
|
|
78
|
+
result[:details][:model] # => "text-embedding-3-small"
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
**When to use `SemanticSimilarity` vs `Correctness`:**
|
|
82
|
+
|
|
83
|
+
| | `Correctness` | `SemanticSimilarity` |
|
|
84
|
+
|---|---|---|
|
|
85
|
+
| Backend | LLM judge (GPT-4, Claude, …) | Embeddings + cosine |
|
|
86
|
+
| Cost per call | $$ (judge LLM tokens) | $ (embedding tokens) |
|
|
87
|
+
| Latency | High (LLM generation) | Low (embedding lookup) |
|
|
88
|
+
| Determinism | Low (model-dependent) | High |
|
|
89
|
+
| Reasoning | Natural-language rationale in details | Raw cosine value |
|
|
90
|
+
| Best for | Nuanced/subjective answers | Regression tests, bulk scoring |
|
|
91
|
+
|
|
53
92
|
## Retrieval Evaluation
|
|
54
93
|
|
|
55
94
|
```ruby
|
data/lib/eval_ruby/comparison.rb
CHANGED
|
@@ -1,14 +1,27 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
module EvalRuby
|
|
4
|
+
# Statistical comparison of two evaluation reports using paired t-tests.
|
|
5
|
+
#
|
|
6
|
+
# @example
|
|
7
|
+
# comparison = EvalRuby.compare(report_a, report_b)
|
|
8
|
+
# puts comparison.summary
|
|
9
|
+
# comparison.significant_improvements # => [:faithfulness]
|
|
4
10
|
class Comparison
|
|
5
|
-
|
|
11
|
+
# @return [Report] baseline report
|
|
12
|
+
attr_reader :report_a
|
|
6
13
|
|
|
14
|
+
# @return [Report] comparison report
|
|
15
|
+
attr_reader :report_b
|
|
16
|
+
|
|
17
|
+
# @param report_a [Report] baseline
|
|
18
|
+
# @param report_b [Report] comparison
|
|
7
19
|
def initialize(report_a, report_b)
|
|
8
20
|
@report_a = report_a
|
|
9
21
|
@report_b = report_b
|
|
10
22
|
end
|
|
11
23
|
|
|
24
|
+
# @return [String] formatted comparison table with deltas and p-values
|
|
12
25
|
def summary
|
|
13
26
|
lines = [
|
|
14
27
|
format("%-20s | %-10s | %-10s | %-8s | %s", "Metric", "A", "B", "Delta", "p-value"),
|
|
@@ -35,6 +48,10 @@ module EvalRuby
|
|
|
35
48
|
lines.join("\n")
|
|
36
49
|
end
|
|
37
50
|
|
|
51
|
+
# Returns metrics where report_b is significantly better than report_a.
|
|
52
|
+
#
|
|
53
|
+
# @param alpha [Float] significance level (default 0.05)
|
|
54
|
+
# @return [Array<Symbol>] metric names with significant improvements
|
|
38
55
|
def significant_improvements(alpha: 0.05)
|
|
39
56
|
all_metrics.select do |metric|
|
|
40
57
|
scores_a = @report_a.results.filter_map { |r| r.scores[metric] }
|
|
@@ -1,14 +1,49 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
module EvalRuby
|
|
4
|
+
# Global configuration for EvalRuby.
|
|
5
|
+
#
|
|
6
|
+
# @example
|
|
7
|
+
# EvalRuby.configure do |config|
|
|
8
|
+
# config.judge_llm = :openai
|
|
9
|
+
# config.api_key = ENV["OPENAI_API_KEY"]
|
|
10
|
+
# config.judge_model = "gpt-4o"
|
|
11
|
+
# end
|
|
4
12
|
class Configuration
|
|
5
|
-
|
|
6
|
-
|
|
13
|
+
# @return [Symbol] LLM provider for judge (:openai or :anthropic)
|
|
14
|
+
attr_accessor :judge_llm
|
|
15
|
+
|
|
16
|
+
# @return [String] model name for the judge LLM
|
|
17
|
+
attr_accessor :judge_model
|
|
18
|
+
|
|
19
|
+
# @return [String, nil] API key for the judge LLM provider
|
|
20
|
+
attr_accessor :api_key
|
|
21
|
+
|
|
22
|
+
# @return [Symbol] embedding provider (:openai in v0.3.0)
|
|
23
|
+
attr_accessor :embedder_llm
|
|
24
|
+
|
|
25
|
+
# @return [String] model name for the embedder
|
|
26
|
+
attr_accessor :embedder_model
|
|
27
|
+
|
|
28
|
+
# @return [String, nil] API key for the embedder; falls back to {#api_key} when nil
|
|
29
|
+
attr_accessor :embedder_api_key
|
|
30
|
+
|
|
31
|
+
# @return [Float] default threshold for pass/fail decisions
|
|
32
|
+
attr_accessor :default_threshold
|
|
33
|
+
|
|
34
|
+
# @return [Integer] HTTP request timeout in seconds
|
|
35
|
+
attr_accessor :timeout
|
|
36
|
+
|
|
37
|
+
# @return [Integer] maximum number of retries on transient failures
|
|
38
|
+
attr_accessor :max_retries
|
|
7
39
|
|
|
8
40
|
def initialize
|
|
9
41
|
@judge_llm = :openai
|
|
10
42
|
@judge_model = "gpt-4o"
|
|
11
43
|
@api_key = nil
|
|
44
|
+
@embedder_llm = :openai
|
|
45
|
+
@embedder_model = "text-embedding-3-small"
|
|
46
|
+
@embedder_api_key = nil
|
|
12
47
|
@default_threshold = 0.7
|
|
13
48
|
@timeout = 30
|
|
14
49
|
@max_retries = 3
|
data/lib/eval_ruby/dataset.rb
CHANGED
|
@@ -4,16 +4,36 @@ require "csv"
|
|
|
4
4
|
require "json"
|
|
5
5
|
|
|
6
6
|
module EvalRuby
|
|
7
|
+
# Collection of evaluation samples with import/export support.
|
|
8
|
+
# Supports CSV, JSON, and programmatic construction.
|
|
9
|
+
#
|
|
10
|
+
# @example
|
|
11
|
+
# dataset = EvalRuby::Dataset.new("my_test_set")
|
|
12
|
+
# dataset.add(question: "What is Ruby?", answer: "A language", ground_truth: "A language")
|
|
13
|
+
# report = EvalRuby.evaluate_batch(dataset)
|
|
7
14
|
class Dataset
|
|
8
15
|
include Enumerable
|
|
9
16
|
|
|
10
|
-
|
|
17
|
+
# @return [String] dataset name
|
|
18
|
+
attr_reader :name
|
|
11
19
|
|
|
20
|
+
# @return [Array<Hash>] sample entries
|
|
21
|
+
attr_reader :samples
|
|
22
|
+
|
|
23
|
+
# @param name [String] dataset name
|
|
12
24
|
def initialize(name = "default")
|
|
13
25
|
@name = name
|
|
14
26
|
@samples = []
|
|
15
27
|
end
|
|
16
28
|
|
|
29
|
+
# Adds a sample to the dataset.
|
|
30
|
+
#
|
|
31
|
+
# @param question [String]
|
|
32
|
+
# @param ground_truth [String, nil]
|
|
33
|
+
# @param relevant_contexts [Array<String>] alias for context
|
|
34
|
+
# @param answer [String, nil]
|
|
35
|
+
# @param context [Array<String>]
|
|
36
|
+
# @return [self]
|
|
17
37
|
def add(question:, ground_truth: nil, relevant_contexts: [], answer: nil, context: [])
|
|
18
38
|
@samples << {
|
|
19
39
|
question: question,
|
|
@@ -24,18 +44,26 @@ module EvalRuby
|
|
|
24
44
|
self
|
|
25
45
|
end
|
|
26
46
|
|
|
47
|
+
# @yield [Hash] each sample
|
|
27
48
|
def each(&block)
|
|
28
49
|
@samples.each(&block)
|
|
29
50
|
end
|
|
30
51
|
|
|
52
|
+
# @return [Integer] number of samples
|
|
31
53
|
def size
|
|
32
54
|
@samples.size
|
|
33
55
|
end
|
|
34
56
|
|
|
57
|
+
# @param index [Integer]
|
|
58
|
+
# @return [Hash] sample at index
|
|
35
59
|
def [](index)
|
|
36
60
|
@samples[index]
|
|
37
61
|
end
|
|
38
62
|
|
|
63
|
+
# Loads a dataset from a CSV file.
|
|
64
|
+
#
|
|
65
|
+
# @param path [String] path to CSV file
|
|
66
|
+
# @return [Dataset]
|
|
39
67
|
def self.from_csv(path)
|
|
40
68
|
dataset = new(File.basename(path, ".*"))
|
|
41
69
|
CSV.foreach(path, headers: true) do |row|
|
|
@@ -49,6 +77,10 @@ module EvalRuby
|
|
|
49
77
|
dataset
|
|
50
78
|
end
|
|
51
79
|
|
|
80
|
+
# Loads a dataset from a JSON file.
|
|
81
|
+
#
|
|
82
|
+
# @param path [String] path to JSON file
|
|
83
|
+
# @return [Dataset]
|
|
52
84
|
def self.from_json(path)
|
|
53
85
|
dataset = new(File.basename(path, ".*"))
|
|
54
86
|
data = JSON.parse(File.read(path))
|
|
@@ -64,6 +96,10 @@ module EvalRuby
|
|
|
64
96
|
dataset
|
|
65
97
|
end
|
|
66
98
|
|
|
99
|
+
# Exports dataset to CSV.
|
|
100
|
+
#
|
|
101
|
+
# @param path [String] output file path
|
|
102
|
+
# @return [void]
|
|
67
103
|
def to_csv(path)
|
|
68
104
|
CSV.open(path, "w") do |csv|
|
|
69
105
|
csv << %w[question answer context ground_truth]
|
|
@@ -78,21 +114,40 @@ module EvalRuby
|
|
|
78
114
|
end
|
|
79
115
|
end
|
|
80
116
|
|
|
117
|
+
# Exports dataset to JSON.
|
|
118
|
+
#
|
|
119
|
+
# @param path [String] output file path
|
|
120
|
+
# @return [void]
|
|
81
121
|
def to_json(path)
|
|
82
122
|
File.write(path, JSON.pretty_generate({name: @name, samples: @samples}))
|
|
83
123
|
end
|
|
84
124
|
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
125
|
+
# Generates a dataset from documents using an LLM.
|
|
126
|
+
#
|
|
127
|
+
# Each document is read, passed to the LLM with a prompt asking for
|
|
128
|
+
# +questions_per_doc+ QA pairs, and the resulting pairs are appended to
|
|
129
|
+
# the dataset. Directory paths are expanded via +Dir.glob+. Missing paths
|
|
130
|
+
# and malformed LLM responses raise or are skipped gracefully rather than
|
|
131
|
+
# crashing the whole generation.
|
|
132
|
+
#
|
|
133
|
+
# @param documents [String, Array<String>] file paths or directory paths
|
|
134
|
+
# @param questions_per_doc [Integer] number of QA pairs per document (must be > 0)
|
|
135
|
+
# @param llm [Symbol] LLM provider (:openai or :anthropic)
|
|
136
|
+
# @param judge [Judges::Base, nil] inject a pre-built judge (mainly for testing)
|
|
137
|
+
# @return [Dataset]
|
|
138
|
+
# @raise [EvalRuby::Error] on bad arguments or no documents found
|
|
139
|
+
def self.generate(documents:, questions_per_doc: 5, llm: :openai, judge: nil)
|
|
140
|
+
unless questions_per_doc.is_a?(Integer) && questions_per_doc.positive?
|
|
141
|
+
raise Error, "questions_per_doc must be a positive integer, got #{questions_per_doc.inspect}"
|
|
92
142
|
end
|
|
93
143
|
|
|
144
|
+
document_paths = expand_document_paths(documents)
|
|
145
|
+
raise Error, "No documents found in the provided paths" if document_paths.empty?
|
|
146
|
+
|
|
147
|
+
judge ||= build_judge_for(llm)
|
|
148
|
+
|
|
94
149
|
dataset = new("generated")
|
|
95
|
-
|
|
150
|
+
document_paths.each do |doc_path|
|
|
96
151
|
content = File.read(doc_path)
|
|
97
152
|
prompt = <<~PROMPT
|
|
98
153
|
Given the following document, generate #{questions_per_doc} question-answer pairs
|
|
@@ -104,14 +159,19 @@ module EvalRuby
|
|
|
104
159
|
Respond in JSON: {"pairs": [{"question": "...", "answer": "...", "context": "relevant excerpt"}]}
|
|
105
160
|
PROMPT
|
|
106
161
|
|
|
107
|
-
|
|
108
|
-
|
|
162
|
+
begin
|
|
163
|
+
result = judge.call(prompt)
|
|
164
|
+
rescue StandardError
|
|
165
|
+
next # keep generating from remaining docs; individual failure should not abort the batch
|
|
166
|
+
end
|
|
167
|
+
|
|
168
|
+
extract_pairs(result).each do |pair|
|
|
169
|
+
next unless valid_pair?(pair)
|
|
109
170
|
|
|
110
|
-
result["pairs"].each do |pair|
|
|
111
171
|
dataset.add(
|
|
112
172
|
question: pair["question"],
|
|
113
173
|
answer: pair["answer"],
|
|
114
|
-
context: [pair["context"]
|
|
174
|
+
context: [pair["context"].is_a?(String) && !pair["context"].empty? ? pair["context"] : content],
|
|
115
175
|
ground_truth: pair["answer"]
|
|
116
176
|
)
|
|
117
177
|
end
|
|
@@ -119,6 +179,51 @@ module EvalRuby
|
|
|
119
179
|
dataset
|
|
120
180
|
end
|
|
121
181
|
|
|
182
|
+
# Expands a list of file/directory paths into a flat list of file paths.
|
|
183
|
+
# Validates existence — missing paths raise an Error.
|
|
184
|
+
#
|
|
185
|
+
# @param paths [String, Array<String>]
|
|
186
|
+
# @return [Array<String>] absolute-or-relative file paths, each verified to exist
|
|
187
|
+
# @raise [EvalRuby::Error] if any path does not exist
|
|
188
|
+
def self.expand_document_paths(paths)
|
|
189
|
+
result = []
|
|
190
|
+
Array(paths).each do |path|
|
|
191
|
+
if File.directory?(path)
|
|
192
|
+
result.concat(Dir.glob(File.join(path, "**/*")).select { |p| File.file?(p) }.sort)
|
|
193
|
+
elsif File.file?(path)
|
|
194
|
+
result << path
|
|
195
|
+
else
|
|
196
|
+
raise Error, "Document path does not exist: #{path}"
|
|
197
|
+
end
|
|
198
|
+
end
|
|
199
|
+
result
|
|
200
|
+
end
|
|
201
|
+
|
|
202
|
+
def self.build_judge_for(llm)
|
|
203
|
+
config = EvalRuby.configuration.dup
|
|
204
|
+
config.judge_llm = llm
|
|
205
|
+
case llm
|
|
206
|
+
when :openai then Judges::OpenAI.new(config)
|
|
207
|
+
when :anthropic then Judges::Anthropic.new(config)
|
|
208
|
+
else raise Error, "Unknown LLM: #{llm}"
|
|
209
|
+
end
|
|
210
|
+
end
|
|
211
|
+
private_class_method :build_judge_for
|
|
212
|
+
|
|
213
|
+
def self.extract_pairs(result)
|
|
214
|
+
return [] unless result.is_a?(Hash)
|
|
215
|
+
pairs = result["pairs"]
|
|
216
|
+
pairs.is_a?(Array) ? pairs : []
|
|
217
|
+
end
|
|
218
|
+
private_class_method :extract_pairs
|
|
219
|
+
|
|
220
|
+
def self.valid_pair?(pair)
|
|
221
|
+
pair.is_a?(Hash) &&
|
|
222
|
+
pair["question"].is_a?(String) && !pair["question"].strip.empty? &&
|
|
223
|
+
pair["answer"].is_a?(String) && !pair["answer"].strip.empty?
|
|
224
|
+
end
|
|
225
|
+
private_class_method :valid_pair?
|
|
226
|
+
|
|
122
227
|
private_class_method def self.parse_array_field(value)
|
|
123
228
|
return [] if value.nil? || value.empty?
|
|
124
229
|
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
module EvalRuby
|
|
4
|
+
module Embedders
|
|
5
|
+
# Abstract base class for text embedders.
|
|
6
|
+
# Subclasses must implement {#call} to convert a batch of strings
|
|
7
|
+
# into a batch of float vectors, and {#model} to surface the model
|
|
8
|
+
# identifier used (shown in metric details).
|
|
9
|
+
class Base
|
|
10
|
+
# @param config [Configuration]
|
|
11
|
+
def initialize(config)
|
|
12
|
+
@config = config
|
|
13
|
+
end
|
|
14
|
+
|
|
15
|
+
# Embeds a batch of texts.
|
|
16
|
+
#
|
|
17
|
+
# @param texts [Array<String>] inputs to embed
|
|
18
|
+
# @return [Array<Array<Float>>] one vector per input, in the same order
|
|
19
|
+
def call(texts)
|
|
20
|
+
raise NotImplementedError, "#{self.class}#call must be implemented"
|
|
21
|
+
end
|
|
22
|
+
|
|
23
|
+
# @return [String] model identifier (e.g. "text-embedding-3-small")
|
|
24
|
+
def model
|
|
25
|
+
raise NotImplementedError, "#{self.class}#model must be implemented"
|
|
26
|
+
end
|
|
27
|
+
end
|
|
28
|
+
end
|
|
29
|
+
end
|
|
@@ -0,0 +1,83 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require "net/http"
|
|
4
|
+
require "json"
|
|
5
|
+
require "uri"
|
|
6
|
+
|
|
7
|
+
module EvalRuby
|
|
8
|
+
module Embedders
|
|
9
|
+
# OpenAI embeddings backend.
|
|
10
|
+
# Requires an API key via {Configuration#embedder_api_key} (or falls
|
|
11
|
+
# back to {Configuration#api_key}). The default model is
|
|
12
|
+
# +text-embedding-3-small+ (1536 dimensions).
|
|
13
|
+
class OpenAI < Base
|
|
14
|
+
API_URL = "https://api.openai.com/v1/embeddings"
|
|
15
|
+
|
|
16
|
+
# @param config [Configuration]
|
|
17
|
+
# @raise [EvalRuby::Error] if no API key is available
|
|
18
|
+
def initialize(config)
|
|
19
|
+
super
|
|
20
|
+
@api_key = @config.embedder_api_key || @config.api_key
|
|
21
|
+
if @api_key.nil? || @api_key.empty?
|
|
22
|
+
raise EvalRuby::Error, "API key is required for embedder. Set via EvalRuby.configure { |c| c.embedder_api_key = '...' } or c.api_key = '...'"
|
|
23
|
+
end
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
# @return [String] configured embedder model
|
|
27
|
+
def model
|
|
28
|
+
@config.embedder_model
|
|
29
|
+
end
|
|
30
|
+
|
|
31
|
+
# @param texts [Array<String>] inputs to embed
|
|
32
|
+
# @return [Array<Array<Float>>] vectors in input order
|
|
33
|
+
# @raise [EvalRuby::APIError] on non-success HTTP responses
|
|
34
|
+
# @raise [EvalRuby::TimeoutError] after max retries
|
|
35
|
+
def call(texts)
|
|
36
|
+
retries = 0
|
|
37
|
+
begin
|
|
38
|
+
uri = URI(API_URL)
|
|
39
|
+
request = Net::HTTP::Post.new(uri)
|
|
40
|
+
request["Authorization"] = "Bearer #{@api_key}"
|
|
41
|
+
request["Content-Type"] = "application/json"
|
|
42
|
+
request.body = JSON.generate({input: texts, model: model})
|
|
43
|
+
|
|
44
|
+
response = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true,
|
|
45
|
+
read_timeout: @config.timeout) do |http|
|
|
46
|
+
http.request(request)
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
unless response.is_a?(Net::HTTPSuccess)
|
|
50
|
+
raise APIError, "OpenAI embeddings API error: #{response.code} - #{response.body}"
|
|
51
|
+
end
|
|
52
|
+
|
|
53
|
+
parse_vectors(response.body)
|
|
54
|
+
rescue Net::OpenTimeout, Net::ReadTimeout, Errno::ECONNRESET => e
|
|
55
|
+
retries += 1
|
|
56
|
+
if retries <= @config.max_retries
|
|
57
|
+
sleep(2**(retries - 1))
|
|
58
|
+
retry
|
|
59
|
+
end
|
|
60
|
+
raise EvalRuby::TimeoutError, "Embedder API failed after #{@config.max_retries} retries: #{e.message}"
|
|
61
|
+
end
|
|
62
|
+
end
|
|
63
|
+
|
|
64
|
+
private
|
|
65
|
+
|
|
66
|
+
def parse_vectors(body)
|
|
67
|
+
parsed = JSON.parse(body)
|
|
68
|
+
data = parsed["data"]
|
|
69
|
+
raise InvalidResponseError, "Unexpected embeddings response shape: missing 'data'" unless data.is_a?(Array)
|
|
70
|
+
|
|
71
|
+
# OpenAI returns data entries tagged with 'index' to preserve input order;
|
|
72
|
+
# sort defensively in case the API ever reorders.
|
|
73
|
+
data.sort_by { |entry| entry["index"].to_i }.map do |entry|
|
|
74
|
+
vector = entry["embedding"]
|
|
75
|
+
raise InvalidResponseError, "Embedding entry missing 'embedding' array" unless vector.is_a?(Array)
|
|
76
|
+
vector
|
|
77
|
+
end
|
|
78
|
+
rescue JSON::ParserError => e
|
|
79
|
+
raise InvalidResponseError, "Failed to parse embeddings response: #{e.message}"
|
|
80
|
+
end
|
|
81
|
+
end
|
|
82
|
+
end
|
|
83
|
+
end
|
data/lib/eval_ruby/evaluator.rb
CHANGED
|
@@ -1,12 +1,25 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
module EvalRuby
|
|
4
|
+
# Runs all configured metrics on a given question/answer/context tuple.
|
|
5
|
+
#
|
|
6
|
+
# @example
|
|
7
|
+
# evaluator = EvalRuby::Evaluator.new
|
|
8
|
+
# result = evaluator.evaluate(question: "...", answer: "...", context: [...])
|
|
4
9
|
class Evaluator
|
|
10
|
+
# @param config [Configuration] configuration to use
|
|
5
11
|
def initialize(config = EvalRuby.configuration)
|
|
6
12
|
@config = config
|
|
7
13
|
@judge = build_judge(config)
|
|
8
14
|
end
|
|
9
15
|
|
|
16
|
+
# Evaluates an LLM response across quality metrics.
|
|
17
|
+
#
|
|
18
|
+
# @param question [String] the input question
|
|
19
|
+
# @param answer [String] the LLM-generated answer
|
|
20
|
+
# @param context [Array<String>] retrieved context chunks
|
|
21
|
+
# @param ground_truth [String, nil] expected correct answer
|
|
22
|
+
# @return [Result]
|
|
10
23
|
def evaluate(question:, answer:, context: [], ground_truth: nil)
|
|
11
24
|
scores = {}
|
|
12
25
|
details = {}
|
|
@@ -37,6 +50,12 @@ module EvalRuby
|
|
|
37
50
|
Result.new(scores: scores, details: details)
|
|
38
51
|
end
|
|
39
52
|
|
|
53
|
+
# Evaluates retrieval quality using IR metrics.
|
|
54
|
+
#
|
|
55
|
+
# @param question [String] the input question
|
|
56
|
+
# @param retrieved [Array<String>] retrieved document IDs
|
|
57
|
+
# @param relevant [Array<String>] ground-truth relevant document IDs
|
|
58
|
+
# @return [RetrievalResult]
|
|
40
59
|
def evaluate_retrieval(question:, retrieved:, relevant:)
|
|
41
60
|
RetrievalResult.new(retrieved: retrieved, relevant: relevant)
|
|
42
61
|
end
|
|
@@ -55,32 +74,49 @@ module EvalRuby
|
|
|
55
74
|
end
|
|
56
75
|
end
|
|
57
76
|
|
|
77
|
+
# Holds retrieval evaluation results with IR metric accessors.
|
|
78
|
+
#
|
|
79
|
+
# @example
|
|
80
|
+
# result = EvalRuby.evaluate_retrieval(question: "...", retrieved: [...], relevant: [...])
|
|
81
|
+
# result.precision_at_k(5) # => 0.6
|
|
82
|
+
# result.mrr # => 1.0
|
|
58
83
|
class RetrievalResult
|
|
84
|
+
# @param retrieved [Array<String>] retrieved document IDs in ranked order
|
|
85
|
+
# @param relevant [Array<String>] ground-truth relevant document IDs
|
|
59
86
|
def initialize(retrieved:, relevant:)
|
|
60
87
|
@retrieved = retrieved
|
|
61
88
|
@relevant = relevant
|
|
62
89
|
end
|
|
63
90
|
|
|
91
|
+
# @param k [Integer] number of top results to consider
|
|
92
|
+
# @return [Float] precision at k
|
|
64
93
|
def precision_at_k(k)
|
|
65
94
|
Metrics::PrecisionAtK.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
|
|
66
95
|
end
|
|
67
96
|
|
|
97
|
+
# @param k [Integer] number of top results to consider
|
|
98
|
+
# @return [Float] recall at k
|
|
68
99
|
def recall_at_k(k)
|
|
69
100
|
Metrics::RecallAtK.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
|
|
70
101
|
end
|
|
71
102
|
|
|
103
|
+
# @return [Float] mean reciprocal rank
|
|
72
104
|
def mrr
|
|
73
105
|
Metrics::MRR.new.call(retrieved: @retrieved, relevant: @relevant)
|
|
74
106
|
end
|
|
75
107
|
|
|
108
|
+
# @param k [Integer, nil] number of top results (nil for all)
|
|
109
|
+
# @return [Float] normalized discounted cumulative gain
|
|
76
110
|
def ndcg(k: nil)
|
|
77
111
|
Metrics::NDCG.new.call(retrieved: @retrieved, relevant: @relevant, k: k)
|
|
78
112
|
end
|
|
79
113
|
|
|
114
|
+
# @return [Float] 1.0 if any relevant doc is retrieved, 0.0 otherwise
|
|
80
115
|
def hit_rate
|
|
81
116
|
@retrieved.any? { |doc| @relevant.include?(doc) } ? 1.0 : 0.0
|
|
82
117
|
end
|
|
83
118
|
|
|
119
|
+
# @return [Hash{Symbol => Float}] all retrieval metrics
|
|
84
120
|
def to_h
|
|
85
121
|
{
|
|
86
122
|
precision_at_k: precision_at_k(@retrieved.length),
|
|
@@ -6,14 +6,22 @@ require "uri"
|
|
|
6
6
|
|
|
7
7
|
module EvalRuby
|
|
8
8
|
module Judges
|
|
9
|
+
# Anthropic-based LLM judge using the Messages API.
|
|
10
|
+
# Requires an API key set via {Configuration#api_key}.
|
|
9
11
|
class Anthropic < Base
|
|
10
12
|
API_URL = "https://api.anthropic.com/v1/messages"
|
|
11
13
|
|
|
14
|
+
# @param config [Configuration]
|
|
15
|
+
# @raise [EvalRuby::Error] if API key is missing
|
|
12
16
|
def initialize(config)
|
|
13
17
|
super
|
|
14
18
|
raise EvalRuby::Error, "API key is required. Set via EvalRuby.configure { |c| c.api_key = '...' }" if @config.api_key.nil? || @config.api_key.empty?
|
|
15
19
|
end
|
|
16
20
|
|
|
21
|
+
# @param prompt [String] the evaluation prompt
|
|
22
|
+
# @return [Hash, nil] parsed JSON response
|
|
23
|
+
# @raise [EvalRuby::Error] on API errors
|
|
24
|
+
# @raise [EvalRuby::TimeoutError] after max retries
|
|
17
25
|
def call(prompt)
|
|
18
26
|
retries = 0
|
|
19
27
|
begin
|