ask-eval 0.1.0 → 0.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +5 -1
- data/README.md +57 -3
- data/lib/ask/eval/version.rb +1 -1
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: bade74b9a66f3d955fea90e17535033015f504f0b6221c332a07c7c947a486c1
|
|
4
|
+
data.tar.gz: 228d85c034b3f9f50fef305c0a5422179959e98906cc3262306bde76219bbabe
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: f393cad79fb781b4b76caa6e3bc036dd8b2787d6389dd9b59bdf23d01f025fbd5dbda0e4b4068ef3abd313e62f1f4c063abee67876b56c5c860a16ee775c3c5b
|
|
7
|
+
data.tar.gz: e8b4d2cb025fbb53cf9d76db336b636c6583c759a76e4b90771e384064e7e57926b4e526dc78c7d0689f7d5b29289998ab4c789345a3e35d29373cf5ac433c19
|
data/CHANGELOG.md
CHANGED
|
@@ -1,3 +1,7 @@
|
|
|
1
|
+
## [0.1.1] - 2026-06-25
|
|
2
|
+
|
|
3
|
+
### Changed
|
|
4
|
+
- Expanded tests: Runner(12t), TestCase(9t), DSL(29t), Configuration(10t), MinitestPlugin(20t), Reporters(16t). Infrastructure: rubocop, overcommit, CI matrix, gemspec, SimpleCov.
|
|
1
5
|
# Changelog
|
|
2
6
|
|
|
3
7
|
## [0.1.0] - 2026-06-10
|
|
@@ -13,7 +17,7 @@
|
|
|
13
17
|
- Batch evaluation runner (`Ask::Eval::Runner`)
|
|
14
18
|
- CI reporters: Console, JUnit XML, GitHub Actions annotations
|
|
15
19
|
- Cost tracking with `CostTracker` — per-model pricing, summary reports
|
|
20
|
+
- Custom judge API — subclass `Ask::Eval::Judge` with `#call`, `#system_prompt`, `#user_message`
|
|
16
21
|
- Zero runtime dependencies — deterministic assertions work standalone
|
|
17
22
|
- Optional ask-llm-providers integration for judge models
|
|
18
23
|
- Tests: 88 minitest tests covering all components
|
|
19
|
-
</RUBY>
|
data/README.md
CHANGED
|
@@ -138,9 +138,9 @@ bundle exec rake test
|
|
|
138
138
|
|
|
139
139
|
## Design Philosophy
|
|
140
140
|
|
|
141
|
-
**This gem
|
|
141
|
+
**This gem is NOT a port of ruby_llm-tribunal.** See the comparison below:
|
|
142
142
|
|
|
143
|
-
| ruby_llm-tribunal
|
|
143
|
+
| ruby_llm-tribunal | ask-eval |
|
|
144
144
|
|---|---|
|
|
145
145
|
| Standalone evaluator with its own API | **Minitest-native assertions** — drops into existing tests |
|
|
146
146
|
| 10 judges (including niche: jailbreak, PII, refusal) | **5 essential judges** — faithful, hallucination, bias, toxicity, correctness |
|
|
@@ -148,10 +148,64 @@ bundle exec rake test
|
|
|
148
148
|
| Dataset management, red teaming, custom judges | **No datasets, no red teaming.** Focus on what matters for 80% of users. |
|
|
149
149
|
| Tied to RubyLLM for judge model | **Any model as judge** — cheap gpt-4o-mini, accurate claude, or local |
|
|
150
150
|
| Cost tracking: none | **Cost tracking per evaluation** |
|
|
151
|
-
| Snapshot testing: none | **Eval snapshots for regression detection** |
|
|
151
|
+
| Snapshot testing: none | **Eval snapshots for regression detection** (v0.2.0) |
|
|
152
152
|
| Test framework integration: requires include | **Minitest plugin** — auto-loads with `require "ask/eval/minitest"` |
|
|
153
153
|
|
|
154
|
+
|
|
155
|
+
|
|
154
156
|
## License
|
|
155
157
|
|
|
156
158
|
MIT
|
|
157
159
|
</RUBY>
|
|
160
|
+
|
|
161
|
+
|
|
162
|
+
## Custom Judges
|
|
163
|
+
|
|
164
|
+
The 5 built-in judges cover common cases, but you can create your own by
|
|
165
|
+
subclassing `Ask::Eval::Judge`:
|
|
166
|
+
|
|
167
|
+
```ruby
|
|
168
|
+
class BrandVoiceJudge < Ask::Eval::Judge
|
|
169
|
+
def call(tc)
|
|
170
|
+
query_judge(tc)
|
|
171
|
+
end
|
|
172
|
+
|
|
173
|
+
private
|
|
174
|
+
|
|
175
|
+
def system_prompt
|
|
176
|
+
<<~PROMPT
|
|
177
|
+
You are a brand voice evaluator. Determine if the response matches our guidelines:
|
|
178
|
+
- Friendly but professional tone
|
|
179
|
+
- No jargon or technical terms
|
|
180
|
+
- Empathetic and helpful
|
|
181
|
+
|
|
182
|
+
Respond in JSON format:
|
|
183
|
+
{ "passed": true/false, "score": 0.0-1.0, "reason": "..." }
|
|
184
|
+
PROMPT
|
|
185
|
+
end
|
|
186
|
+
|
|
187
|
+
def user_message(tc)
|
|
188
|
+
"Response to evaluate: " + tc.actual_output
|
|
189
|
+
end
|
|
190
|
+
end
|
|
191
|
+
|
|
192
|
+
# Use it directly
|
|
193
|
+
judge = BrandVoiceJudge.new(model: my_model)
|
|
194
|
+
result = judge.call(Ask::Eval::TestCase.new(actual_output: response))
|
|
195
|
+
puts result.reason if result.passed?
|
|
196
|
+
```
|
|
197
|
+
|
|
198
|
+
### Using a lambda for custom evaluation
|
|
199
|
+
|
|
200
|
+
For simple checks, pass a callable directly as the `model:` parameter --
|
|
201
|
+
you do not need a full judge class:
|
|
202
|
+
|
|
203
|
+
```ruby
|
|
204
|
+
assert_faithful response, context: docs, model: ->(messages) {
|
|
205
|
+
{ content: JSON.generate({ passed: true, score: 1.0, reason: "All good" }) }
|
|
206
|
+
}
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
No registration system needed. Subclassing `Judge` and implementing
|
|
210
|
+
`#call`, `#system_prompt`, and `#user_message` is the entire API.
|
|
211
|
+
|
data/lib/ask/eval/version.rb
CHANGED