ask-eval 0.1.0 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 255e463739c832d6784d161c3f6b26b267dbaaf6e29b27268ac8b49bd3a3fa05
4
- data.tar.gz: 842364b8ea52fbae673bd53ebec870627481cac8a9214b98d92476daa609789c
3
+ metadata.gz: bade74b9a66f3d955fea90e17535033015f504f0b6221c332a07c7c947a486c1
4
+ data.tar.gz: 228d85c034b3f9f50fef305c0a5422179959e98906cc3262306bde76219bbabe
5
5
  SHA512:
6
- metadata.gz: dade52a1b23dc7415788d02c858ee3ca2a0c14739a45f2bcb172b72818cc003db215c2bda019cf68fe072fdc8f40d3aa0a1ae8628df98e5e9e52b435c14c0d0b
7
- data.tar.gz: e82d90fb63e88693df7f5c4e1c89dbb166ade6617bfa3980a0a7ba1cfdcc355b7d8abcc458882d60a84d701f98c21cb741c65b586399b1826701c061b3123936
6
+ metadata.gz: f393cad79fb781b4b76caa6e3bc036dd8b2787d6389dd9b59bdf23d01f025fbd5dbda0e4b4068ef3abd313e62f1f4c063abee67876b56c5c860a16ee775c3c5b
7
+ data.tar.gz: e8b4d2cb025fbb53cf9d76db336b636c6583c759a76e4b90771e384064e7e57926b4e526dc78c7d0689f7d5b29289998ab4c789345a3e35d29373cf5ac433c19
data/CHANGELOG.md CHANGED
@@ -1,3 +1,7 @@
1
+ ## [0.1.1] - 2026-06-25
2
+
3
+ ### Changed
4
+ - Expanded tests: Runner(12t), TestCase(9t), DSL(29t), Configuration(10t), MinitestPlugin(20t), Reporters(16t). Infrastructure: rubocop, overcommit, CI matrix, gemspec, SimpleCov.
1
5
  # Changelog
2
6
 
3
7
  ## [0.1.0] - 2026-06-10
@@ -13,7 +17,7 @@
13
17
  - Batch evaluation runner (`Ask::Eval::Runner`)
14
18
  - CI reporters: Console, JUnit XML, GitHub Actions annotations
15
19
  - Cost tracking with `CostTracker` — per-model pricing, summary reports
20
+ - Custom judge API — subclass `Ask::Eval::Judge` with `#call`, `#system_prompt`, `#user_message`
16
21
  - Zero runtime dependencies — deterministic assertions work standalone
17
22
  - Optional ask-llm-providers integration for judge models
18
23
  - Tests: 88 minitest tests covering all components
19
- </RUBY>
data/README.md CHANGED
@@ -138,9 +138,9 @@ bundle exec rake test
138
138
 
139
139
  ## Design Philosophy
140
140
 
141
- **This gem should NOT be a port of ruby_llm-tribunal.** See the comparison:
141
+ **This gem is NOT a port of ruby_llm-tribunal.** See the comparison below:
142
142
 
143
- | ruby_llm-tribunal (~500 lines, 25+ files) | ask-eval (~300 lines, 10 files) |
143
+ | ruby_llm-tribunal | ask-eval |
144
144
  |---|---|
145
145
  | Standalone evaluator with its own API | **Minitest-native assertions** — drops into existing tests |
146
146
  | 10 judges (including niche: jailbreak, PII, refusal) | **5 essential judges** — faithful, hallucination, bias, toxicity, correctness |
@@ -148,10 +148,64 @@ bundle exec rake test
148
148
  | Dataset management, red teaming, custom judges | **No datasets, no red teaming.** Focus on what matters for 80% of users. |
149
149
  | Tied to RubyLLM for judge model | **Any model as judge** — cheap gpt-4o-mini, accurate claude, or local |
150
150
  | Cost tracking: none | **Cost tracking per evaluation** |
151
- | Snapshot testing: none | **Eval snapshots for regression detection** |
151
+ | Snapshot testing: none | **Eval snapshots for regression detection** (v0.2.0) |
152
152
  | Test framework integration: requires include | **Minitest plugin** — auto-loads with `require "ask/eval/minitest"` |
153
153
 
154
+
155
+
154
156
  ## License
155
157
 
156
158
  MIT
157
159
  </RUBY>
160
+
161
+
162
+ ## Custom Judges
163
+
164
+ The 5 built-in judges cover common cases, but you can create your own by
165
+ subclassing `Ask::Eval::Judge`:
166
+
167
+ ```ruby
168
+ class BrandVoiceJudge < Ask::Eval::Judge
169
+ def call(tc)
170
+ query_judge(tc)
171
+ end
172
+
173
+ private
174
+
175
+ def system_prompt
176
+ <<~PROMPT
177
+ You are a brand voice evaluator. Determine if the response matches our guidelines:
178
+ - Friendly but professional tone
179
+ - No jargon or technical terms
180
+ - Empathetic and helpful
181
+
182
+ Respond in JSON format:
183
+ { "passed": true/false, "score": 0.0-1.0, "reason": "..." }
184
+ PROMPT
185
+ end
186
+
187
+ def user_message(tc)
188
+ "Response to evaluate: " + tc.actual_output
189
+ end
190
+ end
191
+
192
+ # Use it directly
193
+ judge = BrandVoiceJudge.new(model: my_model)
194
+ result = judge.call(Ask::Eval::TestCase.new(actual_output: response))
195
+ puts result.reason if result.passed?
196
+ ```
197
+
198
+ ### Using a lambda for custom evaluation
199
+
200
+ For simple checks, pass a callable directly as the `model:` parameter --
201
+ you do not need a full judge class:
202
+
203
+ ```ruby
204
+ assert_faithful response, context: docs, model: ->(messages) {
205
+ { content: JSON.generate({ passed: true, score: 1.0, reason: "All good" }) }
206
+ }
207
+ ```
208
+
209
+ No registration system needed. Subclassing `Judge` and implementing
210
+ `#call`, `#system_prompt`, and `#user_message` is the entire API.
211
+
@@ -1,5 +1,5 @@
1
1
  module Ask
2
2
  module Eval
3
- VERSION = "0.1.0"
3
+ VERSION = "0.1.1"
4
4
  end
5
5
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ask-eval
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Kaka Ruto