eval-ai-library 0.2.2__py3-none-any.whl → 0.3.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of eval-ai-library might be problematic. Click here for more details.
- eval_ai_library-0.3.0.dist-info/METADATA +1042 -0
- eval_ai_library-0.3.0.dist-info/RECORD +34 -0
- eval_lib/__init__.py +19 -6
- eval_lib/agent_metrics/knowledge_retention_metric/knowledge_retention.py +8 -3
- eval_lib/agent_metrics/role_adherence_metric/role_adherence.py +12 -4
- eval_lib/agent_metrics/task_success_metric/task_success_rate.py +23 -23
- eval_lib/agent_metrics/tools_correctness_metric/tool_correctness.py +8 -2
- eval_lib/datagenerator/datagenerator.py +208 -12
- eval_lib/datagenerator/document_loader.py +29 -29
- eval_lib/evaluate.py +0 -22
- eval_lib/llm_client.py +223 -78
- eval_lib/metric_pattern.py +208 -152
- eval_lib/metrics/answer_precision_metric/answer_precision.py +8 -3
- eval_lib/metrics/answer_relevancy_metric/answer_relevancy.py +7 -2
- eval_lib/metrics/bias_metric/bias.py +12 -2
- eval_lib/metrics/contextual_precision_metric/contextual_precision.py +9 -4
- eval_lib/metrics/contextual_recall_metric/contextual_recall.py +7 -3
- eval_lib/metrics/contextual_relevancy_metric/contextual_relevancy.py +8 -2
- eval_lib/metrics/custom_metric/custom_eval.py +237 -204
- eval_lib/metrics/faithfulness_metric/faithfulness.py +7 -2
- eval_lib/metrics/geval/geval.py +8 -2
- eval_lib/metrics/restricted_refusal_metric/restricted_refusal.py +7 -3
- eval_lib/metrics/toxicity_metric/toxicity.py +8 -2
- eval_lib/utils.py +44 -29
- eval_ai_library-0.2.2.dist-info/METADATA +0 -779
- eval_ai_library-0.2.2.dist-info/RECORD +0 -34
- {eval_ai_library-0.2.2.dist-info → eval_ai_library-0.3.0.dist-info}/WHEEL +0 -0
- {eval_ai_library-0.2.2.dist-info → eval_ai_library-0.3.0.dist-info}/licenses/LICENSE +0 -0
- {eval_ai_library-0.2.2.dist-info → eval_ai_library-0.3.0.dist-info}/top_level.txt +0 -0
|
@@ -0,0 +1,1042 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: eval-ai-library
|
|
3
|
+
Version: 0.3.0
|
|
4
|
+
Summary: Comprehensive AI Model Evaluation Framework with support for multiple LLM providers
|
|
5
|
+
Author-email: Aleksandr Meshkov <alekslynx90@gmail.com>
|
|
6
|
+
License: MIT
|
|
7
|
+
Project-URL: Homepage, https://github.com/meshkovQA/Eval-ai-library
|
|
8
|
+
Project-URL: Documentation, https://github.com/meshkovQA/Eval-ai-library#readme
|
|
9
|
+
Project-URL: Repository, https://github.com/meshkovQA/Eval-ai-library
|
|
10
|
+
Project-URL: Bug Tracker, https://github.com/meshkovQA/Eval-ai-library/issues
|
|
11
|
+
Keywords: ai,evaluation,llm,rag,metrics,testing,quality-assurance
|
|
12
|
+
Classifier: Development Status :: 4 - Beta
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Intended Audience :: Science/Research
|
|
15
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
16
|
+
Classifier: Programming Language :: Python :: 3
|
|
17
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
18
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
19
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
20
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
21
|
+
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
|
|
22
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
23
|
+
Requires-Python: >=3.9
|
|
24
|
+
Description-Content-Type: text/markdown
|
|
25
|
+
License-File: LICENSE
|
|
26
|
+
Requires-Dist: openai>=1.0.0
|
|
27
|
+
Requires-Dist: anthropic>=0.18.0
|
|
28
|
+
Requires-Dist: google-genai>=0.2.0
|
|
29
|
+
Requires-Dist: pydantic>=2.0.0
|
|
30
|
+
Requires-Dist: numpy>=1.24.0
|
|
31
|
+
Requires-Dist: langchain>=0.1.0
|
|
32
|
+
Requires-Dist: langchain-community>=0.0.10
|
|
33
|
+
Requires-Dist: langchain-core>=0.1.0
|
|
34
|
+
Requires-Dist: langchain-text-splitters>=0.2.0
|
|
35
|
+
Requires-Dist: pypdf>=3.0.0
|
|
36
|
+
Requires-Dist: python-docx>=0.8.11
|
|
37
|
+
Requires-Dist: openpyxl>=3.1.0
|
|
38
|
+
Requires-Dist: pillow>=10.0.0
|
|
39
|
+
Requires-Dist: pytesseract>=0.3.10
|
|
40
|
+
Requires-Dist: python-pptx>=0.6.21
|
|
41
|
+
Requires-Dist: PyMuPDF>=1.23.0
|
|
42
|
+
Requires-Dist: mammoth>=1.6.0
|
|
43
|
+
Requires-Dist: PyYAML>=6.0.0
|
|
44
|
+
Requires-Dist: html2text>=2020.1.16
|
|
45
|
+
Requires-Dist: markdown>=3.4.0
|
|
46
|
+
Requires-Dist: pandas>=2.0.0
|
|
47
|
+
Requires-Dist: striprtf>=0.0.26
|
|
48
|
+
Provides-Extra: dev
|
|
49
|
+
Requires-Dist: pytest>=7.0.0; extra == "dev"
|
|
50
|
+
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
|
|
51
|
+
Requires-Dist: black>=23.0.0; extra == "dev"
|
|
52
|
+
Requires-Dist: flake8>=6.0.0; extra == "dev"
|
|
53
|
+
Requires-Dist: mypy>=1.0.0; extra == "dev"
|
|
54
|
+
Requires-Dist: isort>=5.12.0; extra == "dev"
|
|
55
|
+
Provides-Extra: docs
|
|
56
|
+
Requires-Dist: sphinx>=6.0.0; extra == "docs"
|
|
57
|
+
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
|
|
58
|
+
Dynamic: license-file
|
|
59
|
+
|
|
60
|
+
# Eval AI Library
|
|
61
|
+
|
|
62
|
+
[](https://www.python.org/downloads/)
|
|
63
|
+
[](https://opensource.org/licenses/MIT)
|
|
64
|
+
|
|
65
|
+
Comprehensive AI Model Evaluation Framework with advanced techniques including **Probability-Weighted Scoring** and **Auto Chain-of-Thought**. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.
|
|
66
|
+
|
|
67
|
+
## Features
|
|
68
|
+
|
|
69
|
+
- 🎯 **15+ Evaluation Metrics**: RAG metrics and agent-specific evaluations
|
|
70
|
+
- 🧠 **G-Eval Implementation**: State-of-the-art evaluation with probability-weighted scoring
|
|
71
|
+
- 🔗 **Chain-of-Thought**: Automatic generation of evaluation steps from criteria
|
|
72
|
+
- 🤖 **Multi-Provider Support**: OpenAI, Azure OpenAI, Google Gemini, Anthropic Claude, Ollama
|
|
73
|
+
- 📊 **RAG Metrics**: Answer relevancy, faithfulness, contextual precision/recall, and more
|
|
74
|
+
- 🔧 **Agent Metrics**: Tool correctness, task success rate, role adherence, knowledge retention
|
|
75
|
+
- 🎨 **Custom Metrics**: Advanced custom evaluation with CoT and probability weighting
|
|
76
|
+
- 📦 **Data Generation**: Built-in test case generator from documents
|
|
77
|
+
- ⚡ **Async Support**: Full async/await support for efficient evaluation
|
|
78
|
+
- 💰 **Cost Tracking**: Automatic cost calculation for LLM API calls
|
|
79
|
+
- 📝 **Detailed Logging**: Comprehensive evaluation logs for transparency
|
|
80
|
+
|
|
81
|
+
## Installation
|
|
82
|
+
```bash
|
|
83
|
+
pip install eval-ai-library
|
|
84
|
+
```
|
|
85
|
+
|
|
86
|
+
### Development Installation
|
|
87
|
+
```bash
|
|
88
|
+
git clone https://github.com/yourusername/eval-ai-library.git
|
|
89
|
+
cd eval-ai-library
|
|
90
|
+
pip install -e ".[dev]"
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
## Quick Start
|
|
94
|
+
|
|
95
|
+
### Basic Batch Evaluation
|
|
96
|
+
```python
|
|
97
|
+
import asyncio
|
|
98
|
+
from eval_lib import (
|
|
99
|
+
evaluate,
|
|
100
|
+
EvalTestCase,
|
|
101
|
+
AnswerRelevancyMetric,
|
|
102
|
+
FaithfulnessMetric,
|
|
103
|
+
BiasMetric
|
|
104
|
+
)
|
|
105
|
+
|
|
106
|
+
async def test_batch_standard_metrics():
|
|
107
|
+
"""Test batch evaluation with multiple test cases and standard metrics"""
|
|
108
|
+
|
|
109
|
+
# Create test cases
|
|
110
|
+
test_cases = [
|
|
111
|
+
EvalTestCase(
|
|
112
|
+
input="What is the capital of France?",
|
|
113
|
+
actual_output="The capital of France is Paris.",
|
|
114
|
+
expected_output="Paris",
|
|
115
|
+
retrieval_context=["Paris is the capital of France."]
|
|
116
|
+
),
|
|
117
|
+
EvalTestCase(
|
|
118
|
+
input="What is photosynthesis?",
|
|
119
|
+
actual_output="The weather today is sunny.",
|
|
120
|
+
expected_output="Process by which plants convert light into energy",
|
|
121
|
+
retrieval_context=[
|
|
122
|
+
"Photosynthesis is the process by which plants use sunlight."]
|
|
123
|
+
)
|
|
124
|
+
]
|
|
125
|
+
|
|
126
|
+
# Define metrics
|
|
127
|
+
metrics = [
|
|
128
|
+
AnswerRelevancyMetric(
|
|
129
|
+
model="gpt-4o-mini",
|
|
130
|
+
threshold=0.7,
|
|
131
|
+
temperature=0.5,
|
|
132
|
+
),
|
|
133
|
+
FaithfulnessMetric(
|
|
134
|
+
model="gpt-4o-mini",
|
|
135
|
+
threshold=0.8,
|
|
136
|
+
temperature=0.5,
|
|
137
|
+
),
|
|
138
|
+
BiasMetric(
|
|
139
|
+
model="gpt-4o-mini",
|
|
140
|
+
threshold=0.8,
|
|
141
|
+
),
|
|
142
|
+
]
|
|
143
|
+
|
|
144
|
+
# Run batch evaluation
|
|
145
|
+
results = await evaluate(
|
|
146
|
+
test_cases=test_cases,
|
|
147
|
+
metrics=metrics,
|
|
148
|
+
verbose=True
|
|
149
|
+
)
|
|
150
|
+
|
|
151
|
+
return results
|
|
152
|
+
|
|
153
|
+
|
|
154
|
+
if __name__ == "__main__":
|
|
155
|
+
asyncio.run(test_batch_standard_metrics())
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
### G-Eval with Probability-Weighted Scoring (single evaluation)
|
|
159
|
+
|
|
160
|
+
G-Eval implements the state-of-the-art evaluation method from the paper ["G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"](https://arxiv.org/abs/2303.16634). It uses **probability-weighted scoring** (score = Σ p(si) × si) for fine-grained, continuous evaluation scores.
|
|
161
|
+
```python
|
|
162
|
+
from eval_lib import GEval, EvalTestCase
|
|
163
|
+
|
|
164
|
+
async def evaluate_with_geval():
|
|
165
|
+
test_case = EvalTestCase(
|
|
166
|
+
input="Explain quantum computing to a 10-year-old",
|
|
167
|
+
actual_output="Quantum computers are like super-powerful regular computers that use special tiny particles to solve really hard problems much faster.",
|
|
168
|
+
)
|
|
169
|
+
|
|
170
|
+
# G-Eval with auto chain-of-thought
|
|
171
|
+
metric = GEval(
|
|
172
|
+
model="gpt-4o", # Works best with GPT-4
|
|
173
|
+
threshold=0.7, # Score range: 0.0-1.0
|
|
174
|
+
name="Clarity & Simplicity",
|
|
175
|
+
criteria="Evaluate how clear and age-appropriate the explanation is for a 10-year-old child",
|
|
176
|
+
|
|
177
|
+
# Evaluation_steps is auto-generated from criteria if not provided
|
|
178
|
+
evaluation_steps=[
|
|
179
|
+
"Step 1: Check if the language is appropriate for a 10-year-old. Avoid complex technical terms, jargon, or abstract concepts that children cannot relate to. The vocabulary should be simple and conversational.",
|
|
180
|
+
|
|
181
|
+
"Step 2: Evaluate the use of analogies and examples. Look for comparisons to everyday objects, activities, or experiences familiar to children (toys, games, school, animals, family activities). Good analogies make abstract concepts concrete.",
|
|
182
|
+
|
|
183
|
+
"Step 3: Assess the structure and flow. The explanation should have a clear beginning, middle, and end. Ideas should build logically, starting with familiar concepts before introducing new ones. Sentences should be short and easy to follow.",
|
|
184
|
+
|
|
185
|
+
"Step 4: Check for engagement elements. Look for questions, storytelling, humor, or interactive elements that capture a child's attention. The tone should be friendly and encouraging, not boring or too formal.",
|
|
186
|
+
|
|
187
|
+
"Step 5: Verify completeness without overwhelming. The explanation should cover the main idea adequately but not overload with too many details. It should answer the question without confusing the child with unnecessary complexity.",
|
|
188
|
+
|
|
189
|
+
"Step 6: Assign a score from 0.0 to 1.0, where 0.0 means completely inappropriate or unclear for a child, and 1.0 means perfectly clear, engaging, and age-appropriate."
|
|
190
|
+
],
|
|
191
|
+
n_samples=20, # Number of samples for probability estimation (default: 20)
|
|
192
|
+
sampling_temperature=2.0 # High temperature for diverse sampling (default: 2.0)
|
|
193
|
+
)
|
|
194
|
+
|
|
195
|
+
result = await metric.evaluate(test_case)
|
|
196
|
+
|
|
197
|
+
|
|
198
|
+
asyncio.run(evaluate_with_geval())
|
|
199
|
+
```
|
|
200
|
+
|
|
201
|
+
### Custom Evaluation with Verdict-Based Scoring (single evaluation)
|
|
202
|
+
|
|
203
|
+
CustomEvalMetric uses **verdict-based evaluation** with automatic criteria generation for transparent and detailed scoring:
|
|
204
|
+
```python
|
|
205
|
+
from eval_lib import CustomEvalMetric
|
|
206
|
+
|
|
207
|
+
async def custom_evaluation():
|
|
208
|
+
test_case = EvalTestCase(
|
|
209
|
+
input="Explain photosynthesis",
|
|
210
|
+
actual_output="Photosynthesis is the process where plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.",
|
|
211
|
+
)
|
|
212
|
+
|
|
213
|
+
# Verdict-based custom evaluation
|
|
214
|
+
metric = CustomEvalMetric(
|
|
215
|
+
model="gpt-4o-mini",
|
|
216
|
+
threshold=0.8,
|
|
217
|
+
name="Scientific Accuracy",
|
|
218
|
+
criteria="Evaluate if the explanation is scientifically accurate and complete",
|
|
219
|
+
evaluation_steps=None, # Auto-generated if not provided
|
|
220
|
+
temperature=0.8, # Controls verdict aggregation
|
|
221
|
+
verbose=True
|
|
222
|
+
)
|
|
223
|
+
|
|
224
|
+
result = await metric.evaluate(test_case)
|
|
225
|
+
|
|
226
|
+
asyncio.run(custom_evaluation())
|
|
227
|
+
```
|
|
228
|
+
|
|
229
|
+
### Agent Evaluation
|
|
230
|
+
```python
|
|
231
|
+
from eval_lib import (
|
|
232
|
+
evaluate,
|
|
233
|
+
EvalTestCase,
|
|
234
|
+
ToolCorrectnessMetric,
|
|
235
|
+
TaskSuccessRateMetric
|
|
236
|
+
)
|
|
237
|
+
|
|
238
|
+
async def evaluate_agent():
|
|
239
|
+
test_cases = EvalTestCase(
|
|
240
|
+
input="Book a flight to New York for tomorrow",
|
|
241
|
+
actual_output="I've found available flights and booked your trip to New York for tomorrow.",
|
|
242
|
+
tools_called=["search_flights", "book_flight"],
|
|
243
|
+
expected_tools=["search_flights", "book_flight"]
|
|
244
|
+
)
|
|
245
|
+
|
|
246
|
+
metrics = [
|
|
247
|
+
ToolCorrectnessMetric(model="gpt-4o-mini", threshold=0.8),
|
|
248
|
+
TaskSuccessRateMetric(
|
|
249
|
+
model="gpt-4o-mini",
|
|
250
|
+
threshold=0.7,
|
|
251
|
+
temperature=1.0
|
|
252
|
+
)
|
|
253
|
+
]
|
|
254
|
+
|
|
255
|
+
results = await evaluate(
|
|
256
|
+
test_cases=[test_cases],
|
|
257
|
+
metrics=metrics,
|
|
258
|
+
verbose=True
|
|
259
|
+
)
|
|
260
|
+
return results
|
|
261
|
+
|
|
262
|
+
asyncio.run(evaluate_agent())
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
### Conversational Evaluation
|
|
266
|
+
```python
|
|
267
|
+
from eval_lib import (
|
|
268
|
+
evaluate_conversations,
|
|
269
|
+
ConversationalEvalTestCase,
|
|
270
|
+
EvalTestCase,
|
|
271
|
+
RoleAdherenceMetric,
|
|
272
|
+
KnowledgeRetentionMetric
|
|
273
|
+
)
|
|
274
|
+
|
|
275
|
+
async def evaluate_conversation():
|
|
276
|
+
# Create conversations
|
|
277
|
+
conversations = [
|
|
278
|
+
ConversationalEvalTestCase(
|
|
279
|
+
chatbot_role="You are a professional customer support assistant.",
|
|
280
|
+
turns=[
|
|
281
|
+
EvalTestCase(
|
|
282
|
+
input="I need help with my order",
|
|
283
|
+
actual_output="I'd be happy to help. Could you provide your order number?"
|
|
284
|
+
),
|
|
285
|
+
EvalTestCase(
|
|
286
|
+
input="It's #12345",
|
|
287
|
+
actual_output="Thank you! Let me look up order #12345 for you."
|
|
288
|
+
),
|
|
289
|
+
EvalTestCase(
|
|
290
|
+
input="When will it arrive?",
|
|
291
|
+
actual_output="Your order will be delivered on October 27, 2025."
|
|
292
|
+
),
|
|
293
|
+
]
|
|
294
|
+
),
|
|
295
|
+
ConversationalEvalTestCase(
|
|
296
|
+
chatbot_role="You are a formal financial advisor.",
|
|
297
|
+
turns=[
|
|
298
|
+
EvalTestCase(
|
|
299
|
+
input="Should I invest in stocks?",
|
|
300
|
+
actual_output="Yo dude! Just YOLO into stocks!"
|
|
301
|
+
),
|
|
302
|
+
EvalTestCase(
|
|
303
|
+
input="What about bonds?",
|
|
304
|
+
actual_output="Bonds are boring, bro!"
|
|
305
|
+
),
|
|
306
|
+
]
|
|
307
|
+
),
|
|
308
|
+
ConversationalEvalTestCase(
|
|
309
|
+
chatbot_role="You are a helpful assistant.",
|
|
310
|
+
turns=[
|
|
311
|
+
EvalTestCase(
|
|
312
|
+
input="My name is John",
|
|
313
|
+
actual_output="Nice to meet you, John!"
|
|
314
|
+
),
|
|
315
|
+
EvalTestCase(
|
|
316
|
+
input="What's my name?",
|
|
317
|
+
actual_output="Your name is John."
|
|
318
|
+
),
|
|
319
|
+
EvalTestCase(
|
|
320
|
+
input="Where do I live?",
|
|
321
|
+
actual_output="I don't have that information."
|
|
322
|
+
),
|
|
323
|
+
]
|
|
324
|
+
),
|
|
325
|
+
]
|
|
326
|
+
|
|
327
|
+
# Define conversational metrics
|
|
328
|
+
metrics = [
|
|
329
|
+
TaskSuccessRateMetric(
|
|
330
|
+
model="gpt-4o-mini",
|
|
331
|
+
threshold=0.7,
|
|
332
|
+
temperature=0.9,
|
|
333
|
+
),
|
|
334
|
+
RoleAdherenceMetric(
|
|
335
|
+
model="gpt-4o-mini",
|
|
336
|
+
threshold=0.8,
|
|
337
|
+
temperature=0.5,
|
|
338
|
+
),
|
|
339
|
+
KnowledgeRetentionMetric(
|
|
340
|
+
model="gpt-4o-mini",
|
|
341
|
+
threshold=0.7,
|
|
342
|
+
temperature=0.5,
|
|
343
|
+
),
|
|
344
|
+
]
|
|
345
|
+
|
|
346
|
+
# Run batch evaluation
|
|
347
|
+
results = await evaluate_conversations(
|
|
348
|
+
conv_cases=conversations,
|
|
349
|
+
metrics=metrics,
|
|
350
|
+
verbose=True
|
|
351
|
+
)
|
|
352
|
+
|
|
353
|
+
return results
|
|
354
|
+
|
|
355
|
+
asyncio.run(evaluate_conversation())
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
## Available Metrics
|
|
359
|
+
|
|
360
|
+
### RAG Metrics
|
|
361
|
+
|
|
362
|
+
#### AnswerRelevancyMetric
|
|
363
|
+
Measures how relevant the answer is to the question using multi-step evaluation:
|
|
364
|
+
1. Infers user intent
|
|
365
|
+
2. Extracts atomic statements from answer
|
|
366
|
+
3. Generates verdicts (fully/mostly/partial/minor/none) for each statement
|
|
367
|
+
4. Aggregates using softmax
|
|
368
|
+
```python
|
|
369
|
+
metric = AnswerRelevancyMetric(
|
|
370
|
+
model="gpt-4o-mini",
|
|
371
|
+
threshold=0.7,
|
|
372
|
+
temperature=0.5 # Controls aggregation strictness
|
|
373
|
+
)
|
|
374
|
+
```
|
|
375
|
+
|
|
376
|
+
#### FaithfulnessMetric
|
|
377
|
+
Checks if the answer is faithful to the provided context:
|
|
378
|
+
1. Extracts factual claims from answer
|
|
379
|
+
2. Verifies each claim against context (fully/mostly/partial/minor/none)
|
|
380
|
+
3. Aggregates faithfulness score
|
|
381
|
+
```python
|
|
382
|
+
metric = FaithfulnessMetric(
|
|
383
|
+
model="gpt-4o-mini",
|
|
384
|
+
threshold=0.8,
|
|
385
|
+
temperature=0.5
|
|
386
|
+
)
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
#### ContextualRelevancyMetric
|
|
390
|
+
Evaluates relevance of retrieved context to the question.
|
|
391
|
+
```python
|
|
392
|
+
metric = ContextualRelevancyMetric(
|
|
393
|
+
model="gpt-4o-mini",
|
|
394
|
+
threshold=0.7,
|
|
395
|
+
temperature=0.5
|
|
396
|
+
)
|
|
397
|
+
```
|
|
398
|
+
|
|
399
|
+
#### ContextualPrecisionMetric
|
|
400
|
+
Measures precision of context retrieval - are the retrieved chunks relevant?
|
|
401
|
+
```python
|
|
402
|
+
metric = ContextualPrecisionMetric(
|
|
403
|
+
model="gpt-4o-mini",
|
|
404
|
+
threshold=0.7
|
|
405
|
+
)
|
|
406
|
+
```
|
|
407
|
+
|
|
408
|
+
#### ContextualRecallMetric
|
|
409
|
+
Measures recall of relevant context - was all relevant information retrieved?
|
|
410
|
+
```python
|
|
411
|
+
metric = ContextualRecallMetric(
|
|
412
|
+
model="gpt-4o-mini",
|
|
413
|
+
threshold=0.7
|
|
414
|
+
)
|
|
415
|
+
```
|
|
416
|
+
|
|
417
|
+
#### BiasMetric
|
|
418
|
+
Detects bias and prejudice in AI-generated output. Score range: 0 (strong bias) and 100 (no bias).
|
|
419
|
+
```python
|
|
420
|
+
metric = BiasMetric(
|
|
421
|
+
model="gpt-4o-mini",
|
|
422
|
+
threshold=1.0 # Score range: 0 or 100
|
|
423
|
+
)
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
#### ToxicityMetric
|
|
427
|
+
Identifies toxic content in responses. Score range: 0 (highly toxic) and 100 (no toxicity).
|
|
428
|
+
```python
|
|
429
|
+
metric = ToxicityMetric(
|
|
430
|
+
model="gpt-4o-mini",
|
|
431
|
+
threshold=1.0 # Score range: 0 or 100
|
|
432
|
+
)
|
|
433
|
+
```
|
|
434
|
+
|
|
435
|
+
#### RestrictedRefusalMetric
|
|
436
|
+
Checks if the AI appropriately refuses harmful or out-of-scope requests.
|
|
437
|
+
```python
|
|
438
|
+
metric = RestrictedRefusalMetric(
|
|
439
|
+
model="gpt-4o-mini",
|
|
440
|
+
threshold=0.7
|
|
441
|
+
)
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
### Agent Metrics
|
|
445
|
+
|
|
446
|
+
#### ToolCorrectnessMetric
|
|
447
|
+
Validates that the agent calls the correct tools in the right sequence.
|
|
448
|
+
```python
|
|
449
|
+
metric = ToolCorrectnessMetric(
|
|
450
|
+
model="gpt-4o-mini",
|
|
451
|
+
threshold=0.8
|
|
452
|
+
)
|
|
453
|
+
```
|
|
454
|
+
|
|
455
|
+
#### TaskSuccessRateMetric
|
|
456
|
+
````
|
|
457
|
+
**Note:** The metric automatically detects if the conversation contains links/URLs and adds "The user got the link to the requested resource" as an evaluation criterion only when links are present in the dialogue.
|
|
458
|
+
````
|
|
459
|
+
Measures task completion success across conversation:
|
|
460
|
+
1. Infers user's goal
|
|
461
|
+
2. Generates success criteria
|
|
462
|
+
3. Evaluates each criterion (fully/mostly/partial/minor/none)
|
|
463
|
+
4. Aggregates into final score
|
|
464
|
+
```python
|
|
465
|
+
metric = TaskSuccessRateMetric(
|
|
466
|
+
model="gpt-4o-mini",
|
|
467
|
+
threshold=0.7,
|
|
468
|
+
temperature=1.0 # Higher = more lenient aggregation
|
|
469
|
+
)
|
|
470
|
+
```
|
|
471
|
+
|
|
472
|
+
#### RoleAdherenceMetric
|
|
473
|
+
Evaluates how well the agent maintains its assigned role:
|
|
474
|
+
1. Compares each response against role description
|
|
475
|
+
2. Generates adherence verdicts (fully/mostly/partial/minor/none)
|
|
476
|
+
3. Aggregates across all turns
|
|
477
|
+
```python
|
|
478
|
+
metric = RoleAdherenceMetric(
|
|
479
|
+
model="gpt-4o-mini",
|
|
480
|
+
threshold=0.8,
|
|
481
|
+
temperature=0.5,
|
|
482
|
+
chatbot_role="You are helpfull assistant" # Set role here directly
|
|
483
|
+
)
|
|
484
|
+
|
|
485
|
+
```
|
|
486
|
+
|
|
487
|
+
#### KnowledgeRetentionMetric
|
|
488
|
+
Checks if the agent remembers and recalls information from earlier in the conversation:
|
|
489
|
+
1. Analyzes conversation for retention quality
|
|
490
|
+
2. Generates retention verdicts (fully/mostly/partial/minor/none)
|
|
491
|
+
3. Aggregates into retention score
|
|
492
|
+
```python
|
|
493
|
+
metric = KnowledgeRetentionMetric(
|
|
494
|
+
model="gpt-4o-mini",
|
|
495
|
+
threshold=0.7,
|
|
496
|
+
temperature=0.5
|
|
497
|
+
)
|
|
498
|
+
```
|
|
499
|
+
|
|
500
|
+
### Custom & Advanced Metrics
|
|
501
|
+
|
|
502
|
+
#### GEval
|
|
503
|
+
State-of-the-art evaluation using probability-weighted scoring from the [G-Eval paper](https://arxiv.org/abs/2303.16634):
|
|
504
|
+
- **Auto Chain-of-Thought**: Automatically generates evaluation steps from criteria
|
|
505
|
+
- **Probability-Weighted Scoring**: score = Σ p(si) × si using 20 samples
|
|
506
|
+
- **Fine-Grained Scores**: Continuous scores (e.g., 73.45) instead of integers
|
|
507
|
+
```python
|
|
508
|
+
metric = GEval(
|
|
509
|
+
model="gpt-4o", # Best with GPT-4 for probability estimation
|
|
510
|
+
threshold=0.7,
|
|
511
|
+
name="Coherence",
|
|
512
|
+
criteria="Evaluate logical flow and structure of the response",
|
|
513
|
+
evaluation_steps=None, # Auto-generated if not provided
|
|
514
|
+
n_samples=20, # Number of samples for probability estimation
|
|
515
|
+
sampling_temperature=2.0 # High temperature for diverse sampling
|
|
516
|
+
)
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
#### CustomEvalMetric
|
|
520
|
+
Verdict-based custom evaluation with automatic criteria generation.
|
|
521
|
+
Automatically:
|
|
522
|
+
- Generates 3-5 specific sub-criteria from main criteria (1 LLM call)
|
|
523
|
+
- Evaluates each criterion with verdicts (fully/mostly/partial/minor/none)
|
|
524
|
+
- Aggregates using softmax (temperature-controlled)
|
|
525
|
+
Total: 1-2 LLM calls
|
|
526
|
+
|
|
527
|
+
Usage:
|
|
528
|
+
```python
|
|
529
|
+
metric = CustomEvalMetric(
|
|
530
|
+
model="gpt-4o-mini",
|
|
531
|
+
threshold=0.8,
|
|
532
|
+
name="Code Quality",
|
|
533
|
+
criteria="Evaluate code readability, efficiency, and best practices",
|
|
534
|
+
evaluation_steps=None, # Auto-generated if not provided
|
|
535
|
+
temperature=0.8, # Controls verdict aggregation (0.1=strict, 1.0=lenient)
|
|
536
|
+
verbose=True
|
|
537
|
+
)
|
|
538
|
+
|
|
539
|
+
```
|
|
540
|
+
|
|
541
|
+
**Example with manual criteria:**
|
|
542
|
+
```python
|
|
543
|
+
metric = CustomEvalMetric(
|
|
544
|
+
model="gpt-4o-mini",
|
|
545
|
+
threshold=0.8,
|
|
546
|
+
name="Child-Friendly Explanation",
|
|
547
|
+
criteria="Evaluate if explanation is appropriate for a 10-year-old",
|
|
548
|
+
evaluation_steps=[ # Manual criteria for precise control
|
|
549
|
+
"Uses simple vocabulary appropriate for 10-year-olds",
|
|
550
|
+
"Includes relatable analogies or comparisons",
|
|
551
|
+
"Avoids complex technical jargon",
|
|
552
|
+
"Explanation is engaging and interesting",
|
|
553
|
+
"Concept is broken down into understandable parts"
|
|
554
|
+
],
|
|
555
|
+
temperature=0.8,
|
|
556
|
+
verbose=True
|
|
557
|
+
)
|
|
558
|
+
|
|
559
|
+
result = await metric.evaluate(test_case)
|
|
560
|
+
```
|
|
561
|
+
|
|
562
|
+
## Understanding Evaluation Results
|
|
563
|
+
|
|
564
|
+
### Score Ranges
|
|
565
|
+
|
|
566
|
+
All metrics use a normalized score range of **0.0 to 1.0**:
|
|
567
|
+
- **0.0**: Complete failure / Does not meet criteria
|
|
568
|
+
- **0.5**: Partial satisfaction / Mixed results
|
|
569
|
+
- **1.0**: Perfect / Fully meets criteria
|
|
570
|
+
|
|
571
|
+
**Score Interpretation:**
|
|
572
|
+
- **0.8 - 1.0**: Excellent
|
|
573
|
+
- **0.7 - 0.8**: Good (typical threshold)
|
|
574
|
+
- **0.5 - 0.7**: Acceptable with issues
|
|
575
|
+
- **0.0 - 0.5**: Poor / Needs improvement
|
|
576
|
+
|
|
577
|
+
## Verbose Mode
|
|
578
|
+
|
|
579
|
+
All metrics support a `verbose` parameter that controls output formatting:
|
|
580
|
+
|
|
581
|
+
### verbose=False (Default) - JSON Output
|
|
582
|
+
Returns simple dictionary with results:
|
|
583
|
+
```python
|
|
584
|
+
metric = AnswerRelevancyMetric(
|
|
585
|
+
model="gpt-4o-mini",
|
|
586
|
+
threshold=0.7,
|
|
587
|
+
verbose=False # Default
|
|
588
|
+
)
|
|
589
|
+
|
|
590
|
+
result = await metric.evaluate(test_case)
|
|
591
|
+
print(result)
|
|
592
|
+
# Output: Simple dictionary
|
|
593
|
+
# {
|
|
594
|
+
# 'name': 'answerRelevancyMetric',
|
|
595
|
+
# 'score': 0.85,
|
|
596
|
+
# 'success': True,
|
|
597
|
+
# 'reason': 'Answer is highly relevant...',
|
|
598
|
+
# 'evaluation_cost': 0.000234,
|
|
599
|
+
# 'evaluation_log': {...}
|
|
600
|
+
# }
|
|
601
|
+
```
|
|
602
|
+
|
|
603
|
+
### verbose=True - Beautiful Console Output
|
|
604
|
+
Displays formatted results with colors, progress bars, and detailed logs:
|
|
605
|
+
```python
|
|
606
|
+
metric = CustomEvalMetric(
|
|
607
|
+
model="gpt-4o-mini",
|
|
608
|
+
threshold=0.9,
|
|
609
|
+
name="Factual Accuracy",
|
|
610
|
+
criteria="Evaluate the factual accuracy of the response",
|
|
611
|
+
verbose=True # Enable beautiful output
|
|
612
|
+
)
|
|
613
|
+
|
|
614
|
+
result = await metric.evaluate(test_case)
|
|
615
|
+
# Output: Beautiful formatted display (see image below)
|
|
616
|
+
```
|
|
617
|
+
|
|
618
|
+
**Console Output with verbose=True:**
|
|
619
|
+
|
|
620
|
+
**Console Output with verbose=True:**
|
|
621
|
+
```
|
|
622
|
+
╔════════════════════════════════════════════════════════════════╗
|
|
623
|
+
║ 📊answerRelevancyMetric ║
|
|
624
|
+
╚════════════════════════════════════════════════════════════════╝
|
|
625
|
+
|
|
626
|
+
Status: ✅ PASSED
|
|
627
|
+
Score: 0.91 [███████████████████████████░░░] 91%
|
|
628
|
+
Cost: 💰 $0.000178
|
|
629
|
+
Reason:
|
|
630
|
+
The answer correctly identifies Paris as the capital of France, demonstrating a clear understanding of the
|
|
631
|
+
user's request. However, it fails to provide a direct and explicit response, which diminishes its overall
|
|
632
|
+
effectiveness.
|
|
633
|
+
|
|
634
|
+
Evaluation Log:
|
|
635
|
+
╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
|
|
636
|
+
│ { │
|
|
637
|
+
│ "input_question": "What is the capital of France?", │
|
|
638
|
+
│ "answer": "The capital of France is Paris and it is a beautiful city and known for its art and culture.", │
|
|
639
|
+
│ "user_intent": "The user is seeking information about the capital city of France.", │
|
|
640
|
+
│ "comment_user_intent": "Inferred goal of the question.", │
|
|
641
|
+
│ "statements": [ │
|
|
642
|
+
│ "The capital of France is Paris.", │
|
|
643
|
+
│ "Paris is a beautiful city.", │
|
|
644
|
+
│ "Paris is known for its art and culture." │
|
|
645
|
+
│ ], │
|
|
646
|
+
│ "comment_statements": "Atomic facts extracted from the answer.", │
|
|
647
|
+
│ "verdicts": [ │
|
|
648
|
+
│ { │
|
|
649
|
+
│ "verdict": "fully", │
|
|
650
|
+
│ "reason": "The statement explicitly answers the user's question about the capital of France." │
|
|
651
|
+
│ }, │
|
|
652
|
+
│ { │
|
|
653
|
+
│ "verdict": "minor", │
|
|
654
|
+
│ "reason": "While it mentions Paris, it does not directly answer the user's question." │
|
|
655
|
+
│ }, │
|
|
656
|
+
│ { │
|
|
657
|
+
│ "verdict": "minor", │
|
|
658
|
+
│ "reason": "This statement is related to Paris but does not address the user's question about the │
|
|
659
|
+
│ capital." │
|
|
660
|
+
│ } │
|
|
661
|
+
│ ], │
|
|
662
|
+
│ "comment_verdicts": "Each verdict explains whether a statement is relevant to the question.", │
|
|
663
|
+
│ "verdict_score": 0.9142, │
|
|
664
|
+
│ "comment_verdict_score": "Proportion of relevant statements in the answer.", │
|
|
665
|
+
│ "final_score": 0.9142, │
|
|
666
|
+
│ "comment_final_score": "Score based on the proportion of relevant statements.", │
|
|
667
|
+
│ "threshold": 0.7, │
|
|
668
|
+
│ "success": true, │
|
|
669
|
+
│ "comment_success": "Whether the score exceeds the pass threshold.", │
|
|
670
|
+
│ "final_reason": "The answer correctly identifies Paris as the capital of France, demonstrating a clear │
|
|
671
|
+
│ understanding of the user's request. However, it fails to provide a direct and explicit response, which │
|
|
672
|
+
│ diminishes its overall effectiveness.", │
|
|
673
|
+
│ "comment_reasoning": "Compressed explanation of the key verdict rationales." │
|
|
674
|
+
│ } │
|
|
675
|
+
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
|
|
676
|
+
```
|
|
677
|
+
|
|
678
|
+
Features:
|
|
679
|
+
- ✅ Color-coded status (✅ PASSED / ❌ FAILED)
|
|
680
|
+
- 📊 Visual progress bar for scores
|
|
681
|
+
- 💰 Cost tracking display
|
|
682
|
+
- 📝 Formatted reason with word wrapping
|
|
683
|
+
- 📋 Pretty-printed evaluation log in bordered box
|
|
684
|
+
|
|
685
|
+
**When to use verbose=True:**
|
|
686
|
+
- Interactive development and testing
|
|
687
|
+
- Debugging evaluation issues
|
|
688
|
+
- Presentations and demonstrations
|
|
689
|
+
- Manual review of results
|
|
690
|
+
|
|
691
|
+
**When to use verbose=False:**
|
|
692
|
+
- Production environments
|
|
693
|
+
- Batch processing
|
|
694
|
+
- Automated testing
|
|
695
|
+
- When storing results in databases
|
|
696
|
+
|
|
697
|
+
---
|
|
698
|
+
|
|
699
|
+
## Working with Results
|
|
700
|
+
|
|
701
|
+
Results are returned as simple dictionaries. Access fields directly:
|
|
702
|
+
```python
|
|
703
|
+
# Run evaluation
|
|
704
|
+
result = await metric.evaluate(test_case)
|
|
705
|
+
|
|
706
|
+
# Access result fields
|
|
707
|
+
score = result['score'] # 0.0-1.0
|
|
708
|
+
success = result['success'] # True/False
|
|
709
|
+
reason = result['reason'] # String explanation
|
|
710
|
+
cost = result['evaluation_cost'] # USD amount
|
|
711
|
+
log = result['evaluation_log'] # Detailed breakdown
|
|
712
|
+
|
|
713
|
+
# Example: Check success and print score
|
|
714
|
+
if result['success']:
|
|
715
|
+
print(f"✅ Passed with score: {result['score']:.2f}")
|
|
716
|
+
else:
|
|
717
|
+
print(f"❌ Failed: {result['reason']}")
|
|
718
|
+
|
|
719
|
+
# Access detailed verdicts (for verdict-based metrics)
|
|
720
|
+
if 'verdicts' in result['evaluation_log']:
|
|
721
|
+
for verdict in result['evaluation_log']['verdicts']:
|
|
722
|
+
print(f"- {verdict['verdict']}: {verdict['reason']}")
|
|
723
|
+
```
|
|
724
|
+
|
|
725
|
+
## Temperature Parameter
|
|
726
|
+
|
|
727
|
+
Many metrics use a **temperature** parameter for score aggregation (via temperature-weighted scoring):
|
|
728
|
+
|
|
729
|
+
- **Lower (0.1-0.3)**: **STRICT** - All scores matter equally, low scores heavily penalize the final result. Best for critical applications where even one bad verdict should fail the metric.
|
|
730
|
+
- **Medium (0.4-0.6)**: **BALANCED** - Moderate weighting between high and low scores. Default behavior for most use cases (default: 0.5).
|
|
731
|
+
- **Higher (0.7-1.0)**: **LENIENT** - High scores (fully/mostly) dominate, effectively ignoring partial/minor/none verdicts. Best for exploratory evaluation or when you want to focus on positive signals.
|
|
732
|
+
|
|
733
|
+
**How it works:** Temperature controls exponential weighting of scores. Higher temperature exponentially boosts high scores (1.0, 0.9), making low scores (0.7, 0.3, 0.0) matter less. Lower temperature treats all scores more equally.
|
|
734
|
+
|
|
735
|
+
**Example:**
|
|
736
|
+
```python
|
|
737
|
+
# Verdicts: [fully, mostly, partial, minor, none] = [1.0, 0.9, 0.7, 0.3, 0.0]
|
|
738
|
+
|
|
739
|
+
# STRICT: All verdicts count
|
|
740
|
+
metric = FaithfulnessMetric(temperature=0.1)
|
|
741
|
+
# Result: ~0.52 (heavily penalized by "minor" and "none")
|
|
742
|
+
|
|
743
|
+
# BALANCED: Moderate weighting
|
|
744
|
+
metric = AnswerRelevancyMetric(temperature=0.5)
|
|
745
|
+
# Result: ~0.73 (balanced consideration)
|
|
746
|
+
|
|
747
|
+
# LENIENT: Only "fully" and "mostly" matter
|
|
748
|
+
metric = TaskSuccessRateMetric(temperature=1.0)
|
|
749
|
+
# Result: ~0.95 (ignores "partial", "minor", "none")
|
|
750
|
+
```
|
|
751
|
+
|
|
752
|
+
## LLM Provider Configuration
|
|
753
|
+
|
|
754
|
+
### OpenAI
|
|
755
|
+
```python
|
|
756
|
+
import os
|
|
757
|
+
os.environ["OPENAI_API_KEY"] = "your-api-key"
|
|
758
|
+
|
|
759
|
+
from eval_lib import chat_complete
|
|
760
|
+
|
|
761
|
+
response, cost = await chat_complete(
|
|
762
|
+
"gpt-4o-mini", # or "openai:gpt-4o-mini"
|
|
763
|
+
messages=[{"role": "user", "content": "Hello!"}]
|
|
764
|
+
)
|
|
765
|
+
```
|
|
766
|
+
|
|
767
|
+
### Azure OpenAI
|
|
768
|
+
```python
|
|
769
|
+
os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
|
|
770
|
+
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
|
|
771
|
+
os.environ["AZURE_OPENAI_DEPLOYMENT"] = "your-deployment-name"
|
|
772
|
+
|
|
773
|
+
response, cost = await chat_complete(
|
|
774
|
+
"azure:gpt-4o",
|
|
775
|
+
messages=[{"role": "user", "content": "Hello!"}]
|
|
776
|
+
)
|
|
777
|
+
```
|
|
778
|
+
|
|
779
|
+
### Google Gemini
|
|
780
|
+
```python
|
|
781
|
+
os.environ["GOOGLE_API_KEY"] = "your-api-key"
|
|
782
|
+
|
|
783
|
+
response, cost = await chat_complete(
|
|
784
|
+
"google:gemini-2.0-flash",
|
|
785
|
+
messages=[{"role": "user", "content": "Hello!"}]
|
|
786
|
+
)
|
|
787
|
+
```
|
|
788
|
+
|
|
789
|
+
### Anthropic Claude
|
|
790
|
+
```python
|
|
791
|
+
os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
|
|
792
|
+
|
|
793
|
+
response, cost = await chat_complete(
|
|
794
|
+
"anthropic:claude-sonnet-4-0",
|
|
795
|
+
messages=[{"role": "user", "content": "Hello!"}]
|
|
796
|
+
)
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
### Ollama (Local)
|
|
800
|
+
```python
|
|
801
|
+
os.environ["OLLAMA_API_KEY"] = "ollama" # Can be any value
|
|
802
|
+
os.environ["OLLAMA_API_BASE_URL"] = "http://localhost:11434/v1"
|
|
803
|
+
|
|
804
|
+
response, cost = await chat_complete(
|
|
805
|
+
"ollama:llama2",
|
|
806
|
+
messages=[{"role": "user", "content": "Hello!"}]
|
|
807
|
+
)
|
|
808
|
+
```
|
|
809
|
+
|
|
810
|
+
## Test Data Generation
|
|
811
|
+
|
|
812
|
+
The library includes a powerful test data generator that can create realistic test cases either from scratch or based on your documents.
|
|
813
|
+
|
|
814
|
+
### Supported Document Formats
|
|
815
|
+
|
|
816
|
+
- **Documents**: PDF, DOCX, DOC, TXT, RTF, ODT
|
|
817
|
+
- **Structured Data**: CSV, TSV, XLSX, JSON, YAML, XML
|
|
818
|
+
- **Web**: HTML, Markdown
|
|
819
|
+
- **Presentations**: PPTX
|
|
820
|
+
- **Images**: PNG, JPG, JPEG (with OCR support)
|
|
821
|
+
|
|
822
|
+
### Generate from Scratch
|
|
823
|
+
```python
|
|
824
|
+
from eval_lib.datagenerator.datagenerator import DatasetGenerator
|
|
825
|
+
|
|
826
|
+
generator = DatasetGenerator(
|
|
827
|
+
model="gpt-4o-mini",
|
|
828
|
+
agent_description="A customer support chatbot",
|
|
829
|
+
input_format="User question or request",
|
|
830
|
+
expected_output_format="Helpful response",
|
|
831
|
+
test_types=["functionality", "edge_cases"],
|
|
832
|
+
max_rows=20,
|
|
833
|
+
question_length="mixed", # "short", "long", or "mixed"
|
|
834
|
+
question_openness="mixed", # "open", "closed", or "mixed"
|
|
835
|
+
trap_density=0.1, # 10% trap questions
|
|
836
|
+
language="en",
|
|
837
|
+
verbose=True # Displays beautiful formatted progress, statistics and full dataset preview
|
|
838
|
+
)
|
|
839
|
+
|
|
840
|
+
dataset = await generator.generate_from_scratch()
|
|
841
|
+
```
|
|
842
|
+
|
|
843
|
+
### Generate from Documents
|
|
844
|
+
```python
|
|
845
|
+
generator = DatasetGenerator(
|
|
846
|
+
model="gpt-4o-mini",
|
|
847
|
+
agent_description="Technical support agent",
|
|
848
|
+
input_format="Technical question",
|
|
849
|
+
expected_output_format="Detailed answer with references",
|
|
850
|
+
test_types=["retrieval", "accuracy"],
|
|
851
|
+
max_rows=50,
|
|
852
|
+
chunk_size=1024,
|
|
853
|
+
chunk_overlap=100,
|
|
854
|
+
max_chunks=30,
|
|
855
|
+
verbose=True
|
|
856
|
+
)
|
|
857
|
+
|
|
858
|
+
file_paths = ["docs/user_guide.pdf", "docs/faq.md"]
|
|
859
|
+
dataset = await generator.generate_from_documents(file_paths)
|
|
860
|
+
|
|
861
|
+
# Convert to test cases
|
|
862
|
+
from eval_lib import EvalTestCase
|
|
863
|
+
test_cases = [
|
|
864
|
+
EvalTestCase(
|
|
865
|
+
input=item["input"],
|
|
866
|
+
expected_output=item["expected_output"],
|
|
867
|
+
retrieval_context=[item.get("context", "")]
|
|
868
|
+
)
|
|
869
|
+
for item in dataset
|
|
870
|
+
]
|
|
871
|
+
```
|
|
872
|
+
|
|
873
|
+
## Best Practices
|
|
874
|
+
|
|
875
|
+
### 1. Choose the Right Model
|
|
876
|
+
|
|
877
|
+
- **G-Eval**: Use GPT-4 for best results with probability-weighted scoring
|
|
878
|
+
- **Other Metrics**: GPT-4o-mini is cost-effective and sufficient
|
|
879
|
+
- **Custom Eval**: Use GPT-4 for complex criteria, GPT-4o-mini for simple ones
|
|
880
|
+
|
|
881
|
+
### 2. Set Appropriate Thresholds
|
|
882
|
+
```python
|
|
883
|
+
# Safety metrics - high bar
|
|
884
|
+
BiasMetric(threshold=0.8)
|
|
885
|
+
ToxicityMetric(threshold=0.85)
|
|
886
|
+
|
|
887
|
+
# Quality metrics - moderate bar
|
|
888
|
+
AnswerRelevancyMetric(threshold=0.7)
|
|
889
|
+
FaithfulnessMetric(threshold=0.75)
|
|
890
|
+
|
|
891
|
+
# Agent metrics - context-dependent
|
|
892
|
+
TaskSuccessRateMetric(threshold=0.7) # Most tasks
|
|
893
|
+
RoleAdherenceMetric(threshold=0.9) # Strict role requirements
|
|
894
|
+
```
|
|
895
|
+
|
|
896
|
+
### 3. Use Temperature Wisely
|
|
897
|
+
```python
|
|
898
|
+
# STRICT evaluation - critical applications where all verdicts matter
|
|
899
|
+
# Use when: You need high accuracy and can't tolerate bad verdicts
|
|
900
|
+
metric = FaithfulnessMetric(temperature=0.1)
|
|
901
|
+
|
|
902
|
+
# BALANCED - general use (default)
|
|
903
|
+
# Use when: Standard evaluation with moderate requirements
|
|
904
|
+
metric = AnswerRelevancyMetric(temperature=0.5)
|
|
905
|
+
|
|
906
|
+
# LENIENT - exploratory evaluation or focusing on positive signals
|
|
907
|
+
# Use when: You want to reward good answers and ignore occasional mistakes
|
|
908
|
+
metric = TaskSuccessRateMetric(temperature=1.0)
|
|
909
|
+
```
|
|
910
|
+
|
|
911
|
+
**Real-world examples:**
|
|
912
|
+
```python
|
|
913
|
+
# Production RAG system - must be accurate
|
|
914
|
+
faithfulness = FaithfulnessMetric(
|
|
915
|
+
model="gpt-4o-mini",
|
|
916
|
+
threshold=0.8,
|
|
917
|
+
temperature=0.2 # STRICT: verdicts "none", "minor", "partially" significantly impact score
|
|
918
|
+
)
|
|
919
|
+
|
|
920
|
+
# Customer support chatbot - moderate standards
|
|
921
|
+
role_adherence = RoleAdherenceMetric(
|
|
922
|
+
model="gpt-4o-mini",
|
|
923
|
+
threshold=0.7,
|
|
924
|
+
temperature=0.5 # BALANCED: Standard evaluation
|
|
925
|
+
)
|
|
926
|
+
|
|
927
|
+
# Experimental feature testing - focus on successes
|
|
928
|
+
task_success = TaskSuccessRateMetric(
|
|
929
|
+
model="gpt-4o-mini",
|
|
930
|
+
threshold=0.6,
|
|
931
|
+
temperature=1.0 # LENIENT: Focuses on "fully" and "mostly" completions
|
|
932
|
+
)
|
|
933
|
+
```
|
|
934
|
+
|
|
935
|
+
### 4. Leverage Evaluation Logs
|
|
936
|
+
```python
|
|
937
|
+
# Enable verbose mode for automatic detailed display
|
|
938
|
+
metric = AnswerRelevancyMetric(
|
|
939
|
+
model="gpt-4o-mini",
|
|
940
|
+
threshold=0.7,
|
|
941
|
+
verbose=True # Automatic formatted output with full logs
|
|
942
|
+
)
|
|
943
|
+
|
|
944
|
+
# Or access logs programmatically
|
|
945
|
+
result = await metric.evaluate(test_case)
|
|
946
|
+
log = result['evaluation_log']
|
|
947
|
+
|
|
948
|
+
# Debugging failures
|
|
949
|
+
if not result['success']:
|
|
950
|
+
# All details available in log
|
|
951
|
+
reason = result['reason']
|
|
952
|
+
verdicts = log.get('verdicts', [])
|
|
953
|
+
steps = log.get('evaluation_steps', [])
|
|
954
|
+
```
|
|
955
|
+
|
|
956
|
+
### 5. Batch Evaluation for Efficiency
|
|
957
|
+
```python
|
|
958
|
+
# Evaluate multiple test cases at once
|
|
959
|
+
results = await evaluate(
|
|
960
|
+
test_cases=[test_case1, test_case2, test_case3],
|
|
961
|
+
metrics=[metric1, metric2, metric3]
|
|
962
|
+
)
|
|
963
|
+
|
|
964
|
+
# Calculate aggregate statistics
|
|
965
|
+
total_cost = sum(
|
|
966
|
+
metric.evaluation_cost or 0
|
|
967
|
+
for _, test_results in results
|
|
968
|
+
for result in test_results
|
|
969
|
+
for metric in result.metrics_data
|
|
970
|
+
)
|
|
971
|
+
|
|
972
|
+
success_rate = sum(
|
|
973
|
+
1 for _, test_results in results
|
|
974
|
+
for result in test_results
|
|
975
|
+
if result.success
|
|
976
|
+
) / len(results)
|
|
977
|
+
|
|
978
|
+
print(f"Total cost: ${total_cost:.4f}")
|
|
979
|
+
print(f"Success rate: {success_rate:.2%}")
|
|
980
|
+
```
|
|
981
|
+
|
|
982
|
+
|
|
983
|
+
## Environment Variables
|
|
984
|
+
|
|
985
|
+
| Variable | Description | Required |
|
|
986
|
+
|----------|-------------|----------|
|
|
987
|
+
| `OPENAI_API_KEY` | OpenAI API key | For OpenAI |
|
|
988
|
+
| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key | For Azure |
|
|
989
|
+
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint URL | For Azure |
|
|
990
|
+
| `AZURE_OPENAI_DEPLOYMENT` | Azure deployment name | For Azure |
|
|
991
|
+
| `GOOGLE_API_KEY` | Google API key | For Google |
|
|
992
|
+
| `ANTHROPIC_API_KEY` | Anthropic API key | For Anthropic |
|
|
993
|
+
| `OLLAMA_API_KEY` | Ollama API key | For Ollama |
|
|
994
|
+
| `OLLAMA_API_BASE_URL` | Ollama base URL | For Ollama |
|
|
995
|
+
|
|
996
|
+
## Contributing
|
|
997
|
+
|
|
998
|
+
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
999
|
+
|
|
1000
|
+
1. Fork the repository
|
|
1001
|
+
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
|
|
1002
|
+
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
|
|
1003
|
+
4. Push to the branch (`git push origin feature/AmazingFeature`)
|
|
1004
|
+
5. Open a Pull Request
|
|
1005
|
+
|
|
1006
|
+
## License
|
|
1007
|
+
|
|
1008
|
+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
|
1009
|
+
|
|
1010
|
+
## Citation
|
|
1011
|
+
|
|
1012
|
+
If you use this library in your research, please cite:
|
|
1013
|
+
```bibtex
|
|
1014
|
+
@software{eval_ai_library,
|
|
1015
|
+
author = {Meshkov, Aleksandr},
|
|
1016
|
+
title = {Eval AI Library: Comprehensive AI Model Evaluation Framework},
|
|
1017
|
+
year = {2025},
|
|
1018
|
+
url = {https://github.com/meshkovQA/Eval-ai-library.git}
|
|
1019
|
+
}
|
|
1020
|
+
```
|
|
1021
|
+
|
|
1022
|
+
### References
|
|
1023
|
+
|
|
1024
|
+
This library implements techniques from:
|
|
1025
|
+
```bibtex
|
|
1026
|
+
@inproceedings{liu2023geval,
|
|
1027
|
+
title={G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment},
|
|
1028
|
+
author={Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang},
|
|
1029
|
+
booktitle={Proceedings of EMNLP},
|
|
1030
|
+
year={2023}
|
|
1031
|
+
}
|
|
1032
|
+
```
|
|
1033
|
+
|
|
1034
|
+
## Support
|
|
1035
|
+
|
|
1036
|
+
- 📧 Email: alekslynx90@gmail.com
|
|
1037
|
+
- 🐛 Issues: [GitHub Issues](https://github.com/meshkovQA/Eval-ai-library.git/issues)
|
|
1038
|
+
- 📖 Documentation: [Full Documentation](https://github.com/meshkovQA/Eval-ai-library.git#readme)
|
|
1039
|
+
|
|
1040
|
+
## Acknowledgments
|
|
1041
|
+
|
|
1042
|
+
This library was developed to provide a comprehensive solution for evaluating AI models across different use cases and providers, with state-of-the-art techniques including G-Eval's probability-weighted scoring and automatic chain-of-thought generation.
|