eval-ai-library 0.2.2__py3-none-any.whl → 0.3.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of eval-ai-library might be problematic. Click here for more details.

Files changed (29) hide show
  1. eval_ai_library-0.3.0.dist-info/METADATA +1042 -0
  2. eval_ai_library-0.3.0.dist-info/RECORD +34 -0
  3. eval_lib/__init__.py +19 -6
  4. eval_lib/agent_metrics/knowledge_retention_metric/knowledge_retention.py +8 -3
  5. eval_lib/agent_metrics/role_adherence_metric/role_adherence.py +12 -4
  6. eval_lib/agent_metrics/task_success_metric/task_success_rate.py +23 -23
  7. eval_lib/agent_metrics/tools_correctness_metric/tool_correctness.py +8 -2
  8. eval_lib/datagenerator/datagenerator.py +208 -12
  9. eval_lib/datagenerator/document_loader.py +29 -29
  10. eval_lib/evaluate.py +0 -22
  11. eval_lib/llm_client.py +223 -78
  12. eval_lib/metric_pattern.py +208 -152
  13. eval_lib/metrics/answer_precision_metric/answer_precision.py +8 -3
  14. eval_lib/metrics/answer_relevancy_metric/answer_relevancy.py +7 -2
  15. eval_lib/metrics/bias_metric/bias.py +12 -2
  16. eval_lib/metrics/contextual_precision_metric/contextual_precision.py +9 -4
  17. eval_lib/metrics/contextual_recall_metric/contextual_recall.py +7 -3
  18. eval_lib/metrics/contextual_relevancy_metric/contextual_relevancy.py +8 -2
  19. eval_lib/metrics/custom_metric/custom_eval.py +237 -204
  20. eval_lib/metrics/faithfulness_metric/faithfulness.py +7 -2
  21. eval_lib/metrics/geval/geval.py +8 -2
  22. eval_lib/metrics/restricted_refusal_metric/restricted_refusal.py +7 -3
  23. eval_lib/metrics/toxicity_metric/toxicity.py +8 -2
  24. eval_lib/utils.py +44 -29
  25. eval_ai_library-0.2.2.dist-info/METADATA +0 -779
  26. eval_ai_library-0.2.2.dist-info/RECORD +0 -34
  27. {eval_ai_library-0.2.2.dist-info → eval_ai_library-0.3.0.dist-info}/WHEEL +0 -0
  28. {eval_ai_library-0.2.2.dist-info → eval_ai_library-0.3.0.dist-info}/licenses/LICENSE +0 -0
  29. {eval_ai_library-0.2.2.dist-info → eval_ai_library-0.3.0.dist-info}/top_level.txt +0 -0
@@ -0,0 +1,1042 @@
1
+ Metadata-Version: 2.4
2
+ Name: eval-ai-library
3
+ Version: 0.3.0
4
+ Summary: Comprehensive AI Model Evaluation Framework with support for multiple LLM providers
5
+ Author-email: Aleksandr Meshkov <alekslynx90@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/meshkovQA/Eval-ai-library
8
+ Project-URL: Documentation, https://github.com/meshkovQA/Eval-ai-library#readme
9
+ Project-URL: Repository, https://github.com/meshkovQA/Eval-ai-library
10
+ Project-URL: Bug Tracker, https://github.com/meshkovQA/Eval-ai-library/issues
11
+ Keywords: ai,evaluation,llm,rag,metrics,testing,quality-assurance
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
23
+ Requires-Python: >=3.9
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ Requires-Dist: openai>=1.0.0
27
+ Requires-Dist: anthropic>=0.18.0
28
+ Requires-Dist: google-genai>=0.2.0
29
+ Requires-Dist: pydantic>=2.0.0
30
+ Requires-Dist: numpy>=1.24.0
31
+ Requires-Dist: langchain>=0.1.0
32
+ Requires-Dist: langchain-community>=0.0.10
33
+ Requires-Dist: langchain-core>=0.1.0
34
+ Requires-Dist: langchain-text-splitters>=0.2.0
35
+ Requires-Dist: pypdf>=3.0.0
36
+ Requires-Dist: python-docx>=0.8.11
37
+ Requires-Dist: openpyxl>=3.1.0
38
+ Requires-Dist: pillow>=10.0.0
39
+ Requires-Dist: pytesseract>=0.3.10
40
+ Requires-Dist: python-pptx>=0.6.21
41
+ Requires-Dist: PyMuPDF>=1.23.0
42
+ Requires-Dist: mammoth>=1.6.0
43
+ Requires-Dist: PyYAML>=6.0.0
44
+ Requires-Dist: html2text>=2020.1.16
45
+ Requires-Dist: markdown>=3.4.0
46
+ Requires-Dist: pandas>=2.0.0
47
+ Requires-Dist: striprtf>=0.0.26
48
+ Provides-Extra: dev
49
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
50
+ Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
51
+ Requires-Dist: black>=23.0.0; extra == "dev"
52
+ Requires-Dist: flake8>=6.0.0; extra == "dev"
53
+ Requires-Dist: mypy>=1.0.0; extra == "dev"
54
+ Requires-Dist: isort>=5.12.0; extra == "dev"
55
+ Provides-Extra: docs
56
+ Requires-Dist: sphinx>=6.0.0; extra == "docs"
57
+ Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
58
+ Dynamic: license-file
59
+
60
+ # Eval AI Library
61
+
62
+ [![Python Version](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/downloads/)
63
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
64
+
65
+ Comprehensive AI Model Evaluation Framework with advanced techniques including **Probability-Weighted Scoring** and **Auto Chain-of-Thought**. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.
66
+
67
+ ## Features
68
+
69
+ - 🎯 **15+ Evaluation Metrics**: RAG metrics and agent-specific evaluations
70
+ - 🧠 **G-Eval Implementation**: State-of-the-art evaluation with probability-weighted scoring
71
+ - 🔗 **Chain-of-Thought**: Automatic generation of evaluation steps from criteria
72
+ - 🤖 **Multi-Provider Support**: OpenAI, Azure OpenAI, Google Gemini, Anthropic Claude, Ollama
73
+ - 📊 **RAG Metrics**: Answer relevancy, faithfulness, contextual precision/recall, and more
74
+ - 🔧 **Agent Metrics**: Tool correctness, task success rate, role adherence, knowledge retention
75
+ - 🎨 **Custom Metrics**: Advanced custom evaluation with CoT and probability weighting
76
+ - 📦 **Data Generation**: Built-in test case generator from documents
77
+ - ⚡ **Async Support**: Full async/await support for efficient evaluation
78
+ - 💰 **Cost Tracking**: Automatic cost calculation for LLM API calls
79
+ - 📝 **Detailed Logging**: Comprehensive evaluation logs for transparency
80
+
81
+ ## Installation
82
+ ```bash
83
+ pip install eval-ai-library
84
+ ```
85
+
86
+ ### Development Installation
87
+ ```bash
88
+ git clone https://github.com/yourusername/eval-ai-library.git
89
+ cd eval-ai-library
90
+ pip install -e ".[dev]"
91
+ ```
92
+
93
+ ## Quick Start
94
+
95
+ ### Basic Batch Evaluation
96
+ ```python
97
+ import asyncio
98
+ from eval_lib import (
99
+ evaluate,
100
+ EvalTestCase,
101
+ AnswerRelevancyMetric,
102
+ FaithfulnessMetric,
103
+ BiasMetric
104
+ )
105
+
106
+ async def test_batch_standard_metrics():
107
+ """Test batch evaluation with multiple test cases and standard metrics"""
108
+
109
+ # Create test cases
110
+ test_cases = [
111
+ EvalTestCase(
112
+ input="What is the capital of France?",
113
+ actual_output="The capital of France is Paris.",
114
+ expected_output="Paris",
115
+ retrieval_context=["Paris is the capital of France."]
116
+ ),
117
+ EvalTestCase(
118
+ input="What is photosynthesis?",
119
+ actual_output="The weather today is sunny.",
120
+ expected_output="Process by which plants convert light into energy",
121
+ retrieval_context=[
122
+ "Photosynthesis is the process by which plants use sunlight."]
123
+ )
124
+ ]
125
+
126
+ # Define metrics
127
+ metrics = [
128
+ AnswerRelevancyMetric(
129
+ model="gpt-4o-mini",
130
+ threshold=0.7,
131
+ temperature=0.5,
132
+ ),
133
+ FaithfulnessMetric(
134
+ model="gpt-4o-mini",
135
+ threshold=0.8,
136
+ temperature=0.5,
137
+ ),
138
+ BiasMetric(
139
+ model="gpt-4o-mini",
140
+ threshold=0.8,
141
+ ),
142
+ ]
143
+
144
+ # Run batch evaluation
145
+ results = await evaluate(
146
+ test_cases=test_cases,
147
+ metrics=metrics,
148
+ verbose=True
149
+ )
150
+
151
+ return results
152
+
153
+
154
+ if __name__ == "__main__":
155
+ asyncio.run(test_batch_standard_metrics())
156
+ ```
157
+
158
+ ### G-Eval with Probability-Weighted Scoring (single evaluation)
159
+
160
+ G-Eval implements the state-of-the-art evaluation method from the paper ["G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"](https://arxiv.org/abs/2303.16634). It uses **probability-weighted scoring** (score = Σ p(si) × si) for fine-grained, continuous evaluation scores.
161
+ ```python
162
+ from eval_lib import GEval, EvalTestCase
163
+
164
+ async def evaluate_with_geval():
165
+ test_case = EvalTestCase(
166
+ input="Explain quantum computing to a 10-year-old",
167
+ actual_output="Quantum computers are like super-powerful regular computers that use special tiny particles to solve really hard problems much faster.",
168
+ )
169
+
170
+ # G-Eval with auto chain-of-thought
171
+ metric = GEval(
172
+ model="gpt-4o", # Works best with GPT-4
173
+ threshold=0.7, # Score range: 0.0-1.0
174
+ name="Clarity & Simplicity",
175
+ criteria="Evaluate how clear and age-appropriate the explanation is for a 10-year-old child",
176
+
177
+ # Evaluation_steps is auto-generated from criteria if not provided
178
+ evaluation_steps=[
179
+ "Step 1: Check if the language is appropriate for a 10-year-old. Avoid complex technical terms, jargon, or abstract concepts that children cannot relate to. The vocabulary should be simple and conversational.",
180
+
181
+ "Step 2: Evaluate the use of analogies and examples. Look for comparisons to everyday objects, activities, or experiences familiar to children (toys, games, school, animals, family activities). Good analogies make abstract concepts concrete.",
182
+
183
+ "Step 3: Assess the structure and flow. The explanation should have a clear beginning, middle, and end. Ideas should build logically, starting with familiar concepts before introducing new ones. Sentences should be short and easy to follow.",
184
+
185
+ "Step 4: Check for engagement elements. Look for questions, storytelling, humor, or interactive elements that capture a child's attention. The tone should be friendly and encouraging, not boring or too formal.",
186
+
187
+ "Step 5: Verify completeness without overwhelming. The explanation should cover the main idea adequately but not overload with too many details. It should answer the question without confusing the child with unnecessary complexity.",
188
+
189
+ "Step 6: Assign a score from 0.0 to 1.0, where 0.0 means completely inappropriate or unclear for a child, and 1.0 means perfectly clear, engaging, and age-appropriate."
190
+ ],
191
+ n_samples=20, # Number of samples for probability estimation (default: 20)
192
+ sampling_temperature=2.0 # High temperature for diverse sampling (default: 2.0)
193
+ )
194
+
195
+ result = await metric.evaluate(test_case)
196
+
197
+
198
+ asyncio.run(evaluate_with_geval())
199
+ ```
200
+
201
+ ### Custom Evaluation with Verdict-Based Scoring (single evaluation)
202
+
203
+ CustomEvalMetric uses **verdict-based evaluation** with automatic criteria generation for transparent and detailed scoring:
204
+ ```python
205
+ from eval_lib import CustomEvalMetric
206
+
207
+ async def custom_evaluation():
208
+ test_case = EvalTestCase(
209
+ input="Explain photosynthesis",
210
+ actual_output="Photosynthesis is the process where plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.",
211
+ )
212
+
213
+ # Verdict-based custom evaluation
214
+ metric = CustomEvalMetric(
215
+ model="gpt-4o-mini",
216
+ threshold=0.8,
217
+ name="Scientific Accuracy",
218
+ criteria="Evaluate if the explanation is scientifically accurate and complete",
219
+ evaluation_steps=None, # Auto-generated if not provided
220
+ temperature=0.8, # Controls verdict aggregation
221
+ verbose=True
222
+ )
223
+
224
+ result = await metric.evaluate(test_case)
225
+
226
+ asyncio.run(custom_evaluation())
227
+ ```
228
+
229
+ ### Agent Evaluation
230
+ ```python
231
+ from eval_lib import (
232
+ evaluate,
233
+ EvalTestCase,
234
+ ToolCorrectnessMetric,
235
+ TaskSuccessRateMetric
236
+ )
237
+
238
+ async def evaluate_agent():
239
+ test_cases = EvalTestCase(
240
+ input="Book a flight to New York for tomorrow",
241
+ actual_output="I've found available flights and booked your trip to New York for tomorrow.",
242
+ tools_called=["search_flights", "book_flight"],
243
+ expected_tools=["search_flights", "book_flight"]
244
+ )
245
+
246
+ metrics = [
247
+ ToolCorrectnessMetric(model="gpt-4o-mini", threshold=0.8),
248
+ TaskSuccessRateMetric(
249
+ model="gpt-4o-mini",
250
+ threshold=0.7,
251
+ temperature=1.0
252
+ )
253
+ ]
254
+
255
+ results = await evaluate(
256
+ test_cases=[test_cases],
257
+ metrics=metrics,
258
+ verbose=True
259
+ )
260
+ return results
261
+
262
+ asyncio.run(evaluate_agent())
263
+ ```
264
+
265
+ ### Conversational Evaluation
266
+ ```python
267
+ from eval_lib import (
268
+ evaluate_conversations,
269
+ ConversationalEvalTestCase,
270
+ EvalTestCase,
271
+ RoleAdherenceMetric,
272
+ KnowledgeRetentionMetric
273
+ )
274
+
275
+ async def evaluate_conversation():
276
+ # Create conversations
277
+ conversations = [
278
+ ConversationalEvalTestCase(
279
+ chatbot_role="You are a professional customer support assistant.",
280
+ turns=[
281
+ EvalTestCase(
282
+ input="I need help with my order",
283
+ actual_output="I'd be happy to help. Could you provide your order number?"
284
+ ),
285
+ EvalTestCase(
286
+ input="It's #12345",
287
+ actual_output="Thank you! Let me look up order #12345 for you."
288
+ ),
289
+ EvalTestCase(
290
+ input="When will it arrive?",
291
+ actual_output="Your order will be delivered on October 27, 2025."
292
+ ),
293
+ ]
294
+ ),
295
+ ConversationalEvalTestCase(
296
+ chatbot_role="You are a formal financial advisor.",
297
+ turns=[
298
+ EvalTestCase(
299
+ input="Should I invest in stocks?",
300
+ actual_output="Yo dude! Just YOLO into stocks!"
301
+ ),
302
+ EvalTestCase(
303
+ input="What about bonds?",
304
+ actual_output="Bonds are boring, bro!"
305
+ ),
306
+ ]
307
+ ),
308
+ ConversationalEvalTestCase(
309
+ chatbot_role="You are a helpful assistant.",
310
+ turns=[
311
+ EvalTestCase(
312
+ input="My name is John",
313
+ actual_output="Nice to meet you, John!"
314
+ ),
315
+ EvalTestCase(
316
+ input="What's my name?",
317
+ actual_output="Your name is John."
318
+ ),
319
+ EvalTestCase(
320
+ input="Where do I live?",
321
+ actual_output="I don't have that information."
322
+ ),
323
+ ]
324
+ ),
325
+ ]
326
+
327
+ # Define conversational metrics
328
+ metrics = [
329
+ TaskSuccessRateMetric(
330
+ model="gpt-4o-mini",
331
+ threshold=0.7,
332
+ temperature=0.9,
333
+ ),
334
+ RoleAdherenceMetric(
335
+ model="gpt-4o-mini",
336
+ threshold=0.8,
337
+ temperature=0.5,
338
+ ),
339
+ KnowledgeRetentionMetric(
340
+ model="gpt-4o-mini",
341
+ threshold=0.7,
342
+ temperature=0.5,
343
+ ),
344
+ ]
345
+
346
+ # Run batch evaluation
347
+ results = await evaluate_conversations(
348
+ conv_cases=conversations,
349
+ metrics=metrics,
350
+ verbose=True
351
+ )
352
+
353
+ return results
354
+
355
+ asyncio.run(evaluate_conversation())
356
+ ```
357
+
358
+ ## Available Metrics
359
+
360
+ ### RAG Metrics
361
+
362
+ #### AnswerRelevancyMetric
363
+ Measures how relevant the answer is to the question using multi-step evaluation:
364
+ 1. Infers user intent
365
+ 2. Extracts atomic statements from answer
366
+ 3. Generates verdicts (fully/mostly/partial/minor/none) for each statement
367
+ 4. Aggregates using softmax
368
+ ```python
369
+ metric = AnswerRelevancyMetric(
370
+ model="gpt-4o-mini",
371
+ threshold=0.7,
372
+ temperature=0.5 # Controls aggregation strictness
373
+ )
374
+ ```
375
+
376
+ #### FaithfulnessMetric
377
+ Checks if the answer is faithful to the provided context:
378
+ 1. Extracts factual claims from answer
379
+ 2. Verifies each claim against context (fully/mostly/partial/minor/none)
380
+ 3. Aggregates faithfulness score
381
+ ```python
382
+ metric = FaithfulnessMetric(
383
+ model="gpt-4o-mini",
384
+ threshold=0.8,
385
+ temperature=0.5
386
+ )
387
+ ```
388
+
389
+ #### ContextualRelevancyMetric
390
+ Evaluates relevance of retrieved context to the question.
391
+ ```python
392
+ metric = ContextualRelevancyMetric(
393
+ model="gpt-4o-mini",
394
+ threshold=0.7,
395
+ temperature=0.5
396
+ )
397
+ ```
398
+
399
+ #### ContextualPrecisionMetric
400
+ Measures precision of context retrieval - are the retrieved chunks relevant?
401
+ ```python
402
+ metric = ContextualPrecisionMetric(
403
+ model="gpt-4o-mini",
404
+ threshold=0.7
405
+ )
406
+ ```
407
+
408
+ #### ContextualRecallMetric
409
+ Measures recall of relevant context - was all relevant information retrieved?
410
+ ```python
411
+ metric = ContextualRecallMetric(
412
+ model="gpt-4o-mini",
413
+ threshold=0.7
414
+ )
415
+ ```
416
+
417
+ #### BiasMetric
418
+ Detects bias and prejudice in AI-generated output. Score range: 0 (strong bias) and 100 (no bias).
419
+ ```python
420
+ metric = BiasMetric(
421
+ model="gpt-4o-mini",
422
+ threshold=1.0 # Score range: 0 or 100
423
+ )
424
+ ```
425
+
426
+ #### ToxicityMetric
427
+ Identifies toxic content in responses. Score range: 0 (highly toxic) and 100 (no toxicity).
428
+ ```python
429
+ metric = ToxicityMetric(
430
+ model="gpt-4o-mini",
431
+ threshold=1.0 # Score range: 0 or 100
432
+ )
433
+ ```
434
+
435
+ #### RestrictedRefusalMetric
436
+ Checks if the AI appropriately refuses harmful or out-of-scope requests.
437
+ ```python
438
+ metric = RestrictedRefusalMetric(
439
+ model="gpt-4o-mini",
440
+ threshold=0.7
441
+ )
442
+ ```
443
+
444
+ ### Agent Metrics
445
+
446
+ #### ToolCorrectnessMetric
447
+ Validates that the agent calls the correct tools in the right sequence.
448
+ ```python
449
+ metric = ToolCorrectnessMetric(
450
+ model="gpt-4o-mini",
451
+ threshold=0.8
452
+ )
453
+ ```
454
+
455
+ #### TaskSuccessRateMetric
456
+ ````
457
+ **Note:** The metric automatically detects if the conversation contains links/URLs and adds "The user got the link to the requested resource" as an evaluation criterion only when links are present in the dialogue.
458
+ ````
459
+ Measures task completion success across conversation:
460
+ 1. Infers user's goal
461
+ 2. Generates success criteria
462
+ 3. Evaluates each criterion (fully/mostly/partial/minor/none)
463
+ 4. Aggregates into final score
464
+ ```python
465
+ metric = TaskSuccessRateMetric(
466
+ model="gpt-4o-mini",
467
+ threshold=0.7,
468
+ temperature=1.0 # Higher = more lenient aggregation
469
+ )
470
+ ```
471
+
472
+ #### RoleAdherenceMetric
473
+ Evaluates how well the agent maintains its assigned role:
474
+ 1. Compares each response against role description
475
+ 2. Generates adherence verdicts (fully/mostly/partial/minor/none)
476
+ 3. Aggregates across all turns
477
+ ```python
478
+ metric = RoleAdherenceMetric(
479
+ model="gpt-4o-mini",
480
+ threshold=0.8,
481
+ temperature=0.5,
482
+ chatbot_role="You are helpfull assistant" # Set role here directly
483
+ )
484
+
485
+ ```
486
+
487
+ #### KnowledgeRetentionMetric
488
+ Checks if the agent remembers and recalls information from earlier in the conversation:
489
+ 1. Analyzes conversation for retention quality
490
+ 2. Generates retention verdicts (fully/mostly/partial/minor/none)
491
+ 3. Aggregates into retention score
492
+ ```python
493
+ metric = KnowledgeRetentionMetric(
494
+ model="gpt-4o-mini",
495
+ threshold=0.7,
496
+ temperature=0.5
497
+ )
498
+ ```
499
+
500
+ ### Custom & Advanced Metrics
501
+
502
+ #### GEval
503
+ State-of-the-art evaluation using probability-weighted scoring from the [G-Eval paper](https://arxiv.org/abs/2303.16634):
504
+ - **Auto Chain-of-Thought**: Automatically generates evaluation steps from criteria
505
+ - **Probability-Weighted Scoring**: score = Σ p(si) × si using 20 samples
506
+ - **Fine-Grained Scores**: Continuous scores (e.g., 73.45) instead of integers
507
+ ```python
508
+ metric = GEval(
509
+ model="gpt-4o", # Best with GPT-4 for probability estimation
510
+ threshold=0.7,
511
+ name="Coherence",
512
+ criteria="Evaluate logical flow and structure of the response",
513
+ evaluation_steps=None, # Auto-generated if not provided
514
+ n_samples=20, # Number of samples for probability estimation
515
+ sampling_temperature=2.0 # High temperature for diverse sampling
516
+ )
517
+ ```
518
+
519
+ #### CustomEvalMetric
520
+ Verdict-based custom evaluation with automatic criteria generation.
521
+ Automatically:
522
+ - Generates 3-5 specific sub-criteria from main criteria (1 LLM call)
523
+ - Evaluates each criterion with verdicts (fully/mostly/partial/minor/none)
524
+ - Aggregates using softmax (temperature-controlled)
525
+ Total: 1-2 LLM calls
526
+
527
+ Usage:
528
+ ```python
529
+ metric = CustomEvalMetric(
530
+ model="gpt-4o-mini",
531
+ threshold=0.8,
532
+ name="Code Quality",
533
+ criteria="Evaluate code readability, efficiency, and best practices",
534
+ evaluation_steps=None, # Auto-generated if not provided
535
+ temperature=0.8, # Controls verdict aggregation (0.1=strict, 1.0=lenient)
536
+ verbose=True
537
+ )
538
+
539
+ ```
540
+
541
+ **Example with manual criteria:**
542
+ ```python
543
+ metric = CustomEvalMetric(
544
+ model="gpt-4o-mini",
545
+ threshold=0.8,
546
+ name="Child-Friendly Explanation",
547
+ criteria="Evaluate if explanation is appropriate for a 10-year-old",
548
+ evaluation_steps=[ # Manual criteria for precise control
549
+ "Uses simple vocabulary appropriate for 10-year-olds",
550
+ "Includes relatable analogies or comparisons",
551
+ "Avoids complex technical jargon",
552
+ "Explanation is engaging and interesting",
553
+ "Concept is broken down into understandable parts"
554
+ ],
555
+ temperature=0.8,
556
+ verbose=True
557
+ )
558
+
559
+ result = await metric.evaluate(test_case)
560
+ ```
561
+
562
+ ## Understanding Evaluation Results
563
+
564
+ ### Score Ranges
565
+
566
+ All metrics use a normalized score range of **0.0 to 1.0**:
567
+ - **0.0**: Complete failure / Does not meet criteria
568
+ - **0.5**: Partial satisfaction / Mixed results
569
+ - **1.0**: Perfect / Fully meets criteria
570
+
571
+ **Score Interpretation:**
572
+ - **0.8 - 1.0**: Excellent
573
+ - **0.7 - 0.8**: Good (typical threshold)
574
+ - **0.5 - 0.7**: Acceptable with issues
575
+ - **0.0 - 0.5**: Poor / Needs improvement
576
+
577
+ ## Verbose Mode
578
+
579
+ All metrics support a `verbose` parameter that controls output formatting:
580
+
581
+ ### verbose=False (Default) - JSON Output
582
+ Returns simple dictionary with results:
583
+ ```python
584
+ metric = AnswerRelevancyMetric(
585
+ model="gpt-4o-mini",
586
+ threshold=0.7,
587
+ verbose=False # Default
588
+ )
589
+
590
+ result = await metric.evaluate(test_case)
591
+ print(result)
592
+ # Output: Simple dictionary
593
+ # {
594
+ # 'name': 'answerRelevancyMetric',
595
+ # 'score': 0.85,
596
+ # 'success': True,
597
+ # 'reason': 'Answer is highly relevant...',
598
+ # 'evaluation_cost': 0.000234,
599
+ # 'evaluation_log': {...}
600
+ # }
601
+ ```
602
+
603
+ ### verbose=True - Beautiful Console Output
604
+ Displays formatted results with colors, progress bars, and detailed logs:
605
+ ```python
606
+ metric = CustomEvalMetric(
607
+ model="gpt-4o-mini",
608
+ threshold=0.9,
609
+ name="Factual Accuracy",
610
+ criteria="Evaluate the factual accuracy of the response",
611
+ verbose=True # Enable beautiful output
612
+ )
613
+
614
+ result = await metric.evaluate(test_case)
615
+ # Output: Beautiful formatted display (see image below)
616
+ ```
617
+
618
+ **Console Output with verbose=True:**
619
+
620
+ **Console Output with verbose=True:**
621
+ ```
622
+ ╔════════════════════════════════════════════════════════════════╗
623
+ ║ 📊answerRelevancyMetric ║
624
+ ╚════════════════════════════════════════════════════════════════╝
625
+
626
+ Status: ✅ PASSED
627
+ Score: 0.91 [███████████████████████████░░░] 91%
628
+ Cost: 💰 $0.000178
629
+ Reason:
630
+ The answer correctly identifies Paris as the capital of France, demonstrating a clear understanding of the
631
+ user's request. However, it fails to provide a direct and explicit response, which diminishes its overall
632
+ effectiveness.
633
+
634
+ Evaluation Log:
635
+ ╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
636
+ │ { │
637
+ │ "input_question": "What is the capital of France?", │
638
+ │ "answer": "The capital of France is Paris and it is a beautiful city and known for its art and culture.", │
639
+ │ "user_intent": "The user is seeking information about the capital city of France.", │
640
+ │ "comment_user_intent": "Inferred goal of the question.", │
641
+ │ "statements": [ │
642
+ │ "The capital of France is Paris.", │
643
+ │ "Paris is a beautiful city.", │
644
+ │ "Paris is known for its art and culture." │
645
+ │ ], │
646
+ │ "comment_statements": "Atomic facts extracted from the answer.", │
647
+ │ "verdicts": [ │
648
+ │ { │
649
+ │ "verdict": "fully", │
650
+ │ "reason": "The statement explicitly answers the user's question about the capital of France." │
651
+ │ }, │
652
+ │ { │
653
+ │ "verdict": "minor", │
654
+ │ "reason": "While it mentions Paris, it does not directly answer the user's question." │
655
+ │ }, │
656
+ │ { │
657
+ │ "verdict": "minor", │
658
+ │ "reason": "This statement is related to Paris but does not address the user's question about the │
659
+ │ capital." │
660
+ │ } │
661
+ │ ], │
662
+ │ "comment_verdicts": "Each verdict explains whether a statement is relevant to the question.", │
663
+ │ "verdict_score": 0.9142, │
664
+ │ "comment_verdict_score": "Proportion of relevant statements in the answer.", │
665
+ │ "final_score": 0.9142, │
666
+ │ "comment_final_score": "Score based on the proportion of relevant statements.", │
667
+ │ "threshold": 0.7, │
668
+ │ "success": true, │
669
+ │ "comment_success": "Whether the score exceeds the pass threshold.", │
670
+ │ "final_reason": "The answer correctly identifies Paris as the capital of France, demonstrating a clear │
671
+ │ understanding of the user's request. However, it fails to provide a direct and explicit response, which │
672
+ │ diminishes its overall effectiveness.", │
673
+ │ "comment_reasoning": "Compressed explanation of the key verdict rationales." │
674
+ │ } │
675
+ ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
676
+ ```
677
+
678
+ Features:
679
+ - ✅ Color-coded status (✅ PASSED / ❌ FAILED)
680
+ - 📊 Visual progress bar for scores
681
+ - 💰 Cost tracking display
682
+ - 📝 Formatted reason with word wrapping
683
+ - 📋 Pretty-printed evaluation log in bordered box
684
+
685
+ **When to use verbose=True:**
686
+ - Interactive development and testing
687
+ - Debugging evaluation issues
688
+ - Presentations and demonstrations
689
+ - Manual review of results
690
+
691
+ **When to use verbose=False:**
692
+ - Production environments
693
+ - Batch processing
694
+ - Automated testing
695
+ - When storing results in databases
696
+
697
+ ---
698
+
699
+ ## Working with Results
700
+
701
+ Results are returned as simple dictionaries. Access fields directly:
702
+ ```python
703
+ # Run evaluation
704
+ result = await metric.evaluate(test_case)
705
+
706
+ # Access result fields
707
+ score = result['score'] # 0.0-1.0
708
+ success = result['success'] # True/False
709
+ reason = result['reason'] # String explanation
710
+ cost = result['evaluation_cost'] # USD amount
711
+ log = result['evaluation_log'] # Detailed breakdown
712
+
713
+ # Example: Check success and print score
714
+ if result['success']:
715
+ print(f"✅ Passed with score: {result['score']:.2f}")
716
+ else:
717
+ print(f"❌ Failed: {result['reason']}")
718
+
719
+ # Access detailed verdicts (for verdict-based metrics)
720
+ if 'verdicts' in result['evaluation_log']:
721
+ for verdict in result['evaluation_log']['verdicts']:
722
+ print(f"- {verdict['verdict']}: {verdict['reason']}")
723
+ ```
724
+
725
+ ## Temperature Parameter
726
+
727
+ Many metrics use a **temperature** parameter for score aggregation (via temperature-weighted scoring):
728
+
729
+ - **Lower (0.1-0.3)**: **STRICT** - All scores matter equally, low scores heavily penalize the final result. Best for critical applications where even one bad verdict should fail the metric.
730
+ - **Medium (0.4-0.6)**: **BALANCED** - Moderate weighting between high and low scores. Default behavior for most use cases (default: 0.5).
731
+ - **Higher (0.7-1.0)**: **LENIENT** - High scores (fully/mostly) dominate, effectively ignoring partial/minor/none verdicts. Best for exploratory evaluation or when you want to focus on positive signals.
732
+
733
+ **How it works:** Temperature controls exponential weighting of scores. Higher temperature exponentially boosts high scores (1.0, 0.9), making low scores (0.7, 0.3, 0.0) matter less. Lower temperature treats all scores more equally.
734
+
735
+ **Example:**
736
+ ```python
737
+ # Verdicts: [fully, mostly, partial, minor, none] = [1.0, 0.9, 0.7, 0.3, 0.0]
738
+
739
+ # STRICT: All verdicts count
740
+ metric = FaithfulnessMetric(temperature=0.1)
741
+ # Result: ~0.52 (heavily penalized by "minor" and "none")
742
+
743
+ # BALANCED: Moderate weighting
744
+ metric = AnswerRelevancyMetric(temperature=0.5)
745
+ # Result: ~0.73 (balanced consideration)
746
+
747
+ # LENIENT: Only "fully" and "mostly" matter
748
+ metric = TaskSuccessRateMetric(temperature=1.0)
749
+ # Result: ~0.95 (ignores "partial", "minor", "none")
750
+ ```
751
+
752
+ ## LLM Provider Configuration
753
+
754
+ ### OpenAI
755
+ ```python
756
+ import os
757
+ os.environ["OPENAI_API_KEY"] = "your-api-key"
758
+
759
+ from eval_lib import chat_complete
760
+
761
+ response, cost = await chat_complete(
762
+ "gpt-4o-mini", # or "openai:gpt-4o-mini"
763
+ messages=[{"role": "user", "content": "Hello!"}]
764
+ )
765
+ ```
766
+
767
+ ### Azure OpenAI
768
+ ```python
769
+ os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
770
+ os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
771
+ os.environ["AZURE_OPENAI_DEPLOYMENT"] = "your-deployment-name"
772
+
773
+ response, cost = await chat_complete(
774
+ "azure:gpt-4o",
775
+ messages=[{"role": "user", "content": "Hello!"}]
776
+ )
777
+ ```
778
+
779
+ ### Google Gemini
780
+ ```python
781
+ os.environ["GOOGLE_API_KEY"] = "your-api-key"
782
+
783
+ response, cost = await chat_complete(
784
+ "google:gemini-2.0-flash",
785
+ messages=[{"role": "user", "content": "Hello!"}]
786
+ )
787
+ ```
788
+
789
+ ### Anthropic Claude
790
+ ```python
791
+ os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
792
+
793
+ response, cost = await chat_complete(
794
+ "anthropic:claude-sonnet-4-0",
795
+ messages=[{"role": "user", "content": "Hello!"}]
796
+ )
797
+ ```
798
+
799
+ ### Ollama (Local)
800
+ ```python
801
+ os.environ["OLLAMA_API_KEY"] = "ollama" # Can be any value
802
+ os.environ["OLLAMA_API_BASE_URL"] = "http://localhost:11434/v1"
803
+
804
+ response, cost = await chat_complete(
805
+ "ollama:llama2",
806
+ messages=[{"role": "user", "content": "Hello!"}]
807
+ )
808
+ ```
809
+
810
+ ## Test Data Generation
811
+
812
+ The library includes a powerful test data generator that can create realistic test cases either from scratch or based on your documents.
813
+
814
+ ### Supported Document Formats
815
+
816
+ - **Documents**: PDF, DOCX, DOC, TXT, RTF, ODT
817
+ - **Structured Data**: CSV, TSV, XLSX, JSON, YAML, XML
818
+ - **Web**: HTML, Markdown
819
+ - **Presentations**: PPTX
820
+ - **Images**: PNG, JPG, JPEG (with OCR support)
821
+
822
+ ### Generate from Scratch
823
+ ```python
824
+ from eval_lib.datagenerator.datagenerator import DatasetGenerator
825
+
826
+ generator = DatasetGenerator(
827
+ model="gpt-4o-mini",
828
+ agent_description="A customer support chatbot",
829
+ input_format="User question or request",
830
+ expected_output_format="Helpful response",
831
+ test_types=["functionality", "edge_cases"],
832
+ max_rows=20,
833
+ question_length="mixed", # "short", "long", or "mixed"
834
+ question_openness="mixed", # "open", "closed", or "mixed"
835
+ trap_density=0.1, # 10% trap questions
836
+ language="en",
837
+ verbose=True # Displays beautiful formatted progress, statistics and full dataset preview
838
+ )
839
+
840
+ dataset = await generator.generate_from_scratch()
841
+ ```
842
+
843
+ ### Generate from Documents
844
+ ```python
845
+ generator = DatasetGenerator(
846
+ model="gpt-4o-mini",
847
+ agent_description="Technical support agent",
848
+ input_format="Technical question",
849
+ expected_output_format="Detailed answer with references",
850
+ test_types=["retrieval", "accuracy"],
851
+ max_rows=50,
852
+ chunk_size=1024,
853
+ chunk_overlap=100,
854
+ max_chunks=30,
855
+ verbose=True
856
+ )
857
+
858
+ file_paths = ["docs/user_guide.pdf", "docs/faq.md"]
859
+ dataset = await generator.generate_from_documents(file_paths)
860
+
861
+ # Convert to test cases
862
+ from eval_lib import EvalTestCase
863
+ test_cases = [
864
+ EvalTestCase(
865
+ input=item["input"],
866
+ expected_output=item["expected_output"],
867
+ retrieval_context=[item.get("context", "")]
868
+ )
869
+ for item in dataset
870
+ ]
871
+ ```
872
+
873
+ ## Best Practices
874
+
875
+ ### 1. Choose the Right Model
876
+
877
+ - **G-Eval**: Use GPT-4 for best results with probability-weighted scoring
878
+ - **Other Metrics**: GPT-4o-mini is cost-effective and sufficient
879
+ - **Custom Eval**: Use GPT-4 for complex criteria, GPT-4o-mini for simple ones
880
+
881
+ ### 2. Set Appropriate Thresholds
882
+ ```python
883
+ # Safety metrics - high bar
884
+ BiasMetric(threshold=0.8)
885
+ ToxicityMetric(threshold=0.85)
886
+
887
+ # Quality metrics - moderate bar
888
+ AnswerRelevancyMetric(threshold=0.7)
889
+ FaithfulnessMetric(threshold=0.75)
890
+
891
+ # Agent metrics - context-dependent
892
+ TaskSuccessRateMetric(threshold=0.7) # Most tasks
893
+ RoleAdherenceMetric(threshold=0.9) # Strict role requirements
894
+ ```
895
+
896
+ ### 3. Use Temperature Wisely
897
+ ```python
898
+ # STRICT evaluation - critical applications where all verdicts matter
899
+ # Use when: You need high accuracy and can't tolerate bad verdicts
900
+ metric = FaithfulnessMetric(temperature=0.1)
901
+
902
+ # BALANCED - general use (default)
903
+ # Use when: Standard evaluation with moderate requirements
904
+ metric = AnswerRelevancyMetric(temperature=0.5)
905
+
906
+ # LENIENT - exploratory evaluation or focusing on positive signals
907
+ # Use when: You want to reward good answers and ignore occasional mistakes
908
+ metric = TaskSuccessRateMetric(temperature=1.0)
909
+ ```
910
+
911
+ **Real-world examples:**
912
+ ```python
913
+ # Production RAG system - must be accurate
914
+ faithfulness = FaithfulnessMetric(
915
+ model="gpt-4o-mini",
916
+ threshold=0.8,
917
+ temperature=0.2 # STRICT: verdicts "none", "minor", "partially" significantly impact score
918
+ )
919
+
920
+ # Customer support chatbot - moderate standards
921
+ role_adherence = RoleAdherenceMetric(
922
+ model="gpt-4o-mini",
923
+ threshold=0.7,
924
+ temperature=0.5 # BALANCED: Standard evaluation
925
+ )
926
+
927
+ # Experimental feature testing - focus on successes
928
+ task_success = TaskSuccessRateMetric(
929
+ model="gpt-4o-mini",
930
+ threshold=0.6,
931
+ temperature=1.0 # LENIENT: Focuses on "fully" and "mostly" completions
932
+ )
933
+ ```
934
+
935
+ ### 4. Leverage Evaluation Logs
936
+ ```python
937
+ # Enable verbose mode for automatic detailed display
938
+ metric = AnswerRelevancyMetric(
939
+ model="gpt-4o-mini",
940
+ threshold=0.7,
941
+ verbose=True # Automatic formatted output with full logs
942
+ )
943
+
944
+ # Or access logs programmatically
945
+ result = await metric.evaluate(test_case)
946
+ log = result['evaluation_log']
947
+
948
+ # Debugging failures
949
+ if not result['success']:
950
+ # All details available in log
951
+ reason = result['reason']
952
+ verdicts = log.get('verdicts', [])
953
+ steps = log.get('evaluation_steps', [])
954
+ ```
955
+
956
+ ### 5. Batch Evaluation for Efficiency
957
+ ```python
958
+ # Evaluate multiple test cases at once
959
+ results = await evaluate(
960
+ test_cases=[test_case1, test_case2, test_case3],
961
+ metrics=[metric1, metric2, metric3]
962
+ )
963
+
964
+ # Calculate aggregate statistics
965
+ total_cost = sum(
966
+ metric.evaluation_cost or 0
967
+ for _, test_results in results
968
+ for result in test_results
969
+ for metric in result.metrics_data
970
+ )
971
+
972
+ success_rate = sum(
973
+ 1 for _, test_results in results
974
+ for result in test_results
975
+ if result.success
976
+ ) / len(results)
977
+
978
+ print(f"Total cost: ${total_cost:.4f}")
979
+ print(f"Success rate: {success_rate:.2%}")
980
+ ```
981
+
982
+
983
+ ## Environment Variables
984
+
985
+ | Variable | Description | Required |
986
+ |----------|-------------|----------|
987
+ | `OPENAI_API_KEY` | OpenAI API key | For OpenAI |
988
+ | `AZURE_OPENAI_API_KEY` | Azure OpenAI API key | For Azure |
989
+ | `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint URL | For Azure |
990
+ | `AZURE_OPENAI_DEPLOYMENT` | Azure deployment name | For Azure |
991
+ | `GOOGLE_API_KEY` | Google API key | For Google |
992
+ | `ANTHROPIC_API_KEY` | Anthropic API key | For Anthropic |
993
+ | `OLLAMA_API_KEY` | Ollama API key | For Ollama |
994
+ | `OLLAMA_API_BASE_URL` | Ollama base URL | For Ollama |
995
+
996
+ ## Contributing
997
+
998
+ Contributions are welcome! Please feel free to submit a Pull Request.
999
+
1000
+ 1. Fork the repository
1001
+ 2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
1002
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
1003
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
1004
+ 5. Open a Pull Request
1005
+
1006
+ ## License
1007
+
1008
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
1009
+
1010
+ ## Citation
1011
+
1012
+ If you use this library in your research, please cite:
1013
+ ```bibtex
1014
+ @software{eval_ai_library,
1015
+ author = {Meshkov, Aleksandr},
1016
+ title = {Eval AI Library: Comprehensive AI Model Evaluation Framework},
1017
+ year = {2025},
1018
+ url = {https://github.com/meshkovQA/Eval-ai-library.git}
1019
+ }
1020
+ ```
1021
+
1022
+ ### References
1023
+
1024
+ This library implements techniques from:
1025
+ ```bibtex
1026
+ @inproceedings{liu2023geval,
1027
+ title={G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment},
1028
+ author={Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang},
1029
+ booktitle={Proceedings of EMNLP},
1030
+ year={2023}
1031
+ }
1032
+ ```
1033
+
1034
+ ## Support
1035
+
1036
+ - 📧 Email: alekslynx90@gmail.com
1037
+ - 🐛 Issues: [GitHub Issues](https://github.com/meshkovQA/Eval-ai-library.git/issues)
1038
+ - 📖 Documentation: [Full Documentation](https://github.com/meshkovQA/Eval-ai-library.git#readme)
1039
+
1040
+ ## Acknowledgments
1041
+
1042
+ This library was developed to provide a comprehensive solution for evaluating AI models across different use cases and providers, with state-of-the-art techniques including G-Eval's probability-weighted scoring and automatic chain-of-thought generation.