eval-ai-library 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of eval-ai-library might be problematic. Click here for more details.

Files changed (34) hide show
  1. eval_ai_library-0.1.0.dist-info/METADATA +753 -0
  2. eval_ai_library-0.1.0.dist-info/RECORD +34 -0
  3. eval_ai_library-0.1.0.dist-info/WHEEL +5 -0
  4. eval_ai_library-0.1.0.dist-info/licenses/LICENSE +21 -0
  5. eval_ai_library-0.1.0.dist-info/top_level.txt +1 -0
  6. eval_lib/__init__.py +122 -0
  7. eval_lib/agent_metrics/__init__.py +12 -0
  8. eval_lib/agent_metrics/knowledge_retention_metric/knowledge_retention.py +231 -0
  9. eval_lib/agent_metrics/role_adherence_metric/role_adherence.py +251 -0
  10. eval_lib/agent_metrics/task_success_metric/task_success_rate.py +347 -0
  11. eval_lib/agent_metrics/tools_correctness_metric/tool_correctness.py +106 -0
  12. eval_lib/datagenerator/datagenerator.py +230 -0
  13. eval_lib/datagenerator/document_loader.py +510 -0
  14. eval_lib/datagenerator/prompts.py +192 -0
  15. eval_lib/evaluate.py +335 -0
  16. eval_lib/evaluation_schema.py +63 -0
  17. eval_lib/llm_client.py +286 -0
  18. eval_lib/metric_pattern.py +229 -0
  19. eval_lib/metrics/__init__.py +25 -0
  20. eval_lib/metrics/answer_precision_metric/answer_precision.py +405 -0
  21. eval_lib/metrics/answer_relevancy_metric/answer_relevancy.py +195 -0
  22. eval_lib/metrics/bias_metric/bias.py +114 -0
  23. eval_lib/metrics/contextual_precision_metric/contextual_precision.py +102 -0
  24. eval_lib/metrics/contextual_recall_metric/contextual_recall.py +91 -0
  25. eval_lib/metrics/contextual_relevancy_metric/contextual_relevancy.py +169 -0
  26. eval_lib/metrics/custom_metric/custom_eval.py +303 -0
  27. eval_lib/metrics/faithfulness_metric/faithfulness.py +140 -0
  28. eval_lib/metrics/geval/geval.py +326 -0
  29. eval_lib/metrics/restricted_refusal_metric/restricted_refusal.py +102 -0
  30. eval_lib/metrics/toxicity_metric/toxicity.py +113 -0
  31. eval_lib/price.py +37 -0
  32. eval_lib/py.typed +1 -0
  33. eval_lib/testcases_schema.py +27 -0
  34. eval_lib/utils.py +99 -0
@@ -0,0 +1,753 @@
1
+ Metadata-Version: 2.4
2
+ Name: eval-ai-library
3
+ Version: 0.1.0
4
+ Summary: Comprehensive AI Model Evaluation Framework with support for multiple LLM providers
5
+ Author-email: Aleksandr Meshkov <alekslynx90@gmail.com>
6
+ License: MIT
7
+ Project-URL: Homepage, https://github.com/meshkovQA/Eval-ai-library
8
+ Project-URL: Documentation, https://github.com/meshkovQA/Eval-ai-library#readme
9
+ Project-URL: Repository, https://github.com/meshkovQA/Eval-ai-library
10
+ Project-URL: Bug Tracker, https://github.com/meshkovQA/Eval-ai-library/issues
11
+ Keywords: ai,evaluation,llm,rag,metrics,testing,quality-assurance
12
+ Classifier: Development Status :: 4 - Beta
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Intended Audience :: Science/Research
15
+ Classifier: License :: OSI Approved :: MIT License
16
+ Classifier: Programming Language :: Python :: 3
17
+ Classifier: Programming Language :: Python :: 3.9
18
+ Classifier: Programming Language :: Python :: 3.10
19
+ Classifier: Programming Language :: Python :: 3.11
20
+ Classifier: Programming Language :: Python :: 3.12
21
+ Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
22
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
23
+ Requires-Python: >=3.9
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE
26
+ Requires-Dist: openai>=1.0.0
27
+ Requires-Dist: anthropic>=0.18.0
28
+ Requires-Dist: google-genai>=0.2.0
29
+ Requires-Dist: pydantic>=2.0.0
30
+ Requires-Dist: numpy>=1.24.0
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest>=7.0.0; extra == "dev"
33
+ Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
34
+ Requires-Dist: black>=23.0.0; extra == "dev"
35
+ Requires-Dist: flake8>=6.0.0; extra == "dev"
36
+ Requires-Dist: mypy>=1.0.0; extra == "dev"
37
+ Requires-Dist: isort>=5.12.0; extra == "dev"
38
+ Provides-Extra: docs
39
+ Requires-Dist: sphinx>=6.0.0; extra == "docs"
40
+ Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
41
+ Dynamic: license-file
42
+
43
+ # Eval AI Library
44
+
45
+ [![Python Version](https://img.shields.io/badge/python-3.9%2B-blue)](https://www.python.org/downloads/)
46
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
47
+
48
+ Comprehensive AI Model Evaluation Framework with advanced techniques including **Probability-Weighted Scoring** and **Auto Chain-of-Thought**. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.
49
+
50
+ ## Features
51
+
52
+ - 🎯 **15+ Evaluation Metrics**: RAG metrics and agent-specific evaluations
53
+ - 🧠 **G-Eval Implementation**: State-of-the-art evaluation with probability-weighted scoring
54
+ - 🔗 **Chain-of-Thought**: Automatic generation of evaluation steps from criteria
55
+ - 🤖 **Multi-Provider Support**: OpenAI, Azure OpenAI, Google Gemini, Anthropic Claude, Ollama
56
+ - 📊 **RAG Metrics**: Answer relevancy, faithfulness, contextual precision/recall, and more
57
+ - 🔧 **Agent Metrics**: Tool correctness, task success rate, role adherence, knowledge retention
58
+ - 🎨 **Custom Metrics**: Advanced custom evaluation with CoT and probability weighting
59
+ - 📦 **Data Generation**: Built-in test case generator from documents
60
+ - ⚡ **Async Support**: Full async/await support for efficient evaluation
61
+ - 💰 **Cost Tracking**: Automatic cost calculation for LLM API calls
62
+ - 📝 **Detailed Logging**: Comprehensive evaluation logs for transparency
63
+
64
+ ## Installation
65
+ ```bash
66
+ pip install eval-ai-library
67
+ ```
68
+
69
+ ### Development Installation
70
+ ```bash
71
+ git clone https://github.com/yourusername/eval-ai-library.git
72
+ cd eval-ai-library
73
+ pip install -e ".[dev]"
74
+ ```
75
+
76
+ ## Quick Start
77
+
78
+ ### Basic RAG Evaluation
79
+ ```python
80
+ import asyncio
81
+ from eval_lib import (
82
+ evaluate,
83
+ EvalTestCase,
84
+ AnswerRelevancyMetric,
85
+ FaithfulnessMetric
86
+ )
87
+
88
+ async def main():
89
+ # Create test case
90
+ test_case = EvalTestCase(
91
+ input="What is the capital of France?",
92
+ actual_output="The capital of France is Paris, a beautiful city known for its art and culture.",
93
+ expected_output="Paris",
94
+ retrieval_context=["Paris is the capital and largest city of France."]
95
+ )
96
+
97
+ # Define metrics
98
+ metrics = [
99
+ AnswerRelevancyMetric(
100
+ model="gpt-4o-mini",
101
+ threshold=0.7,
102
+ temperature=0.5 # Softmax temperature for score aggregation
103
+ ),
104
+ FaithfulnessMetric(
105
+ model="gpt-4o-mini",
106
+ threshold=0.8,
107
+ temperature=0.5
108
+ )
109
+ ]
110
+
111
+ # Evaluate
112
+ results = await evaluate(
113
+ test_cases=[test_case],
114
+ metrics=metrics
115
+ )
116
+
117
+ # Print results with detailed logs
118
+ for _, test_results in results:
119
+ for result in test_results:
120
+ print(f"Success: {result.success}")
121
+ for metric in result.metrics_data:
122
+ print(f"{metric.name}: {metric.score:.2f}")
123
+ print(f"Reason: {metric.reason}")
124
+ print(f"Cost: ${metric.evaluation_cost:.6f}")
125
+ # Access detailed evaluation log
126
+ if hasattr(metric, 'evaluation_log'):
127
+ print(f"Log: {metric.evaluation_log}")
128
+
129
+ asyncio.run(main())
130
+ ```
131
+
132
+ ### G-Eval with Probability-Weighted Scoring
133
+
134
+ G-Eval implements the state-of-the-art evaluation method from the paper ["G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment"](https://arxiv.org/abs/2303.16634). It uses **probability-weighted scoring** (score = Σ p(si) × si) for fine-grained, continuous evaluation scores.
135
+ ```python
136
+ from eval_lib import GEval, EvalTestCase
137
+
138
+ async def evaluate_with_geval():
139
+ test_case = EvalTestCase(
140
+ input="Explain quantum computing to a 10-year-old",
141
+ actual_output="Quantum computers are like super-powerful regular computers that use special tiny particles to solve really hard problems much faster.",
142
+ expected_output="A simple explanation using analogies suitable for children"
143
+ )
144
+
145
+ # G-Eval with auto chain-of-thought
146
+ metric = GEval(
147
+ model="gpt-4o", # Works best with GPT-4
148
+ threshold=0.7, # Score range: 0-100
149
+ name="Clarity & Simplicity",
150
+ criteria="Evaluate how clear and age-appropriate the explanation is for a 10-year-old child",
151
+ # evaluation_steps is auto-generated from criteria if not provided
152
+ n_samples=20, # Number of samples for probability estimation (default: 20)
153
+ sampling_temperature=2.0 # High temperature for diverse sampling (default: 2.0)
154
+ )
155
+
156
+ result = await metric.evaluate(test_case)
157
+
158
+ print(f"Score: {result['score']:.2f}/100") # Fine-grained score like 73.45
159
+ print(f"Success: {result['success']}")
160
+ print(f"Reason: {result['reason']}")
161
+ print(f"Sampled scores: {result['metadata']['sampled_scores']}") # See all 20 samples
162
+ print(f"Score distribution: {result['evaluation_log']['score_distribution']}")
163
+
164
+ asyncio.run(evaluate_with_geval())
165
+ ```
166
+
167
+ ### Custom Evaluation with Advanced Features
168
+
169
+ The CustomEvalMetric now includes **Chain-of-Thought** and **Probability-Weighted Scoring** from G-Eval for maximum accuracy:
170
+ ```python
171
+ from eval_lib import CustomEvalMetric
172
+
173
+ async def custom_evaluation():
174
+ test_case = EvalTestCase(
175
+ input="How do I reset my password?",
176
+ actual_output="To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the link sent to your inbox.",
177
+ expected_output="Clear step-by-step instructions"
178
+ )
179
+
180
+ metric = CustomEvalMetric(
181
+ model="gpt-4o",
182
+ threshold=0.7,
183
+ name="HelpfulnessScore",
184
+ criteria="Evaluate if the response provides clear, actionable steps that directly answer the user's question"
185
+ # Auto-generates evaluation steps using CoT
186
+ # Auto-applies probability-weighted scoring (20 samples)
187
+ )
188
+
189
+ result = await metric.evaluate(test_case)
190
+
191
+ # Access detailed evaluation log
192
+ log = result['evaluation_log']
193
+ print(f"Auto-generated steps: {log['evaluation_steps']}")
194
+ print(f"Sampled scores: {log['sampled_scores']}")
195
+ print(f"Score distribution: {log['score_distribution']}")
196
+ print(f"Final score: {log['final_score']:.2f}")
197
+
198
+ asyncio.run(custom_evaluation())
199
+ ```
200
+
201
+ ### Agent Evaluation
202
+ ```python
203
+ from eval_lib import (
204
+ evaluate,
205
+ EvalTestCase,
206
+ ToolCorrectnessMetric,
207
+ TaskSuccessRateMetric
208
+ )
209
+
210
+ async def evaluate_agent():
211
+ test_case = EvalTestCase(
212
+ input="Book a flight to New York for tomorrow",
213
+ actual_output="I've found available flights and booked your trip to New York for tomorrow.",
214
+ tools_called=["search_flights", "book_flight"],
215
+ expected_tools=["search_flights", "book_flight"]
216
+ )
217
+
218
+ metrics = [
219
+ ToolCorrectnessMetric(model="gpt-4o-mini", threshold=0.8),
220
+ TaskSuccessRateMetric(
221
+ model="gpt-4o-mini",
222
+ threshold=0.7,
223
+ temperature=1.1 # Controls score aggregation strictness
224
+ )
225
+ ]
226
+
227
+ results = await evaluate([test_case], metrics)
228
+ return results
229
+
230
+ asyncio.run(evaluate_agent())
231
+ ```
232
+
233
+ ### Conversational Evaluation
234
+ ```python
235
+ from eval_lib import (
236
+ evaluate_conversations,
237
+ ConversationalEvalTestCase,
238
+ EvalTestCase,
239
+ RoleAdherenceMetric,
240
+ KnowledgeRetentionMetric
241
+ )
242
+
243
+ async def evaluate_conversation():
244
+ conversation = ConversationalEvalTestCase(
245
+ chatbot_role="You are a helpful customer support assistant. Be professional and empathetic.",
246
+ turns=[
247
+ EvalTestCase(
248
+ input="I need help with my order",
249
+ actual_output="I'd be happy to help you with your order. Could you please provide your order number?"
250
+ ),
251
+ EvalTestCase(
252
+ input="It's #12345",
253
+ actual_output="Thank you! Let me look up order #12345 for you."
254
+ )
255
+ ]
256
+ )
257
+
258
+ metrics = [
259
+ RoleAdherenceMetric(
260
+ model="gpt-4o-mini",
261
+ threshold=0.8,
262
+ temperature=0.5 # Softmax temperature for verdict aggregation
263
+ ),
264
+ KnowledgeRetentionMetric(
265
+ model="gpt-4o-mini",
266
+ threshold=0.7,
267
+ temperature=0.5
268
+ )
269
+ ]
270
+
271
+ # Set chatbot role for role adherence
272
+ metrics[0].chatbot_role = conversation.chatbot_role
273
+
274
+ results = await evaluate_conversations([conversation], metrics)
275
+
276
+ # Access detailed logs
277
+ for result in results:
278
+ print(f"Dialogue: {result.evaluation_log['dialogue']}")
279
+ print(f"Verdicts: {result.evaluation_log['verdicts']}")
280
+ print(f"Score: {result.score}")
281
+
282
+ return results
283
+
284
+ asyncio.run(evaluate_conversation())
285
+ ```
286
+
287
+ ## Available Metrics
288
+
289
+ ### RAG Metrics
290
+
291
+ #### AnswerRelevancyMetric
292
+ Measures how relevant the answer is to the question using multi-step evaluation:
293
+ 1. Infers user intent
294
+ 2. Extracts atomic statements from answer
295
+ 3. Generates verdicts (fully/mostly/partial/minor/none) for each statement
296
+ 4. Aggregates using softmax
297
+ ```python
298
+ metric = AnswerRelevancyMetric(
299
+ model="gpt-4o-mini",
300
+ threshold=0.7,
301
+ temperature=0.5 # Controls aggregation strictness
302
+ )
303
+ ```
304
+
305
+ #### FaithfulnessMetric
306
+ Checks if the answer is faithful to the provided context:
307
+ 1. Extracts factual claims from answer
308
+ 2. Verifies each claim against context (fully/mostly/partial/minor/none)
309
+ 3. Aggregates faithfulness score
310
+ ```python
311
+ metric = FaithfulnessMetric(
312
+ model="gpt-4o-mini",
313
+ threshold=0.8,
314
+ temperature=0.5
315
+ )
316
+ ```
317
+
318
+ #### ContextualRelevancyMetric
319
+ Evaluates relevance of retrieved context to the question.
320
+ ```python
321
+ metric = ContextualRelevancyMetric(
322
+ model="gpt-4o-mini",
323
+ threshold=0.7,
324
+ temperature=0.5
325
+ )
326
+ ```
327
+
328
+ #### ContextualPrecisionMetric
329
+ Measures precision of context retrieval - are the retrieved chunks relevant?
330
+ ```python
331
+ metric = ContextualPrecisionMetric(
332
+ model="gpt-4o-mini",
333
+ threshold=0.7
334
+ )
335
+ ```
336
+
337
+ #### ContextualRecallMetric
338
+ Measures recall of relevant context - was all relevant information retrieved?
339
+ ```python
340
+ metric = ContextualRecallMetric(
341
+ model="gpt-4o-mini",
342
+ threshold=0.7
343
+ )
344
+ ```
345
+
346
+ #### BiasMetric
347
+ Detects bias and prejudice in AI-generated output. Score range: 0 (strong bias) to 100 (no bias).
348
+ ```python
349
+ metric = BiasMetric(
350
+ model="gpt-4o-mini",
351
+ threshold=0.7 # Score range: 0-100
352
+ )
353
+ ```
354
+
355
+ #### ToxicityMetric
356
+ Identifies toxic content in responses. Score range: 0 (highly toxic) to 100 (no toxicity).
357
+ ```python
358
+ metric = ToxicityMetric(
359
+ model="gpt-4o-mini",
360
+ threshold=0.7 # Score range: 0-100
361
+ )
362
+ ```
363
+
364
+ #### RestrictedRefusalMetric
365
+ Checks if the AI appropriately refuses harmful or out-of-scope requests.
366
+ ```python
367
+ metric = RestrictedRefusalMetric(
368
+ model="gpt-4o-mini",
369
+ threshold=0.7
370
+ )
371
+ ```
372
+
373
+ ### Agent Metrics
374
+
375
+ #### ToolCorrectnessMetric
376
+ Validates that the agent calls the correct tools in the right sequence.
377
+ ```python
378
+ metric = ToolCorrectnessMetric(
379
+ model="gpt-4o-mini",
380
+ threshold=0.8
381
+ )
382
+ ```
383
+
384
+ #### TaskSuccessRateMetric
385
+ Measures task completion success across conversation:
386
+ 1. Infers user's goal
387
+ 2. Generates success criteria
388
+ 3. Evaluates each criterion (fully/mostly/partial/minor/none)
389
+ 4. Aggregates into final score
390
+ ```python
391
+ metric = TaskSuccessRateMetric(
392
+ model="gpt-4o-mini",
393
+ threshold=0.7,
394
+ temperature=1.1 # Higher = more lenient aggregation
395
+ )
396
+ ```
397
+
398
+ #### RoleAdherenceMetric
399
+ Evaluates how well the agent maintains its assigned role:
400
+ 1. Compares each response against role description
401
+ 2. Generates adherence verdicts (fully/mostly/partial/minor/none)
402
+ 3. Aggregates across all turns
403
+ ```python
404
+ metric = RoleAdherenceMetric(
405
+ model="gpt-4o-mini",
406
+ threshold=0.8,
407
+ temperature=0.5
408
+ )
409
+ # Don't forget to set: metric.chatbot_role = "Your role description"
410
+ ```
411
+
412
+ #### KnowledgeRetentionMetric
413
+ Checks if the agent remembers and recalls information from earlier in the conversation:
414
+ 1. Analyzes conversation for retention quality
415
+ 2. Generates retention verdicts (fully/mostly/partial/minor/none)
416
+ 3. Aggregates into retention score
417
+ ```python
418
+ metric = KnowledgeRetentionMetric(
419
+ model="gpt-4o-mini",
420
+ threshold=0.7,
421
+ temperature=0.5
422
+ )
423
+ ```
424
+
425
+ ### Custom & Advanced Metrics
426
+
427
+ #### GEval
428
+ State-of-the-art evaluation using probability-weighted scoring from the [G-Eval paper](https://arxiv.org/abs/2303.16634):
429
+ - **Auto Chain-of-Thought**: Automatically generates evaluation steps from criteria
430
+ - **Probability-Weighted Scoring**: score = Σ p(si) × si using 20 samples
431
+ - **Fine-Grained Scores**: Continuous scores (e.g., 73.45) instead of integers
432
+ ```python
433
+ metric = GEval(
434
+ model="gpt-4o", # Best with GPT-4 for probability estimation
435
+ threshold=0.7,
436
+ name="Coherence",
437
+ criteria="Evaluate logical flow and structure of the response",
438
+ evaluation_steps=None, # Auto-generated if not provided
439
+ n_samples=20, # Number of samples for probability estimation
440
+ sampling_temperature=2.0 # High temperature for diverse sampling
441
+ )
442
+ ```
443
+
444
+ #### CustomEvalMetric
445
+ Enhanced custom evaluation with CoT and probability-weighted scoring:
446
+ ```python
447
+ metric = CustomEvalMetric(
448
+ model="gpt-4o",
449
+ threshold=0.7,
450
+ name="QualityScore",
451
+ criteria="Your custom evaluation criteria"
452
+ # Automatically uses:
453
+ # - Chain-of-Thought (generates evaluation steps)
454
+ # - Probability-Weighted Scoring (20 samples, temp=2.0)
455
+ )
456
+ ```
457
+
458
+ ## Understanding Evaluation Results
459
+
460
+ ### Score Ranges
461
+
462
+ - **RAG Metrics** (Answer Relevancy, Faithfulness, etc.): 0.0 - 1.0
463
+ - **Safety Metrics** (Bias, Toxicity): 0.0 - 1.0
464
+ - **G-Eval & Custom Metrics**: 0.0 - 1.0
465
+ - **Agent Metrics** (Task Success, Role Adherence, etc.): 0.0 - 1.0
466
+
467
+ ## Temperature Parameter
468
+
469
+ Many metrics use a **temperature** parameter for score aggregation (via softmax):
470
+
471
+ - **Lower (0.1-0.3)**: **Strict** - high scores dominate, penalizes any low scores heavily
472
+ - **Medium (0.4-0.6)**: **Balanced** - default behavior
473
+ - **Higher (0.8-1.5)**: **Lenient** - closer to arithmetic mean, more forgiving
474
+ ```python
475
+ # Strict evaluation - one bad verdict significantly lowers score
476
+ metric = AnswerRelevancyMetric(model="gpt-4o-mini", threshold=0.7, temperature=0.3)
477
+
478
+ # Lenient evaluation - focuses on overall trend
479
+ metric = TaskSuccessRateMetric(model="gpt-4o-mini", threshold=0.7, temperature=1.2)
480
+ ```
481
+
482
+ ## LLM Provider Configuration
483
+
484
+ ### OpenAI
485
+ ```python
486
+ import os
487
+ os.environ["OPENAI_API_KEY"] = "your-api-key"
488
+
489
+ from eval_lib import chat_complete
490
+
491
+ response, cost = await chat_complete(
492
+ "gpt-4o-mini", # or "openai:gpt-4o-mini"
493
+ messages=[{"role": "user", "content": "Hello!"}]
494
+ )
495
+ ```
496
+
497
+ ### Azure OpenAI
498
+ ```python
499
+ os.environ["AZURE_OPENAI_API_KEY"] = "your-api-key"
500
+ os.environ["AZURE_OPENAI_ENDPOINT"] = "https://your-endpoint.openai.azure.com/"
501
+ os.environ["AZURE_OPENAI_DEPLOYMENT"] = "your-deployment-name"
502
+
503
+ response, cost = await chat_complete(
504
+ "azure:gpt-4o",
505
+ messages=[{"role": "user", "content": "Hello!"}]
506
+ )
507
+ ```
508
+
509
+ ### Google Gemini
510
+ ```python
511
+ os.environ["GOOGLE_API_KEY"] = "your-api-key"
512
+
513
+ response, cost = await chat_complete(
514
+ "google:gemini-2.0-flash",
515
+ messages=[{"role": "user", "content": "Hello!"}]
516
+ )
517
+ ```
518
+
519
+ ### Anthropic Claude
520
+ ```python
521
+ os.environ["ANTHROPIC_API_KEY"] = "your-api-key"
522
+
523
+ response, cost = await chat_complete(
524
+ "anthropic:claude-sonnet-4-0",
525
+ messages=[{"role": "user", "content": "Hello!"}]
526
+ )
527
+ ```
528
+
529
+ ### Ollama (Local)
530
+ ```python
531
+ os.environ["OLLAMA_API_KEY"] = "ollama" # Can be any value
532
+ os.environ["OLLAMA_API_BASE_URL"] = "http://localhost:11434/v1"
533
+
534
+ response, cost = await chat_complete(
535
+ "ollama:llama2",
536
+ messages=[{"role": "user", "content": "Hello!"}]
537
+ )
538
+ ```
539
+
540
+ ## Test Data Generation
541
+
542
+ The library includes a powerful test data generator that can create realistic test cases either from scratch or based on your documents.
543
+
544
+ ### Supported Document Formats
545
+
546
+ - **Documents**: PDF, DOCX, DOC, TXT, RTF, ODT
547
+ - **Structured Data**: CSV, TSV, XLSX, JSON, YAML, XML
548
+ - **Web**: HTML, Markdown
549
+ - **Presentations**: PPTX
550
+ - **Images**: PNG, JPG, JPEG (with OCR support)
551
+
552
+ ### Generate from Scratch
553
+ ```python
554
+ from eval_lib.datagenerator.datagenerator import DatasetGenerator
555
+
556
+ generator = DatasetGenerator(
557
+ model="gpt-4o-mini",
558
+ agent_description="A customer support chatbot",
559
+ input_format="User question or request",
560
+ expected_output_format="Helpful response",
561
+ test_types=["functionality", "edge_cases"],
562
+ max_rows=20,
563
+ question_length="mixed", # "short", "long", or "mixed"
564
+ question_openness="mixed", # "open", "closed", or "mixed"
565
+ trap_density=0.1, # 10% trap questions
566
+ language="en"
567
+ )
568
+
569
+ dataset = await generator.generate_from_scratch()
570
+ ```
571
+
572
+ ### Generate from Documents
573
+ ```python
574
+ generator = DatasetGenerator(
575
+ model="gpt-4o-mini",
576
+ agent_description="Technical support agent",
577
+ input_format="Technical question",
578
+ expected_output_format="Detailed answer with references",
579
+ test_types=["retrieval", "accuracy"],
580
+ max_rows=50,
581
+ chunk_size=1024,
582
+ chunk_overlap=100,
583
+ max_chunks=30
584
+ )
585
+
586
+ file_paths = ["docs/user_guide.pdf", "docs/faq.md"]
587
+ dataset = await generator.generate_from_documents(file_paths)
588
+
589
+ # Convert to test cases
590
+ from eval_lib import EvalTestCase
591
+ test_cases = [
592
+ EvalTestCase(
593
+ input=item["input"],
594
+ expected_output=item["expected_output"],
595
+ retrieval_context=[item.get("context", "")]
596
+ )
597
+ for item in dataset
598
+ ]
599
+ ```
600
+
601
+ ## Best Practices
602
+
603
+ ### 1. Choose the Right Model
604
+
605
+ - **G-Eval**: Use GPT-4 for best results with probability-weighted scoring
606
+ - **Other Metrics**: GPT-4o-mini is cost-effective and sufficient
607
+ - **Custom Eval**: Use GPT-4 for complex criteria, GPT-4o-mini for simple ones
608
+
609
+ ### 2. Set Appropriate Thresholds
610
+ ```python
611
+ # Safety metrics - high bar
612
+ BiasMetric(threshold=80.0)
613
+ ToxicityMetric(threshold=85.0)
614
+
615
+ # Quality metrics - moderate bar
616
+ AnswerRelevancyMetric(threshold=0.7)
617
+ FaithfulnessMetric(threshold=0.75)
618
+
619
+ # Agent metrics - context-dependent
620
+ TaskSuccessRateMetric(threshold=0.7) # Most tasks
621
+ RoleAdherenceMetric(threshold=0.9) # Strict role requirements
622
+ ```
623
+
624
+ ### 3. Use Temperature Wisely
625
+ ```python
626
+ # Strict evaluation - critical applications
627
+ metric = FaithfulnessMetric(temperature=0.3)
628
+
629
+ # Balanced - general use (default)
630
+ metric = AnswerRelevancyMetric(temperature=0.5)
631
+
632
+ # Lenient - exploratory evaluation
633
+ metric = TaskSuccessRateMetric(temperature=1.2)
634
+ ```
635
+
636
+ ### 4. Leverage Evaluation Logs
637
+ ```python
638
+ result = await metric.evaluate(test_case)
639
+
640
+ # Always check the log for insights
641
+ log = result['evaluation_log']
642
+
643
+ # For debugging failures:
644
+ if not result['success']:
645
+ print(f"Failed because: {log['final_reason']}")
646
+ print(f"Verdicts: {log.get('verdicts', [])}")
647
+ print(f"Steps taken: {log.get('evaluation_steps', [])}")
648
+ ```
649
+
650
+ ### 5. Batch Evaluation for Efficiency
651
+ ```python
652
+ # Evaluate multiple test cases at once
653
+ results = await evaluate(
654
+ test_cases=[test_case1, test_case2, test_case3],
655
+ metrics=[metric1, metric2, metric3]
656
+ )
657
+
658
+ # Calculate aggregate statistics
659
+ total_cost = sum(
660
+ metric.evaluation_cost or 0
661
+ for _, test_results in results
662
+ for result in test_results
663
+ for metric in result.metrics_data
664
+ )
665
+
666
+ success_rate = sum(
667
+ 1 for _, test_results in results
668
+ for result in test_results
669
+ if result.success
670
+ ) / len(results)
671
+
672
+ print(f"Total cost: ${total_cost:.4f}")
673
+ print(f"Success rate: {success_rate:.2%}")
674
+ ```
675
+
676
+ ## Cost Tracking
677
+
678
+ All evaluations automatically track API costs:
679
+ ```python
680
+ results = await evaluate(test_cases, metrics)
681
+
682
+ for _, test_results in results:
683
+ for result in test_results:
684
+ for metric in result.metrics_data:
685
+ print(f"{metric.name}: ${metric.evaluation_cost:.6f}")
686
+ ```
687
+
688
+ **Cost Estimates** (as of 2025):
689
+ - **G-Eval with GPT-4**: ~$0.10-0.15 per evaluation (20 samples)
690
+ - **Custom Eval with GPT-4**: ~$0.10-0.15 per evaluation (20 samples + CoT)
691
+ - **Standard metrics with GPT-4o-mini**: ~$0.001-0.005 per evaluation
692
+ - **Faithfulness/Answer Relevancy**: ~$0.003-0.010 per evaluation (multiple LLM calls)
693
+
694
+ ## Environment Variables
695
+
696
+ | Variable | Description | Required |
697
+ |----------|-------------|----------|
698
+ | `OPENAI_API_KEY` | OpenAI API key | For OpenAI |
699
+ | `AZURE_OPENAI_API_KEY` | Azure OpenAI API key | For Azure |
700
+ | `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint URL | For Azure |
701
+ | `AZURE_OPENAI_DEPLOYMENT` | Azure deployment name | For Azure |
702
+ | `GOOGLE_API_KEY` | Google API key | For Google |
703
+ | `ANTHROPIC_API_KEY` | Anthropic API key | For Anthropic |
704
+ | `OLLAMA_API_KEY` | Ollama API key | For Ollama |
705
+ | `OLLAMA_API_BASE_URL` | Ollama base URL | For Ollama |
706
+
707
+ ## Contributing
708
+
709
+ Contributions are welcome! Please feel free to submit a Pull Request.
710
+
711
+ 1. Fork the repository
712
+ 2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
713
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
714
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
715
+ 5. Open a Pull Request
716
+
717
+ ## License
718
+
719
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
720
+
721
+ ## Citation
722
+
723
+ If you use this library in your research, please cite:
724
+ ```bibtex
725
+ @software{eval_ai_library,
726
+ author = {Meshkov, Aleksandr},
727
+ title = {Eval AI Library: Comprehensive AI Model Evaluation Framework},
728
+ year = {2025},
729
+ url = {https://github.com/meshkovQA/Eval-ai-library.git}
730
+ }
731
+ ```
732
+
733
+ ### References
734
+
735
+ This library implements techniques from:
736
+ ```bibtex
737
+ @inproceedings{liu2023geval,
738
+ title={G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment},
739
+ author={Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang},
740
+ booktitle={Proceedings of EMNLP},
741
+ year={2023}
742
+ }
743
+ ```
744
+
745
+ ## Support
746
+
747
+ - 📧 Email: alekslynx90@gmail.com
748
+ - 🐛 Issues: [GitHub Issues](https://github.com/meshkovQA/Eval-ai-library.git/issues)
749
+ - 📖 Documentation: [Full Documentation](https://github.com/meshkovQA/Eval-ai-library.git#readme)
750
+
751
+ ## Acknowledgments
752
+
753
+ This library was developed to provide a comprehensive solution for evaluating AI models across different use cases and providers, with state-of-the-art techniques including G-Eval's probability-weighted scoring and automatic chain-of-thought generation.