graphrag-eval 5.0.2__tar.gz → 5.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/PKG-INFO +93 -82
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/README.md +86 -77
- graphrag_eval-5.1.1/graphrag_eval/__init__.py +2 -0
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/aggregation.py +104 -98
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/answer_correctness.py +11 -12
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/answer_relevance.py +1 -1
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/evaluation.py +1 -1
- graphrag_eval-5.1.1/graphrag_eval/steps/__init__.py +0 -0
- graphrag_eval-5.0.2/graphrag_eval/steps/__init__.py → graphrag_eval-5.1.1/graphrag_eval/steps/evaluation.py +64 -54
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/steps/retrieval_answer.py +10 -7
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/steps/retrieval_context_texts.py +3 -3
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/steps/sparql.py +0 -1
- graphrag_eval-5.1.1/pyproject.toml +51 -0
- graphrag_eval-5.0.2/graphrag_eval/__init__.py +0 -4
- graphrag_eval-5.0.2/pyproject.toml +0 -47
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/LICENSE +0 -0
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/prompts/template.md +0 -0
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/steps/retrieval_context_ids.py +0 -0
- {graphrag_eval-5.0.2 → graphrag_eval-5.1.1}/graphrag_eval/util.py +0 -0
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.3
|
|
2
2
|
Name: graphrag-eval
|
|
3
|
-
Version: 5.
|
|
3
|
+
Version: 5.1.1
|
|
4
4
|
Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
|
|
5
5
|
License: Apache-2.0
|
|
6
6
|
Author: Philip Ganchev
|
|
@@ -9,10 +9,12 @@ Requires-Python: >=3.12,<3.13
|
|
|
9
9
|
Classifier: License :: OSI Approved :: Apache Software License
|
|
10
10
|
Classifier: Programming Language :: Python :: 3
|
|
11
11
|
Classifier: Programming Language :: Python :: 3.12
|
|
12
|
-
Provides-Extra:
|
|
13
|
-
Requires-Dist:
|
|
14
|
-
Requires-Dist:
|
|
15
|
-
Requires-Dist:
|
|
12
|
+
Provides-Extra: ragas
|
|
13
|
+
Requires-Dist: langchain-openai (==0.3.7) ; extra == "ragas"
|
|
14
|
+
Requires-Dist: langchain_community (==0.3.18) ; extra == "ragas"
|
|
15
|
+
Requires-Dist: langevals[ragas] (==0.1.8) ; extra == "ragas"
|
|
16
|
+
Requires-Dist: litellm (==1.61.20) ; extra == "ragas"
|
|
17
|
+
Requires-Dist: ragas (==0.2.9) ; extra == "ragas"
|
|
16
18
|
Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
|
|
17
19
|
Description-Content-Type: text/markdown
|
|
18
20
|
|
|
@@ -22,8 +24,7 @@ Description-Content-Type: text/markdown
|
|
|
22
24
|
|
|
23
25
|
# QA Evaluation
|
|
24
26
|
|
|
25
|
-
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
|
|
26
|
-
to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
|
|
27
|
+
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
|
|
27
28
|
|
|
28
29
|
## License
|
|
29
30
|
|
|
@@ -43,12 +44,12 @@ graphrag-eval = "*"
|
|
|
43
44
|
To evaluate answer relevance and answer correctness:
|
|
44
45
|
|
|
45
46
|
```bash
|
|
46
|
-
pip install 'graphrag-eval[
|
|
47
|
+
pip install 'graphrag-eval[ragas]'
|
|
47
48
|
```
|
|
48
49
|
|
|
49
50
|
or add the following dependency in your `pyproject.toml` file:
|
|
50
51
|
```toml
|
|
51
|
-
graphrag-eval = {version = "*", extras = ["
|
|
52
|
+
graphrag-eval = {version = "*", extras = ["ragas"]}
|
|
52
53
|
```
|
|
53
54
|
|
|
54
55
|
## Maintainers
|
|
@@ -61,7 +62,7 @@ For issues or feature requests, please open [a GitHub issue](https://github.com/
|
|
|
61
62
|
To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
|
|
62
63
|
|
|
63
64
|
1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
|
|
64
|
-
1. Execute `poetry install --with
|
|
65
|
+
1. Execute `poetry install --with ragas`
|
|
65
66
|
1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
|
|
66
67
|
|
|
67
68
|
We plan to improve CLI support in future releases.
|
|
@@ -70,24 +71,24 @@ We plan to improve CLI support in future releases.
|
|
|
70
71
|
|
|
71
72
|
To evaluate answers and/or steps:
|
|
72
73
|
1. Install this package: section [Install](#Installation)
|
|
73
|
-
1. Format the
|
|
74
|
-
1. Format the answers and/or steps you want to evaluate: section [
|
|
74
|
+
1. Format the dataset of questions and reference answers and/or steps: section [Reference Q&A Data](#Reference-qa-Data)
|
|
75
|
+
1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
|
|
75
76
|
1. To evaluate answer relevance:
|
|
76
77
|
1. Include `actual_answer` in the target data to evaluate
|
|
77
78
|
1. Set environment variable `OPENAI_API_KEY` appropriately
|
|
78
79
|
1. To evaluate answer correctness:
|
|
79
|
-
1. Include `reference_answer` in the reference
|
|
80
|
+
1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
|
|
80
81
|
1. Set environment variable `OPENAI_API_KEY` appropriately
|
|
81
82
|
1. To evaluate steps:
|
|
82
|
-
1. Include `reference_steps` in the reference
|
|
83
|
-
1. Call the evaluation function with the reference
|
|
83
|
+
1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
|
|
84
|
+
1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
|
|
84
85
|
1. Call the aggregation function with the evaluation results
|
|
85
86
|
|
|
86
87
|
Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
|
|
87
88
|
|
|
88
|
-
### Reference Q&A
|
|
89
|
+
### Reference Q&A Data
|
|
89
90
|
|
|
90
|
-
A reference
|
|
91
|
+
A reference dataset is a list of templates, each of which contains:
|
|
91
92
|
|
|
92
93
|
- `template_id`: Unique template identifier
|
|
93
94
|
- `questions`: A list of questions derived from this template, where each includes:
|
|
@@ -107,9 +108,9 @@ Each step includes:
|
|
|
107
108
|
- `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
|
|
108
109
|
- `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
|
|
109
110
|
|
|
110
|
-
####
|
|
111
|
+
#### Reference Data
|
|
111
112
|
|
|
112
|
-
The example
|
|
113
|
+
The example data below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
|
|
113
114
|
|
|
114
115
|
```yaml
|
|
115
116
|
- template_id: list_all_transformers_within_Substation_SUBSTATION
|
|
@@ -275,9 +276,9 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
|
|
|
275
276
|
|
|
276
277
|
The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
|
|
277
278
|
|
|
278
|
-
###
|
|
279
|
+
### Responses to evaluate
|
|
279
280
|
|
|
280
|
-
|
|
281
|
+
Given a question, if the question-answering system successfully responds, to evaluate the response, call `run_evaluation()` with the response formatted as in the example below. (On the other hand, if an error occurs while generating a response, format it as in [Target Input on Error](#target-input-on-error).)
|
|
281
282
|
|
|
282
283
|
```json
|
|
283
284
|
{
|
|
@@ -330,9 +331,9 @@ Below is an example response from the question-answering system for a single que
|
|
|
330
331
|
}
|
|
331
332
|
```
|
|
332
333
|
|
|
333
|
-
####
|
|
334
|
+
#### Target Input on Error
|
|
334
335
|
|
|
335
|
-
If an error occurs
|
|
336
|
+
If an error occurs while the question-answering system is generating a response, and you want to tally this error, the input to `run_evaluate()` should be like:
|
|
336
337
|
|
|
337
338
|
```json
|
|
338
339
|
{
|
|
@@ -342,22 +343,22 @@ If an error occurs during generating a response to a question, the expected targ
|
|
|
342
343
|
}
|
|
343
344
|
```
|
|
344
345
|
|
|
345
|
-
###
|
|
346
|
+
### Usage Code
|
|
346
347
|
|
|
347
348
|
```python
|
|
348
349
|
from graphrag_eval import run_evaluation, compute_aggregates
|
|
349
350
|
|
|
350
|
-
reference_qas: list[dict] = [] # read your
|
|
351
|
+
reference_qas: list[dict] = [] # read your reference data
|
|
351
352
|
chat_responses: dict = {} # call your implementation to get the response
|
|
352
353
|
evaluation_results = run_evaluation(reference_qas, chat_responses)
|
|
353
354
|
aggregates = compute_aggregates(evaluation_results)
|
|
354
355
|
```
|
|
355
356
|
|
|
356
|
-
`evaluation_results` is a list of statistics for each question, as in section [
|
|
357
|
+
`evaluation_results` is a list of statistics for each question, as in section [Evaluation Results](#Evaluation-results). The format is explained in section [Output Keys](#output-keys)
|
|
357
358
|
|
|
358
359
|
If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
|
|
359
360
|
|
|
360
|
-
###
|
|
361
|
+
### Evaluation Results
|
|
361
362
|
|
|
362
363
|
The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
|
|
363
364
|
|
|
@@ -445,7 +446,6 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
445
446
|
retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
|
|
446
447
|
retrieval_answer_recall_cost: 0.0007
|
|
447
448
|
retrieval_answer_precision: 1.0
|
|
448
|
-
retrieval_answer_precision_reason: The context contains only transformers listed in the reference answer
|
|
449
449
|
retrieval_answer_precision_cost: 0.0003
|
|
450
450
|
retrieval_answer_f1: 1.0
|
|
451
451
|
retrieval_answer_f1_cost: 0.001
|
|
@@ -570,7 +570,6 @@ All `actual_steps` with `name` "retrieval" contain:
|
|
|
570
570
|
- `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
|
|
571
571
|
- `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
|
|
572
572
|
- `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
|
|
573
|
-
- `retrieval_answer_precision_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_precision`
|
|
574
573
|
- `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
|
|
575
574
|
- `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
|
|
576
575
|
- `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
|
|
@@ -584,60 +583,72 @@ All `actual_steps` with `name` "retrieval" contain:
|
|
|
584
583
|
|
|
585
584
|
#### Aggregates Keys
|
|
586
585
|
|
|
587
|
-
The `aggregates` object provides aggregated evaluation metrics.
|
|
588
|
-
|
|
589
|
-
|
|
586
|
+
The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
|
|
587
|
+
1. per question template, and
|
|
588
|
+
1. over all questions in the dataset, using micro and macro averaging
|
|
589
|
+
|
|
590
590
|
Aggregates are:
|
|
591
591
|
- `per_template`: a dictionary mapping a template identifier to the following statistics:
|
|
592
592
|
- `number_of_error_samples`: number of questions for this template, which resulted in error response
|
|
593
593
|
- `number_of_success_samples`: number of questions for this template, which resulted in successful response
|
|
594
|
-
- `
|
|
595
|
-
|
|
596
|
-
|
|
597
|
-
|
|
598
|
-
|
|
599
|
-
|
|
600
|
-
|
|
601
|
-
|
|
602
|
-
|
|
603
|
-
|
|
604
|
-
- `
|
|
605
|
-
- `
|
|
606
|
-
- `
|
|
607
|
-
- `
|
|
608
|
-
|
|
609
|
-
|
|
610
|
-
|
|
594
|
+
- `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
|
|
595
|
+
- `input_tokens`
|
|
596
|
+
- `output_tokens`
|
|
597
|
+
- `total_tokens`
|
|
598
|
+
- `elapsed_sec`
|
|
599
|
+
- `answer_recall`
|
|
600
|
+
- `answer_precision`
|
|
601
|
+
- `answer_f1`
|
|
602
|
+
- `answer_relevance`
|
|
603
|
+
- `steps_score`
|
|
604
|
+
- `retrieval_answer_recall`
|
|
605
|
+
- `retrieval_answer_precision`
|
|
606
|
+
- `retrieval_answer_f1`
|
|
607
|
+
- `retrieval_context_recall`
|
|
608
|
+
- `retrieval_context_precision`
|
|
609
|
+
- `retrieval_context_f1`
|
|
610
|
+
- `steps`: includes:
|
|
611
|
+
- `steps`: for each step type how many times it was executed
|
|
612
|
+
- `once_per_sample`: how many times each step was executed, counted only once per question
|
|
613
|
+
- `empty_results`: how many times the step was executed and returned empty results
|
|
614
|
+
- `errors`: how many times the step was executed and resulted in error
|
|
611
615
|
- `micro`: statistics across questions, regardless of template. It includes:
|
|
612
616
|
- `number_of_error_samples`: total number of questions, which resulted in error response
|
|
613
617
|
- `number_of_success_samples`: total number of questions, which resulted in successful response
|
|
614
|
-
- `
|
|
615
|
-
|
|
616
|
-
|
|
617
|
-
|
|
618
|
-
|
|
619
|
-
|
|
620
|
-
|
|
621
|
-
|
|
622
|
-
|
|
623
|
-
|
|
624
|
-
|
|
625
|
-
|
|
626
|
-
|
|
627
|
-
- `
|
|
628
|
-
|
|
629
|
-
|
|
630
|
-
|
|
631
|
-
|
|
632
|
-
- `
|
|
633
|
-
- `
|
|
634
|
-
- `
|
|
635
|
-
- `
|
|
636
|
-
- `
|
|
637
|
-
- `
|
|
638
|
-
- `
|
|
639
|
-
- `
|
|
640
|
-
- `
|
|
618
|
+
- `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
|
|
619
|
+
- `input_tokens`
|
|
620
|
+
- `output_tokens`
|
|
621
|
+
- `total_tokens`
|
|
622
|
+
- `elapsed_sec`
|
|
623
|
+
- `answer_recall`
|
|
624
|
+
- `answer_precision`
|
|
625
|
+
- `answer_f1`
|
|
626
|
+
- `answer_relevance`
|
|
627
|
+
- `answer_relevance_cost`
|
|
628
|
+
- `retrieval_answer_recall`
|
|
629
|
+
- `retrieval_answer_precision`
|
|
630
|
+
- `retrieval_answer_f1`
|
|
631
|
+
- `retrieval_context_recall`
|
|
632
|
+
- `retrieval_context_precision`
|
|
633
|
+
- `retrieval_context_f1`
|
|
634
|
+
- `steps_score`
|
|
635
|
+
- `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes the following means:
|
|
636
|
+
- `input_tokens`
|
|
637
|
+
- `output_tokens`
|
|
638
|
+
- `total_tokens`
|
|
639
|
+
- `elapsed_sec`
|
|
640
|
+
- `answer_recall`
|
|
641
|
+
- `answer_precision`
|
|
642
|
+
- `answer_f1`
|
|
643
|
+
- `answer_relevance`
|
|
644
|
+
- `answer_relevance_cost`
|
|
645
|
+
- `retrieval_answer_recall`
|
|
646
|
+
- `retrieval_answer_precision`
|
|
647
|
+
- `retrieval_answer_f1`
|
|
648
|
+
- `retrieval_context_recall`
|
|
649
|
+
- `retrieval_context_precision`
|
|
650
|
+
- `retrieval_context_f1`
|
|
651
|
+
- `steps_score`
|
|
641
652
|
|
|
642
653
|
#### Example Aggregates
|
|
643
654
|
|
|
@@ -665,11 +676,11 @@ per_template:
|
|
|
665
676
|
min: 1.0
|
|
666
677
|
max: 1.0
|
|
667
678
|
answer_relevance:
|
|
668
|
-
|
|
669
|
-
|
|
670
|
-
|
|
671
|
-
|
|
672
|
-
|
|
679
|
+
min: 0.9
|
|
680
|
+
max: 0.9
|
|
681
|
+
mean: 0.9
|
|
682
|
+
median: 0.9
|
|
683
|
+
sum: 0.9
|
|
673
684
|
answer_relevance_cost:
|
|
674
685
|
min: 0.0007
|
|
675
686
|
max: 0.0007
|
|
@@ -1031,7 +1042,7 @@ The following metrics are based on the content of retrieved documents.
|
|
|
1031
1042
|
|
|
1032
1043
|
#### Context Recall@k
|
|
1033
1044
|
|
|
1034
|
-
The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we
|
|
1045
|
+
The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we include in the first k spots?"
|
|
1035
1046
|
* **Formula**:
|
|
1036
1047
|
$`
|
|
1037
1048
|
\frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
|
|
@@ -4,8 +4,7 @@
|
|
|
4
4
|
|
|
5
5
|
# QA Evaluation
|
|
6
6
|
|
|
7
|
-
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
|
|
8
|
-
to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
|
|
7
|
+
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
|
|
9
8
|
|
|
10
9
|
## License
|
|
11
10
|
|
|
@@ -25,12 +24,12 @@ graphrag-eval = "*"
|
|
|
25
24
|
To evaluate answer relevance and answer correctness:
|
|
26
25
|
|
|
27
26
|
```bash
|
|
28
|
-
pip install 'graphrag-eval[
|
|
27
|
+
pip install 'graphrag-eval[ragas]'
|
|
29
28
|
```
|
|
30
29
|
|
|
31
30
|
or add the following dependency in your `pyproject.toml` file:
|
|
32
31
|
```toml
|
|
33
|
-
graphrag-eval = {version = "*", extras = ["
|
|
32
|
+
graphrag-eval = {version = "*", extras = ["ragas"]}
|
|
34
33
|
```
|
|
35
34
|
|
|
36
35
|
## Maintainers
|
|
@@ -43,7 +42,7 @@ For issues or feature requests, please open [a GitHub issue](https://github.com/
|
|
|
43
42
|
To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
|
|
44
43
|
|
|
45
44
|
1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
|
|
46
|
-
1. Execute `poetry install --with
|
|
45
|
+
1. Execute `poetry install --with ragas`
|
|
47
46
|
1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
|
|
48
47
|
|
|
49
48
|
We plan to improve CLI support in future releases.
|
|
@@ -52,24 +51,24 @@ We plan to improve CLI support in future releases.
|
|
|
52
51
|
|
|
53
52
|
To evaluate answers and/or steps:
|
|
54
53
|
1. Install this package: section [Install](#Installation)
|
|
55
|
-
1. Format the
|
|
56
|
-
1. Format the answers and/or steps you want to evaluate: section [
|
|
54
|
+
1. Format the dataset of questions and reference answers and/or steps: section [Reference Q&A Data](#Reference-qa-Data)
|
|
55
|
+
1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
|
|
57
56
|
1. To evaluate answer relevance:
|
|
58
57
|
1. Include `actual_answer` in the target data to evaluate
|
|
59
58
|
1. Set environment variable `OPENAI_API_KEY` appropriately
|
|
60
59
|
1. To evaluate answer correctness:
|
|
61
|
-
1. Include `reference_answer` in the reference
|
|
60
|
+
1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
|
|
62
61
|
1. Set environment variable `OPENAI_API_KEY` appropriately
|
|
63
62
|
1. To evaluate steps:
|
|
64
|
-
1. Include `reference_steps` in the reference
|
|
65
|
-
1. Call the evaluation function with the reference
|
|
63
|
+
1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
|
|
64
|
+
1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
|
|
66
65
|
1. Call the aggregation function with the evaluation results
|
|
67
66
|
|
|
68
67
|
Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
|
|
69
68
|
|
|
70
|
-
### Reference Q&A
|
|
69
|
+
### Reference Q&A Data
|
|
71
70
|
|
|
72
|
-
A reference
|
|
71
|
+
A reference dataset is a list of templates, each of which contains:
|
|
73
72
|
|
|
74
73
|
- `template_id`: Unique template identifier
|
|
75
74
|
- `questions`: A list of questions derived from this template, where each includes:
|
|
@@ -89,9 +88,9 @@ Each step includes:
|
|
|
89
88
|
- `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
|
|
90
89
|
- `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
|
|
91
90
|
|
|
92
|
-
####
|
|
91
|
+
#### Reference Data
|
|
93
92
|
|
|
94
|
-
The example
|
|
93
|
+
The example data below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
|
|
95
94
|
|
|
96
95
|
```yaml
|
|
97
96
|
- template_id: list_all_transformers_within_Substation_SUBSTATION
|
|
@@ -257,9 +256,9 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
|
|
|
257
256
|
|
|
258
257
|
The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
|
|
259
258
|
|
|
260
|
-
###
|
|
259
|
+
### Responses to evaluate
|
|
261
260
|
|
|
262
|
-
|
|
261
|
+
Given a question, if the question-answering system successfully responds, to evaluate the response, call `run_evaluation()` with the response formatted as in the example below. (On the other hand, if an error occurs while generating a response, format it as in [Target Input on Error](#target-input-on-error).)
|
|
263
262
|
|
|
264
263
|
```json
|
|
265
264
|
{
|
|
@@ -312,9 +311,9 @@ Below is an example response from the question-answering system for a single que
|
|
|
312
311
|
}
|
|
313
312
|
```
|
|
314
313
|
|
|
315
|
-
####
|
|
314
|
+
#### Target Input on Error
|
|
316
315
|
|
|
317
|
-
If an error occurs
|
|
316
|
+
If an error occurs while the question-answering system is generating a response, and you want to tally this error, the input to `run_evaluate()` should be like:
|
|
318
317
|
|
|
319
318
|
```json
|
|
320
319
|
{
|
|
@@ -324,22 +323,22 @@ If an error occurs during generating a response to a question, the expected targ
|
|
|
324
323
|
}
|
|
325
324
|
```
|
|
326
325
|
|
|
327
|
-
###
|
|
326
|
+
### Usage Code
|
|
328
327
|
|
|
329
328
|
```python
|
|
330
329
|
from graphrag_eval import run_evaluation, compute_aggregates
|
|
331
330
|
|
|
332
|
-
reference_qas: list[dict] = [] # read your
|
|
331
|
+
reference_qas: list[dict] = [] # read your reference data
|
|
333
332
|
chat_responses: dict = {} # call your implementation to get the response
|
|
334
333
|
evaluation_results = run_evaluation(reference_qas, chat_responses)
|
|
335
334
|
aggregates = compute_aggregates(evaluation_results)
|
|
336
335
|
```
|
|
337
336
|
|
|
338
|
-
`evaluation_results` is a list of statistics for each question, as in section [
|
|
337
|
+
`evaluation_results` is a list of statistics for each question, as in section [Evaluation Results](#Evaluation-results). The format is explained in section [Output Keys](#output-keys)
|
|
339
338
|
|
|
340
339
|
If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
|
|
341
340
|
|
|
342
|
-
###
|
|
341
|
+
### Evaluation Results
|
|
343
342
|
|
|
344
343
|
The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
|
|
345
344
|
|
|
@@ -427,7 +426,6 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
427
426
|
retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
|
|
428
427
|
retrieval_answer_recall_cost: 0.0007
|
|
429
428
|
retrieval_answer_precision: 1.0
|
|
430
|
-
retrieval_answer_precision_reason: The context contains only transformers listed in the reference answer
|
|
431
429
|
retrieval_answer_precision_cost: 0.0003
|
|
432
430
|
retrieval_answer_f1: 1.0
|
|
433
431
|
retrieval_answer_f1_cost: 0.001
|
|
@@ -552,7 +550,6 @@ All `actual_steps` with `name` "retrieval" contain:
|
|
|
552
550
|
- `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
|
|
553
551
|
- `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
|
|
554
552
|
- `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
|
|
555
|
-
- `retrieval_answer_precision_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_precision`
|
|
556
553
|
- `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
|
|
557
554
|
- `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
|
|
558
555
|
- `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
|
|
@@ -566,60 +563,72 @@ All `actual_steps` with `name` "retrieval" contain:
|
|
|
566
563
|
|
|
567
564
|
#### Aggregates Keys
|
|
568
565
|
|
|
569
|
-
The `aggregates` object provides aggregated evaluation metrics.
|
|
570
|
-
|
|
571
|
-
|
|
566
|
+
The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
|
|
567
|
+
1. per question template, and
|
|
568
|
+
1. over all questions in the dataset, using micro and macro averaging
|
|
569
|
+
|
|
572
570
|
Aggregates are:
|
|
573
571
|
- `per_template`: a dictionary mapping a template identifier to the following statistics:
|
|
574
572
|
- `number_of_error_samples`: number of questions for this template, which resulted in error response
|
|
575
573
|
- `number_of_success_samples`: number of questions for this template, which resulted in successful response
|
|
576
|
-
- `
|
|
577
|
-
|
|
578
|
-
|
|
579
|
-
|
|
580
|
-
|
|
581
|
-
|
|
582
|
-
|
|
583
|
-
|
|
584
|
-
|
|
585
|
-
|
|
586
|
-
- `
|
|
587
|
-
- `
|
|
588
|
-
- `
|
|
589
|
-
- `
|
|
590
|
-
|
|
591
|
-
|
|
592
|
-
|
|
574
|
+
- `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
|
|
575
|
+
- `input_tokens`
|
|
576
|
+
- `output_tokens`
|
|
577
|
+
- `total_tokens`
|
|
578
|
+
- `elapsed_sec`
|
|
579
|
+
- `answer_recall`
|
|
580
|
+
- `answer_precision`
|
|
581
|
+
- `answer_f1`
|
|
582
|
+
- `answer_relevance`
|
|
583
|
+
- `steps_score`
|
|
584
|
+
- `retrieval_answer_recall`
|
|
585
|
+
- `retrieval_answer_precision`
|
|
586
|
+
- `retrieval_answer_f1`
|
|
587
|
+
- `retrieval_context_recall`
|
|
588
|
+
- `retrieval_context_precision`
|
|
589
|
+
- `retrieval_context_f1`
|
|
590
|
+
- `steps`: includes:
|
|
591
|
+
- `steps`: for each step type how many times it was executed
|
|
592
|
+
- `once_per_sample`: how many times each step was executed, counted only once per question
|
|
593
|
+
- `empty_results`: how many times the step was executed and returned empty results
|
|
594
|
+
- `errors`: how many times the step was executed and resulted in error
|
|
593
595
|
- `micro`: statistics across questions, regardless of template. It includes:
|
|
594
596
|
- `number_of_error_samples`: total number of questions, which resulted in error response
|
|
595
597
|
- `number_of_success_samples`: total number of questions, which resulted in successful response
|
|
596
|
-
- `
|
|
597
|
-
|
|
598
|
-
|
|
599
|
-
|
|
600
|
-
|
|
601
|
-
|
|
602
|
-
|
|
603
|
-
|
|
604
|
-
|
|
605
|
-
|
|
606
|
-
|
|
607
|
-
|
|
608
|
-
|
|
609
|
-
- `
|
|
610
|
-
|
|
611
|
-
|
|
612
|
-
|
|
613
|
-
|
|
614
|
-
- `
|
|
615
|
-
- `
|
|
616
|
-
- `
|
|
617
|
-
- `
|
|
618
|
-
- `
|
|
619
|
-
- `
|
|
620
|
-
- `
|
|
621
|
-
- `
|
|
622
|
-
- `
|
|
598
|
+
- `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
|
|
599
|
+
- `input_tokens`
|
|
600
|
+
- `output_tokens`
|
|
601
|
+
- `total_tokens`
|
|
602
|
+
- `elapsed_sec`
|
|
603
|
+
- `answer_recall`
|
|
604
|
+
- `answer_precision`
|
|
605
|
+
- `answer_f1`
|
|
606
|
+
- `answer_relevance`
|
|
607
|
+
- `answer_relevance_cost`
|
|
608
|
+
- `retrieval_answer_recall`
|
|
609
|
+
- `retrieval_answer_precision`
|
|
610
|
+
- `retrieval_answer_f1`
|
|
611
|
+
- `retrieval_context_recall`
|
|
612
|
+
- `retrieval_context_precision`
|
|
613
|
+
- `retrieval_context_f1`
|
|
614
|
+
- `steps_score`
|
|
615
|
+
- `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes the following means:
|
|
616
|
+
- `input_tokens`
|
|
617
|
+
- `output_tokens`
|
|
618
|
+
- `total_tokens`
|
|
619
|
+
- `elapsed_sec`
|
|
620
|
+
- `answer_recall`
|
|
621
|
+
- `answer_precision`
|
|
622
|
+
- `answer_f1`
|
|
623
|
+
- `answer_relevance`
|
|
624
|
+
- `answer_relevance_cost`
|
|
625
|
+
- `retrieval_answer_recall`
|
|
626
|
+
- `retrieval_answer_precision`
|
|
627
|
+
- `retrieval_answer_f1`
|
|
628
|
+
- `retrieval_context_recall`
|
|
629
|
+
- `retrieval_context_precision`
|
|
630
|
+
- `retrieval_context_f1`
|
|
631
|
+
- `steps_score`
|
|
623
632
|
|
|
624
633
|
#### Example Aggregates
|
|
625
634
|
|
|
@@ -647,11 +656,11 @@ per_template:
|
|
|
647
656
|
min: 1.0
|
|
648
657
|
max: 1.0
|
|
649
658
|
answer_relevance:
|
|
650
|
-
|
|
651
|
-
|
|
652
|
-
|
|
653
|
-
|
|
654
|
-
|
|
659
|
+
min: 0.9
|
|
660
|
+
max: 0.9
|
|
661
|
+
mean: 0.9
|
|
662
|
+
median: 0.9
|
|
663
|
+
sum: 0.9
|
|
655
664
|
answer_relevance_cost:
|
|
656
665
|
min: 0.0007
|
|
657
666
|
max: 0.0007
|
|
@@ -1013,7 +1022,7 @@ The following metrics are based on the content of retrieved documents.
|
|
|
1013
1022
|
|
|
1014
1023
|
#### Context Recall@k
|
|
1015
1024
|
|
|
1016
|
-
The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we
|
|
1025
|
+
The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we include in the first k spots?"
|
|
1017
1026
|
* **Formula**:
|
|
1018
1027
|
$`
|
|
1019
1028
|
\frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
|