graphrag-eval 5.0.2__tar.gz → 5.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.3
2
2
  Name: graphrag-eval
3
- Version: 5.0.2
3
+ Version: 5.1.1
4
4
  Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
5
5
  License: Apache-2.0
6
6
  Author: Philip Ganchev
@@ -9,10 +9,12 @@ Requires-Python: >=3.12,<3.13
9
9
  Classifier: License :: OSI Approved :: Apache Software License
10
10
  Classifier: Programming Language :: Python :: 3
11
11
  Classifier: Programming Language :: Python :: 3.12
12
- Provides-Extra: openai
13
- Requires-Dist: langevals (==0.1.*) ; extra == "openai"
14
- Requires-Dist: langevals-ragas (>=0.1.12,<0.2.0) ; extra == "openai"
15
- Requires-Dist: openai (>=1.97.0,<2.0.0) ; extra == "openai"
12
+ Provides-Extra: ragas
13
+ Requires-Dist: langchain-openai (==0.3.7) ; extra == "ragas"
14
+ Requires-Dist: langchain_community (==0.3.18) ; extra == "ragas"
15
+ Requires-Dist: langevals[ragas] (==0.1.8) ; extra == "ragas"
16
+ Requires-Dist: litellm (==1.61.20) ; extra == "ragas"
17
+ Requires-Dist: ragas (==0.2.9) ; extra == "ragas"
16
18
  Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
17
19
  Description-Content-Type: text/markdown
18
20
 
@@ -22,8 +24,7 @@ Description-Content-Type: text/markdown
22
24
 
23
25
  # QA Evaluation
24
26
 
25
- This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
26
- to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
27
+ This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
27
28
 
28
29
  ## License
29
30
 
@@ -43,12 +44,12 @@ graphrag-eval = "*"
43
44
  To evaluate answer relevance and answer correctness:
44
45
 
45
46
  ```bash
46
- pip install 'graphrag-eval[openai]'
47
+ pip install 'graphrag-eval[ragas]'
47
48
  ```
48
49
 
49
50
  or add the following dependency in your `pyproject.toml` file:
50
51
  ```toml
51
- graphrag-eval = {version = "*", extras = ["openai"]}
52
+ graphrag-eval = {version = "*", extras = ["ragas"]}
52
53
  ```
53
54
 
54
55
  ## Maintainers
@@ -61,7 +62,7 @@ For issues or feature requests, please open [a GitHub issue](https://github.com/
61
62
  To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
62
63
 
63
64
  1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
64
- 1. Execute `poetry install --with openai`
65
+ 1. Execute `poetry install --with ragas`
65
66
  1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
66
67
 
67
68
  We plan to improve CLI support in future releases.
@@ -70,24 +71,24 @@ We plan to improve CLI support in future releases.
70
71
 
71
72
  To evaluate answers and/or steps:
72
73
  1. Install this package: section [Install](#Installation)
73
- 1. Format the corpus of questions and reference answers and/or steps: section [Reference Q&A Corpus](#reference-qa-corpus)
74
- 1. Format the answers and/or steps you want to evaluate: section [Evaluation Target Corpus](#Evaluation-Target-Corpus)
74
+ 1. Format the dataset of questions and reference answers and/or steps: section [Reference Q&A Data](#Reference-qa-Data)
75
+ 1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
75
76
  1. To evaluate answer relevance:
76
77
  1. Include `actual_answer` in the target data to evaluate
77
78
  1. Set environment variable `OPENAI_API_KEY` appropriately
78
79
  1. To evaluate answer correctness:
79
- 1. Include `reference_answer` in the reference corpus and `actual_answer` in the target data to evaluate
80
+ 1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
80
81
  1. Set environment variable `OPENAI_API_KEY` appropriately
81
82
  1. To evaluate steps:
82
- 1. Include `reference_steps` in the reference corpus and `actual_steps` in target data to evaluate
83
- 1. Call the evaluation function with the reference corpus and target corpus: section [Example Usage Code](#Example-Usage-Code)
83
+ 1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
84
+ 1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
84
85
  1. Call the aggregation function with the evaluation results
85
86
 
86
87
  Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
87
88
 
88
- ### Reference Q&A Corpus
89
+ ### Reference Q&A Data
89
90
 
90
- A reference corpus is a list of templates, each of which contains:
91
+ A reference dataset is a list of templates, each of which contains:
91
92
 
92
93
  - `template_id`: Unique template identifier
93
94
  - `questions`: A list of questions derived from this template, where each includes:
@@ -107,9 +108,9 @@ Each step includes:
107
108
  - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
108
109
  - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
109
110
 
110
- #### Example Reference Corpus
111
+ #### Reference Data
111
112
 
112
- The example corpus below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
113
+ The example data below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
113
114
 
114
115
  ```yaml
115
116
  - template_id: list_all_transformers_within_Substation_SUBSTATION
@@ -275,9 +276,9 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
275
276
 
276
277
  The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
277
278
 
278
- ### Evaluation Target Corpus
279
+ ### Responses to evaluate
279
280
 
280
- Below is an example response from the question-answering system for a single question (unless there is an error in answering: see [Example Target Input on Error](#example-target-input-on-error) below):
281
+ Given a question, if the question-answering system successfully responds, to evaluate the response, call `run_evaluation()` with the response formatted as in the example below. (On the other hand, if an error occurs while generating a response, format it as in [Target Input on Error](#target-input-on-error).)
281
282
 
282
283
  ```json
283
284
  {
@@ -330,9 +331,9 @@ Below is an example response from the question-answering system for a single que
330
331
  }
331
332
  ```
332
333
 
333
- #### Example Target Input on Error
334
+ #### Target Input on Error
334
335
 
335
- If an error occurs during generating a response to a question, the expected target input for evaluation is:
336
+ If an error occurs while the question-answering system is generating a response, and you want to tally this error, the input to `run_evaluate()` should be like:
336
337
 
337
338
  ```json
338
339
  {
@@ -342,22 +343,22 @@ If an error occurs during generating a response to a question, the expected targ
342
343
  }
343
344
  ```
344
345
 
345
- ### Example Usage Code
346
+ ### Usage Code
346
347
 
347
348
  ```python
348
349
  from graphrag_eval import run_evaluation, compute_aggregates
349
350
 
350
- reference_qas: list[dict] = [] # read your corpus
351
+ reference_qas: list[dict] = [] # read your reference data
351
352
  chat_responses: dict = {} # call your implementation to get the response
352
353
  evaluation_results = run_evaluation(reference_qas, chat_responses)
353
354
  aggregates = compute_aggregates(evaluation_results)
354
355
  ```
355
356
 
356
- `evaluation_results` is a list of statistics for each question, as in section [Example Evaluation Results](#example-evaluation-results). The format is explained in section [Output Keys](#output-keys)
357
+ `evaluation_results` is a list of statistics for each question, as in section [Evaluation Results](#Evaluation-results). The format is explained in section [Output Keys](#output-keys)
357
358
 
358
359
  If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
359
360
 
360
- ### Example Evaluation Results
361
+ ### Evaluation Results
361
362
 
362
363
  The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
363
364
 
@@ -445,7 +446,6 @@ The output is a list of statistics for each question from the reference Q&A data
445
446
  retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
446
447
  retrieval_answer_recall_cost: 0.0007
447
448
  retrieval_answer_precision: 1.0
448
- retrieval_answer_precision_reason: The context contains only transformers listed in the reference answer
449
449
  retrieval_answer_precision_cost: 0.0003
450
450
  retrieval_answer_f1: 1.0
451
451
  retrieval_answer_f1_cost: 0.001
@@ -570,7 +570,6 @@ All `actual_steps` with `name` "retrieval" contain:
570
570
  - `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
571
571
  - `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
572
572
  - `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
573
- - `retrieval_answer_precision_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_precision`
574
573
  - `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
575
574
  - `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
576
575
  - `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
@@ -584,60 +583,72 @@ All `actual_steps` with `name` "retrieval" contain:
584
583
 
585
584
  #### Aggregates Keys
586
585
 
587
- The `aggregates` object provides aggregated evaluation metrics.
588
- Aggregates are computed both per-template and overall, using micro and macro averaging strategies.
589
- These aggregates support analysis of agent quality, token efficiency, and execution performance.
586
+ The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
587
+ 1. per question template, and
588
+ 1. over all questions in the dataset, using micro and macro averaging
589
+
590
590
  Aggregates are:
591
591
  - `per_template`: a dictionary mapping a template identifier to the following statistics:
592
592
  - `number_of_error_samples`: number of questions for this template, which resulted in error response
593
593
  - `number_of_success_samples`: number of questions for this template, which resulted in successful response
594
- - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `input_tokens` of all successful questions for this template
595
- - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `output_tokens` of all successful questions for this template
596
- - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `total_tokens` of all successful questions for this template
597
- - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` statistics for `elapsed_sec` of all successful questions for this template
598
- - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_recall` of all successful questions for this template
599
- - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_precision` of all successful questions for this template
600
- - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_f1` of all successful questions for this template
601
- - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions for this template
602
- - `steps_score`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps_score` of all successful questions for this template
603
- - `steps`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps` of all successful questions for this template. Includes:
604
- - `steps`: for each step type how many times it was executed
605
- - `once_per_sample`: how many times each step was executed, counted only once per question
606
- - `empty_results`: how many times the step was executed and returned empty results
607
- - `errors`: how many times the step was executed and resulted in error
608
- - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` for all successful questions in this template
609
- - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` for all successful questions in this template
610
- - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` for all successful questions in this template
594
+ - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
595
+ - `input_tokens`
596
+ - `output_tokens`
597
+ - `total_tokens`
598
+ - `elapsed_sec`
599
+ - `answer_recall`
600
+ - `answer_precision`
601
+ - `answer_f1`
602
+ - `answer_relevance`
603
+ - `steps_score`
604
+ - `retrieval_answer_recall`
605
+ - `retrieval_answer_precision`
606
+ - `retrieval_answer_f1`
607
+ - `retrieval_context_recall`
608
+ - `retrieval_context_precision`
609
+ - `retrieval_context_f1`
610
+ - `steps`: includes:
611
+ - `steps`: for each step type how many times it was executed
612
+ - `once_per_sample`: how many times each step was executed, counted only once per question
613
+ - `empty_results`: how many times the step was executed and returned empty results
614
+ - `errors`: how many times the step was executed and resulted in error
611
615
  - `micro`: statistics across questions, regardless of template. It includes:
612
616
  - `number_of_error_samples`: total number of questions, which resulted in error response
613
617
  - `number_of_success_samples`: total number of questions, which resulted in successful response
614
- - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` for `input_tokens` of all successful questions
615
- - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` for `output_tokens` of all successful questions
616
- - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` for `total_tokens` of all successful questions
617
- - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` for `elapsed_sec` of all successful questions
618
- - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` for `answer_recall` of all successful questions
619
- - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` for `answer_precision` of all successful questions
620
- - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
621
- - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
622
- - `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
623
- - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` of all successful questions
624
- - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` of all successful questions
625
- - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` of all successful questions
626
- - `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
627
- - `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
628
- - `input_tokens`: `mean` for `input_tokens`
629
- - `output_tokens`: `mean` for `output_tokens`
630
- - `total_tokens`: `mean` for `total_tokens`
631
- - `elapsed_sec`: `mean` for `elapsed_sec`
632
- - `answer_recall`: `mean` for `answer_recall`
633
- - `answer_precision`: `mean` for `answer_precision`
634
- - `answer_f1`: `mean` for `answer_f1`
635
- - `answer_relevance`: `mean` for `answer_relevance`
636
- - `answer_relevance_cost`: `mean` for `answer_relevance_cost`
637
- - `retrieval_context_recall`: `mean` for `retrieval_context_recall`
638
- - `retrieval_context_precision`: `mean` for `retrieval_context_precision`
639
- - `retrieval_context_f1`: `mean` for `retrieval_context_f1`
640
- - `steps_score`: `mean` for `steps_score`
618
+ - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
619
+ - `input_tokens`
620
+ - `output_tokens`
621
+ - `total_tokens`
622
+ - `elapsed_sec`
623
+ - `answer_recall`
624
+ - `answer_precision`
625
+ - `answer_f1`
626
+ - `answer_relevance`
627
+ - `answer_relevance_cost`
628
+ - `retrieval_answer_recall`
629
+ - `retrieval_answer_precision`
630
+ - `retrieval_answer_f1`
631
+ - `retrieval_context_recall`
632
+ - `retrieval_context_precision`
633
+ - `retrieval_context_f1`
634
+ - `steps_score`
635
+ - `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes the following means:
636
+ - `input_tokens`
637
+ - `output_tokens`
638
+ - `total_tokens`
639
+ - `elapsed_sec`
640
+ - `answer_recall`
641
+ - `answer_precision`
642
+ - `answer_f1`
643
+ - `answer_relevance`
644
+ - `answer_relevance_cost`
645
+ - `retrieval_answer_recall`
646
+ - `retrieval_answer_precision`
647
+ - `retrieval_answer_f1`
648
+ - `retrieval_context_recall`
649
+ - `retrieval_context_precision`
650
+ - `retrieval_context_f1`
651
+ - `steps_score`
641
652
 
642
653
  #### Example Aggregates
643
654
 
@@ -665,11 +676,11 @@ per_template:
665
676
  min: 1.0
666
677
  max: 1.0
667
678
  answer_relevance:
668
- min: 0.9
669
- max: 0.9
670
- mean: 0.9
671
- median: 0.9
672
- sum: 0.9
679
+ min: 0.9
680
+ max: 0.9
681
+ mean: 0.9
682
+ median: 0.9
683
+ sum: 0.9
673
684
  answer_relevance_cost:
674
685
  min: 0.0007
675
686
  max: 0.0007
@@ -1031,7 +1042,7 @@ The following metrics are based on the content of retrieved documents.
1031
1042
 
1032
1043
  #### Context Recall@k
1033
1044
 
1034
- The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we inclide in the first k spots?"
1045
+ The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we include in the first k spots?"
1035
1046
  * **Formula**:
1036
1047
  $`
1037
1048
  \frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
@@ -4,8 +4,7 @@
4
4
 
5
5
  # QA Evaluation
6
6
 
7
- This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
8
- to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
7
+ This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
9
8
 
10
9
  ## License
11
10
 
@@ -25,12 +24,12 @@ graphrag-eval = "*"
25
24
  To evaluate answer relevance and answer correctness:
26
25
 
27
26
  ```bash
28
- pip install 'graphrag-eval[openai]'
27
+ pip install 'graphrag-eval[ragas]'
29
28
  ```
30
29
 
31
30
  or add the following dependency in your `pyproject.toml` file:
32
31
  ```toml
33
- graphrag-eval = {version = "*", extras = ["openai"]}
32
+ graphrag-eval = {version = "*", extras = ["ragas"]}
34
33
  ```
35
34
 
36
35
  ## Maintainers
@@ -43,7 +42,7 @@ For issues or feature requests, please open [a GitHub issue](https://github.com/
43
42
  To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
44
43
 
45
44
  1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
46
- 1. Execute `poetry install --with openai`
45
+ 1. Execute `poetry install --with ragas`
47
46
  1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
48
47
 
49
48
  We plan to improve CLI support in future releases.
@@ -52,24 +51,24 @@ We plan to improve CLI support in future releases.
52
51
 
53
52
  To evaluate answers and/or steps:
54
53
  1. Install this package: section [Install](#Installation)
55
- 1. Format the corpus of questions and reference answers and/or steps: section [Reference Q&A Corpus](#reference-qa-corpus)
56
- 1. Format the answers and/or steps you want to evaluate: section [Evaluation Target Corpus](#Evaluation-Target-Corpus)
54
+ 1. Format the dataset of questions and reference answers and/or steps: section [Reference Q&A Data](#Reference-qa-Data)
55
+ 1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
57
56
  1. To evaluate answer relevance:
58
57
  1. Include `actual_answer` in the target data to evaluate
59
58
  1. Set environment variable `OPENAI_API_KEY` appropriately
60
59
  1. To evaluate answer correctness:
61
- 1. Include `reference_answer` in the reference corpus and `actual_answer` in the target data to evaluate
60
+ 1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
62
61
  1. Set environment variable `OPENAI_API_KEY` appropriately
63
62
  1. To evaluate steps:
64
- 1. Include `reference_steps` in the reference corpus and `actual_steps` in target data to evaluate
65
- 1. Call the evaluation function with the reference corpus and target corpus: section [Example Usage Code](#Example-Usage-Code)
63
+ 1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
64
+ 1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
66
65
  1. Call the aggregation function with the evaluation results
67
66
 
68
67
  Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
69
68
 
70
- ### Reference Q&A Corpus
69
+ ### Reference Q&A Data
71
70
 
72
- A reference corpus is a list of templates, each of which contains:
71
+ A reference dataset is a list of templates, each of which contains:
73
72
 
74
73
  - `template_id`: Unique template identifier
75
74
  - `questions`: A list of questions derived from this template, where each includes:
@@ -89,9 +88,9 @@ Each step includes:
89
88
  - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
90
89
  - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
91
90
 
92
- #### Example Reference Corpus
91
+ #### Reference Data
93
92
 
94
- The example corpus below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
93
+ The example data below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
95
94
 
96
95
  ```yaml
97
96
  - template_id: list_all_transformers_within_Substation_SUBSTATION
@@ -257,9 +256,9 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
257
256
 
258
257
  The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
259
258
 
260
- ### Evaluation Target Corpus
259
+ ### Responses to evaluate
261
260
 
262
- Below is an example response from the question-answering system for a single question (unless there is an error in answering: see [Example Target Input on Error](#example-target-input-on-error) below):
261
+ Given a question, if the question-answering system successfully responds, to evaluate the response, call `run_evaluation()` with the response formatted as in the example below. (On the other hand, if an error occurs while generating a response, format it as in [Target Input on Error](#target-input-on-error).)
263
262
 
264
263
  ```json
265
264
  {
@@ -312,9 +311,9 @@ Below is an example response from the question-answering system for a single que
312
311
  }
313
312
  ```
314
313
 
315
- #### Example Target Input on Error
314
+ #### Target Input on Error
316
315
 
317
- If an error occurs during generating a response to a question, the expected target input for evaluation is:
316
+ If an error occurs while the question-answering system is generating a response, and you want to tally this error, the input to `run_evaluate()` should be like:
318
317
 
319
318
  ```json
320
319
  {
@@ -324,22 +323,22 @@ If an error occurs during generating a response to a question, the expected targ
324
323
  }
325
324
  ```
326
325
 
327
- ### Example Usage Code
326
+ ### Usage Code
328
327
 
329
328
  ```python
330
329
  from graphrag_eval import run_evaluation, compute_aggregates
331
330
 
332
- reference_qas: list[dict] = [] # read your corpus
331
+ reference_qas: list[dict] = [] # read your reference data
333
332
  chat_responses: dict = {} # call your implementation to get the response
334
333
  evaluation_results = run_evaluation(reference_qas, chat_responses)
335
334
  aggregates = compute_aggregates(evaluation_results)
336
335
  ```
337
336
 
338
- `evaluation_results` is a list of statistics for each question, as in section [Example Evaluation Results](#example-evaluation-results). The format is explained in section [Output Keys](#output-keys)
337
+ `evaluation_results` is a list of statistics for each question, as in section [Evaluation Results](#Evaluation-results). The format is explained in section [Output Keys](#output-keys)
339
338
 
340
339
  If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
341
340
 
342
- ### Example Evaluation Results
341
+ ### Evaluation Results
343
342
 
344
343
  The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
345
344
 
@@ -427,7 +426,6 @@ The output is a list of statistics for each question from the reference Q&A data
427
426
  retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
428
427
  retrieval_answer_recall_cost: 0.0007
429
428
  retrieval_answer_precision: 1.0
430
- retrieval_answer_precision_reason: The context contains only transformers listed in the reference answer
431
429
  retrieval_answer_precision_cost: 0.0003
432
430
  retrieval_answer_f1: 1.0
433
431
  retrieval_answer_f1_cost: 0.001
@@ -552,7 +550,6 @@ All `actual_steps` with `name` "retrieval" contain:
552
550
  - `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
553
551
  - `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
554
552
  - `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
555
- - `retrieval_answer_precision_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_precision`
556
553
  - `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
557
554
  - `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
558
555
  - `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
@@ -566,60 +563,72 @@ All `actual_steps` with `name` "retrieval" contain:
566
563
 
567
564
  #### Aggregates Keys
568
565
 
569
- The `aggregates` object provides aggregated evaluation metrics.
570
- Aggregates are computed both per-template and overall, using micro and macro averaging strategies.
571
- These aggregates support analysis of agent quality, token efficiency, and execution performance.
566
+ The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
567
+ 1. per question template, and
568
+ 1. over all questions in the dataset, using micro and macro averaging
569
+
572
570
  Aggregates are:
573
571
  - `per_template`: a dictionary mapping a template identifier to the following statistics:
574
572
  - `number_of_error_samples`: number of questions for this template, which resulted in error response
575
573
  - `number_of_success_samples`: number of questions for this template, which resulted in successful response
576
- - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `input_tokens` of all successful questions for this template
577
- - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `output_tokens` of all successful questions for this template
578
- - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `total_tokens` of all successful questions for this template
579
- - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` statistics for `elapsed_sec` of all successful questions for this template
580
- - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_recall` of all successful questions for this template
581
- - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_precision` of all successful questions for this template
582
- - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_f1` of all successful questions for this template
583
- - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions for this template
584
- - `steps_score`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps_score` of all successful questions for this template
585
- - `steps`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps` of all successful questions for this template. Includes:
586
- - `steps`: for each step type how many times it was executed
587
- - `once_per_sample`: how many times each step was executed, counted only once per question
588
- - `empty_results`: how many times the step was executed and returned empty results
589
- - `errors`: how many times the step was executed and resulted in error
590
- - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` for all successful questions in this template
591
- - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` for all successful questions in this template
592
- - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` for all successful questions in this template
574
+ - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
575
+ - `input_tokens`
576
+ - `output_tokens`
577
+ - `total_tokens`
578
+ - `elapsed_sec`
579
+ - `answer_recall`
580
+ - `answer_precision`
581
+ - `answer_f1`
582
+ - `answer_relevance`
583
+ - `steps_score`
584
+ - `retrieval_answer_recall`
585
+ - `retrieval_answer_precision`
586
+ - `retrieval_answer_f1`
587
+ - `retrieval_context_recall`
588
+ - `retrieval_context_precision`
589
+ - `retrieval_context_f1`
590
+ - `steps`: includes:
591
+ - `steps`: for each step type how many times it was executed
592
+ - `once_per_sample`: how many times each step was executed, counted only once per question
593
+ - `empty_results`: how many times the step was executed and returned empty results
594
+ - `errors`: how many times the step was executed and resulted in error
593
595
  - `micro`: statistics across questions, regardless of template. It includes:
594
596
  - `number_of_error_samples`: total number of questions, which resulted in error response
595
597
  - `number_of_success_samples`: total number of questions, which resulted in successful response
596
- - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` for `input_tokens` of all successful questions
597
- - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` for `output_tokens` of all successful questions
598
- - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` for `total_tokens` of all successful questions
599
- - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` for `elapsed_sec` of all successful questions
600
- - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` for `answer_recall` of all successful questions
601
- - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` for `answer_precision` of all successful questions
602
- - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
603
- - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
604
- - `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
605
- - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` of all successful questions
606
- - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` of all successful questions
607
- - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` of all successful questions
608
- - `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
609
- - `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
610
- - `input_tokens`: `mean` for `input_tokens`
611
- - `output_tokens`: `mean` for `output_tokens`
612
- - `total_tokens`: `mean` for `total_tokens`
613
- - `elapsed_sec`: `mean` for `elapsed_sec`
614
- - `answer_recall`: `mean` for `answer_recall`
615
- - `answer_precision`: `mean` for `answer_precision`
616
- - `answer_f1`: `mean` for `answer_f1`
617
- - `answer_relevance`: `mean` for `answer_relevance`
618
- - `answer_relevance_cost`: `mean` for `answer_relevance_cost`
619
- - `retrieval_context_recall`: `mean` for `retrieval_context_recall`
620
- - `retrieval_context_precision`: `mean` for `retrieval_context_precision`
621
- - `retrieval_context_f1`: `mean` for `retrieval_context_f1`
622
- - `steps_score`: `mean` for `steps_score`
598
+ - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
599
+ - `input_tokens`
600
+ - `output_tokens`
601
+ - `total_tokens`
602
+ - `elapsed_sec`
603
+ - `answer_recall`
604
+ - `answer_precision`
605
+ - `answer_f1`
606
+ - `answer_relevance`
607
+ - `answer_relevance_cost`
608
+ - `retrieval_answer_recall`
609
+ - `retrieval_answer_precision`
610
+ - `retrieval_answer_f1`
611
+ - `retrieval_context_recall`
612
+ - `retrieval_context_precision`
613
+ - `retrieval_context_f1`
614
+ - `steps_score`
615
+ - `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes the following means:
616
+ - `input_tokens`
617
+ - `output_tokens`
618
+ - `total_tokens`
619
+ - `elapsed_sec`
620
+ - `answer_recall`
621
+ - `answer_precision`
622
+ - `answer_f1`
623
+ - `answer_relevance`
624
+ - `answer_relevance_cost`
625
+ - `retrieval_answer_recall`
626
+ - `retrieval_answer_precision`
627
+ - `retrieval_answer_f1`
628
+ - `retrieval_context_recall`
629
+ - `retrieval_context_precision`
630
+ - `retrieval_context_f1`
631
+ - `steps_score`
623
632
 
624
633
  #### Example Aggregates
625
634
 
@@ -647,11 +656,11 @@ per_template:
647
656
  min: 1.0
648
657
  max: 1.0
649
658
  answer_relevance:
650
- min: 0.9
651
- max: 0.9
652
- mean: 0.9
653
- median: 0.9
654
- sum: 0.9
659
+ min: 0.9
660
+ max: 0.9
661
+ mean: 0.9
662
+ median: 0.9
663
+ sum: 0.9
655
664
  answer_relevance_cost:
656
665
  min: 0.0007
657
666
  max: 0.0007
@@ -1013,7 +1022,7 @@ The following metrics are based on the content of retrieved documents.
1013
1022
 
1014
1023
  #### Context Recall@k
1015
1024
 
1016
- The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we inclide in the first k spots?"
1025
+ The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we include in the first k spots?"
1017
1026
  * **Formula**:
1018
1027
  $`
1019
1028
  \frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
@@ -0,0 +1,2 @@
1
+ from .aggregation import compute_aggregates
2
+ from .evaluation import run_evaluation