PyPI - graphrag-eval - Versions diffs - 5.1.0__tar.gz → 5.1.2__tar.gz - Mend

graphrag-eval 5.1.0tar.gz → 5.1.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (18) hide show

{graphrag_eval-5.1.0 → graphrag_eval-5.1.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: graphrag-eval
-Version: 5.1.0
+Version: 5.1.2
 Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
 License: Apache-2.0
 Author: Philip Ganchev
@@ -24,8 +24,7 @@ Description-Content-Type: text/markdown
 # QA Evaluation
-This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
-to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
+This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
 ## License
@@ -72,24 +71,24 @@ We plan to improve CLI support in future releases.
 To evaluate answers and/or steps:
 1. Install this package: section [Install](#Installation)
-1. Format the corpus of questions and reference answers and/or steps: section [Reference Q&A Corpus](#reference-qa-corpus)
-1. Format the answers and/or steps you want to evaluate: section [Evaluation Target Corpus](#Evaluation-Target-Corpus)
+1. Format the dataset of questions and reference answers and/or steps: section [Reference Q&A Data](#Reference-qa-Data)
+1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
 1. To evaluate answer relevance:
     1. Include `actual_answer` in the target data to evaluate
     1. Set environment variable `OPENAI_API_KEY` appropriately
 1. To evaluate answer correctness:
-    1. Include `reference_answer` in the reference corpus and `actual_answer` in the target data to evaluate
+    1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
     1. Set environment variable `OPENAI_API_KEY` appropriately
 1. To evaluate steps:
-    1. Include `reference_steps` in the reference corpus and `actual_steps` in target data to evaluate
-1. Call the evaluation function with the reference corpus and target corpus: section [Example Usage Code](#Example-Usage-Code)
+    1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
+1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
 1. Call the aggregation function with the evaluation results
 Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
-### Reference Q&A Corpus
+### Reference Q&A Data
-A reference corpus is a list of templates, each of which contains:
+A reference dataset is a list of templates, each of which contains:
 - `template_id`: Unique template identifier
 - `questions`: A list of questions derived from this template, where each includes:
@@ -109,9 +108,9 @@ Each step includes:
 - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
 - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
-#### Example Reference Corpus
+#### Reference Data
-The example corpus below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
+The example data below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
 ```yaml
 - template_id: list_all_transformers_within_Substation_SUBSTATION
@@ -277,9 +276,9 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
 The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
-### Evaluation Target Corpus
+### Responses to evaluate
-Below is an example response from the question-answering system for a single question (unless there is an error in answering: see [Example Target Input on Error](#example-target-input-on-error) below):
+Given a question, if the question-answering system successfully responds, to evaluate the response, call `run_evaluation()` with the response formatted as in the example below. (On the other hand, if an error occurs while generating a response, format it as in [Target Input on Error](#target-input-on-error).)
 ```json
 {
@@ -332,9 +331,9 @@ Below is an example response from the question-answering system for a single que
 }
 ```
-#### Example Target Input on Error
+#### Target Input on Error
-If an error occurs during generating a response to a question, the expected target input for evaluation is:
+If an error occurs while the question-answering system is generating a response, and you want to tally this error, the input to `run_evaluate()` should be like:
 ```json
 {
@@ -344,22 +343,22 @@ If an error occurs during generating a response to a question, the expected targ
 }
 ```
-### Example Usage Code
+### Usage Code
 ```python
 from graphrag_eval import run_evaluation, compute_aggregates
-reference_qas: list[dict] = [] # read your corpus
+reference_qas: list[dict] = [] # read your reference data
 chat_responses: dict = {} # call your implementation to get the response
 evaluation_results = run_evaluation(reference_qas, chat_responses)
 aggregates = compute_aggregates(evaluation_results)
 ```
-`evaluation_results` is a list of statistics for each question, as in section [Example Evaluation Results](#example-evaluation-results). The format is explained in section [Output Keys](#output-keys)
+`evaluation_results` is a list of statistics for each question, as in section [Evaluation Results](#Evaluation-results). The format is explained in section [Output Keys](#output-keys)
 If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
-### Example Evaluation Results
+### Evaluation Results
 The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
@@ -584,69 +583,72 @@ All `actual_steps` with `name` "retrieval" contain:
 #### Aggregates Keys
-The `aggregates` object provides aggregated evaluation metrics.
-Aggregates are computed both per-template and overall, using micro and macro averaging strategies.
-These aggregates support analysis of agent quality, token efficiency, and execution performance.
+The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
+1. per question template, and
+1. over all questions in the dataset, using micro and macro averaging
 Aggregates are:
 - `per_template`: a dictionary mapping a template identifier to the following statistics:
   - `number_of_error_samples`: number of questions for this template, which resulted in error response
   - `number_of_success_samples`: number of questions for this template, which resulted in successful response
-  - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `input_tokens` of all successful questions for this template
-  - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `output_tokens` of all successful questions for this template
-  - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `total_tokens` of all successful questions for this template
-  - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` statistics for `elapsed_sec` of all successful questions for this template
-  - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_recall` of all successful questions for this template
-  - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_precision` of all successful questions for this template
-  - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_f1` of all successful questions for this template
-  - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions for this template
-  - `steps_score`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps_score` of all successful questions for this template
-  - `steps`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps` of all successful questions for this template. Includes:
-    - `steps`: for each step type how many times it was executed
-    - `once_per_sample`: how many times each step was executed, counted only once per question
-    - `empty_results`: how many times the step was executed and returned empty results
-    - `errors`: how many times the step was executed and resulted in error
-  - `retrieval_answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_recall` for all successful questions in this template
-  - `retrieval_answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_precision` for all successful questions in this template
-  - `retrieval_answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_f1` for all successful questions in this template
-  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` for all successful questions in this template
-  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` for all successful questions in this template
-  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` for all successful questions in this template
+  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
+    - `input_tokens`
+    - `output_tokens`
+    - `total_tokens`
+    - `elapsed_sec`
+    - `answer_recall`
+    - `answer_precision`
+    - `answer_f1`
+    - `answer_relevance`
+    - `steps_score`
+    - `retrieval_answer_recall`
+    - `retrieval_answer_precision`
+    - `retrieval_answer_f1`
+    - `retrieval_context_recall`
+    - `retrieval_context_precision`
+    - `retrieval_context_f1`
+    - `steps`: includes:
+      - `steps`: for each step type how many times it was executed
+      - `once_per_sample`: how many times each step was executed, counted only once per question
+      - `empty_results`: how many times the step was executed and returned empty results
+      - `errors`: how many times the step was executed and resulted in error
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
-  - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` for `input_tokens` of all successful questions
-  - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` for `output_tokens` of all successful questions
-  - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` for `total_tokens` of all successful questions
-  - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` for `elapsed_sec` of all successful questions
-  - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` for `answer_recall` of all successful questions
-  - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` for `answer_precision` of all successful questions
-  - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
-  - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
-  - `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
-  - `retrieval_answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_recall` of all successful questions
-  - `retrieval_answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_precision` of all successful questions
-  - `retrieval_answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_f1` of all successful questions
-  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` of all successful questions
-  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` of all successful questions
-  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` of all successful questions
-  - `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
-- `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
-  - `input_tokens`: `mean` for `input_tokens`
-  - `output_tokens`: `mean` for `output_tokens`
-  - `total_tokens`: `mean` for `total_tokens`
-  - `elapsed_sec`: `mean` for `elapsed_sec`
-  - `answer_recall`: `mean` for `answer_recall`
-  - `answer_precision`: `mean` for `answer_precision`
-  - `answer_f1`: `mean` for `answer_f1`
-  - `answer_relevance`: `mean` for `answer_relevance`
-  - `answer_relevance_cost`: `mean` for `answer_relevance_cost`
-  - `retrieval_answer_recall`: `mean` for `retrieval_answer_recall`
-  - `retrieval_answer_precision`: `mean` for `retrieval_answer_precision`
-  - `retrieval_answer_f1`: `mean` for `retrieval_answer_f1`
-  - `retrieval_context_recall`: `mean` for `retrieval_context_recall`
-  - `retrieval_context_precision`: `mean` for `retrieval_context_precision`
-  - `retrieval_context_f1`: `mean` for `retrieval_context_f1`
-  - `steps_score`: `mean` for `steps_score`
+  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
+    - `input_tokens`
+    - `output_tokens`
+    - `total_tokens`
+    - `elapsed_sec`
+    - `answer_recall`
+    - `answer_precision`
+    - `answer_f1`
+    - `answer_relevance`
+    - `answer_relevance_cost`
+    - `retrieval_answer_recall`
+    - `retrieval_answer_precision`
+    - `retrieval_answer_f1`
+    - `retrieval_context_recall`
+    - `retrieval_context_precision`
+    - `retrieval_context_f1`
+    - `steps_score`
+- `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes the following means:
+  - `input_tokens`
+  - `output_tokens`
+  - `total_tokens`
+  - `elapsed_sec`
+  - `answer_recall`
+  - `answer_precision`
+  - `answer_f1`
+  - `answer_relevance`
+  - `answer_relevance_cost`
+  - `retrieval_answer_recall`
+  - `retrieval_answer_precision`
+  - `retrieval_answer_f1`
+  - `retrieval_context_recall`
+  - `retrieval_context_precision`
+  - `retrieval_context_f1`
+  - `steps_score`
 #### Example Aggregates
@@ -674,11 +676,11 @@ per_template:
       min: 1.0
       max: 1.0
     answer_relevance:
-        min: 0.9
-        max: 0.9
-        mean: 0.9
-        median: 0.9
-        sum: 0.9
+      min: 0.9
+      max: 0.9
+      mean: 0.9
+      median: 0.9
+      sum: 0.9
     answer_relevance_cost:
       min: 0.0007
       max: 0.0007

{graphrag_eval-5.1.0 → graphrag_eval-5.1.2}/README.md RENAMED Viewed

@@ -4,8 +4,7 @@
 # QA Evaluation
-This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
-to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
+This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
 ## License
@@ -52,24 +51,24 @@ We plan to improve CLI support in future releases.
 To evaluate answers and/or steps:
 1. Install this package: section [Install](#Installation)
-1. Format the corpus of questions and reference answers and/or steps: section [Reference Q&A Corpus](#reference-qa-corpus)
-1. Format the answers and/or steps you want to evaluate: section [Evaluation Target Corpus](#Evaluation-Target-Corpus)
+1. Format the dataset of questions and reference answers and/or steps: section [Reference Q&A Data](#Reference-qa-Data)
+1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
 1. To evaluate answer relevance:
     1. Include `actual_answer` in the target data to evaluate
     1. Set environment variable `OPENAI_API_KEY` appropriately
 1. To evaluate answer correctness:
-    1. Include `reference_answer` in the reference corpus and `actual_answer` in the target data to evaluate
+    1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
     1. Set environment variable `OPENAI_API_KEY` appropriately
 1. To evaluate steps:
-    1. Include `reference_steps` in the reference corpus and `actual_steps` in target data to evaluate
-1. Call the evaluation function with the reference corpus and target corpus: section [Example Usage Code](#Example-Usage-Code)
+    1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
+1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
 1. Call the aggregation function with the evaluation results
 Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
-### Reference Q&A Corpus
+### Reference Q&A Data
-A reference corpus is a list of templates, each of which contains:
+A reference dataset is a list of templates, each of which contains:
 - `template_id`: Unique template identifier
 - `questions`: A list of questions derived from this template, where each includes:
@@ -89,9 +88,9 @@ Each step includes:
 - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
 - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
-#### Example Reference Corpus
+#### Reference Data
-The example corpus below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
+The example data below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
 ```yaml
 - template_id: list_all_transformers_within_Substation_SUBSTATION
@@ -257,9 +256,9 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
 The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
-### Evaluation Target Corpus
+### Responses to evaluate
-Below is an example response from the question-answering system for a single question (unless there is an error in answering: see [Example Target Input on Error](#example-target-input-on-error) below):
+Given a question, if the question-answering system successfully responds, to evaluate the response, call `run_evaluation()` with the response formatted as in the example below. (On the other hand, if an error occurs while generating a response, format it as in [Target Input on Error](#target-input-on-error).)
 ```json
 {
@@ -312,9 +311,9 @@ Below is an example response from the question-answering system for a single que
 }
 ```
-#### Example Target Input on Error
+#### Target Input on Error
-If an error occurs during generating a response to a question, the expected target input for evaluation is:
+If an error occurs while the question-answering system is generating a response, and you want to tally this error, the input to `run_evaluate()` should be like:
 ```json
 {
@@ -324,22 +323,22 @@ If an error occurs during generating a response to a question, the expected targ
 }
 ```
-### Example Usage Code
+### Usage Code
 ```python
 from graphrag_eval import run_evaluation, compute_aggregates
-reference_qas: list[dict] = [] # read your corpus
+reference_qas: list[dict] = [] # read your reference data
 chat_responses: dict = {} # call your implementation to get the response
 evaluation_results = run_evaluation(reference_qas, chat_responses)
 aggregates = compute_aggregates(evaluation_results)
 ```
-`evaluation_results` is a list of statistics for each question, as in section [Example Evaluation Results](#example-evaluation-results). The format is explained in section [Output Keys](#output-keys)
+`evaluation_results` is a list of statistics for each question, as in section [Evaluation Results](#Evaluation-results). The format is explained in section [Output Keys](#output-keys)
 If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
-### Example Evaluation Results
+### Evaluation Results
 The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
@@ -564,69 +563,72 @@ All `actual_steps` with `name` "retrieval" contain:
 #### Aggregates Keys
-The `aggregates` object provides aggregated evaluation metrics.
-Aggregates are computed both per-template and overall, using micro and macro averaging strategies.
-These aggregates support analysis of agent quality, token efficiency, and execution performance.
+The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
+1. per question template, and
+1. over all questions in the dataset, using micro and macro averaging
 Aggregates are:
 - `per_template`: a dictionary mapping a template identifier to the following statistics:
   - `number_of_error_samples`: number of questions for this template, which resulted in error response
   - `number_of_success_samples`: number of questions for this template, which resulted in successful response
-  - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `input_tokens` of all successful questions for this template
-  - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `output_tokens` of all successful questions for this template
-  - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `total_tokens` of all successful questions for this template
-  - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` statistics for `elapsed_sec` of all successful questions for this template
-  - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_recall` of all successful questions for this template
-  - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_precision` of all successful questions for this template
-  - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_f1` of all successful questions for this template
-  - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions for this template
-  - `steps_score`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps_score` of all successful questions for this template
-  - `steps`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps` of all successful questions for this template. Includes:
-    - `steps`: for each step type how many times it was executed
-    - `once_per_sample`: how many times each step was executed, counted only once per question
-    - `empty_results`: how many times the step was executed and returned empty results
-    - `errors`: how many times the step was executed and resulted in error
-  - `retrieval_answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_recall` for all successful questions in this template
-  - `retrieval_answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_precision` for all successful questions in this template
-  - `retrieval_answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_f1` for all successful questions in this template
-  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` for all successful questions in this template
-  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` for all successful questions in this template
-  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` for all successful questions in this template
+  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
+    - `input_tokens`
+    - `output_tokens`
+    - `total_tokens`
+    - `elapsed_sec`
+    - `answer_recall`
+    - `answer_precision`
+    - `answer_f1`
+    - `answer_relevance`
+    - `steps_score`
+    - `retrieval_answer_recall`
+    - `retrieval_answer_precision`
+    - `retrieval_answer_f1`
+    - `retrieval_context_recall`
+    - `retrieval_context_precision`
+    - `retrieval_context_f1`
+    - `steps`: includes:
+      - `steps`: for each step type how many times it was executed
+      - `once_per_sample`: how many times each step was executed, counted only once per question
+      - `empty_results`: how many times the step was executed and returned empty results
+      - `errors`: how many times the step was executed and resulted in error
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
-  - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` for `input_tokens` of all successful questions
-  - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` for `output_tokens` of all successful questions
-  - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` for `total_tokens` of all successful questions
-  - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` for `elapsed_sec` of all successful questions
-  - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` for `answer_recall` of all successful questions
-  - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` for `answer_precision` of all successful questions
-  - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
-  - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
-  - `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
-  - `retrieval_answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_recall` of all successful questions
-  - `retrieval_answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_precision` of all successful questions
-  - `retrieval_answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_answer_f1` of all successful questions
-  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` of all successful questions
-  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` of all successful questions
-  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` of all successful questions
-  - `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
-- `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
-  - `input_tokens`: `mean` for `input_tokens`
-  - `output_tokens`: `mean` for `output_tokens`
-  - `total_tokens`: `mean` for `total_tokens`
-  - `elapsed_sec`: `mean` for `elapsed_sec`
-  - `answer_recall`: `mean` for `answer_recall`
-  - `answer_precision`: `mean` for `answer_precision`
-  - `answer_f1`: `mean` for `answer_f1`
-  - `answer_relevance`: `mean` for `answer_relevance`
-  - `answer_relevance_cost`: `mean` for `answer_relevance_cost`
-  - `retrieval_answer_recall`: `mean` for `retrieval_answer_recall`
-  - `retrieval_answer_precision`: `mean` for `retrieval_answer_precision`
-  - `retrieval_answer_f1`: `mean` for `retrieval_answer_f1`
-  - `retrieval_context_recall`: `mean` for `retrieval_context_recall`
-  - `retrieval_context_precision`: `mean` for `retrieval_context_precision`
-  - `retrieval_context_f1`: `mean` for `retrieval_context_f1`
-  - `steps_score`: `mean` for `steps_score`
+  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
+    - `input_tokens`
+    - `output_tokens`
+    - `total_tokens`
+    - `elapsed_sec`
+    - `answer_recall`
+    - `answer_precision`
+    - `answer_f1`
+    - `answer_relevance`
+    - `answer_relevance_cost`
+    - `retrieval_answer_recall`
+    - `retrieval_answer_precision`
+    - `retrieval_answer_f1`
+    - `retrieval_context_recall`
+    - `retrieval_context_precision`
+    - `retrieval_context_f1`
+    - `steps_score`
+- `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes the following means:
+  - `input_tokens`
+  - `output_tokens`
+  - `total_tokens`
+  - `elapsed_sec`
+  - `answer_recall`
+  - `answer_precision`
+  - `answer_f1`
+  - `answer_relevance`
+  - `answer_relevance_cost`
+  - `retrieval_answer_recall`
+  - `retrieval_answer_precision`
+  - `retrieval_answer_f1`
+  - `retrieval_context_recall`
+  - `retrieval_context_precision`
+  - `retrieval_context_f1`
+  - `steps_score`
 #### Example Aggregates
@@ -654,11 +656,11 @@ per_template:
       min: 1.0
       max: 1.0
     answer_relevance:
-        min: 0.9
-        max: 0.9
-        mean: 0.9
-        median: 0.9
-        sum: 0.9
+      min: 0.9
+      max: 0.9
+      mean: 0.9
+      median: 0.9
+      sum: 0.9
     answer_relevance_cost:
       min: 0.0007
       max: 0.0007

graphrag_eval-5.1.2/graphrag_eval/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ from .aggregation import compute_aggregates
2	+ from .evaluation import run_evaluation

{graphrag_eval-5.1.0 → graphrag_eval-5.1.2}/graphrag_eval/aggregation.py RENAMED Viewed

@@ -1,8 +1,8 @@
 import json
 from collections import defaultdict
+from collections.abc import Sequence
 from statistics import mean, median
-from typing import Any, Iterable
+from typing import Any, Collection, Iterable
 METRICS = [
     "answer_recall",
@@ -32,7 +32,7 @@ STEPS_METRICS = {
         "retrieval_context_f1_cost",
     ]
 }
-PROTECTED_METRICS = [
+RETAINED_METRICS = [
     "input_tokens",
     "output_tokens",
     "total_tokens",
@@ -50,124 +50,96 @@ def stats_for_series(values: Iterable[int | float]) -> dict[str, float]:
     }
-def update_step_metrics_per_template(
-    sample: dict,
-    step_metrics_per_template: dict,
-    template_id: str
+def update_step_metrics(
+    sample: dict[str, Any],
+    template_step_metrics: dict[str, list[float]],
 ):
     for step in sample.get("actual_steps", []):
-        if step["name"] in STEPS_METRICS:
-            for metric in STEPS_METRICS[step["name"]]:
-                value = step.get(metric)
-                if value is not None:
-                    step_metrics_per_template[template_id][metric].append(value)
+        for metric in STEPS_METRICS.get(step["name"], []):
+            value = step.get(metric)
+            if value is not None:
+                template_step_metrics[metric].append(value)
-def update_stats_per_template(
-    sample: dict,
-    stats_per_template: dict,
-    template_id: str
+def update_stats(
+    sample: dict[str, float | int],
+    template_stats: dict[str, list[float | int]],
 ):
     for metric in METRICS:
         value = sample.get(metric)
         if value is not None:
-            stats_per_template[template_id][metric].append(value)
+            template_stats[metric].append(value)
-def update_steps_summary_per_template(
-    sample: dict,
-    steps_summary_per_template: dict,
-    template_id: str
+def update_steps_summary(
+    sample: dict[str, Any],
+    template_steps_summary: dict,
 ):
     seen = set()
     for step in sample.get("actual_steps", []):
         name = step["name"]
-        template_steps_summary = steps_summary_per_template[template_id]
         template_steps_summary["total"][name] += 1
         if step["status"] == "error":
             template_steps_summary["errors"][name] += 1
         if name not in seen:
             seen.add(name)
             template_steps_summary["once_per_sample"][name] += 1
         if step["status"] != "error":
             try:
                 res = json.loads(step["output"])
-                if "results" in res and "bindings" in res["results"]:
+                if isinstance(res, dict) and "results" in res and "bindings" in res["results"]:
                     if not res["results"]["bindings"]:
                         template_steps_summary["empty_results"][name] += 1
             except json.decoder.JSONDecodeError:
                 pass
-def compute_aggregates(samples: list[dict]) -> dict:
-    number_of_samples_per_template_by_status = defaultdict(lambda: defaultdict(int))
-    stats_per_template = defaultdict(lambda: defaultdict(list))
-    steps_summary_per_template = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
-    step_metrics_per_template = defaultdict(lambda: defaultdict(list))
-    # Compute per-template stats
-    templates_ids = set()
-    for sample in samples:
-        template_id = sample["template_id"]
-        templates_ids.add(template_id)
-        if "error" in sample:
-            number_of_samples_per_template_by_status[template_id]["error"] += 1
-            continue
-        number_of_samples_per_template_by_status[template_id]["success"] += 1
-        update_stats_per_template(sample, stats_per_template, template_id)
-        update_steps_summary_per_template(
-            sample,
-            steps_summary_per_template,
-            template_id
-        )
-        update_step_metrics_per_template(
-            sample,
-            step_metrics_per_template,
-            template_id
-        )
-    summary = {"per_template": {}}
+def compute_per_template_stats(
+    templates_ids: Collection[str],
+    number_of_samples_per_template_by_status: dict[str, dict[str, int]],
+    stats_per_template: dict[str, dict[str, Sequence[int]]],
+    steps_summary_per_template: dict[str, dict[str, dict[str, int]]],
+    step_metrics_per_template: dict[str, dict[str, Sequence[int]]],
+) -> dict[str, dict[str, Any]]:
+    summary = {}
     # Add per-template stats
     for template_id in templates_ids:
-        template_summary: dict[str, Any] = {
-            "number_of_error_samples": number_of_samples_per_template_by_status[template_id]["error"],
-            "number_of_success_samples": number_of_samples_per_template_by_status[template_id]["success"],
+        n_by_status = number_of_samples_per_template_by_status[template_id]
+        summary[template_id] = {
+            "number_of_error_samples": n_by_status["error"],
+            "number_of_success_samples": n_by_status["success"],
         }
+        for metric in METRICS:
+            series = stats_per_template[template_id].get(metric, [])
+            if series or metric in RETAINED_METRICS:
+                summary[template_id][metric] = stats_for_series(series)
         steps_summary = {
             k1: {k2: v2 for k2, v2 in v1.items()}
             for k1, v1 in steps_summary_per_template[template_id].items()
         }
         if steps_summary:
-            template_summary.update({"steps": steps_summary})
-        for metric in METRICS:
-            results_for_template = stats_per_template[template_id]
-            series = results_for_template.get(metric, [])
-            if series or metric in PROTECTED_METRICS:
-                template_summary[metric] = stats_for_series(series)
+            summary[template_id]["steps"] = steps_summary
         # Add step metrics for the template
         template_step_metrics = {}
         for metric, values in step_metrics_per_template[template_id].items():
             template_step_metrics[metric] = stats_for_series(values)
         if template_step_metrics:
-            template_summary["steps"].update(template_step_metrics)
+            summary[template_id]["steps"].update(template_step_metrics)
+    return summary
-        summary["per_template"][template_id] = template_summary
-    # Add micro stats
-    values_ = number_of_samples_per_template_by_status.values()
-    summary["micro"] = {
-        "number_of_error_samples": sum(
-            values["error"] for values in values_
-        ),
-        "number_of_success_samples": sum(
-            values["success"] for values in values_
-        ),
-    }
+def compute_micro_stats(
+    number_of_samples_per_template_by_status,
+    stats_per_template,
+    step_metrics_per_template
+) -> dict:
+    values = number_of_samples_per_template_by_status.values()
+    micro_summary = defaultdict(dict, {
+        "number_of_error_samples": sum(v["error"] for v in values),
+        "number_of_success_samples": sum(v["success"] for v in values)
+    })
     for metric in METRICS:
         series = [
             i
@@ -175,42 +147,76 @@ def compute_aggregates(samples: list[dict]) -> dict:
             for i in values[metric]
             if values.get(metric) is not None
         ]
-        if series or metric in PROTECTED_METRICS:
-            summary["micro"][metric] = stats_for_series(series)
+        if series or metric in RETAINED_METRICS:
+            micro_summary[metric] = stats_for_series(series)
     # Add micro step metrics
     micro_step_metrics = defaultdict(list)
     for template_metrics in step_metrics_per_template.values():
         for metric, values in template_metrics.items():
             micro_step_metrics[metric].extend(values)
-    step_metrics = {
-        metric: stats_for_series(values)
-        for metric, values in micro_step_metrics.items()
-    }
-    summary["micro"].update(step_metrics)
+    for metric, values in micro_step_metrics.items():
+        micro_summary[metric] = stats_for_series(values)
+    return dict(micro_summary)
-    # Add macro stats
-    summary["macro"] = {}
+def compute_macro_stats(
+    summary_per_template: dict[str, dict[str, dict[str, Any]]]
+) -> dict:
+    macro_summary = defaultdict(dict)
     for metric in METRICS:
         means = [
             values[metric]["mean"]
-            for template_id, values in summary["per_template"].items()
+            for values in summary_per_template.values()
             if values.get(metric) is not None
         ]
-        if means or metric in PROTECTED_METRICS:
-            summary["macro"][metric] = {"mean": mean(means) if means else 0}
+        if means or metric in RETAINED_METRICS:
+            macro_summary[metric]["mean"] = mean(means or [0])
     # Add macro step metrics
     macro_step_metrics = defaultdict(list)
-    for template_id, template_summary in summary["per_template"].items():
-        if "steps" in template_summary:
-            for metric, stats in template_summary["steps"].items():
-                if "mean" in stats:
-                    macro_step_metrics[metric].append(stats["mean"])
-    step_metrics = {
-        metric: {"mean": mean(values) if values else 0}
-        for metric, values in macro_step_metrics.items()
-    }
-    summary["macro"].update(step_metrics)
+    for template_summary in summary_per_template.values():
+        for metric, stats in template_summary.get("steps", {}).items():
+            if "mean" in stats:
+                macro_step_metrics[metric].append(stats["mean"])
+    for metric, values in macro_step_metrics.items():
+        macro_summary[metric]["mean"] = mean(values or [0])
+    return dict(macro_summary)
+def compute_aggregates(samples: list[dict]) -> dict:
+    number_of_samples_per_template_by_status = defaultdict(lambda: defaultdict(int))
+    stats_per_template = defaultdict(lambda: defaultdict(list))
+    steps_summary_per_template = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
+    step_metrics_per_template = defaultdict(lambda: defaultdict(list))
+    # Compute per-template stats
+    templates_ids = set()
+    for sample in samples:
+        template_id = sample["template_id"]
+        templates_ids.add(template_id)
+        if "error" in sample:
+            number_of_samples_per_template_by_status[template_id]["error"] += 1
+            continue
+        number_of_samples_per_template_by_status[template_id]["success"] += 1
+        update_stats(sample, stats_per_template[template_id])
+        update_steps_summary(sample, steps_summary_per_template[template_id])
+        update_step_metrics(sample, step_metrics_per_template[template_id])
+    summary = {
+        "per_template": compute_per_template_stats(
+            templates_ids,
+            number_of_samples_per_template_by_status,
+            stats_per_template,
+            steps_summary_per_template,
+            step_metrics_per_template,
+        ),
+        "micro": compute_micro_stats(
+            number_of_samples_per_template_by_status,
+            stats_per_template,
+            step_metrics_per_template
+        )
+    }
+    summary["macro"] = compute_macro_stats(summary["per_template"])
     return summary

{graphrag_eval-5.1.0 → graphrag_eval-5.1.2}/graphrag_eval/answer_correctness.py RENAMED Viewed

@@ -4,6 +4,8 @@ from pathlib import Path
 from openai import OpenAI
 from tqdm import tqdm
+from graphrag_eval.util import compute_f1
 IN_FILE_PATH = "../data/data-1.tsv"
 PROMPT_FILE_PATH = Path(__file__).parent / "prompts" / "template.md"
@@ -21,14 +23,11 @@ def compute_recall_precision_f1(
 ) -> tuple[float | None, float | None, float | None]:
     recall = None
     precision = None
-    f1 = None
     if n_true_pos is not None and n_pos:
         recall = n_true_pos / n_pos
     if n_true_pos is not None and n_pred_pos:
         precision = n_true_pos / n_pred_pos
-    if precision is not None and recall is not None and precision + recall > 0:
-        f1 = 2 * (precision * recall) / (precision + recall)
-    return recall, precision, f1
+    return recall, precision, compute_f1(recall, precision)
 def extract_response_values(
@@ -41,20 +40,20 @@ def extract_response_values(
         return None, None, None, "", msg
     vals = vals[:4]
     try:
-        n_ref, n_target, n_matching = map(int, vals[:3])
+        n_ref, n_actual, n_matching = map(int, vals[:3])
     except ValueError:
-        msg = f"Non-int value: {response}"
+        msg = f"Claims counts should be ints: {vals}"
         return None, None, None, vals[3], msg
     if any([
         n_ref < 1,
-        n_target < 1,
+        n_actual < 1,
         n_matching < 0,
         n_matching > n_ref,
-        n_matching > n_target
+        n_matching > n_actual
     ]):
-        msg = f"Invalid int values: {n_ref}\t{n_target}\t{n_matching}"
+        msg = f"Invalid claims counts combination: {n_ref}\t{n_actual}\t{n_matching}"
         return None, None, None, vals[3], msg
-    return n_ref, n_target, n_matching, vals[3], ""
+    return n_ref, n_actual, n_matching, vals[3], ""
 class AnswerCorrectnessEvaluator:
@@ -96,7 +95,7 @@ class AnswerCorrectnessEvaluator:
     def get_correctness_dict(
         self,
         reference: dict,
-        target: dict,
+        actual: dict,
     ):
         result = {}
         result["reference_answer"] = reference["reference_answer"]
@@ -104,7 +103,7 @@ class AnswerCorrectnessEvaluator:
         self.evaluate_answer(
             reference["question_text"],
             reference["reference_answer"],
-            target["actual_answer"],
+            actual["actual_answer"],
         )
         if error:
             result["answer_eval_error"] = error

{graphrag_eval-5.1.0 → graphrag_eval-5.1.2}/graphrag_eval/evaluation.py RENAMED Viewed

@@ -1,4 +1,4 @@
-from .steps import get_steps_evaluation_result_dict
+from .steps.evaluation import get_steps_evaluation_result_dict
 def run_evaluation(

graphrag_eval-5.1.2/graphrag_eval/steps/__init__.py ADDED Viewed

File without changes

graphrag_eval-5.1.0/graphrag_eval/steps/__init__.py → graphrag_eval-5.1.2/graphrag_eval/steps/evaluation.py RENAMED Viewed

@@ -1,41 +1,49 @@
 import json
 from collections import defaultdict
+from typing import Any
+from collections.abc import Sequence
 from .retrieval_context_ids import recall_at_k
 from .sparql import compare_sparql_results
-def compare_steps_outputs(reference: dict, actual: dict) -> float:
-    ref_output = reference.get("output")
-    act_output = actual["output"]
-    assert ref_output, "Reference step output is mandatory"
-    if reference.get("output_media_type") == "application/sparql-results+json":
+Match = tuple[int, int, int, float]
+Step = dict[str, Any]
+StepsGroup = Sequence[Step]  # We will index into a group
+def compare_steps_outputs(reference_step: Step, actual_step: Step) -> float:
+    reference_output = reference_step.get("output")
+    actual_output = actual_step["output"]
+    assert reference_output, "Reference step output is mandatory"
+    reference_output_media_type = reference_step.get("output_media_type")
+    if reference_output_media_type == "application/sparql-results+json":
         return compare_sparql_results(
-            json.loads(ref_output),
-            json.loads(act_output),
-            reference["required_columns"],
-            reference.get("ordered", False),
+            json.loads(reference_output),
+            json.loads(actual_output),
+            reference_step["required_columns"],
+            reference_step.get("ordered", False),
         )
-    if reference.get("output_media_type") == "application/json":
-        return float(json.loads(ref_output) == json.loads(act_output))
-    if reference["name"] == actual["name"] == "retrieval":
-        ref_contexts_ids = [c["id"] for c in json.loads(ref_output)]
-        act_contexts_ids = [c["id"] for c in json.loads(act_output)]
-        k = actual["args"]["k"]
+    if reference_step.get("output_media_type") == "application/json":
+        return float(json.loads(reference_output) == json.loads(actual_output))
+    if reference_step["name"] == actual_step["name"] == "retrieval":
+        ref_contexts_ids = [c["id"] for c in json.loads(reference_output)]
+        act_contexts_ids = [c["id"] for c in json.loads(actual_output)]
+        k = actual_step["args"]["k"]
         return recall_at_k(ref_contexts_ids, act_contexts_ids, k)
-    return float(ref_output == act_output)
+    return float(reference_output == actual_output)
 def match_group_by_output(
-        reference_steps: list[list[dict]],
+        reference_groups: Sequence[StepsGroup],
         group_idx: int,
-        actual_steps: list[dict],
+        actual_steps: Sequence[Step],
         candidates_by_name: dict[str, list[int]],
-) -> list[tuple[int, int, int, float]]:
+) -> list[Match]:
     used_actual_indices = set()
     matches = []
-    reference_group = reference_steps[group_idx]
+    reference_group = reference_groups[group_idx]
     for reference_idx, reference_step in enumerate(reference_group):
         name = reference_step["name"]
         candidates = reversed(candidates_by_name.get(name, []))
@@ -52,8 +60,8 @@ def match_group_by_output(
 def collect_possible_matches_by_name_and_status(
-        group: list[dict],
-        actual_steps: list[dict],
+        group: StepsGroup,
+        actual_steps: Sequence[Step],
         search_upto: int,
 ) -> dict[str, list[int]]:
     group_by_name = defaultdict(list)
@@ -63,14 +71,14 @@ def collect_possible_matches_by_name_and_status(
         if actual_steps[j]["status"] == "success":
             group_by_name[name].append(j)
-    reference_names = {item["name"] for item in group}
+    reference_names = {step["name"] for step in group}
     return {name: group_by_name[name] for name in reference_names if name in group_by_name}
 def get_steps_matches(
-        reference_steps: list[list[dict]],
-        actual_steps: list[dict],
-) -> list[tuple[int, int, int, float]]:
+        reference_groups: Sequence[StepsGroup],
+        actual_steps: Sequence[Step],
+) -> list[Match]:
     # when we have autocomplete
     # matches = []
     # search_upto = len(actual_steps)
@@ -91,57 +99,59 @@ def get_steps_matches(
     # return matches
     # for now, we have only the last step(s)
-    last_group = reference_steps[-1]
-    candidates = collect_possible_matches_by_name_and_status(last_group, actual_steps, len(actual_steps))
-    return match_group_by_output(reference_steps, -1, actual_steps, candidates)
+    last_group = reference_groups[-1]
+    candidates = collect_possible_matches_by_name_and_status(
+        last_group,
+        actual_steps,
+        len(actual_steps)
+    )
+    return match_group_by_output(reference_groups, -1, actual_steps, candidates)
 def evaluate_steps(
-    reference_steps_groups: list[list[dict]],
-    actual_steps: list[dict],
-    matches: list[tuple[int, int, int, float]] | None = None
+    reference_steps_groups: Sequence[StepsGroup],
+    actual_steps: Sequence[Step],
+    matches: Sequence[Match] | None = None
 ) -> float:
     if matches is None:
         matches = get_steps_matches(reference_steps_groups, actual_steps)
-    matches_by_group = defaultdict(list)
     scores_by_group = defaultdict(float)
     for ref_group_idx, ref_match_idx, actual_idx, score in matches:
-        matches_by_group[ref_group_idx].append(ref_match_idx)
         scores_by_group[ref_group_idx] += score
         reference_steps_groups[ref_group_idx][ref_match_idx]["matches"] \
             = actual_steps[actual_idx]["id"]
-    group_ix = -1  # For now, consider only the last reference group of steps
-    return scores_by_group[group_ix] / len(reference_steps_groups[group_ix])
+    group_idx = -1  # For now, consider only the last reference group of steps
+    return scores_by_group[group_idx] / len(reference_steps_groups[group_idx])
-def get_steps_evaluation_result_dict(reference: dict, target: dict) -> dict:
+def get_steps_evaluation_result_dict(reference: dict, actual: dict) -> dict:
     eval_result = {}
-    act_steps = target.get("actual_steps", [])
-    eval_result["actual_steps"] = act_steps
-    for act_step in act_steps:
-        if act_step["name"] == "retrieval":
+    actual_steps = actual.get("actual_steps", [])
+    eval_result["actual_steps"] = actual_steps
+    for actual_step in actual_steps:
+        if actual_step["name"] == "retrieval":
             from .retrieval_answer import get_retrieval_evaluation_dict
             result = get_retrieval_evaluation_dict(
                 question_text=reference["question_text"],
                 reference_answer=reference.get("reference_answer"),
-                actual_answer=target.get("actual_answer"),
-                actual_contexts=json.loads(act_step["output"])
+                actual_answer=actual.get("actual_answer"),
+                actual_contexts=json.loads(actual_step["output"])
             )
-            act_step.update(result)
+            actual_step.update(result)
     if "reference_steps" in reference:
-        ref_steps = reference["reference_steps"]
-        matches = get_steps_matches(ref_steps, act_steps)
-        steps_score = evaluate_steps(ref_steps, act_steps, matches)
-        eval_result["steps_score"] = steps_score
+        reference_steps = reference["reference_steps"]
+        matches = get_steps_matches(reference_steps, actual_steps)
+        eval_result["steps_score"] \
+            = evaluate_steps(reference_steps, actual_steps, matches)
         for ref_group_idx, ref_match_idx, act_idx, _ in matches:
-            ref_step = ref_steps[ref_group_idx][ref_match_idx]
-            act_step = act_steps[act_idx]
-            if ref_step["name"] == "retrieval":
+            reference_step = reference_steps[ref_group_idx][ref_match_idx]
+            actual_step = actual_steps[act_idx]
+            if reference_step["name"] == "retrieval":
                 from .retrieval_context_texts import \
                     get_retrieval_evaluation_dict
                 res = get_retrieval_evaluation_dict(
-                    reference_contexts=json.loads(ref_step["output"]),
-                    actual_contexts=json.loads(act_step["output"])
+                    reference_contexts=json.loads(reference_step["output"]),
+                    actual_contexts=json.loads(actual_step["output"])
                 )
-                act_step.update(res)
+                actual_step.update(res)
     return eval_result

{graphrag_eval-5.1.0 → graphrag_eval-5.1.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "graphrag-eval"
-version = "5.1.0"
+version = "5.1.2"
 description = "For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps."
 authors = [
     { name = "Philip Ganchev", email = "philip.ganchev@graphwise.ai" },