PyPI - graphrag-eval - Versions diffs - 5.1.2__tar.gz → 5.3.0__tar.gz - Mend

graphrag-eval 5.1.2tar.gz → 5.3.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

{graphrag_eval-5.1.2 → graphrag_eval-5.3.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: graphrag-eval
-Version: 5.1.2
+Version: 5.3.0
 Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
 License: Apache-2.0
 Author: Philip Ganchev
@@ -19,7 +19,7 @@ Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
 Description-Content-Type: text/markdown
 <p align="center">
-  <img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
+  <img alt="Graphwise Logo" src="https://github.com/Ontotext-AD/graphrag-eval/blob/main/.github/Graphwise_Logo.jpg">
 </p>
 # QA Evaluation
@@ -28,7 +28,7 @@ This is a Python module for assessing the quality of question-answering systems
 ## License
-Apache-2.0 License. See [LICENSE](LICENSE) file for details.
+Apache-2.0 License. See [LICENSE](https://github.com/Ontotext-AD/graphrag-eval/blob/main/LICENSE) file for details.
 ## Installation
@@ -107,6 +107,7 @@ Each step includes:
 - `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
 - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
 - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
+- `ignore_duplicates`: (optional, defaults to `true`) For SPARQL query results, whether duplicate binding values in the expected or in the actual results should be ignored for the comparison.
 #### Reference Data
@@ -591,7 +592,7 @@ Aggregates are:
 - `per_template`: a dictionary mapping a template identifier to the following statistics:
   - `number_of_error_samples`: number of questions for this template, which resulted in error response
   - `number_of_success_samples`: number of questions for this template, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics over all questions of this template for which the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`
@@ -608,14 +609,19 @@ Aggregates are:
     - `retrieval_context_precision`
     - `retrieval_context_f1`
     - `steps`: includes:
-      - `steps`: for each step type how many times it was executed
+      - `total`: for each step type how many times it was executed
       - `once_per_sample`: how many times each step was executed, counted only once per question
       - `empty_results`: how many times the step was executed and returned empty results
       - `errors`: how many times the step was executed and resulted in error
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
+  - `steps`: includes:
+    - `total`: for each step type how many times it was executed
+    - `once_per_sample`: how many times each step was executed, counted only once per question
+    - `empty_results`: how many times the step was executed and returned empty results
+    - `errors`: how many times the step was executed and resulted in error
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics, over all questions where the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`

{graphrag_eval-5.1.2 → graphrag_eval-5.3.0}/README.md RENAMED Viewed

@@ -1,5 +1,5 @@
 <p align="center">
-  <img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
+  <img alt="Graphwise Logo" src="https://github.com/Ontotext-AD/graphrag-eval/blob/main/.github/Graphwise_Logo.jpg">
 </p>
 # QA Evaluation
@@ -8,7 +8,7 @@ This is a Python module for assessing the quality of question-answering systems
 ## License
-Apache-2.0 License. See [LICENSE](LICENSE) file for details.
+Apache-2.0 License. See [LICENSE](https://github.com/Ontotext-AD/graphrag-eval/blob/main/LICENSE) file for details.
 ## Installation
@@ -87,6 +87,7 @@ Each step includes:
 - `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
 - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
 - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
+- `ignore_duplicates`: (optional, defaults to `true`) For SPARQL query results, whether duplicate binding values in the expected or in the actual results should be ignored for the comparison.
 #### Reference Data
@@ -571,7 +572,7 @@ Aggregates are:
 - `per_template`: a dictionary mapping a template identifier to the following statistics:
   - `number_of_error_samples`: number of questions for this template, which resulted in error response
   - `number_of_success_samples`: number of questions for this template, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics over all questions of this template for which the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`
@@ -588,14 +589,19 @@ Aggregates are:
     - `retrieval_context_precision`
     - `retrieval_context_f1`
     - `steps`: includes:
-      - `steps`: for each step type how many times it was executed
+      - `total`: for each step type how many times it was executed
       - `once_per_sample`: how many times each step was executed, counted only once per question
       - `empty_results`: how many times the step was executed and returned empty results
       - `errors`: how many times the step was executed and resulted in error
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
+  - `steps`: includes:
+    - `total`: for each step type how many times it was executed
+    - `once_per_sample`: how many times each step was executed, counted only once per question
+    - `empty_results`: how many times the step was executed and returned empty results
+    - `errors`: how many times the step was executed and resulted in error
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics, over all questions where the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`

{graphrag_eval-5.1.2 → graphrag_eval-5.3.0}/graphrag_eval/aggregation.py RENAMED Viewed

@@ -131,9 +131,10 @@ def compute_per_template_stats(
 def compute_micro_stats(
-    number_of_samples_per_template_by_status,
-    stats_per_template,
-    step_metrics_per_template
+    number_of_samples_per_template_by_status: dict[str, dict[str, int]],
+    stats_per_template: dict[str, dict[str, Sequence[int]]],
+    steps_summary_per_template: dict[str, dict[str, dict[str, int]]],
+    step_metrics_per_template: dict[str, dict[str, Sequence[int]]],
 ) -> dict:
     values = number_of_samples_per_template_by_status.values()
     micro_summary = defaultdict(dict, {
@@ -157,6 +158,16 @@ def compute_micro_stats(
             micro_step_metrics[metric].extend(values)
     for metric, values in micro_step_metrics.items():
         micro_summary[metric] = stats_for_series(values)
+    steps_summary = defaultdict(lambda: defaultdict(int))
+    for template_steps_summary in steps_summary_per_template.values():
+        for summary_name, steps_stats in template_steps_summary.items():
+            for step_id, count in steps_stats.items():
+                steps_summary[summary_name][step_id] += count
+    steps_summary = {k: dict(v) for k, v in steps_summary.items()}
+    if len(steps_summary) > 0:
+        micro_summary["steps"] = steps_summary
     return dict(micro_summary)
@@ -198,8 +209,8 @@ def compute_aggregates(samples: list[dict]) -> dict:
         if "error" in sample:
             number_of_samples_per_template_by_status[template_id]["error"] += 1
-            continue
-        number_of_samples_per_template_by_status[template_id]["success"] += 1
+        else:
+            number_of_samples_per_template_by_status[template_id]["success"] += 1
         update_stats(sample, stats_per_template[template_id])
         update_steps_summary(sample, steps_summary_per_template[template_id])
         update_step_metrics(sample, step_metrics_per_template[template_id])
@@ -215,6 +226,7 @@ def compute_aggregates(samples: list[dict]) -> dict:
         "micro": compute_micro_stats(
             number_of_samples_per_template_by_status,
             stats_per_template,
+            steps_summary_per_template,
             step_metrics_per_template
         )
     }

{graphrag_eval-5.1.2 → graphrag_eval-5.3.0}/graphrag_eval/evaluation.py RENAMED Viewed

@@ -6,7 +6,7 @@ def run_evaluation(
         responses_dict: dict,
 ) -> list[dict]:
     # Output metrics are not nested, for simpler aggregation
-    answer_correctess_evaluator = None
+    answer_correctness_evaluator = None
     evaluation_results = []
     for template in qa_dataset:
         template_id = template["template_id"]
@@ -26,9 +26,9 @@ def run_evaluation(
                     "status": "error",
                     "error": actual_result["error"],
                 })
-                evaluation_results.append(eval_result)
-                continue
-            eval_result["status"] = "success"
+            else:
+                eval_result["status"] = "success"
             if "actual_answer" in actual_result:
                 eval_result["actual_answer"] = actual_result["actual_answer"]
                 from graphrag_eval import answer_relevance
@@ -38,25 +38,24 @@ def run_evaluation(
                         actual_result["actual_answer"],
                     )
                 )
-            if "reference_answer" in question and "actual_answer" in actual_result:
-                from graphrag_eval.answer_correctness import AnswerCorrectnessEvaluator
-                if not answer_correctess_evaluator:
-                    answer_correctess_evaluator = AnswerCorrectnessEvaluator()
-                eval_result.update(
-                    answer_correctess_evaluator.get_correctness_dict(
-                        question,
-                        actual_result,
+                if "reference_answer" in question:
+                    from graphrag_eval.answer_correctness import AnswerCorrectnessEvaluator
+                    if not answer_correctness_evaluator:
+                        answer_correctness_evaluator = AnswerCorrectnessEvaluator()
+                    eval_result.update(
+                        answer_correctness_evaluator.get_correctness_dict(
+                            question,
+                            actual_result,
+                        )
                     )
-                )
-            if "actual_steps" in actual_result:
-                eval_result.update(
-                    get_steps_evaluation_result_dict(question, actual_result)
-                )
-            eval_result.update({
-                "input_tokens": actual_result["input_tokens"],
-                "output_tokens": actual_result["output_tokens"],
-                "total_tokens": actual_result["total_tokens"],
-                "elapsed_sec": actual_result["elapsed_sec"],
-            })
+            eval_result.update(
+                get_steps_evaluation_result_dict(question, actual_result)
+            )
+            for key in "input_tokens", "output_tokens", "total_tokens", "elapsed_sec":
+                if key in actual_result:
+                    eval_result[key] = actual_result[key]
             evaluation_results.append(eval_result)
     return evaluation_results

{graphrag_eval-5.1.2 → graphrag_eval-5.3.0}/graphrag_eval/steps/evaluation.py RENAMED Viewed

@@ -1,12 +1,11 @@
 import json
 from collections import defaultdict
-from typing import Any
 from collections.abc import Sequence
+from typing import Any
 from .retrieval_context_ids import recall_at_k
 from .sparql import compare_sparql_results
 Match = tuple[int, int, int, float]
 Step = dict[str, Any]
 StepsGroup = Sequence[Step]  # We will index into a group
@@ -23,6 +22,7 @@ def compare_steps_outputs(reference_step: Step, actual_step: Step) -> float:
             json.loads(actual_output),
             reference_step["required_columns"],
             reference_step.get("ordered", False),
+            reference_step.get("ignore_duplicates", True),
         )
     if reference_step.get("output_media_type") == "application/json":
         return float(json.loads(reference_output) == json.loads(actual_output))

{graphrag_eval-5.1.2 → graphrag_eval-5.3.0}/graphrag_eval/steps/sparql.py RENAMED Viewed

@@ -1,7 +1,7 @@
-from collections import Counter
-from typing import Union
 import itertools
 import math
+from collections import Counter
+from typing import Union
 XSD_NUMERIC_TYPES = {
     "http://www.w3.org/2001/XMLSchema#integer",
@@ -35,7 +35,7 @@ def truncate(number: float, decimals: int = 0) -> float:
     elif decimals == 0:
         return math.trunc(number)
-    factor = 10.0**decimals
+    factor = 10.0 ** decimals
     return math.trunc(number * factor) / factor
@@ -137,8 +137,8 @@ def compare_values(
     actual_vars: Union[list[str], tuple[str, ...]],
     actual_var_to_values: dict[str, list],
     results_are_ordered: bool,
+    ignore_duplicates: bool,
 ) -> bool:
     if len(reference_vars) < len(actual_vars):
         for combination in itertools.combinations(actual_vars, len(reference_vars)):
             if compare_values(
@@ -147,6 +147,7 @@ def compare_values(
                 combination,
                 actual_var_to_values,
                 results_are_ordered,
+                ignore_duplicates,
             ):
                 return True
         return False
@@ -154,9 +155,9 @@ def compare_values(
     table = convert_table_dict2lines(reference_vars, reference_var_to_values)
     for permutation in itertools.permutations(actual_vars):
         actual_table = convert_table_dict2lines(permutation, actual_var_to_values)
-        if (results_are_ordered and table == actual_table) or (
-            not results_are_ordered and Counter(table) == Counter(actual_table)
-        ):
+        if (results_are_ordered and table == actual_table) or \
+            ((not results_are_ordered) and ignore_duplicates and set(table) == set(actual_table)) or \
+            ((not results_are_ordered) and (not ignore_duplicates) and Counter(table) == Counter(actual_table)):
             return True
     return False
@@ -167,6 +168,7 @@ def compare_sparql_results(
     actual_sparql_result: dict,
     required_vars: list[str],
     results_are_ordered: bool = False,
+    ignore_duplicates: bool = True,
 ) -> float:
     # DESCRIBE results
     if isinstance(actual_sparql_result, str):
@@ -208,5 +210,6 @@ def compare_sparql_results(
             actual_vars,
             actual_var_to_values,
             results_are_ordered,
+            ignore_duplicates,
         )
     )

{graphrag_eval-5.1.2 → graphrag_eval-5.3.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "graphrag-eval"
-version = "5.1.2"
+version = "5.3.0"
 description = "For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps."
 authors = [
     { name = "Philip Ganchev", email = "philip.ganchev@graphwise.ai" },