PyPI - graphrag-eval - Versions diffs - 5.2.0__tar.gz → 5.3.1__tar.gz - Mend

graphrag-eval 5.2.0tar.gz → 5.3.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (17) hide show

{graphrag_eval-5.2.0 → graphrag_eval-5.3.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: graphrag-eval
-Version: 5.2.0
+Version: 5.3.1
 Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
 License: Apache-2.0
 Author: Philip Ganchev
@@ -19,7 +19,7 @@ Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
 Description-Content-Type: text/markdown
 <p align="center">
-  <img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
+  <img alt="Graphwise Logo" src="https://github.com/Ontotext-AD/graphrag-eval/blob/main/.github/Graphwise_Logo.jpg">
 </p>
 # QA Evaluation
@@ -28,7 +28,7 @@ This is a Python module for assessing the quality of question-answering systems
 ## License
-Apache-2.0 License. See [LICENSE](LICENSE) file for details.
+Apache-2.0 License. See [LICENSE](https://github.com/Ontotext-AD/graphrag-eval/blob/main/LICENSE) file for details.
 ## Installation
@@ -592,7 +592,7 @@ Aggregates are:
 - `per_template`: a dictionary mapping a template identifier to the following statistics:
   - `number_of_error_samples`: number of questions for this template, which resulted in error response
   - `number_of_success_samples`: number of questions for this template, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics over all questions of this template for which the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`
@@ -609,14 +609,19 @@ Aggregates are:
     - `retrieval_context_precision`
     - `retrieval_context_f1`
     - `steps`: includes:
-      - `steps`: for each step type how many times it was executed
+      - `total`: for each step type how many times it was executed
       - `once_per_sample`: how many times each step was executed, counted only once per question
       - `empty_results`: how many times the step was executed and returned empty results
       - `errors`: how many times the step was executed and resulted in error
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
+  - `steps`: includes:
+    - `total`: for each step type how many times it was executed
+    - `once_per_sample`: how many times each step was executed, counted only once per question
+    - `empty_results`: how many times the step was executed and returned empty results
+    - `errors`: how many times the step was executed and resulted in error
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics, over all questions where the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`

{graphrag_eval-5.2.0 → graphrag_eval-5.3.1}/README.md RENAMED Viewed

@@ -1,5 +1,5 @@
 <p align="center">
-  <img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
+  <img alt="Graphwise Logo" src="https://github.com/Ontotext-AD/graphrag-eval/blob/main/.github/Graphwise_Logo.jpg">
 </p>
 # QA Evaluation
@@ -8,7 +8,7 @@ This is a Python module for assessing the quality of question-answering systems
 ## License
-Apache-2.0 License. See [LICENSE](LICENSE) file for details.
+Apache-2.0 License. See [LICENSE](https://github.com/Ontotext-AD/graphrag-eval/blob/main/LICENSE) file for details.
 ## Installation
@@ -572,7 +572,7 @@ Aggregates are:
 - `per_template`: a dictionary mapping a template identifier to the following statistics:
   - `number_of_error_samples`: number of questions for this template, which resulted in error response
   - `number_of_success_samples`: number of questions for this template, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for this template for the following metrics:
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics over all questions of this template for which the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`
@@ -589,14 +589,19 @@ Aggregates are:
     - `retrieval_context_precision`
     - `retrieval_context_f1`
     - `steps`: includes:
-      - `steps`: for each step type how many times it was executed
+      - `total`: for each step type how many times it was executed
       - `once_per_sample`: how many times each step was executed, counted only once per question
       - `empty_results`: how many times the step was executed and returned empty results
       - `errors`: how many times the step was executed and resulted in error
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
-  - `sum`, `mean`, `median`, `min` and `max` statistics over all non-error responses for the following metrics:
+  - `steps`: includes:
+    - `total`: for each step type how many times it was executed
+    - `once_per_sample`: how many times each step was executed, counted only once per question
+    - `empty_results`: how many times the step was executed and returned empty results
+    - `errors`: how many times the step was executed and resulted in error
+  - `sum`, `mean`, `median`, `min` and `max` statistics for the following metrics, over all questions where the metrics exist:
     - `input_tokens`
     - `output_tokens`
     - `total_tokens`

{graphrag_eval-5.2.0 → graphrag_eval-5.3.1}/graphrag_eval/aggregation.py RENAMED Viewed

@@ -131,9 +131,10 @@ def compute_per_template_stats(
 def compute_micro_stats(
-    number_of_samples_per_template_by_status,
-    stats_per_template,
-    step_metrics_per_template
+    number_of_samples_per_template_by_status: dict[str, dict[str, int]],
+    stats_per_template: dict[str, dict[str, Sequence[int]]],
+    steps_summary_per_template: dict[str, dict[str, dict[str, int]]],
+    step_metrics_per_template: dict[str, dict[str, Sequence[int]]],
 ) -> dict:
     values = number_of_samples_per_template_by_status.values()
     micro_summary = defaultdict(dict, {
@@ -157,6 +158,16 @@ def compute_micro_stats(
             micro_step_metrics[metric].extend(values)
     for metric, values in micro_step_metrics.items():
         micro_summary[metric] = stats_for_series(values)
+    steps_summary = defaultdict(lambda: defaultdict(int))
+    for template_steps_summary in steps_summary_per_template.values():
+        for summary_name, steps_stats in template_steps_summary.items():
+            for step_id, count in steps_stats.items():
+                steps_summary[summary_name][step_id] += count
+    steps_summary = {k: dict(v) for k, v in steps_summary.items()}
+    if len(steps_summary) > 0:
+        micro_summary["steps"] = steps_summary
     return dict(micro_summary)
@@ -198,8 +209,8 @@ def compute_aggregates(samples: list[dict]) -> dict:
         if "error" in sample:
             number_of_samples_per_template_by_status[template_id]["error"] += 1
-            continue
-        number_of_samples_per_template_by_status[template_id]["success"] += 1
+        else:
+            number_of_samples_per_template_by_status[template_id]["success"] += 1
         update_stats(sample, stats_per_template[template_id])
         update_steps_summary(sample, steps_summary_per_template[template_id])
         update_step_metrics(sample, step_metrics_per_template[template_id])
@@ -215,6 +226,7 @@ def compute_aggregates(samples: list[dict]) -> dict:
         "micro": compute_micro_stats(
             number_of_samples_per_template_by_status,
             stats_per_template,
+            steps_summary_per_template,
             step_metrics_per_template
         )
     }

{graphrag_eval-5.2.0 → graphrag_eval-5.3.1}/graphrag_eval/evaluation.py RENAMED Viewed

@@ -6,7 +6,7 @@ def run_evaluation(
         responses_dict: dict,
 ) -> list[dict]:
     # Output metrics are not nested, for simpler aggregation
-    answer_correctess_evaluator = None
+    answer_correctness_evaluator = None
     evaluation_results = []
     for template in qa_dataset:
         template_id = template["template_id"]
@@ -26,9 +26,9 @@ def run_evaluation(
                     "status": "error",
                     "error": actual_result["error"],
                 })
-                evaluation_results.append(eval_result)
-                continue
-            eval_result["status"] = "success"
+            else:
+                eval_result["status"] = "success"
             if "actual_answer" in actual_result:
                 eval_result["actual_answer"] = actual_result["actual_answer"]
                 from graphrag_eval import answer_relevance
@@ -38,25 +38,24 @@ def run_evaluation(
                         actual_result["actual_answer"],
                     )
                 )
-            if "reference_answer" in question and "actual_answer" in actual_result:
-                from graphrag_eval.answer_correctness import AnswerCorrectnessEvaluator
-                if not answer_correctess_evaluator:
-                    answer_correctess_evaluator = AnswerCorrectnessEvaluator()
-                eval_result.update(
-                    answer_correctess_evaluator.get_correctness_dict(
-                        question,
-                        actual_result,
+                if "reference_answer" in question:
+                    from graphrag_eval.answer_correctness import AnswerCorrectnessEvaluator
+                    if not answer_correctness_evaluator:
+                        answer_correctness_evaluator = AnswerCorrectnessEvaluator()
+                    eval_result.update(
+                        answer_correctness_evaluator.get_correctness_dict(
+                            question,
+                            actual_result,
+                        )
                     )
-                )
-            if "actual_steps" in actual_result:
-                eval_result.update(
-                    get_steps_evaluation_result_dict(question, actual_result)
-                )
-            eval_result.update({
-                "input_tokens": actual_result["input_tokens"],
-                "output_tokens": actual_result["output_tokens"],
-                "total_tokens": actual_result["total_tokens"],
-                "elapsed_sec": actual_result["elapsed_sec"],
-            })
+            eval_result.update(
+                get_steps_evaluation_result_dict(question, actual_result)
+            )
+            for key in "input_tokens", "output_tokens", "total_tokens", "elapsed_sec":
+                if key in actual_result:
+                    eval_result[key] = actual_result[key]
             evaluation_results.append(eval_result)
     return evaluation_results

{graphrag_eval-5.2.0 → graphrag_eval-5.3.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "graphrag-eval"
-version = "5.2.0"
+version = "5.3.1"
 description = "For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps."
 authors = [
     { name = "Philip Ganchev", email = "philip.ganchev@graphwise.ai" },