PyPI - graphrag-eval - Versions diffs - 4.0.0__tar.gz → 5.0.1__tar.gz - Mend

graphrag-eval 4.0.0tar.gz → 5.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

graphrag_eval-4.0.0/README.md → graphrag_eval-5.0.1/PKG-INFO RENAMED Viewed

@@ -1,3 +1,21 @@
+Metadata-Version: 2.3
+Name: graphrag-eval
+Version: 5.0.1
+Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
+License: Apache-2.0
+Author: Neli Hateva
+Author-email: neli.hateva@graphwise.ai
+Requires-Python: >=3.12,<3.13
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Provides-Extra: openai
+Requires-Dist: langevals (==0.1.*) ; extra == "openai"
+Requires-Dist: langevals-ragas (>=0.1.12,<0.2.0) ; extra == "openai"
+Requires-Dist: openai (>=1.97.0,<2.0.0) ; extra == "openai"
+Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
+Description-Content-Type: text/markdown
 <p align="center">
   <img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
 </p>
@@ -36,7 +54,7 @@ graphrag-eval = {version = "*", extras = ["openai"]}
 ## Maintainers
 Developed and maintained by [Graphwise](https://graphwise.ai/).
-For issues or feature requests, please open [a GitHub issue](https://github.com/Ontotext-AD/qa-eval/issues).
+For issues or feature requests, please open [a GitHub issue](https://github.com/Ontotext-AD/graphrag-eval/issues).
 ## Command Line Use
@@ -77,13 +95,14 @@ A reference corpus is a list of templates, each of which contains:
   - `question_text`: The natural language query passed to the LLM
   - `reference_steps`: (optional) A list of expected steps grouped by expected order of execution, where all steps in a group can be executed in any order relative to each other, but after all steps in the previous group and before all steps in the next group.
   - `reference_answer`: (optional) The expected answer to the question
 The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed last (last level).
 Each step includes:
 - `name`: The type of step being performed (e.g., `sparql_query`)
 - `args`: Arguments of the step (e.g., arguments to a tool used in the step, such as a SPARQL query)
-- `output`: The expected output from the step
+- `output`: The expected output from the step.
 - `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
 - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
 - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
@@ -99,7 +118,22 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
     question_text: List all transformers within Substation OSLO
     reference_answer: OSLO T1, OSLO T2
     reference_steps:
-    - - name: sparql_query
+    - - name: retrieval
+        args:
+          query: transformers Substation OSLO
+          k: 2
+        output: |-
+          [
+            {
+              "id": "http://example.com/resource/doc/1",
+              "text": "Transformer OSLO T1 is in Substation Oslo."
+            },
+            {
+              "id": "http://example.com/resource/doc/2",
+              "text": "Transformer OSLO T2 is in Substation Oslo."
+            }
+          ]
+      - name: sparql_query
         args:
           query: |2
@@ -253,6 +287,16 @@ Below is an example response from the question-answering system for a single que
     "total_tokens": 298753,
     "elapsed_sec": 46.48961806297302,
     "actual_steps": [
+        {
+          "name": "retrieval",
+          "args": {
+            "query": "transformers Substation OSLO",
+            "k": 2
+          },
+          "id": "call_3",
+          "status": "success",
+          "output": "[\n  {\n    \"id\": \"http://example.com/resource/doc/1\",\n    \"text\": \"Transformer OSLO T1 is in Substation Oslo.\"\n  },\n  {\n    \"id\": \"http://example.com/resource/doc/2\",\n    \"text\": \"Transformer OSLO T2 is in Substation Oslo.\"\n  }\n]"
+        },
         {
             "name": "autocomplete_search",
             "args": {
@@ -323,7 +367,23 @@ The output is a list of statistics for each question from the reference Q&A data
   question_text: List all transformers within Substation OSLO
   reference_answer: OSLO T1, OSLO T2
   reference_steps:
-  - - name: sparql_query
+  - - name: retrieval
+      args:
+        query: transformers Substation OSLO
+        k: 2
+      matches: call_3
+      output: |-
+        [
+          {
+            "id": "http://example.com/resource/doc/1",
+            "text": "Transformer OSLO T1 is in Substation Oslo."
+          },
+          {
+            "id": "http://example.com/resource/doc/2",
+            "text": "Transformer OSLO T2 is in Substation Oslo."
+          }
+        ]
+  - name: sparql_query
       args:
         query: |2
@@ -364,6 +424,31 @@ The output is a list of statistics for each question from the reference Q&A data
   answer_relevance: 0.9
   answer_relevance_cost: 0.0007
   actual_steps:
+  - name: retrieval
+    id: call_3
+    args:
+      query: transformers Substation OSLO
+      k: 2
+    status: success
+    output: |-
+      [
+        {
+          "id": "http://example.com/resource/doc/1",
+          "text": "Transformer OSLO T1 is in Substation Oslo."
+        },
+        {
+          "id": "http://example.com/resource/doc/2",
+          "text": "Transformer OSLO T2 is in Substation Oslo."
+        }
+      ]
+    retrieval_answer_recall: 1.0
+    retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
+    retrieval_answer_recall_cost: 0.0007
+    retrieval_answer_precision: 1.0
+    retrieval_answer_precision_reason: The context contains only transformers listed in the reference answer
+    retrieval_answer_precision_cost: 0.0003
+    retrieval_answer_f1: 1.0
+    retrieval_answer_f1_cost: 0.001
   - name: autocomplete_search
     args:
       query: OSLO
@@ -470,12 +555,33 @@ The output is a list of statistics for each question from the reference Q&A data
 - `answer_relevance_error`: (optional) error message if answer relevance evaluation failed
 - `answer_relevance_cost`: The LLM use cost of computing `answer_relevance`, in US dollars
 - `actual_steps`: (optional) copy of the steps in the evaluation target, if specified there
-- `steps_score`: a real number between 0 and 1, computed by comparing the results of the last steps that were executed to the reference's last group of steps. If there is no match in the actual steps, then the score is `0`. Otherwise, it is calculated as the number of the matched steps on the last group divided by the total number of steps in the last group.
+- `steps_score`: a real number between 0 and 1, computed by comparing the results of the last executed steps to the output of the reference's last group of steps.
+    - If there is no match in the actual steps, then the score is `0.0`
+    - If the executed step's name is "retrieval" and the last reference group contains a retrieval step, then the score is the [recall at k](#context-recallk) of the retrieved document ids with respect to the reference.
+    - Otherwise, the score is the number of the matched steps on the last group divided by the total number of steps in the last group.
 - `input_tokens`: input tokens usage
 - `output_tokens`: output tokens usage
 - `total_tokens`: total tokens usage
 - `elapsed_sec`: elapsed seconds
+All `actual_steps` with `name` "retrieval" contain:
+- `retrieval_answer_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_answer_recall_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_recall`
+- `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
+- `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
+- `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_answer_precision_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_precision`
+- `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
+- `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
+- `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
+- `retrieval_answer_f1_cost`: The sum of `retrieval_answer_recall_cost` and `retrieval_answer_precision_cost`
+- `retrieval_context_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_context_recall_error`: (optional) error message if `retrieval_context_recall` evaluation fails
+- `retrieval_context_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_context_precision_error`: (optional) error message if `retrieval_context_precision` evaluation fails
+- `retrieval_context_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_context_recall` and `retrieval_context_precision` succeed
 #### Aggregates Keys
 The `aggregates` object provides aggregated evaluation metrics.
@@ -499,6 +605,9 @@ Aggregates are:
     - `once_per_sample`: how many times each step was executed, counted only once per question
     - `empty_results`: how many times the step was executed and returned empty results
     - `errors`: how many times the step was executed and resulted in error
+  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` for all successful questions in this template
+  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` for all successful questions in this template
+  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` for all successful questions in this template
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
@@ -511,6 +620,9 @@ Aggregates are:
   - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
   - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
   - `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
+  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` of all successful questions
+  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` of all successful questions
+  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` of all successful questions
   - `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
 - `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
   - `input_tokens`: `mean` for `input_tokens`
@@ -522,6 +634,9 @@ Aggregates are:
   - `answer_f1`: `mean` for `answer_f1`
   - `answer_relevance`: `mean` for `answer_relevance`
   - `answer_relevance_cost`: `mean` for `answer_relevance_cost`
+  - `retrieval_context_recall`: `mean` for `retrieval_context_recall`
+  - `retrieval_context_precision`: `mean` for `retrieval_context_precision`
+  - `retrieval_context_f1`: `mean` for `retrieval_context_f1`
   - `steps_score`: `mean` for `steps_score`
 #### Example Aggregates
@@ -898,18 +1013,30 @@ macro:
     mean: 25.911653497483996
 ```
+### SPARQL queries comparison
+The algorithm iterates over all subsets of columns in the actual result of the same size as in the reference result.
+For each subset, it compares the set of columns (skipping optional columns).
+It matches floating-point numbers up to a 1e-8 precision. It does not do this for special types such as duration.
+The average time complexity is О(nr\*nc_ref!\*binomial(nc_act, nc_ref)), where
+* *nr* is the number of rows in the actual result
+* *nc_ref* is the number of columns in the reference result
+* *nc_act* is the number of columns in the actual result
 ### Retrieval Evaluation
-The following metrics are based on the ids of retrieved documents.
+The following metrics are based on the content of retrieved documents.
-#### Recall@k Metric
+#### Context Recall@k
 The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we inclide in the first k spots?"
 * **Formula**:
     $`
     \frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
     `$
-* **Calculation**: Count the number of relevant items in the top `k` retrieved results; divide that by the *total* number of relevant items.
+* **Calculation**: Count the number of relevant items in the top `k` retrieved results; divide that by the first 'k' relevant items.
 * **Example**: Suppose there are 4 relevant documents for a given query. Suppose our system retrieves 3 of them in the top 5 results (`k=5`). Recall@5 is `3 / 4 = 0.75`.
 ```python
@@ -920,7 +1047,7 @@ recall_at_k(
 )  # => 0.75
 ```
-#### Average Precision (AP) Metric
+#### Context Precision@k
 Evaluates a ranked list of recommendations by looking at the precision at the position of each correctly retrieved item. It rewards systems for placing relevant items higher up in the list. It's more sophisticated than just looking at precision at a single cutoff because it considers the entire ranking.
 * **Formula**:
@@ -950,3 +1077,4 @@ average_precision(
     retrieved_docs=[1, 4, 3, 5, 7]
 ) # ~=> 0.8056
 ```

graphrag_eval-4.0.0/PKG-INFO → graphrag_eval-5.0.1/README.md RENAMED Viewed

@@ -1,17 +1,3 @@
-Metadata-Version: 2.3
-Name: graphrag-eval
-Version: 4.0.0
-Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
-License: Apache-2.0
-Author: Neli Hateva
-Author-email: neli.hateva@graphwise.ai
-Requires-Python: >=3.12,<3.13
-Classifier: License :: OSI Approved :: Apache Software License
-Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.12
-Project-URL: Repository, https://github.com/Ontotext-AD/qa-eval
-Description-Content-Type: text/markdown
 <p align="center">
   <img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
 </p>
@@ -50,7 +36,7 @@ graphrag-eval = {version = "*", extras = ["openai"]}
 ## Maintainers
 Developed and maintained by [Graphwise](https://graphwise.ai/).
-For issues or feature requests, please open [a GitHub issue](https://github.com/Ontotext-AD/qa-eval/issues).
+For issues or feature requests, please open [a GitHub issue](https://github.com/Ontotext-AD/graphrag-eval/issues).
 ## Command Line Use
@@ -91,13 +77,14 @@ A reference corpus is a list of templates, each of which contains:
   - `question_text`: The natural language query passed to the LLM
   - `reference_steps`: (optional) A list of expected steps grouped by expected order of execution, where all steps in a group can be executed in any order relative to each other, but after all steps in the previous group and before all steps in the next group.
   - `reference_answer`: (optional) The expected answer to the question
 The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed last (last level).
 Each step includes:
 - `name`: The type of step being performed (e.g., `sparql_query`)
 - `args`: Arguments of the step (e.g., arguments to a tool used in the step, such as a SPARQL query)
-- `output`: The expected output from the step
+- `output`: The expected output from the step.
 - `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
 - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
 - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
@@ -113,7 +100,22 @@ The example corpus below illustrates a minimal but realistic Q&A dataset, showin
     question_text: List all transformers within Substation OSLO
     reference_answer: OSLO T1, OSLO T2
     reference_steps:
-    - - name: sparql_query
+    - - name: retrieval
+        args:
+          query: transformers Substation OSLO
+          k: 2
+        output: |-
+          [
+            {
+              "id": "http://example.com/resource/doc/1",
+              "text": "Transformer OSLO T1 is in Substation Oslo."
+            },
+            {
+              "id": "http://example.com/resource/doc/2",
+              "text": "Transformer OSLO T2 is in Substation Oslo."
+            }
+          ]
+      - name: sparql_query
         args:
           query: |2
@@ -267,6 +269,16 @@ Below is an example response from the question-answering system for a single que
     "total_tokens": 298753,
     "elapsed_sec": 46.48961806297302,
     "actual_steps": [
+        {
+          "name": "retrieval",
+          "args": {
+            "query": "transformers Substation OSLO",
+            "k": 2
+          },
+          "id": "call_3",
+          "status": "success",
+          "output": "[\n  {\n    \"id\": \"http://example.com/resource/doc/1\",\n    \"text\": \"Transformer OSLO T1 is in Substation Oslo.\"\n  },\n  {\n    \"id\": \"http://example.com/resource/doc/2\",\n    \"text\": \"Transformer OSLO T2 is in Substation Oslo.\"\n  }\n]"
+        },
         {
             "name": "autocomplete_search",
             "args": {
@@ -337,7 +349,23 @@ The output is a list of statistics for each question from the reference Q&A data
   question_text: List all transformers within Substation OSLO
   reference_answer: OSLO T1, OSLO T2
   reference_steps:
-  - - name: sparql_query
+  - - name: retrieval
+      args:
+        query: transformers Substation OSLO
+        k: 2
+      matches: call_3
+      output: |-
+        [
+          {
+            "id": "http://example.com/resource/doc/1",
+            "text": "Transformer OSLO T1 is in Substation Oslo."
+          },
+          {
+            "id": "http://example.com/resource/doc/2",
+            "text": "Transformer OSLO T2 is in Substation Oslo."
+          }
+        ]
+  - name: sparql_query
       args:
         query: |2
@@ -378,6 +406,31 @@ The output is a list of statistics for each question from the reference Q&A data
   answer_relevance: 0.9
   answer_relevance_cost: 0.0007
   actual_steps:
+  - name: retrieval
+    id: call_3
+    args:
+      query: transformers Substation OSLO
+      k: 2
+    status: success
+    output: |-
+      [
+        {
+          "id": "http://example.com/resource/doc/1",
+          "text": "Transformer OSLO T1 is in Substation Oslo."
+        },
+        {
+          "id": "http://example.com/resource/doc/2",
+          "text": "Transformer OSLO T2 is in Substation Oslo."
+        }
+      ]
+    retrieval_answer_recall: 1.0
+    retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
+    retrieval_answer_recall_cost: 0.0007
+    retrieval_answer_precision: 1.0
+    retrieval_answer_precision_reason: The context contains only transformers listed in the reference answer
+    retrieval_answer_precision_cost: 0.0003
+    retrieval_answer_f1: 1.0
+    retrieval_answer_f1_cost: 0.001
   - name: autocomplete_search
     args:
       query: OSLO
@@ -484,12 +537,33 @@ The output is a list of statistics for each question from the reference Q&A data
 - `answer_relevance_error`: (optional) error message if answer relevance evaluation failed
 - `answer_relevance_cost`: The LLM use cost of computing `answer_relevance`, in US dollars
 - `actual_steps`: (optional) copy of the steps in the evaluation target, if specified there
-- `steps_score`: a real number between 0 and 1, computed by comparing the results of the last steps that were executed to the reference's last group of steps. If there is no match in the actual steps, then the score is `0`. Otherwise, it is calculated as the number of the matched steps on the last group divided by the total number of steps in the last group.
+- `steps_score`: a real number between 0 and 1, computed by comparing the results of the last executed steps to the output of the reference's last group of steps.
+    - If there is no match in the actual steps, then the score is `0.0`
+    - If the executed step's name is "retrieval" and the last reference group contains a retrieval step, then the score is the [recall at k](#context-recallk) of the retrieved document ids with respect to the reference.
+    - Otherwise, the score is the number of the matched steps on the last group divided by the total number of steps in the last group.
 - `input_tokens`: input tokens usage
 - `output_tokens`: output tokens usage
 - `total_tokens`: total tokens usage
 - `elapsed_sec`: elapsed seconds
+All `actual_steps` with `name` "retrieval" contain:
+- `retrieval_answer_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_answer_recall_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_recall`
+- `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
+- `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
+- `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_answer_precision_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_precision`
+- `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
+- `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
+- `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
+- `retrieval_answer_f1_cost`: The sum of `retrieval_answer_recall_cost` and `retrieval_answer_precision_cost`
+- `retrieval_context_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_context_recall_error`: (optional) error message if `retrieval_context_recall` evaluation fails
+- `retrieval_context_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
+- `retrieval_context_precision_error`: (optional) error message if `retrieval_context_precision` evaluation fails
+- `retrieval_context_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_context_recall` and `retrieval_context_precision` succeed
 #### Aggregates Keys
 The `aggregates` object provides aggregated evaluation metrics.
@@ -513,6 +587,9 @@ Aggregates are:
     - `once_per_sample`: how many times each step was executed, counted only once per question
     - `empty_results`: how many times the step was executed and returned empty results
     - `errors`: how many times the step was executed and resulted in error
+  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` for all successful questions in this template
+  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` for all successful questions in this template
+  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` for all successful questions in this template
 - `micro`: statistics across questions, regardless of template. It includes:
   - `number_of_error_samples`: total number of questions, which resulted in error response
   - `number_of_success_samples`: total number of questions, which resulted in successful response
@@ -525,6 +602,9 @@ Aggregates are:
   - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
   - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
   - `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
+  - `retrieval_context_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_recall` of all successful questions
+  - `retrieval_context_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_precision` of all successful questions
+  - `retrieval_context_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `retrieval_context_f1` of all successful questions
   - `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
 - `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
   - `input_tokens`: `mean` for `input_tokens`
@@ -536,6 +616,9 @@ Aggregates are:
   - `answer_f1`: `mean` for `answer_f1`
   - `answer_relevance`: `mean` for `answer_relevance`
   - `answer_relevance_cost`: `mean` for `answer_relevance_cost`
+  - `retrieval_context_recall`: `mean` for `retrieval_context_recall`
+  - `retrieval_context_precision`: `mean` for `retrieval_context_precision`
+  - `retrieval_context_f1`: `mean` for `retrieval_context_f1`
   - `steps_score`: `mean` for `steps_score`
 #### Example Aggregates
@@ -912,18 +995,30 @@ macro:
     mean: 25.911653497483996
 ```
+### SPARQL queries comparison
+The algorithm iterates over all subsets of columns in the actual result of the same size as in the reference result.
+For each subset, it compares the set of columns (skipping optional columns).
+It matches floating-point numbers up to a 1e-8 precision. It does not do this for special types such as duration.
+The average time complexity is О(nr\*nc_ref!\*binomial(nc_act, nc_ref)), where
+* *nr* is the number of rows in the actual result
+* *nc_ref* is the number of columns in the reference result
+* *nc_act* is the number of columns in the actual result
 ### Retrieval Evaluation
-The following metrics are based on the ids of retrieved documents.
+The following metrics are based on the content of retrieved documents.
-#### Recall@k Metric
+#### Context Recall@k
 The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we inclide in the first k spots?"
 * **Formula**:
     $`
     \frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
     `$
-* **Calculation**: Count the number of relevant items in the top `k` retrieved results; divide that by the *total* number of relevant items.
+* **Calculation**: Count the number of relevant items in the top `k` retrieved results; divide that by the first 'k' relevant items.
 * **Example**: Suppose there are 4 relevant documents for a given query. Suppose our system retrieves 3 of them in the top 5 results (`k=5`). Recall@5 is `3 / 4 = 0.75`.
 ```python
@@ -934,7 +1029,7 @@ recall_at_k(
 )  # => 0.75
 ```
-#### Average Precision (AP) Metric
+#### Context Precision@k
 Evaluates a ranked list of recommendations by looking at the precision at the position of each correctly retrieved item. It rewards systems for placing relevant items higher up in the list. It's more sophisticated than just looking at precision at a single cutoff because it considers the entire ranking.
 * **Formula**:
@@ -964,4 +1059,3 @@ average_precision(
     retrieved_docs=[1, 4, 3, 5, 7]
 ) # ~=> 0.8056
 ```

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/graphrag_eval/aggregation.py RENAMED Viewed

@@ -16,7 +16,22 @@ METRICS = [
     "total_tokens",
     "elapsed_sec"
 ]
+STEPS_METRICS = {
+    "retrieval": [
+        "retrieval_answer_precision",
+        "retrieval_answer_precision_cost",
+        "retrieval_answer_recall",
+        "retrieval_answer_recall_cost",
+        "retrieval_answer_f1",
+        "retrieval_answer_f1_cost",
+        "retrieval_context_precision",
+        "retrieval_context_precision_cost",
+        "retrieval_context_recall",
+        "retrieval_context_recall_cost",
+        "retrieval_context_f1",
+        "retrieval_context_f1_cost",
+    ]
+}
 PROTECTED_METRICS = [
     "input_tokens",
     "output_tokens",
@@ -35,6 +50,19 @@ def stats_for_series(values: Iterable[int | float]) -> dict[str, float]:
     }
+def update_step_metrics_per_template(
+    sample: dict,
+    step_metrics_per_template: dict,
+    template_id: str
+):
+    for step in sample.get("actual_steps", []):
+        if step["name"] in STEPS_METRICS:
+            for metric in STEPS_METRICS[step["name"]]:
+                value = step.get(metric)
+                if value is not None:
+                    step_metrics_per_template[template_id][metric].append(value)
 def update_stats_per_template(
     sample: dict,
     stats_per_template: dict,
@@ -76,6 +104,7 @@ def compute_aggregates(samples: list[dict]) -> dict:
     number_of_samples_per_template_by_status = defaultdict(lambda: defaultdict(int))
     stats_per_template = defaultdict(lambda: defaultdict(list))
     steps_summary_per_template = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
+    step_metrics_per_template = defaultdict(lambda: defaultdict(list))
     # Compute per-template stats
     templates_ids = set()
@@ -94,6 +123,11 @@ def compute_aggregates(samples: list[dict]) -> dict:
             steps_summary_per_template,
             template_id
         )
+        update_step_metrics_per_template(
+            sample,
+            step_metrics_per_template,
+            template_id
+        )
     summary = {"per_template": {}}
@@ -115,6 +149,13 @@ def compute_aggregates(samples: list[dict]) -> dict:
             if series or metric in PROTECTED_METRICS:
                 template_summary[metric] = stats_for_series(series)
+        # Add step metrics for the template
+        template_step_metrics = {}
+        for metric, values in step_metrics_per_template[template_id].items():
+            template_step_metrics[metric] = stats_for_series(values)
+        if template_step_metrics:
+            template_summary["steps"].update(template_step_metrics)
         summary["per_template"][template_id] = template_summary
     # Add micro stats
@@ -137,6 +178,17 @@ def compute_aggregates(samples: list[dict]) -> dict:
         if series or metric in PROTECTED_METRICS:
             summary["micro"][metric] = stats_for_series(series)
+    # Add micro step metrics
+    micro_step_metrics = defaultdict(list)
+    for template_metrics in step_metrics_per_template.values():
+        for metric, values in template_metrics.items():
+            micro_step_metrics[metric].extend(values)
+    step_metrics = {
+        metric: stats_for_series(values)
+        for metric, values in micro_step_metrics.items()
+    }
+    summary["micro"].update(step_metrics)
     # Add macro stats
     summary["macro"] = {}
     for metric in METRICS:
@@ -148,4 +200,17 @@ def compute_aggregates(samples: list[dict]) -> dict:
         if means or metric in PROTECTED_METRICS:
             summary["macro"][metric] = {"mean": mean(means) if means else 0}
+    # Add macro step metrics
+    macro_step_metrics = defaultdict(list)
+    for template_id, template_summary in summary["per_template"].items():
+        if "steps" in template_summary:
+            for metric, stats in template_summary["steps"].items():
+                if "mean" in stats:
+                    macro_step_metrics[metric].append(stats["mean"])
+    step_metrics = {
+        metric: {"mean": mean(values) if values else 0}
+        for metric, values in macro_step_metrics.items()
+    }
+    summary["macro"].update(step_metrics)
     return summary

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/graphrag_eval/evaluation.py RENAMED Viewed

@@ -48,7 +48,7 @@ def run_evaluation(
                         actual_result,
                     )
                 )
-            if "steps" in actual_result:
+            if "actual_steps" in actual_result:
                 eval_result.update(
                     get_steps_evaluation_result_dict(question, actual_result)
                 )

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/graphrag_eval/steps/__init__.py RENAMED Viewed

@@ -1,13 +1,14 @@
 import json
 from collections import defaultdict
-from .retrieval import recall_at_k
+from .retrieval_context_ids import recall_at_k
 from .sparql import compare_sparql_results
 def compare_steps_outputs(reference: dict, actual: dict) -> float:
-    ref_output = reference["output"]
+    ref_output = reference.get("output")
     act_output = actual["output"]
+    assert ref_output, "Reference step output is mandatory"
     if reference.get("output_media_type") == "application/sparql-results+json":
         return compare_sparql_results(
             json.loads(ref_output),
@@ -17,9 +18,11 @@ def compare_steps_outputs(reference: dict, actual: dict) -> float:
         )
     if reference.get("output_media_type") == "application/json":
         return float(json.loads(ref_output) == json.loads(act_output))
-    if reference["name"] == "retrieval":
-        k = reference["args"]["k"]
-        return recall_at_k(ref_output, act_output, k)
+    if reference["name"] == actual["name"] == "retrieval":
+        ref_contexts_ids = [c["id"] for c in json.loads(ref_output)]
+        act_contexts_ids = [c["id"] for c in json.loads(act_output)]
+        k = actual["args"]["k"]
+        return recall_at_k(ref_contexts_ids, act_contexts_ids, k)
     return float(ref_output == act_output)
@@ -95,9 +98,11 @@ def get_steps_matches(
 def evaluate_steps(
     reference_steps_groups: list[list[dict]],
-    actual_steps: list[dict]
+    actual_steps: list[dict],
+    matches: list[tuple[int, int, int, float]] | None = None
 ) -> float:
-    matches = get_steps_matches(reference_steps_groups, actual_steps)
+    if matches is None:
+        matches = get_steps_matches(reference_steps_groups, actual_steps)
     matches_by_group = defaultdict(list)
     scores_by_group = defaultdict(float)
     for ref_group_idx, ref_match_idx, actual_idx, score in matches:
@@ -110,11 +115,33 @@ def evaluate_steps(
 def get_steps_evaluation_result_dict(reference: dict, target: dict) -> dict:
-    act_steps = target["steps"]
     eval_result = {}
+    act_steps = target.get("actual_steps", [])
     eval_result["actual_steps"] = act_steps
+    for act_step in act_steps:
+        if act_step["name"] == "retrieval":
+            from .retrieval_answer import get_retrieval_evaluation_dict
+            result = get_retrieval_evaluation_dict(
+                question_text=reference["question_text"],
+                reference_answer=reference.get("reference_answer"),
+                actual_answer=target.get("actual_answer"),
+                actual_contexts=json.loads(act_step["output"])
+            )
+            act_step.update(result)
     if "reference_steps" in reference:
         ref_steps = reference["reference_steps"]
-        steps_score = evaluate_steps(ref_steps, act_steps)
+        matches = get_steps_matches(ref_steps, act_steps)
+        steps_score = evaluate_steps(ref_steps, act_steps, matches)
         eval_result["steps_score"] = steps_score
+        for ref_group_idx, ref_match_idx, act_idx, _ in matches:
+            ref_step = ref_steps[ref_group_idx][ref_match_idx]
+            act_step = act_steps[act_idx]
+            if ref_step["name"] == "retrieval":
+                from .retrieval_context_texts import \
+                    get_retrieval_evaluation_dict
+                res = get_retrieval_evaluation_dict(
+                    reference_contexts=json.loads(ref_step["output"]),
+                    actual_contexts=json.loads(act_step["output"])
+                )
+                act_step.update(res)
     return eval_result

graphrag_eval-5.0.1/graphrag_eval/steps/retrieval_answer.py ADDED Viewed

@@ -0,0 +1,62 @@
+from langevals_ragas.response_context_recall import (
+    RagasResponseContextRecallEntry,
+    RagasResponseContextRecallEvaluator,
+)
+from langevals_ragas.response_context_precision import (
+    RagasResponseContextPrecisionEntry,
+    RagasResponseContextPrecisionEvaluator,
+)
+from graphrag_eval.util import get_f1_dict
+def _evaluate(
+    evaluator: RagasResponseContextRecallEvaluator | RagasResponseContextPrecisionEvaluator,
+    entry: RagasResponseContextRecallEntry | RagasResponseContextPrecisionEntry,
+    metric: str
+) -> dict[str, float | str]:
+    try:
+        result = evaluator.evaluate(entry)
+        if result.status == "processed":
+            return {
+                f"retrieval_answer_{metric}": result.score,
+                f"retrieval_answer_{metric}_cost": result.cost.amount,
+                f"retrieval_answer_{metric}_reason": result.details
+            }
+        else:
+            return {
+                f"retrieval_answer_{metric}_error": result.details
+            }
+    except Exception as e:
+        return {
+            f"retrieval_answer_{metric}_error": str(e)
+        }
+def get_retrieval_evaluation_dict(
+    question_text: str,
+    actual_contexts: list[dict[str, str]],
+    reference_answer: str | None = None,
+    actual_answer: str | None = None,
+    model_name : str = "openai/gpt-4o-mini",
+    max_tokens : int = 65_536
+) -> dict:
+    if not reference_answer and not actual_answer:
+        return {}
+    settings_dict = {
+        "model": model_name,
+        "max_tokens": max_tokens
+    }
+    entry = RagasResponseContextPrecisionEntry(
+        input=question_text,
+        expected_output=reference_answer,
+        output=actual_answer,
+        contexts=[a["text"] for a in actual_contexts]
+    )
+    result = {}
+    evaluator = RagasResponseContextRecallEvaluator(settings=settings_dict)
+    result.update(_evaluate(evaluator, entry, "recall"))
+    evaluator = RagasResponseContextPrecisionEvaluator(settings=settings_dict)
+    result.update(_evaluate(evaluator, entry, "precision"))
+    result.update(get_f1_dict(result, "retrieval_answer"))
+    return result

graphrag_eval-5.0.1/graphrag_eval/steps/retrieval_context_ids.py ADDED Viewed

@@ -0,0 +1,50 @@
+from typing import Iterable
+def recall_at_k(relevant_ids: list, retrieved_ids: list, k: int = 10) -> float:
+    """
+    Calculates Recall@k.
+    Args:
+        relevant_ids (list): A list of ground truth relevant document IDs.
+        retrieved_ids (list): A list of retrieved document IDs, ordered by rank.
+        k (int): The cutoff for the retrieval list.
+    Returns:
+        float: The Recall@k score.
+    """
+    retrieved_at_k = retrieved_ids[:k]
+    relevant_at_k = relevant_ids[:k]
+    true_positives = len(set(relevant_at_k).intersection(set(retrieved_at_k)))
+    total_relevant = len(relevant_at_k)
+    if total_relevant == 0:
+        return 0.0
+    return true_positives / total_relevant
+def average_precision(relevant_ids: Iterable, retrieved_ids: Iterable) -> float:
+    """
+    Calculates Average Precision (AP) for a single query.
+    Args:
+        relevant_ids (Iterable): A set of ground truth relevant document IDs.
+        retrieved_ids (Iterable): A list of retrieved document IDs, ordered by rank.
+    Returns:
+        float: The Average Precision score.
+    """
+    relevant_set = set(relevant_ids)
+    hits = 0
+    sum_of_precisions = 0.0
+    for i, doc_id in enumerate(retrieved_ids):
+        if doc_id in relevant_set:
+            hits += 1
+            precision_at_k = hits / (i + 1)
+            sum_of_precisions += precision_at_k
+    total_relevant = len(relevant_set)
+    if total_relevant == 0:
+        return 0.0
+    return sum_of_precisions / total_relevant

graphrag_eval-5.0.1/graphrag_eval/steps/retrieval_context_texts.py ADDED Viewed

@@ -0,0 +1,59 @@
+from langevals_ragas.context_precision import (
+    RagasContextPrecisionEntry,
+    RagasContextPrecisionEvaluator,
+)
+from langevals_ragas.context_recall import (
+    RagasContextRecallEntry,
+    RagasContextRecallEvaluator,
+)
+from graphrag_eval.util import get_f1_dict
+def _evaluate(
+    entry: RagasContextRecallEntry | RagasContextPrecisionEntry,
+    evauator: RagasContextRecallEvaluator | RagasContextPrecisionEvaluator,
+    metric: str
+) -> dict:
+    try:
+        result = evauator.evaluate(entry)
+        if result.status == "processed":
+            result_dict = {
+                f"retrieval_context_{metric}": result.score,
+            }
+            if result.details:
+                result_dict[f"retrieval_context_{metric}_reason"] = result.details
+            if result.cost is not None:
+                result_dict[f"retrieval_context_{metric}_cost"] = result.cost.amount
+            return result_dict
+        else:
+            return {
+                f"retrieval_context_{metric}_error": result.details,
+            }
+    except Exception as e:
+        return {
+            f"retrieval_context_{metric}_error": str(e),
+        }
+def get_retrieval_evaluation_dict(
+    reference_contexts: list[dict[str, str]],
+    actual_contexts: list[dict[str, str]],
+    model_name : str = "openai/gpt-4o-mini",
+    max_tokens : int = 65_536
+) -> dict:
+    settings_dict = {
+        "model": model_name,
+        "max_tokens": max_tokens
+    }
+    entry = RagasContextRecallEntry(
+        expected_contexts=[a["text"] for a in reference_contexts],
+        contexts=[a["text"] for a in actual_contexts]
+    )
+    result = {}
+    evaluator = RagasContextRecallEvaluator(settings=settings_dict)
+    result.update(_evaluate(entry, evaluator, "recall"))
+    evaluator = RagasContextPrecisionEvaluator(settings=settings_dict)
+    result.update(_evaluate(entry, evaluator, "precision"))
+    result.update(get_f1_dict(result, "retrieval_context"))
+    return result

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/graphrag_eval/steps/sparql.py RENAMED Viewed

@@ -1,10 +1,31 @@
 from collections import Counter
+import re
 from typing import Union
 import itertools
 import math
-def truncate(number, decimals=0):
+XSD_NUMERIC_TYPES = {
+    "http://www.w3.org/2001/XMLSchema#integer",
+    "http://www.w3.org/2001/XMLSchema#int",
+    "http://www.w3.org/2001/XMLSchema#long",
+    "http://www.w3.org/2001/XMLSchema#short",
+    "http://www.w3.org/2001/XMLSchema#byte",
+    "http://www.w3.org/2001/XMLSchema#nonNegativeInteger",
+    "http://www.w3.org/2001/XMLSchema#positiveInteger",
+    "http://www.w3.org/2001/XMLSchema#unsignedLong",
+    "http://www.w3.org/2001/XMLSchema#unsignedInt",
+    "http://www.w3.org/2001/XMLSchema#unsignedShort",
+    "http://www.w3.org/2001/XMLSchema#unsignedByte",
+}
+XSD_FLOAT_TYPES = {
+    "http://www.w3.org/2001/XMLSchema#decimal",
+    "http://www.w3.org/2001/XMLSchema#double",
+    "http://www.w3.org/2001/XMLSchema#float",
+}
+XSD_BOOLEAN = "http://www.w3.org/2001/XMLSchema#boolean"
+def truncate(number: float, decimals: int = 0) -> float:
     """
     Truncates a float to a certain number of decimal places.
     """
@@ -19,37 +40,92 @@ def truncate(number, decimals=0):
     return math.trunc(number * factor) / factor
+def parse_sparql_term(term: dict) -> Union[str, float, bool, None]:
+    if not isinstance(term, dict):
+        return term
+    term_type = term.get("type")
+    value = term.get("value")
+    if term_type in ("literal", "typed-literal"):
+        datatype = term.get("datatype")
+        if not datatype:
+            return value
+        if datatype in XSD_NUMERIC_TYPES:
+            try:
+                return int(value)
+            except (ValueError, TypeError):
+                return value
+        elif datatype in XSD_FLOAT_TYPES:
+            try:
+                value = float(value)
+                return truncate(value, 5)
+            except (ValueError, TypeError):
+                return value
+        elif datatype == XSD_BOOLEAN:
+            return value.lower() in ("true", "1")
+        else:
+            return value
+    return value
 def get_var_to_values(
     vars_: list[str],
     bindings: list[dict],
 ) -> dict[str, list]:
-    var_to_values = dict()
+    var_to_values = {}
     for var in vars_:
         var_to_values[var] = []
         for binding in bindings:
             if var in binding:
-                var_to_values[var].append(binding[var]["value"])
+                var_to_values[var].append(parse_sparql_term(binding[var]))
             else:
                 var_to_values[var].append(None)
     return dict(var_to_values)
-def parse_dict2table(
+def convert_table_dict2lines(
     reference_vars: Union[list[str], tuple[str, ...]],
     reference_var_to_values: dict[str, list],
 ) -> list[str]:
+    """Converts a dictionary of lists (columns) into a list of row strings.
+    This function takes a dictionary where keys are column headers and values are
+    lists of column data. It transforms this column-oriented data into a list
+    of rows, where each row is a single string formed by concatenating the
+    string representation of its cell values.
+    It assumes that all lists in the `reference_var_to_values` dictionary
+    have the same length.
+    Args:
+        reference_vars: An ordered list or tuple of keys that defines the
+            column order for the output rows.
+        reference_var_to_values: A dictionary mapping column names (keys) to
+            lists of their corresponding values.
+    Returns:
+        A list of strings, where each string is a concatenation of the values
+        for a single row, ordered according to `reference_vars`.
+    Example:
+        >>> columns = ['name', 'age', 'city']
+        >>> data = {
+        ...     'name': ['Alice', 'Bob'],
+        ...     'age': [30, 25],
+        ...     'city': ['New York', 'Los Angeles']
+        ... }
+        >>> dict2lines(columns, data)
+        ['Alice30New York', 'Bob25Los Angeles']
+    """
     result = []
     num_rows = len(reference_var_to_values[reference_vars[0]])
     for row_idx in range(num_rows):
         row = []
         for reference_var in reference_vars:
             val = reference_var_to_values[reference_var][row_idx]
-            if isinstance(val, float):
-                val = truncate(val, 5)
-            if isinstance(val, int):
-                print(val)
-                val = float(val)
-                print(str(val))
             val = str(val)
             row.append(val)
         result.append("".join(row))
@@ -64,8 +140,6 @@ def compare_values(
     results_are_ordered: bool,
 ) -> bool:
-    if len(reference_vars) > len(actual_vars):
-        return False
     if len(reference_vars) < len(actual_vars):
         for combination in itertools.combinations(actual_vars, len(reference_vars)):
             if compare_values(
@@ -78,9 +152,9 @@ def compare_values(
                 return True
         return False
-    table = parse_dict2table(reference_vars, reference_var_to_values)
+    table = convert_table_dict2lines(reference_vars, reference_var_to_values)
     for permutation in itertools.permutations(actual_vars):
-        actual_table = parse_dict2table(permutation, actual_var_to_values)
+        actual_table = convert_table_dict2lines(permutation, actual_var_to_values)
         if (results_are_ordered and table == actual_table) or (
             not results_are_ordered and Counter(table) == Counter(actual_table)
         ):

graphrag_eval-5.0.1/graphrag_eval/util.py ADDED Viewed

@@ -0,0 +1,25 @@
+def compute_f1(recall: float | str | None, precision: float | str | None) -> float | None:
+    if recall is None or precision is None:
+        return None
+    recall = float(recall)
+    precision = float(precision)
+    if recall == 0.0 and precision == 0.0:
+        return 0.0
+    return 2 * (recall * precision) / (recall + precision)
+def get_f1_dict(
+    input_dict: dict,
+    prefix: str
+) -> dict:
+    recall = input_dict.get(f"{prefix}_recall")
+    precision = input_dict.get(f"{prefix}_precision")
+    f1 = compute_f1(recall, precision)
+    if f1 is None:
+        return {}
+    result = {f"{prefix}_f1": f1}
+    recall_cost = input_dict.get(f"{prefix}_recall_cost")
+    precision_cost = input_dict.get(f"{prefix}_precision_cost")
+    if recall_cost is not None and precision_cost is not None:
+        result[f"{prefix}_f1_cost"] = recall_cost + precision_cost
+    return result

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "graphrag-eval"
-version = "4.0.0"
+version = "5.0.1"
 description = "For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps."
 authors = [
     {name = "Neli Hateva", email = "neli.hateva@graphwise.ai"},
@@ -11,20 +11,15 @@ license = "Apache-2.0"
 requires-python = ">=3.12,<3.13"
 [project.urls]
-repository = "https://github.com/Ontotext-AD/qa-eval"
+repository = "https://github.com/Ontotext-AD/graphrag-eval"
-[build-system]
-requires = ["poetry-core>=2.0.0"]
-build-backend = "poetry.core.masonry.api"
-[tool.poetry.group.test.dependencies]
-pytest = "<9,>=8"
-pytest-cov = "<7,>=6"
-jsonlines = "4.0.0"
-pyyaml = "^6.0.2"
+[tool.poetry.dependencies]
+openai = { version = "^1.97.0", optional = true }
+langevals = { version = "0.1.*", optional = true }
+langevals-ragas = { version = "^0.1.12", optional = true }
-[tool.poetry.group.test]
-optional = true
+[tool.poetry.extras]
+openai = ["openai", "langevals", "langevals-ragas"]
 [tool.poetry.group.openai.dependencies]
 openai = "^1.97.0"
@@ -34,5 +29,18 @@ langevals-ragas = "^0.1.12"
 [tool.poetry.group.openai]
 optional = true
+[tool.poetry.group.test.dependencies]
+pytest = "<9,>=8"
+pytest-cov = "<7,>=6"
+jsonlines = "4.0.0"
+pyyaml = "^6.0.2"
+[tool.poetry.group.test]
+optional = true
 [project.scripts]
-answer-correctness = "qa_eval.answer_evaluation:main"
+answer-correctness = "graphrag_eval.answer_correctness:main"
+[build-system]
+requires = ["poetry-core>=2.0.0"]
+build-backend = "poetry.core.masonry.api"

graphrag_eval-4.0.0/graphrag_eval/steps/retrieval.py DELETED Viewed

@@ -1,55 +0,0 @@
-from typing import Iterable
-def recall_at_k(relevant_docs: Iterable, retrieved_docs: list, k: int = 10) -> float:
-    """
-    Calculates Recall@k.
-    Args:
-        relevant_docs (Iterable): A set of ground truth relevant document IDs.
-        retrieved_docs (list): A list of retrieved document IDs, ordered by rank.
-        k (int): The cutoff for the retrieval list.
-    Returns:
-        float: The Recall@k score.
-    """
-    retrieved_at_k = retrieved_docs[:k]
-    relevant_set = set(relevant_docs)
-    retrieved_set = set(retrieved_at_k)
-    true_positives = len(relevant_set.intersection(retrieved_set))
-    total_relevant = len(relevant_set)
-    if total_relevant == 0:
-        return 0.0
-    return true_positives / total_relevant
-def average_precision(relevant_docs: Iterable, retrieved_docs: Iterable) -> float:
-    """
-    Calculates Average Precision (AP) for a single query.
-    Args:
-        relevant_docs (Iterable): A set of ground truth relevant document IDs.
-        retrieved_docs (Iterable): A list of retrieved document IDs, ordered by rank.
-    Returns:
-        float: The Average Precision score.
-    """
-    relevant_set = set(relevant_docs)
-    hits = 0
-    sum_of_precisions = 0.0
-    for i, doc_id in enumerate(retrieved_docs):
-        if doc_id in relevant_set:
-            hits += 1
-            precision_at_k = hits / (i + 1)
-            sum_of_precisions += precision_at_k
-    total_relevant = len(relevant_set)
-    if total_relevant == 0:
-        return 0.0
-    return sum_of_precisions / total_relevant

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/LICENSE RENAMED Viewed

File without changes

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/graphrag_eval/__init__.py RENAMED Viewed

File without changes

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/graphrag_eval/answer_correctness.py RENAMED Viewed

File without changes

{graphrag_eval-4.0.0 → graphrag_eval-5.0.1}/graphrag_eval/answer_relevance.py RENAMED Viewed

File without changes

graphrag-eval 4.0.0__tar.gz → 5.0.1__tar.gz

graphrag-eval 4.0.0tar.gz → 5.0.1tar.gz