PyPI - graphrag-eval - Versions diffs - 5.3.0__tar.gz → 6.0.0__tar.gz - Mend

graphrag-eval 5.3.0tar.gz → 6.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

graphrag_eval-5.3.0/README.md → graphrag_eval-6.0.0/PKG-INFO RENAMED Viewed

@@ -1,10 +1,28 @@
+Metadata-Version: 2.3
+Name: graphrag-eval
+Version: 6.0.0
+Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
+License: Apache-2.0
+Author: Philip Ganchev
+Author-email: philip.ganchev@graphwise.ai
+Requires-Python: >=3.12,<3.13
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Programming Language :: Python :: 3
+Classifier: Programming Language :: Python :: 3.12
+Provides-Extra: llm
+Requires-Dist: pydantic (==2.12.5)
+Requires-Dist: python-dateutil (==2.9.0.post0)
+Requires-Dist: ragas (==0.4.3) ; extra == "llm"
+Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
+Description-Content-Type: text/markdown
 <p align="center">
   <img alt="Graphwise Logo" src="https://github.com/Ontotext-AD/graphrag-eval/blob/main/.github/Graphwise_Logo.jpg">
 </p>
 # QA Evaluation
-This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
+This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics.
 ## License
@@ -21,15 +39,15 @@ or add the following dependency in your `pyproject.toml` file:
 graphrag-eval = "*"
 ```
-To evaluate answer relevance and answer correctness:
+To evaluate answer relevance and answer correctness or use a [custom evaluation](#Custom-Evaluation-(custom-metrics)):
 ```bash
-pip install 'graphrag-eval[ragas]'
+pip install 'graphrag-eval[llm]'
 ```
 or add the following dependency in your `pyproject.toml` file:
 ```toml
-graphrag-eval = {version = "*", extras = ["ragas"]}
+graphrag-eval = {version = "*", extras = ["llm"]}
 ```
 ## Maintainers
@@ -42,8 +60,12 @@ For issues or feature requests, please open [a GitHub issue](https://github.com/
 To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
 1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
-1. Execute `poetry install --with ragas`
-1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
+1. Execute `poetry install --with llm`
+1. Execute
+   ```<LLM_ACCESS_VARIABLE>=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>```
+  replacing `<LLM_ACCESS_VARIABLE>` by the variable used by your LLM provider to specify your LLM use key.
+  Example:
+    ```OPENAI_API_KEY=XXX poetry run answer-correctness -i reference.tsv -o evaluations.tsv```
 We plan to improve CLI support in future releases.
@@ -55,16 +77,153 @@ To evaluate answers and/or steps:
 1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
 1. To evaluate answer relevance:
     1. Include `actual_answer` in the target data to evaluate
-    1. Set environment variable `OPENAI_API_KEY` appropriately
+    1. Set the appropriate environment variable (e.g.,`OPENAI_API_KEY`) with your LLM access key
 1. To evaluate answer correctness:
     1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
-    1. Set environment variable `OPENAI_API_KEY` appropriately
+    1. Set the appropriate environment variable (e.g.,`OPENAI_API_KEY`) with your LLM access key
 1. To evaluate steps:
     1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
+1. If you want to evaluate metrics that require an LLM, write a [configuration file](#configuration).
 1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
-1. Call the aggregation function with the evaluation results
+1. Call the aggregation function with the evaluation results: section [Usage Code](#Usage-Code)
+### LLM use in evaluation
+The following metrics use an LLM which must be configured using a [configuration](#configuration) file:
+* answer metrics
+  * `answer_recall`
+  * `answer_precision`
+  * `answer_f1`
+  * `answer_relevance`
+* retrieval context metrics:
+  * `retrieval_answer_recall`
+  * `retrieval_answer_precision`
+  * `retrieval_answer_f1`
+  * `retrieval_context_recall`
+  * `retrieval_context_precision`
+  * `retrieval_context_f1`
+* [custom evaluation](#custom-evaluation-custom-metrics)
+Supported LLMs are all those supported by the [`litellm`](https://github.com/BerriAI/litellm) library, including all major LLMs and local models via Ollama.
+If no LLM is configured or the `config_file_path` parameter is not provided, these metrics are not evaluated.
+### Configuration
+The configuration has two sections: `llm` and `custom_evaluation`. Example:
+* `llm`: required for [LLM-based metrics](#llm-use-in-evaluation). The following keys are required:
+    * `generation`: required. The following keys are required:
+        * `provider`: (str) name of the organization providing the generation model, as supported by LiteLLM
+        * `model`: (str) name of the generation model
+        * `temperature`: (float in the range [0.0, 2.0]) adversarial temperature for generation
+        * `max_tokens`: (int > 0) maximum number of tokens to generate
+        * Optional keys: parameters to be passed to LiteLLM for generation (for [`answer_correctness`](#output-keys) and [custom evaluation](#custom-evaluation-custom-metrics)). Examples:
+          * `base_url`: (str) base URL for the generation model, alternative to the provider's default URL
+          * `api_key`: (str) API key for the generation model, alternative to setting the environment variable corresponding to the provider (e.g. `OPENAI_API_KEY` for OpenAI)
+    * `embedding`: required for [`answer_relevance`](#output-keys).
+        * `provider`: (str) name of the organiation providing the embedding model
+        * `model`: (str) name of the embedding model
+* `custom_evaluations`: (list of the following maps) required nonempty for [custom evaluation](#custom-evaluation-custom-metrics). Each map has keys:
+    * `name`: (str) name of the evaluation
+    * `inputs`: (list[str]) list of input variables. Any combination of the following:
+        * `question`
+        * `reference_answer`
+        * `reference_steps`
+        * `actual_answer`
+        * `actual_steps`
+    * `steps_keys`: (list[str]; required if `inputs` contains `actual_steps` or `reference_steps`) one or both of:
+        * `args`
+        * `output`
+    * `steps_name`: (str; required if `inputs` contains `actual_steps` or `reference_steps`) the type (name) of steps to include in the evaluation
+    * `instructions`: (str) instructions for the evaluation
+    * `outputs`: (map[str]) output variable names and descriptions
+#### Example Configuration File With LLM Configuration
+Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](#llm-use-in-evaluation)) and embedding (for [`answer_relevance`](#otuput-keys)). It assumes that the environment variable `OPENAI_API_KEY` is set with your OpenAI API key.
+```YAML
+llm:
+  generation:
+    provider: openai
+    model: gpt-4o-mini
+    temperature: 0.0
+    max_tokens: 65536
+  embedding:
+    provider: openai
+    model: text-embedding-3-small
+```
+#### Example Configuration File With LLM Configuration and API keys
+Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](#llm-use-in-evaluation)) and embedding (for [`answer_relevance`](#otuput-keys)) with different API keys in place of environment variables.
+```YAML
+llm:
+  generation:
+    provider: azure
+    model: graphrag-eval-system-tests-gpt-5.2
+    base_url: https://my-generator.openai.azure.com
+    temperature: 0.0
+    max_tokens: 8192
+    api_key: ...
+  embedding:
+    provider: azure
+    model: graphrag-eval-system-tests-text-embedding-3-small
+    api_base: https://my-embedder.openai.azure.com
+    api_key: ...
+```
-Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
+#### Example Configuration File With Custom Evaluations
+Below is a YAML file that defines two custom evaluations:
+1. a simple relevance evaluation
+1. a SPARQL retrieval evaluation using the reference answer
+This is an example of the format and may not create accurate evaluations.
+```YAML
+llm:
+  generation:
+    provider: openai
+    model: gpt-4o-mini
+    temperature: 0.0
+    max_tokens: 65536
+  embedding:
+    provider: openai
+    model: text-embedding-3-small
+custom_evaluations:
+  -
+    name: my_answer_relevance
+    inputs:
+      - question
+      - actual_answer
+    instructions: |
+      Evaluate how relevant is the answer to the question.
+    outputs:
+      my_answer_relevance: fraction between 0 and 1
+      my_answer_relevance_reason: reason for your evaluation
+  -
+    name: sparql_llm_evaluation
+    inputs:
+      - question
+      - reference_answer
+      - actual_steps
+    steps_keys:
+      - output
+    steps_name: sparql
+    instructions: |
+      Divide the reference answer into claims and try to match each claim to the
+      SPARQL query results. Count the:
+      - reference claims
+      - SPARQL results
+      - matching claims
+    outputs:
+      sparql_recall: Number of matching claims as a fraction of reference claims (fraction 0-1)
+      sparql_precision: Number of matching claims as a fraction of SPARQL results (fraction 0-1)
+      sparql_reason: reason for your evaluation
+```
 ### Reference Q&A Data
@@ -81,7 +240,7 @@ The assumption is that the final answer to the question is derived from the outp
 Each step includes:
-- `name`: The type of step being performed (e.g., `sparql_query`)
+- `name`: The name of the step (e.g., `sparql_query`)
 - `args`: Arguments of the step (e.g., arguments to a tool used in the step, such as a SPARQL query)
 - `output`: The expected output from the step.
 - `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
@@ -270,14 +429,15 @@ Given a question, if the question-answering system successfully responds, to eva
     "elapsed_sec": 46.48961806297302,
     "actual_steps": [
         {
-          "name": "retrieval",
-          "args": {
-            "query": "transformers Substation OSLO",
-            "k": 2
-          },
-          "id": "call_3",
-          "status": "success",
-          "output": "[\n  {\n    \"id\": \"http://example.com/resource/doc/1\",\n    \"text\": \"Transformer OSLO T1 is in Substation Oslo.\"\n  },\n  {\n    \"id\": \"http://example.com/resource/doc/2\",\n    \"text\": \"Transformer OSLO T2 is in Substation Oslo.\"\n  }\n]"
+            "name": "retrieval",
+            "args": {
+                "query": "transformers Substation OSLO",
+                "k": 2
+            },
+            "id": "call_3",
+            "status": "success",
+            "output": "[\n  {\n    \"id\": \"http://example.com/resource/doc/1\",\n    \"text\": \"Transformer OSLO T1 is in Substation Oslo.\"\n  },\n  {\n    \"id\": \"http://example.com/resource/doc/2\",\n    \"text\": \"Transformer OSLO T2 is in Substation Oslo.\"\n  }\n]",
+            "execution_timestamp": "2025-12-17T09:15:53Z"
         },
         {
             "name": "autocomplete_search",
@@ -287,7 +447,8 @@ Given a question, if the question-answering system successfully responds, to eva
             },
             "id": "call_7amIsfEGelOnVZ1DWtgtc0hc",
             "status": "success",
-            "output": "{\n  \"head\": {\n    \"vars\": [\n      \"iri\",\n      \"name\",\n      \"rank\"\n    ]\n  },\n  \"results\": {\n    \"bindings\": [\n      {\n        \"iri\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"name\": {\n          \"type\": \"literal\",\n          \"value\": \"STAVANGER\"\n        },\n        \"rank\": {\n          \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n          \"type\": \"literal\",\n          \"value\": \"0.00976\"\n        }\n      }\n    ]\n  }\n}"
+            "output": "{\n  \"head\": {\n    \"vars\": [\n      \"iri\",\n      \"name\",\n      \"rank\"\n    ]\n  },\n  \"results\": {\n    \"bindings\": [\n      {\n        \"iri\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"name\": {\n          \"type\": \"literal\",\n          \"value\": \"STAVANGER\"\n        },\n        \"rank\": {\n          \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n          \"type\": \"literal\",\n          \"value\": \"0.00976\"\n        }\n      }\n    ]\n  }\n}",
+            "execution_timestamp": "2025-12-17T09:15:58Z"
         },
         {
             "name": "sparql_query",
@@ -296,7 +457,8 @@ Given a question, if the question-answering system successfully responds, to eva
             },
             "id": "call_DbMkZ8kv3qkf49wNNdREkpRN",
             "status": "error",
-            "error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes."
+            "error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes.",
+            "execution_timestamp": "2025-12-17T09:16:03Z"
         },
         {
             "name": "sparql_query",
@@ -305,7 +467,8 @@ Given a question, if the question-answering system successfully responds, to eva
             },
             "id": "call_Qm1mzX7g5q9SVPrR2QzEMTp3",
             "status": "success",
-            "output": "{\n  \"head\": {\n    \"vars\": [\n      \"substation2\",\n      \"substation2Name\",\n      \"line\",\n      \"lineName\"\n    ]\n  },\n  \"results\": {\n    \"bindings\": [\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KRISTIANSAND\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 300KRISTIAN-STAVANGE\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"DAGALI\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420DAGALI-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420DAGALI-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420DAGALI-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"OSLO\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"SYSLE\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"AURLAND\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"SIMA\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"SYLLING\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"OSLO\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      }\n    ]\n  }\n}"
+            "output": "{\n  \"head\": {\n    \"vars\": [\n      \"substation2\",\n      \"substation2Name\",\n      \"line\",\n      \"lineName\"\n    ]\n  },\n  \"results\": {\n    \"bindings\": [\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KRISTIANSAND\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 300KRISTIAN-STAVANGE\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"DAGALI\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420DAGALI-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420DAGALI-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420DAGALI-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"OSLO\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"SYSLE\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYSLE-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"AURLAND\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"SIMA\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420AURLAND-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"SYLLING\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"OSLO\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"KVILLDAL\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      },\n      {\n        \"substation2\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n        },\n        \"substation2Name\": {\n          \"type\": \"literal\",\n          \"value\": \"HAGAFOSS\"\n        },\n        \"line\": {\n          \"type\": \"uri\",\n          \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n        },\n        \"lineName\": {\n          \"type\": \"literal\",\n          \"value\": \"LC 420SYLLING-HAGAFOSS\"\n        }\n      }\n    ]\n  }\n}",
+            "execution_timestamp": "2025-12-17T09:16:08Z"
         }
     ],
     "actual_answer": "The substations connected to the substation \"STAVANGER\" via AC or DC lines are:\n\n1. **KRISTIANSAND** via line \"LC 300KRISTIAN-STAVANGE\"\n2. **DAGALI** via line \"LC 420DAGALI-HAGAFOSS\"\n3. **KVILLDAL** via lines:\n   - \"LC 420DAGALI-HAGAFOSS\"\n   - \"LC 420SYSLE-HAGAFOSS\"\n   - \"LC 420AURLAND-HAGAFOSS\"\n   - \"LC 420SYLLING-HAGAFOSS\"\n4. **HAGAFOSS** via lines:\n   - \"LC 420DAGALI-HAGAFOSS\"\n   - \"LC 420SYSLE-HAGAFOSS\"\n   - \"LC 420AURLAND-HAGAFOSS\"\n   - \"LC 420SYLLING-HAGAFOSS\"\n5. **OSLO** via lines:\n   - \"LC 420SYSLE-HAGAFOSS\"\n   - \"LC 420SYLLING-HAGAFOSS\"\n6. **SYSLE** via line \"LC 420SYSLE-HAGAFOSS\"\n7. **AURLAND** via line \"LC 420AURLAND-HAGAFOSS\"\n8. **SIMA** via line \"LC 420AURLAND-HAGAFOSS\"\n9. **SYLLING** via line \"LC 420SYLLING-HAGAFOSS\""
@@ -331,7 +494,7 @@ from graphrag_eval import run_evaluation, compute_aggregates
 reference_qas: list[dict] = [] # read your reference data
 chat_responses: dict = {} # call your implementation to get the response
-evaluation_results = run_evaluation(reference_qas, chat_responses)
+evaluation_results = await run_evaluation(reference_qas, chat_responses)
 aggregates = compute_aggregates(evaluation_results)
 ```
@@ -365,7 +528,7 @@ The output is a list of statistics for each question from the reference Q&A data
             "text": "Transformer OSLO T2 is in Substation Oslo."
           }
         ]
-  - name: sparql_query
+    - name: sparql_query
       args:
         query: |2
@@ -404,7 +567,6 @@ The output is a list of statistics for each question from the reference Q&A data
   answer_precision: 1.0
   answer_f1: 1.0
   answer_relevance: 0.9
-  answer_relevance_cost: 0.0007
   actual_steps:
   - name: retrieval
     id: call_3
@@ -423,13 +585,10 @@ The output is a list of statistics for each question from the reference Q&A data
           "text": "Transformer OSLO T2 is in Substation Oslo."
         }
       ]
+    execution_timestamp: '2025-12-17T09:15:53Z'
     retrieval_answer_recall: 1.0
-    retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
-    retrieval_answer_recall_cost: 0.0007
     retrieval_answer_precision: 1.0
-    retrieval_answer_precision_cost: 0.0003
     retrieval_answer_f1: 1.0
-    retrieval_answer_f1_cost: 0.001
   - name: autocomplete_search
     args:
       query: OSLO
@@ -465,6 +624,7 @@ The output is a list of statistics for each question from the reference Q&A data
           ]
         }
       }
+    execution_timestamp: '2025-12-17T09:15:58Z'
   - name: sparql_query
     args:
       query: |-
@@ -508,6 +668,7 @@ The output is a list of statistics for each question from the reference Q&A data
           ]
         }
       }
+    execution_timestamp: '2025-12-17T09:16:03Z'
   steps_score: 1
   input_tokens: 221339
   output_tokens: 212
@@ -520,9 +681,11 @@ The output is a list of statistics for each question from the reference Q&A data
 - `template_id`: the template id
 - `question_id`: the question id
 - `question_text`: the natural language query
-- `reference_steps`: (optional) copy of the expected steps in the Q&A dataset, if specified there
-- `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
 - `status`: "success" or "error", indicating whether the evaluation succeeded
+- `reference_steps`: (optional) copy of the expected steps in the Q&A dataset,
+if specified there. Additional key "matches" is added to those steps, which are
+matched.
+- `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
 - `actual_answer`: (optional) copy of the response text in the evaluation target, if specified there
 - `answer_reference_claims_count`: (optional) number of claims extracted from the reference answer, if a reference answer and actual answer are available
 - `answer_actual_claims_count`: (optional) number of claims extracted from the answer being evaluated, if a reference answer and actual answer are available
@@ -534,34 +697,26 @@ The output is a list of statistics for each question from the reference Q&A data
 - `answer_f1`: (optional) Harmonic mean of `answer_recall` and `answer_precision`
 - `answer_relevance`: (optional) The value representing how relevant is the actual answer to the question, computed using [RAGAS answer relevance](https://docs.ragas.io/en/v0.3.3/concepts/metrics/available_metrics/answer_relevance/)
 - `answer_relevance_error`: (optional) error message if answer relevance evaluation failed
-- `answer_relevance_cost`: The LLM use cost of computing `answer_relevance`, in US dollars
 - `actual_steps`: (optional) copy of the steps in the evaluation target, if specified there
-- `steps_score`: a real number between 0 and 1, computed by comparing the results of the last executed steps to the output of the reference's last group of steps.
-    - If there is no match in the actual steps, then the score is `0.0`
-    - If the executed step's name is "retrieval" and the last reference group contains a retrieval step, then the score is the [recall at k](#context-recallk) of the retrieved document ids with respect to the reference.
-    - Otherwise, the score is the number of the matched steps on the last group divided by the total number of steps in the last group.
-- `input_tokens`: input tokens usage
-- `output_tokens`: output tokens usage
-- `total_tokens`: total tokens usage
-- `elapsed_sec`: elapsed seconds
+- `steps_score`: (optional) a real number between 0 and 1, see how step score
+is calculated in the section [Steps score](#Steps-score)
+- `input_tokens`: (optional) input tokens usage
+- `output_tokens`: (optional) output tokens usage
+- `total_tokens`: (optional) total tokens usage
+- `elapsed_sec`: (optional) elapsed seconds
 All `actual_steps` with `name` "retrieval" contain:
 - `retrieval_answer_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
-- `retrieval_answer_recall_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_recall`
 - `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
-- `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
 - `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
 - `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
-- `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
 - `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
-- `retrieval_answer_f1_cost`: The sum of `retrieval_answer_recall_cost` and `retrieval_answer_precision_cost`
 - `retrieval_context_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
 - `retrieval_context_recall_error`: (optional) error message if `retrieval_context_recall` evaluation fails
 - `retrieval_context_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
 - `retrieval_context_precision_error`: (optional) error message if `retrieval_context_precision` evaluation fails
 - `retrieval_context_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_context_recall` and `retrieval_context_precision` succeed
 #### Aggregates Keys
 The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
@@ -610,7 +765,6 @@ Aggregates are:
     - `answer_precision`
     - `answer_f1`
     - `answer_relevance`
-    - `answer_relevance_cost`
     - `retrieval_answer_recall`
     - `retrieval_answer_precision`
     - `retrieval_answer_f1`
@@ -627,7 +781,6 @@ Aggregates are:
   - `answer_precision`
   - `answer_f1`
   - `answer_relevance`
-  - `answer_relevance_cost`
   - `retrieval_answer_recall`
   - `retrieval_answer_precision`
   - `retrieval_answer_f1`
@@ -667,12 +820,6 @@ per_template:
       mean: 0.9
       median: 0.9
       sum: 0.9
-    answer_relevance_cost:
-      min: 0.0007
-      max: 0.0007
-      mean: 0.0007
-      median: 0.0007
-      sum: 0.0007
     steps:
       total:
         autocomplete_search: 10
@@ -739,12 +886,6 @@ per_template:
       mean: 0.9
       median: 0.9
       sum: 0.9
-    answer_relevance_cost:
-      min: 0.0007
-      max: 0.0007
-      mean: 0.0007
-      median: 0.0007
-      sum: 0.0007
     steps:
       total:
         autocomplete_search: 10
@@ -809,12 +950,6 @@ per_template:
       mean: 0.9
       median: 0.9
       sum: 0.9
-    answer_relevance_cost:
-      min: 0.0007
-      max: 0.0007
-      mean: 0.0007
-      median: 0.0007
-      sum: 0.0007
     steps:
       total:
         autocomplete_search: 9
@@ -881,12 +1016,6 @@ per_template:
       mean: 0.9
       median: 0.9
       sum: 0.9
-    answer_relevance_cost:
-      min: 0.0007
-      max: 0.0007
-      mean: 0.0007
-      median: 0.0007
-      sum: 0.0007
     steps:
       total:
         autocomplete_search: 20
@@ -951,12 +1080,6 @@ micro:
     mean: 0.9
     median: 0.9
     sum: 0.9
-  answer_relevance_cost:
-    min: 0.0007
-    max: 0.0007
-    mean: 0.0007
-    median: 0.0007
-    sum: 0.0007
   steps_score:
     sum: 17
     mean: 0.4358974358974359
@@ -996,8 +1119,6 @@ macro:
     mean: 1.0
   answer_relevance:
     mean: 0.9
-  answer_relevance_cost:
-    mean: 0.0007
   steps_score:
     mean: 0.45
   input_tokens:
@@ -1010,7 +1131,54 @@ macro:
     mean: 25.911653497483996
 ```
-### SPARQL queries comparison
+### Steps score
+The steps score is a real number in the interval [0, 1],
+which indicates how closely the actual steps match to the reference ones.
+A score of 1 indicates a perfect match.
+Going in reverse order of the reference groups and in reverse order of the
+actual steps, for each reference group we try to match each step in the group
+to an actual step:
+- If all steps from the current reference group are matched, we proceed to the
+previous group, but we only search among the actual steps before the earliest
+actual step already matched for the current group. Thus, we ignore the
+execution order of steps within a reference group, but groups are matched
+in order: actual steps that match earlier groups must be executed before all
+steps matching later groups.
+- If some steps in the current reference group are not matched, then the
+matching stops, and the score is computed from the matches found so far.
+An actual step can match at most one reference step, and only if the actual
+step is successful (i.e., it didn’t result in an error). There are a few ways
+a reference step can match an actual one. In all cases except for
+the "retrieval" steps, the matching score is either 0 or 1. A score above 0
+indicates a match.
+- if both are named "sparql_query" and the "output_media_type" of the
+reference step is "application/sparql-results+json", then we try to match them
+using the [SPARQL queries comparison algorithm](#sparql-queries-comparison).
+The algorithm doesn't handle comparison of DESCRIBE or CONSTRUCT queries output
+with SELECT and ASK queries output.
+- if both are named "retrieval" and the reference step has "output", then we
+compute [recall@k](#context-recallk).
+- if both are named "retrieve_time_series", then we check if the arguments of
+the steps are matching.
+- if both are named "retrieve_data_points", then we check if the arguments of
+the steps are matching.
+- if the reference step is named "iri_discovery" and the actual step name is
+"autocomplete_search", тhen check if the IRI specified as "output" of the
+"iri_discovery" step is present in the "output" of the "autocomplete_search".
+- if the reference and actual step names are the same and the
+"output_media_type" of the reference step is "application/json", then the steps
+match, if the json outputs are the same.
+- we fallback to match the outputs of the two steps.
+The final steps score is the macro mean of scores over the groups. That is, it
+is the sum of the scores of all reference groups divided by their number. Each
+group score is the sum of scores of its steps divided by their number.
+#### SPARQL queries comparison
 The algorithm iterates over all subsets of columns in the actual result of the same size as in the reference result.
 For each subset, it compares the set of columns (skipping optional columns).
@@ -1074,3 +1242,68 @@ average_precision(
     retrieved_docs=[1, 4, 3, 5, 7]
 ) # ~=> 0.8056
 ```
+### Custom Evaluation (Custom Metrics)
+You (the user) can define your own metrics of system outputs to be evaluated
+using an LLM. To do this, specify its name, inputs, outputs and instructions in
+a YAML file and pass the file path as a parameter to `run_evaluation()`. This
+will return your output metrics alongside the standard metrics described
+in previous sections.
+One configuration file can define multiple custom evaluations, each of which
+will be done as a separate query to the LLM. Each evaluation can have multiple
+outputs. The format is shown in the example sections below.
+See [Example Configuration File](#example-configuration-file).
+#### Example Call to Evaluate Using Custom Metrics
+```python
+evaluation_results = run_evaluation(
+    reference_qa_dataset,
+    chat_responses,
+    "my_project/custom_eval.yaml"
+)
+```
+#### Example Output for Custom SPARQL Evaluation
+With the
+[custom SPARQL evaluation](example-custom-evaluation-configuration-file),
+the output is as for section [Evaluation Results](#evaluation-results), except
+that it has the following additional keys and example values:
+```yaml
+  my_answer_relevance: 0.9
+  my_answer_relevance_eval_reason: The answer contains relevant information except for the sentence about total revenue
+  sparql_recall: 0.75
+  sparql_precision: 0.6
+  sparql_eval_reason: The reference answer has 4 claims; there are 5 SPARQL results; 3 claims match
+```
+#### Output in Case of Evaluation Error
+If there is an error during evaluation, the output will have `null` for all
+custom evaluation specified output keys and will have an additional key
+explaining the error. The key will be `{name}_error` where `name` is the custom
+evaluation name.
+There are three types of errors:
+1. The reference input is missing keys requested in the custom evaluation
+configuration. Example:
+`custom_1_error: Reference missing key 'reference_steps'`
+1. The actual output to be evaluated is missing keys requested in the custom
+evaluation configuration.  Example:
+`custom_1_error: Actual output missing 'actual_steps'`
+1. The evaluating LLM output does not conform to the custom evaluation
+configuration. Example:
+`custom_1_error: "Expected 6 tab-separated values, got: 0.1\tCustom answer reason"`
+#### Recommendations for Custom Evaluations
+1. Specify only several outputs in each evaluation
+1. Request an explanation output for output quantities you ask the LLM to count
+or estimate. You can ask for one explanation per quantity or one shared
+explanation for 2-3 quantities

graphrag-eval 5.3.0__tar.gz → 6.0.0__tar.gz

graphrag-eval 5.3.0tar.gz → 6.0.0tar.gz