graphrag-eval 5.3.0__tar.gz → 6.0.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (26) hide show
  1. graphrag_eval-5.3.0/README.md → graphrag_eval-6.0.0/PKG-INFO +313 -80
  2. graphrag_eval-5.3.0/PKG-INFO → graphrag_eval-6.0.0/README.md +293 -100
  3. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/aggregation.py +52 -20
  4. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/answer_correctness.py +57 -26
  5. graphrag_eval-6.0.0/graphrag_eval/answer_relevance.py +29 -0
  6. graphrag_eval-6.0.0/graphrag_eval/custom_evaluation.py +204 -0
  7. graphrag_eval-6.0.0/graphrag_eval/evaluation.py +101 -0
  8. graphrag_eval-6.0.0/graphrag_eval/llm.py +55 -0
  9. graphrag_eval-6.0.0/graphrag_eval/steps/evaluation.py +181 -0
  10. graphrag_eval-6.0.0/graphrag_eval/steps/iri_discovery.py +20 -0
  11. graphrag_eval-6.0.0/graphrag_eval/steps/retrieval_answer.py +44 -0
  12. graphrag_eval-6.0.0/graphrag_eval/steps/retrieval_context_texts.py +45 -0
  13. graphrag_eval-6.0.0/graphrag_eval/steps/timeseries.py +182 -0
  14. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/util.py +13 -6
  15. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/pyproject.toml +12 -15
  16. graphrag_eval-5.3.0/graphrag_eval/answer_relevance.py +0 -37
  17. graphrag_eval-5.3.0/graphrag_eval/evaluation.py +0 -61
  18. graphrag_eval-5.3.0/graphrag_eval/steps/evaluation.py +0 -157
  19. graphrag_eval-5.3.0/graphrag_eval/steps/retrieval_answer.py +0 -65
  20. graphrag_eval-5.3.0/graphrag_eval/steps/retrieval_context_texts.py +0 -59
  21. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/LICENSE +0 -0
  22. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/__init__.py +0 -0
  23. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/prompts/template.md +0 -0
  24. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/steps/__init__.py +0 -0
  25. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/steps/retrieval_context_ids.py +0 -0
  26. {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/steps/sparql.py +0 -0
@@ -1,10 +1,28 @@
1
+ Metadata-Version: 2.3
2
+ Name: graphrag-eval
3
+ Version: 6.0.0
4
+ Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
5
+ License: Apache-2.0
6
+ Author: Philip Ganchev
7
+ Author-email: philip.ganchev@graphwise.ai
8
+ Requires-Python: >=3.12,<3.13
9
+ Classifier: License :: OSI Approved :: Apache Software License
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Provides-Extra: llm
13
+ Requires-Dist: pydantic (==2.12.5)
14
+ Requires-Dist: python-dateutil (==2.9.0.post0)
15
+ Requires-Dist: ragas (==0.4.3) ; extra == "llm"
16
+ Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
17
+ Description-Content-Type: text/markdown
18
+
1
19
  <p align="center">
2
20
  <img alt="Graphwise Logo" src="https://github.com/Ontotext-AD/graphrag-eval/blob/main/.github/Graphwise_Logo.jpg">
3
21
  </p>
4
22
 
5
23
  # QA Evaluation
6
24
 
7
- This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
25
+ This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics.
8
26
 
9
27
  ## License
10
28
 
@@ -21,15 +39,15 @@ or add the following dependency in your `pyproject.toml` file:
21
39
  graphrag-eval = "*"
22
40
  ```
23
41
 
24
- To evaluate answer relevance and answer correctness:
42
+ To evaluate answer relevance and answer correctness or use a [custom evaluation](#Custom-Evaluation-(custom-metrics)):
25
43
 
26
44
  ```bash
27
- pip install 'graphrag-eval[ragas]'
45
+ pip install 'graphrag-eval[llm]'
28
46
  ```
29
47
 
30
48
  or add the following dependency in your `pyproject.toml` file:
31
49
  ```toml
32
- graphrag-eval = {version = "*", extras = ["ragas"]}
50
+ graphrag-eval = {version = "*", extras = ["llm"]}
33
51
  ```
34
52
 
35
53
  ## Maintainers
@@ -42,8 +60,12 @@ For issues or feature requests, please open [a GitHub issue](https://github.com/
42
60
  To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
43
61
 
44
62
  1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
45
- 1. Execute `poetry install --with ragas`
46
- 1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
63
+ 1. Execute `poetry install --with llm`
64
+ 1. Execute
65
+ ```<LLM_ACCESS_VARIABLE>=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>```
66
+ replacing `<LLM_ACCESS_VARIABLE>` by the variable used by your LLM provider to specify your LLM use key.
67
+ Example:
68
+ ```OPENAI_API_KEY=XXX poetry run answer-correctness -i reference.tsv -o evaluations.tsv```
47
69
 
48
70
  We plan to improve CLI support in future releases.
49
71
 
@@ -55,16 +77,153 @@ To evaluate answers and/or steps:
55
77
  1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
56
78
  1. To evaluate answer relevance:
57
79
  1. Include `actual_answer` in the target data to evaluate
58
- 1. Set environment variable `OPENAI_API_KEY` appropriately
80
+ 1. Set the appropriate environment variable (e.g.,`OPENAI_API_KEY`) with your LLM access key
59
81
  1. To evaluate answer correctness:
60
82
  1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
61
- 1. Set environment variable `OPENAI_API_KEY` appropriately
83
+ 1. Set the appropriate environment variable (e.g.,`OPENAI_API_KEY`) with your LLM access key
62
84
  1. To evaluate steps:
63
85
  1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
86
+ 1. If you want to evaluate metrics that require an LLM, write a [configuration file](#configuration).
64
87
  1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
65
- 1. Call the aggregation function with the evaluation results
88
+ 1. Call the aggregation function with the evaluation results: section [Usage Code](#Usage-Code)
89
+
90
+ ### LLM use in evaluation
91
+
92
+ The following metrics use an LLM which must be configured using a [configuration](#configuration) file:
93
+ * answer metrics
94
+ * `answer_recall`
95
+ * `answer_precision`
96
+ * `answer_f1`
97
+ * `answer_relevance`
98
+ * retrieval context metrics:
99
+ * `retrieval_answer_recall`
100
+ * `retrieval_answer_precision`
101
+ * `retrieval_answer_f1`
102
+ * `retrieval_context_recall`
103
+ * `retrieval_context_precision`
104
+ * `retrieval_context_f1`
105
+ * [custom evaluation](#custom-evaluation-custom-metrics)
106
+
107
+ Supported LLMs are all those supported by the [`litellm`](https://github.com/BerriAI/litellm) library, including all major LLMs and local models via Ollama.
108
+
109
+ If no LLM is configured or the `config_file_path` parameter is not provided, these metrics are not evaluated.
110
+
111
+ ### Configuration
112
+
113
+ The configuration has two sections: `llm` and `custom_evaluation`. Example:
114
+
115
+ * `llm`: required for [LLM-based metrics](#llm-use-in-evaluation). The following keys are required:
116
+ * `generation`: required. The following keys are required:
117
+ * `provider`: (str) name of the organization providing the generation model, as supported by LiteLLM
118
+ * `model`: (str) name of the generation model
119
+ * `temperature`: (float in the range [0.0, 2.0]) adversarial temperature for generation
120
+ * `max_tokens`: (int > 0) maximum number of tokens to generate
121
+ * Optional keys: parameters to be passed to LiteLLM for generation (for [`answer_correctness`](#output-keys) and [custom evaluation](#custom-evaluation-custom-metrics)). Examples:
122
+ * `base_url`: (str) base URL for the generation model, alternative to the provider's default URL
123
+ * `api_key`: (str) API key for the generation model, alternative to setting the environment variable corresponding to the provider (e.g. `OPENAI_API_KEY` for OpenAI)
124
+ * `embedding`: required for [`answer_relevance`](#output-keys).
125
+ * `provider`: (str) name of the organiation providing the embedding model
126
+ * `model`: (str) name of the embedding model
127
+ * `custom_evaluations`: (list of the following maps) required nonempty for [custom evaluation](#custom-evaluation-custom-metrics). Each map has keys:
128
+ * `name`: (str) name of the evaluation
129
+ * `inputs`: (list[str]) list of input variables. Any combination of the following:
130
+ * `question`
131
+ * `reference_answer`
132
+ * `reference_steps`
133
+ * `actual_answer`
134
+ * `actual_steps`
135
+ * `steps_keys`: (list[str]; required if `inputs` contains `actual_steps` or `reference_steps`) one or both of:
136
+ * `args`
137
+ * `output`
138
+ * `steps_name`: (str; required if `inputs` contains `actual_steps` or `reference_steps`) the type (name) of steps to include in the evaluation
139
+ * `instructions`: (str) instructions for the evaluation
140
+ * `outputs`: (map[str]) output variable names and descriptions
141
+
142
+ #### Example Configuration File With LLM Configuration
143
+
144
+ Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](#llm-use-in-evaluation)) and embedding (for [`answer_relevance`](#otuput-keys)). It assumes that the environment variable `OPENAI_API_KEY` is set with your OpenAI API key.
145
+
146
+ ```YAML
147
+ llm:
148
+ generation:
149
+ provider: openai
150
+ model: gpt-4o-mini
151
+ temperature: 0.0
152
+ max_tokens: 65536
153
+ embedding:
154
+ provider: openai
155
+ model: text-embedding-3-small
156
+ ```
157
+
158
+ #### Example Configuration File With LLM Configuration and API keys
159
+
160
+ Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](#llm-use-in-evaluation)) and embedding (for [`answer_relevance`](#otuput-keys)) with different API keys in place of environment variables.
161
+
162
+ ```YAML
163
+ llm:
164
+ generation:
165
+ provider: azure
166
+ model: graphrag-eval-system-tests-gpt-5.2
167
+ base_url: https://my-generator.openai.azure.com
168
+ temperature: 0.0
169
+ max_tokens: 8192
170
+ api_key: ...
171
+ embedding:
172
+ provider: azure
173
+ model: graphrag-eval-system-tests-text-embedding-3-small
174
+ api_base: https://my-embedder.openai.azure.com
175
+ api_key: ...
176
+ ```
66
177
 
67
- Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
178
+ #### Example Configuration File With Custom Evaluations
179
+
180
+ Below is a YAML file that defines two custom evaluations:
181
+ 1. a simple relevance evaluation
182
+ 1. a SPARQL retrieval evaluation using the reference answer
183
+
184
+ This is an example of the format and may not create accurate evaluations.
185
+
186
+ ```YAML
187
+ llm:
188
+ generation:
189
+ provider: openai
190
+ model: gpt-4o-mini
191
+ temperature: 0.0
192
+ max_tokens: 65536
193
+ embedding:
194
+ provider: openai
195
+ model: text-embedding-3-small
196
+ custom_evaluations:
197
+ -
198
+ name: my_answer_relevance
199
+ inputs:
200
+ - question
201
+ - actual_answer
202
+ instructions: |
203
+ Evaluate how relevant is the answer to the question.
204
+ outputs:
205
+ my_answer_relevance: fraction between 0 and 1
206
+ my_answer_relevance_reason: reason for your evaluation
207
+ -
208
+ name: sparql_llm_evaluation
209
+ inputs:
210
+ - question
211
+ - reference_answer
212
+ - actual_steps
213
+ steps_keys:
214
+ - output
215
+ steps_name: sparql
216
+ instructions: |
217
+ Divide the reference answer into claims and try to match each claim to the
218
+ SPARQL query results. Count the:
219
+ - reference claims
220
+ - SPARQL results
221
+ - matching claims
222
+ outputs:
223
+ sparql_recall: Number of matching claims as a fraction of reference claims (fraction 0-1)
224
+ sparql_precision: Number of matching claims as a fraction of SPARQL results (fraction 0-1)
225
+ sparql_reason: reason for your evaluation
226
+ ```
68
227
 
69
228
  ### Reference Q&A Data
70
229
 
@@ -81,7 +240,7 @@ The assumption is that the final answer to the question is derived from the outp
81
240
 
82
241
  Each step includes:
83
242
 
84
- - `name`: The type of step being performed (e.g., `sparql_query`)
243
+ - `name`: The name of the step (e.g., `sparql_query`)
85
244
  - `args`: Arguments of the step (e.g., arguments to a tool used in the step, such as a SPARQL query)
86
245
  - `output`: The expected output from the step.
87
246
  - `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
@@ -270,14 +429,15 @@ Given a question, if the question-answering system successfully responds, to eva
270
429
  "elapsed_sec": 46.48961806297302,
271
430
  "actual_steps": [
272
431
  {
273
- "name": "retrieval",
274
- "args": {
275
- "query": "transformers Substation OSLO",
276
- "k": 2
277
- },
278
- "id": "call_3",
279
- "status": "success",
280
- "output": "[\n {\n \"id\": \"http://example.com/resource/doc/1\",\n \"text\": \"Transformer OSLO T1 is in Substation Oslo.\"\n },\n {\n \"id\": \"http://example.com/resource/doc/2\",\n \"text\": \"Transformer OSLO T2 is in Substation Oslo.\"\n }\n]"
432
+ "name": "retrieval",
433
+ "args": {
434
+ "query": "transformers Substation OSLO",
435
+ "k": 2
436
+ },
437
+ "id": "call_3",
438
+ "status": "success",
439
+ "output": "[\n {\n \"id\": \"http://example.com/resource/doc/1\",\n \"text\": \"Transformer OSLO T1 is in Substation Oslo.\"\n },\n {\n \"id\": \"http://example.com/resource/doc/2\",\n \"text\": \"Transformer OSLO T2 is in Substation Oslo.\"\n }\n]",
440
+ "execution_timestamp": "2025-12-17T09:15:53Z"
281
441
  },
282
442
  {
283
443
  "name": "autocomplete_search",
@@ -287,7 +447,8 @@ Given a question, if the question-answering system successfully responds, to eva
287
447
  },
288
448
  "id": "call_7amIsfEGelOnVZ1DWtgtc0hc",
289
449
  "status": "success",
290
- "output": "{\n \"head\": {\n \"vars\": [\n \"iri\",\n \"name\",\n \"rank\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"iri\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"name\": {\n \"type\": \"literal\",\n \"value\": \"STAVANGER\"\n },\n \"rank\": {\n \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n \"type\": \"literal\",\n \"value\": \"0.00976\"\n }\n }\n ]\n }\n}"
450
+ "output": "{\n \"head\": {\n \"vars\": [\n \"iri\",\n \"name\",\n \"rank\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"iri\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"name\": {\n \"type\": \"literal\",\n \"value\": \"STAVANGER\"\n },\n \"rank\": {\n \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n \"type\": \"literal\",\n \"value\": \"0.00976\"\n }\n }\n ]\n }\n}",
451
+ "execution_timestamp": "2025-12-17T09:15:58Z"
291
452
  },
292
453
  {
293
454
  "name": "sparql_query",
@@ -296,7 +457,8 @@ Given a question, if the question-answering system successfully responds, to eva
296
457
  },
297
458
  "id": "call_DbMkZ8kv3qkf49wNNdREkpRN",
298
459
  "status": "error",
299
- "error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes."
460
+ "error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes.",
461
+ "execution_timestamp": "2025-12-17T09:16:03Z"
300
462
  },
301
463
  {
302
464
  "name": "sparql_query",
@@ -305,7 +467,8 @@ Given a question, if the question-answering system successfully responds, to eva
305
467
  },
306
468
  "id": "call_Qm1mzX7g5q9SVPrR2QzEMTp3",
307
469
  "status": "success",
308
- "output": "{\n \"head\": {\n \"vars\": [\n \"substation2\",\n \"substation2Name\",\n \"line\",\n \"lineName\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KRISTIANSAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 300KRISTIAN-STAVANGE\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"DAGALI\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYSLE\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"AURLAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SIMA\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYLLING\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n }\n ]\n }\n}"
470
+ "output": "{\n \"head\": {\n \"vars\": [\n \"substation2\",\n \"substation2Name\",\n \"line\",\n \"lineName\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KRISTIANSAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 300KRISTIAN-STAVANGE\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"DAGALI\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYSLE\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"AURLAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SIMA\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYLLING\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n }\n ]\n }\n}",
471
+ "execution_timestamp": "2025-12-17T09:16:08Z"
309
472
  }
310
473
  ],
311
474
  "actual_answer": "The substations connected to the substation \"STAVANGER\" via AC or DC lines are:\n\n1. **KRISTIANSAND** via line \"LC 300KRISTIAN-STAVANGE\"\n2. **DAGALI** via line \"LC 420DAGALI-HAGAFOSS\"\n3. **KVILLDAL** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n4. **HAGAFOSS** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n5. **OSLO** via lines:\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n6. **SYSLE** via line \"LC 420SYSLE-HAGAFOSS\"\n7. **AURLAND** via line \"LC 420AURLAND-HAGAFOSS\"\n8. **SIMA** via line \"LC 420AURLAND-HAGAFOSS\"\n9. **SYLLING** via line \"LC 420SYLLING-HAGAFOSS\""
@@ -331,7 +494,7 @@ from graphrag_eval import run_evaluation, compute_aggregates
331
494
 
332
495
  reference_qas: list[dict] = [] # read your reference data
333
496
  chat_responses: dict = {} # call your implementation to get the response
334
- evaluation_results = run_evaluation(reference_qas, chat_responses)
497
+ evaluation_results = await run_evaluation(reference_qas, chat_responses)
335
498
  aggregates = compute_aggregates(evaluation_results)
336
499
  ```
337
500
 
@@ -365,7 +528,7 @@ The output is a list of statistics for each question from the reference Q&A data
365
528
  "text": "Transformer OSLO T2 is in Substation Oslo."
366
529
  }
367
530
  ]
368
- - name: sparql_query
531
+ - name: sparql_query
369
532
  args:
370
533
  query: |2
371
534
 
@@ -404,7 +567,6 @@ The output is a list of statistics for each question from the reference Q&A data
404
567
  answer_precision: 1.0
405
568
  answer_f1: 1.0
406
569
  answer_relevance: 0.9
407
- answer_relevance_cost: 0.0007
408
570
  actual_steps:
409
571
  - name: retrieval
410
572
  id: call_3
@@ -423,13 +585,10 @@ The output is a list of statistics for each question from the reference Q&A data
423
585
  "text": "Transformer OSLO T2 is in Substation Oslo."
424
586
  }
425
587
  ]
588
+ execution_timestamp: '2025-12-17T09:15:53Z'
426
589
  retrieval_answer_recall: 1.0
427
- retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
428
- retrieval_answer_recall_cost: 0.0007
429
590
  retrieval_answer_precision: 1.0
430
- retrieval_answer_precision_cost: 0.0003
431
591
  retrieval_answer_f1: 1.0
432
- retrieval_answer_f1_cost: 0.001
433
592
  - name: autocomplete_search
434
593
  args:
435
594
  query: OSLO
@@ -465,6 +624,7 @@ The output is a list of statistics for each question from the reference Q&A data
465
624
  ]
466
625
  }
467
626
  }
627
+ execution_timestamp: '2025-12-17T09:15:58Z'
468
628
  - name: sparql_query
469
629
  args:
470
630
  query: |-
@@ -508,6 +668,7 @@ The output is a list of statistics for each question from the reference Q&A data
508
668
  ]
509
669
  }
510
670
  }
671
+ execution_timestamp: '2025-12-17T09:16:03Z'
511
672
  steps_score: 1
512
673
  input_tokens: 221339
513
674
  output_tokens: 212
@@ -520,9 +681,11 @@ The output is a list of statistics for each question from the reference Q&A data
520
681
  - `template_id`: the template id
521
682
  - `question_id`: the question id
522
683
  - `question_text`: the natural language query
523
- - `reference_steps`: (optional) copy of the expected steps in the Q&A dataset, if specified there
524
- - `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
525
684
  - `status`: "success" or "error", indicating whether the evaluation succeeded
685
+ - `reference_steps`: (optional) copy of the expected steps in the Q&A dataset,
686
+ if specified there. Additional key "matches" is added to those steps, which are
687
+ matched.
688
+ - `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
526
689
  - `actual_answer`: (optional) copy of the response text in the evaluation target, if specified there
527
690
  - `answer_reference_claims_count`: (optional) number of claims extracted from the reference answer, if a reference answer and actual answer are available
528
691
  - `answer_actual_claims_count`: (optional) number of claims extracted from the answer being evaluated, if a reference answer and actual answer are available
@@ -534,34 +697,26 @@ The output is a list of statistics for each question from the reference Q&A data
534
697
  - `answer_f1`: (optional) Harmonic mean of `answer_recall` and `answer_precision`
535
698
  - `answer_relevance`: (optional) The value representing how relevant is the actual answer to the question, computed using [RAGAS answer relevance](https://docs.ragas.io/en/v0.3.3/concepts/metrics/available_metrics/answer_relevance/)
536
699
  - `answer_relevance_error`: (optional) error message if answer relevance evaluation failed
537
- - `answer_relevance_cost`: The LLM use cost of computing `answer_relevance`, in US dollars
538
700
  - `actual_steps`: (optional) copy of the steps in the evaluation target, if specified there
539
- - `steps_score`: a real number between 0 and 1, computed by comparing the results of the last executed steps to the output of the reference's last group of steps.
540
- - If there is no match in the actual steps, then the score is `0.0`
541
- - If the executed step's name is "retrieval" and the last reference group contains a retrieval step, then the score is the [recall at k](#context-recallk) of the retrieved document ids with respect to the reference.
542
- - Otherwise, the score is the number of the matched steps on the last group divided by the total number of steps in the last group.
543
- - `input_tokens`: input tokens usage
544
- - `output_tokens`: output tokens usage
545
- - `total_tokens`: total tokens usage
546
- - `elapsed_sec`: elapsed seconds
701
+ - `steps_score`: (optional) a real number between 0 and 1, see how step score
702
+ is calculated in the section [Steps score](#Steps-score)
703
+ - `input_tokens`: (optional) input tokens usage
704
+ - `output_tokens`: (optional) output tokens usage
705
+ - `total_tokens`: (optional) total tokens usage
706
+ - `elapsed_sec`: (optional) elapsed seconds
547
707
 
548
708
  All `actual_steps` with `name` "retrieval" contain:
549
709
  - `retrieval_answer_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
550
- - `retrieval_answer_recall_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_recall`
551
710
  - `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
552
- - `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
553
711
  - `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
554
712
  - `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
555
- - `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
556
713
  - `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
557
- - `retrieval_answer_f1_cost`: The sum of `retrieval_answer_recall_cost` and `retrieval_answer_precision_cost`
558
714
  - `retrieval_context_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
559
715
  - `retrieval_context_recall_error`: (optional) error message if `retrieval_context_recall` evaluation fails
560
716
  - `retrieval_context_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
561
717
  - `retrieval_context_precision_error`: (optional) error message if `retrieval_context_precision` evaluation fails
562
718
  - `retrieval_context_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_context_recall` and `retrieval_context_precision` succeed
563
719
 
564
-
565
720
  #### Aggregates Keys
566
721
 
567
722
  The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
@@ -610,7 +765,6 @@ Aggregates are:
610
765
  - `answer_precision`
611
766
  - `answer_f1`
612
767
  - `answer_relevance`
613
- - `answer_relevance_cost`
614
768
  - `retrieval_answer_recall`
615
769
  - `retrieval_answer_precision`
616
770
  - `retrieval_answer_f1`
@@ -627,7 +781,6 @@ Aggregates are:
627
781
  - `answer_precision`
628
782
  - `answer_f1`
629
783
  - `answer_relevance`
630
- - `answer_relevance_cost`
631
784
  - `retrieval_answer_recall`
632
785
  - `retrieval_answer_precision`
633
786
  - `retrieval_answer_f1`
@@ -667,12 +820,6 @@ per_template:
667
820
  mean: 0.9
668
821
  median: 0.9
669
822
  sum: 0.9
670
- answer_relevance_cost:
671
- min: 0.0007
672
- max: 0.0007
673
- mean: 0.0007
674
- median: 0.0007
675
- sum: 0.0007
676
823
  steps:
677
824
  total:
678
825
  autocomplete_search: 10
@@ -739,12 +886,6 @@ per_template:
739
886
  mean: 0.9
740
887
  median: 0.9
741
888
  sum: 0.9
742
- answer_relevance_cost:
743
- min: 0.0007
744
- max: 0.0007
745
- mean: 0.0007
746
- median: 0.0007
747
- sum: 0.0007
748
889
  steps:
749
890
  total:
750
891
  autocomplete_search: 10
@@ -809,12 +950,6 @@ per_template:
809
950
  mean: 0.9
810
951
  median: 0.9
811
952
  sum: 0.9
812
- answer_relevance_cost:
813
- min: 0.0007
814
- max: 0.0007
815
- mean: 0.0007
816
- median: 0.0007
817
- sum: 0.0007
818
953
  steps:
819
954
  total:
820
955
  autocomplete_search: 9
@@ -881,12 +1016,6 @@ per_template:
881
1016
  mean: 0.9
882
1017
  median: 0.9
883
1018
  sum: 0.9
884
- answer_relevance_cost:
885
- min: 0.0007
886
- max: 0.0007
887
- mean: 0.0007
888
- median: 0.0007
889
- sum: 0.0007
890
1019
  steps:
891
1020
  total:
892
1021
  autocomplete_search: 20
@@ -951,12 +1080,6 @@ micro:
951
1080
  mean: 0.9
952
1081
  median: 0.9
953
1082
  sum: 0.9
954
- answer_relevance_cost:
955
- min: 0.0007
956
- max: 0.0007
957
- mean: 0.0007
958
- median: 0.0007
959
- sum: 0.0007
960
1083
  steps_score:
961
1084
  sum: 17
962
1085
  mean: 0.4358974358974359
@@ -996,8 +1119,6 @@ macro:
996
1119
  mean: 1.0
997
1120
  answer_relevance:
998
1121
  mean: 0.9
999
- answer_relevance_cost:
1000
- mean: 0.0007
1001
1122
  steps_score:
1002
1123
  mean: 0.45
1003
1124
  input_tokens:
@@ -1010,7 +1131,54 @@ macro:
1010
1131
  mean: 25.911653497483996
1011
1132
  ```
1012
1133
 
1013
- ### SPARQL queries comparison
1134
+ ### Steps score
1135
+
1136
+ The steps score is a real number in the interval [0, 1],
1137
+ which indicates how closely the actual steps match to the reference ones.
1138
+ A score of 1 indicates a perfect match.
1139
+
1140
+ Going in reverse order of the reference groups and in reverse order of the
1141
+ actual steps, for each reference group we try to match each step in the group
1142
+ to an actual step:
1143
+ - If all steps from the current reference group are matched, we proceed to the
1144
+ previous group, but we only search among the actual steps before the earliest
1145
+ actual step already matched for the current group. Thus, we ignore the
1146
+ execution order of steps within a reference group, but groups are matched
1147
+ in order: actual steps that match earlier groups must be executed before all
1148
+ steps matching later groups.
1149
+ - If some steps in the current reference group are not matched, then the
1150
+ matching stops, and the score is computed from the matches found so far.
1151
+
1152
+ An actual step can match at most one reference step, and only if the actual
1153
+ step is successful (i.e., it didn’t result in an error). There are a few ways
1154
+ a reference step can match an actual one. In all cases except for
1155
+ the "retrieval" steps, the matching score is either 0 or 1. A score above 0
1156
+ indicates a match.
1157
+
1158
+ - if both are named "sparql_query" and the "output_media_type" of the
1159
+ reference step is "application/sparql-results+json", then we try to match them
1160
+ using the [SPARQL queries comparison algorithm](#sparql-queries-comparison).
1161
+ The algorithm doesn't handle comparison of DESCRIBE or CONSTRUCT queries output
1162
+ with SELECT and ASK queries output.
1163
+ - if both are named "retrieval" and the reference step has "output", then we
1164
+ compute [recall@k](#context-recallk).
1165
+ - if both are named "retrieve_time_series", then we check if the arguments of
1166
+ the steps are matching.
1167
+ - if both are named "retrieve_data_points", then we check if the arguments of
1168
+ the steps are matching.
1169
+ - if the reference step is named "iri_discovery" and the actual step name is
1170
+ "autocomplete_search", тhen check if the IRI specified as "output" of the
1171
+ "iri_discovery" step is present in the "output" of the "autocomplete_search".
1172
+ - if the reference and actual step names are the same and the
1173
+ "output_media_type" of the reference step is "application/json", then the steps
1174
+ match, if the json outputs are the same.
1175
+ - we fallback to match the outputs of the two steps.
1176
+
1177
+ The final steps score is the macro mean of scores over the groups. That is, it
1178
+ is the sum of the scores of all reference groups divided by their number. Each
1179
+ group score is the sum of scores of its steps divided by their number.
1180
+
1181
+ #### SPARQL queries comparison
1014
1182
 
1015
1183
  The algorithm iterates over all subsets of columns in the actual result of the same size as in the reference result.
1016
1184
  For each subset, it compares the set of columns (skipping optional columns).
@@ -1074,3 +1242,68 @@ average_precision(
1074
1242
  retrieved_docs=[1, 4, 3, 5, 7]
1075
1243
  ) # ~=> 0.8056
1076
1244
  ```
1245
+
1246
+ ### Custom Evaluation (Custom Metrics)
1247
+
1248
+ You (the user) can define your own metrics of system outputs to be evaluated
1249
+ using an LLM. To do this, specify its name, inputs, outputs and instructions in
1250
+ a YAML file and pass the file path as a parameter to `run_evaluation()`. This
1251
+ will return your output metrics alongside the standard metrics described
1252
+ in previous sections.
1253
+
1254
+ One configuration file can define multiple custom evaluations, each of which
1255
+ will be done as a separate query to the LLM. Each evaluation can have multiple
1256
+ outputs. The format is shown in the example sections below.
1257
+
1258
+ See [Example Configuration File](#example-configuration-file).
1259
+
1260
+ #### Example Call to Evaluate Using Custom Metrics
1261
+
1262
+ ```python
1263
+ evaluation_results = run_evaluation(
1264
+ reference_qa_dataset,
1265
+ chat_responses,
1266
+ "my_project/custom_eval.yaml"
1267
+ )
1268
+ ```
1269
+
1270
+ #### Example Output for Custom SPARQL Evaluation
1271
+
1272
+ With the
1273
+ [custom SPARQL evaluation](example-custom-evaluation-configuration-file),
1274
+ the output is as for section [Evaluation Results](#evaluation-results), except
1275
+ that it has the following additional keys and example values:
1276
+
1277
+ ```yaml
1278
+ my_answer_relevance: 0.9
1279
+ my_answer_relevance_eval_reason: The answer contains relevant information except for the sentence about total revenue
1280
+ sparql_recall: 0.75
1281
+ sparql_precision: 0.6
1282
+ sparql_eval_reason: The reference answer has 4 claims; there are 5 SPARQL results; 3 claims match
1283
+ ```
1284
+
1285
+ #### Output in Case of Evaluation Error
1286
+
1287
+ If there is an error during evaluation, the output will have `null` for all
1288
+ custom evaluation specified output keys and will have an additional key
1289
+ explaining the error. The key will be `{name}_error` where `name` is the custom
1290
+ evaluation name.
1291
+
1292
+ There are three types of errors:
1293
+ 1. The reference input is missing keys requested in the custom evaluation
1294
+ configuration. Example:
1295
+ `custom_1_error: Reference missing key 'reference_steps'`
1296
+ 1. The actual output to be evaluated is missing keys requested in the custom
1297
+ evaluation configuration. Example:
1298
+ `custom_1_error: Actual output missing 'actual_steps'`
1299
+ 1. The evaluating LLM output does not conform to the custom evaluation
1300
+ configuration. Example:
1301
+ `custom_1_error: "Expected 6 tab-separated values, got: 0.1\tCustom answer reason"`
1302
+
1303
+ #### Recommendations for Custom Evaluations
1304
+
1305
+ 1. Specify only several outputs in each evaluation
1306
+ 1. Request an explanation output for output quantities you ask the LLM to count
1307
+ or estimate. You can ask for one explanation per quantity or one shared
1308
+ explanation for 2-3 quantities
1309
+