graphrag-eval 5.3.0__tar.gz → 6.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- graphrag_eval-5.3.0/README.md → graphrag_eval-6.0.0/PKG-INFO +313 -80
- graphrag_eval-5.3.0/PKG-INFO → graphrag_eval-6.0.0/README.md +293 -100
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/aggregation.py +52 -20
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/answer_correctness.py +57 -26
- graphrag_eval-6.0.0/graphrag_eval/answer_relevance.py +29 -0
- graphrag_eval-6.0.0/graphrag_eval/custom_evaluation.py +204 -0
- graphrag_eval-6.0.0/graphrag_eval/evaluation.py +101 -0
- graphrag_eval-6.0.0/graphrag_eval/llm.py +55 -0
- graphrag_eval-6.0.0/graphrag_eval/steps/evaluation.py +181 -0
- graphrag_eval-6.0.0/graphrag_eval/steps/iri_discovery.py +20 -0
- graphrag_eval-6.0.0/graphrag_eval/steps/retrieval_answer.py +44 -0
- graphrag_eval-6.0.0/graphrag_eval/steps/retrieval_context_texts.py +45 -0
- graphrag_eval-6.0.0/graphrag_eval/steps/timeseries.py +182 -0
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/util.py +13 -6
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/pyproject.toml +12 -15
- graphrag_eval-5.3.0/graphrag_eval/answer_relevance.py +0 -37
- graphrag_eval-5.3.0/graphrag_eval/evaluation.py +0 -61
- graphrag_eval-5.3.0/graphrag_eval/steps/evaluation.py +0 -157
- graphrag_eval-5.3.0/graphrag_eval/steps/retrieval_answer.py +0 -65
- graphrag_eval-5.3.0/graphrag_eval/steps/retrieval_context_texts.py +0 -59
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/LICENSE +0 -0
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/__init__.py +0 -0
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/prompts/template.md +0 -0
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/steps/__init__.py +0 -0
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/steps/retrieval_context_ids.py +0 -0
- {graphrag_eval-5.3.0 → graphrag_eval-6.0.0}/graphrag_eval/steps/sparql.py +0 -0
|
@@ -1,10 +1,28 @@
|
|
|
1
|
+
Metadata-Version: 2.3
|
|
2
|
+
Name: graphrag-eval
|
|
3
|
+
Version: 6.0.0
|
|
4
|
+
Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
|
|
5
|
+
License: Apache-2.0
|
|
6
|
+
Author: Philip Ganchev
|
|
7
|
+
Author-email: philip.ganchev@graphwise.ai
|
|
8
|
+
Requires-Python: >=3.12,<3.13
|
|
9
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
12
|
+
Provides-Extra: llm
|
|
13
|
+
Requires-Dist: pydantic (==2.12.5)
|
|
14
|
+
Requires-Dist: python-dateutil (==2.9.0.post0)
|
|
15
|
+
Requires-Dist: ragas (==0.4.3) ; extra == "llm"
|
|
16
|
+
Project-URL: Repository, https://github.com/Ontotext-AD/graphrag-eval
|
|
17
|
+
Description-Content-Type: text/markdown
|
|
18
|
+
|
|
1
19
|
<p align="center">
|
|
2
20
|
<img alt="Graphwise Logo" src="https://github.com/Ontotext-AD/graphrag-eval/blob/main/.github/Graphwise_Logo.jpg">
|
|
3
21
|
</p>
|
|
4
22
|
|
|
5
23
|
# QA Evaluation
|
|
6
24
|
|
|
7
|
-
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
|
|
25
|
+
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics.
|
|
8
26
|
|
|
9
27
|
## License
|
|
10
28
|
|
|
@@ -21,15 +39,15 @@ or add the following dependency in your `pyproject.toml` file:
|
|
|
21
39
|
graphrag-eval = "*"
|
|
22
40
|
```
|
|
23
41
|
|
|
24
|
-
To evaluate answer relevance and answer correctness:
|
|
42
|
+
To evaluate answer relevance and answer correctness or use a [custom evaluation](#Custom-Evaluation-(custom-metrics)):
|
|
25
43
|
|
|
26
44
|
```bash
|
|
27
|
-
pip install 'graphrag-eval[
|
|
45
|
+
pip install 'graphrag-eval[llm]'
|
|
28
46
|
```
|
|
29
47
|
|
|
30
48
|
or add the following dependency in your `pyproject.toml` file:
|
|
31
49
|
```toml
|
|
32
|
-
graphrag-eval = {version = "*", extras = ["
|
|
50
|
+
graphrag-eval = {version = "*", extras = ["llm"]}
|
|
33
51
|
```
|
|
34
52
|
|
|
35
53
|
## Maintainers
|
|
@@ -42,8 +60,12 @@ For issues or feature requests, please open [a GitHub issue](https://github.com/
|
|
|
42
60
|
To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
|
|
43
61
|
|
|
44
62
|
1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
|
|
45
|
-
1. Execute `poetry install --with
|
|
46
|
-
1. Execute
|
|
63
|
+
1. Execute `poetry install --with llm`
|
|
64
|
+
1. Execute
|
|
65
|
+
```<LLM_ACCESS_VARIABLE>=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>```
|
|
66
|
+
replacing `<LLM_ACCESS_VARIABLE>` by the variable used by your LLM provider to specify your LLM use key.
|
|
67
|
+
Example:
|
|
68
|
+
```OPENAI_API_KEY=XXX poetry run answer-correctness -i reference.tsv -o evaluations.tsv```
|
|
47
69
|
|
|
48
70
|
We plan to improve CLI support in future releases.
|
|
49
71
|
|
|
@@ -55,16 +77,153 @@ To evaluate answers and/or steps:
|
|
|
55
77
|
1. Format the answers and/or steps you want to evaluate: section [Responses to evaluate](#Responses-to-evaluate)
|
|
56
78
|
1. To evaluate answer relevance:
|
|
57
79
|
1. Include `actual_answer` in the target data to evaluate
|
|
58
|
-
1. Set environment variable
|
|
80
|
+
1. Set the appropriate environment variable (e.g.,`OPENAI_API_KEY`) with your LLM access key
|
|
59
81
|
1. To evaluate answer correctness:
|
|
60
82
|
1. Include `reference_answer` in the reference dataset and `actual_answer` in the target data to evaluate
|
|
61
|
-
1. Set environment variable
|
|
83
|
+
1. Set the appropriate environment variable (e.g.,`OPENAI_API_KEY`) with your LLM access key
|
|
62
84
|
1. To evaluate steps:
|
|
63
85
|
1. Include `reference_steps` in the reference data and `actual_steps` in target data to evaluate
|
|
86
|
+
1. If you want to evaluate metrics that require an LLM, write a [configuration file](#configuration).
|
|
64
87
|
1. Call the evaluation function with the reference data and target data: section [Usage Code](#Usage-Code)
|
|
65
|
-
1. Call the aggregation function with the evaluation results
|
|
88
|
+
1. Call the aggregation function with the evaluation results: section [Usage Code](#Usage-Code)
|
|
89
|
+
|
|
90
|
+
### LLM use in evaluation
|
|
91
|
+
|
|
92
|
+
The following metrics use an LLM which must be configured using a [configuration](#configuration) file:
|
|
93
|
+
* answer metrics
|
|
94
|
+
* `answer_recall`
|
|
95
|
+
* `answer_precision`
|
|
96
|
+
* `answer_f1`
|
|
97
|
+
* `answer_relevance`
|
|
98
|
+
* retrieval context metrics:
|
|
99
|
+
* `retrieval_answer_recall`
|
|
100
|
+
* `retrieval_answer_precision`
|
|
101
|
+
* `retrieval_answer_f1`
|
|
102
|
+
* `retrieval_context_recall`
|
|
103
|
+
* `retrieval_context_precision`
|
|
104
|
+
* `retrieval_context_f1`
|
|
105
|
+
* [custom evaluation](#custom-evaluation-custom-metrics)
|
|
106
|
+
|
|
107
|
+
Supported LLMs are all those supported by the [`litellm`](https://github.com/BerriAI/litellm) library, including all major LLMs and local models via Ollama.
|
|
108
|
+
|
|
109
|
+
If no LLM is configured or the `config_file_path` parameter is not provided, these metrics are not evaluated.
|
|
110
|
+
|
|
111
|
+
### Configuration
|
|
112
|
+
|
|
113
|
+
The configuration has two sections: `llm` and `custom_evaluation`. Example:
|
|
114
|
+
|
|
115
|
+
* `llm`: required for [LLM-based metrics](#llm-use-in-evaluation). The following keys are required:
|
|
116
|
+
* `generation`: required. The following keys are required:
|
|
117
|
+
* `provider`: (str) name of the organization providing the generation model, as supported by LiteLLM
|
|
118
|
+
* `model`: (str) name of the generation model
|
|
119
|
+
* `temperature`: (float in the range [0.0, 2.0]) adversarial temperature for generation
|
|
120
|
+
* `max_tokens`: (int > 0) maximum number of tokens to generate
|
|
121
|
+
* Optional keys: parameters to be passed to LiteLLM for generation (for [`answer_correctness`](#output-keys) and [custom evaluation](#custom-evaluation-custom-metrics)). Examples:
|
|
122
|
+
* `base_url`: (str) base URL for the generation model, alternative to the provider's default URL
|
|
123
|
+
* `api_key`: (str) API key for the generation model, alternative to setting the environment variable corresponding to the provider (e.g. `OPENAI_API_KEY` for OpenAI)
|
|
124
|
+
* `embedding`: required for [`answer_relevance`](#output-keys).
|
|
125
|
+
* `provider`: (str) name of the organiation providing the embedding model
|
|
126
|
+
* `model`: (str) name of the embedding model
|
|
127
|
+
* `custom_evaluations`: (list of the following maps) required nonempty for [custom evaluation](#custom-evaluation-custom-metrics). Each map has keys:
|
|
128
|
+
* `name`: (str) name of the evaluation
|
|
129
|
+
* `inputs`: (list[str]) list of input variables. Any combination of the following:
|
|
130
|
+
* `question`
|
|
131
|
+
* `reference_answer`
|
|
132
|
+
* `reference_steps`
|
|
133
|
+
* `actual_answer`
|
|
134
|
+
* `actual_steps`
|
|
135
|
+
* `steps_keys`: (list[str]; required if `inputs` contains `actual_steps` or `reference_steps`) one or both of:
|
|
136
|
+
* `args`
|
|
137
|
+
* `output`
|
|
138
|
+
* `steps_name`: (str; required if `inputs` contains `actual_steps` or `reference_steps`) the type (name) of steps to include in the evaluation
|
|
139
|
+
* `instructions`: (str) instructions for the evaluation
|
|
140
|
+
* `outputs`: (map[str]) output variable names and descriptions
|
|
141
|
+
|
|
142
|
+
#### Example Configuration File With LLM Configuration
|
|
143
|
+
|
|
144
|
+
Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](#llm-use-in-evaluation)) and embedding (for [`answer_relevance`](#otuput-keys)). It assumes that the environment variable `OPENAI_API_KEY` is set with your OpenAI API key.
|
|
145
|
+
|
|
146
|
+
```YAML
|
|
147
|
+
llm:
|
|
148
|
+
generation:
|
|
149
|
+
provider: openai
|
|
150
|
+
model: gpt-4o-mini
|
|
151
|
+
temperature: 0.0
|
|
152
|
+
max_tokens: 65536
|
|
153
|
+
embedding:
|
|
154
|
+
provider: openai
|
|
155
|
+
model: text-embedding-3-small
|
|
156
|
+
```
|
|
157
|
+
|
|
158
|
+
#### Example Configuration File With LLM Configuration and API keys
|
|
159
|
+
|
|
160
|
+
Below is a YAML file that configures the LLM generation (for [metrics that require an LLM](#llm-use-in-evaluation)) and embedding (for [`answer_relevance`](#otuput-keys)) with different API keys in place of environment variables.
|
|
161
|
+
|
|
162
|
+
```YAML
|
|
163
|
+
llm:
|
|
164
|
+
generation:
|
|
165
|
+
provider: azure
|
|
166
|
+
model: graphrag-eval-system-tests-gpt-5.2
|
|
167
|
+
base_url: https://my-generator.openai.azure.com
|
|
168
|
+
temperature: 0.0
|
|
169
|
+
max_tokens: 8192
|
|
170
|
+
api_key: ...
|
|
171
|
+
embedding:
|
|
172
|
+
provider: azure
|
|
173
|
+
model: graphrag-eval-system-tests-text-embedding-3-small
|
|
174
|
+
api_base: https://my-embedder.openai.azure.com
|
|
175
|
+
api_key: ...
|
|
176
|
+
```
|
|
66
177
|
|
|
67
|
-
|
|
178
|
+
#### Example Configuration File With Custom Evaluations
|
|
179
|
+
|
|
180
|
+
Below is a YAML file that defines two custom evaluations:
|
|
181
|
+
1. a simple relevance evaluation
|
|
182
|
+
1. a SPARQL retrieval evaluation using the reference answer
|
|
183
|
+
|
|
184
|
+
This is an example of the format and may not create accurate evaluations.
|
|
185
|
+
|
|
186
|
+
```YAML
|
|
187
|
+
llm:
|
|
188
|
+
generation:
|
|
189
|
+
provider: openai
|
|
190
|
+
model: gpt-4o-mini
|
|
191
|
+
temperature: 0.0
|
|
192
|
+
max_tokens: 65536
|
|
193
|
+
embedding:
|
|
194
|
+
provider: openai
|
|
195
|
+
model: text-embedding-3-small
|
|
196
|
+
custom_evaluations:
|
|
197
|
+
-
|
|
198
|
+
name: my_answer_relevance
|
|
199
|
+
inputs:
|
|
200
|
+
- question
|
|
201
|
+
- actual_answer
|
|
202
|
+
instructions: |
|
|
203
|
+
Evaluate how relevant is the answer to the question.
|
|
204
|
+
outputs:
|
|
205
|
+
my_answer_relevance: fraction between 0 and 1
|
|
206
|
+
my_answer_relevance_reason: reason for your evaluation
|
|
207
|
+
-
|
|
208
|
+
name: sparql_llm_evaluation
|
|
209
|
+
inputs:
|
|
210
|
+
- question
|
|
211
|
+
- reference_answer
|
|
212
|
+
- actual_steps
|
|
213
|
+
steps_keys:
|
|
214
|
+
- output
|
|
215
|
+
steps_name: sparql
|
|
216
|
+
instructions: |
|
|
217
|
+
Divide the reference answer into claims and try to match each claim to the
|
|
218
|
+
SPARQL query results. Count the:
|
|
219
|
+
- reference claims
|
|
220
|
+
- SPARQL results
|
|
221
|
+
- matching claims
|
|
222
|
+
outputs:
|
|
223
|
+
sparql_recall: Number of matching claims as a fraction of reference claims (fraction 0-1)
|
|
224
|
+
sparql_precision: Number of matching claims as a fraction of SPARQL results (fraction 0-1)
|
|
225
|
+
sparql_reason: reason for your evaluation
|
|
226
|
+
```
|
|
68
227
|
|
|
69
228
|
### Reference Q&A Data
|
|
70
229
|
|
|
@@ -81,7 +240,7 @@ The assumption is that the final answer to the question is derived from the outp
|
|
|
81
240
|
|
|
82
241
|
Each step includes:
|
|
83
242
|
|
|
84
|
-
- `name`: The
|
|
243
|
+
- `name`: The name of the step (e.g., `sparql_query`)
|
|
85
244
|
- `args`: Arguments of the step (e.g., arguments to a tool used in the step, such as a SPARQL query)
|
|
86
245
|
- `output`: The expected output from the step.
|
|
87
246
|
- `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
|
|
@@ -270,14 +429,15 @@ Given a question, if the question-answering system successfully responds, to eva
|
|
|
270
429
|
"elapsed_sec": 46.48961806297302,
|
|
271
430
|
"actual_steps": [
|
|
272
431
|
{
|
|
273
|
-
|
|
274
|
-
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
|
|
280
|
-
|
|
432
|
+
"name": "retrieval",
|
|
433
|
+
"args": {
|
|
434
|
+
"query": "transformers Substation OSLO",
|
|
435
|
+
"k": 2
|
|
436
|
+
},
|
|
437
|
+
"id": "call_3",
|
|
438
|
+
"status": "success",
|
|
439
|
+
"output": "[\n {\n \"id\": \"http://example.com/resource/doc/1\",\n \"text\": \"Transformer OSLO T1 is in Substation Oslo.\"\n },\n {\n \"id\": \"http://example.com/resource/doc/2\",\n \"text\": \"Transformer OSLO T2 is in Substation Oslo.\"\n }\n]",
|
|
440
|
+
"execution_timestamp": "2025-12-17T09:15:53Z"
|
|
281
441
|
},
|
|
282
442
|
{
|
|
283
443
|
"name": "autocomplete_search",
|
|
@@ -287,7 +447,8 @@ Given a question, if the question-answering system successfully responds, to eva
|
|
|
287
447
|
},
|
|
288
448
|
"id": "call_7amIsfEGelOnVZ1DWtgtc0hc",
|
|
289
449
|
"status": "success",
|
|
290
|
-
"output": "{\n \"head\": {\n \"vars\": [\n \"iri\",\n \"name\",\n \"rank\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"iri\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"name\": {\n \"type\": \"literal\",\n \"value\": \"STAVANGER\"\n },\n \"rank\": {\n \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n \"type\": \"literal\",\n \"value\": \"0.00976\"\n }\n }\n ]\n }\n}"
|
|
450
|
+
"output": "{\n \"head\": {\n \"vars\": [\n \"iri\",\n \"name\",\n \"rank\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"iri\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"name\": {\n \"type\": \"literal\",\n \"value\": \"STAVANGER\"\n },\n \"rank\": {\n \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n \"type\": \"literal\",\n \"value\": \"0.00976\"\n }\n }\n ]\n }\n}",
|
|
451
|
+
"execution_timestamp": "2025-12-17T09:15:58Z"
|
|
291
452
|
},
|
|
292
453
|
{
|
|
293
454
|
"name": "sparql_query",
|
|
@@ -296,7 +457,8 @@ Given a question, if the question-answering system successfully responds, to eva
|
|
|
296
457
|
},
|
|
297
458
|
"id": "call_DbMkZ8kv3qkf49wNNdREkpRN",
|
|
298
459
|
"status": "error",
|
|
299
|
-
"error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes."
|
|
460
|
+
"error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes.",
|
|
461
|
+
"execution_timestamp": "2025-12-17T09:16:03Z"
|
|
300
462
|
},
|
|
301
463
|
{
|
|
302
464
|
"name": "sparql_query",
|
|
@@ -305,7 +467,8 @@ Given a question, if the question-answering system successfully responds, to eva
|
|
|
305
467
|
},
|
|
306
468
|
"id": "call_Qm1mzX7g5q9SVPrR2QzEMTp3",
|
|
307
469
|
"status": "success",
|
|
308
|
-
"output": "{\n \"head\": {\n \"vars\": [\n \"substation2\",\n \"substation2Name\",\n \"line\",\n \"lineName\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KRISTIANSAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 300KRISTIAN-STAVANGE\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"DAGALI\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYSLE\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"AURLAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SIMA\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYLLING\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n }\n ]\n }\n}"
|
|
470
|
+
"output": "{\n \"head\": {\n \"vars\": [\n \"substation2\",\n \"substation2Name\",\n \"line\",\n \"lineName\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KRISTIANSAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 300KRISTIAN-STAVANGE\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"DAGALI\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYSLE\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"AURLAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SIMA\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYLLING\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n }\n ]\n }\n}",
|
|
471
|
+
"execution_timestamp": "2025-12-17T09:16:08Z"
|
|
309
472
|
}
|
|
310
473
|
],
|
|
311
474
|
"actual_answer": "The substations connected to the substation \"STAVANGER\" via AC or DC lines are:\n\n1. **KRISTIANSAND** via line \"LC 300KRISTIAN-STAVANGE\"\n2. **DAGALI** via line \"LC 420DAGALI-HAGAFOSS\"\n3. **KVILLDAL** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n4. **HAGAFOSS** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n5. **OSLO** via lines:\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n6. **SYSLE** via line \"LC 420SYSLE-HAGAFOSS\"\n7. **AURLAND** via line \"LC 420AURLAND-HAGAFOSS\"\n8. **SIMA** via line \"LC 420AURLAND-HAGAFOSS\"\n9. **SYLLING** via line \"LC 420SYLLING-HAGAFOSS\""
|
|
@@ -331,7 +494,7 @@ from graphrag_eval import run_evaluation, compute_aggregates
|
|
|
331
494
|
|
|
332
495
|
reference_qas: list[dict] = [] # read your reference data
|
|
333
496
|
chat_responses: dict = {} # call your implementation to get the response
|
|
334
|
-
evaluation_results = run_evaluation(reference_qas, chat_responses)
|
|
497
|
+
evaluation_results = await run_evaluation(reference_qas, chat_responses)
|
|
335
498
|
aggregates = compute_aggregates(evaluation_results)
|
|
336
499
|
```
|
|
337
500
|
|
|
@@ -365,7 +528,7 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
365
528
|
"text": "Transformer OSLO T2 is in Substation Oslo."
|
|
366
529
|
}
|
|
367
530
|
]
|
|
368
|
-
|
|
531
|
+
- name: sparql_query
|
|
369
532
|
args:
|
|
370
533
|
query: |2
|
|
371
534
|
|
|
@@ -404,7 +567,6 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
404
567
|
answer_precision: 1.0
|
|
405
568
|
answer_f1: 1.0
|
|
406
569
|
answer_relevance: 0.9
|
|
407
|
-
answer_relevance_cost: 0.0007
|
|
408
570
|
actual_steps:
|
|
409
571
|
- name: retrieval
|
|
410
572
|
id: call_3
|
|
@@ -423,13 +585,10 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
423
585
|
"text": "Transformer OSLO T2 is in Substation Oslo."
|
|
424
586
|
}
|
|
425
587
|
]
|
|
588
|
+
execution_timestamp: '2025-12-17T09:15:53Z'
|
|
426
589
|
retrieval_answer_recall: 1.0
|
|
427
|
-
retrieval_answer_recall_reason: The context contains all the transformers listed in the reference answer
|
|
428
|
-
retrieval_answer_recall_cost: 0.0007
|
|
429
590
|
retrieval_answer_precision: 1.0
|
|
430
|
-
retrieval_answer_precision_cost: 0.0003
|
|
431
591
|
retrieval_answer_f1: 1.0
|
|
432
|
-
retrieval_answer_f1_cost: 0.001
|
|
433
592
|
- name: autocomplete_search
|
|
434
593
|
args:
|
|
435
594
|
query: OSLO
|
|
@@ -465,6 +624,7 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
465
624
|
]
|
|
466
625
|
}
|
|
467
626
|
}
|
|
627
|
+
execution_timestamp: '2025-12-17T09:15:58Z'
|
|
468
628
|
- name: sparql_query
|
|
469
629
|
args:
|
|
470
630
|
query: |-
|
|
@@ -508,6 +668,7 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
508
668
|
]
|
|
509
669
|
}
|
|
510
670
|
}
|
|
671
|
+
execution_timestamp: '2025-12-17T09:16:03Z'
|
|
511
672
|
steps_score: 1
|
|
512
673
|
input_tokens: 221339
|
|
513
674
|
output_tokens: 212
|
|
@@ -520,9 +681,11 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
520
681
|
- `template_id`: the template id
|
|
521
682
|
- `question_id`: the question id
|
|
522
683
|
- `question_text`: the natural language query
|
|
523
|
-
- `reference_steps`: (optional) copy of the expected steps in the Q&A dataset, if specified there
|
|
524
|
-
- `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
|
|
525
684
|
- `status`: "success" or "error", indicating whether the evaluation succeeded
|
|
685
|
+
- `reference_steps`: (optional) copy of the expected steps in the Q&A dataset,
|
|
686
|
+
if specified there. Additional key "matches" is added to those steps, which are
|
|
687
|
+
matched.
|
|
688
|
+
- `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
|
|
526
689
|
- `actual_answer`: (optional) copy of the response text in the evaluation target, if specified there
|
|
527
690
|
- `answer_reference_claims_count`: (optional) number of claims extracted from the reference answer, if a reference answer and actual answer are available
|
|
528
691
|
- `answer_actual_claims_count`: (optional) number of claims extracted from the answer being evaluated, if a reference answer and actual answer are available
|
|
@@ -534,34 +697,26 @@ The output is a list of statistics for each question from the reference Q&A data
|
|
|
534
697
|
- `answer_f1`: (optional) Harmonic mean of `answer_recall` and `answer_precision`
|
|
535
698
|
- `answer_relevance`: (optional) The value representing how relevant is the actual answer to the question, computed using [RAGAS answer relevance](https://docs.ragas.io/en/v0.3.3/concepts/metrics/available_metrics/answer_relevance/)
|
|
536
699
|
- `answer_relevance_error`: (optional) error message if answer relevance evaluation failed
|
|
537
|
-
- `answer_relevance_cost`: The LLM use cost of computing `answer_relevance`, in US dollars
|
|
538
700
|
- `actual_steps`: (optional) copy of the steps in the evaluation target, if specified there
|
|
539
|
-
- `steps_score`: a real number between 0 and 1,
|
|
540
|
-
|
|
541
|
-
|
|
542
|
-
|
|
543
|
-
- `
|
|
544
|
-
- `
|
|
545
|
-
- `total_tokens`: total tokens usage
|
|
546
|
-
- `elapsed_sec`: elapsed seconds
|
|
701
|
+
- `steps_score`: (optional) a real number between 0 and 1, see how step score
|
|
702
|
+
is calculated in the section [Steps score](#Steps-score)
|
|
703
|
+
- `input_tokens`: (optional) input tokens usage
|
|
704
|
+
- `output_tokens`: (optional) output tokens usage
|
|
705
|
+
- `total_tokens`: (optional) total tokens usage
|
|
706
|
+
- `elapsed_sec`: (optional) elapsed seconds
|
|
547
707
|
|
|
548
708
|
All `actual_steps` with `name` "retrieval" contain:
|
|
549
709
|
- `retrieval_answer_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
|
|
550
|
-
- `retrieval_answer_recall_reason`: (optional) LLM reasoning in evaluating `retrieval_answer_recall`
|
|
551
710
|
- `retrieval_answer_recall_error`: (optional) error message if `retrieval_answer_recall` evaluation fails
|
|
552
|
-
- `retrieval_answer_recall_cost`: cost of evaluating `retrieval_answer_recall`, in US dollars
|
|
553
711
|
- `retrieval_answer_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
|
|
554
712
|
- `retrieval_answer_precision_error`: (optional) error message if `retrieval_answer_precision` evaluation fails
|
|
555
|
-
- `retrieval_answer_precision_cost`: cost of evaluating `retrieval_answer_precision`, in US dollars
|
|
556
713
|
- `retrieval_answer_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_answer_recall` and `retrieval_answer_precision` succeed
|
|
557
|
-
- `retrieval_answer_f1_cost`: The sum of `retrieval_answer_recall_cost` and `retrieval_answer_precision_cost`
|
|
558
714
|
- `retrieval_context_recall`: (optional) recall of the retrieved context with respect to the reference answer, if evaluation succeeds
|
|
559
715
|
- `retrieval_context_recall_error`: (optional) error message if `retrieval_context_recall` evaluation fails
|
|
560
716
|
- `retrieval_context_precision`: (optional) precision of the retrieved context with respect to the reference answer, if evaluation succeeds
|
|
561
717
|
- `retrieval_context_precision_error`: (optional) error message if `retrieval_context_precision` evaluation fails
|
|
562
718
|
- `retrieval_context_f1`: (optional) F1 score of the retrieved context with respect to the reference answer, if `retrieval_context_recall` and `retrieval_context_precision` succeed
|
|
563
719
|
|
|
564
|
-
|
|
565
720
|
#### Aggregates Keys
|
|
566
721
|
|
|
567
722
|
The `aggregates` object provides aggregated evaluation metrics. These aggregates support analysis of agent quality, token efficiency, and execution performance. Aggregates are computed:
|
|
@@ -610,7 +765,6 @@ Aggregates are:
|
|
|
610
765
|
- `answer_precision`
|
|
611
766
|
- `answer_f1`
|
|
612
767
|
- `answer_relevance`
|
|
613
|
-
- `answer_relevance_cost`
|
|
614
768
|
- `retrieval_answer_recall`
|
|
615
769
|
- `retrieval_answer_precision`
|
|
616
770
|
- `retrieval_answer_f1`
|
|
@@ -627,7 +781,6 @@ Aggregates are:
|
|
|
627
781
|
- `answer_precision`
|
|
628
782
|
- `answer_f1`
|
|
629
783
|
- `answer_relevance`
|
|
630
|
-
- `answer_relevance_cost`
|
|
631
784
|
- `retrieval_answer_recall`
|
|
632
785
|
- `retrieval_answer_precision`
|
|
633
786
|
- `retrieval_answer_f1`
|
|
@@ -667,12 +820,6 @@ per_template:
|
|
|
667
820
|
mean: 0.9
|
|
668
821
|
median: 0.9
|
|
669
822
|
sum: 0.9
|
|
670
|
-
answer_relevance_cost:
|
|
671
|
-
min: 0.0007
|
|
672
|
-
max: 0.0007
|
|
673
|
-
mean: 0.0007
|
|
674
|
-
median: 0.0007
|
|
675
|
-
sum: 0.0007
|
|
676
823
|
steps:
|
|
677
824
|
total:
|
|
678
825
|
autocomplete_search: 10
|
|
@@ -739,12 +886,6 @@ per_template:
|
|
|
739
886
|
mean: 0.9
|
|
740
887
|
median: 0.9
|
|
741
888
|
sum: 0.9
|
|
742
|
-
answer_relevance_cost:
|
|
743
|
-
min: 0.0007
|
|
744
|
-
max: 0.0007
|
|
745
|
-
mean: 0.0007
|
|
746
|
-
median: 0.0007
|
|
747
|
-
sum: 0.0007
|
|
748
889
|
steps:
|
|
749
890
|
total:
|
|
750
891
|
autocomplete_search: 10
|
|
@@ -809,12 +950,6 @@ per_template:
|
|
|
809
950
|
mean: 0.9
|
|
810
951
|
median: 0.9
|
|
811
952
|
sum: 0.9
|
|
812
|
-
answer_relevance_cost:
|
|
813
|
-
min: 0.0007
|
|
814
|
-
max: 0.0007
|
|
815
|
-
mean: 0.0007
|
|
816
|
-
median: 0.0007
|
|
817
|
-
sum: 0.0007
|
|
818
953
|
steps:
|
|
819
954
|
total:
|
|
820
955
|
autocomplete_search: 9
|
|
@@ -881,12 +1016,6 @@ per_template:
|
|
|
881
1016
|
mean: 0.9
|
|
882
1017
|
median: 0.9
|
|
883
1018
|
sum: 0.9
|
|
884
|
-
answer_relevance_cost:
|
|
885
|
-
min: 0.0007
|
|
886
|
-
max: 0.0007
|
|
887
|
-
mean: 0.0007
|
|
888
|
-
median: 0.0007
|
|
889
|
-
sum: 0.0007
|
|
890
1019
|
steps:
|
|
891
1020
|
total:
|
|
892
1021
|
autocomplete_search: 20
|
|
@@ -951,12 +1080,6 @@ micro:
|
|
|
951
1080
|
mean: 0.9
|
|
952
1081
|
median: 0.9
|
|
953
1082
|
sum: 0.9
|
|
954
|
-
answer_relevance_cost:
|
|
955
|
-
min: 0.0007
|
|
956
|
-
max: 0.0007
|
|
957
|
-
mean: 0.0007
|
|
958
|
-
median: 0.0007
|
|
959
|
-
sum: 0.0007
|
|
960
1083
|
steps_score:
|
|
961
1084
|
sum: 17
|
|
962
1085
|
mean: 0.4358974358974359
|
|
@@ -996,8 +1119,6 @@ macro:
|
|
|
996
1119
|
mean: 1.0
|
|
997
1120
|
answer_relevance:
|
|
998
1121
|
mean: 0.9
|
|
999
|
-
answer_relevance_cost:
|
|
1000
|
-
mean: 0.0007
|
|
1001
1122
|
steps_score:
|
|
1002
1123
|
mean: 0.45
|
|
1003
1124
|
input_tokens:
|
|
@@ -1010,7 +1131,54 @@ macro:
|
|
|
1010
1131
|
mean: 25.911653497483996
|
|
1011
1132
|
```
|
|
1012
1133
|
|
|
1013
|
-
###
|
|
1134
|
+
### Steps score
|
|
1135
|
+
|
|
1136
|
+
The steps score is a real number in the interval [0, 1],
|
|
1137
|
+
which indicates how closely the actual steps match to the reference ones.
|
|
1138
|
+
A score of 1 indicates a perfect match.
|
|
1139
|
+
|
|
1140
|
+
Going in reverse order of the reference groups and in reverse order of the
|
|
1141
|
+
actual steps, for each reference group we try to match each step in the group
|
|
1142
|
+
to an actual step:
|
|
1143
|
+
- If all steps from the current reference group are matched, we proceed to the
|
|
1144
|
+
previous group, but we only search among the actual steps before the earliest
|
|
1145
|
+
actual step already matched for the current group. Thus, we ignore the
|
|
1146
|
+
execution order of steps within a reference group, but groups are matched
|
|
1147
|
+
in order: actual steps that match earlier groups must be executed before all
|
|
1148
|
+
steps matching later groups.
|
|
1149
|
+
- If some steps in the current reference group are not matched, then the
|
|
1150
|
+
matching stops, and the score is computed from the matches found so far.
|
|
1151
|
+
|
|
1152
|
+
An actual step can match at most one reference step, and only if the actual
|
|
1153
|
+
step is successful (i.e., it didn’t result in an error). There are a few ways
|
|
1154
|
+
a reference step can match an actual one. In all cases except for
|
|
1155
|
+
the "retrieval" steps, the matching score is either 0 or 1. A score above 0
|
|
1156
|
+
indicates a match.
|
|
1157
|
+
|
|
1158
|
+
- if both are named "sparql_query" and the "output_media_type" of the
|
|
1159
|
+
reference step is "application/sparql-results+json", then we try to match them
|
|
1160
|
+
using the [SPARQL queries comparison algorithm](#sparql-queries-comparison).
|
|
1161
|
+
The algorithm doesn't handle comparison of DESCRIBE or CONSTRUCT queries output
|
|
1162
|
+
with SELECT and ASK queries output.
|
|
1163
|
+
- if both are named "retrieval" and the reference step has "output", then we
|
|
1164
|
+
compute [recall@k](#context-recallk).
|
|
1165
|
+
- if both are named "retrieve_time_series", then we check if the arguments of
|
|
1166
|
+
the steps are matching.
|
|
1167
|
+
- if both are named "retrieve_data_points", then we check if the arguments of
|
|
1168
|
+
the steps are matching.
|
|
1169
|
+
- if the reference step is named "iri_discovery" and the actual step name is
|
|
1170
|
+
"autocomplete_search", тhen check if the IRI specified as "output" of the
|
|
1171
|
+
"iri_discovery" step is present in the "output" of the "autocomplete_search".
|
|
1172
|
+
- if the reference and actual step names are the same and the
|
|
1173
|
+
"output_media_type" of the reference step is "application/json", then the steps
|
|
1174
|
+
match, if the json outputs are the same.
|
|
1175
|
+
- we fallback to match the outputs of the two steps.
|
|
1176
|
+
|
|
1177
|
+
The final steps score is the macro mean of scores over the groups. That is, it
|
|
1178
|
+
is the sum of the scores of all reference groups divided by their number. Each
|
|
1179
|
+
group score is the sum of scores of its steps divided by their number.
|
|
1180
|
+
|
|
1181
|
+
#### SPARQL queries comparison
|
|
1014
1182
|
|
|
1015
1183
|
The algorithm iterates over all subsets of columns in the actual result of the same size as in the reference result.
|
|
1016
1184
|
For each subset, it compares the set of columns (skipping optional columns).
|
|
@@ -1074,3 +1242,68 @@ average_precision(
|
|
|
1074
1242
|
retrieved_docs=[1, 4, 3, 5, 7]
|
|
1075
1243
|
) # ~=> 0.8056
|
|
1076
1244
|
```
|
|
1245
|
+
|
|
1246
|
+
### Custom Evaluation (Custom Metrics)
|
|
1247
|
+
|
|
1248
|
+
You (the user) can define your own metrics of system outputs to be evaluated
|
|
1249
|
+
using an LLM. To do this, specify its name, inputs, outputs and instructions in
|
|
1250
|
+
a YAML file and pass the file path as a parameter to `run_evaluation()`. This
|
|
1251
|
+
will return your output metrics alongside the standard metrics described
|
|
1252
|
+
in previous sections.
|
|
1253
|
+
|
|
1254
|
+
One configuration file can define multiple custom evaluations, each of which
|
|
1255
|
+
will be done as a separate query to the LLM. Each evaluation can have multiple
|
|
1256
|
+
outputs. The format is shown in the example sections below.
|
|
1257
|
+
|
|
1258
|
+
See [Example Configuration File](#example-configuration-file).
|
|
1259
|
+
|
|
1260
|
+
#### Example Call to Evaluate Using Custom Metrics
|
|
1261
|
+
|
|
1262
|
+
```python
|
|
1263
|
+
evaluation_results = run_evaluation(
|
|
1264
|
+
reference_qa_dataset,
|
|
1265
|
+
chat_responses,
|
|
1266
|
+
"my_project/custom_eval.yaml"
|
|
1267
|
+
)
|
|
1268
|
+
```
|
|
1269
|
+
|
|
1270
|
+
#### Example Output for Custom SPARQL Evaluation
|
|
1271
|
+
|
|
1272
|
+
With the
|
|
1273
|
+
[custom SPARQL evaluation](example-custom-evaluation-configuration-file),
|
|
1274
|
+
the output is as for section [Evaluation Results](#evaluation-results), except
|
|
1275
|
+
that it has the following additional keys and example values:
|
|
1276
|
+
|
|
1277
|
+
```yaml
|
|
1278
|
+
my_answer_relevance: 0.9
|
|
1279
|
+
my_answer_relevance_eval_reason: The answer contains relevant information except for the sentence about total revenue
|
|
1280
|
+
sparql_recall: 0.75
|
|
1281
|
+
sparql_precision: 0.6
|
|
1282
|
+
sparql_eval_reason: The reference answer has 4 claims; there are 5 SPARQL results; 3 claims match
|
|
1283
|
+
```
|
|
1284
|
+
|
|
1285
|
+
#### Output in Case of Evaluation Error
|
|
1286
|
+
|
|
1287
|
+
If there is an error during evaluation, the output will have `null` for all
|
|
1288
|
+
custom evaluation specified output keys and will have an additional key
|
|
1289
|
+
explaining the error. The key will be `{name}_error` where `name` is the custom
|
|
1290
|
+
evaluation name.
|
|
1291
|
+
|
|
1292
|
+
There are three types of errors:
|
|
1293
|
+
1. The reference input is missing keys requested in the custom evaluation
|
|
1294
|
+
configuration. Example:
|
|
1295
|
+
`custom_1_error: Reference missing key 'reference_steps'`
|
|
1296
|
+
1. The actual output to be evaluated is missing keys requested in the custom
|
|
1297
|
+
evaluation configuration. Example:
|
|
1298
|
+
`custom_1_error: Actual output missing 'actual_steps'`
|
|
1299
|
+
1. The evaluating LLM output does not conform to the custom evaluation
|
|
1300
|
+
configuration. Example:
|
|
1301
|
+
`custom_1_error: "Expected 6 tab-separated values, got: 0.1\tCustom answer reason"`
|
|
1302
|
+
|
|
1303
|
+
#### Recommendations for Custom Evaluations
|
|
1304
|
+
|
|
1305
|
+
1. Specify only several outputs in each evaluation
|
|
1306
|
+
1. Request an explanation output for output quantities you ask the LLM to count
|
|
1307
|
+
or estimate. You can ask for one explanation per quantity or one shared
|
|
1308
|
+
explanation for 2-3 quantities
|
|
1309
|
+
|