graphrag-eval 4.0.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- graphrag_eval/__init__.py +4 -0
- graphrag_eval/aggregation.py +151 -0
- graphrag_eval/answer_correctness.py +162 -0
- graphrag_eval/answer_relevance.py +37 -0
- graphrag_eval/evaluation.py +62 -0
- graphrag_eval/steps/__init__.py +120 -0
- graphrag_eval/steps/retrieval.py +55 -0
- graphrag_eval/steps/sparql.py +139 -0
- graphrag_eval-4.0.0.dist-info/LICENSE +201 -0
- graphrag_eval-4.0.0.dist-info/METADATA +967 -0
- graphrag_eval-4.0.0.dist-info/RECORD +13 -0
- graphrag_eval-4.0.0.dist-info/WHEEL +4 -0
- graphrag_eval-4.0.0.dist-info/entry_points.txt +3 -0
|
@@ -0,0 +1,967 @@
|
|
|
1
|
+
Metadata-Version: 2.3
|
|
2
|
+
Name: graphrag-eval
|
|
3
|
+
Version: 4.0.0
|
|
4
|
+
Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
|
|
5
|
+
License: Apache-2.0
|
|
6
|
+
Author: Neli Hateva
|
|
7
|
+
Author-email: neli.hateva@graphwise.ai
|
|
8
|
+
Requires-Python: >=3.12,<3.13
|
|
9
|
+
Classifier: License :: OSI Approved :: Apache Software License
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
12
|
+
Project-URL: Repository, https://github.com/Ontotext-AD/qa-eval
|
|
13
|
+
Description-Content-Type: text/markdown
|
|
14
|
+
|
|
15
|
+
<p align="center">
|
|
16
|
+
<img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
|
|
17
|
+
</p>
|
|
18
|
+
|
|
19
|
+
# QA Evaluation
|
|
20
|
+
|
|
21
|
+
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
|
|
22
|
+
to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
|
|
23
|
+
|
|
24
|
+
## License
|
|
25
|
+
|
|
26
|
+
Apache-2.0 License. See [LICENSE](LICENSE) file for details.
|
|
27
|
+
|
|
28
|
+
## Installation
|
|
29
|
+
|
|
30
|
+
To evaluate only steps:
|
|
31
|
+
```bash
|
|
32
|
+
pip install graphrag-eval
|
|
33
|
+
```
|
|
34
|
+
or add the following dependency in your `pyproject.toml` file:
|
|
35
|
+
```toml
|
|
36
|
+
graphrag-eval = "*"
|
|
37
|
+
```
|
|
38
|
+
|
|
39
|
+
To evaluate answer relevance and answer correctness:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
pip install 'graphrag-eval[openai]'
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
or add the following dependency in your `pyproject.toml` file:
|
|
46
|
+
```toml
|
|
47
|
+
graphrag-eval = {version = "*", extras = ["openai"]}
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
## Maintainers
|
|
51
|
+
|
|
52
|
+
Developed and maintained by [Graphwise](https://graphwise.ai/).
|
|
53
|
+
For issues or feature requests, please open [a GitHub issue](https://github.com/Ontotext-AD/qa-eval/issues).
|
|
54
|
+
|
|
55
|
+
## Command Line Use
|
|
56
|
+
|
|
57
|
+
To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
|
|
58
|
+
|
|
59
|
+
1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
|
|
60
|
+
1. Execute `poetry install --with openai`
|
|
61
|
+
1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
|
|
62
|
+
|
|
63
|
+
We plan to improve CLI support in future releases.
|
|
64
|
+
|
|
65
|
+
## Use as a Library
|
|
66
|
+
|
|
67
|
+
To evaluate answers and/or steps:
|
|
68
|
+
1. Install this package: section [Install](#Installation)
|
|
69
|
+
1. Format the corpus of questions and reference answers and/or steps: section [Reference Q&A Corpus](#reference-qa-corpus)
|
|
70
|
+
1. Format the answers and/or steps you want to evaluate: section [Evaluation Target Corpus](#Evaluation-Target-Corpus)
|
|
71
|
+
1. To evaluate answer relevance:
|
|
72
|
+
1. Include `actual_answer` in the target data to evaluate
|
|
73
|
+
1. Set environment vairabe `OPENAI_API_KEY` appropriately
|
|
74
|
+
1. To evaluate answer correctness:
|
|
75
|
+
1. Include `reference_answer` in the reference corpus and `actual_answer` in the target data to evaluate
|
|
76
|
+
1. Set environment variable `OPENAI_API_KEY` appropriately
|
|
77
|
+
1. To evaluate steps:
|
|
78
|
+
1. Include `reference_steps` in the reference corpus and `actual_steps` in target data to evaluate
|
|
79
|
+
1. Call the evaluation function with the reference corpus and target corpus: section [Example Usage Code](#Example-Usage-Code)
|
|
80
|
+
1. Call the aggregation function with the evaluation results
|
|
81
|
+
|
|
82
|
+
Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
|
|
83
|
+
|
|
84
|
+
### Reference Q&A Corpus
|
|
85
|
+
|
|
86
|
+
A reference corpus is a list of templates, each of which contains:
|
|
87
|
+
|
|
88
|
+
- `template_id`: Unique template identifier
|
|
89
|
+
- `questions`: A list of questions derived from this template, where each includes:
|
|
90
|
+
- `id`: Unique question identifier
|
|
91
|
+
- `question_text`: The natural language query passed to the LLM
|
|
92
|
+
- `reference_steps`: (optional) A list of expected steps grouped by expected order of execution, where all steps in a group can be executed in any order relative to each other, but after all steps in the previous group and before all steps in the next group.
|
|
93
|
+
- `reference_answer`: (optional) The expected answer to the question
|
|
94
|
+
The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed last (last level).
|
|
95
|
+
|
|
96
|
+
Each step includes:
|
|
97
|
+
|
|
98
|
+
- `name`: The type of step being performed (e.g., `sparql_query`)
|
|
99
|
+
- `args`: Arguments of the step (e.g., arguments to a tool used in the step, such as a SPARQL query)
|
|
100
|
+
- `output`: The expected output from the step
|
|
101
|
+
- `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
|
|
102
|
+
- `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
|
|
103
|
+
- `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
|
|
104
|
+
|
|
105
|
+
#### Example Reference Corpus
|
|
106
|
+
|
|
107
|
+
The example corpus below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
|
|
108
|
+
|
|
109
|
+
```yaml
|
|
110
|
+
- template_id: list_all_transformers_within_Substation_SUBSTATION
|
|
111
|
+
questions:
|
|
112
|
+
- id: c10bbc8dce98a4b8832d125134a16153
|
|
113
|
+
question_text: List all transformers within Substation OSLO
|
|
114
|
+
reference_answer: OSLO T1, OSLO T2
|
|
115
|
+
reference_steps:
|
|
116
|
+
- - name: sparql_query
|
|
117
|
+
args:
|
|
118
|
+
query: |2
|
|
119
|
+
|
|
120
|
+
PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
|
|
121
|
+
PREFIX cim: <https://cim.ucaiug.io/ns#>
|
|
122
|
+
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
|
|
123
|
+
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
|
|
124
|
+
select distinct ?transformer ?transformerName
|
|
125
|
+
where {
|
|
126
|
+
bind(<urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f> as ?substation)
|
|
127
|
+
|
|
128
|
+
?transformer a cim:PowerTransformer ;
|
|
129
|
+
cim:Equipment.EquipmentContainer ?substation ;
|
|
130
|
+
cim:IdentifiedObject.name ?transformerName .
|
|
131
|
+
}
|
|
132
|
+
output: '{"head": {"vars": ["transformer", "transformerName"]}, "results":
|
|
133
|
+
{"bindings": [{"transformer": {"type": "uri", "value": "urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f"},
|
|
134
|
+
"transformerName": {"type": "literal", "value": "OSLO T2"}}, {"transformer":
|
|
135
|
+
{"type": "uri", "value": "urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f"},
|
|
136
|
+
"transformerName": {"type": "literal", "value": "OSLO T1"}}]}}'
|
|
137
|
+
output_media_type: application/sparql-results+json
|
|
138
|
+
required_columns:
|
|
139
|
+
- transformer
|
|
140
|
+
- transformerName
|
|
141
|
+
- id: 8bbea9a10876a04ad77a82fd2aedee40
|
|
142
|
+
question_text: List all transformers within Substation STAVANGER
|
|
143
|
+
reference_answer: STAVANGET1
|
|
144
|
+
reference_steps:
|
|
145
|
+
- - name: sparql_query
|
|
146
|
+
args:
|
|
147
|
+
query: |2
|
|
148
|
+
|
|
149
|
+
PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
|
|
150
|
+
PREFIX cim: <https://cim.ucaiug.io/ns#>
|
|
151
|
+
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
|
|
152
|
+
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
|
|
153
|
+
select distinct ?transformer ?transformerName
|
|
154
|
+
where {
|
|
155
|
+
bind(<urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f> as ?substation)
|
|
156
|
+
|
|
157
|
+
?transformer a cim:PowerTransformer ;
|
|
158
|
+
cim:Equipment.EquipmentContainer ?substation ;
|
|
159
|
+
cim:IdentifiedObject.name ?transformerName .
|
|
160
|
+
}
|
|
161
|
+
output: '{"head": {"vars": ["transformer", "transformerName"]}, "results":
|
|
162
|
+
{"bindings": [{"transformer": {"type": "uri", "value": "urn:uuid:f1769e0c-9aeb-11e5-91da-b8763fd99c5f"},
|
|
163
|
+
"transformerName": {"type": "literal", "value": "STAVANGET1"}}]}}'
|
|
164
|
+
output_media_type: application/sparql-results+json
|
|
165
|
+
required_columns:
|
|
166
|
+
- transformer
|
|
167
|
+
- transformerName
|
|
168
|
+
- template_id: list_all_substations_within_bidding_zone_REGION
|
|
169
|
+
questions:
|
|
170
|
+
- id: d566b1e9da418ac83e520a66cc7af4d7
|
|
171
|
+
question_text: List all substations within bidding zone NO2 SGR
|
|
172
|
+
reference_answer: ARENDAL, BLAFALLI, STAVANGER, KRISTIA_HVDC, KVILLDAL, SANDEFJORD, KRISTIANSAND, FEDA_HVDC
|
|
173
|
+
reference_steps:
|
|
174
|
+
- - name: sparql_query
|
|
175
|
+
args:
|
|
176
|
+
query: |2
|
|
177
|
+
|
|
178
|
+
PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
|
|
179
|
+
PREFIX cim: <https://cim.ucaiug.io/ns#>
|
|
180
|
+
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
|
|
181
|
+
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
|
|
182
|
+
select distinct ?substation ?substationName
|
|
183
|
+
where {
|
|
184
|
+
bind(<urn:uuid:f176965f-9aeb-11e5-91da-b8763fd99c5f> as ?region)
|
|
185
|
+
|
|
186
|
+
?substation a cim:Substation ;
|
|
187
|
+
cim:Substation.Region ?region ;
|
|
188
|
+
cim:IdentifiedObject.name ?substationName .
|
|
189
|
+
}
|
|
190
|
+
output: '{"head": {"vars": ["substation", "substationName"]}, "results": {"bindings":
|
|
191
|
+
[{"substation": {"type": "uri", "value": "urn:uuid:f1769670-9aeb-11e5-91da-b8763fd99c5f"},
|
|
192
|
+
"substationName": {"type": "literal", "value": "ARENDAL"}}, {"substation":
|
|
193
|
+
{"type": "uri", "value": "urn:uuid:f176968e-9aeb-11e5-91da-b8763fd99c5f"},
|
|
194
|
+
"substationName": {"type": "literal", "value": "BLAFALLI"}}, {"substation":
|
|
195
|
+
{"type": "uri", "value": "urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f"},
|
|
196
|
+
"substationName": {"type": "literal", "value": "STAVANGER"}}, {"substation":
|
|
197
|
+
{"type": "uri", "value": "urn:uuid:f1769676-9aeb-11e5-91da-b8763fd99c5f"},
|
|
198
|
+
"substationName": {"type": "literal", "value": "KRISTIA_HVDC"}}, {"substation":
|
|
199
|
+
{"type": "uri", "value": "urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f"},
|
|
200
|
+
"substationName": {"type": "literal", "value": "KVILLDAL"}}, {"substation":
|
|
201
|
+
{"type": "uri", "value": "urn:uuid:f176966a-9aeb-11e5-91da-b8763fd99c5f"},
|
|
202
|
+
"substationName": {"type": "literal", "value": "SANDEFJORD"}}, {"substation":
|
|
203
|
+
{"type": "uri", "value": "urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f"},
|
|
204
|
+
"substationName": {"type": "literal", "value": "KRISTIANSAND"}}, {"substation":
|
|
205
|
+
{"type": "uri", "value": "urn:uuid:f176967c-9aeb-11e5-91da-b8763fd99c5f"},
|
|
206
|
+
"substationName": {"type": "literal", "value": "FEDA_HVDC"}}]}}'
|
|
207
|
+
output_media_type: application/sparql-results+json
|
|
208
|
+
required_columns:
|
|
209
|
+
- substation
|
|
210
|
+
- substationName
|
|
211
|
+
ordered: false
|
|
212
|
+
- id: 03d4283773b4387114342518176b128b
|
|
213
|
+
question_text: List all substations within bidding zone NO1 SGR
|
|
214
|
+
reference_answer: HALDEN, KONGSBERG, SYLLING, OSLO, ASKER, SYSLE, SKIEN, TRETTEN
|
|
215
|
+
reference_steps:
|
|
216
|
+
- - name: sparql_query
|
|
217
|
+
args:
|
|
218
|
+
query: |2
|
|
219
|
+
|
|
220
|
+
PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
|
|
221
|
+
PREFIX cim: <https://cim.ucaiug.io/ns#>
|
|
222
|
+
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
|
|
223
|
+
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
|
|
224
|
+
select distinct ?substation ?substationName
|
|
225
|
+
where {
|
|
226
|
+
bind(<urn:uuid:f1769609-9aeb-11e5-91da-b8763fd99c5f> as ?region)
|
|
227
|
+
|
|
228
|
+
?substation a cim:Substation ;
|
|
229
|
+
cim:Substation.Region ?region ;
|
|
230
|
+
cim:IdentifiedObject.name ?substationName .
|
|
231
|
+
}
|
|
232
|
+
output: '{"head": {"vars": ["substation", "substationName"]}, "results": {"bindings":
|
|
233
|
+
[{"substation": {"type": "uri", "value": "urn:uuid:f176960e-9aeb-11e5-91da-b8763fd99c5f"},
|
|
234
|
+
"substationName": {"type": "literal", "value": "HALDEN"}}, {"substation":
|
|
235
|
+
{"type": "uri", "value": "urn:uuid:f176961e-9aeb-11e5-91da-b8763fd99c5f"},
|
|
236
|
+
"substationName": {"type": "literal", "value": "KONGSBERG"}}, {"substation":
|
|
237
|
+
{"type": "uri", "value": "urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f"},
|
|
238
|
+
"substationName": {"type": "literal", "value": "SYLLING"}}, {"substation":
|
|
239
|
+
{"type": "uri", "value": "urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f"},
|
|
240
|
+
"substationName": {"type": "literal", "value": "OSLO"}}, {"substation":
|
|
241
|
+
{"type": "uri", "value": "urn:uuid:f176964e-9aeb-11e5-91da-b8763fd99c5f"},
|
|
242
|
+
"substationName": {"type": "literal", "value": "ASKER"}}, {"substation":
|
|
243
|
+
{"type": "uri", "value": "urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f"},
|
|
244
|
+
"substationName": {"type": "literal", "value": "SYSLE"}}, {"substation":
|
|
245
|
+
{"type": "uri", "value": "urn:uuid:f1769654-9aeb-11e5-91da-b8763fd99c5f"},
|
|
246
|
+
"substationName": {"type": "literal", "value": "SKIEN"}}, {"substation":
|
|
247
|
+
{"type": "uri", "value": "urn:uuid:f1769604-9aeb-11e5-91da-b8763fd99c5f"},
|
|
248
|
+
"substationName": {"type": "literal", "value": "TRETTEN"}}]}}'
|
|
249
|
+
output_media_type: application/sparql-results+json
|
|
250
|
+
required_columns:
|
|
251
|
+
- substation
|
|
252
|
+
- substationName
|
|
253
|
+
ordered: false
|
|
254
|
+
```
|
|
255
|
+
|
|
256
|
+
The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
|
|
257
|
+
|
|
258
|
+
### Evaluation Target Corpus
|
|
259
|
+
|
|
260
|
+
Below is an example response from the question-answering system for a single question (unless there is an error in answering: see [Example Target Input on Error](#example-target-input-on-error) below):
|
|
261
|
+
|
|
262
|
+
```json
|
|
263
|
+
{
|
|
264
|
+
"question_id": "f91fc938d606e5f6089912bebfaf114b",
|
|
265
|
+
"input_tokens": 298028,
|
|
266
|
+
"output_tokens": 725,
|
|
267
|
+
"total_tokens": 298753,
|
|
268
|
+
"elapsed_sec": 46.48961806297302,
|
|
269
|
+
"actual_steps": [
|
|
270
|
+
{
|
|
271
|
+
"name": "autocomplete_search",
|
|
272
|
+
"args": {
|
|
273
|
+
"query": "STAVANGER",
|
|
274
|
+
"result_class": "cim:Substation"
|
|
275
|
+
},
|
|
276
|
+
"id": "call_7amIsfEGelOnVZ1DWtgtc0hc",
|
|
277
|
+
"status": "success",
|
|
278
|
+
"output": "{\n \"head\": {\n \"vars\": [\n \"iri\",\n \"name\",\n \"rank\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"iri\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"name\": {\n \"type\": \"literal\",\n \"value\": \"STAVANGER\"\n },\n \"rank\": {\n \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n \"type\": \"literal\",\n \"value\": \"0.00976\"\n }\n }\n ]\n }\n}"
|
|
279
|
+
},
|
|
280
|
+
{
|
|
281
|
+
"name": "sparql_query",
|
|
282
|
+
"args": {
|
|
283
|
+
"query": "SELECT DISTINCT ?substation2 ?substation2Name ?line ?lineName WHERE {\n # Substation STAVANGER\n BIND(urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f AS ?substation1) .\n\n # Find lines connected to STAVANGER\n ?line a cim:Line ;\n cimex:connectedThroughPart ?substation1, ?substation2 ;\n cim:IdentifiedObject.name ?lineName .\n\n # Ensure the connected substation is different from STAVANGER\n ?substation2 a cim:Substation ;\n cim:IdentifiedObject.name ?substation2Name .\n FILTER(?substation1 != ?substation2)\n}"
|
|
284
|
+
},
|
|
285
|
+
"id": "call_DbMkZ8kv3qkf49wNNdREkpRN",
|
|
286
|
+
"status": "error",
|
|
287
|
+
"error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes."
|
|
288
|
+
},
|
|
289
|
+
{
|
|
290
|
+
"name": "sparql_query",
|
|
291
|
+
"args": {
|
|
292
|
+
"query": "SELECT DISTINCT ?substation2 ?substation2Name ?line ?lineName WHERE {\n # Substation STAVANGER\n BIND(<urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f> AS ?substation1) .\n\n # Find lines connected to STAVANGER\n ?line a cim:Line ;\n cimex:connectedThroughPart ?substation1, ?substation2 ;\n cim:IdentifiedObject.name ?lineName .\n\n # Ensure the connected substation is different from STAVANGER\n ?substation2 a cim:Substation ;\n cim:IdentifiedObject.name ?substation2Name .\n FILTER(?substation1 != ?substation2)\n}"
|
|
293
|
+
},
|
|
294
|
+
"id": "call_Qm1mzX7g5q9SVPrR2QzEMTp3",
|
|
295
|
+
"status": "success",
|
|
296
|
+
"output": "{\n \"head\": {\n \"vars\": [\n \"substation2\",\n \"substation2Name\",\n \"line\",\n \"lineName\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KRISTIANSAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 300KRISTIAN-STAVANGE\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"DAGALI\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYSLE\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"AURLAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SIMA\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYLLING\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n }\n ]\n }\n}"
|
|
297
|
+
}
|
|
298
|
+
],
|
|
299
|
+
"actual_answer": "The substations connected to the substation \"STAVANGER\" via AC or DC lines are:\n\n1. **KRISTIANSAND** via line \"LC 300KRISTIAN-STAVANGE\"\n2. **DAGALI** via line \"LC 420DAGALI-HAGAFOSS\"\n3. **KVILLDAL** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n4. **HAGAFOSS** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n5. **OSLO** via lines:\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n6. **SYSLE** via line \"LC 420SYSLE-HAGAFOSS\"\n7. **AURLAND** via line \"LC 420AURLAND-HAGAFOSS\"\n8. **SIMA** via line \"LC 420AURLAND-HAGAFOSS\"\n9. **SYLLING** via line \"LC 420SYLLING-HAGAFOSS\""
|
|
300
|
+
}
|
|
301
|
+
```
|
|
302
|
+
|
|
303
|
+
#### Example Target Input on Error
|
|
304
|
+
|
|
305
|
+
If an error occurs during generating a response to a question, the expected target input for evaluation is:
|
|
306
|
+
|
|
307
|
+
```json
|
|
308
|
+
{
|
|
309
|
+
"question_id": "a8daaf98b84b4f6b0e0052fb942bf6b6",
|
|
310
|
+
"error": "Error message",
|
|
311
|
+
"status": "error"
|
|
312
|
+
}
|
|
313
|
+
```
|
|
314
|
+
|
|
315
|
+
### Example Usage Code
|
|
316
|
+
|
|
317
|
+
```python
|
|
318
|
+
from graphrag_eval import run_evaluation, compute_aggregates
|
|
319
|
+
|
|
320
|
+
reference_qas: list[dict] = [] # read your corpus
|
|
321
|
+
chat_responses: dict = {} # call your implementation to get the response
|
|
322
|
+
evaluation_results = run_evaluation(reference_qas, chat_responses)
|
|
323
|
+
aggregates = compute_aggregates(evaluation_results)
|
|
324
|
+
```
|
|
325
|
+
|
|
326
|
+
`evaluation_results` is a list of statistics for each question, as in section [Example Evaluation Results](#example-evaluation-results). The format is explained in section [Output Keys](#output-keys)
|
|
327
|
+
|
|
328
|
+
If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
|
|
329
|
+
|
|
330
|
+
### Example Evaluation Results
|
|
331
|
+
|
|
332
|
+
The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
|
|
333
|
+
|
|
334
|
+
```yaml
|
|
335
|
+
- template_id: list_all_transformers_within_Substation_SUBSTATION
|
|
336
|
+
question_id: c10bbc8dce98a4b8832d125134a16153
|
|
337
|
+
question_text: List all transformers within Substation OSLO
|
|
338
|
+
reference_answer: OSLO T1, OSLO T2
|
|
339
|
+
reference_steps:
|
|
340
|
+
- - name: sparql_query
|
|
341
|
+
args:
|
|
342
|
+
query: |2
|
|
343
|
+
|
|
344
|
+
PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
|
|
345
|
+
PREFIX cim: <https://cim.ucaiug.io/ns#>
|
|
346
|
+
PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
|
|
347
|
+
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
|
|
348
|
+
select distinct ?transformer ?transformerName
|
|
349
|
+
where {
|
|
350
|
+
bind(<urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f> as ?substation)
|
|
351
|
+
|
|
352
|
+
?transformer a cim:PowerTransformer ;
|
|
353
|
+
cim:Equipment.EquipmentContainer ?substation ;
|
|
354
|
+
cim:IdentifiedObject.name ?transformerName .
|
|
355
|
+
}
|
|
356
|
+
output: '{"head": {"vars": ["transformer", "transformerName"]}, "results": {"bindings":
|
|
357
|
+
[{"transformer": {"type": "uri", "value": "urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f"},
|
|
358
|
+
"transformerName": {"type": "literal", "value": "OSLO T2"}}, {"transformer":
|
|
359
|
+
{"type": "uri", "value": "urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f"},
|
|
360
|
+
"transformerName": {"type": "literal", "value": "OSLO T1"}}]}}'
|
|
361
|
+
output_media_type: application/sparql-results+json
|
|
362
|
+
required_columns:
|
|
363
|
+
- transformer
|
|
364
|
+
- transformerName
|
|
365
|
+
matches: call_3b3zHJnBXwYYSg04BiFGAAgO
|
|
366
|
+
status: success
|
|
367
|
+
actual_answer: |-
|
|
368
|
+
The following transformers are located within the Substation OSLO:
|
|
369
|
+
1. **OSLO T2** (IRI: `urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f`)
|
|
370
|
+
2. **OSLO T1** (IRI: `urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f`)
|
|
371
|
+
answer_reference_claims_count: 2
|
|
372
|
+
answer_actual_claims_count: 2
|
|
373
|
+
answer_matching_claims_count: 2
|
|
374
|
+
answer_correctness_reason: The candidate answer contains exactly the transformers listed in the reference answer, asked in the question
|
|
375
|
+
answer_recall: 1.0
|
|
376
|
+
answer_precision: 1.0
|
|
377
|
+
answer_f1: 1.0
|
|
378
|
+
answer_relevance: 0.9
|
|
379
|
+
answer_relevance_cost: 0.0007
|
|
380
|
+
actual_steps:
|
|
381
|
+
- name: autocomplete_search
|
|
382
|
+
args:
|
|
383
|
+
query: OSLO
|
|
384
|
+
result_class: cim:Substation
|
|
385
|
+
id: call_3wIrBHIsInzAWzo8qwwYAkDD
|
|
386
|
+
status: success
|
|
387
|
+
output: |-
|
|
388
|
+
{
|
|
389
|
+
"head": {
|
|
390
|
+
"vars": [
|
|
391
|
+
"iri",
|
|
392
|
+
"name",
|
|
393
|
+
"rank"
|
|
394
|
+
]
|
|
395
|
+
},
|
|
396
|
+
"results": {
|
|
397
|
+
"bindings": [
|
|
398
|
+
{
|
|
399
|
+
"iri": {
|
|
400
|
+
"type": "uri",
|
|
401
|
+
"value": "urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f"
|
|
402
|
+
},
|
|
403
|
+
"name": {
|
|
404
|
+
"type": "literal",
|
|
405
|
+
"value": "OSLO"
|
|
406
|
+
},
|
|
407
|
+
"rank": {
|
|
408
|
+
"datatype": "http://www.w3.org/2001/XMLSchema#float",
|
|
409
|
+
"type": "literal",
|
|
410
|
+
"value": "0.01185"
|
|
411
|
+
}
|
|
412
|
+
}
|
|
413
|
+
]
|
|
414
|
+
}
|
|
415
|
+
}
|
|
416
|
+
- name: sparql_query
|
|
417
|
+
args:
|
|
418
|
+
query: |-
|
|
419
|
+
SELECT ?transformer ?transformerName WHERE {
|
|
420
|
+
?transformer a cim:PowerTransformer ;
|
|
421
|
+
cim:Equipment.EquipmentContainer <urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f> ;
|
|
422
|
+
cim:IdentifiedObject.name ?transformerName .
|
|
423
|
+
}
|
|
424
|
+
id: call_3b3zHJnBXwYYSg04BiFGAAgO
|
|
425
|
+
status: success
|
|
426
|
+
output: |-
|
|
427
|
+
{
|
|
428
|
+
"head": {
|
|
429
|
+
"vars": [
|
|
430
|
+
"transformer",
|
|
431
|
+
"transformerName"
|
|
432
|
+
]
|
|
433
|
+
},
|
|
434
|
+
"results": {
|
|
435
|
+
"bindings": [
|
|
436
|
+
{
|
|
437
|
+
"transformer": {
|
|
438
|
+
"type": "uri",
|
|
439
|
+
"value": "urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f"
|
|
440
|
+
},
|
|
441
|
+
"transformerName": {
|
|
442
|
+
"type": "literal",
|
|
443
|
+
"value": "OSLO T2"
|
|
444
|
+
}
|
|
445
|
+
},
|
|
446
|
+
{
|
|
447
|
+
"transformer": {
|
|
448
|
+
"type": "uri",
|
|
449
|
+
"value": "urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f"
|
|
450
|
+
},
|
|
451
|
+
"transformerName": {
|
|
452
|
+
"type": "literal",
|
|
453
|
+
"value": "OSLO T1"
|
|
454
|
+
}
|
|
455
|
+
}
|
|
456
|
+
]
|
|
457
|
+
}
|
|
458
|
+
}
|
|
459
|
+
steps_score: 1
|
|
460
|
+
input_tokens: 221339
|
|
461
|
+
output_tokens: 212
|
|
462
|
+
total_tokens: 221551
|
|
463
|
+
elapsed_sec: 6.601679801940918
|
|
464
|
+
```
|
|
465
|
+
|
|
466
|
+
### Output Keys
|
|
467
|
+
|
|
468
|
+
- `template_id`: the template id
|
|
469
|
+
- `question_id`: the question id
|
|
470
|
+
- `question_text`: the natural language query
|
|
471
|
+
- `reference_steps`: (optional) copy of the expected steps in the Q&A dataset, if specified there
|
|
472
|
+
- `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
|
|
473
|
+
- `status`: "success" or "error", indicating whether the evaluation succeeded
|
|
474
|
+
- `actual_answer`: (optional) copy of the response text in the evaluation target, if specified there
|
|
475
|
+
- `answer_reference_claims_count`: (optional) number of claims extracted from the reference answer, if a reference answer and actual answer are available
|
|
476
|
+
- `answer_actual_claims_count`: (optional) number of claims extracted from the answer being evaluated, if a reference answer and actual answer are available
|
|
477
|
+
- `answer_matching_claims_count`: (optional) number of matching claims between the reference answer and the actual answer, if a reference answer and actual answer are available
|
|
478
|
+
- `answer_recall`: (optional) `answer_matching_claims_count / answer_reference_claims_count`
|
|
479
|
+
- `answer_precision`: (optional) `answer_matching_claims_count / answer_actual_claims_count`
|
|
480
|
+
- `answer_correctness_reason`: (optional) LLM reasoning in extracting and matching claims from the reference answer and the actual answer
|
|
481
|
+
- `answer_eval_error`: (optional) error message if answer evaluation failed
|
|
482
|
+
- `answer_f1`: (optional) Harmonic mean of `answer_recall` and `answer_precision`
|
|
483
|
+
- `answer_relevance`: (optional) The value representing how relevant is the actual answer to the question, computed using [RAGAS answer relevance](https://docs.ragas.io/en/v0.3.3/concepts/metrics/available_metrics/answer_relevance/)
|
|
484
|
+
- `answer_relevance_error`: (optional) error message if answer relevance evaluation failed
|
|
485
|
+
- `answer_relevance_cost`: The LLM use cost of computing `answer_relevance`, in US dollars
|
|
486
|
+
- `actual_steps`: (optional) copy of the steps in the evaluation target, if specified there
|
|
487
|
+
- `steps_score`: a real number between 0 and 1, computed by comparing the results of the last steps that were executed to the reference's last group of steps. If there is no match in the actual steps, then the score is `0`. Otherwise, it is calculated as the number of the matched steps on the last group divided by the total number of steps in the last group.
|
|
488
|
+
- `input_tokens`: input tokens usage
|
|
489
|
+
- `output_tokens`: output tokens usage
|
|
490
|
+
- `total_tokens`: total tokens usage
|
|
491
|
+
- `elapsed_sec`: elapsed seconds
|
|
492
|
+
|
|
493
|
+
#### Aggregates Keys
|
|
494
|
+
|
|
495
|
+
The `aggregates` object provides aggregated evaluation metrics.
|
|
496
|
+
Aggregates are computed both per-template and overall, using micro and macro averaging strategies.
|
|
497
|
+
These aggregates support analysis of agent quality, token efficiency, and execution performance.
|
|
498
|
+
Aggregates are:
|
|
499
|
+
- `per_template`: a dictionary mapping a template identifier to the following statistics:
|
|
500
|
+
- `number_of_error_samples`: number of questions for this template, which resulted in error response
|
|
501
|
+
- `number_of_success_samples`: number of questions for this template, which resulted in successful response
|
|
502
|
+
- `input_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `input_tokens` of all successful questions for this template
|
|
503
|
+
- `output_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `output_tokens` of all successful questions for this template
|
|
504
|
+
- `total_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `total_tokens` of all successful questions for this template
|
|
505
|
+
- `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` statistics for `elapsed_sec` of all successful questions for this template
|
|
506
|
+
- `answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_recall` of all successful questions for this template
|
|
507
|
+
- `answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_precision` of all successful questions for this template
|
|
508
|
+
- `answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_f1` of all successful questions for this template
|
|
509
|
+
- `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions for this template
|
|
510
|
+
- `steps_score`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps_score` of all successful questions for this template
|
|
511
|
+
- `steps`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps` of all successful questions for this template. Includes:
|
|
512
|
+
- `steps`: for each step type how many times it was executed
|
|
513
|
+
- `once_per_sample`: how many times each step was executed, counted only once per question
|
|
514
|
+
- `empty_results`: how many times the step was executed and returned empty results
|
|
515
|
+
- `errors`: how many times the step was executed and resulted in error
|
|
516
|
+
- `micro`: statistics across questions, regardless of template. It includes:
|
|
517
|
+
- `number_of_error_samples`: total number of questions, which resulted in error response
|
|
518
|
+
- `number_of_success_samples`: total number of questions, which resulted in successful response
|
|
519
|
+
- `input_tokens`: `sum`, `mean`, `median`, `min` and `max` for `input_tokens` of all successful questions
|
|
520
|
+
- `output_tokens`: `sum`, `mean`, `median`, `min` and `max` for `output_tokens` of all successful questions
|
|
521
|
+
- `total_tokens`: `sum`, `mean`, `median`, `min` and `max` for `total_tokens` of all successful questions
|
|
522
|
+
- `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` for `elapsed_sec` of all successful questions
|
|
523
|
+
- `answer_recall`: `sum`, `mean`, `median`, `min` and `max` for `answer_recall` of all successful questions
|
|
524
|
+
- `answer_precision`: `sum`, `mean`, `median`, `min` and `max` for `answer_precision` of all successful questions
|
|
525
|
+
- `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
|
|
526
|
+
- `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
|
|
527
|
+
- `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
|
|
528
|
+
- `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
|
|
529
|
+
- `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
|
|
530
|
+
- `input_tokens`: `mean` for `input_tokens`
|
|
531
|
+
- `output_tokens`: `mean` for `output_tokens`
|
|
532
|
+
- `total_tokens`: `mean` for `total_tokens`
|
|
533
|
+
- `elapsed_sec`: `mean` for `elapsed_sec`
|
|
534
|
+
- `answer_recall`: `mean` for `answer_recall`
|
|
535
|
+
- `answer_precision`: `mean` for `answer_precision`
|
|
536
|
+
- `answer_f1`: `mean` for `answer_f1`
|
|
537
|
+
- `answer_relevance`: `mean` for `answer_relevance`
|
|
538
|
+
- `answer_relevance_cost`: `mean` for `answer_relevance_cost`
|
|
539
|
+
- `steps_score`: `mean` for `steps_score`
|
|
540
|
+
|
|
541
|
+
#### Example Aggregates
|
|
542
|
+
|
|
543
|
+
```yaml
|
|
544
|
+
per_template:
|
|
545
|
+
list_all_transformers_within_Substation_SUBSTATION:
|
|
546
|
+
number_of_error_samples: 0
|
|
547
|
+
number_of_success_samples: 10
|
|
548
|
+
answer_recall:
|
|
549
|
+
sum: 1.0
|
|
550
|
+
mean: 1.0
|
|
551
|
+
median: 1.0
|
|
552
|
+
min: 1.0
|
|
553
|
+
max: 1.0
|
|
554
|
+
answer_precision:
|
|
555
|
+
sum: 1.0
|
|
556
|
+
mean: 1.0
|
|
557
|
+
median: 1.0
|
|
558
|
+
min: 1.0
|
|
559
|
+
max: 1.0
|
|
560
|
+
answer_f1:
|
|
561
|
+
sum: 1.0
|
|
562
|
+
mean: 1.0
|
|
563
|
+
median: 1.0
|
|
564
|
+
min: 1.0
|
|
565
|
+
max: 1.0
|
|
566
|
+
answer_relevance:
|
|
567
|
+
min: 0.9
|
|
568
|
+
max: 0.9
|
|
569
|
+
mean: 0.9
|
|
570
|
+
median: 0.9
|
|
571
|
+
sum: 0.9
|
|
572
|
+
answer_relevance_cost:
|
|
573
|
+
min: 0.0007
|
|
574
|
+
max: 0.0007
|
|
575
|
+
mean: 0.0007
|
|
576
|
+
median: 0.0007
|
|
577
|
+
sum: 0.0007
|
|
578
|
+
steps:
|
|
579
|
+
total:
|
|
580
|
+
autocomplete_search: 10
|
|
581
|
+
sparql_query: 8
|
|
582
|
+
once_per_sample:
|
|
583
|
+
autocomplete_search: 10
|
|
584
|
+
sparql_query: 8
|
|
585
|
+
empty_results:
|
|
586
|
+
autocomplete_search: 2
|
|
587
|
+
steps_score:
|
|
588
|
+
sum: 8
|
|
589
|
+
mean: 0.8
|
|
590
|
+
median: 1
|
|
591
|
+
min: 0
|
|
592
|
+
max: 1
|
|
593
|
+
input_tokens:
|
|
594
|
+
sum: 2064559
|
|
595
|
+
mean: 206455.9
|
|
596
|
+
median: 221263.5
|
|
597
|
+
min: 147171
|
|
598
|
+
max: 221339
|
|
599
|
+
output_tokens:
|
|
600
|
+
sum: 1555
|
|
601
|
+
mean: 155.5
|
|
602
|
+
median: 177
|
|
603
|
+
min: 46
|
|
604
|
+
max: 212
|
|
605
|
+
total_tokens:
|
|
606
|
+
sum: 2066114
|
|
607
|
+
mean: 206611.4
|
|
608
|
+
median: 221439.5
|
|
609
|
+
min: 147217
|
|
610
|
+
max: 221551
|
|
611
|
+
elapsed_sec:
|
|
612
|
+
sum: 259.2278094291687
|
|
613
|
+
mean: 25.92278094291687
|
|
614
|
+
median: 9.677194952964783
|
|
615
|
+
min: 5.529741525650024
|
|
616
|
+
max: 55.4010910987854
|
|
617
|
+
list_all_substations_within_bidding_zone_REGION:
|
|
618
|
+
number_of_error_samples: 0
|
|
619
|
+
number_of_success_samples: 10
|
|
620
|
+
answer_recall:
|
|
621
|
+
sum: 1.0
|
|
622
|
+
mean: 1.0
|
|
623
|
+
median: 1.0
|
|
624
|
+
min: 1.0
|
|
625
|
+
max: 1.0
|
|
626
|
+
answer_precision:
|
|
627
|
+
sum: 1.0
|
|
628
|
+
mean: 1.0
|
|
629
|
+
median: 1.0
|
|
630
|
+
min: 1.0
|
|
631
|
+
max: 1.0
|
|
632
|
+
answer_f1:
|
|
633
|
+
sum: 1.0
|
|
634
|
+
mean: 1.0
|
|
635
|
+
median: 1.0
|
|
636
|
+
min: 1.0
|
|
637
|
+
max: 1.0
|
|
638
|
+
answer_relevance:
|
|
639
|
+
min: 0.9
|
|
640
|
+
max: 0.9
|
|
641
|
+
mean: 0.9
|
|
642
|
+
median: 0.9
|
|
643
|
+
sum: 0.9
|
|
644
|
+
answer_relevance_cost:
|
|
645
|
+
min: 0.0007
|
|
646
|
+
max: 0.0007
|
|
647
|
+
mean: 0.0007
|
|
648
|
+
median: 0.0007
|
|
649
|
+
sum: 0.0007
|
|
650
|
+
steps:
|
|
651
|
+
total:
|
|
652
|
+
autocomplete_search: 10
|
|
653
|
+
once_per_sample:
|
|
654
|
+
autocomplete_search: 10
|
|
655
|
+
empty_results:
|
|
656
|
+
autocomplete_search: 10
|
|
657
|
+
steps_score:
|
|
658
|
+
sum: 0
|
|
659
|
+
mean: 0
|
|
660
|
+
median: 0
|
|
661
|
+
min: 0
|
|
662
|
+
max: 0
|
|
663
|
+
input_tokens:
|
|
664
|
+
sum: 1471880
|
|
665
|
+
mean: 147188
|
|
666
|
+
median: 147188
|
|
667
|
+
min: 147188
|
|
668
|
+
max: 147188
|
|
669
|
+
output_tokens:
|
|
670
|
+
sum: 571
|
|
671
|
+
mean: 57.1
|
|
672
|
+
median: 57
|
|
673
|
+
min: 56
|
|
674
|
+
max: 61
|
|
675
|
+
total_tokens:
|
|
676
|
+
sum: 1472451
|
|
677
|
+
mean: 147245.1
|
|
678
|
+
median: 147245
|
|
679
|
+
min: 147244
|
|
680
|
+
max: 147249
|
|
681
|
+
elapsed_sec:
|
|
682
|
+
sum: 185.5483124256134
|
|
683
|
+
mean: 18.55483124256134
|
|
684
|
+
median: 8.886059165000916
|
|
685
|
+
min: 2.8653159141540527
|
|
686
|
+
max: 47.51542258262634
|
|
687
|
+
list_all_substations_that_are_connected_via_an_ac_line_or_a_dc_line_to_substation_named_SUBSTATION:
|
|
688
|
+
number_of_error_samples: 1
|
|
689
|
+
number_of_success_samples: 9
|
|
690
|
+
answer_recall:
|
|
691
|
+
sum: 1.0
|
|
692
|
+
mean: 1.0
|
|
693
|
+
median: 1.0
|
|
694
|
+
min: 1.0
|
|
695
|
+
max: 1.0
|
|
696
|
+
answer_precision:
|
|
697
|
+
sum: 1.0
|
|
698
|
+
mean: 1.0
|
|
699
|
+
median: 1.0
|
|
700
|
+
min: 1.0
|
|
701
|
+
max: 1.0
|
|
702
|
+
answer_f1:
|
|
703
|
+
sum: 1.0
|
|
704
|
+
mean: 1.0
|
|
705
|
+
median: 1.0
|
|
706
|
+
min: 1.0
|
|
707
|
+
max: 1.0
|
|
708
|
+
answer_relevance:
|
|
709
|
+
min: 0.9
|
|
710
|
+
max: 0.9
|
|
711
|
+
mean: 0.9
|
|
712
|
+
median: 0.9
|
|
713
|
+
sum: 0.9
|
|
714
|
+
answer_relevance_cost:
|
|
715
|
+
min: 0.0007
|
|
716
|
+
max: 0.0007
|
|
717
|
+
mean: 0.0007
|
|
718
|
+
median: 0.0007
|
|
719
|
+
sum: 0.0007
|
|
720
|
+
steps:
|
|
721
|
+
total:
|
|
722
|
+
autocomplete_search: 9
|
|
723
|
+
sparql_query: 17
|
|
724
|
+
once_per_sample:
|
|
725
|
+
autocomplete_search: 9
|
|
726
|
+
sparql_query: 9
|
|
727
|
+
errors:
|
|
728
|
+
sparql_query: 8
|
|
729
|
+
steps_score:
|
|
730
|
+
sum: 9
|
|
731
|
+
mean: 1
|
|
732
|
+
median: 1
|
|
733
|
+
min: 1
|
|
734
|
+
max: 1
|
|
735
|
+
input_tokens:
|
|
736
|
+
sum: 2601595
|
|
737
|
+
mean: 289066.1111111111
|
|
738
|
+
median: 297059
|
|
739
|
+
min: 222528
|
|
740
|
+
max: 298028
|
|
741
|
+
output_tokens:
|
|
742
|
+
sum: 6066
|
|
743
|
+
mean: 674
|
|
744
|
+
median: 700
|
|
745
|
+
min: 363
|
|
746
|
+
max: 805
|
|
747
|
+
total_tokens:
|
|
748
|
+
sum: 2607661
|
|
749
|
+
mean: 289740.1111111111
|
|
750
|
+
median: 297759
|
|
751
|
+
min: 222891
|
|
752
|
+
max: 298787
|
|
753
|
+
elapsed_sec:
|
|
754
|
+
sum: 354.82168316841125
|
|
755
|
+
mean: 39.42463146315681
|
|
756
|
+
median: 41.88556528091431
|
|
757
|
+
min: 26.418761014938354
|
|
758
|
+
max: 52.42662525177002
|
|
759
|
+
list_all_ac_lines_that_traverse_bidding_zones_REGION1_and_REGION2:
|
|
760
|
+
number_of_error_samples: 0
|
|
761
|
+
number_of_success_samples: 10
|
|
762
|
+
answer_recall:
|
|
763
|
+
sum: 1.0
|
|
764
|
+
mean: 1.0
|
|
765
|
+
median: 1.0
|
|
766
|
+
min: 1.0
|
|
767
|
+
max: 1.0
|
|
768
|
+
answer_precision:
|
|
769
|
+
sum: 1.0
|
|
770
|
+
mean: 1.0
|
|
771
|
+
median: 1.0
|
|
772
|
+
min: 1.0
|
|
773
|
+
max: 1.0
|
|
774
|
+
answer_f1:
|
|
775
|
+
sum: 1.0
|
|
776
|
+
mean: 1.0
|
|
777
|
+
median: 1.0
|
|
778
|
+
min: 1.0
|
|
779
|
+
max: 1.0
|
|
780
|
+
answer_relevance:
|
|
781
|
+
min: 0.9
|
|
782
|
+
max: 0.9
|
|
783
|
+
mean: 0.9
|
|
784
|
+
median: 0.9
|
|
785
|
+
sum: 0.9
|
|
786
|
+
answer_relevance_cost:
|
|
787
|
+
min: 0.0007
|
|
788
|
+
max: 0.0007
|
|
789
|
+
mean: 0.0007
|
|
790
|
+
median: 0.0007
|
|
791
|
+
sum: 0.0007
|
|
792
|
+
steps:
|
|
793
|
+
total:
|
|
794
|
+
autocomplete_search: 20
|
|
795
|
+
once_per_sample:
|
|
796
|
+
autocomplete_search: 10
|
|
797
|
+
empty_results:
|
|
798
|
+
autocomplete_search: 20
|
|
799
|
+
steps_score:
|
|
800
|
+
sum: 0
|
|
801
|
+
mean: 0
|
|
802
|
+
median: 0
|
|
803
|
+
min: 0
|
|
804
|
+
max: 0
|
|
805
|
+
input_tokens:
|
|
806
|
+
sum: 1472540
|
|
807
|
+
mean: 147254
|
|
808
|
+
median: 147254
|
|
809
|
+
min: 147254
|
|
810
|
+
max: 147254
|
|
811
|
+
output_tokens:
|
|
812
|
+
sum: 1052
|
|
813
|
+
mean: 105.2
|
|
814
|
+
median: 105
|
|
815
|
+
min: 105
|
|
816
|
+
max: 107
|
|
817
|
+
total_tokens:
|
|
818
|
+
sum: 1473592
|
|
819
|
+
mean: 147359.2
|
|
820
|
+
median: 147359
|
|
821
|
+
min: 147359
|
|
822
|
+
max: 147361
|
|
823
|
+
elapsed_sec:
|
|
824
|
+
sum: 197.44370341300964
|
|
825
|
+
mean: 19.744370341300964
|
|
826
|
+
median: 18.030158162117004
|
|
827
|
+
min: 15.56333041191101
|
|
828
|
+
max: 26.422670125961304
|
|
829
|
+
micro:
|
|
830
|
+
number_of_error_samples: 1
|
|
831
|
+
number_of_success_samples: 39
|
|
832
|
+
answer_recall:
|
|
833
|
+
sum: 1.0
|
|
834
|
+
mean: 1.0
|
|
835
|
+
median: 1.0
|
|
836
|
+
min: 1.0
|
|
837
|
+
max: 1.0
|
|
838
|
+
answer_precision:
|
|
839
|
+
sum: 1.0
|
|
840
|
+
mean: 1.0
|
|
841
|
+
median: 1.0
|
|
842
|
+
min: 1.0
|
|
843
|
+
max: 1.0
|
|
844
|
+
answer_f1:
|
|
845
|
+
sum: 1.0
|
|
846
|
+
mean: 1.0
|
|
847
|
+
median: 1.0
|
|
848
|
+
min: 1.0
|
|
849
|
+
max: 1.0
|
|
850
|
+
answer_relevance:
|
|
851
|
+
min: 0.9
|
|
852
|
+
max: 0.9
|
|
853
|
+
mean: 0.9
|
|
854
|
+
median: 0.9
|
|
855
|
+
sum: 0.9
|
|
856
|
+
answer_relevance_cost:
|
|
857
|
+
min: 0.0007
|
|
858
|
+
max: 0.0007
|
|
859
|
+
mean: 0.0007
|
|
860
|
+
median: 0.0007
|
|
861
|
+
sum: 0.0007
|
|
862
|
+
steps_score:
|
|
863
|
+
sum: 17
|
|
864
|
+
mean: 0.4358974358974359
|
|
865
|
+
median: 0
|
|
866
|
+
min: 0
|
|
867
|
+
max: 1
|
|
868
|
+
input_tokens:
|
|
869
|
+
sum: 7610574
|
|
870
|
+
mean: 195142.92307692306
|
|
871
|
+
median: 147254
|
|
872
|
+
min: 147171
|
|
873
|
+
max: 298028
|
|
874
|
+
output_tokens:
|
|
875
|
+
sum: 9244
|
|
876
|
+
mean: 237.02564102564102
|
|
877
|
+
median: 105
|
|
878
|
+
min: 46
|
|
879
|
+
max: 805
|
|
880
|
+
total_tokens:
|
|
881
|
+
sum: 7619818
|
|
882
|
+
mean: 195379.94871794872
|
|
883
|
+
median: 147359
|
|
884
|
+
min: 147217
|
|
885
|
+
max: 298787
|
|
886
|
+
elapsed_sec:
|
|
887
|
+
sum: 997.041508436203
|
|
888
|
+
mean: 25.565166882979565
|
|
889
|
+
median: 18.32871961593628
|
|
890
|
+
min: 2.8653159141540527
|
|
891
|
+
max: 55.4010910987854
|
|
892
|
+
macro:
|
|
893
|
+
answer_recall:
|
|
894
|
+
mean: 1.0
|
|
895
|
+
answer_precision:
|
|
896
|
+
mean: 1.0
|
|
897
|
+
answer_f1:
|
|
898
|
+
mean: 1.0
|
|
899
|
+
answer_relevance:
|
|
900
|
+
mean: 0.9
|
|
901
|
+
answer_relevance_cost:
|
|
902
|
+
mean: 0.0007
|
|
903
|
+
steps_score:
|
|
904
|
+
mean: 0.45
|
|
905
|
+
input_tokens:
|
|
906
|
+
mean: 197491.0027777778
|
|
907
|
+
output_tokens:
|
|
908
|
+
mean: 247.95
|
|
909
|
+
total_tokens:
|
|
910
|
+
mean: 197738.9527777778
|
|
911
|
+
elapsed_sec:
|
|
912
|
+
mean: 25.911653497483996
|
|
913
|
+
```
|
|
914
|
+
|
|
915
|
+
### Retrieval Evaluation
|
|
916
|
+
|
|
917
|
+
The following metrics are based on the ids of retrieved documents.
|
|
918
|
+
|
|
919
|
+
#### Recall@k Metric
|
|
920
|
+
|
|
921
|
+
The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we inclide in the first k spots?"
|
|
922
|
+
* **Formula**:
|
|
923
|
+
$`
|
|
924
|
+
\frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
|
|
925
|
+
`$
|
|
926
|
+
* **Calculation**: Count the number of relevant items in the top `k` retrieved results; divide that by the *total* number of relevant items.
|
|
927
|
+
* **Example**: Suppose there are 4 relevant documents for a given query. Suppose our system retrieves 3 of them in the top 5 results (`k=5`). Recall@5 is `3 / 4 = 0.75`.
|
|
928
|
+
|
|
929
|
+
```python
|
|
930
|
+
recall_at_k(
|
|
931
|
+
relevant_docs={1, 3, 5, 6},
|
|
932
|
+
retrieved_docs=[1, 4, 3, 5, 7],
|
|
933
|
+
k=5
|
|
934
|
+
) # => 0.75
|
|
935
|
+
```
|
|
936
|
+
|
|
937
|
+
#### Average Precision (AP) Metric
|
|
938
|
+
|
|
939
|
+
Evaluates a ranked list of recommendations by looking at the precision at the position of each correctly retrieved item. It rewards systems for placing relevant items higher up in the list. It's more sophisticated than just looking at precision at a single cutoff because it considers the entire ranking.
|
|
940
|
+
* **Formula**:
|
|
941
|
+
$`
|
|
942
|
+
\frac{\sum_{k=1}^{n} (P(k) \times \text{rel}(k))}{\text{Number of relevant items}}
|
|
943
|
+
`$,\
|
|
944
|
+
where:
|
|
945
|
+
* `P(k)` is the precision at rank `k`
|
|
946
|
+
* `rel(k)` is 1 if the item at rank `k` is relevant and 0 otherwise.
|
|
947
|
+
* **Calculation**:
|
|
948
|
+
1. For each retrieved item, if it is relevant, record the precision at that index (i.e., `number of hits / current rank`).
|
|
949
|
+
2. Average all of these precision scores.
|
|
950
|
+
3. Divide that average by the total number of relevant items.
|
|
951
|
+
* **Example**:
|
|
952
|
+
* Suppose:
|
|
953
|
+
* The relevant items are `1, 3, 5, 6`
|
|
954
|
+
* Our system retrieves `1, 4, 3, 5, 7`
|
|
955
|
+
* Calculation:
|
|
956
|
+
* Item at index 1 (item 1) is relevant. Precision@1 = 1/1
|
|
957
|
+
* Item at index 3 (item 2) is relevant. Precision@3 = 2/3
|
|
958
|
+
* Item at index 4 (item 5) is relevant. Precision@4 = 3/4
|
|
959
|
+
* AP = (1.0 + 2/3 + 3/4) / 3 = 0.8055...
|
|
960
|
+
|
|
961
|
+
```python
|
|
962
|
+
average_precision(
|
|
963
|
+
relevant_docs={1, 3, 5, 6},
|
|
964
|
+
retrieved_docs=[1, 4, 3, 5, 7]
|
|
965
|
+
) # ~=> 0.8056
|
|
966
|
+
```
|
|
967
|
+
|