graphrag-eval 4.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,967 @@
1
+ Metadata-Version: 2.3
2
+ Name: graphrag-eval
3
+ Version: 4.0.0
4
+ Summary: For assessing question answering systems' final answers and intermediate steps, against a given set of questions, reference answers and steps.
5
+ License: Apache-2.0
6
+ Author: Neli Hateva
7
+ Author-email: neli.hateva@graphwise.ai
8
+ Requires-Python: >=3.12,<3.13
9
+ Classifier: License :: OSI Approved :: Apache Software License
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: Programming Language :: Python :: 3.12
12
+ Project-URL: Repository, https://github.com/Ontotext-AD/qa-eval
13
+ Description-Content-Type: text/markdown
14
+
15
+ <p align="center">
16
+ <img alt="Graphwise Logo" src=".github/Graphwise_Logo.jpg">
17
+ </p>
18
+
19
+ # QA Evaluation
20
+
21
+ This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used
22
+ to reach the answer (such as orchestrated and executed steps), compared to the given reference steps.
23
+
24
+ ## License
25
+
26
+ Apache-2.0 License. See [LICENSE](LICENSE) file for details.
27
+
28
+ ## Installation
29
+
30
+ To evaluate only steps:
31
+ ```bash
32
+ pip install graphrag-eval
33
+ ```
34
+ or add the following dependency in your `pyproject.toml` file:
35
+ ```toml
36
+ graphrag-eval = "*"
37
+ ```
38
+
39
+ To evaluate answer relevance and answer correctness:
40
+
41
+ ```bash
42
+ pip install 'graphrag-eval[openai]'
43
+ ```
44
+
45
+ or add the following dependency in your `pyproject.toml` file:
46
+ ```toml
47
+ graphrag-eval = {version = "*", extras = ["openai"]}
48
+ ```
49
+
50
+ ## Maintainers
51
+
52
+ Developed and maintained by [Graphwise](https://graphwise.ai/).
53
+ For issues or feature requests, please open [a GitHub issue](https://github.com/Ontotext-AD/qa-eval/issues).
54
+
55
+ ## Command Line Use
56
+
57
+ To evaluate only correctness of final answers (system responses), you can clone this repository and run the code on the command line:
58
+
59
+ 1. Prepare an input TSV file with columns `Question`, `Reference answer` and `Actual answer`
60
+ 1. Execute `poetry install --with openai`
61
+ 1. Execute `OPENAI_API_KEY=<your_api_key> poetry run answer-correctness -i <input_file.tsv> -o <output_file.tsv>`
62
+
63
+ We plan to improve CLI support in future releases.
64
+
65
+ ## Use as a Library
66
+
67
+ To evaluate answers and/or steps:
68
+ 1. Install this package: section [Install](#Installation)
69
+ 1. Format the corpus of questions and reference answers and/or steps: section [Reference Q&A Corpus](#reference-qa-corpus)
70
+ 1. Format the answers and/or steps you want to evaluate: section [Evaluation Target Corpus](#Evaluation-Target-Corpus)
71
+ 1. To evaluate answer relevance:
72
+ 1. Include `actual_answer` in the target data to evaluate
73
+ 1. Set environment vairabe `OPENAI_API_KEY` appropriately
74
+ 1. To evaluate answer correctness:
75
+ 1. Include `reference_answer` in the reference corpus and `actual_answer` in the target data to evaluate
76
+ 1. Set environment variable `OPENAI_API_KEY` appropriately
77
+ 1. To evaluate steps:
78
+ 1. Include `reference_steps` in the reference corpus and `actual_steps` in target data to evaluate
79
+ 1. Call the evaluation function with the reference corpus and target corpus: section [Example Usage Code](#Example-Usage-Code)
80
+ 1. Call the aggregation function with the evaluation results
81
+
82
+ Answer evaluation (correctness and relevance) uses the LLM `openai/gpt-4o-mini`.
83
+
84
+ ### Reference Q&A Corpus
85
+
86
+ A reference corpus is a list of templates, each of which contains:
87
+
88
+ - `template_id`: Unique template identifier
89
+ - `questions`: A list of questions derived from this template, where each includes:
90
+ - `id`: Unique question identifier
91
+ - `question_text`: The natural language query passed to the LLM
92
+ - `reference_steps`: (optional) A list of expected steps grouped by expected order of execution, where all steps in a group can be executed in any order relative to each other, but after all steps in the previous group and before all steps in the next group.
93
+ - `reference_answer`: (optional) The expected answer to the question
94
+ The assumption is that the final answer to the question is derived from the outputs of the steps, which are executed last (last level).
95
+
96
+ Each step includes:
97
+
98
+ - `name`: The type of step being performed (e.g., `sparql_query`)
99
+ - `args`: Arguments of the step (e.g., arguments to a tool used in the step, such as a SPARQL query)
100
+ - `output`: The expected output from the step
101
+ - `output_media_type`: (optional, missing or one of `application/sparql-results+json`, `application/json`) Indicates how the output of a step must be processed
102
+ - `ordered`: (optional, defaults to `false`) For SPARQL query results, whether results order matters. `true` means that the actual result rows must be ordered as the reference result; `false` means that result rows are matched as a set.
103
+ - `required_columns`: (optional) - required only for SPARQL query results; list of binding names, which are required for SPARQL query results to match
104
+
105
+ #### Example Reference Corpus
106
+
107
+ The example corpus below illustrates a minimal but realistic Q&A dataset, showing two templates with associated questions and steps.
108
+
109
+ ```yaml
110
+ - template_id: list_all_transformers_within_Substation_SUBSTATION
111
+ questions:
112
+ - id: c10bbc8dce98a4b8832d125134a16153
113
+ question_text: List all transformers within Substation OSLO
114
+ reference_answer: OSLO T1, OSLO T2
115
+ reference_steps:
116
+ - - name: sparql_query
117
+ args:
118
+ query: |2
119
+
120
+ PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
121
+ PREFIX cim: <https://cim.ucaiug.io/ns#>
122
+ PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
123
+ PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
124
+ select distinct ?transformer ?transformerName
125
+ where {
126
+ bind(<urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f> as ?substation)
127
+
128
+ ?transformer a cim:PowerTransformer ;
129
+ cim:Equipment.EquipmentContainer ?substation ;
130
+ cim:IdentifiedObject.name ?transformerName .
131
+ }
132
+ output: '{"head": {"vars": ["transformer", "transformerName"]}, "results":
133
+ {"bindings": [{"transformer": {"type": "uri", "value": "urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f"},
134
+ "transformerName": {"type": "literal", "value": "OSLO T2"}}, {"transformer":
135
+ {"type": "uri", "value": "urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f"},
136
+ "transformerName": {"type": "literal", "value": "OSLO T1"}}]}}'
137
+ output_media_type: application/sparql-results+json
138
+ required_columns:
139
+ - transformer
140
+ - transformerName
141
+ - id: 8bbea9a10876a04ad77a82fd2aedee40
142
+ question_text: List all transformers within Substation STAVANGER
143
+ reference_answer: STAVANGET1
144
+ reference_steps:
145
+ - - name: sparql_query
146
+ args:
147
+ query: |2
148
+
149
+ PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
150
+ PREFIX cim: <https://cim.ucaiug.io/ns#>
151
+ PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
152
+ PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
153
+ select distinct ?transformer ?transformerName
154
+ where {
155
+ bind(<urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f> as ?substation)
156
+
157
+ ?transformer a cim:PowerTransformer ;
158
+ cim:Equipment.EquipmentContainer ?substation ;
159
+ cim:IdentifiedObject.name ?transformerName .
160
+ }
161
+ output: '{"head": {"vars": ["transformer", "transformerName"]}, "results":
162
+ {"bindings": [{"transformer": {"type": "uri", "value": "urn:uuid:f1769e0c-9aeb-11e5-91da-b8763fd99c5f"},
163
+ "transformerName": {"type": "literal", "value": "STAVANGET1"}}]}}'
164
+ output_media_type: application/sparql-results+json
165
+ required_columns:
166
+ - transformer
167
+ - transformerName
168
+ - template_id: list_all_substations_within_bidding_zone_REGION
169
+ questions:
170
+ - id: d566b1e9da418ac83e520a66cc7af4d7
171
+ question_text: List all substations within bidding zone NO2 SGR
172
+ reference_answer: ARENDAL, BLAFALLI, STAVANGER, KRISTIA_HVDC, KVILLDAL, SANDEFJORD, KRISTIANSAND, FEDA_HVDC
173
+ reference_steps:
174
+ - - name: sparql_query
175
+ args:
176
+ query: |2
177
+
178
+ PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
179
+ PREFIX cim: <https://cim.ucaiug.io/ns#>
180
+ PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
181
+ PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
182
+ select distinct ?substation ?substationName
183
+ where {
184
+ bind(<urn:uuid:f176965f-9aeb-11e5-91da-b8763fd99c5f> as ?region)
185
+
186
+ ?substation a cim:Substation ;
187
+ cim:Substation.Region ?region ;
188
+ cim:IdentifiedObject.name ?substationName .
189
+ }
190
+ output: '{"head": {"vars": ["substation", "substationName"]}, "results": {"bindings":
191
+ [{"substation": {"type": "uri", "value": "urn:uuid:f1769670-9aeb-11e5-91da-b8763fd99c5f"},
192
+ "substationName": {"type": "literal", "value": "ARENDAL"}}, {"substation":
193
+ {"type": "uri", "value": "urn:uuid:f176968e-9aeb-11e5-91da-b8763fd99c5f"},
194
+ "substationName": {"type": "literal", "value": "BLAFALLI"}}, {"substation":
195
+ {"type": "uri", "value": "urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f"},
196
+ "substationName": {"type": "literal", "value": "STAVANGER"}}, {"substation":
197
+ {"type": "uri", "value": "urn:uuid:f1769676-9aeb-11e5-91da-b8763fd99c5f"},
198
+ "substationName": {"type": "literal", "value": "KRISTIA_HVDC"}}, {"substation":
199
+ {"type": "uri", "value": "urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f"},
200
+ "substationName": {"type": "literal", "value": "KVILLDAL"}}, {"substation":
201
+ {"type": "uri", "value": "urn:uuid:f176966a-9aeb-11e5-91da-b8763fd99c5f"},
202
+ "substationName": {"type": "literal", "value": "SANDEFJORD"}}, {"substation":
203
+ {"type": "uri", "value": "urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f"},
204
+ "substationName": {"type": "literal", "value": "KRISTIANSAND"}}, {"substation":
205
+ {"type": "uri", "value": "urn:uuid:f176967c-9aeb-11e5-91da-b8763fd99c5f"},
206
+ "substationName": {"type": "literal", "value": "FEDA_HVDC"}}]}}'
207
+ output_media_type: application/sparql-results+json
208
+ required_columns:
209
+ - substation
210
+ - substationName
211
+ ordered: false
212
+ - id: 03d4283773b4387114342518176b128b
213
+ question_text: List all substations within bidding zone NO1 SGR
214
+ reference_answer: HALDEN, KONGSBERG, SYLLING, OSLO, ASKER, SYSLE, SKIEN, TRETTEN
215
+ reference_steps:
216
+ - - name: sparql_query
217
+ args:
218
+ query: |2
219
+
220
+ PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
221
+ PREFIX cim: <https://cim.ucaiug.io/ns#>
222
+ PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
223
+ PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
224
+ select distinct ?substation ?substationName
225
+ where {
226
+ bind(<urn:uuid:f1769609-9aeb-11e5-91da-b8763fd99c5f> as ?region)
227
+
228
+ ?substation a cim:Substation ;
229
+ cim:Substation.Region ?region ;
230
+ cim:IdentifiedObject.name ?substationName .
231
+ }
232
+ output: '{"head": {"vars": ["substation", "substationName"]}, "results": {"bindings":
233
+ [{"substation": {"type": "uri", "value": "urn:uuid:f176960e-9aeb-11e5-91da-b8763fd99c5f"},
234
+ "substationName": {"type": "literal", "value": "HALDEN"}}, {"substation":
235
+ {"type": "uri", "value": "urn:uuid:f176961e-9aeb-11e5-91da-b8763fd99c5f"},
236
+ "substationName": {"type": "literal", "value": "KONGSBERG"}}, {"substation":
237
+ {"type": "uri", "value": "urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f"},
238
+ "substationName": {"type": "literal", "value": "SYLLING"}}, {"substation":
239
+ {"type": "uri", "value": "urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f"},
240
+ "substationName": {"type": "literal", "value": "OSLO"}}, {"substation":
241
+ {"type": "uri", "value": "urn:uuid:f176964e-9aeb-11e5-91da-b8763fd99c5f"},
242
+ "substationName": {"type": "literal", "value": "ASKER"}}, {"substation":
243
+ {"type": "uri", "value": "urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f"},
244
+ "substationName": {"type": "literal", "value": "SYSLE"}}, {"substation":
245
+ {"type": "uri", "value": "urn:uuid:f1769654-9aeb-11e5-91da-b8763fd99c5f"},
246
+ "substationName": {"type": "literal", "value": "SKIEN"}}, {"substation":
247
+ {"type": "uri", "value": "urn:uuid:f1769604-9aeb-11e5-91da-b8763fd99c5f"},
248
+ "substationName": {"type": "literal", "value": "TRETTEN"}}]}}'
249
+ output_media_type: application/sparql-results+json
250
+ required_columns:
251
+ - substation
252
+ - substationName
253
+ ordered: false
254
+ ```
255
+
256
+ The module is agnostic to the specific LLM agent implementation and model; it depends solely on the format of the response.
257
+
258
+ ### Evaluation Target Corpus
259
+
260
+ Below is an example response from the question-answering system for a single question (unless there is an error in answering: see [Example Target Input on Error](#example-target-input-on-error) below):
261
+
262
+ ```json
263
+ {
264
+ "question_id": "f91fc938d606e5f6089912bebfaf114b",
265
+ "input_tokens": 298028,
266
+ "output_tokens": 725,
267
+ "total_tokens": 298753,
268
+ "elapsed_sec": 46.48961806297302,
269
+ "actual_steps": [
270
+ {
271
+ "name": "autocomplete_search",
272
+ "args": {
273
+ "query": "STAVANGER",
274
+ "result_class": "cim:Substation"
275
+ },
276
+ "id": "call_7amIsfEGelOnVZ1DWtgtc0hc",
277
+ "status": "success",
278
+ "output": "{\n \"head\": {\n \"vars\": [\n \"iri\",\n \"name\",\n \"rank\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"iri\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"name\": {\n \"type\": \"literal\",\n \"value\": \"STAVANGER\"\n },\n \"rank\": {\n \"datatype\": \"http://www.w3.org/2001/XMLSchema#float\",\n \"type\": \"literal\",\n \"value\": \"0.00976\"\n }\n }\n ]\n }\n}"
279
+ },
280
+ {
281
+ "name": "sparql_query",
282
+ "args": {
283
+ "query": "SELECT DISTINCT ?substation2 ?substation2Name ?line ?lineName WHERE {\n # Substation STAVANGER\n BIND(urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f AS ?substation1) .\n\n # Find lines connected to STAVANGER\n ?line a cim:Line ;\n cimex:connectedThroughPart ?substation1, ?substation2 ;\n cim:IdentifiedObject.name ?lineName .\n\n # Ensure the connected substation is different from STAVANGER\n ?substation2 a cim:Substation ;\n cim:IdentifiedObject.name ?substation2Name .\n FILTER(?substation1 != ?substation2)\n}"
284
+ },
285
+ "id": "call_DbMkZ8kv3qkf49wNNdREkpRN",
286
+ "status": "error",
287
+ "error": "Error: ValueError('The following prefixes are undefined: urn')\n Please fix your mistakes."
288
+ },
289
+ {
290
+ "name": "sparql_query",
291
+ "args": {
292
+ "query": "SELECT DISTINCT ?substation2 ?substation2Name ?line ?lineName WHERE {\n # Substation STAVANGER\n BIND(<urn:uuid:f1769664-9aeb-11e5-91da-b8763fd99c5f> AS ?substation1) .\n\n # Find lines connected to STAVANGER\n ?line a cim:Line ;\n cimex:connectedThroughPart ?substation1, ?substation2 ;\n cim:IdentifiedObject.name ?lineName .\n\n # Ensure the connected substation is different from STAVANGER\n ?substation2 a cim:Substation ;\n cim:IdentifiedObject.name ?substation2Name .\n FILTER(?substation1 != ?substation2)\n}"
293
+ },
294
+ "id": "call_Qm1mzX7g5q9SVPrR2QzEMTp3",
295
+ "status": "success",
296
+ "output": "{\n \"head\": {\n \"vars\": [\n \"substation2\",\n \"substation2Name\",\n \"line\",\n \"lineName\"\n ]\n },\n \"results\": {\n \"bindings\": [\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176965a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KRISTIANSAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:47332c85-17a4-7a41-abd5-2b92e4b800ed\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 300KRISTIAN-STAVANGE\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769614-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"DAGALI\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:c5f535a5-f053-a948-9cee-f6a7b07c7d4f\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420DAGALI-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769648-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYSLE\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:a93b83d7-8a39-ef48-8c29-36de1ac0eaf5\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYSLE-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176962a-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"AURLAND\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769624-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SIMA\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:293e49bc-c995-fc46-a69c-380876b317a1\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420AURLAND-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769642-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"SYLLING\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"OSLO\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769682-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"KVILLDAL\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n },\n {\n \"substation2\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:f1769688-9aeb-11e5-91da-b8763fd99c5f\"\n },\n \"substation2Name\": {\n \"type\": \"literal\",\n \"value\": \"HAGAFOSS\"\n },\n \"line\": {\n \"type\": \"uri\",\n \"value\": \"urn:uuid:163037f6-cf93-054e-8a6d-3f58de83b6a3\"\n },\n \"lineName\": {\n \"type\": \"literal\",\n \"value\": \"LC 420SYLLING-HAGAFOSS\"\n }\n }\n ]\n }\n}"
297
+ }
298
+ ],
299
+ "actual_answer": "The substations connected to the substation \"STAVANGER\" via AC or DC lines are:\n\n1. **KRISTIANSAND** via line \"LC 300KRISTIAN-STAVANGE\"\n2. **DAGALI** via line \"LC 420DAGALI-HAGAFOSS\"\n3. **KVILLDAL** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n4. **HAGAFOSS** via lines:\n - \"LC 420DAGALI-HAGAFOSS\"\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420AURLAND-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n5. **OSLO** via lines:\n - \"LC 420SYSLE-HAGAFOSS\"\n - \"LC 420SYLLING-HAGAFOSS\"\n6. **SYSLE** via line \"LC 420SYSLE-HAGAFOSS\"\n7. **AURLAND** via line \"LC 420AURLAND-HAGAFOSS\"\n8. **SIMA** via line \"LC 420AURLAND-HAGAFOSS\"\n9. **SYLLING** via line \"LC 420SYLLING-HAGAFOSS\""
300
+ }
301
+ ```
302
+
303
+ #### Example Target Input on Error
304
+
305
+ If an error occurs during generating a response to a question, the expected target input for evaluation is:
306
+
307
+ ```json
308
+ {
309
+ "question_id": "a8daaf98b84b4f6b0e0052fb942bf6b6",
310
+ "error": "Error message",
311
+ "status": "error"
312
+ }
313
+ ```
314
+
315
+ ### Example Usage Code
316
+
317
+ ```python
318
+ from graphrag_eval import run_evaluation, compute_aggregates
319
+
320
+ reference_qas: list[dict] = [] # read your corpus
321
+ chat_responses: dict = {} # call your implementation to get the response
322
+ evaluation_results = run_evaluation(reference_qas, chat_responses)
323
+ aggregates = compute_aggregates(evaluation_results)
324
+ ```
325
+
326
+ `evaluation_results` is a list of statistics for each question, as in section [Example Evaluation Results](#example-evaluation-results). The format is explained in section [Output Keys](#output-keys)
327
+
328
+ If your chat responses contain actual answers, set your environment variable `OPENAI_API_KEY` before running the code above.
329
+
330
+ ### Example Evaluation Results
331
+
332
+ The output is a list of statistics for each question from the reference Q&A dataset. Here is an example of statistics for one question:
333
+
334
+ ```yaml
335
+ - template_id: list_all_transformers_within_Substation_SUBSTATION
336
+ question_id: c10bbc8dce98a4b8832d125134a16153
337
+ question_text: List all transformers within Substation OSLO
338
+ reference_answer: OSLO T1, OSLO T2
339
+ reference_steps:
340
+ - - name: sparql_query
341
+ args:
342
+ query: |2
343
+
344
+ PREFIX cimex: <https://rawgit2.com/statnett/Talk2PowerSystem/main/demo1/cimex/>
345
+ PREFIX cim: <https://cim.ucaiug.io/ns#>
346
+ PREFIX rank: <http://www.ontotext.com/owlim/RDFRank#>
347
+ PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
348
+ select distinct ?transformer ?transformerName
349
+ where {
350
+ bind(<urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f> as ?substation)
351
+
352
+ ?transformer a cim:PowerTransformer ;
353
+ cim:Equipment.EquipmentContainer ?substation ;
354
+ cim:IdentifiedObject.name ?transformerName .
355
+ }
356
+ output: '{"head": {"vars": ["transformer", "transformerName"]}, "results": {"bindings":
357
+ [{"transformer": {"type": "uri", "value": "urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f"},
358
+ "transformerName": {"type": "literal", "value": "OSLO T2"}}, {"transformer":
359
+ {"type": "uri", "value": "urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f"},
360
+ "transformerName": {"type": "literal", "value": "OSLO T1"}}]}}'
361
+ output_media_type: application/sparql-results+json
362
+ required_columns:
363
+ - transformer
364
+ - transformerName
365
+ matches: call_3b3zHJnBXwYYSg04BiFGAAgO
366
+ status: success
367
+ actual_answer: |-
368
+ The following transformers are located within the Substation OSLO:
369
+ 1. **OSLO T2** (IRI: `urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f`)
370
+ 2. **OSLO T1** (IRI: `urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f`)
371
+ answer_reference_claims_count: 2
372
+ answer_actual_claims_count: 2
373
+ answer_matching_claims_count: 2
374
+ answer_correctness_reason: The candidate answer contains exactly the transformers listed in the reference answer, asked in the question
375
+ answer_recall: 1.0
376
+ answer_precision: 1.0
377
+ answer_f1: 1.0
378
+ answer_relevance: 0.9
379
+ answer_relevance_cost: 0.0007
380
+ actual_steps:
381
+ - name: autocomplete_search
382
+ args:
383
+ query: OSLO
384
+ result_class: cim:Substation
385
+ id: call_3wIrBHIsInzAWzo8qwwYAkDD
386
+ status: success
387
+ output: |-
388
+ {
389
+ "head": {
390
+ "vars": [
391
+ "iri",
392
+ "name",
393
+ "rank"
394
+ ]
395
+ },
396
+ "results": {
397
+ "bindings": [
398
+ {
399
+ "iri": {
400
+ "type": "uri",
401
+ "value": "urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f"
402
+ },
403
+ "name": {
404
+ "type": "literal",
405
+ "value": "OSLO"
406
+ },
407
+ "rank": {
408
+ "datatype": "http://www.w3.org/2001/XMLSchema#float",
409
+ "type": "literal",
410
+ "value": "0.01185"
411
+ }
412
+ }
413
+ ]
414
+ }
415
+ }
416
+ - name: sparql_query
417
+ args:
418
+ query: |-
419
+ SELECT ?transformer ?transformerName WHERE {
420
+ ?transformer a cim:PowerTransformer ;
421
+ cim:Equipment.EquipmentContainer <urn:uuid:f176963c-9aeb-11e5-91da-b8763fd99c5f> ;
422
+ cim:IdentifiedObject.name ?transformerName .
423
+ }
424
+ id: call_3b3zHJnBXwYYSg04BiFGAAgO
425
+ status: success
426
+ output: |-
427
+ {
428
+ "head": {
429
+ "vars": [
430
+ "transformer",
431
+ "transformerName"
432
+ ]
433
+ },
434
+ "results": {
435
+ "bindings": [
436
+ {
437
+ "transformer": {
438
+ "type": "uri",
439
+ "value": "urn:uuid:f1769de8-9aeb-11e5-91da-b8763fd99c5f"
440
+ },
441
+ "transformerName": {
442
+ "type": "literal",
443
+ "value": "OSLO T2"
444
+ }
445
+ },
446
+ {
447
+ "transformer": {
448
+ "type": "uri",
449
+ "value": "urn:uuid:f1769dd6-9aeb-11e5-91da-b8763fd99c5f"
450
+ },
451
+ "transformerName": {
452
+ "type": "literal",
453
+ "value": "OSLO T1"
454
+ }
455
+ }
456
+ ]
457
+ }
458
+ }
459
+ steps_score: 1
460
+ input_tokens: 221339
461
+ output_tokens: 212
462
+ total_tokens: 221551
463
+ elapsed_sec: 6.601679801940918
464
+ ```
465
+
466
+ ### Output Keys
467
+
468
+ - `template_id`: the template id
469
+ - `question_id`: the question id
470
+ - `question_text`: the natural language query
471
+ - `reference_steps`: (optional) copy of the expected steps in the Q&A dataset, if specified there
472
+ - `reference_answer`: (optional) copy of the expected answer in the Q&A dataset, if specified there
473
+ - `status`: "success" or "error", indicating whether the evaluation succeeded
474
+ - `actual_answer`: (optional) copy of the response text in the evaluation target, if specified there
475
+ - `answer_reference_claims_count`: (optional) number of claims extracted from the reference answer, if a reference answer and actual answer are available
476
+ - `answer_actual_claims_count`: (optional) number of claims extracted from the answer being evaluated, if a reference answer and actual answer are available
477
+ - `answer_matching_claims_count`: (optional) number of matching claims between the reference answer and the actual answer, if a reference answer and actual answer are available
478
+ - `answer_recall`: (optional) `answer_matching_claims_count / answer_reference_claims_count`
479
+ - `answer_precision`: (optional) `answer_matching_claims_count / answer_actual_claims_count`
480
+ - `answer_correctness_reason`: (optional) LLM reasoning in extracting and matching claims from the reference answer and the actual answer
481
+ - `answer_eval_error`: (optional) error message if answer evaluation failed
482
+ - `answer_f1`: (optional) Harmonic mean of `answer_recall` and `answer_precision`
483
+ - `answer_relevance`: (optional) The value representing how relevant is the actual answer to the question, computed using [RAGAS answer relevance](https://docs.ragas.io/en/v0.3.3/concepts/metrics/available_metrics/answer_relevance/)
484
+ - `answer_relevance_error`: (optional) error message if answer relevance evaluation failed
485
+ - `answer_relevance_cost`: The LLM use cost of computing `answer_relevance`, in US dollars
486
+ - `actual_steps`: (optional) copy of the steps in the evaluation target, if specified there
487
+ - `steps_score`: a real number between 0 and 1, computed by comparing the results of the last steps that were executed to the reference's last group of steps. If there is no match in the actual steps, then the score is `0`. Otherwise, it is calculated as the number of the matched steps on the last group divided by the total number of steps in the last group.
488
+ - `input_tokens`: input tokens usage
489
+ - `output_tokens`: output tokens usage
490
+ - `total_tokens`: total tokens usage
491
+ - `elapsed_sec`: elapsed seconds
492
+
493
+ #### Aggregates Keys
494
+
495
+ The `aggregates` object provides aggregated evaluation metrics.
496
+ Aggregates are computed both per-template and overall, using micro and macro averaging strategies.
497
+ These aggregates support analysis of agent quality, token efficiency, and execution performance.
498
+ Aggregates are:
499
+ - `per_template`: a dictionary mapping a template identifier to the following statistics:
500
+ - `number_of_error_samples`: number of questions for this template, which resulted in error response
501
+ - `number_of_success_samples`: number of questions for this template, which resulted in successful response
502
+ - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `input_tokens` of all successful questions for this template
503
+ - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `output_tokens` of all successful questions for this template
504
+ - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` statistics for `total_tokens` of all successful questions for this template
505
+ - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` statistics for `elapsed_sec` of all successful questions for this template
506
+ - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_recall` of all successful questions for this template
507
+ - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_precision` of all successful questions for this template
508
+ - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_f1` of all successful questions for this template
509
+ - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions for this template
510
+ - `steps_score`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps_score` of all successful questions for this template
511
+ - `steps`: `sum`, `mean`, `median`, `min` and `max` statistics for `steps` of all successful questions for this template. Includes:
512
+ - `steps`: for each step type how many times it was executed
513
+ - `once_per_sample`: how many times each step was executed, counted only once per question
514
+ - `empty_results`: how many times the step was executed and returned empty results
515
+ - `errors`: how many times the step was executed and resulted in error
516
+ - `micro`: statistics across questions, regardless of template. It includes:
517
+ - `number_of_error_samples`: total number of questions, which resulted in error response
518
+ - `number_of_success_samples`: total number of questions, which resulted in successful response
519
+ - `input_tokens`: `sum`, `mean`, `median`, `min` and `max` for `input_tokens` of all successful questions
520
+ - `output_tokens`: `sum`, `mean`, `median`, `min` and `max` for `output_tokens` of all successful questions
521
+ - `total_tokens`: `sum`, `mean`, `median`, `min` and `max` for `total_tokens` of all successful questions
522
+ - `elapsed_sec`: `sum`, `mean`, `median`, `min` and `max` for `elapsed_sec` of all successful questions
523
+ - `answer_recall`: `sum`, `mean`, `median`, `min` and `max` for `answer_recall` of all successful questions
524
+ - `answer_precision`: `sum`, `mean`, `median`, `min` and `max` for `answer_precision` of all successful questions
525
+ - `answer_f1`: `sum`, `mean`, `median`, `min` and `max` for `answer_f1` of all successful questions
526
+ - `answer_relevance`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance` of all successful questions
527
+ - `answer_relevance_cost`: `sum`, `mean`, `median`, `min` and `max` statistics for `answer_relevance_cost` of all successful questions
528
+ - `steps_score`: `sum`, `mean`, `median`, `min` and `max` for `steps_score` of all successful questions
529
+ - `macro`: averages across templates, i.e., the mean of each metric per template, averaged. It includes:
530
+ - `input_tokens`: `mean` for `input_tokens`
531
+ - `output_tokens`: `mean` for `output_tokens`
532
+ - `total_tokens`: `mean` for `total_tokens`
533
+ - `elapsed_sec`: `mean` for `elapsed_sec`
534
+ - `answer_recall`: `mean` for `answer_recall`
535
+ - `answer_precision`: `mean` for `answer_precision`
536
+ - `answer_f1`: `mean` for `answer_f1`
537
+ - `answer_relevance`: `mean` for `answer_relevance`
538
+ - `answer_relevance_cost`: `mean` for `answer_relevance_cost`
539
+ - `steps_score`: `mean` for `steps_score`
540
+
541
+ #### Example Aggregates
542
+
543
+ ```yaml
544
+ per_template:
545
+ list_all_transformers_within_Substation_SUBSTATION:
546
+ number_of_error_samples: 0
547
+ number_of_success_samples: 10
548
+ answer_recall:
549
+ sum: 1.0
550
+ mean: 1.0
551
+ median: 1.0
552
+ min: 1.0
553
+ max: 1.0
554
+ answer_precision:
555
+ sum: 1.0
556
+ mean: 1.0
557
+ median: 1.0
558
+ min: 1.0
559
+ max: 1.0
560
+ answer_f1:
561
+ sum: 1.0
562
+ mean: 1.0
563
+ median: 1.0
564
+ min: 1.0
565
+ max: 1.0
566
+ answer_relevance:
567
+ min: 0.9
568
+ max: 0.9
569
+ mean: 0.9
570
+ median: 0.9
571
+ sum: 0.9
572
+ answer_relevance_cost:
573
+ min: 0.0007
574
+ max: 0.0007
575
+ mean: 0.0007
576
+ median: 0.0007
577
+ sum: 0.0007
578
+ steps:
579
+ total:
580
+ autocomplete_search: 10
581
+ sparql_query: 8
582
+ once_per_sample:
583
+ autocomplete_search: 10
584
+ sparql_query: 8
585
+ empty_results:
586
+ autocomplete_search: 2
587
+ steps_score:
588
+ sum: 8
589
+ mean: 0.8
590
+ median: 1
591
+ min: 0
592
+ max: 1
593
+ input_tokens:
594
+ sum: 2064559
595
+ mean: 206455.9
596
+ median: 221263.5
597
+ min: 147171
598
+ max: 221339
599
+ output_tokens:
600
+ sum: 1555
601
+ mean: 155.5
602
+ median: 177
603
+ min: 46
604
+ max: 212
605
+ total_tokens:
606
+ sum: 2066114
607
+ mean: 206611.4
608
+ median: 221439.5
609
+ min: 147217
610
+ max: 221551
611
+ elapsed_sec:
612
+ sum: 259.2278094291687
613
+ mean: 25.92278094291687
614
+ median: 9.677194952964783
615
+ min: 5.529741525650024
616
+ max: 55.4010910987854
617
+ list_all_substations_within_bidding_zone_REGION:
618
+ number_of_error_samples: 0
619
+ number_of_success_samples: 10
620
+ answer_recall:
621
+ sum: 1.0
622
+ mean: 1.0
623
+ median: 1.0
624
+ min: 1.0
625
+ max: 1.0
626
+ answer_precision:
627
+ sum: 1.0
628
+ mean: 1.0
629
+ median: 1.0
630
+ min: 1.0
631
+ max: 1.0
632
+ answer_f1:
633
+ sum: 1.0
634
+ mean: 1.0
635
+ median: 1.0
636
+ min: 1.0
637
+ max: 1.0
638
+ answer_relevance:
639
+ min: 0.9
640
+ max: 0.9
641
+ mean: 0.9
642
+ median: 0.9
643
+ sum: 0.9
644
+ answer_relevance_cost:
645
+ min: 0.0007
646
+ max: 0.0007
647
+ mean: 0.0007
648
+ median: 0.0007
649
+ sum: 0.0007
650
+ steps:
651
+ total:
652
+ autocomplete_search: 10
653
+ once_per_sample:
654
+ autocomplete_search: 10
655
+ empty_results:
656
+ autocomplete_search: 10
657
+ steps_score:
658
+ sum: 0
659
+ mean: 0
660
+ median: 0
661
+ min: 0
662
+ max: 0
663
+ input_tokens:
664
+ sum: 1471880
665
+ mean: 147188
666
+ median: 147188
667
+ min: 147188
668
+ max: 147188
669
+ output_tokens:
670
+ sum: 571
671
+ mean: 57.1
672
+ median: 57
673
+ min: 56
674
+ max: 61
675
+ total_tokens:
676
+ sum: 1472451
677
+ mean: 147245.1
678
+ median: 147245
679
+ min: 147244
680
+ max: 147249
681
+ elapsed_sec:
682
+ sum: 185.5483124256134
683
+ mean: 18.55483124256134
684
+ median: 8.886059165000916
685
+ min: 2.8653159141540527
686
+ max: 47.51542258262634
687
+ list_all_substations_that_are_connected_via_an_ac_line_or_a_dc_line_to_substation_named_SUBSTATION:
688
+ number_of_error_samples: 1
689
+ number_of_success_samples: 9
690
+ answer_recall:
691
+ sum: 1.0
692
+ mean: 1.0
693
+ median: 1.0
694
+ min: 1.0
695
+ max: 1.0
696
+ answer_precision:
697
+ sum: 1.0
698
+ mean: 1.0
699
+ median: 1.0
700
+ min: 1.0
701
+ max: 1.0
702
+ answer_f1:
703
+ sum: 1.0
704
+ mean: 1.0
705
+ median: 1.0
706
+ min: 1.0
707
+ max: 1.0
708
+ answer_relevance:
709
+ min: 0.9
710
+ max: 0.9
711
+ mean: 0.9
712
+ median: 0.9
713
+ sum: 0.9
714
+ answer_relevance_cost:
715
+ min: 0.0007
716
+ max: 0.0007
717
+ mean: 0.0007
718
+ median: 0.0007
719
+ sum: 0.0007
720
+ steps:
721
+ total:
722
+ autocomplete_search: 9
723
+ sparql_query: 17
724
+ once_per_sample:
725
+ autocomplete_search: 9
726
+ sparql_query: 9
727
+ errors:
728
+ sparql_query: 8
729
+ steps_score:
730
+ sum: 9
731
+ mean: 1
732
+ median: 1
733
+ min: 1
734
+ max: 1
735
+ input_tokens:
736
+ sum: 2601595
737
+ mean: 289066.1111111111
738
+ median: 297059
739
+ min: 222528
740
+ max: 298028
741
+ output_tokens:
742
+ sum: 6066
743
+ mean: 674
744
+ median: 700
745
+ min: 363
746
+ max: 805
747
+ total_tokens:
748
+ sum: 2607661
749
+ mean: 289740.1111111111
750
+ median: 297759
751
+ min: 222891
752
+ max: 298787
753
+ elapsed_sec:
754
+ sum: 354.82168316841125
755
+ mean: 39.42463146315681
756
+ median: 41.88556528091431
757
+ min: 26.418761014938354
758
+ max: 52.42662525177002
759
+ list_all_ac_lines_that_traverse_bidding_zones_REGION1_and_REGION2:
760
+ number_of_error_samples: 0
761
+ number_of_success_samples: 10
762
+ answer_recall:
763
+ sum: 1.0
764
+ mean: 1.0
765
+ median: 1.0
766
+ min: 1.0
767
+ max: 1.0
768
+ answer_precision:
769
+ sum: 1.0
770
+ mean: 1.0
771
+ median: 1.0
772
+ min: 1.0
773
+ max: 1.0
774
+ answer_f1:
775
+ sum: 1.0
776
+ mean: 1.0
777
+ median: 1.0
778
+ min: 1.0
779
+ max: 1.0
780
+ answer_relevance:
781
+ min: 0.9
782
+ max: 0.9
783
+ mean: 0.9
784
+ median: 0.9
785
+ sum: 0.9
786
+ answer_relevance_cost:
787
+ min: 0.0007
788
+ max: 0.0007
789
+ mean: 0.0007
790
+ median: 0.0007
791
+ sum: 0.0007
792
+ steps:
793
+ total:
794
+ autocomplete_search: 20
795
+ once_per_sample:
796
+ autocomplete_search: 10
797
+ empty_results:
798
+ autocomplete_search: 20
799
+ steps_score:
800
+ sum: 0
801
+ mean: 0
802
+ median: 0
803
+ min: 0
804
+ max: 0
805
+ input_tokens:
806
+ sum: 1472540
807
+ mean: 147254
808
+ median: 147254
809
+ min: 147254
810
+ max: 147254
811
+ output_tokens:
812
+ sum: 1052
813
+ mean: 105.2
814
+ median: 105
815
+ min: 105
816
+ max: 107
817
+ total_tokens:
818
+ sum: 1473592
819
+ mean: 147359.2
820
+ median: 147359
821
+ min: 147359
822
+ max: 147361
823
+ elapsed_sec:
824
+ sum: 197.44370341300964
825
+ mean: 19.744370341300964
826
+ median: 18.030158162117004
827
+ min: 15.56333041191101
828
+ max: 26.422670125961304
829
+ micro:
830
+ number_of_error_samples: 1
831
+ number_of_success_samples: 39
832
+ answer_recall:
833
+ sum: 1.0
834
+ mean: 1.0
835
+ median: 1.0
836
+ min: 1.0
837
+ max: 1.0
838
+ answer_precision:
839
+ sum: 1.0
840
+ mean: 1.0
841
+ median: 1.0
842
+ min: 1.0
843
+ max: 1.0
844
+ answer_f1:
845
+ sum: 1.0
846
+ mean: 1.0
847
+ median: 1.0
848
+ min: 1.0
849
+ max: 1.0
850
+ answer_relevance:
851
+ min: 0.9
852
+ max: 0.9
853
+ mean: 0.9
854
+ median: 0.9
855
+ sum: 0.9
856
+ answer_relevance_cost:
857
+ min: 0.0007
858
+ max: 0.0007
859
+ mean: 0.0007
860
+ median: 0.0007
861
+ sum: 0.0007
862
+ steps_score:
863
+ sum: 17
864
+ mean: 0.4358974358974359
865
+ median: 0
866
+ min: 0
867
+ max: 1
868
+ input_tokens:
869
+ sum: 7610574
870
+ mean: 195142.92307692306
871
+ median: 147254
872
+ min: 147171
873
+ max: 298028
874
+ output_tokens:
875
+ sum: 9244
876
+ mean: 237.02564102564102
877
+ median: 105
878
+ min: 46
879
+ max: 805
880
+ total_tokens:
881
+ sum: 7619818
882
+ mean: 195379.94871794872
883
+ median: 147359
884
+ min: 147217
885
+ max: 298787
886
+ elapsed_sec:
887
+ sum: 997.041508436203
888
+ mean: 25.565166882979565
889
+ median: 18.32871961593628
890
+ min: 2.8653159141540527
891
+ max: 55.4010910987854
892
+ macro:
893
+ answer_recall:
894
+ mean: 1.0
895
+ answer_precision:
896
+ mean: 1.0
897
+ answer_f1:
898
+ mean: 1.0
899
+ answer_relevance:
900
+ mean: 0.9
901
+ answer_relevance_cost:
902
+ mean: 0.0007
903
+ steps_score:
904
+ mean: 0.45
905
+ input_tokens:
906
+ mean: 197491.0027777778
907
+ output_tokens:
908
+ mean: 247.95
909
+ total_tokens:
910
+ mean: 197738.9527777778
911
+ elapsed_sec:
912
+ mean: 25.911653497483996
913
+ ```
914
+
915
+ ### Retrieval Evaluation
916
+
917
+ The following metrics are based on the ids of retrieved documents.
918
+
919
+ #### Recall@k Metric
920
+
921
+ The fraction of relevant items among the top *k* recommendations. It answers the question: "Of all items the user cares about, how many did we inclide in the first k spots?"
922
+ * **Formula**:
923
+ $`
924
+ \frac{\text{Number of relevant items in top k}}{\text{Number of relevant items}}
925
+ `$
926
+ * **Calculation**: Count the number of relevant items in the top `k` retrieved results; divide that by the *total* number of relevant items.
927
+ * **Example**: Suppose there are 4 relevant documents for a given query. Suppose our system retrieves 3 of them in the top 5 results (`k=5`). Recall@5 is `3 / 4 = 0.75`.
928
+
929
+ ```python
930
+ recall_at_k(
931
+ relevant_docs={1, 3, 5, 6},
932
+ retrieved_docs=[1, 4, 3, 5, 7],
933
+ k=5
934
+ ) # => 0.75
935
+ ```
936
+
937
+ #### Average Precision (AP) Metric
938
+
939
+ Evaluates a ranked list of recommendations by looking at the precision at the position of each correctly retrieved item. It rewards systems for placing relevant items higher up in the list. It's more sophisticated than just looking at precision at a single cutoff because it considers the entire ranking.
940
+ * **Formula**:
941
+ $`
942
+ \frac{\sum_{k=1}^{n} (P(k) \times \text{rel}(k))}{\text{Number of relevant items}}
943
+ `$,\
944
+ where:
945
+ * `P(k)` is the precision at rank `k`
946
+ * `rel(k)` is 1 if the item at rank `k` is relevant and 0 otherwise.
947
+ * **Calculation**:
948
+ 1. For each retrieved item, if it is relevant, record the precision at that index (i.e., `number of hits / current rank`).
949
+ 2. Average all of these precision scores.
950
+ 3. Divide that average by the total number of relevant items.
951
+ * **Example**:
952
+ * Suppose:
953
+ * The relevant items are `1, 3, 5, 6`
954
+ * Our system retrieves `1, 4, 3, 5, 7`
955
+ * Calculation:
956
+ * Item at index 1 (item 1) is relevant. Precision@1 = 1/1
957
+ * Item at index 3 (item 2) is relevant. Precision@3 = 2/3
958
+ * Item at index 4 (item 5) is relevant. Precision@4 = 3/4
959
+ * AP = (1.0 + 2/3 + 3/4) / 3 = 0.8055...
960
+
961
+ ```python
962
+ average_precision(
963
+ relevant_docs={1, 3, 5, 6},
964
+ retrieved_docs=[1, 4, 3, 5, 7]
965
+ ) # ~=> 0.8056
966
+ ```
967
+