EuroEval 15.12.0__py3-none-any.whl → 16.7.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- euroeval/__init__.py +32 -14
- euroeval/benchmark_config_factory.py +92 -180
- euroeval/benchmark_modules/base.py +49 -39
- euroeval/benchmark_modules/fresh.py +35 -21
- euroeval/benchmark_modules/hf.py +280 -244
- euroeval/benchmark_modules/litellm.py +752 -312
- euroeval/benchmark_modules/vllm.py +570 -268
- euroeval/benchmarker.py +651 -528
- euroeval/caching_utils.py +79 -0
- euroeval/callbacks.py +5 -7
- euroeval/cli.py +49 -38
- euroeval/constants.py +44 -25
- euroeval/data_loading.py +111 -55
- euroeval/data_models.py +490 -323
- euroeval/dataset_configs/__init__.py +26 -4
- euroeval/dataset_configs/bosnian.py +39 -0
- euroeval/dataset_configs/bulgarian.py +56 -0
- euroeval/dataset_configs/croatian.py +56 -0
- euroeval/dataset_configs/czech.py +75 -0
- euroeval/dataset_configs/danish.py +78 -50
- euroeval/dataset_configs/dutch.py +74 -44
- euroeval/dataset_configs/english.py +71 -36
- euroeval/dataset_configs/estonian.py +111 -0
- euroeval/dataset_configs/faroese.py +25 -18
- euroeval/dataset_configs/finnish.py +63 -26
- euroeval/dataset_configs/french.py +65 -32
- euroeval/dataset_configs/german.py +77 -36
- euroeval/dataset_configs/greek.py +64 -0
- euroeval/dataset_configs/icelandic.py +68 -57
- euroeval/dataset_configs/italian.py +68 -36
- euroeval/dataset_configs/latvian.py +87 -0
- euroeval/dataset_configs/lithuanian.py +64 -0
- euroeval/dataset_configs/norwegian.py +98 -72
- euroeval/dataset_configs/polish.py +96 -0
- euroeval/dataset_configs/portuguese.py +63 -40
- euroeval/dataset_configs/serbian.py +64 -0
- euroeval/dataset_configs/slovak.py +55 -0
- euroeval/dataset_configs/slovene.py +56 -0
- euroeval/dataset_configs/spanish.py +68 -34
- euroeval/dataset_configs/swedish.py +82 -41
- euroeval/dataset_configs/ukrainian.py +64 -0
- euroeval/enums.py +12 -6
- euroeval/exceptions.py +21 -1
- euroeval/finetuning.py +34 -26
- euroeval/generation.py +76 -41
- euroeval/generation_utils.py +169 -34
- euroeval/languages.py +1020 -188
- euroeval/logging_utils.py +268 -0
- euroeval/metrics/__init__.py +6 -0
- euroeval/metrics/base.py +85 -0
- euroeval/metrics/huggingface.py +216 -0
- euroeval/metrics/llm_as_a_judge.py +260 -0
- euroeval/metrics/pipeline.py +289 -0
- euroeval/metrics/speed.py +48 -0
- euroeval/model_cache.py +40 -21
- euroeval/model_config.py +4 -5
- euroeval/model_loading.py +3 -0
- euroeval/prompt_templates/__init__.py +2 -0
- euroeval/prompt_templates/classification.py +206 -0
- euroeval/prompt_templates/linguistic_acceptability.py +157 -22
- euroeval/prompt_templates/multiple_choice.py +159 -17
- euroeval/prompt_templates/named_entity_recognition.py +318 -21
- euroeval/prompt_templates/reading_comprehension.py +207 -16
- euroeval/prompt_templates/sentiment_classification.py +205 -22
- euroeval/prompt_templates/summarization.py +122 -22
- euroeval/prompt_templates/token_classification.py +279 -0
- euroeval/scores.py +20 -9
- euroeval/speed_benchmark.py +11 -12
- euroeval/task_group_utils/multiple_choice_classification.py +21 -12
- euroeval/task_group_utils/question_answering.py +101 -73
- euroeval/task_group_utils/sequence_classification.py +144 -61
- euroeval/task_group_utils/text_to_text.py +33 -12
- euroeval/task_group_utils/token_classification.py +86 -89
- euroeval/tasks.py +75 -16
- euroeval/tokenisation_utils.py +603 -0
- euroeval/types.py +17 -11
- euroeval/utils.py +332 -137
- euroeval-16.7.1.dist-info/METADATA +623 -0
- euroeval-16.7.1.dist-info/RECORD +84 -0
- {euroeval-15.12.0.dist-info → euroeval-16.7.1.dist-info}/entry_points.txt +0 -1
- euroeval/human_evaluation.py +0 -737
- euroeval/metrics.py +0 -452
- euroeval/tokenization_utils.py +0 -498
- euroeval-15.12.0.dist-info/METADATA +0 -285
- euroeval-15.12.0.dist-info/RECORD +0 -63
- {euroeval-15.12.0.dist-info → euroeval-16.7.1.dist-info}/WHEEL +0 -0
- {euroeval-15.12.0.dist-info → euroeval-16.7.1.dist-info}/licenses/LICENSE +0 -0
|
@@ -0,0 +1,623 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: EuroEval
|
|
3
|
+
Version: 16.7.1
|
|
4
|
+
Summary: The robust European language model benchmark.
|
|
5
|
+
Project-URL: Repository, https://github.com/EuroEval/EuroEval
|
|
6
|
+
Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
|
|
7
|
+
Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
8
|
+
Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
9
|
+
License: MIT License
|
|
10
|
+
|
|
11
|
+
Copyright (c) 2022-2025 Dan Saattrup Smart
|
|
12
|
+
|
|
13
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
14
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
15
|
+
in the Software without restriction, including without limitation the rights
|
|
16
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
17
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
18
|
+
furnished to do so, subject to the following conditions:
|
|
19
|
+
|
|
20
|
+
The above copyright notice and this permission notice shall be included in all
|
|
21
|
+
copies or substantial portions of the Software.
|
|
22
|
+
|
|
23
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
24
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
25
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
26
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
27
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
28
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
29
|
+
SOFTWARE.
|
|
30
|
+
License-File: LICENSE
|
|
31
|
+
Requires-Python: <4.0,>=3.11
|
|
32
|
+
Requires-Dist: accelerate>=1.9.0
|
|
33
|
+
Requires-Dist: bert-score>=0.3.13
|
|
34
|
+
Requires-Dist: click>=8.1.3
|
|
35
|
+
Requires-Dist: cloudpickle>=3.1.1
|
|
36
|
+
Requires-Dist: datasets>=3.5.0
|
|
37
|
+
Requires-Dist: demjson3>=3.0.6
|
|
38
|
+
Requires-Dist: evaluate>=0.4.1
|
|
39
|
+
Requires-Dist: huggingface-hub>=0.30.1
|
|
40
|
+
Requires-Dist: levenshtein>=0.24.0
|
|
41
|
+
Requires-Dist: litellm>=1.75.6
|
|
42
|
+
Requires-Dist: more-itertools>=10.5.0
|
|
43
|
+
Requires-Dist: numpy>=2.0.0
|
|
44
|
+
Requires-Dist: ollama>=0.5.1
|
|
45
|
+
Requires-Dist: pandas>=2.2.0
|
|
46
|
+
Requires-Dist: peft>=0.15.0
|
|
47
|
+
Requires-Dist: protobuf>=2.0.0
|
|
48
|
+
Requires-Dist: pydantic>=2.6.0
|
|
49
|
+
Requires-Dist: pyinfer>=0.0.3
|
|
50
|
+
Requires-Dist: python-dotenv>=1.0.1
|
|
51
|
+
Requires-Dist: rouge-score>=0.1.2
|
|
52
|
+
Requires-Dist: sacremoses>=0.1.1
|
|
53
|
+
Requires-Dist: scikit-learn==1.6.1
|
|
54
|
+
Requires-Dist: sentencepiece>=0.1.96
|
|
55
|
+
Requires-Dist: seqeval>=1.2.2
|
|
56
|
+
Requires-Dist: setuptools>=75.8.2
|
|
57
|
+
Requires-Dist: tenacity>=9.0.0
|
|
58
|
+
Requires-Dist: termcolor>=2.0.0
|
|
59
|
+
Requires-Dist: torch>=2.6.0
|
|
60
|
+
Requires-Dist: transformers[mistral-common]>=4.56.0
|
|
61
|
+
Provides-Extra: all
|
|
62
|
+
Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
|
|
63
|
+
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
|
|
64
|
+
Requires-Dist: timm>=1.0.19; extra == 'all'
|
|
65
|
+
Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'all'
|
|
66
|
+
Provides-Extra: generative
|
|
67
|
+
Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
|
|
68
|
+
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
|
|
69
|
+
Requires-Dist: timm>=1.0.19; extra == 'generative'
|
|
70
|
+
Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'generative'
|
|
71
|
+
Description-Content-Type: text/markdown
|
|
72
|
+
|
|
73
|
+
<!-- This disables the requirement that the first line is a top-level heading -->
|
|
74
|
+
<!-- markdownlint-configure-file { "MD041": false } -->
|
|
75
|
+
|
|
76
|
+
<div align='center'>
|
|
77
|
+
<img
|
|
78
|
+
src="https://raw.githubusercontent.com/EuroEval/EuroEval/main/gfx/euroeval.png"
|
|
79
|
+
height="500"
|
|
80
|
+
width="372"
|
|
81
|
+
>
|
|
82
|
+
</div>
|
|
83
|
+
|
|
84
|
+
### The robust European language model benchmark
|
|
85
|
+
|
|
86
|
+
(formerly known as ScandEval)
|
|
87
|
+
|
|
88
|
+
______________________________________________________________________
|
|
89
|
+
[](https://euroeval.com)
|
|
90
|
+
[](https://pypi.org/project/euroeval/)
|
|
91
|
+
[](https://arxiv.org/abs/2304.00906)
|
|
92
|
+
[](https://arxiv.org/abs/2406.13469)
|
|
93
|
+
[](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
|
|
94
|
+
[](https://github.com/EuroEval/EuroEval/commits/main)
|
|
95
|
+
[](https://github.com/EuroEval/EuroEval/tree/main/tests)
|
|
96
|
+
[](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
|
|
97
|
+
|
|
98
|
+
## Maintainer
|
|
99
|
+
|
|
100
|
+
- Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), <dan.smart@alexandra.dk>)
|
|
101
|
+
|
|
102
|
+
## Installation
|
|
103
|
+
|
|
104
|
+
To install the package simply write the following command in your favorite terminal:
|
|
105
|
+
|
|
106
|
+
```bash
|
|
107
|
+
pip install euroeval[all]
|
|
108
|
+
```
|
|
109
|
+
|
|
110
|
+
This will install the EuroEval package with all extras. You can also install the
|
|
111
|
+
minimal version by leaving out the `[all]`, in which case the package will let you know
|
|
112
|
+
when an evaluation requires a certain extra dependency, and how you install it.
|
|
113
|
+
|
|
114
|
+
## Quickstart
|
|
115
|
+
|
|
116
|
+
### Benchmarking from the command line
|
|
117
|
+
|
|
118
|
+
The easiest way to benchmark pretrained models is via the command line interface. After
|
|
119
|
+
having installed the package, you can benchmark your favorite model like so:
|
|
120
|
+
|
|
121
|
+
```bash
|
|
122
|
+
euroeval --model <model-id>
|
|
123
|
+
```
|
|
124
|
+
|
|
125
|
+
Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
|
|
126
|
+
Hub](https://huggingface.co/models). By default this will benchmark the model on all
|
|
127
|
+
the tasks available. If you want to benchmark on a particular task, then use the
|
|
128
|
+
`--task` argument:
|
|
129
|
+
|
|
130
|
+
```bash
|
|
131
|
+
euroeval --model <model-id> --task sentiment-classification
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
We can also narrow down which languages we would like to benchmark on. This can be done
|
|
135
|
+
by setting the `--language` argument. Here we thus benchmark the model on the Danish
|
|
136
|
+
sentiment classification task:
|
|
137
|
+
|
|
138
|
+
```bash
|
|
139
|
+
euroeval --model <model-id> --task sentiment-classification --language da
|
|
140
|
+
```
|
|
141
|
+
|
|
142
|
+
Multiple models, datasets and/or languages can be specified by just attaching multiple
|
|
143
|
+
arguments. Here is an example with two models:
|
|
144
|
+
|
|
145
|
+
```bash
|
|
146
|
+
euroeval --model <model-id1> --model <model-id2>
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
The specific model version/revision to use can also be added after the suffix '@':
|
|
150
|
+
|
|
151
|
+
```bash
|
|
152
|
+
euroeval --model <model-id>@<commit>
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
|
|
156
|
+
|
|
157
|
+
See all the arguments and options available for the `euroeval` command by typing
|
|
158
|
+
|
|
159
|
+
```bash
|
|
160
|
+
euroeval --help
|
|
161
|
+
```
|
|
162
|
+
|
|
163
|
+
### Benchmarking from a script
|
|
164
|
+
|
|
165
|
+
In a script, the syntax is similar to the command line interface. You simply initialise
|
|
166
|
+
an object of the `Benchmarker` class, and call this benchmark object with your favorite
|
|
167
|
+
model:
|
|
168
|
+
|
|
169
|
+
```python
|
|
170
|
+
>>> from euroeval import Benchmarker
|
|
171
|
+
>>> benchmarker = Benchmarker()
|
|
172
|
+
>>> benchmarker.benchmark(model="<model-id>")
|
|
173
|
+
```
|
|
174
|
+
|
|
175
|
+
To benchmark on a specific task and/or language, you simply specify the `task` or
|
|
176
|
+
`language` arguments, shown here with same example as above:
|
|
177
|
+
|
|
178
|
+
```python
|
|
179
|
+
>>> benchmarker.benchmark(
|
|
180
|
+
... model="<model-id>",
|
|
181
|
+
... task="sentiment-classification",
|
|
182
|
+
... language="da",
|
|
183
|
+
... )
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
|
|
187
|
+
simply leave out the `model` argument. In this example, we're benchmarking all Danish
|
|
188
|
+
models on the Danish sentiment classification task:
|
|
189
|
+
|
|
190
|
+
```python
|
|
191
|
+
>>> benchmarker.benchmark(task="sentiment-classification", language="da")
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
### Benchmarking from Docker
|
|
195
|
+
|
|
196
|
+
A Dockerfile is provided in the repo, which can be downloaded and run, without needing
|
|
197
|
+
to clone the repo and installing from source. This can be fetched programmatically by
|
|
198
|
+
running the following:
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
wget https://raw.githubusercontent.com/EuroEval/EuroEval/main/Dockerfile.cuda
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
Next, to be able to build the Docker image, first ensure that the NVIDIA Container
|
|
205
|
+
Toolkit is
|
|
206
|
+
[installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation)
|
|
207
|
+
and
|
|
208
|
+
[configured](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker).
|
|
209
|
+
Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA
|
|
210
|
+
version installed (which you can check using `nvidia-smi`). After that, we build the
|
|
211
|
+
image as follows:
|
|
212
|
+
|
|
213
|
+
```bash
|
|
214
|
+
docker build --pull -t euroeval -f Dockerfile.cuda .
|
|
215
|
+
```
|
|
216
|
+
|
|
217
|
+
With the Docker image built, we can now evaluate any model as follows:
|
|
218
|
+
|
|
219
|
+
```bash
|
|
220
|
+
docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
|
|
221
|
+
```
|
|
222
|
+
|
|
223
|
+
Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
|
|
224
|
+
argument. This could for instance be `--model <model-id> --task
|
|
225
|
+
sentiment-classification`.
|
|
226
|
+
|
|
227
|
+
## Benchmarking custom inference APIs
|
|
228
|
+
|
|
229
|
+
If the model you want to benchmark is hosted by a custom inference provider, such as a
|
|
230
|
+
[vLLM server](https://docs.vllm.ai/en/stable/), then this is also supported in EuroEval.
|
|
231
|
+
|
|
232
|
+
When benchmarking, you simply have to set the `--api-base` argument (`api_base` when
|
|
233
|
+
using the `Benchmarker` API) to the URL of the inference API, and optionally the
|
|
234
|
+
`--api-key` argument (`api_key`) to the API key, if authentication is required.
|
|
235
|
+
|
|
236
|
+
If you're benchmarking an Ollama model, then you're urged to add the prefix
|
|
237
|
+
`ollama_chat/` to the model name, as that will also fetch model metadata as well as pull
|
|
238
|
+
the models from the Ollama model repository before evaluating it, e.g.:
|
|
239
|
+
|
|
240
|
+
```bash
|
|
241
|
+
euroeval --model ollama_chat/mymodel --api-base http://localhost:11434
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
For all other OpenAI-compatible inference APIs, you simply provide the model name as
|
|
245
|
+
is, e.g.:
|
|
246
|
+
|
|
247
|
+
```bash
|
|
248
|
+
euroeval --model my-model --api-base http://localhost:8000
|
|
249
|
+
```
|
|
250
|
+
|
|
251
|
+
Again, if the inference API requires authentication, you simply add the `--api-key`
|
|
252
|
+
argument:
|
|
253
|
+
|
|
254
|
+
```bash
|
|
255
|
+
euroeval --model my-model --api-base http://localhost:8000 --api-key my-secret-key
|
|
256
|
+
```
|
|
257
|
+
|
|
258
|
+
If your model is a reasoning model, then you need to specify this as follows:
|
|
259
|
+
|
|
260
|
+
```bash
|
|
261
|
+
euroeval --model my-reasoning-model --api-base http://localhost:8000 --generative-type reasoning
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
Likewise, if it is a pretrained decoder model (aka a completion model), then you specify
|
|
265
|
+
this as follows:
|
|
266
|
+
|
|
267
|
+
```bash
|
|
268
|
+
euroeval --model my-base-decoder-model --api-base http://localhost:8000 --generative-type base
|
|
269
|
+
```
|
|
270
|
+
|
|
271
|
+
When using the `Benchmarker` API, the same applies. Here is an example of benchmarking
|
|
272
|
+
an Ollama model hosted locally:
|
|
273
|
+
|
|
274
|
+
```python
|
|
275
|
+
>>> benchmarker.benchmark(
|
|
276
|
+
... model="ollama_chat/mymodel",
|
|
277
|
+
... api_base="http://localhost:11434",
|
|
278
|
+
... )
|
|
279
|
+
```
|
|
280
|
+
|
|
281
|
+
## Benchmarking in an offline environment
|
|
282
|
+
|
|
283
|
+
If you need to benchmark in an offline environment, you need to download the models,
|
|
284
|
+
datasets and metrics beforehand. This can be done by adding the `--download-only`
|
|
285
|
+
argument, from the command line, or the `download_only` argument, if benchmarking from a
|
|
286
|
+
script. For example to download the model you want and all of the Danish sentiment
|
|
287
|
+
classification datasets:
|
|
288
|
+
|
|
289
|
+
```bash
|
|
290
|
+
euroeval --model <model-id> --task sentiment-classification --language da --download-only
|
|
291
|
+
```
|
|
292
|
+
|
|
293
|
+
Or from a script:
|
|
294
|
+
|
|
295
|
+
```python
|
|
296
|
+
>>> benchmarker.benchmark(
|
|
297
|
+
... model="<model-id>",
|
|
298
|
+
... task="sentiment-classification",
|
|
299
|
+
... language="da",
|
|
300
|
+
... download_only=True,
|
|
301
|
+
... )
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
Please note: Offline benchmarking of adapter models is not currently supported, meaning
|
|
305
|
+
that we still require an internet connection during the evaluation of these. If offline
|
|
306
|
+
support of adapters is important to you, please consider [opening an
|
|
307
|
+
issue](https://github.com/EuroEval/EuroEval/issues).
|
|
308
|
+
|
|
309
|
+
## Benchmarking custom datasets
|
|
310
|
+
|
|
311
|
+
If you want to benchmark models on your own custom dataset, this is also possible.
|
|
312
|
+
First, you need to set up your dataset to be compatible with EuroEval. This means
|
|
313
|
+
splitting up your dataset in a training, validation and test split, and ensuring that
|
|
314
|
+
the column names are correct. We use `text` as the column name for the input text, and
|
|
315
|
+
the output column name depends on the type of task:
|
|
316
|
+
|
|
317
|
+
- **Text or multiple-choice classification**: `label`
|
|
318
|
+
- **Token classification**: `labels`
|
|
319
|
+
- **Reading comprehension**: `answers`
|
|
320
|
+
- **Free-form text generation**: `target_text`
|
|
321
|
+
|
|
322
|
+
Text and multiple-choice classification tasks are by far the most common. Next, you
|
|
323
|
+
store your three dataset splits as three different CSV files with the desired two
|
|
324
|
+
columns. Finally, you create a file called `custom_datasets.py` script in which you
|
|
325
|
+
define the associated `DatasetConfig` objects for your dataset. Here is an example of a
|
|
326
|
+
simple text classification dataset with two classes:
|
|
327
|
+
|
|
328
|
+
```python
|
|
329
|
+
from euroeval import DatasetConfig, TEXT_CLASSIFICATION
|
|
330
|
+
from euroeval.languages import ENGLISH
|
|
331
|
+
|
|
332
|
+
MY_CONFIG = DatasetConfig(
|
|
333
|
+
name="my-dataset",
|
|
334
|
+
pretty_name="My Dataset",
|
|
335
|
+
source=dict(train="train.csv", val="val.csv", test="test.csv"),
|
|
336
|
+
task=TEXT_CLASSIFICATION,
|
|
337
|
+
languages=[ENGLISH],
|
|
338
|
+
_labels=["positive", "negative"],
|
|
339
|
+
)
|
|
340
|
+
```
|
|
341
|
+
|
|
342
|
+
You can then benchmark your custom dataset by simply running
|
|
343
|
+
|
|
344
|
+
```bash
|
|
345
|
+
euroeval --dataset my-dataset --model <model-id>
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
You can also run the benchmark from a Python script, by simply providing your custom
|
|
349
|
+
dataset configuration directly into the `benchmark` method:
|
|
350
|
+
|
|
351
|
+
```python
|
|
352
|
+
from euroeval import Benchmarker
|
|
353
|
+
|
|
354
|
+
benchmarker = Benchmarker()
|
|
355
|
+
benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
|
|
356
|
+
```
|
|
357
|
+
|
|
358
|
+
We have included three convenience tasks to make it easier to set up custom datasets:
|
|
359
|
+
|
|
360
|
+
- `TEXT_CLASSIFICATION`, which is used for text classification tasks. This requires you
|
|
361
|
+
to set the `_labels` argument in the `DatasetConfig`, and requires the columns `text`
|
|
362
|
+
and `label` to be present in the dataset.
|
|
363
|
+
- `MULTIPLE_CHOICE`, which is used for multiple-choice classification tasks. This
|
|
364
|
+
also requires you to set the `_labels` argument in the `DatasetConfig`. Note that for
|
|
365
|
+
multiple choice tasks, you need to set up your `text` column to also list all the
|
|
366
|
+
choices, and all the samples should have the same number of choices. This requires the
|
|
367
|
+
columns `text` and `label` to be present in the dataset.
|
|
368
|
+
- `TOKEN_CLASSIFICATION`, which is used when classifying individual tokens in a text.
|
|
369
|
+
This also require you to set the `_labels` argument in the `DatasetConfig`. This
|
|
370
|
+
requires the columns `tokens` and `labels` to be present in the dataset, where
|
|
371
|
+
`tokens` is a list of tokens/words in the text, and `labels` is a list of the
|
|
372
|
+
corresponding labels for each token (so the two lists have the same length).
|
|
373
|
+
|
|
374
|
+
On top of these three convenience tasks, there are of course also the tasks that we use
|
|
375
|
+
in the official benchmark, which you can use if you want to use one of these tasks with
|
|
376
|
+
your own bespoke dataset:
|
|
377
|
+
|
|
378
|
+
- `LA`, for linguistic acceptability datasets.
|
|
379
|
+
- `NER`, for named entity recognition datasets with the standard BIO tagging scheme.
|
|
380
|
+
- `RC`, for reading comprehension datasets in the SQuAD format.
|
|
381
|
+
- `SENT`, for sentiment classification datasets.
|
|
382
|
+
- `SUMM`, for text summarisation datasets.
|
|
383
|
+
- `KNOW`, for multiple-choice knowledge datasets (e.g., MMLU).
|
|
384
|
+
- `MCRC`, for multiple-choice reading comprehension datasets (e.g., Belebele).
|
|
385
|
+
- `COMMON_SENSE`, for multiple-choice common-sense reasoning datasets (e.g., HellaSwag).
|
|
386
|
+
|
|
387
|
+
These can all be imported from `euroeval.tasks` module.
|
|
388
|
+
|
|
389
|
+
### Creating your own custom task
|
|
390
|
+
|
|
391
|
+
You are of course also free to define your own task from scratch, which allows you to
|
|
392
|
+
customise the prompts used when evaluating generative models, for instance. Here is an
|
|
393
|
+
example of a custom free-form text generation task, where the goal for the model is to
|
|
394
|
+
generate a SQL query based on a natural language input:
|
|
395
|
+
|
|
396
|
+
```python
|
|
397
|
+
from euroeval import DatasetConfig
|
|
398
|
+
from euroeval.data_models import Task, PromptConfig
|
|
399
|
+
from euroeval.enums import TaskGroup, ModelType
|
|
400
|
+
from euroeval.languages import ENGLISH
|
|
401
|
+
from euroeval.metrics import rouge_l_metric
|
|
402
|
+
|
|
403
|
+
sql_generation_task = Task(
|
|
404
|
+
name="sql-generation",
|
|
405
|
+
task_group=TaskGroup.TEXT_TO_TEXT,
|
|
406
|
+
template_dict={
|
|
407
|
+
ENGLISH: PromptConfig(
|
|
408
|
+
default_prompt_prefix="The following are natural language texts and their "
|
|
409
|
+
"corresponding SQL queries.",
|
|
410
|
+
default_prompt_template="Natural language query: {text}\nSQL query: "
|
|
411
|
+
"{target_text}",
|
|
412
|
+
default_instruction_prompt="Generate the SQL query for the following "
|
|
413
|
+
"natural language query:\n{text!r}",
|
|
414
|
+
default_prompt_label_mapping=dict(),
|
|
415
|
+
),
|
|
416
|
+
},
|
|
417
|
+
metrics=[rouge_l_metric],
|
|
418
|
+
default_num_few_shot_examples=3,
|
|
419
|
+
default_max_generated_tokens=256,
|
|
420
|
+
default_allowed_model_types=[ModelType.GENERATIVE],
|
|
421
|
+
)
|
|
422
|
+
|
|
423
|
+
MY_SQL_DATASET = DatasetConfig(
|
|
424
|
+
name="my-sql-dataset",
|
|
425
|
+
pretty_name="My SQL Dataset",
|
|
426
|
+
source=dict(train="train.csv", val="val.csv", test="test.csv"),
|
|
427
|
+
task=sql_generation_task,
|
|
428
|
+
languages=[ENGLISH],
|
|
429
|
+
)
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
Again, with this you can benchmark your custom dataset by simply running
|
|
433
|
+
|
|
434
|
+
```bash
|
|
435
|
+
euroeval --dataset my-sql-dataset --model <model-id>
|
|
436
|
+
```
|
|
437
|
+
|
|
438
|
+
## Reproducing the evaluation datasets
|
|
439
|
+
|
|
440
|
+
All datasets used in this project are generated using the scripts located in the
|
|
441
|
+
[src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script
|
|
442
|
+
with the following command
|
|
443
|
+
|
|
444
|
+
```bash
|
|
445
|
+
uv run src/scripts/<name-of-script>.py
|
|
446
|
+
```
|
|
447
|
+
|
|
448
|
+
Replace <name-of-script> with the specific script you wish to execute, e.g.,
|
|
449
|
+
|
|
450
|
+
```bash
|
|
451
|
+
uv run src/scripts/create_allocine.py
|
|
452
|
+
```
|
|
453
|
+
|
|
454
|
+
## Contributors :pray:
|
|
455
|
+
|
|
456
|
+
A huge thank you to all the contributors who have helped make this project a success!
|
|
457
|
+
|
|
458
|
+
<a href="https://github.com/peter-sk">
|
|
459
|
+
<img
|
|
460
|
+
src="https://avatars.githubusercontent.com/u/6168908"
|
|
461
|
+
width=50
|
|
462
|
+
alt="Contributor avatar for peter-sk"
|
|
463
|
+
/>
|
|
464
|
+
</a>
|
|
465
|
+
<a href="https://github.com/AJDERS">
|
|
466
|
+
<img
|
|
467
|
+
src="https://avatars.githubusercontent.com/u/38854604"
|
|
468
|
+
width=50
|
|
469
|
+
alt="Contributor avatar for AJDERS"
|
|
470
|
+
/>
|
|
471
|
+
</a>
|
|
472
|
+
<a href="https://github.com/oliverkinch">
|
|
473
|
+
<img
|
|
474
|
+
src="https://avatars.githubusercontent.com/u/71556498"
|
|
475
|
+
width=50
|
|
476
|
+
alt="Contributor avatar for oliverkinch"
|
|
477
|
+
/>
|
|
478
|
+
</a>
|
|
479
|
+
<a href="https://github.com/versae">
|
|
480
|
+
<img
|
|
481
|
+
src="https://avatars.githubusercontent.com/u/173537"
|
|
482
|
+
width=50
|
|
483
|
+
alt="Contributor avatar for versae"
|
|
484
|
+
/>
|
|
485
|
+
</a>
|
|
486
|
+
<a href="https://github.com/KennethEnevoldsen">
|
|
487
|
+
<img
|
|
488
|
+
src="https://avatars.githubusercontent.com/u/23721977"
|
|
489
|
+
width=50
|
|
490
|
+
alt="Contributor avatar for KennethEnevoldsen"
|
|
491
|
+
/>
|
|
492
|
+
</a>
|
|
493
|
+
<a href="https://github.com/viggo-gascou">
|
|
494
|
+
<img
|
|
495
|
+
src="https://avatars.githubusercontent.com/u/94069687"
|
|
496
|
+
width=50
|
|
497
|
+
alt="Contributor avatar for viggo-gascou"
|
|
498
|
+
/>
|
|
499
|
+
</a>
|
|
500
|
+
<a href="https://github.com/mathiasesn">
|
|
501
|
+
<img
|
|
502
|
+
src="https://avatars.githubusercontent.com/u/27091759"
|
|
503
|
+
width=50
|
|
504
|
+
alt="Contributor avatar for mathiasesn"
|
|
505
|
+
/>
|
|
506
|
+
</a>
|
|
507
|
+
<a href="https://github.com/Alkarex">
|
|
508
|
+
<img
|
|
509
|
+
src="https://avatars.githubusercontent.com/u/1008324"
|
|
510
|
+
width=50
|
|
511
|
+
alt="Contributor avatar for Alkarex"
|
|
512
|
+
/>
|
|
513
|
+
</a>
|
|
514
|
+
<a href="https://github.com/marksverdhei">
|
|
515
|
+
<img
|
|
516
|
+
src="https://avatars.githubusercontent.com/u/46672778"
|
|
517
|
+
width=50
|
|
518
|
+
alt="Contributor avatar for marksverdhei"
|
|
519
|
+
/>
|
|
520
|
+
</a>
|
|
521
|
+
<a href="https://github.com/Mikeriess">
|
|
522
|
+
<img
|
|
523
|
+
src="https://avatars.githubusercontent.com/u/19728563"
|
|
524
|
+
width=50
|
|
525
|
+
alt="Contributor avatar for Mikeriess"
|
|
526
|
+
/>
|
|
527
|
+
</a>
|
|
528
|
+
<a href="https://github.com/ThomasKluiters">
|
|
529
|
+
<img
|
|
530
|
+
src="https://avatars.githubusercontent.com/u/8137941"
|
|
531
|
+
width=50
|
|
532
|
+
alt="Contributor avatar for ThomasKluiters"
|
|
533
|
+
/>
|
|
534
|
+
</a>
|
|
535
|
+
<a href="https://github.com/BramVanroy">
|
|
536
|
+
<img
|
|
537
|
+
src="https://avatars.githubusercontent.com/u/2779410"
|
|
538
|
+
width=50
|
|
539
|
+
alt="Contributor avatar for BramVanroy"
|
|
540
|
+
/>
|
|
541
|
+
</a>
|
|
542
|
+
<a href="https://github.com/peregilk">
|
|
543
|
+
<img
|
|
544
|
+
src="https://avatars.githubusercontent.com/u/9079808"
|
|
545
|
+
width=50
|
|
546
|
+
alt="Contributor avatar for peregilk"
|
|
547
|
+
/>
|
|
548
|
+
</a>
|
|
549
|
+
<a href="https://github.com/Rijgersberg">
|
|
550
|
+
<img
|
|
551
|
+
src="https://avatars.githubusercontent.com/u/8604946"
|
|
552
|
+
width=50
|
|
553
|
+
alt="Contributor avatar for Rijgersberg"
|
|
554
|
+
/>
|
|
555
|
+
</a>
|
|
556
|
+
<a href="https://github.com/duarteocarmo">
|
|
557
|
+
<img
|
|
558
|
+
src="https://avatars.githubusercontent.com/u/26342344"
|
|
559
|
+
width=50
|
|
560
|
+
alt="Contributor avatar for duarteocarmo"
|
|
561
|
+
/>
|
|
562
|
+
</a>
|
|
563
|
+
<a href="https://github.com/slowwavesleep">
|
|
564
|
+
<img
|
|
565
|
+
src="https://avatars.githubusercontent.com/u/44175589"
|
|
566
|
+
width=50
|
|
567
|
+
alt="Contributor avatar for slowwavesleep"
|
|
568
|
+
/>
|
|
569
|
+
</a>
|
|
570
|
+
<a href="https://github.com/mrkowalski">
|
|
571
|
+
<img
|
|
572
|
+
src="https://avatars.githubusercontent.com/u/6357044"
|
|
573
|
+
width=50
|
|
574
|
+
alt="Contributor avatar for mrkowalski"
|
|
575
|
+
/>
|
|
576
|
+
</a>
|
|
577
|
+
|
|
578
|
+
### Contribute to EuroEval
|
|
579
|
+
|
|
580
|
+
We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
|
|
581
|
+
contributing new datasets, your help makes this project better for everyone.
|
|
582
|
+
|
|
583
|
+
- **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
|
|
584
|
+
for information on how to get started.
|
|
585
|
+
- **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
|
|
586
|
+
a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
|
|
587
|
+
|
|
588
|
+
### Special thanks
|
|
589
|
+
|
|
590
|
+
- Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
|
|
591
|
+
[Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
|
|
592
|
+
- Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
|
|
593
|
+
models on the leaderboards.
|
|
594
|
+
- Thanks to [OpenAI](https://openai.com/) for sponsoring OpenAI credits as part of their
|
|
595
|
+
[Researcher Access Program](https://openai.com/form/researcher-access-program/).
|
|
596
|
+
- Thanks to [UWV](https://www.uwv.nl/) and [KU
|
|
597
|
+
Leuven](https://www.arts.kuleuven.be/ling/ccl) for sponsoring the Azure OpenAI
|
|
598
|
+
credits used to evaluate GPT-4-turbo in Dutch.
|
|
599
|
+
- Thanks to [Miðeind](https://mideind.is/en) for sponsoring the OpenAI
|
|
600
|
+
credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
|
|
601
|
+
- Thanks to [CHC](https://chc.au.dk/) for sponsoring the OpenAI credits used to
|
|
602
|
+
evaluate GPT-4-turbo in German.
|
|
603
|
+
|
|
604
|
+
## Citing EuroEval
|
|
605
|
+
|
|
606
|
+
If you want to cite the framework then feel free to use this:
|
|
607
|
+
|
|
608
|
+
```bibtex
|
|
609
|
+
@article{smart2024encoder,
|
|
610
|
+
title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
|
|
611
|
+
author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
|
|
612
|
+
journal={arXiv preprint arXiv:2406.13469},
|
|
613
|
+
year={2024}
|
|
614
|
+
}
|
|
615
|
+
@inproceedings{smart2023scandeval,
|
|
616
|
+
author = {Smart, Dan Saattrup},
|
|
617
|
+
booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
|
|
618
|
+
month = may,
|
|
619
|
+
pages = {185--201},
|
|
620
|
+
title = {{ScandEval: A Benchmark for Scandinavian Natural Language Processing}},
|
|
621
|
+
year = {2023}
|
|
622
|
+
}
|
|
623
|
+
```
|