ScandEval 16.12.0__py3-none-any.whl → 16.13.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (61) hide show
  1. scandeval/async_utils.py +46 -0
  2. scandeval/benchmark_config_factory.py +26 -2
  3. scandeval/benchmark_modules/fresh.py +2 -1
  4. scandeval/benchmark_modules/hf.py +50 -12
  5. scandeval/benchmark_modules/litellm.py +25 -15
  6. scandeval/benchmark_modules/vllm.py +3 -3
  7. scandeval/benchmarker.py +15 -33
  8. scandeval/cli.py +2 -4
  9. scandeval/constants.py +5 -0
  10. scandeval/custom_dataset_configs.py +152 -0
  11. scandeval/data_loading.py +87 -31
  12. scandeval/data_models.py +396 -225
  13. scandeval/dataset_configs/__init__.py +51 -25
  14. scandeval/dataset_configs/albanian.py +1 -1
  15. scandeval/dataset_configs/belarusian.py +47 -0
  16. scandeval/dataset_configs/bulgarian.py +1 -1
  17. scandeval/dataset_configs/catalan.py +1 -1
  18. scandeval/dataset_configs/croatian.py +1 -1
  19. scandeval/dataset_configs/danish.py +3 -2
  20. scandeval/dataset_configs/dutch.py +7 -6
  21. scandeval/dataset_configs/english.py +4 -3
  22. scandeval/dataset_configs/estonian.py +8 -7
  23. scandeval/dataset_configs/faroese.py +1 -1
  24. scandeval/dataset_configs/finnish.py +5 -4
  25. scandeval/dataset_configs/french.py +6 -5
  26. scandeval/dataset_configs/german.py +4 -3
  27. scandeval/dataset_configs/greek.py +1 -1
  28. scandeval/dataset_configs/hungarian.py +1 -1
  29. scandeval/dataset_configs/icelandic.py +4 -3
  30. scandeval/dataset_configs/italian.py +4 -3
  31. scandeval/dataset_configs/latvian.py +2 -2
  32. scandeval/dataset_configs/lithuanian.py +1 -1
  33. scandeval/dataset_configs/norwegian.py +6 -5
  34. scandeval/dataset_configs/polish.py +4 -3
  35. scandeval/dataset_configs/portuguese.py +5 -4
  36. scandeval/dataset_configs/romanian.py +2 -2
  37. scandeval/dataset_configs/serbian.py +1 -1
  38. scandeval/dataset_configs/slovene.py +1 -1
  39. scandeval/dataset_configs/spanish.py +4 -3
  40. scandeval/dataset_configs/swedish.py +4 -3
  41. scandeval/dataset_configs/ukrainian.py +1 -1
  42. scandeval/generation_utils.py +6 -6
  43. scandeval/metrics/llm_as_a_judge.py +1 -1
  44. scandeval/metrics/pipeline.py +1 -1
  45. scandeval/model_cache.py +34 -4
  46. scandeval/prompt_templates/linguistic_acceptability.py +9 -0
  47. scandeval/prompt_templates/multiple_choice.py +9 -0
  48. scandeval/prompt_templates/named_entity_recognition.py +21 -0
  49. scandeval/prompt_templates/reading_comprehension.py +10 -0
  50. scandeval/prompt_templates/sentiment_classification.py +11 -0
  51. scandeval/string_utils.py +157 -0
  52. scandeval/task_group_utils/sequence_classification.py +2 -5
  53. scandeval/task_group_utils/token_classification.py +2 -4
  54. scandeval/utils.py +6 -323
  55. scandeval-16.13.0.dist-info/METADATA +334 -0
  56. scandeval-16.13.0.dist-info/RECORD +94 -0
  57. scandeval-16.12.0.dist-info/METADATA +0 -667
  58. scandeval-16.12.0.dist-info/RECORD +0 -90
  59. {scandeval-16.12.0.dist-info → scandeval-16.13.0.dist-info}/WHEEL +0 -0
  60. {scandeval-16.12.0.dist-info → scandeval-16.13.0.dist-info}/entry_points.txt +0 -0
  61. {scandeval-16.12.0.dist-info → scandeval-16.13.0.dist-info}/licenses/LICENSE +0 -0
@@ -1,667 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: ScandEval
3
- Version: 16.12.0
4
- Summary: The robust European language model benchmark.
5
- Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
- Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
7
- Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
8
- Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
9
- License: MIT License
10
-
11
- Copyright (c) 2022-2026 Dan Saattrup Smart
12
-
13
- Permission is hereby granted, free of charge, to any person obtaining a copy
14
- of this software and associated documentation files (the "Software"), to deal
15
- in the Software without restriction, including without limitation the rights
16
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
17
- copies of the Software, and to permit persons to whom the Software is
18
- furnished to do so, subject to the following conditions:
19
-
20
- The above copyright notice and this permission notice shall be included in all
21
- copies or substantial portions of the Software.
22
-
23
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
24
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
25
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
26
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
27
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
28
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
29
- SOFTWARE.
30
- License-File: LICENSE
31
- Requires-Python: <4.0,>=3.12
32
- Requires-Dist: accelerate>=1.9.0
33
- Requires-Dist: bert-score>=0.3.13
34
- Requires-Dist: click>=8.1.3
35
- Requires-Dist: cloudpickle>=3.1.1
36
- Requires-Dist: datasets>=3.5.0
37
- Requires-Dist: demjson3>=3.0.6
38
- Requires-Dist: evaluate>=0.4.1
39
- Requires-Dist: huggingface-hub>=0.30.1
40
- Requires-Dist: levenshtein>=0.24.0
41
- Requires-Dist: litellm>=1.75.6
42
- Requires-Dist: mistral-common[soundfile]
43
- Requires-Dist: more-itertools>=10.5.0
44
- Requires-Dist: numpy>=2.0.0
45
- Requires-Dist: ollama>=0.5.1
46
- Requires-Dist: pandas>=2.2.0
47
- Requires-Dist: peft>=0.15.0
48
- Requires-Dist: protobuf>=2.0.0
49
- Requires-Dist: pydantic>=2.6.0
50
- Requires-Dist: pyinfer>=0.0.3
51
- Requires-Dist: python-dotenv>=1.0.1
52
- Requires-Dist: rouge-score>=0.1.2
53
- Requires-Dist: sacrebleu>=2.5.1
54
- Requires-Dist: sacremoses>=0.1.1
55
- Requires-Dist: scikit-learn==1.6.1
56
- Requires-Dist: sentencepiece>=0.1.96
57
- Requires-Dist: seqeval>=1.2.2
58
- Requires-Dist: setuptools>=75.8.2
59
- Requires-Dist: tenacity>=9.0.0
60
- Requires-Dist: termcolor>=2.0.0
61
- Requires-Dist: torch>=2.6.0
62
- Requires-Dist: transformers[mistral-common]<5.0.0,>=4.56.0
63
- Provides-Extra: all
64
- Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
65
- Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
66
- Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'all'
67
- Requires-Dist: timm>=1.0.19; extra == 'all'
68
- Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'all'
69
- Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'all'
70
- Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'all'
71
- Provides-Extra: generative
72
- Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
73
- Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
74
- Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'generative'
75
- Requires-Dist: timm>=1.0.19; extra == 'generative'
76
- Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'generative'
77
- Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'generative'
78
- Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'generative'
79
- Description-Content-Type: text/markdown
80
-
81
- <!-- This disables the requirement that the first line is a top-level heading -->
82
- <!-- markdownlint-configure-file { "MD041": false } -->
83
-
84
- <div align='center'>
85
- <img
86
- src="https://raw.githubusercontent.com/EuroEval/EuroEval/main/gfx/euroeval.png"
87
- height="500"
88
- width="372"
89
- >
90
- </div>
91
-
92
- ### The robust European language model benchmark
93
-
94
- (formerly known as ScandEval)
95
-
96
- ______________________________________________________________________
97
- [![Documentation](https://img.shields.io/badge/docs-passing-green)](https://euroeval.com)
98
- [![PyPI Status](https://badge.fury.io/py/euroeval.svg)](https://pypi.org/project/euroeval/)
99
- [![First paper](https://img.shields.io/badge/arXiv-2304.00906-b31b1b.svg)](https://arxiv.org/abs/2304.00906)
100
- [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
101
- [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
102
- [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
103
- [![Code Coverage](https://img.shields.io/badge/Coverage-74%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
104
- [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
105
-
106
- ## Maintainer
107
-
108
- - Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), <dan.smart@alexandra.dk>)
109
-
110
- ## Installation
111
-
112
- To install the package simply write the following command in your favorite terminal:
113
-
114
- ```bash
115
- pip install euroeval[all]
116
- ```
117
-
118
- This will install the EuroEval package with all extras. You can also install the
119
- minimal version by leaving out the `[all]`, in which case the package will let you know
120
- when an evaluation requires a certain extra dependency, and how you install it.
121
-
122
- ## Quickstart
123
-
124
- ### Benchmarking from the command line
125
-
126
- The easiest way to benchmark pretrained models is via the command line interface. After
127
- having installed the package, you can benchmark your favorite model like so:
128
-
129
- ```bash
130
- euroeval --model <model-id-or-path>
131
- ```
132
-
133
- Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
134
- Hub](https://huggingface.co/models), or a local path to a model directory (containing
135
- the model files as well as the `config.json` file). By default this will benchmark the
136
- model on all the tasks available. If you want to benchmark on a particular task, then
137
- use the `--task` argument:
138
-
139
- ```bash
140
- euroeval --model <model-id-or-path> --task sentiment-classification
141
- ```
142
-
143
- We can also narrow down which languages we would like to benchmark on. This can be done
144
- by setting the `--language` argument. Here we thus benchmark the model on the Danish
145
- sentiment classification task:
146
-
147
- ```bash
148
- euroeval --model <model-id-or-path> --task sentiment-classification --language da
149
- ```
150
-
151
- Multiple models, datasets and/or languages can be specified by just attaching multiple
152
- arguments. Here is an example with two models:
153
-
154
- ```bash
155
- euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
156
- ```
157
-
158
- The specific model version/revision to use can also be added after the suffix '@':
159
-
160
- ```bash
161
- euroeval --model <model-id-or-path>@<commit>
162
- ```
163
-
164
- This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
165
-
166
- See all the arguments and options available for the `euroeval` command by typing
167
-
168
- ```bash
169
- euroeval --help
170
- ```
171
-
172
- ### Benchmarking from a script
173
-
174
- In a script, the syntax is similar to the command line interface. You simply initialise
175
- an object of the `Benchmarker` class, and call this benchmark object with your favorite
176
- model:
177
-
178
- ```python
179
- >>> from euroeval import Benchmarker
180
- >>> benchmarker = Benchmarker()
181
- >>> benchmarker.benchmark(model="<model-id-or-path>")
182
- ```
183
-
184
- To benchmark on a specific task and/or language, you simply specify the `task` or
185
- `language` arguments, shown here with same example as above:
186
-
187
- ```python
188
- >>> benchmarker.benchmark(
189
- ... model="<model-id-or-path>",
190
- ... task="sentiment-classification",
191
- ... language="da",
192
- ... )
193
- ```
194
-
195
- If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
196
- simply leave out the `model` argument. In this example, we're benchmarking all Danish
197
- models on the Danish sentiment classification task:
198
-
199
- ```python
200
- >>> benchmarker.benchmark(task="sentiment-classification", language="da")
201
- ```
202
-
203
- ### Benchmarking from Docker
204
-
205
- A Dockerfile is provided in the repo, which can be downloaded and run, without needing
206
- to clone the repo and installing from source. This can be fetched programmatically by
207
- running the following:
208
-
209
- ```bash
210
- wget https://raw.githubusercontent.com/EuroEval/EuroEval/main/Dockerfile.cuda
211
- ```
212
-
213
- Next, to be able to build the Docker image, first ensure that the NVIDIA Container
214
- Toolkit is
215
- [installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation)
216
- and
217
- [configured](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker).
218
- Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA
219
- version installed (which you can check using `nvidia-smi`). After that, we build the
220
- image as follows:
221
-
222
- ```bash
223
- docker build --pull -t euroeval -f Dockerfile.cuda .
224
- ```
225
-
226
- With the Docker image built, we can now evaluate any model as follows:
227
-
228
- ```bash
229
- docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
230
- ```
231
-
232
- Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
233
- argument. This could for instance be `--model <model-id-or-path> --task
234
- sentiment-classification`.
235
-
236
- ## Benchmarking custom inference APIs
237
-
238
- If the model you want to benchmark is hosted by a custom inference provider, such as a
239
- [vLLM server](https://docs.vllm.ai/en/stable/), then this is also supported in EuroEval.
240
-
241
- When benchmarking, you simply have to set the `--api-base` argument (`api_base` when
242
- using the `Benchmarker` API) to the URL of the inference API, and optionally the
243
- `--api-key` argument (`api_key`) to the API key, if authentication is required.
244
-
245
- If you're benchmarking an Ollama model, then you're urged to add the prefix
246
- `ollama_chat/` to the model name, as that will also fetch model metadata as well as pull
247
- the models from the Ollama model repository before evaluating it, e.g.:
248
-
249
- ```bash
250
- euroeval --model ollama_chat/mymodel --api-base http://localhost:11434
251
- ```
252
-
253
- For all other OpenAI-compatible inference APIs, you simply provide the model name as
254
- is, e.g.:
255
-
256
- ```bash
257
- euroeval --model my-model --api-base http://localhost:8000
258
- ```
259
-
260
- Again, if the inference API requires authentication, you simply add the `--api-key`
261
- argument:
262
-
263
- ```bash
264
- euroeval --model my-model --api-base http://localhost:8000 --api-key my-secret-key
265
- ```
266
-
267
- If your model is a reasoning model, then you need to specify this as follows:
268
-
269
- ```bash
270
- euroeval --model my-reasoning-model --api-base http://localhost:8000 --generative-type reasoning
271
- ```
272
-
273
- Likewise, if it is a pretrained decoder model (aka a completion model), then you specify
274
- this as follows:
275
-
276
- ```bash
277
- euroeval --model my-base-decoder-model --api-base http://localhost:8000 --generative-type base
278
- ```
279
-
280
- When using the `Benchmarker` API, the same applies. Here is an example of benchmarking
281
- an Ollama model hosted locally:
282
-
283
- ```python
284
- >>> benchmarker.benchmark(
285
- ... model="ollama_chat/mymodel",
286
- ... api_base="http://localhost:11434",
287
- ... )
288
- ```
289
-
290
- ## Benchmarking in an offline environment
291
-
292
- If you need to benchmark in an offline environment, you need to download the models,
293
- datasets and metrics beforehand. This can be done by adding the `--download-only`
294
- argument, from the command line, or the `download_only` argument, if benchmarking from a
295
- script. For example to download the model you want and all of the Danish sentiment
296
- classification datasets:
297
-
298
- ```bash
299
- euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
300
- ```
301
-
302
- Or from a script:
303
-
304
- ```python
305
- >>> benchmarker.benchmark(
306
- ... model="<model-id-or-path>",
307
- ... task="sentiment-classification",
308
- ... language="da",
309
- ... download_only=True,
310
- ... )
311
- ```
312
-
313
- Please note: Offline benchmarking of adapter models is not currently supported, meaning
314
- that we still require an internet connection during the evaluation of these. If offline
315
- support of adapters is important to you, please consider [opening an
316
- issue](https://github.com/EuroEval/EuroEval/issues).
317
-
318
- ## Benchmarking custom datasets
319
-
320
- If you want to benchmark models on your own custom dataset, this is also possible.
321
- First, you need to set up your dataset to be compatible with EuroEval. This means
322
- splitting up your dataset in a training, validation and test split, and ensuring that
323
- the column names are correct. We use `text` as the column name for the input text, and
324
- the output column name depends on the type of task:
325
-
326
- - **Text or multiple-choice classification**: `label`
327
- - **Token classification**: `labels`
328
- - **Reading comprehension**: `answers`
329
- - **Free-form text generation**: `target_text`
330
-
331
- Text and multiple-choice classification tasks are by far the most common. Next, you
332
- store your three dataset splits as three different CSV files with the desired two
333
- columns. Finally, you create a file called `custom_datasets.py` script in which you
334
- define the associated `DatasetConfig` objects for your dataset. Here is an example of a
335
- simple text classification dataset with two classes:
336
-
337
- ```python
338
- from euroeval import DatasetConfig, TEXT_CLASSIFICATION
339
- from euroeval.languages import ENGLISH
340
-
341
- MY_CONFIG = DatasetConfig(
342
- name="my-dataset",
343
- pretty_name="My Dataset",
344
- source=dict(train="train.csv", val="val.csv", test="test.csv"),
345
- task=TEXT_CLASSIFICATION,
346
- languages=[ENGLISH],
347
- _labels=["positive", "negative"],
348
- )
349
- ```
350
-
351
- You can then benchmark your custom dataset by simply running
352
-
353
- ```bash
354
- euroeval --dataset my-dataset --model <model-id-or-path>
355
- ```
356
-
357
- You can also run the benchmark from a Python script, by simply providing your custom
358
- dataset configuration directly into the `benchmark` method:
359
-
360
- ```python
361
- from euroeval import Benchmarker
362
-
363
- benchmarker = Benchmarker()
364
- benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
365
- ```
366
-
367
- We have included three convenience tasks to make it easier to set up custom datasets:
368
-
369
- - `TEXT_CLASSIFICATION`, which is used for text classification tasks. This requires you
370
- to set the `_labels` argument in the `DatasetConfig`, and requires the columns `text`
371
- and `label` to be present in the dataset.
372
- - `MULTIPLE_CHOICE`, which is used for multiple-choice classification tasks. This
373
- also requires you to set the `_labels` argument in the `DatasetConfig`. Note that for
374
- multiple choice tasks, you need to set up your `text` column to also list all the
375
- choices, and all the samples should have the same number of choices. This requires the
376
- columns `text` and `label` to be present in the dataset.
377
- - `TOKEN_CLASSIFICATION`, which is used when classifying individual tokens in a text.
378
- This also require you to set the `_labels` argument in the `DatasetConfig`. This
379
- requires the columns `tokens` and `labels` to be present in the dataset, where
380
- `tokens` is a list of tokens/words in the text, and `labels` is a list of the
381
- corresponding labels for each token (so the two lists have the same length).
382
-
383
- On top of these three convenience tasks, there are of course also the tasks that we use
384
- in the official benchmark, which you can use if you want to use one of these tasks with
385
- your own bespoke dataset:
386
-
387
- - `LA`, for linguistic acceptability datasets.
388
- - `NER`, for named entity recognition datasets with the standard BIO tagging scheme.
389
- - `RC`, for reading comprehension datasets in the SQuAD format.
390
- - `SENT`, for sentiment classification datasets.
391
- - `SUMM`, for text summarisation datasets.
392
- - `KNOW`, for multiple-choice knowledge datasets (e.g., MMLU).
393
- - `MCRC`, for multiple-choice reading comprehension datasets (e.g., Belebele).
394
- - `COMMON_SENSE`, for multiple-choice common-sense reasoning datasets (e.g., HellaSwag).
395
-
396
- These can all be imported from `euroeval.tasks` module.
397
-
398
- ### Creating your own custom task
399
-
400
- You are of course also free to define your own task from scratch, which allows you to
401
- customise the prompts used when evaluating generative models, for instance. Here is an
402
- example of a custom free-form text generation task, where the goal for the model is to
403
- generate a SQL query based on a natural language input:
404
-
405
- ```python
406
- from euroeval import DatasetConfig
407
- from euroeval.data_models import Task, PromptConfig
408
- from euroeval.enums import TaskGroup, ModelType
409
- from euroeval.languages import ENGLISH
410
- from euroeval.metrics import rouge_l_metric
411
-
412
- sql_generation_task = Task(
413
- name="sql-generation",
414
- task_group=TaskGroup.TEXT_TO_TEXT,
415
- template_dict={
416
- ENGLISH: PromptConfig(
417
- default_prompt_prefix="The following are natural language texts and their "
418
- "corresponding SQL queries.",
419
- default_prompt_template="Natural language query: {text}\nSQL query: "
420
- "{target_text}",
421
- default_instruction_prompt="Generate the SQL query for the following "
422
- "natural language query:\n{text!r}",
423
- default_prompt_label_mapping=dict(),
424
- ),
425
- },
426
- metrics=[rouge_l_metric],
427
- default_num_few_shot_examples=3,
428
- default_max_generated_tokens=256,
429
- default_allowed_model_types=[ModelType.GENERATIVE],
430
- )
431
-
432
- MY_SQL_DATASET = DatasetConfig(
433
- name="my-sql-dataset",
434
- pretty_name="My SQL Dataset",
435
- source=dict(train="train.csv", val="val.csv", test="test.csv"),
436
- task=sql_generation_task,
437
- languages=[ENGLISH],
438
- )
439
- ```
440
-
441
- Again, with this you can benchmark your custom dataset by simply running
442
-
443
- ```bash
444
- euroeval --dataset my-sql-dataset --model <model-id-or-path>
445
- ```
446
-
447
- ## Reproducing the evaluation datasets
448
-
449
- All datasets used in this project are generated using the scripts located in the
450
- [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script
451
- with the following command
452
-
453
- ```bash
454
- uv run src/scripts/<name-of-script>.py
455
- ```
456
-
457
- Replace <name-of-script> with the specific script you wish to execute, e.g.,
458
-
459
- ```bash
460
- uv run src/scripts/create_allocine.py
461
- ```
462
-
463
- ## Contributors :pray:
464
-
465
- A huge thank you to all the contributors who have helped make this project a success!
466
-
467
- <a href="https://github.com/peter-sk">
468
- <img
469
- src="https://avatars.githubusercontent.com/u/6168908"
470
- width=50
471
- alt="Contributor avatar for peter-sk"
472
- />
473
- </a>
474
- <a href="https://github.com/AJDERS">
475
- <img
476
- src="https://avatars.githubusercontent.com/u/38854604"
477
- width=50
478
- alt="Contributor avatar for AJDERS"
479
- />
480
- </a>
481
- <a href="https://github.com/oliverkinch">
482
- <img
483
- src="https://avatars.githubusercontent.com/u/71556498"
484
- width=50
485
- alt="Contributor avatar for oliverkinch"
486
- />
487
- </a>
488
- <a href="https://github.com/versae">
489
- <img
490
- src="https://avatars.githubusercontent.com/u/173537"
491
- width=50
492
- alt="Contributor avatar for versae"
493
- />
494
- </a>
495
- <a href="https://github.com/KennethEnevoldsen">
496
- <img
497
- src="https://avatars.githubusercontent.com/u/23721977"
498
- width=50
499
- alt="Contributor avatar for KennethEnevoldsen"
500
- />
501
- </a>
502
- <a href="https://github.com/viggo-gascou">
503
- <img
504
- src="https://avatars.githubusercontent.com/u/94069687"
505
- width=50
506
- alt="Contributor avatar for viggo-gascou"
507
- />
508
- </a>
509
- <a href="https://github.com/mathiasesn">
510
- <img
511
- src="https://avatars.githubusercontent.com/u/27091759"
512
- width=50
513
- alt="Contributor avatar for mathiasesn"
514
- />
515
- </a>
516
- <a href="https://github.com/Alkarex">
517
- <img
518
- src="https://avatars.githubusercontent.com/u/1008324"
519
- width=50
520
- alt="Contributor avatar for Alkarex"
521
- />
522
- </a>
523
- <a href="https://github.com/marksverdhei">
524
- <img
525
- src="https://avatars.githubusercontent.com/u/46672778"
526
- width=50
527
- alt="Contributor avatar for marksverdhei"
528
- />
529
- </a>
530
- <a href="https://github.com/Mikeriess">
531
- <img
532
- src="https://avatars.githubusercontent.com/u/19728563"
533
- width=50
534
- alt="Contributor avatar for Mikeriess"
535
- />
536
- </a>
537
- <a href="https://github.com/ThomasKluiters">
538
- <img
539
- src="https://avatars.githubusercontent.com/u/8137941"
540
- width=50
541
- alt="Contributor avatar for ThomasKluiters"
542
- />
543
- </a>
544
- <a href="https://github.com/BramVanroy">
545
- <img
546
- src="https://avatars.githubusercontent.com/u/2779410"
547
- width=50
548
- alt="Contributor avatar for BramVanroy"
549
- />
550
- </a>
551
- <a href="https://github.com/peregilk">
552
- <img
553
- src="https://avatars.githubusercontent.com/u/9079808"
554
- width=50
555
- alt="Contributor avatar for peregilk"
556
- />
557
- </a>
558
- <a href="https://github.com/Rijgersberg">
559
- <img
560
- src="https://avatars.githubusercontent.com/u/8604946"
561
- width=50
562
- alt="Contributor avatar for Rijgersberg"
563
- />
564
- </a>
565
- <a href="https://github.com/duarteocarmo">
566
- <img
567
- src="https://avatars.githubusercontent.com/u/26342344"
568
- width=50
569
- alt="Contributor avatar for duarteocarmo"
570
- />
571
- </a>
572
- <a href="https://github.com/slowwavesleep">
573
- <img
574
- src="https://avatars.githubusercontent.com/u/44175589"
575
- width=50
576
- alt="Contributor avatar for slowwavesleep"
577
- />
578
- </a>
579
- <a href="https://github.com/mrkowalski">
580
- <img
581
- src="https://avatars.githubusercontent.com/u/6357044"
582
- width=50
583
- alt="Contributor avatar for mrkowalski"
584
- />
585
- </a>
586
- <a href="https://github.com/simonevanbruggen">
587
- <img
588
- src="https://avatars.githubusercontent.com/u/24842609"
589
- width=50
590
- alt="Contributor avatar for simonevanbruggen"
591
- />
592
- </a>
593
- <a href="https://github.com/tvosch">
594
- <img
595
- src="https://avatars.githubusercontent.com/u/110661769"
596
- width=50
597
- alt="Contributor avatar for tvosch"
598
- />
599
- </a>
600
- <a href="https://github.com/Touzen">
601
- <img
602
- src="https://avatars.githubusercontent.com/u/1416265"
603
- width=50
604
- alt="Contributor avatar for Touzen"
605
- />
606
- </a>
607
- <a href="https://github.com/caldaibis">
608
- <img
609
- src="https://avatars.githubusercontent.com/u/16032437"
610
- width=50
611
- alt="Contributor avatar for caldaibis"
612
- />
613
- </a>
614
- <a href="https://github.com/SwekeR-463">
615
- <img
616
- src="https://avatars.githubusercontent.com/u/114919896?v=4"
617
- width=50
618
- alt="Contributor avatar for SwekeR-463"
619
- />
620
- </a>
621
-
622
- ### Contribute to EuroEval
623
-
624
- We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
625
- contributing new datasets, your help makes this project better for everyone.
626
-
627
- - **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
628
- for information on how to get started.
629
- - **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
630
- a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
631
-
632
- ### Special thanks
633
-
634
- - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
635
- [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
636
- - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
637
- models on the leaderboards.
638
- - Thanks to [OpenAI](https://openai.com/) for sponsoring OpenAI credits as part of their
639
- [Researcher Access Program](https://openai.com/form/researcher-access-program/).
640
- - Thanks to [UWV](https://www.uwv.nl/) and [KU
641
- Leuven](https://www.arts.kuleuven.be/ling/ccl) for sponsoring the Azure OpenAI
642
- credits used to evaluate GPT-4-turbo in Dutch.
643
- - Thanks to [Miðeind](https://mideind.is/en) for sponsoring the OpenAI
644
- credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
645
- - Thanks to [CHC](https://chc.au.dk/) for sponsoring the OpenAI credits used to
646
- evaluate GPT-4-turbo in German.
647
-
648
- ## Citing EuroEval
649
-
650
- If you want to cite the framework then feel free to use this:
651
-
652
- ```bibtex
653
- @article{smart2024encoder,
654
- title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
655
- author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
656
- journal={arXiv preprint arXiv:2406.13469},
657
- year={2024}
658
- }
659
- @inproceedings{smart2023scandeval,
660
- author = {Smart, Dan Saattrup},
661
- booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
662
- month = may,
663
- pages = {185--201},
664
- title = {{ScandEval: A Benchmark for Scandinavian Natural Language Processing}},
665
- year = {2023}
666
- }
667
- ```