PyPI - EuroEval - Versions diffs - 16.3.0__tar.gz → 16.5.0__tar.gz - Mend

EuroEval 16.3.0tar.gz → 16.5.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (362) hide show

{euroeval-16.3.0 → euroeval-16.5.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -24,6 +24,8 @@ body:
     label: Dataset languages
     description: What languages is the dataset in?
     options:
+      - label: Bulgarian
+      - label: Czech
       - label: Danish
       - label: Dutch
       - label: English
@@ -32,6 +34,7 @@ body:
       - label: Finnish
       - label: French
       - label: German
+      - label: Greek
       - label: Icelandic
       - label: Italian
       - label: Latvian
@@ -39,8 +42,11 @@ body:
       - label: Norwegian (Bokmål or Nynorsk)
       - label: Polish
       - label: Portuguese
+      - label: Serbian
+      - label: Slovak
       - label: Spanish
       - label: Swedish
+      - label: Ukrainian
   validations:
     required: true
 - type: textarea

{euroeval-16.3.0 → euroeval-16.5.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -18,12 +18,13 @@ body:
       What languages should this model be evaluated on? Tick all that apply. If the
       model is multilingual (e.g., Mistral, Llama), then tick all the languages.
     options:
+      - label: Baltic languages (Latvian, Lithuanian)
+      - label: Finnic languages (Estonian, Finnish)
+      - label: Hellenic languages (Greek)
       - label: Romance languages (French, Italian, Portuguese, Spanish)
       - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
+      - label: Slavic languages (Bulgarian, Czech, Polish, Serbian, Slovak, Ukrainian)
       - label: West Germanic languages (Dutch, English, German)
-      - label: Finnic languages (Estonian, Finnish)
-      - label: Baltic languages (Latvian, Lithuanian)
-      - label: Polish
   validations:
     required: true
 - type: dropdown

{euroeval-16.3.0 → euroeval-16.5.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.13.1
+    rev: v0.14.2
     hooks:
       - id: ruff
         args:

{euroeval-16.3.0 → euroeval-16.5.0}/CHANGELOG.md RENAMED Viewed

@@ -7,15 +7,133 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [Unreleased]
+## [v16.5.0] - 2025-10-28
+### Added
+- Added better support for evaluating on custom datasets, by allowing `DatasetConfig`
+  objects directly in the `Benchmarker.benchmark` method. We also support custom
+  datasets with the CLI, by simply defining the desired `DatasetConfig`s in a
+  `custom_datasets.py` file (path can be changed with the `--custom-datasets-file`
+  argument. In the `DatasetConfig`s we also support loading datasets from CSVs directly,
+  with the new `source` argument. This argument can both be the Hugging Face Hub ID of
+  the dataset or a dictionary with 'train', 'val' and 'test', and values the paths to
+  the CSV files.
+- Added support for Serbian 🇷🇸! This includes the sentiment classification dataset
+  MMS-sr, the linguistic acceptability dataset ScaLA-sr, the named entity recognition
+  dataset UNER-sr, the reading comprehension dataset MultiWikiQA-sr, the summarisation
+  dataset LR-Sum-sr, the knowledge dataset MMLU-sr, and the common-sense reasoning
+  dataset Winogrande-sr. This was contributed by @oliverkinch ✨
+- Added support for Bulgarian 🇧🇬! This includes the sentiment classification dataset
+  Cinexio, the linguistic acceptability dataset ScaLA-bg, the named entity recognition
+  dataset BG-NER-BSNLP, the reading comprehension dataset MultiWikiQA-bg, the knowledge
+  dataset Exams-bg, and the common-sense reasoning dataset Winogrande-bg. This was
+  contributed by @oliverkinch ✨
+- Added support for Greek 🇬🇷! This includes the binary sentiment classification dataset
+  Greek-SA, the linguistic acceptability dataset ScaLA-el, the named entity recognition
+  dataset elNER, the reading comprehension dataset MultiWikiQA-el, the summarisation
+  dataset Greek-Wikipedia, the knowledge dataset Global-MMLU-el, and the common-sense
+  reasoning dataset Winogrande-el. This was contributed by @oliverkinch ✨
+- Added support for Ukrainian 🇺🇦! This includes the sentiment classification dataset
+  Cross-Domain UK Reviews, the linguistic acceptability dataset ScaLA-uk, the named
+  entity recognition dataset NER-uk, the reading comprehension dataset MultiWikiQA-uk,
+  the summarisation dataset LR-Sum-uk, and the knowledge dataset Global-MMLU-uk. This
+  was contributed by @oliverkinch ✨
+### Changed
+- Now returns all the desired results from the `Benchmarker.benchmark` method, rather
+  than only the ones that were newly computed (so we load all previous results from disk
+  as well).
+### Fixed
+- Fixed the "double option" problem in Winogrande datasets across all languages.
+  Previously, option labels were duplicated for multiple languages (e.g.,
+  "Svarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty" instead of
+  just "Svarmuligheder:\na. Natalie\nb. Betty").
+- The previous fix to close arrow writers in metrics did not work as intended, as the
+  "too many open files" error still occurred. We now ensure that the writers are closed
+  properly after each metric computation to avoid this issue.
+- Now correctly allows specifying inference provider API keys with the `--api-key`
+  argument. Previously, this conflicted with the Hugging Face API key.
+- Fixed an issue where some pretrained generative models required prefix spaces in the
+  labels for classification tasks, which resulted in faulty structured choice
+  generation. We now correctly take this into account, which significantly increases
+  the classification performance of these models.
+## [v16.4.0] - 2025-10-21
+### Added
+- Added support for Slovak 🇸🇰! This includes the sentiment classification dataset
+  CSFD-sentiment-sk, the linguistic acceptability dataset ScaLA-sk, the named entity
+  recognition dataset UNER-sk, the reading comprehension dataset MultiWikiQA-sk, the
+  multiple-choice classification dataset MMLU-sk, and the common-sense reasoning dataset
+  Winogrande-sk. This was contributed by @oliverkinch ✨
+- Added support for Czech 🇨🇿! This includes the sentiment classification dataset
+  CSFD-sentiment, the linguistic acceptability dataset ScaLA-cs, the linguistic
+  acceptability dataset CS-GEC, the named entity recognition dataset PONER, the reading
+  comprehension dataset SQAD, the summarization dataset Czech News, the common-sense
+  reasoning dataset HellaSwag-cs, and the knowledge dataset Umimeto-qa. This was
+  contributed by @oliverkinch ✨
+- Added the Lithuanian summarisation dataset Lrytas based on the Lithuanian
+  public media news portal [Lrytas.lt](https://www.lrytas.lt/). This was contributed by
+  @oliverkinch ✨
+- Added the Estonian translation of MMLU, `mmlu-et`, as an unofficial knowledge
+  dataset.
+### Changed
+- Updated vLLM to `>=0.11.0`, which features several breaking changes, so we had to
+  force the minimum version. This also features support for multiple new models, such as
+  Qwen3-Next and OLMo3.
+- Now uses MultiWikiQA-da and MultiWikiQA-sv as the official Danish and Swedish reading
+  comprehension datasets, respectively, as the quality is substantially better than
+  ScandiQA-da and ScandiQA-sv.
+- Used 128 of the test samples from the Winogrande datasets for validation, as we
+  previously did not use a validation split. This is done for all languages except
+  Icelandic and Estonian, as these are manually translated and corrected splits from a
+  different source. Most of these are unofficial datasets and thus won't affect the
+  leaderboard rankings. The only languages for which these are official are Lithuanian
+  and Polish, which do not have official leaderboards yet - so no leaderboards are
+  affected by this change.
+- In the same vein as the above, we now use 32 samples for validation for the Lithuanian
+  LT-history dataset and the Swedish Skolprov dataset.
+- Changed logging styling.
+### Fixed
+- If a generative model consistently does not adhere to a given JSON schema, we disable
+  structured generation for that model. This was triggered by Claude models not
+  supporting Literal types in JSON schemas.
+- Removed "e" options from the Skolprov multiple-choice dataset, as this inconsistency
+  in number of options caused issues when evaluating models on it.
+- Fixed an issue where an uninformative logging message was shown when a model
+  configuration could not be loaded from the Hugging Face Hub, when the model was gated.
+  We now show that this is due to the gatedness, indicating that the user should log in
+  or provide a Hugging Face Hub access token to evaluate the model.
+- Now caches functions related to loading repo info or fetching model configs from the
+  Hugging Face Hub, to avoid repeated calls to the Hub, resulting in rate limits.
+- When running an evaluation that required the test split (e.g., European values
+  evaluation) as the last benchmark for a given model, then subsequent models would
+  continue to be evaluated on the test split, even if the user requested to use the
+  validation split. We now reset this not just after each dataset, but also after each
+  model, so that this does not happen.
+- Now catches more errors when evaluating LiteLLM models, which were related to some
+  generation parameters not being supported (such as stop sequences) for some models.
+- We now clean up metric writers when we're done with them, which prevents a "too many
+  open files" error when evaluating many models and datasets in a single run.
 ## [v16.3.0] - 2025-09-23
 ### Added
 - Added support for Lithuanian 🇱🇹! This includes the sentiment classification dataset
-  Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt, the reading
-  comprehension dataset MultiWikiQA-lt, the named entity recognition dataset WikiANN-lt,
-  the the history knowledge dataset LT-History, and the common-sense reasoning dataset
-  Winogrande-lt. This was contributed by @oliverkinch ✨
+  Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt (unofficial), the
+  reading comprehension dataset MultiWikiQA-lt, the named entity recognition dataset
+  WikiANN-lt, the the history knowledge dataset LT-History, and the common-sense
+  reasoning dataset Winogrande-lt. This was contributed by @oliverkinch ✨
 - Added "slow-tokenizer" model parameter, which can be used to force the use of a slow
   tokenizer when loading it. Use this by replacing your model ID with
   `<model-id>#slow-tokenizer`.

{euroeval-16.3.0 → euroeval-16.5.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 16.3.0
+Version: 16.5.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -62,12 +62,12 @@ Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: timm>=1.0.19; extra == 'all'
-Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: timm>=1.0.19; extra == 'generative'
-Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown
 <!-- This disables the requirement that the first line is a top-level heading -->
@@ -92,7 +92,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-76%25-yellowgreen.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
 ## Maintainer
@@ -113,7 +113,7 @@ when an evaluation requires a certain extra dependency, and how you install it.
 ## Quickstart
-### Benchmarking from the Command Line
+### Benchmarking from the command line
 The easiest way to benchmark pretrained models is via the command line interface. After
 having installed the package, you can benchmark your favorite model like so:
@@ -160,7 +160,7 @@ See all the arguments and options available for the `euroeval` command by typing
 euroeval --help
 ```
-### Benchmarking from a Script
+### Benchmarking from a script
 In a script, the syntax is similar to the command line interface. You simply initialise
 an object of the `Benchmarker` class, and call this benchmark object with your favorite
@@ -168,15 +168,19 @@ model:
 ```python
 >>> from euroeval import Benchmarker
->>> benchmark = Benchmarker()
->>> benchmark(model="<model-id>")
+>>> benchmarker = Benchmarker()
+>>> benchmarker.benchmark(model="<model-id>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
 `language` arguments, shown here with same example as above:
 ```python
->>> benchmark(model="<model-id>", task="sentiment-classification", language="da")
+>>> benchmarker.benchmark(
+...     model="<model-id>",
+...     task="sentiment-classification",
+...     language="da",
+... )
 ```
 If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
@@ -184,10 +188,61 @@ simply leave out the `model` argument. In this example, we're benchmarking all D
 models on the Danish sentiment classification task:
 ```python
->>> benchmark(task="sentiment-classification", language="da")
+>>> benchmarker.benchmark(task="sentiment-classification", language="da")
 ```
-### Benchmarking in an Offline Environment
+### Benchmarking from Docker
+A Dockerfile is provided in the repo, which can be downloaded and run, without needing
+to clone the repo and installing from source. This can be fetched programmatically by
+running the following:
+```bash
+wget https://raw.githubusercontent.com/EuroEval/EuroEval/main/Dockerfile.cuda
+```
+Next, to be able to build the Docker image, first ensure that the NVIDIA Container
+Toolkit is
+[installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation)
+and
+[configured](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker).
+Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA
+version installed (which you can check using `nvidia-smi`). After that, we build the
+image as follows:
+```bash
+docker build --pull -t euroeval -f Dockerfile.cuda .
+```
+With the Docker image built, we can now evaluate any model as follows:
+```bash
+docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
+```
+Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
+argument. This could for instance be `--model <model-id> --task
+sentiment-classification`.
+## Benchmarking custom inference APIs
+If the model you want to benchmark is hosted by a custom inference provider, such as a
+[vLLM server](https://docs.vllm.ai/en/stable/), then this is also supported in EuroEval.
+When benchmarking, you simply have to set the `--api-base` argument (`api_base` when
+using the `Benchmarker` API) to the URL of the inference API, and optionally the
+`--api-key` argument (`api_key`) to the API key, if authentication is required.
+When benchmarking models hosted on a custom inference API, the model ID
+(`--model`/`model`) should be the model name as registered on the inference server,
+potentially with a required prefix, depending on the type of inference server used. For
+instance, if the model is hosted on a vLLM server, the model ID should be prefixed with
+`hosted_vllm/`, and if the model is hosted on an Ollama server, the model ID should be
+prefixed with `ollama_chat/`. See the full list of possible inference providers as well
+as their corresponding prefixes in the [LiteLLM
+documentation](https://docs.litellm.ai/docs/providers/), as EuroEval uses LiteLLM to
+handle evaluation of inference APIs in general.
+## Benchmarking in an offline environment
 If you need to benchmark in an offline environment, you need to download the models,
 datasets and metrics beforehand. This can be done by adding the `--download-only`
@@ -202,7 +257,7 @@ euroeval --model <model-id> --task sentiment-classification --language da --down
 Or from a script:
 ```python
->>> benchmark(
+>>> benchmarker.benchmark(
 ... model="<model-id>",
 ... task="sentiment-classification",
 ... language="da",
@@ -210,44 +265,139 @@ Or from a script:
 ... )
 ```
-Please note: Offline benchmarking of adapter models is not currently supported. An
-internet connection will be required during evaluation. If offline support is important
-to you, please consider [opening an issue](https://github.com/EuroEval/EuroEval/issues).
+Please note: Offline benchmarking of adapter models is not currently supported, meaning
+that we still require an internet connection during the evaluation of these. If offline
+support of adapters is important to you, please consider [opening an
+issue](https://github.com/EuroEval/EuroEval/issues).
-### Benchmarking from Docker
+## Benchmarking custom datasets
-A Dockerfile is provided in the repo, which can be downloaded and run, without needing
-to clone the repo and installing from source. This can be fetched programmatically by
-running the following:
+If you want to benchmark models on your own custom dataset, this is also possible.
+First, you need to set up your dataset to be compatible with EuroEval. This means
+splitting up your dataset in a training, validation and test split, and ensuring that
+the column names are correct. We use `text` as the column name for the input text, and
+the output column name depends on the type of task:
-```bash
-wget https://raw.githubusercontent.com/EuroEval/EuroEval/main/Dockerfile.cuda
+- **Text or multiple-choice classification**: `label`
+- **Token classification**: `labels`
+- **Reading comprehension**: `answers`
+- **Free-form text generation**: `target_text`
+Text and multiple-choice classification tasks are by far the most common. Next, you
+store your three dataset splits as three different CSV files with the desired two
+columns. Finally, you create a file called `custom_datasets.py` script in which you
+define the associated `DatasetConfig` objects for your dataset. Here is an example of a
+simple text classification dataset with two classes:
+```python
+from euroeval import DatasetConfig, TEXT_CLASSIFICATION
+from euroeval.languages import ENGLISH
+MY_CONFIG = DatasetConfig(
+    name="my-dataset",
+    source=dict(train="train.csv", val="val.csv", test="test.csv"),
+    task=TEXT_CLASSIFICATION,
+    languages=[ENGLISH],
+    _labels=["positive", "negative"],
+)
 ```
-Next, to be able to build the Docker image, first ensure that the NVIDIA Container
-Toolkit is
-[installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation)
-and
-[configured](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker).
-Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA
-version installed (which you can check using `nvidia-smi`). After that, we build the
-image as follows:
+You can then benchmark your custom dataset by simply running
 ```bash
-docker build --pull -t euroeval -f Dockerfile.cuda .
+euroeval --dataset my-dataset --model <model-id>
 ```
-With the Docker image built, we can now evaluate any model as follows:
+You can also run the benchmark from a Python script, by simply providing your custom
+dataset configuration directly into the `benchmark` method:
-```bash
-docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
+```python
+from euroeval import Benchmarker
+benchmarker = Benchmarker()
+benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
 ```
-Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
-argument. This could for instance be `--model <model-id> --task
-sentiment-classification`.
+We have included three convenience tasks to make it easier to set up custom datasets:
+- `TEXT_CLASSIFICATION`, which is used for text classification tasks. This requires you
+  to set the `_labels` argument in the `DatasetConfig`, and requires the columns `text`
+  and `label` to be present in the dataset.
+- `MULTIPLE_CHOICE`, which is used for multiple-choice classification tasks. This
+  also requires you to set the `_labels` argument in the `DatasetConfig`. Note that for
+  multiple choice tasks, you need to set up your `text` column to also list all the
+  choices, and all the samples should have the same number of choices. This requires the
+  columns `text` and `label` to be present in the dataset.
+- `TOKEN_CLASSIFICATION`, which is used when classifying individual tokens in a text.
+  This also require you to set the `_labels` argument in the `DatasetConfig`. This
+  requires the columns `tokens` and `labels` to be present in the dataset, where
+  `tokens` is a list of tokens/words in the text, and `labels` is a list of the
+  corresponding labels for each token (so the two lists have the same length).
+On top of these three convenience tasks, there are of course also the tasks that we use
+in the official benchmark, which you can use if you want to use one of these tasks with
+your own bespoke dataset:
+- `LA`, for linguistic acceptability datasets.
+- `NER`, for named entity recognition datasets with the standard BIO tagging scheme.
+- `RC`, for reading comprehension datasets in the SQuAD format.
+- `SENT`, for sentiment classification datasets.
+- `SUMM`, for text summarisation datasets.
+- `KNOW`, for multiple-choice knowledge datasets (e.g., MMLU).
+- `MCRC`, for multiple-choice reading comprehension datasets (e.g., Belebele).
+- `COMMON_SENSE`, for multiple-choice common-sense reasoning datasets (e.g., HellaSwag).
+These can all be imported from `euroeval.tasks` module.
+### Creating your own custom task
+You are of course also free to define your own task from scratch, which allows you to
+customise the prompts used when evaluating generative models, for instance. Here is an
+example of a custom free-form text generation task, where the goal for the model is to
+generate a SQL query based on a natural language input:
+```python
+from euroeval import DatasetConfig
+from euroeval.data_models import Task, PromptConfig
+from euroeval.enums import TaskGroup, ModelType
+from euroeval.languages import ENGLISH
+from euroeval.metrics import rouge_l_metric
+sql_generation_task = Task(
+    name="sql-generation",
+    task_group=TaskGroup.TEXT_TO_TEXT,
+    template_dict={
+        ENGLISH: PromptConfig(
+            default_prompt_prefix="The following are natural language texts and their "
+            "corresponding SQL queries.",
+            default_prompt_template="Natural language query: {text}\nSQL query: "
+            "{target_text}",
+            default_instruction_prompt="Generate the SQL query for the following "
+            "natural language query:\n{text!r}",
+            default_prompt_label_mapping=dict(),
+        ),
+    },
+    metrics=[rouge_l_metric],
+    default_num_few_shot_examples=3,
+    default_max_generated_tokens=256,
+    default_allowed_model_types=[ModelType.GENERATIVE],
+)
+MY_SQL_DATASET = DatasetConfig(
+    name="my-sql-dataset",
+    source=dict(train="train.csv", val="val.csv", test="test.csv"),
+    task=sql_generation_task,
+    languages=[ENGLISH],
+)
+```
+Again, with this you can benchmark your custom dataset by simply running
+```bash
+euroeval --dataset my-sql-dataset --model <model-id>
+```
-### Reproducing the datasets
+## Reproducing the evaluation datasets
 All datasets used in this project are generated using the scripts located in the
 [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script
@@ -379,6 +529,13 @@ A huge thank you to all the contributors who have helped make this project a suc
         alt="Contributor avatar for slowwavesleep"
     />
 </a>
+<a href="https://github.com/mrkowalski">
+    <img
+        src="https://avatars.githubusercontent.com/u/6357044"
+        width=50
+        alt="Contributor avatar for mrkowalski"
+    />
+</a>
 ### Contribute to EuroEval
@@ -390,7 +547,7 @@ contributing new datasets, your help makes this project better for everyone.
 - **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
   a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
-### Special Thanks
+### Special thanks
 - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
   [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
@@ -401,7 +558,7 @@ contributing new datasets, your help makes this project better for everyone.
 - Thanks to [UWV](https://www.uwv.nl/) and [KU
   Leuven](https://www.arts.kuleuven.be/ling/ccl) for sponsoring the Azure OpenAI
   credits used to evaluate GPT-4-turbo in Dutch.
-- Thanks to [Miðeind](https://mideind.is/english.html) for sponsoring the OpenAI
+- Thanks to [Miðeind](https://mideind.is/en) for sponsoring the OpenAI
   credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
 - Thanks to [CHC](https://chc.au.dk/) for sponsoring the OpenAI credits used to
   evaluate GPT-4-turbo in German.

EuroEval 16.3.0__tar.gz → 16.5.0__tar.gz

Potentially problematic release.

EuroEval 16.3.0tar.gz → 16.5.0tar.gz