PyPI - EuroEval - Versions diffs - 15.6.1__tar.gz → 15.7.1__tar.gz - Mend

EuroEval 15.6.1tar.gz → 15.7.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (242) hide show

{euroeval-15.6.1 → euroeval-15.7.1}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -26,6 +26,7 @@ body:
       - label: Dutch
       - label: English
       - label: Faroese
+      - label: Finnish
       - label: French
       - label: German
       - label: Icelandic

{euroeval-15.6.1 → euroeval-15.7.1}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -21,6 +21,7 @@ body:
       - label: Romance languages (French, Italian, Spanish)
       - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
       - label: West Germanic languages (Dutch, English, German)
+      - label: Finnish
   validations:
     required: true
 - type: dropdown

{euroeval-15.6.1 → euroeval-15.7.1}/.github/workflows/ci.yaml RENAMED Viewed

@@ -24,7 +24,10 @@ jobs:
       - uses: actions/setup-python@v5
         with:
           python-version: "3.11"
-      - uses: pre-commit/action@v3.0.1
+      - run: python -m pip install pre-commit
+        shell: bash
+      - run: pre-commit run --show-diff-on-failure --color=always
+        shell: bash
   pytest-linux:
     if: github.event.pull_request.draft == false
@@ -41,8 +44,9 @@ jobs:
           persist-credentials: false
       - name: Install uv and set up Python
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v5
         with:
+          enable-cache: false
           python-version: ${{ matrix.python-version }}
       - name: Install Dependencies

{euroeval-15.6.1 → euroeval-15.7.1}/.gitignore RENAMED Viewed

@@ -117,5 +117,5 @@ site/
 docs/datasets/dataset_example_commands.txt
 # Various graphics
-gfx/euroeval-italian.png
-gfx/euroeval-italian.xcf
+gfx/euroeval-*.png
+gfx/euroeval-*.xcf

{euroeval-15.6.1 → euroeval-15.7.1}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.11.5
+    rev: v0.11.7
     hooks:
       - id: ruff
         args:

{euroeval-15.6.1 → euroeval-15.7.1}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,71 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.7.1] - 2025-04-29
+### Changed
+- Marked the DBRD Dutch sentiment classification as official, as the quality is
+  substantially better than the previous Dutch Social.
+### Fixed
+- Fixed an issue with NER evaluation of instruction-tuned models, which was caused by
+  the "O" label mistakenly being included in the prompt template, causing an error
+  during evaluation. No evaluations were affected by this, only that some evaluations
+  could not be run.
+## [v15.7.0] - 2025-04-28
+### Added
+- Added support for Finnish 🇫🇮! This includes the Finnish part of the reading
+  comprehension dataset
+  [TydiQA-fi](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train),
+  the Finnish part of the binary sentiment classification dataset
+  [ScandiSent](https://github.com/timpal0l/ScandiSent), the linguistic acceptability
+  dataset ScaLA with the [Finnish Universal
+  Dependencies](https://github.com/UniversalDependencies/UD_Finnish-TDT), the NER
+  dataset [Turku NER](https://aclanthology.org/2020.lrec-1.567/), the summarisation
+  dataset [XL-Sum-fi](https://huggingface.co/datasets/TurkuNLP/xlsum-fi), and the
+  common-sense reasoning dataset
+  [HellaSwag-fi](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate).
+  This was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
+- Added metadata for GPT-4.1 and Grok-3 models.
+- Marked Gemini-2.5-flash and Grok-3-mini as reasoning models, giving them more tokens
+  to think.
+### Changed
+- Updated `datasets` to `>=3.5.0`, as the previous versions were incompatible with the
+  newer versions of `huggingface_hub`.
+- Increase the number of allowed reasoning tokens from 8,192 to 32,768 for reasoning
+  models. This is done as several models did not stop reasoning before running out of
+  tokens, yielding a blank output.
+- API models now use JSON schemas for the NER task if they support it, and if not then
+  they resort to standard JSON mode (which does not enforce a specific schema, just that
+  the output is JSON).
+### Fixed
+- If we fail to extract labels using a generative model's logprobs, we now fall back to
+  using word edit distance between the outputted text and the labels instead of throwing
+  an error.
+- Fixed a bug where we could not use the `thinking` parameter with `claude-3-7-sonnet`,
+  due to a typo. This has been fixed now.
+- Now catches the error when an API model requires setting temperature to 1.0, and
+  retries the evaluation with temperature set to 1.0.
+- When benchmarking a model with a revision (i.e., of the form `<model-id>@<revision>`),
+  we now correctly store this full model ID to the benchmark results on disk, including
+  the revision.
+- Fixed a GPU memory error while computing the BERTScore for the summarisation task,
+  resulting in a memory crash. We have now reduced the batch size to 1 for this task,
+  making it slightly slower but more memory efficient.
+- Disabled structured outputs and logprobs for reasoning models, to ensure that they
+  are allowed to output reasoning tokens before they output their answer.
+- Do not supply stop sequences to API models if they do not support it.
+- If a `SystemError` happens during LiteLLM generation then we now retry the
+  generation.
+- Handle if a LiteLLM model does not support specifying maxItems in the JSON schema
+  during structured generation.
+- Truncate prompts to decoder model's maximum sequence length if the model's maximum
+  sequence length is smaller than 5,000 tokens.
 ## [v15.6.1] - 2025-04-14
 ### Changed
 - Added more info about SQuAD-nl in the documentation. This was contributed by

{euroeval-15.6.1 → euroeval-15.7.1}/CONTRIBUTING.md RENAMED Viewed

@@ -14,14 +14,22 @@ issue, creating a PR, reviewing, and merging the PR.
 To get an overview of the project, read the [README](README.md). Here are some
 resources to help you get started with open source contributions:
-- [Finding ways to contribute to open source on GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/finding-ways-to-contribute-to-open-source-on-github)
+- [Finding ways to contribute to open source on
+  GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/finding-ways-to-contribute-to-open-source-on-github)
 - [Set up Git](https://docs.github.com/en/get-started/quickstart/set-up-git)
 - [GitHub flow](https://docs.github.com/en/get-started/quickstart/github-flow)
-- [Collaborating with pull requests](https://docs.github.com/en/github/collaborating-with-pull-requests)
+- [Collaborating with pull
+  requests](https://docs.github.com/en/github/collaborating-with-pull-requests)
 ## Getting started
+### Adding datasets
+EuroEval welcomes contributions of new datasets that help evaluate language models
+across European languages. A guide for adding datasets to EuroEval can be found
+[here](NEW_DATASET_GUIDE.md).
 ### Issues
 #### Create a new issue
@@ -42,11 +50,17 @@ find an issue to work on, you are welcome to open a PR with a fix.
 1. Fork the repository.
 - Using GitHub Desktop:
-  - [Getting started with GitHub Desktop](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/getting-started-with-github-desktop) will guide you through setting up Desktop.
-  - Once Desktop is set up, you can use it to [fork the repo](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop)!
+  - [Getting started with GitHub
+    Desktop](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/getting-started-with-github-desktop)
+    will guide you through setting up Desktop.
+  - Once Desktop is set up, you can use it to [fork the
+    repo](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop)!
 - Using the command line:
-  - [Fork the repo](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) so that you can make your changes without affecting the original project until you're ready to merge them.
+  - [Fork the
+    repo](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository)
+    so that you can make your changes without affecting the original project until
+    you're ready to merge them.
 3. Run `make install` from within the repo to get set up

euroeval-15.7.1/NEW_DATASET_GUIDE.md ADDED Viewed

@@ -0,0 +1,107 @@
+# Contributing a Dataset to EuroEval
+This guide will walk you through the process of contributing a new dataset to EuroEval.
+For general contribution guidelines, please refer to our [Contributing Guide](CONTRIBUTING.md).
+If you have any questions during this process, please open an issue on the [EuroEval GitHub repository](https://github.com/EuroEval/EuroEval/issues).
+## Step 0: Prerequisites
+Before beginning:
+1. Check if your dataset matches [one of the supported tasks](https://euroeval.com/tasks/). If your dataset doesn't match any supported task, you have two options:
+   1. Try to adapt it to fit an existing task (e.g., by reformatting it or adding multiple choice options)
+   2. Open an issue on the EuroEval repository requesting to add a new task type
+2. If it does, [fork the EuroEval repository](https://github.com/EuroEval/EuroEval/fork) and create a new branch to work on your dataset contribution
+## Step 1: Create the Dataset Processing Script
+Create a script in the `src/scripts` directory that processes your dataset into the EuroEval format.
+The dataset creation script roughly follows this pattern:
+```python
+# Load your dataset.
+raw_dataset = load_dataset("path_to_your_dataset")
+# Process the dataset to fit the EuroEval format.
+dataset = process_raw_dataset(raw_dataset=raw_dataset)
+# Push the dataset to the Hugging Face Hub.
+dataset.push_to_hub("EuroEval/your_dataset_name", private=True)
+```
+### Tips for Dataset Processing:
+- Examine existing scripts for datasets with the same task for a reference on how to process your dataset.
+- Take a look at [existing datasets in your language](https://euroeval.com/datasets/) to see how these are usually set up. Study these examples to understand the expected format and structure for your own dataset's entries.
+- Split your dataset into train / val / test sets, ideally with 1,024 / 256 / 2,048 samples, respectively
+- If your dataset already has splits, maintain consistency (e.g., the EuroEval train split should be a subset of the original train split)
+## Step 2: Add Dataset Configuration
+Dataset configurations in EuroEval are organised by language, with each language having its own file at `src/euroeval/dataset_configs/{language}.py`. A configuration is made with the `DatasetConfig` class. Here is an example for the fictive English Knowledge dataset `Rizzler`.
+```python
+RIZZLER_KNOWLEDGE_CONFIG = DatasetConfig(
+    name="rizzler_knowledge", # The name of the dataset
+    pretty_name="the truncated version of the English knowledge dataset Rizzler", # The pretty name of the dataset used in logs.
+    huggingface_id="EuroEval/rizzler_knowledge", # The same id as used in the dataset creation script
+    task=KNOW, # The task of the dataset
+    languages=[EN], # The language of the dataset
+    unofficial=True, # Whether the dataset is unofficial
+)
+```
+Every `src/euroeval/dataset_configs/{language}.py` file has two sections:
+- `### Official datasets ###`
+- `### Unofficial datasets ###`
+An unofficial dataset means that the resulting evaluation will not be included in the [official leaderboard](https://euroeval.com/leaderboards/).
+As a starting point, make your dataset unofficial. This can always be changed later.
+## Step 3: Document Your Dataset
+Dataset documentation in EuroEval is organised by language, with each language having its own file at `docs/datasets/{language}.md`. Within each language file, documentation is further organised by task.
+Navigate to the documentation file for your dataset's language and add your dataset's documentation in the appropriate task section.
+The documentation should include the following information:
+1. **General description**: Explain the dataset's origin and purpose
+2. **Split details**: Describe how splits were created and their sizes
+3. **Example samples**: Provide 3 representative examples from the training split
+4. **Evaluation setup**: Explain how models are evaluated on this dataset
+5. **Evaluation command**: Show how to evaluate a model on your dataset
+To do this, you can follow these steps:
+1. Find an existing dataset of the same task in `docs/datasets/{language}.md`
+2. Copy the entire documentation section for that dataset
+3. Use this as a template and modify all details to match your new dataset
+4. Ensure you update all dataset-specific information (description, split sizes, example samples, etc.)
+## Step 4: Modify the Change Log
+After completing the previous steps, add an entry to the project's changelog to document your contribution. The entry should be added under the `[Unreleased]` section with a short description of the dataset you have added. Here is an example of a new entry.
+```md
+## [Unreleased]
+### Added
+- Added the English knowledge dataset [rizzler_knowledge](https://huggingface.co/datasets/Example-User/rizzler_knowledge). The split is given by 1,024 / 256 / 2,048 samples for train / val / test, respectively. It is marked as `unofficial` for now. This was contributed by [@your_name](https://github.com/your_name) ✨
+```
+## Step 5: Make a Pull Request
+When you have completed all the previous steps, create a pull request to the EuroEval repository.
+### Thank you!
+This concludes the process of contributing a dataset to EuroEval. Your contribution helps expand the multilingual evaluation capabilities of the benchmark and is greatly appreciated by the research community!
+Thank you for your valuable contribution! 🎉

{euroeval-15.6.1 → euroeval-15.7.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.6.1
+Version: 15.7.1
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -32,7 +32,7 @@ Requires-Python: <4.0,>=3.10
 Requires-Dist: accelerate>=0.34.2
 Requires-Dist: bert-score>=0.3.13
 Requires-Dist: click>=8.1.3
-Requires-Dist: datasets>=2.15.0
+Requires-Dist: datasets>=3.5.0
 Requires-Dist: demjson3>=3.0.6
 Requires-Dist: evaluate>=0.4.1
 Requires-Dist: huggingface-hub>=0.30.1
@@ -239,6 +239,18 @@ A huge thank you to all the contributors who have helped make this project a suc
 <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
 <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
+### Contribute to EuroEval
+We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
+contributing new datasets, your help makes this project better for everyone.
+- **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
+  for information on how to get started.
+- **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
+  a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
 ### Special Thanks
 - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
   [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).

{euroeval-15.6.1 → euroeval-15.7.1}/README.md RENAMED Viewed

@@ -163,6 +163,18 @@ A huge thank you to all the contributors who have helped make this project a suc
 <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
 <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
+### Contribute to EuroEval
+We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
+contributing new datasets, your help makes this project better for everyone.
+- **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
+  for information on how to get started.
+- **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
+  a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
 ### Special Thanks
 - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
   [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).

{euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/dutch.md RENAMED Viewed

@@ -7,68 +7,7 @@ information about what these constitute.
 ## Sentiment Classification
-### Dutch Social
-This dataset consists of Dutch tweets annotated with sentiment labels. It is not sure
-how the sentiment labels were assigned, this information is pending from the authors.
-The original full dataset consists of 162,805 / 54,269 / 54,268 samples for training,
-validation and testing, respectively (so 271,342 samples used in total). We use a 1,024
-/ 256 / 1,024 split for training, validation and testing, respectively. All the new
-splits are subsets of the original splits.
-Here are a few examples from the training split:
-```json
-{
-  "text": 'Novak Djokovic positief getest op coronavirus na eigen tennistoernooi\n\nhttps://t.co/U7VOcjANh9',
-  "label": 'positive'
-}
-```
-```json
-{
-  "text": "via @NYTimes  https://t.co/IjbCWIwYvR",
-  "label": "neutral"
-}
-```
-```json
-{
-  "text": "@backinflow 30 min Corona tijd....",
-  "label": "negative"
-}
-```
-When evaluating generative models, we use the following setup (see the
-[methodology](/methodology) for more information on how these are used):
-- Number of few-shot examples: 12
-- Prefix prompt:
-  ```
-  Hieronder staan tweets en hun sentiment, dat 'positief', 'neutraal' of 'negatief' kan zijn.
-  ```
-- Base prompt template:
-  ```
-  Tweet: {text}
-  Sentiment: {label}
-  ```
-- Instruction-tuned prompt template:
-  ```
-  Tweet: {text}
-  Classificeer het sentiment in de tweet. Antwoord met 'positief', 'neutraal' of 'negatief'.
-  ```
-- Label mapping:
-    - `positive` ➡️ `positief`
-    - `neutral` ➡️ `neutraal`
-    - `negative` ➡️ `negatief`
-You can evaluate this dataset directly as follows:
-```bash
-$ euroeval --model <model-id> --dataset dutch-social
-```
-### Unofficial: DBRD
+### DBRD
 This dataset was published in [this paper](https://doi.org/10.48550/arXiv.1910.00896)
 and features Dutch book reviews from [Hebban.nl](https://www.hebban.nl), annotated with

EuroEval 15.6.1__tar.gz → 15.7.1__tar.gz

Potentially problematic release.

EuroEval 15.6.1tar.gz → 15.7.1tar.gz