PyPI - EuroEval - Versions diffs - 15.3.1__tar.gz → 15.4.1__tar.gz - Mend

EuroEval 15.3.1tar.gz → 15.4.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (211) hide show

{euroeval-15.3.1 → euroeval-15.4.1}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -1,5 +1,5 @@
 name: 📚 Benchmark Dataset Request
-description: Do you think a particular benchmark dataset is missing in ScandEval?
+description: Do you think a particular benchmark dataset is missing in EuroEval?
 title: "[BENCHMARK DATASET REQUEST] <dataset-name>"
 labels: "benchmark dataset request"
@@ -36,7 +36,7 @@ body:
 - type: textarea
   attributes:
     label: Describe the dataset
-    description: Describe what the dataset is measuring, and why you think it is important to include it as a benchmark dataset in ScandEval.
+    description: Describe what the dataset is measuring, and why you think it is important to include it as a benchmark dataset in EuroEval.
   validations:
     required: true
 - type: markdown

{euroeval-15.3.1 → euroeval-15.4.1}/.github/ISSUE_TEMPLATE/bug.yaml RENAMED Viewed

@@ -1,5 +1,5 @@
 name: 🐛 Bug Report
-description: Have you experienced a bug using the `scandeval` package?
+description: Have you experienced a bug using the `euroeval` package?
 title: "[BUG] <name-of-bug>"
 labels: bug
@@ -7,7 +7,7 @@ body:
 - type: markdown
   attributes:
     value: >
-      #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/Scandeval/ScandEval/issues?q=is%3Aissue).
+      #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/EuroEval/EuroEval/issues?q=is%3Aissue).
 - type: textarea
   attributes:
     label: 🐛 Describe the bug
@@ -52,9 +52,9 @@ body:
     required: true
 - type: input
   attributes:
-    label: ScandEval version
-    description: What version of ScandEval are you using?
-    placeholder: Output of `pip list | grep ScandEval`
+    label: EuroEval version
+    description: What version of EuroEval are you using?
+    placeholder: Output of `pip list | grep EuroEval`
   validations:
     required: true
 - type: markdown

{euroeval-15.3.1 → euroeval-15.4.1}/.github/ISSUE_TEMPLATE/feature_request.yaml RENAMED Viewed

@@ -1,5 +1,5 @@
 name: 🚀 Feature Request
-description: Is the ScandEval benchmark missing a feature?
+description: Is the EuroEval benchmark missing a feature?
 title: "[FEATURE REQUEST] <name-of-feature>"
 labels: enhancement

{euroeval-15.3.1 → euroeval-15.4.1}/.github/workflows/ci.yaml RENAMED Viewed

@@ -49,6 +49,9 @@ jobs:
       - name: Install Dependencies
         run: uv sync --no-dev --extra test
+      - name: Start Ollama server
+        run: curl -fsSL https://ollama.com/install.sh | sh
       - name: Test with pytest
         run: uv run pytest
         env:
@@ -57,8 +60,8 @@ jobs:
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
-      - name: Delete ScandEval cache
-        run: rm -rf .scandeval_cache
+      - name: Delete EuroEval cache
+        run: rm -rf .euroeval_cache
   pytest-macos:
     if: github.event.pull_request.draft == false && contains(github.event.pull_request.labels.*.name, 'macos')
@@ -78,6 +81,9 @@ jobs:
       - name: Install Dependencies
         run: uv sync --no-dev --extra test
+      - name: Start Ollama server
+        run: curl -fsSL https://ollama.com/install.sh | sh
       - name: Test with pytest
         run: uv run pytest
         env:
@@ -86,5 +92,5 @@ jobs:
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
-      - name: Delete ScandEval cache
-        run: rm -rf .scandeval_cache
+      - name: Delete EuroEval cache
+        run: rm -rf .euroeval_cache

{euroeval-15.3.1 → euroeval-15.4.1}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,11 +10,12 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.9.10
+    rev: v0.11.2
     hooks:
       - id: ruff
         args:
           - --fix
+          - --unsafe-fixes
           - --exit-non-zero-on-fix
         types_or:
           - python

{euroeval-15.3.1 → euroeval-15.4.1}/CHANGELOG.md RENAMED Viewed

@@ -10,9 +10,64 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.4.1] - 2025-03-25
+### Fixed
+- Disallow `vllm` v0.8.1, as it causes severe degradation in generation output of
+  some models, resulting in artificially low scores.
+- Fixed an issue with text classification tasks if the first token of multiple labels
+  are identical, when tokenising with the model's tokeniser.
+## [v15.4.0] - 2025-03-24
+### Added
+- Added support for Spanish! 🇪🇸This includes two reading comprehension datasets:
+  [XQuAD-es](https://huggingface.co/datasets/google/xquad/viewer/xquad.es) and
+  [MLQA-es](https://huggingface.co/datasets/facebook/mlqa/viewer/mlqa.es.es),
+  [SentimentHeadlines-es](https://huggingface.co/datasets/pysentimiento/spanish-targeted-sentiment-headlines),
+  the linguistic acceptability dataset ScaLA with the [Spanish Universal
+  Dependencies](https://github.com/UniversalDependencies/UD_Spanish-AnCora),
+  [MLSum-es](https://huggingface.co/datasets/reciTAL/mlsum), the knowledge dataset
+  [MMLU-es](https://hf.co/datasets/alexandrainst/m_mmlu), the common-sense reasoning
+  dataset [HellaSwag-es](https://hf.co/datasets/alexandrainst/m_hellaswag), and the
+  named entity recognition dataset [CoNLL-es](https://aclanthology.org/W02-2024/). This
+  was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
+- Now extracts number of parameters and context length for Ollama models, using the
+  `ollama` package. Vocabulary size is currently not available available in the `ollama`
+  package, so this is not extracted for Ollama models. For this reason, the `ollama`
+  package has been added to the core dependencies, as it is very small (~10 KB)
+- Now downloads Ollama models when evaluating them.
+### Fixed
+- When models output nested JSON dictionaries and structured generation isn't available,
+  we use the inner-most dictionary. This caused issues with Anthropic models, since they
+  do not support structured generation, and their output are always {"input": actual
+  dictionary}. This has been fixed now.
+- Now handles `ReadTimeout`s when loading datasets, rather than aborting evaluations.
+- Benchmark configurations specified when calling `Benchmarker.benchmark` did not
+  properly override the default configurations set during initialisation when
+  benchmarking generative models. This has been fixed now.
+- Now sets the `VLLM_WORKER_MULTIPROC_METHOD` environment variable to `spawn`, to avoid
+  a `RuntimeError` when using newer versions of vLLM with multiple GPUs.
+- Now also detects reasoning tokens specified in the prompt rather than in the
+  completion, which is for instance the case for the QwQ reasoning model.
+- Now recognises models with the pipeline tags `image-text-to-text`,
+  `audio-text-to-text` and `video-text-to-text` as generative models, which mistakenly
+  were detected as encoder models before.
+### Changed
+- Update `vllm` to `>=0.8.0`, `transformers` to `>=4.50.0` and `torch` to `>=2.6.0`.
+- Moved the `demjson3` dependency from the `generative` extra to the main dependencies,
+  to allow benchmarking API-based models without any extras.
+- Now does not include the speed benchmark by default, as it is not used in the official
+  leaderboards. It can still be used by including `--task speed` when benchmarking a
+  model, or by using the `task` argument if using the `Benchmarker` API.
+- Do not use sliding window sizes as candidates for maximum context length anymore, as
+  this is no longer needed.
 ## [v15.3.1] - 2025-03-13
 ### Fixed
-- Now handles`ConnectionError`s when loading datasets, rather than aborting evaluations.
+- Now handles `ConnectionError`s when loading datasets, rather than aborting evaluations.
 ## [v15.3.0] - 2025-03-12
@@ -85,12 +140,14 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [v15.1.0] - 2025-02-12
 ### Added
 - Added new `--only-allow-safetensors` flag, which disallows evaluating models from the
   Hugging Face Hub if they are not stored as safetensors. This ensures a high level of
   security on the system running the evaluations, if this is necessary. This was
   contributed by [@Mikeriess](https://github.com/Mikeriess) ✨
 ### Fixed
 - Regex mismatch caused the wrong sequence length for GPT-4o models. This has been fixed
   now.
@@ -104,6 +161,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [v15.0.0] - 2025-02-02
 ### Added
 - Added support for evaluating generative reasoning models, such as OpenAI o1 and
   Deepseek R1. This is done by upping the maximal sequence length to 8,192 tokens, and
@@ -150,6 +208,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [v14.4.0] - 2025-01-22
+### Added
 - Added support for French! 🇫🇷This includes the sentiment classification dataset
   [Allocine](https://hf.co/datasets/tblard/allocine), the linguistic acceptability
   dataset ScaLA with the [French Universal

{euroeval-15.3.1 → euroeval-15.4.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.3.1
+Version: 15.4.1
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -33,12 +33,14 @@ Requires-Dist: accelerate>=0.34.2
 Requires-Dist: bert-score>=0.3.13
 Requires-Dist: click>=8.1.3
 Requires-Dist: datasets>=2.15.0
+Requires-Dist: demjson3>=3.0.6
 Requires-Dist: evaluate>=0.4.1
 Requires-Dist: huggingface-hub>=0.24.0
 Requires-Dist: levenshtein>=0.24.0
 Requires-Dist: litellm>=1.61.13
 Requires-Dist: more-itertools>=10.5.0
 Requires-Dist: numpy<2.0.0,>=1.23.0
+Requires-Dist: ollama>=0.4.7
 Requires-Dist: pandas>=2.2.0
 Requires-Dist: protobuf~=3.20.0
 Requires-Dist: pydantic>=2.6.0
@@ -52,19 +54,19 @@ Requires-Dist: seqeval>=1.2.2
 Requires-Dist: setuptools>=75.8.2
 Requires-Dist: tenacity>=9.0.0
 Requires-Dist: termcolor>=2.0.0
-Requires-Dist: torch>=2.3.0
-Requires-Dist: transformers>=4.47.0
+Requires-Dist: torch>=2.6.0
+Requires-Dist: transformers>=4.50.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
-Requires-Dist: demjson3>=3.0.6; extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: gradio>=4.26.0; extra == 'all'
-Requires-Dist: vllm<0.6.5,>=0.6.3; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: outlines>=0.1.11; extra == 'all'
+Requires-Dist: vllm!=0.8.1,>=0.8.0; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
-Requires-Dist: demjson3>=3.0.6; extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
-Requires-Dist: vllm<0.6.5,>=0.6.3; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: outlines>=0.1.11; extra == 'generative'
+Requires-Dist: vllm!=0.8.1,>=0.8.0; (platform_system == 'Linux') and extra == 'generative'
 Provides-Extra: human-evaluation
 Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
 Provides-Extra: test
@@ -202,6 +204,19 @@ argument. This could for instance be `--model <model-id> --task
 sentiment-classification`.
+### Reproducing the datasets
+All datasets used in this project are generated using the scripts located in the [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script with the following command
+```shell
+$ uv run src/scripts/<name-of-script>.py
+```
+Replace <name-of-script> with the specific script you wish to execute, e.g.,
+```shell
+$ uv run src/scripts/create_allocine.py
+```
 ## Special Thanks :pray:
 - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
   models on the leaderboards.

{euroeval-15.3.1 → euroeval-15.4.1}/README.md RENAMED Viewed

@@ -129,6 +129,19 @@ argument. This could for instance be `--model <model-id> --task
 sentiment-classification`.
+### Reproducing the datasets
+All datasets used in this project are generated using the scripts located in the [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script with the following command
+```shell
+$ uv run src/scripts/<name-of-script>.py
+```
+Replace <name-of-script> with the specific script you wish to execute, e.g.,
+```shell
+$ uv run src/scripts/create_allocine.py
+```
 ## Special Thanks :pray:
 - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
   models on the leaderboards.

{euroeval-15.3.1 → euroeval-15.4.1}/docs/datasets/dutch.md RENAMED Viewed

@@ -75,7 +75,7 @@ and features Dutch book reviews from [Hebban.nl](https://www.hebban.nl), annotat
 sentiment labels, written by the users of the website.
 The original full dataset consists of 20,000 / 2,200 samples for training and testing,
-respectively. We use a 1,024 / 256 / 2,048 split for training, validation and testing,
+respectively. We use a 1,014 / 253 / 2,014 split for training, validation and testing,
 respectively (so 3,328 samples used in total). The training and testing splits are
 subsets of the original splits, and the validation split is a disjoint subset of the
 original training split.

{euroeval-15.3.1 → euroeval-15.4.1}/docs/datasets/faroese.md RENAMED Viewed

@@ -17,8 +17,8 @@ labels were manually annotated by two native speakers.
 The original full dataset consists of 245 samples, which consisted of both a news
 article, a chosen sentence from the article, and the sentiment label. We use both the
 news article and the chosen sentence as two separate samples, to increase the size of
-the dataset (keeping them within the same dataset split). In total, we use a 74 / 35 /
-283 split for training, validation and testing, respectively.
+the dataset (keeping them within the same dataset split). In total, we use a 72 / 40 /
+279 split for training, validation and testing, respectively.
 Here are a few examples from the training split:

{euroeval-15.3.1 → euroeval-15.4.1}/docs/datasets/german.md RENAMED Viewed

@@ -485,7 +485,7 @@ $ euroeval --model <model-id> --dataset hellaswag-de
 ## Summarization
-### MLSum
+### MLSum-de
 This dataset was published in [this
 paper](https://aclanthology.org/2020.emnlp-main.647/) and features news articles and
@@ -541,5 +541,5 @@ When evaluating generative models, we use the following setup (see the
 You can evaluate this dataset directly as follows:
 ```bash
-$ euroeval --model <model-id> --dataset mlsum
+$ euroeval --model <model-id> --dataset mlsum-de
 ```

{euroeval-15.3.1 → euroeval-15.4.1}/docs/datasets/icelandic.md RENAMED Viewed

@@ -13,7 +13,7 @@ This dataset is being published in an upcoming paper, and consists of texts from
 Icelandic blog post, annotated with sentiment labels (and many others) via a
 crowdsourcing platform.
-The original full dataset consists of 2,901 samples, and we use a 1,024 / 256 / 1,621
+The original full dataset consists of 2,901 samples, and we use a 1,021 / 255 / 1,607
 split for training, validation and testing, respectively (so all samples are used in
 total).

EuroEval 15.3.1__tar.gz → 15.4.1__tar.gz

Potentially problematic release.

EuroEval 15.3.1tar.gz → 15.4.1tar.gz