PyPI - EuroEval - Versions diffs - 16.3.0__tar.gz → 16.4.0__tar.gz - Mend

EuroEval 16.3.0tar.gz → 16.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (313) hide show

{euroeval-16.3.0 → euroeval-16.4.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -24,6 +24,7 @@ body:
     label: Dataset languages
     description: What languages is the dataset in?
     options:
+      - label: Czech
       - label: Danish
       - label: Dutch
       - label: English
@@ -39,6 +40,7 @@ body:
       - label: Norwegian (Bokmål or Nynorsk)
       - label: Polish
       - label: Portuguese
+      - label: Slovak
       - label: Spanish
       - label: Swedish
   validations:

{euroeval-16.3.0 → euroeval-16.4.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -18,12 +18,12 @@ body:
       What languages should this model be evaluated on? Tick all that apply. If the
       model is multilingual (e.g., Mistral, Llama), then tick all the languages.
     options:
+      - label: Baltic languages (Latvian, Lithuanian)
+      - label: Finnic languages (Estonian, Finnish)
       - label: Romance languages (French, Italian, Portuguese, Spanish)
       - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
+      - label: Slavic languages (Czech, Polish, Slovak)
       - label: West Germanic languages (Dutch, English, German)
-      - label: Finnic languages (Estonian, Finnish)
-      - label: Baltic languages (Latvian, Lithuanian)
-      - label: Polish
   validations:
     required: true
 - type: dropdown

{euroeval-16.3.0 → euroeval-16.4.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.13.1
+    rev: v0.14.1
     hooks:
       - id: ruff
         args:

{euroeval-16.3.0 → euroeval-16.4.0}/CHANGELOG.md RENAMED Viewed

@@ -7,15 +7,78 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [Unreleased]
+## [v16.4.0] - 2025-10-21
+### Added
+- Added support for Slovak 🇸🇰! This includes the sentiment classification dataset
+  CSFD-sentiment-sk, the linguistic acceptability dataset ScaLA-sk, the named entity
+  recognition dataset UNER-sk, the reading comprehension dataset MultiWikiQA-sk, the
+  multiple-choice classification dataset MMLU-sk, and the common-sense reasoning dataset
+  Winogrande-sk. This was contributed by @oliverkinch ✨
+- Added support for Czech 🇨🇿! This includes the sentiment classification dataset
+  CSFD-sentiment, the linguistic acceptability dataset ScaLA-cs, the linguistic
+  acceptability dataset CS-GEC, the named entity recognition dataset PONER, the reading
+  comprehension dataset SQAD, the summarization dataset Czech News, the common-sense
+  reasoning dataset HellaSwag-cs, and the knowledge dataset Umimeto-qa. This was
+  contributed by @oliverkinch ✨
+- Added the Lithuanian summarisation dataset Lrytas based on the Lithuanian
+  public media news portal [Lrytas.lt](https://www.lrytas.lt/). This was contributed by
+  @oliverkinch ✨
+- Added the Estonian translation of MMLU, `mmlu-et`, as an unofficial knowledge
+  dataset.
+### Changed
+- Updated vLLM to `>=0.11.0`, which features several breaking changes, so we had to
+  force the minimum version. This also features support for multiple new models, such as
+  Qwen3-Next and OLMo3.
+- Now uses MultiWikiQA-da and MultiWikiQA-sv as the official Danish and Swedish reading
+  comprehension datasets, respectively, as the quality is substantially better than
+  ScandiQA-da and ScandiQA-sv.
+- Used 128 of the test samples from the Winogrande datasets for validation, as we
+  previously did not use a validation split. This is done for all languages except
+  Icelandic and Estonian, as these are manually translated and corrected splits from a
+  different source. Most of these are unofficial datasets and thus won't affect the
+  leaderboard rankings. The only languages for which these are official are Lithuanian
+  and Polish, which do not have official leaderboards yet - so no leaderboards are
+  affected by this change.
+- In the same vein as the above, we now use 32 samples for validation for the Lithuanian
+  LT-history dataset and the Swedish Skolprov dataset.
+- Changed logging styling.
+### Fixed
+- If a generative model consistently does not adhere to a given JSON schema, we disable
+  structured generation for that model. This was triggered by Claude models not
+  supporting Literal types in JSON schemas.
+- Removed "e" options from the Skolprov multiple-choice dataset, as this inconsistency
+  in number of options caused issues when evaluating models on it.
+- Fixed an issue where an uninformative logging message was shown when a model
+  configuration could not be loaded from the Hugging Face Hub, when the model was gated.
+  We now show that this is due to the gatedness, indicating that the user should log in
+  or provide a Hugging Face Hub access token to evaluate the model.
+- Now caches functions related to loading repo info or fetching model configs from the
+  Hugging Face Hub, to avoid repeated calls to the Hub, resulting in rate limits.
+- When running an evaluation that required the test split (e.g., European values
+  evaluation) as the last benchmark for a given model, then subsequent models would
+  continue to be evaluated on the test split, even if the user requested to use the
+  validation split. We now reset this not just after each dataset, but also after each
+  model, so that this does not happen.
+- Now catches more errors when evaluating LiteLLM models, which were related to some
+  generation parameters not being supported (such as stop sequences) for some models.
+- We now clean up metric writers when we're done with them, which prevents a "too many
+  open files" error when evaluating many models and datasets in a single run.
 ## [v16.3.0] - 2025-09-23
 ### Added
 - Added support for Lithuanian 🇱🇹! This includes the sentiment classification dataset
-  Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt, the reading
-  comprehension dataset MultiWikiQA-lt, the named entity recognition dataset WikiANN-lt,
-  the the history knowledge dataset LT-History, and the common-sense reasoning dataset
-  Winogrande-lt. This was contributed by @oliverkinch ✨
+  Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt (unofficial), the
+  reading comprehension dataset MultiWikiQA-lt, the named entity recognition dataset
+  WikiANN-lt, the the history knowledge dataset LT-History, and the common-sense
+  reasoning dataset Winogrande-lt. This was contributed by @oliverkinch ✨
 - Added "slow-tokenizer" model parameter, which can be used to force the use of a slow
   tokenizer when loading it. Use this by replacing your model ID with
   `<model-id>#slow-tokenizer`.

{euroeval-16.3.0 → euroeval-16.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 16.3.0
+Version: 16.4.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -62,12 +62,12 @@ Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: timm>=1.0.19; extra == 'all'
-Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: timm>=1.0.19; extra == 'generative'
-Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown
 <!-- This disables the requirement that the first line is a top-level heading -->
@@ -92,7 +92,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
 ## Maintainer

{euroeval-16.3.0 → euroeval-16.4.0}/README.md RENAMED Viewed

@@ -20,7 +20,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
 ## Maintainer

EuroEval 16.3.0__tar.gz → 16.4.0__tar.gz

Potentially problematic release.

EuroEval 16.3.0tar.gz → 16.4.0tar.gz