PyPI - EuroEval - Versions diffs - 15.9.2__tar.gz → 15.10.1__tar.gz - Mend

EuroEval 15.9.2tar.gz → 15.10.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (241) hide show

{euroeval-15.9.2 → euroeval-15.10.1}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.11.12
+    rev: v0.12.0
     hooks:
       - id: ruff
         args:
@@ -31,7 +31,7 @@ repos:
     hooks:
     -   id: nbstripout
 -   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.16.0
+    rev: v1.16.1
     hooks:
     -   id: mypy
         args:

{euroeval-15.9.2 → euroeval-15.10.1}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,38 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.10.1] - 2025-06-20
+### Fixed
+- Fixed an issue when benchmarking encoder models on reading comprehension tasks, where
+  we sometimes would truncate the model outputs when they should not have been.
+## [v15.10.0] - 2025-06-17
+### Changed
+- Updated `vllm` to `>=0.9.1`.
+- Updated `litellm` to `>=1.72.2`.
+- Updated `ollama` to `>=0.5.1`.
+- Better detecmtion of instruction-tuned models.
+### Fixed
+- Fixed an issue where the EOS token would be included in the vLLM generation output,
+  leading to incorrect evaluation results. We now manually remove all stop tokens from
+  the generation output, which fixes this issue.
+- Now correctly detects reasoning models for Ollama models and enables their new "think"
+  parameter whenever a reasoning model is detected.
+- Added a cap on the number of concurrent connections when evaluating API models, to
+  avoid running into errors related to too many open file descriptors. In case this
+  error _still_ occurs, we now give the user an informative error message on how to
+  increase the maximum number of open file descriptors on their system.
+- Catch requests.ConnectionError when loading datasets.
+- When benchmarking encoder models on reading comprehension tasks, we allow the model
+  outputs to have more than two elements (start and end position logits), where we
+  instead just use the first two elements and ignore the rest.
+- When an encoder model outputs additional tensors aside from the logits, we now remove
+  these tensors from the output dictionary via the `preprocess_logits_for_metrics`
+  argument to `Trainer`.
 ## [v15.9.2] - 2025-06-04
 ### Fixed
 - Allow a model to not have any BOS and EOS tokens.

{euroeval-15.9.2 → euroeval-15.10.1}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.9.2
+Version: 15.10.1
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
 Author-email: Dan Saattrup Nielsen <dan.nielsen@alexandra.dk>
-Maintainer-email: Dan Saattrup Nielsen <dan.nielsen@alexandra.dk>, Kenneth Enevoldsen <kenneth.enevoldsen@cas.au.dk>
+Maintainer-email: Dan Saattrup Nielsen <dan.nielsen@alexandra.dk>
 License: MIT License
         Copyright (c) 2022-2024 Dan Saattrup Nielsen
@@ -37,13 +37,12 @@ Requires-Dist: demjson3>=3.0.6
 Requires-Dist: evaluate>=0.4.1
 Requires-Dist: huggingface-hub>=0.30.1
 Requires-Dist: levenshtein>=0.24.0
-Requires-Dist: litellm>=1.63.0
+Requires-Dist: litellm>=1.72.2
 Requires-Dist: more-itertools>=10.5.0
 Requires-Dist: numpy<2.0.0,>=1.23.0
-Requires-Dist: ollama>=0.4.7
+Requires-Dist: ollama>=0.5.1
 Requires-Dist: pandas>=2.2.0
 Requires-Dist: peft>=0.15.0
-Requires-Dist: protobuf~=3.20.0
 Requires-Dist: pydantic>=2.6.0
 Requires-Dist: pyinfer>=0.0.3
 Requires-Dist: python-dotenv>=1.0.1
@@ -62,12 +61,12 @@ Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == '
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: gradio>=4.26.0; extra == 'all'
 Requires-Dist: outlines>=0.1.11; extra == 'all'
-Requires-Dist: vllm>=0.9.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm>=0.9.1; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: outlines>=0.1.11; extra == 'generative'
-Requires-Dist: vllm>=0.9.0; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: vllm>=0.9.1; (platform_system == 'Linux') and extra == 'generative'
 Provides-Extra: human-evaluation
 Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
 Provides-Extra: test
@@ -93,7 +92,7 @@ ______________________________________________________________________
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
-## Maintainers
+## Maintainer
 - Dan Saattrup Nielsen ([@saattrupdan](https://github.com/saattrupdan),
   dan.nielsen@alexandra.dk)

{euroeval-15.9.2 → euroeval-15.10.1}/README.md RENAMED Viewed

@@ -17,7 +17,7 @@ ______________________________________________________________________
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
-## Maintainers
+## Maintainer
 - Dan Saattrup Nielsen ([@saattrupdan](https://github.com/saattrupdan),
   dan.nielsen@alexandra.dk)

{euroeval-15.9.2 → euroeval-15.10.1}/docs/README.md RENAMED Viewed

@@ -29,7 +29,7 @@ or [LM Studio](https://lmstudio.ai/).
 The idea of EuroEval grew out of the development of Danish language model RøBÆRTa in
 2021, when we realised that there was no standard way to evaluate Danish language
 models. It started as a hobby project including Danish, Swedish and Norwegian, but has
-since grown to include 8+ European languages.
+since grown to include 12+ European languages.
 EuroEval is maintained by [Dan Saattrup Nielsen](https://www.saattrupdan.com/) from the
 [Alexandra Institute](https://alexandra.dk), and is funded by the EU project

euroeval-15.10.1/docs/leaderboards/Monolingual/finnish.md ADDED Viewed

@@ -0,0 +1,15 @@
+---
+hide:
+    - toc
+---
+# 🇫🇮 Finnish
+See the [leaderboard page](/leaderboards) for more information about all the columns.
+/// tab | Generative Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-ubHSy" src="https://datawrapper.dwcdn.net/ubHSy" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="847" data-external="1"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}}))}();</script>
+///
+/// tab | NLU Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-qVbA3" src="https://datawrapper.dwcdn.net/qVbA3/1/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="818" data-external="1"></iframe><script type="text/javascript">!function(){"use strict";window.addEventListener("message",(function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}}))}();</script>
+///

{euroeval-15.9.2 → euroeval-15.10.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "EuroEval"
-version = "15.9.2"
+version = "15.10.1"
 description = "The robust European language model benchmark."
 readme = "README.md"
 authors = [
@@ -8,7 +8,6 @@ authors = [
 ]
 maintainers = [
     {name = "Dan Saattrup Nielsen", email = "dan.nielsen@alexandra.dk"},
-    {name = "Kenneth Enevoldsen", email = "kenneth.enevoldsen@cas.au.dk"},
 ]
 requires-python = ">=3.10,<4.0"
 dependencies = [
@@ -27,18 +26,17 @@ dependencies = [
     "huggingface-hub>=0.30.1",
     "pyinfer>=0.0.3",
     "sentencepiece>=0.1.96",
-    "protobuf~=3.20.0",
     "sacremoses>=0.1.1",
     "more-itertools>=10.5.0",
     "tenacity>=9.0.0",
-    "litellm>=1.63.0",
+    "litellm>=1.72.2",
     "rouge-score>=0.1.2",
     "bert-score>=0.3.13",
     "levenshtein>=0.24.0",
     "scikit-learn<1.6.0",
     "setuptools>=75.8.2",
     "demjson3>=3.0.6",
-    "ollama>=0.4.7",
+    "ollama>=0.5.1",
     "peft>=0.15.0",
 ]
@@ -46,7 +44,7 @@ dependencies = [
 generative = [
     "outlines>=0.1.11",
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.9.0; platform_system == 'Linux'",
+    "vllm>=0.9.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
 ]
 human_evaluation = [
@@ -55,7 +53,7 @@ human_evaluation = [
 all = [
     "outlines>=0.1.11",
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.9.0; platform_system == 'Linux'",
+    "vllm>=0.9.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
     "gradio>=4.26.0",
 ]
@@ -150,6 +148,8 @@ ignore = [
     "ANN101",
     # Type annotations for "cls" arguments
     "ANN102",
+    # Type annotations for *args
+    "ANN002",
     # Type annotations for **kwargs
     "ANN003",
     # Docstrings for **kwargs

{euroeval-15.9.2 → euroeval-15.10.1}/src/euroeval/benchmark_modules/hf.py RENAMED Viewed

@@ -378,7 +378,7 @@ class HuggingFaceEncoderModel(BenchmarkModule):
                             tokenizer=self._tokenizer,
                         ),
                         batched=True,
-                        batch_size=1,
+                        batch_size=10,
                         remove_columns=dataset["train"].column_names,
                         load_from_cache_file=False,
                         keep_in_memory=True,
@@ -389,7 +389,7 @@ class HuggingFaceEncoderModel(BenchmarkModule):
                             tokenizer=self._tokenizer,
                         ),
                         batched=True,
-                        batch_size=1,
+                        batch_size=10,
                         remove_columns=dataset["val"].column_names,
                         load_from_cache_file=False,
                         keep_in_memory=True,
@@ -400,7 +400,7 @@ class HuggingFaceEncoderModel(BenchmarkModule):
                             tokenizer=self._tokenizer,
                         ),
                         batched=True,
-                        batch_size=1,
+                        batch_size=10,
                         remove_columns=dataset["test"].column_names,
                         load_from_cache_file=False,
                         keep_in_memory=True,

EuroEval 15.9.2__tar.gz → 15.10.1__tar.gz

Potentially problematic release.

EuroEval 15.9.2tar.gz → 15.10.1tar.gz