PyPI - EuroEval - Versions diffs - 15.15.0__tar.gz → 16.0.0__tar.gz - Mend

EuroEval 15.15.0tar.gz → 16.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (278) hide show

{euroeval-15.15.0 → euroeval-16.0.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -25,12 +25,14 @@ body:
       - label: Danish
       - label: Dutch
       - label: English
+      - label: Estonian
       - label: Faroese
       - label: Finnish
       - label: French
       - label: German
       - label: Icelandic
       - label: Italian
+      - label: Latvian
       - label: Norwegian (Bokmål or Nynorsk)
       - label: Portuguese
       - label: Spanish

{euroeval-15.15.0 → euroeval-16.0.0}/.github/ISSUE_TEMPLATE/bug.yaml RENAMED Viewed

@@ -55,7 +55,7 @@ body:
   attributes:
     label: EuroEval version
     description: What version of EuroEval are you using?
-    placeholder: Output of `pip list | grep EuroEval`
+    placeholder: Output of `pip list | grep euroeval`
   validations:
     required: true
 - type: input

{euroeval-15.15.0 → euroeval-16.0.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -21,7 +21,8 @@ body:
       - label: Romance languages (French, Italian, Portuguese, Spanish)
       - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
       - label: West Germanic languages (Dutch, English, German)
-      - label: Finnish
+      - label: Finnic languages (Estonian, Finnish)
+      - label: Latvian
   validations:
     required: true
 - type: dropdown

{euroeval-15.15.0 → euroeval-16.0.0}/.github/workflows/ci.yaml RENAMED Viewed

@@ -22,16 +22,19 @@ jobs:
       pull-requests: write
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
         with:
           persist-credentials: false
-      - uses: actions/setup-python@v5
+          ref: main
+      - name: Install uv and set up Python
+        uses: astral-sh/setup-uv@v6
         with:
+          enable-cache: false
           python-version: "3.11"
-      - run: python -m pip install pre-commit
-        shell: bash
-      - run: pre-commit run --show-diff-on-failure --color=always --all-files
-        shell: bash
+      - name: Run pre-commit hooks
+        uses: pre-commit/action@v3.0.1
   pytest-linux:
     if: github.event.pull_request.draft == false
@@ -40,24 +43,25 @@ jobs:
       pull-requests: write
     strategy:
         matrix:
-            python-version: ["3.10", "3.11", "3.12"]
+            python-version: ["3.11", "3.12", "3.13"]
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
         with:
           persist-credentials: false
+          ref: main
       - name: Install uv and set up Python
-        uses: astral-sh/setup-uv@v5
+        uses: astral-sh/setup-uv@v6
         with:
           enable-cache: false
           python-version: ${{ matrix.python-version }}
       - name: Install Dependencies
-        run: uv sync --no-dev --extra test
+        run: uv sync --no-dev
       - name: Start Ollama server
-        run: curl -fsSL https://ollama.com/install.sh | sh
+        run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
       - name: Test with pytest
         run: uv run pytest
@@ -66,6 +70,8 @@ jobs:
           HF_TOKEN: ${{ secrets.HUGGINGFACE_API_KEY }}
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+          XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
       - name: Delete EuroEval cache
         run: rm -rf .euroeval_cache
@@ -77,21 +83,25 @@ jobs:
       pull-requests: write
     runs-on: macos-latest
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v5
+        with:
+          persist-credentials: false
+          ref: main
       - name: Install uv and set up Python
-        uses: astral-sh/setup-uv@v4
+        uses: astral-sh/setup-uv@v6
         with:
+          enable-cache: false
           python-version: ${{ matrix.python-version }}
       - name: Install Dependencies
-        run: uv sync --no-dev --extra test
+        run: uv sync --no-dev
       - name: Start Ollama server
-        run: curl -fsSL https://ollama.com/install.sh | sh
+        run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
       - name: Test with pytest
-        run: uv run pytest
+        run: uv run pytest -vvv
         env:
           HUGGINGFACE_API_KEY: ${{ secrets.HUGGINGFACE_API_KEY }}
           HF_TOKEN: ${{ secrets.HUGGINGFACE_API_KEY }}

{euroeval-15.15.0 → euroeval-16.0.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -4,24 +4,27 @@ repos:
     hooks:
       - id: python-use-type-annotations
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v5.0.0
+    rev: v6.0.0
     hooks:
       - id: end-of-file-fixer
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.12.7
+    rev: v0.12.12
     hooks:
       - id: ruff
         args:
           - --fix
           - --unsafe-fixes
           - --exit-non-zero-on-fix
+          - --no-cache
         types_or:
           - python
           - pyi
           - jupyter
       - id: ruff-format
+        args:
+          - --no-cache
         types_or:
           - python
           - pyi

{euroeval-15.15.0 → euroeval-16.0.0}/CHANGELOG.md RENAMED Viewed

@@ -10,12 +10,85 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v16.0.0] - 2025-09-05
+### Added
+- Added support for Latvian 🇱🇻! This includes the sentiment classification dataset
+  Latvian Twitter Sentiment, the linguistic acceptability dataset ScaLA-lv, the named
+  entity recognition datasets FullStack-NER-lv and WikiANN-lv, the reading comprehension
+  dataset MultiWikiQA, the knowledge dataset MMLU-lv, the common-sense reasoning
+  dataset COPA-lv, and the summarisation dataset LSM.
+- Added support for Estonian 🇪🇪! It includes the sentiment classification dataset
+  Estonian Valence, the linguistic acceptability datasets Grammar-et and ScaLA-et, the
+  named entity recognition dataset EstNER, the reading comprehension dataset
+  MultiWikiQA-et, the summarisation dataset ERRNews, the knowledge dataset Exam-et,
+  and the common-sense reasoning dataset Winogrande-et. This was contributed by
+  @slowwavesleep ✨
+- It is now possible to evaluate how much a model adhere to European values! 🇪🇺 This
+  probes 53 questions from the European values survey, which have been chosen based on
+  an optimisation procedure that maximises agreement across the EU. We then measure how
+  well the model's answers align with the distribution of answers across the EU, using a
+  tree-based kernel density estimation. This can only be used zero-shot, and only with
+  instruction-based decoder models (including reasoning models).
+### Changed
+- When evaluating classification tasks, we now force the model to output one of the
+  labels. This is done directly with open models, and done via a JSON schema for API
+  models. This won't change the results for existing tasks, as logprobs are used, but
+  this was required to measure the European values.
+- Updated `vllm` dependency to `>=0.10.1`, which includes GPT-OSS support.
+- Updated `numpy` dependency to `>=2.0.0`, as the previous clash is not applicable
+- Updated `transformers` dependency to `>=4.56.0`, which includes support for more
+  models.
+- Now requires Python >=3.11, as Python 3.10 does not support structured generation with
+  a dynamic set of choices (Literal[*list_of_choices] is not supported)
+### Fixed
+- Enable support to evaluate Mistral models with their custom `mistral-common`
+  tokeniser, which includes all recent Mistral models. Note that we currently assume
+  that all of these models are instruction-tuned decoder models (which _is_ true
+  currently), which can lead to errors in case they publish different types of models in
+  the future.
+- Now disables the `seed` parameter if the API inference model does not support it,
+  which prevented evaluating some models.
+- Now correctly detects an API inference model as non-existing, even if LiteLLM *does*
+  see it as existing. We have an additional check during evaluation to ensure this now.
+- Catch an `ImportError` error that sometimes happens when finishing the evaluation of a
+  vLLM model, during shutdown.
+- Now uses `litellm>=1.75.6`, which fixes an issue related to evaluation of GPT-5 models
+  using Ollama.
+- Now always uses the `multiprocessing` backend when evaluating vLLM models, rather than
+  reverting to `ray` when using multiple GPUs, as `ray` led to evaluations of several
+  models freezing.
+- Now does not require the user to be logged in to Hugging Face to benchmark models on
+  the Hugging Face Hub, if the models are public.
+### Removed
+- Removed support for human evaluation, as it was not actively maintained and not used.
+## [v15.16.0] - 2025-08-12
+### Added
+- Added metadata for GPT-5 models.
+### Changed
+- Updated `transformers` dependency to `>=4.55.0`.
+### Fixed
+- If the model uses 'mxfp4' quantisation then we allow the dtype to be bfloat16, rather
+  than forcing float16. This caused issues with the new GPT-OSS models.
+- Prevent multiple `Model <model-id> does not exist` logs when evaluating a model
+  that does not exist - now only logs this once.
+- Cleaner error message when attempting to benchmark a generative model without having a
+  GPU available.
+- Now raises error if an inference API is used with a parameter that is not supported.
 ## [v15.15.0] - 2025-08-06
 ### Added
 - Added the common-sense reasoning dataset GoldenSwag for the following
   languages: Danish, German, Spanish, Finnish, French, Italian, Dutch, Swedish.
   The datasets are unofficial for now. This was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 ### Changed
 - Now allows metadata to be included in metrics, allowing more flexibility when
@@ -71,7 +144,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   acceptability dataset ScaLA-pt. The machine translated ones include the sentiment
   classification dataset SST-2, the multiple choice reading comprehension dataset BoolQ,
   the knowledge dataset MMLU, and the common-sense reasoning dataset GoldenSwag. This
-  was contributed by [@duarteocarmo](https://github.com/duarteocarmo) ✨
+  was contributed by @duarteocarmo ✨
 - Added `--gpu-memory-utilization` argument (`gpu_memory_utilization` in the
   `Benchmarker` API), which can be lowered in case the user is experiencing OOM errors
   when evaluating models. The default is 0.9 (same as previously), which means that vLLM
@@ -91,11 +164,11 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Added the English knowledge dataset Life in the UK, which has been added as an
   official dataset, replacing the existing English knowledge dataset MMLU, which in turn
   has been marked as unofficial now. This was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 - Added the Norwegian knowledge dataset Idioms-no, which is a multiple-choice question
   dataset where the alternative answers have been generated using GPT-4o. This has been
   added as an official dataset, and was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 - Added new `LLMAsAJudgeMetric`, which allows evaluating the performance of a model with
   another judge model. This is useful for evaluating models in a reference-free manner,
   or if the metric is sufficiently complex. It is currently not used in any task, but
@@ -199,11 +272,11 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ### Added
 - Added the BeleBele datasets for Finnish, Italian and Spanish. They are listed as
   unofficial for now. This was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 ### Changed
 - Now uses asyncronous requests when dealing with API models, speeding up the generation
-  immensely. This was contributed by [@mathiasesn](https://github.com/mathiasesn) ✨
+  immensely. This was contributed by @mathiasesn ✨
 ### Fixed
 - Add HellaSwag-fi back in, as the issue with the labels in the test split has been
@@ -255,7 +328,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   dataset [XL-Sum-fi](https://huggingface.co/datasets/TurkuNLP/xlsum-fi), and the
   common-sense reasoning dataset
   [HellaSwag-fi](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate).
-  This was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
+  This was contributed by @oliverkinch ✨
 - Added metadata for GPT-4.1 and Grok-3 models.
 - Marked Gemini-2.5-flash and Grok-3-mini as reasoning models, giving them more tokens
   to think.
@@ -298,7 +371,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [v15.6.1] - 2025-04-14
 ### Changed
 - Added more info about SQuAD-nl in the documentation. This was contributed by
-  [@Rijgersberg](https://github.com/Rijgersberg) ✨
+  @Rijgersberg ✨
 ### Fixed
 - The "E" option for the Norwegian NorCommonSenseQA dataset was not included in the
@@ -326,7 +399,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Uniformised the prompt templates used for each task, so that they are more
   consistent across tasks. Evaluation tests across different model types and sizes show
   no significant performance difference between the new and old templates. This was
-  contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
+  contributed by @viggo-gascou ✨
 ### Fixed
 - Avoid duplicate error messages when a rate limit occurs.
@@ -355,7 +428,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Allows all vLLM versions from v0.8.0 again, as the issue with the generation output
   has been resolved.
 - Added overall progress indicator during evaluation. This was contributed by
-  [@mathiasesn](https://github.com/mathiasesn) ✨
+  @mathiasesn ✨
 ### Changed
 - Now does not use logprobs in text classification tasks with Google VertexAI models, as
@@ -394,9 +467,9 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ### Fixed
 - Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
   compatibility < 8.0. This was contributed by
-  [@marksverdhei](https://github.com/marksverdhei) ✨
+  @marksverdhei ✨
 - Corrected the name of the French sentiment dataset AlloCiné. This was contributed by
-  [@Alkarex](https://github.com/Alkarex) ✨
+  @Alkarex ✨
 - Evaluating a specific model revision did not work for adapter models, as there was a
   confusion between the revision of the adapter and the revision of the base model. We
   now use the revision for the adapter and use the latest revision for the base model.
@@ -422,7 +495,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   `HuggingFaceHubDown` exception.
 - Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
   compatibility < 8.0. This was contributed by
-  [@marksverdhei](https://github.com/marksverdhei) ✨
+  @marksverdhei ✨
 - Fixed docs for ScandiQA-da and ScandiQA-sv, where it was incorrectly stated that
   the splits were made by considering the original train/validation/test splits.
@@ -447,7 +520,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   [MMLU-es](https://hf.co/datasets/alexandrainst/m_mmlu), the common-sense reasoning
   dataset [HellaSwag-es](https://hf.co/datasets/alexandrainst/m_hellaswag), and the
   named entity recognition dataset [CoNLL-es](https://aclanthology.org/W02-2024/). This
-  was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
+  was contributed by @oliverkinch ✨
 - Now extracts number of parameters and context length for Ollama models, using the
   `ollama` package. Vocabulary size is currently not available available in the `ollama`
   package, so this is not extracted for Ollama models. For this reason, the `ollama`
@@ -500,7 +573,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   dataset [MMLU-it](https://hf.co/datasets/alexandrainst/m_mmlu), and the named entity
   recognition dataset [MultiNERD IT](https://hf.co/datasets/Babelscape/multinerd) (and
   unofficially [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was
-  contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
+  contributed by @viggo-gascou ✨
 - Added the new Norwegian knowledge dataset NRK-Quiz-QA, consisting of quizzes on the
   Norwegian language and culture, in both Bokmål and Nynorsk. The dataset has been split
   into 635 / 256 / 2,048 samples for train, val, and test, respectively. This replaces
@@ -561,7 +634,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Added new `--only-allow-safetensors` flag, which disallows evaluating models from the
   Hugging Face Hub if they are not stored as safetensors. This ensures a high level of
   security on the system running the evaluations, if this is necessary. This was
-  contributed by [@Mikeriess](https://github.com/Mikeriess) ✨
+  contributed by @Mikeriess ✨
 ### Fixed
@@ -590,19 +663,19 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   [personal-sum](https://github.com/SmartmediaAI/PersonalSum). It has been split into
   121 / 64 / 256 samples for train / validation / test, respectively, and is set to
   `unofficial` for now. This was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 - Added the Jentoft dataset - a linguistic acceptability dataset which was published in
   [this Master's thesis](https://www.duo.uio.no/handle/10852/103885) by Matias Jentoft.
   The original dataset consists of 85,771 / 10,827 / 10487 samples for training,
   validation and test, respectively. We use a split of 1,024 / 256 / 2,048 samples for
   training, validation and test, respectively. In each split, the distribution of
   `correct` and `incorrect` is 50/50. This dataset has been set to `unofficial` for now.
-  This was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
+  This was contributed by @oliverkinch ✨
 - Added the dataset icelandic-knowledge, which is derived from the IcelandicQA dataset,
   reformatted as a knowledge dataset with GPT-4o generated candidate answers. The split
   is given by 845 / 128 / 1024 for train, val, and test, respectively. It is marked as
   `unofficial` for now. This was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 ### Changed
 - Changed the instruction prompts to all text classification tasks by specifying
@@ -640,8 +713,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   dataset [OrangeSum](https://hf.co/datasets/EdinburghNLP/orange_sum).
 - Added support for evaluating local models again, which supports models stored in the
   Hugging Face format with a Hugging Face model configuration file (`config.json`) in
-  the model directory. This was contributed by [@rlrs](https://github.com/rlrs) and
-  [@peter-sk](https://github.com/peter-sk) ✨
+  the model directory. This was contributed by @rlrs and
+  @peter-sk ✨
 ### Changed
 - Changed the Belebele splits, as there were too few training splits for evaluation on
@@ -861,7 +934,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   dataset NO-Multi-QA-Sum (norglm-multi-qa). This dataset is part of the NLEBench
   Norwegian benchmarks. The answers from the original dataset have been rephrased with
   gpt-4o to contain the answer from the context. It has been marked as `unofficial` for
-  now. This was contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
+  now. This was contributed by @viggo-gascou ✨
 - Added the sentiment classification part of the Icelandic dataset Hotter and Colder,
   being a gold standard dataset. As no Icelandic sentiment classification dataset was
   included in the benchmark previously, this is now the official Icelandic sentiment
@@ -880,18 +953,18 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Added the summarisation part of the Norwegian NorGLM multi-task human annotated
   dataset NO-Multi-QA-Sum (`norglm-multi-sum`). This dataset is part of the NLEBench
   Norwegian benchmarks. It has been marked as `unofficial` for now. This was contributed
-  by [@viggo-gascou](https://github.com/viggo-gascou) ✨
+  by @viggo-gascou ✨
 - Added `ice-linguistic` a linguistic acceptability dataset which is a subset of the
   Icelandic Linguistic Benchmarks dataset. It is a small dataset with 94 train
   samples, 32 validation samples, and 256 test samples, and has been marked as
   `unofficial` for now. This was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 - Added `icelandic-qa`, an Icelandic question answering dataset about Icelandic culture
   and history. The original dataset has 2000 samples, but only 375 of the samples have
   answers that are found in the context (exact match). An LLM has therefore been used to
   rephrase the answers and we now have 1683 samples where the answers are found in the
   context (531 train, 128 val, 1024 test). It has been set to `unofficial` for now. This
-  was contributed by [@oliverkinch](http://github.com/oliverkinch) ✨
+  was contributed by @oliverkinch ✨
 ### Fixed
 - Small typo in prefix prompt used for few-shot evaluation of the English sentiment
@@ -903,21 +976,21 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [v13.1.0] - 2024-10-31
 - Added `ice-ec` (a subset of the dataset) and `ice-ec-full` (the full dataset), an
   Icelandic linguistic acceptability dataset. It has been set to `unofficial` for now.
-  This was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
+  This was contributed by @oliverkinch ✨
 - Added the Schibsted summarisation dataset, which contains summaries of published
   articles from Schibsted Media's Norwegian and Swedish newsrooms. The dataset has been
   split into two separate small datasets, `schibsted-sv` for Swedish and `schibsted-no`
   for Norwegian. Note that both of these datasets are really small (89 and 374 test
   samples in `schibsted-sv` and `schibsted-no`, respectively), and have been set to
   `unofficial` for now. This was contributed by
-  [@oliverkinch](https://github.com/oliverkinch) ✨
+  @oliverkinch ✨
 - Added the Icelandic summarisation dataset IceSum. IceSum is a collection of 1,000
   Icelandic news articles from mbl.is, which have been manually annotated with
   summaries. The dataset has been marked as unofficial, meaning that it will not be
   automatically included when benchmarking models, but can be included by specifying the
   dataset explicitly using the --dataset argument (or dataset argument if using the
   Benchmarker API). This was contributed by
-  [@viggo-gascou](https://github.com/viggo-gascou) ✨
+  @viggo-gascou ✨
 - Added the new Faroese reading comprehension dataset FoQA. This is now the default
   Faroese reading comprehension benchmark, as there was none previously.
 - Now supports evaluation of models with adapters. This requires that the model
@@ -1219,7 +1292,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ### Fixed
 - Move tensor to the correct device when benchmarking seq-to-seq models (#363). Thanks
-  to [@ThomasKluiters](https://github.com/ThomasKluiters) for this contribution! :tada:
+  to @ThomasKluiters for this contribution! :tada:
 - Deals with the case where an instruction tuned model does not use any special token
   at the end of the chat, such as `<|im_end|>`. This holds for, e.g., Qwen models.
 - Better auto-detection of pipeline tag for models on the Hugging Face Hub, in case the
@@ -1233,7 +1306,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   `AZURE_OPENAI_ENDPOINT` and `AZURE_OPENAI_API_VERSION` need to have been set, or
   alternatively through the `--azure-openai-api-key`, `--azure-openai-endpoint` and
   `--azure-openai-api-version` arguments. Thanks to
-  [@BramVanroy](https://github.com/BramVanroy) for all the help regarding the
+  @BramVanroy for all the help regarding the
   implementation of this :tada:
 - We now use the new JSON mode for newer OpenAI models for the NER task, to ensure
   better JSON generation.
@@ -1744,7 +1817,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - A `--use-flash-attention` flag has been added, which enables Flash Attention 2.0,
   which is required by some models, such as Mistral-based ones. If `flash-attn` has not
   been installed then an informative error message will be raised. Thanks to
-  [@peter-sk](https://github.com/peter-sk) for this contribution! :tada:
+  @peter-sk for this contribution! :tada:
 ### Changed
 - Now uses 8-bit AdamW whenever CUDA is available, as opposed to regular AdamW.
@@ -1764,7 +1837,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   OpenAI models. This currently happens automatically when specifying a generative
   model from the Hugging Face Hub, and with all OpenAI models.
 - Now stores model caches in separate directories, enabling parallel evaluations.
-  Thanks to [@KennethEnevoldsen](https://github.com/KennethEnevoldsen) for this
+  Thanks to @KennethEnevoldsen for this
   contribution! :tada:
 - Added `--device` argument to the CLI, which can be used to overwrite the automatic
   detection of device (CPU, CUDA GPU, MPS GPU, TPU) to use.
@@ -1833,7 +1906,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Now added support for benchmarking local models in the Hugging Face format (i.e.,
   saved with the `save_pretrained` method). This automatically detects the framework
   based on the file extension, but can also be set using the new `--model-framework`
-  argument. Thanks to [@peter-sk](https://github.com/peter-sk) for implementing this!
+  argument. Thanks to @peter-sk for implementing this!
   :tada:
 ### Fixed
@@ -2132,7 +2205,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Specific branches/commits/tags can now be benchmarked, using the `@`
   delimiter. For instance, `scandeval -m model_id@commit_hash` will benchmark
   the model with model ID `model_id`, stored at commit with hash `commit_hash`.
-  Thanks to [@versae](https://github.com/versae) for contributing! :tada:
+  Thanks to @versae for contributing! :tada:
 ## [v2.2.0] - 2022-01-18
@@ -2142,8 +2215,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [v2.1.0] - 2022-01-17
 ### Added
-- Added support for `flax` models. Thanks to
-  [@versae](https://github.com/versae) for contributing! :tada:
+- Added support for `flax` models. Thanks to @versae for contributing! :tada:
 ## [v2.0.0] - 2022-01-07

{euroeval-15.15.0 → euroeval-16.0.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.15.0
+Version: 16.0.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -28,18 +28,19 @@ License: MIT License
         OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
         SOFTWARE.
 License-File: LICENSE
-Requires-Python: <4.0,>=3.10
+Requires-Python: <4.0,>=3.11
 Requires-Dist: accelerate>=1.9.0
 Requires-Dist: bert-score>=0.3.13
 Requires-Dist: click>=8.1.3
+Requires-Dist: cloudpickle>=3.1.1
 Requires-Dist: datasets>=3.5.0
 Requires-Dist: demjson3>=3.0.6
 Requires-Dist: evaluate>=0.4.1
 Requires-Dist: huggingface-hub>=0.30.1
 Requires-Dist: levenshtein>=0.24.0
-Requires-Dist: litellm>=1.72.2
+Requires-Dist: litellm>=1.75.6
 Requires-Dist: more-itertools>=10.5.0
-Requires-Dist: numpy<2.0.0,>=1.23.0
+Requires-Dist: numpy>=2.0.0
 Requires-Dist: ollama>=0.5.1
 Requires-Dist: pandas>=2.2.0
 Requires-Dist: peft>=0.15.0
@@ -49,27 +50,22 @@ Requires-Dist: pyinfer>=0.0.3
 Requires-Dist: python-dotenv>=1.0.1
 Requires-Dist: rouge-score>=0.1.2
 Requires-Dist: sacremoses>=0.1.1
-Requires-Dist: scikit-learn<1.6.0
+Requires-Dist: scikit-learn==1.6.1
 Requires-Dist: sentencepiece>=0.1.96
 Requires-Dist: seqeval>=1.2.2
 Requires-Dist: setuptools>=75.8.2
 Requires-Dist: tenacity>=9.0.0
 Requires-Dist: termcolor>=2.0.0
 Requires-Dist: torch>=2.6.0
-Requires-Dist: transformers>=4.51.0
+Requires-Dist: transformers[mistral-common]>=4.56.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
-Requires-Dist: gradio>=4.26.0; extra == 'all'
-Requires-Dist: vllm>=0.10.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
-Requires-Dist: vllm>=0.10.0; (platform_system == 'Linux') and extra == 'generative'
-Provides-Extra: human-evaluation
-Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
-Provides-Extra: test
-Requires-Dist: gradio>=4.26.0; extra == 'test'
+Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown
 <div align='center'>
@@ -223,16 +219,18 @@ A huge thank you to all the contributors who have helped make this project a suc
 <a href="https://github.com/AJDERS"><img src="https://avatars.githubusercontent.com/u/38854604" width=50 alt="Contributor avatar for AJDERS"/></a>
 <a href="https://github.com/oliverkinch"><img src="https://avatars.githubusercontent.com/u/71556498" width=50 alt="Contributor avatar for oliverkinch"/></a>
 <a href="https://github.com/versae"><img src="https://avatars.githubusercontent.com/u/173537" width=50 alt="Contributor avatar for versae"/></a>
+<a href="https://github.com/KennethEnevoldsen"><img src="https://avatars.githubusercontent.com/u/23721977" width=50 alt="Contributor avatar for KennethEnevoldsen"/></a>
 <a href="https://github.com/viggo-gascou"><img src="https://avatars.githubusercontent.com/u/94069687" width=50 alt="Contributor avatar for viggo-gascou"/></a>
 <a href="https://github.com/mathiasesn"><img src="https://avatars.githubusercontent.com/u/27091759" width=50 alt="Contributor avatar for mathiasesn"/></a>
 <a href="https://github.com/Alkarex"><img src="https://avatars.githubusercontent.com/u/1008324" width=50 alt="Contributor avatar for Alkarex"/></a>
 <a href="https://github.com/marksverdhei"><img src="https://avatars.githubusercontent.com/u/46672778" width=50 alt="Contributor avatar for marksverdhei"/></a>
 <a href="https://github.com/Mikeriess"><img src="https://avatars.githubusercontent.com/u/19728563" width=50 alt="Contributor avatar for Mikeriess"/></a>
-<a href="https://github.com/pakagronglb"><img src="https://avatars.githubusercontent.com/u/178713124" width=50 alt="Contributor avatar for pakagronglb"/></a>
 <a href="https://github.com/ThomasKluiters"><img src="https://avatars.githubusercontent.com/u/8137941" width=50 alt="Contributor avatar for ThomasKluiters"/></a>
 <a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
 <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
 <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
+<a href="https://github.com/duarteocarmo"><img src="https://avatars.githubusercontent.com/u/26342344" width=50 alt="Contributor avatar for duarteocarmo"/></a>
+<a href="https://github.com/slowwavesleep"><img src="https://avatars.githubusercontent.com/u/44175589" width=50 alt="Contributor avatar for slowwavesleep"/></a>
 ### Contribute to EuroEval

{euroeval-15.15.0 → euroeval-16.0.0}/README.md RENAMED Viewed

@@ -149,16 +149,18 @@ A huge thank you to all the contributors who have helped make this project a suc
 <a href="https://github.com/AJDERS"><img src="https://avatars.githubusercontent.com/u/38854604" width=50 alt="Contributor avatar for AJDERS"/></a>
 <a href="https://github.com/oliverkinch"><img src="https://avatars.githubusercontent.com/u/71556498" width=50 alt="Contributor avatar for oliverkinch"/></a>
 <a href="https://github.com/versae"><img src="https://avatars.githubusercontent.com/u/173537" width=50 alt="Contributor avatar for versae"/></a>
+<a href="https://github.com/KennethEnevoldsen"><img src="https://avatars.githubusercontent.com/u/23721977" width=50 alt="Contributor avatar for KennethEnevoldsen"/></a>
 <a href="https://github.com/viggo-gascou"><img src="https://avatars.githubusercontent.com/u/94069687" width=50 alt="Contributor avatar for viggo-gascou"/></a>
 <a href="https://github.com/mathiasesn"><img src="https://avatars.githubusercontent.com/u/27091759" width=50 alt="Contributor avatar for mathiasesn"/></a>
 <a href="https://github.com/Alkarex"><img src="https://avatars.githubusercontent.com/u/1008324" width=50 alt="Contributor avatar for Alkarex"/></a>
 <a href="https://github.com/marksverdhei"><img src="https://avatars.githubusercontent.com/u/46672778" width=50 alt="Contributor avatar for marksverdhei"/></a>
 <a href="https://github.com/Mikeriess"><img src="https://avatars.githubusercontent.com/u/19728563" width=50 alt="Contributor avatar for Mikeriess"/></a>
-<a href="https://github.com/pakagronglb"><img src="https://avatars.githubusercontent.com/u/178713124" width=50 alt="Contributor avatar for pakagronglb"/></a>
 <a href="https://github.com/ThomasKluiters"><img src="https://avatars.githubusercontent.com/u/8137941" width=50 alt="Contributor avatar for ThomasKluiters"/></a>
 <a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
 <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
 <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
+<a href="https://github.com/duarteocarmo"><img src="https://avatars.githubusercontent.com/u/26342344" width=50 alt="Contributor avatar for duarteocarmo"/></a>
+<a href="https://github.com/slowwavesleep"><img src="https://avatars.githubusercontent.com/u/44175589" width=50 alt="Contributor avatar for slowwavesleep"/></a>
 ### Contribute to EuroEval

EuroEval 15.15.0__tar.gz → 16.0.0__tar.gz

Potentially problematic release.

EuroEval 15.15.0tar.gz → 16.0.0tar.gz