PyPI - EuroEval - Versions diffs - 15.4.2__tar.gz → 15.6.0__tar.gz - Mend

EuroEval 15.4.2tar.gz → 15.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (232) hide show

{euroeval-15.4.2 → euroeval-15.6.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -8,7 +8,7 @@ body:
 - type: input
   attributes:
     label: Model ID
-    description: What is the Hugging Face model ID?
+    description: What is the model ID, either on the Hugging Face Hub or on LiteLLM?
   validations:
     required: true
 - type: checkboxes
@@ -18,17 +18,9 @@ body:
       What languages should this model be evaluated on? Tick all that apply. If the
       model is multilingual (e.g., Mistral, Llama), then tick all the languages.
     options:
-      - label: Danish
-      - label: Dutch
-      - label: English
-      - label: Faroese
-      - label: French
-      - label: German
-      - label: Icelandic
-      - label: Italian
-      - label: Norwegian (Bokmål or Nynorsk)
-      - label: Spanish
-      - label: Swedish
+      - label: Romance languages (French, Italian, Spanish)
+      - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
+      - label: West Germanic languages (Dutch, English, German)
   validations:
     required: true
 - type: dropdown
@@ -48,6 +40,7 @@ body:
     options:
       - Small (<=8B parameters)
       - Large (>8B parameters)
+      - N/A
   validations:
     required: true
 - type: dropdown
@@ -57,6 +50,7 @@ body:
     options:
       - Not a merged model
       - Merged model
+      - N/A
   validations:
     required: true
 - type: markdown

{euroeval-15.4.2 → euroeval-15.6.0}/.github/workflows/ci.yaml RENAMED Viewed

@@ -89,6 +89,8 @@ jobs:
           HF_TOKEN: ${{ secrets.HUGGINGFACE_API_KEY }}
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+          XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
       - name: Delete EuroEval cache
         run: rm -rf .euroeval_cache

{euroeval-15.4.2 → euroeval-15.6.0}/.gitignore RENAMED Viewed

@@ -115,3 +115,7 @@ site/
 # Helper files for docs
 docs/datasets/dataset_example_commands.txt
+# Various graphics
+gfx/euroeval-italian.png
+gfx/euroeval-italian.xcf

{euroeval-15.4.2 → euroeval-15.6.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.11.2
+    rev: v0.11.5
     hooks:
       - id: ruff
         args:

{euroeval-15.4.2 → euroeval-15.6.0}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,79 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.6.0] - 2025-04-13
+### Added
+- We now support specifying custom inference providers when benchmarking via the Hugging
+  Face inference APIs. This can be done by specifying the model as
+  `huggingface/<inference-provider>/<organisation>/<model>`, as described in [these
+  LiteLLM docs](https://docs.litellm.ai/docs/providers/huggingface).
+### Changed
+- Updated `transformers` to `>=4.51.0`, which includes support for Llama-4, Phi-4,
+  Deepseek-v3 and Qwen3. This also includes the `image-text-to-text` pipeline tag
+  properly, so that we do not have to use a custom fix for it anymore.
+- Updated `vllm` to `>=0.8.3`, which includes support for Llama-4.
+- Set the maximum amount of logprobs for generative models to 8, as that is the upper
+  bound for xAI models.
+- When benchmarking Ollama models, if the model is not found, we now also check if the
+  model exists if prefixed with 'hf.co/'.
+- Uniformised the prompt templates used for each task, so that they are more
+  consistent across tasks. Evaluation tests across different model types and sizes show
+  no significant performance difference between the new and old templates. This was
+  contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
+### Fixed
+- Avoid duplicate error messages when a rate limit occurs.
+- ModernBERT models cannot be used on a CPU, which caused an error in our check for
+  maximal context length. In this case we simply skip this check and use the reported
+  maximal context length as-is.
+- Fixed issue with benchmarking multiple generative models in the same evaluation
+  command. This was caused by vLLM and Ray not being able to release GPU memory
+  properly, but this seems to be released properly now.
+- Now only logs when encoder models are being benchmarked on generative tasks if the
+  `--verbose` flag is set (or `verbose=True` in the `Benchmarker` API).
+- All Spanish NER datasets were mistakenly marked as unofficial. The `conll-es` is now
+  marked as official.
+## [v15.5.0] - 2025-04-07
+### Added
+- Now allows supplying a parameter to API models, which is done by using
+  `<model-id>@<parameter>` as the model ID (only a single parameter is supported). The
+  parameters allowed are "low" and "high" for OpenAI models (which is the reasoning
+  effort of the model, supported by the o1- and o3-series, default is "medium"), and
+  "thinking" for Anthropic models, to enable thinking mode (supported for
+  Claude-Sonnet-3.7+). These will appear in the leaderboards as
+  `<model-id>@<parameter>`.
+- Added metadata for Google Gemini and xAI Grok models.
+- Allows all vLLM versions from v0.8.0 again, as the issue with the generation output
+  has been resolved.
+- Added overall progress indicator during evaluation. This was contributed by
+  [@mathiasesn](https://github.com/mathiasesn) ✨
+### Changed
+- Now does not use logprobs in text classification tasks with Google VertexAI models, as
+  they heavily rate limit logprobs usage. This shouldn't affect the scores significantly
+  in any case, as the models are very confident in their predictions.
+- Updated `litellm` to `>=1.63.0`, allowing better support for reasoning models.
+### Fixed
+- The Gemini-2.5-pro model uses different error messages than the other Gemini models,
+  which caused an error when evaluating it. This has been fixed now.
+- Now registers the Gemini-2.5-pro model series as reasoning models, as otherwise they
+  did not generate any text as they were just generating reasoning tokens.
+- Previously, if there were multiple labels whose first tokens were identical and that
+  the (generative) model did not output the label as the first output token, we would
+  randomly choose one of the labels, resulting in an evaluation error. This is very
+  rare, but *does* happen for very particular (model, dataset) pairs. If we are in this
+  case, we now resort to choosing the label with closest word edit distance instead of
+  relying on logprobs of the first token.
+- Now defaults to BF16 if the model is registered as using FP32, assuming that BF16 is
+  supported by the GPU.
+- Improved model existence pipeline for Ollama model IDs with multiple forward slashes
+  in the name, which caused some models to not be detected as existing.
 ## [v15.4.2] - 2025-03-31
 ### Added
 - Now added version metadata to results, to easier track which versions of the various
@@ -23,7 +96,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ### Fixed
 - Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
-  compatibility < 8.0. This was contributed by [@marksverdhei](https://github.com/marksverdhei) ✨
+  compatibility < 8.0. This was contributed by
+  [@marksverdhei](https://github.com/marksverdhei) ✨
 - Corrected the name of the French sentiment dataset AlloCiné. This was contributed by
   [@Alkarex](https://github.com/Alkarex) ✨
 - Evaluating a specific model revision did not work for adapter models, as there was a
@@ -50,7 +124,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
   as the API sometimes fails. If it still fails after 3 attempts, we raise the
   `HuggingFaceHubDown` exception.
 - Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
-  compatibility < 8.0. This was contributed by [@marksverdhei](https://github.com/marksverdhei) ✨
+  compatibility < 8.0. This was contributed by
+  [@marksverdhei](https://github.com/marksverdhei) ✨
 - Fixed docs for ScandiQA-da and ScandiQA-sv, where it was incorrectly stated that
   the splits were made by considering the original train/validation/test splits.
@@ -118,18 +193,17 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [v15.3.0] - 2025-03-12
 ### Added
 - Added support for evaluating Italian 🇮🇹! This includes the reading comprehension
-  dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization
-  dataset [IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment
-  classification
-  [Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual),
-  the common-sense reasoning dataset
-  [HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic acceptability
-  dataset ScaLA with the [Italian Universal Dependencies
+  dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization dataset
+  [IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment classification
+  [Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual), the
+  common-sense reasoning dataset
+  [HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic
+  acceptability dataset ScaLA with the [Italian Universal Dependencies
   treebank](https://github.com/UniversalDependencies/UD_Italian-ISDT), the knowledge
   dataset [MMLU-it](https://hf.co/datasets/alexandrainst/m_mmlu), and the named entity
-  recognition dataset [MultiNERD
-  IT](https://hf.co/datasets/Babelscape/multinerd) (and unofficially
-  [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
+  recognition dataset [MultiNERD IT](https://hf.co/datasets/Babelscape/multinerd) (and
+  unofficially [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was
+  contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
 - Added the new Norwegian knowledge dataset NRK-Quiz-QA, consisting of quizzes on the
   Norwegian language and culture, in both Bokmål and Nynorsk. The dataset has been split
   into 635 / 256 / 2,048 samples for train, val, and test, respectively. This replaces

{euroeval-15.4.2 → euroeval-15.6.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.4.2
+Version: 15.6.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -35,9 +35,9 @@ Requires-Dist: click>=8.1.3
 Requires-Dist: datasets>=2.15.0
 Requires-Dist: demjson3>=3.0.6
 Requires-Dist: evaluate>=0.4.1
-Requires-Dist: huggingface-hub>=0.24.0
+Requires-Dist: huggingface-hub>=0.30.1
 Requires-Dist: levenshtein>=0.24.0
-Requires-Dist: litellm>=1.61.13
+Requires-Dist: litellm>=1.63.0
 Requires-Dist: more-itertools>=10.5.0
 Requires-Dist: numpy<2.0.0,>=1.23.0
 Requires-Dist: ollama>=0.4.7
@@ -56,18 +56,18 @@ Requires-Dist: setuptools>=75.8.2
 Requires-Dist: tenacity>=9.0.0
 Requires-Dist: termcolor>=2.0.0
 Requires-Dist: torch>=2.6.0
-Requires-Dist: transformers>=4.50.0
+Requires-Dist: transformers>=4.51.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: gradio>=4.26.0; extra == 'all'
 Requires-Dist: outlines>=0.1.11; extra == 'all'
-Requires-Dist: vllm==0.8.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm>=0.8.3; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: outlines>=0.1.11; extra == 'generative'
-Requires-Dist: vllm==0.8.0; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: vllm>=0.8.3; (platform_system == 'Linux') and extra == 'generative'
 Provides-Extra: human-evaluation
 Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
 Provides-Extra: test
@@ -89,7 +89,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-65%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
@@ -206,7 +206,9 @@ sentiment-classification`.
 ### Reproducing the datasets
-All datasets used in this project are generated using the scripts located in the [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script with the following command
+All datasets used in this project are generated using the scripts located in the
+[src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script
+with the following command
 ```shell
 $ uv run src/scripts/<name-of-script>.py
@@ -218,7 +220,27 @@ Replace <name-of-script> with the specific script you wish to execute, e.g.,
 $ uv run src/scripts/create_allocine.py
 ```
-## Special Thanks :pray:
+## Contributors :pray:
+A huge thank you to all the contributors who have helped make this project a success!
+<a href="https://github.com/peter-sk"><img src="https://avatars.githubusercontent.com/u/6168908" width=50 alt="Contributor avatar for peter-sk"/></a>
+<a href="https://github.com/AJDERS"><img src="https://avatars.githubusercontent.com/u/38854604" width=50 alt="Contributor avatar for AJDERS"/></a>
+<a href="https://github.com/oliverkinch"><img src="https://avatars.githubusercontent.com/u/71556498" width=50 alt="Contributor avatar for oliverkinch"/></a>
+<a href="https://github.com/versae"><img src="https://avatars.githubusercontent.com/u/173537" width=50 alt="Contributor avatar for versae"/></a>
+<a href="https://github.com/viggo-gascou"><img src="https://avatars.githubusercontent.com/u/94069687" width=50 alt="Contributor avatar for viggo-gascou"/></a>
+<a href="https://github.com/mathiasesn"><img src="https://avatars.githubusercontent.com/u/27091759" width=50 alt="Contributor avatar for mathiasesn"/></a>
+<a href="https://github.com/Alkarex"><img src="https://avatars.githubusercontent.com/u/1008324" width=50 alt="Contributor avatar for Alkarex"/></a>
+<a href="https://github.com/marksverdhei"><img src="https://avatars.githubusercontent.com/u/46672778" width=50 alt="Contributor avatar for marksverdhei"/></a>
+<a href="https://github.com/Mikeriess"><img src="https://avatars.githubusercontent.com/u/19728563" width=50 alt="Contributor avatar for Mikeriess"/></a>
+<a href="https://github.com/pakagronglb"><img src="https://avatars.githubusercontent.com/u/178713124" width=50 alt="Contributor avatar for pakagronglb"/></a>
+<a href="https://github.com/ThomasKluiters"><img src="https://avatars.githubusercontent.com/u/8137941" width=50 alt="Contributor avatar for ThomasKluiters"/></a>
+<a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
+<a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
+### Special Thanks
+- Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
+  [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
 - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
   models on the leaderboards.
 - Thanks to [OpenAI](https://openai.com/) for sponsoring OpenAI credits as part of their

{euroeval-15.4.2 → euroeval-15.6.0}/README.md RENAMED Viewed

@@ -13,7 +13,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-65%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
@@ -130,7 +130,9 @@ sentiment-classification`.
 ### Reproducing the datasets
-All datasets used in this project are generated using the scripts located in the [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script with the following command
+All datasets used in this project are generated using the scripts located in the
+[src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script
+with the following command
 ```shell
 $ uv run src/scripts/<name-of-script>.py
@@ -142,7 +144,27 @@ Replace <name-of-script> with the specific script you wish to execute, e.g.,
 $ uv run src/scripts/create_allocine.py
 ```
-## Special Thanks :pray:
+## Contributors :pray:
+A huge thank you to all the contributors who have helped make this project a success!
+<a href="https://github.com/peter-sk"><img src="https://avatars.githubusercontent.com/u/6168908" width=50 alt="Contributor avatar for peter-sk"/></a>
+<a href="https://github.com/AJDERS"><img src="https://avatars.githubusercontent.com/u/38854604" width=50 alt="Contributor avatar for AJDERS"/></a>
+<a href="https://github.com/oliverkinch"><img src="https://avatars.githubusercontent.com/u/71556498" width=50 alt="Contributor avatar for oliverkinch"/></a>
+<a href="https://github.com/versae"><img src="https://avatars.githubusercontent.com/u/173537" width=50 alt="Contributor avatar for versae"/></a>
+<a href="https://github.com/viggo-gascou"><img src="https://avatars.githubusercontent.com/u/94069687" width=50 alt="Contributor avatar for viggo-gascou"/></a>
+<a href="https://github.com/mathiasesn"><img src="https://avatars.githubusercontent.com/u/27091759" width=50 alt="Contributor avatar for mathiasesn"/></a>
+<a href="https://github.com/Alkarex"><img src="https://avatars.githubusercontent.com/u/1008324" width=50 alt="Contributor avatar for Alkarex"/></a>
+<a href="https://github.com/marksverdhei"><img src="https://avatars.githubusercontent.com/u/46672778" width=50 alt="Contributor avatar for marksverdhei"/></a>
+<a href="https://github.com/Mikeriess"><img src="https://avatars.githubusercontent.com/u/19728563" width=50 alt="Contributor avatar for Mikeriess"/></a>
+<a href="https://github.com/pakagronglb"><img src="https://avatars.githubusercontent.com/u/178713124" width=50 alt="Contributor avatar for pakagronglb"/></a>
+<a href="https://github.com/ThomasKluiters"><img src="https://avatars.githubusercontent.com/u/8137941" width=50 alt="Contributor avatar for ThomasKluiters"/></a>
+<a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
+<a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
+### Special Thanks
+- Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
+  [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
 - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
   models on the leaderboards.
 - Thanks to [OpenAI](https://openai.com/) for sponsoring OpenAI credits as part of their

{euroeval-15.4.2 → euroeval-15.6.0}/docs/datasets/danish.md RENAMED Viewed

@@ -450,12 +450,14 @@ Here are a few examples from the training split:
 {
   "text": "Hvilket af følgende områder har kommunerne ansvaret for driften af?\nSvarmuligheder:\na. Domstole\nb. Vuggestuer\nc. Sygehuse",
   "label": "b"
-}```
+}
+```
 ```json
 {
   "text": "Hvilken organisation blev Danmark medlem af i 1945?\nSvarmuligheder:\na. Verdenshandelsorganisationen (WTO)\nb. Den Europæiske Union (EU)\nc. De Forenede Nationer (FN)",
   "label": "c"
-}```
+}
+```
 When evaluating generative models, we use the following setup (see the
 [methodology](/methodology) for more information on how these are used):

{euroeval-15.4.2 → euroeval-15.6.0}/docs/datasets/dutch.md RENAMED Viewed

@@ -133,7 +133,7 @@ $ euroeval --model <model-id> --dataset dbrd
 ## Named Entity Recognition
-### CoNLL-2002-nl
+### CoNLL-nl
 This dataset was published in [this paper](https://aclanthology.org/W02-2024/) and
 consists of named entity recognition annotations of the Belgian newspaper "De Morgen" of

{euroeval-15.4.2 → euroeval-15.6.0}/docs/datasets/english.md RENAMED Viewed

@@ -81,7 +81,7 @@ $ euroeval --model <model-id> --dataset sst5
 ## Named Entity Recognition
-### CoNLL-2003-En
+### CoNLL-en
 This dataset was published in [this paper](https://aclanthology.org/W03-0419/) and was
 part of the CoNNL-2003 shared task. The data comes from the [Reuters

{euroeval-15.4.2 → euroeval-15.6.0}/docs/datasets/faroese.md RENAMED Viewed

@@ -282,10 +282,10 @@ $ euroeval --model <model-id> --dataset scala-fo
 ### FoQA
-This dataset will be published in an upcoming paper and is based on the Faroese
-Wikipedia. The questions and answers were automatically generated using GPT-4-turbo,
-which were verified by a native speaker, and some of them were also corrected by the
-same native speaker.
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2502.07642)
+and is based on the Faroese Wikipedia. The questions and answers were automatically
+generated using GPT-4-turbo, which were verified by a native speaker, and some of them
+were also corrected by the same native speaker.
 The original full dataset consists of 2,000 samples, and we split these into 848 / 128 /
 1,024 samples for training, validation and testing, respectively.

{euroeval-15.4.2 → euroeval-15.6.0}/docs/datasets/icelandic.md RENAMED Viewed

@@ -9,9 +9,9 @@ information about what these constitute.
 ### Hotter and Colder Sentiment
-This dataset is being published in an upcoming paper, and consists of texts from
-Icelandic blog post, annotated with sentiment labels (and many others) via a
-crowdsourcing platform.
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2502.16987),
+and consists of texts from Icelandic blog post, annotated with sentiment labels (and
+many others) via a crowdsourcing platform.
 The original full dataset consists of 2,901 samples, and we use a 1,021 / 255 / 1,607
 split for training, validation and testing, respectively (so all samples are used in
@@ -73,13 +73,14 @@ $ euroeval --model <model-id> --dataset hotter-and-colder-sentiment
 ### MIM-GOLD-NER
-This dataset was published in [this paper]() and is based on the [Tagged Icelandic
-Corpus (MIM)](https://clarin.is/en/resources/mim/), which consists of Icelandic books,
-news articles, periodicals, parliament speeches, legal texts, adjudications and
-government websites. It has been annotated with named entities in a semi-automated
-fashion, where each labels has been manually verified. The entity types in the dataset
-is a superset of the CoNLL-2003 tags, with the following additional labels: `DATE`,
-`TIME`, `MONEY`, `PERCENT`. These labels have been removed.
+This dataset was published in [this
+paper](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/230) and is
+based on the [Tagged Icelandic Corpus (MIM)](https://clarin.is/en/resources/mim/), which
+consists of Icelandic books, news articles, periodicals, parliament speeches, legal
+texts, adjudications and government websites. It has been annotated with named entities
+in a semi-automated fashion, where each labels has been manually verified. The entity
+types in the dataset is a superset of the CoNLL-2003 tags, with the following additional
+labels: `DATE`, `TIME`, `MONEY`, `PERCENT`. These labels have been removed.
 The original full dataset consists of 1,000,000 tokens. We use a 1,024 / 256 / 2,048
 split for training, validation and testing, respectively.
@@ -526,17 +527,20 @@ Here are a few examples from the training split:
 {
   "text": "Hver var talinn heilagur maður eftir dauða sinn, er tákngervingur alþýðuhreyfingar vestanlands og talinn góður til áheita?\nSvarmöguleikar:\na. Þórður Jónsson helgi\nb. Guðmundur Arason\nc. Snorri Þorgrímsson\nd. Jón Hreggviðsson",
   "label": "a"
-}```
+}
+```
 ```json
 {
   "text": "Í kringum hvaða ár hófst verslun á Arngerðareyri?\nSvarmöguleikar:\na. 1895\nb. 1884\nc. 1870\nd. 1902",
   "label": "b"
-}```
+}
+```
 ```json
 {
   "text": "Hvenær var ákveðið að uppstigningardagur skyldi vera kirkjudagur aldraðra á Íslandi?\nSvarmöguleikar:\na. Árið 1975\nb. Árið 1985\nc. Árið 1982\nd. Árið 1990",
   "label": "c"
-}```
+}
+```
 When evaluating generative models, we use the following setup (see the
 [methodology](/methodology) for more information on how these are used):

{euroeval-15.4.2 → euroeval-15.6.0}/docs/datasets/italian.md RENAMED Viewed

@@ -71,11 +71,10 @@ $ euroeval --model <model-id> --dataset sentipolc16
 ### MultiNERD IT
 This dataset was published in [this
-paper](https://aclanthology.org/2022.findings-naacl.60/) and
-consists of sentences from Wikipedia and Wikinews in 10 different languages. It is an
-extension of the combination of
-(WikiNEuRal)[https://www.github.com/Babelscape/wikineural] and
-(NER4EL)[https://www.github.com/Babelscape/ner4el]. The original test set was created
+paper](https://aclanthology.org/2022.findings-naacl.60/) and consists of sentences from
+Wikipedia and Wikinews in 10 different languages. It is an extension of the combination
+of [WikiNEuRal](https://www.github.com/Babelscape/wikineural) and
+[NER4EL](https://www.github.com/Babelscape/ner4el). The original test set was created
 from manual annotations, while the training set is based on an automatic annotation
 pipeline.
@@ -519,7 +518,7 @@ $ euroeval --model <model-id> --dataset hellaswag-it
 ## Summarization
-### IlPost-sum
+### IlPost-Sum
 This dataset was published in [this paper](https://www.mdpi.com/2078-2489/13/5/228) and
 consists of news articles from [Il Post](https://www.ilpost.it/). The summaries were

{euroeval-15.4.2 → euroeval-15.6.0}/docs/datasets/norwegian.md RENAMED Viewed

@@ -388,17 +388,20 @@ Here are a few examples from the training split:
 {
   "text": "Vi har hatt krig i nesten ti år. Jeg føler meg noen ganger trist fordi jeg har mistet flere venner og min far på grunn av krigen.",
   "label": "correct"
-}```
+}
+```
 ```json
 {
   "text": "Hvis jeg ikke sier in n genting, kan han spille hele dagen.",
   "label": "incorrect"
-}```
+}
+```
 ```json
 {
   "text": "De føler at samfunnet trenger ikke dem.",
   "label": "incorrect"
-}```
+}
+```
 When evaluating generative models, we use the following setup (see the
 [methodology](/methodology) for more information on how these are used):
@@ -660,17 +663,20 @@ Here are a few examples from the training split:
 {
   "text": "Gunnar har hatt plutselige og sterke smerteanfall siden han var liten gutt. Det var vondt å tisse og det gjorde vondt i ryggen og magen. Det hjalp litt å drikke vann. Reseptbelagte medisiner kan være nødvendig under anfall.\nSvaralternativer:\na. Nyrestein, kronisk\nb. Irritabel tarmsyndrom\nc. Angst\nd. Urinveisinfeksjon",
   "label": "a"
-}```
+}
+```
 ```json
 {
   "text": "80 år gamle Harrison Ford er nok ein gong aktuell i rolla som Indiana Jones. Kva heiter filmen?\nSvaralternativer:\na. Indiana Jones and the Nasty Nazis\nb. Indiana Jones and the Dial of Destiny\nc. Indiana Jones and the Hunt for Power\nd. Indiana Jones Forever",
   "label": "b"
-}```
+}
+```
 ```json
 {
   "text": "I 1980 måtte denne bassisten overnatte ni netter i fengsel i Japan fordi han prøvde å få med seg ca. 200 gram marihuana inn i landet. Hvem var det?\nSvaralternativer:\na. Sting\nb. Lemmy Kilmister\nc. Paul McCartney\nd. Bootsy Collins",
   "label": "c"
-}```
+}
+```
 When evaluating generative models, we use the following setup (see the
 [methodology](/methodology) for more information on how these are used):
@@ -868,17 +874,20 @@ Here are a few examples from the training split:
 {
   "text": "Hvor er det sannsynlig at en fugl lager hjemmet sitt?\nSvaralternativer:\na. I skogen\nb. I et rede\nc. På taket\nd. På blader\ne. I himmelen",
   "label": "a"
-}```
+}
+```
 ```json
 {
   "text": "Hvis et hjem har et abonnoment, hva får de sannsyneligvis hver dag i posten?\nSvaralternativer:\na. Delestykker\nb. En avis\nc. En gate\nd. En vaskemaskin\ne. Jordas overflate",
   "label": "b"
-}```
+}
+```
 ```json
 {
   "text": "Når du ikke klarer å gjøre noe ferdig, hva feilet du i da?\nSvaralternativer:\na. Å vinne\nb. Å bestå\nc. Å fullfør\nd. Å gjøre det bra\ne. Å lykkes",
   "label": "c"
-}```
+}
+```
 When evaluating generative models, we use the following setup (see the
 [methodology](/methodology) for more information on how these are used):

{euroeval-15.4.2 → euroeval-15.6.0}/makefile RENAMED Viewed

@@ -56,7 +56,6 @@ install-dependencies:
 	@if [ "${NO_FLASH_ATTN}" != "1" ] && [ $$(uname) != "Darwin" ]; then \
 		uv pip install --no-build-isolation flash-attn>=2.7.0.post2; \
 	fi
-	@uv sync -U --only-dev
 setup-environment-variables:
 	@uv run python src/scripts/fix_dot_env_file.py
@@ -127,8 +126,7 @@ publish:
 		echo "No PyPI API token specified in the '.env' file, so cannot publish."; \
 	else \
 		echo "Publishing to PyPI..."; \
-		$(MAKE) --quiet check \
-			&& $(MAKE) --quiet publish-euroeval \
+		$(MAKE) --quiet publish-euroeval \
 			&& $(MAKE) --quiet publish-scandeval \
 			&& $(MAKE) --quiet publish-docs \
 			&& $(MAKE) --quiet add-dev-version \
@@ -157,8 +155,8 @@ publish-scandeval:
 	fi
 	@mv src/scandeval src/euroeval
-publish-major: bump-major publish  ## Publish a major version
+publish-major: install check bump-major publish  ## Publish a major version
-publish-minor: bump-minor publish  ## Publish a minor version
+publish-minor: install check bump-minor publish  ## Publish a minor version
-publish-patch: bump-patch publish  ## Publish a patch version
+publish-patch: install check bump-patch publish  ## Publish a patch version

EuroEval 15.4.2__tar.gz → 15.6.0__tar.gz

Potentially problematic release.

EuroEval 15.4.2tar.gz → 15.6.0tar.gz