PyPI - EuroEval - Versions diffs - 15.10.1__tar.gz → 15.11.0__tar.gz - Mend

EuroEval 15.10.1tar.gz → 15.11.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (259) hide show

{euroeval-15.10.1 → euroeval-15.11.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.12.0
+    rev: v0.12.3
     hooks:
       - id: ruff
         args:

{euroeval-15.10.1 → euroeval-15.11.0}/CHANGELOG.md RENAMED Viewed

@@ -10,8 +10,36 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.11.0] - 2025-07-15
+### Added
+- Added the English knowledge dataset Life in the UK, which has been added as an
+  official dataset, replacing the existing English knowledge dataset MMLU, which in turn
+  has been marked as unofficial now. This was contributed by
+  [@oliverkinch](https://github.com/oliverkinch) ✨
+- Added the Norwegian knowledge dataset Idioms-no, which is a multiple-choice question
+  dataset where the alternative answers have been generated using GPT-4o. This has been
+  added as an official dataset, and was contributed by
+  [@oliverkinch](https://github.com/oliverkinch) ✨
+- Added new `LLMAsAJudgeMetric`, which allows evaluating the performance of a model with
+  another judge model. This is useful for evaluating models in a reference-free manner,
+  or if the metric is sufficiently complex. It is currently not used in any task, but
+  the functionality is there for future use.
+- Add `no-thinking` and `thinking` options for Gemini-2.5-flash and
+  Gemini-2.5-flash-lite, which allows disabling and enabling the reasoning mode for
+  these models, respectively. Note that the former model has reasoning enabled by
+  default and the latter has it disabled by default (see the defaults in the [Gemini-2.5
+  docs](https://ai.google.dev/gemini-api/docs/thinking#set-budget)).
+### Fixed
+- Evaluating freshly initialised encoder models on multiple-choice classification tasks
+  caused an error, as the id-to-label mapping was not set up correctly. This has been
+  fixed now.
+- Now dynamically lowers the maximum amount of reasoning tokens for LiteLLM models if
+  they do not support the full 32,768 tokens.
 ## [v15.10.1] - 2025-06-20
-### Fixed
+### Fixed
 - Fixed an issue when benchmarking encoder models on reading comprehension tasks, where
   we sometimes would truncate the model outputs when they should not have been.

{euroeval-15.10.1 → euroeval-15.11.0}/CITATION.cff RENAMED Viewed

@@ -4,8 +4,8 @@ message: If you use this software, please cite it using the metadata from this f
 type: software
 authors:
   - given-names: Dan Saattrup
-    family-names: Nielsen
-    email: dan.nielsen@alexandra.dk
+    family-names: Smart
+    email: dan.smart@alexandra.dk
     affiliation: Alexandra Institute
     orcid: 'https://orcid.org/0000-0001-9227-1470'
 identifiers:
@@ -22,7 +22,7 @@ license: MIT
 preferred-citation:
   type: conference-paper
   authors:
-  - family-names: "Nielsen"
+  - family-names: "Smart"
     given-names: "Dan Saattrup"
     orcid: https://orcid.org/0000-0001-9227-1470
   collection-title: "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)"

{euroeval-15.10.1 → euroeval-15.11.0}/LICENSE RENAMED Viewed

@@ -1,6 +1,6 @@
 MIT License
-Copyright (c) 2022-2024 Dan Saattrup Nielsen
+Copyright (c) 2022-2025 Dan Saattrup Smart
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

{euroeval-15.10.1 → euroeval-15.11.0}/PKG-INFO RENAMED Viewed

@@ -1,14 +1,14 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.10.1
+Version: 15.11.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
-Author-email: Dan Saattrup Nielsen <dan.nielsen@alexandra.dk>
-Maintainer-email: Dan Saattrup Nielsen <dan.nielsen@alexandra.dk>
+Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
+Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
 License: MIT License
-        Copyright (c) 2022-2024 Dan Saattrup Nielsen
+        Copyright (c) 2022-2025 Dan Saattrup Smart
         Permission is hereby granted, free of charge, to any person obtaining a copy
         of this software and associated documentation files (the "Software"), to deal
@@ -43,6 +43,7 @@ Requires-Dist: numpy<2.0.0,>=1.23.0
 Requires-Dist: ollama>=0.5.1
 Requires-Dist: pandas>=2.2.0
 Requires-Dist: peft>=0.15.0
+Requires-Dist: protobuf>=2.0.0
 Requires-Dist: pydantic>=2.6.0
 Requires-Dist: pyinfer>=0.0.3
 Requires-Dist: python-dotenv>=1.0.1
@@ -94,8 +95,7 @@ ______________________________________________________________________
 ## Maintainer
-- Dan Saattrup Nielsen ([@saattrupdan](https://github.com/saattrupdan),
-  dan.nielsen@alexandra.dk)
+- Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), dan.smart@alexandra.dk)
 ## Installation
@@ -268,14 +268,14 @@ contributing new datasets, your help makes this project better for everyone.
 If you want to cite the framework then feel free to use this:
 ```
-@article{nielsen2024encoder,
+@article{smart2024encoder,
   title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
-  author={Nielsen, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
+  author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
   journal={arXiv preprint arXiv:2406.13469},
   year={2024}
 }
-@inproceedings{nielsen2023scandeval,
-  author = {Nielsen, Dan Saattrup},
+@inproceedings{smart2023scandeval,
+  author = {Smart, Dan Saattrup},
   booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
   month = may,
   pages = {185--201},

{euroeval-15.10.1 → euroeval-15.11.0}/README.md RENAMED Viewed

@@ -19,8 +19,7 @@ ______________________________________________________________________
 ## Maintainer
-- Dan Saattrup Nielsen ([@saattrupdan](https://github.com/saattrupdan),
-  dan.nielsen@alexandra.dk)
+- Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), dan.smart@alexandra.dk)
 ## Installation
@@ -193,14 +192,14 @@ contributing new datasets, your help makes this project better for everyone.
 If you want to cite the framework then feel free to use this:
 ```
-@article{nielsen2024encoder,
+@article{smart2024encoder,
   title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
-  author={Nielsen, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
+  author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
   journal={arXiv preprint arXiv:2406.13469},
   year={2024}
 }
-@inproceedings{nielsen2023scandeval,
-  author = {Nielsen, Dan Saattrup},
+@inproceedings{smart2023scandeval,
+  author = {Smart, Dan Saattrup},
   booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
   month = may,
   pages = {185--201},

{euroeval-15.10.1 → euroeval-15.11.0}/docs/README.md RENAMED Viewed

@@ -31,6 +31,6 @@ The idea of EuroEval grew out of the development of Danish language model RøBÆ
 models. It started as a hobby project including Danish, Swedish and Norwegian, but has
 since grown to include 12+ European languages.
-EuroEval is maintained by [Dan Saattrup Nielsen](https://www.saattrupdan.com/) from the
+EuroEval is maintained by [Dan Saattrup Smart](https://www.saattrupdan.com/) from the
 [Alexandra Institute](https://alexandra.dk), and is funded by the EU project
 [TrustLLM](https://trustllm.eu/).

{euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/dutch.md RENAMED Viewed

@@ -325,9 +325,12 @@ $ euroeval --model <model-id> --dataset squad-nl
 ### Unofficial: BeleBele-nl
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features multiple-choice reading comprehension questions across 122 languages.
-The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+The original dataset contains 900 unique multiple-choice reading comprehension passages
+and questions. From these, we use a 256 / 64 / 580 split for training, validation and
+testing, respectively.
 Here are a few examples from the training split:

{euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/english.md RENAMED Viewed

@@ -297,9 +297,13 @@ $ euroeval --model <model-id> --dataset squad
 ### Unofficial: BeleBele-en
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features reading comprehension questions across 122 languages. The dataset was created by professional translators who translated 900 multiple-choice questions from English into other languages, with answers carefully validated by native speakers.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features reading comprehension questions across 122 languages. The dataset was
+created by professional translators who translated 900 multiple-choice questions from
+English into other languages, with answers carefully validated by native speakers.
-The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for training, validation and testing, respectively.
+The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for
+training, validation and testing, respectively.
 Here are a few examples from the training split:
@@ -354,7 +358,79 @@ $ euroeval --model <model-id> --dataset belebele-en
 ## Knowledge
-### MMLU
+### Life in the UK
+This dataset was published
+[here](https://huggingface.co/datasets/oliverkinch/life-in-the-uk-multiple-choice) was
+scraped from [lifeintheuktestweb.co.uk](https://lifeintheuktestweb.co.uk/test-1/) and
+contains multiple choice questions about UK history, culture, and citizenship
+requirements. The website was created to help people pass the Life in the UK Test for UK
+citizenship.
+The original dataset consists of 1,450 samples. After processing (removing questions
+with overly short or long texts, repetitive content, and true/false questions), we have
+1,206 samples remaining. From these, we use 438 / 256 / 512 samples for our training,
+validation and test splits, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "What is the capital of the United Kingdom?\nChoices:\na. London\nb. Manchester\nc. Birmingham\nd. Edinburgh",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "What TWO houses were confronted during the Wars of the Roses?\nChoices:\na. The House of Lancaster\nb. The House of Leicester\nc. The House of Canterbury\nd. The House of York",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "What is the name of the War Memorial located in Whitehall?\nChoices:\na. Dumfries\nb. Cenotaph\nc. Royal Crescent\nd. The White Tower",
+  "label": "b"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  The following are multiple choice questions (with answers).
+  ```
+- Base prompt template:
+  ```
+  Question: {text}
+  Options:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Answer: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Question: {text}
+  Options:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Answer the above question by replying with 'a', 'b', 'c' or 'd', and nothing else.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset life-in-the-uk
+```
+### Unofficial: MMLU
 This dataset was published [in this paper](https://doi.org/10.48550/arXiv.2009.03300)
 and features questions within 57 different topics, such as elementary mathematics, US

{euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/finnish.md RENAMED Viewed

@@ -8,9 +8,13 @@ information about what these constitute.
 ### ScandiSent-fi
-This dataset consists of reviews from Trustpilot and was published [here](https://aclanthology.org/2021.nodalida-main.42/). It is a binary sentiment classification dataset, with labels "positive" and "negative".
+This dataset consists of reviews from Trustpilot and was published
+[here](https://aclanthology.org/2021.nodalida-main.42/). It is a binary sentiment
+classification dataset, with labels "positive" and "negative".
-For the Finnish part of the dataset, there are 10,000 training samples. From these samples, we have created a 1,024 / 256 / 2,048 split for the train, validation and test splits, respectively.
+For the Finnish part of the dataset, there are 10,000 training samples. From these
+samples, we have created a 1,024 / 256 / 2,048 split for the train, validation and test
+splits, respectively.
 Here are a few examples from the training split:
@@ -67,9 +71,14 @@ $ euroeval --model <model-id> --dataset scandisent-fi
 ### Turku-NER-fi
-This dataset was published in [this paper](https://aclanthology.org/2020.lrec-1.567/). The dataset is a manually annotated corpus built on the Universal Dependencies Finnish corpus. The corpus was created by the Turku NLP group.
+This dataset was published in [this paper](https://aclanthology.org/2020.lrec-1.567/).
+The dataset is a manually annotated corpus built on the Universal Dependencies Finnish
+corpus. The corpus was created by the Turku NLP group.
-The original dataset contains 12,217 / 1,364 / 1,555 samples for the training, validation and test splits, respectively. We use 1,024 / 256 / 2,048 samples for our training, validation and test splits, respectively. All the new splits are subsets of the original splits.
+The original dataset contains 12,217 / 1,364 / 1,555 samples for the training,
+validation and test splits, respectively. We use 1,024 / 256 / 2,048 samples for our
+training, validation and test splits, respectively. All the new splits are subsets of
+the original splits.
 Here are a few examples from the training split:
@@ -141,9 +150,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
 that this does indeed break the grammaticality of the sentence, a set of rules were used
 on the part-of-speech tags of the words in the sentence.
-The original dataset consists of 15,136 samples, from which we use 1,024 / 256 / 2,048 samples for training,
-validation and testing, respectively (so 3,328 samples used in total). These splits are
-used as-is in the framework.
+The original dataset consists of 15,136 samples, from which we use 1,024 / 256 / 2,048
+samples for training, validation and testing, respectively (so 3,328 samples used in
+total). These splits are used as-is in the framework.
 Here are a few examples from the training split:
@@ -199,9 +208,20 @@ $ euroeval --model <model-id> --dataset scala-fi
 ## Reading Comprehension
 ### TydiQA-fi
-This question-answering dataset was published in [this paper](https://aclanthology.org/2020.tacl-1.30/). TydiQA is a multilingual dataset covering 11 typologically diverse languages with 204K question-answer pairs collected from native speakers genuinely seeking information. It was designed to evaluate models across languages with varied linguistic features and contains questions written directly in each language without translation.
-The original Finnish TydiQA dataset contains 6,855 training and 782 validation samples (we use the [secondary task subset](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train)).  We created a 1,024 / 256 / 2,024 split, where the samples from the train and validation split are sampled from the original train and validation splits, respectively. The test set consists of the remaining samples from the original validation split + additional samples from the original train split.
+This question-answering dataset was published in [this
+paper](https://aclanthology.org/2020.tacl-1.30/). TydiQA is a multilingual dataset
+covering 11 typologically diverse languages with 204K question-answer pairs collected
+from native speakers genuinely seeking information. It was designed to evaluate models
+across languages with varied linguistic features and contains questions written directly
+in each language without translation.
+The original Finnish TydiQA dataset contains 6,855 training and 782 validation samples
+(we use the [secondary task
+subset](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train)).
+We created a 1,024 / 256 / 2,024 split, where the samples from the train and validation
+split are sampled from the original train and validation splits, respectively. The test
+set consists of the remaining samples from the original validation split + additional
+samples from the original train split.
 Here are a few examples from the training split:
@@ -268,9 +288,12 @@ $ euroeval --model <model-id> --dataset tydiqa-fi
 ### Unofficial: BeleBele-fi
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features multiple-choice reading comprehension questions across 122 languages.
-The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+The original dataset contains 900 unique multiple-choice reading comprehension passages
+and questions. From these, we use a 256 / 64 / 580 split for training, validation and
+testing, respectively.
 Here are a few examples from the training split:
@@ -335,8 +358,11 @@ $ euroeval --model <model-id> --dataset belebele-fi
 ### HellaSwag-fi
 This dataset is a machine translated version of the English [HellaSwag
-dataset](https://aclanthology.org/P19-1472/). The [dataset](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate) was created by Finnish-NLP using Google Translate. The dataset is designed to
-be used in EuroEval and it therefore already has a 1,024 / 256 / 2,048 split for the train, validation and test splits, respectively.
+dataset](https://aclanthology.org/P19-1472/). The
+[dataset](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate) was
+created by Finnish-NLP using Google Translate. The dataset is designed to be used in
+EuroEval and it therefore already has a 1,024 / 256 / 2,048 split for the train,
+validation and test splits, respectively.
 Here are a few examples from the training split:
@@ -400,9 +426,16 @@ $ euroeval --model <model-id> --dataset hellaswag-fi
 ### XLSum-fi
-This dataset is a machine translation of the XL-Sum dataset, which was published in [this paper](https://aclanthology.org/2021.findings-acl.413/). [TurkuNLP](https://huggingface.co/datasets/TurkuNLP) has translated the dataset to Finnish using DeepL.
+This dataset is a machine translation of the XL-Sum dataset, which was published in
+[this paper](https://aclanthology.org/2021.findings-acl.413/).
+[TurkuNLP](https://huggingface.co/datasets/TurkuNLP) has translated the dataset to
+Finnish using DeepL.
-The original Finnish XL-Sum dataset contains 54,966 / 1,803 / 1,791 training, validation and test samples, respectively. We use 1,024 / 256 / 2,048 samples for our training, validation and test splits, respectively. The new training and validation splits are subsets of the original splits. The test split is the same as the original test split + additional samples from the original validation split.
+The original Finnish XL-Sum dataset contains 54,966 / 1,803 / 1,791 training, validation
+and test samples, respectively. We use 1,024 / 256 / 2,048 samples for our training,
+validation and test splits, respectively. The new training and validation splits are
+subsets of the original splits. The test split is the same as the original test split +
+additional samples from the original validation split.
 Here are a few examples from the training split:

{euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/french.md RENAMED Viewed

@@ -11,10 +11,11 @@ information about what these constitute.
 This dataset was published in [this Github
 repository](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert) and
-features reviews from the French movie review website [AlloCiné](https://www.allocine.fr/). The reviews range from
-0.5 to 5 (inclusive), with steps of 0.5. The negative samples are reviews with a rating
-of at most 2, and the positive ones are reviews with a rating of at least 4. The reviews
-in between were discarded.
+features reviews from the French movie review website
+[AlloCiné](https://www.allocine.fr/). The reviews range from 0.5 to 5 (inclusive), with
+steps of 0.5. The negative samples are reviews with a rating of at most 2, and the
+positive ones are reviews with a rating of at least 4. The reviews in between were
+discarded.
 The original full dataset consists of 160,000 / 20,000 / 20,000 samples for training,
 validation, and testing, respectively. We use 1,024 / 256 / 2,048 samples for training,
@@ -163,9 +164,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
 that this does indeed break the grammaticality of the sentence, a set of rules were used
 on the part-of-speech tags of the words in the sentence.
-The original dataset consists of 16,342 samples, from which we use 1,024 / 256 / 2,048 samples for training,
-validation and testing, respectively (so 3,328 samples used in total). These splits are
-used as-is in the framework.
+The original dataset consists of 16,342 samples, from which we use 1,024 / 256 / 2,048
+samples for training, validation and testing, respectively (so 3,328 samples used in
+total). These splits are used as-is in the framework.
 Here are a few examples from the training split:
@@ -298,9 +299,12 @@ $ euroeval --model <model-id> --dataset fquad
 ### Unofficial: BeleBele-fr
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features multiple-choice reading comprehension questions across 122 languages.
-The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+The original dataset contains 900 unique multiple-choice reading comprehension passages
+and questions. From these, we use a 256 / 64 / 580 split for training, validation and
+testing, respectively.
 Here are a few examples from the training split:

{euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/german.md RENAMED Viewed

@@ -153,9 +153,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
 that this does indeed break the grammaticality of the sentence, a set of rules were used
 on the part-of-speech tags of the words in the sentence.
-The original dataset consists of 15,590 samples, from which we use 1,024 / 256 / 2,048 samples for training,
-validation and testing, respectively (so 3,328 samples used in total). These splits are
-used as-is in the framework.
+The original dataset consists of 15,590 samples, from which we use 1,024 / 256 / 2,048
+samples for training, validation and testing, respectively (so 3,328 samples used in
+total). These splits are used as-is in the framework.
 Here are a few examples from the training split:
@@ -286,9 +286,12 @@ $ euroeval --model <model-id> --dataset germanquad
 ### Unofficial: BeleBele-de
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features multiple-choice reading comprehension questions across 122 languages.
-The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+The original dataset contains 900 unique multiple-choice reading comprehension passages
+and questions. From these, we use a 256 / 64 / 580 split for training, validation and
+testing, respectively.
 Here are a few examples from the training split:

{euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/icelandic.md RENAMED Viewed

@@ -155,9 +155,9 @@ from a sentence, or by swapping two neighbouring words in a sentence. To ensure
 this does indeed break the grammaticality of the sentence, a set of rules were used on
 the part-of-speech tags of the words in the sentence.
-The original dataset consists of 3,535 samples, from which we use 1,024 / 256 / 2,048 samples for training,
-validation and testing, respectively (so 3,328 samples used in total). These splits are
-used as-is in the framework.
+The original dataset consists of 3,535 samples, from which we use 1,024 / 256 / 2,048
+samples for training, validation and testing, respectively (so 3,328 samples used in
+total). These splits are used as-is in the framework.
 Here are a few examples from the training split:
@@ -491,9 +491,12 @@ $ euroeval --model <model-id> --dataset icelandic-qa
 ### Unofficial: BeleBele-is
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features multiple-choice reading comprehension questions across 122 languages.
-The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+The original dataset contains 900 unique multiple-choice reading comprehension passages
+and questions. From these, we use a 256 / 64 / 580 split for training, validation and
+testing, respectively.
 Here are a few examples from the training split:
@@ -579,7 +582,8 @@ completion = client.beta.chat.completions.parse(
 )
 ```
-where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
+where `CandidateAnswers` is a Pydantic model that is used to ensure [structured
+outputs](https://platform.openai.com/docs/guides/structured-outputs).
 The original dataset has 2,000 samples, but only 1,994 unique questions, and the total
 length of this dataset is therefore 1,994. The split is given by 842 / 128 / 1024 for

{euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/italian.md RENAMED Viewed

@@ -373,9 +373,12 @@ $ euroeval --model <model-id> --dataset squad-it
 ### Unofficial: BeleBele-it
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features multiple-choice reading comprehension questions across 122 languages.
-The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+The original dataset contains 900 unique multiple-choice reading comprehension passages
+and questions. From these, we use a 256 / 64 / 580 split for training, validation and
+testing, respectively.
 Here are a few examples from the training split:

EuroEval 15.10.1__tar.gz → 15.11.0__tar.gz

Potentially problematic release.

EuroEval 15.10.1tar.gz → 15.11.0tar.gz