PyPI - EuroEval - Versions diffs - 16.0.0__tar.gz → 16.1.0__tar.gz - Mend

EuroEval 16.0.0tar.gz → 16.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (296) hide show

{euroeval-16.0.0 → euroeval-16.1.0}/.gitignore RENAMED Viewed

@@ -34,7 +34,7 @@ var/
 pip-log.txt
 pip-delete-this-directory.txt
-# Unit test / coverage reports
+# Tests / coverage reports
 htmlcov/
 .tox/
 .coverage
@@ -118,4 +118,6 @@ docs/datasets/dataset_example_commands.txt
 # Various graphics
 gfx/euroeval-*.png
+gfx/euroeval-*.jpeg
+gfx/euroeval-*.jpg
 gfx/euroeval-*.xcf

{euroeval-16.0.0 → euroeval-16.1.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.12.12
+    rev: v0.13.0
     hooks:
       - id: ruff
         args:

{euroeval-16.0.0 → euroeval-16.1.0}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,74 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v16.1.0] - 2025-09-11
+### Added
+- Added support for Polish 🇵🇱! This includes the reading comprehension dataset PoQuAD,
+  the sentiment classification dataset PolEmo 2.0, the linguistic acceptability dataset
+  ScaLA-pl, the named entity recognition dataset KPWr-NER, the summarisation dataset
+  PSC, the knowledge dataset LLMzSzŁ and the common-sense reasoning dataset
+  Winogrande-pl. Also added MultiWikiQA-pl and GoldenSwag-pl as unofficial reading
+  comprehension and common-sense reasoning datasets, respectively. This was contributed
+  by @oliverkinch ✨
+- Added the Swedish knowledge dataset Skolprov. It is unofficial for now. This was
+  contributed by @oliverkinch ✨
+- Added the knowledge dataset Trivia-et for Estonian. The dataset contains 800 trivia
+  questions about Estonia. In this version we rearrange the examples in
+  240 / 60 / 500 samples for training, validation and test splits, respectively.
+  This replaces Exam-et as the official Estonian knowledge dataset. This was contributed
+  by @slowwavesleep ✨
+- Added the English and German versions of XQuAD as unofficial reading comprehension
+  datasets.
+- Added the English common-sense reasoning dataset Winogrande and its translated
+  versions of Winogrande for Danish, German, Spanish, Finnish, French, Italian, Latvian,
+  Dutch, Norwegian, Polish, Portuguese and Swedish. These are unofficial for now.
+- Added new `--generative-type` argument, which can be used to override the automatic
+  detection of the generative type (base decoder, instruction-tuned decoder, or
+  reasoning decoder) of a decoder model. This can be useful if the automatic detection
+  fails for a specific model.
+- Now supports evaluating base decoders on inference servers. This requires the
+  `--generative-type base` argument to be set, as the automatic detection will not work
+  for these models.
+### Changed
+- Changed the model ID syntax, where we now use `#` to indicate parameters and still use
+  `@` to indicate revision. For instance, `o3#low` indicates the `o3` model with the
+  low reasoning effort, and `tencent/Hunyuan-1.8B-Instruct@v1#no-thinking` indicates the
+  Hunyuan model from the `v1` branch and with the `enable_thinking=False` parameter set.
+  This is fully backwards compatible, in the sense that API models still support using
+  `@` for parameters as well, just like previously, but you will get a warning that this
+  syntax is deprecated.
+- Added `thinking` and `no-thinking` parameters for all open-weight models now. Of
+  course, it only makes a difference for models that supports this flag.
+- Reduced the number of tokens used for reasoning models from 32,768 to 8,192, as models
+  reaching the full 32,768 tokens were because they ended up repeating themselves,
+  making the evaluation slower without any benefit.
+### Fixed
+- Some generative models consistently generated empty dictionaries when using structured
+  generation. We now catch this and retry the evaluation without structured generation.
+## [v16.0.1] - 2025-09-07
+### Fixed
+- Fixed a bug causing encoders to fail when evaluating on the Exam-et dataset.
+- Previously we would abort an evaluation completely if the model outputted a single
+  invalid output on a classification task. As individual samples rarely have a great
+  influence on the overall score, we now just assign the closest label to the sample and
+  continue the evaluation. This will be logged to the user, so that they are aware of
+  this. Some tasks are more sensitive to individual samples, such as European values,
+  where we still abort the evaluation if a single sample is invalid.
+- Fixed a bug where logprobs were not used for classification tasks when evaluating
+  generative models, due to the fact that we raised the number of generated tokens to 10
+  for such tasks. This did not affect the results, but it meant that some evaluations
+  failed.
+- Now includes FlashInfer as a dependency, as it is required by vLLM.
+- Changed the choices in European values to use letters, like the other multiple
+  choice tasks, rather than numbers. Aside from ensuring consistency, we also avoid the
+  issue where '10' and '1' often both have the same first token ('1'), causing us not to
+  be able to use logprobs to determine the answer.
 ## [v16.0.0] - 2025-09-05
 ### Added
 - Added support for Latvian 🇱🇻! This includes the sentiment classification dataset

{euroeval-16.0.0 → euroeval-16.1.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 16.0.0
+Version: 16.1.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -61,10 +61,12 @@ Requires-Dist: transformers[mistral-common]>=4.56.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown

{euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/danish.md RENAMED Viewed

@@ -355,9 +355,12 @@ $ euroeval --model <model-id> --dataset scandiqa-da
 ### Unofficial: BeleBele-da
-This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
+and features multiple-choice reading comprehension questions across 122 languages.
-The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+The original dataset contains 900 unique multiple-choice reading comprehension passages
+and questions. From these, we use a 256 / 64 / 580 split for training, validation and
+testing, respectively.
 Here are a few examples from the training split:
@@ -418,8 +421,9 @@ $ euroeval --model <model-id> --dataset belebele-da
 ### Unofficial: MultiWikiQA-da
-This dataset will be published in an upcoming paper, and contains Danish Wikipedia
-articles with generated questions and answers, using the LLM Gemini-1.5-pro.
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
+and contains Wikipedia articles with LLM-generated questions and answers in 300+
+languages.
 The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
 256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -831,9 +835,17 @@ $ euroeval --model <model-id> --dataset hellaswag-da
 ### Unofficial: GoldenSwag-da
-This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
+This dataset is a filtered and machine translated version of the English [HellaSwag
+dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from
+ActivityNet as well as how-to articles from WikiHow. The machine translated version was
+published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using
+DeepL, and the filtering was published in [this
+paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality
+samples.
-The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
+The original full dataset consists of 1530 / 1530 samples for training and validation,
+respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048
+samples for training, validation, and testing, respectively.
 Here are a few examples from the training split:
@@ -894,8 +906,72 @@ You can evaluate this dataset directly as follows:
 $ euroeval --model <model-id> --dataset goldenswag-da
 ```
+### Unofficial: Winogrande-da
-## Summarization
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2506.19468)
+and is a translated and filtered version of the English [Winogrande
+dataset](https://doi.org/10.1145/3474381).
+The original full dataset consists of 47 / 1,210 samples for training and testing, and
+we use the same splits.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "At håndtere nødsituationer var aldrig særlig svært for Kevin, men det var det for Nelson, fordi _ ikke var i stand til at forblive rolig under pres. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Kevin\nb. Valgmulighed B: Nelson",
+  "label": "b"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Følgende er multiple choice spørgsmål (med svar).
+  ```
+- Base prompt template:
+  ```
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  Svar: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  Besvar ovenstående spørgsmål ved at svare med 'a' eller 'b', og intet andet.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset winogrande-da
+```
+## Summarisation
 ### Nordjylland News

{euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/dutch.md RENAMED Viewed

@@ -153,9 +153,9 @@ from a sentence, or by swapping two neighbouring words in a sentence. To ensure
 this does indeed break the grammaticality of the sentence, a set of rules were used on
 the part-of-speech tags of the words in the sentence.
-The original dataset consists of 13,603 samples, from which we use 1,024 / 256 / 2,048 samples for training,
-validation and testing, respectively (so 3,328 samples used in total). These splits are
-used as-is in the framework.
+The original dataset consists of 13,603 samples, from which we use 1,024 / 256 / 2,048
+samples for training, validation and testing, respectively (so 3,328 samples used in
+total). These splits are used as-is in the framework.
 Here are a few examples from the training split:
@@ -390,8 +390,9 @@ $ euroeval --model <model-id> --dataset belebele-nl
 ### Unofficial: MultiWikiQA-nl
-This dataset will be published in an upcoming paper, and contains Dutch Wikipedia
-articles with generated questions and answers, using the LLM Gemini-1.5-pro.
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
+and contains Wikipedia articles with LLM-generated questions and answers in 300+
+languages.
 The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
 256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -676,9 +677,17 @@ $ euroeval --model <model-id> --dataset hellaswag-nl
 ### Unofficial: GoldenSwag-nl
-This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
+This dataset is a filtered and machine translated version of the English [HellaSwag
+dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from
+ActivityNet as well as how-to articles from WikiHow. The machine translated version was
+published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using
+DeepL, and the filtering was published in [this
+paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality
+samples.
-The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
+The original full dataset consists of 1530 / 1530 samples for training and validation,
+respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048
+samples for training, validation, and testing, respectively.
 Here are a few examples from the training split:
@@ -739,8 +748,72 @@ You can evaluate this dataset directly as follows:
 $ euroeval --model <model-id> --dataset goldenswag-nl
 ```
+### Unofficial: Winogrande-nl
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2506.19468)
+and is a translated and filtered version of the English [Winogrande
+dataset](https://doi.org/10.1145/3474381).
+The original full dataset consists of 47 / 1,210 samples for training and testing, and
+we use the same splits.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Emily vroeg haar zus Sarah of ze tampons of maandverband nodig had uit de winkel, hoewel _ dat niet nodig had omdat ze was overgestapt op het gebruik van menstruatiecups. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Emily\nb. Optie B: Sarah",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Bij het kopen van een huis heeft Patricia niet zoveel geld te besteden als Tanya, dus _ koopt een huis met 1 slaapkamer. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Patricia\nb. Optie B: Tanya",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Eenmaal in Polen genoot Dennis meer van de reis dan Jason omdat _ een oppervlakkige kennis van de Poolse taal had. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Dennis\nb. Optie B: Jason",
+  "label": "b"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Hieronder staan meerkeuzevragen (met antwoorden).
+  ```
+- Base prompt template:
+  ```
+  Vraag: {text}
+  Antwoordopties:
+  a. {option_a}
+  b. {option_b}
+  Antwoord: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Vraag: {text}
+  Antwoordopties:
+  a. {option_a}
+  b. {option_b}
+  Beantwoord de bovenstaande vraag met 'a' of 'b', en niets anders.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset winogrande-nl
+```
-## Summarization
+## Summarisation
 ### WikiLingua-nl

{euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/english.md RENAMED Viewed

@@ -295,6 +295,79 @@ $ euroeval --model <model-id> --dataset squad
 ```
+### Unofficial: XQuAD-en
+This dataset was published in [this paper](https://aclanthology.org/2020.acl-main.421/)
+and contains 1190 question-answer pairs from [SQuAD
+v1.1](https://rajpurkar.github.io/SQuAD-explorer/) translated into ten languages by
+professional translators.
+The dataset is split intro 550 / 128 / 512 question-answer pairs for training,
+validation, and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+    "context": "Newcastle replaced him in January 1756 with Lord Loudoun, with Major General James Abercrombie as his second in command. Neither of these men had as much campaign experience as the trio of officers France sent to North America. French regular army reinforcements arrived in New France in May 1756, led by Major General Louis-Joseph de Montcalm and seconded by the Chevalier de Lévis and Colonel François-Charles de Bourlamaque, all experienced veterans from the War of the Austrian Succession. During that time in Europe, on May 18, 1756, England formally declared war on France, which expanded the war into Europe, which was later to be known as the Seven Years" War.",
+    "question": "Who led New France reinforcements in 1756?",
+    "answers": {
+        "answer_start": array([305], dtype=int32),
+        "text": array(["Major General Louis-Joseph de Montcalm"], dtype=object)
+    }
+}
+```
+```json
+{
+    "context": "Jacksonville is in the First Coast region of northeast Florida and is centered on the banks of the St. Johns River, about 25 miles (40 km) south of the Georgia state line and about 340 miles (550 km) north of Miami. The Jacksonville Beaches communities are along the adjacent Atlantic coast. The area was originally inhabited by the Timucua people, and in 1564 was the site of the French colony of Fort Caroline, one of the earliest European settlements in what is now the continental United States. Under British rule, settlement grew at the narrow point in the river where cattle crossed, known as Wacca Pilatka to the Seminole and the Cow Ford to the British. A platted town was established there in 1822, a year after the United States gained Florida from Spain; it was named after Andrew Jackson, the first military governor of the Florida Territory and seventh President of the United States.",
+    "question": "Prior to the arrival of the French, the area now known as Jacksonville was previously inhabited by what people?",
+    "answers": {
+        "answer_start": array([329], dtype=int32),
+        "text": array(["the Timucua"], dtype=object)
+    }
+}
+```
+```json
+{
+    "context": "Luther\"s hymns were frequently evoked by particular events in his life and the unfolding Reformation. This behavior started with his learning of the execution of Johann Esch and Heinrich Voes, the first individuals to be martyred by the Roman Catholic Church for Lutheran views, prompting Luther to write the hymn "Ein neues Lied wir heben an" ("A new song we raise"), which is generally known in English by John C. Messenger\"s translation by the title and first line "Flung to the Heedless Winds" and sung to the tune Ibstone composed in 1875 by Maria C. Tiddeman.",
+    "question": "What is the hymn known as in English?",
+    "answers": {
+        "answer_start": array([469], dtype=int32),
+        "text": array(["Flung to the Heedless Winds"], dtype=object)
+    }
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 4
+- Prefix prompt:
+  ```
+  The following are texts with accompanying questions and answers.
+  ```
+- Base prompt template:
+  ```
+  Text: {text}
+  Question: {question}
+  Answer in max 3 words:
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Text: {text}
+  Answer the following question about the above text in at most 3 words.
+  Question: {question}
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset xquad-en
+```
 ### Unofficial: BeleBele-en
 This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
@@ -358,8 +431,9 @@ $ euroeval --model <model-id> --dataset belebele-en
 ### Unofficial: MultiWikiQA-en
-This dataset will be published in an upcoming paper, and contains English Wikipedia
-articles with generated questions and answers, using the LLM Gemini-1.5-pro.
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
+and contains Wikipedia articles with LLM-generated questions and answers in 300+
+languages.
 The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
 256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -707,8 +781,69 @@ You can evaluate this dataset directly as follows:
 $ euroeval --model <model-id> --dataset hellaswag
 ```
+### Unofficial: Winogrande
+This dataset was published in [this paper](https://doi.org/10.1145/3474381). The
+original full dataset consists of 47 / 1,210 samples for training and testing, and we
+use the same splits.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Elena would grab their inventory in the back of the store for Megan to sell each time because _ was a businessperson. What does the blank _ refer to?\nChoices:\na. Elena\nb. Megan",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Once in Poland, Dennis enjoyed the trip more than Jason because _ had a deeper understanding of the Polish language. What does the blank _ refer to?\nChoices:\na. Dennis\nb. Jason",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Handling emergencies was never very difficult for Kevin but it was for Nelson because _ wasn't able to remain calm under pressure. What does the blank _ refer to?\nChoices:\na. Kevin\nb. Nelson",
+  "label": "b"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  The following are multiple choice questions (with answers).
+  ```
+- Base prompt template:
+  ```
+  Question: {text}
+  Options:
+  a. {option_a}
+  b. {option_b}
+  Answer: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Question: {text}
+  Options:
+  a. {option_a}
+  b. {option_b}
+  Answer the above question by replying with 'a' or 'b', and nothing else.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset winogrande
+```
-## Summarization
+## Summarisation
 ### CNN/DailyMail

{euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/estonian.md RENAMED Viewed

@@ -280,8 +280,9 @@ $ euroeval --model <model-id> --dataset scala-et
 ### MultiWikiQA-et
-This dataset will be published in an upcoming paper, and contains Estonian Wikipedia
-articles with generated questions and answers, using the LLM Gemini-1.5-pro.
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
+and contains Wikipedia articles with LLM-generated questions and answers in 300+
+languages.
 The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
 256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -351,7 +352,77 @@ $ euroeval --model <model-id> --dataset multi-wiki-qa-et
 ## Knowledge
-### Exam-et
+### Trivia-et
+This dataset was published [here](https://huggingface.co/datasets/TalTechNLP/trivia_et).
+It was extracted from the "Eesti Mäng" board game, and contains trivia questions about
+Estonia.
+The original dataset contains 800 examples. From these, we use 240 / 60 / 500 samples
+for our training, validation and test splits, respectively.
+Note that this is a gated dataset, and we would like to avoid contaminating LLM
+pre-training data as much as possible. Accordingly, we selected more generic questions
+not representative of the full dataset in terms of question content to show here:
+```json
+{
+  "text": "Mis on isoterm?\nVastusevariandid:\na. samatemperatuurijoon\nb. samaõhurõhujoon\nc. samapingejoon\nd. samakõrgusjoon",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Mis on isobaat?\nVastusevariandid:\na. samasügavusjoon\nb. samaõhurõhujoon\nc. samatemperatuurijoon\nd. samakõrgusjoon",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Mida mõõdetakse baromeetriga?\nVastusevariandid:\na. veekogude sügavust\nb. temperatuuri\nc. jõgede voolukiirust\nd. õhurõhku",
+  "label": "d"
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Järgnevad on vastusevariantidega küsimused (koos vastustega).
+  ```
+- Base prompt template:
+  ```
+  Küsimus: {text}
+  Vastusevariandid:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Vastus: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Küsimus: {text}
+  Vastusevariandid:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Võimalikud vastused: 'a', 'b', 'c' or 'd'. Muud vastused ei ole lubatud.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset trivia-et
+```
+### Unofficial: Exam-et
 This dataset was released in [this
 repository](https://huggingface.co/datasets/TalTechNLP/exam_et) and contains questions
@@ -420,9 +491,9 @@ $ euroeval --model <model-id> --dataset exam-et
 ## Common-sense Reasoning
-### WinoGrande-ET
+### Winogrande-et
-The dataset includes the [WinoGrande](https://doi.org/10.48550/arXiv.1907.10641) test
+The dataset includes the [Winogrande](https://doi.org/10.48550/arXiv.1907.10641) test
 set translated and culturally adapted by hand by a professional translator (citation
 TBA). The structure of the dataset is identical to the original. Since train and dev
 splits were not translated manually, we employ the GPT-4o model to translate the
@@ -430,7 +501,8 @@ expected number of examples starting from the beginning of the respective splits
 final dataset size is 1,024 / 256 / 1,767 for the training, validation and test splits,
 respectively.
-Here are a few examples from the training split (note that unlike the test split these are machine translated):
+Here are a few examples from the training split (note that unlike the test split these
+are machine translated):
 ```json
 {
@@ -440,7 +512,8 @@ Here are a few examples from the training split (note that unlike the test split
 ```
 ```json
 {
-  "text": "Ian vabatahtlikult sõi Dennise menudo pärast seda, kui oli juba kausitäie söönud, sest _ nautis soolte söömist.\nVastusevariandid:\na. Ian\nb. Dennis", "label": "a"
+  "text": "Ian vabatahtlikult sõi Dennise menudo pärast seda, kui oli juba kausitäie söönud, sest _ nautis soolte söömist.\nVastusevariandid:\na. Ian\nb. Dennis",
+  "label": "a"
 }
 ```
 ```json
@@ -483,7 +556,7 @@ $ euroeval --model <model-id> --dataset winogrande-et
 ```
-## Summarization
+## Summarisation
 ### ERRNews
@@ -495,8 +568,8 @@ pipeline paired with the human written summary from the archive.
 The original full dataset consists of 10,420 / 523 / 523 samples for training,
 validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
-validation and testing, respectively. The test split is extended with additional examples
-from the train split.
+validation and testing, respectively. The test split is extended with additional
+examples from the train split.
 ```json
 {

{euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/faroese.md RENAMED Viewed

@@ -355,8 +355,9 @@ $ euroeval --model <model-id> --dataset foqa
 ### Unofficial: MultiWikiQA-fo
-This dataset will be published in an upcoming paper, and contains Faroese Wikipedia
-articles with generated questions and answers, using the LLM Gemini-1.5-pro.
+This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
+and contains Wikipedia articles with LLM-generated questions and answers in 300+
+languages.
 The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
 256 / 2,048 split for training, validation and testing, respectively, sampled randomly.

EuroEval 16.0.0__tar.gz → 16.1.0__tar.gz

Potentially problematic release.

EuroEval 16.0.0tar.gz → 16.1.0tar.gz