PyPI - EuroEval - Versions diffs - 15.7.1__tar.gz → 15.8.0__tar.gz - Mend

EuroEval 15.7.1tar.gz → 15.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (240) hide show

{euroeval-15.7.1 → euroeval-15.8.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.11.7
+    rev: v0.11.8
     hooks:
       - id: ruff
         args:

{euroeval-15.7.1 → euroeval-15.8.0}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,41 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.8.0] - 2025-05-07
+### Added
+- Added the BeleBele datasets for Finnish, Italian and Spanish. They are listed as
+  unofficial for now. This was contributed by
+  [@oliverkinch](https://github.com/oliverkinch) ✨
+### Changed
+- Now uses asyncronous requests when dealing with API models, speeding up the generation
+  immensely. This was contributed by [@mathiasesn](https://github.com/mathiasesn) ✨
+### Fixed
+- Add HellaSwag-fi back in, as the issue with the labels in the test split has been
+  fixed.
+- Now uses `eval_accumulation_steps` (set to 32) when evaluating encoder models, to
+  avoid running out of memory during evaluation.
+- Now also looks for `<|startoftext|>` as BOS token if the BOS token is not set in the
+  model's config.
+## [v15.7.2] - 2025-05-02
+### Fixed
+- Now does not check if a model exists if it has already been evaluated. This is an
+  issue when evaluating Ollama models, if the Ollama server is not running.
+- When evaluating instruction-tuned models on text classification tasks, the chat
+  template sometimes ends with special symbols, such as a newline, which can change the
+  tokenisation of the generated label. When we are evaluating the model using logprobs
+  we are thus looking for the wrong label in these cases. We now take this into account,
+  and log it to the user if the labels are not found, to avoid confusion.
+- Finnish datasets were not included in the default "all" dataset list, which is the
+  default used when no datasets are specified. This has been fixed now.
+- Temporarily disabled HellaSwag-fi, as there is an issue with the labels in the test
+  split, causing errors during evaluation. We will re-enable in a future release, when
+  this has been fixed.
 ## [v15.7.1] - 2025-04-29
 ### Changed
 - Marked the DBRD Dutch sentiment classification as official, as the quality is

{euroeval-15.7.1 → euroeval-15.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.7.1
+Version: 15.8.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues

{euroeval-15.7.1 → euroeval-15.8.0}/docs/datasets/danish.md RENAMED Viewed

@@ -353,6 +353,70 @@ $ euroeval --model <model-id> --dataset scandiqa-da
 ```
+### Unofficial: BeleBele-da
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Tekst: Prognoserne siger, at stormen, der er omkring 645 mil (1040 km) vest for Kap Verde-øerne, sandsynligvis vil forsvinde, før den truer nogen landområder. Fred har i øjeblikket vinde på 165 km/t og bevæger sig mod nordvest. Fred er den heftigste tropiske cyklon, der nogensinde er blevet registreret så sydligt og østligt i Atlanterhavet, siden man begyndte at bruge satellitbilleder, og kun den tredje store orkan, der er registreret øst for 35°V.\nSpørgsmål: Da Fred befandt sig nær Kap Verde-øerne, hvilken retning bevægede den sig så mod?\nSvarmuligheder:\na. Vest\nb. Syd\nc. Øst\nd. Nordvest",
+  "label": "d"
+}
+```
+```json
+{
+  "text": "Tekst: "Siden Pakistan i 1947 blev uafhængigt af det britiske styre, har den pakistanske præsident udpeget ""politiske agenter"", som styrer FATA, og som har næsten fuldstændig kontrol over områderne. Disse agenter er ansvarlige for at levere regerings- og retstjenester i henhold til artikel 247 i den pakistanske forfatning."\nSpørgsmål: Hvem leverer retslige tjenester til FATA?\nSvarmuligheder:\na. Den pakistanske regering\nb. Politiske agenter\nc. Pakistans præsident\nd. Den britiske regering",
+  "label": "b"
+}
+```
+```json
+{
+  "text": "Tekst: Alle er en del af samfundet og benytter transportsystemerne. Næsten alle klager over transportsystemerne. I udviklede lande hører du sjældent ligeså mange klager over vandkvalitet eller broer, der styrter sammen. Hvorfor giver transportsystemerne anledning til sådanne klager, hvorfor svigter de på daglig basis? Er transportingeniører blot inkompetente? Eller foregår der noget mere fundamentalt?\nSpørgsmål: Hvilken offentlig service siges at skabe størst utilfredshed i udviklede lande?\nSvarmuligheder:\na. Vandkvalitet\nb. Brobyggelse\nc. Offentlig transport\nd. Uddannelse",
+  "label": "c"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Følgende er multiple choice spørgsmål (med svar).
+  ```
+- Base prompt template:
+  ```
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Svar: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c' eller 'd', og intet andet.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset belebele-da
+```
 ## Knowledge
 ### Danske Talemåder
@@ -608,7 +672,7 @@ When evaluating generative models, we use the following setup (see the
   a. {option_a}
   b. {option_b}
   c. {option_c}
-  d. {option_c}
+  d. {option_d}
   Svar: {label}
   ```
 - Instruction-tuned prompt template:
@@ -673,7 +737,7 @@ When evaluating generative models, we use the following setup (see the
   a. {option_a}
   b. {option_b}
   c. {option_c}
-  d. {option_c}
+  d. {option_d}
   Svar: {label}
   ```
 - Instruction-tuned prompt template:

{euroeval-15.7.1 → euroeval-15.8.0}/docs/datasets/dutch.md RENAMED Viewed

@@ -323,6 +323,70 @@ $ euroeval --model <model-id> --dataset squad-nl
 ```
+### Unofficial: BeleBele-nl
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Tekst: Mystiek is het geloven in, identificeren met of bewustzijn van een ultieme werkelijkheid, goddelijkheid, spirituele waarheid of God. De kerkganger streeft naar een directe gewaarwording, intuïtie of inzicht in de goddelijke werkelijkheid. Volgers streven een bepaalde manier van leven na of willen ervaringen opdoen die ze datzelfde gevoel geven. In tegenstelling tot andere religieuze overtuigingen en aanbidding, legt mystiek nadruk op de rechtstreekse persoonlijke beleving van een unieke staat van bewustzijn, vooral van een vredige, inzichtelijke, gelukzalige of extatische aard.\nVraag: Wat is geen juiste omschrijving van mystiek?\nAntwoordopties:\na. De nadruk ligt op het ervaren van een vredige, gelukzalige staat van bewustzijn\nb. Volgers van mystiek streven bewustwording na van een spirituele werkelijkheid\nc. Volgers van mystiek passen gebruiken toe die hun inzicht in een goddelijke werkelijkheid vergroten\nd. De nadruk op het streven naar een directe persoonlijke beleving is vergelijkbaar met veel andere vormen van religieuze overtuiging en aanbidding",
+  "label": "d"
+}
+```
+```json
+{
+  "text": "Tekst: Het favoriete maaltje van ocelotten zijn kleine dieren. Ze vangen apen, slangen, knaagdieren en vogels als dat lukt. De ocelot jaagt bijna uitsluitend op dieren die veel kleiner zijn dan hij zelf is. Geleerden vermoeden dat ocelotten hun reukvermogen gebruiken om op kleine dieren (hun prooi) te jagen, door aan de grond te ruiken waar deze zijn geweest. Ze kunnen door nachtvisie heel goed in het donker zien en bewegen zich heel onopvallend voort. Ocelotten jagen op prooi door zich één te maken met de omgeving en vervolgens op hun prooi te springen.\nVraag: Welke uitspraak over een ocelot is onjuist?\nAntwoordopties:\na. Ze kunnen goed in het donker jagen\nb. Ze bewegen zich in stilte voort\nc. Hun reukvermogen is zwak\nd. Ze jagen het liefst op kleine dieren",
+  "label": "c"
+}
+```
+```json
+{
+  "text": "Tekst: Er was 120-160 kubieke meter brandstof aan boord van de Luno toen het schip motorproblemen kreeg en door de harde wind en golven tegen de golfbreker werd geduwd. De twaalf crewleden zijn met helikopters in veiligheid gebracht, met als enige verwonding een gebroken neus. Het 100 meter lange schip was onderweg om de gebruikelijke lading kunstmest op te halen. In eerste instantie vreesden autoriteiten dat het vaartuig met de lading zou kunnen gaan lekken.\nVraag: Waar vreesden de autoriteiten volgens de tekst in eerste instantie voor wat betreft de Luno?\nAntwoordopties:\na. Gebrek aan een lading kunstmest\nb. Golven en harde wind\nc. Lekken van brandstof\nd. Verwondingen van bemanningsleden",
+  "label": "c"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Hieronder staan meerkeuzevragen (met antwoorden).
+  ```
+- Base prompt template:
+  ```
+  Vraag: {text}
+  Antwoordopties:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Antwoord: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Vraag: {text}
+  Antwoordopties:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Beantwoord de bovenstaande vraag met 'a', 'b', 'c' of 'd', en niets anders.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset belebele-nl
+```
 ## Knowledge
 ### MMLU-nl

{euroeval-15.7.1 → euroeval-15.8.0}/docs/datasets/english.md RENAMED Viewed

@@ -295,6 +295,63 @@ $ euroeval --model <model-id> --dataset squad
 ```
+### Unofficial: BeleBele-en
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features reading comprehension questions across 122 languages. The dataset was created by professional translators who translated 900 multiple-choice questions from English into other languages, with answers carefully validated by native speakers.
+The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": 'Text: """We will endeavour to cut carbon dioxide emissions per unit of GDP by a notable margin by 2020 from the 2005 level,"" Hu said. He did not set a figure for the cuts, saying they will be made based on China\'s economic output. Hu encouraged developing countries ""to avoid the old path of polluting first and cleaning up later."" He added that ""they should not, however, be asked to take on obligations that go beyond their development stage, responsibility and capabilities."""\nQuestion: What did Hu suggest that developing countries do?\nChoices:\na. Take on obligations that push their development stage\nb. Focus on economic output\nc. Go beyond their current responsibilities\nd. Avoiding old paths of pollution',
+  "label": "d"
+}
+```
+```json
+{
+  "text": 'Text: "All of the cave entrances, which were named ""The Seven Sisters"", are at least 100 to 250 meters (328 to 820 feet) in diameter. Infrared images show that the temperature variations from night and day show that they are likely caves. ""They are cooler than the surrounding surface in the day and warmer at night. Their thermal behavior is not as steady as large caves on Earth that often maintain a fairly constant temperature, but it is consistent with these being deep holes in the ground,"" said Glen Cushing of the United States Geological Survey (USGS) Astrogeology Team and of Northern Arizona University located in Flagstaff, Arizona."\nQuestion: What information suggests that The Seven Sisters are caves?\nChoices:\na. Temperature variations\nb. The diameter of the cave entrances\nc. Geological surveys\nd. Pictures of caves on Earth',
+  "label": "a"
+}
+```
+```json
+{
+  "text": 'Text: The proposed amendment already passed both houses in 2011. A change was made this legislative session when the second sentence was deleted first by the House of Representatives and then was passed in a similar form by the Senate Monday. The failure of the second sentence, which proposes to ban same-sex civil unions, could possibly open the door for civil unions in the future. Following the process, HJR-3 will be reviewed again by the next elected legislature in either 2015 or 2016 to remain in process.\nQuestion: According to the passage, when was the second sentence deleted?\nChoices:\na. During the legislative session\nb. In 2011\nc. On Monday\nd. In 2015',
+  "label": "a"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 4
+- Prefix prompt:
+  ```
+  The following are texts with accompanying questions and answers.
+  ```
+- Base prompt template:
+  ```
+  Text: {text}
+  Question: {question}
+  Answer in max 3 words:
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Text: {text}
+  Answer the following question about the above text in at most 3 words.
+  Question: {question}
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset belebele-en
+```
 ## Knowledge
 ### MMLU

{euroeval-15.7.1 → euroeval-15.8.0}/docs/datasets/finnish.md RENAMED Viewed

@@ -266,6 +266,70 @@ $ euroeval --model <model-id> --dataset tydiqa-fi
 ```
+### Unofficial: BeleBele-fi
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Toisin kuin muut kädelliset, isot ihmisapinat eivät enää käytä käsiään liikkumiseen, painon kannattelemiseen tai liikkumiseen puissa itseään heilautellen. Simpanssin käsi ja jalka ovat samankokoisia ja -pituisia, mikä viittaa siihen, että kädelle varataan painoa rystykävelyssä. Ihmisen käsi on lyhyempi kuin jalka, ja sen sormiluut ovat suoremmat. Kahden-kolmen miljoonan vuoden ikäiset käsiluiden fossiilit paljastavat käden erikoistumisessa tämän muutoksen liikkumisesta käyttelyyn.\nKysymys: Mikä seuraavista kuvaa tarkasti simpanssin sormiluita?\nVaihtoehdot:\na. Ne ovat suoremmat kuin ihmisillä\nb. Niiden kädet ja jalat ovat erikokoisia\nc. Niitä käytetään painon kannattelemiseen\nd. Niitä käytetään pääasiassa käyttelyyn",
+  "label": "c"
+}
+```
+```json
+{
+  "text": "Panaman paperit on yläkäsite panamalaisen lakiyrityksen Mossack Fonsecan noin kymmenelle miljoonalle asiakirjalle, jotka vuodettiin lehdistölle keväällä 2016. Asiakirjoista selvisi, että neljätoista pankkia auttoi varakkaita asiakkaita piilottamaan miljardeja USA:n dollareita verojen ja muiden sääntelyjen välttämiseksi. Brittiläisen sanomalehden The Guardianin mukaan Deutsche Bank hallitsi tämän toteuttamiseen käytetyistä 1 200 postilaatikkoyrityksestä suunnilleen kolmasosaa. Seurasi maailmanlaajuisia protesteja ja useita rikossyytteitä, ja Islannin ja Pakistanin hallitusten johtajat kumpikin erosivat.\nKysymys: Kuka brittiläisen lehdistön väitteen mukaan hallinnoi monia varojen piilottamisessa käytettyjä yrityksiä tekstikatkelman mukaan?\nVaihtoehdot:\na. Eri pankkien varakkaat asiakkaat\nb. Panamalainen lakiyritys\nc. Deutsche Bank\nd. Pakistanin hallitus",
+  "label": "c"
+}
+```
+```json
+{
+  "text": "Teksti: Sundarban on maailman suurin mangrovemetsäalue. Se ulottuu 80 kilometriä (50 mailia) rannikolta Bangladeshin ja Intian takamaille. Sundarban on julistettu Unescon maailmanperintökohteeksi. Metsän Intian puolella sijaitsevaa osaa kutsutaan Sundarbanin kansallispuistoksi. Metsät eivät kuitenkaan ole pelkkiä mangrovesoita, vaan niihin kuuluu joitakin viimeisiä jäänteitä niistä mahtavista viidakoista, jotka aikoinaan peittivät koko Gangesin tasangon. Sundarban kattaa 3 850 neliökilometrin alueen, josta noin kolmasosa on vesi- tai suoalueiden peitossa. Vuodesta 1966 asti Sundarbans on ollut villieläinten suojelualue. Arvioidaan, että siellä on nykyään 400 intiantiikeriä ja suunnilleen 30 000 aksishirveä.\nKysymys: Mikä metsän osa on Intian puolella?\nVaihtoehdot:\na. Sundarbanin kansallispuisto\nb. Villieläinten suojelualue\nc. Maailmanperintökohde\nd. Gangesin tasanko",
+  "label": "a"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Seuraavat ovat monivalintakysymyksiä (vastauksineen).
+  ```
+- Base prompt template:
+  ```
+  Kysymys: {text}
+  Vaihtoehdot:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Vastaus: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Kysymys: {text}
+  Vaihtoehdot:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Vastaa yllä olevaan kysymykseen käyttämällä 'a', 'b', 'c' tai 'd', äläkä mitään muuta.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset belebele-fi
+```
 ## Common-sense Reasoning
 ### HellaSwag-fi
@@ -310,7 +374,7 @@ When evaluating generative models, we use the following setup (see the
   a. {option_a}
   b. {option_b}
   c. {option_c}
-  d. {option_c}
+  d. {option_d}
   Vastaus: {label}
   ```
 - Instruction-tuned prompt template:

{euroeval-15.7.1 → euroeval-15.8.0}/docs/datasets/french.md RENAMED Viewed

@@ -296,6 +296,70 @@ $ euroeval --model <model-id> --dataset fquad
 ```
+### Unofficial: BeleBele-fr
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Texte: Lorsqu’un petit groupe d’êtres vivants (une petite population) est séparé de la population principale dont il est issu (par exemple, s’il se déplace au-dessus d’une chaîne de montagnes ou d’une rivière, ou s’il se déplace vers une nouvelle île de sorte qu’il ne peut pas facilement revenir en arrière), il se retrouve souvent dans un environnement différent de celui dans lequel il était auparavant. Ce nouvel environnement a des ressources et des concurrents différents, de sorte que la nouvelle population aura besoin de caractéristiques ou d'adaptations nouvelles pour être un concurrent puissant par rapport à ce dont elle avait besoin auparavant. La population d'origine n'a pas changé du tout,\xa0elle a toujours besoin des mêmes adaptations. Au fil du temps, à mesure que la nouvelle population s'adapte à son nouvel environnement, elle commence à ressembler de moins en moins à l'autre population. Enfin, après des milliers ou même des millions d'années, les deux populations paraîtront tellement différentes qu'elles ne pourront plus être considérées comme appartenant à la même espèce. Nous appelons ce processus «\u2009spéciation\u2009», ce qui signifie simplement la formation de nouvelles espèces. La spéciation est une conséquence inévitable et une partie très importante de l’évolution.\nQuestion: D’après l’extrait et parmi les exemples ci-dessous, qu’est-ce qui gênerait le processus d’évolution\xa0?\nChoix:\na. La difficulté pour un petit groupe à s’épanouir dans un nouvel endroit\nb. La migration d’une portion d’une population vers un nouvel environnement\nc. L’ajustement par une population de son adaptation à un nouvel environnement\nd. Le fait qu’une population finisse par devenir deux populations distinctes",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Texte: Le pillage généralisé se serait poursuivi pendant la nuit, les forces de l'ordre n'étant pas présentes dans les rues de Bichkek. Un observateur a décrit Bichkek comme étant en train de sombrer dans un état d’« anarchie », tandis que la population se déplaçait en bandes dans les rues et pillait les magasins de biens de consommation. Plusieurs habitants de Bichkek ont reproché les manifestants du sud d'être responsables de l'anarchie.\nQuestion: Qui a accusé les manifestants du sud de pillage\xa0?\nChoix:\na. Des habitants de Bichkek\nb. Les forces de l’ordre\nc. Les anarchistes\nd. Des bandes de personnes",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Texte: Dans de nombreuses régions du monde, faire un signe de la main est un geste amical signifiant «\u2009bonjour\u2009». En revanche, en Malaisie, du moins chez les Malais des zones rurales, cela signifie « viens par ici », comme le fait de plier l'index vers soi, geste utilisé dans certains pays occidentaux, et il ne devrait être utilisé qu'en ce sens. De même, un voyageur britannique en Espagne pourrait confondre un signe d'adieu fait par une personne qui tourne la paume de sa main vers elle-même (plutôt que vers la personne à qui elle adresse le signe) avec une invitation à revenir.\nQuestion: Dans les zones rurales de la Malaisie, quel geste signifie « viens par ici » ?\nChoix:\na. Plier l’index\nb. Faire un signe de la main\nc. Faire un « high five »\nd. Lever le pouce",
+  "label": "b"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Les questions suivantes sont des questions à choix multiples (avec réponses).
+  ```
+- Base prompt template:
+  ```
+  Question: {text}
+  Choix:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Réponse: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Question: {text}
+  Choix:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Répondez à la question ci-dessus par 'a', 'b', 'c' ou 'd', et rien d'autre.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset belebele-fr
+```
 ## Knowledge
 ### MMLU-fr
@@ -348,7 +412,7 @@ When evaluating generative models, we use the following setup (see the
   a. {option_a}
   b. {option_b}
   c. {option_c}
-  d. {option_c}
+  d. {option_d}
   Réponse: {label}
   ```
 - Instruction-tuned prompt template:
@@ -419,7 +483,7 @@ When evaluating generative models, we use the following setup (see the
   a. {option_a}
   b. {option_b}
   c. {option_c}
-  d. {option_c}
+  d. {option_d}
   Réponse: {label}
   ```
 - Instruction-tuned prompt template:

{euroeval-15.7.1 → euroeval-15.8.0}/docs/datasets/german.md RENAMED Viewed

@@ -284,6 +284,65 @@ $ euroeval --model <model-id> --dataset germanquad
 ```
+### Unofficial: BeleBele-de
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Text: Es gibt viele Dinge, die Sie vor und während einer Reise berücksichtigen müssen. Erwarten Sie nicht, dass die Dinge beim Reisen genau so sind wie „zuhause“. Umgangsformen, Gesetze, Essen, Verkehr, Unterkünfte, Standards, Spache und so weiter werden zu einem gewissen Grad anders sein als dort, wo Sie leben. Dies ist etwas, was man immer im Hinterkopf behalten sollte, um Enttäuschung oder gar Abneigung über lokale Vorgehensweisen zu vermeiden.\nFragen: Was kann Reisenden dem Abschnitt nach helfen, Enttäuschung beim Besuch neuer Orte zu vermeiden?\nAntwortmöglichkeiten:\na. Ähnliche Standards wie zuhause erwarten\nb. Essen probieren, das ungewohnt ist\nc. Die gleichen Gesetze wie zuhause einhalten\nd. Nicht vorher nach Unterkünften recherchieren",
+  "label": "b"
+}
+```
+```json
+{
+  "text": "Text: Genehmigungen müssen im Voraus bestellt werden. Sie benötigen eine Genehmigung, um in La Sirena zu übernachten. Sirena ist die einzige Rangerstation, die neben Zelten auch Übernachtung im Schlafsaal und warme Mahlzeiten anbietet. La Leona, San Pedrillo und Los Patos bieten nur Camping ohne Verpflegung an. Es ist möglich, eine Parklizenz direkt bei der Rangerstation in Puerto Jiménez zu bekommen, aber sie akzeptieren keine Kreditkarten Die Parkverwaltung (MINAE) stellt Genehmigungen  für den Park nicht früher als einen Monat vor der geplanten Ankunft aus. CafeNet El Sol bietet einen Reservierungsservice gegen eine Gebühr von 30 US-Dollar bzw. 10 US-Dollar für Tageskarten an. Einzelheiten dazu findet man auf deren Corcovado-Seite.\nFragen: Welche der folgenden Rangerstationen bietet zwei Übernachtungsmöglichkeiten an?\nAntwortmöglichkeiten:\na. Sirena\nb. Los Patos\nc. La Leona\nd. San Pedrillo",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Text: Naturnaher Tourismus zieht Leute an, die daran interessiert sind, Naturgebiete zu besuchen, um die Landschaft zu genießen, einschließlich der wilden Pflanzen und Tiere. Beispiele für Aktivitäten vor Ort sind Jagen, Angeln, Fotografie, Vogelbeobachtung, der Besuch von Parks und das Lernen von Informationen über das Ökosystem. Ein Beispiel dafür ist der Besuch, das Fotografieren und das Studieren von Orangutangs in Borneo.\nFragen: Welche der folgenden Aktivitäten ist kein Beispiel für naturnahen Tourismus?\nAntwortmöglichkeiten:\na. Wandern zu einem Wasserfall\nb. Fotografieren von Wildblumen\nc. Besuch eines Wissenschaftsmuseum\nd. Fliegenfischen",
+  "label": "c"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Die folgenden Fragen sind Multiple-Choice-Fragen (mit Antworten).
+  ```
+- Base prompt template:
+  ```
+  Frage: {text}
+  Antwort: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Frage: {text}
+  Antwortmöglichkeiten:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Beantworten Sie die obige Frage mit 'a', 'b', 'c' oder 'd', und nichts anderes.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset belebele-de
+```
 ## Knowledge
 ### MMLU-de

{euroeval-15.7.1 → euroeval-15.8.0}/docs/datasets/icelandic.md RENAMED Viewed

@@ -489,6 +489,70 @@ $ euroeval --model <model-id> --dataset icelandic-qa
 ```
+### Unofficial: BeleBele-is
+This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
+The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Texti: Í Frelsisstríðinu mynduðu ríkin þrettán veikburða ríkisstjórn – með Þjóðþingið sem eina þátt þess – skv. fyrstu stjórnarskránni. Þingið var ekki með nægar valdheimildir til að leggja á skatta, og vegna þess að ekki var neinn alríkisstjóri eða dómsvald til staðar, treysti það á yfirvöld í hverju ríki fyrir sig, sem voru oft og tíðum ósamvinnuþýð, til að framfylgja lögum þess. Það hafði heldur engar valdheimildir til að fella niður skattalög og tolla á milli ríkja. Greinarnar gerðu kröfu um samhljóða samþykki allra ríkjanna áður en hægt var að breyta þeim og ríkin sýndu ríkisvaldinu svo mikla lítilsvirðingu að fulltrúar þeirra voru oft fjarverandi.\nSpurning: Samkvæmt því sem fram kemur í kaflanum, hvaða fullyrðing á nákvæmlega við um ástand ríkisvaldsins í frelsisstríðinu?\nSvarmöguleikar:\na. Skattar voru innheimtir af þinginu og ríkisstofnunum\nb. Breytingar á stjórnarskránni þurftu samþykki þingsins\nc. Fulltrúar ríkjanna voru oft fjarverandi\nd. Hin miðlæga ríkisstjórn var mynduð í kringum tvo meginþætti",
+  "label": "c"
+}
+```
+```json
+{
+  "text": "Texti: İzmir er þriðja stærsta borg Tyrklands með um 3,7 milljónir íbúa, næststærstu höfnina á eftir Istanbúl og er mjög góð samgöngumiðstöð. Hin forna borg Smyrna er núna nútímaleg, þróuð og iðandi viðskiptamiðstöð sem staðsett er við gríðarstóran flóa og umkringd er fjöllum. Hinar breiðu breiðgötur, byggingar með framhliðum úr gleri og nútímalegar verslunarmiðstöðvar með hefðbundnum rauðum þakskífum, 18. aldar markaðurinn og gamlar moskur og kirkjur, þó að andrúmsloft borgarinnar tengist meira Miðjarðarhafssvæði Evrópu en hefðbundnu Tyrklandi.\nSpurning: Hvert eftirfarandi einkennir Izmir er frá fornri tíð?\nSvarmöguleikar:\na. Breiðar breiðgötur\nb. Byggingar með framhliðum úr gleri\nc. Verslanamiðstöðvar\nd. rauðar þakskífur",
+  "label": "d"
+}
+```
+```json
+{
+  "text": "Texti: Dæmigert fyrir það tímabil er Kirby Muxloe Castle sem er frekar víggirt hús en raunverulegur kastali. Stóru gljáðu gluggarnir og þunnu veggirnir hefðu ekki getað staðist stórárás í langan tíma. Árið 1480, þegar Hastings lávarður hóf byggingarframkvæmdirnar, ríkti friður í nánast öllu landinu og aðeins var þörf á varnarmúrum gegn litlum ræningjahópum.\nSpurning: Hvert af eftirtöldu hefði verið talið óvenjulegt við byggingu Kirby Muxloe kastala á þeim tíma sem talað er um í kaflanum?\nSvarmöguleikar:\na. Stórir gluggar\nb. Grunnur sem á að standast árásir\nc. Minna af varnarútbúnaði en í öðrum köstulum\nd. Þunnir veggir",
+  "label": "b"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```
+  Eftirfarandi eru fjölvalsspurningar (með svörum).
+  ```
+- Base prompt template:
+  ```
+  Spurningar: {text}
+  Svarmöguleikar:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Svara: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```
+  Spurningar: {text}
+  Svarmöguleikar:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Svaraðu eftirfarandi spurningum með 'a', 'b', 'c' eða 'd', og engu öðru.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+$ euroeval --model <model-id> --dataset belebele-is
+```
 ## Knowledge
 ### IcelandicKnowledge

EuroEval 15.7.1__tar.gz → 15.8.0__tar.gz

Potentially problematic release.

EuroEval 15.7.1tar.gz → 15.8.0tar.gz