PyPI - EuroEval - Versions diffs - 15.2.0__tar.gz → 15.3.1__tar.gz - Mend

EuroEval 15.2.0tar.gz → 15.3.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (208) hide show

{euroeval-15.2.0 → euroeval-15.3.1}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -28,6 +28,7 @@ body:
       - label: French
       - label: German
       - label: Icelandic
+      - label: Italian
       - label: Norwegian (Bokmål or Nynorsk)
       - label: Swedish
   validations:

{euroeval-15.2.0 → euroeval-15.3.1}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -34,6 +34,7 @@ body:
       - label: French
       - label: German
       - label: Icelandic
+      - label: Italian
       - label: Norwegian (Bokmål or Nynorsk)
       - label: Swedish
   validations:

{euroeval-15.2.0 → euroeval-15.3.1}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.9.8
+    rev: v0.9.10
     hooks:
       - id: ruff
         args:

{euroeval-15.2.0 → euroeval-15.3.1}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,62 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.3.1] - 2025-03-13
+### Fixed
+- Now handles`ConnectionError`s when loading datasets, rather than aborting evaluations.
+## [v15.3.0] - 2025-03-12
+### Added
+- Added support for evaluating Italian 🇮🇹! This includes the reading comprehension
+  dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization
+  dataset [IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment
+  classification
+  [Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual),
+  the common-sense reasoning dataset
+  [HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic acceptability
+  dataset ScaLA with the [Italian Universal Dependencies
+  treebank](https://github.com/UniversalDependencies/UD_Italian-ISDT), the knowledge
+  dataset [MMLU-it](https://hf.co/datasets/alexandrainst/m_mmlu), and the named entity
+  recognition dataset [MultiNERD
+  IT](https://hf.co/datasets/Babelscape/multinerd) (and unofficially
+  [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
+- Added the new Norwegian knowledge dataset NRK-Quiz-QA, consisting of quizzes on the
+  Norwegian language and culture, in both Bokmål and Nynorsk. The dataset has been split
+  into 635 / 256 / 2,048 samples for train, val, and test, respectively. This replaces
+  the old MMLU-no as the official Norwegian knowledge dataset.
+- Added the new Norwegian common-sense reasoning dataset NorCommonSenseQA, which is a
+  manually translated and localised version of the English CommonsenseQA dataset, in
+  both Bokmål and Nynorsk. The dataset has been split into 128 / 128 / 787 samples for
+  train, val, and test, respectively. This replaces the old HellaSwag-no as the official
+  Norwegian common-sense reasoning dataset.
+- Added the Norwegian linguistic acceptability dataset NoCoLA, which is based on the
+  annotated language learner corpus ASK. The dataset has been split into 1,024 / 256 /
+  2,048 samples and converted into a binary correct/incorrect dataset, but
+  stratified across the error categories.
+### Changed
+- Updated the Danish Citizen Tests dataset to include the newer 2024 tests, Further,
+  rather than splitting the dataset randomly, we include all the citizenship tests in
+  the test split, and prioritise the newer permanent residence tests in the test and
+  validation splits.
+- Changed the IcelandicKnowledge dataset to be the new official Icelandic knowledge
+  dataset, as it is more specific to Icelandic culture and history than the previous
+  machine translated ARC-is dataset. It has also been improved, as some of the generated
+  alternative answers were formatted incorrectly.
+### Fixed
+- A bug caused fresh encoder models to not be benchmarkable on the speed benchmark -
+  this has been fixed now.
+- Some encoder models were not able to be evaluated on reading comprehensions, if their
+  tokenizers were not subclassing `PreTrainedTokenizer`. This has been relaxed to
+  `PreTrainedTokenizerBase` instead.
+- Newer versions of the `transformers` package changed the model output format, causing
+  errors when evaluating encoder models on some tasks. This has been fixed now.
+- Added `setuptools` to the dependencies, as it is required for the package to be
+  installed correctly.
 ## [v15.2.0] - 2025-02-28
 ### Changed
 - Changed the name of the benchmark to `EuroEval`, to reflect the fact that the

{euroeval-15.2.0 → euroeval-15.3.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.2.0
+Version: 15.3.1
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -49,6 +49,7 @@ Requires-Dist: sacremoses>=0.1.1
 Requires-Dist: scikit-learn<1.6.0
 Requires-Dist: sentencepiece>=0.1.96
 Requires-Dist: seqeval>=1.2.2
+Requires-Dist: setuptools>=75.8.2
 Requires-Dist: tenacity>=9.0.0
 Requires-Dist: termcolor>=2.0.0
 Requires-Dist: torch>=2.3.0
@@ -76,6 +77,8 @@ Description-Content-Type: text/markdown
 ### The robust European language model benchmark.
+_(formerly known as ScandEval)_
 ______________________________________________________________________
 [![Documentation](https://img.shields.io/badge/docs-passing-green)](https://euroeval.com)
 [![PyPI Status](https://badge.fury.io/py/euroeval.svg)](https://pypi.org/project/euroeval/)

{euroeval-15.2.0 → euroeval-15.3.1}/README.md RENAMED Viewed

@@ -4,6 +4,8 @@
 ### The robust European language model benchmark.
+_(formerly known as ScandEval)_
 ______________________________________________________________________
 [![Documentation](https://img.shields.io/badge/docs-passing-green)](https://euroeval.com)
 [![PyPI Status](https://badge.fury.io/py/euroeval.svg)](https://pypi.org/project/euroeval/)

{euroeval-15.2.0 → euroeval-15.3.1}/docs/README.md RENAMED Viewed

@@ -6,7 +6,7 @@ hide:
 #
 <div align='center'>
 <img src="https://raw.githubusercontent.com/EuroEval/EuroEval/main/gfx/euroeval.png" height="500" width="372">
-<h3>A Robust Multilingual Evaluation Framework for Language Models</h3>
+<h3>The robust European language model benchmark.</h3>
 </div>
 --------------------------

{euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/danish.md RENAMED Viewed

@@ -429,33 +429,34 @@ $ euroeval --model <model-id> --dataset danske-talemaader
 ### Danish Citizen Tests
 This dataset was created by scraping the Danish citizenship tests (indfødsretsprøven)
-and permanent residency tests (medborgerskabsprøven) from 2016 to 2023. These are
+and permanent residency tests (medborgerskabsprøven) from 2016 to 2024. These are
 available on the [official website of the Danish Ministry of International Recruitment
 and Integration](https://danskogproever.dk/).
-The original full dataset consists of 720 samples. We use an 80 / 128 / 512 split for
-training, validation and testing, respectively (so 720 samples used in total).
+The original full dataset consists of 870 samples. We use an 345 / 90 / 525 split for
+training, validation and testing, respectively. Here all the citizenship tests belong to
+the test split, as well as the newest permanent residency tests. The validation split
+contains the newer permanent residency tests after the ones in the test split, and the
+training split contains the oldest permanent residency tests.
 Here are a few examples from the training split:
 ```json
 {
-  "text": "Hvilke lande er med i rigsfællesskab et?\nSvarmuligheder:\na. Danmark, Grønland og Færøerne\nb. Danmark, Island og Norge",
+  "text": "Hvilket parti tilhørte Lars Løkke Rasmussen, da han var statsminister i perioderne 2009-11 og 2015-19?\nSvarmuligheder:\na. Venstre\nb. Socialdemokratiet\nc. Det Konservative Folkeparti",
   "label": "a"
 }
 ```
 ```json
 {
-  "text": "Hvor mange medlemmer har Folketinget?\nSvarmuligheder:\na. 87\nb. 179\nc. 265",
+  "text": "Hvilket af følgende områder har kommunerne ansvaret for driften af?\nSvarmuligheder:\na. Domstole\nb. Vuggestuer\nc. Sygehuse",
   "label": "b"
-}
-```
+}```
 ```json
 {
-  "text": "Hvem kan blive biskop i den danske folkekirke?\nSvarmuligheder:\na. Kun mænd\nb. Kun kvinder\nc. Både m ænd og kvinder",
+  "text": "Hvilken organisation blev Danmark medlem af i 1945?\nSvarmuligheder:\na. Verdenshandelsorganisationen (WTO)\nb. Den Europæiske Union (EU)\nc. De Forenede Nationer (FN)",
   "label": "c"
-}
-```
+}```
 When evaluating generative models, we use the following setup (see the
 [methodology](/methodology) for more information on how these are used):

{euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/icelandic.md RENAMED Viewed

@@ -413,7 +413,7 @@ $ euroeval --model <model-id> --dataset nqii
 ### Unofficial: IcelandicQA
 This dataset was published
-[here](https://huggingface.co/datasets/mideind/icelandic_qa_euroeval) and consists of
+[here](https://huggingface.co/datasets/mideind/icelandic_qa_scandeval) and consists of
 an automatically created Icelandic question-answering dataset based on the Icelandic
 Wikipedia as well as Icelandic news articles from the RÚV corpus.
@@ -490,37 +490,53 @@ $ euroeval --model <model-id> --dataset icelandic-qa
 ## Knowledge
-### ARC-is
+### IcelandicKnowledge
-This dataset is a machine translated version of the English [ARC
-dataset](https://doi.org/10.48550/arXiv.1803.05457) and features US grade-school science
-questions. The dataset was translated by Miðeind using the Claude 3.5 Sonnet model.
+This dataset was published
+[here](https://huggingface.co/datasets/mideind/icelandic_qa_scandeval) and consists of
+an automatically created Icelandic question-answering dataset based on the Icelandic
+Wikipedia as well as Icelandic news articles from the RÚV corpus.
-The original full dataset consists of 1,110 / 297 / 1,170 samples for training,
-validation and testing, respectively. We use a 1,024 / 256 / 1,024 split for training,
-validation and testing, respectively (so 2,304 samples used in total). All new splits
-are subsets of the original splits.
+The dataset was converted into a multiple-choice knowledge dataset by removing the
+contexts and using GPT-4o to generate 3 plausible wrong answers for each correct answer,
+using the following prompt for each `row` in the original dataset:
+```python
+messages = [
+    {
+        "role": "user",
+        "content": f"For the question: {row.question} where the correct answer is: {row.answer}, please provide 3 plausible alternatives in Icelandic. You should return the alternatives in a JSON dictionary, with keys 'first', 'second', and 'third'. The values should be the alternatives only, without any numbering or formatting. The alternatives should be unique and not contain the correct answer."
+    }
+]
+completion = client.beta.chat.completions.parse(
+    model="gpt-4o", messages=messages, response_format=CandidateAnswers
+)
+```
+where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
+The original dataset has 2,000 samples, but only 1,994 unique questions, and the total
+length of this dataset is therefore 1,994. The split is given by 842 / 128 / 1024 for
+train, val, and test, respectively.
 Here are a few examples from the training split:
 ```json
 {
-  "text": "Líkamar manna hafa flókna uppbyggingu sem styður vöxt og lífslíkur. Hver er grundvallaruppbygging líkamans sem stuðlar að vexti og lífslíkum?\nSvarmöguleikar:\na. fruma\nb. vefur\nc. líffæri\nd. líffærakerfi",
+  "text": "Hver var talinn heilagur maður eftir dauða sinn, er tákngervingur alþýðuhreyfingar vestanlands og talinn góður til áheita?\nSvarmöguleikar:\na. Þórður Jónsson helgi\nb. Guðmundur Arason\nc. Snorri Þorgrímsson\nd. Jón Hreggviðsson",
   "label": "a"
-}
-```
+}```
 ```json
 {
-  "text": "Veðurfræðingur skráir gögn fyrir borg á ákveðnum degi. Gögnin innihalda hitastig, skýjahulu, vindhraða, loftþrýsting og vindátt. Hvaða aðferð ætti veðurfræðingurinn að nota til að skrá þessi gögn fyrir fljótlega tilvísun?\nSvarmöguleikar:\na. skriflega lýsingu\nb. töflu\nc. stöðvarlíkan\nd. veðurkort",
+  "text": "Í kringum hvaða ár hófst verslun á Arngerðareyri?\nSvarmöguleikar:\na. 1895\nb. 1884\nc. 1870\nd. 1902",
   "label": "b"
-}
-```
+}```
 ```json
 {
-  "text": "Hvaða breytingar urðu þegar reikistjörnurnar hitnnuðu á meðan þær mynduðust?\nSvarmöguleikar:\na. Massi þeirra jókst.\nb. Þær töpuðu meirihluta geislavirkra samsæta sinna.\nc. Uppbygging þeirra aðgreindist í mismunandi lög.\nd. Þær byrjuðu að snúast í kringum sólina.",
+  "text": "Hvenær var ákveðið að uppstigningardagur skyldi vera kirkjudagur aldraðra á Íslandi?\nSvarmöguleikar:\na. Árið 1975\nb. Árið 1985\nc. Árið 1982\nd. Árið 1990",
   "label": "c"
-}
-```
+}```
 When evaluating generative models, we use the following setup (see the
 [methodology](/methodology) for more information on how these are used):
@@ -555,40 +571,38 @@ When evaluating generative models, we use the following setup (see the
 You can evaluate this dataset directly as follows:
 ```bash
-$ euroeval --model <model-id> --dataset arc-is
+$ euroeval --model <model-id> --dataset icelandic-knowledge
 ```
-### Unofficial: MMLU-is
+### Unofficial: ARC-is
-This dataset is a machine translated version of the English [MMLU
-dataset](https://openreview.net/forum?id=d7KBjmI3GmQ) and features questions within 57
-different topics, such as elementary mathematics, US history and law. The dataset was
-translated using [Miðeind](https://mideind.is/english.html)'s Greynir translation model.
+This dataset is a machine translated version of the English [ARC
+dataset](https://doi.org/10.48550/arXiv.1803.05457) and features US grade-school science
+questions. The dataset was translated by Miðeind using the Claude 3.5 Sonnet model.
-The original full dataset consists of 269 / 1,410 / 13,200 samples for training,
-validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
-validation and testing, respectively (so 3,328 samples used in total). These splits are
-new and there can thus be some overlap between the original validation and test sets and
-our validation and test sets.
+The original full dataset consists of 1,110 / 297 / 1,170 samples for training,
+validation and testing, respectively. We use a 1,024 / 256 / 1,024 split for training,
+validation and testing, respectively (so 2,304 samples used in total). All new splits
+are subsets of the original splits.
 Here are a few examples from the training split:
 ```json
 {
-  "text": "Af hverju er öruggara að horfa á tunglið en að horfa á sólina?\nSvarmöguleikar:\na. Tunglið er minna bjart.\nb. Tunglið er nær jörðinni.\nc. Tunglið skín aðallega á nóttunni.\nd. Tunglið er aðeins fullt einu sinni í mánuði.",
+  "text": "Líkamar manna hafa flókna uppbyggingu sem styður vöxt og lífslíkur. Hver er grundvallaruppbygging líkamans sem stuðlar að vexti og lífslíkum?\nSvarmöguleikar:\na. fruma\nb. vefur\nc. líffæri\nd. líffærakerfi",
   "label": "a"
 }
 ```
 ```json
 {
-  "text": "Hvaða lög jarðar eru aðallega gerð úr föstu efni?\nSvarmöguleikar:\na. innri kjarni og ytri kjarni\nb. skorpu og innri kjarni\nc. skorpu og möttli\nd. möttli og ytri kjarni",
+  "text": "Veðurfræðingur skráir gögn fyrir borg á ákveðnum degi. Gögnin innihalda hitastig, skýjahulu, vindhraða, loftþrýsting og vindátt. Hvaða aðferð ætti veðurfræðingurinn að nota til að skrá þessi gögn fyrir fljótlega tilvísun?\nSvarmöguleikar:\na. skriflega lýsingu\nb. töflu\nc. stöðvarlíkan\nd. veðurkort",
   "label": "b"
 }
 ```
 ```json
 {
-  "text": "Bekkur er að rannsaka þéttleika bergsýna. Hvaða vísindalegan búnað þurfa þau til að ákvarða þéttleika bergsýnanna?\nSvarmöguleikar:\na. smásjá og vog\nb. bikar og mæliglös\nc. mæliglös og vog\nd. smásjá og mæliglös",
+  "text": "Hvaða breytingar urðu þegar reikistjörnurnar hitnnuðu á meðan þær mynduðust?\nSvarmöguleikar:\na. Massi þeirra jókst.\nb. Þær töpuðu meirihluta geislavirkra samsæta sinna.\nc. Uppbygging þeirra aðgreindist í mismunandi lög.\nd. Þær byrjuðu að snúast í kringum sólina.",
   "label": "c"
 }
 ```
@@ -626,46 +640,41 @@ When evaluating generative models, we use the following setup (see the
 You can evaluate this dataset directly as follows:
 ```bash
-$ euroeval --model <model-id> --dataset mmlu-is
+$ euroeval --model <model-id> --dataset arc-is
 ```
-### Unofficial: IcelandicKnowledge
-This dataset is based on the IcelandicQA dataset, which was published [here](https://huggingface.co/,datasets/mideind/icelandic_qa_euroeval), but is here phrased as a knowledge dataset. The candidate answers has been generated by GPT-4o, using the following prompt for each `row` in the original dataset:
-```python
-messages = [
-    {
-        "role": "user",
-        "content": f"For the question: {row.question} where the correct answer is: {row.answer}, please provide 3 plausible alternatives in Icelandic.",
-    }
-]
+### Unofficial: MMLU-is
-completion = client.beta.chat.completions.parse(
-    model="gpt-4o", messages=messages, response_format=CandidateAnswers
-)
-```
-where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
+This dataset is a machine translated version of the English [MMLU
+dataset](https://openreview.net/forum?id=d7KBjmI3GmQ) and features questions within 57
+different topics, such as elementary mathematics, US history and law. The dataset was
+translated using [Miðeind](https://mideind.is/english.html)'s Greynir translation model.
-The original dataset has 2,000 samples, but only 1,997 unique questions, and the total length of this dataset is therefore 1,997. The split is given by 845 / 128 / 1024 for train, val, and test, respectively.
+The original full dataset consists of 269 / 1,410 / 13,200 samples for training,
+validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
+validation and testing, respectively (so 3,328 samples used in total). These splits are
+new and there can thus be some overlap between the original validation and test sets and
+our validation and test sets.
 Here are a few examples from the training split:
 ```json
 {
-    "text": "Hvaða gamla verðeining var jafngildi einnar kýr að verðmæti?\nSvarmöguleikar:\na. Sauðfé\nb. Kúgildi\nc. Mjólkurtollur\nd. Hrossgildi",
-    "label": "b"
+  "text": "Af hverju er öruggara að horfa á tunglið en að horfa á sólina?\nSvarmöguleikar:\na. Tunglið er minna bjart.\nb. Tunglið er nær jörðinni.\nc. Tunglið skín aðallega á nóttunni.\nd. Tunglið er aðeins fullt einu sinni í mánuði.",
+  "label": "a"
 }
 ```
 ```json
 {
-    "text": "Hvenær komu Íslendingar fyrst til Gimli í Manitoba?\nSvarmöguleikar:\na. 15. september 1875\nb. 25. október 1874\nc. 10. október 1876\nd. 21. október 1875",
-    "label": "d"
+  "text": "Hvaða lög jarðar eru aðallega gerð úr föstu efni?\nSvarmöguleikar:\na. innri kjarni og ytri kjarni\nb. skorpu og innri kjarni\nc. skorpu og möttli\nd. möttli og ytri kjarni",
+  "label": "b"
 }
 ```
 ```json
 {
-    "text": "Hvaða ár var byggingin sem gaf Barónsstíg í Reykjavík nafn reist?\nSvarmöguleikar:\na. 1901\nb. 1897\nc. 1899\nd. 1898",
-    "label": "c"
+  "text": "Bekkur er að rannsaka þéttleika bergsýna. Hvaða vísindalegan búnað þurfa þau til að ákvarða þéttleika bergsýnanna?\nSvarmöguleikar:\na. smásjá og vog\nb. bikar og mæliglös\nc. mæliglös og vog\nd. smásjá og mæliglös",
+  "label": "c"
 }
 ```
@@ -702,9 +711,10 @@ When evaluating generative models, we use the following setup (see the
 You can evaluate this dataset directly as follows:
 ```bash
-$ euroeval --model <model-id> --dataset icelandic-knowledge
+$ euroeval --model <model-id> --dataset mmlu-is
 ```
 ## Common-sense Reasoning
 ### Winogrande-is

EuroEval 15.2.0__tar.gz → 15.3.1__tar.gz

Potentially problematic release.

EuroEval 15.2.0tar.gz → 15.3.1tar.gz