EuroEval 15.2.0__tar.gz → 15.3.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of EuroEval might be problematic. Click here for more details.
- {euroeval-15.2.0 → euroeval-15.3.1}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +1 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +1 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/.pre-commit-config.yaml +1 -1
- {euroeval-15.2.0 → euroeval-15.3.1}/CHANGELOG.md +56 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/PKG-INFO +4 -1
- {euroeval-15.2.0 → euroeval-15.3.1}/README.md +2 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/README.md +1 -1
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/danish.md +11 -10
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/icelandic.md +65 -55
- euroeval-15.3.1/docs/datasets/italian.md +577 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/norwegian.md +205 -6
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/danish.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/dutch.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/english.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/faroese.md +1 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/french.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/german.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/icelandic.md +2 -2
- euroeval-15.3.1/docs/leaderboards/Monolingual/italian.md +15 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/norwegian.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Monolingual/swedish.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Multilingual/european.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Multilingual/germanic.md +2 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/Multilingual/mainland-scandinavian.md +2 -2
- euroeval-15.3.1/docs/leaderboards/Multilingual/romance.md +15 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/makefile +24 -10
- {euroeval-15.2.0 → euroeval-15.3.1}/pyproject.toml +4 -1
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmark_modules/fresh.py +3 -1
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmark_modules/vllm.py +6 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmarker.py +10 -12
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/data_loading.py +9 -3
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/dataset_configs.py +242 -6
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/task_utils/question_answering.py +10 -7
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/task_utils/sequence_classification.py +11 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/task_utils/text_to_text.py +10 -1
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/task_utils/token_classification.py +9 -3
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/utils.py +2 -2
- euroeval-15.3.1/src/scripts/create_danish_citizen_tests.py +136 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_hellaswag.py +1 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_icelandic_knowledge.py +39 -9
- euroeval-15.3.1/src/scripts/create_ilpost_sum.py +83 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_mmlu.py +1 -0
- euroeval-15.3.1/src/scripts/create_multinerd-it.py +114 -0
- euroeval-15.3.1/src/scripts/create_no_cola.py +138 -0
- euroeval-15.3.1/src/scripts/create_nor_common_sense_qa.py +141 -0
- euroeval-15.3.1/src/scripts/create_nrk_quiz_qa.py +153 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_scala.py +2 -0
- euroeval-15.3.1/src/scripts/create_sentipolc16.py +76 -0
- euroeval-15.3.1/src/scripts/create_squad_it.py +107 -0
- euroeval-15.3.1/src/scripts/create_wikineural-it.py +109 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/fix_dot_env_file.py +4 -2
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/load_ud_pos.py +18 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/versioning.py +1 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmarker.py +4 -3
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_finetuning.py +0 -1
- {euroeval-15.2.0 → euroeval-15.3.1}/uv.lock +478 -431
- euroeval-15.2.0/gfx/euroeval-no-bg.png +0 -0
- euroeval-15.2.0/gfx/euroeval-orig.png +0 -0
- euroeval-15.2.0/src/scripts/create_danish_citizen_tests.py +0 -95
- {euroeval-15.2.0 → euroeval-15.3.1}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/.github/workflows/ci.yaml +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/.gitignore +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/CITATION.cff +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/CODE_OF_CONDUCT.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/CONTRIBUTING.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/Dockerfile.cuda +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/LICENSE +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/CNAME +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/README.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/dutch.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/english.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/faroese.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/french.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/german.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/datasets/swedish.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/extras/radial_plotter.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/faq.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/gfx/favicon.png +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/leaderboards/README.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/methodology.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/python-package.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/README.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/common-sense-reasoning.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/knowledge.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/linguistic-acceptability.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/named-entity-recognition.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/reading-comprehension.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/sentiment-classification.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/speed.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/docs/tasks/summarization.md +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/gfx/euroeval.png +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/gfx/euroeval.xcf +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/gfx/scandeval.png +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/mkdocs.yaml +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/__init__.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmark_config_factory.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmark_modules/__init__.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmark_modules/base.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmark_modules/hf.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/benchmark_modules/litellm.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/callbacks.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/cli.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/constants.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/data_models.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/enums.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/exceptions.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/finetuning.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/generation.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/human_evaluation.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/languages.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/model_cache.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/model_config.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/model_loading.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/scores.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/speed_benchmark.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/task_utils/__init__.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/task_utils/multiple_choice_classification.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/tasks.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/euroeval/types.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/constants.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_allocine.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_angry_tweets.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_arc.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_arc_is.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_belebele.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_cnn_dailymail.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_conll_en.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_conll_nl.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_dane.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_dansk.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_danske_talemaader.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_danske_talemaader_old.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_dbrd.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_dutch_cola.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_dutch_social.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_eltec.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_fone.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_foqa.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_fosent.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_fquad.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_germanquad.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_germeval.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_ice_linguistic.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_icelandic_error_corpus.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_icelandic_qa.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_icesum.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_jentoft.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_mim_gold_ner.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_mlsum.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_no_sammendrag.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_nordjylland_news.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_norec.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_norglm_multiqa.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_norglm_multisum.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_norne.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_norquad.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_nqii.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_orange_sum.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_personal_sum.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_rrn.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_sb10k.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_scandiqa.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_schibsted.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_squad.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_squad_nl.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_squad_nl_old.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_sst5.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_suc3.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_swedn.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_swerec.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_wiki_lingua_nl.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_wikiann_fo.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/src/scripts/create_winogrande_is.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/__init__.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/conftest.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmark_config_factory.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmark_modules/__init__.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmark_modules/test_base.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmark_modules/test_fresh.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmark_modules/test_hf.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmark_modules/test_litellm.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_benchmark_modules/test_vllm.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_callbacks.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_cli.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_constants.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_data_loading.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_data_models.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_dataset_configs.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_enums.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_exceptions.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_generation.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_human_evaluation.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_languages.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_model_cache.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_model_config.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_model_loading.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_scores.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_speed_benchmark.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_task_utils/__init__.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_task_utils/test_question_answering.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_task_utils/test_sequence_classification.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_task_utils/test_text_to_text.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_task_utils/test_token_classification.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_tasks.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_types.py +0 -0
- {euroeval-15.2.0 → euroeval-15.3.1}/tests/test_utils.py +0 -0
|
@@ -10,6 +10,62 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
10
10
|
|
|
11
11
|
|
|
12
12
|
|
|
13
|
+
## [v15.3.1] - 2025-03-13
|
|
14
|
+
### Fixed
|
|
15
|
+
- Now handles`ConnectionError`s when loading datasets, rather than aborting evaluations.
|
|
16
|
+
|
|
17
|
+
|
|
18
|
+
## [v15.3.0] - 2025-03-12
|
|
19
|
+
### Added
|
|
20
|
+
- Added support for evaluating Italian 🇮🇹! This includes the reading comprehension
|
|
21
|
+
dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization
|
|
22
|
+
dataset [IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment
|
|
23
|
+
classification
|
|
24
|
+
[Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual),
|
|
25
|
+
the common-sense reasoning dataset
|
|
26
|
+
[HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic acceptability
|
|
27
|
+
dataset ScaLA with the [Italian Universal Dependencies
|
|
28
|
+
treebank](https://github.com/UniversalDependencies/UD_Italian-ISDT), the knowledge
|
|
29
|
+
dataset [MMLU-it](https://hf.co/datasets/alexandrainst/m_mmlu), and the named entity
|
|
30
|
+
recognition dataset [MultiNERD
|
|
31
|
+
IT](https://hf.co/datasets/Babelscape/multinerd) (and unofficially
|
|
32
|
+
[WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
|
|
33
|
+
- Added the new Norwegian knowledge dataset NRK-Quiz-QA, consisting of quizzes on the
|
|
34
|
+
Norwegian language and culture, in both Bokmål and Nynorsk. The dataset has been split
|
|
35
|
+
into 635 / 256 / 2,048 samples for train, val, and test, respectively. This replaces
|
|
36
|
+
the old MMLU-no as the official Norwegian knowledge dataset.
|
|
37
|
+
- Added the new Norwegian common-sense reasoning dataset NorCommonSenseQA, which is a
|
|
38
|
+
manually translated and localised version of the English CommonsenseQA dataset, in
|
|
39
|
+
both Bokmål and Nynorsk. The dataset has been split into 128 / 128 / 787 samples for
|
|
40
|
+
train, val, and test, respectively. This replaces the old HellaSwag-no as the official
|
|
41
|
+
Norwegian common-sense reasoning dataset.
|
|
42
|
+
- Added the Norwegian linguistic acceptability dataset NoCoLA, which is based on the
|
|
43
|
+
annotated language learner corpus ASK. The dataset has been split into 1,024 / 256 /
|
|
44
|
+
2,048 samples and converted into a binary correct/incorrect dataset, but
|
|
45
|
+
stratified across the error categories.
|
|
46
|
+
|
|
47
|
+
### Changed
|
|
48
|
+
- Updated the Danish Citizen Tests dataset to include the newer 2024 tests, Further,
|
|
49
|
+
rather than splitting the dataset randomly, we include all the citizenship tests in
|
|
50
|
+
the test split, and prioritise the newer permanent residence tests in the test and
|
|
51
|
+
validation splits.
|
|
52
|
+
- Changed the IcelandicKnowledge dataset to be the new official Icelandic knowledge
|
|
53
|
+
dataset, as it is more specific to Icelandic culture and history than the previous
|
|
54
|
+
machine translated ARC-is dataset. It has also been improved, as some of the generated
|
|
55
|
+
alternative answers were formatted incorrectly.
|
|
56
|
+
|
|
57
|
+
### Fixed
|
|
58
|
+
- A bug caused fresh encoder models to not be benchmarkable on the speed benchmark -
|
|
59
|
+
this has been fixed now.
|
|
60
|
+
- Some encoder models were not able to be evaluated on reading comprehensions, if their
|
|
61
|
+
tokenizers were not subclassing `PreTrainedTokenizer`. This has been relaxed to
|
|
62
|
+
`PreTrainedTokenizerBase` instead.
|
|
63
|
+
- Newer versions of the `transformers` package changed the model output format, causing
|
|
64
|
+
errors when evaluating encoder models on some tasks. This has been fixed now.
|
|
65
|
+
- Added `setuptools` to the dependencies, as it is required for the package to be
|
|
66
|
+
installed correctly.
|
|
67
|
+
|
|
68
|
+
|
|
13
69
|
## [v15.2.0] - 2025-02-28
|
|
14
70
|
### Changed
|
|
15
71
|
- Changed the name of the benchmark to `EuroEval`, to reflect the fact that the
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: EuroEval
|
|
3
|
-
Version: 15.
|
|
3
|
+
Version: 15.3.1
|
|
4
4
|
Summary: The robust European language model benchmark.
|
|
5
5
|
Project-URL: Repository, https://github.com/EuroEval/EuroEval
|
|
6
6
|
Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
|
|
@@ -49,6 +49,7 @@ Requires-Dist: sacremoses>=0.1.1
|
|
|
49
49
|
Requires-Dist: scikit-learn<1.6.0
|
|
50
50
|
Requires-Dist: sentencepiece>=0.1.96
|
|
51
51
|
Requires-Dist: seqeval>=1.2.2
|
|
52
|
+
Requires-Dist: setuptools>=75.8.2
|
|
52
53
|
Requires-Dist: tenacity>=9.0.0
|
|
53
54
|
Requires-Dist: termcolor>=2.0.0
|
|
54
55
|
Requires-Dist: torch>=2.3.0
|
|
@@ -76,6 +77,8 @@ Description-Content-Type: text/markdown
|
|
|
76
77
|
|
|
77
78
|
### The robust European language model benchmark.
|
|
78
79
|
|
|
80
|
+
_(formerly known as ScandEval)_
|
|
81
|
+
|
|
79
82
|
______________________________________________________________________
|
|
80
83
|
[](https://euroeval.com)
|
|
81
84
|
[](https://pypi.org/project/euroeval/)
|
|
@@ -4,6 +4,8 @@
|
|
|
4
4
|
|
|
5
5
|
### The robust European language model benchmark.
|
|
6
6
|
|
|
7
|
+
_(formerly known as ScandEval)_
|
|
8
|
+
|
|
7
9
|
______________________________________________________________________
|
|
8
10
|
[](https://euroeval.com)
|
|
9
11
|
[](https://pypi.org/project/euroeval/)
|
|
@@ -6,7 +6,7 @@ hide:
|
|
|
6
6
|
#
|
|
7
7
|
<div align='center'>
|
|
8
8
|
<img src="https://raw.githubusercontent.com/EuroEval/EuroEval/main/gfx/euroeval.png" height="500" width="372">
|
|
9
|
-
<h3>
|
|
9
|
+
<h3>The robust European language model benchmark.</h3>
|
|
10
10
|
</div>
|
|
11
11
|
|
|
12
12
|
--------------------------
|
|
@@ -429,33 +429,34 @@ $ euroeval --model <model-id> --dataset danske-talemaader
|
|
|
429
429
|
### Danish Citizen Tests
|
|
430
430
|
|
|
431
431
|
This dataset was created by scraping the Danish citizenship tests (indfødsretsprøven)
|
|
432
|
-
and permanent residency tests (medborgerskabsprøven) from 2016 to
|
|
432
|
+
and permanent residency tests (medborgerskabsprøven) from 2016 to 2024. These are
|
|
433
433
|
available on the [official website of the Danish Ministry of International Recruitment
|
|
434
434
|
and Integration](https://danskogproever.dk/).
|
|
435
435
|
|
|
436
|
-
The original full dataset consists of
|
|
437
|
-
training, validation and testing, respectively
|
|
436
|
+
The original full dataset consists of 870 samples. We use an 345 / 90 / 525 split for
|
|
437
|
+
training, validation and testing, respectively. Here all the citizenship tests belong to
|
|
438
|
+
the test split, as well as the newest permanent residency tests. The validation split
|
|
439
|
+
contains the newer permanent residency tests after the ones in the test split, and the
|
|
440
|
+
training split contains the oldest permanent residency tests.
|
|
438
441
|
|
|
439
442
|
Here are a few examples from the training split:
|
|
440
443
|
|
|
441
444
|
```json
|
|
442
445
|
{
|
|
443
|
-
"text": "
|
|
446
|
+
"text": "Hvilket parti tilhørte Lars Løkke Rasmussen, da han var statsminister i perioderne 2009-11 og 2015-19?\nSvarmuligheder:\na. Venstre\nb. Socialdemokratiet\nc. Det Konservative Folkeparti",
|
|
444
447
|
"label": "a"
|
|
445
448
|
}
|
|
446
449
|
```
|
|
447
450
|
```json
|
|
448
451
|
{
|
|
449
|
-
"text": "
|
|
452
|
+
"text": "Hvilket af følgende områder har kommunerne ansvaret for driften af?\nSvarmuligheder:\na. Domstole\nb. Vuggestuer\nc. Sygehuse",
|
|
450
453
|
"label": "b"
|
|
451
|
-
}
|
|
452
|
-
```
|
|
454
|
+
}```
|
|
453
455
|
```json
|
|
454
456
|
{
|
|
455
|
-
"text": "
|
|
457
|
+
"text": "Hvilken organisation blev Danmark medlem af i 1945?\nSvarmuligheder:\na. Verdenshandelsorganisationen (WTO)\nb. Den Europæiske Union (EU)\nc. De Forenede Nationer (FN)",
|
|
456
458
|
"label": "c"
|
|
457
|
-
}
|
|
458
|
-
```
|
|
459
|
+
}```
|
|
459
460
|
|
|
460
461
|
When evaluating generative models, we use the following setup (see the
|
|
461
462
|
[methodology](/methodology) for more information on how these are used):
|
|
@@ -413,7 +413,7 @@ $ euroeval --model <model-id> --dataset nqii
|
|
|
413
413
|
### Unofficial: IcelandicQA
|
|
414
414
|
|
|
415
415
|
This dataset was published
|
|
416
|
-
[here](https://huggingface.co/datasets/mideind/
|
|
416
|
+
[here](https://huggingface.co/datasets/mideind/icelandic_qa_scandeval) and consists of
|
|
417
417
|
an automatically created Icelandic question-answering dataset based on the Icelandic
|
|
418
418
|
Wikipedia as well as Icelandic news articles from the RÚV corpus.
|
|
419
419
|
|
|
@@ -490,37 +490,53 @@ $ euroeval --model <model-id> --dataset icelandic-qa
|
|
|
490
490
|
|
|
491
491
|
## Knowledge
|
|
492
492
|
|
|
493
|
-
###
|
|
493
|
+
### IcelandicKnowledge
|
|
494
494
|
|
|
495
|
-
This dataset
|
|
496
|
-
|
|
497
|
-
|
|
495
|
+
This dataset was published
|
|
496
|
+
[here](https://huggingface.co/datasets/mideind/icelandic_qa_scandeval) and consists of
|
|
497
|
+
an automatically created Icelandic question-answering dataset based on the Icelandic
|
|
498
|
+
Wikipedia as well as Icelandic news articles from the RÚV corpus.
|
|
498
499
|
|
|
499
|
-
The
|
|
500
|
-
|
|
501
|
-
|
|
502
|
-
|
|
500
|
+
The dataset was converted into a multiple-choice knowledge dataset by removing the
|
|
501
|
+
contexts and using GPT-4o to generate 3 plausible wrong answers for each correct answer,
|
|
502
|
+
using the following prompt for each `row` in the original dataset:
|
|
503
|
+
|
|
504
|
+
```python
|
|
505
|
+
messages = [
|
|
506
|
+
{
|
|
507
|
+
"role": "user",
|
|
508
|
+
"content": f"For the question: {row.question} where the correct answer is: {row.answer}, please provide 3 plausible alternatives in Icelandic. You should return the alternatives in a JSON dictionary, with keys 'first', 'second', and 'third'. The values should be the alternatives only, without any numbering or formatting. The alternatives should be unique and not contain the correct answer."
|
|
509
|
+
}
|
|
510
|
+
]
|
|
511
|
+
|
|
512
|
+
completion = client.beta.chat.completions.parse(
|
|
513
|
+
model="gpt-4o", messages=messages, response_format=CandidateAnswers
|
|
514
|
+
)
|
|
515
|
+
```
|
|
516
|
+
|
|
517
|
+
where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
|
|
518
|
+
|
|
519
|
+
The original dataset has 2,000 samples, but only 1,994 unique questions, and the total
|
|
520
|
+
length of this dataset is therefore 1,994. The split is given by 842 / 128 / 1024 for
|
|
521
|
+
train, val, and test, respectively.
|
|
503
522
|
|
|
504
523
|
Here are a few examples from the training split:
|
|
505
524
|
|
|
506
525
|
```json
|
|
507
526
|
{
|
|
508
|
-
"text": "
|
|
527
|
+
"text": "Hver var talinn heilagur maður eftir dauða sinn, er tákngervingur alþýðuhreyfingar vestanlands og talinn góður til áheita?\nSvarmöguleikar:\na. Þórður Jónsson helgi\nb. Guðmundur Arason\nc. Snorri Þorgrímsson\nd. Jón Hreggviðsson",
|
|
509
528
|
"label": "a"
|
|
510
|
-
}
|
|
511
|
-
```
|
|
529
|
+
}```
|
|
512
530
|
```json
|
|
513
531
|
{
|
|
514
|
-
"text": "
|
|
532
|
+
"text": "Í kringum hvaða ár hófst verslun á Arngerðareyri?\nSvarmöguleikar:\na. 1895\nb. 1884\nc. 1870\nd. 1902",
|
|
515
533
|
"label": "b"
|
|
516
|
-
}
|
|
517
|
-
```
|
|
534
|
+
}```
|
|
518
535
|
```json
|
|
519
536
|
{
|
|
520
|
-
"text": "
|
|
537
|
+
"text": "Hvenær var ákveðið að uppstigningardagur skyldi vera kirkjudagur aldraðra á Íslandi?\nSvarmöguleikar:\na. Árið 1975\nb. Árið 1985\nc. Árið 1982\nd. Árið 1990",
|
|
521
538
|
"label": "c"
|
|
522
|
-
}
|
|
523
|
-
```
|
|
539
|
+
}```
|
|
524
540
|
|
|
525
541
|
When evaluating generative models, we use the following setup (see the
|
|
526
542
|
[methodology](/methodology) for more information on how these are used):
|
|
@@ -555,40 +571,38 @@ When evaluating generative models, we use the following setup (see the
|
|
|
555
571
|
You can evaluate this dataset directly as follows:
|
|
556
572
|
|
|
557
573
|
```bash
|
|
558
|
-
$ euroeval --model <model-id> --dataset
|
|
574
|
+
$ euroeval --model <model-id> --dataset icelandic-knowledge
|
|
559
575
|
```
|
|
560
576
|
|
|
561
577
|
|
|
562
|
-
### Unofficial:
|
|
578
|
+
### Unofficial: ARC-is
|
|
563
579
|
|
|
564
|
-
This dataset is a machine translated version of the English [
|
|
565
|
-
dataset](https://
|
|
566
|
-
|
|
567
|
-
translated using [Miðeind](https://mideind.is/english.html)'s Greynir translation model.
|
|
580
|
+
This dataset is a machine translated version of the English [ARC
|
|
581
|
+
dataset](https://doi.org/10.48550/arXiv.1803.05457) and features US grade-school science
|
|
582
|
+
questions. The dataset was translated by Miðeind using the Claude 3.5 Sonnet model.
|
|
568
583
|
|
|
569
|
-
The original full dataset consists of
|
|
570
|
-
validation and testing, respectively. We use a 1,024 / 256 /
|
|
571
|
-
validation and testing, respectively (so
|
|
572
|
-
|
|
573
|
-
our validation and test sets.
|
|
584
|
+
The original full dataset consists of 1,110 / 297 / 1,170 samples for training,
|
|
585
|
+
validation and testing, respectively. We use a 1,024 / 256 / 1,024 split for training,
|
|
586
|
+
validation and testing, respectively (so 2,304 samples used in total). All new splits
|
|
587
|
+
are subsets of the original splits.
|
|
574
588
|
|
|
575
589
|
Here are a few examples from the training split:
|
|
576
590
|
|
|
577
591
|
```json
|
|
578
592
|
{
|
|
579
|
-
"text": "
|
|
593
|
+
"text": "Líkamar manna hafa flókna uppbyggingu sem styður vöxt og lífslíkur. Hver er grundvallaruppbygging líkamans sem stuðlar að vexti og lífslíkum?\nSvarmöguleikar:\na. fruma\nb. vefur\nc. líffæri\nd. líffærakerfi",
|
|
580
594
|
"label": "a"
|
|
581
595
|
}
|
|
582
596
|
```
|
|
583
597
|
```json
|
|
584
598
|
{
|
|
585
|
-
"text": "Hvaða
|
|
599
|
+
"text": "Veðurfræðingur skráir gögn fyrir borg á ákveðnum degi. Gögnin innihalda hitastig, skýjahulu, vindhraða, loftþrýsting og vindátt. Hvaða aðferð ætti veðurfræðingurinn að nota til að skrá þessi gögn fyrir fljótlega tilvísun?\nSvarmöguleikar:\na. skriflega lýsingu\nb. töflu\nc. stöðvarlíkan\nd. veðurkort",
|
|
586
600
|
"label": "b"
|
|
587
601
|
}
|
|
588
602
|
```
|
|
589
603
|
```json
|
|
590
604
|
{
|
|
591
|
-
"text": "
|
|
605
|
+
"text": "Hvaða breytingar urðu þegar reikistjörnurnar hitnnuðu á meðan þær mynduðust?\nSvarmöguleikar:\na. Massi þeirra jókst.\nb. Þær töpuðu meirihluta geislavirkra samsæta sinna.\nc. Uppbygging þeirra aðgreindist í mismunandi lög.\nd. Þær byrjuðu að snúast í kringum sólina.",
|
|
592
606
|
"label": "c"
|
|
593
607
|
}
|
|
594
608
|
```
|
|
@@ -626,46 +640,41 @@ When evaluating generative models, we use the following setup (see the
|
|
|
626
640
|
You can evaluate this dataset directly as follows:
|
|
627
641
|
|
|
628
642
|
```bash
|
|
629
|
-
$ euroeval --model <model-id> --dataset
|
|
643
|
+
$ euroeval --model <model-id> --dataset arc-is
|
|
630
644
|
```
|
|
631
645
|
|
|
632
|
-
### Unofficial: IcelandicKnowledge
|
|
633
|
-
This dataset is based on the IcelandicQA dataset, which was published [here](https://huggingface.co/,datasets/mideind/icelandic_qa_euroeval), but is here phrased as a knowledge dataset. The candidate answers has been generated by GPT-4o, using the following prompt for each `row` in the original dataset:
|
|
634
646
|
|
|
635
|
-
|
|
636
|
-
messages = [
|
|
637
|
-
{
|
|
638
|
-
"role": "user",
|
|
639
|
-
"content": f"For the question: {row.question} where the correct answer is: {row.answer}, please provide 3 plausible alternatives in Icelandic.",
|
|
640
|
-
}
|
|
641
|
-
]
|
|
647
|
+
### Unofficial: MMLU-is
|
|
642
648
|
|
|
643
|
-
|
|
644
|
-
|
|
645
|
-
|
|
646
|
-
|
|
647
|
-
where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
|
|
649
|
+
This dataset is a machine translated version of the English [MMLU
|
|
650
|
+
dataset](https://openreview.net/forum?id=d7KBjmI3GmQ) and features questions within 57
|
|
651
|
+
different topics, such as elementary mathematics, US history and law. The dataset was
|
|
652
|
+
translated using [Miðeind](https://mideind.is/english.html)'s Greynir translation model.
|
|
648
653
|
|
|
649
|
-
The original dataset
|
|
654
|
+
The original full dataset consists of 269 / 1,410 / 13,200 samples for training,
|
|
655
|
+
validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
|
|
656
|
+
validation and testing, respectively (so 3,328 samples used in total). These splits are
|
|
657
|
+
new and there can thus be some overlap between the original validation and test sets and
|
|
658
|
+
our validation and test sets.
|
|
650
659
|
|
|
651
660
|
Here are a few examples from the training split:
|
|
652
661
|
|
|
653
662
|
```json
|
|
654
663
|
{
|
|
655
|
-
|
|
656
|
-
|
|
664
|
+
"text": "Af hverju er öruggara að horfa á tunglið en að horfa á sólina?\nSvarmöguleikar:\na. Tunglið er minna bjart.\nb. Tunglið er nær jörðinni.\nc. Tunglið skín aðallega á nóttunni.\nd. Tunglið er aðeins fullt einu sinni í mánuði.",
|
|
665
|
+
"label": "a"
|
|
657
666
|
}
|
|
658
667
|
```
|
|
659
668
|
```json
|
|
660
669
|
{
|
|
661
|
-
|
|
662
|
-
|
|
670
|
+
"text": "Hvaða lög jarðar eru aðallega gerð úr föstu efni?\nSvarmöguleikar:\na. innri kjarni og ytri kjarni\nb. skorpu og innri kjarni\nc. skorpu og möttli\nd. möttli og ytri kjarni",
|
|
671
|
+
"label": "b"
|
|
663
672
|
}
|
|
664
673
|
```
|
|
665
674
|
```json
|
|
666
675
|
{
|
|
667
|
-
|
|
668
|
-
|
|
676
|
+
"text": "Bekkur er að rannsaka þéttleika bergsýna. Hvaða vísindalegan búnað þurfa þau til að ákvarða þéttleika bergsýnanna?\nSvarmöguleikar:\na. smásjá og vog\nb. bikar og mæliglös\nc. mæliglös og vog\nd. smásjá og mæliglös",
|
|
677
|
+
"label": "c"
|
|
669
678
|
}
|
|
670
679
|
```
|
|
671
680
|
|
|
@@ -702,9 +711,10 @@ When evaluating generative models, we use the following setup (see the
|
|
|
702
711
|
You can evaluate this dataset directly as follows:
|
|
703
712
|
|
|
704
713
|
```bash
|
|
705
|
-
$ euroeval --model <model-id> --dataset
|
|
714
|
+
$ euroeval --model <model-id> --dataset mmlu-is
|
|
706
715
|
```
|
|
707
716
|
|
|
717
|
+
|
|
708
718
|
## Common-sense Reasoning
|
|
709
719
|
|
|
710
720
|
### Winogrande-is
|