EuroEval 15.10.1__tar.gz → 15.11.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of EuroEval might be problematic. Click here for more details.
- {euroeval-15.10.1 → euroeval-15.11.0}/.pre-commit-config.yaml +1 -1
- {euroeval-15.10.1 → euroeval-15.11.0}/CHANGELOG.md +29 -1
- {euroeval-15.10.1 → euroeval-15.11.0}/CITATION.cff +3 -3
- {euroeval-15.10.1 → euroeval-15.11.0}/LICENSE +1 -1
- {euroeval-15.10.1 → euroeval-15.11.0}/PKG-INFO +10 -10
- {euroeval-15.10.1 → euroeval-15.11.0}/README.md +5 -6
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/README.md +1 -1
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/dutch.md +5 -2
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/english.md +79 -3
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/finnish.md +49 -16
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/french.md +13 -9
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/german.md +8 -5
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/icelandic.md +10 -6
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/italian.md +5 -2
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/norwegian.md +90 -11
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/spanish.md +42 -19
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/swedish.md +8 -5
- euroeval-15.11.0/docs/leaderboards/Monolingual/danish.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/dutch.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/english.md +23 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/leaderboards/Monolingual/faroese.md +4 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/finnish.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/french.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/german.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/icelandic.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/italian.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/norwegian.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/spanish.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Monolingual/swedish.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Multilingual/european.md +23 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/leaderboards/Multilingual/germanic.md +8 -0
- euroeval-15.11.0/docs/leaderboards/Multilingual/mainland-scandinavian.md +23 -0
- euroeval-15.11.0/docs/leaderboards/Multilingual/romance.md +23 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/leaderboards/README.md +8 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/pyproject.toml +4 -3
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/__init__.py +7 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmark_modules/base.py +29 -29
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmark_modules/fresh.py +31 -19
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmark_modules/hf.py +27 -23
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmark_modules/litellm.py +50 -30
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmark_modules/vllm.py +21 -25
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmarker.py +1 -1
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/callbacks.py +17 -13
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/data_loading.py +10 -5
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/data_models.py +2 -40
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/english.py +13 -4
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/norwegian.py +8 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/finetuning.py +9 -8
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/generation.py +5 -4
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/generation_utils.py +1 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/human_evaluation.py +13 -13
- euroeval-15.11.0/src/euroeval/metrics.py +452 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/scores.py +14 -19
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/speed_benchmark.py +6 -7
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +6 -4
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/task_group_utils/question_answering.py +5 -28
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/task_group_utils/sequence_classification.py +6 -30
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/task_group_utils/text_to_text.py +19 -34
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/task_group_utils/token_classification.py +18 -30
- euroeval-15.11.0/src/euroeval/tasks.py +131 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/types.py +6 -4
- euroeval-15.11.0/src/scripts/create_idioms_no.py +254 -0
- euroeval-15.11.0/src/scripts/create_life_in_the_uk.py +145 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/conftest.py +4 -3
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_data_models.py +17 -16
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_scores.py +15 -21
- {euroeval-15.10.1 → euroeval-15.11.0}/uv.lock +5 -1
- euroeval-15.10.1/docs/leaderboards/Monolingual/danish.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/dutch.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/english.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/finnish.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/french.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/german.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/icelandic.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/italian.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/norwegian.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/spanish.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Monolingual/swedish.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Multilingual/european.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -15
- euroeval-15.10.1/docs/leaderboards/Multilingual/romance.md +0 -15
- euroeval-15.10.1/src/euroeval/tasks.py +0 -256
- {euroeval-15.10.1 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/.github/workflows/ci.yaml +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/.gitignore +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/CODE_OF_CONDUCT.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/CONTRIBUTING.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/Dockerfile.cuda +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/NEW_DATASET_GUIDE.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/CNAME +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/README.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/danish.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/datasets/faroese.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/extras/radial_plotter.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/faq.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/gfx/favicon.png +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/methodology.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/python-package.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/README.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/common-sense-reasoning.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/knowledge.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/linguistic-acceptability.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/named-entity-recognition.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/reading-comprehension.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/sentiment-classification.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/speed.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/docs/tasks/summarization.md +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/gfx/euroeval.png +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/gfx/euroeval.xcf +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/gfx/scandeval.png +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/makefile +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/mkdocs.yaml +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmark_config_factory.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/cli.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/constants.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/__init__.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/danish.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/dutch.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/faroese.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/finnish.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/french.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/german.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/italian.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/spanish.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/dataset_configs/swedish.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/enums.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/exceptions.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/languages.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/model_cache.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/model_config.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/model_loading.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/prompt_templates/__init__.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/prompt_templates/multiple_choice.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/prompt_templates/named_entity_recognition.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/prompt_templates/reading_comprehension.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/prompt_templates/sentiment_classification.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/prompt_templates/summarization.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/task_group_utils/__init__.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/tokenization_utils.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/euroeval/utils.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/constants.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_allocine.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_angry_tweets.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_arc.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_arc_is.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_belebele.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_cnn_dailymail.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_conll_en.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_conll_es.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_conll_nl.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_dane.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_danish_citizen_tests.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_dansk.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_danske_talemaader.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_danske_talemaader_old.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_dbrd.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_dutch_cola.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_eltec.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_fone.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_foqa.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_fosent.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_fquad.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_germanquad.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_germeval.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_hellaswag.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_hellaswag_fi.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_ice_linguistic.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_icelandic_knowledge.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_icelandic_qa.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_icesum.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_ilpost_sum.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_jentoft.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_mim_gold_ner.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_mlqa_es.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_mlsum_de.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_mlsum_es.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_mmlu.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_multinerd-it.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_no_cola.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_no_sammendrag.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_nordjylland_news.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_norec.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_norglm_multiqa.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_norglm_multisum.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_norne.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_norquad.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_nqii.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_orange_sum.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_personal_sum.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_rrn.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_sb10k.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_scala.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_scandiqa.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_scandisent_fi.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_schibsted.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_sentipolc16.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_squad.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_squad_it.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_squad_nl.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_squad_nl_old.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_sst5.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_suc3.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_swedn.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_swerec.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_turku_ner_fi.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_tydiqa_fi.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_wikiann_fo.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_wikineural-it.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_winogrande_is.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_xlsum_fi.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/create_xquad_es.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/fix_dot_env_file.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/load_ud_pos.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/src/scripts/versioning.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/__init__.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmark_config_factory.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmark_modules/__init__.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmark_modules/test_base.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmark_modules/test_hf.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_benchmarker.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_callbacks.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_cli.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_constants.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_data_loading.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_dataset_configs.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_enums.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_exceptions.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_finetuning.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_generation.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_human_evaluation.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_languages.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_model_cache.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_model_config.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_model_loading.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_speed_benchmark.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_task_utils/__init__.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_task_utils/test_question_answering.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_task_utils/test_text_to_text.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_task_utils/test_token_classification.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_tasks.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_tokenization_utils.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_types.py +0 -0
- {euroeval-15.10.1 → euroeval-15.11.0}/tests/test_utils.py +0 -0
|
@@ -10,8 +10,36 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
10
10
|
|
|
11
11
|
|
|
12
12
|
|
|
13
|
+
## [v15.11.0] - 2025-07-15
|
|
14
|
+
### Added
|
|
15
|
+
- Added the English knowledge dataset Life in the UK, which has been added as an
|
|
16
|
+
official dataset, replacing the existing English knowledge dataset MMLU, which in turn
|
|
17
|
+
has been marked as unofficial now. This was contributed by
|
|
18
|
+
[@oliverkinch](https://github.com/oliverkinch) ✨
|
|
19
|
+
- Added the Norwegian knowledge dataset Idioms-no, which is a multiple-choice question
|
|
20
|
+
dataset where the alternative answers have been generated using GPT-4o. This has been
|
|
21
|
+
added as an official dataset, and was contributed by
|
|
22
|
+
[@oliverkinch](https://github.com/oliverkinch) ✨
|
|
23
|
+
- Added new `LLMAsAJudgeMetric`, which allows evaluating the performance of a model with
|
|
24
|
+
another judge model. This is useful for evaluating models in a reference-free manner,
|
|
25
|
+
or if the metric is sufficiently complex. It is currently not used in any task, but
|
|
26
|
+
the functionality is there for future use.
|
|
27
|
+
- Add `no-thinking` and `thinking` options for Gemini-2.5-flash and
|
|
28
|
+
Gemini-2.5-flash-lite, which allows disabling and enabling the reasoning mode for
|
|
29
|
+
these models, respectively. Note that the former model has reasoning enabled by
|
|
30
|
+
default and the latter has it disabled by default (see the defaults in the [Gemini-2.5
|
|
31
|
+
docs](https://ai.google.dev/gemini-api/docs/thinking#set-budget)).
|
|
32
|
+
|
|
33
|
+
### Fixed
|
|
34
|
+
- Evaluating freshly initialised encoder models on multiple-choice classification tasks
|
|
35
|
+
caused an error, as the id-to-label mapping was not set up correctly. This has been
|
|
36
|
+
fixed now.
|
|
37
|
+
- Now dynamically lowers the maximum amount of reasoning tokens for LiteLLM models if
|
|
38
|
+
they do not support the full 32,768 tokens.
|
|
39
|
+
|
|
40
|
+
|
|
13
41
|
## [v15.10.1] - 2025-06-20
|
|
14
|
-
###
|
|
42
|
+
### Fixed
|
|
15
43
|
- Fixed an issue when benchmarking encoder models on reading comprehension tasks, where
|
|
16
44
|
we sometimes would truncate the model outputs when they should not have been.
|
|
17
45
|
|
|
@@ -4,8 +4,8 @@ message: If you use this software, please cite it using the metadata from this f
|
|
|
4
4
|
type: software
|
|
5
5
|
authors:
|
|
6
6
|
- given-names: Dan Saattrup
|
|
7
|
-
family-names:
|
|
8
|
-
email: dan.
|
|
7
|
+
family-names: Smart
|
|
8
|
+
email: dan.smart@alexandra.dk
|
|
9
9
|
affiliation: Alexandra Institute
|
|
10
10
|
orcid: 'https://orcid.org/0000-0001-9227-1470'
|
|
11
11
|
identifiers:
|
|
@@ -22,7 +22,7 @@ license: MIT
|
|
|
22
22
|
preferred-citation:
|
|
23
23
|
type: conference-paper
|
|
24
24
|
authors:
|
|
25
|
-
- family-names: "
|
|
25
|
+
- family-names: "Smart"
|
|
26
26
|
given-names: "Dan Saattrup"
|
|
27
27
|
orcid: https://orcid.org/0000-0001-9227-1470
|
|
28
28
|
collection-title: "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)"
|
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: EuroEval
|
|
3
|
-
Version: 15.
|
|
3
|
+
Version: 15.11.0
|
|
4
4
|
Summary: The robust European language model benchmark.
|
|
5
5
|
Project-URL: Repository, https://github.com/EuroEval/EuroEval
|
|
6
6
|
Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
|
|
7
|
-
Author-email: Dan Saattrup
|
|
8
|
-
Maintainer-email: Dan Saattrup
|
|
7
|
+
Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
8
|
+
Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
9
9
|
License: MIT License
|
|
10
10
|
|
|
11
|
-
Copyright (c) 2022-
|
|
11
|
+
Copyright (c) 2022-2025 Dan Saattrup Smart
|
|
12
12
|
|
|
13
13
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
14
14
|
of this software and associated documentation files (the "Software"), to deal
|
|
@@ -43,6 +43,7 @@ Requires-Dist: numpy<2.0.0,>=1.23.0
|
|
|
43
43
|
Requires-Dist: ollama>=0.5.1
|
|
44
44
|
Requires-Dist: pandas>=2.2.0
|
|
45
45
|
Requires-Dist: peft>=0.15.0
|
|
46
|
+
Requires-Dist: protobuf>=2.0.0
|
|
46
47
|
Requires-Dist: pydantic>=2.6.0
|
|
47
48
|
Requires-Dist: pyinfer>=0.0.3
|
|
48
49
|
Requires-Dist: python-dotenv>=1.0.1
|
|
@@ -94,8 +95,7 @@ ______________________________________________________________________
|
|
|
94
95
|
|
|
95
96
|
## Maintainer
|
|
96
97
|
|
|
97
|
-
- Dan Saattrup
|
|
98
|
-
dan.nielsen@alexandra.dk)
|
|
98
|
+
- Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), dan.smart@alexandra.dk)
|
|
99
99
|
|
|
100
100
|
|
|
101
101
|
## Installation
|
|
@@ -268,14 +268,14 @@ contributing new datasets, your help makes this project better for everyone.
|
|
|
268
268
|
If you want to cite the framework then feel free to use this:
|
|
269
269
|
|
|
270
270
|
```
|
|
271
|
-
@article{
|
|
271
|
+
@article{smart2024encoder,
|
|
272
272
|
title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
|
|
273
|
-
author={
|
|
273
|
+
author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
|
|
274
274
|
journal={arXiv preprint arXiv:2406.13469},
|
|
275
275
|
year={2024}
|
|
276
276
|
}
|
|
277
|
-
@inproceedings{
|
|
278
|
-
author = {
|
|
277
|
+
@inproceedings{smart2023scandeval,
|
|
278
|
+
author = {Smart, Dan Saattrup},
|
|
279
279
|
booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
|
|
280
280
|
month = may,
|
|
281
281
|
pages = {185--201},
|
|
@@ -19,8 +19,7 @@ ______________________________________________________________________
|
|
|
19
19
|
|
|
20
20
|
## Maintainer
|
|
21
21
|
|
|
22
|
-
- Dan Saattrup
|
|
23
|
-
dan.nielsen@alexandra.dk)
|
|
22
|
+
- Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), dan.smart@alexandra.dk)
|
|
24
23
|
|
|
25
24
|
|
|
26
25
|
## Installation
|
|
@@ -193,14 +192,14 @@ contributing new datasets, your help makes this project better for everyone.
|
|
|
193
192
|
If you want to cite the framework then feel free to use this:
|
|
194
193
|
|
|
195
194
|
```
|
|
196
|
-
@article{
|
|
195
|
+
@article{smart2024encoder,
|
|
197
196
|
title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
|
|
198
|
-
author={
|
|
197
|
+
author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
|
|
199
198
|
journal={arXiv preprint arXiv:2406.13469},
|
|
200
199
|
year={2024}
|
|
201
200
|
}
|
|
202
|
-
@inproceedings{
|
|
203
|
-
author = {
|
|
201
|
+
@inproceedings{smart2023scandeval,
|
|
202
|
+
author = {Smart, Dan Saattrup},
|
|
204
203
|
booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
|
|
205
204
|
month = may,
|
|
206
205
|
pages = {185--201},
|
|
@@ -31,6 +31,6 @@ The idea of EuroEval grew out of the development of Danish language model RøBÆ
|
|
|
31
31
|
models. It started as a hobby project including Danish, Swedish and Norwegian, but has
|
|
32
32
|
since grown to include 12+ European languages.
|
|
33
33
|
|
|
34
|
-
EuroEval is maintained by [Dan Saattrup
|
|
34
|
+
EuroEval is maintained by [Dan Saattrup Smart](https://www.saattrupdan.com/) from the
|
|
35
35
|
[Alexandra Institute](https://alexandra.dk), and is funded by the EU project
|
|
36
36
|
[TrustLLM](https://trustllm.eu/).
|
|
@@ -325,9 +325,12 @@ $ euroeval --model <model-id> --dataset squad-nl
|
|
|
325
325
|
|
|
326
326
|
### Unofficial: BeleBele-nl
|
|
327
327
|
|
|
328
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
328
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
329
|
+
and features multiple-choice reading comprehension questions across 122 languages.
|
|
329
330
|
|
|
330
|
-
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
331
|
+
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
332
|
+
and questions. From these, we use a 256 / 64 / 580 split for training, validation and
|
|
333
|
+
testing, respectively.
|
|
331
334
|
|
|
332
335
|
Here are a few examples from the training split:
|
|
333
336
|
|
|
@@ -297,9 +297,13 @@ $ euroeval --model <model-id> --dataset squad
|
|
|
297
297
|
|
|
298
298
|
### Unofficial: BeleBele-en
|
|
299
299
|
|
|
300
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
300
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
301
|
+
and features reading comprehension questions across 122 languages. The dataset was
|
|
302
|
+
created by professional translators who translated 900 multiple-choice questions from
|
|
303
|
+
English into other languages, with answers carefully validated by native speakers.
|
|
301
304
|
|
|
302
|
-
The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for
|
|
305
|
+
The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for
|
|
306
|
+
training, validation and testing, respectively.
|
|
303
307
|
|
|
304
308
|
Here are a few examples from the training split:
|
|
305
309
|
|
|
@@ -354,7 +358,79 @@ $ euroeval --model <model-id> --dataset belebele-en
|
|
|
354
358
|
|
|
355
359
|
## Knowledge
|
|
356
360
|
|
|
357
|
-
###
|
|
361
|
+
### Life in the UK
|
|
362
|
+
|
|
363
|
+
This dataset was published
|
|
364
|
+
[here](https://huggingface.co/datasets/oliverkinch/life-in-the-uk-multiple-choice) was
|
|
365
|
+
scraped from [lifeintheuktestweb.co.uk](https://lifeintheuktestweb.co.uk/test-1/) and
|
|
366
|
+
contains multiple choice questions about UK history, culture, and citizenship
|
|
367
|
+
requirements. The website was created to help people pass the Life in the UK Test for UK
|
|
368
|
+
citizenship.
|
|
369
|
+
|
|
370
|
+
The original dataset consists of 1,450 samples. After processing (removing questions
|
|
371
|
+
with overly short or long texts, repetitive content, and true/false questions), we have
|
|
372
|
+
1,206 samples remaining. From these, we use 438 / 256 / 512 samples for our training,
|
|
373
|
+
validation and test splits, respectively.
|
|
374
|
+
|
|
375
|
+
Here are a few examples from the training split:
|
|
376
|
+
|
|
377
|
+
```json
|
|
378
|
+
{
|
|
379
|
+
"text": "What is the capital of the United Kingdom?\nChoices:\na. London\nb. Manchester\nc. Birmingham\nd. Edinburgh",
|
|
380
|
+
"label": "a"
|
|
381
|
+
}
|
|
382
|
+
```
|
|
383
|
+
```json
|
|
384
|
+
{
|
|
385
|
+
"text": "What TWO houses were confronted during the Wars of the Roses?\nChoices:\na. The House of Lancaster\nb. The House of Leicester\nc. The House of Canterbury\nd. The House of York",
|
|
386
|
+
"label": "a"
|
|
387
|
+
}
|
|
388
|
+
```
|
|
389
|
+
```json
|
|
390
|
+
{
|
|
391
|
+
"text": "What is the name of the War Memorial located in Whitehall?\nChoices:\na. Dumfries\nb. Cenotaph\nc. Royal Crescent\nd. The White Tower",
|
|
392
|
+
"label": "b"
|
|
393
|
+
}
|
|
394
|
+
```
|
|
395
|
+
|
|
396
|
+
When evaluating generative models, we use the following setup (see the
|
|
397
|
+
[methodology](/methodology) for more information on how these are used):
|
|
398
|
+
|
|
399
|
+
- Number of few-shot examples: 5
|
|
400
|
+
- Prefix prompt:
|
|
401
|
+
```
|
|
402
|
+
The following are multiple choice questions (with answers).
|
|
403
|
+
```
|
|
404
|
+
- Base prompt template:
|
|
405
|
+
```
|
|
406
|
+
Question: {text}
|
|
407
|
+
Options:
|
|
408
|
+
a. {option_a}
|
|
409
|
+
b. {option_b}
|
|
410
|
+
c. {option_c}
|
|
411
|
+
d. {option_d}
|
|
412
|
+
Answer: {label}
|
|
413
|
+
```
|
|
414
|
+
- Instruction-tuned prompt template:
|
|
415
|
+
```
|
|
416
|
+
Question: {text}
|
|
417
|
+
Options:
|
|
418
|
+
a. {option_a}
|
|
419
|
+
b. {option_b}
|
|
420
|
+
c. {option_c}
|
|
421
|
+
d. {option_d}
|
|
422
|
+
|
|
423
|
+
Answer the above question by replying with 'a', 'b', 'c' or 'd', and nothing else.
|
|
424
|
+
```
|
|
425
|
+
|
|
426
|
+
You can evaluate this dataset directly as follows:
|
|
427
|
+
|
|
428
|
+
```bash
|
|
429
|
+
$ euroeval --model <model-id> --dataset life-in-the-uk
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
|
|
433
|
+
### Unofficial: MMLU
|
|
358
434
|
|
|
359
435
|
This dataset was published [in this paper](https://doi.org/10.48550/arXiv.2009.03300)
|
|
360
436
|
and features questions within 57 different topics, such as elementary mathematics, US
|
|
@@ -8,9 +8,13 @@ information about what these constitute.
|
|
|
8
8
|
|
|
9
9
|
### ScandiSent-fi
|
|
10
10
|
|
|
11
|
-
This dataset consists of reviews from Trustpilot and was published
|
|
11
|
+
This dataset consists of reviews from Trustpilot and was published
|
|
12
|
+
[here](https://aclanthology.org/2021.nodalida-main.42/). It is a binary sentiment
|
|
13
|
+
classification dataset, with labels "positive" and "negative".
|
|
12
14
|
|
|
13
|
-
For the Finnish part of the dataset, there are 10,000 training samples. From these
|
|
15
|
+
For the Finnish part of the dataset, there are 10,000 training samples. From these
|
|
16
|
+
samples, we have created a 1,024 / 256 / 2,048 split for the train, validation and test
|
|
17
|
+
splits, respectively.
|
|
14
18
|
|
|
15
19
|
Here are a few examples from the training split:
|
|
16
20
|
|
|
@@ -67,9 +71,14 @@ $ euroeval --model <model-id> --dataset scandisent-fi
|
|
|
67
71
|
|
|
68
72
|
### Turku-NER-fi
|
|
69
73
|
|
|
70
|
-
This dataset was published in [this paper](https://aclanthology.org/2020.lrec-1.567/).
|
|
74
|
+
This dataset was published in [this paper](https://aclanthology.org/2020.lrec-1.567/).
|
|
75
|
+
The dataset is a manually annotated corpus built on the Universal Dependencies Finnish
|
|
76
|
+
corpus. The corpus was created by the Turku NLP group.
|
|
71
77
|
|
|
72
|
-
The original dataset contains 12,217 / 1,364 / 1,555 samples for the training,
|
|
78
|
+
The original dataset contains 12,217 / 1,364 / 1,555 samples for the training,
|
|
79
|
+
validation and test splits, respectively. We use 1,024 / 256 / 2,048 samples for our
|
|
80
|
+
training, validation and test splits, respectively. All the new splits are subsets of
|
|
81
|
+
the original splits.
|
|
73
82
|
|
|
74
83
|
Here are a few examples from the training split:
|
|
75
84
|
|
|
@@ -141,9 +150,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
|
|
|
141
150
|
that this does indeed break the grammaticality of the sentence, a set of rules were used
|
|
142
151
|
on the part-of-speech tags of the words in the sentence.
|
|
143
152
|
|
|
144
|
-
The original dataset consists of 15,136 samples, from which we use 1,024 / 256 / 2,048
|
|
145
|
-
validation and testing, respectively (so 3,328 samples used in
|
|
146
|
-
used as-is in the framework.
|
|
153
|
+
The original dataset consists of 15,136 samples, from which we use 1,024 / 256 / 2,048
|
|
154
|
+
samples for training, validation and testing, respectively (so 3,328 samples used in
|
|
155
|
+
total). These splits are used as-is in the framework.
|
|
147
156
|
|
|
148
157
|
Here are a few examples from the training split:
|
|
149
158
|
|
|
@@ -199,9 +208,20 @@ $ euroeval --model <model-id> --dataset scala-fi
|
|
|
199
208
|
## Reading Comprehension
|
|
200
209
|
|
|
201
210
|
### TydiQA-fi
|
|
202
|
-
This question-answering dataset was published in [this
|
|
203
|
-
|
|
204
|
-
|
|
211
|
+
This question-answering dataset was published in [this
|
|
212
|
+
paper](https://aclanthology.org/2020.tacl-1.30/). TydiQA is a multilingual dataset
|
|
213
|
+
covering 11 typologically diverse languages with 204K question-answer pairs collected
|
|
214
|
+
from native speakers genuinely seeking information. It was designed to evaluate models
|
|
215
|
+
across languages with varied linguistic features and contains questions written directly
|
|
216
|
+
in each language without translation.
|
|
217
|
+
|
|
218
|
+
The original Finnish TydiQA dataset contains 6,855 training and 782 validation samples
|
|
219
|
+
(we use the [secondary task
|
|
220
|
+
subset](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train)).
|
|
221
|
+
We created a 1,024 / 256 / 2,024 split, where the samples from the train and validation
|
|
222
|
+
split are sampled from the original train and validation splits, respectively. The test
|
|
223
|
+
set consists of the remaining samples from the original validation split + additional
|
|
224
|
+
samples from the original train split.
|
|
205
225
|
|
|
206
226
|
Here are a few examples from the training split:
|
|
207
227
|
|
|
@@ -268,9 +288,12 @@ $ euroeval --model <model-id> --dataset tydiqa-fi
|
|
|
268
288
|
|
|
269
289
|
### Unofficial: BeleBele-fi
|
|
270
290
|
|
|
271
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
291
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
292
|
+
and features multiple-choice reading comprehension questions across 122 languages.
|
|
272
293
|
|
|
273
|
-
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
294
|
+
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
295
|
+
and questions. From these, we use a 256 / 64 / 580 split for training, validation and
|
|
296
|
+
testing, respectively.
|
|
274
297
|
|
|
275
298
|
Here are a few examples from the training split:
|
|
276
299
|
|
|
@@ -335,8 +358,11 @@ $ euroeval --model <model-id> --dataset belebele-fi
|
|
|
335
358
|
### HellaSwag-fi
|
|
336
359
|
|
|
337
360
|
This dataset is a machine translated version of the English [HellaSwag
|
|
338
|
-
dataset](https://aclanthology.org/P19-1472/). The
|
|
339
|
-
|
|
361
|
+
dataset](https://aclanthology.org/P19-1472/). The
|
|
362
|
+
[dataset](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate) was
|
|
363
|
+
created by Finnish-NLP using Google Translate. The dataset is designed to be used in
|
|
364
|
+
EuroEval and it therefore already has a 1,024 / 256 / 2,048 split for the train,
|
|
365
|
+
validation and test splits, respectively.
|
|
340
366
|
|
|
341
367
|
Here are a few examples from the training split:
|
|
342
368
|
|
|
@@ -400,9 +426,16 @@ $ euroeval --model <model-id> --dataset hellaswag-fi
|
|
|
400
426
|
|
|
401
427
|
### XLSum-fi
|
|
402
428
|
|
|
403
|
-
This dataset is a machine translation of the XL-Sum dataset, which was published in
|
|
429
|
+
This dataset is a machine translation of the XL-Sum dataset, which was published in
|
|
430
|
+
[this paper](https://aclanthology.org/2021.findings-acl.413/).
|
|
431
|
+
[TurkuNLP](https://huggingface.co/datasets/TurkuNLP) has translated the dataset to
|
|
432
|
+
Finnish using DeepL.
|
|
404
433
|
|
|
405
|
-
The original Finnish XL-Sum dataset contains 54,966 / 1,803 / 1,791 training, validation
|
|
434
|
+
The original Finnish XL-Sum dataset contains 54,966 / 1,803 / 1,791 training, validation
|
|
435
|
+
and test samples, respectively. We use 1,024 / 256 / 2,048 samples for our training,
|
|
436
|
+
validation and test splits, respectively. The new training and validation splits are
|
|
437
|
+
subsets of the original splits. The test split is the same as the original test split +
|
|
438
|
+
additional samples from the original validation split.
|
|
406
439
|
|
|
407
440
|
Here are a few examples from the training split:
|
|
408
441
|
|
|
@@ -11,10 +11,11 @@ information about what these constitute.
|
|
|
11
11
|
|
|
12
12
|
This dataset was published in [this Github
|
|
13
13
|
repository](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert) and
|
|
14
|
-
features reviews from the French movie review website
|
|
15
|
-
0.5 to 5 (inclusive), with
|
|
16
|
-
of
|
|
17
|
-
in between were
|
|
14
|
+
features reviews from the French movie review website
|
|
15
|
+
[AlloCiné](https://www.allocine.fr/). The reviews range from 0.5 to 5 (inclusive), with
|
|
16
|
+
steps of 0.5. The negative samples are reviews with a rating of at most 2, and the
|
|
17
|
+
positive ones are reviews with a rating of at least 4. The reviews in between were
|
|
18
|
+
discarded.
|
|
18
19
|
|
|
19
20
|
The original full dataset consists of 160,000 / 20,000 / 20,000 samples for training,
|
|
20
21
|
validation, and testing, respectively. We use 1,024 / 256 / 2,048 samples for training,
|
|
@@ -163,9 +164,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
|
|
|
163
164
|
that this does indeed break the grammaticality of the sentence, a set of rules were used
|
|
164
165
|
on the part-of-speech tags of the words in the sentence.
|
|
165
166
|
|
|
166
|
-
The original dataset consists of 16,342 samples, from which we use 1,024 / 256 / 2,048
|
|
167
|
-
validation and testing, respectively (so 3,328 samples used in
|
|
168
|
-
used as-is in the framework.
|
|
167
|
+
The original dataset consists of 16,342 samples, from which we use 1,024 / 256 / 2,048
|
|
168
|
+
samples for training, validation and testing, respectively (so 3,328 samples used in
|
|
169
|
+
total). These splits are used as-is in the framework.
|
|
169
170
|
|
|
170
171
|
Here are a few examples from the training split:
|
|
171
172
|
|
|
@@ -298,9 +299,12 @@ $ euroeval --model <model-id> --dataset fquad
|
|
|
298
299
|
|
|
299
300
|
### Unofficial: BeleBele-fr
|
|
300
301
|
|
|
301
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
302
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
303
|
+
and features multiple-choice reading comprehension questions across 122 languages.
|
|
302
304
|
|
|
303
|
-
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
305
|
+
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
306
|
+
and questions. From these, we use a 256 / 64 / 580 split for training, validation and
|
|
307
|
+
testing, respectively.
|
|
304
308
|
|
|
305
309
|
Here are a few examples from the training split:
|
|
306
310
|
|
|
@@ -153,9 +153,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
|
|
|
153
153
|
that this does indeed break the grammaticality of the sentence, a set of rules were used
|
|
154
154
|
on the part-of-speech tags of the words in the sentence.
|
|
155
155
|
|
|
156
|
-
The original dataset consists of 15,590 samples, from which we use 1,024 / 256 / 2,048
|
|
157
|
-
validation and testing, respectively (so 3,328 samples used in
|
|
158
|
-
used as-is in the framework.
|
|
156
|
+
The original dataset consists of 15,590 samples, from which we use 1,024 / 256 / 2,048
|
|
157
|
+
samples for training, validation and testing, respectively (so 3,328 samples used in
|
|
158
|
+
total). These splits are used as-is in the framework.
|
|
159
159
|
|
|
160
160
|
Here are a few examples from the training split:
|
|
161
161
|
|
|
@@ -286,9 +286,12 @@ $ euroeval --model <model-id> --dataset germanquad
|
|
|
286
286
|
|
|
287
287
|
### Unofficial: BeleBele-de
|
|
288
288
|
|
|
289
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
289
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
290
|
+
and features multiple-choice reading comprehension questions across 122 languages.
|
|
290
291
|
|
|
291
|
-
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
292
|
+
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
293
|
+
and questions. From these, we use a 256 / 64 / 580 split for training, validation and
|
|
294
|
+
testing, respectively.
|
|
292
295
|
|
|
293
296
|
Here are a few examples from the training split:
|
|
294
297
|
|
|
@@ -155,9 +155,9 @@ from a sentence, or by swapping two neighbouring words in a sentence. To ensure
|
|
|
155
155
|
this does indeed break the grammaticality of the sentence, a set of rules were used on
|
|
156
156
|
the part-of-speech tags of the words in the sentence.
|
|
157
157
|
|
|
158
|
-
The original dataset consists of 3,535 samples, from which we use 1,024 / 256 / 2,048
|
|
159
|
-
validation and testing, respectively (so 3,328 samples used in
|
|
160
|
-
used as-is in the framework.
|
|
158
|
+
The original dataset consists of 3,535 samples, from which we use 1,024 / 256 / 2,048
|
|
159
|
+
samples for training, validation and testing, respectively (so 3,328 samples used in
|
|
160
|
+
total). These splits are used as-is in the framework.
|
|
161
161
|
|
|
162
162
|
Here are a few examples from the training split:
|
|
163
163
|
|
|
@@ -491,9 +491,12 @@ $ euroeval --model <model-id> --dataset icelandic-qa
|
|
|
491
491
|
|
|
492
492
|
### Unofficial: BeleBele-is
|
|
493
493
|
|
|
494
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
494
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
495
|
+
and features multiple-choice reading comprehension questions across 122 languages.
|
|
495
496
|
|
|
496
|
-
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
497
|
+
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
498
|
+
and questions. From these, we use a 256 / 64 / 580 split for training, validation and
|
|
499
|
+
testing, respectively.
|
|
497
500
|
|
|
498
501
|
Here are a few examples from the training split:
|
|
499
502
|
|
|
@@ -579,7 +582,8 @@ completion = client.beta.chat.completions.parse(
|
|
|
579
582
|
)
|
|
580
583
|
```
|
|
581
584
|
|
|
582
|
-
where `CandidateAnswers` is a Pydantic model that is used to ensure [structured
|
|
585
|
+
where `CandidateAnswers` is a Pydantic model that is used to ensure [structured
|
|
586
|
+
outputs](https://platform.openai.com/docs/guides/structured-outputs).
|
|
583
587
|
|
|
584
588
|
The original dataset has 2,000 samples, but only 1,994 unique questions, and the total
|
|
585
589
|
length of this dataset is therefore 1,994. The split is given by 842 / 128 / 1024 for
|
|
@@ -373,9 +373,12 @@ $ euroeval --model <model-id> --dataset squad-it
|
|
|
373
373
|
|
|
374
374
|
### Unofficial: BeleBele-it
|
|
375
375
|
|
|
376
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
376
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
377
|
+
and features multiple-choice reading comprehension questions across 122 languages.
|
|
377
378
|
|
|
378
|
-
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
379
|
+
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
380
|
+
and questions. From these, we use a 256 / 64 / 580 split for training, validation and
|
|
381
|
+
testing, respectively.
|
|
379
382
|
|
|
380
383
|
Here are a few examples from the training split:
|
|
381
384
|
|