EuroEval 16.0.1__tar.gz → 16.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of EuroEval might be problematic. Click here for more details.
- {euroeval-16.0.1 → euroeval-16.1.0}/.gitignore +3 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/.pre-commit-config.yaml +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/CHANGELOG.md +48 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/PKG-INFO +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/danish.md +83 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/dutch.md +81 -8
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/english.md +138 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/estonian.md +83 -10
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/faroese.md +3 -2
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/finnish.md +78 -5
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/french.md +78 -5
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/german.md +139 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/icelandic.md +5 -4
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/italian.md +78 -5
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/latvian.md +97 -10
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/norwegian.md +68 -3
- euroeval-16.1.0/docs/datasets/polish.md +640 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/portuguese.md +68 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/spanish.md +68 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/swedish.md +132 -3
- euroeval-16.1.0/docs/leaderboards/Monolingual/estonian.md +23 -0
- euroeval-16.1.0/docs/leaderboards/Multilingual/finnic.md +23 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Multilingual/romance.md +1 -1
- euroeval-16.1.0/generated_contracts/employment_contract_001.md +137 -0
- euroeval-16.1.0/generated_contracts/employment_contract_002.md +152 -0
- euroeval-16.1.0/generated_contracts/employment_contract_003.md +144 -0
- euroeval-16.1.0/generated_contracts/employment_contract_004.md +139 -0
- euroeval-16.1.0/generated_contracts/employment_contract_005.md +146 -0
- euroeval-16.1.0/generated_contracts/employment_contract_006.md +127 -0
- euroeval-16.1.0/generated_contracts/employment_contract_007.md +147 -0
- euroeval-16.1.0/generated_contracts/employment_contract_008.md +136 -0
- euroeval-16.1.0/generated_contracts/employment_contract_009.md +143 -0
- euroeval-16.1.0/generated_contracts/employment_contract_010.md +148 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/pyproject.toml +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmark_config_factory.py +6 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmark_modules/base.py +2 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmark_modules/fresh.py +7 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmark_modules/hf.py +26 -21
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmark_modules/litellm.py +258 -131
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmark_modules/vllm.py +79 -40
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmarker.py +11 -2
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/cli.py +14 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/constants.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/data_models.py +77 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/__init__.py +1 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/danish.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/dutch.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/english.py +22 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/estonian.py +15 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/finnish.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/french.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/german.py +23 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/italian.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/latvian.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/norwegian.py +14 -0
- euroeval-16.1.0/src/euroeval/dataset_configs/polish.py +126 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/portuguese.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/spanish.py +14 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/swedish.py +25 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/enums.py +12 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/generation.py +17 -8
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/generation_utils.py +58 -10
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/metrics/pipeline.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +9 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/prompt_templates/multiple_choice.py +27 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/prompt_templates/named_entity_recognition.py +20 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/prompt_templates/reading_comprehension.py +11 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/prompt_templates/sentiment_classification.py +15 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/prompt_templates/summarization.py +27 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/scores.py +5 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/task_group_utils/question_answering.py +29 -29
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/task_group_utils/sequence_classification.py +10 -33
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/task_group_utils/token_classification.py +3 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/tasks.py +4 -4
- euroeval-16.0.1/src/euroeval/tokenization_utils.py → euroeval-16.1.0/src/euroeval/tokenisation_utils.py +40 -23
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/utils.py +36 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/constants.py +20 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_allocine.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_arc.py +9 -26
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_arc_is.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_belebele.py +4 -21
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_boolq_pt.py +1 -5
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_cnn_dailymail.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_conll_en.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_conll_es.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_conll_nl.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_copa_lv.py +3 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_dane.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_danish_citizen_tests.py +3 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_dansk.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_danske_talemaader.py +3 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_danske_talemaader_old.py +3 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_dbrd.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_dutch_cola.py +2 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_eltec.py +1 -5
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_err_news.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_estner.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_estonian_valence.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_european_values.py +24 -20
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_exam_et.py +2 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_fone.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_foqa.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_fosent.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_fquad.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_fullstack_ner.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_germanquad.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_germeval.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_goldenswag.py +4 -19
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_grammar_et.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_harem.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_hellaswag.py +6 -22
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_hellaswag_fi.py +3 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_hotter_and_colder_sentiment.py +1 -5
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_ice_linguistic.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_icelandic_error_corpus.py +2 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_icelandic_knowledge.py +8 -8
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_icelandic_qa.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_icesum.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_idioms_no.py +3 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_ilpost_sum.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_jentoft.py +1 -6
- euroeval-16.1.0/src/scripts/create_kpwr_ner.py +140 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_latvian_lsm_summary.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_latvian_twitter_sentiment.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_life_in_the_uk.py +3 -7
- euroeval-16.1.0/src/scripts/create_llmzszl.py +153 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_mlqa_es.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_mlsum_de.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_mlsum_es.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_mmlu.py +6 -23
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_mmlu_lv.py +3 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_multi_wiki_qa.py +4 -8
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_multinerd-it.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_no_cola.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_no_sammendrag.py +1 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_nor_common_sense_qa.py +3 -7
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_nordjylland_news.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_norglm_multiqa.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_norglm_multisum.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_norne.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_norquad.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_nqii.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_nrk_quiz_qa.py +4 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_orange_sum.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_personal_sum.py +6 -12
- euroeval-16.1.0/src/scripts/create_polemo2.py +130 -0
- euroeval-16.1.0/src/scripts/create_poquad.py +109 -0
- euroeval-16.1.0/src/scripts/create_psc.py +85 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_publico.py +2 -10
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_rrn.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_sb10k.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_scala.py +4 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_scandiqa.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_scandisent_fi.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_schibsted.py +1 -8
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_sentiment_headlines_es.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_sentipolc16.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_squad.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_squad_it.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_squad_nl.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_squad_nl_old.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_sst2_pt.py +2 -6
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_sst5.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_suc3.py +2 -11
- euroeval-16.1.0/src/scripts/create_swedish_skolprov.py +167 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_swedn.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_swerec.py +2 -11
- euroeval-16.1.0/src/scripts/create_trivia_et.py +70 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_turku_ner_fi.py +3 -10
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_tydiqa_fi.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_wiki_lingua_nl.py +3 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_wikiann_lv.py +2 -11
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_wikineural-it.py +2 -11
- euroeval-16.1.0/src/scripts/create_winogrande.py +156 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_winogrande_et.py +6 -8
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_winogrande_is.py +4 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_xlsum_fi.py +3 -12
- euroeval-16.1.0/src/scripts/create_xquad.py +73 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/load_ud_pos.py +18 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/conftest.py +2 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_benchmark_config_factory.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_benchmarker.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_callbacks.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_cli.py +3 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_data_loading.py +6 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_data_models.py +3 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_dataset_configs.py +3 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_exceptions.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_finetuning.py +0 -12
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_languages.py +2 -2
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_model_loading.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_scores.py +4 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_speed_benchmark.py +2 -2
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_tasks.py +2 -2
- euroeval-16.0.1/tests/test_tokenization_utils.py → euroeval-16.1.0/tests/test_tokenisation_utils.py +5 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_types.py +1 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_utils.py +41 -3
- {euroeval-16.0.1 → euroeval-16.1.0}/uv.lock +1 -3
- euroeval-16.0.1/src/scripts/create_wikiann_fo.py +0 -1
- euroeval-16.0.1/src/scripts/create_xquad_es.py +0 -80
- euroeval-16.0.1/tests/test_benchmark_modules/test_base.py +0 -1
- euroeval-16.0.1/tests/test_benchmark_modules/test_fresh.py +0 -1
- euroeval-16.0.1/tests/test_benchmark_modules/test_litellm.py +0 -1
- euroeval-16.0.1/tests/test_benchmark_modules/test_vllm.py +0 -1
- euroeval-16.0.1/tests/test_generation.py +0 -19
- euroeval-16.0.1/tests/test_model_cache.py +0 -46
- euroeval-16.0.1/tests/test_task_utils/__init__.py +0 -1
- euroeval-16.0.1/tests/test_task_utils/test_question_answering.py +0 -1
- euroeval-16.0.1/tests/test_task_utils/test_sequence_classification.py +0 -1
- euroeval-16.0.1/tests/test_task_utils/test_text_to_text.py +0 -1
- euroeval-16.0.1/tests/test_task_utils/test_token_classification.py +0 -1
- {euroeval-16.0.1 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/.github/workflows/ci.yaml +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/CITATION.cff +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/CODE_OF_CONDUCT.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/CONTRIBUTING.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/Dockerfile.cuda +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/LICENSE +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/NEW_DATASET_GUIDE.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/README.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/CNAME +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/README.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/datasets/README.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/extras/radial_plotter.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/faq.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/gfx/favicon.png +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/danish.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/english.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/finnish.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/french.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/german.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/italian.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/portuguese.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Multilingual/european.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/leaderboards/README.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/methodology.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/python-package.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/README.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/common-sense-reasoning.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/knowledge.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/linguistic-acceptability.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/named-entity-recognition.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/reading-comprehension.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/sentiment-classification.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/speed.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/docs/tasks/summarization.md +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/gfx/euroeval.png +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/gfx/euroeval.xcf +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/gfx/scandeval.png +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/makefile +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/mkdocs.yaml +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/__init__.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/callbacks.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/data_loading.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/faroese.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/exceptions.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/finetuning.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/languages.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/metrics/__init__.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/metrics/base.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/metrics/huggingface.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/metrics/llm_as_a_judge.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/metrics/speed.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/model_cache.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/model_config.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/model_loading.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/prompt_templates/__init__.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/speed_benchmark.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/task_group_utils/__init__.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/task_group_utils/text_to_text.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/euroeval/types.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_angry_tweets.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_mim_gold_ner.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/create_norec.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/fix_dot_env_file.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/src/scripts/versioning.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/__init__.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_benchmark_modules/__init__.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_benchmark_modules/test_hf.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_constants.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_enums.py +0 -0
- {euroeval-16.0.1 → euroeval-16.1.0}/tests/test_model_config.py +0 -0
|
@@ -34,7 +34,7 @@ var/
|
|
|
34
34
|
pip-log.txt
|
|
35
35
|
pip-delete-this-directory.txt
|
|
36
36
|
|
|
37
|
-
#
|
|
37
|
+
# Tests / coverage reports
|
|
38
38
|
htmlcov/
|
|
39
39
|
.tox/
|
|
40
40
|
.coverage
|
|
@@ -118,4 +118,6 @@ docs/datasets/dataset_example_commands.txt
|
|
|
118
118
|
|
|
119
119
|
# Various graphics
|
|
120
120
|
gfx/euroeval-*.png
|
|
121
|
+
gfx/euroeval-*.jpeg
|
|
122
|
+
gfx/euroeval-*.jpg
|
|
121
123
|
gfx/euroeval-*.xcf
|
|
@@ -10,6 +10,54 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
10
10
|
|
|
11
11
|
|
|
12
12
|
|
|
13
|
+
## [v16.1.0] - 2025-09-11
|
|
14
|
+
### Added
|
|
15
|
+
- Added support for Polish 🇵🇱! This includes the reading comprehension dataset PoQuAD,
|
|
16
|
+
the sentiment classification dataset PolEmo 2.0, the linguistic acceptability dataset
|
|
17
|
+
ScaLA-pl, the named entity recognition dataset KPWr-NER, the summarisation dataset
|
|
18
|
+
PSC, the knowledge dataset LLMzSzŁ and the common-sense reasoning dataset
|
|
19
|
+
Winogrande-pl. Also added MultiWikiQA-pl and GoldenSwag-pl as unofficial reading
|
|
20
|
+
comprehension and common-sense reasoning datasets, respectively. This was contributed
|
|
21
|
+
by @oliverkinch ✨
|
|
22
|
+
- Added the Swedish knowledge dataset Skolprov. It is unofficial for now. This was
|
|
23
|
+
contributed by @oliverkinch ✨
|
|
24
|
+
- Added the knowledge dataset Trivia-et for Estonian. The dataset contains 800 trivia
|
|
25
|
+
questions about Estonia. In this version we rearrange the examples in
|
|
26
|
+
240 / 60 / 500 samples for training, validation and test splits, respectively.
|
|
27
|
+
This replaces Exam-et as the official Estonian knowledge dataset. This was contributed
|
|
28
|
+
by @slowwavesleep ✨
|
|
29
|
+
- Added the English and German versions of XQuAD as unofficial reading comprehension
|
|
30
|
+
datasets.
|
|
31
|
+
- Added the English common-sense reasoning dataset Winogrande and its translated
|
|
32
|
+
versions of Winogrande for Danish, German, Spanish, Finnish, French, Italian, Latvian,
|
|
33
|
+
Dutch, Norwegian, Polish, Portuguese and Swedish. These are unofficial for now.
|
|
34
|
+
- Added new `--generative-type` argument, which can be used to override the automatic
|
|
35
|
+
detection of the generative type (base decoder, instruction-tuned decoder, or
|
|
36
|
+
reasoning decoder) of a decoder model. This can be useful if the automatic detection
|
|
37
|
+
fails for a specific model.
|
|
38
|
+
- Now supports evaluating base decoders on inference servers. This requires the
|
|
39
|
+
`--generative-type base` argument to be set, as the automatic detection will not work
|
|
40
|
+
for these models.
|
|
41
|
+
|
|
42
|
+
### Changed
|
|
43
|
+
- Changed the model ID syntax, where we now use `#` to indicate parameters and still use
|
|
44
|
+
`@` to indicate revision. For instance, `o3#low` indicates the `o3` model with the
|
|
45
|
+
low reasoning effort, and `tencent/Hunyuan-1.8B-Instruct@v1#no-thinking` indicates the
|
|
46
|
+
Hunyuan model from the `v1` branch and with the `enable_thinking=False` parameter set.
|
|
47
|
+
This is fully backwards compatible, in the sense that API models still support using
|
|
48
|
+
`@` for parameters as well, just like previously, but you will get a warning that this
|
|
49
|
+
syntax is deprecated.
|
|
50
|
+
- Added `thinking` and `no-thinking` parameters for all open-weight models now. Of
|
|
51
|
+
course, it only makes a difference for models that supports this flag.
|
|
52
|
+
- Reduced the number of tokens used for reasoning models from 32,768 to 8,192, as models
|
|
53
|
+
reaching the full 32,768 tokens were because they ended up repeating themselves,
|
|
54
|
+
making the evaluation slower without any benefit.
|
|
55
|
+
|
|
56
|
+
### Fixed
|
|
57
|
+
- Some generative models consistently generated empty dictionaries when using structured
|
|
58
|
+
generation. We now catch this and retry the evaluation without structured generation.
|
|
59
|
+
|
|
60
|
+
|
|
13
61
|
## [v16.0.1] - 2025-09-07
|
|
14
62
|
### Fixed
|
|
15
63
|
- Fixed a bug causing encoders to fail when evaluating on the Exam-et dataset.
|
|
@@ -355,9 +355,12 @@ $ euroeval --model <model-id> --dataset scandiqa-da
|
|
|
355
355
|
|
|
356
356
|
### Unofficial: BeleBele-da
|
|
357
357
|
|
|
358
|
-
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
358
|
+
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
359
|
+
and features multiple-choice reading comprehension questions across 122 languages.
|
|
359
360
|
|
|
360
|
-
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
361
|
+
The original dataset contains 900 unique multiple-choice reading comprehension passages
|
|
362
|
+
and questions. From these, we use a 256 / 64 / 580 split for training, validation and
|
|
363
|
+
testing, respectively.
|
|
361
364
|
|
|
362
365
|
Here are a few examples from the training split:
|
|
363
366
|
|
|
@@ -418,8 +421,9 @@ $ euroeval --model <model-id> --dataset belebele-da
|
|
|
418
421
|
|
|
419
422
|
### Unofficial: MultiWikiQA-da
|
|
420
423
|
|
|
421
|
-
This dataset
|
|
422
|
-
articles with generated questions and answers
|
|
424
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
|
|
425
|
+
and contains Wikipedia articles with LLM-generated questions and answers in 300+
|
|
426
|
+
languages.
|
|
423
427
|
|
|
424
428
|
The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
|
|
425
429
|
256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
|
|
@@ -831,9 +835,17 @@ $ euroeval --model <model-id> --dataset hellaswag-da
|
|
|
831
835
|
|
|
832
836
|
### Unofficial: GoldenSwag-da
|
|
833
837
|
|
|
834
|
-
This dataset is a filtered and machine translated version of the English [HellaSwag
|
|
838
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag
|
|
839
|
+
dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from
|
|
840
|
+
ActivityNet as well as how-to articles from WikiHow. The machine translated version was
|
|
841
|
+
published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using
|
|
842
|
+
DeepL, and the filtering was published in [this
|
|
843
|
+
paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality
|
|
844
|
+
samples.
|
|
835
845
|
|
|
836
|
-
The original full dataset consists of 1530 / 1530 samples for training and validation,
|
|
846
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation,
|
|
847
|
+
respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048
|
|
848
|
+
samples for training, validation, and testing, respectively.
|
|
837
849
|
|
|
838
850
|
Here are a few examples from the training split:
|
|
839
851
|
|
|
@@ -894,8 +906,72 @@ You can evaluate this dataset directly as follows:
|
|
|
894
906
|
$ euroeval --model <model-id> --dataset goldenswag-da
|
|
895
907
|
```
|
|
896
908
|
|
|
909
|
+
### Unofficial: Winogrande-da
|
|
897
910
|
|
|
898
|
-
|
|
911
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2506.19468)
|
|
912
|
+
and is a translated and filtered version of the English [Winogrande
|
|
913
|
+
dataset](https://doi.org/10.1145/3474381).
|
|
914
|
+
|
|
915
|
+
The original full dataset consists of 47 / 1,210 samples for training and testing, and
|
|
916
|
+
we use the same splits.
|
|
917
|
+
|
|
918
|
+
Here are a few examples from the training split:
|
|
919
|
+
|
|
920
|
+
```json
|
|
921
|
+
{
|
|
922
|
+
"text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty",
|
|
923
|
+
"label": "a"
|
|
924
|
+
}
|
|
925
|
+
```
|
|
926
|
+
|
|
927
|
+
```json
|
|
928
|
+
{
|
|
929
|
+
"text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty",
|
|
930
|
+
"label": "a"
|
|
931
|
+
}
|
|
932
|
+
```
|
|
933
|
+
|
|
934
|
+
```json
|
|
935
|
+
{
|
|
936
|
+
"text": "At håndtere nødsituationer var aldrig særlig svært for Kevin, men det var det for Nelson, fordi _ ikke var i stand til at forblive rolig under pres. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Kevin\nb. Valgmulighed B: Nelson",
|
|
937
|
+
"label": "b"
|
|
938
|
+
}
|
|
939
|
+
```
|
|
940
|
+
|
|
941
|
+
When evaluating generative models, we use the following setup (see the
|
|
942
|
+
[methodology](/methodology) for more information on how these are used):
|
|
943
|
+
|
|
944
|
+
- Number of few-shot examples: 5
|
|
945
|
+
- Prefix prompt:
|
|
946
|
+
```
|
|
947
|
+
Følgende er multiple choice spørgsmål (med svar).
|
|
948
|
+
```
|
|
949
|
+
- Base prompt template:
|
|
950
|
+
```
|
|
951
|
+
Spørgsmål: {text}
|
|
952
|
+
Svarmuligheder:
|
|
953
|
+
a. {option_a}
|
|
954
|
+
b. {option_b}
|
|
955
|
+
Svar: {label}
|
|
956
|
+
```
|
|
957
|
+
- Instruction-tuned prompt template:
|
|
958
|
+
```
|
|
959
|
+
Spørgsmål: {text}
|
|
960
|
+
Svarmuligheder:
|
|
961
|
+
a. {option_a}
|
|
962
|
+
b. {option_b}
|
|
963
|
+
|
|
964
|
+
Besvar ovenstående spørgsmål ved at svare med 'a' eller 'b', og intet andet.
|
|
965
|
+
```
|
|
966
|
+
|
|
967
|
+
You can evaluate this dataset directly as follows:
|
|
968
|
+
|
|
969
|
+
```bash
|
|
970
|
+
$ euroeval --model <model-id> --dataset winogrande-da
|
|
971
|
+
```
|
|
972
|
+
|
|
973
|
+
|
|
974
|
+
## Summarisation
|
|
899
975
|
|
|
900
976
|
### Nordjylland News
|
|
901
977
|
|
|
@@ -153,9 +153,9 @@ from a sentence, or by swapping two neighbouring words in a sentence. To ensure
|
|
|
153
153
|
this does indeed break the grammaticality of the sentence, a set of rules were used on
|
|
154
154
|
the part-of-speech tags of the words in the sentence.
|
|
155
155
|
|
|
156
|
-
The original dataset consists of 13,603 samples, from which we use 1,024 / 256 / 2,048
|
|
157
|
-
validation and testing, respectively (so 3,328 samples used in
|
|
158
|
-
used as-is in the framework.
|
|
156
|
+
The original dataset consists of 13,603 samples, from which we use 1,024 / 256 / 2,048
|
|
157
|
+
samples for training, validation and testing, respectively (so 3,328 samples used in
|
|
158
|
+
total). These splits are used as-is in the framework.
|
|
159
159
|
|
|
160
160
|
Here are a few examples from the training split:
|
|
161
161
|
|
|
@@ -390,8 +390,9 @@ $ euroeval --model <model-id> --dataset belebele-nl
|
|
|
390
390
|
|
|
391
391
|
### Unofficial: MultiWikiQA-nl
|
|
392
392
|
|
|
393
|
-
This dataset
|
|
394
|
-
articles with generated questions and answers
|
|
393
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
|
|
394
|
+
and contains Wikipedia articles with LLM-generated questions and answers in 300+
|
|
395
|
+
languages.
|
|
395
396
|
|
|
396
397
|
The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
|
|
397
398
|
256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
|
|
@@ -676,9 +677,17 @@ $ euroeval --model <model-id> --dataset hellaswag-nl
|
|
|
676
677
|
|
|
677
678
|
### Unofficial: GoldenSwag-nl
|
|
678
679
|
|
|
679
|
-
This dataset is a filtered and machine translated version of the English [HellaSwag
|
|
680
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag
|
|
681
|
+
dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from
|
|
682
|
+
ActivityNet as well as how-to articles from WikiHow. The machine translated version was
|
|
683
|
+
published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using
|
|
684
|
+
DeepL, and the filtering was published in [this
|
|
685
|
+
paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality
|
|
686
|
+
samples.
|
|
680
687
|
|
|
681
|
-
The original full dataset consists of 1530 / 1530 samples for training and validation,
|
|
688
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation,
|
|
689
|
+
respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048
|
|
690
|
+
samples for training, validation, and testing, respectively.
|
|
682
691
|
|
|
683
692
|
Here are a few examples from the training split:
|
|
684
693
|
|
|
@@ -739,8 +748,72 @@ You can evaluate this dataset directly as follows:
|
|
|
739
748
|
$ euroeval --model <model-id> --dataset goldenswag-nl
|
|
740
749
|
```
|
|
741
750
|
|
|
751
|
+
### Unofficial: Winogrande-nl
|
|
752
|
+
|
|
753
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2506.19468)
|
|
754
|
+
and is a translated and filtered version of the English [Winogrande
|
|
755
|
+
dataset](https://doi.org/10.1145/3474381).
|
|
756
|
+
|
|
757
|
+
The original full dataset consists of 47 / 1,210 samples for training and testing, and
|
|
758
|
+
we use the same splits.
|
|
759
|
+
|
|
760
|
+
Here are a few examples from the training split:
|
|
761
|
+
|
|
762
|
+
```json
|
|
763
|
+
{
|
|
764
|
+
"text": "Emily vroeg haar zus Sarah of ze tampons of maandverband nodig had uit de winkel, hoewel _ dat niet nodig had omdat ze was overgestapt op het gebruik van menstruatiecups. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Emily\nb. Optie B: Sarah",
|
|
765
|
+
"label": "a"
|
|
766
|
+
}
|
|
767
|
+
```
|
|
768
|
+
|
|
769
|
+
```json
|
|
770
|
+
{
|
|
771
|
+
"text": "Bij het kopen van een huis heeft Patricia niet zoveel geld te besteden als Tanya, dus _ koopt een huis met 1 slaapkamer. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Patricia\nb. Optie B: Tanya",
|
|
772
|
+
"label": "a"
|
|
773
|
+
}
|
|
774
|
+
```
|
|
775
|
+
|
|
776
|
+
```json
|
|
777
|
+
{
|
|
778
|
+
"text": "Eenmaal in Polen genoot Dennis meer van de reis dan Jason omdat _ een oppervlakkige kennis van de Poolse taal had. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Dennis\nb. Optie B: Jason",
|
|
779
|
+
"label": "b"
|
|
780
|
+
}
|
|
781
|
+
```
|
|
782
|
+
|
|
783
|
+
When evaluating generative models, we use the following setup (see the
|
|
784
|
+
[methodology](/methodology) for more information on how these are used):
|
|
785
|
+
|
|
786
|
+
- Number of few-shot examples: 5
|
|
787
|
+
- Prefix prompt:
|
|
788
|
+
```
|
|
789
|
+
Hieronder staan meerkeuzevragen (met antwoorden).
|
|
790
|
+
```
|
|
791
|
+
- Base prompt template:
|
|
792
|
+
```
|
|
793
|
+
Vraag: {text}
|
|
794
|
+
Antwoordopties:
|
|
795
|
+
a. {option_a}
|
|
796
|
+
b. {option_b}
|
|
797
|
+
Antwoord: {label}
|
|
798
|
+
```
|
|
799
|
+
- Instruction-tuned prompt template:
|
|
800
|
+
```
|
|
801
|
+
Vraag: {text}
|
|
802
|
+
Antwoordopties:
|
|
803
|
+
a. {option_a}
|
|
804
|
+
b. {option_b}
|
|
805
|
+
|
|
806
|
+
Beantwoord de bovenstaande vraag met 'a' of 'b', en niets anders.
|
|
807
|
+
```
|
|
808
|
+
|
|
809
|
+
You can evaluate this dataset directly as follows:
|
|
810
|
+
|
|
811
|
+
```bash
|
|
812
|
+
$ euroeval --model <model-id> --dataset winogrande-nl
|
|
813
|
+
```
|
|
814
|
+
|
|
742
815
|
|
|
743
|
-
##
|
|
816
|
+
## Summarisation
|
|
744
817
|
|
|
745
818
|
### WikiLingua-nl
|
|
746
819
|
|
|
@@ -295,6 +295,79 @@ $ euroeval --model <model-id> --dataset squad
|
|
|
295
295
|
```
|
|
296
296
|
|
|
297
297
|
|
|
298
|
+
### Unofficial: XQuAD-en
|
|
299
|
+
|
|
300
|
+
This dataset was published in [this paper](https://aclanthology.org/2020.acl-main.421/)
|
|
301
|
+
and contains 1190 question-answer pairs from [SQuAD
|
|
302
|
+
v1.1](https://rajpurkar.github.io/SQuAD-explorer/) translated into ten languages by
|
|
303
|
+
professional translators.
|
|
304
|
+
|
|
305
|
+
The dataset is split intro 550 / 128 / 512 question-answer pairs for training,
|
|
306
|
+
validation, and testing, respectively.
|
|
307
|
+
|
|
308
|
+
Here are a few examples from the training split:
|
|
309
|
+
|
|
310
|
+
```json
|
|
311
|
+
{
|
|
312
|
+
"context": "Newcastle replaced him in January 1756 with Lord Loudoun, with Major General James Abercrombie as his second in command. Neither of these men had as much campaign experience as the trio of officers France sent to North America. French regular army reinforcements arrived in New France in May 1756, led by Major General Louis-Joseph de Montcalm and seconded by the Chevalier de Lévis and Colonel François-Charles de Bourlamaque, all experienced veterans from the War of the Austrian Succession. During that time in Europe, on May 18, 1756, England formally declared war on France, which expanded the war into Europe, which was later to be known as the Seven Years" War.",
|
|
313
|
+
"question": "Who led New France reinforcements in 1756?",
|
|
314
|
+
"answers": {
|
|
315
|
+
"answer_start": array([305], dtype=int32),
|
|
316
|
+
"text": array(["Major General Louis-Joseph de Montcalm"], dtype=object)
|
|
317
|
+
}
|
|
318
|
+
}
|
|
319
|
+
```
|
|
320
|
+
```json
|
|
321
|
+
{
|
|
322
|
+
"context": "Jacksonville is in the First Coast region of northeast Florida and is centered on the banks of the St. Johns River, about 25 miles (40 km) south of the Georgia state line and about 340 miles (550 km) north of Miami. The Jacksonville Beaches communities are along the adjacent Atlantic coast. The area was originally inhabited by the Timucua people, and in 1564 was the site of the French colony of Fort Caroline, one of the earliest European settlements in what is now the continental United States. Under British rule, settlement grew at the narrow point in the river where cattle crossed, known as Wacca Pilatka to the Seminole and the Cow Ford to the British. A platted town was established there in 1822, a year after the United States gained Florida from Spain; it was named after Andrew Jackson, the first military governor of the Florida Territory and seventh President of the United States.",
|
|
323
|
+
"question": "Prior to the arrival of the French, the area now known as Jacksonville was previously inhabited by what people?",
|
|
324
|
+
"answers": {
|
|
325
|
+
"answer_start": array([329], dtype=int32),
|
|
326
|
+
"text": array(["the Timucua"], dtype=object)
|
|
327
|
+
}
|
|
328
|
+
}
|
|
329
|
+
```
|
|
330
|
+
```json
|
|
331
|
+
{
|
|
332
|
+
"context": "Luther\"s hymns were frequently evoked by particular events in his life and the unfolding Reformation. This behavior started with his learning of the execution of Johann Esch and Heinrich Voes, the first individuals to be martyred by the Roman Catholic Church for Lutheran views, prompting Luther to write the hymn "Ein neues Lied wir heben an" ("A new song we raise"), which is generally known in English by John C. Messenger\"s translation by the title and first line "Flung to the Heedless Winds" and sung to the tune Ibstone composed in 1875 by Maria C. Tiddeman.",
|
|
333
|
+
"question": "What is the hymn known as in English?",
|
|
334
|
+
"answers": {
|
|
335
|
+
"answer_start": array([469], dtype=int32),
|
|
336
|
+
"text": array(["Flung to the Heedless Winds"], dtype=object)
|
|
337
|
+
}
|
|
338
|
+
}
|
|
339
|
+
```
|
|
340
|
+
|
|
341
|
+
When evaluating generative models, we use the following setup (see the
|
|
342
|
+
[methodology](/methodology) for more information on how these are used):
|
|
343
|
+
|
|
344
|
+
- Number of few-shot examples: 4
|
|
345
|
+
- Prefix prompt:
|
|
346
|
+
```
|
|
347
|
+
The following are texts with accompanying questions and answers.
|
|
348
|
+
```
|
|
349
|
+
- Base prompt template:
|
|
350
|
+
```
|
|
351
|
+
Text: {text}
|
|
352
|
+
Question: {question}
|
|
353
|
+
Answer in max 3 words:
|
|
354
|
+
```
|
|
355
|
+
- Instruction-tuned prompt template:
|
|
356
|
+
```
|
|
357
|
+
Text: {text}
|
|
358
|
+
|
|
359
|
+
Answer the following question about the above text in at most 3 words.
|
|
360
|
+
|
|
361
|
+
Question: {question}
|
|
362
|
+
```
|
|
363
|
+
|
|
364
|
+
You can evaluate this dataset directly as follows:
|
|
365
|
+
|
|
366
|
+
```bash
|
|
367
|
+
$ euroeval --model <model-id> --dataset xquad-en
|
|
368
|
+
```
|
|
369
|
+
|
|
370
|
+
|
|
298
371
|
### Unofficial: BeleBele-en
|
|
299
372
|
|
|
300
373
|
This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
|
|
@@ -358,8 +431,9 @@ $ euroeval --model <model-id> --dataset belebele-en
|
|
|
358
431
|
|
|
359
432
|
### Unofficial: MultiWikiQA-en
|
|
360
433
|
|
|
361
|
-
This dataset
|
|
362
|
-
articles with generated questions and answers
|
|
434
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
|
|
435
|
+
and contains Wikipedia articles with LLM-generated questions and answers in 300+
|
|
436
|
+
languages.
|
|
363
437
|
|
|
364
438
|
The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
|
|
365
439
|
256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
|
|
@@ -707,8 +781,69 @@ You can evaluate this dataset directly as follows:
|
|
|
707
781
|
$ euroeval --model <model-id> --dataset hellaswag
|
|
708
782
|
```
|
|
709
783
|
|
|
784
|
+
### Unofficial: Winogrande
|
|
785
|
+
|
|
786
|
+
This dataset was published in [this paper](https://doi.org/10.1145/3474381). The
|
|
787
|
+
original full dataset consists of 47 / 1,210 samples for training and testing, and we
|
|
788
|
+
use the same splits.
|
|
789
|
+
|
|
790
|
+
Here are a few examples from the training split:
|
|
791
|
+
|
|
792
|
+
```json
|
|
793
|
+
{
|
|
794
|
+
"text": "Elena would grab their inventory in the back of the store for Megan to sell each time because _ was a businessperson. What does the blank _ refer to?\nChoices:\na. Elena\nb. Megan",
|
|
795
|
+
"label": "a"
|
|
796
|
+
}
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
```json
|
|
800
|
+
{
|
|
801
|
+
"text": "Once in Poland, Dennis enjoyed the trip more than Jason because _ had a deeper understanding of the Polish language. What does the blank _ refer to?\nChoices:\na. Dennis\nb. Jason",
|
|
802
|
+
"label": "a"
|
|
803
|
+
}
|
|
804
|
+
```
|
|
805
|
+
|
|
806
|
+
```json
|
|
807
|
+
{
|
|
808
|
+
"text": "Handling emergencies was never very difficult for Kevin but it was for Nelson because _ wasn't able to remain calm under pressure. What does the blank _ refer to?\nChoices:\na. Kevin\nb. Nelson",
|
|
809
|
+
"label": "b"
|
|
810
|
+
}
|
|
811
|
+
```
|
|
812
|
+
|
|
813
|
+
When evaluating generative models, we use the following setup (see the
|
|
814
|
+
[methodology](/methodology) for more information on how these are used):
|
|
815
|
+
|
|
816
|
+
- Number of few-shot examples: 5
|
|
817
|
+
- Prefix prompt:
|
|
818
|
+
```
|
|
819
|
+
The following are multiple choice questions (with answers).
|
|
820
|
+
```
|
|
821
|
+
- Base prompt template:
|
|
822
|
+
```
|
|
823
|
+
Question: {text}
|
|
824
|
+
Options:
|
|
825
|
+
a. {option_a}
|
|
826
|
+
b. {option_b}
|
|
827
|
+
Answer: {label}
|
|
828
|
+
```
|
|
829
|
+
- Instruction-tuned prompt template:
|
|
830
|
+
```
|
|
831
|
+
Question: {text}
|
|
832
|
+
Options:
|
|
833
|
+
a. {option_a}
|
|
834
|
+
b. {option_b}
|
|
835
|
+
|
|
836
|
+
Answer the above question by replying with 'a' or 'b', and nothing else.
|
|
837
|
+
```
|
|
838
|
+
|
|
839
|
+
You can evaluate this dataset directly as follows:
|
|
840
|
+
|
|
841
|
+
```bash
|
|
842
|
+
$ euroeval --model <model-id> --dataset winogrande
|
|
843
|
+
```
|
|
844
|
+
|
|
710
845
|
|
|
711
|
-
##
|
|
846
|
+
## Summarisation
|
|
712
847
|
|
|
713
848
|
### CNN/DailyMail
|
|
714
849
|
|
|
@@ -280,8 +280,9 @@ $ euroeval --model <model-id> --dataset scala-et
|
|
|
280
280
|
|
|
281
281
|
### MultiWikiQA-et
|
|
282
282
|
|
|
283
|
-
This dataset
|
|
284
|
-
articles with generated questions and answers
|
|
283
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
|
|
284
|
+
and contains Wikipedia articles with LLM-generated questions and answers in 300+
|
|
285
|
+
languages.
|
|
285
286
|
|
|
286
287
|
The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
|
|
287
288
|
256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
|
|
@@ -351,7 +352,77 @@ $ euroeval --model <model-id> --dataset multi-wiki-qa-et
|
|
|
351
352
|
|
|
352
353
|
## Knowledge
|
|
353
354
|
|
|
354
|
-
###
|
|
355
|
+
### Trivia-et
|
|
356
|
+
|
|
357
|
+
This dataset was published [here](https://huggingface.co/datasets/TalTechNLP/trivia_et).
|
|
358
|
+
It was extracted from the "Eesti Mäng" board game, and contains trivia questions about
|
|
359
|
+
Estonia.
|
|
360
|
+
|
|
361
|
+
The original dataset contains 800 examples. From these, we use 240 / 60 / 500 samples
|
|
362
|
+
for our training, validation and test splits, respectively.
|
|
363
|
+
|
|
364
|
+
Note that this is a gated dataset, and we would like to avoid contaminating LLM
|
|
365
|
+
pre-training data as much as possible. Accordingly, we selected more generic questions
|
|
366
|
+
not representative of the full dataset in terms of question content to show here:
|
|
367
|
+
|
|
368
|
+
```json
|
|
369
|
+
{
|
|
370
|
+
"text": "Mis on isoterm?\nVastusevariandid:\na. samatemperatuurijoon\nb. samaõhurõhujoon\nc. samapingejoon\nd. samakõrgusjoon",
|
|
371
|
+
"label": "a"
|
|
372
|
+
}
|
|
373
|
+
```
|
|
374
|
+
|
|
375
|
+
```json
|
|
376
|
+
{
|
|
377
|
+
"text": "Mis on isobaat?\nVastusevariandid:\na. samasügavusjoon\nb. samaõhurõhujoon\nc. samatemperatuurijoon\nd. samakõrgusjoon",
|
|
378
|
+
"label": "a"
|
|
379
|
+
}
|
|
380
|
+
```
|
|
381
|
+
|
|
382
|
+
```json
|
|
383
|
+
{
|
|
384
|
+
"text": "Mida mõõdetakse baromeetriga?\nVastusevariandid:\na. veekogude sügavust\nb. temperatuuri\nc. jõgede voolukiirust\nd. õhurõhku",
|
|
385
|
+
"label": "d"
|
|
386
|
+
```
|
|
387
|
+
|
|
388
|
+
When evaluating generative models, we use the following setup (see the
|
|
389
|
+
[methodology](/methodology) for more information on how these are used):
|
|
390
|
+
|
|
391
|
+
- Number of few-shot examples: 5
|
|
392
|
+
- Prefix prompt:
|
|
393
|
+
```
|
|
394
|
+
Järgnevad on vastusevariantidega küsimused (koos vastustega).
|
|
395
|
+
```
|
|
396
|
+
- Base prompt template:
|
|
397
|
+
```
|
|
398
|
+
Küsimus: {text}
|
|
399
|
+
Vastusevariandid:
|
|
400
|
+
a. {option_a}
|
|
401
|
+
b. {option_b}
|
|
402
|
+
c. {option_c}
|
|
403
|
+
d. {option_d}
|
|
404
|
+
Vastus: {label}
|
|
405
|
+
```
|
|
406
|
+
- Instruction-tuned prompt template:
|
|
407
|
+
```
|
|
408
|
+
Küsimus: {text}
|
|
409
|
+
Vastusevariandid:
|
|
410
|
+
a. {option_a}
|
|
411
|
+
b. {option_b}
|
|
412
|
+
c. {option_c}
|
|
413
|
+
d. {option_d}
|
|
414
|
+
|
|
415
|
+
Võimalikud vastused: 'a', 'b', 'c' or 'd'. Muud vastused ei ole lubatud.
|
|
416
|
+
```
|
|
417
|
+
|
|
418
|
+
You can evaluate this dataset directly as follows:
|
|
419
|
+
|
|
420
|
+
```bash
|
|
421
|
+
$ euroeval --model <model-id> --dataset trivia-et
|
|
422
|
+
```
|
|
423
|
+
|
|
424
|
+
|
|
425
|
+
### Unofficial: Exam-et
|
|
355
426
|
|
|
356
427
|
This dataset was released in [this
|
|
357
428
|
repository](https://huggingface.co/datasets/TalTechNLP/exam_et) and contains questions
|
|
@@ -420,9 +491,9 @@ $ euroeval --model <model-id> --dataset exam-et
|
|
|
420
491
|
|
|
421
492
|
## Common-sense Reasoning
|
|
422
493
|
|
|
423
|
-
###
|
|
494
|
+
### Winogrande-et
|
|
424
495
|
|
|
425
|
-
The dataset includes the [
|
|
496
|
+
The dataset includes the [Winogrande](https://doi.org/10.48550/arXiv.1907.10641) test
|
|
426
497
|
set translated and culturally adapted by hand by a professional translator (citation
|
|
427
498
|
TBA). The structure of the dataset is identical to the original. Since train and dev
|
|
428
499
|
splits were not translated manually, we employ the GPT-4o model to translate the
|
|
@@ -430,7 +501,8 @@ expected number of examples starting from the beginning of the respective splits
|
|
|
430
501
|
final dataset size is 1,024 / 256 / 1,767 for the training, validation and test splits,
|
|
431
502
|
respectively.
|
|
432
503
|
|
|
433
|
-
Here are a few examples from the training split (note that unlike the test split these
|
|
504
|
+
Here are a few examples from the training split (note that unlike the test split these
|
|
505
|
+
are machine translated):
|
|
434
506
|
|
|
435
507
|
```json
|
|
436
508
|
{
|
|
@@ -440,7 +512,8 @@ Here are a few examples from the training split (note that unlike the test split
|
|
|
440
512
|
```
|
|
441
513
|
```json
|
|
442
514
|
{
|
|
443
|
-
"text": "Ian vabatahtlikult sõi Dennise menudo pärast seda, kui oli juba kausitäie söönud, sest _ nautis soolte söömist.\nVastusevariandid:\na. Ian\nb. Dennis",
|
|
515
|
+
"text": "Ian vabatahtlikult sõi Dennise menudo pärast seda, kui oli juba kausitäie söönud, sest _ nautis soolte söömist.\nVastusevariandid:\na. Ian\nb. Dennis",
|
|
516
|
+
"label": "a"
|
|
444
517
|
}
|
|
445
518
|
```
|
|
446
519
|
```json
|
|
@@ -483,7 +556,7 @@ $ euroeval --model <model-id> --dataset winogrande-et
|
|
|
483
556
|
```
|
|
484
557
|
|
|
485
558
|
|
|
486
|
-
##
|
|
559
|
+
## Summarisation
|
|
487
560
|
|
|
488
561
|
### ERRNews
|
|
489
562
|
|
|
@@ -495,8 +568,8 @@ pipeline paired with the human written summary from the archive.
|
|
|
495
568
|
|
|
496
569
|
The original full dataset consists of 10,420 / 523 / 523 samples for training,
|
|
497
570
|
validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
|
|
498
|
-
validation and testing, respectively. The test split is extended with additional
|
|
499
|
-
from the train split.
|
|
571
|
+
validation and testing, respectively. The test split is extended with additional
|
|
572
|
+
examples from the train split.
|
|
500
573
|
|
|
501
574
|
```json
|
|
502
575
|
{
|
|
@@ -355,8 +355,9 @@ $ euroeval --model <model-id> --dataset foqa
|
|
|
355
355
|
|
|
356
356
|
### Unofficial: MultiWikiQA-fo
|
|
357
357
|
|
|
358
|
-
This dataset
|
|
359
|
-
articles with generated questions and answers
|
|
358
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
|
|
359
|
+
and contains Wikipedia articles with LLM-generated questions and answers in 300+
|
|
360
|
+
languages.
|
|
360
361
|
|
|
361
362
|
The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
|
|
362
363
|
256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
|