EuroEval 15.14.0__tar.gz → 15.16.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of EuroEval might be problematic. Click here for more details.
- {euroeval-15.14.0 → euroeval-15.16.0}/.github/ISSUE_TEMPLATE/bug.yaml +1 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/.github/workflows/ci.yaml +4 -2
- {euroeval-15.14.0 → euroeval-15.16.0}/.pre-commit-config.yaml +3 -3
- {euroeval-15.14.0 → euroeval-15.16.0}/CHANGELOG.md +41 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/PKG-INFO +5 -6
- {euroeval-15.14.0 → euroeval-15.16.0}/README.md +1 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/danish.md +66 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/dutch.md +66 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/finnish.md +66 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/french.md +66 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/german.md +61 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/italian.md +68 -2
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/spanish.md +67 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/swedish.md +66 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/methodology.md +1 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/pyproject.toml +4 -6
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/__init__.py +7 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmark_modules/litellm.py +155 -105
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmark_modules/vllm.py +21 -15
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmarker.py +10 -11
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/data_models.py +1 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/danish.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/dutch.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/finnish.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/french.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/german.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/italian.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/spanish.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/swedish.py +10 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/finetuning.py +2 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/generation.py +1 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/human_evaluation.py +2 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/metrics.py +22 -4
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/prompt_templates/multiple_choice.py +1 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/task_group_utils/question_answering.py +7 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/task_group_utils/sequence_classification.py +8 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/task_group_utils/text_to_text.py +8 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/task_group_utils/token_classification.py +9 -2
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/types.py +5 -0
- euroeval-15.14.0/src/scripts/create_goldenswag_pt.py → euroeval-15.16.0/src/scripts/create_goldenswag.py +63 -36
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_data_models.py +1 -1
- {euroeval-15.14.0 → euroeval-15.16.0}/uv.lock +478 -446
- {euroeval-15.14.0 → euroeval-15.16.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/.gitignore +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/CITATION.cff +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/CODE_OF_CONDUCT.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/CONTRIBUTING.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/Dockerfile.cuda +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/LICENSE +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/NEW_DATASET_GUIDE.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/CNAME +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/README.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/README.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/english.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/faroese.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/icelandic.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/norwegian.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/datasets/portuguese.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/extras/radial_plotter.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/faq.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/gfx/favicon.png +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/danish.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/english.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/finnish.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/french.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/german.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/italian.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Multilingual/european.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/Multilingual/romance.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/leaderboards/README.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/python-package.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/README.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/common-sense-reasoning.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/knowledge.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/linguistic-acceptability.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/named-entity-recognition.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/reading-comprehension.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/sentiment-classification.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/speed.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/docs/tasks/summarization.md +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/gfx/euroeval.png +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/gfx/euroeval.xcf +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/gfx/scandeval.png +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/makefile +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/mkdocs.yaml +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmark_config_factory.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmark_modules/base.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmark_modules/fresh.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/benchmark_modules/hf.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/callbacks.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/cli.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/constants.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/data_loading.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/__init__.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/english.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/faroese.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/norwegian.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/dataset_configs/portuguese.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/enums.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/exceptions.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/generation_utils.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/languages.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/model_cache.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/model_config.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/model_loading.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/prompt_templates/__init__.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/prompt_templates/named_entity_recognition.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/prompt_templates/reading_comprehension.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/prompt_templates/sentiment_classification.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/prompt_templates/summarization.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/scores.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/speed_benchmark.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/task_group_utils/__init__.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/tasks.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/tokenization_utils.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/euroeval/utils.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/constants.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_allocine.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_angry_tweets.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_arc.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_arc_is.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_belebele.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_boolq_pt.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_cnn_dailymail.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_conll_en.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_conll_es.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_conll_nl.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_dane.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_danish_citizen_tests.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_dansk.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_danske_talemaader.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_danske_talemaader_old.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_dbrd.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_dutch_cola.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_eltec.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_fone.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_foqa.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_fosent.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_fquad.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_germanquad.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_germeval.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_harem.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_hellaswag.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_hellaswag_fi.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_ice_linguistic.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_icelandic_knowledge.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_icelandic_qa.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_icesum.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_idioms_no.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_ilpost_sum.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_jentoft.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_life_in_the_uk.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_mim_gold_ner.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_mlqa_es.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_mlsum_de.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_mlsum_es.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_mmlu.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_multi_wiki_qa.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_multinerd-it.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_no_cola.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_no_sammendrag.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_nordjylland_news.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_norec.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_norglm_multiqa.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_norglm_multisum.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_norne.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_norquad.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_nqii.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_orange_sum.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_personal_sum.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_publico.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_rrn.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_sb10k.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_scala.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_scandiqa.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_scandisent_fi.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_schibsted.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_sentipolc16.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_squad.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_squad_it.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_squad_nl.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_squad_nl_old.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_sst2_pt.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_sst5.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_suc3.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_swedn.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_swerec.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_turku_ner_fi.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_tydiqa_fi.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_wikiann_fo.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_wikineural-it.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_winogrande_is.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_xlsum_fi.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/create_xquad_es.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/fix_dot_env_file.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/load_ud_pos.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/src/scripts/versioning.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/__init__.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/conftest.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmark_config_factory.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmark_modules/__init__.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmark_modules/test_base.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmark_modules/test_hf.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_benchmarker.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_callbacks.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_cli.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_constants.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_data_loading.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_dataset_configs.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_enums.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_exceptions.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_finetuning.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_generation.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_human_evaluation.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_languages.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_model_cache.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_model_config.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_model_loading.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_scores.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_speed_benchmark.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_task_utils/__init__.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_task_utils/test_question_answering.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_task_utils/test_text_to_text.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_task_utils/test_token_classification.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_tasks.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_tokenization_utils.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_types.py +0 -0
- {euroeval-15.14.0 → euroeval-15.16.0}/tests/test_utils.py +0 -0
|
@@ -55,7 +55,7 @@ body:
|
|
|
55
55
|
attributes:
|
|
56
56
|
label: EuroEval version
|
|
57
57
|
description: What version of EuroEval are you using?
|
|
58
|
-
placeholder: Output of `pip list | grep
|
|
58
|
+
placeholder: Output of `pip list | grep euroeval`
|
|
59
59
|
validations:
|
|
60
60
|
required: true
|
|
61
61
|
- type: input
|
|
@@ -57,7 +57,7 @@ jobs:
|
|
|
57
57
|
run: uv sync --no-dev --extra test
|
|
58
58
|
|
|
59
59
|
- name: Start Ollama server
|
|
60
|
-
run: curl -fsSL https://ollama.com/install.sh | sh
|
|
60
|
+
run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
|
|
61
61
|
|
|
62
62
|
- name: Test with pytest
|
|
63
63
|
run: uv run pytest
|
|
@@ -66,6 +66,8 @@ jobs:
|
|
|
66
66
|
HF_TOKEN: ${{ secrets.HUGGINGFACE_API_KEY }}
|
|
67
67
|
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
68
68
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
69
|
+
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
|
|
70
|
+
XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
|
|
69
71
|
|
|
70
72
|
- name: Delete EuroEval cache
|
|
71
73
|
run: rm -rf .euroeval_cache
|
|
@@ -88,7 +90,7 @@ jobs:
|
|
|
88
90
|
run: uv sync --no-dev --extra test
|
|
89
91
|
|
|
90
92
|
- name: Start Ollama server
|
|
91
|
-
run: curl -fsSL https://ollama.com/install.sh | sh
|
|
93
|
+
run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
|
|
92
94
|
|
|
93
95
|
- name: Test with pytest
|
|
94
96
|
run: uv run pytest
|
|
@@ -4,13 +4,13 @@ repos:
|
|
|
4
4
|
hooks:
|
|
5
5
|
- id: python-use-type-annotations
|
|
6
6
|
- repo: https://github.com/pre-commit/pre-commit-hooks
|
|
7
|
-
rev:
|
|
7
|
+
rev: v6.0.0
|
|
8
8
|
hooks:
|
|
9
9
|
- id: end-of-file-fixer
|
|
10
10
|
- id: trailing-whitespace
|
|
11
11
|
- id: debug-statements
|
|
12
12
|
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
13
|
-
rev: v0.12.
|
|
13
|
+
rev: v0.12.8
|
|
14
14
|
hooks:
|
|
15
15
|
- id: ruff
|
|
16
16
|
args:
|
|
@@ -31,7 +31,7 @@ repos:
|
|
|
31
31
|
hooks:
|
|
32
32
|
- id: nbstripout
|
|
33
33
|
- repo: https://github.com/pre-commit/mirrors-mypy
|
|
34
|
-
rev: v1.17.
|
|
34
|
+
rev: v1.17.1
|
|
35
35
|
hooks:
|
|
36
36
|
- id: mypy
|
|
37
37
|
args:
|
|
@@ -10,6 +10,47 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
10
10
|
|
|
11
11
|
|
|
12
12
|
|
|
13
|
+
## [v15.16.0] - 2025-08-12
|
|
14
|
+
### Added
|
|
15
|
+
- Added metadata for GPT-5 models.
|
|
16
|
+
|
|
17
|
+
### Changed
|
|
18
|
+
- Updated `transformers` dependency to `>=4.55.0`.
|
|
19
|
+
|
|
20
|
+
### Fixed
|
|
21
|
+
- If the model uses 'mxfp4' quantisation then we allow the dtype to be bfloat16, rather
|
|
22
|
+
than forcing float16. This caused issues with the new GPT-OSS models.
|
|
23
|
+
- Prevent multiple `Model <model-id> does not exist` logs when evaluating a model
|
|
24
|
+
that does not exist - now only logs this once.
|
|
25
|
+
- Cleaner error message when attempting to benchmark a generative model without having a
|
|
26
|
+
GPU available.
|
|
27
|
+
- Now raises error if an inference API is used with a parameter that is not supported.
|
|
28
|
+
|
|
29
|
+
|
|
30
|
+
## [v15.15.0] - 2025-08-06
|
|
31
|
+
### Added
|
|
32
|
+
- Added the common-sense reasoning dataset GoldenSwag for the following
|
|
33
|
+
languages: Danish, German, Spanish, Finnish, French, Italian, Dutch, Swedish.
|
|
34
|
+
The datasets are unofficial for now. This was contributed by
|
|
35
|
+
[@oliverkinch](https://github.com/oliverkinch) ✨
|
|
36
|
+
|
|
37
|
+
### Changed
|
|
38
|
+
- Now allows metadata to be included in metrics, allowing more flexibility when
|
|
39
|
+
implementing custom metrics. This is not used in any task yet.
|
|
40
|
+
- Changed structured decoding backend from Outlines to XGrammar, as the latter was more
|
|
41
|
+
robust and now supports all the JSON features we need.
|
|
42
|
+
- Updated vLLM to `>=0.10.0`, which includes the updated XGrammar version.
|
|
43
|
+
- Now uses the V1 engine of vLLM, as we only used the V0 engine because XGrammar did not
|
|
44
|
+
support all the JSON features we needed.
|
|
45
|
+
|
|
46
|
+
### Fixed
|
|
47
|
+
- Now sets `VLLM_ALLOW_LONG_MAX_MODEL_LEN=1` to ignore the vLLM error that happens when
|
|
48
|
+
vLLM cannot determine the maximum context length of a model correctly, so that it
|
|
49
|
+
thinks that the model's maximum context length is smaller than the amount that we
|
|
50
|
+
allow it to generate. This is basically since we're doing a more thorough check
|
|
51
|
+
through the config than vLLM does, so we can safely ignore this error.
|
|
52
|
+
|
|
53
|
+
|
|
13
54
|
## [v15.14.0] - 2025-07-30
|
|
14
55
|
### Changed
|
|
15
56
|
- Now runs a "test run" for API inference models with a single conversation to check for
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: EuroEval
|
|
3
|
-
Version: 15.
|
|
3
|
+
Version: 15.16.0
|
|
4
4
|
Summary: The robust European language model benchmark.
|
|
5
5
|
Project-URL: Repository, https://github.com/EuroEval/EuroEval
|
|
6
6
|
Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
|
|
@@ -56,18 +56,16 @@ Requires-Dist: setuptools>=75.8.2
|
|
|
56
56
|
Requires-Dist: tenacity>=9.0.0
|
|
57
57
|
Requires-Dist: termcolor>=2.0.0
|
|
58
58
|
Requires-Dist: torch>=2.6.0
|
|
59
|
-
Requires-Dist: transformers>=4.
|
|
59
|
+
Requires-Dist: transformers>=4.55.0
|
|
60
60
|
Provides-Extra: all
|
|
61
61
|
Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
|
|
62
62
|
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
|
|
63
63
|
Requires-Dist: gradio>=4.26.0; extra == 'all'
|
|
64
|
-
Requires-Dist:
|
|
65
|
-
Requires-Dist: vllm>=0.9.1; (platform_system == 'Linux') and extra == 'all'
|
|
64
|
+
Requires-Dist: vllm>=0.10.0; (platform_system == 'Linux') and extra == 'all'
|
|
66
65
|
Provides-Extra: generative
|
|
67
66
|
Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
|
|
68
67
|
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
|
|
69
|
-
Requires-Dist:
|
|
70
|
-
Requires-Dist: vllm>=0.9.1; (platform_system == 'Linux') and extra == 'generative'
|
|
68
|
+
Requires-Dist: vllm>=0.10.0; (platform_system == 'Linux') and extra == 'generative'
|
|
71
69
|
Provides-Extra: human-evaluation
|
|
72
70
|
Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
|
|
73
71
|
Provides-Extra: test
|
|
@@ -235,6 +233,7 @@ A huge thank you to all the contributors who have helped make this project a suc
|
|
|
235
233
|
<a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
|
|
236
234
|
<a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
|
|
237
235
|
<a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
|
|
236
|
+
<a href="https://github.com/duarteocarmo"><img src="https://avatars.githubusercontent.com/u/26342344" width=50 alt="Contributor avatar for duarteocarmo"/></a>
|
|
238
237
|
|
|
239
238
|
|
|
240
239
|
### Contribute to EuroEval
|
|
@@ -159,6 +159,7 @@ A huge thank you to all the contributors who have helped make this project a suc
|
|
|
159
159
|
<a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
|
|
160
160
|
<a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
|
|
161
161
|
<a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
|
|
162
|
+
<a href="https://github.com/duarteocarmo"><img src="https://avatars.githubusercontent.com/u/26342344" width=50 alt="Contributor avatar for duarteocarmo"/></a>
|
|
162
163
|
|
|
163
164
|
|
|
164
165
|
### Contribute to EuroEval
|
|
@@ -829,6 +829,72 @@ $ euroeval --model <model-id> --dataset hellaswag-da
|
|
|
829
829
|
```
|
|
830
830
|
|
|
831
831
|
|
|
832
|
+
### Unofficial: GoldenSwag-da
|
|
833
|
+
|
|
834
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
|
|
835
|
+
|
|
836
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
|
|
837
|
+
|
|
838
|
+
Here are a few examples from the training split:
|
|
839
|
+
|
|
840
|
+
```json
|
|
841
|
+
{
|
|
842
|
+
"text": "Sådan giver du dig selv en fransk manicure ved hjælp af tape. Gnid en vatpind med neglelakfjerner på alle dine negle. Det vil ikke kun fjerne afskallet lak eller rester af lak, men det vil også fjerne fugtighedscreme fra neglen. Hvis du har et fugtighedsbevarende middel, såsom lotion eller olie, på neglen, vil lakken ikke sidde ordentligt fast.\nSvarmuligheder:\na. Kom lakfjerneren i en lille skål. Du skal bruge den om et par minutter til at få denne opløsning på tæerne.\nb. Fordel et fugtgivende pulver over alle dine negle med cirkulære bevægelser, indtil du kommer i kontakt med huden. Polér altid neglene, inden du går i gang.\nc. Skum vattet i lakfjerneren. Brug en blød vaskeklud til at samle lakken op.\nd. Sørg for, at du har skabt et perfekt lærred til din franske manicure. Påfør din basisfarve på hele neglen.",
|
|
843
|
+
"label": "d"
|
|
844
|
+
}
|
|
845
|
+
```
|
|
846
|
+
|
|
847
|
+
```json
|
|
848
|
+
{
|
|
849
|
+
"text": "Sådan forbedrer du et lille barns tale. Kom ned på deres niveau. Sæt dig på hug eller på gulvet. Det vil få deres opmærksomhed.\nSvarmuligheder:\na. Du vil tale med dit barn i stedet for til det. Hun vil også kunne se din mund og få visuelle tegn på, hvordan man siger bestemte lyde.\nb. Løft om nødvendigt hænderne sammen til knytnæver. Hvis du strækker dine hænder til knytnæver og gør det, mens du taler, vil dit barn sandsynligvis gøre det samme.\nc. Prøv at være så stille som muligt, og tal kun til dem, når de er rolige. Hvis du taler længe nok, vil de til sidst høre din stemme.\nd. Lad dem bede dig om at rykke tættere på dem. Hvis det er muligt, så brug en siddepind i hovedhøjde.",
|
|
850
|
+
"label": "a"
|
|
851
|
+
}
|
|
852
|
+
```
|
|
853
|
+
|
|
854
|
+
```json
|
|
855
|
+
{
|
|
856
|
+
"text": "Sådan bruger du en bodysuit. Vælg en bodysuit, der smigrer dine yndlingstræk. Med så mange muligheder og stilarter kan bodysuiten virkelig være universelt flatterende. For at finde en body, der ser godt ud på dig, skal du overveje, hvilken del af din krop du vil fremhæve.\nSvarmuligheder:\na. Det kan være underarmene, benene eller andre steder, der stikker ud. Måske har du for eksempel en flot læbespalte, som du gerne vil fremhæve.\nb. Find ud af, hvilken del af din krop du vil fremhæve, og skær så ned på det, der fremhæver denne del. Hvis du for eksempel ønsker, at overdelene skal fremhæve dine bryster mest muligt, kan bikinitrusserne også bæres omkring det område.\nc. Hvis du for eksempel er stolt af dine tonede arme, skal du vælge en body uden ærmer eller med halterneck. Start med en bodysuit i t-shirt-stil, hvis du er ved at varme op til trenden.\nd. Beslut dig for, hvor mange forskellige dele af dig, din body skal fremhæve. Hvis du for eksempel vil have et sporty look, skal din body også fremhæve en del af din krop i stedet for en særlig iøjnefaldende del.",
|
|
857
|
+
"label": "c"
|
|
858
|
+
}
|
|
859
|
+
```
|
|
860
|
+
|
|
861
|
+
When evaluating generative models, we use the following setup (see the
|
|
862
|
+
[methodology](/methodology) for more information on how these are used):
|
|
863
|
+
|
|
864
|
+
- Number of few-shot examples: 5
|
|
865
|
+
- Prefix prompt:
|
|
866
|
+
```
|
|
867
|
+
Følgende er multiple choice spørgsmål (med svar).
|
|
868
|
+
```
|
|
869
|
+
- Base prompt template:
|
|
870
|
+
```
|
|
871
|
+
Spørgsmål: {text}
|
|
872
|
+
Svarmuligheder:
|
|
873
|
+
a. {option_a}
|
|
874
|
+
b. {option_b}
|
|
875
|
+
c. {option_c}
|
|
876
|
+
d. {option_d}
|
|
877
|
+
Svar: {label}
|
|
878
|
+
```
|
|
879
|
+
- Instruction-tuned prompt template:
|
|
880
|
+
```
|
|
881
|
+
Spørgsmål: {text}
|
|
882
|
+
Svarmuligheder:
|
|
883
|
+
a. {option_a}
|
|
884
|
+
b. {option_b}
|
|
885
|
+
c. {option_c}
|
|
886
|
+
d. {option_d}
|
|
887
|
+
|
|
888
|
+
Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c' eller 'd', og intet andet.
|
|
889
|
+
```
|
|
890
|
+
|
|
891
|
+
You can evaluate this dataset directly as follows:
|
|
892
|
+
|
|
893
|
+
```bash
|
|
894
|
+
$ euroeval --model <model-id> --dataset goldenswag-da
|
|
895
|
+
```
|
|
896
|
+
|
|
897
|
+
|
|
832
898
|
## Summarization
|
|
833
899
|
|
|
834
900
|
### Nordjylland News
|
|
@@ -674,6 +674,72 @@ $ euroeval --model <model-id> --dataset hellaswag-nl
|
|
|
674
674
|
```
|
|
675
675
|
|
|
676
676
|
|
|
677
|
+
### Unofficial: GoldenSwag-nl
|
|
678
|
+
|
|
679
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
|
|
680
|
+
|
|
681
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
|
|
682
|
+
|
|
683
|
+
Here are a few examples from the training split:
|
|
684
|
+
|
|
685
|
+
```json
|
|
686
|
+
{
|
|
687
|
+
"text": "Hoe leer je je kind een potlood vasthouden? Koop het juiste potlood voor je kind. Het gebruik van korte potloden, zoals golfpotloden of gewone potloden die doormidden zijn gebroken, kan kinderen helpen om zelf de juiste greep te vinden. Korte potloden hebben minder ruimte voor overbodige vingers, dus je kind heeft weinig keus dan de juiste drievingerige greep te gebruiken.\nAntwoordopties:\na. Je kunt korte potloden kopen bij de meeste hobby- en kantoorboekhandels. Help je kind met een potloodgreep.\nb. Goede potloden om mee te beginnen zijn de "p" en "g" potloden. Begin in de onderste potloodhouder in het midden en let goed op hoe je kind het potlood op zijn plaats probeert te houden.\nc. Ga voor meer informatie over het hanteren van een potlood naar. Maak onderscheid tussen potloden die met de handen worden aangedreven en potloden die met beide handen worden gebruikt.\nd. Met een vinger met een langere potloodpunt kunnen kinderen potloden met veel meer controle vasthouden. Met een vinger met een lagere punt kun je proberen om zowat elke vingerpositie onder controle te houden.",
|
|
688
|
+
"label": "a"
|
|
689
|
+
}
|
|
690
|
+
```
|
|
691
|
+
|
|
692
|
+
```json
|
|
693
|
+
{
|
|
694
|
+
"text": "Hoe ontstop je een langzaam lopende afvoer van de badkamer gootsteen. Verzamel je materialen. In plaats van te vertrouwen op afvoerreinigingsproducten, die vaak bijtend zijn en allergische reacties en ademhalingsproblemen kunnen veroorzaken, kun je huishoudelijke artikelen gebruiken die je waarschijnlijk al in huis hebt. Je hebt nodig: Doekjes zuiveringszout azijn citroen kokend water. Meet je ingrediënten af.\nAntwoordopties:\na. Een afvoer met een diameter van ongeveer 0,64 centimeter. Was gootsteenontstoppingsproducten met de hand is een gebruikelijke methode, maar je kunt ze bij de meeste bouwmarkten kopen.\nb. Je hebt als basis bloem, zuiveringszout, witte azijn en water nodig. Je kunt een maatbeker gebruiken om ze af te meten, of zelfs een waterkoker.\nc. Neem ¼ kopje zuiveringszout, 1 kopje witte azijn en 1 grote pan water om te koken. Zorg dat je een vod of gootsteenstopper bij de hand hebt.\nd. Hoewel niet alle kleine gootstenen verstopt zijn, kan heet water helpen om de verstopte gaten te verwijderen. Het gebruik van een maatbeker om je ingrediënten af te meten is vooral belangrijk omdat kokend water ook vuil zoals lichaamsresten, klei en zelfs dierlijke uitwerpselen introduceert.",
|
|
695
|
+
"label": "c"
|
|
696
|
+
}
|
|
697
|
+
```
|
|
698
|
+
|
|
699
|
+
```json
|
|
700
|
+
{
|
|
701
|
+
"text": "Hoe doe je een dip powder manicure? Gebruik nagellakremover en een nagelriemduwer. Als je nagellak op je nagels hebt, verwijder deze dan met nagellakremover zonder aceton op een niet-pluizend wattenschijfje. Gebruik een nagelriemduwer om je nagelriemen voorzichtig een beetje naar achteren te duwen.\nAntwoordopties:\na. Gebruik alleen nagellakremover of een nagelriemduwer als je vieze handen hebt. Schrijf op je nagels met de nagelriemduwer.\nb. Duw ze niet krachtig terug zodat je geen pijn veroorzaakt aan je vingers of teennagels. Druk ze echter wel stevig aan met je vingers.\nc. Beweeg de drukker in een ronddraaiende beweging om de doorbloeding te stimuleren. Breng een leave-in conditioner aan nadat je je nagelriemen hebt ingesmeerd.\nd. Verwijder voorzichtig overtollige nagelriemen met een nagelriemtrimmer of schraper. Dit zorgt ervoor dat nieuwe nagelgroei zichtbaar wordt, zodat je manicure langer meegaat voordat je hem moet opvullen.",
|
|
702
|
+
"label": "d"
|
|
703
|
+
}
|
|
704
|
+
```
|
|
705
|
+
|
|
706
|
+
When evaluating generative models, we use the following setup (see the
|
|
707
|
+
[methodology](/methodology) for more information on how these are used):
|
|
708
|
+
|
|
709
|
+
- Number of few-shot examples: 5
|
|
710
|
+
- Prefix prompt:
|
|
711
|
+
```
|
|
712
|
+
Hieronder staan meerkeuzevragen (met antwoorden).
|
|
713
|
+
```
|
|
714
|
+
- Base prompt template:
|
|
715
|
+
```
|
|
716
|
+
Vraag: {text}
|
|
717
|
+
Antwoordopties:
|
|
718
|
+
a. {option_a}
|
|
719
|
+
b. {option_b}
|
|
720
|
+
c. {option_c}
|
|
721
|
+
d. {option_d}
|
|
722
|
+
Antwoord: {label}
|
|
723
|
+
```
|
|
724
|
+
- Instruction-tuned prompt template:
|
|
725
|
+
```
|
|
726
|
+
Vraag: {text}
|
|
727
|
+
Antwoordopties:
|
|
728
|
+
a. {option_a}
|
|
729
|
+
b. {option_b}
|
|
730
|
+
c. {option_c}
|
|
731
|
+
d. {option_d}
|
|
732
|
+
|
|
733
|
+
Beantwoord de bovenstaande vraag met 'a', 'b', 'c' of 'd', en niets anders.
|
|
734
|
+
```
|
|
735
|
+
|
|
736
|
+
You can evaluate this dataset directly as follows:
|
|
737
|
+
|
|
738
|
+
```bash
|
|
739
|
+
$ euroeval --model <model-id> --dataset goldenswag-nl
|
|
740
|
+
```
|
|
741
|
+
|
|
742
|
+
|
|
677
743
|
## Summarization
|
|
678
744
|
|
|
679
745
|
### WikiLingua-nl
|
|
@@ -494,6 +494,72 @@ $ euroeval --model <model-id> --dataset hellaswag-fi
|
|
|
494
494
|
```
|
|
495
495
|
|
|
496
496
|
|
|
497
|
+
### Unofficial: GoldenSwag-fi
|
|
498
|
+
|
|
499
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
|
|
500
|
+
|
|
501
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
|
|
502
|
+
|
|
503
|
+
Here are a few examples from the training split:
|
|
504
|
+
|
|
505
|
+
```json
|
|
506
|
+
{
|
|
507
|
+
"text": "Miten auton ulkoinen pesu tehdään oikein. Ensimmäinen asia, joka sinun on tehtävä kunnolla, on pestä autosi tehokkaasti. Ei ole mitään järkeä yrittää tehdä auton ulkoista detaljointia, jos päädyt vain naarmuttamaan ducosi entistä enemmän, koska jätit autoosi likaa. Sinun on ensin huuhdeltava autosi letkulla kovalla paineella.\nVastausvaihtoehdot:\na. Tämä poistaa suurimman osan liasta moottoristasi ja pitää moottorin moitteettomana. Käytä autosi pesemiseen korkeapainepesukoneita.\nb. Sitten sinun on alettava imuroida likaa pois. Kun olet poistanut mahdollisimman paljon likaa, voit palata ajoneuvon luokse keräämään roskia.\nc. Vie letku kaasuttimesta moottorilohkon yläosaan, odota viisi minuuttia, sulje sitten vesi ja päästä ilma ulos jäähdyttimestä. Irrota vanhat tiivisteet ja aloita vedellä pesu moottorin kannesta alas.\nd. Älä käytä letkusta lasertyyppistä pesua, vaan mieluummin pientä suppiloa. Aloita aina ylhäältä ja etene alaspäin.",
|
|
508
|
+
"label": "d"
|
|
509
|
+
}
|
|
510
|
+
```
|
|
511
|
+
|
|
512
|
+
```json
|
|
513
|
+
{
|
|
514
|
+
"text": "Miten kylpeä merisuolalla. Varaa itsellesi riittävästi aikaa 15-20 minuutin kylpyyn. Kylpy ei ole kuin suihku, jossa usein kiirehditään. Sen sijaan niiden on tarkoitus kestää pidempään, jotta keho ja mieli voivat rentoutua.\nVastausvaihtoehdot:\na. Ennen kylpyä haluat, että kehosi rentoutuu, ota päivittäin noin minuutti rentoutumista. Kylvystä voi saada samoja hyötyjä: suolahoito on helpompaa, mikä voi vähentää stressiä.\nb. Jotta saisit kylvystäsi suurimman hyödyn, suunnittele, että vietät vedessä 15-20 minuuttia. Ota suolakylpy illalla, jos haluat hoitaa unettomuutta.\nc. Jos haluat nopean kylpyläkokemuksen, 15-20 minuutin kylpy voi olla hyvä valinta. Anna itsellesi muutama tunti aikaa tottua lämpimään, rentouttavaan veteen.\nd. Jos sinulla on kiire, saatat jännittyä niin paljon, että menetät ajantajusi. Jos väsyt, ota myös nopea 15-20 minuutin kylpy.",
|
|
515
|
+
"label": "b"
|
|
516
|
+
}
|
|
517
|
+
```
|
|
518
|
+
|
|
519
|
+
```json
|
|
520
|
+
{
|
|
521
|
+
"text": "Kuinka tehdä ylösnousemussämpylöitä. Kaada maito kulhoon. Jotta hiiva aktivoituu, sinun on sekoitettava se lämpimään nesteeseen. Lisää ½ kupillista (118 ml) lämmintä maitoa tehosekoittimen kulhoon.\nVastausvaihtoehdot:\na. Jos haluat pidemmän prosessin, voit juoksuttaa hiivan lavuaarissa ennen kuin jatkat.... Sekoita maito ja seos vähitellen vispilällä.\nb. Sekoita, kunnes maito on hyvin vaaleaa (noin 110 ml). Jos maito on liian pehmeää tähän reseptiin, lisää 1/2 kupillista (120 ml) smetanaa.\nc. Maidon lämpötilan tulisi olla 105 °f (41 °c). Voit käyttää 1- tai 2-prosenttista maitoa, mutta täysmaidosta saadaan yleensä parhaat sämpylät.\nd. Jos sinulla on sauvasekoitin, voit tehdä sämpylöiden taikinan itse. Tarvitset vain 2 kuppia (500 ml) maitoa.",
|
|
522
|
+
"label": "c"
|
|
523
|
+
}
|
|
524
|
+
```
|
|
525
|
+
|
|
526
|
+
When evaluating generative models, we use the following setup (see the
|
|
527
|
+
[methodology](/methodology) for more information on how these are used):
|
|
528
|
+
|
|
529
|
+
- Number of few-shot examples: 5
|
|
530
|
+
- Prefix prompt:
|
|
531
|
+
```
|
|
532
|
+
Seuraavat ovat monivalintakysymyksiä (vastauksineen).
|
|
533
|
+
```
|
|
534
|
+
- Base prompt template:
|
|
535
|
+
```
|
|
536
|
+
Kysymys: {text}
|
|
537
|
+
Vastausvaihtoehdot:
|
|
538
|
+
a. {option_a}
|
|
539
|
+
b. {option_b}
|
|
540
|
+
c. {option_c}
|
|
541
|
+
d. {option_d}
|
|
542
|
+
Vastaus: {label}
|
|
543
|
+
```
|
|
544
|
+
- Instruction-tuned prompt template:
|
|
545
|
+
```
|
|
546
|
+
Kysymys: {text}
|
|
547
|
+
Vastausvaihtoehdot:
|
|
548
|
+
a. {option_a}
|
|
549
|
+
b. {option_b}
|
|
550
|
+
c. {option_c}
|
|
551
|
+
d. {option_d}
|
|
552
|
+
|
|
553
|
+
Vastaa yllä olevaan kysymykseen käyttämällä 'a', 'b', 'c' tai 'd', äläkä mitään muuta.
|
|
554
|
+
```
|
|
555
|
+
|
|
556
|
+
You can evaluate this dataset directly as follows:
|
|
557
|
+
|
|
558
|
+
```bash
|
|
559
|
+
$ euroeval --model <model-id> --dataset goldenswag-fi
|
|
560
|
+
```
|
|
561
|
+
|
|
562
|
+
|
|
497
563
|
## Summarization
|
|
498
564
|
|
|
499
565
|
### XLSum-fi
|
|
@@ -580,6 +580,72 @@ $ euroeval --model <model-id> --dataset hellaswag-fr
|
|
|
580
580
|
```
|
|
581
581
|
|
|
582
582
|
|
|
583
|
+
### Unofficial: GoldenSwag-fr
|
|
584
|
+
|
|
585
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
|
|
586
|
+
|
|
587
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
|
|
588
|
+
|
|
589
|
+
Here are a few examples from the training split:
|
|
590
|
+
|
|
591
|
+
```json
|
|
592
|
+
{
|
|
593
|
+
"text": "Comment réparer des lunettes tordues. Prenez une paire de pinces à becs en plastique. Les pinces vous permettront d'effectuer des micro-ajustements sur les montures tordues de manière plus sûre qu'en essayant de les forcer à se mettre en forme à la main. Si possible, équipez-vous d'une paire de pinces dont les pointes sont recouvertes d'un revêtement en plastique souple.\nChoix:\na. Les pinces en métal ordinaires risquent de rayer, voire de casser, les montures en fil métallique fin. Si vous ne disposez pas d'une pince appropriée, une pince à main en plastique ou une paire de pinces peut également faire l'affaire.\nb. Sinon, vous pouvez simplement tenir la pince dans votre main et la laisser glisser. Soulevez la lentille avec les pointes de la pince.\nc. Les boîtiers métalliques sont parmi les matériaux les moins chers disponibles, mais ils rendent la tâche beaucoup plus difficile. Si vous ne trouvez pas de pince à bouts en plastique, votre dentiste optera probablement pour des étuis en verre.\nd. Le plastique souple peut être meilleur que le plastique dur. Le but du plastique est d'améliorer l'apparence des lentilles, tout en les rendant plus faciles à nettoyer et à remplacer.",
|
|
594
|
+
"label": "a"
|
|
595
|
+
}
|
|
596
|
+
```
|
|
597
|
+
|
|
598
|
+
```json
|
|
599
|
+
{
|
|
600
|
+
"text": "Comment être une meilleure personne à l'école. Développez votre sens du bien et du mal. Le monde d'aujourd'hui est rapide et impatient, mais pour devenir une meilleure personne, il faut prendre le temps de travailler sur ses valeurs. Décidez quelles sont les valeurs et les vertus les plus importantes pour vous.\nChoix:\na. Si vous pratiquez un sport, profitez-en pour vous entraîner. Si vous passez vos journées de gym à garder vos muscles immobiles, assurez-vous de prendre le temps de faire cet exercice.\nb. Efforcez-vous de voir toutes vos situations idéales en termes de bonne et de mauvaise situation afin d'avoir une meilleure attitude à l'égard de ces choses. Pensez à la façon dont vous aborderiez la situation dans laquelle vous avez l'intention de faire ce qu'il faut.\nc. Créez un système personnel de moralité en rejoignant des clubs et des organisations qui vous aideront à développer vos vertus, comme une équipe sportive, des clubs de service communautaire, une chorale ou un gouvernement étudiant. L'empathie, l'honnêteté, la patience, l'humour et la persévérance ne sont que quelques exemples de bonnes valeurs.\nd. La dernière chose que vous souhaitez, c'est de vous retrouver coincé dans un bar, de passer une mauvaise journée ou de vouloir faire du bénévolat pour votre cause. Pratiquez l'empathie et essayez de vivre votre vie sous un meilleur angle.",
|
|
601
|
+
"label": "c"
|
|
602
|
+
}
|
|
603
|
+
```
|
|
604
|
+
|
|
605
|
+
```json
|
|
606
|
+
{
|
|
607
|
+
"text": "Comment préparer une pommade antibactérienne à la maison. Choisissez vos huiles. L'huile de coco est naturellement antivirale, antibactérienne et antifongique. L'huile de coco devrait être le premier ingrédient, représentant environ la moitié de votre base d'huile (environ ½ tasse).\nChoix:\na. Vous ne devez pas en utiliser trop - 1-1 pour cent est une quantité excessive qui endommage facilement la peau du bébé et l'irrite. Vous n'avez pas besoin d'utiliser toutes vos huiles, mais essayez-en quelques-unes pour les peaux sensibles.\nb. Mais l'huile de coco peut aussi être rigide et difficile à travailler, vous devriez donc envisager d'utiliser ½ tasse d'une autre huile. D'excellents choix incluent l'huile d'olive, l'huile de jojoba ou l'huile d'amande.\nc. Utilisez 1 à 2 gouttes de votre huile essentielle préférée comme antibactérien. L'huile de coco est naturellement antibactérienne.\nd. L'huile peut être un ingrédient irritant pour la peau, provoquant irritation, sécheresse et inflammation. Appliquez de l'huile de coco sur la peau sèche comme remède topique ou à domicile.",
|
|
608
|
+
"label": "b"
|
|
609
|
+
}
|
|
610
|
+
```
|
|
611
|
+
|
|
612
|
+
When evaluating generative models, we use the following setup (see the
|
|
613
|
+
[methodology](/methodology) for more information on how these are used):
|
|
614
|
+
|
|
615
|
+
- Number of few-shot examples: 5
|
|
616
|
+
- Prefix prompt:
|
|
617
|
+
```
|
|
618
|
+
Les questions suivantes sont des questions à choix multiples (avec réponses).
|
|
619
|
+
```
|
|
620
|
+
- Base prompt template:
|
|
621
|
+
```
|
|
622
|
+
Question: {text}
|
|
623
|
+
Choix:
|
|
624
|
+
a. {option_a}
|
|
625
|
+
b. {option_b}
|
|
626
|
+
c. {option_c}
|
|
627
|
+
d. {option_d}
|
|
628
|
+
Réponse: {label}
|
|
629
|
+
```
|
|
630
|
+
- Instruction-tuned prompt template:
|
|
631
|
+
```
|
|
632
|
+
Question: {text}
|
|
633
|
+
Choix:
|
|
634
|
+
a. {option_a}
|
|
635
|
+
b. {option_b}
|
|
636
|
+
c. {option_c}
|
|
637
|
+
d. {option_d}
|
|
638
|
+
|
|
639
|
+
Répondez à la question ci-dessus par 'a', 'b', 'c' ou 'd', et rien d'autre.
|
|
640
|
+
```
|
|
641
|
+
|
|
642
|
+
You can evaluate this dataset directly as follows:
|
|
643
|
+
|
|
644
|
+
```bash
|
|
645
|
+
$ euroeval --model <model-id> --dataset goldenswag-fr
|
|
646
|
+
```
|
|
647
|
+
|
|
648
|
+
|
|
583
649
|
## Summarization
|
|
584
650
|
|
|
585
651
|
### Orange Sum
|
|
@@ -615,6 +615,67 @@ $ euroeval --model <model-id> --dataset hellaswag-de
|
|
|
615
615
|
```
|
|
616
616
|
|
|
617
617
|
|
|
618
|
+
### Unofficial: GoldenSwag-de
|
|
619
|
+
|
|
620
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
|
|
621
|
+
|
|
622
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
|
|
623
|
+
|
|
624
|
+
Here are a few examples from the training split:
|
|
625
|
+
|
|
626
|
+
```json
|
|
627
|
+
{
|
|
628
|
+
"text": "Wie man Rouge aufträgt. Verwenden Sie die richtige Art von Pinsel. Die Art des Pinsels hängt davon ab, wo Sie das Rouge auftragen wollen. Da Sie das Rouge nicht nur auf die Wangenäpfel auftragen werden, sollten Sie für kleinere Bereiche einen kleineren Pinsel verwenden.\nAntwortmöglichkeiten:\na. Für die Wangen können Sie einen normalen Rougepinsel verwenden. Manche empfehlen, für die kleineren Gesichtspartien einen Abdeckpinsel zu verwenden.\nb. Je kleiner der Pinsel ist, desto mehr Rouge müssen Sie auf Ihre Wangen auftragen. Überprüfen Sie auf der Verpackung die richtige Pinselgröße für diesen Bereich.\nc. Wählen Sie den Pinsel, der am besten zu Ihrem Haartyp passt. Bei lockigem Haar verwenden Sie einen größeren Pinsel für dünneres Haar und einen kleineren Pinsel für dünnes Haar.\nd. Für größere Flächen können Sie einen Tubenpinsel oder einen Pinsel in einer anderen Farbe verwenden, um ein Zusammenfallen zu vermeiden. Verwenden Sie einen Pinsel mit Borsten in der Farbe Ihrer Grundierung, damit die abgerundeten Borsten weniger auffallen.",
|
|
629
|
+
"label": "a"
|
|
630
|
+
}
|
|
631
|
+
```
|
|
632
|
+
|
|
633
|
+
```json
|
|
634
|
+
{
|
|
635
|
+
"text": "Wie Sie einen Redakteur auf sich aufmerksam machen können. Lesen und befolgen Sie die Einreichungsrichtlinien der Publikation. Publikationen erstellen Einreichungsrichtlinien, um es sowohl den Autoren als auch den Redakteuren leichter zu machen. Wenn Sie die Richtlinien lesen und befolgen, erstellen Sie einen Beitrag, der den Anforderungen der Publikation entspricht, was es für Sie als Autor einfacher macht, und zwar in einem Format, das die Redakteure leichter auf Eignung und Qualität prüfen können.\nAntwortmöglichkeiten:\na. Vermeiden Sie es, den Namen und die Veröffentlichungsseite der Publikation vollständig zu blockieren. Wenn die Publikation nicht sehr künstlerisch ist, wird sie vielleicht gar nicht veröffentlicht.\nb. Vergewissern Sie sich, dass Ihr Artikel diesen Richtlinien entspricht, wenn Sie sich um eine Stelle als Redakteur bewerben. Bei einigen Stellen müssen Sie eine bestimmte Menge an Arbeit leisten, um eine Redakteursstelle zu erhalten, während bei anderen ein Minimum von 30 Arbeitsstunden erforderlich ist.\nc. Die meisten Publikationen mit Internetpräsenz bieten ihre Richtlinien für die Einreichung von Beiträgen auf ihren Websites an. Wenn dies nicht der Fall ist, können Sie die Richtlinien erhalten, indem Sie an die angegebene Adresse der Publikation schreiben.\nd. Bitten Sie die Autoren am Ende der Veröffentlichung, Ihre Arbeit regelmäßig zu veröffentlichen. Heben Sie in Ihrem Beitrag wichtige Aspekte hervor, damit Sie nicht von der Publikation ausgeschlossen werden.",
|
|
636
|
+
"label": "c"
|
|
637
|
+
}
|
|
638
|
+
```
|
|
639
|
+
|
|
640
|
+
```json
|
|
641
|
+
{
|
|
642
|
+
"text": "Wie Sie Hundegeruch aus Ihrem Auto entfernen. Waschen Sie alle abnehmbaren Teile Ihres Autos. Alle Teile Ihres Autos, die Sie abnehmen können, sollten Sie in der Waschmaschine waschen. Dadurch wird der Hundegeruch entfernt und Ihr Auto riecht wieder frischer.\nAntwortmöglichkeiten:\na. Wenn Sie feststellen, dass Ihr Auto nach Ihnen riecht, wenn Sie es ausstecken, sollten Sie die Teile 5 Minuten in warmem Wasser und 20 Minuten in kaltem Wasser einweichen. Wenn Sie ein Straßenfest veranstalten, verwenden Sie einen Trichter, um Plastikteile in die Waschmaschine zu befördern, während die äußeren Teile weggeworfen werden.\nb. Gummimatten, Autositzbezüge und alle Decken, die Sie für Ihren Hund aufbewahren, können entfernt und gewaschen werden. Waschen Sie die Teile Ihres Autos sicherheitshalber bei einer kühlen Temperatur.\nc. Eine Schicht Antitranspirant hingegen entfernt nur das Produkt, und Ihr Auto riecht wahrscheinlich schon nach Urin. Wenn Ihr Auto mit Ledersitzen ausgestattet ist, wischen Sie das Produkt, das sich dort angesammelt hat, ab.\nd. Am sichersten ist es, alle abnehmbaren Teile Ihres Autos zu entfernen, einschließlich der "Fifflers". Diese Teile können bei heißem Wetter leicht stinken, aber sie können auch schwitzen und den Eigengeruch Ihres Hundes produzieren.",
|
|
643
|
+
"label": "b"
|
|
644
|
+
}
|
|
645
|
+
```
|
|
646
|
+
|
|
647
|
+
When evaluating generative models, we use the following setup (see the
|
|
648
|
+
[methodology](/methodology) for more information on how these are used):
|
|
649
|
+
|
|
650
|
+
- Number of few-shot examples: 5
|
|
651
|
+
- Prefix prompt:
|
|
652
|
+
```
|
|
653
|
+
Die folgenden Fragen sind Multiple-Choice-Fragen (mit Antworten).
|
|
654
|
+
```
|
|
655
|
+
- Base prompt template:
|
|
656
|
+
```
|
|
657
|
+
Frage: {text}
|
|
658
|
+
Antwort: {label}
|
|
659
|
+
```
|
|
660
|
+
- Instruction-tuned prompt template:
|
|
661
|
+
```
|
|
662
|
+
Frage: {text}
|
|
663
|
+
Antwortmöglichkeiten:
|
|
664
|
+
a. {option_a}
|
|
665
|
+
b. {option_b}
|
|
666
|
+
c. {option_c}
|
|
667
|
+
d. {option_d}
|
|
668
|
+
|
|
669
|
+
Beantworten Sie die obige Frage mit 'a', 'b', 'c' oder 'd', und nichts anderes.
|
|
670
|
+
```
|
|
671
|
+
|
|
672
|
+
You can evaluate this dataset directly as follows:
|
|
673
|
+
|
|
674
|
+
```bash
|
|
675
|
+
$ euroeval --model <model-id> --dataset goldenswag-de
|
|
676
|
+
```
|
|
677
|
+
|
|
678
|
+
|
|
618
679
|
## Summarization
|
|
619
680
|
|
|
620
681
|
### MLSum-de
|
|
@@ -562,7 +562,7 @@ When evaluating generative models, we use the following setup (see the
|
|
|
562
562
|
b. {option_b}
|
|
563
563
|
c. {option_c}
|
|
564
564
|
d. {option_d}
|
|
565
|
-
|
|
565
|
+
Risposta: {label}
|
|
566
566
|
```
|
|
567
567
|
- Instruction-tuned prompt template:
|
|
568
568
|
```
|
|
@@ -633,7 +633,7 @@ When evaluating generative models, we use the following setup (see the
|
|
|
633
633
|
b. {option_b}
|
|
634
634
|
c. {option_c}
|
|
635
635
|
d. {option_d}
|
|
636
|
-
|
|
636
|
+
Risposta: {label}
|
|
637
637
|
```
|
|
638
638
|
- Instruction-tuned prompt template:
|
|
639
639
|
```
|
|
@@ -654,6 +654,72 @@ $ euroeval --model <model-id> --dataset hellaswag-it
|
|
|
654
654
|
```
|
|
655
655
|
|
|
656
656
|
|
|
657
|
+
### Unofficial: GoldenSwag-it
|
|
658
|
+
|
|
659
|
+
This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
|
|
660
|
+
|
|
661
|
+
The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
|
|
662
|
+
|
|
663
|
+
Here are a few examples from the training split:
|
|
664
|
+
|
|
665
|
+
```json
|
|
666
|
+
{
|
|
667
|
+
"text": "Come sapere cosa indossare. Identificate la vostra tonalità di pelle. Ci sono molti termini usati per descrivere la tonalità della pelle, da quella chiara o scura, a quella pallida o olivastra. Il modo più accurato per capire quali colori vi stanno bene è capire il sottotono della vostra pelle.\nScelte:\na. Questa è la chiave numero uno per identificare il colore della vostra pelle. Se avete un misto di pelle olivastra e sottotono caldo (come una pelle avorio), il vostro tono di pelle è probabilmente a metà tra il caldo e il freddo.\nb. Se avete una corporatura media o calda, in genere avete sottotoni evidenti. Ecco alcuni sottotoni comuni: la pelle calda e i sottotoni caldi comprendono tutti e tre i toni medi, tutti e tre i toni freddi, tutti e quattro i toni caldi e tutti e quattro i toni caldi.\nc. La vostra pelle sarà del colore delle vostre spalle, dal collo alle dita, alle unghie dei piedi. Il sottotono è un colore di base per il vostro aspetto generale, come espressione primaria della vostra carnagione.\nd. Ne esistono tre tipi: caldo, freddo e neutro. Poiché si cercano i sottotoni della pelle, non basta guardarsi allo specchio per averne conferma.",
|
|
668
|
+
"label": "d"
|
|
669
|
+
}
|
|
670
|
+
```
|
|
671
|
+
|
|
672
|
+
```json
|
|
673
|
+
{
|
|
674
|
+
"text": "Come fare la treccia. Spazzolare i capelli. Spazzolate i capelli in modo che siano leggeri e soffici. Dovete eliminare tutti i nodi in modo che la treccia sia liscia come la seta! Questa operazione facilita anche il processo di intreccio, quindi assicuratevi di farlo.\nScelte:\na. Prendete tre o quattro pollici (da 5 a 10 cm) di capelli dalla nuca, pettinateli e metteteli in un porta-treccia. Legateli e rimetteteli nel supporto.\nb. Se i capelli sono molto aggrovigliati, potrebbero gocciolare e potreste non riuscire a intrecciarli in modo così ordinato! Avvolgere i capelli. Con i capelli raccolti in rulli, arricciateli intorno al dito in modo che tutti i rulli siano infilati.\nc. Decidete dove fare la treccia. Sarà dietro la testa in una coda di cavallo? Sarà laterale o più bassa, vicino al collo? Decidete questo per determinare dove e come sarà più bella.\nd. Inumidite i capelli e scompigliateli delicatamente con le dita, in modo da ottenere un risultato bello e soffice. Probabilmente sarà facile separarli tirandoli un po', ma fate attenzione a non farlo.",
|
|
675
|
+
"label": "c"
|
|
676
|
+
}
|
|
677
|
+
```
|
|
678
|
+
|
|
679
|
+
```json
|
|
680
|
+
{
|
|
681
|
+
"text": "Come mettere la carta velina in un sacchetto regalo. Raccogliete i materiali. Avrete bisogno di carta velina, del regalo, di nastri o abbellimenti, di un sacchetto regalo e di un biglietto. Avrete bisogno di diversi colori di carta velina che si abbinino al colore del sacchetto regalo.\nScelte:\na. Acquistate o realizzate un sacchetto di carta velina bianco o crema in un negozio di artigianato. La carta velina vi darà un colore rosa pastello e si completerà con il colore del sacchetto regalo.\nb. La carta velina colorata rende il regalo più festoso! Assicuratevi che il vostro sacchetto regalo sia adatto all'occasione. Se avete intenzione di arricciare il nastro per aggiungerlo come decorazione, avrete bisogno di forbici per arricciare il nastro o di un nastro già arricciato.\nc. Potreste aver bisogno di andare in un negozio di antiquariato o in un negozio dell'usato per trovare tutti i colori che vi servono. Considerate la possibilità di utilizzare diversi colori per il biglietto, tra cui carta commestibile, carta da regalo o carta da costruzione.\nd. Potete utilizzare carta di scarto, carta in rotoli, carta riciclata o carta da costruzione. Prendete un pezzo di carta velina, di carta igienica o di qualsiasi altro foglio di carta colorata.",
|
|
682
|
+
"label": "b"
|
|
683
|
+
}
|
|
684
|
+
```
|
|
685
|
+
|
|
686
|
+
When evaluating generative models, we use the following setup (see the
|
|
687
|
+
[methodology](/methodology) for more information on how these are used):
|
|
688
|
+
|
|
689
|
+
- Number of few-shot examples: 5
|
|
690
|
+
- Prefix prompt:
|
|
691
|
+
```
|
|
692
|
+
Le seguenti sono domande a scelta multipla (con relative risposte).
|
|
693
|
+
```
|
|
694
|
+
- Base prompt template:
|
|
695
|
+
```
|
|
696
|
+
Domanda: {text}
|
|
697
|
+
Scelte:
|
|
698
|
+
a. {option_a}
|
|
699
|
+
b. {option_b}
|
|
700
|
+
c. {option_c}
|
|
701
|
+
d. {option_d}
|
|
702
|
+
Risposta: {label}
|
|
703
|
+
```
|
|
704
|
+
- Instruction-tuned prompt template:
|
|
705
|
+
```
|
|
706
|
+
Domanda: {text}
|
|
707
|
+
Scelte:
|
|
708
|
+
a. {option_a}
|
|
709
|
+
b. {option_b}
|
|
710
|
+
c. {option_c}
|
|
711
|
+
d. {option_d}
|
|
712
|
+
|
|
713
|
+
Rispondete alla domanda precedente con 'a', 'b', 'c' o 'd' e nient'altro.
|
|
714
|
+
```
|
|
715
|
+
|
|
716
|
+
You can evaluate this dataset directly as follows:
|
|
717
|
+
|
|
718
|
+
```bash
|
|
719
|
+
$ euroeval --model <model-id> --dataset goldenswag-it
|
|
720
|
+
```
|
|
721
|
+
|
|
722
|
+
|
|
657
723
|
## Summarization
|
|
658
724
|
|
|
659
725
|
### IlPost-Sum
|