EuroEval 15.4.1__tar.gz → 15.5.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Potentially problematic release.
This version of EuroEval might be problematic. Click here for more details.
- {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +2 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/bug.yaml +17 -2
- {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +1 -11
- {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +21 -16
- {euroeval-15.4.1 → euroeval-15.5.0}/.github/workflows/ci.yaml +2 -2
- {euroeval-15.4.1 → euroeval-15.5.0}/.gitignore +4 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/.pre-commit-config.yaml +1 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/CHANGELOG.md +95 -11
- {euroeval-15.4.1 → euroeval-15.5.0}/PKG-INFO +6 -4
- {euroeval-15.4.1 → euroeval-15.5.0}/README.md +1 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/danish.md +8 -7
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/dutch.md +1 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/english.md +1 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/faroese.md +4 -4
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/french.md +2 -2
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/icelandic.md +17 -13
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/italian.md +5 -6
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/norwegian.md +18 -9
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/spanish.md +1 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/swedish.md +4 -5
- {euroeval-15.4.1 → euroeval-15.5.0}/makefile +1 -2
- {euroeval-15.4.1 → euroeval-15.5.0}/pyproject.toml +7 -6
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/__init__.py +2 -2
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/hf.py +79 -39
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/litellm.py +204 -74
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/vllm.py +106 -42
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmarker.py +35 -6
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/constants.py +11 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/data_models.py +6 -2
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/dataset_configs.py +6 -6
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/sequence_classification.py +70 -30
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/types.py +3 -3
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/utils.py +131 -32
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mlsum_de.py +1 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mlsum_es.py +1 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/conftest.py +12 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmarker.py +29 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_constants.py +1 -1
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_data_models.py +4 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_utils.py +0 -11
- {euroeval-15.4.1 → euroeval-15.5.0}/uv.lock +981 -889
- {euroeval-15.4.1 → euroeval-15.5.0}/CITATION.cff +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/CODE_OF_CONDUCT.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/CONTRIBUTING.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/Dockerfile.cuda +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/LICENSE +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/CNAME +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/README.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/README.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/german.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/extras/radial_plotter.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/faq.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/gfx/favicon.png +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/danish.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/english.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/french.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/german.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/italian.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/european.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/romance.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/README.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/methodology.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/python-package.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/README.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/common-sense-reasoning.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/knowledge.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/linguistic-acceptability.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/named-entity-recognition.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/reading-comprehension.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/sentiment-classification.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/speed.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/summarization.md +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/gfx/euroeval.png +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/gfx/euroeval.xcf +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/gfx/scandeval.png +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/mkdocs.yaml +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_config_factory.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/base.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/fresh.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/callbacks.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/cli.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/data_loading.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/enums.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/exceptions.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/finetuning.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/generation.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/human_evaluation.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/languages.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/model_cache.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/model_config.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/model_loading.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/scores.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/speed_benchmark.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/__init__.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/multiple_choice_classification.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/question_answering.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/text_to_text.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/token_classification.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/tasks.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/constants.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_allocine.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_angry_tweets.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_arc.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_arc_is.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_belebele.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_cnn_dailymail.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_conll_en.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_conll_es.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_conll_nl.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dane.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_danish_citizen_tests.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dansk.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_danske_talemaader.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_danske_talemaader_old.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dbrd.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dutch_cola.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dutch_social.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_eltec.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_fone.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_foqa.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_fosent.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_fquad.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_germanquad.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_germeval.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_hellaswag.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_ice_linguistic.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icelandic_knowledge.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icelandic_qa.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icesum.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_ilpost_sum.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_jentoft.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mim_gold_ner.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mlqa_es.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mmlu.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_multinerd-it.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_no_cola.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_no_sammendrag.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nordjylland_news.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norec.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norglm_multiqa.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norglm_multisum.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norne.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norquad.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nqii.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_orange_sum.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_personal_sum.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_rrn.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sb10k.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_scala.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_scandiqa.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_schibsted.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sentipolc16.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad_it.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad_nl.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad_nl_old.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sst5.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_suc3.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_swedn.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_swerec.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_wikiann_fo.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_wikineural-it.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_winogrande_is.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_xquad_es.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/fix_dot_env_file.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/load_ud_pos.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/versioning.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/__init__.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_config_factory.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/__init__.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_base.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_hf.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_callbacks.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_cli.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_data_loading.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_dataset_configs.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_enums.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_exceptions.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_finetuning.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_generation.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_human_evaluation.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_languages.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_model_cache.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_model_config.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_model_loading.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_scores.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_speed_benchmark.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/__init__.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_question_answering.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_text_to_text.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_token_classification.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_tasks.py +0 -0
- {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_types.py +0 -0
|
@@ -2,6 +2,7 @@ name: 📚 Benchmark Dataset Request
|
|
|
2
2
|
description: Do you think a particular benchmark dataset is missing in EuroEval?
|
|
3
3
|
title: "[BENCHMARK DATASET REQUEST] <dataset-name>"
|
|
4
4
|
labels: "benchmark dataset request"
|
|
5
|
+
type: task
|
|
5
6
|
|
|
6
7
|
body:
|
|
7
8
|
- type: input
|
|
@@ -30,6 +31,7 @@ body:
|
|
|
30
31
|
- label: Icelandic
|
|
31
32
|
- label: Italian
|
|
32
33
|
- label: Norwegian (Bokmål or Nynorsk)
|
|
34
|
+
- label: Spanish
|
|
33
35
|
- label: Swedish
|
|
34
36
|
validations:
|
|
35
37
|
required: true
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
name: 🐛 Bug Report
|
|
2
2
|
description: Have you experienced a bug using the `euroeval` package?
|
|
3
3
|
title: "[BUG] <name-of-bug>"
|
|
4
|
-
|
|
4
|
+
type: bug
|
|
5
5
|
|
|
6
6
|
body:
|
|
7
7
|
- type: markdown
|
|
@@ -46,8 +46,9 @@ body:
|
|
|
46
46
|
- 3.10.x
|
|
47
47
|
- 3.11.x
|
|
48
48
|
- 3.12.x
|
|
49
|
+
- 3.13.x
|
|
49
50
|
- Older than 3.10.x
|
|
50
|
-
- Newer than 3.
|
|
51
|
+
- Newer than 3.13.x
|
|
51
52
|
validations:
|
|
52
53
|
required: true
|
|
53
54
|
- type: input
|
|
@@ -57,6 +58,20 @@ body:
|
|
|
57
58
|
placeholder: Output of `pip list | grep EuroEval`
|
|
58
59
|
validations:
|
|
59
60
|
required: true
|
|
61
|
+
- type: input
|
|
62
|
+
attributes:
|
|
63
|
+
label: Transformers version
|
|
64
|
+
description: What version of 🤗 transformers are you using?
|
|
65
|
+
placeholder: Output of `pip list | grep transformers`
|
|
66
|
+
validations:
|
|
67
|
+
required: true
|
|
68
|
+
- type: input
|
|
69
|
+
attributes:
|
|
70
|
+
label: vLLM version
|
|
71
|
+
description: What version of vLLM are you using?
|
|
72
|
+
placeholder: Output of `pip list | grep vllm`
|
|
73
|
+
validations:
|
|
74
|
+
required: true
|
|
60
75
|
- type: markdown
|
|
61
76
|
attributes:
|
|
62
77
|
value: >
|
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
name: 🚀 Feature Request
|
|
2
2
|
description: Is the EuroEval benchmark missing a feature?
|
|
3
3
|
title: "[FEATURE REQUEST] <name-of-feature>"
|
|
4
|
-
|
|
4
|
+
type: feature
|
|
5
5
|
|
|
6
6
|
body:
|
|
7
7
|
- type: textarea
|
|
@@ -11,16 +11,6 @@ body:
|
|
|
11
11
|
A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*.
|
|
12
12
|
validations:
|
|
13
13
|
required: true
|
|
14
|
-
- type: textarea
|
|
15
|
-
attributes:
|
|
16
|
-
label: Alternatives
|
|
17
|
-
description: >
|
|
18
|
-
A description of any alternative solutions or features you've considered, if any.
|
|
19
|
-
- type: textarea
|
|
20
|
-
attributes:
|
|
21
|
-
label: Additional context
|
|
22
|
-
description: >
|
|
23
|
-
Add any other context or screenshots about the feature request.
|
|
24
14
|
- type: markdown
|
|
25
15
|
attributes:
|
|
26
16
|
value: >
|
|
@@ -2,12 +2,25 @@ name: 📊 Model Evaluation Request
|
|
|
2
2
|
description: Would you like to have a particular model included in the leaderboards?
|
|
3
3
|
title: "[MODEL EVALUATION REQUEST] <model-name>"
|
|
4
4
|
labels: "model evaluation request"
|
|
5
|
+
type: task
|
|
5
6
|
|
|
6
7
|
body:
|
|
7
8
|
- type: input
|
|
8
9
|
attributes:
|
|
9
10
|
label: Model ID
|
|
10
|
-
description: What is the Hugging Face
|
|
11
|
+
description: What is the model ID, either on the Hugging Face Hub or on LiteLLM?
|
|
12
|
+
validations:
|
|
13
|
+
required: true
|
|
14
|
+
- type: checkboxes
|
|
15
|
+
attributes:
|
|
16
|
+
label: Evaluation languages
|
|
17
|
+
description: >
|
|
18
|
+
What languages should this model be evaluated on? Tick all that apply. If the
|
|
19
|
+
model is multilingual (e.g., Mistral, Llama), then tick all the languages.
|
|
20
|
+
options:
|
|
21
|
+
- label: Romance languages (French, Italian, Spanish)
|
|
22
|
+
- label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
|
|
23
|
+
- label: West Germanic languages (Dutch, English, German)
|
|
11
24
|
validations:
|
|
12
25
|
required: true
|
|
13
26
|
- type: dropdown
|
|
@@ -20,23 +33,14 @@ body:
|
|
|
20
33
|
- Sequence-to-sequence model (e.g., T5)
|
|
21
34
|
validations:
|
|
22
35
|
required: true
|
|
23
|
-
- type:
|
|
36
|
+
- type: dropdown
|
|
24
37
|
attributes:
|
|
25
|
-
label:
|
|
26
|
-
description:
|
|
27
|
-
What languages should this model be evaluated on? Tick all that apply. If the
|
|
28
|
-
model is multilingual (e.g., Mistral, Llama), then tick all the languages.
|
|
38
|
+
label: Model size
|
|
39
|
+
description: What is the size of the model?
|
|
29
40
|
options:
|
|
30
|
-
-
|
|
31
|
-
-
|
|
32
|
-
-
|
|
33
|
-
- label: Faroese
|
|
34
|
-
- label: French
|
|
35
|
-
- label: German
|
|
36
|
-
- label: Icelandic
|
|
37
|
-
- label: Italian
|
|
38
|
-
- label: Norwegian (Bokmål or Nynorsk)
|
|
39
|
-
- label: Swedish
|
|
41
|
+
- Small (<=8B parameters)
|
|
42
|
+
- Large (>8B parameters)
|
|
43
|
+
- N/A
|
|
40
44
|
validations:
|
|
41
45
|
required: true
|
|
42
46
|
- type: dropdown
|
|
@@ -46,6 +50,7 @@ body:
|
|
|
46
50
|
options:
|
|
47
51
|
- Not a merged model
|
|
48
52
|
- Merged model
|
|
53
|
+
- N/A
|
|
49
54
|
validations:
|
|
50
55
|
required: true
|
|
51
56
|
- type: markdown
|
|
@@ -43,7 +43,6 @@ jobs:
|
|
|
43
43
|
- name: Install uv and set up Python
|
|
44
44
|
uses: astral-sh/setup-uv@v4
|
|
45
45
|
with:
|
|
46
|
-
enable-cache: true
|
|
47
46
|
python-version: ${{ matrix.python-version }}
|
|
48
47
|
|
|
49
48
|
- name: Install Dependencies
|
|
@@ -75,7 +74,6 @@ jobs:
|
|
|
75
74
|
- name: Install uv and set up Python
|
|
76
75
|
uses: astral-sh/setup-uv@v4
|
|
77
76
|
with:
|
|
78
|
-
enable-cache: true
|
|
79
77
|
python-version: ${{ matrix.python-version }}
|
|
80
78
|
|
|
81
79
|
- name: Install Dependencies
|
|
@@ -91,6 +89,8 @@ jobs:
|
|
|
91
89
|
HF_TOKEN: ${{ secrets.HUGGINGFACE_API_KEY }}
|
|
92
90
|
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
93
91
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
92
|
+
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
|
|
93
|
+
XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
|
|
94
94
|
|
|
95
95
|
- name: Delete EuroEval cache
|
|
96
96
|
run: rm -rf .euroeval_cache
|
|
@@ -10,6 +10,91 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
10
10
|
|
|
11
11
|
|
|
12
12
|
|
|
13
|
+
## [v15.5.0] - 2025-04-07
|
|
14
|
+
### Added
|
|
15
|
+
- Now allows supplying a parameter to API models, which is done by using
|
|
16
|
+
`<model-id>@<parameter>` as the model ID (only a single parameter is supported). The
|
|
17
|
+
parameters allowed are "low" and "high" for OpenAI models (which is the reasoning
|
|
18
|
+
effort of the model, supported by the o1- and o3-series, default is "medium"), and
|
|
19
|
+
"thinking" for Anthropic models, to enable thinking mode (supported for
|
|
20
|
+
Claude-Sonnet-3.7+). These will appear in the leaderboards as
|
|
21
|
+
`<model-id>@<parameter>`.
|
|
22
|
+
- Added metadata for Google Gemini and xAI Grok models.
|
|
23
|
+
- Allows all vLLM versions from v0.8.0 again, as the issue with the generation output
|
|
24
|
+
has been resolved.
|
|
25
|
+
- Added overall progress indicator during evaluation. This was contributed by
|
|
26
|
+
[@mathiasesn](https://github.com/mathiasesn) ✨
|
|
27
|
+
|
|
28
|
+
### Changed
|
|
29
|
+
- Now does not use logprobs in text classification tasks with Google VertexAI models, as
|
|
30
|
+
they heavily rate limit logprobs usage. This shouldn't affect the scores significantly
|
|
31
|
+
in any case, as the models are very confident in their predictions.
|
|
32
|
+
- Updated `litellm` to `>=1.63.0`, allowing better support for reasoning models.
|
|
33
|
+
|
|
34
|
+
### Fixed
|
|
35
|
+
- The Gemini-2.5-pro model uses different error messages than the other Gemini models,
|
|
36
|
+
which caused an error when evaluating it. This has been fixed now.
|
|
37
|
+
- Now registers the Gemini-2.5-pro model series as reasoning models, as otherwise they
|
|
38
|
+
did not generate any text as they were just generating reasoning tokens.
|
|
39
|
+
- Previously, if there were multiple labels whose first tokens were identical and that
|
|
40
|
+
the (generative) model did not output the label as the first output token, we would
|
|
41
|
+
randomly choose one of the labels, resulting in an evaluation error. This is very
|
|
42
|
+
rare, but *does* happen for very particular (model, dataset) pairs. If we are in this
|
|
43
|
+
case, we now resort to choosing the label with closest word edit distance instead of
|
|
44
|
+
relying on logprobs of the first token.
|
|
45
|
+
- Now defaults to BF16 if the model is registered as using FP32, assuming that BF16 is
|
|
46
|
+
supported by the GPU.
|
|
47
|
+
- Improved model existence pipeline for Ollama model IDs with multiple forward slashes
|
|
48
|
+
in the name, which caused some models to not be detected as existing.
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
## [v15.4.2] - 2025-03-31
|
|
52
|
+
### Added
|
|
53
|
+
- Now added version metadata to results, to easier track which versions of the various
|
|
54
|
+
dependencies were used when evaluating a model. This currently includes
|
|
55
|
+
`transformers`, `torch`, `vllm` and `outlines`.
|
|
56
|
+
|
|
57
|
+
### Changed
|
|
58
|
+
- Changed the name of the German 'mlsum' summarisation dataset to 'mlsum-de', to reflect
|
|
59
|
+
that it is the German version of the dataset, and to avoid confusion with the Spanish
|
|
60
|
+
'mlsum-es' dataset.
|
|
61
|
+
|
|
62
|
+
### Fixed
|
|
63
|
+
- Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
|
|
64
|
+
compatibility < 8.0. This was contributed by
|
|
65
|
+
[@marksverdhei](https://github.com/marksverdhei) ✨
|
|
66
|
+
- Corrected the name of the French sentiment dataset AlloCiné. This was contributed by
|
|
67
|
+
[@Alkarex](https://github.com/Alkarex) ✨
|
|
68
|
+
- Evaluating a specific model revision did not work for adapter models, as there was a
|
|
69
|
+
confusion between the revision of the adapter and the revision of the base model. We
|
|
70
|
+
now use the revision for the adapter and use the latest revision for the base model.
|
|
71
|
+
- In the (very unlikely) scenario that the model's tokeniser has the same first token
|
|
72
|
+
for two different labels in a text classification task, we now also use the second
|
|
73
|
+
token to ensure that we determine the correct label. If this is not possible, then we
|
|
74
|
+
warn the user.
|
|
75
|
+
- Now catches `TypeError` when trying to generate with vLLM, and retries 3 times before
|
|
76
|
+
giving up on evaluating the dataset.
|
|
77
|
+
- A bug in `transformers` caused models with the `image-text-to-text` pipeline tag to
|
|
78
|
+
not be detected as generative models. This has been patched now, and will be fixed
|
|
79
|
+
properly when [this transformers
|
|
80
|
+
PR](https://github.com/huggingface/transformers/pull/37107) has been merged.
|
|
81
|
+
- Force `vllm` v0.8.0 for now, as the severe degradation in generation output of some
|
|
82
|
+
models has not been resolved in versions v0.8.2 and v0.8.3.
|
|
83
|
+
- Only accepts the local labels for text classification tasks when evaluating decoder
|
|
84
|
+
models now, where we before accepted both the local and English labels. The reason is
|
|
85
|
+
that this caused a confusion mat times when there was a unique local label starting
|
|
86
|
+
with a particular letter, but a different English label starting with the same letter,
|
|
87
|
+
causing some models to be evaluated on the wrong label.
|
|
88
|
+
- When fetching the model information from the Hugging Face API we now attempt 3 times,
|
|
89
|
+
as the API sometimes fails. If it still fails after 3 attempts, we raise the
|
|
90
|
+
`HuggingFaceHubDown` exception.
|
|
91
|
+
- Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
|
|
92
|
+
compatibility < 8.0. This was contributed by
|
|
93
|
+
[@marksverdhei](https://github.com/marksverdhei) ✨
|
|
94
|
+
- Fixed docs for ScandiQA-da and ScandiQA-sv, where it was incorrectly stated that
|
|
95
|
+
the splits were made by considering the original train/validation/test splits.
|
|
96
|
+
|
|
97
|
+
|
|
13
98
|
## [v15.4.1] - 2025-03-25
|
|
14
99
|
### Fixed
|
|
15
100
|
- Disallow `vllm` v0.8.1, as it causes severe degradation in generation output of
|
|
@@ -73,18 +158,17 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
73
158
|
## [v15.3.0] - 2025-03-12
|
|
74
159
|
### Added
|
|
75
160
|
- Added support for evaluating Italian 🇮🇹! This includes the reading comprehension
|
|
76
|
-
dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
dataset ScaLA with the [Italian Universal Dependencies
|
|
161
|
+
dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization dataset
|
|
162
|
+
[IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment classification
|
|
163
|
+
[Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual), the
|
|
164
|
+
common-sense reasoning dataset
|
|
165
|
+
[HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic
|
|
166
|
+
acceptability dataset ScaLA with the [Italian Universal Dependencies
|
|
83
167
|
treebank](https://github.com/UniversalDependencies/UD_Italian-ISDT), the knowledge
|
|
84
168
|
dataset [MMLU-it](https://hf.co/datasets/alexandrainst/m_mmlu), and the named entity
|
|
85
|
-
recognition dataset [MultiNERD
|
|
86
|
-
IT](https://hf.co/datasets/Babelscape/
|
|
87
|
-
|
|
169
|
+
recognition dataset [MultiNERD IT](https://hf.co/datasets/Babelscape/multinerd) (and
|
|
170
|
+
unofficially [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was
|
|
171
|
+
contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
|
|
88
172
|
- Added the new Norwegian knowledge dataset NRK-Quiz-QA, consisting of quizzes on the
|
|
89
173
|
Norwegian language and culture, in both Bokmål and Nynorsk. The dataset has been split
|
|
90
174
|
into 635 / 256 / 2,048 samples for train, val, and test, respectively. This replaces
|
|
@@ -211,7 +295,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
211
295
|
|
|
212
296
|
### Added
|
|
213
297
|
- Added support for French! 🇫🇷This includes the sentiment classification dataset
|
|
214
|
-
[
|
|
298
|
+
[AlloCiné](https://hf.co/datasets/tblard/allocine), the linguistic acceptability
|
|
215
299
|
dataset ScaLA with the [French Universal
|
|
216
300
|
Dependencies](https://github.com/UniversalDependencies/UD_French-GSD), the reading
|
|
217
301
|
comprehension dataset [FQuAD](https://hf.co/datasets/illuin/fquad) (and unofficially
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: EuroEval
|
|
3
|
-
Version: 15.
|
|
3
|
+
Version: 15.5.0
|
|
4
4
|
Summary: The robust European language model benchmark.
|
|
5
5
|
Project-URL: Repository, https://github.com/EuroEval/EuroEval
|
|
6
6
|
Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
|
|
@@ -37,11 +37,12 @@ Requires-Dist: demjson3>=3.0.6
|
|
|
37
37
|
Requires-Dist: evaluate>=0.4.1
|
|
38
38
|
Requires-Dist: huggingface-hub>=0.24.0
|
|
39
39
|
Requires-Dist: levenshtein>=0.24.0
|
|
40
|
-
Requires-Dist: litellm>=1.
|
|
40
|
+
Requires-Dist: litellm>=1.63.0
|
|
41
41
|
Requires-Dist: more-itertools>=10.5.0
|
|
42
42
|
Requires-Dist: numpy<2.0.0,>=1.23.0
|
|
43
43
|
Requires-Dist: ollama>=0.4.7
|
|
44
44
|
Requires-Dist: pandas>=2.2.0
|
|
45
|
+
Requires-Dist: peft>=0.15.0
|
|
45
46
|
Requires-Dist: protobuf~=3.20.0
|
|
46
47
|
Requires-Dist: pydantic>=2.6.0
|
|
47
48
|
Requires-Dist: pyinfer>=0.0.3
|
|
@@ -61,12 +62,12 @@ Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == '
|
|
|
61
62
|
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
|
|
62
63
|
Requires-Dist: gradio>=4.26.0; extra == 'all'
|
|
63
64
|
Requires-Dist: outlines>=0.1.11; extra == 'all'
|
|
64
|
-
Requires-Dist: vllm
|
|
65
|
+
Requires-Dist: vllm>=0.8.0; (platform_system == 'Linux') and extra == 'all'
|
|
65
66
|
Provides-Extra: generative
|
|
66
67
|
Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
|
|
67
68
|
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
|
|
68
69
|
Requires-Dist: outlines>=0.1.11; extra == 'generative'
|
|
69
|
-
Requires-Dist: vllm
|
|
70
|
+
Requires-Dist: vllm>=0.8.0; (platform_system == 'Linux') and extra == 'generative'
|
|
70
71
|
Provides-Extra: human-evaluation
|
|
71
72
|
Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
|
|
72
73
|
Provides-Extra: test
|
|
@@ -217,6 +218,7 @@ Replace <name-of-script> with the specific script you wish to execute, e.g.,
|
|
|
217
218
|
$ uv run src/scripts/create_allocine.py
|
|
218
219
|
```
|
|
219
220
|
|
|
221
|
+
|
|
220
222
|
## Special Thanks :pray:
|
|
221
223
|
- Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
|
|
222
224
|
models on the leaderboards.
|
|
@@ -142,6 +142,7 @@ Replace <name-of-script> with the specific script you wish to execute, e.g.,
|
|
|
142
142
|
$ uv run src/scripts/create_allocine.py
|
|
143
143
|
```
|
|
144
144
|
|
|
145
|
+
|
|
145
146
|
## Special Thanks :pray:
|
|
146
147
|
- Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
|
|
147
148
|
models on the leaderboards.
|
|
@@ -285,11 +285,10 @@ the translated contexts still contained the answer to the question, potentially
|
|
|
285
285
|
changing the answers slightly.
|
|
286
286
|
|
|
287
287
|
The original full dataset consists of 6,810 / 500 / 500 samples for training,
|
|
288
|
-
validation and testing, respectively
|
|
289
|
-
|
|
290
|
-
|
|
291
|
-
|
|
292
|
-
was sampled from the original training set.
|
|
288
|
+
validation and testing, respectively (so 3,328 samples used in total).
|
|
289
|
+
We use a 1,024 / 256 / 2,048 split for training, validation and testing, respectively,
|
|
290
|
+
where the splits are made by randomly sampling from the full dataset without considering
|
|
291
|
+
the original train/validation/test splits.
|
|
293
292
|
|
|
294
293
|
Here are a few examples from the training split:
|
|
295
294
|
|
|
@@ -451,12 +450,14 @@ Here are a few examples from the training split:
|
|
|
451
450
|
{
|
|
452
451
|
"text": "Hvilket af følgende områder har kommunerne ansvaret for driften af?\nSvarmuligheder:\na. Domstole\nb. Vuggestuer\nc. Sygehuse",
|
|
453
452
|
"label": "b"
|
|
454
|
-
}
|
|
453
|
+
}
|
|
454
|
+
```
|
|
455
455
|
```json
|
|
456
456
|
{
|
|
457
457
|
"text": "Hvilken organisation blev Danmark medlem af i 1945?\nSvarmuligheder:\na. Verdenshandelsorganisationen (WTO)\nb. Den Europæiske Union (EU)\nc. De Forenede Nationer (FN)",
|
|
458
458
|
"label": "c"
|
|
459
|
-
}
|
|
459
|
+
}
|
|
460
|
+
```
|
|
460
461
|
|
|
461
462
|
When evaluating generative models, we use the following setup (see the
|
|
462
463
|
[methodology](/methodology) for more information on how these are used):
|
|
@@ -133,7 +133,7 @@ $ euroeval --model <model-id> --dataset dbrd
|
|
|
133
133
|
|
|
134
134
|
## Named Entity Recognition
|
|
135
135
|
|
|
136
|
-
### CoNLL-
|
|
136
|
+
### CoNLL-nl
|
|
137
137
|
|
|
138
138
|
This dataset was published in [this paper](https://aclanthology.org/W02-2024/) and
|
|
139
139
|
consists of named entity recognition annotations of the Belgian newspaper "De Morgen" of
|
|
@@ -81,7 +81,7 @@ $ euroeval --model <model-id> --dataset sst5
|
|
|
81
81
|
|
|
82
82
|
## Named Entity Recognition
|
|
83
83
|
|
|
84
|
-
### CoNLL-
|
|
84
|
+
### CoNLL-en
|
|
85
85
|
|
|
86
86
|
This dataset was published in [this paper](https://aclanthology.org/W03-0419/) and was
|
|
87
87
|
part of the CoNNL-2003 shared task. The data comes from the [Reuters
|
|
@@ -282,10 +282,10 @@ $ euroeval --model <model-id> --dataset scala-fo
|
|
|
282
282
|
|
|
283
283
|
### FoQA
|
|
284
284
|
|
|
285
|
-
This dataset
|
|
286
|
-
Wikipedia. The questions and answers were automatically
|
|
287
|
-
which were verified by a native speaker, and some of them
|
|
288
|
-
same native speaker.
|
|
285
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2502.07642)
|
|
286
|
+
and is based on the Faroese Wikipedia. The questions and answers were automatically
|
|
287
|
+
generated using GPT-4-turbo, which were verified by a native speaker, and some of them
|
|
288
|
+
were also corrected by the same native speaker.
|
|
289
289
|
|
|
290
290
|
The original full dataset consists of 2,000 samples, and we split these into 848 / 128 /
|
|
291
291
|
1,024 samples for training, validation and testing, respectively.
|
|
@@ -7,11 +7,11 @@ information about what these constitute.
|
|
|
7
7
|
|
|
8
8
|
## Sentiment Classification
|
|
9
9
|
|
|
10
|
-
###
|
|
10
|
+
### AlloCiné
|
|
11
11
|
|
|
12
12
|
This dataset was published in [this Github
|
|
13
13
|
repository](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert) and
|
|
14
|
-
features reviews from the French movie review website
|
|
14
|
+
features reviews from the French movie review website [AlloCiné](https://www.allocine.fr/). The reviews range from
|
|
15
15
|
0.5 to 5 (inclusive), with steps of 0.5. The negative samples are reviews with a rating
|
|
16
16
|
of at most 2, and the positive ones are reviews with a rating of at least 4. The reviews
|
|
17
17
|
in between were discarded.
|
|
@@ -9,9 +9,9 @@ information about what these constitute.
|
|
|
9
9
|
|
|
10
10
|
### Hotter and Colder Sentiment
|
|
11
11
|
|
|
12
|
-
This dataset
|
|
13
|
-
Icelandic blog post, annotated with sentiment labels (and
|
|
14
|
-
crowdsourcing platform.
|
|
12
|
+
This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2502.16987),
|
|
13
|
+
and consists of texts from Icelandic blog post, annotated with sentiment labels (and
|
|
14
|
+
many others) via a crowdsourcing platform.
|
|
15
15
|
|
|
16
16
|
The original full dataset consists of 2,901 samples, and we use a 1,021 / 255 / 1,607
|
|
17
17
|
split for training, validation and testing, respectively (so all samples are used in
|
|
@@ -73,13 +73,14 @@ $ euroeval --model <model-id> --dataset hotter-and-colder-sentiment
|
|
|
73
73
|
|
|
74
74
|
### MIM-GOLD-NER
|
|
75
75
|
|
|
76
|
-
This dataset was published in [this
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
76
|
+
This dataset was published in [this
|
|
77
|
+
paper](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/230) and is
|
|
78
|
+
based on the [Tagged Icelandic Corpus (MIM)](https://clarin.is/en/resources/mim/), which
|
|
79
|
+
consists of Icelandic books, news articles, periodicals, parliament speeches, legal
|
|
80
|
+
texts, adjudications and government websites. It has been annotated with named entities
|
|
81
|
+
in a semi-automated fashion, where each labels has been manually verified. The entity
|
|
82
|
+
types in the dataset is a superset of the CoNLL-2003 tags, with the following additional
|
|
83
|
+
labels: `DATE`, `TIME`, `MONEY`, `PERCENT`. These labels have been removed.
|
|
83
84
|
|
|
84
85
|
The original full dataset consists of 1,000,000 tokens. We use a 1,024 / 256 / 2,048
|
|
85
86
|
split for training, validation and testing, respectively.
|
|
@@ -526,17 +527,20 @@ Here are a few examples from the training split:
|
|
|
526
527
|
{
|
|
527
528
|
"text": "Hver var talinn heilagur maður eftir dauða sinn, er tákngervingur alþýðuhreyfingar vestanlands og talinn góður til áheita?\nSvarmöguleikar:\na. Þórður Jónsson helgi\nb. Guðmundur Arason\nc. Snorri Þorgrímsson\nd. Jón Hreggviðsson",
|
|
528
529
|
"label": "a"
|
|
529
|
-
}
|
|
530
|
+
}
|
|
531
|
+
```
|
|
530
532
|
```json
|
|
531
533
|
{
|
|
532
534
|
"text": "Í kringum hvaða ár hófst verslun á Arngerðareyri?\nSvarmöguleikar:\na. 1895\nb. 1884\nc. 1870\nd. 1902",
|
|
533
535
|
"label": "b"
|
|
534
|
-
}
|
|
536
|
+
}
|
|
537
|
+
```
|
|
535
538
|
```json
|
|
536
539
|
{
|
|
537
540
|
"text": "Hvenær var ákveðið að uppstigningardagur skyldi vera kirkjudagur aldraðra á Íslandi?\nSvarmöguleikar:\na. Árið 1975\nb. Árið 1985\nc. Árið 1982\nd. Árið 1990",
|
|
538
541
|
"label": "c"
|
|
539
|
-
}
|
|
542
|
+
}
|
|
543
|
+
```
|
|
540
544
|
|
|
541
545
|
When evaluating generative models, we use the following setup (see the
|
|
542
546
|
[methodology](/methodology) for more information on how these are used):
|
|
@@ -71,11 +71,10 @@ $ euroeval --model <model-id> --dataset sentipolc16
|
|
|
71
71
|
### MultiNERD IT
|
|
72
72
|
|
|
73
73
|
This dataset was published in [this
|
|
74
|
-
paper](https://aclanthology.org/2022.findings-naacl.60/) and
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
(
|
|
78
|
-
(NER4EL)[https://www.github.com/Babelscape/ner4el]. The original test set was created
|
|
74
|
+
paper](https://aclanthology.org/2022.findings-naacl.60/) and consists of sentences from
|
|
75
|
+
Wikipedia and Wikinews in 10 different languages. It is an extension of the combination
|
|
76
|
+
of [WikiNEuRal](https://www.github.com/Babelscape/wikineural) and
|
|
77
|
+
[NER4EL](https://www.github.com/Babelscape/ner4el). The original test set was created
|
|
79
78
|
from manual annotations, while the training set is based on an automatic annotation
|
|
80
79
|
pipeline.
|
|
81
80
|
|
|
@@ -519,7 +518,7 @@ $ euroeval --model <model-id> --dataset hellaswag-it
|
|
|
519
518
|
|
|
520
519
|
## Summarization
|
|
521
520
|
|
|
522
|
-
### IlPost-
|
|
521
|
+
### IlPost-Sum
|
|
523
522
|
|
|
524
523
|
This dataset was published in [this paper](https://www.mdpi.com/2078-2489/13/5/228) and
|
|
525
524
|
consists of news articles from [Il Post](https://www.ilpost.it/). The summaries were
|
|
@@ -388,17 +388,20 @@ Here are a few examples from the training split:
|
|
|
388
388
|
{
|
|
389
389
|
"text": "Vi har hatt krig i nesten ti år. Jeg føler meg noen ganger trist fordi jeg har mistet flere venner og min far på grunn av krigen.",
|
|
390
390
|
"label": "correct"
|
|
391
|
-
}
|
|
391
|
+
}
|
|
392
|
+
```
|
|
392
393
|
```json
|
|
393
394
|
{
|
|
394
395
|
"text": "Hvis jeg ikke sier in n genting, kan han spille hele dagen.",
|
|
395
396
|
"label": "incorrect"
|
|
396
|
-
}
|
|
397
|
+
}
|
|
398
|
+
```
|
|
397
399
|
```json
|
|
398
400
|
{
|
|
399
401
|
"text": "De føler at samfunnet trenger ikke dem.",
|
|
400
402
|
"label": "incorrect"
|
|
401
|
-
}
|
|
403
|
+
}
|
|
404
|
+
```
|
|
402
405
|
|
|
403
406
|
When evaluating generative models, we use the following setup (see the
|
|
404
407
|
[methodology](/methodology) for more information on how these are used):
|
|
@@ -660,17 +663,20 @@ Here are a few examples from the training split:
|
|
|
660
663
|
{
|
|
661
664
|
"text": "Gunnar har hatt plutselige og sterke smerteanfall siden han var liten gutt. Det var vondt å tisse og det gjorde vondt i ryggen og magen. Det hjalp litt å drikke vann. Reseptbelagte medisiner kan være nødvendig under anfall.\nSvaralternativer:\na. Nyrestein, kronisk\nb. Irritabel tarmsyndrom\nc. Angst\nd. Urinveisinfeksjon",
|
|
662
665
|
"label": "a"
|
|
663
|
-
}
|
|
666
|
+
}
|
|
667
|
+
```
|
|
664
668
|
```json
|
|
665
669
|
{
|
|
666
670
|
"text": "80 år gamle Harrison Ford er nok ein gong aktuell i rolla som Indiana Jones. Kva heiter filmen?\nSvaralternativer:\na. Indiana Jones and the Nasty Nazis\nb. Indiana Jones and the Dial of Destiny\nc. Indiana Jones and the Hunt for Power\nd. Indiana Jones Forever",
|
|
667
671
|
"label": "b"
|
|
668
|
-
}
|
|
672
|
+
}
|
|
673
|
+
```
|
|
669
674
|
```json
|
|
670
675
|
{
|
|
671
676
|
"text": "I 1980 måtte denne bassisten overnatte ni netter i fengsel i Japan fordi han prøvde å få med seg ca. 200 gram marihuana inn i landet. Hvem var det?\nSvaralternativer:\na. Sting\nb. Lemmy Kilmister\nc. Paul McCartney\nd. Bootsy Collins",
|
|
672
677
|
"label": "c"
|
|
673
|
-
}
|
|
678
|
+
}
|
|
679
|
+
```
|
|
674
680
|
|
|
675
681
|
When evaluating generative models, we use the following setup (see the
|
|
676
682
|
[methodology](/methodology) for more information on how these are used):
|
|
@@ -868,17 +874,20 @@ Here are a few examples from the training split:
|
|
|
868
874
|
{
|
|
869
875
|
"text": "Hvor er det sannsynlig at en fugl lager hjemmet sitt?\nSvaralternativer:\na. I skogen\nb. I et rede\nc. På taket\nd. På blader\ne. I himmelen",
|
|
870
876
|
"label": "a"
|
|
871
|
-
}
|
|
877
|
+
}
|
|
878
|
+
```
|
|
872
879
|
```json
|
|
873
880
|
{
|
|
874
881
|
"text": "Hvis et hjem har et abonnoment, hva får de sannsyneligvis hver dag i posten?\nSvaralternativer:\na. Delestykker\nb. En avis\nc. En gate\nd. En vaskemaskin\ne. Jordas overflate",
|
|
875
882
|
"label": "b"
|
|
876
|
-
}
|
|
883
|
+
}
|
|
884
|
+
```
|
|
877
885
|
```json
|
|
878
886
|
{
|
|
879
887
|
"text": "Når du ikke klarer å gjøre noe ferdig, hva feilet du i da?\nSvaralternativer:\na. Å vinne\nb. Å bestå\nc. Å fullfør\nd. Å gjøre det bra\ne. Å lykkes",
|
|
880
888
|
"label": "c"
|
|
881
|
-
}
|
|
889
|
+
}
|
|
890
|
+
```
|
|
882
891
|
|
|
883
892
|
When evaluating generative models, we use the following setup (see the
|
|
884
893
|
[methodology](/methodology) for more information on how these are used):
|
|
@@ -475,7 +475,7 @@ $ euroeval --model <model-id> --dataset hellaswag-es
|
|
|
475
475
|
|
|
476
476
|
## Summarization
|
|
477
477
|
|
|
478
|
-
### MLSum-es
|
|
478
|
+
### MLSum-es
|
|
479
479
|
|
|
480
480
|
The dataset was published in [this paper](https://aclanthology.org/2020.emnlp-main.647/) and is obtained from online newspapers.
|
|
481
481
|
|
|
@@ -231,11 +231,10 @@ the translated contexts still contained the answer to the question, potentially
|
|
|
231
231
|
changing the answers slightly.
|
|
232
232
|
|
|
233
233
|
The original full dataset consists of 6,810 / 500 / 500 samples for training,
|
|
234
|
-
validation and testing, respectively
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
was sampled from the original training set.
|
|
234
|
+
validation and testing, respectively (so 3,328 samples used in total).
|
|
235
|
+
We use a 1,024 / 256 / 2,048 split for training, validation and testing, respectively,
|
|
236
|
+
where the splits are made by randomly sampling from the full dataset without considering
|
|
237
|
+
the original train/validation/test splits.
|
|
239
238
|
|
|
240
239
|
Here are a few examples from the training split:
|
|
241
240
|
|