ScandEval 16.10.1__tar.gz → 16.11.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- {scandeval-16.10.1 → scandeval-16.11.0}/.pre-commit-config.yaml +4 -4
- {scandeval-16.10.1 → scandeval-16.11.0}/CHANGELOG.md +40 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/CONTRIBUTING.md +1 -1
- {scandeval-16.10.1 → scandeval-16.11.0}/LICENSE +1 -1
- {scandeval-16.10.1 → scandeval-16.11.0}/PKG-INFO +27 -19
- {scandeval-16.10.1 → scandeval-16.11.0}/README.md +25 -17
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/danish.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/dutch.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/english.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/estonian.md +79 -1
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/finnish.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/french.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/german.md +101 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/icelandic.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/italian.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/norwegian.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/polish.md +78 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/portuguese.md +87 -9
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/spanish.md +85 -7
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/swedish.md +84 -6
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/README.md +4 -7
- scandeval-16.11.0/docs/tasks/european-values.md +33 -0
- scandeval-16.11.0/docs/tasks/simplification.md +36 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/pyproject.toml +1 -1
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/hf.py +14 -1
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/litellm.py +111 -22
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/vllm.py +111 -56
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmarker.py +13 -6
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/data_models.py +2 -2
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/logging_utils.py +1 -1
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/huggingface.py +3 -2
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/llm_as_a_judge.py +79 -15
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/model_loading.py +2 -1
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/sequence_classification.py +12 -3
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/types.py +39 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/utils.py +29 -4
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/load_ud_pos.py +11 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/uv.lock +1 -1
- scandeval-16.10.1/docs/tasks/simplification.md +0 -42
- {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/language_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/.github/workflows/ci.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/.gitignore +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/.markdownlint.jsonc +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/CITATION.cff +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/CODE_OF_CONDUCT.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/Dockerfile.cuda +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/NEW_DATASET_GUIDE.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/CNAME +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/README.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/README.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/albanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/bosnian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/bulgarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/catalan.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/croatian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/czech.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/faroese.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/greek.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/hungarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/latvian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/lithuanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/romanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/serbian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/slovak.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/slovene.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/ukrainian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/extras/radial_plotter.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/faq.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/gfx/favicon.png +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/albanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/bosnian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/bulgarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/catalan.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/croatian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/czech.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/danish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/english.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/estonian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/finnish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/french.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/german.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/greek.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/hungarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/italian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/latvian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/lithuanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/polish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/portuguese.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/romanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/serbian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/slovak.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/slovene.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/ukrainian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/baltic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/european.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/finnic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/romance.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/slavic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/README.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/methodology.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/python-package.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/common-sense-reasoning.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/knowledge.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/linguistic-acceptability.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/named-entity-recognition.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/reading-comprehension.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/sentiment-classification.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/speed.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/summarization.md +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/gfx/euroeval.png +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/gfx/euroeval.xcf +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/gfx/scandeval.png +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/makefile +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/mkdocs.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_config_factory.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/base.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/fresh.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/caching_utils.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/callbacks.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/cli.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/constants.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/data_loading.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/albanian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/bosnian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/bulgarian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/catalan.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/croatian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/czech.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/danish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/dutch.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/english.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/estonian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/faroese.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/finnish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/french.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/german.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/greek.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/hungarian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/icelandic.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/italian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/latvian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/lithuanian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/norwegian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/polish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/portuguese.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/romanian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/serbian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/slovak.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/slovene.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/spanish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/swedish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/ukrainian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/enums.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/exceptions.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/finetuning.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/generation.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/generation_utils.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/languages.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/base.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/pipeline.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/speed.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/model_cache.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/model_config.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/linguistic_acceptability.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/multiple_choice.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/named_entity_recognition.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/reading_comprehension.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/sentiment_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/simplification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/summarization.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/token_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/scores.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/speed_benchmark.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/multiple_choice_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/question_answering.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/text_to_text.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/token_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/tasks.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/tokenisation_utils.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/constants.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_allocine.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_angry_tweets.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_arc.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_arc_is.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_atsiliepimai.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_belebele.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_bg_ner_bsnlp.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_boolq_pt.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cinexio.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cnn_dailymail.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_conll_en.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_conll_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_conll_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_copa_lv.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_copa_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cross_domain_uk_reviews.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cs_gec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_csfd_sentiment.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_csfd_sentiment_sk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_czech_news.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dacsa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dane.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_danish_citizen_tests.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dansk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_danske_talemaader.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_danske_talemaader_old.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dbrd.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_duidelijke_taal.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dutch_cola.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_elner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_eltec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_err_news.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_estner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_estonian_valence.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_european_values.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_exam_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_exams_bg.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fone.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_foqa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fosent.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fullstack_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_germanquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_germeval.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_global_mmlu.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_goldenswag.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_grammar_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_greek_sa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_greek_wikipedia.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_guia_cat.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_harem.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hellaswag.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hellaswag_cs.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hellaswag_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hun_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_husst.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ice_linguistic.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icelandic_knowledge.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icelandic_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icesum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_idioms_no.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ilpost_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_jentoft.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_kpwr_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_latvian_lsm_summary.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_latvian_twitter_sentiment.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_life_in_the_uk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lithuanian_lrytas_summarization.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_llmzszl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lr_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lt_emotions.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lt_history.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mim_gold_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mlqa_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mlsum_de.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mlsum_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu_hr.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu_lv.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mms.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_multi_wiki_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_multinerd-it.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ner_uk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_no_cola.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_no_sammendrag.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nordjylland_news.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norglm_multiqa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norglm_multisum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norne.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nqii.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_orange_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_personal_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_polemo2.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_poner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_poquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_psc.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_publico.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ronec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_rosent.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_rrn.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sb10k.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_scala.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_scandiqa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_scandisent_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_schibsted.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sentinews.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sentipolc16.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_skolprov.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sqad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad_it.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad_nl_old.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ssj500k_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sst2_pt.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sst5.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_suc3.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sumo_ro.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_swedish_facts.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_swedn.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_swerec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_szeged_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_trivia_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_turku_ner_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_tydiqa_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_umimeto_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_uner_sk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_uner_sr.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_wikiann.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_wikineural-it.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_winogrande.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_winogrande_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_winogrande_is.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_xlsum_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_xquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/fix_dot_env_file.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/versioning.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/conftest.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmark_config_factory.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmark_modules/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmark_modules/test_hf.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmarker.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_callbacks.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_cli.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_constants.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_data_loading.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_data_models.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_dataset_configs.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_enums.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_exceptions.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_finetuning.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_languages.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_model_config.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_model_loading.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scores.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_create_scala.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/de_gsd-ud-train.conllu.adp_det +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/empty.file +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/en_gum-ud-train.conllu.case +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_01 +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_02 +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_03 +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_speed_benchmark.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_tokenisation_utils.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_types.py +0 -0
- {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_utils.py +0 -0
|
@@ -8,9 +8,9 @@ repos:
|
|
|
8
8
|
hooks:
|
|
9
9
|
- id: end-of-file-fixer
|
|
10
10
|
- id: trailing-whitespace
|
|
11
|
-
|
|
11
|
+
- id: debug-statements
|
|
12
12
|
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
13
|
-
rev: v0.14.
|
|
13
|
+
rev: v0.14.13
|
|
14
14
|
hooks:
|
|
15
15
|
- id: ruff
|
|
16
16
|
args:
|
|
@@ -30,11 +30,11 @@ repos:
|
|
|
30
30
|
- pyi
|
|
31
31
|
- jupyter
|
|
32
32
|
- repo: https://github.com/kynan/nbstripout
|
|
33
|
-
rev: 0.
|
|
33
|
+
rev: 0.9.0
|
|
34
34
|
hooks:
|
|
35
35
|
- id: nbstripout
|
|
36
36
|
- repo: https://github.com/facebook/pyrefly-pre-commit
|
|
37
|
-
rev: 0.
|
|
37
|
+
rev: 0.49.0
|
|
38
38
|
hooks:
|
|
39
39
|
- id: pyrefly-check
|
|
40
40
|
name: Pyrefly (type checking)
|
|
@@ -7,6 +7,46 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [v16.11.0] - 2026-01-21
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- Added model metadata for GPT 5.2.
|
|
15
|
+
- Added better support for unofficial inference providers, allowing model prefixes even
|
|
16
|
+
if they're not in LiteLLM's official list of providers. Currently this only works with
|
|
17
|
+
the "ordbogen/" prefix for models available on ordbogen.dk.
|
|
18
|
+
|
|
19
|
+
### Changed
|
|
20
|
+
|
|
21
|
+
- LLM-as-a-Judge metrics now support batch scoring across multiple judge outputs.
|
|
22
|
+
- When evaluating datasets with no validation split, we now set the `validation_split`
|
|
23
|
+
in the resulting JSONL file to `null` rather than `True`, to avoid confusion.
|
|
24
|
+
Likewise, if a task requires zero-shot evaluation, we set `few_shot` to null rather
|
|
25
|
+
than a Boolean value.
|
|
26
|
+
- When evaluating a reasoning model on a sequence classification task, if the model
|
|
27
|
+
outputs an answer that starts with one of candidate labels, we now use that label as
|
|
28
|
+
the predicted label. Previously, we would have conducted a word edit distance search
|
|
29
|
+
to find the closest candidate label, which was almost always correct, but not in all
|
|
30
|
+
cases.
|
|
31
|
+
|
|
32
|
+
### Fixed
|
|
33
|
+
|
|
34
|
+
- Quantized models in vLLM now have their dtype inferred automatically, removing
|
|
35
|
+
explicit dtype casting based on GPU compute capability. This was contributed by
|
|
36
|
+
@tvosch ✨
|
|
37
|
+
- Evaluation of local vLLM models when no internet connection was available did not work
|
|
38
|
+
correctly; this has been fixed now. This was contributed by @Touzen ✨
|
|
39
|
+
- More robust detection and handling of errors related to too long inputs for vLLM
|
|
40
|
+
models.
|
|
41
|
+
- Some API models need the `logprobs` argument to be a Boolean rather than an integer.
|
|
42
|
+
This has been fixed now.
|
|
43
|
+
- Better handling of rate limits when evaluating API models, by backing off more
|
|
44
|
+
aggressively when hitting rate limits.
|
|
45
|
+
- Now truncates prompts for instruction-following models in a smarter way, by removing
|
|
46
|
+
few-shot examples one by one until the prompt is short enough, rather than just
|
|
47
|
+
truncating the prompt to the maximum length. This only affects models whose maximum
|
|
48
|
+
model length is quite small (roughly 5,000 tokens or less).
|
|
49
|
+
|
|
10
50
|
## [v16.10.1] - 2026-01-02
|
|
11
51
|
|
|
12
52
|
### Changed
|
|
@@ -72,7 +72,7 @@ guide](https://github.com/atom/atom/blob/master/CONTRIBUTING.md#git-commit-messa
|
|
|
72
72
|
know how to use emoji for commit messages.
|
|
73
73
|
|
|
74
74
|
Once your changes are ready, don't forget to
|
|
75
|
-
|
|
75
|
+
self-review to speed up the review process:zap:.
|
|
76
76
|
|
|
77
77
|
### Pull Request
|
|
78
78
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: ScandEval
|
|
3
|
-
Version: 16.
|
|
3
|
+
Version: 16.11.0
|
|
4
4
|
Summary: The robust European language model benchmark.
|
|
5
5
|
Project-URL: Repository, https://github.com/EuroEval/EuroEval
|
|
6
6
|
Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
|
|
@@ -8,7 +8,7 @@ Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
|
8
8
|
Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
9
9
|
License: MIT License
|
|
10
10
|
|
|
11
|
-
Copyright (c) 2022-
|
|
11
|
+
Copyright (c) 2022-2026 Dan Saattrup Smart
|
|
12
12
|
|
|
13
13
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
14
14
|
of this software and associated documentation files (the "Software"), to deal
|
|
@@ -123,16 +123,17 @@ The easiest way to benchmark pretrained models is via the command line interface
|
|
|
123
123
|
having installed the package, you can benchmark your favorite model like so:
|
|
124
124
|
|
|
125
125
|
```bash
|
|
126
|
-
euroeval --model <model-id>
|
|
126
|
+
euroeval --model <model-id-or-path>
|
|
127
127
|
```
|
|
128
128
|
|
|
129
|
-
Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
|
|
130
|
-
Hub](https://huggingface.co/models)
|
|
131
|
-
the
|
|
132
|
-
|
|
129
|
+
Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
|
|
130
|
+
Hub](https://huggingface.co/models), or a local path to a model directory (containing
|
|
131
|
+
the model files as well as the `config.json` file). By default this will benchmark the
|
|
132
|
+
model on all the tasks available. If you want to benchmark on a particular task, then
|
|
133
|
+
use the `--task` argument:
|
|
133
134
|
|
|
134
135
|
```bash
|
|
135
|
-
euroeval --model <model-id> --task sentiment-classification
|
|
136
|
+
euroeval --model <model-id-or-path> --task sentiment-classification
|
|
136
137
|
```
|
|
137
138
|
|
|
138
139
|
We can also narrow down which languages we would like to benchmark on. This can be done
|
|
@@ -140,20 +141,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
|
|
|
140
141
|
sentiment classification task:
|
|
141
142
|
|
|
142
143
|
```bash
|
|
143
|
-
euroeval --model <model-id> --task sentiment-classification --language da
|
|
144
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da
|
|
144
145
|
```
|
|
145
146
|
|
|
146
147
|
Multiple models, datasets and/or languages can be specified by just attaching multiple
|
|
147
148
|
arguments. Here is an example with two models:
|
|
148
149
|
|
|
149
150
|
```bash
|
|
150
|
-
euroeval --model <model-
|
|
151
|
+
euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
|
|
151
152
|
```
|
|
152
153
|
|
|
153
154
|
The specific model version/revision to use can also be added after the suffix '@':
|
|
154
155
|
|
|
155
156
|
```bash
|
|
156
|
-
euroeval --model <model-id>@<commit>
|
|
157
|
+
euroeval --model <model-id-or-path>@<commit>
|
|
157
158
|
```
|
|
158
159
|
|
|
159
160
|
This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
|
|
@@ -173,7 +174,7 @@ model:
|
|
|
173
174
|
```python
|
|
174
175
|
>>> from euroeval import Benchmarker
|
|
175
176
|
>>> benchmarker = Benchmarker()
|
|
176
|
-
>>> benchmarker.benchmark(model="<model-id>")
|
|
177
|
+
>>> benchmarker.benchmark(model="<model-id-or-path>")
|
|
177
178
|
```
|
|
178
179
|
|
|
179
180
|
To benchmark on a specific task and/or language, you simply specify the `task` or
|
|
@@ -181,7 +182,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
|
|
|
181
182
|
|
|
182
183
|
```python
|
|
183
184
|
>>> benchmarker.benchmark(
|
|
184
|
-
... model="<model-id>",
|
|
185
|
+
... model="<model-id-or-path>",
|
|
185
186
|
... task="sentiment-classification",
|
|
186
187
|
... language="da",
|
|
187
188
|
... )
|
|
@@ -225,7 +226,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
|
|
|
225
226
|
```
|
|
226
227
|
|
|
227
228
|
Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
|
|
228
|
-
argument. This could for instance be `--model <model-id> --task
|
|
229
|
+
argument. This could for instance be `--model <model-id-or-path> --task
|
|
229
230
|
sentiment-classification`.
|
|
230
231
|
|
|
231
232
|
## Benchmarking custom inference APIs
|
|
@@ -291,14 +292,14 @@ script. For example to download the model you want and all of the Danish sentime
|
|
|
291
292
|
classification datasets:
|
|
292
293
|
|
|
293
294
|
```bash
|
|
294
|
-
euroeval --model <model-id> --task sentiment-classification --language da --download-only
|
|
295
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
|
|
295
296
|
```
|
|
296
297
|
|
|
297
298
|
Or from a script:
|
|
298
299
|
|
|
299
300
|
```python
|
|
300
301
|
>>> benchmarker.benchmark(
|
|
301
|
-
... model="<model-id>",
|
|
302
|
+
... model="<model-id-or-path>",
|
|
302
303
|
... task="sentiment-classification",
|
|
303
304
|
... language="da",
|
|
304
305
|
... download_only=True,
|
|
@@ -346,7 +347,7 @@ MY_CONFIG = DatasetConfig(
|
|
|
346
347
|
You can then benchmark your custom dataset by simply running
|
|
347
348
|
|
|
348
349
|
```bash
|
|
349
|
-
euroeval --dataset my-dataset --model <model-id>
|
|
350
|
+
euroeval --dataset my-dataset --model <model-id-or-path>
|
|
350
351
|
```
|
|
351
352
|
|
|
352
353
|
You can also run the benchmark from a Python script, by simply providing your custom
|
|
@@ -356,7 +357,7 @@ dataset configuration directly into the `benchmark` method:
|
|
|
356
357
|
from euroeval import Benchmarker
|
|
357
358
|
|
|
358
359
|
benchmarker = Benchmarker()
|
|
359
|
-
benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
|
|
360
|
+
benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
|
|
360
361
|
```
|
|
361
362
|
|
|
362
363
|
We have included three convenience tasks to make it easier to set up custom datasets:
|
|
@@ -436,7 +437,7 @@ MY_SQL_DATASET = DatasetConfig(
|
|
|
436
437
|
Again, with this you can benchmark your custom dataset by simply running
|
|
437
438
|
|
|
438
439
|
```bash
|
|
439
|
-
euroeval --dataset my-sql-dataset --model <model-id>
|
|
440
|
+
euroeval --dataset my-sql-dataset --model <model-id-or-path>
|
|
440
441
|
```
|
|
441
442
|
|
|
442
443
|
## Reproducing the evaluation datasets
|
|
@@ -592,6 +593,13 @@ A huge thank you to all the contributors who have helped make this project a suc
|
|
|
592
593
|
alt="Contributor avatar for tvosch"
|
|
593
594
|
/>
|
|
594
595
|
</a>
|
|
596
|
+
<a href="https://github.com/Touzen">
|
|
597
|
+
<img
|
|
598
|
+
src="https://avatars.githubusercontent.com/u/1416265"
|
|
599
|
+
width=50
|
|
600
|
+
alt="Contributor avatar for Touzen"
|
|
601
|
+
/>
|
|
602
|
+
</a>
|
|
595
603
|
|
|
596
604
|
### Contribute to EuroEval
|
|
597
605
|
|
|
@@ -47,16 +47,17 @@ The easiest way to benchmark pretrained models is via the command line interface
|
|
|
47
47
|
having installed the package, you can benchmark your favorite model like so:
|
|
48
48
|
|
|
49
49
|
```bash
|
|
50
|
-
euroeval --model <model-id>
|
|
50
|
+
euroeval --model <model-id-or-path>
|
|
51
51
|
```
|
|
52
52
|
|
|
53
|
-
Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
|
|
54
|
-
Hub](https://huggingface.co/models)
|
|
55
|
-
the
|
|
56
|
-
|
|
53
|
+
Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
|
|
54
|
+
Hub](https://huggingface.co/models), or a local path to a model directory (containing
|
|
55
|
+
the model files as well as the `config.json` file). By default this will benchmark the
|
|
56
|
+
model on all the tasks available. If you want to benchmark on a particular task, then
|
|
57
|
+
use the `--task` argument:
|
|
57
58
|
|
|
58
59
|
```bash
|
|
59
|
-
euroeval --model <model-id> --task sentiment-classification
|
|
60
|
+
euroeval --model <model-id-or-path> --task sentiment-classification
|
|
60
61
|
```
|
|
61
62
|
|
|
62
63
|
We can also narrow down which languages we would like to benchmark on. This can be done
|
|
@@ -64,20 +65,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
|
|
|
64
65
|
sentiment classification task:
|
|
65
66
|
|
|
66
67
|
```bash
|
|
67
|
-
euroeval --model <model-id> --task sentiment-classification --language da
|
|
68
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da
|
|
68
69
|
```
|
|
69
70
|
|
|
70
71
|
Multiple models, datasets and/or languages can be specified by just attaching multiple
|
|
71
72
|
arguments. Here is an example with two models:
|
|
72
73
|
|
|
73
74
|
```bash
|
|
74
|
-
euroeval --model <model-
|
|
75
|
+
euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
|
|
75
76
|
```
|
|
76
77
|
|
|
77
78
|
The specific model version/revision to use can also be added after the suffix '@':
|
|
78
79
|
|
|
79
80
|
```bash
|
|
80
|
-
euroeval --model <model-id>@<commit>
|
|
81
|
+
euroeval --model <model-id-or-path>@<commit>
|
|
81
82
|
```
|
|
82
83
|
|
|
83
84
|
This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
|
|
@@ -97,7 +98,7 @@ model:
|
|
|
97
98
|
```python
|
|
98
99
|
>>> from euroeval import Benchmarker
|
|
99
100
|
>>> benchmarker = Benchmarker()
|
|
100
|
-
>>> benchmarker.benchmark(model="<model-id>")
|
|
101
|
+
>>> benchmarker.benchmark(model="<model-id-or-path>")
|
|
101
102
|
```
|
|
102
103
|
|
|
103
104
|
To benchmark on a specific task and/or language, you simply specify the `task` or
|
|
@@ -105,7 +106,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
|
|
|
105
106
|
|
|
106
107
|
```python
|
|
107
108
|
>>> benchmarker.benchmark(
|
|
108
|
-
... model="<model-id>",
|
|
109
|
+
... model="<model-id-or-path>",
|
|
109
110
|
... task="sentiment-classification",
|
|
110
111
|
... language="da",
|
|
111
112
|
... )
|
|
@@ -149,7 +150,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
|
|
|
149
150
|
```
|
|
150
151
|
|
|
151
152
|
Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
|
|
152
|
-
argument. This could for instance be `--model <model-id> --task
|
|
153
|
+
argument. This could for instance be `--model <model-id-or-path> --task
|
|
153
154
|
sentiment-classification`.
|
|
154
155
|
|
|
155
156
|
## Benchmarking custom inference APIs
|
|
@@ -215,14 +216,14 @@ script. For example to download the model you want and all of the Danish sentime
|
|
|
215
216
|
classification datasets:
|
|
216
217
|
|
|
217
218
|
```bash
|
|
218
|
-
euroeval --model <model-id> --task sentiment-classification --language da --download-only
|
|
219
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
|
|
219
220
|
```
|
|
220
221
|
|
|
221
222
|
Or from a script:
|
|
222
223
|
|
|
223
224
|
```python
|
|
224
225
|
>>> benchmarker.benchmark(
|
|
225
|
-
... model="<model-id>",
|
|
226
|
+
... model="<model-id-or-path>",
|
|
226
227
|
... task="sentiment-classification",
|
|
227
228
|
... language="da",
|
|
228
229
|
... download_only=True,
|
|
@@ -270,7 +271,7 @@ MY_CONFIG = DatasetConfig(
|
|
|
270
271
|
You can then benchmark your custom dataset by simply running
|
|
271
272
|
|
|
272
273
|
```bash
|
|
273
|
-
euroeval --dataset my-dataset --model <model-id>
|
|
274
|
+
euroeval --dataset my-dataset --model <model-id-or-path>
|
|
274
275
|
```
|
|
275
276
|
|
|
276
277
|
You can also run the benchmark from a Python script, by simply providing your custom
|
|
@@ -280,7 +281,7 @@ dataset configuration directly into the `benchmark` method:
|
|
|
280
281
|
from euroeval import Benchmarker
|
|
281
282
|
|
|
282
283
|
benchmarker = Benchmarker()
|
|
283
|
-
benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
|
|
284
|
+
benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
|
|
284
285
|
```
|
|
285
286
|
|
|
286
287
|
We have included three convenience tasks to make it easier to set up custom datasets:
|
|
@@ -360,7 +361,7 @@ MY_SQL_DATASET = DatasetConfig(
|
|
|
360
361
|
Again, with this you can benchmark your custom dataset by simply running
|
|
361
362
|
|
|
362
363
|
```bash
|
|
363
|
-
euroeval --dataset my-sql-dataset --model <model-id>
|
|
364
|
+
euroeval --dataset my-sql-dataset --model <model-id-or-path>
|
|
364
365
|
```
|
|
365
366
|
|
|
366
367
|
## Reproducing the evaluation datasets
|
|
@@ -516,6 +517,13 @@ A huge thank you to all the contributors who have helped make this project a suc
|
|
|
516
517
|
alt="Contributor avatar for tvosch"
|
|
517
518
|
/>
|
|
518
519
|
</a>
|
|
520
|
+
<a href="https://github.com/Touzen">
|
|
521
|
+
<img
|
|
522
|
+
src="https://avatars.githubusercontent.com/u/1416265"
|
|
523
|
+
width=50
|
|
524
|
+
alt="Contributor avatar for Touzen"
|
|
525
|
+
/>
|
|
526
|
+
</a>
|
|
519
527
|
|
|
520
528
|
### Contribute to EuroEval
|
|
521
529
|
|
|
@@ -1116,3 +1116,81 @@ You can evaluate this dataset directly as follows:
|
|
|
1116
1116
|
```bash
|
|
1117
1117
|
euroeval --model <model-id> --dataset nordjylland-news
|
|
1118
1118
|
```
|
|
1119
|
+
|
|
1120
|
+
## European Values
|
|
1121
|
+
|
|
1122
|
+
### ValEU-da
|
|
1123
|
+
|
|
1124
|
+
This dataset is the official Danish version of questions from the [European values
|
|
1125
|
+
study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
|
|
1126
|
+
questions regarding people's values and beliefs across a variety of topics, such as
|
|
1127
|
+
politics, religion and society.
|
|
1128
|
+
|
|
1129
|
+
The dataset consists of 52 questions from the 2017-2022 wave of the European values
|
|
1130
|
+
study, where the questions were chosen based on optimising against agreement within EU
|
|
1131
|
+
countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
|
|
1132
|
+
|
|
1133
|
+
Here are a few examples from the training split:
|
|
1134
|
+
|
|
1135
|
+
```json
|
|
1136
|
+
{
|
|
1137
|
+
"question_id": "C039",
|
|
1138
|
+
"text": "Hvor enig eller uenig er du i følgende udsagn?\nDet er ens pligt over for samfundet at arbejde.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig",
|
|
1139
|
+
}
|
|
1140
|
+
```
|
|
1141
|
+
|
|
1142
|
+
```json
|
|
1143
|
+
{
|
|
1144
|
+
"question_id": "F122",
|
|
1145
|
+
"text": "Fortæl for hver af handlingerne på dette kort, i hvilken grad du billiger handlingen. 1 betyder, at du slet ikke billiger dem, 10 betyder, at du i høj grad billiger dem\nAktiv dødshjælp\nSvarmuligheder:\na. Aldrig\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Altid",
|
|
1146
|
+
}
|
|
1147
|
+
```
|
|
1148
|
+
|
|
1149
|
+
```json
|
|
1150
|
+
{
|
|
1151
|
+
"question_id": "C041",
|
|
1152
|
+
"text": "Hvor enig eller uenig er du i følgende udsagn?\nArbejde kommer først, også selv om det betyder mindre fritid.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig"
|
|
1153
|
+
}
|
|
1154
|
+
```
|
|
1155
|
+
|
|
1156
|
+
When evaluating generative models, we use the following setup (see the
|
|
1157
|
+
[methodology](/methodology) for more information on how these are used):
|
|
1158
|
+
|
|
1159
|
+
- Number of few-shot examples: 0
|
|
1160
|
+
- Prefix prompt:
|
|
1161
|
+
|
|
1162
|
+
```text
|
|
1163
|
+
Følgende er multiple choice spørgsmål (med svar).
|
|
1164
|
+
```
|
|
1165
|
+
|
|
1166
|
+
- Base prompt template:
|
|
1167
|
+
|
|
1168
|
+
```text
|
|
1169
|
+
Spørgsmål: {text}
|
|
1170
|
+
Svarmuligheder:
|
|
1171
|
+
a. {option_a}
|
|
1172
|
+
b. {option_b}
|
|
1173
|
+
(...)
|
|
1174
|
+
k. {option_k}
|
|
1175
|
+
Svar: {label}
|
|
1176
|
+
```
|
|
1177
|
+
|
|
1178
|
+
- Instruction-tuned prompt template:
|
|
1179
|
+
|
|
1180
|
+
```text
|
|
1181
|
+
Spørgsmål: {text}
|
|
1182
|
+
Svarmuligheder:
|
|
1183
|
+
a. {option_a}
|
|
1184
|
+
b. {option_b}
|
|
1185
|
+
(...)
|
|
1186
|
+
k. {option_k}
|
|
1187
|
+
|
|
1188
|
+
Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
|
|
1189
|
+
'i', 'j' eller 'k', og intet andet.
|
|
1190
|
+
```
|
|
1191
|
+
|
|
1192
|
+
You can evaluate this dataset directly as follows:
|
|
1193
|
+
|
|
1194
|
+
```bash
|
|
1195
|
+
euroeval --model <model-id> --dataset valeu-da
|
|
1196
|
+
```
|
|
@@ -1100,3 +1100,81 @@ You can evaluate this dataset directly as follows:
|
|
|
1100
1100
|
```bash
|
|
1101
1101
|
euroeval --model <model-id> --dataset duidelijke-taal
|
|
1102
1102
|
```
|
|
1103
|
+
|
|
1104
|
+
## European Values
|
|
1105
|
+
|
|
1106
|
+
### ValEU-nl
|
|
1107
|
+
|
|
1108
|
+
This dataset is the official Dutch version of questions from the [European values
|
|
1109
|
+
study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
|
|
1110
|
+
questions regarding people's values and beliefs across a variety of topics, such as
|
|
1111
|
+
politics, religion and society.
|
|
1112
|
+
|
|
1113
|
+
The dataset consists of 52 questions from the 2017-2022 wave of the European values
|
|
1114
|
+
study, where the questions were chosen based on optimising against agreement within EU
|
|
1115
|
+
countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
|
|
1116
|
+
|
|
1117
|
+
Here are a few examples from the training split:
|
|
1118
|
+
|
|
1119
|
+
```json
|
|
1120
|
+
{
|
|
1121
|
+
"question_id": "E069_01",
|
|
1122
|
+
"text": "Wilt u mij voor elk van de instellingen op deze kaart vertellen of u er heel veel, tamelijk veel, niet zo veel of helemaal geen vertrouwen in heeft?\nDe kerk\nAntwoordopties:\na. Heel veel\nb. Tamelijk veel\nc. Niet zo veel\nd. Helemaal geen"
|
|
1123
|
+
}
|
|
1124
|
+
```
|
|
1125
|
+
|
|
1126
|
+
```json
|
|
1127
|
+
{
|
|
1128
|
+
"question_id": "E028",
|
|
1129
|
+
"text": "Wilt u nu deze lijst erbij houden? Ik ga u nu een aantal verschillende soorten van politieke actie noemen die men kan voeren. Wilt u mij van elke actie vertellen of u het zelf ooit heeft gedaan, of u het zelf misschien zou doen als u het nodig vond, of dat u het zeker nooit zult doen?\nMeedoen aan een wilde staking\nAntwoordopties:\na. Zelf gedaan\nb. Zou dat misschien doen\nc. Zou dat nooit doen"
|
|
1130
|
+
}
|
|
1131
|
+
```
|
|
1132
|
+
|
|
1133
|
+
```json
|
|
1134
|
+
{
|
|
1135
|
+
"question_id": "E265_07",
|
|
1136
|
+
"text": "Hoe vaak gebeuren volgens u de volgende dingen tijdens verkiezingen in dit land?\nRijke mensen kopen de verkiezingsuitslag\nAntwoordopties:\na. Zeer vaak\nb. Tamelijk vaak\nc. Niet zo vaak\nd. Helemaal niet vaak"
|
|
1137
|
+
}
|
|
1138
|
+
```
|
|
1139
|
+
|
|
1140
|
+
When evaluating generative models, we use the following setup (see the
|
|
1141
|
+
[methodology](/methodology) for more information on how these are used):
|
|
1142
|
+
|
|
1143
|
+
- Number of few-shot examples: 0
|
|
1144
|
+
- Prefix prompt:
|
|
1145
|
+
|
|
1146
|
+
```text
|
|
1147
|
+
Hieronder staan meerkeuzevragen (met antwoorden).
|
|
1148
|
+
```
|
|
1149
|
+
|
|
1150
|
+
- Base prompt template:
|
|
1151
|
+
|
|
1152
|
+
```text
|
|
1153
|
+
Vraag: {text}
|
|
1154
|
+
Antwoordopties:
|
|
1155
|
+
a. {option_a}
|
|
1156
|
+
b. {option_b}
|
|
1157
|
+
(...)
|
|
1158
|
+
k. {option_k}
|
|
1159
|
+
Antwoord: {label}
|
|
1160
|
+
```
|
|
1161
|
+
|
|
1162
|
+
- Instruction-tuned prompt template:
|
|
1163
|
+
|
|
1164
|
+
```text
|
|
1165
|
+
Vraag: {text}
|
|
1166
|
+
Antwoordopties:
|
|
1167
|
+
a. {option_a}
|
|
1168
|
+
b. {option_b}
|
|
1169
|
+
(...)
|
|
1170
|
+
k. {option_k}
|
|
1171
|
+
|
|
1172
|
+
Beantwoord de bovenstaande vraag met 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'
|
|
1173
|
+
of 'k', en niets anders.
|
|
1174
|
+
```
|
|
1175
|
+
|
|
1176
|
+
You can evaluate this dataset directly as follows:
|
|
1177
|
+
|
|
1178
|
+
```bash
|
|
1179
|
+
euroeval --model <model-id> --dataset valeu-nl
|
|
1180
|
+
```
|
|
@@ -983,3 +983,81 @@ You can evaluate this dataset directly as follows:
|
|
|
983
983
|
```bash
|
|
984
984
|
euroeval --model <model-id> --dataset cnn-dailymail
|
|
985
985
|
```
|
|
986
|
+
|
|
987
|
+
## European Values
|
|
988
|
+
|
|
989
|
+
### ValEU-en
|
|
990
|
+
|
|
991
|
+
This dataset is the official English version of questions from the [European values
|
|
992
|
+
study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
|
|
993
|
+
questions regarding people's values and beliefs across a variety of topics, such as
|
|
994
|
+
politics, religion and society.
|
|
995
|
+
|
|
996
|
+
The dataset consists of 52 questions from the 2017-2022 wave of the European values
|
|
997
|
+
study, where the questions were chosen based on optimising against agreement within EU
|
|
998
|
+
countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
|
|
999
|
+
|
|
1000
|
+
Here are a few examples from the training split:
|
|
1001
|
+
|
|
1002
|
+
```json
|
|
1003
|
+
{
|
|
1004
|
+
"question_id": "A072",
|
|
1005
|
+
"text": "Please look carefully at the following list of voluntary organisations and say which, if any, do you belong to?\nProfessional associations\nChoices:\na. No\nb. Yes"
|
|
1006
|
+
}
|
|
1007
|
+
```
|
|
1008
|
+
|
|
1009
|
+
```json
|
|
1010
|
+
{
|
|
1011
|
+
"question_id": "F025",
|
|
1012
|
+
"text": "Do you belong to a religious denomination? If yes, which one?\nChoices:\na. Do not belong to a denomination\nb. Roman Catholic\nc. Protestant\nd. Orthodox (Russian/Greek/etc.)\ne. Jew\nf. Muslim\ng. Hindu\nh. Buddhist\ni. Other Christian (Evangelical/Pentecostal/Free church/etc.)\nj. Other"
|
|
1013
|
+
}
|
|
1014
|
+
```
|
|
1015
|
+
|
|
1016
|
+
```json
|
|
1017
|
+
{
|
|
1018
|
+
"question_id": "F118",
|
|
1019
|
+
"text": "Please tell me for each of the following whether you think it can always be justified, never be justified, or something in between.\nHomosexuality\nChoices:\na. Never justifiable\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Always justifiable"
|
|
1020
|
+
}
|
|
1021
|
+
```
|
|
1022
|
+
|
|
1023
|
+
When evaluating generative models, we use the following setup (see the
|
|
1024
|
+
[methodology](/methodology) for more information on how these are used):
|
|
1025
|
+
|
|
1026
|
+
- Number of few-shot examples: 0
|
|
1027
|
+
- Prefix prompt:
|
|
1028
|
+
|
|
1029
|
+
```text
|
|
1030
|
+
The following are multiple choice questions (with answers).
|
|
1031
|
+
```
|
|
1032
|
+
|
|
1033
|
+
- Base prompt template:
|
|
1034
|
+
|
|
1035
|
+
```text
|
|
1036
|
+
Question: {text}
|
|
1037
|
+
Options:
|
|
1038
|
+
a. {option_a}
|
|
1039
|
+
b. {option_b}
|
|
1040
|
+
(...)
|
|
1041
|
+
k. {option_k}
|
|
1042
|
+
Answer: {label}
|
|
1043
|
+
```
|
|
1044
|
+
|
|
1045
|
+
- Instruction-tuned prompt template:
|
|
1046
|
+
|
|
1047
|
+
```text
|
|
1048
|
+
Question: {text}
|
|
1049
|
+
Options:
|
|
1050
|
+
a. {option_a}
|
|
1051
|
+
b. {option_b}
|
|
1052
|
+
(...)
|
|
1053
|
+
k. {option_k}
|
|
1054
|
+
|
|
1055
|
+
Answer the above question by replying with 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
|
|
1056
|
+
'i', 'j', or 'k', and nothing else.
|
|
1057
|
+
```
|
|
1058
|
+
|
|
1059
|
+
You can evaluate this dataset directly as follows:
|
|
1060
|
+
|
|
1061
|
+
```bash
|
|
1062
|
+
euroeval --model <model-id> --dataset valeu-en
|
|
1063
|
+
```
|