ScandEval 16.10.1__tar.gz → 16.12.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- scandeval-16.12.0/.github/auto_assign.yaml +29 -0
- scandeval-16.12.0/.github/workflows/auto_assign_reviewers.yaml +15 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/.github/workflows/ci.yaml +4 -4
- {scandeval-16.10.1 → scandeval-16.12.0}/.pre-commit-config.yaml +5 -5
- {scandeval-16.10.1 → scandeval-16.12.0}/CHANGELOG.md +75 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/CONTRIBUTING.md +1 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/Dockerfile.cuda +1 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/LICENSE +1 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/PKG-INFO +50 -24
- {scandeval-16.10.1 → scandeval-16.12.0}/README.md +40 -18
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/danish.md +79 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/dutch.md +170 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/english.md +78 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/estonian.md +79 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/finnish.md +78 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/french.md +78 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/german.md +101 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/icelandic.md +78 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/italian.md +78 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/norwegian.md +78 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/polish.md +78 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/portuguese.md +87 -9
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/spanish.md +85 -7
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/swedish.md +84 -6
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/faq.md +4 -2
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/python-package.md +33 -67
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/README.md +5 -7
- scandeval-16.12.0/docs/tasks/bias-detection.md +29 -0
- scandeval-16.12.0/docs/tasks/european-values.md +33 -0
- scandeval-16.12.0/docs/tasks/simplification.md +36 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/makefile +2 -2
- {scandeval-16.10.1 → scandeval-16.12.0}/mkdocs.yaml +7 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/pyproject.toml +16 -8
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/__init__.py +0 -9
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_config_factory.py +5 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/hf.py +36 -8
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/litellm.py +119 -22
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/vllm.py +202 -94
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmarker.py +28 -7
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/cli.py +13 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/constants.py +31 -2
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/data_models.py +12 -2
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/dutch.py +10 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/logging_utils.py +1 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/__init__.py +1 -0
- scandeval-16.12.0/src/scandeval/metrics/bias.py +237 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/huggingface.py +5 -3
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/llm_as_a_judge.py +79 -15
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/model_loading.py +2 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/sequence_classification.py +12 -3
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/tasks.py +22 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/tokenisation_utils.py +12 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/types.py +39 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/utils.py +38 -66
- scandeval-16.12.0/src/scripts/create_mbbq_nl.py +213 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/load_ud_pos.py +11 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/conftest.py +1 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmark_config_factory.py +10 -10
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmarker.py +44 -17
- scandeval-16.12.0/tests/test_bias_metrics.py +144 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_cli.py +1 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_data_loading.py +1 -1
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_dataset_configs.py +3 -2
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_model_loading.py +7 -9
- {scandeval-16.10.1 → scandeval-16.12.0}/uv.lock +1781 -1755
- scandeval-16.10.1/docs/tasks/simplification.md +0 -42
- {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/language_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/.gitignore +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/.markdownlint.jsonc +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/CITATION.cff +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/CODE_OF_CONDUCT.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/NEW_DATASET_GUIDE.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/CNAME +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/README.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/README.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/albanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/bosnian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/bulgarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/catalan.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/croatian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/czech.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/faroese.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/greek.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/hungarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/latvian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/lithuanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/romanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/serbian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/slovak.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/slovene.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/ukrainian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/extras/radial_plotter.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/gfx/favicon.png +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/albanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/bosnian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/bulgarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/catalan.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/croatian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/czech.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/danish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/english.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/estonian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/finnish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/french.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/german.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/greek.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/hungarian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/italian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/latvian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/lithuanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/polish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/portuguese.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/romanian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/serbian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/slovak.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/slovene.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/ukrainian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/baltic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/european.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/finnic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/romance.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/slavic.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/README.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/methodology.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/common-sense-reasoning.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/knowledge.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/linguistic-acceptability.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/named-entity-recognition.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/reading-comprehension.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/sentiment-classification.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/speed.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/summarization.md +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/gfx/euroeval.png +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/gfx/euroeval.xcf +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/gfx/scandeval.png +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/base.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/fresh.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/caching_utils.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/callbacks.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/data_loading.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/albanian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/bosnian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/bulgarian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/catalan.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/croatian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/czech.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/danish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/english.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/estonian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/faroese.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/finnish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/french.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/german.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/greek.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/hungarian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/icelandic.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/italian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/latvian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/lithuanian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/norwegian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/polish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/portuguese.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/romanian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/serbian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/slovak.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/slovene.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/spanish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/swedish.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/ukrainian.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/enums.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/exceptions.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/finetuning.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/generation.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/generation_utils.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/languages.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/base.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/pipeline.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/speed.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/model_cache.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/model_config.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/linguistic_acceptability.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/multiple_choice.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/named_entity_recognition.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/reading_comprehension.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/sentiment_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/simplification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/summarization.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/token_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/scores.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/speed_benchmark.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/multiple_choice_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/question_answering.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/text_to_text.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/token_classification.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/constants.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_allocine.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_angry_tweets.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_arc.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_arc_is.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_atsiliepimai.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_belebele.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_bg_ner_bsnlp.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_boolq_pt.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cinexio.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cnn_dailymail.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_conll_en.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_conll_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_conll_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_copa_lv.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_copa_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cross_domain_uk_reviews.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cs_gec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_csfd_sentiment.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_csfd_sentiment_sk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_czech_news.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dacsa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dane.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_danish_citizen_tests.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dansk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_danske_talemaader.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_danske_talemaader_old.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dbrd.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_duidelijke_taal.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dutch_cola.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_elner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_eltec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_err_news.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_estner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_estonian_valence.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_european_values.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_exam_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_exams_bg.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fone.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_foqa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fosent.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fullstack_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_germanquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_germeval.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_global_mmlu.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_goldenswag.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_grammar_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_greek_sa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_greek_wikipedia.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_guia_cat.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_harem.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hellaswag.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hellaswag_cs.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hellaswag_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hun_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_husst.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ice_linguistic.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icelandic_knowledge.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icelandic_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icesum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_idioms_no.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ilpost_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_jentoft.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_kpwr_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_latvian_lsm_summary.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_latvian_twitter_sentiment.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_life_in_the_uk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lithuanian_lrytas_summarization.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_llmzszl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lr_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lt_emotions.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lt_history.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mim_gold_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mlqa_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mlsum_de.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mlsum_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu_hr.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu_lv.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mms.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_multi_wiki_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_multinerd-it.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ner_uk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_no_cola.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_no_sammendrag.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nordjylland_news.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norglm_multiqa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norglm_multisum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norne.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nqii.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_orange_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_personal_sum.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_polemo2.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_poner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_poquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_psc.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_publico.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ronec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_rosent.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_rrn.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sb10k.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_scala.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_scandiqa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_scandisent_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_schibsted.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sentinews.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sentipolc16.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_skolprov.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sqad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad_it.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad_nl_old.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ssj500k_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sst2_pt.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sst5.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_suc3.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sumo_ro.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_swedish_facts.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_swedn.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_swerec.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_szeged_ner.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_trivia_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_turku_ner_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_tydiqa_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_umimeto_qa.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_uner_sk.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_uner_sr.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_wikiann.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_wikineural-it.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_winogrande.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_winogrande_et.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_winogrande_is.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_xlsum_fi.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_xquad.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/fix_dot_env_file.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/versioning.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmark_modules/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmark_modules/test_hf.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_callbacks.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_constants.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_data_models.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_enums.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_exceptions.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_finetuning.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_languages.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_model_config.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scores.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/__init__.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_create_scala.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/de_gsd-ud-train.conllu.adp_det +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/empty.file +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/en_gum-ud-train.conllu.case +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_01 +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_02 +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_03 +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_speed_benchmark.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_tokenisation_utils.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_types.py +0 -0
- {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_utils.py +0 -0
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
# Set to true to add reviewers to pull requests
|
|
2
|
+
addReviewers: true
|
|
3
|
+
|
|
4
|
+
# Set to true to add assignees to pull requests
|
|
5
|
+
addAssignees: true
|
|
6
|
+
|
|
7
|
+
# A list of reviewers to be added to pull requests (GitHub user name)
|
|
8
|
+
reviewers:
|
|
9
|
+
- saattrupdan
|
|
10
|
+
|
|
11
|
+
# A number of reviewers added to the pull request
|
|
12
|
+
# Set 0 to add all the reviewers (default: 0)
|
|
13
|
+
numberOfReviewers: 0
|
|
14
|
+
|
|
15
|
+
# Whether to run the action on draft pull requests
|
|
16
|
+
runOnDraft: true
|
|
17
|
+
|
|
18
|
+
# A list of assignees, overrides reviewers if set
|
|
19
|
+
# assignees:
|
|
20
|
+
# - assigneeA
|
|
21
|
+
|
|
22
|
+
# A number of assignees to add to the pull request
|
|
23
|
+
# Set to 0 to add all of the assignees.
|
|
24
|
+
# Uses numberOfReviewers if unset.
|
|
25
|
+
# numberOfAssignees: 2
|
|
26
|
+
|
|
27
|
+
# A list of keywords to be skipped the process that add reviewers if pull requests include it
|
|
28
|
+
# skipKeywords:
|
|
29
|
+
# - wip
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
name: 'Auto Assign'
|
|
2
|
+
on:
|
|
3
|
+
pull_request:
|
|
4
|
+
types: [opened, ready_for_review]
|
|
5
|
+
|
|
6
|
+
jobs:
|
|
7
|
+
add-reviews:
|
|
8
|
+
permissions:
|
|
9
|
+
contents: read
|
|
10
|
+
pull-requests: write
|
|
11
|
+
runs-on: ubuntu-latest
|
|
12
|
+
steps:
|
|
13
|
+
- uses: kentaro-m/auto-assign-action@v2.0.1
|
|
14
|
+
with:
|
|
15
|
+
configuration-path: .github/auto_assign.yaml
|
|
@@ -31,7 +31,7 @@ jobs:
|
|
|
31
31
|
uses: astral-sh/setup-uv@v6
|
|
32
32
|
with:
|
|
33
33
|
enable-cache: false
|
|
34
|
-
python-version: "3.
|
|
34
|
+
python-version: "3.12"
|
|
35
35
|
|
|
36
36
|
- name: Run pre-commit hooks
|
|
37
37
|
uses: pre-commit/action@v3.0.1
|
|
@@ -43,7 +43,7 @@ jobs:
|
|
|
43
43
|
pull-requests: write
|
|
44
44
|
strategy:
|
|
45
45
|
matrix:
|
|
46
|
-
python-version: ["3.
|
|
46
|
+
python-version: ["3.12", "3.13"]
|
|
47
47
|
runs-on: ubuntu-latest
|
|
48
48
|
steps:
|
|
49
49
|
- uses: actions/checkout@v5
|
|
@@ -58,7 +58,7 @@ jobs:
|
|
|
58
58
|
python-version: ${{ matrix.python-version }}
|
|
59
59
|
|
|
60
60
|
- name: Install Dependencies
|
|
61
|
-
run: uv sync --no-dev
|
|
61
|
+
run: uv sync --no-dev --all-extras
|
|
62
62
|
|
|
63
63
|
- name: Start Ollama server
|
|
64
64
|
run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
|
|
@@ -95,7 +95,7 @@ jobs:
|
|
|
95
95
|
python-version: ${{ matrix.python-version }}
|
|
96
96
|
|
|
97
97
|
- name: Install Dependencies
|
|
98
|
-
run: uv sync --no-dev
|
|
98
|
+
run: uv sync --no-dev --all-extras
|
|
99
99
|
|
|
100
100
|
- name: Start Ollama server
|
|
101
101
|
run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
|
|
@@ -8,9 +8,9 @@ repos:
|
|
|
8
8
|
hooks:
|
|
9
9
|
- id: end-of-file-fixer
|
|
10
10
|
- id: trailing-whitespace
|
|
11
|
-
|
|
11
|
+
- id: debug-statements
|
|
12
12
|
- repo: https://github.com/astral-sh/ruff-pre-commit
|
|
13
|
-
rev: v0.14.
|
|
13
|
+
rev: v0.14.14
|
|
14
14
|
hooks:
|
|
15
15
|
- id: ruff
|
|
16
16
|
args:
|
|
@@ -30,15 +30,15 @@ repos:
|
|
|
30
30
|
- pyi
|
|
31
31
|
- jupyter
|
|
32
32
|
- repo: https://github.com/kynan/nbstripout
|
|
33
|
-
rev: 0.
|
|
33
|
+
rev: 0.9.0
|
|
34
34
|
hooks:
|
|
35
35
|
- id: nbstripout
|
|
36
36
|
- repo: https://github.com/facebook/pyrefly-pre-commit
|
|
37
|
-
rev: 0.
|
|
37
|
+
rev: 0.50.1
|
|
38
38
|
hooks:
|
|
39
39
|
- id: pyrefly-check
|
|
40
40
|
name: Pyrefly (type checking)
|
|
41
|
-
pass_filenames:
|
|
41
|
+
pass_filenames: false
|
|
42
42
|
- repo: https://github.com/DavidAnson/markdownlint-cli2
|
|
43
43
|
rev: v0.20.0
|
|
44
44
|
hooks:
|
|
@@ -7,6 +7,81 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
|
|
|
7
7
|
|
|
8
8
|
## [Unreleased]
|
|
9
9
|
|
|
10
|
+
## [v16.12.0] - 2026-02-02
|
|
11
|
+
|
|
12
|
+
### Added
|
|
13
|
+
|
|
14
|
+
- Added the bias detection task (`multiple-choice-stereotype-bias`) along with the Dutch
|
|
15
|
+
dataset MBBQ-NL. This was added by @caldaibis ✨
|
|
16
|
+
- Added support for vLLM Metal, so that generative models can now be evaluated on Apple
|
|
17
|
+
Silicon. Note that this currently does not support structured generation, which means
|
|
18
|
+
that classification and named entity recognitions tasks unfortunately won't work yet.
|
|
19
|
+
This is due to [this xgrammar
|
|
20
|
+
issue](https://github.com/vllm-project/vllm/issues/31901).
|
|
21
|
+
|
|
22
|
+
### Changed
|
|
23
|
+
|
|
24
|
+
- Replaced deprecated `VLLM_ATTENTION_BACKEND` environment variable with vLLM's
|
|
25
|
+
`AttentionConfig` API. Added `--attention-backend` CLI option to configure the
|
|
26
|
+
attention backend. Defaults to FLASHINFER. This was added by @SwekeR-463 ✨
|
|
27
|
+
- Now requires Python >=3.12, as Python 3.11 does not support some dependencies.
|
|
28
|
+
- We now up the vLLM maximum context length for reasoning models, from 8,192 to
|
|
29
|
+
16,384, to accommodate for reasoning tokens for some datasets that have long documents.
|
|
30
|
+
- We opened up the pinned vLLM version now, now set to version `>=0.14.1`.
|
|
31
|
+
- Made changes to the codebase that makes it compatible with Transformers 5.0, for when
|
|
32
|
+
vLLM starts supporting it.
|
|
33
|
+
|
|
34
|
+
### Fixed
|
|
35
|
+
|
|
36
|
+
- Fixed an issue where a model was incorrectly classified as an encoder model if it had
|
|
37
|
+
no pipeline tag on the Hugging Face Hub and it relied on a custom implementation that
|
|
38
|
+
isn't integrated into the `transformers` library.
|
|
39
|
+
- Fixed an issue when a model config had no `pad_token_id` and/or `eos_token_id`.
|
|
40
|
+
- There was an error when evaluating local adapter models, which has been fixed now.
|
|
41
|
+
- Now ensures that the vLLM argument `max_num_batched_tokens` is at least as large as the
|
|
42
|
+
maximum context length of the model, which gave errors with models that had a maximum
|
|
43
|
+
context length of less than 8,192.
|
|
44
|
+
|
|
45
|
+
## [v16.11.0] - 2026-01-21
|
|
46
|
+
|
|
47
|
+
### Added
|
|
48
|
+
|
|
49
|
+
- Added model metadata for GPT 5.2.
|
|
50
|
+
- Added better support for unofficial inference providers, allowing model prefixes even
|
|
51
|
+
if they're not in LiteLLM's official list of providers. Currently this only works with
|
|
52
|
+
the "ordbogen/" prefix for models available on ordbogen.dk.
|
|
53
|
+
|
|
54
|
+
### Changed
|
|
55
|
+
|
|
56
|
+
- LLM-as-a-Judge metrics now support batch scoring across multiple judge outputs.
|
|
57
|
+
- When evaluating datasets with no validation split, we now set the `validation_split`
|
|
58
|
+
in the resulting JSONL file to `null` rather than `True`, to avoid confusion.
|
|
59
|
+
Likewise, if a task requires zero-shot evaluation, we set `few_shot` to null rather
|
|
60
|
+
than a Boolean value.
|
|
61
|
+
- When evaluating a reasoning model on a sequence classification task, if the model
|
|
62
|
+
outputs an answer that starts with one of candidate labels, we now use that label as
|
|
63
|
+
the predicted label. Previously, we would have conducted a word edit distance search
|
|
64
|
+
to find the closest candidate label, which was almost always correct, but not in all
|
|
65
|
+
cases.
|
|
66
|
+
|
|
67
|
+
### Fixed
|
|
68
|
+
|
|
69
|
+
- Quantized models in vLLM now have their dtype inferred automatically, removing
|
|
70
|
+
explicit dtype casting based on GPU compute capability. This was contributed by
|
|
71
|
+
@tvosch ✨
|
|
72
|
+
- Evaluation of local vLLM models when no internet connection was available did not work
|
|
73
|
+
correctly; this has been fixed now. This was contributed by @Touzen ✨
|
|
74
|
+
- More robust detection and handling of errors related to too long inputs for vLLM
|
|
75
|
+
models.
|
|
76
|
+
- Some API models need the `logprobs` argument to be a Boolean rather than an integer.
|
|
77
|
+
This has been fixed now.
|
|
78
|
+
- Better handling of rate limits when evaluating API models, by backing off more
|
|
79
|
+
aggressively when hitting rate limits.
|
|
80
|
+
- Now truncates prompts for instruction-following models in a smarter way, by removing
|
|
81
|
+
few-shot examples one by one until the prompt is short enough, rather than just
|
|
82
|
+
truncating the prompt to the maximum length. This only affects models whose maximum
|
|
83
|
+
model length is quite small (roughly 5,000 tokens or less).
|
|
84
|
+
|
|
10
85
|
## [v16.10.1] - 2026-01-02
|
|
11
86
|
|
|
12
87
|
### Changed
|
|
@@ -72,7 +72,7 @@ guide](https://github.com/atom/atom/blob/master/CONTRIBUTING.md#git-commit-messa
|
|
|
72
72
|
know how to use emoji for commit messages.
|
|
73
73
|
|
|
74
74
|
Once your changes are ready, don't forget to
|
|
75
|
-
|
|
75
|
+
self-review to speed up the review process:zap:.
|
|
76
76
|
|
|
77
77
|
### Pull Request
|
|
78
78
|
|
|
@@ -3,7 +3,7 @@ FROM nvidia/cuda:12.2.0-base-ubuntu22.04
|
|
|
3
3
|
# Install dependencies
|
|
4
4
|
RUN apt-get -y update && \
|
|
5
5
|
apt-get -y upgrade && \
|
|
6
|
-
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends gcc python3.
|
|
6
|
+
DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends gcc python3.12 python3-pip python3-dev git-all && \
|
|
7
7
|
python3 -m pip install --upgrade pip wheel && \
|
|
8
8
|
python3 -m pip install euroeval[all]
|
|
9
9
|
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: ScandEval
|
|
3
|
-
Version: 16.
|
|
3
|
+
Version: 16.12.0
|
|
4
4
|
Summary: The robust European language model benchmark.
|
|
5
5
|
Project-URL: Repository, https://github.com/EuroEval/EuroEval
|
|
6
6
|
Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
|
|
@@ -8,7 +8,7 @@ Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
|
8
8
|
Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
|
|
9
9
|
License: MIT License
|
|
10
10
|
|
|
11
|
-
Copyright (c) 2022-
|
|
11
|
+
Copyright (c) 2022-2026 Dan Saattrup Smart
|
|
12
12
|
|
|
13
13
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
14
14
|
of this software and associated documentation files (the "Software"), to deal
|
|
@@ -28,7 +28,7 @@ License: MIT License
|
|
|
28
28
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
29
29
|
SOFTWARE.
|
|
30
30
|
License-File: LICENSE
|
|
31
|
-
Requires-Python: <4.0,>=3.
|
|
31
|
+
Requires-Python: <4.0,>=3.12
|
|
32
32
|
Requires-Dist: accelerate>=1.9.0
|
|
33
33
|
Requires-Dist: bert-score>=0.3.13
|
|
34
34
|
Requires-Dist: click>=8.1.3
|
|
@@ -59,19 +59,23 @@ Requires-Dist: setuptools>=75.8.2
|
|
|
59
59
|
Requires-Dist: tenacity>=9.0.0
|
|
60
60
|
Requires-Dist: termcolor>=2.0.0
|
|
61
61
|
Requires-Dist: torch>=2.6.0
|
|
62
|
-
Requires-Dist: transformers[mistral-common]
|
|
62
|
+
Requires-Dist: transformers[mistral-common]<5.0.0,>=4.56.0
|
|
63
63
|
Provides-Extra: all
|
|
64
64
|
Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
|
|
65
65
|
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
|
|
66
66
|
Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'all'
|
|
67
67
|
Requires-Dist: timm>=1.0.19; extra == 'all'
|
|
68
|
-
Requires-Dist: vllm
|
|
68
|
+
Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'all'
|
|
69
|
+
Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'all'
|
|
70
|
+
Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'all'
|
|
69
71
|
Provides-Extra: generative
|
|
70
72
|
Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
|
|
71
73
|
Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
|
|
72
74
|
Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'generative'
|
|
73
75
|
Requires-Dist: timm>=1.0.19; extra == 'generative'
|
|
74
|
-
Requires-Dist: vllm
|
|
76
|
+
Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'generative'
|
|
77
|
+
Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'generative'
|
|
78
|
+
Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'generative'
|
|
75
79
|
Description-Content-Type: text/markdown
|
|
76
80
|
|
|
77
81
|
<!-- This disables the requirement that the first line is a top-level heading -->
|
|
@@ -96,7 +100,7 @@ ______________________________________________________________________
|
|
|
96
100
|
[](https://arxiv.org/abs/2406.13469)
|
|
97
101
|
[](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
|
|
98
102
|
[](https://github.com/EuroEval/EuroEval/commits/main)
|
|
99
|
-
[](https://github.com/EuroEval/EuroEval/tree/main/tests)
|
|
100
104
|
[](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
|
|
101
105
|
|
|
102
106
|
## Maintainer
|
|
@@ -123,16 +127,17 @@ The easiest way to benchmark pretrained models is via the command line interface
|
|
|
123
127
|
having installed the package, you can benchmark your favorite model like so:
|
|
124
128
|
|
|
125
129
|
```bash
|
|
126
|
-
euroeval --model <model-id>
|
|
130
|
+
euroeval --model <model-id-or-path>
|
|
127
131
|
```
|
|
128
132
|
|
|
129
|
-
Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
|
|
130
|
-
Hub](https://huggingface.co/models)
|
|
131
|
-
the
|
|
132
|
-
|
|
133
|
+
Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
|
|
134
|
+
Hub](https://huggingface.co/models), or a local path to a model directory (containing
|
|
135
|
+
the model files as well as the `config.json` file). By default this will benchmark the
|
|
136
|
+
model on all the tasks available. If you want to benchmark on a particular task, then
|
|
137
|
+
use the `--task` argument:
|
|
133
138
|
|
|
134
139
|
```bash
|
|
135
|
-
euroeval --model <model-id> --task sentiment-classification
|
|
140
|
+
euroeval --model <model-id-or-path> --task sentiment-classification
|
|
136
141
|
```
|
|
137
142
|
|
|
138
143
|
We can also narrow down which languages we would like to benchmark on. This can be done
|
|
@@ -140,20 +145,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
|
|
|
140
145
|
sentiment classification task:
|
|
141
146
|
|
|
142
147
|
```bash
|
|
143
|
-
euroeval --model <model-id> --task sentiment-classification --language da
|
|
148
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da
|
|
144
149
|
```
|
|
145
150
|
|
|
146
151
|
Multiple models, datasets and/or languages can be specified by just attaching multiple
|
|
147
152
|
arguments. Here is an example with two models:
|
|
148
153
|
|
|
149
154
|
```bash
|
|
150
|
-
euroeval --model <model-
|
|
155
|
+
euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
|
|
151
156
|
```
|
|
152
157
|
|
|
153
158
|
The specific model version/revision to use can also be added after the suffix '@':
|
|
154
159
|
|
|
155
160
|
```bash
|
|
156
|
-
euroeval --model <model-id>@<commit>
|
|
161
|
+
euroeval --model <model-id-or-path>@<commit>
|
|
157
162
|
```
|
|
158
163
|
|
|
159
164
|
This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
|
|
@@ -173,7 +178,7 @@ model:
|
|
|
173
178
|
```python
|
|
174
179
|
>>> from euroeval import Benchmarker
|
|
175
180
|
>>> benchmarker = Benchmarker()
|
|
176
|
-
>>> benchmarker.benchmark(model="<model-id>")
|
|
181
|
+
>>> benchmarker.benchmark(model="<model-id-or-path>")
|
|
177
182
|
```
|
|
178
183
|
|
|
179
184
|
To benchmark on a specific task and/or language, you simply specify the `task` or
|
|
@@ -181,7 +186,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
|
|
|
181
186
|
|
|
182
187
|
```python
|
|
183
188
|
>>> benchmarker.benchmark(
|
|
184
|
-
... model="<model-id>",
|
|
189
|
+
... model="<model-id-or-path>",
|
|
185
190
|
... task="sentiment-classification",
|
|
186
191
|
... language="da",
|
|
187
192
|
... )
|
|
@@ -225,7 +230,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
|
|
|
225
230
|
```
|
|
226
231
|
|
|
227
232
|
Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
|
|
228
|
-
argument. This could for instance be `--model <model-id> --task
|
|
233
|
+
argument. This could for instance be `--model <model-id-or-path> --task
|
|
229
234
|
sentiment-classification`.
|
|
230
235
|
|
|
231
236
|
## Benchmarking custom inference APIs
|
|
@@ -291,14 +296,14 @@ script. For example to download the model you want and all of the Danish sentime
|
|
|
291
296
|
classification datasets:
|
|
292
297
|
|
|
293
298
|
```bash
|
|
294
|
-
euroeval --model <model-id> --task sentiment-classification --language da --download-only
|
|
299
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
|
|
295
300
|
```
|
|
296
301
|
|
|
297
302
|
Or from a script:
|
|
298
303
|
|
|
299
304
|
```python
|
|
300
305
|
>>> benchmarker.benchmark(
|
|
301
|
-
... model="<model-id>",
|
|
306
|
+
... model="<model-id-or-path>",
|
|
302
307
|
... task="sentiment-classification",
|
|
303
308
|
... language="da",
|
|
304
309
|
... download_only=True,
|
|
@@ -346,7 +351,7 @@ MY_CONFIG = DatasetConfig(
|
|
|
346
351
|
You can then benchmark your custom dataset by simply running
|
|
347
352
|
|
|
348
353
|
```bash
|
|
349
|
-
euroeval --dataset my-dataset --model <model-id>
|
|
354
|
+
euroeval --dataset my-dataset --model <model-id-or-path>
|
|
350
355
|
```
|
|
351
356
|
|
|
352
357
|
You can also run the benchmark from a Python script, by simply providing your custom
|
|
@@ -356,7 +361,7 @@ dataset configuration directly into the `benchmark` method:
|
|
|
356
361
|
from euroeval import Benchmarker
|
|
357
362
|
|
|
358
363
|
benchmarker = Benchmarker()
|
|
359
|
-
benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
|
|
364
|
+
benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
|
|
360
365
|
```
|
|
361
366
|
|
|
362
367
|
We have included three convenience tasks to make it easier to set up custom datasets:
|
|
@@ -436,7 +441,7 @@ MY_SQL_DATASET = DatasetConfig(
|
|
|
436
441
|
Again, with this you can benchmark your custom dataset by simply running
|
|
437
442
|
|
|
438
443
|
```bash
|
|
439
|
-
euroeval --dataset my-sql-dataset --model <model-id>
|
|
444
|
+
euroeval --dataset my-sql-dataset --model <model-id-or-path>
|
|
440
445
|
```
|
|
441
446
|
|
|
442
447
|
## Reproducing the evaluation datasets
|
|
@@ -592,6 +597,27 @@ A huge thank you to all the contributors who have helped make this project a suc
|
|
|
592
597
|
alt="Contributor avatar for tvosch"
|
|
593
598
|
/>
|
|
594
599
|
</a>
|
|
600
|
+
<a href="https://github.com/Touzen">
|
|
601
|
+
<img
|
|
602
|
+
src="https://avatars.githubusercontent.com/u/1416265"
|
|
603
|
+
width=50
|
|
604
|
+
alt="Contributor avatar for Touzen"
|
|
605
|
+
/>
|
|
606
|
+
</a>
|
|
607
|
+
<a href="https://github.com/caldaibis">
|
|
608
|
+
<img
|
|
609
|
+
src="https://avatars.githubusercontent.com/u/16032437"
|
|
610
|
+
width=50
|
|
611
|
+
alt="Contributor avatar for caldaibis"
|
|
612
|
+
/>
|
|
613
|
+
</a>
|
|
614
|
+
<a href="https://github.com/SwekeR-463">
|
|
615
|
+
<img
|
|
616
|
+
src="https://avatars.githubusercontent.com/u/114919896?v=4"
|
|
617
|
+
width=50
|
|
618
|
+
alt="Contributor avatar for SwekeR-463"
|
|
619
|
+
/>
|
|
620
|
+
</a>
|
|
595
621
|
|
|
596
622
|
### Contribute to EuroEval
|
|
597
623
|
|
|
@@ -20,7 +20,7 @@ ______________________________________________________________________
|
|
|
20
20
|
[](https://arxiv.org/abs/2406.13469)
|
|
21
21
|
[](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
|
|
22
22
|
[](https://github.com/EuroEval/EuroEval/commits/main)
|
|
23
|
-
[](https://github.com/EuroEval/EuroEval/tree/main/tests)
|
|
24
24
|
[](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
|
|
25
25
|
|
|
26
26
|
## Maintainer
|
|
@@ -47,16 +47,17 @@ The easiest way to benchmark pretrained models is via the command line interface
|
|
|
47
47
|
having installed the package, you can benchmark your favorite model like so:
|
|
48
48
|
|
|
49
49
|
```bash
|
|
50
|
-
euroeval --model <model-id>
|
|
50
|
+
euroeval --model <model-id-or-path>
|
|
51
51
|
```
|
|
52
52
|
|
|
53
|
-
Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
|
|
54
|
-
Hub](https://huggingface.co/models)
|
|
55
|
-
the
|
|
56
|
-
|
|
53
|
+
Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
|
|
54
|
+
Hub](https://huggingface.co/models), or a local path to a model directory (containing
|
|
55
|
+
the model files as well as the `config.json` file). By default this will benchmark the
|
|
56
|
+
model on all the tasks available. If you want to benchmark on a particular task, then
|
|
57
|
+
use the `--task` argument:
|
|
57
58
|
|
|
58
59
|
```bash
|
|
59
|
-
euroeval --model <model-id> --task sentiment-classification
|
|
60
|
+
euroeval --model <model-id-or-path> --task sentiment-classification
|
|
60
61
|
```
|
|
61
62
|
|
|
62
63
|
We can also narrow down which languages we would like to benchmark on. This can be done
|
|
@@ -64,20 +65,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
|
|
|
64
65
|
sentiment classification task:
|
|
65
66
|
|
|
66
67
|
```bash
|
|
67
|
-
euroeval --model <model-id> --task sentiment-classification --language da
|
|
68
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da
|
|
68
69
|
```
|
|
69
70
|
|
|
70
71
|
Multiple models, datasets and/or languages can be specified by just attaching multiple
|
|
71
72
|
arguments. Here is an example with two models:
|
|
72
73
|
|
|
73
74
|
```bash
|
|
74
|
-
euroeval --model <model-
|
|
75
|
+
euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
|
|
75
76
|
```
|
|
76
77
|
|
|
77
78
|
The specific model version/revision to use can also be added after the suffix '@':
|
|
78
79
|
|
|
79
80
|
```bash
|
|
80
|
-
euroeval --model <model-id>@<commit>
|
|
81
|
+
euroeval --model <model-id-or-path>@<commit>
|
|
81
82
|
```
|
|
82
83
|
|
|
83
84
|
This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
|
|
@@ -97,7 +98,7 @@ model:
|
|
|
97
98
|
```python
|
|
98
99
|
>>> from euroeval import Benchmarker
|
|
99
100
|
>>> benchmarker = Benchmarker()
|
|
100
|
-
>>> benchmarker.benchmark(model="<model-id>")
|
|
101
|
+
>>> benchmarker.benchmark(model="<model-id-or-path>")
|
|
101
102
|
```
|
|
102
103
|
|
|
103
104
|
To benchmark on a specific task and/or language, you simply specify the `task` or
|
|
@@ -105,7 +106,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
|
|
|
105
106
|
|
|
106
107
|
```python
|
|
107
108
|
>>> benchmarker.benchmark(
|
|
108
|
-
... model="<model-id>",
|
|
109
|
+
... model="<model-id-or-path>",
|
|
109
110
|
... task="sentiment-classification",
|
|
110
111
|
... language="da",
|
|
111
112
|
... )
|
|
@@ -149,7 +150,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
|
|
|
149
150
|
```
|
|
150
151
|
|
|
151
152
|
Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
|
|
152
|
-
argument. This could for instance be `--model <model-id> --task
|
|
153
|
+
argument. This could for instance be `--model <model-id-or-path> --task
|
|
153
154
|
sentiment-classification`.
|
|
154
155
|
|
|
155
156
|
## Benchmarking custom inference APIs
|
|
@@ -215,14 +216,14 @@ script. For example to download the model you want and all of the Danish sentime
|
|
|
215
216
|
classification datasets:
|
|
216
217
|
|
|
217
218
|
```bash
|
|
218
|
-
euroeval --model <model-id> --task sentiment-classification --language da --download-only
|
|
219
|
+
euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
|
|
219
220
|
```
|
|
220
221
|
|
|
221
222
|
Or from a script:
|
|
222
223
|
|
|
223
224
|
```python
|
|
224
225
|
>>> benchmarker.benchmark(
|
|
225
|
-
... model="<model-id>",
|
|
226
|
+
... model="<model-id-or-path>",
|
|
226
227
|
... task="sentiment-classification",
|
|
227
228
|
... language="da",
|
|
228
229
|
... download_only=True,
|
|
@@ -270,7 +271,7 @@ MY_CONFIG = DatasetConfig(
|
|
|
270
271
|
You can then benchmark your custom dataset by simply running
|
|
271
272
|
|
|
272
273
|
```bash
|
|
273
|
-
euroeval --dataset my-dataset --model <model-id>
|
|
274
|
+
euroeval --dataset my-dataset --model <model-id-or-path>
|
|
274
275
|
```
|
|
275
276
|
|
|
276
277
|
You can also run the benchmark from a Python script, by simply providing your custom
|
|
@@ -280,7 +281,7 @@ dataset configuration directly into the `benchmark` method:
|
|
|
280
281
|
from euroeval import Benchmarker
|
|
281
282
|
|
|
282
283
|
benchmarker = Benchmarker()
|
|
283
|
-
benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
|
|
284
|
+
benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
|
|
284
285
|
```
|
|
285
286
|
|
|
286
287
|
We have included three convenience tasks to make it easier to set up custom datasets:
|
|
@@ -360,7 +361,7 @@ MY_SQL_DATASET = DatasetConfig(
|
|
|
360
361
|
Again, with this you can benchmark your custom dataset by simply running
|
|
361
362
|
|
|
362
363
|
```bash
|
|
363
|
-
euroeval --dataset my-sql-dataset --model <model-id>
|
|
364
|
+
euroeval --dataset my-sql-dataset --model <model-id-or-path>
|
|
364
365
|
```
|
|
365
366
|
|
|
366
367
|
## Reproducing the evaluation datasets
|
|
@@ -516,6 +517,27 @@ A huge thank you to all the contributors who have helped make this project a suc
|
|
|
516
517
|
alt="Contributor avatar for tvosch"
|
|
517
518
|
/>
|
|
518
519
|
</a>
|
|
520
|
+
<a href="https://github.com/Touzen">
|
|
521
|
+
<img
|
|
522
|
+
src="https://avatars.githubusercontent.com/u/1416265"
|
|
523
|
+
width=50
|
|
524
|
+
alt="Contributor avatar for Touzen"
|
|
525
|
+
/>
|
|
526
|
+
</a>
|
|
527
|
+
<a href="https://github.com/caldaibis">
|
|
528
|
+
<img
|
|
529
|
+
src="https://avatars.githubusercontent.com/u/16032437"
|
|
530
|
+
width=50
|
|
531
|
+
alt="Contributor avatar for caldaibis"
|
|
532
|
+
/>
|
|
533
|
+
</a>
|
|
534
|
+
<a href="https://github.com/SwekeR-463">
|
|
535
|
+
<img
|
|
536
|
+
src="https://avatars.githubusercontent.com/u/114919896?v=4"
|
|
537
|
+
width=50
|
|
538
|
+
alt="Contributor avatar for SwekeR-463"
|
|
539
|
+
/>
|
|
540
|
+
</a>
|
|
519
541
|
|
|
520
542
|
### Contribute to EuroEval
|
|
521
543
|
|
|
@@ -1002,7 +1002,7 @@ Here are a few examples from the training split:
|
|
|
1002
1002
|
|
|
1003
1003
|
```json
|
|
1004
1004
|
{
|
|
1005
|
-
"text": "
|
|
1005
|
+
"text": "Jeg kunne ikke kontrollere fugten, som jeg kontrollerede regnen, fordi _ kom ind overalt. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. fugt\nb. regn",
|
|
1006
1006
|
"label": "a"
|
|
1007
1007
|
}
|
|
1008
1008
|
```
|
|
@@ -1116,3 +1116,81 @@ You can evaluate this dataset directly as follows:
|
|
|
1116
1116
|
```bash
|
|
1117
1117
|
euroeval --model <model-id> --dataset nordjylland-news
|
|
1118
1118
|
```
|
|
1119
|
+
|
|
1120
|
+
## European Values
|
|
1121
|
+
|
|
1122
|
+
### ValEU-da
|
|
1123
|
+
|
|
1124
|
+
This dataset is the official Danish version of questions from the [European values
|
|
1125
|
+
study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
|
|
1126
|
+
questions regarding people's values and beliefs across a variety of topics, such as
|
|
1127
|
+
politics, religion and society.
|
|
1128
|
+
|
|
1129
|
+
The dataset consists of 52 questions from the 2017-2022 wave of the European values
|
|
1130
|
+
study, where the questions were chosen based on optimising against agreement within EU
|
|
1131
|
+
countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
|
|
1132
|
+
|
|
1133
|
+
Here are a few examples from the training split:
|
|
1134
|
+
|
|
1135
|
+
```json
|
|
1136
|
+
{
|
|
1137
|
+
"question_id": "C039",
|
|
1138
|
+
"text": "Hvor enig eller uenig er du i følgende udsagn?\nDet er ens pligt over for samfundet at arbejde.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig",
|
|
1139
|
+
}
|
|
1140
|
+
```
|
|
1141
|
+
|
|
1142
|
+
```json
|
|
1143
|
+
{
|
|
1144
|
+
"question_id": "F122",
|
|
1145
|
+
"text": "Fortæl for hver af handlingerne på dette kort, i hvilken grad du billiger handlingen. 1 betyder, at du slet ikke billiger dem, 10 betyder, at du i høj grad billiger dem\nAktiv dødshjælp\nSvarmuligheder:\na. Aldrig\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Altid",
|
|
1146
|
+
}
|
|
1147
|
+
```
|
|
1148
|
+
|
|
1149
|
+
```json
|
|
1150
|
+
{
|
|
1151
|
+
"question_id": "C041",
|
|
1152
|
+
"text": "Hvor enig eller uenig er du i følgende udsagn?\nArbejde kommer først, også selv om det betyder mindre fritid.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig"
|
|
1153
|
+
}
|
|
1154
|
+
```
|
|
1155
|
+
|
|
1156
|
+
When evaluating generative models, we use the following setup (see the
|
|
1157
|
+
[methodology](/methodology) for more information on how these are used):
|
|
1158
|
+
|
|
1159
|
+
- Number of few-shot examples: 0
|
|
1160
|
+
- Prefix prompt:
|
|
1161
|
+
|
|
1162
|
+
```text
|
|
1163
|
+
Følgende er multiple choice spørgsmål (med svar).
|
|
1164
|
+
```
|
|
1165
|
+
|
|
1166
|
+
- Base prompt template:
|
|
1167
|
+
|
|
1168
|
+
```text
|
|
1169
|
+
Spørgsmål: {text}
|
|
1170
|
+
Svarmuligheder:
|
|
1171
|
+
a. {option_a}
|
|
1172
|
+
b. {option_b}
|
|
1173
|
+
(...)
|
|
1174
|
+
k. {option_k}
|
|
1175
|
+
Svar: {label}
|
|
1176
|
+
```
|
|
1177
|
+
|
|
1178
|
+
- Instruction-tuned prompt template:
|
|
1179
|
+
|
|
1180
|
+
```text
|
|
1181
|
+
Spørgsmål: {text}
|
|
1182
|
+
Svarmuligheder:
|
|
1183
|
+
a. {option_a}
|
|
1184
|
+
b. {option_b}
|
|
1185
|
+
(...)
|
|
1186
|
+
k. {option_k}
|
|
1187
|
+
|
|
1188
|
+
Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
|
|
1189
|
+
'i', 'j' eller 'k', og intet andet.
|
|
1190
|
+
```
|
|
1191
|
+
|
|
1192
|
+
You can evaluate this dataset directly as follows:
|
|
1193
|
+
|
|
1194
|
+
```bash
|
|
1195
|
+
euroeval --model <model-id> --dataset valeu-da
|
|
1196
|
+
```
|