EuroEval 16.3.0__tar.gz → 16.5.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (362) hide show
  1. {euroeval-16.3.0 → euroeval-16.5.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +6 -0
  2. {euroeval-16.3.0 → euroeval-16.5.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +4 -3
  3. {euroeval-16.3.0 → euroeval-16.5.0}/.pre-commit-config.yaml +1 -1
  4. {euroeval-16.3.0 → euroeval-16.5.0}/CHANGELOG.md +122 -4
  5. {euroeval-16.3.0 → euroeval-16.5.0}/PKG-INFO +196 -39
  6. {euroeval-16.3.0 → euroeval-16.5.0}/README.md +193 -36
  7. euroeval-16.5.0/cool_test.csv +7 -0
  8. euroeval-16.5.0/cool_train.csv +13 -0
  9. euroeval-16.5.0/cool_val.csv +5 -0
  10. euroeval-16.5.0/custom_datasets.py +21 -0
  11. euroeval-16.5.0/docs/datasets/bulgarian.md +461 -0
  12. euroeval-16.5.0/docs/datasets/czech.md +600 -0
  13. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/danish.md +84 -83
  14. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/dutch.md +5 -4
  15. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/english.md +2 -1
  16. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/estonian.md +76 -0
  17. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/finnish.md +5 -4
  18. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/french.md +2 -1
  19. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/german.md +2 -1
  20. euroeval-16.5.0/docs/datasets/greek.md +510 -0
  21. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/italian.md +5 -4
  22. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/latvian.md +5 -4
  23. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/lithuanian.md +72 -7
  24. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/norwegian.md +5 -4
  25. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/polish.md +33 -31
  26. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/portuguese.md +5 -4
  27. euroeval-16.5.0/docs/datasets/serbian.md +519 -0
  28. euroeval-16.5.0/docs/datasets/slovak.md +448 -0
  29. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/spanish.md +5 -4
  30. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/swedish.md +83 -82
  31. euroeval-16.5.0/docs/datasets/ukrainian.md +522 -0
  32. euroeval-16.5.0/docs/leaderboards/Monolingual/czech.md +26 -0
  33. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/danish.md +3 -2
  34. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/dutch.md +3 -2
  35. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/english.md +3 -2
  36. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/estonian.md +3 -2
  37. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/faroese.md +3 -2
  38. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/finnish.md +3 -2
  39. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/french.md +3 -2
  40. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/german.md +3 -2
  41. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/icelandic.md +3 -2
  42. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/italian.md +3 -2
  43. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/latvian.md +3 -2
  44. euroeval-16.5.0/docs/leaderboards/Monolingual/lithuanian.md +26 -0
  45. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/norwegian.md +3 -2
  46. euroeval-16.5.0/docs/leaderboards/Monolingual/polish.md +26 -0
  47. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/portuguese.md +3 -2
  48. euroeval-16.5.0/docs/leaderboards/Monolingual/slovak.md +26 -0
  49. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/spanish.md +3 -2
  50. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Monolingual/swedish.md +3 -2
  51. euroeval-16.5.0/docs/leaderboards/Multilingual/baltic.md +26 -0
  52. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Multilingual/european.md +3 -2
  53. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Multilingual/finnic.md +5 -4
  54. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Multilingual/germanic.md +3 -2
  55. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +3 -2
  56. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/Multilingual/romance.md +3 -2
  57. euroeval-16.5.0/docs/leaderboards/Multilingual/slavic.md +26 -0
  58. euroeval-16.5.0/gfx/different-poses/pose1.png +0 -0
  59. euroeval-16.5.0/gfx/different-poses/pose2.png +0 -0
  60. euroeval-16.5.0/gfx/different-poses/pose3.png +0 -0
  61. euroeval-16.5.0/gfx/different-poses/pose4.png +0 -0
  62. euroeval-16.5.0/gfx/different-poses/pose5.png +0 -0
  63. euroeval-16.5.0/gfx/different-poses/pose6.png +0 -0
  64. {euroeval-16.3.0 → euroeval-16.5.0}/pyproject.toml +10 -5
  65. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/__init__.py +9 -2
  66. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmark_config_factory.py +51 -50
  67. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmark_modules/base.py +9 -21
  68. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmark_modules/fresh.py +2 -1
  69. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmark_modules/hf.py +101 -71
  70. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmark_modules/litellm.py +115 -53
  71. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmark_modules/vllm.py +107 -92
  72. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmarker.py +144 -121
  73. euroeval-16.5.0/src/euroeval/caching_utils.py +79 -0
  74. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/callbacks.py +5 -7
  75. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/cli.py +86 -8
  76. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/constants.py +9 -0
  77. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/data_loading.py +80 -29
  78. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/data_models.py +338 -330
  79. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/dataset_configs/__init__.py +12 -3
  80. euroeval-16.5.0/src/euroeval/dataset_configs/bulgarian.py +56 -0
  81. euroeval-16.5.0/src/euroeval/dataset_configs/czech.py +75 -0
  82. euroeval-16.5.0/src/euroeval/dataset_configs/danish.py +148 -0
  83. euroeval-16.5.0/src/euroeval/dataset_configs/dutch.py +142 -0
  84. euroeval-16.5.0/src/euroeval/dataset_configs/english.py +132 -0
  85. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/dataset_configs/estonian.py +42 -34
  86. euroeval-16.5.0/src/euroeval/dataset_configs/faroese.py +61 -0
  87. euroeval-16.5.0/src/euroeval/dataset_configs/finnish.py +107 -0
  88. euroeval-16.5.0/src/euroeval/dataset_configs/french.py +116 -0
  89. euroeval-16.5.0/src/euroeval/dataset_configs/german.py +132 -0
  90. euroeval-16.5.0/src/euroeval/dataset_configs/greek.py +64 -0
  91. euroeval-16.5.0/src/euroeval/dataset_configs/icelandic.py +159 -0
  92. euroeval-16.5.0/src/euroeval/dataset_configs/italian.py +123 -0
  93. euroeval-16.5.0/src/euroeval/dataset_configs/latvian.py +87 -0
  94. euroeval-16.5.0/src/euroeval/dataset_configs/lithuanian.py +64 -0
  95. euroeval-16.5.0/src/euroeval/dataset_configs/norwegian.py +212 -0
  96. euroeval-16.5.0/src/euroeval/dataset_configs/polish.py +96 -0
  97. euroeval-16.5.0/src/euroeval/dataset_configs/portuguese.py +97 -0
  98. euroeval-16.5.0/src/euroeval/dataset_configs/serbian.py +64 -0
  99. euroeval-16.5.0/src/euroeval/dataset_configs/slovak.py +55 -0
  100. euroeval-16.5.0/src/euroeval/dataset_configs/spanish.py +123 -0
  101. euroeval-16.5.0/src/euroeval/dataset_configs/swedish.py +141 -0
  102. euroeval-16.5.0/src/euroeval/dataset_configs/ukrainian.py +64 -0
  103. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/exceptions.py +1 -1
  104. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/finetuning.py +24 -17
  105. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/generation.py +15 -14
  106. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/generation_utils.py +8 -8
  107. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/languages.py +395 -323
  108. euroeval-16.5.0/src/euroeval/logging_utils.py +250 -0
  109. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/metrics/base.py +0 -3
  110. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/metrics/huggingface.py +21 -6
  111. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/metrics/llm_as_a_judge.py +6 -4
  112. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/metrics/pipeline.py +17 -9
  113. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/metrics/speed.py +0 -3
  114. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/model_cache.py +17 -19
  115. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/model_config.py +4 -5
  116. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/model_loading.py +3 -0
  117. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/prompt_templates/__init__.py +2 -0
  118. euroeval-16.5.0/src/euroeval/prompt_templates/classification.py +206 -0
  119. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +99 -42
  120. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/prompt_templates/multiple_choice.py +102 -38
  121. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/prompt_templates/named_entity_recognition.py +172 -51
  122. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/prompt_templates/reading_comprehension.py +119 -42
  123. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/prompt_templates/sentiment_classification.py +110 -40
  124. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/prompt_templates/summarization.py +85 -40
  125. euroeval-16.5.0/src/euroeval/prompt_templates/token_classification.py +279 -0
  126. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/scores.py +11 -10
  127. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/speed_benchmark.py +5 -6
  128. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +2 -4
  129. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/task_group_utils/question_answering.py +24 -16
  130. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/task_group_utils/sequence_classification.py +48 -35
  131. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/task_group_utils/text_to_text.py +19 -9
  132. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/task_group_utils/token_classification.py +21 -17
  133. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/tasks.py +44 -1
  134. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/tokenisation_utils.py +33 -22
  135. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/types.py +10 -9
  136. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/utils.py +35 -149
  137. euroeval-16.5.0/src/scripts/__init__.py +1 -0
  138. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/constants.py +6 -0
  139. euroeval-16.5.0/src/scripts/create_bg_ner_bsnlp.py +111 -0
  140. euroeval-16.5.0/src/scripts/create_cinexio.py +249 -0
  141. euroeval-16.5.0/src/scripts/create_cross_domain_uk_reviews.py +116 -0
  142. euroeval-16.5.0/src/scripts/create_cs_gec.py +83 -0
  143. euroeval-16.5.0/src/scripts/create_csfd_sentiment.py +97 -0
  144. euroeval-16.5.0/src/scripts/create_csfd_sentiment_sk.py +92 -0
  145. euroeval-16.5.0/src/scripts/create_czech_news.py +75 -0
  146. euroeval-16.5.0/src/scripts/create_elner.py +222 -0
  147. euroeval-16.5.0/src/scripts/create_exams_bg.py +193 -0
  148. euroeval-16.5.0/src/scripts/create_global_mmlu.py +188 -0
  149. euroeval-16.5.0/src/scripts/create_greek_sa.py +130 -0
  150. euroeval-16.5.0/src/scripts/create_greek_wikipedia.py +89 -0
  151. euroeval-16.5.0/src/scripts/create_hellaswag_cs.py +120 -0
  152. euroeval-16.5.0/src/scripts/create_lithuanian_lrytas_summarization.py +87 -0
  153. euroeval-16.5.0/src/scripts/create_lr_sum.py +127 -0
  154. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_lt_history.py +13 -6
  155. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_mmlu.py +15 -1
  156. euroeval-16.5.0/src/scripts/create_mmlu_et.py +162 -0
  157. euroeval-16.5.0/src/scripts/create_mms_sr.py +110 -0
  158. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_multi_wiki_qa.py +5 -0
  159. euroeval-16.5.0/src/scripts/create_ner_uk.py +115 -0
  160. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_norglm_multiqa.py +20 -0
  161. euroeval-16.5.0/src/scripts/create_poner.py +125 -0
  162. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_scala.py +13 -1
  163. euroeval-16.3.0/src/scripts/create_swedish_skolprov.py → euroeval-16.5.0/src/scripts/create_skolprov.py +25 -18
  164. euroeval-16.5.0/src/scripts/create_sqad.py +137 -0
  165. euroeval-16.5.0/src/scripts/create_umimeto_qa.py +114 -0
  166. euroeval-16.5.0/src/scripts/create_uner_sk.py +183 -0
  167. euroeval-16.5.0/src/scripts/create_uner_sr.py +113 -0
  168. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_winogrande.py +28 -3
  169. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/load_ud_pos.py +271 -73
  170. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/versioning.py +1 -1
  171. {euroeval-16.3.0 → euroeval-16.5.0}/tests/conftest.py +32 -23
  172. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_benchmark_config_factory.py +60 -59
  173. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_benchmark_modules/test_hf.py +11 -5
  174. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_benchmarker.py +76 -63
  175. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_cli.py +5 -2
  176. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_constants.py +4 -2
  177. euroeval-16.5.0/tests/test_data_loading.py +165 -0
  178. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_data_models.py +7 -2
  179. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_dataset_configs.py +36 -0
  180. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_model_config.py +1 -0
  181. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_model_loading.py +3 -0
  182. euroeval-16.5.0/tests/test_scripts/__init__.py +1 -0
  183. euroeval-16.5.0/tests/test_scripts/test_create_scala/__init__.py +1 -0
  184. euroeval-16.5.0/tests/test_scripts/test_create_scala/test_create_scala.py +86 -0
  185. euroeval-16.5.0/tests/test_scripts/test_create_scala/test_data/de_gsd-ud-train.conllu.adp_det +12 -0
  186. euroeval-16.5.0/tests/test_scripts/test_create_scala/test_data/empty.file +0 -0
  187. euroeval-16.5.0/tests/test_scripts/test_create_scala/test_data/en_gum-ud-train.conllu.case +70 -0
  188. euroeval-16.5.0/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_01 +11 -0
  189. euroeval-16.5.0/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_02 +14 -0
  190. euroeval-16.5.0/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_03 +16 -0
  191. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_speed_benchmark.py +3 -1
  192. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_tasks.py +0 -1
  193. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_tokenisation_utils.py +12 -3
  194. {euroeval-16.3.0 → euroeval-16.5.0}/uv.lock +162 -158
  195. euroeval-16.3.0/src/euroeval/dataset_configs/danish.py +0 -186
  196. euroeval-16.3.0/src/euroeval/dataset_configs/dutch.py +0 -181
  197. euroeval-16.3.0/src/euroeval/dataset_configs/english.py +0 -164
  198. euroeval-16.3.0/src/euroeval/dataset_configs/faroese.py +0 -102
  199. euroeval-16.3.0/src/euroeval/dataset_configs/finnish.py +0 -140
  200. euroeval-16.3.0/src/euroeval/dataset_configs/french.py +0 -152
  201. euroeval-16.3.0/src/euroeval/dataset_configs/german.py +0 -169
  202. euroeval-16.3.0/src/euroeval/dataset_configs/icelandic.py +0 -196
  203. euroeval-16.3.0/src/euroeval/dataset_configs/italian.py +0 -160
  204. euroeval-16.3.0/src/euroeval/dataset_configs/latvian.py +0 -94
  205. euroeval-16.3.0/src/euroeval/dataset_configs/lithuanian.py +0 -62
  206. euroeval-16.3.0/src/euroeval/dataset_configs/norwegian.py +0 -255
  207. euroeval-16.3.0/src/euroeval/dataset_configs/polish.py +0 -124
  208. euroeval-16.3.0/src/euroeval/dataset_configs/portuguese.py +0 -130
  209. euroeval-16.3.0/src/euroeval/dataset_configs/spanish.py +0 -158
  210. euroeval-16.3.0/src/euroeval/dataset_configs/swedish.py +0 -179
  211. euroeval-16.3.0/tests/test_data_loading.py +0 -141
  212. {euroeval-16.3.0 → euroeval-16.5.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  213. {euroeval-16.3.0 → euroeval-16.5.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  214. {euroeval-16.3.0 → euroeval-16.5.0}/.github/ISSUE_TEMPLATE/language_request.yaml +0 -0
  215. {euroeval-16.3.0 → euroeval-16.5.0}/.github/workflows/ci.yaml +0 -0
  216. {euroeval-16.3.0 → euroeval-16.5.0}/.gitignore +0 -0
  217. {euroeval-16.3.0 → euroeval-16.5.0}/.markdownlint.jsonc +0 -0
  218. {euroeval-16.3.0 → euroeval-16.5.0}/CITATION.cff +0 -0
  219. {euroeval-16.3.0 → euroeval-16.5.0}/CODE_OF_CONDUCT.md +0 -0
  220. {euroeval-16.3.0 → euroeval-16.5.0}/CONTRIBUTING.md +0 -0
  221. {euroeval-16.3.0 → euroeval-16.5.0}/Dockerfile.cuda +0 -0
  222. {euroeval-16.3.0 → euroeval-16.5.0}/LICENSE +0 -0
  223. {euroeval-16.3.0 → euroeval-16.5.0}/NEW_DATASET_GUIDE.md +0 -0
  224. {euroeval-16.3.0 → euroeval-16.5.0}/docs/CNAME +0 -0
  225. {euroeval-16.3.0 → euroeval-16.5.0}/docs/README.md +0 -0
  226. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/README.md +0 -0
  227. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/faroese.md +0 -0
  228. {euroeval-16.3.0 → euroeval-16.5.0}/docs/datasets/icelandic.md +0 -0
  229. {euroeval-16.3.0 → euroeval-16.5.0}/docs/extras/radial_plotter.md +0 -0
  230. {euroeval-16.3.0 → euroeval-16.5.0}/docs/faq.md +0 -0
  231. {euroeval-16.3.0 → euroeval-16.5.0}/docs/gfx/favicon.png +0 -0
  232. {euroeval-16.3.0 → euroeval-16.5.0}/docs/leaderboards/README.md +0 -0
  233. {euroeval-16.3.0 → euroeval-16.5.0}/docs/methodology.md +0 -0
  234. {euroeval-16.3.0 → euroeval-16.5.0}/docs/python-package.md +0 -0
  235. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/README.md +0 -0
  236. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/common-sense-reasoning.md +0 -0
  237. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/knowledge.md +0 -0
  238. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/linguistic-acceptability.md +0 -0
  239. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/named-entity-recognition.md +0 -0
  240. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/reading-comprehension.md +0 -0
  241. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/sentiment-classification.md +0 -0
  242. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/speed.md +0 -0
  243. {euroeval-16.3.0 → euroeval-16.5.0}/docs/tasks/summarization.md +0 -0
  244. {euroeval-16.3.0 → euroeval-16.5.0}/gfx/euroeval.png +0 -0
  245. {euroeval-16.3.0 → euroeval-16.5.0}/gfx/euroeval.xcf +0 -0
  246. {euroeval-16.3.0 → euroeval-16.5.0}/gfx/scandeval.png +0 -0
  247. {euroeval-16.3.0 → euroeval-16.5.0}/makefile +0 -0
  248. {euroeval-16.3.0 → euroeval-16.5.0}/mkdocs.yaml +0 -0
  249. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  250. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/enums.py +0 -0
  251. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/metrics/__init__.py +0 -0
  252. {euroeval-16.3.0 → euroeval-16.5.0}/src/euroeval/task_group_utils/__init__.py +0 -0
  253. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_allocine.py +0 -0
  254. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_angry_tweets.py +0 -0
  255. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_arc.py +0 -0
  256. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_arc_is.py +0 -0
  257. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_belebele.py +0 -0
  258. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_boolq_pt.py +0 -0
  259. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_cnn_dailymail.py +0 -0
  260. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_conll_en.py +0 -0
  261. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_conll_es.py +0 -0
  262. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_conll_nl.py +0 -0
  263. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_copa_lv.py +0 -0
  264. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_dane.py +0 -0
  265. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_danish_citizen_tests.py +0 -0
  266. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_dansk.py +0 -0
  267. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_danske_talemaader.py +0 -0
  268. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  269. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_dbrd.py +0 -0
  270. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_dutch_cola.py +0 -0
  271. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_eltec.py +0 -0
  272. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_err_news.py +0 -0
  273. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_estner.py +0 -0
  274. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_estonian_valence.py +0 -0
  275. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_european_values.py +0 -0
  276. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_exam_et.py +0 -0
  277. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_fone.py +0 -0
  278. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_foqa.py +0 -0
  279. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_fosent.py +0 -0
  280. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_fquad.py +0 -0
  281. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_fullstack_ner.py +0 -0
  282. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_germanquad.py +0 -0
  283. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_germeval.py +0 -0
  284. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_goldenswag.py +0 -0
  285. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_grammar_et.py +0 -0
  286. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_harem.py +0 -0
  287. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_hellaswag.py +0 -0
  288. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_hellaswag_fi.py +0 -0
  289. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
  290. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_ice_linguistic.py +0 -0
  291. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  292. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_icelandic_knowledge.py +0 -0
  293. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_icelandic_qa.py +0 -0
  294. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_icesum.py +0 -0
  295. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_idioms_no.py +0 -0
  296. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_ilpost_sum.py +0 -0
  297. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_jentoft.py +0 -0
  298. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_kpwr_ner.py +0 -0
  299. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_latvian_lsm_summary.py +0 -0
  300. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_latvian_twitter_sentiment.py +0 -0
  301. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_life_in_the_uk.py +0 -0
  302. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_llmzszl.py +0 -0
  303. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_lt_emotions.py +0 -0
  304. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_mim_gold_ner.py +0 -0
  305. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_mlqa_es.py +0 -0
  306. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_mlsum_de.py +0 -0
  307. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_mlsum_es.py +0 -0
  308. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_mmlu_lv.py +0 -0
  309. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_multinerd-it.py +0 -0
  310. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_no_cola.py +0 -0
  311. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_no_sammendrag.py +0 -0
  312. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
  313. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_nordjylland_news.py +0 -0
  314. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_norec.py +0 -0
  315. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_norglm_multisum.py +0 -0
  316. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_norne.py +0 -0
  317. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_norquad.py +0 -0
  318. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_nqii.py +0 -0
  319. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
  320. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_orange_sum.py +0 -0
  321. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_personal_sum.py +0 -0
  322. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_polemo2.py +0 -0
  323. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_poquad.py +0 -0
  324. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_psc.py +0 -0
  325. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_publico.py +0 -0
  326. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_rrn.py +0 -0
  327. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_sb10k.py +0 -0
  328. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_scandiqa.py +0 -0
  329. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_scandisent_fi.py +0 -0
  330. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_schibsted.py +0 -0
  331. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
  332. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_sentipolc16.py +0 -0
  333. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_squad.py +0 -0
  334. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_squad_it.py +0 -0
  335. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_squad_nl.py +0 -0
  336. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_squad_nl_old.py +0 -0
  337. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_sst2_pt.py +0 -0
  338. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_sst5.py +0 -0
  339. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_suc3.py +0 -0
  340. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_swedn.py +0 -0
  341. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_swerec.py +0 -0
  342. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_trivia_et.py +0 -0
  343. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_turku_ner_fi.py +0 -0
  344. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_tydiqa_fi.py +0 -0
  345. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  346. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_wikiann.py +0 -0
  347. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_wikineural-it.py +0 -0
  348. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_winogrande_et.py +0 -0
  349. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_winogrande_is.py +0 -0
  350. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_xlsum_fi.py +0 -0
  351. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/create_xquad.py +0 -0
  352. {euroeval-16.3.0 → euroeval-16.5.0}/src/scripts/fix_dot_env_file.py +0 -0
  353. {euroeval-16.3.0 → euroeval-16.5.0}/tests/__init__.py +0 -0
  354. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_benchmark_modules/__init__.py +0 -0
  355. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_callbacks.py +0 -0
  356. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_enums.py +0 -0
  357. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_exceptions.py +0 -0
  358. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_finetuning.py +0 -0
  359. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_languages.py +0 -0
  360. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_scores.py +0 -0
  361. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_types.py +0 -0
  362. {euroeval-16.3.0 → euroeval-16.5.0}/tests/test_utils.py +0 -0
@@ -24,6 +24,8 @@ body:
24
24
  label: Dataset languages
25
25
  description: What languages is the dataset in?
26
26
  options:
27
+ - label: Bulgarian
28
+ - label: Czech
27
29
  - label: Danish
28
30
  - label: Dutch
29
31
  - label: English
@@ -32,6 +34,7 @@ body:
32
34
  - label: Finnish
33
35
  - label: French
34
36
  - label: German
37
+ - label: Greek
35
38
  - label: Icelandic
36
39
  - label: Italian
37
40
  - label: Latvian
@@ -39,8 +42,11 @@ body:
39
42
  - label: Norwegian (Bokmål or Nynorsk)
40
43
  - label: Polish
41
44
  - label: Portuguese
45
+ - label: Serbian
46
+ - label: Slovak
42
47
  - label: Spanish
43
48
  - label: Swedish
49
+ - label: Ukrainian
44
50
  validations:
45
51
  required: true
46
52
  - type: textarea
@@ -18,12 +18,13 @@ body:
18
18
  What languages should this model be evaluated on? Tick all that apply. If the
19
19
  model is multilingual (e.g., Mistral, Llama), then tick all the languages.
20
20
  options:
21
+ - label: Baltic languages (Latvian, Lithuanian)
22
+ - label: Finnic languages (Estonian, Finnish)
23
+ - label: Hellenic languages (Greek)
21
24
  - label: Romance languages (French, Italian, Portuguese, Spanish)
22
25
  - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
26
+ - label: Slavic languages (Bulgarian, Czech, Polish, Serbian, Slovak, Ukrainian)
23
27
  - label: West Germanic languages (Dutch, English, German)
24
- - label: Finnic languages (Estonian, Finnish)
25
- - label: Baltic languages (Latvian, Lithuanian)
26
- - label: Polish
27
28
  validations:
28
29
  required: true
29
30
  - type: dropdown
@@ -10,7 +10,7 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.13.1
13
+ rev: v0.14.2
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -7,15 +7,133 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [v16.5.0] - 2025-10-28
11
+
12
+ ### Added
13
+
14
+ - Added better support for evaluating on custom datasets, by allowing `DatasetConfig`
15
+ objects directly in the `Benchmarker.benchmark` method. We also support custom
16
+ datasets with the CLI, by simply defining the desired `DatasetConfig`s in a
17
+ `custom_datasets.py` file (path can be changed with the `--custom-datasets-file`
18
+ argument. In the `DatasetConfig`s we also support loading datasets from CSVs directly,
19
+ with the new `source` argument. This argument can both be the Hugging Face Hub ID of
20
+ the dataset or a dictionary with 'train', 'val' and 'test', and values the paths to
21
+ the CSV files.
22
+ - Added support for Serbian 🇷🇸! This includes the sentiment classification dataset
23
+ MMS-sr, the linguistic acceptability dataset ScaLA-sr, the named entity recognition
24
+ dataset UNER-sr, the reading comprehension dataset MultiWikiQA-sr, the summarisation
25
+ dataset LR-Sum-sr, the knowledge dataset MMLU-sr, and the common-sense reasoning
26
+ dataset Winogrande-sr. This was contributed by @oliverkinch ✨
27
+ - Added support for Bulgarian 🇧🇬! This includes the sentiment classification dataset
28
+ Cinexio, the linguistic acceptability dataset ScaLA-bg, the named entity recognition
29
+ dataset BG-NER-BSNLP, the reading comprehension dataset MultiWikiQA-bg, the knowledge
30
+ dataset Exams-bg, and the common-sense reasoning dataset Winogrande-bg. This was
31
+ contributed by @oliverkinch ✨
32
+ - Added support for Greek 🇬🇷! This includes the binary sentiment classification dataset
33
+ Greek-SA, the linguistic acceptability dataset ScaLA-el, the named entity recognition
34
+ dataset elNER, the reading comprehension dataset MultiWikiQA-el, the summarisation
35
+ dataset Greek-Wikipedia, the knowledge dataset Global-MMLU-el, and the common-sense
36
+ reasoning dataset Winogrande-el. This was contributed by @oliverkinch ✨
37
+ - Added support for Ukrainian 🇺🇦! This includes the sentiment classification dataset
38
+ Cross-Domain UK Reviews, the linguistic acceptability dataset ScaLA-uk, the named
39
+ entity recognition dataset NER-uk, the reading comprehension dataset MultiWikiQA-uk,
40
+ the summarisation dataset LR-Sum-uk, and the knowledge dataset Global-MMLU-uk. This
41
+ was contributed by @oliverkinch ✨
42
+
43
+ ### Changed
44
+
45
+ - Now returns all the desired results from the `Benchmarker.benchmark` method, rather
46
+ than only the ones that were newly computed (so we load all previous results from disk
47
+ as well).
48
+
49
+ ### Fixed
50
+
51
+ - Fixed the "double option" problem in Winogrande datasets across all languages.
52
+ Previously, option labels were duplicated for multiple languages (e.g.,
53
+ "Svarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty" instead of
54
+ just "Svarmuligheder:\na. Natalie\nb. Betty").
55
+ - The previous fix to close arrow writers in metrics did not work as intended, as the
56
+ "too many open files" error still occurred. We now ensure that the writers are closed
57
+ properly after each metric computation to avoid this issue.
58
+ - Now correctly allows specifying inference provider API keys with the `--api-key`
59
+ argument. Previously, this conflicted with the Hugging Face API key.
60
+ - Fixed an issue where some pretrained generative models required prefix spaces in the
61
+ labels for classification tasks, which resulted in faulty structured choice
62
+ generation. We now correctly take this into account, which significantly increases
63
+ the classification performance of these models.
64
+
65
+ ## [v16.4.0] - 2025-10-21
66
+
67
+ ### Added
68
+
69
+ - Added support for Slovak 🇸🇰! This includes the sentiment classification dataset
70
+ CSFD-sentiment-sk, the linguistic acceptability dataset ScaLA-sk, the named entity
71
+ recognition dataset UNER-sk, the reading comprehension dataset MultiWikiQA-sk, the
72
+ multiple-choice classification dataset MMLU-sk, and the common-sense reasoning dataset
73
+ Winogrande-sk. This was contributed by @oliverkinch ✨
74
+ - Added support for Czech 🇨🇿! This includes the sentiment classification dataset
75
+ CSFD-sentiment, the linguistic acceptability dataset ScaLA-cs, the linguistic
76
+ acceptability dataset CS-GEC, the named entity recognition dataset PONER, the reading
77
+ comprehension dataset SQAD, the summarization dataset Czech News, the common-sense
78
+ reasoning dataset HellaSwag-cs, and the knowledge dataset Umimeto-qa. This was
79
+ contributed by @oliverkinch ✨
80
+ - Added the Lithuanian summarisation dataset Lrytas based on the Lithuanian
81
+ public media news portal [Lrytas.lt](https://www.lrytas.lt/). This was contributed by
82
+ @oliverkinch ✨
83
+ - Added the Estonian translation of MMLU, `mmlu-et`, as an unofficial knowledge
84
+ dataset.
85
+
86
+ ### Changed
87
+
88
+ - Updated vLLM to `>=0.11.0`, which features several breaking changes, so we had to
89
+ force the minimum version. This also features support for multiple new models, such as
90
+ Qwen3-Next and OLMo3.
91
+ - Now uses MultiWikiQA-da and MultiWikiQA-sv as the official Danish and Swedish reading
92
+ comprehension datasets, respectively, as the quality is substantially better than
93
+ ScandiQA-da and ScandiQA-sv.
94
+ - Used 128 of the test samples from the Winogrande datasets for validation, as we
95
+ previously did not use a validation split. This is done for all languages except
96
+ Icelandic and Estonian, as these are manually translated and corrected splits from a
97
+ different source. Most of these are unofficial datasets and thus won't affect the
98
+ leaderboard rankings. The only languages for which these are official are Lithuanian
99
+ and Polish, which do not have official leaderboards yet - so no leaderboards are
100
+ affected by this change.
101
+ - In the same vein as the above, we now use 32 samples for validation for the Lithuanian
102
+ LT-history dataset and the Swedish Skolprov dataset.
103
+ - Changed logging styling.
104
+
105
+ ### Fixed
106
+
107
+ - If a generative model consistently does not adhere to a given JSON schema, we disable
108
+ structured generation for that model. This was triggered by Claude models not
109
+ supporting Literal types in JSON schemas.
110
+ - Removed "e" options from the Skolprov multiple-choice dataset, as this inconsistency
111
+ in number of options caused issues when evaluating models on it.
112
+ - Fixed an issue where an uninformative logging message was shown when a model
113
+ configuration could not be loaded from the Hugging Face Hub, when the model was gated.
114
+ We now show that this is due to the gatedness, indicating that the user should log in
115
+ or provide a Hugging Face Hub access token to evaluate the model.
116
+ - Now caches functions related to loading repo info or fetching model configs from the
117
+ Hugging Face Hub, to avoid repeated calls to the Hub, resulting in rate limits.
118
+ - When running an evaluation that required the test split (e.g., European values
119
+ evaluation) as the last benchmark for a given model, then subsequent models would
120
+ continue to be evaluated on the test split, even if the user requested to use the
121
+ validation split. We now reset this not just after each dataset, but also after each
122
+ model, so that this does not happen.
123
+ - Now catches more errors when evaluating LiteLLM models, which were related to some
124
+ generation parameters not being supported (such as stop sequences) for some models.
125
+ - We now clean up metric writers when we're done with them, which prevents a "too many
126
+ open files" error when evaluating many models and datasets in a single run.
127
+
10
128
  ## [v16.3.0] - 2025-09-23
11
129
 
12
130
  ### Added
13
131
 
14
132
  - Added support for Lithuanian 🇱🇹! This includes the sentiment classification dataset
15
- Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt, the reading
16
- comprehension dataset MultiWikiQA-lt, the named entity recognition dataset WikiANN-lt,
17
- the the history knowledge dataset LT-History, and the common-sense reasoning dataset
18
- Winogrande-lt. This was contributed by @oliverkinch ✨
133
+ Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt (unofficial), the
134
+ reading comprehension dataset MultiWikiQA-lt, the named entity recognition dataset
135
+ WikiANN-lt, the the history knowledge dataset LT-History, and the common-sense
136
+ reasoning dataset Winogrande-lt. This was contributed by @oliverkinch ✨
19
137
  - Added "slow-tokenizer" model parameter, which can be used to force the use of a slow
20
138
  tokenizer when loading it. Use this by replacing your model ID with
21
139
  `<model-id>#slow-tokenizer`.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 16.3.0
3
+ Version: 16.5.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -62,12 +62,12 @@ Provides-Extra: all
62
62
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
63
63
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
64
64
  Requires-Dist: timm>=1.0.19; extra == 'all'
65
- Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'all'
65
+ Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'all'
66
66
  Provides-Extra: generative
67
67
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
68
68
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
69
69
  Requires-Dist: timm>=1.0.19; extra == 'generative'
70
- Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
70
+ Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'generative'
71
71
  Description-Content-Type: text/markdown
72
72
 
73
73
  <!-- This disables the requirement that the first line is a top-level heading -->
@@ -92,7 +92,7 @@ ______________________________________________________________________
92
92
  [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
93
93
  [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
94
94
  [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
95
- [![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
95
+ [![Code Coverage](https://img.shields.io/badge/Coverage-76%25-yellowgreen.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
96
96
  [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
97
97
 
98
98
  ## Maintainer
@@ -113,7 +113,7 @@ when an evaluation requires a certain extra dependency, and how you install it.
113
113
 
114
114
  ## Quickstart
115
115
 
116
- ### Benchmarking from the Command Line
116
+ ### Benchmarking from the command line
117
117
 
118
118
  The easiest way to benchmark pretrained models is via the command line interface. After
119
119
  having installed the package, you can benchmark your favorite model like so:
@@ -160,7 +160,7 @@ See all the arguments and options available for the `euroeval` command by typing
160
160
  euroeval --help
161
161
  ```
162
162
 
163
- ### Benchmarking from a Script
163
+ ### Benchmarking from a script
164
164
 
165
165
  In a script, the syntax is similar to the command line interface. You simply initialise
166
166
  an object of the `Benchmarker` class, and call this benchmark object with your favorite
@@ -168,15 +168,19 @@ model:
168
168
 
169
169
  ```python
170
170
  >>> from euroeval import Benchmarker
171
- >>> benchmark = Benchmarker()
172
- >>> benchmark(model="<model-id>")
171
+ >>> benchmarker = Benchmarker()
172
+ >>> benchmarker.benchmark(model="<model-id>")
173
173
  ```
174
174
 
175
175
  To benchmark on a specific task and/or language, you simply specify the `task` or
176
176
  `language` arguments, shown here with same example as above:
177
177
 
178
178
  ```python
179
- >>> benchmark(model="<model-id>", task="sentiment-classification", language="da")
179
+ >>> benchmarker.benchmark(
180
+ ... model="<model-id>",
181
+ ... task="sentiment-classification",
182
+ ... language="da",
183
+ ... )
180
184
  ```
181
185
 
182
186
  If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
@@ -184,10 +188,61 @@ simply leave out the `model` argument. In this example, we're benchmarking all D
184
188
  models on the Danish sentiment classification task:
185
189
 
186
190
  ```python
187
- >>> benchmark(task="sentiment-classification", language="da")
191
+ >>> benchmarker.benchmark(task="sentiment-classification", language="da")
188
192
  ```
189
193
 
190
- ### Benchmarking in an Offline Environment
194
+ ### Benchmarking from Docker
195
+
196
+ A Dockerfile is provided in the repo, which can be downloaded and run, without needing
197
+ to clone the repo and installing from source. This can be fetched programmatically by
198
+ running the following:
199
+
200
+ ```bash
201
+ wget https://raw.githubusercontent.com/EuroEval/EuroEval/main/Dockerfile.cuda
202
+ ```
203
+
204
+ Next, to be able to build the Docker image, first ensure that the NVIDIA Container
205
+ Toolkit is
206
+ [installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation)
207
+ and
208
+ [configured](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker).
209
+ Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA
210
+ version installed (which you can check using `nvidia-smi`). After that, we build the
211
+ image as follows:
212
+
213
+ ```bash
214
+ docker build --pull -t euroeval -f Dockerfile.cuda .
215
+ ```
216
+
217
+ With the Docker image built, we can now evaluate any model as follows:
218
+
219
+ ```bash
220
+ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
221
+ ```
222
+
223
+ Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
224
+ argument. This could for instance be `--model <model-id> --task
225
+ sentiment-classification`.
226
+
227
+ ## Benchmarking custom inference APIs
228
+
229
+ If the model you want to benchmark is hosted by a custom inference provider, such as a
230
+ [vLLM server](https://docs.vllm.ai/en/stable/), then this is also supported in EuroEval.
231
+ When benchmarking, you simply have to set the `--api-base` argument (`api_base` when
232
+ using the `Benchmarker` API) to the URL of the inference API, and optionally the
233
+ `--api-key` argument (`api_key`) to the API key, if authentication is required.
234
+
235
+ When benchmarking models hosted on a custom inference API, the model ID
236
+ (`--model`/`model`) should be the model name as registered on the inference server,
237
+ potentially with a required prefix, depending on the type of inference server used. For
238
+ instance, if the model is hosted on a vLLM server, the model ID should be prefixed with
239
+ `hosted_vllm/`, and if the model is hosted on an Ollama server, the model ID should be
240
+ prefixed with `ollama_chat/`. See the full list of possible inference providers as well
241
+ as their corresponding prefixes in the [LiteLLM
242
+ documentation](https://docs.litellm.ai/docs/providers/), as EuroEval uses LiteLLM to
243
+ handle evaluation of inference APIs in general.
244
+
245
+ ## Benchmarking in an offline environment
191
246
 
192
247
  If you need to benchmark in an offline environment, you need to download the models,
193
248
  datasets and metrics beforehand. This can be done by adding the `--download-only`
@@ -202,7 +257,7 @@ euroeval --model <model-id> --task sentiment-classification --language da --down
202
257
  Or from a script:
203
258
 
204
259
  ```python
205
- >>> benchmark(
260
+ >>> benchmarker.benchmark(
206
261
  ... model="<model-id>",
207
262
  ... task="sentiment-classification",
208
263
  ... language="da",
@@ -210,44 +265,139 @@ Or from a script:
210
265
  ... )
211
266
  ```
212
267
 
213
- Please note: Offline benchmarking of adapter models is not currently supported. An
214
- internet connection will be required during evaluation. If offline support is important
215
- to you, please consider [opening an issue](https://github.com/EuroEval/EuroEval/issues).
268
+ Please note: Offline benchmarking of adapter models is not currently supported, meaning
269
+ that we still require an internet connection during the evaluation of these. If offline
270
+ support of adapters is important to you, please consider [opening an
271
+ issue](https://github.com/EuroEval/EuroEval/issues).
216
272
 
217
- ### Benchmarking from Docker
273
+ ## Benchmarking custom datasets
218
274
 
219
- A Dockerfile is provided in the repo, which can be downloaded and run, without needing
220
- to clone the repo and installing from source. This can be fetched programmatically by
221
- running the following:
275
+ If you want to benchmark models on your own custom dataset, this is also possible.
276
+ First, you need to set up your dataset to be compatible with EuroEval. This means
277
+ splitting up your dataset in a training, validation and test split, and ensuring that
278
+ the column names are correct. We use `text` as the column name for the input text, and
279
+ the output column name depends on the type of task:
222
280
 
223
- ```bash
224
- wget https://raw.githubusercontent.com/EuroEval/EuroEval/main/Dockerfile.cuda
281
+ - **Text or multiple-choice classification**: `label`
282
+ - **Token classification**: `labels`
283
+ - **Reading comprehension**: `answers`
284
+ - **Free-form text generation**: `target_text`
285
+
286
+ Text and multiple-choice classification tasks are by far the most common. Next, you
287
+ store your three dataset splits as three different CSV files with the desired two
288
+ columns. Finally, you create a file called `custom_datasets.py` script in which you
289
+ define the associated `DatasetConfig` objects for your dataset. Here is an example of a
290
+ simple text classification dataset with two classes:
291
+
292
+ ```python
293
+ from euroeval import DatasetConfig, TEXT_CLASSIFICATION
294
+ from euroeval.languages import ENGLISH
295
+
296
+ MY_CONFIG = DatasetConfig(
297
+ name="my-dataset",
298
+ source=dict(train="train.csv", val="val.csv", test="test.csv"),
299
+ task=TEXT_CLASSIFICATION,
300
+ languages=[ENGLISH],
301
+ _labels=["positive", "negative"],
302
+ )
225
303
  ```
226
304
 
227
- Next, to be able to build the Docker image, first ensure that the NVIDIA Container
228
- Toolkit is
229
- [installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installation)
230
- and
231
- [configured](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-docker).
232
- Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA
233
- version installed (which you can check using `nvidia-smi`). After that, we build the
234
- image as follows:
305
+ You can then benchmark your custom dataset by simply running
235
306
 
236
307
  ```bash
237
- docker build --pull -t euroeval -f Dockerfile.cuda .
308
+ euroeval --dataset my-dataset --model <model-id>
238
309
  ```
239
310
 
240
- With the Docker image built, we can now evaluate any model as follows:
311
+ You can also run the benchmark from a Python script, by simply providing your custom
312
+ dataset configuration directly into the `benchmark` method:
241
313
 
242
- ```bash
243
- docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
314
+ ```python
315
+ from euroeval import Benchmarker
316
+
317
+ benchmarker = Benchmarker()
318
+ benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
244
319
  ```
245
320
 
246
- Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
247
- argument. This could for instance be `--model <model-id> --task
248
- sentiment-classification`.
321
+ We have included three convenience tasks to make it easier to set up custom datasets:
322
+
323
+ - `TEXT_CLASSIFICATION`, which is used for text classification tasks. This requires you
324
+ to set the `_labels` argument in the `DatasetConfig`, and requires the columns `text`
325
+ and `label` to be present in the dataset.
326
+ - `MULTIPLE_CHOICE`, which is used for multiple-choice classification tasks. This
327
+ also requires you to set the `_labels` argument in the `DatasetConfig`. Note that for
328
+ multiple choice tasks, you need to set up your `text` column to also list all the
329
+ choices, and all the samples should have the same number of choices. This requires the
330
+ columns `text` and `label` to be present in the dataset.
331
+ - `TOKEN_CLASSIFICATION`, which is used when classifying individual tokens in a text.
332
+ This also require you to set the `_labels` argument in the `DatasetConfig`. This
333
+ requires the columns `tokens` and `labels` to be present in the dataset, where
334
+ `tokens` is a list of tokens/words in the text, and `labels` is a list of the
335
+ corresponding labels for each token (so the two lists have the same length).
336
+
337
+ On top of these three convenience tasks, there are of course also the tasks that we use
338
+ in the official benchmark, which you can use if you want to use one of these tasks with
339
+ your own bespoke dataset:
340
+
341
+ - `LA`, for linguistic acceptability datasets.
342
+ - `NER`, for named entity recognition datasets with the standard BIO tagging scheme.
343
+ - `RC`, for reading comprehension datasets in the SQuAD format.
344
+ - `SENT`, for sentiment classification datasets.
345
+ - `SUMM`, for text summarisation datasets.
346
+ - `KNOW`, for multiple-choice knowledge datasets (e.g., MMLU).
347
+ - `MCRC`, for multiple-choice reading comprehension datasets (e.g., Belebele).
348
+ - `COMMON_SENSE`, for multiple-choice common-sense reasoning datasets (e.g., HellaSwag).
349
+
350
+ These can all be imported from `euroeval.tasks` module.
351
+
352
+ ### Creating your own custom task
353
+
354
+ You are of course also free to define your own task from scratch, which allows you to
355
+ customise the prompts used when evaluating generative models, for instance. Here is an
356
+ example of a custom free-form text generation task, where the goal for the model is to
357
+ generate a SQL query based on a natural language input:
358
+
359
+ ```python
360
+ from euroeval import DatasetConfig
361
+ from euroeval.data_models import Task, PromptConfig
362
+ from euroeval.enums import TaskGroup, ModelType
363
+ from euroeval.languages import ENGLISH
364
+ from euroeval.metrics import rouge_l_metric
365
+
366
+ sql_generation_task = Task(
367
+ name="sql-generation",
368
+ task_group=TaskGroup.TEXT_TO_TEXT,
369
+ template_dict={
370
+ ENGLISH: PromptConfig(
371
+ default_prompt_prefix="The following are natural language texts and their "
372
+ "corresponding SQL queries.",
373
+ default_prompt_template="Natural language query: {text}\nSQL query: "
374
+ "{target_text}",
375
+ default_instruction_prompt="Generate the SQL query for the following "
376
+ "natural language query:\n{text!r}",
377
+ default_prompt_label_mapping=dict(),
378
+ ),
379
+ },
380
+ metrics=[rouge_l_metric],
381
+ default_num_few_shot_examples=3,
382
+ default_max_generated_tokens=256,
383
+ default_allowed_model_types=[ModelType.GENERATIVE],
384
+ )
385
+
386
+ MY_SQL_DATASET = DatasetConfig(
387
+ name="my-sql-dataset",
388
+ source=dict(train="train.csv", val="val.csv", test="test.csv"),
389
+ task=sql_generation_task,
390
+ languages=[ENGLISH],
391
+ )
392
+ ```
393
+
394
+ Again, with this you can benchmark your custom dataset by simply running
395
+
396
+ ```bash
397
+ euroeval --dataset my-sql-dataset --model <model-id>
398
+ ```
249
399
 
250
- ### Reproducing the datasets
400
+ ## Reproducing the evaluation datasets
251
401
 
252
402
  All datasets used in this project are generated using the scripts located in the
253
403
  [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script
@@ -379,6 +529,13 @@ A huge thank you to all the contributors who have helped make this project a suc
379
529
  alt="Contributor avatar for slowwavesleep"
380
530
  />
381
531
  </a>
532
+ <a href="https://github.com/mrkowalski">
533
+ <img
534
+ src="https://avatars.githubusercontent.com/u/6357044"
535
+ width=50
536
+ alt="Contributor avatar for mrkowalski"
537
+ />
538
+ </a>
382
539
 
383
540
  ### Contribute to EuroEval
384
541
 
@@ -390,7 +547,7 @@ contributing new datasets, your help makes this project better for everyone.
390
547
  - **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
391
548
  a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
392
549
 
393
- ### Special Thanks
550
+ ### Special thanks
394
551
 
395
552
  - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
396
553
  [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
@@ -401,7 +558,7 @@ contributing new datasets, your help makes this project better for everyone.
401
558
  - Thanks to [UWV](https://www.uwv.nl/) and [KU
402
559
  Leuven](https://www.arts.kuleuven.be/ling/ccl) for sponsoring the Azure OpenAI
403
560
  credits used to evaluate GPT-4-turbo in Dutch.
404
- - Thanks to [Miðeind](https://mideind.is/english.html) for sponsoring the OpenAI
561
+ - Thanks to [Miðeind](https://mideind.is/en) for sponsoring the OpenAI
405
562
  credits used to evaluate GPT-4-turbo in Icelandic and Faroese.
406
563
  - Thanks to [CHC](https://chc.au.dk/) for sponsoring the OpenAI credits used to
407
564
  evaluate GPT-4-turbo in German.