EuroEval 16.3.0__tar.gz → 16.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (313) hide show
  1. {euroeval-16.3.0 → euroeval-16.4.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +2 -0
  2. {euroeval-16.3.0 → euroeval-16.4.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +3 -3
  3. {euroeval-16.3.0 → euroeval-16.4.0}/.pre-commit-config.yaml +1 -1
  4. {euroeval-16.3.0 → euroeval-16.4.0}/CHANGELOG.md +67 -4
  5. {euroeval-16.3.0 → euroeval-16.4.0}/PKG-INFO +4 -4
  6. {euroeval-16.3.0 → euroeval-16.4.0}/README.md +1 -1
  7. euroeval-16.4.0/docs/datasets/czech.md +671 -0
  8. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/danish.md +81 -80
  9. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/dutch.md +2 -1
  10. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/english.md +2 -1
  11. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/estonian.md +76 -0
  12. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/finnish.md +2 -1
  13. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/french.md +2 -1
  14. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/german.md +2 -1
  15. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/italian.md +2 -1
  16. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/latvian.md +2 -1
  17. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/lithuanian.md +69 -4
  18. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/norwegian.md +2 -1
  19. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/polish.md +30 -28
  20. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/portuguese.md +2 -1
  21. euroeval-16.4.0/docs/datasets/slovak.md +446 -0
  22. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/spanish.md +2 -1
  23. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/swedish.md +83 -82
  24. euroeval-16.4.0/docs/leaderboards/Monolingual/czech.md +26 -0
  25. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/danish.md +3 -2
  26. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/dutch.md +3 -2
  27. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/english.md +3 -2
  28. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/estonian.md +3 -2
  29. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/faroese.md +3 -2
  30. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/finnish.md +3 -2
  31. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/french.md +3 -2
  32. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/german.md +3 -2
  33. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/icelandic.md +3 -2
  34. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/italian.md +3 -2
  35. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/latvian.md +3 -2
  36. euroeval-16.4.0/docs/leaderboards/Monolingual/lithuanian.md +26 -0
  37. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/norwegian.md +3 -2
  38. euroeval-16.4.0/docs/leaderboards/Monolingual/polish.md +26 -0
  39. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/portuguese.md +3 -2
  40. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/spanish.md +3 -2
  41. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Monolingual/swedish.md +3 -2
  42. euroeval-16.4.0/docs/leaderboards/Multilingual/baltic.md +26 -0
  43. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Multilingual/european.md +3 -2
  44. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Multilingual/finnic.md +5 -4
  45. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Multilingual/germanic.md +3 -2
  46. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +3 -2
  47. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/Multilingual/romance.md +3 -2
  48. euroeval-16.4.0/docs/leaderboards/Multilingual/slavic.md +26 -0
  49. {euroeval-16.3.0 → euroeval-16.4.0}/pyproject.toml +8 -3
  50. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/__init__.py +3 -2
  51. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmark_config_factory.py +0 -4
  52. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmark_modules/base.py +3 -16
  53. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmark_modules/fresh.py +2 -1
  54. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmark_modules/hf.py +99 -62
  55. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmark_modules/litellm.py +101 -41
  56. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmark_modules/vllm.py +91 -83
  57. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmarker.py +84 -78
  58. euroeval-16.4.0/src/euroeval/caching_utils.py +79 -0
  59. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/callbacks.py +5 -7
  60. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/constants.py +6 -0
  61. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/data_loading.py +14 -11
  62. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/data_models.py +12 -4
  63. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/__init__.py +2 -0
  64. euroeval-16.4.0/src/euroeval/dataset_configs/czech.py +79 -0
  65. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/danish.py +10 -11
  66. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/dutch.py +0 -1
  67. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/english.py +0 -1
  68. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/estonian.py +11 -1
  69. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/finnish.py +0 -1
  70. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/french.py +0 -1
  71. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/german.py +0 -1
  72. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/italian.py +0 -1
  73. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/latvian.py +0 -1
  74. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/lithuanian.py +9 -3
  75. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/norwegian.py +0 -1
  76. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/polish.py +0 -1
  77. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/portuguese.py +0 -1
  78. euroeval-16.4.0/src/euroeval/dataset_configs/slovak.py +60 -0
  79. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/spanish.py +0 -1
  80. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/swedish.py +10 -12
  81. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/finetuning.py +21 -15
  82. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/generation.py +10 -10
  83. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/generation_utils.py +2 -3
  84. euroeval-16.4.0/src/euroeval/logging_utils.py +250 -0
  85. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/metrics/base.py +0 -3
  86. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/metrics/huggingface.py +9 -5
  87. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/metrics/llm_as_a_judge.py +5 -3
  88. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/metrics/pipeline.py +17 -9
  89. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/metrics/speed.py +0 -3
  90. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/model_cache.py +11 -14
  91. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/model_config.py +4 -5
  92. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/model_loading.py +3 -0
  93. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +21 -3
  94. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/prompt_templates/multiple_choice.py +25 -1
  95. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/prompt_templates/named_entity_recognition.py +51 -11
  96. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/prompt_templates/reading_comprehension.py +31 -3
  97. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/prompt_templates/sentiment_classification.py +23 -1
  98. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/prompt_templates/summarization.py +26 -6
  99. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/scores.py +7 -7
  100. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/speed_benchmark.py +3 -5
  101. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +0 -3
  102. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/task_group_utils/question_answering.py +0 -3
  103. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/task_group_utils/sequence_classification.py +43 -31
  104. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/task_group_utils/text_to_text.py +17 -8
  105. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/task_group_utils/token_classification.py +10 -9
  106. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/tokenisation_utils.py +14 -12
  107. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/utils.py +29 -146
  108. euroeval-16.4.0/src/scripts/__init__.py +1 -0
  109. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/constants.py +2 -0
  110. euroeval-16.4.0/src/scripts/create_cs_gec.py +83 -0
  111. euroeval-16.4.0/src/scripts/create_csfd_sentiment.py +97 -0
  112. euroeval-16.4.0/src/scripts/create_csfd_sentiment_sk.py +92 -0
  113. euroeval-16.4.0/src/scripts/create_czech_news.py +75 -0
  114. euroeval-16.4.0/src/scripts/create_hellaswag_cs.py +120 -0
  115. euroeval-16.4.0/src/scripts/create_lithuanian_lrytas_summarization.py +87 -0
  116. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_lt_history.py +13 -6
  117. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_mmlu.py +1 -1
  118. euroeval-16.4.0/src/scripts/create_mmlu_et.py +162 -0
  119. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_multi_wiki_qa.py +1 -0
  120. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_norglm_multiqa.py +20 -0
  121. euroeval-16.4.0/src/scripts/create_poner.py +125 -0
  122. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_scala.py +4 -0
  123. euroeval-16.3.0/src/scripts/create_swedish_skolprov.py → euroeval-16.4.0/src/scripts/create_skolprov.py +25 -18
  124. euroeval-16.4.0/src/scripts/create_sqad.py +137 -0
  125. euroeval-16.4.0/src/scripts/create_umimeto_qa.py +114 -0
  126. euroeval-16.4.0/src/scripts/create_uner_sk.py +183 -0
  127. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_winogrande.py +20 -1
  128. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/load_ud_pos.py +199 -73
  129. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/versioning.py +1 -1
  130. {euroeval-16.3.0 → euroeval-16.4.0}/tests/conftest.py +14 -11
  131. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_benchmark_modules/test_hf.py +11 -5
  132. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_benchmarker.py +10 -4
  133. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_constants.py +4 -2
  134. euroeval-16.4.0/tests/test_data_loading.py +166 -0
  135. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_data_models.py +2 -1
  136. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_dataset_configs.py +36 -0
  137. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_model_config.py +1 -0
  138. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_model_loading.py +3 -0
  139. euroeval-16.4.0/tests/test_scripts/__init__.py +1 -0
  140. euroeval-16.4.0/tests/test_scripts/test_create_scala/__init__.py +1 -0
  141. euroeval-16.4.0/tests/test_scripts/test_create_scala/test_create_scala.py +86 -0
  142. euroeval-16.4.0/tests/test_scripts/test_create_scala/test_data/de_gsd-ud-train.conllu.adp_det +12 -0
  143. euroeval-16.4.0/tests/test_scripts/test_create_scala/test_data/empty.file +0 -0
  144. euroeval-16.4.0/tests/test_scripts/test_create_scala/test_data/en_gum-ud-train.conllu.case +70 -0
  145. euroeval-16.4.0/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_01 +11 -0
  146. euroeval-16.4.0/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_02 +14 -0
  147. euroeval-16.4.0/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_03 +16 -0
  148. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_speed_benchmark.py +1 -0
  149. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_tokenisation_utils.py +6 -2
  150. {euroeval-16.3.0 → euroeval-16.4.0}/uv.lock +142 -152
  151. euroeval-16.3.0/tests/test_data_loading.py +0 -141
  152. {euroeval-16.3.0 → euroeval-16.4.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  153. {euroeval-16.3.0 → euroeval-16.4.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  154. {euroeval-16.3.0 → euroeval-16.4.0}/.github/ISSUE_TEMPLATE/language_request.yaml +0 -0
  155. {euroeval-16.3.0 → euroeval-16.4.0}/.github/workflows/ci.yaml +0 -0
  156. {euroeval-16.3.0 → euroeval-16.4.0}/.gitignore +0 -0
  157. {euroeval-16.3.0 → euroeval-16.4.0}/.markdownlint.jsonc +0 -0
  158. {euroeval-16.3.0 → euroeval-16.4.0}/CITATION.cff +0 -0
  159. {euroeval-16.3.0 → euroeval-16.4.0}/CODE_OF_CONDUCT.md +0 -0
  160. {euroeval-16.3.0 → euroeval-16.4.0}/CONTRIBUTING.md +0 -0
  161. {euroeval-16.3.0 → euroeval-16.4.0}/Dockerfile.cuda +0 -0
  162. {euroeval-16.3.0 → euroeval-16.4.0}/LICENSE +0 -0
  163. {euroeval-16.3.0 → euroeval-16.4.0}/NEW_DATASET_GUIDE.md +0 -0
  164. {euroeval-16.3.0 → euroeval-16.4.0}/docs/CNAME +0 -0
  165. {euroeval-16.3.0 → euroeval-16.4.0}/docs/README.md +0 -0
  166. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/README.md +0 -0
  167. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/faroese.md +0 -0
  168. {euroeval-16.3.0 → euroeval-16.4.0}/docs/datasets/icelandic.md +0 -0
  169. {euroeval-16.3.0 → euroeval-16.4.0}/docs/extras/radial_plotter.md +0 -0
  170. {euroeval-16.3.0 → euroeval-16.4.0}/docs/faq.md +0 -0
  171. {euroeval-16.3.0 → euroeval-16.4.0}/docs/gfx/favicon.png +0 -0
  172. {euroeval-16.3.0 → euroeval-16.4.0}/docs/leaderboards/README.md +0 -0
  173. {euroeval-16.3.0 → euroeval-16.4.0}/docs/methodology.md +0 -0
  174. {euroeval-16.3.0 → euroeval-16.4.0}/docs/python-package.md +0 -0
  175. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/README.md +0 -0
  176. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/common-sense-reasoning.md +0 -0
  177. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/knowledge.md +0 -0
  178. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/linguistic-acceptability.md +0 -0
  179. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/named-entity-recognition.md +0 -0
  180. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/reading-comprehension.md +0 -0
  181. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/sentiment-classification.md +0 -0
  182. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/speed.md +0 -0
  183. {euroeval-16.3.0 → euroeval-16.4.0}/docs/tasks/summarization.md +0 -0
  184. {euroeval-16.3.0 → euroeval-16.4.0}/gfx/euroeval.png +0 -0
  185. {euroeval-16.3.0 → euroeval-16.4.0}/gfx/euroeval.xcf +0 -0
  186. {euroeval-16.3.0 → euroeval-16.4.0}/gfx/scandeval.png +0 -0
  187. {euroeval-16.3.0 → euroeval-16.4.0}/makefile +0 -0
  188. {euroeval-16.3.0 → euroeval-16.4.0}/mkdocs.yaml +0 -0
  189. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  190. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/cli.py +0 -0
  191. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/faroese.py +0 -0
  192. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
  193. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/enums.py +0 -0
  194. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/exceptions.py +0 -0
  195. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/languages.py +0 -0
  196. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/metrics/__init__.py +0 -0
  197. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/prompt_templates/__init__.py +0 -0
  198. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/task_group_utils/__init__.py +0 -0
  199. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/tasks.py +0 -0
  200. {euroeval-16.3.0 → euroeval-16.4.0}/src/euroeval/types.py +0 -0
  201. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_allocine.py +0 -0
  202. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_angry_tweets.py +0 -0
  203. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_arc.py +0 -0
  204. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_arc_is.py +0 -0
  205. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_belebele.py +0 -0
  206. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_boolq_pt.py +0 -0
  207. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_cnn_dailymail.py +0 -0
  208. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_conll_en.py +0 -0
  209. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_conll_es.py +0 -0
  210. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_conll_nl.py +0 -0
  211. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_copa_lv.py +0 -0
  212. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_dane.py +0 -0
  213. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_danish_citizen_tests.py +0 -0
  214. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_dansk.py +0 -0
  215. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_danske_talemaader.py +0 -0
  216. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  217. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_dbrd.py +0 -0
  218. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_dutch_cola.py +0 -0
  219. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_eltec.py +0 -0
  220. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_err_news.py +0 -0
  221. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_estner.py +0 -0
  222. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_estonian_valence.py +0 -0
  223. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_european_values.py +0 -0
  224. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_exam_et.py +0 -0
  225. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_fone.py +0 -0
  226. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_foqa.py +0 -0
  227. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_fosent.py +0 -0
  228. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_fquad.py +0 -0
  229. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_fullstack_ner.py +0 -0
  230. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_germanquad.py +0 -0
  231. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_germeval.py +0 -0
  232. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_goldenswag.py +0 -0
  233. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_grammar_et.py +0 -0
  234. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_harem.py +0 -0
  235. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_hellaswag.py +0 -0
  236. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_hellaswag_fi.py +0 -0
  237. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
  238. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_ice_linguistic.py +0 -0
  239. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  240. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_icelandic_knowledge.py +0 -0
  241. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_icelandic_qa.py +0 -0
  242. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_icesum.py +0 -0
  243. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_idioms_no.py +0 -0
  244. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_ilpost_sum.py +0 -0
  245. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_jentoft.py +0 -0
  246. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_kpwr_ner.py +0 -0
  247. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_latvian_lsm_summary.py +0 -0
  248. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_latvian_twitter_sentiment.py +0 -0
  249. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_life_in_the_uk.py +0 -0
  250. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_llmzszl.py +0 -0
  251. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_lt_emotions.py +0 -0
  252. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_mim_gold_ner.py +0 -0
  253. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_mlqa_es.py +0 -0
  254. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_mlsum_de.py +0 -0
  255. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_mlsum_es.py +0 -0
  256. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_mmlu_lv.py +0 -0
  257. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_multinerd-it.py +0 -0
  258. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_no_cola.py +0 -0
  259. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_no_sammendrag.py +0 -0
  260. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
  261. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_nordjylland_news.py +0 -0
  262. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_norec.py +0 -0
  263. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_norglm_multisum.py +0 -0
  264. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_norne.py +0 -0
  265. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_norquad.py +0 -0
  266. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_nqii.py +0 -0
  267. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
  268. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_orange_sum.py +0 -0
  269. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_personal_sum.py +0 -0
  270. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_polemo2.py +0 -0
  271. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_poquad.py +0 -0
  272. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_psc.py +0 -0
  273. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_publico.py +0 -0
  274. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_rrn.py +0 -0
  275. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_sb10k.py +0 -0
  276. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_scandiqa.py +0 -0
  277. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_scandisent_fi.py +0 -0
  278. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_schibsted.py +0 -0
  279. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
  280. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_sentipolc16.py +0 -0
  281. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_squad.py +0 -0
  282. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_squad_it.py +0 -0
  283. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_squad_nl.py +0 -0
  284. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_squad_nl_old.py +0 -0
  285. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_sst2_pt.py +0 -0
  286. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_sst5.py +0 -0
  287. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_suc3.py +0 -0
  288. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_swedn.py +0 -0
  289. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_swerec.py +0 -0
  290. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_trivia_et.py +0 -0
  291. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_turku_ner_fi.py +0 -0
  292. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_tydiqa_fi.py +0 -0
  293. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  294. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_wikiann.py +0 -0
  295. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_wikineural-it.py +0 -0
  296. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_winogrande_et.py +0 -0
  297. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_winogrande_is.py +0 -0
  298. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_xlsum_fi.py +0 -0
  299. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/create_xquad.py +0 -0
  300. {euroeval-16.3.0 → euroeval-16.4.0}/src/scripts/fix_dot_env_file.py +0 -0
  301. {euroeval-16.3.0 → euroeval-16.4.0}/tests/__init__.py +0 -0
  302. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_benchmark_config_factory.py +0 -0
  303. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_benchmark_modules/__init__.py +0 -0
  304. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_callbacks.py +0 -0
  305. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_cli.py +0 -0
  306. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_enums.py +0 -0
  307. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_exceptions.py +0 -0
  308. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_finetuning.py +0 -0
  309. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_languages.py +0 -0
  310. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_scores.py +0 -0
  311. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_tasks.py +0 -0
  312. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_types.py +0 -0
  313. {euroeval-16.3.0 → euroeval-16.4.0}/tests/test_utils.py +0 -0
@@ -24,6 +24,7 @@ body:
24
24
  label: Dataset languages
25
25
  description: What languages is the dataset in?
26
26
  options:
27
+ - label: Czech
27
28
  - label: Danish
28
29
  - label: Dutch
29
30
  - label: English
@@ -39,6 +40,7 @@ body:
39
40
  - label: Norwegian (Bokmål or Nynorsk)
40
41
  - label: Polish
41
42
  - label: Portuguese
43
+ - label: Slovak
42
44
  - label: Spanish
43
45
  - label: Swedish
44
46
  validations:
@@ -18,12 +18,12 @@ body:
18
18
  What languages should this model be evaluated on? Tick all that apply. If the
19
19
  model is multilingual (e.g., Mistral, Llama), then tick all the languages.
20
20
  options:
21
+ - label: Baltic languages (Latvian, Lithuanian)
22
+ - label: Finnic languages (Estonian, Finnish)
21
23
  - label: Romance languages (French, Italian, Portuguese, Spanish)
22
24
  - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
25
+ - label: Slavic languages (Czech, Polish, Slovak)
23
26
  - label: West Germanic languages (Dutch, English, German)
24
- - label: Finnic languages (Estonian, Finnish)
25
- - label: Baltic languages (Latvian, Lithuanian)
26
- - label: Polish
27
27
  validations:
28
28
  required: true
29
29
  - type: dropdown
@@ -10,7 +10,7 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.13.1
13
+ rev: v0.14.1
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -7,15 +7,78 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [v16.4.0] - 2025-10-21
11
+
12
+ ### Added
13
+
14
+ - Added support for Slovak 🇸🇰! This includes the sentiment classification dataset
15
+ CSFD-sentiment-sk, the linguistic acceptability dataset ScaLA-sk, the named entity
16
+ recognition dataset UNER-sk, the reading comprehension dataset MultiWikiQA-sk, the
17
+ multiple-choice classification dataset MMLU-sk, and the common-sense reasoning dataset
18
+ Winogrande-sk. This was contributed by @oliverkinch ✨
19
+ - Added support for Czech 🇨🇿! This includes the sentiment classification dataset
20
+ CSFD-sentiment, the linguistic acceptability dataset ScaLA-cs, the linguistic
21
+ acceptability dataset CS-GEC, the named entity recognition dataset PONER, the reading
22
+ comprehension dataset SQAD, the summarization dataset Czech News, the common-sense
23
+ reasoning dataset HellaSwag-cs, and the knowledge dataset Umimeto-qa. This was
24
+ contributed by @oliverkinch ✨
25
+ - Added the Lithuanian summarisation dataset Lrytas based on the Lithuanian
26
+ public media news portal [Lrytas.lt](https://www.lrytas.lt/). This was contributed by
27
+ @oliverkinch ✨
28
+ - Added the Estonian translation of MMLU, `mmlu-et`, as an unofficial knowledge
29
+ dataset.
30
+
31
+ ### Changed
32
+
33
+ - Updated vLLM to `>=0.11.0`, which features several breaking changes, so we had to
34
+ force the minimum version. This also features support for multiple new models, such as
35
+ Qwen3-Next and OLMo3.
36
+ - Now uses MultiWikiQA-da and MultiWikiQA-sv as the official Danish and Swedish reading
37
+ comprehension datasets, respectively, as the quality is substantially better than
38
+ ScandiQA-da and ScandiQA-sv.
39
+ - Used 128 of the test samples from the Winogrande datasets for validation, as we
40
+ previously did not use a validation split. This is done for all languages except
41
+ Icelandic and Estonian, as these are manually translated and corrected splits from a
42
+ different source. Most of these are unofficial datasets and thus won't affect the
43
+ leaderboard rankings. The only languages for which these are official are Lithuanian
44
+ and Polish, which do not have official leaderboards yet - so no leaderboards are
45
+ affected by this change.
46
+ - In the same vein as the above, we now use 32 samples for validation for the Lithuanian
47
+ LT-history dataset and the Swedish Skolprov dataset.
48
+ - Changed logging styling.
49
+
50
+ ### Fixed
51
+
52
+ - If a generative model consistently does not adhere to a given JSON schema, we disable
53
+ structured generation for that model. This was triggered by Claude models not
54
+ supporting Literal types in JSON schemas.
55
+ - Removed "e" options from the Skolprov multiple-choice dataset, as this inconsistency
56
+ in number of options caused issues when evaluating models on it.
57
+ - Fixed an issue where an uninformative logging message was shown when a model
58
+ configuration could not be loaded from the Hugging Face Hub, when the model was gated.
59
+ We now show that this is due to the gatedness, indicating that the user should log in
60
+ or provide a Hugging Face Hub access token to evaluate the model.
61
+ - Now caches functions related to loading repo info or fetching model configs from the
62
+ Hugging Face Hub, to avoid repeated calls to the Hub, resulting in rate limits.
63
+ - When running an evaluation that required the test split (e.g., European values
64
+ evaluation) as the last benchmark for a given model, then subsequent models would
65
+ continue to be evaluated on the test split, even if the user requested to use the
66
+ validation split. We now reset this not just after each dataset, but also after each
67
+ model, so that this does not happen.
68
+ - Now catches more errors when evaluating LiteLLM models, which were related to some
69
+ generation parameters not being supported (such as stop sequences) for some models.
70
+ - We now clean up metric writers when we're done with them, which prevents a "too many
71
+ open files" error when evaluating many models and datasets in a single run.
72
+
10
73
  ## [v16.3.0] - 2025-09-23
11
74
 
12
75
  ### Added
13
76
 
14
77
  - Added support for Lithuanian 🇱🇹! This includes the sentiment classification dataset
15
- Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt, the reading
16
- comprehension dataset MultiWikiQA-lt, the named entity recognition dataset WikiANN-lt,
17
- the the history knowledge dataset LT-History, and the common-sense reasoning dataset
18
- Winogrande-lt. This was contributed by @oliverkinch ✨
78
+ Lithuanian Emotions, the linguistic acceptability dataset ScaLA-lt (unofficial), the
79
+ reading comprehension dataset MultiWikiQA-lt, the named entity recognition dataset
80
+ WikiANN-lt, the the history knowledge dataset LT-History, and the common-sense
81
+ reasoning dataset Winogrande-lt. This was contributed by @oliverkinch ✨
19
82
  - Added "slow-tokenizer" model parameter, which can be used to force the use of a slow
20
83
  tokenizer when loading it. Use this by replacing your model ID with
21
84
  `<model-id>#slow-tokenizer`.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 16.3.0
3
+ Version: 16.4.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -62,12 +62,12 @@ Provides-Extra: all
62
62
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
63
63
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
64
64
  Requires-Dist: timm>=1.0.19; extra == 'all'
65
- Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'all'
65
+ Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'all'
66
66
  Provides-Extra: generative
67
67
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
68
68
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
69
69
  Requires-Dist: timm>=1.0.19; extra == 'generative'
70
- Requires-Dist: vllm[flashinfer]<0.11.0,>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
70
+ Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'generative'
71
71
  Description-Content-Type: text/markdown
72
72
 
73
73
  <!-- This disables the requirement that the first line is a top-level heading -->
@@ -92,7 +92,7 @@ ______________________________________________________________________
92
92
  [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
93
93
  [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
94
94
  [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
95
- [![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
95
+ [![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
96
96
  [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
97
97
 
98
98
  ## Maintainer
@@ -20,7 +20,7 @@ ______________________________________________________________________
20
20
  [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
21
21
  [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
22
22
  [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
23
- [![Code Coverage](https://img.shields.io/badge/Coverage-67%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
23
+ [![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
24
24
  [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
25
25
 
26
26
  ## Maintainer