EuroEval 16.0.0__tar.gz → 16.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (296) hide show
  1. {euroeval-16.0.0 → euroeval-16.1.0}/.gitignore +3 -1
  2. {euroeval-16.0.0 → euroeval-16.1.0}/.pre-commit-config.yaml +1 -1
  3. {euroeval-16.0.0 → euroeval-16.1.0}/CHANGELOG.md +68 -0
  4. {euroeval-16.0.0 → euroeval-16.1.0}/PKG-INFO +3 -1
  5. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/danish.md +83 -7
  6. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/dutch.md +81 -8
  7. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/english.md +138 -3
  8. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/estonian.md +83 -10
  9. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/faroese.md +3 -2
  10. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/finnish.md +78 -5
  11. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/french.md +78 -5
  12. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/german.md +139 -3
  13. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/icelandic.md +5 -4
  14. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/italian.md +78 -5
  15. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/latvian.md +97 -10
  16. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/norwegian.md +68 -3
  17. euroeval-16.1.0/docs/datasets/polish.md +640 -0
  18. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/portuguese.md +68 -3
  19. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/spanish.md +68 -3
  20. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/swedish.md +132 -3
  21. euroeval-16.1.0/docs/leaderboards/Monolingual/estonian.md +23 -0
  22. euroeval-16.1.0/docs/leaderboards/Multilingual/finnic.md +23 -0
  23. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Multilingual/romance.md +1 -1
  24. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/README.md +5 -15
  25. euroeval-16.1.0/generated_contracts/employment_contract_001.md +137 -0
  26. euroeval-16.1.0/generated_contracts/employment_contract_002.md +152 -0
  27. euroeval-16.1.0/generated_contracts/employment_contract_003.md +144 -0
  28. euroeval-16.1.0/generated_contracts/employment_contract_004.md +139 -0
  29. euroeval-16.1.0/generated_contracts/employment_contract_005.md +146 -0
  30. euroeval-16.1.0/generated_contracts/employment_contract_006.md +127 -0
  31. euroeval-16.1.0/generated_contracts/employment_contract_007.md +147 -0
  32. euroeval-16.1.0/generated_contracts/employment_contract_008.md +136 -0
  33. euroeval-16.1.0/generated_contracts/employment_contract_009.md +143 -0
  34. euroeval-16.1.0/generated_contracts/employment_contract_010.md +148 -0
  35. {euroeval-16.0.0 → euroeval-16.1.0}/makefile +3 -0
  36. {euroeval-16.0.0 → euroeval-16.1.0}/pyproject.toml +3 -1
  37. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/__init__.py +5 -0
  38. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmark_config_factory.py +6 -1
  39. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmark_modules/base.py +2 -0
  40. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmark_modules/fresh.py +7 -1
  41. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmark_modules/hf.py +26 -21
  42. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmark_modules/litellm.py +258 -131
  43. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmark_modules/vllm.py +120 -68
  44. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmarker.py +11 -2
  45. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/cli.py +14 -1
  46. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/constants.py +7 -1
  47. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/data_models.py +95 -20
  48. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/__init__.py +1 -0
  49. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/danish.py +14 -3
  50. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/dutch.py +14 -0
  51. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/english.py +22 -0
  52. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/estonian.py +15 -7
  53. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/finnish.py +14 -0
  54. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/french.py +14 -0
  55. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/german.py +23 -0
  56. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/italian.py +14 -0
  57. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/latvian.py +14 -0
  58. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/norwegian.py +14 -0
  59. euroeval-16.1.0/src/euroeval/dataset_configs/polish.py +126 -0
  60. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/portuguese.py +14 -0
  61. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/spanish.py +14 -0
  62. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/swedish.py +25 -0
  63. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/enums.py +12 -0
  64. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/generation.py +17 -8
  65. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/generation_utils.py +102 -16
  66. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/metrics/pipeline.py +51 -9
  67. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/model_cache.py +13 -1
  68. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +9 -0
  69. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/prompt_templates/multiple_choice.py +27 -1
  70. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/prompt_templates/named_entity_recognition.py +20 -0
  71. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/prompt_templates/reading_comprehension.py +11 -0
  72. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/prompt_templates/sentiment_classification.py +15 -0
  73. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/prompt_templates/summarization.py +27 -1
  74. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/scores.py +5 -0
  75. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +2 -2
  76. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/task_group_utils/question_answering.py +29 -29
  77. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/task_group_utils/sequence_classification.py +71 -81
  78. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/task_group_utils/token_classification.py +17 -3
  79. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/tasks.py +12 -10
  80. euroeval-16.0.0/src/euroeval/tokenization_utils.py → euroeval-16.1.0/src/euroeval/tokenisation_utils.py +41 -25
  81. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/utils.py +67 -3
  82. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/constants.py +20 -0
  83. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_allocine.py +1 -6
  84. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_arc.py +9 -26
  85. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_arc_is.py +1 -6
  86. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_belebele.py +4 -21
  87. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_boolq_pt.py +1 -5
  88. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_cnn_dailymail.py +1 -6
  89. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_conll_en.py +1 -6
  90. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_conll_es.py +1 -6
  91. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_conll_nl.py +1 -6
  92. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_copa_lv.py +3 -7
  93. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_dane.py +1 -6
  94. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_danish_citizen_tests.py +3 -7
  95. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_dansk.py +1 -6
  96. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_danske_talemaader.py +3 -6
  97. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_danske_talemaader_old.py +3 -7
  98. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_dbrd.py +1 -6
  99. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_dutch_cola.py +2 -0
  100. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_eltec.py +1 -5
  101. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_err_news.py +1 -6
  102. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_estner.py +1 -6
  103. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_estonian_valence.py +1 -6
  104. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_european_values.py +56 -46
  105. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_exam_et.py +2 -1
  106. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_fone.py +1 -6
  107. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_foqa.py +1 -6
  108. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_fosent.py +1 -6
  109. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_fquad.py +1 -6
  110. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_fullstack_ner.py +1 -6
  111. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_germanquad.py +1 -6
  112. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_germeval.py +1 -6
  113. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_goldenswag.py +4 -19
  114. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_grammar_et.py +1 -6
  115. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_harem.py +1 -6
  116. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_hellaswag.py +6 -22
  117. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_hellaswag_fi.py +3 -7
  118. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_hotter_and_colder_sentiment.py +1 -5
  119. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_ice_linguistic.py +1 -6
  120. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_icelandic_error_corpus.py +2 -7
  121. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_icelandic_knowledge.py +8 -8
  122. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_icelandic_qa.py +1 -6
  123. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_icesum.py +1 -6
  124. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_idioms_no.py +3 -7
  125. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_ilpost_sum.py +1 -6
  126. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_jentoft.py +1 -6
  127. euroeval-16.1.0/src/scripts/create_kpwr_ner.py +140 -0
  128. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_latvian_lsm_summary.py +1 -6
  129. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_latvian_twitter_sentiment.py +1 -6
  130. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_life_in_the_uk.py +3 -7
  131. euroeval-16.1.0/src/scripts/create_llmzszl.py +153 -0
  132. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_mlqa_es.py +1 -6
  133. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_mlsum_de.py +1 -6
  134. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_mlsum_es.py +2 -11
  135. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_mmlu.py +6 -23
  136. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_mmlu_lv.py +3 -7
  137. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_multi_wiki_qa.py +4 -8
  138. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_multinerd-it.py +1 -6
  139. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_no_cola.py +1 -6
  140. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_no_sammendrag.py +1 -6
  141. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_nor_common_sense_qa.py +3 -7
  142. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_nordjylland_news.py +3 -12
  143. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_norglm_multiqa.py +2 -11
  144. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_norglm_multisum.py +2 -11
  145. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_norne.py +2 -11
  146. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_norquad.py +3 -12
  147. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_nqii.py +3 -12
  148. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_nrk_quiz_qa.py +4 -12
  149. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_orange_sum.py +3 -12
  150. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_personal_sum.py +6 -12
  151. euroeval-16.1.0/src/scripts/create_polemo2.py +130 -0
  152. euroeval-16.1.0/src/scripts/create_poquad.py +109 -0
  153. euroeval-16.1.0/src/scripts/create_psc.py +85 -0
  154. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_publico.py +2 -10
  155. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_rrn.py +3 -12
  156. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_sb10k.py +2 -11
  157. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_scala.py +4 -11
  158. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_scandiqa.py +3 -12
  159. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_scandisent_fi.py +2 -11
  160. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_schibsted.py +1 -8
  161. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_sentiment_headlines_es.py +2 -11
  162. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_sentipolc16.py +2 -11
  163. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_squad.py +3 -12
  164. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_squad_it.py +3 -12
  165. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_squad_nl.py +3 -12
  166. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_squad_nl_old.py +3 -12
  167. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_sst2_pt.py +2 -6
  168. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_sst5.py +2 -11
  169. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_suc3.py +2 -11
  170. euroeval-16.1.0/src/scripts/create_swedish_skolprov.py +167 -0
  171. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_swedn.py +3 -12
  172. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_swerec.py +2 -11
  173. euroeval-16.1.0/src/scripts/create_trivia_et.py +70 -0
  174. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_turku_ner_fi.py +3 -10
  175. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_tydiqa_fi.py +3 -12
  176. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_wiki_lingua_nl.py +3 -12
  177. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_wikiann_lv.py +2 -11
  178. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_wikineural-it.py +2 -11
  179. euroeval-16.1.0/src/scripts/create_winogrande.py +156 -0
  180. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_winogrande_et.py +6 -8
  181. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_winogrande_is.py +4 -12
  182. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_xlsum_fi.py +3 -12
  183. euroeval-16.1.0/src/scripts/create_xquad.py +73 -0
  184. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/load_ud_pos.py +18 -0
  185. {euroeval-16.0.0 → euroeval-16.1.0}/tests/conftest.py +2 -0
  186. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_benchmark_config_factory.py +1 -1
  187. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_benchmarker.py +1 -1
  188. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_callbacks.py +1 -1
  189. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_cli.py +3 -1
  190. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_data_loading.py +6 -1
  191. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_data_models.py +3 -3
  192. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_dataset_configs.py +3 -3
  193. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_exceptions.py +1 -1
  194. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_finetuning.py +0 -12
  195. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_languages.py +2 -2
  196. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_model_loading.py +1 -1
  197. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_scores.py +4 -3
  198. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_speed_benchmark.py +2 -2
  199. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_tasks.py +2 -2
  200. euroeval-16.0.0/tests/test_tokenization_utils.py → euroeval-16.1.0/tests/test_tokenisation_utils.py +5 -3
  201. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_types.py +1 -1
  202. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_utils.py +41 -3
  203. {euroeval-16.0.0 → euroeval-16.1.0}/uv.lock +58 -3
  204. euroeval-16.0.0/src/scripts/create_wikiann_fo.py +0 -1
  205. euroeval-16.0.0/src/scripts/create_xquad_es.py +0 -80
  206. euroeval-16.0.0/tests/test_benchmark_modules/test_base.py +0 -1
  207. euroeval-16.0.0/tests/test_benchmark_modules/test_fresh.py +0 -1
  208. euroeval-16.0.0/tests/test_benchmark_modules/test_litellm.py +0 -1
  209. euroeval-16.0.0/tests/test_benchmark_modules/test_vllm.py +0 -1
  210. euroeval-16.0.0/tests/test_generation.py +0 -19
  211. euroeval-16.0.0/tests/test_model_cache.py +0 -46
  212. euroeval-16.0.0/tests/test_task_utils/__init__.py +0 -1
  213. euroeval-16.0.0/tests/test_task_utils/test_question_answering.py +0 -1
  214. euroeval-16.0.0/tests/test_task_utils/test_sequence_classification.py +0 -1
  215. euroeval-16.0.0/tests/test_task_utils/test_text_to_text.py +0 -1
  216. euroeval-16.0.0/tests/test_task_utils/test_token_classification.py +0 -1
  217. {euroeval-16.0.0 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
  218. {euroeval-16.0.0 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  219. {euroeval-16.0.0 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  220. {euroeval-16.0.0 → euroeval-16.1.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
  221. {euroeval-16.0.0 → euroeval-16.1.0}/.github/workflows/ci.yaml +0 -0
  222. {euroeval-16.0.0 → euroeval-16.1.0}/CITATION.cff +0 -0
  223. {euroeval-16.0.0 → euroeval-16.1.0}/CODE_OF_CONDUCT.md +0 -0
  224. {euroeval-16.0.0 → euroeval-16.1.0}/CONTRIBUTING.md +0 -0
  225. {euroeval-16.0.0 → euroeval-16.1.0}/Dockerfile.cuda +0 -0
  226. {euroeval-16.0.0 → euroeval-16.1.0}/LICENSE +0 -0
  227. {euroeval-16.0.0 → euroeval-16.1.0}/NEW_DATASET_GUIDE.md +0 -0
  228. {euroeval-16.0.0 → euroeval-16.1.0}/README.md +0 -0
  229. {euroeval-16.0.0 → euroeval-16.1.0}/docs/CNAME +0 -0
  230. {euroeval-16.0.0 → euroeval-16.1.0}/docs/README.md +0 -0
  231. {euroeval-16.0.0 → euroeval-16.1.0}/docs/datasets/README.md +0 -0
  232. {euroeval-16.0.0 → euroeval-16.1.0}/docs/extras/radial_plotter.md +0 -0
  233. {euroeval-16.0.0 → euroeval-16.1.0}/docs/faq.md +0 -0
  234. {euroeval-16.0.0 → euroeval-16.1.0}/docs/gfx/favicon.png +0 -0
  235. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/danish.md +0 -0
  236. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
  237. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/english.md +0 -0
  238. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
  239. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/finnish.md +0 -0
  240. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/french.md +0 -0
  241. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/german.md +0 -0
  242. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  243. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/italian.md +0 -0
  244. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  245. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/portuguese.md +0 -0
  246. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
  247. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
  248. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Multilingual/european.md +0 -0
  249. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
  250. {euroeval-16.0.0 → euroeval-16.1.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  251. {euroeval-16.0.0 → euroeval-16.1.0}/docs/methodology.md +0 -0
  252. {euroeval-16.0.0 → euroeval-16.1.0}/docs/python-package.md +0 -0
  253. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/README.md +0 -0
  254. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/common-sense-reasoning.md +0 -0
  255. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/knowledge.md +0 -0
  256. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/linguistic-acceptability.md +0 -0
  257. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/named-entity-recognition.md +0 -0
  258. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/reading-comprehension.md +0 -0
  259. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/sentiment-classification.md +0 -0
  260. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/speed.md +0 -0
  261. {euroeval-16.0.0 → euroeval-16.1.0}/docs/tasks/summarization.md +0 -0
  262. {euroeval-16.0.0 → euroeval-16.1.0}/gfx/euroeval.png +0 -0
  263. {euroeval-16.0.0 → euroeval-16.1.0}/gfx/euroeval.xcf +0 -0
  264. {euroeval-16.0.0 → euroeval-16.1.0}/gfx/scandeval.png +0 -0
  265. {euroeval-16.0.0 → euroeval-16.1.0}/mkdocs.yaml +0 -0
  266. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  267. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/callbacks.py +0 -0
  268. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/data_loading.py +0 -0
  269. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/faroese.py +0 -0
  270. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
  271. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/exceptions.py +0 -0
  272. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/finetuning.py +0 -0
  273. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/languages.py +0 -0
  274. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/metrics/__init__.py +0 -0
  275. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/metrics/base.py +0 -0
  276. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/metrics/huggingface.py +0 -0
  277. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/metrics/llm_as_a_judge.py +0 -0
  278. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/metrics/speed.py +0 -0
  279. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/model_config.py +0 -0
  280. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/model_loading.py +0 -0
  281. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/prompt_templates/__init__.py +0 -0
  282. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/speed_benchmark.py +0 -0
  283. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/task_group_utils/__init__.py +0 -0
  284. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/task_group_utils/text_to_text.py +0 -0
  285. {euroeval-16.0.0 → euroeval-16.1.0}/src/euroeval/types.py +0 -0
  286. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_angry_tweets.py +0 -0
  287. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_mim_gold_ner.py +0 -0
  288. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/create_norec.py +0 -0
  289. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/fix_dot_env_file.py +0 -0
  290. {euroeval-16.0.0 → euroeval-16.1.0}/src/scripts/versioning.py +0 -0
  291. {euroeval-16.0.0 → euroeval-16.1.0}/tests/__init__.py +0 -0
  292. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_benchmark_modules/__init__.py +0 -0
  293. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  294. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_constants.py +0 -0
  295. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_enums.py +0 -0
  296. {euroeval-16.0.0 → euroeval-16.1.0}/tests/test_model_config.py +0 -0
@@ -34,7 +34,7 @@ var/
34
34
  pip-log.txt
35
35
  pip-delete-this-directory.txt
36
36
 
37
- # Unit test / coverage reports
37
+ # Tests / coverage reports
38
38
  htmlcov/
39
39
  .tox/
40
40
  .coverage
@@ -118,4 +118,6 @@ docs/datasets/dataset_example_commands.txt
118
118
 
119
119
  # Various graphics
120
120
  gfx/euroeval-*.png
121
+ gfx/euroeval-*.jpeg
122
+ gfx/euroeval-*.jpg
121
123
  gfx/euroeval-*.xcf
@@ -10,7 +10,7 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.12.12
13
+ rev: v0.13.0
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -10,6 +10,74 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v16.1.0] - 2025-09-11
14
+ ### Added
15
+ - Added support for Polish 🇵🇱! This includes the reading comprehension dataset PoQuAD,
16
+ the sentiment classification dataset PolEmo 2.0, the linguistic acceptability dataset
17
+ ScaLA-pl, the named entity recognition dataset KPWr-NER, the summarisation dataset
18
+ PSC, the knowledge dataset LLMzSzŁ and the common-sense reasoning dataset
19
+ Winogrande-pl. Also added MultiWikiQA-pl and GoldenSwag-pl as unofficial reading
20
+ comprehension and common-sense reasoning datasets, respectively. This was contributed
21
+ by @oliverkinch ✨
22
+ - Added the Swedish knowledge dataset Skolprov. It is unofficial for now. This was
23
+ contributed by @oliverkinch ✨
24
+ - Added the knowledge dataset Trivia-et for Estonian. The dataset contains 800 trivia
25
+ questions about Estonia. In this version we rearrange the examples in
26
+ 240 / 60 / 500 samples for training, validation and test splits, respectively.
27
+ This replaces Exam-et as the official Estonian knowledge dataset. This was contributed
28
+ by @slowwavesleep ✨
29
+ - Added the English and German versions of XQuAD as unofficial reading comprehension
30
+ datasets.
31
+ - Added the English common-sense reasoning dataset Winogrande and its translated
32
+ versions of Winogrande for Danish, German, Spanish, Finnish, French, Italian, Latvian,
33
+ Dutch, Norwegian, Polish, Portuguese and Swedish. These are unofficial for now.
34
+ - Added new `--generative-type` argument, which can be used to override the automatic
35
+ detection of the generative type (base decoder, instruction-tuned decoder, or
36
+ reasoning decoder) of a decoder model. This can be useful if the automatic detection
37
+ fails for a specific model.
38
+ - Now supports evaluating base decoders on inference servers. This requires the
39
+ `--generative-type base` argument to be set, as the automatic detection will not work
40
+ for these models.
41
+
42
+ ### Changed
43
+ - Changed the model ID syntax, where we now use `#` to indicate parameters and still use
44
+ `@` to indicate revision. For instance, `o3#low` indicates the `o3` model with the
45
+ low reasoning effort, and `tencent/Hunyuan-1.8B-Instruct@v1#no-thinking` indicates the
46
+ Hunyuan model from the `v1` branch and with the `enable_thinking=False` parameter set.
47
+ This is fully backwards compatible, in the sense that API models still support using
48
+ `@` for parameters as well, just like previously, but you will get a warning that this
49
+ syntax is deprecated.
50
+ - Added `thinking` and `no-thinking` parameters for all open-weight models now. Of
51
+ course, it only makes a difference for models that supports this flag.
52
+ - Reduced the number of tokens used for reasoning models from 32,768 to 8,192, as models
53
+ reaching the full 32,768 tokens were because they ended up repeating themselves,
54
+ making the evaluation slower without any benefit.
55
+
56
+ ### Fixed
57
+ - Some generative models consistently generated empty dictionaries when using structured
58
+ generation. We now catch this and retry the evaluation without structured generation.
59
+
60
+
61
+ ## [v16.0.1] - 2025-09-07
62
+ ### Fixed
63
+ - Fixed a bug causing encoders to fail when evaluating on the Exam-et dataset.
64
+ - Previously we would abort an evaluation completely if the model outputted a single
65
+ invalid output on a classification task. As individual samples rarely have a great
66
+ influence on the overall score, we now just assign the closest label to the sample and
67
+ continue the evaluation. This will be logged to the user, so that they are aware of
68
+ this. Some tasks are more sensitive to individual samples, such as European values,
69
+ where we still abort the evaluation if a single sample is invalid.
70
+ - Fixed a bug where logprobs were not used for classification tasks when evaluating
71
+ generative models, due to the fact that we raised the number of generated tokens to 10
72
+ for such tasks. This did not affect the results, but it meant that some evaluations
73
+ failed.
74
+ - Now includes FlashInfer as a dependency, as it is required by vLLM.
75
+ - Changed the choices in European values to use letters, like the other multiple
76
+ choice tasks, rather than numbers. Aside from ensuring consistency, we also avoid the
77
+ issue where '10' and '1' often both have the same first token ('1'), causing us not to
78
+ be able to use logprobs to determine the answer.
79
+
80
+
13
81
  ## [v16.0.0] - 2025-09-05
14
82
  ### Added
15
83
  - Added support for Latvian 🇱🇻! This includes the sentiment classification dataset
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 16.0.0
3
+ Version: 16.1.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -61,10 +61,12 @@ Requires-Dist: transformers[mistral-common]>=4.56.0
61
61
  Provides-Extra: all
62
62
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
63
63
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
64
+ Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'all'
64
65
  Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'all'
65
66
  Provides-Extra: generative
66
67
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
67
68
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
69
+ Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'generative'
68
70
  Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
69
71
  Description-Content-Type: text/markdown
70
72
 
@@ -355,9 +355,12 @@ $ euroeval --model <model-id> --dataset scandiqa-da
355
355
 
356
356
  ### Unofficial: BeleBele-da
357
357
 
358
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
358
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
359
+ and features multiple-choice reading comprehension questions across 122 languages.
359
360
 
360
- The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
361
+ The original dataset contains 900 unique multiple-choice reading comprehension passages
362
+ and questions. From these, we use a 256 / 64 / 580 split for training, validation and
363
+ testing, respectively.
361
364
 
362
365
  Here are a few examples from the training split:
363
366
 
@@ -418,8 +421,9 @@ $ euroeval --model <model-id> --dataset belebele-da
418
421
 
419
422
  ### Unofficial: MultiWikiQA-da
420
423
 
421
- This dataset will be published in an upcoming paper, and contains Danish Wikipedia
422
- articles with generated questions and answers, using the LLM Gemini-1.5-pro.
424
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
425
+ and contains Wikipedia articles with LLM-generated questions and answers in 300+
426
+ languages.
423
427
 
424
428
  The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
425
429
  256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -831,9 +835,17 @@ $ euroeval --model <model-id> --dataset hellaswag-da
831
835
 
832
836
  ### Unofficial: GoldenSwag-da
833
837
 
834
- This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
838
+ This dataset is a filtered and machine translated version of the English [HellaSwag
839
+ dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from
840
+ ActivityNet as well as how-to articles from WikiHow. The machine translated version was
841
+ published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using
842
+ DeepL, and the filtering was published in [this
843
+ paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality
844
+ samples.
835
845
 
836
- The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
846
+ The original full dataset consists of 1530 / 1530 samples for training and validation,
847
+ respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048
848
+ samples for training, validation, and testing, respectively.
837
849
 
838
850
  Here are a few examples from the training split:
839
851
 
@@ -894,8 +906,72 @@ You can evaluate this dataset directly as follows:
894
906
  $ euroeval --model <model-id> --dataset goldenswag-da
895
907
  ```
896
908
 
909
+ ### Unofficial: Winogrande-da
897
910
 
898
- ## Summarization
911
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2506.19468)
912
+ and is a translated and filtered version of the English [Winogrande
913
+ dataset](https://doi.org/10.1145/3474381).
914
+
915
+ The original full dataset consists of 47 / 1,210 samples for training and testing, and
916
+ we use the same splits.
917
+
918
+ Here are a few examples from the training split:
919
+
920
+ ```json
921
+ {
922
+ "text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty",
923
+ "label": "a"
924
+ }
925
+ ```
926
+
927
+ ```json
928
+ {
929
+ "text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Natalie\nb. Valgmulighed B: Betty",
930
+ "label": "a"
931
+ }
932
+ ```
933
+
934
+ ```json
935
+ {
936
+ "text": "At håndtere nødsituationer var aldrig særlig svært for Kevin, men det var det for Nelson, fordi _ ikke var i stand til at forblive rolig under pres. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Valgmulighed A: Kevin\nb. Valgmulighed B: Nelson",
937
+ "label": "b"
938
+ }
939
+ ```
940
+
941
+ When evaluating generative models, we use the following setup (see the
942
+ [methodology](/methodology) for more information on how these are used):
943
+
944
+ - Number of few-shot examples: 5
945
+ - Prefix prompt:
946
+ ```
947
+ Følgende er multiple choice spørgsmål (med svar).
948
+ ```
949
+ - Base prompt template:
950
+ ```
951
+ Spørgsmål: {text}
952
+ Svarmuligheder:
953
+ a. {option_a}
954
+ b. {option_b}
955
+ Svar: {label}
956
+ ```
957
+ - Instruction-tuned prompt template:
958
+ ```
959
+ Spørgsmål: {text}
960
+ Svarmuligheder:
961
+ a. {option_a}
962
+ b. {option_b}
963
+
964
+ Besvar ovenstående spørgsmål ved at svare med 'a' eller 'b', og intet andet.
965
+ ```
966
+
967
+ You can evaluate this dataset directly as follows:
968
+
969
+ ```bash
970
+ $ euroeval --model <model-id> --dataset winogrande-da
971
+ ```
972
+
973
+
974
+ ## Summarisation
899
975
 
900
976
  ### Nordjylland News
901
977
 
@@ -153,9 +153,9 @@ from a sentence, or by swapping two neighbouring words in a sentence. To ensure
153
153
  this does indeed break the grammaticality of the sentence, a set of rules were used on
154
154
  the part-of-speech tags of the words in the sentence.
155
155
 
156
- The original dataset consists of 13,603 samples, from which we use 1,024 / 256 / 2,048 samples for training,
157
- validation and testing, respectively (so 3,328 samples used in total). These splits are
158
- used as-is in the framework.
156
+ The original dataset consists of 13,603 samples, from which we use 1,024 / 256 / 2,048
157
+ samples for training, validation and testing, respectively (so 3,328 samples used in
158
+ total). These splits are used as-is in the framework.
159
159
 
160
160
  Here are a few examples from the training split:
161
161
 
@@ -390,8 +390,9 @@ $ euroeval --model <model-id> --dataset belebele-nl
390
390
 
391
391
  ### Unofficial: MultiWikiQA-nl
392
392
 
393
- This dataset will be published in an upcoming paper, and contains Dutch Wikipedia
394
- articles with generated questions and answers, using the LLM Gemini-1.5-pro.
393
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
394
+ and contains Wikipedia articles with LLM-generated questions and answers in 300+
395
+ languages.
395
396
 
396
397
  The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
397
398
  256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -676,9 +677,17 @@ $ euroeval --model <model-id> --dataset hellaswag-nl
676
677
 
677
678
  ### Unofficial: GoldenSwag-nl
678
679
 
679
- This dataset is a filtered and machine translated version of the English [HellaSwag dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from ActivityNet as well as how-to articles from WikiHow. The machine translated version was published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using DeepL, and the filtering was published in [this paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality samples.
680
+ This dataset is a filtered and machine translated version of the English [HellaSwag
681
+ dataset](https://aclanthology.org/P19-1472/), featuring both video descriptions from
682
+ ActivityNet as well as how-to articles from WikiHow. The machine translated version was
683
+ published in [this paper](https://doi.org/10.48550/arXiv.2410.08928) and was done using
684
+ DeepL, and the filtering was published in [this
685
+ paper](https://doi.org/10.48550/arXiv.2504.07825), which resulted in higher quality
686
+ samples.
680
687
 
681
- The original full dataset consists of 1530 / 1530 samples for training and validation, respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048 samples for training, validation, and testing, respectively.
688
+ The original full dataset consists of 1530 / 1530 samples for training and validation,
689
+ respectively. However, they are exactly equal. We use a split of 660 / 256 / 2,048
690
+ samples for training, validation, and testing, respectively.
682
691
 
683
692
  Here are a few examples from the training split:
684
693
 
@@ -739,8 +748,72 @@ You can evaluate this dataset directly as follows:
739
748
  $ euroeval --model <model-id> --dataset goldenswag-nl
740
749
  ```
741
750
 
751
+ ### Unofficial: Winogrande-nl
752
+
753
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2506.19468)
754
+ and is a translated and filtered version of the English [Winogrande
755
+ dataset](https://doi.org/10.1145/3474381).
756
+
757
+ The original full dataset consists of 47 / 1,210 samples for training and testing, and
758
+ we use the same splits.
759
+
760
+ Here are a few examples from the training split:
761
+
762
+ ```json
763
+ {
764
+ "text": "Emily vroeg haar zus Sarah of ze tampons of maandverband nodig had uit de winkel, hoewel _ dat niet nodig had omdat ze was overgestapt op het gebruik van menstruatiecups. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Emily\nb. Optie B: Sarah",
765
+ "label": "a"
766
+ }
767
+ ```
768
+
769
+ ```json
770
+ {
771
+ "text": "Bij het kopen van een huis heeft Patricia niet zoveel geld te besteden als Tanya, dus _ koopt een huis met 1 slaapkamer. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Patricia\nb. Optie B: Tanya",
772
+ "label": "a"
773
+ }
774
+ ```
775
+
776
+ ```json
777
+ {
778
+ "text": "Eenmaal in Polen genoot Dennis meer van de reis dan Jason omdat _ een oppervlakkige kennis van de Poolse taal had. Waar verwijst de lege _ naar?\nAntwoordopties:\na. Optie A: Dennis\nb. Optie B: Jason",
779
+ "label": "b"
780
+ }
781
+ ```
782
+
783
+ When evaluating generative models, we use the following setup (see the
784
+ [methodology](/methodology) for more information on how these are used):
785
+
786
+ - Number of few-shot examples: 5
787
+ - Prefix prompt:
788
+ ```
789
+ Hieronder staan meerkeuzevragen (met antwoorden).
790
+ ```
791
+ - Base prompt template:
792
+ ```
793
+ Vraag: {text}
794
+ Antwoordopties:
795
+ a. {option_a}
796
+ b. {option_b}
797
+ Antwoord: {label}
798
+ ```
799
+ - Instruction-tuned prompt template:
800
+ ```
801
+ Vraag: {text}
802
+ Antwoordopties:
803
+ a. {option_a}
804
+ b. {option_b}
805
+
806
+ Beantwoord de bovenstaande vraag met 'a' of 'b', en niets anders.
807
+ ```
808
+
809
+ You can evaluate this dataset directly as follows:
810
+
811
+ ```bash
812
+ $ euroeval --model <model-id> --dataset winogrande-nl
813
+ ```
814
+
742
815
 
743
- ## Summarization
816
+ ## Summarisation
744
817
 
745
818
  ### WikiLingua-nl
746
819
 
@@ -295,6 +295,79 @@ $ euroeval --model <model-id> --dataset squad
295
295
  ```
296
296
 
297
297
 
298
+ ### Unofficial: XQuAD-en
299
+
300
+ This dataset was published in [this paper](https://aclanthology.org/2020.acl-main.421/)
301
+ and contains 1190 question-answer pairs from [SQuAD
302
+ v1.1](https://rajpurkar.github.io/SQuAD-explorer/) translated into ten languages by
303
+ professional translators.
304
+
305
+ The dataset is split intro 550 / 128 / 512 question-answer pairs for training,
306
+ validation, and testing, respectively.
307
+
308
+ Here are a few examples from the training split:
309
+
310
+ ```json
311
+ {
312
+ "context": "Newcastle replaced him in January 1756 with Lord Loudoun, with Major General James Abercrombie as his second in command. Neither of these men had as much campaign experience as the trio of officers France sent to North America. French regular army reinforcements arrived in New France in May 1756, led by Major General Louis-Joseph de Montcalm and seconded by the Chevalier de Lévis and Colonel François-Charles de Bourlamaque, all experienced veterans from the War of the Austrian Succession. During that time in Europe, on May 18, 1756, England formally declared war on France, which expanded the war into Europe, which was later to be known as the Seven Years" War.",
313
+ "question": "Who led New France reinforcements in 1756?",
314
+ "answers": {
315
+ "answer_start": array([305], dtype=int32),
316
+ "text": array(["Major General Louis-Joseph de Montcalm"], dtype=object)
317
+ }
318
+ }
319
+ ```
320
+ ```json
321
+ {
322
+ "context": "Jacksonville is in the First Coast region of northeast Florida and is centered on the banks of the St. Johns River, about 25 miles (40 km) south of the Georgia state line and about 340 miles (550 km) north of Miami. The Jacksonville Beaches communities are along the adjacent Atlantic coast. The area was originally inhabited by the Timucua people, and in 1564 was the site of the French colony of Fort Caroline, one of the earliest European settlements in what is now the continental United States. Under British rule, settlement grew at the narrow point in the river where cattle crossed, known as Wacca Pilatka to the Seminole and the Cow Ford to the British. A platted town was established there in 1822, a year after the United States gained Florida from Spain; it was named after Andrew Jackson, the first military governor of the Florida Territory and seventh President of the United States.",
323
+ "question": "Prior to the arrival of the French, the area now known as Jacksonville was previously inhabited by what people?",
324
+ "answers": {
325
+ "answer_start": array([329], dtype=int32),
326
+ "text": array(["the Timucua"], dtype=object)
327
+ }
328
+ }
329
+ ```
330
+ ```json
331
+ {
332
+ "context": "Luther\"s hymns were frequently evoked by particular events in his life and the unfolding Reformation. This behavior started with his learning of the execution of Johann Esch and Heinrich Voes, the first individuals to be martyred by the Roman Catholic Church for Lutheran views, prompting Luther to write the hymn "Ein neues Lied wir heben an" ("A new song we raise"), which is generally known in English by John C. Messenger\"s translation by the title and first line "Flung to the Heedless Winds" and sung to the tune Ibstone composed in 1875 by Maria C. Tiddeman.",
333
+ "question": "What is the hymn known as in English?",
334
+ "answers": {
335
+ "answer_start": array([469], dtype=int32),
336
+ "text": array(["Flung to the Heedless Winds"], dtype=object)
337
+ }
338
+ }
339
+ ```
340
+
341
+ When evaluating generative models, we use the following setup (see the
342
+ [methodology](/methodology) for more information on how these are used):
343
+
344
+ - Number of few-shot examples: 4
345
+ - Prefix prompt:
346
+ ```
347
+ The following are texts with accompanying questions and answers.
348
+ ```
349
+ - Base prompt template:
350
+ ```
351
+ Text: {text}
352
+ Question: {question}
353
+ Answer in max 3 words:
354
+ ```
355
+ - Instruction-tuned prompt template:
356
+ ```
357
+ Text: {text}
358
+
359
+ Answer the following question about the above text in at most 3 words.
360
+
361
+ Question: {question}
362
+ ```
363
+
364
+ You can evaluate this dataset directly as follows:
365
+
366
+ ```bash
367
+ $ euroeval --model <model-id> --dataset xquad-en
368
+ ```
369
+
370
+
298
371
  ### Unofficial: BeleBele-en
299
372
 
300
373
  This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
@@ -358,8 +431,9 @@ $ euroeval --model <model-id> --dataset belebele-en
358
431
 
359
432
  ### Unofficial: MultiWikiQA-en
360
433
 
361
- This dataset will be published in an upcoming paper, and contains English Wikipedia
362
- articles with generated questions and answers, using the LLM Gemini-1.5-pro.
434
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
435
+ and contains Wikipedia articles with LLM-generated questions and answers in 300+
436
+ languages.
363
437
 
364
438
  The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
365
439
  256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -707,8 +781,69 @@ You can evaluate this dataset directly as follows:
707
781
  $ euroeval --model <model-id> --dataset hellaswag
708
782
  ```
709
783
 
784
+ ### Unofficial: Winogrande
785
+
786
+ This dataset was published in [this paper](https://doi.org/10.1145/3474381). The
787
+ original full dataset consists of 47 / 1,210 samples for training and testing, and we
788
+ use the same splits.
789
+
790
+ Here are a few examples from the training split:
791
+
792
+ ```json
793
+ {
794
+ "text": "Elena would grab their inventory in the back of the store for Megan to sell each time because _ was a businessperson. What does the blank _ refer to?\nChoices:\na. Elena\nb. Megan",
795
+ "label": "a"
796
+ }
797
+ ```
798
+
799
+ ```json
800
+ {
801
+ "text": "Once in Poland, Dennis enjoyed the trip more than Jason because _ had a deeper understanding of the Polish language. What does the blank _ refer to?\nChoices:\na. Dennis\nb. Jason",
802
+ "label": "a"
803
+ }
804
+ ```
805
+
806
+ ```json
807
+ {
808
+ "text": "Handling emergencies was never very difficult for Kevin but it was for Nelson because _ wasn't able to remain calm under pressure. What does the blank _ refer to?\nChoices:\na. Kevin\nb. Nelson",
809
+ "label": "b"
810
+ }
811
+ ```
812
+
813
+ When evaluating generative models, we use the following setup (see the
814
+ [methodology](/methodology) for more information on how these are used):
815
+
816
+ - Number of few-shot examples: 5
817
+ - Prefix prompt:
818
+ ```
819
+ The following are multiple choice questions (with answers).
820
+ ```
821
+ - Base prompt template:
822
+ ```
823
+ Question: {text}
824
+ Options:
825
+ a. {option_a}
826
+ b. {option_b}
827
+ Answer: {label}
828
+ ```
829
+ - Instruction-tuned prompt template:
830
+ ```
831
+ Question: {text}
832
+ Options:
833
+ a. {option_a}
834
+ b. {option_b}
835
+
836
+ Answer the above question by replying with 'a' or 'b', and nothing else.
837
+ ```
838
+
839
+ You can evaluate this dataset directly as follows:
840
+
841
+ ```bash
842
+ $ euroeval --model <model-id> --dataset winogrande
843
+ ```
844
+
710
845
 
711
- ## Summarization
846
+ ## Summarisation
712
847
 
713
848
  ### CNN/DailyMail
714
849
 
@@ -280,8 +280,9 @@ $ euroeval --model <model-id> --dataset scala-et
280
280
 
281
281
  ### MultiWikiQA-et
282
282
 
283
- This dataset will be published in an upcoming paper, and contains Estonian Wikipedia
284
- articles with generated questions and answers, using the LLM Gemini-1.5-pro.
283
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
284
+ and contains Wikipedia articles with LLM-generated questions and answers in 300+
285
+ languages.
285
286
 
286
287
  The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
287
288
  256 / 2,048 split for training, validation and testing, respectively, sampled randomly.
@@ -351,7 +352,77 @@ $ euroeval --model <model-id> --dataset multi-wiki-qa-et
351
352
 
352
353
  ## Knowledge
353
354
 
354
- ### Exam-et
355
+ ### Trivia-et
356
+
357
+ This dataset was published [here](https://huggingface.co/datasets/TalTechNLP/trivia_et).
358
+ It was extracted from the "Eesti Mäng" board game, and contains trivia questions about
359
+ Estonia.
360
+
361
+ The original dataset contains 800 examples. From these, we use 240 / 60 / 500 samples
362
+ for our training, validation and test splits, respectively.
363
+
364
+ Note that this is a gated dataset, and we would like to avoid contaminating LLM
365
+ pre-training data as much as possible. Accordingly, we selected more generic questions
366
+ not representative of the full dataset in terms of question content to show here:
367
+
368
+ ```json
369
+ {
370
+ "text": "Mis on isoterm?\nVastusevariandid:\na. samatemperatuurijoon\nb. samaõhurõhujoon\nc. samapingejoon\nd. samakõrgusjoon",
371
+ "label": "a"
372
+ }
373
+ ```
374
+
375
+ ```json
376
+ {
377
+ "text": "Mis on isobaat?\nVastusevariandid:\na. samasügavusjoon\nb. samaõhurõhujoon\nc. samatemperatuurijoon\nd. samakõrgusjoon",
378
+ "label": "a"
379
+ }
380
+ ```
381
+
382
+ ```json
383
+ {
384
+ "text": "Mida mõõdetakse baromeetriga?\nVastusevariandid:\na. veekogude sügavust\nb. temperatuuri\nc. jõgede voolukiirust\nd. õhurõhku",
385
+ "label": "d"
386
+ ```
387
+
388
+ When evaluating generative models, we use the following setup (see the
389
+ [methodology](/methodology) for more information on how these are used):
390
+
391
+ - Number of few-shot examples: 5
392
+ - Prefix prompt:
393
+ ```
394
+ Järgnevad on vastusevariantidega küsimused (koos vastustega).
395
+ ```
396
+ - Base prompt template:
397
+ ```
398
+ Küsimus: {text}
399
+ Vastusevariandid:
400
+ a. {option_a}
401
+ b. {option_b}
402
+ c. {option_c}
403
+ d. {option_d}
404
+ Vastus: {label}
405
+ ```
406
+ - Instruction-tuned prompt template:
407
+ ```
408
+ Küsimus: {text}
409
+ Vastusevariandid:
410
+ a. {option_a}
411
+ b. {option_b}
412
+ c. {option_c}
413
+ d. {option_d}
414
+
415
+ Võimalikud vastused: 'a', 'b', 'c' or 'd'. Muud vastused ei ole lubatud.
416
+ ```
417
+
418
+ You can evaluate this dataset directly as follows:
419
+
420
+ ```bash
421
+ $ euroeval --model <model-id> --dataset trivia-et
422
+ ```
423
+
424
+
425
+ ### Unofficial: Exam-et
355
426
 
356
427
  This dataset was released in [this
357
428
  repository](https://huggingface.co/datasets/TalTechNLP/exam_et) and contains questions
@@ -420,9 +491,9 @@ $ euroeval --model <model-id> --dataset exam-et
420
491
 
421
492
  ## Common-sense Reasoning
422
493
 
423
- ### WinoGrande-ET
494
+ ### Winogrande-et
424
495
 
425
- The dataset includes the [WinoGrande](https://doi.org/10.48550/arXiv.1907.10641) test
496
+ The dataset includes the [Winogrande](https://doi.org/10.48550/arXiv.1907.10641) test
426
497
  set translated and culturally adapted by hand by a professional translator (citation
427
498
  TBA). The structure of the dataset is identical to the original. Since train and dev
428
499
  splits were not translated manually, we employ the GPT-4o model to translate the
@@ -430,7 +501,8 @@ expected number of examples starting from the beginning of the respective splits
430
501
  final dataset size is 1,024 / 256 / 1,767 for the training, validation and test splits,
431
502
  respectively.
432
503
 
433
- Here are a few examples from the training split (note that unlike the test split these are machine translated):
504
+ Here are a few examples from the training split (note that unlike the test split these
505
+ are machine translated):
434
506
 
435
507
  ```json
436
508
  {
@@ -440,7 +512,8 @@ Here are a few examples from the training split (note that unlike the test split
440
512
  ```
441
513
  ```json
442
514
  {
443
- "text": "Ian vabatahtlikult sõi Dennise menudo pärast seda, kui oli juba kausitäie söönud, sest _ nautis soolte söömist.\nVastusevariandid:\na. Ian\nb. Dennis", "label": "a"
515
+ "text": "Ian vabatahtlikult sõi Dennise menudo pärast seda, kui oli juba kausitäie söönud, sest _ nautis soolte söömist.\nVastusevariandid:\na. Ian\nb. Dennis",
516
+ "label": "a"
444
517
  }
445
518
  ```
446
519
  ```json
@@ -483,7 +556,7 @@ $ euroeval --model <model-id> --dataset winogrande-et
483
556
  ```
484
557
 
485
558
 
486
- ## Summarization
559
+ ## Summarisation
487
560
 
488
561
  ### ERRNews
489
562
 
@@ -495,8 +568,8 @@ pipeline paired with the human written summary from the archive.
495
568
 
496
569
  The original full dataset consists of 10,420 / 523 / 523 samples for training,
497
570
  validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
498
- validation and testing, respectively. The test split is extended with additional examples
499
- from the train split.
571
+ validation and testing, respectively. The test split is extended with additional
572
+ examples from the train split.
500
573
 
501
574
  ```json
502
575
  {
@@ -355,8 +355,9 @@ $ euroeval --model <model-id> --dataset foqa
355
355
 
356
356
  ### Unofficial: MultiWikiQA-fo
357
357
 
358
- This dataset will be published in an upcoming paper, and contains Faroese Wikipedia
359
- articles with generated questions and answers, using the LLM Gemini-1.5-pro.
358
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2509.04111)
359
+ and contains Wikipedia articles with LLM-generated questions and answers in 300+
360
+ languages.
360
361
 
361
362
  The original full dataset consists of 5,000 samples in a single split. We use a 1,024 /
362
363
  256 / 2,048 split for training, validation and testing, respectively, sampled randomly.