EuroEval 15.6.0__tar.gz → 15.7.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (241) hide show
  1. {euroeval-15.6.0 → euroeval-15.7.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +1 -0
  2. {euroeval-15.6.0 → euroeval-15.7.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +1 -0
  3. {euroeval-15.6.0 → euroeval-15.7.0}/.github/workflows/ci.yaml +6 -2
  4. {euroeval-15.6.0 → euroeval-15.7.0}/.gitignore +2 -2
  5. {euroeval-15.6.0 → euroeval-15.7.0}/.pre-commit-config.yaml +2 -2
  6. {euroeval-15.6.0 → euroeval-15.7.0}/CHANGELOG.md +65 -0
  7. {euroeval-15.6.0 → euroeval-15.7.0}/CONTRIBUTING.md +19 -5
  8. euroeval-15.7.0/NEW_DATASET_GUIDE.md +107 -0
  9. {euroeval-15.6.0 → euroeval-15.7.0}/PKG-INFO +15 -2
  10. {euroeval-15.6.0 → euroeval-15.7.0}/README.md +13 -0
  11. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/dutch.md +8 -6
  12. euroeval-15.7.0/docs/datasets/finnish.md +388 -0
  13. euroeval-15.7.0/docs/leaderboards/Monolingual/spanish.md +15 -0
  14. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Multilingual/romance.md +1 -1
  15. {euroeval-15.6.0 → euroeval-15.7.0}/makefile +2 -15
  16. {euroeval-15.6.0 → euroeval-15.7.0}/pyproject.toml +2 -2
  17. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmark_modules/litellm.py +136 -31
  18. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmark_modules/vllm.py +105 -38
  19. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmarker.py +12 -2
  20. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/constants.py +1 -1
  21. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/data_loading.py +48 -26
  22. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/data_models.py +8 -12
  23. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/faroese.py +1 -0
  24. euroeval-15.7.0/src/euroeval/dataset_configs/finnish.py +60 -0
  25. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/norwegian.py +1 -1
  26. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +9 -1
  27. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/prompt_templates/multiple_choice.py +8 -1
  28. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/prompt_templates/named_entity_recognition.py +20 -1
  29. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/prompt_templates/reading_comprehension.py +11 -1
  30. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/prompt_templates/sentiment_classification.py +11 -1
  31. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/prompt_templates/summarization.py +9 -1
  32. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/task_group_utils/sequence_classification.py +27 -32
  33. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/task_group_utils/text_to_text.py +10 -27
  34. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/tasks.py +1 -1
  35. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/tokenization_utils.py +22 -6
  36. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_allocine.py +1 -1
  37. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_arc.py +1 -1
  38. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_arc_is.py +1 -1
  39. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_belebele.py +1 -1
  40. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_cnn_dailymail.py +1 -1
  41. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_conll_en.py +1 -1
  42. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_conll_es.py +1 -1
  43. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_conll_nl.py +1 -1
  44. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_dane.py +1 -1
  45. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_danish_citizen_tests.py +1 -1
  46. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_dansk.py +1 -1
  47. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_danske_talemaader.py +1 -1
  48. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_danske_talemaader_old.py +1 -1
  49. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_dbrd.py +1 -1
  50. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_dutch_cola.py +1 -1
  51. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_dutch_social.py +1 -1
  52. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_eltec.py +1 -1
  53. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_fone.py +1 -1
  54. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_foqa.py +1 -1
  55. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_fosent.py +1 -1
  56. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_fquad.py +1 -1
  57. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_germanquad.py +1 -1
  58. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_germeval.py +1 -1
  59. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_hellaswag.py +1 -1
  60. euroeval-15.7.0/src/scripts/create_hellaswag_fi.py +274 -0
  61. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_hotter_and_colder_sentiment.py +1 -1
  62. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_ice_linguistic.py +1 -1
  63. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_icelandic_error_corpus.py +1 -1
  64. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_icelandic_knowledge.py +1 -1
  65. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_icelandic_qa.py +1 -1
  66. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_icesum.py +1 -1
  67. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_ilpost_sum.py +1 -1
  68. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_jentoft.py +1 -1
  69. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_mlsum_de.py +1 -1
  70. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_mlsum_es.py +1 -1
  71. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_mmlu.py +1 -1
  72. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_multinerd-it.py +1 -1
  73. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_no_cola.py +1 -1
  74. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_no_sammendrag.py +1 -1
  75. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_nor_common_sense_qa.py +1 -1
  76. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_nordjylland_news.py +1 -1
  77. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_norglm_multisum.py +1 -1
  78. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_norne.py +1 -1
  79. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_norquad.py +1 -1
  80. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_nqii.py +1 -1
  81. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_nrk_quiz_qa.py +1 -1
  82. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_orange_sum.py +1 -1
  83. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_personal_sum.py +1 -1
  84. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_rrn.py +1 -1
  85. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_sb10k.py +1 -1
  86. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_scala.py +3 -1
  87. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_scandiqa.py +1 -1
  88. euroeval-15.7.0/src/scripts/create_scandisent_fi.py +93 -0
  89. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_schibsted.py +1 -1
  90. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_sentiment_headlines_es.py +1 -1
  91. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_sentipolc16.py +1 -1
  92. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_squad.py +1 -1
  93. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_squad_it.py +1 -1
  94. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_squad_nl.py +1 -1
  95. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_squad_nl_old.py +1 -1
  96. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_sst5.py +1 -1
  97. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_suc3.py +1 -1
  98. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_swedn.py +1 -1
  99. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_swerec.py +1 -1
  100. euroeval-15.7.0/src/scripts/create_turku_ner_fi.py +117 -0
  101. euroeval-15.7.0/src/scripts/create_tydiqa_fi.py +118 -0
  102. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_wiki_lingua_nl.py +1 -1
  103. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_winogrande_is.py +1 -1
  104. euroeval-15.7.0/src/scripts/create_xlsum_fi.py +78 -0
  105. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/load_ud_pos.py +18 -0
  106. euroeval-15.7.0/tests/test_data_loading.py +107 -0
  107. {euroeval-15.6.0 → euroeval-15.7.0}/uv.lock +2726 -2726
  108. euroeval-15.6.0/tests/test_data_loading.py +0 -51
  109. {euroeval-15.6.0 → euroeval-15.7.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  110. {euroeval-15.6.0 → euroeval-15.7.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  111. {euroeval-15.6.0 → euroeval-15.7.0}/CITATION.cff +0 -0
  112. {euroeval-15.6.0 → euroeval-15.7.0}/CODE_OF_CONDUCT.md +0 -0
  113. {euroeval-15.6.0 → euroeval-15.7.0}/Dockerfile.cuda +0 -0
  114. {euroeval-15.6.0 → euroeval-15.7.0}/LICENSE +0 -0
  115. {euroeval-15.6.0 → euroeval-15.7.0}/docs/CNAME +0 -0
  116. {euroeval-15.6.0 → euroeval-15.7.0}/docs/README.md +0 -0
  117. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/README.md +0 -0
  118. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/danish.md +0 -0
  119. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/english.md +0 -0
  120. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/faroese.md +0 -0
  121. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/french.md +0 -0
  122. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/german.md +0 -0
  123. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/icelandic.md +0 -0
  124. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/italian.md +0 -0
  125. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/norwegian.md +0 -0
  126. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/spanish.md +0 -0
  127. {euroeval-15.6.0 → euroeval-15.7.0}/docs/datasets/swedish.md +0 -0
  128. {euroeval-15.6.0 → euroeval-15.7.0}/docs/extras/radial_plotter.md +0 -0
  129. {euroeval-15.6.0 → euroeval-15.7.0}/docs/faq.md +0 -0
  130. {euroeval-15.6.0 → euroeval-15.7.0}/docs/gfx/favicon.png +0 -0
  131. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/danish.md +0 -0
  132. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
  133. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/english.md +0 -0
  134. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
  135. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/french.md +0 -0
  136. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/german.md +0 -0
  137. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  138. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/italian.md +0 -0
  139. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  140. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
  141. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Multilingual/european.md +0 -0
  142. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
  143. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  144. {euroeval-15.6.0 → euroeval-15.7.0}/docs/leaderboards/README.md +0 -0
  145. {euroeval-15.6.0 → euroeval-15.7.0}/docs/methodology.md +0 -0
  146. {euroeval-15.6.0 → euroeval-15.7.0}/docs/python-package.md +0 -0
  147. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/README.md +0 -0
  148. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/common-sense-reasoning.md +0 -0
  149. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/knowledge.md +0 -0
  150. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/linguistic-acceptability.md +0 -0
  151. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/named-entity-recognition.md +0 -0
  152. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/reading-comprehension.md +0 -0
  153. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/sentiment-classification.md +0 -0
  154. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/speed.md +0 -0
  155. {euroeval-15.6.0 → euroeval-15.7.0}/docs/tasks/summarization.md +0 -0
  156. {euroeval-15.6.0 → euroeval-15.7.0}/gfx/euroeval.png +0 -0
  157. {euroeval-15.6.0 → euroeval-15.7.0}/gfx/euroeval.xcf +0 -0
  158. {euroeval-15.6.0 → euroeval-15.7.0}/gfx/scandeval.png +0 -0
  159. {euroeval-15.6.0 → euroeval-15.7.0}/mkdocs.yaml +0 -0
  160. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/__init__.py +0 -0
  161. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmark_config_factory.py +0 -0
  162. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  163. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmark_modules/base.py +0 -0
  164. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmark_modules/fresh.py +0 -0
  165. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/benchmark_modules/hf.py +0 -0
  166. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/callbacks.py +0 -0
  167. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/cli.py +0 -0
  168. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/__init__.py +0 -0
  169. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/danish.py +0 -0
  170. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/dutch.py +0 -0
  171. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/english.py +0 -0
  172. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/french.py +0 -0
  173. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/german.py +0 -0
  174. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
  175. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/italian.py +0 -0
  176. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/spanish.py +0 -0
  177. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/dataset_configs/swedish.py +0 -0
  178. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/enums.py +0 -0
  179. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/exceptions.py +0 -0
  180. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/finetuning.py +0 -0
  181. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/generation.py +0 -0
  182. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/human_evaluation.py +0 -0
  183. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/languages.py +0 -0
  184. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/model_cache.py +0 -0
  185. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/model_config.py +0 -0
  186. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/model_loading.py +0 -0
  187. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/prompt_templates/__init__.py +0 -0
  188. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/scores.py +0 -0
  189. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/speed_benchmark.py +0 -0
  190. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/task_group_utils/__init__.py +0 -0
  191. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +0 -0
  192. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/task_group_utils/question_answering.py +0 -0
  193. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/task_group_utils/token_classification.py +0 -0
  194. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/types.py +0 -0
  195. {euroeval-15.6.0 → euroeval-15.7.0}/src/euroeval/utils.py +0 -0
  196. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/constants.py +0 -0
  197. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_angry_tweets.py +0 -0
  198. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_mim_gold_ner.py +0 -0
  199. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_mlqa_es.py +0 -0
  200. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_norec.py +0 -0
  201. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_norglm_multiqa.py +0 -0
  202. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_wikiann_fo.py +0 -0
  203. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_wikineural-it.py +0 -0
  204. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/create_xquad_es.py +0 -0
  205. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/fix_dot_env_file.py +0 -0
  206. {euroeval-15.6.0 → euroeval-15.7.0}/src/scripts/versioning.py +0 -0
  207. {euroeval-15.6.0 → euroeval-15.7.0}/tests/__init__.py +0 -0
  208. {euroeval-15.6.0 → euroeval-15.7.0}/tests/conftest.py +0 -0
  209. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmark_config_factory.py +0 -0
  210. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmark_modules/__init__.py +0 -0
  211. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmark_modules/test_base.py +0 -0
  212. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
  213. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  214. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
  215. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
  216. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_benchmarker.py +0 -0
  217. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_callbacks.py +0 -0
  218. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_cli.py +0 -0
  219. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_constants.py +0 -0
  220. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_data_models.py +0 -0
  221. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_dataset_configs.py +0 -0
  222. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_enums.py +0 -0
  223. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_exceptions.py +0 -0
  224. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_finetuning.py +0 -0
  225. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_generation.py +0 -0
  226. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_human_evaluation.py +0 -0
  227. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_languages.py +0 -0
  228. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_model_cache.py +0 -0
  229. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_model_config.py +0 -0
  230. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_model_loading.py +0 -0
  231. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_scores.py +0 -0
  232. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_speed_benchmark.py +0 -0
  233. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_task_utils/__init__.py +0 -0
  234. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_task_utils/test_question_answering.py +0 -0
  235. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
  236. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_task_utils/test_text_to_text.py +0 -0
  237. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_task_utils/test_token_classification.py +0 -0
  238. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_tasks.py +0 -0
  239. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_tokenization_utils.py +0 -0
  240. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_types.py +0 -0
  241. {euroeval-15.6.0 → euroeval-15.7.0}/tests/test_utils.py +0 -0
@@ -26,6 +26,7 @@ body:
26
26
  - label: Dutch
27
27
  - label: English
28
28
  - label: Faroese
29
+ - label: Finnish
29
30
  - label: French
30
31
  - label: German
31
32
  - label: Icelandic
@@ -21,6 +21,7 @@ body:
21
21
  - label: Romance languages (French, Italian, Spanish)
22
22
  - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
23
23
  - label: West Germanic languages (Dutch, English, German)
24
+ - label: Finnish
24
25
  validations:
25
26
  required: true
26
27
  - type: dropdown
@@ -24,7 +24,10 @@ jobs:
24
24
  - uses: actions/setup-python@v5
25
25
  with:
26
26
  python-version: "3.11"
27
- - uses: pre-commit/action@v3.0.1
27
+ - run: python -m pip install pre-commit
28
+ shell: bash
29
+ - run: pre-commit run --show-diff-on-failure --color=always
30
+ shell: bash
28
31
 
29
32
  pytest-linux:
30
33
  if: github.event.pull_request.draft == false
@@ -41,8 +44,9 @@ jobs:
41
44
  persist-credentials: false
42
45
 
43
46
  - name: Install uv and set up Python
44
- uses: astral-sh/setup-uv@v4
47
+ uses: astral-sh/setup-uv@v5
45
48
  with:
49
+ enable-cache: false
46
50
  python-version: ${{ matrix.python-version }}
47
51
 
48
52
  - name: Install Dependencies
@@ -117,5 +117,5 @@ site/
117
117
  docs/datasets/dataset_example_commands.txt
118
118
 
119
119
  # Various graphics
120
- gfx/euroeval-italian.png
121
- gfx/euroeval-italian.xcf
120
+ gfx/euroeval-*.png
121
+ gfx/euroeval-*.xcf
@@ -8,9 +8,9 @@ repos:
8
8
  hooks:
9
9
  - id: end-of-file-fixer
10
10
  - id: trailing-whitespace
11
- - id: debug-statements
11
+ # - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.11.5
13
+ rev: v0.11.7
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -10,6 +10,71 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v15.7.0] - 2025-04-28
14
+ ### Added
15
+ - Added support for Finnish 🇫🇮! This includes the Finnish part of the reading
16
+ comprehension dataset
17
+ [TydiQA-fi](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train),
18
+ the Finnish part of the binary sentiment classification dataset
19
+ [ScandiSent](https://github.com/timpal0l/ScandiSent), the linguistic acceptability
20
+ dataset ScaLA with the [Finnish Universal
21
+ Dependencies](https://github.com/UniversalDependencies/UD_Finnish-TDT), the NER
22
+ dataset [Turku NER](https://aclanthology.org/2020.lrec-1.567/), the summarisation
23
+ dataset [XL-Sum-fi](https://huggingface.co/datasets/TurkuNLP/xlsum-fi), and the
24
+ common-sense reasoning dataset
25
+ [HellaSwag-fi](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate).
26
+ This was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
27
+ - Added metadata for GPT-4.1 and Grok-3 models.
28
+ - Marked Gemini-2.5-flash and Grok-3-mini as reasoning models, giving them more tokens
29
+ to think.
30
+
31
+ ### Changed
32
+ - Updated `datasets` to `>=3.5.0`, as the previous versions were incompatible with the
33
+ newer versions of `huggingface_hub`.
34
+ - Increase the number of allowed reasoning tokens from 8,192 to 32,768 for reasoning
35
+ models. This is done as several models did not stop reasoning before running out of
36
+ tokens, yielding a blank output.
37
+ - API models now use JSON schemas for the NER task if they support it, and if not then
38
+ they resort to standard JSON mode (which does not enforce a specific schema, just that
39
+ the output is JSON).
40
+
41
+ ### Fixed
42
+ - If we fail to extract labels using a generative model's logprobs, we now fall back to
43
+ using word edit distance between the outputted text and the labels instead of throwing
44
+ an error.
45
+ - Fixed a bug where we could not use the `thinking` parameter with `claude-3-7-sonnet`,
46
+ due to a typo. This has been fixed now.
47
+ - Now catches the error when an API model requires setting temperature to 1.0, and
48
+ retries the evaluation with temperature set to 1.0.
49
+ - When benchmarking a model with a revision (i.e., of the form `<model-id>@<revision>`),
50
+ we now correctly store this full model ID to the benchmark results on disk, including
51
+ the revision.
52
+ - Fixed a GPU memory error while computing the BERTScore for the summarisation task,
53
+ resulting in a memory crash. We have now reduced the batch size to 1 for this task,
54
+ making it slightly slower but more memory efficient.
55
+ - Disabled structured outputs and logprobs for reasoning models, to ensure that they
56
+ are allowed to output reasoning tokens before they output their answer.
57
+ - Do not supply stop sequences to API models if they do not support it.
58
+ - If a `SystemError` happens during LiteLLM generation then we now retry the
59
+ generation.
60
+ - Handle if a LiteLLM model does not support specifying maxItems in the JSON schema
61
+ during structured generation.
62
+ - Truncate prompts to decoder model's maximum sequence length if the model's maximum
63
+ sequence length is smaller than 5,000 tokens.
64
+
65
+
66
+ ## [v15.6.1] - 2025-04-14
67
+ ### Changed
68
+ - Added more info about SQuAD-nl in the documentation. This was contributed by
69
+ [@Rijgersberg](https://github.com/Rijgersberg) ✨
70
+
71
+ ### Fixed
72
+ - The "E" option for the Norwegian NorCommonSenseQA dataset was not included in the
73
+ refactor in v15.6.0, leading to evaluation errors. This has been fixed now.
74
+ - The number of few-shot examples for FoSent was not reduced to 5 again during the
75
+ refactor in v15.6.0, leading to evaluation errors. This has been fixed now.
76
+
77
+
13
78
  ## [v15.6.0] - 2025-04-13
14
79
  ### Added
15
80
  - We now support specifying custom inference providers when benchmarking via the Hugging
@@ -14,14 +14,22 @@ issue, creating a PR, reviewing, and merging the PR.
14
14
  To get an overview of the project, read the [README](README.md). Here are some
15
15
  resources to help you get started with open source contributions:
16
16
 
17
- - [Finding ways to contribute to open source on GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/finding-ways-to-contribute-to-open-source-on-github)
17
+ - [Finding ways to contribute to open source on
18
+ GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/finding-ways-to-contribute-to-open-source-on-github)
18
19
  - [Set up Git](https://docs.github.com/en/get-started/quickstart/set-up-git)
19
20
  - [GitHub flow](https://docs.github.com/en/get-started/quickstart/github-flow)
20
- - [Collaborating with pull requests](https://docs.github.com/en/github/collaborating-with-pull-requests)
21
+ - [Collaborating with pull
22
+ requests](https://docs.github.com/en/github/collaborating-with-pull-requests)
21
23
 
22
24
 
23
25
  ## Getting started
24
26
 
27
+ ### Adding datasets
28
+
29
+ EuroEval welcomes contributions of new datasets that help evaluate language models
30
+ across European languages. A guide for adding datasets to EuroEval can be found
31
+ [here](NEW_DATASET_GUIDE.md).
32
+
25
33
  ### Issues
26
34
 
27
35
  #### Create a new issue
@@ -42,11 +50,17 @@ find an issue to work on, you are welcome to open a PR with a fix.
42
50
 
43
51
  1. Fork the repository.
44
52
  - Using GitHub Desktop:
45
- - [Getting started with GitHub Desktop](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/getting-started-with-github-desktop) will guide you through setting up Desktop.
46
- - Once Desktop is set up, you can use it to [fork the repo](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop)!
53
+ - [Getting started with GitHub
54
+ Desktop](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/getting-started-with-github-desktop)
55
+ will guide you through setting up Desktop.
56
+ - Once Desktop is set up, you can use it to [fork the
57
+ repo](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop)!
47
58
 
48
59
  - Using the command line:
49
- - [Fork the repo](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) so that you can make your changes without affecting the original project until you're ready to merge them.
60
+ - [Fork the
61
+ repo](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository)
62
+ so that you can make your changes without affecting the original project until
63
+ you're ready to merge them.
50
64
 
51
65
  3. Run `make install` from within the repo to get set up
52
66
 
@@ -0,0 +1,107 @@
1
+ # Contributing a Dataset to EuroEval
2
+
3
+ This guide will walk you through the process of contributing a new dataset to EuroEval.
4
+
5
+ For general contribution guidelines, please refer to our [Contributing Guide](CONTRIBUTING.md).
6
+
7
+ If you have any questions during this process, please open an issue on the [EuroEval GitHub repository](https://github.com/EuroEval/EuroEval/issues).
8
+
9
+
10
+ ## Step 0: Prerequisites
11
+
12
+ Before beginning:
13
+ 1. Check if your dataset matches [one of the supported tasks](https://euroeval.com/tasks/). If your dataset doesn't match any supported task, you have two options:
14
+ 1. Try to adapt it to fit an existing task (e.g., by reformatting it or adding multiple choice options)
15
+ 2. Open an issue on the EuroEval repository requesting to add a new task type
16
+ 2. If it does, [fork the EuroEval repository](https://github.com/EuroEval/EuroEval/fork) and create a new branch to work on your dataset contribution
17
+
18
+
19
+ ## Step 1: Create the Dataset Processing Script
20
+
21
+ Create a script in the `src/scripts` directory that processes your dataset into the EuroEval format.
22
+
23
+ The dataset creation script roughly follows this pattern:
24
+
25
+ ```python
26
+ # Load your dataset.
27
+ raw_dataset = load_dataset("path_to_your_dataset")
28
+
29
+ # Process the dataset to fit the EuroEval format.
30
+ dataset = process_raw_dataset(raw_dataset=raw_dataset)
31
+
32
+ # Push the dataset to the Hugging Face Hub.
33
+ dataset.push_to_hub("EuroEval/your_dataset_name", private=True)
34
+ ```
35
+
36
+ ### Tips for Dataset Processing:
37
+ - Examine existing scripts for datasets with the same task for a reference on how to process your dataset.
38
+ - Take a look at [existing datasets in your language](https://euroeval.com/datasets/) to see how these are usually set up. Study these examples to understand the expected format and structure for your own dataset's entries.
39
+ - Split your dataset into train / val / test sets, ideally with 1,024 / 256 / 2,048 samples, respectively
40
+ - If your dataset already has splits, maintain consistency (e.g., the EuroEval train split should be a subset of the original train split)
41
+
42
+
43
+ ## Step 2: Add Dataset Configuration
44
+
45
+ Dataset configurations in EuroEval are organised by language, with each language having its own file at `src/euroeval/dataset_configs/{language}.py`. A configuration is made with the `DatasetConfig` class. Here is an example for the fictive English Knowledge dataset `Rizzler`.
46
+
47
+ ```python
48
+ RIZZLER_KNOWLEDGE_CONFIG = DatasetConfig(
49
+ name="rizzler_knowledge", # The name of the dataset
50
+ pretty_name="the truncated version of the English knowledge dataset Rizzler", # The pretty name of the dataset used in logs.
51
+ huggingface_id="EuroEval/rizzler_knowledge", # The same id as used in the dataset creation script
52
+ task=KNOW, # The task of the dataset
53
+ languages=[EN], # The language of the dataset
54
+ unofficial=True, # Whether the dataset is unofficial
55
+ )
56
+ ```
57
+
58
+ Every `src/euroeval/dataset_configs/{language}.py` file has two sections:
59
+ - `### Official datasets ###`
60
+ - `### Unofficial datasets ###`
61
+
62
+ An unofficial dataset means that the resulting evaluation will not be included in the [official leaderboard](https://euroeval.com/leaderboards/).
63
+
64
+ As a starting point, make your dataset unofficial. This can always be changed later.
65
+
66
+
67
+ ## Step 3: Document Your Dataset
68
+
69
+ Dataset documentation in EuroEval is organised by language, with each language having its own file at `docs/datasets/{language}.md`. Within each language file, documentation is further organised by task.
70
+
71
+ Navigate to the documentation file for your dataset's language and add your dataset's documentation in the appropriate task section.
72
+
73
+ The documentation should include the following information:
74
+
75
+ 1. **General description**: Explain the dataset's origin and purpose
76
+ 2. **Split details**: Describe how splits were created and their sizes
77
+ 3. **Example samples**: Provide 3 representative examples from the training split
78
+ 4. **Evaluation setup**: Explain how models are evaluated on this dataset
79
+ 5. **Evaluation command**: Show how to evaluate a model on your dataset
80
+
81
+ To do this, you can follow these steps:
82
+ 1. Find an existing dataset of the same task in `docs/datasets/{language}.md`
83
+ 2. Copy the entire documentation section for that dataset
84
+ 3. Use this as a template and modify all details to match your new dataset
85
+ 4. Ensure you update all dataset-specific information (description, split sizes, example samples, etc.)
86
+
87
+
88
+ ## Step 4: Modify the Change Log
89
+
90
+ After completing the previous steps, add an entry to the project's changelog to document your contribution. The entry should be added under the `[Unreleased]` section with a short description of the dataset you have added. Here is an example of a new entry.
91
+
92
+ ```md
93
+ ## [Unreleased]
94
+ ### Added
95
+ - Added the English knowledge dataset [rizzler_knowledge](https://huggingface.co/datasets/Example-User/rizzler_knowledge). The split is given by 1,024 / 256 / 2,048 samples for train / val / test, respectively. It is marked as `unofficial` for now. This was contributed by [@your_name](https://github.com/your_name) ✨
96
+ ```
97
+
98
+
99
+ ## Step 5: Make a Pull Request
100
+
101
+ When you have completed all the previous steps, create a pull request to the EuroEval repository.
102
+
103
+
104
+ ### Thank you!
105
+ This concludes the process of contributing a dataset to EuroEval. Your contribution helps expand the multilingual evaluation capabilities of the benchmark and is greatly appreciated by the research community!
106
+
107
+ Thank you for your valuable contribution! 🎉
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 15.6.0
3
+ Version: 15.7.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -32,7 +32,7 @@ Requires-Python: <4.0,>=3.10
32
32
  Requires-Dist: accelerate>=0.34.2
33
33
  Requires-Dist: bert-score>=0.3.13
34
34
  Requires-Dist: click>=8.1.3
35
- Requires-Dist: datasets>=2.15.0
35
+ Requires-Dist: datasets>=3.5.0
36
36
  Requires-Dist: demjson3>=3.0.6
37
37
  Requires-Dist: evaluate>=0.4.1
38
38
  Requires-Dist: huggingface-hub>=0.30.1
@@ -237,6 +237,19 @@ A huge thank you to all the contributors who have helped make this project a suc
237
237
  <a href="https://github.com/ThomasKluiters"><img src="https://avatars.githubusercontent.com/u/8137941" width=50 alt="Contributor avatar for ThomasKluiters"/></a>
238
238
  <a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
239
239
  <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
240
+ <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
241
+
242
+
243
+ ### Contribute to EuroEval
244
+
245
+ We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
246
+ contributing new datasets, your help makes this project better for everyone.
247
+
248
+ - **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
249
+ for information on how to get started.
250
+ - **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
251
+ a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
252
+
240
253
 
241
254
  ### Special Thanks
242
255
  - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
@@ -161,6 +161,19 @@ A huge thank you to all the contributors who have helped make this project a suc
161
161
  <a href="https://github.com/ThomasKluiters"><img src="https://avatars.githubusercontent.com/u/8137941" width=50 alt="Contributor avatar for ThomasKluiters"/></a>
162
162
  <a href="https://github.com/BramVanroy"><img src="https://avatars.githubusercontent.com/u/2779410" width=50 alt="Contributor avatar for BramVanroy"/></a>
163
163
  <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
164
+ <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
165
+
166
+
167
+ ### Contribute to EuroEval
168
+
169
+ We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
170
+ contributing new datasets, your help makes this project better for everyone.
171
+
172
+ - **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
173
+ for information on how to get started.
174
+ - **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
175
+ a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
176
+
164
177
 
165
178
  ### Special Thanks
166
179
  - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
@@ -310,12 +310,14 @@ Here are a few examples from the training split:
310
310
  This dataset is published
311
311
  [here](https://huggingface.co/datasets/GroNLP/squad-nl-v2.0) and is a machine translated
312
312
  dataset of the English [SQuAD](https://aclanthology.org/D16-1264/) and
313
- [XQuAD](https://aclanthology.org/2020.acl-main.421/) datasets. Google Translate was used
314
- to translate the original datasets to Dutch.
315
-
316
- These are based on English Wikipedia articles and the questions and answers are written
317
- by crowdworkers. It is not clear how the translations were done, this information is
318
- pending from the authors.
313
+ [XQuAD](https://aclanthology.org/2020.acl-main.421/) datasets, created for the
314
+ Dutch-language [DUMB](https://dumbench.nl/) benchmark. Google Translate was used to
315
+ translate the original datasets to Dutch. The test data
316
+ [was manually corrected](https://aclanthology.org/2023.emnlp-main.447/) by eight BSc
317
+ students as part of their thesis work.
318
+
319
+ The original SQuAD and XQuAD datasets are based on English Wikipedia articles and the
320
+ questions and answers are written by crowdworkers.
319
321
 
320
322
  Here are a few examples from the training split:
321
323