EuroEval 15.6.1__tar.gz → 15.7.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (242) hide show
  1. {euroeval-15.6.1 → euroeval-15.7.1}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +1 -0
  2. {euroeval-15.6.1 → euroeval-15.7.1}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +1 -0
  3. {euroeval-15.6.1 → euroeval-15.7.1}/.github/workflows/ci.yaml +6 -2
  4. {euroeval-15.6.1 → euroeval-15.7.1}/.gitignore +2 -2
  5. {euroeval-15.6.1 → euroeval-15.7.1}/.pre-commit-config.yaml +1 -1
  6. {euroeval-15.6.1 → euroeval-15.7.1}/CHANGELOG.md +65 -0
  7. {euroeval-15.6.1 → euroeval-15.7.1}/CONTRIBUTING.md +19 -5
  8. euroeval-15.7.1/NEW_DATASET_GUIDE.md +107 -0
  9. {euroeval-15.6.1 → euroeval-15.7.1}/PKG-INFO +14 -2
  10. {euroeval-15.6.1 → euroeval-15.7.1}/README.md +12 -0
  11. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/dutch.md +1 -62
  12. euroeval-15.7.1/docs/datasets/finnish.md +388 -0
  13. euroeval-15.7.1/docs/leaderboards/Monolingual/spanish.md +15 -0
  14. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Multilingual/romance.md +1 -1
  15. {euroeval-15.6.1 → euroeval-15.7.1}/pyproject.toml +2 -2
  16. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmark_modules/litellm.py +148 -284
  17. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmark_modules/vllm.py +115 -338
  18. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmarker.py +13 -2
  19. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/constants.py +1 -1
  20. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/data_loading.py +48 -26
  21. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/data_models.py +3 -9
  22. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/dutch.py +5 -16
  23. euroeval-15.7.1/src/euroeval/dataset_configs/finnish.py +60 -0
  24. euroeval-15.7.1/src/euroeval/generation_utils.py +346 -0
  25. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/prompt_templates/linguistic_acceptability.py +9 -1
  26. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/prompt_templates/multiple_choice.py +8 -1
  27. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/prompt_templates/named_entity_recognition.py +20 -1
  28. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/prompt_templates/reading_comprehension.py +11 -1
  29. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/prompt_templates/sentiment_classification.py +11 -1
  30. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/prompt_templates/summarization.py +9 -1
  31. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/scores.py +7 -1
  32. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/task_group_utils/sequence_classification.py +27 -32
  33. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/task_group_utils/text_to_text.py +10 -27
  34. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/tasks.py +1 -1
  35. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/tokenization_utils.py +22 -6
  36. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_allocine.py +1 -1
  37. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_arc.py +1 -1
  38. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_arc_is.py +1 -1
  39. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_belebele.py +1 -1
  40. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_cnn_dailymail.py +1 -1
  41. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_conll_en.py +1 -1
  42. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_conll_es.py +1 -1
  43. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_conll_nl.py +1 -1
  44. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_dane.py +1 -1
  45. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_danish_citizen_tests.py +1 -1
  46. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_dansk.py +1 -1
  47. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_danske_talemaader.py +1 -1
  48. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_danske_talemaader_old.py +1 -1
  49. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_dbrd.py +23 -23
  50. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_dutch_cola.py +1 -1
  51. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_eltec.py +1 -1
  52. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_fone.py +1 -1
  53. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_foqa.py +1 -1
  54. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_fosent.py +1 -1
  55. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_fquad.py +1 -1
  56. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_germanquad.py +1 -1
  57. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_germeval.py +1 -1
  58. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_hellaswag.py +1 -1
  59. euroeval-15.7.1/src/scripts/create_hellaswag_fi.py +274 -0
  60. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_hotter_and_colder_sentiment.py +1 -1
  61. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_ice_linguistic.py +1 -1
  62. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_icelandic_error_corpus.py +1 -1
  63. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_icelandic_knowledge.py +1 -1
  64. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_icelandic_qa.py +1 -1
  65. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_icesum.py +1 -1
  66. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_ilpost_sum.py +1 -1
  67. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_jentoft.py +1 -1
  68. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_mlsum_de.py +1 -1
  69. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_mlsum_es.py +1 -1
  70. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_mmlu.py +1 -1
  71. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_multinerd-it.py +1 -1
  72. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_no_cola.py +1 -1
  73. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_no_sammendrag.py +1 -1
  74. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_nor_common_sense_qa.py +1 -1
  75. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_nordjylland_news.py +1 -1
  76. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_norglm_multisum.py +1 -1
  77. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_norne.py +1 -1
  78. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_norquad.py +1 -1
  79. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_nqii.py +1 -1
  80. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_nrk_quiz_qa.py +1 -1
  81. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_orange_sum.py +1 -1
  82. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_personal_sum.py +1 -1
  83. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_rrn.py +1 -1
  84. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_sb10k.py +1 -1
  85. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_scala.py +3 -1
  86. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_scandiqa.py +1 -1
  87. euroeval-15.7.1/src/scripts/create_scandisent_fi.py +93 -0
  88. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_schibsted.py +1 -1
  89. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_sentiment_headlines_es.py +1 -1
  90. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_sentipolc16.py +1 -1
  91. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_squad.py +1 -1
  92. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_squad_it.py +1 -1
  93. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_squad_nl.py +1 -1
  94. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_squad_nl_old.py +1 -1
  95. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_sst5.py +1 -1
  96. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_suc3.py +1 -1
  97. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_swedn.py +1 -1
  98. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_swerec.py +1 -1
  99. euroeval-15.7.1/src/scripts/create_turku_ner_fi.py +117 -0
  100. euroeval-15.7.1/src/scripts/create_tydiqa_fi.py +118 -0
  101. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_wiki_lingua_nl.py +1 -1
  102. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_winogrande_is.py +1 -1
  103. euroeval-15.7.1/src/scripts/create_xlsum_fi.py +78 -0
  104. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/load_ud_pos.py +18 -0
  105. euroeval-15.7.1/tests/test_data_loading.py +120 -0
  106. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_scores.py +1 -0
  107. {euroeval-15.6.1 → euroeval-15.7.1}/uv.lock +2726 -2726
  108. euroeval-15.6.1/src/scripts/create_dutch_social.py +0 -114
  109. euroeval-15.6.1/tests/test_data_loading.py +0 -51
  110. {euroeval-15.6.1 → euroeval-15.7.1}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  111. {euroeval-15.6.1 → euroeval-15.7.1}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  112. {euroeval-15.6.1 → euroeval-15.7.1}/CITATION.cff +0 -0
  113. {euroeval-15.6.1 → euroeval-15.7.1}/CODE_OF_CONDUCT.md +0 -0
  114. {euroeval-15.6.1 → euroeval-15.7.1}/Dockerfile.cuda +0 -0
  115. {euroeval-15.6.1 → euroeval-15.7.1}/LICENSE +0 -0
  116. {euroeval-15.6.1 → euroeval-15.7.1}/docs/CNAME +0 -0
  117. {euroeval-15.6.1 → euroeval-15.7.1}/docs/README.md +0 -0
  118. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/README.md +0 -0
  119. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/danish.md +0 -0
  120. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/english.md +0 -0
  121. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/faroese.md +0 -0
  122. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/french.md +0 -0
  123. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/german.md +0 -0
  124. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/icelandic.md +0 -0
  125. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/italian.md +0 -0
  126. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/norwegian.md +0 -0
  127. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/spanish.md +0 -0
  128. {euroeval-15.6.1 → euroeval-15.7.1}/docs/datasets/swedish.md +0 -0
  129. {euroeval-15.6.1 → euroeval-15.7.1}/docs/extras/radial_plotter.md +0 -0
  130. {euroeval-15.6.1 → euroeval-15.7.1}/docs/faq.md +0 -0
  131. {euroeval-15.6.1 → euroeval-15.7.1}/docs/gfx/favicon.png +0 -0
  132. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/danish.md +0 -0
  133. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/dutch.md +0 -0
  134. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/english.md +0 -0
  135. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/faroese.md +0 -0
  136. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/french.md +0 -0
  137. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/german.md +0 -0
  138. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  139. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/italian.md +0 -0
  140. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  141. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Monolingual/swedish.md +0 -0
  142. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Multilingual/european.md +0 -0
  143. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Multilingual/germanic.md +0 -0
  144. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  145. {euroeval-15.6.1 → euroeval-15.7.1}/docs/leaderboards/README.md +0 -0
  146. {euroeval-15.6.1 → euroeval-15.7.1}/docs/methodology.md +0 -0
  147. {euroeval-15.6.1 → euroeval-15.7.1}/docs/python-package.md +0 -0
  148. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/README.md +0 -0
  149. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/common-sense-reasoning.md +0 -0
  150. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/knowledge.md +0 -0
  151. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/linguistic-acceptability.md +0 -0
  152. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/named-entity-recognition.md +0 -0
  153. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/reading-comprehension.md +0 -0
  154. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/sentiment-classification.md +0 -0
  155. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/speed.md +0 -0
  156. {euroeval-15.6.1 → euroeval-15.7.1}/docs/tasks/summarization.md +0 -0
  157. {euroeval-15.6.1 → euroeval-15.7.1}/gfx/euroeval.png +0 -0
  158. {euroeval-15.6.1 → euroeval-15.7.1}/gfx/euroeval.xcf +0 -0
  159. {euroeval-15.6.1 → euroeval-15.7.1}/gfx/scandeval.png +0 -0
  160. {euroeval-15.6.1 → euroeval-15.7.1}/makefile +0 -0
  161. {euroeval-15.6.1 → euroeval-15.7.1}/mkdocs.yaml +0 -0
  162. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/__init__.py +0 -0
  163. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmark_config_factory.py +0 -0
  164. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmark_modules/__init__.py +0 -0
  165. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmark_modules/base.py +0 -0
  166. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmark_modules/fresh.py +0 -0
  167. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/benchmark_modules/hf.py +0 -0
  168. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/callbacks.py +0 -0
  169. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/cli.py +0 -0
  170. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/__init__.py +0 -0
  171. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/danish.py +0 -0
  172. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/english.py +0 -0
  173. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/faroese.py +0 -0
  174. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/french.py +0 -0
  175. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/german.py +0 -0
  176. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/icelandic.py +0 -0
  177. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/italian.py +0 -0
  178. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/norwegian.py +0 -0
  179. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/spanish.py +0 -0
  180. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/dataset_configs/swedish.py +0 -0
  181. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/enums.py +0 -0
  182. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/exceptions.py +0 -0
  183. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/finetuning.py +0 -0
  184. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/generation.py +0 -0
  185. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/human_evaluation.py +0 -0
  186. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/languages.py +0 -0
  187. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/model_cache.py +0 -0
  188. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/model_config.py +0 -0
  189. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/model_loading.py +0 -0
  190. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/prompt_templates/__init__.py +0 -0
  191. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/speed_benchmark.py +0 -0
  192. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/task_group_utils/__init__.py +0 -0
  193. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/task_group_utils/multiple_choice_classification.py +0 -0
  194. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/task_group_utils/question_answering.py +0 -0
  195. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/task_group_utils/token_classification.py +0 -0
  196. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/types.py +0 -0
  197. {euroeval-15.6.1 → euroeval-15.7.1}/src/euroeval/utils.py +0 -0
  198. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/constants.py +0 -0
  199. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_angry_tweets.py +0 -0
  200. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_mim_gold_ner.py +0 -0
  201. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_mlqa_es.py +0 -0
  202. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_norec.py +0 -0
  203. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_norglm_multiqa.py +0 -0
  204. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_wikiann_fo.py +0 -0
  205. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_wikineural-it.py +0 -0
  206. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/create_xquad_es.py +0 -0
  207. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/fix_dot_env_file.py +0 -0
  208. {euroeval-15.6.1 → euroeval-15.7.1}/src/scripts/versioning.py +0 -0
  209. {euroeval-15.6.1 → euroeval-15.7.1}/tests/__init__.py +0 -0
  210. {euroeval-15.6.1 → euroeval-15.7.1}/tests/conftest.py +0 -0
  211. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmark_config_factory.py +0 -0
  212. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmark_modules/__init__.py +0 -0
  213. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmark_modules/test_base.py +0 -0
  214. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmark_modules/test_fresh.py +0 -0
  215. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmark_modules/test_hf.py +0 -0
  216. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmark_modules/test_litellm.py +0 -0
  217. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmark_modules/test_vllm.py +0 -0
  218. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_benchmarker.py +0 -0
  219. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_callbacks.py +0 -0
  220. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_cli.py +0 -0
  221. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_constants.py +0 -0
  222. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_data_models.py +0 -0
  223. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_dataset_configs.py +0 -0
  224. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_enums.py +0 -0
  225. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_exceptions.py +0 -0
  226. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_finetuning.py +0 -0
  227. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_generation.py +0 -0
  228. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_human_evaluation.py +0 -0
  229. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_languages.py +0 -0
  230. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_model_cache.py +0 -0
  231. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_model_config.py +0 -0
  232. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_model_loading.py +0 -0
  233. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_speed_benchmark.py +0 -0
  234. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_task_utils/__init__.py +0 -0
  235. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_task_utils/test_question_answering.py +0 -0
  236. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_task_utils/test_sequence_classification.py +0 -0
  237. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_task_utils/test_text_to_text.py +0 -0
  238. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_task_utils/test_token_classification.py +0 -0
  239. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_tasks.py +0 -0
  240. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_tokenization_utils.py +0 -0
  241. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_types.py +0 -0
  242. {euroeval-15.6.1 → euroeval-15.7.1}/tests/test_utils.py +0 -0
@@ -26,6 +26,7 @@ body:
26
26
  - label: Dutch
27
27
  - label: English
28
28
  - label: Faroese
29
+ - label: Finnish
29
30
  - label: French
30
31
  - label: German
31
32
  - label: Icelandic
@@ -21,6 +21,7 @@ body:
21
21
  - label: Romance languages (French, Italian, Spanish)
22
22
  - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
23
23
  - label: West Germanic languages (Dutch, English, German)
24
+ - label: Finnish
24
25
  validations:
25
26
  required: true
26
27
  - type: dropdown
@@ -24,7 +24,10 @@ jobs:
24
24
  - uses: actions/setup-python@v5
25
25
  with:
26
26
  python-version: "3.11"
27
- - uses: pre-commit/action@v3.0.1
27
+ - run: python -m pip install pre-commit
28
+ shell: bash
29
+ - run: pre-commit run --show-diff-on-failure --color=always
30
+ shell: bash
28
31
 
29
32
  pytest-linux:
30
33
  if: github.event.pull_request.draft == false
@@ -41,8 +44,9 @@ jobs:
41
44
  persist-credentials: false
42
45
 
43
46
  - name: Install uv and set up Python
44
- uses: astral-sh/setup-uv@v4
47
+ uses: astral-sh/setup-uv@v5
45
48
  with:
49
+ enable-cache: false
46
50
  python-version: ${{ matrix.python-version }}
47
51
 
48
52
  - name: Install Dependencies
@@ -117,5 +117,5 @@ site/
117
117
  docs/datasets/dataset_example_commands.txt
118
118
 
119
119
  # Various graphics
120
- gfx/euroeval-italian.png
121
- gfx/euroeval-italian.xcf
120
+ gfx/euroeval-*.png
121
+ gfx/euroeval-*.xcf
@@ -10,7 +10,7 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.11.5
13
+ rev: v0.11.7
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -10,6 +10,71 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v15.7.1] - 2025-04-29
14
+ ### Changed
15
+ - Marked the DBRD Dutch sentiment classification as official, as the quality is
16
+ substantially better than the previous Dutch Social.
17
+
18
+ ### Fixed
19
+ - Fixed an issue with NER evaluation of instruction-tuned models, which was caused by
20
+ the "O" label mistakenly being included in the prompt template, causing an error
21
+ during evaluation. No evaluations were affected by this, only that some evaluations
22
+ could not be run.
23
+
24
+
25
+ ## [v15.7.0] - 2025-04-28
26
+ ### Added
27
+ - Added support for Finnish 🇫🇮! This includes the Finnish part of the reading
28
+ comprehension dataset
29
+ [TydiQA-fi](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train),
30
+ the Finnish part of the binary sentiment classification dataset
31
+ [ScandiSent](https://github.com/timpal0l/ScandiSent), the linguistic acceptability
32
+ dataset ScaLA with the [Finnish Universal
33
+ Dependencies](https://github.com/UniversalDependencies/UD_Finnish-TDT), the NER
34
+ dataset [Turku NER](https://aclanthology.org/2020.lrec-1.567/), the summarisation
35
+ dataset [XL-Sum-fi](https://huggingface.co/datasets/TurkuNLP/xlsum-fi), and the
36
+ common-sense reasoning dataset
37
+ [HellaSwag-fi](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate).
38
+ This was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
39
+ - Added metadata for GPT-4.1 and Grok-3 models.
40
+ - Marked Gemini-2.5-flash and Grok-3-mini as reasoning models, giving them more tokens
41
+ to think.
42
+
43
+ ### Changed
44
+ - Updated `datasets` to `>=3.5.0`, as the previous versions were incompatible with the
45
+ newer versions of `huggingface_hub`.
46
+ - Increase the number of allowed reasoning tokens from 8,192 to 32,768 for reasoning
47
+ models. This is done as several models did not stop reasoning before running out of
48
+ tokens, yielding a blank output.
49
+ - API models now use JSON schemas for the NER task if they support it, and if not then
50
+ they resort to standard JSON mode (which does not enforce a specific schema, just that
51
+ the output is JSON).
52
+
53
+ ### Fixed
54
+ - If we fail to extract labels using a generative model's logprobs, we now fall back to
55
+ using word edit distance between the outputted text and the labels instead of throwing
56
+ an error.
57
+ - Fixed a bug where we could not use the `thinking` parameter with `claude-3-7-sonnet`,
58
+ due to a typo. This has been fixed now.
59
+ - Now catches the error when an API model requires setting temperature to 1.0, and
60
+ retries the evaluation with temperature set to 1.0.
61
+ - When benchmarking a model with a revision (i.e., of the form `<model-id>@<revision>`),
62
+ we now correctly store this full model ID to the benchmark results on disk, including
63
+ the revision.
64
+ - Fixed a GPU memory error while computing the BERTScore for the summarisation task,
65
+ resulting in a memory crash. We have now reduced the batch size to 1 for this task,
66
+ making it slightly slower but more memory efficient.
67
+ - Disabled structured outputs and logprobs for reasoning models, to ensure that they
68
+ are allowed to output reasoning tokens before they output their answer.
69
+ - Do not supply stop sequences to API models if they do not support it.
70
+ - If a `SystemError` happens during LiteLLM generation then we now retry the
71
+ generation.
72
+ - Handle if a LiteLLM model does not support specifying maxItems in the JSON schema
73
+ during structured generation.
74
+ - Truncate prompts to decoder model's maximum sequence length if the model's maximum
75
+ sequence length is smaller than 5,000 tokens.
76
+
77
+
13
78
  ## [v15.6.1] - 2025-04-14
14
79
  ### Changed
15
80
  - Added more info about SQuAD-nl in the documentation. This was contributed by
@@ -14,14 +14,22 @@ issue, creating a PR, reviewing, and merging the PR.
14
14
  To get an overview of the project, read the [README](README.md). Here are some
15
15
  resources to help you get started with open source contributions:
16
16
 
17
- - [Finding ways to contribute to open source on GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/finding-ways-to-contribute-to-open-source-on-github)
17
+ - [Finding ways to contribute to open source on
18
+ GitHub](https://docs.github.com/en/get-started/exploring-projects-on-github/finding-ways-to-contribute-to-open-source-on-github)
18
19
  - [Set up Git](https://docs.github.com/en/get-started/quickstart/set-up-git)
19
20
  - [GitHub flow](https://docs.github.com/en/get-started/quickstart/github-flow)
20
- - [Collaborating with pull requests](https://docs.github.com/en/github/collaborating-with-pull-requests)
21
+ - [Collaborating with pull
22
+ requests](https://docs.github.com/en/github/collaborating-with-pull-requests)
21
23
 
22
24
 
23
25
  ## Getting started
24
26
 
27
+ ### Adding datasets
28
+
29
+ EuroEval welcomes contributions of new datasets that help evaluate language models
30
+ across European languages. A guide for adding datasets to EuroEval can be found
31
+ [here](NEW_DATASET_GUIDE.md).
32
+
25
33
  ### Issues
26
34
 
27
35
  #### Create a new issue
@@ -42,11 +50,17 @@ find an issue to work on, you are welcome to open a PR with a fix.
42
50
 
43
51
  1. Fork the repository.
44
52
  - Using GitHub Desktop:
45
- - [Getting started with GitHub Desktop](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/getting-started-with-github-desktop) will guide you through setting up Desktop.
46
- - Once Desktop is set up, you can use it to [fork the repo](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop)!
53
+ - [Getting started with GitHub
54
+ Desktop](https://docs.github.com/en/desktop/installing-and-configuring-github-desktop/getting-started-with-github-desktop)
55
+ will guide you through setting up Desktop.
56
+ - Once Desktop is set up, you can use it to [fork the
57
+ repo](https://docs.github.com/en/desktop/contributing-and-collaborating-using-github-desktop/cloning-and-forking-repositories-from-github-desktop)!
47
58
 
48
59
  - Using the command line:
49
- - [Fork the repo](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository) so that you can make your changes without affecting the original project until you're ready to merge them.
60
+ - [Fork the
61
+ repo](https://docs.github.com/en/github/getting-started-with-github/fork-a-repo#fork-an-example-repository)
62
+ so that you can make your changes without affecting the original project until
63
+ you're ready to merge them.
50
64
 
51
65
  3. Run `make install` from within the repo to get set up
52
66
 
@@ -0,0 +1,107 @@
1
+ # Contributing a Dataset to EuroEval
2
+
3
+ This guide will walk you through the process of contributing a new dataset to EuroEval.
4
+
5
+ For general contribution guidelines, please refer to our [Contributing Guide](CONTRIBUTING.md).
6
+
7
+ If you have any questions during this process, please open an issue on the [EuroEval GitHub repository](https://github.com/EuroEval/EuroEval/issues).
8
+
9
+
10
+ ## Step 0: Prerequisites
11
+
12
+ Before beginning:
13
+ 1. Check if your dataset matches [one of the supported tasks](https://euroeval.com/tasks/). If your dataset doesn't match any supported task, you have two options:
14
+ 1. Try to adapt it to fit an existing task (e.g., by reformatting it or adding multiple choice options)
15
+ 2. Open an issue on the EuroEval repository requesting to add a new task type
16
+ 2. If it does, [fork the EuroEval repository](https://github.com/EuroEval/EuroEval/fork) and create a new branch to work on your dataset contribution
17
+
18
+
19
+ ## Step 1: Create the Dataset Processing Script
20
+
21
+ Create a script in the `src/scripts` directory that processes your dataset into the EuroEval format.
22
+
23
+ The dataset creation script roughly follows this pattern:
24
+
25
+ ```python
26
+ # Load your dataset.
27
+ raw_dataset = load_dataset("path_to_your_dataset")
28
+
29
+ # Process the dataset to fit the EuroEval format.
30
+ dataset = process_raw_dataset(raw_dataset=raw_dataset)
31
+
32
+ # Push the dataset to the Hugging Face Hub.
33
+ dataset.push_to_hub("EuroEval/your_dataset_name", private=True)
34
+ ```
35
+
36
+ ### Tips for Dataset Processing:
37
+ - Examine existing scripts for datasets with the same task for a reference on how to process your dataset.
38
+ - Take a look at [existing datasets in your language](https://euroeval.com/datasets/) to see how these are usually set up. Study these examples to understand the expected format and structure for your own dataset's entries.
39
+ - Split your dataset into train / val / test sets, ideally with 1,024 / 256 / 2,048 samples, respectively
40
+ - If your dataset already has splits, maintain consistency (e.g., the EuroEval train split should be a subset of the original train split)
41
+
42
+
43
+ ## Step 2: Add Dataset Configuration
44
+
45
+ Dataset configurations in EuroEval are organised by language, with each language having its own file at `src/euroeval/dataset_configs/{language}.py`. A configuration is made with the `DatasetConfig` class. Here is an example for the fictive English Knowledge dataset `Rizzler`.
46
+
47
+ ```python
48
+ RIZZLER_KNOWLEDGE_CONFIG = DatasetConfig(
49
+ name="rizzler_knowledge", # The name of the dataset
50
+ pretty_name="the truncated version of the English knowledge dataset Rizzler", # The pretty name of the dataset used in logs.
51
+ huggingface_id="EuroEval/rizzler_knowledge", # The same id as used in the dataset creation script
52
+ task=KNOW, # The task of the dataset
53
+ languages=[EN], # The language of the dataset
54
+ unofficial=True, # Whether the dataset is unofficial
55
+ )
56
+ ```
57
+
58
+ Every `src/euroeval/dataset_configs/{language}.py` file has two sections:
59
+ - `### Official datasets ###`
60
+ - `### Unofficial datasets ###`
61
+
62
+ An unofficial dataset means that the resulting evaluation will not be included in the [official leaderboard](https://euroeval.com/leaderboards/).
63
+
64
+ As a starting point, make your dataset unofficial. This can always be changed later.
65
+
66
+
67
+ ## Step 3: Document Your Dataset
68
+
69
+ Dataset documentation in EuroEval is organised by language, with each language having its own file at `docs/datasets/{language}.md`. Within each language file, documentation is further organised by task.
70
+
71
+ Navigate to the documentation file for your dataset's language and add your dataset's documentation in the appropriate task section.
72
+
73
+ The documentation should include the following information:
74
+
75
+ 1. **General description**: Explain the dataset's origin and purpose
76
+ 2. **Split details**: Describe how splits were created and their sizes
77
+ 3. **Example samples**: Provide 3 representative examples from the training split
78
+ 4. **Evaluation setup**: Explain how models are evaluated on this dataset
79
+ 5. **Evaluation command**: Show how to evaluate a model on your dataset
80
+
81
+ To do this, you can follow these steps:
82
+ 1. Find an existing dataset of the same task in `docs/datasets/{language}.md`
83
+ 2. Copy the entire documentation section for that dataset
84
+ 3. Use this as a template and modify all details to match your new dataset
85
+ 4. Ensure you update all dataset-specific information (description, split sizes, example samples, etc.)
86
+
87
+
88
+ ## Step 4: Modify the Change Log
89
+
90
+ After completing the previous steps, add an entry to the project's changelog to document your contribution. The entry should be added under the `[Unreleased]` section with a short description of the dataset you have added. Here is an example of a new entry.
91
+
92
+ ```md
93
+ ## [Unreleased]
94
+ ### Added
95
+ - Added the English knowledge dataset [rizzler_knowledge](https://huggingface.co/datasets/Example-User/rizzler_knowledge). The split is given by 1,024 / 256 / 2,048 samples for train / val / test, respectively. It is marked as `unofficial` for now. This was contributed by [@your_name](https://github.com/your_name) ✨
96
+ ```
97
+
98
+
99
+ ## Step 5: Make a Pull Request
100
+
101
+ When you have completed all the previous steps, create a pull request to the EuroEval repository.
102
+
103
+
104
+ ### Thank you!
105
+ This concludes the process of contributing a dataset to EuroEval. Your contribution helps expand the multilingual evaluation capabilities of the benchmark and is greatly appreciated by the research community!
106
+
107
+ Thank you for your valuable contribution! 🎉
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 15.6.1
3
+ Version: 15.7.1
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -32,7 +32,7 @@ Requires-Python: <4.0,>=3.10
32
32
  Requires-Dist: accelerate>=0.34.2
33
33
  Requires-Dist: bert-score>=0.3.13
34
34
  Requires-Dist: click>=8.1.3
35
- Requires-Dist: datasets>=2.15.0
35
+ Requires-Dist: datasets>=3.5.0
36
36
  Requires-Dist: demjson3>=3.0.6
37
37
  Requires-Dist: evaluate>=0.4.1
38
38
  Requires-Dist: huggingface-hub>=0.30.1
@@ -239,6 +239,18 @@ A huge thank you to all the contributors who have helped make this project a suc
239
239
  <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
240
240
  <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
241
241
 
242
+
243
+ ### Contribute to EuroEval
244
+
245
+ We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
246
+ contributing new datasets, your help makes this project better for everyone.
247
+
248
+ - **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
249
+ for information on how to get started.
250
+ - **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
251
+ a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
252
+
253
+
242
254
  ### Special Thanks
243
255
  - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
244
256
  [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
@@ -163,6 +163,18 @@ A huge thank you to all the contributors who have helped make this project a suc
163
163
  <a href="https://github.com/peregilk"><img src="https://avatars.githubusercontent.com/u/9079808" width=50 alt="Contributor avatar for peregilk"/></a>
164
164
  <a href="https://github.com/Rijgersberg"><img src="https://avatars.githubusercontent.com/u/8604946" width=50 alt="Contributor avatar for Rijgersberg"/></a>
165
165
 
166
+
167
+ ### Contribute to EuroEval
168
+
169
+ We welcome contributions to EuroEval! Whether you're fixing bugs, adding features, or
170
+ contributing new datasets, your help makes this project better for everyone.
171
+
172
+ - **General contributions**: Check out our [contribution guidelines](CONTRIBUTING.md)
173
+ for information on how to get started.
174
+ - **Adding datasets**: If you're interested in adding a new dataset to EuroEval, we have
175
+ a [dedicated guide](NEW_DATASET_GUIDE.md) with step-by-step instructions.
176
+
177
+
166
178
  ### Special Thanks
167
179
  - Thanks to [Google](https://google.com/) for sponsoring Gemini credits as part of their
168
180
  [Google Cloud for Researchers Program](https://cloud.google.com/edu/researchers).
@@ -7,68 +7,7 @@ information about what these constitute.
7
7
 
8
8
  ## Sentiment Classification
9
9
 
10
- ### Dutch Social
11
-
12
- This dataset consists of Dutch tweets annotated with sentiment labels. It is not sure
13
- how the sentiment labels were assigned, this information is pending from the authors.
14
-
15
- The original full dataset consists of 162,805 / 54,269 / 54,268 samples for training,
16
- validation and testing, respectively (so 271,342 samples used in total). We use a 1,024
17
- / 256 / 1,024 split for training, validation and testing, respectively. All the new
18
- splits are subsets of the original splits.
19
-
20
- Here are a few examples from the training split:
21
-
22
- ```json
23
- {
24
- "text": 'Novak Djokovic positief getest op coronavirus na eigen tennistoernooi\n\nhttps://t.co/U7VOcjANh9',
25
- "label": 'positive'
26
- }
27
- ```
28
- ```json
29
- {
30
- "text": "via @NYTimes https://t.co/IjbCWIwYvR",
31
- "label": "neutral"
32
- }
33
- ```
34
- ```json
35
- {
36
- "text": "@backinflow 30 min Corona tijd....",
37
- "label": "negative"
38
- }
39
- ```
40
-
41
- When evaluating generative models, we use the following setup (see the
42
- [methodology](/methodology) for more information on how these are used):
43
-
44
- - Number of few-shot examples: 12
45
- - Prefix prompt:
46
- ```
47
- Hieronder staan tweets en hun sentiment, dat 'positief', 'neutraal' of 'negatief' kan zijn.
48
- ```
49
- - Base prompt template:
50
- ```
51
- Tweet: {text}
52
- Sentiment: {label}
53
- ```
54
- - Instruction-tuned prompt template:
55
- ```
56
- Tweet: {text}
57
-
58
- Classificeer het sentiment in de tweet. Antwoord met 'positief', 'neutraal' of 'negatief'.
59
- ```
60
- - Label mapping:
61
- - `positive` ➡️ `positief`
62
- - `neutral` ➡️ `neutraal`
63
- - `negative` ➡️ `negatief`
64
-
65
- You can evaluate this dataset directly as follows:
66
-
67
- ```bash
68
- $ euroeval --model <model-id> --dataset dutch-social
69
- ```
70
-
71
- ### Unofficial: DBRD
10
+ ### DBRD
72
11
 
73
12
  This dataset was published in [this paper](https://doi.org/10.48550/arXiv.1910.00896)
74
13
  and features Dutch book reviews from [Hebban.nl](https://www.hebban.nl), annotated with