EuroEval 15.3.0__tar.gz → 15.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (211) hide show
  1. {euroeval-15.3.0 → euroeval-15.4.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +2 -2
  2. {euroeval-15.3.0 → euroeval-15.4.0}/.github/ISSUE_TEMPLATE/bug.yaml +5 -5
  3. {euroeval-15.3.0 → euroeval-15.4.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +1 -1
  4. {euroeval-15.3.0 → euroeval-15.4.0}/.github/workflows/ci.yaml +10 -4
  5. {euroeval-15.3.0 → euroeval-15.4.0}/.pre-commit-config.yaml +2 -1
  6. {euroeval-15.3.0 → euroeval-15.4.0}/CHANGELOG.md +57 -0
  7. {euroeval-15.3.0 → euroeval-15.4.0}/PKG-INFO +22 -7
  8. {euroeval-15.3.0 → euroeval-15.4.0}/README.md +13 -0
  9. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/dutch.md +1 -1
  10. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/faroese.md +2 -2
  11. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/german.md +2 -2
  12. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/icelandic.md +1 -1
  13. euroeval-15.4.0/docs/datasets/spanish.md +529 -0
  14. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/danish.md +2 -2
  15. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/dutch.md +2 -2
  16. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/english.md +2 -2
  17. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/faroese.md +1 -2
  18. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/french.md +2 -2
  19. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/german.md +2 -2
  20. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/icelandic.md +2 -2
  21. euroeval-15.4.0/docs/leaderboards/Monolingual/italian.md +15 -0
  22. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/norwegian.md +2 -2
  23. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Monolingual/swedish.md +2 -2
  24. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Multilingual/european.md +2 -2
  25. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Multilingual/germanic.md +2 -2
  26. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +2 -2
  27. euroeval-15.4.0/docs/leaderboards/Multilingual/romance.md +15 -0
  28. {euroeval-15.3.0 → euroeval-15.4.0}/pyproject.toml +9 -7
  29. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/__init__.py +11 -0
  30. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmark_config_factory.py +2 -2
  31. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmark_modules/hf.py +2 -3
  32. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmark_modules/litellm.py +124 -2
  33. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmark_modules/vllm.py +33 -13
  34. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmarker.py +12 -14
  35. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/constants.py +7 -1
  36. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/data_loading.py +10 -3
  37. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/dataset_configs.py +172 -1
  38. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/task_utils/token_classification.py +3 -9
  39. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/utils.py +1 -0
  40. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_allocine.py +33 -3
  41. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_arc.py +10 -0
  42. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_arc_is.py +10 -0
  43. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_belebele.py +11 -0
  44. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_cnn_dailymail.py +10 -0
  45. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_conll_en.py +10 -0
  46. euroeval-15.4.0/src/scripts/create_conll_es.py +115 -0
  47. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_conll_nl.py +10 -0
  48. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_dane.py +10 -0
  49. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_danish_citizen_tests.py +11 -0
  50. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_dansk.py +10 -0
  51. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_danske_talemaader.py +11 -0
  52. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_danske_talemaader_old.py +11 -0
  53. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_dbrd.py +33 -3
  54. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_dutch_cola.py +33 -3
  55. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_dutch_social.py +33 -3
  56. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_eltec.py +14 -1
  57. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_fone.py +10 -0
  58. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_foqa.py +10 -0
  59. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_fosent.py +33 -4
  60. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_fquad.py +11 -0
  61. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_germanquad.py +10 -0
  62. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_germeval.py +10 -0
  63. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_hellaswag.py +12 -0
  64. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_hotter_and_colder_sentiment.py +36 -3
  65. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_ice_linguistic.py +37 -3
  66. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_icelandic_error_corpus.py +40 -6
  67. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_icelandic_knowledge.py +14 -0
  68. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_icelandic_qa.py +12 -0
  69. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_icesum.py +10 -0
  70. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_ilpost_sum.py +10 -0
  71. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_jentoft.py +38 -4
  72. euroeval-15.4.0/src/scripts/create_mlqa_es.py +74 -0
  73. euroeval-15.3.0/src/scripts/create_mlsum.py → euroeval-15.4.0/src/scripts/create_mlsum_de.py +13 -3
  74. euroeval-15.4.0/src/scripts/create_mlsum_es.py +84 -0
  75. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_mmlu.py +12 -0
  76. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_multinerd-it.py +10 -0
  77. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_no_cola.py +38 -3
  78. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_no_sammendrag.py +10 -0
  79. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_nor_common_sense_qa.py +10 -0
  80. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_nordjylland_news.py +10 -0
  81. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_norglm_multiqa.py +12 -0
  82. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_norglm_multisum.py +10 -0
  83. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_norne.py +11 -0
  84. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_norquad.py +10 -0
  85. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_nqii.py +10 -0
  86. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_nrk_quiz_qa.py +11 -0
  87. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_orange_sum.py +10 -0
  88. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_personal_sum.py +10 -0
  89. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_rrn.py +10 -0
  90. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_sb10k.py +33 -3
  91. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_scala.py +43 -4
  92. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_scandiqa.py +10 -0
  93. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_schibsted.py +10 -0
  94. euroeval-15.3.0/src/scripts/create_sentipolc16.py → euroeval-15.4.0/src/scripts/create_sentiment_headlines_es.py +24 -9
  95. euroeval-15.4.0/src/scripts/create_sentipolc16.py +106 -0
  96. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_squad.py +10 -0
  97. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_squad_it.py +10 -0
  98. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_squad_nl.py +10 -0
  99. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_squad_nl_old.py +10 -0
  100. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_sst5.py +30 -0
  101. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_suc3.py +11 -0
  102. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_swedn.py +10 -0
  103. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_swerec.py +12 -0
  104. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_wiki_lingua_nl.py +10 -0
  105. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_wikineural-it.py +10 -0
  106. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_winogrande_is.py +11 -0
  107. euroeval-15.4.0/src/scripts/create_xquad_es.py +80 -0
  108. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/fix_dot_env_file.py +5 -0
  109. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/load_ud_pos.py +26 -0
  110. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/versioning.py +6 -0
  111. {euroeval-15.3.0 → euroeval-15.4.0}/tests/conftest.py +6 -0
  112. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmark_config_factory.py +8 -4
  113. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmarker.py +16 -2
  114. {euroeval-15.3.0 → euroeval-15.4.0}/uv.lock +816 -453
  115. {euroeval-15.3.0 → euroeval-15.4.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
  116. {euroeval-15.3.0 → euroeval-15.4.0}/.gitignore +0 -0
  117. {euroeval-15.3.0 → euroeval-15.4.0}/CITATION.cff +0 -0
  118. {euroeval-15.3.0 → euroeval-15.4.0}/CODE_OF_CONDUCT.md +0 -0
  119. {euroeval-15.3.0 → euroeval-15.4.0}/CONTRIBUTING.md +0 -0
  120. {euroeval-15.3.0 → euroeval-15.4.0}/Dockerfile.cuda +0 -0
  121. {euroeval-15.3.0 → euroeval-15.4.0}/LICENSE +0 -0
  122. {euroeval-15.3.0 → euroeval-15.4.0}/docs/CNAME +0 -0
  123. {euroeval-15.3.0 → euroeval-15.4.0}/docs/README.md +0 -0
  124. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/README.md +0 -0
  125. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/danish.md +0 -0
  126. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/english.md +0 -0
  127. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/french.md +0 -0
  128. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/italian.md +0 -0
  129. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/norwegian.md +0 -0
  130. {euroeval-15.3.0 → euroeval-15.4.0}/docs/datasets/swedish.md +0 -0
  131. {euroeval-15.3.0 → euroeval-15.4.0}/docs/extras/radial_plotter.md +0 -0
  132. {euroeval-15.3.0 → euroeval-15.4.0}/docs/faq.md +0 -0
  133. {euroeval-15.3.0 → euroeval-15.4.0}/docs/gfx/favicon.png +0 -0
  134. {euroeval-15.3.0 → euroeval-15.4.0}/docs/leaderboards/README.md +0 -0
  135. {euroeval-15.3.0 → euroeval-15.4.0}/docs/methodology.md +0 -0
  136. {euroeval-15.3.0 → euroeval-15.4.0}/docs/python-package.md +0 -0
  137. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/README.md +0 -0
  138. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/common-sense-reasoning.md +0 -0
  139. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/knowledge.md +0 -0
  140. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/linguistic-acceptability.md +0 -0
  141. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/named-entity-recognition.md +0 -0
  142. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/reading-comprehension.md +0 -0
  143. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/sentiment-classification.md +0 -0
  144. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/speed.md +0 -0
  145. {euroeval-15.3.0 → euroeval-15.4.0}/docs/tasks/summarization.md +0 -0
  146. {euroeval-15.3.0 → euroeval-15.4.0}/gfx/euroeval.png +0 -0
  147. {euroeval-15.3.0 → euroeval-15.4.0}/gfx/euroeval.xcf +0 -0
  148. {euroeval-15.3.0 → euroeval-15.4.0}/gfx/scandeval.png +0 -0
  149. {euroeval-15.3.0 → euroeval-15.4.0}/makefile +0 -0
  150. {euroeval-15.3.0 → euroeval-15.4.0}/mkdocs.yaml +0 -0
  151. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  152. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmark_modules/base.py +0 -0
  153. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/benchmark_modules/fresh.py +0 -0
  154. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/callbacks.py +0 -0
  155. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/cli.py +0 -0
  156. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/data_models.py +0 -0
  157. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/enums.py +0 -0
  158. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/exceptions.py +0 -0
  159. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/finetuning.py +0 -0
  160. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/generation.py +0 -0
  161. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/human_evaluation.py +0 -0
  162. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/languages.py +0 -0
  163. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/model_cache.py +0 -0
  164. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/model_config.py +0 -0
  165. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/model_loading.py +0 -0
  166. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/scores.py +0 -0
  167. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/speed_benchmark.py +0 -0
  168. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/task_utils/__init__.py +0 -0
  169. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/task_utils/multiple_choice_classification.py +0 -0
  170. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/task_utils/question_answering.py +0 -0
  171. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/task_utils/sequence_classification.py +0 -0
  172. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/task_utils/text_to_text.py +0 -0
  173. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/tasks.py +0 -0
  174. {euroeval-15.3.0 → euroeval-15.4.0}/src/euroeval/types.py +0 -0
  175. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/constants.py +0 -0
  176. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_angry_tweets.py +0 -0
  177. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_mim_gold_ner.py +0 -0
  178. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_norec.py +0 -0
  179. {euroeval-15.3.0 → euroeval-15.4.0}/src/scripts/create_wikiann_fo.py +0 -0
  180. {euroeval-15.3.0 → euroeval-15.4.0}/tests/__init__.py +0 -0
  181. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmark_modules/__init__.py +0 -0
  182. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmark_modules/test_base.py +0 -0
  183. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
  184. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  185. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
  186. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
  187. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_callbacks.py +0 -0
  188. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_cli.py +0 -0
  189. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_constants.py +0 -0
  190. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_data_loading.py +0 -0
  191. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_data_models.py +0 -0
  192. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_dataset_configs.py +0 -0
  193. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_enums.py +0 -0
  194. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_exceptions.py +0 -0
  195. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_finetuning.py +0 -0
  196. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_generation.py +0 -0
  197. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_human_evaluation.py +0 -0
  198. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_languages.py +0 -0
  199. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_model_cache.py +0 -0
  200. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_model_config.py +0 -0
  201. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_model_loading.py +0 -0
  202. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_scores.py +0 -0
  203. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_speed_benchmark.py +0 -0
  204. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_task_utils/__init__.py +0 -0
  205. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_task_utils/test_question_answering.py +0 -0
  206. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
  207. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_task_utils/test_text_to_text.py +0 -0
  208. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_task_utils/test_token_classification.py +0 -0
  209. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_tasks.py +0 -0
  210. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_types.py +0 -0
  211. {euroeval-15.3.0 → euroeval-15.4.0}/tests/test_utils.py +0 -0
@@ -1,5 +1,5 @@
1
1
  name: 📚 Benchmark Dataset Request
2
- description: Do you think a particular benchmark dataset is missing in ScandEval?
2
+ description: Do you think a particular benchmark dataset is missing in EuroEval?
3
3
  title: "[BENCHMARK DATASET REQUEST] <dataset-name>"
4
4
  labels: "benchmark dataset request"
5
5
 
@@ -36,7 +36,7 @@ body:
36
36
  - type: textarea
37
37
  attributes:
38
38
  label: Describe the dataset
39
- description: Describe what the dataset is measuring, and why you think it is important to include it as a benchmark dataset in ScandEval.
39
+ description: Describe what the dataset is measuring, and why you think it is important to include it as a benchmark dataset in EuroEval.
40
40
  validations:
41
41
  required: true
42
42
  - type: markdown
@@ -1,5 +1,5 @@
1
1
  name: 🐛 Bug Report
2
- description: Have you experienced a bug using the `scandeval` package?
2
+ description: Have you experienced a bug using the `euroeval` package?
3
3
  title: "[BUG] <name-of-bug>"
4
4
  labels: bug
5
5
 
@@ -7,7 +7,7 @@ body:
7
7
  - type: markdown
8
8
  attributes:
9
9
  value: >
10
- #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/Scandeval/ScandEval/issues?q=is%3Aissue).
10
+ #### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/EuroEval/EuroEval/issues?q=is%3Aissue).
11
11
  - type: textarea
12
12
  attributes:
13
13
  label: 🐛 Describe the bug
@@ -52,9 +52,9 @@ body:
52
52
  required: true
53
53
  - type: input
54
54
  attributes:
55
- label: ScandEval version
56
- description: What version of ScandEval are you using?
57
- placeholder: Output of `pip list | grep ScandEval`
55
+ label: EuroEval version
56
+ description: What version of EuroEval are you using?
57
+ placeholder: Output of `pip list | grep EuroEval`
58
58
  validations:
59
59
  required: true
60
60
  - type: markdown
@@ -1,5 +1,5 @@
1
1
  name: 🚀 Feature Request
2
- description: Is the ScandEval benchmark missing a feature?
2
+ description: Is the EuroEval benchmark missing a feature?
3
3
  title: "[FEATURE REQUEST] <name-of-feature>"
4
4
  labels: enhancement
5
5
 
@@ -49,6 +49,9 @@ jobs:
49
49
  - name: Install Dependencies
50
50
  run: uv sync --no-dev --extra test
51
51
 
52
+ - name: Start Ollama server
53
+ run: curl -fsSL https://ollama.com/install.sh | sh
54
+
52
55
  - name: Test with pytest
53
56
  run: uv run pytest
54
57
  env:
@@ -57,8 +60,8 @@ jobs:
57
60
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
58
61
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
59
62
 
60
- - name: Delete ScandEval cache
61
- run: rm -rf .scandeval_cache
63
+ - name: Delete EuroEval cache
64
+ run: rm -rf .euroeval_cache
62
65
 
63
66
  pytest-macos:
64
67
  if: github.event.pull_request.draft == false && contains(github.event.pull_request.labels.*.name, 'macos')
@@ -78,6 +81,9 @@ jobs:
78
81
  - name: Install Dependencies
79
82
  run: uv sync --no-dev --extra test
80
83
 
84
+ - name: Start Ollama server
85
+ run: curl -fsSL https://ollama.com/install.sh | sh
86
+
81
87
  - name: Test with pytest
82
88
  run: uv run pytest
83
89
  env:
@@ -86,5 +92,5 @@ jobs:
86
92
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
87
93
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
88
94
 
89
- - name: Delete ScandEval cache
90
- run: rm -rf .scandeval_cache
95
+ - name: Delete EuroEval cache
96
+ run: rm -rf .euroeval_cache
@@ -10,11 +10,12 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.9.10
13
+ rev: v0.11.2
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
17
17
  - --fix
18
+ - --unsafe-fixes
18
19
  - --exit-non-zero-on-fix
19
20
  types_or:
20
21
  - python
@@ -10,6 +10,58 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v15.4.0] - 2025-03-24
14
+ ### Added
15
+ - Added support for Spanish! 🇪🇸This includes two reading comprehension datasets:
16
+ [XQuAD-es](https://huggingface.co/datasets/google/xquad/viewer/xquad.es) and
17
+ [MLQA-es](https://huggingface.co/datasets/facebook/mlqa/viewer/mlqa.es.es),
18
+ [SentimentHeadlines-es](https://huggingface.co/datasets/pysentimiento/spanish-targeted-sentiment-headlines),
19
+ the linguistic acceptability dataset ScaLA with the [Spanish Universal
20
+ Dependencies](https://github.com/UniversalDependencies/UD_Spanish-AnCora),
21
+ [MLSum-es](https://huggingface.co/datasets/reciTAL/mlsum), the knowledge dataset
22
+ [MMLU-es](https://hf.co/datasets/alexandrainst/m_mmlu), the common-sense reasoning
23
+ dataset [HellaSwag-es](https://hf.co/datasets/alexandrainst/m_hellaswag), and the
24
+ named entity recognition dataset [CoNLL-es](https://aclanthology.org/W02-2024/). This
25
+ was contributed by [@oliverkinch](https://github.com/oliverkinch) ✨
26
+ - Now extracts number of parameters and context length for Ollama models, using the
27
+ `ollama` package. Vocabulary size is currently not available available in the `ollama`
28
+ package, so this is not extracted for Ollama models. For this reason, the `ollama`
29
+ package has been added to the core dependencies, as it is very small (~10 KB)
30
+ - Now downloads Ollama models when evaluating them.
31
+
32
+ ### Fixed
33
+ - When models output nested JSON dictionaries and structured generation isn't available,
34
+ we use the inner-most dictionary. This caused issues with Anthropic models, since they
35
+ do not support structured generation, and their output are always {"input": actual
36
+ dictionary}. This has been fixed now.
37
+ - Now handles `ReadTimeout`s when loading datasets, rather than aborting evaluations.
38
+ - Benchmark configurations specified when calling `Benchmarker.benchmark` did not
39
+ properly override the default configurations set during initialisation when
40
+ benchmarking generative models. This has been fixed now.
41
+ - Now sets the `VLLM_WORKER_MULTIPROC_METHOD` environment variable to `spawn`, to avoid
42
+ a `RuntimeError` when using newer versions of vLLM with multiple GPUs.
43
+ - Now also detects reasoning tokens specified in the prompt rather than in the
44
+ completion, which is for instance the case for the QwQ reasoning model.
45
+ - Now recognises models with the pipeline tags `image-text-to-text`,
46
+ `audio-text-to-text` and `video-text-to-text` as generative models, which mistakenly
47
+ were detected as encoder models before.
48
+
49
+ ### Changed
50
+ - Update `vllm` to `>=0.8.0`, `transformers` to `>=4.50.0` and `torch` to `>=2.6.0`.
51
+ - Moved the `demjson3` dependency from the `generative` extra to the main dependencies,
52
+ to allow benchmarking API-based models without any extras.
53
+ - Now does not include the speed benchmark by default, as it is not used in the official
54
+ leaderboards. It can still be used by including `--task speed` when benchmarking a
55
+ model, or by using the `task` argument if using the `Benchmarker` API.
56
+ - Do not use sliding window sizes as candidates for maximum context length anymore, as
57
+ this is no longer needed.
58
+
59
+
60
+ ## [v15.3.1] - 2025-03-13
61
+ ### Fixed
62
+ - Now handles `ConnectionError`s when loading datasets, rather than aborting evaluations.
63
+
64
+
13
65
  ## [v15.3.0] - 2025-03-12
14
66
  ### Added
15
67
  - Added support for evaluating Italian 🇮🇹! This includes the reading comprehension
@@ -80,12 +132,14 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
80
132
 
81
133
 
82
134
  ## [v15.1.0] - 2025-02-12
135
+
83
136
  ### Added
84
137
  - Added new `--only-allow-safetensors` flag, which disallows evaluating models from the
85
138
  Hugging Face Hub if they are not stored as safetensors. This ensures a high level of
86
139
  security on the system running the evaluations, if this is necessary. This was
87
140
  contributed by [@Mikeriess](https://github.com/Mikeriess) ✨
88
141
 
142
+
89
143
  ### Fixed
90
144
  - Regex mismatch caused the wrong sequence length for GPT-4o models. This has been fixed
91
145
  now.
@@ -99,6 +153,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
99
153
 
100
154
 
101
155
  ## [v15.0.0] - 2025-02-02
156
+
102
157
  ### Added
103
158
  - Added support for evaluating generative reasoning models, such as OpenAI o1 and
104
159
  Deepseek R1. This is done by upping the maximal sequence length to 8,192 tokens, and
@@ -145,6 +200,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
145
200
 
146
201
 
147
202
  ## [v14.4.0] - 2025-01-22
203
+
204
+ ### Added
148
205
  - Added support for French! 🇫🇷This includes the sentiment classification dataset
149
206
  [Allocine](https://hf.co/datasets/tblard/allocine), the linguistic acceptability
150
207
  dataset ScaLA with the [French Universal
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 15.3.0
3
+ Version: 15.4.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -33,12 +33,14 @@ Requires-Dist: accelerate>=0.34.2
33
33
  Requires-Dist: bert-score>=0.3.13
34
34
  Requires-Dist: click>=8.1.3
35
35
  Requires-Dist: datasets>=2.15.0
36
+ Requires-Dist: demjson3>=3.0.6
36
37
  Requires-Dist: evaluate>=0.4.1
37
38
  Requires-Dist: huggingface-hub>=0.24.0
38
39
  Requires-Dist: levenshtein>=0.24.0
39
40
  Requires-Dist: litellm>=1.61.13
40
41
  Requires-Dist: more-itertools>=10.5.0
41
42
  Requires-Dist: numpy<2.0.0,>=1.23.0
43
+ Requires-Dist: ollama>=0.4.7
42
44
  Requires-Dist: pandas>=2.2.0
43
45
  Requires-Dist: protobuf~=3.20.0
44
46
  Requires-Dist: pydantic>=2.6.0
@@ -52,19 +54,19 @@ Requires-Dist: seqeval>=1.2.2
52
54
  Requires-Dist: setuptools>=75.8.2
53
55
  Requires-Dist: tenacity>=9.0.0
54
56
  Requires-Dist: termcolor>=2.0.0
55
- Requires-Dist: torch>=2.3.0
56
- Requires-Dist: transformers>=4.47.0
57
+ Requires-Dist: torch>=2.6.0
58
+ Requires-Dist: transformers>=4.50.0
57
59
  Provides-Extra: all
58
60
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
59
- Requires-Dist: demjson3>=3.0.6; extra == 'all'
60
61
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
61
62
  Requires-Dist: gradio>=4.26.0; extra == 'all'
62
- Requires-Dist: vllm<0.6.5,>=0.6.3; (platform_system == 'Linux') and extra == 'all'
63
+ Requires-Dist: outlines>=0.1.11; extra == 'all'
64
+ Requires-Dist: vllm>=0.8.0; (platform_system == 'Linux') and extra == 'all'
63
65
  Provides-Extra: generative
64
66
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
65
- Requires-Dist: demjson3>=3.0.6; extra == 'generative'
66
67
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
67
- Requires-Dist: vllm<0.6.5,>=0.6.3; (platform_system == 'Linux') and extra == 'generative'
68
+ Requires-Dist: outlines>=0.1.11; extra == 'generative'
69
+ Requires-Dist: vllm>=0.8.0; (platform_system == 'Linux') and extra == 'generative'
68
70
  Provides-Extra: human-evaluation
69
71
  Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
70
72
  Provides-Extra: test
@@ -202,6 +204,19 @@ argument. This could for instance be `--model <model-id> --task
202
204
  sentiment-classification`.
203
205
 
204
206
 
207
+ ### Reproducing the datasets
208
+ All datasets used in this project are generated using the scripts located in the [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script with the following command
209
+
210
+ ```shell
211
+ $ uv run src/scripts/<name-of-script>.py
212
+ ```
213
+
214
+ Replace <name-of-script> with the specific script you wish to execute, e.g.,
215
+
216
+ ```shell
217
+ $ uv run src/scripts/create_allocine.py
218
+ ```
219
+
205
220
  ## Special Thanks :pray:
206
221
  - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
207
222
  models on the leaderboards.
@@ -129,6 +129,19 @@ argument. This could for instance be `--model <model-id> --task
129
129
  sentiment-classification`.
130
130
 
131
131
 
132
+ ### Reproducing the datasets
133
+ All datasets used in this project are generated using the scripts located in the [src/scripts](src/scripts) folder. To reproduce a dataset, run the corresponding script with the following command
134
+
135
+ ```shell
136
+ $ uv run src/scripts/<name-of-script>.py
137
+ ```
138
+
139
+ Replace <name-of-script> with the specific script you wish to execute, e.g.,
140
+
141
+ ```shell
142
+ $ uv run src/scripts/create_allocine.py
143
+ ```
144
+
132
145
  ## Special Thanks :pray:
133
146
  - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
134
147
  models on the leaderboards.
@@ -75,7 +75,7 @@ and features Dutch book reviews from [Hebban.nl](https://www.hebban.nl), annotat
75
75
  sentiment labels, written by the users of the website.
76
76
 
77
77
  The original full dataset consists of 20,000 / 2,200 samples for training and testing,
78
- respectively. We use a 1,024 / 256 / 2,048 split for training, validation and testing,
78
+ respectively. We use a 1,014 / 253 / 2,014 split for training, validation and testing,
79
79
  respectively (so 3,328 samples used in total). The training and testing splits are
80
80
  subsets of the original splits, and the validation split is a disjoint subset of the
81
81
  original training split.
@@ -17,8 +17,8 @@ labels were manually annotated by two native speakers.
17
17
  The original full dataset consists of 245 samples, which consisted of both a news
18
18
  article, a chosen sentence from the article, and the sentiment label. We use both the
19
19
  news article and the chosen sentence as two separate samples, to increase the size of
20
- the dataset (keeping them within the same dataset split). In total, we use a 74 / 35 /
21
- 283 split for training, validation and testing, respectively.
20
+ the dataset (keeping them within the same dataset split). In total, we use a 72 / 40 /
21
+ 279 split for training, validation and testing, respectively.
22
22
 
23
23
  Here are a few examples from the training split:
24
24
 
@@ -485,7 +485,7 @@ $ euroeval --model <model-id> --dataset hellaswag-de
485
485
 
486
486
  ## Summarization
487
487
 
488
- ### MLSum
488
+ ### MLSum-de
489
489
 
490
490
  This dataset was published in [this
491
491
  paper](https://aclanthology.org/2020.emnlp-main.647/) and features news articles and
@@ -541,5 +541,5 @@ When evaluating generative models, we use the following setup (see the
541
541
  You can evaluate this dataset directly as follows:
542
542
 
543
543
  ```bash
544
- $ euroeval --model <model-id> --dataset mlsum
544
+ $ euroeval --model <model-id> --dataset mlsum-de
545
545
  ```
@@ -13,7 +13,7 @@ This dataset is being published in an upcoming paper, and consists of texts from
13
13
  Icelandic blog post, annotated with sentiment labels (and many others) via a
14
14
  crowdsourcing platform.
15
15
 
16
- The original full dataset consists of 2,901 samples, and we use a 1,024 / 256 / 1,621
16
+ The original full dataset consists of 2,901 samples, and we use a 1,021 / 255 / 1,607
17
17
  split for training, validation and testing, respectively (so all samples are used in
18
18
  total).
19
19