EuroEval 15.2.0__tar.gz → 15.3.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (206) hide show
  1. {euroeval-15.2.0 → euroeval-15.3.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +1 -0
  2. {euroeval-15.2.0 → euroeval-15.3.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +1 -0
  3. {euroeval-15.2.0 → euroeval-15.3.0}/.pre-commit-config.yaml +1 -1
  4. {euroeval-15.2.0 → euroeval-15.3.0}/CHANGELOG.md +51 -0
  5. {euroeval-15.2.0 → euroeval-15.3.0}/PKG-INFO +4 -1
  6. {euroeval-15.2.0 → euroeval-15.3.0}/README.md +2 -0
  7. {euroeval-15.2.0 → euroeval-15.3.0}/docs/README.md +1 -1
  8. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/danish.md +11 -10
  9. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/icelandic.md +65 -55
  10. euroeval-15.3.0/docs/datasets/italian.md +577 -0
  11. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/norwegian.md +205 -6
  12. {euroeval-15.2.0 → euroeval-15.3.0}/makefile +24 -10
  13. {euroeval-15.2.0 → euroeval-15.3.0}/pyproject.toml +4 -1
  14. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmark_modules/fresh.py +3 -1
  15. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmark_modules/vllm.py +6 -2
  16. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/dataset_configs.py +242 -6
  17. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/task_utils/question_answering.py +10 -7
  18. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/task_utils/sequence_classification.py +11 -2
  19. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/task_utils/text_to_text.py +10 -1
  20. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/task_utils/token_classification.py +9 -3
  21. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/utils.py +2 -2
  22. euroeval-15.3.0/src/scripts/create_danish_citizen_tests.py +136 -0
  23. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_hellaswag.py +1 -0
  24. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_icelandic_knowledge.py +39 -9
  25. euroeval-15.3.0/src/scripts/create_ilpost_sum.py +83 -0
  26. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_mmlu.py +1 -0
  27. euroeval-15.3.0/src/scripts/create_multinerd-it.py +114 -0
  28. euroeval-15.3.0/src/scripts/create_no_cola.py +138 -0
  29. euroeval-15.3.0/src/scripts/create_nor_common_sense_qa.py +141 -0
  30. euroeval-15.3.0/src/scripts/create_nrk_quiz_qa.py +153 -0
  31. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_scala.py +2 -0
  32. euroeval-15.3.0/src/scripts/create_sentipolc16.py +76 -0
  33. euroeval-15.3.0/src/scripts/create_squad_it.py +107 -0
  34. euroeval-15.3.0/src/scripts/create_wikineural-it.py +109 -0
  35. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/fix_dot_env_file.py +4 -2
  36. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/load_ud_pos.py +18 -0
  37. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmarker.py +4 -3
  38. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_finetuning.py +0 -1
  39. {euroeval-15.2.0 → euroeval-15.3.0}/uv.lock +462 -415
  40. euroeval-15.2.0/gfx/euroeval-no-bg.png +0 -0
  41. euroeval-15.2.0/gfx/euroeval-orig.png +0 -0
  42. euroeval-15.2.0/src/scripts/create_danish_citizen_tests.py +0 -95
  43. {euroeval-15.2.0 → euroeval-15.3.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  44. {euroeval-15.2.0 → euroeval-15.3.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  45. {euroeval-15.2.0 → euroeval-15.3.0}/.github/workflows/ci.yaml +0 -0
  46. {euroeval-15.2.0 → euroeval-15.3.0}/.gitignore +0 -0
  47. {euroeval-15.2.0 → euroeval-15.3.0}/CITATION.cff +0 -0
  48. {euroeval-15.2.0 → euroeval-15.3.0}/CODE_OF_CONDUCT.md +0 -0
  49. {euroeval-15.2.0 → euroeval-15.3.0}/CONTRIBUTING.md +0 -0
  50. {euroeval-15.2.0 → euroeval-15.3.0}/Dockerfile.cuda +0 -0
  51. {euroeval-15.2.0 → euroeval-15.3.0}/LICENSE +0 -0
  52. {euroeval-15.2.0 → euroeval-15.3.0}/docs/CNAME +0 -0
  53. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/README.md +0 -0
  54. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/dutch.md +0 -0
  55. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/english.md +0 -0
  56. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/faroese.md +0 -0
  57. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/french.md +0 -0
  58. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/german.md +0 -0
  59. {euroeval-15.2.0 → euroeval-15.3.0}/docs/datasets/swedish.md +0 -0
  60. {euroeval-15.2.0 → euroeval-15.3.0}/docs/extras/radial_plotter.md +0 -0
  61. {euroeval-15.2.0 → euroeval-15.3.0}/docs/faq.md +0 -0
  62. {euroeval-15.2.0 → euroeval-15.3.0}/docs/gfx/favicon.png +0 -0
  63. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/danish.md +0 -0
  64. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
  65. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/english.md +0 -0
  66. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
  67. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/french.md +0 -0
  68. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/german.md +0 -0
  69. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  70. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  71. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
  72. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Multilingual/european.md +0 -0
  73. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
  74. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  75. {euroeval-15.2.0 → euroeval-15.3.0}/docs/leaderboards/README.md +0 -0
  76. {euroeval-15.2.0 → euroeval-15.3.0}/docs/methodology.md +0 -0
  77. {euroeval-15.2.0 → euroeval-15.3.0}/docs/python-package.md +0 -0
  78. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/README.md +0 -0
  79. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/common-sense-reasoning.md +0 -0
  80. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/knowledge.md +0 -0
  81. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/linguistic-acceptability.md +0 -0
  82. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/named-entity-recognition.md +0 -0
  83. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/reading-comprehension.md +0 -0
  84. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/sentiment-classification.md +0 -0
  85. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/speed.md +0 -0
  86. {euroeval-15.2.0 → euroeval-15.3.0}/docs/tasks/summarization.md +0 -0
  87. {euroeval-15.2.0 → euroeval-15.3.0}/gfx/euroeval.png +0 -0
  88. {euroeval-15.2.0 → euroeval-15.3.0}/gfx/euroeval.xcf +0 -0
  89. {euroeval-15.2.0 → euroeval-15.3.0}/gfx/scandeval.png +0 -0
  90. {euroeval-15.2.0 → euroeval-15.3.0}/mkdocs.yaml +0 -0
  91. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/__init__.py +0 -0
  92. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmark_config_factory.py +0 -0
  93. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  94. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmark_modules/base.py +0 -0
  95. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmark_modules/hf.py +0 -0
  96. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmark_modules/litellm.py +0 -0
  97. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/benchmarker.py +0 -0
  98. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/callbacks.py +0 -0
  99. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/cli.py +0 -0
  100. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/constants.py +0 -0
  101. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/data_loading.py +0 -0
  102. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/data_models.py +0 -0
  103. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/enums.py +0 -0
  104. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/exceptions.py +0 -0
  105. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/finetuning.py +0 -0
  106. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/generation.py +0 -0
  107. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/human_evaluation.py +0 -0
  108. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/languages.py +0 -0
  109. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/model_cache.py +0 -0
  110. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/model_config.py +0 -0
  111. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/model_loading.py +0 -0
  112. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/scores.py +0 -0
  113. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/speed_benchmark.py +0 -0
  114. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/task_utils/__init__.py +0 -0
  115. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/task_utils/multiple_choice_classification.py +0 -0
  116. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/tasks.py +0 -0
  117. {euroeval-15.2.0 → euroeval-15.3.0}/src/euroeval/types.py +0 -0
  118. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/constants.py +0 -0
  119. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_allocine.py +0 -0
  120. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_angry_tweets.py +0 -0
  121. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_arc.py +0 -0
  122. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_arc_is.py +0 -0
  123. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_belebele.py +0 -0
  124. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_cnn_dailymail.py +0 -0
  125. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_conll_en.py +0 -0
  126. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_conll_nl.py +0 -0
  127. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_dane.py +0 -0
  128. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_dansk.py +0 -0
  129. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_danske_talemaader.py +0 -0
  130. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  131. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_dbrd.py +0 -0
  132. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_dutch_cola.py +0 -0
  133. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_dutch_social.py +0 -0
  134. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_eltec.py +0 -0
  135. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_fone.py +0 -0
  136. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_foqa.py +0 -0
  137. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_fosent.py +0 -0
  138. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_fquad.py +0 -0
  139. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_germanquad.py +0 -0
  140. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_germeval.py +0 -0
  141. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
  142. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_ice_linguistic.py +0 -0
  143. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  144. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_icelandic_qa.py +0 -0
  145. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_icesum.py +0 -0
  146. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_jentoft.py +0 -0
  147. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_mim_gold_ner.py +0 -0
  148. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_mlsum.py +0 -0
  149. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_no_sammendrag.py +0 -0
  150. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_nordjylland_news.py +0 -0
  151. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_norec.py +0 -0
  152. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_norglm_multiqa.py +0 -0
  153. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_norglm_multisum.py +0 -0
  154. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_norne.py +0 -0
  155. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_norquad.py +0 -0
  156. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_nqii.py +0 -0
  157. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_orange_sum.py +0 -0
  158. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_personal_sum.py +0 -0
  159. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_rrn.py +0 -0
  160. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_sb10k.py +0 -0
  161. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_scandiqa.py +0 -0
  162. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_schibsted.py +0 -0
  163. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_squad.py +0 -0
  164. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_squad_nl.py +0 -0
  165. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_squad_nl_old.py +0 -0
  166. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_sst5.py +0 -0
  167. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_suc3.py +0 -0
  168. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_swedn.py +0 -0
  169. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_swerec.py +0 -0
  170. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  171. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_wikiann_fo.py +0 -0
  172. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/create_winogrande_is.py +0 -0
  173. {euroeval-15.2.0 → euroeval-15.3.0}/src/scripts/versioning.py +0 -0
  174. {euroeval-15.2.0 → euroeval-15.3.0}/tests/__init__.py +0 -0
  175. {euroeval-15.2.0 → euroeval-15.3.0}/tests/conftest.py +0 -0
  176. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmark_config_factory.py +0 -0
  177. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmark_modules/__init__.py +0 -0
  178. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmark_modules/test_base.py +0 -0
  179. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
  180. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  181. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
  182. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
  183. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_callbacks.py +0 -0
  184. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_cli.py +0 -0
  185. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_constants.py +0 -0
  186. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_data_loading.py +0 -0
  187. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_data_models.py +0 -0
  188. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_dataset_configs.py +0 -0
  189. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_enums.py +0 -0
  190. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_exceptions.py +0 -0
  191. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_generation.py +0 -0
  192. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_human_evaluation.py +0 -0
  193. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_languages.py +0 -0
  194. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_model_cache.py +0 -0
  195. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_model_config.py +0 -0
  196. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_model_loading.py +0 -0
  197. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_scores.py +0 -0
  198. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_speed_benchmark.py +0 -0
  199. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_task_utils/__init__.py +0 -0
  200. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_task_utils/test_question_answering.py +0 -0
  201. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
  202. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_task_utils/test_text_to_text.py +0 -0
  203. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_task_utils/test_token_classification.py +0 -0
  204. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_tasks.py +0 -0
  205. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_types.py +0 -0
  206. {euroeval-15.2.0 → euroeval-15.3.0}/tests/test_utils.py +0 -0
@@ -28,6 +28,7 @@ body:
28
28
  - label: French
29
29
  - label: German
30
30
  - label: Icelandic
31
+ - label: Italian
31
32
  - label: Norwegian (Bokmål or Nynorsk)
32
33
  - label: Swedish
33
34
  validations:
@@ -34,6 +34,7 @@ body:
34
34
  - label: French
35
35
  - label: German
36
36
  - label: Icelandic
37
+ - label: Italian
37
38
  - label: Norwegian (Bokmål or Nynorsk)
38
39
  - label: Swedish
39
40
  validations:
@@ -10,7 +10,7 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.9.8
13
+ rev: v0.9.10
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -10,6 +10,57 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v15.3.0] - 2025-03-12
14
+ ### Added
15
+ - Added support for evaluating Italian 🇮🇹! This includes the reading comprehension
16
+ dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization
17
+ dataset [IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment
18
+ classification
19
+ [Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual),
20
+ the common-sense reasoning dataset
21
+ [HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic acceptability
22
+ dataset ScaLA with the [Italian Universal Dependencies
23
+ treebank](https://github.com/UniversalDependencies/UD_Italian-ISDT), the knowledge
24
+ dataset [MMLU-it](https://hf.co/datasets/alexandrainst/m_mmlu), and the named entity
25
+ recognition dataset [MultiNERD
26
+ IT](https://hf.co/datasets/Babelscape/multinerd) (and unofficially
27
+ [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
28
+ - Added the new Norwegian knowledge dataset NRK-Quiz-QA, consisting of quizzes on the
29
+ Norwegian language and culture, in both Bokmål and Nynorsk. The dataset has been split
30
+ into 635 / 256 / 2,048 samples for train, val, and test, respectively. This replaces
31
+ the old MMLU-no as the official Norwegian knowledge dataset.
32
+ - Added the new Norwegian common-sense reasoning dataset NorCommonSenseQA, which is a
33
+ manually translated and localised version of the English CommonsenseQA dataset, in
34
+ both Bokmål and Nynorsk. The dataset has been split into 128 / 128 / 787 samples for
35
+ train, val, and test, respectively. This replaces the old HellaSwag-no as the official
36
+ Norwegian common-sense reasoning dataset.
37
+ - Added the Norwegian linguistic acceptability dataset NoCoLA, which is based on the
38
+ annotated language learner corpus ASK. The dataset has been split into 1,024 / 256 /
39
+ 2,048 samples and converted into a binary correct/incorrect dataset, but
40
+ stratified across the error categories.
41
+
42
+ ### Changed
43
+ - Updated the Danish Citizen Tests dataset to include the newer 2024 tests, Further,
44
+ rather than splitting the dataset randomly, we include all the citizenship tests in
45
+ the test split, and prioritise the newer permanent residence tests in the test and
46
+ validation splits.
47
+ - Changed the IcelandicKnowledge dataset to be the new official Icelandic knowledge
48
+ dataset, as it is more specific to Icelandic culture and history than the previous
49
+ machine translated ARC-is dataset. It has also been improved, as some of the generated
50
+ alternative answers were formatted incorrectly.
51
+
52
+ ### Fixed
53
+ - A bug caused fresh encoder models to not be benchmarkable on the speed benchmark -
54
+ this has been fixed now.
55
+ - Some encoder models were not able to be evaluated on reading comprehensions, if their
56
+ tokenizers were not subclassing `PreTrainedTokenizer`. This has been relaxed to
57
+ `PreTrainedTokenizerBase` instead.
58
+ - Newer versions of the `transformers` package changed the model output format, causing
59
+ errors when evaluating encoder models on some tasks. This has been fixed now.
60
+ - Added `setuptools` to the dependencies, as it is required for the package to be
61
+ installed correctly.
62
+
63
+
13
64
  ## [v15.2.0] - 2025-02-28
14
65
  ### Changed
15
66
  - Changed the name of the benchmark to `EuroEval`, to reflect the fact that the
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 15.2.0
3
+ Version: 15.3.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -49,6 +49,7 @@ Requires-Dist: sacremoses>=0.1.1
49
49
  Requires-Dist: scikit-learn<1.6.0
50
50
  Requires-Dist: sentencepiece>=0.1.96
51
51
  Requires-Dist: seqeval>=1.2.2
52
+ Requires-Dist: setuptools>=75.8.2
52
53
  Requires-Dist: tenacity>=9.0.0
53
54
  Requires-Dist: termcolor>=2.0.0
54
55
  Requires-Dist: torch>=2.3.0
@@ -76,6 +77,8 @@ Description-Content-Type: text/markdown
76
77
 
77
78
  ### The robust European language model benchmark.
78
79
 
80
+ _(formerly known as ScandEval)_
81
+
79
82
  ______________________________________________________________________
80
83
  [![Documentation](https://img.shields.io/badge/docs-passing-green)](https://euroeval.com)
81
84
  [![PyPI Status](https://badge.fury.io/py/euroeval.svg)](https://pypi.org/project/euroeval/)
@@ -4,6 +4,8 @@
4
4
 
5
5
  ### The robust European language model benchmark.
6
6
 
7
+ _(formerly known as ScandEval)_
8
+
7
9
  ______________________________________________________________________
8
10
  [![Documentation](https://img.shields.io/badge/docs-passing-green)](https://euroeval.com)
9
11
  [![PyPI Status](https://badge.fury.io/py/euroeval.svg)](https://pypi.org/project/euroeval/)
@@ -6,7 +6,7 @@ hide:
6
6
  #
7
7
  <div align='center'>
8
8
  <img src="https://raw.githubusercontent.com/EuroEval/EuroEval/main/gfx/euroeval.png" height="500" width="372">
9
- <h3>A Robust Multilingual Evaluation Framework for Language Models</h3>
9
+ <h3>The robust European language model benchmark.</h3>
10
10
  </div>
11
11
 
12
12
  --------------------------
@@ -429,33 +429,34 @@ $ euroeval --model <model-id> --dataset danske-talemaader
429
429
  ### Danish Citizen Tests
430
430
 
431
431
  This dataset was created by scraping the Danish citizenship tests (indfødsretsprøven)
432
- and permanent residency tests (medborgerskabsprøven) from 2016 to 2023. These are
432
+ and permanent residency tests (medborgerskabsprøven) from 2016 to 2024. These are
433
433
  available on the [official website of the Danish Ministry of International Recruitment
434
434
  and Integration](https://danskogproever.dk/).
435
435
 
436
- The original full dataset consists of 720 samples. We use an 80 / 128 / 512 split for
437
- training, validation and testing, respectively (so 720 samples used in total).
436
+ The original full dataset consists of 870 samples. We use an 345 / 90 / 525 split for
437
+ training, validation and testing, respectively. Here all the citizenship tests belong to
438
+ the test split, as well as the newest permanent residency tests. The validation split
439
+ contains the newer permanent residency tests after the ones in the test split, and the
440
+ training split contains the oldest permanent residency tests.
438
441
 
439
442
  Here are a few examples from the training split:
440
443
 
441
444
  ```json
442
445
  {
443
- "text": "Hvilke lande er med i rigsfællesskab et?\nSvarmuligheder:\na. Danmark, Grønland og Færøerne\nb. Danmark, Island og Norge",
446
+ "text": "Hvilket parti tilhørte Lars Løkke Rasmussen, da han var statsminister i perioderne 2009-11 og 2015-19?\nSvarmuligheder:\na. Venstre\nb. Socialdemokratiet\nc. Det Konservative Folkeparti",
444
447
  "label": "a"
445
448
  }
446
449
  ```
447
450
  ```json
448
451
  {
449
- "text": "Hvor mange medlemmer har Folketinget?\nSvarmuligheder:\na. 87\nb. 179\nc. 265",
452
+ "text": "Hvilket af følgende områder har kommunerne ansvaret for driften af?\nSvarmuligheder:\na. Domstole\nb. Vuggestuer\nc. Sygehuse",
450
453
  "label": "b"
451
- }
452
- ```
454
+ }```
453
455
  ```json
454
456
  {
455
- "text": "Hvem kan blive biskop i den danske folkekirke?\nSvarmuligheder:\na. Kun mænd\nb. Kun kvinder\nc. Både m ænd og kvinder",
457
+ "text": "Hvilken organisation blev Danmark medlem af i 1945?\nSvarmuligheder:\na. Verdenshandelsorganisationen (WTO)\nb. Den Europæiske Union (EU)\nc. De Forenede Nationer (FN)",
456
458
  "label": "c"
457
- }
458
- ```
459
+ }```
459
460
 
460
461
  When evaluating generative models, we use the following setup (see the
461
462
  [methodology](/methodology) for more information on how these are used):
@@ -413,7 +413,7 @@ $ euroeval --model <model-id> --dataset nqii
413
413
  ### Unofficial: IcelandicQA
414
414
 
415
415
  This dataset was published
416
- [here](https://huggingface.co/datasets/mideind/icelandic_qa_euroeval) and consists of
416
+ [here](https://huggingface.co/datasets/mideind/icelandic_qa_scandeval) and consists of
417
417
  an automatically created Icelandic question-answering dataset based on the Icelandic
418
418
  Wikipedia as well as Icelandic news articles from the RÚV corpus.
419
419
 
@@ -490,37 +490,53 @@ $ euroeval --model <model-id> --dataset icelandic-qa
490
490
 
491
491
  ## Knowledge
492
492
 
493
- ### ARC-is
493
+ ### IcelandicKnowledge
494
494
 
495
- This dataset is a machine translated version of the English [ARC
496
- dataset](https://doi.org/10.48550/arXiv.1803.05457) and features US grade-school science
497
- questions. The dataset was translated by Miðeind using the Claude 3.5 Sonnet model.
495
+ This dataset was published
496
+ [here](https://huggingface.co/datasets/mideind/icelandic_qa_scandeval) and consists of
497
+ an automatically created Icelandic question-answering dataset based on the Icelandic
498
+ Wikipedia as well as Icelandic news articles from the RÚV corpus.
498
499
 
499
- The original full dataset consists of 1,110 / 297 / 1,170 samples for training,
500
- validation and testing, respectively. We use a 1,024 / 256 / 1,024 split for training,
501
- validation and testing, respectively (so 2,304 samples used in total). All new splits
502
- are subsets of the original splits.
500
+ The dataset was converted into a multiple-choice knowledge dataset by removing the
501
+ contexts and using GPT-4o to generate 3 plausible wrong answers for each correct answer,
502
+ using the following prompt for each `row` in the original dataset:
503
+
504
+ ```python
505
+ messages = [
506
+ {
507
+ "role": "user",
508
+ "content": f"For the question: {row.question} where the correct answer is: {row.answer}, please provide 3 plausible alternatives in Icelandic. You should return the alternatives in a JSON dictionary, with keys 'first', 'second', and 'third'. The values should be the alternatives only, without any numbering or formatting. The alternatives should be unique and not contain the correct answer."
509
+ }
510
+ ]
511
+
512
+ completion = client.beta.chat.completions.parse(
513
+ model="gpt-4o", messages=messages, response_format=CandidateAnswers
514
+ )
515
+ ```
516
+
517
+ where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
518
+
519
+ The original dataset has 2,000 samples, but only 1,994 unique questions, and the total
520
+ length of this dataset is therefore 1,994. The split is given by 842 / 128 / 1024 for
521
+ train, val, and test, respectively.
503
522
 
504
523
  Here are a few examples from the training split:
505
524
 
506
525
  ```json
507
526
  {
508
- "text": "Líkamar manna hafa flókna uppbyggingu sem styður vöxt og lífslíkur. Hver er grundvallaruppbygging líkamans sem stuðlar vexti og lífslíkum?\nSvarmöguleikar:\na. fruma\nb. vefur\nc. líffæri\nd. líffærakerfi",
527
+ "text": "Hver var talinn heilagur maður eftir dauða sinn, er tákngervingur alþýðuhreyfingar vestanlands og talinn góður til áheita?\nSvarmöguleikar:\na. Þórður Jónsson helgi\nb. Guðmundur Arason\nc. Snorri Þorgrímsson\nd. Jón Hreggviðsson",
509
528
  "label": "a"
510
- }
511
- ```
529
+ }```
512
530
  ```json
513
531
  {
514
- "text": "Veðurfræðingur skráir gögn fyrir borg á ákveðnum degi. Gögnin innihalda hitastig, skýjahulu, vindhraða, loftþrýsting og vindátt. Hvaða aðferð ætti veðurfræðingurinn að nota til að skrá þessi gögn fyrir fljótlega tilvísun?\nSvarmöguleikar:\na. skriflega lýsingu\nb. töflu\nc. stöðvarlíkan\nd. veðurkort",
532
+ "text": "Í kringum hvaða ár hófst verslun á Arngerðareyri?\nSvarmöguleikar:\na. 1895\nb. 1884\nc. 1870\nd. 1902",
515
533
  "label": "b"
516
- }
517
- ```
534
+ }```
518
535
  ```json
519
536
  {
520
- "text": "Hvaða breytingar urðu þegar reikistjörnurnar hitnnuðu á meðan þær mynduðust?\nSvarmöguleikar:\na. Massi þeirra jókst.\nb. Þær töpuðu meirihluta geislavirkra samsæta sinna.\nc. Uppbygging þeirra aðgreindist í mismunandi lög.\nd. Þær byrjuðu að snúast í kringum sólina.",
537
+ "text": "Hvenær var ákveðið uppstigningardagur skyldi vera kirkjudagur aldraðra á Íslandi?\nSvarmöguleikar:\na. Árið 1975\nb. Árið 1985\nc. Árið 1982\nd. Árið 1990",
521
538
  "label": "c"
522
- }
523
- ```
539
+ }```
524
540
 
525
541
  When evaluating generative models, we use the following setup (see the
526
542
  [methodology](/methodology) for more information on how these are used):
@@ -555,40 +571,38 @@ When evaluating generative models, we use the following setup (see the
555
571
  You can evaluate this dataset directly as follows:
556
572
 
557
573
  ```bash
558
- $ euroeval --model <model-id> --dataset arc-is
574
+ $ euroeval --model <model-id> --dataset icelandic-knowledge
559
575
  ```
560
576
 
561
577
 
562
- ### Unofficial: MMLU-is
578
+ ### Unofficial: ARC-is
563
579
 
564
- This dataset is a machine translated version of the English [MMLU
565
- dataset](https://openreview.net/forum?id=d7KBjmI3GmQ) and features questions within 57
566
- different topics, such as elementary mathematics, US history and law. The dataset was
567
- translated using [Miðeind](https://mideind.is/english.html)'s Greynir translation model.
580
+ This dataset is a machine translated version of the English [ARC
581
+ dataset](https://doi.org/10.48550/arXiv.1803.05457) and features US grade-school science
582
+ questions. The dataset was translated by Miðeind using the Claude 3.5 Sonnet model.
568
583
 
569
- The original full dataset consists of 269 / 1,410 / 13,200 samples for training,
570
- validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
571
- validation and testing, respectively (so 3,328 samples used in total). These splits are
572
- new and there can thus be some overlap between the original validation and test sets and
573
- our validation and test sets.
584
+ The original full dataset consists of 1,110 / 297 / 1,170 samples for training,
585
+ validation and testing, respectively. We use a 1,024 / 256 / 1,024 split for training,
586
+ validation and testing, respectively (so 2,304 samples used in total). All new splits
587
+ are subsets of the original splits.
574
588
 
575
589
  Here are a few examples from the training split:
576
590
 
577
591
  ```json
578
592
  {
579
- "text": "Af hverju er öruggara horfa á tunglið enhorfa á sólina?\nSvarmöguleikar:\na. Tunglið er minna bjart.\nb. Tunglið er nær jörðinni.\nc. Tunglið skín aðallega á nóttunni.\nd. Tunglið er aðeins fullt einu sinni í mánuði.",
593
+ "text": "Líkamar manna hafa flókna uppbyggingu sem styður vöxt og lífslíkur. Hver er grundvallaruppbygging líkamans sem stuðlar vexti og lífslíkum?\nSvarmöguleikar:\na. fruma\nb. vefur\nc. líffæri\nd. líffærakerfi",
580
594
  "label": "a"
581
595
  }
582
596
  ```
583
597
  ```json
584
598
  {
585
- "text": "Hvaða lög jarðar eru aðallega gerð úr föstu efni?\nSvarmöguleikar:\na. innri kjarni og ytri kjarni\nb. skorpu og innri kjarni\nc. skorpu og möttli\nd. möttli og ytri kjarni",
599
+ "text": "Veðurfræðingur skráir gögn fyrir borg á ákveðnum degi. Gögnin innihalda hitastig, skýjahulu, vindhraða, loftþrýsting og vindátt. Hvaða aðferð ætti veðurfræðingurinn nota til skrá þessi gögn fyrir fljótlega tilvísun?\nSvarmöguleikar:\na. skriflega lýsingu\nb. töflu\nc. stöðvarlíkan\nd. veðurkort",
586
600
  "label": "b"
587
601
  }
588
602
  ```
589
603
  ```json
590
604
  {
591
- "text": "Bekkur er að rannsaka þéttleika bergsýna. Hvaða vísindalegan búnað þurfa þau til ákvarða þéttleika bergsýnanna?\nSvarmöguleikar:\na. smásjá og vog\nb. bikar og mæliglös\nc. mæliglös og vog\nd. smásjá og mæliglös",
605
+ "text": "Hvaða breytingar urðu þegar reikistjörnurnar hitnnuðu á meðan þær mynduðust?\nSvarmöguleikar:\na. Massi þeirra jókst.\nb. Þær töpuðu meirihluta geislavirkra samsæta sinna.\nc. Uppbygging þeirra aðgreindist í mismunandi lög.\nd. Þær byrjuðu að snúast í kringum sólina.",
592
606
  "label": "c"
593
607
  }
594
608
  ```
@@ -626,46 +640,41 @@ When evaluating generative models, we use the following setup (see the
626
640
  You can evaluate this dataset directly as follows:
627
641
 
628
642
  ```bash
629
- $ euroeval --model <model-id> --dataset mmlu-is
643
+ $ euroeval --model <model-id> --dataset arc-is
630
644
  ```
631
645
 
632
- ### Unofficial: IcelandicKnowledge
633
- This dataset is based on the IcelandicQA dataset, which was published [here](https://huggingface.co/,datasets/mideind/icelandic_qa_euroeval), but is here phrased as a knowledge dataset. The candidate answers has been generated by GPT-4o, using the following prompt for each `row` in the original dataset:
634
646
 
635
- ```python
636
- messages = [
637
- {
638
- "role": "user",
639
- "content": f"For the question: {row.question} where the correct answer is: {row.answer}, please provide 3 plausible alternatives in Icelandic.",
640
- }
641
- ]
647
+ ### Unofficial: MMLU-is
642
648
 
643
- completion = client.beta.chat.completions.parse(
644
- model="gpt-4o", messages=messages, response_format=CandidateAnswers
645
- )
646
- ```
647
- where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
649
+ This dataset is a machine translated version of the English [MMLU
650
+ dataset](https://openreview.net/forum?id=d7KBjmI3GmQ) and features questions within 57
651
+ different topics, such as elementary mathematics, US history and law. The dataset was
652
+ translated using [Miðeind](https://mideind.is/english.html)'s Greynir translation model.
648
653
 
649
- The original dataset has 2,000 samples, but only 1,997 unique questions, and the total length of this dataset is therefore 1,997. The split is given by 845 / 128 / 1024 for train, val, and test, respectively.
654
+ The original full dataset consists of 269 / 1,410 / 13,200 samples for training,
655
+ validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
656
+ validation and testing, respectively (so 3,328 samples used in total). These splits are
657
+ new and there can thus be some overlap between the original validation and test sets and
658
+ our validation and test sets.
650
659
 
651
660
  Here are a few examples from the training split:
652
661
 
653
662
  ```json
654
663
  {
655
- "text": "Hvaða gamla verðeining var jafngildi einnar kýrverðmæti?\nSvarmöguleikar:\na. Sauðfé\nb. Kúgildi\nc. Mjólkurtollur\nd. Hrossgildi",
656
- "label": "b"
664
+ "text": "Af hverju er öruggara horfa á tunglið en horfa á sólina?\nSvarmöguleikar:\na. Tunglið er minna bjart.\nb. Tunglið er nær jörðinni.\nc. Tunglið skín aðallega á nóttunni.\nd. Tunglið er aðeins fullt einu sinni í mánuði.",
665
+ "label": "a"
657
666
  }
658
667
  ```
659
668
  ```json
660
669
  {
661
- "text": "Hvenær komu Íslendingar fyrst til Gimli í Manitoba?\nSvarmöguleikar:\na. 15. september 1875\nb. 25. október 1874\nc. 10. október 1876\nd. 21. október 1875",
662
- "label": "d"
670
+ "text": "Hvaða lög jarðar eru aðallega gerð úr föstu efni?\nSvarmöguleikar:\na. innri kjarni og ytri kjarni\nb. skorpu og innri kjarni\nc. skorpu og möttli\nd. möttli og ytri kjarni",
671
+ "label": "b"
663
672
  }
664
673
  ```
665
674
  ```json
666
675
  {
667
- "text": "Hvaða ár var byggingin sem gaf Barónsstíg í Reykjavík nafn reist?\nSvarmöguleikar:\na. 1901\nb. 1897\nc. 1899\nd. 1898",
668
- "label": "c"
676
+ "text": "Bekkur er að rannsaka þéttleika bergsýna. Hvaða vísindalegan búnað þurfa þau til ákvarða þéttleika bergsýnanna?\nSvarmöguleikar:\na. smásjá og vog\nb. bikar og mæliglös\nc. mæliglös og vog\nd. smásjá og mæliglös",
677
+ "label": "c"
669
678
  }
670
679
  ```
671
680
 
@@ -702,9 +711,10 @@ When evaluating generative models, we use the following setup (see the
702
711
  You can evaluate this dataset directly as follows:
703
712
 
704
713
  ```bash
705
- $ euroeval --model <model-id> --dataset icelandic-knowledge
714
+ $ euroeval --model <model-id> --dataset mmlu-is
706
715
  ```
707
716
 
717
+
708
718
  ## Common-sense Reasoning
709
719
 
710
720
  ### Winogrande-is