EuroEval 15.4.1__tar.gz → 15.5.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (211) hide show
  1. {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +2 -0
  2. {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/bug.yaml +17 -2
  3. {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +1 -11
  4. {euroeval-15.4.1 → euroeval-15.5.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +21 -16
  5. {euroeval-15.4.1 → euroeval-15.5.0}/.github/workflows/ci.yaml +2 -2
  6. {euroeval-15.4.1 → euroeval-15.5.0}/.gitignore +4 -0
  7. {euroeval-15.4.1 → euroeval-15.5.0}/.pre-commit-config.yaml +1 -1
  8. {euroeval-15.4.1 → euroeval-15.5.0}/CHANGELOG.md +95 -11
  9. {euroeval-15.4.1 → euroeval-15.5.0}/PKG-INFO +6 -4
  10. {euroeval-15.4.1 → euroeval-15.5.0}/README.md +1 -0
  11. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/danish.md +8 -7
  12. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/dutch.md +1 -1
  13. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/english.md +1 -1
  14. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/faroese.md +4 -4
  15. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/french.md +2 -2
  16. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/icelandic.md +17 -13
  17. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/italian.md +5 -6
  18. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/norwegian.md +18 -9
  19. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/spanish.md +1 -1
  20. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/swedish.md +4 -5
  21. {euroeval-15.4.1 → euroeval-15.5.0}/makefile +1 -2
  22. {euroeval-15.4.1 → euroeval-15.5.0}/pyproject.toml +7 -6
  23. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/__init__.py +2 -2
  24. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/hf.py +79 -39
  25. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/litellm.py +204 -74
  26. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/vllm.py +106 -42
  27. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmarker.py +35 -6
  28. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/constants.py +11 -1
  29. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/data_models.py +6 -2
  30. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/dataset_configs.py +6 -6
  31. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/sequence_classification.py +70 -30
  32. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/types.py +3 -3
  33. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/utils.py +131 -32
  34. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mlsum_de.py +1 -1
  35. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mlsum_es.py +1 -1
  36. {euroeval-15.4.1 → euroeval-15.5.0}/tests/conftest.py +12 -0
  37. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmarker.py +29 -0
  38. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_constants.py +1 -1
  39. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_data_models.py +4 -0
  40. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_utils.py +0 -11
  41. {euroeval-15.4.1 → euroeval-15.5.0}/uv.lock +981 -889
  42. {euroeval-15.4.1 → euroeval-15.5.0}/CITATION.cff +0 -0
  43. {euroeval-15.4.1 → euroeval-15.5.0}/CODE_OF_CONDUCT.md +0 -0
  44. {euroeval-15.4.1 → euroeval-15.5.0}/CONTRIBUTING.md +0 -0
  45. {euroeval-15.4.1 → euroeval-15.5.0}/Dockerfile.cuda +0 -0
  46. {euroeval-15.4.1 → euroeval-15.5.0}/LICENSE +0 -0
  47. {euroeval-15.4.1 → euroeval-15.5.0}/docs/CNAME +0 -0
  48. {euroeval-15.4.1 → euroeval-15.5.0}/docs/README.md +0 -0
  49. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/README.md +0 -0
  50. {euroeval-15.4.1 → euroeval-15.5.0}/docs/datasets/german.md +0 -0
  51. {euroeval-15.4.1 → euroeval-15.5.0}/docs/extras/radial_plotter.md +0 -0
  52. {euroeval-15.4.1 → euroeval-15.5.0}/docs/faq.md +0 -0
  53. {euroeval-15.4.1 → euroeval-15.5.0}/docs/gfx/favicon.png +0 -0
  54. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/danish.md +0 -0
  55. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
  56. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/english.md +0 -0
  57. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
  58. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/french.md +0 -0
  59. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/german.md +0 -0
  60. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  61. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/italian.md +0 -0
  62. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  63. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
  64. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/european.md +0 -0
  65. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
  66. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  67. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/Multilingual/romance.md +0 -0
  68. {euroeval-15.4.1 → euroeval-15.5.0}/docs/leaderboards/README.md +0 -0
  69. {euroeval-15.4.1 → euroeval-15.5.0}/docs/methodology.md +0 -0
  70. {euroeval-15.4.1 → euroeval-15.5.0}/docs/python-package.md +0 -0
  71. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/README.md +0 -0
  72. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/common-sense-reasoning.md +0 -0
  73. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/knowledge.md +0 -0
  74. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/linguistic-acceptability.md +0 -0
  75. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/named-entity-recognition.md +0 -0
  76. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/reading-comprehension.md +0 -0
  77. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/sentiment-classification.md +0 -0
  78. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/speed.md +0 -0
  79. {euroeval-15.4.1 → euroeval-15.5.0}/docs/tasks/summarization.md +0 -0
  80. {euroeval-15.4.1 → euroeval-15.5.0}/gfx/euroeval.png +0 -0
  81. {euroeval-15.4.1 → euroeval-15.5.0}/gfx/euroeval.xcf +0 -0
  82. {euroeval-15.4.1 → euroeval-15.5.0}/gfx/scandeval.png +0 -0
  83. {euroeval-15.4.1 → euroeval-15.5.0}/mkdocs.yaml +0 -0
  84. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_config_factory.py +0 -0
  85. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  86. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/base.py +0 -0
  87. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/benchmark_modules/fresh.py +0 -0
  88. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/callbacks.py +0 -0
  89. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/cli.py +0 -0
  90. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/data_loading.py +0 -0
  91. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/enums.py +0 -0
  92. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/exceptions.py +0 -0
  93. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/finetuning.py +0 -0
  94. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/generation.py +0 -0
  95. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/human_evaluation.py +0 -0
  96. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/languages.py +0 -0
  97. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/model_cache.py +0 -0
  98. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/model_config.py +0 -0
  99. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/model_loading.py +0 -0
  100. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/scores.py +0 -0
  101. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/speed_benchmark.py +0 -0
  102. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/__init__.py +0 -0
  103. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/multiple_choice_classification.py +0 -0
  104. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/question_answering.py +0 -0
  105. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/text_to_text.py +0 -0
  106. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/task_utils/token_classification.py +0 -0
  107. {euroeval-15.4.1 → euroeval-15.5.0}/src/euroeval/tasks.py +0 -0
  108. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/constants.py +0 -0
  109. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_allocine.py +0 -0
  110. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_angry_tweets.py +0 -0
  111. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_arc.py +0 -0
  112. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_arc_is.py +0 -0
  113. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_belebele.py +0 -0
  114. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_cnn_dailymail.py +0 -0
  115. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_conll_en.py +0 -0
  116. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_conll_es.py +0 -0
  117. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_conll_nl.py +0 -0
  118. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dane.py +0 -0
  119. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_danish_citizen_tests.py +0 -0
  120. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dansk.py +0 -0
  121. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_danske_talemaader.py +0 -0
  122. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  123. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dbrd.py +0 -0
  124. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dutch_cola.py +0 -0
  125. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_dutch_social.py +0 -0
  126. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_eltec.py +0 -0
  127. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_fone.py +0 -0
  128. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_foqa.py +0 -0
  129. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_fosent.py +0 -0
  130. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_fquad.py +0 -0
  131. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_germanquad.py +0 -0
  132. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_germeval.py +0 -0
  133. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_hellaswag.py +0 -0
  134. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
  135. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_ice_linguistic.py +0 -0
  136. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  137. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icelandic_knowledge.py +0 -0
  138. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icelandic_qa.py +0 -0
  139. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_icesum.py +0 -0
  140. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_ilpost_sum.py +0 -0
  141. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_jentoft.py +0 -0
  142. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mim_gold_ner.py +0 -0
  143. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mlqa_es.py +0 -0
  144. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_mmlu.py +0 -0
  145. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_multinerd-it.py +0 -0
  146. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_no_cola.py +0 -0
  147. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_no_sammendrag.py +0 -0
  148. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
  149. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nordjylland_news.py +0 -0
  150. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norec.py +0 -0
  151. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norglm_multiqa.py +0 -0
  152. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norglm_multisum.py +0 -0
  153. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norne.py +0 -0
  154. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_norquad.py +0 -0
  155. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nqii.py +0 -0
  156. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
  157. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_orange_sum.py +0 -0
  158. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_personal_sum.py +0 -0
  159. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_rrn.py +0 -0
  160. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sb10k.py +0 -0
  161. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_scala.py +0 -0
  162. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_scandiqa.py +0 -0
  163. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_schibsted.py +0 -0
  164. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
  165. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sentipolc16.py +0 -0
  166. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad.py +0 -0
  167. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad_it.py +0 -0
  168. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad_nl.py +0 -0
  169. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_squad_nl_old.py +0 -0
  170. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_sst5.py +0 -0
  171. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_suc3.py +0 -0
  172. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_swedn.py +0 -0
  173. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_swerec.py +0 -0
  174. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  175. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_wikiann_fo.py +0 -0
  176. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_wikineural-it.py +0 -0
  177. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_winogrande_is.py +0 -0
  178. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/create_xquad_es.py +0 -0
  179. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/fix_dot_env_file.py +0 -0
  180. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/load_ud_pos.py +0 -0
  181. {euroeval-15.4.1 → euroeval-15.5.0}/src/scripts/versioning.py +0 -0
  182. {euroeval-15.4.1 → euroeval-15.5.0}/tests/__init__.py +0 -0
  183. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_config_factory.py +0 -0
  184. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/__init__.py +0 -0
  185. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_base.py +0 -0
  186. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
  187. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  188. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
  189. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
  190. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_callbacks.py +0 -0
  191. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_cli.py +0 -0
  192. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_data_loading.py +0 -0
  193. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_dataset_configs.py +0 -0
  194. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_enums.py +0 -0
  195. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_exceptions.py +0 -0
  196. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_finetuning.py +0 -0
  197. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_generation.py +0 -0
  198. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_human_evaluation.py +0 -0
  199. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_languages.py +0 -0
  200. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_model_cache.py +0 -0
  201. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_model_config.py +0 -0
  202. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_model_loading.py +0 -0
  203. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_scores.py +0 -0
  204. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_speed_benchmark.py +0 -0
  205. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/__init__.py +0 -0
  206. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_question_answering.py +0 -0
  207. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
  208. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_text_to_text.py +0 -0
  209. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_task_utils/test_token_classification.py +0 -0
  210. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_tasks.py +0 -0
  211. {euroeval-15.4.1 → euroeval-15.5.0}/tests/test_types.py +0 -0
@@ -2,6 +2,7 @@ name: 📚 Benchmark Dataset Request
2
2
  description: Do you think a particular benchmark dataset is missing in EuroEval?
3
3
  title: "[BENCHMARK DATASET REQUEST] <dataset-name>"
4
4
  labels: "benchmark dataset request"
5
+ type: task
5
6
 
6
7
  body:
7
8
  - type: input
@@ -30,6 +31,7 @@ body:
30
31
  - label: Icelandic
31
32
  - label: Italian
32
33
  - label: Norwegian (Bokmål or Nynorsk)
34
+ - label: Spanish
33
35
  - label: Swedish
34
36
  validations:
35
37
  required: true
@@ -1,7 +1,7 @@
1
1
  name: 🐛 Bug Report
2
2
  description: Have you experienced a bug using the `euroeval` package?
3
3
  title: "[BUG] <name-of-bug>"
4
- labels: bug
4
+ type: bug
5
5
 
6
6
  body:
7
7
  - type: markdown
@@ -46,8 +46,9 @@ body:
46
46
  - 3.10.x
47
47
  - 3.11.x
48
48
  - 3.12.x
49
+ - 3.13.x
49
50
  - Older than 3.10.x
50
- - Newer than 3.12.x
51
+ - Newer than 3.13.x
51
52
  validations:
52
53
  required: true
53
54
  - type: input
@@ -57,6 +58,20 @@ body:
57
58
  placeholder: Output of `pip list | grep EuroEval`
58
59
  validations:
59
60
  required: true
61
+ - type: input
62
+ attributes:
63
+ label: Transformers version
64
+ description: What version of 🤗 transformers are you using?
65
+ placeholder: Output of `pip list | grep transformers`
66
+ validations:
67
+ required: true
68
+ - type: input
69
+ attributes:
70
+ label: vLLM version
71
+ description: What version of vLLM are you using?
72
+ placeholder: Output of `pip list | grep vllm`
73
+ validations:
74
+ required: true
60
75
  - type: markdown
61
76
  attributes:
62
77
  value: >
@@ -1,7 +1,7 @@
1
1
  name: 🚀 Feature Request
2
2
  description: Is the EuroEval benchmark missing a feature?
3
3
  title: "[FEATURE REQUEST] <name-of-feature>"
4
- labels: enhancement
4
+ type: feature
5
5
 
6
6
  body:
7
7
  - type: textarea
@@ -11,16 +11,6 @@ body:
11
11
  A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*.
12
12
  validations:
13
13
  required: true
14
- - type: textarea
15
- attributes:
16
- label: Alternatives
17
- description: >
18
- A description of any alternative solutions or features you've considered, if any.
19
- - type: textarea
20
- attributes:
21
- label: Additional context
22
- description: >
23
- Add any other context or screenshots about the feature request.
24
14
  - type: markdown
25
15
  attributes:
26
16
  value: >
@@ -2,12 +2,25 @@ name: 📊 Model Evaluation Request
2
2
  description: Would you like to have a particular model included in the leaderboards?
3
3
  title: "[MODEL EVALUATION REQUEST] <model-name>"
4
4
  labels: "model evaluation request"
5
+ type: task
5
6
 
6
7
  body:
7
8
  - type: input
8
9
  attributes:
9
10
  label: Model ID
10
- description: What is the Hugging Face model ID?
11
+ description: What is the model ID, either on the Hugging Face Hub or on LiteLLM?
12
+ validations:
13
+ required: true
14
+ - type: checkboxes
15
+ attributes:
16
+ label: Evaluation languages
17
+ description: >
18
+ What languages should this model be evaluated on? Tick all that apply. If the
19
+ model is multilingual (e.g., Mistral, Llama), then tick all the languages.
20
+ options:
21
+ - label: Romance languages (French, Italian, Spanish)
22
+ - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
23
+ - label: West Germanic languages (Dutch, English, German)
11
24
  validations:
12
25
  required: true
13
26
  - type: dropdown
@@ -20,23 +33,14 @@ body:
20
33
  - Sequence-to-sequence model (e.g., T5)
21
34
  validations:
22
35
  required: true
23
- - type: checkboxes
36
+ - type: dropdown
24
37
  attributes:
25
- label: Evaluation languages
26
- description: >
27
- What languages should this model be evaluated on? Tick all that apply. If the
28
- model is multilingual (e.g., Mistral, Llama), then tick all the languages.
38
+ label: Model size
39
+ description: What is the size of the model?
29
40
  options:
30
- - label: Danish
31
- - label: Dutch
32
- - label: English
33
- - label: Faroese
34
- - label: French
35
- - label: German
36
- - label: Icelandic
37
- - label: Italian
38
- - label: Norwegian (Bokmål or Nynorsk)
39
- - label: Swedish
41
+ - Small (<=8B parameters)
42
+ - Large (>8B parameters)
43
+ - N/A
40
44
  validations:
41
45
  required: true
42
46
  - type: dropdown
@@ -46,6 +50,7 @@ body:
46
50
  options:
47
51
  - Not a merged model
48
52
  - Merged model
53
+ - N/A
49
54
  validations:
50
55
  required: true
51
56
  - type: markdown
@@ -43,7 +43,6 @@ jobs:
43
43
  - name: Install uv and set up Python
44
44
  uses: astral-sh/setup-uv@v4
45
45
  with:
46
- enable-cache: true
47
46
  python-version: ${{ matrix.python-version }}
48
47
 
49
48
  - name: Install Dependencies
@@ -75,7 +74,6 @@ jobs:
75
74
  - name: Install uv and set up Python
76
75
  uses: astral-sh/setup-uv@v4
77
76
  with:
78
- enable-cache: true
79
77
  python-version: ${{ matrix.python-version }}
80
78
 
81
79
  - name: Install Dependencies
@@ -91,6 +89,8 @@ jobs:
91
89
  HF_TOKEN: ${{ secrets.HUGGINGFACE_API_KEY }}
92
90
  OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
93
91
  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
92
+ GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
93
+ XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
94
94
 
95
95
  - name: Delete EuroEval cache
96
96
  run: rm -rf .euroeval_cache
@@ -115,3 +115,7 @@ site/
115
115
 
116
116
  # Helper files for docs
117
117
  docs/datasets/dataset_example_commands.txt
118
+
119
+ # Various graphics
120
+ gfx/euroeval-italian.png
121
+ gfx/euroeval-italian.xcf
@@ -10,7 +10,7 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.11.2
13
+ rev: v0.11.4
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -10,6 +10,91 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v15.5.0] - 2025-04-07
14
+ ### Added
15
+ - Now allows supplying a parameter to API models, which is done by using
16
+ `<model-id>@<parameter>` as the model ID (only a single parameter is supported). The
17
+ parameters allowed are "low" and "high" for OpenAI models (which is the reasoning
18
+ effort of the model, supported by the o1- and o3-series, default is "medium"), and
19
+ "thinking" for Anthropic models, to enable thinking mode (supported for
20
+ Claude-Sonnet-3.7+). These will appear in the leaderboards as
21
+ `<model-id>@<parameter>`.
22
+ - Added metadata for Google Gemini and xAI Grok models.
23
+ - Allows all vLLM versions from v0.8.0 again, as the issue with the generation output
24
+ has been resolved.
25
+ - Added overall progress indicator during evaluation. This was contributed by
26
+ [@mathiasesn](https://github.com/mathiasesn) ✨
27
+
28
+ ### Changed
29
+ - Now does not use logprobs in text classification tasks with Google VertexAI models, as
30
+ they heavily rate limit logprobs usage. This shouldn't affect the scores significantly
31
+ in any case, as the models are very confident in their predictions.
32
+ - Updated `litellm` to `>=1.63.0`, allowing better support for reasoning models.
33
+
34
+ ### Fixed
35
+ - The Gemini-2.5-pro model uses different error messages than the other Gemini models,
36
+ which caused an error when evaluating it. This has been fixed now.
37
+ - Now registers the Gemini-2.5-pro model series as reasoning models, as otherwise they
38
+ did not generate any text as they were just generating reasoning tokens.
39
+ - Previously, if there were multiple labels whose first tokens were identical and that
40
+ the (generative) model did not output the label as the first output token, we would
41
+ randomly choose one of the labels, resulting in an evaluation error. This is very
42
+ rare, but *does* happen for very particular (model, dataset) pairs. If we are in this
43
+ case, we now resort to choosing the label with closest word edit distance instead of
44
+ relying on logprobs of the first token.
45
+ - Now defaults to BF16 if the model is registered as using FP32, assuming that BF16 is
46
+ supported by the GPU.
47
+ - Improved model existence pipeline for Ollama model IDs with multiple forward slashes
48
+ in the name, which caused some models to not be detected as existing.
49
+
50
+
51
+ ## [v15.4.2] - 2025-03-31
52
+ ### Added
53
+ - Now added version metadata to results, to easier track which versions of the various
54
+ dependencies were used when evaluating a model. This currently includes
55
+ `transformers`, `torch`, `vllm` and `outlines`.
56
+
57
+ ### Changed
58
+ - Changed the name of the German 'mlsum' summarisation dataset to 'mlsum-de', to reflect
59
+ that it is the German version of the dataset, and to avoid confusion with the Spanish
60
+ 'mlsum-es' dataset.
61
+
62
+ ### Fixed
63
+ - Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
64
+ compatibility < 8.0. This was contributed by
65
+ [@marksverdhei](https://github.com/marksverdhei) ✨
66
+ - Corrected the name of the French sentiment dataset AlloCiné. This was contributed by
67
+ [@Alkarex](https://github.com/Alkarex) ✨
68
+ - Evaluating a specific model revision did not work for adapter models, as there was a
69
+ confusion between the revision of the adapter and the revision of the base model. We
70
+ now use the revision for the adapter and use the latest revision for the base model.
71
+ - In the (very unlikely) scenario that the model's tokeniser has the same first token
72
+ for two different labels in a text classification task, we now also use the second
73
+ token to ensure that we determine the correct label. If this is not possible, then we
74
+ warn the user.
75
+ - Now catches `TypeError` when trying to generate with vLLM, and retries 3 times before
76
+ giving up on evaluating the dataset.
77
+ - A bug in `transformers` caused models with the `image-text-to-text` pipeline tag to
78
+ not be detected as generative models. This has been patched now, and will be fixed
79
+ properly when [this transformers
80
+ PR](https://github.com/huggingface/transformers/pull/37107) has been merged.
81
+ - Force `vllm` v0.8.0 for now, as the severe degradation in generation output of some
82
+ models has not been resolved in versions v0.8.2 and v0.8.3.
83
+ - Only accepts the local labels for text classification tasks when evaluating decoder
84
+ models now, where we before accepted both the local and English labels. The reason is
85
+ that this caused a confusion mat times when there was a unique local label starting
86
+ with a particular letter, but a different English label starting with the same letter,
87
+ causing some models to be evaluated on the wrong label.
88
+ - When fetching the model information from the Hugging Face API we now attempt 3 times,
89
+ as the API sometimes fails. If it still fails after 3 attempts, we raise the
90
+ `HuggingFaceHubDown` exception.
91
+ - Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
92
+ compatibility < 8.0. This was contributed by
93
+ [@marksverdhei](https://github.com/marksverdhei) ✨
94
+ - Fixed docs for ScandiQA-da and ScandiQA-sv, where it was incorrectly stated that
95
+ the splits were made by considering the original train/validation/test splits.
96
+
97
+
13
98
  ## [v15.4.1] - 2025-03-25
14
99
  ### Fixed
15
100
  - Disallow `vllm` v0.8.1, as it causes severe degradation in generation output of
@@ -73,18 +158,17 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
73
158
  ## [v15.3.0] - 2025-03-12
74
159
  ### Added
75
160
  - Added support for evaluating Italian 🇮🇹! This includes the reading comprehension
76
- dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization
77
- dataset [IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment
78
- classification
79
- [Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual),
80
- the common-sense reasoning dataset
81
- [HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic acceptability
82
- dataset ScaLA with the [Italian Universal Dependencies
161
+ dataset [SQuAD-it](https://hf.co/datasets/crux82/squad_it), the summarization dataset
162
+ [IlPost](https://hf.co/datasets/ARTeLab/ilpost), the sentiment classification
163
+ [Sentipolc-16](https://hf.co/datasets/cardiffnlp/tweet_sentiment_multilingual), the
164
+ common-sense reasoning dataset
165
+ [HellaSwag-it](https://hf.co/datasets/alexandrainst/m_hellaswag), the linguistic
166
+ acceptability dataset ScaLA with the [Italian Universal Dependencies
83
167
  treebank](https://github.com/UniversalDependencies/UD_Italian-ISDT), the knowledge
84
168
  dataset [MMLU-it](https://hf.co/datasets/alexandrainst/m_mmlu), and the named entity
85
- recognition dataset [MultiNERD
86
- IT](https://hf.co/datasets/Babelscape/multinerd) (and unofficially
87
- [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
169
+ recognition dataset [MultiNERD IT](https://hf.co/datasets/Babelscape/multinerd) (and
170
+ unofficially [WikiNEuRal IT](https://hf.co/datasets/Babelscape/wikineural)). This was
171
+ contributed by [@viggo-gascou](https://github.com/viggo-gascou) ✨
88
172
  - Added the new Norwegian knowledge dataset NRK-Quiz-QA, consisting of quizzes on the
89
173
  Norwegian language and culture, in both Bokmål and Nynorsk. The dataset has been split
90
174
  into 635 / 256 / 2,048 samples for train, val, and test, respectively. This replaces
@@ -211,7 +295,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
211
295
 
212
296
  ### Added
213
297
  - Added support for French! 🇫🇷This includes the sentiment classification dataset
214
- [Allocine](https://hf.co/datasets/tblard/allocine), the linguistic acceptability
298
+ [AlloCiné](https://hf.co/datasets/tblard/allocine), the linguistic acceptability
215
299
  dataset ScaLA with the [French Universal
216
300
  Dependencies](https://github.com/UniversalDependencies/UD_French-GSD), the reading
217
301
  comprehension dataset [FQuAD](https://hf.co/datasets/illuin/fquad) (and unofficially
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 15.4.1
3
+ Version: 15.5.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -37,11 +37,12 @@ Requires-Dist: demjson3>=3.0.6
37
37
  Requires-Dist: evaluate>=0.4.1
38
38
  Requires-Dist: huggingface-hub>=0.24.0
39
39
  Requires-Dist: levenshtein>=0.24.0
40
- Requires-Dist: litellm>=1.61.13
40
+ Requires-Dist: litellm>=1.63.0
41
41
  Requires-Dist: more-itertools>=10.5.0
42
42
  Requires-Dist: numpy<2.0.0,>=1.23.0
43
43
  Requires-Dist: ollama>=0.4.7
44
44
  Requires-Dist: pandas>=2.2.0
45
+ Requires-Dist: peft>=0.15.0
45
46
  Requires-Dist: protobuf~=3.20.0
46
47
  Requires-Dist: pydantic>=2.6.0
47
48
  Requires-Dist: pyinfer>=0.0.3
@@ -61,12 +62,12 @@ Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == '
61
62
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
62
63
  Requires-Dist: gradio>=4.26.0; extra == 'all'
63
64
  Requires-Dist: outlines>=0.1.11; extra == 'all'
64
- Requires-Dist: vllm!=0.8.1,>=0.8.0; (platform_system == 'Linux') and extra == 'all'
65
+ Requires-Dist: vllm>=0.8.0; (platform_system == 'Linux') and extra == 'all'
65
66
  Provides-Extra: generative
66
67
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
67
68
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
68
69
  Requires-Dist: outlines>=0.1.11; extra == 'generative'
69
- Requires-Dist: vllm!=0.8.1,>=0.8.0; (platform_system == 'Linux') and extra == 'generative'
70
+ Requires-Dist: vllm>=0.8.0; (platform_system == 'Linux') and extra == 'generative'
70
71
  Provides-Extra: human-evaluation
71
72
  Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
72
73
  Provides-Extra: test
@@ -217,6 +218,7 @@ Replace <name-of-script> with the specific script you wish to execute, e.g.,
217
218
  $ uv run src/scripts/create_allocine.py
218
219
  ```
219
220
 
221
+
220
222
  ## Special Thanks :pray:
221
223
  - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
222
224
  models on the leaderboards.
@@ -142,6 +142,7 @@ Replace <name-of-script> with the specific script you wish to execute, e.g.,
142
142
  $ uv run src/scripts/create_allocine.py
143
143
  ```
144
144
 
145
+
145
146
  ## Special Thanks :pray:
146
147
  - Thanks [@Mikeriess](https://github.com/Mikeriess) for evaluating many of the larger
147
148
  models on the leaderboards.
@@ -285,11 +285,10 @@ the translated contexts still contained the answer to the question, potentially
285
285
  changing the answers slightly.
286
286
 
287
287
  The original full dataset consists of 6,810 / 500 / 500 samples for training,
288
- validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
289
- validation and testing, respectively (so 3,328 samples used in total). All validation
290
- samples in our version also belong to the original validation set, and all original test
291
- samples are included in our test set. The remaining 1,548 test samples in our version
292
- was sampled from the original training set.
288
+ validation and testing, respectively (so 3,328 samples used in total).
289
+ We use a 1,024 / 256 / 2,048 split for training, validation and testing, respectively,
290
+ where the splits are made by randomly sampling from the full dataset without considering
291
+ the original train/validation/test splits.
293
292
 
294
293
  Here are a few examples from the training split:
295
294
 
@@ -451,12 +450,14 @@ Here are a few examples from the training split:
451
450
  {
452
451
  "text": "Hvilket af følgende områder har kommunerne ansvaret for driften af?\nSvarmuligheder:\na. Domstole\nb. Vuggestuer\nc. Sygehuse",
453
452
  "label": "b"
454
- }```
453
+ }
454
+ ```
455
455
  ```json
456
456
  {
457
457
  "text": "Hvilken organisation blev Danmark medlem af i 1945?\nSvarmuligheder:\na. Verdenshandelsorganisationen (WTO)\nb. Den Europæiske Union (EU)\nc. De Forenede Nationer (FN)",
458
458
  "label": "c"
459
- }```
459
+ }
460
+ ```
460
461
 
461
462
  When evaluating generative models, we use the following setup (see the
462
463
  [methodology](/methodology) for more information on how these are used):
@@ -133,7 +133,7 @@ $ euroeval --model <model-id> --dataset dbrd
133
133
 
134
134
  ## Named Entity Recognition
135
135
 
136
- ### CoNLL-2002-nl
136
+ ### CoNLL-nl
137
137
 
138
138
  This dataset was published in [this paper](https://aclanthology.org/W02-2024/) and
139
139
  consists of named entity recognition annotations of the Belgian newspaper "De Morgen" of
@@ -81,7 +81,7 @@ $ euroeval --model <model-id> --dataset sst5
81
81
 
82
82
  ## Named Entity Recognition
83
83
 
84
- ### CoNLL-2003-En
84
+ ### CoNLL-en
85
85
 
86
86
  This dataset was published in [this paper](https://aclanthology.org/W03-0419/) and was
87
87
  part of the CoNNL-2003 shared task. The data comes from the [Reuters
@@ -282,10 +282,10 @@ $ euroeval --model <model-id> --dataset scala-fo
282
282
 
283
283
  ### FoQA
284
284
 
285
- This dataset will be published in an upcoming paper and is based on the Faroese
286
- Wikipedia. The questions and answers were automatically generated using GPT-4-turbo,
287
- which were verified by a native speaker, and some of them were also corrected by the
288
- same native speaker.
285
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2502.07642)
286
+ and is based on the Faroese Wikipedia. The questions and answers were automatically
287
+ generated using GPT-4-turbo, which were verified by a native speaker, and some of them
288
+ were also corrected by the same native speaker.
289
289
 
290
290
  The original full dataset consists of 2,000 samples, and we split these into 848 / 128 /
291
291
  1,024 samples for training, validation and testing, respectively.
@@ -7,11 +7,11 @@ information about what these constitute.
7
7
 
8
8
  ## Sentiment Classification
9
9
 
10
- ### Allocine
10
+ ### AlloCiné
11
11
 
12
12
  This dataset was published in [this Github
13
13
  repository](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert) and
14
- features reviews from the French movie review website Allocine. The reviews range from
14
+ features reviews from the French movie review website [AlloCiné](https://www.allocine.fr/). The reviews range from
15
15
  0.5 to 5 (inclusive), with steps of 0.5. The negative samples are reviews with a rating
16
16
  of at most 2, and the positive ones are reviews with a rating of at least 4. The reviews
17
17
  in between were discarded.
@@ -9,9 +9,9 @@ information about what these constitute.
9
9
 
10
10
  ### Hotter and Colder Sentiment
11
11
 
12
- This dataset is being published in an upcoming paper, and consists of texts from
13
- Icelandic blog post, annotated with sentiment labels (and many others) via a
14
- crowdsourcing platform.
12
+ This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2502.16987),
13
+ and consists of texts from Icelandic blog post, annotated with sentiment labels (and
14
+ many others) via a crowdsourcing platform.
15
15
 
16
16
  The original full dataset consists of 2,901 samples, and we use a 1,021 / 255 / 1,607
17
17
  split for training, validation and testing, respectively (so all samples are used in
@@ -73,13 +73,14 @@ $ euroeval --model <model-id> --dataset hotter-and-colder-sentiment
73
73
 
74
74
  ### MIM-GOLD-NER
75
75
 
76
- This dataset was published in [this paper]() and is based on the [Tagged Icelandic
77
- Corpus (MIM)](https://clarin.is/en/resources/mim/), which consists of Icelandic books,
78
- news articles, periodicals, parliament speeches, legal texts, adjudications and
79
- government websites. It has been annotated with named entities in a semi-automated
80
- fashion, where each labels has been manually verified. The entity types in the dataset
81
- is a superset of the CoNLL-2003 tags, with the following additional labels: `DATE`,
82
- `TIME`, `MONEY`, `PERCENT`. These labels have been removed.
76
+ This dataset was published in [this
77
+ paper](https://repository.clarin.is/repository/xmlui/handle/20.500.12537/230) and is
78
+ based on the [Tagged Icelandic Corpus (MIM)](https://clarin.is/en/resources/mim/), which
79
+ consists of Icelandic books, news articles, periodicals, parliament speeches, legal
80
+ texts, adjudications and government websites. It has been annotated with named entities
81
+ in a semi-automated fashion, where each labels has been manually verified. The entity
82
+ types in the dataset is a superset of the CoNLL-2003 tags, with the following additional
83
+ labels: `DATE`, `TIME`, `MONEY`, `PERCENT`. These labels have been removed.
83
84
 
84
85
  The original full dataset consists of 1,000,000 tokens. We use a 1,024 / 256 / 2,048
85
86
  split for training, validation and testing, respectively.
@@ -526,17 +527,20 @@ Here are a few examples from the training split:
526
527
  {
527
528
  "text": "Hver var talinn heilagur maður eftir dauða sinn, er tákngervingur alþýðuhreyfingar vestanlands og talinn góður til áheita?\nSvarmöguleikar:\na. Þórður Jónsson helgi\nb. Guðmundur Arason\nc. Snorri Þorgrímsson\nd. Jón Hreggviðsson",
528
529
  "label": "a"
529
- }```
530
+ }
531
+ ```
530
532
  ```json
531
533
  {
532
534
  "text": "Í kringum hvaða ár hófst verslun á Arngerðareyri?\nSvarmöguleikar:\na. 1895\nb. 1884\nc. 1870\nd. 1902",
533
535
  "label": "b"
534
- }```
536
+ }
537
+ ```
535
538
  ```json
536
539
  {
537
540
  "text": "Hvenær var ákveðið að uppstigningardagur skyldi vera kirkjudagur aldraðra á Íslandi?\nSvarmöguleikar:\na. Árið 1975\nb. Árið 1985\nc. Árið 1982\nd. Árið 1990",
538
541
  "label": "c"
539
- }```
542
+ }
543
+ ```
540
544
 
541
545
  When evaluating generative models, we use the following setup (see the
542
546
  [methodology](/methodology) for more information on how these are used):
@@ -71,11 +71,10 @@ $ euroeval --model <model-id> --dataset sentipolc16
71
71
  ### MultiNERD IT
72
72
 
73
73
  This dataset was published in [this
74
- paper](https://aclanthology.org/2022.findings-naacl.60/) and
75
- consists of sentences from Wikipedia and Wikinews in 10 different languages. It is an
76
- extension of the combination of
77
- (WikiNEuRal)[https://www.github.com/Babelscape/wikineural] and
78
- (NER4EL)[https://www.github.com/Babelscape/ner4el]. The original test set was created
74
+ paper](https://aclanthology.org/2022.findings-naacl.60/) and consists of sentences from
75
+ Wikipedia and Wikinews in 10 different languages. It is an extension of the combination
76
+ of [WikiNEuRal](https://www.github.com/Babelscape/wikineural) and
77
+ [NER4EL](https://www.github.com/Babelscape/ner4el). The original test set was created
79
78
  from manual annotations, while the training set is based on an automatic annotation
80
79
  pipeline.
81
80
 
@@ -519,7 +518,7 @@ $ euroeval --model <model-id> --dataset hellaswag-it
519
518
 
520
519
  ## Summarization
521
520
 
522
- ### IlPost-sum
521
+ ### IlPost-Sum
523
522
 
524
523
  This dataset was published in [this paper](https://www.mdpi.com/2078-2489/13/5/228) and
525
524
  consists of news articles from [Il Post](https://www.ilpost.it/). The summaries were
@@ -388,17 +388,20 @@ Here are a few examples from the training split:
388
388
  {
389
389
  "text": "Vi har hatt krig i nesten ti år. Jeg føler meg noen ganger trist fordi jeg har mistet flere venner og min far på grunn av krigen.",
390
390
  "label": "correct"
391
- }```
391
+ }
392
+ ```
392
393
  ```json
393
394
  {
394
395
  "text": "Hvis jeg ikke sier in n genting, kan han spille hele dagen.",
395
396
  "label": "incorrect"
396
- }```
397
+ }
398
+ ```
397
399
  ```json
398
400
  {
399
401
  "text": "De føler at samfunnet trenger ikke dem.",
400
402
  "label": "incorrect"
401
- }```
403
+ }
404
+ ```
402
405
 
403
406
  When evaluating generative models, we use the following setup (see the
404
407
  [methodology](/methodology) for more information on how these are used):
@@ -660,17 +663,20 @@ Here are a few examples from the training split:
660
663
  {
661
664
  "text": "Gunnar har hatt plutselige og sterke smerteanfall siden han var liten gutt. Det var vondt å tisse og det gjorde vondt i ryggen og magen. Det hjalp litt å drikke vann. Reseptbelagte medisiner kan være nødvendig under anfall.\nSvaralternativer:\na. Nyrestein, kronisk\nb. Irritabel tarmsyndrom\nc. Angst\nd. Urinveisinfeksjon",
662
665
  "label": "a"
663
- }```
666
+ }
667
+ ```
664
668
  ```json
665
669
  {
666
670
  "text": "80 år gamle Harrison Ford er nok ein gong aktuell i rolla som Indiana Jones. Kva heiter filmen?\nSvaralternativer:\na. Indiana Jones and the Nasty Nazis\nb. Indiana Jones and the Dial of Destiny\nc. Indiana Jones and the Hunt for Power\nd. Indiana Jones Forever",
667
671
  "label": "b"
668
- }```
672
+ }
673
+ ```
669
674
  ```json
670
675
  {
671
676
  "text": "I 1980 måtte denne bassisten overnatte ni netter i fengsel i Japan fordi han prøvde å få med seg ca. 200 gram marihuana inn i landet. Hvem var det?\nSvaralternativer:\na. Sting\nb. Lemmy Kilmister\nc. Paul McCartney\nd. Bootsy Collins",
672
677
  "label": "c"
673
- }```
678
+ }
679
+ ```
674
680
 
675
681
  When evaluating generative models, we use the following setup (see the
676
682
  [methodology](/methodology) for more information on how these are used):
@@ -868,17 +874,20 @@ Here are a few examples from the training split:
868
874
  {
869
875
  "text": "Hvor er det sannsynlig at en fugl lager hjemmet sitt?\nSvaralternativer:\na. I skogen\nb. I et rede\nc. På taket\nd. På blader\ne. I himmelen",
870
876
  "label": "a"
871
- }```
877
+ }
878
+ ```
872
879
  ```json
873
880
  {
874
881
  "text": "Hvis et hjem har et abonnoment, hva får de sannsyneligvis hver dag i posten?\nSvaralternativer:\na. Delestykker\nb. En avis\nc. En gate\nd. En vaskemaskin\ne. Jordas overflate",
875
882
  "label": "b"
876
- }```
883
+ }
884
+ ```
877
885
  ```json
878
886
  {
879
887
  "text": "Når du ikke klarer å gjøre noe ferdig, hva feilet du i da?\nSvaralternativer:\na. Å vinne\nb. Å bestå\nc. Å fullfør\nd. Å gjøre det bra\ne. Å lykkes",
880
888
  "label": "c"
881
- }```
889
+ }
890
+ ```
882
891
 
883
892
  When evaluating generative models, we use the following setup (see the
884
893
  [methodology](/methodology) for more information on how these are used):
@@ -475,7 +475,7 @@ $ euroeval --model <model-id> --dataset hellaswag-es
475
475
 
476
476
  ## Summarization
477
477
 
478
- ### MLSum-es-mini
478
+ ### MLSum-es
479
479
 
480
480
  The dataset was published in [this paper](https://aclanthology.org/2020.emnlp-main.647/) and is obtained from online newspapers.
481
481
 
@@ -231,11 +231,10 @@ the translated contexts still contained the answer to the question, potentially
231
231
  changing the answers slightly.
232
232
 
233
233
  The original full dataset consists of 6,810 / 500 / 500 samples for training,
234
- validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
235
- validation and testing, respectively (so 3,328 samples used in total). All validation
236
- samples in our version also belong to the original validation set, and all original test
237
- samples are included in our test set. The remaining 1,548 test samples in our version
238
- was sampled from the original training set.
234
+ validation and testing, respectively (so 3,328 samples used in total).
235
+ We use a 1,024 / 256 / 2,048 split for training, validation and testing, respectively,
236
+ where the splits are made by randomly sampling from the full dataset without considering
237
+ the original train/validation/test splits.
239
238
 
240
239
  Here are a few examples from the training split:
241
240