EuroEval 15.10.0__tar.gz → 15.11.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (258) hide show
  1. {euroeval-15.10.0 → euroeval-15.11.0}/.pre-commit-config.yaml +1 -1
  2. {euroeval-15.10.0 → euroeval-15.11.0}/CHANGELOG.md +34 -0
  3. {euroeval-15.10.0 → euroeval-15.11.0}/CITATION.cff +3 -3
  4. {euroeval-15.10.0 → euroeval-15.11.0}/LICENSE +1 -1
  5. {euroeval-15.10.0 → euroeval-15.11.0}/PKG-INFO +10 -10
  6. {euroeval-15.10.0 → euroeval-15.11.0}/README.md +5 -6
  7. {euroeval-15.10.0 → euroeval-15.11.0}/docs/README.md +2 -2
  8. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/dutch.md +5 -2
  9. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/english.md +79 -3
  10. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/finnish.md +49 -16
  11. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/french.md +13 -9
  12. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/german.md +8 -5
  13. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/icelandic.md +10 -6
  14. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/italian.md +5 -2
  15. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/norwegian.md +90 -11
  16. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/spanish.md +42 -19
  17. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/swedish.md +8 -5
  18. euroeval-15.11.0/docs/leaderboards/Monolingual/danish.md +23 -0
  19. euroeval-15.11.0/docs/leaderboards/Monolingual/dutch.md +23 -0
  20. euroeval-15.11.0/docs/leaderboards/Monolingual/english.md +23 -0
  21. {euroeval-15.10.0 → euroeval-15.11.0}/docs/leaderboards/Monolingual/faroese.md +4 -0
  22. euroeval-15.11.0/docs/leaderboards/Monolingual/finnish.md +23 -0
  23. euroeval-15.11.0/docs/leaderboards/Monolingual/french.md +23 -0
  24. euroeval-15.11.0/docs/leaderboards/Monolingual/german.md +23 -0
  25. euroeval-15.11.0/docs/leaderboards/Monolingual/icelandic.md +23 -0
  26. euroeval-15.11.0/docs/leaderboards/Monolingual/italian.md +23 -0
  27. euroeval-15.11.0/docs/leaderboards/Monolingual/norwegian.md +23 -0
  28. euroeval-15.11.0/docs/leaderboards/Monolingual/spanish.md +23 -0
  29. euroeval-15.11.0/docs/leaderboards/Monolingual/swedish.md +23 -0
  30. euroeval-15.11.0/docs/leaderboards/Multilingual/european.md +23 -0
  31. {euroeval-15.10.0 → euroeval-15.11.0}/docs/leaderboards/Multilingual/germanic.md +8 -0
  32. euroeval-15.11.0/docs/leaderboards/Multilingual/mainland-scandinavian.md +23 -0
  33. euroeval-15.11.0/docs/leaderboards/Multilingual/romance.md +23 -0
  34. {euroeval-15.10.0 → euroeval-15.11.0}/docs/leaderboards/README.md +8 -0
  35. {euroeval-15.10.0 → euroeval-15.11.0}/pyproject.toml +4 -3
  36. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/__init__.py +7 -0
  37. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmark_modules/base.py +29 -29
  38. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmark_modules/fresh.py +31 -19
  39. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmark_modules/hf.py +27 -23
  40. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmark_modules/litellm.py +50 -30
  41. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmark_modules/vllm.py +21 -25
  42. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmarker.py +1 -1
  43. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/callbacks.py +17 -13
  44. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/data_loading.py +10 -5
  45. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/data_models.py +2 -40
  46. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/english.py +13 -4
  47. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/norwegian.py +8 -0
  48. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/finetuning.py +10 -9
  49. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/generation.py +5 -4
  50. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/generation_utils.py +1 -0
  51. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/human_evaluation.py +13 -13
  52. euroeval-15.11.0/src/euroeval/metrics.py +452 -0
  53. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/scores.py +14 -19
  54. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/speed_benchmark.py +6 -7
  55. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +6 -4
  56. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/task_group_utils/question_answering.py +5 -28
  57. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/task_group_utils/sequence_classification.py +6 -30
  58. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/task_group_utils/text_to_text.py +19 -34
  59. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/task_group_utils/token_classification.py +18 -30
  60. euroeval-15.11.0/src/euroeval/tasks.py +131 -0
  61. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/types.py +6 -4
  62. euroeval-15.11.0/src/scripts/create_idioms_no.py +254 -0
  63. euroeval-15.11.0/src/scripts/create_life_in_the_uk.py +145 -0
  64. {euroeval-15.10.0 → euroeval-15.11.0}/tests/conftest.py +4 -3
  65. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_data_models.py +17 -16
  66. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_scores.py +15 -21
  67. {euroeval-15.10.0 → euroeval-15.11.0}/uv.lock +5 -1
  68. euroeval-15.10.0/docs/leaderboards/Monolingual/danish.md +0 -15
  69. euroeval-15.10.0/docs/leaderboards/Monolingual/dutch.md +0 -15
  70. euroeval-15.10.0/docs/leaderboards/Monolingual/english.md +0 -15
  71. euroeval-15.10.0/docs/leaderboards/Monolingual/french.md +0 -15
  72. euroeval-15.10.0/docs/leaderboards/Monolingual/german.md +0 -15
  73. euroeval-15.10.0/docs/leaderboards/Monolingual/icelandic.md +0 -15
  74. euroeval-15.10.0/docs/leaderboards/Monolingual/italian.md +0 -15
  75. euroeval-15.10.0/docs/leaderboards/Monolingual/norwegian.md +0 -15
  76. euroeval-15.10.0/docs/leaderboards/Monolingual/spanish.md +0 -15
  77. euroeval-15.10.0/docs/leaderboards/Monolingual/swedish.md +0 -15
  78. euroeval-15.10.0/docs/leaderboards/Multilingual/european.md +0 -15
  79. euroeval-15.10.0/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -15
  80. euroeval-15.10.0/docs/leaderboards/Multilingual/romance.md +0 -15
  81. euroeval-15.10.0/src/euroeval/tasks.py +0 -256
  82. {euroeval-15.10.0 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
  83. {euroeval-15.10.0 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  84. {euroeval-15.10.0 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  85. {euroeval-15.10.0 → euroeval-15.11.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
  86. {euroeval-15.10.0 → euroeval-15.11.0}/.github/workflows/ci.yaml +0 -0
  87. {euroeval-15.10.0 → euroeval-15.11.0}/.gitignore +0 -0
  88. {euroeval-15.10.0 → euroeval-15.11.0}/CODE_OF_CONDUCT.md +0 -0
  89. {euroeval-15.10.0 → euroeval-15.11.0}/CONTRIBUTING.md +0 -0
  90. {euroeval-15.10.0 → euroeval-15.11.0}/Dockerfile.cuda +0 -0
  91. {euroeval-15.10.0 → euroeval-15.11.0}/NEW_DATASET_GUIDE.md +0 -0
  92. {euroeval-15.10.0 → euroeval-15.11.0}/docs/CNAME +0 -0
  93. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/README.md +0 -0
  94. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/danish.md +0 -0
  95. {euroeval-15.10.0 → euroeval-15.11.0}/docs/datasets/faroese.md +0 -0
  96. {euroeval-15.10.0 → euroeval-15.11.0}/docs/extras/radial_plotter.md +0 -0
  97. {euroeval-15.10.0 → euroeval-15.11.0}/docs/faq.md +0 -0
  98. {euroeval-15.10.0 → euroeval-15.11.0}/docs/gfx/favicon.png +0 -0
  99. {euroeval-15.10.0 → euroeval-15.11.0}/docs/methodology.md +0 -0
  100. {euroeval-15.10.0 → euroeval-15.11.0}/docs/python-package.md +0 -0
  101. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/README.md +0 -0
  102. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/common-sense-reasoning.md +0 -0
  103. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/knowledge.md +0 -0
  104. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/linguistic-acceptability.md +0 -0
  105. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/named-entity-recognition.md +0 -0
  106. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/reading-comprehension.md +0 -0
  107. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/sentiment-classification.md +0 -0
  108. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/speed.md +0 -0
  109. {euroeval-15.10.0 → euroeval-15.11.0}/docs/tasks/summarization.md +0 -0
  110. {euroeval-15.10.0 → euroeval-15.11.0}/gfx/euroeval.png +0 -0
  111. {euroeval-15.10.0 → euroeval-15.11.0}/gfx/euroeval.xcf +0 -0
  112. {euroeval-15.10.0 → euroeval-15.11.0}/gfx/scandeval.png +0 -0
  113. {euroeval-15.10.0 → euroeval-15.11.0}/makefile +0 -0
  114. {euroeval-15.10.0 → euroeval-15.11.0}/mkdocs.yaml +0 -0
  115. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmark_config_factory.py +0 -0
  116. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  117. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/cli.py +0 -0
  118. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/constants.py +0 -0
  119. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/__init__.py +0 -0
  120. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/danish.py +0 -0
  121. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/dutch.py +0 -0
  122. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/faroese.py +0 -0
  123. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/finnish.py +0 -0
  124. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/french.py +0 -0
  125. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/german.py +0 -0
  126. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
  127. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/italian.py +0 -0
  128. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/spanish.py +0 -0
  129. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/dataset_configs/swedish.py +0 -0
  130. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/enums.py +0 -0
  131. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/exceptions.py +0 -0
  132. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/languages.py +0 -0
  133. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/model_cache.py +0 -0
  134. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/model_config.py +0 -0
  135. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/model_loading.py +0 -0
  136. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/prompt_templates/__init__.py +0 -0
  137. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +0 -0
  138. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/prompt_templates/multiple_choice.py +0 -0
  139. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/prompt_templates/named_entity_recognition.py +0 -0
  140. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/prompt_templates/reading_comprehension.py +0 -0
  141. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/prompt_templates/sentiment_classification.py +0 -0
  142. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/prompt_templates/summarization.py +0 -0
  143. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/task_group_utils/__init__.py +0 -0
  144. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/tokenization_utils.py +0 -0
  145. {euroeval-15.10.0 → euroeval-15.11.0}/src/euroeval/utils.py +0 -0
  146. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/constants.py +0 -0
  147. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_allocine.py +0 -0
  148. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_angry_tweets.py +0 -0
  149. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_arc.py +0 -0
  150. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_arc_is.py +0 -0
  151. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_belebele.py +0 -0
  152. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_cnn_dailymail.py +0 -0
  153. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_conll_en.py +0 -0
  154. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_conll_es.py +0 -0
  155. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_conll_nl.py +0 -0
  156. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_dane.py +0 -0
  157. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_danish_citizen_tests.py +0 -0
  158. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_dansk.py +0 -0
  159. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_danske_talemaader.py +0 -0
  160. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  161. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_dbrd.py +0 -0
  162. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_dutch_cola.py +0 -0
  163. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_eltec.py +0 -0
  164. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_fone.py +0 -0
  165. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_foqa.py +0 -0
  166. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_fosent.py +0 -0
  167. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_fquad.py +0 -0
  168. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_germanquad.py +0 -0
  169. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_germeval.py +0 -0
  170. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_hellaswag.py +0 -0
  171. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_hellaswag_fi.py +0 -0
  172. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
  173. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_ice_linguistic.py +0 -0
  174. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  175. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_icelandic_knowledge.py +0 -0
  176. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_icelandic_qa.py +0 -0
  177. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_icesum.py +0 -0
  178. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_ilpost_sum.py +0 -0
  179. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_jentoft.py +0 -0
  180. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_mim_gold_ner.py +0 -0
  181. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_mlqa_es.py +0 -0
  182. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_mlsum_de.py +0 -0
  183. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_mlsum_es.py +0 -0
  184. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_mmlu.py +0 -0
  185. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_multinerd-it.py +0 -0
  186. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_no_cola.py +0 -0
  187. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_no_sammendrag.py +0 -0
  188. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
  189. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_nordjylland_news.py +0 -0
  190. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_norec.py +0 -0
  191. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_norglm_multiqa.py +0 -0
  192. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_norglm_multisum.py +0 -0
  193. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_norne.py +0 -0
  194. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_norquad.py +0 -0
  195. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_nqii.py +0 -0
  196. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
  197. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_orange_sum.py +0 -0
  198. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_personal_sum.py +0 -0
  199. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_rrn.py +0 -0
  200. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_sb10k.py +0 -0
  201. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_scala.py +0 -0
  202. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_scandiqa.py +0 -0
  203. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_scandisent_fi.py +0 -0
  204. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_schibsted.py +0 -0
  205. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
  206. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_sentipolc16.py +0 -0
  207. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_squad.py +0 -0
  208. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_squad_it.py +0 -0
  209. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_squad_nl.py +0 -0
  210. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_squad_nl_old.py +0 -0
  211. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_sst5.py +0 -0
  212. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_suc3.py +0 -0
  213. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_swedn.py +0 -0
  214. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_swerec.py +0 -0
  215. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_turku_ner_fi.py +0 -0
  216. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_tydiqa_fi.py +0 -0
  217. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  218. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_wikiann_fo.py +0 -0
  219. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_wikineural-it.py +0 -0
  220. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_winogrande_is.py +0 -0
  221. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_xlsum_fi.py +0 -0
  222. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/create_xquad_es.py +0 -0
  223. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/fix_dot_env_file.py +0 -0
  224. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/load_ud_pos.py +0 -0
  225. {euroeval-15.10.0 → euroeval-15.11.0}/src/scripts/versioning.py +0 -0
  226. {euroeval-15.10.0 → euroeval-15.11.0}/tests/__init__.py +0 -0
  227. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmark_config_factory.py +0 -0
  228. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmark_modules/__init__.py +0 -0
  229. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmark_modules/test_base.py +0 -0
  230. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
  231. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  232. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
  233. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
  234. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_benchmarker.py +0 -0
  235. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_callbacks.py +0 -0
  236. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_cli.py +0 -0
  237. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_constants.py +0 -0
  238. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_data_loading.py +0 -0
  239. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_dataset_configs.py +0 -0
  240. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_enums.py +0 -0
  241. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_exceptions.py +0 -0
  242. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_finetuning.py +0 -0
  243. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_generation.py +0 -0
  244. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_human_evaluation.py +0 -0
  245. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_languages.py +0 -0
  246. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_model_cache.py +0 -0
  247. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_model_config.py +0 -0
  248. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_model_loading.py +0 -0
  249. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_speed_benchmark.py +0 -0
  250. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_task_utils/__init__.py +0 -0
  251. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_task_utils/test_question_answering.py +0 -0
  252. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
  253. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_task_utils/test_text_to_text.py +0 -0
  254. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_task_utils/test_token_classification.py +0 -0
  255. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_tasks.py +0 -0
  256. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_tokenization_utils.py +0 -0
  257. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_types.py +0 -0
  258. {euroeval-15.10.0 → euroeval-15.11.0}/tests/test_utils.py +0 -0
@@ -10,7 +10,7 @@ repos:
10
10
  - id: trailing-whitespace
11
11
  - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.11.13
13
+ rev: v0.12.3
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -10,6 +10,40 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v15.11.0] - 2025-07-15
14
+ ### Added
15
+ - Added the English knowledge dataset Life in the UK, which has been added as an
16
+ official dataset, replacing the existing English knowledge dataset MMLU, which in turn
17
+ has been marked as unofficial now. This was contributed by
18
+ [@oliverkinch](https://github.com/oliverkinch) ✨
19
+ - Added the Norwegian knowledge dataset Idioms-no, which is a multiple-choice question
20
+ dataset where the alternative answers have been generated using GPT-4o. This has been
21
+ added as an official dataset, and was contributed by
22
+ [@oliverkinch](https://github.com/oliverkinch) ✨
23
+ - Added new `LLMAsAJudgeMetric`, which allows evaluating the performance of a model with
24
+ another judge model. This is useful for evaluating models in a reference-free manner,
25
+ or if the metric is sufficiently complex. It is currently not used in any task, but
26
+ the functionality is there for future use.
27
+ - Add `no-thinking` and `thinking` options for Gemini-2.5-flash and
28
+ Gemini-2.5-flash-lite, which allows disabling and enabling the reasoning mode for
29
+ these models, respectively. Note that the former model has reasoning enabled by
30
+ default and the latter has it disabled by default (see the defaults in the [Gemini-2.5
31
+ docs](https://ai.google.dev/gemini-api/docs/thinking#set-budget)).
32
+
33
+ ### Fixed
34
+ - Evaluating freshly initialised encoder models on multiple-choice classification tasks
35
+ caused an error, as the id-to-label mapping was not set up correctly. This has been
36
+ fixed now.
37
+ - Now dynamically lowers the maximum amount of reasoning tokens for LiteLLM models if
38
+ they do not support the full 32,768 tokens.
39
+
40
+
41
+ ## [v15.10.1] - 2025-06-20
42
+ ### Fixed
43
+ - Fixed an issue when benchmarking encoder models on reading comprehension tasks, where
44
+ we sometimes would truncate the model outputs when they should not have been.
45
+
46
+
13
47
  ## [v15.10.0] - 2025-06-17
14
48
  ### Changed
15
49
  - Updated `vllm` to `>=0.9.1`.
@@ -4,8 +4,8 @@ message: If you use this software, please cite it using the metadata from this f
4
4
  type: software
5
5
  authors:
6
6
  - given-names: Dan Saattrup
7
- family-names: Nielsen
8
- email: dan.nielsen@alexandra.dk
7
+ family-names: Smart
8
+ email: dan.smart@alexandra.dk
9
9
  affiliation: Alexandra Institute
10
10
  orcid: 'https://orcid.org/0000-0001-9227-1470'
11
11
  identifiers:
@@ -22,7 +22,7 @@ license: MIT
22
22
  preferred-citation:
23
23
  type: conference-paper
24
24
  authors:
25
- - family-names: "Nielsen"
25
+ - family-names: "Smart"
26
26
  given-names: "Dan Saattrup"
27
27
  orcid: https://orcid.org/0000-0001-9227-1470
28
28
  collection-title: "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)"
@@ -1,6 +1,6 @@
1
1
  MIT License
2
2
 
3
- Copyright (c) 2022-2024 Dan Saattrup Nielsen
3
+ Copyright (c) 2022-2025 Dan Saattrup Smart
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
@@ -1,14 +1,14 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 15.10.0
3
+ Version: 15.11.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
7
- Author-email: Dan Saattrup Nielsen <dan.nielsen@alexandra.dk>
8
- Maintainer-email: Dan Saattrup Nielsen <dan.nielsen@alexandra.dk>
7
+ Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
8
+ Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
9
9
  License: MIT License
10
10
 
11
- Copyright (c) 2022-2024 Dan Saattrup Nielsen
11
+ Copyright (c) 2022-2025 Dan Saattrup Smart
12
12
 
13
13
  Permission is hereby granted, free of charge, to any person obtaining a copy
14
14
  of this software and associated documentation files (the "Software"), to deal
@@ -43,6 +43,7 @@ Requires-Dist: numpy<2.0.0,>=1.23.0
43
43
  Requires-Dist: ollama>=0.5.1
44
44
  Requires-Dist: pandas>=2.2.0
45
45
  Requires-Dist: peft>=0.15.0
46
+ Requires-Dist: protobuf>=2.0.0
46
47
  Requires-Dist: pydantic>=2.6.0
47
48
  Requires-Dist: pyinfer>=0.0.3
48
49
  Requires-Dist: python-dotenv>=1.0.1
@@ -94,8 +95,7 @@ ______________________________________________________________________
94
95
 
95
96
  ## Maintainer
96
97
 
97
- - Dan Saattrup Nielsen ([@saattrupdan](https://github.com/saattrupdan),
98
- dan.nielsen@alexandra.dk)
98
+ - Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), dan.smart@alexandra.dk)
99
99
 
100
100
 
101
101
  ## Installation
@@ -268,14 +268,14 @@ contributing new datasets, your help makes this project better for everyone.
268
268
  If you want to cite the framework then feel free to use this:
269
269
 
270
270
  ```
271
- @article{nielsen2024encoder,
271
+ @article{smart2024encoder,
272
272
  title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
273
- author={Nielsen, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
273
+ author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
274
274
  journal={arXiv preprint arXiv:2406.13469},
275
275
  year={2024}
276
276
  }
277
- @inproceedings{nielsen2023scandeval,
278
- author = {Nielsen, Dan Saattrup},
277
+ @inproceedings{smart2023scandeval,
278
+ author = {Smart, Dan Saattrup},
279
279
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
280
280
  month = may,
281
281
  pages = {185--201},
@@ -19,8 +19,7 @@ ______________________________________________________________________
19
19
 
20
20
  ## Maintainer
21
21
 
22
- - Dan Saattrup Nielsen ([@saattrupdan](https://github.com/saattrupdan),
23
- dan.nielsen@alexandra.dk)
22
+ - Dan Saattrup Smart ([@saattrupdan](https://github.com/saattrupdan), dan.smart@alexandra.dk)
24
23
 
25
24
 
26
25
  ## Installation
@@ -193,14 +192,14 @@ contributing new datasets, your help makes this project better for everyone.
193
192
  If you want to cite the framework then feel free to use this:
194
193
 
195
194
  ```
196
- @article{nielsen2024encoder,
195
+ @article{smart2024encoder,
197
196
  title={Encoder vs Decoder: Comparative Analysis of Encoder and Decoder Language Models on Multilingual NLU Tasks},
198
- author={Nielsen, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
197
+ author={Smart, Dan Saattrup and Enevoldsen, Kenneth and Schneider-Kamp, Peter},
199
198
  journal={arXiv preprint arXiv:2406.13469},
200
199
  year={2024}
201
200
  }
202
- @inproceedings{nielsen2023scandeval,
203
- author = {Nielsen, Dan Saattrup},
201
+ @inproceedings{smart2023scandeval,
202
+ author = {Smart, Dan Saattrup},
204
203
  booktitle = {Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)},
205
204
  month = may,
206
205
  pages = {185--201},
@@ -29,8 +29,8 @@ or [LM Studio](https://lmstudio.ai/).
29
29
  The idea of EuroEval grew out of the development of Danish language model RøBÆRTa in
30
30
  2021, when we realised that there was no standard way to evaluate Danish language
31
31
  models. It started as a hobby project including Danish, Swedish and Norwegian, but has
32
- since grown to include 8+ European languages.
32
+ since grown to include 12+ European languages.
33
33
 
34
- EuroEval is maintained by [Dan Saattrup Nielsen](https://www.saattrupdan.com/) from the
34
+ EuroEval is maintained by [Dan Saattrup Smart](https://www.saattrupdan.com/) from the
35
35
  [Alexandra Institute](https://alexandra.dk), and is funded by the EU project
36
36
  [TrustLLM](https://trustllm.eu/).
@@ -325,9 +325,12 @@ $ euroeval --model <model-id> --dataset squad-nl
325
325
 
326
326
  ### Unofficial: BeleBele-nl
327
327
 
328
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
328
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
329
+ and features multiple-choice reading comprehension questions across 122 languages.
329
330
 
330
- The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
331
+ The original dataset contains 900 unique multiple-choice reading comprehension passages
332
+ and questions. From these, we use a 256 / 64 / 580 split for training, validation and
333
+ testing, respectively.
331
334
 
332
335
  Here are a few examples from the training split:
333
336
 
@@ -297,9 +297,13 @@ $ euroeval --model <model-id> --dataset squad
297
297
 
298
298
  ### Unofficial: BeleBele-en
299
299
 
300
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features reading comprehension questions across 122 languages. The dataset was created by professional translators who translated 900 multiple-choice questions from English into other languages, with answers carefully validated by native speakers.
300
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
301
+ and features reading comprehension questions across 122 languages. The dataset was
302
+ created by professional translators who translated 900 multiple-choice questions from
303
+ English into other languages, with answers carefully validated by native speakers.
301
304
 
302
- The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for training, validation and testing, respectively.
305
+ The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for
306
+ training, validation and testing, respectively.
303
307
 
304
308
  Here are a few examples from the training split:
305
309
 
@@ -354,7 +358,79 @@ $ euroeval --model <model-id> --dataset belebele-en
354
358
 
355
359
  ## Knowledge
356
360
 
357
- ### MMLU
361
+ ### Life in the UK
362
+
363
+ This dataset was published
364
+ [here](https://huggingface.co/datasets/oliverkinch/life-in-the-uk-multiple-choice) was
365
+ scraped from [lifeintheuktestweb.co.uk](https://lifeintheuktestweb.co.uk/test-1/) and
366
+ contains multiple choice questions about UK history, culture, and citizenship
367
+ requirements. The website was created to help people pass the Life in the UK Test for UK
368
+ citizenship.
369
+
370
+ The original dataset consists of 1,450 samples. After processing (removing questions
371
+ with overly short or long texts, repetitive content, and true/false questions), we have
372
+ 1,206 samples remaining. From these, we use 438 / 256 / 512 samples for our training,
373
+ validation and test splits, respectively.
374
+
375
+ Here are a few examples from the training split:
376
+
377
+ ```json
378
+ {
379
+ "text": "What is the capital of the United Kingdom?\nChoices:\na. London\nb. Manchester\nc. Birmingham\nd. Edinburgh",
380
+ "label": "a"
381
+ }
382
+ ```
383
+ ```json
384
+ {
385
+ "text": "What TWO houses were confronted during the Wars of the Roses?\nChoices:\na. The House of Lancaster\nb. The House of Leicester\nc. The House of Canterbury\nd. The House of York",
386
+ "label": "a"
387
+ }
388
+ ```
389
+ ```json
390
+ {
391
+ "text": "What is the name of the War Memorial located in Whitehall?\nChoices:\na. Dumfries\nb. Cenotaph\nc. Royal Crescent\nd. The White Tower",
392
+ "label": "b"
393
+ }
394
+ ```
395
+
396
+ When evaluating generative models, we use the following setup (see the
397
+ [methodology](/methodology) for more information on how these are used):
398
+
399
+ - Number of few-shot examples: 5
400
+ - Prefix prompt:
401
+ ```
402
+ The following are multiple choice questions (with answers).
403
+ ```
404
+ - Base prompt template:
405
+ ```
406
+ Question: {text}
407
+ Options:
408
+ a. {option_a}
409
+ b. {option_b}
410
+ c. {option_c}
411
+ d. {option_d}
412
+ Answer: {label}
413
+ ```
414
+ - Instruction-tuned prompt template:
415
+ ```
416
+ Question: {text}
417
+ Options:
418
+ a. {option_a}
419
+ b. {option_b}
420
+ c. {option_c}
421
+ d. {option_d}
422
+
423
+ Answer the above question by replying with 'a', 'b', 'c' or 'd', and nothing else.
424
+ ```
425
+
426
+ You can evaluate this dataset directly as follows:
427
+
428
+ ```bash
429
+ $ euroeval --model <model-id> --dataset life-in-the-uk
430
+ ```
431
+
432
+
433
+ ### Unofficial: MMLU
358
434
 
359
435
  This dataset was published [in this paper](https://doi.org/10.48550/arXiv.2009.03300)
360
436
  and features questions within 57 different topics, such as elementary mathematics, US
@@ -8,9 +8,13 @@ information about what these constitute.
8
8
 
9
9
  ### ScandiSent-fi
10
10
 
11
- This dataset consists of reviews from Trustpilot and was published [here](https://aclanthology.org/2021.nodalida-main.42/). It is a binary sentiment classification dataset, with labels "positive" and "negative".
11
+ This dataset consists of reviews from Trustpilot and was published
12
+ [here](https://aclanthology.org/2021.nodalida-main.42/). It is a binary sentiment
13
+ classification dataset, with labels "positive" and "negative".
12
14
 
13
- For the Finnish part of the dataset, there are 10,000 training samples. From these samples, we have created a 1,024 / 256 / 2,048 split for the train, validation and test splits, respectively.
15
+ For the Finnish part of the dataset, there are 10,000 training samples. From these
16
+ samples, we have created a 1,024 / 256 / 2,048 split for the train, validation and test
17
+ splits, respectively.
14
18
 
15
19
  Here are a few examples from the training split:
16
20
 
@@ -67,9 +71,14 @@ $ euroeval --model <model-id> --dataset scandisent-fi
67
71
 
68
72
  ### Turku-NER-fi
69
73
 
70
- This dataset was published in [this paper](https://aclanthology.org/2020.lrec-1.567/). The dataset is a manually annotated corpus built on the Universal Dependencies Finnish corpus. The corpus was created by the Turku NLP group.
74
+ This dataset was published in [this paper](https://aclanthology.org/2020.lrec-1.567/).
75
+ The dataset is a manually annotated corpus built on the Universal Dependencies Finnish
76
+ corpus. The corpus was created by the Turku NLP group.
71
77
 
72
- The original dataset contains 12,217 / 1,364 / 1,555 samples for the training, validation and test splits, respectively. We use 1,024 / 256 / 2,048 samples for our training, validation and test splits, respectively. All the new splits are subsets of the original splits.
78
+ The original dataset contains 12,217 / 1,364 / 1,555 samples for the training,
79
+ validation and test splits, respectively. We use 1,024 / 256 / 2,048 samples for our
80
+ training, validation and test splits, respectively. All the new splits are subsets of
81
+ the original splits.
73
82
 
74
83
  Here are a few examples from the training split:
75
84
 
@@ -141,9 +150,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
141
150
  that this does indeed break the grammaticality of the sentence, a set of rules were used
142
151
  on the part-of-speech tags of the words in the sentence.
143
152
 
144
- The original dataset consists of 15,136 samples, from which we use 1,024 / 256 / 2,048 samples for training,
145
- validation and testing, respectively (so 3,328 samples used in total). These splits are
146
- used as-is in the framework.
153
+ The original dataset consists of 15,136 samples, from which we use 1,024 / 256 / 2,048
154
+ samples for training, validation and testing, respectively (so 3,328 samples used in
155
+ total). These splits are used as-is in the framework.
147
156
 
148
157
  Here are a few examples from the training split:
149
158
 
@@ -199,9 +208,20 @@ $ euroeval --model <model-id> --dataset scala-fi
199
208
  ## Reading Comprehension
200
209
 
201
210
  ### TydiQA-fi
202
- This question-answering dataset was published in [this paper](https://aclanthology.org/2020.tacl-1.30/). TydiQA is a multilingual dataset covering 11 typologically diverse languages with 204K question-answer pairs collected from native speakers genuinely seeking information. It was designed to evaluate models across languages with varied linguistic features and contains questions written directly in each language without translation.
203
-
204
- The original Finnish TydiQA dataset contains 6,855 training and 782 validation samples (we use the [secondary task subset](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train)). We created a 1,024 / 256 / 2,024 split, where the samples from the train and validation split are sampled from the original train and validation splits, respectively. The test set consists of the remaining samples from the original validation split + additional samples from the original train split.
211
+ This question-answering dataset was published in [this
212
+ paper](https://aclanthology.org/2020.tacl-1.30/). TydiQA is a multilingual dataset
213
+ covering 11 typologically diverse languages with 204K question-answer pairs collected
214
+ from native speakers genuinely seeking information. It was designed to evaluate models
215
+ across languages with varied linguistic features and contains questions written directly
216
+ in each language without translation.
217
+
218
+ The original Finnish TydiQA dataset contains 6,855 training and 782 validation samples
219
+ (we use the [secondary task
220
+ subset](https://huggingface.co/datasets/google-research-datasets/tydiqa/viewer/secondary_task?views%5B%5D=secondary_task_train)).
221
+ We created a 1,024 / 256 / 2,024 split, where the samples from the train and validation
222
+ split are sampled from the original train and validation splits, respectively. The test
223
+ set consists of the remaining samples from the original validation split + additional
224
+ samples from the original train split.
205
225
 
206
226
  Here are a few examples from the training split:
207
227
 
@@ -268,9 +288,12 @@ $ euroeval --model <model-id> --dataset tydiqa-fi
268
288
 
269
289
  ### Unofficial: BeleBele-fi
270
290
 
271
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
291
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
292
+ and features multiple-choice reading comprehension questions across 122 languages.
272
293
 
273
- The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
294
+ The original dataset contains 900 unique multiple-choice reading comprehension passages
295
+ and questions. From these, we use a 256 / 64 / 580 split for training, validation and
296
+ testing, respectively.
274
297
 
275
298
  Here are a few examples from the training split:
276
299
 
@@ -335,8 +358,11 @@ $ euroeval --model <model-id> --dataset belebele-fi
335
358
  ### HellaSwag-fi
336
359
 
337
360
  This dataset is a machine translated version of the English [HellaSwag
338
- dataset](https://aclanthology.org/P19-1472/). The [dataset](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate) was created by Finnish-NLP using Google Translate. The dataset is designed to
339
- be used in EuroEval and it therefore already has a 1,024 / 256 / 2,048 split for the train, validation and test splits, respectively.
361
+ dataset](https://aclanthology.org/P19-1472/). The
362
+ [dataset](https://huggingface.co/datasets/Finnish-NLP/hellaswag-fi-google-translate) was
363
+ created by Finnish-NLP using Google Translate. The dataset is designed to be used in
364
+ EuroEval and it therefore already has a 1,024 / 256 / 2,048 split for the train,
365
+ validation and test splits, respectively.
340
366
 
341
367
  Here are a few examples from the training split:
342
368
 
@@ -400,9 +426,16 @@ $ euroeval --model <model-id> --dataset hellaswag-fi
400
426
 
401
427
  ### XLSum-fi
402
428
 
403
- This dataset is a machine translation of the XL-Sum dataset, which was published in [this paper](https://aclanthology.org/2021.findings-acl.413/). [TurkuNLP](https://huggingface.co/datasets/TurkuNLP) has translated the dataset to Finnish using DeepL.
429
+ This dataset is a machine translation of the XL-Sum dataset, which was published in
430
+ [this paper](https://aclanthology.org/2021.findings-acl.413/).
431
+ [TurkuNLP](https://huggingface.co/datasets/TurkuNLP) has translated the dataset to
432
+ Finnish using DeepL.
404
433
 
405
- The original Finnish XL-Sum dataset contains 54,966 / 1,803 / 1,791 training, validation and test samples, respectively. We use 1,024 / 256 / 2,048 samples for our training, validation and test splits, respectively. The new training and validation splits are subsets of the original splits. The test split is the same as the original test split + additional samples from the original validation split.
434
+ The original Finnish XL-Sum dataset contains 54,966 / 1,803 / 1,791 training, validation
435
+ and test samples, respectively. We use 1,024 / 256 / 2,048 samples for our training,
436
+ validation and test splits, respectively. The new training and validation splits are
437
+ subsets of the original splits. The test split is the same as the original test split +
438
+ additional samples from the original validation split.
406
439
 
407
440
  Here are a few examples from the training split:
408
441
 
@@ -11,10 +11,11 @@ information about what these constitute.
11
11
 
12
12
  This dataset was published in [this Github
13
13
  repository](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert) and
14
- features reviews from the French movie review website [AlloCiné](https://www.allocine.fr/). The reviews range from
15
- 0.5 to 5 (inclusive), with steps of 0.5. The negative samples are reviews with a rating
16
- of at most 2, and the positive ones are reviews with a rating of at least 4. The reviews
17
- in between were discarded.
14
+ features reviews from the French movie review website
15
+ [AlloCiné](https://www.allocine.fr/). The reviews range from 0.5 to 5 (inclusive), with
16
+ steps of 0.5. The negative samples are reviews with a rating of at most 2, and the
17
+ positive ones are reviews with a rating of at least 4. The reviews in between were
18
+ discarded.
18
19
 
19
20
  The original full dataset consists of 160,000 / 20,000 / 20,000 samples for training,
20
21
  validation, and testing, respectively. We use 1,024 / 256 / 2,048 samples for training,
@@ -163,9 +164,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
163
164
  that this does indeed break the grammaticality of the sentence, a set of rules were used
164
165
  on the part-of-speech tags of the words in the sentence.
165
166
 
166
- The original dataset consists of 16,342 samples, from which we use 1,024 / 256 / 2,048 samples for training,
167
- validation and testing, respectively (so 3,328 samples used in total). These splits are
168
- used as-is in the framework.
167
+ The original dataset consists of 16,342 samples, from which we use 1,024 / 256 / 2,048
168
+ samples for training, validation and testing, respectively (so 3,328 samples used in
169
+ total). These splits are used as-is in the framework.
169
170
 
170
171
  Here are a few examples from the training split:
171
172
 
@@ -298,9 +299,12 @@ $ euroeval --model <model-id> --dataset fquad
298
299
 
299
300
  ### Unofficial: BeleBele-fr
300
301
 
301
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
302
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
303
+ and features multiple-choice reading comprehension questions across 122 languages.
302
304
 
303
- The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
305
+ The original dataset contains 900 unique multiple-choice reading comprehension passages
306
+ and questions. From these, we use a 256 / 64 / 580 split for training, validation and
307
+ testing, respectively.
304
308
 
305
309
  Here are a few examples from the training split:
306
310
 
@@ -153,9 +153,9 @@ word from a sentence, or by swapping two neighbouring words in a sentence. To en
153
153
  that this does indeed break the grammaticality of the sentence, a set of rules were used
154
154
  on the part-of-speech tags of the words in the sentence.
155
155
 
156
- The original dataset consists of 15,590 samples, from which we use 1,024 / 256 / 2,048 samples for training,
157
- validation and testing, respectively (so 3,328 samples used in total). These splits are
158
- used as-is in the framework.
156
+ The original dataset consists of 15,590 samples, from which we use 1,024 / 256 / 2,048
157
+ samples for training, validation and testing, respectively (so 3,328 samples used in
158
+ total). These splits are used as-is in the framework.
159
159
 
160
160
  Here are a few examples from the training split:
161
161
 
@@ -286,9 +286,12 @@ $ euroeval --model <model-id> --dataset germanquad
286
286
 
287
287
  ### Unofficial: BeleBele-de
288
288
 
289
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
289
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
290
+ and features multiple-choice reading comprehension questions across 122 languages.
290
291
 
291
- The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
292
+ The original dataset contains 900 unique multiple-choice reading comprehension passages
293
+ and questions. From these, we use a 256 / 64 / 580 split for training, validation and
294
+ testing, respectively.
292
295
 
293
296
  Here are a few examples from the training split:
294
297
 
@@ -155,9 +155,9 @@ from a sentence, or by swapping two neighbouring words in a sentence. To ensure
155
155
  this does indeed break the grammaticality of the sentence, a set of rules were used on
156
156
  the part-of-speech tags of the words in the sentence.
157
157
 
158
- The original dataset consists of 3,535 samples, from which we use 1,024 / 256 / 2,048 samples for training,
159
- validation and testing, respectively (so 3,328 samples used in total). These splits are
160
- used as-is in the framework.
158
+ The original dataset consists of 3,535 samples, from which we use 1,024 / 256 / 2,048
159
+ samples for training, validation and testing, respectively (so 3,328 samples used in
160
+ total). These splits are used as-is in the framework.
161
161
 
162
162
  Here are a few examples from the training split:
163
163
 
@@ -491,9 +491,12 @@ $ euroeval --model <model-id> --dataset icelandic-qa
491
491
 
492
492
  ### Unofficial: BeleBele-is
493
493
 
494
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
494
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
495
+ and features multiple-choice reading comprehension questions across 122 languages.
495
496
 
496
- The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
497
+ The original dataset contains 900 unique multiple-choice reading comprehension passages
498
+ and questions. From these, we use a 256 / 64 / 580 split for training, validation and
499
+ testing, respectively.
497
500
 
498
501
  Here are a few examples from the training split:
499
502
 
@@ -579,7 +582,8 @@ completion = client.beta.chat.completions.parse(
579
582
  )
580
583
  ```
581
584
 
582
- where `CandidateAnswers` is a Pydantic model that is used to ensure [structured outputs](https://platform.openai.com/docs/guides/structured-outputs).
585
+ where `CandidateAnswers` is a Pydantic model that is used to ensure [structured
586
+ outputs](https://platform.openai.com/docs/guides/structured-outputs).
583
587
 
584
588
  The original dataset has 2,000 samples, but only 1,994 unique questions, and the total
585
589
  length of this dataset is therefore 1,994. The split is given by 842 / 128 / 1024 for
@@ -373,9 +373,12 @@ $ euroeval --model <model-id> --dataset squad-it
373
373
 
374
374
  ### Unofficial: BeleBele-it
375
375
 
376
- This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
376
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/)
377
+ and features multiple-choice reading comprehension questions across 122 languages.
377
378
 
378
- The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
379
+ The original dataset contains 900 unique multiple-choice reading comprehension passages
380
+ and questions. From these, we use a 256 / 64 / 580 split for training, validation and
381
+ testing, respectively.
379
382
 
380
383
  Here are a few examples from the training split:
381
384