EuroEval 15.7.2__tar.gz → 15.8.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of EuroEval might be problematic. Click here for more details.

Files changed (240) hide show
  1. {euroeval-15.7.2 → euroeval-15.8.0}/CHANGELOG.md +19 -0
  2. {euroeval-15.7.2 → euroeval-15.8.0}/PKG-INFO +1 -1
  3. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/danish.md +66 -2
  4. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/dutch.md +64 -0
  5. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/english.md +57 -0
  6. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/finnish.md +65 -1
  7. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/french.md +66 -2
  8. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/german.md +59 -0
  9. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/icelandic.md +64 -0
  10. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/italian.md +66 -2
  11. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/norwegian.md +64 -0
  12. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/spanish.md +66 -1
  13. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/swedish.md +64 -0
  14. {euroeval-15.7.2 → euroeval-15.8.0}/pyproject.toml +1 -1
  15. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmark_modules/litellm.py +326 -145
  16. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmarker.py +11 -1
  17. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/english.py +1 -1
  18. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/finnish.py +19 -11
  19. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/italian.py +11 -1
  20. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/spanish.py +11 -1
  21. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/finetuning.py +29 -31
  22. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/tokenization_utils.py +2 -2
  23. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/utils.py +41 -0
  24. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_belebele.py +12 -0
  25. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_eltec.py +1 -1
  26. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_hellaswag_fi.py +49 -58
  27. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_hotter_and_colder_sentiment.py +1 -0
  28. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_icelandic_knowledge.py +3 -2
  29. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_jentoft.py +3 -2
  30. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_no_cola.py +3 -2
  31. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_personal_sum.py +3 -2
  32. {euroeval-15.7.2 → euroeval-15.8.0}/uv.lock +1 -1
  33. {euroeval-15.7.2 → euroeval-15.8.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
  34. {euroeval-15.7.2 → euroeval-15.8.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  35. {euroeval-15.7.2 → euroeval-15.8.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  36. {euroeval-15.7.2 → euroeval-15.8.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
  37. {euroeval-15.7.2 → euroeval-15.8.0}/.github/workflows/ci.yaml +0 -0
  38. {euroeval-15.7.2 → euroeval-15.8.0}/.gitignore +0 -0
  39. {euroeval-15.7.2 → euroeval-15.8.0}/.pre-commit-config.yaml +0 -0
  40. {euroeval-15.7.2 → euroeval-15.8.0}/CITATION.cff +0 -0
  41. {euroeval-15.7.2 → euroeval-15.8.0}/CODE_OF_CONDUCT.md +0 -0
  42. {euroeval-15.7.2 → euroeval-15.8.0}/CONTRIBUTING.md +0 -0
  43. {euroeval-15.7.2 → euroeval-15.8.0}/Dockerfile.cuda +0 -0
  44. {euroeval-15.7.2 → euroeval-15.8.0}/LICENSE +0 -0
  45. {euroeval-15.7.2 → euroeval-15.8.0}/NEW_DATASET_GUIDE.md +0 -0
  46. {euroeval-15.7.2 → euroeval-15.8.0}/README.md +0 -0
  47. {euroeval-15.7.2 → euroeval-15.8.0}/docs/CNAME +0 -0
  48. {euroeval-15.7.2 → euroeval-15.8.0}/docs/README.md +0 -0
  49. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/README.md +0 -0
  50. {euroeval-15.7.2 → euroeval-15.8.0}/docs/datasets/faroese.md +0 -0
  51. {euroeval-15.7.2 → euroeval-15.8.0}/docs/extras/radial_plotter.md +0 -0
  52. {euroeval-15.7.2 → euroeval-15.8.0}/docs/faq.md +0 -0
  53. {euroeval-15.7.2 → euroeval-15.8.0}/docs/gfx/favicon.png +0 -0
  54. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/danish.md +0 -0
  55. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
  56. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/english.md +0 -0
  57. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
  58. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/french.md +0 -0
  59. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/german.md +0 -0
  60. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  61. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/italian.md +0 -0
  62. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  63. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
  64. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
  65. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Multilingual/european.md +0 -0
  66. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
  67. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  68. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/Multilingual/romance.md +0 -0
  69. {euroeval-15.7.2 → euroeval-15.8.0}/docs/leaderboards/README.md +0 -0
  70. {euroeval-15.7.2 → euroeval-15.8.0}/docs/methodology.md +0 -0
  71. {euroeval-15.7.2 → euroeval-15.8.0}/docs/python-package.md +0 -0
  72. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/README.md +0 -0
  73. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/common-sense-reasoning.md +0 -0
  74. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/knowledge.md +0 -0
  75. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/linguistic-acceptability.md +0 -0
  76. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/named-entity-recognition.md +0 -0
  77. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/reading-comprehension.md +0 -0
  78. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/sentiment-classification.md +0 -0
  79. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/speed.md +0 -0
  80. {euroeval-15.7.2 → euroeval-15.8.0}/docs/tasks/summarization.md +0 -0
  81. {euroeval-15.7.2 → euroeval-15.8.0}/gfx/euroeval.png +0 -0
  82. {euroeval-15.7.2 → euroeval-15.8.0}/gfx/euroeval.xcf +0 -0
  83. {euroeval-15.7.2 → euroeval-15.8.0}/gfx/scandeval.png +0 -0
  84. {euroeval-15.7.2 → euroeval-15.8.0}/makefile +0 -0
  85. {euroeval-15.7.2 → euroeval-15.8.0}/mkdocs.yaml +0 -0
  86. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/__init__.py +0 -0
  87. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmark_config_factory.py +0 -0
  88. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmark_modules/__init__.py +0 -0
  89. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmark_modules/base.py +0 -0
  90. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmark_modules/fresh.py +0 -0
  91. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmark_modules/hf.py +0 -0
  92. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/benchmark_modules/vllm.py +0 -0
  93. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/callbacks.py +0 -0
  94. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/cli.py +0 -0
  95. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/constants.py +0 -0
  96. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/data_loading.py +0 -0
  97. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/data_models.py +0 -0
  98. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/__init__.py +0 -0
  99. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/danish.py +0 -0
  100. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/dutch.py +0 -0
  101. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/faroese.py +0 -0
  102. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/french.py +0 -0
  103. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/german.py +0 -0
  104. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/icelandic.py +0 -0
  105. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/norwegian.py +0 -0
  106. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/dataset_configs/swedish.py +0 -0
  107. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/enums.py +0 -0
  108. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/exceptions.py +0 -0
  109. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/generation.py +0 -0
  110. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/generation_utils.py +0 -0
  111. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/human_evaluation.py +0 -0
  112. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/languages.py +0 -0
  113. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/model_cache.py +0 -0
  114. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/model_config.py +0 -0
  115. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/model_loading.py +0 -0
  116. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/prompt_templates/__init__.py +0 -0
  117. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/prompt_templates/linguistic_acceptability.py +0 -0
  118. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/prompt_templates/multiple_choice.py +0 -0
  119. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/prompt_templates/named_entity_recognition.py +0 -0
  120. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/prompt_templates/reading_comprehension.py +0 -0
  121. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/prompt_templates/sentiment_classification.py +0 -0
  122. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/prompt_templates/summarization.py +0 -0
  123. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/scores.py +0 -0
  124. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/speed_benchmark.py +0 -0
  125. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/task_group_utils/__init__.py +0 -0
  126. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/task_group_utils/multiple_choice_classification.py +0 -0
  127. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/task_group_utils/question_answering.py +0 -0
  128. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/task_group_utils/sequence_classification.py +0 -0
  129. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/task_group_utils/text_to_text.py +0 -0
  130. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/task_group_utils/token_classification.py +0 -0
  131. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/tasks.py +0 -0
  132. {euroeval-15.7.2 → euroeval-15.8.0}/src/euroeval/types.py +0 -0
  133. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/constants.py +0 -0
  134. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_allocine.py +0 -0
  135. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_angry_tweets.py +0 -0
  136. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_arc.py +0 -0
  137. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_arc_is.py +0 -0
  138. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_cnn_dailymail.py +0 -0
  139. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_conll_en.py +0 -0
  140. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_conll_es.py +0 -0
  141. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_conll_nl.py +0 -0
  142. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_dane.py +0 -0
  143. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_danish_citizen_tests.py +0 -0
  144. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_dansk.py +0 -0
  145. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_danske_talemaader.py +0 -0
  146. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  147. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_dbrd.py +0 -0
  148. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_dutch_cola.py +0 -0
  149. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_fone.py +0 -0
  150. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_foqa.py +0 -0
  151. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_fosent.py +0 -0
  152. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_fquad.py +0 -0
  153. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_germanquad.py +0 -0
  154. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_germeval.py +0 -0
  155. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_hellaswag.py +0 -0
  156. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_ice_linguistic.py +0 -0
  157. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  158. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_icelandic_qa.py +0 -0
  159. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_icesum.py +0 -0
  160. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_ilpost_sum.py +0 -0
  161. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_mim_gold_ner.py +0 -0
  162. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_mlqa_es.py +0 -0
  163. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_mlsum_de.py +0 -0
  164. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_mlsum_es.py +0 -0
  165. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_mmlu.py +0 -0
  166. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_multinerd-it.py +0 -0
  167. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_no_sammendrag.py +0 -0
  168. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
  169. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_nordjylland_news.py +0 -0
  170. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_norec.py +0 -0
  171. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_norglm_multiqa.py +0 -0
  172. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_norglm_multisum.py +0 -0
  173. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_norne.py +0 -0
  174. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_norquad.py +0 -0
  175. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_nqii.py +0 -0
  176. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
  177. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_orange_sum.py +0 -0
  178. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_rrn.py +0 -0
  179. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_sb10k.py +0 -0
  180. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_scala.py +0 -0
  181. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_scandiqa.py +0 -0
  182. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_scandisent_fi.py +0 -0
  183. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_schibsted.py +0 -0
  184. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
  185. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_sentipolc16.py +0 -0
  186. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_squad.py +0 -0
  187. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_squad_it.py +0 -0
  188. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_squad_nl.py +0 -0
  189. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_squad_nl_old.py +0 -0
  190. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_sst5.py +0 -0
  191. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_suc3.py +0 -0
  192. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_swedn.py +0 -0
  193. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_swerec.py +0 -0
  194. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_turku_ner_fi.py +0 -0
  195. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_tydiqa_fi.py +0 -0
  196. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  197. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_wikiann_fo.py +0 -0
  198. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_wikineural-it.py +0 -0
  199. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_winogrande_is.py +0 -0
  200. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_xlsum_fi.py +0 -0
  201. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/create_xquad_es.py +0 -0
  202. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/fix_dot_env_file.py +0 -0
  203. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/load_ud_pos.py +0 -0
  204. {euroeval-15.7.2 → euroeval-15.8.0}/src/scripts/versioning.py +0 -0
  205. {euroeval-15.7.2 → euroeval-15.8.0}/tests/__init__.py +0 -0
  206. {euroeval-15.7.2 → euroeval-15.8.0}/tests/conftest.py +0 -0
  207. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmark_config_factory.py +0 -0
  208. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmark_modules/__init__.py +0 -0
  209. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmark_modules/test_base.py +0 -0
  210. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmark_modules/test_fresh.py +0 -0
  211. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  212. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmark_modules/test_litellm.py +0 -0
  213. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmark_modules/test_vllm.py +0 -0
  214. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_benchmarker.py +0 -0
  215. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_callbacks.py +0 -0
  216. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_cli.py +0 -0
  217. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_constants.py +0 -0
  218. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_data_loading.py +0 -0
  219. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_data_models.py +0 -0
  220. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_dataset_configs.py +0 -0
  221. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_enums.py +0 -0
  222. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_exceptions.py +0 -0
  223. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_finetuning.py +0 -0
  224. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_generation.py +0 -0
  225. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_human_evaluation.py +0 -0
  226. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_languages.py +0 -0
  227. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_model_cache.py +0 -0
  228. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_model_config.py +0 -0
  229. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_model_loading.py +0 -0
  230. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_scores.py +0 -0
  231. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_speed_benchmark.py +0 -0
  232. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_task_utils/__init__.py +0 -0
  233. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_task_utils/test_question_answering.py +0 -0
  234. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_task_utils/test_sequence_classification.py +0 -0
  235. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_task_utils/test_text_to_text.py +0 -0
  236. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_task_utils/test_token_classification.py +0 -0
  237. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_tasks.py +0 -0
  238. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_tokenization_utils.py +0 -0
  239. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_types.py +0 -0
  240. {euroeval-15.7.2 → euroeval-15.8.0}/tests/test_utils.py +0 -0
@@ -10,6 +10,25 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
10
10
 
11
11
 
12
12
 
13
+ ## [v15.8.0] - 2025-05-07
14
+ ### Added
15
+ - Added the BeleBele datasets for Finnish, Italian and Spanish. They are listed as
16
+ unofficial for now. This was contributed by
17
+ [@oliverkinch](https://github.com/oliverkinch) ✨
18
+
19
+ ### Changed
20
+ - Now uses asyncronous requests when dealing with API models, speeding up the generation
21
+ immensely. This was contributed by [@mathiasesn](https://github.com/mathiasesn) ✨
22
+
23
+ ### Fixed
24
+ - Add HellaSwag-fi back in, as the issue with the labels in the test split has been
25
+ fixed.
26
+ - Now uses `eval_accumulation_steps` (set to 32) when evaluating encoder models, to
27
+ avoid running out of memory during evaluation.
28
+ - Now also looks for `<|startoftext|>` as BOS token if the BOS token is not set in the
29
+ model's config.
30
+
31
+
13
32
  ## [v15.7.2] - 2025-05-02
14
33
  ### Fixed
15
34
  - Now does not check if a model exists if it has already been evaluated. This is an
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: EuroEval
3
- Version: 15.7.2
3
+ Version: 15.8.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -353,6 +353,70 @@ $ euroeval --model <model-id> --dataset scandiqa-da
353
353
  ```
354
354
 
355
355
 
356
+ ### Unofficial: BeleBele-da
357
+
358
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
359
+
360
+ The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
361
+
362
+ Here are a few examples from the training split:
363
+
364
+ ```json
365
+ {
366
+ "text": "Tekst: Prognoserne siger, at stormen, der er omkring 645 mil (1040 km) vest for Kap Verde-øerne, sandsynligvis vil forsvinde, før den truer nogen landområder. Fred har i øjeblikket vinde på 165 km/t og bevæger sig mod nordvest. Fred er den heftigste tropiske cyklon, der nogensinde er blevet registreret så sydligt og østligt i Atlanterhavet, siden man begyndte at bruge satellitbilleder, og kun den tredje store orkan, der er registreret øst for 35°V.\nSpørgsmål: Da Fred befandt sig nær Kap Verde-øerne, hvilken retning bevægede den sig så mod?\nSvarmuligheder:\na. Vest\nb. Syd\nc. Øst\nd. Nordvest",
367
+ "label": "d"
368
+ }
369
+ ```
370
+ ```json
371
+ {
372
+ "text": "Tekst: "Siden Pakistan i 1947 blev uafhængigt af det britiske styre, har den pakistanske præsident udpeget ""politiske agenter"", som styrer FATA, og som har næsten fuldstændig kontrol over områderne. Disse agenter er ansvarlige for at levere regerings- og retstjenester i henhold til artikel 247 i den pakistanske forfatning."\nSpørgsmål: Hvem leverer retslige tjenester til FATA?\nSvarmuligheder:\na. Den pakistanske regering\nb. Politiske agenter\nc. Pakistans præsident\nd. Den britiske regering",
373
+ "label": "b"
374
+ }
375
+ ```
376
+ ```json
377
+ {
378
+ "text": "Tekst: Alle er en del af samfundet og benytter transportsystemerne. Næsten alle klager over transportsystemerne. I udviklede lande hører du sjældent ligeså mange klager over vandkvalitet eller broer, der styrter sammen. Hvorfor giver transportsystemerne anledning til sådanne klager, hvorfor svigter de på daglig basis? Er transportingeniører blot inkompetente? Eller foregår der noget mere fundamentalt?\nSpørgsmål: Hvilken offentlig service siges at skabe størst utilfredshed i udviklede lande?\nSvarmuligheder:\na. Vandkvalitet\nb. Brobyggelse\nc. Offentlig transport\nd. Uddannelse",
379
+ "label": "c"
380
+ }
381
+ ```
382
+
383
+ When evaluating generative models, we use the following setup (see the
384
+ [methodology](/methodology) for more information on how these are used):
385
+
386
+ - Number of few-shot examples: 5
387
+ - Prefix prompt:
388
+ ```
389
+ Følgende er multiple choice spørgsmål (med svar).
390
+ ```
391
+ - Base prompt template:
392
+ ```
393
+ Spørgsmål: {text}
394
+ Svarmuligheder:
395
+ a. {option_a}
396
+ b. {option_b}
397
+ c. {option_c}
398
+ d. {option_d}
399
+ Svar: {label}
400
+ ```
401
+ - Instruction-tuned prompt template:
402
+ ```
403
+ Spørgsmål: {text}
404
+ Svarmuligheder:
405
+ a. {option_a}
406
+ b. {option_b}
407
+ c. {option_c}
408
+ d. {option_d}
409
+
410
+ Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c' eller 'd', og intet andet.
411
+ ```
412
+
413
+ You can evaluate this dataset directly as follows:
414
+
415
+ ```bash
416
+ $ euroeval --model <model-id> --dataset belebele-da
417
+ ```
418
+
419
+
356
420
  ## Knowledge
357
421
 
358
422
  ### Danske Talemåder
@@ -608,7 +672,7 @@ When evaluating generative models, we use the following setup (see the
608
672
  a. {option_a}
609
673
  b. {option_b}
610
674
  c. {option_c}
611
- d. {option_c}
675
+ d. {option_d}
612
676
  Svar: {label}
613
677
  ```
614
678
  - Instruction-tuned prompt template:
@@ -673,7 +737,7 @@ When evaluating generative models, we use the following setup (see the
673
737
  a. {option_a}
674
738
  b. {option_b}
675
739
  c. {option_c}
676
- d. {option_c}
740
+ d. {option_d}
677
741
  Svar: {label}
678
742
  ```
679
743
  - Instruction-tuned prompt template:
@@ -323,6 +323,70 @@ $ euroeval --model <model-id> --dataset squad-nl
323
323
  ```
324
324
 
325
325
 
326
+ ### Unofficial: BeleBele-nl
327
+
328
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
329
+
330
+ The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
331
+
332
+ Here are a few examples from the training split:
333
+
334
+ ```json
335
+ {
336
+ "text": "Tekst: Mystiek is het geloven in, identificeren met of bewustzijn van een ultieme werkelijkheid, goddelijkheid, spirituele waarheid of God. De kerkganger streeft naar een directe gewaarwording, intuïtie of inzicht in de goddelijke werkelijkheid. Volgers streven een bepaalde manier van leven na of willen ervaringen opdoen die ze datzelfde gevoel geven. In tegenstelling tot andere religieuze overtuigingen en aanbidding, legt mystiek nadruk op de rechtstreekse persoonlijke beleving van een unieke staat van bewustzijn, vooral van een vredige, inzichtelijke, gelukzalige of extatische aard.\nVraag: Wat is geen juiste omschrijving van mystiek?\nAntwoordopties:\na. De nadruk ligt op het ervaren van een vredige, gelukzalige staat van bewustzijn\nb. Volgers van mystiek streven bewustwording na van een spirituele werkelijkheid\nc. Volgers van mystiek passen gebruiken toe die hun inzicht in een goddelijke werkelijkheid vergroten\nd. De nadruk op het streven naar een directe persoonlijke beleving is vergelijkbaar met veel andere vormen van religieuze overtuiging en aanbidding",
337
+ "label": "d"
338
+ }
339
+ ```
340
+ ```json
341
+ {
342
+ "text": "Tekst: Het favoriete maaltje van ocelotten zijn kleine dieren. Ze vangen apen, slangen, knaagdieren en vogels als dat lukt. De ocelot jaagt bijna uitsluitend op dieren die veel kleiner zijn dan hij zelf is. Geleerden vermoeden dat ocelotten hun reukvermogen gebruiken om op kleine dieren (hun prooi) te jagen, door aan de grond te ruiken waar deze zijn geweest. Ze kunnen door nachtvisie heel goed in het donker zien en bewegen zich heel onopvallend voort. Ocelotten jagen op prooi door zich één te maken met de omgeving en vervolgens op hun prooi te springen.\nVraag: Welke uitspraak over een ocelot is onjuist?\nAntwoordopties:\na. Ze kunnen goed in het donker jagen\nb. Ze bewegen zich in stilte voort\nc. Hun reukvermogen is zwak\nd. Ze jagen het liefst op kleine dieren",
343
+ "label": "c"
344
+ }
345
+ ```
346
+ ```json
347
+ {
348
+ "text": "Tekst: Er was 120-160 kubieke meter brandstof aan boord van de Luno toen het schip motorproblemen kreeg en door de harde wind en golven tegen de golfbreker werd geduwd. De twaalf crewleden zijn met helikopters in veiligheid gebracht, met als enige verwonding een gebroken neus. Het 100 meter lange schip was onderweg om de gebruikelijke lading kunstmest op te halen. In eerste instantie vreesden autoriteiten dat het vaartuig met de lading zou kunnen gaan lekken.\nVraag: Waar vreesden de autoriteiten volgens de tekst in eerste instantie voor wat betreft de Luno?\nAntwoordopties:\na. Gebrek aan een lading kunstmest\nb. Golven en harde wind\nc. Lekken van brandstof\nd. Verwondingen van bemanningsleden",
349
+ "label": "c"
350
+ }
351
+ ```
352
+
353
+ When evaluating generative models, we use the following setup (see the
354
+ [methodology](/methodology) for more information on how these are used):
355
+
356
+ - Number of few-shot examples: 5
357
+ - Prefix prompt:
358
+ ```
359
+ Hieronder staan meerkeuzevragen (met antwoorden).
360
+ ```
361
+ - Base prompt template:
362
+ ```
363
+ Vraag: {text}
364
+ Antwoordopties:
365
+ a. {option_a}
366
+ b. {option_b}
367
+ c. {option_c}
368
+ d. {option_d}
369
+ Antwoord: {label}
370
+ ```
371
+ - Instruction-tuned prompt template:
372
+ ```
373
+ Vraag: {text}
374
+ Antwoordopties:
375
+ a. {option_a}
376
+ b. {option_b}
377
+ c. {option_c}
378
+ d. {option_d}
379
+
380
+ Beantwoord de bovenstaande vraag met 'a', 'b', 'c' of 'd', en niets anders.
381
+ ```
382
+
383
+ You can evaluate this dataset directly as follows:
384
+
385
+ ```bash
386
+ $ euroeval --model <model-id> --dataset belebele-nl
387
+ ```
388
+
389
+
326
390
  ## Knowledge
327
391
 
328
392
  ### MMLU-nl
@@ -295,6 +295,63 @@ $ euroeval --model <model-id> --dataset squad
295
295
  ```
296
296
 
297
297
 
298
+ ### Unofficial: BeleBele-en
299
+
300
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features reading comprehension questions across 122 languages. The dataset was created by professional translators who translated 900 multiple-choice questions from English into other languages, with answers carefully validated by native speakers.
301
+
302
+ The original dataset consists of 900 samples, and we use 256 / 64 / 580 samples for training, validation and testing, respectively.
303
+
304
+ Here are a few examples from the training split:
305
+
306
+ ```json
307
+ {
308
+ "text": 'Text: """We will endeavour to cut carbon dioxide emissions per unit of GDP by a notable margin by 2020 from the 2005 level,"" Hu said. He did not set a figure for the cuts, saying they will be made based on China\'s economic output. Hu encouraged developing countries ""to avoid the old path of polluting first and cleaning up later."" He added that ""they should not, however, be asked to take on obligations that go beyond their development stage, responsibility and capabilities."""\nQuestion: What did Hu suggest that developing countries do?\nChoices:\na. Take on obligations that push their development stage\nb. Focus on economic output\nc. Go beyond their current responsibilities\nd. Avoiding old paths of pollution',
309
+ "label": "d"
310
+ }
311
+ ```
312
+ ```json
313
+ {
314
+ "text": 'Text: "All of the cave entrances, which were named ""The Seven Sisters"", are at least 100 to 250 meters (328 to 820 feet) in diameter. Infrared images show that the temperature variations from night and day show that they are likely caves. ""They are cooler than the surrounding surface in the day and warmer at night. Their thermal behavior is not as steady as large caves on Earth that often maintain a fairly constant temperature, but it is consistent with these being deep holes in the ground,"" said Glen Cushing of the United States Geological Survey (USGS) Astrogeology Team and of Northern Arizona University located in Flagstaff, Arizona."\nQuestion: What information suggests that The Seven Sisters are caves?\nChoices:\na. Temperature variations\nb. The diameter of the cave entrances\nc. Geological surveys\nd. Pictures of caves on Earth',
315
+ "label": "a"
316
+ }
317
+ ```
318
+ ```json
319
+ {
320
+ "text": 'Text: The proposed amendment already passed both houses in 2011. A change was made this legislative session when the second sentence was deleted first by the House of Representatives and then was passed in a similar form by the Senate Monday. The failure of the second sentence, which proposes to ban same-sex civil unions, could possibly open the door for civil unions in the future. Following the process, HJR-3 will be reviewed again by the next elected legislature in either 2015 or 2016 to remain in process.\nQuestion: According to the passage, when was the second sentence deleted?\nChoices:\na. During the legislative session\nb. In 2011\nc. On Monday\nd. In 2015',
321
+ "label": "a"
322
+ }
323
+ ```
324
+
325
+ When evaluating generative models, we use the following setup (see the
326
+ [methodology](/methodology) for more information on how these are used):
327
+
328
+ - Number of few-shot examples: 4
329
+ - Prefix prompt:
330
+ ```
331
+ The following are texts with accompanying questions and answers.
332
+ ```
333
+ - Base prompt template:
334
+ ```
335
+ Text: {text}
336
+ Question: {question}
337
+ Answer in max 3 words:
338
+ ```
339
+ - Instruction-tuned prompt template:
340
+ ```
341
+ Text: {text}
342
+
343
+ Answer the following question about the above text in at most 3 words.
344
+
345
+ Question: {question}
346
+ ```
347
+
348
+ You can evaluate this dataset directly as follows:
349
+
350
+ ```bash
351
+ $ euroeval --model <model-id> --dataset belebele-en
352
+ ```
353
+
354
+
298
355
  ## Knowledge
299
356
 
300
357
  ### MMLU
@@ -266,6 +266,70 @@ $ euroeval --model <model-id> --dataset tydiqa-fi
266
266
  ```
267
267
 
268
268
 
269
+ ### Unofficial: BeleBele-fi
270
+
271
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
272
+
273
+ The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
274
+
275
+ Here are a few examples from the training split:
276
+
277
+ ```json
278
+ {
279
+ "text": "Toisin kuin muut kädelliset, isot ihmisapinat eivät enää käytä käsiään liikkumiseen, painon kannattelemiseen tai liikkumiseen puissa itseään heilautellen. Simpanssin käsi ja jalka ovat samankokoisia ja -pituisia, mikä viittaa siihen, että kädelle varataan painoa rystykävelyssä. Ihmisen käsi on lyhyempi kuin jalka, ja sen sormiluut ovat suoremmat. Kahden-kolmen miljoonan vuoden ikäiset käsiluiden fossiilit paljastavat käden erikoistumisessa tämän muutoksen liikkumisesta käyttelyyn.\nKysymys: Mikä seuraavista kuvaa tarkasti simpanssin sormiluita?\nVaihtoehdot:\na. Ne ovat suoremmat kuin ihmisillä\nb. Niiden kädet ja jalat ovat erikokoisia\nc. Niitä käytetään painon kannattelemiseen\nd. Niitä käytetään pääasiassa käyttelyyn",
280
+ "label": "c"
281
+ }
282
+ ```
283
+ ```json
284
+ {
285
+ "text": "Panaman paperit on yläkäsite panamalaisen lakiyrityksen Mossack Fonsecan noin kymmenelle miljoonalle asiakirjalle, jotka vuodettiin lehdistölle keväällä 2016. Asiakirjoista selvisi, että neljätoista pankkia auttoi varakkaita asiakkaita piilottamaan miljardeja USA:n dollareita verojen ja muiden sääntelyjen välttämiseksi. Brittiläisen sanomalehden The Guardianin mukaan Deutsche Bank hallitsi tämän toteuttamiseen käytetyistä 1 200 postilaatikkoyrityksestä suunnilleen kolmasosaa. Seurasi maailmanlaajuisia protesteja ja useita rikossyytteitä, ja Islannin ja Pakistanin hallitusten johtajat kumpikin erosivat.\nKysymys: Kuka brittiläisen lehdistön väitteen mukaan hallinnoi monia varojen piilottamisessa käytettyjä yrityksiä tekstikatkelman mukaan?\nVaihtoehdot:\na. Eri pankkien varakkaat asiakkaat\nb. Panamalainen lakiyritys\nc. Deutsche Bank\nd. Pakistanin hallitus",
286
+ "label": "c"
287
+ }
288
+ ```
289
+ ```json
290
+ {
291
+ "text": "Teksti: Sundarban on maailman suurin mangrovemetsäalue. Se ulottuu 80 kilometriä (50 mailia) rannikolta Bangladeshin ja Intian takamaille. Sundarban on julistettu Unescon maailmanperintökohteeksi. Metsän Intian puolella sijaitsevaa osaa kutsutaan Sundarbanin kansallispuistoksi. Metsät eivät kuitenkaan ole pelkkiä mangrovesoita, vaan niihin kuuluu joitakin viimeisiä jäänteitä niistä mahtavista viidakoista, jotka aikoinaan peittivät koko Gangesin tasangon. Sundarban kattaa 3 850 neliökilometrin alueen, josta noin kolmasosa on vesi- tai suoalueiden peitossa. Vuodesta 1966 asti Sundarbans on ollut villieläinten suojelualue. Arvioidaan, että siellä on nykyään 400 intiantiikeriä ja suunnilleen 30 000 aksishirveä.\nKysymys: Mikä metsän osa on Intian puolella?\nVaihtoehdot:\na. Sundarbanin kansallispuisto\nb. Villieläinten suojelualue\nc. Maailmanperintökohde\nd. Gangesin tasanko",
292
+ "label": "a"
293
+ }
294
+ ```
295
+
296
+ When evaluating generative models, we use the following setup (see the
297
+ [methodology](/methodology) for more information on how these are used):
298
+
299
+ - Number of few-shot examples: 5
300
+ - Prefix prompt:
301
+ ```
302
+ Seuraavat ovat monivalintakysymyksiä (vastauksineen).
303
+ ```
304
+ - Base prompt template:
305
+ ```
306
+ Kysymys: {text}
307
+ Vaihtoehdot:
308
+ a. {option_a}
309
+ b. {option_b}
310
+ c. {option_c}
311
+ d. {option_d}
312
+ Vastaus: {label}
313
+ ```
314
+ - Instruction-tuned prompt template:
315
+ ```
316
+ Kysymys: {text}
317
+ Vaihtoehdot:
318
+ a. {option_a}
319
+ b. {option_b}
320
+ c. {option_c}
321
+ d. {option_d}
322
+
323
+ Vastaa yllä olevaan kysymykseen käyttämällä 'a', 'b', 'c' tai 'd', äläkä mitään muuta.
324
+ ```
325
+
326
+ You can evaluate this dataset directly as follows:
327
+
328
+ ```bash
329
+ $ euroeval --model <model-id> --dataset belebele-fi
330
+ ```
331
+
332
+
269
333
  ## Common-sense Reasoning
270
334
 
271
335
  ### HellaSwag-fi
@@ -310,7 +374,7 @@ When evaluating generative models, we use the following setup (see the
310
374
  a. {option_a}
311
375
  b. {option_b}
312
376
  c. {option_c}
313
- d. {option_c}
377
+ d. {option_d}
314
378
  Vastaus: {label}
315
379
  ```
316
380
  - Instruction-tuned prompt template:
@@ -296,6 +296,70 @@ $ euroeval --model <model-id> --dataset fquad
296
296
  ```
297
297
 
298
298
 
299
+ ### Unofficial: BeleBele-fr
300
+
301
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
302
+
303
+ The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
304
+
305
+ Here are a few examples from the training split:
306
+
307
+ ```json
308
+ {
309
+ "text": "Texte: Lorsqu’un petit groupe d’êtres vivants (une petite population) est séparé de la population principale dont il est issu (par exemple, s’il se déplace au-dessus d’une chaîne de montagnes ou d’une rivière, ou s’il se déplace vers une nouvelle île de sorte qu’il ne peut pas facilement revenir en arrière), il se retrouve souvent dans un environnement différent de celui dans lequel il était auparavant. Ce nouvel environnement a des ressources et des concurrents différents, de sorte que la nouvelle population aura besoin de caractéristiques ou d'adaptations nouvelles pour être un concurrent puissant par rapport à ce dont elle avait besoin auparavant. La population d'origine n'a pas changé du tout,\xa0elle a toujours besoin des mêmes adaptations. Au fil du temps, à mesure que la nouvelle population s'adapte à son nouvel environnement, elle commence à ressembler de moins en moins à l'autre population. Enfin, après des milliers ou même des millions d'années, les deux populations paraîtront tellement différentes qu'elles ne pourront plus être considérées comme appartenant à la même espèce. Nous appelons ce processus «\u2009spéciation\u2009», ce qui signifie simplement la formation de nouvelles espèces. La spéciation est une conséquence inévitable et une partie très importante de l’évolution.\nQuestion: D’après l’extrait et parmi les exemples ci-dessous, qu’est-ce qui gênerait le processus d’évolution\xa0?\nChoix:\na. La difficulté pour un petit groupe à s’épanouir dans un nouvel endroit\nb. La migration d’une portion d’une population vers un nouvel environnement\nc. L’ajustement par une population de son adaptation à un nouvel environnement\nd. Le fait qu’une population finisse par devenir deux populations distinctes",
310
+ "label": "a"
311
+ }
312
+ ```
313
+ ```json
314
+ {
315
+ "text": "Texte: Le pillage généralisé se serait poursuivi pendant la nuit, les forces de l'ordre n'étant pas présentes dans les rues de Bichkek. Un observateur a décrit Bichkek comme étant en train de sombrer dans un état d’« anarchie », tandis que la population se déplaçait en bandes dans les rues et pillait les magasins de biens de consommation. Plusieurs habitants de Bichkek ont reproché les manifestants du sud d'être responsables de l'anarchie.\nQuestion: Qui a accusé les manifestants du sud de pillage\xa0?\nChoix:\na. Des habitants de Bichkek\nb. Les forces de l’ordre\nc. Les anarchistes\nd. Des bandes de personnes",
316
+ "label": "a"
317
+ }
318
+ ```
319
+ ```json
320
+ {
321
+ "text": "Texte: Dans de nombreuses régions du monde, faire un signe de la main est un geste amical signifiant «\u2009bonjour\u2009». En revanche, en Malaisie, du moins chez les Malais des zones rurales, cela signifie « viens par ici », comme le fait de plier l'index vers soi, geste utilisé dans certains pays occidentaux, et il ne devrait être utilisé qu'en ce sens. De même, un voyageur britannique en Espagne pourrait confondre un signe d'adieu fait par une personne qui tourne la paume de sa main vers elle-même (plutôt que vers la personne à qui elle adresse le signe) avec une invitation à revenir.\nQuestion: Dans les zones rurales de la Malaisie, quel geste signifie « viens par ici » ?\nChoix:\na. Plier l’index\nb. Faire un signe de la main\nc. Faire un « high five »\nd. Lever le pouce",
322
+ "label": "b"
323
+ }
324
+ ```
325
+
326
+ When evaluating generative models, we use the following setup (see the
327
+ [methodology](/methodology) for more information on how these are used):
328
+
329
+ - Number of few-shot examples: 5
330
+ - Prefix prompt:
331
+ ```
332
+ Les questions suivantes sont des questions à choix multiples (avec réponses).
333
+ ```
334
+ - Base prompt template:
335
+ ```
336
+ Question: {text}
337
+ Choix:
338
+ a. {option_a}
339
+ b. {option_b}
340
+ c. {option_c}
341
+ d. {option_d}
342
+ Réponse: {label}
343
+ ```
344
+ - Instruction-tuned prompt template:
345
+ ```
346
+ Question: {text}
347
+ Choix:
348
+ a. {option_a}
349
+ b. {option_b}
350
+ c. {option_c}
351
+ d. {option_d}
352
+
353
+ Répondez à la question ci-dessus par 'a', 'b', 'c' ou 'd', et rien d'autre.
354
+ ```
355
+
356
+ You can evaluate this dataset directly as follows:
357
+
358
+ ```bash
359
+ $ euroeval --model <model-id> --dataset belebele-fr
360
+ ```
361
+
362
+
299
363
  ## Knowledge
300
364
 
301
365
  ### MMLU-fr
@@ -348,7 +412,7 @@ When evaluating generative models, we use the following setup (see the
348
412
  a. {option_a}
349
413
  b. {option_b}
350
414
  c. {option_c}
351
- d. {option_c}
415
+ d. {option_d}
352
416
  Réponse: {label}
353
417
  ```
354
418
  - Instruction-tuned prompt template:
@@ -419,7 +483,7 @@ When evaluating generative models, we use the following setup (see the
419
483
  a. {option_a}
420
484
  b. {option_b}
421
485
  c. {option_c}
422
- d. {option_c}
486
+ d. {option_d}
423
487
  Réponse: {label}
424
488
  ```
425
489
  - Instruction-tuned prompt template:
@@ -284,6 +284,65 @@ $ euroeval --model <model-id> --dataset germanquad
284
284
  ```
285
285
 
286
286
 
287
+ ### Unofficial: BeleBele-de
288
+
289
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
290
+
291
+ The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
292
+
293
+ Here are a few examples from the training split:
294
+
295
+ ```json
296
+ {
297
+ "text": "Text: Es gibt viele Dinge, die Sie vor und während einer Reise berücksichtigen müssen. Erwarten Sie nicht, dass die Dinge beim Reisen genau so sind wie „zuhause“. Umgangsformen, Gesetze, Essen, Verkehr, Unterkünfte, Standards, Spache und so weiter werden zu einem gewissen Grad anders sein als dort, wo Sie leben. Dies ist etwas, was man immer im Hinterkopf behalten sollte, um Enttäuschung oder gar Abneigung über lokale Vorgehensweisen zu vermeiden.\nFragen: Was kann Reisenden dem Abschnitt nach helfen, Enttäuschung beim Besuch neuer Orte zu vermeiden?\nAntwortmöglichkeiten:\na. Ähnliche Standards wie zuhause erwarten\nb. Essen probieren, das ungewohnt ist\nc. Die gleichen Gesetze wie zuhause einhalten\nd. Nicht vorher nach Unterkünften recherchieren",
298
+ "label": "b"
299
+ }
300
+ ```
301
+ ```json
302
+ {
303
+ "text": "Text: Genehmigungen müssen im Voraus bestellt werden. Sie benötigen eine Genehmigung, um in La Sirena zu übernachten. Sirena ist die einzige Rangerstation, die neben Zelten auch Übernachtung im Schlafsaal und warme Mahlzeiten anbietet. La Leona, San Pedrillo und Los Patos bieten nur Camping ohne Verpflegung an. Es ist möglich, eine Parklizenz direkt bei der Rangerstation in Puerto Jiménez zu bekommen, aber sie akzeptieren keine Kreditkarten Die Parkverwaltung (MINAE) stellt Genehmigungen für den Park nicht früher als einen Monat vor der geplanten Ankunft aus. CafeNet El Sol bietet einen Reservierungsservice gegen eine Gebühr von 30 US-Dollar bzw. 10 US-Dollar für Tageskarten an. Einzelheiten dazu findet man auf deren Corcovado-Seite.\nFragen: Welche der folgenden Rangerstationen bietet zwei Übernachtungsmöglichkeiten an?\nAntwortmöglichkeiten:\na. Sirena\nb. Los Patos\nc. La Leona\nd. San Pedrillo",
304
+ "label": "a"
305
+ }
306
+ ```
307
+ ```json
308
+ {
309
+ "text": "Text: Naturnaher Tourismus zieht Leute an, die daran interessiert sind, Naturgebiete zu besuchen, um die Landschaft zu genießen, einschließlich der wilden Pflanzen und Tiere. Beispiele für Aktivitäten vor Ort sind Jagen, Angeln, Fotografie, Vogelbeobachtung, der Besuch von Parks und das Lernen von Informationen über das Ökosystem. Ein Beispiel dafür ist der Besuch, das Fotografieren und das Studieren von Orangutangs in Borneo.\nFragen: Welche der folgenden Aktivitäten ist kein Beispiel für naturnahen Tourismus?\nAntwortmöglichkeiten:\na. Wandern zu einem Wasserfall\nb. Fotografieren von Wildblumen\nc. Besuch eines Wissenschaftsmuseum\nd. Fliegenfischen",
310
+ "label": "c"
311
+ }
312
+ ```
313
+
314
+ When evaluating generative models, we use the following setup (see the
315
+ [methodology](/methodology) for more information on how these are used):
316
+
317
+ - Number of few-shot examples: 5
318
+ - Prefix prompt:
319
+ ```
320
+ Die folgenden Fragen sind Multiple-Choice-Fragen (mit Antworten).
321
+ ```
322
+ - Base prompt template:
323
+ ```
324
+ Frage: {text}
325
+ Antwort: {label}
326
+ ```
327
+ - Instruction-tuned prompt template:
328
+ ```
329
+ Frage: {text}
330
+ Antwortmöglichkeiten:
331
+ a. {option_a}
332
+ b. {option_b}
333
+ c. {option_c}
334
+ d. {option_d}
335
+
336
+ Beantworten Sie die obige Frage mit 'a', 'b', 'c' oder 'd', und nichts anderes.
337
+ ```
338
+
339
+ You can evaluate this dataset directly as follows:
340
+
341
+ ```bash
342
+ $ euroeval --model <model-id> --dataset belebele-de
343
+ ```
344
+
345
+
287
346
  ## Knowledge
288
347
 
289
348
  ### MMLU-de
@@ -489,6 +489,70 @@ $ euroeval --model <model-id> --dataset icelandic-qa
489
489
  ```
490
490
 
491
491
 
492
+ ### Unofficial: BeleBele-is
493
+
494
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
495
+
496
+ The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
497
+
498
+ Here are a few examples from the training split:
499
+
500
+ ```json
501
+ {
502
+ "text": "Texti: Í Frelsisstríðinu mynduðu ríkin þrettán veikburða ríkisstjórn – með Þjóðþingið sem eina þátt þess – skv. fyrstu stjórnarskránni. Þingið var ekki með nægar valdheimildir til að leggja á skatta, og vegna þess að ekki var neinn alríkisstjóri eða dómsvald til staðar, treysti það á yfirvöld í hverju ríki fyrir sig, sem voru oft og tíðum ósamvinnuþýð, til að framfylgja lögum þess. Það hafði heldur engar valdheimildir til að fella niður skattalög og tolla á milli ríkja. Greinarnar gerðu kröfu um samhljóða samþykki allra ríkjanna áður en hægt var að breyta þeim og ríkin sýndu ríkisvaldinu svo mikla lítilsvirðingu að fulltrúar þeirra voru oft fjarverandi.\nSpurning: Samkvæmt því sem fram kemur í kaflanum, hvaða fullyrðing á nákvæmlega við um ástand ríkisvaldsins í frelsisstríðinu?\nSvarmöguleikar:\na. Skattar voru innheimtir af þinginu og ríkisstofnunum\nb. Breytingar á stjórnarskránni þurftu samþykki þingsins\nc. Fulltrúar ríkjanna voru oft fjarverandi\nd. Hin miðlæga ríkisstjórn var mynduð í kringum tvo meginþætti",
503
+ "label": "c"
504
+ }
505
+ ```
506
+ ```json
507
+ {
508
+ "text": "Texti: İzmir er þriðja stærsta borg Tyrklands með um 3,7 milljónir íbúa, næststærstu höfnina á eftir Istanbúl og er mjög góð samgöngumiðstöð. Hin forna borg Smyrna er núna nútímaleg, þróuð og iðandi viðskiptamiðstöð sem staðsett er við gríðarstóran flóa og umkringd er fjöllum. Hinar breiðu breiðgötur, byggingar með framhliðum úr gleri og nútímalegar verslunarmiðstöðvar með hefðbundnum rauðum þakskífum, 18. aldar markaðurinn og gamlar moskur og kirkjur, þó að andrúmsloft borgarinnar tengist meira Miðjarðarhafssvæði Evrópu en hefðbundnu Tyrklandi.\nSpurning: Hvert eftirfarandi einkennir Izmir er frá fornri tíð?\nSvarmöguleikar:\na. Breiðar breiðgötur\nb. Byggingar með framhliðum úr gleri\nc. Verslanamiðstöðvar\nd. rauðar þakskífur",
509
+ "label": "d"
510
+ }
511
+ ```
512
+ ```json
513
+ {
514
+ "text": "Texti: Dæmigert fyrir það tímabil er Kirby Muxloe Castle sem er frekar víggirt hús en raunverulegur kastali. Stóru gljáðu gluggarnir og þunnu veggirnir hefðu ekki getað staðist stórárás í langan tíma. Árið 1480, þegar Hastings lávarður hóf byggingarframkvæmdirnar, ríkti friður í nánast öllu landinu og aðeins var þörf á varnarmúrum gegn litlum ræningjahópum.\nSpurning: Hvert af eftirtöldu hefði verið talið óvenjulegt við byggingu Kirby Muxloe kastala á þeim tíma sem talað er um í kaflanum?\nSvarmöguleikar:\na. Stórir gluggar\nb. Grunnur sem á að standast árásir\nc. Minna af varnarútbúnaði en í öðrum köstulum\nd. Þunnir veggir",
515
+ "label": "b"
516
+ }
517
+ ```
518
+
519
+ When evaluating generative models, we use the following setup (see the
520
+ [methodology](/methodology) for more information on how these are used):
521
+
522
+ - Number of few-shot examples: 5
523
+ - Prefix prompt:
524
+ ```
525
+ Eftirfarandi eru fjölvalsspurningar (með svörum).
526
+ ```
527
+ - Base prompt template:
528
+ ```
529
+ Spurningar: {text}
530
+ Svarmöguleikar:
531
+ a. {option_a}
532
+ b. {option_b}
533
+ c. {option_c}
534
+ d. {option_d}
535
+ Svara: {label}
536
+ ```
537
+ - Instruction-tuned prompt template:
538
+ ```
539
+ Spurningar: {text}
540
+ Svarmöguleikar:
541
+ a. {option_a}
542
+ b. {option_b}
543
+ c. {option_c}
544
+ d. {option_d}
545
+
546
+ Svaraðu eftirfarandi spurningum með 'a', 'b', 'c' eða 'd', og engu öðru.
547
+ ```
548
+
549
+ You can evaluate this dataset directly as follows:
550
+
551
+ ```bash
552
+ $ euroeval --model <model-id> --dataset belebele-is
553
+ ```
554
+
555
+
492
556
  ## Knowledge
493
557
 
494
558
  ### IcelandicKnowledge
@@ -371,6 +371,70 @@ $ euroeval --model <model-id> --dataset squad-it
371
371
  ```
372
372
 
373
373
 
374
+ ### Unofficial: BeleBele-it
375
+
376
+ This dataset was published in [this paper](https://aclanthology.org/2024.acl-long.44/) and features multiple-choice reading comprehension questions across 122 languages.
377
+
378
+ The original dataset contains 900 unique multiple-choice reading comprehension passages and questions. From these, we use a 256 / 64 / 580 split for training, validation and testing, respectively.
379
+
380
+ Here are a few examples from the training split:
381
+
382
+ ```json
383
+ {
384
+ "text": "Testo: Con la decisione del signor Rudd di firmare l’accordo sul clima di Kyoto, gli Stati Uniti, che ora saranno l’unica nazione sviluppata a non averlo ratificato, rimangono isolati. Il precedente governo conservatore australiano aveva rifiutato di ratificare gli accordi di Kyoto asserendo che avrebbero danneggiato l'economia, data la pesante dipendenza dalle esportazioni di carbone, mentre gli obiettivi sulle emissioni non sarebbero stati vincolanti per Paesi come l'India e la Cina.\nDomanda: Il precedente governo australiano pensava che la ratifica di Kyoto avrebbe causato danni a cosa?\nOpzioni:\na. Stati Uniti\nb. Economia del Paese\nc. Esportazioni di carbone\nd. Gli obiettivi di emissione del Paese",
385
+ "label": "b"
386
+ }
387
+ ```
388
+ ```json
389
+ {
390
+ "text": "Testo: "I commenti, in diretta televisiva, hanno rappresentato la prima occasione per autorevoli fonti iraniane per ammettere che le sanzioni sono efficaci. Esse comprendono limitazioni finanziarie e il divieto dell\'Unione europea all\'esportazione di petrolio greggio, che rappresenta l\'80% del reddito estero nell\'economia dell\'Iran. Secondo l\'ultimo rapporto mensile dell’OPEC, il volume delle esportazioni di greggio è sceso al livello più basso degli ultimi vent\'anni, con 2,8 milioni di barili al giorno. Il leader supremo del Paese, l’Ayatollah Ali Khamenei, ha parlato della dipendenza dal petrolio paragonandola ad ""una trappola"" che risale al periodo precedente la rivoluzione islamica iraniana del 1979 e dalla quale il Paese si dovrebbe liberare."\nDomanda: Secondo il passaggio, chi ha ammesso gli effetti delle sanzioni sull\'economia iraniana?\nOpzioni:\na. Autorevoli fonti\nb. OPEC\nc. Ayatollah Ali Khamenei\nd. L\'Unione Europea",
391
+ "label": "a"
392
+ }
393
+ ```
394
+ ```json
395
+ {
396
+ "text": "Testo: Il dottor Lee si è detto preoccupato anche in merito ai rapporti che rivelano che i bambini in Turchia ora sono stati contagiati dal virus dell'influenza aviaria A(H5N1) senza ammalarsi. Ha sottolineato che secondo alcuni studi la malattia diventerà meno mortale prima che possa causare un'epidemia globale. Si teme che se permangono sintomi influenzali di lieve entità, i pazienti possano continuare a contagiare più persone durante la loro routine quotidiana.\nDomanda: Secondo il brano, cosa dovrebbe accadere alla malattia prima di causare un'epidemia globale?\nOpzioni:\na. Deve diventare meno letale\nb. I sintomi devono rimanere lievi\nc. Occorre che più pazienti vengano infettati\nd. I bambini devono manifestare i sintomi",
397
+ "label": "a"
398
+ }
399
+ ```
400
+
401
+ When evaluating generative models, we use the following setup (see the
402
+ [methodology](/methodology) for more information on how these are used):
403
+
404
+ - Number of few-shot examples: 5
405
+ - Prefix prompt:
406
+ ```
407
+ Le seguenti sono domande a scelta multipla (con relative risposte).
408
+ ```
409
+ - Base prompt template:
410
+ ```
411
+ Domanda: {text}
412
+ Opzioni:
413
+ a. {option_a}
414
+ b. {option_b}
415
+ c. {option_c}
416
+ d. {option_d}
417
+ Risposta: {label}
418
+ ```
419
+ - Instruction-tuned prompt template:
420
+ ```
421
+ Domanda: {text}
422
+ Opzioni:
423
+ a. {option_a}
424
+ b. {option_b}
425
+ c. {option_c}
426
+ d. {option_d}
427
+
428
+ Rispondete alla domanda precedente con 'a', 'b', 'c' o 'd', e nient'altro.
429
+ ```
430
+
431
+ You can evaluate this dataset directly as follows:
432
+
433
+ ```bash
434
+ $ euroeval --model <model-id> --dataset belebele-it
435
+ ```
436
+
437
+
374
438
  ## Knowledge
375
439
 
376
440
  ### MMLU-it
@@ -423,7 +487,7 @@ When evaluating generative models, we use the following setup (see the
423
487
  a. {option_a}
424
488
  b. {option_b}
425
489
  c. {option_c}
426
- d. {option_c}
490
+ d. {option_d}
427
491
  Réponse: {label}
428
492
  ```
429
493
  - Instruction-tuned prompt template:
@@ -494,7 +558,7 @@ When evaluating generative models, we use the following setup (see the
494
558
  a. {option_a}
495
559
  b. {option_b}
496
560
  c. {option_c}
497
- d. {option_c}
561
+ d. {option_d}
498
562
  Réponse: {label}
499
563
  ```
500
564
  - Instruction-tuned prompt template: