ScandEval 16.10.1__tar.gz → 16.12.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (384) hide show
  1. scandeval-16.12.0/.github/auto_assign.yaml +29 -0
  2. scandeval-16.12.0/.github/workflows/auto_assign_reviewers.yaml +15 -0
  3. {scandeval-16.10.1 → scandeval-16.12.0}/.github/workflows/ci.yaml +4 -4
  4. {scandeval-16.10.1 → scandeval-16.12.0}/.pre-commit-config.yaml +5 -5
  5. {scandeval-16.10.1 → scandeval-16.12.0}/CHANGELOG.md +75 -0
  6. {scandeval-16.10.1 → scandeval-16.12.0}/CONTRIBUTING.md +1 -1
  7. {scandeval-16.10.1 → scandeval-16.12.0}/Dockerfile.cuda +1 -1
  8. {scandeval-16.10.1 → scandeval-16.12.0}/LICENSE +1 -1
  9. {scandeval-16.10.1 → scandeval-16.12.0}/PKG-INFO +50 -24
  10. {scandeval-16.10.1 → scandeval-16.12.0}/README.md +40 -18
  11. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/danish.md +79 -1
  12. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/dutch.md +170 -0
  13. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/english.md +78 -0
  14. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/estonian.md +79 -1
  15. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/finnish.md +78 -0
  16. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/french.md +78 -0
  17. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/german.md +101 -0
  18. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/icelandic.md +78 -0
  19. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/italian.md +78 -0
  20. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/norwegian.md +78 -0
  21. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/polish.md +78 -0
  22. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/portuguese.md +87 -9
  23. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/spanish.md +85 -7
  24. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/swedish.md +84 -6
  25. {scandeval-16.10.1 → scandeval-16.12.0}/docs/faq.md +4 -2
  26. {scandeval-16.10.1 → scandeval-16.12.0}/docs/python-package.md +33 -67
  27. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/README.md +5 -7
  28. scandeval-16.12.0/docs/tasks/bias-detection.md +29 -0
  29. scandeval-16.12.0/docs/tasks/european-values.md +33 -0
  30. scandeval-16.12.0/docs/tasks/simplification.md +36 -0
  31. {scandeval-16.10.1 → scandeval-16.12.0}/makefile +2 -2
  32. {scandeval-16.10.1 → scandeval-16.12.0}/mkdocs.yaml +7 -0
  33. {scandeval-16.10.1 → scandeval-16.12.0}/pyproject.toml +16 -8
  34. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/__init__.py +0 -9
  35. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_config_factory.py +5 -0
  36. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/hf.py +36 -8
  37. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/litellm.py +119 -22
  38. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/vllm.py +202 -94
  39. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmarker.py +28 -7
  40. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/cli.py +13 -0
  41. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/constants.py +31 -2
  42. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/data_models.py +12 -2
  43. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/dutch.py +10 -0
  44. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/logging_utils.py +1 -1
  45. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/__init__.py +1 -0
  46. scandeval-16.12.0/src/scandeval/metrics/bias.py +237 -0
  47. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/huggingface.py +5 -3
  48. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/llm_as_a_judge.py +79 -15
  49. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/model_loading.py +2 -1
  50. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/sequence_classification.py +12 -3
  51. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/tasks.py +22 -0
  52. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/tokenisation_utils.py +12 -1
  53. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/types.py +39 -0
  54. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/utils.py +38 -66
  55. scandeval-16.12.0/src/scripts/create_mbbq_nl.py +213 -0
  56. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/load_ud_pos.py +11 -0
  57. {scandeval-16.10.1 → scandeval-16.12.0}/tests/conftest.py +1 -0
  58. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmark_config_factory.py +10 -10
  59. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmarker.py +44 -17
  60. scandeval-16.12.0/tests/test_bias_metrics.py +144 -0
  61. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_cli.py +1 -0
  62. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_data_loading.py +1 -1
  63. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_dataset_configs.py +3 -2
  64. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_model_loading.py +7 -9
  65. {scandeval-16.10.1 → scandeval-16.12.0}/uv.lock +1781 -1755
  66. scandeval-16.10.1/docs/tasks/simplification.md +0 -42
  67. {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
  68. {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  69. {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  70. {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/language_request.yaml +0 -0
  71. {scandeval-16.10.1 → scandeval-16.12.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
  72. {scandeval-16.10.1 → scandeval-16.12.0}/.gitignore +0 -0
  73. {scandeval-16.10.1 → scandeval-16.12.0}/.markdownlint.jsonc +0 -0
  74. {scandeval-16.10.1 → scandeval-16.12.0}/CITATION.cff +0 -0
  75. {scandeval-16.10.1 → scandeval-16.12.0}/CODE_OF_CONDUCT.md +0 -0
  76. {scandeval-16.10.1 → scandeval-16.12.0}/NEW_DATASET_GUIDE.md +0 -0
  77. {scandeval-16.10.1 → scandeval-16.12.0}/docs/CNAME +0 -0
  78. {scandeval-16.10.1 → scandeval-16.12.0}/docs/README.md +0 -0
  79. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/README.md +0 -0
  80. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/albanian.md +0 -0
  81. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/bosnian.md +0 -0
  82. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/bulgarian.md +0 -0
  83. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/catalan.md +0 -0
  84. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/croatian.md +0 -0
  85. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/czech.md +0 -0
  86. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/faroese.md +0 -0
  87. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/greek.md +0 -0
  88. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/hungarian.md +0 -0
  89. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/latvian.md +0 -0
  90. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/lithuanian.md +0 -0
  91. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/romanian.md +0 -0
  92. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/serbian.md +0 -0
  93. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/slovak.md +0 -0
  94. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/slovene.md +0 -0
  95. {scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/ukrainian.md +0 -0
  96. {scandeval-16.10.1 → scandeval-16.12.0}/docs/extras/radial_plotter.md +0 -0
  97. {scandeval-16.10.1 → scandeval-16.12.0}/docs/gfx/favicon.png +0 -0
  98. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/albanian.md +0 -0
  99. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/bosnian.md +0 -0
  100. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/bulgarian.md +0 -0
  101. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/catalan.md +0 -0
  102. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/croatian.md +0 -0
  103. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/czech.md +0 -0
  104. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/danish.md +0 -0
  105. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
  106. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/english.md +0 -0
  107. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/estonian.md +0 -0
  108. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
  109. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/finnish.md +0 -0
  110. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/french.md +0 -0
  111. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/german.md +0 -0
  112. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/greek.md +0 -0
  113. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/hungarian.md +0 -0
  114. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  115. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/italian.md +0 -0
  116. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/latvian.md +0 -0
  117. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/lithuanian.md +0 -0
  118. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  119. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/polish.md +0 -0
  120. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/portuguese.md +0 -0
  121. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/romanian.md +0 -0
  122. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/serbian.md +0 -0
  123. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/slovak.md +0 -0
  124. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/slovene.md +0 -0
  125. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
  126. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
  127. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Monolingual/ukrainian.md +0 -0
  128. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/baltic.md +0 -0
  129. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/european.md +0 -0
  130. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/finnic.md +0 -0
  131. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
  132. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  133. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/romance.md +0 -0
  134. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/Multilingual/slavic.md +0 -0
  135. {scandeval-16.10.1 → scandeval-16.12.0}/docs/leaderboards/README.md +0 -0
  136. {scandeval-16.10.1 → scandeval-16.12.0}/docs/methodology.md +0 -0
  137. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/common-sense-reasoning.md +0 -0
  138. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/knowledge.md +0 -0
  139. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/linguistic-acceptability.md +0 -0
  140. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/named-entity-recognition.md +0 -0
  141. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/reading-comprehension.md +0 -0
  142. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/sentiment-classification.md +0 -0
  143. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/speed.md +0 -0
  144. {scandeval-16.10.1 → scandeval-16.12.0}/docs/tasks/summarization.md +0 -0
  145. {scandeval-16.10.1 → scandeval-16.12.0}/gfx/euroeval.png +0 -0
  146. {scandeval-16.10.1 → scandeval-16.12.0}/gfx/euroeval.xcf +0 -0
  147. {scandeval-16.10.1 → scandeval-16.12.0}/gfx/scandeval.png +0 -0
  148. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/__init__.py +0 -0
  149. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/base.py +0 -0
  150. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/benchmark_modules/fresh.py +0 -0
  151. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/caching_utils.py +0 -0
  152. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/callbacks.py +0 -0
  153. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/data_loading.py +0 -0
  154. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/__init__.py +0 -0
  155. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/albanian.py +0 -0
  156. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/bosnian.py +0 -0
  157. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/bulgarian.py +0 -0
  158. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/catalan.py +0 -0
  159. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/croatian.py +0 -0
  160. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/czech.py +0 -0
  161. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/danish.py +0 -0
  162. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/english.py +0 -0
  163. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/estonian.py +0 -0
  164. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/faroese.py +0 -0
  165. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/finnish.py +0 -0
  166. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/french.py +0 -0
  167. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/german.py +0 -0
  168. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/greek.py +0 -0
  169. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/hungarian.py +0 -0
  170. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/icelandic.py +0 -0
  171. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/italian.py +0 -0
  172. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/latvian.py +0 -0
  173. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/lithuanian.py +0 -0
  174. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/norwegian.py +0 -0
  175. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/polish.py +0 -0
  176. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/portuguese.py +0 -0
  177. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/romanian.py +0 -0
  178. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/serbian.py +0 -0
  179. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/slovak.py +0 -0
  180. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/slovene.py +0 -0
  181. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/spanish.py +0 -0
  182. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/swedish.py +0 -0
  183. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/dataset_configs/ukrainian.py +0 -0
  184. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/enums.py +0 -0
  185. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/exceptions.py +0 -0
  186. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/finetuning.py +0 -0
  187. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/generation.py +0 -0
  188. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/generation_utils.py +0 -0
  189. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/languages.py +0 -0
  190. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/base.py +0 -0
  191. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/pipeline.py +0 -0
  192. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/metrics/speed.py +0 -0
  193. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/model_cache.py +0 -0
  194. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/model_config.py +0 -0
  195. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/__init__.py +0 -0
  196. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/classification.py +0 -0
  197. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/linguistic_acceptability.py +0 -0
  198. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/multiple_choice.py +0 -0
  199. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/named_entity_recognition.py +0 -0
  200. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/reading_comprehension.py +0 -0
  201. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/sentiment_classification.py +0 -0
  202. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/simplification.py +0 -0
  203. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/summarization.py +0 -0
  204. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/prompt_templates/token_classification.py +0 -0
  205. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/scores.py +0 -0
  206. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/speed_benchmark.py +0 -0
  207. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/__init__.py +0 -0
  208. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/multiple_choice_classification.py +0 -0
  209. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/question_answering.py +0 -0
  210. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/text_to_text.py +0 -0
  211. {scandeval-16.10.1 → scandeval-16.12.0}/src/scandeval/task_group_utils/token_classification.py +0 -0
  212. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/__init__.py +0 -0
  213. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/constants.py +0 -0
  214. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_allocine.py +0 -0
  215. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_angry_tweets.py +0 -0
  216. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_arc.py +0 -0
  217. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_arc_is.py +0 -0
  218. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_atsiliepimai.py +0 -0
  219. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_belebele.py +0 -0
  220. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_bg_ner_bsnlp.py +0 -0
  221. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_boolq_pt.py +0 -0
  222. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cinexio.py +0 -0
  223. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cnn_dailymail.py +0 -0
  224. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_conll_en.py +0 -0
  225. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_conll_es.py +0 -0
  226. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_conll_nl.py +0 -0
  227. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_copa_lv.py +0 -0
  228. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_copa_nl.py +0 -0
  229. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cross_domain_uk_reviews.py +0 -0
  230. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_cs_gec.py +0 -0
  231. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_csfd_sentiment.py +0 -0
  232. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_csfd_sentiment_sk.py +0 -0
  233. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_czech_news.py +0 -0
  234. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dacsa.py +0 -0
  235. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dane.py +0 -0
  236. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_danish_citizen_tests.py +0 -0
  237. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dansk.py +0 -0
  238. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_danske_talemaader.py +0 -0
  239. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  240. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dbrd.py +0 -0
  241. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_duidelijke_taal.py +0 -0
  242. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_dutch_cola.py +0 -0
  243. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_elner.py +0 -0
  244. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_eltec.py +0 -0
  245. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_err_news.py +0 -0
  246. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_estner.py +0 -0
  247. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_estonian_valence.py +0 -0
  248. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_european_values.py +0 -0
  249. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_exam_et.py +0 -0
  250. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_exams_bg.py +0 -0
  251. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fone.py +0 -0
  252. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_foqa.py +0 -0
  253. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fosent.py +0 -0
  254. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fquad.py +0 -0
  255. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_fullstack_ner.py +0 -0
  256. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_germanquad.py +0 -0
  257. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_germeval.py +0 -0
  258. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_global_mmlu.py +0 -0
  259. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_goldenswag.py +0 -0
  260. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_grammar_et.py +0 -0
  261. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_greek_sa.py +0 -0
  262. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_greek_wikipedia.py +0 -0
  263. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_guia_cat.py +0 -0
  264. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_harem.py +0 -0
  265. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hellaswag.py +0 -0
  266. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hellaswag_cs.py +0 -0
  267. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hellaswag_fi.py +0 -0
  268. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
  269. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_hun_sum.py +0 -0
  270. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_husst.py +0 -0
  271. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ice_linguistic.py +0 -0
  272. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  273. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icelandic_knowledge.py +0 -0
  274. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icelandic_qa.py +0 -0
  275. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_icesum.py +0 -0
  276. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_idioms_no.py +0 -0
  277. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ilpost_sum.py +0 -0
  278. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_jentoft.py +0 -0
  279. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_kpwr_ner.py +0 -0
  280. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_latvian_lsm_summary.py +0 -0
  281. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_latvian_twitter_sentiment.py +0 -0
  282. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_life_in_the_uk.py +0 -0
  283. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lithuanian_lrytas_summarization.py +0 -0
  284. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_llmzszl.py +0 -0
  285. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lr_sum.py +0 -0
  286. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lt_emotions.py +0 -0
  287. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_lt_history.py +0 -0
  288. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mim_gold_ner.py +0 -0
  289. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mlqa_es.py +0 -0
  290. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mlsum_de.py +0 -0
  291. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mlsum_es.py +0 -0
  292. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu.py +0 -0
  293. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu_et.py +0 -0
  294. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu_hr.py +0 -0
  295. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mmlu_lv.py +0 -0
  296. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_mms.py +0 -0
  297. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_multi_wiki_qa.py +0 -0
  298. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_multinerd-it.py +0 -0
  299. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ner_uk.py +0 -0
  300. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_no_cola.py +0 -0
  301. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_no_sammendrag.py +0 -0
  302. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
  303. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nordjylland_news.py +0 -0
  304. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norec.py +0 -0
  305. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norglm_multiqa.py +0 -0
  306. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norglm_multisum.py +0 -0
  307. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norne.py +0 -0
  308. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_norquad.py +0 -0
  309. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nqii.py +0 -0
  310. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
  311. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_orange_sum.py +0 -0
  312. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_personal_sum.py +0 -0
  313. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_polemo2.py +0 -0
  314. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_poner.py +0 -0
  315. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_poquad.py +0 -0
  316. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_psc.py +0 -0
  317. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_publico.py +0 -0
  318. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ronec.py +0 -0
  319. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_rosent.py +0 -0
  320. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_rrn.py +0 -0
  321. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sb10k.py +0 -0
  322. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_scala.py +0 -0
  323. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_scandiqa.py +0 -0
  324. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_scandisent_fi.py +0 -0
  325. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_schibsted.py +0 -0
  326. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
  327. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sentinews.py +0 -0
  328. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sentipolc16.py +0 -0
  329. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_skolprov.py +0 -0
  330. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sqad.py +0 -0
  331. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad.py +0 -0
  332. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad_it.py +0 -0
  333. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad_nl.py +0 -0
  334. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_squad_nl_old.py +0 -0
  335. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_ssj500k_ner.py +0 -0
  336. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sst2_pt.py +0 -0
  337. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sst5.py +0 -0
  338. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_suc3.py +0 -0
  339. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_sumo_ro.py +0 -0
  340. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_swedish_facts.py +0 -0
  341. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_swedn.py +0 -0
  342. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_swerec.py +0 -0
  343. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_szeged_ner.py +0 -0
  344. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_trivia_et.py +0 -0
  345. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_turku_ner_fi.py +0 -0
  346. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_tydiqa_fi.py +0 -0
  347. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_umimeto_qa.py +0 -0
  348. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_uner_sk.py +0 -0
  349. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_uner_sr.py +0 -0
  350. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  351. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_wikiann.py +0 -0
  352. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_wikineural-it.py +0 -0
  353. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_winogrande.py +0 -0
  354. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_winogrande_et.py +0 -0
  355. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_winogrande_is.py +0 -0
  356. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_xlsum_fi.py +0 -0
  357. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/create_xquad.py +0 -0
  358. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/fix_dot_env_file.py +0 -0
  359. {scandeval-16.10.1 → scandeval-16.12.0}/src/scripts/versioning.py +0 -0
  360. {scandeval-16.10.1 → scandeval-16.12.0}/tests/__init__.py +0 -0
  361. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmark_modules/__init__.py +0 -0
  362. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  363. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_callbacks.py +0 -0
  364. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_constants.py +0 -0
  365. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_data_models.py +0 -0
  366. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_enums.py +0 -0
  367. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_exceptions.py +0 -0
  368. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_finetuning.py +0 -0
  369. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_languages.py +0 -0
  370. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_model_config.py +0 -0
  371. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scores.py +0 -0
  372. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/__init__.py +0 -0
  373. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/__init__.py +0 -0
  374. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_create_scala.py +0 -0
  375. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/de_gsd-ud-train.conllu.adp_det +0 -0
  376. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/empty.file +0 -0
  377. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/en_gum-ud-train.conllu.case +0 -0
  378. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_01 +0 -0
  379. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_02 +0 -0
  380. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_03 +0 -0
  381. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_speed_benchmark.py +0 -0
  382. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_tokenisation_utils.py +0 -0
  383. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_types.py +0 -0
  384. {scandeval-16.10.1 → scandeval-16.12.0}/tests/test_utils.py +0 -0
@@ -0,0 +1,29 @@
1
+ # Set to true to add reviewers to pull requests
2
+ addReviewers: true
3
+
4
+ # Set to true to add assignees to pull requests
5
+ addAssignees: true
6
+
7
+ # A list of reviewers to be added to pull requests (GitHub user name)
8
+ reviewers:
9
+ - saattrupdan
10
+
11
+ # A number of reviewers added to the pull request
12
+ # Set 0 to add all the reviewers (default: 0)
13
+ numberOfReviewers: 0
14
+
15
+ # Whether to run the action on draft pull requests
16
+ runOnDraft: true
17
+
18
+ # A list of assignees, overrides reviewers if set
19
+ # assignees:
20
+ # - assigneeA
21
+
22
+ # A number of assignees to add to the pull request
23
+ # Set to 0 to add all of the assignees.
24
+ # Uses numberOfReviewers if unset.
25
+ # numberOfAssignees: 2
26
+
27
+ # A list of keywords to be skipped the process that add reviewers if pull requests include it
28
+ # skipKeywords:
29
+ # - wip
@@ -0,0 +1,15 @@
1
+ name: 'Auto Assign'
2
+ on:
3
+ pull_request:
4
+ types: [opened, ready_for_review]
5
+
6
+ jobs:
7
+ add-reviews:
8
+ permissions:
9
+ contents: read
10
+ pull-requests: write
11
+ runs-on: ubuntu-latest
12
+ steps:
13
+ - uses: kentaro-m/auto-assign-action@v2.0.1
14
+ with:
15
+ configuration-path: .github/auto_assign.yaml
@@ -31,7 +31,7 @@ jobs:
31
31
  uses: astral-sh/setup-uv@v6
32
32
  with:
33
33
  enable-cache: false
34
- python-version: "3.11"
34
+ python-version: "3.12"
35
35
 
36
36
  - name: Run pre-commit hooks
37
37
  uses: pre-commit/action@v3.0.1
@@ -43,7 +43,7 @@ jobs:
43
43
  pull-requests: write
44
44
  strategy:
45
45
  matrix:
46
- python-version: ["3.11", "3.12", "3.13"]
46
+ python-version: ["3.12", "3.13"]
47
47
  runs-on: ubuntu-latest
48
48
  steps:
49
49
  - uses: actions/checkout@v5
@@ -58,7 +58,7 @@ jobs:
58
58
  python-version: ${{ matrix.python-version }}
59
59
 
60
60
  - name: Install Dependencies
61
- run: uv sync --no-dev
61
+ run: uv sync --no-dev --all-extras
62
62
 
63
63
  - name: Start Ollama server
64
64
  run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
@@ -95,7 +95,7 @@ jobs:
95
95
  python-version: ${{ matrix.python-version }}
96
96
 
97
97
  - name: Install Dependencies
98
- run: uv sync --no-dev
98
+ run: uv sync --no-dev --all-extras
99
99
 
100
100
  - name: Start Ollama server
101
101
  run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
@@ -8,9 +8,9 @@ repos:
8
8
  hooks:
9
9
  - id: end-of-file-fixer
10
10
  - id: trailing-whitespace
11
- # - id: debug-statements
11
+ - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.14.10
13
+ rev: v0.14.14
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -30,15 +30,15 @@ repos:
30
30
  - pyi
31
31
  - jupyter
32
32
  - repo: https://github.com/kynan/nbstripout
33
- rev: 0.8.2
33
+ rev: 0.9.0
34
34
  hooks:
35
35
  - id: nbstripout
36
36
  - repo: https://github.com/facebook/pyrefly-pre-commit
37
- rev: 0.46.3
37
+ rev: 0.50.1
38
38
  hooks:
39
39
  - id: pyrefly-check
40
40
  name: Pyrefly (type checking)
41
- pass_filenames: true
41
+ pass_filenames: false
42
42
  - repo: https://github.com/DavidAnson/markdownlint-cli2
43
43
  rev: v0.20.0
44
44
  hooks:
@@ -7,6 +7,81 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [v16.12.0] - 2026-02-02
11
+
12
+ ### Added
13
+
14
+ - Added the bias detection task (`multiple-choice-stereotype-bias`) along with the Dutch
15
+ dataset MBBQ-NL. This was added by @caldaibis ✨
16
+ - Added support for vLLM Metal, so that generative models can now be evaluated on Apple
17
+ Silicon. Note that this currently does not support structured generation, which means
18
+ that classification and named entity recognitions tasks unfortunately won't work yet.
19
+ This is due to [this xgrammar
20
+ issue](https://github.com/vllm-project/vllm/issues/31901).
21
+
22
+ ### Changed
23
+
24
+ - Replaced deprecated `VLLM_ATTENTION_BACKEND` environment variable with vLLM's
25
+ `AttentionConfig` API. Added `--attention-backend` CLI option to configure the
26
+ attention backend. Defaults to FLASHINFER. This was added by @SwekeR-463 ✨
27
+ - Now requires Python >=3.12, as Python 3.11 does not support some dependencies.
28
+ - We now up the vLLM maximum context length for reasoning models, from 8,192 to
29
+ 16,384, to accommodate for reasoning tokens for some datasets that have long documents.
30
+ - We opened up the pinned vLLM version now, now set to version `>=0.14.1`.
31
+ - Made changes to the codebase that makes it compatible with Transformers 5.0, for when
32
+ vLLM starts supporting it.
33
+
34
+ ### Fixed
35
+
36
+ - Fixed an issue where a model was incorrectly classified as an encoder model if it had
37
+ no pipeline tag on the Hugging Face Hub and it relied on a custom implementation that
38
+ isn't integrated into the `transformers` library.
39
+ - Fixed an issue when a model config had no `pad_token_id` and/or `eos_token_id`.
40
+ - There was an error when evaluating local adapter models, which has been fixed now.
41
+ - Now ensures that the vLLM argument `max_num_batched_tokens` is at least as large as the
42
+ maximum context length of the model, which gave errors with models that had a maximum
43
+ context length of less than 8,192.
44
+
45
+ ## [v16.11.0] - 2026-01-21
46
+
47
+ ### Added
48
+
49
+ - Added model metadata for GPT 5.2.
50
+ - Added better support for unofficial inference providers, allowing model prefixes even
51
+ if they're not in LiteLLM's official list of providers. Currently this only works with
52
+ the "ordbogen/" prefix for models available on ordbogen.dk.
53
+
54
+ ### Changed
55
+
56
+ - LLM-as-a-Judge metrics now support batch scoring across multiple judge outputs.
57
+ - When evaluating datasets with no validation split, we now set the `validation_split`
58
+ in the resulting JSONL file to `null` rather than `True`, to avoid confusion.
59
+ Likewise, if a task requires zero-shot evaluation, we set `few_shot` to null rather
60
+ than a Boolean value.
61
+ - When evaluating a reasoning model on a sequence classification task, if the model
62
+ outputs an answer that starts with one of candidate labels, we now use that label as
63
+ the predicted label. Previously, we would have conducted a word edit distance search
64
+ to find the closest candidate label, which was almost always correct, but not in all
65
+ cases.
66
+
67
+ ### Fixed
68
+
69
+ - Quantized models in vLLM now have their dtype inferred automatically, removing
70
+ explicit dtype casting based on GPU compute capability. This was contributed by
71
+ @tvosch ✨
72
+ - Evaluation of local vLLM models when no internet connection was available did not work
73
+ correctly; this has been fixed now. This was contributed by @Touzen ✨
74
+ - More robust detection and handling of errors related to too long inputs for vLLM
75
+ models.
76
+ - Some API models need the `logprobs` argument to be a Boolean rather than an integer.
77
+ This has been fixed now.
78
+ - Better handling of rate limits when evaluating API models, by backing off more
79
+ aggressively when hitting rate limits.
80
+ - Now truncates prompts for instruction-following models in a smarter way, by removing
81
+ few-shot examples one by one until the prompt is short enough, rather than just
82
+ truncating the prompt to the maximum length. This only affects models whose maximum
83
+ model length is quite small (roughly 5,000 tokens or less).
84
+
10
85
  ## [v16.10.1] - 2026-01-02
11
86
 
12
87
  ### Changed
@@ -72,7 +72,7 @@ guide](https://github.com/atom/atom/blob/master/CONTRIBUTING.md#git-commit-messa
72
72
  know how to use emoji for commit messages.
73
73
 
74
74
  Once your changes are ready, don't forget to
75
- [self-review](/contributing/self-review.md) to speed up the review process:zap:.
75
+ self-review to speed up the review process:zap:.
76
76
 
77
77
  ### Pull Request
78
78
 
@@ -3,7 +3,7 @@ FROM nvidia/cuda:12.2.0-base-ubuntu22.04
3
3
  # Install dependencies
4
4
  RUN apt-get -y update && \
5
5
  apt-get -y upgrade && \
6
- DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends gcc python3.11 python3-pip python3-dev git-all && \
6
+ DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends gcc python3.12 python3-pip python3-dev git-all && \
7
7
  python3 -m pip install --upgrade pip wheel && \
8
8
  python3 -m pip install euroeval[all]
9
9
 
@@ -1,6 +1,6 @@
1
1
  MIT License
2
2
 
3
- Copyright (c) 2022-2025 Dan Saattrup Smart
3
+ Copyright (c) 2022-2026 Dan Saattrup Smart
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: ScandEval
3
- Version: 16.10.1
3
+ Version: 16.12.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -8,7 +8,7 @@ Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
8
8
  Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
9
9
  License: MIT License
10
10
 
11
- Copyright (c) 2022-2025 Dan Saattrup Smart
11
+ Copyright (c) 2022-2026 Dan Saattrup Smart
12
12
 
13
13
  Permission is hereby granted, free of charge, to any person obtaining a copy
14
14
  of this software and associated documentation files (the "Software"), to deal
@@ -28,7 +28,7 @@ License: MIT License
28
28
  OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
29
29
  SOFTWARE.
30
30
  License-File: LICENSE
31
- Requires-Python: <4.0,>=3.11
31
+ Requires-Python: <4.0,>=3.12
32
32
  Requires-Dist: accelerate>=1.9.0
33
33
  Requires-Dist: bert-score>=0.3.13
34
34
  Requires-Dist: click>=8.1.3
@@ -59,19 +59,23 @@ Requires-Dist: setuptools>=75.8.2
59
59
  Requires-Dist: tenacity>=9.0.0
60
60
  Requires-Dist: termcolor>=2.0.0
61
61
  Requires-Dist: torch>=2.6.0
62
- Requires-Dist: transformers[mistral-common]>=4.56.0
62
+ Requires-Dist: transformers[mistral-common]<5.0.0,>=4.56.0
63
63
  Provides-Extra: all
64
64
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
65
65
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
66
66
  Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'all'
67
67
  Requires-Dist: timm>=1.0.19; extra == 'all'
68
- Requires-Dist: vllm[flashinfer]==0.11.0; (platform_system == 'Linux') and extra == 'all'
68
+ Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'all'
69
+ Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'all'
70
+ Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'all'
69
71
  Provides-Extra: generative
70
72
  Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
71
73
  Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
72
74
  Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'generative'
73
75
  Requires-Dist: timm>=1.0.19; extra == 'generative'
74
- Requires-Dist: vllm[flashinfer]==0.11.0; (platform_system == 'Linux') and extra == 'generative'
76
+ Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'generative'
77
+ Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'generative'
78
+ Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'generative'
75
79
  Description-Content-Type: text/markdown
76
80
 
77
81
  <!-- This disables the requirement that the first line is a top-level heading -->
@@ -96,7 +100,7 @@ ______________________________________________________________________
96
100
  [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
97
101
  [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
98
102
  [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
99
- [![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
103
+ [![Code Coverage](https://img.shields.io/badge/Coverage-74%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
100
104
  [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
101
105
 
102
106
  ## Maintainer
@@ -123,16 +127,17 @@ The easiest way to benchmark pretrained models is via the command line interface
123
127
  having installed the package, you can benchmark your favorite model like so:
124
128
 
125
129
  ```bash
126
- euroeval --model <model-id>
130
+ euroeval --model <model-id-or-path>
127
131
  ```
128
132
 
129
- Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
130
- Hub](https://huggingface.co/models). By default this will benchmark the model on all
131
- the tasks available. If you want to benchmark on a particular task, then use the
132
- `--task` argument:
133
+ Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
134
+ Hub](https://huggingface.co/models), or a local path to a model directory (containing
135
+ the model files as well as the `config.json` file). By default this will benchmark the
136
+ model on all the tasks available. If you want to benchmark on a particular task, then
137
+ use the `--task` argument:
133
138
 
134
139
  ```bash
135
- euroeval --model <model-id> --task sentiment-classification
140
+ euroeval --model <model-id-or-path> --task sentiment-classification
136
141
  ```
137
142
 
138
143
  We can also narrow down which languages we would like to benchmark on. This can be done
@@ -140,20 +145,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
140
145
  sentiment classification task:
141
146
 
142
147
  ```bash
143
- euroeval --model <model-id> --task sentiment-classification --language da
148
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da
144
149
  ```
145
150
 
146
151
  Multiple models, datasets and/or languages can be specified by just attaching multiple
147
152
  arguments. Here is an example with two models:
148
153
 
149
154
  ```bash
150
- euroeval --model <model-id1> --model <model-id2>
155
+ euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
151
156
  ```
152
157
 
153
158
  The specific model version/revision to use can also be added after the suffix '@':
154
159
 
155
160
  ```bash
156
- euroeval --model <model-id>@<commit>
161
+ euroeval --model <model-id-or-path>@<commit>
157
162
  ```
158
163
 
159
164
  This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -173,7 +178,7 @@ model:
173
178
  ```python
174
179
  >>> from euroeval import Benchmarker
175
180
  >>> benchmarker = Benchmarker()
176
- >>> benchmarker.benchmark(model="<model-id>")
181
+ >>> benchmarker.benchmark(model="<model-id-or-path>")
177
182
  ```
178
183
 
179
184
  To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -181,7 +186,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
181
186
 
182
187
  ```python
183
188
  >>> benchmarker.benchmark(
184
- ... model="<model-id>",
189
+ ... model="<model-id-or-path>",
185
190
  ... task="sentiment-classification",
186
191
  ... language="da",
187
192
  ... )
@@ -225,7 +230,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
225
230
  ```
226
231
 
227
232
  Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
228
- argument. This could for instance be `--model <model-id> --task
233
+ argument. This could for instance be `--model <model-id-or-path> --task
229
234
  sentiment-classification`.
230
235
 
231
236
  ## Benchmarking custom inference APIs
@@ -291,14 +296,14 @@ script. For example to download the model you want and all of the Danish sentime
291
296
  classification datasets:
292
297
 
293
298
  ```bash
294
- euroeval --model <model-id> --task sentiment-classification --language da --download-only
299
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
295
300
  ```
296
301
 
297
302
  Or from a script:
298
303
 
299
304
  ```python
300
305
  >>> benchmarker.benchmark(
301
- ... model="<model-id>",
306
+ ... model="<model-id-or-path>",
302
307
  ... task="sentiment-classification",
303
308
  ... language="da",
304
309
  ... download_only=True,
@@ -346,7 +351,7 @@ MY_CONFIG = DatasetConfig(
346
351
  You can then benchmark your custom dataset by simply running
347
352
 
348
353
  ```bash
349
- euroeval --dataset my-dataset --model <model-id>
354
+ euroeval --dataset my-dataset --model <model-id-or-path>
350
355
  ```
351
356
 
352
357
  You can also run the benchmark from a Python script, by simply providing your custom
@@ -356,7 +361,7 @@ dataset configuration directly into the `benchmark` method:
356
361
  from euroeval import Benchmarker
357
362
 
358
363
  benchmarker = Benchmarker()
359
- benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
364
+ benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
360
365
  ```
361
366
 
362
367
  We have included three convenience tasks to make it easier to set up custom datasets:
@@ -436,7 +441,7 @@ MY_SQL_DATASET = DatasetConfig(
436
441
  Again, with this you can benchmark your custom dataset by simply running
437
442
 
438
443
  ```bash
439
- euroeval --dataset my-sql-dataset --model <model-id>
444
+ euroeval --dataset my-sql-dataset --model <model-id-or-path>
440
445
  ```
441
446
 
442
447
  ## Reproducing the evaluation datasets
@@ -592,6 +597,27 @@ A huge thank you to all the contributors who have helped make this project a suc
592
597
  alt="Contributor avatar for tvosch"
593
598
  />
594
599
  </a>
600
+ <a href="https://github.com/Touzen">
601
+ <img
602
+ src="https://avatars.githubusercontent.com/u/1416265"
603
+ width=50
604
+ alt="Contributor avatar for Touzen"
605
+ />
606
+ </a>
607
+ <a href="https://github.com/caldaibis">
608
+ <img
609
+ src="https://avatars.githubusercontent.com/u/16032437"
610
+ width=50
611
+ alt="Contributor avatar for caldaibis"
612
+ />
613
+ </a>
614
+ <a href="https://github.com/SwekeR-463">
615
+ <img
616
+ src="https://avatars.githubusercontent.com/u/114919896?v=4"
617
+ width=50
618
+ alt="Contributor avatar for SwekeR-463"
619
+ />
620
+ </a>
595
621
 
596
622
  ### Contribute to EuroEval
597
623
 
@@ -20,7 +20,7 @@ ______________________________________________________________________
20
20
  [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
21
21
  [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
22
22
  [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
23
- [![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
23
+ [![Code Coverage](https://img.shields.io/badge/Coverage-74%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
24
24
  [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
25
25
 
26
26
  ## Maintainer
@@ -47,16 +47,17 @@ The easiest way to benchmark pretrained models is via the command line interface
47
47
  having installed the package, you can benchmark your favorite model like so:
48
48
 
49
49
  ```bash
50
- euroeval --model <model-id>
50
+ euroeval --model <model-id-or-path>
51
51
  ```
52
52
 
53
- Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
54
- Hub](https://huggingface.co/models). By default this will benchmark the model on all
55
- the tasks available. If you want to benchmark on a particular task, then use the
56
- `--task` argument:
53
+ Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
54
+ Hub](https://huggingface.co/models), or a local path to a model directory (containing
55
+ the model files as well as the `config.json` file). By default this will benchmark the
56
+ model on all the tasks available. If you want to benchmark on a particular task, then
57
+ use the `--task` argument:
57
58
 
58
59
  ```bash
59
- euroeval --model <model-id> --task sentiment-classification
60
+ euroeval --model <model-id-or-path> --task sentiment-classification
60
61
  ```
61
62
 
62
63
  We can also narrow down which languages we would like to benchmark on. This can be done
@@ -64,20 +65,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
64
65
  sentiment classification task:
65
66
 
66
67
  ```bash
67
- euroeval --model <model-id> --task sentiment-classification --language da
68
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da
68
69
  ```
69
70
 
70
71
  Multiple models, datasets and/or languages can be specified by just attaching multiple
71
72
  arguments. Here is an example with two models:
72
73
 
73
74
  ```bash
74
- euroeval --model <model-id1> --model <model-id2>
75
+ euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
75
76
  ```
76
77
 
77
78
  The specific model version/revision to use can also be added after the suffix '@':
78
79
 
79
80
  ```bash
80
- euroeval --model <model-id>@<commit>
81
+ euroeval --model <model-id-or-path>@<commit>
81
82
  ```
82
83
 
83
84
  This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -97,7 +98,7 @@ model:
97
98
  ```python
98
99
  >>> from euroeval import Benchmarker
99
100
  >>> benchmarker = Benchmarker()
100
- >>> benchmarker.benchmark(model="<model-id>")
101
+ >>> benchmarker.benchmark(model="<model-id-or-path>")
101
102
  ```
102
103
 
103
104
  To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -105,7 +106,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
105
106
 
106
107
  ```python
107
108
  >>> benchmarker.benchmark(
108
- ... model="<model-id>",
109
+ ... model="<model-id-or-path>",
109
110
  ... task="sentiment-classification",
110
111
  ... language="da",
111
112
  ... )
@@ -149,7 +150,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
149
150
  ```
150
151
 
151
152
  Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
152
- argument. This could for instance be `--model <model-id> --task
153
+ argument. This could for instance be `--model <model-id-or-path> --task
153
154
  sentiment-classification`.
154
155
 
155
156
  ## Benchmarking custom inference APIs
@@ -215,14 +216,14 @@ script. For example to download the model you want and all of the Danish sentime
215
216
  classification datasets:
216
217
 
217
218
  ```bash
218
- euroeval --model <model-id> --task sentiment-classification --language da --download-only
219
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
219
220
  ```
220
221
 
221
222
  Or from a script:
222
223
 
223
224
  ```python
224
225
  >>> benchmarker.benchmark(
225
- ... model="<model-id>",
226
+ ... model="<model-id-or-path>",
226
227
  ... task="sentiment-classification",
227
228
  ... language="da",
228
229
  ... download_only=True,
@@ -270,7 +271,7 @@ MY_CONFIG = DatasetConfig(
270
271
  You can then benchmark your custom dataset by simply running
271
272
 
272
273
  ```bash
273
- euroeval --dataset my-dataset --model <model-id>
274
+ euroeval --dataset my-dataset --model <model-id-or-path>
274
275
  ```
275
276
 
276
277
  You can also run the benchmark from a Python script, by simply providing your custom
@@ -280,7 +281,7 @@ dataset configuration directly into the `benchmark` method:
280
281
  from euroeval import Benchmarker
281
282
 
282
283
  benchmarker = Benchmarker()
283
- benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
284
+ benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
284
285
  ```
285
286
 
286
287
  We have included three convenience tasks to make it easier to set up custom datasets:
@@ -360,7 +361,7 @@ MY_SQL_DATASET = DatasetConfig(
360
361
  Again, with this you can benchmark your custom dataset by simply running
361
362
 
362
363
  ```bash
363
- euroeval --dataset my-sql-dataset --model <model-id>
364
+ euroeval --dataset my-sql-dataset --model <model-id-or-path>
364
365
  ```
365
366
 
366
367
  ## Reproducing the evaluation datasets
@@ -516,6 +517,27 @@ A huge thank you to all the contributors who have helped make this project a suc
516
517
  alt="Contributor avatar for tvosch"
517
518
  />
518
519
  </a>
520
+ <a href="https://github.com/Touzen">
521
+ <img
522
+ src="https://avatars.githubusercontent.com/u/1416265"
523
+ width=50
524
+ alt="Contributor avatar for Touzen"
525
+ />
526
+ </a>
527
+ <a href="https://github.com/caldaibis">
528
+ <img
529
+ src="https://avatars.githubusercontent.com/u/16032437"
530
+ width=50
531
+ alt="Contributor avatar for caldaibis"
532
+ />
533
+ </a>
534
+ <a href="https://github.com/SwekeR-463">
535
+ <img
536
+ src="https://avatars.githubusercontent.com/u/114919896?v=4"
537
+ width=50
538
+ alt="Contributor avatar for SwekeR-463"
539
+ />
540
+ </a>
519
541
 
520
542
  ### Contribute to EuroEval
521
543
 
@@ -1002,7 +1002,7 @@ Here are a few examples from the training split:
1002
1002
 
1003
1003
  ```json
1004
1004
  {
1005
- "text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Natalie\nb. Betty",
1005
+ "text": "Jeg kunne ikke kontrollere fugten, som jeg kontrollerede regnen, fordi _ kom ind overalt. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. fugt\nb. regn",
1006
1006
  "label": "a"
1007
1007
  }
1008
1008
  ```
@@ -1116,3 +1116,81 @@ You can evaluate this dataset directly as follows:
1116
1116
  ```bash
1117
1117
  euroeval --model <model-id> --dataset nordjylland-news
1118
1118
  ```
1119
+
1120
+ ## European Values
1121
+
1122
+ ### ValEU-da
1123
+
1124
+ This dataset is the official Danish version of questions from the [European values
1125
+ study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
1126
+ questions regarding people's values and beliefs across a variety of topics, such as
1127
+ politics, religion and society.
1128
+
1129
+ The dataset consists of 52 questions from the 2017-2022 wave of the European values
1130
+ study, where the questions were chosen based on optimising against agreement within EU
1131
+ countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
1132
+
1133
+ Here are a few examples from the training split:
1134
+
1135
+ ```json
1136
+ {
1137
+ "question_id": "C039",
1138
+ "text": "Hvor enig eller uenig er du i følgende udsagn?\nDet er ens pligt over for samfundet at arbejde.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig",
1139
+ }
1140
+ ```
1141
+
1142
+ ```json
1143
+ {
1144
+ "question_id": "F122",
1145
+ "text": "Fortæl for hver af handlingerne på dette kort, i hvilken grad du billiger handlingen. 1 betyder, at du slet ikke billiger dem, 10 betyder, at du i høj grad billiger dem\nAktiv dødshjælp\nSvarmuligheder:\na. Aldrig\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Altid",
1146
+ }
1147
+ ```
1148
+
1149
+ ```json
1150
+ {
1151
+ "question_id": "C041",
1152
+ "text": "Hvor enig eller uenig er du i følgende udsagn?\nArbejde kommer først, også selv om det betyder mindre fritid.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig"
1153
+ }
1154
+ ```
1155
+
1156
+ When evaluating generative models, we use the following setup (see the
1157
+ [methodology](/methodology) for more information on how these are used):
1158
+
1159
+ - Number of few-shot examples: 0
1160
+ - Prefix prompt:
1161
+
1162
+ ```text
1163
+ Følgende er multiple choice spørgsmål (med svar).
1164
+ ```
1165
+
1166
+ - Base prompt template:
1167
+
1168
+ ```text
1169
+ Spørgsmål: {text}
1170
+ Svarmuligheder:
1171
+ a. {option_a}
1172
+ b. {option_b}
1173
+ (...)
1174
+ k. {option_k}
1175
+ Svar: {label}
1176
+ ```
1177
+
1178
+ - Instruction-tuned prompt template:
1179
+
1180
+ ```text
1181
+ Spørgsmål: {text}
1182
+ Svarmuligheder:
1183
+ a. {option_a}
1184
+ b. {option_b}
1185
+ (...)
1186
+ k. {option_k}
1187
+
1188
+ Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
1189
+ 'i', 'j' eller 'k', og intet andet.
1190
+ ```
1191
+
1192
+ You can evaluate this dataset directly as follows:
1193
+
1194
+ ```bash
1195
+ euroeval --model <model-id> --dataset valeu-da
1196
+ ```