ScandEval 16.10.1__tar.gz → 16.11.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (378) hide show
  1. {scandeval-16.10.1 → scandeval-16.11.0}/.pre-commit-config.yaml +4 -4
  2. {scandeval-16.10.1 → scandeval-16.11.0}/CHANGELOG.md +40 -0
  3. {scandeval-16.10.1 → scandeval-16.11.0}/CONTRIBUTING.md +1 -1
  4. {scandeval-16.10.1 → scandeval-16.11.0}/LICENSE +1 -1
  5. {scandeval-16.10.1 → scandeval-16.11.0}/PKG-INFO +27 -19
  6. {scandeval-16.10.1 → scandeval-16.11.0}/README.md +25 -17
  7. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/danish.md +78 -0
  8. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/dutch.md +78 -0
  9. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/english.md +78 -0
  10. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/estonian.md +79 -1
  11. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/finnish.md +78 -0
  12. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/french.md +78 -0
  13. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/german.md +101 -0
  14. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/icelandic.md +78 -0
  15. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/italian.md +78 -0
  16. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/norwegian.md +78 -0
  17. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/polish.md +78 -0
  18. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/portuguese.md +87 -9
  19. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/spanish.md +85 -7
  20. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/swedish.md +84 -6
  21. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/README.md +4 -7
  22. scandeval-16.11.0/docs/tasks/european-values.md +33 -0
  23. scandeval-16.11.0/docs/tasks/simplification.md +36 -0
  24. {scandeval-16.10.1 → scandeval-16.11.0}/pyproject.toml +1 -1
  25. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/hf.py +14 -1
  26. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/litellm.py +111 -22
  27. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/vllm.py +111 -56
  28. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmarker.py +13 -6
  29. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/data_models.py +2 -2
  30. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/logging_utils.py +1 -1
  31. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/huggingface.py +3 -2
  32. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/llm_as_a_judge.py +79 -15
  33. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/model_loading.py +2 -1
  34. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/sequence_classification.py +12 -3
  35. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/types.py +39 -0
  36. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/utils.py +29 -4
  37. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/load_ud_pos.py +11 -0
  38. {scandeval-16.10.1 → scandeval-16.11.0}/uv.lock +1 -1
  39. scandeval-16.10.1/docs/tasks/simplification.md +0 -42
  40. {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml +0 -0
  41. {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/bug.yaml +0 -0
  42. {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/feature_request.yaml +0 -0
  43. {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/language_request.yaml +0 -0
  44. {scandeval-16.10.1 → scandeval-16.11.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml +0 -0
  45. {scandeval-16.10.1 → scandeval-16.11.0}/.github/workflows/ci.yaml +0 -0
  46. {scandeval-16.10.1 → scandeval-16.11.0}/.gitignore +0 -0
  47. {scandeval-16.10.1 → scandeval-16.11.0}/.markdownlint.jsonc +0 -0
  48. {scandeval-16.10.1 → scandeval-16.11.0}/CITATION.cff +0 -0
  49. {scandeval-16.10.1 → scandeval-16.11.0}/CODE_OF_CONDUCT.md +0 -0
  50. {scandeval-16.10.1 → scandeval-16.11.0}/Dockerfile.cuda +0 -0
  51. {scandeval-16.10.1 → scandeval-16.11.0}/NEW_DATASET_GUIDE.md +0 -0
  52. {scandeval-16.10.1 → scandeval-16.11.0}/docs/CNAME +0 -0
  53. {scandeval-16.10.1 → scandeval-16.11.0}/docs/README.md +0 -0
  54. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/README.md +0 -0
  55. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/albanian.md +0 -0
  56. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/bosnian.md +0 -0
  57. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/bulgarian.md +0 -0
  58. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/catalan.md +0 -0
  59. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/croatian.md +0 -0
  60. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/czech.md +0 -0
  61. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/faroese.md +0 -0
  62. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/greek.md +0 -0
  63. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/hungarian.md +0 -0
  64. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/latvian.md +0 -0
  65. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/lithuanian.md +0 -0
  66. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/romanian.md +0 -0
  67. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/serbian.md +0 -0
  68. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/slovak.md +0 -0
  69. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/slovene.md +0 -0
  70. {scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/ukrainian.md +0 -0
  71. {scandeval-16.10.1 → scandeval-16.11.0}/docs/extras/radial_plotter.md +0 -0
  72. {scandeval-16.10.1 → scandeval-16.11.0}/docs/faq.md +0 -0
  73. {scandeval-16.10.1 → scandeval-16.11.0}/docs/gfx/favicon.png +0 -0
  74. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/albanian.md +0 -0
  75. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/bosnian.md +0 -0
  76. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/bulgarian.md +0 -0
  77. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/catalan.md +0 -0
  78. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/croatian.md +0 -0
  79. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/czech.md +0 -0
  80. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/danish.md +0 -0
  81. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/dutch.md +0 -0
  82. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/english.md +0 -0
  83. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/estonian.md +0 -0
  84. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/faroese.md +0 -0
  85. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/finnish.md +0 -0
  86. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/french.md +0 -0
  87. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/german.md +0 -0
  88. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/greek.md +0 -0
  89. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/hungarian.md +0 -0
  90. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/icelandic.md +0 -0
  91. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/italian.md +0 -0
  92. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/latvian.md +0 -0
  93. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/lithuanian.md +0 -0
  94. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/norwegian.md +0 -0
  95. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/polish.md +0 -0
  96. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/portuguese.md +0 -0
  97. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/romanian.md +0 -0
  98. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/serbian.md +0 -0
  99. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/slovak.md +0 -0
  100. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/slovene.md +0 -0
  101. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/spanish.md +0 -0
  102. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/swedish.md +0 -0
  103. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Monolingual/ukrainian.md +0 -0
  104. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/baltic.md +0 -0
  105. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/european.md +0 -0
  106. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/finnic.md +0 -0
  107. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/germanic.md +0 -0
  108. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/mainland-scandinavian.md +0 -0
  109. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/romance.md +0 -0
  110. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/Multilingual/slavic.md +0 -0
  111. {scandeval-16.10.1 → scandeval-16.11.0}/docs/leaderboards/README.md +0 -0
  112. {scandeval-16.10.1 → scandeval-16.11.0}/docs/methodology.md +0 -0
  113. {scandeval-16.10.1 → scandeval-16.11.0}/docs/python-package.md +0 -0
  114. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/common-sense-reasoning.md +0 -0
  115. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/knowledge.md +0 -0
  116. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/linguistic-acceptability.md +0 -0
  117. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/named-entity-recognition.md +0 -0
  118. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/reading-comprehension.md +0 -0
  119. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/sentiment-classification.md +0 -0
  120. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/speed.md +0 -0
  121. {scandeval-16.10.1 → scandeval-16.11.0}/docs/tasks/summarization.md +0 -0
  122. {scandeval-16.10.1 → scandeval-16.11.0}/gfx/euroeval.png +0 -0
  123. {scandeval-16.10.1 → scandeval-16.11.0}/gfx/euroeval.xcf +0 -0
  124. {scandeval-16.10.1 → scandeval-16.11.0}/gfx/scandeval.png +0 -0
  125. {scandeval-16.10.1 → scandeval-16.11.0}/makefile +0 -0
  126. {scandeval-16.10.1 → scandeval-16.11.0}/mkdocs.yaml +0 -0
  127. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/__init__.py +0 -0
  128. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_config_factory.py +0 -0
  129. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/__init__.py +0 -0
  130. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/base.py +0 -0
  131. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/benchmark_modules/fresh.py +0 -0
  132. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/caching_utils.py +0 -0
  133. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/callbacks.py +0 -0
  134. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/cli.py +0 -0
  135. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/constants.py +0 -0
  136. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/data_loading.py +0 -0
  137. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/__init__.py +0 -0
  138. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/albanian.py +0 -0
  139. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/bosnian.py +0 -0
  140. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/bulgarian.py +0 -0
  141. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/catalan.py +0 -0
  142. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/croatian.py +0 -0
  143. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/czech.py +0 -0
  144. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/danish.py +0 -0
  145. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/dutch.py +0 -0
  146. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/english.py +0 -0
  147. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/estonian.py +0 -0
  148. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/faroese.py +0 -0
  149. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/finnish.py +0 -0
  150. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/french.py +0 -0
  151. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/german.py +0 -0
  152. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/greek.py +0 -0
  153. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/hungarian.py +0 -0
  154. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/icelandic.py +0 -0
  155. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/italian.py +0 -0
  156. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/latvian.py +0 -0
  157. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/lithuanian.py +0 -0
  158. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/norwegian.py +0 -0
  159. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/polish.py +0 -0
  160. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/portuguese.py +0 -0
  161. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/romanian.py +0 -0
  162. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/serbian.py +0 -0
  163. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/slovak.py +0 -0
  164. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/slovene.py +0 -0
  165. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/spanish.py +0 -0
  166. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/swedish.py +0 -0
  167. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/dataset_configs/ukrainian.py +0 -0
  168. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/enums.py +0 -0
  169. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/exceptions.py +0 -0
  170. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/finetuning.py +0 -0
  171. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/generation.py +0 -0
  172. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/generation_utils.py +0 -0
  173. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/languages.py +0 -0
  174. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/__init__.py +0 -0
  175. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/base.py +0 -0
  176. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/pipeline.py +0 -0
  177. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/metrics/speed.py +0 -0
  178. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/model_cache.py +0 -0
  179. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/model_config.py +0 -0
  180. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/__init__.py +0 -0
  181. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/classification.py +0 -0
  182. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/linguistic_acceptability.py +0 -0
  183. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/multiple_choice.py +0 -0
  184. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/named_entity_recognition.py +0 -0
  185. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/reading_comprehension.py +0 -0
  186. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/sentiment_classification.py +0 -0
  187. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/simplification.py +0 -0
  188. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/summarization.py +0 -0
  189. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/prompt_templates/token_classification.py +0 -0
  190. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/scores.py +0 -0
  191. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/speed_benchmark.py +0 -0
  192. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/__init__.py +0 -0
  193. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/multiple_choice_classification.py +0 -0
  194. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/question_answering.py +0 -0
  195. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/text_to_text.py +0 -0
  196. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/task_group_utils/token_classification.py +0 -0
  197. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/tasks.py +0 -0
  198. {scandeval-16.10.1 → scandeval-16.11.0}/src/scandeval/tokenisation_utils.py +0 -0
  199. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/__init__.py +0 -0
  200. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/constants.py +0 -0
  201. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_allocine.py +0 -0
  202. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_angry_tweets.py +0 -0
  203. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_arc.py +0 -0
  204. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_arc_is.py +0 -0
  205. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_atsiliepimai.py +0 -0
  206. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_belebele.py +0 -0
  207. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_bg_ner_bsnlp.py +0 -0
  208. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_boolq_pt.py +0 -0
  209. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cinexio.py +0 -0
  210. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cnn_dailymail.py +0 -0
  211. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_conll_en.py +0 -0
  212. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_conll_es.py +0 -0
  213. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_conll_nl.py +0 -0
  214. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_copa_lv.py +0 -0
  215. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_copa_nl.py +0 -0
  216. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cross_domain_uk_reviews.py +0 -0
  217. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_cs_gec.py +0 -0
  218. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_csfd_sentiment.py +0 -0
  219. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_csfd_sentiment_sk.py +0 -0
  220. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_czech_news.py +0 -0
  221. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dacsa.py +0 -0
  222. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dane.py +0 -0
  223. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_danish_citizen_tests.py +0 -0
  224. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dansk.py +0 -0
  225. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_danske_talemaader.py +0 -0
  226. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_danske_talemaader_old.py +0 -0
  227. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dbrd.py +0 -0
  228. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_duidelijke_taal.py +0 -0
  229. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_dutch_cola.py +0 -0
  230. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_elner.py +0 -0
  231. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_eltec.py +0 -0
  232. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_err_news.py +0 -0
  233. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_estner.py +0 -0
  234. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_estonian_valence.py +0 -0
  235. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_european_values.py +0 -0
  236. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_exam_et.py +0 -0
  237. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_exams_bg.py +0 -0
  238. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fone.py +0 -0
  239. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_foqa.py +0 -0
  240. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fosent.py +0 -0
  241. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fquad.py +0 -0
  242. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_fullstack_ner.py +0 -0
  243. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_germanquad.py +0 -0
  244. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_germeval.py +0 -0
  245. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_global_mmlu.py +0 -0
  246. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_goldenswag.py +0 -0
  247. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_grammar_et.py +0 -0
  248. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_greek_sa.py +0 -0
  249. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_greek_wikipedia.py +0 -0
  250. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_guia_cat.py +0 -0
  251. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_harem.py +0 -0
  252. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hellaswag.py +0 -0
  253. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hellaswag_cs.py +0 -0
  254. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hellaswag_fi.py +0 -0
  255. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hotter_and_colder_sentiment.py +0 -0
  256. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_hun_sum.py +0 -0
  257. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_husst.py +0 -0
  258. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ice_linguistic.py +0 -0
  259. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icelandic_error_corpus.py +0 -0
  260. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icelandic_knowledge.py +0 -0
  261. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icelandic_qa.py +0 -0
  262. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_icesum.py +0 -0
  263. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_idioms_no.py +0 -0
  264. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ilpost_sum.py +0 -0
  265. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_jentoft.py +0 -0
  266. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_kpwr_ner.py +0 -0
  267. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_latvian_lsm_summary.py +0 -0
  268. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_latvian_twitter_sentiment.py +0 -0
  269. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_life_in_the_uk.py +0 -0
  270. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lithuanian_lrytas_summarization.py +0 -0
  271. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_llmzszl.py +0 -0
  272. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lr_sum.py +0 -0
  273. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lt_emotions.py +0 -0
  274. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_lt_history.py +0 -0
  275. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mim_gold_ner.py +0 -0
  276. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mlqa_es.py +0 -0
  277. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mlsum_de.py +0 -0
  278. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mlsum_es.py +0 -0
  279. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu.py +0 -0
  280. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu_et.py +0 -0
  281. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu_hr.py +0 -0
  282. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mmlu_lv.py +0 -0
  283. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_mms.py +0 -0
  284. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_multi_wiki_qa.py +0 -0
  285. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_multinerd-it.py +0 -0
  286. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ner_uk.py +0 -0
  287. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_no_cola.py +0 -0
  288. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_no_sammendrag.py +0 -0
  289. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nor_common_sense_qa.py +0 -0
  290. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nordjylland_news.py +0 -0
  291. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norec.py +0 -0
  292. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norglm_multiqa.py +0 -0
  293. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norglm_multisum.py +0 -0
  294. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norne.py +0 -0
  295. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_norquad.py +0 -0
  296. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nqii.py +0 -0
  297. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_nrk_quiz_qa.py +0 -0
  298. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_orange_sum.py +0 -0
  299. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_personal_sum.py +0 -0
  300. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_polemo2.py +0 -0
  301. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_poner.py +0 -0
  302. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_poquad.py +0 -0
  303. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_psc.py +0 -0
  304. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_publico.py +0 -0
  305. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ronec.py +0 -0
  306. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_rosent.py +0 -0
  307. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_rrn.py +0 -0
  308. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sb10k.py +0 -0
  309. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_scala.py +0 -0
  310. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_scandiqa.py +0 -0
  311. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_scandisent_fi.py +0 -0
  312. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_schibsted.py +0 -0
  313. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sentiment_headlines_es.py +0 -0
  314. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sentinews.py +0 -0
  315. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sentipolc16.py +0 -0
  316. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_skolprov.py +0 -0
  317. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sqad.py +0 -0
  318. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad.py +0 -0
  319. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad_it.py +0 -0
  320. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad_nl.py +0 -0
  321. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_squad_nl_old.py +0 -0
  322. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_ssj500k_ner.py +0 -0
  323. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sst2_pt.py +0 -0
  324. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sst5.py +0 -0
  325. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_suc3.py +0 -0
  326. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_sumo_ro.py +0 -0
  327. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_swedish_facts.py +0 -0
  328. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_swedn.py +0 -0
  329. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_swerec.py +0 -0
  330. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_szeged_ner.py +0 -0
  331. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_trivia_et.py +0 -0
  332. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_turku_ner_fi.py +0 -0
  333. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_tydiqa_fi.py +0 -0
  334. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_umimeto_qa.py +0 -0
  335. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_uner_sk.py +0 -0
  336. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_uner_sr.py +0 -0
  337. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_wiki_lingua_nl.py +0 -0
  338. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_wikiann.py +0 -0
  339. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_wikineural-it.py +0 -0
  340. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_winogrande.py +0 -0
  341. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_winogrande_et.py +0 -0
  342. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_winogrande_is.py +0 -0
  343. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_xlsum_fi.py +0 -0
  344. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/create_xquad.py +0 -0
  345. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/fix_dot_env_file.py +0 -0
  346. {scandeval-16.10.1 → scandeval-16.11.0}/src/scripts/versioning.py +0 -0
  347. {scandeval-16.10.1 → scandeval-16.11.0}/tests/__init__.py +0 -0
  348. {scandeval-16.10.1 → scandeval-16.11.0}/tests/conftest.py +0 -0
  349. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmark_config_factory.py +0 -0
  350. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmark_modules/__init__.py +0 -0
  351. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmark_modules/test_hf.py +0 -0
  352. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_benchmarker.py +0 -0
  353. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_callbacks.py +0 -0
  354. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_cli.py +0 -0
  355. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_constants.py +0 -0
  356. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_data_loading.py +0 -0
  357. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_data_models.py +0 -0
  358. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_dataset_configs.py +0 -0
  359. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_enums.py +0 -0
  360. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_exceptions.py +0 -0
  361. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_finetuning.py +0 -0
  362. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_languages.py +0 -0
  363. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_model_config.py +0 -0
  364. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_model_loading.py +0 -0
  365. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scores.py +0 -0
  366. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/__init__.py +0 -0
  367. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/__init__.py +0 -0
  368. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_create_scala.py +0 -0
  369. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/de_gsd-ud-train.conllu.adp_det +0 -0
  370. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/empty.file +0 -0
  371. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/en_gum-ud-train.conllu.case +0 -0
  372. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_01 +0 -0
  373. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_02 +0 -0
  374. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_scripts/test_create_scala/test_data/pl_pdb-ud-train.conllu.aux_clitic_03 +0 -0
  375. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_speed_benchmark.py +0 -0
  376. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_tokenisation_utils.py +0 -0
  377. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_types.py +0 -0
  378. {scandeval-16.10.1 → scandeval-16.11.0}/tests/test_utils.py +0 -0
@@ -8,9 +8,9 @@ repos:
8
8
  hooks:
9
9
  - id: end-of-file-fixer
10
10
  - id: trailing-whitespace
11
- # - id: debug-statements
11
+ - id: debug-statements
12
12
  - repo: https://github.com/astral-sh/ruff-pre-commit
13
- rev: v0.14.10
13
+ rev: v0.14.13
14
14
  hooks:
15
15
  - id: ruff
16
16
  args:
@@ -30,11 +30,11 @@ repos:
30
30
  - pyi
31
31
  - jupyter
32
32
  - repo: https://github.com/kynan/nbstripout
33
- rev: 0.8.2
33
+ rev: 0.9.0
34
34
  hooks:
35
35
  - id: nbstripout
36
36
  - repo: https://github.com/facebook/pyrefly-pre-commit
37
- rev: 0.46.3
37
+ rev: 0.49.0
38
38
  hooks:
39
39
  - id: pyrefly-check
40
40
  name: Pyrefly (type checking)
@@ -7,6 +7,46 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
7
7
 
8
8
  ## [Unreleased]
9
9
 
10
+ ## [v16.11.0] - 2026-01-21
11
+
12
+ ### Added
13
+
14
+ - Added model metadata for GPT 5.2.
15
+ - Added better support for unofficial inference providers, allowing model prefixes even
16
+ if they're not in LiteLLM's official list of providers. Currently this only works with
17
+ the "ordbogen/" prefix for models available on ordbogen.dk.
18
+
19
+ ### Changed
20
+
21
+ - LLM-as-a-Judge metrics now support batch scoring across multiple judge outputs.
22
+ - When evaluating datasets with no validation split, we now set the `validation_split`
23
+ in the resulting JSONL file to `null` rather than `True`, to avoid confusion.
24
+ Likewise, if a task requires zero-shot evaluation, we set `few_shot` to null rather
25
+ than a Boolean value.
26
+ - When evaluating a reasoning model on a sequence classification task, if the model
27
+ outputs an answer that starts with one of candidate labels, we now use that label as
28
+ the predicted label. Previously, we would have conducted a word edit distance search
29
+ to find the closest candidate label, which was almost always correct, but not in all
30
+ cases.
31
+
32
+ ### Fixed
33
+
34
+ - Quantized models in vLLM now have their dtype inferred automatically, removing
35
+ explicit dtype casting based on GPU compute capability. This was contributed by
36
+ @tvosch ✨
37
+ - Evaluation of local vLLM models when no internet connection was available did not work
38
+ correctly; this has been fixed now. This was contributed by @Touzen ✨
39
+ - More robust detection and handling of errors related to too long inputs for vLLM
40
+ models.
41
+ - Some API models need the `logprobs` argument to be a Boolean rather than an integer.
42
+ This has been fixed now.
43
+ - Better handling of rate limits when evaluating API models, by backing off more
44
+ aggressively when hitting rate limits.
45
+ - Now truncates prompts for instruction-following models in a smarter way, by removing
46
+ few-shot examples one by one until the prompt is short enough, rather than just
47
+ truncating the prompt to the maximum length. This only affects models whose maximum
48
+ model length is quite small (roughly 5,000 tokens or less).
49
+
10
50
  ## [v16.10.1] - 2026-01-02
11
51
 
12
52
  ### Changed
@@ -72,7 +72,7 @@ guide](https://github.com/atom/atom/blob/master/CONTRIBUTING.md#git-commit-messa
72
72
  know how to use emoji for commit messages.
73
73
 
74
74
  Once your changes are ready, don't forget to
75
- [self-review](/contributing/self-review.md) to speed up the review process:zap:.
75
+ self-review to speed up the review process:zap:.
76
76
 
77
77
  ### Pull Request
78
78
 
@@ -1,6 +1,6 @@
1
1
  MIT License
2
2
 
3
- Copyright (c) 2022-2025 Dan Saattrup Smart
3
+ Copyright (c) 2022-2026 Dan Saattrup Smart
4
4
 
5
5
  Permission is hereby granted, free of charge, to any person obtaining a copy
6
6
  of this software and associated documentation files (the "Software"), to deal
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.4
2
2
  Name: ScandEval
3
- Version: 16.10.1
3
+ Version: 16.11.0
4
4
  Summary: The robust European language model benchmark.
5
5
  Project-URL: Repository, https://github.com/EuroEval/EuroEval
6
6
  Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -8,7 +8,7 @@ Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
8
8
  Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
9
9
  License: MIT License
10
10
 
11
- Copyright (c) 2022-2025 Dan Saattrup Smart
11
+ Copyright (c) 2022-2026 Dan Saattrup Smart
12
12
 
13
13
  Permission is hereby granted, free of charge, to any person obtaining a copy
14
14
  of this software and associated documentation files (the "Software"), to deal
@@ -123,16 +123,17 @@ The easiest way to benchmark pretrained models is via the command line interface
123
123
  having installed the package, you can benchmark your favorite model like so:
124
124
 
125
125
  ```bash
126
- euroeval --model <model-id>
126
+ euroeval --model <model-id-or-path>
127
127
  ```
128
128
 
129
- Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
130
- Hub](https://huggingface.co/models). By default this will benchmark the model on all
131
- the tasks available. If you want to benchmark on a particular task, then use the
132
- `--task` argument:
129
+ Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
130
+ Hub](https://huggingface.co/models), or a local path to a model directory (containing
131
+ the model files as well as the `config.json` file). By default this will benchmark the
132
+ model on all the tasks available. If you want to benchmark on a particular task, then
133
+ use the `--task` argument:
133
134
 
134
135
  ```bash
135
- euroeval --model <model-id> --task sentiment-classification
136
+ euroeval --model <model-id-or-path> --task sentiment-classification
136
137
  ```
137
138
 
138
139
  We can also narrow down which languages we would like to benchmark on. This can be done
@@ -140,20 +141,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
140
141
  sentiment classification task:
141
142
 
142
143
  ```bash
143
- euroeval --model <model-id> --task sentiment-classification --language da
144
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da
144
145
  ```
145
146
 
146
147
  Multiple models, datasets and/or languages can be specified by just attaching multiple
147
148
  arguments. Here is an example with two models:
148
149
 
149
150
  ```bash
150
- euroeval --model <model-id1> --model <model-id2>
151
+ euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
151
152
  ```
152
153
 
153
154
  The specific model version/revision to use can also be added after the suffix '@':
154
155
 
155
156
  ```bash
156
- euroeval --model <model-id>@<commit>
157
+ euroeval --model <model-id-or-path>@<commit>
157
158
  ```
158
159
 
159
160
  This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -173,7 +174,7 @@ model:
173
174
  ```python
174
175
  >>> from euroeval import Benchmarker
175
176
  >>> benchmarker = Benchmarker()
176
- >>> benchmarker.benchmark(model="<model-id>")
177
+ >>> benchmarker.benchmark(model="<model-id-or-path>")
177
178
  ```
178
179
 
179
180
  To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -181,7 +182,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
181
182
 
182
183
  ```python
183
184
  >>> benchmarker.benchmark(
184
- ... model="<model-id>",
185
+ ... model="<model-id-or-path>",
185
186
  ... task="sentiment-classification",
186
187
  ... language="da",
187
188
  ... )
@@ -225,7 +226,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
225
226
  ```
226
227
 
227
228
  Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
228
- argument. This could for instance be `--model <model-id> --task
229
+ argument. This could for instance be `--model <model-id-or-path> --task
229
230
  sentiment-classification`.
230
231
 
231
232
  ## Benchmarking custom inference APIs
@@ -291,14 +292,14 @@ script. For example to download the model you want and all of the Danish sentime
291
292
  classification datasets:
292
293
 
293
294
  ```bash
294
- euroeval --model <model-id> --task sentiment-classification --language da --download-only
295
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
295
296
  ```
296
297
 
297
298
  Or from a script:
298
299
 
299
300
  ```python
300
301
  >>> benchmarker.benchmark(
301
- ... model="<model-id>",
302
+ ... model="<model-id-or-path>",
302
303
  ... task="sentiment-classification",
303
304
  ... language="da",
304
305
  ... download_only=True,
@@ -346,7 +347,7 @@ MY_CONFIG = DatasetConfig(
346
347
  You can then benchmark your custom dataset by simply running
347
348
 
348
349
  ```bash
349
- euroeval --dataset my-dataset --model <model-id>
350
+ euroeval --dataset my-dataset --model <model-id-or-path>
350
351
  ```
351
352
 
352
353
  You can also run the benchmark from a Python script, by simply providing your custom
@@ -356,7 +357,7 @@ dataset configuration directly into the `benchmark` method:
356
357
  from euroeval import Benchmarker
357
358
 
358
359
  benchmarker = Benchmarker()
359
- benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
360
+ benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
360
361
  ```
361
362
 
362
363
  We have included three convenience tasks to make it easier to set up custom datasets:
@@ -436,7 +437,7 @@ MY_SQL_DATASET = DatasetConfig(
436
437
  Again, with this you can benchmark your custom dataset by simply running
437
438
 
438
439
  ```bash
439
- euroeval --dataset my-sql-dataset --model <model-id>
440
+ euroeval --dataset my-sql-dataset --model <model-id-or-path>
440
441
  ```
441
442
 
442
443
  ## Reproducing the evaluation datasets
@@ -592,6 +593,13 @@ A huge thank you to all the contributors who have helped make this project a suc
592
593
  alt="Contributor avatar for tvosch"
593
594
  />
594
595
  </a>
596
+ <a href="https://github.com/Touzen">
597
+ <img
598
+ src="https://avatars.githubusercontent.com/u/1416265"
599
+ width=50
600
+ alt="Contributor avatar for Touzen"
601
+ />
602
+ </a>
595
603
 
596
604
  ### Contribute to EuroEval
597
605
 
@@ -47,16 +47,17 @@ The easiest way to benchmark pretrained models is via the command line interface
47
47
  having installed the package, you can benchmark your favorite model like so:
48
48
 
49
49
  ```bash
50
- euroeval --model <model-id>
50
+ euroeval --model <model-id-or-path>
51
51
  ```
52
52
 
53
- Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
54
- Hub](https://huggingface.co/models). By default this will benchmark the model on all
55
- the tasks available. If you want to benchmark on a particular task, then use the
56
- `--task` argument:
53
+ Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
54
+ Hub](https://huggingface.co/models), or a local path to a model directory (containing
55
+ the model files as well as the `config.json` file). By default this will benchmark the
56
+ model on all the tasks available. If you want to benchmark on a particular task, then
57
+ use the `--task` argument:
57
58
 
58
59
  ```bash
59
- euroeval --model <model-id> --task sentiment-classification
60
+ euroeval --model <model-id-or-path> --task sentiment-classification
60
61
  ```
61
62
 
62
63
  We can also narrow down which languages we would like to benchmark on. This can be done
@@ -64,20 +65,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
64
65
  sentiment classification task:
65
66
 
66
67
  ```bash
67
- euroeval --model <model-id> --task sentiment-classification --language da
68
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da
68
69
  ```
69
70
 
70
71
  Multiple models, datasets and/or languages can be specified by just attaching multiple
71
72
  arguments. Here is an example with two models:
72
73
 
73
74
  ```bash
74
- euroeval --model <model-id1> --model <model-id2>
75
+ euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
75
76
  ```
76
77
 
77
78
  The specific model version/revision to use can also be added after the suffix '@':
78
79
 
79
80
  ```bash
80
- euroeval --model <model-id>@<commit>
81
+ euroeval --model <model-id-or-path>@<commit>
81
82
  ```
82
83
 
83
84
  This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -97,7 +98,7 @@ model:
97
98
  ```python
98
99
  >>> from euroeval import Benchmarker
99
100
  >>> benchmarker = Benchmarker()
100
- >>> benchmarker.benchmark(model="<model-id>")
101
+ >>> benchmarker.benchmark(model="<model-id-or-path>")
101
102
  ```
102
103
 
103
104
  To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -105,7 +106,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
105
106
 
106
107
  ```python
107
108
  >>> benchmarker.benchmark(
108
- ... model="<model-id>",
109
+ ... model="<model-id-or-path>",
109
110
  ... task="sentiment-classification",
110
111
  ... language="da",
111
112
  ... )
@@ -149,7 +150,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
149
150
  ```
150
151
 
151
152
  Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
152
- argument. This could for instance be `--model <model-id> --task
153
+ argument. This could for instance be `--model <model-id-or-path> --task
153
154
  sentiment-classification`.
154
155
 
155
156
  ## Benchmarking custom inference APIs
@@ -215,14 +216,14 @@ script. For example to download the model you want and all of the Danish sentime
215
216
  classification datasets:
216
217
 
217
218
  ```bash
218
- euroeval --model <model-id> --task sentiment-classification --language da --download-only
219
+ euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
219
220
  ```
220
221
 
221
222
  Or from a script:
222
223
 
223
224
  ```python
224
225
  >>> benchmarker.benchmark(
225
- ... model="<model-id>",
226
+ ... model="<model-id-or-path>",
226
227
  ... task="sentiment-classification",
227
228
  ... language="da",
228
229
  ... download_only=True,
@@ -270,7 +271,7 @@ MY_CONFIG = DatasetConfig(
270
271
  You can then benchmark your custom dataset by simply running
271
272
 
272
273
  ```bash
273
- euroeval --dataset my-dataset --model <model-id>
274
+ euroeval --dataset my-dataset --model <model-id-or-path>
274
275
  ```
275
276
 
276
277
  You can also run the benchmark from a Python script, by simply providing your custom
@@ -280,7 +281,7 @@ dataset configuration directly into the `benchmark` method:
280
281
  from euroeval import Benchmarker
281
282
 
282
283
  benchmarker = Benchmarker()
283
- benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
284
+ benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
284
285
  ```
285
286
 
286
287
  We have included three convenience tasks to make it easier to set up custom datasets:
@@ -360,7 +361,7 @@ MY_SQL_DATASET = DatasetConfig(
360
361
  Again, with this you can benchmark your custom dataset by simply running
361
362
 
362
363
  ```bash
363
- euroeval --dataset my-sql-dataset --model <model-id>
364
+ euroeval --dataset my-sql-dataset --model <model-id-or-path>
364
365
  ```
365
366
 
366
367
  ## Reproducing the evaluation datasets
@@ -516,6 +517,13 @@ A huge thank you to all the contributors who have helped make this project a suc
516
517
  alt="Contributor avatar for tvosch"
517
518
  />
518
519
  </a>
520
+ <a href="https://github.com/Touzen">
521
+ <img
522
+ src="https://avatars.githubusercontent.com/u/1416265"
523
+ width=50
524
+ alt="Contributor avatar for Touzen"
525
+ />
526
+ </a>
519
527
 
520
528
  ### Contribute to EuroEval
521
529
 
@@ -1116,3 +1116,81 @@ You can evaluate this dataset directly as follows:
1116
1116
  ```bash
1117
1117
  euroeval --model <model-id> --dataset nordjylland-news
1118
1118
  ```
1119
+
1120
+ ## European Values
1121
+
1122
+ ### ValEU-da
1123
+
1124
+ This dataset is the official Danish version of questions from the [European values
1125
+ study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
1126
+ questions regarding people's values and beliefs across a variety of topics, such as
1127
+ politics, religion and society.
1128
+
1129
+ The dataset consists of 52 questions from the 2017-2022 wave of the European values
1130
+ study, where the questions were chosen based on optimising against agreement within EU
1131
+ countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
1132
+
1133
+ Here are a few examples from the training split:
1134
+
1135
+ ```json
1136
+ {
1137
+ "question_id": "C039",
1138
+ "text": "Hvor enig eller uenig er du i følgende udsagn?\nDet er ens pligt over for samfundet at arbejde.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig",
1139
+ }
1140
+ ```
1141
+
1142
+ ```json
1143
+ {
1144
+ "question_id": "F122",
1145
+ "text": "Fortæl for hver af handlingerne på dette kort, i hvilken grad du billiger handlingen. 1 betyder, at du slet ikke billiger dem, 10 betyder, at du i høj grad billiger dem\nAktiv dødshjælp\nSvarmuligheder:\na. Aldrig\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Altid",
1146
+ }
1147
+ ```
1148
+
1149
+ ```json
1150
+ {
1151
+ "question_id": "C041",
1152
+ "text": "Hvor enig eller uenig er du i følgende udsagn?\nArbejde kommer først, også selv om det betyder mindre fritid.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig"
1153
+ }
1154
+ ```
1155
+
1156
+ When evaluating generative models, we use the following setup (see the
1157
+ [methodology](/methodology) for more information on how these are used):
1158
+
1159
+ - Number of few-shot examples: 0
1160
+ - Prefix prompt:
1161
+
1162
+ ```text
1163
+ Følgende er multiple choice spørgsmål (med svar).
1164
+ ```
1165
+
1166
+ - Base prompt template:
1167
+
1168
+ ```text
1169
+ Spørgsmål: {text}
1170
+ Svarmuligheder:
1171
+ a. {option_a}
1172
+ b. {option_b}
1173
+ (...)
1174
+ k. {option_k}
1175
+ Svar: {label}
1176
+ ```
1177
+
1178
+ - Instruction-tuned prompt template:
1179
+
1180
+ ```text
1181
+ Spørgsmål: {text}
1182
+ Svarmuligheder:
1183
+ a. {option_a}
1184
+ b. {option_b}
1185
+ (...)
1186
+ k. {option_k}
1187
+
1188
+ Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
1189
+ 'i', 'j' eller 'k', og intet andet.
1190
+ ```
1191
+
1192
+ You can evaluate this dataset directly as follows:
1193
+
1194
+ ```bash
1195
+ euroeval --model <model-id> --dataset valeu-da
1196
+ ```
@@ -1100,3 +1100,81 @@ You can evaluate this dataset directly as follows:
1100
1100
  ```bash
1101
1101
  euroeval --model <model-id> --dataset duidelijke-taal
1102
1102
  ```
1103
+
1104
+ ## European Values
1105
+
1106
+ ### ValEU-nl
1107
+
1108
+ This dataset is the official Dutch version of questions from the [European values
1109
+ study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
1110
+ questions regarding people's values and beliefs across a variety of topics, such as
1111
+ politics, religion and society.
1112
+
1113
+ The dataset consists of 52 questions from the 2017-2022 wave of the European values
1114
+ study, where the questions were chosen based on optimising against agreement within EU
1115
+ countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
1116
+
1117
+ Here are a few examples from the training split:
1118
+
1119
+ ```json
1120
+ {
1121
+ "question_id": "E069_01",
1122
+ "text": "Wilt u mij voor elk van de instellingen op deze kaart vertellen of u er heel veel, tamelijk veel, niet zo veel of helemaal geen vertrouwen in heeft?\nDe kerk\nAntwoordopties:\na. Heel veel\nb. Tamelijk veel\nc. Niet zo veel\nd. Helemaal geen"
1123
+ }
1124
+ ```
1125
+
1126
+ ```json
1127
+ {
1128
+ "question_id": "E028",
1129
+ "text": "Wilt u nu deze lijst erbij houden? Ik ga u nu een aantal verschillende soorten van politieke actie noemen die men kan voeren. Wilt u mij van elke actie vertellen of u het zelf ooit heeft gedaan, of u het zelf misschien zou doen als u het nodig vond, of dat u het zeker nooit zult doen?\nMeedoen aan een wilde staking\nAntwoordopties:\na. Zelf gedaan\nb. Zou dat misschien doen\nc. Zou dat nooit doen"
1130
+ }
1131
+ ```
1132
+
1133
+ ```json
1134
+ {
1135
+ "question_id": "E265_07",
1136
+ "text": "Hoe vaak gebeuren volgens u de volgende dingen tijdens verkiezingen in dit land?\nRijke mensen kopen de verkiezingsuitslag\nAntwoordopties:\na. Zeer vaak\nb. Tamelijk vaak\nc. Niet zo vaak\nd. Helemaal niet vaak"
1137
+ }
1138
+ ```
1139
+
1140
+ When evaluating generative models, we use the following setup (see the
1141
+ [methodology](/methodology) for more information on how these are used):
1142
+
1143
+ - Number of few-shot examples: 0
1144
+ - Prefix prompt:
1145
+
1146
+ ```text
1147
+ Hieronder staan meerkeuzevragen (met antwoorden).
1148
+ ```
1149
+
1150
+ - Base prompt template:
1151
+
1152
+ ```text
1153
+ Vraag: {text}
1154
+ Antwoordopties:
1155
+ a. {option_a}
1156
+ b. {option_b}
1157
+ (...)
1158
+ k. {option_k}
1159
+ Antwoord: {label}
1160
+ ```
1161
+
1162
+ - Instruction-tuned prompt template:
1163
+
1164
+ ```text
1165
+ Vraag: {text}
1166
+ Antwoordopties:
1167
+ a. {option_a}
1168
+ b. {option_b}
1169
+ (...)
1170
+ k. {option_k}
1171
+
1172
+ Beantwoord de bovenstaande vraag met 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'
1173
+ of 'k', en niets anders.
1174
+ ```
1175
+
1176
+ You can evaluate this dataset directly as follows:
1177
+
1178
+ ```bash
1179
+ euroeval --model <model-id> --dataset valeu-nl
1180
+ ```
@@ -983,3 +983,81 @@ You can evaluate this dataset directly as follows:
983
983
  ```bash
984
984
  euroeval --model <model-id> --dataset cnn-dailymail
985
985
  ```
986
+
987
+ ## European Values
988
+
989
+ ### ValEU-en
990
+
991
+ This dataset is the official English version of questions from the [European values
992
+ study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
993
+ questions regarding people's values and beliefs across a variety of topics, such as
994
+ politics, religion and society.
995
+
996
+ The dataset consists of 52 questions from the 2017-2022 wave of the European values
997
+ study, where the questions were chosen based on optimising against agreement within EU
998
+ countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
999
+
1000
+ Here are a few examples from the training split:
1001
+
1002
+ ```json
1003
+ {
1004
+ "question_id": "A072",
1005
+ "text": "Please look carefully at the following list of voluntary organisations and say which, if any, do you belong to?\nProfessional associations\nChoices:\na. No\nb. Yes"
1006
+ }
1007
+ ```
1008
+
1009
+ ```json
1010
+ {
1011
+ "question_id": "F025",
1012
+ "text": "Do you belong to a religious denomination? If yes, which one?\nChoices:\na. Do not belong to a denomination\nb. Roman Catholic\nc. Protestant\nd. Orthodox (Russian/Greek/etc.)\ne. Jew\nf. Muslim\ng. Hindu\nh. Buddhist\ni. Other Christian (Evangelical/Pentecostal/Free church/etc.)\nj. Other"
1013
+ }
1014
+ ```
1015
+
1016
+ ```json
1017
+ {
1018
+ "question_id": "F118",
1019
+ "text": "Please tell me for each of the following whether you think it can always be justified, never be justified, or something in between.\nHomosexuality\nChoices:\na. Never justifiable\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Always justifiable"
1020
+ }
1021
+ ```
1022
+
1023
+ When evaluating generative models, we use the following setup (see the
1024
+ [methodology](/methodology) for more information on how these are used):
1025
+
1026
+ - Number of few-shot examples: 0
1027
+ - Prefix prompt:
1028
+
1029
+ ```text
1030
+ The following are multiple choice questions (with answers).
1031
+ ```
1032
+
1033
+ - Base prompt template:
1034
+
1035
+ ```text
1036
+ Question: {text}
1037
+ Options:
1038
+ a. {option_a}
1039
+ b. {option_b}
1040
+ (...)
1041
+ k. {option_k}
1042
+ Answer: {label}
1043
+ ```
1044
+
1045
+ - Instruction-tuned prompt template:
1046
+
1047
+ ```text
1048
+ Question: {text}
1049
+ Options:
1050
+ a. {option_a}
1051
+ b. {option_b}
1052
+ (...)
1053
+ k. {option_k}
1054
+
1055
+ Answer the above question by replying with 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
1056
+ 'i', 'j', or 'k', and nothing else.
1057
+ ```
1058
+
1059
+ You can evaluate this dataset directly as follows:
1060
+
1061
+ ```bash
1062
+ euroeval --model <model-id> --dataset valeu-en
1063
+ ```