PyPI - ScandEval - Versions diffs - 16.8.0__tar.gz → 16.9.0__tar.gz - Mend

ScandEval 16.8.0tar.gz → 16.9.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (370) hide show

{scandeval-16.8.0 → scandeval-16.9.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -20,11 +20,12 @@ body:
     options:
       - label: Baltic languages (Latvian, Lithuanian)
       - label: Finnic languages (Estonian, Finnish)
-      - label: Greek
       - label: Romance languages (Catalan, French, Italian, Portuguese, Romanian, Spanish)
       - label: Scandinavian languages (Danish, Faroese, Icelandic, Norwegian, Swedish)
-      - label: Slavic languages (Bulgarian, Bosnian, Croatian, Czech, Hungarian, Polish, Serbian, Slovak, Slovenian, Ukrainian)
+      - label: Slavic languages (Bulgarian, Bosnian, Croatian, Czech, Polish, Serbian, Slovak, Slovenian, Ukrainian)
       - label: West Germanic languages (Dutch, English, German)
+      - label: Greek
+      - label: Hungarian
   validations:
     required: true
 - type: dropdown

{scandeval-16.8.0 → scandeval-16.9.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -10,7 +10,7 @@ repos:
       - id: trailing-whitespace
       - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.14.6
+    rev: v0.14.9
     hooks:
       - id: ruff
         args:
@@ -34,13 +34,13 @@ repos:
     hooks:
     -   id: nbstripout
 -   repo: https://github.com/facebook/pyrefly-pre-commit
-    rev: 0.0.1
+    rev: 0.46.0
     hooks:
-    -   id: pyrefly-typecheck-system
+    -   id: pyrefly-check
         name: Pyrefly (type checking)
         pass_filenames: true
 -   repo: https://github.com/DavidAnson/markdownlint-cli2
-    rev: v0.19.1
+    rev: v0.20.0
     hooks:
     -   id: markdownlint-cli2
         args:

{scandeval-16.8.0 → scandeval-16.9.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,29 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [Unreleased]
+## [v16.9.0] - 2025-12-16
+### Added
+- Added the Swedish factual knowledge dataset SwedishFacts, which is based on the
+  [liu-nlp/swedish-facts-v1](https://huggingface.co/datasets/liu-nlp/swedish-facts-v1)
+  dataset. This was contributed by @oliverkinch ✨
+### Fixed
+- When a model has registered the number of parameters wrongly within their safetensors
+  files, we collect all the potential parameter counts from the safetensors file and
+  pick the largest one.
+- We now pinned vLLM to v0.11.0, as all future versions (up to and including v0.12.0)
+  have breaking changes regarding loading of Mistral models. We aim to unpin this when a
+  new vLLM version fixes this.
+- Removed mentions of `hf_transfer` and the associated environment variable
+  `HF_HUB_ENABLE_HF_TRANSFER`, since this has been removed from the `transformers`
+  library now.
+- Marked the `PleIAs/Pleias-3b-Preview` as requiring the `TRITON_ATTN` backend over the
+  default `FLASHINFER` backend, as the model architecture is currently not supported by
+  the default backend.
 ## [v16.8.0] - 2025-11-25
 ### Added
@@ -2735,8 +2758,8 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ### Deprecated
-- Deprecated support for evaluating finetuned models, as the package was primarily used to
-  benchmark pretrained models anyway, and the change in datasets means that many
+- Deprecated support for evaluating finetuned models, as the package was primarily used
+  to benchmark pretrained models anyway, and the change in datasets means that many
   finetuned models would have been trained on (part of) the test sets, resulting in
   artificially large scores. For evaluation of finetuned models, please check out the
   `aiai_eval` Python package instead (under development).

{scandeval-16.8.0 → scandeval-16.9.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: ScandEval
-Version: 16.8.0
+Version: 16.9.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -39,6 +39,7 @@ Requires-Dist: evaluate>=0.4.1
 Requires-Dist: huggingface-hub>=0.30.1
 Requires-Dist: levenshtein>=0.24.0
 Requires-Dist: litellm>=1.75.6
+Requires-Dist: mistral-common[soundfile]
 Requires-Dist: more-itertools>=10.5.0
 Requires-Dist: numpy>=2.0.0
 Requires-Dist: ollama>=0.5.1
@@ -62,12 +63,12 @@ Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: timm>=1.0.19; extra == 'all'
-Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm[flashinfer]==0.11.0; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: timm>=1.0.19; extra == 'generative'
-Requires-Dist: vllm[flashinfer]>=0.11.0; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: vllm[flashinfer]==0.11.0; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown
 <!-- This disables the requirement that the first line is a top-level heading -->

{scandeval-16.8.0 → scandeval-16.9.0}/docs/datasets/bosnian.md RENAMED Viewed

@@ -9,8 +9,8 @@ information about what these constitute.
 ### MMS-bs
 This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2306.07902).
-The corpus consists of 79 manually selected datasets from over 350 datasets reported in the
-scientific literature based on strict quality criteria.
+The corpus consists of 79 manually selected datasets from over 350 datasets reported in
+the scientific literature based on strict quality criteria.
 The original dataset contains a single split with 36,183 Bosnian samples.
 We use 1,024 / 256 / 2,048 samples for our training, validation, and test splits,

{scandeval-16.8.0 → scandeval-16.9.0}/docs/datasets/croatian.md RENAMED Viewed

@@ -9,8 +9,8 @@ information about what these constitute.
 ### MMS-hr
 This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2306.07902).
-The corpus consists of 79 manually selected datasets from over 350 datasets reported in the
-scientific literature based on strict quality criteria.
+The corpus consists of 79 manually selected datasets from over 350 datasets reported in
+the scientific literature based on strict quality criteria.
 The original dataset contains a single split with 77,594 Croatian samples.
 We use 1,024 / 256 / 2,048 samples for our training, validation, and test splits,

{scandeval-16.8.0 → scandeval-16.9.0}/docs/datasets/czech.md RENAMED Viewed

@@ -12,10 +12,10 @@ This dataset was published in [this paper](https://aclanthology.org/R13-1016/) a
 consists of reviews from the the Czech Movie
 Database (CSFD).
-The original dataset contains 85,948 / 894 / 1503 samples for the training, validation, and
-and test splits, respectively. We use 1,024 / 256 / 2,048 samples for our training,
-validation and test splits, respectively. The train and validation splits are subsets
-of the original splits. For the test split, we use all available test samples and
+The original dataset contains 85,948 / 894 / 1503 samples for the training, validation,
+and and test splits, respectively. We use 1,024 / 256 / 2,048 samples for our training,
+validation and test splits, respectively. The train and validation splits are subsets of
+the original splits. For the test split, we use all available test samples and
 supplement with additional samples from the training set to reach 2,048 samples in
 total.

{scandeval-16.8.0 → scandeval-16.9.0}/docs/datasets/lithuanian.md RENAMED Viewed

@@ -9,9 +9,9 @@ information about what these constitute.
 ### Atsiliepimai
 This dataset was published
-[here](https://huggingface.co/datasets/alexandrainst/lithuanian-sentiment-analysis). It was
-scraped from [atsiliepimai.lt](https://atsiliepimai.lt/) and
-contains reviews similar to trustpilot reviews.
+[here](https://huggingface.co/datasets/alexandrainst/lithuanian-sentiment-analysis). It
+was scraped from [atsiliepimai.lt](https://atsiliepimai.lt/) and contains reviews
+similar to trustpilot reviews.
 The original dataset consists of 1,796 samples. We use 512 / 256 / 1,028
 samples for our training, validation and test splits, respectively.

{scandeval-16.8.0 → scandeval-16.9.0}/docs/datasets/serbian.md RENAMED Viewed

@@ -9,8 +9,8 @@ information about what these constitute.
 ### MMS-sr
 This dataset was published in [this paper](https://doi.org/10.48550/arXiv.2306.07902).
-The corpus consists of 79 manually selected datasets from over 350 datasets reported in the
-scientific literature based on strict quality criteria.
+The corpus consists of 79 manually selected datasets from over 350 datasets reported in
+the scientific literature based on strict quality criteria.
 The original dataset contains a single split with 76,368 Serbian samples. We use
 1,024 / 256 / 2,048 samples for our training, validation and test splits, respectively.

{scandeval-16.8.0 → scandeval-16.9.0}/docs/datasets/swedish.md RENAMED Viewed

@@ -699,6 +699,83 @@ You can evaluate this dataset directly as follows:
 euroeval --model <model-id> --dataset skolprov
 ```
+### Unofficial: SwedishFacts
+This is a benchmark for factual knowledge about Sweden.
+The questions are based on topics related to the hosts of the Swedish radio program
+[Sommar i P1](https://www.sverigesradio.se/sommar-i-p1) as well as Swedish sporting
+events, such as those featured in [En Svensk Klassiker](https://ensvenskklassiker.se).
+In the [dataset card](https://huggingface.co/datasets/liu-nlp/swedish-facts-v1)
+it is mentioned that a paper with more information is coming soon.
+Since the dataset does not include candidate answers, we generate them using GPT-4o.
+The original dataset consists of 1,289 samples. We
+use a 128 / 64 / 1,097 split for training, validation and testing, respectively.
+Here are a few examples from the training split:
+```json
+{
+  "text": "Hur många gånger befodrades Micael Bydén till en högre militär grad under 1990-talet?\nSvarsalternativ:\na. Tre, 3\nb. Fyra\nc. Fem\nd. Två",
+  "label": "a"
+}
+```
+```json
+{
+  "text": "Vad heter skivbolaget Titiyo Jah kontrakt med år 1988?\nSvarsalternativ:\na. Virgin Records\nb. Telegram\nc. Sony Music\nd. Warner Music",
+  "label": "b"
+}
+```
+```json
+{
+  "text": "I vilken ort föddes PM Nilsson?\nSvarsalternativ:\na. Göteborg\nb. Lund\nc. Helsingborg\nd. Malmö",
+  "label": "b"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 5
+- Prefix prompt:
+  ```text
+  Följande är flervalsfrågor (med svar).
+  ```
+- Base prompt template:
+  ```text
+  Fråga: {text}
+  Svarsalternativ:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Svar: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```text
+  Fråga: {text}
+  Svarsalternativ:
+  a. {option_a}
+  b. {option_b}
+  c. {option_c}
+  d. {option_d}
+  Besvara följande fråga med 'a', 'b', 'c' eller 'd', och inget annat.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+euroeval --model <model-id> --dataset swedish-facts
+```
 ## Common-sense Reasoning
 ### HellaSwag-sv

scandeval-16.9.0/docs/leaderboards/Monolingual/bosnian.md ADDED Viewed

@@ -0,0 +1,26 @@
+---
+hide:
+    - toc
+---
+# 🇧🇦 Bosnian
+See the [leaderboard page](/leaderboards) for more information about all the columns.
+/// tab | Generative Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-Qs9Zq" src="https://datawrapper.dwcdn.net/Qs9Zq" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="1016" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | NLU Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-PrIYR" src="https://datawrapper.dwcdn.net/PrIYR" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="950" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | Generative Scatter Plot
+<iframe title="Few-shot Performance of Generative Language Models on Bosnian Tasks by Model Size" aria-label="Scatter Plot" id="datawrapper-chart-nLYsL" src="https://datawrapper.dwcdn.net/nLYsL" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="687" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | NLU Scatter Plot
+<iframe title="Few-shot Performance of Language Models on Bosnian NLU Tasks by Model Size" aria-label="Scatter Plot" id="datawrapper-chart-ibhn6" src="https://datawrapper.dwcdn.net/ibhn6" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="687" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+<!-- This disables the requirement that all lines must be shorter than 88 characters -->
+<!-- markdownlint-configure-file { "MD013": false } -->

scandeval-16.9.0/docs/leaderboards/Monolingual/catalan.md ADDED Viewed

@@ -0,0 +1,26 @@
+---
+hide:
+    - toc
+---
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/Flag_of_Catalonia.svg/960px-Flag_of_Catalonia.svg.png" width="35" alt="Flag of Catalonia"/> Catalan
+See the [leaderboard page](/leaderboards) for more information about all the columns.
+/// tab | Generative Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-pC6Eu" src="https://datawrapper.dwcdn.net/pC6Eu/2/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="886" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | NLU Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-rYG8W" src="https://datawrapper.dwcdn.net/rYG8W/2/" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="902" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | Generative Scatter Plot
+<iframe title="Performance of Generative Language Models on Catalan Tasks by Model Size" aria-label="Scatter Plot" id="datawrapper-chart-noEdf" src="https://datawrapper.dwcdn.net/noEdf" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="687" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | NLU Scatter Plot
+<iframe title="Performance of Language Models on Catalan NLU Tasks by Model Size" aria-label="Scatter Plot" id="datawrapper-chart-DH0vi" src="https://datawrapper.dwcdn.net/DH0vi" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="687" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+<!-- This disables the requirement that all lines must be shorter than 88 characters -->
+<!-- markdownlint-configure-file { "MD013": false } -->

scandeval-16.9.0/docs/leaderboards/Monolingual/hungarian.md ADDED Viewed

@@ -0,0 +1,26 @@
+---
+hide:
+    - toc
+---
+# 🇭🇺 Hungarian
+See the [leaderboard page](/leaderboards) for more information about all the columns.
+/// tab | Generative Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-H91LQ" src="https://datawrapper.dwcdn.net/H91LQ" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="855" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | NLU Leaderboard
+<iframe title="" aria-label="Table" id="datawrapper-chart-bLogV" src="https://datawrapper.dwcdn.net/bLogV" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="826" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | Generative Scatter Plot
+<iframe title="Performance of Generative Language Models on Hungarian Tasks by Model Size" aria-label="Scatter Plot" id="datawrapper-chart-7Qn2I" src="https://datawrapper.dwcdn.net/7Qn2I" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="687" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+/// tab | NLU Scatter Plot
+<iframe title="Performance of Language Models on Hungarian NLU Tasks by Model Size" aria-label="Scatter Plot" id="datawrapper-chart-F5I8e" src="https://datawrapper.dwcdn.net/F5I8e" scrolling="no" frameborder="0" style="width: 0; min-width: 100% !important; border: none;" height="687" data-external="1"></iframe><script type="text/javascript">window.addEventListener("message",function(a){if(void 0!==a.data["datawrapper-height"]){var e=document.querySelectorAll("iframe");for(var t in a.data["datawrapper-height"])for(var r,i=0;r=e[i];i++)if(r.contentWindow===a.source){var d=a.data["datawrapper-height"][t]+"px";r.style.height=d}}});</script>
+///
+<!-- This disables the requirement that all lines must be shorter than 88 characters -->
+<!-- markdownlint-configure-file { "MD013": false } -->

{scandeval-16.8.0 → scandeval-16.9.0}/docs/leaderboards/Multilingual/romance.md RENAMED Viewed

@@ -2,7 +2,7 @@
 hide:
     - toc
 ---
-# 🇫🇷🇮🇹🇵🇹🇪🇸 Romance
+# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/ce/Flag_of_Catalonia.svg/960px-Flag_of_Catalonia.svg.png" width="35" alt="Flag of Catalonia"/>🇫🇷🇮🇹🇵🇹🇪🇸 Romance
 See the [leaderboard page](/leaderboards) for more information about all the columns.

{scandeval-16.8.0 → scandeval-16.9.0}/docs/leaderboards/Multilingual/slavic.md RENAMED Viewed

@@ -2,7 +2,7 @@
 hide:
     - toc
 ---
-# 🇧🇬🇭🇷🇨🇿🇵🇱🇷🇸🇸🇰🇸🇮🇺🇦 Slavic
+# 🇧🇦🇧🇬🇭🇷🇨🇿🇵🇱🇷🇸🇸🇰🇸🇮🇺🇦 Slavic
 See the [leaderboard page](/leaderboards) for more information about all the columns.

ScandEval 16.8.0__tar.gz → 16.9.0__tar.gz

ScandEval 16.8.0tar.gz → 16.9.0tar.gz