PyPI - ScandEval - Versions diffs - 16.10.1__tar.gz → 16.11.0__tar.gz - Mend

ScandEval 16.10.1tar.gz → 16.11.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (378) hide show

{scandeval-16.10.1 → scandeval-16.11.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -8,9 +8,9 @@ repos:
     hooks:
       - id: end-of-file-fixer
       - id: trailing-whitespace
-      # - id: debug-statements
+      - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.14.10
+    rev: v0.14.13
     hooks:
       - id: ruff
         args:
@@ -30,11 +30,11 @@ repos:
           - pyi
           - jupyter
 -   repo: https://github.com/kynan/nbstripout
-    rev: 0.8.2
+    rev: 0.9.0
     hooks:
     -   id: nbstripout
 -   repo: https://github.com/facebook/pyrefly-pre-commit
-    rev: 0.46.3
+    rev: 0.49.0
     hooks:
     -   id: pyrefly-check
         name: Pyrefly (type checking)

{scandeval-16.10.1 → scandeval-16.11.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,46 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [Unreleased]
+## [v16.11.0] - 2026-01-21
+### Added
+- Added model metadata for GPT 5.2.
+- Added better support for unofficial inference providers, allowing model prefixes even
+  if they're not in LiteLLM's official list of providers. Currently this only works with
+  the "ordbogen/" prefix for models available on ordbogen.dk.
+### Changed
+- LLM-as-a-Judge metrics now support batch scoring across multiple judge outputs.
+- When evaluating datasets with no validation split, we now set the `validation_split`
+  in the resulting JSONL file to `null` rather than `True`, to avoid confusion.
+  Likewise, if a task requires zero-shot evaluation, we set `few_shot` to null rather
+  than a Boolean value.
+- When evaluating a reasoning model on a sequence classification task, if the model
+  outputs an answer that starts with one of candidate labels, we now use that label as
+  the predicted label. Previously, we would have conducted a word edit distance search
+  to find the closest candidate label, which was almost always correct, but not in all
+  cases.
+### Fixed
+- Quantized models in vLLM now have their dtype inferred automatically, removing
+  explicit dtype casting based on GPU compute capability. This was contributed by
+  @tvosch ✨
+- Evaluation of local vLLM models when no internet connection was available did not work
+  correctly; this has been fixed now. This was contributed by @Touzen ✨
+- More robust detection and handling of errors related to too long inputs for vLLM
+  models.
+- Some API models need the `logprobs` argument to be a Boolean rather than an integer.
+  This has been fixed now.
+- Better handling of rate limits when evaluating API models, by backing off more
+  aggressively when hitting rate limits.
+- Now truncates prompts for instruction-following models in a smarter way, by removing
+  few-shot examples one by one until the prompt is short enough, rather than just
+  truncating the prompt to the maximum length. This only affects models whose maximum
+  model length is quite small (roughly 5,000 tokens or less).
 ## [v16.10.1] - 2026-01-02
 ### Changed

{scandeval-16.10.1 → scandeval-16.11.0}/CONTRIBUTING.md RENAMED Viewed

@@ -72,7 +72,7 @@ guide](https://github.com/atom/atom/blob/master/CONTRIBUTING.md#git-commit-messa
 know how to use emoji for commit messages.
 Once your changes are ready, don't forget to
-[self-review](/contributing/self-review.md) to speed up the review process:zap:.
+self-review to speed up the review process:zap:.
 ### Pull Request

{scandeval-16.10.1 → scandeval-16.11.0}/LICENSE RENAMED Viewed

@@ -1,6 +1,6 @@
 MIT License
-Copyright (c) 2022-2025 Dan Saattrup Smart
+Copyright (c) 2022-2026 Dan Saattrup Smart
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

{scandeval-16.10.1 → scandeval-16.11.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: ScandEval
-Version: 16.10.1
+Version: 16.11.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -8,7 +8,7 @@ Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
 Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
 License: MIT License
-        Copyright (c) 2022-2025 Dan Saattrup Smart
+        Copyright (c) 2022-2026 Dan Saattrup Smart
         Permission is hereby granted, free of charge, to any person obtaining a copy
         of this software and associated documentation files (the "Software"), to deal
@@ -123,16 +123,17 @@ The easiest way to benchmark pretrained models is via the command line interface
 having installed the package, you can benchmark your favorite model like so:
 ```bash
-euroeval --model <model-id>
+euroeval --model <model-id-or-path>
 ```
-Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
-Hub](https://huggingface.co/models). By default this will benchmark the model on all
-the tasks available. If you want to benchmark on a particular task, then use the
-`--task` argument:
+Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
+Hub](https://huggingface.co/models), or a local path to a model directory (containing
+the model files as well as the `config.json` file). By default this will benchmark the
+model on all the tasks available. If you want to benchmark on a particular task, then
+use the `--task` argument:
 ```bash
-euroeval --model <model-id> --task sentiment-classification
+euroeval --model <model-id-or-path> --task sentiment-classification
 ```
 We can also narrow down which languages we would like to benchmark on. This can be done
@@ -140,20 +141,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
 sentiment classification task:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da
+euroeval --model <model-id-or-path> --task sentiment-classification --language da
 ```
 Multiple models, datasets and/or languages can be specified by just attaching multiple
 arguments. Here is an example with two models:
 ```bash
-euroeval --model <model-id1> --model <model-id2>
+euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
 ```
 The specific model version/revision to use can also be added after the suffix '@':
 ```bash
-euroeval --model <model-id>@<commit>
+euroeval --model <model-id-or-path>@<commit>
 ```
 This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -173,7 +174,7 @@ model:
 ```python
 >>> from euroeval import Benchmarker
 >>> benchmarker = Benchmarker()
->>> benchmarker.benchmark(model="<model-id>")
+>>> benchmarker.benchmark(model="<model-id-or-path>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -181,7 +182,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
 ```python
 >>> benchmarker.benchmark(
-...     model="<model-id>",
+...     model="<model-id-or-path>",
 ...     task="sentiment-classification",
 ...     language="da",
 ... )
@@ -225,7 +226,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
 ```
 Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
-argument. This could for instance be `--model <model-id> --task
+argument. This could for instance be `--model <model-id-or-path> --task
 sentiment-classification`.
 ## Benchmarking custom inference APIs
@@ -291,14 +292,14 @@ script. For example to download the model you want and all of the Danish sentime
 classification datasets:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da --download-only
+euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
 ```
 Or from a script:
 ```python
 >>> benchmarker.benchmark(
-... model="<model-id>",
+... model="<model-id-or-path>",
 ... task="sentiment-classification",
 ... language="da",
 ... download_only=True,
@@ -346,7 +347,7 @@ MY_CONFIG = DatasetConfig(
 You can then benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-dataset --model <model-id>
+euroeval --dataset my-dataset --model <model-id-or-path>
 ```
 You can also run the benchmark from a Python script, by simply providing your custom
@@ -356,7 +357,7 @@ dataset configuration directly into the `benchmark` method:
 from euroeval import Benchmarker
 benchmarker = Benchmarker()
-benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
+benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
 ```
 We have included three convenience tasks to make it easier to set up custom datasets:
@@ -436,7 +437,7 @@ MY_SQL_DATASET = DatasetConfig(
 Again, with this you can benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-sql-dataset --model <model-id>
+euroeval --dataset my-sql-dataset --model <model-id-or-path>
 ```
 ## Reproducing the evaluation datasets
@@ -592,6 +593,13 @@ A huge thank you to all the contributors who have helped make this project a suc
         alt="Contributor avatar for tvosch"
     />
 </a>
+<a href="https://github.com/Touzen">
+    <img
+        src="https://avatars.githubusercontent.com/u/1416265"
+        width=50
+        alt="Contributor avatar for Touzen"
+    />
+</a>
 ### Contribute to EuroEval

{scandeval-16.10.1 → scandeval-16.11.0}/README.md RENAMED Viewed

@@ -47,16 +47,17 @@ The easiest way to benchmark pretrained models is via the command line interface
 having installed the package, you can benchmark your favorite model like so:
 ```bash
-euroeval --model <model-id>
+euroeval --model <model-id-or-path>
 ```
-Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
-Hub](https://huggingface.co/models). By default this will benchmark the model on all
-the tasks available. If you want to benchmark on a particular task, then use the
-`--task` argument:
+Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
+Hub](https://huggingface.co/models), or a local path to a model directory (containing
+the model files as well as the `config.json` file). By default this will benchmark the
+model on all the tasks available. If you want to benchmark on a particular task, then
+use the `--task` argument:
 ```bash
-euroeval --model <model-id> --task sentiment-classification
+euroeval --model <model-id-or-path> --task sentiment-classification
 ```
 We can also narrow down which languages we would like to benchmark on. This can be done
@@ -64,20 +65,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
 sentiment classification task:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da
+euroeval --model <model-id-or-path> --task sentiment-classification --language da
 ```
 Multiple models, datasets and/or languages can be specified by just attaching multiple
 arguments. Here is an example with two models:
 ```bash
-euroeval --model <model-id1> --model <model-id2>
+euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
 ```
 The specific model version/revision to use can also be added after the suffix '@':
 ```bash
-euroeval --model <model-id>@<commit>
+euroeval --model <model-id-or-path>@<commit>
 ```
 This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -97,7 +98,7 @@ model:
 ```python
 >>> from euroeval import Benchmarker
 >>> benchmarker = Benchmarker()
->>> benchmarker.benchmark(model="<model-id>")
+>>> benchmarker.benchmark(model="<model-id-or-path>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -105,7 +106,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
 ```python
 >>> benchmarker.benchmark(
-...     model="<model-id>",
+...     model="<model-id-or-path>",
 ...     task="sentiment-classification",
 ...     language="da",
 ... )
@@ -149,7 +150,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
 ```
 Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
-argument. This could for instance be `--model <model-id> --task
+argument. This could for instance be `--model <model-id-or-path> --task
 sentiment-classification`.
 ## Benchmarking custom inference APIs
@@ -215,14 +216,14 @@ script. For example to download the model you want and all of the Danish sentime
 classification datasets:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da --download-only
+euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
 ```
 Or from a script:
 ```python
 >>> benchmarker.benchmark(
-... model="<model-id>",
+... model="<model-id-or-path>",
 ... task="sentiment-classification",
 ... language="da",
 ... download_only=True,
@@ -270,7 +271,7 @@ MY_CONFIG = DatasetConfig(
 You can then benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-dataset --model <model-id>
+euroeval --dataset my-dataset --model <model-id-or-path>
 ```
 You can also run the benchmark from a Python script, by simply providing your custom
@@ -280,7 +281,7 @@ dataset configuration directly into the `benchmark` method:
 from euroeval import Benchmarker
 benchmarker = Benchmarker()
-benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
+benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
 ```
 We have included three convenience tasks to make it easier to set up custom datasets:
@@ -360,7 +361,7 @@ MY_SQL_DATASET = DatasetConfig(
 Again, with this you can benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-sql-dataset --model <model-id>
+euroeval --dataset my-sql-dataset --model <model-id-or-path>
 ```
 ## Reproducing the evaluation datasets
@@ -516,6 +517,13 @@ A huge thank you to all the contributors who have helped make this project a suc
         alt="Contributor avatar for tvosch"
     />
 </a>
+<a href="https://github.com/Touzen">
+    <img
+        src="https://avatars.githubusercontent.com/u/1416265"
+        width=50
+        alt="Contributor avatar for Touzen"
+    />
+</a>
 ### Contribute to EuroEval

{scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/danish.md RENAMED Viewed

@@ -1116,3 +1116,81 @@ You can evaluate this dataset directly as follows:
 ```bash
 euroeval --model <model-id> --dataset nordjylland-news
 ```
+## European Values
+### ValEU-da
+This dataset is the official Danish version of questions from the [European values
+study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
+questions regarding people's values and beliefs across a variety of topics, such as
+politics, religion and society.
+The dataset consists of 52 questions from the 2017-2022 wave of the European values
+study, where the questions were chosen based on optimising against agreement within EU
+countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
+Here are a few examples from the training split:
+```json
+{
+    "question_id": "C039",
+    "text": "Hvor enig eller uenig er du i følgende udsagn?\nDet er ens pligt over for samfundet at arbejde.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig",
+}
+```
+```json
+{
+    "question_id": "F122",
+    "text": "Fortæl for hver af handlingerne på dette kort, i hvilken grad du billiger handlingen. 1 betyder, at du slet ikke billiger dem, 10 betyder, at du i høj grad billiger dem\nAktiv dødshjælp\nSvarmuligheder:\na. Aldrig\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Altid",
+}
+```
+```json
+{
+    "question_id": "C041",
+    "text": "Hvor enig eller uenig er du i følgende udsagn?\nArbejde kommer først, også selv om det betyder mindre fritid.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 0
+- Prefix prompt:
+  ```text
+  Følgende er multiple choice spørgsmål (med svar).
+  ```
+- Base prompt template:
+  ```text
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Svar: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```text
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
+  'i', 'j' eller 'k', og intet andet.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+euroeval --model <model-id> --dataset valeu-da
+```

{scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/dutch.md RENAMED Viewed

@@ -1100,3 +1100,81 @@ You can evaluate this dataset directly as follows:
 ```bash
 euroeval --model <model-id> --dataset duidelijke-taal
 ```
+## European Values
+### ValEU-nl
+This dataset is the official Dutch version of questions from the [European values
+study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
+questions regarding people's values and beliefs across a variety of topics, such as
+politics, religion and society.
+The dataset consists of 52 questions from the 2017-2022 wave of the European values
+study, where the questions were chosen based on optimising against agreement within EU
+countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
+Here are a few examples from the training split:
+```json
+{
+  "question_id": "E069_01",
+  "text": "Wilt u mij voor elk van de instellingen op deze kaart vertellen of u er heel veel, tamelijk veel, niet zo veel of helemaal geen vertrouwen in heeft?\nDe kerk\nAntwoordopties:\na. Heel veel\nb. Tamelijk veel\nc. Niet zo veel\nd. Helemaal geen"
+}
+```
+```json
+{
+  "question_id": "E028",
+  "text": "Wilt u nu deze lijst erbij houden? Ik ga u nu een aantal verschillende soorten van politieke actie noemen die men kan voeren. Wilt u mij van elke actie vertellen of u het zelf ooit heeft gedaan, of u het zelf misschien zou doen als u het nodig vond, of dat u het zeker nooit zult doen?\nMeedoen aan een wilde staking\nAntwoordopties:\na. Zelf gedaan\nb. Zou dat misschien doen\nc. Zou dat nooit doen"
+}
+```
+```json
+{
+  "question_id": "E265_07",
+  "text": "Hoe vaak gebeuren volgens u de volgende dingen tijdens verkiezingen in dit land?\nRijke mensen kopen de verkiezingsuitslag\nAntwoordopties:\na. Zeer vaak\nb. Tamelijk vaak\nc. Niet zo vaak\nd. Helemaal niet vaak"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 0
+- Prefix prompt:
+  ```text
+  Hieronder staan meerkeuzevragen (met antwoorden).
+  ```
+- Base prompt template:
+  ```text
+  Vraag: {text}
+  Antwoordopties:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Antwoord: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```text
+  Vraag: {text}
+  Antwoordopties:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Beantwoord de bovenstaande vraag met 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'
+  of 'k', en niets anders.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+euroeval --model <model-id> --dataset valeu-nl
+```

{scandeval-16.10.1 → scandeval-16.11.0}/docs/datasets/english.md RENAMED Viewed

@@ -983,3 +983,81 @@ You can evaluate this dataset directly as follows:
 ```bash
 euroeval --model <model-id> --dataset cnn-dailymail
 ```
+## European Values
+### ValEU-en
+This dataset is the official English version of questions from the [European values
+study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
+questions regarding people's values and beliefs across a variety of topics, such as
+politics, religion and society.
+The dataset consists of 52 questions from the 2017-2022 wave of the European values
+study, where the questions were chosen based on optimising against agreement within EU
+countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
+Here are a few examples from the training split:
+```json
+{
+  "question_id": "A072",
+  "text": "Please look carefully at the following list of voluntary organisations and say which, if any, do you belong to?\nProfessional associations\nChoices:\na. No\nb. Yes"
+}
+```
+```json
+{
+  "question_id": "F025",
+  "text": "Do you belong to a religious denomination? If yes, which one?\nChoices:\na. Do not belong to a denomination\nb. Roman Catholic\nc. Protestant\nd. Orthodox (Russian/Greek/etc.)\ne. Jew\nf. Muslim\ng. Hindu\nh. Buddhist\ni. Other Christian (Evangelical/Pentecostal/Free church/etc.)\nj. Other"
+}
+```
+```json
+{
+  "question_id": "F118",
+  "text": "Please tell me for each of the following whether you think it can always be justified, never be justified, or something in between.\nHomosexuality\nChoices:\na. Never justifiable\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Always justifiable"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 0
+- Prefix prompt:
+  ```text
+  The following are multiple choice questions (with answers).
+  ```
+- Base prompt template:
+  ```text
+  Question: {text}
+  Options:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Answer: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```text
+  Question: {text}
+  Options:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Answer the above question by replying with 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
+  'i', 'j', or 'k', and nothing else.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+euroeval --model <model-id> --dataset valeu-en
+```

ScandEval 16.10.1__tar.gz → 16.11.0__tar.gz

ScandEval 16.10.1tar.gz → 16.11.0tar.gz