PyPI - ScandEval - Versions diffs - 16.10.1__tar.gz → 16.12.0__tar.gz - Mend

ScandEval 16.10.1tar.gz → 16.12.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (384) hide show

scandeval-16.12.0/.github/auto_assign.yaml ADDED Viewed

@@ -0,0 +1,29 @@
+# Set to true to add reviewers to pull requests
+addReviewers: true
+# Set to true to add assignees to pull requests
+addAssignees: true
+# A list of reviewers to be added to pull requests (GitHub user name)
+reviewers:
+  - saattrupdan
+# A number of reviewers added to the pull request
+# Set 0 to add all the reviewers (default: 0)
+numberOfReviewers: 0
+# Whether to run the action on draft pull requests
+runOnDraft: true
+# A list of assignees, overrides reviewers if set
+# assignees:
+#   - assigneeA
+# A number of assignees to add to the pull request
+# Set to 0 to add all of the assignees.
+# Uses numberOfReviewers if unset.
+# numberOfAssignees: 2
+# A list of keywords to be skipped the process that add reviewers if pull requests include it
+# skipKeywords:
+#   - wip

scandeval-16.12.0/.github/workflows/auto_assign_reviewers.yaml ADDED Viewed

@@ -0,0 +1,15 @@
+name: 'Auto Assign'
+on:
+  pull_request:
+    types: [opened, ready_for_review]
+jobs:
+  add-reviews:
+    permissions:
+      contents: read
+      pull-requests: write
+    runs-on: ubuntu-latest
+    steps:
+      - uses: kentaro-m/auto-assign-action@v2.0.1
+        with:
+          configuration-path: .github/auto_assign.yaml

{scandeval-16.10.1 → scandeval-16.12.0}/.github/workflows/ci.yaml RENAMED Viewed

@@ -31,7 +31,7 @@ jobs:
         uses: astral-sh/setup-uv@v6
         with:
           enable-cache: false
-          python-version: "3.11"
+          python-version: "3.12"
       - name: Run pre-commit hooks
         uses: pre-commit/action@v3.0.1
@@ -43,7 +43,7 @@ jobs:
       pull-requests: write
     strategy:
         matrix:
-            python-version: ["3.11", "3.12", "3.13"]
+            python-version: ["3.12", "3.13"]
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v5
@@ -58,7 +58,7 @@ jobs:
           python-version: ${{ matrix.python-version }}
       - name: Install Dependencies
-        run: uv sync --no-dev
+        run: uv sync --no-dev --all-extras
       - name: Start Ollama server
         run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &
@@ -95,7 +95,7 @@ jobs:
           python-version: ${{ matrix.python-version }}
       - name: Install Dependencies
-        run: uv sync --no-dev
+        run: uv sync --no-dev --all-extras
       - name: Start Ollama server
         run: curl -fsSL https://ollama.com/install.sh | sh && ollama serve &

{scandeval-16.10.1 → scandeval-16.12.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -8,9 +8,9 @@ repos:
     hooks:
       - id: end-of-file-fixer
       - id: trailing-whitespace
-      # - id: debug-statements
+      - id: debug-statements
 -   repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.14.10
+    rev: v0.14.14
     hooks:
       - id: ruff
         args:
@@ -30,15 +30,15 @@ repos:
           - pyi
           - jupyter
 -   repo: https://github.com/kynan/nbstripout
-    rev: 0.8.2
+    rev: 0.9.0
     hooks:
     -   id: nbstripout
 -   repo: https://github.com/facebook/pyrefly-pre-commit
-    rev: 0.46.3
+    rev: 0.50.1
     hooks:
     -   id: pyrefly-check
         name: Pyrefly (type checking)
-        pass_filenames: true
+        pass_filenames: false
 -   repo: https://github.com/DavidAnson/markdownlint-cli2
     rev: v0.20.0
     hooks:

{scandeval-16.10.1 → scandeval-16.12.0}/CHANGELOG.md RENAMED Viewed

@@ -7,6 +7,81 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ## [Unreleased]
+## [v16.12.0] - 2026-02-02
+### Added
+- Added the bias detection task (`multiple-choice-stereotype-bias`) along with the Dutch
+  dataset MBBQ-NL. This was added by @caldaibis ✨
+- Added support for vLLM Metal, so that generative models can now be evaluated on Apple
+  Silicon. Note that this currently does not support structured generation, which means
+  that classification and named entity recognitions tasks unfortunately won't work yet.
+  This is due to [this xgrammar
+  issue](https://github.com/vllm-project/vllm/issues/31901).
+### Changed
+- Replaced deprecated `VLLM_ATTENTION_BACKEND` environment variable with vLLM's
+  `AttentionConfig` API. Added `--attention-backend` CLI option to configure the
+  attention backend. Defaults to FLASHINFER. This was added by @SwekeR-463 ✨
+- Now requires Python >=3.12, as Python 3.11 does not support some dependencies.
+- We now up the vLLM maximum context length for reasoning models, from 8,192 to
+  16,384, to accommodate for reasoning tokens for some datasets that have long documents.
+- We opened up the pinned vLLM version now, now set to version `>=0.14.1`.
+- Made changes to the codebase that makes it compatible with Transformers 5.0, for when
+  vLLM starts supporting it.
+### Fixed
+- Fixed an issue where a model was incorrectly classified as an encoder model if it had
+  no pipeline tag on the Hugging Face Hub and it relied on a custom implementation that
+  isn't integrated into the `transformers` library.
+- Fixed an issue when a model config had no `pad_token_id` and/or `eos_token_id`.
+- There was an error when evaluating local adapter models, which has been fixed now.
+- Now ensures that the vLLM argument `max_num_batched_tokens` is at least as large as the
+  maximum context length of the model, which gave errors with models that had a maximum
+  context length of less than 8,192.
+## [v16.11.0] - 2026-01-21
+### Added
+- Added model metadata for GPT 5.2.
+- Added better support for unofficial inference providers, allowing model prefixes even
+  if they're not in LiteLLM's official list of providers. Currently this only works with
+  the "ordbogen/" prefix for models available on ordbogen.dk.
+### Changed
+- LLM-as-a-Judge metrics now support batch scoring across multiple judge outputs.
+- When evaluating datasets with no validation split, we now set the `validation_split`
+  in the resulting JSONL file to `null` rather than `True`, to avoid confusion.
+  Likewise, if a task requires zero-shot evaluation, we set `few_shot` to null rather
+  than a Boolean value.
+- When evaluating a reasoning model on a sequence classification task, if the model
+  outputs an answer that starts with one of candidate labels, we now use that label as
+  the predicted label. Previously, we would have conducted a word edit distance search
+  to find the closest candidate label, which was almost always correct, but not in all
+  cases.
+### Fixed
+- Quantized models in vLLM now have their dtype inferred automatically, removing
+  explicit dtype casting based on GPU compute capability. This was contributed by
+  @tvosch ✨
+- Evaluation of local vLLM models when no internet connection was available did not work
+  correctly; this has been fixed now. This was contributed by @Touzen ✨
+- More robust detection and handling of errors related to too long inputs for vLLM
+  models.
+- Some API models need the `logprobs` argument to be a Boolean rather than an integer.
+  This has been fixed now.
+- Better handling of rate limits when evaluating API models, by backing off more
+  aggressively when hitting rate limits.
+- Now truncates prompts for instruction-following models in a smarter way, by removing
+  few-shot examples one by one until the prompt is short enough, rather than just
+  truncating the prompt to the maximum length. This only affects models whose maximum
+  model length is quite small (roughly 5,000 tokens or less).
 ## [v16.10.1] - 2026-01-02
 ### Changed

{scandeval-16.10.1 → scandeval-16.12.0}/CONTRIBUTING.md RENAMED Viewed

@@ -72,7 +72,7 @@ guide](https://github.com/atom/atom/blob/master/CONTRIBUTING.md#git-commit-messa
 know how to use emoji for commit messages.
 Once your changes are ready, don't forget to
-[self-review](/contributing/self-review.md) to speed up the review process:zap:.
+self-review to speed up the review process:zap:.
 ### Pull Request

{scandeval-16.10.1 → scandeval-16.12.0}/Dockerfile.cuda RENAMED Viewed

@@ -3,7 +3,7 @@ FROM nvidia/cuda:12.2.0-base-ubuntu22.04
 # Install dependencies
 RUN apt-get -y update && \
     apt-get -y upgrade && \
-    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends gcc python3.11 python3-pip python3-dev git-all && \
+    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends gcc python3.12 python3-pip python3-dev git-all && \
     python3 -m pip install --upgrade pip wheel && \
     python3 -m pip install euroeval[all]

{scandeval-16.10.1 → scandeval-16.12.0}/LICENSE RENAMED Viewed

@@ -1,6 +1,6 @@
 MIT License
-Copyright (c) 2022-2025 Dan Saattrup Smart
+Copyright (c) 2022-2026 Dan Saattrup Smart
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

{scandeval-16.10.1 → scandeval-16.12.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: ScandEval
-Version: 16.10.1
+Version: 16.12.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -8,7 +8,7 @@ Author-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
 Maintainer-email: Dan Saattrup Smart <dan.smart@alexandra.dk>
 License: MIT License
-        Copyright (c) 2022-2025 Dan Saattrup Smart
+        Copyright (c) 2022-2026 Dan Saattrup Smart
         Permission is hereby granted, free of charge, to any person obtaining a copy
         of this software and associated documentation files (the "Software"), to deal
@@ -28,7 +28,7 @@ License: MIT License
         OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
         SOFTWARE.
 License-File: LICENSE
-Requires-Python: <4.0,>=3.11
+Requires-Python: <4.0,>=3.12
 Requires-Dist: accelerate>=1.9.0
 Requires-Dist: bert-score>=0.3.13
 Requires-Dist: click>=8.1.3
@@ -59,19 +59,23 @@ Requires-Dist: setuptools>=75.8.2
 Requires-Dist: tenacity>=9.0.0
 Requires-Dist: termcolor>=2.0.0
 Requires-Dist: torch>=2.6.0
-Requires-Dist: transformers[mistral-common]>=4.56.0
+Requires-Dist: transformers[mistral-common]<5.0.0,>=4.56.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: timm>=1.0.19; extra == 'all'
-Requires-Dist: vllm[flashinfer]==0.11.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'all'
+Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'all'
+Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: ray>=2.53.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: timm>=1.0.19; extra == 'generative'
-Requires-Dist: vllm[flashinfer]==0.11.0; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: vllm-metal>=0.1.0; (platform_system == 'Darwin') and extra == 'generative'
+Requires-Dist: vllm==0.11.0; (platform_system == 'Darwin') and extra == 'generative'
+Requires-Dist: vllm[flashinfer]>=0.14.1; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown
 <!-- This disables the requirement that the first line is a top-level heading -->
@@ -96,7 +100,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-74%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
 ## Maintainer
@@ -123,16 +127,17 @@ The easiest way to benchmark pretrained models is via the command line interface
 having installed the package, you can benchmark your favorite model like so:
 ```bash
-euroeval --model <model-id>
+euroeval --model <model-id-or-path>
 ```
-Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
-Hub](https://huggingface.co/models). By default this will benchmark the model on all
-the tasks available. If you want to benchmark on a particular task, then use the
-`--task` argument:
+Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
+Hub](https://huggingface.co/models), or a local path to a model directory (containing
+the model files as well as the `config.json` file). By default this will benchmark the
+model on all the tasks available. If you want to benchmark on a particular task, then
+use the `--task` argument:
 ```bash
-euroeval --model <model-id> --task sentiment-classification
+euroeval --model <model-id-or-path> --task sentiment-classification
 ```
 We can also narrow down which languages we would like to benchmark on. This can be done
@@ -140,20 +145,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
 sentiment classification task:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da
+euroeval --model <model-id-or-path> --task sentiment-classification --language da
 ```
 Multiple models, datasets and/or languages can be specified by just attaching multiple
 arguments. Here is an example with two models:
 ```bash
-euroeval --model <model-id1> --model <model-id2>
+euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
 ```
 The specific model version/revision to use can also be added after the suffix '@':
 ```bash
-euroeval --model <model-id>@<commit>
+euroeval --model <model-id-or-path>@<commit>
 ```
 This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -173,7 +178,7 @@ model:
 ```python
 >>> from euroeval import Benchmarker
 >>> benchmarker = Benchmarker()
->>> benchmarker.benchmark(model="<model-id>")
+>>> benchmarker.benchmark(model="<model-id-or-path>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -181,7 +186,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
 ```python
 >>> benchmarker.benchmark(
-...     model="<model-id>",
+...     model="<model-id-or-path>",
 ...     task="sentiment-classification",
 ...     language="da",
 ... )
@@ -225,7 +230,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
 ```
 Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
-argument. This could for instance be `--model <model-id> --task
+argument. This could for instance be `--model <model-id-or-path> --task
 sentiment-classification`.
 ## Benchmarking custom inference APIs
@@ -291,14 +296,14 @@ script. For example to download the model you want and all of the Danish sentime
 classification datasets:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da --download-only
+euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
 ```
 Or from a script:
 ```python
 >>> benchmarker.benchmark(
-... model="<model-id>",
+... model="<model-id-or-path>",
 ... task="sentiment-classification",
 ... language="da",
 ... download_only=True,
@@ -346,7 +351,7 @@ MY_CONFIG = DatasetConfig(
 You can then benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-dataset --model <model-id>
+euroeval --dataset my-dataset --model <model-id-or-path>
 ```
 You can also run the benchmark from a Python script, by simply providing your custom
@@ -356,7 +361,7 @@ dataset configuration directly into the `benchmark` method:
 from euroeval import Benchmarker
 benchmarker = Benchmarker()
-benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
+benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
 ```
 We have included three convenience tasks to make it easier to set up custom datasets:
@@ -436,7 +441,7 @@ MY_SQL_DATASET = DatasetConfig(
 Again, with this you can benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-sql-dataset --model <model-id>
+euroeval --dataset my-sql-dataset --model <model-id-or-path>
 ```
 ## Reproducing the evaluation datasets
@@ -592,6 +597,27 @@ A huge thank you to all the contributors who have helped make this project a suc
         alt="Contributor avatar for tvosch"
     />
 </a>
+<a href="https://github.com/Touzen">
+    <img
+        src="https://avatars.githubusercontent.com/u/1416265"
+        width=50
+        alt="Contributor avatar for Touzen"
+    />
+</a>
+<a href="https://github.com/caldaibis">
+    <img
+        src="https://avatars.githubusercontent.com/u/16032437"
+        width=50
+        alt="Contributor avatar for caldaibis"
+    />
+</a>
+<a href="https://github.com/SwekeR-463">
+    <img
+        src="https://avatars.githubusercontent.com/u/114919896?v=4"
+        width=50
+        alt="Contributor avatar for SwekeR-463"
+    />
+</a>
 ### Contribute to EuroEval

{scandeval-16.10.1 → scandeval-16.12.0}/README.md RENAMED Viewed

@@ -20,7 +20,7 @@ ______________________________________________________________________
 [![Second paper](https://img.shields.io/badge/arXiv-2406.13469-b31b1b.svg)](https://arxiv.org/abs/2406.13469)
 [![License](https://img.shields.io/github/license/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/blob/main/LICENSE)
 [![LastCommit](https://img.shields.io/github/last-commit/EuroEval/EuroEval)](https://github.com/EuroEval/EuroEval/commits/main)
-[![Code Coverage](https://img.shields.io/badge/Coverage-70%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
+[![Code Coverage](https://img.shields.io/badge/Coverage-74%25-yellow.svg)](https://github.com/EuroEval/EuroEval/tree/main/tests)
 [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/EuroEval/EuroEval/blob/main/CODE_OF_CONDUCT.md)
 ## Maintainer
@@ -47,16 +47,17 @@ The easiest way to benchmark pretrained models is via the command line interface
 having installed the package, you can benchmark your favorite model like so:
 ```bash
-euroeval --model <model-id>
+euroeval --model <model-id-or-path>
 ```
-Here `model` is the HuggingFace model ID, which can be found on the [HuggingFace
-Hub](https://huggingface.co/models). By default this will benchmark the model on all
-the tasks available. If you want to benchmark on a particular task, then use the
-`--task` argument:
+Here `model` is either the HuggingFace model ID, which can be found on the [HuggingFace
+Hub](https://huggingface.co/models), or a local path to a model directory (containing
+the model files as well as the `config.json` file). By default this will benchmark the
+model on all the tasks available. If you want to benchmark on a particular task, then
+use the `--task` argument:
 ```bash
-euroeval --model <model-id> --task sentiment-classification
+euroeval --model <model-id-or-path> --task sentiment-classification
 ```
 We can also narrow down which languages we would like to benchmark on. This can be done
@@ -64,20 +65,20 @@ by setting the `--language` argument. Here we thus benchmark the model on the Da
 sentiment classification task:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da
+euroeval --model <model-id-or-path> --task sentiment-classification --language da
 ```
 Multiple models, datasets and/or languages can be specified by just attaching multiple
 arguments. Here is an example with two models:
 ```bash
-euroeval --model <model-id1> --model <model-id2>
+euroeval --model <model-id-or-path-1> --model <model-id-or-path-2>
 ```
 The specific model version/revision to use can also be added after the suffix '@':
 ```bash
-euroeval --model <model-id>@<commit>
+euroeval --model <model-id-or-path>@<commit>
 ```
 This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
@@ -97,7 +98,7 @@ model:
 ```python
 >>> from euroeval import Benchmarker
 >>> benchmarker = Benchmarker()
->>> benchmarker.benchmark(model="<model-id>")
+>>> benchmarker.benchmark(model="<model-id-or-path>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
@@ -105,7 +106,7 @@ To benchmark on a specific task and/or language, you simply specify the `task` o
 ```python
 >>> benchmarker.benchmark(
-...     model="<model-id>",
+...     model="<model-id-or-path>",
 ...     task="sentiment-classification",
 ...     language="da",
 ... )
@@ -149,7 +150,7 @@ docker run -e args="<euroeval-arguments>" --gpus 1 --name euroeval --rm euroeval
 ```
 Here `<euroeval-arguments>` consists of the arguments added to the `euroeval` CLI
-argument. This could for instance be `--model <model-id> --task
+argument. This could for instance be `--model <model-id-or-path> --task
 sentiment-classification`.
 ## Benchmarking custom inference APIs
@@ -215,14 +216,14 @@ script. For example to download the model you want and all of the Danish sentime
 classification datasets:
 ```bash
-euroeval --model <model-id> --task sentiment-classification --language da --download-only
+euroeval --model <model-id-or-path> --task sentiment-classification --language da --download-only
 ```
 Or from a script:
 ```python
 >>> benchmarker.benchmark(
-... model="<model-id>",
+... model="<model-id-or-path>",
 ... task="sentiment-classification",
 ... language="da",
 ... download_only=True,
@@ -270,7 +271,7 @@ MY_CONFIG = DatasetConfig(
 You can then benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-dataset --model <model-id>
+euroeval --dataset my-dataset --model <model-id-or-path>
 ```
 You can also run the benchmark from a Python script, by simply providing your custom
@@ -280,7 +281,7 @@ dataset configuration directly into the `benchmark` method:
 from euroeval import Benchmarker
 benchmarker = Benchmarker()
-benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
+benchmarker.benchmark(model="<model-id-or-path>", dataset=MY_CONFIG)
 ```
 We have included three convenience tasks to make it easier to set up custom datasets:
@@ -360,7 +361,7 @@ MY_SQL_DATASET = DatasetConfig(
 Again, with this you can benchmark your custom dataset by simply running
 ```bash
-euroeval --dataset my-sql-dataset --model <model-id>
+euroeval --dataset my-sql-dataset --model <model-id-or-path>
 ```
 ## Reproducing the evaluation datasets
@@ -516,6 +517,27 @@ A huge thank you to all the contributors who have helped make this project a suc
         alt="Contributor avatar for tvosch"
     />
 </a>
+<a href="https://github.com/Touzen">
+    <img
+        src="https://avatars.githubusercontent.com/u/1416265"
+        width=50
+        alt="Contributor avatar for Touzen"
+    />
+</a>
+<a href="https://github.com/caldaibis">
+    <img
+        src="https://avatars.githubusercontent.com/u/16032437"
+        width=50
+        alt="Contributor avatar for caldaibis"
+    />
+</a>
+<a href="https://github.com/SwekeR-463">
+    <img
+        src="https://avatars.githubusercontent.com/u/114919896?v=4"
+        width=50
+        alt="Contributor avatar for SwekeR-463"
+    />
+</a>
 ### Contribute to EuroEval

{scandeval-16.10.1 → scandeval-16.12.0}/docs/datasets/danish.md RENAMED Viewed

@@ -1002,7 +1002,7 @@ Here are a few examples from the training split:
 ```json
 {
-  "text": "Natalie synes, at smaragder er smukke ædelstene, men Betty gør ikke. _ købte en halskæde med en stor smaragd. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. Natalie\nb. Betty",
+  "text": "Jeg kunne ikke kontrollere fugten, som jeg kontrollerede regnen, fordi _ kom ind overalt. Hvad refererer det tomme _ til?\nSvarmuligheder:\na. fugt\nb. regn",
   "label": "a"
 }
 ```
@@ -1116,3 +1116,81 @@ You can evaluate this dataset directly as follows:
 ```bash
 euroeval --model <model-id> --dataset nordjylland-news
 ```
+## European Values
+### ValEU-da
+This dataset is the official Danish version of questions from the [European values
+study](https://europeanvaluesstudy.eu/). The dataset contains multiple-choice
+questions regarding people's values and beliefs across a variety of topics, such as
+politics, religion and society.
+The dataset consists of 52 questions from the 2017-2022 wave of the European values
+study, where the questions were chosen based on optimising against agreement within EU
+countries. We use only zero-shot evaluation on this dataset, and thus require no splits.
+Here are a few examples from the training split:
+```json
+{
+    "question_id": "C039",
+    "text": "Hvor enig eller uenig er du i følgende udsagn?\nDet er ens pligt over for samfundet at arbejde.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig",
+}
+```
+```json
+{
+    "question_id": "F122",
+    "text": "Fortæl for hver af handlingerne på dette kort, i hvilken grad du billiger handlingen. 1 betyder, at du slet ikke billiger dem, 10 betyder, at du i høj grad billiger dem\nAktiv dødshjælp\nSvarmuligheder:\na. Aldrig\nb. 2\nc. 3\nd. 4\ne. 5\nf. 6\ng. 7\nh. 8\ni. 9\nj. Altid",
+}
+```
+```json
+{
+    "question_id": "C041",
+    "text": "Hvor enig eller uenig er du i følgende udsagn?\nArbejde kommer først, også selv om det betyder mindre fritid.\nSvarmuligheder:\na. Helt enig\nb. Enig\nc. Hverken enig eller uenig\nd. Uenig\ne. Helt uenig"
+}
+```
+When evaluating generative models, we use the following setup (see the
+[methodology](/methodology) for more information on how these are used):
+- Number of few-shot examples: 0
+- Prefix prompt:
+  ```text
+  Følgende er multiple choice spørgsmål (med svar).
+  ```
+- Base prompt template:
+  ```text
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Svar: {label}
+  ```
+- Instruction-tuned prompt template:
+  ```text
+  Spørgsmål: {text}
+  Svarmuligheder:
+  a. {option_a}
+  b. {option_b}
+  (...)
+  k. {option_k}
+  Besvar ovenstående spørgsmål ved at svare med 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h',
+  'i', 'j' eller 'k', og intet andet.
+  ```
+You can evaluate this dataset directly as follows:
+```bash
+euroeval --model <model-id> --dataset valeu-da
+```

ScandEval 16.10.1__tar.gz → 16.12.0__tar.gz

ScandEval 16.10.1tar.gz → 16.12.0tar.gz