PyPI - EuroEval - Versions diffs - 16.1.1__tar.gz → 16.2.1__tar.gz - Mend

EuroEval 16.1.1tar.gz → 16.2.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (274) hide show

{euroeval-16.1.1 → euroeval-16.2.1}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -34,6 +34,7 @@ body:
       - label: Italian
       - label: Latvian
       - label: Norwegian (Bokmål or Nynorsk)
+      - label: Polish
       - label: Portuguese
       - label: Spanish
       - label: Swedish

euroeval-16.2.1/.github/ISSUE_TEMPLATE/language_request.yaml ADDED Viewed

@@ -0,0 +1,49 @@
+name: 🌍 Language Request
+description: Is there a European language missing in EuroEval?
+title: "[LANGUAGE REQUEST] <language-name>"
+labels: "new language"
+type: task
+body:
+- type: input
+  attributes:
+    label: Language name and code
+    description: What is the name and ISO 639 code of the language?
+  validations:
+    required: true
+- type: markdown
+  attributes:
+    value: >
+      Here are some existing evaluation datasets in the language, that could be used:
+- type: textarea
+  attributes:
+    label: Sentiment classification dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Linguistic acceptability dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Named entity recognition dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Reading comprehension dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Summarisation dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Knowledge dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Common-sense reasoning dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: markdown
+  attributes:
+    value: >
+      Thanks for contributing 🎉!

{euroeval-16.1.1 → euroeval-16.2.1}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -23,6 +23,7 @@ body:
       - label: West Germanic languages (Dutch, English, German)
       - label: Finnic languages (Estonian, Finnish)
       - label: Latvian
+      - label: Polish
   validations:
     required: true
 - type: dropdown

{euroeval-16.1.1 → euroeval-16.2.1}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,40 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v16.2.1] - 2025-09-15
+### Fixed
+- Some of the `download_only` arguments were missing in the code, and have now been
+  added.
+## [v16.2.0] - 2025-09-15
+### Added
+- Now supports evaluating models in an offline environment. This is done by first
+  downloading all necessary models, datasets, metrics and other artifacts while online,
+  using the new `--download-only` flag (or `download_only=True` in the `Benchmarker`
+  API). Then you can safely disable internet access and run the evaluation as normal,
+  and it will use the cached models, datasets and metrics. This was contributed by
+  @viggo-gascou ✨
+- Added the `timm` package to the set of `generative` extra dependencies, as it is
+  required to load some multimodal models, such as Gemma-3n.
+### Changed
+- Now does not benchmark encoder models on multiple-choice classification tasks, as they
+  get near-random performance and these scores are not used in the leaderboards. We can
+  change this in the future if we find a way to make encoder models work better on these
+  tasks.
+- For generative vLLM models that can swap between reasoning and non-reasoning modes,
+  we previously defaulted to reasoning. We now default to what the model uses by
+  default, which is non-reasoning for most models.
+### Fixed
+- Fixed an issue where old evaluation records could not be loaded, as the format had
+  changed. We are now able to load old records again.
+- Fixed some grammatical errors in the Icelandic prompts.
+- Now stores model IDs with parameters (e.g., `o3#low`) correctly in the benchmark
+  results, rather than just the base model ID (e.g., `o3`).
 ## [v16.1.1] - 2025-09-12
 ### Fixed
 - Fixed an issue from v16.1.0, where reasoning models were not using the tokeniser's

{euroeval-16.1.1 → euroeval-16.2.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 16.1.1
+Version: 16.2.1
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -61,13 +61,13 @@ Requires-Dist: transformers[mistral-common]>=4.56.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
-Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'all'
-Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: timm>=1.0.19; extra == 'all'
+Requires-Dist: vllm[flashinfer]>=0.10.1; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
-Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'generative'
-Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: timm>=1.0.19; extra == 'generative'
+Requires-Dist: vllm[flashinfer]>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown
 <div align='center'>
@@ -152,13 +152,13 @@ model:
 ```
 >>> from euroeval import Benchmarker
 >>> benchmark = Benchmarker()
->>> benchmark(model="<model>")
+>>> benchmark(model="<model-id>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
 `language` arguments, shown here with same example as above:
 ```
->>> benchmark(model="<model>", task="sentiment-classification", language="da")
+>>> benchmark(model="<model-id>", task="sentiment-classification", language="da")
 ```
 If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
@@ -168,6 +168,30 @@ models on the Danish sentiment classification task:
 >>> benchmark(task="sentiment-classification", language="da")
 ```
+### Benchmarking in an Offline Environment
+If you need to benchmark in an offline environment, you need to download the models,
+datasets and metrics beforehand. This can be done by adding the `--download-only`
+argument, from the command line, or the `download_only` argument, if benchmarking from a
+script. For example to download the model you want and all of the Danish sentiment
+classification datasets:
+```
+$ euroeval --model <model-id> --task sentiment-classification --language da --download-only
+```
+Or from a script:
+```
+>>> benchmark(
+... model="<model-id>",
+... task="sentiment-classification",
+... language="da",
+... download_only=True,
+... )
+```
+Please note: Offline benchmarking of adapter models is not currently supported. An
+internet connection will be required during evaluation. If offline support is important
+to you, please consider [opening an issue](https://github.com/EuroEval/EuroEval/issues).
 ### Benchmarking from Docker
 A Dockerfile is provided in the repo, which can be downloaded and run, without needing
 to clone the repo and installing from source. This can be fetched programmatically by

{euroeval-16.1.1 → euroeval-16.2.1}/README.md RENAMED Viewed

@@ -80,13 +80,13 @@ model:
 ```
 >>> from euroeval import Benchmarker
 >>> benchmark = Benchmarker()
->>> benchmark(model="<model>")
+>>> benchmark(model="<model-id>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
 `language` arguments, shown here with same example as above:
 ```
->>> benchmark(model="<model>", task="sentiment-classification", language="da")
+>>> benchmark(model="<model-id>", task="sentiment-classification", language="da")
 ```
 If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
@@ -96,6 +96,30 @@ models on the Danish sentiment classification task:
 >>> benchmark(task="sentiment-classification", language="da")
 ```
+### Benchmarking in an Offline Environment
+If you need to benchmark in an offline environment, you need to download the models,
+datasets and metrics beforehand. This can be done by adding the `--download-only`
+argument, from the command line, or the `download_only` argument, if benchmarking from a
+script. For example to download the model you want and all of the Danish sentiment
+classification datasets:
+```
+$ euroeval --model <model-id> --task sentiment-classification --language da --download-only
+```
+Or from a script:
+```
+>>> benchmark(
+... model="<model-id>",
+... task="sentiment-classification",
+... language="da",
+... download_only=True,
+... )
+```
+Please note: Offline benchmarking of adapter models is not currently supported. An
+internet connection will be required during evaluation. If offline support is important
+to you, please consider [opening an issue](https://github.com/EuroEval/EuroEval/issues).
 ### Benchmarking from Docker
 A Dockerfile is provided in the repo, which can be downloaded and run, without needing
 to clone the repo and installing from source. This can be fetched programmatically by

{euroeval-16.1.1 → euroeval-16.2.1}/docs/datasets/icelandic.md RENAMED Viewed

@@ -44,11 +44,11 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru yfirferðir ásamt lyndisgildi þeirra, sem getur verið 'jákvætt', 'hlutlaust' eða 'neikvætt'.
+  Hér fyrir neðan eru textabrot ásamt lyndisgildi þeirra sem getur verið 'jákvætt', 'hlutlaust' eða 'neikvætt'.
   ```
 - Base prompt template:
   ```
-  Yfirferð: {text}
+  Textabrot: {text}
   Lyndi: {label}
   ```
 - Instruction-tuned prompt template:
@@ -117,13 +117,13 @@ When evaluating generative models, we use the following setup (see the
 - Base prompt template:
   ```
   Setning: {text}
-  Nefndar einingar: {label}
+  Nafneiningar: {label}
   ```
 - Instruction-tuned prompt template:
   ```
   Setning: {text}
-  Greinið nefndu einingarnar í setningunni. Þú ættir að skila þessu sem JSON orðabók með lyklunum 'einstaklingur', 'staðsetning', 'stofnun' og 'ýmislegt'. Gildin ættu að vera listi yfir nefndu einingarnar af þeirri gerð, nákvæmlega eins og þær koma fram í setningunni.
+  Greindu nefndu einingarnar í setningunni. Þú ættir að skila þessu sem JSON orðabók með lyklunum 'einstaklingur', 'staðsetning', 'stofnun' og 'ýmislegt'. Gildin ættu að vera listi yfir nefndu einingarnar af þeirri gerð, nákvæmlega eins og þær koma fram í setningunni.
   ```
 - Label mapping:
     - `B-PER` ➡️ `einstaklingur`
@@ -186,7 +186,7 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru setningar og hvort þær eru málfræðilega réttar.
+  Hér fyrir neðan eru setningar ásamt mati á því hvort þær eru málfræðilega réttar.
   ```
 - Base prompt template:
   ```
@@ -197,7 +197,7 @@ When evaluating generative models, we use the following setup (see the
   ```
   Setning: {text}
-  Greinið hvort setningin er málfræðilega rétt eða ekki. Svarið skal vera 'já' ef setningin er rétt og 'nei' ef hún er ekki.
+  Greindu hvort setningin er málfræðilega rétt. Svaraðu með 'já' ef setningin er rétt og 'nei' ef hún er það ekki.
   ```
 - Label mapping:
     - `correct` ➡️ `já`
@@ -249,7 +249,7 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru setningar og hvort þær eru málfræðilega réttar.
+  Hér fyrir neðan eru setningar ásamt mati á því hvort þær eru málfræðilega réttar.
   ```
 - Base prompt template:
   ```
@@ -260,7 +260,7 @@ When evaluating generative models, we use the following setup (see the
   ```
   Setning: {text}
-  Greinið hvort setningin er málfræðilega rétt eða ekki. Svarið skal vera 'já' ef setningin er rétt og 'nei' ef hún er ekki.
+  Greindu hvort setningin er málfræðilega rétt. Svaraðu með 'já' ef setningin er rétt og 'nei' ef hún er það ekki.
   ```
 - Label mapping:
     - `correct` ➡️ `já`
@@ -310,7 +310,7 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru setningar og hvort þær eru málfræðilega réttar.
+  Hér fyrir neðan eru setningar ásamt mati á því hvort þær eru málfræðilega réttar.
   ```
 - Base prompt template:
   ```
@@ -321,7 +321,7 @@ When evaluating generative models, we use the following setup (see the
   ```
   Setning: {text}
-  Greinið hvort setningin er málfræðilega rétt eða ekki. Svarið skal vera 'já' ef setningin er rétt og 'nei' ef hún er ekki.
+  Greindu hvort setningin er málfræðilega rétt. Svaraðu með 'já' ef setningin er rétt og 'nei' ef hún er það ekki.
   ```
 - Label mapping:
     - `correct` ➡️ `já`

{euroeval-16.1.1 → euroeval-16.2.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "EuroEval"
-version = "16.1.1"
+version = "16.2.1"
 description = "The robust European language model benchmark."
 readme = "README.md"
 authors = [
@@ -33,7 +33,7 @@ dependencies = [
     "rouge-score>=0.1.2",
     "bert-score>=0.3.13",
     "levenshtein>=0.24.0",
-    "scikit-learn==1.6.1",  # Required for loading European values pipeline
+    "scikit-learn==1.6.1", # Required for loading European values pipeline
     "setuptools>=75.8.2",
     "demjson3>=3.0.6",
     "ollama>=0.5.1",
@@ -45,15 +45,15 @@ dependencies = [
 [project.optional-dependencies]
 generative = [
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.10.1; platform_system == 'Linux'",
-    "flashinfer-python>=0.3.1; platform_system == 'Linux'",
+    "vllm[flashinfer]>=0.10.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
+    "timm>=1.0.19",
 ]
 all = [
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.10.1; platform_system == 'Linux'",
-    "flashinfer-python>=0.3.1; platform_system == 'Linux'",
+    "vllm[flashinfer]>=0.10.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
+    "timm>=1.0.19",
 ]
 [project.urls]
@@ -100,6 +100,8 @@ dev-dependencies = [
     "types-ujson>=5.10.0.20240515",
     "types-simplejson>=3.2.0.2025032",
     "debugpy>=1.8.13",
+    "pytest-socket>=0.7.0",
+    "pytest-dependency>=0.6.0",
 ]
 [tool.ruff]
@@ -170,6 +172,7 @@ addopts = [
     "--cov=src/euroeval",
     "--color=yes",
     "-vvv",
+    "--allow-unix-socket"
 ]
 xfail_strict = true
 filterwarnings = [
@@ -181,7 +184,7 @@ filterwarnings = [
     "ignore::ResourceWarning",
     "ignore::FutureWarning",
 ]
-log_cli_level = "info"
+log_cli_level = "INFO"
 testpaths = [
     "tests",
     "src/euroeval",

{euroeval-16.1.1 → euroeval-16.2.1}/src/euroeval/__init__.py RENAMED Viewed

@@ -12,12 +12,13 @@ import warnings
 from termcolor import colored
 # Block specific warnings before importing anything else, as they can be noisy
-warnings.filterwarnings("ignore", category=UserWarning)
-warnings.filterwarnings("ignore", category=FutureWarning)
-logging.getLogger("httpx").setLevel(logging.CRITICAL)
-logging.getLogger("datasets").setLevel(logging.CRITICAL)
-logging.getLogger("vllm").setLevel(logging.CRITICAL)
-os.environ["VLLM_CONFIGURE_LOGGING"] = "0"
+if os.getenv("FULL_LOG") != "1":
+    warnings.filterwarnings("ignore", category=UserWarning)
+    warnings.filterwarnings("ignore", category=FutureWarning)
+    logging.getLogger("httpx").setLevel(logging.CRITICAL)
+    logging.getLogger("datasets").setLevel(logging.CRITICAL)
+    logging.getLogger("vllm").setLevel(logging.CRITICAL)
+    os.environ["VLLM_CONFIGURE_LOGGING"] = "0"
 # Set up logging
 fmt = colored("%(asctime)s", "light_blue") + " ⋅ " + colored("%(message)s", "green")

{euroeval-16.1.1 → euroeval-16.2.1}/src/euroeval/benchmark_config_factory.py RENAMED Viewed

@@ -6,9 +6,9 @@ import typing as t
 import torch
-from .data_models import BenchmarkConfig
+from .data_models import BenchmarkConfig, BenchmarkConfigParams
 from .dataset_configs import get_all_dataset_configs
-from .enums import Device, GenerativeType
+from .enums import Device
 from .exceptions import InvalidBenchmark
 from .languages import get_all_languages
 from .tasks import SPEED, get_all_tasks
@@ -21,150 +21,66 @@ logger = logging.getLogger("euroeval")
 def build_benchmark_config(
-    progress_bar: bool,
-    save_results: bool,
-    task: str | list[str] | None,
-    dataset: str | list[str] | None,
-    language: str | list[str],
-    model_language: str | list[str] | None,
-    dataset_language: str | list[str] | None,
-    device: Device | None,
-    batch_size: int,
-    raise_errors: bool,
-    cache_dir: str,
-    api_key: str | None,
-    force: bool,
-    verbose: bool,
-    trust_remote_code: bool,
-    clear_model_cache: bool,
-    evaluate_test_split: bool,
-    few_shot: bool,
-    num_iterations: int,
-    api_base: str | None,
-    api_version: str | None,
-    gpu_memory_utilization: float,
-    generative_type: GenerativeType | None,
-    debug: bool,
-    run_with_cli: bool,
-    requires_safetensors: bool,
+    benchmark_config_params: BenchmarkConfigParams,
 ) -> BenchmarkConfig:
     """Create a benchmark configuration.
     Args:
-        progress_bar:
-            Whether to show a progress bar when running the benchmark.
-        save_results:
-            Whether to save the benchmark results to a file.
-        task:
-            The tasks to include for dataset. If None then datasets will not be
-            filtered based on their task.
-        dataset:
-            The datasets to include for task. If None then all datasets will be
-            included, limited by the `task` parameter.
-        language:
-            The language codes of the languages to include, both for models and
-            datasets. Here 'no' means both Bokmål (nb) and Nynorsk (nn). Set this
-            to 'all' if all languages should be considered.
-        model_language:
-            The language codes of the languages to include for models. If None then
-            the `language` parameter will be used.
-        dataset_language:
-            The language codes of the languages to include for datasets. If None then
-            the `language` parameter will be used.
-        device:
-            The device to use for running the models. If None then the device will be
-            set automatically.
-        batch_size:
-            The batch size to use for running the models.
-        raise_errors:
-            Whether to raise errors when running the benchmark.
-        cache_dir:
-            The directory to use for caching the models.
-        api_key:
-            The API key to use for a given inference server.
-        force:
-            Whether to force the benchmark to run even if the results are already
-            cached.
-        verbose:
-            Whether to print verbose output when running the benchmark. This is
-            automatically set if `debug` is True.
-        trust_remote_code:
-            Whether to trust remote code when running the benchmark.
-        clear_model_cache:
-            Whether to clear the model cache before running the benchmark.
-        evaluate_test_split:
-            Whether to use the test split for the datasets.
-        few_shot:
-            Whether to use few-shot learning for the models.
-        num_iterations:
-            The number of iterations each model should be evaluated for.
-        api_base:
-            The base URL for a given inference API. Only relevant if `model` refers to a
-            model on an inference API.
-        api_version:
-            The version of the API to use for a given inference API.
-        gpu_memory_utilization:
-            The GPU memory utilization to use for vLLM. A larger value will result in
-            faster evaluation, but at the risk of running out of GPU memory. Only reduce
-            this if you are running out of GPU memory. Only relevant if the model is
-            generative.
-        generative_type:
-            The type of generative model. Only relevant if the model is generative. If
-            not specified, the type will be inferred automatically.
-        debug:
-            Whether to run the benchmark in debug mode.
-        run_with_cli:
-            Whether the benchmark is being run with the CLI.
-        requires_safetensors:
-            Whether to only allow evaluations of models stored as safetensors.
+        benchmark_config_params:
+            The parameters for creating the benchmark configuration.
     Returns:
         The benchmark configuration.
     """
-    language_codes = get_correct_language_codes(language_codes=language)
+    language_codes = get_correct_language_codes(
+        language_codes=benchmark_config_params.language
+    )
     model_languages = prepare_languages(
-        language_codes=model_language, default_language_codes=language_codes
+        language_codes=benchmark_config_params.model_language,
+        default_language_codes=language_codes,
     )
     dataset_languages = prepare_languages(
-        language_codes=dataset_language, default_language_codes=language_codes
+        language_codes=benchmark_config_params.dataset_language,
+        default_language_codes=language_codes,
     )
     tasks, datasets = prepare_tasks_and_datasets(
-        task=task, dataset=dataset, dataset_languages=dataset_languages
+        task=benchmark_config_params.task,
+        dataset=benchmark_config_params.dataset,
+        dataset_languages=dataset_languages,
     )
-    torch_device = prepare_device(device=device)
-    # Set variable with number of iterations
-    if hasattr(sys, "_called_from_test"):
-        num_iterations = 1
     return BenchmarkConfig(
         model_languages=model_languages,
         dataset_languages=dataset_languages,
         tasks=tasks,
         datasets=datasets,
-        batch_size=batch_size,
-        raise_errors=raise_errors,
-        cache_dir=cache_dir,
-        api_key=api_key,
-        force=force,
-        progress_bar=progress_bar,
-        save_results=save_results,
-        verbose=verbose or debug,
-        device=torch_device,
-        trust_remote_code=trust_remote_code,
-        clear_model_cache=clear_model_cache,
-        evaluate_test_split=evaluate_test_split,
-        few_shot=few_shot,
-        num_iterations=num_iterations,
-        api_base=api_base,
-        api_version=api_version,
-        gpu_memory_utilization=gpu_memory_utilization,
-        generative_type=generative_type,
-        debug=debug,
-        run_with_cli=run_with_cli,
-        requires_safetensors=requires_safetensors,
+        batch_size=benchmark_config_params.batch_size,
+        raise_errors=benchmark_config_params.raise_errors,
+        cache_dir=benchmark_config_params.cache_dir,
+        api_key=benchmark_config_params.api_key,
+        force=benchmark_config_params.force,
+        progress_bar=benchmark_config_params.progress_bar,
+        save_results=benchmark_config_params.save_results,
+        verbose=benchmark_config_params.verbose or benchmark_config_params.debug,
+        device=prepare_device(device=benchmark_config_params.device),
+        trust_remote_code=benchmark_config_params.trust_remote_code,
+        clear_model_cache=benchmark_config_params.clear_model_cache,
+        evaluate_test_split=benchmark_config_params.evaluate_test_split,
+        few_shot=benchmark_config_params.few_shot,
+        num_iterations=(
+            1
+            if hasattr(sys, "_called_from_test")
+            else benchmark_config_params.num_iterations
+        ),
+        api_base=benchmark_config_params.api_base,
+        api_version=benchmark_config_params.api_version,
+        gpu_memory_utilization=benchmark_config_params.gpu_memory_utilization,
+        generative_type=benchmark_config_params.generative_type,
+        debug=benchmark_config_params.debug,
+        run_with_cli=benchmark_config_params.run_with_cli,
+        requires_safetensors=benchmark_config_params.requires_safetensors,
+        download_only=benchmark_config_params.download_only,
     )

EuroEval 16.1.1__tar.gz → 16.2.1__tar.gz

Potentially problematic release.

EuroEval 16.1.1tar.gz → 16.2.1tar.gz