PyPI - EuroEval - Versions diffs - 16.1.0__tar.gz → 16.2.0__tar.gz - Mend

EuroEval 16.1.0tar.gz → 16.2.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (284) hide show

{euroeval-16.1.0 → euroeval-16.2.0}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -34,6 +34,7 @@ body:
       - label: Italian
       - label: Latvian
       - label: Norwegian (Bokmål or Nynorsk)
+      - label: Polish
       - label: Portuguese
       - label: Spanish
       - label: Swedish

euroeval-16.2.0/.github/ISSUE_TEMPLATE/language_request.yaml ADDED Viewed

@@ -0,0 +1,49 @@
+name: 🌍 Language Request
+description: Is there a European language missing in EuroEval?
+title: "[LANGUAGE REQUEST] <language-name>"
+labels: "new language"
+type: task
+body:
+- type: input
+  attributes:
+    label: Language name and code
+    description: What is the name and ISO 639 code of the language?
+  validations:
+    required: true
+- type: markdown
+  attributes:
+    value: >
+      Here are some existing evaluation datasets in the language, that could be used:
+- type: textarea
+  attributes:
+    label: Sentiment classification dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Linguistic acceptability dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Named entity recognition dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Reading comprehension dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Summarisation dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Knowledge dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: textarea
+  attributes:
+    label: Common-sense reasoning dataset
+    description: Link to one or more datasets in the language (leave blank if unknown)
+- type: markdown
+  attributes:
+    value: >
+      Thanks for contributing 🎉!

{euroeval-16.1.0 → euroeval-16.2.0}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -23,6 +23,7 @@ body:
       - label: West Germanic languages (Dutch, English, German)
       - label: Finnic languages (Estonian, Finnish)
       - label: Latvian
+      - label: Polish
   validations:
     required: true
 - type: dropdown

{euroeval-16.1.0 → euroeval-16.2.0}/.gitignore RENAMED Viewed

@@ -121,3 +121,6 @@ gfx/euroeval-*.png
 gfx/euroeval-*.jpeg
 gfx/euroeval-*.jpg
 gfx/euroeval-*.xcf
+# Contracts
+generated_contracts/

{euroeval-16.1.0 → euroeval-16.2.0}/.pre-commit-config.yaml RENAMED Viewed

@@ -34,7 +34,7 @@ repos:
     hooks:
     -   id: nbstripout
 -   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.17.1
+    rev: v1.18.1
     hooks:
     -   id: mypy
         args:

{euroeval-16.1.0 → euroeval-16.2.0}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,42 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v16.2.0] - 2025-09-15
+### Added
+- Now supports evaluating models in an offline environment. This is done by first
+  downloading all necessary models, datasets, metrics and other artifacts while online,
+  using the new `--download-only` flag (or `download_only=True` in the `Benchmarker`
+  API). Then you can safely disable internet access and run the evaluation as normal,
+  and it will use the cached models, datasets and metrics. This was contributed by
+  @viggo-gascou ✨
+- Added the `timm` package to the set of `generative` extra dependencies, as it is
+  required to load some multimodal models, such as Gemma-3n.
+### Changed
+- Now does not benchmark encoder models on multiple-choice classification tasks, as they
+  get near-random performance and these scores are not used in the leaderboards. We can
+  change this in the future if we find a way to make encoder models work better on these
+  tasks.
+- For generative vLLM models that can swap between reasoning and non-reasoning modes,
+  we previously defaulted to reasoning. We now default to what the model uses by
+  default, which is non-reasoning for most models.
+### Fixed
+- Fixed an issue where old evaluation records could not be loaded, as the format had
+  changed. We are now able to load old records again.
+- Fixed some grammatical errors in the Icelandic prompts.
+- Now stores model IDs with parameters (e.g., `o3#low`) correctly in the benchmark
+  results, rather than just the base model ID (e.g., `o3`).
+## [v16.1.1] - 2025-09-12
+### Fixed
+- Fixed an issue from v16.1.0, where reasoning models were not using the tokeniser's
+  chat template.
+- Fixed an issue with some of the prompts for base decoders, that the list of possible
+  labels for sequence classification tasks was not included in the prompt.
 ## [v16.1.0] - 2025-09-11
 ### Added
 - Added support for Polish 🇵🇱! This includes the reading comprehension dataset PoQuAD,

{euroeval-16.1.0 → euroeval-16.2.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 16.1.0
+Version: 16.2.0
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -61,13 +61,13 @@ Requires-Dist: transformers[mistral-common]>=4.56.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
-Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'all'
-Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: timm>=1.0.19; extra == 'all'
+Requires-Dist: vllm[flashinfer]>=0.10.1; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
-Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'generative'
-Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: timm>=1.0.19; extra == 'generative'
+Requires-Dist: vllm[flashinfer]>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown
 <div align='center'>
@@ -152,13 +152,13 @@ model:
 ```
 >>> from euroeval import Benchmarker
 >>> benchmark = Benchmarker()
->>> benchmark(model="<model>")
+>>> benchmark(model="<model-id>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
 `language` arguments, shown here with same example as above:
 ```
->>> benchmark(model="<model>", task="sentiment-classification", language="da")
+>>> benchmark(model="<model-id>", task="sentiment-classification", language="da")
 ```
 If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
@@ -168,6 +168,30 @@ models on the Danish sentiment classification task:
 >>> benchmark(task="sentiment-classification", language="da")
 ```
+### Benchmarking in an Offline Environment
+If you need to benchmark in an offline environment, you need to download the models,
+datasets and metrics beforehand. This can be done by adding the `--download-only`
+argument, from the command line, or the `download_only` argument, if benchmarking from a
+script. For example to download the model you want and all of the Danish sentiment
+classification datasets:
+```
+$ euroeval --model <model-id> --task sentiment-classification --language da --download-only
+```
+Or from a script:
+```
+>>> benchmark(
+... model="<model-id>",
+... task="sentiment-classification",
+... language="da",
+... download_only=True,
+... )
+```
+Please note: Offline benchmarking of adapter models is not currently supported. An
+internet connection will be required during evaluation. If offline support is important
+to you, please consider [opening an issue](https://github.com/EuroEval/EuroEval/issues).
 ### Benchmarking from Docker
 A Dockerfile is provided in the repo, which can be downloaded and run, without needing
 to clone the repo and installing from source. This can be fetched programmatically by

{euroeval-16.1.0 → euroeval-16.2.0}/README.md RENAMED Viewed

@@ -80,13 +80,13 @@ model:
 ```
 >>> from euroeval import Benchmarker
 >>> benchmark = Benchmarker()
->>> benchmark(model="<model>")
+>>> benchmark(model="<model-id>")
 ```
 To benchmark on a specific task and/or language, you simply specify the `task` or
 `language` arguments, shown here with same example as above:
 ```
->>> benchmark(model="<model>", task="sentiment-classification", language="da")
+>>> benchmark(model="<model-id>", task="sentiment-classification", language="da")
 ```
 If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
@@ -96,6 +96,30 @@ models on the Danish sentiment classification task:
 >>> benchmark(task="sentiment-classification", language="da")
 ```
+### Benchmarking in an Offline Environment
+If you need to benchmark in an offline environment, you need to download the models,
+datasets and metrics beforehand. This can be done by adding the `--download-only`
+argument, from the command line, or the `download_only` argument, if benchmarking from a
+script. For example to download the model you want and all of the Danish sentiment
+classification datasets:
+```
+$ euroeval --model <model-id> --task sentiment-classification --language da --download-only
+```
+Or from a script:
+```
+>>> benchmark(
+... model="<model-id>",
+... task="sentiment-classification",
+... language="da",
+... download_only=True,
+... )
+```
+Please note: Offline benchmarking of adapter models is not currently supported. An
+internet connection will be required during evaluation. If offline support is important
+to you, please consider [opening an issue](https://github.com/EuroEval/EuroEval/issues).
 ### Benchmarking from Docker
 A Dockerfile is provided in the repo, which can be downloaded and run, without needing
 to clone the repo and installing from source. This can be fetched programmatically by

{euroeval-16.1.0 → euroeval-16.2.0}/docs/datasets/icelandic.md RENAMED Viewed

@@ -44,11 +44,11 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru yfirferðir ásamt lyndisgildi þeirra, sem getur verið 'jákvætt', 'hlutlaust' eða 'neikvætt'.
+  Hér fyrir neðan eru textabrot ásamt lyndisgildi þeirra sem getur verið 'jákvætt', 'hlutlaust' eða 'neikvætt'.
   ```
 - Base prompt template:
   ```
-  Yfirferð: {text}
+  Textabrot: {text}
   Lyndi: {label}
   ```
 - Instruction-tuned prompt template:
@@ -117,13 +117,13 @@ When evaluating generative models, we use the following setup (see the
 - Base prompt template:
   ```
   Setning: {text}
-  Nefndar einingar: {label}
+  Nafneiningar: {label}
   ```
 - Instruction-tuned prompt template:
   ```
   Setning: {text}
-  Greinið nefndu einingarnar í setningunni. Þú ættir að skila þessu sem JSON orðabók með lyklunum 'einstaklingur', 'staðsetning', 'stofnun' og 'ýmislegt'. Gildin ættu að vera listi yfir nefndu einingarnar af þeirri gerð, nákvæmlega eins og þær koma fram í setningunni.
+  Greindu nefndu einingarnar í setningunni. Þú ættir að skila þessu sem JSON orðabók með lyklunum 'einstaklingur', 'staðsetning', 'stofnun' og 'ýmislegt'. Gildin ættu að vera listi yfir nefndu einingarnar af þeirri gerð, nákvæmlega eins og þær koma fram í setningunni.
   ```
 - Label mapping:
     - `B-PER` ➡️ `einstaklingur`
@@ -186,7 +186,7 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru setningar og hvort þær eru málfræðilega réttar.
+  Hér fyrir neðan eru setningar ásamt mati á því hvort þær eru málfræðilega réttar.
   ```
 - Base prompt template:
   ```
@@ -197,7 +197,7 @@ When evaluating generative models, we use the following setup (see the
   ```
   Setning: {text}
-  Greinið hvort setningin er málfræðilega rétt eða ekki. Svarið skal vera 'já' ef setningin er rétt og 'nei' ef hún er ekki.
+  Greindu hvort setningin er málfræðilega rétt. Svaraðu með 'já' ef setningin er rétt og 'nei' ef hún er það ekki.
   ```
 - Label mapping:
     - `correct` ➡️ `já`
@@ -249,7 +249,7 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru setningar og hvort þær eru málfræðilega réttar.
+  Hér fyrir neðan eru setningar ásamt mati á því hvort þær eru málfræðilega réttar.
   ```
 - Base prompt template:
   ```
@@ -260,7 +260,7 @@ When evaluating generative models, we use the following setup (see the
   ```
   Setning: {text}
-  Greinið hvort setningin er málfræðilega rétt eða ekki. Svarið skal vera 'já' ef setningin er rétt og 'nei' ef hún er ekki.
+  Greindu hvort setningin er málfræðilega rétt. Svaraðu með 'já' ef setningin er rétt og 'nei' ef hún er það ekki.
   ```
 - Label mapping:
     - `correct` ➡️ `já`
@@ -310,7 +310,7 @@ When evaluating generative models, we use the following setup (see the
 - Number of few-shot examples: 12
 - Prefix prompt:
   ```
-  Eftirfarandi eru setningar og hvort þær eru málfræðilega réttar.
+  Hér fyrir neðan eru setningar ásamt mati á því hvort þær eru málfræðilega réttar.
   ```
 - Base prompt template:
   ```
@@ -321,7 +321,7 @@ When evaluating generative models, we use the following setup (see the
   ```
   Setning: {text}
-  Greinið hvort setningin er málfræðilega rétt eða ekki. Svarið skal vera 'já' ef setningin er rétt og 'nei' ef hún er ekki.
+  Greindu hvort setningin er málfræðilega rétt. Svaraðu með 'já' ef setningin er rétt og 'nei' ef hún er það ekki.
   ```
 - Label mapping:
     - `correct` ➡️ `já`

{euroeval-16.1.0 → euroeval-16.2.0}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "EuroEval"
-version = "16.1.0"
+version = "16.2.0"
 description = "The robust European language model benchmark."
 readme = "README.md"
 authors = [
@@ -33,7 +33,7 @@ dependencies = [
     "rouge-score>=0.1.2",
     "bert-score>=0.3.13",
     "levenshtein>=0.24.0",
-    "scikit-learn==1.6.1",  # Required for loading European values pipeline
+    "scikit-learn==1.6.1", # Required for loading European values pipeline
     "setuptools>=75.8.2",
     "demjson3>=3.0.6",
     "ollama>=0.5.1",
@@ -45,15 +45,15 @@ dependencies = [
 [project.optional-dependencies]
 generative = [
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.10.1; platform_system == 'Linux'",
-    "flashinfer-python>=0.3.1; platform_system == 'Linux'",
+    "vllm[flashinfer]>=0.10.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
+    "timm>=1.0.19",
 ]
 all = [
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.10.1; platform_system == 'Linux'",
-    "flashinfer-python>=0.3.1; platform_system == 'Linux'",
+    "vllm[flashinfer]>=0.10.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
+    "timm>=1.0.19",
 ]
 [project.urls]
@@ -100,6 +100,8 @@ dev-dependencies = [
     "types-ujson>=5.10.0.20240515",
     "types-simplejson>=3.2.0.2025032",
     "debugpy>=1.8.13",
+    "pytest-socket>=0.7.0",
+    "pytest-dependency>=0.6.0",
 ]
 [tool.ruff]
@@ -170,6 +172,7 @@ addopts = [
     "--cov=src/euroeval",
     "--color=yes",
     "-vvv",
+    "--allow-unix-socket"
 ]
 xfail_strict = true
 filterwarnings = [
@@ -181,7 +184,7 @@ filterwarnings = [
     "ignore::ResourceWarning",
     "ignore::FutureWarning",
 ]
-log_cli_level = "info"
+log_cli_level = "INFO"
 testpaths = [
     "tests",
     "src/euroeval",

{euroeval-16.1.0 → euroeval-16.2.0}/src/euroeval/__init__.py RENAMED Viewed

@@ -12,12 +12,13 @@ import warnings
 from termcolor import colored
 # Block specific warnings before importing anything else, as they can be noisy
-warnings.filterwarnings("ignore", category=UserWarning)
-warnings.filterwarnings("ignore", category=FutureWarning)
-logging.getLogger("httpx").setLevel(logging.CRITICAL)
-logging.getLogger("datasets").setLevel(logging.CRITICAL)
-logging.getLogger("vllm").setLevel(logging.CRITICAL)
-os.environ["VLLM_CONFIGURE_LOGGING"] = "0"
+if os.getenv("FULL_LOG") != "1":
+    warnings.filterwarnings("ignore", category=UserWarning)
+    warnings.filterwarnings("ignore", category=FutureWarning)
+    logging.getLogger("httpx").setLevel(logging.CRITICAL)
+    logging.getLogger("datasets").setLevel(logging.CRITICAL)
+    logging.getLogger("vllm").setLevel(logging.CRITICAL)
+    os.environ["VLLM_CONFIGURE_LOGGING"] = "0"
 # Set up logging
 fmt = colored("%(asctime)s", "light_blue") + " ⋅ " + colored("%(message)s", "green")

{euroeval-16.1.0 → euroeval-16.2.0}/src/euroeval/benchmark_config_factory.py RENAMED Viewed

@@ -47,6 +47,7 @@ def build_benchmark_config(
     debug: bool,
     run_with_cli: bool,
     requires_safetensors: bool,
+    download_only: bool,
 ) -> BenchmarkConfig:
     """Create a benchmark configuration.
@@ -117,6 +118,8 @@ def build_benchmark_config(
             Whether the benchmark is being run with the CLI.
         requires_safetensors:
             Whether to only allow evaluations of models stored as safetensors.
+        download_only:
+            Whether to only download the requested model weights and datasets.
     Returns:
         The benchmark configuration.
@@ -165,6 +168,7 @@ def build_benchmark_config(
         debug=debug,
         run_with_cli=run_with_cli,
         requires_safetensors=requires_safetensors,
+        download_only=download_only,
     )

{euroeval-16.1.0 → euroeval-16.2.0}/src/euroeval/benchmark_modules/hf.py RENAMED Viewed

@@ -146,21 +146,25 @@ class HuggingFaceEncoderModel(BenchmarkModule):
         Returns:
             The number of parameters in the model.
         """
-        token = get_hf_token(api_key=self.benchmark_config.api_key)
-        hf_api = HfApi(token=token)
-        try:
-            repo_info = hf_api.model_info(
-                repo_id=self.model_config.adapter_base_model_id
-                or self.model_config.model_id,
-                revision=self.model_config.revision,
-            )
-        except (
-            RepositoryNotFoundError,
-            RevisionNotFoundError,
-            RequestException,
-            HFValidationError,
-        ):
+        # No need to try to use the API if we have no internet.
+        if not internet_connection_available():
             repo_info = None
+        else:
+            token = get_hf_token(api_key=self.benchmark_config.api_key)
+            hf_api = HfApi(token=token)
+            try:
+                repo_info = hf_api.model_info(
+                    repo_id=self.model_config.adapter_base_model_id
+                    or self.model_config.model_id,
+                    revision=self.model_config.revision,
+                )
+            except (
+                RepositoryNotFoundError,
+                RevisionNotFoundError,
+                RequestException,
+                HFValidationError,
+            ):
+                repo_info = None
         if (
             repo_info is not None
@@ -558,7 +562,7 @@ def load_model_and_tokeniser(
             The benchmark configuration
     Returns:
-        The loaded model and tokeniser.
+        A pair (model, tokeniser), with the loaded model and tokeniser
     """
     config: "PretrainedConfig"
     block_terminal_output()
@@ -686,6 +690,7 @@ def load_model_and_tokeniser(
         model=model,
         model_id=model_id,
         trust_remote_code=benchmark_config.trust_remote_code,
+        model_cache_dir=model_config.model_cache_dir,
     )
     return model, tokeniser
@@ -722,6 +727,11 @@ def get_model_repo_info(
         ):
             model_info = HfApiModelInfo(id=model_id, tags=None, pipeline_tag=None)
+    # If we have not internet, and the model_id is not a directory for a local model
+    # we also just create a dummy model info object.
+    elif not internet_connection_available():
+        model_info = HfApiModelInfo(id=model_id, tags=None, pipeline_tag=None)
     # If the model does not exist locally, then we get the model info from the Hugging
     # Face Hub, if possible
     if model_info is None:
@@ -867,7 +877,10 @@ def get_model_repo_info(
 def load_tokeniser(
-    model: "PreTrainedModel | None", model_id: str, trust_remote_code: bool
+    model: "PreTrainedModel | None",
+    model_id: str,
+    trust_remote_code: bool,
+    model_cache_dir: str,
 ) -> "PreTrainedTokenizer":
     """Load the tokeniser.
@@ -889,6 +902,7 @@ def load_tokeniser(
         trust_remote_code=trust_remote_code,
         padding_side="right",
         truncation_side="right",
+        cache_dir=model_cache_dir,
     )
     # If the model is a subclass of a certain model types then we have to add a prefix
@@ -999,6 +1013,7 @@ def load_hf_model_config(
                 token=get_hf_token(api_key=api_key),
                 trust_remote_code=trust_remote_code,
                 cache_dir=model_cache_dir,
+                local_files_only=not internet_connection_available(),
             )
             if config.eos_token_id is not None and config.pad_token_id is None:
                 if isinstance(config.eos_token_id, list):

{euroeval-16.1.0 → euroeval-16.2.0}/src/euroeval/benchmark_modules/litellm.py RENAMED Viewed

@@ -984,6 +984,7 @@ class LiteLLMModel(BenchmarkModule):
                     model=None,
                     model_id=model_id,
                     trust_remote_code=self.benchmark_config.trust_remote_code,
+                    model_cache_dir=self.model_config.model_cache_dir,
                 )
                 if (
@@ -1066,6 +1067,7 @@ class LiteLLMModel(BenchmarkModule):
                     model=None,
                     model_id=model_id,
                     trust_remote_code=self.benchmark_config.trust_remote_code,
+                    model_cache_dir=self.model_config.model_cache_dir,
                 )
                 all_max_lengths: list[int] = list()

{euroeval-16.1.0 → euroeval-16.2.0}/src/euroeval/benchmark_modules/vllm.py RENAMED Viewed

@@ -72,7 +72,9 @@ from ..utils import (
     create_model_cache_dir,
     get_hf_token,
     get_min_cuda_compute_capability,
+    internet_connection_available,
     log_once,
+    resolve_model_path,
     split_model_id,
 )
 from .hf import HuggingFaceEncoderModel, get_model_repo_info, load_hf_model_config
@@ -146,7 +148,7 @@ class VLLMModel(HuggingFaceEncoderModel):
         )
         self.end_of_reasoning_token = get_end_of_reasoning_token(
-            model=self._model, tokeniser=self._tokeniser, model_id=model_config.model_id
+            model=self._model, tokeniser=self._tokeniser, model_config=model_config
         )
         self.end_of_chat_token_ids = get_end_of_chat_token_ids(
             tokeniser=self._tokeniser, generative_type=self.generative_type
@@ -834,10 +836,15 @@ def load_model_and_tokeniser(
     clear_vllm()
+    # if we do not have an internet connection we need to give the path to the folder
+    # that contains the model weights and config files, otherwise vLLM will try to
+    # download them regardless if they are already present in the download_dir
+    model_path = resolve_model_path(download_dir)
     try:
         model = LLM(
-            model=model_id,
-            tokenizer=model_id,
+            model=model_id if internet_connection_available() else model_path,
+            tokenizer=model_id if internet_connection_available() else model_path,
             gpu_memory_utilization=benchmark_config.gpu_memory_utilization,
             max_model_len=min(true_max_model_len, MAX_CONTEXT_LENGTH),
             download_dir=download_dir,
@@ -925,6 +932,7 @@ def load_tokeniser(
         cache_dir=model_cache_dir,
         token=token,
         trust_remote_code=trust_remote_code,
+        local_files_only=not internet_connection_available(),
     )
     num_retries = 5
     for _ in range(num_retries):
@@ -937,8 +945,10 @@ def load_tokeniser(
                 padding_side="left",
                 truncation_side="left",
                 model_max_length=model_max_length,
+                cache_dir=model_cache_dir,
                 config=config,
                 token=token,
+                local_files_only=not internet_connection_available(),
             )
             break
         except (json.JSONDecodeError, OSError, TypeError) as e:
@@ -996,7 +1006,7 @@ def clear_vllm() -> None:
 def get_end_of_reasoning_token(
-    model: "LLM", tokeniser: "PreTrainedTokenizer", model_id: str
+    model: "LLM", tokeniser: "PreTrainedTokenizer", model_config: "ModelConfig"
 ) -> str | None:
     """Get the end-of-reasoning token for a generative model.
@@ -1005,21 +1015,26 @@ def get_end_of_reasoning_token(
             The vLLM model.
         tokeniser:
             The tokeniser.
-        model_id:
-            The model ID.
+        model_config:
+            The model configuration.
     Returns:
         The end of reasoning token, or None if it could not be found.
     """
+    model_id = model_config.model_id
     # Create a prompt to check if the model uses the reasoning tokens
     prompt = "What is your name?"
     if has_chat_template(tokeniser=tokeniser):
+        extra_kwargs = dict()
+        if model_config.param in {"thinking", "no-thinking"}:
+            extra_kwargs["enable_thinking"] = model_config.param == "thinking"
         templated_prompt = apply_chat_template(
             conversation=[dict(role="user", content=prompt)],
             tokeniser=tokeniser,
             tokenise=False,
             add_generation_prompt=True,
-            enable_thinking=True,
+            **extra_kwargs,
         )
         assert isinstance(templated_prompt, str)
         prompt = templated_prompt
@@ -1042,8 +1057,8 @@ def get_end_of_reasoning_token(
     if not bor_reasoning_matches:
         log_once(
             f"The model {model_id!r} did not generate any beginning-of-reasoning "
-            "tokens in the prompt or the completion. Assuming the model is not "
-            "a reasoning model.",
+            "tokens in the prompt or the completion. Assuming the model is not a "
+            "reasoning model.",
             level=logging.DEBUG,
         )
         return None

EuroEval 16.1.0__tar.gz → 16.2.0__tar.gz

Potentially problematic release.

EuroEval 16.1.0tar.gz → 16.2.0tar.gz