PyPI - EuroEval - Versions diffs - 15.4.1__tar.gz → 15.4.2__tar.gz - Mend

EuroEval 15.4.1tar.gz → 15.4.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (211) hide show

{euroeval-15.4.1 → euroeval-15.4.2}/.github/ISSUE_TEMPLATE/benchmark_dataset_request.yaml RENAMED Viewed

@@ -2,6 +2,7 @@ name: 📚 Benchmark Dataset Request
 description: Do you think a particular benchmark dataset is missing in EuroEval?
 title: "[BENCHMARK DATASET REQUEST] <dataset-name>"
 labels: "benchmark dataset request"
+type: task
 body:
 - type: input
@@ -30,6 +31,7 @@ body:
       - label: Icelandic
       - label: Italian
       - label: Norwegian (Bokmål or Nynorsk)
+      - label: Spanish
       - label: Swedish
   validations:
     required: true

{euroeval-15.4.1 → euroeval-15.4.2}/.github/ISSUE_TEMPLATE/bug.yaml RENAMED Viewed

@@ -1,7 +1,7 @@
 name: 🐛 Bug Report
 description: Have you experienced a bug using the `euroeval` package?
 title: "[BUG] <name-of-bug>"
-labels: bug
+type: bug
 body:
 - type: markdown
@@ -46,8 +46,9 @@ body:
       - 3.10.x
       - 3.11.x
       - 3.12.x
+      - 3.13.x
       - Older than 3.10.x
-      - Newer than 3.12.x
+      - Newer than 3.13.x
   validations:
     required: true
 - type: input
@@ -57,6 +58,20 @@ body:
     placeholder: Output of `pip list | grep EuroEval`
   validations:
     required: true
+- type: input
+  attributes:
+    label: Transformers version
+    description: What version of 🤗 transformers are you using?
+    placeholder: Output of `pip list | grep transformers`
+  validations:
+    required: true
+- type: input
+  attributes:
+    label: vLLM version
+    description: What version of vLLM are you using?
+    placeholder: Output of `pip list | grep vllm`
+  validations:
+    required: true
 - type: markdown
   attributes:
     value: >

{euroeval-15.4.1 → euroeval-15.4.2}/.github/ISSUE_TEMPLATE/feature_request.yaml RENAMED Viewed

@@ -1,7 +1,7 @@
 name: 🚀 Feature Request
 description: Is the EuroEval benchmark missing a feature?
 title: "[FEATURE REQUEST] <name-of-feature>"
-labels: enhancement
+type: feature
 body:
 - type: textarea
@@ -11,16 +11,6 @@ body:
       A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*.
   validations:
     required: true
-- type: textarea
-  attributes:
-    label: Alternatives
-    description: >
-      A description of any alternative solutions or features you've considered, if any.
-- type: textarea
-  attributes:
-    label: Additional context
-    description: >
-      Add any other context or screenshots about the feature request.
 - type: markdown
   attributes:
     value: >

{euroeval-15.4.1 → euroeval-15.4.2}/.github/ISSUE_TEMPLATE/model_evaluation_request.yaml RENAMED Viewed

@@ -2,6 +2,7 @@ name: 📊 Model Evaluation Request
 description: Would you like to have a particular model included in the leaderboards?
 title: "[MODEL EVALUATION REQUEST] <model-name>"
 labels: "model evaluation request"
+type: task
 body:
 - type: input
@@ -10,16 +11,6 @@ body:
     description: What is the Hugging Face model ID?
   validations:
     required: true
-- type: dropdown
-  attributes:
-    label: Model type
-    description: What is the architecture of the model?
-    options:
-      - Decoder model (e.g., GPT)
-      - Encoder model (e.g., BERT)
-      - Sequence-to-sequence model (e.g., T5)
-  validations:
-    required: true
 - type: checkboxes
   attributes:
     label: Evaluation languages
@@ -36,9 +27,29 @@ body:
       - label: Icelandic
       - label: Italian
       - label: Norwegian (Bokmål or Nynorsk)
+      - label: Spanish
       - label: Swedish
   validations:
     required: true
+- type: dropdown
+  attributes:
+    label: Model type
+    description: What is the architecture of the model?
+    options:
+      - Decoder model (e.g., GPT)
+      - Encoder model (e.g., BERT)
+      - Sequence-to-sequence model (e.g., T5)
+  validations:
+    required: true
+- type: dropdown
+  attributes:
+    label: Model size
+    description: What is the size of the model?
+    options:
+      - Small (<=8B parameters)
+      - Large (>8B parameters)
+  validations:
+    required: true
 - type: dropdown
   attributes:
     label: Merged model

{euroeval-15.4.1 → euroeval-15.4.2}/.github/workflows/ci.yaml RENAMED Viewed

@@ -43,7 +43,6 @@ jobs:
       - name: Install uv and set up Python
         uses: astral-sh/setup-uv@v4
         with:
-          enable-cache: true
           python-version: ${{ matrix.python-version }}
       - name: Install Dependencies
@@ -75,7 +74,6 @@ jobs:
       - name: Install uv and set up Python
         uses: astral-sh/setup-uv@v4
         with:
-          enable-cache: true
           python-version: ${{ matrix.python-version }}
       - name: Install Dependencies

{euroeval-15.4.1 → euroeval-15.4.2}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,51 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v15.4.2] - 2025-03-31
+### Added
+- Now added version metadata to results, to easier track which versions of the various
+  dependencies were used when evaluating a model. This currently includes
+  `transformers`, `torch`, `vllm` and `outlines`.
+### Changed
+- Changed the name of the German 'mlsum' summarisation dataset to 'mlsum-de', to reflect
+  that it is the German version of the dataset, and to avoid confusion with the Spanish
+  'mlsum-es' dataset.
+### Fixed
+- Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
+  compatibility < 8.0. This was contributed by [@marksverdhei](https://github.com/marksverdhei) ✨
+- Corrected the name of the French sentiment dataset AlloCiné. This was contributed by
+  [@Alkarex](https://github.com/Alkarex) ✨
+- Evaluating a specific model revision did not work for adapter models, as there was a
+  confusion between the revision of the adapter and the revision of the base model. We
+  now use the revision for the adapter and use the latest revision for the base model.
+- In the (very unlikely) scenario that the model's tokeniser has the same first token
+  for two different labels in a text classification task, we now also use the second
+  token to ensure that we determine the correct label. If this is not possible, then we
+  warn the user.
+- Now catches `TypeError` when trying to generate with vLLM, and retries 3 times before
+  giving up on evaluating the dataset.
+- A bug in `transformers` caused models with the `image-text-to-text` pipeline tag to
+  not be detected as generative models. This has been patched now, and will be fixed
+  properly when [this transformers
+  PR](https://github.com/huggingface/transformers/pull/37107) has been merged.
+- Force `vllm` v0.8.0 for now, as the severe degradation in generation output of some
+  models has not been resolved in versions v0.8.2 and v0.8.3.
+- Only accepts the local labels for text classification tasks when evaluating decoder
+  models now, where we before accepted both the local and English labels. The reason is
+  that this caused a confusion mat times when there was a unique local label starting
+  with a particular letter, but a different English label starting with the same letter,
+  causing some models to be evaluated on the wrong label.
+- When fetching the model information from the Hugging Face API we now attempt 3 times,
+  as the API sometimes fails. If it still fails after 3 attempts, we raise the
+  `HuggingFaceHubDown` exception.
+- Now uses `fp16` instead of `bf16` when evaluating decoder models on GPUs with CUDA
+  compatibility < 8.0. This was contributed by [@marksverdhei](https://github.com/marksverdhei) ✨
+- Fixed docs for ScandiQA-da and ScandiQA-sv, where it was incorrectly stated that
+  the splits were made by considering the original train/validation/test splits.
 ## [v15.4.1] - 2025-03-25
 ### Fixed
 - Disallow `vllm` v0.8.1, as it causes severe degradation in generation output of
@@ -211,7 +256,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 ### Added
 - Added support for French! 🇫🇷This includes the sentiment classification dataset
-  [Allocine](https://hf.co/datasets/tblard/allocine), the linguistic acceptability
+  [AlloCiné](https://hf.co/datasets/tblard/allocine), the linguistic acceptability
   dataset ScaLA with the [French Universal
   Dependencies](https://github.com/UniversalDependencies/UD_French-GSD), the reading
   comprehension dataset [FQuAD](https://hf.co/datasets/illuin/fquad) (and unofficially

{euroeval-15.4.1 → euroeval-15.4.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 15.4.1
+Version: 15.4.2
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -42,6 +42,7 @@ Requires-Dist: more-itertools>=10.5.0
 Requires-Dist: numpy<2.0.0,>=1.23.0
 Requires-Dist: ollama>=0.4.7
 Requires-Dist: pandas>=2.2.0
+Requires-Dist: peft>=0.15.0
 Requires-Dist: protobuf~=3.20.0
 Requires-Dist: pydantic>=2.6.0
 Requires-Dist: pyinfer>=0.0.3
@@ -61,12 +62,12 @@ Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == '
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: gradio>=4.26.0; extra == 'all'
 Requires-Dist: outlines>=0.1.11; extra == 'all'
-Requires-Dist: vllm!=0.8.1,>=0.8.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: vllm==0.8.0; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: outlines>=0.1.11; extra == 'generative'
-Requires-Dist: vllm!=0.8.1,>=0.8.0; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: vllm==0.8.0; (platform_system == 'Linux') and extra == 'generative'
 Provides-Extra: human-evaluation
 Requires-Dist: gradio>=4.26.0; extra == 'human-evaluation'
 Provides-Extra: test

{euroeval-15.4.1 → euroeval-15.4.2}/docs/datasets/danish.md RENAMED Viewed

@@ -285,11 +285,10 @@ the translated contexts still contained the answer to the question, potentially
 changing the answers slightly.
 The original full dataset consists of 6,810 / 500 / 500 samples for training,
-validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
-validation and testing, respectively (so 3,328 samples used in total). All validation
-samples in our version also belong to the original validation set, and all original test
-samples are included in our test set. The remaining 1,548 test samples in our version
-was sampled from the original training set.
+validation and testing, respectively (so 3,328 samples used in total).
+We use a 1,024 / 256 / 2,048 split for training, validation and testing, respectively,
+where the splits are made by randomly sampling from the full dataset without considering
+the original train/validation/test splits.
 Here are a few examples from the training split:

{euroeval-15.4.1 → euroeval-15.4.2}/docs/datasets/french.md RENAMED Viewed

@@ -7,11 +7,11 @@ information about what these constitute.
 ## Sentiment Classification
-### Allocine
+### AlloCiné
 This dataset was published in [this Github
 repository](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert) and
-features reviews from the French movie review website Allocine. The reviews range from
+features reviews from the French movie review website [AlloCiné](https://www.allocine.fr/). The reviews range from
 0.5 to 5 (inclusive), with steps of 0.5. The negative samples are reviews with a rating
 of at most 2, and the positive ones are reviews with a rating of at least 4. The reviews
 in between were discarded.

{euroeval-15.4.1 → euroeval-15.4.2}/docs/datasets/spanish.md RENAMED Viewed

@@ -475,7 +475,7 @@ $ euroeval --model <model-id> --dataset hellaswag-es
 ## Summarization
-### MLSum-es-mini
+### MLSum-es
 The dataset was published in [this paper](https://aclanthology.org/2020.emnlp-main.647/) and is obtained from online newspapers.

{euroeval-15.4.1 → euroeval-15.4.2}/docs/datasets/swedish.md RENAMED Viewed

@@ -231,11 +231,10 @@ the translated contexts still contained the answer to the question, potentially
 changing the answers slightly.
 The original full dataset consists of 6,810 / 500 / 500 samples for training,
-validation and testing, respectively. We use a 1,024 / 256 / 2,048 split for training,
-validation and testing, respectively (so 3,328 samples used in total). All validation
-samples in our version also belong to the original validation set, and all original test
-samples are included in our test set. The remaining 1,548 test samples in our version
-was sampled from the original training set.
+validation and testing, respectively (so 3,328 samples used in total).
+We use a 1,024 / 256 / 2,048 split for training, validation and testing, respectively,
+where the splits are made by randomly sampling from the full dataset without considering
+the original train/validation/test splits.
 Here are a few examples from the training split:

{euroeval-15.4.1 → euroeval-15.4.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "EuroEval"
-version = "15.4.1"
+version = "15.4.2"
 description = "The robust European language model benchmark."
 readme = "README.md"
 authors = [
@@ -39,13 +39,14 @@ dependencies = [
     "setuptools>=75.8.2",
     "demjson3>=3.0.6",
     "ollama>=0.4.7",
+    "peft>=0.15.0",
 ]
 [project.optional-dependencies]
 generative = [
     "outlines>=0.1.11",
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.8.0,!=0.8.1; platform_system == 'Linux'",
+    "vllm==0.8.0; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
 ]
 human_evaluation = [
@@ -54,7 +55,7 @@ human_evaluation = [
 all = [
     "outlines>=0.1.11",
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
-    "vllm>=0.8.0,!=0.8.1; platform_system == 'Linux'",
+    "vllm==0.8.0; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
     "gradio>=4.26.0",
 ]
@@ -86,7 +87,6 @@ dev-dependencies = [
     "nbstripout>=0.7.1",
     "coverage>=5.5",
     "lxml>=5.1.0",
-    "peft>=0.13.2",
     "mkdocs-material>=9.5.45",
     "mkdocs-include-markdown-plugin>=7.0.1",
     "mkdocs-include-dir-to-nav>=1.2.0",

{euroeval-15.4.1 → euroeval-15.4.2}/src/euroeval/benchmark_modules/hf.py RENAMED Viewed

@@ -20,6 +20,7 @@ from huggingface_hub.utils import (
     HFValidationError,
     LocalTokenNotFoundError,
 )
+from peft import PeftConfig
 from requests.exceptions import RequestException
 from torch import nn
 from transformers import (
@@ -34,6 +35,9 @@ from transformers import (
     Trainer,
 )
 from transformers.modelcard import TASK_MAPPING
+from transformers.models.auto.modeling_auto import (
+    MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES,
+)
 from urllib3.exceptions import RequestError
 from ..constants import (
@@ -73,6 +77,7 @@ from ..utils import (
     get_class_by_name,
     get_eos_token,
     internet_connection_available,
+    log_once,
 )
 from .base import BenchmarkModule
@@ -727,53 +732,54 @@ def get_model_repo_info(
     # If the model does not exist locally, then we get the model info from the Hugging
     # Face Hub
     if model_info is None:
-        try:
-            model_info = hf_api.model_info(
-                repo_id=model_id, revision=revision, token=token
-            )
-        except (GatedRepoError, LocalTokenNotFoundError) as e:
+        num_attempts = 3
+        for _ in range(num_attempts):
             try:
-                hf_whoami(token=token)
-                logger.warning(
-                    f"Could not access the model {model_id} with the revision "
-                    f"{revision}. The error was {str(e)!r}."
+                model_info = hf_api.model_info(
+                    repo_id=model_id, revision=revision, token=token
                 )
+                break
+            except (GatedRepoError, LocalTokenNotFoundError) as e:
+                try:
+                    hf_whoami(token=token)
+                    logger.warning(
+                        f"Could not access the model {model_id} with the revision "
+                        f"{revision}. The error was {str(e)!r}."
+                    )
+                    return None
+                except LocalTokenNotFoundError:
+                    raise NeedsAdditionalArgument(
+                        cli_argument="--api-key",
+                        script_argument="api_key=<your-api-key>",
+                        run_with_cli=benchmark_config.run_with_cli,
+                    )
+            except (RepositoryNotFoundError, HFValidationError):
                 return None
-            except LocalTokenNotFoundError:
-                raise NeedsAdditionalArgument(
-                    cli_argument="--api-key",
-                    script_argument="api_key=<your-api-key>",
-                    run_with_cli=benchmark_config.run_with_cli,
-                )
-        except (RepositoryNotFoundError, HFValidationError):
-            return None
-        except (OSError, RequestException):
-            if internet_connection_available():
-                raise HuggingFaceHubDown()
-            else:
+            except (OSError, RequestException):
+                if internet_connection_available():
+                    continue
                 raise NoInternetConnection()
+        else:
+            raise HuggingFaceHubDown()
     # Get all the Hugging Face repository tags for the model. If the model is an adapter
     # model, then we also get the tags for the base model
     tags = model_info.tags or list()
-    has_base_model_tag = any(
-        tag.startswith("base_model:") and tag.count(":") == 1 for tag in tags
-    )
     base_model_id: str | None = None
-    if has_base_model_tag:
-        has_adapter_config = model_info.siblings is not None and any(
-            sibling.rfilename == "adapter_config.json"
-            for sibling in model_info.siblings
+    has_adapter_config = model_info.siblings is not None and any(
+        sibling.rfilename == "adapter_config.json" for sibling in model_info.siblings
+    )
+    if has_adapter_config:
+        adapter_config = PeftConfig.from_pretrained(model_id, revision=revision)
+        base_model_id = adapter_config.base_model_name_or_path
+        log_once(
+            f"Model {model_id!r} identified as an adapter model, with base model "
+            f"{base_model_id!r}.",
+            level=logging.DEBUG,
         )
-        if has_adapter_config:
-            base_model_id = [
-                tag.split(":")[1]
-                for tag in tags
-                if tag.startswith("base_model:") and tag.count(":") == 1
-            ][0]
+        if base_model_id is not None:
             base_model_info = hf_api.model_info(
                 repo_id=base_model_id,
-                revision=revision,
                 token=benchmark_config.api_key
                 or os.getenv("HUGGINGFACE_API_KEY")
                 or True,
@@ -781,12 +787,18 @@ def get_model_repo_info(
             tags += base_model_info.tags or list()
             tags = list(set(tags))
+    # TEMP: This extends the `TASK_MAPPING` dictionary to include the missing
+    # 'image-text-to-text' pipeline tag. This will be added as part of `TASK_MAPPING`
+    # when this PR has been merged in and published:
+    # https://github.com/huggingface/transformers/pull/37107
+    TASK_MAPPING["image-text-to-text"] = MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES
     # Get the pipeline tag for the model. If it is not specified, then we determine it
     # by checking the model's architecture as written in the model's Hugging Face config
     pipeline_tag = model_info.pipeline_tag
     if pipeline_tag is None:
         hf_config = load_hf_model_config(
-            model_id=model_id,
+            model_id=base_model_id or model_id,
             num_labels=0,
             id2label=dict(),
             label2id=dict(),
@@ -812,7 +824,6 @@ def get_model_repo_info(
             pipeline_tag = "fill-mask"
     if benchmark_config.only_allow_safetensors:
-        # Check if any file ends with .safetensors
         repo_files = hf_api.list_repo_files(repo_id=model_id, revision=revision)
         has_safetensors = any(f.endswith(".safetensors") for f in repo_files)
         if not has_safetensors:
@@ -826,6 +837,26 @@ def get_model_repo_info(
                 )
             raise InvalidModel(msg)
+        # Also check base model if we are evaluating an adapter
+        if base_model_id is not None:
+            base_repo_files = hf_api.list_repo_files(repo_id=base_model_id)
+            base_has_safetensors = any(
+                f.endswith(".safetensors") for f in base_repo_files
+            )
+            if not base_has_safetensors:
+                msg = (
+                    f"Base model {base_model_id} does not have safetensors weights "
+                    "available."
+                )
+                if benchmark_config.run_with_cli:
+                    msg += " Skipping since the `--only-allow-safetensors` flag is set."
+                else:
+                    msg += (
+                        " Skipping since the `only_allow_safetensors` argument is set "
+                        "to `True`."
+                    )
+                raise InvalidModel(msg)
     return HFModelInfo(
         pipeline_tag=pipeline_tag, tags=tags, adapter_base_model_id=base_model_id
     )

{euroeval-15.4.1 → euroeval-15.4.2}/src/euroeval/benchmark_modules/vllm.py RENAMED Viewed

@@ -30,6 +30,7 @@ from ..constants import (
     REASONING_MAX_TOKENS,
     TASK_GROUPS_USING_LOGPROBS,
     TASKS_USING_JSON,
+    VLLM_BF16_MIN_CUDA_COMPUTE_CAPABILITY,
 )
 from ..data_models import (
     BenchmarkConfig,
@@ -65,6 +66,7 @@ from ..utils import (
     get_bos_token,
     get_end_of_chat_token_ids,
     get_eos_token,
+    get_min_cuda_compute_capability,
     log_once,
     should_prompts_be_stripped,
 )
@@ -145,6 +147,7 @@ class VLLMModel(HuggingFaceEncoderModel):
         if self.model_config.adapter_base_model_id is not None:
             adapter_path = snapshot_download(
                 repo_id=self.model_config.model_id,
+                revision=self.model_config.revision,
                 cache_dir=Path(self.model_config.model_cache_dir),
             )
             self.buffer["lora_request"] = LoRARequest(
@@ -373,12 +376,27 @@ class VLLMModel(HuggingFaceEncoderModel):
         # Generate sequences using vLLM
         input_is_a_test = len(prompts) == 1 and len(set(prompts[0])) == 1
-        raw_outputs = self._model.generate(
-            prompts=prompts,
-            sampling_params=sampling_params,
-            use_tqdm=(not input_is_a_test),
-            lora_request=self.buffer.get("lora_request"),
-        )
+        num_attempts = 3
+        for _ in range(num_attempts):
+            try:
+                raw_outputs = self._model.generate(
+                    prompts=prompts,
+                    sampling_params=sampling_params,
+                    use_tqdm=(not input_is_a_test),
+                    lora_request=self.buffer.get("lora_request"),
+                )
+                break
+            except TypeError as e:
+                logger.debug(
+                    f"Encountered error during vLLM generation: {str(e)}. Retrying..."
+                )
+                sleep(1)
+        else:
+            raise InvalidBenchmark(
+                f"Could not generate sequences after {num_attempts} attempts."
+            )
+        # Parse the raw model outputs
         completion_ids: list[list[int]] = [
             output.outputs[0].token_ids for output in raw_outputs
         ]
@@ -846,13 +864,16 @@ def load_model_and_tokenizer(
     # Prefer base model ID if the model is an adapter - the adapter will be added on
     # during inference in this case
     model_id = model_config.adapter_base_model_id or model_config.model_id
+    revision = (
+        model_config.revision if model_config.adapter_base_model_id is None else "main"
+    )
     hf_model_config = load_hf_model_config(
         model_id=model_id,
         num_labels=0,
         id2label=dict(),
         label2id=dict(),
-        revision=model_config.revision,
+        revision=revision,
         model_cache_dir=model_config.model_cache_dir,
         api_key=benchmark_config.api_key,
         trust_remote_code=benchmark_config.trust_remote_code,
@@ -881,6 +902,23 @@ def load_model_and_tokenizer(
         )
         dtype = torch.float16
+    if hf_model_config.torch_dtype == torch.bfloat16:
+        min_cuda_compute_capability = get_min_cuda_compute_capability()
+        required_capability = VLLM_BF16_MIN_CUDA_COMPUTE_CAPABILITY
+        if min_cuda_compute_capability is not None:
+            if min_cuda_compute_capability < required_capability:
+                logger.info(
+                    "You are loading a model with "
+                    f"dtype {hf_model_config.torch_dtype}, "
+                    "which vLLM only supports for CUDA devices with"
+                    f"CUDA compute capability >={required_capability}. "
+                    "You are using one or more devices with "
+                    f"compute capability {min_cuda_compute_capability}. "
+                    "Setting dtype to float16 instead."
+                )
+                dtype = torch.float16
     if model_config.adapter_base_model_id is not None:
         download_dir = str(Path(model_config.model_cache_dir) / "base_model")
     else:
@@ -916,7 +954,7 @@ def load_model_and_tokenizer(
             max_model_len=min(true_max_model_len, 5_000),
             download_dir=download_dir,
             trust_remote_code=benchmark_config.trust_remote_code,
-            revision=model_config.revision,
+            revision=revision,
             seed=4242,
             distributed_executor_backend=executor_backend,
             tensor_parallel_size=torch.cuda.device_count(),
@@ -994,6 +1032,7 @@ def load_tokenizer(
     Returns:
         The loaded tokenizer.
     """
+    revision = revision if adapter_base_model_id is None else "main"
     config = AutoConfig.from_pretrained(
         adapter_base_model_id or model_id,
         revision=revision,

{euroeval-15.4.1 → euroeval-15.4.2}/src/euroeval/constants.py RENAMED Viewed

@@ -54,3 +54,6 @@ METRIC_ATTRIBUTES_TAKING_UP_MEMORY = ["cached_bertscorer"]
 # Hugging Face Hub tags used to classify models as merge models
 MERGE_TAGS = ["merge", "mergekit"]
+# The minimum required CUDA compute capability for using bfloat16 in vLLM
+VLLM_BF16_MIN_CUDA_COMPUTE_CAPABILITY = 8.0

{euroeval-15.4.1 → euroeval-15.4.2}/src/euroeval/data_models.py RENAMED Viewed

@@ -1,7 +1,6 @@
 """Data models used in EuroEval."""
 import collections.abc as c
-import importlib.metadata
 import json
 import pathlib
 import re
@@ -11,6 +10,8 @@ from dataclasses import dataclass, field
 import pydantic
 import torch
+from euroeval.utils import get_package_version
 from .enums import Device, InferenceBackend, ModelType, TaskGroup
 from .types import ScoreDict
@@ -228,7 +229,11 @@ class BenchmarkResult(pydantic.BaseModel):
     generative_type: str | None
     few_shot: bool
     validation_split: bool
-    euroeval_version: str = importlib.metadata.version("euroeval")
+    euroeval_version: str | None = get_package_version("euroeval")
+    transformers_version: str | None = get_package_version("transformers")
+    torch_version: str | None = get_package_version("torch")
+    vllm_version: str | None = get_package_version("vllm")
+    outlines_version: str | None = get_package_version("outlines")
     @classmethod
     def from_dict(cls, config: dict) -> "BenchmarkResult":

EuroEval 15.4.1__tar.gz → 15.4.2__tar.gz

Potentially problematic release.

EuroEval 15.4.1tar.gz → 15.4.2tar.gz