PyPI - EuroEval - Versions diffs - 16.0.0__tar.gz → 16.0.1__tar.gz - Mend

EuroEval 16.0.0tar.gz → 16.0.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of EuroEval might be problematic. Click here for more details.

Files changed (273) hide show

{euroeval-16.0.0 → euroeval-16.0.1}/CHANGELOG.md RENAMED Viewed

@@ -10,6 +10,26 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
+## [v16.0.1] - 2025-09-07
+### Fixed
+- Fixed a bug causing encoders to fail when evaluating on the Exam-et dataset.
+- Previously we would abort an evaluation completely if the model outputted a single
+  invalid output on a classification task. As individual samples rarely have a great
+  influence on the overall score, we now just assign the closest label to the sample and
+  continue the evaluation. This will be logged to the user, so that they are aware of
+  this. Some tasks are more sensitive to individual samples, such as European values,
+  where we still abort the evaluation if a single sample is invalid.
+- Fixed a bug where logprobs were not used for classification tasks when evaluating
+  generative models, due to the fact that we raised the number of generated tokens to 10
+  for such tasks. This did not affect the results, but it meant that some evaluations
+  failed.
+- Now includes FlashInfer as a dependency, as it is required by vLLM.
+- Changed the choices in European values to use letters, like the other multiple
+  choice tasks, rather than numbers. Aside from ensuring consistency, we also avoid the
+  issue where '10' and '1' often both have the same first token ('1'), causing us not to
+  be able to use logprobs to determine the answer.
 ## [v16.0.0] - 2025-09-05
 ### Added
 - Added support for Latvian 🇱🇻! This includes the sentiment classification dataset

{euroeval-16.0.0 → euroeval-16.0.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: EuroEval
-Version: 16.0.0
+Version: 16.0.1
 Summary: The robust European language model benchmark.
 Project-URL: Repository, https://github.com/EuroEval/EuroEval
 Project-URL: Issues, https://github.com/EuroEval/EuroEval/issues
@@ -61,10 +61,12 @@ Requires-Dist: transformers[mistral-common]>=4.56.0
 Provides-Extra: all
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'all'
+Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'all'
 Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'all'
 Provides-Extra: generative
 Requires-Dist: bitsandbytes>=0.43.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: fbgemm-gpu>=1.0.0; (platform_system == 'Linux') and extra == 'generative'
+Requires-Dist: flashinfer-python>=0.3.1; (platform_system == 'Linux') and extra == 'generative'
 Requires-Dist: vllm>=0.10.1; (platform_system == 'Linux') and extra == 'generative'
 Description-Content-Type: text/markdown

{euroeval-16.0.0 → euroeval-16.0.1}/docs/leaderboards/README.md RENAMED Viewed

@@ -14,9 +14,8 @@ Each language has two leaderboards:
 - **Generative Leaderboard**: This leaderboard shows the performance of models that can
   generate text. These models have been evaluated on _all_ [tasks](/tasks), both NLU and
   NLG.
-- **NLU Leaderboard**: This leaderboard shows the performance of models that can only
-  understand text, and not generate text themselves. These models have been evaluated on
-  the NLU tasks only.
+- **NLU Leaderboard**: This leaderboard shows the performance of models that can
+  understand text, which includes both generative and non-generative models.
 ## 📊 How to Read the Leaderboards
@@ -26,15 +25,14 @@ model across all the tasks in the leaderboard. The lower the rank, the better th
 The columns that follow the rank columns are metadata about the model:
-- `Parameters`: The total number of parameters in the model, in millions.
-- `Vocabulary`: The size of the model's vocabulary, in thousands.
-- `Context`: The maximum number of tokens that the model can process at a time.
-- `Speed`: The inference time of the model - see more [here](/tasks/speed).
 - `Type`: The type of model:
     - 🔍 indicates that it is an encoder model (e.g., BERT)
     - 🧠 indicates that it is a base generative model (e.g., GPT-2)
     - 📝 indicates that it is an instruction-tuned model (e.g., ChatGPT)
     - 🤔 indicates that it is a reasoning model (e.g., o1)
+- `Parameters`: The total number of parameters in the model, in millions.
+- `Vocabulary`: The size of the model's vocabulary, in thousands.
+- `Context`: The maximum number of tokens that the model can process at a time.
 - `Commercial`: Whether the model can be used for commercial purposes. See [here](/faq)
   for more information.
 - `Merge`: Whether the model is a merge of other models.
@@ -47,11 +45,3 @@ the given model on each of the datasets.
 To read more about the individual datasets, see the [datasets](/datasets) page. If
 you're interested in the methodology behind the benchmark, see the
 [methodology](/methodology) page.
-/// tab | Generative Scatter Plot
-///
-/// tab | NLU Scatter Plot
-///

{euroeval-16.0.0 → euroeval-16.0.1}/makefile RENAMED Viewed

@@ -144,3 +144,6 @@ publish-major: install check bump-major publish  ## Publish a major version
 publish-minor: install check bump-minor publish  ## Publish a minor version
 publish-patch: install check bump-patch publish  ## Publish a patch version
+loc: ## Count the number of lines of code in the project
+	@git ls-files | grep '\.py' | xargs wc -l | tail -n 1

{euroeval-16.0.0 → euroeval-16.0.1}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "EuroEval"
-version = "16.0.0"
+version = "16.0.1"
 description = "The robust European language model benchmark."
 readme = "README.md"
 authors = [
@@ -46,11 +46,13 @@ dependencies = [
 generative = [
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
     "vllm>=0.10.1; platform_system == 'Linux'",
+    "flashinfer-python>=0.3.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
 ]
 all = [
     "bitsandbytes>=0.43.1; platform_system == 'Linux'",
     "vllm>=0.10.1; platform_system == 'Linux'",
+    "flashinfer-python>=0.3.1; platform_system == 'Linux'",
     "fbgemm-gpu>=1.0.0; platform_system == 'Linux'",
 ]

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/__init__.py RENAMED Viewed

@@ -13,6 +13,7 @@ from termcolor import colored
 # Block specific warnings before importing anything else, as they can be noisy
 warnings.filterwarnings("ignore", category=UserWarning)
+warnings.filterwarnings("ignore", category=FutureWarning)
 logging.getLogger("httpx").setLevel(logging.CRITICAL)
 logging.getLogger("datasets").setLevel(logging.CRITICAL)
 logging.getLogger("vllm").setLevel(logging.CRITICAL)
@@ -101,6 +102,10 @@ os.environ["DISABLE_AIOHTTP_TRANSPORT"] = "True"
 os.environ["VLLM_USE_V1"] = "1"
+# Use the FlashInfer flash-attention backend for vLLM
+os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
 # Set the HF_TOKEN env var to copy the HUGGINGFACE_API_KEY env var, as vLLM uses the
 # former and LiteLLM uses the latter
 if os.getenv("HUGGINGFACE_API_KEY"):

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/benchmark_modules/vllm.py RENAMED Viewed

@@ -337,31 +337,6 @@ class VLLMModel(HuggingFaceEncoderModel):
             if end_of_chat_token:
                 stop_tokens.append(end_of_chat_token)
-        structured_generation_schema = None
-        if self.dataset_config.task.uses_structured_output:
-            if self.generative_type == GenerativeType.REASONING:
-                log_once(
-                    f"The model {self.model_config.model_id!r} is a reasoning model "
-                    "and thus does not support structured generation, so we do not "
-                    "enable it.",
-                    level=logging.DEBUG,
-                )
-            else:
-                ner_tag_names = list(self.dataset_config.prompt_label_mapping.values())
-                keys_and_their_types: dict[str, t.Any] = {
-                    tag_name: (conlist(str, max_length=5), ...)
-                    for tag_name in ner_tag_names
-                }
-                answer_format_class = create_model(
-                    "AnswerFormat", **keys_and_their_types
-                )
-                structured_generation_schema = answer_format_class.model_json_schema()
-                log_once(
-                    "Using structured generation with the JSON schema "
-                    f"{structured_generation_schema}",
-                    level=logging.DEBUG,
-                )
         # Get the mapping from labels to the first token in the label. We call this each
         # time we generate a new dataset since the dataset config can change
         self.buffer["first_label_token_mapping"] = get_first_label_token_mapping(
@@ -382,8 +357,29 @@ class VLLMModel(HuggingFaceEncoderModel):
                 "error was. Skipping this evaluation."
             )
-        # Define the guided decoding that we will use for structured generation
-        if structured_generation_schema is not None:
+        structured_generation_schema = None
+        if (
+            self.dataset_config.task.uses_structured_output
+            or (self.dataset_config.task.uses_logprobs and self.dataset_config.labels)
+        ) and self.generative_type == GenerativeType.REASONING:
+            guided_decoding = None
+            logger.debug(
+                "The dataset uses structured output, but we are not using it as the "
+                "model is a reasoning model."
+            )
+        elif self.dataset_config.task.uses_structured_output:
+            ner_tag_names = list(self.dataset_config.prompt_label_mapping.values())
+            keys_and_their_types: dict[str, t.Any] = {
+                tag_name: (conlist(str, max_length=5), ...)
+                for tag_name in ner_tag_names
+            }
+            answer_format_class = create_model("AnswerFormat", **keys_and_their_types)
+            structured_generation_schema = answer_format_class.model_json_schema()
+            log_once(
+                "Using structured generation with the JSON schema: "
+                f"{json.dumps(structured_generation_schema)}",
+                level=logging.DEBUG,
+            )
             guided_decoding = GuidedDecodingParams(json=structured_generation_schema)
         elif self.dataset_config.task.uses_logprobs and self.dataset_config.labels:
             guided_decoding = GuidedDecodingParams(
@@ -392,8 +388,17 @@ class VLLMModel(HuggingFaceEncoderModel):
                     for label in self.dataset_config.labels
                 ]
             )
+            log_once(
+                "Using structured generation with the choices: "
+                f"{guided_decoding.choice!r}.",
+                level=logging.DEBUG,
+            )
         else:
             guided_decoding = None
+            log_once(
+                "Not using structured generation as the dataset does not require it.",
+                level=logging.DEBUG,
+            )
         # Define the parameters used for vLLM generation
         max_tokens: int = (
@@ -439,6 +444,7 @@ class VLLMModel(HuggingFaceEncoderModel):
         # Generate sequences using vLLM
         input_is_a_test = len(prompts) == 1 and len(set(prompts[0])) == 1
         num_attempts = 3
+        truncation_attempts = 0
         for _ in range(num_attempts):
             try:
                 raw_outputs = self._model.generate(
@@ -466,12 +472,19 @@ class VLLMModel(HuggingFaceEncoderModel):
                         "Prompts are too long, so truncating them and trying again..."
                     )
                     logger.debug(f"The error message was: {str(e)}")
+                    # If we have already tried truncating the prompts a few times, then
+                    # we truncate a bit more aggressively
+                    extra_truncation = 50 * truncation_attempts
+                    truncation_attempts += 1
                     tokenized_prompts = self._tokeniser(
                         text=prompts,
                         truncation=True,
                         max_length=max(
                             min(self._tokeniser.model_max_length, MAX_CONTEXT_LENGTH)
-                            - max_tokens,
+                            - max_tokens
+                            - extra_truncation,
                             0,
                         ),
                     )

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/constants.py RENAMED Viewed

@@ -75,3 +75,9 @@ LITELLM_CLASSIFICATION_OUTPUT_KEY = "label"
 # These characters are stripped from JSON output when trying to identify the label
 JSON_STRIP_CHARACTERS = ' {}\n\r":'
+# The number of tokens we generate when evaluating generative models on classification
+# tasks. We also use this to determine whether we should store logprobs in the model
+# outputs (and cache).
+NUM_GENERATION_TOKENS_FOR_CLASSIFICATION = 10

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/data_models.py RENAMED Viewed

@@ -125,6 +125,12 @@ class Task:
             A list of generative model types that are allowed to be evaluated on this
             task. If None, all generative model types are allowed. Only relevant if
             `allowed_model_types` includes generative models.
+        allow_invalid_model_outputs (optional):
+            Whether to allow invalid model outputs. This is only relevant for generative
+            models on classification tasks, where the model may generate an output
+            which is not one of the allowed labels. If True, the model output will be
+            mapped to the closest valid label. If False, the model output will be
+            considered incorrect and the evaluation will be aborted. Defaults to True.
     """
     name: str
@@ -148,6 +154,7 @@ class Task:
             GenerativeType.REASONING,
         ]
     )
+    allow_invalid_model_outputs: bool = True
     def __post_init__(self) -> None:
         """Post-initialisation checks."""
@@ -430,7 +437,6 @@ class DatasetConfig:
             if self._prompt_prefix is None
             else self._prompt_prefix
         )
-        prompt_prefix = prompt_prefix.replace("{labels_str}", self._labels_str)
         return prompt_prefix
     @property
@@ -443,7 +449,6 @@ class DatasetConfig:
             if self._prompt_template is None
             else self._prompt_template
         )
-        prompt_template = prompt_template.replace("{labels_str}", self._labels_str)
         return prompt_template
     @property
@@ -456,9 +461,6 @@ class DatasetConfig:
             if self._instruction_prompt is None
             else self._instruction_prompt
         )
-        instruction_prompt = instruction_prompt.replace(
-            "{labels_str}", self._labels_str
-        )
         return instruction_prompt
     @property
@@ -519,15 +521,16 @@ class DatasetConfig:
         """Return a hash of the dataset configuration."""
         return hash(self.name)
-    @property
-    def _labels_str(self) -> str:
+    def get_labels_str(self, labels: list[str] | None = None) -> str:
         """Converts a set of labels to a natural string, in the specified language.
         If the task is NER, we separate using 'and' and use the mapped labels instead of
         the BIO NER labels.
         Args:
-            language: The language to be used when converting the labels.
+            labels (optional):
+                The labels to convert to a natural string. If None, uses all the labels
+                in the dataset. Defaults to None.
         Returns:
             The natural string representation of the labels in specified language.
@@ -539,16 +542,17 @@ class DatasetConfig:
         else:
             sep_word = main_language.or_separator
-        local_labels: list[str] = []
-        for label in self.labels:
-            if label not in self.prompt_label_mapping:
-                continue
-            local_label = self.prompt_label_mapping[label]
-            if local_label not in local_labels:
-                local_labels.append(local_label)
+        if labels is None:
+            labels = list()
+            for english_label in self.labels:
+                if english_label not in self.prompt_label_mapping:
+                    continue
+                label = self.prompt_label_mapping[english_label]
+                if label not in labels:
+                    labels.append(label)
         # Convert labels to single-quoted labels - and remove duplicates
-        quoted_labels = [f"'{label}'" for label in local_labels]
+        quoted_labels = [f"'{label}'" for label in labels]
         if not quoted_labels:
             return ""

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/dataset_configs/danish.py RENAMED Viewed

@@ -84,7 +84,6 @@ EUROPEAN_VALUES_DA_CONFIG = DatasetConfig(
     languages=[DA],
     splits=["test"],
     bootstrap_samples=False,
-    _instruction_prompt="{text}",
 )
@@ -159,7 +158,6 @@ EUROPEAN_VALUES_SITUATIONAL_DA_CONFIG = DatasetConfig(
     languages=[DA],
     splits=["test"],
     bootstrap_samples=False,
-    _instruction_prompt="{text}",
     unofficial=True,
 )
@@ -172,6 +170,5 @@ EUROPEAN_VALUES_COMPLETIONS_DA_CONFIG = DatasetConfig(
     languages=[DA],
     splits=["test"],
     bootstrap_samples=False,
-    _instruction_prompt="{text}",
     unofficial=True,
 )

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/generation_utils.py RENAMED Viewed

@@ -9,7 +9,7 @@ import typing as t
 from .enums import TaskGroup
 from .exceptions import InvalidBenchmark
 from .tokenization_utils import apply_chat_template
-from .utils import log_once
+from .utils import extract_multiple_choice_labels, log_once
 if t.TYPE_CHECKING:
     from datasets import DatasetDict
@@ -230,18 +230,49 @@ def apply_prompt(
             return dataset_config.prompt_template.format(**kwargs), ""
     match dataset_config.task.task_group:
-        case (
-            TaskGroup.SEQUENCE_CLASSIFICATION | TaskGroup.MULTIPLE_CHOICE_CLASSIFICATION
-        ):
+        case TaskGroup.SEQUENCE_CLASSIFICATION:
+            labels_str = dataset_config.get_labels_str()
+            few_shot_sections = [
+                create_prompt(
+                    text=example["text"].replace("\n", " ").strip(),
+                    label=example["label"].replace("\n", " ").strip(),
+                    labels_str=labels_str,
+                )
+                for example in few_shot_examples
+            ]
+            new_sections = [
+                create_prompt(
+                    text=text.replace("\n", " ").strip(),
+                    label="",
+                    labels_str=labels_str,
+                )
+                for text in examples["text"]
+            ]
+        case TaskGroup.MULTIPLE_CHOICE_CLASSIFICATION:
             few_shot_sections = [
                 create_prompt(
                     text=example["text"].replace("\n", " ").strip(),
                     label=example["label"].replace("\n", " ").strip(),
+                    labels_str=dataset_config.get_labels_str(
+                        labels=extract_multiple_choice_labels(
+                            prompt=example["text"],
+                            candidate_labels=dataset_config.labels,
+                        )
+                    ),
                 )
                 for example in few_shot_examples
             ]
             new_sections = [
-                create_prompt(text=text.replace("\n", " ").strip(), label="")
+                create_prompt(
+                    text=text.replace("\n", " ").strip(),
+                    label="",
+                    labels_str=dataset_config.get_labels_str(
+                        labels=extract_multiple_choice_labels(
+                            prompt=text, candidate_labels=dataset_config.labels
+                        )
+                    ),
+                )
                 for text in examples["text"]
             ]
@@ -259,6 +290,7 @@ def apply_prompt(
             ]
         case TaskGroup.TOKEN_CLASSIFICATION:
+            labels_str = dataset_config.get_labels_str()
             def create_label(example: dict) -> str:
                 prompt_labels = dataset_config.prompt_label_mapping.values()
@@ -280,12 +312,15 @@ def apply_prompt(
                 create_prompt(
                     text=" ".join(example["tokens"]).replace("\n", " ").strip(),
                     label=create_label(example=example),
+                    labels_str=labels_str,
                 )
                 for example in few_shot_examples
             ]
             new_sections = [
                 create_prompt(
-                    text=" ".join(tokens).replace("\n", " ").strip(), label=""
+                    text=" ".join(tokens).replace("\n", " ").strip(),
+                    label="",
+                    labels_str=labels_str,
                 )
                 for tokens in examples["tokens"]
             ]
@@ -375,4 +410,7 @@ def apply_prompt(
             for new_prompt, _ in new_sections
         ]
+    # Always add the final prompts without few-shot examples, too, for analysis
+    examples["prompt"] = [new_prompt for new_prompt, _ in new_sections]
     return examples

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/metrics/pipeline.py RENAMED Viewed

@@ -26,6 +26,27 @@ logger: logging.Logger = logging.getLogger("euroeval")
 T = t.TypeVar("T", bound=int | float | str | bool)
+class PreprocessingFunction(t.Protocol):
+    """A protocol for a preprocessing function."""
+    def __call__(
+        self, predictions: c.Sequence[int], dataset: "Dataset"
+    ) -> c.Sequence[int]:
+        """Preprocess the model predictions before they are passed to the pipeline.
+        Args:
+            predictions:
+                The model predictions.
+            dataset:
+                The dataset used for evaluation. This is only used in case any
+                additional metadata is used to compute the metrics.
+        Returns:
+            The preprocessed model predictions.
+        """
+        ...
 class PipelineMetric(Metric):
     """Load a scikit-learn pipeline and use it to get scores from the predictions."""
@@ -36,7 +57,7 @@ class PipelineMetric(Metric):
         pipeline_repo: str,
         pipeline_scoring_function: c.Callable[["Pipeline", c.Sequence], float],
         pipeline_file_name: str = "pipeline.pkl",
-        preprocessing_fn: c.Callable[[c.Sequence[T]], c.Sequence[T]] = lambda x: x,
+        preprocessing_fn: PreprocessingFunction | None = None,
         postprocessing_fn: c.Callable[[float], tuple[float, str]] | None = None,
     ) -> None:
         """Initialise the pipeline transform metric.
@@ -101,7 +122,10 @@ class PipelineMetric(Metric):
         """
         if self.pipeline is None:
             self.pipeline = self._download_pipeline()
-        predictions = self.preprocessing_fn(predictions)
+        if self.preprocessing_fn is not None:
+            predictions = self.preprocessing_fn(
+                predictions=predictions, dataset=dataset
+            )
         return self.pipeline_scoring_function(self.pipeline, predictions)
     def _download_pipeline(self) -> "Pipeline":
@@ -133,13 +157,18 @@ class PipelineMetric(Metric):
 ### European Values Metric ###
-def european_values_preprocessing_fn(predictions: c.Sequence[int]) -> c.Sequence[int]:
+def european_values_preprocessing_fn(
+    predictions: c.Sequence[int], dataset: "Dataset"
+) -> c.Sequence[int]:
     """Preprocess the model predictions for the European Values metric.
     Args:
         predictions:
             The model predictions, a sequence of integers representing the predicted
             choices for each question.
+        dataset:
+            The dataset used for evaluation. This is only used in case any additional
+            metadata is used to compute the metrics.
     Returns:
         The preprocessed model predictions, a sequence of integers representing the
@@ -154,6 +183,17 @@ def european_values_preprocessing_fn(predictions: c.Sequence[int]) -> c.Sequence
     num_questions = 53
     num_phrasings_per_question = 5
+    # Convert the predictions to integers
+    integer_predictions = []
+    for prediction, idx_to_choice in zip(predictions, dataset["idx_to_choice"]):
+        idx_to_choice = {
+            int(idx): int(choice)
+            for idx, choice in idx_to_choice.items()
+            if choice is not None
+        }
+        integer_prediction = idx_to_choice[prediction]
+        integer_predictions.append(integer_prediction)
     assert len(predictions) % num_questions == 0, (
         f"The number of predictions ({len(predictions)}) is not a multiple of "
         f"{num_questions}, which is required for the European Values metric."
@@ -171,7 +211,7 @@ def european_values_preprocessing_fn(predictions: c.Sequence[int]) -> c.Sequence
         # Shape: (num_questions, num_phrasings_per_question)
         arr = np.array(
             [
-                predictions[i : i + num_phrasings_per_question]
+                integer_predictions[i : i + num_phrasings_per_question]
                 for i in range(0, len(predictions), num_phrasings_per_question)
             ]
         )
@@ -188,7 +228,7 @@ def european_values_preprocessing_fn(predictions: c.Sequence[int]) -> c.Sequence
         arr = np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=1, arr=arr)
         # Convert the array to a list
-        predictions = arr.tolist()
+        integer_predictions = arr.tolist()
     # Some of the questions are categorical and we're only interested in whether the
     # model chooses a specific choice or not. This mapping takes the question index
@@ -208,11 +248,13 @@ def european_values_preprocessing_fn(predictions: c.Sequence[int]) -> c.Sequence
     }
     # Map the predictions to the choices we're interested in
-    predictions = list(predictions)
+    integer_predictions = list(integer_predictions)
     for question_idx, choice in question_choices.items():
-        predictions[question_idx] = 1 if predictions[question_idx] == choice else 0
+        integer_predictions[question_idx] = (
+            1 if integer_predictions[question_idx] == choice else 0
+        )
-    return predictions
+    return integer_predictions
 def european_values_scoring_function(

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/model_cache.py RENAMED Viewed

@@ -10,7 +10,9 @@ from dataclasses import asdict
 from tqdm.auto import tqdm
+from .constants import NUM_GENERATION_TOKENS_FOR_CLASSIFICATION
 from .data_models import GenerativeModelOutput, SingleGenerativeModelOutput
+from .utils import log_once
 if t.TYPE_CHECKING:
     from pathlib import Path
@@ -189,10 +191,20 @@ class ModelCache:
                 # the indices of the top scores, to save space. Further, we only store
                 # the scores if the generated sequence is shorter than the maximum
                 # length
-                if model_output.scores is not None and self.max_generated_tokens < 8:
+                if (
+                    model_output.scores is not None
+                    and self.max_generated_tokens
+                    <= NUM_GENERATION_TOKENS_FOR_CLASSIFICATION
+                ):
                     assert model_output.scores is not None
                     scores = model_output.scores[sample_idx]
                 else:
+                    if model_output.scores is not None:
+                        log_once(
+                            "The generated sequence is longer than the maximum "
+                            "length for classification. Not caching the scores.",
+                            level=logging.DEBUG,
+                        )
                     scores = None
                 self[model_input] = SingleGenerativeModelOutput(
                     sequence=model_output.sequences[sample_idx], scores=scores

{euroeval-16.0.0 → euroeval-16.0.1}/src/euroeval/task_group_utils/multiple_choice_classification.py RENAMED Viewed

@@ -126,7 +126,7 @@ def prepare_examples(
         ):
             choice_idxs.append(idx)
-    choices = [sections[idx] for idx in choice_idxs]
+    choices = [sections[idx] for idx in reversed(choice_idxs)]
     # Check that the choices are present, and that all of them are at the end
     assert len(choices) > 0, "No choices found in the document."
@@ -146,7 +146,7 @@ def prepare_examples(
     )
     new_examples["label"] = [
         int(choice.startswith(f"{letter}. ") and letter == examples["label"][0])
-        for letter, choice in zip("abcde", choices)
+        for letter, choice in zip("abcdefghijklmnopqrstuvwxyz", choices)
     ]
     new_examples["id"] = [hashlib.md5(string=doc.encode()).hexdigest()] * len(choices)
     return new_examples

EuroEval 16.0.0__tar.gz → 16.0.1__tar.gz

Potentially problematic release.

EuroEval 16.0.0tar.gz → 16.0.1tar.gz