PyPI - eval-framework - Versions diffs - 0.2.1__tar.gz → 0.2.3__tar.gz - Mend

eval-framework 0.2.1tar.gz → 0.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (170) hide show

{eval_framework-0.2.1 → eval_framework-0.2.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: eval-framework
-Version: 0.2.1
+Version: 0.2.3
 Summary: Evalulation Framework
 Author: Aleph Alpha Research
 License:                                  Apache License
@@ -233,6 +233,7 @@ Requires-Dist: jsonlines>=4,<5
 Requires-Dist: lxml>=6,<7
 Requires-Dist: python-iso639>=2025.2.18
 Requires-Dist: wandb>=0.21.1,<1
+Requires-Dist: boto3>=1.40.54,<2
 Requires-Dist: accelerate ; extra == 'accelerate'
 Requires-Dist: eval-framework[determined,api,openai,transformers,accelerate,vllm,comet,optional,mistral] ; extra == 'all'
 Requires-Dist: aleph-alpha-client>=10,<11 ; extra == 'api'
@@ -248,6 +249,7 @@ Requires-Dist: transformers>=4.45.2,<5 ; extra == 'optional'
 Requires-Dist: jinja2>=3.1.6,<4 ; extra == 'optional'
 Requires-Dist: transformers>=4.45.2,<5 ; extra == 'transformers'
 Requires-Dist: torch>=2.5,<3 ; extra == 'transformers'
+Requires-Dist: accelerate>=0.30.0,<1 ; extra == 'transformers'
 Requires-Dist: vllm>=0.8.5,<0.9 ; extra == 'vllm'
 Requires-Dist: torch>=2.5,<3 ; extra == 'vllm'
 Requires-Python: >=3.12, <3.13
@@ -305,13 +307,25 @@ There are optional extras available to unlock specific features of the library:
 As a short hand, the `all` extra installs all of the above.
-For development, you can instead install it directly from the repository. Please first install
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
+We use `uv` to better resolve dependencies when downloading the extras. You can install uv with:
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+or by follwing the `uv` [installation docs.](https://docs.astral.sh/uv/getting-started/installation/)
-To install the project with all optional extras use
+Now, you can safely install the project with all optional extras:
 ```bash
 uv sync --all-extras
 ```
+or with pip
+```bash
+uv pip install eval_framework[all]
+```
+Tip: ensure python is properly installed with uv:
+```
+uv python install 3.12 --reinstall
+```
 We provide custom groups to control optional extras.
 - `flash_attn`: Install `flash_attn` with correct handling of build isolation
@@ -327,8 +341,9 @@ To evaluate a single benchmark locally, you can use the following command:
 eval_framework \
     --models src/eval_framework/llm/models.py \
     --llm-name Smollm135MInstruct \
-    --task-name "GSM8K" \
-    --output-dir ./eval \
+    --task-name "MMLU" \
+    --task-subjects "abstract_algebra" \
+    --output-dir ./eval_results \
     --num-fewshot 5 \
     --num-samples 10
 ```
@@ -414,35 +429,37 @@ pip install eval_framework[transformers]
 2. **Create and run your first evaluation using HuggingFace model**:
-   ```python
-    from pathlib import Path
-    from eval_framework.llm.huggingface import HFLLM
-    from eval_framework.main import main
-    from eval_framework.tasks.eval_config import EvalConfig
-    from template_formatting.formatter import HFFormatter
-    # Define your model
-    class MyHuggingFaceModel(HFLLM):
-        LLM_NAME = "microsoft/DialoGPT-medium"
-        DEFAULT_FORMATTER = partial(HFFormatter, "microsoft/DialoGPT-medium")
-    if __name__ == "__main__":
-        # Initialize your model
-        llm = MyHuggingFaceModel()
-        # Running evaluation on GSM8K task using 5 few-shot examples and 10 samples
-        config = EvalConfig(
-            output_dir=Path("./eval_results"),
-            num_fewshot=5,
-            num_samples=10,
-            task_name="GSM8K",
-            llm_class=MyHuggingFaceModel,
-        )
-        # Run evaluation and get results
-        results = main(llm=llm, config=config)
-   ```
+```python
+from functools import partial
+from pathlib import Path
+from eval_framework.llm.huggingface import HFLLM
+from eval_framework.main import main
+from eval_framework.tasks.eval_config import EvalConfig
+from template_formatting.formatter import HFFormatter
+# Define your model
+class MyHuggingFaceModel(HFLLM):
+    LLM_NAME = "microsoft/DialoGPT-medium"
+    DEFAULT_FORMATTER = partial(HFFormatter, "microsoft/DialoGPT-medium")
+if __name__ == "__main__":
+    # Initialize your model
+    llm = MyHuggingFaceModel()
+    # Running evaluation on MMLU abstract algebra task using 5 few-shot examples and 10 samples
+    config = EvalConfig(
+        output_dir=Path("./eval_results"),
+        num_fewshot=5,
+        num_samples=10,
+        task_name="MMLU",
+        task_subjects=["abstract_algebra", "astronomy"],
+        llm_class=MyHuggingFaceModel,
+    )
+    # Run evaluation and get results
+    results = main(llm=llm, config=config)
+```
 3. **Review results** - Check `./eval_results/` for detailed outputs and use our [results guide](docs/understanding_results_guide.md) to interpret them
@@ -450,6 +467,7 @@ pip install eval_framework[transformers]
 - **Use CLI interface**: See [CLI usage guide](docs/cli_usage.md) for command-line evaluation options
 - **Evaluate HuggingFace models**: Follow our [HuggingFace evaluation guide](docs/evaluate_huggingface_model.md)
+- **Understand model arguments**: Read out [Model Arguments guide](docs/model_arguments.md)
 - **Create custom benchmarks**: Follow our [benchmark creation guide](docs/add_new_benchmark_guide.md)
 - **Scale your evaluations**: Use [Determined AI integration](docs/using_determined.md) for distributed evaluation
 - **Understand your results**: Read our [results interpretation guide](docs/understanding_results_guide.md)
@@ -465,6 +483,7 @@ pip install eval_framework[transformers]
 ### Advanced Usage
+- **[Understanding Model Arguments](docs/model_arguments.md)** - Thorough guide on each constructor argument for salient model classes
 - **[Adding New Benchmarks](docs/add_new_benchmark_guide.md)** - Complete guide with practical examples for adding new benchmarks
 - **[Benchmarks and Metrics](docs/benchmarks_and_metrics.md)** - Comprehensive overview of all available benchmarks and evaluation metrics
 - **[Overview of Dataloading](docs/overview_dataloading.md)** - Explanation of dataloading and task/sample/message structure

{eval_framework-0.2.1 → eval_framework-0.2.3}/README.md RENAMED Viewed

@@ -39,13 +39,25 @@ There are optional extras available to unlock specific features of the library:
 As a short hand, the `all` extra installs all of the above.
-For development, you can instead install it directly from the repository. Please first install
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
+We use `uv` to better resolve dependencies when downloading the extras. You can install uv with:
+```bash
+curl -LsSf https://astral.sh/uv/install.sh | sh
+```
+or by follwing the `uv` [installation docs.](https://docs.astral.sh/uv/getting-started/installation/)
-To install the project with all optional extras use
+Now, you can safely install the project with all optional extras:
 ```bash
 uv sync --all-extras
 ```
+or with pip
+```bash
+uv pip install eval_framework[all]
+```
+Tip: ensure python is properly installed with uv:
+```
+uv python install 3.12 --reinstall
+```
 We provide custom groups to control optional extras.
 - `flash_attn`: Install `flash_attn` with correct handling of build isolation
@@ -61,8 +73,9 @@ To evaluate a single benchmark locally, you can use the following command:
 eval_framework \
     --models src/eval_framework/llm/models.py \
     --llm-name Smollm135MInstruct \
-    --task-name "GSM8K" \
-    --output-dir ./eval \
+    --task-name "MMLU" \
+    --task-subjects "abstract_algebra" \
+    --output-dir ./eval_results \
     --num-fewshot 5 \
     --num-samples 10
 ```
@@ -148,35 +161,37 @@ pip install eval_framework[transformers]
 2. **Create and run your first evaluation using HuggingFace model**:
-   ```python
-    from pathlib import Path
-    from eval_framework.llm.huggingface import HFLLM
-    from eval_framework.main import main
-    from eval_framework.tasks.eval_config import EvalConfig
-    from template_formatting.formatter import HFFormatter
-    # Define your model
-    class MyHuggingFaceModel(HFLLM):
-        LLM_NAME = "microsoft/DialoGPT-medium"
-        DEFAULT_FORMATTER = partial(HFFormatter, "microsoft/DialoGPT-medium")
-    if __name__ == "__main__":
-        # Initialize your model
-        llm = MyHuggingFaceModel()
-        # Running evaluation on GSM8K task using 5 few-shot examples and 10 samples
-        config = EvalConfig(
-            output_dir=Path("./eval_results"),
-            num_fewshot=5,
-            num_samples=10,
-            task_name="GSM8K",
-            llm_class=MyHuggingFaceModel,
-        )
-        # Run evaluation and get results
-        results = main(llm=llm, config=config)
-   ```
+```python
+from functools import partial
+from pathlib import Path
+from eval_framework.llm.huggingface import HFLLM
+from eval_framework.main import main
+from eval_framework.tasks.eval_config import EvalConfig
+from template_formatting.formatter import HFFormatter
+# Define your model
+class MyHuggingFaceModel(HFLLM):
+    LLM_NAME = "microsoft/DialoGPT-medium"
+    DEFAULT_FORMATTER = partial(HFFormatter, "microsoft/DialoGPT-medium")
+if __name__ == "__main__":
+    # Initialize your model
+    llm = MyHuggingFaceModel()
+    # Running evaluation on MMLU abstract algebra task using 5 few-shot examples and 10 samples
+    config = EvalConfig(
+        output_dir=Path("./eval_results"),
+        num_fewshot=5,
+        num_samples=10,
+        task_name="MMLU",
+        task_subjects=["abstract_algebra", "astronomy"],
+        llm_class=MyHuggingFaceModel,
+    )
+    # Run evaluation and get results
+    results = main(llm=llm, config=config)
+```
 3. **Review results** - Check `./eval_results/` for detailed outputs and use our [results guide](docs/understanding_results_guide.md) to interpret them
@@ -184,6 +199,7 @@ pip install eval_framework[transformers]
 - **Use CLI interface**: See [CLI usage guide](docs/cli_usage.md) for command-line evaluation options
 - **Evaluate HuggingFace models**: Follow our [HuggingFace evaluation guide](docs/evaluate_huggingface_model.md)
+- **Understand model arguments**: Read out [Model Arguments guide](docs/model_arguments.md)
 - **Create custom benchmarks**: Follow our [benchmark creation guide](docs/add_new_benchmark_guide.md)
 - **Scale your evaluations**: Use [Determined AI integration](docs/using_determined.md) for distributed evaluation
 - **Understand your results**: Read our [results interpretation guide](docs/understanding_results_guide.md)
@@ -199,6 +215,7 @@ pip install eval_framework[transformers]
 ### Advanced Usage
+- **[Understanding Model Arguments](docs/model_arguments.md)** - Thorough guide on each constructor argument for salient model classes
 - **[Adding New Benchmarks](docs/add_new_benchmark_guide.md)** - Complete guide with practical examples for adding new benchmarks
 - **[Benchmarks and Metrics](docs/benchmarks_and_metrics.md)** - Comprehensive overview of all available benchmarks and evaluation metrics
 - **[Overview of Dataloading](docs/overview_dataloading.md)** - Explanation of dataloading and task/sample/message structure

{eval_framework-0.2.1 → eval_framework-0.2.3}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "eval-framework"
-version = "0.2.1"
+version = "0.2.3"
 description = "Evalulation Framework"
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -40,6 +40,7 @@ dependencies = [
   "lxml>=6,<7",
   "python-iso639>=2025.2.18",
   "wandb>=0.21.1,<1",
+  "boto3>=1.40.54,<2",
 ]
 [project.optional-dependencies]
@@ -53,7 +54,11 @@ openai = [
   "openai>=1.62,<2",
   "tiktoken>=0.9,<0.10"
 ]
-transformers = ["transformers>=4.45.2,<5", "torch>=2.5,<3"]
+transformers = [
+  "transformers>=4.45.2,<5",
+  "torch>=2.5,<3",
+  "accelerate>=0.30.0,<1",
+]
 accelerate = ["accelerate"]
 vllm = [
   "vllm>=0.8.5,<0.9",
@@ -87,6 +92,7 @@ eval_framework = "eval_framework.run:run"
 dev = [
   "mypy>=1.10,<2",
   "pytest>=8.3.3,<9",
+  "pytest-mock>=3.14.1",
   "pytest-xdist>=3.6.1,<4",
   "pytest-sugar>1.1,<2",
   "types-pyyaml>=6.0.12.20240917,<7",
@@ -172,3 +178,6 @@ markers = [
 filterwarnings = [
     "ignore::DeprecationWarning:datasets.utils._dill:",
 ]
+env = [
+    "WANDB_MODE = disabled",
+]

{eval_framework-0.2.1 → eval_framework-0.2.3}/src/eval_framework/context/determined.py RENAMED Viewed

@@ -42,10 +42,12 @@ class Hyperparameters(BaseModel):
     wandb_project: str | None = None
     wandb_entity: str | None = None
     wandb_run_id: str | None = None
+    wandb_upload_results: bool | None = None
     description: str | None = None
     task_args: TaskArgs
     llm_args: dict[str, Any] | None = {}
     extra_task_modules: list[str] | None = None
+    delete_output_dir_after_upload: bool | None = None
 class DeterminedContext(EvalContext):
@@ -88,7 +90,9 @@ class DeterminedContext(EvalContext):
             "wandb_project",
             "wandb_entity",
             "wandb_run_id",
+            "wandb_upload_results",
             "description",
+            "delete_output_dir_after_upload",
         ]:
             val_cli = getattr(self, name, None)
             val_hparams = getattr(self.hparams, name, None)
@@ -112,13 +116,16 @@ class DeterminedContext(EvalContext):
             if val_cli and val_hparams and val_cli != val_hparams:
                 logger.info(f"CLI argument {name} ({val_cli}) is being overridden by hyperparameters: ({val_hparams}).")
-        llm_name = getattr(self.hparams, "llm_name", self.llm_name)
-        judge_model_name = getattr(self.hparams.task_args, "judge_model_name", self.judge_model_name)
+        # Hyperparameters take precedence over core context
+        llm_name = self.hparams.llm_name or self.llm_name
+        judge_model_name = self.hparams.task_args.judge_model_name or self.judge_model_name
         llm_class = _load_model(llm_name, models_path=self.models_path)
-        llm_judge_class: type[BaseLLM] | None = None
-        if judge_model_name is not None:
-            llm_judge_class = _load_model(judge_model_name, models_path=self.judge_models_path, info="judge")
+        llm_judge_class: type[BaseLLM] | None = (
+            _load_model(judge_model_name, models_path=self.judge_models_path, info="judge")
+            if judge_model_name
+            else None
+        )
         # for all optional hyperparameters, resort to the respective CLI argument if the hyperparameter is not set
         self.config = EvalConfig(
@@ -139,8 +146,11 @@ class DeterminedContext(EvalContext):
             wandb_project=self.hparams.wandb_project or self.wandb_project,
             wandb_entity=self.hparams.wandb_entity or self.wandb_entity,
             wandb_run_id=self.hparams.wandb_run_id or self.wandb_run_id,
+            wandb_upload_results=self.hparams.wandb_upload_results or self.wandb_upload_results,
             batch_size=self.hparams.task_args.batch_size or self.batch_size,
             description=self.hparams.description or self.description,
+            delete_output_dir_after_upload=self.hparams.delete_output_dir_after_upload
+            or self.delete_output_dir_after_upload,
         )
         return self

{eval_framework-0.2.1 → eval_framework-0.2.3}/src/eval_framework/context/eval.py RENAMED Viewed

@@ -61,6 +61,7 @@ class EvalContext(AbstractContextManager):
         wandb_project: str | None = None,
         wandb_entity: str | None = None,
         wandb_run_id: str | None = None,
+        wandb_upload_results: bool | None = None,
         hf_upload_dir: str | None = None,
         hf_upload_repo: str | None = None,
         llm_args: dict[str, Any] | None = None,
@@ -72,6 +73,7 @@ class EvalContext(AbstractContextManager):
         perturbation_type: str | None = None,
         perturbation_probability: float | None = None,
         perturbation_seed: int | None = None,
+        delete_output_dir_after_upload: bool | None = None,
     ) -> None:
         self.llm_name = llm_name
         self.models_path = models_path
@@ -85,6 +87,7 @@ class EvalContext(AbstractContextManager):
         self.wandb_project = wandb_project
         self.wandb_entity = wandb_entity
         self.wandb_run_id = wandb_run_id
+        self.wandb_upload_results = wandb_upload_results
         self.hf_upload_dir = hf_upload_dir
         self.hf_upload_repo = hf_upload_repo
         self.llm_args = llm_args if llm_args is not None else {}
@@ -93,6 +96,7 @@ class EvalContext(AbstractContextManager):
         self.judge_model_args = judge_model_args if judge_model_args is not None else {}
         self.batch_size = batch_size
         self.description = description
+        self.delete_output_dir_after_upload = delete_output_dir_after_upload
         if perturbation_type or perturbation_probability is not None:
             perturbation = {

{eval_framework-0.2.1 → eval_framework-0.2.3}/src/eval_framework/context/local.py RENAMED Viewed

@@ -20,7 +20,7 @@ def _load_model(llm_name: str, models_path: str | PathLike | None, *, info: str
     if models_path is None or "." in llm_name:
         # The llm_name must a a fully qualified module path
         if "." not in llm_name:
-            raise ValueError(f"LLM {info}'{llm_name}' is not a fully qualified module path.")
+            raise ValueError(f"LLM {info} '{llm_name}' is not a fully qualified module path.")
         module_path, llm_class_name = llm_name.rsplit(".", 1)
         module = importlib.import_module(module_path)
         if not hasattr(module, llm_class_name):
@@ -31,7 +31,7 @@ def _load_model(llm_name: str, models_path: str | PathLike | None, *, info: str
         if llm_name not in models_dict:
             if info:
                 info = f"{info.strip()} "
-            raise ValueError(f"LLM {info}'{llm_name}' not found in {models_path}.")
+            raise ValueError(f"LLM {info} '{llm_name}' not found in {models_path}.")
         return models_dict[llm_name]
@@ -58,10 +58,12 @@ class LocalContext(EvalContext):
             wandb_entity=self.wandb_entity,
             wandb_project=self.wandb_project,
             wandb_run_id=self.wandb_run_id,
+            wandb_upload_results=self.wandb_upload_results,
             llm_judge_class=self.llm_judge_class,
             judge_model_args=self.judge_model_args,
             batch_size=self.batch_size,
             description=self.description,
+            delete_output_dir_after_upload=self.delete_output_dir_after_upload,
         )
         return self

{eval_framework-0.2.1 → eval_framework-0.2.3}/src/eval_framework/llm/aleph_alpha.py RENAMED Viewed

@@ -1,6 +1,7 @@
 import asyncio
 import json
 import logging
+import math
 import os
 import random
 import re
@@ -43,6 +44,7 @@ def safe_json_loads(s: str) -> dict:
 class AlephAlphaAPIModel(BaseLLM):
     LLM_NAME: str
     DEFAULT_FORMATTER: Callable[[], BaseFormatter] | None = None
+    BYTES_PER_TOKEN: float = 4.0  # rule of thumb according to https://platform.openai.com/tokenizer
     def __init__(
         self,
@@ -53,6 +55,7 @@ class AlephAlphaAPIModel(BaseLLM):
         max_async_concurrent_requests: int = 32,
         request_timeout_seconds: int = 30 * 60 + 5,
         queue_full_timeout_seconds: int = 30 * 60 + 5,
+        bytes_per_token: float | None = None,
     ) -> None:
         self._formatter: BaseFormatter
         if formatter is None:
@@ -67,6 +70,12 @@ class AlephAlphaAPIModel(BaseLLM):
         self.request_timeout_seconds = request_timeout_seconds
         self.queue_full_timeout_seconds = queue_full_timeout_seconds
         self._validate_model_availability()
+        # set bytes_per_token_scalar for non-standard models
+        if bytes_per_token is not None and bytes_per_token <= 0:
+            raise ValueError("bytes_per_token must be positive")
+        self.bytes_per_token_scalar = (
+            4.0 / bytes_per_token if bytes_per_token is not None else 4.0 / self.BYTES_PER_TOKEN
+        )
     def _validate_model_availability(self) -> None:
         """
@@ -250,11 +259,14 @@ class AlephAlphaAPIModel(BaseLLM):
         requests = []
+        # Adjust max tokens based on bytes_per_token_scalar so that non-standard models generate full responses
+        scaled_max_tokens = math.ceil(max_tokens * self.bytes_per_token_scalar) if max_tokens is not None else None
         for single_messages in messages:
             requests.append(
                 CompletionRequest(
                     prompt=Prompt.from_text(self._formatter.format(single_messages, output_mode="string")),
-                    maximum_tokens=max_tokens,
+                    maximum_tokens=scaled_max_tokens,
                     stop_sequences=stop_sequences,
                     temperature=effective_temperature,
                 )

eval_framework-0.2.3/src/eval_framework/llm/base.py ADDED Viewed

@@ -0,0 +1,180 @@
+from abc import ABC, abstractmethod
+from collections.abc import Sequence
+from pathlib import Path
+from typing import Any
+from eval_framework.shared.types import RawCompletion, RawLoglikelihood
+from eval_framework.tasks.base import Sample
+from template_formatting.formatter import BaseFormatter, Message
+class BaseLLM(ABC):
+    @property
+    def name(self) -> str:
+        """
+        This property is used to name the results folder and identify the eval results.
+        Overwrite this property in the subclass with e.g. the checkpoint name/huggingface model name."""
+        return self.__class__.__name__
+    @abstractmethod
+    def generate_from_messages(
+        self,
+        messages: list[Sequence[Message]],
+        stop_sequences: list[str] | None = None,
+        max_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[RawCompletion]:
+        """
+        stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or
+        extended with the properties of the model. This includes but is not limited to the stop tokens
+        by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|>
+        for a pretrained Llama3.1).
+        This function is expected to raise errors which are caught and reported when running the eval.
+        Please also make sure to raise an error in case of sequence length issues. We expect to always
+        raise an error if something impedes the expected completion of a task.
+        Important! The completion is expected to be detokenized and to NOT contain special tokens.
+        Returns: List[RawCompletion]
+        """
+        raise NotImplementedError
+    def generate_from_samples(
+        self,
+        samples: list[Sample],
+        stop_sequences: list[str] | None = None,
+        max_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[RawCompletion]:
+        """
+        stop_sequences and max_tokens are injected by the task if exist. They should be overwritten or
+        extended with the properties of the model. This includes but is not limited to the stop tokens
+        by the evaluated checkpoint (e.g. <|eot_id|> for an instruction finetuned Llama3.1, <|endoftext|>
+        for a pretrained Llama3.1).
+        This function is expected to raise errors which are caught and reported when running the eval.
+        Please also make sure to raise an error in case of sequence length issues. We expect to always
+        raise an error if something impedes the expected completion of a task.
+        Important! The completion is expected to be detokenized and to NOT contain special tokens.
+        Returns: List[RawCompletion]
+        """
+        raise NotImplementedError
+    @abstractmethod
+    def logprobs(self, samples: list[Sample]) -> list[RawLoglikelihood]:
+        """
+        This function is expected to raise errors which are caught and reported when running the eval.
+        Please also make sure to raise an error in case of sequence length issues. We expect to always
+        raise an error if something prevents the expected completion of a task.
+        """
+        raise NotImplementedError
+    def generate(
+        self,
+        samples: list[Sample],
+        stop_sequences: list[str] | None = None,
+        max_tokens: int | None = None,
+        temperature: float | None = None,
+    ) -> list[RawCompletion]:
+        """Generates a model response for each sample.
+        Uses 'generate_from_samples' to generate responses if implemented,
+        otherwise falls back to 'generate_from_messages'.
+        """
+        try:
+            return self.generate_from_samples(samples, stop_sequences, max_tokens, temperature)
+        except NotImplementedError:
+            messages: list[Sequence[Message]] = [sample.messages for sample in samples]
+            return self.generate_from_messages(messages, stop_sequences, max_tokens, temperature)
+    def post_process_completion(self, completion: str, sample: Sample) -> str:
+        """
+        Model-specific post-processing of generated completions.
+        Override this method to apply model-specific cleanup or transformations
+        (e.g., removing specific artifacts such as reasoning traces, handling special tokens).
+        Args:
+            completion: The raw completion string from the model
+            sample: The sample that was used to generate the completion
+        Returns:
+            The post-processed completion string
+        """
+        return completion
+    def __del__(self) -> None:
+        """
+        Method for custom resource cleanup (particularly GPUs)
+        """
+        pass
+    def _get_final_checkpoint(
+        self, checkpoint_path: str | Path | None = None, model_name: str | None = None, artifact_name: str | None = None
+    ) -> tuple[str | Path | None, str | None]:
+        if (num_provided := sum(x is not None for x in [checkpoint_path, model_name, artifact_name])) == 0:
+            if not getattr(self, "LLM_NAME", ""):
+                raise ValueError("Either LLM_NAME, checkpoint_path, model_name, or artifact_name must be provided.")
+            return None, None  # no argument given, so will use the LLM_NAME of the class
+        elif num_provided > 1:
+            raise ValueError("At most one of `checkpoint_path`, `model_name`, or `artifact_name` must be provided.")
+        elif checkpoint_path is not None:
+            return checkpoint_path, str(checkpoint_path)
+        elif model_name is not None:
+            return model_name, model_name
+        else:
+            from eval_framework.utils.file_ops import WandbFs
+            assert artifact_name is not None
+            artifact_base, version = artifact_name.split(":", 1) if ":" in artifact_name else (artifact_name, "latest")
+            with WandbFs() as wandb_fs:
+                self.artifact = wandb_fs.get_artifact(artifact_base, version)  # self.artifact being read in main()
+                wandb_fs.download_artifact(self.artifact)
+                file_root = wandb_fs.find_hf_checkpoint_root_from_path_list()
+                if file_root is None:
+                    raise ValueError(f"Could not find HuggingFace checkpoint in artifact {artifact_base}:{version}")
+                return file_root, artifact_name
+    def _get_final_formatter(
+        self,
+        formatter: BaseFormatter | None = None,
+        formatter_name: str | None = None,
+        formatter_kwargs: dict[str, Any] | None = None,
+    ) -> BaseFormatter | None:
+        if (num_provided := sum(x is not None for x in [formatter, formatter_name])) == 0:
+            return None  # none given, so will use the default of the class
+        elif num_provided > 1:
+            raise ValueError("At most one of `formatter` or `formatter_name` must be provided.")
+        if formatter:
+            if formatter_kwargs:
+                raise ValueError("Cannot provide `formatter_kwargs` when `formatter` is provided.")
+            return formatter
+        elif formatter_name:
+            kwargs = formatter_kwargs or {}
+            match formatter_name:
+                case "Llama3Formatter":
+                    from template_formatting.formatter import Llama3Formatter
+                    return Llama3Formatter()
+                case "MistralFormatter" | "MagistralFormatter":
+                    from eval_framework.llm.mistral import MagistralFormatter
+                    return MagistralFormatter(**kwargs)
+                case "ConcatFormatter":
+                    from template_formatting.formatter import ConcatFormatter
+                    return ConcatFormatter()
+                case "HFFormatter":
+                    from template_formatting.formatter import HFFormatter
+                    return HFFormatter(**kwargs)
+                case _:
+                    raise ValueError(f"Unsupported formatter: {formatter_name}.")
+        return None

eval-framework 0.2.1__tar.gz → 0.2.3__tar.gz

eval-framework 0.2.1tar.gz → 0.2.3tar.gz