PyPI - eval-framework - Versions diffs - 0.2.0__tar.gz → 0.2.2__tar.gz - Mend

eval-framework 0.2.0tar.gz → 0.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (165) hide show

{eval_framework-0.2.0 → eval_framework-0.2.2}/LICENSE RENAMED Viewed

@@ -186,7 +186,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
-   Copyright [yyyy] [name of copyright owner]
+   Copyright 2025 Aleph Alpha Research GmbH
    Licensed under the Apache License, Version 2.0 (the "License");
    you may not use this file except in compliance with the License.

{eval_framework-0.2.0 → eval_framework-0.2.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: eval-framework
-Version: 0.2.0
+Version: 0.2.2
 Summary: Evalulation Framework
 Author: Aleph Alpha Research
 License:                                  Apache License
@@ -191,7 +191,7 @@ License:                                  Apache License
                same "printed page" as the copyright notice for easier
                identification within third-party archives.
-            Copyright [yyyy] [name of copyright owner]
+            Copyright 2025 Aleph Alpha Research GmbH
             Licensed under the Apache License, Version 2.0 (the "License");
             you may not use this file except in compliance with the License.
@@ -218,8 +218,6 @@ Requires-Dist: datasets>=2.19.1,<4
 Requires-Dist: sacrebleu>=2.4.3,<3
 Requires-Dist: pycountry>=24.6.1,<25
 Requires-Dist: nltk>=3.9.1,<4
-Requires-Dist: types-pyyaml>=6.0.12.20240917,<7
-Requires-Dist: psutil>=6.1,<7
 Requires-Dist: python-dotenv>=1.0.1,<2
 Requires-Dist: lingua-language-detector>=2.0.2,<3
 Requires-Dist: google-crc32c>=1.5.0,<2
@@ -235,7 +233,6 @@ Requires-Dist: jsonlines>=4,<5
 Requires-Dist: lxml>=6,<7
 Requires-Dist: python-iso639>=2025.2.18
 Requires-Dist: wandb>=0.21.1,<1
-Requires-Dist: torch
 Requires-Dist: accelerate ; extra == 'accelerate'
 Requires-Dist: eval-framework[determined,api,openai,transformers,accelerate,vllm,comet,optional,mistral] ; extra == 'all'
 Requires-Dist: aleph-alpha-client>=10,<11 ; extra == 'api'
@@ -270,21 +267,73 @@ Description-Content-Type: text/markdown
 # Aleph Alpha Eval-Framework
 > **Comprehensive LLM evaluation at scale** - A production-ready framework for evaluating large language models across 90+ benchmarks.
+![eval-framework](docs/eval-framework.png "https://github.com/Aleph-Alpha-Research/eval-framework/blob/main/docs/eval-framework.png")
-## Features
+## Why Choose This Framework?
+- **Scalability**: Built for distributed evaluation. Currently providing an integration with Determined AI.
+- **Extensibility**: Easily add custom models, benchmarks, and metrics with object-oriented base classes.
+- **Comprehensive**: Comes pre-loaded with over 90 tasks covering a broad and diverse range, from reasoning and coding to safety and long-context. Also comes with a comprehensive set of metrics, including LLM-as-a-judge evaluations.
+## Other features
-- 90+ Benchmarks: Covers reasoning, knowledge, coding, long-context, and safety tasks.
-- Custom Benchmarks: Easily add new benchmarks with minimal code using the BaseTask class.
-- Distributed Evaluation: Integration with Determined AI for scalable distributed evaluation.
-- Docker Support: Pre-configured Dockerfiles for local and distributed setups.
 - Flexible Model Integration: Supports models loaded via HuggingFace Transformers or custom implementations using the BaseLLM class.
+- Custom Benchmarks: Easily add new benchmarks with minimal code using the BaseTask class.
 - Custom Metrics: Easily define new metrics using the BaseMetric class.
-- Rich Outputs: Generates JSON results, plots, and detailed analysis reports.
 - Perturbation Testing: Robustness analysis with configurable perturbation types and probabilities.
+- Rich Outputs: Generates JSON results, plots, and detailed analysis reports.
 - Statistical Analysis: Includes confidence intervals and significance testing for reliable comparisons.
-- LLM-as-a-Judge: Evaluation using LLM judges.
+- Docker Support: Pre-configured Dockerfiles for local and distributed setups.
+## Quick Start
+The codebase is tested and compatible with Python 3.12 and PyTorch 2.5.
+You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found [here](docs/installation.md).
+The easiest way to get started is by installing the library via `pip` and use it as an external dependency.
+```
+pip install eval_framework
+```
+There are optional extras available to unlock specific features of the library:
+- `api` for inference using the aleph-alpha client.
+- `comet` for the COMET metric.
+- `determined` for running jobs via determined.
+- `mistral` for inference on Mistral models.
+- `transformers` for inference using the transformers library.
+- `vllm` for inference via VLLM.
+As a short hand, the `all` extra installs all of the above.
+For development, you can instead install it directly from the repository. Please first install
+ [uv](https://docs.astral.sh/uv/getting-started/installation/)
-![eval-framework](docs/eval-framework.png "eval-framework")
+To install the project with all optional extras use
+```bash
+uv sync --all-extras
+```
+We provide custom groups to control optional extras.
+- `flash_attn`: Install `flash_attn` with correct handling of build isolation
+Thus, the following will setup the project with `flash_attn`
+```bash
+uv sync --all-extras --group flash_attn
+```
+To evaluate a single benchmark locally, you can use the following command:
+```bash
+eval_framework \
+    --models src/eval_framework/llm/models.py \
+    --llm-name Smollm135MInstruct \
+    --task-name "GSM8K" \
+    --output-dir ./eval \
+    --num-fewshot 5 \
+    --num-samples 10
+```
+For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usage.md).
 ## Benchmark Coverage & Task Categories
@@ -336,51 +385,6 @@ Evaluation metrics include:
 For the full list of tasks and metrics, see [Detailed Task Table](docs/benchmarks_and_metrics.md).
-## Quick Start
-The codebase is tested and compatible with Python 3.12 and PyTorch 2.5.
-You will also need the appropriate CUDA dependencies and version installed on your system for GPU support.
-The easiest way to get started is by installing the library via `pip` and use it as an external dependency.
-```
-pip install eval_framework
-```
-There are optional extras available to unlock specific features of the library:
-- `mistral` for inference on Mistral models
-- `transformers` for inference using the transformers library
-- `api` for inference using the aleph-alpha client.
-- `vllm` for inference via VLLM
-- `determined` for running jobs via determined
-- `comet` for the COMET metric
-As a short hand, the `all` extra installs all of the above.
-For development, you can instead install it directly from the repository instead, please first install
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
-To install the project with all optional extras use
-```bash
-uv sync --all-extras
-```
-We provide custom groups to control optional extras.
-- `cpu`: Use the CPU backend for torch
-- `cu124`: Use the CUDA 12.4 backend
-- `flash_attn`: Install `flash_attn` with correct handling of build isolation
-Thus, the following will setup the project with `flash_attn` and CUDA 12.4
-```bash
-uv sync --all-extras --group flash_attn --group cu124
-```
-There is also a pre-commit hook to help with development:
-```
-uv run pre-commit install
-```
-After installation, task documentation can be generated with `uv run python src/eval_framework/utils/generate_task_docs.py` (see [docs/installation.md(docs/installation.md)) for more details.
 ## Getting Started
 ### Understanding the Evaluation Framework
@@ -449,22 +453,7 @@ pip install eval_framework[transformers]
 - **Create custom benchmarks**: Follow our [benchmark creation guide](docs/add_new_benchmark_guide.md)
 - **Scale your evaluations**: Use [Determined AI integration](docs/using_determined.md) for distributed evaluation
 - **Understand your results**: Read our [results interpretation guide](docs/understanding_results_guide.md)
-### Example CLI Usage
-To evaluate a single benchmark locally, you can use the following command:
-```bash
-eval_framework \
-    --models src/eval_framework/llm/models.py \
-    --llm-name Smollm135MInstruct \
-    --task-name "GSM8K" \
-    --output-dir ./eval \
-    --num-fewshot 5 \
-    --num-samples 10
-```
-For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usage.md).
+- **Log results in WandB**: See how [we integrate WandB](docs/wandb_integration.md) for metric and lineage tracking
 ## Documentation
@@ -485,6 +474,10 @@ For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usa
 - **[Using Determined](docs/using_determined.md)** - Guide for distributed evaluation using Determined AI
 - **[Controlling Upload Results](docs/controlling_upload_results.md)** - How to manage and control the upload of evaluation results
+### Contributing
+- **[Contributing Guide](CONTRIBUTING.md)** - Guide for contributing to this project
 ### Citation
 If you use `eval-framework` in your research, please cite:
@@ -509,6 +502,6 @@ This project has received funding from the European Union’s Digital Europe Pro
 The contents of this publication are the sole responsibility of the OpenEuroLLM consortium and do not necessarily reflect the opinion of the European Union.
 <p align="center">
-  <img src="docs/OELLM_1.png" alt="OELLM 1" width="100" style="margin-right: 50px;"/>
-  <img src="docs/OELLM_2.png" alt="OELLM 2" width="350"/>
+  <img src="docs/OELLM_1.png" alt="https://github.com/Aleph-Alpha-Research/eval-framework/raw/main/docs/OELLM_1.png" width="100" style="margin-right: 50px;"/>
+  <img src="docs/OELLM_2.png" alt="https://github.com/Aleph-Alpha-Research/eval-framework/raw/main/docs/OELLM_2.png" width="350"/>
 </p>

{eval_framework-0.2.0 → eval_framework-0.2.2}/README.md RENAMED Viewed

@@ -1,21 +1,73 @@
 # Aleph Alpha Eval-Framework
 > **Comprehensive LLM evaluation at scale** - A production-ready framework for evaluating large language models across 90+ benchmarks.
+![eval-framework](docs/eval-framework.png "https://github.com/Aleph-Alpha-Research/eval-framework/blob/main/docs/eval-framework.png")
-## Features
+## Why Choose This Framework?
+- **Scalability**: Built for distributed evaluation. Currently providing an integration with Determined AI.
+- **Extensibility**: Easily add custom models, benchmarks, and metrics with object-oriented base classes.
+- **Comprehensive**: Comes pre-loaded with over 90 tasks covering a broad and diverse range, from reasoning and coding to safety and long-context. Also comes with a comprehensive set of metrics, including LLM-as-a-judge evaluations.
+## Other features
-- 90+ Benchmarks: Covers reasoning, knowledge, coding, long-context, and safety tasks.
-- Custom Benchmarks: Easily add new benchmarks with minimal code using the BaseTask class.
-- Distributed Evaluation: Integration with Determined AI for scalable distributed evaluation.
-- Docker Support: Pre-configured Dockerfiles for local and distributed setups.
 - Flexible Model Integration: Supports models loaded via HuggingFace Transformers or custom implementations using the BaseLLM class.
+- Custom Benchmarks: Easily add new benchmarks with minimal code using the BaseTask class.
 - Custom Metrics: Easily define new metrics using the BaseMetric class.
-- Rich Outputs: Generates JSON results, plots, and detailed analysis reports.
 - Perturbation Testing: Robustness analysis with configurable perturbation types and probabilities.
+- Rich Outputs: Generates JSON results, plots, and detailed analysis reports.
 - Statistical Analysis: Includes confidence intervals and significance testing for reliable comparisons.
-- LLM-as-a-Judge: Evaluation using LLM judges.
+- Docker Support: Pre-configured Dockerfiles for local and distributed setups.
+## Quick Start
+The codebase is tested and compatible with Python 3.12 and PyTorch 2.5.
+You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found [here](docs/installation.md).
+The easiest way to get started is by installing the library via `pip` and use it as an external dependency.
+```
+pip install eval_framework
+```
+There are optional extras available to unlock specific features of the library:
+- `api` for inference using the aleph-alpha client.
+- `comet` for the COMET metric.
+- `determined` for running jobs via determined.
+- `mistral` for inference on Mistral models.
+- `transformers` for inference using the transformers library.
+- `vllm` for inference via VLLM.
+As a short hand, the `all` extra installs all of the above.
+For development, you can instead install it directly from the repository. Please first install
+ [uv](https://docs.astral.sh/uv/getting-started/installation/)
-![eval-framework](docs/eval-framework.png "eval-framework")
+To install the project with all optional extras use
+```bash
+uv sync --all-extras
+```
+We provide custom groups to control optional extras.
+- `flash_attn`: Install `flash_attn` with correct handling of build isolation
+Thus, the following will setup the project with `flash_attn`
+```bash
+uv sync --all-extras --group flash_attn
+```
+To evaluate a single benchmark locally, you can use the following command:
+```bash
+eval_framework \
+    --models src/eval_framework/llm/models.py \
+    --llm-name Smollm135MInstruct \
+    --task-name "GSM8K" \
+    --output-dir ./eval \
+    --num-fewshot 5 \
+    --num-samples 10
+```
+For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usage.md).
 ## Benchmark Coverage & Task Categories
@@ -67,51 +119,6 @@ Evaluation metrics include:
 For the full list of tasks and metrics, see [Detailed Task Table](docs/benchmarks_and_metrics.md).
-## Quick Start
-The codebase is tested and compatible with Python 3.12 and PyTorch 2.5.
-You will also need the appropriate CUDA dependencies and version installed on your system for GPU support.
-The easiest way to get started is by installing the library via `pip` and use it as an external dependency.
-```
-pip install eval_framework
-```
-There are optional extras available to unlock specific features of the library:
-- `mistral` for inference on Mistral models
-- `transformers` for inference using the transformers library
-- `api` for inference using the aleph-alpha client.
-- `vllm` for inference via VLLM
-- `determined` for running jobs via determined
-- `comet` for the COMET metric
-As a short hand, the `all` extra installs all of the above.
-For development, you can instead install it directly from the repository instead, please first install
- [uv](https://docs.astral.sh/uv/getting-started/installation/)
-To install the project with all optional extras use
-```bash
-uv sync --all-extras
-```
-We provide custom groups to control optional extras.
-- `cpu`: Use the CPU backend for torch
-- `cu124`: Use the CUDA 12.4 backend
-- `flash_attn`: Install `flash_attn` with correct handling of build isolation
-Thus, the following will setup the project with `flash_attn` and CUDA 12.4
-```bash
-uv sync --all-extras --group flash_attn --group cu124
-```
-There is also a pre-commit hook to help with development:
-```
-uv run pre-commit install
-```
-After installation, task documentation can be generated with `uv run python src/eval_framework/utils/generate_task_docs.py` (see [docs/installation.md(docs/installation.md)) for more details.
 ## Getting Started
 ### Understanding the Evaluation Framework
@@ -180,22 +187,7 @@ pip install eval_framework[transformers]
 - **Create custom benchmarks**: Follow our [benchmark creation guide](docs/add_new_benchmark_guide.md)
 - **Scale your evaluations**: Use [Determined AI integration](docs/using_determined.md) for distributed evaluation
 - **Understand your results**: Read our [results interpretation guide](docs/understanding_results_guide.md)
-### Example CLI Usage
-To evaluate a single benchmark locally, you can use the following command:
-```bash
-eval_framework \
-    --models src/eval_framework/llm/models.py \
-    --llm-name Smollm135MInstruct \
-    --task-name "GSM8K" \
-    --output-dir ./eval \
-    --num-fewshot 5 \
-    --num-samples 10
-```
-For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usage.md).
+- **Log results in WandB**: See how [we integrate WandB](docs/wandb_integration.md) for metric and lineage tracking
 ## Documentation
@@ -216,6 +208,10 @@ For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usa
 - **[Using Determined](docs/using_determined.md)** - Guide for distributed evaluation using Determined AI
 - **[Controlling Upload Results](docs/controlling_upload_results.md)** - How to manage and control the upload of evaluation results
+### Contributing
+- **[Contributing Guide](CONTRIBUTING.md)** - Guide for contributing to this project
 ### Citation
 If you use `eval-framework` in your research, please cite:
@@ -240,6 +236,6 @@ This project has received funding from the European Union’s Digital Europe Pro
 The contents of this publication are the sole responsibility of the OpenEuroLLM consortium and do not necessarily reflect the opinion of the European Union.
 <p align="center">
-  <img src="docs/OELLM_1.png" alt="OELLM 1" width="100" style="margin-right: 50px;"/>
-  <img src="docs/OELLM_2.png" alt="OELLM 2" width="350"/>
+  <img src="docs/OELLM_1.png" alt="https://github.com/Aleph-Alpha-Research/eval-framework/raw/main/docs/OELLM_1.png" width="100" style="margin-right: 50px;"/>
+  <img src="docs/OELLM_2.png" alt="https://github.com/Aleph-Alpha-Research/eval-framework/raw/main/docs/OELLM_2.png" width="350"/>
 </p>

{eval_framework-0.2.0 → eval_framework-0.2.2}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "eval-framework"
-version = "0.2.0"
+version = "0.2.2"
 description = "Evalulation Framework"
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -25,25 +25,21 @@ dependencies = [
   "sacrebleu>=2.4.3,<3",
   "pycountry>=24.6.1,<25",
   "nltk>=3.9.1,<4",
-  "types-pyyaml>=6.0.12.20240917,<7",
-  "psutil>=6.1,<7",
   "python-dotenv>=1.0.1,<2",
   "lingua-language-detector>=2.0.2,<3",
   "google-crc32c>=1.5.0,<2",
-  "kubernetes>=31.0.0,<32",  # required by llm-sandbox though actually not needed
-  "langdetect>=1.0.9,<2",  # required by the original ifeval implementation
+  "kubernetes>=31.0.0,<32", # required by llm-sandbox though actually not needed
+  "langdetect>=1.0.9,<2", # required by the original ifeval implementation
   "spacy>=3.8.3,<4",
   "jsonschema>=4.23.0,<5",
-  "mysql-connector-python>=9.0.0,<10",  # required for sql-related tasks
-  "psycopg2-binary>=2.9.9,<3",  # required for sql-related tasks
+  "mysql-connector-python>=9.0.0,<10", # required for sql-related tasks
+  "psycopg2-binary>=2.9.9,<3", # required for sql-related tasks
   "sympy>=1.13.1,<2",
   "llm-sandbox[docker]>=0.1.8,<0.2",
   "jsonlines>=4,<5",
   "lxml>=6,<7",
   "python-iso639>=2025.2.18",
   "wandb>=0.21.1,<1",
-  # Needed for uv bug: https://github.com/astral-sh/uv/issues/15661
-  "torch",
 ]
 [project.optional-dependencies]
@@ -99,9 +95,10 @@ dev = [
   "plotly>=5.24.1,<6",
   "ruff>=0.12.8",
 ]
-flash-attn = ["flash-attn>=2.7.2.post1,<2.8"]
-cu124 = ["torch"]
-cpu = ["torch"]
+flash-attn = [
+  "flash-attn>=2.7.2.post1,<2.8",
+  "torch"
+]
 [build-system]
 requires = ["uv_build>=0.8.10,<0.9.0"]
@@ -114,17 +111,11 @@ module-name = ["eval_framework", "template_formatting"]
 override-dependencies = [
   "requests>=2.32,<3",  # fix for determined
 ]
-conflicts = [
-  [
-    { group = "cpu" },
-    { group = "cu124" },
-  ],
-]
 [tool.uv.sources]
 torch = [
-  { index = "pytorch-cu124", group = "cu124"},
-  { index = "pytorch-cpu", group = "cpu"},
+  { index = "pytorch-default", marker = "sys_platform != 'linux'" },
+  { index = "pytorch-cu124", marker = "sys_platform == 'linux'" },
 ]
 [[tool.uv.index]]
@@ -133,8 +124,8 @@ url = "https://download.pytorch.org/whl/cu124"
 explicit = true
 [[tool.uv.index]]
-name = "pytorch-cpu"
-url = "https://download.pytorch.org/whl/cpu"
+name = "pytorch-default"
+url = "https://pypi.org/simple"
 explicit = true
 [tool.uv.extra-build-dependencies]
@@ -152,6 +143,12 @@ select = [
     "UP", # Auto-upgrading of new Python features
     "I", # Sort imports
 ]
+[tool.ruff.lint.isort]
+# https://github.com/astral-sh/ruff-pre-commit/issues/121
+# https://github.com/astral-sh/ruff/issues/10519
+# wandb creates a folder called 'wandb' during local runs (not logged in)
+# this needs to be added to prevent isort from incorrectly sorting
+known-third-party = ["wandb"]
 [tool.ruff.lint.extend-per-file-ignores]
 "__init__.py" = ["F401"]

{eval_framework-0.2.0 → eval_framework-0.2.2}/src/eval_framework/context/determined.py RENAMED Viewed

@@ -8,7 +8,8 @@ from determined.core._context import init as determined_core_init
 from determined.core._distributed import DummyDistributedContext
 from pydantic import AfterValidator, BaseModel, ConfigDict
-from eval_framework.context.eval import EvalContext, import_models
+from eval_framework.context.eval import EvalContext
+from eval_framework.context.local import _load_model
 from eval_framework.llm.base import BaseLLM
 from eval_framework.tasks.eval_config import EvalConfig
 from eval_framework.tasks.perturbation import PerturbationConfig
@@ -111,18 +112,16 @@ class DeterminedContext(EvalContext):
             if val_cli and val_hparams and val_cli != val_hparams:
                 logger.info(f"CLI argument {name} ({val_cli}) is being overridden by hyperparameters: ({val_hparams}).")
-        models = import_models(self.models_path)
-        if self.hparams.llm_name not in models:
-            raise ValueError(f"LLM '{self.hparams.llm_name}' not found.")
-        llm_class = models[self.hparams.llm_name]
-        llm_judge_class: type[BaseLLM] | None = None
+        # Hyperparameters take precedence over core context
+        llm_name = self.hparams.llm_name or self.llm_name
         judge_model_name = self.hparams.task_args.judge_model_name or self.judge_model_name
-        if self.judge_models_path is not None and judge_model_name is not None:
-            judge_models = import_models(self.judge_models_path)
-            if judge_model_name not in judge_models:
-                raise ValueError(f"LLM judge '{judge_model_name}' not found.")
-            llm_judge_class = judge_models[judge_model_name]
+        llm_class = _load_model(llm_name, models_path=self.models_path)
+        llm_judge_class: type[BaseLLM] | None = (
+            _load_model(judge_model_name, models_path=self.judge_models_path, info="judge")
+            if judge_model_name
+            else None
+        )
         # for all optional hyperparameters, resort to the respective CLI argument if the hyperparameter is not set
         self.config = EvalConfig(

{eval_framework-0.2.0 → eval_framework-0.2.2}/src/eval_framework/context/eval.py RENAMED Viewed

@@ -2,6 +2,7 @@ import importlib.util
 import inspect
 import sys
 from contextlib import AbstractContextManager
+from os import PathLike
 from pathlib import Path
 from typing import Any
@@ -11,7 +12,7 @@ from eval_framework.tasks.eval_config import EvalConfig
 from eval_framework.tasks.perturbation import PerturbationConfig
-def import_models(models_file: Path | str) -> dict[str, type[BaseLLM]]:
+def import_models(models_file: PathLike | str) -> dict[str, type[BaseLLM]]:
     models_file = Path(models_file).resolve()
     library_path = Path(eval_framework.__path__[0]).resolve()
@@ -86,10 +87,10 @@ class EvalContext(AbstractContextManager):
         self.wandb_run_id = wandb_run_id
         self.hf_upload_dir = hf_upload_dir
         self.hf_upload_repo = hf_upload_repo
-        self.llm_args = llm_args
+        self.llm_args = llm_args if llm_args is not None else {}
         self.judge_models_path = judge_models_path
         self.judge_model_name = judge_model_name
-        self.judge_model_args = judge_model_args
+        self.judge_model_args = judge_model_args if judge_model_args is not None else {}
         self.batch_size = batch_size
         self.description = description

eval_framework-0.2.2/src/eval_framework/context/local.py ADDED Viewed

@@ -0,0 +1,75 @@
+import importlib
+from os import PathLike
+from typing import Any
+from eval_framework.context.eval import EvalContext, import_models
+from eval_framework.llm.base import BaseLLM
+from eval_framework.tasks.eval_config import EvalConfig
+def _load_model(llm_name: str, models_path: str | PathLike | None, *, info: str = "") -> type[BaseLLM]:
+    """Load a model class either from a models file or as a fully qualified module path.
+    Args:
+        llm_name: The name of the model class to load, or a fully qualified module path.
+        models_path: The path to a Python file containing model class definitions
+        info: Additional info to include in error messages.
+    Returns:
+        The model class.
+    """
+    if models_path is None or "." in llm_name:
+        # The llm_name must a a fully qualified module path
+        if "." not in llm_name:
+            raise ValueError(f"LLM {info} '{llm_name}' is not a fully qualified module path.")
+        module_path, llm_class_name = llm_name.rsplit(".", 1)
+        module = importlib.import_module(module_path)
+        if not hasattr(module, llm_class_name):
+            raise ValueError(f"LLM '{llm_class_name}' not found in module '{module_path}'.")
+        return getattr(module, llm_class_name)
+    else:
+        models_dict = import_models(models_path)
+        if llm_name not in models_dict:
+            if info:
+                info = f"{info.strip()} "
+            raise ValueError(f"LLM {info} '{llm_name}' not found in {models_path}.")
+        return models_dict[llm_name]
+class LocalContext(EvalContext):
+    def __enter__(self) -> "LocalContext":
+        llm_class = _load_model(self.llm_name, models_path=self.models_path)
+        self.llm_judge_class: type[BaseLLM] | None = None
+        if self.judge_model_name is not None:
+            self.llm_judge_class = _load_model(self.judge_model_name, models_path=self.judge_models_path, info="judge")
+        self.config = EvalConfig(
+            llm_class=llm_class,
+            llm_args=self.llm_args,
+            num_samples=self.num_samples,
+            max_tokens=self.max_tokens,
+            num_fewshot=self.num_fewshot,
+            perturbation_config=self.perturbation_config,
+            task_name=self.task_name,
+            task_subjects=self.task_subjects,
+            hf_revision=self.hf_revision,
+            output_dir=self.output_dir,
+            hf_upload_dir=self.hf_upload_dir,
+            hf_upload_repo=self.hf_upload_repo,
+            wandb_entity=self.wandb_entity,
+            wandb_project=self.wandb_project,
+            wandb_run_id=self.wandb_run_id,
+            llm_judge_class=self.llm_judge_class,
+            judge_model_args=self.judge_model_args,
+            batch_size=self.batch_size,
+            description=self.description,
+        )
+        return self
+    def __exit__(
+        self,
+        exc_type: type[BaseException] | None,
+        exc_value: BaseException | None,
+        traceback: Any | None,
+    ) -> None:
+        pass

eval-framework 0.2.0__tar.gz → 0.2.2__tar.gz

eval-framework 0.2.0tar.gz → 0.2.2tar.gz