PyPI - eval-framework - Versions diffs - 0.2.6__tar.gz → 0.2.8__tar.gz - Mend

eval-framework 0.2.6tar.gz → 0.2.8tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (170) hide show

{eval_framework-0.2.6 → eval_framework-0.2.8}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: eval-framework
-Version: 0.2.6
+Version: 0.2.8
 Summary: Evalulation Framework
 Author: Aleph Alpha Research
 License:                                  Apache License
@@ -235,6 +235,7 @@ Requires-Dist: python-iso639>=2025.2.18
 Requires-Dist: wandb>=0.23.0,<1
 Requires-Dist: boto3>=1.40.54,<2
 Requires-Dist: numpy>=1.26.4
+Requires-Dist: antlr4-python3-runtime==4.11.0
 Requires-Dist: accelerate ; extra == 'accelerate'
 Requires-Dist: eval-framework[determined,api,openai,transformers,accelerate,vllm,comet,optional,mistral] ; extra == 'all'
 Requires-Dist: aleph-alpha-client>=10,<11 ; extra == 'api'
@@ -268,10 +269,24 @@ Provides-Extra: transformers
 Provides-Extra: vllm
 Description-Content-Type: text/markdown
+<!-- Badges -->
+<div align="center">
 # Aleph Alpha Eval-Framework
-> **Comprehensive LLM evaluation at scale** - A production-ready framework for evaluating large language models across 90+ benchmarks.
-![eval-framework](docs/eval-framework.png "https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/refs/heads/main/docs/eval-framework.png")
+**Comprehensive LLM evaluation at scale** - A production-ready framework for evaluating large language models across 90+ benchmarks.
+[![Build Status](https://github.com/Aleph-Alpha-Research/eval-framework/actions/workflows/tests.yml/badge.svg)](https://github.com/Aleph-Alpha-Research/eval-framework/actions)
+[![Version](https://img.shields.io/github/v/release/Aleph-Alpha-Research/eval-framework)](https://github.com/Aleph-Alpha-Research/eval-framework/releases)
+[![PyPI](https://img.shields.io/pypi/v/eval-framework.svg)](https://pypi.org/project/eval-framework/)
+[![License](https://img.shields.io/github/license/Aleph-Alpha-Research/eval-framework.svg)](LICENSE)
+[![Docs](https://img.shields.io/badge/docs-online-blue)](https://aleph-alpha-research.github.io/eval-framework/)
+[![Stars](https://img.shields.io/github/stars/Aleph-Alpha-Research/eval-framework)](https://github.com/Aleph-Alpha-Research/eval-framework/stargazers)
+![eval-framework](https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/refs/heads/main/docs/eval-framework.png)
+</div>
 ## Why Choose This Framework?
@@ -289,10 +304,12 @@ Description-Content-Type: text/markdown
 - Statistical Analysis: Includes confidence intervals and significance testing for reliable comparisons.
 - Docker Support: Pre-configured Dockerfiles for local and distributed setups.
+For full documentation, visit our [Docs Page](https://aleph-alpha-research.github.io/eval-framework/).
 ## Quick Start
 The codebase is tested and compatible with Python 3.12 and PyTorch 2.5.
-You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found [here](docs/installation.md).
+You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found [here](https://aleph-alpha-research.github.io/eval-framework/installation.html).
 The easiest way to get started is by installing the library via `pip` and use it as an external dependency.
 ```
@@ -350,7 +367,7 @@ eval_framework \
     --num-samples 10
 ```
-For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usage.md).
+For more detailed CLI usage instructions, see the [CLI Usage Guide](https://aleph-alpha-research.github.io/eval-framework/cli_usage.html).
 ## Benchmark Coverage & Task Categories
@@ -403,7 +420,7 @@ Evaluation metrics include:
 - **LLM Metrics:** Chatbot Style Judge, Instruction Judge
 - **Efficiency Metrics:** Bytes per Sequence Position
-For the full list of tasks and metrics, see [Detailed Task Table](docs/benchmarks_and_metrics.md).
+For the full list of tasks and metrics, see [Detailed Task Table](https://aleph-alpha-research.github.io/eval-framework/benchmarks_and_metrics.html).
 ## Getting Started
@@ -419,9 +436,9 @@ Eval-Framework provides a unified interface for evaluating language models acros
 ### Core Components
-- **Models**: Defined via [`BaseLLM`](docs/evaluate_huggingface_model.md) interface (HuggingFace, OpenAI, custom APIs)
-- **Tasks**: Inherit from [`BaseTask`](docs/add_new_benchmark_guide.md) (completion, loglikelihood, or LLM-judge based)
-- **Metrics**: Automatic scoring via [`BaseMetric`](docs/benchmarks_and_metrics.md) classes
+- **Models**: Defined via [`BaseLLM`](https://aleph-alpha-research.github.io/eval-framework/evaluate_huggingface_model.html) interface (HuggingFace, OpenAI, custom APIs)
+- **Tasks**: Inherit from [`BaseTask`](https://aleph-alpha-research.github.io/eval-framework/add_new_benchmark_guide.html) (completion, loglikelihood, or LLM-judge based)
+- **Metrics**: Automatic scoring via [`BaseMetric`](https://aleph-alpha-research.github.io/eval-framework/benchmarks_and_metrics.html) classes
 - **Formatters**: Handle prompt construction and model-specific formatting
 - **Results**: Structured outputs with sample-level details and aggregated statistics
@@ -466,41 +483,42 @@ if __name__ == "__main__":
     results = main(llm=llm, config=config)
 ```
-3. **Review results** - Check `./eval_results/` for detailed outputs and use our [results guide](docs/understanding_results_guide.md) to interpret them
+3. **Review results** - Check `./eval_results/` for detailed outputs and use our [results guide](https://aleph-alpha-research.github.io/eval-framework/understanding_results_guide.html) to interpret them
 ### Next Steps
-- **Use CLI interface**: See [CLI usage guide](docs/cli_usage.md) for command-line evaluation options
-- **Evaluate HuggingFace models**: Follow our [HuggingFace evaluation guide](docs/evaluate_huggingface_model.md)
-- **Understand model arguments**: Read out [Model Arguments guide](docs/model_arguments.md)
-- **Create custom benchmarks**: Follow our [benchmark creation guide](docs/add_new_benchmark_guide.md)
-- **Scale your evaluations**: Use [Determined AI integration](docs/using_determined.md) for distributed evaluation
-- **Understand your results**: Read our [results interpretation guide](docs/understanding_results_guide.md)
-- **Log results in WandB**: See how [we integrate WandB](docs/wandb_integration.md) for metric and lineage tracking
+- **Use CLI interface**: See [CLI usage guide](https://aleph-alpha-research.github.io/eval-framework/cli_usage.html) for command-line evaluation options
+- **Evaluate HuggingFace models**: Follow our [HuggingFace evaluation guide](https://aleph-alpha-research.github.io/eval-framework/evaluate_huggingface_model.html)
+- **Understand model arguments**: Read out [Model Arguments guide](https://aleph-alpha-research.github.io/eval-framework/model_arguments.html)
+- **Create custom benchmarks**: Follow our [benchmark creation guide](https://aleph-alpha-research.github.io/eval-framework/add_new_benchmark_guide.html)
+- **Scale your evaluations**: Use [Determined AI integration](https://aleph-alpha-research.github.io/eval-framework/using_determined.html) for distributed evaluation
+- **Understand your results**: Read our [results interpretation guide](https://aleph-alpha-research.github.io/eval-framework/understanding_results_guide.html)
+- **Log results in WandB**: See how [we integrate WandB](https://aleph-alpha-research.github.io/eval-framework/wandb_integration.html) for metric and lineage tracking
 ## Documentation
 ### Getting Started
-- **[CLI Usage Guide](docs/cli_usage.md)** - Detailed instructions for using the command-line interface
-- **[Evaluating HuggingFace Models](docs/evaluate_huggingface_model.md)** - Complete guide for evaluating HuggingFace models
-- **[Understanding Results](docs/understanding_results_guide.md)** - How to read and interpret evaluation results
+- **[CLI Usage Guide](https://aleph-alpha-research.github.io/eval-framework/cli_usage.html)** - Detailed instructions for using the command-line interface
+- **[Evaluating HuggingFace Models](https://aleph-alpha-research.github.io/eval-framework/evaluate_huggingface_model.html)** - Complete guide for evaluating HuggingFace models
+- **[Understanding Results](https://aleph-alpha-research.github.io/eval-framework/understanding_results_guide.html)** - How to read and interpret evaluation results
 ### Advanced Usage
-- **[Understanding Model Arguments](docs/model_arguments.md)** - Thorough guide on each constructor argument for salient model classes
-- **[Adding New Benchmarks](docs/add_new_benchmark_guide.md)** - Complete guide with practical examples for adding new benchmarks
-- **[Benchmarks and Metrics](docs/benchmarks_and_metrics.md)** - Comprehensive overview of all available benchmarks and evaluation metrics
-- **[Overview of Dataloading](docs/overview_dataloading.md)** - Explanation of dataloading and task/sample/message structure
+- **[Understanding Model Arguments](https://aleph-alpha-research.github.io/eval-framework/model_arguments.html)** - Thorough guide on each constructor argument for salient model classes
+- **[Adding New Benchmarks](https://aleph-alpha-research.github.io/eval-framework/add_new_benchmark_guide.html)** - Complete guide with practical examples for adding new benchmarks
+- **[Benchmarks and Metrics](https://aleph-alpha-research.github.io/eval-framework/benchmarks_and_metrics.html)** - Comprehensive overview of all available benchmarks and evaluation metrics
+- **[Overview of Dataloading](https://aleph-alpha-research.github.io/eval-framework/overview_dataloading.html)** - Explanation of dataloading and task/sample/message structure
 ### Scaling & Production
-- **[Using Determined](docs/using_determined.md)** - Guide for distributed evaluation using Determined AI
-- **[Controlling Upload Results](docs/controlling_upload_results.md)** - How to manage and control the upload of evaluation results
+- **[Using Determined](https://aleph-alpha-research.github.io/eval-framework/using_determined.html)** - Guide for distributed evaluation using Determined AI
+- **[Controlling Upload Results](https://aleph-alpha-research.github.io/eval-framework/controlling_upload_results.html)** - How to manage and control the upload of evaluation results
 ### Contributing
-- **[Contributing Guide](CONTRIBUTING.md)** - Guide for contributing to this project
+- **[Contributing Guide](https://aleph-alpha-research.github.io/eval-framework/CONTRIBUTING.html)** - Guide for contributing to this project
+- **[Testing](https://aleph-alpha-research.github.io/eval-framework/testing.html)** - Guide for running tests comparable to the CI pipelines
 ### Citation
@@ -526,6 +544,6 @@ This project has received funding from the European Union’s Digital Europe Pro
 The contents of this publication are the sole responsibility of the OpenEuroLLM consortium and do not necessarily reflect the opinion of the European Union.
 <p align="center">
-  <img src="docs/OELLM_1.png" alt="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_1.png" width="100" style="margin-right: 50px;"/>
-  <img src="docs/OELLM_2.png" alt="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_2.png" width="350"/>
+  <img src="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_1.png" width="100" style="margin-right: 50px;"/>
+  <img src="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_2.png" width="350"/>
 </p>

{eval_framework-0.2.6 → eval_framework-0.2.8}/README.md RENAMED Viewed

@@ -1,7 +1,21 @@
+<!-- Badges -->
+<div align="center">
 # Aleph Alpha Eval-Framework
-> **Comprehensive LLM evaluation at scale** - A production-ready framework for evaluating large language models across 90+ benchmarks.
-![eval-framework](docs/eval-framework.png "https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/refs/heads/main/docs/eval-framework.png")
+**Comprehensive LLM evaluation at scale** - A production-ready framework for evaluating large language models across 90+ benchmarks.
+[![Build Status](https://github.com/Aleph-Alpha-Research/eval-framework/actions/workflows/tests.yml/badge.svg)](https://github.com/Aleph-Alpha-Research/eval-framework/actions)
+[![Version](https://img.shields.io/github/v/release/Aleph-Alpha-Research/eval-framework)](https://github.com/Aleph-Alpha-Research/eval-framework/releases)
+[![PyPI](https://img.shields.io/pypi/v/eval-framework.svg)](https://pypi.org/project/eval-framework/)
+[![License](https://img.shields.io/github/license/Aleph-Alpha-Research/eval-framework.svg)](LICENSE)
+[![Docs](https://img.shields.io/badge/docs-online-blue)](https://aleph-alpha-research.github.io/eval-framework/)
+[![Stars](https://img.shields.io/github/stars/Aleph-Alpha-Research/eval-framework)](https://github.com/Aleph-Alpha-Research/eval-framework/stargazers)
+![eval-framework](https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/refs/heads/main/docs/eval-framework.png)
+</div>
 ## Why Choose This Framework?
@@ -19,10 +33,12 @@
 - Statistical Analysis: Includes confidence intervals and significance testing for reliable comparisons.
 - Docker Support: Pre-configured Dockerfiles for local and distributed setups.
+For full documentation, visit our [Docs Page](https://aleph-alpha-research.github.io/eval-framework/).
 ## Quick Start
 The codebase is tested and compatible with Python 3.12 and PyTorch 2.5.
-You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found [here](docs/installation.md).
+You will also need the appropriate CUDA dependencies and version installed on your system for GPU support. Detailed installation instructions can be found [here](https://aleph-alpha-research.github.io/eval-framework/installation.html).
 The easiest way to get started is by installing the library via `pip` and use it as an external dependency.
 ```
@@ -80,7 +96,7 @@ eval_framework \
     --num-samples 10
 ```
-For more detailed CLI usage instructions, see the [CLI Usage Guide](docs/cli_usage.md).
+For more detailed CLI usage instructions, see the [CLI Usage Guide](https://aleph-alpha-research.github.io/eval-framework/cli_usage.html).
 ## Benchmark Coverage & Task Categories
@@ -133,7 +149,7 @@ Evaluation metrics include:
 - **LLM Metrics:** Chatbot Style Judge, Instruction Judge
 - **Efficiency Metrics:** Bytes per Sequence Position
-For the full list of tasks and metrics, see [Detailed Task Table](docs/benchmarks_and_metrics.md).
+For the full list of tasks and metrics, see [Detailed Task Table](https://aleph-alpha-research.github.io/eval-framework/benchmarks_and_metrics.html).
 ## Getting Started
@@ -149,9 +165,9 @@ Eval-Framework provides a unified interface for evaluating language models acros
 ### Core Components
-- **Models**: Defined via [`BaseLLM`](docs/evaluate_huggingface_model.md) interface (HuggingFace, OpenAI, custom APIs)
-- **Tasks**: Inherit from [`BaseTask`](docs/add_new_benchmark_guide.md) (completion, loglikelihood, or LLM-judge based)
-- **Metrics**: Automatic scoring via [`BaseMetric`](docs/benchmarks_and_metrics.md) classes
+- **Models**: Defined via [`BaseLLM`](https://aleph-alpha-research.github.io/eval-framework/evaluate_huggingface_model.html) interface (HuggingFace, OpenAI, custom APIs)
+- **Tasks**: Inherit from [`BaseTask`](https://aleph-alpha-research.github.io/eval-framework/add_new_benchmark_guide.html) (completion, loglikelihood, or LLM-judge based)
+- **Metrics**: Automatic scoring via [`BaseMetric`](https://aleph-alpha-research.github.io/eval-framework/benchmarks_and_metrics.html) classes
 - **Formatters**: Handle prompt construction and model-specific formatting
 - **Results**: Structured outputs with sample-level details and aggregated statistics
@@ -196,41 +212,42 @@ if __name__ == "__main__":
     results = main(llm=llm, config=config)
 ```
-3. **Review results** - Check `./eval_results/` for detailed outputs and use our [results guide](docs/understanding_results_guide.md) to interpret them
+3. **Review results** - Check `./eval_results/` for detailed outputs and use our [results guide](https://aleph-alpha-research.github.io/eval-framework/understanding_results_guide.html) to interpret them
 ### Next Steps
-- **Use CLI interface**: See [CLI usage guide](docs/cli_usage.md) for command-line evaluation options
-- **Evaluate HuggingFace models**: Follow our [HuggingFace evaluation guide](docs/evaluate_huggingface_model.md)
-- **Understand model arguments**: Read out [Model Arguments guide](docs/model_arguments.md)
-- **Create custom benchmarks**: Follow our [benchmark creation guide](docs/add_new_benchmark_guide.md)
-- **Scale your evaluations**: Use [Determined AI integration](docs/using_determined.md) for distributed evaluation
-- **Understand your results**: Read our [results interpretation guide](docs/understanding_results_guide.md)
-- **Log results in WandB**: See how [we integrate WandB](docs/wandb_integration.md) for metric and lineage tracking
+- **Use CLI interface**: See [CLI usage guide](https://aleph-alpha-research.github.io/eval-framework/cli_usage.html) for command-line evaluation options
+- **Evaluate HuggingFace models**: Follow our [HuggingFace evaluation guide](https://aleph-alpha-research.github.io/eval-framework/evaluate_huggingface_model.html)
+- **Understand model arguments**: Read out [Model Arguments guide](https://aleph-alpha-research.github.io/eval-framework/model_arguments.html)
+- **Create custom benchmarks**: Follow our [benchmark creation guide](https://aleph-alpha-research.github.io/eval-framework/add_new_benchmark_guide.html)
+- **Scale your evaluations**: Use [Determined AI integration](https://aleph-alpha-research.github.io/eval-framework/using_determined.html) for distributed evaluation
+- **Understand your results**: Read our [results interpretation guide](https://aleph-alpha-research.github.io/eval-framework/understanding_results_guide.html)
+- **Log results in WandB**: See how [we integrate WandB](https://aleph-alpha-research.github.io/eval-framework/wandb_integration.html) for metric and lineage tracking
 ## Documentation
 ### Getting Started
-- **[CLI Usage Guide](docs/cli_usage.md)** - Detailed instructions for using the command-line interface
-- **[Evaluating HuggingFace Models](docs/evaluate_huggingface_model.md)** - Complete guide for evaluating HuggingFace models
-- **[Understanding Results](docs/understanding_results_guide.md)** - How to read and interpret evaluation results
+- **[CLI Usage Guide](https://aleph-alpha-research.github.io/eval-framework/cli_usage.html)** - Detailed instructions for using the command-line interface
+- **[Evaluating HuggingFace Models](https://aleph-alpha-research.github.io/eval-framework/evaluate_huggingface_model.html)** - Complete guide for evaluating HuggingFace models
+- **[Understanding Results](https://aleph-alpha-research.github.io/eval-framework/understanding_results_guide.html)** - How to read and interpret evaluation results
 ### Advanced Usage
-- **[Understanding Model Arguments](docs/model_arguments.md)** - Thorough guide on each constructor argument for salient model classes
-- **[Adding New Benchmarks](docs/add_new_benchmark_guide.md)** - Complete guide with practical examples for adding new benchmarks
-- **[Benchmarks and Metrics](docs/benchmarks_and_metrics.md)** - Comprehensive overview of all available benchmarks and evaluation metrics
-- **[Overview of Dataloading](docs/overview_dataloading.md)** - Explanation of dataloading and task/sample/message structure
+- **[Understanding Model Arguments](https://aleph-alpha-research.github.io/eval-framework/model_arguments.html)** - Thorough guide on each constructor argument for salient model classes
+- **[Adding New Benchmarks](https://aleph-alpha-research.github.io/eval-framework/add_new_benchmark_guide.html)** - Complete guide with practical examples for adding new benchmarks
+- **[Benchmarks and Metrics](https://aleph-alpha-research.github.io/eval-framework/benchmarks_and_metrics.html)** - Comprehensive overview of all available benchmarks and evaluation metrics
+- **[Overview of Dataloading](https://aleph-alpha-research.github.io/eval-framework/overview_dataloading.html)** - Explanation of dataloading and task/sample/message structure
 ### Scaling & Production
-- **[Using Determined](docs/using_determined.md)** - Guide for distributed evaluation using Determined AI
-- **[Controlling Upload Results](docs/controlling_upload_results.md)** - How to manage and control the upload of evaluation results
+- **[Using Determined](https://aleph-alpha-research.github.io/eval-framework/using_determined.html)** - Guide for distributed evaluation using Determined AI
+- **[Controlling Upload Results](https://aleph-alpha-research.github.io/eval-framework/controlling_upload_results.html)** - How to manage and control the upload of evaluation results
 ### Contributing
-- **[Contributing Guide](CONTRIBUTING.md)** - Guide for contributing to this project
+- **[Contributing Guide](https://aleph-alpha-research.github.io/eval-framework/CONTRIBUTING.html)** - Guide for contributing to this project
+- **[Testing](https://aleph-alpha-research.github.io/eval-framework/testing.html)** - Guide for running tests comparable to the CI pipelines
 ### Citation
@@ -256,6 +273,6 @@ This project has received funding from the European Union’s Digital Europe Pro
 The contents of this publication are the sole responsibility of the OpenEuroLLM consortium and do not necessarily reflect the opinion of the European Union.
 <p align="center">
-  <img src="docs/OELLM_1.png" alt="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_1.png" width="100" style="margin-right: 50px;"/>
-  <img src="docs/OELLM_2.png" alt="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_2.png" width="350"/>
+  <img src="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_1.png" width="100" style="margin-right: 50px;"/>
+  <img src="https://raw.githubusercontent.com/Aleph-Alpha-Research/eval-framework/main/docs/OELLM_2.png" width="350"/>
 </p>

{eval_framework-0.2.6 → eval_framework-0.2.8}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "eval-framework"
-version = "0.2.6"
+version = "0.2.8"
 description = "Evalulation Framework"
 readme = "README.md"
 license = { file = "LICENSE" }
@@ -42,6 +42,9 @@ dependencies = [
   "wandb>=0.23.0,<1",
   "boto3>=1.40.54,<2",
   "numpy>=1.26.4",
+  # is a dependency of sympy, but not explicitly listed in the requirements.txt
+  # https://github.com/sympy/sympy/blob/0204fa34e8f6f6f8ccb4de01209be9a2345c9d6e/doc/src/contributing/dependencies.md?plain=1#L125
+  "antlr4-python3-runtime==4.11.0",
 ]
 [project.optional-dependencies]

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/context/determined.py RENAMED Viewed

@@ -149,6 +149,7 @@ class DeterminedContext(EvalContext):
             wandb_upload_results=self.hparams.wandb_upload_results or self.wandb_upload_results,
             batch_size=self.hparams.task_args.batch_size or self.batch_size,
             description=self.hparams.description or self.description,
+            randomize_judge_order=self.randomize_judge_order,
             delete_output_dir_after_upload=self.hparams.delete_output_dir_after_upload
             or self.delete_output_dir_after_upload,
         )

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/context/eval.py RENAMED Viewed

@@ -73,6 +73,7 @@ class EvalContext(AbstractContextManager):
         perturbation_type: str | None = None,
         perturbation_probability: float | None = None,
         perturbation_seed: int | None = None,
+        randomize_judge_order: bool = False,
         delete_output_dir_after_upload: bool | None = None,
     ) -> None:
         self.llm_name = llm_name
@@ -96,6 +97,7 @@ class EvalContext(AbstractContextManager):
         self.judge_model_args = judge_model_args if judge_model_args is not None else {}
         self.batch_size = batch_size
         self.description = description
+        self.randomize_judge_order = randomize_judge_order
         self.delete_output_dir_after_upload = delete_output_dir_after_upload
         if perturbation_type or perturbation_probability is not None:

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/context/local.py RENAMED Viewed

@@ -63,6 +63,7 @@ class LocalContext(EvalContext):
             judge_model_args=self.judge_model_args,
             batch_size=self.batch_size,
             description=self.description,
+            randomize_judge_order=self.randomize_judge_order,
             delete_output_dir_after_upload=self.delete_output_dir_after_upload,
         )

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/evaluation_generator.py RENAMED Viewed

@@ -67,7 +67,10 @@ class EvaluationGenerator:
                 if llm_judge is None:
                     assert self.config.llm_judge_class is not None, "The llm_judge_class must be defined in the config."
                     llm_judge = self.config.llm_judge_class(**self.config.judge_model_args)
-                metric = metric_class(llm_judge=llm_judge)
+                metric = metric_class(
+                    llm_judge=llm_judge,
+                    randomize_order=self.config.randomize_judge_order,
+                )
             else:
                 metric = metric_class()

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/llm/aleph_alpha.py RENAMED Viewed

@@ -55,6 +55,8 @@ class AlephAlphaAPIModel(BaseLLM):
         request_timeout_seconds: int = 30 * 60 + 5,
         queue_full_timeout_seconds: int = 30 * 60 + 5,
         bytes_per_token: float | None = None,
+        token: str = os.getenv("AA_TOKEN", "dummy"),
+        base_url: str = os.getenv("AA_INFERENCE_ENDPOINT", "dummy_endpoint"),
     ) -> None:
         self._formatter: BaseFormatter
         if formatter is None:
@@ -69,7 +71,9 @@ class AlephAlphaAPIModel(BaseLLM):
         self.max_retries = max_retries
         self.request_timeout_seconds = request_timeout_seconds
         self.queue_full_timeout_seconds = queue_full_timeout_seconds
-        self._validate_model_availability()
+        self.token = token
+        self.base_url = base_url
+        self._validate_model_availability(base_url, token)
         # set bytes_per_token_scalar for non-standard models
         if bytes_per_token is not None and bytes_per_token <= 0:
             raise ValueError("bytes_per_token must be positive")
@@ -77,15 +81,15 @@ class AlephAlphaAPIModel(BaseLLM):
             4.0 / bytes_per_token if bytes_per_token is not None else 4.0 / self.BYTES_PER_TOKEN
         )
-    def _validate_model_availability(self) -> None:
+    def _validate_model_availability(self, base_url: str, token: str) -> None:
         """
         Validate that the model name is available by making a test request.
         """
         try:
             # 'Client' object does not support the context manager protocol
             client = Client(
-                host=os.getenv("AA_INFERENCE_ENDPOINT", "dummy_endpoint"),
-                token=os.getenv("AA_TOKEN", "dummy"),
+                host=base_url,
+                token=token,
             )
             request = CompletionRequest(
@@ -190,10 +194,10 @@ class AlephAlphaAPIModel(BaseLLM):
         """Process multiple requests concurrently, returning request/response pairs."""
         semaphore = asyncio.Semaphore(self.max_async_concurrent_requests)
         async with AsyncClient(
-            host=os.getenv("AA_INFERENCE_ENDPOINT", "dummy_endpoint"),
+            host=self.base_url,
             nice=True,
             request_timeout_seconds=self.request_timeout_seconds,
-            token=os.getenv("AA_TOKEN", "dummy"),
+            token=self.token,
             total_retries=0,  # we have a custom retry policy in _request_with_backoff()
         ) as client:
             tasks = (

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/metrics/completion/math_reasoning_completion.py RENAMED Viewed

@@ -204,10 +204,15 @@ class MathReasoningCompletion(BaseMetric[Completion]):
         timeout = 10
         # latex parse all ingested ground truth values for math reasoning
         for gt in response.ground_truth_list:
+            if gt is None:
+                continue
             signal.signal(signal.SIGALRM, timeout_handler)  # Set timeout signal
             signal.alarm(timeout)  # Set timeout duration
             try:
-                gt_parsed = parse_latex(gt)  # NOTE: parses f(x)=0,\quadf(x)=x-1,\quadf(x)=-x+1 to Eq(f(x), 0) ONLY
+                gt_normalized = self.normalize_expression(gt)
+                gt_parsed = parse_latex(
+                    gt_normalized
+                )  # NOTE: parses f(x)=0,\quadf(x)=x-1,\quadf(x)=-x+1 to Eq(f(x), 0) ONLY
                 ground_truths.append(gt_parsed)
             except Exception:
                 ground_truths.append(gt)
@@ -229,15 +234,11 @@ class MathReasoningCompletion(BaseMetric[Completion]):
                 )
             ]
         else:
-            # fall back to string comparison
-            # ground truth can be list or str, we have str comparisons
-            assert isinstance(response.ground_truth, str)
-            str_is_correct = self._is_str_correct(normalized_response, response.ground_truth)
-            return [
-                MetricResult(
-                    metric_name=self.NAME, value=float(str_is_correct), higher_is_better=True, error=response.error
-                )
+            normalized_ground_truths = [
+                self.normalize_expression(gt) for gt in response.ground_truth_list if gt is not None
             ]
+            res = self._any_str_correct([normalized_response], normalized_ground_truths)
+            return [MetricResult(metric_name=self.NAME, value=float(res), higher_is_better=True, error=response.error)]
     def _any_str_correct(self, response_list: list, ground_truths: list) -> bool:
         """

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/metrics/llm/base.py RENAMED Viewed

@@ -6,8 +6,9 @@ from eval_framework.shared.types import Completion, Error
 class BaseLLMJudgeMetric(BaseMetric[Completion]):
-    def __init__(self, llm_judge: BaseLLM) -> None:
+    def __init__(self, llm_judge: BaseLLM, randomize_order: bool = False) -> None:
         self._llm_judge = llm_judge
+        self._randomize_order = randomize_order
     def _create_metric_result(
         self,

{eval_framework-0.2.6 → eval_framework-0.2.8}/src/eval_framework/metrics/llm/graders/comparison_grader.py RENAMED Viewed

@@ -1,3 +1,4 @@
+import random
 from collections.abc import Mapping
 from enum import Enum
@@ -8,6 +9,7 @@ from eval_framework.metrics.llm.graders.models import (
     PromptTemplateWithParseMap,
     parse_json_output,
 )
+from eval_framework.metrics.llm.utils import order_answers_for_comparison
 class MatchOutcome(str, Enum):
@@ -23,6 +25,14 @@ class MatchOutcome(str, Enum):
             return (0.5, 0.5)
         return (0, 1)
+    def flip(self) -> "MatchOutcome":
+        """Flip the outcome (A_WINS <-> B_WINS, DRAW stays DRAW)."""
+        if self == self.A_WINS:
+            return MatchOutcome.B_WINS
+        if self == self.B_WINS:
+            return MatchOutcome.A_WINS
+        return self  # DRAW stays DRAW
     @staticmethod
     def from_rank_literal(rank: int) -> "MatchOutcome":
         match rank:
@@ -122,25 +132,67 @@ Answer 2:
         self._prompt_templates = prompt_templates
     def grade(
-        self, instruction: str, completion_1: str, completion_2: str, language: Language
+        self,
+        instruction: str,
+        completion_1: str,
+        completion_2: str,
+        language: Language,
+        randomize_order: bool = False,
+        seed: int | None = None,
     ) -> ComparisonGradingOutput:
+        """Grade two completions by comparing them.
+        Args:
+            instruction: The instruction/task that was given.
+            completion_1: The first completion (typically the candidate).
+            completion_2: The second completion (typically the reference).
+            language: The language for the grading prompts.
+            randomize_order: If True, randomly swap the order of completions to eliminate
+                position bias.
+            seed: Optional random seed for reproducibility. If None and randomize_order
+                is True, uses a random swap decision.
+        Returns:
+            ComparisonGradingOutput with the outcome corrected for any position swap,
+            so outcome always reflects completion_1 vs completion_2 regardless of
+            presentation order to the judge.
+        """
         prompt_template = language.language_config(self._prompt_templates)
+        # Determine whether to swap the order
+        if randomize_order:
+            rng = random.Random(seed)
+            swap_order = rng.choice([True, False])
+        else:
+            swap_order = False
+        # Apply the swap if needed
+        actual_answer_1, actual_answer_2 = order_answers_for_comparison(completion_1, completion_2, swap_order)
         messages = prompt_template.to_messages(
             [],
             [
                 (self.INSTRUCTION_KEY, instruction),
-                (self.ANSWER_1_KEY, completion_1),
-                (self.ANSWER_2_KEY, completion_2),
+                (self.ANSWER_1_KEY, actual_answer_1),
+                (self.ANSWER_2_KEY, actual_answer_2),
             ],
         )
         raw_completion = self._grading_model.generate_from_messages([messages])[0]
         loaded_json = parse_json_output(raw_completion.completion)
+        # Get the raw outcome from the judge
+        raw_outcome: MatchOutcome | None = prompt_template.parse_map.get(
+            str(loaded_json.get(self.BETTER_ANSWER_KEY, None)), None
+        )
+        # Correct the outcome if we swapped the order
+        # If swapped: "Answer 1 is better" means completion_2 is better (B_WINS from completion_1's perspective)
+        final_outcome = raw_outcome.flip() if swap_order and raw_outcome is not None else raw_outcome
         return ComparisonGradingOutput(
             reasoning=loaded_json.get(self.REASONING_KEY, None),
-            outcome=prompt_template.parse_map.get(str(loaded_json.get(self.BETTER_ANSWER_KEY, None)), None),
+            outcome=final_outcome,
             judge_prompt=raw_completion.prompt,
             judge_response=raw_completion.completion,
         )

eval-framework 0.2.6__tar.gz → 0.2.8__tar.gz

eval-framework 0.2.6tar.gz → 0.2.8tar.gz