PyPI - evalscope - Versions diffs - 0.17.0__tar.gz → 0.17.1__tar.gz - Mend

evalscope 0.17.0tar.gz → 0.17.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (500) hide show

{evalscope-0.17.0/evalscope.egg-info → evalscope-0.17.1}/PKG-INFO RENAMED Viewed

@@ -1,19 +1,20 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.17.0
+Version: 0.17.1
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
 Author-email: contact@modelscope.cn
+License: Apache License 2.0
 Keywords: python,llm,evaluation
 Classifier: Development Status :: 4 - Beta
-Classifier: License :: OSI Approved :: Apache Software License
 Classifier: Operating System :: OS Independent
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
-Requires-Python: >=3.8
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 Provides-Extra: opencompass
 Provides-Extra: vlmeval
@@ -22,6 +23,7 @@ Provides-Extra: perf
 Provides-Extra: app
 Provides-Extra: aigc
 Provides-Extra: dev
+Provides-Extra: docs
 Provides-Extra: all
 License-File: LICENSE
@@ -64,16 +66,17 @@ License-File: LICENSE
   - [Basic Parameter](#basic-parameter)
   - [Output Results](#output-results)
 - [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
-- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [🌐 Evaluation of Model API](#-evaluation-of-model-api)
 - [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
-  - [Parameter](#parameter)
-- [Evaluation Backend](#evaluation-backend)
+  - [Parameter Description](#parameter-description)
+- [🧪 Other Evaluation Backends](#-other-evaluation-backends)
 - [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
 - [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [🏟️ Arena Mode](#️-arena-mode)
+- [⚔️ Arena Mode](#️-arena-mode)
 - [👷‍♂️ Contribution](#️-contribution)
+- [📚 Citation](#-citation)
 - [🔜 Roadmap](#-roadmap)
-- [Star History](#star-history)
+- [⭐ Star History](#-star-history)
 ## 📝 Introduction
@@ -137,7 +140,9 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
+- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
+- 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
 - 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
 - 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for "scoring directly without reference answers" and "checking answer consistency with reference answers". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.
 - 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
@@ -149,6 +154,8 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
 - 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).
+<details><summary>More</summary>
 - 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
 - 🔥 **[2025.03.20]** The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#using-the-random-dataset) for more details.
 - 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark, which can be used by specifying `live_code_bench`. Supports evaluating QwQ-32B on LiveCodeBench, refer to the [best practices](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html).
@@ -158,8 +165,6 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!
 - 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
 - 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.
-<details><summary>More</summary>
 - 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
@@ -255,33 +260,31 @@ evalscope eval \
 When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
-**Using Python Dictionary**
+**Using `TaskConfig`**
 ```python
-from evalscope.run import run_task
+from evalscope import run_task, TaskConfig
-task_cfg = {
-    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
-    'datasets': ['gsm8k', 'arc'],
-    'limit': 5
-}
+task_cfg = TaskConfig(
+    model='Qwen/Qwen2.5-0.5B-Instruct',
+    datasets=['gsm8k', 'arc'],
+    limit=5
+)
 run_task(task_cfg=task_cfg)
 ```
 <details><summary>More Startup Methods</summary>
-**Using `TaskConfig`**
+**Using Python Dictionary**
 ```python
 from evalscope.run import run_task
-from evalscope.config import TaskConfig
-task_cfg = TaskConfig(
-    model='Qwen/Qwen2.5-0.5B-Instruct',
-    datasets=['gsm8k', 'arc'],
-    limit=5
-)
+task_cfg = {
+    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
+    'datasets': ['gsm8k', 'arc'],
+    'limit': 5
+}
 run_task(task_cfg=task_cfg)
 ```
@@ -384,7 +387,7 @@ To create a public link, set `share=True` in `launch()`.
 For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
-## 🌐 Evaluation of Specified Model API
+## 🌐 Evaluation of Model API
 Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
@@ -435,7 +438,7 @@ evalscope eval \
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
-## Evaluation Backend
+## 🧪 Other Evaluation Backends
 EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
 - **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
 - [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)
@@ -508,6 +511,17 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   </table>
 </a>
+## 📚 Citation
+```bibtex
+@misc{evalscope_2024,
+    title={{EvalScope}: Evaluation Framework for Large Models},
+    author={ModelScope Team},
+    year={2024},
+    url={https://github.com/modelscope/evalscope}
+}
+```
 ## 🔜 Roadmap
 - [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
@@ -523,6 +537,6 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   - [x] MBPP
-## Star History
+## ⭐ Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

{evalscope-0.17.0 → evalscope-0.17.1}/README.md RENAMED Viewed

@@ -37,16 +37,17 @@
   - [Basic Parameter](#basic-parameter)
   - [Output Results](#output-results)
 - [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
-- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [🌐 Evaluation of Model API](#-evaluation-of-model-api)
 - [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
-  - [Parameter](#parameter)
-- [Evaluation Backend](#evaluation-backend)
+  - [Parameter Description](#parameter-description)
+- [🧪 Other Evaluation Backends](#-other-evaluation-backends)
 - [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
 - [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [🏟️ Arena Mode](#️-arena-mode)
+- [⚔️ Arena Mode](#️-arena-mode)
 - [👷‍♂️ Contribution](#️-contribution)
+- [📚 Citation](#-citation)
 - [🔜 Roadmap](#-roadmap)
-- [Star History](#star-history)
+- [⭐ Star History](#-star-history)
 ## 📝 Introduction
@@ -110,7 +111,9 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
+- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
+- 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
 - 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
 - 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for "scoring directly without reference answers" and "checking answer consistency with reference answers". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.
 - 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
@@ -122,6 +125,8 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
 - 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).
+<details><summary>More</summary>
 - 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
 - 🔥 **[2025.03.20]** The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#using-the-random-dataset) for more details.
 - 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark, which can be used by specifying `live_code_bench`. Supports evaluating QwQ-32B on LiveCodeBench, refer to the [best practices](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html).
@@ -131,8 +136,6 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!
 - 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
 - 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.
-<details><summary>More</summary>
 - 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
@@ -228,33 +231,31 @@ evalscope eval \
 When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
-**Using Python Dictionary**
+**Using `TaskConfig`**
 ```python
-from evalscope.run import run_task
+from evalscope import run_task, TaskConfig
-task_cfg = {
-    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
-    'datasets': ['gsm8k', 'arc'],
-    'limit': 5
-}
+task_cfg = TaskConfig(
+    model='Qwen/Qwen2.5-0.5B-Instruct',
+    datasets=['gsm8k', 'arc'],
+    limit=5
+)
 run_task(task_cfg=task_cfg)
 ```
 <details><summary>More Startup Methods</summary>
-**Using `TaskConfig`**
+**Using Python Dictionary**
 ```python
 from evalscope.run import run_task
-from evalscope.config import TaskConfig
-task_cfg = TaskConfig(
-    model='Qwen/Qwen2.5-0.5B-Instruct',
-    datasets=['gsm8k', 'arc'],
-    limit=5
-)
+task_cfg = {
+    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
+    'datasets': ['gsm8k', 'arc'],
+    'limit': 5
+}
 run_task(task_cfg=task_cfg)
 ```
@@ -357,7 +358,7 @@ To create a public link, set `share=True` in `launch()`.
 For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
-## 🌐 Evaluation of Specified Model API
+## 🌐 Evaluation of Model API
 Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
@@ -408,7 +409,7 @@ evalscope eval \
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
-## Evaluation Backend
+## 🧪 Other Evaluation Backends
 EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
 - **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
 - [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)
@@ -481,6 +482,17 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   </table>
 </a>
+## 📚 Citation
+```bibtex
+@misc{evalscope_2024,
+    title={{EvalScope}: Evaluation Framework for Large Models},
+    author={ModelScope Team},
+    year={2024},
+    url={https://github.com/modelscope/evalscope}
+}
+```
 ## 🔜 Roadmap
 - [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
@@ -496,6 +508,6 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   - [x] MBPP
-## Star History
+## ⭐ Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

{evalscope-0.17.0 → evalscope-0.17.1}/evalscope/benchmarks/bfcl/bfcl_adapter.py RENAMED Viewed

@@ -35,7 +35,7 @@ SUBJECT_MAPPING = {
 @Benchmark.register(
     name='bfcl_v3',
     pretty_name='BFCL-v3',
-    tags=['Agent'],
+    tags=['Agent', 'Function Calling'],
     description=
     'Berkeley Function Calling Leaderboard (BFCL), the **first comprehensive and executable function call evaluation** '
     'dedicated to assessing Large Language Models\' (LLMs) ability to invoke functions. Unlike previous evaluations, '

{evalscope-0.17.0 → evalscope-0.17.1}/evalscope/benchmarks/data_adapter.py RENAMED Viewed

@@ -168,6 +168,11 @@ class DataAdapter(ABC):
         If you want to support local dataset, please rewrite this method in xxx_data_adapter.
         Use modelscope.msdatasets.MsDataset.load to load the dataset from local by default.
         """
+        # remove dataset_infos.json file if exists, since MsDataset will occur an error if it exists.
+        dataset_infos_path = os.path.join(dataset_name_or_path, 'dataset_infos.json')
+        if os.path.exists(dataset_infos_path):
+            logger.info(f'Removing dataset_infos.json file at {dataset_infos_path} to avoid MsDataset errors.')
+            os.remove(dataset_infos_path)
         return self.load_from_hub(dataset_name_or_path, subset_list, None, **kwargs)
     def load_with_snapshot(self,
@@ -382,7 +387,7 @@ class DataAdapter(ABC):
         pass
     def gen_prompt_data(self,
-                        prompt: str,
+                        prompt: str = '',
                         system_prompt: Optional[str] = None,
                         choices: Optional[List[str]] = None,
                         index: Optional[Union[int, str]] = None,
@@ -413,7 +418,8 @@ class DataAdapter(ABC):
             system_prompt=system_prompt or self.system_prompt,
             index=index or 0,
             id=id,
-            messages=messages)
+            messages=messages,
+            extra_data=kwargs.get('extra_data', None))
         return prompt_data.to_dict()
     def gen_prompt(self, input_d: dict, subset_name: str, few_shot_list: list, **kwargs) -> Any:
@@ -477,7 +483,6 @@ class DataAdapter(ABC):
         """
         return result
-    @abstractmethod
     def match(self, gold: Any, pred: Any) -> Any:
         """
         Match the gold answer and the predicted answer.
@@ -491,7 +496,7 @@ class DataAdapter(ABC):
         Returns:
             The match result. Usually a score (float) for chat/multiple-choice-questions.
         """
-        raise NotImplementedError
+        return 1.0 if gold == pred else 0.0
     def llm_match(self, gold: Any, pred: Any, judge: Optional[LLMJudge] = None, **kwargs) -> float:
         """

{evalscope-0.17.0 → evalscope-0.17.1}/evalscope/benchmarks/general_mcq/general_mcq_adapter.py RENAMED Viewed

@@ -17,7 +17,8 @@ logger = get_logger()
 @Benchmark.register(
     name='general_mcq',
     pretty_name='General-MCQ',
-    description='A general multiple-choice question answering dataset.',
+    description='A general multiple-choice question answering dataset for custom evaluation. '
+    'For detailed instructions on how to use this benchmark, please refer to the [User Guide](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/custom_dataset/llm.html#mcq).',
     tags=['MCQ', 'Custom'],
     dataset_id='general_mcq',
     model_adapter=OutputType.GENERATION,

{evalscope-0.17.0 → evalscope-0.17.1}/evalscope/benchmarks/general_qa/general_qa_adapter.py RENAMED Viewed

@@ -14,7 +14,8 @@ logger = get_logger()
 @Benchmark.register(
     name='general_qa',
     pretty_name='General-QA',
-    description='General Question Answering dataset',
+    description='A general question answering dataset for custom evaluation. '
+    'For detailed instructions on how to use this benchmark, please refer to the [User Guide](https://evalscope.readthedocs.io/zh-cn/latest/advanced_guides/custom_dataset/llm.html#qa).',  # noqa: E501
     tags=['QA', 'Custom'],
     dataset_id='general_qa',
     subset_list=['default'],

evalscope-0.17.1/evalscope/benchmarks/hle/hle_adapter.py ADDED Viewed

@@ -0,0 +1,118 @@
+import re
+from collections import defaultdict
+from typing import Any, List
+from evalscope.benchmarks import Benchmark, DataAdapter
+from evalscope.metrics import DEFAULT_PROMPT_TEMPLATE, LLMJudge, exact_match, mean
+from evalscope.utils.logger import get_logger
+# flake8: noqa
+logger = get_logger()
+SUBSET_LIST = [
+    'Biology/Medicine',
+    'Chemistry',
+    'Computer Science/AI',
+    'Engineering',
+    'Humanities/Social Science',
+    'Math',
+    'Physics',
+    'Other',
+]
+@Benchmark.register(
+    name='hle',
+    pretty_name="Humanity's-Last-Exam",
+    tags=['Knowledge', 'QA'],
+    description=
+    'Humanity\'s Last Exam (HLE) is a language model benchmark consisting of 2,500 questions across a broad range of subjects. It was created jointly by the Center for AI Safety and Scale AI. The benchmark classifies the questions into the following broad subjects: mathematics (41%), physics (9%), biology/medicine (11%), humanities/social science (9%), computer science/artificial intelligence (10%), engineering (4%), chemistry (7%), and other (9%). Around 14% of the questions require the ability to understand both text and images, i.e., multi-modality. 24% of the questions are multiple-choice; the rest are short-answer, exact-match questions.',  # noqa: E501
+    dataset_id='cais/hle',
+    subset_list=SUBSET_LIST,
+    metric_list=['AverageAccuracy'],
+    few_shot_num=0,
+    train_split=None,
+    eval_split='test',
+    prompt_template='{query}\n\nPlease reason step by step, and put your final answer within \\boxed{{}}.',
+)
+class HLEAdapter(DataAdapter):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.llm_as_a_judge = True
+    def load(self, **kwargs):
+        kwargs['subset_list'] = ['default']
+        data_dict = super().load(**kwargs)
+        return self.reformat_subset(data_dict, subset_key='category', format='{}')
+    def gen_prompt(self, input_d: dict, subset_name: str, few_shot_list: list, **kwargs) -> dict:
+        # remove image preview
+        input_d.pop('image_preview', None)
+        input_d.pop('rationale_image', None)
+        # generate prompt
+        question = input_d['question']
+        prompt = self.prompt_template.format(query=question)
+        image = input_d.get('image', None)
+        # build messages for multi-modal input
+        messages = []
+        if self.system_prompt:
+            messages.append({'role': 'system', 'content': self.system_prompt})
+        if image:
+            messages.append({
+                'role':
+                'user',
+                'content': [{
+                    'type': 'text',
+                    'text': prompt
+                }, {
+                    'type': 'image_url',
+                    'image_url': {
+                        'url': image
+                    }
+                }]
+            })
+        else:
+            messages.append({'role': 'user', 'content': prompt})
+        return self.gen_prompt_data(prompt='', messages=messages)
+    def get_gold_answer(self, input_d: dict) -> str:
+        return input_d['answer']
+    def parse_pred_result(self, result: str, raw_input_d: dict = None, **kwargs) -> str:
+        # Extract the answer from the model output \boxed{answer}
+        match = re.search(r'\\boxed{([^}]*)}', result)
+        if match:
+            return match.group(1).strip()
+        else:
+            logger.warning(f'No answer found in the model output: {result}')
+            return ''
+    def llm_parse_pred_result(self, result, raw_input_d=None, **kwargs) -> str:
+        return result.strip()
+    def match(self, gold: str, pred: str) -> dict:
+        # simple match
+        return {
+            'AverageAccuracy': 1.0 if exact_match(gold, pred) else 0.0,
+        }
+    def llm_match(self, gold: Any, pred: Any, judge: LLMJudge, **kwargs) -> dict:
+        raw_input = kwargs.get('raw_input', None)
+        question = raw_input['question']
+        # get grading response
+        prompt = judge.build_prompt(pred, gold, question)
+        judge_response = judge(prompt)
+        score = judge.get_score(judge_response)
+        return {
+            'AverageAccuracy': score,
+            'response': judge_response,
+        }
+    def compute_metric(self, review_res_list: List[dict], **kwargs) -> List[dict]:
+        # zip dict answers
+        res_dict = super().compute_dict_metric(review_res_list, **kwargs)
+        return super().compute_metric(res_dict, **kwargs)

{evalscope-0.17.0 → evalscope-0.17.1}/evalscope/benchmarks/humaneval/humaneval_adapter.py RENAMED Viewed

@@ -22,7 +22,8 @@ logger = get_logger()
     few_shot_num=0,
     train_split=None,
     eval_split='test',
-    prompt_template='Complete the following python code:\n{query}',
+    prompt_template=
+    'Read the following function signature and docstring, and fully implement the function described. Your response should only contain the code for this function.\n{query}',  # noqa: E501
     extra_params={
         'num_workers': 4,
         'timeout': 4
@@ -76,26 +77,9 @@ class HumanevalAdapter(DataAdapter):
     @classmethod
     def _postprocess(cls, text: str) -> str:
-        if '```' in text:
-            blocks = re.findall(r'```(.*?)```', text, re.DOTALL)
-            if len(blocks) == 0:
-                text = text.split('```')[1]  # fall back to default strategy
-            else:
-                text = blocks[0]  # fetch the first code block
-                if not text.startswith('\n'):  # in case starting with ```python
-                    text = text[max(text.find('\n') + 1, 0):]
-        if text.strip().startswith('from') or text.strip().startswith('import'):
-            def_idx = text.find('def')
-            if def_idx != -1:
-                text = text[max(text.find('\n', def_idx) + 1, 0):]
-        text = text.split('\n\n')[0]
-        if text.strip().startswith('def'):
-            text = '\n'.join(text.split('\n')[1:])
-        if not text.startswith('    '):
-            if text.startswith(' '):
-                text = '    ' + text.lstrip()
-            else:
-                text = '\n'.join(['    ' + line for line in text.split('\n')])
+        blocks = re.findall(r'```\w*\n(.*?)```', text, re.DOTALL)
+        if len(blocks) >= 1:
+            text = blocks[0]
         return text
     def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = 'checkpoint') -> str:

{evalscope-0.17.0 → evalscope-0.17.1}/evalscope/benchmarks/mmlu/mmlu_adapter.py RENAMED Viewed

@@ -144,7 +144,7 @@ SUBJECT_MAPPING = {
     output_types=[OutputType.MULTIPLE_CHOICE, OutputType.GENERATION],
     subset_list=SUBSET_LIST,
     metric_list=['AverageAccuracy'],
-    few_shot_num=5,
+    few_shot_num=0,
     train_split='train',
     eval_split='test',
     prompt_template=

evalscope 0.17.0__tar.gz → 0.17.1__tar.gz

Potentially problematic release.

evalscope 0.17.0tar.gz → 0.17.1tar.gz