PyPI - evalscope - Versions diffs - 0.17.0__tar.gz → 1.0.0__tar.gz - Mend

evalscope 0.17.0tar.gz → 1.0.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (601) hide show

{evalscope-0.17.0 → evalscope-1.0.0}/PKG-INFO RENAMED Viewed

@@ -1,19 +1,20 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.17.0
+Version: 1.0.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
 Author-email: contact@modelscope.cn
+License: Apache License 2.0
 Keywords: python,llm,evaluation
 Classifier: Development Status :: 4 - Beta
-Classifier: License :: OSI Approved :: Apache Software License
 Classifier: Operating System :: OS Independent
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
-Requires-Python: >=3.8
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 Provides-Extra: opencompass
 Provides-Extra: vlmeval
@@ -22,6 +23,7 @@ Provides-Extra: perf
 Provides-Extra: app
 Provides-Extra: aigc
 Provides-Extra: dev
+Provides-Extra: docs
 Provides-Extra: all
 License-File: LICENSE
@@ -55,25 +57,26 @@ License-File: LICENSE
 - [📝 Introduction](#-introduction)
 - [☎ User Groups](#-user-groups)
 - [🎉 News](#-news)
-- [🛠️ Installation](#️-installation)
-  - [Method 1: Install Using pip](#method-1-install-using-pip)
-  - [Method 2: Install from Source](#method-2-install-from-source)
+- [🛠️ Environment Setup](#️-environment-setup)
+  - [Method 1. Install via pip](#method-1-install-via-pip)
+  - [Method 2. Install from source](#method-2-install-from-source)
 - [🚀 Quick Start](#-quick-start)
   - [Method 1. Using Command Line](#method-1-using-command-line)
   - [Method 2. Using Python Code](#method-2-using-python-code)
   - [Basic Parameter](#basic-parameter)
   - [Output Results](#output-results)
 - [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
-- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [🌐 Evaluation of Model API](#-evaluation-of-model-api)
 - [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
-  - [Parameter](#parameter)
-- [Evaluation Backend](#evaluation-backend)
+  - [Parameter Description](#parameter-description)
+- [🧪 Other Evaluation Backends](#-other-evaluation-backends)
 - [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
 - [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [🏟️ Arena Mode](#️-arena-mode)
+- [⚔️ Arena Mode](#️-arena-mode)
 - [👷‍♂️ Contribution](#️-contribution)
+- [📚 Citation](#-citation)
 - [🔜 Roadmap](#-roadmap)
-- [Star History](#star-history)
+- [⭐ Star History](#-star-history)
 ## 📝 Introduction
@@ -138,6 +141,15 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+> [!IMPORTANT]
+> **Version 1.0 Refactoring**
+>
+> Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under `evalscope/api`. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.
+- 🔥 **[2025.08.22]** Version 1.0 Refactoring.
+- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
+- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
+- 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
 - 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
 - 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for "scoring directly without reference answers" and "checking answer consistency with reference answers". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.
 - 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
@@ -145,6 +157,8 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
 - 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
 - 🔥 **[2025.05.13]** Added support for the [ToolBench-Static](https://modelscope.cn/datasets/AI-ModelScope/ToolBench-Static) dataset to evaluate model's tool-calling capabilities. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) for usage instructions. Also added support for the [DROP](https://modelscope.cn/datasets/AI-ModelScope/DROP/dataPeview) and [Winogrande](https://modelscope.cn/datasets/AI-ModelScope/winogrande_val) benchmarks to assess the reasoning capabilities of models.
+<details><summary>More</summary>
 - 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
@@ -158,8 +172,6 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!
 - 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
 - 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.
-<details><summary>More</summary>
 - 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
@@ -183,58 +195,87 @@ Please scan the QR code below to join our community groups:
 </details>
-## 🛠️ Installation
-### Method 1: Install Using pip
-We recommend using conda to manage your environment and installing dependencies with pip:
+## 🛠️ Environment Setup
+### Method 1. Install via pip
+We recommend using conda to manage your environment and pip to install dependencies. This allows you to use the latest evalscope PyPI package.
 1. Create a conda environment (optional)
+```shell
+# Python 3.10 is recommended
+conda create -n evalscope python=3.10
+# Activate the conda environment
+conda activate evalscope
+```
+2. Install dependencies via pip
+```shell
+pip install evalscope
+```
+3. Install additional dependencies (optional)
+  - To use model service inference benchmarking features, install the perf dependency:
     ```shell
-    # It is recommended to use Python 3.10
-    conda create -n evalscope python=3.10
-    # Activate the conda environment
-    conda activate evalscope
+    pip install 'evalscope[perf]'
     ```
-2. Install dependencies using pip
+  - To use visualization features, install the app dependency:
+    ```shell
+    pip install 'evalscope[app]'
+    ```
+  - If you need to use other evaluation backends, you can install OpenCompass, VLMEvalKit, or RAGEval as needed:
     ```shell
-    pip install evalscope                # Install Native backend (default)
-    # Additional options
-    pip install 'evalscope[opencompass]'   # Install OpenCompass backend
-    pip install 'evalscope[vlmeval]'       # Install VLMEvalKit backend
-    pip install 'evalscope[rag]'           # Install RAGEval backend
-    pip install 'evalscope[perf]'          # Install dependencies for the model performance testing module
-    pip install 'evalscope[app]'           # Install dependencies for visualization
-    pip install 'evalscope[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
+    pip install 'evalscope[opencompass]'
+    pip install 'evalscope[vlmeval]'
+    pip install 'evalscope[rag]'
+    ```
+  - To install all dependencies:
+    ```shell
+    pip install 'evalscope[all]'
     ```
-> [!WARNING]
-> As the project has been renamed to `evalscope`, for versions `v0.4.3` or earlier, you can install using the following command:
+> [!NOTE]
+> The project has been renamed to `evalscope`. For version `v0.4.3` or earlier, you can install it with:
 > ```shell
-> pip install llmuses<=0.4.3
+>  pip install llmuses<=0.4.3
 > ```
-> To import relevant dependencies using `llmuses`:
-> ``` python
+> Then, import related dependencies using `llmuses`:
+> ```python
 > from llmuses import ...
 > ```
-### Method 2: Install from Source
-1. Download the source code
-    ```shell
-    git clone https://github.com/modelscope/evalscope.git
-    ```
+### Method 2. Install from source
+Installing from source allows you to use the latest code and makes it easier for further development and debugging.
+1. Clone the source code
+```shell
+git clone https://github.com/modelscope/evalscope.git
+```
 2. Install dependencies
-    ```shell
-    cd evalscope/
-    pip install -e .                  # Install Native backend
-    # Additional options
-    pip install -e '.[opencompass]'   # Install OpenCompass backend
-    pip install -e '.[vlmeval]'       # Install VLMEvalKit backend
-    pip install -e '.[rag]'           # Install RAGEval backend
-    pip install -e '.[perf]'          # Install Perf dependencies
-    pip install -e '.[app]'           # Install visualization dependencies
-    pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
-    ```
+```shell
+cd evalscope/
+pip install -e .
+```
+3. Install additional dependencies
+ - To use model service inference benchmarking features, install the perf dependency:
+   ```shell
+   pip install '.[perf]'
+   ```
+ - To use visualization features, install the app dependency:
+   ```shell
+   pip install '.[app]'
+   ```
+ - If you need to use other evaluation backends, you can install OpenCompass, VLMEvalKit, or RAGEval as needed:
+   ```shell
+   pip install '.[opencompass]'
+   pip install '.[vlmeval]'
+   pip install '.[rag]'
+   ```
+ - To install all dependencies:
+   ```shell
+   pip install '.[all]'
+   ```
 ## 🚀 Quick Start
@@ -255,33 +296,31 @@ evalscope eval \
 When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
-**Using Python Dictionary**
+**Using `TaskConfig`**
 ```python
-from evalscope.run import run_task
+from evalscope import run_task, TaskConfig
-task_cfg = {
-    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
-    'datasets': ['gsm8k', 'arc'],
-    'limit': 5
-}
+task_cfg = TaskConfig(
+    model='Qwen/Qwen2.5-0.5B-Instruct',
+    datasets=['gsm8k', 'arc'],
+    limit=5
+)
 run_task(task_cfg=task_cfg)
 ```
 <details><summary>More Startup Methods</summary>
-**Using `TaskConfig`**
+**Using Python Dictionary**
 ```python
 from evalscope.run import run_task
-from evalscope.config import TaskConfig
-task_cfg = TaskConfig(
-    model='Qwen/Qwen2.5-0.5B-Instruct',
-    datasets=['gsm8k', 'arc'],
-    limit=5
-)
+task_cfg = {
+    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
+    'datasets': ['gsm8k', 'arc'],
+    'limit': 5
+}
 run_task(task_cfg=task_cfg)
 ```
@@ -384,7 +423,7 @@ To create a public link, set `share=True` in `launch()`.
 For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
-## 🌐 Evaluation of Specified Model API
+## 🌐 Evaluation of Model API
 Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
@@ -435,7 +474,7 @@ evalscope eval \
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
-## Evaluation Backend
+## 🧪 Other Evaluation Backends
 EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
 - **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
 - [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)
@@ -508,6 +547,17 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   </table>
 </a>
+## 📚 Citation
+```bibtex
+@misc{evalscope_2024,
+    title={{EvalScope}: Evaluation Framework for Large Models},
+    author={ModelScope Team},
+    year={2024},
+    url={https://github.com/modelscope/evalscope}
+}
+```
 ## 🔜 Roadmap
 - [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
@@ -523,6 +573,6 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   - [x] MBPP
-## Star History
+## ⭐ Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

evalscope-0.17.0/evalscope.egg-info/PKG-INFO → evalscope-1.0.0/README.md RENAMED Viewed

@@ -1,30 +1,3 @@
-Metadata-Version: 2.1
-Name: evalscope
-Version: 0.17.0
-Summary: EvalScope: Lightweight LLMs Evaluation Framework
-Home-page: https://github.com/modelscope/evalscope
-Author: ModelScope team
-Author-email: contact@modelscope.cn
-Keywords: python,llm,evaluation
-Classifier: Development Status :: 4 - Beta
-Classifier: License :: OSI Approved :: Apache Software License
-Classifier: Operating System :: OS Independent
-Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.8
-Classifier: Programming Language :: Python :: 3.9
-Classifier: Programming Language :: Python :: 3.10
-Requires-Python: >=3.8
-Description-Content-Type: text/markdown
-Provides-Extra: opencompass
-Provides-Extra: vlmeval
-Provides-Extra: rag
-Provides-Extra: perf
-Provides-Extra: app
-Provides-Extra: aigc
-Provides-Extra: dev
-Provides-Extra: all
-License-File: LICENSE
 <p align="center">
     <br>
     <img src="docs/en/_static/images/evalscope_logo.png"/>
@@ -55,25 +28,26 @@ License-File: LICENSE
 - [📝 Introduction](#-introduction)
 - [☎ User Groups](#-user-groups)
 - [🎉 News](#-news)
-- [🛠️ Installation](#️-installation)
-  - [Method 1: Install Using pip](#method-1-install-using-pip)
-  - [Method 2: Install from Source](#method-2-install-from-source)
+- [🛠️ Environment Setup](#️-environment-setup)
+  - [Method 1. Install via pip](#method-1-install-via-pip)
+  - [Method 2. Install from source](#method-2-install-from-source)
 - [🚀 Quick Start](#-quick-start)
   - [Method 1. Using Command Line](#method-1-using-command-line)
   - [Method 2. Using Python Code](#method-2-using-python-code)
   - [Basic Parameter](#basic-parameter)
   - [Output Results](#output-results)
 - [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
-- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [🌐 Evaluation of Model API](#-evaluation-of-model-api)
 - [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
-  - [Parameter](#parameter)
-- [Evaluation Backend](#evaluation-backend)
+  - [Parameter Description](#parameter-description)
+- [🧪 Other Evaluation Backends](#-other-evaluation-backends)
 - [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
 - [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [🏟️ Arena Mode](#️-arena-mode)
+- [⚔️ Arena Mode](#️-arena-mode)
 - [👷‍♂️ Contribution](#️-contribution)
+- [📚 Citation](#-citation)
 - [🔜 Roadmap](#-roadmap)
-- [Star History](#star-history)
+- [⭐ Star History](#-star-history)
 ## 📝 Introduction
@@ -138,6 +112,15 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+> [!IMPORTANT]
+> **Version 1.0 Refactoring**
+>
+> Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under `evalscope/api`. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.
+- 🔥 **[2025.08.22]** Version 1.0 Refactoring.
+- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
+- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
+- 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
 - 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
 - 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for "scoring directly without reference answers" and "checking answer consistency with reference answers". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.
 - 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
@@ -145,6 +128,8 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
 - 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
 - 🔥 **[2025.05.13]** Added support for the [ToolBench-Static](https://modelscope.cn/datasets/AI-ModelScope/ToolBench-Static) dataset to evaluate model's tool-calling capabilities. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) for usage instructions. Also added support for the [DROP](https://modelscope.cn/datasets/AI-ModelScope/DROP/dataPeview) and [Winogrande](https://modelscope.cn/datasets/AI-ModelScope/winogrande_val) benchmarks to assess the reasoning capabilities of models.
+<details><summary>More</summary>
 - 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
@@ -158,8 +143,6 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!
 - 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
 - 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.
-<details><summary>More</summary>
 - 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
@@ -183,58 +166,87 @@ Please scan the QR code below to join our community groups:
 </details>
-## 🛠️ Installation
-### Method 1: Install Using pip
-We recommend using conda to manage your environment and installing dependencies with pip:
+## 🛠️ Environment Setup
+### Method 1. Install via pip
+We recommend using conda to manage your environment and pip to install dependencies. This allows you to use the latest evalscope PyPI package.
 1. Create a conda environment (optional)
+```shell
+# Python 3.10 is recommended
+conda create -n evalscope python=3.10
+# Activate the conda environment
+conda activate evalscope
+```
+2. Install dependencies via pip
+```shell
+pip install evalscope
+```
+3. Install additional dependencies (optional)
+  - To use model service inference benchmarking features, install the perf dependency:
     ```shell
-    # It is recommended to use Python 3.10
-    conda create -n evalscope python=3.10
-    # Activate the conda environment
-    conda activate evalscope
+    pip install 'evalscope[perf]'
     ```
-2. Install dependencies using pip
+  - To use visualization features, install the app dependency:
     ```shell
-    pip install evalscope                # Install Native backend (default)
-    # Additional options
-    pip install 'evalscope[opencompass]'   # Install OpenCompass backend
-    pip install 'evalscope[vlmeval]'       # Install VLMEvalKit backend
-    pip install 'evalscope[rag]'           # Install RAGEval backend
-    pip install 'evalscope[perf]'          # Install dependencies for the model performance testing module
-    pip install 'evalscope[app]'           # Install dependencies for visualization
-    pip install 'evalscope[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
+    pip install 'evalscope[app]'
+    ```
+  - If you need to use other evaluation backends, you can install OpenCompass, VLMEvalKit, or RAGEval as needed:
+    ```shell
+    pip install 'evalscope[opencompass]'
+    pip install 'evalscope[vlmeval]'
+    pip install 'evalscope[rag]'
+    ```
+  - To install all dependencies:
+    ```shell
+    pip install 'evalscope[all]'
     ```
-> [!WARNING]
-> As the project has been renamed to `evalscope`, for versions `v0.4.3` or earlier, you can install using the following command:
+> [!NOTE]
+> The project has been renamed to `evalscope`. For version `v0.4.3` or earlier, you can install it with:
 > ```shell
-> pip install llmuses<=0.4.3
+>  pip install llmuses<=0.4.3
 > ```
-> To import relevant dependencies using `llmuses`:
-> ``` python
+> Then, import related dependencies using `llmuses`:
+> ```python
 > from llmuses import ...
 > ```
-### Method 2: Install from Source
-1. Download the source code
-    ```shell
-    git clone https://github.com/modelscope/evalscope.git
-    ```
+### Method 2. Install from source
+Installing from source allows you to use the latest code and makes it easier for further development and debugging.
+1. Clone the source code
+```shell
+git clone https://github.com/modelscope/evalscope.git
+```
 2. Install dependencies
-    ```shell
-    cd evalscope/
-    pip install -e .                  # Install Native backend
-    # Additional options
-    pip install -e '.[opencompass]'   # Install OpenCompass backend
-    pip install -e '.[vlmeval]'       # Install VLMEvalKit backend
-    pip install -e '.[rag]'           # Install RAGEval backend
-    pip install -e '.[perf]'          # Install Perf dependencies
-    pip install -e '.[app]'           # Install visualization dependencies
-    pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
-    ```
+```shell
+cd evalscope/
+pip install -e .
+```
+3. Install additional dependencies
+ - To use model service inference benchmarking features, install the perf dependency:
+   ```shell
+   pip install '.[perf]'
+   ```
+ - To use visualization features, install the app dependency:
+   ```shell
+   pip install '.[app]'
+   ```
+ - If you need to use other evaluation backends, you can install OpenCompass, VLMEvalKit, or RAGEval as needed:
+   ```shell
+   pip install '.[opencompass]'
+   pip install '.[vlmeval]'
+   pip install '.[rag]'
+   ```
+ - To install all dependencies:
+   ```shell
+   pip install '.[all]'
+   ```
 ## 🚀 Quick Start
@@ -255,33 +267,31 @@ evalscope eval \
 When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
-**Using Python Dictionary**
+**Using `TaskConfig`**
 ```python
-from evalscope.run import run_task
+from evalscope import run_task, TaskConfig
-task_cfg = {
-    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
-    'datasets': ['gsm8k', 'arc'],
-    'limit': 5
-}
+task_cfg = TaskConfig(
+    model='Qwen/Qwen2.5-0.5B-Instruct',
+    datasets=['gsm8k', 'arc'],
+    limit=5
+)
 run_task(task_cfg=task_cfg)
 ```
 <details><summary>More Startup Methods</summary>
-**Using `TaskConfig`**
+**Using Python Dictionary**
 ```python
 from evalscope.run import run_task
-from evalscope.config import TaskConfig
-task_cfg = TaskConfig(
-    model='Qwen/Qwen2.5-0.5B-Instruct',
-    datasets=['gsm8k', 'arc'],
-    limit=5
-)
+task_cfg = {
+    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
+    'datasets': ['gsm8k', 'arc'],
+    'limit': 5
+}
 run_task(task_cfg=task_cfg)
 ```
@@ -384,7 +394,7 @@ To create a public link, set `share=True` in `launch()`.
 For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
-## 🌐 Evaluation of Specified Model API
+## 🌐 Evaluation of Model API
 Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
@@ -435,7 +445,7 @@ evalscope eval \
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
-## Evaluation Backend
+## 🧪 Other Evaluation Backends
 EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
 - **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
 - [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)
@@ -508,6 +518,17 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   </table>
 </a>
+## 📚 Citation
+```bibtex
+@misc{evalscope_2024,
+    title={{EvalScope}: Evaluation Framework for Large Models},
+    author={ModelScope Team},
+    year={2024},
+    url={https://github.com/modelscope/evalscope}
+}
+```
 ## 🔜 Roadmap
 - [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
@@ -523,6 +544,6 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   - [x] MBPP
-## Star History
+## ⭐ Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

evalscope-1.0.0/evalscope/__init__.py ADDED Viewed

@@ -0,0 +1,8 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from evalscope.benchmarks import *  # registered benchmarks
+from evalscope.config import TaskConfig
+from evalscope.filters import extraction, selection  # registered filters
+from evalscope.metrics import metric  # registered metrics
+from evalscope.models import model_apis  # need for register model apis
+from evalscope.run import run_task
+from .version import __release_datetime__, __version__

evalscope-1.0.0/evalscope/api/benchmark/__init__.py ADDED Viewed

@@ -0,0 +1,3 @@
+from .adapters import DefaultDataAdapter, MultiChoiceAdapter, Text2ImageAdapter
+from .benchmark import DataAdapter
+from .meta import BenchmarkMeta

evalscope 0.17.0__tar.gz → 1.0.0__tar.gz

Potentially problematic release.

evalscope 0.17.0tar.gz → 1.0.0tar.gz