PyPI - evalscope - Versions diffs - 0.8.2__tar.gz → 0.10.0__tar.gz - Mend

evalscope 0.8.2tar.gz → 0.10.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (331) hide show

{evalscope-0.8.2/evalscope.egg-info → evalscope-0.10.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.8.2
+Version: 0.10.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -63,6 +63,9 @@ Requires-Dist: numpy; extra == "perf"
 Requires-Dist: sse_starlette; extra == "perf"
 Requires-Dist: transformers; extra == "perf"
 Requires-Dist: unicorn; extra == "perf"
+Provides-Extra: app
+Requires-Dist: gradio>=5.4.0; extra == "app"
+Requires-Dist: plotly>=5.23.0; extra == "app"
 Provides-Extra: inner
 Requires-Dist: absl-py; extra == "inner"
 Requires-Dist: accelerate; extra == "inner"
@@ -133,6 +136,8 @@ Requires-Dist: numpy; extra == "all"
 Requires-Dist: sse_starlette; extra == "all"
 Requires-Dist: transformers; extra == "all"
 Requires-Dist: unicorn; extra == "all"
+Requires-Dist: gradio>=5.4.0; extra == "all"
+Requires-Dist: plotly>=5.23.0; extra == "all"
 <p align="center">
     <br>
@@ -160,14 +165,16 @@ Requires-Dist: unicorn; extra == "all"
 > ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 ## 📋 Contents
-- [Introduction](#introduction)
-- [News](#News)
-- [Installation](#installation)
-- [Quick Start](#quick-start)
+- [Introduction](#-introduction)
+- [News](#-news)
+- [Installation](#️-installation)
+- [Quick Start](#-quick-start)
 - [Evaluation Backend](#evaluation-backend)
-- [Custom Dataset Evaluation](#custom-dataset-evaluation)
-- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)
-- [Arena Mode](#arena-mode)
+- [Custom Dataset Evaluation](#️-custom-dataset-evaluation)
+- [Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
+- [Arena Mode](#-arena-mode)
+- [Contribution](#️-contribution)
+- [Roadmap](#-roadmap)
 ## 📝 Introduction
@@ -208,11 +215,17 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visulization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
+- 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
+- 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
 - 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
 - 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
 - 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
+<details><summary>More</summary>
 - 🔥 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [📖 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).
 - 🔥 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.
 - 🔥 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
@@ -224,7 +237,7 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2024.06.13]** EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.
 - 🔥 **[2024.06.13]** Integrated the Agent evaluation dataset ToolBench.
+</details>
 ## 🛠️ Installation
 ### Method 1: Install Using pip
@@ -368,15 +381,85 @@ run_task(task_cfg="config.json")
 - `--limit`: Maximum amount of evaluation data for each dataset. If not specified, it defaults to evaluating all data. Can be used for quick validation
 ### Output Results
+```text
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Model Name            | Dataset Name   | Metric Name     | Category Name   | Subset Name   |   Num |   Score |
++=======================+================+=================+=================+===============+=======+=========+
+| Qwen2.5-0.5B-Instruct | gsm8k          | AverageAccuracy | default         | main          |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Easy      |     5 |     0.8 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Challenge |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
 ```
-+-----------------------+-------------------+-----------------+
-| Model                 | ai2_arc           | gsm8k           |
-+=======================+===================+=================+
-| Qwen2.5-0.5B-Instruct | (ai2_arc/acc) 0.6 | (gsm8k/acc) 0.6 |
-+-----------------------+-------------------+-----------------+
+## 📈 Visualization of Evaluation Results
+1. Install the dependencies required for visualization, including gradio, plotly, etc.
+```bash
+pip install 'evalscope[app]'
+```
+2. Start the Visualization Service
+Run the following command to start the visualization service.
+```bash
+evalscope app
 ```
+You can access the visualization service in the browser if the following output appears.
+```text
+* Running on local URL:  http://127.0.0.1:7861
+To create a public link, set `share=True` in `launch()`.
+```
+<table>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/setting.png" alt="Setting" style="width: 100%;" />
+      <p>Setting Interface</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/model_compare.png" alt="Model Compare" style="width: 100%;" />
+      <p>Model Comparison</p>
+    </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/report_overview.png" alt="Report Overview" style="width: 100%;" />
+      <p>Report Overview</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/report_details.png" alt="Report Details" style="width: 100%;" />
+      <p>Report Details</p>
+    </td>
+  </tr>
+</table>
+For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visulization.html)
+## 🌐 Evaluation of Specified Model API
+Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
+For example, to launch a model service using [vLLM](https://github.com/vllm-project/vllm):
+```shell
+export VLLM_USE_MODELSCOPE=True && python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --served-model-name qwen2.5 --trust_remote_code --port 8801
+```
+Then, you can use the following command to evaluate the model API service:
+```shell
+evalscope eval \
+ --model qwen2.5 \
+ --api-url http://127.0.0.1:8801/v1/chat/completions \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+## ⚙️ Custom Parameter Evaluation
-## ⚙️ Complex Evaluation
 For more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation startup method is the same as simple evaluation. Below shows how to start the evaluation using the `eval` command:
 ```shell
@@ -414,7 +497,7 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
 - **ThirdParty**: Third-party evaluation tasks, such as [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) and [LongBench-Write](https://evalscope.readthedocs.io/en/latest/third_party/longwriter.html).
-## Model Serving Performance Evaluation
+## 📈 Model Serving Performance Evaluation
 A stress testing tool focused on large language models, which can be customized to support various dataset formats and different API protocol formats.
 Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html)
@@ -439,19 +522,32 @@ Speed Benchmark Results:
 +---------------+-----------------+----------------+
 ```
-## Custom Dataset Evaluation
+## 🖊️ Custom Dataset Evaluation
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
-## Arena Mode
+## 🏟️ Arena Mode
 The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
 Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
+## 👷‍♂️ Contribution
+EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn), is continuously optimizing its benchmark evaluation features! We invite you to refer to the [Contribution Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html) to easily add your own evaluation benchmarks and share your contributions with the community. Let’s work together to support the growth of EvalScope and make our tools even better! Join us now!
+<a href="https://github.com/modelscope/evalscope/graphs/contributors" target="_blank">
+  <table>
+    <tr>
+      <th colspan="2">
+        <br><img src="https://contrib.rocks/image?repo=modelscope/evalscope"><br><br>
+      </th>
+    </tr>
+  </table>
+</a>
-## TO-DO List
+## 🔜 Roadmap
+- [ ] Support for better evaluation report visualization
+- [x] Support for mixed evaluations across multiple datasets
 - [x] RAG evaluation
 - [x] VLM evaluation
 - [x] Agents evaluation
@@ -462,8 +558,6 @@ Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/lates
   - [ ] GAIA
   - [ ] GPQA
   - [x] MBPP
-- [ ] Auto-reviewer
-  - [ ] Qwen-max
 ## Star History

{evalscope-0.8.2 → evalscope-0.10.0}/README.md RENAMED Viewed

@@ -24,14 +24,16 @@
 > ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 ## 📋 Contents
-- [Introduction](#introduction)
-- [News](#News)
-- [Installation](#installation)
-- [Quick Start](#quick-start)
+- [Introduction](#-introduction)
+- [News](#-news)
+- [Installation](#️-installation)
+- [Quick Start](#-quick-start)
 - [Evaluation Backend](#evaluation-backend)
-- [Custom Dataset Evaluation](#custom-dataset-evaluation)
-- [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)
-- [Arena Mode](#arena-mode)
+- [Custom Dataset Evaluation](#️-custom-dataset-evaluation)
+- [Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
+- [Arena Mode](#-arena-mode)
+- [Contribution](#️-contribution)
+- [Roadmap](#-roadmap)
 ## 📝 Introduction
@@ -72,11 +74,17 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visulization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
+- 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
+- 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
 - 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
 - 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
 - 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
+<details><summary>More</summary>
 - 🔥 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [📖 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).
 - 🔥 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.
 - 🔥 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
@@ -88,7 +96,7 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2024.06.13]** EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.
 - 🔥 **[2024.06.13]** Integrated the Agent evaluation dataset ToolBench.
+</details>
 ## 🛠️ Installation
 ### Method 1: Install Using pip
@@ -232,15 +240,85 @@ run_task(task_cfg="config.json")
 - `--limit`: Maximum amount of evaluation data for each dataset. If not specified, it defaults to evaluating all data. Can be used for quick validation
 ### Output Results
+```text
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Model Name            | Dataset Name   | Metric Name     | Category Name   | Subset Name   |   Num |   Score |
++=======================+================+=================+=================+===============+=======+=========+
+| Qwen2.5-0.5B-Instruct | gsm8k          | AverageAccuracy | default         | main          |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Easy      |     5 |     0.8 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Challenge |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
 ```
-+-----------------------+-------------------+-----------------+
-| Model                 | ai2_arc           | gsm8k           |
-+=======================+===================+=================+
-| Qwen2.5-0.5B-Instruct | (ai2_arc/acc) 0.6 | (gsm8k/acc) 0.6 |
-+-----------------------+-------------------+-----------------+
+## 📈 Visualization of Evaluation Results
+1. Install the dependencies required for visualization, including gradio, plotly, etc.
+```bash
+pip install 'evalscope[app]'
+```
+2. Start the Visualization Service
+Run the following command to start the visualization service.
+```bash
+evalscope app
 ```
+You can access the visualization service in the browser if the following output appears.
+```text
+* Running on local URL:  http://127.0.0.1:7861
+To create a public link, set `share=True` in `launch()`.
+```
+<table>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/setting.png" alt="Setting" style="width: 100%;" />
+      <p>Setting Interface</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/model_compare.png" alt="Model Compare" style="width: 100%;" />
+      <p>Model Comparison</p>
+    </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/report_overview.png" alt="Report Overview" style="width: 100%;" />
+      <p>Report Overview</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/zh/get_started/images/report_details.png" alt="Report Details" style="width: 100%;" />
+      <p>Report Details</p>
+    </td>
+  </tr>
+</table>
+For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visulization.html)
+## 🌐 Evaluation of Specified Model API
+Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
+For example, to launch a model service using [vLLM](https://github.com/vllm-project/vllm):
+```shell
+export VLLM_USE_MODELSCOPE=True && python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --served-model-name qwen2.5 --trust_remote_code --port 8801
+```
+Then, you can use the following command to evaluate the model API service:
+```shell
+evalscope eval \
+ --model qwen2.5 \
+ --api-url http://127.0.0.1:8801/v1/chat/completions \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+## ⚙️ Custom Parameter Evaluation
-## ⚙️ Complex Evaluation
 For more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation startup method is the same as simple evaluation. Below shows how to start the evaluation using the `eval` command:
 ```shell
@@ -278,7 +356,7 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
 - **ThirdParty**: Third-party evaluation tasks, such as [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) and [LongBench-Write](https://evalscope.readthedocs.io/en/latest/third_party/longwriter.html).
-## Model Serving Performance Evaluation
+## 📈 Model Serving Performance Evaluation
 A stress testing tool focused on large language models, which can be customized to support various dataset formats and different API protocol formats.
 Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html)
@@ -303,19 +381,32 @@ Speed Benchmark Results:
 +---------------+-----------------+----------------+
 ```
-## Custom Dataset Evaluation
+## 🖊️ Custom Dataset Evaluation
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
-## Arena Mode
+## 🏟️ Arena Mode
 The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
 Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
+## 👷‍♂️ Contribution
+EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn), is continuously optimizing its benchmark evaluation features! We invite you to refer to the [Contribution Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html) to easily add your own evaluation benchmarks and share your contributions with the community. Let’s work together to support the growth of EvalScope and make our tools even better! Join us now!
+<a href="https://github.com/modelscope/evalscope/graphs/contributors" target="_blank">
+  <table>
+    <tr>
+      <th colspan="2">
+        <br><img src="https://contrib.rocks/image?repo=modelscope/evalscope"><br><br>
+      </th>
+    </tr>
+  </table>
+</a>
-## TO-DO List
+## 🔜 Roadmap
+- [ ] Support for better evaluation report visualization
+- [x] Support for mixed evaluations across multiple datasets
 - [x] RAG evaluation
 - [x] VLM evaluation
 - [x] Agents evaluation
@@ -326,8 +417,6 @@ Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/lates
   - [ ] GAIA
   - [ ] GPQA
   - [x] MBPP
-- [ ] Auto-reviewer
-  - [ ] Qwen-max
 ## Star History

{evalscope-0.8.2 → evalscope-0.10.0}/evalscope/__init__.py RENAMED Viewed

@@ -1,3 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
+from evalscope.config import TaskConfig
+from evalscope.run import run_task
 from .version import __release_datetime__, __version__

{evalscope-0.8.2 → evalscope-0.10.0}/evalscope/arguments.py RENAMED Viewed

@@ -1,6 +1,8 @@
 import argparse
 import json
+from evalscope.constants import EvalBackend, EvalStage, EvalType
 class ParseStrArgsAction(argparse.Action):
@@ -31,6 +33,7 @@ def add_argument(parser: argparse.ArgumentParser):
     # yapf: disable
     # Model-related arguments
     parser.add_argument('--model', type=str, required=False, help='The model id on modelscope, or local model dir.')
+    parser.add_argument('--model-id', type=str, required=False, help='The model id for model name in report.')
     parser.add_argument('--model-args', type=str, action=ParseStrArgsAction, help='The model args, should be a string.')
     # Template-related arguments
@@ -47,10 +50,13 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--generation-config', type=str, action=ParseStrArgsAction, help='The generation config, should be a string.')  # noqa: E501
     # Evaluation-related arguments
-    parser.add_argument('--eval-type', type=str, help='The type for evaluating.')
-    parser.add_argument('--eval-backend', type=str, help='The evaluation backend to use.')
+    parser.add_argument('--eval-type', type=str, help='The type for evaluating.',
+                        choices=[EvalType.CHECKPOINT, EvalType.CUSTOM, EvalType.SERVICE])
+    parser.add_argument('--eval-backend', type=str, help='The evaluation backend to use.',
+                        choices=[EvalBackend.NATIVE, EvalBackend.OPEN_COMPASS, EvalBackend.VLM_EVAL_KIT, EvalBackend.RAG_EVAL])  # noqa: E501
     parser.add_argument('--eval-config', type=str, required=False, help='The eval task config file path for evaluation backend.')  # noqa: E501
-    parser.add_argument('--stage', type=str, default='all', help='The stage of evaluation pipeline.')
+    parser.add_argument('--stage', type=str, default='all', help='The stage of evaluation pipeline.',
+                        choices=[EvalStage.ALL, EvalStage.INFER, EvalStage.REVIEW])
     parser.add_argument('--limit', type=int, default=None, help='Max evaluation samples num for each subset.')
     # Cache and working directory arguments
@@ -62,6 +68,8 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--debug', action='store_true', default=False, help='Debug mode, will print information for debugging.')  # noqa: E501
     parser.add_argument('--dry-run', action='store_true', default=False, help='Dry run in single processing mode.')
     parser.add_argument('--seed', type=int, default=42, help='Random seed for reproducibility.')
+    parser.add_argument('--api-key', type=str, default='EMPTY', help='The API key for the remote API model.')
+    parser.add_argument('--api-url', type=str, default=None, help='The API url for the remote API model.')
     # yapf: enable

{evalscope-0.8.2 → evalscope-0.10.0}/evalscope/backend/rag_eval/clip_benchmark/tasks/zeroshot_classification.py RENAMED Viewed

@@ -3,7 +3,6 @@ Code adapated from https://github.com/mlfoundations/open_clip/blob/main/src/trai
 Thanks to the authors of OpenCLIP
 """
-import logging
 import torch
 import torch.nn.functional as F
 from contextlib import suppress

{evalscope-0.8.2 → evalscope-0.10.0}/evalscope/backend/rag_eval/utils/llm.py RENAMED Viewed

@@ -6,7 +6,7 @@ from modelscope.utils.hf_util import GenerationConfig
 from typing import Any, Dict, Iterator, List, Mapping, Optional
 from evalscope.constants import DEFAULT_MODEL_REVISION
-from evalscope.models.model_adapter import ChatGenerationModelAdapter
+from evalscope.models import ChatGenerationModelAdapter
 class LLM:

evalscope-0.10.0/evalscope/benchmarks/__init__.py ADDED Viewed

@@ -0,0 +1,23 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import glob
+import importlib
+import os
+from evalscope.benchmarks.benchmark import Benchmark, BenchmarkMeta
+from evalscope.benchmarks.data_adapter import DataAdapter
+from evalscope.utils import get_logger
+logger = get_logger()
+# Using glob to find all files matching the pattern
+pattern = os.path.join(os.path.dirname(__file__), '*', '*_adapter.py')
+files = glob.glob(pattern, recursive=False)
+for file_path in files:
+    if file_path.endswith('.py') and not os.path.basename(file_path).startswith('_'):
+        # Convert file path to a module path
+        relative_path = os.path.relpath(file_path, os.path.dirname(__file__))
+        module_path = relative_path[:-3].replace(os.path.sep, '.')  # strip '.py' and convert to module path
+        full_path = f'evalscope.benchmarks.{module_path}'
+        importlib.import_module(full_path)
+        # print(f'Importing {full_path}')

evalscope 0.8.2__tar.gz → 0.10.0__tar.gz

Potentially problematic release.

evalscope 0.8.2tar.gz → 0.10.0tar.gz