PyPI - evalscope - Versions diffs - 0.16.3__tar.gz → 0.17.1__tar.gz - Mend

evalscope 0.16.3tar.gz → 0.17.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (532) hide show

{evalscope-0.16.3/evalscope.egg-info → evalscope-0.17.1}/PKG-INFO RENAMED Viewed

@@ -1,130 +1,31 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.16.3
+Version: 0.17.1
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
 Author-email: contact@modelscope.cn
+License: Apache License 2.0
 Keywords: python,llm,evaluation
 Classifier: Development Status :: 4 - Beta
-Classifier: License :: OSI Approved :: Apache Software License
 Classifier: Operating System :: OS Independent
 Classifier: Programming Language :: Python :: 3
-Classifier: Programming Language :: Python :: 3.8
 Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
-Requires-Python: >=3.8
+Classifier: Programming Language :: Python :: 3.11
+Classifier: Programming Language :: Python :: 3.12
+Requires-Python: >=3.9
 Description-Content-Type: text/markdown
-License-File: LICENSE
-Requires-Dist: accelerate
-Requires-Dist: datasets>=3.0
-Requires-Dist: immutabledict
-Requires-Dist: jieba
-Requires-Dist: jsonlines
-Requires-Dist: langdetect
-Requires-Dist: latex2sympy2_extended
-Requires-Dist: matplotlib
-Requires-Dist: modelscope[framework]
-Requires-Dist: nltk>=3.9
-Requires-Dist: openai
-Requires-Dist: pandas
-Requires-Dist: pillow
-Requires-Dist: pyarrow
-Requires-Dist: pyyaml>=5.1
-Requires-Dist: requests
-Requires-Dist: rouge-chinese
-Requires-Dist: rouge-score>=0.1.0
-Requires-Dist: sacrebleu
-Requires-Dist: scikit-learn
-Requires-Dist: seaborn
-Requires-Dist: sympy
-Requires-Dist: tabulate
-Requires-Dist: torch
-Requires-Dist: tqdm
-Requires-Dist: transformers>=4.33
-Requires-Dist: word2number
 Provides-Extra: opencompass
-Requires-Dist: ms-opencompass>=0.1.6; extra == "opencompass"
 Provides-Extra: vlmeval
-Requires-Dist: ms-vlmeval>=0.0.17; extra == "vlmeval"
 Provides-Extra: rag
-Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "rag"
-Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "rag"
-Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "rag"
-Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "rag"
-Requires-Dist: mteb==1.38.20; extra == "rag"
-Requires-Dist: ragas==0.2.14; extra == "rag"
-Requires-Dist: webdataset>0.2.0; extra == "rag"
 Provides-Extra: perf
-Requires-Dist: aiohttp; extra == "perf"
-Requires-Dist: fastapi; extra == "perf"
-Requires-Dist: numpy; extra == "perf"
-Requires-Dist: rich; extra == "perf"
-Requires-Dist: sse_starlette; extra == "perf"
-Requires-Dist: transformers; extra == "perf"
-Requires-Dist: uvicorn; extra == "perf"
 Provides-Extra: app
-Requires-Dist: gradio==5.4.0; extra == "app"
-Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "app"
 Provides-Extra: aigc
-Requires-Dist: diffusers; extra == "aigc"
-Requires-Dist: iopath; extra == "aigc"
-Requires-Dist: omegaconf; extra == "aigc"
-Requires-Dist: open_clip_torch; extra == "aigc"
-Requires-Dist: opencv-python; extra == "aigc"
-Requires-Dist: torchvision; extra == "aigc"
+Provides-Extra: dev
+Provides-Extra: docs
 Provides-Extra: all
-Requires-Dist: accelerate; extra == "all"
-Requires-Dist: datasets>=3.0; extra == "all"
-Requires-Dist: immutabledict; extra == "all"
-Requires-Dist: jieba; extra == "all"
-Requires-Dist: jsonlines; extra == "all"
-Requires-Dist: langdetect; extra == "all"
-Requires-Dist: latex2sympy2_extended; extra == "all"
-Requires-Dist: matplotlib; extra == "all"
-Requires-Dist: modelscope[framework]; extra == "all"
-Requires-Dist: nltk>=3.9; extra == "all"
-Requires-Dist: openai; extra == "all"
-Requires-Dist: pandas; extra == "all"
-Requires-Dist: pillow; extra == "all"
-Requires-Dist: pyarrow; extra == "all"
-Requires-Dist: pyyaml>=5.1; extra == "all"
-Requires-Dist: requests; extra == "all"
-Requires-Dist: rouge-chinese; extra == "all"
-Requires-Dist: rouge-score>=0.1.0; extra == "all"
-Requires-Dist: sacrebleu; extra == "all"
-Requires-Dist: scikit-learn; extra == "all"
-Requires-Dist: seaborn; extra == "all"
-Requires-Dist: sympy; extra == "all"
-Requires-Dist: tabulate; extra == "all"
-Requires-Dist: torch; extra == "all"
-Requires-Dist: tqdm; extra == "all"
-Requires-Dist: transformers>=4.33; extra == "all"
-Requires-Dist: word2number; extra == "all"
-Requires-Dist: ms-opencompass>=0.1.6; extra == "all"
-Requires-Dist: ms-vlmeval>=0.0.17; extra == "all"
-Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "all"
-Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "all"
-Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "all"
-Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "all"
-Requires-Dist: mteb==1.38.20; extra == "all"
-Requires-Dist: ragas==0.2.14; extra == "all"
-Requires-Dist: webdataset>0.2.0; extra == "all"
-Requires-Dist: aiohttp; extra == "all"
-Requires-Dist: fastapi; extra == "all"
-Requires-Dist: numpy; extra == "all"
-Requires-Dist: rich; extra == "all"
-Requires-Dist: sse_starlette; extra == "all"
-Requires-Dist: transformers; extra == "all"
-Requires-Dist: uvicorn; extra == "all"
-Requires-Dist: gradio==5.4.0; extra == "all"
-Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "all"
-Requires-Dist: diffusers; extra == "all"
-Requires-Dist: iopath; extra == "all"
-Requires-Dist: omegaconf; extra == "all"
-Requires-Dist: open_clip_torch; extra == "all"
-Requires-Dist: opencv-python; extra == "all"
-Requires-Dist: torchvision; extra == "all"
+License-File: LICENSE
 <p align="center">
     <br>
@@ -165,16 +66,17 @@ Requires-Dist: torchvision; extra == "all"
   - [Basic Parameter](#basic-parameter)
   - [Output Results](#output-results)
 - [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
-- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [🌐 Evaluation of Model API](#-evaluation-of-model-api)
 - [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
-  - [Parameter](#parameter)
-- [Evaluation Backend](#evaluation-backend)
+  - [Parameter Description](#parameter-description)
+- [🧪 Other Evaluation Backends](#-other-evaluation-backends)
 - [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
 - [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [🏟️ Arena Mode](#️-arena-mode)
+- [⚔️ Arena Mode](#️-arena-mode)
 - [👷‍♂️ Contribution](#️-contribution)
+- [📚 Citation](#-citation)
 - [🔜 Roadmap](#-roadmap)
-- [Star History](#star-history)
+- [⭐ Star History](#-star-history)
 ## 📝 Introduction
@@ -198,24 +100,33 @@ EvalScope is not merely an evaluation tool; it is a valuable ally in your model
 Below is the overall architecture diagram of EvalScope:
 <p align="center">
-  <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
+  <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/doc/EvalScope%E6%9E%B6%E6%9E%84%E5%9B%BE.png" width="70%">
   <br>EvalScope Framework.
 </p>
 <details><summary>Framework Description</summary>
 The architecture includes the following modules:
-1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
-2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
-3. **Evaluation Backend**:
-    - **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
-    - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
-    - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
-    - **RAGEval**: Supports RAG evaluation, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
-    - **ThirdParty**: Other third-party evaluation tasks, such as ToolBench.
-4. **Performance Evaluator**: Model performance evaluation, responsible for measuring model inference service performance, including performance testing, stress testing, performance report generation, and visualization.
-5. **Evaluation Report**: The final generated evaluation report summarizes the model's performance, which can be used for decision-making and further model optimization.
-6. **Visualization**: Visualization results help users intuitively understand evaluation results, facilitating analysis and comparison of different model performances.
+1. Input Layer
+- **Model Sources**: API models (OpenAI API), local models (ModelScope)
+- **Datasets**: Standard evaluation benchmarks (MMLU/GSM8k, etc.), custom data (MCQ/QA)
+2. Core Functions
+- **Multi-backend Evaluation**
+   - Native backends: Unified evaluation for LLM/VLM/Embedding/T2I models
+   - Integrated frameworks: OpenCompass/MTEB/VLMEvalKit/RAGAS
+- **Performance Monitoring**
+   - Model plugins: Supports various model service APIs
+   - Data plugins: Supports multiple data formats
+   - Metric tracking: TTFT/TPOP/Stability and other metrics
+- **Tool Extensions**
+   - Integration: Tool-Bench/Needle-in-a-Haystack/BFCL-v3
+3. Output Layer
+- **Structured Reports**: Supports JSON/Tables/Logs
+- **Visualization Platforms**: Supports Gradio/Wandb/SwanLab
 </details>
@@ -229,8 +140,12 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
-- 🔥 **[2025.06.19]** Added support for the BFCL-v3 benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
+- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
+- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
+- 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
+- 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
+- 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for "scoring directly without reference answers" and "checking answer consistency with reference answers". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.
+- 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
 - 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.
 - 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
 - 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
@@ -239,6 +154,8 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
 - 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).
+<details><summary>More</summary>
 - 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
 - 🔥 **[2025.03.20]** The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#using-the-random-dataset) for more details.
 - 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark, which can be used by specifying `live_code_bench`. Supports evaluating QwQ-32B on LiveCodeBench, refer to the [best practices](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html).
@@ -252,8 +169,6 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
-<details><summary>More</summary>
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
@@ -345,33 +260,31 @@ evalscope eval \
 When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
-**Using Python Dictionary**
+**Using `TaskConfig`**
 ```python
-from evalscope.run import run_task
+from evalscope import run_task, TaskConfig
-task_cfg = {
-    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
-    'datasets': ['gsm8k', 'arc'],
-    'limit': 5
-}
+task_cfg = TaskConfig(
+    model='Qwen/Qwen2.5-0.5B-Instruct',
+    datasets=['gsm8k', 'arc'],
+    limit=5
+)
 run_task(task_cfg=task_cfg)
 ```
 <details><summary>More Startup Methods</summary>
-**Using `TaskConfig`**
+**Using Python Dictionary**
 ```python
 from evalscope.run import run_task
-from evalscope.config import TaskConfig
-task_cfg = TaskConfig(
-    model='Qwen/Qwen2.5-0.5B-Instruct',
-    datasets=['gsm8k', 'arc'],
-    limit=5
-)
+task_cfg = {
+    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
+    'datasets': ['gsm8k', 'arc'],
+    'limit': 5
+}
 run_task(task_cfg=task_cfg)
 ```
@@ -474,7 +387,7 @@ To create a public link, set `share=True` in `launch()`.
 For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
-## 🌐 Evaluation of Specified Model API
+## 🌐 Evaluation of Model API
 Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
@@ -525,7 +438,7 @@ evalscope eval \
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
-## Evaluation Backend
+## 🧪 Other Evaluation Backends
 EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
 - **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
 - [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)
@@ -572,10 +485,17 @@ Speed Benchmark Results:
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
-## 🏟️ Arena Mode
-The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
+## ⚔️ Arena Mode
-Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
+Arena mode allows you to configure multiple candidate models and specify a baseline model. Evaluation is performed by pairwise battles between each candidate model and the baseline model, with the final output including each model's win rate and ranking. This method is suitable for comparative evaluation among multiple models, providing an intuitive reflection of each model's strengths and weaknesses. Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
+```text
+Model           WinRate (%)  CI (%)
+------------  -------------  ---------------
+qwen2.5-72b            69.3  (-13.3 / +12.2)
+qwen2.5-7b             50    (+0.0 / +0.0)
+qwen2.5-0.5b            4.7  (-2.5 / +4.4)
+```
 ## 👷‍♂️ Contribution
@@ -591,6 +511,17 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   </table>
 </a>
+## 📚 Citation
+```bibtex
+@misc{evalscope_2024,
+    title={{EvalScope}: Evaluation Framework for Large Models},
+    author={ModelScope Team},
+    year={2024},
+    url={https://github.com/modelscope/evalscope}
+}
+```
 ## 🔜 Roadmap
 - [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
@@ -601,11 +532,11 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
 - [ ] Distributed evaluating
 - [x] Multi-modal evaluation
 - [ ] Benchmarks
-  - [ ] GAIA
+  - [x] BFCL-v3
   - [x] GPQA
   - [x] MBPP
-## Star History
+## ⭐ Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

{evalscope-0.16.3 → evalscope-0.17.1}/README.md RENAMED Viewed

@@ -37,16 +37,17 @@
   - [Basic Parameter](#basic-parameter)
   - [Output Results](#output-results)
 - [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
-- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [🌐 Evaluation of Model API](#-evaluation-of-model-api)
 - [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
-  - [Parameter](#parameter)
-- [Evaluation Backend](#evaluation-backend)
+  - [Parameter Description](#parameter-description)
+- [🧪 Other Evaluation Backends](#-other-evaluation-backends)
 - [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
 - [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [🏟️ Arena Mode](#️-arena-mode)
+- [⚔️ Arena Mode](#️-arena-mode)
 - [👷‍♂️ Contribution](#️-contribution)
+- [📚 Citation](#-citation)
 - [🔜 Roadmap](#-roadmap)
-- [Star History](#star-history)
+- [⭐ Star History](#-star-history)
 ## 📝 Introduction
@@ -70,24 +71,33 @@ EvalScope is not merely an evaluation tool; it is a valuable ally in your model
 Below is the overall architecture diagram of EvalScope:
 <p align="center">
-  <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
+  <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/doc/EvalScope%E6%9E%B6%E6%9E%84%E5%9B%BE.png" width="70%">
   <br>EvalScope Framework.
 </p>
 <details><summary>Framework Description</summary>
 The architecture includes the following modules:
-1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
-2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
-3. **Evaluation Backend**:
-    - **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
-    - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
-    - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
-    - **RAGEval**: Supports RAG evaluation, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
-    - **ThirdParty**: Other third-party evaluation tasks, such as ToolBench.
-4. **Performance Evaluator**: Model performance evaluation, responsible for measuring model inference service performance, including performance testing, stress testing, performance report generation, and visualization.
-5. **Evaluation Report**: The final generated evaluation report summarizes the model's performance, which can be used for decision-making and further model optimization.
-6. **Visualization**: Visualization results help users intuitively understand evaluation results, facilitating analysis and comparison of different model performances.
+1. Input Layer
+- **Model Sources**: API models (OpenAI API), local models (ModelScope)
+- **Datasets**: Standard evaluation benchmarks (MMLU/GSM8k, etc.), custom data (MCQ/QA)
+2. Core Functions
+- **Multi-backend Evaluation**
+   - Native backends: Unified evaluation for LLM/VLM/Embedding/T2I models
+   - Integrated frameworks: OpenCompass/MTEB/VLMEvalKit/RAGAS
+- **Performance Monitoring**
+   - Model plugins: Supports various model service APIs
+   - Data plugins: Supports multiple data formats
+   - Metric tracking: TTFT/TPOP/Stability and other metrics
+- **Tool Extensions**
+   - Integration: Tool-Bench/Needle-in-a-Haystack/BFCL-v3
+3. Output Layer
+- **Structured Reports**: Supports JSON/Tables/Logs
+- **Visualization Platforms**: Supports Gradio/Wandb/SwanLab
 </details>
@@ -101,8 +111,12 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
-- 🔥 **[2025.06.19]** Added support for the BFCL-v3 benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
+- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).
+- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/llm.html#bench).
+- 🔥 **[2025.07.14]** Support for "Humanity's Last Exam" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).
+- 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
+- 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for "scoring directly without reference answers" and "checking answer consistency with reference answers". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.
+- 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
 - 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.
 - 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
 - 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
@@ -111,6 +125,8 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
 - 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).
+<details><summary>More</summary>
 - 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
 - 🔥 **[2025.03.20]** The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#using-the-random-dataset) for more details.
 - 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark, which can be used by specifying `live_code_bench`. Supports evaluating QwQ-32B on LiveCodeBench, refer to the [best practices](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html).
@@ -124,8 +140,6 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
-<details><summary>More</summary>
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
@@ -217,33 +231,31 @@ evalscope eval \
 When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
-**Using Python Dictionary**
+**Using `TaskConfig`**
 ```python
-from evalscope.run import run_task
+from evalscope import run_task, TaskConfig
-task_cfg = {
-    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
-    'datasets': ['gsm8k', 'arc'],
-    'limit': 5
-}
+task_cfg = TaskConfig(
+    model='Qwen/Qwen2.5-0.5B-Instruct',
+    datasets=['gsm8k', 'arc'],
+    limit=5
+)
 run_task(task_cfg=task_cfg)
 ```
 <details><summary>More Startup Methods</summary>
-**Using `TaskConfig`**
+**Using Python Dictionary**
 ```python
 from evalscope.run import run_task
-from evalscope.config import TaskConfig
-task_cfg = TaskConfig(
-    model='Qwen/Qwen2.5-0.5B-Instruct',
-    datasets=['gsm8k', 'arc'],
-    limit=5
-)
+task_cfg = {
+    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
+    'datasets': ['gsm8k', 'arc'],
+    'limit': 5
+}
 run_task(task_cfg=task_cfg)
 ```
@@ -346,7 +358,7 @@ To create a public link, set `share=True` in `launch()`.
 For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
-## 🌐 Evaluation of Specified Model API
+## 🌐 Evaluation of Model API
 Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
@@ -397,7 +409,7 @@ evalscope eval \
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
-## Evaluation Backend
+## 🧪 Other Evaluation Backends
 EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:
 - **Native**: EvalScope's own **default evaluation framework**, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
 - [OpenCompass](https://github.com/open-compass/opencompass): Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)
@@ -444,10 +456,17 @@ Speed Benchmark Results:
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
-## 🏟️ Arena Mode
-The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
+## ⚔️ Arena Mode
-Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
+Arena mode allows you to configure multiple candidate models and specify a baseline model. Evaluation is performed by pairwise battles between each candidate model and the baseline model, with the final output including each model's win rate and ranking. This method is suitable for comparative evaluation among multiple models, providing an intuitive reflection of each model's strengths and weaknesses. Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
+```text
+Model           WinRate (%)  CI (%)
+------------  -------------  ---------------
+qwen2.5-72b            69.3  (-13.3 / +12.2)
+qwen2.5-7b             50    (+0.0 / +0.0)
+qwen2.5-0.5b            4.7  (-2.5 / +4.4)
+```
 ## 👷‍♂️ Contribution
@@ -463,6 +482,17 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
   </table>
 </a>
+## 📚 Citation
+```bibtex
+@misc{evalscope_2024,
+    title={{EvalScope}: Evaluation Framework for Large Models},
+    author={ModelScope Team},
+    year={2024},
+    url={https://github.com/modelscope/evalscope}
+}
+```
 ## 🔜 Roadmap
 - [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
@@ -473,11 +503,11 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
 - [ ] Distributed evaluating
 - [x] Multi-modal evaluation
 - [ ] Benchmarks
-  - [ ] GAIA
+  - [x] BFCL-v3
   - [x] GPQA
   - [x] MBPP
-## Star History
+## ⭐ Star History
 [![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

evalscope-0.17.1/evalscope/app/app.py ADDED Viewed

@@ -0,0 +1,35 @@
+"""
+Main application module for the Evalscope dashboard.
+"""
+import argparse
+from evalscope.utils.logger import configure_logging
+from .arguments import add_argument
+from .ui import create_app_ui
+def create_app(args: argparse.Namespace):
+    """
+    Create and launch the Evalscope dashboard application.
+    Args:
+        args: Command line arguments.
+    """
+    configure_logging(debug=args.debug)
+    demo = create_app_ui(args)
+    demo.launch(
+        share=args.share,
+        server_name=args.server_name,
+        server_port=args.server_port,
+        debug=args.debug,
+        allowed_paths=args.allowed_paths,
+    )
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    add_argument(parser)
+    args = parser.parse_args()
+    create_app(args)

{evalscope-0.16.3 → evalscope-0.17.1}/evalscope/app/constants.py RENAMED Viewed

@@ -2,6 +2,7 @@ PLOTLY_THEME = 'plotly_dark'
 REPORT_TOKEN = '@@'
 MODEL_TOKEN = '::'
 DATASET_TOKEN = ', '
+DEFAULT_BAR_WIDTH = 0.2
 LATEX_DELIMITERS = [{
     'left': '$$',
     'right': '$$',

evalscope 0.16.3__tar.gz → 0.17.1__tar.gz

Potentially problematic release.

evalscope 0.16.3tar.gz → 0.17.1tar.gz