PyPI - evalscope - Versions diffs - 0.6.0rc0__py3-none-any.whl → 0.7.0__py3-none-any.whl - Mend

evalscope 0.6.0rc0py3-none-any.whl → 0.7.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (114) hide show

{evalscope-0.6.0rc0.dist-info → evalscope-0.7.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.6.0rc0
+Version: 0.7.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -15,26 +15,28 @@ Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Requires-Python: >=3.8
 Description-Content-Type: text/markdown
-Requires-Dist: torch
+License-File: LICENSE
 Requires-Dist: absl-py
 Requires-Dist: accelerate
 Requires-Dist: cachetools
-Requires-Dist: datasets (<=3.0.1,>=3.0.0)
+Requires-Dist: datasets<=3.0.1,>=3.0.0
 Requires-Dist: editdistance
+Requires-Dist: jieba
 Requires-Dist: jsonlines
 Requires-Dist: matplotlib
 Requires-Dist: modelscope[framework]
-Requires-Dist: nltk (>=3.9)
+Requires-Dist: nltk>=3.9
 Requires-Dist: openai
 Requires-Dist: pandas
 Requires-Dist: plotly
-Requires-Dist: pyarrow (<=17.0.0)
+Requires-Dist: pyarrow<=17.0.0
 Requires-Dist: pympler
 Requires-Dist: pyyaml
 Requires-Dist: regex
 Requires-Dist: requests
 Requires-Dist: requests-toolbelt
-Requires-Dist: rouge-score (>=0.1.0)
+Requires-Dist: rouge-chinese
+Requires-Dist: rouge-score>=0.1.0
 Requires-Dist: sacrebleu
 Requires-Dist: scikit-learn
 Requires-Dist: seaborn
@@ -42,83 +44,95 @@ Requires-Dist: sentencepiece
 Requires-Dist: simple-ddl-parser
 Requires-Dist: tabulate
 Requires-Dist: tiktoken
+Requires-Dist: torch
 Requires-Dist: tqdm
-Requires-Dist: transformers (>=4.33)
+Requires-Dist: transformers>=4.33
 Requires-Dist: transformers-stream-generator
-Requires-Dist: jieba
-Requires-Dist: rouge-chinese
 Provides-Extra: all
-Requires-Dist: torch ; extra == 'all'
-Requires-Dist: absl-py ; extra == 'all'
-Requires-Dist: accelerate ; extra == 'all'
-Requires-Dist: cachetools ; extra == 'all'
-Requires-Dist: datasets (<=3.0.1,>=3.0.0) ; extra == 'all'
-Requires-Dist: editdistance ; extra == 'all'
-Requires-Dist: jsonlines ; extra == 'all'
-Requires-Dist: matplotlib ; extra == 'all'
-Requires-Dist: modelscope[framework] ; extra == 'all'
-Requires-Dist: nltk (>=3.9) ; extra == 'all'
-Requires-Dist: openai ; extra == 'all'
-Requires-Dist: pandas ; extra == 'all'
-Requires-Dist: plotly ; extra == 'all'
-Requires-Dist: pyarrow (<=17.0.0) ; extra == 'all'
-Requires-Dist: pympler ; extra == 'all'
-Requires-Dist: pyyaml ; extra == 'all'
-Requires-Dist: regex ; extra == 'all'
-Requires-Dist: requests ; extra == 'all'
-Requires-Dist: requests-toolbelt ; extra == 'all'
-Requires-Dist: rouge-score (>=0.1.0) ; extra == 'all'
-Requires-Dist: sacrebleu ; extra == 'all'
-Requires-Dist: scikit-learn ; extra == 'all'
-Requires-Dist: seaborn ; extra == 'all'
-Requires-Dist: sentencepiece ; extra == 'all'
-Requires-Dist: simple-ddl-parser ; extra == 'all'
-Requires-Dist: tabulate ; extra == 'all'
-Requires-Dist: tiktoken ; extra == 'all'
-Requires-Dist: tqdm ; extra == 'all'
-Requires-Dist: transformers (>=4.33) ; extra == 'all'
-Requires-Dist: transformers-stream-generator ; extra == 'all'
-Requires-Dist: jieba ; extra == 'all'
-Requires-Dist: rouge-chinese ; extra == 'all'
-Requires-Dist: ms-opencompass (>=0.1.3) ; extra == 'all'
-Requires-Dist: ms-vlmeval (>=0.0.5) ; extra == 'all'
-Requires-Dist: mteb (==1.19.4) ; extra == 'all'
-Requires-Dist: ragas (==0.2.3) ; extra == 'all'
-Requires-Dist: webdataset (>0.2.0) ; extra == 'all'
+Requires-Dist: absl-py; extra == "all"
+Requires-Dist: accelerate; extra == "all"
+Requires-Dist: cachetools; extra == "all"
+Requires-Dist: datasets<=3.0.1,>=3.0.0; extra == "all"
+Requires-Dist: editdistance; extra == "all"
+Requires-Dist: jieba; extra == "all"
+Requires-Dist: jsonlines; extra == "all"
+Requires-Dist: matplotlib; extra == "all"
+Requires-Dist: modelscope[framework]; extra == "all"
+Requires-Dist: nltk>=3.9; extra == "all"
+Requires-Dist: openai; extra == "all"
+Requires-Dist: pandas; extra == "all"
+Requires-Dist: plotly; extra == "all"
+Requires-Dist: pyarrow<=17.0.0; extra == "all"
+Requires-Dist: pympler; extra == "all"
+Requires-Dist: pyyaml; extra == "all"
+Requires-Dist: regex; extra == "all"
+Requires-Dist: requests; extra == "all"
+Requires-Dist: requests-toolbelt; extra == "all"
+Requires-Dist: rouge-chinese; extra == "all"
+Requires-Dist: rouge-score>=0.1.0; extra == "all"
+Requires-Dist: sacrebleu; extra == "all"
+Requires-Dist: scikit-learn; extra == "all"
+Requires-Dist: seaborn; extra == "all"
+Requires-Dist: sentencepiece; extra == "all"
+Requires-Dist: simple-ddl-parser; extra == "all"
+Requires-Dist: tabulate; extra == "all"
+Requires-Dist: tiktoken; extra == "all"
+Requires-Dist: torch; extra == "all"
+Requires-Dist: tqdm; extra == "all"
+Requires-Dist: transformers>=4.33; extra == "all"
+Requires-Dist: transformers-stream-generator; extra == "all"
+Requires-Dist: ms-opencompass>=0.1.3; extra == "all"
+Requires-Dist: ms-vlmeval>=0.0.9; extra == "all"
+Requires-Dist: mteb==1.19.4; extra == "all"
+Requires-Dist: ragas==0.2.5; extra == "all"
+Requires-Dist: webdataset>0.2.0; extra == "all"
+Requires-Dist: aiohttp; extra == "all"
+Requires-Dist: fastapi; extra == "all"
+Requires-Dist: numpy; extra == "all"
+Requires-Dist: sse-starlette; extra == "all"
+Requires-Dist: transformers; extra == "all"
+Requires-Dist: unicorn; extra == "all"
 Provides-Extra: inner
-Requires-Dist: absl-py ; extra == 'inner'
-Requires-Dist: accelerate ; extra == 'inner'
-Requires-Dist: alibaba-itag-sdk ; extra == 'inner'
-Requires-Dist: dashscope ; extra == 'inner'
-Requires-Dist: editdistance ; extra == 'inner'
-Requires-Dist: jsonlines ; extra == 'inner'
-Requires-Dist: nltk ; extra == 'inner'
-Requires-Dist: openai ; extra == 'inner'
-Requires-Dist: pandas (==1.5.3) ; extra == 'inner'
-Requires-Dist: plotly ; extra == 'inner'
-Requires-Dist: pyarrow ; extra == 'inner'
-Requires-Dist: pyodps ; extra == 'inner'
-Requires-Dist: pyyaml ; extra == 'inner'
-Requires-Dist: regex ; extra == 'inner'
-Requires-Dist: requests (==2.28.1) ; extra == 'inner'
-Requires-Dist: requests-toolbelt (==0.10.1) ; extra == 'inner'
-Requires-Dist: rouge-score ; extra == 'inner'
-Requires-Dist: sacrebleu ; extra == 'inner'
-Requires-Dist: scikit-learn ; extra == 'inner'
-Requires-Dist: seaborn ; extra == 'inner'
-Requires-Dist: simple-ddl-parser ; extra == 'inner'
-Requires-Dist: streamlit ; extra == 'inner'
-Requires-Dist: tqdm ; extra == 'inner'
-Requires-Dist: transformers (<4.43,>=4.33) ; extra == 'inner'
-Requires-Dist: transformers-stream-generator ; extra == 'inner'
+Requires-Dist: absl-py; extra == "inner"
+Requires-Dist: accelerate; extra == "inner"
+Requires-Dist: alibaba-itag-sdk; extra == "inner"
+Requires-Dist: dashscope; extra == "inner"
+Requires-Dist: editdistance; extra == "inner"
+Requires-Dist: jsonlines; extra == "inner"
+Requires-Dist: nltk; extra == "inner"
+Requires-Dist: openai; extra == "inner"
+Requires-Dist: pandas==1.5.3; extra == "inner"
+Requires-Dist: plotly; extra == "inner"
+Requires-Dist: pyarrow; extra == "inner"
+Requires-Dist: pyodps; extra == "inner"
+Requires-Dist: pyyaml; extra == "inner"
+Requires-Dist: regex; extra == "inner"
+Requires-Dist: requests==2.28.1; extra == "inner"
+Requires-Dist: requests-toolbelt==0.10.1; extra == "inner"
+Requires-Dist: rouge-score; extra == "inner"
+Requires-Dist: sacrebleu; extra == "inner"
+Requires-Dist: scikit-learn; extra == "inner"
+Requires-Dist: seaborn; extra == "inner"
+Requires-Dist: simple-ddl-parser; extra == "inner"
+Requires-Dist: streamlit; extra == "inner"
+Requires-Dist: tqdm; extra == "inner"
+Requires-Dist: transformers<4.43,>=4.33; extra == "inner"
+Requires-Dist: transformers-stream-generator; extra == "inner"
 Provides-Extra: opencompass
-Requires-Dist: ms-opencompass (>=0.1.3) ; extra == 'opencompass'
+Requires-Dist: ms-opencompass>=0.1.3; extra == "opencompass"
+Provides-Extra: perf
+Requires-Dist: aiohttp; extra == "perf"
+Requires-Dist: fastapi; extra == "perf"
+Requires-Dist: numpy; extra == "perf"
+Requires-Dist: sse-starlette; extra == "perf"
+Requires-Dist: transformers; extra == "perf"
+Requires-Dist: unicorn; extra == "perf"
 Provides-Extra: rag
-Requires-Dist: mteb (==1.19.4) ; extra == 'rag'
-Requires-Dist: ragas (==0.2.3) ; extra == 'rag'
-Requires-Dist: webdataset (>0.2.0) ; extra == 'rag'
+Requires-Dist: mteb==1.19.4; extra == "rag"
+Requires-Dist: ragas==0.2.5; extra == "rag"
+Requires-Dist: webdataset>0.2.0; extra == "rag"
 Provides-Extra: vlmeval
-Requires-Dist: ms-vlmeval (>=0.0.5) ; extra == 'vlmeval'
+Requires-Dist: ms-vlmeval>=0.0.9; extra == "vlmeval"
@@ -129,16 +143,18 @@ Requires-Dist: ms-vlmeval (>=0.0.5) ; extra == 'vlmeval'
 </p>
 <p align="center">
-<a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
-<a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope">
-</a>
-<a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'>
-    <img src='https://readthedocs.org/projects/evalscope-en/badge/?version=latest' alt='Documentation Status' />
-</a>
-<br>
- <a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
+  <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
+  <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope">
+  </a>
+  <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
+  <a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'>
+      <img src='https://readthedocs.org/projects/evalscope-en/badge/?version=latest' alt='Documentation Status' />
+  </a>
+  <br>
+  <a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
 <p>
+> ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 ## 📋 Table of Contents
 - [Introduction](#introduction)
@@ -164,7 +180,7 @@ EvalScope is the official model evaluation and performance benchmarking framewor
 The architecture includes the following modules:
 1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
 2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
-3. **Evaluation Backend**:
+3. **Evaluation Backend**:
     - **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
     - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
     - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
@@ -176,6 +192,7 @@ The architecture includes the following modules:
 ## 🎉 News
+- 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
 - 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
 - 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
 - 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
@@ -210,7 +227,9 @@ We recommend using conda to manage your environment and installing dependencies
    # Additional options
    pip install evalscope[opencompass]   # Install OpenCompass backend
    pip install evalscope[vlmeval]       # Install VLMEvalKit backend
-   pip install evalscope[all]           # Install all backends (Native, OpenCompass, VLMEvalKit)
+   pip install evalscope[rag]           # Install RAGEval backend
+   pip install evalscope[perf]          # Install Perf dependencies
+   pip install evalscope[all]           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
    ```
 > [!WARNING]
@@ -236,7 +255,9 @@ We recommend using conda to manage your environment and installing dependencies
    # Additional options
    pip install -e '.[opencompass]'   # Install OpenCompass backend
    pip install -e '.[vlmeval]'       # Install VLMEvalKit backend
-   pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit)
+   pip install -e '.[rag]'           # Install RAGEval backend
+   pip install -e '.[perf]'          # Install Perf dependencies
+   pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)
    ```
@@ -245,31 +266,47 @@ We recommend using conda to manage your environment and installing dependencies
 ### 1. Simple Evaluation
 To evaluate a model using default settings on specified datasets, follow the process below:
-#### Install using pip
-You can execute this command from any directory:
+#### Installation using pip
+You can execute this in any directory:
 ```bash
 python -m evalscope.run \
- --model qwen/Qwen2-0.5B-Instruct \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
  --template-type qwen \
- --datasets arc
+ --datasets gsm8k ceval \
+ --limit 10
 ```
-#### Install from source
-Execute this command in the `evalscope` directory:
+#### Installation from source
+You need to execute this in the `evalscope` directory:
 ```bash
 python evalscope/run.py \
- --model qwen/Qwen2-0.5B-Instruct \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
  --template-type qwen \
- --datasets arc
+ --datasets gsm8k ceval \
+ --limit 10
 ```
-If prompted with `Do you wish to run the custom code? [y/N]`, please type `y`.
+> If prompted with `Do you wish to run the custom code? [y/N]`, please type `y`.
+**Results (tested with only 10 samples)**
+```text
+Report table:
++-----------------------+--------------------+-----------------+
+| Model                 | ceval              | gsm8k           |
++=======================+====================+=================+
+| Qwen2.5-0.5B-Instruct | (ceval/acc) 0.5577 | (gsm8k/acc) 0.5 |
++-----------------------+--------------------+-----------------+
+```
 #### Basic Parameter Descriptions
 - `--model`: Specifies the `model_id` of the model on [ModelScope](https://modelscope.cn/), allowing automatic download. For example, see the [Qwen2-0.5B-Instruct model link](https://modelscope.cn/models/qwen/Qwen2-0.5B-Instruct/summary); you can also use a local path, such as `/path/to/model`.
 - `--template-type`: Specifies the template type corresponding to the model. Refer to the `Default Template` field in the [template table](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-datasets.html#llm) for filling in this field.
 - `--datasets`: The dataset name, allowing multiple datasets to be specified, separated by spaces; these datasets will be automatically downloaded. Refer to the [supported datasets list](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html) for available options.
+- `--limit`: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.
 ### 2. Parameterized Evaluation
 If you wish to conduct a more customized evaluation, such as modifying model parameters or dataset parameters, you can use the following commands:
@@ -309,7 +346,7 @@ In addition to the three [basic parameters](#basic-parameter-descriptions), the
 - `--dataset-args`: Evaluation dataset configuration parameters, provided in JSON format, where the key is the dataset name and the value is the parameter; note that these must correspond one-to-one with the values in `--datasets`.
   - `--few_shot_num`: Number of few-shot examples.
   - `--few_shot_random`: Whether to randomly sample few-shot data; if not specified, defaults to `true`.
-- `--limit`: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.
 ### 3. Use the run_task Function to Submit an Evaluation Task
 Using the `run_task` function to submit an evaluation task requires the same parameters as the command line. You need to pass a dictionary as the parameter, which includes the following fields:
@@ -354,24 +391,46 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
 - **RAGEval**: Initiate RAG evaluation tasks through EvalScope, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html): [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/index.html)
 - **ThirdParty**: Third-party evaluation tasks, such as [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) and [LongBench-Write](https://evalscope.readthedocs.io/en/latest/third_party/longwriter.html).
+## Model Serving Performance Evaluation
+A stress testing tool focused on large language models, which can be customized to support various dataset formats and different API protocol formats.
+Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html)
+**Supports wandb for recording results**
+![wandb sample](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/wandb_sample.png)
+**Supports Speed Benchmark**
+It supports speed testing and provides speed benchmarks similar to those found in the [official Qwen](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html) reports:
+```text
+Speed Benchmark Results:
++---------------+-----------------+----------------+
+| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
++---------------+-----------------+----------------+
+|       1       |      50.69      |      0.97      |
+|     6144      |      51.36      |      1.23      |
+|     14336     |      49.93      |      1.59      |
+|     30720     |      49.56      |      2.34      |
++---------------+-----------------+----------------+
+```
 ## Custom Dataset Evaluation
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)
 ## Offline Evaluation
-You can use local dataset to evaluate the model without internet connection.
+You can use local dataset to evaluate the model without internet connection.
 Refer to: Offline Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)
 ## Arena Mode
-The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
+The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
 Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
-## Model Serving Performance Evaluation
-A stress testing tool that focuses on large language models and can be customized to support various data set formats and different API protocol formats.
-Refer to : Model Serving Performance Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test.html)

evalscope 0.6.0rc0__py3-none-any.whl → 0.7.0__py3-none-any.whl

evalscope 0.6.0rc0py3-none-any.whl → 0.7.0py3-none-any.whl