PyPI - evalscope - Versions diffs - 0.13.2__tar.gz → 0.15.0__tar.gz - Mend

evalscope 0.13.2tar.gz → 0.15.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (483) hide show

{evalscope-0.13.2/evalscope.egg-info → evalscope-0.15.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.13.2
+Version: 0.15.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -26,8 +26,10 @@ Requires-Dist: latex2sympy2
 Requires-Dist: matplotlib
 Requires-Dist: modelscope[framework]
 Requires-Dist: nltk>=3.9
+Requires-Dist: omegaconf
 Requires-Dist: openai
 Requires-Dist: pandas
+Requires-Dist: pillow
 Requires-Dist: pyarrow
 Requires-Dist: pyyaml
 Requires-Dist: requests
@@ -39,6 +41,7 @@ Requires-Dist: seaborn
 Requires-Dist: sympy
 Requires-Dist: tabulate
 Requires-Dist: torch
+Requires-Dist: torchvision
 Requires-Dist: tqdm
 Requires-Dist: transformers>=4.33
 Requires-Dist: word2number
@@ -47,12 +50,12 @@ Requires-Dist: ms-opencompass>=0.1.4; extra == "opencompass"
 Provides-Extra: vlmeval
 Requires-Dist: ms-vlmeval>=0.0.9; extra == "vlmeval"
 Provides-Extra: rag
-Requires-Dist: langchain<0.3.0; extra == "rag"
-Requires-Dist: langchain-community<0.3.0; extra == "rag"
-Requires-Dist: langchain-core<0.3.0; extra == "rag"
-Requires-Dist: langchain-openai<0.3.0; extra == "rag"
+Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "rag"
+Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "rag"
+Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "rag"
+Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "rag"
 Requires-Dist: mteb==1.19.4; extra == "rag"
-Requires-Dist: ragas==0.2.9; extra == "rag"
+Requires-Dist: ragas==0.2.14; extra == "rag"
 Requires-Dist: webdataset>0.2.0; extra == "rag"
 Provides-Extra: perf
 Requires-Dist: aiohttp; extra == "perf"
@@ -64,6 +67,11 @@ Requires-Dist: unicorn; extra == "perf"
 Provides-Extra: app
 Requires-Dist: gradio==5.4.0; extra == "app"
 Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "app"
+Provides-Extra: aigc
+Requires-Dist: diffusers; extra == "aigc"
+Requires-Dist: iopath; extra == "aigc"
+Requires-Dist: open_clip_torch; extra == "aigc"
+Requires-Dist: opencv-python; extra == "aigc"
 Provides-Extra: all
 Requires-Dist: accelerate; extra == "all"
 Requires-Dist: datasets<=3.2.0,>=3.0.0; extra == "all"
@@ -75,8 +83,10 @@ Requires-Dist: latex2sympy2; extra == "all"
 Requires-Dist: matplotlib; extra == "all"
 Requires-Dist: modelscope[framework]; extra == "all"
 Requires-Dist: nltk>=3.9; extra == "all"
+Requires-Dist: omegaconf; extra == "all"
 Requires-Dist: openai; extra == "all"
 Requires-Dist: pandas; extra == "all"
+Requires-Dist: pillow; extra == "all"
 Requires-Dist: pyarrow; extra == "all"
 Requires-Dist: pyyaml; extra == "all"
 Requires-Dist: requests; extra == "all"
@@ -88,17 +98,18 @@ Requires-Dist: seaborn; extra == "all"
 Requires-Dist: sympy; extra == "all"
 Requires-Dist: tabulate; extra == "all"
 Requires-Dist: torch; extra == "all"
+Requires-Dist: torchvision; extra == "all"
 Requires-Dist: tqdm; extra == "all"
 Requires-Dist: transformers>=4.33; extra == "all"
 Requires-Dist: word2number; extra == "all"
 Requires-Dist: ms-opencompass>=0.1.4; extra == "all"
 Requires-Dist: ms-vlmeval>=0.0.9; extra == "all"
-Requires-Dist: langchain<0.3.0; extra == "all"
-Requires-Dist: langchain-community<0.3.0; extra == "all"
-Requires-Dist: langchain-core<0.3.0; extra == "all"
-Requires-Dist: langchain-openai<0.3.0; extra == "all"
+Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "all"
+Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "all"
+Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "all"
+Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "all"
 Requires-Dist: mteb==1.19.4; extra == "all"
-Requires-Dist: ragas==0.2.9; extra == "all"
+Requires-Dist: ragas==0.2.14; extra == "all"
 Requires-Dist: webdataset>0.2.0; extra == "all"
 Requires-Dist: aiohttp; extra == "all"
 Requires-Dist: fastapi; extra == "all"
@@ -108,6 +119,10 @@ Requires-Dist: transformers; extra == "all"
 Requires-Dist: unicorn; extra == "all"
 Requires-Dist: gradio==5.4.0; extra == "all"
 Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "all"
+Requires-Dist: diffusers; extra == "all"
+Requires-Dist: iopath; extra == "all"
+Requires-Dist: open_clip_torch; extra == "all"
+Requires-Dist: opencv-python; extra == "all"
 <p align="center">
     <br>
@@ -121,7 +136,7 @@ Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "all"
 </p>
 <p align="center">
-<img src="https://img.shields.io/badge/python-%E2%89%A53.8-5be.svg">
+<img src="https://img.shields.io/badge/python-%E2%89%A53.9-5be.svg">
 <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
 <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope"></a>
 <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
@@ -199,6 +214,10 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
+- 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
+- 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
+- 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).
 - 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
 - 🔥 **[2025.03.20]** The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#using-the-random-dataset) for more details.
 - 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark, which can be used by specifying `live_code_bench`. Supports evaluating QwQ-32B on LiveCodeBench, refer to the [best practices](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html).
@@ -212,15 +231,14 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
+<details><summary>More</summary>
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
 - 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
 - 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
 - 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
-<details><summary>More</summary>
 - 🔥 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [📖 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).
 - 🔥 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.
 - 🔥 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
@@ -503,6 +521,10 @@ Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.i
 ![wandb sample](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/wandb_sample.png)
+**Supports swanlab for recording results**
+![swanlab sample](https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/swanlab.png)
 **Supports Speed Benchmark**
 It supports speed testing and provides speed benchmarks similar to those found in the [official Qwen](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html) reports:

{evalscope-0.13.2 → evalscope-0.15.0}/README.md RENAMED Viewed

@@ -10,7 +10,7 @@
 </p>
 <p align="center">
-<img src="https://img.shields.io/badge/python-%E2%89%A53.8-5be.svg">
+<img src="https://img.shields.io/badge/python-%E2%89%A53.9-5be.svg">
 <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
 <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope"></a>
 <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
@@ -88,6 +88,10 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
+- 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
+- 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
+- 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).
 - 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
 - 🔥 **[2025.03.20]** The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#using-the-random-dataset) for more details.
 - 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark, which can be used by specifying `live_code_bench`. Supports evaluating QwQ-32B on LiveCodeBench, refer to the [best practices](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html).
@@ -101,15 +105,14 @@ Please scan the QR code below to join our community groups:
 - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
+<details><summary>More</summary>
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
 - 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
 - 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
 - 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
-<details><summary>More</summary>
 - 🔥 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [📖 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).
 - 🔥 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.
 - 🔥 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
@@ -392,6 +395,10 @@ Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.i
 ![wandb sample](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/wandb_sample.png)
+**Supports swanlab for recording results**
+![swanlab sample](https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/swanlab.png)
 **Supports Speed Benchmark**
 It supports speed testing and provides speed benchmarks similar to those found in the [official Qwen](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html) reports:

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/arguments.py RENAMED Viewed

@@ -1,7 +1,7 @@
 import argparse
 import json
-from evalscope.constants import EvalBackend, EvalStage, EvalType, JudgeStrategy, OutputType
+from evalscope.constants import EvalBackend, EvalStage, EvalType, JudgeStrategy, ModelTask, OutputType
 class ParseStrArgsAction(argparse.Action):
@@ -35,6 +35,7 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--model', type=str, required=False, help='The model id on modelscope, or local model dir.')
     parser.add_argument('--model-id', type=str, required=False, help='The model id for model name in report.')
     parser.add_argument('--model-args', type=str, action=ParseStrArgsAction, help='The model args, should be a string.')
+    parser.add_argument('--model-task', type=str, default=ModelTask.TEXT_GENERATION, choices=[ModelTask.TEXT_GENERATION, ModelTask.IMAGE_GENERATION], help='The model task for model id.')  # noqa: E501
     # Template-related arguments
     parser.add_argument('--template-type', type=str, required=False, help='Deprecated, will be removed in v1.0.0.')

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/__init__.py RENAMED Viewed

@@ -1,4 +1,4 @@
-from evalscope.backend.rag_eval.backend_manager import RAGEvalBackendManager
+from evalscope.backend.rag_eval.backend_manager import RAGEvalBackendManager, Tools
 from evalscope.backend.rag_eval.utils.clip import VisionModel
 from evalscope.backend.rag_eval.utils.embedding import EmbeddingModel
 from evalscope.backend.rag_eval.utils.llm import LLM, ChatOpenAI, LocalLLM

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/backend_manager.py RENAMED Viewed

@@ -8,6 +8,12 @@ from evalscope.utils.logger import get_logger
 logger = get_logger()
+class Tools:
+    MTEB = 'mteb'
+    RAGAS = 'ragas'
+    CLIP_BENCHMARK = 'clip_benchmark'
 class RAGEvalBackendManager(BackendManager):
     def __init__(self, config: Union[str, dict], **kwargs):
@@ -47,9 +53,19 @@ class RAGEvalBackendManager(BackendManager):
         from evalscope.backend.rag_eval.ragas.tasks import generate_testset
         if testset_args is not None:
-            generate_testset(TestsetGenerationArguments(**testset_args))
+            if isinstance(testset_args, dict):
+                generate_testset(TestsetGenerationArguments(**testset_args))
+            elif isinstance(testset_args, TestsetGenerationArguments):
+                generate_testset(testset_args)
+            else:
+                raise ValueError('Please provide the testset generation arguments.')
         if eval_args is not None:
-            rag_eval(EvaluationArguments(**eval_args))
+            if isinstance(eval_args, dict):
+                rag_eval(EvaluationArguments(**eval_args))
+            elif isinstance(eval_args, EvaluationArguments):
+                rag_eval(eval_args)
+            else:
+                raise ValueError('Please provide the evaluation arguments.')
     @staticmethod
     def run_clip_benchmark(args):
@@ -59,17 +75,17 @@ class RAGEvalBackendManager(BackendManager):
     def run(self, *args, **kwargs):
         tool = self.config_d.pop('tool')
-        if tool.lower() == 'mteb':
+        if tool.lower() == Tools.MTEB:
             self._check_env('mteb')
             model_args = self.config_d['model']
             eval_args = self.config_d['eval']
             self.run_mteb(model_args, eval_args)
-        elif tool.lower() == 'ragas':
+        elif tool.lower() == Tools.RAGAS:
             self._check_env('ragas')
             testset_args = self.config_d.get('testset_generation', None)
             eval_args = self.config_d.get('eval', None)
             self.run_ragas(testset_args, eval_args)
-        elif tool.lower() == 'clip_benchmark':
+        elif tool.lower() == Tools.CLIP_BENCHMARK:
             self._check_env('webdataset')
             self.run_clip_benchmark(self.config_d['eval'])
         else:

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/cmteb/arguments.py RENAMED Viewed

@@ -20,6 +20,12 @@ class ModelArguments:
     encode_kwargs: dict = field(default_factory=lambda: {'show_progress_bar': True, 'batch_size': 32})
     hub: str = 'modelscope'  # modelscope or huggingface
+    # for API embedding model
+    model_name: Optional[str] = None
+    api_base: Optional[str] = None
+    api_key: Optional[str] = None
+    dimensions: Optional[int] = None
     def to_dict(self) -> Dict[str, Any]:
         return {
             'model_name_or_path': self.model_name_or_path,
@@ -31,6 +37,10 @@ class ModelArguments:
             'config_kwargs': self.config_kwargs,
             'encode_kwargs': self.encode_kwargs,
             'hub': self.hub,
+            'model_name': self.model_name,
+            'api_base': self.api_base,
+            'api_key': self.api_key,
+            'dimensions': self.dimensions,
         }

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/ragas/arguments.py RENAMED Viewed

@@ -21,7 +21,6 @@ class TestsetGenerationArguments:
     """
     generator_llm: Dict = field(default_factory=dict)
     embeddings: Dict = field(default_factory=dict)
-    distribution: str = field(default_factory=lambda: {'simple': 0.5, 'multi_context': 0.4, 'reasoning': 0.1})
     # For LLM based evaluation
     # available: ['english', 'hindi', 'marathi', 'chinese', 'spanish', 'amharic', 'arabic',
     # 'armenian', 'bulgarian', 'urdu', 'russian', 'polish', 'persian', 'dutch', 'danish',

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/ragas/tasks/testset_generation.py RENAMED Viewed

@@ -67,9 +67,14 @@ def get_persona(llm, kg, language):
 def load_data(file_path):
-    from langchain_community.document_loaders import UnstructuredFileLoader
+    import nltk
+    from langchain_unstructured import UnstructuredLoader
-    loader = UnstructuredFileLoader(file_path, mode='single')
+    if nltk.data.find('taggers/averaged_perceptron_tagger_eng') is False:
+        # need to download nltk data for the first time
+        nltk.download('averaged_perceptron_tagger_eng')
+    loader = UnstructuredLoader(file_path)
     data = loader.load()
     return data

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/ragas/tasks/translate_prompt.py RENAMED Viewed

@@ -2,7 +2,6 @@ import asyncio
 import os
 from ragas.llms import BaseRagasLLM
 from ragas.prompt import PromptMixin, PydanticPrompt
-from ragas.utils import RAGAS_SUPPORTED_LANGUAGE_CODES
 from typing import List
 from evalscope.utils.logger import get_logger
@@ -16,10 +15,6 @@ async def translate_prompt(
     llm: BaseRagasLLM,
     adapt_instruction: bool = False,
 ):
-    if target_lang not in RAGAS_SUPPORTED_LANGUAGE_CODES:
-        logger.warning(f'{target_lang} is not in supported language: {list(RAGAS_SUPPORTED_LANGUAGE_CODES)}')
-        return
     if not issubclass(type(prompt_user), PromptMixin):
         logger.info(f"{prompt_user} is not a PromptMixin, don't translate it")
         return

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/utils/embedding.py RENAMED Viewed

@@ -1,10 +1,12 @@
 import os
 import torch
 from langchain_core.embeddings import Embeddings
+from langchain_openai.embeddings import OpenAIEmbeddings
 from sentence_transformers import models
 from sentence_transformers.cross_encoder import CrossEncoder
 from sentence_transformers.SentenceTransformer import SentenceTransformer
 from torch import Tensor
+from tqdm import tqdm
 from typing import Dict, List, Optional, Union
 from evalscope.backend.rag_eval.utils.tools import download_model
@@ -18,10 +20,10 @@ class BaseModel(Embeddings):
     def __init__(
         self,
-        model_name_or_path: str,
+        model_name_or_path: str = '',
         max_seq_length: int = 512,
         prompt: str = '',
-        revision: Optional[str] = None,
+        revision: Optional[str] = 'master',
         **kwargs,
     ):
         self.model_name_or_path = model_name_or_path
@@ -139,7 +141,7 @@ class CrossEncoderModel(BaseModel):
             max_length=self.max_seq_length,
         )
-    def predict(self, sentences: List[List[str]], **kwargs) -> List[List[float]]:
+    def predict(self, sentences: List[List[str]], **kwargs) -> Tensor:
         self.encode_kwargs.update(kwargs)
         if len(sentences[0]) == 3:  # Note: For mteb retrieval task
@@ -154,6 +156,46 @@ class CrossEncoderModel(BaseModel):
         return embeddings
+class APIEmbeddingModel(BaseModel):
+    def __init__(self, **kwargs):
+        self.model_name = kwargs.get('model_name')
+        self.openai_api_base = kwargs.get('api_base')
+        self.openai_api_key = kwargs.get('api_key')
+        self.dimensions = kwargs.get('dimensions')
+        self.model = OpenAIEmbeddings(
+            model=self.model_name,
+            openai_api_base=self.openai_api_base,
+            openai_api_key=self.openai_api_key,
+            dimensions=self.dimensions,
+            check_embedding_ctx_length=False)
+        super().__init__(model_name_or_path=self.model_name, **kwargs)
+        self.batch_size = self.encode_kwargs.get('batch_size', 10)
+    def encode(self, texts: Union[str, List[str]], **kwargs) -> Tensor:
+        if isinstance(texts, str):
+            texts = [texts]
+        embeddings: List[List[float]] = []
+        for i in tqdm(range(0, len(texts), self.batch_size)):
+            response = self.model.embed_documents(texts[i:i + self.batch_size], chunk_size=self.batch_size)
+            embeddings.extend(response)
+        return torch.tensor(embeddings)
+    def encode_queries(self, queries, **kwargs):
+        return self.encode(queries, **kwargs)
+    def encode_corpus(self, corpus, **kwargs):
+        if isinstance(corpus[0], dict):
+            input_texts = ['{} {}'.format(doc.get('title', ''), doc['text']).strip() for doc in corpus]
+        else:
+            input_texts = corpus
+        return self.encode(input_texts, **kwargs)
 class EmbeddingModel:
     """Custom embeddings"""
@@ -165,6 +207,10 @@ class EmbeddingModel:
         revision: Optional[str] = 'master',
         **kwargs,
     ):
+        if kwargs.get('model_name'):
+            # If model_name is provided, use OpenAIEmbeddings
+            return APIEmbeddingModel(**kwargs)
         # If model path does not exist and hub is 'modelscope', download the model
         if not os.path.exists(model_name_or_path) and hub == HubType.MODELSCOPE:
             model_name_or_path = download_model(model_name_or_path, revision)

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/rag_eval/utils/llm.py RENAMED Viewed

@@ -2,7 +2,7 @@ import os
 from langchain_core.callbacks.manager import CallbackManagerForLLMRun
 from langchain_core.language_models.llms import LLM as BaseLLM
 from langchain_openai import ChatOpenAI
-from modelscope.utils.hf_util import GenerationConfig
+from transformers.generation.configuration_utils import GenerationConfig
 from typing import Any, Dict, Iterator, List, Mapping, Optional
 from evalscope.constants import DEFAULT_MODEL_REVISION
@@ -16,9 +16,9 @@ class LLM:
         api_base = kw.get('api_base', None)
         if api_base:
             return ChatOpenAI(
-                model_name=kw.get('model_name', ''),
-                openai_api_base=api_base,
-                openai_api_key=kw.get('api_key', 'EMPTY'),
+                model=kw.get('model_name', ''),
+                base_url=api_base,
+                api_key=kw.get('api_key', 'EMPTY'),
             )
         else:
             return LocalLLM(**kw)

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/backend/vlm_eval_kit/backend_manager.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import copy
+import os
 import subprocess
 from functools import partial
 from typing import Optional, Union
@@ -66,8 +67,9 @@ class VLMEvalKitBackendManager(BackendManager):
                     del remain_cfg['name']  # remove not used args
                     del remain_cfg['type']  # remove not used args
-                    self.valid_models.update({model_type: partial(model_class, model=model_type, **remain_cfg)})
-                    new_model_names.append(model_type)
+                    norm_model_type = os.path.basename(model_type).replace(':', '-').replace('.', '_')
+                    self.valid_models.update({norm_model_type: partial(model_class, model=model_type, **remain_cfg)})
+                    new_model_names.append(norm_model_type)
                 else:
                     remain_cfg = copy.deepcopy(model_cfg)
                     del remain_cfg['name']  # remove not used args

{evalscope-0.13.2 → evalscope-0.15.0}/evalscope/benchmarks/__init__.py RENAMED Viewed

@@ -10,8 +10,8 @@ from evalscope.utils import get_logger
 logger = get_logger()
 # Using glob to find all files matching the pattern
-pattern = os.path.join(os.path.dirname(__file__), '*', '*_adapter.py')
-files = glob.glob(pattern, recursive=False)
+pattern = os.path.join(os.path.dirname(__file__), '*', '**', '*_adapter.py')
+files = glob.glob(pattern, recursive=True)
 for file_path in files:
     if file_path.endswith('.py') and not os.path.basename(file_path).startswith('_'):

evalscope-0.15.0/evalscope/benchmarks/aigc/t2i/base.py ADDED Viewed

@@ -0,0 +1,56 @@
+from typing import List, Optional, Union
+from evalscope.benchmarks import DataAdapter
+from evalscope.metrics import mean, metric_registry
+from evalscope.utils.logger import get_logger
+logger = get_logger()
+class T2IBaseAdapter(DataAdapter):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        logger.info(f'Initializing metrics: {self.metric_list}')
+        self.metrics = {m: metric_registry.get(m).object() for m in self.metric_list}
+    def gen_prompt(self, input_d: dict, subset_name: str, few_shot_list: list, **kwargs) -> dict:
+        # dummy prompt for general t2i
+        return self.gen_prompt_data(prompt=input_d.get('prompt', ''), id=input_d.get('id', 0))
+    def get_gold_answer(self, input_d: dict) -> str:
+        # dummy gold answer for general t2i
+        return input_d.get('prompt', '')
+    def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = 'checkpoint') -> str:
+        # dummy parse pred result for general t2i
+        return result or raw_input_d.get('image_path', '')
+    def match(self, gold: str, pred: str) -> dict:
+        # dummy match for general t2i
+        # pred is the image path, gold is the prompt
+        res = {}
+        for metric_name, metric_func in self.metrics.items():
+            score = metric_func(images=[pred], texts=[gold])[0][0]
+            if isinstance(score, dict):
+                for k, v in score.items():
+                    res[f'{metric_name}_{k}'] = v.cpu().item()
+            else:
+                res[metric_name] = score.cpu().item()  # Updated to use score.cpu().item()
+        return res
+    def compute_metric(self, review_res_list: Union[List[dict], List[List[dict]]], **kwargs) -> List[dict]:
+        """
+        compute weighted mean of the bleu score of all samples
+        Args:
+            review_res_list: [score1, score2, ...]
+        Returns:
+            avg_res: List[dict]
+        """
+        items = super().compute_dict_metric(review_res_list, **kwargs)
+        return [{'metric_name': k, 'score': mean(v), 'num': len(v)} for k, v in items.items()]

evalscope-0.15.0/evalscope/benchmarks/aigc/t2i/evalmuse_adapter.py ADDED Viewed

@@ -0,0 +1,77 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+import os.path
+from collections import defaultdict
+from typing import List, Optional, Union
+from evalscope.benchmarks import Benchmark
+from evalscope.constants import OutputType
+from evalscope.metrics import mean
+from evalscope.utils.io_utils import jsonl_to_list
+from evalscope.utils.logger import get_logger
+from .base import T2IBaseAdapter
+logger = get_logger()
+@Benchmark.register(
+    name='evalmuse',
+    dataset_id='AI-ModelScope/T2V-Eval-Prompts',
+    model_adapter=OutputType.IMAGE_GENERATION,
+    output_types=[OutputType.IMAGE_GENERATION],
+    subset_list=['EvalMuse'],
+    metric_list=['FGA_BLIP2Score'],
+    few_shot_num=0,
+    train_split=None,
+    eval_split='test',
+)
+class EvalMuseAdapter(T2IBaseAdapter):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+    def load(self, **kwargs) -> dict:
+        if os.path.isfile(self.dataset_id):
+            data_list = jsonl_to_list(self.dataset_id)
+            data_dict = {self.subset_list[0]: {'test': data_list}}
+            return data_dict
+        else:
+            return super().load(**kwargs)
+    def get_gold_answer(self, input_d: dict) -> dict:
+        # return prompt and elements dict
+        return {'prompt': input_d.get('prompt'), 'tags': input_d.get('tags', {})}
+    def match(self, gold: dict, pred: str) -> dict:
+        # dummy match for general t2i
+        # pred is the image path, gold is the prompt
+        res = {}
+        for metric_name, metric_func in self.metrics.items():
+            if metric_name == 'FGA_BLIP2Score':
+                # For FGA_BLIP2Score, we need to pass the dictionary
+                score = metric_func(images=[pred], texts=[gold])[0][0]
+            else:
+                score = metric_func(images=[pred], texts=[gold['prompt']])[0][0]
+            if isinstance(score, dict):
+                for k, v in score.items():
+                    res[f'{metric_name}:{k}'] = v.cpu().item()
+            else:
+                res[metric_name] = score.cpu().item()
+        return res
+    def compute_metric(self, review_res_list: Union[List[dict], List[List[dict]]], **kwargs) -> List[dict]:
+        """
+        compute weighted mean of the bleu score of all samples
+        """
+        items = super().compute_dict_metric(review_res_list, **kwargs)
+        # add statistics for each metric
+        new_items = defaultdict(list)
+        for metric_name, value_list in items.items():
+            if 'FGA_BLIP2Score' in metric_name and '(' in metric_name:  # FGA_BLIP2Score element score
+                metrics_prefix = metric_name.split(':')[0]
+                category = metric_name.rpartition('(')[-1].split(')')[0]
+                new_items[f'{metrics_prefix}:{category}'].extend(value_list)
+            else:
+                new_items[metric_name].extend(value_list)
+        # calculate mean for each metric
+        return [{'metric_name': k, 'score': mean(v), 'num': len(v)} for k, v in new_items.items()]

evalscope 0.13.2__tar.gz → 0.15.0__tar.gz

Potentially problematic release.

evalscope 0.13.2tar.gz → 0.15.0tar.gz