PyPI - evalscope - Versions diffs - 0.15.1__tar.gz → 0.16.1__tar.gz - Mend

evalscope 0.15.1tar.gz → 0.16.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (498) hide show

{evalscope-0.15.1/evalscope.egg-info → evalscope-0.16.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.15.1
+Version: 0.16.1
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -17,12 +17,12 @@ Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: accelerate
-Requires-Dist: datasets<=3.2.0,>=3.0.0
+Requires-Dist: datasets>=3.0
 Requires-Dist: immutabledict
 Requires-Dist: jieba
 Requires-Dist: jsonlines
 Requires-Dist: langdetect
-Requires-Dist: latex2sympy2
+Requires-Dist: latex2sympy2_extended
 Requires-Dist: matplotlib
 Requires-Dist: modelscope[framework]
 Requires-Dist: nltk>=3.9
@@ -45,24 +45,25 @@ Requires-Dist: tqdm
 Requires-Dist: transformers>=4.33
 Requires-Dist: word2number
 Provides-Extra: opencompass
-Requires-Dist: ms-opencompass>=0.1.4; extra == "opencompass"
+Requires-Dist: ms-opencompass>=0.1.6; extra == "opencompass"
 Provides-Extra: vlmeval
-Requires-Dist: ms-vlmeval>=0.0.9; extra == "vlmeval"
+Requires-Dist: ms-vlmeval>=0.0.17; extra == "vlmeval"
 Provides-Extra: rag
 Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "rag"
 Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "rag"
 Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "rag"
 Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "rag"
-Requires-Dist: mteb==1.19.4; extra == "rag"
+Requires-Dist: mteb==1.38.20; extra == "rag"
 Requires-Dist: ragas==0.2.14; extra == "rag"
 Requires-Dist: webdataset>0.2.0; extra == "rag"
 Provides-Extra: perf
 Requires-Dist: aiohttp; extra == "perf"
 Requires-Dist: fastapi; extra == "perf"
 Requires-Dist: numpy; extra == "perf"
+Requires-Dist: rich; extra == "perf"
 Requires-Dist: sse_starlette; extra == "perf"
 Requires-Dist: transformers; extra == "perf"
-Requires-Dist: unicorn; extra == "perf"
+Requires-Dist: uvicorn; extra == "perf"
 Provides-Extra: app
 Requires-Dist: gradio==5.4.0; extra == "app"
 Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "app"
@@ -74,12 +75,12 @@ Requires-Dist: open_clip_torch; extra == "aigc"
 Requires-Dist: opencv-python; extra == "aigc"
 Provides-Extra: all
 Requires-Dist: accelerate; extra == "all"
-Requires-Dist: datasets<=3.2.0,>=3.0.0; extra == "all"
+Requires-Dist: datasets>=3.0; extra == "all"
 Requires-Dist: immutabledict; extra == "all"
 Requires-Dist: jieba; extra == "all"
 Requires-Dist: jsonlines; extra == "all"
 Requires-Dist: langdetect; extra == "all"
-Requires-Dist: latex2sympy2; extra == "all"
+Requires-Dist: latex2sympy2_extended; extra == "all"
 Requires-Dist: matplotlib; extra == "all"
 Requires-Dist: modelscope[framework]; extra == "all"
 Requires-Dist: nltk>=3.9; extra == "all"
@@ -101,21 +102,22 @@ Requires-Dist: torchvision; extra == "all"
 Requires-Dist: tqdm; extra == "all"
 Requires-Dist: transformers>=4.33; extra == "all"
 Requires-Dist: word2number; extra == "all"
-Requires-Dist: ms-opencompass>=0.1.4; extra == "all"
-Requires-Dist: ms-vlmeval>=0.0.9; extra == "all"
+Requires-Dist: ms-opencompass>=0.1.6; extra == "all"
+Requires-Dist: ms-vlmeval>=0.0.17; extra == "all"
 Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "all"
 Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "all"
 Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "all"
 Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "all"
-Requires-Dist: mteb==1.19.4; extra == "all"
+Requires-Dist: mteb==1.38.20; extra == "all"
 Requires-Dist: ragas==0.2.14; extra == "all"
 Requires-Dist: webdataset>0.2.0; extra == "all"
 Requires-Dist: aiohttp; extra == "all"
 Requires-Dist: fastapi; extra == "all"
 Requires-Dist: numpy; extra == "all"
+Requires-Dist: rich; extra == "all"
 Requires-Dist: sse_starlette; extra == "all"
 Requires-Dist: transformers; extra == "all"
-Requires-Dist: unicorn; extra == "all"
+Requires-Dist: uvicorn; extra == "all"
 Requires-Dist: gradio==5.4.0; extra == "all"
 Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "all"
 Requires-Dist: diffusers; extra == "all"
@@ -177,9 +179,23 @@ Requires-Dist: opencv-python; extra == "all"
 ## 📝 Introduction
-EvalScope is [ModelScope](https://modelscope.cn/)'s official framework for model evaluation and benchmarking, designed for diverse assessment needs. It supports various model types including large language models, multimodal, embedding, reranker, and CLIP models.
+EvalScope is a comprehensive model evaluation and performance benchmarking framework meticulously crafted by the [ModelScope Community](https://modelscope.cn/), offering a one-stop solution for your model assessment needs. Regardless of the type of model you are developing, EvalScope is equipped to cater to your requirements:
-The framework accommodates multiple evaluation scenarios such as end-to-end RAG evaluation, arena mode, and inference performance testing. It features built-in benchmarks and metrics like MMLU, CMMLU, C-Eval, and GSM8K. Seamlessly integrated with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, EvalScope enables one-click evaluations, offering comprehensive support for model training and assessment 🚀
+- 🧠 Large Language Models
+- 🎨 Multimodal Models
+- 🔍 Embedding Models
+- 🏆 Reranker Models
+- 🖼️ CLIP Models
+- 🎭 AIGC Models (Image-to-Text/Video)
+- ...and more!
+EvalScope is not merely an evaluation tool; it is a valuable ally in your model optimization journey:
+- 🏅 Equipped with multiple industry-recognized benchmarks and evaluation metrics: MMLU, CMMLU, C-Eval, GSM8K, etc.
+- 📊 Model inference performance stress testing: Ensuring your model excels in real-world applications.
+- 🚀 Seamless integration with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, enabling one-click evaluations and providing full-chain support from training to assessment for your model development.
+Below is the overall architecture diagram of EvalScope:
 <p align="center">
   <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
@@ -214,6 +230,10 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.
+- 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
+- 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
+- 🔥 **[2025.05.13]** Added support for the [ToolBench-Static](https://modelscope.cn/datasets/AI-ModelScope/ToolBench-Static) dataset to evaluate model's tool-calling capabilities. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) for usage instructions. Also added support for the [DROP](https://modelscope.cn/datasets/AI-ModelScope/DROP/dataPeview) and [Winogrande](https://modelscope.cn/datasets/AI-ModelScope/winogrande_val) benchmarks to assess the reasoning capabilities of models.
 - 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
@@ -479,26 +499,27 @@ For more customized evaluations, such as customizing model parameters or dataset
 ```shell
 evalscope eval \
- --model Qwen/Qwen2.5-0.5B-Instruct \
- --model-args revision=master,precision=torch.float16,device_map=auto \
- --generation-config do_sample=true,temperature=0.5 \
+ --model Qwen/Qwen3-0.6B \
+ --model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' \
+ --generation-config '{"do_sample":true,"temperature":0.6,"max_new_tokens":512,"chat_template_kwargs":{"enable_thinking": false}}' \
  --dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
  --datasets gsm8k \
  --limit 10
 ```
-### Parameter
-- `--model-args`: Model loading parameters, separated by commas in `key=value` format. Default parameters:
-  - `revision`: Model version, default is `master`
-  - `precision`: Model precision, default is `auto`
-  - `device_map`: Model device allocation, default is `auto`
-- `--generation-config`: Generation parameters, separated by commas in `key=value` format. Default parameters:
-  - `do_sample`: Whether to use sampling, default is `false`
-  - `max_length`: Maximum length, default is 2048
-  - `max_new_tokens`: Maximum length of generation, default is 512
-- `--dataset-args`: Configuration parameters for evaluation datasets, passed in `json` format. The key is the dataset name, and the value is the parameters. Note that it needs to correspond one-to-one with the values in the `--datasets` parameter:
+### Parameter Description
+- `--model-args`: Model loading parameters, passed as a JSON string:
+  - `revision`: Model version
+  - `precision`: Model precision
+  - `device_map`: Device allocation for the model
+- `--generation-config`: Generation parameters, passed as a JSON string and parsed as a dictionary:
+  - `do_sample`: Whether to use sampling
+  - `temperature`: Generation temperature
+  - `max_new_tokens`: Maximum length of generated tokens
+  - `chat_template_kwargs`: Model inference template parameters
+- `--dataset-args`: Settings for the evaluation dataset, passed as a JSON string where the key is the dataset name and the value is the parameters. Note that these need to correspond one-to-one with the values in the `--datasets` parameter:
   - `few_shot_num`: Number of few-shot examples
-  - `few_shot_random`: Whether to randomly sample few-shot data, if not set, defaults to `true`
+  - `few_shot_random`: Whether to randomly sample few-shot data; if not set, defaults to `true`
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
@@ -517,6 +538,11 @@ A stress testing tool focused on large language models, which can be customized
 Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html)
+**Output example**
+![multi_perf](docs/en/user_guides/stress_test/images/multi_perf.png)
 **Supports wandb for recording results**
 ![wandb sample](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/wandb_sample.png)
@@ -565,7 +591,7 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
 </a>
 ## 🔜 Roadmap
-- [ ] Support for better evaluation report visualization
+- [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
 - [x] RAG evaluation
 - [x] VLM evaluation
@@ -575,7 +601,7 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
 - [x] Multi-modal evaluation
 - [ ] Benchmarks
   - [ ] GAIA
-  - [ ] GPQA
+  - [x] GPQA
   - [x] MBPP

{evalscope-0.15.1 → evalscope-0.16.1}/README.md RENAMED Viewed

@@ -51,9 +51,23 @@
 ## 📝 Introduction
-EvalScope is [ModelScope](https://modelscope.cn/)'s official framework for model evaluation and benchmarking, designed for diverse assessment needs. It supports various model types including large language models, multimodal, embedding, reranker, and CLIP models.
+EvalScope is a comprehensive model evaluation and performance benchmarking framework meticulously crafted by the [ModelScope Community](https://modelscope.cn/), offering a one-stop solution for your model assessment needs. Regardless of the type of model you are developing, EvalScope is equipped to cater to your requirements:
-The framework accommodates multiple evaluation scenarios such as end-to-end RAG evaluation, arena mode, and inference performance testing. It features built-in benchmarks and metrics like MMLU, CMMLU, C-Eval, and GSM8K. Seamlessly integrated with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, EvalScope enables one-click evaluations, offering comprehensive support for model training and assessment 🚀
+- 🧠 Large Language Models
+- 🎨 Multimodal Models
+- 🔍 Embedding Models
+- 🏆 Reranker Models
+- 🖼️ CLIP Models
+- 🎭 AIGC Models (Image-to-Text/Video)
+- ...and more!
+EvalScope is not merely an evaluation tool; it is a valuable ally in your model optimization journey:
+- 🏅 Equipped with multiple industry-recognized benchmarks and evaluation metrics: MMLU, CMMLU, C-Eval, GSM8K, etc.
+- 📊 Model inference performance stress testing: Ensuring your model excels in real-world applications.
+- 🚀 Seamless integration with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, enabling one-click evaluations and providing full-chain support from training to assessment for your model development.
+Below is the overall architecture diagram of EvalScope:
 <p align="center">
   <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
@@ -88,6 +102,10 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.
+- 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
+- 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
+- 🔥 **[2025.05.13]** Added support for the [ToolBench-Static](https://modelscope.cn/datasets/AI-ModelScope/ToolBench-Static) dataset to evaluate model's tool-calling capabilities. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) for usage instructions. Also added support for the [DROP](https://modelscope.cn/datasets/AI-ModelScope/DROP/dataPeview) and [Winogrande](https://modelscope.cn/datasets/AI-ModelScope/winogrande_val) benchmarks to assess the reasoning capabilities of models.
 - 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
 - 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
@@ -353,26 +371,27 @@ For more customized evaluations, such as customizing model parameters or dataset
 ```shell
 evalscope eval \
- --model Qwen/Qwen2.5-0.5B-Instruct \
- --model-args revision=master,precision=torch.float16,device_map=auto \
- --generation-config do_sample=true,temperature=0.5 \
+ --model Qwen/Qwen3-0.6B \
+ --model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' \
+ --generation-config '{"do_sample":true,"temperature":0.6,"max_new_tokens":512,"chat_template_kwargs":{"enable_thinking": false}}' \
  --dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
  --datasets gsm8k \
  --limit 10
 ```
-### Parameter
-- `--model-args`: Model loading parameters, separated by commas in `key=value` format. Default parameters:
-  - `revision`: Model version, default is `master`
-  - `precision`: Model precision, default is `auto`
-  - `device_map`: Model device allocation, default is `auto`
-- `--generation-config`: Generation parameters, separated by commas in `key=value` format. Default parameters:
-  - `do_sample`: Whether to use sampling, default is `false`
-  - `max_length`: Maximum length, default is 2048
-  - `max_new_tokens`: Maximum length of generation, default is 512
-- `--dataset-args`: Configuration parameters for evaluation datasets, passed in `json` format. The key is the dataset name, and the value is the parameters. Note that it needs to correspond one-to-one with the values in the `--datasets` parameter:
+### Parameter Description
+- `--model-args`: Model loading parameters, passed as a JSON string:
+  - `revision`: Model version
+  - `precision`: Model precision
+  - `device_map`: Device allocation for the model
+- `--generation-config`: Generation parameters, passed as a JSON string and parsed as a dictionary:
+  - `do_sample`: Whether to use sampling
+  - `temperature`: Generation temperature
+  - `max_new_tokens`: Maximum length of generated tokens
+  - `chat_template_kwargs`: Model inference template parameters
+- `--dataset-args`: Settings for the evaluation dataset, passed as a JSON string where the key is the dataset name and the value is the parameters. Note that these need to correspond one-to-one with the values in the `--datasets` parameter:
   - `few_shot_num`: Number of few-shot examples
-  - `few_shot_random`: Whether to randomly sample few-shot data, if not set, defaults to `true`
+  - `few_shot_random`: Whether to randomly sample few-shot data; if not set, defaults to `true`
 Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
@@ -391,6 +410,11 @@ A stress testing tool focused on large language models, which can be customized
 Reference: Performance Testing [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html)
+**Output example**
+![multi_perf](docs/en/user_guides/stress_test/images/multi_perf.png)
 **Supports wandb for recording results**
 ![wandb sample](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/wandb_sample.png)
@@ -439,7 +463,7 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
 </a>
 ## 🔜 Roadmap
-- [ ] Support for better evaluation report visualization
+- [x] Support for better evaluation report visualization
 - [x] Support for mixed evaluations across multiple datasets
 - [x] RAG evaluation
 - [x] VLM evaluation
@@ -449,7 +473,7 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
 - [x] Multi-modal evaluation
 - [ ] Benchmarks
   - [ ] GAIA
-  - [ ] GPQA
+  - [x] GPQA
   - [x] MBPP

evalscope-0.16.1/evalscope/app/__init__.py ADDED Viewed

@@ -0,0 +1,28 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from typing import TYPE_CHECKING
+from evalscope.utils.import_utils import _LazyModule
+if TYPE_CHECKING:
+    from .app import create_app
+    from .arguments import add_argument
+else:
+    _import_structure = {
+        'app': [
+            'create_app',
+        ],
+        'arguments': [
+            'add_argument',
+        ],
+    }
+    import sys
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()['__file__'],
+        _import_structure,
+        module_spec=__spec__,
+        extra_objects={},
+    )

{evalscope-0.15.1/evalscope/report → evalscope-0.16.1/evalscope/app}/app.py RENAMED Viewed

@@ -11,35 +11,15 @@ from dataclasses import dataclass
 from typing import Any, List, Union
 from evalscope.constants import DataCollection
-from evalscope.report import Report, ReportKey, add_argument, get_data_frame, get_report_list
+from evalscope.report import Report, ReportKey, get_data_frame, get_report_list
 from evalscope.utils.io_utils import OutputsStructure, yaml_to_dict
 from evalscope.utils.logger import configure_logging, get_logger
 from evalscope.version import __version__
+from .arguments import add_argument
+from .constants import DATASET_TOKEN, LATEX_DELIMITERS, MODEL_TOKEN, PLOTLY_THEME, REPORT_TOKEN
 logger = get_logger()
-PLOTLY_THEME = 'plotly_dark'
-REPORT_TOKEN = '@@'
-MODEL_TOKEN = '::'
-DATASET_TOKEN = ', '
-LATEX_DELIMITERS = [{
-    'left': '$$',
-    'right': '$$',
-    'display': True
-}, {
-    'left': '$',
-    'right': '$',
-    'display': False
-}, {
-    'left': '\\(',
-    'right': '\\)',
-    'display': False
-}, {
-    'left': '\\[',
-    'right': '\\]',
-    'display': True
-}]
 def scan_for_report_folders(root_path):
     """Scan for folders containing reports subdirectories"""
@@ -185,6 +165,13 @@ def get_single_dataset_df(df: pd.DataFrame, dataset_name: str):
     return df, styler
+def get_report_analysis(report_list: List[Report], dataset_name: str) -> str:
+    for report in report_list:
+        if report.dataset_name == dataset_name:
+            return report.analysis
+    return 'N/A'
 def plot_single_dataset_scores(df: pd.DataFrame):
     # TODO: add metric radio and relace category name
     plot = px.bar(
@@ -223,6 +210,33 @@ def plot_multi_report_radar(df: pd.DataFrame):
     return fig
+def convert_markdown_image(text):
+    if not os.path.isfile(text):
+        return text
+    # Convert the image path to a markdown image tag
+    if text.endswith('.png') or text.endswith('.jpg') or text.endswith('.jpeg'):
+        text = os.path.abspath(text)
+        image_tag = f'![image](gradio_api/file={text})'
+        logger.debug(f'Converting image path to markdown: {text} -> {image_tag}')
+        return image_tag
+    return text
+def convert_html_tags(text):
+    # match begin label
+    text = re.sub(r'<(\w+)>', r'[\1]', text)
+    # match end label
+    text = re.sub(r'</(\w+)>', r'[/\1]', text)
+    return text
+def process_string(string: str, max_length: int = 2048) -> str:
+    string = convert_html_tags(string)  # for display labels e.g.
+    if max_length and len(string) > max_length:
+        return f'{string[:max_length // 2]}......{string[-max_length // 2:]}'
+    return string
 def dict_to_markdown(data) -> str:
     markdown_lines = []
@@ -230,55 +244,41 @@ def dict_to_markdown(data) -> str:
         bold_key = f'**{key}**'
         if isinstance(value, list):
-            value_str = '\n' + '\n'.join([f'  - {item}' for item in value])
+            value_str = '\n' + '\n'.join([f'- {process_model_prediction(item, max_length=None)}' for item in value])
         elif isinstance(value, dict):
             value_str = dict_to_markdown(value)
         else:
             value_str = str(value)
-        value_str = process_string(value_str)
-        markdown_line = f'{bold_key}: {value_str}'
+        value_str = process_string(value_str, max_length=None)  # Convert HTML tags but don't truncate
+        markdown_line = f'{bold_key}:\n{value_str}'
         markdown_lines.append(markdown_line)
     return '\n\n'.join(markdown_lines)
-def convert_html_tags(text):
-    # match begin label
-    text = re.sub(r'<(\w+)>', r'[\1]', text)
-    # match end label
-    text = re.sub(r'</(\w+)>', r'[/\1]', text)
-    return text
+def process_model_prediction(item: Any, max_length: int = 2048) -> str:
+    """
+    Process model prediction output into a formatted string.
+    Args:
+        item: The item to process. Can be a string, list, or dictionary.
+        max_length: The maximum length of the output string.
-def convert_markdown_image(text):
-    if not os.path.isfile(text):
-        return text
-    # Convert the image path to a markdown image tag
-    if text.endswith('.png') or text.endswith('.jpg') or text.endswith('.jpeg'):
-        text = os.path.abspath(text)
-        image_tag = f'![image](gradio_api/file={text})'
-        logger.debug(f'Converting image path to markdown: {text} -> {image_tag}')
-        return image_tag
-    return text
-def process_string(string: str, max_length: int = 2048) -> str:
-    string = convert_html_tags(string)  # for display labels e.g. `<think>`
-    if len(string) > max_length:
-        return f'{string[:max_length // 2]}......{string[-max_length // 2:]}'
-    return string
-def process_model_prediction(item: Any):
+    Returns:
+        A formatted string representation of the input.
+    """
     if isinstance(item, dict):
-        res = dict_to_markdown(item)
-        return process_string(res)
+        result = dict_to_markdown(item)
     elif isinstance(item, list):
-        res = '\n'.join([process_model_prediction(item) for item in item])
-        return process_string(res)
+        result = '\n'.join([f'- {process_model_prediction(i, max_length=None)}' for i in item])
     else:
-        return process_string(str(item))
+        result = str(item)
+    # Apply HTML tag conversion and truncation only at the final output
+    if max_length is not None:
+        return process_string(result, max_length)
+    return result
 def normalize_score(score):
@@ -443,6 +443,10 @@ def create_single_model_tab(sidebar: SidebarComponents, lang: str):
             'zh': '数据集分数',
             'en': 'Dataset Scores'
         },
+        'report_analysis': {
+            'zh': '报告智能分析',
+            'en': 'Report Intelligent Analysis'
+        },
         'dataset_scores_table': {
             'zh': '数据集分数表',
             'en': 'Dataset Scores Table'
@@ -498,6 +502,9 @@ def create_single_model_tab(sidebar: SidebarComponents, lang: str):
     with gr.Tab(locale_dict['dataset_details'][lang]):
         dataset_radio = gr.Radio(
             label=locale_dict['select_dataset'][lang], choices=[], show_label=True, interactive=True)
+        # show dataset details
+        with gr.Accordion(locale_dict['report_analysis'][lang], open=True):
+            report_analysis = gr.Markdown(value='N/A', show_copy_button=True)
         gr.Markdown(f'### {locale_dict["dataset_scores"][lang]}')
         dataset_plot = gr.Plot(value=None, scale=1, label=locale_dict['dataset_scores'][lang])
         gr.Markdown(f'### {locale_dict["dataset_scores_table"][lang]}')
@@ -573,15 +580,16 @@ def create_single_model_tab(sidebar: SidebarComponents, lang: str):
     @gr.on(
         triggers=[dataset_radio.change, report_list.change],
         inputs=[dataset_radio, report_list],
-        outputs=[dataset_plot, dataset_table, subset_select, data_review_df])
+        outputs=[dataset_plot, dataset_table, subset_select, data_review_df, report_analysis])
     def update_single_report_dataset(dataset_name, report_list):
         logger.debug(f'Updating single report dataset: {dataset_name}')
         report_df = get_data_frame(report_list)
+        analysis = get_report_analysis(report_list, dataset_name)
         data_score_df, styler = get_single_dataset_df(report_df, dataset_name)
         data_score_plot = plot_single_dataset_scores(data_score_df)
         subsets = data_score_df[ReportKey.subset_name].unique().tolist()
         logger.debug(f'subsets: {subsets}')
-        return data_score_plot, styler, gr.update(choices=subsets, value=None), None
+        return data_score_plot, styler, gr.update(choices=subsets, value=None), None, analysis
     @gr.on(
         triggers=[subset_select.change],

evalscope-0.16.1/evalscope/app/constants.py ADDED Viewed

@@ -0,0 +1,21 @@
+PLOTLY_THEME = 'plotly_dark'
+REPORT_TOKEN = '@@'
+MODEL_TOKEN = '::'
+DATASET_TOKEN = ', '
+LATEX_DELIMITERS = [{
+    'left': '$$',
+    'right': '$$',
+    'display': True
+}, {
+    'left': '$',
+    'right': '$',
+    'display': False
+}, {
+    'left': '\\(',
+    'right': '\\)',
+    'display': False
+}, {
+    'left': '\\[',
+    'right': '\\]',
+    'display': True
+}]

{evalscope-0.15.1 → evalscope-0.16.1}/evalscope/arguments.py RENAMED Viewed

@@ -9,6 +9,15 @@ class ParseStrArgsAction(argparse.Action):
     def __call__(self, parser, namespace, values, option_string=None):
         assert isinstance(values, str), 'args should be a string.'
+        # try json load first
+        try:
+            arg_dict = json.loads(values)
+            setattr(namespace, self.dest, arg_dict)
+            return
+        except (json.JSONDecodeError, ValueError):
+            pass
+        # If JSON load fails, fall back to parsing as key=value pairs
         arg_dict = {}
         for arg in values.strip().split(','):
             key, value = map(str.strip, arg.split('=', 1))  # Use maxsplit=1 to handle multiple '='
@@ -58,7 +67,7 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--eval-config', type=str, required=False, help='The eval task config file path for evaluation backend.')  # noqa: E501
     parser.add_argument('--stage', type=str, default='all', help='The stage of evaluation pipeline.',
                         choices=[EvalStage.ALL, EvalStage.INFER, EvalStage.REVIEW])
-    parser.add_argument('--limit', type=int, default=None, help='Max evaluation samples num for each subset.')
+    parser.add_argument('--limit', type=float, default=None, help='Max evaluation samples num for each subset.')
     parser.add_argument('--eval-batch-size', type=int, default=1, help='The batch size for evaluation.')
     # Cache and working directory arguments
@@ -67,6 +76,7 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--work-dir', type=str, help='The root cache dir.')
     # Debug and runtime mode arguments
+    parser.add_argument('--ignore-errors', action='store_true', default=False, help='Ignore errors during evaluation.')
     parser.add_argument('--debug', action='store_true', default=False, help='Debug mode, will print information for debugging.')  # noqa: E501
     parser.add_argument('--dry-run', action='store_true', default=False, help='Dry run in single processing mode.')
     parser.add_argument('--seed', type=int, default=42, help='Random seed for reproducibility.')
@@ -79,6 +89,7 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--judge-strategy', type=str, default=JudgeStrategy.AUTO, help='The judge strategy.')
     parser.add_argument('--judge-model-args', type=json.loads, default='{}', help='The judge model args, should be a json string.')  # noqa: E501
     parser.add_argument('--judge-worker-num', type=int, default=1, help='The number of workers for the judge model.')
+    parser.add_argument('--analysis-report', action='store_true', default=False, help='Generate analysis report for the evaluation results using judge model.')  # noqa: E501
     # yapf: enable

{evalscope-0.15.1 → evalscope-0.16.1}/evalscope/backend/opencompass/backend_manager.py RENAMED Viewed

@@ -1,4 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
+import os
 import subprocess
 import tempfile
 from dataclasses import asdict
@@ -204,7 +205,7 @@ class OpenCompassBackendManager(BackendManager):
                     model_d['meta_template'] = get_template(model_d['meta_template'])
                 # set the 'abbr' as the 'path' if 'abbr' is not specified
-                model_d['abbr'] = model_d['path']
+                model_d['abbr'] = os.path.basename(model_d['path'])
                 model_config = ApiModelConfig(**model_d)
                 models.append(asdict(model_config))

{evalscope-0.15.1 → evalscope-0.16.1}/evalscope/backend/rag_eval/cmteb/arguments.py RENAMED Viewed

@@ -11,7 +11,9 @@ class ModelArguments:
     pooling_mode: Optional[str] = None
     max_seq_length: int = 512  # max sequence length
     # prompt for llm based model
-    prompt: str = ''
+    prompt: Optional[str] = None
+    # prompts dictionary for different tasks, if prompt is not set
+    prompts: Optional[Dict[str, str]] = None
     # model kwargs
     model_kwargs: dict = field(default_factory=dict)
     # config kwargs
@@ -33,6 +35,7 @@ class ModelArguments:
             'pooling_mode': self.pooling_mode,
             'max_seq_length': self.max_seq_length,
             'prompt': self.prompt,
+            'prompts': self.prompts,
             'model_kwargs': self.model_kwargs,
             'config_kwargs': self.config_kwargs,
             'encode_kwargs': self.encode_kwargs,

evalscope 0.15.1__tar.gz → 0.16.1__tar.gz

Potentially problematic release.

evalscope 0.15.1tar.gz → 0.16.1tar.gz