PyPI - evalscope - Versions diffs - 0.12.0__tar.gz → 0.13.0__tar.gz - Mend

evalscope 0.12.0tar.gz → 0.13.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (353) hide show

{evalscope-0.12.0/evalscope.egg-info → evalscope-0.13.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.12.0
+Version: 0.13.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -175,16 +175,29 @@ Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "all"
 > ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 ## 📋 Contents
-- [Introduction](#-introduction)
-- [News](#-news)
-- [Installation](#️-installation)
-- [Quick Start](#-quick-start)
+- [📋 Contents](#-contents)
+- [📝 Introduction](#-introduction)
+- [☎ User Groups](#-user-groups)
+- [🎉 News](#-news)
+- [🛠️ Installation](#️-installation)
+  - [Method 1: Install Using pip](#method-1-install-using-pip)
+  - [Method 2: Install from Source](#method-2-install-from-source)
+- [🚀 Quick Start](#-quick-start)
+  - [Method 1. Using Command Line](#method-1-using-command-line)
+  - [Method 2. Using Python Code](#method-2-using-python-code)
+  - [Basic Parameter](#basic-parameter)
+  - [Output Results](#output-results)
+- [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
+- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
+  - [Parameter](#parameter)
 - [Evaluation Backend](#evaluation-backend)
-- [Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
-- [Arena Mode](#-arena-mode)
-- [Contribution](#️-contribution)
-- [Roadmap](#-roadmap)
+- [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
+- [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
+- [🏟️ Arena Mode](#️-arena-mode)
+- [👷‍♂️ Contribution](#️-contribution)
+- [🔜 Roadmap](#-roadmap)
+- [Star History](#star-history)
 ## 📝 Introduction
@@ -225,10 +238,16 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
-- 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
+- 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark. You can use it by specifying `live_code_bench`.
+- 🔥 **[2025.03.11]** Added support for the [SimpleQA](https://modelscope.cn/datasets/AI-ModelScope/SimpleQA/summary) and [Chinese SimpleQA](https://modelscope.cn/datasets/AI-ModelScope/Chinese-SimpleQA/summary) evaluation benchmarks. These are used to assess the factual accuracy of models, and you can specify `simple_qa` and `chinese_simpleqa` for use. Support for specifying a judge model is also available. For more details, refer to the [relevant parameter documentation](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html).
+- 🔥 **[2025.03.07]** Added support for the [QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B/summary) model, evaluate the model's reasoning ability and reasoning efficiency, refer to [📖 Best Practices for QwQ-32B Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html) for more details.
+- 🔥 **[2025.03.04]** Added support for the [SuperGPQA](https://modelscope.cn/datasets/m-a-p/SuperGPQA/summary) dataset, which covers 13 categories, 72 first-level disciplines, and 285 second-level disciplines, totaling 26,529 questions. You can use it by specifying `super_gpqa`.
+- 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!
+- 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
 - 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.
 - 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
-- 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
+- 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).

{evalscope-0.12.0 → evalscope-0.13.0}/README.md RENAMED Viewed

@@ -24,16 +24,29 @@
 > ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 ## 📋 Contents
-- [Introduction](#-introduction)
-- [News](#-news)
-- [Installation](#️-installation)
-- [Quick Start](#-quick-start)
+- [📋 Contents](#-contents)
+- [📝 Introduction](#-introduction)
+- [☎ User Groups](#-user-groups)
+- [🎉 News](#-news)
+- [🛠️ Installation](#️-installation)
+  - [Method 1: Install Using pip](#method-1-install-using-pip)
+  - [Method 2: Install from Source](#method-2-install-from-source)
+- [🚀 Quick Start](#-quick-start)
+  - [Method 1. Using Command Line](#method-1-using-command-line)
+  - [Method 2. Using Python Code](#method-2-using-python-code)
+  - [Basic Parameter](#basic-parameter)
+  - [Output Results](#output-results)
+- [📈 Visualization of Evaluation Results](#-visualization-of-evaluation-results)
+- [🌐 Evaluation of Specified Model API](#-evaluation-of-specified-model-api)
+- [⚙️ Custom Parameter Evaluation](#️-custom-parameter-evaluation)
+  - [Parameter](#parameter)
 - [Evaluation Backend](#evaluation-backend)
-- [Custom Dataset Evaluation](#️-custom-dataset-evaluation)
-- [Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
-- [Arena Mode](#-arena-mode)
-- [Contribution](#️-contribution)
-- [Roadmap](#-roadmap)
+- [📈 Model Serving Performance Evaluation](#-model-serving-performance-evaluation)
+- [🖊️ Custom Dataset Evaluation](#️-custom-dataset-evaluation)
+- [🏟️ Arena Mode](#️-arena-mode)
+- [👷‍♂️ Contribution](#️-contribution)
+- [🔜 Roadmap](#-roadmap)
+- [Star History](#star-history)
 ## 📝 Introduction
@@ -74,10 +87,16 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
-- 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
+- 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark. You can use it by specifying `live_code_bench`.
+- 🔥 **[2025.03.11]** Added support for the [SimpleQA](https://modelscope.cn/datasets/AI-ModelScope/SimpleQA/summary) and [Chinese SimpleQA](https://modelscope.cn/datasets/AI-ModelScope/Chinese-SimpleQA/summary) evaluation benchmarks. These are used to assess the factual accuracy of models, and you can specify `simple_qa` and `chinese_simpleqa` for use. Support for specifying a judge model is also available. For more details, refer to the [relevant parameter documentation](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html).
+- 🔥 **[2025.03.07]** Added support for the [QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B/summary) model, evaluate the model's reasoning ability and reasoning efficiency, refer to [📖 Best Practices for QwQ-32B Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html) for more details.
+- 🔥 **[2025.03.04]** Added support for the [SuperGPQA](https://modelscope.cn/datasets/m-a-p/SuperGPQA/summary) dataset, which covers 13 categories, 72 first-level disciplines, and 285 second-level disciplines, totaling 26,529 questions. You can use it by specifying `super_gpqa`.
+- 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!
+- 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
 - 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.
 - 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
-- 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
+- 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).

{evalscope-0.12.0 → evalscope-0.13.0}/evalscope/arguments.py RENAMED Viewed

@@ -1,7 +1,7 @@
 import argparse
 import json
-from evalscope.constants import EvalBackend, EvalStage, EvalType
+from evalscope.constants import EvalBackend, EvalStage, EvalType, JudgeStrategy, OutputType
 class ParseStrArgsAction(argparse.Action):
@@ -73,6 +73,11 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--api-url', type=str, default=None, help='The API url for the remote API model.')
     parser.add_argument('--timeout', type=float, default=None, help='The timeout for the remote API model.')
     parser.add_argument('--stream', action='store_true', default=False, help='Stream mode.')  # noqa: E501
+    # LLMJudge arguments
+    parser.add_argument('--judge-strategy', type=str, default=JudgeStrategy.AUTO, help='The judge strategy.')
+    parser.add_argument('--judge-model-args', type=json.loads, default='{}', help='The judge model args, should be a json string.')  # noqa: E501
+    parser.add_argument('--judge-worker-num', type=int, default=8, help='The number of workers for the judge model.')
     # yapf: enable

{evalscope-0.12.0 → evalscope-0.13.0}/evalscope/benchmarks/aime/aime24_adapter.py RENAMED Viewed

@@ -1,6 +1,6 @@
 from evalscope.benchmarks import Benchmark, DataAdapter
+from evalscope.constants import OutputType
 from evalscope.metrics.math_parser import extract_answer, math_equal, strip_answer_string
-from evalscope.models import ChatGenerationModelAdapter
 from evalscope.utils.logger import get_logger
 # flake8: noqa
@@ -10,8 +10,8 @@ logger = get_logger()
 @Benchmark.register(
     name='aime24',
+    pretty_name='AIME-2024',
     dataset_id='HuggingFaceH4/aime_2024',
-    model_adapter=ChatGenerationModelAdapter,
     subset_list=['default'],
     metric_list=['AveragePass@1'],
     few_shot_num=0,
@@ -31,7 +31,7 @@ class AIME24Adapter(DataAdapter):
         problem = input_d['problem']
         full_prompt = self.prompt_template.format(query=problem)
-        return {'data': [full_prompt], 'system_prompt': self.system_prompt}
+        return self.gen_prompt_data(full_prompt)
     def get_gold_answer(self, input_d: dict) -> str:
         # Extract the gold answer from the input dict.

{evalscope-0.12.0 → evalscope-0.13.0}/evalscope/benchmarks/aime/aime25_adapter.py RENAMED Viewed

@@ -1,6 +1,6 @@
 from evalscope.benchmarks import Benchmark, DataAdapter
+from evalscope.constants import OutputType
 from evalscope.metrics.math_parser import extract_answer, math_equal, strip_answer_string
-from evalscope.models import ChatGenerationModelAdapter
 from evalscope.utils.logger import get_logger
 # flake8: noqa
@@ -10,8 +10,8 @@ logger = get_logger()
 @Benchmark.register(
     name='aime25',
+    pretty_name='AIME-2025',
     dataset_id='TIGER-Lab/AIME25',
-    model_adapter=ChatGenerationModelAdapter,
     subset_list=['default'],
     metric_list=['AveragePass@1'],
     few_shot_num=0,
@@ -31,7 +31,7 @@ class AIME25Adapter(DataAdapter):
         problem = input_d['question']
         full_prompt = self.prompt_template.format(query=problem)
-        return {'data': [full_prompt], 'system_prompt': self.system_prompt}
+        return self.gen_prompt_data(full_prompt)
     def get_gold_answer(self, input_d: dict) -> str:
         # Extract the gold answer from the input dict.

{evalscope-0.12.0 → evalscope-0.13.0}/evalscope/benchmarks/arc/arc_adapter.py RENAMED Viewed

@@ -4,9 +4,8 @@ import json
 import os
 from evalscope.benchmarks import Benchmark, DataAdapter
-from evalscope.constants import EvalType
+from evalscope.constants import EvalType, OutputType
 from evalscope.metrics import exact_match
-from evalscope.models import MultiChoiceModelAdapter
 from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
@@ -17,19 +16,20 @@ logger = get_logger()
 @Benchmark.register(
     name='arc',
+    pretty_name='ARC',
     dataset_id='modelscope/ai2_arc',
-    model_adapter=MultiChoiceModelAdapter,
+    model_adapter=OutputType.MULTIPLE_CHOICE,
+    output_types=[OutputType.MULTIPLE_CHOICE, OutputType.GENERATION],
     subset_list=['ARC-Easy', 'ARC-Challenge'],
     metric_list=['AverageAccuracy'],
     few_shot_num=0,
     train_split='train',
     eval_split='test',
-    prompt_template='',
+    prompt_template=
+    'Given the following question and four candidate answers (A, B, C and D), choose the best answer.\n{query}\nYour response should end with "The best answer is [the_answer_letter]" where the [the_answer_letter] is one of A, B, C or D.',  # noqa
 )
 class ARCAdapter(DataAdapter):
-    choices = ['A', 'B', 'C', 'D']
     def __init__(self, **kwargs):
         few_shot_num = kwargs.get('few_shot_num', None)
         if few_shot_num is None:
@@ -42,6 +42,8 @@ class ARCAdapter(DataAdapter):
         super().__init__(**kwargs)
+        self.choices = ['A', 'B', 'C', 'D']
     def load_from_disk(self, dataset_name_or_path, subset_list, work_dir, **kwargs) -> dict:
         """
         Load the dataset from local disk.
@@ -60,7 +62,7 @@ class ARCAdapter(DataAdapter):
             for split_name in ['Train', 'Test']:
                 split_path = os.path.join(subset_path, f'{subset_name}-{split_name}.jsonl')
                 if os.path.exists(split_path):
-                    with open(split_path, 'r', errors='ignore') as in_f:
+                    with open(split_path, 'r', errors='ignore', encoding='utf-8') as in_f:
                         rows = []
                         for line in in_f:
                             item = json.loads(line.strip())
@@ -107,12 +109,11 @@ class ARCAdapter(DataAdapter):
             {'data': ['xxx'], 'multi_choices': ['A', 'B', 'C', 'D']}
         """
         few_shot_prompts = [self._generate_prompt(input_d=sample, include_answer=True) for sample in few_shot_list]
-        context: str = '\n'.join(few_shot_prompts)
+        context = '\n'.join(few_shot_prompts) + self._generate_prompt(input_d=input_d, include_answer=False)
-        # context = f'The following are multiple choice questions, please output correct answer in the form of A or B or C or D, do not output explanation:\n {context}'
-        full_prompt: str = context + self._generate_prompt(input_d=input_d, include_answer=False)
+        full_prompt = self.prompt_template.format(query=context)
-        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.system_prompt}
+        return self.gen_prompt_data(full_prompt)
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
@@ -130,14 +131,10 @@ class ARCAdapter(DataAdapter):
         Returns:
             The parsed answer. Depending on the dataset. Usually a string for chat.
         """
-        if eval_type == EvalType.CHECKPOINT:
+        if self.model_adapter == OutputType.MULTIPLE_CHOICE:
             return result
-        elif eval_type == EvalType.SERVICE:
-            return ResponseParser.parse_first_option_with_choices(text=result, options=self.choices)
-        elif eval_type == EvalType.CUSTOM:
-            return ResponseParser.parse_first_option_with_choices(text=result, options=self.choices)
         else:
-            raise ValueError(f'Invalid eval_type: {eval_type}')
+            return ResponseParser.parse_first_option(text=result)
     def match(self, gold: str, pred: str) -> float:
         return exact_match(gold=gold, pred=pred)
@@ -152,8 +149,8 @@ class ARCAdapter(DataAdapter):
         choices_prompts: str = '\n'.join([label + '. ' + text for text, label in zip(choices_texts, choices_labels)])
         example += '\n' + choices_prompts
-        example += '\nAnswer:'
         if include_answer:
+            example += '\nAnswer:'
             example += ' {}\n\n'.format(input_d['answerKey'])
         return example

{evalscope-0.12.0 → evalscope-0.13.0}/evalscope/benchmarks/bbh/bbh_adapter.py RENAMED Viewed

@@ -8,8 +8,6 @@ import re
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import AnswerKeys
 from evalscope.metrics import exact_match
-from evalscope.models.chat_adapter import ChatGenerationModelAdapter
-from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
 # flake8: noqa
@@ -60,8 +58,8 @@ SUBSET_LIST = MULTIPLE_CHOICE_LIST + FREE_FORM_LIST
 @Benchmark.register(
     name='bbh',
+    pretty_name='BBH',
     dataset_id='modelscope/bbh',
-    model_adapter=ChatGenerationModelAdapter,
     subset_list=SUBSET_LIST,
     metric_list=['AverageAccuracy'],
     few_shot_num=3,
@@ -94,7 +92,7 @@ class BBHAdapter(DataAdapter):
                 else:
                     file_path: str = os.path.join(work_dir, dataset_name_or_path, f'{subset_name}.json')
                 if os.path.exists(file_path):
-                    with open(file_path, 'r') as f:
+                    with open(file_path, 'r', encoding='utf-8') as f:
                         examples = json.load(f)['examples']
                         if subset_name in data_dict:
                             data_dict[subset_name].update({split_name: examples})
@@ -125,7 +123,7 @@ class BBHAdapter(DataAdapter):
             cot_prompts = ''
         full_prompt = cot_prompts + self.prompt_template.format(query=input_d['input'])
-        return {'data': [full_prompt], 'system_prompt': self.system_prompt}
+        return self.gen_prompt_data(full_prompt)
     def gen_prompts(self, data_dict: dict) -> dict:
         """
@@ -153,7 +151,9 @@ class BBHAdapter(DataAdapter):
         for sub_name, sub_data_dict in data_dict.items():
             few_shot_data = []
             if self.few_shot_num > 0:
-                with open(os.path.join(os.path.dirname(__file__), 'cot_prompts', f'{sub_name}.txt'), 'r') as f:
+                with open(
+                        os.path.join(os.path.dirname(__file__), 'cot_prompts', f'{sub_name}.txt'), 'r',
+                        encoding='utf-8') as f:
                     cot_prompt_str = f.read()
                 few_shot_data = [cot_prompt_str]

{evalscope-0.12.0 → evalscope-0.13.0}/evalscope/benchmarks/benchmark.py RENAMED Viewed

@@ -1,12 +1,13 @@
 import copy
-from dataclasses import dataclass, field
+from collections import OrderedDict
+from dataclasses import dataclass, field, fields
 from typing import TYPE_CHECKING, Dict, List, Optional
+from evalscope.constants import OutputType
 if TYPE_CHECKING:
     from evalscope.benchmarks import DataAdapter
-from evalscope.models import BaseModelAdapter
 BENCHMARK_MAPPINGS = {}
@@ -15,8 +16,9 @@ class BenchmarkMeta:
     name: str
     dataset_id: str
     data_adapter: 'DataAdapter'
-    model_adapter: BaseModelAdapter
-    subset_list: List[str] = field(default_factory=list)
+    model_adapter: Optional[str] = OutputType.GENERATION
+    output_types: Optional[List[str]] = field(default_factory=lambda: [OutputType.GENERATION])
+    subset_list: List[str] = field(default_factory=lambda: ['default'])
     metric_list: List[str] = field(default_factory=list)
     few_shot_num: int = 0
     few_shot_random: bool = False
@@ -26,6 +28,8 @@ class BenchmarkMeta:
     system_prompt: Optional[str] = None
     query_template: Optional[str] = None
     pretty_name: Optional[str] = None
+    filters: Optional[OrderedDict] = None
+    extra_params: Optional[Dict] = field(default_factory=dict)
     def _update(self, args: dict):
         if args.get('local_path'):
@@ -37,12 +41,9 @@ class BenchmarkMeta:
         return self.__dict__
     def to_string_dict(self) -> dict:
-        cur_dict = copy.deepcopy(self.__dict__)
+        cur_dict = copy.deepcopy(self.to_dict())
         # cur_dict['data_adapter'] = self.data_adapter.__name__
-        # cur_dict['model_adapter'] = self.model_adapter.__name__
-        # cur_dict['metric_list'] = [metric['name'] for metric in self.metric_list]
         del cur_dict['data_adapter']
-        del cur_dict['model_adapter']
         return cur_dict
     def get_data_adapter(self, config: dict = {}) -> 'DataAdapter':
@@ -66,13 +67,13 @@ class Benchmark:
         return benchmark
     @classmethod
-    def register(cls, name: str, dataset_id: str, model_adapter: BaseModelAdapter, **kwargs):
+    def register(cls, name: str, dataset_id: str, **kwargs):
         def register_wrapper(data_adapter):
             if name in BENCHMARK_MAPPINGS:
                 raise Exception(f'Benchmark {name} already registered')
             BENCHMARK_MAPPINGS[name] = BenchmarkMeta(
-                name=name, data_adapter=data_adapter, model_adapter=model_adapter, dataset_id=dataset_id, **kwargs)
+                name=name, data_adapter=data_adapter, dataset_id=dataset_id, **kwargs)
             return data_adapter
         return register_wrapper

{evalscope-0.12.0 → evalscope-0.13.0}/evalscope/benchmarks/ceval/ceval_adapter.py RENAMED Viewed

@@ -3,9 +3,8 @@ import csv
 import os
 from evalscope.benchmarks import Benchmark, DataAdapter
-from evalscope.constants import EvalType
+from evalscope.constants import EvalType, OutputType
 from evalscope.metrics.metrics import exact_match
-from evalscope.models import MultiChoiceModelAdapter
 from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
@@ -126,19 +125,20 @@ SUBJECT_MAPPING = {
 @Benchmark.register(
     name='ceval',
+    pretty_name='C-Eval',
     dataset_id='modelscope/ceval-exam',
-    model_adapter=MultiChoiceModelAdapter,
+    model_adapter=OutputType.MULTIPLE_CHOICE,
+    output_types=[OutputType.MULTIPLE_CHOICE, OutputType.GENERATION],
     subset_list=SUBSET_LIST,
     metric_list=['AverageAccuracy'],
     few_shot_num=0,
     train_split='dev',
     eval_split='val',
-    prompt_template='以下是中国关于{subset_name}考试的单项选择题，请选出其中的正确答案。\n{query}',
+    prompt_template=
+    '以下是中国关于{subset_name}考试的单项选择题，请选出其中的正确答案。你的回答的最后一行应该是这样的格式：“答案是：LETTER”（不带引号），其中 LETTER 是 A、B、C、D 中的一个。\n{query}',
 )
 class CEVALAdapter(DataAdapter):
-    choices = ['A', 'B', 'C', 'D']
     def __init__(self, **kwargs):
         few_shot_num = kwargs.get('few_shot_num', 0)
@@ -148,6 +148,7 @@ class CEVALAdapter(DataAdapter):
         super().__init__(**kwargs)
         self.category_map = {k: v[-1] for k, v in SUBJECT_MAPPING.items()}
+        self.choices = ['A', 'B', 'C', 'D']
     def load_from_disk(self, dataset_name_or_path, subset_list, work_dir, **kwargs) -> dict:
         data_dict = {}
@@ -207,7 +208,7 @@ class CEVALAdapter(DataAdapter):
         subject_name: str = SUBJECT_MAPPING.get(subset_name)[1] if SUBJECT_MAPPING.get(subset_name) else subset_name
         full_prompt = self.prompt_template.format(subset_name=subject_name, query=query)
-        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.system_prompt}
+        return self.gen_prompt_data(full_prompt)
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
@@ -225,22 +226,17 @@ class CEVALAdapter(DataAdapter):
         Returns:
             The parsed answer. Depending on the dataset. Usually a string for chat.
         """
-        if eval_type == EvalType.CHECKPOINT:
+        if self.model_adapter == OutputType.MULTIPLE_CHOICE:
             return result
-        elif eval_type == EvalType.SERVICE:
-            return ResponseParser.parse_first_option_with_choices(result, self.choices)
-        elif eval_type == EvalType.CUSTOM:
-            return ResponseParser.parse_first_option_with_choices(result, self.choices)
         else:
-            raise ValueError(f'Invalid eval_type: {eval_type}')
+            return ResponseParser.parse_first_option_with_choices(text=result, options=self.choices)
     def match(self, gold: str, pred: str) -> float:
         return exact_match(gold=gold, pred=pred)
-    @classmethod
-    def _format_example(cls, input_d: dict, include_answer=True):
+    def _format_example(self, input_d: dict, include_answer=True):
         example = '问题：' + input_d['question']
-        for choice in cls.choices:
+        for choice in self.choices:
             example += f'\n{choice}. {input_d[f"{choice}"]}'
         if include_answer:

evalscope 0.12.0__tar.gz → 0.13.0__tar.gz

Potentially problematic release.

evalscope 0.12.0tar.gz → 0.13.0tar.gz