PyPI - evalscope - Versions diffs - 0.10.1__tar.gz → 0.11.0__tar.gz - Mend

evalscope 0.10.1tar.gz → 0.11.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (317) hide show

{evalscope-0.10.1/evalscope.egg-info → evalscope-0.11.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.10.1
+Version: 0.11.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -19,10 +19,12 @@ License-File: LICENSE
 Requires-Dist: absl-py
 Requires-Dist: accelerate
 Requires-Dist: cachetools
-Requires-Dist: datasets<=3.0.1,>=3.0.0
+Requires-Dist: datasets<=3.2.0,>=3.0.0
 Requires-Dist: editdistance
 Requires-Dist: jieba
 Requires-Dist: jsonlines
+Requires-Dist: langdetect
+Requires-Dist: latex2sympy2
 Requires-Dist: matplotlib
 Requires-Dist: modelscope[framework]
 Requires-Dist: nltk>=3.9
@@ -42,12 +44,14 @@ Requires-Dist: scikit-learn
 Requires-Dist: seaborn
 Requires-Dist: sentencepiece
 Requires-Dist: simple-ddl-parser
+Requires-Dist: sympy
 Requires-Dist: tabulate
 Requires-Dist: tiktoken
 Requires-Dist: torch
 Requires-Dist: tqdm
 Requires-Dist: transformers>=4.33
 Requires-Dist: transformers_stream_generator
+Requires-Dist: word2number
 Provides-Extra: opencompass
 Requires-Dist: ms-opencompass>=0.1.4; extra == "opencompass"
 Provides-Extra: vlmeval
@@ -64,7 +68,7 @@ Requires-Dist: sse_starlette; extra == "perf"
 Requires-Dist: transformers; extra == "perf"
 Requires-Dist: unicorn; extra == "perf"
 Provides-Extra: app
-Requires-Dist: gradio>=5.4.0; extra == "app"
+Requires-Dist: gradio==5.4.0; extra == "app"
 Requires-Dist: plotly>=5.23.0; extra == "app"
 Provides-Extra: inner
 Requires-Dist: absl-py; extra == "inner"
@@ -96,10 +100,12 @@ Provides-Extra: all
 Requires-Dist: absl-py; extra == "all"
 Requires-Dist: accelerate; extra == "all"
 Requires-Dist: cachetools; extra == "all"
-Requires-Dist: datasets<=3.0.1,>=3.0.0; extra == "all"
+Requires-Dist: datasets<=3.2.0,>=3.0.0; extra == "all"
 Requires-Dist: editdistance; extra == "all"
 Requires-Dist: jieba; extra == "all"
 Requires-Dist: jsonlines; extra == "all"
+Requires-Dist: langdetect; extra == "all"
+Requires-Dist: latex2sympy2; extra == "all"
 Requires-Dist: matplotlib; extra == "all"
 Requires-Dist: modelscope[framework]; extra == "all"
 Requires-Dist: nltk>=3.9; extra == "all"
@@ -119,12 +125,14 @@ Requires-Dist: scikit-learn; extra == "all"
 Requires-Dist: seaborn; extra == "all"
 Requires-Dist: sentencepiece; extra == "all"
 Requires-Dist: simple-ddl-parser; extra == "all"
+Requires-Dist: sympy; extra == "all"
 Requires-Dist: tabulate; extra == "all"
 Requires-Dist: tiktoken; extra == "all"
 Requires-Dist: torch; extra == "all"
 Requires-Dist: tqdm; extra == "all"
 Requires-Dist: transformers>=4.33; extra == "all"
 Requires-Dist: transformers_stream_generator; extra == "all"
+Requires-Dist: word2number; extra == "all"
 Requires-Dist: ms-opencompass>=0.1.4; extra == "all"
 Requires-Dist: ms-vlmeval>=0.0.9; extra == "all"
 Requires-Dist: mteb==1.19.4; extra == "all"
@@ -136,7 +144,7 @@ Requires-Dist: numpy; extra == "all"
 Requires-Dist: sse_starlette; extra == "all"
 Requires-Dist: transformers; extra == "all"
 Requires-Dist: unicorn; extra == "all"
-Requires-Dist: gradio>=5.4.0; extra == "all"
+Requires-Dist: gradio==5.4.0; extra == "all"
 Requires-Dist: plotly>=5.23.0; extra == "all"
 <p align="center">
@@ -215,6 +223,7 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).

{evalscope-0.10.1 → evalscope-0.11.0}/README.md RENAMED Viewed

@@ -74,6 +74,7 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/zh-cn/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
 - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
 - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/arguments.py RENAMED Viewed

@@ -58,6 +58,7 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--stage', type=str, default='all', help='The stage of evaluation pipeline.',
                         choices=[EvalStage.ALL, EvalStage.INFER, EvalStage.REVIEW])
     parser.add_argument('--limit', type=int, default=None, help='Max evaluation samples num for each subset.')
+    parser.add_argument('--eval-batch-size', type=int, default=1, help='The batch size for evaluation.')
     # Cache and working directory arguments
     parser.add_argument('--mem-cache', action='store_true', default=False, help='Deprecated, will be removed in v1.0.0.')  # noqa: E501

evalscope-0.11.0/evalscope/benchmarks/aime24/aime24_adapter.py ADDED Viewed

@@ -0,0 +1,49 @@
+from evalscope.benchmarks import Benchmark, DataAdapter
+from evalscope.metrics.math_parser import extract_answer, math_equal, strip_answer_string
+from evalscope.models import ChatGenerationModelAdapter
+from evalscope.utils.logger import get_logger
+# flake8: noqa
+logger = get_logger()
+@Benchmark.register(
+    name='aime24',
+    dataset_id='HuggingFaceH4/aime_2024',
+    model_adapter=ChatGenerationModelAdapter,
+    subset_list=['default'],
+    metric_list=['AveragePass@1'],
+    few_shot_num=0,
+    train_split=None,
+    eval_split='train',  # Only train set is available
+    prompt_template='{query}\nPlease reason step by step, and put your final answer within \\boxed{{}}.',
+)
+class AIME24Adapter(DataAdapter):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+    def gen_prompt(self, input_d: dict, few_shot_list: list, **kwargs) -> dict:
+        """
+        Generate the prompt for the model input.
+        """
+        problem = input_d['problem']
+        full_prompt = self.prompt_template.format(query=problem)
+        return {'data': [full_prompt], 'system_prompt': self.system_prompt}
+    def get_gold_answer(self, input_d: dict) -> str:
+        # Extract the gold answer from the input dict.
+        return strip_answer_string(input_d['answer'])
+    def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = 'checkpoint') -> str:
+        """
+        Parse the model output to get the answer. Could be the best choice index.
+        """
+        # Note: Use same extraction method for both of checkpoint/service/custom
+        result = strip_answer_string(extract_answer(result))
+        return result
+    def match(self, gold: str, pred: str) -> float:
+        return math_equal(pred, gold)

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/benchmarks/arc/arc_adapter.py RENAMED Viewed

@@ -5,7 +5,7 @@ import os
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import EvalType
-from evalscope.metrics import AverageAccuracy, exact_match
+from evalscope.metrics import exact_match
 from evalscope.models import MultiChoiceModelAdapter
 from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
@@ -20,7 +20,7 @@ logger = get_logger()
     dataset_id='modelscope/ai2_arc',
     model_adapter=MultiChoiceModelAdapter,
     subset_list=['ARC-Easy', 'ARC-Challenge'],
-    metric_list=[AverageAccuracy],
+    metric_list=['AverageAccuracy'],
     few_shot_num=0,
     train_split='train',
     eval_split='test',
@@ -112,7 +112,7 @@ class ARCAdapter(DataAdapter):
         # context = f'The following are multiple choice questions, please output correct answer in the form of A or B or C or D, do not output explanation:\n {context}'
         full_prompt: str = context + self._generate_prompt(input_d=input_d, include_answer=False)
-        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.prompt_template}
+        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.system_prompt}
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
@@ -133,11 +133,9 @@ class ARCAdapter(DataAdapter):
         if eval_type == EvalType.CHECKPOINT:
             return result
         elif eval_type == EvalType.SERVICE:
-            return ResponseParser.parse_first_option_with_choices(
-                text=result, options=self.choices)  # TODO: to be checked !
+            return ResponseParser.parse_first_option_with_choices(text=result, options=self.choices)
         elif eval_type == EvalType.CUSTOM:
-            return ResponseParser.parse_first_option_with_choices(
-                text=result, options=self.choices)  # TODO: to be checked !
+            return ResponseParser.parse_first_option_with_choices(text=result, options=self.choices)
         else:
             raise ValueError(f'Invalid eval_type: {eval_type}')

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/benchmarks/bbh/bbh_adapter.py RENAMED Viewed

@@ -7,7 +7,7 @@ import re
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import AnswerKeys
-from evalscope.metrics import AverageAccuracy, exact_match
+from evalscope.metrics import exact_match
 from evalscope.models.chat_adapter import ChatGenerationModelAdapter
 from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
@@ -63,11 +63,11 @@ SUBSET_LIST = MULTIPLE_CHOICE_LIST + FREE_FORM_LIST
     dataset_id='modelscope/bbh',
     model_adapter=ChatGenerationModelAdapter,
     subset_list=SUBSET_LIST,
-    metric_list=[AverageAccuracy],
+    metric_list=['AverageAccuracy'],
     few_shot_num=3,
     train_split=None,
     eval_split='test',
-    prompt_template='',
+    prompt_template="Q: {query}\nA: Let's think step by step.",
 )
 class BBHAdapter(DataAdapter):
     """
@@ -119,10 +119,13 @@ class BBHAdapter(DataAdapter):
             {'data': ['xxx']}
         """
         # few_shot_list: should be ['xxxx']
-        cot_prompts: str = few_shot_list[0] if len(few_shot_list) > 0 else ''
-        full_prompt: str = f"Follow the given examples and answer the question.\n{cot_prompts}\n\nQ: {input_d['input']}\nA: Let's think step by step."
+        if len(few_shot_list) > 0:
+            cot_prompts = 'Follow the given examples and answer the question.\n' + few_shot_list[0]
+        else:
+            cot_prompts = ''
+        full_prompt = cot_prompts + self.prompt_template.format(query=input_d['input'])
-        return {'data': [full_prompt], 'system_prompt': self.prompt_template}
+        return {'data': [full_prompt], 'system_prompt': self.system_prompt}
     def gen_prompts(self, data_dict: dict) -> dict:
         """
@@ -177,9 +180,11 @@ class BBHAdapter(DataAdapter):
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
-        gold = input_d.get('target')
+        gold = input_d.get('target', '')
+        # remove brackets
         if gold is None:
             logger.error(f'BBHAdapter: gold is None.')
+        gold = gold.replace('(', '').replace(')', '')
         return gold
     def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = 'checkpoint') -> str:
@@ -228,8 +233,11 @@ class BBHAdapter(DataAdapter):
         """
         Extract the answer from the model output for Free-form task.
         """
-        res = ResponseParser.parse_first_option(ans)
-        if res:
+        pattern = r'answer is\s+(.*?)\.'
+        match = re.search(pattern, ans)
+        if match:
+            res = match.group(1)
             return res
         ans_line = ans.split('answer is ')

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/benchmarks/benchmark.py RENAMED Viewed

@@ -17,12 +17,13 @@ class BenchmarkMeta:
     data_adapter: 'DataAdapter'
     model_adapter: BaseModelAdapter
     subset_list: List[str] = field(default_factory=list)
-    metric_list: List[dict] = field(default_factory=list)
+    metric_list: List[str] = field(default_factory=list)
     few_shot_num: int = 0
     few_shot_random: bool = False
     train_split: Optional[str] = None
     eval_split: Optional[str] = None
     prompt_template: Optional[str] = None
+    system_prompt: Optional[str] = None
     def _update(self, args: dict):
         if args.get('local_path'):
@@ -40,7 +41,6 @@ class BenchmarkMeta:
         # cur_dict['metric_list'] = [metric['name'] for metric in self.metric_list]
         del cur_dict['data_adapter']
         del cur_dict['model_adapter']
-        del cur_dict['metric_list']
         return cur_dict
     def get_data_adapter(self, config: dict = {}) -> 'DataAdapter':

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/benchmarks/ceval/ceval_adapter.py RENAMED Viewed

@@ -4,10 +4,9 @@ import os
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import EvalType
-from evalscope.metrics import AverageAccuracy
-from evalscope.metrics.metrics import exact_match, weighted_mean
+from evalscope.metrics.metrics import exact_match
 from evalscope.models import MultiChoiceModelAdapter
-from evalscope.utils import ResponseParser, normalize_score
+from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
 # flake8: noqa
@@ -130,10 +129,11 @@ SUBJECT_MAPPING = {
     dataset_id='modelscope/ceval-exam',
     model_adapter=MultiChoiceModelAdapter,
     subset_list=SUBSET_LIST,
-    metric_list=[AverageAccuracy],
+    metric_list=['AverageAccuracy'],
     few_shot_num=0,
     train_split='dev',
     eval_split='val',
+    prompt_template='以下是中国关于{subset_name}考试的单项选择题，请选出其中的正确答案。\n{query}',
 )
 class CEVALAdapter(DataAdapter):
@@ -202,12 +202,12 @@ class CEVALAdapter(DataAdapter):
         else:
             context = ''
-        full_prompt: str = context.strip() + self._format_example(input_d=input_d, include_answer=False)
+        query: str = context.strip() + self._format_example(input_d=input_d, include_answer=False)
         subject_name: str = SUBJECT_MAPPING.get(subset_name)[1] if SUBJECT_MAPPING.get(subset_name) else subset_name
-        full_prompt = f'以下是中国关于{subject_name}考试的单项选择题，请选出其中的正确答案。\n' + full_prompt
+        full_prompt = self.prompt_template.format(subset_name=subject_name, query=query)
-        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.prompt_template}
+        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.system_prompt}
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
@@ -228,9 +228,9 @@ class CEVALAdapter(DataAdapter):
         if eval_type == EvalType.CHECKPOINT:
             return result
         elif eval_type == EvalType.SERVICE:
-            return ResponseParser.parse_first_option_with_choices(result, self.choices)  # TODO: to be checked !
+            return ResponseParser.parse_first_option_with_choices(result, self.choices)
         elif eval_type == EvalType.CUSTOM:
-            return ResponseParser.parse_first_option_with_choices(result, self.choices)  # TODO: to be checked !
+            return ResponseParser.parse_first_option_with_choices(result, self.choices)
         else:
             raise ValueError(f'Invalid eval_type: {eval_type}')

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/benchmarks/cmmlu/cmmlu_adapter.py RENAMED Viewed

@@ -5,9 +5,9 @@ import os
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import EvalType
-from evalscope.metrics import AverageAccuracy, exact_match
+from evalscope.metrics import exact_match
 from evalscope.models import MultiChoiceModelAdapter
-from evalscope.utils import ResponseParser, normalize_score
+from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
 # flake8: noqa
@@ -106,10 +106,11 @@ SUBJECT_MAPPING = {
     dataset_id='modelscope/cmmlu',
     model_adapter=MultiChoiceModelAdapter,
     subset_list=SUBSET_LIST,
-    metric_list=[AverageAccuracy],
+    metric_list=['AverageAccuracy'],
     few_shot_num=5,
     train_split='dev',
     eval_split='test',
+    prompt_template='以下是关于{subset_name}的单项选择题，请直接给出正确答案的选项。\n{query}',
 )
 class CMMLUAdapter(DataAdapter):
@@ -165,16 +166,13 @@ class CMMLUAdapter(DataAdapter):
             {'data': [(context, continuation), ...]}
         """
-        prompt = '以下是关于{}的单项选择题。\n\n'.format(self._format_subject(subset_name))
         few_shot_prompts = [self._generate_prompt(input_d=sample, include_answer=True) for sample in few_shot_list]
-        context: str = '\n'.join(few_shot_prompts) + '\n'
+        context = '\n'.join(few_shot_prompts) + '\n'
         context += self._generate_prompt(input_d=input_d, include_answer=False)
-        context = prompt + context
-        full_prompt: str = context.strip() + self._generate_prompt(input_d=input_d, include_answer=False)
+        full_prompt = self.prompt_template.format(subset_name=self._format_subject(subset_name), query=context.strip())
-        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': prompt}
+        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.system_prompt}
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
@@ -195,9 +193,9 @@ class CMMLUAdapter(DataAdapter):
         if eval_type == EvalType.CHECKPOINT:
             return result
         elif eval_type == EvalType.SERVICE:
-            return ResponseParser.parse_first_option_with_choices(result, self.choices)  # TODO: to be checked !
+            return ResponseParser.parse_first_option_with_choices(result, self.choices)
         elif eval_type == EvalType.CUSTOM:
-            return ResponseParser.parse_first_option_with_choices(result, self.choices)  # TODO: to be checked !
+            return ResponseParser.parse_first_option_with_choices(result, self.choices)
         else:
             raise ValueError(f'Invalid eval_type: {eval_type}')

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/benchmarks/competition_math/competition_math_adapter.py RENAMED Viewed

@@ -3,10 +3,11 @@
 import glob
 import json
 import os
+from collections import defaultdict
 from evalscope.benchmarks import Benchmark, DataAdapter
-from evalscope.metrics import AverageAccuracy
-from evalscope.metrics.math_accuracy import is_equiv, last_boxed_only_string, remove_boxed
+from evalscope.constants import AnswerKeys
+from evalscope.metrics.math_parser import extract_answer, math_equal, strip_answer_string
 from evalscope.models import ChatGenerationModelAdapter
 from evalscope.utils.logger import get_logger
@@ -19,12 +20,12 @@ logger = get_logger()
     name='competition_math',
     dataset_id='modelscope/competition_math',
     model_adapter=ChatGenerationModelAdapter,
-    subset_list=['default'],
-    metric_list=[AverageAccuracy],
+    subset_list=['Level 1', 'Level 2', 'Level 3', 'Level 4', 'Level 5'],
+    metric_list=['AveragePass@1'],
     few_shot_num=4,
     train_split='train',
     eval_split='test',
-    prompt_template='Put the final answer in \\boxed{}.',
+    prompt_template='{query}\nPlease reason step by step, and put your final answer within \\boxed{{}}.',
 )
 class CompetitionMathAdapter(DataAdapter):
     """ To be tested for all models. """
@@ -39,8 +40,13 @@ class CompetitionMathAdapter(DataAdapter):
         super().__init__(**kwargs)
+    def load(self, **kwargs):
+        # default load all levels
+        kwargs['subset_list'] = ['default']
+        return super().load(**kwargs)
     def load_from_disk(self, dataset_name_or_path, subset_list, work_dir, **kwargs) -> dict:
-        data_dict: dict = {}
+        data_dict = defaultdict(dict)
         for subset_name in subset_list:
             for split_name in [self.train_split, self.eval_split]:
                 if os.path.exists(dataset_name_or_path):
@@ -53,13 +59,25 @@ class CompetitionMathAdapter(DataAdapter):
                     if os.path.exists(file_path):
                         with open(file_path, 'r') as f:
                             split_data.append(json.load(f))
-                if subset_name in data_dict:
-                    data_dict[subset_name].update({split_name: split_data})
-                else:
-                    data_dict[subset_name] = {split_name: split_data}
+                data_dict[subset_name][split_name] = split_data
         return data_dict
+    def gen_prompts(self, data_dict: dict) -> dict:
+        res_dict: dict = defaultdict(list)
+        #  use level as subset
+        for sub_name, sub_data_dict in data_dict.items():
+            for sample_d in sub_data_dict[self.eval_split]:
+                level = sample_d['level']
+                if level not in self.subset_list:
+                    continue
+                prompt_d = self.gen_prompt(input_d=sample_d, few_shot_list=None)
+                prompt_d[AnswerKeys.RAW_INPUT] = sample_d
+                res_dict[level].append(prompt_d)
+        return res_dict
     def gen_prompt(self, input_d: dict, few_shot_list: list, **kwargs) -> dict:
         """
         Generate the prompt for the model input.
@@ -75,13 +93,13 @@ class CompetitionMathAdapter(DataAdapter):
             {'data': [prompt]}
         """
         use_fewshot = self.few_shot_num > 0
-        full_prompt = self._generate_prompt(input_d, use_fewshot=use_fewshot)
-        return {'data': [full_prompt], 'system_prompt': self.prompt_template}
+        query = self._generate_prompt(input_d, use_fewshot=use_fewshot)
+        full_prompt = self.prompt_template.format(query=query)
+        return {'data': [full_prompt], 'system_prompt': self.system_prompt}
     def get_gold_answer(self, input_d: dict) -> str:
         # Extract the gold answer from the input dict.
-        return remove_boxed(last_boxed_only_string(input_d['solution']))
+        return strip_answer_string(extract_answer(input_d['solution']))
     def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = 'checkpoint') -> str:
         """
@@ -96,18 +114,11 @@ class CompetitionMathAdapter(DataAdapter):
             The parsed answer. Depending on the dataset. Usually a string for chat.
         """
         # Note: Use same extraction method for both of checkpoint/service/custom
-        try:
-            result = remove_boxed(last_boxed_only_string(result))
-        except Exception:
-            return None
+        result = strip_answer_string(extract_answer(result))
         return result
     def match(self, gold: str, pred: str) -> float:
-        res = 0
-        if is_equiv(pred, gold):
-            res = 1
-        return res
+        return math_equal(pred, gold)
     @classmethod
     def _generate_prompt(cls, input_d: dict, use_fewshot: bool = True) -> str:

{evalscope-0.10.1 → evalscope-0.11.0}/evalscope/benchmarks/data_adapter.py RENAMED Viewed

@@ -2,10 +2,10 @@
 import os.path
 import random
 from abc import ABC, abstractmethod
-from typing import Any, List, Optional
+from typing import Any, List, Optional, Union
 from evalscope.constants import DEFAULT_DATASET_CACHE_DIR, AnswerKeys, EvalType, HubType
-from evalscope.metrics import Metric
+from evalscope.metrics.named_metrics import metric_registry
 from evalscope.report import Report, ReportGenerator
 from evalscope.utils.logger import get_logger
@@ -16,12 +16,14 @@ class DataAdapter(ABC):
     def __init__(self,
                  name: str,
+                 dataset_id: str,
                  subset_list: list,
-                 metric_list: List[Metric],
+                 metric_list: List[str],
                  few_shot_num: Optional[int] = 0,
                  train_split: Optional[str] = None,
                  eval_split: Optional[str] = None,
                  prompt_template: Optional[str] = None,
+                 system_prompt: Optional[str] = None,
                  **kwargs):
         """
         Data Adapter for the benchmark. You need to implement the following methods:
@@ -31,6 +33,7 @@ class DataAdapter(ABC):
             - match
         Args:
             name: str, the name of the benchmark.
+            dataset_id: str, the dataset id on ModelScope or local path for the benchmark.
             subset_list: list of subset names for the dataset.
             metric_list: list, the metric list to evaluate the model on specific benchmark.
             few_shot_num: int, number of few-shot examples. Default: 0
@@ -41,17 +44,19 @@ class DataAdapter(ABC):
                     the form of A or B or C or D, do not output explanation:`
         """
         self.name = name
+        self.dataset_id = dataset_id
         self.subset_list = subset_list
         self.metric_list = metric_list
         self.few_shot_num = few_shot_num
         self.train_split = train_split
         self.eval_split = eval_split
         self.prompt_template = prompt_template
+        self.system_prompt = system_prompt
         self.config_kwargs = kwargs
         self.category_map = kwargs.get('category_map', {})
     def load(self,
-             dataset_name_or_path: str,
+             dataset_name_or_path: str = None,
              subset_list: list = None,
              work_dir: Optional[str] = DEFAULT_DATASET_CACHE_DIR,
              datasets_hub: str = HubType.MODELSCOPE,
@@ -64,7 +69,7 @@ class DataAdapter(ABC):
             train_dataset, test_dataset: Iterable dataset, object each item of which is a dict.
         """
-        dataset_name_or_path = os.path.expanduser(dataset_name_or_path)
+        dataset_name_or_path = os.path.expanduser(dataset_name_or_path or self.dataset_id)
         subset_list = subset_list or self.subset_list
         # Try to load dataset from local disk
@@ -156,7 +161,7 @@ class DataAdapter(ABC):
         else:
             return data_list[:k]
-    def compute_metric(self, review_res_list: list) -> List[dict]:
+    def compute_metric(self, review_res_list: Union[dict, list]) -> List[dict]:
         """
         Compute evaluation result by specific metrics.
@@ -170,14 +175,15 @@ class DataAdapter(ABC):
             raise ValueError('No metric list found for the benchmark.')
         res_list = []
-        for metric in self.metric_list:
+        for metric_str in self.metric_list:
+            metric = metric_registry.get(metric_str)
             metric_name = metric.name
             metric_func = metric.object
-            res_list.append({
-                'metric_name': metric_name,
-                'score': metric_func(review_res_list),
-                'num': len(review_res_list)
-            })
+            if isinstance(review_res_list, dict):
+                review_res = review_res_list.get(metric_name, [])
+            else:
+                review_res = review_res_list
+            res_list.append({'metric_name': metric_name, 'score': metric_func(review_res), 'num': len(review_res)})
         return res_list
     def gen_report(self, subset_score_map: dict, report_name: str = None, **kwargs) -> Report:

evalscope 0.10.1__tar.gz → 0.11.0__tar.gz

Potentially problematic release.

evalscope 0.10.1tar.gz → 0.11.0tar.gz