PyPI - evalscope - Versions diffs - 0.9.0__tar.gz → 0.10.1__tar.gz - Mend

evalscope 0.9.0tar.gz → 0.10.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (312) hide show

{evalscope-0.9.0/evalscope.egg-info → evalscope-0.10.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.9.0
+Version: 0.10.1
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -63,6 +63,9 @@ Requires-Dist: numpy; extra == "perf"
 Requires-Dist: sse_starlette; extra == "perf"
 Requires-Dist: transformers; extra == "perf"
 Requires-Dist: unicorn; extra == "perf"
+Provides-Extra: app
+Requires-Dist: gradio>=5.4.0; extra == "app"
+Requires-Dist: plotly>=5.23.0; extra == "app"
 Provides-Extra: inner
 Requires-Dist: absl-py; extra == "inner"
 Requires-Dist: accelerate; extra == "inner"
@@ -133,6 +136,8 @@ Requires-Dist: numpy; extra == "all"
 Requires-Dist: sse_starlette; extra == "all"
 Requires-Dist: transformers; extra == "all"
 Requires-Dist: unicorn; extra == "all"
+Requires-Dist: gradio>=5.4.0; extra == "all"
+Requires-Dist: plotly>=5.23.0; extra == "all"
 <p align="center">
     <br>
@@ -210,6 +215,8 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
+- 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
@@ -374,15 +381,85 @@ run_task(task_cfg="config.json")
 - `--limit`: Maximum amount of evaluation data for each dataset. If not specified, it defaults to evaluating all data. Can be used for quick validation
 ### Output Results
+```text
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Model Name            | Dataset Name   | Metric Name     | Category Name   | Subset Name   |   Num |   Score |
++=======================+================+=================+=================+===============+=======+=========+
+| Qwen2.5-0.5B-Instruct | gsm8k          | AverageAccuracy | default         | main          |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Easy      |     5 |     0.8 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Challenge |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+```
+## 📈 Visualization of Evaluation Results
+1. Install the dependencies required for visualization, including gradio, plotly, etc.
+```bash
+pip install 'evalscope[app]'
 ```
-+-----------------------+-------------------+-----------------+
-| Model                 | ai2_arc           | gsm8k           |
-+=======================+===================+=================+
-| Qwen2.5-0.5B-Instruct | (ai2_arc/acc) 0.6 | (gsm8k/acc) 0.6 |
-+-----------------------+-------------------+-----------------+
+2. Start the Visualization Service
+Run the following command to start the visualization service.
+```bash
+evalscope app
+```
+You can access the visualization service in the browser if the following output appears.
+```text
+* Running on local URL:  http://127.0.0.1:7861
+To create a public link, set `share=True` in `launch()`.
 ```
-## ⚙️ Complex Evaluation
+<table>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/setting.png" alt="Setting" style="width: 75%;" />
+      <p>Setting Interface</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/model_compare.png" alt="Model Compare" style="width: 100%;" />
+      <p>Model Comparison</p>
+    </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/report_overview.png" alt="Report Overview" style="width: 100%;" />
+      <p>Report Overview</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/report_details.png" alt="Report Details" style="width: 80%;" />
+      <p>Report Details</p>
+    </td>
+  </tr>
+</table>
+For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
+## 🌐 Evaluation of Specified Model API
+Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
+For example, to launch a model service using [vLLM](https://github.com/vllm-project/vllm):
+```shell
+export VLLM_USE_MODELSCOPE=True && python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --served-model-name qwen2.5 --trust_remote_code --port 8801
+```
+Then, you can use the following command to evaluate the model API service:
+```shell
+evalscope eval \
+ --model qwen2.5 \
+ --api-url http://127.0.0.1:8801/v1/chat/completions \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+## ⚙️ Custom Parameter Evaluation
 For more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation startup method is the same as simple evaluation. Below shows how to start the evaluation using the `eval` command:
 ```shell

{evalscope-0.9.0 → evalscope-0.10.1}/README.md RENAMED Viewed

@@ -74,6 +74,8 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
+- 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
 - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
 - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
@@ -238,15 +240,85 @@ run_task(task_cfg="config.json")
 - `--limit`: Maximum amount of evaluation data for each dataset. If not specified, it defaults to evaluating all data. Can be used for quick validation
 ### Output Results
+```text
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Model Name            | Dataset Name   | Metric Name     | Category Name   | Subset Name   |   Num |   Score |
++=======================+================+=================+=================+===============+=======+=========+
+| Qwen2.5-0.5B-Instruct | gsm8k          | AverageAccuracy | default         | main          |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Easy      |     5 |     0.8 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Challenge |     5 |     0.4 |
++-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
+```
+## 📈 Visualization of Evaluation Results
+1. Install the dependencies required for visualization, including gradio, plotly, etc.
+```bash
+pip install 'evalscope[app]'
 ```
-+-----------------------+-------------------+-----------------+
-| Model                 | ai2_arc           | gsm8k           |
-+=======================+===================+=================+
-| Qwen2.5-0.5B-Instruct | (ai2_arc/acc) 0.6 | (gsm8k/acc) 0.6 |
-+-----------------------+-------------------+-----------------+
+2. Start the Visualization Service
+Run the following command to start the visualization service.
+```bash
+evalscope app
+```
+You can access the visualization service in the browser if the following output appears.
+```text
+* Running on local URL:  http://127.0.0.1:7861
+To create a public link, set `share=True` in `launch()`.
 ```
-## ⚙️ Complex Evaluation
+<table>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/setting.png" alt="Setting" style="width: 75%;" />
+      <p>Setting Interface</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/model_compare.png" alt="Model Compare" style="width: 100%;" />
+      <p>Model Comparison</p>
+    </td>
+  </tr>
+  <tr>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/report_overview.png" alt="Report Overview" style="width: 100%;" />
+      <p>Report Overview</p>
+    </td>
+    <td style="text-align: center;">
+      <img src="docs/en/get_started/images/report_details.png" alt="Report Details" style="width: 80%;" />
+      <p>Report Details</p>
+    </td>
+  </tr>
+</table>
+For more details, refer to: [📖 Visualization of Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html)
+## 🌐 Evaluation of Specified Model API
+Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the `eval-type` parameter must be specified as `service`, for example:
+For example, to launch a model service using [vLLM](https://github.com/vllm-project/vllm):
+```shell
+export VLLM_USE_MODELSCOPE=True && python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --served-model-name qwen2.5 --trust_remote_code --port 8801
+```
+Then, you can use the following command to evaluate the model API service:
+```shell
+evalscope eval \
+ --model qwen2.5 \
+ --api-url http://127.0.0.1:8801/v1/chat/completions \
+ --api-key EMPTY \
+ --eval-type service \
+ --datasets gsm8k \
+ --limit 10
+```
+## ⚙️ Custom Parameter Evaluation
 For more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation startup method is the same as simple evaluation. Below shows how to start the evaluation using the `eval` command:
 ```shell

{evalscope-0.9.0 → evalscope-0.10.1}/evalscope/arguments.py RENAMED Viewed

@@ -33,6 +33,7 @@ def add_argument(parser: argparse.ArgumentParser):
     # yapf: disable
     # Model-related arguments
     parser.add_argument('--model', type=str, required=False, help='The model id on modelscope, or local model dir.')
+    parser.add_argument('--model-id', type=str, required=False, help='The model id for model name in report.')
     parser.add_argument('--model-args', type=str, action=ParseStrArgsAction, help='The model args, should be a string.')
     # Template-related arguments

{evalscope-0.9.0 → evalscope-0.10.1}/evalscope/benchmarks/arc/arc_adapter.py RENAMED Viewed

@@ -5,7 +5,7 @@ import os
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import EvalType
-from evalscope.metrics import WeightedAverageAccuracy, exact_match
+from evalscope.metrics import AverageAccuracy, exact_match
 from evalscope.models import MultiChoiceModelAdapter
 from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
@@ -20,7 +20,7 @@ logger = get_logger()
     dataset_id='modelscope/ai2_arc',
     model_adapter=MultiChoiceModelAdapter,
     subset_list=['ARC-Easy', 'ARC-Challenge'],
-    metric_list=[WeightedAverageAccuracy],
+    metric_list=[AverageAccuracy],
     few_shot_num=0,
     train_split='train',
     eval_split='test',
@@ -109,12 +109,10 @@ class ARCAdapter(DataAdapter):
         few_shot_prompts = [self._generate_prompt(input_d=sample, include_answer=True) for sample in few_shot_list]
         context: str = '\n'.join(few_shot_prompts)
-        context = f'{self.prompt_template}\n{context}' if self.prompt_template else context
         # context = f'The following are multiple choice questions, please output correct answer in the form of A or B or C or D, do not output explanation:\n {context}'
         full_prompt: str = context + self._generate_prompt(input_d=input_d, include_answer=False)
-        return {'data': [full_prompt], 'multi_choices': self.choices}
+        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.prompt_template}
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice

{evalscope-0.9.0 → evalscope-0.10.1}/evalscope/benchmarks/bbh/bbh_adapter.py RENAMED Viewed

@@ -7,7 +7,7 @@ import re
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import AnswerKeys
-from evalscope.metrics import WeightedAverageAccuracy, exact_match
+from evalscope.metrics import AverageAccuracy, exact_match
 from evalscope.models.chat_adapter import ChatGenerationModelAdapter
 from evalscope.utils import ResponseParser
 from evalscope.utils.logger import get_logger
@@ -63,7 +63,7 @@ SUBSET_LIST = MULTIPLE_CHOICE_LIST + FREE_FORM_LIST
     dataset_id='modelscope/bbh',
     model_adapter=ChatGenerationModelAdapter,
     subset_list=SUBSET_LIST,
-    metric_list=[WeightedAverageAccuracy],
+    metric_list=[AverageAccuracy],
     few_shot_num=3,
     train_split=None,
     eval_split='test',
@@ -122,7 +122,7 @@ class BBHAdapter(DataAdapter):
         cot_prompts: str = few_shot_list[0] if len(few_shot_list) > 0 else ''
         full_prompt: str = f"Follow the given examples and answer the question.\n{cot_prompts}\n\nQ: {input_d['input']}\nA: Let's think step by step."
-        return {'data': [full_prompt]}
+        return {'data': [full_prompt], 'system_prompt': self.prompt_template}
     def gen_prompts(self, data_dict: dict) -> dict:
         """

{evalscope-0.9.0 → evalscope-0.10.1}/evalscope/benchmarks/benchmark.py RENAMED Viewed

@@ -22,7 +22,7 @@ class BenchmarkMeta:
     few_shot_random: bool = False
     train_split: Optional[str] = None
     eval_split: Optional[str] = None
-    prompt_template: str = ''
+    prompt_template: Optional[str] = None
     def _update(self, args: dict):
         if args.get('local_path'):

{evalscope-0.9.0 → evalscope-0.10.1}/evalscope/benchmarks/ceval/ceval_adapter.py RENAMED Viewed

@@ -4,7 +4,7 @@ import os
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import EvalType
-from evalscope.metrics import WeightedAverageAccuracy
+from evalscope.metrics import AverageAccuracy
 from evalscope.metrics.metrics import exact_match, weighted_mean
 from evalscope.models import MultiChoiceModelAdapter
 from evalscope.utils import ResponseParser, normalize_score
@@ -130,7 +130,7 @@ SUBJECT_MAPPING = {
     dataset_id='modelscope/ceval-exam',
     model_adapter=MultiChoiceModelAdapter,
     subset_list=SUBSET_LIST,
-    metric_list=[WeightedAverageAccuracy],
+    metric_list=[AverageAccuracy],
     few_shot_num=0,
     train_split='dev',
     eval_split='val',
@@ -145,9 +145,10 @@ class CEVALAdapter(DataAdapter):
         if few_shot_num > 5:
             logger.warning(f'few_shot_num <= 5 for C-Eval, but got {few_shot_num}. Use 5-shot by default.')
             kwargs['few_shot_num'] = 5
         super().__init__(**kwargs)
+        self.category_map = {k: v[-1] for k, v in SUBJECT_MAPPING.items()}
     def load_from_disk(self, dataset_name_or_path, subset_list, work_dir, **kwargs) -> dict:
         data_dict = {}
         for subset_name in subset_list:
@@ -206,7 +207,7 @@ class CEVALAdapter(DataAdapter):
         subject_name: str = SUBJECT_MAPPING.get(subset_name)[1] if SUBJECT_MAPPING.get(subset_name) else subset_name
         full_prompt = f'以下是中国关于{subject_name}考试的单项选择题，请选出其中的正确答案。\n' + full_prompt
-        return {'data': [full_prompt], 'multi_choices': self.choices}
+        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': self.prompt_template}
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
@@ -236,84 +237,6 @@ class CEVALAdapter(DataAdapter):
     def match(self, gold: str, pred: str) -> float:
         return exact_match(gold=gold, pred=pred)
-    def gen_report(self, subset_score_map: dict, report_name: str = None) -> dict:
-        """
-        Generate report for the evaluation.
-        Args:
-            subset_score_map: The subset-score mapping. e.g. {subset_name: (score, num), ...}
-            report_name: The user-defined report name.
-        Returns:
-        {
-            "name":"C-Eval",
-            "metric":"WeightedAverageAccuracy",
-            "score":0.3389,
-            "category":[
-                {
-                    "name":"STEM",
-                    "score":0.2528,
-                    "subset":[
-                        {
-                            "name":"computer_network",
-                            "score":0.2632
-                        },
-                        {
-                            "name":"operating_system",
-                            "score":0.3157
-                        },
-                        {
-                            "name":"computer_architecture",
-                            "score":0.4285
-                        }
-                    ]
-                }
-            ],
-            "total_num":59
-        }
-        """
-        total_num: int = sum([num for _, num in subset_score_map.values()])
-        weighted_avg_acc: float = sum([score * num for score, num in subset_score_map.values()]) / total_num
-        weighted_avg_acc = normalize_score(score=weighted_avg_acc)
-        # Get domain-subject mapping
-        subject_review_map = {}
-        for subset_name, (subset_score, num) in subset_score_map.items():
-            domain_name: str = SUBJECT_MAPPING.get(subset_name)[2] if SUBJECT_MAPPING.get(subset_name) else 'DEFAULT'
-            if domain_name in subject_review_map:
-                subject_review_map[domain_name].append((subset_name, subset_score, num))
-            else:
-                subject_review_map[domain_name] = [(subset_name, subset_score, num)]
-        # Get domain score
-        category_list = []
-        for domain_name, domain_res_list in subject_review_map.items():
-            domain_weighted_avg_acc = sum([score * num for _, score, num in domain_res_list]) / \
-                                      sum([num for _, _, num in domain_res_list])
-            domain_weighted_avg_acc = normalize_score(score=domain_weighted_avg_acc)
-            category_list.append({
-                'name':
-                domain_name,
-                'score':
-                domain_weighted_avg_acc,
-                'subset': [{
-                    'name': subset_name,
-                    'score': normalize_score(score=subset_score)
-                } for subset_name, subset_score, _ in domain_res_list]
-            })
-        category_list = sorted(category_list, key=lambda x: x['name'])
-        # Get final dict of report
-        res_map = dict(
-            name=report_name or 'ceval',
-            metric=self.metric_list[0]['name'],
-            score=weighted_avg_acc,
-            category=category_list,
-            total_num=total_num)
-        return res_map
     @classmethod
     def _format_example(cls, input_d: dict, include_answer=True):
         example = '问题：' + input_d['question']

{evalscope-0.9.0 → evalscope-0.10.1}/evalscope/benchmarks/cmmlu/cmmlu_adapter.py RENAMED Viewed

@@ -5,7 +5,7 @@ import os
 from evalscope.benchmarks import Benchmark, DataAdapter
 from evalscope.constants import EvalType
-from evalscope.metrics import WeightedAverageAccuracy, exact_match
+from evalscope.metrics import AverageAccuracy, exact_match
 from evalscope.models import MultiChoiceModelAdapter
 from evalscope.utils import ResponseParser, normalize_score
 from evalscope.utils.logger import get_logger
@@ -106,7 +106,7 @@ SUBJECT_MAPPING = {
     dataset_id='modelscope/cmmlu',
     model_adapter=MultiChoiceModelAdapter,
     subset_list=SUBSET_LIST,
-    metric_list=[WeightedAverageAccuracy],
+    metric_list=[AverageAccuracy],
     few_shot_num=5,
     train_split='dev',
     eval_split='test',
@@ -116,9 +116,10 @@ class CMMLUAdapter(DataAdapter):
     choices = ['A', 'B', 'C', 'D']
     def __init__(self, **kwargs):
         super().__init__(**kwargs)
+        self.category_map = {k: v[-1] for k, v in SUBJECT_MAPPING.items()}
     def load_from_disk(self, dataset_name_or_path, subset_list, work_dir, **kwargs) -> dict:
         data_dict = {}
         for subset_name in subset_list:
@@ -173,7 +174,7 @@ class CMMLUAdapter(DataAdapter):
         full_prompt: str = context.strip() + self._generate_prompt(input_d=input_d, include_answer=False)
-        return {'data': [full_prompt], 'multi_choices': self.choices}
+        return {'data': [full_prompt], 'multi_choices': self.choices, 'system_prompt': prompt}
     def get_gold_answer(self, input_d: dict) -> str:
         # Get the gold choice
@@ -203,81 +204,6 @@ class CMMLUAdapter(DataAdapter):
     def match(self, gold: str, pred: str) -> float:
         return exact_match(gold=gold, pred=pred)
-    def gen_report(self, subset_score_map: dict, report_name: str = None) -> dict:
-        """
-        Generate report for the evaluation.
-        Args:
-            subset_score_map: The subset-score mapping. e.g. {subset_name: (score, num), ...}
-            report_name: the user-defined report name. Default: None
-        Returns:
-        {
-            "name":"CMMLU",
-            "metric":"WeightedAverageAccuracy",
-            "score":0.3389,
-            "category":[
-                {
-                   "name":"STEM",
-                   "score":0.2528,
-                   "subset":[
-                       {
-                           "name":"computer_network",
-                           "score":0.2632
-                       },
-                       {
-                           "name":"operating_system",
-                           "score":0.3157
-                       },
-                       {
-                           "name":"computer_architecture",
-                           "score":0.4285
-                       }
-                   ]
-                }
-            ],
-            "total_num":59
-        }
-        """
-        total_num: int = sum([num for _, num in subset_score_map.values()])
-        weighted_avg_acc: float = sum([score * num for score, num in subset_score_map.values()]) / total_num
-        # Get domain-subject mapping
-        subject_review_map = {}
-        for subset_name, (subset_score, num) in subset_score_map.items():
-            domain_name: str = SUBJECT_MAPPING.get(subset_name)[1] if SUBJECT_MAPPING.get(subset_name) else subset_name
-            if domain_name in subject_review_map:
-                subject_review_map[domain_name].append((subset_name, subset_score, num))
-            else:
-                subject_review_map[domain_name] = [(subset_name, subset_score, num)]
-        # Get domain score
-        category_list = []
-        for domain_name, domain_res_list in subject_review_map.items():
-            domain_weighted_avg_acc = sum([score * num for _, score, num in domain_res_list]) / \
-                                     sum([num for _, _, num in domain_res_list])
-            domain_weighted_avg_acc = normalize_score(score=domain_weighted_avg_acc)
-            category_list.append({
-                'name':
-                domain_name,
-                'score':
-                domain_weighted_avg_acc,
-                'subset': [{
-                    'name': subset_name,
-                    'score': normalize_score(subset_score)
-                } for subset_name, subset_score, _ in domain_res_list]
-            })
-        # Get final dict of report
-        res_map = dict(
-            name=report_name or 'cmmlu',
-            metric=self.metric_list[0]['name'],
-            score=weighted_avg_acc,
-            category=category_list,
-            total_num=total_num)
-        return res_map
     @classmethod
     def _generate_prompt(cls, input_d: dict, include_answer=True) -> str:

{evalscope-0.9.0 → evalscope-0.10.1}/evalscope/benchmarks/competition_math/competition_math_adapter.py RENAMED Viewed

@@ -5,7 +5,7 @@ import json
 import os
 from evalscope.benchmarks import Benchmark, DataAdapter
-from evalscope.metrics import WeightedAverageAccuracy
+from evalscope.metrics import AverageAccuracy
 from evalscope.metrics.math_accuracy import is_equiv, last_boxed_only_string, remove_boxed
 from evalscope.models import ChatGenerationModelAdapter
 from evalscope.utils.logger import get_logger
@@ -20,11 +20,11 @@ logger = get_logger()
     dataset_id='modelscope/competition_math',
     model_adapter=ChatGenerationModelAdapter,
     subset_list=['default'],
-    metric_list=[WeightedAverageAccuracy],
+    metric_list=[AverageAccuracy],
     few_shot_num=4,
     train_split='train',
     eval_split='test',
-    prompt_template='',
+    prompt_template='Put the final answer in \\boxed{}.',
 )
 class CompetitionMathAdapter(DataAdapter):
     """ To be tested for all models. """
@@ -77,7 +77,7 @@ class CompetitionMathAdapter(DataAdapter):
         use_fewshot = self.few_shot_num > 0
         full_prompt = self._generate_prompt(input_d, use_fewshot=use_fewshot)
-        return {'data': [full_prompt], 'system_prompt': 'Put the final answer in \\boxed{}.'}
+        return {'data': [full_prompt], 'system_prompt': self.prompt_template}
     def get_gold_answer(self, input_d: dict) -> str:
         # Extract the gold answer from the input dict.

evalscope 0.9.0__tar.gz → 0.10.1__tar.gz

Potentially problematic release.

evalscope 0.9.0tar.gz → 0.10.1tar.gz