PyPI - evalscope - Versions diffs - 0.14.0__py3-none-any.whl → 0.15.1__py3-none-any.whl - Mend

evalscope 0.14.0py3-none-any.whl → 0.15.1py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (181) hide show

evalscope/report/app_arguments.py ADDED Viewed

@@ -0,0 +1,11 @@
+import argparse
+def add_argument(parser: argparse.ArgumentParser):
+    parser.add_argument('--share', action='store_true', help='Share the app.')
+    parser.add_argument('--server-name', type=str, default='0.0.0.0', help='The server name.')
+    parser.add_argument('--server-port', type=int, default=None, help='The server port.')
+    parser.add_argument('--debug', action='store_true', help='Debug the app.')
+    parser.add_argument('--lang', type=str, default='zh', help='The locale.', choices=['zh', 'en'])
+    parser.add_argument('--outputs', type=str, default='./outputs', help='The outputs dir.')
+    parser.add_argument('--allowed-paths', nargs='+', default=['/'], help='The outputs dir.')

evalscope/report/generator.py CHANGED Viewed

@@ -48,7 +48,7 @@ class ReportGenerator:
         df = flatten_subset()
         metrics_list = []
-        for metric_name, group_metric in df.groupby('metric_name'):
+        for metric_name, group_metric in df.groupby('metric_name', sort=False):
             categories = []
             for category_name, group_category in group_metric.groupby('categories'):
                 subsets = []

evalscope/run.py CHANGED Viewed

@@ -153,10 +153,10 @@ def create_evaluator(task_cfg: TaskConfig, dataset_name: str, outputs: OutputsSt
         data_adapter = benchmark.get_data_adapter(config=task_cfg.dataset_args.get(dataset_name, {}))
         return EvaluatorCollection(task_cfg, data_adapter, outputs, base_model)
-    # Initialize model adapter
-    model_adapter = initialize_model_adapter(task_cfg, benchmark, base_model)
-    # Initialize data adapter
+    # Initialize data adapter first to update config
     data_adapter = benchmark.get_data_adapter(config=task_cfg.dataset_args.get(dataset_name, {}))
+    # Initialize model adapter
+    model_adapter = initialize_model_adapter(task_cfg, data_adapter, base_model)
     # update task_cfg.dataset_args
     task_cfg.dataset_args[dataset_name] = benchmark.to_string_dict()

evalscope/third_party/thinkbench/eval.py CHANGED Viewed

@@ -357,7 +357,7 @@ judge_config = dict(
 )
 distill_qwen_config = dict(
-    report_path = './outputs/20250218_180219',
+    report_path = '../eval-scope/outputs/20250218_180219',
     model_name = 'DeepSeek-R1-Distill-Qwen-7B',
     tokenizer_path = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-7B',
     dataset_name = 'math_500',
@@ -367,7 +367,7 @@ distill_qwen_config = dict(
 )
 math_qwen_config = dict(
-    report_path = './outputs/20250219_202358',
+    report_path = '../eval-scope/outputs/20250219_202358',
     model_name = 'Qwen2.5-Math-7B-Instruct',
     tokenizer_path = 'Qwen/Qwen2.5-Math-7B-Instruct',
     dataset_name = 'math_500',
@@ -377,7 +377,7 @@ math_qwen_config = dict(
 )
 r1_config = dict(
-    report_path = './outputs/20250307_000404',
+    report_path = '../eval-scope/outputs/20250307_000404',
     model_name = 'deepseek-r1',
     tokenizer_path = 'deepseek-ai/DeepSeek-R1',
     dataset_name = 'math_500',
@@ -387,7 +387,7 @@ r1_config = dict(
 )
 qwq_preview_config = dict(
-    report_path = './outputs/20250221_105911',
+    report_path = '../eval-scope/outputs/20250221_105911',
     model_name = 'qwq-32b-preview',
     tokenizer_path = 'Qwen/QwQ-32B-Preview',
     dataset_name = 'math_500',
@@ -397,7 +397,7 @@ qwq_preview_config = dict(
 )
 qwq_config = dict(
-    report_path = './outputs/20250306_181550',
+    report_path = '../eval-scope/outputs/20250306_181550',
     model_name = 'QwQ-32B',
     tokenizer_path = 'Qwen/QwQ-32B',
     dataset_name = 'math_500',
@@ -407,7 +407,7 @@ qwq_config = dict(
 )
 distill_qwen_32b = dict(
-    report_path = './outputs/20250306_235951',
+    report_path = '../eval-scope/outputs/20250306_235951',
     model_name = 'deepseek-r1-distill-qwen-32b',
     tokenizer_path = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-32B',
     dataset_name = 'math_500',
@@ -416,14 +416,26 @@ distill_qwen_32b = dict(
     judge_config=judge_config
 )
+qwen3_32b_think = dict(
+    report_path = '../eval-scope/outputs/20250428_151817',
+    model_name = 'Qwen3-32B',
+    tokenizer_path = 'Qwen/Qwen3-32B',
+    dataset_name = 'math_500',
+    subsets = ['Level 1', 'Level 2', 'Level 3', 'Level 4', 'Level 5'],
+    split_strategies='separator',
+    judge_config=judge_config
+)
 if __name__ == '__main__':
     # run_task(distill_qwen_config, count=80)
     # run_task(math_qwen_config)
     # run_task(qwq_preview_config, max_tokens=20000, count=200, workers=128)
     # run_task(r1_config, max_tokens=20000, count=200, workers=128)
     # run_task(qwq_config, max_tokens=20000, count=200, workers=128)
+    run_task(qwen3_32b_think, max_tokens=20000, count=200, workers=128)
     # run_task(distill_qwen_32b, max_tokens=20000, count=200, workers=128)
     # combine_results([qwq_config, r1_config, qwq_preview_config,  distill_qwen_32b], output_path='outputs/model_comparison_metrics.png')
     # combine_results([qwq_config, r1_config, distill_qwen_32b], output_path='outputs/model_comparison_metrics_3models.png')
-    combine_results([distill_qwen_config, math_qwen_config, qwq_config, r1_config, qwq_preview_config, distill_qwen_32b], output_path='outputs/model_comparison_metrics_6models.png')
+    # combine_results([distill_qwen_config, math_qwen_config, qwq_config, r1_config, qwq_preview_config, distill_qwen_32b], output_path='outputs/model_comparison_metrics_6models.png')
+    combine_results([qwq_config, r1_config, distill_qwen_32b, qwen3_32b_think], output_path='outputs/model_comparison_metrics_4models.png')

evalscope/utils/chat_service.py CHANGED Viewed

@@ -64,10 +64,10 @@ class ChatCompletionResponseStreamChoice(BaseModel):
 class ChatCompletionResponse(BaseModel):
     model: str
-    object: Literal['chat.completion', 'chat.completion.chunk']
+    object: Literal['chat.completion', 'chat.completion.chunk', 'images.generations']
     choices: List[Union[ChatCompletionResponseChoice, ChatCompletionResponseStreamChoice, Any]]
     created: Optional[int] = Field(default_factory=lambda: int(time.time()))
-    usage: Optional[Usage]
+    usage: Optional[Usage] = None
 class TextCompletionRequest(BaseModel):

evalscope/utils/import_utils.py ADDED Viewed

@@ -0,0 +1,66 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+# Copyright 2023-present the HuggingFace Inc. team.
+import importlib
+import os
+from itertools import chain
+from types import ModuleType
+from typing import Any
+from .logger import get_logger
+logger = get_logger()  # pylint: disable=invalid-name
+class _LazyModule(ModuleType):
+    """
+    Module class that surfaces all objects but only performs associated imports when the objects are requested.
+    """
+    # Very heavily inspired by optuna.integration._IntegrationModule
+    # https://github.com/optuna/optuna/blob/master/optuna/integration/__init__.py
+    def __init__(self, name, module_file, import_structure, module_spec=None, extra_objects=None):
+        super().__init__(name)
+        self._modules = set(import_structure.keys())
+        self._class_to_module = {}
+        for key, values in import_structure.items():
+            for value in values:
+                self._class_to_module[value] = key
+        # Needed for autocompletion in an IDE
+        self.__all__ = list(import_structure.keys()) + list(chain(*import_structure.values()))
+        self.__file__ = module_file
+        self.__spec__ = module_spec
+        self.__path__ = [os.path.dirname(module_file)]
+        self._objects = {} if extra_objects is None else extra_objects
+        self._name = name
+        self._import_structure = import_structure
+    # Needed for autocompletion in an IDE
+    def __dir__(self):
+        result = super().__dir__()
+        # The elements of self.__all__ that are submodules may or may not be in the dir already, depending on whether
+        # they have been accessed or not. So we only add the elements of self.__all__ that are not already in the dir.
+        for attr in self.__all__:
+            if attr not in result:
+                result.append(attr)
+        return result
+    def __getattr__(self, name: str) -> Any:
+        if name in self._objects:
+            return self._objects[name]
+        if name in self._modules:
+            value = self._get_module(name)
+        elif name in self._class_to_module.keys():
+            module = self._get_module(self._class_to_module[name])
+            value = getattr(module, name)
+        else:
+            raise AttributeError(f'module {self.__name__} has no attribute {name}')
+        setattr(self, name, value)
+        return value
+    def _get_module(self, module_name: str):
+        return importlib.import_module('.' + module_name, self.__name__)
+    def __reduce__(self):
+        return self.__class__, (self._name, self.__file__, self._import_structure)

evalscope/utils/utils.py CHANGED Viewed

@@ -76,16 +76,16 @@ def dict_torch_dtype_to_str(d: Dict[str, Any]) -> dict:
 class ResponseParser:
     @staticmethod
-    def parse_first_capital(text: str) -> str:
+    def parse_first_capital(text: str, options: list[str]) -> str:
         for t in text:
-            if t.isupper():
+            if t.isupper() and (t in options):
                 return t
         return ''
     @staticmethod
-    def parse_last_capital(text: str) -> str:
+    def parse_last_capital(text: str, options: list[str]) -> str:
         for t in text[::-1]:
-            if t.isupper():
+            if t.isupper() and (t in options):
                 return t
         return ''
@@ -155,6 +155,10 @@ class ResponseParser:
                 for i in options:
                     if i in outputs:
                         return i
+        # If no match found, try to find the last capital letter in the text
+        last_capital = ResponseParser.parse_last_capital(text, options)
+        if last_capital:
+            return last_capital
         return 'No valid option found'
     @staticmethod
@@ -183,6 +187,10 @@ class ResponseParser:
             matches = regex.search(text)
             if matches:
                 return matches.group(1)
+        # If no match found, try to find the last capital letter in the text
+        last_capital = ResponseParser.parse_last_capital(text, options)
+        if last_capital:
+            return last_capital
         return 'No valid option found'

evalscope/version.py CHANGED Viewed

@@ -1,4 +1,4 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-__version__ = '0.14.0'
-__release_datetime__ = '2025-04-10 20:00:00'
+__version__ = '0.15.1'
+__release_datetime__ = '2025-04-30 12:00:00'

{evalscope-0.14.0.dist-info → evalscope-0.15.1.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.14.0
+Version: 0.15.1
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -28,8 +28,9 @@ Requires-Dist: modelscope[framework]
 Requires-Dist: nltk>=3.9
 Requires-Dist: openai
 Requires-Dist: pandas
+Requires-Dist: pillow
 Requires-Dist: pyarrow
-Requires-Dist: pyyaml
+Requires-Dist: pyyaml>=5.1
 Requires-Dist: requests
 Requires-Dist: rouge-chinese
 Requires-Dist: rouge-score>=0.1.0
@@ -39,9 +40,16 @@ Requires-Dist: seaborn
 Requires-Dist: sympy
 Requires-Dist: tabulate
 Requires-Dist: torch
+Requires-Dist: torchvision
 Requires-Dist: tqdm
 Requires-Dist: transformers>=4.33
 Requires-Dist: word2number
+Provides-Extra: aigc
+Requires-Dist: diffusers; extra == "aigc"
+Requires-Dist: iopath; extra == "aigc"
+Requires-Dist: omegaconf; extra == "aigc"
+Requires-Dist: open-clip-torch; extra == "aigc"
+Requires-Dist: opencv-python; extra == "aigc"
 Provides-Extra: all
 Requires-Dist: accelerate; extra == "all"
 Requires-Dist: datasets<=3.2.0,>=3.0.0; extra == "all"
@@ -55,8 +63,9 @@ Requires-Dist: modelscope[framework]; extra == "all"
 Requires-Dist: nltk>=3.9; extra == "all"
 Requires-Dist: openai; extra == "all"
 Requires-Dist: pandas; extra == "all"
+Requires-Dist: pillow; extra == "all"
 Requires-Dist: pyarrow; extra == "all"
-Requires-Dist: pyyaml; extra == "all"
+Requires-Dist: pyyaml>=5.1; extra == "all"
 Requires-Dist: requests; extra == "all"
 Requires-Dist: rouge-chinese; extra == "all"
 Requires-Dist: rouge-score>=0.1.0; extra == "all"
@@ -66,6 +75,7 @@ Requires-Dist: seaborn; extra == "all"
 Requires-Dist: sympy; extra == "all"
 Requires-Dist: tabulate; extra == "all"
 Requires-Dist: torch; extra == "all"
+Requires-Dist: torchvision; extra == "all"
 Requires-Dist: tqdm; extra == "all"
 Requires-Dist: transformers>=4.33; extra == "all"
 Requires-Dist: word2number; extra == "all"
@@ -86,6 +96,11 @@ Requires-Dist: transformers; extra == "all"
 Requires-Dist: unicorn; extra == "all"
 Requires-Dist: gradio==5.4.0; extra == "all"
 Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "all"
+Requires-Dist: diffusers; extra == "all"
+Requires-Dist: iopath; extra == "all"
+Requires-Dist: omegaconf; extra == "all"
+Requires-Dist: open-clip-torch; extra == "all"
+Requires-Dist: opencv-python; extra == "all"
 Provides-Extra: app
 Requires-Dist: gradio==5.4.0; extra == "app"
 Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "app"
@@ -199,6 +214,8 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)
+- 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.
 - 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)
 - 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).
 - 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)

evalscope 0.14.0__py3-none-any.whl → 0.15.1__py3-none-any.whl

Potentially problematic release.

evalscope 0.14.0py3-none-any.whl → 0.15.1py3-none-any.whl