evalscope 0.16.2__py3-none-any.whl → 0.17.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.


This version of evalscope might be problematic. Click here for more details.

Files changed (117) hide show
  1. evalscope/app/app.py +9 -762
  2. evalscope/app/constants.py +1 -0
  3. evalscope/app/ui/__init__.py +20 -0
  4. evalscope/app/ui/app_ui.py +52 -0
  5. evalscope/app/ui/multi_model.py +323 -0
  6. evalscope/app/ui/sidebar.py +42 -0
  7. evalscope/app/ui/single_model.py +202 -0
  8. evalscope/app/ui/visualization.py +36 -0
  9. evalscope/app/utils/data_utils.py +178 -0
  10. evalscope/app/utils/localization.py +221 -0
  11. evalscope/app/utils/text_utils.py +119 -0
  12. evalscope/app/utils/visualization.py +91 -0
  13. evalscope/backend/opencompass/backend_manager.py +2 -1
  14. evalscope/backend/rag_eval/backend_manager.py +2 -1
  15. evalscope/backend/rag_eval/utils/embedding.py +1 -1
  16. evalscope/backend/vlm_eval_kit/backend_manager.py +4 -1
  17. evalscope/benchmarks/__init__.py +15 -1
  18. evalscope/benchmarks/aime/aime24_adapter.py +2 -1
  19. evalscope/benchmarks/aime/aime25_adapter.py +2 -1
  20. evalscope/benchmarks/alpaca_eval/alpaca_eval_adapter.py +1 -1
  21. evalscope/benchmarks/arc/arc_adapter.py +1 -1
  22. evalscope/benchmarks/arena_hard/arena_hard_adapter.py +1 -1
  23. evalscope/benchmarks/arena_hard/utils.py +0 -12
  24. evalscope/benchmarks/ceval/ceval_adapter.py +5 -16
  25. evalscope/benchmarks/cmmlu/cmmlu_adapter.py +9 -21
  26. evalscope/benchmarks/competition_math/competition_math_adapter.py +2 -1
  27. evalscope/benchmarks/data_adapter.py +20 -5
  28. evalscope/benchmarks/general_arena/__init__.py +0 -0
  29. evalscope/benchmarks/general_arena/general_arena_adapter.py +411 -0
  30. evalscope/benchmarks/general_arena/utils.py +226 -0
  31. evalscope/benchmarks/general_mcq/general_mcq_adapter.py +1 -1
  32. evalscope/benchmarks/general_qa/general_qa_adapter.py +42 -29
  33. evalscope/benchmarks/hellaswag/hellaswag_adapter.py +1 -1
  34. evalscope/benchmarks/ifeval/ifeval_adapter.py +2 -4
  35. evalscope/benchmarks/iquiz/iquiz_adapter.py +1 -1
  36. evalscope/benchmarks/live_code_bench/live_code_bench_adapter.py +0 -6
  37. evalscope/benchmarks/maritime_bench/maritime_bench_adapter.py +1 -1
  38. evalscope/benchmarks/math_500/math_500_adapter.py +2 -1
  39. evalscope/benchmarks/mmlu/mmlu_adapter.py +1 -1
  40. evalscope/benchmarks/mmlu_pro/mmlu_pro_adapter.py +1 -1
  41. evalscope/benchmarks/mmlu_redux/mmlu_redux_adapter.py +1 -1
  42. evalscope/benchmarks/musr/musr_adapter.py +1 -1
  43. evalscope/benchmarks/race/race_adapter.py +1 -1
  44. evalscope/benchmarks/trivia_qa/trivia_qa_adapter.py +9 -4
  45. evalscope/benchmarks/utils.py +1 -2
  46. evalscope/benchmarks/winogrande/winogrande_adapter.py +1 -1
  47. evalscope/config.py +8 -123
  48. evalscope/evaluator/evaluator.py +15 -12
  49. evalscope/metrics/__init__.py +6 -0
  50. evalscope/{utils/utils.py → metrics/completion_parsers.py} +68 -180
  51. evalscope/metrics/llm_judge.py +105 -20
  52. evalscope/metrics/metrics.py +1 -1
  53. evalscope/models/adapters/base_adapter.py +0 -2
  54. evalscope/models/adapters/server_adapter.py +2 -2
  55. evalscope/models/custom/dummy_model.py +3 -3
  56. evalscope/perf/arguments.py +2 -16
  57. evalscope/perf/main.py +1 -1
  58. evalscope/perf/utils/analysis_result.py +24 -23
  59. evalscope/perf/utils/benchmark_util.py +1 -1
  60. evalscope/report/__init__.py +1 -1
  61. evalscope/report/utils.py +34 -15
  62. evalscope/run.py +1 -1
  63. evalscope/summarizer.py +1 -2
  64. evalscope/utils/__init__.py +63 -2
  65. evalscope/utils/argument_utils.py +64 -0
  66. evalscope/utils/import_utils.py +16 -0
  67. evalscope/utils/io_utils.py +45 -4
  68. evalscope/utils/model_utils.py +37 -1
  69. evalscope/version.py +2 -2
  70. {evalscope-0.16.2.dist-info → evalscope-0.17.0.dist-info}/METADATA +55 -26
  71. {evalscope-0.16.2.dist-info → evalscope-0.17.0.dist-info}/RECORD +90 -101
  72. tests/aigc/test_t2i.py +1 -1
  73. tests/cli/test_all.py +50 -2
  74. tests/cli/test_collection.py +1 -1
  75. tests/cli/test_custom.py +261 -0
  76. tests/cli/test_run.py +13 -37
  77. tests/perf/test_perf.py +2 -2
  78. tests/rag/test_clip_benchmark.py +2 -1
  79. tests/rag/test_mteb.py +3 -1
  80. tests/rag/test_ragas.py +3 -1
  81. tests/swift/test_run_swift_eval.py +2 -1
  82. tests/swift/test_run_swift_vlm_eval.py +2 -1
  83. tests/swift/test_run_swift_vlm_jugde_eval.py +2 -1
  84. tests/utils.py +13 -0
  85. tests/vlm/test_vlmeval.py +8 -2
  86. evalscope/evaluator/rating_eval.py +0 -157
  87. evalscope/evaluator/reviewer/__init__.py +0 -1
  88. evalscope/evaluator/reviewer/auto_reviewer.py +0 -391
  89. evalscope/registry/__init__.py +0 -1
  90. evalscope/registry/config/cfg_arena.yaml +0 -77
  91. evalscope/registry/config/cfg_arena_zhihu.yaml +0 -63
  92. evalscope/registry/config/cfg_pairwise_baseline.yaml +0 -83
  93. evalscope/registry/config/cfg_single.yaml +0 -78
  94. evalscope/registry/data/prompt_template/lmsys_v2.jsonl +0 -8
  95. evalscope/registry/data/prompt_template/prompt_templates.jsonl +0 -8
  96. evalscope/registry/data/qa_browser/battle.jsonl +0 -634
  97. evalscope/registry/data/qa_browser/category_mapping.yaml +0 -10
  98. evalscope/registry/data/question.jsonl +0 -80
  99. evalscope/registry/tasks/arc.yaml +0 -28
  100. evalscope/registry/tasks/bbh.yaml +0 -26
  101. evalscope/registry/tasks/bbh_mini.yaml +0 -26
  102. evalscope/registry/tasks/ceval.yaml +0 -27
  103. evalscope/registry/tasks/ceval_mini.yaml +0 -26
  104. evalscope/registry/tasks/cmmlu.yaml +0 -27
  105. evalscope/registry/tasks/eval_qwen-7b-chat_v100.yaml +0 -28
  106. evalscope/registry/tasks/general_qa.yaml +0 -27
  107. evalscope/registry/tasks/gsm8k.yaml +0 -29
  108. evalscope/registry/tasks/mmlu.yaml +0 -29
  109. evalscope/registry/tasks/mmlu_mini.yaml +0 -27
  110. evalscope/run_arena.py +0 -202
  111. evalscope/utils/arena_utils.py +0 -217
  112. evalscope/utils/completion_parsers.py +0 -82
  113. /evalscope/{utils → benchmarks}/filters.py +0 -0
  114. {evalscope-0.16.2.dist-info → evalscope-0.17.0.dist-info}/LICENSE +0 -0
  115. {evalscope-0.16.2.dist-info → evalscope-0.17.0.dist-info}/WHEEL +0 -0
  116. {evalscope-0.16.2.dist-info → evalscope-0.17.0.dist-info}/entry_points.txt +0 -0
  117. {evalscope-0.16.2.dist-info → evalscope-0.17.0.dist-info}/top_level.txt +0 -0
evalscope/report/utils.py CHANGED
@@ -3,14 +3,45 @@ import os
3
3
  import pandas as pd
4
4
  from collections import defaultdict
5
5
  from dataclasses import asdict, dataclass, field
6
- from typing import Any, Dict, List
6
+ from typing import Any, Dict, List, Union
7
7
 
8
8
  from evalscope.metrics import macro_mean, micro_mean
9
- from evalscope.utils import normalize_score
10
- from evalscope.utils.logger import get_logger
9
+ from evalscope.utils import get_logger
11
10
 
12
11
  logger = get_logger()
13
12
 
13
+ ANALYSIS_PROMPT = """根据给出的json格式的模型评测结果,输出分析报告,要求如下:
14
+ 1. 报告分为 总体表现、关键指标分析、改进建议、结论 四部分
15
+ 2. 若模型有多种指标,将其分为低分、中分、高分三个部分,并列出markdown表格
16
+ 3. 只列出报告本身,不要有其他多余内容
17
+ 4. 输出报告语言为{language}
18
+
19
+ ```json
20
+ {report_str}
21
+ ```
22
+ """
23
+
24
+
25
+ def normalize_score(score: Union[float, dict], keep_num: int = 4) -> Union[float, dict]:
26
+ """
27
+ Normalize score.
28
+
29
+ Args:
30
+ score: input score, could be float or dict. e.g. 0.12345678 or {'acc': 0.12345678, 'f1': 0.12345678}
31
+ keep_num: number of digits to keep.
32
+
33
+ Returns:
34
+ Union[float, dict]: normalized score. e.g. 0.1234 or {'acc': 0.1234, 'f1': 0.1234}
35
+ """
36
+ if isinstance(score, float):
37
+ score = round(score, keep_num)
38
+ elif isinstance(score, dict):
39
+ score = {k: round(v, keep_num) for k, v in score.items()}
40
+ else:
41
+ logger.warning(f'Unknown score type: {type(score)}')
42
+
43
+ return score
44
+
14
45
 
15
46
  @dataclass
16
47
  class Subset:
@@ -74,18 +105,6 @@ class ReportKey:
74
105
  score = 'Score'
75
106
 
76
107
 
77
- ANALYSIS_PROMPT = """根据给出的json格式的模型评测结果,输出分析报告,要求如下:
78
- 1. 报告分为 总体表现、关键指标分析、改进建议、结论 四部分
79
- 2. 若模型有多种指标,将其分为低分、中分、高分三个部分,并列出markdown表格
80
- 3. 只列出报告本身,不要有其他多余内容
81
- 4. 输出报告语言为{language}
82
-
83
- ```json
84
- {report_str}
85
- ```
86
- """
87
-
88
-
89
108
  @dataclass
90
109
  class Report:
91
110
  name: str = 'default_report'
evalscope/run.py CHANGED
@@ -9,9 +9,9 @@ from typing import TYPE_CHECKING, List, Optional, Union
9
9
 
10
10
  from evalscope.config import TaskConfig, parse_task_config
11
11
  from evalscope.constants import DataCollection, EvalBackend
12
- from evalscope.utils import seed_everything
13
12
  from evalscope.utils.io_utils import OutputsStructure
14
13
  from evalscope.utils.logger import configure_logging, get_logger
14
+ from evalscope.utils.model_utils import seed_everything
15
15
 
16
16
  if TYPE_CHECKING:
17
17
  from evalscope.models import LocalModel
evalscope/summarizer.py CHANGED
@@ -7,8 +7,7 @@ from typing import List, Union
7
7
  from evalscope.config import TaskConfig, parse_task_config
8
8
  from evalscope.constants import EvalBackend
9
9
  from evalscope.report import gen_table
10
- from evalscope.utils import csv_to_list, get_latest_folder_path
11
- from evalscope.utils.io_utils import OutputsStructure, json_to_dict, yaml_to_dict
10
+ from evalscope.utils.io_utils import OutputsStructure, csv_to_list, get_latest_folder_path, json_to_dict, yaml_to_dict
12
11
  from evalscope.utils.logger import get_logger
13
12
 
14
13
  logger = get_logger()
@@ -1,4 +1,65 @@
1
1
  # Copyright (c) Alibaba, Inc. and its affiliates.
2
2
 
3
- from evalscope.utils.model_utils import EvalBackend
4
- from evalscope.utils.utils import *
3
+ from typing import TYPE_CHECKING
4
+
5
+ from .import_utils import _LazyModule
6
+
7
+ if TYPE_CHECKING:
8
+ from .argument_utils import BaseArgument, get_supported_params, parse_int_or_float
9
+ from .deprecation_utils import deprecated
10
+ from .import_utils import get_module_path, is_module_installed
11
+ from .io_utils import (OutputsStructure, csv_to_jsonl, csv_to_list, dict_to_yaml, gen_hash, get_latest_folder_path,
12
+ get_valid_list, json_to_dict, jsonl_to_csv, jsonl_to_list, yaml_to_dict)
13
+ from .logger import configure_logging, get_logger
14
+ from .model_utils import EvalBackend, dict_torch_dtype_to_str, fix_do_sample_warning, get_device, seed_everything
15
+
16
+ else:
17
+ _import_structure = {
18
+ 'argument_utils': [
19
+ 'BaseArgument',
20
+ 'parse_int_or_float',
21
+ 'get_supported_params',
22
+ ],
23
+ 'model_utils': [
24
+ 'EvalBackend',
25
+ 'get_device',
26
+ 'seed_everything',
27
+ 'dict_torch_dtype_to_str',
28
+ 'fix_do_sample_warning',
29
+ ],
30
+ 'import_utils': [
31
+ 'is_module_installed',
32
+ 'get_module_path',
33
+ ],
34
+ 'io_utils': [
35
+ 'OutputsStructure',
36
+ 'csv_to_list',
37
+ 'json_to_dict',
38
+ 'yaml_to_dict',
39
+ 'get_latest_folder_path',
40
+ 'gen_hash',
41
+ 'dict_to_yaml',
42
+ 'csv_to_jsonl',
43
+ 'jsonl_to_csv',
44
+ 'jsonl_to_list',
45
+ 'gen_hash',
46
+ 'get_valid_list',
47
+ ],
48
+ 'deprecation_utils': [
49
+ 'deprecated',
50
+ ],
51
+ 'logger': [
52
+ 'get_logger',
53
+ 'configure_logging',
54
+ ],
55
+ }
56
+
57
+ import sys
58
+
59
+ sys.modules[__name__] = _LazyModule(
60
+ __name__,
61
+ globals()['__file__'],
62
+ _import_structure,
63
+ module_spec=__spec__,
64
+ extra_objects={},
65
+ )
@@ -0,0 +1,64 @@
1
+ import json
2
+ from argparse import Namespace
3
+ from inspect import signature
4
+
5
+ from evalscope.utils.io_utils import json_to_dict, yaml_to_dict
6
+
7
+
8
+ class BaseArgument:
9
+ """
10
+ BaseArgument is a base class designed to facilitate the creation and manipulation
11
+ of argument classes in the evalscope framework. It provides utility methods for
12
+ instantiating objects from various data formats and converting objects back into
13
+ dictionary representations.
14
+ """
15
+
16
+ @classmethod
17
+ def from_dict(cls, d: dict):
18
+ """Instantiate the class from a dictionary."""
19
+ return cls(**d)
20
+
21
+ @classmethod
22
+ def from_json(cls, json_file: str):
23
+ """Instantiate the class from a JSON file."""
24
+ return cls.from_dict(json_to_dict(json_file))
25
+
26
+ @classmethod
27
+ def from_yaml(cls, yaml_file: str):
28
+ """Instantiate the class from a YAML file."""
29
+ return cls.from_dict(yaml_to_dict(yaml_file))
30
+
31
+ @classmethod
32
+ def from_args(cls, args: Namespace):
33
+ """
34
+ Instantiate the class from an argparse.Namespace object.
35
+ Filters out None values and removes 'func' if present.
36
+ """
37
+ args_dict = {k: v for k, v in vars(args).items() if v is not None}
38
+
39
+ if 'func' in args_dict:
40
+ del args_dict['func'] # Note: compat CLI arguments
41
+
42
+ return cls.from_dict(args_dict)
43
+
44
+ def to_dict(self):
45
+ """Convert the instance to a dictionary."""
46
+ result = self.__dict__.copy()
47
+ return result
48
+
49
+ def __str__(self):
50
+ """Return a JSON-formatted string representation of the instance."""
51
+ return json.dumps(self.to_dict(), indent=4, default=str, ensure_ascii=False)
52
+
53
+
54
+ def parse_int_or_float(num):
55
+ number = float(num)
56
+ if number.is_integer():
57
+ return int(number)
58
+ return number
59
+
60
+
61
+ def get_supported_params(func):
62
+ """Get the supported parameters of a function."""
63
+ sig = signature(func)
64
+ return list(sig.parameters.keys())
@@ -64,3 +64,19 @@ class _LazyModule(ModuleType):
64
64
 
65
65
  def __reduce__(self):
66
66
  return self.__class__, (self._name, self.__file__, self._import_structure)
67
+
68
+
69
+ def is_module_installed(module_name):
70
+ try:
71
+ importlib.import_module(module_name)
72
+ return True
73
+ except ImportError:
74
+ return False
75
+
76
+
77
+ def get_module_path(module_name):
78
+ spec = importlib.util.find_spec(module_name)
79
+ if spec and spec.origin:
80
+ return os.path.abspath(spec.origin)
81
+ else:
82
+ raise ValueError(f'Cannot find module: {module_name}')
@@ -1,7 +1,9 @@
1
1
  import csv
2
+ import hashlib
2
3
  import json
3
4
  import jsonlines as jsonl
4
5
  import os
6
+ import re
5
7
  import yaml
6
8
 
7
9
  from evalscope.constants import DumpMode
@@ -221,7 +223,46 @@ def dict_to_json(d: dict, json_file: str):
221
223
  json.dump(d, f, indent=4, ensure_ascii=False)
222
224
 
223
225
 
224
- if __name__ == '__main__':
225
- csv_file = 'custom_eval/text/mcq/example_val.csv'
226
- jsonl_file = 'custom_eval/text/mcq/example_val.jsonl'
227
- csv_to_jsonl(csv_file, jsonl_file)
226
+ def get_latest_folder_path(work_dir):
227
+ from datetime import datetime
228
+
229
+ # Get all subdirectories in the work_dir
230
+ folders = [f for f in os.listdir(work_dir) if os.path.isdir(os.path.join(work_dir, f))]
231
+
232
+ # Get the timestamp(YYYYMMDD_HHMMSS)
233
+ timestamp_pattern = re.compile(r'^\d{8}_\d{6}$')
234
+
235
+ # Filter out the folders
236
+ timestamped_folders = [f for f in folders if timestamp_pattern.match(f)]
237
+
238
+ if not timestamped_folders:
239
+ print(f'>> No timestamped folders found in {work_dir}!')
240
+ return None
241
+
242
+ # timestamp parser
243
+ def parse_timestamp(folder_name):
244
+ return datetime.strptime(folder_name, '%Y%m%d_%H%M%S')
245
+
246
+ # Find the latest folder
247
+ latest_folder = max(timestamped_folders, key=parse_timestamp)
248
+
249
+ return os.path.join(work_dir, latest_folder)
250
+
251
+
252
+ def gen_hash(name: str, bits: int = 32):
253
+ return hashlib.md5(name.encode(encoding='UTF-8')).hexdigest()[:bits]
254
+
255
+
256
+ def get_valid_list(input_list, candidate_list):
257
+ """
258
+ Get the valid and invalid list from input_list based on candidate_list.
259
+ Args:
260
+ input_list: The input list.
261
+ candidate_list: The candidate list.
262
+
263
+ Returns:
264
+ valid_list: The valid list.
265
+ invalid_list: The invalid list.
266
+ """
267
+ return [i for i in input_list if i in candidate_list], \
268
+ [i for i in input_list if i not in candidate_list]
@@ -1,6 +1,9 @@
1
+ import numpy as np
1
2
  import os
3
+ import random
4
+ import torch
2
5
  from enum import Enum
3
- from typing import TYPE_CHECKING, Optional, Tuple, Union
6
+ from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple, Union
4
7
 
5
8
  if TYPE_CHECKING:
6
9
  from transformers import GenerationConfig
@@ -38,3 +41,36 @@ def get_device() -> str:
38
41
  device = 'cpu'
39
42
 
40
43
  return device
44
+
45
+
46
+ def dict_torch_dtype_to_str(d: Dict[str, Any]) -> dict:
47
+ """
48
+ Checks whether the passed dictionary and its nested dicts have a *torch_dtype* key and if it's not None,
49
+ converts torch.dtype to a string of just the type. For example, `torch.float32` get converted into *"float32"*
50
+ string, which can then be stored in the json format.
51
+
52
+ Refer to: https://github.com/huggingface/transformers/pull/16065/files for details.
53
+ """
54
+ if d.get('torch_dtype', None) is not None and not isinstance(d['torch_dtype'], str):
55
+ d['torch_dtype'] = str(d['torch_dtype']).split('.')[1]
56
+
57
+ for value in d.values():
58
+ if isinstance(value, dict):
59
+ dict_torch_dtype_to_str(value)
60
+
61
+ return d
62
+
63
+
64
+ def seed_everything(seed: int):
65
+ """Set all random seeds to a fixed value for reproducibility.
66
+
67
+ Args:
68
+ seed (int): The seed value.
69
+ """
70
+ random.seed(seed)
71
+ np.random.seed(seed)
72
+ torch.manual_seed(seed)
73
+ if torch.cuda.is_available():
74
+ torch.cuda.manual_seed_all(seed)
75
+ torch.backends.cudnn.deterministic = True
76
+ torch.backends.cudnn.benchmark = False
evalscope/version.py CHANGED
@@ -1,4 +1,4 @@
1
1
  # Copyright (c) Alibaba, Inc. and its affiliates.
2
2
 
3
- __version__ = '0.16.2'
4
- __release_datetime__ = '2025-06-23 20:00:00'
3
+ __version__ = '0.17.0'
4
+ __release_datetime__ = '2025-07-04 17:00:00'
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: evalscope
3
- Version: 0.16.2
3
+ Version: 0.17.0
4
4
  Summary: EvalScope: Lightweight LLMs Evaluation Framework
5
5
  Home-page: https://github.com/modelscope/evalscope
6
6
  Author: ModelScope team
@@ -17,14 +17,14 @@ Requires-Python: >=3.8
17
17
  Description-Content-Type: text/markdown
18
18
  License-File: LICENSE
19
19
  Requires-Dist: accelerate
20
- Requires-Dist: datasets>=3.0
20
+ Requires-Dist: datasets==3.2.0
21
21
  Requires-Dist: immutabledict
22
22
  Requires-Dist: jieba
23
23
  Requires-Dist: jsonlines
24
24
  Requires-Dist: langdetect
25
25
  Requires-Dist: latex2sympy2-extended
26
26
  Requires-Dist: matplotlib
27
- Requires-Dist: modelscope[framework]
27
+ Requires-Dist: modelscope[framework]>=1.27
28
28
  Requires-Dist: nltk>=3.9
29
29
  Requires-Dist: openai
30
30
  Requires-Dist: pandas
@@ -52,14 +52,14 @@ Requires-Dist: opencv-python; extra == "aigc"
52
52
  Requires-Dist: torchvision; extra == "aigc"
53
53
  Provides-Extra: all
54
54
  Requires-Dist: accelerate; extra == "all"
55
- Requires-Dist: datasets>=3.0; extra == "all"
55
+ Requires-Dist: datasets==3.2.0; extra == "all"
56
56
  Requires-Dist: immutabledict; extra == "all"
57
57
  Requires-Dist: jieba; extra == "all"
58
58
  Requires-Dist: jsonlines; extra == "all"
59
59
  Requires-Dist: langdetect; extra == "all"
60
60
  Requires-Dist: latex2sympy2-extended; extra == "all"
61
61
  Requires-Dist: matplotlib; extra == "all"
62
- Requires-Dist: modelscope[framework]; extra == "all"
62
+ Requires-Dist: modelscope[framework]>=1.27; extra == "all"
63
63
  Requires-Dist: nltk>=3.9; extra == "all"
64
64
  Requires-Dist: openai; extra == "all"
65
65
  Requires-Dist: pandas; extra == "all"
@@ -94,7 +94,7 @@ Requires-Dist: rich; extra == "all"
94
94
  Requires-Dist: sse-starlette; extra == "all"
95
95
  Requires-Dist: transformers; extra == "all"
96
96
  Requires-Dist: uvicorn; extra == "all"
97
- Requires-Dist: gradio==5.4.0; extra == "all"
97
+ Requires-Dist: gradio>=5.4.0; extra == "all"
98
98
  Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "all"
99
99
  Requires-Dist: diffusers; extra == "all"
100
100
  Requires-Dist: iopath; extra == "all"
@@ -102,9 +102,20 @@ Requires-Dist: omegaconf; extra == "all"
102
102
  Requires-Dist: open-clip-torch; extra == "all"
103
103
  Requires-Dist: opencv-python; extra == "all"
104
104
  Requires-Dist: torchvision; extra == "all"
105
+ Requires-Dist: bfcl-eval; extra == "all"
106
+ Requires-Dist: dotenv; extra == "all"
107
+ Requires-Dist: human-eval; extra == "all"
108
+ Requires-Dist: pytest; extra == "all"
109
+ Requires-Dist: pytest-cov; extra == "all"
105
110
  Provides-Extra: app
106
- Requires-Dist: gradio==5.4.0; extra == "app"
111
+ Requires-Dist: gradio>=5.4.0; extra == "app"
107
112
  Requires-Dist: plotly<6.0.0,>=5.23.0; extra == "app"
113
+ Provides-Extra: dev
114
+ Requires-Dist: bfcl-eval; extra == "dev"
115
+ Requires-Dist: dotenv; extra == "dev"
116
+ Requires-Dist: human-eval; extra == "dev"
117
+ Requires-Dist: pytest; extra == "dev"
118
+ Requires-Dist: pytest-cov; extra == "dev"
108
119
  Provides-Extra: opencompass
109
120
  Requires-Dist: ms-opencompass>=0.1.6; extra == "opencompass"
110
121
  Provides-Extra: perf
@@ -198,24 +209,33 @@ EvalScope is not merely an evaluation tool; it is a valuable ally in your model
198
209
  Below is the overall architecture diagram of EvalScope:
199
210
 
200
211
  <p align="center">
201
- <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
212
+ <img src="https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/doc/EvalScope%E6%9E%B6%E6%9E%84%E5%9B%BE.png" width="70%">
202
213
  <br>EvalScope Framework.
203
214
  </p>
204
215
 
205
216
  <details><summary>Framework Description</summary>
206
217
 
207
218
  The architecture includes the following modules:
208
- 1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
209
- 2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
210
- 3. **Evaluation Backend**:
211
- - **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
212
- - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
213
- - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
214
- - **RAGEval**: Supports RAG evaluation, supporting independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).
215
- - **ThirdParty**: Other third-party evaluation tasks, such as ToolBench.
216
- 4. **Performance Evaluator**: Model performance evaluation, responsible for measuring model inference service performance, including performance testing, stress testing, performance report generation, and visualization.
217
- 5. **Evaluation Report**: The final generated evaluation report summarizes the model's performance, which can be used for decision-making and further model optimization.
218
- 6. **Visualization**: Visualization results help users intuitively understand evaluation results, facilitating analysis and comparison of different model performances.
219
+ 1. Input Layer
220
+ - **Model Sources**: API models (OpenAI API), local models (ModelScope)
221
+ - **Datasets**: Standard evaluation benchmarks (MMLU/GSM8k, etc.), custom data (MCQ/QA)
222
+
223
+ 2. Core Functions
224
+ - **Multi-backend Evaluation**
225
+ - Native backends: Unified evaluation for LLM/VLM/Embedding/T2I models
226
+ - Integrated frameworks: OpenCompass/MTEB/VLMEvalKit/RAGAS
227
+
228
+ - **Performance Monitoring**
229
+ - Model plugins: Supports various model service APIs
230
+ - Data plugins: Supports multiple data formats
231
+ - Metric tracking: TTFT/TPOP/Stability and other metrics
232
+
233
+ - **Tool Extensions**
234
+ - Integration: Tool-Bench/Needle-in-a-Haystack/BFCL-v3
235
+
236
+ 3. Output Layer
237
+ - **Structured Reports**: Supports JSON/Tables/Logs
238
+ - **Visualization Platforms**: Supports Gradio/Wandb/SwanLab
219
239
 
220
240
  </details>
221
241
 
@@ -230,7 +250,9 @@ Please scan the QR code below to join our community groups:
230
250
 
231
251
  ## 🎉 News
232
252
 
233
- - 🔥 **[2025.06.19]** Added support for the BFCL-v3 benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
253
+ - 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.
254
+ - 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for "scoring directly without reference answers" and "checking answer consistency with reference answers". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.
255
+ - 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
234
256
  - 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.
235
257
  - 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
236
258
  - 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
@@ -248,12 +270,12 @@ Please scan the QR code below to join our community groups:
248
270
  - 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!
249
271
  - 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).
250
272
  - 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.
273
+ <details><summary>More</summary>
274
+
251
275
  - 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
252
276
  - 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets,refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.
253
277
  - 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.
254
278
  - 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.
255
- <details><summary>More</summary>
256
-
257
279
  - 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).
258
280
  - 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
259
281
  - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
@@ -572,10 +594,17 @@ Speed Benchmark Results:
572
594
  EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
573
595
 
574
596
 
575
- ## 🏟️ Arena Mode
576
- The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
597
+ ## ⚔️ Arena Mode
598
+
599
+ Arena mode allows you to configure multiple candidate models and specify a baseline model. Evaluation is performed by pairwise battles between each candidate model and the baseline model, with the final output including each model's win rate and ranking. This method is suitable for comparative evaluation among multiple models, providing an intuitive reflection of each model's strengths and weaknesses. Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
577
600
 
578
- Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
601
+ ```text
602
+ Model WinRate (%) CI (%)
603
+ ------------ ------------- ---------------
604
+ qwen2.5-72b 69.3 (-13.3 / +12.2)
605
+ qwen2.5-7b 50 (+0.0 / +0.0)
606
+ qwen2.5-0.5b 4.7 (-2.5 / +4.4)
607
+ ```
579
608
 
580
609
  ## 👷‍♂️ Contribution
581
610
 
@@ -601,7 +630,7 @@ EvalScope, as the official evaluation tool of [ModelScope](https://modelscope.cn
601
630
  - [ ] Distributed evaluating
602
631
  - [x] Multi-modal evaluation
603
632
  - [ ] Benchmarks
604
- - [ ] GAIA
633
+ - [x] BFCL-v3
605
634
  - [x] GPQA
606
635
  - [x] MBPP
607
636