PyPI - evalscope - Versions diffs - 0.7.1__py3-none-any.whl → 0.8.0__py3-none-any.whl - Mend

evalscope 0.7.1py3-none-any.whl → 0.8.0py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (233) hide show

evalscope/__init__.py +1 -1
evalscope/arguments.py +73 -0
evalscope/backend/base.py +5 -1
evalscope/backend/opencompass/api_meta_template.py +8 -14
evalscope/backend/opencompass/backend_manager.py +24 -15
evalscope/backend/opencompass/tasks/eval_api.py +1 -6
evalscope/backend/opencompass/tasks/eval_datasets.py +26 -28
evalscope/backend/rag_eval/__init__.py +3 -3
evalscope/backend/rag_eval/backend_manager.py +21 -25
evalscope/backend/rag_eval/clip_benchmark/__init__.py +1 -1
evalscope/backend/rag_eval/clip_benchmark/arguments.py +6 -6
evalscope/backend/rag_eval/clip_benchmark/dataset_builder.py +62 -79
evalscope/backend/rag_eval/clip_benchmark/task_template.py +29 -43
evalscope/backend/rag_eval/clip_benchmark/tasks/image_caption.py +20 -22
evalscope/backend/rag_eval/clip_benchmark/tasks/zeroshot_classification.py +16 -23
evalscope/backend/rag_eval/clip_benchmark/tasks/zeroshot_retrieval.py +14 -35
evalscope/backend/rag_eval/clip_benchmark/utils/webdataset_convert.py +69 -90
evalscope/backend/rag_eval/cmteb/__init__.py +3 -3
evalscope/backend/rag_eval/cmteb/arguments.py +25 -27
evalscope/backend/rag_eval/cmteb/base.py +22 -23
evalscope/backend/rag_eval/cmteb/task_template.py +15 -17
evalscope/backend/rag_eval/cmteb/tasks/Classification.py +98 -79
evalscope/backend/rag_eval/cmteb/tasks/Clustering.py +17 -22
evalscope/backend/rag_eval/cmteb/tasks/CustomTask.py +17 -19
evalscope/backend/rag_eval/cmteb/tasks/PairClassification.py +35 -29
evalscope/backend/rag_eval/cmteb/tasks/Reranking.py +18 -5
evalscope/backend/rag_eval/cmteb/tasks/Retrieval.py +163 -163
evalscope/backend/rag_eval/cmteb/tasks/STS.py +126 -104
evalscope/backend/rag_eval/cmteb/tasks/__init__.py +33 -34
evalscope/backend/rag_eval/ragas/__init__.py +2 -2
evalscope/backend/rag_eval/ragas/arguments.py +3 -8
evalscope/backend/rag_eval/ragas/prompts/chinese/AnswerCorrectness/correctness_prompt_chinese.json +9 -9
evalscope/backend/rag_eval/ragas/prompts/chinese/AnswerCorrectness/long_form_answer_prompt_chinese.json +2 -2
evalscope/backend/rag_eval/ragas/prompts/chinese/AnswerRelevancy/question_generation_chinese.json +3 -3
evalscope/backend/rag_eval/ragas/prompts/chinese/ContextPrecision/context_precision_prompt_chinese.json +5 -5
evalscope/backend/rag_eval/ragas/prompts/chinese/CustomNodeFilter/scoring_prompt_chinese.json +7 -0
evalscope/backend/rag_eval/ragas/prompts/chinese/Faithfulness/nli_statements_message_chinese.json +8 -8
evalscope/backend/rag_eval/ragas/prompts/chinese/Faithfulness/statement_prompt_chinese.json +5 -5
evalscope/backend/rag_eval/ragas/prompts/chinese/HeadlinesExtractor/prompt_chinese.json +7 -5
evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopAbstractQuerySynthesizer/concept_combination_prompt_chinese.json +2 -2
evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopAbstractQuerySynthesizer/generate_query_reference_prompt_chinese.json +27 -4
evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopAbstractQuerySynthesizer/theme_persona_matching_prompt_chinese.json +2 -2
evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopSpecificQuerySynthesizer/generate_query_reference_prompt_chinese.json +27 -4
evalscope/backend/rag_eval/ragas/prompts/chinese/MultiHopSpecificQuerySynthesizer/theme_persona_matching_prompt_chinese.json +2 -2
evalscope/backend/rag_eval/ragas/prompts/chinese/MultiModalFaithfulness/faithfulness_prompt_chinese.json +2 -2
evalscope/backend/rag_eval/ragas/prompts/chinese/MultiModalRelevance/relevance_prompt_chinese.json +5 -5
evalscope/backend/rag_eval/ragas/prompts/chinese/NERExtractor/prompt_chinese.json +3 -3
evalscope/backend/rag_eval/ragas/prompts/chinese/SingleHopSpecificQuerySynthesizer/generate_query_reference_prompt_chinese.json +21 -4
evalscope/backend/rag_eval/ragas/prompts/chinese/SingleHopSpecificQuerySynthesizer/theme_persona_matching_prompt_chinese.json +3 -3
evalscope/backend/rag_eval/ragas/prompts/chinese/SummaryExtractor/prompt_chinese.json +4 -4
evalscope/backend/rag_eval/ragas/prompts/chinese/ThemesExtractor/prompt_chinese.json +2 -2
evalscope/backend/rag_eval/ragas/prompts/persona_prompt.py +0 -1
evalscope/backend/rag_eval/ragas/task_template.py +10 -15
evalscope/backend/rag_eval/ragas/tasks/__init__.py +1 -1
evalscope/backend/rag_eval/ragas/tasks/build_distribution.py +45 -0
evalscope/backend/rag_eval/ragas/tasks/build_transform.py +135 -0
evalscope/backend/rag_eval/ragas/tasks/testset_generation.py +17 -133
evalscope/backend/rag_eval/ragas/tasks/translate_prompt.py +8 -18
evalscope/backend/rag_eval/utils/clip.py +46 -50
evalscope/backend/rag_eval/utils/embedding.py +12 -11
evalscope/backend/rag_eval/utils/llm.py +8 -6
evalscope/backend/rag_eval/utils/tools.py +12 -11
evalscope/backend/vlm_eval_kit/__init__.py +1 -1
evalscope/backend/vlm_eval_kit/custom_dataset.py +7 -8
evalscope/benchmarks/arc/__init__.py +3 -2
evalscope/benchmarks/arc/ai2_arc.py +19 -16
evalscope/benchmarks/arc/arc_adapter.py +32 -24
evalscope/benchmarks/bbh/__init__.py +1 -2
evalscope/benchmarks/bbh/bbh_adapter.py +28 -25
evalscope/benchmarks/bbh/cot_prompts/boolean_expressions.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/causal_judgement.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/date_understanding.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/disambiguation_qa.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/dyck_languages.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/formal_fallacies.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/geometric_shapes.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/hyperbaton.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/logical_deduction_five_objects.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/logical_deduction_seven_objects.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/logical_deduction_three_objects.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/movie_recommendation.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/multistep_arithmetic_two.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/navigate.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/object_counting.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/penguins_in_a_table.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/reasoning_about_colored_objects.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/ruin_names.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/salient_translation_error_detection.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/snarks.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/sports_understanding.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/temporal_sequences.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/tracking_shuffled_objects_five_objects.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/tracking_shuffled_objects_seven_objects.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/tracking_shuffled_objects_three_objects.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/web_of_lies.txt +1 -1
evalscope/benchmarks/bbh/cot_prompts/word_sorting.txt +1 -1
evalscope/benchmarks/benchmark.py +16 -16
evalscope/benchmarks/ceval/__init__.py +3 -2
evalscope/benchmarks/ceval/ceval_adapter.py +80 -69
evalscope/benchmarks/ceval/ceval_exam.py +18 -31
evalscope/benchmarks/cmmlu/__init__.py +3 -2
evalscope/benchmarks/cmmlu/cmmlu.py +87 -92
evalscope/benchmarks/cmmlu/cmmlu_adapter.py +109 -155
evalscope/benchmarks/cmmlu/samples.jsonl +1 -1
evalscope/benchmarks/competition_math/__init__.py +3 -2
evalscope/benchmarks/competition_math/competition_math.py +7 -16
evalscope/benchmarks/competition_math/competition_math_adapter.py +32 -34
evalscope/benchmarks/data_adapter.py +24 -24
evalscope/benchmarks/general_qa/__init__.py +3 -2
evalscope/benchmarks/general_qa/general_qa_adapter.py +34 -38
evalscope/benchmarks/gsm8k/__init__.py +1 -1
evalscope/benchmarks/gsm8k/gsm8k.py +6 -12
evalscope/benchmarks/gsm8k/gsm8k_adapter.py +26 -24
evalscope/benchmarks/hellaswag/__init__.py +3 -2
evalscope/benchmarks/hellaswag/hellaswag.py +15 -19
evalscope/benchmarks/hellaswag/hellaswag_adapter.py +27 -23
evalscope/benchmarks/humaneval/__init__.py +1 -1
evalscope/benchmarks/humaneval/humaneval.py +15 -18
evalscope/benchmarks/humaneval/humaneval_adapter.py +0 -1
evalscope/benchmarks/mmlu/__init__.py +3 -2
evalscope/benchmarks/mmlu/mmlu.py +15 -29
evalscope/benchmarks/mmlu/mmlu_adapter.py +85 -77
evalscope/benchmarks/race/__init__.py +3 -2
evalscope/benchmarks/race/race.py +21 -35
evalscope/benchmarks/race/race_adapter.py +32 -29
evalscope/benchmarks/race/samples.jsonl +1 -1
evalscope/benchmarks/trivia_qa/__init__.py +3 -2
evalscope/benchmarks/trivia_qa/samples.jsonl +1 -1
evalscope/benchmarks/trivia_qa/trivia_qa.py +19 -34
evalscope/benchmarks/trivia_qa/trivia_qa_adapter.py +27 -22
evalscope/benchmarks/truthful_qa/__init__.py +3 -2
evalscope/benchmarks/truthful_qa/truthful_qa.py +25 -29
evalscope/benchmarks/truthful_qa/truthful_qa_adapter.py +36 -37
evalscope/cli/cli.py +6 -5
evalscope/cli/start_eval.py +31 -0
evalscope/cli/start_perf.py +0 -3
evalscope/cli/start_server.py +27 -41
evalscope/config.py +119 -95
evalscope/constants.py +61 -29
evalscope/evaluator/__init__.py +1 -0
evalscope/evaluator/evaluator.py +96 -377
evalscope/evaluator/humaneval_evaluator.py +158 -0
evalscope/evaluator/rating_eval.py +12 -33
evalscope/evaluator/reviewer/auto_reviewer.py +47 -76
evalscope/metrics/bundled_rouge_score/rouge_scorer.py +10 -20
evalscope/metrics/code_metric.py +3 -9
evalscope/metrics/math_accuracy.py +3 -6
evalscope/metrics/metrics.py +21 -21
evalscope/metrics/rouge_metric.py +11 -25
evalscope/models/__init__.py +1 -2
evalscope/models/api/openai_api.py +40 -29
evalscope/models/custom/__init__.py +0 -1
evalscope/models/custom/custom_model.py +3 -3
evalscope/models/dummy_chat_model.py +7 -8
evalscope/models/model_adapter.py +89 -156
evalscope/models/openai_model.py +20 -20
evalscope/perf/arguments.py +15 -3
evalscope/perf/benchmark.py +7 -9
evalscope/perf/http_client.py +3 -8
evalscope/perf/main.py +10 -0
evalscope/perf/plugin/api/custom_api.py +1 -2
evalscope/perf/plugin/api/dashscope_api.py +1 -2
evalscope/perf/plugin/api/openai_api.py +3 -4
evalscope/perf/plugin/datasets/base.py +1 -2
evalscope/perf/plugin/datasets/flickr8k.py +1 -2
evalscope/perf/plugin/datasets/longalpaca.py +1 -2
evalscope/perf/plugin/datasets/openqa.py +1 -2
evalscope/perf/utils/analysis_result.py +1 -2
evalscope/perf/utils/benchmark_util.py +1 -2
evalscope/perf/utils/db_util.py +11 -8
evalscope/perf/utils/local_server.py +19 -13
evalscope/registry/config/cfg_arena_zhihu.yaml +1 -1
evalscope/registry/tasks/arc.yaml +2 -3
evalscope/registry/tasks/bbh.yaml +3 -4
evalscope/registry/tasks/bbh_mini.yaml +3 -4
evalscope/registry/tasks/ceval.yaml +3 -3
evalscope/registry/tasks/ceval_mini.yaml +3 -4
evalscope/registry/tasks/cmmlu.yaml +3 -3
evalscope/registry/tasks/eval_qwen-7b-chat_v100.yaml +1 -1
evalscope/registry/tasks/general_qa.yaml +1 -1
evalscope/registry/tasks/gsm8k.yaml +2 -2
evalscope/registry/tasks/mmlu.yaml +3 -3
evalscope/registry/tasks/mmlu_mini.yaml +3 -3
evalscope/run.py +184 -375
evalscope/run_arena.py +20 -25
evalscope/summarizer.py +16 -17
evalscope/third_party/longbench_write/README.md +99 -42
evalscope/third_party/longbench_write/default_task.json +1 -1
evalscope/third_party/longbench_write/default_task.yaml +8 -7
evalscope/third_party/longbench_write/eval.py +29 -28
evalscope/third_party/longbench_write/infer.py +16 -104
evalscope/third_party/longbench_write/longbench_write.py +5 -5
evalscope/third_party/longbench_write/resources/judge.txt +1 -1
evalscope/third_party/longbench_write/tools/data_etl.py +4 -5
evalscope/third_party/longbench_write/utils.py +0 -1
evalscope/third_party/toolbench_static/eval.py +14 -15
evalscope/third_party/toolbench_static/infer.py +48 -69
evalscope/third_party/toolbench_static/llm/swift_infer.py +4 -12
evalscope/third_party/toolbench_static/requirements.txt +1 -1
evalscope/third_party/toolbench_static/toolbench_static.py +3 -3
evalscope/tools/combine_reports.py +25 -30
evalscope/tools/rewrite_eval_results.py +14 -46
evalscope/utils/__init__.py +0 -1
evalscope/utils/arena_utils.py +18 -48
evalscope/{perf/utils → utils}/chat_service.py +3 -4
evalscope/utils/completion_parsers.py +3 -8
evalscope/utils/logger.py +9 -7
evalscope/utils/model_utils.py +11 -0
evalscope/utils/utils.py +12 -138
evalscope/version.py +2 -2
{evalscope-0.7.1.dist-info → evalscope-0.8.0.dist-info}/METADATA +125 -120
evalscope-0.8.0.dist-info/RECORD +285 -0
tests/cli/test_run.py +54 -15
tests/perf/test_perf.py +4 -0
tests/rag/test_clip_benchmark.py +38 -38
tests/rag/test_mteb.py +3 -2
tests/rag/test_ragas.py +5 -5
tests/swift/test_run_swift_eval.py +2 -3
tests/swift/test_run_swift_vlm_eval.py +2 -3
tests/swift/test_run_swift_vlm_jugde_eval.py +2 -3
evalscope/backend/rag_eval/ragas/metrics/__init__.py +0 -2
evalscope/backend/rag_eval/ragas/metrics/multi_modal_faithfulness.py +0 -91
evalscope/backend/rag_eval/ragas/metrics/multi_modal_relevance.py +0 -99
evalscope/cache.py +0 -98
evalscope/models/template.py +0 -1446
evalscope/run_ms.py +0 -140
evalscope/utils/task_cfg_parser.py +0 -10
evalscope/utils/task_utils.py +0 -22
evalscope-0.7.1.dist-info/RECORD +0 -286
{evalscope-0.7.1.dist-info → evalscope-0.8.0.dist-info}/LICENSE +0 -0
{evalscope-0.7.1.dist-info → evalscope-0.8.0.dist-info}/WHEEL +0 -0
{evalscope-0.7.1.dist-info → evalscope-0.8.0.dist-info}/entry_points.txt +0 -0
{evalscope-0.7.1.dist-info → evalscope-0.8.0.dist-info}/top_level.txt +0 -0

evalscope/utils/utils.py CHANGED Viewed

@@ -5,20 +5,19 @@ import functools
 import hashlib
 import importlib
 import importlib.util
+import json
+import jsonlines as jsonl
+import numpy as np
 import os
 import random
 import re
 import sys
-from typing import Any, Dict, List, Tuple, Union
-import json
-import jsonlines as jsonl
-import numpy as np
 import torch
 import torch.nn.functional as F
 import yaml
+from typing import Any, Dict, List, Tuple, Union
-from evalscope.constants import DumpMode, OutputsStructure
+from evalscope.constants import DumpMode
 from evalscope.utils.logger import get_logger
 logger = get_logger()
@@ -86,13 +85,15 @@ def dump_jsonl_data(data_list, jsonl_file, dump_mode=DumpMode.OVERWRITE):
     jsonl_file = os.path.expanduser(jsonl_file)
+    if not isinstance(data_list, list):
+        data_list = [data_list]
     if dump_mode == DumpMode.OVERWRITE:
         dump_mode = 'w'
     elif dump_mode == DumpMode.APPEND:
         dump_mode = 'a'
     with jsonl.open(jsonl_file, mode=dump_mode) as writer:
         writer.write_all(data_list)
-    logger.info(f'Dump data to {jsonl_file} successfully.')
 def yaml_to_dict(yaml_file) -> dict:
@@ -115,7 +116,6 @@ def dict_to_yaml(d: dict, yaml_file: str):
     """
     with open(yaml_file, 'w') as f:
         yaml.dump(d, f, default_flow_style=False)
-    logger.info(f'Dump data to {yaml_file} successfully.')
 def json_to_dict(json_file) -> dict:
@@ -148,25 +148,13 @@ def get_obj_from_cfg(eval_class_ref: Any, *args, **kwargs) -> Any:
     return functools.partial(obj_cls, *args, **kwargs)
-def markdown_table(header_l, data_l):
-    md_str = f'| {" | ".join(header_l)} |'
-    md_str += f'\n| {" | ".join(["---"] * len(header_l))} |'
-    for data in data_l:
-        if isinstance(data, str):
-            data = [data]
-        assert len(data) <= len(header_l)
-        tmp = data + [''] * (len(header_l) - len(data))
-        md_str += f'\n| {" | ".join(tmp)} |'
-    return md_str
 def random_seeded_choice(seed: Union[int, str, float], choices, **kwargs):
     """Random choice with a (potentially string) seed."""
     return random.Random(seed).choices(choices, k=1, **kwargs)[0]
-def gen_hash(name: str):
-    return hashlib.md5(name.encode(encoding='UTF-8')).hexdigest()
+def gen_hash(name: str, bits: int = 32):
+    return hashlib.md5(name.encode(encoding='UTF-8')).hexdigest()[:bits]
 def dict_torch_dtype_to_str(d: Dict[str, Any]) -> dict:
@@ -313,14 +301,6 @@ class ResponseParser:
 def make_outputs_dir(root_dir: str, datasets: list, model_id: str, model_revision: str):
-    # model_revision = model_revision if model_revision is not None else 'none'
-    # now = datetime.datetime.now()
-    # format_time = now.strftime('%Y%m%d_%H%M%S')
-    # outputs_name = format_time + '_' + 'default' + '_' + model_id.replace('/', '_') + '_' + model_revision
-    # outputs_dir = os.path.join(work_dir, outputs_name)
-    # dataset_name = dataset_id.replace('/', '_')
-    # outputs_dir = os.path.join(work_dir, dataset_name)
     if not model_id:
         model_id = 'default'
     model_id = model_id.replace('/', '_')
@@ -328,37 +308,11 @@ def make_outputs_dir(root_dir: str, datasets: list, model_id: str, model_revisio
     if not model_revision:
         model_revision = 'default'
-    outputs_dir = os.path.join(root_dir, f"eval_{'-'.join(datasets)}_{model_id}_{model_revision}")
+    outputs_dir = os.path.join(root_dir, model_id, model_revision, f"eval_{'-'.join(datasets)}")
     return outputs_dir
-def process_outputs_structure(outputs_dir: str, is_make: bool = True) -> dict:
-    logs_dir = os.path.join(outputs_dir, 'logs')
-    predictions_dir = os.path.join(outputs_dir, 'predictions')
-    reviews_dir = os.path.join(outputs_dir, 'reviews')
-    reports_dir = os.path.join(outputs_dir, 'reports')
-    configs_dir = os.path.join(outputs_dir, 'configs')
-    if is_make:
-        os.makedirs(outputs_dir, exist_ok=True)
-        os.makedirs(logs_dir, exist_ok=True)
-        os.makedirs(predictions_dir, exist_ok=True)
-        os.makedirs(reviews_dir, exist_ok=True)
-        os.makedirs(reports_dir, exist_ok=True)
-        os.makedirs(configs_dir, exist_ok=True)
-    outputs_structure = {
-        OutputsStructure.LOGS_DIR: logs_dir,
-        OutputsStructure.PREDICTIONS_DIR: predictions_dir,
-        OutputsStructure.REVIEWS_DIR: reviews_dir,
-        OutputsStructure.REPORTS_DIR: reports_dir,
-        OutputsStructure.CONFIGS_DIR: configs_dir,
-    }
-    return outputs_structure
 def import_module_util(import_path_prefix: str, module_name: str, members_to_import: list) -> dict:
     """
     Import module utility function.
@@ -442,48 +396,6 @@ def split_str_parts_by(text: str, delimiters: List[str]):
     return text_list
-def calculate_loss_scale(response: str, use_loss_scale=False) -> Tuple[List[str], List[float]]:
-    """Calculate the loss scale by splitting the agent response.
-    This algorithm comes from paper: https://arxiv.org/pdf/2309.00986.pdf
-    Agent response format:
-    ```text
-        Thought: you should always think about what to do
-        Action: the action to take, should be one of the above tools[fire_recognition,
-            fire_alert, call_police, call_fireman]
-        Action Input: the input to the action
-        Observation: the result of the action
-        ... (this Thought/Action/Action Input/Observation can be repeated zero or more times)
-        Thought: I now know the final answer
-        Final Answer: the final answer to the original input question
-    ```
-    Args:
-        response: The response text
-        use_loss_scale: Use weighted loss. With this, some part of the loss will be enhanced to improve performance.
-    Returns:
-        A tuple of agent response parts and their weights.
-    """
-    if 'Action:' in response and 'Observation:' in response and use_loss_scale:
-        agent_keyword = ['Action:', 'Action Input:', 'Thought:', 'Final Answer:', 'Observation:']
-        agent_parts = split_str_parts_by(response, agent_keyword)
-        weights = []
-        agent_content = []
-        for c in agent_parts:
-            if c['key'] in ('Action:', 'Action Input:'):
-                weights += [2.0]
-                weights += [2.0]
-            elif c['key'] in ('Thought:', 'Final Answer:', ''):
-                weights += [1.0]
-                weights += [1.0]
-            elif c['key'] in ('Observation:', ):
-                weights += [2.0]
-                weights += [0.0]
-            agent_content.append(c['key'])
-            agent_content.append(c['content'])
-        return agent_content, weights
-    else:
-        return [response], [1.0]
 def get_bucket_sizes(max_length: int) -> List[int]:
     return [max_length // 4 * (i + 1) for i in range(4)]
@@ -504,45 +416,6 @@ def _get_closet_bucket(bucket_sizes, data_length):
     return cloest_length
-def pad_and_split_batch(padding_to, input_ids, attention_mask, labels, loss_scale, max_length, tokenizer, rank,
-                        world_size):
-    if padding_to is None:
-        longest_len = input_ids.shape[-1]
-        bucket_sizes = get_bucket_sizes(max_length)
-        bucket_data_length = _get_closet_bucket(bucket_sizes, longest_len)
-        padding_length = bucket_data_length - input_ids.shape[1]
-        input_ids = F.pad(input_ids, (0, padding_length), 'constant', tokenizer.pad_token_id)
-        attention_mask = F.pad(attention_mask, (0, padding_length), 'constant', 0)
-        if loss_scale:
-            loss_scale = F.pad(loss_scale, (0, padding_length), 'constant', 0.)
-        labels = F.pad(labels, (0, padding_length), 'constant', -100)
-    # manully split the batch to different DP rank.
-    batch_size = input_ids.shape[0] // world_size
-    if batch_size > 0:
-        start = rank * batch_size
-        end = (rank + 1) * batch_size
-        input_ids = input_ids[start:end, :]
-        attention_mask = attention_mask[start:end, :]
-        labels = labels[start:end, :]
-        if loss_scale:
-            loss_scale = loss_scale[start:end, :]
-    return input_ids, attention_mask, labels, loss_scale
-def get_dist_setting() -> Tuple[int, int, int, int]:
-    """return rank, local_rank, world_size, local_world_size"""
-    rank = int(os.getenv('RANK', -1))
-    local_rank = int(os.getenv('LOCAL_RANK', -1))
-    world_size = int(os.getenv('WORLD_SIZE', 1))
-    local_world_size = int(os.getenv('LOCAL_WORLD_SIZE', 1))
-    return rank, local_rank, world_size, local_world_size
-def use_torchacc() -> bool:
-    return os.getenv('USE_TORCHACC', '0') == '1'
 def is_module_installed(module_name):
     try:
         importlib.import_module(module_name)
@@ -576,6 +449,7 @@ def get_valid_list(input_list, candidate_list):
 def get_latest_folder_path(work_dir):
     from datetime import datetime
     # Get all subdirectories in the work_dir
     folders = [f for f in os.listdir(work_dir) if os.path.isdir(os.path.join(work_dir, f))]

evalscope/version.py CHANGED Viewed

@@ -1,4 +1,4 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
-__version__ = "0.7.1"
-__release_datetime__ = "2024-11-29 03:00:00"
+__version__ = '0.8.0'
+__release_datetime__ = '2024-12-15 00:00:00'

{evalscope-0.7.1.dist-info → evalscope-0.8.0.dist-info}/METADATA RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.7.1
+Version: 0.8.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -29,7 +29,7 @@ Requires-Dist: nltk>=3.9
 Requires-Dist: openai
 Requires-Dist: pandas
 Requires-Dist: plotly
-Requires-Dist: pyarrow<=17.0.0
+Requires-Dist: pyarrow
 Requires-Dist: pympler
 Requires-Dist: pyyaml
 Requires-Dist: regex
@@ -62,7 +62,7 @@ Requires-Dist: nltk>=3.9; extra == "all"
 Requires-Dist: openai; extra == "all"
 Requires-Dist: pandas; extra == "all"
 Requires-Dist: plotly; extra == "all"
-Requires-Dist: pyarrow<=17.0.0; extra == "all"
+Requires-Dist: pyarrow; extra == "all"
 Requires-Dist: pympler; extra == "all"
 Requires-Dist: pyyaml; extra == "all"
 Requires-Dist: regex; extra == "all"
@@ -84,7 +84,7 @@ Requires-Dist: transformers-stream-generator; extra == "all"
 Requires-Dist: ms-opencompass>=0.1.4; extra == "all"
 Requires-Dist: ms-vlmeval>=0.0.9; extra == "all"
 Requires-Dist: mteb==1.19.4; extra == "all"
-Requires-Dist: ragas==0.2.5; extra == "all"
+Requires-Dist: ragas==0.2.7; extra == "all"
 Requires-Dist: webdataset>0.2.0; extra == "all"
 Requires-Dist: aiohttp; extra == "all"
 Requires-Dist: fastapi; extra == "all"
@@ -129,48 +129,52 @@ Requires-Dist: transformers; extra == "perf"
 Requires-Dist: unicorn; extra == "perf"
 Provides-Extra: rag
 Requires-Dist: mteb==1.19.4; extra == "rag"
-Requires-Dist: ragas==0.2.5; extra == "rag"
+Requires-Dist: ragas==0.2.7; extra == "rag"
 Requires-Dist: webdataset>0.2.0; extra == "rag"
 Provides-Extra: vlmeval
 Requires-Dist: ms-vlmeval>=0.0.9; extra == "vlmeval"
+<p align="center">
+    <br>
+    <img src="docs/en/_static/images/evalscope_logo.png"/>
+    <br>
+<p>
-![](docs/en/_static/images/evalscope_logo.png)
 <p align="center">
-    English | <a href="README_zh.md">简体中文</a>
+  <a href="README_zh.md">中文</a> &nbsp ｜ &nbsp English &nbsp
 </p>
 <p align="center">
-  <a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
-  <a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope">
-  </a>
-  <a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
-  <a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'>
-      <img src='https://readthedocs.org/projects/evalscope-en/badge/?version=latest' alt='Documentation Status' />
-  </a>
-  <br>
-  <a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
+<img src="https://img.shields.io/badge/python-%E2%89%A53.8-5be.svg">
+<a href="https://badge.fury.io/py/evalscope"><img src="https://badge.fury.io/py/evalscope.svg" alt="PyPI version" height="18"></a>
+<a href="https://pypi.org/project/evalscope"><img alt="PyPI - Downloads" src="https://static.pepy.tech/badge/evalscope"></a>
+<a href="https://github.com/modelscope/evalscope/pulls"><img src="https://img.shields.io/badge/PR-welcome-55EB99.svg"></a>
+<a href='https://evalscope.readthedocs.io/en/latest/?badge=latest'><img src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' /></a>
+<p>
+<p align="center">
+<a href="https://evalscope.readthedocs.io/zh-cn/latest/"> 📖  中文文档</a> &nbsp ｜ &nbsp <a href="https://evalscope.readthedocs.io/en/latest/"> 📖  English Documents</a>
 <p>
 > ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
-## 📋 Table of Contents
+## 📋 Contents
 - [Introduction](#introduction)
 - [News](#News)
 - [Installation](#installation)
 - [Quick Start](#quick-start)
 - [Evaluation Backend](#evaluation-backend)
 - [Custom Dataset Evaluation](#custom-dataset-evaluation)
-- [Offline Evaluation](#offline-evaluation)
-- [Arena Mode](#arena-mode)
 - [Model Serving Performance Evaluation](#Model-Serving-Performance-Evaluation)
+- [Arena Mode](#arena-mode)
 ## 📝 Introduction
-EvalScope is the official model evaluation and performance benchmarking framework launched by the [ModelScope](https://modelscope.cn/) community. It comes with built-in common benchmarks and evaluation metrics, such as MMLU, CMMLU, C-Eval, GSM8K, ARC, HellaSwag, TruthfulQA, MATH, and HumanEval. EvalScope supports various types of model evaluations, including LLMs, multimodal LLMs, embedding models, and reranker models. It is also applicable to multiple evaluation scenarios, such as end-to-end RAG evaluation, arena mode, and model inference performance stress testing. Moreover, with the seamless integration of the ms-swift training framework, evaluations can be initiated with a single click, providing full end-to-end support from model training to evaluation 🚀
+EvalScope is [ModelScope](https://modelscope.cn/)'s official framework for model evaluation and benchmarking, designed for diverse assessment needs. It supports various model types including large language models, multimodal, embedding, reranker, and CLIP models.
+The framework accommodates multiple evaluation scenarios such as end-to-end RAG evaluation, arena mode, and inference performance testing. It features built-in benchmarks and metrics like MMLU, CMMLU, C-Eval, and GSM8K. Seamlessly integrated with the [ms-swift](https://github.com/modelscope/ms-swift) training framework, EvalScope enables one-click evaluations, offering comprehensive support for model training and assessment 🚀
 <p align="center">
   <img src="docs/en/_static/images/evalscope_framework.png" width="70%">
@@ -192,6 +196,7 @@ The architecture includes the following modules:
 ## 🎉 News
+- 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.
 - 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).
 - 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.
 - 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.
@@ -263,124 +268,129 @@ We recommend using conda to manage your environment and installing dependencies
 ## 🚀 Quick Start
-### 1. Simple Evaluation
-To evaluate a model using default settings on specified datasets, follow the process below:
+To evaluate a model on specified datasets using default configurations, this framework supports two ways to initiate evaluation tasks: using the command line or using Python code.
-#### Installation using pip
+### Method 1. Using Command Line
-You can execute this in any directory:
+Execute the `eval` command in any directory:
 ```bash
-python -m evalscope.run \
+evalscope eval \
  --model Qwen/Qwen2.5-0.5B-Instruct \
- --template-type qwen \
- --datasets gsm8k ceval \
- --limit 10
+ --datasets gsm8k arc \
+ --limit 5
 ```
-#### Installation from source
+### Method 2. Using Python Code
-You need to execute this in the `evalscope` directory:
-```bash
-python evalscope/run.py \
- --model Qwen/Qwen2.5-0.5B-Instruct \
- --template-type qwen \
- --datasets gsm8k ceval \
- --limit 10
-```
+When using Python code for evaluation, you need to submit the evaluation task using the `run_task` function, passing a `TaskConfig` as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:
-> If prompted with `Do you wish to run the custom code? [y/N]`, please type `y`.
+**Using Python Dictionary**
-**Results (tested with only 10 samples)**
-```text
-Report table:
-+-----------------------+--------------------+-----------------+
-| Model                 | ceval              | gsm8k           |
-+=======================+====================+=================+
-| Qwen2.5-0.5B-Instruct | (ceval/acc) 0.5577 | (gsm8k/acc) 0.5 |
-+-----------------------+--------------------+-----------------+
+```python
+from evalscope.run import run_task
+task_cfg = {
+    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
+    'datasets': ['gsm8k', 'arc'],
+    'limit': 5
+}
+run_task(task_cfg=task_cfg)
 ```
+<details><summary>More Startup Methods</summary>
-#### Basic Parameter Descriptions
-- `--model`: Specifies the `model_id` of the model on [ModelScope](https://modelscope.cn/), allowing automatic download. For example, see the [Qwen2-0.5B-Instruct model link](https://modelscope.cn/models/qwen/Qwen2-0.5B-Instruct/summary); you can also use a local path, such as `/path/to/model`.
-- `--template-type`: Specifies the template type corresponding to the model. Refer to the `Default Template` field in the [template table](https://swift.readthedocs.io/en/latest/Instruction/Supported-models-datasets.html#llm) for filling in this field.
-- `--datasets`: The dataset name, allowing multiple datasets to be specified, separated by spaces; these datasets will be automatically downloaded. Refer to the [supported datasets list](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html) for available options.
-- `--limit`: Maximum number of evaluation samples per dataset; if not specified, all will be evaluated, which is useful for quick validation.
+**Using `TaskConfig`**
+```python
+from evalscope.run import run_task
+from evalscope.config import TaskConfig
-### 2. Parameterized Evaluation
-If you wish to conduct a more customized evaluation, such as modifying model parameters or dataset parameters, you can use the following commands:
+task_cfg = TaskConfig(
+    model='Qwen/Qwen2.5-0.5B-Instruct',
+    datasets=['gsm8k', 'arc'],
+    limit=5
+)
-**Example 1:**
-```shell
-python evalscope/run.py \
- --model qwen/Qwen2-0.5B-Instruct \
- --template-type qwen \
- --model-args revision=master,precision=torch.float16,device_map=auto \
- --datasets gsm8k ceval \
- --use-cache true \
- --limit 10
+run_task(task_cfg=task_cfg)
 ```
-**Example 2:**
-```shell
-python evalscope/run.py \
- --model qwen/Qwen2-0.5B-Instruct \
- --template-type qwen \
- --generation-config do_sample=false,temperature=0.0 \
- --datasets ceval \
- --dataset-args '{"ceval": {"few_shot_num": 0, "few_shot_random": false}}' \
- --limit 10
+**Using `yaml` file**
+`config.yaml`:
+```yaml
+model: Qwen/Qwen2.5-0.5B-Instruct
+datasets:
+  - gsm8k
+  - arc
+limit: 5
 ```
-#### Parameter Descriptions
-In addition to the three [basic parameters](#basic-parameter-descriptions), the other parameters are as follows:
-- `--model-args`: Model loading parameters, separated by commas, in `key=value` format.
-- `--generation-config`: Generation parameters, separated by commas, in `key=value` format.
-  - `do_sample`: Whether to use sampling, default is `false`.
-  - `max_new_tokens`: Maximum generation length, default is 1024.
-  - `temperature`: Sampling temperature.
-  - `top_p`: Sampling threshold.
-  - `top_k`: Sampling threshold.
-- `--use-cache`: Whether to use local cache, default is `false`. If set to `true`, previously evaluated model and dataset combinations will not be evaluated again, and will be read directly from the local cache.
-- `--dataset-args`: Evaluation dataset configuration parameters, provided in JSON format, where the key is the dataset name and the value is the parameter; note that these must correspond one-to-one with the values in `--datasets`.
-  - `--few_shot_num`: Number of few-shot examples.
-  - `--few_shot_random`: Whether to randomly sample few-shot data; if not specified, defaults to `true`.
-### 3. Use the run_task Function to Submit an Evaluation Task
-Using the `run_task` function to submit an evaluation task requires the same parameters as the command line. You need to pass a dictionary as the parameter, which includes the following fields:
-#### 1. Configuration Task Dictionary Parameters
 ```python
-import torch
-from evalscope.constants import DEFAULT_ROOT_CACHE_DIR
-# Example
-your_task_cfg = {
-        'model_args': {'revision': None, 'precision': torch.float16, 'device_map': 'auto'},
-        'generation_config': {'do_sample': False, 'repetition_penalty': 1.0, 'max_new_tokens': 512},
-        'dataset_args': {},
-        'dry_run': False,
-        'model': 'qwen/Qwen2-0.5B-Instruct',
-        'template_type': 'qwen',
-        'datasets': ['arc', 'hellaswag'],
-        'work_dir': DEFAULT_ROOT_CACHE_DIR,
-        'outputs': DEFAULT_ROOT_CACHE_DIR,
-        'mem_cache': False,
-        'dataset_hub': 'ModelScope',
-        'dataset_dir': DEFAULT_ROOT_CACHE_DIR,
-        'limit': 10,
-        'debug': False
-    }
+from evalscope.run import run_task
+run_task(task_cfg="config.yaml")
+```
+**Using `json` file**
+`config.json`:
+```json
+{
+    "model": "Qwen/Qwen2.5-0.5B-Instruct",
+    "datasets": ["gsm8k", "arc"],
+    "limit": 5
+}
 ```
-Here, `DEFAULT_ROOT_CACHE_DIR` is set to `'~/.cache/evalscope'`.
-#### 2. Execute Task with run_task
 ```python
 from evalscope.run import run_task
-run_task(task_cfg=your_task_cfg)
+run_task(task_cfg="config.json")
+```
+</details>
+### Basic Parameter
+- `--model`: Specifies the `model_id` of the model in [ModelScope](https://modelscope.cn/), which can be automatically downloaded, e.g., [Qwen/Qwen2.5-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct/summary); or use the local path of the model, e.g., `/path/to/model`
+- `--datasets`: Dataset names, supports inputting multiple datasets separated by spaces. Datasets will be automatically downloaded from modelscope. For supported datasets, refer to the [Dataset List](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html)
+- `--limit`: Maximum amount of evaluation data for each dataset. If not specified, it defaults to evaluating all data. Can be used for quick validation
+### Output Results
 ```
++-----------------------+-------------------+-----------------+
+| Model                 | ai2_arc           | gsm8k           |
++=======================+===================+=================+
+| Qwen2.5-0.5B-Instruct | (ai2_arc/acc) 0.6 | (gsm8k/acc) 0.6 |
++-----------------------+-------------------+-----------------+
+```
+## ⚙️ Complex Evaluation
+For more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation startup method is the same as simple evaluation. Below shows how to start the evaluation using the `eval` command:
+```shell
+evalscope eval \
+ --model Qwen/Qwen2.5-0.5B-Instruct \
+ --model-args revision=master,precision=torch.float16,device_map=auto \
+ --generation-config do_sample=true,temperature=0.5 \
+ --dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
+ --datasets gsm8k \
+ --limit 10
+```
+### Parameter
+- `--model-args`: Model loading parameters, separated by commas in `key=value` format. Default parameters:
+  - `revision`: Model version, default is `master`
+  - `precision`: Model precision, default is `auto`
+  - `device_map`: Model device allocation, default is `auto`
+- `--generation-config`: Generation parameters, separated by commas in `key=value` format. Default parameters:
+  - `do_sample`: Whether to use sampling, default is `false`
+  - `max_length`: Maximum length, default is 2048
+  - `max_new_tokens`: Maximum length of generation, default is 512
+- `--dataset-args`: Configuration parameters for evaluation datasets, passed in `json` format. The key is the dataset name, and the value is the parameters. Note that it needs to correspond one-to-one with the values in the `--datasets` parameter:
+  - `few_shot_num`: Number of few-shot examples
+  - `few_shot_random`: Whether to randomly sample few-shot data, if not set, defaults to `true`
+Reference: [Full Parameter Description](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html)
 ## Evaluation Backend
@@ -418,12 +428,7 @@ Speed Benchmark Results:
 ```
 ## Custom Dataset Evaluation
-EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)
-## Offline Evaluation
-You can use local dataset to evaluate the model without internet connection.
-Refer to: Offline Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)
+EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html)
 ## Arena Mode

evalscope 0.7.1__py3-none-any.whl → 0.8.0__py3-none-any.whl

Potentially problematic release.

evalscope 0.7.1py3-none-any.whl → 0.8.0py3-none-any.whl