PyPI - evalscope - Versions diffs - 0.16.0__tar.gz → 0.16.2__tar.gz - Mend

evalscope 0.16.0tar.gz → 0.16.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (501) hide show

{evalscope-0.16.0/evalscope.egg-info → evalscope-0.16.2}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.16.0
+Version: 0.16.2
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -17,12 +17,12 @@ Requires-Python: >=3.8
 Description-Content-Type: text/markdown
 License-File: LICENSE
 Requires-Dist: accelerate
-Requires-Dist: datasets<=3.2.0,>=3.0.0
+Requires-Dist: datasets>=3.0
 Requires-Dist: immutabledict
 Requires-Dist: jieba
 Requires-Dist: jsonlines
 Requires-Dist: langdetect
-Requires-Dist: latex2sympy2
+Requires-Dist: latex2sympy2_extended
 Requires-Dist: matplotlib
 Requires-Dist: modelscope[framework]
 Requires-Dist: nltk>=3.9
@@ -40,20 +40,19 @@ Requires-Dist: seaborn
 Requires-Dist: sympy
 Requires-Dist: tabulate
 Requires-Dist: torch
-Requires-Dist: torchvision
 Requires-Dist: tqdm
 Requires-Dist: transformers>=4.33
 Requires-Dist: word2number
 Provides-Extra: opencompass
-Requires-Dist: ms-opencompass>=0.1.4; extra == "opencompass"
+Requires-Dist: ms-opencompass>=0.1.6; extra == "opencompass"
 Provides-Extra: vlmeval
-Requires-Dist: ms-vlmeval>=0.0.9; extra == "vlmeval"
+Requires-Dist: ms-vlmeval>=0.0.17; extra == "vlmeval"
 Provides-Extra: rag
 Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "rag"
 Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "rag"
 Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "rag"
 Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "rag"
-Requires-Dist: mteb==1.19.4; extra == "rag"
+Requires-Dist: mteb==1.38.20; extra == "rag"
 Requires-Dist: ragas==0.2.14; extra == "rag"
 Requires-Dist: webdataset>0.2.0; extra == "rag"
 Provides-Extra: perf
@@ -73,14 +72,15 @@ Requires-Dist: iopath; extra == "aigc"
 Requires-Dist: omegaconf; extra == "aigc"
 Requires-Dist: open_clip_torch; extra == "aigc"
 Requires-Dist: opencv-python; extra == "aigc"
+Requires-Dist: torchvision; extra == "aigc"
 Provides-Extra: all
 Requires-Dist: accelerate; extra == "all"
-Requires-Dist: datasets<=3.2.0,>=3.0.0; extra == "all"
+Requires-Dist: datasets>=3.0; extra == "all"
 Requires-Dist: immutabledict; extra == "all"
 Requires-Dist: jieba; extra == "all"
 Requires-Dist: jsonlines; extra == "all"
 Requires-Dist: langdetect; extra == "all"
-Requires-Dist: latex2sympy2; extra == "all"
+Requires-Dist: latex2sympy2_extended; extra == "all"
 Requires-Dist: matplotlib; extra == "all"
 Requires-Dist: modelscope[framework]; extra == "all"
 Requires-Dist: nltk>=3.9; extra == "all"
@@ -98,17 +98,16 @@ Requires-Dist: seaborn; extra == "all"
 Requires-Dist: sympy; extra == "all"
 Requires-Dist: tabulate; extra == "all"
 Requires-Dist: torch; extra == "all"
-Requires-Dist: torchvision; extra == "all"
 Requires-Dist: tqdm; extra == "all"
 Requires-Dist: transformers>=4.33; extra == "all"
 Requires-Dist: word2number; extra == "all"
-Requires-Dist: ms-opencompass>=0.1.4; extra == "all"
-Requires-Dist: ms-vlmeval>=0.0.9; extra == "all"
+Requires-Dist: ms-opencompass>=0.1.6; extra == "all"
+Requires-Dist: ms-vlmeval>=0.0.17; extra == "all"
 Requires-Dist: langchain<0.4.0,>=0.3.0; extra == "all"
 Requires-Dist: langchain-community<0.4.0,>=0.3.0; extra == "all"
 Requires-Dist: langchain-core<0.4.0,>=0.3.0; extra == "all"
 Requires-Dist: langchain-openai<0.4.0,>=0.3.0; extra == "all"
-Requires-Dist: mteb==1.19.4; extra == "all"
+Requires-Dist: mteb==1.38.20; extra == "all"
 Requires-Dist: ragas==0.2.14; extra == "all"
 Requires-Dist: webdataset>0.2.0; extra == "all"
 Requires-Dist: aiohttp; extra == "all"
@@ -125,6 +124,7 @@ Requires-Dist: iopath; extra == "all"
 Requires-Dist: omegaconf; extra == "all"
 Requires-Dist: open_clip_torch; extra == "all"
 Requires-Dist: opencv-python; extra == "all"
+Requires-Dist: torchvision; extra == "all"
 <p align="center">
     <br>
@@ -230,6 +230,9 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.06.19]** Added support for the BFCL-v3 benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
+- 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.
+- 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
 - 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
 - 🔥 **[2025.05.13]** Added support for the [ToolBench-Static](https://modelscope.cn/datasets/AI-ModelScope/ToolBench-Static) dataset to evaluate model's tool-calling capabilities. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) for usage instructions. Also added support for the [DROP](https://modelscope.cn/datasets/AI-ModelScope/DROP/dataPeview) and [Winogrande](https://modelscope.cn/datasets/AI-ModelScope/winogrande_val) benchmarks to assess the reasoning capabilities of models.
 - 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)

{evalscope-0.16.0 → evalscope-0.16.2}/README.md RENAMED Viewed

@@ -102,6 +102,9 @@ Please scan the QR code below to join our community groups:
 ## 🎉 News
+- 🔥 **[2025.06.19]** Added support for the BFCL-v3 benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/zh-cn/latest/third_party/bfcl_v3.html).
+- 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.
+- 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset.html).
 - 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).
 - 🔥 **[2025.05.13]** Added support for the [ToolBench-Static](https://modelscope.cn/datasets/AI-ModelScope/ToolBench-Static) dataset to evaluate model's tool-calling capabilities. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) for usage instructions. Also added support for the [DROP](https://modelscope.cn/datasets/AI-ModelScope/DROP/dataPeview) and [Winogrande](https://modelscope.cn/datasets/AI-ModelScope/winogrande_val) benchmarks to assess the reasoning capabilities of models.
 - 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)

evalscope-0.16.2/evalscope/app/__init__.py ADDED Viewed

@@ -0,0 +1,28 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from typing import TYPE_CHECKING
+from evalscope.utils.import_utils import _LazyModule
+if TYPE_CHECKING:
+    from .app import create_app
+    from .arguments import add_argument
+else:
+    _import_structure = {
+        'app': [
+            'create_app',
+        ],
+        'arguments': [
+            'add_argument',
+        ],
+    }
+    import sys
+    sys.modules[__name__] = _LazyModule(
+        __name__,
+        globals()['__file__'],
+        _import_structure,
+        module_spec=__spec__,
+        extra_objects={},
+    )

{evalscope-0.16.0/evalscope/report → evalscope-0.16.2/evalscope/app}/app.py RENAMED Viewed

@@ -1,6 +1,7 @@
 import argparse
 import glob
 import gradio as gr
+import json
 import numpy as np
 import os
 import pandas as pd
@@ -11,35 +12,15 @@ from dataclasses import dataclass
 from typing import Any, List, Union
 from evalscope.constants import DataCollection
-from evalscope.report import Report, ReportKey, add_argument, get_data_frame, get_report_list
+from evalscope.report import Report, ReportKey, get_data_frame, get_report_list
 from evalscope.utils.io_utils import OutputsStructure, yaml_to_dict
 from evalscope.utils.logger import configure_logging, get_logger
 from evalscope.version import __version__
+from .arguments import add_argument
+from .constants import DATASET_TOKEN, LATEX_DELIMITERS, MODEL_TOKEN, PLOTLY_THEME, REPORT_TOKEN
 logger = get_logger()
-PLOTLY_THEME = 'plotly_dark'
-REPORT_TOKEN = '@@'
-MODEL_TOKEN = '::'
-DATASET_TOKEN = ', '
-LATEX_DELIMITERS = [{
-    'left': '$$',
-    'right': '$$',
-    'display': True
-}, {
-    'left': '$',
-    'right': '$',
-    'display': False
-}, {
-    'left': '\\(',
-    'right': '\\)',
-    'display': False
-}, {
-    'left': '\\[',
-    'right': '\\]',
-    'display': True
-}]
 def scan_for_report_folders(root_path):
     """Scan for folders containing reports subdirectories"""
@@ -155,11 +136,11 @@ def plot_single_report_scores(df: pd.DataFrame):
 def plot_single_report_sunburst(report_list: List[Report]):
     if report_list[0].name == DataCollection.NAME:
-        df = get_data_frame(report_list)
+        df = get_data_frame(report_list=report_list)
         categories = sorted([i for i in df.columns if i.startswith(ReportKey.category_prefix)])
         path = categories + [ReportKey.subset_name]
     else:
-        df = get_data_frame(report_list, flatten_metrics=False)
+        df = get_data_frame(report_list=report_list, flatten_metrics=False)
         categories = sorted([i for i in df.columns if i.startswith(ReportKey.category_prefix)])
         path = [ReportKey.dataset_name] + categories + [ReportKey.subset_name]
     logger.debug(f'df: {df}')
@@ -185,6 +166,13 @@ def get_single_dataset_df(df: pd.DataFrame, dataset_name: str):
     return df, styler
+def get_report_analysis(report_list: List[Report], dataset_name: str) -> str:
+    for report in report_list:
+        if report.dataset_name == dataset_name:
+            return report.analysis
+    return 'N/A'
 def plot_single_dataset_scores(df: pd.DataFrame):
     # TODO: add metric radio and relace category name
     plot = px.bar(
@@ -246,7 +234,7 @@ def convert_html_tags(text):
 def process_string(string: str, max_length: int = 2048) -> str:
     string = convert_html_tags(string)  # for display labels e.g.
     if max_length and len(string) > max_length:
-        return f'{string[:max_length // 2]}......{string[-max_length // 2:]}'
+        return f'{string[:max_length // 2]}...[truncate]...{string[-max_length // 2:]}'
     return string
@@ -270,7 +258,7 @@ def dict_to_markdown(data) -> str:
     return '\n\n'.join(markdown_lines)
-def process_model_prediction(item: Any, max_length: int = 2048) -> str:
+def process_model_prediction_old(item: Any, max_length: int = 2048) -> str:
     """
     Process model prediction output into a formatted string.
@@ -294,6 +282,20 @@ def process_model_prediction(item: Any, max_length: int = 2048) -> str:
     return result
+def process_model_prediction(item: Any, max_length: int = 4096) -> str:
+    if isinstance(item, (dict, list)):
+        result = json.dumps(item, ensure_ascii=False, indent=2)
+        result = f'```json\n{result}\n```'
+    else:
+        result = str(item)
+    # Apply HTML tag conversion and truncation only at the final output
+    if max_length is not None:
+        return process_string(result, max_length)
+    return result
 def normalize_score(score):
     try:
         if isinstance(score, bool):
@@ -456,6 +458,10 @@ def create_single_model_tab(sidebar: SidebarComponents, lang: str):
             'zh': '数据集分数',
             'en': 'Dataset Scores'
         },
+        'report_analysis': {
+            'zh': '报告智能分析',
+            'en': 'Report Intelligent Analysis'
+        },
         'dataset_scores_table': {
             'zh': '数据集分数表',
             'en': 'Dataset Scores Table'
@@ -511,6 +517,9 @@ def create_single_model_tab(sidebar: SidebarComponents, lang: str):
     with gr.Tab(locale_dict['dataset_details'][lang]):
         dataset_radio = gr.Radio(
             label=locale_dict['select_dataset'][lang], choices=[], show_label=True, interactive=True)
+        # show dataset details
+        with gr.Accordion(locale_dict['report_analysis'][lang], open=True):
+            report_analysis = gr.Markdown(value='N/A', show_copy_button=True)
         gr.Markdown(f'### {locale_dict["dataset_scores"][lang]}')
         dataset_plot = gr.Plot(value=None, scale=1, label=locale_dict['dataset_scores'][lang])
         gr.Markdown(f'### {locale_dict["dataset_scores_table"][lang]}')
@@ -586,15 +595,16 @@ def create_single_model_tab(sidebar: SidebarComponents, lang: str):
     @gr.on(
         triggers=[dataset_radio.change, report_list.change],
         inputs=[dataset_radio, report_list],
-        outputs=[dataset_plot, dataset_table, subset_select, data_review_df])
+        outputs=[dataset_plot, dataset_table, subset_select, data_review_df, report_analysis])
     def update_single_report_dataset(dataset_name, report_list):
         logger.debug(f'Updating single report dataset: {dataset_name}')
-        report_df = get_data_frame(report_list)
+        report_df = get_data_frame(report_list=report_list)
+        analysis = get_report_analysis(report_list, dataset_name)
         data_score_df, styler = get_single_dataset_df(report_df, dataset_name)
         data_score_plot = plot_single_dataset_scores(data_score_df)
         subsets = data_score_df[ReportKey.subset_name].unique().tolist()
         logger.debug(f'subsets: {subsets}')
-        return data_score_plot, styler, gr.update(choices=subsets, value=None), None
+        return data_score_plot, styler, gr.update(choices=subsets, value=None), None, analysis
     @gr.on(
         triggers=[subset_select.change],

evalscope-0.16.2/evalscope/app/constants.py ADDED Viewed

@@ -0,0 +1,21 @@
+PLOTLY_THEME = 'plotly_dark'
+REPORT_TOKEN = '@@'
+MODEL_TOKEN = '::'
+DATASET_TOKEN = ', '
+LATEX_DELIMITERS = [{
+    'left': '$$',
+    'right': '$$',
+    'display': True
+}, {
+    'left': '$',
+    'right': '$',
+    'display': False
+}, {
+    'left': '\\(',
+    'right': '\\)',
+    'display': False
+}, {
+    'left': '\\[',
+    'right': '\\]',
+    'display': True
+}]

{evalscope-0.16.0 → evalscope-0.16.2}/evalscope/arguments.py RENAMED Viewed

@@ -67,7 +67,7 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--eval-config', type=str, required=False, help='The eval task config file path for evaluation backend.')  # noqa: E501
     parser.add_argument('--stage', type=str, default='all', help='The stage of evaluation pipeline.',
                         choices=[EvalStage.ALL, EvalStage.INFER, EvalStage.REVIEW])
-    parser.add_argument('--limit', type=int, default=None, help='Max evaluation samples num for each subset.')
+    parser.add_argument('--limit', type=float, default=None, help='Max evaluation samples num for each subset.')
     parser.add_argument('--eval-batch-size', type=int, default=1, help='The batch size for evaluation.')
     # Cache and working directory arguments
@@ -89,6 +89,7 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--judge-strategy', type=str, default=JudgeStrategy.AUTO, help='The judge strategy.')
     parser.add_argument('--judge-model-args', type=json.loads, default='{}', help='The judge model args, should be a json string.')  # noqa: E501
     parser.add_argument('--judge-worker-num', type=int, default=1, help='The number of workers for the judge model.')
+    parser.add_argument('--analysis-report', action='store_true', default=False, help='Generate analysis report for the evaluation results using judge model.')  # noqa: E501
     # yapf: enable

{evalscope-0.16.0 → evalscope-0.16.2}/evalscope/backend/opencompass/backend_manager.py RENAMED Viewed

@@ -1,4 +1,5 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
+import os
 import subprocess
 import tempfile
 from dataclasses import asdict
@@ -204,7 +205,7 @@ class OpenCompassBackendManager(BackendManager):
                     model_d['meta_template'] = get_template(model_d['meta_template'])
                 # set the 'abbr' as the 'path' if 'abbr' is not specified
-                model_d['abbr'] = model_d['path']
+                model_d['abbr'] = os.path.basename(model_d['path'])
                 model_config = ApiModelConfig(**model_d)
                 models.append(asdict(model_config))

{evalscope-0.16.0 → evalscope-0.16.2}/evalscope/backend/rag_eval/clip_benchmark/dataset_builder.py RENAMED Viewed

@@ -1,4 +1,5 @@
 import os
+import posixpath  # For URL path handling
 import torch
 from torch.utils.data import DataLoader
 from torch.utils.data import Dataset as TorchDataset
@@ -186,42 +187,53 @@ def build_wds_dataset(dataset_name, transform, split='test', data_dir='root', ca
     Set `cache_dir` to a path to cache the dataset, otherwise, no caching will occur.
     """
+    import requests
     import webdataset as wds
     def read_txt(fname):
-        if '://' in fname:
-            stream = os.popen("curl -L -s --fail '%s'" % fname, 'r')
-            value = stream.read()
-            if stream.close():
-                raise FileNotFoundError('Failed to retreive data')
+        if fname.startswith(('http://', 'https://')):
+            try:
+                response = requests.get(fname)
+                response.raise_for_status()  # Ensure the HTTP request was successful
+                return response.text
+            except requests.exceptions.RequestException as e:
+                raise FileNotFoundError(f'Failed to read {fname}: {e}')
         else:
             with open(fname, 'r') as file:
-                value = file.read()
-        return value
+                return file.read()
+    def url_path_join(*parts):
+        """Join URL path parts with forward slashes regardless of platform"""
+        return posixpath.join(*parts)
     if not data_dir:
         data_dir = f'https://modelscope.cn/datasets/clip-benchmark/wds_{dataset_name}/resolve/master'
     # Git LFS files have a different file path to access the raw data than other files
-    if data_dir.startswith('https://modelscope.cn/datasets'):
+    is_url = data_dir.startswith(('http://', 'https://'))
+    if is_url and data_dir.startswith('https://modelscope.cn/datasets'):
         *split_url_head, _, url_path = data_dir.split('/', 7)
         url_head = '/'.join(split_url_head)
         metadata_dir = '/'.join([url_head, 'resolve', url_path])
         tardata_dir = '/'.join([url_head, 'resolve', url_path])
     else:
         metadata_dir = tardata_dir = data_dir
+    # Use appropriate path joining function based on whether we're dealing with a URL
+    path_join = url_path_join if is_url else os.path.join
     # Get number of shards
-    nshards_fname = os.path.join(metadata_dir, split, 'nshards.txt')
+    nshards_fname = path_join(metadata_dir, split, 'nshards.txt')
     nshards = int(read_txt(nshards_fname))  # Do not catch FileNotFound, nshards.txt should be mandatory
     # Get dataset type (classification or retrieval)
-    type_fname = os.path.join(metadata_dir, 'dataset_type.txt')
+    type_fname = path_join(metadata_dir, 'dataset_type.txt')
     try:
         dataset_type = read_txt(type_fname).strip().lower()
     except FileNotFoundError:
         dataset_type = 'classification'
-    filepattern = os.path.join(tardata_dir, split, '{0..%d}.tar' % (nshards - 1))
+    filepattern = path_join(tardata_dir, split, '{0..%d}.tar' % (nshards - 1))
     # Load webdataset (support WEBP, PNG, and JPG for now)
     if not cache_dir or not isinstance(cache_dir, str):
         cache_dir = None

{evalscope-0.16.0 → evalscope-0.16.2}/evalscope/backend/rag_eval/cmteb/arguments.py RENAMED Viewed

@@ -11,7 +11,9 @@ class ModelArguments:
     pooling_mode: Optional[str] = None
     max_seq_length: int = 512  # max sequence length
     # prompt for llm based model
-    prompt: str = ''
+    prompt: Optional[str] = None
+    # prompts dictionary for different tasks, if prompt is not set
+    prompts: Optional[Dict[str, str]] = None
     # model kwargs
     model_kwargs: dict = field(default_factory=dict)
     # config kwargs
@@ -33,6 +35,7 @@ class ModelArguments:
             'pooling_mode': self.pooling_mode,
             'max_seq_length': self.max_seq_length,
             'prompt': self.prompt,
+            'prompts': self.prompts,
             'model_kwargs': self.model_kwargs,
             'config_kwargs': self.config_kwargs,
             'encode_kwargs': self.encode_kwargs,

{evalscope-0.16.0 → evalscope-0.16.2}/evalscope/backend/rag_eval/cmteb/task_template.py RENAMED Viewed

@@ -1,6 +1,6 @@
 import mteb
 import os
-from mteb.task_selection import results_to_dataframe
+from tabulate import tabulate
 from evalscope.backend.rag_eval import EmbeddingModel, cmteb
 from evalscope.utils.logger import get_logger
@@ -12,14 +12,27 @@ def show_results(output_folder, model, results):
     model_name = model.mteb_model_meta.model_name_as_path()
     revision = model.mteb_model_meta.revision
-    results_df = results_to_dataframe({model_name: {revision: results}})
+    data = []
+    for model_res in results:
+        main_res = model_res.only_main_score()
+        for split, score in main_res.scores.items():
+            for sub_score in score:
+                data.append({
+                    'Model': model_name.replace('eval__', ''),
+                    'Revision': revision,
+                    'Task Type': main_res.task_type,
+                    'Task': main_res.task_name,
+                    'Split': split,
+                    'Subset': sub_score['hf_subset'],
+                    'Main Score': sub_score['main_score'],
+                })
     save_path = os.path.join(
         output_folder,
         model_name,
         revision,
     )
-    logger.info(f'Evaluation results:\n{results_df.to_markdown()}')
+    logger.info(f'Evaluation results:\n{tabulate(data, headers="keys", tablefmt="grid")}')
     logger.info(f'Evaluation results saved in {os.path.abspath(save_path)}')
@@ -34,6 +47,7 @@ def one_stage_eval(
     tasks = cmteb.TaskBase.get_tasks(task_names=eval_args['tasks'], dataset_path=custom_dataset_path)
     evaluation = mteb.MTEB(tasks=tasks)
+    eval_args['encode_kwargs'] = model_args.get('encode_kwargs', {})
     # run evaluation
     results = evaluation.run(model, **eval_args)
@@ -66,6 +80,7 @@ def two_stage_eval(
             overwrite_results=True,
             hub=eval_args['hub'],
             limits=eval_args['limits'],
+            encode_kwargs=model1_args.get('encode_kwargs', {}),
         )
         # stage 2: run cross encoder
         results = evaluation.run(
@@ -77,6 +92,7 @@ def two_stage_eval(
             overwrite_results=True,
             hub=eval_args['hub'],
             limits=eval_args['limits'],
+            encode_kwargs=model2_args.get('encode_kwargs', {}),
         )
         # save and log results

evalscope 0.16.0__tar.gz → 0.16.2__tar.gz

Potentially problematic release.

evalscope 0.16.0tar.gz → 0.16.2tar.gz