PyPI - evalscope - Versions diffs - 1.0.1__tar.gz → 1.1.0__tar.gz - Mend

evalscope 1.0.1tar.gz → 1.1.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (578) hide show

evalscope-1.1.0/MANIFEST.in ADDED Viewed

@@ -0,0 +1,10 @@
+include README.md
+# Include all resources (code + other files) inside the package
+recursive-include evalscope *
+# Exclude cache/compiled artifacts
+global-exclude *.py[cod] __pycache__ *.so *.dylib
+# If there are models/large files, you can add prune/exclude as needed
+# Example: prune evalscope/models

{evalscope-1.0.1/evalscope.egg-info → evalscope-1.1.0}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 1.0.1
+Version: 1.1.0
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
-Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
 Author-email: contact@modelscope.cn
 License: Apache License 2.0
+Project-URL: Homepage, https://github.com/modelscope/evalscope
 Keywords: python,llm,evaluation
 Classifier: Development Status :: 4 - Beta
 Classifier: Operating System :: OS Independent
@@ -14,6 +14,7 @@ Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Classifier: Programming Language :: Python :: 3.12
+Classifier: License :: OSI Approved :: Apache Software License
 Requires-Python: >=3.9
 Description-Content-Type: text/markdown
 Provides-Extra: opencompass
@@ -145,7 +146,9 @@ Please scan the QR code below to join our community groups:
 > **Version 1.0 Refactoring**
 >
 > Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under `evalscope/api`. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.
+- 🔥 **[2025.10.14]** Added support for OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, and BLINK multimodal image-text evaluation benchmarks.
+- 🔥 **[2025.09.22]** Code evaluation benchmarks (HumanEval, LiveCodeBench) now support running in a sandbox environment. To use this feature, please install [ms-enclave](https://github.com/modelscope/ms-enclave) first.
+- 🔥 **[2025.09.19]** Added support for multimodal image-text evaluation benchmarks including RealWorldQA, AI2D, MMStar, MMBench, and OmniBench, as well as pure text evaluation benchmarks such as Multi-IF, HealthBench, and AMC.
 - 🔥 **[2025.09.05]** Added support for vision-language multimodal model evaluation tasks, such as MathVista and MMMU. For more supported datasets, please [refer to the documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/vlm.html).
 - 🔥 **[2025.09.04]** Added support for image editing task evaluation, including the [GEdit-Bench](https://modelscope.cn/datasets/stepfun-ai/GEdit-Bench) benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/image_edit.html).
 - 🔥 **[2025.08.22]** Version 1.0 Refactoring. Break changes, please [refer to](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#switching-to-version-v1-0).

{evalscope-1.0.1 → evalscope-1.1.0}/README.md RENAMED Viewed

@@ -116,7 +116,9 @@ Please scan the QR code below to join our community groups:
 > **Version 1.0 Refactoring**
 >
 > Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under `evalscope/api`. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.
+- 🔥 **[2025.10.14]** Added support for OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, and BLINK multimodal image-text evaluation benchmarks.
+- 🔥 **[2025.09.22]** Code evaluation benchmarks (HumanEval, LiveCodeBench) now support running in a sandbox environment. To use this feature, please install [ms-enclave](https://github.com/modelscope/ms-enclave) first.
+- 🔥 **[2025.09.19]** Added support for multimodal image-text evaluation benchmarks including RealWorldQA, AI2D, MMStar, MMBench, and OmniBench, as well as pure text evaluation benchmarks such as Multi-IF, HealthBench, and AMC.
 - 🔥 **[2025.09.05]** Added support for vision-language multimodal model evaluation tasks, such as MathVista and MMMU. For more supported datasets, please [refer to the documentation](https://evalscope.readthedocs.io/zh-cn/latest/get_started/supported_dataset/vlm.html).
 - 🔥 **[2025.09.04]** Added support for image editing task evaluation, including the [GEdit-Bench](https://modelscope.cn/datasets/stepfun-ai/GEdit-Bench) benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/image_edit.html).
 - 🔥 **[2025.08.22]** Version 1.0 Refactoring. Break changes, please [refer to](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#switching-to-version-v1-0).

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/benchmark/adapters/default_data_adapter.py RENAMED Viewed

@@ -128,6 +128,9 @@ class DefaultDataAdapter(DataAdapter):
             for sample in self.test_dataset[subset]:
                 if isinstance(sample.input, str):
                     sample.input = self.process_sample_str_input(sample, subset)
+                elif isinstance(sample.input, list):
+                    # Handle list[ChatMessage] and add system prompt if needed
+                    sample.input = self.process_sample_messages_input(sample, subset)
     def process_sample_str_input(self, sample: Sample, subset: str) -> List[ChatMessage]:
         """
@@ -142,6 +145,15 @@ class DefaultDataAdapter(DataAdapter):
             input_messages.insert(0, ChatMessageSystem(content=self.system_prompt))
         return input_messages
+    def process_sample_messages_input(self, sample: Sample, subset: str) -> List[ChatMessage]:
+        """
+        Normalize a sample's existing List[ChatMessage] input and ensure system prompt is set once.
+        """
+        messages = list(sample.input)  # shallow copy to avoid in-place mutations
+        if self.system_prompt and not any(isinstance(m, ChatMessageSystem) for m in messages):
+            messages = [ChatMessageSystem(content=self.system_prompt)] + messages
+        return messages
     def process_sample_input(self, sample: Sample, subset: str) -> str:
         """
         Process a single sample's input by applying prompt templates and few-shot formatting.
@@ -642,9 +654,7 @@ class DefaultDataAdapter(DataAdapter):
         """
         pass
-    def _on_generate_report(
-        self, scores: Dict[str, List[AggScore]], model_name: str, add_aggregation_name: bool = True
-    ) -> Report:
+    def _on_generate_report(self, scores: Dict[str, List[AggScore]], model_name: str) -> Report:
         """
         Hook method called during report generation.
@@ -660,7 +670,7 @@ class DefaultDataAdapter(DataAdapter):
             Report: The generated evaluation report
         """
         return ReportGenerator.generate_report(
-            score_dict=scores, model_name=model_name, data_adapter=self, add_aggregation_name=add_aggregation_name
+            score_dict=scores, model_name=model_name, data_adapter=self, add_aggregation_name=self.add_aggregation_name
         )
     @override
@@ -682,3 +692,7 @@ class DefaultDataAdapter(DataAdapter):
         report = self._on_generate_report(scores, model_name=model_name)
         self._on_generate_report_end(report, output_dir, **kwargs)
         return report
+    def finalize(self, *args, **kwargs):
+        # Finalize the evaluation process
+        self.sandbox_finalize(*args, **kwargs)

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/benchmark/adapters/multi_choice_adapter.py RENAMED Viewed

@@ -18,8 +18,11 @@ class MultiChoiceAdapter(DefaultDataAdapter):
     This adapter formats the input for multi-choice questions and handles few-shot examples.
     """
-    multiple_correct: bool = False
-    """Whether the benchmark allows multiple correct answers."""
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.multiple_correct: bool = False
+        """Whether the benchmark allows multiple correct answers."""
     def format_prompt_template(self, sample: Sample) -> str:
         """

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/benchmark/adapters/text2image_adapter.py RENAMED Viewed

@@ -19,6 +19,11 @@ logger = get_logger()
 class Text2ImageAdapter(DefaultDataAdapter):
     """Text to Image Adapter for benchmarks."""
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.add_aggregation_name = False  # Do not add aggregation name in the report by default
     def load_from_disk(self, **kwargs):
         return super().load_from_disk(use_local_loader=True)
@@ -150,7 +155,3 @@ class Text2ImageAdapter(DefaultDataAdapter):
                 score.metadata[metric_name] = f'error: {str(e)}'
         return score
-    def _on_generate_report(self, scores, model_name, add_aggregation_name=True):
-        # Don't add aggregation name for needle haystack adapter
-        return super()._on_generate_report(scores, model_name, False)

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/benchmark/adapters/vision_language_adapter.py RENAMED Viewed

@@ -3,4 +3,6 @@ from .default_data_adapter import DefaultDataAdapter
 class VisionLanguageAdapter(DefaultDataAdapter):
     """Adapter for vision-language benchmarks. e.g., image captioning, visual question answering, etc."""
-    pass
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/benchmark/benchmark.py RENAMED Viewed

@@ -9,7 +9,7 @@ from evalscope.api.dataset import DatasetDict, Sample
 from evalscope.api.evaluator import TaskState
 from evalscope.api.filter import FilterEnsemble, build_filter_ensemble
 from evalscope.api.metric import AggScore, SampleScore
-from evalscope.api.mixin import LLMJudgeMixin
+from evalscope.api.mixin import LLMJudgeMixin, SandboxMixin
 from evalscope.api.model import Model
 from evalscope.report import Report
 from evalscope.utils.logger import get_logger
@@ -21,7 +21,7 @@ if TYPE_CHECKING:
 logger = get_logger()
-class DataAdapter(LLMJudgeMixin, ABC):
+class DataAdapter(LLMJudgeMixin, SandboxMixin, ABC):
     """
     Data Adapter for the benchmark.
     """
@@ -43,6 +43,12 @@ class DataAdapter(LLMJudgeMixin, ABC):
         self.save_metadata = True
         """Whether to save metadata in the review result"""
+        self.add_aggregation_name = True
+        """Whether to add aggregation name in the report"""
+        self.add_overall_metric = True
+        """Whether to add overall metric in the report"""
         self.category_map = {}
         """Category map for the benchmark"""
@@ -86,6 +92,11 @@ class DataAdapter(LLMJudgeMixin, ABC):
         """
         pass
+    @abstractmethod
+    def finalize(self, *args, **kwargs) -> None:
+        """Finalize the evaluation process."""
+        pass
     @property
     def name(self) -> str:
         """
@@ -334,6 +345,20 @@ class DataAdapter(LLMJudgeMixin, ABC):
         """
         self._benchmark_meta.shuffle_choices = value
+    @property
+    def review_timeout(self) -> Optional[float]:
+        """
+        Return the timeout for the review process.
+        """
+        return self._benchmark_meta.review_timeout
+    @review_timeout.setter
+    def review_timeout(self, value: float):
+        """
+        Set the timeout for the review process.
+        """
+        self._benchmark_meta.review_timeout = value
     @contextlib.contextmanager
     def _temporary_attribute(self, attr_name: str, new_value):
         """

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/benchmark/meta.py RENAMED Viewed

@@ -79,6 +79,9 @@ class BenchmarkMeta:
     shuffle_choices: bool = False
     """Whether to shuffle the choices in multiple-choice datasets."""
+    review_timeout: Optional[float] = None
+    """ Timeout for review in seconds."""
     extra_params: Dict = field(default_factory=dict)
     """ Additional parameters for the benchmark."""

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/evaluator/evaluator.py RENAMED Viewed

@@ -54,3 +54,8 @@ class Evaluator(abc.ABC):
     def get_report(self, *args, **kwargs) -> Report:
         """Get the evaluation report."""
         pass
+    @abc.abstractmethod
+    def finalize(self, *args, **kwargs) -> None:
+        """Finalize the evaluation process."""
+        pass

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/evaluator/state.py RENAMED Viewed

@@ -273,3 +273,8 @@ class TaskState:
     def target(self) -> str:
         """The scoring target for this `Sample`."""
         return self._target.text
+    @target.setter
+    def target(self, text: str) -> None:
+        """Set the target for review purposes."""
+        self._target = Target(text)

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/messages/chat_message.py RENAMED Viewed

@@ -3,7 +3,7 @@ from pydantic import BaseModel, Field, JsonValue, model_validator
 from typing import Any, Dict, List, Literal, Optional, Type, Union
 from evalscope.api.tool import ToolCall, ToolCallError
-from .content import Content, ContentImage, ContentReasoning, ContentText
+from .content import Content, ContentAudio, ContentImage, ContentReasoning, ContentText
 from .utils import parse_content_with_reasoning
@@ -225,6 +225,11 @@ def messages_to_markdown(messages: List[ChatMessage], max_length: Optional[int]
                     if max_length and len(image_base64) > max_length:
                         image_base64 = image_base64[:max_length]
                     content_parts.append(f'![image]({image_base64})')
+                elif isinstance(content_item, ContentAudio):
+                    audio_base64 = content_item.audio
+                    if max_length and len(audio_base64) > max_length:
+                        audio_base64 = audio_base64[:max_length]
+                    content_parts.append(f"<audio controls src='{audio_base64}'></audio>")
                 elif isinstance(content_item, ContentReasoning):
                     content_parts.append(f'**Reasoning:** {content_item.reasoning}')

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/mixin/__init__.py RENAMED Viewed

	@@ -1 +1,2 @@
1 1	from .llm_judge_mixin import LLMJudgeMixin
2	+ from .sandbox_mixin import SandboxMixin

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/mixin/llm_judge_mixin.py RENAMED Viewed

@@ -24,6 +24,8 @@ class LLMJudgeMixin:
         self._llm_judge: Optional[LLMJudge] = None
+        super().__init__(task_config=task_config)
     @property
     def llm_judge(self) -> Optional[LLMJudge]:
         """Get LLM judge instance with lazy initialization."""

evalscope-1.1.0/evalscope/api/mixin/sandbox_mixin.py ADDED Viewed

@@ -0,0 +1,204 @@
+import asyncio
+import threading
+from typing import TYPE_CHECKING, Any, Dict, List, Optional
+from evalscope.utils.logger import get_logger
+if TYPE_CHECKING:
+    from ms_enclave.sandbox.manager import SandboxManager
+    from evalscope.config import TaskConfig
+logger = get_logger()
+class SandboxMixin:
+    """Sandbox mixin for sandboxed code execution."""
+    def __init__(self, task_config: 'TaskConfig'):
+        self._task_config = task_config
+        self._manager: Optional['SandboxManager'] = None
+        """Sandbox manager instance."""
+        self._sandbox_id: Optional[str] = None
+        """Sandbox ID."""
+        self._loop: Optional[asyncio.AbstractEventLoop] = None
+        """Event loop for async operations."""
+        # Initialize sandbox synchronously by running async methods
+        if self.use_sandbox:
+            self._loop = asyncio.new_event_loop()
+            # Start the loop in a separate thread
+            def run_loop():
+                asyncio.set_event_loop(self._loop)
+                self._loop.run_forever()
+            self._loop_thread = threading.Thread(target=run_loop, daemon=True)
+            self._loop_thread.start()
+            # Wait for initialization
+            future = asyncio.run_coroutine_threadsafe(self._async_init(), self._loop)
+            future.result()
+        super().__init__()
+    async def _async_init(self):
+        """Async initialization helper."""
+        await self.init_sandbox_manager_async()
+        await self.init_sandbox_async()
+    @property
+    def use_sandbox(self) -> bool:
+        """
+        Return whether to use sandbox for the benchmark.
+        """
+        if not self._task_config:
+            return False
+        else:
+            return self._task_config.use_sandbox
+    @property
+    def sandbox_manager(self) -> Optional['SandboxManager']:
+        """Get the sandbox manager instance."""
+        return self._manager
+    @property
+    def sandbox_id(self) -> Optional[str]:
+        """Get the sandbox ID."""
+        return self._sandbox_id
+    async def init_sandbox_manager_async(self) -> Optional['SandboxManager']:
+        """Initialize the sandbox manager asynchronously."""
+        if self._manager is not None:
+            return self._manager
+        if not self.use_sandbox:
+            return None
+        from ms_enclave.sandbox.manager import HttpSandboxManager, LocalSandboxManager
+        manager_config = self._task_config.sandbox_manager_config or {}
+        if manager_config.get('base_url'):
+            # Remote manager
+            self._manager = HttpSandboxManager(**manager_config)
+        else:
+            # Local manager
+            self._manager = LocalSandboxManager(**manager_config)
+        await self._manager.start()
+        logger.info('Sandbox manager initialized.')
+        return self._manager
+    def init_sandbox_manager(self) -> Optional['SandboxManager']:
+        """Initialize the sandbox manager."""
+        if self._manager is not None:
+            return self._manager
+        if not self.use_sandbox:
+            return None
+        # Use the dedicated loop if available
+        if self._loop and not self._loop.is_closed():
+            future = asyncio.run_coroutine_threadsafe(self.init_sandbox_manager_async(), self._loop)
+            return future.result()
+        else:
+            # Fallback for cases where no loop is available
+            return asyncio.run(self.init_sandbox_manager_async())
+    async def init_sandbox_async(self) -> Optional[str]:
+        """Initialize the sandbox instance asynchronously."""
+        if self._sandbox_id is not None:
+            return self._sandbox_id
+        if not self.use_sandbox:
+            return None
+        from ms_enclave.sandbox.model import DockerSandboxConfig, SandboxType
+        sandbox_config = self._task_config.sandbox_config or DockerSandboxConfig(
+            image='python:3.11-slim', tools_config={
+                'shell_executor': {},
+                'python_executor': {}
+            }
+        )
+        sandbox_type = self._task_config.sandbox_type or SandboxType.DOCKER
+        self._sandbox_id = await self._manager.create_sandbox(sandbox_type=sandbox_type, config=sandbox_config)
+        sandbox_info = await self._manager.get_sandbox_info(self._sandbox_id)
+        logger.info(f'Sandbox of type {sandbox_type} initialized. Info: {sandbox_info.model_dump(exclude_none=True)}')
+        return self._sandbox_id
+    def init_sandbox(self) -> Optional[str]:
+        """Initialize the sandbox instance."""
+        if self._sandbox_id is not None:
+            return self._sandbox_id
+        if not self.use_sandbox:
+            return None
+        # Use the dedicated loop if available
+        if self._loop and not self._loop.is_closed():
+            future = asyncio.run_coroutine_threadsafe(self.init_sandbox_async(), self._loop)
+            return future.result()
+        else:
+            # Fallback for cases where no loop is available
+            return asyncio.run(self.init_sandbox_async())
+    def execute_code_in_sandbox(self, code: str, timeout: int = 60, language: str = 'python') -> Dict[str, Any]:
+        """Execute code in the sandbox."""
+        if not self._sandbox_id or not self._manager:
+            logger.warning('Sandbox is not initialized.')
+            return {'error': 'Sandbox is not initialized.'}
+        from ms_enclave.sandbox.model import ExecutionStatus, ToolResult
+        async def _execute_async():
+            if language.lower() == 'python':
+                tool_name = 'python_executor'
+                parameters = {'code': code, 'timeout': timeout}
+                result = await self._manager.execute_tool(self._sandbox_id, tool_name, parameters)
+            elif language.lower() == 'shell':
+                tool_name = 'shell_executor'
+                parameters = {'command': code, 'timeout': timeout}
+                result = await self._manager.execute_tool(self._sandbox_id, tool_name, parameters)
+            else:
+                logger.warning(f"Unsupported language: {language}. Supported languages are 'python' and 'shell'.")
+                result = ToolResult(
+                    status=ExecutionStatus.ERROR,
+                    tool_name='code_executor',
+                    output=f"Unsupported language: {language}. Supported languages are 'python' and 'shell'."
+                )
+            return result
+        # Use the dedicated loop if available
+        if self._loop and not self._loop.is_closed():
+            future = asyncio.run_coroutine_threadsafe(_execute_async(), self._loop)
+            result = future.result(timeout + 10)  # Add some buffer to the timeout
+        else:
+            # Fallback for cases where no loop is available
+            result = asyncio.run(_execute_async())
+        return result.model_dump(exclude_none=True)
+    def sandbox_finalize(self, *args, **kwargs):
+        """Finalize the sandbox manager."""
+        if self._manager:
+            try:
+                if self._loop and not self._loop.is_closed():
+                    # Stop the manager using the dedicated loop
+                    future = asyncio.run_coroutine_threadsafe(self._manager.stop(), self._loop)
+                    future.result(timeout=30)
+                    # Stop the event loop
+                    self._loop.call_soon_threadsafe(self._loop.stop)
+                    if hasattr(self, '_loop_thread'):
+                        self._loop_thread.join(timeout=5)
+                logger.info('Sandbox manager finalized.')
+            except Exception as e:
+                logger.warning(f'Error finalizing sandbox manager: {e}')

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/model/generate_config.py RENAMED Viewed

@@ -36,9 +36,6 @@ class GenerateConfig(BaseModel):
     stream: Optional[bool] = Field(default=None)
     """Whether to stream the response (default is model specific)."""
-    system_message: Optional[str] = Field(default=None)
-    """Override the default system message."""
     max_tokens: Optional[int] = Field(default=None)
     """The maximum number of tokens that can be generated in the completion (default is model specific)."""

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/model/model.py RENAMED Viewed

@@ -365,7 +365,7 @@ def get_model(
     logger.info(
         f'Creating model {model} with eval_type={eval_type} '
-        f'base_url={base_url}, api_key={api_key}, config={config}, model_args={model_args}'
+        f'base_url={base_url}, config={config.model_dump(exclude_none=True)}, model_args={model_args}'
     )
     # find a matching model type

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/api/tool/tool_info.py RENAMED Viewed

@@ -1,7 +1,7 @@
 import inspect
 from dataclasses import dataclass
 from docstring_parser import Docstring, parse
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, field_validator
 from typing import Any, Callable, Dict, List, Literal, Optional, TypeAlias, Union, get_args, get_type_hints
 from evalscope.utils.json_schema import JSONSchema, JSONType, json_schema, python_type_to_json_type

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/app/ui/multi_model.py RENAMED Viewed

@@ -204,7 +204,12 @@ def create_multi_model_tab(sidebar: 'SidebarComponents', lang: str):
         data_score_df_b, _ = get_single_dataset_df(report_df_b, dataset_name)
         # Get subset choices - should be same for both models
-        subsets = data_score_df_a[ReportKey.subset_name].unique().tolist()
+        # Only select the subsets that Cat.0 is not '-'
+        df_for_subsets = data_score_df_a.copy()
+        subsets = sorted(
+            df_for_subsets.loc[df_for_subsets[f'{ReportKey.category_prefix}0'].ne('-'),
+                               ReportKey.subset_name].dropna().unique().tolist()
+        )
         return gr.update(choices=subsets, value=None), None

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/app/ui/single_model.py RENAMED Viewed

@@ -134,11 +134,17 @@ def create_single_model_tab(sidebar: 'SidebarComponents', lang: str):
     )
     def update_single_report_dataset(dataset_name, report_list):
         logger.debug(f'Updating single report dataset: {dataset_name}')
-        report_df = get_data_frame(report_list=report_list)
+        report_df = get_data_frame(report_list=report_list, flatten_metrics=True, flatten_categories=True)
         analysis = get_report_analysis(report_list, dataset_name)
         data_score_df, styler = get_single_dataset_df(report_df, dataset_name)
         data_score_plot = plot_single_dataset_scores(data_score_df)
-        subsets = data_score_df[ReportKey.subset_name].unique().tolist()
+        # Only select the subsets that Cat.0 is not '-'
+        df_for_subsets = data_score_df.copy()
+        subsets = sorted(
+            df_for_subsets.loc[df_for_subsets[f'{ReportKey.category_prefix}0'].ne('-'),
+                               ReportKey.subset_name].dropna().unique().tolist()
+        )
         logger.debug(f'subsets: {subsets}')
         return data_score_plot, styler, gr.update(choices=subsets, value=None), None, analysis

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/app/utils/data_utils.py RENAMED Viewed

@@ -168,9 +168,10 @@ def get_model_prediction(work_dir: str, model_name: str, dataset_name: str, subs
             'Index': str(review_result.index),
             'Input': review_result.input.replace('\n', '\n\n'),  # for markdown
             'Metadata': metadata,
-            'Generated': prediction,
+            'Generated': prediction or '',  # Ensure no None value
             'Gold': target,
-            'Pred': extracted_prediction if extracted_prediction != prediction else '*Same as Generated*',
+            'Pred': (extracted_prediction if extracted_prediction != prediction else '*Same as Generated*')
+            or '',  # Ensure no None value
             'Score': score.model_dump(exclude_none=True),
             'NScore': normalize_score(score.main_value)
         }

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/app/utils/visualization.py RENAMED Viewed

@@ -18,7 +18,7 @@ logger = get_logger()
 def plot_single_report_scores(df: pd.DataFrame):
     if df is None:
         return None
-    logger.debug(f'df: {df}')
+    logger.debug(f'df: \n{df}')
     plot = px.bar(df, x=df[ReportKey.dataset_name], y=df[ReportKey.score], text=df[ReportKey.score])
     width = DEFAULT_BAR_WIDTH if len(df[ReportKey.dataset_name]) <= 5 else None
@@ -36,7 +36,7 @@ def plot_single_report_sunburst(report_list: List[Report]):
         df = get_data_frame(report_list=report_list, flatten_metrics=False)
         categories = sorted([i for i in df.columns if i.startswith(ReportKey.category_prefix)])
         path = [ReportKey.dataset_name] + categories + [ReportKey.subset_name]
-    logger.debug(f'df: {df}')
+    logger.debug(f'df: \n{df}')
     df[categories] = df[categories].fillna('default')  # NOTE: fillna for empty categories
     plot = px.sunburst(

{evalscope-1.0.1 → evalscope-1.1.0}/evalscope/arguments.py RENAMED Viewed

@@ -87,6 +87,12 @@ def add_argument(parser: argparse.ArgumentParser):
     parser.add_argument('--judge-model-args', type=json.loads, default='{}', help='The judge model args, should be a json string.')  # noqa: E501
     parser.add_argument('--judge-worker-num', type=int, default=1, help='The number of workers for the judge model.')
     parser.add_argument('--analysis-report', action='store_true', default=False, help='Generate analysis report for the evaluation results using judge model.')  # noqa: E501
+    # Sandbox-related arguments
+    parser.add_argument('--use-sandbox', action='store_true', default=False, help='Whether to use sandbox for model evaluation.')  # noqa: E501
+    parser.add_argument('--sandbox-type', type=str, default='docker', help='The sandbox type to use.')  # noqa: E501
+    parser.add_argument('--sandbox-config', type=json.loads, default='{}', help='The sandbox config, should be a json string.')  # noqa: E501
+    parser.add_argument('--sandbox-manager-config', type=json.loads, default='{}', help='The sandbox manager config, should be a json string.')  # noqa: E501
     # yapf: enable

evalscope 1.0.1__tar.gz → 1.1.0__tar.gz

Potentially problematic release.

evalscope 1.0.1tar.gz → 1.1.0tar.gz