PyPI - opencompass - Versions diffs - 0.2.4__tar.gz → 0.2.5__tar.gz - Mend

opencompass 0.2.4tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (414) hide show

{opencompass-0.2.4 → opencompass-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: opencompass
-Version: 0.2.4
+Version: 0.2.5
 Summary: A comprehensive toolkit for large model evaluation
 Home-page: https://github.com/open-compass/opencompass
 Author: OpenCompass Contributors
@@ -78,6 +78,11 @@ Description: <div align="center">
         ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
+        - **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
+        - **\[2024.04.30\]** We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an [external corpora](configs/datasets/llm_compression/README.md) ([official paper](https://github.com/hkust-nlp/llm-compression-intelligence)). Check out the [llm-compression](configs/eval_llm_compression.py) evaluation config now! 🔥🔥🔥
+        - **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
+        - **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
+        - **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py)  welcome to try!🔥🔥🔥.
         - **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
         - **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html)
         - **\[2024.01.30\]** We release OpenCompass 2.0. Click  [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information !
@@ -130,7 +135,7 @@ Description: <div align="center">
         git clone https://github.com/open-compass/opencompass opencompass
         cd opencompass
         pip install -e .
-        # also please install requiresments packages via `pip install -r requirements/api.txt` for API models if needed.
+        # also please install requirements packages via `pip install -r requirements/api.txt` for API models if needed.
         ```
         ### 📂 Data Preparation
@@ -165,19 +170,13 @@ Description: <div align="center">
         You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
         ```bash
-        python run.py --datasets ceval_ppl mmlu_ppl \
-        --hf-path huggyllama/llama-7b \  # HuggingFace model path
-        --model-kwargs device_map='auto' \  # Arguments for model construction
-        --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # Arguments for tokenizer construction
-        --max-out-len 100 \  # Maximum number of tokens generated
-        --max-seq-len 2048 \  # Maximum sequence length the model can accept
-        --batch-size 8 \  # Batch size
-        --no-batch-padding \  # Don't enable batch padding, infer through for loop to avoid performance loss
-        --num-gpus 1  # Number of minimum required GPUs
+        python run.py --datasets ceval_ppl mmlu_ppl --hf-type base --hf-path huggyllama/llama-7b
         ```
-        > **Note**<br />
-        > To run the command above, you will need to remove the comments starting from `# ` first.
+        > \[!TIP\]
+        >
+        > configuration with `_ppl` is designed for base model typically.
+        > configuration with `_gen` can be used for both base model and chat model.
         Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.

{opencompass-0.2.4 → opencompass-0.2.5}/README.md RENAMED Viewed

@@ -70,6 +70,11 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
+- **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
+- **\[2024.04.30\]** We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an [external corpora](configs/datasets/llm_compression/README.md) ([official paper](https://github.com/hkust-nlp/llm-compression-intelligence)). Check out the [llm-compression](configs/eval_llm_compression.py) evaluation config now! 🔥🔥🔥
+- **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
+- **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
+- **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py)  welcome to try!🔥🔥🔥.
 - **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
 - **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html)
 - **\[2024.01.30\]** We release OpenCompass 2.0. Click  [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information !
@@ -122,7 +127,7 @@ conda activate opencompass
 git clone https://github.com/open-compass/opencompass opencompass
 cd opencompass
 pip install -e .
-# also please install requiresments packages via `pip install -r requirements/api.txt` for API models if needed.
+# also please install requirements packages via `pip install -r requirements/api.txt` for API models if needed.
 ```
 ### 📂 Data Preparation
@@ -157,19 +162,13 @@ python tools/list_configs.py llama mmlu
 You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
 ```bash
-python run.py --datasets ceval_ppl mmlu_ppl \
---hf-path huggyllama/llama-7b \  # HuggingFace model path
---model-kwargs device_map='auto' \  # Arguments for model construction
---tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # Arguments for tokenizer construction
---max-out-len 100 \  # Maximum number of tokens generated
---max-seq-len 2048 \  # Maximum sequence length the model can accept
---batch-size 8 \  # Batch size
---no-batch-padding \  # Don't enable batch padding, infer through for loop to avoid performance loss
---num-gpus 1  # Number of minimum required GPUs
+python run.py --datasets ceval_ppl mmlu_ppl --hf-type base --hf-path huggyllama/llama-7b
 ```
-> **Note**<br />
-> To run the command above, you will need to remove the comments starting from `# ` first.
+> \[!TIP\]
+>
+> configuration with `_ppl` is designed for base model typically.
+> configuration with `_gen` can be used for both base model and chat model.
 Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.

opencompass-0.2.5/opencompass/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = '0.2.5'

{opencompass-0.2.4 → opencompass-0.2.5}/opencompass/datasets/GaokaoBench.py RENAMED Viewed

@@ -91,34 +91,51 @@ class GaokaoBenchEvaluator(BaseEvaluator):
         ]:
             return {'score': 0}
         elif self.question_type == 'multi_choice':
+            details = {}
             correct_score, total_score = 0, 0
-            for pred, refr in zip(predictions, references):
+            for index, (pred, refr) in enumerate(zip(predictions, references)):
                 pred = self.do_predictions_postprocess(pred)
                 pred = self.ensure_same_length(pred, refr)
+                is_corrects = []
                 for p, r in zip(pred, refr):
                     if p == r:
                         correct_score += 2
+                        is_corrects.append(True)
                     else:
                         for i in p:
                             if i not in r:
                                 break
                         else:
                             correct_score += 1
+                        is_corrects.append(False)
                     total_score += 2
-            return {'score': correct_score / total_score * 100}
+                details[str(index)] = {
+                    'pred': pred,
+                    'refr': refr,
+                    'is_correct': all(is_corrects),
+                }
         else:
+            details = {}
             correct_score, total_score = 0, 0
-            for pred, refr in zip(predictions, references):
+            for index, (pred, refr) in enumerate(zip(predictions, references)):
                 if self.question_type == 'multi_question_choice':
                     pred = self.do_predictions_postprocess(pred, len(refr))
                 else:
                     pred = self.do_predictions_postprocess(pred)
                 pred = self.ensure_same_length(pred, refr)
+                is_corrects = []
                 for p, r in zip(pred, refr):
-                    if p == r:
-                        correct_score += 1
+                    is_correct = p == r
+                    correct_score += is_correct
                     total_score += 1
-            return {'score': correct_score / total_score * 100}
+                    is_corrects.append(is_correct)
+                details[str(index)] = {
+                    'pred': pred,
+                    'refr': refr,
+                    'is_correct': all(is_corrects),
+                }
+        return {'score': correct_score / total_score * 100, 'details': details}
 for question_type in valid_gaokao_bench_question_types:

opencompass-0.2.5/opencompass/datasets/MMLUArabic.py ADDED Viewed

@@ -0,0 +1,33 @@
+import csv
+import os.path as osp
+from datasets import Dataset, DatasetDict
+from opencompass.registry import LOAD_DATASET
+from .base import BaseDataset
+@LOAD_DATASET.register_module()
+class MMLUArabicDataset(BaseDataset):
+    @staticmethod
+    def load(path: str, name: str):
+        dataset = DatasetDict()
+        for split in ['dev', 'test']:
+            raw_data = []
+            filename = osp.join(path, split, f'{name}_{split}.csv')
+            with open(filename, encoding='utf-8') as f:
+                reader = csv.reader(f)
+                for row in reader:
+                    assert len(row) == 6
+                    raw_data.append({
+                        'input': row[0],
+                        'A': row[1],
+                        'B': row[2],
+                        'C': row[3],
+                        'D': row[4],
+                        'target': row[5],
+                    })
+            dataset[split] = Dataset.from_list(raw_data)
+        return dataset

{opencompass-0.2.4 → opencompass-0.2.5}/opencompass/datasets/__init__.py RENAMED Viewed

@@ -12,6 +12,7 @@ from .bustum import *  # noqa: F401, F403
 from .c3 import *  # noqa: F401, F403
 from .cb import *  # noqa: F401, F403
 from .ceval import *  # noqa: F401, F403
+from .charm import *  # noqa: F401, F403
 from .chembench import *  # noqa: F401, F403
 from .chid import *  # noqa: F401, F403
 from .cibench import *  # noqa: F401, F403
@@ -33,10 +34,12 @@ from .custom import *  # noqa: F401, F403
 from .cvalues import *  # noqa: F401, F403
 from .drcd import *  # noqa: F401, F403
 from .drop import *  # noqa: F401, F403
+from .drop_simple_eval import *  # noqa: F401, F403
 from .ds1000 import *  # noqa: F401, F403
 from .ds1000_interpreter import *  # noqa: F401, F403
 from .eprstmt import *  # noqa: F401, F403
 from .FinanceIQ import *  # noqa: F401, F403
+from .flames import *  # noqa: F401, F403
 from .flores import *  # noqa: F401, F403
 from .game24 import *  # noqa: F401, F403
 from .GaokaoBench import *  # noqa: F401, F403
@@ -59,6 +62,7 @@ from .lambada import *  # noqa: F401, F403
 from .lawbench import *  # noqa: F401, F403
 from .lcsts import *  # noqa: F401, F403
 from .leval import *  # noqa: F401, F403
+from .llm_compression import LLMCompressionDataset  # noqa: F401, F403
 from .longbench import *  # noqa: F401, F403
 from .lveval import *  # noqa: F401, F403
 from .mastermath2024v1 import *  # noqa: F401, F403
@@ -68,7 +72,9 @@ from .math_intern import *  # noqa: F401, F403
 from .mathbench import *  # noqa: F401, F403
 from .mbpp import *  # noqa: F401, F403
 from .medbench import *  # noqa: F401, F403
+from .mgsm import *  # noqa: F401, F403
 from .mmlu import *  # noqa: F401, F403
+from .MMLUArabic import *  # noqa: F401, F403
 from .multirc import *  # noqa: F401, F403
 from .narrativeqa import *  # noqa: F401, F403
 from .natural_question import *  # noqa: F401, F403

opencompass-0.2.5/opencompass/datasets/charm.py ADDED Viewed

@@ -0,0 +1,55 @@
+import json
+import os.path as osp
+import re
+from datasets import Dataset
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from opencompass.registry import (ICL_EVALUATORS, LOAD_DATASET,
+                                  TEXT_POSTPROCESSORS)
+from .base import BaseDataset
+@TEXT_POSTPROCESSORS.register_module('charm-reason')
+def charm_reason_postprocess(text: str) -> str:
+    ans = text
+    ans_line = ans.split('answer is ')
+    if len(ans_line) != 1:
+        ans = ans_line[1].strip()
+    match = re.search(r'\(([A-Z])\)*', ans)
+    if match:
+        return match.group(1)
+    match = re.search(r'([A-Z])', ans)
+    if match:
+        return match.group(1)
+    return ans
+@ICL_EVALUATORS.register_module()
+class CharmReasonEvaluator(BaseEvaluator):
+    def score(self, predictions, references):
+        if len(predictions) != len(references):
+            return {'error': 'preds and refrs have different length'}
+        details = []
+        cnt = 0
+        for pred, ref in zip(predictions, references):
+            detail = {'pred': pred, 'answer': ref, 'correct': False}
+            if pred == ref:
+                cnt += 1
+                detail['correct'] = True
+            details.append(detail)
+        score = cnt / len(predictions) * 100
+        return {'score': score, 'details': details}
+@LOAD_DATASET.register_module()
+class CharmDataset(BaseDataset):
+    @staticmethod
+    def load(path: str, name: str):
+        with open(osp.join(path, f'{name}.json'), 'r') as f:
+            data = json.load(f)['examples']
+        dataset = Dataset.from_list(data)
+        return dataset

opencompass 0.2.4__tar.gz → 0.2.5__tar.gz

opencompass 0.2.4tar.gz → 0.2.5tar.gz