PyPI - opencompass - Versions diffs - 0.2.3__tar.gz → 0.2.5__tar.gz - Mend

opencompass 0.2.3tar.gz → 0.2.5tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (419) hide show

{opencompass-0.2.3 → opencompass-0.2.5}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: opencompass
-Version: 0.2.3
+Version: 0.2.5
 Summary: A comprehensive toolkit for large model evaluation
 Home-page: https://github.com/open-compass/opencompass
 Author: OpenCompass Contributors
@@ -11,8 +11,13 @@ Description: <div align="center">
           <br />
           <br />
-        [![docs](https://readthedocs.org/projects/opencompass/badge)](https://opencompass.readthedocs.io/en)
-        [![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](https://github.com/open-compass/opencompass/blob/main/LICENSE)
+        [![][github-release-shield]][github-release-link]
+        [![][github-releasedate-shield]][github-releasedate-link]
+        [![][github-contributors-shield]][github-contributors-link]<br>
+        [![][github-forks-shield]][github-forks-link]
+        [![][github-stars-shield]][github-stars-link]
+        [![][github-issues-shield]][github-issues-link]
+        [![][github-license-shield]][github-license-link]
         <!-- [![PyPI](https://badge.fury.io/py/opencompass.svg)](https://pypi.org/project/opencompass/) -->
@@ -25,12 +30,18 @@ Description: <div align="center">
         English | [简体中文](README_zh-CN.md)
+        [![][github-trending-shield]][github-trending-url]
         </div>
         <p align="center">
             👋 join us on <a href="https://discord.gg/KKwfEbFj7U" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=opencompass" target="_blank">WeChat</a>
         </p>
+        > \[!IMPORTANT\]
+        >
+        > **Star Us**, You will receive all release notifications from GitHub without any delay ~ ⭐️
         ## 📣 OpenCompass 2.0
         We are thrilled to introduce OpenCompass 2.0, an advanced suite featuring three key components: [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home).
@@ -42,6 +53,14 @@ Description: <div align="center">
         **CompassKit** is a powerful collection of evaluation toolkits specifically tailored for Large Language Models and Large Vision-language Models. It provides an extensive set of tools to assess and measure the performance of these complex models effectively. Welcome to try our toolkits for in your research and products.
+        <details>
+          <summary><kbd>Star History</kbd></summary>
+          <picture>
+            <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&theme=dark&type=Date">
+            <img width="100%" src="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&type=Date">
+          </picture>
+        </details>
         ## 🧭	Welcome
         to **OpenCompass**!
@@ -59,12 +78,14 @@ Description: <div align="center">
         ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
-        - **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html) 🔥🔥🔥.
-        - **\[2024.01.30\]** We release OpenCompass 2.0. Click  [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information ! 🔥🔥🔥.
-        - **\[2024.01.17\]** We supported the evaluation of [InternLM2](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_keyset.py) and [InternLM2-Chat](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_chat_keyset.py), InternLM2 showed extremely strong performance in these tests, welcome to try! 🔥🔥🔥.
-        - **\[2024.01.17\]** We supported the needle in a haystack test with multiple needles, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html#id8) 🔥🔥🔥.
-        - **\[2023.12.28\]** We have enabled seamless evaluation of all models developed using [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory), a powerful toolkit for comprehensive LLM development.
-        - **\[2023.12.22\]** We have released [T-Eval](https://github.com/open-compass/T-Eval), a step-by-step evaluation benchmark to gauge your LLMs on tool utilization. Welcome to our [Leaderboard](https://open-compass.github.io/T-Eval/leaderboard.html) for more details!
+        - **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
+        - **\[2024.04.30\]** We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an [external corpora](configs/datasets/llm_compression/README.md) ([official paper](https://github.com/hkust-nlp/llm-compression-intelligence)). Check out the [llm-compression](configs/eval_llm_compression.py) evaluation config now! 🔥🔥🔥
+        - **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
+        - **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
+        - **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py)  welcome to try!🔥🔥🔥.
+        - **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
+        - **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html)
+        - **\[2024.01.30\]** We release OpenCompass 2.0. Click  [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information !
         > [More](docs/en/notes/news.md)
@@ -114,7 +135,7 @@ Description: <div align="center">
         git clone https://github.com/open-compass/opencompass opencompass
         cd opencompass
         pip install -e .
-        # also please install requiresments packages via `pip install -r requirements/api.txt` for API models if needed.
+        # also please install requirements packages via `pip install -r requirements/api.txt` for API models if needed.
         ```
         ### 📂 Data Preparation
@@ -149,19 +170,13 @@ Description: <div align="center">
         You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
         ```bash
-        python run.py --datasets ceval_ppl mmlu_ppl \
-        --hf-path huggyllama/llama-7b \  # HuggingFace model path
-        --model-kwargs device_map='auto' \  # Arguments for model construction
-        --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # Arguments for tokenizer construction
-        --max-out-len 100 \  # Maximum number of tokens generated
-        --max-seq-len 2048 \  # Maximum sequence length the model can accept
-        --batch-size 8 \  # Batch size
-        --no-batch-padding \  # Don't enable batch padding, infer through for loop to avoid performance loss
-        --num-gpus 1  # Number of minimum required GPUs
+        python run.py --datasets ceval_ppl mmlu_ppl --hf-type base --hf-path huggyllama/llama-7b
         ```
-        > **Note**<br />
-        > To run the command above, you will need to remove the comments starting from `# ` first.
+        > \[!TIP\]
+        >
+        > configuration with `_ppl` is designed for base model typically.
+        > configuration with `_gen` can be used for both base model and chat model.
         Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.
@@ -447,6 +462,7 @@ Description: <div align="center">
         - [InternLM](https://github.com/InternLM/InternLM)
         - [LLaMA](https://github.com/facebookresearch/llama)
+        - [LLaMA3](https://github.com/meta-llama/llama3)
         - [Vicuna](https://github.com/lm-sys/FastChat)
         - [Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
         - [Baichuan](https://github.com/baichuan-inc)
@@ -505,6 +521,20 @@ Description: <div align="center">
         We appreciate all contributions to improving OpenCompass. Please refer to the [contributing guideline](https://opencompass.readthedocs.io/en/latest/notes/contribution_guide.html) for the best practice.
+        <!-- Copy-paste in your Readme.md file -->
+        <!-- Made with [OSS Insight](https://ossinsight.io/) -->
+        <a href="https://github.com/open-compass/opencompass/graphs/contributors" target="_blank">
+          <table>
+            <tr>
+              <th colspan="2">
+                <br><img src="https://contrib.rocks/image?repo=open-compass/opencompass"><br><br>
+              </th>
+            </tr>
+          </table>
+        </a>
         ## 🤝 Acknowledgements
         Some code in this project is cited and modified from [OpenICL](https://github.com/Shark-NLP/OpenICL).
@@ -524,6 +554,23 @@ Description: <div align="center">
         <p align="right"><a href="#top">🔝Back to top</a></p>
+        [github-contributors-link]: https://github.com/open-compass/opencompass/graphs/contributors
+        [github-contributors-shield]: https://img.shields.io/github/contributors/open-compass/opencompass?color=c4f042&labelColor=black&style=flat-square
+        [github-forks-link]: https://github.com/open-compass/opencompass/network/members
+        [github-forks-shield]: https://img.shields.io/github/forks/open-compass/opencompass?color=8ae8ff&labelColor=black&style=flat-square
+        [github-issues-link]: https://github.com/open-compass/opencompass/issues
+        [github-issues-shield]: https://img.shields.io/github/issues/open-compass/opencompass?color=ff80eb&labelColor=black&style=flat-square
+        [github-license-link]: https://github.com/open-compass/opencompass/blob/main/LICENSE
+        [github-license-shield]: https://img.shields.io/github/license/open-compass/opencompass?color=white&labelColor=black&style=flat-square
+        [github-release-link]: https://github.com/open-compass/opencompass/releases
+        [github-release-shield]: https://img.shields.io/github/v/release/open-compass/opencompass?color=369eff&labelColor=black&logo=github&style=flat-square
+        [github-releasedate-link]: https://github.com/open-compass/opencompass/releases
+        [github-releasedate-shield]: https://img.shields.io/github/release-date/open-compass/opencompass?labelColor=black&style=flat-square
+        [github-stars-link]: https://github.com/open-compass/opencompass/stargazers
+        [github-stars-shield]: https://img.shields.io/github/stars/open-compass/opencompass?color=ffcb47&labelColor=black&style=flat-square
+        [github-trending-shield]: https://trendshift.io/api/badge/repositories/6630
+        [github-trending-url]: https://trendshift.io/repositories/6630
 Keywords: AI,NLP,in-context learning,large language model,evaluation,benchmark,llm
 Platform: UNKNOWN
 Classifier: Programming Language :: Python :: 3.8

{opencompass-0.2.3 → opencompass-0.2.5}/README.md RENAMED Viewed

@@ -3,8 +3,13 @@
   <br />
   <br />
-[![docs](https://readthedocs.org/projects/opencompass/badge)](https://opencompass.readthedocs.io/en)
-[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](https://github.com/open-compass/opencompass/blob/main/LICENSE)
+[![][github-release-shield]][github-release-link]
+[![][github-releasedate-shield]][github-releasedate-link]
+[![][github-contributors-shield]][github-contributors-link]<br>
+[![][github-forks-shield]][github-forks-link]
+[![][github-stars-shield]][github-stars-link]
+[![][github-issues-shield]][github-issues-link]
+[![][github-license-shield]][github-license-link]
 <!-- [![PyPI](https://badge.fury.io/py/opencompass.svg)](https://pypi.org/project/opencompass/) -->
@@ -17,12 +22,18 @@
 English | [简体中文](README_zh-CN.md)
+[![][github-trending-shield]][github-trending-url]
 </div>
 <p align="center">
     👋 join us on <a href="https://discord.gg/KKwfEbFj7U" target="_blank">Discord</a> and <a href="https://r.vansin.top/?r=opencompass" target="_blank">WeChat</a>
 </p>
+> \[!IMPORTANT\]
+>
+> **Star Us**, You will receive all release notifications from GitHub without any delay ~ ⭐️
 ## 📣 OpenCompass 2.0
 We are thrilled to introduce OpenCompass 2.0, an advanced suite featuring three key components: [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home).
@@ -34,6 +45,14 @@ We are thrilled to introduce OpenCompass 2.0, an advanced suite featuring three
 **CompassKit** is a powerful collection of evaluation toolkits specifically tailored for Large Language Models and Large Vision-language Models. It provides an extensive set of tools to assess and measure the performance of these complex models effectively. Welcome to try our toolkits for in your research and products.
+<details>
+  <summary><kbd>Star History</kbd></summary>
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&theme=dark&type=Date">
+    <img width="100%" src="https://api.star-history.com/svg?repos=open-compass%2Fopencompass&type=Date">
+  </picture>
+</details>
 ## 🧭	Welcome
 to **OpenCompass**!
@@ -51,12 +70,14 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 ## 🚀 What's New <a><img width="35" height="20" src="https://user-images.githubusercontent.com/12782558/212848161-5e783dd6-11e8-4fe0-bbba-39ffb77730be.png"></a>
-- **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html) 🔥🔥🔥.
-- **\[2024.01.30\]** We release OpenCompass 2.0. Click  [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information ! 🔥🔥🔥.
-- **\[2024.01.17\]** We supported the evaluation of [InternLM2](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_keyset.py) and [InternLM2-Chat](https://github.com/open-compass/opencompass/blob/main/configs/eval_internlm2_chat_keyset.py), InternLM2 showed extremely strong performance in these tests, welcome to try! 🔥🔥🔥.
-- **\[2024.01.17\]** We supported the needle in a haystack test with multiple needles, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/needleinahaystack_eval.html#id8) 🔥🔥🔥.
-- **\[2023.12.28\]** We have enabled seamless evaluation of all models developed using [LLaMA2-Accessory](https://github.com/Alpha-VLLM/LLaMA2-Accessory), a powerful toolkit for comprehensive LLM development.
-- **\[2023.12.22\]** We have released [T-Eval](https://github.com/open-compass/T-Eval), a step-by-step evaluation benchmark to gauge your LLMs on tool utilization. Welcome to our [Leaderboard](https://open-compass.github.io/T-Eval/leaderboard.html) for more details!
+- **\[2024.05.08\]** We supported the evaluation of 4 MoE models: [Mixtral-8x22B-v0.1](configs/models/mixtral/hf_mixtral_8x22b_v0_1.py), [Mixtral-8x22B-Instruct-v0.1](configs/models/mixtral/hf_mixtral_8x22b_instruct_v0_1.py), [Qwen1.5-MoE-A2.7B](configs/models/qwen/hf_qwen1_5_moe_a2_7b.py), [Qwen1.5-MoE-A2.7B-Chat](configs/models/qwen/hf_qwen1_5_moe_a2_7b_chat.py). Try them out now!
+- **\[2024.04.30\]** We supported evaluating a model's compression efficiency by calculating its Bits per Character (BPC) metric on an [external corpora](configs/datasets/llm_compression/README.md) ([official paper](https://github.com/hkust-nlp/llm-compression-intelligence)). Check out the [llm-compression](configs/eval_llm_compression.py) evaluation config now! 🔥🔥🔥
+- **\[2024.04.29\]** We report the performance of several famous LLMs on the common benchmarks, welcome to [documentation](https://opencompass.readthedocs.io/en/latest/user_guides/corebench.html) for more information! 🔥🔥🔥.
+- **\[2024.04.26\]** We deprecated the multi-madality evaluating function from OpenCompass, related implement has moved to [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), welcome to use! 🔥🔥🔥.
+- **\[2024.04.26\]** We supported the evaluation of [ArenaHard](configs/eval_subjective_arena_hard.py)  welcome to try!🔥🔥🔥.
+- **\[2024.04.22\]** We supported the evaluation of [LLaMA3](configs/models/hf_llama/hf_llama3_8b.py) 和 [LLaMA3-Instruct](configs/models/hf_llama/hf_llama3_8b_instruct.py), welcome to try! 🔥🔥🔥
+- **\[2024.02.29\]** We supported the MT-Bench, AlpacalEval and AlignBench, more information can be found [here](https://opencompass.readthedocs.io/en/latest/advanced_guides/subjective_evaluation.html)
+- **\[2024.01.30\]** We release OpenCompass 2.0. Click  [CompassKit](https://github.com/open-compass), [CompassHub](https://hub.opencompass.org.cn/home), and [CompassRank](https://rank.opencompass.org.cn/home) for more information !
 > [More](docs/en/notes/news.md)
@@ -106,7 +127,7 @@ conda activate opencompass
 git clone https://github.com/open-compass/opencompass opencompass
 cd opencompass
 pip install -e .
-# also please install requiresments packages via `pip install -r requirements/api.txt` for API models if needed.
+# also please install requirements packages via `pip install -r requirements/api.txt` for API models if needed.
 ```
 ### 📂 Data Preparation
@@ -141,19 +162,13 @@ python tools/list_configs.py llama mmlu
 You can also evaluate other HuggingFace models via command line. Taking LLaMA-7b as an example:
 ```bash
-python run.py --datasets ceval_ppl mmlu_ppl \
---hf-path huggyllama/llama-7b \  # HuggingFace model path
---model-kwargs device_map='auto' \  # Arguments for model construction
---tokenizer-kwargs padding_side='left' truncation='left' use_fast=False \  # Arguments for tokenizer construction
---max-out-len 100 \  # Maximum number of tokens generated
---max-seq-len 2048 \  # Maximum sequence length the model can accept
---batch-size 8 \  # Batch size
---no-batch-padding \  # Don't enable batch padding, infer through for loop to avoid performance loss
---num-gpus 1  # Number of minimum required GPUs
+python run.py --datasets ceval_ppl mmlu_ppl --hf-type base --hf-path huggyllama/llama-7b
 ```
-> **Note**<br />
-> To run the command above, you will need to remove the comments starting from `# ` first.
+> \[!TIP\]
+>
+> configuration with `_ppl` is designed for base model typically.
+> configuration with `_gen` can be used for both base model and chat model.
 Through the command line or configuration files, OpenCompass also supports evaluating APIs or custom models, as well as more diversified evaluation strategies. Please read the [Quick Start](https://opencompass.readthedocs.io/en/latest/get_started/quick_start.html) to learn how to run an evaluation task.
@@ -439,6 +454,7 @@ Through the command line or configuration files, OpenCompass also supports evalu
 - [InternLM](https://github.com/InternLM/InternLM)
 - [LLaMA](https://github.com/facebookresearch/llama)
+- [LLaMA3](https://github.com/meta-llama/llama3)
 - [Vicuna](https://github.com/lm-sys/FastChat)
 - [Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
 - [Baichuan](https://github.com/baichuan-inc)
@@ -497,6 +513,20 @@ Through the command line or configuration files, OpenCompass also supports evalu
 We appreciate all contributions to improving OpenCompass. Please refer to the [contributing guideline](https://opencompass.readthedocs.io/en/latest/notes/contribution_guide.html) for the best practice.
+<!-- Copy-paste in your Readme.md file -->
+<!-- Made with [OSS Insight](https://ossinsight.io/) -->
+<a href="https://github.com/open-compass/opencompass/graphs/contributors" target="_blank">
+  <table>
+    <tr>
+      <th colspan="2">
+        <br><img src="https://contrib.rocks/image?repo=open-compass/opencompass"><br><br>
+      </th>
+    </tr>
+  </table>
+</a>
 ## 🤝 Acknowledgements
 Some code in this project is cited and modified from [OpenICL](https://github.com/Shark-NLP/OpenICL).
@@ -515,3 +545,20 @@ Some datasets and prompt implementations are modified from [chain-of-thought-hub
 ```
 <p align="right"><a href="#top">🔝Back to top</a></p>
+[github-contributors-link]: https://github.com/open-compass/opencompass/graphs/contributors
+[github-contributors-shield]: https://img.shields.io/github/contributors/open-compass/opencompass?color=c4f042&labelColor=black&style=flat-square
+[github-forks-link]: https://github.com/open-compass/opencompass/network/members
+[github-forks-shield]: https://img.shields.io/github/forks/open-compass/opencompass?color=8ae8ff&labelColor=black&style=flat-square
+[github-issues-link]: https://github.com/open-compass/opencompass/issues
+[github-issues-shield]: https://img.shields.io/github/issues/open-compass/opencompass?color=ff80eb&labelColor=black&style=flat-square
+[github-license-link]: https://github.com/open-compass/opencompass/blob/main/LICENSE
+[github-license-shield]: https://img.shields.io/github/license/open-compass/opencompass?color=white&labelColor=black&style=flat-square
+[github-release-link]: https://github.com/open-compass/opencompass/releases
+[github-release-shield]: https://img.shields.io/github/v/release/open-compass/opencompass?color=369eff&labelColor=black&logo=github&style=flat-square
+[github-releasedate-link]: https://github.com/open-compass/opencompass/releases
+[github-releasedate-shield]: https://img.shields.io/github/release-date/open-compass/opencompass?labelColor=black&style=flat-square
+[github-stars-link]: https://github.com/open-compass/opencompass/stargazers
+[github-stars-shield]: https://img.shields.io/github/stars/open-compass/opencompass?color=ffcb47&labelColor=black&style=flat-square
+[github-trending-shield]: https://trendshift.io/api/badge/repositories/6630
+[github-trending-url]: https://trendshift.io/repositories/6630

opencompass-0.2.5/opencompass/__init__.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__ = '0.2.5'

{opencompass-0.2.3 → opencompass-0.2.5}/opencompass/datasets/GaokaoBench.py RENAMED Viewed

@@ -91,34 +91,51 @@ class GaokaoBenchEvaluator(BaseEvaluator):
         ]:
             return {'score': 0}
         elif self.question_type == 'multi_choice':
+            details = {}
             correct_score, total_score = 0, 0
-            for pred, refr in zip(predictions, references):
+            for index, (pred, refr) in enumerate(zip(predictions, references)):
                 pred = self.do_predictions_postprocess(pred)
                 pred = self.ensure_same_length(pred, refr)
+                is_corrects = []
                 for p, r in zip(pred, refr):
                     if p == r:
                         correct_score += 2
+                        is_corrects.append(True)
                     else:
                         for i in p:
                             if i not in r:
                                 break
                         else:
                             correct_score += 1
+                        is_corrects.append(False)
                     total_score += 2
-            return {'score': correct_score / total_score * 100}
+                details[str(index)] = {
+                    'pred': pred,
+                    'refr': refr,
+                    'is_correct': all(is_corrects),
+                }
         else:
+            details = {}
             correct_score, total_score = 0, 0
-            for pred, refr in zip(predictions, references):
+            for index, (pred, refr) in enumerate(zip(predictions, references)):
                 if self.question_type == 'multi_question_choice':
                     pred = self.do_predictions_postprocess(pred, len(refr))
                 else:
                     pred = self.do_predictions_postprocess(pred)
                 pred = self.ensure_same_length(pred, refr)
+                is_corrects = []
                 for p, r in zip(pred, refr):
-                    if p == r:
-                        correct_score += 1
+                    is_correct = p == r
+                    correct_score += is_correct
                     total_score += 1
-            return {'score': correct_score / total_score * 100}
+                    is_corrects.append(is_correct)
+                details[str(index)] = {
+                    'pred': pred,
+                    'refr': refr,
+                    'is_correct': all(is_corrects),
+                }
+        return {'score': correct_score / total_score * 100, 'details': details}
 for question_type in valid_gaokao_bench_question_types:

opencompass-0.2.5/opencompass/datasets/MMLUArabic.py ADDED Viewed

@@ -0,0 +1,33 @@
+import csv
+import os.path as osp
+from datasets import Dataset, DatasetDict
+from opencompass.registry import LOAD_DATASET
+from .base import BaseDataset
+@LOAD_DATASET.register_module()
+class MMLUArabicDataset(BaseDataset):
+    @staticmethod
+    def load(path: str, name: str):
+        dataset = DatasetDict()
+        for split in ['dev', 'test']:
+            raw_data = []
+            filename = osp.join(path, split, f'{name}_{split}.csv')
+            with open(filename, encoding='utf-8') as f:
+                reader = csv.reader(f)
+                for row in reader:
+                    assert len(row) == 6
+                    raw_data.append({
+                        'input': row[0],
+                        'A': row[1],
+                        'B': row[2],
+                        'C': row[3],
+                        'D': row[4],
+                        'target': row[5],
+                    })
+            dataset[split] = Dataset.from_list(raw_data)
+        return dataset

{opencompass-0.2.3 → opencompass-0.2.5}/opencompass/datasets/NPHardEval/cmp_GCP_D.py RENAMED Viewed

@@ -1,6 +1,10 @@
 import ast
-import networkx as nx
+try:
+    import networkx as nx
+except ImportError:
+    nx = None
 from datasets import Dataset
 from opencompass.openicl.icl_evaluator import BaseEvaluator

{opencompass-0.2.3 → opencompass-0.2.5}/opencompass/datasets/NPHardEval/cmp_TSP_D.py RENAMED Viewed

@@ -1,7 +1,11 @@
 import ast
 import json
-import networkx as nx
+try:
+    import networkx as nx
+except ImportError:
+    nx = None
 import pandas as pd
 from datasets import Dataset

{opencompass-0.2.3 → opencompass-0.2.5}/opencompass/datasets/NPHardEval/p_SPP.py RENAMED Viewed

@@ -1,7 +1,11 @@
 import ast
 import json
-import networkx as nx
+try:
+    import networkx as nx
+except ImportError:
+    nx = None
 from datasets import Dataset
 from opencompass.openicl.icl_evaluator import BaseEvaluator

opencompass-0.2.5/opencompass/datasets/QuALITY.py ADDED Viewed

@@ -0,0 +1,59 @@
+import json
+from datasets import Dataset
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from opencompass.registry import LOAD_DATASET
+from .base import BaseDataset
+@LOAD_DATASET.register_module()
+class QuALITYDataset(BaseDataset):
+    @staticmethod
+    def load(path: str):
+        dataset_list = []
+        with open(path, 'r', encoding='utf-8') as f:
+            for line in f:
+                line = json.loads(line)
+                for question in line['questions']:
+                    dataset_list.append({
+                        'article':
+                        line['article'],
+                        'question':
+                        question['question'],
+                        'A':
+                        question['options'][0],
+                        'B':
+                        question['options'][1],
+                        'C':
+                        question['options'][2],
+                        'D':
+                        question['options'][3],
+                        'gold_label':
+                        'ABCD'[question['gold_label'] - 1],
+                        'difficult':
+                        question['difficult']
+                    })
+        return Dataset.from_list(dataset_list)
+class QuALITYEvaluator(BaseEvaluator):
+    def score(self, predictions, references, test_set):
+        assert len(predictions) == len(references)
+        easy, hard, all = [], [], []
+        for pred, refer, test in zip(predictions, references, test_set):
+            if pred == refer:
+                answer = True
+            else:
+                answer = False
+            all.append(answer)
+            if test['difficult'] == 0:
+                easy.append(answer)
+            else:
+                hard.append(answer)
+        return dict(easy_acc=sum(easy) / len(easy) * 100,
+                    hard_acc=sum(hard) / len(easy) * 100,
+                    all_acc=sum(all) / len(all) * 100)

opencompass-0.2.5/opencompass/datasets/TheoremQA/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .legacy import (TheoremQA_postprocess, TheoremQA_postprocess_v2,
+                     TheoremQADataset)
+from .main import (TheoremQA_postprocess_v3, TheoremQADatasetV3,
+                   TheoremQAEvaluatorV3)

opencompass-0.2.3/opencompass/datasets/TheoremQA.py → opencompass-0.2.5/opencompass/datasets/TheoremQA/legacy.py RENAMED Viewed

@@ -4,7 +4,7 @@ from datasets import load_dataset
 from opencompass.registry import LOAD_DATASET, TEXT_POSTPROCESSORS
-from .base import BaseDataset
+from ..base import BaseDataset
 @LOAD_DATASET.register_module()

opencompass-0.2.5/opencompass/datasets/TheoremQA/main.py ADDED Viewed

@@ -0,0 +1,66 @@
+import re
+import json
+from datasets import Dataset, DatasetDict
+from opencompass.registry import LOAD_DATASET, TEXT_POSTPROCESSORS, ICL_EVALUATORS
+from opencompass.openicl.icl_evaluator import BaseEvaluator
+from ..base import BaseDataset
+from . import utils
+from tqdm import tqdm
+@LOAD_DATASET.register_module()
+class TheoremQADatasetV3(BaseDataset):
+    @staticmethod
+    def load(path: str):
+        with open(path, 'r') as f:
+            data = json.load(f)
+        for item in data:
+            item['Answer'] = str(item['Answer'])
+        dataset = Dataset.from_list(data)
+        return dataset
+def TheoremQA_postprocess_v3(text: str) -> str:
+    answer = utils.answer_clean(["The answer is:", "The answer is", "the answer is"], text)
+    return answer
+@ICL_EVALUATORS.register_module()
+class TheoremQAEvaluatorV3(BaseEvaluator):
+    def score(self, predictions, references, test_set):
+        if len(predictions) != len(references):
+            return {"error": "preds and refrs have different length"}
+        details = []
+        correct, wrong = 0, 0
+        for index in tqdm(range(len(predictions))):
+            answer = predictions[index]
+            groundtruth = references[index]
+            answer_type = test_set[index]['Answer_type']
+            if answer_type in ['float', 'integer', 'bool']:
+                groundtruth = [groundtruth, eval(groundtruth)]
+            else:
+                groundtruth = [groundtruth, None]
+            if utils.compare_answer_with_groundtruth(answer, *groundtruth):
+                correct += 1
+                is_correct = True
+            else:
+                wrong += 1
+                is_correct = False
+            details.append(
+                {
+                    # "question": question,
+                    # "solution": output,
+                    "correct": groundtruth,
+                    "pred": answer,
+                    "is_correct": is_correct,
+                }
+            )
+        score = correct / (correct + wrong) * 100
+        return {'score': score, 'details': details}

opencompass 0.2.3__tar.gz → 0.2.5__tar.gz

opencompass 0.2.3tar.gz → 0.2.5tar.gz