PyPI - evalscope - Versions diffs - 0.6.0__tar.gz → 0.6.1__tar.gz - Mend

evalscope 0.6.0tar.gz → 0.6.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of evalscope might be problematic. Click here for more details.

Files changed (222) hide show

{evalscope-0.6.0 → evalscope-0.6.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: evalscope
-Version: 0.6.0
+Version: 0.6.1
 Summary: EvalScope: Lightweight LLMs Evaluation Framework
 Home-page: https://github.com/modelscope/evalscope
 Author: ModelScope team
@@ -28,7 +28,7 @@ Requires-Dist: nltk>=3.9
 Requires-Dist: openai
 Requires-Dist: pandas
 Requires-Dist: plotly
-Requires-Dist: pyarrow
+Requires-Dist: pyarrow<=17.0.0
 Requires-Dist: pympler
 Requires-Dist: pyyaml
 Requires-Dist: regex
@@ -48,12 +48,12 @@ Requires-Dist: transformers_stream_generator
 Requires-Dist: jieba
 Requires-Dist: rouge-chinese
 Provides-Extra: opencompass
-Requires-Dist: ms-opencompass>=0.1.1; extra == "opencompass"
+Requires-Dist: ms-opencompass>=0.1.3; extra == "opencompass"
 Provides-Extra: vlmeval
 Requires-Dist: ms-vlmeval>=0.0.5; extra == "vlmeval"
 Provides-Extra: rag
-Requires-Dist: mteb>=0.14.16; extra == "rag"
-Requires-Dist: ragas<0.3,>=0.2.3; extra == "rag"
+Requires-Dist: mteb==1.19.4; extra == "rag"
+Requires-Dist: ragas==0.2.5; extra == "rag"
 Requires-Dist: webdataset>0.2.0; extra == "rag"
 Provides-Extra: inner
 Requires-Dist: absl-py; extra == "inner"
@@ -96,7 +96,7 @@ Requires-Dist: nltk>=3.9; extra == "all"
 Requires-Dist: openai; extra == "all"
 Requires-Dist: pandas; extra == "all"
 Requires-Dist: plotly; extra == "all"
-Requires-Dist: pyarrow; extra == "all"
+Requires-Dist: pyarrow<=17.0.0; extra == "all"
 Requires-Dist: pympler; extra == "all"
 Requires-Dist: pyyaml; extra == "all"
 Requires-Dist: regex; extra == "all"
@@ -115,10 +115,10 @@ Requires-Dist: transformers>=4.33; extra == "all"
 Requires-Dist: transformers_stream_generator; extra == "all"
 Requires-Dist: jieba; extra == "all"
 Requires-Dist: rouge-chinese; extra == "all"
-Requires-Dist: ms-opencompass>=0.1.1; extra == "all"
+Requires-Dist: ms-opencompass>=0.1.3; extra == "all"
 Requires-Dist: ms-vlmeval>=0.0.5; extra == "all"
-Requires-Dist: mteb>=0.14.16; extra == "all"
-Requires-Dist: ragas<0.3,>=0.2.3; extra == "all"
+Requires-Dist: mteb==1.19.4; extra == "all"
+Requires-Dist: ragas==0.2.5; extra == "all"
 Requires-Dist: webdataset>0.2.0; extra == "all"
@@ -140,6 +140,7 @@ Requires-Dist: webdataset>0.2.0; extra == "all"
  <a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
 <p>
+> ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 ## 📋 Table of Contents
 - [Introduction](#introduction)
@@ -165,7 +166,7 @@ EvalScope is the official model evaluation and performance benchmarking framewor
 The architecture includes the following modules:
 1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
 2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
-3. **Evaluation Backend**:
+3. **Evaluation Backend**:
     - **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
     - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
     - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
@@ -252,7 +253,7 @@ You can execute this command from any directory:
 python -m evalscope.run \
  --model qwen/Qwen2-0.5B-Instruct \
  --template-type qwen \
- --datasets arc
+ --datasets arc
 ```
 #### Install from source
@@ -359,13 +360,13 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)
 ## Offline Evaluation
-You can use local dataset to evaluate the model without internet connection.
+You can use local dataset to evaluate the model without internet connection.
 Refer to: Offline Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)
 ## Arena Mode
-The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
+The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
 Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)

{evalscope-0.6.0 → evalscope-0.6.1}/README.md RENAMED Viewed

@@ -17,6 +17,7 @@
  <a href="https://evalscope.readthedocs.io/en/latest/">📖 Documents</a>
 <p>
+> ⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!
 ## 📋 Table of Contents
 - [Introduction](#introduction)
@@ -42,7 +43,7 @@ EvalScope is the official model evaluation and performance benchmarking framewor
 The architecture includes the following modules:
 1. **Model Adapter**: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
 2. **Data Adapter**: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
-3. **Evaluation Backend**:
+3. **Evaluation Backend**:
     - **Native**: EvalScope’s own **default evaluation framework**, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
     - **OpenCompass**: Supports [OpenCompass](https://github.com/open-compass/opencompass) as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
     - **VLMEvalKit**: Supports [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
@@ -129,7 +130,7 @@ You can execute this command from any directory:
 python -m evalscope.run \
  --model qwen/Qwen2-0.5B-Instruct \
  --template-type qwen \
- --datasets arc
+ --datasets arc
 ```
 #### Install from source
@@ -236,13 +237,13 @@ EvalScope supports using third-party evaluation frameworks to initiate evaluatio
 EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation [📖User Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset.html)
 ## Offline Evaluation
-You can use local dataset to evaluate the model without internet connection.
+You can use local dataset to evaluate the model without internet connection.
 Refer to: Offline Evaluation [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/offline_evaluation.html)
 ## Arena Mode
-The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
+The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.
 Refer to: Arena Mode [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html)
@@ -270,4 +271,4 @@ Refer to : Model Serving Performance Evaluation [📖 User Guide](https://evalsc
 ## Star History
-[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)
+[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope&type=Date)](https://star-history.com/#modelscope/evalscope&Date)

{evalscope-0.6.0 → evalscope-0.6.1}/evalscope/backend/opencompass/tasks/eval_datasets.py RENAMED Viewed

@@ -50,6 +50,7 @@ with read_base():
     from opencompass.configs.datasets.nq.nq_gen_c788f6 import nq_datasets
     from opencompass.configs.datasets.triviaqa.triviaqa_gen_2121ce import triviaqa_datasets
     from opencompass.configs.datasets.cmb.cmb_gen_dfb5c4 import cmb_datasets
+    from opencompass.configs.datasets.cmmlu.cmmlu_gen_c13365 import cmmlu_datasets
     # Note: to be supported
     # from opencompass.configs.datasets.flores.flores_gen_806ede import flores_datasets

{evalscope-0.6.0 → evalscope-0.6.1}/evalscope/backend/rag_eval/cmteb/tasks/Clustering.py RENAMED Viewed

@@ -17,57 +17,57 @@ class CLSClusteringFastS2S(AbsTaskClusteringFast):
     max_fraction_of_documents_to_embed = None
     metadata = TaskMetadata(
-        name="CLSClusteringS2S",
-        description="Clustering of titles from CLS dataset. Clustering of 13 sets on the main category.",
-        reference="https://arxiv.org/abs/2209.05034",
+        name='CLSClusteringS2S',
+        description='Clustering of titles from CLS dataset. Clustering of 13 sets on the main category.',
+        reference='https://arxiv.org/abs/2209.05034',
         dataset={
-            "path": "C-MTEB/CLSClusteringS2S",
-            "revision": "e458b3f5414b62b7f9f83499ac1f5497ae2e869f",
+            'path': 'C-MTEB/CLSClusteringS2S',
+            'revision': 'e458b3f5414b62b7f9f83499ac1f5497ae2e869f',
         },
-        type="Clustering",
-        category="s2s",
-        modalities=["text"],
-        eval_splits=["test"],
-        eval_langs=["cmn-Hans"],
-        main_score="v_measure",
-        date=("2022-01-01", "2022-09-12"),
-        domains=["Academic", "Written"],
-        task_subtypes=["Thematic clustering", "Topic classification"],
-        license="Apache-2.0",
-        annotations_creators="derived",
+        type='Clustering',
+        category='s2s',
+        modalities=['text'],
+        eval_splits=['test'],
+        eval_langs=['cmn-Hans'],
+        main_score='v_measure',
+        date=('2022-01-01', '2022-09-12'),
+        domains=['Academic', 'Written'],
+        task_subtypes=['Thematic clustering', 'Topic classification'],
+        license='apache-2.0',
+        annotations_creators='derived',
         dialect=[],
-        sample_creation="found",
+        sample_creation='found',
         bibtex_citation="""@misc{li2022csl,
-            title={CSL: A Large-scale Chinese Scientific Literature Dataset},
+            title={CSL: A Large-scale Chinese Scientific Literature Dataset},
             author={Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang},
             year={2022},
             eprint={2209.05034},
             archivePrefix={arXiv},
             primaryClass={cs.CL}
-        }""",
+        }""",  # noqa
         descriptive_stats={
-            "n_samples": {"test": NUM_SAMPLES},
-            "avg_character_length": {},
+            'n_samples': {'test': NUM_SAMPLES},
+            'avg_character_length': {},
         },
     )
     def dataset_transform(self):
         ds = {}
         for split in self.metadata.eval_splits:
-            labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))
+            labels = list(itertools.chain.from_iterable(self.dataset[split]['labels']))
             sentences = list(
-                itertools.chain.from_iterable(self.dataset[split]["sentences"])
+                itertools.chain.from_iterable(self.dataset[split]['sentences'])
             )
             check_label_distribution(self.dataset[split])
-            ds[split] = Dataset.from_dict({"labels": labels, "sentences": sentences})
+            ds[split] = Dataset.from_dict({'labels': labels, 'sentences': sentences})
         self.dataset = DatasetDict(ds)
         self.dataset = self.stratified_subsampling(
             self.dataset,
             self.seed,
             self.metadata.eval_splits,
-            label="labels",
+            label='labels',
             n_samples=NUM_SAMPLES,
         )
@@ -77,57 +77,57 @@ class CLSClusteringFastP2P(AbsTaskClusteringFast):
     max_fraction_of_documents_to_embed = None
     metadata = TaskMetadata(
-        name="CLSClusteringP2P",
-        description="Clustering of titles + abstract from CLS dataset. Clustering of 13 sets on the main category.",
-        reference="https://arxiv.org/abs/2209.05034",
+        name='CLSClusteringP2P',
+        description='Clustering of titles + abstract from CLS dataset. Clustering of 13 sets on the main category.',
+        reference='https://arxiv.org/abs/2209.05034',
         dataset={
-            "path": "C-MTEB/CLSClusteringP2P",
-            "revision": "4b6227591c6c1a73bc76b1055f3b7f3588e72476",
+            'path': 'C-MTEB/CLSClusteringP2P',
+            'revision': '4b6227591c6c1a73bc76b1055f3b7f3588e72476',
         },
-        type="Clustering",
-        category="p2p",
-        modalities=["text"],
-        eval_splits=["test"],
-        eval_langs=["cmn-Hans"],
-        main_score="v_measure",
-        date=("2022-01-01", "2022-09-12"),
-        domains=["Academic", "Written"],
-        task_subtypes=["Thematic clustering", "Topic classification"],
-        license="Apache-2.0",
-        annotations_creators="derived",
+        type='Clustering',
+        category='p2p',
+        modalities=['text'],
+        eval_splits=['test'],
+        eval_langs=['cmn-Hans'],
+        main_score='v_measure',
+        date=('2022-01-01', '2022-09-12'),
+        domains=['Academic', 'Written'],
+        task_subtypes=['Thematic clustering', 'Topic classification'],
+        license='apache-2.0',
+        annotations_creators='derived',
         dialect=[],
-        sample_creation="found",
+        sample_creation='found',
         bibtex_citation="""@misc{li2022csl,
-            title={CSL: A Large-scale Chinese Scientific Literature Dataset},
+            title={CSL: A Large-scale Chinese Scientific Literature Dataset},
             author={Yudong Li and Yuqing Zhang and Zhe Zhao and Linlin Shen and Weijie Liu and Weiquan Mao and Hui Zhang},
             year={2022},
             eprint={2209.05034},
             archivePrefix={arXiv},
             primaryClass={cs.CL}
-        }""",
+        }""",  # noqa
         descriptive_stats={
-            "n_samples": {"test": NUM_SAMPLES},
-            "avg_character_length": {},
+            'n_samples': {'test': NUM_SAMPLES},
+            'avg_character_length': {},
         },
     )
     def dataset_transform(self):
         ds = {}
         for split in self.metadata.eval_splits:
-            labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))
+            labels = list(itertools.chain.from_iterable(self.dataset[split]['labels']))
             sentences = list(
-                itertools.chain.from_iterable(self.dataset[split]["sentences"])
+                itertools.chain.from_iterable(self.dataset[split]['sentences'])
             )
             check_label_distribution(self.dataset[split])
-            ds[split] = Dataset.from_dict({"labels": labels, "sentences": sentences})
+            ds[split] = Dataset.from_dict({'labels': labels, 'sentences': sentences})
         self.dataset = DatasetDict(ds)
         self.dataset = self.stratified_subsampling(
             self.dataset,
             self.seed,
             self.metadata.eval_splits,
-            label="labels",
+            label='labels',
             n_samples=NUM_SAMPLES,
         )
@@ -137,26 +137,26 @@ class ThuNewsClusteringFastS2S(AbsTaskClusteringFast):
     max_fraction_of_documents_to_embed = None
     metadata = TaskMetadata(
-        name="ThuNewsClusteringS2S",
+        name='ThuNewsClusteringS2S',
         dataset={
-            "path": "C-MTEB/ThuNewsClusteringS2S",
-            "revision": "8a8b2caeda43f39e13c4bc5bea0f8a667896e10d",
+            'path': 'C-MTEB/ThuNewsClusteringS2S',
+            'revision': '8a8b2caeda43f39e13c4bc5bea0f8a667896e10d',
         },
-        description="Clustering of titles from the THUCNews dataset",
-        reference="http://thuctc.thunlp.org/",
-        type="Clustering",
-        category="s2s",
-        modalities=["text"],
-        eval_splits=["test"],
-        eval_langs=["cmn-Hans"],
-        main_score="v_measure",
-        date=("2006-01-01", "2007-01-01"),
-        domains=["News", "Written"],
-        task_subtypes=["Thematic clustering", "Topic classification"],
-        license="Not specified",
-        annotations_creators="derived",
+        description='Clustering of titles from the THUCNews dataset',
+        reference='http://thuctc.thunlp.org/',
+        type='Clustering',
+        category='s2s',
+        modalities=['text'],
+        eval_splits=['test'],
+        eval_langs=['cmn-Hans'],
+        main_score='v_measure',
+        date=('2006-01-01', '2007-01-01'),
+        domains=['News', 'Written'],
+        task_subtypes=['Thematic clustering', 'Topic classification'],
+        license='apache-2.0',
+        annotations_creators='derived',
         dialect=[],
-        sample_creation="found",
+        sample_creation='found',
         bibtex_citation="""@software{THUCTC,
   author = {Sun, M. and Li, J. and Guo, Z. and Yu, Z. and Zheng, Y. and Si, X. and Liu, Z.},
   title = {THUCTC: An Efficient Chinese Text Classifier},
@@ -166,28 +166,28 @@ class ThuNewsClusteringFastS2S(AbsTaskClusteringFast):
   url = {https://github.com/thunlp/THUCTC}
 }""",
         descriptive_stats={
-            "n_samples": {"test": NUM_SAMPLES},
-            "avg_character_length": {},
+            'n_samples': {'test': NUM_SAMPLES},
+            'avg_character_length': {},
         },
     )
     def dataset_transform(self):
         ds = {}
         for split in self.metadata.eval_splits:
-            labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))
+            labels = list(itertools.chain.from_iterable(self.dataset[split]['labels']))
             sentences = list(
-                itertools.chain.from_iterable(self.dataset[split]["sentences"])
+                itertools.chain.from_iterable(self.dataset[split]['sentences'])
             )
             check_label_distribution(self.dataset[split])
-            ds[split] = Dataset.from_dict({"labels": labels, "sentences": sentences})
+            ds[split] = Dataset.from_dict({'labels': labels, 'sentences': sentences})
         self.dataset = DatasetDict(ds)
         self.dataset = self.stratified_subsampling(
             self.dataset,
             self.seed,
             self.metadata.eval_splits,
-            label="labels",
+            label='labels',
             n_samples=NUM_SAMPLES,
         )
@@ -197,26 +197,26 @@ class ThuNewsClusteringFastP2P(AbsTaskClusteringFast):
     max_fraction_of_documents_to_embed = None
     metadata = TaskMetadata(
-        name="ThuNewsClusteringP2P",
+        name='ThuNewsClusteringP2P',
         dataset={
-            "path": "C-MTEB/ThuNewsClusteringP2P",
-            "revision": "5798586b105c0434e4f0fe5e767abe619442cf93",
+            'path': 'C-MTEB/ThuNewsClusteringP2P',
+            'revision': '5798586b105c0434e4f0fe5e767abe619442cf93',
         },
-        description="Clustering of titles + abstracts from the THUCNews dataset",
-        reference="http://thuctc.thunlp.org/",
-        type="Clustering",
-        category="p2p",
-        modalities=["text"],
-        eval_splits=["test"],
-        eval_langs=["cmn-Hans"],
-        main_score="v_measure",
-        date=("2006-01-01", "2007-01-01"),
-        domains=["News", "Written"],
-        task_subtypes=["Thematic clustering", "Topic classification"],
-        license="Not specified",
-        annotations_creators="derived",
+        description='Clustering of titles + abstracts from the THUCNews dataset',
+        reference='http://thuctc.thunlp.org/',
+        type='Clustering',
+        category='p2p',
+        modalities=['text'],
+        eval_splits=['test'],
+        eval_langs=['cmn-Hans'],
+        main_score='v_measure',
+        date=('2006-01-01', '2007-01-01'),
+        domains=['News', 'Written'],
+        task_subtypes=['Thematic clustering', 'Topic classification'],
+        license='apache-2.0',
+        annotations_creators='derived',
         dialect=[],
-        sample_creation="found",
+        sample_creation='found',
         bibtex_citation="""@software{THUCTC,
   author = {Sun, M. and Li, J. and Guo, Z. and Yu, Z. and Zheng, Y. and Si, X. and Liu, Z.},
   title = {THUCTC: An Efficient Chinese Text Classifier},
@@ -226,27 +226,27 @@ class ThuNewsClusteringFastP2P(AbsTaskClusteringFast):
   url = {https://github.com/thunlp/THUCTC}
 }""",
         descriptive_stats={
-            "n_samples": {"test": NUM_SAMPLES},
-            "avg_character_length": {},
+            'n_samples': {'test': NUM_SAMPLES},
+            'avg_character_length': {},
         },
     )
     def dataset_transform(self):
         ds = {}
         for split in self.metadata.eval_splits:
-            labels = list(itertools.chain.from_iterable(self.dataset[split]["labels"]))
+            labels = list(itertools.chain.from_iterable(self.dataset[split]['labels']))
             sentences = list(
-                itertools.chain.from_iterable(self.dataset[split]["sentences"])
+                itertools.chain.from_iterable(self.dataset[split]['sentences'])
             )
             check_label_distribution(self.dataset[split])
-            ds[split] = Dataset.from_dict({"labels": labels, "sentences": sentences})
+            ds[split] = Dataset.from_dict({'labels': labels, 'sentences': sentences})
         self.dataset = DatasetDict(ds)
         self.dataset = self.stratified_subsampling(
             self.dataset,
             self.seed,
             self.metadata.eval_splits,
-            label="labels",
+            label='labels',
             n_samples=NUM_SAMPLES,
         )

evalscope 0.6.0__tar.gz → 0.6.1__tar.gz

Potentially problematic release.

evalscope 0.6.0tar.gz → 0.6.1tar.gz