dingo-python 1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,221 @@
1
+ Metadata-Version: 2.1
2
+ Name: dingo-python
3
+ Version: 1.0
4
+ Summary: Language quality evaluation tool.
5
+ Home-page: https://github.com/APX103/Dingo/tree/main
6
+ Author: SH AI Lab
7
+ Author-email: shailab@pjlab.org.cn
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Operating System :: OS Independent
10
+ Requires-Python: >=3.10
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: hanziconv
14
+ Requires-Dist: jieba
15
+ Requires-Dist: jsonlines
16
+ Requires-Dist: langid
17
+ Requires-Dist: numpy
18
+ Requires-Dist: textstat
19
+ Requires-Dist: zhon
20
+ Requires-Dist: transformers
21
+ Requires-Dist: toml
22
+ Requires-Dist: pydantic
23
+ Requires-Dist: requests
24
+ Requires-Dist: Pillow ==9.4.0
25
+ Requires-Dist: opencv-python
26
+ Requires-Dist: nltk
27
+ Requires-Dist: fastapi
28
+ Requires-Dist: uvicorn
29
+
30
+ English | [简体中文](README_ZH.md)
31
+
32
+ # Introduction
33
+
34
+ Dingo is a data quality assessment tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in detection rules and model methods, and also supports custom detection methods. It supports commonly used NLP datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports various interface usage methods, including local CLI, SDK, and RESTFul API, making it easy to integrate into various environments. The large model evaluation platform of [OpenCompass](https://github.com/open-compass/opencompass) has integrated Dingo for subjective evaluation of language model generated content.
35
+
36
+ ## Architecture of Dingo
37
+
38
+ ![Architecture of dingo](./docs/assets/architeture.png)
39
+
40
+ # QuickStart
41
+
42
+ Install `dingo`.
43
+ ```shell
44
+ pip install dingo
45
+ ```
46
+
47
+ Try the following demo code.
48
+ ```python
49
+ from dingo.model import Model
50
+ from dingo.io import RawInputModel
51
+ from dingo.exec import Executor
52
+
53
+ input_data = {
54
+ "eval_models": ["default"],
55
+ "input_path": "data/inputs/test_data1.json", # replace with your data path here
56
+ "data_type": "json",
57
+ "column_content": ["prediction"],
58
+ }
59
+
60
+ raw_input = RawInputModel(**input_data)
61
+ Model.apply_config(raw_input.custom_config_path)
62
+ executor = Executor.exec_map["local"](raw_input)
63
+ result = executor.evaluate()
64
+ print(result)
65
+ ```
66
+
67
+ # Tutorials
68
+
69
+ ## Config
70
+
71
+ `Dingo` enables users to personalize their data quality inspection methods, which can include the use of heuristic rules,
72
+ third-party quality inspection tools or services, and large models. These can be implemented through configuration.
73
+ Specifically, users can pass a parameter named `custom_config_path` which points to a configuration file.
74
+ Below is a template for this configuration file: [template.json](data/inputs/config/template.json)
75
+
76
+ ### Rules
77
+
78
+ Heuristic rules are a common method for data processing and quality inspection. `Dingo` has implemented a series of
79
+ heuristic rules and grouped them into rule groups, such as `zh-all`, and `en-all`, which represent the heuristic quality
80
+ inspection rule sets for Chinese and English respectively. In the template of the configuration file, the two configuration
81
+ items related to heuristic rule configuration are `custom_rule_list` and `rule_config`, which represent the rule set
82
+ and the configuration parameters for a specific rule, respectively. Below is a configuration example:
83
+ ``` json
84
+ {
85
+ "custom_rule_list": [],
86
+ "rule_config": {}
87
+ }
88
+ ```
89
+ ### Large Models
90
+
91
+ `Dingo` supports data quality inspection using large models. Before use, users need to configure `llm_config`.
92
+ For `OpenAI` models:
93
+ ``` json
94
+ {
95
+ "key": "YOUR_API_KEY"
96
+ }
97
+ ```
98
+ For HuggingFace models(currently support the downloaded models):
99
+ ``` json
100
+ {
101
+ "path": "your local model path",
102
+ }
103
+ ```
104
+
105
+ ## Execute
106
+
107
+ `Dingo` can be run locally or on a Spark cluster.
108
+
109
+ ### Local Mode
110
+
111
+ In addition to the aforementioned SDK calls, you can also run data evaluation locally with `CLI`:
112
+
113
+ ```shell
114
+ python -m dingo.run.cli
115
+ ```
116
+
117
+ The CLI parameters are as follows.
118
+
119
+ | parameter name | description |
120
+ |-------------------------|-------------------------------------------------------------------|
121
+ | `-i` or `--input_path` | The path of data. It can be a file or a directory. |
122
+ | `-e` or `--eval_models` | The model used to evaluate data quality. |
123
+ | `--dataset_id` | The id of data input. |
124
+ | `--data_type` | The type of data. It can be JSON, jsonl, plaintext and list json. |
125
+ | `--output_path` | The path of result data. |
126
+ | `--column_id` | The column name of id in data. |
127
+ | `--column_prompt` | The column name of prompt in data. |
128
+ | `--column_content` | The column name of content in data. |
129
+ | `custom_config_path` | The path of custom config file. |
130
+
131
+ More information can be obtained by running the following command: `python -m dingo.run.cli --help`.
132
+
133
+ ### Spark Mode
134
+
135
+ If the scale of data is very large you can use Spark to run the project.
136
+
137
+ Firstly, create an object from `SparkExecutor`, and set the actual instances of SparkSession and DataFrame.
138
+
139
+ ```python
140
+ from dingo.exec.spark import SparkExecutor
141
+
142
+ spark_exec = SparkExecutor()
143
+ spark_exec.set_spark(spark_session)
144
+ spark_exec.set_input_df(spark_data_frame)
145
+ ```
146
+
147
+ Then, convert the data and execute the rule list.
148
+
149
+ ```python
150
+ spark_exec.convert_data(column_id=['data_id'], column_prompt=['prompt'], column_content=['content'])
151
+ spark_exec.execute(["CommonSpecialCharacter", "CommonColonEnd"])
152
+
153
+ ```
154
+
155
+ Finally, summarize and get the result data.
156
+
157
+ ```python
158
+ spark_exec.summarize()
159
+ output_df = spark_exec.get_output_df()
160
+ summary = spark_exec.get_summary()
161
+ ```
162
+
163
+
164
+ ## Evaluation Results
165
+
166
+ ### Summary
167
+
168
+ The `summary.json` file is overall information about evaluation results. Here is an example:
169
+
170
+ ```
171
+ {
172
+ "dataset_id": "20240618",
173
+ "input_model": "default",
174
+ "input_path": "data/inputs/test_data1.json",
175
+ "output_path": "data/outputs/20240625_134409",
176
+ "score": 90.0,
177
+ "num_good": 90,
178
+ "num_bad": 10,
179
+ "total": 100,
180
+ "error_ratio": {...}
181
+ }
182
+ ```
183
+
184
+ The `error_ratio` field shows data quality signals in seven different aspects:
185
+ `EFFECTIVENESS`, `COMPLETENESS`, `UNDERSTANDABILITY`, `SIMILARITY`, `FLUENCY`, `RELEVANCE` and `SECURITY`.
186
+
187
+ ### Detailed Results
188
+
189
+ For more detailed issues found in data items, `Dingo` created files in a directory named with the quality signals mentioned above.
190
+ Give an example. `CommonColonEnd.json` in the `QUALITY_SIGNAL_COMPLETENESS` directory is as follows:
191
+
192
+ ```
193
+ {
194
+ "name": "CommonColonEnd", # rule name
195
+ "count": 1,
196
+ "ratio": 0.5,
197
+ "detail": [
198
+ {
199
+ "data_id": "0",
200
+ "prompt": "",
201
+ "content": "I am 8 years old. ^I love apple because:",
202
+ "error_reason": "Ends with a colon."
203
+ }
204
+ ]
205
+ }
206
+ ```
207
+
208
+ We evaluated the quality of these three datasets based on `Dingo`.
209
+
210
+ | Dataset | Dataset Type | EFFECTIVENESS | COMPLETENESS | UNDERSTANDABILITY | SIMILARITY | FLUENCY | RELEVANCE | SECURITY |
211
+ |-----------------|--------------|---------------|--------------|-------------------|------------|---------|-----------|----------|
212
+ | SlimPajama-627B | Pretrain | 0 | 0.001797 | 0.011547 | 0.003563 | 0 | 0 | 0 |
213
+ | Stanford_alpaca | SFT | 0.0008 | 0.0004 | 0.0013 | 0.0002 | 0 | 0 | 0 |
214
+ | MMLU | Benchmark | 0.0064 | 0.0005 | 0.0113 | 0 | 0 | 0 | 0 |
215
+
216
+ # Contributing
217
+ We appreciate all contributions to `Dingo`. Please refer to [CONTRIBUTING.md](docs/en/CONTRIBUTING.md) for the contributing guideline.
218
+
219
+ # License
220
+ This project is released under the [Apache 2.0 license](LICENSE).
221
+
@@ -0,0 +1,39 @@
1
+ dingo/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
2
+ dingo/config/__init__.py,sha256=8qPvpZTKUBcZqAqu2S5b2P-GRQGMV6VwGYl8bvQDLI0,45
3
+ dingo/config/config.py,sha256=2PqxEQBu1tc2FnBluXrFJ3qvUJCVrbP4Fm05Mh7UwMw,1520
4
+ dingo/convert/__init__.py,sha256=qXFGV1TQ8LJzeCkAuHo7aW7vMHHz4NzkeF16yP2Pwg8,134
5
+ dingo/convert/base.py,sha256=Sa9G78E-lX6mEZaPyke_zuurZIc2u6zjDh3-CrRDAVk,4485
6
+ dingo/exec/__init__.py,sha256=vW0w0u2-JktaJqKCKlQIJ4eyUv85ekU7sIClG1sIo7E,120
7
+ dingo/exec/base.py,sha256=MJooUK83hLGfadJl2jUioFJAMlDpyiP9NQGukvtk4ok,1272
8
+ dingo/exec/local.py,sha256=Vx-MtlA-V3wi6oa6PEuUReeTtub-4tuEnrDsHC4luoc,10410
9
+ dingo/exec/spark.py,sha256=J9nIsc_jSSoDymg1DVaRVcRNO0DtFjlPMuLQ09YvpGs,6182
10
+ dingo/io/__init__.py,sha256=g8o5SxzL00TFYNCELw-xYJsPnFxikoKptx11xfL2Kok,95
11
+ dingo/io/export.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
12
+ dingo/io/input.py,sha256=9hNF1dUWEi-Q8N_pTSu4uKxQKCBvRorCi3857tDpTM4,676
13
+ dingo/io/summary.py,sha256=sSJJsgt0CbRLddmBdu4tp6jNJNy9N4tNWy919XjpRjk,691
14
+ dingo/model/__init__.py,sha256=6qIJzuv1C3gh1LiNYU0-a3f3BaqUOCqsLdJVLllog64,55
15
+ dingo/model/model.py,sha256=nlU-nN9e-zjGbJnUK6cBpTk_SzpIKgUrg9FPR2mruuY,7367
16
+ dingo/model/llm/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
17
+ dingo/model/llm/base.py,sha256=mGU2Yqg1qObVkj-vp4Pz-k-aVz9L7P23iam64u3rHjw,259
18
+ dingo/model/llm/gpt.py,sha256=eVrauMhNcsjBieboXEiR-w7yrefdZGdE3iogWyqo6O4,2238
19
+ dingo/model/llm/llama3.py,sha256=MY_hwSQMrZFsTnH4fsjJIDhG_STiC4j8Bbm_rsuVqZg,3300
20
+ dingo/model/llm/perspective.py,sha256=ZvlvET54-JsnoyByw9qvkUdQdyXGKHgB0fu-l_0DkRQ,2131
21
+ dingo/model/llm/common/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
22
+ dingo/model/llm/common/base_llm.py,sha256=_--2KnEorLn2FrY1KQP4y5oiGjUsReZoafurO3GwmVg,15925
23
+ dingo/model/llm/common/base_llm_api.py,sha256=SEmdooDfmDXtNQ8TFzYQWNBvPNmmN8Z0iArrR50tnNM,15533
24
+ dingo/model/llm/common/openai_api.py,sha256=29NXR8cdY-ZEBTVW7OuSB2yu71hg7Vtgg0kp-_yiyX8,8833
25
+ dingo/model/llm/common/turbomind_api.py,sha256=WodFPEL1WXkFgLbCRiRYMw-Qbe-EPbvDlfa1VTKImw0,5568
26
+ dingo/model/rule/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
27
+ dingo/model/rule/base.py,sha256=CajHy0hXtV9jBLpNVzbG1D2Lbp3vT1x1VxxX4n6CsVg,271
28
+ dingo/model/rule/common_rule.py,sha256=yrRrUltn9Hk4hGaKS5bdb_XxtWmqSMwXPFpgqJ3gpUQ,19207
29
+ dingo/model/rule/image_rule.py,sha256=QWrmN38hKzy0piBFIG2FrgoOn_gt9WpN22vNU0IRSn8,2940
30
+ dingo/model/rule/prompt_rule.py,sha256=-5P8kFxT9eoVzI0nBwA5-bAkGOTo5l5xVmzjiuhL7vM,1381
31
+ dingo/model/rule/util.py,sha256=EDQWbQ9PWhVsUUanK2A8D68YQ8n4TGVV4j2IMinbqWw,11779
32
+ dingo/utils/__init__.py,sha256=masgEgU90tbPMKtZz5NF1oraNMrx1xLpHQ9B8QMPm9o,37
33
+ dingo/utils/log_util/__init__.py,sha256=B4SurbYC7MqlI9ILM2_gS4QPLYj_UbyPRQQSpcGccdI,721
34
+ dingo/utils/log_util/logger.py,sha256=jliGVit4mHB17nBeXOqbLHrlEWwuZJsNu_xBDmxr42I,1424
35
+ dingo_python-1.0.dist-info/LICENSE,sha256=QwcOLU5TJoTeUhuIXzhdCEEDDvorGiC6-3YTOl4TecE,11356
36
+ dingo_python-1.0.dist-info/METADATA,sha256=rAZckqEqqu91j5E_fEnMIA10FZP11t5Ykq0nImaemWs,7915
37
+ dingo_python-1.0.dist-info/WHEEL,sha256=GJ7t_kWBFywbagK5eo9IoUwLW6oyOeTKmQ-9iHFVNxQ,92
38
+ dingo_python-1.0.dist-info/top_level.txt,sha256=gSXQSLowu_WOQRi75wK3qyjbHxeN5PqsaA4ChGmJdek,6
39
+ dingo_python-1.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: bdist_wheel (0.43.0)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1 @@
1
+ dingo