dqm-ml 1.1.0__tar.gz → 1.1.1__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dqm_ml-1.1.1/PKG-INFO +358 -0
- dqm_ml-1.1.1/README.md +330 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/diversity/diversity.py +21 -17
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/metrics.py +20 -19
- dqm_ml-1.1.1/dqm/main.py +326 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/representativeness/metric.py +114 -103
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/representativeness/utils.py +5 -9
- {dqm_ml-1.1.0/dqm/representativeness → dqm_ml-1.1.1/dqm/utils}/twe_logger.py +21 -20
- dqm_ml-1.1.1/dqm_ml.egg-info/PKG-INFO +358 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm_ml.egg-info/SOURCES.txt +5 -4
- dqm_ml-1.1.1/dqm_ml.egg-info/entry_points.txt +2 -0
- dqm_ml-1.1.0/requirements.txt → dqm_ml-1.1.1/dqm_ml.egg-info/requires.txt +2 -1
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/pyproject.toml +6 -3
- dqm_ml-1.1.0/dqm_ml.egg-info/requires.txt → dqm_ml-1.1.1/requirements.txt +1 -0
- dqm_ml-1.1.0/PKG-INFO +0 -180
- dqm_ml-1.1.0/README.md +0 -153
- dqm_ml-1.1.0/dqm/diversity/twe_logger.py +0 -105
- dqm_ml-1.1.0/dqm_ml.egg-info/PKG-INFO +0 -180
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/LICENSE +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/__init__.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/completeness/__init__.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/completeness/metric.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/diversity/__init__.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/diversity/metric.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/__init__.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/custom_datasets.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/utils.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/representativeness/__init__.py +0 -0
- /dqm_ml-1.1.0/tests/test_main.py → /dqm_ml-1.1.1/dqm/utils/__init__.py +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm_ml.egg-info/dependency_links.txt +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm_ml.egg-info/top_level.txt +0 -0
- {dqm_ml-1.1.0 → dqm_ml-1.1.1}/setup.cfg +0 -0
dqm_ml-1.1.1/PKG-INFO
ADDED
@@ -0,0 +1,358 @@
|
|
1
|
+
Metadata-Version: 2.4
|
2
|
+
Name: dqm-ml
|
3
|
+
Version: 1.1.1
|
4
|
+
Summary: Python library designed to compute data quality metrics for Machine Learning
|
5
|
+
Author: IRT SystemX
|
6
|
+
Author-email: support@irt-systemx.fr
|
7
|
+
License-Expression: MPL-2.0
|
8
|
+
Project-URL: Homepage, https://irt-systemx.github.io/dqm-ml
|
9
|
+
Project-URL: Documentation, https://irt-systemx.github.io/dqm-ml
|
10
|
+
Project-URL: Repository, https://github.com/IRT-SystemX/dqm-ml
|
11
|
+
Requires-Python: >=3.10
|
12
|
+
Description-Content-Type: text/markdown
|
13
|
+
License-File: LICENSE
|
14
|
+
Requires-Dist: matplotlib
|
15
|
+
Requires-Dist: numpy
|
16
|
+
Requires-Dist: pandas
|
17
|
+
Requires-Dist: setuptools
|
18
|
+
Requires-Dist: tqdm
|
19
|
+
Requires-Dist: torch
|
20
|
+
Requires-Dist: torchvision
|
21
|
+
Requires-Dist: scipy
|
22
|
+
Requires-Dist: scikit-learn
|
23
|
+
Requires-Dist: POT
|
24
|
+
Requires-Dist: ipykernel
|
25
|
+
Requires-Dist: seaborn
|
26
|
+
Requires-Dist: pyaml
|
27
|
+
Dynamic: license-file
|
28
|
+
|
29
|
+
<div align="center">
|
30
|
+
<img src="_static/Logo_ConfianceAI.png" width="20%" alt="ConfianceAI Logo" />
|
31
|
+
<h1 style="font-size: large; font-weight: bold;">dqm-ml</h1>
|
32
|
+
</div>
|
33
|
+
|
34
|
+
<div align="center">
|
35
|
+
<a href="#">
|
36
|
+
<img src="https://img.shields.io/badge/Python-3.10-efefef">
|
37
|
+
</a>
|
38
|
+
<a href="#">
|
39
|
+
<img src="https://img.shields.io/badge/Python-3.11-efefef">
|
40
|
+
</a>
|
41
|
+
<a href="#">
|
42
|
+
<img src="https://img.shields.io/badge/Python-3.12-efefef">
|
43
|
+
</a>
|
44
|
+
<a href="#">
|
45
|
+
<img src="https://img.shields.io/badge/License-MPL-2">
|
46
|
+
</a>
|
47
|
+
<a href="_static/pylint/pylint.txt">
|
48
|
+
<img src="_static/pylint/pylint.svg" alt="Pylint Score">
|
49
|
+
</a>
|
50
|
+
<a href="_static/flake8/index.html">
|
51
|
+
<img src="_static/flake8/flake8.svg" alt="Flake8 Report">
|
52
|
+
</a>
|
53
|
+
<a href="_static/coverage/index.html">
|
54
|
+
<img src="_static/coverage/coverage.svg" alt="Coverage report">
|
55
|
+
</a>
|
56
|
+
|
57
|
+
</div>
|
58
|
+
|
59
|
+
<br>
|
60
|
+
<br>
|
61
|
+
|
62
|
+
# Data Quality Metrics
|
63
|
+
|
64
|
+
The current version of the Data Quality Metrics (called **dqm-ml**) computes three data inherent metrics and one data-model dependent metric.
|
65
|
+
|
66
|
+
The data inherent metrics are
|
67
|
+
- **Diversity** : Computes the presence in the dataset of all required information defined in the specification (requirements, Operational Design Domain (ODD) . . . ).
|
68
|
+
- **Representativeness** : is defined as the conformity of the distribution of the key characteristics of the dataset according to a specification (requirements, ODD.. . )
|
69
|
+
- **Completeness** : is defined by the degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use.
|
70
|
+
|
71
|
+
The data-model dependent metrics are:
|
72
|
+
- **Domain Gap** : In the context of a computer vision task, the Domain Gap (DG) refers to the difference in semantic, textures and shapes between two distributions of images and it can lead to poor performances when a model is trained on a given distribution and then is applied to another one.
|
73
|
+
|
74
|
+
(Definitions from [Confiance.ai program](https://www.confiance.ai/))
|
75
|
+
|
76
|
+
[//]: # (- Coverage : The coverage of a couple "Dataset + ML Model" is the ability of the execution of the ML Model on this dataset to generate elements that match the expected space.)
|
77
|
+
|
78
|
+
For each metric, several approaches are developped to handle the maximum of data types. For more technical and scientific details, please refer to this [deliverable](https://catalog.confiance.ai/records/p46p6-1wt83/files/Scientific_Contribution_For_Data_quality_assessment_metrics_for_Machine_learning_process-v2.pdf?download=1)
|
79
|
+
|
80
|
+
## Project description
|
81
|
+
Several approches are developped as described in the figure below.
|
82
|
+
|
83
|
+
<img src="_static/library_view.png" width="1024"/>
|
84
|
+
|
85
|
+
In the current version, the available metrics are:
|
86
|
+
- Representativeness:
|
87
|
+
- $\chi^2$ Goodness of fit test for Uniform and Normal Distributions
|
88
|
+
- Kolmogorov Smirnov test for Uniform and Normal Distributions
|
89
|
+
- Granular and Relative Theorithecal Entropy GRTE proposed and developed in the Confiance.ai Research Program
|
90
|
+
- Diversity:
|
91
|
+
- Relative Diversity developed and implemented in Confiance.ai Research Program
|
92
|
+
- Gini-Simpson and Simposon indices
|
93
|
+
- Completeness:
|
94
|
+
- Ratio of filled information
|
95
|
+
- Domain Gap:
|
96
|
+
- MMD
|
97
|
+
- CMD
|
98
|
+
- Wasserstein
|
99
|
+
- H-Divergence
|
100
|
+
- FID
|
101
|
+
- Kullback-Leiblur MultiVariate Normal Distribution
|
102
|
+
|
103
|
+
[//]: # (- Coverage : )
|
104
|
+
|
105
|
+
[//]: # ( - Approches developed in Neural Coverage (NCL) given [here](https://github.com/Yuanyuan-Yuan/NeuraL-Coverage). )
|
106
|
+
|
107
|
+
# Getting started
|
108
|
+
|
109
|
+
## Set up a clean virtual environnement
|
110
|
+
|
111
|
+
Linux setting:
|
112
|
+
|
113
|
+
```
|
114
|
+
pip install virtualenv
|
115
|
+
virtualenv myenv
|
116
|
+
source myenv/bin/activate
|
117
|
+
```
|
118
|
+
|
119
|
+
Windows setting:
|
120
|
+
|
121
|
+
```
|
122
|
+
pip install virtual env
|
123
|
+
virtualenv myenv
|
124
|
+
.\myenv\Scripts\activate
|
125
|
+
```
|
126
|
+
|
127
|
+
## Install the library
|
128
|
+
You can install it by directly downloading from PyPi using the command:
|
129
|
+
|
130
|
+
````
|
131
|
+
pip install dqm-ml
|
132
|
+
````
|
133
|
+
|
134
|
+
Or you can installing it from the source code by launching the following command:
|
135
|
+
|
136
|
+
```
|
137
|
+
pip install .
|
138
|
+
```
|
139
|
+
|
140
|
+
## Usage
|
141
|
+
|
142
|
+
There is two ways to use the dqm library :
|
143
|
+
- Import dqm package and call the dqm functions within your python code
|
144
|
+
- In standalone mode using direct command line from a terminal, or run the DQm-ML container
|
145
|
+
|
146
|
+
### Standalone mode
|
147
|
+
|
148
|
+
You can use the dqm-ml directly to evaluate your dataset, by using the "dqm-ml" command from your terminal.
|
149
|
+
|
150
|
+
The command line has the following form :
|
151
|
+
|
152
|
+
```dqm-ml --pipeline_config_path path_to_your_pipeline_file --result_file_path path_to_your_result_file```
|
153
|
+
|
154
|
+
This mode requires two user parameters:
|
155
|
+
- pipeline_config_path : A path to a yaml file that will define the pipeline of evaluation you want to apply on your datasests
|
156
|
+
- result_file_path : A yaml file containing the set of computed scores for each defined metric in your pipeline
|
157
|
+
|
158
|
+
For example, if your pipeline file is located at path : ```examples/pipeline_example.yaml ``` and you want your result file to be stored at ```"examples/results_pipeline_example.yaml```, you will type in your terminal :
|
159
|
+
|
160
|
+
```dqm-ml --pipeline_config_path "examples/pipeline_example.yaml" --result_file_path "examples/results_pipeline_example.yaml"```
|
161
|
+
|
162
|
+
### Pipeline definition
|
163
|
+
|
164
|
+
A dqm-ml pipeline is a yaml file that contains the list of dataset you want to evaluate, and the list of metrics you want to compute on each ones.
|
165
|
+
This file has a primary key **pipeline_definition** containing a list of items where each item has the following required fields:
|
166
|
+
- dataset : The path to the dataset you want to evaluate .
|
167
|
+
- domain : The category of metric you want to apply
|
168
|
+
- metrics : The list of metrics to compute on the dataset . (For completeness only this field is not used)
|
169
|
+
|
170
|
+
For representativeness domain only, the following additional parameters fields are required:
|
171
|
+
- bins :
|
172
|
+
- distribution :
|
173
|
+
|
174
|
+
You can use an optionnal field :
|
175
|
+
- columns : The list of columns from your dataset on which you want to restrict the computations of metrics. If this field is missing, by default the metrics are applied on all columns of the given dataset
|
176
|
+
|
177
|
+
The field ```datasets ```, can be a path to a single file or a path to a folder. If the path points on a single file, the file content is directly loaded and considered as the final dataset to evaluate. Supported extension for files are "csv, txt, xls,xlsx, pq and parquet". In case of csv or txt file, you can set a ```separator ``` field to indicate the separator to be used to parse the file.
|
178
|
+
|
179
|
+
If the defined path is a folder, all files within the folder will be automatically concatened along the rows axis to build the final dataset that will be considered for the evaluation. For folders, you can use an additional ```extension ``` field to concatenate only the files with the specified extension in the target folder. By default, all present files are tried to be concatenated.
|
180
|
+
|
181
|
+
For example:
|
182
|
+
|
183
|
+
```
|
184
|
+
- domain : "representativeness"
|
185
|
+
extension: "txt"
|
186
|
+
metrics: ["chi-square","GRTE"]
|
187
|
+
bins : 10
|
188
|
+
distribution : "normal"
|
189
|
+
dataset: "tdata/my_data_folder"
|
190
|
+
columns_names : ["col_1", "col_5","col_9"]
|
191
|
+
```
|
192
|
+
|
193
|
+
|
194
|
+
For domain gap, because the metrics apply only on images datasets, the definition is quite different, the item has the following field
|
195
|
+
- ```domain```: defining the name of the domain thus here "domain_gap"
|
196
|
+
- ```metrics``` : The list of metrics you want to compute, and for each item you have two fields
|
197
|
+
- metrics_name : The name of metric to compute
|
198
|
+
- method_config : The user configuration of the metric. In this part you define the source and target datasets, the chosen models, and other user parameters
|
199
|
+
|
200
|
+
An example of pipeline file defining the computations of many metrics from the four domains is given below:
|
201
|
+
```
|
202
|
+
pipeline_definition:
|
203
|
+
- domain : "completeness"
|
204
|
+
dataset : "tests/sample_data/completeness_sample_data.csv"
|
205
|
+
columns_names : ["column_1","column_3","column_6","column_9"]
|
206
|
+
|
207
|
+
- domain : "representativeness"
|
208
|
+
metrics: ["chi-square","GRTE"]
|
209
|
+
bins : 10
|
210
|
+
distribution : normal
|
211
|
+
dataset: "tests/sample_data/SMD_test_ds_sample.csv"
|
212
|
+
columns_names : ["column_2","column_4", "column_6"]
|
213
|
+
|
214
|
+
- domain : "diversity"
|
215
|
+
metrics: ["simpson","gini"]
|
216
|
+
dataset: "tests/sample_data/SMD_test_ds_sample.csv"
|
217
|
+
columns_names : ["column_2","column_4", "column_6"]
|
218
|
+
|
219
|
+
- domain: "domain_gap"
|
220
|
+
metrics:
|
221
|
+
- metric_name: wasserstein
|
222
|
+
method_config:
|
223
|
+
DATA:
|
224
|
+
batch_size: 32
|
225
|
+
height: 299
|
226
|
+
width: 299
|
227
|
+
norm_mean: [0.485,0.456,0.406]
|
228
|
+
norm_std: [0.229,0.224,0.225]
|
229
|
+
source: "tests/sample_data/image_test_ds/c20"
|
230
|
+
target: "tests/sample_data/image_test_ds/c33"
|
231
|
+
MODEL:
|
232
|
+
arch: "resnet18"
|
233
|
+
device: "cpu"
|
234
|
+
n_layer_feature: -2
|
235
|
+
METHOD:
|
236
|
+
name: "fid"
|
237
|
+
```
|
238
|
+
|
239
|
+
The result file produced at the end of this pipeline is a yaml file containing the pipeline configuration file content augmented with a "scores" field in each item, containing the metrics computed scores.
|
240
|
+
|
241
|
+
Example of result_score:
|
242
|
+
|
243
|
+
```
|
244
|
+
pipeline_definition:
|
245
|
+
- domain: completeness
|
246
|
+
dataset: tests/sample_data/completeness_sample_data.csv
|
247
|
+
columns_names:
|
248
|
+
- column_1
|
249
|
+
- column_3
|
250
|
+
- column_6
|
251
|
+
- column_9
|
252
|
+
scores:
|
253
|
+
overall_score: 0.61825
|
254
|
+
column_1: 1
|
255
|
+
column_3: 0.782
|
256
|
+
column_6: 0.48
|
257
|
+
column_9: 0.211
|
258
|
+
- domain: representativeness
|
259
|
+
metrics:
|
260
|
+
- chi-square
|
261
|
+
- GRTE
|
262
|
+
bins: 10
|
263
|
+
distribution: normal
|
264
|
+
dataset: tests/sample_data/SMD_test_ds_sample.csv
|
265
|
+
columns_names:
|
266
|
+
- column_2
|
267
|
+
- column_4
|
268
|
+
- column_6
|
269
|
+
scores:
|
270
|
+
chi-square:
|
271
|
+
column_2: 1.8740034461104008e-34
|
272
|
+
column_4: 2.7573644464553625e-86
|
273
|
+
column_6: 3.469236770038776e-64
|
274
|
+
GRTE:
|
275
|
+
column_2: 0.8421470393366073
|
276
|
+
column_4: 0.7615162001699769
|
277
|
+
column_6: 0.6955152215780268
|
278
|
+
```
|
279
|
+
|
280
|
+
To create your own pipeline definition, it is adviced to start from one existing model of pipeline present in the ```examples/ ``` folder.
|
281
|
+
|
282
|
+
### Use the dockerized version
|
283
|
+
|
284
|
+
To build locally the docker image, from the root folder of the repository use the command:
|
285
|
+
|
286
|
+
```docker build . -f dockerfile -t your_image_name:tag```
|
287
|
+
|
288
|
+
The command line to run the dqm container has the following form :
|
289
|
+
|
290
|
+
```docker run -e PIPELINE_CONFIG_PATH="path_to_your_pipeline_file" -e RESULT_FILE_PATH="path_to_the_result_file" irtsystemx/dqm-ml:1.1.1```
|
291
|
+
|
292
|
+
You need to mount the ```PIPELINE_CONFIG_PATH``` path to ```/tmp/in/$PIPELIN_CONFIG_PATH``` and the ```$RESULT_FILE_PATH``` to ```/tmp/out/$RESULT_FILE_PATH```
|
293
|
+
Moreover, all datasets directories referenced in your pipeline file shall be mounted in the docker
|
294
|
+
|
295
|
+
For example if your pipeline file is stored at ```examples/pipeline_example_docker.yaml``` and you want your result file to be stored at ```results_docker/result_file.yaml```
|
296
|
+
and all your datasets used in your pipeline are stored locally into ```/tests``` folder and defined on ```data_storage/..``` in your pipeline file
|
297
|
+
|
298
|
+
The command would be :
|
299
|
+
|
300
|
+
```docker run -e PIPELINE_CONFIG_PATH="pipeline_example_docker.yaml" -e RESULT_FILE_PATH="result_file.yaml" -v ${PWD}/examples:/tmp/in -v ${PWD}/tests/:/data_storage/ -v ${PWD}/results_docker:/tmp/out irtsystemx/dqm-ml:1.1.1```
|
301
|
+
|
302
|
+
### User with proxy server
|
303
|
+
|
304
|
+
The computation of domain gap metrics requires the use of pretrained models that are automatically downloaded by pytorch in a local cache directory during the first call of those metrics.
|
305
|
+
|
306
|
+
For users behind a proxy server, this download could fail. To overcome this issue, you can manually get those pretrained models by downloading the zip archive from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models//dqm-ml_pretrained_models.zip) and extract it in the following folder : ``` your_user_directory/.cache/torch/hub/checkpoints/```
|
307
|
+
|
308
|
+
### Use the library within your python code
|
309
|
+
|
310
|
+
[//]: # (All validated and verified functions are detailed in the files **call_main.py**. )
|
311
|
+
|
312
|
+
Each metric is used by importing the corresponding modules and class into your code.
|
313
|
+
For more information about each metric, refer to the specific README.md in ```dqm/<metric_name>``` subfolders
|
314
|
+
|
315
|
+
## Available examples
|
316
|
+
|
317
|
+
Many examples of DQM-ML applications are avalaible in the folder ```/examples```
|
318
|
+
|
319
|
+
You will find :
|
320
|
+
|
321
|
+
2 jupyter_notebooks:
|
322
|
+
|
323
|
+
- **multiple_metrics_tests.ipynb** : A notebook applying completeness, diversity and representativeness metrics on an example dataset.
|
324
|
+
- **domain_gap.ipynb** : A notebook demonstrating an example of applying domain_gap metrics to a generated synthetic dataset.
|
325
|
+
|
326
|
+
4 python scripts:
|
327
|
+
|
328
|
+
Those scripts named **main_X.py** gives an example of computation of approaches implemented for metrics <X> on samples.
|
329
|
+
|
330
|
+
The ```main_domain_gap.py``` script must be called with a config file passed as an argument using ```--cfg```.
|
331
|
+
|
332
|
+
For example:
|
333
|
+
|
334
|
+
``` python examples/main_domain_gap.py --cfg examples/domain_gap_cfg/cmd/cmd.json```
|
335
|
+
|
336
|
+
We provide in the folder ```/examples/domain_gap_cfg``` a set of config files for each domain_gap approaches`:
|
337
|
+
|
338
|
+
For some domain_gap examples, the **200_bird_dataset** will be required. It can be downloaded from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models/200-birds-species.zip). The zip archive shall be extracted into the ```examples/datasets/``` folder.
|
339
|
+
|
340
|
+
1 pipeline example that instanciates every metrics implemented in dqm-ml named ```pipeline_example.yaml``` and its corresponding results ```results_pipeline_example.yaml```.
|
341
|
+
|
342
|
+
1 pipeline example similar to the previous one, but with different datasets path, as shown in the example of how using the containerized version.
|
343
|
+
|
344
|
+
## References
|
345
|
+
|
346
|
+
```
|
347
|
+
@inproceedings{chaouche2024dqm,
|
348
|
+
title={DQM: Data Quality Metrics for AI components in the industry},
|
349
|
+
author={Chaouche, Sabrina and Randon, Yoann and Adjed, Faouzi and Boudjani, Nadira and Khedher, Mohamed Ibn},
|
350
|
+
booktitle={Proceedings of the AAAI Symposium Series},
|
351
|
+
volume={4},
|
352
|
+
number={1},
|
353
|
+
pages={24--31},
|
354
|
+
year={2024}
|
355
|
+
}
|
356
|
+
```
|
357
|
+
|
358
|
+
[HAL link](https://hal.science/hal-04719346v1)
|
dqm_ml-1.1.1/README.md
ADDED
@@ -0,0 +1,330 @@
|
|
1
|
+
<div align="center">
|
2
|
+
<img src="_static/Logo_ConfianceAI.png" width="20%" alt="ConfianceAI Logo" />
|
3
|
+
<h1 style="font-size: large; font-weight: bold;">dqm-ml</h1>
|
4
|
+
</div>
|
5
|
+
|
6
|
+
<div align="center">
|
7
|
+
<a href="#">
|
8
|
+
<img src="https://img.shields.io/badge/Python-3.10-efefef">
|
9
|
+
</a>
|
10
|
+
<a href="#">
|
11
|
+
<img src="https://img.shields.io/badge/Python-3.11-efefef">
|
12
|
+
</a>
|
13
|
+
<a href="#">
|
14
|
+
<img src="https://img.shields.io/badge/Python-3.12-efefef">
|
15
|
+
</a>
|
16
|
+
<a href="#">
|
17
|
+
<img src="https://img.shields.io/badge/License-MPL-2">
|
18
|
+
</a>
|
19
|
+
<a href="_static/pylint/pylint.txt">
|
20
|
+
<img src="_static/pylint/pylint.svg" alt="Pylint Score">
|
21
|
+
</a>
|
22
|
+
<a href="_static/flake8/index.html">
|
23
|
+
<img src="_static/flake8/flake8.svg" alt="Flake8 Report">
|
24
|
+
</a>
|
25
|
+
<a href="_static/coverage/index.html">
|
26
|
+
<img src="_static/coverage/coverage.svg" alt="Coverage report">
|
27
|
+
</a>
|
28
|
+
|
29
|
+
</div>
|
30
|
+
|
31
|
+
<br>
|
32
|
+
<br>
|
33
|
+
|
34
|
+
# Data Quality Metrics
|
35
|
+
|
36
|
+
The current version of the Data Quality Metrics (called **dqm-ml**) computes three data inherent metrics and one data-model dependent metric.
|
37
|
+
|
38
|
+
The data inherent metrics are
|
39
|
+
- **Diversity** : Computes the presence in the dataset of all required information defined in the specification (requirements, Operational Design Domain (ODD) . . . ).
|
40
|
+
- **Representativeness** : is defined as the conformity of the distribution of the key characteristics of the dataset according to a specification (requirements, ODD.. . )
|
41
|
+
- **Completeness** : is defined by the degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use.
|
42
|
+
|
43
|
+
The data-model dependent metrics are:
|
44
|
+
- **Domain Gap** : In the context of a computer vision task, the Domain Gap (DG) refers to the difference in semantic, textures and shapes between two distributions of images and it can lead to poor performances when a model is trained on a given distribution and then is applied to another one.
|
45
|
+
|
46
|
+
(Definitions from [Confiance.ai program](https://www.confiance.ai/))
|
47
|
+
|
48
|
+
[//]: # (- Coverage : The coverage of a couple "Dataset + ML Model" is the ability of the execution of the ML Model on this dataset to generate elements that match the expected space.)
|
49
|
+
|
50
|
+
For each metric, several approaches are developped to handle the maximum of data types. For more technical and scientific details, please refer to this [deliverable](https://catalog.confiance.ai/records/p46p6-1wt83/files/Scientific_Contribution_For_Data_quality_assessment_metrics_for_Machine_learning_process-v2.pdf?download=1)
|
51
|
+
|
52
|
+
## Project description
|
53
|
+
Several approches are developped as described in the figure below.
|
54
|
+
|
55
|
+
<img src="_static/library_view.png" width="1024"/>
|
56
|
+
|
57
|
+
In the current version, the available metrics are:
|
58
|
+
- Representativeness:
|
59
|
+
- $\chi^2$ Goodness of fit test for Uniform and Normal Distributions
|
60
|
+
- Kolmogorov Smirnov test for Uniform and Normal Distributions
|
61
|
+
- Granular and Relative Theorithecal Entropy GRTE proposed and developed in the Confiance.ai Research Program
|
62
|
+
- Diversity:
|
63
|
+
- Relative Diversity developed and implemented in Confiance.ai Research Program
|
64
|
+
- Gini-Simpson and Simposon indices
|
65
|
+
- Completeness:
|
66
|
+
- Ratio of filled information
|
67
|
+
- Domain Gap:
|
68
|
+
- MMD
|
69
|
+
- CMD
|
70
|
+
- Wasserstein
|
71
|
+
- H-Divergence
|
72
|
+
- FID
|
73
|
+
- Kullback-Leiblur MultiVariate Normal Distribution
|
74
|
+
|
75
|
+
[//]: # (- Coverage : )
|
76
|
+
|
77
|
+
[//]: # ( - Approches developed in Neural Coverage (NCL) given [here](https://github.com/Yuanyuan-Yuan/NeuraL-Coverage). )
|
78
|
+
|
79
|
+
# Getting started
|
80
|
+
|
81
|
+
## Set up a clean virtual environnement
|
82
|
+
|
83
|
+
Linux setting:
|
84
|
+
|
85
|
+
```
|
86
|
+
pip install virtualenv
|
87
|
+
virtualenv myenv
|
88
|
+
source myenv/bin/activate
|
89
|
+
```
|
90
|
+
|
91
|
+
Windows setting:
|
92
|
+
|
93
|
+
```
|
94
|
+
pip install virtual env
|
95
|
+
virtualenv myenv
|
96
|
+
.\myenv\Scripts\activate
|
97
|
+
```
|
98
|
+
|
99
|
+
## Install the library
|
100
|
+
You can install it by directly downloading from PyPi using the command:
|
101
|
+
|
102
|
+
````
|
103
|
+
pip install dqm-ml
|
104
|
+
````
|
105
|
+
|
106
|
+
Or you can installing it from the source code by launching the following command:
|
107
|
+
|
108
|
+
```
|
109
|
+
pip install .
|
110
|
+
```
|
111
|
+
|
112
|
+
## Usage
|
113
|
+
|
114
|
+
There is two ways to use the dqm library :
|
115
|
+
- Import dqm package and call the dqm functions within your python code
|
116
|
+
- In standalone mode using direct command line from a terminal, or run the DQm-ML container
|
117
|
+
|
118
|
+
### Standalone mode
|
119
|
+
|
120
|
+
You can use the dqm-ml directly to evaluate your dataset, by using the "dqm-ml" command from your terminal.
|
121
|
+
|
122
|
+
The command line has the following form :
|
123
|
+
|
124
|
+
```dqm-ml --pipeline_config_path path_to_your_pipeline_file --result_file_path path_to_your_result_file```
|
125
|
+
|
126
|
+
This mode requires two user parameters:
|
127
|
+
- pipeline_config_path : A path to a yaml file that will define the pipeline of evaluation you want to apply on your datasests
|
128
|
+
- result_file_path : A yaml file containing the set of computed scores for each defined metric in your pipeline
|
129
|
+
|
130
|
+
For example, if your pipeline file is located at path : ```examples/pipeline_example.yaml ``` and you want your result file to be stored at ```"examples/results_pipeline_example.yaml```, you will type in your terminal :
|
131
|
+
|
132
|
+
```dqm-ml --pipeline_config_path "examples/pipeline_example.yaml" --result_file_path "examples/results_pipeline_example.yaml"```
|
133
|
+
|
134
|
+
### Pipeline definition
|
135
|
+
|
136
|
+
A dqm-ml pipeline is a yaml file that contains the list of dataset you want to evaluate, and the list of metrics you want to compute on each ones.
|
137
|
+
This file has a primary key **pipeline_definition** containing a list of items where each item has the following required fields:
|
138
|
+
- dataset : The path to the dataset you want to evaluate .
|
139
|
+
- domain : The category of metric you want to apply
|
140
|
+
- metrics : The list of metrics to compute on the dataset . (For completeness only this field is not used)
|
141
|
+
|
142
|
+
For representativeness domain only, the following additional parameters fields are required:
|
143
|
+
- bins :
|
144
|
+
- distribution :
|
145
|
+
|
146
|
+
You can use an optionnal field :
|
147
|
+
- columns : The list of columns from your dataset on which you want to restrict the computations of metrics. If this field is missing, by default the metrics are applied on all columns of the given dataset
|
148
|
+
|
149
|
+
The field ```datasets ```, can be a path to a single file or a path to a folder. If the path points on a single file, the file content is directly loaded and considered as the final dataset to evaluate. Supported extension for files are "csv, txt, xls,xlsx, pq and parquet". In case of csv or txt file, you can set a ```separator ``` field to indicate the separator to be used to parse the file.
|
150
|
+
|
151
|
+
If the defined path is a folder, all files within the folder will be automatically concatened along the rows axis to build the final dataset that will be considered for the evaluation. For folders, you can use an additional ```extension ``` field to concatenate only the files with the specified extension in the target folder. By default, all present files are tried to be concatenated.
|
152
|
+
|
153
|
+
For example:
|
154
|
+
|
155
|
+
```
|
156
|
+
- domain : "representativeness"
|
157
|
+
extension: "txt"
|
158
|
+
metrics: ["chi-square","GRTE"]
|
159
|
+
bins : 10
|
160
|
+
distribution : "normal"
|
161
|
+
dataset: "tdata/my_data_folder"
|
162
|
+
columns_names : ["col_1", "col_5","col_9"]
|
163
|
+
```
|
164
|
+
|
165
|
+
|
166
|
+
For domain gap, because the metrics apply only on images datasets, the definition is quite different, the item has the following field
|
167
|
+
- ```domain```: defining the name of the domain thus here "domain_gap"
|
168
|
+
- ```metrics``` : The list of metrics you want to compute, and for each item you have two fields
|
169
|
+
- metrics_name : The name of metric to compute
|
170
|
+
- method_config : The user configuration of the metric. In this part you define the source and target datasets, the chosen models, and other user parameters
|
171
|
+
|
172
|
+
An example of pipeline file defining the computations of many metrics from the four domains is given below:
|
173
|
+
```
|
174
|
+
pipeline_definition:
|
175
|
+
- domain : "completeness"
|
176
|
+
dataset : "tests/sample_data/completeness_sample_data.csv"
|
177
|
+
columns_names : ["column_1","column_3","column_6","column_9"]
|
178
|
+
|
179
|
+
- domain : "representativeness"
|
180
|
+
metrics: ["chi-square","GRTE"]
|
181
|
+
bins : 10
|
182
|
+
distribution : normal
|
183
|
+
dataset: "tests/sample_data/SMD_test_ds_sample.csv"
|
184
|
+
columns_names : ["column_2","column_4", "column_6"]
|
185
|
+
|
186
|
+
- domain : "diversity"
|
187
|
+
metrics: ["simpson","gini"]
|
188
|
+
dataset: "tests/sample_data/SMD_test_ds_sample.csv"
|
189
|
+
columns_names : ["column_2","column_4", "column_6"]
|
190
|
+
|
191
|
+
- domain: "domain_gap"
|
192
|
+
metrics:
|
193
|
+
- metric_name: wasserstein
|
194
|
+
method_config:
|
195
|
+
DATA:
|
196
|
+
batch_size: 32
|
197
|
+
height: 299
|
198
|
+
width: 299
|
199
|
+
norm_mean: [0.485,0.456,0.406]
|
200
|
+
norm_std: [0.229,0.224,0.225]
|
201
|
+
source: "tests/sample_data/image_test_ds/c20"
|
202
|
+
target: "tests/sample_data/image_test_ds/c33"
|
203
|
+
MODEL:
|
204
|
+
arch: "resnet18"
|
205
|
+
device: "cpu"
|
206
|
+
n_layer_feature: -2
|
207
|
+
METHOD:
|
208
|
+
name: "fid"
|
209
|
+
```
|
210
|
+
|
211
|
+
The result file produced at the end of this pipeline is a yaml file containing the pipeline configuration file content augmented with a "scores" field in each item, containing the metrics computed scores.
|
212
|
+
|
213
|
+
Example of result_score:
|
214
|
+
|
215
|
+
```
|
216
|
+
pipeline_definition:
|
217
|
+
- domain: completeness
|
218
|
+
dataset: tests/sample_data/completeness_sample_data.csv
|
219
|
+
columns_names:
|
220
|
+
- column_1
|
221
|
+
- column_3
|
222
|
+
- column_6
|
223
|
+
- column_9
|
224
|
+
scores:
|
225
|
+
overall_score: 0.61825
|
226
|
+
column_1: 1
|
227
|
+
column_3: 0.782
|
228
|
+
column_6: 0.48
|
229
|
+
column_9: 0.211
|
230
|
+
- domain: representativeness
|
231
|
+
metrics:
|
232
|
+
- chi-square
|
233
|
+
- GRTE
|
234
|
+
bins: 10
|
235
|
+
distribution: normal
|
236
|
+
dataset: tests/sample_data/SMD_test_ds_sample.csv
|
237
|
+
columns_names:
|
238
|
+
- column_2
|
239
|
+
- column_4
|
240
|
+
- column_6
|
241
|
+
scores:
|
242
|
+
chi-square:
|
243
|
+
column_2: 1.8740034461104008e-34
|
244
|
+
column_4: 2.7573644464553625e-86
|
245
|
+
column_6: 3.469236770038776e-64
|
246
|
+
GRTE:
|
247
|
+
column_2: 0.8421470393366073
|
248
|
+
column_4: 0.7615162001699769
|
249
|
+
column_6: 0.6955152215780268
|
250
|
+
```
|
251
|
+
|
252
|
+
To create your own pipeline definition, it is adviced to start from one existing model of pipeline present in the ```examples/ ``` folder.
|
253
|
+
|
254
|
+
### Use the dockerized version
|
255
|
+
|
256
|
+
To build locally the docker image, from the root folder of the repository use the command:
|
257
|
+
|
258
|
+
```docker build . -f dockerfile -t your_image_name:tag```
|
259
|
+
|
260
|
+
The command line to run the dqm container has the following form :
|
261
|
+
|
262
|
+
```docker run -e PIPELINE_CONFIG_PATH="path_to_your_pipeline_file" -e RESULT_FILE_PATH="path_to_the_result_file" irtsystemx/dqm-ml:1.1.1```
|
263
|
+
|
264
|
+
You need to mount the ```PIPELINE_CONFIG_PATH``` path to ```/tmp/in/$PIPELIN_CONFIG_PATH``` and the ```$RESULT_FILE_PATH``` to ```/tmp/out/$RESULT_FILE_PATH```
|
265
|
+
Moreover, all datasets directories referenced in your pipeline file shall be mounted in the docker
|
266
|
+
|
267
|
+
For example if your pipeline file is stored at ```examples/pipeline_example_docker.yaml``` and you want your result file to be stored at ```results_docker/result_file.yaml```
|
268
|
+
and all your datasets used in your pipeline are stored locally into ```/tests``` folder and defined on ```data_storage/..``` in your pipeline file
|
269
|
+
|
270
|
+
The command would be :
|
271
|
+
|
272
|
+
```docker run -e PIPELINE_CONFIG_PATH="pipeline_example_docker.yaml" -e RESULT_FILE_PATH="result_file.yaml" -v ${PWD}/examples:/tmp/in -v ${PWD}/tests/:/data_storage/ -v ${PWD}/results_docker:/tmp/out irtsystemx/dqm-ml:1.1.1```
|
273
|
+
|
274
|
+
### User with proxy server
|
275
|
+
|
276
|
+
The computation of domain gap metrics requires the use of pretrained models that are automatically downloaded by pytorch in a local cache directory during the first call of those metrics.
|
277
|
+
|
278
|
+
For users behind a proxy server, this download could fail. To overcome this issue, you can manually get those pretrained models by downloading the zip archive from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models//dqm-ml_pretrained_models.zip) and extract it in the following folder : ``` your_user_directory/.cache/torch/hub/checkpoints/```
|
279
|
+
|
280
|
+
### Use the library within your python code
|
281
|
+
|
282
|
+
[//]: # (All validated and verified functions are detailed in the files **call_main.py**. )
|
283
|
+
|
284
|
+
Each metric is used by importing the corresponding modules and class into your code.
|
285
|
+
For more information about each metric, refer to the specific README.md in ```dqm/<metric_name>``` subfolders
|
286
|
+
|
287
|
+
## Available examples
|
288
|
+
|
289
|
+
Many examples of DQM-ML applications are avalaible in the folder ```/examples```
|
290
|
+
|
291
|
+
You will find :
|
292
|
+
|
293
|
+
2 jupyter_notebooks:
|
294
|
+
|
295
|
+
- **multiple_metrics_tests.ipynb** : A notebook applying completeness, diversity and representativeness metrics on an example dataset.
|
296
|
+
- **domain_gap.ipynb** : A notebook demonstrating an example of applying domain_gap metrics to a generated synthetic dataset.
|
297
|
+
|
298
|
+
4 python scripts:
|
299
|
+
|
300
|
+
Those scripts named **main_X.py** gives an example of computation of approaches implemented for metrics <X> on samples.
|
301
|
+
|
302
|
+
The ```main_domain_gap.py``` script must be called with a config file passed as an argument using ```--cfg```.
|
303
|
+
|
304
|
+
For example:
|
305
|
+
|
306
|
+
``` python examples/main_domain_gap.py --cfg examples/domain_gap_cfg/cmd/cmd.json```
|
307
|
+
|
308
|
+
We provide in the folder ```/examples/domain_gap_cfg``` a set of config files for each domain_gap approaches`:
|
309
|
+
|
310
|
+
For some domain_gap examples, the **200_bird_dataset** will be required. It can be downloaded from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models/200-birds-species.zip). The zip archive shall be extracted into the ```examples/datasets/``` folder.
|
311
|
+
|
312
|
+
1 pipeline example that instanciates every metrics implemented in dqm-ml named ```pipeline_example.yaml``` and its corresponding results ```results_pipeline_example.yaml```.
|
313
|
+
|
314
|
+
1 pipeline example similar to the previous one, but with different datasets path, as shown in the example of how using the containerized version.
|
315
|
+
|
316
|
+
## References
|
317
|
+
|
318
|
+
```
|
319
|
+
@inproceedings{chaouche2024dqm,
|
320
|
+
title={DQM: Data Quality Metrics for AI components in the industry},
|
321
|
+
author={Chaouche, Sabrina and Randon, Yoann and Adjed, Faouzi and Boudjani, Nadira and Khedher, Mohamed Ibn},
|
322
|
+
booktitle={Proceedings of the AAAI Symposium Series},
|
323
|
+
volume={4},
|
324
|
+
number={1},
|
325
|
+
pages={24--31},
|
326
|
+
year={2024}
|
327
|
+
}
|
328
|
+
```
|
329
|
+
|
330
|
+
[HAL link](https://hal.science/hal-04719346v1)
|