dqm-ml 1.1.0__tar.gz → 1.1.1__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (32) hide show
  1. dqm_ml-1.1.1/PKG-INFO +358 -0
  2. dqm_ml-1.1.1/README.md +330 -0
  3. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/diversity/diversity.py +21 -17
  4. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/metrics.py +20 -19
  5. dqm_ml-1.1.1/dqm/main.py +326 -0
  6. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/representativeness/metric.py +114 -103
  7. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/representativeness/utils.py +5 -9
  8. {dqm_ml-1.1.0/dqm/representativeness → dqm_ml-1.1.1/dqm/utils}/twe_logger.py +21 -20
  9. dqm_ml-1.1.1/dqm_ml.egg-info/PKG-INFO +358 -0
  10. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm_ml.egg-info/SOURCES.txt +5 -4
  11. dqm_ml-1.1.1/dqm_ml.egg-info/entry_points.txt +2 -0
  12. dqm_ml-1.1.0/requirements.txt → dqm_ml-1.1.1/dqm_ml.egg-info/requires.txt +2 -1
  13. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/pyproject.toml +6 -3
  14. dqm_ml-1.1.0/dqm_ml.egg-info/requires.txt → dqm_ml-1.1.1/requirements.txt +1 -0
  15. dqm_ml-1.1.0/PKG-INFO +0 -180
  16. dqm_ml-1.1.0/README.md +0 -153
  17. dqm_ml-1.1.0/dqm/diversity/twe_logger.py +0 -105
  18. dqm_ml-1.1.0/dqm_ml.egg-info/PKG-INFO +0 -180
  19. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/LICENSE +0 -0
  20. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/__init__.py +0 -0
  21. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/completeness/__init__.py +0 -0
  22. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/completeness/metric.py +0 -0
  23. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/diversity/__init__.py +0 -0
  24. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/diversity/metric.py +0 -0
  25. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/__init__.py +0 -0
  26. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/custom_datasets.py +0 -0
  27. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/domain_gap/utils.py +0 -0
  28. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm/representativeness/__init__.py +0 -0
  29. /dqm_ml-1.1.0/tests/test_main.py → /dqm_ml-1.1.1/dqm/utils/__init__.py +0 -0
  30. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm_ml.egg-info/dependency_links.txt +0 -0
  31. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/dqm_ml.egg-info/top_level.txt +0 -0
  32. {dqm_ml-1.1.0 → dqm_ml-1.1.1}/setup.cfg +0 -0
dqm_ml-1.1.1/PKG-INFO ADDED
@@ -0,0 +1,358 @@
1
+ Metadata-Version: 2.4
2
+ Name: dqm-ml
3
+ Version: 1.1.1
4
+ Summary: Python library designed to compute data quality metrics for Machine Learning
5
+ Author: IRT SystemX
6
+ Author-email: support@irt-systemx.fr
7
+ License-Expression: MPL-2.0
8
+ Project-URL: Homepage, https://irt-systemx.github.io/dqm-ml
9
+ Project-URL: Documentation, https://irt-systemx.github.io/dqm-ml
10
+ Project-URL: Repository, https://github.com/IRT-SystemX/dqm-ml
11
+ Requires-Python: >=3.10
12
+ Description-Content-Type: text/markdown
13
+ License-File: LICENSE
14
+ Requires-Dist: matplotlib
15
+ Requires-Dist: numpy
16
+ Requires-Dist: pandas
17
+ Requires-Dist: setuptools
18
+ Requires-Dist: tqdm
19
+ Requires-Dist: torch
20
+ Requires-Dist: torchvision
21
+ Requires-Dist: scipy
22
+ Requires-Dist: scikit-learn
23
+ Requires-Dist: POT
24
+ Requires-Dist: ipykernel
25
+ Requires-Dist: seaborn
26
+ Requires-Dist: pyaml
27
+ Dynamic: license-file
28
+
29
+ <div align="center">
30
+ <img src="_static/Logo_ConfianceAI.png" width="20%" alt="ConfianceAI Logo" />
31
+ <h1 style="font-size: large; font-weight: bold;">dqm-ml</h1>
32
+ </div>
33
+
34
+ <div align="center">
35
+ <a href="#">
36
+ <img src="https://img.shields.io/badge/Python-3.10-efefef">
37
+ </a>
38
+ <a href="#">
39
+ <img src="https://img.shields.io/badge/Python-3.11-efefef">
40
+ </a>
41
+ <a href="#">
42
+ <img src="https://img.shields.io/badge/Python-3.12-efefef">
43
+ </a>
44
+ <a href="#">
45
+ <img src="https://img.shields.io/badge/License-MPL-2">
46
+ </a>
47
+ <a href="_static/pylint/pylint.txt">
48
+ <img src="_static/pylint/pylint.svg" alt="Pylint Score">
49
+ </a>
50
+ <a href="_static/flake8/index.html">
51
+ <img src="_static/flake8/flake8.svg" alt="Flake8 Report">
52
+ </a>
53
+ <a href="_static/coverage/index.html">
54
+ <img src="_static/coverage/coverage.svg" alt="Coverage report">
55
+ </a>
56
+
57
+ </div>
58
+
59
+ <br>
60
+ <br>
61
+
62
+ # Data Quality Metrics
63
+
64
+ The current version of the Data Quality Metrics (called **dqm-ml**) computes three data inherent metrics and one data-model dependent metric.
65
+
66
+ The data inherent metrics are
67
+ - **Diversity** : Computes the presence in the dataset of all required information defined in the specification (requirements, Operational Design Domain (ODD) . . . ).
68
+ - **Representativeness** : is defined as the conformity of the distribution of the key characteristics of the dataset according to a specification (requirements, ODD.. . )
69
+ - **Completeness** : is defined by the degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use.
70
+
71
+ The data-model dependent metrics are:
72
+ - **Domain Gap** : In the context of a computer vision task, the Domain Gap (DG) refers to the difference in semantic, textures and shapes between two distributions of images and it can lead to poor performances when a model is trained on a given distribution and then is applied to another one.
73
+
74
+ (Definitions from [Confiance.ai program](https://www.confiance.ai/))
75
+
76
+ [//]: # (- Coverage : The coverage of a couple "Dataset + ML Model" is the ability of the execution of the ML Model on this dataset to generate elements that match the expected space.)
77
+
78
+ For each metric, several approaches are developped to handle the maximum of data types. For more technical and scientific details, please refer to this [deliverable](https://catalog.confiance.ai/records/p46p6-1wt83/files/Scientific_Contribution_For_Data_quality_assessment_metrics_for_Machine_learning_process-v2.pdf?download=1)
79
+
80
+ ## Project description
81
+ Several approches are developped as described in the figure below.
82
+
83
+ <img src="_static/library_view.png" width="1024"/>
84
+
85
+ In the current version, the available metrics are:
86
+ - Representativeness:
87
+ - $\chi^2$ Goodness of fit test for Uniform and Normal Distributions
88
+ - Kolmogorov Smirnov test for Uniform and Normal Distributions
89
+ - Granular and Relative Theorithecal Entropy GRTE proposed and developed in the Confiance.ai Research Program
90
+ - Diversity:
91
+ - Relative Diversity developed and implemented in Confiance.ai Research Program
92
+ - Gini-Simpson and Simposon indices
93
+ - Completeness:
94
+ - Ratio of filled information
95
+ - Domain Gap:
96
+ - MMD
97
+ - CMD
98
+ - Wasserstein
99
+ - H-Divergence
100
+ - FID
101
+ - Kullback-Leiblur MultiVariate Normal Distribution
102
+
103
+ [//]: # (- Coverage : )
104
+
105
+ [//]: # ( - Approches developed in Neural Coverage &#40;NCL&#41; given [here]&#40;https://github.com/Yuanyuan-Yuan/NeuraL-Coverage&#41;. )
106
+
107
+ # Getting started
108
+
109
+ ## Set up a clean virtual environnement
110
+
111
+ Linux setting:
112
+
113
+ ```
114
+ pip install virtualenv
115
+ virtualenv myenv
116
+ source myenv/bin/activate
117
+ ```
118
+
119
+ Windows setting:
120
+
121
+ ```
122
+ pip install virtual env
123
+ virtualenv myenv
124
+ .\myenv\Scripts\activate
125
+ ```
126
+
127
+ ## Install the library
128
+ You can install it by directly downloading from PyPi using the command:
129
+
130
+ ````
131
+ pip install dqm-ml
132
+ ````
133
+
134
+ Or you can installing it from the source code by launching the following command:
135
+
136
+ ```
137
+ pip install .
138
+ ```
139
+
140
+ ## Usage
141
+
142
+ There is two ways to use the dqm library :
143
+ - Import dqm package and call the dqm functions within your python code
144
+ - In standalone mode using direct command line from a terminal, or run the DQm-ML container
145
+
146
+ ### Standalone mode
147
+
148
+ You can use the dqm-ml directly to evaluate your dataset, by using the "dqm-ml" command from your terminal.
149
+
150
+ The command line has the following form :
151
+
152
+ ```dqm-ml --pipeline_config_path path_to_your_pipeline_file --result_file_path path_to_your_result_file```
153
+
154
+ This mode requires two user parameters:
155
+ - pipeline_config_path : A path to a yaml file that will define the pipeline of evaluation you want to apply on your datasests
156
+ - result_file_path : A yaml file containing the set of computed scores for each defined metric in your pipeline
157
+
158
+ For example, if your pipeline file is located at path : ```examples/pipeline_example.yaml ``` and you want your result file to be stored at ```"examples/results_pipeline_example.yaml```, you will type in your terminal :
159
+
160
+ ```dqm-ml --pipeline_config_path "examples/pipeline_example.yaml" --result_file_path "examples/results_pipeline_example.yaml"```
161
+
162
+ ### Pipeline definition
163
+
164
+ A dqm-ml pipeline is a yaml file that contains the list of dataset you want to evaluate, and the list of metrics you want to compute on each ones.
165
+ This file has a primary key **pipeline_definition** containing a list of items where each item has the following required fields:
166
+ - dataset : The path to the dataset you want to evaluate .
167
+ - domain : The category of metric you want to apply
168
+ - metrics : The list of metrics to compute on the dataset . (For completeness only this field is not used)
169
+
170
+ For representativeness domain only, the following additional parameters fields are required:
171
+ - bins :
172
+ - distribution :
173
+
174
+ You can use an optionnal field :
175
+ - columns : The list of columns from your dataset on which you want to restrict the computations of metrics. If this field is missing, by default the metrics are applied on all columns of the given dataset
176
+
177
+ The field ```datasets ```, can be a path to a single file or a path to a folder. If the path points on a single file, the file content is directly loaded and considered as the final dataset to evaluate. Supported extension for files are "csv, txt, xls,xlsx, pq and parquet". In case of csv or txt file, you can set a ```separator ``` field to indicate the separator to be used to parse the file.
178
+
179
+ If the defined path is a folder, all files within the folder will be automatically concatened along the rows axis to build the final dataset that will be considered for the evaluation. For folders, you can use an additional ```extension ``` field to concatenate only the files with the specified extension in the target folder. By default, all present files are tried to be concatenated.
180
+
181
+ For example:
182
+
183
+ ```
184
+ - domain : "representativeness"
185
+ extension: "txt"
186
+ metrics: ["chi-square","GRTE"]
187
+ bins : 10
188
+ distribution : "normal"
189
+ dataset: "tdata/my_data_folder"
190
+ columns_names : ["col_1", "col_5","col_9"]
191
+ ```
192
+
193
+
194
+ For domain gap, because the metrics apply only on images datasets, the definition is quite different, the item has the following field
195
+ - ```domain```: defining the name of the domain thus here "domain_gap"
196
+ - ```metrics``` : The list of metrics you want to compute, and for each item you have two fields
197
+ - metrics_name : The name of metric to compute
198
+ - method_config : The user configuration of the metric. In this part you define the source and target datasets, the chosen models, and other user parameters
199
+
200
+ An example of pipeline file defining the computations of many metrics from the four domains is given below:
201
+ ```
202
+ pipeline_definition:
203
+ - domain : "completeness"
204
+ dataset : "tests/sample_data/completeness_sample_data.csv"
205
+ columns_names : ["column_1","column_3","column_6","column_9"]
206
+
207
+ - domain : "representativeness"
208
+ metrics: ["chi-square","GRTE"]
209
+ bins : 10
210
+ distribution : normal
211
+ dataset: "tests/sample_data/SMD_test_ds_sample.csv"
212
+ columns_names : ["column_2","column_4", "column_6"]
213
+
214
+ - domain : "diversity"
215
+ metrics: ["simpson","gini"]
216
+ dataset: "tests/sample_data/SMD_test_ds_sample.csv"
217
+ columns_names : ["column_2","column_4", "column_6"]
218
+
219
+ - domain: "domain_gap"
220
+ metrics:
221
+ - metric_name: wasserstein
222
+ method_config:
223
+ DATA:
224
+ batch_size: 32
225
+ height: 299
226
+ width: 299
227
+ norm_mean: [0.485,0.456,0.406]
228
+ norm_std: [0.229,0.224,0.225]
229
+ source: "tests/sample_data/image_test_ds/c20"
230
+ target: "tests/sample_data/image_test_ds/c33"
231
+ MODEL:
232
+ arch: "resnet18"
233
+ device: "cpu"
234
+ n_layer_feature: -2
235
+ METHOD:
236
+ name: "fid"
237
+ ```
238
+
239
+ The result file produced at the end of this pipeline is a yaml file containing the pipeline configuration file content augmented with a "scores" field in each item, containing the metrics computed scores.
240
+
241
+ Example of result_score:
242
+
243
+ ```
244
+ pipeline_definition:
245
+ - domain: completeness
246
+ dataset: tests/sample_data/completeness_sample_data.csv
247
+ columns_names:
248
+ - column_1
249
+ - column_3
250
+ - column_6
251
+ - column_9
252
+ scores:
253
+ overall_score: 0.61825
254
+ column_1: 1
255
+ column_3: 0.782
256
+ column_6: 0.48
257
+ column_9: 0.211
258
+ - domain: representativeness
259
+ metrics:
260
+ - chi-square
261
+ - GRTE
262
+ bins: 10
263
+ distribution: normal
264
+ dataset: tests/sample_data/SMD_test_ds_sample.csv
265
+ columns_names:
266
+ - column_2
267
+ - column_4
268
+ - column_6
269
+ scores:
270
+ chi-square:
271
+ column_2: 1.8740034461104008e-34
272
+ column_4: 2.7573644464553625e-86
273
+ column_6: 3.469236770038776e-64
274
+ GRTE:
275
+ column_2: 0.8421470393366073
276
+ column_4: 0.7615162001699769
277
+ column_6: 0.6955152215780268
278
+ ```
279
+
280
+ To create your own pipeline definition, it is adviced to start from one existing model of pipeline present in the ```examples/ ``` folder.
281
+
282
+ ### Use the dockerized version
283
+
284
+ To build locally the docker image, from the root folder of the repository use the command:
285
+
286
+ ```docker build . -f dockerfile -t your_image_name:tag```
287
+
288
+ The command line to run the dqm container has the following form :
289
+
290
+ ```docker run -e PIPELINE_CONFIG_PATH="path_to_your_pipeline_file" -e RESULT_FILE_PATH="path_to_the_result_file" irtsystemx/dqm-ml:1.1.1```
291
+
292
+ You need to mount the ```PIPELINE_CONFIG_PATH``` path to ```/tmp/in/$PIPELIN_CONFIG_PATH``` and the ```$RESULT_FILE_PATH``` to ```/tmp/out/$RESULT_FILE_PATH```
293
+ Moreover, all datasets directories referenced in your pipeline file shall be mounted in the docker
294
+
295
+ For example if your pipeline file is stored at ```examples/pipeline_example_docker.yaml``` and you want your result file to be stored at ```results_docker/result_file.yaml```
296
+ and all your datasets used in your pipeline are stored locally into ```/tests``` folder and defined on ```data_storage/..``` in your pipeline file
297
+
298
+ The command would be :
299
+
300
+ ```docker run -e PIPELINE_CONFIG_PATH="pipeline_example_docker.yaml" -e RESULT_FILE_PATH="result_file.yaml" -v ${PWD}/examples:/tmp/in -v ${PWD}/tests/:/data_storage/ -v ${PWD}/results_docker:/tmp/out irtsystemx/dqm-ml:1.1.1```
301
+
302
+ ### User with proxy server
303
+
304
+ The computation of domain gap metrics requires the use of pretrained models that are automatically downloaded by pytorch in a local cache directory during the first call of those metrics.
305
+
306
+ For users behind a proxy server, this download could fail. To overcome this issue, you can manually get those pretrained models by downloading the zip archive from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models//dqm-ml_pretrained_models.zip) and extract it in the following folder : ``` your_user_directory/.cache/torch/hub/checkpoints/```
307
+
308
+ ### Use the library within your python code
309
+
310
+ [//]: # (All validated and verified functions are detailed in the files **call_main.py**. )
311
+
312
+ Each metric is used by importing the corresponding modules and class into your code.
313
+ For more information about each metric, refer to the specific README.md in ```dqm/<metric_name>``` subfolders
314
+
315
+ ## Available examples
316
+
317
+ Many examples of DQM-ML applications are avalaible in the folder ```/examples```
318
+
319
+ You will find :
320
+
321
+ 2 jupyter_notebooks:
322
+
323
+ - **multiple_metrics_tests.ipynb** : A notebook applying completeness, diversity and representativeness metrics on an example dataset.
324
+ - **domain_gap.ipynb** : A notebook demonstrating an example of applying domain_gap metrics to a generated synthetic dataset.
325
+
326
+ 4 python scripts:
327
+
328
+ Those scripts named **main_X.py** gives an example of computation of approaches implemented for metrics <X> on samples.
329
+
330
+ The ```main_domain_gap.py``` script must be called with a config file passed as an argument using ```--cfg```.
331
+
332
+ For example:
333
+
334
+ ``` python examples/main_domain_gap.py --cfg examples/domain_gap_cfg/cmd/cmd.json```
335
+
336
+ We provide in the folder ```/examples/domain_gap_cfg``` a set of config files for each domain_gap approaches`:
337
+
338
+ For some domain_gap examples, the **200_bird_dataset** will be required. It can be downloaded from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models/200-birds-species.zip). The zip archive shall be extracted into the ```examples/datasets/``` folder.
339
+
340
+ 1 pipeline example that instanciates every metrics implemented in dqm-ml named ```pipeline_example.yaml``` and its corresponding results ```results_pipeline_example.yaml```.
341
+
342
+ 1 pipeline example similar to the previous one, but with different datasets path, as shown in the example of how using the containerized version.
343
+
344
+ ## References
345
+
346
+ ```
347
+ @inproceedings{chaouche2024dqm,
348
+ title={DQM: Data Quality Metrics for AI components in the industry},
349
+ author={Chaouche, Sabrina and Randon, Yoann and Adjed, Faouzi and Boudjani, Nadira and Khedher, Mohamed Ibn},
350
+ booktitle={Proceedings of the AAAI Symposium Series},
351
+ volume={4},
352
+ number={1},
353
+ pages={24--31},
354
+ year={2024}
355
+ }
356
+ ```
357
+
358
+ [HAL link](https://hal.science/hal-04719346v1)
dqm_ml-1.1.1/README.md ADDED
@@ -0,0 +1,330 @@
1
+ <div align="center">
2
+ <img src="_static/Logo_ConfianceAI.png" width="20%" alt="ConfianceAI Logo" />
3
+ <h1 style="font-size: large; font-weight: bold;">dqm-ml</h1>
4
+ </div>
5
+
6
+ <div align="center">
7
+ <a href="#">
8
+ <img src="https://img.shields.io/badge/Python-3.10-efefef">
9
+ </a>
10
+ <a href="#">
11
+ <img src="https://img.shields.io/badge/Python-3.11-efefef">
12
+ </a>
13
+ <a href="#">
14
+ <img src="https://img.shields.io/badge/Python-3.12-efefef">
15
+ </a>
16
+ <a href="#">
17
+ <img src="https://img.shields.io/badge/License-MPL-2">
18
+ </a>
19
+ <a href="_static/pylint/pylint.txt">
20
+ <img src="_static/pylint/pylint.svg" alt="Pylint Score">
21
+ </a>
22
+ <a href="_static/flake8/index.html">
23
+ <img src="_static/flake8/flake8.svg" alt="Flake8 Report">
24
+ </a>
25
+ <a href="_static/coverage/index.html">
26
+ <img src="_static/coverage/coverage.svg" alt="Coverage report">
27
+ </a>
28
+
29
+ </div>
30
+
31
+ <br>
32
+ <br>
33
+
34
+ # Data Quality Metrics
35
+
36
+ The current version of the Data Quality Metrics (called **dqm-ml**) computes three data inherent metrics and one data-model dependent metric.
37
+
38
+ The data inherent metrics are
39
+ - **Diversity** : Computes the presence in the dataset of all required information defined in the specification (requirements, Operational Design Domain (ODD) . . . ).
40
+ - **Representativeness** : is defined as the conformity of the distribution of the key characteristics of the dataset according to a specification (requirements, ODD.. . )
41
+ - **Completeness** : is defined by the degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use.
42
+
43
+ The data-model dependent metrics are:
44
+ - **Domain Gap** : In the context of a computer vision task, the Domain Gap (DG) refers to the difference in semantic, textures and shapes between two distributions of images and it can lead to poor performances when a model is trained on a given distribution and then is applied to another one.
45
+
46
+ (Definitions from [Confiance.ai program](https://www.confiance.ai/))
47
+
48
+ [//]: # (- Coverage : The coverage of a couple "Dataset + ML Model" is the ability of the execution of the ML Model on this dataset to generate elements that match the expected space.)
49
+
50
+ For each metric, several approaches are developped to handle the maximum of data types. For more technical and scientific details, please refer to this [deliverable](https://catalog.confiance.ai/records/p46p6-1wt83/files/Scientific_Contribution_For_Data_quality_assessment_metrics_for_Machine_learning_process-v2.pdf?download=1)
51
+
52
+ ## Project description
53
+ Several approches are developped as described in the figure below.
54
+
55
+ <img src="_static/library_view.png" width="1024"/>
56
+
57
+ In the current version, the available metrics are:
58
+ - Representativeness:
59
+ - $\chi^2$ Goodness of fit test for Uniform and Normal Distributions
60
+ - Kolmogorov Smirnov test for Uniform and Normal Distributions
61
+ - Granular and Relative Theorithecal Entropy GRTE proposed and developed in the Confiance.ai Research Program
62
+ - Diversity:
63
+ - Relative Diversity developed and implemented in Confiance.ai Research Program
64
+ - Gini-Simpson and Simposon indices
65
+ - Completeness:
66
+ - Ratio of filled information
67
+ - Domain Gap:
68
+ - MMD
69
+ - CMD
70
+ - Wasserstein
71
+ - H-Divergence
72
+ - FID
73
+ - Kullback-Leiblur MultiVariate Normal Distribution
74
+
75
+ [//]: # (- Coverage : )
76
+
77
+ [//]: # ( - Approches developed in Neural Coverage &#40;NCL&#41; given [here]&#40;https://github.com/Yuanyuan-Yuan/NeuraL-Coverage&#41;. )
78
+
79
+ # Getting started
80
+
81
+ ## Set up a clean virtual environnement
82
+
83
+ Linux setting:
84
+
85
+ ```
86
+ pip install virtualenv
87
+ virtualenv myenv
88
+ source myenv/bin/activate
89
+ ```
90
+
91
+ Windows setting:
92
+
93
+ ```
94
+ pip install virtual env
95
+ virtualenv myenv
96
+ .\myenv\Scripts\activate
97
+ ```
98
+
99
+ ## Install the library
100
+ You can install it by directly downloading from PyPi using the command:
101
+
102
+ ````
103
+ pip install dqm-ml
104
+ ````
105
+
106
+ Or you can installing it from the source code by launching the following command:
107
+
108
+ ```
109
+ pip install .
110
+ ```
111
+
112
+ ## Usage
113
+
114
+ There is two ways to use the dqm library :
115
+ - Import dqm package and call the dqm functions within your python code
116
+ - In standalone mode using direct command line from a terminal, or run the DQm-ML container
117
+
118
+ ### Standalone mode
119
+
120
+ You can use the dqm-ml directly to evaluate your dataset, by using the "dqm-ml" command from your terminal.
121
+
122
+ The command line has the following form :
123
+
124
+ ```dqm-ml --pipeline_config_path path_to_your_pipeline_file --result_file_path path_to_your_result_file```
125
+
126
+ This mode requires two user parameters:
127
+ - pipeline_config_path : A path to a yaml file that will define the pipeline of evaluation you want to apply on your datasests
128
+ - result_file_path : A yaml file containing the set of computed scores for each defined metric in your pipeline
129
+
130
+ For example, if your pipeline file is located at path : ```examples/pipeline_example.yaml ``` and you want your result file to be stored at ```"examples/results_pipeline_example.yaml```, you will type in your terminal :
131
+
132
+ ```dqm-ml --pipeline_config_path "examples/pipeline_example.yaml" --result_file_path "examples/results_pipeline_example.yaml"```
133
+
134
+ ### Pipeline definition
135
+
136
+ A dqm-ml pipeline is a yaml file that contains the list of dataset you want to evaluate, and the list of metrics you want to compute on each ones.
137
+ This file has a primary key **pipeline_definition** containing a list of items where each item has the following required fields:
138
+ - dataset : The path to the dataset you want to evaluate .
139
+ - domain : The category of metric you want to apply
140
+ - metrics : The list of metrics to compute on the dataset . (For completeness only this field is not used)
141
+
142
+ For representativeness domain only, the following additional parameters fields are required:
143
+ - bins :
144
+ - distribution :
145
+
146
+ You can use an optionnal field :
147
+ - columns : The list of columns from your dataset on which you want to restrict the computations of metrics. If this field is missing, by default the metrics are applied on all columns of the given dataset
148
+
149
+ The field ```datasets ```, can be a path to a single file or a path to a folder. If the path points on a single file, the file content is directly loaded and considered as the final dataset to evaluate. Supported extension for files are "csv, txt, xls,xlsx, pq and parquet". In case of csv or txt file, you can set a ```separator ``` field to indicate the separator to be used to parse the file.
150
+
151
+ If the defined path is a folder, all files within the folder will be automatically concatened along the rows axis to build the final dataset that will be considered for the evaluation. For folders, you can use an additional ```extension ``` field to concatenate only the files with the specified extension in the target folder. By default, all present files are tried to be concatenated.
152
+
153
+ For example:
154
+
155
+ ```
156
+ - domain : "representativeness"
157
+ extension: "txt"
158
+ metrics: ["chi-square","GRTE"]
159
+ bins : 10
160
+ distribution : "normal"
161
+ dataset: "tdata/my_data_folder"
162
+ columns_names : ["col_1", "col_5","col_9"]
163
+ ```
164
+
165
+
166
+ For domain gap, because the metrics apply only on images datasets, the definition is quite different, the item has the following field
167
+ - ```domain```: defining the name of the domain thus here "domain_gap"
168
+ - ```metrics``` : The list of metrics you want to compute, and for each item you have two fields
169
+ - metrics_name : The name of metric to compute
170
+ - method_config : The user configuration of the metric. In this part you define the source and target datasets, the chosen models, and other user parameters
171
+
172
+ An example of pipeline file defining the computations of many metrics from the four domains is given below:
173
+ ```
174
+ pipeline_definition:
175
+ - domain : "completeness"
176
+ dataset : "tests/sample_data/completeness_sample_data.csv"
177
+ columns_names : ["column_1","column_3","column_6","column_9"]
178
+
179
+ - domain : "representativeness"
180
+ metrics: ["chi-square","GRTE"]
181
+ bins : 10
182
+ distribution : normal
183
+ dataset: "tests/sample_data/SMD_test_ds_sample.csv"
184
+ columns_names : ["column_2","column_4", "column_6"]
185
+
186
+ - domain : "diversity"
187
+ metrics: ["simpson","gini"]
188
+ dataset: "tests/sample_data/SMD_test_ds_sample.csv"
189
+ columns_names : ["column_2","column_4", "column_6"]
190
+
191
+ - domain: "domain_gap"
192
+ metrics:
193
+ - metric_name: wasserstein
194
+ method_config:
195
+ DATA:
196
+ batch_size: 32
197
+ height: 299
198
+ width: 299
199
+ norm_mean: [0.485,0.456,0.406]
200
+ norm_std: [0.229,0.224,0.225]
201
+ source: "tests/sample_data/image_test_ds/c20"
202
+ target: "tests/sample_data/image_test_ds/c33"
203
+ MODEL:
204
+ arch: "resnet18"
205
+ device: "cpu"
206
+ n_layer_feature: -2
207
+ METHOD:
208
+ name: "fid"
209
+ ```
210
+
211
+ The result file produced at the end of this pipeline is a yaml file containing the pipeline configuration file content augmented with a "scores" field in each item, containing the metrics computed scores.
212
+
213
+ Example of result_score:
214
+
215
+ ```
216
+ pipeline_definition:
217
+ - domain: completeness
218
+ dataset: tests/sample_data/completeness_sample_data.csv
219
+ columns_names:
220
+ - column_1
221
+ - column_3
222
+ - column_6
223
+ - column_9
224
+ scores:
225
+ overall_score: 0.61825
226
+ column_1: 1
227
+ column_3: 0.782
228
+ column_6: 0.48
229
+ column_9: 0.211
230
+ - domain: representativeness
231
+ metrics:
232
+ - chi-square
233
+ - GRTE
234
+ bins: 10
235
+ distribution: normal
236
+ dataset: tests/sample_data/SMD_test_ds_sample.csv
237
+ columns_names:
238
+ - column_2
239
+ - column_4
240
+ - column_6
241
+ scores:
242
+ chi-square:
243
+ column_2: 1.8740034461104008e-34
244
+ column_4: 2.7573644464553625e-86
245
+ column_6: 3.469236770038776e-64
246
+ GRTE:
247
+ column_2: 0.8421470393366073
248
+ column_4: 0.7615162001699769
249
+ column_6: 0.6955152215780268
250
+ ```
251
+
252
+ To create your own pipeline definition, it is adviced to start from one existing model of pipeline present in the ```examples/ ``` folder.
253
+
254
+ ### Use the dockerized version
255
+
256
+ To build locally the docker image, from the root folder of the repository use the command:
257
+
258
+ ```docker build . -f dockerfile -t your_image_name:tag```
259
+
260
+ The command line to run the dqm container has the following form :
261
+
262
+ ```docker run -e PIPELINE_CONFIG_PATH="path_to_your_pipeline_file" -e RESULT_FILE_PATH="path_to_the_result_file" irtsystemx/dqm-ml:1.1.1```
263
+
264
+ You need to mount the ```PIPELINE_CONFIG_PATH``` path to ```/tmp/in/$PIPELIN_CONFIG_PATH``` and the ```$RESULT_FILE_PATH``` to ```/tmp/out/$RESULT_FILE_PATH```
265
+ Moreover, all datasets directories referenced in your pipeline file shall be mounted in the docker
266
+
267
+ For example if your pipeline file is stored at ```examples/pipeline_example_docker.yaml``` and you want your result file to be stored at ```results_docker/result_file.yaml```
268
+ and all your datasets used in your pipeline are stored locally into ```/tests``` folder and defined on ```data_storage/..``` in your pipeline file
269
+
270
+ The command would be :
271
+
272
+ ```docker run -e PIPELINE_CONFIG_PATH="pipeline_example_docker.yaml" -e RESULT_FILE_PATH="result_file.yaml" -v ${PWD}/examples:/tmp/in -v ${PWD}/tests/:/data_storage/ -v ${PWD}/results_docker:/tmp/out irtsystemx/dqm-ml:1.1.1```
273
+
274
+ ### User with proxy server
275
+
276
+ The computation of domain gap metrics requires the use of pretrained models that are automatically downloaded by pytorch in a local cache directory during the first call of those metrics.
277
+
278
+ For users behind a proxy server, this download could fail. To overcome this issue, you can manually get those pretrained models by downloading the zip archive from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models//dqm-ml_pretrained_models.zip) and extract it in the following folder : ``` your_user_directory/.cache/torch/hub/checkpoints/```
279
+
280
+ ### Use the library within your python code
281
+
282
+ [//]: # (All validated and verified functions are detailed in the files **call_main.py**. )
283
+
284
+ Each metric is used by importing the corresponding modules and class into your code.
285
+ For more information about each metric, refer to the specific README.md in ```dqm/<metric_name>``` subfolders
286
+
287
+ ## Available examples
288
+
289
+ Many examples of DQM-ML applications are avalaible in the folder ```/examples```
290
+
291
+ You will find :
292
+
293
+ 2 jupyter_notebooks:
294
+
295
+ - **multiple_metrics_tests.ipynb** : A notebook applying completeness, diversity and representativeness metrics on an example dataset.
296
+ - **domain_gap.ipynb** : A notebook demonstrating an example of applying domain_gap metrics to a generated synthetic dataset.
297
+
298
+ 4 python scripts:
299
+
300
+ Those scripts named **main_X.py** gives an example of computation of approaches implemented for metrics <X> on samples.
301
+
302
+ The ```main_domain_gap.py``` script must be called with a config file passed as an argument using ```--cfg```.
303
+
304
+ For example:
305
+
306
+ ``` python examples/main_domain_gap.py --cfg examples/domain_gap_cfg/cmd/cmd.json```
307
+
308
+ We provide in the folder ```/examples/domain_gap_cfg``` a set of config files for each domain_gap approaches`:
309
+
310
+ For some domain_gap examples, the **200_bird_dataset** will be required. It can be downloaded from this [link](http://minio-storage.apps.confianceai-public.irtsysx.fr/ml-models/200-birds-species.zip). The zip archive shall be extracted into the ```examples/datasets/``` folder.
311
+
312
+ 1 pipeline example that instanciates every metrics implemented in dqm-ml named ```pipeline_example.yaml``` and its corresponding results ```results_pipeline_example.yaml```.
313
+
314
+ 1 pipeline example similar to the previous one, but with different datasets path, as shown in the example of how using the containerized version.
315
+
316
+ ## References
317
+
318
+ ```
319
+ @inproceedings{chaouche2024dqm,
320
+ title={DQM: Data Quality Metrics for AI components in the industry},
321
+ author={Chaouche, Sabrina and Randon, Yoann and Adjed, Faouzi and Boudjani, Nadira and Khedher, Mohamed Ibn},
322
+ booktitle={Proceedings of the AAAI Symposium Series},
323
+ volume={4},
324
+ number={1},
325
+ pages={24--31},
326
+ year={2024}
327
+ }
328
+ ```
329
+
330
+ [HAL link](https://hal.science/hal-04719346v1)