csv-detective 0.6.7__py3-none-any.whl → 0.9.3.dev2438__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- csv_detective/__init__.py +7 -1
- csv_detective/cli.py +33 -21
- csv_detective/{detect_fields/FR → detection}/__init__.py +0 -0
- csv_detective/detection/columns.py +89 -0
- csv_detective/detection/encoding.py +29 -0
- csv_detective/detection/engine.py +46 -0
- csv_detective/detection/formats.py +156 -0
- csv_detective/detection/headers.py +28 -0
- csv_detective/detection/rows.py +18 -0
- csv_detective/detection/separator.py +44 -0
- csv_detective/detection/variables.py +97 -0
- csv_detective/explore_csv.py +151 -377
- csv_detective/format.py +67 -0
- csv_detective/formats/__init__.py +9 -0
- csv_detective/formats/adresse.py +116 -0
- csv_detective/formats/binary.py +26 -0
- csv_detective/formats/booleen.py +35 -0
- csv_detective/formats/code_commune_insee.py +26 -0
- csv_detective/formats/code_csp_insee.py +36 -0
- csv_detective/formats/code_departement.py +29 -0
- csv_detective/formats/code_fantoir.py +21 -0
- csv_detective/formats/code_import.py +17 -0
- csv_detective/formats/code_postal.py +25 -0
- csv_detective/formats/code_region.py +22 -0
- csv_detective/formats/code_rna.py +29 -0
- csv_detective/formats/code_waldec.py +17 -0
- csv_detective/formats/commune.py +27 -0
- csv_detective/formats/csp_insee.py +31 -0
- csv_detective/{detect_fields/FR/other/insee_ape700 → formats/data}/insee_ape700.txt +0 -0
- csv_detective/formats/date.py +99 -0
- csv_detective/formats/date_fr.py +22 -0
- csv_detective/formats/datetime_aware.py +45 -0
- csv_detective/formats/datetime_naive.py +48 -0
- csv_detective/formats/datetime_rfc822.py +24 -0
- csv_detective/formats/departement.py +37 -0
- csv_detective/formats/email.py +28 -0
- csv_detective/formats/float.py +29 -0
- csv_detective/formats/geojson.py +36 -0
- csv_detective/formats/insee_ape700.py +31 -0
- csv_detective/formats/insee_canton.py +28 -0
- csv_detective/formats/int.py +23 -0
- csv_detective/formats/iso_country_code_alpha2.py +30 -0
- csv_detective/formats/iso_country_code_alpha3.py +30 -0
- csv_detective/formats/iso_country_code_numeric.py +31 -0
- csv_detective/formats/jour_de_la_semaine.py +41 -0
- csv_detective/formats/json.py +20 -0
- csv_detective/formats/latitude_l93.py +48 -0
- csv_detective/formats/latitude_wgs.py +42 -0
- csv_detective/formats/latitude_wgs_fr_metropole.py +42 -0
- csv_detective/formats/latlon_wgs.py +53 -0
- csv_detective/formats/longitude_l93.py +39 -0
- csv_detective/formats/longitude_wgs.py +32 -0
- csv_detective/formats/longitude_wgs_fr_metropole.py +32 -0
- csv_detective/formats/lonlat_wgs.py +36 -0
- csv_detective/formats/mois_de_lannee.py +48 -0
- csv_detective/formats/money.py +18 -0
- csv_detective/formats/mongo_object_id.py +14 -0
- csv_detective/formats/pays.py +35 -0
- csv_detective/formats/percent.py +16 -0
- csv_detective/formats/region.py +70 -0
- csv_detective/formats/sexe.py +17 -0
- csv_detective/formats/siren.py +37 -0
- csv_detective/{detect_fields/FR/other/siret/__init__.py → formats/siret.py} +47 -29
- csv_detective/formats/tel_fr.py +36 -0
- csv_detective/formats/uai.py +36 -0
- csv_detective/formats/url.py +46 -0
- csv_detective/formats/username.py +14 -0
- csv_detective/formats/uuid.py +16 -0
- csv_detective/formats/year.py +28 -0
- csv_detective/output/__init__.py +65 -0
- csv_detective/output/dataframe.py +96 -0
- csv_detective/output/example.py +250 -0
- csv_detective/output/profile.py +119 -0
- csv_detective/{schema_generation.py → output/schema.py} +268 -343
- csv_detective/output/utils.py +74 -0
- csv_detective/{detect_fields/FR/geo → parsing}/__init__.py +0 -0
- csv_detective/parsing/columns.py +235 -0
- csv_detective/parsing/compression.py +11 -0
- csv_detective/parsing/csv.py +56 -0
- csv_detective/parsing/excel.py +167 -0
- csv_detective/parsing/load.py +111 -0
- csv_detective/parsing/text.py +56 -0
- csv_detective/utils.py +23 -196
- csv_detective/validate.py +138 -0
- csv_detective-0.9.3.dev2438.dist-info/METADATA +267 -0
- csv_detective-0.9.3.dev2438.dist-info/RECORD +92 -0
- csv_detective-0.9.3.dev2438.dist-info/WHEEL +4 -0
- {csv_detective-0.6.7.dist-info → csv_detective-0.9.3.dev2438.dist-info}/entry_points.txt +1 -0
- csv_detective/all_packages.txt +0 -104
- csv_detective/detect_fields/FR/geo/adresse/__init__.py +0 -100
- csv_detective/detect_fields/FR/geo/code_commune_insee/__init__.py +0 -24
- csv_detective/detect_fields/FR/geo/code_commune_insee/code_commune_insee.txt +0 -37600
- csv_detective/detect_fields/FR/geo/code_departement/__init__.py +0 -11
- csv_detective/detect_fields/FR/geo/code_fantoir/__init__.py +0 -15
- csv_detective/detect_fields/FR/geo/code_fantoir/code_fantoir.txt +0 -26122
- csv_detective/detect_fields/FR/geo/code_postal/__init__.py +0 -19
- csv_detective/detect_fields/FR/geo/code_postal/code_postal.txt +0 -36822
- csv_detective/detect_fields/FR/geo/code_region/__init__.py +0 -27
- csv_detective/detect_fields/FR/geo/commune/__init__.py +0 -21
- csv_detective/detect_fields/FR/geo/commune/commune.txt +0 -36745
- csv_detective/detect_fields/FR/geo/departement/__init__.py +0 -19
- csv_detective/detect_fields/FR/geo/departement/departement.txt +0 -101
- csv_detective/detect_fields/FR/geo/insee_canton/__init__.py +0 -20
- csv_detective/detect_fields/FR/geo/insee_canton/canton2017.txt +0 -2055
- csv_detective/detect_fields/FR/geo/insee_canton/cantons.txt +0 -2055
- csv_detective/detect_fields/FR/geo/latitude_l93/__init__.py +0 -13
- csv_detective/detect_fields/FR/geo/latitude_wgs_fr_metropole/__init__.py +0 -13
- csv_detective/detect_fields/FR/geo/longitude_l93/__init__.py +0 -13
- csv_detective/detect_fields/FR/geo/longitude_wgs_fr_metropole/__init__.py +0 -13
- csv_detective/detect_fields/FR/geo/pays/__init__.py +0 -17
- csv_detective/detect_fields/FR/geo/pays/pays.txt +0 -248
- csv_detective/detect_fields/FR/geo/region/__init__.py +0 -16
- csv_detective/detect_fields/FR/geo/region/region.txt +0 -44
- csv_detective/detect_fields/FR/other/__init__.py +0 -0
- csv_detective/detect_fields/FR/other/code_csp_insee/__init__.py +0 -26
- csv_detective/detect_fields/FR/other/code_csp_insee/code_csp_insee.txt +0 -498
- csv_detective/detect_fields/FR/other/code_rna/__init__.py +0 -8
- csv_detective/detect_fields/FR/other/code_waldec/__init__.py +0 -12
- csv_detective/detect_fields/FR/other/csp_insee/__init__.py +0 -16
- csv_detective/detect_fields/FR/other/date_fr/__init__.py +0 -12
- csv_detective/detect_fields/FR/other/insee_ape700/__init__.py +0 -16
- csv_detective/detect_fields/FR/other/sexe/__init__.py +0 -9
- csv_detective/detect_fields/FR/other/siren/__init__.py +0 -18
- csv_detective/detect_fields/FR/other/tel_fr/__init__.py +0 -15
- csv_detective/detect_fields/FR/other/uai/__init__.py +0 -15
- csv_detective/detect_fields/FR/temp/__init__.py +0 -0
- csv_detective/detect_fields/FR/temp/jour_de_la_semaine/__init__.py +0 -23
- csv_detective/detect_fields/FR/temp/mois_de_annee/__init__.py +0 -37
- csv_detective/detect_fields/__init__.py +0 -57
- csv_detective/detect_fields/geo/__init__.py +0 -0
- csv_detective/detect_fields/geo/iso_country_code_alpha2/__init__.py +0 -15
- csv_detective/detect_fields/geo/iso_country_code_alpha3/__init__.py +0 -14
- csv_detective/detect_fields/geo/iso_country_code_numeric/__init__.py +0 -15
- csv_detective/detect_fields/geo/json_geojson/__init__.py +0 -22
- csv_detective/detect_fields/geo/latitude_wgs/__init__.py +0 -13
- csv_detective/detect_fields/geo/latlon_wgs/__init__.py +0 -15
- csv_detective/detect_fields/geo/longitude_wgs/__init__.py +0 -13
- csv_detective/detect_fields/other/__init__.py +0 -0
- csv_detective/detect_fields/other/booleen/__init__.py +0 -21
- csv_detective/detect_fields/other/email/__init__.py +0 -8
- csv_detective/detect_fields/other/float/__init__.py +0 -17
- csv_detective/detect_fields/other/int/__init__.py +0 -12
- csv_detective/detect_fields/other/json/__init__.py +0 -24
- csv_detective/detect_fields/other/mongo_object_id/__init__.py +0 -8
- csv_detective/detect_fields/other/twitter/__init__.py +0 -8
- csv_detective/detect_fields/other/url/__init__.py +0 -11
- csv_detective/detect_fields/other/uuid/__init__.py +0 -11
- csv_detective/detect_fields/temp/__init__.py +0 -0
- csv_detective/detect_fields/temp/date/__init__.py +0 -62
- csv_detective/detect_fields/temp/datetime_iso/__init__.py +0 -18
- csv_detective/detect_fields/temp/datetime_rfc822/__init__.py +0 -21
- csv_detective/detect_fields/temp/year/__init__.py +0 -10
- csv_detective/detect_labels/FR/__init__.py +0 -0
- csv_detective/detect_labels/FR/geo/__init__.py +0 -0
- csv_detective/detect_labels/FR/geo/adresse/__init__.py +0 -40
- csv_detective/detect_labels/FR/geo/code_commune_insee/__init__.py +0 -42
- csv_detective/detect_labels/FR/geo/code_departement/__init__.py +0 -33
- csv_detective/detect_labels/FR/geo/code_fantoir/__init__.py +0 -33
- csv_detective/detect_labels/FR/geo/code_postal/__init__.py +0 -41
- csv_detective/detect_labels/FR/geo/code_region/__init__.py +0 -33
- csv_detective/detect_labels/FR/geo/commune/__init__.py +0 -33
- csv_detective/detect_labels/FR/geo/departement/__init__.py +0 -47
- csv_detective/detect_labels/FR/geo/insee_canton/__init__.py +0 -33
- csv_detective/detect_labels/FR/geo/latitude_l93/__init__.py +0 -54
- csv_detective/detect_labels/FR/geo/latitude_wgs_fr_metropole/__init__.py +0 -55
- csv_detective/detect_labels/FR/geo/longitude_l93/__init__.py +0 -44
- csv_detective/detect_labels/FR/geo/longitude_wgs_fr_metropole/__init__.py +0 -45
- csv_detective/detect_labels/FR/geo/pays/__init__.py +0 -45
- csv_detective/detect_labels/FR/geo/region/__init__.py +0 -45
- csv_detective/detect_labels/FR/other/__init__.py +0 -0
- csv_detective/detect_labels/FR/other/code_csp_insee/__init__.py +0 -33
- csv_detective/detect_labels/FR/other/code_rna/__init__.py +0 -38
- csv_detective/detect_labels/FR/other/code_waldec/__init__.py +0 -33
- csv_detective/detect_labels/FR/other/csp_insee/__init__.py +0 -37
- csv_detective/detect_labels/FR/other/date_fr/__init__.py +0 -33
- csv_detective/detect_labels/FR/other/insee_ape700/__init__.py +0 -40
- csv_detective/detect_labels/FR/other/sexe/__init__.py +0 -33
- csv_detective/detect_labels/FR/other/siren/__init__.py +0 -41
- csv_detective/detect_labels/FR/other/siret/__init__.py +0 -40
- csv_detective/detect_labels/FR/other/tel_fr/__init__.py +0 -45
- csv_detective/detect_labels/FR/other/uai/__init__.py +0 -50
- csv_detective/detect_labels/FR/temp/__init__.py +0 -0
- csv_detective/detect_labels/FR/temp/jour_de_la_semaine/__init__.py +0 -41
- csv_detective/detect_labels/FR/temp/mois_de_annee/__init__.py +0 -33
- csv_detective/detect_labels/__init__.py +0 -43
- csv_detective/detect_labels/geo/__init__.py +0 -0
- csv_detective/detect_labels/geo/iso_country_code_alpha2/__init__.py +0 -41
- csv_detective/detect_labels/geo/iso_country_code_alpha3/__init__.py +0 -41
- csv_detective/detect_labels/geo/iso_country_code_numeric/__init__.py +0 -41
- csv_detective/detect_labels/geo/json_geojson/__init__.py +0 -42
- csv_detective/detect_labels/geo/latitude_wgs/__init__.py +0 -55
- csv_detective/detect_labels/geo/latlon_wgs/__init__.py +0 -67
- csv_detective/detect_labels/geo/longitude_wgs/__init__.py +0 -45
- csv_detective/detect_labels/other/__init__.py +0 -0
- csv_detective/detect_labels/other/booleen/__init__.py +0 -34
- csv_detective/detect_labels/other/email/__init__.py +0 -45
- csv_detective/detect_labels/other/float/__init__.py +0 -33
- csv_detective/detect_labels/other/int/__init__.py +0 -33
- csv_detective/detect_labels/other/money/__init__.py +0 -11
- csv_detective/detect_labels/other/money/check_col_name.py +0 -8
- csv_detective/detect_labels/other/mongo_object_id/__init__.py +0 -33
- csv_detective/detect_labels/other/twitter/__init__.py +0 -33
- csv_detective/detect_labels/other/url/__init__.py +0 -48
- csv_detective/detect_labels/other/uuid/__init__.py +0 -33
- csv_detective/detect_labels/temp/__init__.py +0 -0
- csv_detective/detect_labels/temp/date/__init__.py +0 -51
- csv_detective/detect_labels/temp/datetime_iso/__init__.py +0 -45
- csv_detective/detect_labels/temp/datetime_rfc822/__init__.py +0 -44
- csv_detective/detect_labels/temp/year/__init__.py +0 -44
- csv_detective/detection.py +0 -361
- csv_detective/process_text.py +0 -39
- csv_detective/s3_utils.py +0 -48
- csv_detective-0.6.7.data/data/share/csv_detective/CHANGELOG.md +0 -118
- csv_detective-0.6.7.data/data/share/csv_detective/LICENSE.AGPL.txt +0 -661
- csv_detective-0.6.7.data/data/share/csv_detective/README.md +0 -247
- csv_detective-0.6.7.dist-info/LICENSE.AGPL.txt +0 -661
- csv_detective-0.6.7.dist-info/METADATA +0 -23
- csv_detective-0.6.7.dist-info/RECORD +0 -150
- csv_detective-0.6.7.dist-info/WHEEL +0 -5
- csv_detective-0.6.7.dist-info/top_level.txt +0 -2
- tests/__init__.py +0 -0
- tests/test_fields.py +0 -360
- tests/test_file.py +0 -116
- tests/test_labels.py +0 -7
- /csv_detective/{detect_fields/FR/other/csp_insee → formats/data}/csp_insee.txt +0 -0
- /csv_detective/{detect_fields/geo/iso_country_code_alpha2 → formats/data}/iso_country_code_alpha2.txt +0 -0
- /csv_detective/{detect_fields/geo/iso_country_code_alpha3 → formats/data}/iso_country_code_alpha3.txt +0 -0
- /csv_detective/{detect_fields/geo/iso_country_code_numeric → formats/data}/iso_country_code_numeric.txt +0 -0
csv_detective/detection.py
DELETED
|
@@ -1,361 +0,0 @@
|
|
|
1
|
-
import pandas as pd
|
|
2
|
-
import numpy as np
|
|
3
|
-
from cchardet import detect
|
|
4
|
-
from ast import literal_eval
|
|
5
|
-
import logging
|
|
6
|
-
from time import time
|
|
7
|
-
from csv_detective.utils import display_logs_depending_process_time
|
|
8
|
-
from csv_detective.detect_fields.other.float import float_casting
|
|
9
|
-
|
|
10
|
-
logging.basicConfig(level=logging.INFO)
|
|
11
|
-
|
|
12
|
-
def detect_continuous_variable(table, continuous_th=0.9, verbose: bool = False):
|
|
13
|
-
"""
|
|
14
|
-
Detects whether a column contains continuous variables. We consider a continuous column
|
|
15
|
-
one that contains
|
|
16
|
-
a considerable amount of float values.
|
|
17
|
-
We removed the integers as we then end up with postal codes, insee codes, and all sort
|
|
18
|
-
of codes and types.
|
|
19
|
-
This is not optimal but it will do for now.
|
|
20
|
-
:param table:
|
|
21
|
-
:return:
|
|
22
|
-
"""
|
|
23
|
-
|
|
24
|
-
def check_threshold(serie, continuous_th):
|
|
25
|
-
count = serie.value_counts().to_dict()
|
|
26
|
-
total_nb = len(serie)
|
|
27
|
-
if float in count:
|
|
28
|
-
nb_floats = count[float]
|
|
29
|
-
else:
|
|
30
|
-
return False
|
|
31
|
-
if nb_floats / total_nb >= continuous_th:
|
|
32
|
-
return True
|
|
33
|
-
else:
|
|
34
|
-
return False
|
|
35
|
-
|
|
36
|
-
def parses_to_integer(value):
|
|
37
|
-
try:
|
|
38
|
-
value = value.replace(",", ".")
|
|
39
|
-
value = literal_eval(value)
|
|
40
|
-
return type(value)
|
|
41
|
-
# flake8: noqa
|
|
42
|
-
except:
|
|
43
|
-
return False
|
|
44
|
-
|
|
45
|
-
if verbose:
|
|
46
|
-
start = time()
|
|
47
|
-
logging.info("Detecting continuous columns")
|
|
48
|
-
res = table.apply(
|
|
49
|
-
lambda serie: check_threshold(serie.apply(parses_to_integer), continuous_th)
|
|
50
|
-
)
|
|
51
|
-
if verbose:
|
|
52
|
-
display_logs_depending_process_time(
|
|
53
|
-
f"Detected {sum(res)} continuous columns in {round(time() - start, 3)}s",
|
|
54
|
-
time() - start
|
|
55
|
-
)
|
|
56
|
-
return res.index[res]
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
def detetect_categorical_variable(
|
|
60
|
-
table, threshold_pct_categorical=0.05, max_number_categorical_values=25, verbose: bool = False
|
|
61
|
-
):
|
|
62
|
-
"""
|
|
63
|
-
Heuristically detects whether a table (df) contains categorical values according to
|
|
64
|
-
the number of unique values contained.
|
|
65
|
-
As the idea of detecting categorical values is to then try to learn models to predict
|
|
66
|
-
them, we limit categorical values to at most 25 different modes. Postal code, insee code,
|
|
67
|
-
code region and so on, may be thus not
|
|
68
|
-
considered categorical values.
|
|
69
|
-
:param table:
|
|
70
|
-
:param threshold_pct_categorical:
|
|
71
|
-
:param max_number_categorical_values:
|
|
72
|
-
:return:
|
|
73
|
-
"""
|
|
74
|
-
|
|
75
|
-
def abs_number_different_values(column_values):
|
|
76
|
-
return column_values.nunique()
|
|
77
|
-
|
|
78
|
-
def rel_number_different_values(column_values):
|
|
79
|
-
return column_values.nunique() / len(column_values)
|
|
80
|
-
|
|
81
|
-
def detect_categorical(column_values):
|
|
82
|
-
abs_unique_values = abs_number_different_values(column_values)
|
|
83
|
-
rel_unique_values = rel_number_different_values(column_values)
|
|
84
|
-
if abs_unique_values < max_number_categorical_values:
|
|
85
|
-
if rel_unique_values < threshold_pct_categorical:
|
|
86
|
-
return True
|
|
87
|
-
return False
|
|
88
|
-
|
|
89
|
-
if verbose:
|
|
90
|
-
start = time()
|
|
91
|
-
logging.info("Detecting categorical columns")
|
|
92
|
-
res = table.apply(lambda serie: detect_categorical(serie))
|
|
93
|
-
if verbose:
|
|
94
|
-
display_logs_depending_process_time(
|
|
95
|
-
f"Detected {sum(res)} categorical columns out of {len(table.columns)} in {round(time() - start, 3)}s",
|
|
96
|
-
time() - start
|
|
97
|
-
)
|
|
98
|
-
return res.index[res], res
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
def detect_separator(file, verbose: bool = False):
|
|
102
|
-
"""Detects csv separator"""
|
|
103
|
-
# TODO: add a robust detection:
|
|
104
|
-
# si on a un point virgule comme texte et \t comme séparateur, on renvoit
|
|
105
|
-
# pour l'instant un point virgule
|
|
106
|
-
if verbose:
|
|
107
|
-
start = time()
|
|
108
|
-
logging.info("Detecting separator")
|
|
109
|
-
file.seek(0)
|
|
110
|
-
header = file.readline()
|
|
111
|
-
possible_separators = [";", ",", "|", "\t"]
|
|
112
|
-
sep_count = dict()
|
|
113
|
-
for sep in possible_separators:
|
|
114
|
-
sep_count[sep] = header.count(sep)
|
|
115
|
-
sep = max(sep_count, key=sep_count.get)
|
|
116
|
-
if verbose:
|
|
117
|
-
display_logs_depending_process_time(
|
|
118
|
-
f'Detected separator: "{sep}" in {round(time() - start, 3)}s',
|
|
119
|
-
time() - start
|
|
120
|
-
)
|
|
121
|
-
return sep
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
def detect_encoding(the_file, verbose: bool = False):
|
|
125
|
-
"""
|
|
126
|
-
Detects file encoding using faust-cchardet (forked from the original cchardet)
|
|
127
|
-
"""
|
|
128
|
-
if verbose:
|
|
129
|
-
start = time()
|
|
130
|
-
logging.info("Detecting encoding")
|
|
131
|
-
encoding_dict = detect(the_file.read())
|
|
132
|
-
if verbose:
|
|
133
|
-
message = f'Detected encoding: "{encoding_dict["encoding"]}"'
|
|
134
|
-
message += f' in {round(time() - start, 3)}s (confidence: {round(encoding_dict["confidence"]*100)}%)'
|
|
135
|
-
display_logs_depending_process_time(
|
|
136
|
-
message,
|
|
137
|
-
time() - start
|
|
138
|
-
)
|
|
139
|
-
return encoding_dict['encoding']
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
def parse_table(the_file, encoding, sep, num_rows, skiprows, random_state=42, verbose : bool = False):
|
|
143
|
-
# Takes care of some problems
|
|
144
|
-
if verbose:
|
|
145
|
-
start = time()
|
|
146
|
-
logging.info("Parsing table")
|
|
147
|
-
table = None
|
|
148
|
-
|
|
149
|
-
if not isinstance(the_file, str):
|
|
150
|
-
the_file.seek(0)
|
|
151
|
-
|
|
152
|
-
total_lines = None
|
|
153
|
-
for encoding in [encoding, "ISO-8859-1", "utf-8"]:
|
|
154
|
-
# TODO : modification systematique
|
|
155
|
-
if encoding is None:
|
|
156
|
-
continue
|
|
157
|
-
|
|
158
|
-
if "ISO-8859" in encoding:
|
|
159
|
-
encoding = "ISO-8859-1"
|
|
160
|
-
try:
|
|
161
|
-
table = pd.read_csv(
|
|
162
|
-
the_file, sep=sep, dtype="unicode", encoding=encoding, skiprows=skiprows
|
|
163
|
-
)
|
|
164
|
-
total_lines = len(table)
|
|
165
|
-
nb_duplicates = len(table.loc[table.duplicated()])
|
|
166
|
-
if num_rows > 0:
|
|
167
|
-
num_rows = min(num_rows - 1, total_lines)
|
|
168
|
-
table = table.sample(num_rows, random_state=random_state)
|
|
169
|
-
# else : table is unchanged
|
|
170
|
-
break
|
|
171
|
-
except TypeError:
|
|
172
|
-
print("Trying encoding : {encoding}".format(encoding=encoding))
|
|
173
|
-
|
|
174
|
-
if table is None:
|
|
175
|
-
logging.error(" >> encoding not found")
|
|
176
|
-
return table, "NA", "NA"
|
|
177
|
-
if verbose:
|
|
178
|
-
display_logs_depending_process_time(
|
|
179
|
-
f'Table parsed successfully in {round(time() - start, 3)}s',
|
|
180
|
-
time() - start
|
|
181
|
-
)
|
|
182
|
-
return table, total_lines, nb_duplicates
|
|
183
|
-
|
|
184
|
-
|
|
185
|
-
def create_profile(table, dict_cols_fields, sep, encoding, num_rows, skiprows, verbose: bool = False):
|
|
186
|
-
if verbose:
|
|
187
|
-
start = time()
|
|
188
|
-
logging.info("Creating profile")
|
|
189
|
-
map_python_types = {
|
|
190
|
-
"string": str,
|
|
191
|
-
"int": float,
|
|
192
|
-
"float": float,
|
|
193
|
-
}
|
|
194
|
-
|
|
195
|
-
if num_rows > 0:
|
|
196
|
-
raise Exception("To create profiles num_rows has to be set to -1")
|
|
197
|
-
else:
|
|
198
|
-
safe_table = table.copy()
|
|
199
|
-
dtypes = {
|
|
200
|
-
k: map_python_types.get(v["python_type"], str)
|
|
201
|
-
for k, v in dict_cols_fields.items()
|
|
202
|
-
}
|
|
203
|
-
for c in safe_table.columns:
|
|
204
|
-
if dtypes[c] == float:
|
|
205
|
-
safe_table[c] = safe_table[c].apply(
|
|
206
|
-
lambda s: float_casting(s) if isinstance(s, str) else s
|
|
207
|
-
)
|
|
208
|
-
profile = {}
|
|
209
|
-
for c in safe_table.columns:
|
|
210
|
-
profile[c] = {}
|
|
211
|
-
if map_python_types.get(dict_cols_fields[c]["python_type"], str) in [
|
|
212
|
-
float,
|
|
213
|
-
int,
|
|
214
|
-
]:
|
|
215
|
-
profile[c].update(
|
|
216
|
-
min=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
|
|
217
|
-
safe_table[c].min()
|
|
218
|
-
),
|
|
219
|
-
max=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
|
|
220
|
-
safe_table[c].max()
|
|
221
|
-
),
|
|
222
|
-
mean=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
|
|
223
|
-
safe_table[c].mean()
|
|
224
|
-
),
|
|
225
|
-
std=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
|
|
226
|
-
safe_table[c].std()
|
|
227
|
-
),
|
|
228
|
-
)
|
|
229
|
-
tops_bruts = safe_table[safe_table[c].notna()][c] \
|
|
230
|
-
.value_counts(dropna=True) \
|
|
231
|
-
.reset_index() \
|
|
232
|
-
.iloc[:10] \
|
|
233
|
-
.to_dict(orient="records")
|
|
234
|
-
tops = []
|
|
235
|
-
for tb in tops_bruts:
|
|
236
|
-
top = {}
|
|
237
|
-
top["count"] = tb[c]
|
|
238
|
-
top["value"] = tb["index"]
|
|
239
|
-
tops.append(top)
|
|
240
|
-
profile[c].update(
|
|
241
|
-
tops=tops,
|
|
242
|
-
nb_distinct=safe_table[c].nunique(),
|
|
243
|
-
nb_missing_values=len(safe_table[c].loc[safe_table[c].isna()]),
|
|
244
|
-
)
|
|
245
|
-
if verbose:
|
|
246
|
-
display_logs_depending_process_time(
|
|
247
|
-
f"Created profile in {round(time() - start, 3)}s",
|
|
248
|
-
time() - start
|
|
249
|
-
)
|
|
250
|
-
return profile
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
def detect_extra_columns(file, sep):
|
|
254
|
-
"""regarde s'il y a des colonnes en trop
|
|
255
|
-
Attention, file ne doit pas avoir de ligne vide"""
|
|
256
|
-
file.seek(0)
|
|
257
|
-
retour = False
|
|
258
|
-
nb_useless_col = 99999
|
|
259
|
-
|
|
260
|
-
for i in range(10):
|
|
261
|
-
line = file.readline()
|
|
262
|
-
# regarde si on a un retour
|
|
263
|
-
if retour:
|
|
264
|
-
assert line[-1] == "\n"
|
|
265
|
-
if line[-1] == "\n":
|
|
266
|
-
retour = True
|
|
267
|
-
|
|
268
|
-
# regarde le nombre de derniere colonne inutile
|
|
269
|
-
deb = 0 + retour
|
|
270
|
-
line = line[::-1][deb:]
|
|
271
|
-
k = 0
|
|
272
|
-
for sign in line:
|
|
273
|
-
if sign != sep:
|
|
274
|
-
break
|
|
275
|
-
k += 1
|
|
276
|
-
if k == 0:
|
|
277
|
-
return 0, retour
|
|
278
|
-
nb_useless_col = min(k, nb_useless_col)
|
|
279
|
-
return nb_useless_col, retour
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
def detect_headers(file, sep, verbose: bool = False):
|
|
283
|
-
"""Tests 10 first rows for possible header (header not in 1st line)"""
|
|
284
|
-
if verbose:
|
|
285
|
-
start = time()
|
|
286
|
-
logging.info("Detecting headers")
|
|
287
|
-
file.seek(0)
|
|
288
|
-
for i in range(10):
|
|
289
|
-
header = file.readline()
|
|
290
|
-
position = file.tell()
|
|
291
|
-
chaine = [c for c in header.replace("\n", "").split(sep) if c]
|
|
292
|
-
if chaine[-1] not in ["", "\n"] and all(
|
|
293
|
-
[mot not in ["", "\n"] for mot in chaine[1:-1]]
|
|
294
|
-
):
|
|
295
|
-
next_row = file.readline()
|
|
296
|
-
file.seek(position)
|
|
297
|
-
if header != next_row:
|
|
298
|
-
if verbose:
|
|
299
|
-
display_logs_depending_process_time(
|
|
300
|
-
f'Detected headers in {round(time() - start, 3)}s',
|
|
301
|
-
time() - start
|
|
302
|
-
)
|
|
303
|
-
return i, chaine
|
|
304
|
-
if verbose:
|
|
305
|
-
logging.info(f'No header detected')
|
|
306
|
-
return 0, None
|
|
307
|
-
|
|
308
|
-
|
|
309
|
-
def detect_heading_columns(file, sep, verbose : bool = False):
|
|
310
|
-
"""Tests first 10 lines to see if there are empty heading columns"""
|
|
311
|
-
if verbose:
|
|
312
|
-
start = time()
|
|
313
|
-
logging.info("Detecting heading columns")
|
|
314
|
-
file.seek(0)
|
|
315
|
-
return_int = float("Inf")
|
|
316
|
-
for i in range(10):
|
|
317
|
-
line = file.readline()
|
|
318
|
-
return_int = min(return_int, len(line) - len(line.strip(sep)))
|
|
319
|
-
if return_int == 0:
|
|
320
|
-
if verbose:
|
|
321
|
-
display_logs_depending_process_time(
|
|
322
|
-
f'No heading column detected in {round(time() - start, 3)}s',
|
|
323
|
-
time() - start
|
|
324
|
-
)
|
|
325
|
-
return 0
|
|
326
|
-
if verbose:
|
|
327
|
-
display_logs_depending_process_time(
|
|
328
|
-
f'{return_int} heading columns detected in {round(time() - start, 3)}s',
|
|
329
|
-
time() - start
|
|
330
|
-
)
|
|
331
|
-
return return_int
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
def detect_trailing_columns(file, sep, heading_columns, verbose : bool = False):
|
|
335
|
-
"""Tests first 10 lines to see if there are empty trailing columns"""
|
|
336
|
-
if verbose:
|
|
337
|
-
start = time()
|
|
338
|
-
logging.info("Detecting trailing columns")
|
|
339
|
-
file.seek(0)
|
|
340
|
-
return_int = float("Inf")
|
|
341
|
-
for i in range(10):
|
|
342
|
-
line = file.readline()
|
|
343
|
-
return_int = min(
|
|
344
|
-
return_int,
|
|
345
|
-
len(line.replace("\n", ""))
|
|
346
|
-
- len(line.replace("\n", "").strip(sep))
|
|
347
|
-
- heading_columns,
|
|
348
|
-
)
|
|
349
|
-
if return_int == 0:
|
|
350
|
-
if verbose:
|
|
351
|
-
display_logs_depending_process_time(
|
|
352
|
-
f'No trailing column detected in {round(time() - start, 3)}s',
|
|
353
|
-
time() - start
|
|
354
|
-
)
|
|
355
|
-
return 0
|
|
356
|
-
if verbose:
|
|
357
|
-
display_logs_depending_process_time(
|
|
358
|
-
f'{return_int} trailing columns detected in {round(time() - start, 3)}s',
|
|
359
|
-
time() - start
|
|
360
|
-
)
|
|
361
|
-
return return_int
|
csv_detective/process_text.py
DELETED
|
@@ -1,39 +0,0 @@
|
|
|
1
|
-
from re import finditer
|
|
2
|
-
|
|
3
|
-
|
|
4
|
-
def camel_case_split(identifier):
|
|
5
|
-
matches = finditer(
|
|
6
|
-
".+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)", identifier
|
|
7
|
-
)
|
|
8
|
-
return " ".join([m.group(0) for m in matches])
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
# Process text
|
|
12
|
-
def _process_text(val):
|
|
13
|
-
"""Traitement des chaînes de caractères pour les standardiser.
|
|
14
|
-
Plusieurs alternatives ont été testées : .translate, unidecode.unidecode,
|
|
15
|
-
des méthodes hybrides, mais aucune ne s'est avérée plus performante."""
|
|
16
|
-
val = camel_case_split(val)
|
|
17
|
-
val = val.lower()
|
|
18
|
-
val = val.replace("-", " ")
|
|
19
|
-
val = val.replace("_", " ")
|
|
20
|
-
val = val.replace("'", " ")
|
|
21
|
-
val = val.replace(",", " ")
|
|
22
|
-
val = val.replace(" ", " ")
|
|
23
|
-
val = val.replace("à", "a")
|
|
24
|
-
val = val.replace("â", "a")
|
|
25
|
-
val = val.replace("ç", "c")
|
|
26
|
-
val = val.replace("é", "e")
|
|
27
|
-
val = val.replace("é", "e")
|
|
28
|
-
val = val.replace("è", "e")
|
|
29
|
-
val = val.replace("ê", "e")
|
|
30
|
-
val = val.replace("î", "i")
|
|
31
|
-
val = val.replace("ï", "i")
|
|
32
|
-
val = val.replace("ô", "o")
|
|
33
|
-
val = val.replace("ö", "o")
|
|
34
|
-
val = val.replace("î", "i")
|
|
35
|
-
val = val.replace("û", "u")
|
|
36
|
-
val = val.replace("ù", "u")
|
|
37
|
-
val = val.replace("ü", "u")
|
|
38
|
-
val = val.strip()
|
|
39
|
-
return val
|
csv_detective/s3_utils.py
DELETED
|
@@ -1,48 +0,0 @@
|
|
|
1
|
-
import boto3
|
|
2
|
-
import logging
|
|
3
|
-
|
|
4
|
-
from botocore.client import Config
|
|
5
|
-
from botocore.exceptions import ClientError
|
|
6
|
-
|
|
7
|
-
|
|
8
|
-
def get_minio_url(netloc: str, bucket: str, key: str) -> str:
|
|
9
|
-
"""Returns location of given resource in minio once it is saved"""
|
|
10
|
-
return netloc + "/" + bucket + "/" + key
|
|
11
|
-
|
|
12
|
-
|
|
13
|
-
def get_s3_client(url: str, minio_user: str, minio_pwd: str) -> boto3.client:
|
|
14
|
-
return boto3.client(
|
|
15
|
-
"s3",
|
|
16
|
-
endpoint_url=url,
|
|
17
|
-
aws_access_key_id=minio_user,
|
|
18
|
-
aws_secret_access_key=minio_pwd,
|
|
19
|
-
config=Config(signature_version="s3v4"),
|
|
20
|
-
)
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
def download_from_minio(
|
|
24
|
-
netloc: str, bucket: str, key: str, filepath: str, minio_user: str, minio_pwd: str
|
|
25
|
-
) -> None:
|
|
26
|
-
logging.info("Downloading from minio")
|
|
27
|
-
s3 = get_s3_client(netloc, minio_user, minio_pwd)
|
|
28
|
-
try:
|
|
29
|
-
s3.download_file(bucket, key, filepath)
|
|
30
|
-
logging.info(
|
|
31
|
-
f"Resource downloaded from minio at {get_minio_url(netloc, bucket, key)}"
|
|
32
|
-
)
|
|
33
|
-
except ClientError as e:
|
|
34
|
-
logging.error(e)
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
def upload_to_minio(
|
|
38
|
-
netloc: str, bucket: str, key: str, filepath: str, minio_user: str, minio_pwd: str
|
|
39
|
-
) -> None:
|
|
40
|
-
logging.info("Saving to minio")
|
|
41
|
-
s3 = get_s3_client(netloc, minio_user, minio_pwd)
|
|
42
|
-
try:
|
|
43
|
-
s3.upload_file(filepath, bucket, key)
|
|
44
|
-
logging.info(
|
|
45
|
-
f"Resource saved into minio at {get_minio_url(netloc, bucket, key)}"
|
|
46
|
-
)
|
|
47
|
-
except ClientError as e:
|
|
48
|
-
logging.error(e)
|
|
@@ -1,118 +0,0 @@
|
|
|
1
|
-
# Changelog
|
|
2
|
-
|
|
3
|
-
## 0.6.7 (2024-01-15)
|
|
4
|
-
|
|
5
|
-
- Add logs for columns that would take too much time within a specific test
|
|
6
|
-
- Refactor some tests to improve performances and make detection more accurate
|
|
7
|
-
- Try alternative ways to clean text
|
|
8
|
-
|
|
9
|
-
## 0.6.6 (2023-11-24)
|
|
10
|
-
|
|
11
|
-
- Change setup.py to better convey dependencies
|
|
12
|
-
|
|
13
|
-
## 0.6.5 (2023-11-17)
|
|
14
|
-
|
|
15
|
-
- Change encoding detection for faust-cchardet (forked from cchardet) [#66](https://github.com/etalab/csv-detective/pull/66)
|
|
16
|
-
|
|
17
|
-
## 0.6.4 (2023-10-18)
|
|
18
|
-
|
|
19
|
-
- Better handling of ints and floats (now not accepting blanks and "+" in string) [#62](https://github.com/etalab/csv-detective/pull/62)
|
|
20
|
-
|
|
21
|
-
## 0.6.3 (2023-03-23)
|
|
22
|
-
|
|
23
|
-
- Faster routine [#59](https://github.com/etalab/csv-detective/pull/59)
|
|
24
|
-
|
|
25
|
-
## 0.6.2 (2023-02-10)
|
|
26
|
-
|
|
27
|
-
- Catch OverflowError for latitude and longitude checks [#58](https://github.com/etalab/csv-detective/pull/58)
|
|
28
|
-
|
|
29
|
-
## 0.6.0 (2023-02-10)
|
|
30
|
-
|
|
31
|
-
- Add CI and upgrade dependencies [#49](https://github.com/etalab/csv-detective/pull/49)
|
|
32
|
-
- Shuffle data before analysis [#56](https://github.com/etalab/csv-detective/pull/56)
|
|
33
|
-
- Better discrimination between `code_departement` and `code_region` [#56](https://github.com/etalab/csv-detective/pull/56)
|
|
34
|
-
- Add schema in output analysis [#57](https://github.com/etalab/csv-detective/pull/57)
|
|
35
|
-
|
|
36
|
-
## 0.4.7 [#51](https://github.com/etalab/csv-detective/pull/51)
|
|
37
|
-
|
|
38
|
-
- Allow possibility to analyze entire file instead of a limited number of rows [#48](https://github.com/etalab/csv-detective/pull/48)
|
|
39
|
-
- Better boolean detection [#42](https://github.com/etalab/csv-detective/issues/42)
|
|
40
|
-
- Differentiate python types and format for `date` and `datetime` [#43](https://github.com/etalab/csv-detective/issues/43)
|
|
41
|
-
- Better `code_departement` and `code_commune_insee` detection [#44](https://github.com/etalab/csv-detective/issues/44)
|
|
42
|
-
- Fix header line (`header_row_idx`) detection [#44](https://github.com/etalab/csv-detective/issues/44)
|
|
43
|
-
- Allow possibility to get profile of csv [#46](https://github.com/etalab/csv-detective/issues/46)
|
|
44
|
-
|
|
45
|
-
## 0.4.6 [#39](https://github.com/etalab/csv-detective/pull/39)
|
|
46
|
-
|
|
47
|
-
- Fix tests
|
|
48
|
-
- Prioritise lat / lon FR detection over more generic lat / lon.
|
|
49
|
-
- To reduce false positives, prevent detection of the following if label detection is missing: `['code_departement', 'code_commune_insee', 'code_postal', 'latitude_wgs', 'longitude_wgs', 'latitude_wgs_fr_metropole', 'longitude_wgs_fr_metropole', 'latitude_l93', 'longitude_l93']`
|
|
50
|
-
- Lower threshold of label detection so that if one relevant is detected in the label, it boosts the detection score.
|
|
51
|
-
- Add ISO country alpha-3 and numeric detection
|
|
52
|
-
- include camel case parsing in _process_text function
|
|
53
|
-
- Support optional brackets in latlon format
|
|
54
|
-
|
|
55
|
-
## 0.4.5 [#29](https://github.com/etalab/csv-detective/pull/29)
|
|
56
|
-
|
|
57
|
-
- Use `netloc` instead of `url` in location dict
|
|
58
|
-
|
|
59
|
-
## 0.4.4 [#24] (https://github.com/etalab/csv-detective/pull/28)
|
|
60
|
-
|
|
61
|
-
- Prevent crash on empty CSVs
|
|
62
|
-
- Add optional arguments encoding and sep to routine and routine_minio functions
|
|
63
|
-
- Field detection improvements (code_csp_insee and datetime RFC 822)
|
|
64
|
-
- Schema generation improvements with examples
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
## 0.4.3 [#24] (https://github.com/etalab/csv-detective/pull/24)
|
|
68
|
-
|
|
69
|
-
- Add uuid and MongoID detection
|
|
70
|
-
- Add new function dedicated to interaction with minio data
|
|
71
|
-
- Add table schema automatic generation (only on minio data)
|
|
72
|
-
- Modification of calculated score (consider label detection as a boost for score)
|
|
73
|
-
|
|
74
|
-
## 0.4.2 [#22] (https://github.com/etalab/csv-detective/pull/22)
|
|
75
|
-
|
|
76
|
-
Add type detection by header name
|
|
77
|
-
|
|
78
|
-
## 0.4.1 [#19] (https://github.com/etalab/csv-detective/pull/19)
|
|
79
|
-
|
|
80
|
-
Fix bug
|
|
81
|
-
* num_rows was causing problem when it was fix to other value than default - Fixed
|
|
82
|
-
|
|
83
|
-
## 0.4.0 [#18] (https://github.com/etalab/csv_detective/pull/18)
|
|
84
|
-
|
|
85
|
-
Add detailed output possibility
|
|
86
|
-
|
|
87
|
-
Details :
|
|
88
|
-
* two modes now for output report : "LIMITED" and "ALL"
|
|
89
|
-
* "ALL" option give user information on found proportion for each column types and each columns
|
|
90
|
-
|
|
91
|
-
## 0.3.0 [#15] (https://github.com/etalab/csv_detective/pull/15)
|
|
92
|
-
|
|
93
|
-
Fix bugs
|
|
94
|
-
|
|
95
|
-
Details :
|
|
96
|
-
* Facilitate ML Integration
|
|
97
|
-
* Add column types detection
|
|
98
|
-
* Fix documentation
|
|
99
|
-
|
|
100
|
-
## 0.2.1 - [#2](https://github.com/etalab/csv_detective/pull/2)
|
|
101
|
-
|
|
102
|
-
Add continuous integration
|
|
103
|
-
|
|
104
|
-
Details :
|
|
105
|
-
* Add configuration for CircleCI
|
|
106
|
-
* Add `CONTRIBUTING.md`
|
|
107
|
-
* Push automatically new versions to PyPI
|
|
108
|
-
* Use semantic versioning
|
|
109
|
-
|
|
110
|
-
## 0.2 - [#1](https://github.com/etalab/csv_detective/pull/1)
|
|
111
|
-
|
|
112
|
-
Port from python2 to python3
|
|
113
|
-
|
|
114
|
-
Details :
|
|
115
|
-
* Add license AGPLv3
|
|
116
|
-
* Update requirements
|
|
117
|
-
|
|
118
|
-
## 0.1
|