csv-detective 0.6.7__py3-none-any.whl → 0.9.3.dev2438__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (228) hide show
  1. csv_detective/__init__.py +7 -1
  2. csv_detective/cli.py +33 -21
  3. csv_detective/{detect_fields/FR → detection}/__init__.py +0 -0
  4. csv_detective/detection/columns.py +89 -0
  5. csv_detective/detection/encoding.py +29 -0
  6. csv_detective/detection/engine.py +46 -0
  7. csv_detective/detection/formats.py +156 -0
  8. csv_detective/detection/headers.py +28 -0
  9. csv_detective/detection/rows.py +18 -0
  10. csv_detective/detection/separator.py +44 -0
  11. csv_detective/detection/variables.py +97 -0
  12. csv_detective/explore_csv.py +151 -377
  13. csv_detective/format.py +67 -0
  14. csv_detective/formats/__init__.py +9 -0
  15. csv_detective/formats/adresse.py +116 -0
  16. csv_detective/formats/binary.py +26 -0
  17. csv_detective/formats/booleen.py +35 -0
  18. csv_detective/formats/code_commune_insee.py +26 -0
  19. csv_detective/formats/code_csp_insee.py +36 -0
  20. csv_detective/formats/code_departement.py +29 -0
  21. csv_detective/formats/code_fantoir.py +21 -0
  22. csv_detective/formats/code_import.py +17 -0
  23. csv_detective/formats/code_postal.py +25 -0
  24. csv_detective/formats/code_region.py +22 -0
  25. csv_detective/formats/code_rna.py +29 -0
  26. csv_detective/formats/code_waldec.py +17 -0
  27. csv_detective/formats/commune.py +27 -0
  28. csv_detective/formats/csp_insee.py +31 -0
  29. csv_detective/{detect_fields/FR/other/insee_ape700 → formats/data}/insee_ape700.txt +0 -0
  30. csv_detective/formats/date.py +99 -0
  31. csv_detective/formats/date_fr.py +22 -0
  32. csv_detective/formats/datetime_aware.py +45 -0
  33. csv_detective/formats/datetime_naive.py +48 -0
  34. csv_detective/formats/datetime_rfc822.py +24 -0
  35. csv_detective/formats/departement.py +37 -0
  36. csv_detective/formats/email.py +28 -0
  37. csv_detective/formats/float.py +29 -0
  38. csv_detective/formats/geojson.py +36 -0
  39. csv_detective/formats/insee_ape700.py +31 -0
  40. csv_detective/formats/insee_canton.py +28 -0
  41. csv_detective/formats/int.py +23 -0
  42. csv_detective/formats/iso_country_code_alpha2.py +30 -0
  43. csv_detective/formats/iso_country_code_alpha3.py +30 -0
  44. csv_detective/formats/iso_country_code_numeric.py +31 -0
  45. csv_detective/formats/jour_de_la_semaine.py +41 -0
  46. csv_detective/formats/json.py +20 -0
  47. csv_detective/formats/latitude_l93.py +48 -0
  48. csv_detective/formats/latitude_wgs.py +42 -0
  49. csv_detective/formats/latitude_wgs_fr_metropole.py +42 -0
  50. csv_detective/formats/latlon_wgs.py +53 -0
  51. csv_detective/formats/longitude_l93.py +39 -0
  52. csv_detective/formats/longitude_wgs.py +32 -0
  53. csv_detective/formats/longitude_wgs_fr_metropole.py +32 -0
  54. csv_detective/formats/lonlat_wgs.py +36 -0
  55. csv_detective/formats/mois_de_lannee.py +48 -0
  56. csv_detective/formats/money.py +18 -0
  57. csv_detective/formats/mongo_object_id.py +14 -0
  58. csv_detective/formats/pays.py +35 -0
  59. csv_detective/formats/percent.py +16 -0
  60. csv_detective/formats/region.py +70 -0
  61. csv_detective/formats/sexe.py +17 -0
  62. csv_detective/formats/siren.py +37 -0
  63. csv_detective/{detect_fields/FR/other/siret/__init__.py → formats/siret.py} +47 -29
  64. csv_detective/formats/tel_fr.py +36 -0
  65. csv_detective/formats/uai.py +36 -0
  66. csv_detective/formats/url.py +46 -0
  67. csv_detective/formats/username.py +14 -0
  68. csv_detective/formats/uuid.py +16 -0
  69. csv_detective/formats/year.py +28 -0
  70. csv_detective/output/__init__.py +65 -0
  71. csv_detective/output/dataframe.py +96 -0
  72. csv_detective/output/example.py +250 -0
  73. csv_detective/output/profile.py +119 -0
  74. csv_detective/{schema_generation.py → output/schema.py} +268 -343
  75. csv_detective/output/utils.py +74 -0
  76. csv_detective/{detect_fields/FR/geo → parsing}/__init__.py +0 -0
  77. csv_detective/parsing/columns.py +235 -0
  78. csv_detective/parsing/compression.py +11 -0
  79. csv_detective/parsing/csv.py +56 -0
  80. csv_detective/parsing/excel.py +167 -0
  81. csv_detective/parsing/load.py +111 -0
  82. csv_detective/parsing/text.py +56 -0
  83. csv_detective/utils.py +23 -196
  84. csv_detective/validate.py +138 -0
  85. csv_detective-0.9.3.dev2438.dist-info/METADATA +267 -0
  86. csv_detective-0.9.3.dev2438.dist-info/RECORD +92 -0
  87. csv_detective-0.9.3.dev2438.dist-info/WHEEL +4 -0
  88. {csv_detective-0.6.7.dist-info → csv_detective-0.9.3.dev2438.dist-info}/entry_points.txt +1 -0
  89. csv_detective/all_packages.txt +0 -104
  90. csv_detective/detect_fields/FR/geo/adresse/__init__.py +0 -100
  91. csv_detective/detect_fields/FR/geo/code_commune_insee/__init__.py +0 -24
  92. csv_detective/detect_fields/FR/geo/code_commune_insee/code_commune_insee.txt +0 -37600
  93. csv_detective/detect_fields/FR/geo/code_departement/__init__.py +0 -11
  94. csv_detective/detect_fields/FR/geo/code_fantoir/__init__.py +0 -15
  95. csv_detective/detect_fields/FR/geo/code_fantoir/code_fantoir.txt +0 -26122
  96. csv_detective/detect_fields/FR/geo/code_postal/__init__.py +0 -19
  97. csv_detective/detect_fields/FR/geo/code_postal/code_postal.txt +0 -36822
  98. csv_detective/detect_fields/FR/geo/code_region/__init__.py +0 -27
  99. csv_detective/detect_fields/FR/geo/commune/__init__.py +0 -21
  100. csv_detective/detect_fields/FR/geo/commune/commune.txt +0 -36745
  101. csv_detective/detect_fields/FR/geo/departement/__init__.py +0 -19
  102. csv_detective/detect_fields/FR/geo/departement/departement.txt +0 -101
  103. csv_detective/detect_fields/FR/geo/insee_canton/__init__.py +0 -20
  104. csv_detective/detect_fields/FR/geo/insee_canton/canton2017.txt +0 -2055
  105. csv_detective/detect_fields/FR/geo/insee_canton/cantons.txt +0 -2055
  106. csv_detective/detect_fields/FR/geo/latitude_l93/__init__.py +0 -13
  107. csv_detective/detect_fields/FR/geo/latitude_wgs_fr_metropole/__init__.py +0 -13
  108. csv_detective/detect_fields/FR/geo/longitude_l93/__init__.py +0 -13
  109. csv_detective/detect_fields/FR/geo/longitude_wgs_fr_metropole/__init__.py +0 -13
  110. csv_detective/detect_fields/FR/geo/pays/__init__.py +0 -17
  111. csv_detective/detect_fields/FR/geo/pays/pays.txt +0 -248
  112. csv_detective/detect_fields/FR/geo/region/__init__.py +0 -16
  113. csv_detective/detect_fields/FR/geo/region/region.txt +0 -44
  114. csv_detective/detect_fields/FR/other/__init__.py +0 -0
  115. csv_detective/detect_fields/FR/other/code_csp_insee/__init__.py +0 -26
  116. csv_detective/detect_fields/FR/other/code_csp_insee/code_csp_insee.txt +0 -498
  117. csv_detective/detect_fields/FR/other/code_rna/__init__.py +0 -8
  118. csv_detective/detect_fields/FR/other/code_waldec/__init__.py +0 -12
  119. csv_detective/detect_fields/FR/other/csp_insee/__init__.py +0 -16
  120. csv_detective/detect_fields/FR/other/date_fr/__init__.py +0 -12
  121. csv_detective/detect_fields/FR/other/insee_ape700/__init__.py +0 -16
  122. csv_detective/detect_fields/FR/other/sexe/__init__.py +0 -9
  123. csv_detective/detect_fields/FR/other/siren/__init__.py +0 -18
  124. csv_detective/detect_fields/FR/other/tel_fr/__init__.py +0 -15
  125. csv_detective/detect_fields/FR/other/uai/__init__.py +0 -15
  126. csv_detective/detect_fields/FR/temp/__init__.py +0 -0
  127. csv_detective/detect_fields/FR/temp/jour_de_la_semaine/__init__.py +0 -23
  128. csv_detective/detect_fields/FR/temp/mois_de_annee/__init__.py +0 -37
  129. csv_detective/detect_fields/__init__.py +0 -57
  130. csv_detective/detect_fields/geo/__init__.py +0 -0
  131. csv_detective/detect_fields/geo/iso_country_code_alpha2/__init__.py +0 -15
  132. csv_detective/detect_fields/geo/iso_country_code_alpha3/__init__.py +0 -14
  133. csv_detective/detect_fields/geo/iso_country_code_numeric/__init__.py +0 -15
  134. csv_detective/detect_fields/geo/json_geojson/__init__.py +0 -22
  135. csv_detective/detect_fields/geo/latitude_wgs/__init__.py +0 -13
  136. csv_detective/detect_fields/geo/latlon_wgs/__init__.py +0 -15
  137. csv_detective/detect_fields/geo/longitude_wgs/__init__.py +0 -13
  138. csv_detective/detect_fields/other/__init__.py +0 -0
  139. csv_detective/detect_fields/other/booleen/__init__.py +0 -21
  140. csv_detective/detect_fields/other/email/__init__.py +0 -8
  141. csv_detective/detect_fields/other/float/__init__.py +0 -17
  142. csv_detective/detect_fields/other/int/__init__.py +0 -12
  143. csv_detective/detect_fields/other/json/__init__.py +0 -24
  144. csv_detective/detect_fields/other/mongo_object_id/__init__.py +0 -8
  145. csv_detective/detect_fields/other/twitter/__init__.py +0 -8
  146. csv_detective/detect_fields/other/url/__init__.py +0 -11
  147. csv_detective/detect_fields/other/uuid/__init__.py +0 -11
  148. csv_detective/detect_fields/temp/__init__.py +0 -0
  149. csv_detective/detect_fields/temp/date/__init__.py +0 -62
  150. csv_detective/detect_fields/temp/datetime_iso/__init__.py +0 -18
  151. csv_detective/detect_fields/temp/datetime_rfc822/__init__.py +0 -21
  152. csv_detective/detect_fields/temp/year/__init__.py +0 -10
  153. csv_detective/detect_labels/FR/__init__.py +0 -0
  154. csv_detective/detect_labels/FR/geo/__init__.py +0 -0
  155. csv_detective/detect_labels/FR/geo/adresse/__init__.py +0 -40
  156. csv_detective/detect_labels/FR/geo/code_commune_insee/__init__.py +0 -42
  157. csv_detective/detect_labels/FR/geo/code_departement/__init__.py +0 -33
  158. csv_detective/detect_labels/FR/geo/code_fantoir/__init__.py +0 -33
  159. csv_detective/detect_labels/FR/geo/code_postal/__init__.py +0 -41
  160. csv_detective/detect_labels/FR/geo/code_region/__init__.py +0 -33
  161. csv_detective/detect_labels/FR/geo/commune/__init__.py +0 -33
  162. csv_detective/detect_labels/FR/geo/departement/__init__.py +0 -47
  163. csv_detective/detect_labels/FR/geo/insee_canton/__init__.py +0 -33
  164. csv_detective/detect_labels/FR/geo/latitude_l93/__init__.py +0 -54
  165. csv_detective/detect_labels/FR/geo/latitude_wgs_fr_metropole/__init__.py +0 -55
  166. csv_detective/detect_labels/FR/geo/longitude_l93/__init__.py +0 -44
  167. csv_detective/detect_labels/FR/geo/longitude_wgs_fr_metropole/__init__.py +0 -45
  168. csv_detective/detect_labels/FR/geo/pays/__init__.py +0 -45
  169. csv_detective/detect_labels/FR/geo/region/__init__.py +0 -45
  170. csv_detective/detect_labels/FR/other/__init__.py +0 -0
  171. csv_detective/detect_labels/FR/other/code_csp_insee/__init__.py +0 -33
  172. csv_detective/detect_labels/FR/other/code_rna/__init__.py +0 -38
  173. csv_detective/detect_labels/FR/other/code_waldec/__init__.py +0 -33
  174. csv_detective/detect_labels/FR/other/csp_insee/__init__.py +0 -37
  175. csv_detective/detect_labels/FR/other/date_fr/__init__.py +0 -33
  176. csv_detective/detect_labels/FR/other/insee_ape700/__init__.py +0 -40
  177. csv_detective/detect_labels/FR/other/sexe/__init__.py +0 -33
  178. csv_detective/detect_labels/FR/other/siren/__init__.py +0 -41
  179. csv_detective/detect_labels/FR/other/siret/__init__.py +0 -40
  180. csv_detective/detect_labels/FR/other/tel_fr/__init__.py +0 -45
  181. csv_detective/detect_labels/FR/other/uai/__init__.py +0 -50
  182. csv_detective/detect_labels/FR/temp/__init__.py +0 -0
  183. csv_detective/detect_labels/FR/temp/jour_de_la_semaine/__init__.py +0 -41
  184. csv_detective/detect_labels/FR/temp/mois_de_annee/__init__.py +0 -33
  185. csv_detective/detect_labels/__init__.py +0 -43
  186. csv_detective/detect_labels/geo/__init__.py +0 -0
  187. csv_detective/detect_labels/geo/iso_country_code_alpha2/__init__.py +0 -41
  188. csv_detective/detect_labels/geo/iso_country_code_alpha3/__init__.py +0 -41
  189. csv_detective/detect_labels/geo/iso_country_code_numeric/__init__.py +0 -41
  190. csv_detective/detect_labels/geo/json_geojson/__init__.py +0 -42
  191. csv_detective/detect_labels/geo/latitude_wgs/__init__.py +0 -55
  192. csv_detective/detect_labels/geo/latlon_wgs/__init__.py +0 -67
  193. csv_detective/detect_labels/geo/longitude_wgs/__init__.py +0 -45
  194. csv_detective/detect_labels/other/__init__.py +0 -0
  195. csv_detective/detect_labels/other/booleen/__init__.py +0 -34
  196. csv_detective/detect_labels/other/email/__init__.py +0 -45
  197. csv_detective/detect_labels/other/float/__init__.py +0 -33
  198. csv_detective/detect_labels/other/int/__init__.py +0 -33
  199. csv_detective/detect_labels/other/money/__init__.py +0 -11
  200. csv_detective/detect_labels/other/money/check_col_name.py +0 -8
  201. csv_detective/detect_labels/other/mongo_object_id/__init__.py +0 -33
  202. csv_detective/detect_labels/other/twitter/__init__.py +0 -33
  203. csv_detective/detect_labels/other/url/__init__.py +0 -48
  204. csv_detective/detect_labels/other/uuid/__init__.py +0 -33
  205. csv_detective/detect_labels/temp/__init__.py +0 -0
  206. csv_detective/detect_labels/temp/date/__init__.py +0 -51
  207. csv_detective/detect_labels/temp/datetime_iso/__init__.py +0 -45
  208. csv_detective/detect_labels/temp/datetime_rfc822/__init__.py +0 -44
  209. csv_detective/detect_labels/temp/year/__init__.py +0 -44
  210. csv_detective/detection.py +0 -361
  211. csv_detective/process_text.py +0 -39
  212. csv_detective/s3_utils.py +0 -48
  213. csv_detective-0.6.7.data/data/share/csv_detective/CHANGELOG.md +0 -118
  214. csv_detective-0.6.7.data/data/share/csv_detective/LICENSE.AGPL.txt +0 -661
  215. csv_detective-0.6.7.data/data/share/csv_detective/README.md +0 -247
  216. csv_detective-0.6.7.dist-info/LICENSE.AGPL.txt +0 -661
  217. csv_detective-0.6.7.dist-info/METADATA +0 -23
  218. csv_detective-0.6.7.dist-info/RECORD +0 -150
  219. csv_detective-0.6.7.dist-info/WHEEL +0 -5
  220. csv_detective-0.6.7.dist-info/top_level.txt +0 -2
  221. tests/__init__.py +0 -0
  222. tests/test_fields.py +0 -360
  223. tests/test_file.py +0 -116
  224. tests/test_labels.py +0 -7
  225. /csv_detective/{detect_fields/FR/other/csp_insee → formats/data}/csp_insee.txt +0 -0
  226. /csv_detective/{detect_fields/geo/iso_country_code_alpha2 → formats/data}/iso_country_code_alpha2.txt +0 -0
  227. /csv_detective/{detect_fields/geo/iso_country_code_alpha3 → formats/data}/iso_country_code_alpha3.txt +0 -0
  228. /csv_detective/{detect_fields/geo/iso_country_code_numeric → formats/data}/iso_country_code_numeric.txt +0 -0
@@ -1,361 +0,0 @@
1
- import pandas as pd
2
- import numpy as np
3
- from cchardet import detect
4
- from ast import literal_eval
5
- import logging
6
- from time import time
7
- from csv_detective.utils import display_logs_depending_process_time
8
- from csv_detective.detect_fields.other.float import float_casting
9
-
10
- logging.basicConfig(level=logging.INFO)
11
-
12
- def detect_continuous_variable(table, continuous_th=0.9, verbose: bool = False):
13
- """
14
- Detects whether a column contains continuous variables. We consider a continuous column
15
- one that contains
16
- a considerable amount of float values.
17
- We removed the integers as we then end up with postal codes, insee codes, and all sort
18
- of codes and types.
19
- This is not optimal but it will do for now.
20
- :param table:
21
- :return:
22
- """
23
-
24
- def check_threshold(serie, continuous_th):
25
- count = serie.value_counts().to_dict()
26
- total_nb = len(serie)
27
- if float in count:
28
- nb_floats = count[float]
29
- else:
30
- return False
31
- if nb_floats / total_nb >= continuous_th:
32
- return True
33
- else:
34
- return False
35
-
36
- def parses_to_integer(value):
37
- try:
38
- value = value.replace(",", ".")
39
- value = literal_eval(value)
40
- return type(value)
41
- # flake8: noqa
42
- except:
43
- return False
44
-
45
- if verbose:
46
- start = time()
47
- logging.info("Detecting continuous columns")
48
- res = table.apply(
49
- lambda serie: check_threshold(serie.apply(parses_to_integer), continuous_th)
50
- )
51
- if verbose:
52
- display_logs_depending_process_time(
53
- f"Detected {sum(res)} continuous columns in {round(time() - start, 3)}s",
54
- time() - start
55
- )
56
- return res.index[res]
57
-
58
-
59
- def detetect_categorical_variable(
60
- table, threshold_pct_categorical=0.05, max_number_categorical_values=25, verbose: bool = False
61
- ):
62
- """
63
- Heuristically detects whether a table (df) contains categorical values according to
64
- the number of unique values contained.
65
- As the idea of detecting categorical values is to then try to learn models to predict
66
- them, we limit categorical values to at most 25 different modes. Postal code, insee code,
67
- code region and so on, may be thus not
68
- considered categorical values.
69
- :param table:
70
- :param threshold_pct_categorical:
71
- :param max_number_categorical_values:
72
- :return:
73
- """
74
-
75
- def abs_number_different_values(column_values):
76
- return column_values.nunique()
77
-
78
- def rel_number_different_values(column_values):
79
- return column_values.nunique() / len(column_values)
80
-
81
- def detect_categorical(column_values):
82
- abs_unique_values = abs_number_different_values(column_values)
83
- rel_unique_values = rel_number_different_values(column_values)
84
- if abs_unique_values < max_number_categorical_values:
85
- if rel_unique_values < threshold_pct_categorical:
86
- return True
87
- return False
88
-
89
- if verbose:
90
- start = time()
91
- logging.info("Detecting categorical columns")
92
- res = table.apply(lambda serie: detect_categorical(serie))
93
- if verbose:
94
- display_logs_depending_process_time(
95
- f"Detected {sum(res)} categorical columns out of {len(table.columns)} in {round(time() - start, 3)}s",
96
- time() - start
97
- )
98
- return res.index[res], res
99
-
100
-
101
- def detect_separator(file, verbose: bool = False):
102
- """Detects csv separator"""
103
- # TODO: add a robust detection:
104
- # si on a un point virgule comme texte et \t comme séparateur, on renvoit
105
- # pour l'instant un point virgule
106
- if verbose:
107
- start = time()
108
- logging.info("Detecting separator")
109
- file.seek(0)
110
- header = file.readline()
111
- possible_separators = [";", ",", "|", "\t"]
112
- sep_count = dict()
113
- for sep in possible_separators:
114
- sep_count[sep] = header.count(sep)
115
- sep = max(sep_count, key=sep_count.get)
116
- if verbose:
117
- display_logs_depending_process_time(
118
- f'Detected separator: "{sep}" in {round(time() - start, 3)}s',
119
- time() - start
120
- )
121
- return sep
122
-
123
-
124
- def detect_encoding(the_file, verbose: bool = False):
125
- """
126
- Detects file encoding using faust-cchardet (forked from the original cchardet)
127
- """
128
- if verbose:
129
- start = time()
130
- logging.info("Detecting encoding")
131
- encoding_dict = detect(the_file.read())
132
- if verbose:
133
- message = f'Detected encoding: "{encoding_dict["encoding"]}"'
134
- message += f' in {round(time() - start, 3)}s (confidence: {round(encoding_dict["confidence"]*100)}%)'
135
- display_logs_depending_process_time(
136
- message,
137
- time() - start
138
- )
139
- return encoding_dict['encoding']
140
-
141
-
142
- def parse_table(the_file, encoding, sep, num_rows, skiprows, random_state=42, verbose : bool = False):
143
- # Takes care of some problems
144
- if verbose:
145
- start = time()
146
- logging.info("Parsing table")
147
- table = None
148
-
149
- if not isinstance(the_file, str):
150
- the_file.seek(0)
151
-
152
- total_lines = None
153
- for encoding in [encoding, "ISO-8859-1", "utf-8"]:
154
- # TODO : modification systematique
155
- if encoding is None:
156
- continue
157
-
158
- if "ISO-8859" in encoding:
159
- encoding = "ISO-8859-1"
160
- try:
161
- table = pd.read_csv(
162
- the_file, sep=sep, dtype="unicode", encoding=encoding, skiprows=skiprows
163
- )
164
- total_lines = len(table)
165
- nb_duplicates = len(table.loc[table.duplicated()])
166
- if num_rows > 0:
167
- num_rows = min(num_rows - 1, total_lines)
168
- table = table.sample(num_rows, random_state=random_state)
169
- # else : table is unchanged
170
- break
171
- except TypeError:
172
- print("Trying encoding : {encoding}".format(encoding=encoding))
173
-
174
- if table is None:
175
- logging.error(" >> encoding not found")
176
- return table, "NA", "NA"
177
- if verbose:
178
- display_logs_depending_process_time(
179
- f'Table parsed successfully in {round(time() - start, 3)}s',
180
- time() - start
181
- )
182
- return table, total_lines, nb_duplicates
183
-
184
-
185
- def create_profile(table, dict_cols_fields, sep, encoding, num_rows, skiprows, verbose: bool = False):
186
- if verbose:
187
- start = time()
188
- logging.info("Creating profile")
189
- map_python_types = {
190
- "string": str,
191
- "int": float,
192
- "float": float,
193
- }
194
-
195
- if num_rows > 0:
196
- raise Exception("To create profiles num_rows has to be set to -1")
197
- else:
198
- safe_table = table.copy()
199
- dtypes = {
200
- k: map_python_types.get(v["python_type"], str)
201
- for k, v in dict_cols_fields.items()
202
- }
203
- for c in safe_table.columns:
204
- if dtypes[c] == float:
205
- safe_table[c] = safe_table[c].apply(
206
- lambda s: float_casting(s) if isinstance(s, str) else s
207
- )
208
- profile = {}
209
- for c in safe_table.columns:
210
- profile[c] = {}
211
- if map_python_types.get(dict_cols_fields[c]["python_type"], str) in [
212
- float,
213
- int,
214
- ]:
215
- profile[c].update(
216
- min=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
217
- safe_table[c].min()
218
- ),
219
- max=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
220
- safe_table[c].max()
221
- ),
222
- mean=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
223
- safe_table[c].mean()
224
- ),
225
- std=map_python_types.get(dict_cols_fields[c]["python_type"], str)(
226
- safe_table[c].std()
227
- ),
228
- )
229
- tops_bruts = safe_table[safe_table[c].notna()][c] \
230
- .value_counts(dropna=True) \
231
- .reset_index() \
232
- .iloc[:10] \
233
- .to_dict(orient="records")
234
- tops = []
235
- for tb in tops_bruts:
236
- top = {}
237
- top["count"] = tb[c]
238
- top["value"] = tb["index"]
239
- tops.append(top)
240
- profile[c].update(
241
- tops=tops,
242
- nb_distinct=safe_table[c].nunique(),
243
- nb_missing_values=len(safe_table[c].loc[safe_table[c].isna()]),
244
- )
245
- if verbose:
246
- display_logs_depending_process_time(
247
- f"Created profile in {round(time() - start, 3)}s",
248
- time() - start
249
- )
250
- return profile
251
-
252
-
253
- def detect_extra_columns(file, sep):
254
- """regarde s'il y a des colonnes en trop
255
- Attention, file ne doit pas avoir de ligne vide"""
256
- file.seek(0)
257
- retour = False
258
- nb_useless_col = 99999
259
-
260
- for i in range(10):
261
- line = file.readline()
262
- # regarde si on a un retour
263
- if retour:
264
- assert line[-1] == "\n"
265
- if line[-1] == "\n":
266
- retour = True
267
-
268
- # regarde le nombre de derniere colonne inutile
269
- deb = 0 + retour
270
- line = line[::-1][deb:]
271
- k = 0
272
- for sign in line:
273
- if sign != sep:
274
- break
275
- k += 1
276
- if k == 0:
277
- return 0, retour
278
- nb_useless_col = min(k, nb_useless_col)
279
- return nb_useless_col, retour
280
-
281
-
282
- def detect_headers(file, sep, verbose: bool = False):
283
- """Tests 10 first rows for possible header (header not in 1st line)"""
284
- if verbose:
285
- start = time()
286
- logging.info("Detecting headers")
287
- file.seek(0)
288
- for i in range(10):
289
- header = file.readline()
290
- position = file.tell()
291
- chaine = [c for c in header.replace("\n", "").split(sep) if c]
292
- if chaine[-1] not in ["", "\n"] and all(
293
- [mot not in ["", "\n"] for mot in chaine[1:-1]]
294
- ):
295
- next_row = file.readline()
296
- file.seek(position)
297
- if header != next_row:
298
- if verbose:
299
- display_logs_depending_process_time(
300
- f'Detected headers in {round(time() - start, 3)}s',
301
- time() - start
302
- )
303
- return i, chaine
304
- if verbose:
305
- logging.info(f'No header detected')
306
- return 0, None
307
-
308
-
309
- def detect_heading_columns(file, sep, verbose : bool = False):
310
- """Tests first 10 lines to see if there are empty heading columns"""
311
- if verbose:
312
- start = time()
313
- logging.info("Detecting heading columns")
314
- file.seek(0)
315
- return_int = float("Inf")
316
- for i in range(10):
317
- line = file.readline()
318
- return_int = min(return_int, len(line) - len(line.strip(sep)))
319
- if return_int == 0:
320
- if verbose:
321
- display_logs_depending_process_time(
322
- f'No heading column detected in {round(time() - start, 3)}s',
323
- time() - start
324
- )
325
- return 0
326
- if verbose:
327
- display_logs_depending_process_time(
328
- f'{return_int} heading columns detected in {round(time() - start, 3)}s',
329
- time() - start
330
- )
331
- return return_int
332
-
333
-
334
- def detect_trailing_columns(file, sep, heading_columns, verbose : bool = False):
335
- """Tests first 10 lines to see if there are empty trailing columns"""
336
- if verbose:
337
- start = time()
338
- logging.info("Detecting trailing columns")
339
- file.seek(0)
340
- return_int = float("Inf")
341
- for i in range(10):
342
- line = file.readline()
343
- return_int = min(
344
- return_int,
345
- len(line.replace("\n", ""))
346
- - len(line.replace("\n", "").strip(sep))
347
- - heading_columns,
348
- )
349
- if return_int == 0:
350
- if verbose:
351
- display_logs_depending_process_time(
352
- f'No trailing column detected in {round(time() - start, 3)}s',
353
- time() - start
354
- )
355
- return 0
356
- if verbose:
357
- display_logs_depending_process_time(
358
- f'{return_int} trailing columns detected in {round(time() - start, 3)}s',
359
- time() - start
360
- )
361
- return return_int
@@ -1,39 +0,0 @@
1
- from re import finditer
2
-
3
-
4
- def camel_case_split(identifier):
5
- matches = finditer(
6
- ".+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)", identifier
7
- )
8
- return " ".join([m.group(0) for m in matches])
9
-
10
-
11
- # Process text
12
- def _process_text(val):
13
- """Traitement des chaînes de caractères pour les standardiser.
14
- Plusieurs alternatives ont été testées : .translate, unidecode.unidecode,
15
- des méthodes hybrides, mais aucune ne s'est avérée plus performante."""
16
- val = camel_case_split(val)
17
- val = val.lower()
18
- val = val.replace("-", " ")
19
- val = val.replace("_", " ")
20
- val = val.replace("'", " ")
21
- val = val.replace(",", " ")
22
- val = val.replace(" ", " ")
23
- val = val.replace("à", "a")
24
- val = val.replace("â", "a")
25
- val = val.replace("ç", "c")
26
- val = val.replace("é", "e")
27
- val = val.replace("é", "e")
28
- val = val.replace("è", "e")
29
- val = val.replace("ê", "e")
30
- val = val.replace("î", "i")
31
- val = val.replace("ï", "i")
32
- val = val.replace("ô", "o")
33
- val = val.replace("ö", "o")
34
- val = val.replace("î", "i")
35
- val = val.replace("û", "u")
36
- val = val.replace("ù", "u")
37
- val = val.replace("ü", "u")
38
- val = val.strip()
39
- return val
csv_detective/s3_utils.py DELETED
@@ -1,48 +0,0 @@
1
- import boto3
2
- import logging
3
-
4
- from botocore.client import Config
5
- from botocore.exceptions import ClientError
6
-
7
-
8
- def get_minio_url(netloc: str, bucket: str, key: str) -> str:
9
- """Returns location of given resource in minio once it is saved"""
10
- return netloc + "/" + bucket + "/" + key
11
-
12
-
13
- def get_s3_client(url: str, minio_user: str, minio_pwd: str) -> boto3.client:
14
- return boto3.client(
15
- "s3",
16
- endpoint_url=url,
17
- aws_access_key_id=minio_user,
18
- aws_secret_access_key=minio_pwd,
19
- config=Config(signature_version="s3v4"),
20
- )
21
-
22
-
23
- def download_from_minio(
24
- netloc: str, bucket: str, key: str, filepath: str, minio_user: str, minio_pwd: str
25
- ) -> None:
26
- logging.info("Downloading from minio")
27
- s3 = get_s3_client(netloc, minio_user, minio_pwd)
28
- try:
29
- s3.download_file(bucket, key, filepath)
30
- logging.info(
31
- f"Resource downloaded from minio at {get_minio_url(netloc, bucket, key)}"
32
- )
33
- except ClientError as e:
34
- logging.error(e)
35
-
36
-
37
- def upload_to_minio(
38
- netloc: str, bucket: str, key: str, filepath: str, minio_user: str, minio_pwd: str
39
- ) -> None:
40
- logging.info("Saving to minio")
41
- s3 = get_s3_client(netloc, minio_user, minio_pwd)
42
- try:
43
- s3.upload_file(filepath, bucket, key)
44
- logging.info(
45
- f"Resource saved into minio at {get_minio_url(netloc, bucket, key)}"
46
- )
47
- except ClientError as e:
48
- logging.error(e)
@@ -1,118 +0,0 @@
1
- # Changelog
2
-
3
- ## 0.6.7 (2024-01-15)
4
-
5
- - Add logs for columns that would take too much time within a specific test
6
- - Refactor some tests to improve performances and make detection more accurate
7
- - Try alternative ways to clean text
8
-
9
- ## 0.6.6 (2023-11-24)
10
-
11
- - Change setup.py to better convey dependencies
12
-
13
- ## 0.6.5 (2023-11-17)
14
-
15
- - Change encoding detection for faust-cchardet (forked from cchardet) [#66](https://github.com/etalab/csv-detective/pull/66)
16
-
17
- ## 0.6.4 (2023-10-18)
18
-
19
- - Better handling of ints and floats (now not accepting blanks and "+" in string) [#62](https://github.com/etalab/csv-detective/pull/62)
20
-
21
- ## 0.6.3 (2023-03-23)
22
-
23
- - Faster routine [#59](https://github.com/etalab/csv-detective/pull/59)
24
-
25
- ## 0.6.2 (2023-02-10)
26
-
27
- - Catch OverflowError for latitude and longitude checks [#58](https://github.com/etalab/csv-detective/pull/58)
28
-
29
- ## 0.6.0 (2023-02-10)
30
-
31
- - Add CI and upgrade dependencies [#49](https://github.com/etalab/csv-detective/pull/49)
32
- - Shuffle data before analysis [#56](https://github.com/etalab/csv-detective/pull/56)
33
- - Better discrimination between `code_departement` and `code_region` [#56](https://github.com/etalab/csv-detective/pull/56)
34
- - Add schema in output analysis [#57](https://github.com/etalab/csv-detective/pull/57)
35
-
36
- ## 0.4.7 [#51](https://github.com/etalab/csv-detective/pull/51)
37
-
38
- - Allow possibility to analyze entire file instead of a limited number of rows [#48](https://github.com/etalab/csv-detective/pull/48)
39
- - Better boolean detection [#42](https://github.com/etalab/csv-detective/issues/42)
40
- - Differentiate python types and format for `date` and `datetime` [#43](https://github.com/etalab/csv-detective/issues/43)
41
- - Better `code_departement` and `code_commune_insee` detection [#44](https://github.com/etalab/csv-detective/issues/44)
42
- - Fix header line (`header_row_idx`) detection [#44](https://github.com/etalab/csv-detective/issues/44)
43
- - Allow possibility to get profile of csv [#46](https://github.com/etalab/csv-detective/issues/46)
44
-
45
- ## 0.4.6 [#39](https://github.com/etalab/csv-detective/pull/39)
46
-
47
- - Fix tests
48
- - Prioritise lat / lon FR detection over more generic lat / lon.
49
- - To reduce false positives, prevent detection of the following if label detection is missing: `['code_departement', 'code_commune_insee', 'code_postal', 'latitude_wgs', 'longitude_wgs', 'latitude_wgs_fr_metropole', 'longitude_wgs_fr_metropole', 'latitude_l93', 'longitude_l93']`
50
- - Lower threshold of label detection so that if one relevant is detected in the label, it boosts the detection score.
51
- - Add ISO country alpha-3 and numeric detection
52
- - include camel case parsing in _process_text function
53
- - Support optional brackets in latlon format
54
-
55
- ## 0.4.5 [#29](https://github.com/etalab/csv-detective/pull/29)
56
-
57
- - Use `netloc` instead of `url` in location dict
58
-
59
- ## 0.4.4 [#24] (https://github.com/etalab/csv-detective/pull/28)
60
-
61
- - Prevent crash on empty CSVs
62
- - Add optional arguments encoding and sep to routine and routine_minio functions
63
- - Field detection improvements (code_csp_insee and datetime RFC 822)
64
- - Schema generation improvements with examples
65
-
66
-
67
- ## 0.4.3 [#24] (https://github.com/etalab/csv-detective/pull/24)
68
-
69
- - Add uuid and MongoID detection
70
- - Add new function dedicated to interaction with minio data
71
- - Add table schema automatic generation (only on minio data)
72
- - Modification of calculated score (consider label detection as a boost for score)
73
-
74
- ## 0.4.2 [#22] (https://github.com/etalab/csv-detective/pull/22)
75
-
76
- Add type detection by header name
77
-
78
- ## 0.4.1 [#19] (https://github.com/etalab/csv-detective/pull/19)
79
-
80
- Fix bug
81
- * num_rows was causing problem when it was fix to other value than default - Fixed
82
-
83
- ## 0.4.0 [#18] (https://github.com/etalab/csv_detective/pull/18)
84
-
85
- Add detailed output possibility
86
-
87
- Details :
88
- * two modes now for output report : "LIMITED" and "ALL"
89
- * "ALL" option give user information on found proportion for each column types and each columns
90
-
91
- ## 0.3.0 [#15] (https://github.com/etalab/csv_detective/pull/15)
92
-
93
- Fix bugs
94
-
95
- Details :
96
- * Facilitate ML Integration
97
- * Add column types detection
98
- * Fix documentation
99
-
100
- ## 0.2.1 - [#2](https://github.com/etalab/csv_detective/pull/2)
101
-
102
- Add continuous integration
103
-
104
- Details :
105
- * Add configuration for CircleCI
106
- * Add `CONTRIBUTING.md`
107
- * Push automatically new versions to PyPI
108
- * Use semantic versioning
109
-
110
- ## 0.2 - [#1](https://github.com/etalab/csv_detective/pull/1)
111
-
112
- Port from python2 to python3
113
-
114
- Details :
115
- * Add license AGPLv3
116
- * Update requirements
117
-
118
- ## 0.1