deepcsv 0.6.2b2__tar.gz → 0.6.3__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- deepcsv-0.6.3/CHANGELOG.md +21 -0
- {deepcsv-0.6.2b2/deepcsv.egg-info → deepcsv-0.6.3}/PKG-INFO +37 -18
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/README.md +24 -8
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/deepcsv/deepcsv.py +23 -10
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/deepcsv/utils.py +84 -3
- {deepcsv-0.6.2b2 → deepcsv-0.6.3/deepcsv.egg-info}/PKG-INFO +37 -18
- deepcsv-0.6.3/deepcsv.egg-info/SOURCES.txt +29 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.5.0-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.5.0.tar.gz +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.5.0b1-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.5.0b1.tar.gz +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.0-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.0.tar.gz +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.1-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.1.tar.gz +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.2-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.2.tar.gz +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.2b1-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.2b1.tar.gz +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.2b2-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.2b2.tar.gz +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.3b1-py3-none-any.whl +0 -0
- deepcsv-0.6.3/deepcsv.egg-info/dist/deepcsv-0.6.3b1.tar.gz +0 -0
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/setup.py +1 -1
- deepcsv-0.6.2b2/CHANGELOG.md +0 -18
- deepcsv-0.6.2b2/deepcsv.egg-info/SOURCES.txt +0 -13
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/LICENSE +0 -0
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/MANIFEST.in +0 -0
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/deepcsv/__init__.py +0 -0
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/deepcsv.egg-info/dependency_links.txt +0 -0
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/deepcsv.egg-info/requires.txt +0 -0
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/deepcsv.egg-info/top_level.txt +0 -0
- {deepcsv-0.6.2b2 → deepcsv-0.6.3}/setup.cfg +0 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
---
|
|
4
|
+
|
|
5
|
+
### Added
|
|
6
|
+
|
|
7
|
+
- `process_file` — Added Doc & Examples for new params in function
|
|
8
|
+
- `process_all_files` — Added Doc & Examples for new params in function
|
|
9
|
+
- `process_file` & `process_all_file` — Added New parameter `to_list` to be real python list if you don't need array
|
|
10
|
+
- Added `auto_fix` function for automatic data type correction in DataFrames for mixed Dtypes.
|
|
11
|
+
- Added logg to `auto_fix` to track changes made to columns.
|
|
12
|
+
- Added Documentation for `auto_fix` function to understand more about function.
|
|
13
|
+
|
|
14
|
+
---
|
|
15
|
+
|
|
16
|
+
### Changed
|
|
17
|
+
|
|
18
|
+
- `process_file()` — Changed `save_file_extension` parameter to `file_format`
|
|
19
|
+
- `process_all_files()` — Changed `file_extension` parameter to `file_format`
|
|
20
|
+
|
|
21
|
+
---
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: deepcsv
|
|
3
|
-
Version: 0.6.
|
|
3
|
+
Version: 0.6.3
|
|
4
4
|
Summary: Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!
|
|
5
5
|
Home-page: https://github.com/abdubakr77/deepcsv
|
|
6
6
|
Author: Abdullah Bakr
|
|
@@ -50,6 +50,7 @@ Your column has numbers — it secretly has 3 different data types.
|
|
|
50
50
|
You have 200 CSV files across 40 folders — and you process them one by one.
|
|
51
51
|
You load a file and spend 20 minutes just picking the right reader.
|
|
52
52
|
You have nulls scattered everywhere with no clean way to handle them.
|
|
53
|
+
You have alot of mixed DTypes columns no way to handle one by one
|
|
53
54
|
|
|
54
55
|
This is the silent killer of every data pipeline.
|
|
55
56
|
|
|
@@ -65,7 +66,7 @@ This is the silent killer of every data pipeline.
|
|
|
65
66
|
- Catches mixed-type columns and fixes them automatically
|
|
66
67
|
- Saves everything in any format you choose — not just Parquet
|
|
67
68
|
- Reads any file format with one function — no more picking the right reader
|
|
68
|
-
- Cleans nulls with full control over columns, rows, indexes, values, and
|
|
69
|
+
- Cleans nulls with full control over columns, rows, indexes, values, and Fix Mixed DTypes
|
|
69
70
|
|
|
70
71
|
---
|
|
71
72
|
|
|
@@ -79,7 +80,7 @@ pip install deepcsv
|
|
|
79
80
|
|
|
80
81
|
## Functions
|
|
81
82
|
|
|
82
|
-
### `process_file(data_input,
|
|
83
|
+
### `process_file(data_input, file_format= str, to_list=False)`
|
|
83
84
|
|
|
84
85
|
Reads a file or DataFrame, converts array-like strings to NumPy arrays, fixes mixed-type columns, and optionally saves the result in any format you choose.
|
|
85
86
|
|
|
@@ -90,17 +91,20 @@ import deepcsv
|
|
|
90
91
|
df = deepcsv.process_file('path/to/file.csv')
|
|
91
92
|
|
|
92
93
|
# Process and save as parquet
|
|
93
|
-
df = deepcsv.process_file('path/to/file.csv',
|
|
94
|
+
df = deepcsv.process_file('path/to/file.csv', file_format='parquet')
|
|
94
95
|
|
|
95
96
|
# Process and save as Excel
|
|
96
|
-
df = deepcsv.process_file('path/to/file.csv',
|
|
97
|
+
df = deepcsv.process_file('path/to/file.csv', file_format='xlsx')
|
|
98
|
+
|
|
99
|
+
# Process and convert it to Real Python List
|
|
100
|
+
df = deepcsv.process_file('path/to/file.csv', to_list=True)
|
|
97
101
|
```
|
|
98
102
|
|
|
99
|
-
**Supported
|
|
103
|
+
**Supported file formats:** `.csv` `.tsv` `.txt` `.xlsx` `.json` `.parquet` `.pkl` `.feather` `.html` `.xml`
|
|
100
104
|
|
|
101
105
|
---
|
|
102
106
|
|
|
103
|
-
### `process_all_files(directory_path, output_dir="All CSV Files is Converted Here",
|
|
107
|
+
### `process_all_files(directory_path, output_dir="All CSV Files is Converted Here", file_format="parquet", to_list=False)`
|
|
104
108
|
|
|
105
109
|
Walks through all folders and subfolders, applies `process_file` on every supported file, and saves results in the format you choose.
|
|
106
110
|
|
|
@@ -121,7 +125,7 @@ deepcsv.process_all_files('path/to/folder', file_extension='csv')
|
|
|
121
125
|
|
|
122
126
|
---
|
|
123
127
|
|
|
124
|
-
### `read_any(file_path)`
|
|
128
|
+
### `read_any(file_path)`
|
|
125
129
|
|
|
126
130
|
Reads any supported file format and returns a pandas DataFrame — one function for everything.
|
|
127
131
|
|
|
@@ -138,7 +142,7 @@ df = read_any('local.db')
|
|
|
138
142
|
|
|
139
143
|
---
|
|
140
144
|
|
|
141
|
-
### `clean_values(data_input, ...)`
|
|
145
|
+
### `clean_values(data_input, ...)`
|
|
142
146
|
|
|
143
147
|
Cleans a DataFrame by removing nulls, specific values, specific types, or rows by index — with full control over which columns to target and optional conditions.
|
|
144
148
|
|
|
@@ -182,6 +186,17 @@ df = clean_values('data.csv', all_cols_except=['id', 'name'])
|
|
|
182
186
|
|
|
183
187
|
---
|
|
184
188
|
|
|
189
|
+
### `auto_fix(data_input)`
|
|
190
|
+
|
|
191
|
+
Automatic data type correction in DataFrames for solving mixed Dtypes with logg to track changes made to columns.
|
|
192
|
+
|
|
193
|
+
```python
|
|
194
|
+
from deepcsv import auto_fix
|
|
195
|
+
|
|
196
|
+
df = auto_fix('My_Data')
|
|
197
|
+
```
|
|
198
|
+
---
|
|
199
|
+
|
|
185
200
|
## Function Signatures
|
|
186
201
|
|
|
187
202
|
```python
|
|
@@ -189,6 +204,7 @@ process_file(data_input: Union[str, pd.DataFrame], save_file_extension: str = No
|
|
|
189
204
|
process_all_files(directory_path: str, output_dir: str = "All CSV Files is Converted Here", file_extension: str = "parquet") -> None
|
|
190
205
|
read_any(file_path: str) -> pd.DataFrame
|
|
191
206
|
clean_values(data_input, cols=None, ax_0=False, index=None, condition=None, all_cols_except=None, finding_value=None, finding_type=None) -> pd.DataFrame
|
|
207
|
+
auto_fix(data_input: Union[str, pd.DataFrame])
|
|
192
208
|
```
|
|
193
209
|
|
|
194
210
|
---
|
|
@@ -228,16 +244,19 @@ clean_values(data_input, cols=None, ax_0=False, index=None, condition=None, all_
|
|
|
228
244
|
---
|
|
229
245
|
|
|
230
246
|
### Added
|
|
231
|
-
|
|
232
|
-
- `
|
|
233
|
-
- `
|
|
234
|
-
- `
|
|
235
|
-
- `
|
|
236
|
-
-
|
|
237
|
-
-
|
|
247
|
+
|
|
248
|
+
- `process_file` — Added Doc & Examples for new params in function
|
|
249
|
+
- `process_all_files` — Added Doc & Examples for new params in function
|
|
250
|
+
- `process_file` & `process_all_file` — Added New parameter `to_list` to be real python list if you don't need array
|
|
251
|
+
- Added `auto_fix` function for automatic data type correction in DataFrames for mixed Dtypes.
|
|
252
|
+
- Added logg to `auto_fix` to track changes made to columns.
|
|
253
|
+
- Added Documentation for `auto_fix` function to understand more about function.
|
|
254
|
+
|
|
255
|
+
---
|
|
238
256
|
|
|
239
257
|
### Changed
|
|
240
|
-
|
|
241
|
-
- `
|
|
258
|
+
|
|
259
|
+
- `process_file()` — Changed `save_file_extension` parameter to `file_format`
|
|
260
|
+
- `process_all_files()` — Changed `file_extension` parameter to `file_format`
|
|
242
261
|
|
|
243
262
|
---
|
|
@@ -14,6 +14,7 @@ Your column has numbers — it secretly has 3 different data types.
|
|
|
14
14
|
You have 200 CSV files across 40 folders — and you process them one by one.
|
|
15
15
|
You load a file and spend 20 minutes just picking the right reader.
|
|
16
16
|
You have nulls scattered everywhere with no clean way to handle them.
|
|
17
|
+
You have alot of mixed DTypes columns no way to handle one by one
|
|
17
18
|
|
|
18
19
|
This is the silent killer of every data pipeline.
|
|
19
20
|
|
|
@@ -29,7 +30,7 @@ This is the silent killer of every data pipeline.
|
|
|
29
30
|
- Catches mixed-type columns and fixes them automatically
|
|
30
31
|
- Saves everything in any format you choose — not just Parquet
|
|
31
32
|
- Reads any file format with one function — no more picking the right reader
|
|
32
|
-
- Cleans nulls with full control over columns, rows, indexes, values, and
|
|
33
|
+
- Cleans nulls with full control over columns, rows, indexes, values, and Fix Mixed DTypes
|
|
33
34
|
|
|
34
35
|
---
|
|
35
36
|
|
|
@@ -43,7 +44,7 @@ pip install deepcsv
|
|
|
43
44
|
|
|
44
45
|
## Functions
|
|
45
46
|
|
|
46
|
-
### `process_file(data_input,
|
|
47
|
+
### `process_file(data_input, file_format= str, to_list=False)`
|
|
47
48
|
|
|
48
49
|
Reads a file or DataFrame, converts array-like strings to NumPy arrays, fixes mixed-type columns, and optionally saves the result in any format you choose.
|
|
49
50
|
|
|
@@ -54,17 +55,20 @@ import deepcsv
|
|
|
54
55
|
df = deepcsv.process_file('path/to/file.csv')
|
|
55
56
|
|
|
56
57
|
# Process and save as parquet
|
|
57
|
-
df = deepcsv.process_file('path/to/file.csv',
|
|
58
|
+
df = deepcsv.process_file('path/to/file.csv', file_format='parquet')
|
|
58
59
|
|
|
59
60
|
# Process and save as Excel
|
|
60
|
-
df = deepcsv.process_file('path/to/file.csv',
|
|
61
|
+
df = deepcsv.process_file('path/to/file.csv', file_format='xlsx')
|
|
62
|
+
|
|
63
|
+
# Process and convert it to Real Python List
|
|
64
|
+
df = deepcsv.process_file('path/to/file.csv', to_list=True)
|
|
61
65
|
```
|
|
62
66
|
|
|
63
|
-
**Supported
|
|
67
|
+
**Supported file formats:** `.csv` `.tsv` `.txt` `.xlsx` `.json` `.parquet` `.pkl` `.feather` `.html` `.xml`
|
|
64
68
|
|
|
65
69
|
---
|
|
66
70
|
|
|
67
|
-
### `process_all_files(directory_path, output_dir="All CSV Files is Converted Here",
|
|
71
|
+
### `process_all_files(directory_path, output_dir="All CSV Files is Converted Here", file_format="parquet", to_list=False)`
|
|
68
72
|
|
|
69
73
|
Walks through all folders and subfolders, applies `process_file` on every supported file, and saves results in the format you choose.
|
|
70
74
|
|
|
@@ -85,7 +89,7 @@ deepcsv.process_all_files('path/to/folder', file_extension='csv')
|
|
|
85
89
|
|
|
86
90
|
---
|
|
87
91
|
|
|
88
|
-
### `read_any(file_path)`
|
|
92
|
+
### `read_any(file_path)`
|
|
89
93
|
|
|
90
94
|
Reads any supported file format and returns a pandas DataFrame — one function for everything.
|
|
91
95
|
|
|
@@ -102,7 +106,7 @@ df = read_any('local.db')
|
|
|
102
106
|
|
|
103
107
|
---
|
|
104
108
|
|
|
105
|
-
### `clean_values(data_input, ...)`
|
|
109
|
+
### `clean_values(data_input, ...)`
|
|
106
110
|
|
|
107
111
|
Cleans a DataFrame by removing nulls, specific values, specific types, or rows by index — with full control over which columns to target and optional conditions.
|
|
108
112
|
|
|
@@ -146,6 +150,17 @@ df = clean_values('data.csv', all_cols_except=['id', 'name'])
|
|
|
146
150
|
|
|
147
151
|
---
|
|
148
152
|
|
|
153
|
+
### `auto_fix(data_input)`
|
|
154
|
+
|
|
155
|
+
Automatic data type correction in DataFrames for solving mixed Dtypes with logg to track changes made to columns.
|
|
156
|
+
|
|
157
|
+
```python
|
|
158
|
+
from deepcsv import auto_fix
|
|
159
|
+
|
|
160
|
+
df = auto_fix('My_Data')
|
|
161
|
+
```
|
|
162
|
+
---
|
|
163
|
+
|
|
149
164
|
## Function Signatures
|
|
150
165
|
|
|
151
166
|
```python
|
|
@@ -153,6 +168,7 @@ process_file(data_input: Union[str, pd.DataFrame], save_file_extension: str = No
|
|
|
153
168
|
process_all_files(directory_path: str, output_dir: str = "All CSV Files is Converted Here", file_extension: str = "parquet") -> None
|
|
154
169
|
read_any(file_path: str) -> pd.DataFrame
|
|
155
170
|
clean_values(data_input, cols=None, ax_0=False, index=None, condition=None, all_cols_except=None, finding_value=None, finding_type=None) -> pd.DataFrame
|
|
171
|
+
auto_fix(data_input: Union[str, pd.DataFrame])
|
|
156
172
|
```
|
|
157
173
|
|
|
158
174
|
---
|
|
@@ -9,14 +9,18 @@ from os.path import join,relpath,dirname,isfile,isdir
|
|
|
9
9
|
from warnings import filterwarnings
|
|
10
10
|
filterwarnings("ignore")
|
|
11
11
|
|
|
12
|
-
def process_file(data_input: Union[str, pd.DataFrame] ,
|
|
12
|
+
def process_file(data_input: Union[str, pd.DataFrame] , file_format= str, to_list = False) -> pd.DataFrame:
|
|
13
13
|
"""
|
|
14
14
|
Parses string representations of lists in DataFrame columns to actual NumPy arrays.
|
|
15
15
|
|
|
16
16
|
Parameters
|
|
17
17
|
----------
|
|
18
|
-
data_input
|
|
19
|
-
|
|
18
|
+
data_input: str or pd.DataFrame
|
|
19
|
+
Path to the CSV/XLSX file or an existing DataFrame.
|
|
20
|
+
file_format: str
|
|
21
|
+
Saves a DataFrame to a file with the specified format.
|
|
22
|
+
to_list: False -> (Array) is better
|
|
23
|
+
True -> it will convert to list
|
|
20
24
|
|
|
21
25
|
Returns
|
|
22
26
|
-------
|
|
@@ -28,6 +32,8 @@ def process_file(data_input: Union[str, pd.DataFrame] , save_file_extension=str)
|
|
|
28
32
|
--------
|
|
29
33
|
>>> df = process_file('path/to/file.csv')
|
|
30
34
|
>>> df = process_file(my_dataframe)
|
|
35
|
+
>>> df = process_file(my_dataframe , save_format="parquet")
|
|
36
|
+
>>> df = process_file(my_dataframe , save_format="parquet", to_list = True)
|
|
31
37
|
"""
|
|
32
38
|
|
|
33
39
|
try:
|
|
@@ -54,16 +60,20 @@ def process_file(data_input: Union[str, pd.DataFrame] , save_file_extension=str)
|
|
|
54
60
|
print("System : Done!")
|
|
55
61
|
|
|
56
62
|
elif isinstance(First_Value , str) and First_Value.strip().startswith("["):
|
|
57
|
-
|
|
58
|
-
|
|
63
|
+
if to_list:
|
|
64
|
+
data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : list(literal_eval(x)) if pd.notna(x) else nan)
|
|
65
|
+
else:
|
|
66
|
+
data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : array(literal_eval(x)) if pd.notna(x) else nan)
|
|
59
67
|
data.drop(ColName,inplace=True,axis=1)
|
|
60
68
|
|
|
61
|
-
if
|
|
62
|
-
_save_as(data=data,ext=
|
|
69
|
+
if file_format.strip().lower() in ['csv','txt','tsv','xls','xlsx','json','parquet','pkl','feather','db','sqlite']:
|
|
70
|
+
_save_as(data=data,ext=file_format)
|
|
71
|
+
|
|
72
|
+
|
|
63
73
|
return data
|
|
64
74
|
|
|
65
75
|
|
|
66
|
-
def process_all_files(directory_path: str, output_dir="All CSV Files is Converted Here",
|
|
76
|
+
def process_all_files(directory_path: str, output_dir="All CSV Files is Converted Here",file_format= "parquet",to_list = False) -> None:
|
|
67
77
|
"""
|
|
68
78
|
Recursively processes all CSV and XLSX files in a directory,
|
|
69
79
|
converts array strings to NumPy arrays, and saves as Parquet files.
|
|
@@ -74,6 +84,8 @@ def process_all_files(directory_path: str, output_dir="All CSV Files is Converte
|
|
|
74
84
|
Root directory path to search for CSV/XLSX files.
|
|
75
85
|
output_dir : str, default 'All CSV Files is Converted Here'
|
|
76
86
|
Folder name where converted files will be saved.
|
|
87
|
+
file_format: str
|
|
88
|
+
Saves a DataFrame to a file with the specified format for every file.
|
|
77
89
|
|
|
78
90
|
Returns
|
|
79
91
|
-------
|
|
@@ -83,6 +95,7 @@ def process_all_files(directory_path: str, output_dir="All CSV Files is Converte
|
|
|
83
95
|
--------
|
|
84
96
|
>>> process_all_files('/path/to/directory')
|
|
85
97
|
>>> process_all_files('/path/to/directory', output_dir="Converted Files")
|
|
98
|
+
>>> process_all_files('/path/to/directory', output_dir="Converted Files", file_format="tsv")
|
|
86
99
|
"""
|
|
87
100
|
|
|
88
101
|
base_output = join(directory_path, output_dir)
|
|
@@ -113,8 +126,8 @@ def process_all_files(directory_path: str, output_dir="All CSV Files is Converte
|
|
|
113
126
|
print(Sub_Item_Path)
|
|
114
127
|
makedirs(dirname(output),exist_ok=True)
|
|
115
128
|
_save_as(data=df_converted,
|
|
116
|
-
current_dir=output.replace(f".{Sub_Item_Path.split(".")[-1].strip().lower()}", f".{
|
|
117
|
-
ext=
|
|
129
|
+
current_dir=output.replace(f".{Sub_Item_Path.split(".")[-1].strip().lower()}", f".{file_format}"),
|
|
130
|
+
ext=file_format,to_list=to_list)
|
|
118
131
|
|
|
119
132
|
elif isdir(Sub_Item_Path):
|
|
120
133
|
|
|
@@ -110,8 +110,8 @@ def _save_as(data: pd.DataFrame, current_dir = str(Path.cwd()), ext= str) -> Non
|
|
|
110
110
|
|
|
111
111
|
Examples
|
|
112
112
|
--------
|
|
113
|
-
>>>
|
|
114
|
-
>>>
|
|
113
|
+
>>> _save_as(df, "data/myfile", ".parquet")
|
|
114
|
+
>>> _save_as(df, "data/myfile", ".csv")
|
|
115
115
|
"""
|
|
116
116
|
ext = ext.strip().lower()
|
|
117
117
|
if not ext.startswith("."):
|
|
@@ -143,6 +143,14 @@ def _save_as(data: pd.DataFrame, current_dir = str(Path.cwd()), ext= str) -> Non
|
|
|
143
143
|
print("-"*50)
|
|
144
144
|
|
|
145
145
|
|
|
146
|
+
def _val_dtype(x,dtype):
|
|
147
|
+
if dtype == str:
|
|
148
|
+
return str(x)
|
|
149
|
+
elif dtype == float:
|
|
150
|
+
return float(x)
|
|
151
|
+
else:
|
|
152
|
+
return bool(x)
|
|
153
|
+
|
|
146
154
|
# ──────────────────────────────────────────────
|
|
147
155
|
# PUBLIC FUNCTIONS
|
|
148
156
|
# ──────────────────────────────────────────────
|
|
@@ -299,4 +307,77 @@ def clean_values(data_input: Union[str, pd.DataFrame],
|
|
|
299
307
|
else:
|
|
300
308
|
data.dropna(axis=1, inplace=True)
|
|
301
309
|
|
|
302
|
-
return data
|
|
310
|
+
return data
|
|
311
|
+
|
|
312
|
+
|
|
313
|
+
|
|
314
|
+
def auto_fix(data_input: Union[str, pd.DataFrame]):
|
|
315
|
+
"""
|
|
316
|
+
Automatically detects and fixes columns with mixed data types in a DataFrame.
|
|
317
|
+
|
|
318
|
+
This function scans each column for mixed data types (columns containing exactly
|
|
319
|
+
2 different Python types) and attempts to convert all values to the most common
|
|
320
|
+
type. If the primary conversion fails, it falls back to the secondary type.
|
|
321
|
+
|
|
322
|
+
Parameters
|
|
323
|
+
----------
|
|
324
|
+
data_input : str or pd.DataFrame
|
|
325
|
+
File path to read from or an existing DataFrame to process.
|
|
326
|
+
|
|
327
|
+
Returns
|
|
328
|
+
-------
|
|
329
|
+
pd.DataFrame
|
|
330
|
+
DataFrame with mixed-type columns automatically converted to consistent types.
|
|
331
|
+
|
|
332
|
+
Notes
|
|
333
|
+
-----
|
|
334
|
+
- Only processes columns with exactly 2 different data types
|
|
335
|
+
- Attempts conversion to the most frequent type first
|
|
336
|
+
- Falls back to the less frequent type if primary conversion fails
|
|
337
|
+
- Prints progress messages for each column being processed
|
|
338
|
+
- Supported target types: str, float, bool
|
|
339
|
+
|
|
340
|
+
Examples
|
|
341
|
+
--------
|
|
342
|
+
>>> df = auto_fix('data/mixed_types.csv')
|
|
343
|
+
Found a column (price) Have mixed DTypes!
|
|
344
|
+
This Col Have These DTypes: [<class 'str'>, <class 'float'>]
|
|
345
|
+
NOW TRYING TO FIX!
|
|
346
|
+
Done!
|
|
347
|
+
-----------------------------------
|
|
348
|
+
|
|
349
|
+
>>> df = auto_fix(my_dataframe)
|
|
350
|
+
"""
|
|
351
|
+
|
|
352
|
+
try:
|
|
353
|
+
df = read_any(data_input)
|
|
354
|
+
except Exception:
|
|
355
|
+
df = data_input
|
|
356
|
+
|
|
357
|
+
|
|
358
|
+
for ColName in df.columns:
|
|
359
|
+
if len(df[ColName].apply(type).unique()) == 2:
|
|
360
|
+
print(f"Found a column ({ColName}) Have mixed DTypes!")
|
|
361
|
+
print(f"This Col Have These DTypes: {df[ColName].apply(type).unique()}\nNOW TRYING TO FIX!")
|
|
362
|
+
|
|
363
|
+
dtype_dict = dict(df[ColName].apply(type).value_counts().to_dict())
|
|
364
|
+
dtype_values_list = [dtype_value for dtype_value in dtype_dict.values()]
|
|
365
|
+
dtype = None
|
|
366
|
+
try:
|
|
367
|
+
for dtype_name in dtype_dict.keys():
|
|
368
|
+
if dtype_dict[dtype_name] == max(dtype_values_list):
|
|
369
|
+
dtype = dtype_name
|
|
370
|
+
|
|
371
|
+
|
|
372
|
+
df[ColName] = df[ColName].apply(lambda x: _val_dtype(x,dtype))
|
|
373
|
+
|
|
374
|
+
except:
|
|
375
|
+
for dtype_name in dtype_dict.keys():
|
|
376
|
+
if dtype_dict[dtype_name] == min(dtype_values_list):
|
|
377
|
+
dtype = dtype_name
|
|
378
|
+
|
|
379
|
+
|
|
380
|
+
df[ColName] = df[ColName].apply(lambda x: _val_dtype(x,dtype))
|
|
381
|
+
print("Done!")
|
|
382
|
+
print("—"*35)
|
|
383
|
+
return df
|
|
@@ -1,6 +1,6 @@
|
|
|
1
1
|
Metadata-Version: 2.4
|
|
2
2
|
Name: deepcsv
|
|
3
|
-
Version: 0.6.
|
|
3
|
+
Version: 0.6.3
|
|
4
4
|
Summary: Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!
|
|
5
5
|
Home-page: https://github.com/abdubakr77/deepcsv
|
|
6
6
|
Author: Abdullah Bakr
|
|
@@ -50,6 +50,7 @@ Your column has numbers — it secretly has 3 different data types.
|
|
|
50
50
|
You have 200 CSV files across 40 folders — and you process them one by one.
|
|
51
51
|
You load a file and spend 20 minutes just picking the right reader.
|
|
52
52
|
You have nulls scattered everywhere with no clean way to handle them.
|
|
53
|
+
You have alot of mixed DTypes columns no way to handle one by one
|
|
53
54
|
|
|
54
55
|
This is the silent killer of every data pipeline.
|
|
55
56
|
|
|
@@ -65,7 +66,7 @@ This is the silent killer of every data pipeline.
|
|
|
65
66
|
- Catches mixed-type columns and fixes them automatically
|
|
66
67
|
- Saves everything in any format you choose — not just Parquet
|
|
67
68
|
- Reads any file format with one function — no more picking the right reader
|
|
68
|
-
- Cleans nulls with full control over columns, rows, indexes, values, and
|
|
69
|
+
- Cleans nulls with full control over columns, rows, indexes, values, and Fix Mixed DTypes
|
|
69
70
|
|
|
70
71
|
---
|
|
71
72
|
|
|
@@ -79,7 +80,7 @@ pip install deepcsv
|
|
|
79
80
|
|
|
80
81
|
## Functions
|
|
81
82
|
|
|
82
|
-
### `process_file(data_input,
|
|
83
|
+
### `process_file(data_input, file_format= str, to_list=False)`
|
|
83
84
|
|
|
84
85
|
Reads a file or DataFrame, converts array-like strings to NumPy arrays, fixes mixed-type columns, and optionally saves the result in any format you choose.
|
|
85
86
|
|
|
@@ -90,17 +91,20 @@ import deepcsv
|
|
|
90
91
|
df = deepcsv.process_file('path/to/file.csv')
|
|
91
92
|
|
|
92
93
|
# Process and save as parquet
|
|
93
|
-
df = deepcsv.process_file('path/to/file.csv',
|
|
94
|
+
df = deepcsv.process_file('path/to/file.csv', file_format='parquet')
|
|
94
95
|
|
|
95
96
|
# Process and save as Excel
|
|
96
|
-
df = deepcsv.process_file('path/to/file.csv',
|
|
97
|
+
df = deepcsv.process_file('path/to/file.csv', file_format='xlsx')
|
|
98
|
+
|
|
99
|
+
# Process and convert it to Real Python List
|
|
100
|
+
df = deepcsv.process_file('path/to/file.csv', to_list=True)
|
|
97
101
|
```
|
|
98
102
|
|
|
99
|
-
**Supported
|
|
103
|
+
**Supported file formats:** `.csv` `.tsv` `.txt` `.xlsx` `.json` `.parquet` `.pkl` `.feather` `.html` `.xml`
|
|
100
104
|
|
|
101
105
|
---
|
|
102
106
|
|
|
103
|
-
### `process_all_files(directory_path, output_dir="All CSV Files is Converted Here",
|
|
107
|
+
### `process_all_files(directory_path, output_dir="All CSV Files is Converted Here", file_format="parquet", to_list=False)`
|
|
104
108
|
|
|
105
109
|
Walks through all folders and subfolders, applies `process_file` on every supported file, and saves results in the format you choose.
|
|
106
110
|
|
|
@@ -121,7 +125,7 @@ deepcsv.process_all_files('path/to/folder', file_extension='csv')
|
|
|
121
125
|
|
|
122
126
|
---
|
|
123
127
|
|
|
124
|
-
### `read_any(file_path)`
|
|
128
|
+
### `read_any(file_path)`
|
|
125
129
|
|
|
126
130
|
Reads any supported file format and returns a pandas DataFrame — one function for everything.
|
|
127
131
|
|
|
@@ -138,7 +142,7 @@ df = read_any('local.db')
|
|
|
138
142
|
|
|
139
143
|
---
|
|
140
144
|
|
|
141
|
-
### `clean_values(data_input, ...)`
|
|
145
|
+
### `clean_values(data_input, ...)`
|
|
142
146
|
|
|
143
147
|
Cleans a DataFrame by removing nulls, specific values, specific types, or rows by index — with full control over which columns to target and optional conditions.
|
|
144
148
|
|
|
@@ -182,6 +186,17 @@ df = clean_values('data.csv', all_cols_except=['id', 'name'])
|
|
|
182
186
|
|
|
183
187
|
---
|
|
184
188
|
|
|
189
|
+
### `auto_fix(data_input)`
|
|
190
|
+
|
|
191
|
+
Automatic data type correction in DataFrames for solving mixed Dtypes with logg to track changes made to columns.
|
|
192
|
+
|
|
193
|
+
```python
|
|
194
|
+
from deepcsv import auto_fix
|
|
195
|
+
|
|
196
|
+
df = auto_fix('My_Data')
|
|
197
|
+
```
|
|
198
|
+
---
|
|
199
|
+
|
|
185
200
|
## Function Signatures
|
|
186
201
|
|
|
187
202
|
```python
|
|
@@ -189,6 +204,7 @@ process_file(data_input: Union[str, pd.DataFrame], save_file_extension: str = No
|
|
|
189
204
|
process_all_files(directory_path: str, output_dir: str = "All CSV Files is Converted Here", file_extension: str = "parquet") -> None
|
|
190
205
|
read_any(file_path: str) -> pd.DataFrame
|
|
191
206
|
clean_values(data_input, cols=None, ax_0=False, index=None, condition=None, all_cols_except=None, finding_value=None, finding_type=None) -> pd.DataFrame
|
|
207
|
+
auto_fix(data_input: Union[str, pd.DataFrame])
|
|
192
208
|
```
|
|
193
209
|
|
|
194
210
|
---
|
|
@@ -228,16 +244,19 @@ clean_values(data_input, cols=None, ax_0=False, index=None, condition=None, all_
|
|
|
228
244
|
---
|
|
229
245
|
|
|
230
246
|
### Added
|
|
231
|
-
|
|
232
|
-
- `
|
|
233
|
-
- `
|
|
234
|
-
- `
|
|
235
|
-
- `
|
|
236
|
-
-
|
|
237
|
-
-
|
|
247
|
+
|
|
248
|
+
- `process_file` — Added Doc & Examples for new params in function
|
|
249
|
+
- `process_all_files` — Added Doc & Examples for new params in function
|
|
250
|
+
- `process_file` & `process_all_file` — Added New parameter `to_list` to be real python list if you don't need array
|
|
251
|
+
- Added `auto_fix` function for automatic data type correction in DataFrames for mixed Dtypes.
|
|
252
|
+
- Added logg to `auto_fix` to track changes made to columns.
|
|
253
|
+
- Added Documentation for `auto_fix` function to understand more about function.
|
|
254
|
+
|
|
255
|
+
---
|
|
238
256
|
|
|
239
257
|
### Changed
|
|
240
|
-
|
|
241
|
-
- `
|
|
258
|
+
|
|
259
|
+
- `process_file()` — Changed `save_file_extension` parameter to `file_format`
|
|
260
|
+
- `process_all_files()` — Changed `file_extension` parameter to `file_format`
|
|
242
261
|
|
|
243
262
|
---
|
|
@@ -0,0 +1,29 @@
|
|
|
1
|
+
CHANGELOG.md
|
|
2
|
+
LICENSE
|
|
3
|
+
MANIFEST.in
|
|
4
|
+
README.md
|
|
5
|
+
setup.py
|
|
6
|
+
deepcsv/__init__.py
|
|
7
|
+
deepcsv/deepcsv.py
|
|
8
|
+
deepcsv/utils.py
|
|
9
|
+
deepcsv.egg-info/PKG-INFO
|
|
10
|
+
deepcsv.egg-info/SOURCES.txt
|
|
11
|
+
deepcsv.egg-info/dependency_links.txt
|
|
12
|
+
deepcsv.egg-info/requires.txt
|
|
13
|
+
deepcsv.egg-info/top_level.txt
|
|
14
|
+
deepcsv.egg-info/dist/deepcsv-0.5.0-py3-none-any.whl
|
|
15
|
+
deepcsv.egg-info/dist/deepcsv-0.5.0.tar.gz
|
|
16
|
+
deepcsv.egg-info/dist/deepcsv-0.5.0b1-py3-none-any.whl
|
|
17
|
+
deepcsv.egg-info/dist/deepcsv-0.5.0b1.tar.gz
|
|
18
|
+
deepcsv.egg-info/dist/deepcsv-0.6.0-py3-none-any.whl
|
|
19
|
+
deepcsv.egg-info/dist/deepcsv-0.6.0.tar.gz
|
|
20
|
+
deepcsv.egg-info/dist/deepcsv-0.6.1-py3-none-any.whl
|
|
21
|
+
deepcsv.egg-info/dist/deepcsv-0.6.1.tar.gz
|
|
22
|
+
deepcsv.egg-info/dist/deepcsv-0.6.2-py3-none-any.whl
|
|
23
|
+
deepcsv.egg-info/dist/deepcsv-0.6.2.tar.gz
|
|
24
|
+
deepcsv.egg-info/dist/deepcsv-0.6.2b1-py3-none-any.whl
|
|
25
|
+
deepcsv.egg-info/dist/deepcsv-0.6.2b1.tar.gz
|
|
26
|
+
deepcsv.egg-info/dist/deepcsv-0.6.2b2-py3-none-any.whl
|
|
27
|
+
deepcsv.egg-info/dist/deepcsv-0.6.2b2.tar.gz
|
|
28
|
+
deepcsv.egg-info/dist/deepcsv-0.6.3b1-py3-none-any.whl
|
|
29
|
+
deepcsv.egg-info/dist/deepcsv-0.6.3b1.tar.gz
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
@@ -8,7 +8,7 @@ changelog = (this_directory / "CHANGELOG.md").read_text(encoding="utf-8")
|
|
|
8
8
|
|
|
9
9
|
setup(
|
|
10
10
|
name="deepcsv",
|
|
11
|
-
version="0.6.
|
|
11
|
+
version="0.6.3",
|
|
12
12
|
author="Abdullah Bakr",
|
|
13
13
|
author_email="abdubakora1232@gmail.com",
|
|
14
14
|
description="Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files and MORE!",
|
deepcsv-0.6.2b2/CHANGELOG.md
DELETED
|
@@ -1,18 +0,0 @@
|
|
|
1
|
-
# Changelog
|
|
2
|
-
|
|
3
|
-
---
|
|
4
|
-
|
|
5
|
-
### Added
|
|
6
|
-
- `process_all_files` — Added option for user to customize the output folder name in
|
|
7
|
-
- `read_any()` — Reads any supported file format and returns a pandas DataFrame automatically. Supports: `.csv`, `.txt`, `.tsv`, `.xls`, `.xlsx`, `.json`, `.parquet`, `.pkl`, `.feather`, `.db`, `.sqlite`
|
|
8
|
-
- `clean_values()` — Cleans a DataFrame by removing nulls, specific values, specific types, or rows by index. Supports optional condition filtering with 6 operators
|
|
9
|
-
- `_validate_cols()` — Internal helper: validates cols is a non-empty list and all columns exist in the DataFrame
|
|
10
|
-
- `_validate_index()` — Internal helper: validates index is a non-empty list and all indexes exist in the DataFrame. Supports optional `reset_index` before validation
|
|
11
|
-
- `_validate_condition()` — Internal helper: validates condition list and returns `(operator_func, value)`
|
|
12
|
-
- `_parse_operator()` — Internal helper: converts operator string like `'>='` into its Python operator function
|
|
13
|
-
|
|
14
|
-
### Changed
|
|
15
|
-
- `process_file()` — Added `save_file_extension` parameter. Now supports saving the processed DataFrame in any format after conversion, not just returning it
|
|
16
|
-
- `process_all_files()` — Added `file_extension` parameter. Now supports saving converted files in any format instead of always saving as Parquet. Also expanded supported input formats beyond `.csv` and `.xlsx` to cover all formats supported by `read_any()`
|
|
17
|
-
|
|
18
|
-
---
|
|
@@ -1,13 +0,0 @@
|
|
|
1
|
-
CHANGELOG.md
|
|
2
|
-
LICENSE
|
|
3
|
-
MANIFEST.in
|
|
4
|
-
README.md
|
|
5
|
-
setup.py
|
|
6
|
-
deepcsv/__init__.py
|
|
7
|
-
deepcsv/deepcsv.py
|
|
8
|
-
deepcsv/utils.py
|
|
9
|
-
deepcsv.egg-info/PKG-INFO
|
|
10
|
-
deepcsv.egg-info/SOURCES.txt
|
|
11
|
-
deepcsv.egg-info/dependency_links.txt
|
|
12
|
-
deepcsv.egg-info/requires.txt
|
|
13
|
-
deepcsv.egg-info/top_level.txt
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|