deepcsv 0.3.0__tar.gz → 0.5.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- deepcsv-0.5.0/PKG-INFO +128 -0
- deepcsv-0.5.0/README.md +92 -0
- deepcsv-0.5.0/deepcsv/__init__.py +15 -0
- deepcsv-0.5.0/deepcsv/deepcsv.py +126 -0
- deepcsv-0.5.0/deepcsv.egg-info/PKG-INFO +128 -0
- deepcsv-0.5.0/setup.py +35 -0
- deepcsv-0.3.0/PKG-INFO +0 -74
- deepcsv-0.3.0/README.md +0 -56
- deepcsv-0.3.0/deepcsv/__init__.py +0 -1
- deepcsv-0.3.0/deepcsv/deepcsv.py +0 -76
- deepcsv-0.3.0/deepcsv.egg-info/PKG-INFO +0 -74
- deepcsv-0.3.0/setup.py +0 -13
- {deepcsv-0.3.0 → deepcsv-0.5.0}/LICENSE +0 -0
- {deepcsv-0.3.0 → deepcsv-0.5.0}/deepcsv.egg-info/SOURCES.txt +0 -0
- {deepcsv-0.3.0 → deepcsv-0.5.0}/deepcsv.egg-info/dependency_links.txt +0 -0
- {deepcsv-0.3.0 → deepcsv-0.5.0}/deepcsv.egg-info/requires.txt +0 -0
- {deepcsv-0.3.0 → deepcsv-0.5.0}/deepcsv.egg-info/top_level.txt +0 -0
- {deepcsv-0.3.0 → deepcsv-0.5.0}/setup.cfg +0 -0
deepcsv-0.5.0/PKG-INFO
ADDED
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: deepcsv
|
|
3
|
+
Version: 0.5.0
|
|
4
|
+
Summary: Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files.
|
|
5
|
+
Home-page: https://github.com/abdubakr77/deepcsv
|
|
6
|
+
Author: Abdullah Bakr
|
|
7
|
+
Author-email: abdubakora1232@gmail.com
|
|
8
|
+
License: MIT
|
|
9
|
+
Project-URL: Source, https://github.com/abdubakr77/deepcsv
|
|
10
|
+
Project-URL: Tracker, https://github.com/abdubakr77/deepcsv/issues
|
|
11
|
+
Keywords: data-processing pandas numpy etl data-cleaning file-conversion automation parquet bulk-conversion
|
|
12
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
15
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
16
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Requires-Python: >=3.7
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
License-File: LICENSE
|
|
21
|
+
Requires-Dist: pandas
|
|
22
|
+
Requires-Dist: pyarrow
|
|
23
|
+
Dynamic: author
|
|
24
|
+
Dynamic: author-email
|
|
25
|
+
Dynamic: classifier
|
|
26
|
+
Dynamic: description
|
|
27
|
+
Dynamic: description-content-type
|
|
28
|
+
Dynamic: home-page
|
|
29
|
+
Dynamic: keywords
|
|
30
|
+
Dynamic: license
|
|
31
|
+
Dynamic: license-file
|
|
32
|
+
Dynamic: project-url
|
|
33
|
+
Dynamic: requires-dist
|
|
34
|
+
Dynamic: requires-python
|
|
35
|
+
Dynamic: summary
|
|
36
|
+
|
|
37
|
+
# deepcsv (v0.5.0)
|
|
38
|
+
|
|
39
|
+
Ever loaded a CSV file and found your carefully structured lists turned into useless strings?
|
|
40
|
+
````python
|
|
41
|
+
"['Action', 'Sci-Fi', 'Thriller']" # This is a string, not a list
|
|
42
|
+
````
|
|
43
|
+
|
|
44
|
+
deepcsv fixes this automatically.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## The Solution
|
|
49
|
+
|
|
50
|
+
`deepcsv` handles these cases automatically:
|
|
51
|
+
- Reads CSV/XLSX files or existing DataFrames
|
|
52
|
+
- Converts any string value starting with `[` into real NumPy arrays (fast and lightweight)
|
|
53
|
+
- Detects and fixes mixed-type columns by safely converting them to numeric (float)
|
|
54
|
+
- Recursively processes all CSV/XLSX files in subdirectories
|
|
55
|
+
- Saves results as Parquet format to preserve types and speed up analysis
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
## Installation
|
|
60
|
+
```bash
|
|
61
|
+
pip install deepcsv
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Usage
|
|
65
|
+
|
|
66
|
+
### Single file processing (process_file)
|
|
67
|
+
```python
|
|
68
|
+
import deepcsv
|
|
69
|
+
|
|
70
|
+
df = deepcsv.process_file('path/to/file.csv')
|
|
71
|
+
```
|
|
72
|
+
- Accepts `str` (file path) or `pd.DataFrame`
|
|
73
|
+
- Returns `pd.DataFrame` with columns converted to arrays
|
|
74
|
+
|
|
75
|
+
### Batch directory processing (process_all_files)
|
|
76
|
+
```python
|
|
77
|
+
import deepcsv
|
|
78
|
+
|
|
79
|
+
deepcsv.process_all_files('path/to/folder')
|
|
80
|
+
```
|
|
81
|
+
- Processes all `.csv` and `.xlsx` files recursively
|
|
82
|
+
- Saves converted files as Parquet in: `All CSV Files is Converted Here`
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## What it does
|
|
87
|
+
|
|
88
|
+
- Auto-detects files in directory and subdirectories
|
|
89
|
+
- Converts values like:
|
|
90
|
+
- `"['item1', 'item2']"` → `array(['item1', 'item2'])` (NumPy array)
|
|
91
|
+
- Mixed numeric/string columns → single numeric type (float)
|
|
92
|
+
- Handles NaN values without breaking
|
|
93
|
+
- Stores results in Parquet format for type safety and performance
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Function Signatures
|
|
98
|
+
|
|
99
|
+
- `process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame`
|
|
100
|
+
- `process_all_files(directory_path: str) -> None`
|
|
101
|
+
|
|
102
|
+
Output arrays are NumPy arrays for optimal performance in machine learning workflows.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Key Features
|
|
107
|
+
|
|
108
|
+
- Fast NumPy array conversion instead of slow Python lists
|
|
109
|
+
- Mixed-type detection with automatic fixes
|
|
110
|
+
- Parquet storage for data integrity
|
|
111
|
+
- Recursive directory traversal
|
|
112
|
+
- Warning messages for transparency
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Notes
|
|
117
|
+
|
|
118
|
+
- Requires `pyarrow` for Parquet support
|
|
119
|
+
- Only saves files that contain converted array columns
|
|
120
|
+
|
|
121
|
+
---
|
|
122
|
+
|
|
123
|
+
## Requirements
|
|
124
|
+
|
|
125
|
+
- Python >= 3.7
|
|
126
|
+
- pandas
|
|
127
|
+
- pyarrow
|
|
128
|
+
|
deepcsv-0.5.0/README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
1
|
+
# deepcsv (v0.5.0)
|
|
2
|
+
|
|
3
|
+
Ever loaded a CSV file and found your carefully structured lists turned into useless strings?
|
|
4
|
+
````python
|
|
5
|
+
"['Action', 'Sci-Fi', 'Thriller']" # This is a string, not a list
|
|
6
|
+
````
|
|
7
|
+
|
|
8
|
+
deepcsv fixes this automatically.
|
|
9
|
+
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
## The Solution
|
|
13
|
+
|
|
14
|
+
`deepcsv` handles these cases automatically:
|
|
15
|
+
- Reads CSV/XLSX files or existing DataFrames
|
|
16
|
+
- Converts any string value starting with `[` into real NumPy arrays (fast and lightweight)
|
|
17
|
+
- Detects and fixes mixed-type columns by safely converting them to numeric (float)
|
|
18
|
+
- Recursively processes all CSV/XLSX files in subdirectories
|
|
19
|
+
- Saves results as Parquet format to preserve types and speed up analysis
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Installation
|
|
24
|
+
```bash
|
|
25
|
+
pip install deepcsv
|
|
26
|
+
```
|
|
27
|
+
|
|
28
|
+
## Usage
|
|
29
|
+
|
|
30
|
+
### Single file processing (process_file)
|
|
31
|
+
```python
|
|
32
|
+
import deepcsv
|
|
33
|
+
|
|
34
|
+
df = deepcsv.process_file('path/to/file.csv')
|
|
35
|
+
```
|
|
36
|
+
- Accepts `str` (file path) or `pd.DataFrame`
|
|
37
|
+
- Returns `pd.DataFrame` with columns converted to arrays
|
|
38
|
+
|
|
39
|
+
### Batch directory processing (process_all_files)
|
|
40
|
+
```python
|
|
41
|
+
import deepcsv
|
|
42
|
+
|
|
43
|
+
deepcsv.process_all_files('path/to/folder')
|
|
44
|
+
```
|
|
45
|
+
- Processes all `.csv` and `.xlsx` files recursively
|
|
46
|
+
- Saves converted files as Parquet in: `All CSV Files is Converted Here`
|
|
47
|
+
|
|
48
|
+
---
|
|
49
|
+
|
|
50
|
+
## What it does
|
|
51
|
+
|
|
52
|
+
- Auto-detects files in directory and subdirectories
|
|
53
|
+
- Converts values like:
|
|
54
|
+
- `"['item1', 'item2']"` → `array(['item1', 'item2'])` (NumPy array)
|
|
55
|
+
- Mixed numeric/string columns → single numeric type (float)
|
|
56
|
+
- Handles NaN values without breaking
|
|
57
|
+
- Stores results in Parquet format for type safety and performance
|
|
58
|
+
|
|
59
|
+
---
|
|
60
|
+
|
|
61
|
+
## Function Signatures
|
|
62
|
+
|
|
63
|
+
- `process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame`
|
|
64
|
+
- `process_all_files(directory_path: str) -> None`
|
|
65
|
+
|
|
66
|
+
Output arrays are NumPy arrays for optimal performance in machine learning workflows.
|
|
67
|
+
|
|
68
|
+
---
|
|
69
|
+
|
|
70
|
+
## Key Features
|
|
71
|
+
|
|
72
|
+
- Fast NumPy array conversion instead of slow Python lists
|
|
73
|
+
- Mixed-type detection with automatic fixes
|
|
74
|
+
- Parquet storage for data integrity
|
|
75
|
+
- Recursive directory traversal
|
|
76
|
+
- Warning messages for transparency
|
|
77
|
+
|
|
78
|
+
---
|
|
79
|
+
|
|
80
|
+
## Notes
|
|
81
|
+
|
|
82
|
+
- Requires `pyarrow` for Parquet support
|
|
83
|
+
- Only saves files that contain converted array columns
|
|
84
|
+
|
|
85
|
+
---
|
|
86
|
+
|
|
87
|
+
## Requirements
|
|
88
|
+
|
|
89
|
+
- Python >= 3.7
|
|
90
|
+
- pandas
|
|
91
|
+
- pyarrow
|
|
92
|
+
|
|
@@ -0,0 +1,15 @@
|
|
|
1
|
+
from .deepcsv import process_all_files, process_file
|
|
2
|
+
from importlib.metadata import version
|
|
3
|
+
import requests
|
|
4
|
+
|
|
5
|
+
def _check_for_updates():
|
|
6
|
+
try:
|
|
7
|
+
response = requests.get("https://pypi.org/pypi/deepcsv/json")
|
|
8
|
+
latest = response.json()["info"]["version"]
|
|
9
|
+
current = version("deepcsv")
|
|
10
|
+
if latest != current:
|
|
11
|
+
print(f"DeepCSV: New version {latest} available! — run 'pip install -U deepcsv'")
|
|
12
|
+
except:
|
|
13
|
+
pass
|
|
14
|
+
|
|
15
|
+
_check_for_updates()
|
|
@@ -0,0 +1,126 @@
|
|
|
1
|
+
import pyarrow
|
|
2
|
+
import pandas as pd
|
|
3
|
+
from numpy import nan,array
|
|
4
|
+
from os import listdir,makedirs
|
|
5
|
+
from os.path import join,relpath,dirname,isfile,isdir
|
|
6
|
+
from warnings import filterwarnings
|
|
7
|
+
from typing import Union
|
|
8
|
+
filterwarnings("ignore")
|
|
9
|
+
|
|
10
|
+
def process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame:
|
|
11
|
+
"""
|
|
12
|
+
Parses string representations of lists in DataFrame columns to actual NumPy arrays.
|
|
13
|
+
|
|
14
|
+
(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame
|
|
15
|
+
|
|
16
|
+
This function reads a CSV file or takes a DataFrame, scans each column for string values
|
|
17
|
+
starting with '[', and converts them to NumPy arrays for fast and lightweight processing.
|
|
18
|
+
It also detects mixed data types in columns and attempts to convert them to numeric (float) for consistency.
|
|
19
|
+
|
|
20
|
+
Parameters:
|
|
21
|
+
data_input (Union[str, pd.DataFrame]): Path to the CSV file or the DataFrame itself.
|
|
22
|
+
|
|
23
|
+
Returns:
|
|
24
|
+
pd.DataFrame: The processed DataFrame with list strings converted to NumPy arrays, mixed types fixed with a warning message, and original columns renamed if arrays are extracted.
|
|
25
|
+
|
|
26
|
+
Examples:
|
|
27
|
+
### Process a CSV file
|
|
28
|
+
df = parse_lists('path/to/file.csv')
|
|
29
|
+
|
|
30
|
+
### Process an existing DataFrame
|
|
31
|
+
df = parse_lists(my_dataframe)
|
|
32
|
+
"""
|
|
33
|
+
|
|
34
|
+
print(f"Converting Now On {data_input}")
|
|
35
|
+
print("-"*50)
|
|
36
|
+
|
|
37
|
+
try:
|
|
38
|
+
data = pd.read_csv(data_input)
|
|
39
|
+
except:
|
|
40
|
+
# print("The Data It's Already Defined")
|
|
41
|
+
data = data_input
|
|
42
|
+
|
|
43
|
+
for ColName in data.columns:
|
|
44
|
+
|
|
45
|
+
First_Value = data[ColName].iloc[0]
|
|
46
|
+
|
|
47
|
+
if len(data[ColName].apply(type).unique()) >= 2:
|
|
48
|
+
|
|
49
|
+
|
|
50
|
+
sample = (data[data[ColName].apply(type) == str][ColName].head(2)).values
|
|
51
|
+
|
|
52
|
+
if len(sample) > 0 and isinstance(sample[0],str) and sample[0][0].strip().isnumeric():
|
|
53
|
+
|
|
54
|
+
print(f"WARNING:\nThis Dataset Name ({data_input.split("\\")[-1]}) Found {len(data[ColName].apply(type).unique())} Mixed DataType in a column called ({ColName})\nPath : {data_input}")
|
|
55
|
+
print(f"System : This column have These types: {data[ColName].apply(type).unique()}")
|
|
56
|
+
print(f"System : Trying to fix the column as a Float to be have only one datatype...")
|
|
57
|
+
|
|
58
|
+
data[ColName] = pd.to_numeric(data[ColName], errors='coerce')
|
|
59
|
+
print("System : Done!")
|
|
60
|
+
|
|
61
|
+
elif isinstance(First_Value , str) and First_Value.strip().startswith("["):
|
|
62
|
+
|
|
63
|
+
data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : array(x) if pd.notna(x) else nan)
|
|
64
|
+
data.drop(ColName,inplace=True,axis=1)
|
|
65
|
+
|
|
66
|
+
return data
|
|
67
|
+
|
|
68
|
+
|
|
69
|
+
def process_all_files(directory_path: str) -> None:
|
|
70
|
+
"""
|
|
71
|
+
Recursively processes all CSV and XLSX files in a directory, converts array strings to NumPy arrays,
|
|
72
|
+
and saves as Parquet files if array columns are present.
|
|
73
|
+
|
|
74
|
+
(directory_path: str) -> None
|
|
75
|
+
|
|
76
|
+
This function traverses the directory tree starting from directory_path, finds all .csv and .xlsx files,
|
|
77
|
+
applies parse_lists to each for array conversion and type fixing, and saves the result as .parquet files
|
|
78
|
+
in a subdirectory 'All CSV Files is Converted Here' only if the DataFrame contains new array columns.
|
|
79
|
+
|
|
80
|
+
Parameters:
|
|
81
|
+
directory_path (str): The root directory path to search for CSV/XLSX files.
|
|
82
|
+
|
|
83
|
+
Returns:
|
|
84
|
+
None: Files are saved to a subdirectory 'All CSV Files is Converted Here'.
|
|
85
|
+
|
|
86
|
+
Examples:
|
|
87
|
+
# Process all CSV files in a directory
|
|
88
|
+
auto_convert('/path/to/directory')
|
|
89
|
+
"""
|
|
90
|
+
|
|
91
|
+
base_output = join(directory_path, "All CSV Files is Converted Here")
|
|
92
|
+
all_folders = [directory_path]
|
|
93
|
+
|
|
94
|
+
makedirs(base_output,exist_ok=True)
|
|
95
|
+
|
|
96
|
+
while True:
|
|
97
|
+
|
|
98
|
+
if all_folders:
|
|
99
|
+
|
|
100
|
+
Curr_Path = all_folders.pop(0)
|
|
101
|
+
|
|
102
|
+
for item_name in listdir(Curr_Path):
|
|
103
|
+
|
|
104
|
+
Sub_Item_Path = join(Curr_Path,item_name)
|
|
105
|
+
|
|
106
|
+
if isfile(Sub_Item_Path) and (Sub_Item_Path.endswith(".csv") or Sub_Item_Path.endswith(".xlsx")):
|
|
107
|
+
|
|
108
|
+
df_converted = process_file(Sub_Item_Path)
|
|
109
|
+
df_converted.reset_index(drop=True,inplace=True)
|
|
110
|
+
|
|
111
|
+
rel_path = relpath(Sub_Item_Path, directory_path)
|
|
112
|
+
output = join(base_output,rel_path)
|
|
113
|
+
if "List" in df_converted.columns[-1]:
|
|
114
|
+
print(Sub_Item_Path)
|
|
115
|
+
makedirs(dirname(output),exist_ok=True)
|
|
116
|
+
df_converted.to_parquet(output.replace(".csv", ".parquet"))
|
|
117
|
+
|
|
118
|
+
print("-"*50)
|
|
119
|
+
|
|
120
|
+
elif isdir(Sub_Item_Path):
|
|
121
|
+
|
|
122
|
+
all_folders.append(Sub_Item_Path)
|
|
123
|
+
|
|
124
|
+
else:
|
|
125
|
+
break
|
|
126
|
+
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: deepcsv
|
|
3
|
+
Version: 0.5.0
|
|
4
|
+
Summary: Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files.
|
|
5
|
+
Home-page: https://github.com/abdubakr77/deepcsv
|
|
6
|
+
Author: Abdullah Bakr
|
|
7
|
+
Author-email: abdubakora1232@gmail.com
|
|
8
|
+
License: MIT
|
|
9
|
+
Project-URL: Source, https://github.com/abdubakr77/deepcsv
|
|
10
|
+
Project-URL: Tracker, https://github.com/abdubakr77/deepcsv/issues
|
|
11
|
+
Keywords: data-processing pandas numpy etl data-cleaning file-conversion automation parquet bulk-conversion
|
|
12
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
13
|
+
Classifier: Intended Audience :: Developers
|
|
14
|
+
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
|
15
|
+
Classifier: Topic :: Scientific/Engineering :: Information Analysis
|
|
16
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
17
|
+
Classifier: Programming Language :: Python :: 3
|
|
18
|
+
Requires-Python: >=3.7
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
License-File: LICENSE
|
|
21
|
+
Requires-Dist: pandas
|
|
22
|
+
Requires-Dist: pyarrow
|
|
23
|
+
Dynamic: author
|
|
24
|
+
Dynamic: author-email
|
|
25
|
+
Dynamic: classifier
|
|
26
|
+
Dynamic: description
|
|
27
|
+
Dynamic: description-content-type
|
|
28
|
+
Dynamic: home-page
|
|
29
|
+
Dynamic: keywords
|
|
30
|
+
Dynamic: license
|
|
31
|
+
Dynamic: license-file
|
|
32
|
+
Dynamic: project-url
|
|
33
|
+
Dynamic: requires-dist
|
|
34
|
+
Dynamic: requires-python
|
|
35
|
+
Dynamic: summary
|
|
36
|
+
|
|
37
|
+
# deepcsv (v0.5.0)
|
|
38
|
+
|
|
39
|
+
Ever loaded a CSV file and found your carefully structured lists turned into useless strings?
|
|
40
|
+
````python
|
|
41
|
+
"['Action', 'Sci-Fi', 'Thriller']" # This is a string, not a list
|
|
42
|
+
````
|
|
43
|
+
|
|
44
|
+
deepcsv fixes this automatically.
|
|
45
|
+
|
|
46
|
+
---
|
|
47
|
+
|
|
48
|
+
## The Solution
|
|
49
|
+
|
|
50
|
+
`deepcsv` handles these cases automatically:
|
|
51
|
+
- Reads CSV/XLSX files or existing DataFrames
|
|
52
|
+
- Converts any string value starting with `[` into real NumPy arrays (fast and lightweight)
|
|
53
|
+
- Detects and fixes mixed-type columns by safely converting them to numeric (float)
|
|
54
|
+
- Recursively processes all CSV/XLSX files in subdirectories
|
|
55
|
+
- Saves results as Parquet format to preserve types and speed up analysis
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
## Installation
|
|
60
|
+
```bash
|
|
61
|
+
pip install deepcsv
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
## Usage
|
|
65
|
+
|
|
66
|
+
### Single file processing (process_file)
|
|
67
|
+
```python
|
|
68
|
+
import deepcsv
|
|
69
|
+
|
|
70
|
+
df = deepcsv.process_file('path/to/file.csv')
|
|
71
|
+
```
|
|
72
|
+
- Accepts `str` (file path) or `pd.DataFrame`
|
|
73
|
+
- Returns `pd.DataFrame` with columns converted to arrays
|
|
74
|
+
|
|
75
|
+
### Batch directory processing (process_all_files)
|
|
76
|
+
```python
|
|
77
|
+
import deepcsv
|
|
78
|
+
|
|
79
|
+
deepcsv.process_all_files('path/to/folder')
|
|
80
|
+
```
|
|
81
|
+
- Processes all `.csv` and `.xlsx` files recursively
|
|
82
|
+
- Saves converted files as Parquet in: `All CSV Files is Converted Here`
|
|
83
|
+
|
|
84
|
+
---
|
|
85
|
+
|
|
86
|
+
## What it does
|
|
87
|
+
|
|
88
|
+
- Auto-detects files in directory and subdirectories
|
|
89
|
+
- Converts values like:
|
|
90
|
+
- `"['item1', 'item2']"` → `array(['item1', 'item2'])` (NumPy array)
|
|
91
|
+
- Mixed numeric/string columns → single numeric type (float)
|
|
92
|
+
- Handles NaN values without breaking
|
|
93
|
+
- Stores results in Parquet format for type safety and performance
|
|
94
|
+
|
|
95
|
+
---
|
|
96
|
+
|
|
97
|
+
## Function Signatures
|
|
98
|
+
|
|
99
|
+
- `process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame`
|
|
100
|
+
- `process_all_files(directory_path: str) -> None`
|
|
101
|
+
|
|
102
|
+
Output arrays are NumPy arrays for optimal performance in machine learning workflows.
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Key Features
|
|
107
|
+
|
|
108
|
+
- Fast NumPy array conversion instead of slow Python lists
|
|
109
|
+
- Mixed-type detection with automatic fixes
|
|
110
|
+
- Parquet storage for data integrity
|
|
111
|
+
- Recursive directory traversal
|
|
112
|
+
- Warning messages for transparency
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Notes
|
|
117
|
+
|
|
118
|
+
- Requires `pyarrow` for Parquet support
|
|
119
|
+
- Only saves files that contain converted array columns
|
|
120
|
+
|
|
121
|
+
---
|
|
122
|
+
|
|
123
|
+
## Requirements
|
|
124
|
+
|
|
125
|
+
- Python >= 3.7
|
|
126
|
+
- pandas
|
|
127
|
+
- pyarrow
|
|
128
|
+
|
deepcsv-0.5.0/setup.py
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
from pathlib import Path
|
|
3
|
+
|
|
4
|
+
this_directory = Path(__file__).parent
|
|
5
|
+
|
|
6
|
+
readme = (this_directory / "README.md").read_text(encoding="utf-8")
|
|
7
|
+
|
|
8
|
+
|
|
9
|
+
setup(
|
|
10
|
+
name="deepcsv",
|
|
11
|
+
version="0.5.0",
|
|
12
|
+
author="Abdullah Bakr",
|
|
13
|
+
author_email="abdubakora1232@gmail.com",
|
|
14
|
+
description="Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files.",
|
|
15
|
+
long_description=readme,
|
|
16
|
+
long_description_content_type="text/markdown",
|
|
17
|
+
packages=find_packages(),
|
|
18
|
+
install_requires=["pandas", "pyarrow"],
|
|
19
|
+
python_requires=">=3.7",
|
|
20
|
+
url="https://github.com/abdubakr77/deepcsv",
|
|
21
|
+
license="MIT",
|
|
22
|
+
keywords="data-processing pandas numpy etl data-cleaning file-conversion automation parquet bulk-conversion",
|
|
23
|
+
classifiers=[
|
|
24
|
+
"Development Status :: 5 - Production/Stable",
|
|
25
|
+
"Intended Audience :: Developers",
|
|
26
|
+
"Topic :: Software Development :: Libraries :: Python Modules",
|
|
27
|
+
"Topic :: Scientific/Engineering :: Information Analysis",
|
|
28
|
+
"License :: OSI Approved :: MIT License",
|
|
29
|
+
"Programming Language :: Python :: 3",
|
|
30
|
+
],
|
|
31
|
+
project_urls={
|
|
32
|
+
"Source": "https://github.com/abdubakr77/deepcsv",
|
|
33
|
+
"Tracker": "https://github.com/abdubakr77/deepcsv/issues",
|
|
34
|
+
},
|
|
35
|
+
)
|
deepcsv-0.3.0/PKG-INFO
DELETED
|
@@ -1,74 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: deepcsv
|
|
3
|
-
Version: 0.3.0
|
|
4
|
-
Summary: Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
|
|
5
|
-
Author: Abdullah Bakr
|
|
6
|
-
Requires-Python: >=3.7
|
|
7
|
-
Description-Content-Type: text/markdown
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: pandas
|
|
10
|
-
Requires-Dist: pyarrow
|
|
11
|
-
Dynamic: author
|
|
12
|
-
Dynamic: description
|
|
13
|
-
Dynamic: description-content-type
|
|
14
|
-
Dynamic: license-file
|
|
15
|
-
Dynamic: requires-dist
|
|
16
|
-
Dynamic: requires-python
|
|
17
|
-
Dynamic: summary
|
|
18
|
-
|
|
19
|
-
# deepcsv
|
|
20
|
-
|
|
21
|
-
A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
|
|
22
|
-
|
|
23
|
-
## Installation
|
|
24
|
-
```bash
|
|
25
|
-
pip install deepcsv
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
## What it does
|
|
29
|
-
|
|
30
|
-
- Walks through all folders and subfolders automatically
|
|
31
|
-
- Finds every CSV and XLSX file
|
|
32
|
-
- Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
|
|
33
|
-
- Detects columns with mixed data types and tries to fix them automatically
|
|
34
|
-
- Warns you when a column has mixed types so you know what was changed
|
|
35
|
-
- Saves the results as Parquet files to preserve the converted data types
|
|
36
|
-
|
|
37
|
-
> **Why Parquet?**
|
|
38
|
-
> CSV files cannot store arrays or preserve data types. Parquet solves this by keeping the exact types after conversion.
|
|
39
|
-
|
|
40
|
-
> **Why arrays instead of Python lists?**
|
|
41
|
-
> Arrays are significantly faster for numerical operations and machine learning workflows.
|
|
42
|
-
|
|
43
|
-
## Functions
|
|
44
|
-
|
|
45
|
-
### `ConvertListStrToList(file_path)`
|
|
46
|
-
|
|
47
|
-
Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
|
|
48
|
-
```python
|
|
49
|
-
import deepcsv
|
|
50
|
-
|
|
51
|
-
df = deepcsv.ConvertListStrToList("path/to/file.csv")
|
|
52
|
-
```
|
|
53
|
-
|
|
54
|
-
### `ReadAllCSVData(path)`
|
|
55
|
-
|
|
56
|
-
Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
|
|
57
|
-
```python
|
|
58
|
-
import deepcsv
|
|
59
|
-
|
|
60
|
-
deepcsv.ReadAllCSVData("path/to/folder")
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
## Notes
|
|
64
|
-
|
|
65
|
-
- Only files that contain list string columns are saved as Parquet
|
|
66
|
-
- Mixed-type columns are converted to float automatically when possible
|
|
67
|
-
- Skips NaN values without breaking
|
|
68
|
-
- Requires `pyarrow` for Parquet support
|
|
69
|
-
|
|
70
|
-
## Requirements
|
|
71
|
-
|
|
72
|
-
- Python >= 3.7
|
|
73
|
-
- pandas
|
|
74
|
-
- pyarrow
|
deepcsv-0.3.0/README.md
DELETED
|
@@ -1,56 +0,0 @@
|
|
|
1
|
-
# deepcsv
|
|
2
|
-
|
|
3
|
-
A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
|
|
4
|
-
|
|
5
|
-
## Installation
|
|
6
|
-
```bash
|
|
7
|
-
pip install deepcsv
|
|
8
|
-
```
|
|
9
|
-
|
|
10
|
-
## What it does
|
|
11
|
-
|
|
12
|
-
- Walks through all folders and subfolders automatically
|
|
13
|
-
- Finds every CSV and XLSX file
|
|
14
|
-
- Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
|
|
15
|
-
- Detects columns with mixed data types and tries to fix them automatically
|
|
16
|
-
- Warns you when a column has mixed types so you know what was changed
|
|
17
|
-
- Saves the results as Parquet files to preserve the converted data types
|
|
18
|
-
|
|
19
|
-
> **Why Parquet?**
|
|
20
|
-
> CSV files cannot store arrays or preserve data types. Parquet solves this by keeping the exact types after conversion.
|
|
21
|
-
|
|
22
|
-
> **Why arrays instead of Python lists?**
|
|
23
|
-
> Arrays are significantly faster for numerical operations and machine learning workflows.
|
|
24
|
-
|
|
25
|
-
## Functions
|
|
26
|
-
|
|
27
|
-
### `ConvertListStrToList(file_path)`
|
|
28
|
-
|
|
29
|
-
Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
|
|
30
|
-
```python
|
|
31
|
-
import deepcsv
|
|
32
|
-
|
|
33
|
-
df = deepcsv.ConvertListStrToList("path/to/file.csv")
|
|
34
|
-
```
|
|
35
|
-
|
|
36
|
-
### `ReadAllCSVData(path)`
|
|
37
|
-
|
|
38
|
-
Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
|
|
39
|
-
```python
|
|
40
|
-
import deepcsv
|
|
41
|
-
|
|
42
|
-
deepcsv.ReadAllCSVData("path/to/folder")
|
|
43
|
-
```
|
|
44
|
-
|
|
45
|
-
## Notes
|
|
46
|
-
|
|
47
|
-
- Only files that contain list string columns are saved as Parquet
|
|
48
|
-
- Mixed-type columns are converted to float automatically when possible
|
|
49
|
-
- Skips NaN values without breaking
|
|
50
|
-
- Requires `pyarrow` for Parquet support
|
|
51
|
-
|
|
52
|
-
## Requirements
|
|
53
|
-
|
|
54
|
-
- Python >= 3.7
|
|
55
|
-
- pandas
|
|
56
|
-
- pyarrow
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
from .deepcsv import ReadAllCSVData, ConvertListStrToList
|
deepcsv-0.3.0/deepcsv/deepcsv.py
DELETED
|
@@ -1,76 +0,0 @@
|
|
|
1
|
-
import pyarrow
|
|
2
|
-
import os
|
|
3
|
-
import warnings
|
|
4
|
-
import numpy as np
|
|
5
|
-
import pandas as pd
|
|
6
|
-
from ast import literal_eval
|
|
7
|
-
warnings.filterwarnings("ignore")
|
|
8
|
-
|
|
9
|
-
def ConvertListStrToList(File_Path):
|
|
10
|
-
|
|
11
|
-
data = pd.read_csv(File_Path)
|
|
12
|
-
for ColName in data.columns:
|
|
13
|
-
|
|
14
|
-
First_Value = data[ColName].iloc[0]
|
|
15
|
-
|
|
16
|
-
if len(data[ColName].apply(type).unique()) >= 2:
|
|
17
|
-
|
|
18
|
-
|
|
19
|
-
sample = (data[data[ColName].apply(type) == str][ColName].head(2)).values
|
|
20
|
-
|
|
21
|
-
if len(sample) > 0 and isinstance(sample[0],str) and sample[0][0].strip().isnumeric():
|
|
22
|
-
|
|
23
|
-
print(f"WARNING:\nThis Dataset Name ({File_Path.split("\\")[-1]}) Found {len(data[ColName].apply(type).unique())} Mixed DataType in a column called ({ColName})\nPath : {File_Path}")
|
|
24
|
-
print(f"System : This column have These types: {data[ColName].apply(type).unique()}")
|
|
25
|
-
print(f"System : Trying to fix the column as a Float to be have only one datatype...")
|
|
26
|
-
|
|
27
|
-
data[ColName] = pd.to_numeric(data[ColName], errors='coerce')
|
|
28
|
-
print("System : Done!")
|
|
29
|
-
|
|
30
|
-
elif isinstance(First_Value , str) and First_Value.strip().startswith("["):
|
|
31
|
-
|
|
32
|
-
data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : literal_eval(x) if pd.notna(x) else np.nan)
|
|
33
|
-
data.drop(ColName,inplace=True,axis=1)
|
|
34
|
-
|
|
35
|
-
return data
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
def ReadAllCSVData(WorkDirectoryPath):
|
|
39
|
-
|
|
40
|
-
base_output = os.path.join(WorkDirectoryPath, "All CSV Data is Converted Here")
|
|
41
|
-
all_folders = [WorkDirectoryPath]
|
|
42
|
-
|
|
43
|
-
os.makedirs(base_output,exist_ok=True)
|
|
44
|
-
|
|
45
|
-
while True:
|
|
46
|
-
|
|
47
|
-
if all_folders:
|
|
48
|
-
|
|
49
|
-
Curr_Path = all_folders.pop(0)
|
|
50
|
-
|
|
51
|
-
for item_name in os.listdir(Curr_Path):
|
|
52
|
-
|
|
53
|
-
Sub_Item_Path = os.path.join(Curr_Path,item_name)
|
|
54
|
-
|
|
55
|
-
if os.path.isfile(Sub_Item_Path) and (Sub_Item_Path.endswith(".csv") or Sub_Item_Path.endswith(".xlsx")):
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
df_converted = ConvertListStrToList(Sub_Item_Path)
|
|
60
|
-
df_converted.reset_index(drop=True,inplace=True)
|
|
61
|
-
|
|
62
|
-
rel_path = os.path.relpath(Sub_Item_Path, WorkDirectoryPath)
|
|
63
|
-
output = os.path.join(base_output,rel_path)
|
|
64
|
-
if "List" in df_converted.columns[-1]:
|
|
65
|
-
print(Sub_Item_Path)
|
|
66
|
-
os.makedirs(os.path.dirname(output),exist_ok=True)
|
|
67
|
-
df_converted.to_parquet(output.replace(".csv", ".parquet"))
|
|
68
|
-
|
|
69
|
-
print("-"*50)
|
|
70
|
-
|
|
71
|
-
elif os.path.isdir(Sub_Item_Path):
|
|
72
|
-
|
|
73
|
-
all_folders.append(Sub_Item_Path)
|
|
74
|
-
|
|
75
|
-
else:
|
|
76
|
-
break
|
|
@@ -1,74 +0,0 @@
|
|
|
1
|
-
Metadata-Version: 2.4
|
|
2
|
-
Name: deepcsv
|
|
3
|
-
Version: 0.3.0
|
|
4
|
-
Summary: Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
|
|
5
|
-
Author: Abdullah Bakr
|
|
6
|
-
Requires-Python: >=3.7
|
|
7
|
-
Description-Content-Type: text/markdown
|
|
8
|
-
License-File: LICENSE
|
|
9
|
-
Requires-Dist: pandas
|
|
10
|
-
Requires-Dist: pyarrow
|
|
11
|
-
Dynamic: author
|
|
12
|
-
Dynamic: description
|
|
13
|
-
Dynamic: description-content-type
|
|
14
|
-
Dynamic: license-file
|
|
15
|
-
Dynamic: requires-dist
|
|
16
|
-
Dynamic: requires-python
|
|
17
|
-
Dynamic: summary
|
|
18
|
-
|
|
19
|
-
# deepcsv
|
|
20
|
-
|
|
21
|
-
A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
|
|
22
|
-
|
|
23
|
-
## Installation
|
|
24
|
-
```bash
|
|
25
|
-
pip install deepcsv
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
## What it does
|
|
29
|
-
|
|
30
|
-
- Walks through all folders and subfolders automatically
|
|
31
|
-
- Finds every CSV and XLSX file
|
|
32
|
-
- Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
|
|
33
|
-
- Detects columns with mixed data types and tries to fix them automatically
|
|
34
|
-
- Warns you when a column has mixed types so you know what was changed
|
|
35
|
-
- Saves the results as Parquet files to preserve the converted data types
|
|
36
|
-
|
|
37
|
-
> **Why Parquet?**
|
|
38
|
-
> CSV files cannot store arrays or preserve data types. Parquet solves this by keeping the exact types after conversion.
|
|
39
|
-
|
|
40
|
-
> **Why arrays instead of Python lists?**
|
|
41
|
-
> Arrays are significantly faster for numerical operations and machine learning workflows.
|
|
42
|
-
|
|
43
|
-
## Functions
|
|
44
|
-
|
|
45
|
-
### `ConvertListStrToList(file_path)`
|
|
46
|
-
|
|
47
|
-
Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
|
|
48
|
-
```python
|
|
49
|
-
import deepcsv
|
|
50
|
-
|
|
51
|
-
df = deepcsv.ConvertListStrToList("path/to/file.csv")
|
|
52
|
-
```
|
|
53
|
-
|
|
54
|
-
### `ReadAllCSVData(path)`
|
|
55
|
-
|
|
56
|
-
Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
|
|
57
|
-
```python
|
|
58
|
-
import deepcsv
|
|
59
|
-
|
|
60
|
-
deepcsv.ReadAllCSVData("path/to/folder")
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
## Notes
|
|
64
|
-
|
|
65
|
-
- Only files that contain list string columns are saved as Parquet
|
|
66
|
-
- Mixed-type columns are converted to float automatically when possible
|
|
67
|
-
- Skips NaN values without breaking
|
|
68
|
-
- Requires `pyarrow` for Parquet support
|
|
69
|
-
|
|
70
|
-
## Requirements
|
|
71
|
-
|
|
72
|
-
- Python >= 3.7
|
|
73
|
-
- pandas
|
|
74
|
-
- pyarrow
|
deepcsv-0.3.0/setup.py
DELETED
|
@@ -1,13 +0,0 @@
|
|
|
1
|
-
from setuptools import setup, find_packages
|
|
2
|
-
|
|
3
|
-
setup(
|
|
4
|
-
name="deepcsv",
|
|
5
|
-
version="0.3.0",
|
|
6
|
-
author="Abdullah Bakr",
|
|
7
|
-
description="Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.",
|
|
8
|
-
long_description=open("README.md").read(),
|
|
9
|
-
long_description_content_type="text/markdown",
|
|
10
|
-
packages=find_packages(),
|
|
11
|
-
install_requires=["pandas","pyarrow"],
|
|
12
|
-
python_requires=">=3.7",
|
|
13
|
-
)
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|
|
File without changes
|