deepcsv 0.2.0__tar.gz → 0.4.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
deepcsv-0.4.0/PKG-INFO ADDED
@@ -0,0 +1,122 @@
1
+ Metadata-Version: 2.4
2
+ Name: deepcsv
3
+ Version: 0.4.0
4
+ Summary: Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
5
+ Home-page: https://github.com/abdubakr77/deepcsv
6
+ Author: Abdullah Bakr
7
+ Author-email: abdubakora1232@gmail.com
8
+ Project-URL: Source, https://github.com/abdubakr77/deepcsv
9
+ Project-URL: Tracker, https://github.com/abdubakr77/deepcsv/issues
10
+ Requires-Python: >=3.7
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: pandas
14
+ Requires-Dist: pyarrow
15
+ Dynamic: author
16
+ Dynamic: author-email
17
+ Dynamic: description
18
+ Dynamic: description-content-type
19
+ Dynamic: home-page
20
+ Dynamic: license-file
21
+ Dynamic: project-url
22
+ Dynamic: requires-dist
23
+ Dynamic: requires-python
24
+ Dynamic: summary
25
+
26
+ # deepcsv
27
+
28
+ Stop losing your data types when working with CSV files.
29
+ deepcsv automatically cleans messy CSV/XLSX data and converts it into ML-ready Parquet format.
30
+
31
+ ## Installation
32
+ ```bash
33
+ pip install deepcsv
34
+ ```
35
+
36
+ ## Example
37
+
38
+ ### Before
39
+ ```python
40
+ # CSV column value
41
+ "['a', 'b', 'c']"
42
+ ```
43
+
44
+ ### After
45
+ ```python
46
+ # Automatically converted
47
+ ['a', 'b', 'c']
48
+ ```
49
+
50
+ ### Usage
51
+ ```python
52
+ import deepcsv
53
+
54
+ df = deepcsv.ConvertListStrToList("path/to/file.csv")
55
+ ```
56
+
57
+ ---
58
+
59
+ ## What it does
60
+
61
+ - Walks through all folders and subfolders automatically
62
+ - Finds every CSV and XLSX file
63
+ - Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
64
+ - Detects columns with mixed data types and tries to fix them automatically
65
+ - Warns you when a column has mixed types so you know what was changed
66
+ - Saves the results as Parquet files to preserve the converted data types
67
+
68
+ ---
69
+
70
+ ## Why Parquet?
71
+
72
+ CSV files cannot store arrays or preserve data types.
73
+ Parquet solves this by keeping the exact types after conversion and is much faster for data processing workflows.
74
+
75
+ ---
76
+
77
+ ## Why arrays instead of Python lists?
78
+
79
+ Arrays are significantly faster for numerical operations and machine learning workflows, especially when working with large datasets.
80
+
81
+ ---
82
+
83
+ ## Functions
84
+
85
+ ### `ConvertListStrToList(file_path)`
86
+
87
+ Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
88
+
89
+ ```python
90
+ import deepcsv
91
+
92
+ df = deepcsv.ConvertListStrToList("path/to/file.csv")
93
+ ```
94
+
95
+ ---
96
+
97
+ ### `ReadAllCSVData(path)`
98
+
99
+ Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
100
+
101
+ ```python
102
+ import deepcsv
103
+
104
+ deepcsv.ReadAllCSVData("path/to/folder")
105
+ ```
106
+
107
+ ---
108
+
109
+ ## Notes
110
+
111
+ - Only files that contain list string columns are saved as Parquet
112
+ - Mixed-type columns are converted to float automatically when possible
113
+ - Skips NaN values without breaking
114
+ - Requires `pyarrow` for Parquet support
115
+
116
+ ---
117
+
118
+ ## Requirements
119
+
120
+ - Python >= 3.7
121
+ - pandas
122
+ - pyarrow
@@ -0,0 +1,97 @@
1
+ # deepcsv
2
+
3
+ Stop losing your data types when working with CSV files.
4
+ deepcsv automatically cleans messy CSV/XLSX data and converts it into ML-ready Parquet format.
5
+
6
+ ## Installation
7
+ ```bash
8
+ pip install deepcsv
9
+ ```
10
+
11
+ ## Example
12
+
13
+ ### Before
14
+ ```python
15
+ # CSV column value
16
+ "['a', 'b', 'c']"
17
+ ```
18
+
19
+ ### After
20
+ ```python
21
+ # Automatically converted
22
+ ['a', 'b', 'c']
23
+ ```
24
+
25
+ ### Usage
26
+ ```python
27
+ import deepcsv
28
+
29
+ df = deepcsv.ConvertListStrToList("path/to/file.csv")
30
+ ```
31
+
32
+ ---
33
+
34
+ ## What it does
35
+
36
+ - Walks through all folders and subfolders automatically
37
+ - Finds every CSV and XLSX file
38
+ - Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
39
+ - Detects columns with mixed data types and tries to fix them automatically
40
+ - Warns you when a column has mixed types so you know what was changed
41
+ - Saves the results as Parquet files to preserve the converted data types
42
+
43
+ ---
44
+
45
+ ## Why Parquet?
46
+
47
+ CSV files cannot store arrays or preserve data types.
48
+ Parquet solves this by keeping the exact types after conversion and is much faster for data processing workflows.
49
+
50
+ ---
51
+
52
+ ## Why arrays instead of Python lists?
53
+
54
+ Arrays are significantly faster for numerical operations and machine learning workflows, especially when working with large datasets.
55
+
56
+ ---
57
+
58
+ ## Functions
59
+
60
+ ### `ConvertListStrToList(file_path)`
61
+
62
+ Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
63
+
64
+ ```python
65
+ import deepcsv
66
+
67
+ df = deepcsv.ConvertListStrToList("path/to/file.csv")
68
+ ```
69
+
70
+ ---
71
+
72
+ ### `ReadAllCSVData(path)`
73
+
74
+ Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
75
+
76
+ ```python
77
+ import deepcsv
78
+
79
+ deepcsv.ReadAllCSVData("path/to/folder")
80
+ ```
81
+
82
+ ---
83
+
84
+ ## Notes
85
+
86
+ - Only files that contain list string columns are saved as Parquet
87
+ - Mixed-type columns are converted to float automatically when possible
88
+ - Skips NaN values without breaking
89
+ - Requires `pyarrow` for Parquet support
90
+
91
+ ---
92
+
93
+ ## Requirements
94
+
95
+ - Python >= 3.7
96
+ - pandas
97
+ - pyarrow
@@ -0,0 +1,78 @@
1
+ from os import listdir,mkdir,makedirs
2
+ from os.path import join,relpath,dirname,isfile,isdir
3
+ import pandas as pd
4
+ from ast import literal_eval
5
+ from numpy import nan
6
+ import pyarrow
7
+ from warnings import filterwarnings
8
+ filterwarnings("ignore")
9
+
10
+ def ConvertListStrToList(File_Path):
11
+ print(File_Path)
12
+ print("-"*50)
13
+ data = pd.read_csv(File_Path)
14
+ for ColName in data.columns:
15
+
16
+ First_Value = data[ColName].iloc[0]
17
+
18
+ if len(data[ColName].apply(type).unique()) >= 2:
19
+
20
+
21
+ sample = (data[data[ColName].apply(type) == str][ColName].head(2)).values
22
+
23
+ if len(sample) > 0 and isinstance(sample[0],str) and sample[0][0].strip().isnumeric():
24
+
25
+ print(f"WARNING:\nThis Dataset Name ({File_Path.split("\\")[-1]}) Found {len(data[ColName].apply(type).unique())} Mixed DataType in a column called ({ColName})\nPath : {File_Path}")
26
+ print(f"System : This column have These types: {data[ColName].apply(type).unique()}")
27
+ print(f"System : Trying to fix the column as a Float to be have only one datatype...")
28
+
29
+ data[ColName] = pd.to_numeric(data[ColName], errors='coerce')
30
+ print("System : Done!")
31
+
32
+ elif isinstance(First_Value , str) and First_Value.strip().startswith("["):
33
+
34
+ data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : literal_eval(x) if pd.notna(x) else nan)
35
+ data.drop(ColName,inplace=True,axis=1)
36
+
37
+ return data
38
+
39
+
40
+ def ReadAllCSVData(WorkDirectoryPath):
41
+
42
+ base_output = join(WorkDirectoryPath, "All CSV Data is Converted Here")
43
+ all_folders = [WorkDirectoryPath]
44
+
45
+ makedirs(base_output,exist_ok=True)
46
+
47
+ while True:
48
+
49
+ if all_folders:
50
+
51
+ Curr_Path = all_folders.pop(0)
52
+
53
+ for item_name in listdir(Curr_Path):
54
+
55
+ Sub_Item_Path = join(Curr_Path,item_name)
56
+
57
+ if isfile(Sub_Item_Path) and (Sub_Item_Path.endswith(".csv") or Sub_Item_Path.endswith(".xlsx")):
58
+
59
+
60
+
61
+ df_converted = ConvertListStrToList(Sub_Item_Path)
62
+ df_converted.reset_index(drop=True,inplace=True)
63
+
64
+ rel_path = relpath(Sub_Item_Path, WorkDirectoryPath)
65
+ output = join(base_output,rel_path)
66
+ if "List" in df_converted.columns[-1]:
67
+ print(Sub_Item_Path)
68
+ makedirs(dirname(output),exist_ok=True)
69
+ df_converted.to_parquet(output.replace(".csv", ".parquet"))
70
+
71
+ print("-"*50)
72
+
73
+ elif isdir(Sub_Item_Path):
74
+
75
+ all_folders.append(Sub_Item_Path)
76
+
77
+ else:
78
+ break
@@ -0,0 +1,122 @@
1
+ Metadata-Version: 2.4
2
+ Name: deepcsv
3
+ Version: 0.4.0
4
+ Summary: Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
5
+ Home-page: https://github.com/abdubakr77/deepcsv
6
+ Author: Abdullah Bakr
7
+ Author-email: abdubakora1232@gmail.com
8
+ Project-URL: Source, https://github.com/abdubakr77/deepcsv
9
+ Project-URL: Tracker, https://github.com/abdubakr77/deepcsv/issues
10
+ Requires-Python: >=3.7
11
+ Description-Content-Type: text/markdown
12
+ License-File: LICENSE
13
+ Requires-Dist: pandas
14
+ Requires-Dist: pyarrow
15
+ Dynamic: author
16
+ Dynamic: author-email
17
+ Dynamic: description
18
+ Dynamic: description-content-type
19
+ Dynamic: home-page
20
+ Dynamic: license-file
21
+ Dynamic: project-url
22
+ Dynamic: requires-dist
23
+ Dynamic: requires-python
24
+ Dynamic: summary
25
+
26
+ # deepcsv
27
+
28
+ Stop losing your data types when working with CSV files.
29
+ deepcsv automatically cleans messy CSV/XLSX data and converts it into ML-ready Parquet format.
30
+
31
+ ## Installation
32
+ ```bash
33
+ pip install deepcsv
34
+ ```
35
+
36
+ ## Example
37
+
38
+ ### Before
39
+ ```python
40
+ # CSV column value
41
+ "['a', 'b', 'c']"
42
+ ```
43
+
44
+ ### After
45
+ ```python
46
+ # Automatically converted
47
+ ['a', 'b', 'c']
48
+ ```
49
+
50
+ ### Usage
51
+ ```python
52
+ import deepcsv
53
+
54
+ df = deepcsv.ConvertListStrToList("path/to/file.csv")
55
+ ```
56
+
57
+ ---
58
+
59
+ ## What it does
60
+
61
+ - Walks through all folders and subfolders automatically
62
+ - Finds every CSV and XLSX file
63
+ - Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
64
+ - Detects columns with mixed data types and tries to fix them automatically
65
+ - Warns you when a column has mixed types so you know what was changed
66
+ - Saves the results as Parquet files to preserve the converted data types
67
+
68
+ ---
69
+
70
+ ## Why Parquet?
71
+
72
+ CSV files cannot store arrays or preserve data types.
73
+ Parquet solves this by keeping the exact types after conversion and is much faster for data processing workflows.
74
+
75
+ ---
76
+
77
+ ## Why arrays instead of Python lists?
78
+
79
+ Arrays are significantly faster for numerical operations and machine learning workflows, especially when working with large datasets.
80
+
81
+ ---
82
+
83
+ ## Functions
84
+
85
+ ### `ConvertListStrToList(file_path)`
86
+
87
+ Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
88
+
89
+ ```python
90
+ import deepcsv
91
+
92
+ df = deepcsv.ConvertListStrToList("path/to/file.csv")
93
+ ```
94
+
95
+ ---
96
+
97
+ ### `ReadAllCSVData(path)`
98
+
99
+ Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
100
+
101
+ ```python
102
+ import deepcsv
103
+
104
+ deepcsv.ReadAllCSVData("path/to/folder")
105
+ ```
106
+
107
+ ---
108
+
109
+ ## Notes
110
+
111
+ - Only files that contain list string columns are saved as Parquet
112
+ - Mixed-type columns are converted to float automatically when possible
113
+ - Skips NaN values without breaking
114
+ - Requires `pyarrow` for Parquet support
115
+
116
+ ---
117
+
118
+ ## Requirements
119
+
120
+ - Python >= 3.7
121
+ - pandas
122
+ - pyarrow
@@ -0,0 +1,2 @@
1
+ pandas
2
+ pyarrow
deepcsv-0.4.0/setup.py ADDED
@@ -0,0 +1,19 @@
1
+ from setuptools import setup, find_packages
2
+
3
+ setup(
4
+ name="deepcsv",
5
+ version="0.4.0",
6
+ author="Abdullah Bakr",
7
+ author_email="abdubakora1232@gmail.com",
8
+ description="Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.",
9
+ long_description=open("README.md", encoding="utf-8").read(),
10
+ long_description_content_type="text/markdown",
11
+ packages=find_packages(),
12
+ install_requires=["pandas", "pyarrow"],
13
+ python_requires=">=3.7",
14
+ url="https://github.com/abdubakr77/deepcsv",
15
+ project_urls={
16
+ "Source": "https://github.com/abdubakr77/deepcsv",
17
+ "Tracker": "https://github.com/abdubakr77/deepcsv/issues",
18
+ },
19
+ )
deepcsv-0.2.0/PKG-INFO DELETED
@@ -1,57 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: deepcsv
3
- Version: 0.2.0
4
- Summary: Automatically walks folders and converts list strings to lists in CSV/XLSX files
5
- Author: Abdullah Bakr
6
- Requires-Python: >=3.7
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE
9
- Requires-Dist: pandas
10
- Dynamic: author
11
- Dynamic: description
12
- Dynamic: description-content-type
13
- Dynamic: license-file
14
- Dynamic: requires-dist
15
- Dynamic: requires-python
16
- Dynamic: summary
17
-
18
- # deepcsv
19
-
20
- A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, converts list strings into real Python lists, and saves the results in a new folder while keeping the exact same folder structure.
21
-
22
- ## Installation
23
- ```bash
24
- pip install deepcsv
25
- ```
26
-
27
- ## Functions
28
-
29
- ### `ReadAllCSVData(path)`
30
- Walks through all folders and subfolders, finds every CSV and XLSX file, converts list strings to real lists, and saves everything in a new folder called `All CSV Data is Converted Here` with the same structure.
31
- ```python
32
- import deepcsv
33
-
34
- deepcsv.ReadAllCSVData("C:/Users/Data")
35
- ```
36
-
37
- ### `ConvertListStrToList(df)`
38
- Takes a single DataFrame and converts any column that contains list strings into real Python lists. Skips `NaN` values automatically.
39
- ```python
40
- import deepcsv
41
- import pandas as pd
42
-
43
- df = pd.read_csv("file.csv")
44
- df_converted = deepcsv.ConvertListStrToList(df)
45
- ```
46
-
47
- ## Notes
48
-
49
- - Supports `.csv` and `.xlsx` files
50
- - Skips `NaN` values without breaking
51
- - Keeps the exact folder structure in the output folder
52
- - Works on any level of nested folders
53
-
54
- ## Requirements
55
-
56
- - Python >= 3.7
57
- - pandas
deepcsv-0.2.0/README.md DELETED
@@ -1,40 +0,0 @@
1
- # deepcsv
2
-
3
- A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, converts list strings into real Python lists, and saves the results in a new folder while keeping the exact same folder structure.
4
-
5
- ## Installation
6
- ```bash
7
- pip install deepcsv
8
- ```
9
-
10
- ## Functions
11
-
12
- ### `ReadAllCSVData(path)`
13
- Walks through all folders and subfolders, finds every CSV and XLSX file, converts list strings to real lists, and saves everything in a new folder called `All CSV Data is Converted Here` with the same structure.
14
- ```python
15
- import deepcsv
16
-
17
- deepcsv.ReadAllCSVData("C:/Users/Data")
18
- ```
19
-
20
- ### `ConvertListStrToList(df)`
21
- Takes a single DataFrame and converts any column that contains list strings into real Python lists. Skips `NaN` values automatically.
22
- ```python
23
- import deepcsv
24
- import pandas as pd
25
-
26
- df = pd.read_csv("file.csv")
27
- df_converted = deepcsv.ConvertListStrToList(df)
28
- ```
29
-
30
- ## Notes
31
-
32
- - Supports `.csv` and `.xlsx` files
33
- - Skips `NaN` values without breaking
34
- - Keeps the exact folder structure in the output folder
35
- - Works on any level of nested folders
36
-
37
- ## Requirements
38
-
39
- - Python >= 3.7
40
- - pandas
@@ -1,49 +0,0 @@
1
- import os
2
- import pandas as pd
3
- from ast import literal_eval
4
-
5
- def ConvertListStrToList(data):
6
- Data_Col = data.columns
7
- for ColName in Data_Col:
8
- if type(data[ColName][0]) == str and data[ColName][0].startswith("["):
9
- data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : literal_eval(x) if pd.notna(x) else x )
10
- data.drop(ColName,inplace=True,axis=1)
11
-
12
- return data
13
-
14
-
15
- def ReadAllCSVData(WorkDirectoryPath):
16
-
17
- base_output = os.path.join(WorkDirectoryPath, "All CSV Data is Converted Here")
18
- all_folders = [WorkDirectoryPath]
19
-
20
- os.makedirs(base_output,exist_ok=True)
21
-
22
- while True:
23
-
24
- if all_folders:
25
-
26
- Curr_Path = all_folders.pop(0)
27
-
28
- for item_name in os.listdir(Curr_Path):
29
-
30
- Sub_Item_Path = os.path.join(Curr_Path,item_name)
31
-
32
- if os.path.isfile(Sub_Item_Path) and (Sub_Item_Path.endswith(".csv") or Sub_Item_Path.endswith(".xlsx")):
33
- print(Sub_Item_Path)
34
- df = pd.read_csv(Sub_Item_Path)
35
- df_converted = ConvertListStrToList(df)
36
-
37
- rel_path = os.path.relpath(Sub_Item_Path, WorkDirectoryPath)
38
- output = os.path.join(base_output,rel_path)
39
- os.makedirs(os.path.dirname(output),exist_ok=True)
40
-
41
- df_converted.to_csv(output)
42
-
43
-
44
- elif os.path.isdir(Sub_Item_Path):
45
-
46
- all_folders.append(Sub_Item_Path)
47
-
48
- else:
49
- break
@@ -1,57 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: deepcsv
3
- Version: 0.2.0
4
- Summary: Automatically walks folders and converts list strings to lists in CSV/XLSX files
5
- Author: Abdullah Bakr
6
- Requires-Python: >=3.7
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE
9
- Requires-Dist: pandas
10
- Dynamic: author
11
- Dynamic: description
12
- Dynamic: description-content-type
13
- Dynamic: license-file
14
- Dynamic: requires-dist
15
- Dynamic: requires-python
16
- Dynamic: summary
17
-
18
- # deepcsv
19
-
20
- A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, converts list strings into real Python lists, and saves the results in a new folder while keeping the exact same folder structure.
21
-
22
- ## Installation
23
- ```bash
24
- pip install deepcsv
25
- ```
26
-
27
- ## Functions
28
-
29
- ### `ReadAllCSVData(path)`
30
- Walks through all folders and subfolders, finds every CSV and XLSX file, converts list strings to real lists, and saves everything in a new folder called `All CSV Data is Converted Here` with the same structure.
31
- ```python
32
- import deepcsv
33
-
34
- deepcsv.ReadAllCSVData("C:/Users/Data")
35
- ```
36
-
37
- ### `ConvertListStrToList(df)`
38
- Takes a single DataFrame and converts any column that contains list strings into real Python lists. Skips `NaN` values automatically.
39
- ```python
40
- import deepcsv
41
- import pandas as pd
42
-
43
- df = pd.read_csv("file.csv")
44
- df_converted = deepcsv.ConvertListStrToList(df)
45
- ```
46
-
47
- ## Notes
48
-
49
- - Supports `.csv` and `.xlsx` files
50
- - Skips `NaN` values without breaking
51
- - Keeps the exact folder structure in the output folder
52
- - Works on any level of nested folders
53
-
54
- ## Requirements
55
-
56
- - Python >= 3.7
57
- - pandas
@@ -1 +0,0 @@
1
- pandas
deepcsv-0.2.0/setup.py DELETED
@@ -1,13 +0,0 @@
1
- from setuptools import setup, find_packages
2
-
3
- setup(
4
- name="deepcsv",
5
- version="0.2.0",
6
- author="Abdullah Bakr",
7
- description="Automatically walks folders and converts list strings to lists in CSV/XLSX files",
8
- long_description=open("README.md").read(),
9
- long_description_content_type="text/markdown",
10
- packages=find_packages(),
11
- install_requires=["pandas"],
12
- python_requires=">=3.7",
13
- )
File without changes
File without changes
File without changes