deepcsv 0.3.0__tar.gz → 0.5.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
deepcsv-0.5.0/PKG-INFO ADDED
@@ -0,0 +1,128 @@
1
+ Metadata-Version: 2.4
2
+ Name: deepcsv
3
+ Version: 0.5.0
4
+ Summary: Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files.
5
+ Home-page: https://github.com/abdubakr77/deepcsv
6
+ Author: Abdullah Bakr
7
+ Author-email: abdubakora1232@gmail.com
8
+ License: MIT
9
+ Project-URL: Source, https://github.com/abdubakr77/deepcsv
10
+ Project-URL: Tracker, https://github.com/abdubakr77/deepcsv/issues
11
+ Keywords: data-processing pandas numpy etl data-cleaning file-conversion automation parquet bulk-conversion
12
+ Classifier: Development Status :: 5 - Production/Stable
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
15
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Programming Language :: Python :: 3
18
+ Requires-Python: >=3.7
19
+ Description-Content-Type: text/markdown
20
+ License-File: LICENSE
21
+ Requires-Dist: pandas
22
+ Requires-Dist: pyarrow
23
+ Dynamic: author
24
+ Dynamic: author-email
25
+ Dynamic: classifier
26
+ Dynamic: description
27
+ Dynamic: description-content-type
28
+ Dynamic: home-page
29
+ Dynamic: keywords
30
+ Dynamic: license
31
+ Dynamic: license-file
32
+ Dynamic: project-url
33
+ Dynamic: requires-dist
34
+ Dynamic: requires-python
35
+ Dynamic: summary
36
+
37
+ # deepcsv (v0.5.0)
38
+
39
+ Ever loaded a CSV file and found your carefully structured lists turned into useless strings?
40
+ ````python
41
+ "['Action', 'Sci-Fi', 'Thriller']" # This is a string, not a list
42
+ ````
43
+
44
+ deepcsv fixes this automatically.
45
+
46
+ ---
47
+
48
+ ## The Solution
49
+
50
+ `deepcsv` handles these cases automatically:
51
+ - Reads CSV/XLSX files or existing DataFrames
52
+ - Converts any string value starting with `[` into real NumPy arrays (fast and lightweight)
53
+ - Detects and fixes mixed-type columns by safely converting them to numeric (float)
54
+ - Recursively processes all CSV/XLSX files in subdirectories
55
+ - Saves results as Parquet format to preserve types and speed up analysis
56
+
57
+ ---
58
+
59
+ ## Installation
60
+ ```bash
61
+ pip install deepcsv
62
+ ```
63
+
64
+ ## Usage
65
+
66
+ ### Single file processing (process_file)
67
+ ```python
68
+ import deepcsv
69
+
70
+ df = deepcsv.process_file('path/to/file.csv')
71
+ ```
72
+ - Accepts `str` (file path) or `pd.DataFrame`
73
+ - Returns `pd.DataFrame` with columns converted to arrays
74
+
75
+ ### Batch directory processing (process_all_files)
76
+ ```python
77
+ import deepcsv
78
+
79
+ deepcsv.process_all_files('path/to/folder')
80
+ ```
81
+ - Processes all `.csv` and `.xlsx` files recursively
82
+ - Saves converted files as Parquet in: `All CSV Files is Converted Here`
83
+
84
+ ---
85
+
86
+ ## What it does
87
+
88
+ - Auto-detects files in directory and subdirectories
89
+ - Converts values like:
90
+ - `"['item1', 'item2']"` → `array(['item1', 'item2'])` (NumPy array)
91
+ - Mixed numeric/string columns → single numeric type (float)
92
+ - Handles NaN values without breaking
93
+ - Stores results in Parquet format for type safety and performance
94
+
95
+ ---
96
+
97
+ ## Function Signatures
98
+
99
+ - `process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame`
100
+ - `process_all_files(directory_path: str) -> None`
101
+
102
+ Output arrays are NumPy arrays for optimal performance in machine learning workflows.
103
+
104
+ ---
105
+
106
+ ## Key Features
107
+
108
+ - Fast NumPy array conversion instead of slow Python lists
109
+ - Mixed-type detection with automatic fixes
110
+ - Parquet storage for data integrity
111
+ - Recursive directory traversal
112
+ - Warning messages for transparency
113
+
114
+ ---
115
+
116
+ ## Notes
117
+
118
+ - Requires `pyarrow` for Parquet support
119
+ - Only saves files that contain converted array columns
120
+
121
+ ---
122
+
123
+ ## Requirements
124
+
125
+ - Python >= 3.7
126
+ - pandas
127
+ - pyarrow
128
+
@@ -0,0 +1,92 @@
1
+ # deepcsv (v0.5.0)
2
+
3
+ Ever loaded a CSV file and found your carefully structured lists turned into useless strings?
4
+ ````python
5
+ "['Action', 'Sci-Fi', 'Thriller']" # This is a string, not a list
6
+ ````
7
+
8
+ deepcsv fixes this automatically.
9
+
10
+ ---
11
+
12
+ ## The Solution
13
+
14
+ `deepcsv` handles these cases automatically:
15
+ - Reads CSV/XLSX files or existing DataFrames
16
+ - Converts any string value starting with `[` into real NumPy arrays (fast and lightweight)
17
+ - Detects and fixes mixed-type columns by safely converting them to numeric (float)
18
+ - Recursively processes all CSV/XLSX files in subdirectories
19
+ - Saves results as Parquet format to preserve types and speed up analysis
20
+
21
+ ---
22
+
23
+ ## Installation
24
+ ```bash
25
+ pip install deepcsv
26
+ ```
27
+
28
+ ## Usage
29
+
30
+ ### Single file processing (process_file)
31
+ ```python
32
+ import deepcsv
33
+
34
+ df = deepcsv.process_file('path/to/file.csv')
35
+ ```
36
+ - Accepts `str` (file path) or `pd.DataFrame`
37
+ - Returns `pd.DataFrame` with columns converted to arrays
38
+
39
+ ### Batch directory processing (process_all_files)
40
+ ```python
41
+ import deepcsv
42
+
43
+ deepcsv.process_all_files('path/to/folder')
44
+ ```
45
+ - Processes all `.csv` and `.xlsx` files recursively
46
+ - Saves converted files as Parquet in: `All CSV Files is Converted Here`
47
+
48
+ ---
49
+
50
+ ## What it does
51
+
52
+ - Auto-detects files in directory and subdirectories
53
+ - Converts values like:
54
+ - `"['item1', 'item2']"` → `array(['item1', 'item2'])` (NumPy array)
55
+ - Mixed numeric/string columns → single numeric type (float)
56
+ - Handles NaN values without breaking
57
+ - Stores results in Parquet format for type safety and performance
58
+
59
+ ---
60
+
61
+ ## Function Signatures
62
+
63
+ - `process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame`
64
+ - `process_all_files(directory_path: str) -> None`
65
+
66
+ Output arrays are NumPy arrays for optimal performance in machine learning workflows.
67
+
68
+ ---
69
+
70
+ ## Key Features
71
+
72
+ - Fast NumPy array conversion instead of slow Python lists
73
+ - Mixed-type detection with automatic fixes
74
+ - Parquet storage for data integrity
75
+ - Recursive directory traversal
76
+ - Warning messages for transparency
77
+
78
+ ---
79
+
80
+ ## Notes
81
+
82
+ - Requires `pyarrow` for Parquet support
83
+ - Only saves files that contain converted array columns
84
+
85
+ ---
86
+
87
+ ## Requirements
88
+
89
+ - Python >= 3.7
90
+ - pandas
91
+ - pyarrow
92
+
@@ -0,0 +1,15 @@
1
+ from .deepcsv import process_all_files, process_file
2
+ from importlib.metadata import version
3
+ import requests
4
+
5
+ def _check_for_updates():
6
+ try:
7
+ response = requests.get("https://pypi.org/pypi/deepcsv/json")
8
+ latest = response.json()["info"]["version"]
9
+ current = version("deepcsv")
10
+ if latest != current:
11
+ print(f"DeepCSV: New version {latest} available! — run 'pip install -U deepcsv'")
12
+ except:
13
+ pass
14
+
15
+ _check_for_updates()
@@ -0,0 +1,126 @@
1
+ import pyarrow
2
+ import pandas as pd
3
+ from numpy import nan,array
4
+ from os import listdir,makedirs
5
+ from os.path import join,relpath,dirname,isfile,isdir
6
+ from warnings import filterwarnings
7
+ from typing import Union
8
+ filterwarnings("ignore")
9
+
10
+ def process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame:
11
+ """
12
+ Parses string representations of lists in DataFrame columns to actual NumPy arrays.
13
+
14
+ (data_input: Union[str, pd.DataFrame]) -> pd.DataFrame
15
+
16
+ This function reads a CSV file or takes a DataFrame, scans each column for string values
17
+ starting with '[', and converts them to NumPy arrays for fast and lightweight processing.
18
+ It also detects mixed data types in columns and attempts to convert them to numeric (float) for consistency.
19
+
20
+ Parameters:
21
+ data_input (Union[str, pd.DataFrame]): Path to the CSV file or the DataFrame itself.
22
+
23
+ Returns:
24
+ pd.DataFrame: The processed DataFrame with list strings converted to NumPy arrays, mixed types fixed with a warning message, and original columns renamed if arrays are extracted.
25
+
26
+ Examples:
27
+ ### Process a CSV file
28
+ df = parse_lists('path/to/file.csv')
29
+
30
+ ### Process an existing DataFrame
31
+ df = parse_lists(my_dataframe)
32
+ """
33
+
34
+ print(f"Converting Now On {data_input}")
35
+ print("-"*50)
36
+
37
+ try:
38
+ data = pd.read_csv(data_input)
39
+ except:
40
+ # print("The Data It's Already Defined")
41
+ data = data_input
42
+
43
+ for ColName in data.columns:
44
+
45
+ First_Value = data[ColName].iloc[0]
46
+
47
+ if len(data[ColName].apply(type).unique()) >= 2:
48
+
49
+
50
+ sample = (data[data[ColName].apply(type) == str][ColName].head(2)).values
51
+
52
+ if len(sample) > 0 and isinstance(sample[0],str) and sample[0][0].strip().isnumeric():
53
+
54
+ print(f"WARNING:\nThis Dataset Name ({data_input.split("\\")[-1]}) Found {len(data[ColName].apply(type).unique())} Mixed DataType in a column called ({ColName})\nPath : {data_input}")
55
+ print(f"System : This column have These types: {data[ColName].apply(type).unique()}")
56
+ print(f"System : Trying to fix the column as a Float to be have only one datatype...")
57
+
58
+ data[ColName] = pd.to_numeric(data[ColName], errors='coerce')
59
+ print("System : Done!")
60
+
61
+ elif isinstance(First_Value , str) and First_Value.strip().startswith("["):
62
+
63
+ data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : array(x) if pd.notna(x) else nan)
64
+ data.drop(ColName,inplace=True,axis=1)
65
+
66
+ return data
67
+
68
+
69
+ def process_all_files(directory_path: str) -> None:
70
+ """
71
+ Recursively processes all CSV and XLSX files in a directory, converts array strings to NumPy arrays,
72
+ and saves as Parquet files if array columns are present.
73
+
74
+ (directory_path: str) -> None
75
+
76
+ This function traverses the directory tree starting from directory_path, finds all .csv and .xlsx files,
77
+ applies parse_lists to each for array conversion and type fixing, and saves the result as .parquet files
78
+ in a subdirectory 'All CSV Files is Converted Here' only if the DataFrame contains new array columns.
79
+
80
+ Parameters:
81
+ directory_path (str): The root directory path to search for CSV/XLSX files.
82
+
83
+ Returns:
84
+ None: Files are saved to a subdirectory 'All CSV Files is Converted Here'.
85
+
86
+ Examples:
87
+ # Process all CSV files in a directory
88
+ auto_convert('/path/to/directory')
89
+ """
90
+
91
+ base_output = join(directory_path, "All CSV Files is Converted Here")
92
+ all_folders = [directory_path]
93
+
94
+ makedirs(base_output,exist_ok=True)
95
+
96
+ while True:
97
+
98
+ if all_folders:
99
+
100
+ Curr_Path = all_folders.pop(0)
101
+
102
+ for item_name in listdir(Curr_Path):
103
+
104
+ Sub_Item_Path = join(Curr_Path,item_name)
105
+
106
+ if isfile(Sub_Item_Path) and (Sub_Item_Path.endswith(".csv") or Sub_Item_Path.endswith(".xlsx")):
107
+
108
+ df_converted = process_file(Sub_Item_Path)
109
+ df_converted.reset_index(drop=True,inplace=True)
110
+
111
+ rel_path = relpath(Sub_Item_Path, directory_path)
112
+ output = join(base_output,rel_path)
113
+ if "List" in df_converted.columns[-1]:
114
+ print(Sub_Item_Path)
115
+ makedirs(dirname(output),exist_ok=True)
116
+ df_converted.to_parquet(output.replace(".csv", ".parquet"))
117
+
118
+ print("-"*50)
119
+
120
+ elif isdir(Sub_Item_Path):
121
+
122
+ all_folders.append(Sub_Item_Path)
123
+
124
+ else:
125
+ break
126
+
@@ -0,0 +1,128 @@
1
+ Metadata-Version: 2.4
2
+ Name: deepcsv
3
+ Version: 0.5.0
4
+ Summary: Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files.
5
+ Home-page: https://github.com/abdubakr77/deepcsv
6
+ Author: Abdullah Bakr
7
+ Author-email: abdubakora1232@gmail.com
8
+ License: MIT
9
+ Project-URL: Source, https://github.com/abdubakr77/deepcsv
10
+ Project-URL: Tracker, https://github.com/abdubakr77/deepcsv/issues
11
+ Keywords: data-processing pandas numpy etl data-cleaning file-conversion automation parquet bulk-conversion
12
+ Classifier: Development Status :: 5 - Production/Stable
13
+ Classifier: Intended Audience :: Developers
14
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
15
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
16
+ Classifier: License :: OSI Approved :: MIT License
17
+ Classifier: Programming Language :: Python :: 3
18
+ Requires-Python: >=3.7
19
+ Description-Content-Type: text/markdown
20
+ License-File: LICENSE
21
+ Requires-Dist: pandas
22
+ Requires-Dist: pyarrow
23
+ Dynamic: author
24
+ Dynamic: author-email
25
+ Dynamic: classifier
26
+ Dynamic: description
27
+ Dynamic: description-content-type
28
+ Dynamic: home-page
29
+ Dynamic: keywords
30
+ Dynamic: license
31
+ Dynamic: license-file
32
+ Dynamic: project-url
33
+ Dynamic: requires-dist
34
+ Dynamic: requires-python
35
+ Dynamic: summary
36
+
37
+ # deepcsv (v0.5.0)
38
+
39
+ Ever loaded a CSV file and found your carefully structured lists turned into useless strings?
40
+ ````python
41
+ "['Action', 'Sci-Fi', 'Thriller']" # This is a string, not a list
42
+ ````
43
+
44
+ deepcsv fixes this automatically.
45
+
46
+ ---
47
+
48
+ ## The Solution
49
+
50
+ `deepcsv` handles these cases automatically:
51
+ - Reads CSV/XLSX files or existing DataFrames
52
+ - Converts any string value starting with `[` into real NumPy arrays (fast and lightweight)
53
+ - Detects and fixes mixed-type columns by safely converting them to numeric (float)
54
+ - Recursively processes all CSV/XLSX files in subdirectories
55
+ - Saves results as Parquet format to preserve types and speed up analysis
56
+
57
+ ---
58
+
59
+ ## Installation
60
+ ```bash
61
+ pip install deepcsv
62
+ ```
63
+
64
+ ## Usage
65
+
66
+ ### Single file processing (process_file)
67
+ ```python
68
+ import deepcsv
69
+
70
+ df = deepcsv.process_file('path/to/file.csv')
71
+ ```
72
+ - Accepts `str` (file path) or `pd.DataFrame`
73
+ - Returns `pd.DataFrame` with columns converted to arrays
74
+
75
+ ### Batch directory processing (process_all_files)
76
+ ```python
77
+ import deepcsv
78
+
79
+ deepcsv.process_all_files('path/to/folder')
80
+ ```
81
+ - Processes all `.csv` and `.xlsx` files recursively
82
+ - Saves converted files as Parquet in: `All CSV Files is Converted Here`
83
+
84
+ ---
85
+
86
+ ## What it does
87
+
88
+ - Auto-detects files in directory and subdirectories
89
+ - Converts values like:
90
+ - `"['item1', 'item2']"` → `array(['item1', 'item2'])` (NumPy array)
91
+ - Mixed numeric/string columns → single numeric type (float)
92
+ - Handles NaN values without breaking
93
+ - Stores results in Parquet format for type safety and performance
94
+
95
+ ---
96
+
97
+ ## Function Signatures
98
+
99
+ - `process_file(data_input: Union[str, pd.DataFrame]) -> pd.DataFrame`
100
+ - `process_all_files(directory_path: str) -> None`
101
+
102
+ Output arrays are NumPy arrays for optimal performance in machine learning workflows.
103
+
104
+ ---
105
+
106
+ ## Key Features
107
+
108
+ - Fast NumPy array conversion instead of slow Python lists
109
+ - Mixed-type detection with automatic fixes
110
+ - Parquet storage for data integrity
111
+ - Recursive directory traversal
112
+ - Warning messages for transparency
113
+
114
+ ---
115
+
116
+ ## Notes
117
+
118
+ - Requires `pyarrow` for Parquet support
119
+ - Only saves files that contain converted array columns
120
+
121
+ ---
122
+
123
+ ## Requirements
124
+
125
+ - Python >= 3.7
126
+ - pandas
127
+ - pyarrow
128
+
deepcsv-0.5.0/setup.py ADDED
@@ -0,0 +1,35 @@
1
+ from setuptools import setup, find_packages
2
+ from pathlib import Path
3
+
4
+ this_directory = Path(__file__).parent
5
+
6
+ readme = (this_directory / "README.md").read_text(encoding="utf-8")
7
+
8
+
9
+ setup(
10
+ name="deepcsv",
11
+ version="0.5.0",
12
+ author="Abdullah Bakr",
13
+ author_email="abdubakora1232@gmail.com",
14
+ description="Automatically processes data files in directories, converts array-like strings to NumPy arrays, detects and fixes data type issues, and saves results as optimized Parquet files.",
15
+ long_description=readme,
16
+ long_description_content_type="text/markdown",
17
+ packages=find_packages(),
18
+ install_requires=["pandas", "pyarrow"],
19
+ python_requires=">=3.7",
20
+ url="https://github.com/abdubakr77/deepcsv",
21
+ license="MIT",
22
+ keywords="data-processing pandas numpy etl data-cleaning file-conversion automation parquet bulk-conversion",
23
+ classifiers=[
24
+ "Development Status :: 5 - Production/Stable",
25
+ "Intended Audience :: Developers",
26
+ "Topic :: Software Development :: Libraries :: Python Modules",
27
+ "Topic :: Scientific/Engineering :: Information Analysis",
28
+ "License :: OSI Approved :: MIT License",
29
+ "Programming Language :: Python :: 3",
30
+ ],
31
+ project_urls={
32
+ "Source": "https://github.com/abdubakr77/deepcsv",
33
+ "Tracker": "https://github.com/abdubakr77/deepcsv/issues",
34
+ },
35
+ )
deepcsv-0.3.0/PKG-INFO DELETED
@@ -1,74 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: deepcsv
3
- Version: 0.3.0
4
- Summary: Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
5
- Author: Abdullah Bakr
6
- Requires-Python: >=3.7
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE
9
- Requires-Dist: pandas
10
- Requires-Dist: pyarrow
11
- Dynamic: author
12
- Dynamic: description
13
- Dynamic: description-content-type
14
- Dynamic: license-file
15
- Dynamic: requires-dist
16
- Dynamic: requires-python
17
- Dynamic: summary
18
-
19
- # deepcsv
20
-
21
- A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
22
-
23
- ## Installation
24
- ```bash
25
- pip install deepcsv
26
- ```
27
-
28
- ## What it does
29
-
30
- - Walks through all folders and subfolders automatically
31
- - Finds every CSV and XLSX file
32
- - Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
33
- - Detects columns with mixed data types and tries to fix them automatically
34
- - Warns you when a column has mixed types so you know what was changed
35
- - Saves the results as Parquet files to preserve the converted data types
36
-
37
- > **Why Parquet?**
38
- > CSV files cannot store arrays or preserve data types. Parquet solves this by keeping the exact types after conversion.
39
-
40
- > **Why arrays instead of Python lists?**
41
- > Arrays are significantly faster for numerical operations and machine learning workflows.
42
-
43
- ## Functions
44
-
45
- ### `ConvertListStrToList(file_path)`
46
-
47
- Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
48
- ```python
49
- import deepcsv
50
-
51
- df = deepcsv.ConvertListStrToList("path/to/file.csv")
52
- ```
53
-
54
- ### `ReadAllCSVData(path)`
55
-
56
- Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
57
- ```python
58
- import deepcsv
59
-
60
- deepcsv.ReadAllCSVData("path/to/folder")
61
- ```
62
-
63
- ## Notes
64
-
65
- - Only files that contain list string columns are saved as Parquet
66
- - Mixed-type columns are converted to float automatically when possible
67
- - Skips NaN values without breaking
68
- - Requires `pyarrow` for Parquet support
69
-
70
- ## Requirements
71
-
72
- - Python >= 3.7
73
- - pandas
74
- - pyarrow
deepcsv-0.3.0/README.md DELETED
@@ -1,56 +0,0 @@
1
- # deepcsv
2
-
3
- A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
4
-
5
- ## Installation
6
- ```bash
7
- pip install deepcsv
8
- ```
9
-
10
- ## What it does
11
-
12
- - Walks through all folders and subfolders automatically
13
- - Finds every CSV and XLSX file
14
- - Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
15
- - Detects columns with mixed data types and tries to fix them automatically
16
- - Warns you when a column has mixed types so you know what was changed
17
- - Saves the results as Parquet files to preserve the converted data types
18
-
19
- > **Why Parquet?**
20
- > CSV files cannot store arrays or preserve data types. Parquet solves this by keeping the exact types after conversion.
21
-
22
- > **Why arrays instead of Python lists?**
23
- > Arrays are significantly faster for numerical operations and machine learning workflows.
24
-
25
- ## Functions
26
-
27
- ### `ConvertListStrToList(file_path)`
28
-
29
- Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
30
- ```python
31
- import deepcsv
32
-
33
- df = deepcsv.ConvertListStrToList("path/to/file.csv")
34
- ```
35
-
36
- ### `ReadAllCSVData(path)`
37
-
38
- Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
39
- ```python
40
- import deepcsv
41
-
42
- deepcsv.ReadAllCSVData("path/to/folder")
43
- ```
44
-
45
- ## Notes
46
-
47
- - Only files that contain list string columns are saved as Parquet
48
- - Mixed-type columns are converted to float automatically when possible
49
- - Skips NaN values without breaking
50
- - Requires `pyarrow` for Parquet support
51
-
52
- ## Requirements
53
-
54
- - Python >= 3.7
55
- - pandas
56
- - pyarrow
@@ -1 +0,0 @@
1
- from .deepcsv import ReadAllCSVData, ConvertListStrToList
@@ -1,76 +0,0 @@
1
- import pyarrow
2
- import os
3
- import warnings
4
- import numpy as np
5
- import pandas as pd
6
- from ast import literal_eval
7
- warnings.filterwarnings("ignore")
8
-
9
- def ConvertListStrToList(File_Path):
10
-
11
- data = pd.read_csv(File_Path)
12
- for ColName in data.columns:
13
-
14
- First_Value = data[ColName].iloc[0]
15
-
16
- if len(data[ColName].apply(type).unique()) >= 2:
17
-
18
-
19
- sample = (data[data[ColName].apply(type) == str][ColName].head(2)).values
20
-
21
- if len(sample) > 0 and isinstance(sample[0],str) and sample[0][0].strip().isnumeric():
22
-
23
- print(f"WARNING:\nThis Dataset Name ({File_Path.split("\\")[-1]}) Found {len(data[ColName].apply(type).unique())} Mixed DataType in a column called ({ColName})\nPath : {File_Path}")
24
- print(f"System : This column have These types: {data[ColName].apply(type).unique()}")
25
- print(f"System : Trying to fix the column as a Float to be have only one datatype...")
26
-
27
- data[ColName] = pd.to_numeric(data[ColName], errors='coerce')
28
- print("System : Done!")
29
-
30
- elif isinstance(First_Value , str) and First_Value.strip().startswith("["):
31
-
32
- data[f"{ColName.capitalize()}List"] = data[ColName].apply(lambda x : literal_eval(x) if pd.notna(x) else np.nan)
33
- data.drop(ColName,inplace=True,axis=1)
34
-
35
- return data
36
-
37
-
38
- def ReadAllCSVData(WorkDirectoryPath):
39
-
40
- base_output = os.path.join(WorkDirectoryPath, "All CSV Data is Converted Here")
41
- all_folders = [WorkDirectoryPath]
42
-
43
- os.makedirs(base_output,exist_ok=True)
44
-
45
- while True:
46
-
47
- if all_folders:
48
-
49
- Curr_Path = all_folders.pop(0)
50
-
51
- for item_name in os.listdir(Curr_Path):
52
-
53
- Sub_Item_Path = os.path.join(Curr_Path,item_name)
54
-
55
- if os.path.isfile(Sub_Item_Path) and (Sub_Item_Path.endswith(".csv") or Sub_Item_Path.endswith(".xlsx")):
56
-
57
-
58
-
59
- df_converted = ConvertListStrToList(Sub_Item_Path)
60
- df_converted.reset_index(drop=True,inplace=True)
61
-
62
- rel_path = os.path.relpath(Sub_Item_Path, WorkDirectoryPath)
63
- output = os.path.join(base_output,rel_path)
64
- if "List" in df_converted.columns[-1]:
65
- print(Sub_Item_Path)
66
- os.makedirs(os.path.dirname(output),exist_ok=True)
67
- df_converted.to_parquet(output.replace(".csv", ".parquet"))
68
-
69
- print("-"*50)
70
-
71
- elif os.path.isdir(Sub_Item_Path):
72
-
73
- all_folders.append(Sub_Item_Path)
74
-
75
- else:
76
- break
@@ -1,74 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: deepcsv
3
- Version: 0.3.0
4
- Summary: Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
5
- Author: Abdullah Bakr
6
- Requires-Python: >=3.7
7
- Description-Content-Type: text/markdown
8
- License-File: LICENSE
9
- Requires-Dist: pandas
10
- Requires-Dist: pyarrow
11
- Dynamic: author
12
- Dynamic: description
13
- Dynamic: description-content-type
14
- Dynamic: license-file
15
- Dynamic: requires-dist
16
- Dynamic: requires-python
17
- Dynamic: summary
18
-
19
- # deepcsv
20
-
21
- A Python library that automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.
22
-
23
- ## Installation
24
- ```bash
25
- pip install deepcsv
26
- ```
27
-
28
- ## What it does
29
-
30
- - Walks through all folders and subfolders automatically
31
- - Finds every CSV and XLSX file
32
- - Detects columns that contain list strings like `"['item1', 'item2']"` and converts them into real Python arrays for faster performance
33
- - Detects columns with mixed data types and tries to fix them automatically
34
- - Warns you when a column has mixed types so you know what was changed
35
- - Saves the results as Parquet files to preserve the converted data types
36
-
37
- > **Why Parquet?**
38
- > CSV files cannot store arrays or preserve data types. Parquet solves this by keeping the exact types after conversion.
39
-
40
- > **Why arrays instead of Python lists?**
41
- > Arrays are significantly faster for numerical operations and machine learning workflows.
42
-
43
- ## Functions
44
-
45
- ### `ConvertListStrToList(file_path)`
46
-
47
- Reads a CSV file, converts list strings to arrays, fixes mixed-type columns, and returns a clean DataFrame.
48
- ```python
49
- import deepcsv
50
-
51
- df = deepcsv.ConvertListStrToList("path/to/file.csv")
52
- ```
53
-
54
- ### `ReadAllCSVData(path)`
55
-
56
- Walks through all folders and subfolders, applies `ConvertListStrToList` on every CSV and XLSX file, and saves the results as Parquet files in a new folder called `All CSV Data is Converted Here`.
57
- ```python
58
- import deepcsv
59
-
60
- deepcsv.ReadAllCSVData("path/to/folder")
61
- ```
62
-
63
- ## Notes
64
-
65
- - Only files that contain list string columns are saved as Parquet
66
- - Mixed-type columns are converted to float automatically when possible
67
- - Skips NaN values without breaking
68
- - Requires `pyarrow` for Parquet support
69
-
70
- ## Requirements
71
-
72
- - Python >= 3.7
73
- - pandas
74
- - pyarrow
deepcsv-0.3.0/setup.py DELETED
@@ -1,13 +0,0 @@
1
- from setuptools import setup, find_packages
2
-
3
- setup(
4
- name="deepcsv",
5
- version="0.3.0",
6
- author="Abdullah Bakr",
7
- description="Automatically walks through folders and subfolders, finds all CSV and XLSX files, detects and fixes data issues, and saves the results as Parquet files while keeping the exact same folder structure.",
8
- long_description=open("README.md").read(),
9
- long_description_content_type="text/markdown",
10
- packages=find_packages(),
11
- install_requires=["pandas","pyarrow"],
12
- python_requires=">=3.7",
13
- )
File without changes
File without changes