datascrub 1.0.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- datascrub-1.0.0/LICENSE.txt +21 -0
- datascrub-1.0.0/PKG-INFO +122 -0
- datascrub-1.0.0/README.md +105 -0
- datascrub-1.0.0/datascrub/__init__.py +3 -0
- datascrub-1.0.0/datascrub/datascrub.py +208 -0
- datascrub-1.0.0/datascrub.egg-info/PKG-INFO +122 -0
- datascrub-1.0.0/datascrub.egg-info/SOURCES.txt +10 -0
- datascrub-1.0.0/datascrub.egg-info/dependency_links.txt +1 -0
- datascrub-1.0.0/datascrub.egg-info/requires.txt +8 -0
- datascrub-1.0.0/datascrub.egg-info/top_level.txt +1 -0
- datascrub-1.0.0/setup.cfg +4 -0
- datascrub-1.0.0/setup.py +32 -0
|
@@ -0,0 +1,21 @@
|
|
|
1
|
+
MIT License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2023 Samuel Shine
|
|
4
|
+
|
|
5
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
|
6
|
+
of this software and associated documentation files (the "Software"), to deal
|
|
7
|
+
in the Software without restriction, including without limitation the rights
|
|
8
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
|
9
|
+
copies of the Software, and to permit persons to whom the Software is
|
|
10
|
+
furnished to do so, subject to the following conditions:
|
|
11
|
+
|
|
12
|
+
The above copyright notice and this permission notice shall be included in all
|
|
13
|
+
copies or substantial portions of the Software.
|
|
14
|
+
|
|
15
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
|
16
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
|
17
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
|
18
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
|
19
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
|
20
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
|
21
|
+
SOFTWARE.
|
datascrub-1.0.0/PKG-INFO
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: datascrub
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: A Python package for cleaning and preprocessing data in pandas DataFrames
|
|
5
|
+
Home-page: https://github.com/samuelshine/cleanmydata
|
|
6
|
+
Author: ['Samuel Shine', 'Alex Benjamin']
|
|
7
|
+
Author-email: samuelshine112003@gmail.com
|
|
8
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.6
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.7
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Description-Content-Type: text/markdown
|
|
16
|
+
License-File: LICENSE.txt
|
|
17
|
+
|
|
18
|
+
## DataScrub
|
|
19
|
+
|
|
20
|
+
The `DataClean` class provides a set of methods to clean and process data in a pandas DataFrame. It includes functions for cleaning text data, handling missing values, performing scaling normalization, exploding data, parsing date columns, and translating text columns.
|
|
21
|
+
|
|
22
|
+
### Class Initialization
|
|
23
|
+
|
|
24
|
+
To create an instance of the `DataClean` class, you need to provide the filepath of the data file (CSV or Excel) as an argument to the constructor. The class will automatically read the data into a pandas DataFrame based on the file extension.
|
|
25
|
+
|
|
26
|
+
Example:
|
|
27
|
+
```python
|
|
28
|
+
cleaner = DataClean('data.csv')
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
### Method: clean_data
|
|
32
|
+
|
|
33
|
+
The `clean_data` method cleans text data in the DataFrame. It takes a parameter `columns` that specifies which columns to clean. You can either pass `'all'` to clean all columns or provide a list of specific column names to clean.
|
|
34
|
+
|
|
35
|
+
Example:
|
|
36
|
+
```python
|
|
37
|
+
cleaned_data = cleaner.clean_data(['text_column1', 'text_column2'])
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### Method: handle_missing_values
|
|
41
|
+
|
|
42
|
+
The `handle_missing_values` method handles missing values in the DataFrame. It takes a parameter `missing_values`, which is a dictionary specifying the actions to be taken for each column with missing values. The keys of the dictionary are the column names, and the values are the operations to be performed.
|
|
43
|
+
|
|
44
|
+
Example:
|
|
45
|
+
```python
|
|
46
|
+
missing_values = {'column1': 'replace missing value with 0', 'column2': 'drop'}
|
|
47
|
+
processed_data = cleaner.handle_missing_values(missing_values)
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
### Method: perform_scaling_normalization
|
|
51
|
+
|
|
52
|
+
The `perform_scaling_normalization` method performs scaling normalization on numerical columns in the DataFrame using the Box-Cox transformation. Currently, this method is marked as 'NOT COMPLETE' in the code and does not contain the complete implementation.
|
|
53
|
+
|
|
54
|
+
### Method: explode_data
|
|
55
|
+
|
|
56
|
+
The `explode_data` method splits and expands data in specified columns of the DataFrame. It takes a dictionary `explode` where the keys are column names, and the values are the separators for splitting.
|
|
57
|
+
|
|
58
|
+
Example:
|
|
59
|
+
```python
|
|
60
|
+
explode_columns = {'column1': ',', 'column2': ';'}
|
|
61
|
+
exploded_data = cleaner.explode_data(explode_columns)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### Method: dupli
|
|
65
|
+
|
|
66
|
+
The `dupli` method removes duplicate rows from the DataFrame.
|
|
67
|
+
|
|
68
|
+
Example:
|
|
69
|
+
```python
|
|
70
|
+
unique_data = cleaner.dupli()
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Method: parse_date_column
|
|
74
|
+
|
|
75
|
+
The `parse_date_column` method converts specified columns in the DataFrame to datetime format and formats them as 'YYYY-MM-DD'. It takes a list `date_columns` containing the names of the columns to be converted.
|
|
76
|
+
|
|
77
|
+
Example:
|
|
78
|
+
```python
|
|
79
|
+
date_columns = ['date_column1', 'date_column2']
|
|
80
|
+
parsed_data = cleaner.parse_date_column(date_columns)
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Method: translate_columns
|
|
84
|
+
|
|
85
|
+
The `translate_columns` method translates text in specified columns of the DataFrame to English using Google Translate. It takes a dictionary `translations` where the keys are column names, and the values are overwrite boolean values. If the overwrite value is True, the original column will be overwritten; otherwise, a new column with the translated text will be added.
|
|
86
|
+
|
|
87
|
+
Example:
|
|
88
|
+
```python
|
|
89
|
+
column_translations = {'text_column1': True, 'text_column2': False}
|
|
90
|
+
translated_data = cleaner.translate_columns(column_translations)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Method: prep
|
|
94
|
+
|
|
95
|
+
The `prep` method is the main function to prepare and clean the DataFrame. It provides a convenient way to perform multiple cleaning and processing operations in a specific order. You can specify the operations using the following parameters:
|
|
96
|
+
|
|
97
|
+
- `clean`: Columns to clean. Pass `'all'` to clean all columns or provide a list of specific column names.
|
|
98
|
+
- `missing_values`: Actions
|
|
99
|
+
|
|
100
|
+
to be taken on missing values. Pass a dictionary with column names as keys and operations as values.
|
|
101
|
+
- `perform_scaling_normalization_bool`: Boolean value indicating whether to perform scaling normalization on numerical columns.
|
|
102
|
+
- `explode`: Columns to be exploded. Pass a dictionary with column names as keys and separators for splitting as values.
|
|
103
|
+
- `parse_date`: List of column names to be converted to datetime format.
|
|
104
|
+
- `translate_column_names`: Dictionary mapping column names to overwrite boolean values for translation.
|
|
105
|
+
|
|
106
|
+
Example:
|
|
107
|
+
```python
|
|
108
|
+
cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'}, perform_scaling_normalization_bool=True, explode={'column2': ','}, parse_date=['date_column1'], translate_column_names={'text_column1': True})
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### Getting the Cleaned DataFrame
|
|
112
|
+
|
|
113
|
+
To obtain the cleaned and processed DataFrame, you can call the `prep` method and assign the returned DataFrame to a variable.
|
|
114
|
+
|
|
115
|
+
Example:
|
|
116
|
+
```python
|
|
117
|
+
cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'})
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
The variable `cleaned_data` will contain the final cleaned and processed DataFrame.
|
|
121
|
+
|
|
122
|
+
Please note that some methods in the code are marked as 'NOT COMPLETE' and require further implementation to work properly. You can modify and complete those methods as per your requirements.
|
|
@@ -0,0 +1,105 @@
|
|
|
1
|
+
## DataScrub
|
|
2
|
+
|
|
3
|
+
The `DataClean` class provides a set of methods to clean and process data in a pandas DataFrame. It includes functions for cleaning text data, handling missing values, performing scaling normalization, exploding data, parsing date columns, and translating text columns.
|
|
4
|
+
|
|
5
|
+
### Class Initialization
|
|
6
|
+
|
|
7
|
+
To create an instance of the `DataClean` class, you need to provide the filepath of the data file (CSV or Excel) as an argument to the constructor. The class will automatically read the data into a pandas DataFrame based on the file extension.
|
|
8
|
+
|
|
9
|
+
Example:
|
|
10
|
+
```python
|
|
11
|
+
cleaner = DataClean('data.csv')
|
|
12
|
+
```
|
|
13
|
+
|
|
14
|
+
### Method: clean_data
|
|
15
|
+
|
|
16
|
+
The `clean_data` method cleans text data in the DataFrame. It takes a parameter `columns` that specifies which columns to clean. You can either pass `'all'` to clean all columns or provide a list of specific column names to clean.
|
|
17
|
+
|
|
18
|
+
Example:
|
|
19
|
+
```python
|
|
20
|
+
cleaned_data = cleaner.clean_data(['text_column1', 'text_column2'])
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
### Method: handle_missing_values
|
|
24
|
+
|
|
25
|
+
The `handle_missing_values` method handles missing values in the DataFrame. It takes a parameter `missing_values`, which is a dictionary specifying the actions to be taken for each column with missing values. The keys of the dictionary are the column names, and the values are the operations to be performed.
|
|
26
|
+
|
|
27
|
+
Example:
|
|
28
|
+
```python
|
|
29
|
+
missing_values = {'column1': 'replace missing value with 0', 'column2': 'drop'}
|
|
30
|
+
processed_data = cleaner.handle_missing_values(missing_values)
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### Method: perform_scaling_normalization
|
|
34
|
+
|
|
35
|
+
The `perform_scaling_normalization` method performs scaling normalization on numerical columns in the DataFrame using the Box-Cox transformation. Currently, this method is marked as 'NOT COMPLETE' in the code and does not contain the complete implementation.
|
|
36
|
+
|
|
37
|
+
### Method: explode_data
|
|
38
|
+
|
|
39
|
+
The `explode_data` method splits and expands data in specified columns of the DataFrame. It takes a dictionary `explode` where the keys are column names, and the values are the separators for splitting.
|
|
40
|
+
|
|
41
|
+
Example:
|
|
42
|
+
```python
|
|
43
|
+
explode_columns = {'column1': ',', 'column2': ';'}
|
|
44
|
+
exploded_data = cleaner.explode_data(explode_columns)
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
### Method: dupli
|
|
48
|
+
|
|
49
|
+
The `dupli` method removes duplicate rows from the DataFrame.
|
|
50
|
+
|
|
51
|
+
Example:
|
|
52
|
+
```python
|
|
53
|
+
unique_data = cleaner.dupli()
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
### Method: parse_date_column
|
|
57
|
+
|
|
58
|
+
The `parse_date_column` method converts specified columns in the DataFrame to datetime format and formats them as 'YYYY-MM-DD'. It takes a list `date_columns` containing the names of the columns to be converted.
|
|
59
|
+
|
|
60
|
+
Example:
|
|
61
|
+
```python
|
|
62
|
+
date_columns = ['date_column1', 'date_column2']
|
|
63
|
+
parsed_data = cleaner.parse_date_column(date_columns)
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
### Method: translate_columns
|
|
67
|
+
|
|
68
|
+
The `translate_columns` method translates text in specified columns of the DataFrame to English using Google Translate. It takes a dictionary `translations` where the keys are column names, and the values are overwrite boolean values. If the overwrite value is True, the original column will be overwritten; otherwise, a new column with the translated text will be added.
|
|
69
|
+
|
|
70
|
+
Example:
|
|
71
|
+
```python
|
|
72
|
+
column_translations = {'text_column1': True, 'text_column2': False}
|
|
73
|
+
translated_data = cleaner.translate_columns(column_translations)
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
### Method: prep
|
|
77
|
+
|
|
78
|
+
The `prep` method is the main function to prepare and clean the DataFrame. It provides a convenient way to perform multiple cleaning and processing operations in a specific order. You can specify the operations using the following parameters:
|
|
79
|
+
|
|
80
|
+
- `clean`: Columns to clean. Pass `'all'` to clean all columns or provide a list of specific column names.
|
|
81
|
+
- `missing_values`: Actions
|
|
82
|
+
|
|
83
|
+
to be taken on missing values. Pass a dictionary with column names as keys and operations as values.
|
|
84
|
+
- `perform_scaling_normalization_bool`: Boolean value indicating whether to perform scaling normalization on numerical columns.
|
|
85
|
+
- `explode`: Columns to be exploded. Pass a dictionary with column names as keys and separators for splitting as values.
|
|
86
|
+
- `parse_date`: List of column names to be converted to datetime format.
|
|
87
|
+
- `translate_column_names`: Dictionary mapping column names to overwrite boolean values for translation.
|
|
88
|
+
|
|
89
|
+
Example:
|
|
90
|
+
```python
|
|
91
|
+
cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'}, perform_scaling_normalization_bool=True, explode={'column2': ','}, parse_date=['date_column1'], translate_column_names={'text_column1': True})
|
|
92
|
+
```
|
|
93
|
+
|
|
94
|
+
### Getting the Cleaned DataFrame
|
|
95
|
+
|
|
96
|
+
To obtain the cleaned and processed DataFrame, you can call the `prep` method and assign the returned DataFrame to a variable.
|
|
97
|
+
|
|
98
|
+
Example:
|
|
99
|
+
```python
|
|
100
|
+
cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'})
|
|
101
|
+
```
|
|
102
|
+
|
|
103
|
+
The variable `cleaned_data` will contain the final cleaned and processed DataFrame.
|
|
104
|
+
|
|
105
|
+
Please note that some methods in the code are marked as 'NOT COMPLETE' and require further implementation to work properly. You can modify and complete those methods as per your requirements.
|
|
@@ -0,0 +1,208 @@
|
|
|
1
|
+
import pandas as pd
|
|
2
|
+
import numpy as np
|
|
3
|
+
import datetime
|
|
4
|
+
import charset_normalizer
|
|
5
|
+
import fuzzywuzzy
|
|
6
|
+
from fuzzywuzzy import process
|
|
7
|
+
import re
|
|
8
|
+
from scipy import stats
|
|
9
|
+
from pandas.api.types import is_numeric_dtype
|
|
10
|
+
import emoji
|
|
11
|
+
import string
|
|
12
|
+
from googletrans import Translator
|
|
13
|
+
|
|
14
|
+
class DataClean:
|
|
15
|
+
def __init__(self, filepath):
|
|
16
|
+
if filepath.endswith('.csv'):
|
|
17
|
+
self.raw_data = pd.read_csv(filepath)
|
|
18
|
+
elif filepath.endswith('.xlsx'):
|
|
19
|
+
self.raw_data = pd.read_excel(filepath)
|
|
20
|
+
|
|
21
|
+
def clean_data(self, columns):
|
|
22
|
+
"""
|
|
23
|
+
Cleans text data in a pandas DataFrame.
|
|
24
|
+
|
|
25
|
+
Parameters:
|
|
26
|
+
columns (str, list): Either 'all' to clean all columns or a list of column names to clean.
|
|
27
|
+
|
|
28
|
+
Returns:
|
|
29
|
+
pd.DataFrame: DataFrame with cleaned text data.
|
|
30
|
+
"""
|
|
31
|
+
updated_data = self.raw_data.copy()
|
|
32
|
+
|
|
33
|
+
if columns == 'all':
|
|
34
|
+
text_columns = updated_data.select_dtypes(include='object').columns
|
|
35
|
+
print(text_columns)
|
|
36
|
+
elif isinstance(columns, list):
|
|
37
|
+
text_columns = [col for col in columns if col in updated_data.columns]
|
|
38
|
+
else:
|
|
39
|
+
raise ValueError("Invalid value for 'clean' parameter. It should be 'all' or a list of column names.")
|
|
40
|
+
|
|
41
|
+
# Clean text columns
|
|
42
|
+
for column in text_columns:
|
|
43
|
+
updated_data[column] = updated_data[column].str.strip()
|
|
44
|
+
updated_data[column] = updated_data[column].str.lower()
|
|
45
|
+
updated_data[column] = updated_data[column].apply(lambda x: emoji.demojize(str(x)))
|
|
46
|
+
|
|
47
|
+
return updated_data
|
|
48
|
+
|
|
49
|
+
def handle_missing_values(self, missing_values):
|
|
50
|
+
"""
|
|
51
|
+
Handles missing values in a pandas DataFrame. Actions depend on the `missing_values` dictionary provided.
|
|
52
|
+
|
|
53
|
+
Parameters:
|
|
54
|
+
missing_values (dict): Dictionary with actions to be taken on missing values.
|
|
55
|
+
Keys: Column names.
|
|
56
|
+
Values: Operations to be performed on corresponding column.
|
|
57
|
+
"replace missing value with <value>" to replace missing values with <value>.
|
|
58
|
+
"drop" to drop rows with missing values.
|
|
59
|
+
"fill with backward fill along rows" to fill missing values using backward fill along rows.
|
|
60
|
+
"fill with backward fill along columns" to fill missing values using backward fill along columns.
|
|
61
|
+
|
|
62
|
+
Returns:
|
|
63
|
+
pd.DataFrame: DataFrame with handled missing values.
|
|
64
|
+
"""
|
|
65
|
+
for column, operation in missing_values.items():
|
|
66
|
+
if operation.startswith("replace missing value with "):
|
|
67
|
+
value = operation.split("replace missing value with ")[1]
|
|
68
|
+
try:
|
|
69
|
+
value = float(value)
|
|
70
|
+
except ValueError:
|
|
71
|
+
value = str(value)
|
|
72
|
+
self.raw_data[column] = self.raw_data[column].fillna(value)
|
|
73
|
+
elif operation == "drop":
|
|
74
|
+
self.raw_data.dropna(subset=[column], inplace=True)
|
|
75
|
+
elif operation == "fill with backward fill along rows":
|
|
76
|
+
self.raw_data[column] = self.raw_data[column].fillna(method='bfill', axis='rows').fillna(0)
|
|
77
|
+
elif operation == "fill with backward fill along columns":
|
|
78
|
+
self.raw_data[column] = self.raw_data[column].fillna(method='bfill', axis='columns').fillna(0)
|
|
79
|
+
else:
|
|
80
|
+
raise ValueError(f"Invalid operation '{operation}' in 'missing_values' parameter.")
|
|
81
|
+
return self.raw_data
|
|
82
|
+
|
|
83
|
+
def perform_scaling_normalization(self):
|
|
84
|
+
"""
|
|
85
|
+
NOT COMPLETE!!!
|
|
86
|
+
|
|
87
|
+
Performs scaling normalization on numerical columns in a pandas DataFrame using Box-Cox transformation.
|
|
88
|
+
|
|
89
|
+
Returns:
|
|
90
|
+
pd.DataFrame: DataFrame with scaled and normalized numerical data.
|
|
91
|
+
"""
|
|
92
|
+
updated_data = self.raw_data.copy()
|
|
93
|
+
for column in updated_data.columns:
|
|
94
|
+
if is_numeric_dtype(updated_data[column]):
|
|
95
|
+
updated_data[column] = updated_data[column].apply(lambda x: abs(x) if x != 0 else 1)
|
|
96
|
+
updated_data[column] = stats.boxcox(updated_data[column])
|
|
97
|
+
return updated_data
|
|
98
|
+
|
|
99
|
+
def explode_data(self, explode):
|
|
100
|
+
"""
|
|
101
|
+
Splits and expands data in specified columns of a pandas DataFrame.
|
|
102
|
+
|
|
103
|
+
Parameters:
|
|
104
|
+
explode (dict): Dictionary with column names as keys and separator as values for splitting.
|
|
105
|
+
|
|
106
|
+
Returns:
|
|
107
|
+
pd.DataFrame: DataFrame with exploded data.
|
|
108
|
+
"""
|
|
109
|
+
updated_data = self.raw_data.copy()
|
|
110
|
+
for column, separator in explode.items():
|
|
111
|
+
updated_data[column] = updated_data[column].str.split(separator)
|
|
112
|
+
updated_data = updated_data.explode(column)
|
|
113
|
+
return updated_data
|
|
114
|
+
|
|
115
|
+
def dupli(self):
|
|
116
|
+
"""
|
|
117
|
+
Removes duplicate rows from a pandas DataFrame.
|
|
118
|
+
|
|
119
|
+
Returns:
|
|
120
|
+
pd.DataFrame: DataFrame without duplicate rows.
|
|
121
|
+
"""
|
|
122
|
+
return self.raw_data.drop_duplicates()
|
|
123
|
+
|
|
124
|
+
def parse_date_column(self, date_columns):
|
|
125
|
+
"""
|
|
126
|
+
Converts specified columns in a pandas DataFrame to datetime format and formats them as 'YYYY-MM-DD'.
|
|
127
|
+
|
|
128
|
+
Parameters:
|
|
129
|
+
date_columns (dict): List of column names to be converted to datetime format.
|
|
130
|
+
|
|
131
|
+
Returns:
|
|
132
|
+
pd.DataFrame: DataFrame with parsed date columns.
|
|
133
|
+
"""
|
|
134
|
+
updated_data = self.raw_data.copy()
|
|
135
|
+
|
|
136
|
+
for column in date_columns:
|
|
137
|
+
if column not in updated_data.columns:
|
|
138
|
+
raise ValueError(f"Column '{column}' not found in the DataFrame.")
|
|
139
|
+
|
|
140
|
+
if updated_data[column].dtype != 'datetime64[ns]':
|
|
141
|
+
updated_data[column] = updated_data[column].astype(str)
|
|
142
|
+
updated_data[column] = pd.to_datetime(updated_data[column], errors='coerce').dt.strftime('%Y-%m-%d')
|
|
143
|
+
# updated_data[column] = pd.to_datetime(updated_data[column], format=date_format)
|
|
144
|
+
|
|
145
|
+
return updated_data
|
|
146
|
+
|
|
147
|
+
def translate_columns(self, translations):
|
|
148
|
+
"""
|
|
149
|
+
Translates text in specified columns of a DataFrame to English using Google Translate.
|
|
150
|
+
|
|
151
|
+
Parameters:
|
|
152
|
+
translations (dict): Dictionary mapping column names to overwrite boolean values.
|
|
153
|
+
The key is the column name, and the value is the overwrite boolean value.
|
|
154
|
+
|
|
155
|
+
Returns:
|
|
156
|
+
pd.DataFrame: DataFrame with translated columns.
|
|
157
|
+
"""
|
|
158
|
+
translator = Translator()
|
|
159
|
+
updated_data = self.raw_data.copy()
|
|
160
|
+
|
|
161
|
+
for column, overwrite_value in translations.items():
|
|
162
|
+
if column not in updated_data.columns:
|
|
163
|
+
raise ValueError(f"Column '{column}' not found in the DataFrame.")
|
|
164
|
+
|
|
165
|
+
translated_texts = []
|
|
166
|
+
|
|
167
|
+
for text in updated_data[column]:
|
|
168
|
+
translation = translator.translate(text)
|
|
169
|
+
translated_texts.append(translation.text)
|
|
170
|
+
|
|
171
|
+
if overwrite_value:
|
|
172
|
+
updated_data[column] = translated_texts
|
|
173
|
+
else:
|
|
174
|
+
new_column_name = column + '_translated'
|
|
175
|
+
updated_data[new_column_name] = translated_texts
|
|
176
|
+
|
|
177
|
+
return updated_data
|
|
178
|
+
|
|
179
|
+
def prep(self, clean='all', missing_values={}, perform_scaling_normalization_bool=False, explode={}, parse_date=[], translate_column_names={}):
|
|
180
|
+
"""
|
|
181
|
+
Main function to prepare and clean a pandas DataFrame. Can perform cleaning, handle missing values, fix inconsistencies,
|
|
182
|
+
perform scaling normalization, explode data, and parse date columns based on parameters passed.
|
|
183
|
+
|
|
184
|
+
Parameters:
|
|
185
|
+
clean (str, list): Columns to clean. Either 'all' for all columns or a list of specific column names.
|
|
186
|
+
missing_values (dict): Actions to be taken on missing values. Refer `handle_missing_values` for more details.
|
|
187
|
+
perform_scaling_normalization (bool): If True, will perform scaling normalization on numerical columns.
|
|
188
|
+
explode (dict): Columns to be exploded. Keys are column names and values are separators for splitting.
|
|
189
|
+
parse_date (list): List of column names to be converted to datetime format.
|
|
190
|
+
translate_column_names (dict): Dictionary mapping column names to overwrite boolean values for translation.
|
|
191
|
+
|
|
192
|
+
Returns:
|
|
193
|
+
pd.DataFrame: Updated DataFrame with cleaned, transformed, and translated columns.
|
|
194
|
+
"""
|
|
195
|
+
updated_data = self.clean_data(clean)
|
|
196
|
+
updated_data = self.handle_missing_values(missing_values)
|
|
197
|
+
|
|
198
|
+
if perform_scaling_normalization_bool:
|
|
199
|
+
updated_data = self.perform_scaling_normalization()
|
|
200
|
+
|
|
201
|
+
updated_data = self.explode_data(explode)
|
|
202
|
+
|
|
203
|
+
if parse_date:
|
|
204
|
+
updated_data = self.parse_date_column(parse_date)
|
|
205
|
+
|
|
206
|
+
updated_data = self.translate_columns(translate_column_names)
|
|
207
|
+
|
|
208
|
+
return updated_data
|
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: datascrub
|
|
3
|
+
Version: 1.0.0
|
|
4
|
+
Summary: A Python package for cleaning and preprocessing data in pandas DataFrames
|
|
5
|
+
Home-page: https://github.com/samuelshine/cleanmydata
|
|
6
|
+
Author: ['Samuel Shine', 'Alex Benjamin']
|
|
7
|
+
Author-email: samuelshine112003@gmail.com
|
|
8
|
+
Classifier: Development Status :: 5 - Production/Stable
|
|
9
|
+
Classifier: License :: OSI Approved :: MIT License
|
|
10
|
+
Classifier: Programming Language :: Python :: 3
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.6
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.7
|
|
13
|
+
Classifier: Programming Language :: Python :: 3.8
|
|
14
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
15
|
+
Description-Content-Type: text/markdown
|
|
16
|
+
License-File: LICENSE.txt
|
|
17
|
+
|
|
18
|
+
## DataScrub
|
|
19
|
+
|
|
20
|
+
The `DataClean` class provides a set of methods to clean and process data in a pandas DataFrame. It includes functions for cleaning text data, handling missing values, performing scaling normalization, exploding data, parsing date columns, and translating text columns.
|
|
21
|
+
|
|
22
|
+
### Class Initialization
|
|
23
|
+
|
|
24
|
+
To create an instance of the `DataClean` class, you need to provide the filepath of the data file (CSV or Excel) as an argument to the constructor. The class will automatically read the data into a pandas DataFrame based on the file extension.
|
|
25
|
+
|
|
26
|
+
Example:
|
|
27
|
+
```python
|
|
28
|
+
cleaner = DataClean('data.csv')
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
### Method: clean_data
|
|
32
|
+
|
|
33
|
+
The `clean_data` method cleans text data in the DataFrame. It takes a parameter `columns` that specifies which columns to clean. You can either pass `'all'` to clean all columns or provide a list of specific column names to clean.
|
|
34
|
+
|
|
35
|
+
Example:
|
|
36
|
+
```python
|
|
37
|
+
cleaned_data = cleaner.clean_data(['text_column1', 'text_column2'])
|
|
38
|
+
```
|
|
39
|
+
|
|
40
|
+
### Method: handle_missing_values
|
|
41
|
+
|
|
42
|
+
The `handle_missing_values` method handles missing values in the DataFrame. It takes a parameter `missing_values`, which is a dictionary specifying the actions to be taken for each column with missing values. The keys of the dictionary are the column names, and the values are the operations to be performed.
|
|
43
|
+
|
|
44
|
+
Example:
|
|
45
|
+
```python
|
|
46
|
+
missing_values = {'column1': 'replace missing value with 0', 'column2': 'drop'}
|
|
47
|
+
processed_data = cleaner.handle_missing_values(missing_values)
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
### Method: perform_scaling_normalization
|
|
51
|
+
|
|
52
|
+
The `perform_scaling_normalization` method performs scaling normalization on numerical columns in the DataFrame using the Box-Cox transformation. Currently, this method is marked as 'NOT COMPLETE' in the code and does not contain the complete implementation.
|
|
53
|
+
|
|
54
|
+
### Method: explode_data
|
|
55
|
+
|
|
56
|
+
The `explode_data` method splits and expands data in specified columns of the DataFrame. It takes a dictionary `explode` where the keys are column names, and the values are the separators for splitting.
|
|
57
|
+
|
|
58
|
+
Example:
|
|
59
|
+
```python
|
|
60
|
+
explode_columns = {'column1': ',', 'column2': ';'}
|
|
61
|
+
exploded_data = cleaner.explode_data(explode_columns)
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
### Method: dupli
|
|
65
|
+
|
|
66
|
+
The `dupli` method removes duplicate rows from the DataFrame.
|
|
67
|
+
|
|
68
|
+
Example:
|
|
69
|
+
```python
|
|
70
|
+
unique_data = cleaner.dupli()
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
### Method: parse_date_column
|
|
74
|
+
|
|
75
|
+
The `parse_date_column` method converts specified columns in the DataFrame to datetime format and formats them as 'YYYY-MM-DD'. It takes a list `date_columns` containing the names of the columns to be converted.
|
|
76
|
+
|
|
77
|
+
Example:
|
|
78
|
+
```python
|
|
79
|
+
date_columns = ['date_column1', 'date_column2']
|
|
80
|
+
parsed_data = cleaner.parse_date_column(date_columns)
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
### Method: translate_columns
|
|
84
|
+
|
|
85
|
+
The `translate_columns` method translates text in specified columns of the DataFrame to English using Google Translate. It takes a dictionary `translations` where the keys are column names, and the values are overwrite boolean values. If the overwrite value is True, the original column will be overwritten; otherwise, a new column with the translated text will be added.
|
|
86
|
+
|
|
87
|
+
Example:
|
|
88
|
+
```python
|
|
89
|
+
column_translations = {'text_column1': True, 'text_column2': False}
|
|
90
|
+
translated_data = cleaner.translate_columns(column_translations)
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Method: prep
|
|
94
|
+
|
|
95
|
+
The `prep` method is the main function to prepare and clean the DataFrame. It provides a convenient way to perform multiple cleaning and processing operations in a specific order. You can specify the operations using the following parameters:
|
|
96
|
+
|
|
97
|
+
- `clean`: Columns to clean. Pass `'all'` to clean all columns or provide a list of specific column names.
|
|
98
|
+
- `missing_values`: Actions
|
|
99
|
+
|
|
100
|
+
to be taken on missing values. Pass a dictionary with column names as keys and operations as values.
|
|
101
|
+
- `perform_scaling_normalization_bool`: Boolean value indicating whether to perform scaling normalization on numerical columns.
|
|
102
|
+
- `explode`: Columns to be exploded. Pass a dictionary with column names as keys and separators for splitting as values.
|
|
103
|
+
- `parse_date`: List of column names to be converted to datetime format.
|
|
104
|
+
- `translate_column_names`: Dictionary mapping column names to overwrite boolean values for translation.
|
|
105
|
+
|
|
106
|
+
Example:
|
|
107
|
+
```python
|
|
108
|
+
cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'}, perform_scaling_normalization_bool=True, explode={'column2': ','}, parse_date=['date_column1'], translate_column_names={'text_column1': True})
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
### Getting the Cleaned DataFrame
|
|
112
|
+
|
|
113
|
+
To obtain the cleaned and processed DataFrame, you can call the `prep` method and assign the returned DataFrame to a variable.
|
|
114
|
+
|
|
115
|
+
Example:
|
|
116
|
+
```python
|
|
117
|
+
cleaned_data = cleaner.prep(clean='all', missing_values={'column1': 'drop'})
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
The variable `cleaned_data` will contain the final cleaned and processed DataFrame.
|
|
121
|
+
|
|
122
|
+
Please note that some methods in the code are marked as 'NOT COMPLETE' and require further implementation to work properly. You can modify and complete those methods as per your requirements.
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
datascrub
|
datascrub-1.0.0/setup.py
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
1
|
+
from setuptools import setup, find_packages
|
|
2
|
+
|
|
3
|
+
setup(
|
|
4
|
+
name='datascrub',
|
|
5
|
+
version='1.0.0',
|
|
6
|
+
author=['Samuel Shine', 'Alex Benjamin'],
|
|
7
|
+
author_email='samuelshine112003@gmail.com',
|
|
8
|
+
description='A Python package for cleaning and preprocessing data in pandas DataFrames',
|
|
9
|
+
long_description=open('README.md').read(),
|
|
10
|
+
long_description_content_type='text/markdown',
|
|
11
|
+
url='https://github.com/samuelshine/cleanmydata',
|
|
12
|
+
packages=find_packages(),
|
|
13
|
+
classifiers=[
|
|
14
|
+
'Development Status :: 5 - Production/Stable',
|
|
15
|
+
'License :: OSI Approved :: MIT License',
|
|
16
|
+
'Programming Language :: Python :: 3',
|
|
17
|
+
'Programming Language :: Python :: 3.6',
|
|
18
|
+
'Programming Language :: Python :: 3.7',
|
|
19
|
+
'Programming Language :: Python :: 3.8',
|
|
20
|
+
'Programming Language :: Python :: 3.9',
|
|
21
|
+
],
|
|
22
|
+
install_requires=[
|
|
23
|
+
'pandas',
|
|
24
|
+
'numpy',
|
|
25
|
+
'charset_normalizer',
|
|
26
|
+
'fuzzywuzzy',
|
|
27
|
+
'scipy',
|
|
28
|
+
'dateutil',
|
|
29
|
+
'emoji',
|
|
30
|
+
'googletrans',
|
|
31
|
+
],
|
|
32
|
+
)
|