eda-toolkit 0.0.1a0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2024 [Leonid Shpaner, Oscar Gil]
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,69 @@
1
+ Metadata-Version: 2.1
2
+ Name: eda_toolkit
3
+ Version: 0.0.1a0
4
+ Summary: A Python library for EDA (directory management, some data preprocessing, reporting, visualizations, and more)
5
+ Author: Leonid Shpaner, Oscar Gil
6
+ Author-email: lshpaner@ucla.edu
7
+ Project-URL: Leonid Shpaner Website, https://www.leonshpaner.com
8
+ Project-URL: Oscar Gil Website, https://www.oscargildata.com
9
+ Project-URL: Source Code, https://github.com/lshpaner/eda_kit/
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Operating System :: OS Independent
13
+ Requires-Python: >=3.6
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE.md
16
+ Requires-Dist: jinja2>=3.1.4
17
+ Requires-Dist: numpy>=1.21.6
18
+ Requires-Dist: pandas>=1.3.5
19
+ Requires-Dist: matplotlib>=3.5.3
20
+ Requires-Dist: seaborn>=0.12.2
21
+ Requires-Dist: xlsxwriter>=3.2.0
22
+
23
+ # EDA Kit
24
+ Welcome to EDA Kit, a collection of utility functions designed to streamline your exploratory data analysis (EDA) tasks. This repository offers tools for directory management, data preprocessing, reporting, visualization, and more, helping you efficiently handle various aspects of data manipulation and analysis.
25
+
26
+
27
+ ## Prerequisites
28
+
29
+ Before you install `eda_kit`, ensure your system meets the following requirements:
30
+
31
+ - `Python`: Version `3.7.4` or higher is required to run `eda_kit`.
32
+
33
+
34
+ Additionally, `eda_kit` depends on the following packages, which will be automatically installed when you install `eda_kit`:
35
+
36
+ - `numpy`: Version 1.21.6 or higher
37
+ - `pandas`: Version 1.3.5 or higher
38
+ - `matplotlib`: version 3.5.3 or higher
39
+ - `seaborn`: version 0.12.2 or higher
40
+ - `jinja2`: Version 3.1.4 or higher
41
+ - `xlsxwriter`: version 3.2.0 or higher
42
+
43
+
44
+ ## Installation
45
+
46
+ You can install `eda_kit` directly from PyPI:
47
+
48
+ ```bash
49
+ pip install eda_kit
50
+ ```
51
+
52
+ ## 🌐 Authors' Websites
53
+
54
+ https://www.leonshpaner.com
55
+ https://www.oscargildata.com
56
+
57
+
58
+ ## ⚖️ License
59
+
60
+ `eda_kit` is distributed under the MIT License. See [LICENSE](https://github.com/lshpaner/eda_kit/blob/readme/LICENSE.md) for more information.
61
+
62
+
63
+ ## Support
64
+
65
+ If you have any questions or issues with `eda_kit`, please open an issue on this GitHub repository.
66
+
67
+
68
+
69
+
@@ -0,0 +1,250 @@
1
+ # EDA Kit
2
+ Welcome to EDA Kit, a collection of utility functions designed to streamline your exploratory data analysis (EDA) tasks. This repository offers tools for directory management, data preprocessing, reporting, visualization, and more, helping you efficiently handle various aspects of data manipulation and analysis.
3
+
4
+ ---
5
+ ## Table of Contents
6
+
7
+ 1. [Installation](#installation)
8
+ 2. [Overview](#overview)
9
+ 3. [Functions](#functions)
10
+ a. [Path Directories](#path-directories)
11
+ b. [Generate Random IDs](#generate-random-ids)
12
+ c. [Trailing Periods](#trailing-periods)
13
+ d. [Standardized Dates](#standardized-dates)
14
+ e. [Data Types Reports](#data-types-reports)
15
+ f. [DataFrame Columns Analysis](#dataframe-columns-analysis)
16
+ g. [Summarize All Combinations](#summarize-all-combinations)
17
+ h. [Save DataFrames to Excel](#save-dataframes-to-excel)
18
+ i. [Contingency Table](#contingency-table)
19
+ j. [Highlight DataFrame Tables](#highlight-dataframe-tables)
20
+ k. [KDE Distribution Plots](#kde-distribution-plots)
21
+ l. [Stacked Bar Plots with Crosstab Options](#stacked-bar-plots-with-crosstab-options)
22
+ m. [Filtered DataFrame Plots](#filtered-dataframe-plots)
23
+ n. [Metrics Box and Violin Plots](#metrics-box-and-violin-plots)
24
+ 4. [Usage](#usage)
25
+ 5. [Contributing](#contributing)
26
+ 6. [License](#license)
27
+
28
+ ---
29
+
30
+ ## Installation
31
+ Clone the repository and install the necessary dependencies:
32
+
33
+ ```bash
34
+ git clone https://github.com/your_username/eda_kit.git
35
+ cd eda_kit
36
+ pip install -r requirements.txt
37
+ ```
38
+
39
+ ## Overview
40
+ EDA Kit is a comprehensive toolkit for data analysts and data scientists alike. It provides a suite of functions to handle common EDA tasks, making your workflow more efficient and organized. The toolkit covers everything from directory management and ID generation to complex visualizations and data reporting.
41
+
42
+ ## Functions
43
+ ### Path Directories
44
+ `ensure_directory(path)`: ensures that the specified directory exists; if not, it creates it.
45
+
46
+ ### Generate Random IDs
47
+ `add_ids(df, column_name="Patient_ID", seed=None)`: adds a column of unique, 9-digit IDs to a DataFrame.
48
+
49
+ ### Trailing Periods
50
+ `strip_trailing_period(df, column_name)`: strips trailing periods from floats in a specified column of a DataFrame.
51
+
52
+ ### Standardized Dates
53
+ `parse_date_with_rule(date_str)`: parses and standardizes date strings to the ISO 8601 format (YYYY-MM-DD).
54
+
55
+ ### Data Types Reports
56
+ `data_types(df)`: provides a report on every column in the DataFrame, showing column names, data types, number of nulls, and percentage of nulls.
57
+
58
+ ### DataFrame Columns Analysis
59
+ `dataframe_columns(df)`: analyzes DataFrame columns for dtype, null counts, max unique values, and their percentages.
60
+
61
+ ### Summarize All Combinations
62
+ `summarize_all_combinations(df, variables, data_path, data_name, min_length=2)`: generates summary tables for all possible combinations of specified variables in the DataFrame and saves them to an Excel file.
63
+
64
+ ### Save DataFrames to Excel
65
+ `save_dataframes_to_excel(file_path, df_dict, decimal_places=2)`: saves multiple DataFrames to separate sheets in an Excel file with customized formatting.
66
+
67
+ ### Contingency Table
68
+ `contingency_table(df, col1, col2, SortBy)`: creates a contingency table from one or two columns in a DataFrame, with sorting options.
69
+
70
+ ### Highlight DataFrame Tables
71
+ `highlight_columns(df, columns, color="yellow")`: highlights specific columns in a DataFrame with a specified background color.
72
+
73
+ ### KDE Distribution Plots
74
+
75
+ ```python
76
+
77
+ kde_distributions(
78
+ df,
79
+ dist_list,
80
+ x,
81
+ y,
82
+ kde=True,
83
+ n_rows=1,
84
+ n_cols=1,
85
+ w_pad=1.0,
86
+ h_pad=1.0,
87
+ text_wrap=50,
88
+ image_path_png=None,
89
+ image_path_svg=None,
90
+ image_filename=None,
91
+ bbox_inches=None,
92
+ vars_of_interest=None,
93
+ single_var_image_path_png=None,
94
+ single_var_image_path_svg=None,
95
+ single_var_image_filename=None,
96
+ y_axis="count",
97
+ plot_type="both",
98
+ )
99
+
100
+ ```
101
+ Generates KDE or histogram distribution plots for specified columns in a DataFrame.
102
+
103
+ ### Stacked Bar Plots with Crosstab Options
104
+
105
+ ```python
106
+ stacked_crosstab_plot(
107
+ df,
108
+ col,
109
+ func_col,
110
+ legend_labels_list,
111
+ title,
112
+ kind="bar",
113
+ width=0.9,
114
+ rot=0,
115
+ custom_order=None,
116
+ image_path_png=None,
117
+ image_path_svg=None,
118
+ save_formats=None,
119
+ color=None,
120
+ output="both",
121
+ return_dict=False,
122
+ x=None,
123
+ y=None,
124
+ p=None,
125
+ file_prefix=None,
126
+ logscale=False,
127
+ plot_type="both",
128
+ show_legend=True,
129
+ )
130
+ ```
131
+
132
+ Generates stacked bar plots and crosstabs for specified columns.
133
+
134
+ ### Filtered DataFrame Plots
135
+ ```python
136
+
137
+ plot_filtered_dataframes(
138
+ df,
139
+ col,
140
+ func_col,
141
+ legend_labels_list,
142
+ title,
143
+ file_prefix,
144
+ condition_col=None,
145
+ condition_val=1,
146
+ x=12,
147
+ y=8,
148
+ p=10,
149
+ kind="bar",
150
+ width=0.9,
151
+ rot=0,
152
+ image_path_png=None,
153
+ image_path_svg=None,
154
+ save_formats=["png", "svg"],
155
+ color=None,
156
+ output="both",
157
+ return_dict=True,
158
+ logscale=True,
159
+ plot_type="both",
160
+ show_legend=True,
161
+ )
162
+
163
+ ```
164
+ Filters the DataFrame based on a specified condition and generates plots and crosstabs.
165
+
166
+ ### Metrics Box and Violin Plots
167
+
168
+ ```python
169
+ metrics_box_violin(
170
+ df,
171
+ metrics_list,
172
+ metrics_boxplot_comp,
173
+ n_rows,
174
+ n_cols,
175
+ image_path_png,
176
+ image_path_svg,
177
+ save_individual=True,
178
+ save_grid=True,
179
+ save_both=False,
180
+ show_legend=True,
181
+ plot_type="boxplot",
182
+ )
183
+ ```
184
+ Creates and saves individual boxplots or violin plots, or an entire grid of plots for given metrics and comparisons.
185
+
186
+ ## Usage
187
+
188
+ ### Import the module and functions
189
+
190
+ ```python
191
+ import pandas as pd
192
+ import numpy as np
193
+ import random
194
+ from itertools import combinations
195
+ import datetime
196
+ from IPython.display import display
197
+ import matplotlib.pyplot as plt
198
+ import seaborn as sns
199
+ import textwrap
200
+ import os
201
+ import warnings
202
+ # Import the utility functions from EDA Kit
203
+ from eda_kit import (
204
+ ensure_directory,
205
+ add_ids,
206
+ strip_trailing_period,
207
+ parse_date_with_rule,
208
+ data_types,
209
+ dataframe_columns,
210
+ summarize_all_combinations,
211
+ save_dataframes_to_excel,
212
+ contingency_table,
213
+ highlight_columns,
214
+ kde_distributions,
215
+ stacked_crosstab_plot,
216
+ plot_filtered_dataframes,
217
+ metrics_box_violin,
218
+ )
219
+ ```
220
+
221
+ ### Use the functions as needed in your data analysis workflow
222
+
223
+
224
+ ```python
225
+ # Example usage of ensure_directory function
226
+ ensure_directory('path/to/your/directory')
227
+ ```
228
+
229
+
230
+ ### Example usage of add_ids function
231
+ ```python
232
+ df = pd.DataFrame({'data': range(10)})
233
+ df_with_ids = add_ids(df)
234
+ print(df_with_ids)
235
+ ```
236
+
237
+ ## Contributing
238
+ We welcome contributions! If you have suggestions or improvements, please submit an issue or pull request. Follow the standard GitHub flow for contributing.
239
+
240
+ 1. Fork the repository
241
+ 2. Create a new branch (`git checkout -b feature-branch`)
242
+ 3. Make your changes
243
+ 4. Commit your changes (`git commit -am 'Add new feature'`)
244
+ 5. Push to the branch (`git push origin feature-branch`)
245
+ 6. Create a new Pull Request
246
+
247
+ ## License
248
+ This project is licensed under the MIT License. See the [LICENSE](https://github.com/lshpaner/eda_kit/blob/readme/LICENSE.md) file for details.
249
+
250
+ For more detailed documentation, refer to the docstrings within each function.
@@ -0,0 +1,4 @@
1
+ [egg_info]
2
+ tag_build =
3
+ tag_date = 0
4
+
@@ -0,0 +1,37 @@
1
+ from setuptools import setup, find_packages
2
+ import codecs
3
+
4
+ with codecs.open("README_min.md", "r", "utf-8") as fh:
5
+ long_description = fh.read()
6
+
7
+ setup(
8
+ name="eda_toolkit",
9
+ version="0.0.1a",
10
+ author="Leonid Shpaner, Oscar Gil",
11
+ author_email="lshpaner@ucla.edu",
12
+ description="A Python library for EDA (directory management, some data preprocessing, reporting, visualizations, and more)",
13
+ long_description=long_description,
14
+ long_description_content_type="text/markdown", # Type of the long description
15
+ package_dir={"": "src"}, # Directory where your package files are located
16
+ # Automatically find packages in the specified directory
17
+ packages=find_packages(where="src"),
18
+ project_urls={ # Optional
19
+ "Leonid Shpaner Website": "https://www.leonshpaner.com",
20
+ "Oscar Gil Website": "https://www.oscargildata.com",
21
+ "Source Code": "https://github.com/lshpaner/eda_kit/",
22
+ },
23
+ classifiers=[
24
+ "Programming Language :: Python :: 3",
25
+ "License :: OSI Approved :: MIT License",
26
+ "Operating System :: OS Independent",
27
+ ], # Classifiers for the package
28
+ python_requires=">=3.6", # Minimum version of Python required
29
+ install_requires=[
30
+ "jinja2>=3.1.4", # Minimum version of jinja2 required
31
+ "numpy>=1.21.6", # Minimum version of numpy required
32
+ "pandas>=1.3.5", # Minimum version of pandas required
33
+ "matplotlib>=3.5.3", # Minimum version of matplotlib required
34
+ "seaborn>=0.12.2", # Minimum version of seaborn required
35
+ "xlsxwriter>=3.2.0", # Minimum version of xlsxwriter required
36
+ ],
37
+ )
@@ -0,0 +1,69 @@
1
+ Metadata-Version: 2.1
2
+ Name: eda_toolkit
3
+ Version: 0.0.1a0
4
+ Summary: A Python library for EDA (directory management, some data preprocessing, reporting, visualizations, and more)
5
+ Author: Leonid Shpaner, Oscar Gil
6
+ Author-email: lshpaner@ucla.edu
7
+ Project-URL: Leonid Shpaner Website, https://www.leonshpaner.com
8
+ Project-URL: Oscar Gil Website, https://www.oscargildata.com
9
+ Project-URL: Source Code, https://github.com/lshpaner/eda_kit/
10
+ Classifier: Programming Language :: Python :: 3
11
+ Classifier: License :: OSI Approved :: MIT License
12
+ Classifier: Operating System :: OS Independent
13
+ Requires-Python: >=3.6
14
+ Description-Content-Type: text/markdown
15
+ License-File: LICENSE.md
16
+ Requires-Dist: jinja2>=3.1.4
17
+ Requires-Dist: numpy>=1.21.6
18
+ Requires-Dist: pandas>=1.3.5
19
+ Requires-Dist: matplotlib>=3.5.3
20
+ Requires-Dist: seaborn>=0.12.2
21
+ Requires-Dist: xlsxwriter>=3.2.0
22
+
23
+ # EDA Kit
24
+ Welcome to EDA Kit, a collection of utility functions designed to streamline your exploratory data analysis (EDA) tasks. This repository offers tools for directory management, data preprocessing, reporting, visualization, and more, helping you efficiently handle various aspects of data manipulation and analysis.
25
+
26
+
27
+ ## Prerequisites
28
+
29
+ Before you install `eda_kit`, ensure your system meets the following requirements:
30
+
31
+ - `Python`: Version `3.7.4` or higher is required to run `eda_kit`.
32
+
33
+
34
+ Additionally, `eda_kit` depends on the following packages, which will be automatically installed when you install `eda_kit`:
35
+
36
+ - `numpy`: Version 1.21.6 or higher
37
+ - `pandas`: Version 1.3.5 or higher
38
+ - `matplotlib`: version 3.5.3 or higher
39
+ - `seaborn`: version 0.12.2 or higher
40
+ - `jinja2`: Version 3.1.4 or higher
41
+ - `xlsxwriter`: version 3.2.0 or higher
42
+
43
+
44
+ ## Installation
45
+
46
+ You can install `eda_kit` directly from PyPI:
47
+
48
+ ```bash
49
+ pip install eda_kit
50
+ ```
51
+
52
+ ## 🌐 Authors' Websites
53
+
54
+ https://www.leonshpaner.com
55
+ https://www.oscargildata.com
56
+
57
+
58
+ ## ⚖️ License
59
+
60
+ `eda_kit` is distributed under the MIT License. See [LICENSE](https://github.com/lshpaner/eda_kit/blob/readme/LICENSE.md) for more information.
61
+
62
+
63
+ ## Support
64
+
65
+ If you have any questions or issues with `eda_kit`, please open an issue on this GitHub repository.
66
+
67
+
68
+
69
+
@@ -0,0 +1,8 @@
1
+ LICENSE.md
2
+ README.md
3
+ setup.py
4
+ src/eda_toolkit.egg-info/PKG-INFO
5
+ src/eda_toolkit.egg-info/SOURCES.txt
6
+ src/eda_toolkit.egg-info/dependency_links.txt
7
+ src/eda_toolkit.egg-info/requires.txt
8
+ src/eda_toolkit.egg-info/top_level.txt
@@ -0,0 +1,6 @@
1
+ jinja2>=3.1.4
2
+ numpy>=1.21.6
3
+ pandas>=1.3.5
4
+ matplotlib>=3.5.3
5
+ seaborn>=0.12.2
6
+ xlsxwriter>=3.2.0