pandas-checks 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,31 @@
1
+ BSD 3-Clause License
2
+
3
+ Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
4
+ All rights reserved.
5
+
6
+ Copyright (c) 2011-2024, Open source contributors.
7
+
8
+ Redistribution and use in source and binary forms, with or without
9
+ modification, are permitted provided that the following conditions are met:
10
+
11
+ * Redistributions of source code must retain the above copyright notice, this
12
+ list of conditions and the following disclaimer.
13
+
14
+ * Redistributions in binary form must reproduce the above copyright notice,
15
+ this list of conditions and the following disclaimer in the documentation
16
+ and/or other materials provided with the distribution.
17
+
18
+ * Neither the name of the copyright holder nor the names of its
19
+ contributors may be used to endorse or promote products derived from
20
+ this software without specific prior written permission.
21
+
22
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
23
+ AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
24
+ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
25
+ DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
26
+ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
27
+ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
28
+ SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
29
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
30
+ OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
31
+ OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,197 @@
1
+ Metadata-Version: 2.1
2
+ Name: pandas-checks
3
+ Version: 0.1.0
4
+ Summary: Non-invasive health checks for Pandas method chains
5
+ Author: Chad Parmet
6
+ Author-email: cparmet@gmail.com
7
+ Requires-Python: >=3.8.1,<4.0
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Programming Language :: Python :: 3.9
10
+ Classifier: Programming Language :: Python :: 3.10
11
+ Classifier: Programming Language :: Python :: 3.11
12
+ Classifier: Programming Language :: Python :: 3.12
13
+ Requires-Dist: Jinja2 (>=3.1.4)
14
+ Requires-Dist: emoji (>=2.12.1)
15
+ Requires-Dist: ipython (>=7.23.1)
16
+ Requires-Dist: matplotlib (>=3.5.3)
17
+ Requires-Dist: pandas (>=1.4,<3.0)
18
+ Requires-Dist: termcolor (>=2.3.0)
19
+ Description-Content-Type: text/markdown
20
+
21
+ # 🐼🩺 Pandas Checks
22
+
23
+ ## Introduction
24
+ **Pandas Checks** is a Python library for data science and data engineering. It adds non-invasive health checks for Pandas method chains.
25
+
26
+ It can inspect and validate your data at various points in your Pandas pipelines, without modifying the underlying data.
27
+
28
+ So you don't need to chop up a functional method chain, or create intermediate variables, every time you need to diagnose, treat, or prevent problems with data processing.
29
+
30
+ As Fleetwood Mac says, [you would never break the chain](https://www.youtube.com/watch?v=xwTPvcPYaOo).
31
+
32
+
33
+ > [!TIP]
34
+ > See the [full documentation](https://cparmet.github.io/pandas-checks/) for all the details on the what, why, and how of Pandas Checks.
35
+
36
+
37
+ ## Installation
38
+
39
+ ```bash
40
+ pip install pandas-checks
41
+ ```
42
+
43
+ ## Usage
44
+ After installing Pandas Checks, import it:
45
+
46
+ ```python
47
+ import pandas as pd
48
+ import pandas_checks
49
+ ```
50
+
51
+ Now you can use `.check` on your Pandas DataFrames and Series. You don't need to access `pandas_checks` directly, just work with Pandas as you normally would. The new Pandas Checks methods are available when you work with Pandas in Jupyter, IPython, and terminal environments.
52
+
53
+ Here's a basic example of using Pandas Checks:
54
+
55
+ ```python
56
+ iris = pd.read_csv('iris.csv')
57
+
58
+ iris_new = (
59
+ iris
60
+ .check.assert_data(lambda df: (df['sepal_width']> 0).all(), fail_message="Sepal width can't be negative") # Validate your data
61
+ # ... Do your data processing in here ...
62
+ .check.hist(column='petal_length') # Plot a distribution
63
+ .check.head(3) # Display the first few rows
64
+ )
65
+ ```
66
+
67
+ The `.check` methods will display the following results:
68
+
69
+ <img src="static/sample_output.jpg" alt="Sample output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;"/>
70
+
71
+
72
+ > [!NOTE]
73
+ > These methods did not modify `iris`. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`.
74
+
75
+
76
+ ## Methods available
77
+ Here's what's in the doctor's bag.
78
+
79
+ * **Describe**
80
+ - Standard Pandas methods:
81
+ - `.check.columns()`
82
+ - `.check.dtypes()` (`.check.dtype` for Series)
83
+ - `.check.describe()`
84
+ - `.check.head()`
85
+ - `.check.info()`
86
+ - `.check.memory_usage()`
87
+ - `.check.nunique()`
88
+ - `.check.shape()`
89
+ - `.check.tail()`
90
+ - `.check.unique()`
91
+ - `.check.value_counts()`
92
+ - New functions in Pandas Checks:
93
+ - `.check.function()`: Apply an arbitrary lambda function to your data and see the result
94
+ - `.check.ncols()`
95
+ - `.check.ndups()`
96
+ - `.check.nnulls()`
97
+ - `.check.print()`: Print a string, a variable, or the current dataframe
98
+
99
+ * **Export interim files**
100
+ - `.check.write()`: Export the current data, inferring file format from the name
101
+
102
+ * **Time your code**
103
+ - `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()`
104
+ - Tip: You can also use the stopwatcht outside a method chain:
105
+ ```python
106
+ from pandas_checks import print_elapsed_time, start_timer
107
+
108
+ start_time = start_timer()
109
+ ...
110
+ print_elapsed_time(start_time, units="seconds")
111
+ ```
112
+
113
+ * **Turn off Pandas Checks**
114
+ - `.check.disable_checks()`: Don't run checks in this method chain, for production mode etc
115
+ - `.check.ensable_checks()`
116
+
117
+ * **Validate** Perform assertions on your data in the middle of a chain using `.check.assert_data()`.
118
+
119
+ * **Visualize**
120
+ - `.check.hist()`: Histogram
121
+ - `.check.plot()`: An arbitrary plot
122
+
123
+ ## Customizing results
124
+
125
+ You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass:
126
+ * `.check.head(7)`
127
+ * `.check.value_counts(column="species", dropna=False, normalize=True)`
128
+ * `.check.plot(kind="scatter", x="sepal_width", y="sepal_length")`.
129
+
130
+ Also, most Pandas Checks methods accept 3 additional arguments:
131
+ 1. `check_name`: text to display before the result of the check
132
+ 2. `fn`: a lambda function that modifies the data displayed by the check
133
+ 3. `subset`: limit a check to certain columns
134
+
135
+ ```python
136
+ iris_new = (
137
+ iris
138
+ .check.value_counts(column='species', check_name="Varieties after data cleaning")
139
+ .assign(species=lambda df: df["species"].str.upper()) # Do your regular Pandas data processing, like upper-casing the species column
140
+ .check.head(n=2, fn=lambda df: df["petal_width"]*2) # Modify the data that gets displayed in the check only
141
+ .check.describe(subset=['sepal_width', 'sepal_length']) # Only check certain columns
142
+ )
143
+ ```
144
+ <img src="static/power_user_output.jpg" alt="Power user output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;">
145
+
146
+
147
+ ## Global configuration
148
+ You can customize Pandas Checks:
149
+
150
+ ```python
151
+ import pandas_checks as pdc
152
+
153
+ # Set output precision and turn off the cute emojis
154
+ pdc.set_format(precision=3, use_emojis=False)
155
+
156
+ # Don't run any of the calls to Pandas Checks, globally. Useful when switching your code to production mode
157
+ pdc.disable_checks()
158
+ ```
159
+
160
+
161
+ > [!TIP]
162
+ > Run `pdc.describe_options()` to see the arguments you can pass to `.set_format()`.
163
+
164
+
165
+ You can also adjust settings within a method chain. This will set the global configuration. So if you only want the settings to be changed during the method chain, reset them at the end.
166
+
167
+ ```python
168
+ # Customize format
169
+ iris_new = (
170
+ iris
171
+ .check.set_format(precision=7, use_emojis=False)
172
+ ... # Any .check methods in here will use the new format
173
+ .check.reset_format() # Restore default format
174
+ )
175
+
176
+ # Turn off Pandas Checks
177
+ iris_new = (
178
+ iris
179
+ .check.disable_checks()
180
+ ... # Any .check methods in here will not be run
181
+ .check.enable_checks() # Turn it back on for the next code
182
+ )
183
+ ```
184
+
185
+ ## Giving feedback and contributing
186
+
187
+ If you run into trouble or have questions, I'd love to know. Please open an issue.
188
+
189
+ Contributions are appreciated! Please see [more details](https://cparmet.github.io/pandas-checks/#giving-feedback-and-contributing).
190
+
191
+
192
+ ## License
193
+
194
+ Pandas Checks is licensed under the [BSD-3 License](LICENSE).
195
+
196
+ 🐼🩺
197
+
@@ -0,0 +1,176 @@
1
+ # 🐼🩺 Pandas Checks
2
+
3
+ ## Introduction
4
+ **Pandas Checks** is a Python library for data science and data engineering. It adds non-invasive health checks for Pandas method chains.
5
+
6
+ It can inspect and validate your data at various points in your Pandas pipelines, without modifying the underlying data.
7
+
8
+ So you don't need to chop up a functional method chain, or create intermediate variables, every time you need to diagnose, treat, or prevent problems with data processing.
9
+
10
+ As Fleetwood Mac says, [you would never break the chain](https://www.youtube.com/watch?v=xwTPvcPYaOo).
11
+
12
+
13
+ > [!TIP]
14
+ > See the [full documentation](https://cparmet.github.io/pandas-checks/) for all the details on the what, why, and how of Pandas Checks.
15
+
16
+
17
+ ## Installation
18
+
19
+ ```bash
20
+ pip install pandas-checks
21
+ ```
22
+
23
+ ## Usage
24
+ After installing Pandas Checks, import it:
25
+
26
+ ```python
27
+ import pandas as pd
28
+ import pandas_checks
29
+ ```
30
+
31
+ Now you can use `.check` on your Pandas DataFrames and Series. You don't need to access `pandas_checks` directly, just work with Pandas as you normally would. The new Pandas Checks methods are available when you work with Pandas in Jupyter, IPython, and terminal environments.
32
+
33
+ Here's a basic example of using Pandas Checks:
34
+
35
+ ```python
36
+ iris = pd.read_csv('iris.csv')
37
+
38
+ iris_new = (
39
+ iris
40
+ .check.assert_data(lambda df: (df['sepal_width']> 0).all(), fail_message="Sepal width can't be negative") # Validate your data
41
+ # ... Do your data processing in here ...
42
+ .check.hist(column='petal_length') # Plot a distribution
43
+ .check.head(3) # Display the first few rows
44
+ )
45
+ ```
46
+
47
+ The `.check` methods will display the following results:
48
+
49
+ <img src="static/sample_output.jpg" alt="Sample output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;"/>
50
+
51
+
52
+ > [!NOTE]
53
+ > These methods did not modify `iris`. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`.
54
+
55
+
56
+ ## Methods available
57
+ Here's what's in the doctor's bag.
58
+
59
+ * **Describe**
60
+ - Standard Pandas methods:
61
+ - `.check.columns()`
62
+ - `.check.dtypes()` (`.check.dtype` for Series)
63
+ - `.check.describe()`
64
+ - `.check.head()`
65
+ - `.check.info()`
66
+ - `.check.memory_usage()`
67
+ - `.check.nunique()`
68
+ - `.check.shape()`
69
+ - `.check.tail()`
70
+ - `.check.unique()`
71
+ - `.check.value_counts()`
72
+ - New functions in Pandas Checks:
73
+ - `.check.function()`: Apply an arbitrary lambda function to your data and see the result
74
+ - `.check.ncols()`
75
+ - `.check.ndups()`
76
+ - `.check.nnulls()`
77
+ - `.check.print()`: Print a string, a variable, or the current dataframe
78
+
79
+ * **Export interim files**
80
+ - `.check.write()`: Export the current data, inferring file format from the name
81
+
82
+ * **Time your code**
83
+ - `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()`
84
+ - Tip: You can also use the stopwatcht outside a method chain:
85
+ ```python
86
+ from pandas_checks import print_elapsed_time, start_timer
87
+
88
+ start_time = start_timer()
89
+ ...
90
+ print_elapsed_time(start_time, units="seconds")
91
+ ```
92
+
93
+ * **Turn off Pandas Checks**
94
+ - `.check.disable_checks()`: Don't run checks in this method chain, for production mode etc
95
+ - `.check.ensable_checks()`
96
+
97
+ * **Validate** Perform assertions on your data in the middle of a chain using `.check.assert_data()`.
98
+
99
+ * **Visualize**
100
+ - `.check.hist()`: Histogram
101
+ - `.check.plot()`: An arbitrary plot
102
+
103
+ ## Customizing results
104
+
105
+ You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass:
106
+ * `.check.head(7)`
107
+ * `.check.value_counts(column="species", dropna=False, normalize=True)`
108
+ * `.check.plot(kind="scatter", x="sepal_width", y="sepal_length")`.
109
+
110
+ Also, most Pandas Checks methods accept 3 additional arguments:
111
+ 1. `check_name`: text to display before the result of the check
112
+ 2. `fn`: a lambda function that modifies the data displayed by the check
113
+ 3. `subset`: limit a check to certain columns
114
+
115
+ ```python
116
+ iris_new = (
117
+ iris
118
+ .check.value_counts(column='species', check_name="Varieties after data cleaning")
119
+ .assign(species=lambda df: df["species"].str.upper()) # Do your regular Pandas data processing, like upper-casing the species column
120
+ .check.head(n=2, fn=lambda df: df["petal_width"]*2) # Modify the data that gets displayed in the check only
121
+ .check.describe(subset=['sepal_width', 'sepal_length']) # Only check certain columns
122
+ )
123
+ ```
124
+ <img src="static/power_user_output.jpg" alt="Power user output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;">
125
+
126
+
127
+ ## Global configuration
128
+ You can customize Pandas Checks:
129
+
130
+ ```python
131
+ import pandas_checks as pdc
132
+
133
+ # Set output precision and turn off the cute emojis
134
+ pdc.set_format(precision=3, use_emojis=False)
135
+
136
+ # Don't run any of the calls to Pandas Checks, globally. Useful when switching your code to production mode
137
+ pdc.disable_checks()
138
+ ```
139
+
140
+
141
+ > [!TIP]
142
+ > Run `pdc.describe_options()` to see the arguments you can pass to `.set_format()`.
143
+
144
+
145
+ You can also adjust settings within a method chain. This will set the global configuration. So if you only want the settings to be changed during the method chain, reset them at the end.
146
+
147
+ ```python
148
+ # Customize format
149
+ iris_new = (
150
+ iris
151
+ .check.set_format(precision=7, use_emojis=False)
152
+ ... # Any .check methods in here will use the new format
153
+ .check.reset_format() # Restore default format
154
+ )
155
+
156
+ # Turn off Pandas Checks
157
+ iris_new = (
158
+ iris
159
+ .check.disable_checks()
160
+ ... # Any .check methods in here will not be run
161
+ .check.enable_checks() # Turn it back on for the next code
162
+ )
163
+ ```
164
+
165
+ ## Giving feedback and contributing
166
+
167
+ If you run into trouble or have questions, I'd love to know. Please open an issue.
168
+
169
+ Contributions are appreciated! Please see [more details](https://cparmet.github.io/pandas-checks/#giving-feedback-and-contributing).
170
+
171
+
172
+ ## License
173
+
174
+ Pandas Checks is licensed under the [BSD-3 License](LICENSE).
175
+
176
+ 🐼🩺