pandas-checks 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pandas_checks-0.1.0/LICENSE +31 -0
- pandas_checks-0.1.0/PKG-INFO +197 -0
- pandas_checks-0.1.0/README.md +176 -0
- pandas_checks-0.1.0/pandas_checks/DataFrameChecks.py +862 -0
- pandas_checks-0.1.0/pandas_checks/SeriesChecks.py +659 -0
- pandas_checks-0.1.0/pandas_checks/__init__.py +36 -0
- pandas_checks-0.1.0/pandas_checks/display.py +380 -0
- pandas_checks-0.1.0/pandas_checks/options.py +363 -0
- pandas_checks-0.1.0/pandas_checks/run_checks.py +65 -0
- pandas_checks-0.1.0/pandas_checks/timer.py +83 -0
- pandas_checks-0.1.0/pandas_checks/utils.py +24 -0
- pandas_checks-0.1.0/pyproject.toml +37 -0
|
@@ -0,0 +1,31 @@
|
|
|
1
|
+
BSD 3-Clause License
|
|
2
|
+
|
|
3
|
+
Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
|
|
4
|
+
All rights reserved.
|
|
5
|
+
|
|
6
|
+
Copyright (c) 2011-2024, Open source contributors.
|
|
7
|
+
|
|
8
|
+
Redistribution and use in source and binary forms, with or without
|
|
9
|
+
modification, are permitted provided that the following conditions are met:
|
|
10
|
+
|
|
11
|
+
* Redistributions of source code must retain the above copyright notice, this
|
|
12
|
+
list of conditions and the following disclaimer.
|
|
13
|
+
|
|
14
|
+
* Redistributions in binary form must reproduce the above copyright notice,
|
|
15
|
+
this list of conditions and the following disclaimer in the documentation
|
|
16
|
+
and/or other materials provided with the distribution.
|
|
17
|
+
|
|
18
|
+
* Neither the name of the copyright holder nor the names of its
|
|
19
|
+
contributors may be used to endorse or promote products derived from
|
|
20
|
+
this software without specific prior written permission.
|
|
21
|
+
|
|
22
|
+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
|
|
23
|
+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
|
24
|
+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
|
|
25
|
+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
|
|
26
|
+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
|
27
|
+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
|
28
|
+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
|
|
29
|
+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
|
|
30
|
+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
31
|
+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
@@ -0,0 +1,197 @@
|
|
|
1
|
+
Metadata-Version: 2.1
|
|
2
|
+
Name: pandas-checks
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Non-invasive health checks for Pandas method chains
|
|
5
|
+
Author: Chad Parmet
|
|
6
|
+
Author-email: cparmet@gmail.com
|
|
7
|
+
Requires-Python: >=3.8.1,<4.0
|
|
8
|
+
Classifier: Programming Language :: Python :: 3
|
|
9
|
+
Classifier: Programming Language :: Python :: 3.9
|
|
10
|
+
Classifier: Programming Language :: Python :: 3.10
|
|
11
|
+
Classifier: Programming Language :: Python :: 3.11
|
|
12
|
+
Classifier: Programming Language :: Python :: 3.12
|
|
13
|
+
Requires-Dist: Jinja2 (>=3.1.4)
|
|
14
|
+
Requires-Dist: emoji (>=2.12.1)
|
|
15
|
+
Requires-Dist: ipython (>=7.23.1)
|
|
16
|
+
Requires-Dist: matplotlib (>=3.5.3)
|
|
17
|
+
Requires-Dist: pandas (>=1.4,<3.0)
|
|
18
|
+
Requires-Dist: termcolor (>=2.3.0)
|
|
19
|
+
Description-Content-Type: text/markdown
|
|
20
|
+
|
|
21
|
+
# 🐼🩺 Pandas Checks
|
|
22
|
+
|
|
23
|
+
## Introduction
|
|
24
|
+
**Pandas Checks** is a Python library for data science and data engineering. It adds non-invasive health checks for Pandas method chains.
|
|
25
|
+
|
|
26
|
+
It can inspect and validate your data at various points in your Pandas pipelines, without modifying the underlying data.
|
|
27
|
+
|
|
28
|
+
So you don't need to chop up a functional method chain, or create intermediate variables, every time you need to diagnose, treat, or prevent problems with data processing.
|
|
29
|
+
|
|
30
|
+
As Fleetwood Mac says, [you would never break the chain](https://www.youtube.com/watch?v=xwTPvcPYaOo).
|
|
31
|
+
|
|
32
|
+
|
|
33
|
+
> [!TIP]
|
|
34
|
+
> See the [full documentation](https://cparmet.github.io/pandas-checks/) for all the details on the what, why, and how of Pandas Checks.
|
|
35
|
+
|
|
36
|
+
|
|
37
|
+
## Installation
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
pip install pandas-checks
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Usage
|
|
44
|
+
After installing Pandas Checks, import it:
|
|
45
|
+
|
|
46
|
+
```python
|
|
47
|
+
import pandas as pd
|
|
48
|
+
import pandas_checks
|
|
49
|
+
```
|
|
50
|
+
|
|
51
|
+
Now you can use `.check` on your Pandas DataFrames and Series. You don't need to access `pandas_checks` directly, just work with Pandas as you normally would. The new Pandas Checks methods are available when you work with Pandas in Jupyter, IPython, and terminal environments.
|
|
52
|
+
|
|
53
|
+
Here's a basic example of using Pandas Checks:
|
|
54
|
+
|
|
55
|
+
```python
|
|
56
|
+
iris = pd.read_csv('iris.csv')
|
|
57
|
+
|
|
58
|
+
iris_new = (
|
|
59
|
+
iris
|
|
60
|
+
.check.assert_data(lambda df: (df['sepal_width']> 0).all(), fail_message="Sepal width can't be negative") # Validate your data
|
|
61
|
+
# ... Do your data processing in here ...
|
|
62
|
+
.check.hist(column='petal_length') # Plot a distribution
|
|
63
|
+
.check.head(3) # Display the first few rows
|
|
64
|
+
)
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
The `.check` methods will display the following results:
|
|
68
|
+
|
|
69
|
+
<img src="static/sample_output.jpg" alt="Sample output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;"/>
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
> [!NOTE]
|
|
73
|
+
> These methods did not modify `iris`. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`.
|
|
74
|
+
|
|
75
|
+
|
|
76
|
+
## Methods available
|
|
77
|
+
Here's what's in the doctor's bag.
|
|
78
|
+
|
|
79
|
+
* **Describe**
|
|
80
|
+
- Standard Pandas methods:
|
|
81
|
+
- `.check.columns()`
|
|
82
|
+
- `.check.dtypes()` (`.check.dtype` for Series)
|
|
83
|
+
- `.check.describe()`
|
|
84
|
+
- `.check.head()`
|
|
85
|
+
- `.check.info()`
|
|
86
|
+
- `.check.memory_usage()`
|
|
87
|
+
- `.check.nunique()`
|
|
88
|
+
- `.check.shape()`
|
|
89
|
+
- `.check.tail()`
|
|
90
|
+
- `.check.unique()`
|
|
91
|
+
- `.check.value_counts()`
|
|
92
|
+
- New functions in Pandas Checks:
|
|
93
|
+
- `.check.function()`: Apply an arbitrary lambda function to your data and see the result
|
|
94
|
+
- `.check.ncols()`
|
|
95
|
+
- `.check.ndups()`
|
|
96
|
+
- `.check.nnulls()`
|
|
97
|
+
- `.check.print()`: Print a string, a variable, or the current dataframe
|
|
98
|
+
|
|
99
|
+
* **Export interim files**
|
|
100
|
+
- `.check.write()`: Export the current data, inferring file format from the name
|
|
101
|
+
|
|
102
|
+
* **Time your code**
|
|
103
|
+
- `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()`
|
|
104
|
+
- Tip: You can also use the stopwatcht outside a method chain:
|
|
105
|
+
```python
|
|
106
|
+
from pandas_checks import print_elapsed_time, start_timer
|
|
107
|
+
|
|
108
|
+
start_time = start_timer()
|
|
109
|
+
...
|
|
110
|
+
print_elapsed_time(start_time, units="seconds")
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
* **Turn off Pandas Checks**
|
|
114
|
+
- `.check.disable_checks()`: Don't run checks in this method chain, for production mode etc
|
|
115
|
+
- `.check.ensable_checks()`
|
|
116
|
+
|
|
117
|
+
* **Validate** Perform assertions on your data in the middle of a chain using `.check.assert_data()`.
|
|
118
|
+
|
|
119
|
+
* **Visualize**
|
|
120
|
+
- `.check.hist()`: Histogram
|
|
121
|
+
- `.check.plot()`: An arbitrary plot
|
|
122
|
+
|
|
123
|
+
## Customizing results
|
|
124
|
+
|
|
125
|
+
You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass:
|
|
126
|
+
* `.check.head(7)`
|
|
127
|
+
* `.check.value_counts(column="species", dropna=False, normalize=True)`
|
|
128
|
+
* `.check.plot(kind="scatter", x="sepal_width", y="sepal_length")`.
|
|
129
|
+
|
|
130
|
+
Also, most Pandas Checks methods accept 3 additional arguments:
|
|
131
|
+
1. `check_name`: text to display before the result of the check
|
|
132
|
+
2. `fn`: a lambda function that modifies the data displayed by the check
|
|
133
|
+
3. `subset`: limit a check to certain columns
|
|
134
|
+
|
|
135
|
+
```python
|
|
136
|
+
iris_new = (
|
|
137
|
+
iris
|
|
138
|
+
.check.value_counts(column='species', check_name="Varieties after data cleaning")
|
|
139
|
+
.assign(species=lambda df: df["species"].str.upper()) # Do your regular Pandas data processing, like upper-casing the species column
|
|
140
|
+
.check.head(n=2, fn=lambda df: df["petal_width"]*2) # Modify the data that gets displayed in the check only
|
|
141
|
+
.check.describe(subset=['sepal_width', 'sepal_length']) # Only check certain columns
|
|
142
|
+
)
|
|
143
|
+
```
|
|
144
|
+
<img src="static/power_user_output.jpg" alt="Power user output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;">
|
|
145
|
+
|
|
146
|
+
|
|
147
|
+
## Global configuration
|
|
148
|
+
You can customize Pandas Checks:
|
|
149
|
+
|
|
150
|
+
```python
|
|
151
|
+
import pandas_checks as pdc
|
|
152
|
+
|
|
153
|
+
# Set output precision and turn off the cute emojis
|
|
154
|
+
pdc.set_format(precision=3, use_emojis=False)
|
|
155
|
+
|
|
156
|
+
# Don't run any of the calls to Pandas Checks, globally. Useful when switching your code to production mode
|
|
157
|
+
pdc.disable_checks()
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
|
|
161
|
+
> [!TIP]
|
|
162
|
+
> Run `pdc.describe_options()` to see the arguments you can pass to `.set_format()`.
|
|
163
|
+
|
|
164
|
+
|
|
165
|
+
You can also adjust settings within a method chain. This will set the global configuration. So if you only want the settings to be changed during the method chain, reset them at the end.
|
|
166
|
+
|
|
167
|
+
```python
|
|
168
|
+
# Customize format
|
|
169
|
+
iris_new = (
|
|
170
|
+
iris
|
|
171
|
+
.check.set_format(precision=7, use_emojis=False)
|
|
172
|
+
... # Any .check methods in here will use the new format
|
|
173
|
+
.check.reset_format() # Restore default format
|
|
174
|
+
)
|
|
175
|
+
|
|
176
|
+
# Turn off Pandas Checks
|
|
177
|
+
iris_new = (
|
|
178
|
+
iris
|
|
179
|
+
.check.disable_checks()
|
|
180
|
+
... # Any .check methods in here will not be run
|
|
181
|
+
.check.enable_checks() # Turn it back on for the next code
|
|
182
|
+
)
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
## Giving feedback and contributing
|
|
186
|
+
|
|
187
|
+
If you run into trouble or have questions, I'd love to know. Please open an issue.
|
|
188
|
+
|
|
189
|
+
Contributions are appreciated! Please see [more details](https://cparmet.github.io/pandas-checks/#giving-feedback-and-contributing).
|
|
190
|
+
|
|
191
|
+
|
|
192
|
+
## License
|
|
193
|
+
|
|
194
|
+
Pandas Checks is licensed under the [BSD-3 License](LICENSE).
|
|
195
|
+
|
|
196
|
+
🐼🩺
|
|
197
|
+
|
|
@@ -0,0 +1,176 @@
|
|
|
1
|
+
# 🐼🩺 Pandas Checks
|
|
2
|
+
|
|
3
|
+
## Introduction
|
|
4
|
+
**Pandas Checks** is a Python library for data science and data engineering. It adds non-invasive health checks for Pandas method chains.
|
|
5
|
+
|
|
6
|
+
It can inspect and validate your data at various points in your Pandas pipelines, without modifying the underlying data.
|
|
7
|
+
|
|
8
|
+
So you don't need to chop up a functional method chain, or create intermediate variables, every time you need to diagnose, treat, or prevent problems with data processing.
|
|
9
|
+
|
|
10
|
+
As Fleetwood Mac says, [you would never break the chain](https://www.youtube.com/watch?v=xwTPvcPYaOo).
|
|
11
|
+
|
|
12
|
+
|
|
13
|
+
> [!TIP]
|
|
14
|
+
> See the [full documentation](https://cparmet.github.io/pandas-checks/) for all the details on the what, why, and how of Pandas Checks.
|
|
15
|
+
|
|
16
|
+
|
|
17
|
+
## Installation
|
|
18
|
+
|
|
19
|
+
```bash
|
|
20
|
+
pip install pandas-checks
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Usage
|
|
24
|
+
After installing Pandas Checks, import it:
|
|
25
|
+
|
|
26
|
+
```python
|
|
27
|
+
import pandas as pd
|
|
28
|
+
import pandas_checks
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
Now you can use `.check` on your Pandas DataFrames and Series. You don't need to access `pandas_checks` directly, just work with Pandas as you normally would. The new Pandas Checks methods are available when you work with Pandas in Jupyter, IPython, and terminal environments.
|
|
32
|
+
|
|
33
|
+
Here's a basic example of using Pandas Checks:
|
|
34
|
+
|
|
35
|
+
```python
|
|
36
|
+
iris = pd.read_csv('iris.csv')
|
|
37
|
+
|
|
38
|
+
iris_new = (
|
|
39
|
+
iris
|
|
40
|
+
.check.assert_data(lambda df: (df['sepal_width']> 0).all(), fail_message="Sepal width can't be negative") # Validate your data
|
|
41
|
+
# ... Do your data processing in here ...
|
|
42
|
+
.check.hist(column='petal_length') # Plot a distribution
|
|
43
|
+
.check.head(3) # Display the first few rows
|
|
44
|
+
)
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
The `.check` methods will display the following results:
|
|
48
|
+
|
|
49
|
+
<img src="static/sample_output.jpg" alt="Sample output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;"/>
|
|
50
|
+
|
|
51
|
+
|
|
52
|
+
> [!NOTE]
|
|
53
|
+
> These methods did not modify `iris`. That's the difference between Pandas `.head()` and Pandas Checks `.check.head()`.
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
## Methods available
|
|
57
|
+
Here's what's in the doctor's bag.
|
|
58
|
+
|
|
59
|
+
* **Describe**
|
|
60
|
+
- Standard Pandas methods:
|
|
61
|
+
- `.check.columns()`
|
|
62
|
+
- `.check.dtypes()` (`.check.dtype` for Series)
|
|
63
|
+
- `.check.describe()`
|
|
64
|
+
- `.check.head()`
|
|
65
|
+
- `.check.info()`
|
|
66
|
+
- `.check.memory_usage()`
|
|
67
|
+
- `.check.nunique()`
|
|
68
|
+
- `.check.shape()`
|
|
69
|
+
- `.check.tail()`
|
|
70
|
+
- `.check.unique()`
|
|
71
|
+
- `.check.value_counts()`
|
|
72
|
+
- New functions in Pandas Checks:
|
|
73
|
+
- `.check.function()`: Apply an arbitrary lambda function to your data and see the result
|
|
74
|
+
- `.check.ncols()`
|
|
75
|
+
- `.check.ndups()`
|
|
76
|
+
- `.check.nnulls()`
|
|
77
|
+
- `.check.print()`: Print a string, a variable, or the current dataframe
|
|
78
|
+
|
|
79
|
+
* **Export interim files**
|
|
80
|
+
- `.check.write()`: Export the current data, inferring file format from the name
|
|
81
|
+
|
|
82
|
+
* **Time your code**
|
|
83
|
+
- `.check.print_time_elapsed(start_time)`: Print the execution time since you called `start_time = pdc.start_timer()`
|
|
84
|
+
- Tip: You can also use the stopwatcht outside a method chain:
|
|
85
|
+
```python
|
|
86
|
+
from pandas_checks import print_elapsed_time, start_timer
|
|
87
|
+
|
|
88
|
+
start_time = start_timer()
|
|
89
|
+
...
|
|
90
|
+
print_elapsed_time(start_time, units="seconds")
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
* **Turn off Pandas Checks**
|
|
94
|
+
- `.check.disable_checks()`: Don't run checks in this method chain, for production mode etc
|
|
95
|
+
- `.check.ensable_checks()`
|
|
96
|
+
|
|
97
|
+
* **Validate** Perform assertions on your data in the middle of a chain using `.check.assert_data()`.
|
|
98
|
+
|
|
99
|
+
* **Visualize**
|
|
100
|
+
- `.check.hist()`: Histogram
|
|
101
|
+
- `.check.plot()`: An arbitrary plot
|
|
102
|
+
|
|
103
|
+
## Customizing results
|
|
104
|
+
|
|
105
|
+
You can use Pandas Checks methods like the regular Pandas methods. They accept the same arguments. For example, you can pass:
|
|
106
|
+
* `.check.head(7)`
|
|
107
|
+
* `.check.value_counts(column="species", dropna=False, normalize=True)`
|
|
108
|
+
* `.check.plot(kind="scatter", x="sepal_width", y="sepal_length")`.
|
|
109
|
+
|
|
110
|
+
Also, most Pandas Checks methods accept 3 additional arguments:
|
|
111
|
+
1. `check_name`: text to display before the result of the check
|
|
112
|
+
2. `fn`: a lambda function that modifies the data displayed by the check
|
|
113
|
+
3. `subset`: limit a check to certain columns
|
|
114
|
+
|
|
115
|
+
```python
|
|
116
|
+
iris_new = (
|
|
117
|
+
iris
|
|
118
|
+
.check.value_counts(column='species', check_name="Varieties after data cleaning")
|
|
119
|
+
.assign(species=lambda df: df["species"].str.upper()) # Do your regular Pandas data processing, like upper-casing the species column
|
|
120
|
+
.check.head(n=2, fn=lambda df: df["petal_width"]*2) # Modify the data that gets displayed in the check only
|
|
121
|
+
.check.describe(subset=['sepal_width', 'sepal_length']) # Only check certain columns
|
|
122
|
+
)
|
|
123
|
+
```
|
|
124
|
+
<img src="static/power_user_output.jpg" alt="Power user output" width="350" style="display: block; margin-left: auto; margin-right: auto; width: 50%;">
|
|
125
|
+
|
|
126
|
+
|
|
127
|
+
## Global configuration
|
|
128
|
+
You can customize Pandas Checks:
|
|
129
|
+
|
|
130
|
+
```python
|
|
131
|
+
import pandas_checks as pdc
|
|
132
|
+
|
|
133
|
+
# Set output precision and turn off the cute emojis
|
|
134
|
+
pdc.set_format(precision=3, use_emojis=False)
|
|
135
|
+
|
|
136
|
+
# Don't run any of the calls to Pandas Checks, globally. Useful when switching your code to production mode
|
|
137
|
+
pdc.disable_checks()
|
|
138
|
+
```
|
|
139
|
+
|
|
140
|
+
|
|
141
|
+
> [!TIP]
|
|
142
|
+
> Run `pdc.describe_options()` to see the arguments you can pass to `.set_format()`.
|
|
143
|
+
|
|
144
|
+
|
|
145
|
+
You can also adjust settings within a method chain. This will set the global configuration. So if you only want the settings to be changed during the method chain, reset them at the end.
|
|
146
|
+
|
|
147
|
+
```python
|
|
148
|
+
# Customize format
|
|
149
|
+
iris_new = (
|
|
150
|
+
iris
|
|
151
|
+
.check.set_format(precision=7, use_emojis=False)
|
|
152
|
+
... # Any .check methods in here will use the new format
|
|
153
|
+
.check.reset_format() # Restore default format
|
|
154
|
+
)
|
|
155
|
+
|
|
156
|
+
# Turn off Pandas Checks
|
|
157
|
+
iris_new = (
|
|
158
|
+
iris
|
|
159
|
+
.check.disable_checks()
|
|
160
|
+
... # Any .check methods in here will not be run
|
|
161
|
+
.check.enable_checks() # Turn it back on for the next code
|
|
162
|
+
)
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
## Giving feedback and contributing
|
|
166
|
+
|
|
167
|
+
If you run into trouble or have questions, I'd love to know. Please open an issue.
|
|
168
|
+
|
|
169
|
+
Contributions are appreciated! Please see [more details](https://cparmet.github.io/pandas-checks/#giving-feedback-and-contributing).
|
|
170
|
+
|
|
171
|
+
|
|
172
|
+
## License
|
|
173
|
+
|
|
174
|
+
Pandas Checks is licensed under the [BSD-3 License](LICENSE).
|
|
175
|
+
|
|
176
|
+
🐼🩺
|