pointblank 0.9.0__py3-none-any.whl → 0.9.2__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pointblank/_constants.py +29 -0
- pointblank/_constants_translations.py +216 -0
- pointblank/_interrogation.py +218 -0
- pointblank/_utils.py +2 -0
- pointblank/actions.py +2 -2
- pointblank/data/api-docs.txt +611 -7
- pointblank/thresholds.py +3 -2
- pointblank/validate.py +794 -18
- {pointblank-0.9.0.dist-info → pointblank-0.9.2.dist-info}/METADATA +1 -1
- {pointblank-0.9.0.dist-info → pointblank-0.9.2.dist-info}/RECORD +13 -13
- {pointblank-0.9.0.dist-info → pointblank-0.9.2.dist-info}/WHEEL +1 -1
- {pointblank-0.9.0.dist-info → pointblank-0.9.2.dist-info}/licenses/LICENSE +0 -0
- {pointblank-0.9.0.dist-info → pointblank-0.9.2.dist-info}/top_level.txt +0 -0
pointblank/data/api-docs.txt
CHANGED
|
@@ -708,8 +708,9 @@ FinalActions(*args)
|
|
|
708
708
|
In this example, the `send_alert()` function is defined to check the validation summary for
|
|
709
709
|
critical failures. If any are found, an alert message is printed to the console. The function is
|
|
710
710
|
passed to the `FinalActions` class, which ensures it will be executed after all validation steps
|
|
711
|
-
are complete. Note that we used the
|
|
712
|
-
|
|
711
|
+
are complete. Note that we used the
|
|
712
|
+
[`get_validation_summary()`](`pointblank.get_validation_summary`) function to retrieve the
|
|
713
|
+
summary of the validation results to help craft the alert message.
|
|
713
714
|
|
|
714
715
|
Multiple final actions can be provided in a sequence. They will be executed in the order they
|
|
715
716
|
are specified after all validation steps have completed:
|
|
@@ -4367,6 +4368,192 @@ rows_distinct(self, columns_subset: 'str | list[str] | None' = None, pre: 'Calla
|
|
|
4367
4368
|
others.
|
|
4368
4369
|
|
|
4369
4370
|
|
|
4371
|
+
rows_complete(self, columns_subset: 'str | list[str] | None' = None, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
|
|
4372
|
+
|
|
4373
|
+
Validate whether row data are complete by having no missing values.
|
|
4374
|
+
|
|
4375
|
+
The `rows_complete()` method checks whether rows in the table are complete. Completeness
|
|
4376
|
+
of a row means that there are no missing values within the row. This validation will operate
|
|
4377
|
+
over the number of test units that is equal to the number of rows in the table (determined
|
|
4378
|
+
after any `pre=` mutation has been applied). A subset of columns can be specified for the
|
|
4379
|
+
completeness check. If no subset is provided, all columns in the table will be used.
|
|
4380
|
+
|
|
4381
|
+
Parameters
|
|
4382
|
+
----------
|
|
4383
|
+
columns_subset
|
|
4384
|
+
A single column or a list of columns to use as a subset for the completeness check. If
|
|
4385
|
+
`None` (the default), then all columns in the table will be used.
|
|
4386
|
+
pre
|
|
4387
|
+
An optional preprocessing function or lambda to apply to the data table during
|
|
4388
|
+
interrogation. This function should take a table as input and return a modified table.
|
|
4389
|
+
Have a look at the *Preprocessing* section for more information on how to use this
|
|
4390
|
+
argument.
|
|
4391
|
+
segments
|
|
4392
|
+
An optional directive on segmentation, which serves to split a validation step into
|
|
4393
|
+
multiple (one step per segment). Can be a single column name, a tuple that specifies a
|
|
4394
|
+
column name and its corresponding values to segment on, or a combination of both
|
|
4395
|
+
(provided as a list). Read the *Segmentation* section for usage information.
|
|
4396
|
+
thresholds
|
|
4397
|
+
Set threshold failure levels for reporting and reacting to exceedences of the levels.
|
|
4398
|
+
The thresholds are set at the step level and will override any global thresholds set in
|
|
4399
|
+
`Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
|
|
4400
|
+
be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
|
|
4401
|
+
section for information on how to set threshold levels.
|
|
4402
|
+
actions
|
|
4403
|
+
Optional actions to take when the validation step meets or exceeds any set threshold
|
|
4404
|
+
levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
|
|
4405
|
+
define the actions.
|
|
4406
|
+
brief
|
|
4407
|
+
An optional brief description of the validation step that will be displayed in the
|
|
4408
|
+
reporting table. You can use the templating elements like `"{step}"` to insert
|
|
4409
|
+
the step number, or `"{auto}"` to include an automatically generated brief. If `True`
|
|
4410
|
+
the entire brief will be automatically generated. If `None` (the default) then there
|
|
4411
|
+
won't be a brief.
|
|
4412
|
+
active
|
|
4413
|
+
A boolean value indicating whether the validation step should be active. Using `False`
|
|
4414
|
+
will make the validation step inactive (still reporting its presence and keeping indexes
|
|
4415
|
+
for the steps unchanged).
|
|
4416
|
+
|
|
4417
|
+
Returns
|
|
4418
|
+
-------
|
|
4419
|
+
Validate
|
|
4420
|
+
The `Validate` object with the added validation step.
|
|
4421
|
+
|
|
4422
|
+
Preprocessing
|
|
4423
|
+
-------------
|
|
4424
|
+
The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
|
|
4425
|
+
table during interrogation. This function should take a table as input and return a modified
|
|
4426
|
+
table. This is useful for performing any necessary transformations or filtering on the data
|
|
4427
|
+
before the validation step is applied.
|
|
4428
|
+
|
|
4429
|
+
The preprocessing function can be any callable that takes a table as input and returns a
|
|
4430
|
+
modified table. For example, you could use a lambda function to filter the table based on
|
|
4431
|
+
certain criteria or to apply a transformation to the data. Note that you can refer to
|
|
4432
|
+
columns via `columns_subset=` that are expected to be present in the transformed table, but
|
|
4433
|
+
may not exist in the table before preprocessing. Regarding the lifetime of the transformed
|
|
4434
|
+
table, it only exists during the validation step and is not stored in the `Validate` object
|
|
4435
|
+
or used in subsequent validation steps.
|
|
4436
|
+
|
|
4437
|
+
Segmentation
|
|
4438
|
+
------------
|
|
4439
|
+
The `segments=` argument allows for the segmentation of a validation step into multiple
|
|
4440
|
+
segments. This is useful for applying the same validation step to different subsets of the
|
|
4441
|
+
data. The segmentation can be done based on a single column or specific fields within a
|
|
4442
|
+
column.
|
|
4443
|
+
|
|
4444
|
+
Providing a single column name will result in a separate validation step for each unique
|
|
4445
|
+
value in that column. For example, if you have a column called `"region"` with values
|
|
4446
|
+
`"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each
|
|
4447
|
+
region.
|
|
4448
|
+
|
|
4449
|
+
Alternatively, you can provide a tuple that specifies a column name and its corresponding
|
|
4450
|
+
values to segment on. For example, if you have a column called `"date"` and you want to
|
|
4451
|
+
segment on only specific dates, you can provide a tuple like
|
|
4452
|
+
`("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded
|
|
4453
|
+
(i.e., no validation steps will be created for them).
|
|
4454
|
+
|
|
4455
|
+
A list with a combination of column names and tuples can be provided as well. This allows
|
|
4456
|
+
for more complex segmentation scenarios. The following inputs are all valid:
|
|
4457
|
+
|
|
4458
|
+
- `segments=["region", ("date", ["2023-01-01", "2023-01-02"])]`: segments on unique values
|
|
4459
|
+
in the `"region"` column and specific dates in the `"date"` column
|
|
4460
|
+
- `segments=["region", "date"]`: segments on unique values in the `"region"` and `"date"`
|
|
4461
|
+
columns
|
|
4462
|
+
|
|
4463
|
+
The segmentation is performed during interrogation, and the resulting validation steps will
|
|
4464
|
+
be numbered sequentially. Each segment will have its own validation step, and the results
|
|
4465
|
+
will be reported separately. This allows for a more granular analysis of the data and helps
|
|
4466
|
+
identify issues within specific segments.
|
|
4467
|
+
|
|
4468
|
+
Importantly, the segmentation process will be performed after any preprocessing of the data
|
|
4469
|
+
table. Because of this, one can conceivably use the `pre=` argument to generate a column
|
|
4470
|
+
that can be used for segmentation. For example, you could create a new column called
|
|
4471
|
+
`"segment"` through use of `pre=` and then use that column for segmentation.
|
|
4472
|
+
|
|
4473
|
+
Thresholds
|
|
4474
|
+
----------
|
|
4475
|
+
The `thresholds=` parameter is used to set the failure-condition levels for the validation
|
|
4476
|
+
step. If they are set here at the step level, these thresholds will override any thresholds
|
|
4477
|
+
set at the global level in `Validate(thresholds=...)`.
|
|
4478
|
+
|
|
4479
|
+
There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values
|
|
4480
|
+
can either be set as a proportion failing of all test units (a value between `0` to `1`),
|
|
4481
|
+
or, the absolute number of failing test units (as integer that's `1` or greater).
|
|
4482
|
+
|
|
4483
|
+
Thresholds can be defined using one of these input schemes:
|
|
4484
|
+
|
|
4485
|
+
1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create
|
|
4486
|
+
thresholds)
|
|
4487
|
+
2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is
|
|
4488
|
+
the 'error' level, and position `2` is the 'critical' level
|
|
4489
|
+
3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and
|
|
4490
|
+
'critical'
|
|
4491
|
+
4. a single integer/float value denoting absolute number or fraction of failing test units
|
|
4492
|
+
for the 'warning' level only
|
|
4493
|
+
|
|
4494
|
+
If the number of failing test units exceeds set thresholds, the validation step will be
|
|
4495
|
+
marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be
|
|
4496
|
+
set, you're free to set any combination of them.
|
|
4497
|
+
|
|
4498
|
+
Aside from reporting failure conditions, thresholds can be used to determine the actions to
|
|
4499
|
+
take for each level of failure (using the `actions=` parameter).
|
|
4500
|
+
|
|
4501
|
+
Examples
|
|
4502
|
+
--------
|
|
4503
|
+
For the examples here, we'll use a simple Polars DataFrame with three string columns
|
|
4504
|
+
(`col_1`, `col_2`, and `col_3`). The table is shown below:
|
|
4505
|
+
|
|
4506
|
+
```python
|
|
4507
|
+
import pointblank as pb
|
|
4508
|
+
import polars as pl
|
|
4509
|
+
|
|
4510
|
+
tbl = pl.DataFrame(
|
|
4511
|
+
{
|
|
4512
|
+
"col_1": ["a", None, "c", "d"],
|
|
4513
|
+
"col_2": ["a", "a", "c", None],
|
|
4514
|
+
"col_3": ["a", "a", "d", None],
|
|
4515
|
+
}
|
|
4516
|
+
)
|
|
4517
|
+
|
|
4518
|
+
pb.preview(tbl)
|
|
4519
|
+
```
|
|
4520
|
+
|
|
4521
|
+
Let's validate that the rows in the table are complete with `rows_complete()`. We'll
|
|
4522
|
+
determine if this validation had any failing test units (there are four test units, one for
|
|
4523
|
+
each row). A failing test units means that a given row is not complete (i.e., has at least
|
|
4524
|
+
one missing value).
|
|
4525
|
+
|
|
4526
|
+
```python
|
|
4527
|
+
validation = (
|
|
4528
|
+
pb.Validate(data=tbl)
|
|
4529
|
+
.rows_complete()
|
|
4530
|
+
.interrogate()
|
|
4531
|
+
)
|
|
4532
|
+
|
|
4533
|
+
validation
|
|
4534
|
+
```
|
|
4535
|
+
|
|
4536
|
+
From this validation table we see that there are two failing test units. This is because
|
|
4537
|
+
two rows in the table have at least one missing value (the second row and the last row).
|
|
4538
|
+
|
|
4539
|
+
We can also use a subset of columns to determine completeness. Let's specify the subset
|
|
4540
|
+
using columns `col_2` and `col_3` for the next validation.
|
|
4541
|
+
|
|
4542
|
+
```python
|
|
4543
|
+
validation = (
|
|
4544
|
+
pb.Validate(data=tbl)
|
|
4545
|
+
.rows_complete(columns_subset=["col_2", "col_3"])
|
|
4546
|
+
.interrogate()
|
|
4547
|
+
)
|
|
4548
|
+
|
|
4549
|
+
validation
|
|
4550
|
+
```
|
|
4551
|
+
|
|
4552
|
+
The validation table reports a single failing test units. The last row contains missing
|
|
4553
|
+
values in both the `col_2` and `col_3` columns.
|
|
4554
|
+
others.
|
|
4555
|
+
|
|
4556
|
+
|
|
4370
4557
|
col_schema_match(self, schema: 'Schema', complete: 'bool' = True, in_order: 'bool' = True, case_sensitive_colnames: 'bool' = True, case_sensitive_dtypes: 'bool' = True, full_match_dtypes: 'bool' = True, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
|
|
4371
4558
|
|
|
4372
4559
|
Do columns in the table (and their types) match a predefined schema?
|
|
@@ -4991,6 +5178,306 @@ conjointly(self, *exprs: 'Callable', pre: 'Callable | None' = None, thresholds:
|
|
|
4991
5178
|
information on how to use it with different table backends.
|
|
4992
5179
|
|
|
4993
5180
|
|
|
5181
|
+
specially(self, expr: 'Callable', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
|
|
5182
|
+
|
|
5183
|
+
Perform a specialized validation with customized logic.
|
|
5184
|
+
|
|
5185
|
+
The `specially()` validation method allows for the creation of specialized validation
|
|
5186
|
+
expressions that can be used to validate specific conditions or logic in the data. This
|
|
5187
|
+
method provides maximum flexibility by accepting a custom callable that encapsulates
|
|
5188
|
+
your validation logic.
|
|
5189
|
+
|
|
5190
|
+
The callable function can have one of two signatures:
|
|
5191
|
+
|
|
5192
|
+
- a function accepting a single parameter (the data table): `def validate(data): ...`
|
|
5193
|
+
- a function with no parameters: `def validate(): ...`
|
|
5194
|
+
|
|
5195
|
+
The second form is particularly useful for environment validations that don't need to
|
|
5196
|
+
inspect the data table.
|
|
5197
|
+
|
|
5198
|
+
The callable function must ultimately return one of:
|
|
5199
|
+
|
|
5200
|
+
1. a single boolean value or boolean list
|
|
5201
|
+
2. a table where the final column contains boolean values (column name is unimportant)
|
|
5202
|
+
|
|
5203
|
+
The validation will operate over the number of test units that is equal to the number of
|
|
5204
|
+
rows in the data table (if returning a table with boolean values). If returning a scalar
|
|
5205
|
+
boolean value, the validation will operate over a single test unit. For a return of a list
|
|
5206
|
+
of boolean values, the length of the list constitutes the number of test units.
|
|
5207
|
+
|
|
5208
|
+
Parameters
|
|
5209
|
+
----------
|
|
5210
|
+
expr
|
|
5211
|
+
A callable function that defines the specialized validation logic. This function should:
|
|
5212
|
+
(1) accept the target data table as its single argument (though it may ignore it), or
|
|
5213
|
+
(2) take no parameters at all (for environment validations). The function must
|
|
5214
|
+
ultimately return boolean values representing validation results. Design your function
|
|
5215
|
+
to incorporate any custom parameters directly within the function itself using closure
|
|
5216
|
+
variables or default parameters.
|
|
5217
|
+
pre
|
|
5218
|
+
An optional preprocessing function or lambda to apply to the data table during
|
|
5219
|
+
interrogation. This function should take a table as input and return a modified table.
|
|
5220
|
+
Have a look at the *Preprocessing* section for more information on how to use this
|
|
5221
|
+
argument.
|
|
5222
|
+
thresholds
|
|
5223
|
+
Set threshold failure levels for reporting and reacting to exceedences of the levels.
|
|
5224
|
+
The thresholds are set at the step level and will override any global thresholds set in
|
|
5225
|
+
`Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
|
|
5226
|
+
be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
|
|
5227
|
+
section for information on how to set threshold levels.
|
|
5228
|
+
actions
|
|
5229
|
+
Optional actions to take when the validation step meets or exceeds any set threshold
|
|
5230
|
+
levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
|
|
5231
|
+
define the actions.
|
|
5232
|
+
brief
|
|
5233
|
+
An optional brief description of the validation step that will be displayed in the
|
|
5234
|
+
reporting table. You can use the templating elements like `"{step}"` to insert
|
|
5235
|
+
the step number, or `"{auto}"` to include an automatically generated brief. If `True`
|
|
5236
|
+
the entire brief will be automatically generated. If `None` (the default) then there
|
|
5237
|
+
won't be a brief.
|
|
5238
|
+
active
|
|
5239
|
+
A boolean value indicating whether the validation step should be active. Using `False`
|
|
5240
|
+
will make the validation step inactive (still reporting its presence and keeping indexes
|
|
5241
|
+
for the steps unchanged).
|
|
5242
|
+
|
|
5243
|
+
Returns
|
|
5244
|
+
-------
|
|
5245
|
+
Validate
|
|
5246
|
+
The `Validate` object with the added validation step.
|
|
5247
|
+
|
|
5248
|
+
Preprocessing
|
|
5249
|
+
-------------
|
|
5250
|
+
The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
|
|
5251
|
+
table during interrogation. This function should take a table as input and return a modified
|
|
5252
|
+
table. This is useful for performing any necessary transformations or filtering on the data
|
|
5253
|
+
before the validation step is applied.
|
|
5254
|
+
|
|
5255
|
+
The preprocessing function can be any callable that takes a table as input and returns a
|
|
5256
|
+
modified table. For example, you could use a lambda function to filter the table based on
|
|
5257
|
+
certain criteria or to apply a transformation to the data. Regarding the lifetime of the
|
|
5258
|
+
transformed table, it only exists during the validation step and is not stored in the
|
|
5259
|
+
`Validate` object or used in subsequent validation steps.
|
|
5260
|
+
|
|
5261
|
+
Thresholds
|
|
5262
|
+
----------
|
|
5263
|
+
The `thresholds=` parameter is used to set the failure-condition levels for the validation
|
|
5264
|
+
step. If they are set here at the step level, these thresholds will override any thresholds
|
|
5265
|
+
set at the global level in `Validate(thresholds=...)`.
|
|
5266
|
+
|
|
5267
|
+
There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values
|
|
5268
|
+
can either be set as a proportion failing of all test units (a value between `0` to `1`),
|
|
5269
|
+
or, the absolute number of failing test units (as integer that's `1` or greater).
|
|
5270
|
+
|
|
5271
|
+
Thresholds can be defined using one of these input schemes:
|
|
5272
|
+
|
|
5273
|
+
1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create
|
|
5274
|
+
thresholds)
|
|
5275
|
+
2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is
|
|
5276
|
+
the 'error' level, and position `2` is the 'critical' level
|
|
5277
|
+
3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and
|
|
5278
|
+
'critical'
|
|
5279
|
+
4. a single integer/float value denoting absolute number or fraction of failing test units
|
|
5280
|
+
for the 'warning' level only
|
|
5281
|
+
|
|
5282
|
+
If the number of failing test units exceeds set thresholds, the validation step will be
|
|
5283
|
+
marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be
|
|
5284
|
+
set, you're free to set any combination of them.
|
|
5285
|
+
|
|
5286
|
+
Aside from reporting failure conditions, thresholds can be used to determine the actions to
|
|
5287
|
+
take for each level of failure (using the `actions=` parameter).
|
|
5288
|
+
|
|
5289
|
+
Examples
|
|
5290
|
+
--------
|
|
5291
|
+
The `specially()` method offers maximum flexibility for validation, allowing you to create
|
|
5292
|
+
custom validation logic that fits your specific needs. The following examples demonstrate
|
|
5293
|
+
different patterns and use cases for this powerful validation approach.
|
|
5294
|
+
|
|
5295
|
+
### Simple validation with direct table access
|
|
5296
|
+
|
|
5297
|
+
This example shows the most straightforward use case where we create a function that
|
|
5298
|
+
directly checks if the sum of two columns is positive.
|
|
5299
|
+
|
|
5300
|
+
```python
|
|
5301
|
+
import pointblank as pb
|
|
5302
|
+
import polars as pl
|
|
5303
|
+
|
|
5304
|
+
simple_tbl = pl.DataFrame({
|
|
5305
|
+
"a": [5, 7, 1, 3, 9, 4],
|
|
5306
|
+
"b": [6, 3, 0, 5, 8, 2]
|
|
5307
|
+
})
|
|
5308
|
+
|
|
5309
|
+
# Simple function that validates directly on the table
|
|
5310
|
+
def validate_sum_positive(data):
|
|
5311
|
+
return data.select(pl.col("a") + pl.col("b") > 0)
|
|
5312
|
+
|
|
5313
|
+
(
|
|
5314
|
+
pb.Validate(data=simple_tbl)
|
|
5315
|
+
.specially(expr=validate_sum_positive)
|
|
5316
|
+
.interrogate()
|
|
5317
|
+
)
|
|
5318
|
+
```
|
|
5319
|
+
|
|
5320
|
+
The function returns a Polars DataFrame with a single boolean column indicating whether
|
|
5321
|
+
the sum of columns `a` and `b` is positive for each row. Each row in the resulting DataFrame
|
|
5322
|
+
is a distinct test unit. This pattern works well for simple validations where you don't need
|
|
5323
|
+
configurable parameters.
|
|
5324
|
+
|
|
5325
|
+
### Advanced validation with closure variables for parameters
|
|
5326
|
+
|
|
5327
|
+
When you need to make your validation configurable, you can use the function factory pattern
|
|
5328
|
+
(also known as closures) to create parameterized validations:
|
|
5329
|
+
|
|
5330
|
+
```python
|
|
5331
|
+
# Create a parameterized validation function using closures
|
|
5332
|
+
def make_column_ratio_validator(col1, col2, min_ratio):
|
|
5333
|
+
def validate_column_ratio(data):
|
|
5334
|
+
return data.select((pl.col(col1) / pl.col(col2)) > min_ratio)
|
|
5335
|
+
return validate_column_ratio
|
|
5336
|
+
|
|
5337
|
+
(
|
|
5338
|
+
pb.Validate(data=simple_tbl)
|
|
5339
|
+
.specially(
|
|
5340
|
+
expr=make_column_ratio_validator(col1="a", col2="b", min_ratio=0.5)
|
|
5341
|
+
)
|
|
5342
|
+
.interrogate()
|
|
5343
|
+
)
|
|
5344
|
+
```
|
|
5345
|
+
|
|
5346
|
+
This approach allows you to create reusable validation functions that can be configured with
|
|
5347
|
+
different parameters without modifying the function itself.
|
|
5348
|
+
|
|
5349
|
+
### Validation function returning a list of booleans
|
|
5350
|
+
|
|
5351
|
+
This example demonstrates how to create a validation function that returns a list of boolean
|
|
5352
|
+
values, where each element represents a separate test unit:
|
|
5353
|
+
|
|
5354
|
+
```python
|
|
5355
|
+
import pointblank as pb
|
|
5356
|
+
import polars as pl
|
|
5357
|
+
import random
|
|
5358
|
+
|
|
5359
|
+
# Create sample data
|
|
5360
|
+
transaction_tbl = pl.DataFrame({
|
|
5361
|
+
"transaction_id": [f"TX{i:04d}" for i in range(1, 11)],
|
|
5362
|
+
"amount": [120.50, 85.25, 50.00, 240.75, 35.20, 150.00, 85.25, 65.00, 210.75, 90.50],
|
|
5363
|
+
"category": ["food", "shopping", "entertainment", "travel", "utilities",
|
|
5364
|
+
"food", "shopping", "entertainment", "travel", "utilities"]
|
|
5365
|
+
})
|
|
5366
|
+
|
|
5367
|
+
# Define a validation function that returns a list of booleans
|
|
5368
|
+
def validate_transaction_rules(data):
|
|
5369
|
+
# Create a list to store individual test results
|
|
5370
|
+
test_results = []
|
|
5371
|
+
|
|
5372
|
+
# Check each row individually against multiple business rules
|
|
5373
|
+
for row in data.iter_rows(named=True):
|
|
5374
|
+
# Rule: transaction IDs must start with "TX" and be 6 chars long
|
|
5375
|
+
valid_id = row["transaction_id"].startswith("TX") and len(row["transaction_id"]) == 6
|
|
5376
|
+
|
|
5377
|
+
# Rule: Amounts must be appropriate for their category
|
|
5378
|
+
valid_amount = True
|
|
5379
|
+
if row["category"] == "food" and (row["amount"] < 10 or row["amount"] > 200):
|
|
5380
|
+
valid_amount = False
|
|
5381
|
+
elif row["category"] == "utilities" and (row["amount"] < 20 or row["amount"] > 300):
|
|
5382
|
+
valid_amount = False
|
|
5383
|
+
elif row["category"] == "entertainment" and row["amount"] > 100:
|
|
5384
|
+
valid_amount = False
|
|
5385
|
+
|
|
5386
|
+
# A transaction passes if it satisfies both rules
|
|
5387
|
+
test_results.append(valid_id and valid_amount)
|
|
5388
|
+
|
|
5389
|
+
return test_results
|
|
5390
|
+
|
|
5391
|
+
(
|
|
5392
|
+
pb.Validate(data=transaction_tbl)
|
|
5393
|
+
.specially(
|
|
5394
|
+
expr=validate_transaction_rules,
|
|
5395
|
+
brief="Validate transaction IDs and amounts by category."
|
|
5396
|
+
)
|
|
5397
|
+
.interrogate()
|
|
5398
|
+
)
|
|
5399
|
+
```
|
|
5400
|
+
|
|
5401
|
+
This example shows how to create a validation function that applies multiple business rules
|
|
5402
|
+
to each row and returns a list of boolean results. Each boolean in the list represents a
|
|
5403
|
+
separate test unit, and a test unit passes only if all rules are satisfied for a given row.
|
|
5404
|
+
|
|
5405
|
+
The function iterates through each row in the data table, checking:
|
|
5406
|
+
|
|
5407
|
+
1. if transaction IDs follow the required format
|
|
5408
|
+
2. if transaction amounts are appropriate for their respective categories
|
|
5409
|
+
|
|
5410
|
+
This approach is powerful when you need to apply complex, conditional logic that can't be
|
|
5411
|
+
easily expressed using the built-in validation functions.
|
|
5412
|
+
|
|
5413
|
+
### Table-level validation returning a single boolean
|
|
5414
|
+
|
|
5415
|
+
Sometimes you need to validate properties of the entire table rather than row-by-row. In
|
|
5416
|
+
these cases, your function can return a single boolean value:
|
|
5417
|
+
|
|
5418
|
+
```python
|
|
5419
|
+
def validate_table_properties(data):
|
|
5420
|
+
# Check if table has at least one row with column 'a' > 10
|
|
5421
|
+
has_large_values = data.filter(pl.col("a") > 10).height > 0
|
|
5422
|
+
|
|
5423
|
+
# Check if mean of column 'b' is positive
|
|
5424
|
+
has_positive_mean = data.select(pl.mean("b")).item() > 0
|
|
5425
|
+
|
|
5426
|
+
# Return a single boolean for the entire table
|
|
5427
|
+
return has_large_values and has_positive_mean
|
|
5428
|
+
|
|
5429
|
+
(
|
|
5430
|
+
pb.Validate(data=simple_tbl)
|
|
5431
|
+
.specially(expr=validate_table_properties)
|
|
5432
|
+
.interrogate()
|
|
5433
|
+
)
|
|
5434
|
+
```
|
|
5435
|
+
|
|
5436
|
+
This example demonstrates how to perform multiple checks on the table as a whole and combine
|
|
5437
|
+
them into a single validation result.
|
|
5438
|
+
|
|
5439
|
+
### Environment validation that doesn't use the data table
|
|
5440
|
+
|
|
5441
|
+
The `specially()` validation method can even be used to validate aspects of your environment
|
|
5442
|
+
that are completely independent of the data:
|
|
5443
|
+
|
|
5444
|
+
```python
|
|
5445
|
+
def validate_pointblank_version():
|
|
5446
|
+
try:
|
|
5447
|
+
import importlib.metadata
|
|
5448
|
+
version = importlib.metadata.version("pointblank")
|
|
5449
|
+
version_parts = version.split(".")
|
|
5450
|
+
|
|
5451
|
+
# Get major and minor components regardless of how many parts there are
|
|
5452
|
+
major = int(version_parts[0])
|
|
5453
|
+
minor = int(version_parts[1])
|
|
5454
|
+
|
|
5455
|
+
# Check both major and minor components for version `0.9+`
|
|
5456
|
+
return (major > 0) or (major == 0 and minor >= 9)
|
|
5457
|
+
|
|
5458
|
+
except Exception as e:
|
|
5459
|
+
# More specific error handling could be added here
|
|
5460
|
+
print(f"Version check failed: {e}")
|
|
5461
|
+
return False
|
|
5462
|
+
|
|
5463
|
+
(
|
|
5464
|
+
pb.Validate(data=simple_tbl)
|
|
5465
|
+
.specially(
|
|
5466
|
+
expr=validate_pointblank_version,
|
|
5467
|
+
brief="Check Pointblank version `>=0.9.0`."
|
|
5468
|
+
)
|
|
5469
|
+
.interrogate()
|
|
5470
|
+
)
|
|
5471
|
+
```
|
|
5472
|
+
|
|
5473
|
+
This pattern shows how to validate external dependencies or environment conditions as part
|
|
5474
|
+
of your validation workflow. Notice that the function doesn't take any parameters at all,
|
|
5475
|
+
which makes it cleaner when the validation doesn't need to access the data table.
|
|
5476
|
+
|
|
5477
|
+
By combining these patterns, you can create sophisticated validation workflows that address
|
|
5478
|
+
virtually any data quality requirement in your organization.
|
|
5479
|
+
|
|
5480
|
+
|
|
4994
5481
|
|
|
4995
5482
|
## The Column Selection family
|
|
4996
5483
|
|
|
@@ -6614,6 +7101,7 @@ get_step_report(self, i: 'int', columns_subset: 'str | list[str] | Column | None
|
|
|
6614
7101
|
- [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
|
|
6615
7102
|
- [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
|
|
6616
7103
|
- [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
|
|
7104
|
+
- [`rows_complete()`](`pointblank.Validate.rows_complete`)
|
|
6617
7105
|
- [`conjointly()`](`pointblank.Validate.conjointly`)
|
|
6618
7106
|
|
|
6619
7107
|
The [`rows_distinct()`](`pointblank.Validate.rows_distinct`) validation step will produce a
|
|
@@ -6698,17 +7186,133 @@ get_json_report(self, use_fields: 'list[str] | None' = None, exclude_fields: 'li
|
|
|
6698
7186
|
|
|
6699
7187
|
Get a report of the validation results as a JSON-formatted string.
|
|
6700
7188
|
|
|
7189
|
+
The `get_json_report()` method provides a machine-readable report of validation results in
|
|
7190
|
+
JSON format. This is particularly useful for programmatic processing, storing validation
|
|
7191
|
+
results, or integrating with other systems. The report includes detailed information about
|
|
7192
|
+
each validation step, such as assertion type, columns validated, threshold values, test
|
|
7193
|
+
results, and more.
|
|
7194
|
+
|
|
7195
|
+
By default, all available validation information fields are included in the report. However,
|
|
7196
|
+
you can customize the fields to include or exclude using the `use_fields=` and
|
|
7197
|
+
`exclude_fields=` parameters.
|
|
7198
|
+
|
|
6701
7199
|
Parameters
|
|
6702
7200
|
----------
|
|
6703
7201
|
use_fields
|
|
6704
|
-
|
|
7202
|
+
An optional list of specific fields to include in the report. If provided, only these
|
|
7203
|
+
fields will be included in the JSON output. If `None` (the default), all standard
|
|
7204
|
+
validation report fields are included. Have a look at the *Available Report Fields*
|
|
7205
|
+
section below for a list of fields that can be included in the report.
|
|
6705
7206
|
exclude_fields
|
|
6706
|
-
|
|
7207
|
+
An optional list of fields to exclude from the report. If provided, these fields will
|
|
7208
|
+
be omitted from the JSON output. If `None` (the default), no fields are excluded.
|
|
7209
|
+
This parameter cannot be used together with `use_fields=`. The *Available Report Fields*
|
|
7210
|
+
provides a listing of fields that can be excluded from the report.
|
|
6707
7211
|
|
|
6708
7212
|
Returns
|
|
6709
7213
|
-------
|
|
6710
7214
|
str
|
|
6711
|
-
A JSON-formatted string representing the validation report
|
|
7215
|
+
A JSON-formatted string representing the validation report, with each validation step
|
|
7216
|
+
as an object in the report array.
|
|
7217
|
+
|
|
7218
|
+
Available Report Fields
|
|
7219
|
+
-----------------------
|
|
7220
|
+
The JSON report can include any of the standard validation report fields, including:
|
|
7221
|
+
|
|
7222
|
+
- `i`: the step number (1-indexed)
|
|
7223
|
+
- `i_o`: the original step index from the validation plan (pre-expansion)
|
|
7224
|
+
- `assertion_type`: the type of validation assertion (e.g., `"col_vals_gt"`, etc.)
|
|
7225
|
+
- `column`: the column being validated (or columns used in certain validations)
|
|
7226
|
+
- `values`: the comparison values or parameters used in the validation
|
|
7227
|
+
- `inclusive`: whether the comparison is inclusive (for range-based validations)
|
|
7228
|
+
- `na_pass`: whether `NA`/`Null` values are considered passing (for certain validations)
|
|
7229
|
+
- `pre`: preprocessing function applied before validation
|
|
7230
|
+
- `segments`: data segments to which the validation was applied
|
|
7231
|
+
- `thresholds`: threshold level statement that was used for the validation step
|
|
7232
|
+
- `label`: custom label for the validation step
|
|
7233
|
+
- `brief`: a brief description of the validation step
|
|
7234
|
+
- `active`: whether the validation step is active
|
|
7235
|
+
- `all_passed`: whether all test units passed in the step
|
|
7236
|
+
- `n`: total number of test units
|
|
7237
|
+
- `n_passed`, `n_failed`: number of test units that passed and failed
|
|
7238
|
+
- `f_passed`, `f_failed`: Fraction of test units that passed and failed
|
|
7239
|
+
- `warning`, `error`, `critical`: whether the namesake threshold level was exceeded (is
|
|
7240
|
+
`null` if threshold not set)
|
|
7241
|
+
- `time_processed`: when the validation step was processed (ISO 8601 format)
|
|
7242
|
+
- `proc_duration_s`: the processing duration in seconds
|
|
7243
|
+
|
|
7244
|
+
Examples
|
|
7245
|
+
--------
|
|
7246
|
+
Let's create a validation plan with a few validation steps and generate a JSON report of the
|
|
7247
|
+
results:
|
|
7248
|
+
|
|
7249
|
+
```python
|
|
7250
|
+
import pointblank as pb
|
|
7251
|
+
import polars as pl
|
|
7252
|
+
|
|
7253
|
+
# Create a sample DataFrame
|
|
7254
|
+
tbl = pl.DataFrame({
|
|
7255
|
+
"a": [5, 7, 8, 9],
|
|
7256
|
+
"b": [3, 4, 2, 1]
|
|
7257
|
+
})
|
|
7258
|
+
|
|
7259
|
+
# Create and execute a validation plan
|
|
7260
|
+
validation = (
|
|
7261
|
+
pb.Validate(data=tbl)
|
|
7262
|
+
.col_vals_gt(columns="a", value=6)
|
|
7263
|
+
.col_vals_lt(columns="b", value=4)
|
|
7264
|
+
.interrogate()
|
|
7265
|
+
)
|
|
7266
|
+
|
|
7267
|
+
# Get the full JSON report
|
|
7268
|
+
json_report = validation.get_json_report()
|
|
7269
|
+
|
|
7270
|
+
print(json_report)
|
|
7271
|
+
```
|
|
7272
|
+
|
|
7273
|
+
You can also customize which fields to include:
|
|
7274
|
+
|
|
7275
|
+
```python
|
|
7276
|
+
json_report = validation.get_json_report(
|
|
7277
|
+
use_fields=["i", "assertion_type", "column", "n_passed", "n_failed"]
|
|
7278
|
+
)
|
|
7279
|
+
|
|
7280
|
+
print(json_report)
|
|
7281
|
+
```
|
|
7282
|
+
|
|
7283
|
+
Or which fields to exclude:
|
|
7284
|
+
|
|
7285
|
+
```python
|
|
7286
|
+
json_report = validation.get_json_report(
|
|
7287
|
+
exclude_fields=[
|
|
7288
|
+
"i_o", "thresholds", "pre", "segments", "values",
|
|
7289
|
+
"na_pass", "inclusive", "label", "brief", "active",
|
|
7290
|
+
"time_processed", "proc_duration_s"
|
|
7291
|
+
]
|
|
7292
|
+
)
|
|
7293
|
+
|
|
7294
|
+
print(json_report)
|
|
7295
|
+
```
|
|
7296
|
+
|
|
7297
|
+
The JSON output can be further processed or analyzed programmatically:
|
|
7298
|
+
|
|
7299
|
+
```python
|
|
7300
|
+
import json
|
|
7301
|
+
|
|
7302
|
+
# Parse the JSON report
|
|
7303
|
+
report_data = json.loads(validation.get_json_report())
|
|
7304
|
+
|
|
7305
|
+
# Extract and analyze validation results
|
|
7306
|
+
failing_steps = [step for step in report_data if step["n_failed"] > 0]
|
|
7307
|
+
print(f"Number of failing validation steps: {len(failing_steps)}")
|
|
7308
|
+
```
|
|
7309
|
+
|
|
7310
|
+
See Also
|
|
7311
|
+
--------
|
|
7312
|
+
- [`get_tabular_report()`](`pointblank.Validate.get_tabular_report`): Get a formatted HTML
|
|
7313
|
+
report as a GT table
|
|
7314
|
+
- [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`): Get rows that
|
|
7315
|
+
failed validation
|
|
6712
7316
|
|
|
6713
7317
|
|
|
6714
7318
|
get_sundered_data(self, type='pass') -> 'FrameT'
|
|
@@ -8857,7 +9461,7 @@ send_slack_notification(webhook_url: 'str | None' = None, step_msg: 'str | None'
|
|
|
8857
9461
|
validation
|
|
8858
9462
|
```
|
|
8859
9463
|
|
|
8860
|
-
By placing the `notify_slack` function in the `Validate(actions=Actions(critical=))` argument,
|
|
9464
|
+
By placing the `notify_slack()` function in the `Validate(actions=Actions(critical=))` argument,
|
|
8861
9465
|
you can ensure that the notification is sent whenever the 'critical' threshold is reached (as
|
|
8862
9466
|
set here, when 15% or more of the test units fail). The notification will include information
|
|
8863
9467
|
about the validation step that triggered the alert.
|
|
@@ -8887,7 +9491,7 @@ send_slack_notification(webhook_url: 'str | None' = None, step_msg: 'str | None'
|
|
|
8887
9491
|
)
|
|
8888
9492
|
```
|
|
8889
9493
|
|
|
8890
|
-
In this case, the same `notify_slack` function is used, but it is placed in
|
|
9494
|
+
In this case, the same `notify_slack()` function is used, but it is placed in
|
|
8891
9495
|
`Validate(final_actions=FinalActions())`. This results in the summary notification being sent
|
|
8892
9496
|
after all validation steps are completed, regardless of whether any steps failed or not.
|
|
8893
9497
|
|