pointblank 0.14.0__py3-none-any.whl → 0.16.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -289,6 +289,16 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
289
289
  - Vietnamese (`"vi"`)
290
290
  - Indonesian (`"id"`)
291
291
  - Ukrainian (`"uk"`)
292
+ - Bulgarian (`"bg"`)
293
+ - Croatian (`"hr"`)
294
+ - Estonian (`"et"`)
295
+ - Hungarian (`"hu"`)
296
+ - Irish (`"ga"`)
297
+ - Latvian (`"lv"`)
298
+ - Lithuanian (`"lt"`)
299
+ - Maltese (`"mt"`)
300
+ - Slovak (`"sk"`)
301
+ - Slovenian (`"sl"`)
292
302
  - Hebrew (`"he"`)
293
303
  - Thai (`"th"`)
294
304
  - Persian (`"fa"`)
@@ -1157,7 +1167,7 @@ Definition of a schema object.
1157
1167
  `Schema` object is used in a validation workflow.
1158
1168
 
1159
1169
 
1160
- DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None) -> None
1170
+ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None, verify_ssl: 'bool' = True) -> None
1161
1171
 
1162
1172
  Draft a validation plan for a given table using an LLM.
1163
1173
 
@@ -1180,10 +1190,15 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1180
1190
  The data to be used for drafting a validation plan.
1181
1191
  model
1182
1192
  The model to be used. This should be in the form of `provider:model` (e.g.,
1183
- `"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
1193
+ `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`,
1184
1194
  `"ollama"`, and `"bedrock"`.
1185
1195
  api_key
1186
1196
  The API key to be used for the model.
1197
+ verify_ssl
1198
+ Whether to verify SSL certificates when making requests to the LLM provider. Set to `False`
1199
+ to disable SSL verification (e.g., when behind a corporate firewall with self-signed
1200
+ certificates). Defaults to `True`. Use with caution as disabling SSL verification can pose
1201
+ security risks.
1187
1202
 
1188
1203
  Returns
1189
1204
  -------
@@ -1225,6 +1240,33 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1225
1240
  There's no need to have the `python-dotenv` package installed when using `.env` files in this
1226
1241
  way.
1227
1242
 
1243
+ Notes on SSL Certificate Verification
1244
+ --------------------------------------
1245
+ By default, SSL certificate verification is enabled for all requests to LLM providers. However,
1246
+ in certain network environments (such as corporate networks with self-signed certificates or
1247
+ firewall proxies), you may encounter SSL certificate verification errors.
1248
+
1249
+ To disable SSL verification, set the `verify_ssl` parameter to `False`:
1250
+
1251
+ ```python
1252
+ import pointblank as pb
1253
+
1254
+ data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
1255
+
1256
+ # Disable SSL verification for networks with self-signed certificates
1257
+ pb.DraftValidation(
1258
+ data=data,
1259
+ model="anthropic:claude-sonnet-4-5",
1260
+ verify_ssl=False
1261
+ )
1262
+ ```
1263
+
1264
+ :::{.callout-warning}
1265
+ Disabling SSL verification (through `verify_ssl=False`) can expose your API keys and data to
1266
+ man-in-the-middle attacks. Only use this option in trusted network environments and when
1267
+ absolutely necessary.
1268
+ :::
1269
+
1228
1270
  Notes on Data Sent to the Model Provider
1229
1271
  ----------------------------------------
1230
1272
  The data sent to the model provider is a JSON summary of the table. This data summary is
@@ -1251,7 +1293,7 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1251
1293
  Let's look at how the `DraftValidation` class can be used to draft a validation plan for a
1252
1294
  table. The table to be used is `"nycflights"`, which is available here via the
1253
1295
  [`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is
1254
- `"anthropic:claude-3-5-sonnet-latest"` (which performs very well compared to other LLMs). The
1296
+ `"anthropic:claude-sonnet-4-5"` (which performs very well compared to other LLMs). The
1255
1297
  example assumes that the API key is stored in an `.env` file as `ANTHROPIC_API_KEY`.
1256
1298
 
1257
1299
  ```python
@@ -1261,7 +1303,7 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1261
1303
  data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
1262
1304
 
1263
1305
  # Draft a validation plan for the "nycflights" table
1264
- pb.DraftValidation(data=data, model="anthropic:claude-3-5-sonnet-latest")
1306
+ pb.DraftValidation(data=data, model="anthropic:claude-sonnet-4-5")
1265
1307
  ```
1266
1308
 
1267
1309
  The output will be a drafted validation plan for the `"nycflights"` table and this will appear
@@ -3645,6 +3687,262 @@ col_vals_not_in_set(self, columns: 'str | list[str] | Column | ColumnSelector |
3645
3687
  statuses in the `InvalidStatus` enum.
3646
3688
 
3647
3689
 
3690
+ col_vals_increasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, decreasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
3691
+
3692
+ Are column data increasing by row?
3693
+
3694
+ The `col_vals_increasing()` validation method checks whether column values in a table are
3695
+ increasing when moving down a table. There are options for allowing missing values in the
3696
+ target column, allowing stationary phases (where consecutive values don't change), and even
3697
+ one for allowing decreasing movements up to a certain threshold. This validation will
3698
+ operate over the number of test units that is equal to the number of rows in the table
3699
+ (determined after any `pre=` mutation has been applied).
3700
+
3701
+ Parameters
3702
+ ----------
3703
+ columns
3704
+ A single column or a list of columns to validate. Can also use
3705
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
3706
+ multiple columns are supplied or resolved, there will be a separate validation step
3707
+ generated for each column.
3708
+ allow_stationary
3709
+ An option to allow pauses in increasing values. For example, if the values for the test
3710
+ units are `[80, 82, 82, 85, 88]` then the third unit (`82`, appearing a second time)
3711
+ would be marked as failing when `allow_stationary` is `False`. Using
3712
+ `allow_stationary=True` will result in all the test units in `[80, 82, 82, 85, 88]` to
3713
+ be marked as passing.
3714
+ decreasing_tol
3715
+ An optional threshold value that allows for movement of numerical values in the negative
3716
+ direction. By default this is `None` but using a numerical value will set the absolute
3717
+ threshold of negative travel allowed across numerical test units. Note that setting a
3718
+ value here also has the effect of setting `allow_stationary` to `True`.
3719
+ na_pass
3720
+ Should any encountered None, NA, or Null values be considered as passing test units? By
3721
+ default, this is `False`. Set to `True` to pass test units with missing values.
3722
+ pre
3723
+ An optional preprocessing function or lambda to apply to the data table during
3724
+ interrogation. This function should take a table as input and return a modified table.
3725
+ Have a look at the *Preprocessing* section for more information on how to use this
3726
+ argument.
3727
+ segments
3728
+ An optional directive on segmentation, which serves to split a validation step into
3729
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
3730
+ column name and its corresponding values to segment on, or a combination of both
3731
+ (provided as a list). Read the *Segmentation* section for usage information.
3732
+ thresholds
3733
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
3734
+ The thresholds are set at the step level and will override any global thresholds set in
3735
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
3736
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
3737
+ section for information on how to set threshold levels.
3738
+ actions
3739
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
3740
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
3741
+ define the actions.
3742
+ brief
3743
+ An optional brief description of the validation step that will be displayed in the
3744
+ reporting table. You can use the templating elements like `"{step}"` to insert
3745
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
3746
+ the entire brief will be automatically generated. If `None` (the default) then there
3747
+ won't be a brief.
3748
+ active
3749
+ A boolean value indicating whether the validation step should be active. Using `False`
3750
+ will make the validation step inactive (still reporting its presence and keeping indexes
3751
+ for the steps unchanged).
3752
+
3753
+ Returns
3754
+ -------
3755
+ Validate
3756
+ The `Validate` object with the added validation step.
3757
+
3758
+ Examples
3759
+ --------
3760
+ For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The
3761
+ table is shown below:
3762
+
3763
+ ```python
3764
+ import pointblank as pb
3765
+ import polars as pl
3766
+
3767
+ tbl = pl.DataFrame(
3768
+ {
3769
+ "a": [1, 2, 3, 4, 5, 6],
3770
+ "b": [1, 2, 2, 3, 4, 5],
3771
+ "c": [1, 2, 1, 3, 4, 5],
3772
+ }
3773
+ )
3774
+
3775
+ pb.preview(tbl)
3776
+ ```
3777
+
3778
+ Let's validate that values in column `a` are increasing. We'll determine if this validation
3779
+ had any failing test units (there are six test units, one for each row).
3780
+
3781
+ ```python
3782
+ validation = (
3783
+ pb.Validate(data=tbl)
3784
+ .col_vals_increasing(columns="a")
3785
+ .interrogate()
3786
+ )
3787
+
3788
+ validation
3789
+ ```
3790
+
3791
+ The validation passed as all values in column `a` are increasing. Now let's check column
3792
+ `b` which has a stationary value:
3793
+
3794
+ ```python
3795
+ validation = (
3796
+ pb.Validate(data=tbl)
3797
+ .col_vals_increasing(columns="b")
3798
+ .interrogate()
3799
+ )
3800
+
3801
+ validation
3802
+ ```
3803
+
3804
+ This validation fails at the third row because the value `2` is repeated. If we want to
3805
+ allow stationary values, we can use `allow_stationary=True`:
3806
+
3807
+ ```python
3808
+ validation = (
3809
+ pb.Validate(data=tbl)
3810
+ .col_vals_increasing(columns="b", allow_stationary=True)
3811
+ .interrogate()
3812
+ )
3813
+
3814
+ validation
3815
+ ```
3816
+
3817
+
3818
+ col_vals_decreasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, increasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
3819
+
3820
+ Are column data decreasing by row?
3821
+
3822
+ The `col_vals_decreasing()` validation method checks whether column values in a table are
3823
+ decreasing when moving down a table. There are options for allowing missing values in the
3824
+ target column, allowing stationary phases (where consecutive values don't change), and even
3825
+ one for allowing increasing movements up to a certain threshold. This validation will
3826
+ operate over the number of test units that is equal to the number of rows in the table
3827
+ (determined after any `pre=` mutation has been applied).
3828
+
3829
+ Parameters
3830
+ ----------
3831
+ columns
3832
+ A single column or a list of columns to validate. Can also use
3833
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
3834
+ multiple columns are supplied or resolved, there will be a separate validation step
3835
+ generated for each column.
3836
+ allow_stationary
3837
+ An option to allow pauses in decreasing values. For example, if the values for the test
3838
+ units are `[88, 85, 85, 82, 80]` then the third unit (`85`, appearing a second time)
3839
+ would be marked as failing when `allow_stationary` is `False`. Using
3840
+ `allow_stationary=True` will result in all the test units in `[88, 85, 85, 82, 80]` to
3841
+ be marked as passing.
3842
+ increasing_tol
3843
+ An optional threshold value that allows for movement of numerical values in the positive
3844
+ direction. By default this is `None` but using a numerical value will set the absolute
3845
+ threshold of positive travel allowed across numerical test units. Note that setting a
3846
+ value here also has the effect of setting `allow_stationary` to `True`.
3847
+ na_pass
3848
+ Should any encountered None, NA, or Null values be considered as passing test units? By
3849
+ default, this is `False`. Set to `True` to pass test units with missing values.
3850
+ pre
3851
+ An optional preprocessing function or lambda to apply to the data table during
3852
+ interrogation. This function should take a table as input and return a modified table.
3853
+ Have a look at the *Preprocessing* section for more information on how to use this
3854
+ argument.
3855
+ segments
3856
+ An optional directive on segmentation, which serves to split a validation step into
3857
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
3858
+ column name and its corresponding values to segment on, or a combination of both
3859
+ (provided as a list). Read the *Segmentation* section for usage information.
3860
+ thresholds
3861
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
3862
+ The thresholds are set at the step level and will override any global thresholds set in
3863
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
3864
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
3865
+ section for information on how to set threshold levels.
3866
+ actions
3867
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
3868
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
3869
+ define the actions.
3870
+ brief
3871
+ An optional brief description of the validation step that will be displayed in the
3872
+ reporting table. You can use the templating elements like `"{step}"` to insert
3873
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
3874
+ the entire brief will be automatically generated. If `None` (the default) then there
3875
+ won't be a brief.
3876
+ active
3877
+ A boolean value indicating whether the validation step should be active. Using `False`
3878
+ will make the validation step inactive (still reporting its presence and keeping indexes
3879
+ for the steps unchanged).
3880
+
3881
+ Returns
3882
+ -------
3883
+ Validate
3884
+ The `Validate` object with the added validation step.
3885
+
3886
+ Examples
3887
+ --------
3888
+ For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The
3889
+ table is shown below:
3890
+
3891
+ ```python
3892
+ import pointblank as pb
3893
+ import polars as pl
3894
+
3895
+ tbl = pl.DataFrame(
3896
+ {
3897
+ "a": [6, 5, 4, 3, 2, 1],
3898
+ "b": [5, 4, 4, 3, 2, 1],
3899
+ "c": [5, 4, 5, 3, 2, 1],
3900
+ }
3901
+ )
3902
+
3903
+ pb.preview(tbl)
3904
+ ```
3905
+
3906
+ Let's validate that values in column `a` are decreasing. We'll determine if this validation
3907
+ had any failing test units (there are six test units, one for each row).
3908
+
3909
+ ```python
3910
+ validation = (
3911
+ pb.Validate(data=tbl)
3912
+ .col_vals_decreasing(columns="a")
3913
+ .interrogate()
3914
+ )
3915
+
3916
+ validation
3917
+ ```
3918
+
3919
+ The validation passed as all values in column `a` are decreasing. Now let's check column
3920
+ `b` which has a stationary value:
3921
+
3922
+ ```python
3923
+ validation = (
3924
+ pb.Validate(data=tbl)
3925
+ .col_vals_decreasing(columns="b")
3926
+ .interrogate()
3927
+ )
3928
+
3929
+ validation
3930
+ ```
3931
+
3932
+ This validation fails at the third row because the value `4` is repeated. If we want to
3933
+ allow stationary values, we can use `allow_stationary=True`:
3934
+
3935
+ ```python
3936
+ validation = (
3937
+ pb.Validate(data=tbl)
3938
+ .col_vals_decreasing(columns="b", allow_stationary=True)
3939
+ .interrogate()
3940
+ )
3941
+
3942
+ validation
3943
+ ```
3944
+
3945
+
3648
3946
  col_vals_null(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
3649
3947
 
3650
3948
  Validate whether values in a column are Null.
@@ -4213,22 +4511,31 @@ col_vals_regex(self, columns: 'str | list[str] | Column | ColumnSelector | Colum
4213
4511
  string values of rows 1 and 2 in column `b`.
4214
4512
 
4215
4513
 
4216
- col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4514
+ col_vals_within_spec(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', spec: 'str', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4217
4515
 
4218
- Validate column values using a custom expression.
4516
+ Validate whether column values fit within a specification.
4219
4517
 
4220
- The `col_vals_expr()` validation method checks whether column values in a table satisfy a
4221
- custom `expr=` expression. This validation will operate over the number of test units that
4222
- is equal to the number of rows in the table (determined after any `pre=` mutation has been
4223
- applied).
4518
+ The `col_vals_within_spec()` validation method checks whether column values in a table
4519
+ correspond to a specification (`spec=`) type (details of which are available in the
4520
+ *Specifications* section). Specifications include common data types like email addresses,
4521
+ URLs, postal codes, vehicle identification numbers (VINs), International Bank Account
4522
+ Numbers (IBANs), and more. This validation will operate over the number of test units that
4523
+ is equal to the number of rows in the table.
4224
4524
 
4225
4525
  Parameters
4226
4526
  ----------
4227
- expr
4228
- A column expression that will evaluate each row in the table, returning a boolean value
4229
- per table row. If the target table is a Polars DataFrame, the expression should either
4230
- be a Polars column expression or a Narwhals one. For a Pandas DataFrame, the expression
4231
- should either be a lambda expression or a Narwhals column expression.
4527
+ columns
4528
+ A single column or a list of columns to validate. Can also use
4529
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
4530
+ multiple columns are supplied or resolved, there will be a separate validation step
4531
+ generated for each column.
4532
+ spec
4533
+ A specification string for defining the specification type. Examples are `"email"`,
4534
+ `"url"`, and `"postal_code[USA]"`. See the *Specifications* section for all available
4535
+ options.
4536
+ na_pass
4537
+ Should any encountered None, NA, or Null values be considered as passing test units? By
4538
+ default, this is `False`. Set to `True` to pass test units with missing values.
4232
4539
  pre
4233
4540
  An optional preprocessing function or lambda to apply to the data table during
4234
4541
  interrogation. This function should take a table as input and return a modified table.
@@ -4246,7 +4553,7 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4246
4553
  be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
4247
4554
  section for information on how to set threshold levels.
4248
4555
  actions
4249
- Optional actions to take when the validation step meets or exceeds any set threshold
4556
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
4250
4557
  levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
4251
4558
  define the actions.
4252
4559
  brief
@@ -4265,6 +4572,40 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4265
4572
  Validate
4266
4573
  The `Validate` object with the added validation step.
4267
4574
 
4575
+ Specifications
4576
+ --------------
4577
+ A specification type must be used with the `spec=` argument. This is a string-based keyword
4578
+ that corresponds to the type of data in the specified columns. The following keywords can
4579
+ be used:
4580
+
4581
+ - `"isbn"`: The International Standard Book Number (ISBN) is a unique numerical identifier
4582
+ for books. This keyword validates both 10-digit and 13-digit ISBNs.
4583
+
4584
+ - `"vin"`: A vehicle identification number (VIN) is a unique code used by the automotive
4585
+ industry to identify individual motor vehicles.
4586
+
4587
+ - `"postal_code[<country_code>]"`: A postal code (also known as postcodes, PIN, or ZIP
4588
+ codes) is a series of letters, digits, or both included in a postal address. Because the
4589
+ coding varies by country, a country code in either the 2-letter (ISO 3166-1 alpha-2) or
4590
+ 3-letter (ISO 3166-1 alpha-3) format needs to be supplied (e.g., `"postal_code[US]"` or
4591
+ `"postal_code[USA]"`). The keyword alias `"zip"` can be used for US ZIP codes.
4592
+
4593
+ - `"credit_card"`: A credit card number can be validated across a variety of issuers. The
4594
+ validation uses the Luhn algorithm.
4595
+
4596
+ - `"iban[<country_code>]"`: The International Bank Account Number (IBAN) is a system of
4597
+ identifying bank accounts across countries. Because the length and coding varies by
4598
+ country, a country code needs to be supplied (e.g., `"iban[DE]"` or `"iban[DEU]"`).
4599
+
4600
+ - `"swift"`: Business Identifier Codes (also known as SWIFT-BIC, BIC, or SWIFT code) are
4601
+ unique identifiers for financial and non-financial institutions.
4602
+
4603
+ - `"phone"`, `"email"`, `"url"`, `"ipv4"`, `"ipv6"`, `"mac"`: Phone numbers, email
4604
+ addresses, Internet URLs, IPv4 or IPv6 addresses, and MAC addresses can be validated with
4605
+ their respective keywords.
4606
+
4607
+ Only a single `spec=` value should be provided per function call.
4608
+
4268
4609
  Preprocessing
4269
4610
  -------------
4270
4611
  The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
@@ -4274,9 +4615,11 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4274
4615
 
4275
4616
  The preprocessing function can be any callable that takes a table as input and returns a
4276
4617
  modified table. For example, you could use a lambda function to filter the table based on
4277
- certain criteria or to apply a transformation to the data. Regarding the lifetime of the
4278
- transformed table, it only exists during the validation step and is not stored in the
4279
- `Validate` object or used in subsequent validation steps.
4618
+ certain criteria or to apply a transformation to the data. Note that you can refer to
4619
+ a column via `columns=` that is expected to be present in the transformed table, but may not
4620
+ exist in the table before preprocessing. Regarding the lifetime of the transformed table, it
4621
+ only exists during the validation step and is not stored in the `Validate` object or used in
4622
+ subsequent validation steps.
4280
4623
 
4281
4624
  Segmentation
4282
4625
  ------------
@@ -4348,8 +4691,8 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4348
4691
 
4349
4692
  Examples
4350
4693
  --------
4351
- For the examples here, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and
4352
- `c`). The table is shown below:
4694
+ For the examples here, we'll use a simple Polars DataFrame with an email column. The table
4695
+ is shown below:
4353
4696
 
4354
4697
  ```python
4355
4698
  import pointblank as pb
@@ -4357,48 +4700,61 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4357
4700
 
4358
4701
  tbl = pl.DataFrame(
4359
4702
  {
4360
- "a": [1, 2, 1, 7, 8, 6],
4361
- "b": [0, 0, 0, 1, 1, 1],
4362
- "c": [0.5, 0.3, 0.8, 1.4, 1.9, 1.2],
4703
+ "email": [
4704
+ "user@example.com",
4705
+ "admin@test.org",
4706
+ "invalid-email",
4707
+ "contact@company.co.uk",
4708
+ ],
4363
4709
  }
4364
4710
  )
4365
4711
 
4366
4712
  pb.preview(tbl)
4367
4713
  ```
4368
4714
 
4369
- Let's validate that the values in column `a` are all integers. We'll determine if this
4370
- validation had any failing test units (there are six test units, one for each row).
4715
+ Let's validate that all of the values in the `email` column are valid email addresses.
4716
+ We'll determine if this validation had any failing test units (there are four test units,
4717
+ one for each row).
4371
4718
 
4372
4719
  ```python
4373
4720
  validation = (
4374
4721
  pb.Validate(data=tbl)
4375
- .col_vals_expr(expr=pl.col("a") % 1 == 0)
4722
+ .col_vals_within_spec(columns="email", spec="email")
4376
4723
  .interrogate()
4377
4724
  )
4378
4725
 
4379
4726
  validation
4380
4727
  ```
4381
4728
 
4382
- Printing the `validation` object shows the validation table in an HTML viewing environment.
4383
- The validation table shows the single entry that corresponds to the validation step created
4384
- by using `col_vals_expr()`. All test units passed, with no failing test units.
4729
+ The validation table shows that one test unit failed (the invalid email address in row 3).
4385
4730
 
4386
4731
 
4387
- col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4732
+ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4388
4733
 
4389
- Validate whether one or more columns exist in the table.
4734
+ Validate column values using a custom expression.
4390
4735
 
4391
- The `col_exists()` method checks whether one or more columns exist in the target table. The
4392
- only requirement is specification of the column names. Each validation step or expectation
4393
- will operate over a single test unit, which is whether the column exists or not.
4736
+ The `col_vals_expr()` validation method checks whether column values in a table satisfy a
4737
+ custom `expr=` expression. This validation will operate over the number of test units that
4738
+ is equal to the number of rows in the table (determined after any `pre=` mutation has been
4739
+ applied).
4394
4740
 
4395
4741
  Parameters
4396
4742
  ----------
4397
- columns
4398
- A single column or a list of columns to validate. Can also use
4399
- [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
4400
- multiple columns are supplied or resolved, there will be a separate validation step
4401
- generated for each column.
4743
+ expr
4744
+ A column expression that will evaluate each row in the table, returning a boolean value
4745
+ per table row. If the target table is a Polars DataFrame, the expression should either
4746
+ be a Polars column expression or a Narwhals one. For a Pandas DataFrame, the expression
4747
+ should either be a lambda expression or a Narwhals column expression.
4748
+ pre
4749
+ An optional preprocessing function or lambda to apply to the data table during
4750
+ interrogation. This function should take a table as input and return a modified table.
4751
+ Have a look at the *Preprocessing* section for more information on how to use this
4752
+ argument.
4753
+ segments
4754
+ An optional directive on segmentation, which serves to split a validation step into
4755
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
4756
+ column name and its corresponding values to segment on, or a combination of both
4757
+ (provided as a list). Read the *Segmentation* section for usage information.
4402
4758
  thresholds
4403
4759
  Set threshold failure levels for reporting and reacting to exceedences of the levels.
4404
4760
  The thresholds are set at the step level and will override any global thresholds set in
@@ -4406,7 +4762,7 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4406
4762
  be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
4407
4763
  section for information on how to set threshold levels.
4408
4764
  actions
4409
- Optional actions to take when the validation step(s) meets or exceeds any set threshold
4765
+ Optional actions to take when the validation step meets or exceeds any set threshold
4410
4766
  levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
4411
4767
  define the actions.
4412
4768
  brief
@@ -4425,6 +4781,59 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4425
4781
  Validate
4426
4782
  The `Validate` object with the added validation step.
4427
4783
 
4784
+ Preprocessing
4785
+ -------------
4786
+ The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
4787
+ table during interrogation. This function should take a table as input and return a modified
4788
+ table. This is useful for performing any necessary transformations or filtering on the data
4789
+ before the validation step is applied.
4790
+
4791
+ The preprocessing function can be any callable that takes a table as input and returns a
4792
+ modified table. For example, you could use a lambda function to filter the table based on
4793
+ certain criteria or to apply a transformation to the data. Regarding the lifetime of the
4794
+ transformed table, it only exists during the validation step and is not stored in the
4795
+ `Validate` object or used in subsequent validation steps.
4796
+
4797
+ Segmentation
4798
+ ------------
4799
+ The `segments=` argument allows for the segmentation of a validation step into multiple
4800
+ segments. This is useful for applying the same validation step to different subsets of the
4801
+ data. The segmentation can be done based on a single column or specific fields within a
4802
+ column.
4803
+
4804
+ Providing a single column name will result in a separate validation step for each unique
4805
+ value in that column. For example, if you have a column called `"region"` with values
4806
+ `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each
4807
+ region.
4808
+
4809
+ Alternatively, you can provide a tuple that specifies a column name and its corresponding
4810
+ values to segment on. For example, if you have a column called `"date"` and you want to
4811
+ segment on only specific dates, you can provide a tuple like
4812
+ `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded
4813
+ (i.e., no validation steps will be created for them).
4814
+
4815
+ A list with a combination of column names and tuples can be provided as well. This allows
4816
+ for more complex segmentation scenarios. The following inputs are both valid:
4817
+
4818
+ ```
4819
+ # Segments from all unique values in the `region` column
4820
+ # and specific dates in the `date` column
4821
+ segments=["region", ("date", ["2023-01-01", "2023-01-02"])]
4822
+
4823
+ # Segments from all unique values in the `region` and `date` columns
4824
+ segments=["region", "date"]
4825
+ ```
4826
+
4827
+ The segmentation is performed during interrogation, and the resulting validation steps will
4828
+ be numbered sequentially. Each segment will have its own validation step, and the results
4829
+ will be reported separately. This allows for a more granular analysis of the data and helps
4830
+ identify issues within specific segments.
4831
+
4832
+ Importantly, the segmentation process will be performed after any preprocessing of the data
4833
+ table. Because of this, one can conceivably use the `pre=` argument to generate a column
4834
+ that can be used for segmentation. For example, you could create a new column called
4835
+ `"segment"` through use of `pre=` and then use that column for segmentation.
4836
+
4428
4837
  Thresholds
4429
4838
  ----------
4430
4839
  The `thresholds=` parameter is used to set the failure-condition levels for the validation
@@ -4455,8 +4864,8 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4455
4864
 
4456
4865
  Examples
4457
4866
  --------
4458
- For the examples here, we'll use a simple Polars DataFrame with a string columns (`a`) and a
4459
- numeric column (`b`). The table is shown below:
4867
+ For the examples here, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and
4868
+ `c`). The table is shown below:
4460
4869
 
4461
4870
  ```python
4462
4871
  import pointblank as pb
@@ -4464,21 +4873,22 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4464
4873
 
4465
4874
  tbl = pl.DataFrame(
4466
4875
  {
4467
- "a": ["apple", "banana", "cherry", "date"],
4468
- "b": [1, 6, 3, 5],
4876
+ "a": [1, 2, 1, 7, 8, 6],
4877
+ "b": [0, 0, 0, 1, 1, 1],
4878
+ "c": [0.5, 0.3, 0.8, 1.4, 1.9, 1.2],
4469
4879
  }
4470
4880
  )
4471
4881
 
4472
4882
  pb.preview(tbl)
4473
4883
  ```
4474
4884
 
4475
- Let's validate that the columns `a` and `b` actually exist in the table. We'll determine if
4476
- this validation had any failing test units (each validation will have a single test unit).
4885
+ Let's validate that the values in column `a` are all integers. We'll determine if this
4886
+ validation had any failing test units (there are six test units, one for each row).
4477
4887
 
4478
4888
  ```python
4479
4889
  validation = (
4480
4890
  pb.Validate(data=tbl)
4481
- .col_exists(columns=["a", "b"])
4891
+ .col_vals_expr(expr=pl.col("a") % 1 == 0)
4482
4892
  .interrogate()
4483
4893
  )
4484
4894
 
@@ -4486,24 +4896,8 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4486
4896
  ```
4487
4897
 
4488
4898
  Printing the `validation` object shows the validation table in an HTML viewing environment.
4489
- The validation table shows two entries (one check per column) generated by the
4490
- `col_exists()` validation step. Both steps passed since both columns provided in `columns=`
4491
- are present in the table.
4492
-
4493
- Now, let's check for the existence of a different set of columns.
4494
-
4495
- ```python
4496
- validation = (
4497
- pb.Validate(data=tbl)
4498
- .col_exists(columns=["b", "c"])
4499
- .interrogate()
4500
- )
4501
-
4502
- validation
4503
- ```
4504
-
4505
- The validation table reports one passing validation step (the check for column `b`) and one
4506
- failing validation step (the check for column `c`, which doesn't exist).
4899
+ The validation table shows the single entry that corresponds to the validation step created
4900
+ by using `col_vals_expr()`. All test units passed, with no failing test units.
4507
4901
 
4508
4902
 
4509
4903
  rows_distinct(self, columns_subset: 'str | list[str] | None' = None, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
@@ -4886,6 +5280,128 @@ rows_complete(self, columns_subset: 'str | list[str] | None' = None, pre: 'Calla
4886
5280
  others.
4887
5281
 
4888
5282
 
5283
+ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
5284
+
5285
+ Validate whether one or more columns exist in the table.
5286
+
5287
+ The `col_exists()` method checks whether one or more columns exist in the target table. The
5288
+ only requirement is specification of the column names. Each validation step or expectation
5289
+ will operate over a single test unit, which is whether the column exists or not.
5290
+
5291
+ Parameters
5292
+ ----------
5293
+ columns
5294
+ A single column or a list of columns to validate. Can also use
5295
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
5296
+ multiple columns are supplied or resolved, there will be a separate validation step
5297
+ generated for each column.
5298
+ thresholds
5299
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
5300
+ The thresholds are set at the step level and will override any global thresholds set in
5301
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
5302
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
5303
+ section for information on how to set threshold levels.
5304
+ actions
5305
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
5306
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
5307
+ define the actions.
5308
+ brief
5309
+ An optional brief description of the validation step that will be displayed in the
5310
+ reporting table. You can use the templating elements like `"{step}"` to insert
5311
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
5312
+ the entire brief will be automatically generated. If `None` (the default) then there
5313
+ won't be a brief.
5314
+ active
5315
+ A boolean value indicating whether the validation step should be active. Using `False`
5316
+ will make the validation step inactive (still reporting its presence and keeping indexes
5317
+ for the steps unchanged).
5318
+
5319
+ Returns
5320
+ -------
5321
+ Validate
5322
+ The `Validate` object with the added validation step.
5323
+
5324
+ Thresholds
5325
+ ----------
5326
+ The `thresholds=` parameter is used to set the failure-condition levels for the validation
5327
+ step. If they are set here at the step level, these thresholds will override any thresholds
5328
+ set at the global level in `Validate(thresholds=...)`.
5329
+
5330
+ There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values
5331
+ can either be set as a proportion failing of all test units (a value between `0` to `1`),
5332
+ or, the absolute number of failing test units (as integer that's `1` or greater).
5333
+
5334
+ Thresholds can be defined using one of these input schemes:
5335
+
5336
+ 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create
5337
+ thresholds)
5338
+ 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is
5339
+ the 'error' level, and position `2` is the 'critical' level
5340
+ 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and
5341
+ 'critical'
5342
+ 4. a single integer/float value denoting absolute number or fraction of failing test units
5343
+ for the 'warning' level only
5344
+
5345
+ If the number of failing test units exceeds set thresholds, the validation step will be
5346
+ marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be
5347
+ set, you're free to set any combination of them.
5348
+
5349
+ Aside from reporting failure conditions, thresholds can be used to determine the actions to
5350
+ take for each level of failure (using the `actions=` parameter).
5351
+
5352
+ Examples
5353
+ --------
5354
+ For the examples here, we'll use a simple Polars DataFrame with a string columns (`a`) and a
5355
+ numeric column (`b`). The table is shown below:
5356
+
5357
+ ```python
5358
+ import pointblank as pb
5359
+ import polars as pl
5360
+
5361
+ tbl = pl.DataFrame(
5362
+ {
5363
+ "a": ["apple", "banana", "cherry", "date"],
5364
+ "b": [1, 6, 3, 5],
5365
+ }
5366
+ )
5367
+
5368
+ pb.preview(tbl)
5369
+ ```
5370
+
5371
+ Let's validate that the columns `a` and `b` actually exist in the table. We'll determine if
5372
+ this validation had any failing test units (each validation will have a single test unit).
5373
+
5374
+ ```python
5375
+ validation = (
5376
+ pb.Validate(data=tbl)
5377
+ .col_exists(columns=["a", "b"])
5378
+ .interrogate()
5379
+ )
5380
+
5381
+ validation
5382
+ ```
5383
+
5384
+ Printing the `validation` object shows the validation table in an HTML viewing environment.
5385
+ The validation table shows two entries (one check per column) generated by the
5386
+ `col_exists()` validation step. Both steps passed since both columns provided in `columns=`
5387
+ are present in the table.
5388
+
5389
+ Now, let's check for the existence of a different set of columns.
5390
+
5391
+ ```python
5392
+ validation = (
5393
+ pb.Validate(data=tbl)
5394
+ .col_exists(columns=["b", "c"])
5395
+ .interrogate()
5396
+ )
5397
+
5398
+ validation
5399
+ ```
5400
+
5401
+ The validation table reports one passing validation step (the check for column `b`) and one
5402
+ failing validation step (the check for column `c`, which doesn't exist).
5403
+
5404
+
4889
5405
  col_schema_match(self, schema: 'Schema', complete: 'bool' = True, in_order: 'bool' = True, case_sensitive_colnames: 'bool' = True, case_sensitive_dtypes: 'bool' = True, full_match_dtypes: 'bool' = True, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4890
5406
 
4891
5407
  Do columns in the table (and their types) match a predefined schema?
@@ -5169,58 +5685,177 @@ row_count_match(self, count: 'int | FrameT | Any', tol: 'Tolerance' = 0, inverse
5169
5685
  rows in the target table. So, the single test unit passed.
5170
5686
 
5171
5687
 
5172
- Let's modify our example to show the different ways we can allow some tolerance to our validation
5173
- by using the `tol` argument.
5688
+ Let's modify our example to show the different ways we can allow some tolerance to our validation
5689
+ by using the `tol` argument.
5690
+
5691
+ ```python
5692
+ smaller_small_table = small_table.sample(n = 12) # within the lower bound
5693
+ validation = (
5694
+ pb.Validate(data=smaller_small_table)
5695
+ .row_count_match(count=13,tol=(2, 0)) # minus 2 but plus 0, ie. 11-13
5696
+ .interrogate()
5697
+ )
5698
+
5699
+ validation
5700
+
5701
+ validation = (
5702
+ pb.Validate(data=smaller_small_table)
5703
+ .row_count_match(count=13,tol=.05) # .05% tolerance of 13
5704
+ .interrogate()
5705
+ )
5706
+
5707
+ even_smaller_table = small_table.sample(n = 2)
5708
+ validation = (
5709
+ pb.Validate(data=even_smaller_table)
5710
+ .row_count_match(count=13,tol=5) # plus or minus 5; this test will fail
5711
+ .interrogate()
5712
+ )
5713
+
5714
+ validation
5715
+ ```
5716
+
5717
+
5718
+
5719
+ col_count_match(self, count: 'int | FrameT | Any', inverse: 'bool' = False, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
5720
+
5721
+ Validate whether the column count of the table matches a specified count.
5722
+
5723
+ The `col_count_match()` method checks whether the column count of the target table matches a
5724
+ specified count. This validation will operate over a single test unit, which is whether the
5725
+ column count matches the specified count.
5726
+
5727
+ We also have the option to invert the validation step by setting `inverse=True`. This will
5728
+ make the expectation that column row count of the target table *does not* match the
5729
+ specified count.
5730
+
5731
+ Parameters
5732
+ ----------
5733
+ count
5734
+ The expected column count of the table. This can be an integer value, a Polars or Pandas
5735
+ DataFrame object, or an Ibis backend table. If a DataFrame/table is provided, the column
5736
+ count of that object will be used as the expected count.
5737
+ inverse
5738
+ Should the validation step be inverted? If `True`, then the expectation is that the
5739
+ column count of the target table should not match the specified `count=` value.
5740
+ pre
5741
+ An optional preprocessing function or lambda to apply to the data table during
5742
+ interrogation. This function should take a table as input and return a modified table.
5743
+ Have a look at the *Preprocessing* section for more information on how to use this
5744
+ argument.
5745
+ thresholds
5746
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
5747
+ The thresholds are set at the step level and will override any global thresholds set in
5748
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
5749
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
5750
+ section for information on how to set threshold levels.
5751
+ actions
5752
+ Optional actions to take when the validation step meets or exceeds any set threshold
5753
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
5754
+ define the actions.
5755
+ brief
5756
+ An optional brief description of the validation step that will be displayed in the
5757
+ reporting table. You can use the templating elements like `"{step}"` to insert
5758
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
5759
+ the entire brief will be automatically generated. If `None` (the default) then there
5760
+ won't be a brief.
5761
+ active
5762
+ A boolean value indicating whether the validation step should be active. Using `False`
5763
+ will make the validation step inactive (still reporting its presence and keeping indexes
5764
+ for the steps unchanged).
5765
+
5766
+ Returns
5767
+ -------
5768
+ Validate
5769
+ The `Validate` object with the added validation step.
5770
+
5771
+ Preprocessing
5772
+ -------------
5773
+ The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
5774
+ table during interrogation. This function should take a table as input and return a modified
5775
+ table. This is useful for performing any necessary transformations or filtering on the data
5776
+ before the validation step is applied.
5777
+
5778
+ The preprocessing function can be any callable that takes a table as input and returns a
5779
+ modified table. For example, you could use a lambda function to filter the table based on
5780
+ certain criteria or to apply a transformation to the data. Regarding the lifetime of the
5781
+ transformed table, it only exists during the validation step and is not stored in the
5782
+ `Validate` object or used in subsequent validation steps.
5783
+
5784
+ Thresholds
5785
+ ----------
5786
+ The `thresholds=` parameter is used to set the failure-condition levels for the validation
5787
+ step. If they are set here at the step level, these thresholds will override any thresholds
5788
+ set at the global level in `Validate(thresholds=...)`.
5789
+
5790
+ There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values
5791
+ can either be set as a proportion failing of all test units (a value between `0` to `1`),
5792
+ or, the absolute number of failing test units (as integer that's `1` or greater).
5793
+
5794
+ Thresholds can be defined using one of these input schemes:
5795
+
5796
+ 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create
5797
+ thresholds)
5798
+ 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is
5799
+ the 'error' level, and position `2` is the 'critical' level
5800
+ 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and
5801
+ 'critical'
5802
+ 4. a single integer/float value denoting absolute number or fraction of failing test units
5803
+ for the 'warning' level only
5804
+
5805
+ If the number of failing test units exceeds set thresholds, the validation step will be
5806
+ marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be
5807
+ set, you're free to set any combination of them.
5174
5808
 
5175
- ```python
5176
- smaller_small_table = small_table.sample(n = 12) # within the lower bound
5177
- validation = (
5178
- pb.Validate(data=smaller_small_table)
5179
- .row_count_match(count=13,tol=(2, 0)) # minus 2 but plus 0, ie. 11-13
5180
- .interrogate()
5181
- )
5809
+ Aside from reporting failure conditions, thresholds can be used to determine the actions to
5810
+ take for each level of failure (using the `actions=` parameter).
5182
5811
 
5183
- validation
5812
+ Examples
5813
+ --------
5814
+ For the examples here, we'll use the built in dataset `"game_revenue"`. The table can be
5815
+ obtained by calling `load_dataset("game_revenue")`.
5184
5816
 
5185
- validation = (
5186
- pb.Validate(data=smaller_small_table)
5187
- .row_count_match(count=13,tol=.05) # .05% tolerance of 13
5188
- .interrogate()
5189
- )
5817
+ Let's validate that the number of columns in the table matches a fixed value. In this case,
5818
+ we will use the value `11` as the expected column count.
5190
5819
 
5191
- even_smaller_table = small_table.sample(n = 2)
5820
+ ```python
5192
5821
  validation = (
5193
- pb.Validate(data=even_smaller_table)
5194
- .row_count_match(count=13,tol=5) # plus or minus 5; this test will fail
5822
+ pb.Validate(data=game_revenue)
5823
+ .col_count_match(count=11)
5195
5824
  .interrogate()
5196
5825
  )
5197
5826
 
5198
5827
  validation
5199
5828
  ```
5200
5829
 
5830
+ The validation table shows that the expectation value of `11` matches the actual count of
5831
+ columns in the target table. So, the single test unit passed.
5201
5832
 
5202
5833
 
5203
- col_count_match(self, count: 'int | FrameT | Any', inverse: 'bool' = False, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
5834
+ tbl_match(self, tbl_compare: 'FrameT | Any', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
5204
5835
 
5205
- Validate whether the column count of the table matches a specified count.
5836
+ Validate whether the target table matches a comparison table.
5206
5837
 
5207
- The `col_count_match()` method checks whether the column count of the target table matches a
5208
- specified count. This validation will operate over a single test unit, which is whether the
5209
- column count matches the specified count.
5838
+ The `tbl_match()` method checks whether the target table's composition matches that of a
5839
+ comparison table. The validation performs a comprehensive comparison using progressively
5840
+ stricter checks (from least to most stringent):
5210
5841
 
5211
- We also have the option to invert the validation step by setting `inverse=True`. This will
5212
- make the expectation that column row count of the target table *does not* match the
5213
- specified count.
5842
+ 1. **Column count match**: both tables must have the same number of columns
5843
+ 2. **Row count match**: both tables must have the same number of rows
5844
+ 3. **Schema match (loose)**: column names and dtypes match (case-insensitive, any order)
5845
+ 4. **Schema match (order)**: columns in the correct order (case-insensitive names)
5846
+ 5. **Schema match (exact)**: column names match exactly (case-sensitive, correct order)
5847
+ 6. **Data match**: values in corresponding cells must be identical
5848
+
5849
+ This progressive approach helps identify exactly where tables differ. The validation will
5850
+ fail at the first check that doesn't pass, making it easier to diagnose mismatches. This
5851
+ validation operates over a single test unit (pass/fail for complete table match).
5214
5852
 
5215
5853
  Parameters
5216
5854
  ----------
5217
- count
5218
- The expected column count of the table. This can be an integer value, a Polars or Pandas
5219
- DataFrame object, or an Ibis backend table. If a DataFrame/table is provided, the column
5220
- count of that object will be used as the expected count.
5221
- inverse
5222
- Should the validation step be inverted? If `True`, then the expectation is that the
5223
- column count of the target table should not match the specified `count=` value.
5855
+ tbl_compare
5856
+ The comparison table to validate against. This can be a DataFrame object (Polars or
5857
+ Pandas), an Ibis table object, or a callable that returns a table. If a callable is
5858
+ provided, it will be executed during interrogation to obtain the comparison table.
5224
5859
  pre
5225
5860
  An optional preprocessing function or lambda to apply to the data table during
5226
5861
  interrogation. This function should take a table as input and return a modified table.
@@ -5261,9 +5896,10 @@ col_count_match(self, count: 'int | FrameT | Any', inverse: 'bool' = False, pre:
5261
5896
 
5262
5897
  The preprocessing function can be any callable that takes a table as input and returns a
5263
5898
  modified table. For example, you could use a lambda function to filter the table based on
5264
- certain criteria or to apply a transformation to the data. Regarding the lifetime of the
5265
- transformed table, it only exists during the validation step and is not stored in the
5266
- `Validate` object or used in subsequent validation steps.
5899
+ certain criteria or to apply a transformation to the data. Note that the same preprocessing
5900
+ is **not** applied to the comparison table; only the target table is preprocessed. Regarding
5901
+ the lifetime of the transformed table, it only exists during the validation step and is not
5902
+ stored in the `Validate` object or used in subsequent validation steps.
5267
5903
 
5268
5904
  Thresholds
5269
5905
  ----------
@@ -5293,26 +5929,129 @@ col_count_match(self, count: 'int | FrameT | Any', inverse: 'bool' = False, pre:
5293
5929
  Aside from reporting failure conditions, thresholds can be used to determine the actions to
5294
5930
  take for each level of failure (using the `actions=` parameter).
5295
5931
 
5932
+ Cross-Backend Validation
5933
+ ------------------------
5934
+ The `tbl_match()` method supports **automatic backend coercion** when comparing tables from
5935
+ different backends (e.g., comparing a Polars DataFrame against a Pandas DataFrame, or
5936
+ comparing database tables from DuckDB/SQLite against in-memory DataFrames). When tables with
5937
+ different backends are detected, the comparison table is automatically converted to match the
5938
+ data table's backend before validation proceeds.
5939
+
5940
+ **Certified Backend Combinations:**
5941
+
5942
+ All combinations of the following backends have been tested and certified to work (in both
5943
+ directions):
5944
+
5945
+ - Pandas DataFrame
5946
+ - Polars DataFrame
5947
+ - DuckDB (native)
5948
+ - DuckDB (as Ibis table)
5949
+ - SQLite (via Ibis)
5950
+
5951
+ Note that database backends (DuckDB, SQLite, PostgreSQL, MySQL, Snowflake, BigQuery) are
5952
+ automatically materialized during validation:
5953
+
5954
+ - if comparing **against Polars**: materialized to Polars
5955
+ - if comparing **against Pandas**: materialized to Pandas
5956
+ - if **both tables are database backends**: both materialized to Polars
5957
+
5958
+ This ensures optimal performance and type consistency.
5959
+
5960
+ **Data Types That Work Best in Cross-Backend Validation:**
5961
+
5962
+ - numeric types: int, float columns (including proper NaN handling)
5963
+ - string types: text columns with consistent encodings
5964
+ - boolean types: True/False values
5965
+ - null values: `None` and `NaN` are treated as equivalent across backends
5966
+ - list columns: nested list structures (with basic types)
5967
+
5968
+ **Known Limitations:**
5969
+
5970
+ While many data types work well in cross-backend validation, there are some known
5971
+ limitations to be aware of:
5972
+
5973
+ - date/datetime types: When converting between Polars and Pandas, date objects may be
5974
+ represented differently. For example, `datetime.date` objects in Pandas may become
5975
+ `pd.Timestamp` objects when converted from Polars, leading to false mismatches. To work
5976
+ around this, ensure both tables use the same datetime representation before comparison.
5977
+ - custom types: User-defined types or complex nested structures may not convert cleanly
5978
+ between backends and could cause unexpected comparison failures.
5979
+ - categorical types: Categorical/factor columns may have different internal
5980
+ representations across backends.
5981
+ - timezone-aware datetimes: Timezone handling differs between backends and may cause
5982
+ comparison issues.
5983
+
5984
+ Here are some ideas to overcome such limitations:
5985
+
5986
+ - for date/datetime columns, consider using `pre=` preprocessing to normalize representations
5987
+ before comparison.
5988
+ - when working with custom types, manually convert tables to the same backend before using
5989
+ `tbl_match()`.
5990
+ - use the same datetime precision (e.g., milliseconds vs microseconds) in both tables.
5991
+
5296
5992
  Examples
5297
5993
  --------
5298
- For the examples here, we'll use the built in dataset `"game_revenue"`. The table can be
5299
- obtained by calling `load_dataset("game_revenue")`.
5994
+ For the examples here, we'll create two simple tables to demonstrate the `tbl_match()`
5995
+ validation.
5300
5996
 
5301
- Let's validate that the number of columns in the table matches a fixed value. In this case,
5302
- we will use the value `11` as the expected column count.
5997
+ ```python
5998
+ import pointblank as pb
5999
+ import polars as pl
6000
+
6001
+ # Create the first table
6002
+ tbl_1 = pl.DataFrame({
6003
+ "a": [1, 2, 3, 4],
6004
+ "b": ["w", "x", "y", "z"],
6005
+ "c": [4.0, 5.0, 6.0, 7.0]
6006
+ })
6007
+
6008
+ # Create an identical table
6009
+ tbl_2 = pl.DataFrame({
6010
+ "a": [1, 2, 3, 4],
6011
+ "b": ["w", "x", "y", "z"],
6012
+ "c": [4.0, 5.0, 6.0, 7.0]
6013
+ })
6014
+
6015
+ pb.preview(tbl_1)
6016
+ ```
6017
+
6018
+ Let's validate that `tbl_1` matches `tbl_2`. Since these tables are identical, the
6019
+ validation should pass.
5303
6020
 
5304
6021
  ```python
5305
6022
  validation = (
5306
- pb.Validate(data=game_revenue)
5307
- .col_count_match(count=11)
6023
+ pb.Validate(data=tbl_1)
6024
+ .tbl_match(tbl_compare=tbl_2)
5308
6025
  .interrogate()
5309
6026
  )
5310
6027
 
5311
6028
  validation
5312
6029
  ```
5313
6030
 
5314
- The validation table shows that the expectation value of `11` matches the actual count of
5315
- columns in the target table. So, the single test unit passed.
6031
+ The validation table shows that the single test unit passed, indicating that the two tables
6032
+ match completely.
6033
+
6034
+ Now, let's create a table with a slight difference and see what happens.
6035
+
6036
+ ```python
6037
+ # Create a table with one different value
6038
+ tbl_3 = pl.DataFrame({
6039
+ "a": [1, 2, 3, 4],
6040
+ "b": ["w", "x", "y", "z"],
6041
+ "c": [4.0, 5.5, 6.0, 7.0] # Changed 5.0 to 5.5
6042
+ })
6043
+
6044
+ validation = (
6045
+ pb.Validate(data=tbl_1)
6046
+ .tbl_match(tbl_compare=tbl_3)
6047
+ .interrogate()
6048
+ )
6049
+
6050
+ validation
6051
+ ```
6052
+
6053
+ The validation table shows that the single test unit failed because the tables don't match
6054
+ (one value is different in column `c`).
5316
6055
 
5317
6056
 
5318
6057
  conjointly(self, *exprs: 'Callable', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
@@ -5814,6 +6553,317 @@ specially(self, expr: 'Callable', pre: 'Callable | None' = None, thresholds: 'in
5814
6553
  virtually any data quality requirement in your organization.
5815
6554
 
5816
6555
 
6556
+ prompt(self, prompt: 'str', model: 'str', columns_subset: 'str | list[str] | None' = None, batch_size: 'int' = 1000, max_concurrent: 'int' = 3, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
6557
+
6558
+ Validate rows using AI/LLM-powered analysis.
6559
+
6560
+ The `prompt()` validation method uses Large Language Models (LLMs) to validate rows of data
6561
+ based on natural language criteria. Similar to other Pointblank validation methods, this
6562
+ generates binary test results (pass/fail) that integrate seamlessly with the standard
6563
+ reporting framework.
6564
+
6565
+ Like `col_vals_*()` methods, `prompt()` evaluates data against specific criteria, but
6566
+ instead of using programmatic rules, it uses natural language prompts interpreted by an LLM.
6567
+ Like `rows_distinct()` and `rows_complete()`, it operates at the row level and allows you to
6568
+ specify a subset of columns for evaluation using `columns_subset=`.
6569
+
6570
+ The system automatically combines your validation criteria from the `prompt=` parameter with
6571
+ the necessary technical context, data formatting instructions, and response structure
6572
+ requirements. This is all so you only need to focus on describing your validation logic in
6573
+ plain language.
6574
+
6575
+ Each row becomes a test unit that either passes or fails the validation criteria, producing
6576
+ the familiar True/False results that appear in Pointblank validation reports. This method
6577
+ is particularly useful for complex validation rules that are difficult to express with
6578
+ traditional validation methods, such as semantic checks, context-dependent validation, or
6579
+ subjective quality assessments.
6580
+
6581
+ Parameters
6582
+ ----------
6583
+ prompt
6584
+ A natural language description of the validation criteria. This prompt should clearly
6585
+ describe what constitutes valid vs invalid rows. Some examples:
6586
+ `"Each row should contain a valid email address and a realistic person name"`,
6587
+ `"Values should indicate positive sentiment"`,
6588
+ `"The description should mention a country name"`.
6589
+ columns_subset
6590
+ A single column or list of columns to include in the validation. If `None`, all columns
6591
+ will be included. Specifying fewer columns can improve performance and reduce API costs
6592
+ so try to include only the columns necessary for the validation.
6593
+ model
6594
+ The model to be used. This should be in the form of `provider:model` (e.g.,
6595
+ `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`,
6596
+ `"ollama"`, and `"bedrock"`. The model name should be the specific model to be used from
6597
+ the provider. Model names are subject to change so consult the provider's documentation
6598
+ for the most up-to-date model names.
6599
+ batch_size
6600
+ Number of rows to process in each batch. Larger batches are more efficient but may hit
6601
+ API limits. Default is `1000`.
6602
+ max_concurrent
6603
+ Maximum number of concurrent API requests. Higher values speed up processing but may
6604
+ hit rate limits. Default is `3`.
6605
+ pre
6606
+ An optional preprocessing function or lambda to apply to the data table during
6607
+ interrogation. This function should take a table as input and return a modified table.
6608
+ segments
6609
+ An optional directive on segmentation, which serves to split a validation step into
6610
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
6611
+ column name and its corresponding values to segment on, or a combination of both
6612
+ (provided as a list).
6613
+ thresholds
6614
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
6615
+ The thresholds are set at the step level and will override any global thresholds set in
6616
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
6617
+ be set locally and global thresholds (if any) will take effect.
6618
+ actions
6619
+ Optional actions to take when the validation step meets or exceeds any set threshold
6620
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
6621
+ define the actions.
6622
+ brief
6623
+ An optional brief description of the validation step that will be displayed in the
6624
+ reporting table. You can use the templating elements like `"{step}"` to insert
6625
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
6626
+ the entire brief will be automatically generated. If `None` (the default) then there
6627
+ won't be a brief.
6628
+ active
6629
+ A boolean value indicating whether the validation step should be active. Using `False`
6630
+ will make the validation step inactive (still reporting its presence and keeping indexes
6631
+ for the steps unchanged).
6632
+
6633
+ Returns
6634
+ -------
6635
+ Validate
6636
+ The `Validate` object with the added validation step.
6637
+
6638
+ Constructing the `model` Argument
6639
+ ---------------------------------
6640
+ The `model=` argument should be constructed using the provider and model name separated by a
6641
+ colon (`provider:model`). The provider text can any of:
6642
+
6643
+ - `"anthropic"` (Anthropic)
6644
+ - `"openai"` (OpenAI)
6645
+ - `"ollama"` (Ollama)
6646
+ - `"bedrock"` (Amazon Bedrock)
6647
+
6648
+ The model name should be the specific model to be used from the provider. Model names are
6649
+ subject to change so consult the provider's documentation for the most up-to-date model
6650
+ names.
6651
+
6652
+ Notes on Authentication
6653
+ -----------------------
6654
+ API keys are automatically loaded from environment variables or `.env` files and are **not**
6655
+ stored in the validation object for security reasons. You should consider using a secure
6656
+ method for handling API keys.
6657
+
6658
+ One way to do this is to load the API key from an environment variable and retrieve it using
6659
+ the `os` module (specifically the `os.getenv()` function). Places to store the API key might
6660
+ include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`.
6661
+
6662
+ Another solution is to store one or more model provider API keys in an `.env` file (in the
6663
+ root of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or
6664
+ `OPENAI_API_KEY`) then the AI validation will automatically load the API key from the `.env`
6665
+ file. An `.env` file might look like this:
6666
+
6667
+ ```plaintext
6668
+ ANTHROPIC_API_KEY="your_anthropic_api_key_here"
6669
+ OPENAI_API_KEY="your_openai_api_key_here"
6670
+ ```
6671
+
6672
+ There's no need to have the `python-dotenv` package installed when using `.env` files in
6673
+ this way.
6674
+
6675
+ **Provider-specific setup**:
6676
+
6677
+ - **OpenAI**: set `OPENAI_API_KEY` environment variable or create `.env` file
6678
+ - **Anthropic**: set `ANTHROPIC_API_KEY` environment variable or create `.env` file
6679
+ - **Ollama**: no API key required, just ensure Ollama is running locally
6680
+ - **Bedrock**: configure AWS credentials through standard AWS methods
6681
+
6682
+ AI Validation Process
6683
+ ---------------------
6684
+ The AI validation process works as follows:
6685
+
6686
+ 1. data batching: the data is split into batches of the specified size
6687
+ 2. row deduplication: duplicate rows (based on selected columns) are identified and only
6688
+ unique combinations are sent to the LLM for analysis
6689
+ 3. json conversion: each batch of unique rows is converted to JSON format for the LLM
6690
+ 4. prompt construction: the user prompt is embedded in a structured system prompt
6691
+ 5. llm processing: each batch is sent to the LLM for analysis
6692
+ 6. response parsing: LLM responses are parsed to extract validation results
6693
+ 7. result projection: results are mapped back to all original rows using row signatures
6694
+ 8. result aggregation: results from all batches are combined
6695
+
6696
+ **Performance Optimization**: the process uses row signature memoization to avoid redundant
6697
+ LLM calls. When multiple rows have identical values in the selected columns, only one
6698
+ representative row is validated, and the result is applied to all matching rows. This can
6699
+ dramatically reduce API costs and processing time for datasets with repetitive patterns.
6700
+
6701
+ The LLM receives data in this JSON format:
6702
+
6703
+ ```json
6704
+ {
6705
+ "columns": ["col1", "col2", "col3"],
6706
+ "rows": [
6707
+ {"col1": "value1", "col2": "value2", "col3": "value3", "_pb_row_index": 0},
6708
+ {"col1": "value4", "col2": "value5", "col3": "value6", "_pb_row_index": 1}
6709
+ ]
6710
+ }
6711
+ ```
6712
+
6713
+ The LLM returns validation results in this format:
6714
+ ```json
6715
+ [
6716
+ {"index": 0, "result": true},
6717
+ {"index": 1, "result": false}
6718
+ ]
6719
+ ```
6720
+
6721
+ Prompt Design Tips
6722
+ ------------------
6723
+ For best results, design prompts that are:
6724
+
6725
+ - boolean-oriented: frame validation criteria to elicit clear valid/invalid responses
6726
+ - specific: clearly define what makes a row valid/invalid
6727
+ - unambiguous: avoid subjective language that could be interpreted differently
6728
+ - context-aware: include relevant business rules or domain knowledge
6729
+ - example-driven: consider providing examples in the prompt when helpful
6730
+
6731
+ **Critical**: Prompts must be designed so the LLM can determine whether each row passes or
6732
+ fails the validation criteria. The system expects binary validation responses, so avoid
6733
+ open-ended questions or prompts that might generate explanatory text instead of clear
6734
+ pass/fail judgments.
6735
+
6736
+ Good prompt examples:
6737
+
6738
+ - "Each row should contain a valid email address in the 'email' column and a non-empty name
6739
+ in the 'name' column"
6740
+ - "The 'sentiment' column should contain positive sentiment words (happy, good, excellent,
6741
+ etc.)"
6742
+ - "Product descriptions should mention at least one technical specification"
6743
+
6744
+ Poor prompt examples (avoid these):
6745
+
6746
+ - "What do you think about this data?" (too open-ended)
6747
+ - "Describe the quality of each row" (asks for description, not validation)
6748
+ - "How would you improve this data?" (asks for suggestions, not pass/fail)
6749
+
6750
+ Performance Considerations
6751
+ --------------------------
6752
+ AI validation is significantly slower than traditional validation methods due to API calls
6753
+ to LLM providers. However, performance varies dramatically based on data characteristics:
6754
+
6755
+ **High Memoization Scenarios** (seconds to minutes):
6756
+
6757
+ - data with many duplicate rows in the selected columns
6758
+ - low cardinality data (repeated patterns)
6759
+ - small number of unique row combinations
6760
+
6761
+ **Low Memoization Scenarios** (minutes to hours):
6762
+
6763
+ - high cardinality data with mostly unique rows
6764
+ - large datasets with few repeated patterns
6765
+ - all or most rows requiring individual LLM evaluation
6766
+
6767
+ The row signature memoization optimization can reduce processing time significantly when
6768
+ data has repetitive patterns. For datasets where every row is unique, expect longer
6769
+ processing times similar to validating each row individually.
6770
+
6771
+ **Strategies to Reduce Processing Time**:
6772
+
6773
+ - test on data slices: define a sampling function like `def sample_1000(df): return df.head(1000)`
6774
+ and use `pre=sample_1000` to validate on smaller samples
6775
+ - filter relevant data: define filter functions like `def active_only(df): return df.filter(df["status"] == "active")`
6776
+ and use `pre=active_only` to focus on a specific subset
6777
+ - optimize column selection: use `columns_subset=` to include only the columns necessary
6778
+ for validation
6779
+ - start with smaller batches: begin with `batch_size=100` for testing, then increase
6780
+ gradually
6781
+ - reduce concurrency: lower `max_concurrent=1` if hitting rate limits
6782
+ - use faster/cheaper models: consider using smaller or more efficient models for initial
6783
+ testing before switching to more capable models
6784
+
6785
+ Examples
6786
+ --------
6787
+ The following examples demonstrate how to use AI validation for different types of data
6788
+ quality checks. These examples show both basic usage and more advanced configurations with
6789
+ custom thresholds and actions.
6790
+
6791
+ **Basic AI validation example:**
6792
+
6793
+ This first example shows a simple validation scenario where we want to check that customer
6794
+ records have both valid email addresses and non-empty names. Notice how we use
6795
+ `columns_subset=` to focus only on the relevant columns, which improves both performance
6796
+ and cost-effectiveness.
6797
+
6798
+ ```python
6799
+ import pointblank as pb
6800
+ import polars as pl
6801
+
6802
+ # Sample data with email and name columns
6803
+ tbl = pl.DataFrame({
6804
+ "email": ["john@example.com", "invalid-email", "jane@test.org"],
6805
+ "name": ["John Doe", "", "Jane Smith"],
6806
+ "age": [25, 30, 35]
6807
+ })
6808
+
6809
+ # Validate using AI
6810
+ validation = (
6811
+ pb.Validate(data=tbl)
6812
+ .prompt(
6813
+ prompt="Each row should have a valid email address and a non-empty name",
6814
+ columns_subset=["email", "name"], # Only check these columns
6815
+ model="openai:gpt-4o-mini",
6816
+ )
6817
+ .interrogate()
6818
+ )
6819
+
6820
+ validation
6821
+ ```
6822
+
6823
+ In this example, the AI will identify that the second row fails validation because it has
6824
+ an invalid email format (`"invalid-email"`) and the third row also fails because it has an
6825
+ empty name field. The validation results will show 2 out of 3 rows failing the criteria.
6826
+
6827
+ **Advanced example with custom thresholds:**
6828
+
6829
+ This more sophisticated example demonstrates how to use AI validation with custom thresholds
6830
+ and actions. Here we're validating phone number formats to ensure they include area codes,
6831
+ which is a common data quality requirement for customer contact information.
6832
+
6833
+ ```python
6834
+ customer_data = pl.DataFrame({
6835
+ "customer_id": [1, 2, 3, 4, 5],
6836
+ "name": ["John Doe", "Jane Smith", "Bob Johnson", "Alice Brown", "Charlie Davis"],
6837
+ "phone_number": [
6838
+ "(555) 123-4567", # Valid with area code
6839
+ "555-987-6543", # Valid with area code
6840
+ "123-4567", # Missing area code
6841
+ "(800) 555-1234", # Valid with area code
6842
+ "987-6543" # Missing area code
6843
+ ]
6844
+ })
6845
+
6846
+ validation = (
6847
+ pb.Validate(data=customer_data)
6848
+ .prompt(
6849
+ prompt="Do all the phone numbers include an area code?",
6850
+ columns_subset="phone_number", # Only check the `phone_number` column
6851
+ model="openai:gpt-4o",
6852
+ batch_size=500,
6853
+ max_concurrent=5,
6854
+ thresholds=pb.Thresholds(warning=0.1, error=0.2, critical=0.3),
6855
+ actions=pb.Actions(error="Too many phone numbers missing area codes.")
6856
+ )
6857
+ .interrogate()
6858
+ )
6859
+ ```
6860
+
6861
+ This validation will identify that 2 out of 5 phone numbers (40%) are missing area codes,
6862
+ which exceeds all threshold levels. The validation will trigger the specified error action
6863
+ since the failure rate (40%) is above the error threshold (20%). The AI can recognize
6864
+ various phone number formats and determine whether they include area codes.
6865
+
6866
+
5817
6867
 
5818
6868
  ## The Column Selection family
5819
6869
 
@@ -5867,9 +6917,12 @@ col(exprs: 'str | ColumnSelector | ColumnSelectorNarwhals') -> 'Column | ColumnL
5867
6917
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
5868
6918
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
5869
6919
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
6920
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
6921
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
5870
6922
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
5871
6923
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
5872
6924
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
6925
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
5873
6926
  - [`col_exists()`](`pointblank.Validate.col_exists`)
5874
6927
 
5875
6928
  If specifying a single column with certainty (you have the exact name), `col()` is not necessary
@@ -6151,9 +7204,12 @@ starts_with(text: 'str', case_sensitive: 'bool' = False) -> 'StartsWith'
6151
7204
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
6152
7205
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
6153
7206
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7207
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7208
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
6154
7209
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
6155
7210
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
6156
7211
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7212
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
6157
7213
  - [`col_exists()`](`pointblank.Validate.col_exists`)
6158
7214
 
6159
7215
  The `starts_with()` selector function doesn't need to be used in isolation. Read the next
@@ -6301,9 +7357,12 @@ ends_with(text: 'str', case_sensitive: 'bool' = False) -> 'EndsWith'
6301
7357
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
6302
7358
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
6303
7359
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7360
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7361
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
6304
7362
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
6305
7363
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
6306
7364
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7365
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
6307
7366
  - [`col_exists()`](`pointblank.Validate.col_exists`)
6308
7367
 
6309
7368
  The `ends_with()` selector function doesn't need to be used in isolation. Read the next section
@@ -6452,9 +7511,12 @@ contains(text: 'str', case_sensitive: 'bool' = False) -> 'Contains'
6452
7511
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
6453
7512
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
6454
7513
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7514
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7515
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
6455
7516
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
6456
7517
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
6457
7518
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7519
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
6458
7520
  - [`col_exists()`](`pointblank.Validate.col_exists`)
6459
7521
 
6460
7522
  The `contains()` selector function doesn't need to be used in isolation. Read the next section
@@ -6603,9 +7665,12 @@ matches(pattern: 'str', case_sensitive: 'bool' = False) -> 'Matches'
6603
7665
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
6604
7666
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
6605
7667
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7668
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7669
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
6606
7670
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
6607
7671
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
6608
7672
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7673
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
6609
7674
  - [`col_exists()`](`pointblank.Validate.col_exists`)
6610
7675
 
6611
7676
  The `matches()` selector function doesn't need to be used in isolation. Read the next section
@@ -6736,9 +7801,12 @@ everything() -> 'Everything'
6736
7801
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
6737
7802
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
6738
7803
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7804
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7805
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
6739
7806
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
6740
7807
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
6741
7808
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7809
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
6742
7810
  - [`col_exists()`](`pointblank.Validate.col_exists`)
6743
7811
 
6744
7812
  The `everything()` selector function doesn't need to be used in isolation. Read the next section
@@ -6879,9 +7947,12 @@ first_n(n: 'int', offset: 'int' = 0) -> 'FirstN'
6879
7947
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
6880
7948
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
6881
7949
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
7950
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
7951
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
6882
7952
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
6883
7953
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
6884
7954
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
7955
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
6885
7956
  - [`col_exists()`](`pointblank.Validate.col_exists`)
6886
7957
 
6887
7958
  The `first_n()` selector function doesn't need to be used in isolation. Read the next section
@@ -7026,9 +8097,12 @@ last_n(n: 'int', offset: 'int' = 0) -> 'LastN'
7026
8097
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
7027
8098
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
7028
8099
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
8100
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
8101
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
7029
8102
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
7030
8103
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
7031
8104
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
8105
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
7032
8106
  - [`col_exists()`](`pointblank.Validate.col_exists`)
7033
8107
 
7034
8108
  The `last_n()` selector function doesn't need to be used in isolation. Read the next section for
@@ -7392,32 +8466,134 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
7392
8466
  import pointblank as pb
7393
8467
  import polars as pl
7394
8468
 
7395
- validation = (
7396
- pb.Validate(data=pb.load_dataset(dataset="game_revenue"))
7397
- .col_vals_lt(columns="item_revenue", value=200)
7398
- .col_vals_gt(columns="item_revenue", value=0)
7399
- .col_vals_gt(columns="session_duration", value=5)
7400
- .col_vals_in_set(columns="item_type", set=["iap", "ad"])
7401
- .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}")
8469
+ validation = (
8470
+ pb.Validate(data=pb.load_dataset(dataset="game_revenue"))
8471
+ .col_vals_lt(columns="item_revenue", value=200)
8472
+ .col_vals_gt(columns="item_revenue", value=0)
8473
+ .col_vals_gt(columns="session_duration", value=5)
8474
+ .col_vals_in_set(columns="item_type", set=["iap", "ad"])
8475
+ .col_vals_regex(columns="player_id", pattern=r"[A-Z]{12}[0-9]{3}")
8476
+ )
8477
+
8478
+ validation.interrogate(get_first_n=10)
8479
+ ```
8480
+
8481
+ The validation table shows that step 3 (checking for `session_duration` greater than `5`)
8482
+ has 18 failing test units. This means that 18 rows in the table are problematic. We'd like
8483
+ to see the rows that failed this validation step and we can do that with the
8484
+ [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method.
8485
+
8486
+ ```python
8487
+ pb.preview(validation.get_data_extracts(i=3, frame=True))
8488
+ ```
8489
+
8490
+ The [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method will return a
8491
+ Polars DataFrame here with the first 10 rows that failed the validation step (we passed that
8492
+ into the [`preview()`](`pointblank.preview`) function for a better display). There are
8493
+ actually 18 rows that failed but we limited the collection of extracts with
8494
+ `get_first_n=10`.
8495
+
8496
+
8497
+ set_tbl(self, tbl: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None' = None) -> 'Validate'
8498
+
8499
+ Set or replace the table associated with the Validate object.
8500
+
8501
+ This method allows you to replace the table associated with a Validate object with a
8502
+ different (but presumably similar) table. This is useful when you want to apply the same
8503
+ validation plan to multiple tables or when you have a validation workflow defined but want
8504
+ to swap in a different data source.
8505
+
8506
+ Parameters
8507
+ ----------
8508
+ tbl
8509
+ The table to replace the existing table with. This can be any supported table type
8510
+ including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths,
8511
+ GitHub URLs, or database connection strings. The same table type constraints apply as in
8512
+ the `Validate` constructor.
8513
+ tbl_name
8514
+ An optional name to assign to the new input table object. If no value is provided, the
8515
+ existing table name will be retained.
8516
+ label
8517
+ An optional label for the validation plan. If no value is provided, the existing label
8518
+ will be retained.
8519
+
8520
+ Returns
8521
+ -------
8522
+ Validate
8523
+ A new `Validate` object with the replacement table.
8524
+
8525
+ When to Use
8526
+ -----------
8527
+ The `set_tbl()` method is particularly useful in scenarios where you have:
8528
+
8529
+ - multiple similar tables that need the same validation checks
8530
+ - a template validation workflow that should be applied to different data sources
8531
+ - YAML-defined validations where you want to override the table specified in the YAML
8532
+
8533
+ The `set_tbl()` method creates a copy of the validation object with the new table, so the
8534
+ original validation object remains unchanged. This allows you to reuse validation plans
8535
+ across multiple tables without interference.
8536
+
8537
+ Examples
8538
+ --------
8539
+ We will first create two similar tables for our future validation plans.
8540
+
8541
+ ```python
8542
+ import pointblank as pb
8543
+ import polars as pl
8544
+
8545
+ # Create two similar tables
8546
+ table_1 = pl.DataFrame({
8547
+ "x": [1, 2, 3, 4, 5],
8548
+ "y": [5, 4, 3, 2, 1],
8549
+ "z": ["a", "b", "c", "d", "e"]
8550
+ })
8551
+
8552
+ table_2 = pl.DataFrame({
8553
+ "x": [2, 4, 6, 8, 10],
8554
+ "y": [10, 8, 6, 4, 2],
8555
+ "z": ["f", "g", "h", "i", "j"]
8556
+ })
8557
+ ```
8558
+
8559
+ Create a validation plan with the first table.
8560
+
8561
+ ```python
8562
+ validation_table_1 = (
8563
+ pb.Validate(
8564
+ data=table_1,
8565
+ tbl_name="Table 1",
8566
+ label="Validation applied to the first table"
8567
+ )
8568
+ .col_vals_gt(columns="x", value=0)
8569
+ .col_vals_lt(columns="y", value=10)
7402
8570
  )
8571
+ ```
7403
8572
 
7404
- validation.interrogate(get_first_n=10)
8573
+ Now apply the same validation plan to the second table.
8574
+
8575
+ ```python
8576
+ validation_table_2 = (
8577
+ validation_table_1
8578
+ .set_tbl(
8579
+ tbl=table_2,
8580
+ tbl_name="Table 2",
8581
+ label="Validation applied to the second table"
8582
+ )
8583
+ )
7405
8584
  ```
7406
8585
 
7407
- The validation table shows that step 3 (checking for `session_duration` greater than `5`)
7408
- has 18 failing test units. This means that 18 rows in the table are problematic. We'd like
7409
- to see the rows that failed this validation step and we can do that with the
7410
- [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method.
8586
+ Here is the interrogation of the first table:
7411
8587
 
7412
8588
  ```python
7413
- pb.preview(validation.get_data_extracts(i=3, frame=True))
8589
+ validation_table_1.interrogate()
7414
8590
  ```
7415
8591
 
7416
- The [`get_data_extracts()`](`pointblank.Validate.get_data_extracts`) method will return a
7417
- Polars DataFrame here with the first 10 rows that failed the validation step (we passed that
7418
- into the [`preview()`](`pointblank.preview`) function for a better display). There are
7419
- actually 18 rows that failed but we limited the collection of extracts with
7420
- `get_first_n=10`.
8592
+ And the second table:
8593
+
8594
+ ```python
8595
+ validation_table_2.interrogate()
8596
+ ```
7421
8597
 
7422
8598
 
7423
8599
  get_tabular_report(self, title: 'str | None' = ':default:', incl_header: 'bool' = None, incl_footer: 'bool' = None) -> 'GT'
@@ -7557,11 +8733,15 @@ get_step_report(self, i: 'int', columns_subset: 'str | list[str] | Column | None
7557
8733
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
7558
8734
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
7559
8735
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
8736
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
8737
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
7560
8738
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
7561
8739
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
7562
8740
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
8741
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
7563
8742
  - [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`)
7564
8743
  - [`conjointly()`](`pointblank.Validate.conjointly`)
8744
+ - [`prompt()`](`pointblank.Validate.prompt`)
7565
8745
  - [`rows_complete()`](`pointblank.Validate.rows_complete`)
7566
8746
 
7567
8747
  The [`rows_distinct()`](`pointblank.Validate.rows_distinct`) validation step will produce a
@@ -7898,11 +9078,15 @@ get_data_extracts(self, i: 'int | list[int] | None' = None, frame: 'bool' = Fals
7898
9078
  - [`col_vals_outside()`](`pointblank.Validate.col_vals_outside`)
7899
9079
  - [`col_vals_in_set()`](`pointblank.Validate.col_vals_in_set`)
7900
9080
  - [`col_vals_not_in_set()`](`pointblank.Validate.col_vals_not_in_set`)
9081
+ - [`col_vals_increasing()`](`pointblank.Validate.col_vals_increasing`)
9082
+ - [`col_vals_decreasing()`](`pointblank.Validate.col_vals_decreasing`)
7901
9083
  - [`col_vals_null()`](`pointblank.Validate.col_vals_null`)
7902
9084
  - [`col_vals_not_null()`](`pointblank.Validate.col_vals_not_null`)
7903
9085
  - [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`)
9086
+ - [`col_vals_within_spec()`](`pointblank.Validate.col_vals_within_spec`)
7904
9087
  - [`col_vals_expr()`](`pointblank.Validate.col_vals_expr`)
7905
9088
  - [`conjointly()`](`pointblank.Validate.conjointly`)
9089
+ - [`prompt()`](`pointblank.Validate.prompt`)
7906
9090
 
7907
9091
  An extracted row for these validation methods means that a test unit failed for that row in
7908
9092
  the validation step.
@@ -9514,7 +10698,7 @@ assistant(model: 'str', data: 'FrameT | Any | None' = None, tbl_name: 'str | Non
9514
10698
  ----------
9515
10699
  model
9516
10700
  The model to be used. This should be in the form of `provider:model` (e.g.,
9517
- `"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
10701
+ `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`,
9518
10702
  `"ollama"`, and `"bedrock"`.
9519
10703
  data
9520
10704
  An optional data table to focus on during discussion with the PbA, which could be a
@@ -9894,15 +11078,46 @@ connect_to_table(connection_string: 'str') -> 'Any'
9894
11078
  pip install 'ibis-framework[duckdb]' # for DuckDB
9895
11079
  pip install 'ibis-framework[postgres]' # for PostgreSQL
9896
11080
  ```
11081
+ See Also
11082
+ --------
11083
+ print_database_tables : List all available tables in a database for discovery
11084
+
11085
+
11086
+ print_database_tables(connection_string: 'str') -> 'list[str]'
11087
+
11088
+ List all tables in a database from a connection string.
11089
+
11090
+ The `print_database_tables()` function connects to a database and returns a list of all
11091
+ available tables. This is particularly useful for discovering what tables exist in a database
11092
+ before connecting to a specific table with `connect_to_table(). The function automatically
11093
+ filters out temporary Ibis tables (memtables) to show only user tables. It supports all database
11094
+ backends available through Ibis, including DuckDB, SQLite, PostgreSQL, MySQL, BigQuery, and
11095
+ Snowflake.
11096
+
11097
+ Parameters
11098
+ ----------
11099
+ connection_string
11100
+ A database connection string *without* the `::table_name` suffix. Example:
11101
+ `"duckdb:///path/to/database.ddb"`.
11102
+
11103
+ Returns
11104
+ -------
11105
+ list[str]
11106
+ List of table names, excluding temporary Ibis tables.
11107
+
11108
+ See Also
11109
+ --------
11110
+ connect_to_table : Connect to a database table with full connection string documentation
9897
11111
 
9898
11112
 
9899
11113
 
9900
11114
  ## The YAML family
9901
11115
 
9902
11116
  The *YAML* group contains functions that allow for the use of YAML to orchestrate
9903
- validation workflows. The `yaml_interrogate()` function can be used to run a validation workflow from
9904
- YAML strings or files. The `validate_yaml()` function checks if the YAML configuration
9905
- passes its own validity checks.
11117
+ validation workflows. The `yaml_interrogate()` function can be used to run a validation workflow
11118
+ from YAML strings or files. The `validate_yaml()` function checks if the YAML configuration passes
11119
+ its own validity checks. The `yaml_to_python()` function converts YAML configuration to equivalent
11120
+ Python code.
9906
11121
 
9907
11122
  yaml_interrogate(yaml: 'Union[str, Path]', set_tbl: 'Union[FrameT, Any, None]' = None, namespaces: 'Optional[Union[Iterable[str], Mapping[str, str]]]' = None) -> 'Validate'
9908
11123
  Execute a YAML-based validation workflow.
@@ -10295,6 +11510,95 @@ Validate YAML configuration against the expected structure.
10295
11510
  yaml_interrogate : execute YAML-based validation workflows
10296
11511
 
10297
11512
 
11513
+ yaml_to_python(yaml: 'Union[str, Path]') -> 'str'
11514
+ Convert YAML validation configuration to equivalent Python code.
11515
+
11516
+ This function takes a YAML validation configuration and generates the equivalent Python code
11517
+ that would produce the same validation workflow. This is useful for documentation, code
11518
+ generation, or learning how to translate YAML workflows into programmatic workflows.
11519
+
11520
+ The generated Python code includes all necessary imports, data loading, validation steps,
11521
+ and interrogation execution, formatted as executable Python code.
11522
+
11523
+ Parameters
11524
+ ----------
11525
+ yaml
11526
+ YAML configuration as string or file path. Can be: (1) a YAML string containing the
11527
+ validation configuration, or (2) a Path object or string path to a YAML file.
11528
+
11529
+ Returns
11530
+ -------
11531
+ str
11532
+ A formatted Python code string enclosed in markdown code blocks that replicates the YAML
11533
+ workflow. The code includes import statements, data loading, validation method calls, and
11534
+ interrogation execution.
11535
+
11536
+ Raises
11537
+ ------
11538
+ YAMLValidationError
11539
+ If the YAML is invalid, malformed, or contains unknown validation methods.
11540
+
11541
+ Examples
11542
+ --------
11543
+ Convert a basic YAML configuration to Python code:
11544
+
11545
+ ```python
11546
+ import pointblank as pb
11547
+
11548
+ # Define a YAML validation workflow
11549
+ yaml_config = '''
11550
+ tbl: small_table
11551
+ tbl_name: Data Quality Check
11552
+ steps:
11553
+ - col_vals_not_null:
11554
+ columns: [a, b]
11555
+ - col_vals_gt:
11556
+ columns: [c]
11557
+ value: 0
11558
+ '''
11559
+
11560
+ # Generate equivalent Python code
11561
+ python_code = pb.yaml_to_python(yaml_config)
11562
+ print(python_code)
11563
+ ```
11564
+
11565
+ The generated Python code shows exactly how to replicate the YAML workflow programmatically.
11566
+ This is particularly useful when transitioning from YAML-based workflows to code-based
11567
+ workflows, or when generating documentation that shows both YAML and Python approaches.
11568
+
11569
+ For more complex workflows with thresholds and metadata:
11570
+
11571
+ ```python
11572
+ # Advanced YAML configuration
11573
+ yaml_config = '''
11574
+ tbl: small_table
11575
+ tbl_name: Advanced Validation
11576
+ label: Production data check
11577
+ thresholds:
11578
+ warning: 0.1
11579
+ error: 0.2
11580
+ steps:
11581
+ - col_vals_between:
11582
+ columns: [c]
11583
+ left: 1
11584
+ right: 10
11585
+ - col_vals_regex:
11586
+ columns: [b]
11587
+ pattern: '[0-9]-[a-z]{3}-[0-9]{3}'
11588
+ '''
11589
+
11590
+ # Generate the equivalent Python code
11591
+ python_code = pb.yaml_to_python(yaml_config)
11592
+ print(python_code)
11593
+ ```
11594
+
11595
+ The generated code includes all configuration parameters, thresholds, and maintains the exact
11596
+ same validation logic as the original YAML workflow.
11597
+
11598
+ This function is also useful for educational purposes, helping users understand how YAML
11599
+ configurations map to the underlying Python API calls.
11600
+
11601
+
10298
11602
 
10299
11603
  ## The Utility Functions family
10300
11604
 
@@ -10715,6 +12019,297 @@ Access validation summary information when authoring final actions.
10715
12019
  custom actions that are executed after all validation steps have been completed.
10716
12020
 
10717
12021
 
12022
+ write_file(validation: 'Validate', filename: 'str', path: 'str | None' = None, keep_tbl: 'bool' = False, keep_extracts: 'bool' = False, quiet: 'bool' = False) -> 'None'
12023
+
12024
+ Write a Validate object to disk as a serialized file.
12025
+
12026
+ Writing a validation object to disk with `write_file()` can be useful for keeping data
12027
+ validation results close at hand for later retrieval (with `read_file()`). By default, any data
12028
+ table that the validation object holds will be removed before writing to disk (not applicable if
12029
+ no data table is present). This behavior can be changed by setting `keep_tbl=True`, but this
12030
+ only works when the table is not of a database type (e.g., DuckDB, PostgreSQL, etc.), as
12031
+ database connections cannot be serialized.
12032
+
12033
+ Extract data from failing validation steps can also be preserved by setting
12034
+ `keep_extracts=True`, which is useful for later analysis of data quality issues.
12035
+
12036
+ The serialized file uses Python's pickle format for storage of the validation object state,
12037
+ including all validation results, metadata, and optionally the source data.
12038
+
12039
+ **Important note.** If your validation uses custom preprocessing functions (via the `pre=`
12040
+ parameter), these functions must be defined at the module level (not interactively or as lambda
12041
+ functions) to ensure they can be properly restored when loading the validation in a different
12042
+ Python session. Read the *Creating Serializable Validations* section below for more information.
12043
+
12044
+ :::{.callout-warning}
12045
+ The `write_file()` function is currently experimental. Please report any issues you encounter in
12046
+ the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
12047
+ :::
12048
+
12049
+ Parameters
12050
+ ----------
12051
+ validation
12052
+ The `Validate` object to write to disk.
12053
+ filename
12054
+ The filename to create on disk for the validation object. Should not include the file
12055
+ extension as `.pkl` will be added automatically.
12056
+ path
12057
+ An optional directory path where the file should be saved. If not provided, the file will be
12058
+ saved in the current working directory. The directory will be created if it doesn't exist.
12059
+ keep_tbl
12060
+ An option to keep the data table that is associated with the validation object. The default
12061
+ is `False` where the data table is removed before writing to disk. For database tables
12062
+ (e.g., Ibis tables with database backends), the table is always removed even if
12063
+ `keep_tbl=True`, as database connections cannot be serialized.
12064
+ keep_extracts
12065
+ An option to keep any collected extract data for failing rows from validation steps. By
12066
+ default, this is `False` (i.e., extract data is removed to save space).
12067
+ quiet
12068
+ Should the function not inform when the file is written? By default, this is `False`, so a
12069
+ message will be printed when the file is successfully written.
12070
+
12071
+ Returns
12072
+ -------
12073
+ None
12074
+ This function doesn't return anything but saves the validation object to disk.
12075
+
12076
+ Creating Serializable Validations
12077
+ ---------------------------------
12078
+ To ensure your validations work reliably across different Python sessions, the recommended
12079
+ approach is to use module-Level functions. So, create a separate Python file for your
12080
+ preprocessing functions:
12081
+
12082
+ ```python
12083
+ # preprocessing_functions.py
12084
+ import polars as pl
12085
+
12086
+ def multiply_by_100(df):
12087
+ return df.with_columns(pl.col("value") * 100)
12088
+
12089
+ def add_computed_column(df):
12090
+ return df.with_columns(computed=pl.col("value") * 2 + 10)
12091
+ ```
12092
+
12093
+ Then import and use them in your validation:
12094
+
12095
+ ```python
12096
+ # your_main_script.py
12097
+ import pointblank as pb
12098
+ from preprocessing_functions import multiply_by_100, add_computed_column
12099
+
12100
+ validation = (
12101
+ pb.Validate(data=my_data)
12102
+ .col_vals_gt(columns="value", value=500, pre=multiply_by_100)
12103
+ .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column)
12104
+ .interrogate()
12105
+ )
12106
+
12107
+ # Save validation and it will work reliably across sessions
12108
+ pb.write_file(validation, "my_validation", keep_tbl=True)
12109
+ ```
12110
+
12111
+ ### Problematic Patterns to Avoid
12112
+
12113
+ Don't use lambda functions as they will cause immediate errors.
12114
+
12115
+ Don't use interactive function definitions (as they may fail when loading).
12116
+
12117
+ ```python
12118
+ def my_function(df): # Defined in notebook/REPL
12119
+ return df.with_columns(pl.col("value") * 2)
12120
+
12121
+ validation = pb.Validate(data).col_vals_gt(
12122
+ columns="value", value=100, pre=my_function
12123
+ )
12124
+ ```
12125
+
12126
+ ### Automatic Analysis and Guidance
12127
+
12128
+ When you call `write_file()`, it automatically analyzes your validation and provides:
12129
+
12130
+ - confirmation when all functions will work reliably
12131
+ - warnings for functions that may cause cross-session issues
12132
+ - clear errors for unsupported patterns (lambda functions)
12133
+ - specific recommendations and code examples
12134
+ - loading instructions tailored to your validation
12135
+
12136
+ ### Loading Your Validation
12137
+
12138
+ To load a saved validation in a new Python session:
12139
+
12140
+ ```python
12141
+ # In a new Python session
12142
+ import pointblank as pb
12143
+
12144
+ # Import the same preprocessing functions used when creating the validation
12145
+ from preprocessing_functions import multiply_by_100, add_computed_column
12146
+
12147
+ # Upon loading the validation, functions will be automatically restored
12148
+ validation = pb.read_file("my_validation.pkl")
12149
+ ```
12150
+
12151
+ ** Testing Your Validation:**
12152
+
12153
+ To verify your validation works across sessions:
12154
+
12155
+ 1. save your validation in one Python session
12156
+ 2. start a fresh Python session (restart kernel/interpreter)
12157
+ 3. import required preprocessing functions
12158
+ 4. load the validation using `read_file()`
12159
+ 5. test that preprocessing functions work as expected
12160
+
12161
+ ### Performance and Storage
12162
+
12163
+ - use `keep_tbl=False` (default) to reduce file size when you don't need the original data
12164
+ - use `keep_extracts=False` (default) to save space by excluding extract data
12165
+ - set `quiet=True` to suppress guidance messages in automated scripts
12166
+ - files are saved using pickle's highest protocol for optimal performance
12167
+
12168
+ Examples
12169
+ --------
12170
+ Let's create a simple validation and save it to disk:
12171
+
12172
+ ```python
12173
+ import pointblank as pb
12174
+
12175
+ # Create a validation
12176
+ validation = (
12177
+ pb.Validate(data=pb.load_dataset("small_table"), label="My validation")
12178
+ .col_vals_gt(columns="d", value=100)
12179
+ .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}")
12180
+ .interrogate()
12181
+ )
12182
+
12183
+ # Save to disk (without the original table data)
12184
+ pb.write_file(validation, "my_validation")
12185
+ ```
12186
+
12187
+ To keep the original table data for later analysis:
12188
+
12189
+ ```python
12190
+ # Save with the original table data included
12191
+ pb.write_file(validation, "my_validation_with_data", keep_tbl=True)
12192
+ ```
12193
+
12194
+ You can also specify a custom directory and keep extract data:
12195
+
12196
+ ```python
12197
+ pb.write_file(
12198
+ validation,
12199
+ filename="detailed_validation",
12200
+ path="/path/to/validations",
12201
+ keep_tbl=True,
12202
+ keep_extracts=True
12203
+ )
12204
+ ```
12205
+
12206
+ ### Working with Preprocessing Functions
12207
+
12208
+ For validations that use preprocessing functions to be portable across sessions, define your
12209
+ functions in a separate `.py` file:
12210
+
12211
+ ```python
12212
+ # In `preprocessing_functions.py`
12213
+
12214
+ import polars as pl
12215
+
12216
+ def multiply_by_100(df):
12217
+ return df.with_columns(pl.col("value") * 100)
12218
+
12219
+ def add_computed_column(df):
12220
+ return df.with_columns(computed=pl.col("value") * 2 + 10)
12221
+ ```
12222
+
12223
+ Then import and use them in your validation:
12224
+
12225
+ ```python
12226
+ # In your main script
12227
+
12228
+ import pointblank as pb
12229
+ from preprocessing_functions import multiply_by_100, add_computed_column
12230
+
12231
+ validation = (
12232
+ pb.Validate(data=my_data)
12233
+ .col_vals_gt(columns="value", value=500, pre=multiply_by_100)
12234
+ .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column)
12235
+ .interrogate()
12236
+ )
12237
+
12238
+ # This validation can now be saved and loaded reliably
12239
+ pb.write_file(validation, "my_validation", keep_tbl=True)
12240
+ ```
12241
+
12242
+ When you load this validation in a new session, simply import the preprocessing functions
12243
+ again and they will be automatically restored.
12244
+
12245
+ See Also
12246
+ --------
12247
+ Use the [`read_file()`](`pointblank.read_file`) function to load a validation object that was
12248
+ previously saved with `write_file()`.
12249
+
12250
+
12251
+ read_file(filepath: 'str | Path') -> 'Validate'
12252
+
12253
+ Read a Validate object from disk that was previously saved with `write_file()`.
12254
+
12255
+ This function loads a validation object that was previously serialized to disk using the
12256
+ `write_file()` function. The validation object will be restored with all its validation results,
12257
+ metadata, and optionally the source data (if it was saved with `keep_tbl=True`).
12258
+
12259
+ :::{.callout-warning}
12260
+ The `read_file()` function is currently experimental. Please report any issues you encounter in
12261
+ the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
12262
+ :::
12263
+
12264
+ Parameters
12265
+ ----------
12266
+ filepath
12267
+ The path to the saved validation file. Can be a string or Path object.
12268
+
12269
+ Returns
12270
+ -------
12271
+ Validate
12272
+ The restored validation object with all its original state, validation results, and
12273
+ metadata.
12274
+
12275
+ Examples
12276
+ --------
12277
+ Load a validation object that was previously saved:
12278
+
12279
+ ```python
12280
+ import pointblank as pb
12281
+
12282
+ # Load a validation object from disk
12283
+ validation = pb.read_file("my_validation.pkl")
12284
+
12285
+ # View the validation results
12286
+ validation
12287
+ ```
12288
+
12289
+ You can also load using just the filename (without extension):
12290
+
12291
+ ```python
12292
+ # This will automatically look for "my_validation.pkl"
12293
+ validation = pb.read_file("my_validation")
12294
+ ```
12295
+
12296
+ The loaded validation object retains all its functionality:
12297
+
12298
+ ```python
12299
+ # Get validation summary
12300
+ summary = validation.get_json_report()
12301
+
12302
+ # Get sundered data (if original table was saved)
12303
+ if validation.data is not None:
12304
+ failing_rows = validation.get_sundered_data(type="fail")
12305
+ ```
12306
+
12307
+ See Also
12308
+ --------
12309
+ Use the [`write_file()`](`pointblank.Validate.write_file`) method to save a validation object
12310
+ to disk for later retrieval with this function.
12311
+
12312
+
10718
12313
  config(report_incl_header: 'bool' = True, report_incl_footer: 'bool' = True, preview_incl_header: 'bool' = True) -> 'PointblankConfig'
10719
12314
 
10720
12315
  Configuration settings for the Pointblank library.