pointblank 0.14.0__py3-none-any.whl → 0.15.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1157,7 +1157,7 @@ Definition of a schema object.
1157
1157
  `Schema` object is used in a validation workflow.
1158
1158
 
1159
1159
 
1160
- DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None) -> None
1160
+ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None, verify_ssl: 'bool' = True) -> None
1161
1161
 
1162
1162
  Draft a validation plan for a given table using an LLM.
1163
1163
 
@@ -1180,10 +1180,15 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1180
1180
  The data to be used for drafting a validation plan.
1181
1181
  model
1182
1182
  The model to be used. This should be in the form of `provider:model` (e.g.,
1183
- `"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
1183
+ `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`,
1184
1184
  `"ollama"`, and `"bedrock"`.
1185
1185
  api_key
1186
1186
  The API key to be used for the model.
1187
+ verify_ssl
1188
+ Whether to verify SSL certificates when making requests to the LLM provider. Set to `False`
1189
+ to disable SSL verification (e.g., when behind a corporate firewall with self-signed
1190
+ certificates). Defaults to `True`. Use with caution as disabling SSL verification can pose
1191
+ security risks.
1187
1192
 
1188
1193
  Returns
1189
1194
  -------
@@ -1225,6 +1230,33 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1225
1230
  There's no need to have the `python-dotenv` package installed when using `.env` files in this
1226
1231
  way.
1227
1232
 
1233
+ Notes on SSL Certificate Verification
1234
+ --------------------------------------
1235
+ By default, SSL certificate verification is enabled for all requests to LLM providers. However,
1236
+ in certain network environments (such as corporate networks with self-signed certificates or
1237
+ firewall proxies), you may encounter SSL certificate verification errors.
1238
+
1239
+ To disable SSL verification, set the `verify_ssl` parameter to `False`:
1240
+
1241
+ ```python
1242
+ import pointblank as pb
1243
+
1244
+ data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
1245
+
1246
+ # Disable SSL verification for networks with self-signed certificates
1247
+ pb.DraftValidation(
1248
+ data=data,
1249
+ model="anthropic:claude-sonnet-4-5",
1250
+ verify_ssl=False
1251
+ )
1252
+ ```
1253
+
1254
+ :::{.callout-warning}
1255
+ Disabling SSL verification (through `verify_ssl=False`) can expose your API keys and data to
1256
+ man-in-the-middle attacks. Only use this option in trusted network environments and when
1257
+ absolutely necessary.
1258
+ :::
1259
+
1228
1260
  Notes on Data Sent to the Model Provider
1229
1261
  ----------------------------------------
1230
1262
  The data sent to the model provider is a JSON summary of the table. This data summary is
@@ -1251,7 +1283,7 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1251
1283
  Let's look at how the `DraftValidation` class can be used to draft a validation plan for a
1252
1284
  table. The table to be used is `"nycflights"`, which is available here via the
1253
1285
  [`load_dataset()`](`pointblank.load_dataset`) function. The model to be used is
1254
- `"anthropic:claude-3-5-sonnet-latest"` (which performs very well compared to other LLMs). The
1286
+ `"anthropic:claude-sonnet-4-5"` (which performs very well compared to other LLMs). The
1255
1287
  example assumes that the API key is stored in an `.env` file as `ANTHROPIC_API_KEY`.
1256
1288
 
1257
1289
  ```python
@@ -1261,7 +1293,7 @@ DraftValidation(data: 'FrameT | Any', model: 'str', api_key: 'str | None' = None
1261
1293
  data = pb.load_dataset(dataset="nycflights", tbl_type="duckdb")
1262
1294
 
1263
1295
  # Draft a validation plan for the "nycflights" table
1264
- pb.DraftValidation(data=data, model="anthropic:claude-3-5-sonnet-latest")
1296
+ pb.DraftValidation(data=data, model="anthropic:claude-sonnet-4-5")
1265
1297
  ```
1266
1298
 
1267
1299
  The output will be a drafted validation plan for the `"nycflights"` table and this will appear
@@ -3645,6 +3677,262 @@ col_vals_not_in_set(self, columns: 'str | list[str] | Column | ColumnSelector |
3645
3677
  statuses in the `InvalidStatus` enum.
3646
3678
 
3647
3679
 
3680
+ col_vals_increasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, decreasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
3681
+
3682
+ Are column data increasing by row?
3683
+
3684
+ The `col_vals_increasing()` validation method checks whether column values in a table are
3685
+ increasing when moving down a table. There are options for allowing missing values in the
3686
+ target column, allowing stationary phases (where consecutive values don't change), and even
3687
+ one for allowing decreasing movements up to a certain threshold. This validation will
3688
+ operate over the number of test units that is equal to the number of rows in the table
3689
+ (determined after any `pre=` mutation has been applied).
3690
+
3691
+ Parameters
3692
+ ----------
3693
+ columns
3694
+ A single column or a list of columns to validate. Can also use
3695
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
3696
+ multiple columns are supplied or resolved, there will be a separate validation step
3697
+ generated for each column.
3698
+ allow_stationary
3699
+ An option to allow pauses in increasing values. For example, if the values for the test
3700
+ units are `[80, 82, 82, 85, 88]` then the third unit (`82`, appearing a second time)
3701
+ would be marked as failing when `allow_stationary` is `False`. Using
3702
+ `allow_stationary=True` will result in all the test units in `[80, 82, 82, 85, 88]` to
3703
+ be marked as passing.
3704
+ decreasing_tol
3705
+ An optional threshold value that allows for movement of numerical values in the negative
3706
+ direction. By default this is `None` but using a numerical value will set the absolute
3707
+ threshold of negative travel allowed across numerical test units. Note that setting a
3708
+ value here also has the effect of setting `allow_stationary` to `True`.
3709
+ na_pass
3710
+ Should any encountered None, NA, or Null values be considered as passing test units? By
3711
+ default, this is `False`. Set to `True` to pass test units with missing values.
3712
+ pre
3713
+ An optional preprocessing function or lambda to apply to the data table during
3714
+ interrogation. This function should take a table as input and return a modified table.
3715
+ Have a look at the *Preprocessing* section for more information on how to use this
3716
+ argument.
3717
+ segments
3718
+ An optional directive on segmentation, which serves to split a validation step into
3719
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
3720
+ column name and its corresponding values to segment on, or a combination of both
3721
+ (provided as a list). Read the *Segmentation* section for usage information.
3722
+ thresholds
3723
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
3724
+ The thresholds are set at the step level and will override any global thresholds set in
3725
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
3726
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
3727
+ section for information on how to set threshold levels.
3728
+ actions
3729
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
3730
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
3731
+ define the actions.
3732
+ brief
3733
+ An optional brief description of the validation step that will be displayed in the
3734
+ reporting table. You can use the templating elements like `"{step}"` to insert
3735
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
3736
+ the entire brief will be automatically generated. If `None` (the default) then there
3737
+ won't be a brief.
3738
+ active
3739
+ A boolean value indicating whether the validation step should be active. Using `False`
3740
+ will make the validation step inactive (still reporting its presence and keeping indexes
3741
+ for the steps unchanged).
3742
+
3743
+ Returns
3744
+ -------
3745
+ Validate
3746
+ The `Validate` object with the added validation step.
3747
+
3748
+ Examples
3749
+ --------
3750
+ For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The
3751
+ table is shown below:
3752
+
3753
+ ```python
3754
+ import pointblank as pb
3755
+ import polars as pl
3756
+
3757
+ tbl = pl.DataFrame(
3758
+ {
3759
+ "a": [1, 2, 3, 4, 5, 6],
3760
+ "b": [1, 2, 2, 3, 4, 5],
3761
+ "c": [1, 2, 1, 3, 4, 5],
3762
+ }
3763
+ )
3764
+
3765
+ pb.preview(tbl)
3766
+ ```
3767
+
3768
+ Let's validate that values in column `a` are increasing. We'll determine if this validation
3769
+ had any failing test units (there are six test units, one for each row).
3770
+
3771
+ ```python
3772
+ validation = (
3773
+ pb.Validate(data=tbl)
3774
+ .col_vals_increasing(columns="a")
3775
+ .interrogate()
3776
+ )
3777
+
3778
+ validation
3779
+ ```
3780
+
3781
+ The validation passed as all values in column `a` are increasing. Now let's check column
3782
+ `b` which has a stationary value:
3783
+
3784
+ ```python
3785
+ validation = (
3786
+ pb.Validate(data=tbl)
3787
+ .col_vals_increasing(columns="b")
3788
+ .interrogate()
3789
+ )
3790
+
3791
+ validation
3792
+ ```
3793
+
3794
+ This validation fails at the third row because the value `2` is repeated. If we want to
3795
+ allow stationary values, we can use `allow_stationary=True`:
3796
+
3797
+ ```python
3798
+ validation = (
3799
+ pb.Validate(data=tbl)
3800
+ .col_vals_increasing(columns="b", allow_stationary=True)
3801
+ .interrogate()
3802
+ )
3803
+
3804
+ validation
3805
+ ```
3806
+
3807
+
3808
+ col_vals_decreasing(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', allow_stationary: 'bool' = False, increasing_tol: 'float | None' = None, na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
3809
+
3810
+ Are column data decreasing by row?
3811
+
3812
+ The `col_vals_decreasing()` validation method checks whether column values in a table are
3813
+ decreasing when moving down a table. There are options for allowing missing values in the
3814
+ target column, allowing stationary phases (where consecutive values don't change), and even
3815
+ one for allowing increasing movements up to a certain threshold. This validation will
3816
+ operate over the number of test units that is equal to the number of rows in the table
3817
+ (determined after any `pre=` mutation has been applied).
3818
+
3819
+ Parameters
3820
+ ----------
3821
+ columns
3822
+ A single column or a list of columns to validate. Can also use
3823
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
3824
+ multiple columns are supplied or resolved, there will be a separate validation step
3825
+ generated for each column.
3826
+ allow_stationary
3827
+ An option to allow pauses in decreasing values. For example, if the values for the test
3828
+ units are `[88, 85, 85, 82, 80]` then the third unit (`85`, appearing a second time)
3829
+ would be marked as failing when `allow_stationary` is `False`. Using
3830
+ `allow_stationary=True` will result in all the test units in `[88, 85, 85, 82, 80]` to
3831
+ be marked as passing.
3832
+ increasing_tol
3833
+ An optional threshold value that allows for movement of numerical values in the positive
3834
+ direction. By default this is `None` but using a numerical value will set the absolute
3835
+ threshold of positive travel allowed across numerical test units. Note that setting a
3836
+ value here also has the effect of setting `allow_stationary` to `True`.
3837
+ na_pass
3838
+ Should any encountered None, NA, or Null values be considered as passing test units? By
3839
+ default, this is `False`. Set to `True` to pass test units with missing values.
3840
+ pre
3841
+ An optional preprocessing function or lambda to apply to the data table during
3842
+ interrogation. This function should take a table as input and return a modified table.
3843
+ Have a look at the *Preprocessing* section for more information on how to use this
3844
+ argument.
3845
+ segments
3846
+ An optional directive on segmentation, which serves to split a validation step into
3847
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
3848
+ column name and its corresponding values to segment on, or a combination of both
3849
+ (provided as a list). Read the *Segmentation* section for usage information.
3850
+ thresholds
3851
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
3852
+ The thresholds are set at the step level and will override any global thresholds set in
3853
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
3854
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
3855
+ section for information on how to set threshold levels.
3856
+ actions
3857
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
3858
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
3859
+ define the actions.
3860
+ brief
3861
+ An optional brief description of the validation step that will be displayed in the
3862
+ reporting table. You can use the templating elements like `"{step}"` to insert
3863
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
3864
+ the entire brief will be automatically generated. If `None` (the default) then there
3865
+ won't be a brief.
3866
+ active
3867
+ A boolean value indicating whether the validation step should be active. Using `False`
3868
+ will make the validation step inactive (still reporting its presence and keeping indexes
3869
+ for the steps unchanged).
3870
+
3871
+ Returns
3872
+ -------
3873
+ Validate
3874
+ The `Validate` object with the added validation step.
3875
+
3876
+ Examples
3877
+ --------
3878
+ For the examples here, we'll use a simple Polars DataFrame with a numeric column (`a`). The
3879
+ table is shown below:
3880
+
3881
+ ```python
3882
+ import pointblank as pb
3883
+ import polars as pl
3884
+
3885
+ tbl = pl.DataFrame(
3886
+ {
3887
+ "a": [6, 5, 4, 3, 2, 1],
3888
+ "b": [5, 4, 4, 3, 2, 1],
3889
+ "c": [5, 4, 5, 3, 2, 1],
3890
+ }
3891
+ )
3892
+
3893
+ pb.preview(tbl)
3894
+ ```
3895
+
3896
+ Let's validate that values in column `a` are decreasing. We'll determine if this validation
3897
+ had any failing test units (there are six test units, one for each row).
3898
+
3899
+ ```python
3900
+ validation = (
3901
+ pb.Validate(data=tbl)
3902
+ .col_vals_decreasing(columns="a")
3903
+ .interrogate()
3904
+ )
3905
+
3906
+ validation
3907
+ ```
3908
+
3909
+ The validation passed as all values in column `a` are decreasing. Now let's check column
3910
+ `b` which has a stationary value:
3911
+
3912
+ ```python
3913
+ validation = (
3914
+ pb.Validate(data=tbl)
3915
+ .col_vals_decreasing(columns="b")
3916
+ .interrogate()
3917
+ )
3918
+
3919
+ validation
3920
+ ```
3921
+
3922
+ This validation fails at the third row because the value `4` is repeated. If we want to
3923
+ allow stationary values, we can use `allow_stationary=True`:
3924
+
3925
+ ```python
3926
+ validation = (
3927
+ pb.Validate(data=tbl)
3928
+ .col_vals_decreasing(columns="b", allow_stationary=True)
3929
+ .interrogate()
3930
+ )
3931
+
3932
+ validation
3933
+ ```
3934
+
3935
+
3648
3936
  col_vals_null(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
3649
3937
 
3650
3938
  Validate whether values in a column are Null.
@@ -4213,22 +4501,31 @@ col_vals_regex(self, columns: 'str | list[str] | Column | ColumnSelector | Colum
4213
4501
  string values of rows 1 and 2 in column `b`.
4214
4502
 
4215
4503
 
4216
- col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4504
+ col_vals_within_spec(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', spec: 'str', na_pass: 'bool' = False, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4217
4505
 
4218
- Validate column values using a custom expression.
4506
+ Validate whether column values fit within a specification.
4219
4507
 
4220
- The `col_vals_expr()` validation method checks whether column values in a table satisfy a
4221
- custom `expr=` expression. This validation will operate over the number of test units that
4222
- is equal to the number of rows in the table (determined after any `pre=` mutation has been
4223
- applied).
4508
+ The `col_vals_within_spec()` validation method checks whether column values in a table
4509
+ correspond to a specification (`spec=`) type (details of which are available in the
4510
+ *Specifications* section). Specifications include common data types like email addresses,
4511
+ URLs, postal codes, vehicle identification numbers (VINs), International Bank Account
4512
+ Numbers (IBANs), and more. This validation will operate over the number of test units that
4513
+ is equal to the number of rows in the table.
4224
4514
 
4225
4515
  Parameters
4226
4516
  ----------
4227
- expr
4228
- A column expression that will evaluate each row in the table, returning a boolean value
4229
- per table row. If the target table is a Polars DataFrame, the expression should either
4230
- be a Polars column expression or a Narwhals one. For a Pandas DataFrame, the expression
4231
- should either be a lambda expression or a Narwhals column expression.
4517
+ columns
4518
+ A single column or a list of columns to validate. Can also use
4519
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
4520
+ multiple columns are supplied or resolved, there will be a separate validation step
4521
+ generated for each column.
4522
+ spec
4523
+ A specification string for defining the specification type. Examples are `"email"`,
4524
+ `"url"`, and `"postal_code[USA]"`. See the *Specifications* section for all available
4525
+ options.
4526
+ na_pass
4527
+ Should any encountered None, NA, or Null values be considered as passing test units? By
4528
+ default, this is `False`. Set to `True` to pass test units with missing values.
4232
4529
  pre
4233
4530
  An optional preprocessing function or lambda to apply to the data table during
4234
4531
  interrogation. This function should take a table as input and return a modified table.
@@ -4246,7 +4543,7 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4246
4543
  be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
4247
4544
  section for information on how to set threshold levels.
4248
4545
  actions
4249
- Optional actions to take when the validation step meets or exceeds any set threshold
4546
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
4250
4547
  levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
4251
4548
  define the actions.
4252
4549
  brief
@@ -4265,6 +4562,40 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4265
4562
  Validate
4266
4563
  The `Validate` object with the added validation step.
4267
4564
 
4565
+ Specifications
4566
+ --------------
4567
+ A specification type must be used with the `spec=` argument. This is a string-based keyword
4568
+ that corresponds to the type of data in the specified columns. The following keywords can
4569
+ be used:
4570
+
4571
+ - `"isbn"`: The International Standard Book Number (ISBN) is a unique numerical identifier
4572
+ for books. This keyword validates both 10-digit and 13-digit ISBNs.
4573
+
4574
+ - `"vin"`: A vehicle identification number (VIN) is a unique code used by the automotive
4575
+ industry to identify individual motor vehicles.
4576
+
4577
+ - `"postal_code[<country_code>]"`: A postal code (also known as postcodes, PIN, or ZIP
4578
+ codes) is a series of letters, digits, or both included in a postal address. Because the
4579
+ coding varies by country, a country code in either the 2-letter (ISO 3166-1 alpha-2) or
4580
+ 3-letter (ISO 3166-1 alpha-3) format needs to be supplied (e.g., `"postal_code[US]"` or
4581
+ `"postal_code[USA]"`). The keyword alias `"zip"` can be used for US ZIP codes.
4582
+
4583
+ - `"credit_card"`: A credit card number can be validated across a variety of issuers. The
4584
+ validation uses the Luhn algorithm.
4585
+
4586
+ - `"iban[<country_code>]"`: The International Bank Account Number (IBAN) is a system of
4587
+ identifying bank accounts across countries. Because the length and coding varies by
4588
+ country, a country code needs to be supplied (e.g., `"iban[DE]"` or `"iban[DEU]"`).
4589
+
4590
+ - `"swift"`: Business Identifier Codes (also known as SWIFT-BIC, BIC, or SWIFT code) are
4591
+ unique identifiers for financial and non-financial institutions.
4592
+
4593
+ - `"phone"`, `"email"`, `"url"`, `"ipv4"`, `"ipv6"`, `"mac"`: Phone numbers, email
4594
+ addresses, Internet URLs, IPv4 or IPv6 addresses, and MAC addresses can be validated with
4595
+ their respective keywords.
4596
+
4597
+ Only a single `spec=` value should be provided per function call.
4598
+
4268
4599
  Preprocessing
4269
4600
  -------------
4270
4601
  The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
@@ -4274,9 +4605,11 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4274
4605
 
4275
4606
  The preprocessing function can be any callable that takes a table as input and returns a
4276
4607
  modified table. For example, you could use a lambda function to filter the table based on
4277
- certain criteria or to apply a transformation to the data. Regarding the lifetime of the
4278
- transformed table, it only exists during the validation step and is not stored in the
4279
- `Validate` object or used in subsequent validation steps.
4608
+ certain criteria or to apply a transformation to the data. Note that you can refer to
4609
+ a column via `columns=` that is expected to be present in the transformed table, but may not
4610
+ exist in the table before preprocessing. Regarding the lifetime of the transformed table, it
4611
+ only exists during the validation step and is not stored in the `Validate` object or used in
4612
+ subsequent validation steps.
4280
4613
 
4281
4614
  Segmentation
4282
4615
  ------------
@@ -4348,8 +4681,8 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4348
4681
 
4349
4682
  Examples
4350
4683
  --------
4351
- For the examples here, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and
4352
- `c`). The table is shown below:
4684
+ For the examples here, we'll use a simple Polars DataFrame with an email column. The table
4685
+ is shown below:
4353
4686
 
4354
4687
  ```python
4355
4688
  import pointblank as pb
@@ -4357,48 +4690,61 @@ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'Segme
4357
4690
 
4358
4691
  tbl = pl.DataFrame(
4359
4692
  {
4360
- "a": [1, 2, 1, 7, 8, 6],
4361
- "b": [0, 0, 0, 1, 1, 1],
4362
- "c": [0.5, 0.3, 0.8, 1.4, 1.9, 1.2],
4693
+ "email": [
4694
+ "user@example.com",
4695
+ "admin@test.org",
4696
+ "invalid-email",
4697
+ "contact@company.co.uk",
4698
+ ],
4363
4699
  }
4364
4700
  )
4365
4701
 
4366
4702
  pb.preview(tbl)
4367
4703
  ```
4368
4704
 
4369
- Let's validate that the values in column `a` are all integers. We'll determine if this
4370
- validation had any failing test units (there are six test units, one for each row).
4705
+ Let's validate that all of the values in the `email` column are valid email addresses.
4706
+ We'll determine if this validation had any failing test units (there are four test units,
4707
+ one for each row).
4371
4708
 
4372
4709
  ```python
4373
4710
  validation = (
4374
4711
  pb.Validate(data=tbl)
4375
- .col_vals_expr(expr=pl.col("a") % 1 == 0)
4712
+ .col_vals_within_spec(columns="email", spec="email")
4376
4713
  .interrogate()
4377
4714
  )
4378
4715
 
4379
4716
  validation
4380
4717
  ```
4381
4718
 
4382
- Printing the `validation` object shows the validation table in an HTML viewing environment.
4383
- The validation table shows the single entry that corresponds to the validation step created
4384
- by using `col_vals_expr()`. All test units passed, with no failing test units.
4719
+ The validation table shows that one test unit failed (the invalid email address in row 3).
4385
4720
 
4386
4721
 
4387
- col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4722
+ col_vals_expr(self, expr: 'any', pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4388
4723
 
4389
- Validate whether one or more columns exist in the table.
4724
+ Validate column values using a custom expression.
4390
4725
 
4391
- The `col_exists()` method checks whether one or more columns exist in the target table. The
4392
- only requirement is specification of the column names. Each validation step or expectation
4393
- will operate over a single test unit, which is whether the column exists or not.
4726
+ The `col_vals_expr()` validation method checks whether column values in a table satisfy a
4727
+ custom `expr=` expression. This validation will operate over the number of test units that
4728
+ is equal to the number of rows in the table (determined after any `pre=` mutation has been
4729
+ applied).
4394
4730
 
4395
4731
  Parameters
4396
4732
  ----------
4397
- columns
4398
- A single column or a list of columns to validate. Can also use
4399
- [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
4400
- multiple columns are supplied or resolved, there will be a separate validation step
4401
- generated for each column.
4733
+ expr
4734
+ A column expression that will evaluate each row in the table, returning a boolean value
4735
+ per table row. If the target table is a Polars DataFrame, the expression should either
4736
+ be a Polars column expression or a Narwhals one. For a Pandas DataFrame, the expression
4737
+ should either be a lambda expression or a Narwhals column expression.
4738
+ pre
4739
+ An optional preprocessing function or lambda to apply to the data table during
4740
+ interrogation. This function should take a table as input and return a modified table.
4741
+ Have a look at the *Preprocessing* section for more information on how to use this
4742
+ argument.
4743
+ segments
4744
+ An optional directive on segmentation, which serves to split a validation step into
4745
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
4746
+ column name and its corresponding values to segment on, or a combination of both
4747
+ (provided as a list). Read the *Segmentation* section for usage information.
4402
4748
  thresholds
4403
4749
  Set threshold failure levels for reporting and reacting to exceedences of the levels.
4404
4750
  The thresholds are set at the step level and will override any global thresholds set in
@@ -4406,7 +4752,7 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4406
4752
  be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
4407
4753
  section for information on how to set threshold levels.
4408
4754
  actions
4409
- Optional actions to take when the validation step(s) meets or exceeds any set threshold
4755
+ Optional actions to take when the validation step meets or exceeds any set threshold
4410
4756
  levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
4411
4757
  define the actions.
4412
4758
  brief
@@ -4425,6 +4771,59 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4425
4771
  Validate
4426
4772
  The `Validate` object with the added validation step.
4427
4773
 
4774
+ Preprocessing
4775
+ -------------
4776
+ The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
4777
+ table during interrogation. This function should take a table as input and return a modified
4778
+ table. This is useful for performing any necessary transformations or filtering on the data
4779
+ before the validation step is applied.
4780
+
4781
+ The preprocessing function can be any callable that takes a table as input and returns a
4782
+ modified table. For example, you could use a lambda function to filter the table based on
4783
+ certain criteria or to apply a transformation to the data. Regarding the lifetime of the
4784
+ transformed table, it only exists during the validation step and is not stored in the
4785
+ `Validate` object or used in subsequent validation steps.
4786
+
4787
+ Segmentation
4788
+ ------------
4789
+ The `segments=` argument allows for the segmentation of a validation step into multiple
4790
+ segments. This is useful for applying the same validation step to different subsets of the
4791
+ data. The segmentation can be done based on a single column or specific fields within a
4792
+ column.
4793
+
4794
+ Providing a single column name will result in a separate validation step for each unique
4795
+ value in that column. For example, if you have a column called `"region"` with values
4796
+ `"North"`, `"South"`, and `"East"`, the validation step will be applied separately to each
4797
+ region.
4798
+
4799
+ Alternatively, you can provide a tuple that specifies a column name and its corresponding
4800
+ values to segment on. For example, if you have a column called `"date"` and you want to
4801
+ segment on only specific dates, you can provide a tuple like
4802
+ `("date", ["2023-01-01", "2023-01-02"])`. Any other values in the column will be disregarded
4803
+ (i.e., no validation steps will be created for them).
4804
+
4805
+ A list with a combination of column names and tuples can be provided as well. This allows
4806
+ for more complex segmentation scenarios. The following inputs are both valid:
4807
+
4808
+ ```
4809
+ # Segments from all unique values in the `region` column
4810
+ # and specific dates in the `date` column
4811
+ segments=["region", ("date", ["2023-01-01", "2023-01-02"])]
4812
+
4813
+ # Segments from all unique values in the `region` and `date` columns
4814
+ segments=["region", "date"]
4815
+ ```
4816
+
4817
+ The segmentation is performed during interrogation, and the resulting validation steps will
4818
+ be numbered sequentially. Each segment will have its own validation step, and the results
4819
+ will be reported separately. This allows for a more granular analysis of the data and helps
4820
+ identify issues within specific segments.
4821
+
4822
+ Importantly, the segmentation process will be performed after any preprocessing of the data
4823
+ table. Because of this, one can conceivably use the `pre=` argument to generate a column
4824
+ that can be used for segmentation. For example, you could create a new column called
4825
+ `"segment"` through use of `pre=` and then use that column for segmentation.
4826
+
4428
4827
  Thresholds
4429
4828
  ----------
4430
4829
  The `thresholds=` parameter is used to set the failure-condition levels for the validation
@@ -4455,8 +4854,8 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4455
4854
 
4456
4855
  Examples
4457
4856
  --------
4458
- For the examples here, we'll use a simple Polars DataFrame with a string columns (`a`) and a
4459
- numeric column (`b`). The table is shown below:
4857
+ For the examples here, we'll use a simple Polars DataFrame with three columns (`a`, `b`, and
4858
+ `c`). The table is shown below:
4460
4859
 
4461
4860
  ```python
4462
4861
  import pointblank as pb
@@ -4464,21 +4863,22 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4464
4863
 
4465
4864
  tbl = pl.DataFrame(
4466
4865
  {
4467
- "a": ["apple", "banana", "cherry", "date"],
4468
- "b": [1, 6, 3, 5],
4866
+ "a": [1, 2, 1, 7, 8, 6],
4867
+ "b": [0, 0, 0, 1, 1, 1],
4868
+ "c": [0.5, 0.3, 0.8, 1.4, 1.9, 1.2],
4469
4869
  }
4470
4870
  )
4471
4871
 
4472
4872
  pb.preview(tbl)
4473
4873
  ```
4474
4874
 
4475
- Let's validate that the columns `a` and `b` actually exist in the table. We'll determine if
4476
- this validation had any failing test units (each validation will have a single test unit).
4875
+ Let's validate that the values in column `a` are all integers. We'll determine if this
4876
+ validation had any failing test units (there are six test units, one for each row).
4477
4877
 
4478
4878
  ```python
4479
4879
  validation = (
4480
4880
  pb.Validate(data=tbl)
4481
- .col_exists(columns=["a", "b"])
4881
+ .col_vals_expr(expr=pl.col("a") % 1 == 0)
4482
4882
  .interrogate()
4483
4883
  )
4484
4884
 
@@ -4486,24 +4886,8 @@ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSel
4486
4886
  ```
4487
4887
 
4488
4888
  Printing the `validation` object shows the validation table in an HTML viewing environment.
4489
- The validation table shows two entries (one check per column) generated by the
4490
- `col_exists()` validation step. Both steps passed since both columns provided in `columns=`
4491
- are present in the table.
4492
-
4493
- Now, let's check for the existence of a different set of columns.
4494
-
4495
- ```python
4496
- validation = (
4497
- pb.Validate(data=tbl)
4498
- .col_exists(columns=["b", "c"])
4499
- .interrogate()
4500
- )
4501
-
4502
- validation
4503
- ```
4504
-
4505
- The validation table reports one passing validation step (the check for column `b`) and one
4506
- failing validation step (the check for column `c`, which doesn't exist).
4889
+ The validation table shows the single entry that corresponds to the validation step created
4890
+ by using `col_vals_expr()`. All test units passed, with no failing test units.
4507
4891
 
4508
4892
 
4509
4893
  rows_distinct(self, columns_subset: 'str | list[str] | None' = None, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
@@ -4886,6 +5270,128 @@ rows_complete(self, columns_subset: 'str | list[str] | None' = None, pre: 'Calla
4886
5270
  others.
4887
5271
 
4888
5272
 
5273
+ col_exists(self, columns: 'str | list[str] | Column | ColumnSelector | ColumnSelectorNarwhals', thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
5274
+
5275
+ Validate whether one or more columns exist in the table.
5276
+
5277
+ The `col_exists()` method checks whether one or more columns exist in the target table. The
5278
+ only requirement is specification of the column names. Each validation step or expectation
5279
+ will operate over a single test unit, which is whether the column exists or not.
5280
+
5281
+ Parameters
5282
+ ----------
5283
+ columns
5284
+ A single column or a list of columns to validate. Can also use
5285
+ [`col()`](`pointblank.col`) with column selectors to specify one or more columns. If
5286
+ multiple columns are supplied or resolved, there will be a separate validation step
5287
+ generated for each column.
5288
+ thresholds
5289
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
5290
+ The thresholds are set at the step level and will override any global thresholds set in
5291
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
5292
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
5293
+ section for information on how to set threshold levels.
5294
+ actions
5295
+ Optional actions to take when the validation step(s) meets or exceeds any set threshold
5296
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
5297
+ define the actions.
5298
+ brief
5299
+ An optional brief description of the validation step that will be displayed in the
5300
+ reporting table. You can use the templating elements like `"{step}"` to insert
5301
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
5302
+ the entire brief will be automatically generated. If `None` (the default) then there
5303
+ won't be a brief.
5304
+ active
5305
+ A boolean value indicating whether the validation step should be active. Using `False`
5306
+ will make the validation step inactive (still reporting its presence and keeping indexes
5307
+ for the steps unchanged).
5308
+
5309
+ Returns
5310
+ -------
5311
+ Validate
5312
+ The `Validate` object with the added validation step.
5313
+
5314
+ Thresholds
5315
+ ----------
5316
+ The `thresholds=` parameter is used to set the failure-condition levels for the validation
5317
+ step. If they are set here at the step level, these thresholds will override any thresholds
5318
+ set at the global level in `Validate(thresholds=...)`.
5319
+
5320
+ There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values
5321
+ can either be set as a proportion failing of all test units (a value between `0` to `1`),
5322
+ or, the absolute number of failing test units (as integer that's `1` or greater).
5323
+
5324
+ Thresholds can be defined using one of these input schemes:
5325
+
5326
+ 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create
5327
+ thresholds)
5328
+ 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is
5329
+ the 'error' level, and position `2` is the 'critical' level
5330
+ 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and
5331
+ 'critical'
5332
+ 4. a single integer/float value denoting absolute number or fraction of failing test units
5333
+ for the 'warning' level only
5334
+
5335
+ If the number of failing test units exceeds set thresholds, the validation step will be
5336
+ marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be
5337
+ set, you're free to set any combination of them.
5338
+
5339
+ Aside from reporting failure conditions, thresholds can be used to determine the actions to
5340
+ take for each level of failure (using the `actions=` parameter).
5341
+
5342
+ Examples
5343
+ --------
5344
+ For the examples here, we'll use a simple Polars DataFrame with a string columns (`a`) and a
5345
+ numeric column (`b`). The table is shown below:
5346
+
5347
+ ```python
5348
+ import pointblank as pb
5349
+ import polars as pl
5350
+
5351
+ tbl = pl.DataFrame(
5352
+ {
5353
+ "a": ["apple", "banana", "cherry", "date"],
5354
+ "b": [1, 6, 3, 5],
5355
+ }
5356
+ )
5357
+
5358
+ pb.preview(tbl)
5359
+ ```
5360
+
5361
+ Let's validate that the columns `a` and `b` actually exist in the table. We'll determine if
5362
+ this validation had any failing test units (each validation will have a single test unit).
5363
+
5364
+ ```python
5365
+ validation = (
5366
+ pb.Validate(data=tbl)
5367
+ .col_exists(columns=["a", "b"])
5368
+ .interrogate()
5369
+ )
5370
+
5371
+ validation
5372
+ ```
5373
+
5374
+ Printing the `validation` object shows the validation table in an HTML viewing environment.
5375
+ The validation table shows two entries (one check per column) generated by the
5376
+ `col_exists()` validation step. Both steps passed since both columns provided in `columns=`
5377
+ are present in the table.
5378
+
5379
+ Now, let's check for the existence of a different set of columns.
5380
+
5381
+ ```python
5382
+ validation = (
5383
+ pb.Validate(data=tbl)
5384
+ .col_exists(columns=["b", "c"])
5385
+ .interrogate()
5386
+ )
5387
+
5388
+ validation
5389
+ ```
5390
+
5391
+ The validation table reports one passing validation step (the check for column `b`) and one
5392
+ failing validation step (the check for column `c`, which doesn't exist).
5393
+
5394
+
4889
5395
  col_schema_match(self, schema: 'Schema', complete: 'bool' = True, in_order: 'bool' = True, case_sensitive_colnames: 'bool' = True, case_sensitive_dtypes: 'bool' = True, full_match_dtypes: 'bool' = True, pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
4890
5396
 
4891
5397
  Do columns in the table (and their types) match a predefined schema?
@@ -5315,6 +5821,229 @@ col_count_match(self, count: 'int | FrameT | Any', inverse: 'bool' = False, pre:
5315
5821
  columns in the target table. So, the single test unit passed.
5316
5822
 
5317
5823
 
5824
+ tbl_match(self, tbl_compare: 'FrameT | Any', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
5825
+
5826
+ Validate whether the target table matches a comparison table.
5827
+
5828
+ The `tbl_match()` method checks whether the target table's composition matches that of a
5829
+ comparison table. The validation performs a comprehensive comparison using progressively
5830
+ stricter checks (from least to most stringent):
5831
+
5832
+ 1. **Column count match**: both tables must have the same number of columns
5833
+ 2. **Row count match**: both tables must have the same number of rows
5834
+ 3. **Schema match (loose)**: column names and dtypes match (case-insensitive, any order)
5835
+ 4. **Schema match (order)**: columns in the correct order (case-insensitive names)
5836
+ 5. **Schema match (exact)**: column names match exactly (case-sensitive, correct order)
5837
+ 6. **Data match**: values in corresponding cells must be identical
5838
+
5839
+ This progressive approach helps identify exactly where tables differ. The validation will
5840
+ fail at the first check that doesn't pass, making it easier to diagnose mismatches. This
5841
+ validation operates over a single test unit (pass/fail for complete table match).
5842
+
5843
+ Parameters
5844
+ ----------
5845
+ tbl_compare
5846
+ The comparison table to validate against. This can be a DataFrame object (Polars or
5847
+ Pandas), an Ibis table object, or a callable that returns a table. If a callable is
5848
+ provided, it will be executed during interrogation to obtain the comparison table.
5849
+ pre
5850
+ An optional preprocessing function or lambda to apply to the data table during
5851
+ interrogation. This function should take a table as input and return a modified table.
5852
+ Have a look at the *Preprocessing* section for more information on how to use this
5853
+ argument.
5854
+ thresholds
5855
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
5856
+ The thresholds are set at the step level and will override any global thresholds set in
5857
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
5858
+ be set locally and global thresholds (if any) will take effect. Look at the *Thresholds*
5859
+ section for information on how to set threshold levels.
5860
+ actions
5861
+ Optional actions to take when the validation step meets or exceeds any set threshold
5862
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
5863
+ define the actions.
5864
+ brief
5865
+ An optional brief description of the validation step that will be displayed in the
5866
+ reporting table. You can use the templating elements like `"{step}"` to insert
5867
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
5868
+ the entire brief will be automatically generated. If `None` (the default) then there
5869
+ won't be a brief.
5870
+ active
5871
+ A boolean value indicating whether the validation step should be active. Using `False`
5872
+ will make the validation step inactive (still reporting its presence and keeping indexes
5873
+ for the steps unchanged).
5874
+
5875
+ Returns
5876
+ -------
5877
+ Validate
5878
+ The `Validate` object with the added validation step.
5879
+
5880
+ Preprocessing
5881
+ -------------
5882
+ The `pre=` argument allows for a preprocessing function or lambda to be applied to the data
5883
+ table during interrogation. This function should take a table as input and return a modified
5884
+ table. This is useful for performing any necessary transformations or filtering on the data
5885
+ before the validation step is applied.
5886
+
5887
+ The preprocessing function can be any callable that takes a table as input and returns a
5888
+ modified table. For example, you could use a lambda function to filter the table based on
5889
+ certain criteria or to apply a transformation to the data. Note that the same preprocessing
5890
+ is **not** applied to the comparison table; only the target table is preprocessed. Regarding
5891
+ the lifetime of the transformed table, it only exists during the validation step and is not
5892
+ stored in the `Validate` object or used in subsequent validation steps.
5893
+
5894
+ Thresholds
5895
+ ----------
5896
+ The `thresholds=` parameter is used to set the failure-condition levels for the validation
5897
+ step. If they are set here at the step level, these thresholds will override any thresholds
5898
+ set at the global level in `Validate(thresholds=...)`.
5899
+
5900
+ There are three threshold levels: 'warning', 'error', and 'critical'. The threshold values
5901
+ can either be set as a proportion failing of all test units (a value between `0` to `1`),
5902
+ or, the absolute number of failing test units (as integer that's `1` or greater).
5903
+
5904
+ Thresholds can be defined using one of these input schemes:
5905
+
5906
+ 1. use the [`Thresholds`](`pointblank.Thresholds`) class (the most direct way to create
5907
+ thresholds)
5908
+ 2. provide a tuple of 1-3 values, where position `0` is the 'warning' level, position `1` is
5909
+ the 'error' level, and position `2` is the 'critical' level
5910
+ 3. create a dictionary of 1-3 value entries; the valid keys: are 'warning', 'error', and
5911
+ 'critical'
5912
+ 4. a single integer/float value denoting absolute number or fraction of failing test units
5913
+ for the 'warning' level only
5914
+
5915
+ If the number of failing test units exceeds set thresholds, the validation step will be
5916
+ marked as 'warning', 'error', or 'critical'. All of the threshold levels don't need to be
5917
+ set, you're free to set any combination of them.
5918
+
5919
+ Aside from reporting failure conditions, thresholds can be used to determine the actions to
5920
+ take for each level of failure (using the `actions=` parameter).
5921
+
5922
+ Cross-Backend Validation
5923
+ ------------------------
5924
+ The `tbl_match()` method supports **automatic backend coercion** when comparing tables from
5925
+ different backends (e.g., comparing a Polars DataFrame against a Pandas DataFrame, or
5926
+ comparing database tables from DuckDB/SQLite against in-memory DataFrames). When tables with
5927
+ different backends are detected, the comparison table is automatically converted to match the
5928
+ data table's backend before validation proceeds.
5929
+
5930
+ **Certified Backend Combinations:**
5931
+
5932
+ All combinations of the following backends have been tested and certified to work (in both
5933
+ directions):
5934
+
5935
+ - Pandas DataFrame
5936
+ - Polars DataFrame
5937
+ - DuckDB (native)
5938
+ - DuckDB (as Ibis table)
5939
+ - SQLite (via Ibis)
5940
+
5941
+ Note that database backends (DuckDB, SQLite, PostgreSQL, MySQL, Snowflake, BigQuery) are
5942
+ automatically materialized during validation:
5943
+
5944
+ - if comparing **against Polars**: materialized to Polars
5945
+ - if comparing **against Pandas**: materialized to Pandas
5946
+ - if **both tables are database backends**: both materialized to Polars
5947
+
5948
+ This ensures optimal performance and type consistency.
5949
+
5950
+ **Data Types That Work Best in Cross-Backend Validation:**
5951
+
5952
+ - numeric types: int, float columns (including proper NaN handling)
5953
+ - string types: text columns with consistent encodings
5954
+ - boolean types: True/False values
5955
+ - null values: `None` and `NaN` are treated as equivalent across backends
5956
+ - list columns: nested list structures (with basic types)
5957
+
5958
+ **Known Limitations:**
5959
+
5960
+ While many data types work well in cross-backend validation, there are some known
5961
+ limitations to be aware of:
5962
+
5963
+ - date/datetime types: When converting between Polars and Pandas, date objects may be
5964
+ represented differently. For example, `datetime.date` objects in Pandas may become
5965
+ `pd.Timestamp` objects when converted from Polars, leading to false mismatches. To work
5966
+ around this, ensure both tables use the same datetime representation before comparison.
5967
+ - custom types: User-defined types or complex nested structures may not convert cleanly
5968
+ between backends and could cause unexpected comparison failures.
5969
+ - categorical types: Categorical/factor columns may have different internal
5970
+ representations across backends.
5971
+ - timezone-aware datetimes: Timezone handling differs between backends and may cause
5972
+ comparison issues.
5973
+
5974
+ Here are some ideas to overcome such limitations:
5975
+
5976
+ - for date/datetime columns, consider using `pre=` preprocessing to normalize representations
5977
+ before comparison.
5978
+ - when working with custom types, manually convert tables to the same backend before using
5979
+ `tbl_match()`.
5980
+ - use the same datetime precision (e.g., milliseconds vs microseconds) in both tables.
5981
+
5982
+ Examples
5983
+ --------
5984
+ For the examples here, we'll create two simple tables to demonstrate the `tbl_match()`
5985
+ validation.
5986
+
5987
+ ```python
5988
+ import pointblank as pb
5989
+ import polars as pl
5990
+
5991
+ # Create the first table
5992
+ tbl_1 = pl.DataFrame({
5993
+ "a": [1, 2, 3, 4],
5994
+ "b": ["w", "x", "y", "z"],
5995
+ "c": [4.0, 5.0, 6.0, 7.0]
5996
+ })
5997
+
5998
+ # Create an identical table
5999
+ tbl_2 = pl.DataFrame({
6000
+ "a": [1, 2, 3, 4],
6001
+ "b": ["w", "x", "y", "z"],
6002
+ "c": [4.0, 5.0, 6.0, 7.0]
6003
+ })
6004
+
6005
+ pb.preview(tbl_1)
6006
+ ```
6007
+
6008
+ Let's validate that `tbl_1` matches `tbl_2`. Since these tables are identical, the
6009
+ validation should pass.
6010
+
6011
+ ```python
6012
+ validation = (
6013
+ pb.Validate(data=tbl_1)
6014
+ .tbl_match(tbl_compare=tbl_2)
6015
+ .interrogate()
6016
+ )
6017
+
6018
+ validation
6019
+ ```
6020
+
6021
+ The validation table shows that the single test unit passed, indicating that the two tables
6022
+ match completely.
6023
+
6024
+ Now, let's create a table with a slight difference and see what happens.
6025
+
6026
+ ```python
6027
+ # Create a table with one different value
6028
+ tbl_3 = pl.DataFrame({
6029
+ "a": [1, 2, 3, 4],
6030
+ "b": ["w", "x", "y", "z"],
6031
+ "c": [4.0, 5.5, 6.0, 7.0] # Changed 5.0 to 5.5
6032
+ })
6033
+
6034
+ validation = (
6035
+ pb.Validate(data=tbl_1)
6036
+ .tbl_match(tbl_compare=tbl_3)
6037
+ .interrogate()
6038
+ )
6039
+
6040
+ validation
6041
+ ```
6042
+
6043
+ The validation table shows that the single test unit failed because the tables don't match
6044
+ (one value is different in column `c`).
6045
+
6046
+
5318
6047
  conjointly(self, *exprs: 'Callable', pre: 'Callable | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
5319
6048
 
5320
6049
  Perform multiple row-wise validations for joint validity.
@@ -5806,12 +6535,323 @@ specially(self, expr: 'Callable', pre: 'Callable | None' = None, thresholds: 'in
5806
6535
  )
5807
6536
  ```
5808
6537
 
5809
- This pattern shows how to validate external dependencies or environment conditions as part
5810
- of your validation workflow. Notice that the function doesn't take any parameters at all,
5811
- which makes it cleaner when the validation doesn't need to access the data table.
5812
-
5813
- By combining these patterns, you can create sophisticated validation workflows that address
5814
- virtually any data quality requirement in your organization.
6538
+ This pattern shows how to validate external dependencies or environment conditions as part
6539
+ of your validation workflow. Notice that the function doesn't take any parameters at all,
6540
+ which makes it cleaner when the validation doesn't need to access the data table.
6541
+
6542
+ By combining these patterns, you can create sophisticated validation workflows that address
6543
+ virtually any data quality requirement in your organization.
6544
+
6545
+
6546
+ prompt(self, prompt: 'str', model: 'str', columns_subset: 'str | list[str] | None' = None, batch_size: 'int' = 1000, max_concurrent: 'int' = 3, pre: 'Callable | None' = None, segments: 'SegmentSpec | None' = None, thresholds: 'int | float | bool | tuple | dict | Thresholds' = None, actions: 'Actions | None' = None, brief: 'str | bool | None' = None, active: 'bool' = True) -> 'Validate'
6547
+
6548
+ Validate rows using AI/LLM-powered analysis.
6549
+
6550
+ The `prompt()` validation method uses Large Language Models (LLMs) to validate rows of data
6551
+ based on natural language criteria. Similar to other Pointblank validation methods, this
6552
+ generates binary test results (pass/fail) that integrate seamlessly with the standard
6553
+ reporting framework.
6554
+
6555
+ Like `col_vals_*()` methods, `prompt()` evaluates data against specific criteria, but
6556
+ instead of using programmatic rules, it uses natural language prompts interpreted by an LLM.
6557
+ Like `rows_distinct()` and `rows_complete()`, it operates at the row level and allows you to
6558
+ specify a subset of columns for evaluation using `columns_subset=`.
6559
+
6560
+ The system automatically combines your validation criteria from the `prompt=` parameter with
6561
+ the necessary technical context, data formatting instructions, and response structure
6562
+ requirements. This is all so you only need to focus on describing your validation logic in
6563
+ plain language.
6564
+
6565
+ Each row becomes a test unit that either passes or fails the validation criteria, producing
6566
+ the familiar True/False results that appear in Pointblank validation reports. This method
6567
+ is particularly useful for complex validation rules that are difficult to express with
6568
+ traditional validation methods, such as semantic checks, context-dependent validation, or
6569
+ subjective quality assessments.
6570
+
6571
+ Parameters
6572
+ ----------
6573
+ prompt
6574
+ A natural language description of the validation criteria. This prompt should clearly
6575
+ describe what constitutes valid vs invalid rows. Some examples:
6576
+ `"Each row should contain a valid email address and a realistic person name"`,
6577
+ `"Values should indicate positive sentiment"`,
6578
+ `"The description should mention a country name"`.
6579
+ columns_subset
6580
+ A single column or list of columns to include in the validation. If `None`, all columns
6581
+ will be included. Specifying fewer columns can improve performance and reduce API costs
6582
+ so try to include only the columns necessary for the validation.
6583
+ model
6584
+ The model to be used. This should be in the form of `provider:model` (e.g.,
6585
+ `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`,
6586
+ `"ollama"`, and `"bedrock"`. The model name should be the specific model to be used from
6587
+ the provider. Model names are subject to change so consult the provider's documentation
6588
+ for the most up-to-date model names.
6589
+ batch_size
6590
+ Number of rows to process in each batch. Larger batches are more efficient but may hit
6591
+ API limits. Default is `1000`.
6592
+ max_concurrent
6593
+ Maximum number of concurrent API requests. Higher values speed up processing but may
6594
+ hit rate limits. Default is `3`.
6595
+ pre
6596
+ An optional preprocessing function or lambda to apply to the data table during
6597
+ interrogation. This function should take a table as input and return a modified table.
6598
+ segments
6599
+ An optional directive on segmentation, which serves to split a validation step into
6600
+ multiple (one step per segment). Can be a single column name, a tuple that specifies a
6601
+ column name and its corresponding values to segment on, or a combination of both
6602
+ (provided as a list).
6603
+ thresholds
6604
+ Set threshold failure levels for reporting and reacting to exceedences of the levels.
6605
+ The thresholds are set at the step level and will override any global thresholds set in
6606
+ `Validate(thresholds=...)`. The default is `None`, which means that no thresholds will
6607
+ be set locally and global thresholds (if any) will take effect.
6608
+ actions
6609
+ Optional actions to take when the validation step meets or exceeds any set threshold
6610
+ levels. If provided, the [`Actions`](`pointblank.Actions`) class should be used to
6611
+ define the actions.
6612
+ brief
6613
+ An optional brief description of the validation step that will be displayed in the
6614
+ reporting table. You can use the templating elements like `"{step}"` to insert
6615
+ the step number, or `"{auto}"` to include an automatically generated brief. If `True`
6616
+ the entire brief will be automatically generated. If `None` (the default) then there
6617
+ won't be a brief.
6618
+ active
6619
+ A boolean value indicating whether the validation step should be active. Using `False`
6620
+ will make the validation step inactive (still reporting its presence and keeping indexes
6621
+ for the steps unchanged).
6622
+
6623
+ Returns
6624
+ -------
6625
+ Validate
6626
+ The `Validate` object with the added validation step.
6627
+
6628
+ Constructing the `model` Argument
6629
+ ---------------------------------
6630
+ The `model=` argument should be constructed using the provider and model name separated by a
6631
+ colon (`provider:model`). The provider text can any of:
6632
+
6633
+ - `"anthropic"` (Anthropic)
6634
+ - `"openai"` (OpenAI)
6635
+ - `"ollama"` (Ollama)
6636
+ - `"bedrock"` (Amazon Bedrock)
6637
+
6638
+ The model name should be the specific model to be used from the provider. Model names are
6639
+ subject to change so consult the provider's documentation for the most up-to-date model
6640
+ names.
6641
+
6642
+ Notes on Authentication
6643
+ -----------------------
6644
+ API keys are automatically loaded from environment variables or `.env` files and are **not**
6645
+ stored in the validation object for security reasons. You should consider using a secure
6646
+ method for handling API keys.
6647
+
6648
+ One way to do this is to load the API key from an environment variable and retrieve it using
6649
+ the `os` module (specifically the `os.getenv()` function). Places to store the API key might
6650
+ include `.bashrc`, `.bash_profile`, `.zshrc`, or `.zsh_profile`.
6651
+
6652
+ Another solution is to store one or more model provider API keys in an `.env` file (in the
6653
+ root of your project). If the API keys have correct names (e.g., `ANTHROPIC_API_KEY` or
6654
+ `OPENAI_API_KEY`) then the AI validation will automatically load the API key from the `.env`
6655
+ file. An `.env` file might look like this:
6656
+
6657
+ ```plaintext
6658
+ ANTHROPIC_API_KEY="your_anthropic_api_key_here"
6659
+ OPENAI_API_KEY="your_openai_api_key_here"
6660
+ ```
6661
+
6662
+ There's no need to have the `python-dotenv` package installed when using `.env` files in
6663
+ this way.
6664
+
6665
+ **Provider-specific setup**:
6666
+
6667
+ - **OpenAI**: set `OPENAI_API_KEY` environment variable or create `.env` file
6668
+ - **Anthropic**: set `ANTHROPIC_API_KEY` environment variable or create `.env` file
6669
+ - **Ollama**: no API key required, just ensure Ollama is running locally
6670
+ - **Bedrock**: configure AWS credentials through standard AWS methods
6671
+
6672
+ AI Validation Process
6673
+ ---------------------
6674
+ The AI validation process works as follows:
6675
+
6676
+ 1. data batching: the data is split into batches of the specified size
6677
+ 2. row deduplication: duplicate rows (based on selected columns) are identified and only
6678
+ unique combinations are sent to the LLM for analysis
6679
+ 3. json conversion: each batch of unique rows is converted to JSON format for the LLM
6680
+ 4. prompt construction: the user prompt is embedded in a structured system prompt
6681
+ 5. llm processing: each batch is sent to the LLM for analysis
6682
+ 6. response parsing: LLM responses are parsed to extract validation results
6683
+ 7. result projection: results are mapped back to all original rows using row signatures
6684
+ 8. result aggregation: results from all batches are combined
6685
+
6686
+ **Performance Optimization**: the process uses row signature memoization to avoid redundant
6687
+ LLM calls. When multiple rows have identical values in the selected columns, only one
6688
+ representative row is validated, and the result is applied to all matching rows. This can
6689
+ dramatically reduce API costs and processing time for datasets with repetitive patterns.
6690
+
6691
+ The LLM receives data in this JSON format:
6692
+
6693
+ ```json
6694
+ {
6695
+ "columns": ["col1", "col2", "col3"],
6696
+ "rows": [
6697
+ {"col1": "value1", "col2": "value2", "col3": "value3", "_pb_row_index": 0},
6698
+ {"col1": "value4", "col2": "value5", "col3": "value6", "_pb_row_index": 1}
6699
+ ]
6700
+ }
6701
+ ```
6702
+
6703
+ The LLM returns validation results in this format:
6704
+ ```json
6705
+ [
6706
+ {"index": 0, "result": true},
6707
+ {"index": 1, "result": false}
6708
+ ]
6709
+ ```
6710
+
6711
+ Prompt Design Tips
6712
+ ------------------
6713
+ For best results, design prompts that are:
6714
+
6715
+ - boolean-oriented: frame validation criteria to elicit clear valid/invalid responses
6716
+ - specific: clearly define what makes a row valid/invalid
6717
+ - unambiguous: avoid subjective language that could be interpreted differently
6718
+ - context-aware: include relevant business rules or domain knowledge
6719
+ - example-driven: consider providing examples in the prompt when helpful
6720
+
6721
+ **Critical**: Prompts must be designed so the LLM can determine whether each row passes or
6722
+ fails the validation criteria. The system expects binary validation responses, so avoid
6723
+ open-ended questions or prompts that might generate explanatory text instead of clear
6724
+ pass/fail judgments.
6725
+
6726
+ Good prompt examples:
6727
+
6728
+ - "Each row should contain a valid email address in the 'email' column and a non-empty name
6729
+ in the 'name' column"
6730
+ - "The 'sentiment' column should contain positive sentiment words (happy, good, excellent,
6731
+ etc.)"
6732
+ - "Product descriptions should mention at least one technical specification"
6733
+
6734
+ Poor prompt examples (avoid these):
6735
+
6736
+ - "What do you think about this data?" (too open-ended)
6737
+ - "Describe the quality of each row" (asks for description, not validation)
6738
+ - "How would you improve this data?" (asks for suggestions, not pass/fail)
6739
+
6740
+ Performance Considerations
6741
+ --------------------------
6742
+ AI validation is significantly slower than traditional validation methods due to API calls
6743
+ to LLM providers. However, performance varies dramatically based on data characteristics:
6744
+
6745
+ **High Memoization Scenarios** (seconds to minutes):
6746
+
6747
+ - data with many duplicate rows in the selected columns
6748
+ - low cardinality data (repeated patterns)
6749
+ - small number of unique row combinations
6750
+
6751
+ **Low Memoization Scenarios** (minutes to hours):
6752
+
6753
+ - high cardinality data with mostly unique rows
6754
+ - large datasets with few repeated patterns
6755
+ - all or most rows requiring individual LLM evaluation
6756
+
6757
+ The row signature memoization optimization can reduce processing time significantly when
6758
+ data has repetitive patterns. For datasets where every row is unique, expect longer
6759
+ processing times similar to validating each row individually.
6760
+
6761
+ **Strategies to Reduce Processing Time**:
6762
+
6763
+ - test on data slices: define a sampling function like `def sample_1000(df): return df.head(1000)`
6764
+ and use `pre=sample_1000` to validate on smaller samples
6765
+ - filter relevant data: define filter functions like `def active_only(df): return df.filter(df["status"] == "active")`
6766
+ and use `pre=active_only` to focus on a specific subset
6767
+ - optimize column selection: use `columns_subset=` to include only the columns necessary
6768
+ for validation
6769
+ - start with smaller batches: begin with `batch_size=100` for testing, then increase
6770
+ gradually
6771
+ - reduce concurrency: lower `max_concurrent=1` if hitting rate limits
6772
+ - use faster/cheaper models: consider using smaller or more efficient models for initial
6773
+ testing before switching to more capable models
6774
+
6775
+ Examples
6776
+ --------
6777
+ The following examples demonstrate how to use AI validation for different types of data
6778
+ quality checks. These examples show both basic usage and more advanced configurations with
6779
+ custom thresholds and actions.
6780
+
6781
+ **Basic AI validation example:**
6782
+
6783
+ This first example shows a simple validation scenario where we want to check that customer
6784
+ records have both valid email addresses and non-empty names. Notice how we use
6785
+ `columns_subset=` to focus only on the relevant columns, which improves both performance
6786
+ and cost-effectiveness.
6787
+
6788
+ ```python
6789
+ import pointblank as pb
6790
+ import polars as pl
6791
+
6792
+ # Sample data with email and name columns
6793
+ tbl = pl.DataFrame({
6794
+ "email": ["john@example.com", "invalid-email", "jane@test.org"],
6795
+ "name": ["John Doe", "", "Jane Smith"],
6796
+ "age": [25, 30, 35]
6797
+ })
6798
+
6799
+ # Validate using AI
6800
+ validation = (
6801
+ pb.Validate(data=tbl)
6802
+ .prompt(
6803
+ prompt="Each row should have a valid email address and a non-empty name",
6804
+ columns_subset=["email", "name"], # Only check these columns
6805
+ model="openai:gpt-4o-mini",
6806
+ )
6807
+ .interrogate()
6808
+ )
6809
+
6810
+ validation
6811
+ ```
6812
+
6813
+ In this example, the AI will identify that the second row fails validation because it has
6814
+ an invalid email format (`"invalid-email"`) and the third row also fails because it has an
6815
+ empty name field. The validation results will show 2 out of 3 rows failing the criteria.
6816
+
6817
+ **Advanced example with custom thresholds:**
6818
+
6819
+ This more sophisticated example demonstrates how to use AI validation with custom thresholds
6820
+ and actions. Here we're validating phone number formats to ensure they include area codes,
6821
+ which is a common data quality requirement for customer contact information.
6822
+
6823
+ ```python
6824
+ customer_data = pl.DataFrame({
6825
+ "customer_id": [1, 2, 3, 4, 5],
6826
+ "name": ["John Doe", "Jane Smith", "Bob Johnson", "Alice Brown", "Charlie Davis"],
6827
+ "phone_number": [
6828
+ "(555) 123-4567", # Valid with area code
6829
+ "555-987-6543", # Valid with area code
6830
+ "123-4567", # Missing area code
6831
+ "(800) 555-1234", # Valid with area code
6832
+ "987-6543" # Missing area code
6833
+ ]
6834
+ })
6835
+
6836
+ validation = (
6837
+ pb.Validate(data=customer_data)
6838
+ .prompt(
6839
+ prompt="Do all the phone numbers include an area code?",
6840
+ columns_subset="phone_number", # Only check the `phone_number` column
6841
+ model="openai:gpt-4o",
6842
+ batch_size=500,
6843
+ max_concurrent=5,
6844
+ thresholds=pb.Thresholds(warning=0.1, error=0.2, critical=0.3),
6845
+ actions=pb.Actions(error="Too many phone numbers missing area codes.")
6846
+ )
6847
+ .interrogate()
6848
+ )
6849
+ ```
6850
+
6851
+ This validation will identify that 2 out of 5 phone numbers (40%) are missing area codes,
6852
+ which exceeds all threshold levels. The validation will trigger the specified error action
6853
+ since the failure rate (40%) is above the error threshold (20%). The AI can recognize
6854
+ various phone number formats and determine whether they include area codes.
5815
6855
 
5816
6856
 
5817
6857
 
@@ -7420,6 +8460,108 @@ interrogate(self, collect_extracts: 'bool' = True, collect_tbl_checked: 'bool' =
7420
8460
  `get_first_n=10`.
7421
8461
 
7422
8462
 
8463
+ set_tbl(self, tbl: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None' = None) -> 'Validate'
8464
+
8465
+ Set or replace the table associated with the Validate object.
8466
+
8467
+ This method allows you to replace the table associated with a Validate object with a
8468
+ different (but presumably similar) table. This is useful when you want to apply the same
8469
+ validation plan to multiple tables or when you have a validation workflow defined but want
8470
+ to swap in a different data source.
8471
+
8472
+ Parameters
8473
+ ----------
8474
+ tbl
8475
+ The table to replace the existing table with. This can be any supported table type
8476
+ including DataFrame objects, Ibis table objects, CSV file paths, Parquet file paths,
8477
+ GitHub URLs, or database connection strings. The same table type constraints apply as in
8478
+ the `Validate` constructor.
8479
+ tbl_name
8480
+ An optional name to assign to the new input table object. If no value is provided, the
8481
+ existing table name will be retained.
8482
+ label
8483
+ An optional label for the validation plan. If no value is provided, the existing label
8484
+ will be retained.
8485
+
8486
+ Returns
8487
+ -------
8488
+ Validate
8489
+ A new `Validate` object with the replacement table.
8490
+
8491
+ When to Use
8492
+ -----------
8493
+ The `set_tbl()` method is particularly useful in scenarios where you have:
8494
+
8495
+ - multiple similar tables that need the same validation checks
8496
+ - a template validation workflow that should be applied to different data sources
8497
+ - YAML-defined validations where you want to override the table specified in the YAML
8498
+
8499
+ The `set_tbl()` method creates a copy of the validation object with the new table, so the
8500
+ original validation object remains unchanged. This allows you to reuse validation plans
8501
+ across multiple tables without interference.
8502
+
8503
+ Examples
8504
+ --------
8505
+ We will first create two similar tables for our future validation plans.
8506
+
8507
+ ```python
8508
+ import pointblank as pb
8509
+ import polars as pl
8510
+
8511
+ # Create two similar tables
8512
+ table_1 = pl.DataFrame({
8513
+ "x": [1, 2, 3, 4, 5],
8514
+ "y": [5, 4, 3, 2, 1],
8515
+ "z": ["a", "b", "c", "d", "e"]
8516
+ })
8517
+
8518
+ table_2 = pl.DataFrame({
8519
+ "x": [2, 4, 6, 8, 10],
8520
+ "y": [10, 8, 6, 4, 2],
8521
+ "z": ["f", "g", "h", "i", "j"]
8522
+ })
8523
+ ```
8524
+
8525
+ Create a validation plan with the first table.
8526
+
8527
+ ```python
8528
+ validation_table_1 = (
8529
+ pb.Validate(
8530
+ data=table_1,
8531
+ tbl_name="Table 1",
8532
+ label="Validation applied to the first table"
8533
+ )
8534
+ .col_vals_gt(columns="x", value=0)
8535
+ .col_vals_lt(columns="y", value=10)
8536
+ )
8537
+ ```
8538
+
8539
+ Now apply the same validation plan to the second table.
8540
+
8541
+ ```python
8542
+ validation_table_2 = (
8543
+ validation_table_1
8544
+ .set_tbl(
8545
+ tbl=table_2,
8546
+ tbl_name="Table 2",
8547
+ label="Validation applied to the second table"
8548
+ )
8549
+ )
8550
+ ```
8551
+
8552
+ Here is the interrogation of the first table:
8553
+
8554
+ ```python
8555
+ validation_table_1.interrogate()
8556
+ ```
8557
+
8558
+ And the second table:
8559
+
8560
+ ```python
8561
+ validation_table_2.interrogate()
8562
+ ```
8563
+
8564
+
7423
8565
  get_tabular_report(self, title: 'str | None' = ':default:', incl_header: 'bool' = None, incl_footer: 'bool' = None) -> 'GT'
7424
8566
 
7425
8567
  Validation report as a GT table.
@@ -9514,7 +10656,7 @@ assistant(model: 'str', data: 'FrameT | Any | None' = None, tbl_name: 'str | Non
9514
10656
  ----------
9515
10657
  model
9516
10658
  The model to be used. This should be in the form of `provider:model` (e.g.,
9517
- `"anthropic:claude-3-5-sonnet-latest"`). Supported providers are `"anthropic"`, `"openai"`,
10659
+ `"anthropic:claude-sonnet-4-5"`). Supported providers are `"anthropic"`, `"openai"`,
9518
10660
  `"ollama"`, and `"bedrock"`.
9519
10661
  data
9520
10662
  An optional data table to focus on during discussion with the PbA, which could be a
@@ -9900,9 +11042,10 @@ connect_to_table(connection_string: 'str') -> 'Any'
9900
11042
  ## The YAML family
9901
11043
 
9902
11044
  The *YAML* group contains functions that allow for the use of YAML to orchestrate
9903
- validation workflows. The `yaml_interrogate()` function can be used to run a validation workflow from
9904
- YAML strings or files. The `validate_yaml()` function checks if the YAML configuration
9905
- passes its own validity checks.
11045
+ validation workflows. The `yaml_interrogate()` function can be used to run a validation workflow
11046
+ from YAML strings or files. The `validate_yaml()` function checks if the YAML configuration passes
11047
+ its own validity checks. The `yaml_to_python()` function converts YAML configuration to equivalent
11048
+ Python code.
9906
11049
 
9907
11050
  yaml_interrogate(yaml: 'Union[str, Path]', set_tbl: 'Union[FrameT, Any, None]' = None, namespaces: 'Optional[Union[Iterable[str], Mapping[str, str]]]' = None) -> 'Validate'
9908
11051
  Execute a YAML-based validation workflow.
@@ -10295,6 +11438,95 @@ Validate YAML configuration against the expected structure.
10295
11438
  yaml_interrogate : execute YAML-based validation workflows
10296
11439
 
10297
11440
 
11441
+ yaml_to_python(yaml: 'Union[str, Path]') -> 'str'
11442
+ Convert YAML validation configuration to equivalent Python code.
11443
+
11444
+ This function takes a YAML validation configuration and generates the equivalent Python code
11445
+ that would produce the same validation workflow. This is useful for documentation, code
11446
+ generation, or learning how to translate YAML workflows into programmatic workflows.
11447
+
11448
+ The generated Python code includes all necessary imports, data loading, validation steps,
11449
+ and interrogation execution, formatted as executable Python code.
11450
+
11451
+ Parameters
11452
+ ----------
11453
+ yaml
11454
+ YAML configuration as string or file path. Can be: (1) a YAML string containing the
11455
+ validation configuration, or (2) a Path object or string path to a YAML file.
11456
+
11457
+ Returns
11458
+ -------
11459
+ str
11460
+ A formatted Python code string enclosed in markdown code blocks that replicates the YAML
11461
+ workflow. The code includes import statements, data loading, validation method calls, and
11462
+ interrogation execution.
11463
+
11464
+ Raises
11465
+ ------
11466
+ YAMLValidationError
11467
+ If the YAML is invalid, malformed, or contains unknown validation methods.
11468
+
11469
+ Examples
11470
+ --------
11471
+ Convert a basic YAML configuration to Python code:
11472
+
11473
+ ```python
11474
+ import pointblank as pb
11475
+
11476
+ # Define a YAML validation workflow
11477
+ yaml_config = '''
11478
+ tbl: small_table
11479
+ tbl_name: Data Quality Check
11480
+ steps:
11481
+ - col_vals_not_null:
11482
+ columns: [a, b]
11483
+ - col_vals_gt:
11484
+ columns: [c]
11485
+ value: 0
11486
+ '''
11487
+
11488
+ # Generate equivalent Python code
11489
+ python_code = pb.yaml_to_python(yaml_config)
11490
+ print(python_code)
11491
+ ```
11492
+
11493
+ The generated Python code shows exactly how to replicate the YAML workflow programmatically.
11494
+ This is particularly useful when transitioning from YAML-based workflows to code-based
11495
+ workflows, or when generating documentation that shows both YAML and Python approaches.
11496
+
11497
+ For more complex workflows with thresholds and metadata:
11498
+
11499
+ ```python
11500
+ # Advanced YAML configuration
11501
+ yaml_config = '''
11502
+ tbl: small_table
11503
+ tbl_name: Advanced Validation
11504
+ label: Production data check
11505
+ thresholds:
11506
+ warning: 0.1
11507
+ error: 0.2
11508
+ steps:
11509
+ - col_vals_between:
11510
+ columns: [c]
11511
+ left: 1
11512
+ right: 10
11513
+ - col_vals_regex:
11514
+ columns: [b]
11515
+ pattern: '[0-9]-[a-z]{3}-[0-9]{3}'
11516
+ '''
11517
+
11518
+ # Generate the equivalent Python code
11519
+ python_code = pb.yaml_to_python(yaml_config)
11520
+ print(python_code)
11521
+ ```
11522
+
11523
+ The generated code includes all configuration parameters, thresholds, and maintains the exact
11524
+ same validation logic as the original YAML workflow.
11525
+
11526
+ This function is also useful for educational purposes, helping users understand how YAML
11527
+ configurations map to the underlying Python API calls.
11528
+
11529
+
10298
11530
 
10299
11531
  ## The Utility Functions family
10300
11532
 
@@ -10715,6 +11947,297 @@ Access validation summary information when authoring final actions.
10715
11947
  custom actions that are executed after all validation steps have been completed.
10716
11948
 
10717
11949
 
11950
+ write_file(validation: 'Validate', filename: 'str', path: 'str | None' = None, keep_tbl: 'bool' = False, keep_extracts: 'bool' = False, quiet: 'bool' = False) -> 'None'
11951
+
11952
+ Write a Validate object to disk as a serialized file.
11953
+
11954
+ Writing a validation object to disk with `write_file()` can be useful for keeping data
11955
+ validation results close at hand for later retrieval (with `read_file()`). By default, any data
11956
+ table that the validation object holds will be removed before writing to disk (not applicable if
11957
+ no data table is present). This behavior can be changed by setting `keep_tbl=True`, but this
11958
+ only works when the table is not of a database type (e.g., DuckDB, PostgreSQL, etc.), as
11959
+ database connections cannot be serialized.
11960
+
11961
+ Extract data from failing validation steps can also be preserved by setting
11962
+ `keep_extracts=True`, which is useful for later analysis of data quality issues.
11963
+
11964
+ The serialized file uses Python's pickle format for storage of the validation object state,
11965
+ including all validation results, metadata, and optionally the source data.
11966
+
11967
+ **Important note.** If your validation uses custom preprocessing functions (via the `pre=`
11968
+ parameter), these functions must be defined at the module level (not interactively or as lambda
11969
+ functions) to ensure they can be properly restored when loading the validation in a different
11970
+ Python session. Read the *Creating Serializable Validations* section below for more information.
11971
+
11972
+ :::{.callout-warning}
11973
+ The `write_file()` function is currently experimental. Please report any issues you encounter in
11974
+ the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
11975
+ :::
11976
+
11977
+ Parameters
11978
+ ----------
11979
+ validation
11980
+ The `Validate` object to write to disk.
11981
+ filename
11982
+ The filename to create on disk for the validation object. Should not include the file
11983
+ extension as `.pkl` will be added automatically.
11984
+ path
11985
+ An optional directory path where the file should be saved. If not provided, the file will be
11986
+ saved in the current working directory. The directory will be created if it doesn't exist.
11987
+ keep_tbl
11988
+ An option to keep the data table that is associated with the validation object. The default
11989
+ is `False` where the data table is removed before writing to disk. For database tables
11990
+ (e.g., Ibis tables with database backends), the table is always removed even if
11991
+ `keep_tbl=True`, as database connections cannot be serialized.
11992
+ keep_extracts
11993
+ An option to keep any collected extract data for failing rows from validation steps. By
11994
+ default, this is `False` (i.e., extract data is removed to save space).
11995
+ quiet
11996
+ Should the function not inform when the file is written? By default, this is `False`, so a
11997
+ message will be printed when the file is successfully written.
11998
+
11999
+ Returns
12000
+ -------
12001
+ None
12002
+ This function doesn't return anything but saves the validation object to disk.
12003
+
12004
+ Creating Serializable Validations
12005
+ ---------------------------------
12006
+ To ensure your validations work reliably across different Python sessions, the recommended
12007
+ approach is to use module-Level functions. So, create a separate Python file for your
12008
+ preprocessing functions:
12009
+
12010
+ ```python
12011
+ # preprocessing_functions.py
12012
+ import polars as pl
12013
+
12014
+ def multiply_by_100(df):
12015
+ return df.with_columns(pl.col("value") * 100)
12016
+
12017
+ def add_computed_column(df):
12018
+ return df.with_columns(computed=pl.col("value") * 2 + 10)
12019
+ ```
12020
+
12021
+ Then import and use them in your validation:
12022
+
12023
+ ```python
12024
+ # your_main_script.py
12025
+ import pointblank as pb
12026
+ from preprocessing_functions import multiply_by_100, add_computed_column
12027
+
12028
+ validation = (
12029
+ pb.Validate(data=my_data)
12030
+ .col_vals_gt(columns="value", value=500, pre=multiply_by_100)
12031
+ .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column)
12032
+ .interrogate()
12033
+ )
12034
+
12035
+ # Save validation and it will work reliably across sessions
12036
+ pb.write_file(validation, "my_validation", keep_tbl=True)
12037
+ ```
12038
+
12039
+ ### Problematic Patterns to Avoid
12040
+
12041
+ Don't use lambda functions as they will cause immediate errors.
12042
+
12043
+ Don't use interactive function definitions (as they may fail when loading).
12044
+
12045
+ ```python
12046
+ def my_function(df): # Defined in notebook/REPL
12047
+ return df.with_columns(pl.col("value") * 2)
12048
+
12049
+ validation = pb.Validate(data).col_vals_gt(
12050
+ columns="value", value=100, pre=my_function
12051
+ )
12052
+ ```
12053
+
12054
+ ### Automatic Analysis and Guidance
12055
+
12056
+ When you call `write_file()`, it automatically analyzes your validation and provides:
12057
+
12058
+ - confirmation when all functions will work reliably
12059
+ - warnings for functions that may cause cross-session issues
12060
+ - clear errors for unsupported patterns (lambda functions)
12061
+ - specific recommendations and code examples
12062
+ - loading instructions tailored to your validation
12063
+
12064
+ ### Loading Your Validation
12065
+
12066
+ To load a saved validation in a new Python session:
12067
+
12068
+ ```python
12069
+ # In a new Python session
12070
+ import pointblank as pb
12071
+
12072
+ # Import the same preprocessing functions used when creating the validation
12073
+ from preprocessing_functions import multiply_by_100, add_computed_column
12074
+
12075
+ # Upon loading the validation, functions will be automatically restored
12076
+ validation = pb.read_file("my_validation.pkl")
12077
+ ```
12078
+
12079
+ ** Testing Your Validation:**
12080
+
12081
+ To verify your validation works across sessions:
12082
+
12083
+ 1. save your validation in one Python session
12084
+ 2. start a fresh Python session (restart kernel/interpreter)
12085
+ 3. import required preprocessing functions
12086
+ 4. load the validation using `read_file()`
12087
+ 5. test that preprocessing functions work as expected
12088
+
12089
+ ### Performance and Storage
12090
+
12091
+ - use `keep_tbl=False` (default) to reduce file size when you don't need the original data
12092
+ - use `keep_extracts=False` (default) to save space by excluding extract data
12093
+ - set `quiet=True` to suppress guidance messages in automated scripts
12094
+ - files are saved using pickle's highest protocol for optimal performance
12095
+
12096
+ Examples
12097
+ --------
12098
+ Let's create a simple validation and save it to disk:
12099
+
12100
+ ```python
12101
+ import pointblank as pb
12102
+
12103
+ # Create a validation
12104
+ validation = (
12105
+ pb.Validate(data=pb.load_dataset("small_table"), label="My validation")
12106
+ .col_vals_gt(columns="d", value=100)
12107
+ .col_vals_regex(columns="b", pattern=r"[0-9]-[a-z]{3}-[0-9]{3}")
12108
+ .interrogate()
12109
+ )
12110
+
12111
+ # Save to disk (without the original table data)
12112
+ pb.write_file(validation, "my_validation")
12113
+ ```
12114
+
12115
+ To keep the original table data for later analysis:
12116
+
12117
+ ```python
12118
+ # Save with the original table data included
12119
+ pb.write_file(validation, "my_validation_with_data", keep_tbl=True)
12120
+ ```
12121
+
12122
+ You can also specify a custom directory and keep extract data:
12123
+
12124
+ ```python
12125
+ pb.write_file(
12126
+ validation,
12127
+ filename="detailed_validation",
12128
+ path="/path/to/validations",
12129
+ keep_tbl=True,
12130
+ keep_extracts=True
12131
+ )
12132
+ ```
12133
+
12134
+ ### Working with Preprocessing Functions
12135
+
12136
+ For validations that use preprocessing functions to be portable across sessions, define your
12137
+ functions in a separate `.py` file:
12138
+
12139
+ ```python
12140
+ # In `preprocessing_functions.py`
12141
+
12142
+ import polars as pl
12143
+
12144
+ def multiply_by_100(df):
12145
+ return df.with_columns(pl.col("value") * 100)
12146
+
12147
+ def add_computed_column(df):
12148
+ return df.with_columns(computed=pl.col("value") * 2 + 10)
12149
+ ```
12150
+
12151
+ Then import and use them in your validation:
12152
+
12153
+ ```python
12154
+ # In your main script
12155
+
12156
+ import pointblank as pb
12157
+ from preprocessing_functions import multiply_by_100, add_computed_column
12158
+
12159
+ validation = (
12160
+ pb.Validate(data=my_data)
12161
+ .col_vals_gt(columns="value", value=500, pre=multiply_by_100)
12162
+ .col_vals_between(columns="computed", left=50, right=1000, pre=add_computed_column)
12163
+ .interrogate()
12164
+ )
12165
+
12166
+ # This validation can now be saved and loaded reliably
12167
+ pb.write_file(validation, "my_validation", keep_tbl=True)
12168
+ ```
12169
+
12170
+ When you load this validation in a new session, simply import the preprocessing functions
12171
+ again and they will be automatically restored.
12172
+
12173
+ See Also
12174
+ --------
12175
+ Use the [`read_file()`](`pointblank.read_file`) function to load a validation object that was
12176
+ previously saved with `write_file()`.
12177
+
12178
+
12179
+ read_file(filepath: 'str | Path') -> 'Validate'
12180
+
12181
+ Read a Validate object from disk that was previously saved with `write_file()`.
12182
+
12183
+ This function loads a validation object that was previously serialized to disk using the
12184
+ `write_file()` function. The validation object will be restored with all its validation results,
12185
+ metadata, and optionally the source data (if it was saved with `keep_tbl=True`).
12186
+
12187
+ :::{.callout-warning}
12188
+ The `read_file()` function is currently experimental. Please report any issues you encounter in
12189
+ the [Pointblank issue tracker](https://github.com/posit-dev/pointblank/issues).
12190
+ :::
12191
+
12192
+ Parameters
12193
+ ----------
12194
+ filepath
12195
+ The path to the saved validation file. Can be a string or Path object.
12196
+
12197
+ Returns
12198
+ -------
12199
+ Validate
12200
+ The restored validation object with all its original state, validation results, and
12201
+ metadata.
12202
+
12203
+ Examples
12204
+ --------
12205
+ Load a validation object that was previously saved:
12206
+
12207
+ ```python
12208
+ import pointblank as pb
12209
+
12210
+ # Load a validation object from disk
12211
+ validation = pb.read_file("my_validation.pkl")
12212
+
12213
+ # View the validation results
12214
+ validation
12215
+ ```
12216
+
12217
+ You can also load using just the filename (without extension):
12218
+
12219
+ ```python
12220
+ # This will automatically look for "my_validation.pkl"
12221
+ validation = pb.read_file("my_validation")
12222
+ ```
12223
+
12224
+ The loaded validation object retains all its functionality:
12225
+
12226
+ ```python
12227
+ # Get validation summary
12228
+ summary = validation.get_json_report()
12229
+
12230
+ # Get sundered data (if original table was saved)
12231
+ if validation.data is not None:
12232
+ failing_rows = validation.get_sundered_data(type="fail")
12233
+ ```
12234
+
12235
+ See Also
12236
+ --------
12237
+ Use the [`write_file()`](`pointblank.Validate.write_file`) method to save a validation object
12238
+ to disk for later retrieval with this function.
12239
+
12240
+
10718
12241
  config(report_incl_header: 'bool' = True, report_incl_footer: 'bool' = True, preview_incl_header: 'bool' = True) -> 'PointblankConfig'
10719
12242
 
10720
12243
  Configuration settings for the Pointblank library.