pointblank 0.11.5__py3-none-any.whl → 0.12.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -43,13 +43,15 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
43
43
  ----------
44
44
  data
45
45
  The table to validate, which could be a DataFrame object, an Ibis table object, a CSV
46
- file path, a Parquet file path, or a database connection string. When providing a CSV or
47
- Parquet file path (as a string or `pathlib.Path` object), the file will be automatically
48
- loaded using an available DataFrame library (Polars or Pandas). Parquet input also supports
49
- glob patterns, directories containing .parquet files, and Spark-style partitioned datasets.
50
- Connection strings enable direct database access via Ibis with optional table specification
51
- using the `::table_name` suffix. Read the *Supported Input Table Types* section for details
52
- on the supported table types.
46
+ file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file, or a
47
+ database connection string. When providing a CSV or Parquet file path (as a string or
48
+ `pathlib.Path` object), the file will be automatically loaded using an available DataFrame
49
+ library (Polars or Pandas). Parquet input also supports glob patterns, directories
50
+ containing .parquet files, and Spark-style partitioned datasets. GitHub URLs are
51
+ automatically transformed to raw content URLs and downloaded. Connection strings enable
52
+ direct database access via Ibis with optional table specification using the `::table_name`
53
+ suffix. Read the *Supported Input Table Types* section for details on the supported table
54
+ types.
53
55
  tbl_name
54
56
  An optional name to assign to the input table object. If no value is provided, a name will
55
57
  be generated based on whatever information is available. This table name will be displayed
@@ -109,6 +111,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
109
111
 
110
112
  - Polars DataFrame (`"polars"`)
111
113
  - Pandas DataFrame (`"pandas"`)
114
+ - PySpark table (`"pyspark"`)
112
115
  - DuckDB table (`"duckdb"`)*
113
116
  - MySQL table (`"mysql"`)*
114
117
  - PostgreSQL table (`"postgresql"`)*
@@ -116,7 +119,6 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
116
119
  - Microsoft SQL Server table (`"mssql"`)*
117
120
  - Snowflake table (`"snowflake"`)*
118
121
  - Databricks table (`"databricks"`)*
119
- - PySpark table (`"pyspark"`)*
120
122
  - BigQuery table (`"bigquery"`)*
121
123
  - Parquet table (`"parquet"`)*
122
124
  - CSV files (string path or `pathlib.Path` object with `.csv` extension)
@@ -948,10 +950,15 @@ Definition of a schema object.
948
950
 
949
951
  - Polars DataFrame (`"polars"`)
950
952
  - Pandas DataFrame (`"pandas"`)
953
+ - PySpark table (`"pyspark"`)
951
954
  - DuckDB table (`"duckdb"`)*
952
955
  - MySQL table (`"mysql"`)*
953
956
  - PostgreSQL table (`"postgresql"`)*
954
957
  - SQLite table (`"sqlite"`)*
958
+ - Microsoft SQL Server table (`"mssql"`)*
959
+ - Snowflake table (`"snowflake"`)*
960
+ - Databricks table (`"databricks"`)*
961
+ - BigQuery table (`"bigquery"`)*
955
962
  - Parquet table (`"parquet"`)*
956
963
 
957
964
  The table types marked with an asterisk need to be prepared as Ibis tables (with type of
@@ -7079,6 +7086,129 @@ expr_col(column_name: 'str') -> 'ColumnExpression'
7079
7086
 
7080
7087
 
7081
7088
 
7089
+ ## The Segments family
7090
+
7091
+ Combine multiple values into a single segment using `seg_*()` helper functions.
7092
+
7093
+ seg_group(values: 'list[Any]') -> 'Segment'
7094
+
7095
+ Group together values for segmentation.
7096
+
7097
+ Many validation methods have a `segments=` argument that can be used to specify one or more
7098
+ columns, or certain values within a column, to create segments for validation (e.g.,
7099
+ [`col_vals_gt()`](`pointblank.Validate.col_vals_gt`),
7100
+ [`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). When passing in a column, or
7101
+ a tuple with a column and certain values, a segment will be created for each individual value
7102
+ within the column or given values. The `seg_group()` selector enables values to be grouped
7103
+ together into a segment. For example, if you were to create a segment for a column "region",
7104
+ investigating just "North" and "South" regions, a typical segment would look like:
7105
+
7106
+ `segments=("region", ["North", "South"])`
7107
+
7108
+ This would create two validation steps, one for each of the regions. If you wanted to group
7109
+ these two regions into a single segment, you could use the `seg_group()` function like this:
7110
+
7111
+ `segments=("region", pb.seg_group(["North", "South"]))`
7112
+
7113
+ You could create a second segment for "East" and "West" regions like this:
7114
+
7115
+ `segments=("region", pb.seg_group([["North", "South"], ["East", "West"]]))`
7116
+
7117
+ There will be a validation step created for every segment. Note that if there aren't any
7118
+ segments created using `seg_group()` (or any other segment expression), the validation step will
7119
+ fail to be evaluated during the interrogation process. Such a failure to evaluate will be
7120
+ reported in the validation results but it won't affect the interrogation process overall
7121
+ (i.e., the process won't be halted).
7122
+
7123
+ Parameters
7124
+ ----------
7125
+ values
7126
+ A list of values to be grouped into a segment. This can be a single list or a list of lists.
7127
+
7128
+ Returns
7129
+ -------
7130
+ Segment
7131
+ A `Segment` object, which can be used to combine values into a segment.
7132
+
7133
+ Examples
7134
+ --------
7135
+ Let's say we're analyzing sales from our local bookstore, and want to check the number of books
7136
+ sold for the month exceeds a certain threshold. We could pass in the argument
7137
+ `segments="genre"`, which would return a segment for each unique genre in the datasets. We could
7138
+ also pass in `segments=("genre", ["Fantasy", "Science Fiction"])`, to only create segments for
7139
+ those two genres. However, if we wanted to group these two genres into a single segment, we
7140
+ could use the `seg_group()` function.
7141
+
7142
+ ```python
7143
+ import pointblank as pb
7144
+ import polars as pl
7145
+
7146
+ tbl = pl.DataFrame(
7147
+ {
7148
+ "title": [
7149
+ "The Hobbit",
7150
+ "Harry Potter and the Sorcerer's Stone",
7151
+ "The Lord of the Rings",
7152
+ "A Game of Thrones",
7153
+ "The Name of the Wind",
7154
+ "The Girl with the Dragon Tattoo",
7155
+ "The Da Vinci Code",
7156
+ "The Hitchhiker's Guide to the Galaxy",
7157
+ "The Martian",
7158
+ "Brave New World"
7159
+ ],
7160
+ "genre": [
7161
+ "Fantasy",
7162
+ "Fantasy",
7163
+ "Fantasy",
7164
+ "Fantasy",
7165
+ "Fantasy",
7166
+ "Mystery",
7167
+ "Mystery",
7168
+ "Science Fiction",
7169
+ "Science Fiction",
7170
+ "Science Fiction",
7171
+ ],
7172
+ "units_sold": [875, 932, 756, 623, 445, 389, 678, 534, 712, 598],
7173
+ }
7174
+ )
7175
+
7176
+ validation = (
7177
+ pb.Validate(data=tbl)
7178
+ .col_vals_gt(
7179
+ columns="units_sold",
7180
+ value=500,
7181
+ segments=("genre", pb.seg_group(["Fantasy", "Science Fiction"]))
7182
+ )
7183
+ .interrogate()
7184
+ )
7185
+
7186
+ validation
7187
+ ```
7188
+
7189
+ What's more, we can create multiple segments, combining the genres in different ways.
7190
+
7191
+ ```python
7192
+ validation = (
7193
+ pb.Validate(data=tbl)
7194
+ .col_vals_gt(
7195
+ columns="units_sold",
7196
+ value=500,
7197
+ segments=("genre", pb.seg_group([
7198
+ ["Fantasy", "Science Fiction"],
7199
+ ["Fantasy", "Mystery"],
7200
+ ["Mystery", "Science Fiction"]
7201
+ ]))
7202
+ )
7203
+ .interrogate()
7204
+ )
7205
+
7206
+ validation
7207
+ ```
7208
+
7209
+
7210
+
7211
+
7082
7212
  ## The Interrogation and Reporting family
7083
7213
 
7084
7214
  The validation plan is put into action when `interrogate()` is called.
@@ -8826,7 +8956,9 @@ DataScan(data: 'IntoFrameT', tbl_name: 'str | None' = None) -> 'None'
8826
8956
  Parameters
8827
8957
  ----------
8828
8958
  data
8829
- The data to scan and summarize.
8959
+ The data to scan and summarize. This could be a DataFrame object, an Ibis table object,
8960
+ a CSV file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file,
8961
+ or a database connection string.
8830
8962
  tbl_name
8831
8963
  Optionally, the name of the table could be provided as `tbl_name`.
8832
8964
 
@@ -8958,6 +9090,7 @@ preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None'
8958
9090
 
8959
9091
  - Polars DataFrame (`"polars"`)
8960
9092
  - Pandas DataFrame (`"pandas"`)
9093
+ - PySpark table (`"pyspark"`)
8961
9094
  - DuckDB table (`"duckdb"`)*
8962
9095
  - MySQL table (`"mysql"`)*
8963
9096
  - PostgreSQL table (`"postgresql"`)*
@@ -8965,7 +9098,6 @@ preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None'
8965
9098
  - Microsoft SQL Server table (`"mssql"`)*
8966
9099
  - Snowflake table (`"snowflake"`)*
8967
9100
  - Databricks table (`"databricks"`)*
8968
- - PySpark table (`"pyspark"`)*
8969
9101
  - BigQuery table (`"bigquery"`)*
8970
9102
  - Parquet table (`"parquet"`)*
8971
9103
  - CSV files (string path or `pathlib.Path` object with `.csv` extension)
@@ -9116,8 +9248,9 @@ col_summary_tbl(data: 'FrameT | Any', tbl_name: 'str | None' = None) -> 'GT'
9116
9248
  Parameters
9117
9249
  ----------
9118
9250
  data
9119
- The table to summarize, which could be a DataFrame object or an Ibis table object. Read the
9120
- *Supported Input Table Types* section for details on the supported table types.
9251
+ The table to summarize, which could be a DataFrame object, an Ibis table object, a CSV
9252
+ file path, a Parquet file path, or a database connection string. Read the *Supported Input
9253
+ Table Types* section for details on the supported table types.
9121
9254
  tbl_name
9122
9255
  Optionally, the name of the table could be provided as `tbl_name=`.
9123
9256
 
@@ -9137,6 +9270,11 @@ col_summary_tbl(data: 'FrameT | Any', tbl_name: 'str | None' = None) -> 'GT'
9137
9270
  - PostgreSQL table (`"postgresql"`)*
9138
9271
  - SQLite table (`"sqlite"`)*
9139
9272
  - Parquet table (`"parquet"`)*
9273
+ - CSV files (string path or `pathlib.Path` object with `.csv` extension)
9274
+ - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
9275
+ extension, or partitioned dataset)
9276
+ - GitHub URLs (direct links to CSV or Parquet files on GitHub)
9277
+ - Database connection strings (URI format with optional table specification)
9140
9278
 
9141
9279
  The table types marked with an asterisk need to be prepared as Ibis tables (with type of
9142
9280
  `ibis.expr.types.relations.Table`). Furthermore, using `col_summary_tbl()` with these types of
@@ -9171,9 +9309,9 @@ missing_vals_tbl(data: 'FrameT | Any') -> 'GT'
9171
9309
  Parameters
9172
9310
  ----------
9173
9311
  data
9174
- The table for which to display the missing values. This could be a DataFrame object or an
9175
- Ibis table object. Read the *Supported Input Table Types* section for details on the
9176
- supported table types.
9312
+ The table for which to display the missing values. This could be a DataFrame object, an
9313
+ Ibis table object, a CSV file path, a Parquet file path, or a database connection string.
9314
+ Read the *Supported Input Table Types* section for details on the supported table types.
9177
9315
 
9178
9316
  Returns
9179
9317
  -------
@@ -9186,6 +9324,7 @@ missing_vals_tbl(data: 'FrameT | Any') -> 'GT'
9186
9324
 
9187
9325
  - Polars DataFrame (`"polars"`)
9188
9326
  - Pandas DataFrame (`"pandas"`)
9327
+ - PySpark table (`"pyspark"`)
9189
9328
  - DuckDB table (`"duckdb"`)*
9190
9329
  - MySQL table (`"mysql"`)*
9191
9330
  - PostgreSQL table (`"postgresql"`)*
@@ -9193,9 +9332,12 @@ missing_vals_tbl(data: 'FrameT | Any') -> 'GT'
9193
9332
  - Microsoft SQL Server table (`"mssql"`)*
9194
9333
  - Snowflake table (`"snowflake"`)*
9195
9334
  - Databricks table (`"databricks"`)*
9196
- - PySpark table (`"pyspark"`)*
9197
9335
  - BigQuery table (`"bigquery"`)*
9198
9336
  - Parquet table (`"parquet"`)*
9337
+ - CSV files (string path or `pathlib.Path` object with `.csv` extension)
9338
+ - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
9339
+ extension, or partitioned dataset)
9340
+ - Database connection strings (URI format with optional table specification)
9199
9341
 
9200
9342
  The table types marked with an asterisk need to be prepared as Ibis tables (with type of
9201
9343
  `ibis.expr.types.relations.Table`). Furthermore, using `missing_vals_tbl()` with these types of
@@ -9270,8 +9412,9 @@ assistant(model: 'str', data: 'FrameT | Any | None' = None, tbl_name: 'str | Non
9270
9412
  `"ollama"`, and `"bedrock"`.
9271
9413
  data
9272
9414
  An optional data table to focus on during discussion with the PbA, which could be a
9273
- DataFrame object or an Ibis table object. Read the *Supported Input Table Types* section for
9274
- details on the supported table types.
9415
+ DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database
9416
+ connection string. Read the *Supported Input Table Types* section for details on the
9417
+ supported table types.
9275
9418
  tbl_name : str, optional
9276
9419
  The name of the data table. This is optional and is only used to provide a more detailed
9277
9420
  prompt to the PbA.
@@ -9348,16 +9491,29 @@ assistant(model: 'str', data: 'FrameT | Any | None' = None, tbl_name: 'str | Non
9348
9491
 
9349
9492
  - Polars DataFrame (`"polars"`)
9350
9493
  - Pandas DataFrame (`"pandas"`)
9494
+ - PySpark table (`"pyspark"`)
9351
9495
  - DuckDB table (`"duckdb"`)*
9352
9496
  - MySQL table (`"mysql"`)*
9353
9497
  - PostgreSQL table (`"postgresql"`)*
9354
9498
  - SQLite table (`"sqlite"`)*
9499
+ - Microsoft SQL Server table (`"mssql"`)*
9500
+ - Snowflake table (`"snowflake"`)*
9501
+ - Databricks table (`"databricks"`)*
9502
+ - BigQuery table (`"bigquery"`)*
9355
9503
  - Parquet table (`"parquet"`)*
9504
+ - CSV files (string path or `pathlib.Path` object with `.csv` extension)
9505
+ - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
9506
+ extension, or partitioned dataset)
9507
+ - Database connection strings (URI format with optional table specification)
9356
9508
 
9357
9509
  The table types marked with an asterisk need to be prepared as Ibis tables (with type of
9358
- `ibis.expr.types.relations.Table`). Furthermore, using `preview()` with these types of tables
9510
+ `ibis.expr.types.relations.Table`). Furthermore, using `assistant()` with these types of tables
9359
9511
  requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or
9360
9512
  Pandas DataFrame, the availability of Ibis is not needed.
9513
+
9514
+ To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
9515
+ provided. The file will be automatically detected and loaded using the best available DataFrame
9516
+ library. The loading preference is Polars first, then Pandas as a fallback.
9361
9517
 
9362
9518
 
9363
9519
  load_dataset(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', tbl_type: "Literal['polars', 'pandas', 'duckdb']" = 'polars') -> 'FrameT | Any'
@@ -9553,6 +9709,372 @@ get_data_path(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'gl
9553
9709
  [`load_dataset()`](`pointblank.load_dataset`) for loading datasets directly as table objects.
9554
9710
 
9555
9711
 
9712
+ connect_to_table(connection_string: 'str') -> 'Any'
9713
+
9714
+ Connect to a database table using a connection string.
9715
+
9716
+ This utility function tests whether a connection string leads to a valid table and returns
9717
+ the table object if successful. It provides helpful error messages when no table is specified
9718
+ or when backend dependencies are missing.
9719
+
9720
+ Parameters
9721
+ ----------
9722
+ connection_string
9723
+ A database connection string with a required table specification using the `::table_name`
9724
+ suffix. Supported formats are outlined in the *Supported Connection String Formats* section.
9725
+
9726
+ Returns
9727
+ -------
9728
+ Any
9729
+ An Ibis table object for the specified database table.
9730
+
9731
+ Supported Connection String Formats
9732
+ -----------------------------------
9733
+ The `connection_string` parameter must include a valid connection string with a table name
9734
+ specified using the `::` syntax. Here are some examples on how to format connection strings
9735
+ for various backends:
9736
+
9737
+ ```
9738
+ DuckDB: "duckdb:///path/to/database.ddb::table_name"
9739
+ SQLite: "sqlite:///path/to/database.db::table_name"
9740
+ PostgreSQL: "postgresql://user:password@localhost:5432/database::table_name"
9741
+ MySQL: "mysql://user:password@localhost:3306/database::table_name"
9742
+ BigQuery: "bigquery://project/dataset::table_name"
9743
+ Snowflake: "snowflake://user:password@account/database/schema::table_name"
9744
+ ```
9745
+
9746
+ If the connection string does not include a table name, the function will attempt to connect to
9747
+ the database and list available tables, providing guidance on how to specify a table.
9748
+
9749
+ Examples
9750
+ --------
9751
+ Connect to a DuckDB table:
9752
+
9753
+ ```python
9754
+ import pointblank as pb
9755
+
9756
+ # Get path to a DuckDB database file from package data
9757
+ duckdb_path = pb.get_data_path("game_revenue", "duckdb")
9758
+
9759
+ # Connect to the `game_revenue` table in the DuckDB database
9760
+ game_revenue = pb.connect_to_table(f"duckdb:///{duckdb_path}::game_revenue")
9761
+
9762
+ # Use with the `preview()` function
9763
+ pb.preview(game_revenue)
9764
+ ```
9765
+
9766
+ Here are some backend-specific connection examples:
9767
+
9768
+ ```python
9769
+ # PostgreSQL
9770
+ pg_table = pb.connect_to_table(
9771
+ "postgresql://user:password@localhost:5432/warehouse::customer_data"
9772
+ )
9773
+
9774
+ # SQLite
9775
+ sqlite_table = pb.connect_to_table("sqlite:///local_data.db::products")
9776
+
9777
+ # BigQuery
9778
+ bq_table = pb.connect_to_table("bigquery://my-project/analytics::daily_metrics")
9779
+ ```
9780
+
9781
+ This function requires the Ibis library with appropriate backend drivers:
9782
+
9783
+ ```bash
9784
+ # You can install a set of common backends:
9785
+ pip install 'ibis-framework[duckdb,postgres,mysql,sqlite]'
9786
+
9787
+ # ...or specific backends as needed:
9788
+ pip install 'ibis-framework[duckdb]' # for DuckDB
9789
+ pip install 'ibis-framework[postgres]' # for PostgreSQL
9790
+ ```
9791
+
9792
+
9793
+
9794
+ ## The YAML family
9795
+
9796
+ The *YAML* group contains functions that allow for the use of YAML to orchestrate
9797
+ validation workflows. The `yaml_interrogate()` function can be used to run a validation workflow from
9798
+ YAML strings or files. The `validate_yaml()` function checks if the YAML configuration
9799
+ passes its own validity checks.
9800
+
9801
+ yaml_interrogate(yaml: 'Union[str, Path]') -> 'Validate'
9802
+ Execute a YAML-based validation workflow.
9803
+
9804
+ This is the main entry point for YAML-based validation workflows. It takes YAML configuration
9805
+ (as a string or file path) and returns a validated `Validate` object with interrogation results.
9806
+
9807
+ The YAML configuration defines the data source, validation steps, and optional settings like
9808
+ thresholds and labels. This function automatically loads the data, builds the validation plan,
9809
+ executes all validation steps, and returns the interrogated results.
9810
+
9811
+ Parameters
9812
+ ----------
9813
+ yaml
9814
+ YAML configuration as string or file path. Can be: (1) a YAML string containing the
9815
+ validation configuration, or (2) a Path object or string path to a YAML file.
9816
+
9817
+ Returns
9818
+ -------
9819
+ Validate
9820
+ An instance of the `Validate` class that has been configured based on the YAML input.
9821
+ This object contains the results of the validation steps defined in the YAML configuration.
9822
+ It includes metadata like table name, label, language, and thresholds if specified.
9823
+
9824
+ Raises
9825
+ ------
9826
+ YAMLValidationError
9827
+ If the YAML is invalid, malformed, or execution fails. This includes syntax errors, missing
9828
+ required fields, unknown validation methods, or data loading failures.
9829
+
9830
+ Examples
9831
+ --------
9832
+ For the examples here, we'll use YAML configurations to define validation workflows. Let's start
9833
+ with a basic YAML workflow that validates the built-in `small_table` dataset.
9834
+
9835
+ ```python
9836
+ import pointblank as pb
9837
+
9838
+ # Define a basic YAML validation workflow
9839
+ yaml_config = '''
9840
+ tbl: small_table
9841
+ steps:
9842
+ - rows_distinct
9843
+ - col_exists:
9844
+ columns: [date, a, b]
9845
+ '''
9846
+
9847
+ # Execute the validation workflow
9848
+ result = pb.yaml_interrogate(yaml_config)
9849
+ result
9850
+ ```
9851
+
9852
+ The validation table shows the results of our YAML-defined workflow. We can see that the
9853
+ `rows_distinct()` validation failed (because there are duplicate rows in the table), while the
9854
+ column existence checks passed.
9855
+
9856
+ Now let's create a more comprehensive validation workflow with thresholds and metadata:
9857
+
9858
+ ```python
9859
+ # Advanced YAML configuration with thresholds and metadata
9860
+ yaml_config = '''
9861
+ tbl: small_table
9862
+ tbl_name: small_table_demo
9863
+ label: Comprehensive data validation
9864
+ thresholds:
9865
+ warning: 0.1
9866
+ error: 0.25
9867
+ critical: 0.35
9868
+ steps:
9869
+ - col_vals_gt:
9870
+ columns: [d]
9871
+ value: 100
9872
+ - col_vals_regex:
9873
+ columns: [b]
9874
+ pattern: '[0-9]-[a-z]{3}-[0-9]{3}'
9875
+ - col_vals_not_null:
9876
+ columns: [date, a]
9877
+ '''
9878
+
9879
+ # Execute the validation workflow
9880
+ result = pb.yaml_interrogate(yaml_config)
9881
+ print(f"Table name: {result.tbl_name}")
9882
+ print(f"Label: {result.label}")
9883
+ print(f"Total validation steps: {len(result.validation_info)}")
9884
+ ```
9885
+
9886
+ The validation results now include our custom table name and label. The thresholds we defined
9887
+ will determine when validation steps are marked as warnings, errors, or critical failures.
9888
+
9889
+ You can also load YAML configurations from files. Here's how you would work with a YAML file:
9890
+
9891
+ ```python
9892
+ from pathlib import Path
9893
+ import tempfile
9894
+
9895
+ # Create a temporary YAML file for demonstration
9896
+ yaml_content = '''
9897
+ tbl: small_table
9898
+ tbl_name: File-based Validation
9899
+ steps:
9900
+ - col_vals_between:
9901
+ columns: [c]
9902
+ left: 1
9903
+ right: 10
9904
+ - col_vals_in_set:
9905
+ columns: [f]
9906
+ set: [low, mid, high]
9907
+ '''
9908
+
9909
+ with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
9910
+ f.write(yaml_content)
9911
+ yaml_file_path = Path(f.name)
9912
+
9913
+ # Load and execute validation from file
9914
+ result = pb.yaml_interrogate(yaml_file_path)
9915
+ result
9916
+ ```
9917
+
9918
+ This approach is particularly useful for storing validation configurations as part of your data
9919
+ pipeline or version control system, allowing you to maintain validation rules alongside your
9920
+ code.
9921
+
9922
+
9923
+ validate_yaml(yaml: 'Union[str, Path]') -> 'None'
9924
+ Validate YAML configuration against the expected structure.
9925
+
9926
+ This function validates that a YAML configuration conforms to the expected structure for
9927
+ validation workflows. It checks for required fields, proper data types, and valid
9928
+ validation method names. This is useful for validating configurations before execution or
9929
+ for building configuration editors and validators.
9930
+
9931
+ The function performs comprehensive validation including:
9932
+
9933
+ - required fields ('tbl' and 'steps')
9934
+ - proper data types for all fields
9935
+ - valid threshold configurations
9936
+ - known validation method names
9937
+ - proper step configuration structure
9938
+
9939
+ Parameters
9940
+ ----------
9941
+ yaml
9942
+ YAML configuration as string or file path. Can be: (1) a YAML string containing the
9943
+ validation configuration, or (2) a Path object or string path to a YAML file.
9944
+
9945
+ Raises
9946
+ ------
9947
+ YAMLValidationError
9948
+ If the YAML is invalid, malformed, or execution fails. This includes syntax errors,
9949
+ missing required fields, unknown validation methods, or data loading failures.
9950
+
9951
+ Examples
9952
+ --------
9953
+ For the examples here, we'll demonstrate how to validate YAML configurations before using them
9954
+ with validation workflows. This is particularly useful for building robust data validation
9955
+ systems where you want to catch configuration errors early.
9956
+
9957
+ Let's start with validating a basic configuration:
9958
+
9959
+ ```python
9960
+ import pointblank as pb
9961
+
9962
+ # Define a basic YAML validation configuration
9963
+ yaml_config = '''
9964
+ tbl: small_table
9965
+ steps:
9966
+ - rows_distinct
9967
+ - col_exists:
9968
+ columns: [a, b]
9969
+ '''
9970
+
9971
+ # Validate the configuration: no exception means it's valid
9972
+ pb.validate_yaml(yaml_config)
9973
+ print("Basic YAML configuration is valid")
9974
+ ```
9975
+
9976
+ The function completed without raising an exception, which means our configuration is valid and
9977
+ follows the expected structure.
9978
+
9979
+ Now let's validate a more complex configuration with thresholds and metadata:
9980
+
9981
+ ```python
9982
+ # Complex YAML configuration with all optional fields
9983
+ yaml_config = '''
9984
+ tbl: small_table
9985
+ tbl_name: My Dataset
9986
+ label: Quality check
9987
+ lang: en
9988
+ locale: en
9989
+ thresholds:
9990
+ warning: 0.1
9991
+ error: 0.25
9992
+ critical: 0.35
9993
+ steps:
9994
+ - rows_distinct
9995
+ - col_vals_gt:
9996
+ columns: [d]
9997
+ value: 100
9998
+ - col_vals_regex:
9999
+ columns: [b]
10000
+ pattern: '[0-9]-[a-z]{3}-[0-9]{3}'
10001
+ '''
10002
+
10003
+ # Validate the configuration
10004
+ pb.validate_yaml(yaml_config)
10005
+ print("Complex YAML configuration is valid")
10006
+
10007
+ # Count the validation steps
10008
+ import pointblank.yaml as pby
10009
+ config = pby.load_yaml_config(yaml_config)
10010
+ print(f"Configuration has {len(config['steps'])} validation steps")
10011
+ ```
10012
+
10013
+ This configuration includes all the optional metadata fields and complex validation steps,
10014
+ demonstrating that the validation handles the full range of supported options.
10015
+
10016
+ Let's see what happens when we try to validate an invalid configuration:
10017
+
10018
+ ```python
10019
+ # Invalid YAML configuration: missing required 'tbl' field
10020
+ invalid_yaml = '''
10021
+ steps:
10022
+ - rows_distinct
10023
+ '''
10024
+
10025
+ try:
10026
+ pb.validate_yaml(invalid_yaml)
10027
+ except pb.yaml.YAMLValidationError as e:
10028
+ print(f"Validation failed: {e}")
10029
+ ```
10030
+
10031
+ The validation correctly identifies that our configuration is missing the required `'tbl'`
10032
+ field.
10033
+
10034
+ Here's a practical example of using validation in a workflow builder:
10035
+
10036
+ ```python
10037
+ def safe_yaml_interrogate(yaml_config):
10038
+ """Safely execute a YAML configuration after validation."""
10039
+ try:
10040
+ # Validate the YAML configuration first
10041
+ pb.validate_yaml(yaml_config)
10042
+ print("✓ YAML configuration is valid")
10043
+
10044
+ # Then execute the workflow
10045
+ result = pb.yaml_interrogate(yaml_config)
10046
+ print(f"Validation completed with {len(result.validation_info)} steps")
10047
+ return result
10048
+
10049
+ except pb.yaml.YAMLValidationError as e:
10050
+ print(f"Configuration error: {e}")
10051
+ return None
10052
+
10053
+ # Test with a valid YAML configuration
10054
+ test_yaml = '''
10055
+ tbl: small_table
10056
+ steps:
10057
+ - col_vals_between:
10058
+ columns: [c]
10059
+ left: 1
10060
+ right: 10
10061
+ '''
10062
+
10063
+ result = safe_yaml_interrogate(test_yaml)
10064
+ ```
10065
+
10066
+ This pattern of validating before executing helps build more reliable data validation pipelines
10067
+ by catching configuration errors early in the process.
10068
+
10069
+ Note that this function only validates the structure and does not check if the specified data
10070
+ source ('tbl') exists or is accessible. Data source validation occurs during execution with
10071
+ `yaml_interrogate()`.
10072
+
10073
+ See Also
10074
+ --------
10075
+ yaml_interrogate : execute YAML-based validation workflows
10076
+
10077
+
9556
10078
 
9557
10079
  ## The Utility Functions family
9558
10080
 
@@ -9568,14 +10090,15 @@ get_column_count(data: 'FrameT | Any') -> 'int'
9568
10090
 
9569
10091
  The `get_column_count()` function returns the number of columns in a table. The function works
9570
10092
  with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis
9571
- backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.).
10093
+ backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports
10094
+ direct input of CSV files, Parquet files, and database connection strings.
9572
10095
 
9573
10096
  Parameters
9574
10097
  ----------
9575
10098
  data
9576
- The table for which to get the column count, which could be a DataFrame object or an Ibis
9577
- table object. Read the *Supported Input Table Types* section for details on the supported
9578
- table types.
10099
+ The table for which to get the column count, which could be a DataFrame object, an Ibis
10100
+ table object, a CSV file path, a Parquet file path, or a database connection string.
10101
+ Read the *Supported Input Table Types* section for details on the supported table types.
9579
10102
 
9580
10103
  Returns
9581
10104
  -------
@@ -9588,6 +10111,7 @@ get_column_count(data: 'FrameT | Any') -> 'int'
9588
10111
 
9589
10112
  - Polars DataFrame (`"polars"`)
9590
10113
  - Pandas DataFrame (`"pandas"`)
10114
+ - PySpark table (`"pyspark"`)
9591
10115
  - DuckDB table (`"duckdb"`)*
9592
10116
  - MySQL table (`"mysql"`)*
9593
10117
  - PostgreSQL table (`"postgresql"`)*
@@ -9595,15 +10119,41 @@ get_column_count(data: 'FrameT | Any') -> 'int'
9595
10119
  - Microsoft SQL Server table (`"mssql"`)*
9596
10120
  - Snowflake table (`"snowflake"`)*
9597
10121
  - Databricks table (`"databricks"`)*
9598
- - PySpark table (`"pyspark"`)*
9599
10122
  - BigQuery table (`"bigquery"`)*
9600
10123
  - Parquet table (`"parquet"`)*
10124
+ - CSV files (string path or `pathlib.Path` object with `.csv` extension)
10125
+ - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
10126
+ extension, or partitioned dataset)
10127
+ - Database connection strings (URI format with optional table specification)
9601
10128
 
9602
10129
  The table types marked with an asterisk need to be prepared as Ibis tables (with type of
9603
10130
  `ibis.expr.types.relations.Table`). Furthermore, using `get_column_count()` with these types of
9604
10131
  tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a
9605
10132
  Polars or Pandas DataFrame, the availability of Ibis is not needed.
9606
10133
 
10134
+ To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
10135
+ provided. The file will be automatically detected and loaded using the best available DataFrame
10136
+ library. The loading preference is Polars first, then Pandas as a fallback.
10137
+
10138
+ GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw
10139
+ content URLs for downloading. The URL format should be:
10140
+ `https://github.com/user/repo/blob/branch/path/file.csv` or
10141
+ `https://github.com/user/repo/blob/branch/path/file.parquet`
10142
+
10143
+ Connection strings follow database URL formats and must also specify a table using the
10144
+ `::table_name` suffix. Examples include:
10145
+
10146
+ ```
10147
+ "duckdb:///path/to/database.ddb::table_name"
10148
+ "sqlite:///path/to/database.db::table_name"
10149
+ "postgresql://user:password@localhost:5432/database::table_name"
10150
+ "mysql://user:password@localhost:3306/database::table_name"
10151
+ "bigquery://project/dataset::table_name"
10152
+ "snowflake://user:password@account/database/schema::table_name"
10153
+ ```
10154
+
10155
+ When using connection strings, the Ibis library with the appropriate backend driver is required.
10156
+
9607
10157
  Examples
9608
10158
  --------
9609
10159
  To get the number of columns in a table, we can use the `get_column_count()` function. Here's an
@@ -9620,6 +10170,31 @@ get_column_count(data: 'FrameT | Any') -> 'int'
9620
10170
  pb.get_column_count(small_table_duckdb)
9621
10171
  ```
9622
10172
 
10173
+ #### Working with CSV Files
10174
+
10175
+ The `get_column_count()` function can directly accept CSV file paths:
10176
+
10177
+ #### Working with Parquet Files
10178
+
10179
+ The function supports various Parquet input formats:
10180
+
10181
+ You can also use glob patterns and directories:
10182
+
10183
+ ```python
10184
+ # Multiple Parquet files with glob patterns
10185
+ pb.get_column_count("data/sales_*.parquet")
10186
+
10187
+ # Directory containing Parquet files
10188
+ pb.get_column_count("parquet_data/")
10189
+
10190
+ # Partitioned Parquet dataset
10191
+ pb.get_column_count("sales_data/") # Auto-discovers partition columns
10192
+ ```
10193
+
10194
+ #### Working with Database Connection Strings
10195
+
10196
+ The function supports database connection strings for direct access to database tables:
10197
+
9623
10198
  The function always returns the number of columns in the table as an integer value, which is
9624
10199
  `8` for the `small_table` dataset.
9625
10200
 
@@ -9630,14 +10205,15 @@ get_row_count(data: 'FrameT | Any') -> 'int'
9630
10205
 
9631
10206
  The `get_row_count()` function returns the number of rows in a table. The function works with
9632
10207
  any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis
9633
- backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.).
10208
+ backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports
10209
+ direct input of CSV files, Parquet files, and database connection strings.
9634
10210
 
9635
10211
  Parameters
9636
10212
  ----------
9637
10213
  data
9638
- The table for which to get the row count, which could be a DataFrame object or an Ibis table
9639
- object. Read the *Supported Input Table Types* section for details on the supported table
9640
- types.
10214
+ The table for which to get the row count, which could be a DataFrame object, an Ibis table
10215
+ object, a CSV file path, a Parquet file path, or a database connection string.
10216
+ Read the *Supported Input Table Types* section for details on the supported table types.
9641
10217
 
9642
10218
  Returns
9643
10219
  -------
@@ -9650,6 +10226,7 @@ get_row_count(data: 'FrameT | Any') -> 'int'
9650
10226
 
9651
10227
  - Polars DataFrame (`"polars"`)
9652
10228
  - Pandas DataFrame (`"pandas"`)
10229
+ - PySpark table (`"pyspark"`)
9653
10230
  - DuckDB table (`"duckdb"`)*
9654
10231
  - MySQL table (`"mysql"`)*
9655
10232
  - PostgreSQL table (`"postgresql"`)*
@@ -9657,15 +10234,42 @@ get_row_count(data: 'FrameT | Any') -> 'int'
9657
10234
  - Microsoft SQL Server table (`"mssql"`)*
9658
10235
  - Snowflake table (`"snowflake"`)*
9659
10236
  - Databricks table (`"databricks"`)*
9660
- - PySpark table (`"pyspark"`)*
9661
10237
  - BigQuery table (`"bigquery"`)*
9662
10238
  - Parquet table (`"parquet"`)*
10239
+ - CSV files (string path or `pathlib.Path` object with `.csv` extension)
10240
+ - Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
10241
+ extension, or partitioned dataset)
10242
+ - GitHub URLs (direct links to CSV or Parquet files on GitHub)
10243
+ - Database connection strings (URI format with optional table specification)
9663
10244
 
9664
10245
  The table types marked with an asterisk need to be prepared as Ibis tables (with type of
9665
10246
  `ibis.expr.types.relations.Table`). Furthermore, using `get_row_count()` with these types of
9666
10247
  tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a
9667
10248
  Polars or Pandas DataFrame, the availability of Ibis is not needed.
9668
10249
 
10250
+ To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
10251
+ provided. The file will be automatically detected and loaded using the best available DataFrame
10252
+ library. The loading preference is Polars first, then Pandas as a fallback.
10253
+
10254
+ GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw
10255
+ content URLs for downloading. The URL format should be:
10256
+ `https://github.com/user/repo/blob/branch/path/file.csv` or
10257
+ `https://github.com/user/repo/blob/branch/path/file.parquet`
10258
+
10259
+ Connection strings follow database URL formats and must also specify a table using the
10260
+ `::table_name` suffix. Examples include:
10261
+
10262
+ ```
10263
+ "duckdb:///path/to/database.ddb::table_name"
10264
+ "sqlite:///path/to/database.db::table_name"
10265
+ "postgresql://user:password@localhost:5432/database::table_name"
10266
+ "mysql://user:password@localhost:3306/database::table_name"
10267
+ "bigquery://project/dataset::table_name"
10268
+ "snowflake://user:password@account/database/schema::table_name"
10269
+ ```
10270
+
10271
+ When using connection strings, the Ibis library with the appropriate backend driver is required.
10272
+
9669
10273
  Examples
9670
10274
  --------
9671
10275
  Getting the number of rows in a table is easily done by using the `get_row_count()` function.
@@ -9682,6 +10286,31 @@ get_row_count(data: 'FrameT | Any') -> 'int'
9682
10286
  pb.get_row_count(game_revenue_duckdb)
9683
10287
  ```
9684
10288
 
10289
+ #### Working with CSV Files
10290
+
10291
+ The `get_row_count()` function can directly accept CSV file paths:
10292
+
10293
+ #### Working with Parquet Files
10294
+
10295
+ The function supports various Parquet input formats:
10296
+
10297
+ You can also use glob patterns and directories:
10298
+
10299
+ ```python
10300
+ # Multiple Parquet files with glob patterns
10301
+ pb.get_row_count("data/sales_*.parquet")
10302
+
10303
+ # Directory containing Parquet files
10304
+ pb.get_row_count("parquet_data/")
10305
+
10306
+ # Partitioned Parquet dataset
10307
+ pb.get_row_count("sales_data/") # Auto-discovers partition columns
10308
+ ```
10309
+
10310
+ #### Working with Database Connection Strings
10311
+
10312
+ The function supports database connection strings for direct access to database tables:
10313
+
9685
10314
  The function always returns the number of rows in the table as an integer value, which is `2000`
9686
10315
  for the `game_revenue` dataset.
9687
10316