pointblank 0.11.6__py3-none-any.whl → 0.12.1__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pointblank/__init__.py +2 -0
- pointblank/_constants.py +0 -1
- pointblank/_interrogation.py +244 -606
- pointblank/_utils.py +65 -3
- pointblank/assistant.py +9 -0
- pointblank/cli.py +39 -24
- pointblank/data/api-docs.txt +658 -29
- pointblank/schema.py +17 -0
- pointblank/segments.py +163 -0
- pointblank/validate.py +344 -92
- {pointblank-0.11.6.dist-info → pointblank-0.12.1.dist-info}/METADATA +59 -6
- {pointblank-0.11.6.dist-info → pointblank-0.12.1.dist-info}/RECORD +16 -15
- {pointblank-0.11.6.dist-info → pointblank-0.12.1.dist-info}/WHEEL +0 -0
- {pointblank-0.11.6.dist-info → pointblank-0.12.1.dist-info}/entry_points.txt +0 -0
- {pointblank-0.11.6.dist-info → pointblank-0.12.1.dist-info}/licenses/LICENSE +0 -0
- {pointblank-0.11.6.dist-info → pointblank-0.12.1.dist-info}/top_level.txt +0 -0
pointblank/data/api-docs.txt
CHANGED
|
@@ -43,13 +43,15 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
|
|
|
43
43
|
----------
|
|
44
44
|
data
|
|
45
45
|
The table to validate, which could be a DataFrame object, an Ibis table object, a CSV
|
|
46
|
-
file path, a Parquet file path,
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
|
|
46
|
+
file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file, or a
|
|
47
|
+
database connection string. When providing a CSV or Parquet file path (as a string or
|
|
48
|
+
`pathlib.Path` object), the file will be automatically loaded using an available DataFrame
|
|
49
|
+
library (Polars or Pandas). Parquet input also supports glob patterns, directories
|
|
50
|
+
containing .parquet files, and Spark-style partitioned datasets. GitHub URLs are
|
|
51
|
+
automatically transformed to raw content URLs and downloaded. Connection strings enable
|
|
52
|
+
direct database access via Ibis with optional table specification using the `::table_name`
|
|
53
|
+
suffix. Read the *Supported Input Table Types* section for details on the supported table
|
|
54
|
+
types.
|
|
53
55
|
tbl_name
|
|
54
56
|
An optional name to assign to the input table object. If no value is provided, a name will
|
|
55
57
|
be generated based on whatever information is available. This table name will be displayed
|
|
@@ -109,6 +111,7 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
|
|
|
109
111
|
|
|
110
112
|
- Polars DataFrame (`"polars"`)
|
|
111
113
|
- Pandas DataFrame (`"pandas"`)
|
|
114
|
+
- PySpark table (`"pyspark"`)
|
|
112
115
|
- DuckDB table (`"duckdb"`)*
|
|
113
116
|
- MySQL table (`"mysql"`)*
|
|
114
117
|
- PostgreSQL table (`"postgresql"`)*
|
|
@@ -116,7 +119,6 @@ Validate(data: 'FrameT | Any', tbl_name: 'str | None' = None, label: 'str | None
|
|
|
116
119
|
- Microsoft SQL Server table (`"mssql"`)*
|
|
117
120
|
- Snowflake table (`"snowflake"`)*
|
|
118
121
|
- Databricks table (`"databricks"`)*
|
|
119
|
-
- PySpark table (`"pyspark"`)*
|
|
120
122
|
- BigQuery table (`"bigquery"`)*
|
|
121
123
|
- Parquet table (`"parquet"`)*
|
|
122
124
|
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
|
|
@@ -948,10 +950,15 @@ Definition of a schema object.
|
|
|
948
950
|
|
|
949
951
|
- Polars DataFrame (`"polars"`)
|
|
950
952
|
- Pandas DataFrame (`"pandas"`)
|
|
953
|
+
- PySpark table (`"pyspark"`)
|
|
951
954
|
- DuckDB table (`"duckdb"`)*
|
|
952
955
|
- MySQL table (`"mysql"`)*
|
|
953
956
|
- PostgreSQL table (`"postgresql"`)*
|
|
954
957
|
- SQLite table (`"sqlite"`)*
|
|
958
|
+
- Microsoft SQL Server table (`"mssql"`)*
|
|
959
|
+
- Snowflake table (`"snowflake"`)*
|
|
960
|
+
- Databricks table (`"databricks"`)*
|
|
961
|
+
- BigQuery table (`"bigquery"`)*
|
|
955
962
|
- Parquet table (`"parquet"`)*
|
|
956
963
|
|
|
957
964
|
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
|
|
@@ -7079,6 +7086,129 @@ expr_col(column_name: 'str') -> 'ColumnExpression'
|
|
|
7079
7086
|
|
|
7080
7087
|
|
|
7081
7088
|
|
|
7089
|
+
## The Segments family
|
|
7090
|
+
|
|
7091
|
+
Combine multiple values into a single segment using `seg_*()` helper functions.
|
|
7092
|
+
|
|
7093
|
+
seg_group(values: 'list[Any]') -> 'Segment'
|
|
7094
|
+
|
|
7095
|
+
Group together values for segmentation.
|
|
7096
|
+
|
|
7097
|
+
Many validation methods have a `segments=` argument that can be used to specify one or more
|
|
7098
|
+
columns, or certain values within a column, to create segments for validation (e.g.,
|
|
7099
|
+
[`col_vals_gt()`](`pointblank.Validate.col_vals_gt`),
|
|
7100
|
+
[`col_vals_regex()`](`pointblank.Validate.col_vals_regex`), etc.). When passing in a column, or
|
|
7101
|
+
a tuple with a column and certain values, a segment will be created for each individual value
|
|
7102
|
+
within the column or given values. The `seg_group()` selector enables values to be grouped
|
|
7103
|
+
together into a segment. For example, if you were to create a segment for a column "region",
|
|
7104
|
+
investigating just "North" and "South" regions, a typical segment would look like:
|
|
7105
|
+
|
|
7106
|
+
`segments=("region", ["North", "South"])`
|
|
7107
|
+
|
|
7108
|
+
This would create two validation steps, one for each of the regions. If you wanted to group
|
|
7109
|
+
these two regions into a single segment, you could use the `seg_group()` function like this:
|
|
7110
|
+
|
|
7111
|
+
`segments=("region", pb.seg_group(["North", "South"]))`
|
|
7112
|
+
|
|
7113
|
+
You could create a second segment for "East" and "West" regions like this:
|
|
7114
|
+
|
|
7115
|
+
`segments=("region", pb.seg_group([["North", "South"], ["East", "West"]]))`
|
|
7116
|
+
|
|
7117
|
+
There will be a validation step created for every segment. Note that if there aren't any
|
|
7118
|
+
segments created using `seg_group()` (or any other segment expression), the validation step will
|
|
7119
|
+
fail to be evaluated during the interrogation process. Such a failure to evaluate will be
|
|
7120
|
+
reported in the validation results but it won't affect the interrogation process overall
|
|
7121
|
+
(i.e., the process won't be halted).
|
|
7122
|
+
|
|
7123
|
+
Parameters
|
|
7124
|
+
----------
|
|
7125
|
+
values
|
|
7126
|
+
A list of values to be grouped into a segment. This can be a single list or a list of lists.
|
|
7127
|
+
|
|
7128
|
+
Returns
|
|
7129
|
+
-------
|
|
7130
|
+
Segment
|
|
7131
|
+
A `Segment` object, which can be used to combine values into a segment.
|
|
7132
|
+
|
|
7133
|
+
Examples
|
|
7134
|
+
--------
|
|
7135
|
+
Let's say we're analyzing sales from our local bookstore, and want to check the number of books
|
|
7136
|
+
sold for the month exceeds a certain threshold. We could pass in the argument
|
|
7137
|
+
`segments="genre"`, which would return a segment for each unique genre in the datasets. We could
|
|
7138
|
+
also pass in `segments=("genre", ["Fantasy", "Science Fiction"])`, to only create segments for
|
|
7139
|
+
those two genres. However, if we wanted to group these two genres into a single segment, we
|
|
7140
|
+
could use the `seg_group()` function.
|
|
7141
|
+
|
|
7142
|
+
```python
|
|
7143
|
+
import pointblank as pb
|
|
7144
|
+
import polars as pl
|
|
7145
|
+
|
|
7146
|
+
tbl = pl.DataFrame(
|
|
7147
|
+
{
|
|
7148
|
+
"title": [
|
|
7149
|
+
"The Hobbit",
|
|
7150
|
+
"Harry Potter and the Sorcerer's Stone",
|
|
7151
|
+
"The Lord of the Rings",
|
|
7152
|
+
"A Game of Thrones",
|
|
7153
|
+
"The Name of the Wind",
|
|
7154
|
+
"The Girl with the Dragon Tattoo",
|
|
7155
|
+
"The Da Vinci Code",
|
|
7156
|
+
"The Hitchhiker's Guide to the Galaxy",
|
|
7157
|
+
"The Martian",
|
|
7158
|
+
"Brave New World"
|
|
7159
|
+
],
|
|
7160
|
+
"genre": [
|
|
7161
|
+
"Fantasy",
|
|
7162
|
+
"Fantasy",
|
|
7163
|
+
"Fantasy",
|
|
7164
|
+
"Fantasy",
|
|
7165
|
+
"Fantasy",
|
|
7166
|
+
"Mystery",
|
|
7167
|
+
"Mystery",
|
|
7168
|
+
"Science Fiction",
|
|
7169
|
+
"Science Fiction",
|
|
7170
|
+
"Science Fiction",
|
|
7171
|
+
],
|
|
7172
|
+
"units_sold": [875, 932, 756, 623, 445, 389, 678, 534, 712, 598],
|
|
7173
|
+
}
|
|
7174
|
+
)
|
|
7175
|
+
|
|
7176
|
+
validation = (
|
|
7177
|
+
pb.Validate(data=tbl)
|
|
7178
|
+
.col_vals_gt(
|
|
7179
|
+
columns="units_sold",
|
|
7180
|
+
value=500,
|
|
7181
|
+
segments=("genre", pb.seg_group(["Fantasy", "Science Fiction"]))
|
|
7182
|
+
)
|
|
7183
|
+
.interrogate()
|
|
7184
|
+
)
|
|
7185
|
+
|
|
7186
|
+
validation
|
|
7187
|
+
```
|
|
7188
|
+
|
|
7189
|
+
What's more, we can create multiple segments, combining the genres in different ways.
|
|
7190
|
+
|
|
7191
|
+
```python
|
|
7192
|
+
validation = (
|
|
7193
|
+
pb.Validate(data=tbl)
|
|
7194
|
+
.col_vals_gt(
|
|
7195
|
+
columns="units_sold",
|
|
7196
|
+
value=500,
|
|
7197
|
+
segments=("genre", pb.seg_group([
|
|
7198
|
+
["Fantasy", "Science Fiction"],
|
|
7199
|
+
["Fantasy", "Mystery"],
|
|
7200
|
+
["Mystery", "Science Fiction"]
|
|
7201
|
+
]))
|
|
7202
|
+
)
|
|
7203
|
+
.interrogate()
|
|
7204
|
+
)
|
|
7205
|
+
|
|
7206
|
+
validation
|
|
7207
|
+
```
|
|
7208
|
+
|
|
7209
|
+
|
|
7210
|
+
|
|
7211
|
+
|
|
7082
7212
|
## The Interrogation and Reporting family
|
|
7083
7213
|
|
|
7084
7214
|
The validation plan is put into action when `interrogate()` is called.
|
|
@@ -8826,7 +8956,9 @@ DataScan(data: 'IntoFrameT', tbl_name: 'str | None' = None) -> 'None'
|
|
|
8826
8956
|
Parameters
|
|
8827
8957
|
----------
|
|
8828
8958
|
data
|
|
8829
|
-
The data to scan and summarize.
|
|
8959
|
+
The data to scan and summarize. This could be a DataFrame object, an Ibis table object,
|
|
8960
|
+
a CSV file path, a Parquet file path, a GitHub URL pointing to a CSV or Parquet file,
|
|
8961
|
+
or a database connection string.
|
|
8830
8962
|
tbl_name
|
|
8831
8963
|
Optionally, the name of the table could be provided as `tbl_name`.
|
|
8832
8964
|
|
|
@@ -8958,6 +9090,7 @@ preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None'
|
|
|
8958
9090
|
|
|
8959
9091
|
- Polars DataFrame (`"polars"`)
|
|
8960
9092
|
- Pandas DataFrame (`"pandas"`)
|
|
9093
|
+
- PySpark table (`"pyspark"`)
|
|
8961
9094
|
- DuckDB table (`"duckdb"`)*
|
|
8962
9095
|
- MySQL table (`"mysql"`)*
|
|
8963
9096
|
- PostgreSQL table (`"postgresql"`)*
|
|
@@ -8965,7 +9098,6 @@ preview(data: 'FrameT | Any', columns_subset: 'str | list[str] | Column | None'
|
|
|
8965
9098
|
- Microsoft SQL Server table (`"mssql"`)*
|
|
8966
9099
|
- Snowflake table (`"snowflake"`)*
|
|
8967
9100
|
- Databricks table (`"databricks"`)*
|
|
8968
|
-
- PySpark table (`"pyspark"`)*
|
|
8969
9101
|
- BigQuery table (`"bigquery"`)*
|
|
8970
9102
|
- Parquet table (`"parquet"`)*
|
|
8971
9103
|
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
|
|
@@ -9116,8 +9248,9 @@ col_summary_tbl(data: 'FrameT | Any', tbl_name: 'str | None' = None) -> 'GT'
|
|
|
9116
9248
|
Parameters
|
|
9117
9249
|
----------
|
|
9118
9250
|
data
|
|
9119
|
-
The table to summarize, which could be a DataFrame object
|
|
9120
|
-
|
|
9251
|
+
The table to summarize, which could be a DataFrame object, an Ibis table object, a CSV
|
|
9252
|
+
file path, a Parquet file path, or a database connection string. Read the *Supported Input
|
|
9253
|
+
Table Types* section for details on the supported table types.
|
|
9121
9254
|
tbl_name
|
|
9122
9255
|
Optionally, the name of the table could be provided as `tbl_name=`.
|
|
9123
9256
|
|
|
@@ -9137,6 +9270,11 @@ col_summary_tbl(data: 'FrameT | Any', tbl_name: 'str | None' = None) -> 'GT'
|
|
|
9137
9270
|
- PostgreSQL table (`"postgresql"`)*
|
|
9138
9271
|
- SQLite table (`"sqlite"`)*
|
|
9139
9272
|
- Parquet table (`"parquet"`)*
|
|
9273
|
+
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
|
|
9274
|
+
- Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
|
|
9275
|
+
extension, or partitioned dataset)
|
|
9276
|
+
- GitHub URLs (direct links to CSV or Parquet files on GitHub)
|
|
9277
|
+
- Database connection strings (URI format with optional table specification)
|
|
9140
9278
|
|
|
9141
9279
|
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
|
|
9142
9280
|
`ibis.expr.types.relations.Table`). Furthermore, using `col_summary_tbl()` with these types of
|
|
@@ -9171,9 +9309,9 @@ missing_vals_tbl(data: 'FrameT | Any') -> 'GT'
|
|
|
9171
9309
|
Parameters
|
|
9172
9310
|
----------
|
|
9173
9311
|
data
|
|
9174
|
-
The table for which to display the missing values. This could be a DataFrame object
|
|
9175
|
-
Ibis table object
|
|
9176
|
-
supported table types.
|
|
9312
|
+
The table for which to display the missing values. This could be a DataFrame object, an
|
|
9313
|
+
Ibis table object, a CSV file path, a Parquet file path, or a database connection string.
|
|
9314
|
+
Read the *Supported Input Table Types* section for details on the supported table types.
|
|
9177
9315
|
|
|
9178
9316
|
Returns
|
|
9179
9317
|
-------
|
|
@@ -9186,6 +9324,7 @@ missing_vals_tbl(data: 'FrameT | Any') -> 'GT'
|
|
|
9186
9324
|
|
|
9187
9325
|
- Polars DataFrame (`"polars"`)
|
|
9188
9326
|
- Pandas DataFrame (`"pandas"`)
|
|
9327
|
+
- PySpark table (`"pyspark"`)
|
|
9189
9328
|
- DuckDB table (`"duckdb"`)*
|
|
9190
9329
|
- MySQL table (`"mysql"`)*
|
|
9191
9330
|
- PostgreSQL table (`"postgresql"`)*
|
|
@@ -9193,9 +9332,12 @@ missing_vals_tbl(data: 'FrameT | Any') -> 'GT'
|
|
|
9193
9332
|
- Microsoft SQL Server table (`"mssql"`)*
|
|
9194
9333
|
- Snowflake table (`"snowflake"`)*
|
|
9195
9334
|
- Databricks table (`"databricks"`)*
|
|
9196
|
-
- PySpark table (`"pyspark"`)*
|
|
9197
9335
|
- BigQuery table (`"bigquery"`)*
|
|
9198
9336
|
- Parquet table (`"parquet"`)*
|
|
9337
|
+
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
|
|
9338
|
+
- Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
|
|
9339
|
+
extension, or partitioned dataset)
|
|
9340
|
+
- Database connection strings (URI format with optional table specification)
|
|
9199
9341
|
|
|
9200
9342
|
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
|
|
9201
9343
|
`ibis.expr.types.relations.Table`). Furthermore, using `missing_vals_tbl()` with these types of
|
|
@@ -9270,8 +9412,9 @@ assistant(model: 'str', data: 'FrameT | Any | None' = None, tbl_name: 'str | Non
|
|
|
9270
9412
|
`"ollama"`, and `"bedrock"`.
|
|
9271
9413
|
data
|
|
9272
9414
|
An optional data table to focus on during discussion with the PbA, which could be a
|
|
9273
|
-
DataFrame object
|
|
9274
|
-
|
|
9415
|
+
DataFrame object, an Ibis table object, a CSV file path, a Parquet file path, or a database
|
|
9416
|
+
connection string. Read the *Supported Input Table Types* section for details on the
|
|
9417
|
+
supported table types.
|
|
9275
9418
|
tbl_name : str, optional
|
|
9276
9419
|
The name of the data table. This is optional and is only used to provide a more detailed
|
|
9277
9420
|
prompt to the PbA.
|
|
@@ -9348,16 +9491,29 @@ assistant(model: 'str', data: 'FrameT | Any | None' = None, tbl_name: 'str | Non
|
|
|
9348
9491
|
|
|
9349
9492
|
- Polars DataFrame (`"polars"`)
|
|
9350
9493
|
- Pandas DataFrame (`"pandas"`)
|
|
9494
|
+
- PySpark table (`"pyspark"`)
|
|
9351
9495
|
- DuckDB table (`"duckdb"`)*
|
|
9352
9496
|
- MySQL table (`"mysql"`)*
|
|
9353
9497
|
- PostgreSQL table (`"postgresql"`)*
|
|
9354
9498
|
- SQLite table (`"sqlite"`)*
|
|
9499
|
+
- Microsoft SQL Server table (`"mssql"`)*
|
|
9500
|
+
- Snowflake table (`"snowflake"`)*
|
|
9501
|
+
- Databricks table (`"databricks"`)*
|
|
9502
|
+
- BigQuery table (`"bigquery"`)*
|
|
9355
9503
|
- Parquet table (`"parquet"`)*
|
|
9504
|
+
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
|
|
9505
|
+
- Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
|
|
9506
|
+
extension, or partitioned dataset)
|
|
9507
|
+
- Database connection strings (URI format with optional table specification)
|
|
9356
9508
|
|
|
9357
9509
|
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
|
|
9358
|
-
`ibis.expr.types.relations.Table`). Furthermore, using `
|
|
9510
|
+
`ibis.expr.types.relations.Table`). Furthermore, using `assistant()` with these types of tables
|
|
9359
9511
|
requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a Polars or
|
|
9360
9512
|
Pandas DataFrame, the availability of Ibis is not needed.
|
|
9513
|
+
|
|
9514
|
+
To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
|
|
9515
|
+
provided. The file will be automatically detected and loaded using the best available DataFrame
|
|
9516
|
+
library. The loading preference is Polars first, then Pandas as a fallback.
|
|
9361
9517
|
|
|
9362
9518
|
|
|
9363
9519
|
load_dataset(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'global_sales']" = 'small_table', tbl_type: "Literal['polars', 'pandas', 'duckdb']" = 'polars') -> 'FrameT | Any'
|
|
@@ -9553,6 +9709,372 @@ get_data_path(dataset: "Literal['small_table', 'game_revenue', 'nycflights', 'gl
|
|
|
9553
9709
|
[`load_dataset()`](`pointblank.load_dataset`) for loading datasets directly as table objects.
|
|
9554
9710
|
|
|
9555
9711
|
|
|
9712
|
+
connect_to_table(connection_string: 'str') -> 'Any'
|
|
9713
|
+
|
|
9714
|
+
Connect to a database table using a connection string.
|
|
9715
|
+
|
|
9716
|
+
This utility function tests whether a connection string leads to a valid table and returns
|
|
9717
|
+
the table object if successful. It provides helpful error messages when no table is specified
|
|
9718
|
+
or when backend dependencies are missing.
|
|
9719
|
+
|
|
9720
|
+
Parameters
|
|
9721
|
+
----------
|
|
9722
|
+
connection_string
|
|
9723
|
+
A database connection string with a required table specification using the `::table_name`
|
|
9724
|
+
suffix. Supported formats are outlined in the *Supported Connection String Formats* section.
|
|
9725
|
+
|
|
9726
|
+
Returns
|
|
9727
|
+
-------
|
|
9728
|
+
Any
|
|
9729
|
+
An Ibis table object for the specified database table.
|
|
9730
|
+
|
|
9731
|
+
Supported Connection String Formats
|
|
9732
|
+
-----------------------------------
|
|
9733
|
+
The `connection_string` parameter must include a valid connection string with a table name
|
|
9734
|
+
specified using the `::` syntax. Here are some examples on how to format connection strings
|
|
9735
|
+
for various backends:
|
|
9736
|
+
|
|
9737
|
+
```
|
|
9738
|
+
DuckDB: "duckdb:///path/to/database.ddb::table_name"
|
|
9739
|
+
SQLite: "sqlite:///path/to/database.db::table_name"
|
|
9740
|
+
PostgreSQL: "postgresql://user:password@localhost:5432/database::table_name"
|
|
9741
|
+
MySQL: "mysql://user:password@localhost:3306/database::table_name"
|
|
9742
|
+
BigQuery: "bigquery://project/dataset::table_name"
|
|
9743
|
+
Snowflake: "snowflake://user:password@account/database/schema::table_name"
|
|
9744
|
+
```
|
|
9745
|
+
|
|
9746
|
+
If the connection string does not include a table name, the function will attempt to connect to
|
|
9747
|
+
the database and list available tables, providing guidance on how to specify a table.
|
|
9748
|
+
|
|
9749
|
+
Examples
|
|
9750
|
+
--------
|
|
9751
|
+
Connect to a DuckDB table:
|
|
9752
|
+
|
|
9753
|
+
```python
|
|
9754
|
+
import pointblank as pb
|
|
9755
|
+
|
|
9756
|
+
# Get path to a DuckDB database file from package data
|
|
9757
|
+
duckdb_path = pb.get_data_path("game_revenue", "duckdb")
|
|
9758
|
+
|
|
9759
|
+
# Connect to the `game_revenue` table in the DuckDB database
|
|
9760
|
+
game_revenue = pb.connect_to_table(f"duckdb:///{duckdb_path}::game_revenue")
|
|
9761
|
+
|
|
9762
|
+
# Use with the `preview()` function
|
|
9763
|
+
pb.preview(game_revenue)
|
|
9764
|
+
```
|
|
9765
|
+
|
|
9766
|
+
Here are some backend-specific connection examples:
|
|
9767
|
+
|
|
9768
|
+
```python
|
|
9769
|
+
# PostgreSQL
|
|
9770
|
+
pg_table = pb.connect_to_table(
|
|
9771
|
+
"postgresql://user:password@localhost:5432/warehouse::customer_data"
|
|
9772
|
+
)
|
|
9773
|
+
|
|
9774
|
+
# SQLite
|
|
9775
|
+
sqlite_table = pb.connect_to_table("sqlite:///local_data.db::products")
|
|
9776
|
+
|
|
9777
|
+
# BigQuery
|
|
9778
|
+
bq_table = pb.connect_to_table("bigquery://my-project/analytics::daily_metrics")
|
|
9779
|
+
```
|
|
9780
|
+
|
|
9781
|
+
This function requires the Ibis library with appropriate backend drivers:
|
|
9782
|
+
|
|
9783
|
+
```bash
|
|
9784
|
+
# You can install a set of common backends:
|
|
9785
|
+
pip install 'ibis-framework[duckdb,postgres,mysql,sqlite]'
|
|
9786
|
+
|
|
9787
|
+
# ...or specific backends as needed:
|
|
9788
|
+
pip install 'ibis-framework[duckdb]' # for DuckDB
|
|
9789
|
+
pip install 'ibis-framework[postgres]' # for PostgreSQL
|
|
9790
|
+
```
|
|
9791
|
+
|
|
9792
|
+
|
|
9793
|
+
|
|
9794
|
+
## The YAML family
|
|
9795
|
+
|
|
9796
|
+
The *YAML* group contains functions that allow for the use of YAML to orchestrate
|
|
9797
|
+
validation workflows. The `yaml_interrogate()` function can be used to run a validation workflow from
|
|
9798
|
+
YAML strings or files. The `validate_yaml()` function checks if the YAML configuration
|
|
9799
|
+
passes its own validity checks.
|
|
9800
|
+
|
|
9801
|
+
yaml_interrogate(yaml: 'Union[str, Path]') -> 'Validate'
|
|
9802
|
+
Execute a YAML-based validation workflow.
|
|
9803
|
+
|
|
9804
|
+
This is the main entry point for YAML-based validation workflows. It takes YAML configuration
|
|
9805
|
+
(as a string or file path) and returns a validated `Validate` object with interrogation results.
|
|
9806
|
+
|
|
9807
|
+
The YAML configuration defines the data source, validation steps, and optional settings like
|
|
9808
|
+
thresholds and labels. This function automatically loads the data, builds the validation plan,
|
|
9809
|
+
executes all validation steps, and returns the interrogated results.
|
|
9810
|
+
|
|
9811
|
+
Parameters
|
|
9812
|
+
----------
|
|
9813
|
+
yaml
|
|
9814
|
+
YAML configuration as string or file path. Can be: (1) a YAML string containing the
|
|
9815
|
+
validation configuration, or (2) a Path object or string path to a YAML file.
|
|
9816
|
+
|
|
9817
|
+
Returns
|
|
9818
|
+
-------
|
|
9819
|
+
Validate
|
|
9820
|
+
An instance of the `Validate` class that has been configured based on the YAML input.
|
|
9821
|
+
This object contains the results of the validation steps defined in the YAML configuration.
|
|
9822
|
+
It includes metadata like table name, label, language, and thresholds if specified.
|
|
9823
|
+
|
|
9824
|
+
Raises
|
|
9825
|
+
------
|
|
9826
|
+
YAMLValidationError
|
|
9827
|
+
If the YAML is invalid, malformed, or execution fails. This includes syntax errors, missing
|
|
9828
|
+
required fields, unknown validation methods, or data loading failures.
|
|
9829
|
+
|
|
9830
|
+
Examples
|
|
9831
|
+
--------
|
|
9832
|
+
For the examples here, we'll use YAML configurations to define validation workflows. Let's start
|
|
9833
|
+
with a basic YAML workflow that validates the built-in `small_table` dataset.
|
|
9834
|
+
|
|
9835
|
+
```python
|
|
9836
|
+
import pointblank as pb
|
|
9837
|
+
|
|
9838
|
+
# Define a basic YAML validation workflow
|
|
9839
|
+
yaml_config = '''
|
|
9840
|
+
tbl: small_table
|
|
9841
|
+
steps:
|
|
9842
|
+
- rows_distinct
|
|
9843
|
+
- col_exists:
|
|
9844
|
+
columns: [date, a, b]
|
|
9845
|
+
'''
|
|
9846
|
+
|
|
9847
|
+
# Execute the validation workflow
|
|
9848
|
+
result = pb.yaml_interrogate(yaml_config)
|
|
9849
|
+
result
|
|
9850
|
+
```
|
|
9851
|
+
|
|
9852
|
+
The validation table shows the results of our YAML-defined workflow. We can see that the
|
|
9853
|
+
`rows_distinct()` validation failed (because there are duplicate rows in the table), while the
|
|
9854
|
+
column existence checks passed.
|
|
9855
|
+
|
|
9856
|
+
Now let's create a more comprehensive validation workflow with thresholds and metadata:
|
|
9857
|
+
|
|
9858
|
+
```python
|
|
9859
|
+
# Advanced YAML configuration with thresholds and metadata
|
|
9860
|
+
yaml_config = '''
|
|
9861
|
+
tbl: small_table
|
|
9862
|
+
tbl_name: small_table_demo
|
|
9863
|
+
label: Comprehensive data validation
|
|
9864
|
+
thresholds:
|
|
9865
|
+
warning: 0.1
|
|
9866
|
+
error: 0.25
|
|
9867
|
+
critical: 0.35
|
|
9868
|
+
steps:
|
|
9869
|
+
- col_vals_gt:
|
|
9870
|
+
columns: [d]
|
|
9871
|
+
value: 100
|
|
9872
|
+
- col_vals_regex:
|
|
9873
|
+
columns: [b]
|
|
9874
|
+
pattern: '[0-9]-[a-z]{3}-[0-9]{3}'
|
|
9875
|
+
- col_vals_not_null:
|
|
9876
|
+
columns: [date, a]
|
|
9877
|
+
'''
|
|
9878
|
+
|
|
9879
|
+
# Execute the validation workflow
|
|
9880
|
+
result = pb.yaml_interrogate(yaml_config)
|
|
9881
|
+
print(f"Table name: {result.tbl_name}")
|
|
9882
|
+
print(f"Label: {result.label}")
|
|
9883
|
+
print(f"Total validation steps: {len(result.validation_info)}")
|
|
9884
|
+
```
|
|
9885
|
+
|
|
9886
|
+
The validation results now include our custom table name and label. The thresholds we defined
|
|
9887
|
+
will determine when validation steps are marked as warnings, errors, or critical failures.
|
|
9888
|
+
|
|
9889
|
+
You can also load YAML configurations from files. Here's how you would work with a YAML file:
|
|
9890
|
+
|
|
9891
|
+
```python
|
|
9892
|
+
from pathlib import Path
|
|
9893
|
+
import tempfile
|
|
9894
|
+
|
|
9895
|
+
# Create a temporary YAML file for demonstration
|
|
9896
|
+
yaml_content = '''
|
|
9897
|
+
tbl: small_table
|
|
9898
|
+
tbl_name: File-based Validation
|
|
9899
|
+
steps:
|
|
9900
|
+
- col_vals_between:
|
|
9901
|
+
columns: [c]
|
|
9902
|
+
left: 1
|
|
9903
|
+
right: 10
|
|
9904
|
+
- col_vals_in_set:
|
|
9905
|
+
columns: [f]
|
|
9906
|
+
set: [low, mid, high]
|
|
9907
|
+
'''
|
|
9908
|
+
|
|
9909
|
+
with tempfile.NamedTemporaryFile(mode='w', suffix='.yaml', delete=False) as f:
|
|
9910
|
+
f.write(yaml_content)
|
|
9911
|
+
yaml_file_path = Path(f.name)
|
|
9912
|
+
|
|
9913
|
+
# Load and execute validation from file
|
|
9914
|
+
result = pb.yaml_interrogate(yaml_file_path)
|
|
9915
|
+
result
|
|
9916
|
+
```
|
|
9917
|
+
|
|
9918
|
+
This approach is particularly useful for storing validation configurations as part of your data
|
|
9919
|
+
pipeline or version control system, allowing you to maintain validation rules alongside your
|
|
9920
|
+
code.
|
|
9921
|
+
|
|
9922
|
+
|
|
9923
|
+
validate_yaml(yaml: 'Union[str, Path]') -> 'None'
|
|
9924
|
+
Validate YAML configuration against the expected structure.
|
|
9925
|
+
|
|
9926
|
+
This function validates that a YAML configuration conforms to the expected structure for
|
|
9927
|
+
validation workflows. It checks for required fields, proper data types, and valid
|
|
9928
|
+
validation method names. This is useful for validating configurations before execution or
|
|
9929
|
+
for building configuration editors and validators.
|
|
9930
|
+
|
|
9931
|
+
The function performs comprehensive validation including:
|
|
9932
|
+
|
|
9933
|
+
- required fields ('tbl' and 'steps')
|
|
9934
|
+
- proper data types for all fields
|
|
9935
|
+
- valid threshold configurations
|
|
9936
|
+
- known validation method names
|
|
9937
|
+
- proper step configuration structure
|
|
9938
|
+
|
|
9939
|
+
Parameters
|
|
9940
|
+
----------
|
|
9941
|
+
yaml
|
|
9942
|
+
YAML configuration as string or file path. Can be: (1) a YAML string containing the
|
|
9943
|
+
validation configuration, or (2) a Path object or string path to a YAML file.
|
|
9944
|
+
|
|
9945
|
+
Raises
|
|
9946
|
+
------
|
|
9947
|
+
YAMLValidationError
|
|
9948
|
+
If the YAML is invalid, malformed, or execution fails. This includes syntax errors,
|
|
9949
|
+
missing required fields, unknown validation methods, or data loading failures.
|
|
9950
|
+
|
|
9951
|
+
Examples
|
|
9952
|
+
--------
|
|
9953
|
+
For the examples here, we'll demonstrate how to validate YAML configurations before using them
|
|
9954
|
+
with validation workflows. This is particularly useful for building robust data validation
|
|
9955
|
+
systems where you want to catch configuration errors early.
|
|
9956
|
+
|
|
9957
|
+
Let's start with validating a basic configuration:
|
|
9958
|
+
|
|
9959
|
+
```python
|
|
9960
|
+
import pointblank as pb
|
|
9961
|
+
|
|
9962
|
+
# Define a basic YAML validation configuration
|
|
9963
|
+
yaml_config = '''
|
|
9964
|
+
tbl: small_table
|
|
9965
|
+
steps:
|
|
9966
|
+
- rows_distinct
|
|
9967
|
+
- col_exists:
|
|
9968
|
+
columns: [a, b]
|
|
9969
|
+
'''
|
|
9970
|
+
|
|
9971
|
+
# Validate the configuration: no exception means it's valid
|
|
9972
|
+
pb.validate_yaml(yaml_config)
|
|
9973
|
+
print("Basic YAML configuration is valid")
|
|
9974
|
+
```
|
|
9975
|
+
|
|
9976
|
+
The function completed without raising an exception, which means our configuration is valid and
|
|
9977
|
+
follows the expected structure.
|
|
9978
|
+
|
|
9979
|
+
Now let's validate a more complex configuration with thresholds and metadata:
|
|
9980
|
+
|
|
9981
|
+
```python
|
|
9982
|
+
# Complex YAML configuration with all optional fields
|
|
9983
|
+
yaml_config = '''
|
|
9984
|
+
tbl: small_table
|
|
9985
|
+
tbl_name: My Dataset
|
|
9986
|
+
label: Quality check
|
|
9987
|
+
lang: en
|
|
9988
|
+
locale: en
|
|
9989
|
+
thresholds:
|
|
9990
|
+
warning: 0.1
|
|
9991
|
+
error: 0.25
|
|
9992
|
+
critical: 0.35
|
|
9993
|
+
steps:
|
|
9994
|
+
- rows_distinct
|
|
9995
|
+
- col_vals_gt:
|
|
9996
|
+
columns: [d]
|
|
9997
|
+
value: 100
|
|
9998
|
+
- col_vals_regex:
|
|
9999
|
+
columns: [b]
|
|
10000
|
+
pattern: '[0-9]-[a-z]{3}-[0-9]{3}'
|
|
10001
|
+
'''
|
|
10002
|
+
|
|
10003
|
+
# Validate the configuration
|
|
10004
|
+
pb.validate_yaml(yaml_config)
|
|
10005
|
+
print("Complex YAML configuration is valid")
|
|
10006
|
+
|
|
10007
|
+
# Count the validation steps
|
|
10008
|
+
import pointblank.yaml as pby
|
|
10009
|
+
config = pby.load_yaml_config(yaml_config)
|
|
10010
|
+
print(f"Configuration has {len(config['steps'])} validation steps")
|
|
10011
|
+
```
|
|
10012
|
+
|
|
10013
|
+
This configuration includes all the optional metadata fields and complex validation steps,
|
|
10014
|
+
demonstrating that the validation handles the full range of supported options.
|
|
10015
|
+
|
|
10016
|
+
Let's see what happens when we try to validate an invalid configuration:
|
|
10017
|
+
|
|
10018
|
+
```python
|
|
10019
|
+
# Invalid YAML configuration: missing required 'tbl' field
|
|
10020
|
+
invalid_yaml = '''
|
|
10021
|
+
steps:
|
|
10022
|
+
- rows_distinct
|
|
10023
|
+
'''
|
|
10024
|
+
|
|
10025
|
+
try:
|
|
10026
|
+
pb.validate_yaml(invalid_yaml)
|
|
10027
|
+
except pb.yaml.YAMLValidationError as e:
|
|
10028
|
+
print(f"Validation failed: {e}")
|
|
10029
|
+
```
|
|
10030
|
+
|
|
10031
|
+
The validation correctly identifies that our configuration is missing the required `'tbl'`
|
|
10032
|
+
field.
|
|
10033
|
+
|
|
10034
|
+
Here's a practical example of using validation in a workflow builder:
|
|
10035
|
+
|
|
10036
|
+
```python
|
|
10037
|
+
def safe_yaml_interrogate(yaml_config):
|
|
10038
|
+
"""Safely execute a YAML configuration after validation."""
|
|
10039
|
+
try:
|
|
10040
|
+
# Validate the YAML configuration first
|
|
10041
|
+
pb.validate_yaml(yaml_config)
|
|
10042
|
+
print("✓ YAML configuration is valid")
|
|
10043
|
+
|
|
10044
|
+
# Then execute the workflow
|
|
10045
|
+
result = pb.yaml_interrogate(yaml_config)
|
|
10046
|
+
print(f"Validation completed with {len(result.validation_info)} steps")
|
|
10047
|
+
return result
|
|
10048
|
+
|
|
10049
|
+
except pb.yaml.YAMLValidationError as e:
|
|
10050
|
+
print(f"Configuration error: {e}")
|
|
10051
|
+
return None
|
|
10052
|
+
|
|
10053
|
+
# Test with a valid YAML configuration
|
|
10054
|
+
test_yaml = '''
|
|
10055
|
+
tbl: small_table
|
|
10056
|
+
steps:
|
|
10057
|
+
- col_vals_between:
|
|
10058
|
+
columns: [c]
|
|
10059
|
+
left: 1
|
|
10060
|
+
right: 10
|
|
10061
|
+
'''
|
|
10062
|
+
|
|
10063
|
+
result = safe_yaml_interrogate(test_yaml)
|
|
10064
|
+
```
|
|
10065
|
+
|
|
10066
|
+
This pattern of validating before executing helps build more reliable data validation pipelines
|
|
10067
|
+
by catching configuration errors early in the process.
|
|
10068
|
+
|
|
10069
|
+
Note that this function only validates the structure and does not check if the specified data
|
|
10070
|
+
source ('tbl') exists or is accessible. Data source validation occurs during execution with
|
|
10071
|
+
`yaml_interrogate()`.
|
|
10072
|
+
|
|
10073
|
+
See Also
|
|
10074
|
+
--------
|
|
10075
|
+
yaml_interrogate : execute YAML-based validation workflows
|
|
10076
|
+
|
|
10077
|
+
|
|
9556
10078
|
|
|
9557
10079
|
## The Utility Functions family
|
|
9558
10080
|
|
|
@@ -9568,14 +10090,15 @@ get_column_count(data: 'FrameT | Any') -> 'int'
|
|
|
9568
10090
|
|
|
9569
10091
|
The `get_column_count()` function returns the number of columns in a table. The function works
|
|
9570
10092
|
with any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis
|
|
9571
|
-
backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.).
|
|
10093
|
+
backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports
|
|
10094
|
+
direct input of CSV files, Parquet files, and database connection strings.
|
|
9572
10095
|
|
|
9573
10096
|
Parameters
|
|
9574
10097
|
----------
|
|
9575
10098
|
data
|
|
9576
|
-
The table for which to get the column count, which could be a DataFrame object
|
|
9577
|
-
table object
|
|
9578
|
-
table types.
|
|
10099
|
+
The table for which to get the column count, which could be a DataFrame object, an Ibis
|
|
10100
|
+
table object, a CSV file path, a Parquet file path, or a database connection string.
|
|
10101
|
+
Read the *Supported Input Table Types* section for details on the supported table types.
|
|
9579
10102
|
|
|
9580
10103
|
Returns
|
|
9581
10104
|
-------
|
|
@@ -9588,6 +10111,7 @@ get_column_count(data: 'FrameT | Any') -> 'int'
|
|
|
9588
10111
|
|
|
9589
10112
|
- Polars DataFrame (`"polars"`)
|
|
9590
10113
|
- Pandas DataFrame (`"pandas"`)
|
|
10114
|
+
- PySpark table (`"pyspark"`)
|
|
9591
10115
|
- DuckDB table (`"duckdb"`)*
|
|
9592
10116
|
- MySQL table (`"mysql"`)*
|
|
9593
10117
|
- PostgreSQL table (`"postgresql"`)*
|
|
@@ -9595,15 +10119,41 @@ get_column_count(data: 'FrameT | Any') -> 'int'
|
|
|
9595
10119
|
- Microsoft SQL Server table (`"mssql"`)*
|
|
9596
10120
|
- Snowflake table (`"snowflake"`)*
|
|
9597
10121
|
- Databricks table (`"databricks"`)*
|
|
9598
|
-
- PySpark table (`"pyspark"`)*
|
|
9599
10122
|
- BigQuery table (`"bigquery"`)*
|
|
9600
10123
|
- Parquet table (`"parquet"`)*
|
|
10124
|
+
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
|
|
10125
|
+
- Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
|
|
10126
|
+
extension, or partitioned dataset)
|
|
10127
|
+
- Database connection strings (URI format with optional table specification)
|
|
9601
10128
|
|
|
9602
10129
|
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
|
|
9603
10130
|
`ibis.expr.types.relations.Table`). Furthermore, using `get_column_count()` with these types of
|
|
9604
10131
|
tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a
|
|
9605
10132
|
Polars or Pandas DataFrame, the availability of Ibis is not needed.
|
|
9606
10133
|
|
|
10134
|
+
To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
|
|
10135
|
+
provided. The file will be automatically detected and loaded using the best available DataFrame
|
|
10136
|
+
library. The loading preference is Polars first, then Pandas as a fallback.
|
|
10137
|
+
|
|
10138
|
+
GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw
|
|
10139
|
+
content URLs for downloading. The URL format should be:
|
|
10140
|
+
`https://github.com/user/repo/blob/branch/path/file.csv` or
|
|
10141
|
+
`https://github.com/user/repo/blob/branch/path/file.parquet`
|
|
10142
|
+
|
|
10143
|
+
Connection strings follow database URL formats and must also specify a table using the
|
|
10144
|
+
`::table_name` suffix. Examples include:
|
|
10145
|
+
|
|
10146
|
+
```
|
|
10147
|
+
"duckdb:///path/to/database.ddb::table_name"
|
|
10148
|
+
"sqlite:///path/to/database.db::table_name"
|
|
10149
|
+
"postgresql://user:password@localhost:5432/database::table_name"
|
|
10150
|
+
"mysql://user:password@localhost:3306/database::table_name"
|
|
10151
|
+
"bigquery://project/dataset::table_name"
|
|
10152
|
+
"snowflake://user:password@account/database/schema::table_name"
|
|
10153
|
+
```
|
|
10154
|
+
|
|
10155
|
+
When using connection strings, the Ibis library with the appropriate backend driver is required.
|
|
10156
|
+
|
|
9607
10157
|
Examples
|
|
9608
10158
|
--------
|
|
9609
10159
|
To get the number of columns in a table, we can use the `get_column_count()` function. Here's an
|
|
@@ -9620,6 +10170,31 @@ get_column_count(data: 'FrameT | Any') -> 'int'
|
|
|
9620
10170
|
pb.get_column_count(small_table_duckdb)
|
|
9621
10171
|
```
|
|
9622
10172
|
|
|
10173
|
+
#### Working with CSV Files
|
|
10174
|
+
|
|
10175
|
+
The `get_column_count()` function can directly accept CSV file paths:
|
|
10176
|
+
|
|
10177
|
+
#### Working with Parquet Files
|
|
10178
|
+
|
|
10179
|
+
The function supports various Parquet input formats:
|
|
10180
|
+
|
|
10181
|
+
You can also use glob patterns and directories:
|
|
10182
|
+
|
|
10183
|
+
```python
|
|
10184
|
+
# Multiple Parquet files with glob patterns
|
|
10185
|
+
pb.get_column_count("data/sales_*.parquet")
|
|
10186
|
+
|
|
10187
|
+
# Directory containing Parquet files
|
|
10188
|
+
pb.get_column_count("parquet_data/")
|
|
10189
|
+
|
|
10190
|
+
# Partitioned Parquet dataset
|
|
10191
|
+
pb.get_column_count("sales_data/") # Auto-discovers partition columns
|
|
10192
|
+
```
|
|
10193
|
+
|
|
10194
|
+
#### Working with Database Connection Strings
|
|
10195
|
+
|
|
10196
|
+
The function supports database connection strings for direct access to database tables:
|
|
10197
|
+
|
|
9623
10198
|
The function always returns the number of columns in the table as an integer value, which is
|
|
9624
10199
|
`8` for the `small_table` dataset.
|
|
9625
10200
|
|
|
@@ -9630,14 +10205,15 @@ get_row_count(data: 'FrameT | Any') -> 'int'
|
|
|
9630
10205
|
|
|
9631
10206
|
The `get_row_count()` function returns the number of rows in a table. The function works with
|
|
9632
10207
|
any table that is supported by the `pointblank` library, including Pandas, Polars, and Ibis
|
|
9633
|
-
backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.).
|
|
10208
|
+
backend tables (e.g., DuckDB, MySQL, PostgreSQL, SQLite, Parquet, etc.). It also supports
|
|
10209
|
+
direct input of CSV files, Parquet files, and database connection strings.
|
|
9634
10210
|
|
|
9635
10211
|
Parameters
|
|
9636
10212
|
----------
|
|
9637
10213
|
data
|
|
9638
|
-
The table for which to get the row count, which could be a DataFrame object
|
|
9639
|
-
object
|
|
9640
|
-
types.
|
|
10214
|
+
The table for which to get the row count, which could be a DataFrame object, an Ibis table
|
|
10215
|
+
object, a CSV file path, a Parquet file path, or a database connection string.
|
|
10216
|
+
Read the *Supported Input Table Types* section for details on the supported table types.
|
|
9641
10217
|
|
|
9642
10218
|
Returns
|
|
9643
10219
|
-------
|
|
@@ -9650,6 +10226,7 @@ get_row_count(data: 'FrameT | Any') -> 'int'
|
|
|
9650
10226
|
|
|
9651
10227
|
- Polars DataFrame (`"polars"`)
|
|
9652
10228
|
- Pandas DataFrame (`"pandas"`)
|
|
10229
|
+
- PySpark table (`"pyspark"`)
|
|
9653
10230
|
- DuckDB table (`"duckdb"`)*
|
|
9654
10231
|
- MySQL table (`"mysql"`)*
|
|
9655
10232
|
- PostgreSQL table (`"postgresql"`)*
|
|
@@ -9657,15 +10234,42 @@ get_row_count(data: 'FrameT | Any') -> 'int'
|
|
|
9657
10234
|
- Microsoft SQL Server table (`"mssql"`)*
|
|
9658
10235
|
- Snowflake table (`"snowflake"`)*
|
|
9659
10236
|
- Databricks table (`"databricks"`)*
|
|
9660
|
-
- PySpark table (`"pyspark"`)*
|
|
9661
10237
|
- BigQuery table (`"bigquery"`)*
|
|
9662
10238
|
- Parquet table (`"parquet"`)*
|
|
10239
|
+
- CSV files (string path or `pathlib.Path` object with `.csv` extension)
|
|
10240
|
+
- Parquet files (string path, `pathlib.Path` object, glob pattern, directory with `.parquet`
|
|
10241
|
+
extension, or partitioned dataset)
|
|
10242
|
+
- GitHub URLs (direct links to CSV or Parquet files on GitHub)
|
|
10243
|
+
- Database connection strings (URI format with optional table specification)
|
|
9663
10244
|
|
|
9664
10245
|
The table types marked with an asterisk need to be prepared as Ibis tables (with type of
|
|
9665
10246
|
`ibis.expr.types.relations.Table`). Furthermore, using `get_row_count()` with these types of
|
|
9666
10247
|
tables requires the Ibis library (`v9.5.0` or above) to be installed. If the input table is a
|
|
9667
10248
|
Polars or Pandas DataFrame, the availability of Ibis is not needed.
|
|
9668
10249
|
|
|
10250
|
+
To use a CSV file, ensure that a string or `pathlib.Path` object with a `.csv` extension is
|
|
10251
|
+
provided. The file will be automatically detected and loaded using the best available DataFrame
|
|
10252
|
+
library. The loading preference is Polars first, then Pandas as a fallback.
|
|
10253
|
+
|
|
10254
|
+
GitHub URLs pointing to CSV or Parquet files are automatically detected and converted to raw
|
|
10255
|
+
content URLs for downloading. The URL format should be:
|
|
10256
|
+
`https://github.com/user/repo/blob/branch/path/file.csv` or
|
|
10257
|
+
`https://github.com/user/repo/blob/branch/path/file.parquet`
|
|
10258
|
+
|
|
10259
|
+
Connection strings follow database URL formats and must also specify a table using the
|
|
10260
|
+
`::table_name` suffix. Examples include:
|
|
10261
|
+
|
|
10262
|
+
```
|
|
10263
|
+
"duckdb:///path/to/database.ddb::table_name"
|
|
10264
|
+
"sqlite:///path/to/database.db::table_name"
|
|
10265
|
+
"postgresql://user:password@localhost:5432/database::table_name"
|
|
10266
|
+
"mysql://user:password@localhost:3306/database::table_name"
|
|
10267
|
+
"bigquery://project/dataset::table_name"
|
|
10268
|
+
"snowflake://user:password@account/database/schema::table_name"
|
|
10269
|
+
```
|
|
10270
|
+
|
|
10271
|
+
When using connection strings, the Ibis library with the appropriate backend driver is required.
|
|
10272
|
+
|
|
9669
10273
|
Examples
|
|
9670
10274
|
--------
|
|
9671
10275
|
Getting the number of rows in a table is easily done by using the `get_row_count()` function.
|
|
@@ -9682,6 +10286,31 @@ get_row_count(data: 'FrameT | Any') -> 'int'
|
|
|
9682
10286
|
pb.get_row_count(game_revenue_duckdb)
|
|
9683
10287
|
```
|
|
9684
10288
|
|
|
10289
|
+
#### Working with CSV Files
|
|
10290
|
+
|
|
10291
|
+
The `get_row_count()` function can directly accept CSV file paths:
|
|
10292
|
+
|
|
10293
|
+
#### Working with Parquet Files
|
|
10294
|
+
|
|
10295
|
+
The function supports various Parquet input formats:
|
|
10296
|
+
|
|
10297
|
+
You can also use glob patterns and directories:
|
|
10298
|
+
|
|
10299
|
+
```python
|
|
10300
|
+
# Multiple Parquet files with glob patterns
|
|
10301
|
+
pb.get_row_count("data/sales_*.parquet")
|
|
10302
|
+
|
|
10303
|
+
# Directory containing Parquet files
|
|
10304
|
+
pb.get_row_count("parquet_data/")
|
|
10305
|
+
|
|
10306
|
+
# Partitioned Parquet dataset
|
|
10307
|
+
pb.get_row_count("sales_data/") # Auto-discovers partition columns
|
|
10308
|
+
```
|
|
10309
|
+
|
|
10310
|
+
#### Working with Database Connection Strings
|
|
10311
|
+
|
|
10312
|
+
The function supports database connection strings for direct access to database tables:
|
|
10313
|
+
|
|
9685
10314
|
The function always returns the number of rows in the table as an integer value, which is `2000`
|
|
9686
10315
|
for the `game_revenue` dataset.
|
|
9687
10316
|
|