duckguard 3.0.1__py3-none-any.whl → 3.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1206 @@
1
+ Metadata-Version: 2.4
2
+ Name: duckguard
3
+ Version: 3.2.0
4
+ Summary: A Python-native data quality tool with AI superpowers, built on DuckDB for speed
5
+ Project-URL: Homepage, https://github.com/XDataHubAI/duckguard
6
+ Project-URL: Documentation, https://xdatahubai.github.io/duckguard/
7
+ Project-URL: Repository, https://github.com/XDataHubAI/duckguard
8
+ Author: DuckGuard Team
9
+ License-Expression: Apache-2.0
10
+ License-File: LICENSE
11
+ License-File: NOTICE
12
+ Keywords: airflow,anomaly-detection,data-catalog,data-contracts,data-engineering,data-governance,data-lineage,data-observability,data-pipeline,data-profiling,data-quality,data-testing,data-validation,dbt,duckdb,etl,great-expectations,pii-detection,pytest-plugin,schema-validation,soda,testing
13
+ Classifier: Development Status :: 4 - Beta
14
+ Classifier: Environment :: Console
15
+ Classifier: Framework :: Pytest
16
+ Classifier: Intended Audience :: Developers
17
+ Classifier: Intended Audience :: Information Technology
18
+ Classifier: Intended Audience :: Science/Research
19
+ Classifier: License :: OSI Approved :: Apache Software License
20
+ Classifier: Operating System :: OS Independent
21
+ Classifier: Programming Language :: Python :: 3
22
+ Classifier: Programming Language :: Python :: 3.10
23
+ Classifier: Programming Language :: Python :: 3.11
24
+ Classifier: Programming Language :: Python :: 3.12
25
+ Classifier: Topic :: Database
26
+ Classifier: Topic :: Scientific/Engineering :: Information Analysis
27
+ Classifier: Topic :: Software Development :: Quality Assurance
28
+ Classifier: Topic :: Software Development :: Testing
29
+ Classifier: Typing :: Typed
30
+ Requires-Python: >=3.10
31
+ Requires-Dist: duckdb>=1.0.0
32
+ Requires-Dist: packaging>=21.0
33
+ Requires-Dist: pyarrow>=14.0.0
34
+ Requires-Dist: pydantic>=2.0.0
35
+ Requires-Dist: pyyaml>=6.0.0
36
+ Requires-Dist: rich>=13.0.0
37
+ Requires-Dist: typer>=0.9.0
38
+ Provides-Extra: airflow
39
+ Requires-Dist: apache-airflow>=2.5.0; extra == 'airflow'
40
+ Provides-Extra: all
41
+ Requires-Dist: anthropic>=0.18.0; extra == 'all'
42
+ Requires-Dist: apache-airflow>=2.5.0; extra == 'all'
43
+ Requires-Dist: databricks-sql-connector>=2.0.0; extra == 'all'
44
+ Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'all'
45
+ Requires-Dist: jinja2>=3.0.0; extra == 'all'
46
+ Requires-Dist: kafka-python>=2.0.0; extra == 'all'
47
+ Requires-Dist: openai>=1.0.0; extra == 'all'
48
+ Requires-Dist: oracledb>=1.0.0; extra == 'all'
49
+ Requires-Dist: psycopg2-binary>=2.9.0; extra == 'all'
50
+ Requires-Dist: pymongo>=4.0.0; extra == 'all'
51
+ Requires-Dist: pymysql>=1.0.0; extra == 'all'
52
+ Requires-Dist: pyodbc>=4.0.0; extra == 'all'
53
+ Requires-Dist: redshift-connector>=2.0.0; extra == 'all'
54
+ Requires-Dist: scipy>=1.11.0; extra == 'all'
55
+ Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'all'
56
+ Requires-Dist: weasyprint>=60.0; extra == 'all'
57
+ Provides-Extra: bigquery
58
+ Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'bigquery'
59
+ Provides-Extra: databases
60
+ Requires-Dist: databricks-sql-connector>=2.0.0; extra == 'databases'
61
+ Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'databases'
62
+ Requires-Dist: kafka-python>=2.0.0; extra == 'databases'
63
+ Requires-Dist: oracledb>=1.0.0; extra == 'databases'
64
+ Requires-Dist: psycopg2-binary>=2.9.0; extra == 'databases'
65
+ Requires-Dist: pymongo>=4.0.0; extra == 'databases'
66
+ Requires-Dist: pymysql>=1.0.0; extra == 'databases'
67
+ Requires-Dist: pyodbc>=4.0.0; extra == 'databases'
68
+ Requires-Dist: redshift-connector>=2.0.0; extra == 'databases'
69
+ Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'databases'
70
+ Provides-Extra: databricks
71
+ Requires-Dist: databricks-sql-connector>=2.0.0; extra == 'databricks'
72
+ Provides-Extra: dev
73
+ Requires-Dist: black>=23.0.0; extra == 'dev'
74
+ Requires-Dist: jinja2>=3.0.0; extra == 'dev'
75
+ Requires-Dist: mypy>=1.0.0; extra == 'dev'
76
+ Requires-Dist: numpy>=1.24.0; extra == 'dev'
77
+ Requires-Dist: pandas>=2.0.0; extra == 'dev'
78
+ Requires-Dist: psutil>=5.9.0; extra == 'dev'
79
+ Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
80
+ Requires-Dist: pytest>=7.0.0; extra == 'dev'
81
+ Requires-Dist: ruff>=0.1.0; extra == 'dev'
82
+ Requires-Dist: scipy>=1.11.0; extra == 'dev'
83
+ Provides-Extra: kafka
84
+ Requires-Dist: kafka-python>=2.0.0; extra == 'kafka'
85
+ Provides-Extra: llm
86
+ Requires-Dist: anthropic>=0.18.0; extra == 'llm'
87
+ Requires-Dist: openai>=1.0.0; extra == 'llm'
88
+ Provides-Extra: mongodb
89
+ Requires-Dist: pymongo>=4.0.0; extra == 'mongodb'
90
+ Provides-Extra: mysql
91
+ Requires-Dist: pymysql>=1.0.0; extra == 'mysql'
92
+ Provides-Extra: oracle
93
+ Requires-Dist: oracledb>=1.0.0; extra == 'oracle'
94
+ Provides-Extra: postgres
95
+ Requires-Dist: psycopg2-binary>=2.9.0; extra == 'postgres'
96
+ Provides-Extra: redshift
97
+ Requires-Dist: redshift-connector>=2.0.0; extra == 'redshift'
98
+ Provides-Extra: reports
99
+ Requires-Dist: jinja2>=3.0.0; extra == 'reports'
100
+ Requires-Dist: weasyprint>=60.0; extra == 'reports'
101
+ Provides-Extra: snowflake
102
+ Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'snowflake'
103
+ Provides-Extra: sqlserver
104
+ Requires-Dist: pyodbc>=4.0.0; extra == 'sqlserver'
105
+ Provides-Extra: statistics
106
+ Requires-Dist: scipy>=1.11.0; extra == 'statistics'
107
+ Description-Content-Type: text/markdown
108
+
109
+ <div align="center">
110
+ <img src="docs/assets/duckguard-logo.svg" alt="DuckGuard" width="420">
111
+
112
+ <h3>Data Quality That Just Works</h3>
113
+ <p><strong>3 lines of code</strong> &bull; <strong>10x faster</strong> &bull; <strong>20x less memory</strong></p>
114
+
115
+ <p><em>Stop wrestling with 50+ lines of boilerplate. Start validating data in seconds.</em></p>
116
+
117
+ [![PyPI version](https://img.shields.io/pypi/v/duckguard.svg)](https://pypi.org/project/duckguard/)
118
+ [![Downloads](https://static.pepy.tech/badge/duckguard)](https://pepy.tech/project/duckguard)
119
+ [![GitHub stars](https://img.shields.io/github/stars/XDataHubAI/duckguard?style=social)](https://github.com/XDataHubAI/duckguard/stargazers)
120
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
121
+ [![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-blue.svg)](https://www.apache.org/licenses/LICENSE-2.0)
122
+ [![CI](https://github.com/XDataHubAI/duckguard/actions/workflows/ci.yml/badge.svg)](https://github.com/XDataHubAI/duckguard/actions/workflows/ci.yml)
123
+ [![Docs](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://xdatahubai.github.io/duckguard/)
124
+
125
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/XDataHubAI/duckguard/blob/main/examples/getting_started.ipynb)
126
+ [![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/XDataHubAI/duckguard/blob/main/examples/getting_started.ipynb)
127
+ </div>
128
+
129
+ ---
130
+
131
+ ## From Zero to Validated in 30 Seconds
132
+
133
+ ```bash
134
+ pip install duckguard
135
+ ```
136
+
137
+ ```python
138
+ from duckguard import connect
139
+
140
+ orders = connect("orders.csv") # CSV, Parquet, JSON, S3, databases...
141
+ assert orders.customer_id.is_not_null() # Just like pytest!
142
+ assert orders.amount.between(0, 10000) # Readable validations
143
+ assert orders.status.isin(["pending", "shipped", "delivered"])
144
+
145
+ quality = orders.score()
146
+ print(f"Grade: {quality.grade}") # A, B, C, D, or F
147
+ ```
148
+
149
+ **That's it.** No context. No datasource. No validator. No expectation suite. Just data quality.
150
+
151
+ ---
152
+
153
+ ## Demo
154
+
155
+ <div align="center">
156
+ <img src="docs/assets/demo.svg" alt="DuckGuard Demo" width="750">
157
+ </div>
158
+
159
+ ---
160
+
161
+ ## Why DuckGuard?
162
+
163
+ Every data quality tool asks you to write **50+ lines of boilerplate** before you can validate a single column. DuckGuard gives you a **pytest-like API** powered by **DuckDB's speed**.
164
+
165
+ <table>
166
+ <tr>
167
+ <td width="50%">
168
+
169
+ **Great Expectations**
170
+ ```python
171
+ # 50+ lines of setup required
172
+ from great_expectations import get_context
173
+
174
+ context = get_context()
175
+ datasource = context.sources.add_pandas("my_ds")
176
+ asset = datasource.add_dataframe_asset(
177
+ name="orders", dataframe=df
178
+ )
179
+ batch_request = asset.build_batch_request()
180
+ expectation_suite = context.add_expectation_suite(
181
+ "orders_suite"
182
+ )
183
+ validator = context.get_validator(
184
+ batch_request=batch_request,
185
+ expectation_suite_name="orders_suite"
186
+ )
187
+ validator.expect_column_values_to_not_be_null(
188
+ "customer_id"
189
+ )
190
+ validator.expect_column_values_to_be_between(
191
+ "amount", min_value=0, max_value=10000
192
+ )
193
+ # ... and more configuration
194
+ ```
195
+ **45 seconds | 4GB RAM | 20+ dependencies**
196
+
197
+ </td>
198
+ <td width="50%">
199
+
200
+ **DuckGuard**
201
+ ```python
202
+ from duckguard import connect
203
+
204
+ orders = connect("orders.csv")
205
+
206
+ assert orders.customer_id.is_not_null()
207
+ assert orders.amount.between(0, 10000)
208
+ ```
209
+
210
+ <br><br><br><br><br><br><br><br><br><br><br><br>
211
+
212
+ **4 seconds | 200MB RAM | 7 dependencies**
213
+
214
+ </td>
215
+ </tr>
216
+ </table>
217
+
218
+ | Feature | DuckGuard | Great Expectations | Soda Core | Pandera |
219
+ |---------|:---------:|:------------------:|:---------:|:-------:|
220
+ | **Lines of code to start** | 3 | 50+ | 10+ | 5+ |
221
+ | **Time for 1GB CSV*** | ~4 sec | ~45 sec | ~20 sec | ~15 sec |
222
+ | **Memory for 1GB CSV*** | ~200 MB | ~4 GB | ~1.5 GB | ~1.5 GB |
223
+ | **Learning curve** | Minutes | Days | Hours | Minutes |
224
+ | **Pytest-like API** | **Yes** | - | - | - |
225
+ | **DuckDB-powered** | **Yes** | - | Partial | - |
226
+ | **Cloud storage (S3/GCS/Azure)** | **Yes** | Yes | Yes | - |
227
+ | **Database connectors** | **11+** | Yes | Yes | - |
228
+ | **PII detection** | **Built-in** | - | - | - |
229
+ | **Anomaly detection (7 methods)** | **Built-in** | - | Partial | - |
230
+ | **Schema evolution tracking** | **Built-in** | - | Yes | - |
231
+ | **Freshness monitoring** | **Built-in** | - | Yes | - |
232
+ | **Data contracts** | **Yes** | - | Yes | Yes |
233
+ | **Row-level error details** | **Yes** | Yes | - | Yes |
234
+ | **Cross-dataset & FK checks** | **Built-in** | Partial | Yes | - |
235
+ | **Reconciliation** | **Built-in** | - | - | - |
236
+ | **Distribution drift** | **Built-in** | - | - | - |
237
+ | **Conditional checks** | **Built-in** | - | - | - |
238
+ | **Query-based checks** | **Built-in** | - | Yes | - |
239
+ | **YAML rules** | **Yes** | Yes | Yes | - |
240
+ | **dbt integration** | **Yes** | Yes | Yes | - |
241
+ | **Slack/Teams/Email alerts** | **Yes** | Yes | Yes | - |
242
+ | **HTML/PDF reports** | **Yes** | Yes | Yes | - |
243
+
244
+ <sub>*Performance varies by hardware and data characteristics. Based on typical usage patterns with DuckDB's columnar engine.</sub>
245
+
246
+ ---
247
+
248
+ ## Installation
249
+
250
+ ```bash
251
+ pip install duckguard
252
+
253
+ # With optional features
254
+ pip install duckguard[reports] # HTML/PDF reports
255
+ pip install duckguard[snowflake] # Snowflake connector
256
+ pip install duckguard[databricks] # Databricks connector
257
+ pip install duckguard[airflow] # Airflow integration
258
+ pip install duckguard[all] # Everything
259
+ ```
260
+
261
+ ---
262
+
263
+ ## Feature Overview
264
+
265
+ <table>
266
+ <tr>
267
+ <td align="center" width="25%">
268
+ <h3>&#127919;</h3>
269
+ <b>Quality Scoring</b><br>
270
+ <sub>A-F grades with 4 quality dimensions</sub>
271
+ </td>
272
+ <td align="center" width="25%">
273
+ <h3>&#128274;</h3>
274
+ <b>PII Detection</b><br>
275
+ <sub>Auto-detect emails, SSNs, phones</sub>
276
+ </td>
277
+ <td align="center" width="25%">
278
+ <h3>&#128200;</h3>
279
+ <b>Anomaly Detection</b><br>
280
+ <sub>Z-score, IQR, KS-test, ML baselines</sub>
281
+ </td>
282
+ <td align="center" width="25%">
283
+ <h3>&#128276;</h3>
284
+ <b>Alerts</b><br>
285
+ <sub>Slack, Teams, Email</sub>
286
+ </td>
287
+ </tr>
288
+ <tr>
289
+ <td align="center">
290
+ <h3>&#9200;</h3>
291
+ <b>Freshness Monitoring</b><br>
292
+ <sub>Detect stale data automatically</sub>
293
+ </td>
294
+ <td align="center">
295
+ <h3>&#128208;</h3>
296
+ <b>Schema Evolution</b><br>
297
+ <sub>Track and detect breaking changes</sub>
298
+ </td>
299
+ <td align="center">
300
+ <h3>&#128220;</h3>
301
+ <b>Data Contracts</b><br>
302
+ <sub>Schema + SLA enforcement</sub>
303
+ </td>
304
+ <td align="center">
305
+ <h3>&#128270;</h3>
306
+ <b>Row-Level Errors</b><br>
307
+ <sub>See exactly which rows failed</sub>
308
+ </td>
309
+ </tr>
310
+ <tr>
311
+ <td align="center">
312
+ <h3>&#128196;</h3>
313
+ <b>HTML/PDF Reports</b><br>
314
+ <sub>Beautiful shareable reports</sub>
315
+ </td>
316
+ <td align="center">
317
+ <h3>&#128200;</h3>
318
+ <b>Historical Tracking</b><br>
319
+ <sub>Quality trends over time</sub>
320
+ </td>
321
+ <td align="center">
322
+ <h3>&#128279;</h3>
323
+ <b>Cross-Dataset Checks</b><br>
324
+ <sub>FK, reconciliation, drift</sub>
325
+ </td>
326
+ <td align="center">
327
+ <h3>&#128640;</h3>
328
+ <b>CI/CD Ready</b><br>
329
+ <sub>dbt, Airflow, GitHub Actions</sub>
330
+ </td>
331
+ </tr>
332
+ <tr>
333
+ <td align="center">
334
+ <h3>&#128203;</h3>
335
+ <b>YAML Rules</b><br>
336
+ <sub>Declarative validation rules</sub>
337
+ </td>
338
+ <td align="center">
339
+ <h3>&#128269;</h3>
340
+ <b>Auto-Profiling</b><br>
341
+ <sub>Semantic types & rule suggestions</sub>
342
+ </td>
343
+ <td align="center">
344
+ <h3>&#9889;</h3>
345
+ <b>Conditional Checks</b><br>
346
+ <sub>Validate when conditions are met</sub>
347
+ </td>
348
+ <td align="center">
349
+ <h3>&#128202;</h3>
350
+ <b>Group-By Validation</b><br>
351
+ <sub>Segmented per-group checks</sub>
352
+ </td>
353
+ </tr>
354
+ </table>
355
+
356
+ ---
357
+
358
+ ## Connect to Anything
359
+
360
+ ```python
361
+ from duckguard import connect
362
+
363
+ # Files
364
+ orders = connect("orders.csv")
365
+ orders = connect("orders.parquet")
366
+ orders = connect("orders.json")
367
+
368
+ # Cloud Storage
369
+ orders = connect("s3://bucket/orders.parquet")
370
+ orders = connect("gs://bucket/orders.parquet")
371
+ orders = connect("az://container/orders.parquet")
372
+
373
+ # Databases
374
+ orders = connect("postgres://localhost/db", table="orders")
375
+ orders = connect("mysql://localhost/db", table="orders")
376
+ orders = connect("snowflake://account/db", table="orders")
377
+ orders = connect("bigquery://project/dataset", table="orders")
378
+ orders = connect("databricks://workspace/catalog/schema", table="orders")
379
+ orders = connect("redshift://cluster/db", table="orders")
380
+
381
+ # Modern Formats
382
+ orders = connect("delta://path/to/delta_table")
383
+ orders = connect("iceberg://path/to/iceberg_table")
384
+
385
+ # pandas DataFrame
386
+ import pandas as pd
387
+ orders = connect(pd.read_csv("orders.csv"))
388
+ ```
389
+
390
+ **Supported:** CSV, Parquet, JSON, Excel | S3, GCS, Azure Blob | PostgreSQL, MySQL, SQLite, Snowflake, BigQuery, Redshift, Databricks, SQL Server, Oracle, MongoDB | Delta Lake, Apache Iceberg | pandas DataFrames
391
+
392
+ ---
393
+
394
+ ## Cookbook
395
+
396
+ ### Column Validation
397
+
398
+ ```python
399
+ orders = connect("orders.csv")
400
+
401
+ # Null & uniqueness
402
+ orders.order_id.is_not_null() # No nulls allowed
403
+ orders.order_id.is_unique() # All values distinct
404
+ orders.order_id.has_no_duplicates() # Alias for is_unique
405
+
406
+ # Range & comparison
407
+ orders.amount.between(0, 10000) # Inclusive range
408
+ orders.amount.greater_than(0) # Minimum (exclusive)
409
+ orders.amount.less_than(100000) # Maximum (exclusive)
410
+
411
+ # Pattern & enum
412
+ orders.email.matches(r'^[\w.+-]+@[\w-]+\.[\w.]+$')
413
+ orders.status.isin(["pending", "shipped", "delivered"])
414
+
415
+ # String length
416
+ orders.order_id.value_lengths_between(5, 10)
417
+ ```
418
+
419
+ Every validation returns a `ValidationResult` with `.passed`, `.message`, `.summary()`, and `.failed_rows`.
420
+
421
+ ### Row-Level Error Debugging
422
+
423
+ ```python
424
+ result = orders.quantity.between(1, 100)
425
+
426
+ if not result.passed:
427
+ print(result.summary())
428
+ # Column 'quantity' has 3 values outside [1, 100]
429
+ #
430
+ # Sample of 3 failing rows (total: 3):
431
+ # Row 5: quantity=500 - Value outside range [1, 100]
432
+ # Row 23: quantity=-2 - Value outside range [1, 100]
433
+ # Row 29: quantity=0 - Value outside range [1, 100]
434
+
435
+ for row in result.failed_rows:
436
+ print(f"Row {row.row_number}: {row.value} ({row.reason})")
437
+
438
+ print(result.get_failed_values()) # [500, -2, 0]
439
+ print(result.get_failed_row_indices()) # [5, 23, 29]
440
+ ```
441
+
442
+ ### Quality Scoring
443
+
444
+ ```python
445
+ score = orders.score()
446
+
447
+ print(score.grade) # A, B, C, D, or F
448
+ print(score.overall) # 0-100 composite score
449
+ print(score.completeness) # % non-null across all columns
450
+ print(score.uniqueness) # % unique across key columns
451
+ print(score.validity) # % values passing type/range checks
452
+ print(score.consistency) # % consistent formatting
453
+ ```
454
+
455
+ ### Cross-Dataset Validation
456
+
457
+ ```python
458
+ orders = connect("orders.csv")
459
+ customers = connect("customers.csv")
460
+
461
+ # Foreign key check
462
+ result = orders.customer_id.exists_in(customers.customer_id)
463
+
464
+ # FK with null handling
465
+ result = orders.customer_id.references(customers.customer_id, allow_nulls=True)
466
+
467
+ # Get orphan values for debugging
468
+ orphans = orders.customer_id.find_orphans(customers.customer_id)
469
+ print(f"Invalid IDs: {orphans}")
470
+
471
+ # Compare value sets
472
+ result = orders.status.matches_values(lookup.code)
473
+
474
+ # Compare row counts with tolerance
475
+ result = orders.row_count_matches(backup, tolerance=10)
476
+ ```
477
+
478
+ ### Reconciliation
479
+
480
+ ```python
481
+ source = connect("orders_source.parquet")
482
+ target = connect("orders_migrated.parquet")
483
+
484
+ recon = source.reconcile(
485
+ target,
486
+ key_columns=["order_id"],
487
+ compare_columns=["amount", "status", "customer_id"],
488
+ )
489
+
490
+ print(recon.match_percentage) # 95.5
491
+ print(recon.missing_in_target) # 3
492
+ print(recon.extra_in_target) # 1
493
+ print(recon.value_mismatches) # {'amount': 5, 'status': 2}
494
+ print(recon.summary())
495
+ ```
496
+
497
+ ### Distribution Drift Detection
498
+
499
+ ```python
500
+ baseline = connect("orders_jan.parquet")
501
+ current = connect("orders_feb.parquet")
502
+
503
+ drift = current.amount.detect_drift(baseline.amount)
504
+
505
+ print(drift.is_drifted) # True/False
506
+ print(drift.p_value) # 0.0023
507
+ print(drift.statistic) # KS statistic
508
+ print(drift.message) # Human-readable summary
509
+ ```
510
+
511
+ ### Group-By Validation
512
+
513
+ ```python
514
+ grouped = orders.group_by("region")
515
+
516
+ print(grouped.groups) # [{'region': 'North'}, ...]
517
+ print(grouped.group_count) # 4
518
+
519
+ for stat in grouped.stats():
520
+ print(stat) # {'region': 'North', 'row_count': 150}
521
+
522
+ # Ensure every group has at least 10 rows
523
+ result = grouped.row_count_greater_than(10)
524
+ for g in result.get_failed_groups():
525
+ print(f"{g.key_string}: only {g.row_count} rows")
526
+ ```
527
+
528
+ ---
529
+
530
+ ## 🆕 What's New in 3.2
531
+
532
+ DuckGuard 3.2 adds **AI-powered data quality** — the first data quality library with native LLM integration.
533
+
534
+ ### AI Features (NEW)
535
+
536
+ ```bash
537
+ # Explain quality issues in plain English
538
+ duckguard explain orders.csv
539
+
540
+ # AI suggests validation rules based on your data
541
+ duckguard suggest orders.csv
542
+
543
+ # Get AI-powered fix suggestions for quality issues
544
+ duckguard fix orders.csv
545
+ ```
546
+
547
+ ```python
548
+ from duckguard.ai import explainer, rules_generator, fixer
549
+
550
+ # Natural language quality explanation
551
+ summary = explainer.explain(dataset)
552
+
553
+ # AI-generated validation rules
554
+ rules = rules_generator.suggest_rules(dataset)
555
+
556
+ # Suggest fixes for data quality issues
557
+ fixes = fixer.suggest_fixes(dataset, results)
558
+ ```
559
+
560
+ Supports **OpenAI**, **Anthropic**, and **Ollama** (local models). Configure via environment variables or `AIConfig`.
561
+
562
+ ### Also in 3.2
563
+
564
+ - 🔍 **Improved semantic type detection** — smarter column classification, fewer false positives
565
+ - 📄 **Apache 2.0 license** — OSI-approved, enterprise-friendly
566
+ - 🛡️ **SQL injection prevention** — multi-layer escaping in all string-based checks
567
+ - 📖 **Full documentation site** — [xdatahubai.github.io/duckguard](https://xdatahubai.github.io/duckguard/)
568
+ - 🔒 **PEP 561 typed** — `py.typed` marker for mypy/pyright
569
+
570
+ ---
571
+
572
+ ## What's New in 3.0
573
+
574
+ DuckGuard 3.0 introduces **conditional checks**, **multi-column validation**, **query-based expectations**, **distributional tests**, and **7 anomaly detection methods**.
575
+
576
+ ### Conditional Checks
577
+
578
+ Apply validation rules only when a SQL condition is met:
579
+
580
+ ```python
581
+ # Email required only for shipped orders
582
+ orders.email.not_null_when("status = 'shipped'")
583
+
584
+ # Quantity must be 1-100 for US orders
585
+ orders.quantity.between_when(1, 100, "country = 'US'")
586
+
587
+ # Status must be shipped or delivered for UK
588
+ orders.status.isin_when(["shipped", "delivered"], "country = 'UK'")
589
+
590
+ # Also: unique_when(), matches_when()
591
+ ```
592
+
593
+ ### Multi-Column Checks
594
+
595
+ Validate relationships across columns:
596
+
597
+ ```python
598
+ # Ship date must come after created date
599
+ orders.expect_column_pair_satisfy(
600
+ column_a="ship_date",
601
+ column_b="created_at",
602
+ expression="ship_date >= created_at",
603
+ )
604
+
605
+ # Composite key uniqueness
606
+ orders.expect_columns_unique(columns=["order_id", "customer_id"])
607
+
608
+ # Multi-column sum check
609
+ orders.expect_multicolumn_sum_to_equal(
610
+ columns=["subtotal", "tax", "shipping"],
611
+ expected_sum=59.50,
612
+ )
613
+ ```
614
+
615
+ ### Query-Based Checks
616
+
617
+ Run custom SQL for unlimited flexibility:
618
+
619
+ ```python
620
+ # No rows should have negative quantities
621
+ orders.expect_query_to_return_no_rows(
622
+ "SELECT * FROM table WHERE quantity < 0"
623
+ )
624
+
625
+ # Verify data exists
626
+ orders.expect_query_to_return_rows(
627
+ "SELECT * FROM table WHERE status = 'shipped'"
628
+ )
629
+
630
+ # Exact value check on aggregate
631
+ orders.expect_query_result_to_equal(
632
+ "SELECT COUNT(*) FROM table", expected=1000
633
+ )
634
+
635
+ # Range check on aggregate
636
+ orders.expect_query_result_to_be_between(
637
+ "SELECT AVG(amount) FROM table", min_value=50, max_value=500
638
+ )
639
+ ```
640
+
641
+ ### Distributional Checks
642
+
643
+ Statistical tests for distribution shape (requires `scipy`):
644
+
645
+ ```python
646
+ # Test for normal distribution
647
+ orders.amount.expect_distribution_normal(significance_level=0.05)
648
+
649
+ # Kolmogorov-Smirnov test
650
+ orders.quantity.expect_ks_test(distribution="norm")
651
+
652
+ # Chi-square goodness of fit
653
+ orders.status.expect_chi_square_test()
654
+ ```
655
+
656
+ ### Anomaly Detection (7 Methods)
657
+
658
+ ```python
659
+ from duckguard import detect_anomalies, AnomalyDetector
660
+ from duckguard.anomaly import BaselineMethod, KSTestMethod, SeasonalMethod
661
+
662
+ # High-level API: detect anomalies across columns
663
+ report = detect_anomalies(orders, method="zscore", columns=["quantity", "amount"])
664
+ print(report.has_anomalies, report.anomaly_count)
665
+ for a in report.anomalies:
666
+ print(f"{a.column}: score={a.score:.2f}, anomaly={a.is_anomaly}")
667
+
668
+ # AnomalyDetector with IQR
669
+ detector = AnomalyDetector(method="iqr", threshold=1.5)
670
+ report = detector.detect(orders, columns=["quantity"])
671
+
672
+ # ML Baseline: fit on historical data, score new values
673
+ baseline = BaselineMethod(sensitivity=2.0)
674
+ baseline.fit([100, 102, 98, 105, 97, 103])
675
+ print(baseline.baseline_mean, baseline.baseline_std)
676
+
677
+ score = baseline.score(250) # Single value
678
+ print(score.is_anomaly, score.score)
679
+
680
+ scores = baseline.score(orders.amount) # Entire column
681
+ print(max(scores))
682
+
683
+ # KS-Test: detect distribution drift
684
+ ks = KSTestMethod(p_value_threshold=0.05)
685
+ ks.fit([1, 2, 3, 4, 5])
686
+ comparison = ks.compare_distributions([10, 11, 12, 13, 14])
687
+ print(comparison.is_drift, comparison.p_value, comparison.message)
688
+
689
+ # Seasonal: time-aware anomaly detection
690
+ seasonal = SeasonalMethod(period="daily", sensitivity=2.0)
691
+ seasonal.fit([10, 12, 11, 13, 9, 14])
692
+ ```
693
+
694
+ **Available methods:** `zscore`, `iqr`, `modified_zscore`, `percent_change`, `baseline`, `ks_test`, `seasonal`
695
+
696
+ ---
697
+
698
+ ## YAML Rules & Data Contracts
699
+
700
+ ### Declarative Rules
701
+
702
+ ```yaml
703
+ # duckguard.yaml
704
+ name: orders_validation
705
+ description: Quality checks for the orders dataset
706
+
707
+ checks:
708
+ order_id:
709
+ - not_null
710
+ - unique
711
+ quantity:
712
+ - between: [1, 1000]
713
+ status:
714
+ - allowed_values: [pending, shipped, delivered, cancelled, returned]
715
+ email:
716
+ - not_null:
717
+ severity: warning
718
+ ```
719
+
720
+ ```python
721
+ from duckguard import load_rules, execute_rules
722
+
723
+ rules = load_rules("duckguard.yaml")
724
+ result = execute_rules(rules, "orders.csv")
725
+
726
+ print(f"Passed: {result.passed_count}/{result.total_checks}")
727
+ for r in result.results:
728
+ tag = "PASS" if r.passed else "FAIL"
729
+ print(f" [{tag}] {r.message}")
730
+ ```
731
+
732
+ ### Auto-Discover Rules
733
+
734
+ ```python
735
+ from duckguard import connect, generate_rules
736
+
737
+ orders = connect("orders.csv")
738
+ yaml_rules = generate_rules(orders, dataset_name="orders")
739
+ print(yaml_rules) # Ready-to-use YAML
740
+ ```
741
+
742
+ ### Data Contracts
743
+
744
+ ```python
745
+ from duckguard import generate_contract, validate_contract, diff_contracts
746
+ from duckguard.contracts import contract_to_yaml
747
+
748
+ # Generate a contract from existing data
749
+ contract = generate_contract(orders, name="orders_v1", owner="data-team")
750
+ print(contract.name, contract.version, len(contract.schema))
751
+
752
+ # Validate data against a contract
753
+ validation = validate_contract(contract, "orders.csv")
754
+ print(validation.passed)
755
+
756
+ # Export to YAML
757
+ print(contract_to_yaml(contract))
758
+
759
+ # Detect breaking changes between versions
760
+ diff = diff_contracts(contract_v1, contract_v2)
761
+ if diff.has_breaking_changes:
762
+ for change in diff.changes:
763
+ print(change)
764
+ ```
765
+
766
+ ---
767
+
768
+ ## Auto-Profiling & Semantic Analysis
769
+
770
+ ```python
771
+ from duckguard import AutoProfiler, SemanticAnalyzer, detect_type, detect_types_for_dataset
772
+
773
+ # Profile entire dataset — quality scores, pattern detection, and rule suggestions included
774
+ profiler = AutoProfiler()
775
+ profile = profiler.profile(orders)
776
+ print(f"Columns: {profile.column_count}, Rows: {profile.row_count}")
777
+ print(f"Quality: {profile.overall_quality_grade} ({profile.overall_quality_score:.1f}/100)")
778
+
779
+ # Per-column quality grades and percentiles
780
+ for col in profile.columns:
781
+ print(f" {col.name}: grade={col.quality_grade}, nulls={col.null_percent:.1f}%")
782
+ if col.median_value is not None:
783
+ print(f" p25={col.p25_value}, median={col.median_value}, p75={col.p75_value}")
784
+
785
+ # Suggested rules (25+ pattern types: email, SSN, UUID, credit card, etc.)
786
+ print(f"Suggested rules: {len(profile.suggested_rules)}")
787
+ for rule in profile.suggested_rules[:5]:
788
+ print(f" {rule}")
789
+
790
+ # Deep profiling — distribution analysis + outlier detection (numeric columns)
791
+ deep_profiler = AutoProfiler(deep=True)
792
+ deep_profile = deep_profiler.profile(orders)
793
+ for col in deep_profile.columns:
794
+ if col.distribution_type:
795
+ print(f" {col.name}: {col.distribution_type}, skew={col.skewness:.2f}")
796
+ if col.outlier_count is not None:
797
+ print(f" outliers: {col.outlier_count} ({col.outlier_percentage:.1f}%)")
798
+
799
+ # Configurable thresholds
800
+ strict = AutoProfiler(null_threshold=0.0, unique_threshold=100.0, pattern_min_confidence=95.0)
801
+ strict_profile = strict.profile(orders)
802
+ ```
803
+
804
+ ```python
805
+ # Detect semantic type for a single column
806
+ print(detect_type(orders, "email")) # SemanticType.EMAIL
807
+ print(detect_type(orders, "country")) # SemanticType.COUNTRY_CODE
808
+
809
+ # Detect types for all columns at once
810
+ type_map = detect_types_for_dataset(orders)
811
+ for col, stype in type_map.items():
812
+ print(f" {col}: {stype}")
813
+
814
+ # Full PII analysis
815
+ analysis = SemanticAnalyzer().analyze(orders)
816
+ print(f"PII columns: {analysis.pii_columns}") # ['email', 'phone']
817
+ for col in analysis.columns:
818
+ if col.is_pii:
819
+ print(f" {col.name}: {col.semantic_type.value} (confidence: {col.confidence:.0%})")
820
+ ```
821
+
822
+ **Supported semantic types:** `email`, `phone`, `url`, `ip_address`, `ssn`, `credit_card`, `person_name`, `address`, `country`, `state`, `city`, `zipcode`, `latitude`, `longitude`, `date`, `datetime`, `currency`, `percentage`, `boolean`, `uuid`, `identifier`, and more.
823
+
824
+ ---
825
+
826
+ ## Freshness, Schema & History
827
+
828
+ ### Freshness Monitoring
829
+
830
+ ```python
831
+ from datetime import timedelta
832
+ from duckguard.freshness import FreshnessMonitor
833
+
834
+ # Quick check
835
+ print(orders.freshness.last_modified) # 2024-01-30 14:22:01
836
+ print(orders.freshness.age_human) # "2 hours ago"
837
+ print(orders.freshness.is_fresh) # True
838
+
839
+ # Custom threshold
840
+ print(orders.is_fresh(timedelta(hours=6)))
841
+
842
+ # Structured monitoring
843
+ monitor = FreshnessMonitor(threshold=timedelta(hours=1))
844
+ result = monitor.check(orders)
845
+ print(result.is_fresh, result.age_human)
846
+ ```
847
+
848
+ ### Schema Evolution
849
+
850
+ ```python
851
+ from duckguard.schema_history import SchemaTracker, SchemaChangeAnalyzer
852
+
853
+ # Capture a snapshot
854
+ tracker = SchemaTracker()
855
+ snapshot = tracker.capture(orders)
856
+ for col in snapshot.columns[:5]:
857
+ print(f" {col.name}: {col.dtype}")
858
+
859
+ # View history
860
+ history = tracker.get_history(orders.source)
861
+ print(f"Snapshots: {len(history)}")
862
+
863
+ # Detect breaking changes
864
+ analyzer = SchemaChangeAnalyzer()
865
+ report = analyzer.detect_changes(orders)
866
+ print(report.has_breaking_changes, len(report.changes))
867
+ ```
868
+
869
+ ### Historical Tracking & Trends
870
+
871
+ ```python
872
+ from duckguard.history import HistoryStorage, TrendAnalyzer
873
+
874
+ # Store validation results
875
+ storage = HistoryStorage()
876
+ storage.store(exec_result)
877
+
878
+ # Query past runs
879
+ runs = storage.get_runs("orders.csv", limit=10)
880
+ for run in runs:
881
+ print(f" {run.run_id}: passed={run.passed}, checks={run.total_checks}")
882
+
883
+ # Analyze quality trends
884
+ trends = TrendAnalyzer(storage).analyze("orders.csv", days=30)
885
+ print(trends.summary())
886
+ ```
887
+
888
+ ---
889
+
890
+ ## Reports & Notifications
891
+
892
+ DuckGuard generates self-contained HTML reports with dark mode, trend charts, collapsible sections, sortable tables, and search — all in a single file with zero JavaScript dependencies.
893
+
894
+ > **Live demos:**
895
+ > [Light / Auto Theme](https://htmlpreview.github.io/?https://github.com/XDataHubAI/duckguard/blob/main/examples/reports/demo_report.html)
896
+ > &bull;
897
+ > [Dark Theme](https://htmlpreview.github.io/?https://github.com/XDataHubAI/duckguard/blob/main/examples/reports/demo_report_dark.html)
898
+
899
+ ```python
900
+ from duckguard.reports import HTMLReporter, ReportConfig, generate_html_report
901
+
902
+ # Quick one-liner
903
+ generate_html_report(exec_result, "report.html")
904
+
905
+ # Full-featured report with trends and metadata
906
+ config = ReportConfig(
907
+ title="Orders Quality Report",
908
+ dark_mode="auto", # "auto", "light", or "dark"
909
+ include_trends=True,
910
+ include_metadata=True,
911
+ )
912
+ reporter = HTMLReporter(config=config)
913
+ reporter.generate(
914
+ exec_result,
915
+ "report.html",
916
+ trend_data=trend_data, # from HistoryStorage.get_trend()
917
+ row_count=dataset.row_count,
918
+ column_count=dataset.column_count,
919
+ )
920
+
921
+ # PDF export (requires weasyprint)
922
+ from duckguard.reports import generate_pdf_report
923
+ generate_pdf_report(exec_result, "report.pdf")
924
+ ```
925
+
926
+ ### Notifications
927
+
928
+ ```python
929
+ from duckguard.notifications import (
930
+ SlackNotifier, TeamsNotifier, EmailNotifier,
931
+ format_results_text, format_results_markdown,
932
+ )
933
+
934
+ slack = SlackNotifier(webhook_url="https://hooks.slack.com/services/XXX")
935
+ teams = TeamsNotifier(webhook_url="https://outlook.office.com/webhook/XXX")
936
+ email = EmailNotifier(
937
+ smtp_host="smtp.example.com", smtp_port=587,
938
+ smtp_user="user", smtp_password="pass",
939
+ to_addresses=["team@example.com"],
940
+ )
941
+
942
+ # Format for custom integrations
943
+ print(format_results_text(exec_result))
944
+ print(format_results_markdown(exec_result))
945
+ ```
946
+
947
+ ---
948
+
949
+ ## Integrations
950
+
951
+ ### dbt
952
+
953
+ ```python
954
+ from duckguard.integrations.dbt import rules_to_dbt_tests
955
+
956
+ dbt_tests = rules_to_dbt_tests(rules)
957
+ ```
958
+
959
+ ### Airflow
960
+
961
+ ```python
962
+ from airflow import DAG
963
+ from airflow.operators.python import PythonOperator
964
+
965
+ def validate_orders():
966
+ from duckguard import connect, load_rules, execute_rules
967
+ rules = load_rules("duckguard.yaml")
968
+ result = execute_rules(rules, "s3://bucket/orders.parquet")
969
+ if not result.passed:
970
+ raise Exception(f"Quality check failed: {result.failed_count} failures")
971
+
972
+ dag = DAG("data_quality", schedule_interval="@daily", ...)
973
+ PythonOperator(task_id="validate", python_callable=validate_orders, dag=dag)
974
+ ```
975
+
976
+ ### GitHub Actions
977
+
978
+ ```yaml
979
+ name: Data Quality
980
+ on: [push]
981
+ jobs:
982
+ quality-check:
983
+ runs-on: ubuntu-latest
984
+ steps:
985
+ - uses: actions/checkout@v4
986
+ - uses: actions/setup-python@v5
987
+ with: { python-version: "3.11" }
988
+ - run: pip install duckguard
989
+ - run: duckguard check data/orders.csv --rules duckguard.yaml
990
+ ```
991
+
992
+ ### pytest
993
+
994
+ ```python
995
+ # tests/test_data_quality.py
996
+ from duckguard import connect
997
+
998
+ def test_orders_quality():
999
+ orders = connect("data/orders.csv")
1000
+ assert orders.row_count > 0
1001
+ assert orders.order_id.is_not_null()
1002
+ assert orders.order_id.is_unique()
1003
+ assert orders.quantity.between(0, 10000)
1004
+ assert orders.status.isin(["pending", "shipped", "delivered", "cancelled"])
1005
+ ```
1006
+
1007
+ ---
1008
+
1009
+ ## CLI
1010
+
1011
+ ```bash
1012
+ # Validate data against rules
1013
+ duckguard check orders.csv --config duckguard.yaml
1014
+
1015
+ # Auto-discover rules from data
1016
+ duckguard discover orders.csv > duckguard.yaml
1017
+
1018
+ # Generate reports (with dark mode and trend charts)
1019
+ duckguard report orders.csv --output report.html --dark-mode auto --trends
1020
+
1021
+ # Anomaly detection
1022
+ duckguard anomaly orders.csv --method zscore
1023
+
1024
+ # Freshness check
1025
+ duckguard freshness orders.csv --max-age 6h
1026
+
1027
+ # Schema tracking
1028
+ duckguard schema orders.csv --action capture
1029
+ duckguard schema orders.csv --action changes
1030
+
1031
+ # Data contracts
1032
+ duckguard contract generate orders.csv
1033
+ duckguard contract validate orders.csv
1034
+
1035
+ # Dataset info
1036
+ duckguard info orders.csv
1037
+
1038
+ # Profile dataset with quality scoring
1039
+ duckguard profile orders.csv
1040
+ duckguard profile orders.csv --deep --format json
1041
+ ```
1042
+
1043
+ ---
1044
+
1045
+ ## Performance
1046
+
1047
+ Built on DuckDB for fast, memory-efficient validation:
1048
+
1049
+ | Dataset | Great Expectations | DuckGuard | Speedup |
1050
+ |---------|:------------------:|:---------:|:-------:|
1051
+ | 1GB CSV | 45 sec, 4GB RAM | **4 sec, 200MB RAM** | **10x faster** |
1052
+ | 10GB Parquet | 8 min, 32GB RAM | **45 sec, 2GB RAM** | **10x faster** |
1053
+ | 100M rows | Minutes | **Seconds** | **10x faster** |
1054
+
1055
+ ### Why So Fast?
1056
+
1057
+ - **DuckDB engine**: Columnar, vectorized, SIMD-optimized
1058
+ - **Zero copy**: Direct file access, no DataFrame conversion
1059
+ - **Lazy evaluation**: Only compute what's needed
1060
+ - **Memory efficient**: Stream large files without loading entirely
1061
+
1062
+ ### Scaling Guide
1063
+
1064
+ | Data Size | Recommendation |
1065
+ |-----------|----------------|
1066
+ | < 10M rows | DuckGuard directly |
1067
+ | 10-100M rows | Use Parquet, configure `memory_limit` |
1068
+ | 100GB+ | Use database connectors (Snowflake, BigQuery, Databricks) |
1069
+
1070
+ ```python
1071
+ from duckguard import DuckGuardEngine, connect
1072
+
1073
+ engine = DuckGuardEngine(memory_limit="8GB")
1074
+ dataset = connect("large_data.parquet", engine=engine)
1075
+ ```
1076
+
1077
+ ---
1078
+
1079
+ ## API Quick Reference
1080
+
1081
+ ### Column Properties
1082
+
1083
+ ```python
1084
+ col.null_count # Number of null values
1085
+ col.null_percent # Percentage of null values
1086
+ col.unique_count # Number of distinct values
1087
+ col.min, col.max # Min/max values (numeric)
1088
+ col.mean, col.median # Mean and median (numeric)
1089
+ col.stddev # Standard deviation (numeric)
1090
+ ```
1091
+
1092
+ ### Column Validation Methods
1093
+
1094
+ | Method | Description |
1095
+ |--------|-------------|
1096
+ | `col.is_not_null()` | No nulls allowed |
1097
+ | `col.is_unique()` | All values distinct |
1098
+ | `col.between(min, max)` | Range check (inclusive) |
1099
+ | `col.greater_than(val)` | Minimum (exclusive) |
1100
+ | `col.less_than(val)` | Maximum (exclusive) |
1101
+ | `col.matches(regex)` | Regex pattern check |
1102
+ | `col.isin(values)` | Allowed values |
1103
+ | `col.has_no_duplicates()` | No duplicate values |
1104
+ | `col.value_lengths_between(min, max)` | String length range |
1105
+ | `col.exists_in(ref_col)` | FK: values exist in reference |
1106
+ | `col.references(ref_col, allow_nulls)` | FK with null handling |
1107
+ | `col.find_orphans(ref_col)` | List orphan values |
1108
+ | `col.matches_values(other_col)` | Compare value sets |
1109
+ | `col.detect_drift(ref_col)` | KS-test drift detection |
1110
+ | `col.not_null_when(condition)` | Conditional not-null |
1111
+ | `col.unique_when(condition)` | Conditional uniqueness |
1112
+ | `col.between_when(min, max, condition)` | Conditional range |
1113
+ | `col.isin_when(values, condition)` | Conditional enum |
1114
+ | `col.matches_when(pattern, condition)` | Conditional pattern |
1115
+ | `col.expect_distribution_normal()` | Normality test |
1116
+ | `col.expect_ks_test(distribution)` | KS distribution test |
1117
+ | `col.expect_chi_square_test()` | Chi-square test |
1118
+
1119
+ ### Dataset Methods
1120
+
1121
+ | Method | Description |
1122
+ |--------|-------------|
1123
+ | `ds.score()` | Quality score (completeness, uniqueness, validity, consistency) |
1124
+ | `ds.reconcile(target, key_columns, compare_columns)` | Full reconciliation |
1125
+ | `ds.row_count_matches(other, tolerance)` | Row count comparison |
1126
+ | `ds.group_by(columns)` | Group-level validation |
1127
+ | `ds.expect_column_pair_satisfy(a, b, expr)` | Column pair check |
1128
+ | `ds.expect_columns_unique(columns)` | Composite key uniqueness |
1129
+ | `ds.expect_multicolumn_sum_to_equal(columns, sum)` | Multi-column sum |
1130
+ | `ds.expect_query_to_return_no_rows(sql)` | Custom SQL: no violations |
1131
+ | `ds.expect_query_to_return_rows(sql)` | Custom SQL: data exists |
1132
+ | `ds.expect_query_result_to_equal(sql, val)` | Custom SQL: exact value |
1133
+ | `ds.expect_query_result_to_be_between(sql, min, max)` | Custom SQL: range |
1134
+ | `ds.is_fresh(max_age)` | Data freshness check |
1135
+ | `ds.head(n)` | Preview first n rows |
1136
+
1137
+ ---
1138
+
1139
+ ## Enhanced Error Messages
1140
+
1141
+ DuckGuard provides helpful, actionable error messages with suggestions:
1142
+
1143
+ ```python
1144
+ try:
1145
+ orders.nonexistent_column
1146
+ except ColumnNotFoundError as e:
1147
+ print(e)
1148
+ # Column 'nonexistent_column' not found.
1149
+ # Available columns: order_id, customer_id, product_name, ...
1150
+
1151
+ try:
1152
+ connect("ftp://data.example.com/file.xyz")
1153
+ except UnsupportedConnectorError as e:
1154
+ print(e)
1155
+ # No connector found for: ftp://data.example.com/file.xyz
1156
+ # Supported formats: CSV, Parquet, JSON, PostgreSQL, MySQL, ...
1157
+ ```
1158
+
1159
+ ---
1160
+
1161
+ ## Community
1162
+
1163
+ We'd love to hear from you! Whether you have a question, idea, or want to share how you're using DuckGuard:
1164
+
1165
+ - **[GitHub Discussions](https://github.com/XDataHubAI/duckguard/discussions)** — Ask questions, share ideas, show what you've built
1166
+ - **[GitHub Issues](https://github.com/XDataHubAI/duckguard/issues)** — Report bugs or request features
1167
+ - **[Contributing Guide](CONTRIBUTING.md)** — Learn how to contribute code, tests, or docs
1168
+
1169
+ ---
1170
+
1171
+ ## Contributing
1172
+
1173
+ We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
1174
+
1175
+ ```bash
1176
+ git clone https://github.com/XDataHubAI/duckguard.git
1177
+ cd duckguard
1178
+ pip install -e ".[dev]"
1179
+
1180
+ pytest # Run tests
1181
+ black src tests # Format code
1182
+ ruff check src tests # Lint
1183
+ ```
1184
+
1185
+ ---
1186
+
1187
+ ## License
1188
+
1189
+ Apache License 2.0 - see [LICENSE](LICENSE)
1190
+
1191
+ ---
1192
+
1193
+ <div align="center">
1194
+ <p>
1195
+ <strong>Built with &#10084;&#65039; by the DuckGuard Team</strong>
1196
+ </p>
1197
+ <p>
1198
+ <a href="https://github.com/XDataHubAI/duckguard/discussions">Discussions</a>
1199
+ &middot;
1200
+ <a href="https://github.com/XDataHubAI/duckguard/issues">Report Bug</a>
1201
+ &middot;
1202
+ <a href="https://github.com/XDataHubAI/duckguard/issues">Request Feature</a>
1203
+ &middot;
1204
+ <a href="CONTRIBUTING.md">Contribute</a>
1205
+ </p>
1206
+ </div>