databricks-dq-framework 1.0.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,523 @@
1
+ Metadata-Version: 2.4
2
+ Name: databricks-dq-framework
3
+ Version: 1.0.0
4
+ Summary: Metadata-driven Data Quality Assessment Framework for Databricks / Delta Lake
5
+ Author: Nitin Mathew George
6
+ License-Expression: MIT
7
+ Keywords: databricks,delta-lake,data-quality,pyspark,dq-framework
8
+ Classifier: Programming Language :: Python :: 3
9
+ Classifier: Operating System :: OS Independent
10
+ Requires-Python: >=3.9
11
+ Description-Content-Type: text/markdown
12
+ Requires-Dist: pyspark>=3.3
13
+ Requires-Dist: delta-spark>=2.0
14
+ Requires-Dist: python-dateutil>=2.8
15
+ Provides-Extra: dev
16
+ Requires-Dist: pytest>=7; extra == "dev"
17
+ Requires-Dist: pytest-cov; extra == "dev"
18
+
19
+ # Data Quality Assessment Framework — Databricks / Delta Lake
20
+
21
+ **Open-source Python library** reimplementing the SQL Server DQ Assessment
22
+ Framework natively on Databricks with Delta tables, PySpark DataFrames, and
23
+ Python-based validation functions.
24
+
25
+ ---
26
+
27
+ ## What this Library Provides
28
+
29
+ A **metadata-driven, PySpark port of a SQL Server Data Quality Assessment Framework**
30
+ for Databricks/Delta Lake. It validates data in curated Delta tables using configurable
31
+ rules — no code changes needed to add new fields or update logic.
32
+
33
+ | Feature | SQL Server Implementation | Databricks Implementation |
34
+ |---|---|---|
35
+ | Storage | SQL Server tables | Delta Lake tables (Unity Catalog) |
36
+ | Pattern evaluation | T-SQL TVFs (generated dynamically) | Python closures (generated dynamically) |
37
+ | Row-level evaluation | `CROSS APPLY TVF` | PySpark UDF + `DeltaTable.merge()` |
38
+ | Custom rules | SQL expressions in `configCustomQuery` | Plain regex patterns or named validators |
39
+ | Email validation | Complex `LIKE` / `CHARINDEX` SQL | Native Python regex (enhanced) |
40
+ | Output | `SELECT` result sets | Spark DataFrames |
41
+ | Same field configuration | Script_02 SQL `INSERT` | `ConfigManager` Python API or raw SQL |
42
+ | Same pattern library | 118 patterns | 118 patterns (identical) |
43
+ | Same precedence logic | L0A `RANK()` CTE | Python ranking equivalent |
44
+ | Same validation levels | L01→L02→L03→L04→L99 | L01→L02→L03→L04→L99 (identical) |
45
+ | Same DQ columns | `DQEligible BIT`, `DQViolations`, `DQFields` | `DQEligible BOOLEAN`, `DQViolations STRING`, `DQFields STRING` |
46
+ | Same dual logging | `auditDQChecks` + `statDQChecks` | Same Delta tables |
47
+ | Same ExecutionID | `NEWID()` GUID | `uuid.uuid4()` |
48
+ | Same `1=1` scope filtering | `WHERE 1=1 AND (...)` | Python conditional filter |
49
+ | Same sticky DQ logic | `CASE WHEN DQEligible=0 THEN 0 ELSE Result END` | Spark `when()` equivalent |
50
+
51
+ ---
52
+
53
+ ## Architecture at a Glance
54
+
55
+ ```
56
+ masterField (config) ──► configFieldAllowedPattern ──► mapDQChecks
57
+ │ │
58
+ ConfigManager API resolve_pattern_rules.py (L0A precedence)
59
+ │ │
60
+ └──────────────────────────┘
61
+
62
+ generate_rule_functions.py
63
+
64
+ FunctionRegistry (UDFs)
65
+
66
+ data_assessment_rules.py (DQRunner)
67
+
68
+ Curated Table: DQEligible | DQViolations | DQFields
69
+
70
+ ┌───────────────┴───────────────┐
71
+ auditDQChecks statDQChecks
72
+ └───────────────┬───────────────┘
73
+ Reporting Views
74
+ (v_auditDQChecks, v_statDQChecks)
75
+ ```
76
+
77
+ **Data flow:** Config tables → precedence resolution → Python closure generation →
78
+ UDF-based row assessment → DQ column updates + dual audit logging → reporting.
79
+
80
+ ---
81
+
82
+ ## Quick Start
83
+
84
+ ```python
85
+ from dq_framework import DQFramework
86
+
87
+ # 1. Initialise
88
+ dq = DQFramework(spark, catalog="main", schema="dq")
89
+ dq.setup()
90
+
91
+ # 2. Configure a field (ConfigManager API — no SQL required)
92
+ cfg = dq.config
93
+ cfg.register_field(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", data_category_type_id=4)
94
+ cfg.set_field_values(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", min_data_length=6, max_data_length=255)
95
+ cfg.block_category(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS", "DataEmptiness")
96
+ cfg.allow_pattern(2, "Source.SRC_ContactPoint.EMAIL_ADDRESS", "Has At Sign")
97
+ cfg.add_custom_query_regex(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS",
98
+ r"^[^@\s]+@[^@\s]+\.[^@\s]+$", must_match=True)
99
+ cfg.add_mapping(1, "Source.SRC_ContactPoint.EMAIL_ADDRESS",
100
+ target_schema_name="Curated", target_table_name="Entity_Email_Denorm",
101
+ target_field_name="EMAIL_ADDRESS", target_catalog_name="main")
102
+
103
+ # 3. Validate config FK integrity before generating functions
104
+ cfg.verify_config()
105
+
106
+ # 4. Generate field-level checker functions from config
107
+ dq.generate_rule_functions()
108
+
109
+ # 5. Run the DQ assessment
110
+ exec_id = dq.run_assessment(schema_name="Curated")
111
+
112
+ # 6. Query results
113
+ dq.violations(exec_id).display()
114
+ dq.quality_scores(exec_id).display()
115
+ dq.fields_below_threshold(threshold=80).display()
116
+ dq.summary_by_table(exec_id).display()
117
+ ```
118
+
119
+ See [notebooks/00_quickstart.py](notebooks/00_quickstart.py) for a fully commented step-by-step walkthrough.
120
+
121
+ ---
122
+
123
+ ## Prerequisite: DQ Columns on Curated Tables
124
+
125
+ Each curated Delta table being assessed must have these three columns **added
126
+ before running the assessment**:
127
+
128
+ ```sql
129
+ ALTER TABLE main.Curated.YourTable
130
+ ADD COLUMNS (
131
+ DQEligible BOOLEAN COMMENT '1=all checks passed, 0=at least one failed, NULL=not assessed',
132
+ DQViolations STRING COMMENT '[field: ViolationType], [field: ViolationType]',
133
+ DQFields STRING COMMENT '[field1], [field2] (all fields assessed on this row)'
134
+ );
135
+ ```
136
+
137
+ | Column | Value | Meaning |
138
+ |---|---|---|
139
+ | `DQEligible` | `None` / `NULL` | Row not yet assessed |
140
+ | `DQEligible` | `True` | All configured checks passed |
141
+ | `DQEligible` | `False` | At least one check failed |
142
+ | `DQViolations` | e.g. `[email_address: Data Type]` | Violated rules (sticky — accumulates across fields) |
143
+ | `DQFields` | e.g. `[email_address], [phone]` | All fields assessed on this row |
144
+
145
+ ---
146
+
147
+ ## Package Structure
148
+
149
+ ```
150
+ dq_framework/
151
+ ├── __init__.py # Exports: DQFramework, ConfigManager
152
+ ├── framework.py # DQFramework facade — top-level orchestration
153
+ ├── ddl_framework_tables.py # ≡ Script_00_DDL_Framework_Tables — Delta table DDL (9 tables)
154
+ ├── seed_master_data.py # ≡ Script_01_Master_Reference_Data — 27 categories + 118 patterns
155
+ ├── config.py # ConfigManager — Python API for all 5 config tables
156
+ ├── engine/
157
+ │ ├── pattern_checks.py # All 118 pattern check implementations
158
+ │ ├── resolve_pattern_rules.py # Pattern precedence (L0A CTE equivalent)
159
+ │ ├── generate_rule_functions.py # ≡ p_DQ_GenerateRuleFunctions — FunctionRegistry + closures
160
+ │ └── data_assessment_rules.py # ≡ p_DQ_DataAssessmentRules — DQRunner
161
+ ├── reporting/
162
+ │ └── views.py # v_auditDQChecks + v_statDQChecks equivalents
163
+ └── agents/
164
+ ├── __init__.py # Exports SYSTEM_PROMPT, DQ_TOOLS, suggest_config
165
+ ├── context.py # LLM system prompt + short context
166
+ ├── tools.py # OpenAI-compatible tool definitions + execute_tool_call()
167
+ └── suggest.py # Schema-based config auto-suggester (no LLM required)
168
+ notebooks/
169
+ ├── 00_quickstart.py # Step-by-step walkthrough: configure → assess → results
170
+ ├── 01_agent_assistant.py # AI agent integration: auto-suggest, tool-calling, system prompt
171
+ ├── 02_integration_test.py # Integration test suite (run on cluster before publishing)
172
+ └── 03_seed_config.py # Paste Excel-generated INSERT SQL here
173
+ tests/
174
+ ├── conftest.py # Path setup + test split explanation
175
+ └── unit/
176
+ ├── test_pattern_checks.py # All 118 pattern functions (no cluster needed)
177
+ ├── test_resolve_pattern_rules.py # Pattern precedence logic
178
+ ├── test_generate_rule_functions.py # Closure building + FunctionRegistry
179
+ ├── test_seed_master_data.py # Seed data integrity (counts, IDs, FK)
180
+ └── test_suggest.py # Schema-based config suggester
181
+ ```
182
+
183
+ ---
184
+
185
+ ## 9 Framework Tables
186
+
187
+ All tables live in the `dq` schema (configurable). Three categories:
188
+
189
+ | Category | Tables |
190
+ |---|---|
191
+ | **Framework-Seeded** (auto-populated by `setup()`) | `masterDataCategory`, `masterPattern` |
192
+ | **User-Managed** (populated via `ConfigManager` or SQL) | `masterField`, `configFieldValues`, `configFieldAllowedPattern`, `configCustomQuery`, `mapDQChecks` |
193
+ | **Results** (auto-populated by `run_assessment()`) | `auditDQChecks`, `statDQChecks` |
194
+
195
+ | Table | Purpose |
196
+ |---|---|
197
+ | `masterDataCategory` | 27 field type classifications |
198
+ | `masterPattern` | 118 built-in validation patterns. Custom patterns use `_ID >= 1000` (IDs 1–999 are reserved for built-ins) |
199
+ | `masterField` | Source fields registered for assessment |
200
+ | `configFieldValues` | Data length (L01) and value range (L04) boundaries |
201
+ | `configFieldAllowedPattern` | Pattern allow/block rules per field |
202
+ | `configCustomQuery` | Custom validation expressions — plain regex, Spark SQL, or named validators (L02) |
203
+ | `mapDQChecks` | Maps source field → curated table column |
204
+ | `auditDQChecks` | Row-level violation audit log (failures only, append-only) |
205
+ | `statDQChecks` | Aggregated pass/fail statistics per execution (append-only) |
206
+
207
+ ---
208
+
209
+ ## Validation Levels (in order, early-exit on first failure)
210
+
211
+ Each generated field checker implements the same 5-level hierarchy as the SQL TVF:
212
+
213
+ | Level | Check Type | Priority |
214
+ |---|---|---|
215
+ | **L01** | Data Length range (min/max character count) | Checked first |
216
+ | **L02** | Custom expression rules (regex, Spark SQL, or named validator) | Checked second |
217
+ | **L03** | 118-pattern checks (in PatternPriority order) | Checked third |
218
+ | **L04** | Data Value range (min/max value comparison) | Checked fourth |
219
+ | **L99** | Default PASS (if all prior checks pass) | Final |
220
+
221
+ **Early exit**: The checker returns immediately on first failure.
222
+ Re-run after fixing the first violation to surface the next.
223
+
224
+ ---
225
+
226
+ ## 118 Patterns across 10 Categories
227
+
228
+ | Category | Count | Examples |
229
+ |---|---|---|
230
+ | `DataType1` | 3 | Is Fully Numeric, Is Fully Decimal, Is Fully Text |
231
+ | `DataType2` | 1 | Is AlphaNumeric |
232
+ | `DataType3` | 4 | Is Date, Is Time, Is Timestamp, Is Boolean |
233
+ | `SpecialCharacter` | 32 | Has Comma, Has At Sign, Has Hyphen, Has Pipe Symbol, … |
234
+ | `SpaceFound` | 1 | Has Space |
235
+ | `DataEmptiness` | 34 | Is Empty or NULL, Is Virtually Empty with Spaces, … |
236
+ | `InvalidKeyword` | 40 | Has Keyword-n/a, Has Keyword-null, Has Keyword-missing, … |
237
+ | `FullyDuplicatedCharacter` | 1 | Has Fully Duplicated Character (e.g. "aaaa") |
238
+ | `UnicodeCharacters` | 1 | Has Unicode Characters (outside printable ASCII 0x20–0x7E) |
239
+ | `CasingCheck` | 2 | Has Lowercase Character, Has Uppercase Character |
240
+
241
+ **Total: 118 built-in patterns.** Pattern IDs 1–999 are reserved for framework use.
242
+ To add custom patterns (e.g. domain-specific invalid keywords), use IDs `>= 1000`.
243
+
244
+ ---
245
+
246
+ ## Pattern Resolution (L0A — three levels of granularity)
247
+
248
+ | Level | Populate in `configFieldAllowedPattern` | Effect |
249
+ |---|---|---|
250
+ | Category-level | Set `PatternCategory` only | Applies to ALL patterns in that category |
251
+ | SubCategory-level | Set `PatternSubCategory` only | Applies to ALL patterns in that sub-category |
252
+ | Pattern-level | Set `PatternName` only | Applies to that specific named pattern |
253
+
254
+ **Precedence**: More specific rule (PatternName) overrides broader rule (Category).
255
+ **Within the same specificity level**, Allowed overrides Not Allowed — meaning a
256
+ specific "Allowed" at pattern level always wins over a broad "Not Allowed" at
257
+ category level.
258
+
259
+ Example: *"Block ALL special characters (category rule), EXCEPT allow Hyphen,
260
+ Full Stop, and At Sign (pattern-level overrides)"*
261
+
262
+ ```python
263
+ cfg.block_category(1, "Source.T.FIELD", "SpecialCharacter") # block all
264
+ cfg.allow_pattern(2, "Source.T.FIELD", "Has Hyphen") # except hyphen
265
+ cfg.allow_pattern(3, "Source.T.FIELD", "Has Full Stop") # except full stop
266
+ cfg.allow_pattern(4, "Source.T.FIELD", "Has At Sign") # except @ sign
267
+ ```
268
+
269
+ ---
270
+
271
+ ## Custom Validation Rules (L02)
272
+
273
+ The `configCustomQuery` table stores a `CustomQueryExpression` for each field.
274
+ Three options are available.
275
+
276
+ ### Option 1: Plain regex (recommended for non-technical users)
277
+
278
+ ```python
279
+ # Must match: value must satisfy the regex to PASS
280
+ cfg.add_custom_query_regex(1, "Source.T.EMAIL_ADDRESS",
281
+ r"^[^@\s]+@[^@\s]+\.[^@\s]+$", must_match=True,
282
+ description="Basic email format")
283
+
284
+ # Must NOT match: if the regex matches, the row FAILS
285
+ cfg.add_custom_query_regex(2, "Source.T.EMAIL_ADDRESS",
286
+ r"noemaildress", must_match=False,
287
+ description="Reject placeholder values")
288
+ ```
289
+
290
+ The engine auto-detects regex patterns by the presence of metacharacters
291
+ (`^$+*?[({|\`) and applies `re.search()` automatically.
292
+
293
+ ### Option 2: Spark SQL expression
294
+
295
+ ```python
296
+ # @InputValue is replaced with the actual column reference at assessment time
297
+ cfg.add_custom_query_sql(1, "Source.T.AMOUNT",
298
+ "CAST(@InputValue AS DOUBLE) > 0",
299
+ is_condition_allowed=True,
300
+ description="Amount must be positive")
301
+ ```
302
+
303
+ SQL expressions are applied at the DataFrame level via `F.expr()` — not inside a
304
+ Python UDF — so they support any Spark SQL built-in function.
305
+
306
+ ### Option 3: Named validator (for complex logic)
307
+
308
+ Register a Python function once, then reference it by name in any field config:
309
+
310
+ ```python
311
+ # Register (once, at cluster startup or in a setup cell)
312
+ dq.register_validator("au_mobile", lambda v: bool(__import__('re').match(r'^04[0-9]{8}$', v or '')))
313
+
314
+ # Reference by name in config
315
+ cfg.add_custom_query(1, "Source.T.MOBILE_NUMBER", "au_mobile", is_condition_allowed=True)
316
+ ```
317
+
318
+ Built-in email validators are pre-registered:
319
+
320
+ | Validator Name | Description |
321
+ |---|---|
322
+ | `email_basic_format` | Regex: basic `x@y.z` format |
323
+ | `email_at_validation` | @ position, count, dot after @ |
324
+ | `email_domain_format` | 1–3 dots, no trailing dot, 2+ char labels |
325
+ | `email_local_part_dots` | Max 2 dots in local part |
326
+ | `email_no_placeholder` | Must not contain 'noemaildress' |
327
+ | `email_no_trailing_special` | Must not end with `.`, `-`, or `_` |
328
+
329
+ ---
330
+
331
+ ## Scope Filtering (`1=1` Pattern Preserved)
332
+
333
+ ```python
334
+ # All schemas, all tables, all fields
335
+ dq.run_assessment()
336
+
337
+ # Specific schema
338
+ dq.run_assessment(schema_name="Curated")
339
+
340
+ # Specific table
341
+ dq.run_assessment(schema_name="Curated", table_name="Individual_Denorm")
342
+
343
+ # Single field
344
+ dq.run_assessment(schema_name="Curated",
345
+ table_name="Individual_Denorm",
346
+ field_name="FIRST_NAME")
347
+
348
+ # Reset previously-failed rows before re-assessment
349
+ dq.run_assessment(schema_name="Curated", reset_eligible_flag=True)
350
+ ```
351
+
352
+ ---
353
+
354
+ ## Reporting
355
+
356
+ Four helper methods return Spark DataFrames ready for `.display()` or further transformation:
357
+
358
+ ```python
359
+ exec_id = dq.run_assessment(schema_name="Curated")
360
+
361
+ dq.violations(exec_id).display() # Row-level violations (field, violation type, value)
362
+ dq.quality_scores(exec_id).display() # Pass/fail counts + % per field
363
+ dq.fields_below_threshold(threshold=80).display() # Fields with quality < 80%
364
+ dq.summary_by_table(exec_id).display() # Aggregated quality score per table
365
+ ```
366
+
367
+ Two persistent reporting views are also created in the `dq` schema by `setup()`:
368
+
369
+ | View | Purpose |
370
+ |---|---|
371
+ | `v_auditDQChecks` | Row-level violations joined with field metadata |
372
+ | `v_statDQChecks` | Aggregated pass/fail statistics with `PercentageQualified` |
373
+
374
+ ---
375
+
376
+ ## AI Agent Support
377
+
378
+ The `dq_framework.agents` module provides three ways to use AI to configure the framework.
379
+ **No agent framework (Semantic Kernel, LangChain, AutoGen) is required.**
380
+
381
+ ### Approach A — Auto-suggest from schema (no LLM needed)
382
+
383
+ ```python
384
+ from dq_framework.agents import suggest_config
385
+
386
+ schema = [
387
+ ("EMAIL_ADDRESS", "string"),
388
+ ("FIRST_NAME", "string"),
389
+ ("POSTAL_CODE", "string"),
390
+ ("CUSTOMER_ID", "int"),
391
+ ]
392
+
393
+ # Returns ready-to-run ConfigManager code
394
+ code = suggest_config(schema, source_schema="Source", source_table="SRC_Customer",
395
+ curated_schema="Curated", curated_table="Customer_Denorm",
396
+ catalog="main")
397
+ print(code)
398
+ ```
399
+
400
+ Or read schema directly from a live Delta table:
401
+
402
+ ```python
403
+ schema = spark.table("main.Source.SRC_Customer").dtypes
404
+ code = suggest_config(schema, source_schema="Source", source_table="SRC_Customer", ...)
405
+ ```
406
+
407
+ ### Approach B — Chat assistant with tool calling
408
+
409
+ The 14 framework operations are exposed as OpenAI-compatible JSON Schema tool definitions.
410
+ They work unchanged with OpenAI, Azure OpenAI, Anthropic Claude, Semantic Kernel, AutoGen,
411
+ and LangChain.
412
+
413
+ ```python
414
+ from dq_framework.agents import SYSTEM_PROMPT, DQ_TOOLS
415
+ from dq_framework.agents.tools import execute_tool_call
416
+ from openai import AzureOpenAI
417
+
418
+ client = AzureOpenAI(azure_endpoint="...", api_key="...", api_version="2024-05-01-preview")
419
+
420
+ response = client.chat.completions.create(
421
+ model="gpt-4o",
422
+ messages=[
423
+ {"role": "system", "content": SYSTEM_PROMPT},
424
+ {"role": "user", "content": "Configure DQ rules for Source.SRC_Party.EMAIL_ADDRESS"},
425
+ ],
426
+ tools=DQ_TOOLS,
427
+ )
428
+
429
+ for call in response.choices[0].message.tool_calls or []:
430
+ result = execute_tool_call(call.function.name, call.function.arguments, dq, cfg)
431
+ print(f"[Tool] {call.function.name} → {result}")
432
+ ```
433
+
434
+ See [notebooks/01_agent_assistant.py](notebooks/01_agent_assistant.py) for Azure OpenAI,
435
+ Anthropic Claude, and Semantic Kernel examples.
436
+
437
+ ### Approach C — Paste system prompt into ChatGPT / Copilot
438
+
439
+ ```python
440
+ from dq_framework.agents import SYSTEM_PROMPT
441
+ print(SYSTEM_PROMPT) # Copy and paste into any chat interface
442
+ ```
443
+
444
+ ---
445
+
446
+ ## Testing
447
+
448
+ ### Unit tests — no cluster needed
449
+
450
+ Run locally with Python and `pytest`. No Databricks, no Spark, no Delta Lake required.
451
+
452
+ ```bash
453
+ pip install dq-framework pytest
454
+ pytest tests/unit/ -v
455
+ ```
456
+
457
+ | Test file | What it covers |
458
+ |---|---|
459
+ | `test_pattern_checks.py` | All major pattern check functions (DataEmptiness, SpecialChar, InvalidKeyword, …) |
460
+ | `test_resolve_pattern_rules.py` | Pattern precedence — specific overrides broad, inactive excluded |
461
+ | `test_generate_rule_functions.py` | L01/L02/L04 closure building, `FunctionRegistry`, stale function removal |
462
+ | `test_seed_master_data.py` | Seed data integrity: 27 categories, 118 patterns, no duplicate IDs |
463
+ | `test_suggest.py` | Schema-based config suggester: column classification, code generation |
464
+
465
+ ### Integration tests — requires a Databricks cluster
466
+
467
+ Open [notebooks/02_integration_test.py](notebooks/02_integration_test.py) on a cluster.
468
+ It creates throwaway test schemas, runs the full pipeline end-to-end, and drops everything on completion.
469
+
470
+ | Test | What it verifies |
471
+ |---|---|
472
+ | Test 1 | Framework setup — 9 tables created, 27 categories, 118 patterns seeded |
473
+ | Test 2 | `setup()` is idempotent — no duplicate data on second call |
474
+ | Test 3 | Synthetic curated table created with known good/bad email values |
475
+ | Test 4 | `ConfigManager` — all 5 user-managed tables populated |
476
+ | Test 5 | `generate_rule_functions()` — checker built and directly invocable |
477
+ | Test 6 | `run_assessment()` — DQ columns written correctly, specific rows verified |
478
+ | Test 7 | Reporting views — violations count, quality scores, threshold filter, summary |
479
+ | Test 8 | `reset_eligible_flag` — failed rows cleared, re-assessment gives same result |
480
+ | Test 9 | Custom keyword extensibility via `add_invalid_keyword()` |
481
+ | Test 10 | `cfg.verify_config()` passes with no FK issues |
482
+
483
+ ---
484
+
485
+ ## Platform Requirements
486
+
487
+ | Requirement | Detail |
488
+ |---|---|
489
+ | Databricks Runtime | 12.0+ (includes PySpark 3.3+) |
490
+ | Delta Lake | 2.0+ |
491
+ | Python | 3.9+ |
492
+ | Unity Catalog | Recommended (pass `catalog=""` for legacy Hive metastore) |
493
+ | python-dateutil | Included in Databricks Runtime; required for full date/time validation |
494
+
495
+ ---
496
+
497
+ ## Adding a New Source Field
498
+
499
+ 1. Register the field: `cfg.register_field(id, "Schema.Table.Column", data_category_type_id=N)`
500
+ 2. Set length bounds: `cfg.set_field_values(id, ffn, min_data_length=M, max_data_length=N)`
501
+ 3. Add pattern rules: `cfg.block_category(...)`, `cfg.allow_pattern(...)`, `cfg.block_pattern(...)`
502
+ 4. Add custom expressions: `cfg.add_custom_query_regex(id, ffn, r"your_regex", must_match=True)`
503
+ 5. (Optional) Add custom invalid keywords: `dq.add_invalid_keyword(id, "your_keyword")`
504
+ 6. Map to curated column: `cfg.add_mapping(id, ffn, target_schema_name=..., target_table_name=..., ...)`
505
+ 7. Validate FK integrity: `cfg.verify_config()`
506
+ 8. Re-run: `dq.generate_rule_functions()`
507
+
508
+ ---
509
+
510
+ ## Known Limitations
511
+
512
+ | # | Area | Description |
513
+ |---|---|---|
514
+ | 1 | **Bug — `add_custom_query()`** | `CustomQueryDescription` column is `NOT NULL` in the DDL but the method passes `null` when no `description` argument is provided. Always supply a `description` when calling `add_custom_query()`, `add_custom_query_regex()`, or `add_custom_query_sql()` to avoid an INSERT constraint violation. |
515
+ | 2 | **Silent pattern skip** | If a `PatternName` in `configFieldAllowedPattern` does not match any row in `masterPattern`, the rule is silently ignored (equivalent to SQL's `ELSE NULL`). A typo in a pattern name will go undetected unless `verify_config()` is run. |
516
+ | 3 | **No transaction guarantees** | The assessment writes DQ columns, audit rows, and stat rows in three separate Spark actions. A cluster failure between steps can leave partial results. Re-run the assessment with `reset_eligible_flag=True` to recover. |
517
+ | 4 | **Date validation outside Databricks** | If `python-dateutil` is not installed, date/time pattern checks fall back to three hard-coded regex formats. This is rarely an issue on Databricks Runtime but may affect local unit test runs that exercise the date validators. |
518
+ | 5 | **No parameterised SQL** | Config is written via f-string SQL with single-quote escaping. Field names and expressions provided by trusted developers are safe; do not pass untrusted user input directly into ConfigManager methods. |
519
+
520
+ ---
521
+
522
+ *Data Quality Non-Functional Assessment Framework — Databricks Edition*
523
+ *Generalised for open-source publication — no client-specific data included*
@@ -0,0 +1,20 @@
1
+ dq_framework/__init__.py,sha256=VBWj9LQP1ZwzgzFhXAtpLaB-l3buS1YJQgPbp_MtB8A,859
2
+ dq_framework/config.py,sha256=XQhJpdVgZCUOTIHXMAZQk6nAedm9wds3ASidOcIg01c,25791
3
+ dq_framework/ddl_framework_tables.py,sha256=YtWNOrcmbd7mAef81Vb_Gvju_9Tlm4kEcXvO9ptsmTs,21033
4
+ dq_framework/framework.py,sha256=QxPkt4klkURczRHdYtulZudMJw46AJFk57QukwPA7HU,16670
5
+ dq_framework/seed_master_data.py,sha256=-yMJ61QiNG3icQglI5xx6zF17Wbvz8KcEm8q6f49_J0,31169
6
+ dq_framework/agents/__init__.py,sha256=LxOj_aGMKidrb7I4Xju6-t1AK3eSIAtaZYDeAG46Sso,1715
7
+ dq_framework/agents/context.py,sha256=tyl4fooh6LfmD3Q3EOrfau-ckh41GU-yT_GZYibM-aE,10647
8
+ dq_framework/agents/suggest.py,sha256=DS9hjCV27oza6TAtNWwqazxKZBwNmpBKWXQWeImoQtc,14168
9
+ dq_framework/agents/tools.py,sha256=d9YgIm9THksKqMvR85wK0v5ShM5G6AqFs1dcGzifosg,26070
10
+ dq_framework/engine/__init__.py,sha256=dDC1UQ3Kqm8pjtGUbGi2HC1832cI6qgsWMi_au_0-3M,436
11
+ dq_framework/engine/data_assessment_rules.py,sha256=q4k1qC5KPwQzbmukbw1sv4ukvNnIO0Bcn2w1TRhyFRU,31440
12
+ dq_framework/engine/generate_rule_functions.py,sha256=9V6Ks28w83gADa41noVXkm6R7D0VuaSw5r9jN4NzQgA,23838
13
+ dq_framework/engine/pattern_checks.py,sha256=cBx9Kivw2_9dAtkc4W2nuCtakAxZMGXoMOfhUmAGhzw,31916
14
+ dq_framework/engine/resolve_pattern_rules.py,sha256=FwFuNcX2W8sk8ylnSgqHrB_w-fSJdjB5msrycSAtTvc,10586
15
+ dq_framework/reporting/__init__.py,sha256=lolbnQLlg_ijMsupqcrVRZ2tfeSzlr8p5LXAByZDEJY,102
16
+ dq_framework/reporting/views.py,sha256=lUMD-1CPBgY_Tx-Uadql_JlPacns4iTqhCBmsSa8alQ,9033
17
+ databricks_dq_framework-1.0.0.dist-info/METADATA,sha256=XCIZaGANLWfi0V4k8NBOHjWr9zpRmrINJMxHKd2uWjU,23388
18
+ databricks_dq_framework-1.0.0.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
19
+ databricks_dq_framework-1.0.0.dist-info/top_level.txt,sha256=tL7jNAFg2sdBxE89vboz_q8bNoP-DJnKb9WP3I7VAOE,13
20
+ databricks_dq_framework-1.0.0.dist-info/RECORD,,
@@ -0,0 +1,5 @@
1
+ Wheel-Version: 1.0
2
+ Generator: setuptools (82.0.1)
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
5
+
@@ -0,0 +1 @@
1
+ dq_framework
@@ -0,0 +1,23 @@
1
+ """
2
+ Data Quality Non-Functional Assessment Framework for Databricks / Delta Lake
3
+ ============================================================================
4
+ Metadata-driven, dynamically-generated DQ assessment engine.
5
+
6
+ Equivalent to the SQL Server implementation but running natively on
7
+ Databricks with Delta tables, PySpark DataFrames, and Python UDFs.
8
+
9
+ Usage
10
+ -----
11
+ from dq_framework import DQFramework
12
+
13
+ dq = DQFramework(spark, catalog="main", schema="dq")
14
+ dq.setup() # create Delta tables + seed reference data
15
+ dq.generate_rule_functions() # build field-level checker functions from config
16
+ dq.run_assessment(schema_name="Curated") # run DQ assessment
17
+ """
18
+
19
+ from dq_framework.framework import DQFramework
20
+ from dq_framework.config import ConfigManager
21
+
22
+ __all__ = ["DQFramework", "ConfigManager"]
23
+ __version__ = "1.0.0"
@@ -0,0 +1,42 @@
1
+ """
2
+ DQ Framework — Agent Support
3
+ =============================
4
+ Framework-agnostic building blocks for LLM / AI-agent integration.
5
+
6
+ No dependency on any specific agent framework (Semantic Kernel, AutoGen,
7
+ LangChain, etc.). The three modules here produce plain Python objects
8
+ (strings, dicts) that plug into any LLM platform:
9
+
10
+ context.py — System prompt / instructions for an LLM assistant.
11
+ Gives the model enough knowledge to help users configure
12
+ the DQ framework conversationally.
13
+
14
+ tools.py — Tool / function definitions in JSON Schema format
15
+ (OpenAI-compatible; also accepted by Anthropic, Semantic
16
+ Kernel, AutoGen, and LangChain).
17
+
18
+ suggest.py — Reads a Delta table schema (column names + Spark types)
19
+ and returns a ready-to-run ConfigManager code snippet.
20
+ Use this as the output of an "auto-configure" agent step.
21
+
22
+ Quick start (any LLM SDK)
23
+ --------------------------
24
+ from dq_framework.agents import SYSTEM_PROMPT, DQ_TOOLS, suggest_config
25
+
26
+ # 1. Give the LLM framework knowledge
27
+ system = SYSTEM_PROMPT
28
+
29
+ # 2. Register tools so the agent can call framework operations
30
+ tools = DQ_TOOLS # list of OpenAI-style tool dicts
31
+
32
+ # 3. Auto-suggest config from a table schema
33
+ schema = [("EMAIL_ADDRESS", "string"), ("FIRST_NAME", "string"), ("POSTCODE", "string")]
34
+ code = suggest_config(schema, source_schema="Source", source_table="SRC_Customer")
35
+ print(code)
36
+ """
37
+
38
+ from dq_framework.agents.context import SYSTEM_PROMPT
39
+ from dq_framework.agents.tools import DQ_TOOLS
40
+ from dq_framework.agents.suggest import suggest_config
41
+
42
+ __all__ = ["SYSTEM_PROMPT", "DQ_TOOLS", "suggest_config"]