dataqe-framework 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Khadar Shaik
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
@@ -0,0 +1,4 @@
1
+ include README.md
2
+ exclude LICENSE
3
+ recursive-exclude * *.pyc
4
+ recursive-exclude * __pycache__
@@ -0,0 +1,604 @@
1
+ Metadata-Version: 2.1
2
+ Name: dataqe-framework
3
+ Version: 0.1.0
4
+ Summary: Reusable Data Validation Framework for data migration, ETL validation, and cross-database reconciliation
5
+ Author-email: Khadar Shaik <khadarmohiddin.shaik@apree.health>
6
+ Project-URL: Homepage, https://github.com/ShaikKhadarmohiddin/dataqe-framework
7
+ Project-URL: Documentation, https://github.com/ShaikKhadarmohiddin/dataqe-framework#readme
8
+ Project-URL: Repository, https://github.com/ShaikKhadarmohiddin/dataqe-framework.git
9
+ Project-URL: Issues, https://github.com/ShaikKhadarmohiddin/dataqe-framework/issues
10
+ Keywords: data-validation,data-quality,testing,ETL,migration,mysql,bigquery
11
+ Classifier: Development Status :: 4 - Beta
12
+ Classifier: Intended Audience :: Developers
13
+ Classifier: Intended Audience :: System Administrators
14
+ Classifier: License :: OSI Approved :: MIT License
15
+ Classifier: Programming Language :: Python :: 3
16
+ Classifier: Programming Language :: Python :: 3.9
17
+ Classifier: Programming Language :: Python :: 3.10
18
+ Classifier: Programming Language :: Python :: 3.11
19
+ Classifier: Programming Language :: Python :: 3.12
20
+ Classifier: Topic :: Software Development :: Libraries :: Python Modules
21
+ Classifier: Topic :: Database
22
+ Classifier: Topic :: Software Development :: Testing
23
+ Requires-Python: >=3.9
24
+ Description-Content-Type: text/markdown
25
+ License-File: LICENSE.txt
26
+ Requires-Dist: google-cloud-bigquery>=3.0.0
27
+ Requires-Dist: pymysql>=1.0.0
28
+ Requires-Dist: pyyaml>=5.4
29
+ Requires-Dist: pandas>=1.3.0
30
+
31
+ # DataQE Framework - Data Quality and Equality Testing
32
+
33
+ A powerful Python framework for validating data quality and ensuring data consistency between source and target databases. Designed for data migration projects, ETL validation, and cross-database reconciliation.
34
+
35
+ **Version**: 0.0.1
36
+
37
+ ## Overview
38
+
39
+ DataQE Framework enables organizations to:
40
+ - **Validate data migration quality** between different database systems
41
+ - **Ensure data consistency** across source and target environments
42
+ - **Run comprehensive test suites** with flexible comparison modes
43
+ - **Generate detailed reports** for compliance and audit trails
44
+ - **Support dynamic dataset replacement** for multi-release environments
45
+
46
+ ## Key Features
47
+
48
+ ### Multi-Database Support
49
+ - **MySQL** - Relational database validation
50
+ - **Google BigQuery** - Cloud data warehouse validation
51
+ - Extensible connector architecture for adding more databases
52
+
53
+ ### Flexible Test Configuration
54
+ - YAML-based test definitions
55
+ - Single-source validation with expected conditions
56
+ - Source vs Target equality checks
57
+ - Threshold-based comparisons (percentage and absolute)
58
+ - Support for multiple test cases in a single execution
59
+
60
+ ### Dynamic Dataset Replacement
61
+ - Replace dataset placeholders with actual release names
62
+ - Centralized configuration for dataset mappings
63
+ - Support for multiple sources with different release versions
64
+
65
+ ### Comprehensive Reporting
66
+ - **ExecutionReport.html** - Full test results with detailed execution times
67
+ - **FailedExecutionReport.html** - Failed tests or confirmation of all tests passing
68
+ - **ExecutionReport.csv** - Structured test results for further analysis
69
+ - **AutomationData.csv** - CI/CD integration data
70
+ - Real-time console output with progress tracking
71
+
72
+ ### Enterprise Features
73
+ - PHI data protection with KMS encryption support
74
+ - Detailed execution timing metrics
75
+ - Environment-based configuration
76
+ - Flexible credential management
77
+
78
+ ## Installation
79
+
80
+ ### Prerequisites
81
+ - Python 3.8+
82
+ - pip
83
+
84
+ ### Install from Source
85
+
86
+ ```bash
87
+ git clone <repository-url>
88
+ cd dataqe-framework
89
+ pip install -e .
90
+ ```
91
+
92
+ ### Verify Installation
93
+
94
+ ```bash
95
+ dataqe-run --help
96
+ ```
97
+
98
+ ## Quick Start
99
+
100
+ ### 1. Create Configuration File
101
+
102
+ Create `config.yml`:
103
+
104
+ ```yaml
105
+ config_block_validation:
106
+ source:
107
+ database_type: mysql
108
+ mysql:
109
+ host: source-db.example.com
110
+ port: 3306
111
+ user: db_user
112
+ password: db_password
113
+ database: source_db
114
+
115
+ target:
116
+ database_type: gcpbq
117
+ gcp:
118
+ project_id: my-gcp-project
119
+ dataset_id: target_dataset
120
+ credentials_path: /path/to/credentials.json
121
+
122
+ other:
123
+ validation_script: test_suite.yml
124
+ preprocessor_queries: preprocessor_queries.yml
125
+ ```
126
+
127
+ ### 2. Create Test Suite
128
+
129
+ Create `test_suite.yml`:
130
+
131
+ ```yaml
132
+ - test_row_count:
133
+ severity: critical
134
+ source:
135
+ query: |
136
+ SELECT COUNT(*) as value FROM users
137
+ target:
138
+ query: |
139
+ SELECT COUNT(*) as value FROM users
140
+ comparisons:
141
+ comment: "User count must match between source and target"
142
+
143
+ - test_with_threshold:
144
+ severity: high
145
+ source:
146
+ query: |
147
+ SELECT SUM(amount) as value FROM transactions
148
+ target:
149
+ query: |
150
+ SELECT SUM(amount) as value FROM transactions
151
+ comparisons:
152
+ threshold:
153
+ value: percentage
154
+ limit: 1
155
+ comment: "Transaction amounts must match within 1%"
156
+ ```
157
+
158
+ ### 3. Run Validation
159
+
160
+ ```bash
161
+ dataqe-run --config config.yml
162
+ ```
163
+
164
+ Check output directory for reports:
165
+ ```
166
+ ./output/ExecutionReport.html
167
+ ./output/ExecutionReport.csv
168
+ ./output/FailedExecutionReport.html
169
+ ```
170
+
171
+ ## Configuration
172
+
173
+ ### Config Block Structure
174
+
175
+ ```yaml
176
+ config_block_<name>:
177
+ source:
178
+ database_type: mysql|gcpbq
179
+ mysql: {...}
180
+ gcp: {...}
181
+ config_query_key: optional_query_key
182
+ source_name: optional_source_name
183
+
184
+ target:
185
+ database_type: mysql|gcpbq
186
+ mysql: {...}
187
+ gcp: {...}
188
+ config_query_key: optional_query_key
189
+ source_name: optional_source_name
190
+
191
+ other:
192
+ validation_script: path/to/test_suite.yml
193
+ preprocessor_queries: path/to/preprocessor_queries.yml
194
+ ```
195
+
196
+ ### Database Configuration
197
+
198
+ #### MySQL
199
+ ```yaml
200
+ mysql:
201
+ host: hostname
202
+ port: 3306
203
+ user: username
204
+ password: password
205
+ database: database_name
206
+ ```
207
+
208
+ #### Google BigQuery
209
+ ```yaml
210
+ gcp:
211
+ project_id: my-project
212
+ dataset_id: my-dataset
213
+ credentials_path: /path/to/service-account.json
214
+ location: us-central1
215
+ use_encryption: false
216
+ ```
217
+
218
+ See [CONFIGURATION.md](CONFIGURATION.md) for detailed configuration options.
219
+
220
+ ## Test Suite Definition
221
+
222
+ Each test case has the following structure:
223
+
224
+ ```yaml
225
+ - test_name:
226
+ severity: critical|high|medium|low
227
+
228
+ source:
229
+ query: |
230
+ SELECT COUNT(*) as value FROM table
231
+ config_query_key: optional_key
232
+ source_name: optional_source_name
233
+
234
+ target:
235
+ query: |
236
+ SELECT COUNT(*) as value FROM table
237
+ config_query_key: optional_key
238
+ source_name: optional_source_name
239
+
240
+ comparisons:
241
+ expected: optional_expected_value
242
+ threshold:
243
+ value: percentage|absolute
244
+ limit: number
245
+ comment: "Description of this test"
246
+ ```
247
+
248
+ ### Comparison Modes
249
+
250
+ #### 1. Source vs Target Equality
251
+ ```yaml
252
+ comparisons:
253
+ comment: "Values must match exactly"
254
+ ```
255
+
256
+ #### 2. Expected Value Check
257
+ ```yaml
258
+ comparisons:
259
+ expected: ">=1000"
260
+ comment: "Count must be at least 1000"
261
+ ```
262
+
263
+ #### 3. Percentage Threshold
264
+ ```yaml
265
+ comparisons:
266
+ threshold:
267
+ value: percentage
268
+ limit: 5
269
+ comment: "Target can vary up to 5% from source"
270
+ ```
271
+
272
+ #### 4. Absolute Difference
273
+ ```yaml
274
+ comparisons:
275
+ threshold:
276
+ value: absolute
277
+ limit: 100
278
+ comment: "Target can differ by max 100 units"
279
+ ```
280
+
281
+ ## Dynamic Dataset Replacement
282
+
283
+ Replace dataset placeholders with actual release names:
284
+
285
+ ### 1. Create Preprocessor Queries File
286
+
287
+ Create `preprocessor_queries.yml`:
288
+
289
+ ```yaml
290
+ get_releases: |
291
+ SELECT source, current_release, previous_release
292
+ FROM release_metadata
293
+ WHERE is_active = TRUE
294
+
295
+ get_bcbsa_releases: |
296
+ SELECT 'bcbsa' as source, 'bcbsa_export1' as current_release, 'bcbsa_export3' as previous_release
297
+ ```
298
+
299
+ ### 2. Update Configuration
300
+
301
+ Add to `config.yml`:
302
+
303
+ ```yaml
304
+ other:
305
+ validation_script: test_suite.yml
306
+ preprocessor_queries: preprocessor_queries.yml
307
+ ```
308
+
309
+ ### 3. Update Test Suite
310
+
311
+ Use placeholders in queries and specify the preprocessor key:
312
+
313
+ ```yaml
314
+ - test_current_release:
315
+ source:
316
+ query: |
317
+ SELECT COUNT(*) as value FROM BCBSA_CURR_WEEK.users
318
+ config_query_key: get_bcbsa_releases
319
+ source_name: bcbsa
320
+ ```
321
+
322
+ The framework will:
323
+ 1. Execute `get_bcbsa_releases` query
324
+ 2. Get current_release value (`bcbsa_export1`)
325
+ 3. Replace `BCBSA_CURR_WEEK` → `bcbsa_export1`
326
+ 4. Run the modified query
327
+
328
+ See [PREPROCESSOR.md](PREPROCESSOR.md) for detailed examples.
329
+
330
+ ## Report Generation
331
+
332
+ ### ExecutionReport.html
333
+ Full test execution report with:
334
+ - Test results (PASS/FAIL)
335
+ - Source and target values
336
+ - Execution timestamps
337
+ - Query execution times
338
+ - Comparison methods
339
+
340
+ ### FailedExecutionReport.html
341
+ Summary of failed tests or confirmation of all tests passing
342
+
343
+ ### ExecutionReport.csv
344
+ Structured test results for import into analysis tools:
345
+ - Test name
346
+ - Status
347
+ - Severity
348
+ - Source/Target values
349
+ - Execution time
350
+
351
+ ### AutomationData.csv
352
+ CI/CD integration data:
353
+ - App name
354
+ - Branch
355
+ - Platform
356
+ - Owner
357
+ - Test report path
358
+
359
+ ## Environment Variables
360
+
361
+ Configure the framework behavior using environment variables:
362
+
363
+ ```bash
364
+ # Output directory for reports (default: ./output)
365
+ export DATAQE_OUTPUT_DIR=/path/to/output
366
+
367
+ # CI/CD metadata (used in AutomationData.csv)
368
+ export DATAQE_APP_NAME=my-app
369
+ export DATAQE_BRANCH=main
370
+ export DATAQE_PLATFORM=kubernetes
371
+ export DATAQE_OWNER=team-name
372
+ ```
373
+
374
+ ## Command Line Usage
375
+
376
+ ### Basic Execution
377
+ ```bash
378
+ dataqe-run --config /path/to/config.yml
379
+ ```
380
+
381
+ ### With Custom Output Directory
382
+ ```bash
383
+ export DATAQE_OUTPUT_DIR=/custom/output
384
+ dataqe-run --config /path/to/config.yml
385
+ ```
386
+
387
+ ### CI/CD Integration
388
+ ```bash
389
+ export DATAQE_APP_NAME=ecommerce-platform
390
+ export DATAQE_BRANCH=feature-branch
391
+ export DATAQE_PLATFORM=kubernetes
392
+ export DATAQE_OWNER=data-team
393
+
394
+ dataqe-run --config /path/to/config.yml
395
+ ```
396
+
397
+ ## Project Structure
398
+
399
+ ```
400
+ dataqe-framework/
401
+ ├── src/dataqe_framework/
402
+ │ ├── __init__.py
403
+ │ ├── cli.py # Command-line interface
404
+ │ ├── config_loader.py # YAML config loading
405
+ │ ├── executor.py # Test execution engine
406
+ │ ├── preprocessor.py # Query preprocessing
407
+ │ ├── reporter.py # Report generation
408
+ │ ├── comparison/
409
+ │ │ ├── comparator.py # Comparison logic
410
+ │ │ └── threshold.py # Threshold calculations
411
+ │ └── connectors/
412
+ │ ├── base_connector.py # Base connector interface
413
+ │ ├── mysql_connector.py # MySQL implementation
414
+ │ └── bigquery_connector.py # BigQuery implementation
415
+ ├── example_preprocessor_config.yml
416
+ ├── example_preprocessor_queries.yml
417
+ ├── example_preprocessor_test_script.yml
418
+ ├── README.md
419
+ ├── CONFIGURATION.md
420
+ ├── PREPROCESSOR.md
421
+ └── pyproject.toml
422
+ ```
423
+
424
+ ## Examples
425
+
426
+ ### Example 1: Simple Row Count Validation
427
+
428
+ Test if row counts match between MySQL and BigQuery:
429
+
430
+ ```yaml
431
+ - users_row_count:
432
+ severity: critical
433
+ source:
434
+ query: SELECT COUNT(*) as value FROM users
435
+ target:
436
+ query: SELECT COUNT(*) as value FROM users
437
+ comparisons:
438
+ comment: "User count must match exactly"
439
+ ```
440
+
441
+ ### Example 2: Multi-Release Dataset Validation
442
+
443
+ Validate current and previous release datasets:
444
+
445
+ ```yaml
446
+ - current_release_sales:
447
+ severity: high
448
+ source:
449
+ query: |
450
+ SELECT SUM(amount) as value FROM BCBSA_CURR_WEEK.sales
451
+ config_query_key: get_bcbsa_releases
452
+ source_name: bcbsa
453
+
454
+ - previous_release_sales:
455
+ severity: medium
456
+ source:
457
+ query: |
458
+ SELECT SUM(amount) as value FROM BCBSA_PREV_WEEK.sales
459
+ config_query_key: get_bcbsa_releases
460
+ source_name: bcbsa
461
+ ```
462
+
463
+ ### Example 3: Threshold-Based Comparison
464
+
465
+ Allow data variations within acceptable ranges:
466
+
467
+ ```yaml
468
+ - transaction_amounts:
469
+ severity: high
470
+ source:
471
+ query: SELECT SUM(amount) as value FROM transactions
472
+ target:
473
+ query: SELECT SUM(amount) as value FROM transactions
474
+ comparisons:
475
+ threshold:
476
+ value: percentage
477
+ limit: 2
478
+ comment: "Amounts must match within 2%"
479
+ ```
480
+
481
+ ## Troubleshooting
482
+
483
+ ### Connection Issues
484
+
485
+ **MySQL Connection Refused**
486
+ ```bash
487
+ # Check connectivity
488
+ mysql -h <host> -u <user> -p<password> <database>
489
+
490
+ # Verify in config.yml:
491
+ # - host is correct
492
+ # - port is 3306 (or custom port)
493
+ # - user/password are correct
494
+ ```
495
+
496
+ **BigQuery Authentication Failed**
497
+ ```bash
498
+ # Verify credentials file
499
+ gcloud auth application-default print-access-token
500
+
501
+ # Check in config.yml:
502
+ # - credentials_path points to valid service account JSON
503
+ # - credentials file has BigQuery permissions
504
+ ```
505
+
506
+ ### Query Execution Issues
507
+
508
+ **Query Timeout**
509
+ - Increase database timeout settings
510
+ - Optimize query performance
511
+ - Check database load
512
+
513
+ **Dataset Not Found**
514
+ - For preprocessor queries: verify `config_query_key` matches a key in `preprocessor_queries.yml`
515
+ - For dynamic replacement: verify placeholder format matches expected convention
516
+
517
+ ### Report Generation Issues
518
+
519
+ **Output directory not writable**
520
+ ```bash
521
+ chmod -R 755 ./output
522
+ ```
523
+
524
+ **No output files generated**
525
+ - Check logs for errors
526
+ - Verify `DATAQE_OUTPUT_DIR` has write permissions
527
+ - Ensure test suite has valid queries
528
+
529
+ ## Performance Considerations
530
+
531
+ - **Large result sets**: Memory usage scales with query result size
532
+ - **Many tests**: Execution time is cumulative
533
+ - **Database load**: Run during off-peak hours for production databases
534
+ - **Network latency**: BigQuery queries may take longer than MySQL
535
+
536
+ ## Security
537
+
538
+ ### Sensitive Data Handling
539
+ - Never commit credentials files
540
+ - Use environment variables for secrets
541
+ - Enable KMS encryption for PHI data in BigQuery
542
+
543
+ ### Best Practices
544
+ - Use dedicated read-only database accounts
545
+ - Limit query timeout duration
546
+ - Monitor execution logs for suspicious patterns
547
+ - Review generated reports for sensitive data exposure
548
+
549
+ ## Contributing
550
+
551
+ For bug reports and feature requests, please open an issue on the repository.
552
+
553
+ ## Installation via pip
554
+
555
+ ### From PyPI (Coming Soon)
556
+
557
+ ```bash
558
+ pip install dataqe-framework
559
+ ```
560
+
561
+ ### From GitHub
562
+
563
+ ```bash
564
+ pip install git+https://github.com/ShaikKhadarmohiddin/dataqe-framework.git
565
+ ```
566
+
567
+ ### From Source
568
+
569
+ ```bash
570
+ git clone https://github.com/ShaikKhadarmohiddin/dataqe-framework.git
571
+ cd dataqe-framework
572
+ pip install -e .
573
+ ```
574
+
575
+ ## Author
576
+
577
+ **Khadar Shaik**
578
+ - Email: khadarmohiddin.shaik@apree.health
579
+ - GitHub: [@ShaikKhadarmohiddin](https://github.com/ShaikKhadarmohiddin)
580
+
581
+ ## License
582
+
583
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
584
+
585
+ MIT License - You are free to use this project for personal, educational, or commercial purposes.
586
+
587
+ ## Support
588
+
589
+ For support and questions:
590
+ - Check documentation in the project repository
591
+ - Open an issue on [GitHub Issues](https://github.com/ShaikKhadarmohiddin/dataqe-framework/issues)
592
+ - Review troubleshooting section in [GETTING_STARTED.md](GETTING_STARTED.md)
593
+ - Consult test output and logs for error details
594
+
595
+ ## Version History
596
+
597
+ ### 0.0.1 (Initial Release)
598
+ - Multi-database support (MySQL, BigQuery)
599
+ - YAML-based test configuration
600
+ - Flexible comparison modes
601
+ - Dynamic dataset replacement
602
+ - Comprehensive reporting
603
+ - PHI data protection
604
+ - CI/CD integration support