tech-hub-skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (133) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +250 -0
  3. package/bin/cli.js +241 -0
  4. package/bin/copilot.js +182 -0
  5. package/bin/postinstall.js +42 -0
  6. package/package.json +46 -0
  7. package/tech_hub_skills/roles/ai-engineer/skills/01-prompt-engineering/README.md +252 -0
  8. package/tech_hub_skills/roles/ai-engineer/skills/02-rag-pipeline/README.md +448 -0
  9. package/tech_hub_skills/roles/ai-engineer/skills/03-agent-orchestration/README.md +599 -0
  10. package/tech_hub_skills/roles/ai-engineer/skills/04-llm-guardrails/README.md +735 -0
  11. package/tech_hub_skills/roles/ai-engineer/skills/05-vector-embeddings/README.md +711 -0
  12. package/tech_hub_skills/roles/ai-engineer/skills/06-llm-evaluation/README.md +777 -0
  13. package/tech_hub_skills/roles/azure/skills/01-infrastructure-fundamentals/README.md +264 -0
  14. package/tech_hub_skills/roles/azure/skills/02-data-factory/README.md +264 -0
  15. package/tech_hub_skills/roles/azure/skills/03-synapse-analytics/README.md +264 -0
  16. package/tech_hub_skills/roles/azure/skills/04-databricks/README.md +264 -0
  17. package/tech_hub_skills/roles/azure/skills/05-functions/README.md +264 -0
  18. package/tech_hub_skills/roles/azure/skills/06-kubernetes-service/README.md +264 -0
  19. package/tech_hub_skills/roles/azure/skills/07-openai-service/README.md +264 -0
  20. package/tech_hub_skills/roles/azure/skills/08-machine-learning/README.md +264 -0
  21. package/tech_hub_skills/roles/azure/skills/09-storage-adls/README.md +264 -0
  22. package/tech_hub_skills/roles/azure/skills/10-networking/README.md +264 -0
  23. package/tech_hub_skills/roles/azure/skills/11-sql-cosmos/README.md +264 -0
  24. package/tech_hub_skills/roles/azure/skills/12-event-hubs/README.md +264 -0
  25. package/tech_hub_skills/roles/code-review/skills/01-automated-code-review/README.md +394 -0
  26. package/tech_hub_skills/roles/code-review/skills/02-pr-review-workflow/README.md +427 -0
  27. package/tech_hub_skills/roles/code-review/skills/03-code-quality-gates/README.md +518 -0
  28. package/tech_hub_skills/roles/code-review/skills/04-reviewer-assignment/README.md +504 -0
  29. package/tech_hub_skills/roles/code-review/skills/05-review-analytics/README.md +540 -0
  30. package/tech_hub_skills/roles/data-engineer/skills/01-lakehouse-architecture/README.md +550 -0
  31. package/tech_hub_skills/roles/data-engineer/skills/02-etl-pipeline/README.md +580 -0
  32. package/tech_hub_skills/roles/data-engineer/skills/03-data-quality/README.md +579 -0
  33. package/tech_hub_skills/roles/data-engineer/skills/04-streaming-pipelines/README.md +608 -0
  34. package/tech_hub_skills/roles/data-engineer/skills/05-performance-optimization/README.md +547 -0
  35. package/tech_hub_skills/roles/data-governance/skills/01-data-catalog/README.md +112 -0
  36. package/tech_hub_skills/roles/data-governance/skills/02-data-lineage/README.md +129 -0
  37. package/tech_hub_skills/roles/data-governance/skills/03-data-quality-framework/README.md +182 -0
  38. package/tech_hub_skills/roles/data-governance/skills/04-access-control/README.md +39 -0
  39. package/tech_hub_skills/roles/data-governance/skills/05-master-data-management/README.md +40 -0
  40. package/tech_hub_skills/roles/data-governance/skills/06-compliance-privacy/README.md +46 -0
  41. package/tech_hub_skills/roles/data-scientist/skills/01-eda-automation/README.md +230 -0
  42. package/tech_hub_skills/roles/data-scientist/skills/02-statistical-modeling/README.md +264 -0
  43. package/tech_hub_skills/roles/data-scientist/skills/03-feature-engineering/README.md +264 -0
  44. package/tech_hub_skills/roles/data-scientist/skills/04-predictive-modeling/README.md +264 -0
  45. package/tech_hub_skills/roles/data-scientist/skills/05-customer-analytics/README.md +264 -0
  46. package/tech_hub_skills/roles/data-scientist/skills/06-campaign-analysis/README.md +264 -0
  47. package/tech_hub_skills/roles/data-scientist/skills/07-experimentation/README.md +264 -0
  48. package/tech_hub_skills/roles/data-scientist/skills/08-data-visualization/README.md +264 -0
  49. package/tech_hub_skills/roles/devops/skills/01-cicd-pipeline/README.md +264 -0
  50. package/tech_hub_skills/roles/devops/skills/02-container-orchestration/README.md +264 -0
  51. package/tech_hub_skills/roles/devops/skills/03-infrastructure-as-code/README.md +264 -0
  52. package/tech_hub_skills/roles/devops/skills/04-gitops/README.md +264 -0
  53. package/tech_hub_skills/roles/devops/skills/05-environment-management/README.md +264 -0
  54. package/tech_hub_skills/roles/devops/skills/06-automated-testing/README.md +264 -0
  55. package/tech_hub_skills/roles/devops/skills/07-release-management/README.md +264 -0
  56. package/tech_hub_skills/roles/devops/skills/08-monitoring-alerting/README.md +264 -0
  57. package/tech_hub_skills/roles/devops/skills/09-devsecops/README.md +265 -0
  58. package/tech_hub_skills/roles/finops/skills/01-cost-visibility/README.md +264 -0
  59. package/tech_hub_skills/roles/finops/skills/02-resource-tagging/README.md +264 -0
  60. package/tech_hub_skills/roles/finops/skills/03-budget-management/README.md +264 -0
  61. package/tech_hub_skills/roles/finops/skills/04-reserved-instances/README.md +264 -0
  62. package/tech_hub_skills/roles/finops/skills/05-spot-optimization/README.md +264 -0
  63. package/tech_hub_skills/roles/finops/skills/06-storage-tiering/README.md +264 -0
  64. package/tech_hub_skills/roles/finops/skills/07-compute-rightsizing/README.md +264 -0
  65. package/tech_hub_skills/roles/finops/skills/08-chargeback/README.md +264 -0
  66. package/tech_hub_skills/roles/ml-engineer/skills/01-mlops-pipeline/README.md +566 -0
  67. package/tech_hub_skills/roles/ml-engineer/skills/02-feature-engineering/README.md +655 -0
  68. package/tech_hub_skills/roles/ml-engineer/skills/03-model-training/README.md +704 -0
  69. package/tech_hub_skills/roles/ml-engineer/skills/04-model-serving/README.md +845 -0
  70. package/tech_hub_skills/roles/ml-engineer/skills/05-model-monitoring/README.md +874 -0
  71. package/tech_hub_skills/roles/mlops/skills/01-ml-pipeline-orchestration/README.md +264 -0
  72. package/tech_hub_skills/roles/mlops/skills/02-experiment-tracking/README.md +264 -0
  73. package/tech_hub_skills/roles/mlops/skills/03-model-registry/README.md +264 -0
  74. package/tech_hub_skills/roles/mlops/skills/04-feature-store/README.md +264 -0
  75. package/tech_hub_skills/roles/mlops/skills/05-model-deployment/README.md +264 -0
  76. package/tech_hub_skills/roles/mlops/skills/06-model-observability/README.md +264 -0
  77. package/tech_hub_skills/roles/mlops/skills/07-data-versioning/README.md +264 -0
  78. package/tech_hub_skills/roles/mlops/skills/08-ab-testing/README.md +264 -0
  79. package/tech_hub_skills/roles/mlops/skills/09-automated-retraining/README.md +264 -0
  80. package/tech_hub_skills/roles/platform-engineer/skills/01-internal-developer-platform/README.md +153 -0
  81. package/tech_hub_skills/roles/platform-engineer/skills/02-self-service-infrastructure/README.md +57 -0
  82. package/tech_hub_skills/roles/platform-engineer/skills/03-slo-sli-management/README.md +59 -0
  83. package/tech_hub_skills/roles/platform-engineer/skills/04-developer-experience/README.md +57 -0
  84. package/tech_hub_skills/roles/platform-engineer/skills/05-incident-management/README.md +73 -0
  85. package/tech_hub_skills/roles/platform-engineer/skills/06-capacity-management/README.md +59 -0
  86. package/tech_hub_skills/roles/product-designer/skills/01-requirements-discovery/README.md +407 -0
  87. package/tech_hub_skills/roles/product-designer/skills/02-user-research/README.md +382 -0
  88. package/tech_hub_skills/roles/product-designer/skills/03-brainstorming-ideation/README.md +437 -0
  89. package/tech_hub_skills/roles/product-designer/skills/04-ux-design/README.md +496 -0
  90. package/tech_hub_skills/roles/product-designer/skills/05-product-market-fit/README.md +376 -0
  91. package/tech_hub_skills/roles/product-designer/skills/06-stakeholder-management/README.md +412 -0
  92. package/tech_hub_skills/roles/security-architect/skills/01-pii-detection/README.md +319 -0
  93. package/tech_hub_skills/roles/security-architect/skills/02-threat-modeling/README.md +264 -0
  94. package/tech_hub_skills/roles/security-architect/skills/03-infrastructure-security/README.md +264 -0
  95. package/tech_hub_skills/roles/security-architect/skills/04-iam/README.md +264 -0
  96. package/tech_hub_skills/roles/security-architect/skills/05-application-security/README.md +264 -0
  97. package/tech_hub_skills/roles/security-architect/skills/06-secrets-management/README.md +264 -0
  98. package/tech_hub_skills/roles/security-architect/skills/07-security-monitoring/README.md +264 -0
  99. package/tech_hub_skills/roles/system-design/skills/01-architecture-patterns/README.md +337 -0
  100. package/tech_hub_skills/roles/system-design/skills/02-requirements-engineering/README.md +264 -0
  101. package/tech_hub_skills/roles/system-design/skills/03-scalability/README.md +264 -0
  102. package/tech_hub_skills/roles/system-design/skills/04-high-availability/README.md +264 -0
  103. package/tech_hub_skills/roles/system-design/skills/05-cost-optimization-design/README.md +264 -0
  104. package/tech_hub_skills/roles/system-design/skills/06-api-design/README.md +264 -0
  105. package/tech_hub_skills/roles/system-design/skills/07-observability-architecture/README.md +264 -0
  106. package/tech_hub_skills/roles/system-design/skills/08-process-automation/PROCESS_TEMPLATE.md +336 -0
  107. package/tech_hub_skills/roles/system-design/skills/08-process-automation/README.md +521 -0
  108. package/tech_hub_skills/skills/README.md +336 -0
  109. package/tech_hub_skills/skills/ai-engineer.md +104 -0
  110. package/tech_hub_skills/skills/azure.md +149 -0
  111. package/tech_hub_skills/skills/code-review.md +399 -0
  112. package/tech_hub_skills/skills/compliance-automation.md +747 -0
  113. package/tech_hub_skills/skills/data-engineer.md +113 -0
  114. package/tech_hub_skills/skills/data-governance.md +102 -0
  115. package/tech_hub_skills/skills/data-scientist.md +123 -0
  116. package/tech_hub_skills/skills/devops.md +160 -0
  117. package/tech_hub_skills/skills/docker.md +160 -0
  118. package/tech_hub_skills/skills/enterprise-dashboard.md +613 -0
  119. package/tech_hub_skills/skills/finops.md +184 -0
  120. package/tech_hub_skills/skills/ml-engineer.md +115 -0
  121. package/tech_hub_skills/skills/mlops.md +187 -0
  122. package/tech_hub_skills/skills/optimization-advisor.md +329 -0
  123. package/tech_hub_skills/skills/orchestrator.md +497 -0
  124. package/tech_hub_skills/skills/platform-engineer.md +102 -0
  125. package/tech_hub_skills/skills/process-automation.md +226 -0
  126. package/tech_hub_skills/skills/process-changelog.md +184 -0
  127. package/tech_hub_skills/skills/process-documentation.md +484 -0
  128. package/tech_hub_skills/skills/process-kanban.md +324 -0
  129. package/tech_hub_skills/skills/process-versioning.md +214 -0
  130. package/tech_hub_skills/skills/product-designer.md +104 -0
  131. package/tech_hub_skills/skills/project-starter.md +443 -0
  132. package/tech_hub_skills/skills/security-architect.md +135 -0
  133. package/tech_hub_skills/skills/system-design.md +126 -0
@@ -0,0 +1,579 @@
1
+ # Skill 3: Data Quality & Validation
2
+
3
+ ## 🎯 Overview
4
+ Implement comprehensive data quality frameworks with automated validation, monitoring, anomaly detection, and data profiling to ensure data reliability and trustworthiness.
5
+
6
+ ## 🔗 Connections
7
+ - **Data Engineer**: Quality gates for lakehouse layers (de-01, de-02)
8
+ - **ML Engineer**: Feature quality validation (ml-02, ml-03)
9
+ - **MLOps**: Drift detection and data monitoring (mo-05, mo-06)
10
+ - **AI Engineer**: RAG data quality assurance (ai-02)
11
+ - **Data Scientist**: Clean data for analysis (ds-01, ds-02)
12
+ - **Security Architect**: PII detection and data governance (sa-01, sa-06)
13
+ - **FinOps**: Prevent costly data quality issues (fo-01)
14
+ - **DevOps**: Automated quality testing in CI/CD (do-01, do-02)
15
+
16
+ ## 🛠️ Tools Included
17
+
18
+ ### 1. `data_validator.py`
19
+ Comprehensive data validation rules engine with 50+ built-in validators.
20
+
21
+ ### 2. `data_profiler.py`
22
+ Statistical profiling and summary statistics generation.
23
+
24
+ ### 3. `anomaly_detector.py`
25
+ ML-based anomaly detection for data quality monitoring.
26
+
27
+ ### 4. `quality_dashboard.py`
28
+ Real-time data quality dashboards and reporting.
29
+
30
+ ### 5. `data_quality_rules.yaml`
31
+ Declarative quality rules configuration.
32
+
33
+ ## 📊 Data Quality Dimensions
34
+
35
+ ```
36
+ Completeness → Are all required fields present?
37
+ Accuracy → Is the data correct?
38
+ Consistency → Is data uniform across systems?
39
+ Validity → Does data conform to rules?
40
+ Timeliness → Is data fresh and up-to-date?
41
+ Uniqueness → Are there duplicates?
42
+ ```
43
+
44
+ ## 🚀 Quick Start
45
+
46
+ ```python
47
+ from data_validator import DataValidator
48
+ from data_profiler import DataProfiler
49
+
50
+ # Initialize validator
51
+ validator = DataValidator()
52
+
53
+ # Define quality rules
54
+ rules = {
55
+ "completeness": [
56
+ validator.not_null("customer_id"),
57
+ validator.not_null("email"),
58
+ validator.not_null("created_at")
59
+ ],
60
+ "validity": [
61
+ validator.email_format("email"),
62
+ validator.value_in_range("age", min_val=0, max_val=120),
63
+ validator.matches_regex("phone", r"^\+?\d{10,15}$")
64
+ ],
65
+ "uniqueness": [
66
+ validator.unique("customer_id")
67
+ ],
68
+ "consistency": [
69
+ validator.referential_integrity(
70
+ column="country_code",
71
+ reference_table="countries",
72
+ reference_column="code"
73
+ )
74
+ ]
75
+ }
76
+
77
+ # Validate data
78
+ df = spark.read.table("silver.customers")
79
+ validation_result = validator.validate(df, rules)
80
+
81
+ # Check results
82
+ if validation_result.passed:
83
+ print("✓ All quality checks passed!")
84
+ else:
85
+ print(f"✗ {validation_result.failed_count} checks failed")
86
+ for failure in validation_result.failures:
87
+ print(f" - {failure.rule}: {failure.message}")
88
+ print(f" Failed records: {failure.failed_count}")
89
+
90
+ # Profile data
91
+ profiler = DataProfiler()
92
+ profile = profiler.profile(df)
93
+
94
+ print(f"Total records: {profile.row_count:,}")
95
+ print(f"Total columns: {profile.column_count}")
96
+ print(f"Missing values: {profile.missing_percentage:.2f}%")
97
+ print(f"Duplicate records: {profile.duplicate_count}")
98
+ ```
99
+
100
+ ## 📚 Best Practices
101
+
102
+ ### Data Quality Framework (Data Engineer Integration)
103
+
104
+ 1. **Implement Quality Gates**
105
+ - Validate at each lakehouse layer (Bronze → Silver → Gold)
106
+ - Block bad data from propagating downstream
107
+ - Quarantine failed records for review
108
+ - Alert on quality threshold violations
109
+ - Reference: Data Engineer de-01 (Lakehouse)
110
+
111
+ 2. **Automated Quality Checks**
112
+ - Run validation on every pipeline execution
113
+ - Schedule periodic data profiling
114
+ - Monitor quality metrics over time
115
+ - Integrate with CI/CD pipelines
116
+ - Reference: DevOps do-01 (CI/CD), do-02 (Testing)
117
+
118
+ 3. **Data Profiling Strategy**
119
+ - Profile new data sources before ingestion
120
+ - Generate statistical summaries
121
+ - Detect schema drift automatically
122
+ - Track data distributions over time
123
+ - Reference: Data Engineer best practices
124
+
125
+ 4. **Quality Metrics & KPIs**
126
+ - Define quality SLAs for each dataset
127
+ - Track quality trends over time
128
+ - Dashboard for stakeholder visibility
129
+ - Alert on quality degradation
130
+ - Reference: DevOps do-08 (Monitoring)
131
+
132
+ ### ML/AI Data Quality (ML Engineer & MLOps Integration)
133
+
134
+ 5. **Feature Quality Validation**
135
+ - Validate feature distributions
136
+ - Check for feature drift
137
+ - Detect data leakage
138
+ - Monitor feature correlations
139
+ - Reference: ML Engineer ml-02 (Feature Engineering), MLOps mo-05 (Drift)
140
+
141
+ 6. **Training Data Quality**
142
+ - Validate labels and targets
143
+ - Check class balance
144
+ - Detect outliers and anomalies
145
+ - Ensure temporal consistency
146
+ - Reference: ML Engineer ml-01 (Training)
147
+
148
+ 7. **Data Drift Detection**
149
+ - Monitor statistical properties over time
150
+ - Detect distribution shifts
151
+ - Alert on significant drift
152
+ - Trigger retraining when needed
153
+ - Reference: MLOps mo-05 (Drift Detection)
154
+
155
+ ### Security & Governance (Security Architect Integration)
156
+
157
+ 8. **PII Detection in Quality Checks**
158
+ - Scan for unexpected PII in data
159
+ - Validate PII masking/encryption
160
+ - Alert on PII in non-compliant columns
161
+ - Audit PII handling
162
+ - Reference: Security Architect sa-01 (PII Detection)
163
+
164
+ 9. **Data Governance Integration**
165
+ - Enforce data lineage tracking
166
+ - Validate data classification tags
167
+ - Compliance checks (GDPR, CCPA)
168
+ - Retention policy validation
169
+ - Reference: Security Architect sa-06 (Data Governance)
170
+
171
+ 10. **Access Control Validation**
172
+ - Verify row-level security rules
173
+ - Audit data access patterns
174
+ - Detect unauthorized access attempts
175
+ - Validate encryption status
176
+ - Reference: Security Architect sa-02 (IAM)
177
+
178
+ ### Cost Optimization (FinOps Integration)
179
+
180
+ 11. **Prevent Costly Quality Issues**
181
+ - Catch bad data before expensive processing
182
+ - Reduce compute waste on invalid data
183
+ - Minimize storage of duplicates
184
+ - Avoid downstream rework costs
185
+ - Reference: FinOps fo-01 (Cost Monitoring)
186
+
187
+ 12. **Optimize Quality Check Costs**
188
+ - Use sampling for large datasets
189
+ - Cache validation results
190
+ - Incremental quality checks
191
+ - Right-size compute for validation jobs
192
+ - Reference: FinOps fo-06 (Compute Optimization)
193
+
194
+ ### Azure-Specific Best Practices
195
+
196
+ 13. **Azure Purview Integration**
197
+ - Auto-discover data assets
198
+ - Classify data automatically
199
+ - Track data lineage
200
+ - Quality annotations in catalog
201
+ - Reference: Azure best practices
202
+
203
+ 14. **Azure Monitor Integration**
204
+ - Log quality metrics to Application Insights
205
+ - Create custom dashboards
206
+ - Set up alert rules
207
+ - Track quality trends over time
208
+ - Reference: DevOps do-08 (Monitoring)
209
+
210
+ 15. **Azure Data Factory Data Flows**
211
+ - Use data preview for validation
212
+ - Implement conditional splits for quality
213
+ - Error row handling
214
+ - Quality metrics in pipeline monitoring
215
+ - Reference: Azure az-01 (Data Factory)
216
+
217
+ ### Anomaly Detection
218
+
219
+ 16. **Statistical Anomaly Detection**
220
+ - Z-score for univariate outliers
221
+ - Isolation Forest for multivariate anomalies
222
+ - Time-series anomaly detection
223
+ - Seasonal pattern analysis
224
+ - Reference: Data Scientist ds-01 (EDA)
225
+
226
+ 17. **ML-Based Quality Monitoring**
227
+ - Train models on historical quality patterns
228
+ - Predict quality issues before they occur
229
+ - Adaptive thresholds based on trends
230
+ - Automated root cause analysis
231
+ - Reference: ML Engineer ml-01, MLOps mo-04
232
+
233
+ ## 💰 Cost Optimization Examples
234
+
235
+ ### Sampling Strategy for Large Datasets
236
+ ```python
237
+ from data_validator import DataValidator
238
+ from sampling_strategies import SmartSampler
239
+
240
+ validator = DataValidator()
241
+ sampler = SmartSampler()
242
+
243
+ # Load large dataset
244
+ df = spark.read.table("bronze.events") # 10TB table
245
+
246
+ # Smart sampling (stratified by key dimensions)
247
+ sample = sampler.stratified_sample(
248
+ df,
249
+ sample_size=1_000_000, # 1M records instead of billions
250
+ stratify_by=["event_type", "date"],
251
+ confidence_level=0.95
252
+ )
253
+
254
+ print(f"Original size: {df.count():,} records")
255
+ print(f"Sample size: {sample.count():,} records")
256
+ print(f"Cost reduction: {sampler.cost_savings_percentage:.1f}%")
257
+
258
+ # Run validation on sample
259
+ validation_result = validator.validate(sample, quality_rules)
260
+
261
+ # Extrapolate results to full dataset
262
+ estimated_failures = validation_result.extrapolate_to_population(
263
+ population_size=df.count()
264
+ )
265
+
266
+ print(f"Estimated failures in full dataset: {estimated_failures:,}")
267
+
268
+ # If sample passes, validate full dataset incrementally
269
+ if validation_result.passed:
270
+ # Only validate new/changed records
271
+ incremental_result = validator.validate_incremental(
272
+ df,
273
+ watermark_column="ingestion_date",
274
+ last_validated_date=last_run_date
275
+ )
276
+ ```
277
+
278
+ ### Optimize Validation Compute Costs
279
+ ```python
280
+ from data_validator import DataValidator
281
+ from finops_tracker import ValidationCostTracker
282
+
283
+ validator = DataValidator()
284
+ cost_tracker = ValidationCostTracker()
285
+
286
+ # Cache expensive validation results
287
+ @cost_tracker.track_validation_costs
288
+ def validate_with_caching(table_name: str, rules: dict):
289
+ # Check if data hasn't changed
290
+ current_hash = calculate_table_hash(table_name)
291
+ cached_result = validation_cache.get(current_hash)
292
+
293
+ if cached_result:
294
+ print(f"✓ Using cached validation result")
295
+ print(f"Cost saved: ${cached_result.cost_saved:.2f}")
296
+ return cached_result
297
+
298
+ # Run validation
299
+ df = spark.read.table(table_name)
300
+ result = validator.validate(df, rules)
301
+
302
+ # Cache for future use
303
+ validation_cache.set(current_hash, result, ttl=3600) # 1 hour
304
+
305
+ return result
306
+
307
+ # Incremental validation (only new data)
308
+ def validate_incremental_only(table_name: str, rules: dict):
309
+ # Only validate records since last check
310
+ df = spark.read.table(table_name)
311
+ new_records = df.filter(
312
+ f"ingestion_date > '{last_validation_date}'"
313
+ )
314
+
315
+ print(f"Validating {new_records.count():,} new records")
316
+ print(f"Skipping {df.count() - new_records.count():,} already validated")
317
+
318
+ result = validator.validate(new_records, rules)
319
+ return result
320
+
321
+ # Cost report
322
+ report = cost_tracker.generate_report(period="monthly")
323
+ print(f"Total validation costs: ${report.total_cost:.2f}")
324
+ print(f"Savings from caching: ${report.caching_savings:.2f}")
325
+ print(f"Savings from sampling: ${report.sampling_savings:.2f}")
326
+ ```
327
+
328
+ ## 🔒 Security Integration Examples
329
+
330
+ ### PII Detection in Quality Checks
331
+ ```python
332
+ from data_validator import DataValidator
333
+ from pii_detector import PIIDetector # from sa-01
334
+
335
+ validator = DataValidator()
336
+ pii_detector = PIIDetector()
337
+
338
+ # Comprehensive quality + security validation
339
+ def validate_with_pii_check(df, table_name: str):
340
+ # Standard quality checks
341
+ quality_result = validator.validate(df, quality_rules)
342
+
343
+ # PII detection
344
+ pii_result = pii_detector.scan_dataframe(df)
345
+
346
+ if pii_result.pii_found:
347
+ print(f"⚠️ PII detected in {table_name}:")
348
+ for finding in pii_result.findings:
349
+ print(f" - Column '{finding.column}': {finding.pii_type}")
350
+ print(f" Confidence: {finding.confidence:.2%}")
351
+ print(f" Sample count: {finding.count}")
352
+
353
+ # Check if PII is in expected columns
354
+ unexpected_pii = [
355
+ f for f in pii_result.findings
356
+ if f.column not in ALLOWED_PII_COLUMNS
357
+ ]
358
+
359
+ if unexpected_pii:
360
+ raise DataQualityError(
361
+ f"PII found in unexpected columns: {unexpected_pii}"
362
+ )
363
+
364
+ # Combined result
365
+ return {
366
+ "quality_passed": quality_result.passed,
367
+ "pii_compliant": len(unexpected_pii) == 0,
368
+ "quality_failures": quality_result.failures,
369
+ "pii_findings": pii_result.findings
370
+ }
371
+ ```
372
+
373
+ ### Data Lineage in Quality Reports
374
+ ```python
375
+ from data_validator import DataValidator
376
+ from data_lineage import LineageTracker
377
+
378
+ validator = DataValidator()
379
+ lineage_tracker = LineageTracker()
380
+
381
+ def validate_with_lineage(table_name: str, rules: dict):
382
+ df = spark.read.table(table_name)
383
+
384
+ # Run validation
385
+ result = validator.validate(df, rules)
386
+
387
+ # Track quality in lineage
388
+ lineage_tracker.log_quality_check(
389
+ dataset=table_name,
390
+ timestamp=datetime.now(),
391
+ passed=result.passed,
392
+ quality_score=result.quality_score,
393
+ failed_rules=[f.rule for f in result.failures]
394
+ )
395
+
396
+ # If quality check fails, trace back to source
397
+ if not result.passed:
398
+ lineage = lineage_tracker.get_lineage(table_name)
399
+ print(f"Quality issue in {table_name}")
400
+ print(f"Source: {lineage.source}")
401
+ print(f"Transformations: {lineage.transformations}")
402
+
403
+ # Check quality of upstream sources
404
+ for source in lineage.sources:
405
+ source_quality = validator.get_quality_history(source)
406
+ print(f" {source}: {source_quality.latest_score:.2f}")
407
+
408
+ return result
409
+ ```
410
+
411
+ ## 📊 Enhanced Metrics & Monitoring
412
+
413
+ | Metric Category | Metric | Target | Tool |
414
+ |-----------------|--------|--------|------|
415
+ | **Completeness** | Null rate | <1% | Data validator |
416
+ | | Required field coverage | 100% | Data validator |
417
+ | **Accuracy** | Schema validation pass rate | >99% | Data validator |
418
+ | | Data type conformance | 100% | Data profiler |
419
+ | **Consistency** | Cross-table referential integrity | 100% | Data validator |
420
+ | | Format standardization | >98% | Data validator |
421
+ | **Validity** | Business rule compliance | >98% | Data validator |
422
+ | | Range constraint violations | <1% | Data validator |
423
+ | **Timeliness** | Data freshness (SLA) | <1 hour | Azure Monitor |
424
+ | | Staleness detection | 0 tables | Data profiler |
425
+ | **Uniqueness** | Duplicate rate | <0.1% | Data validator |
426
+ | | Primary key violations | 0 | Data validator |
427
+ | **Overall** | Composite quality score | >95% | Quality dashboard |
428
+ | | Quality SLA compliance | >99% | Quality dashboard |
429
+
430
+ ## 🚀 Data Quality Pipeline
431
+
432
+ ### Automated Quality Framework
433
+ ```python
434
+ from data_validator import DataValidator
435
+ from quality_dashboard import QualityDashboard
436
+ from alert_manager import AlertManager
437
+
438
+ class DataQualityFramework:
439
+ def __init__(self):
440
+ self.validator = DataValidator()
441
+ self.dashboard = QualityDashboard()
442
+ self.alerter = AlertManager()
443
+
444
+ def validate_table(self, table_name: str, layer: str):
445
+ """
446
+ Validate table with layer-specific rules
447
+ """
448
+ # Load rules for layer
449
+ rules = self.load_rules(table_name, layer)
450
+
451
+ # Read data
452
+ df = spark.read.table(table_name)
453
+
454
+ # Profile data
455
+ profile = DataProfiler().profile(df)
456
+
457
+ # Validate
458
+ result = self.validator.validate(df, rules)
459
+
460
+ # Calculate quality score
461
+ quality_score = self.calculate_quality_score(result, profile)
462
+
463
+ # Update dashboard
464
+ self.dashboard.update(
465
+ table_name=table_name,
466
+ quality_score=quality_score,
467
+ profile=profile,
468
+ validation_result=result
469
+ )
470
+
471
+ # Check SLA
472
+ sla_threshold = self.get_sla_threshold(table_name)
473
+ if quality_score < sla_threshold:
474
+ self.alerter.send_alert(
475
+ severity="high",
476
+ title=f"Quality SLA Breach: {table_name}",
477
+ message=f"Quality score {quality_score:.2f} below threshold {sla_threshold:.2f}",
478
+ failures=result.failures
479
+ )
480
+
481
+ # Quarantine if critical failures
482
+ if result.has_critical_failures:
483
+ self.quarantine_bad_data(table_name, result)
484
+
485
+ return result
486
+
487
+ def load_rules(self, table_name: str, layer: str):
488
+ """
489
+ Load validation rules from configuration
490
+ """
491
+ # Bronze layer: Basic schema validation
492
+ if layer == "bronze":
493
+ return {
494
+ "schema_validation": True,
495
+ "null_checks": ["id", "timestamp"],
496
+ "type_checks": True
497
+ }
498
+
499
+ # Silver layer: Comprehensive validation
500
+ elif layer == "silver":
501
+ return {
502
+ "schema_validation": True,
503
+ "completeness_checks": True,
504
+ "validity_checks": True,
505
+ "consistency_checks": True,
506
+ "uniqueness_checks": True
507
+ }
508
+
509
+ # Gold layer: Business rule validation
510
+ elif layer == "gold":
511
+ return {
512
+ "business_rules": self.load_business_rules(table_name),
513
+ "aggregation_checks": True,
514
+ "referential_integrity": True
515
+ }
516
+
517
+ # Usage
518
+ framework = DataQualityFramework()
519
+
520
+ # Validate Bronze layer
521
+ framework.validate_table("bronze.events", layer="bronze")
522
+
523
+ # Validate Silver layer
524
+ framework.validate_table("silver.events_clean", layer="silver")
525
+
526
+ # Validate Gold layer
527
+ framework.validate_table("gold.daily_metrics", layer="gold")
528
+ ```
529
+
530
+ ## 🔄 Integration Workflow
531
+
532
+ ### End-to-End Quality Process
533
+ ```
534
+ 1. Data Ingestion (de-02)
535
+
536
+ 2. Bronze Layer Quality Check (de-03)
537
+ - Schema validation
538
+ - Basic completeness
539
+
540
+ 3. Silver Layer Quality Check (de-03)
541
+ - Comprehensive validation
542
+ - PII detection (sa-01)
543
+ - Data profiling
544
+
545
+ 4. Quarantine Bad Data
546
+ - Dead letter queue
547
+ - Manual review process
548
+
549
+ 5. Gold Layer Quality Check (de-03)
550
+ - Business rule validation
551
+ - Aggregation checks
552
+
553
+ 6. Feature Quality Check (ml-02)
554
+ - Feature distribution
555
+ - Drift detection (mo-05)
556
+
557
+ 7. Quality Monitoring (do-08)
558
+ - Dashboard updates
559
+ - Trend analysis
560
+ - Alerting
561
+
562
+ 8. Continuous Improvement
563
+ - Root cause analysis
564
+ - Rule optimization
565
+ - SLA tuning
566
+ ```
567
+
568
+ ## 🎯 Quick Wins
569
+
570
+ 1. **Implement schema validation** - Catch breaking changes immediately
571
+ 2. **Add null checks on critical fields** - Prevent downstream failures
572
+ 3. **Set up duplicate detection** - Reduce storage costs
573
+ 4. **Enable data profiling** - Understand data distribution
574
+ 5. **Create quality dashboard** - Visibility for stakeholders
575
+ 6. **Implement PII scanning** - Prevent compliance violations
576
+ 7. **Add anomaly detection** - Catch unusual patterns early
577
+ 8. **Set up quality alerts** - Reduce MTTR for data issues
578
+ 9. **Use sampling for validation** - 70-90% cost reduction
579
+ 10. **Track quality trends** - Proactive quality improvement