@zigrivers/scaffold 3.8.0 → 3.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +72 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +135 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,257 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-project-structure
|
|
3
|
+
description: Canonical directory layout for data pipeline projects covering DAGs, transforms, sinks, quality checks, configuration, and tests
|
|
4
|
+
topics: [data-pipeline, project-structure, dags, transforms, sinks, quality, config, tests]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
A well-organized data pipeline project separates pipeline orchestration logic from transformation logic, data sinks, and quality checks. This separation allows each concern to evolve independently and makes the codebase navigable for engineers unfamiliar with the project. The structure below is opinionated but widely applicable to Airflow, Prefect, Dagster, and similar orchestration tools.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Organize pipeline projects into `src/dags/` for orchestration, `src/transforms/` for pure transformation functions, `src/sinks/` for write adapters, `src/quality/` for validation logic, `config/` for environment-specific settings, and `tests/` mirroring the `src/` structure. Keep business logic in transforms — not in DAG files. Configuration is never hardcoded.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Root Directory Layout
|
|
16
|
+
|
|
17
|
+
```
|
|
18
|
+
my-pipeline/
|
|
19
|
+
├── src/
|
|
20
|
+
│ ├── dags/ # Orchestration DAGs / pipeline definitions
|
|
21
|
+
│ ├── transforms/ # Pure transformation functions
|
|
22
|
+
│ ├── sources/ # Source connectors and readers
|
|
23
|
+
│ ├── sinks/ # Destination connectors and writers
|
|
24
|
+
│ ├── quality/ # Data quality checks and validators
|
|
25
|
+
│ └── utils/ # Shared utilities (logging, metrics, retry)
|
|
26
|
+
├── config/
|
|
27
|
+
│ ├── base.yaml # Shared configuration across environments
|
|
28
|
+
│ ├── dev.yaml # Development overrides
|
|
29
|
+
│ ├── staging.yaml # Staging overrides
|
|
30
|
+
│ └── prod.yaml # Production overrides
|
|
31
|
+
├── tests/
|
|
32
|
+
│ ├── unit/ # Unit tests for transforms, sources, sinks
|
|
33
|
+
│ ├── integration/ # Integration tests against real (or containerized) systems
|
|
34
|
+
│ ├── quality/ # Data quality test definitions
|
|
35
|
+
│ └── fixtures/ # Static test data files
|
|
36
|
+
├── scripts/
|
|
37
|
+
│ ├── bootstrap.sh # Local environment setup
|
|
38
|
+
│ ├── replay.sh # DLQ and historical replay tooling
|
|
39
|
+
│ └── backfill.sh # Backfill orchestration helper
|
|
40
|
+
├── docker/
|
|
41
|
+
│ ├── docker-compose.yml # Local development stack
|
|
42
|
+
│ └── docker-compose.test.yml # Test isolation stack
|
|
43
|
+
├── Makefile
|
|
44
|
+
├── pyproject.toml # Python project metadata and dependencies
|
|
45
|
+
└── README.md
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### `src/dags/` — Orchestration Layer
|
|
49
|
+
|
|
50
|
+
DAG files define pipeline topology, scheduling, and task dependencies. They must be thin:
|
|
51
|
+
|
|
52
|
+
```
|
|
53
|
+
src/dags/
|
|
54
|
+
├── payments/
|
|
55
|
+
│ ├── transactions_ingest.py
|
|
56
|
+
│ ├── transactions_enrich.py
|
|
57
|
+
│ └── revenue_aggregate.py
|
|
58
|
+
├── users/
|
|
59
|
+
│ ├── events_ingest.py
|
|
60
|
+
│ └── profiles_sync.py
|
|
61
|
+
└── shared/
|
|
62
|
+
├── base_dag.py # Shared DAG defaults (retries, alerts, SLA)
|
|
63
|
+
└── sensors.py # Common sensors (S3 sensor, DB sensor)
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
**Rule**: DAG files contain no business logic. A DAG file defines what runs, in what order, with what dependencies. All data processing logic lives in `src/transforms/`. A DAG task calls a function from `src/transforms/` — it does not implement the transformation inline.
|
|
67
|
+
|
|
68
|
+
Bad (logic in DAG):
|
|
69
|
+
```python
|
|
70
|
+
@task
|
|
71
|
+
def process_transactions(**context):
|
|
72
|
+
df = pd.read_parquet(...)
|
|
73
|
+
df['amount_usd'] = df['amount'] / df['exchange_rate'] # business logic in DAG
|
|
74
|
+
df.to_parquet(...)
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
Good (DAG calls transform):
|
|
78
|
+
```python
|
|
79
|
+
from transforms.payments import normalize_transaction_amounts
|
|
80
|
+
|
|
81
|
+
@task
|
|
82
|
+
def process_transactions(**context):
|
|
83
|
+
records = sources.payments.read(context['date'])
|
|
84
|
+
normalized = normalize_transaction_amounts(records)
|
|
85
|
+
sinks.warehouse.write(normalized, partition=context['date'])
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
### `src/transforms/` — Transformation Logic
|
|
89
|
+
|
|
90
|
+
Transforms are pure functions: they take data in, return data out, with no side effects. This makes them independently testable without any pipeline infrastructure.
|
|
91
|
+
|
|
92
|
+
```
|
|
93
|
+
src/transforms/
|
|
94
|
+
├── payments/
|
|
95
|
+
│ ├── __init__.py
|
|
96
|
+
│ ├── normalize.py # Currency normalization, amount parsing
|
|
97
|
+
│ ├── enrich.py # Join with reference data
|
|
98
|
+
│ └── aggregate.py # Revenue rollups, daily summaries
|
|
99
|
+
├── users/
|
|
100
|
+
│ ├── __init__.py
|
|
101
|
+
│ ├── deduplicate.py # Deduplication logic
|
|
102
|
+
│ └── classify.py # User segment classification
|
|
103
|
+
└── shared/
|
|
104
|
+
├── timestamps.py # Timestamp parsing and normalization
|
|
105
|
+
├── currencies.py # Currency conversion utilities
|
|
106
|
+
└── pii.py # PII masking and hashing functions
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Pure transform function pattern:
|
|
110
|
+
```python
|
|
111
|
+
def normalize_transaction_amounts(records: list[dict]) -> list[dict]:
|
|
112
|
+
"""Convert all transaction amounts to USD using embedded exchange rates.
|
|
113
|
+
|
|
114
|
+
Args:
|
|
115
|
+
records: Raw transaction records with 'amount' and 'currency' fields
|
|
116
|
+
|
|
117
|
+
Returns:
|
|
118
|
+
Records with added 'amount_usd' field. Input records unchanged.
|
|
119
|
+
"""
|
|
120
|
+
result = []
|
|
121
|
+
for record in records:
|
|
122
|
+
normalized = {**record}
|
|
123
|
+
normalized['amount_usd'] = convert_to_usd(record['amount'], record['currency'])
|
|
124
|
+
result.append(normalized)
|
|
125
|
+
return result
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
### `src/sources/` — Source Connectors
|
|
129
|
+
|
|
130
|
+
Sources abstract the read interface for each upstream system:
|
|
131
|
+
|
|
132
|
+
```
|
|
133
|
+
src/sources/
|
|
134
|
+
├── stripe/
|
|
135
|
+
│ ├── __init__.py
|
|
136
|
+
│ ├── charges.py # Stripe Charges API reader
|
|
137
|
+
│ └── webhooks.py # Stripe webhook event reader
|
|
138
|
+
├── postgres/
|
|
139
|
+
│ ├── __init__.py
|
|
140
|
+
│ └── cdc_reader.py # CDC via logical replication
|
|
141
|
+
├── s3/
|
|
142
|
+
│ ├── __init__.py
|
|
143
|
+
│ └── parquet_reader.py
|
|
144
|
+
└── kafka/
|
|
145
|
+
├── __init__.py
|
|
146
|
+
└── consumer.py
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Sources implement a common interface:
|
|
150
|
+
```python
|
|
151
|
+
class Source(Protocol):
|
|
152
|
+
def read(self, start: datetime, end: datetime) -> Iterable[dict]: ...
|
|
153
|
+
def count(self, start: datetime, end: datetime) -> int: ...
|
|
154
|
+
```
|
|
155
|
+
|
|
156
|
+
### `src/sinks/` — Destination Writers
|
|
157
|
+
|
|
158
|
+
Sinks abstract the write interface for each downstream system:
|
|
159
|
+
|
|
160
|
+
```
|
|
161
|
+
src/sinks/
|
|
162
|
+
├── bigquery/
|
|
163
|
+
│ ├── __init__.py
|
|
164
|
+
│ ├── writer.py # Streaming insert and load job writer
|
|
165
|
+
│ └── partitioned.py # Partitioned table writer
|
|
166
|
+
├── s3/
|
|
167
|
+
│ ├── __init__.py
|
|
168
|
+
│ └── parquet_writer.py
|
|
169
|
+
├── postgres/
|
|
170
|
+
│ ├── __init__.py
|
|
171
|
+
│ └── upsert_writer.py
|
|
172
|
+
└── dlq/
|
|
173
|
+
├── __init__.py
|
|
174
|
+
└── writer.py # Dead-letter queue writer
|
|
175
|
+
```
|
|
176
|
+
|
|
177
|
+
Sinks implement idempotent write semantics:
|
|
178
|
+
```python
|
|
179
|
+
class Sink(Protocol):
|
|
180
|
+
def write(self, records: Iterable[dict], partition: str) -> WriteResult: ...
|
|
181
|
+
def write_dlq(self, record: dict, error: Exception, context: dict) -> None: ...
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### `src/quality/` — Data Quality Checks
|
|
185
|
+
|
|
186
|
+
Quality checks are assertions about data that run inline in the pipeline:
|
|
187
|
+
|
|
188
|
+
```
|
|
189
|
+
src/quality/
|
|
190
|
+
├── rules/
|
|
191
|
+
│ ├── completeness.py # Null rate checks
|
|
192
|
+
│ ├── accuracy.py # Range and format checks
|
|
193
|
+
│ ├── consistency.py # Cross-field and cross-table checks
|
|
194
|
+
│ └── timeliness.py # Freshness and late-arrival checks
|
|
195
|
+
├── expectations/
|
|
196
|
+
│ ├── payments.json # Great Expectations suite for payments
|
|
197
|
+
│ └── users.json # Great Expectations suite for users
|
|
198
|
+
└── reconciliation/
|
|
199
|
+
├── row_count.py # Source vs destination row count comparison
|
|
200
|
+
└── checksum.py # Aggregate checksum comparison
|
|
201
|
+
```
|
|
202
|
+
|
|
203
|
+
### `config/` — Environment Configuration
|
|
204
|
+
|
|
205
|
+
All pipeline configuration (connection strings, batch sizes, SLA thresholds, feature flags) lives in config files, never hardcoded:
|
|
206
|
+
|
|
207
|
+
```yaml
|
|
208
|
+
# config/base.yaml
|
|
209
|
+
pipeline:
|
|
210
|
+
batch_size: 10000
|
|
211
|
+
max_retries: 3
|
|
212
|
+
retry_delay_seconds: 30
|
|
213
|
+
|
|
214
|
+
quality:
|
|
215
|
+
completeness_threshold: 0.99
|
|
216
|
+
row_count_variance_pct: 0.01
|
|
217
|
+
|
|
218
|
+
sla:
|
|
219
|
+
max_lag_minutes: 60
|
|
220
|
+
```
|
|
221
|
+
|
|
222
|
+
Environment-specific files override base values only for what differs. Use a config loader that merges base + environment:
|
|
223
|
+
|
|
224
|
+
```python
|
|
225
|
+
def load_config(env: str = "dev") -> Config:
|
|
226
|
+
base = yaml.safe_load(open("config/base.yaml"))
|
|
227
|
+
override = yaml.safe_load(open(f"config/{env}.yaml"))
|
|
228
|
+
return Config(**deep_merge(base, override))
|
|
229
|
+
```
|
|
230
|
+
|
|
231
|
+
### `tests/` — Test Structure Mirrors Source
|
|
232
|
+
|
|
233
|
+
```
|
|
234
|
+
tests/
|
|
235
|
+
├── unit/
|
|
236
|
+
│ ├── transforms/
|
|
237
|
+
│ │ ├── payments/
|
|
238
|
+
│ │ │ ├── test_normalize.py
|
|
239
|
+
│ │ │ └── test_aggregate.py
|
|
240
|
+
│ │ └── users/
|
|
241
|
+
│ │ └── test_deduplicate.py
|
|
242
|
+
│ ├── sources/
|
|
243
|
+
│ └── sinks/
|
|
244
|
+
├── integration/
|
|
245
|
+
│ ├── test_payments_pipeline.py
|
|
246
|
+
│ └── test_users_pipeline.py
|
|
247
|
+
├── quality/
|
|
248
|
+
│ └── test_data_quality_rules.py
|
|
249
|
+
└── fixtures/
|
|
250
|
+
├── payments/
|
|
251
|
+
│ ├── raw_charges.json
|
|
252
|
+
│ └── expected_normalized.json
|
|
253
|
+
└── users/
|
|
254
|
+
└── raw_events.json
|
|
255
|
+
```
|
|
256
|
+
|
|
257
|
+
The `fixtures/` directory contains static JSON/Parquet/CSV files representing realistic sample data. Fixtures must be small enough to run in CI (< 10MB total) but representative enough to cover edge cases.
|
|
@@ -0,0 +1,324 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-quality
|
|
3
|
+
description: Schema validation, anomaly detection, data quality tests using Great Expectations and dbt, and reconciliation patterns
|
|
4
|
+
topics: [data-pipeline, data-quality, schema-validation, anomaly-detection, great-expectations, dbt, reconciliation, testing]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Data quality failures are the most operationally damaging class of data pipeline bugs because they often go undetected until a business user notices wrong numbers. Unlike pipeline failures (which are loud and immediately visible), quality degradation is silent — data flows, pipelines succeed, but the numbers are wrong. Building quality checks directly into the pipeline, not as a separate offline process, is the only reliable defense.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Validate schema and structural correctness at ingestion. Run statistical anomaly detection on all critical metrics. Implement Great Expectations suites or dbt tests on every table in the medallion stack. Reconcile record counts and aggregate checksums between pipeline stages. Alert on quality degradation before downstream consumers see it. Never silently discard records — route all failures to the DLQ with diagnostic context.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Schema Validation at Ingestion
|
|
16
|
+
|
|
17
|
+
Validate incoming data against the expected schema as the first step of every pipeline. Fail fast on schema violations rather than propagating bad data downstream.
|
|
18
|
+
|
|
19
|
+
**JSON Schema validation**
|
|
20
|
+
|
|
21
|
+
```python
|
|
22
|
+
from jsonschema import validate, ValidationError
|
|
23
|
+
|
|
24
|
+
TRANSACTION_SCHEMA = {
|
|
25
|
+
"type": "object",
|
|
26
|
+
"required": ["transaction_id", "amount", "currency", "user_id", "created_at"],
|
|
27
|
+
"properties": {
|
|
28
|
+
"transaction_id": {"type": "string", "minLength": 1},
|
|
29
|
+
"amount": {"type": "number", "minimum": 0},
|
|
30
|
+
"currency": {"type": "string", "enum": ["USD", "EUR", "GBP", "JPY", "CAD"]},
|
|
31
|
+
"user_id": {"type": "string", "pattern": "^usr_[0-9]{6}$"},
|
|
32
|
+
"created_at": {"type": "string", "format": "date-time"},
|
|
33
|
+
},
|
|
34
|
+
"additionalProperties": True, # allow extra fields in bronze
|
|
35
|
+
}
|
|
36
|
+
|
|
37
|
+
def validate_and_route(record: dict, dlq) -> dict | None:
|
|
38
|
+
"""Validate record against schema. Route failures to DLQ, return valid records."""
|
|
39
|
+
try:
|
|
40
|
+
validate(instance=record, schema=TRANSACTION_SCHEMA)
|
|
41
|
+
return record
|
|
42
|
+
except ValidationError as e:
|
|
43
|
+
dlq.write(record, error=e, reason="schema_validation_failure")
|
|
44
|
+
return None
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
**Avro schema enforcement**
|
|
48
|
+
|
|
49
|
+
For Kafka-based pipelines, use Schema Registry to enforce Avro schemas at produce and consume time:
|
|
50
|
+
|
|
51
|
+
```python
|
|
52
|
+
from confluent_kafka.avro import AvroConsumer
|
|
53
|
+
from confluent_kafka.avro.serializer import SerializerError
|
|
54
|
+
|
|
55
|
+
consumer = AvroConsumer({
|
|
56
|
+
"bootstrap.servers": "localhost:9092",
|
|
57
|
+
"schema.registry.url": "http://localhost:8081",
|
|
58
|
+
"group.id": "silver-processor",
|
|
59
|
+
})
|
|
60
|
+
|
|
61
|
+
while True:
|
|
62
|
+
msg = consumer.poll(1.0)
|
|
63
|
+
if msg is None:
|
|
64
|
+
continue
|
|
65
|
+
if msg.error():
|
|
66
|
+
handle_consumer_error(msg.error())
|
|
67
|
+
continue
|
|
68
|
+
try:
|
|
69
|
+
record = msg.value() # automatically deserialized + schema-validated
|
|
70
|
+
process(record)
|
|
71
|
+
except SerializerError as e:
|
|
72
|
+
dlq.write(msg.value(), error=e, reason="avro_deserialization_failure")
|
|
73
|
+
```
|
|
74
|
+
|
|
75
|
+
### Statistical Anomaly Detection
|
|
76
|
+
|
|
77
|
+
Structural validation catches format errors. Anomaly detection catches semantic errors — values that are structurally valid but statistically wrong.
|
|
78
|
+
|
|
79
|
+
**Volume anomaly detection**
|
|
80
|
+
|
|
81
|
+
```python
|
|
82
|
+
def check_volume_anomaly(
|
|
83
|
+
pipeline_id: str,
|
|
84
|
+
actual_count: int,
|
|
85
|
+
window_days: int = 14,
|
|
86
|
+
z_score_threshold: float = 3.0,
|
|
87
|
+
) -> AnomalyResult:
|
|
88
|
+
"""Detect volume anomalies using z-score against rolling historical baseline."""
|
|
89
|
+
historical = get_historical_counts(pipeline_id, days=window_days)
|
|
90
|
+
mean = statistics.mean(historical)
|
|
91
|
+
stddev = statistics.stdev(historical)
|
|
92
|
+
|
|
93
|
+
if stddev == 0:
|
|
94
|
+
return AnomalyResult(anomaly=False)
|
|
95
|
+
|
|
96
|
+
z_score = (actual_count - mean) / stddev
|
|
97
|
+
is_anomaly = abs(z_score) > z_score_threshold
|
|
98
|
+
|
|
99
|
+
return AnomalyResult(
|
|
100
|
+
anomaly=is_anomaly,
|
|
101
|
+
z_score=z_score,
|
|
102
|
+
actual=actual_count,
|
|
103
|
+
expected_mean=mean,
|
|
104
|
+
expected_stddev=stddev,
|
|
105
|
+
direction="high" if z_score > 0 else "low",
|
|
106
|
+
)
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
**Metric anomaly detection**
|
|
110
|
+
|
|
111
|
+
Extend volume checks to key business metrics:
|
|
112
|
+
- Revenue sum per partition (detect 0-revenue or 100x-revenue outliers)
|
|
113
|
+
- Average transaction amount (detect extreme values indicating test data in prod)
|
|
114
|
+
- User count per partition (detect sudden drop suggesting source outage)
|
|
115
|
+
- Error rate (detect spike indicating upstream system degradation)
|
|
116
|
+
|
|
117
|
+
### Great Expectations Integration
|
|
118
|
+
|
|
119
|
+
Great Expectations (GX) provides a declarative assertion framework for data quality:
|
|
120
|
+
|
|
121
|
+
**Expectation suite definition**
|
|
122
|
+
|
|
123
|
+
```python
|
|
124
|
+
# src/quality/expectations/payments.py
|
|
125
|
+
import great_expectations as gx
|
|
126
|
+
|
|
127
|
+
def build_payments_suite() -> gx.ExpectationSuite:
|
|
128
|
+
suite = gx.ExpectationSuite(expectation_suite_name="silver_transactions")
|
|
129
|
+
|
|
130
|
+
# Completeness
|
|
131
|
+
suite.add_expectation(gx.core.ExpectColumnValuesToNotBeNull(
|
|
132
|
+
column="transaction_id", meta={"severity": "critical"}
|
|
133
|
+
))
|
|
134
|
+
suite.add_expectation(gx.core.ExpectColumnValuesToNotBeNull(
|
|
135
|
+
column="amount", meta={"severity": "critical"}
|
|
136
|
+
))
|
|
137
|
+
suite.add_expectation(gx.core.ExpectColumnValuesToNotBeNull(
|
|
138
|
+
column="user_id", mostly=0.99, meta={"severity": "high"}
|
|
139
|
+
))
|
|
140
|
+
|
|
141
|
+
# Accuracy
|
|
142
|
+
suite.add_expectation(gx.core.ExpectColumnValuesToBeBetween(
|
|
143
|
+
column="amount", min_value=0, max_value=1_000_000
|
|
144
|
+
))
|
|
145
|
+
suite.add_expectation(gx.core.ExpectColumnValuesToBeInSet(
|
|
146
|
+
column="currency", value_set=["USD", "EUR", "GBP", "JPY", "CAD"]
|
|
147
|
+
))
|
|
148
|
+
suite.add_expectation(gx.core.ExpectColumnValuesToMatchRegex(
|
|
149
|
+
column="transaction_id", regex=r"^txn_[a-zA-Z0-9]{20}$"
|
|
150
|
+
))
|
|
151
|
+
|
|
152
|
+
# Volume
|
|
153
|
+
suite.add_expectation(gx.core.ExpectTableRowCountToBeBetween(
|
|
154
|
+
min_value=1000, max_value=10_000_000
|
|
155
|
+
))
|
|
156
|
+
|
|
157
|
+
# Uniqueness
|
|
158
|
+
suite.add_expectation(gx.core.ExpectColumnValuesToBeUnique(
|
|
159
|
+
column="transaction_id"
|
|
160
|
+
))
|
|
161
|
+
|
|
162
|
+
return suite
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
**Running GX in the pipeline**
|
|
166
|
+
|
|
167
|
+
```python
|
|
168
|
+
def run_quality_check(df: pd.DataFrame, suite_name: str) -> ValidationResult:
|
|
169
|
+
context = gx.get_context()
|
|
170
|
+
validator = context.get_validator(
|
|
171
|
+
batch_request=RuntimeBatchRequest(
|
|
172
|
+
datasource_name="runtime",
|
|
173
|
+
data_connector_name="default",
|
|
174
|
+
data_asset_name=suite_name,
|
|
175
|
+
runtime_parameters={"batch_data": df},
|
|
176
|
+
batch_identifiers={"run_id": str(datetime.utcnow())},
|
|
177
|
+
),
|
|
178
|
+
expectation_suite_name=suite_name,
|
|
179
|
+
)
|
|
180
|
+
result = validator.validate()
|
|
181
|
+
|
|
182
|
+
metrics.gauge("data_quality_success_pct",
|
|
183
|
+
result.statistics["success_percent"],
|
|
184
|
+
tags={"suite": suite_name})
|
|
185
|
+
|
|
186
|
+
if not result.success:
|
|
187
|
+
failed = [e for e in result.results if not e.success]
|
|
188
|
+
critical_failures = [f for f in failed if f.meta.get("severity") == "critical"]
|
|
189
|
+
if critical_failures:
|
|
190
|
+
raise DataQualityError(f"Critical quality failures: {critical_failures}")
|
|
191
|
+
|
|
192
|
+
return result
|
|
193
|
+
```
|
|
194
|
+
|
|
195
|
+
### dbt Data Tests
|
|
196
|
+
|
|
197
|
+
For SQL-based pipelines using dbt, define tests directly in model YAML files:
|
|
198
|
+
|
|
199
|
+
```yaml
|
|
200
|
+
# models/silver/schema.yml
|
|
201
|
+
models:
|
|
202
|
+
- name: silver_transactions
|
|
203
|
+
description: Cleaned and validated transaction records
|
|
204
|
+
columns:
|
|
205
|
+
- name: transaction_id
|
|
206
|
+
tests:
|
|
207
|
+
- not_null
|
|
208
|
+
- unique
|
|
209
|
+
- name: amount
|
|
210
|
+
tests:
|
|
211
|
+
- not_null
|
|
212
|
+
- dbt_utils.accepted_range:
|
|
213
|
+
min_value: 0
|
|
214
|
+
max_value: 1000000
|
|
215
|
+
- name: currency
|
|
216
|
+
tests:
|
|
217
|
+
- not_null
|
|
218
|
+
- accepted_values:
|
|
219
|
+
values: ["USD", "EUR", "GBP", "JPY", "CAD"]
|
|
220
|
+
- name: user_id
|
|
221
|
+
tests:
|
|
222
|
+
- not_null
|
|
223
|
+
- relationships:
|
|
224
|
+
to: ref('silver_users')
|
|
225
|
+
field: user_id
|
|
226
|
+
|
|
227
|
+
tests:
|
|
228
|
+
- dbt_utils.recency:
|
|
229
|
+
datepart: hour
|
|
230
|
+
field: created_at
|
|
231
|
+
interval: 2 # table should have data from last 2 hours
|
|
232
|
+
- dbt_utils.equal_rowcount:
|
|
233
|
+
compare_model: ref('bronze_transactions')
|
|
234
|
+
# silver count should match bronze minus known invalid records
|
|
235
|
+
```
|
|
236
|
+
|
|
237
|
+
**Custom dbt tests**
|
|
238
|
+
|
|
239
|
+
```sql
|
|
240
|
+
-- tests/assert_no_future_transactions.sql
|
|
241
|
+
-- Fails if any transactions are timestamped in the future
|
|
242
|
+
SELECT count(*) as future_count
|
|
243
|
+
FROM {{ ref('silver_transactions') }}
|
|
244
|
+
WHERE created_at > CURRENT_TIMESTAMP + INTERVAL '5 minutes'
|
|
245
|
+
HAVING count(*) > 0
|
|
246
|
+
```
|
|
247
|
+
|
|
248
|
+
### Reconciliation Patterns
|
|
249
|
+
|
|
250
|
+
Reconciliation compares record counts and aggregate values between pipeline stages to detect silent data loss.
|
|
251
|
+
|
|
252
|
+
**Row count reconciliation**
|
|
253
|
+
|
|
254
|
+
```python
|
|
255
|
+
def reconcile_row_counts(
|
|
256
|
+
source_count: int,
|
|
257
|
+
destination_count: int,
|
|
258
|
+
pipeline_id: str,
|
|
259
|
+
expected_loss_rate: float = 0.001, # max 0.1% loss acceptable
|
|
260
|
+
) -> ReconciliationResult:
|
|
261
|
+
"""Compare source and destination record counts."""
|
|
262
|
+
if source_count == 0:
|
|
263
|
+
raise ReconciliationError(f"Source count is zero for {pipeline_id} — possible extraction failure")
|
|
264
|
+
|
|
265
|
+
loss_rate = (source_count - destination_count) / source_count
|
|
266
|
+
|
|
267
|
+
if loss_rate > expected_loss_rate:
|
|
268
|
+
raise ReconciliationError(
|
|
269
|
+
f"Record loss rate {loss_rate:.4%} exceeds threshold {expected_loss_rate:.4%}. "
|
|
270
|
+
f"Source: {source_count}, Destination: {destination_count}"
|
|
271
|
+
)
|
|
272
|
+
|
|
273
|
+
metrics.gauge("pipeline_record_loss_rate", loss_rate, tags={"pipeline": pipeline_id})
|
|
274
|
+
return ReconciliationResult(passed=True, loss_rate=loss_rate)
|
|
275
|
+
```
|
|
276
|
+
|
|
277
|
+
**Aggregate checksum reconciliation**
|
|
278
|
+
|
|
279
|
+
```python
|
|
280
|
+
def reconcile_aggregates(
|
|
281
|
+
source_df: pd.DataFrame,
|
|
282
|
+
destination_df: pd.DataFrame,
|
|
283
|
+
key_column: str,
|
|
284
|
+
value_columns: list[str],
|
|
285
|
+
tolerance_pct: float = 0.001,
|
|
286
|
+
) -> ReconciliationResult:
|
|
287
|
+
"""Compare aggregate sums between source and destination."""
|
|
288
|
+
failures = []
|
|
289
|
+
for col in value_columns:
|
|
290
|
+
src_sum = source_df[col].sum()
|
|
291
|
+
dst_sum = destination_df[col].sum()
|
|
292
|
+
|
|
293
|
+
if src_sum == 0 and dst_sum == 0:
|
|
294
|
+
continue
|
|
295
|
+
|
|
296
|
+
variance = abs(src_sum - dst_sum) / max(abs(src_sum), 1e-10)
|
|
297
|
+
if variance > tolerance_pct:
|
|
298
|
+
failures.append(f"{col}: src={src_sum:.2f}, dst={dst_sum:.2f}, variance={variance:.4%}")
|
|
299
|
+
|
|
300
|
+
if failures:
|
|
301
|
+
raise ReconciliationError(f"Aggregate reconciliation failed:\n" + "\n".join(failures))
|
|
302
|
+
|
|
303
|
+
return ReconciliationResult(passed=True)
|
|
304
|
+
```
|
|
305
|
+
|
|
306
|
+
### Quality Gate Integration
|
|
307
|
+
|
|
308
|
+
Quality checks must be integrated as pipeline gates, not optional post-processing steps:
|
|
309
|
+
|
|
310
|
+
```
|
|
311
|
+
Extract → Schema Validate → [pass: Transform] [fail: DLQ]
|
|
312
|
+
↓
|
|
313
|
+
Statistical Check → [anomaly: alert + hold] [pass: continue]
|
|
314
|
+
↓
|
|
315
|
+
Business Rules → [fail: DLQ] [pass: Load]
|
|
316
|
+
↓
|
|
317
|
+
Reconciliation → [fail: alert + rollback] [pass: commit]
|
|
318
|
+
```
|
|
319
|
+
|
|
320
|
+
A quality gate failure at any stage must:
|
|
321
|
+
1. Route affected records to the DLQ with full diagnostic context
|
|
322
|
+
2. Emit a quality failure metric
|
|
323
|
+
3. Alert the on-call engineer if the failure rate exceeds the quality budget
|
|
324
|
+
4. Block pipeline progression for critical failures (structural, not statistical)
|