@zigrivers/scaffold 3.8.0 → 3.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +73 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/ml/ml-architecture.md +172 -0
  27. package/content/knowledge/ml/ml-conventions.md +209 -0
  28. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  29. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  30. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  31. package/content/knowledge/ml/ml-observability.md +253 -0
  32. package/content/knowledge/ml/ml-project-structure.md +216 -0
  33. package/content/knowledge/ml/ml-requirements.md +138 -0
  34. package/content/knowledge/ml/ml-security.md +188 -0
  35. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  36. package/content/knowledge/ml/ml-testing.md +301 -0
  37. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  38. package/content/methodology/browser-extension-overlay.yml +82 -0
  39. package/content/methodology/data-pipeline-overlay.yml +70 -0
  40. package/content/methodology/ml-overlay.yml +70 -0
  41. package/dist/cli/commands/init.d.ts +13 -0
  42. package/dist/cli/commands/init.d.ts.map +1 -1
  43. package/dist/cli/commands/init.js +122 -2
  44. package/dist/cli/commands/init.js.map +1 -1
  45. package/dist/cli/commands/init.test.js +120 -0
  46. package/dist/cli/commands/init.test.js.map +1 -1
  47. package/dist/config/schema.d.ts +864 -48
  48. package/dist/config/schema.d.ts.map +1 -1
  49. package/dist/config/schema.js +53 -0
  50. package/dist/config/schema.js.map +1 -1
  51. package/dist/config/schema.test.js +166 -3
  52. package/dist/config/schema.test.js.map +1 -1
  53. package/dist/core/assembly/overlay-loader.test.js +33 -0
  54. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  55. package/dist/e2e/project-type-overlays.test.d.ts +2 -2
  56. package/dist/e2e/project-type-overlays.test.js +499 -33
  57. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  58. package/dist/types/config.d.ts +10 -1
  59. package/dist/types/config.d.ts.map +1 -1
  60. package/dist/wizard/questions.d.ts +17 -1
  61. package/dist/wizard/questions.d.ts.map +1 -1
  62. package/dist/wizard/questions.js +75 -1
  63. package/dist/wizard/questions.js.map +1 -1
  64. package/dist/wizard/questions.test.js +167 -0
  65. package/dist/wizard/questions.test.js.map +1 -1
  66. package/dist/wizard/wizard.d.ts +13 -0
  67. package/dist/wizard/wizard.d.ts.map +1 -1
  68. package/dist/wizard/wizard.js +17 -1
  69. package/dist/wizard/wizard.js.map +1 -1
  70. package/package.json +1 -1
@@ -0,0 +1,257 @@
1
+ ---
2
+ name: data-pipeline-project-structure
3
+ description: Canonical directory layout for data pipeline projects covering DAGs, transforms, sinks, quality checks, configuration, and tests
4
+ topics: [data-pipeline, project-structure, dags, transforms, sinks, quality, config, tests]
5
+ ---
6
+
7
+ A well-organized data pipeline project separates pipeline orchestration logic from transformation logic, data sinks, and quality checks. This separation allows each concern to evolve independently and makes the codebase navigable for engineers unfamiliar with the project. The structure below is opinionated but widely applicable to Airflow, Prefect, Dagster, and similar orchestration tools.
8
+
9
+ ## Summary
10
+
11
+ Organize pipeline projects into `src/dags/` for orchestration, `src/transforms/` for pure transformation functions, `src/sinks/` for write adapters, `src/quality/` for validation logic, `config/` for environment-specific settings, and `tests/` mirroring the `src/` structure. Keep business logic in transforms — not in DAG files. Configuration is never hardcoded.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Root Directory Layout
16
+
17
+ ```
18
+ my-pipeline/
19
+ ├── src/
20
+ │ ├── dags/ # Orchestration DAGs / pipeline definitions
21
+ │ ├── transforms/ # Pure transformation functions
22
+ │ ├── sources/ # Source connectors and readers
23
+ │ ├── sinks/ # Destination connectors and writers
24
+ │ ├── quality/ # Data quality checks and validators
25
+ │ └── utils/ # Shared utilities (logging, metrics, retry)
26
+ ├── config/
27
+ │ ├── base.yaml # Shared configuration across environments
28
+ │ ├── dev.yaml # Development overrides
29
+ │ ├── staging.yaml # Staging overrides
30
+ │ └── prod.yaml # Production overrides
31
+ ├── tests/
32
+ │ ├── unit/ # Unit tests for transforms, sources, sinks
33
+ │ ├── integration/ # Integration tests against real (or containerized) systems
34
+ │ ├── quality/ # Data quality test definitions
35
+ │ └── fixtures/ # Static test data files
36
+ ├── scripts/
37
+ │ ├── bootstrap.sh # Local environment setup
38
+ │ ├── replay.sh # DLQ and historical replay tooling
39
+ │ └── backfill.sh # Backfill orchestration helper
40
+ ├── docker/
41
+ │ ├── docker-compose.yml # Local development stack
42
+ │ └── docker-compose.test.yml # Test isolation stack
43
+ ├── Makefile
44
+ ├── pyproject.toml # Python project metadata and dependencies
45
+ └── README.md
46
+ ```
47
+
48
+ ### `src/dags/` — Orchestration Layer
49
+
50
+ DAG files define pipeline topology, scheduling, and task dependencies. They must be thin:
51
+
52
+ ```
53
+ src/dags/
54
+ ├── payments/
55
+ │ ├── transactions_ingest.py
56
+ │ ├── transactions_enrich.py
57
+ │ └── revenue_aggregate.py
58
+ ├── users/
59
+ │ ├── events_ingest.py
60
+ │ └── profiles_sync.py
61
+ └── shared/
62
+ ├── base_dag.py # Shared DAG defaults (retries, alerts, SLA)
63
+ └── sensors.py # Common sensors (S3 sensor, DB sensor)
64
+ ```
65
+
66
+ **Rule**: DAG files contain no business logic. A DAG file defines what runs, in what order, with what dependencies. All data processing logic lives in `src/transforms/`. A DAG task calls a function from `src/transforms/` — it does not implement the transformation inline.
67
+
68
+ Bad (logic in DAG):
69
+ ```python
70
+ @task
71
+ def process_transactions(**context):
72
+ df = pd.read_parquet(...)
73
+ df['amount_usd'] = df['amount'] / df['exchange_rate'] # business logic in DAG
74
+ df.to_parquet(...)
75
+ ```
76
+
77
+ Good (DAG calls transform):
78
+ ```python
79
+ from transforms.payments import normalize_transaction_amounts
80
+
81
+ @task
82
+ def process_transactions(**context):
83
+ records = sources.payments.read(context['date'])
84
+ normalized = normalize_transaction_amounts(records)
85
+ sinks.warehouse.write(normalized, partition=context['date'])
86
+ ```
87
+
88
+ ### `src/transforms/` — Transformation Logic
89
+
90
+ Transforms are pure functions: they take data in, return data out, with no side effects. This makes them independently testable without any pipeline infrastructure.
91
+
92
+ ```
93
+ src/transforms/
94
+ ├── payments/
95
+ │ ├── __init__.py
96
+ │ ├── normalize.py # Currency normalization, amount parsing
97
+ │ ├── enrich.py # Join with reference data
98
+ │ └── aggregate.py # Revenue rollups, daily summaries
99
+ ├── users/
100
+ │ ├── __init__.py
101
+ │ ├── deduplicate.py # Deduplication logic
102
+ │ └── classify.py # User segment classification
103
+ └── shared/
104
+ ├── timestamps.py # Timestamp parsing and normalization
105
+ ├── currencies.py # Currency conversion utilities
106
+ └── pii.py # PII masking and hashing functions
107
+ ```
108
+
109
+ Pure transform function pattern:
110
+ ```python
111
+ def normalize_transaction_amounts(records: list[dict]) -> list[dict]:
112
+ """Convert all transaction amounts to USD using embedded exchange rates.
113
+
114
+ Args:
115
+ records: Raw transaction records with 'amount' and 'currency' fields
116
+
117
+ Returns:
118
+ Records with added 'amount_usd' field. Input records unchanged.
119
+ """
120
+ result = []
121
+ for record in records:
122
+ normalized = {**record}
123
+ normalized['amount_usd'] = convert_to_usd(record['amount'], record['currency'])
124
+ result.append(normalized)
125
+ return result
126
+ ```
127
+
128
+ ### `src/sources/` — Source Connectors
129
+
130
+ Sources abstract the read interface for each upstream system:
131
+
132
+ ```
133
+ src/sources/
134
+ ├── stripe/
135
+ │ ├── __init__.py
136
+ │ ├── charges.py # Stripe Charges API reader
137
+ │ └── webhooks.py # Stripe webhook event reader
138
+ ├── postgres/
139
+ │ ├── __init__.py
140
+ │ └── cdc_reader.py # CDC via logical replication
141
+ ├── s3/
142
+ │ ├── __init__.py
143
+ │ └── parquet_reader.py
144
+ └── kafka/
145
+ ├── __init__.py
146
+ └── consumer.py
147
+ ```
148
+
149
+ Sources implement a common interface:
150
+ ```python
151
+ class Source(Protocol):
152
+ def read(self, start: datetime, end: datetime) -> Iterable[dict]: ...
153
+ def count(self, start: datetime, end: datetime) -> int: ...
154
+ ```
155
+
156
+ ### `src/sinks/` — Destination Writers
157
+
158
+ Sinks abstract the write interface for each downstream system:
159
+
160
+ ```
161
+ src/sinks/
162
+ ├── bigquery/
163
+ │ ├── __init__.py
164
+ │ ├── writer.py # Streaming insert and load job writer
165
+ │ └── partitioned.py # Partitioned table writer
166
+ ├── s3/
167
+ │ ├── __init__.py
168
+ │ └── parquet_writer.py
169
+ ├── postgres/
170
+ │ ├── __init__.py
171
+ │ └── upsert_writer.py
172
+ └── dlq/
173
+ ├── __init__.py
174
+ └── writer.py # Dead-letter queue writer
175
+ ```
176
+
177
+ Sinks implement idempotent write semantics:
178
+ ```python
179
+ class Sink(Protocol):
180
+ def write(self, records: Iterable[dict], partition: str) -> WriteResult: ...
181
+ def write_dlq(self, record: dict, error: Exception, context: dict) -> None: ...
182
+ ```
183
+
184
+ ### `src/quality/` — Data Quality Checks
185
+
186
+ Quality checks are assertions about data that run inline in the pipeline:
187
+
188
+ ```
189
+ src/quality/
190
+ ├── rules/
191
+ │ ├── completeness.py # Null rate checks
192
+ │ ├── accuracy.py # Range and format checks
193
+ │ ├── consistency.py # Cross-field and cross-table checks
194
+ │ └── timeliness.py # Freshness and late-arrival checks
195
+ ├── expectations/
196
+ │ ├── payments.json # Great Expectations suite for payments
197
+ │ └── users.json # Great Expectations suite for users
198
+ └── reconciliation/
199
+ ├── row_count.py # Source vs destination row count comparison
200
+ └── checksum.py # Aggregate checksum comparison
201
+ ```
202
+
203
+ ### `config/` — Environment Configuration
204
+
205
+ All pipeline configuration (connection strings, batch sizes, SLA thresholds, feature flags) lives in config files, never hardcoded:
206
+
207
+ ```yaml
208
+ # config/base.yaml
209
+ pipeline:
210
+ batch_size: 10000
211
+ max_retries: 3
212
+ retry_delay_seconds: 30
213
+
214
+ quality:
215
+ completeness_threshold: 0.99
216
+ row_count_variance_pct: 0.01
217
+
218
+ sla:
219
+ max_lag_minutes: 60
220
+ ```
221
+
222
+ Environment-specific files override base values only for what differs. Use a config loader that merges base + environment:
223
+
224
+ ```python
225
+ def load_config(env: str = "dev") -> Config:
226
+ base = yaml.safe_load(open("config/base.yaml"))
227
+ override = yaml.safe_load(open(f"config/{env}.yaml"))
228
+ return Config(**deep_merge(base, override))
229
+ ```
230
+
231
+ ### `tests/` — Test Structure Mirrors Source
232
+
233
+ ```
234
+ tests/
235
+ ├── unit/
236
+ │ ├── transforms/
237
+ │ │ ├── payments/
238
+ │ │ │ ├── test_normalize.py
239
+ │ │ │ └── test_aggregate.py
240
+ │ │ └── users/
241
+ │ │ └── test_deduplicate.py
242
+ │ ├── sources/
243
+ │ └── sinks/
244
+ ├── integration/
245
+ │ ├── test_payments_pipeline.py
246
+ │ └── test_users_pipeline.py
247
+ ├── quality/
248
+ │ └── test_data_quality_rules.py
249
+ └── fixtures/
250
+ ├── payments/
251
+ │ ├── raw_charges.json
252
+ │ └── expected_normalized.json
253
+ └── users/
254
+ └── raw_events.json
255
+ ```
256
+
257
+ The `fixtures/` directory contains static JSON/Parquet/CSV files representing realistic sample data. Fixtures must be small enough to run in CI (< 10MB total) but representative enough to cover edge cases.
@@ -0,0 +1,324 @@
1
+ ---
2
+ name: data-pipeline-quality
3
+ description: Schema validation, anomaly detection, data quality tests using Great Expectations and dbt, and reconciliation patterns
4
+ topics: [data-pipeline, data-quality, schema-validation, anomaly-detection, great-expectations, dbt, reconciliation, testing]
5
+ ---
6
+
7
+ Data quality failures are the most operationally damaging class of data pipeline bugs because they often go undetected until a business user notices wrong numbers. Unlike pipeline failures (which are loud and immediately visible), quality degradation is silent — data flows, pipelines succeed, but the numbers are wrong. Building quality checks directly into the pipeline, not as a separate offline process, is the only reliable defense.
8
+
9
+ ## Summary
10
+
11
+ Validate schema and structural correctness at ingestion. Run statistical anomaly detection on all critical metrics. Implement Great Expectations suites or dbt tests on every table in the medallion stack. Reconcile record counts and aggregate checksums between pipeline stages. Alert on quality degradation before downstream consumers see it. Never silently discard records — route all failures to the DLQ with diagnostic context.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Schema Validation at Ingestion
16
+
17
+ Validate incoming data against the expected schema as the first step of every pipeline. Fail fast on schema violations rather than propagating bad data downstream.
18
+
19
+ **JSON Schema validation**
20
+
21
+ ```python
22
+ from jsonschema import validate, ValidationError
23
+
24
+ TRANSACTION_SCHEMA = {
25
+ "type": "object",
26
+ "required": ["transaction_id", "amount", "currency", "user_id", "created_at"],
27
+ "properties": {
28
+ "transaction_id": {"type": "string", "minLength": 1},
29
+ "amount": {"type": "number", "minimum": 0},
30
+ "currency": {"type": "string", "enum": ["USD", "EUR", "GBP", "JPY", "CAD"]},
31
+ "user_id": {"type": "string", "pattern": "^usr_[0-9]{6}$"},
32
+ "created_at": {"type": "string", "format": "date-time"},
33
+ },
34
+ "additionalProperties": True, # allow extra fields in bronze
35
+ }
36
+
37
+ def validate_and_route(record: dict, dlq) -> dict | None:
38
+ """Validate record against schema. Route failures to DLQ, return valid records."""
39
+ try:
40
+ validate(instance=record, schema=TRANSACTION_SCHEMA)
41
+ return record
42
+ except ValidationError as e:
43
+ dlq.write(record, error=e, reason="schema_validation_failure")
44
+ return None
45
+ ```
46
+
47
+ **Avro schema enforcement**
48
+
49
+ For Kafka-based pipelines, use Schema Registry to enforce Avro schemas at produce and consume time:
50
+
51
+ ```python
52
+ from confluent_kafka.avro import AvroConsumer
53
+ from confluent_kafka.avro.serializer import SerializerError
54
+
55
+ consumer = AvroConsumer({
56
+ "bootstrap.servers": "localhost:9092",
57
+ "schema.registry.url": "http://localhost:8081",
58
+ "group.id": "silver-processor",
59
+ })
60
+
61
+ while True:
62
+ msg = consumer.poll(1.0)
63
+ if msg is None:
64
+ continue
65
+ if msg.error():
66
+ handle_consumer_error(msg.error())
67
+ continue
68
+ try:
69
+ record = msg.value() # automatically deserialized + schema-validated
70
+ process(record)
71
+ except SerializerError as e:
72
+ dlq.write(msg.value(), error=e, reason="avro_deserialization_failure")
73
+ ```
74
+
75
+ ### Statistical Anomaly Detection
76
+
77
+ Structural validation catches format errors. Anomaly detection catches semantic errors — values that are structurally valid but statistically wrong.
78
+
79
+ **Volume anomaly detection**
80
+
81
+ ```python
82
+ def check_volume_anomaly(
83
+ pipeline_id: str,
84
+ actual_count: int,
85
+ window_days: int = 14,
86
+ z_score_threshold: float = 3.0,
87
+ ) -> AnomalyResult:
88
+ """Detect volume anomalies using z-score against rolling historical baseline."""
89
+ historical = get_historical_counts(pipeline_id, days=window_days)
90
+ mean = statistics.mean(historical)
91
+ stddev = statistics.stdev(historical)
92
+
93
+ if stddev == 0:
94
+ return AnomalyResult(anomaly=False)
95
+
96
+ z_score = (actual_count - mean) / stddev
97
+ is_anomaly = abs(z_score) > z_score_threshold
98
+
99
+ return AnomalyResult(
100
+ anomaly=is_anomaly,
101
+ z_score=z_score,
102
+ actual=actual_count,
103
+ expected_mean=mean,
104
+ expected_stddev=stddev,
105
+ direction="high" if z_score > 0 else "low",
106
+ )
107
+ ```
108
+
109
+ **Metric anomaly detection**
110
+
111
+ Extend volume checks to key business metrics:
112
+ - Revenue sum per partition (detect 0-revenue or 100x-revenue outliers)
113
+ - Average transaction amount (detect extreme values indicating test data in prod)
114
+ - User count per partition (detect sudden drop suggesting source outage)
115
+ - Error rate (detect spike indicating upstream system degradation)
116
+
117
+ ### Great Expectations Integration
118
+
119
+ Great Expectations (GX) provides a declarative assertion framework for data quality:
120
+
121
+ **Expectation suite definition**
122
+
123
+ ```python
124
+ # src/quality/expectations/payments.py
125
+ import great_expectations as gx
126
+
127
+ def build_payments_suite() -> gx.ExpectationSuite:
128
+ suite = gx.ExpectationSuite(expectation_suite_name="silver_transactions")
129
+
130
+ # Completeness
131
+ suite.add_expectation(gx.core.ExpectColumnValuesToNotBeNull(
132
+ column="transaction_id", meta={"severity": "critical"}
133
+ ))
134
+ suite.add_expectation(gx.core.ExpectColumnValuesToNotBeNull(
135
+ column="amount", meta={"severity": "critical"}
136
+ ))
137
+ suite.add_expectation(gx.core.ExpectColumnValuesToNotBeNull(
138
+ column="user_id", mostly=0.99, meta={"severity": "high"}
139
+ ))
140
+
141
+ # Accuracy
142
+ suite.add_expectation(gx.core.ExpectColumnValuesToBeBetween(
143
+ column="amount", min_value=0, max_value=1_000_000
144
+ ))
145
+ suite.add_expectation(gx.core.ExpectColumnValuesToBeInSet(
146
+ column="currency", value_set=["USD", "EUR", "GBP", "JPY", "CAD"]
147
+ ))
148
+ suite.add_expectation(gx.core.ExpectColumnValuesToMatchRegex(
149
+ column="transaction_id", regex=r"^txn_[a-zA-Z0-9]{20}$"
150
+ ))
151
+
152
+ # Volume
153
+ suite.add_expectation(gx.core.ExpectTableRowCountToBeBetween(
154
+ min_value=1000, max_value=10_000_000
155
+ ))
156
+
157
+ # Uniqueness
158
+ suite.add_expectation(gx.core.ExpectColumnValuesToBeUnique(
159
+ column="transaction_id"
160
+ ))
161
+
162
+ return suite
163
+ ```
164
+
165
+ **Running GX in the pipeline**
166
+
167
+ ```python
168
+ def run_quality_check(df: pd.DataFrame, suite_name: str) -> ValidationResult:
169
+ context = gx.get_context()
170
+ validator = context.get_validator(
171
+ batch_request=RuntimeBatchRequest(
172
+ datasource_name="runtime",
173
+ data_connector_name="default",
174
+ data_asset_name=suite_name,
175
+ runtime_parameters={"batch_data": df},
176
+ batch_identifiers={"run_id": str(datetime.utcnow())},
177
+ ),
178
+ expectation_suite_name=suite_name,
179
+ )
180
+ result = validator.validate()
181
+
182
+ metrics.gauge("data_quality_success_pct",
183
+ result.statistics["success_percent"],
184
+ tags={"suite": suite_name})
185
+
186
+ if not result.success:
187
+ failed = [e for e in result.results if not e.success]
188
+ critical_failures = [f for f in failed if f.meta.get("severity") == "critical"]
189
+ if critical_failures:
190
+ raise DataQualityError(f"Critical quality failures: {critical_failures}")
191
+
192
+ return result
193
+ ```
194
+
195
+ ### dbt Data Tests
196
+
197
+ For SQL-based pipelines using dbt, define tests directly in model YAML files:
198
+
199
+ ```yaml
200
+ # models/silver/schema.yml
201
+ models:
202
+ - name: silver_transactions
203
+ description: Cleaned and validated transaction records
204
+ columns:
205
+ - name: transaction_id
206
+ tests:
207
+ - not_null
208
+ - unique
209
+ - name: amount
210
+ tests:
211
+ - not_null
212
+ - dbt_utils.accepted_range:
213
+ min_value: 0
214
+ max_value: 1000000
215
+ - name: currency
216
+ tests:
217
+ - not_null
218
+ - accepted_values:
219
+ values: ["USD", "EUR", "GBP", "JPY", "CAD"]
220
+ - name: user_id
221
+ tests:
222
+ - not_null
223
+ - relationships:
224
+ to: ref('silver_users')
225
+ field: user_id
226
+
227
+ tests:
228
+ - dbt_utils.recency:
229
+ datepart: hour
230
+ field: created_at
231
+ interval: 2 # table should have data from last 2 hours
232
+ - dbt_utils.equal_rowcount:
233
+ compare_model: ref('bronze_transactions')
234
+ # silver count should match bronze minus known invalid records
235
+ ```
236
+
237
+ **Custom dbt tests**
238
+
239
+ ```sql
240
+ -- tests/assert_no_future_transactions.sql
241
+ -- Fails if any transactions are timestamped in the future
242
+ SELECT count(*) as future_count
243
+ FROM {{ ref('silver_transactions') }}
244
+ WHERE created_at > CURRENT_TIMESTAMP + INTERVAL '5 minutes'
245
+ HAVING count(*) > 0
246
+ ```
247
+
248
+ ### Reconciliation Patterns
249
+
250
+ Reconciliation compares record counts and aggregate values between pipeline stages to detect silent data loss.
251
+
252
+ **Row count reconciliation**
253
+
254
+ ```python
255
+ def reconcile_row_counts(
256
+ source_count: int,
257
+ destination_count: int,
258
+ pipeline_id: str,
259
+ expected_loss_rate: float = 0.001, # max 0.1% loss acceptable
260
+ ) -> ReconciliationResult:
261
+ """Compare source and destination record counts."""
262
+ if source_count == 0:
263
+ raise ReconciliationError(f"Source count is zero for {pipeline_id} — possible extraction failure")
264
+
265
+ loss_rate = (source_count - destination_count) / source_count
266
+
267
+ if loss_rate > expected_loss_rate:
268
+ raise ReconciliationError(
269
+ f"Record loss rate {loss_rate:.4%} exceeds threshold {expected_loss_rate:.4%}. "
270
+ f"Source: {source_count}, Destination: {destination_count}"
271
+ )
272
+
273
+ metrics.gauge("pipeline_record_loss_rate", loss_rate, tags={"pipeline": pipeline_id})
274
+ return ReconciliationResult(passed=True, loss_rate=loss_rate)
275
+ ```
276
+
277
+ **Aggregate checksum reconciliation**
278
+
279
+ ```python
280
+ def reconcile_aggregates(
281
+ source_df: pd.DataFrame,
282
+ destination_df: pd.DataFrame,
283
+ key_column: str,
284
+ value_columns: list[str],
285
+ tolerance_pct: float = 0.001,
286
+ ) -> ReconciliationResult:
287
+ """Compare aggregate sums between source and destination."""
288
+ failures = []
289
+ for col in value_columns:
290
+ src_sum = source_df[col].sum()
291
+ dst_sum = destination_df[col].sum()
292
+
293
+ if src_sum == 0 and dst_sum == 0:
294
+ continue
295
+
296
+ variance = abs(src_sum - dst_sum) / max(abs(src_sum), 1e-10)
297
+ if variance > tolerance_pct:
298
+ failures.append(f"{col}: src={src_sum:.2f}, dst={dst_sum:.2f}, variance={variance:.4%}")
299
+
300
+ if failures:
301
+ raise ReconciliationError(f"Aggregate reconciliation failed:\n" + "\n".join(failures))
302
+
303
+ return ReconciliationResult(passed=True)
304
+ ```
305
+
306
+ ### Quality Gate Integration
307
+
308
+ Quality checks must be integrated as pipeline gates, not optional post-processing steps:
309
+
310
+ ```
311
+ Extract → Schema Validate → [pass: Transform] [fail: DLQ]
312
+
313
+ Statistical Check → [anomaly: alert + hold] [pass: continue]
314
+
315
+ Business Rules → [fail: DLQ] [pass: Load]
316
+
317
+ Reconciliation → [fail: alert + rollback] [pass: commit]
318
+ ```
319
+
320
+ A quality gate failure at any stage must:
321
+ 1. Route affected records to the DLQ with full diagnostic context
322
+ 2. Emit a quality failure metric
323
+ 3. Alert the on-call engineer if the failure rate exceeds the quality budget
324
+ 4. Block pipeline progression for critical failures (structural, not statistical)