@zigrivers/scaffold 3.8.0 → 3.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +72 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +135 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,263 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-batch-patterns
|
|
3
|
+
description: DAG design, partitioning strategies, incremental loads, backfill strategies, and dependency management for batch data pipelines
|
|
4
|
+
topics: [data-pipeline, batch, dag, partitioning, incremental-load, backfill, dependency-management, airflow]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Batch pipeline design decisions — DAG structure, partition strategy, incremental load mechanics, and backfill tooling — determine whether a pipeline is operationally maintainable or a constant source of incidents. Most batch pipeline failures stem from four root causes: missing idempotency, no backfill support, poorly managed upstream dependencies, and partitioning strategies that don't match query access patterns. Address all four up front.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Design DAGs with explicit dependencies, clear task granularity, and idempotent operations. Partition data by the natural business time dimension (event date, not processing date). Implement incremental loads with explicit watermarks. Build backfill support into every pipeline from day one — it is always needed. Model upstream dependencies explicitly with sensors rather than hardcoded schedules.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### DAG Design Principles
|
|
16
|
+
|
|
17
|
+
A well-designed DAG has four properties:
|
|
18
|
+
|
|
19
|
+
**1. Idempotent**: Running the DAG for a given execution date multiple times produces the same result as running it once. Every task must be idempotent.
|
|
20
|
+
|
|
21
|
+
**2. Atomic partitions**: Each DAG run produces a complete, self-consistent partition. Partial writes that leave a partition in an intermediate state corrupt downstream consumers. Use write-then-rename patterns: write to a temp location, verify completeness, then atomically rename/move to the final location.
|
|
22
|
+
|
|
23
|
+
**3. Explicit dependencies**: Task A depends on Task B only if there is a real data dependency. Do not add artificial dependencies to control parallelism (use pools/slots instead). False dependencies serialize work that could run in parallel.
|
|
24
|
+
|
|
25
|
+
**4. Catchup-capable**: DAGs must support backfill execution (running past execution dates) without manual intervention. This requires that all tasks use the DAG execution date as the partition identifier, not `NOW()` or `CURRENT_DATE`.
|
|
26
|
+
|
|
27
|
+
**DAG structure template**
|
|
28
|
+
|
|
29
|
+
```python
|
|
30
|
+
from airflow.decorators import dag, task
|
|
31
|
+
from airflow.utils.dates import days_ago
|
|
32
|
+
from datetime import datetime, timedelta
|
|
33
|
+
|
|
34
|
+
@dag(
|
|
35
|
+
dag_id="payments_transactions_ingest",
|
|
36
|
+
schedule_interval="0 * * * *", # hourly
|
|
37
|
+
start_date=days_ago(1),
|
|
38
|
+
catchup=True, # enable backfill
|
|
39
|
+
max_active_runs=4, # allow parallel backfill
|
|
40
|
+
default_args={
|
|
41
|
+
"retries": 3,
|
|
42
|
+
"retry_delay": timedelta(minutes=5),
|
|
43
|
+
"retry_exponential_backoff": True,
|
|
44
|
+
"sla": timedelta(minutes=45), # alert if not done in 45min
|
|
45
|
+
"on_failure_callback": alert_on_failure,
|
|
46
|
+
},
|
|
47
|
+
tags=["payments", "ingestion", "hourly"],
|
|
48
|
+
)
|
|
49
|
+
def payments_transactions_ingest():
|
|
50
|
+
|
|
51
|
+
@task
|
|
52
|
+
def extract(execution_date: datetime = None) -> dict:
|
|
53
|
+
"""Extract transactions for the execution hour from source system."""
|
|
54
|
+
window_start = execution_date
|
|
55
|
+
window_end = execution_date + timedelta(hours=1)
|
|
56
|
+
records = sources.payments.read(window_start, window_end)
|
|
57
|
+
staging_path = write_to_staging(records, partition=execution_date)
|
|
58
|
+
return {"staging_path": staging_path, "record_count": len(records)}
|
|
59
|
+
|
|
60
|
+
@task
|
|
61
|
+
def validate(extract_result: dict) -> dict:
|
|
62
|
+
"""Validate extracted records against quality rules."""
|
|
63
|
+
records = read_from_staging(extract_result["staging_path"])
|
|
64
|
+
quality_result = quality.check(records)
|
|
65
|
+
if not quality_result.passed:
|
|
66
|
+
raise ValueError(f"Quality check failed: {quality_result.failures}")
|
|
67
|
+
return {**extract_result, "quality_passed": True}
|
|
68
|
+
|
|
69
|
+
@task
|
|
70
|
+
def transform(validate_result: dict) -> dict:
|
|
71
|
+
"""Apply transformations to validated records."""
|
|
72
|
+
records = read_from_staging(validate_result["staging_path"])
|
|
73
|
+
transformed = transforms.payments.normalize(records)
|
|
74
|
+
output_path = write_transformed(transformed)
|
|
75
|
+
return {"output_path": output_path, "record_count": len(transformed)}
|
|
76
|
+
|
|
77
|
+
@task
|
|
78
|
+
def load(transform_result: dict, execution_date: datetime = None) -> None:
|
|
79
|
+
"""Load transformed records to destination, overwriting the partition."""
|
|
80
|
+
records = read_from_staging(transform_result["output_path"])
|
|
81
|
+
sinks.warehouse.write(
|
|
82
|
+
records,
|
|
83
|
+
partition=execution_date.strftime("%Y-%m-%d/%H"),
|
|
84
|
+
mode="overwrite", # idempotent: overwrite entire partition on retry
|
|
85
|
+
)
|
|
86
|
+
|
|
87
|
+
extracted = extract()
|
|
88
|
+
validated = validate(extracted)
|
|
89
|
+
transformed = transform(validated)
|
|
90
|
+
load(transformed)
|
|
91
|
+
|
|
92
|
+
dag = payments_transactions_ingest()
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
### Partitioning Strategies
|
|
96
|
+
|
|
97
|
+
Partitioning determines how data is physically organized in storage and how efficiently it can be queried. Misaligned partitioning causes full table scans where partition pruning should eliminate 99% of data.
|
|
98
|
+
|
|
99
|
+
**Partition by event time, not processing time**
|
|
100
|
+
|
|
101
|
+
Use the business event timestamp (when the event occurred) as the partition key, not when the pipeline processed it. Late-arriving events processed at 3am should land in yesterday's partition, not today's.
|
|
102
|
+
|
|
103
|
+
```python
|
|
104
|
+
# Correct: use event timestamp for partitioning
|
|
105
|
+
partition_key = record["event_timestamp"].date()
|
|
106
|
+
|
|
107
|
+
# Wrong: use processing time
|
|
108
|
+
partition_key = datetime.utcnow().date()
|
|
109
|
+
```
|
|
110
|
+
|
|
111
|
+
Exception: if downstream consumers query by processing time (e.g., "give me everything loaded since midnight"), partition by processing time instead and document this clearly.
|
|
112
|
+
|
|
113
|
+
**Partition granularity**
|
|
114
|
+
|
|
115
|
+
Choose partition granularity based on query patterns and data volume:
|
|
116
|
+
|
|
117
|
+
| Volume per day | Query access pattern | Recommended partitioning |
|
|
118
|
+
|----------------|---------------------|--------------------------|
|
|
119
|
+
| < 1 GB | Daily queries | Daily partitions |
|
|
120
|
+
| 1–100 GB | Daily queries | Daily partitions |
|
|
121
|
+
| 100 GB – 1 TB | Hourly queries | Hourly partitions |
|
|
122
|
+
| > 1 TB | Hourly queries | Hourly + secondary partition (country, region) |
|
|
123
|
+
|
|
124
|
+
Over-partitioning (e.g., per-minute partitions for low-volume data) causes file system metadata overhead and slow listing performance in object stores.
|
|
125
|
+
|
|
126
|
+
**Partition pruning in query engines**
|
|
127
|
+
|
|
128
|
+
Partition columns must appear in `WHERE` clauses to benefit from pruning. Ensure the partition column name and data type match what query engines expect:
|
|
129
|
+
|
|
130
|
+
```sql
|
|
131
|
+
-- Efficient: partition pruning eliminates 364/365 partitions
|
|
132
|
+
SELECT * FROM silver_transactions
|
|
133
|
+
WHERE event_date = '2024-01-15'
|
|
134
|
+
AND country = 'US';
|
|
135
|
+
|
|
136
|
+
-- Inefficient: wrapping partition column in function prevents pruning
|
|
137
|
+
SELECT * FROM silver_transactions
|
|
138
|
+
WHERE DATE(event_timestamp) = '2024-01-15'; -- full scan
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### Incremental Load Patterns
|
|
142
|
+
|
|
143
|
+
Full table reloads are expensive for large datasets. Incremental loads process only new or changed records since the last successful run.
|
|
144
|
+
|
|
145
|
+
**Watermark-based incremental load**
|
|
146
|
+
|
|
147
|
+
Maintain a persistent watermark (the processing boundary) in a metadata store:
|
|
148
|
+
|
|
149
|
+
```python
|
|
150
|
+
class WatermarkStore:
|
|
151
|
+
def get(self, pipeline_id: str) -> datetime:
|
|
152
|
+
"""Return the high-water mark for the pipeline (last successfully processed timestamp)."""
|
|
153
|
+
|
|
154
|
+
def set(self, pipeline_id: str, watermark: datetime) -> None:
|
|
155
|
+
"""Update the high-water mark after successful processing."""
|
|
156
|
+
|
|
157
|
+
def incremental_extract(pipeline_id: str, source) -> list[dict]:
|
|
158
|
+
store = WatermarkStore()
|
|
159
|
+
low_watermark = store.get(pipeline_id)
|
|
160
|
+
high_watermark = datetime.utcnow() - timedelta(minutes=5) # lag buffer for late arrivals
|
|
161
|
+
|
|
162
|
+
records = source.read(low_watermark, high_watermark)
|
|
163
|
+
|
|
164
|
+
# Only advance watermark after successful write
|
|
165
|
+
return records, high_watermark
|
|
166
|
+
|
|
167
|
+
def incremental_load(pipeline_id: str, records: list[dict], watermark: datetime):
|
|
168
|
+
sinks.warehouse.write(records, mode="append")
|
|
169
|
+
WatermarkStore().set(pipeline_id, watermark)
|
|
170
|
+
```
|
|
171
|
+
|
|
172
|
+
**Late arrival buffer**: Always include a lag buffer (5–60 minutes) when setting the high watermark. Events frequently arrive late at the source due to network delays, mobile client sync, or upstream processing delays. Processing too close to "now" produces incomplete partitions.
|
|
173
|
+
|
|
174
|
+
**Deduplication in incremental loads**: Incremental appends accumulate duplicates when pipelines retry or overlap windows. Deduplicate in the silver layer using the business key (`transaction_id`, `event_id`), not the pipeline-internal record ID.
|
|
175
|
+
|
|
176
|
+
### Backfill Strategies
|
|
177
|
+
|
|
178
|
+
Every pipeline will need to be backfilled. Plan for it upfront.
|
|
179
|
+
|
|
180
|
+
**When backfills are needed**
|
|
181
|
+
- A bug is discovered in transformation logic; historical data must be reprocessed with the fix
|
|
182
|
+
- A new derived column is added to the schema; existing partitions lack the column
|
|
183
|
+
- A source system retroactively corrects historical data
|
|
184
|
+
- Initial data load when launching a new pipeline
|
|
185
|
+
|
|
186
|
+
**Airflow backfill execution**
|
|
187
|
+
|
|
188
|
+
```bash
|
|
189
|
+
# Backfill a specific date range
|
|
190
|
+
airflow dags backfill \
|
|
191
|
+
payments_transactions_ingest \
|
|
192
|
+
--start-date 2024-01-01 \
|
|
193
|
+
--end-date 2024-01-31 \
|
|
194
|
+
--reset-dagruns
|
|
195
|
+
|
|
196
|
+
# Backfill with parallelism control (avoid overloading source systems)
|
|
197
|
+
airflow dags backfill \
|
|
198
|
+
payments_transactions_ingest \
|
|
199
|
+
--start-date 2024-01-01 \
|
|
200
|
+
--end-date 2024-01-31 \
|
|
201
|
+
--max-active-runs 2
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
**Backfill safety checklist**
|
|
205
|
+
1. Verify the pipeline is idempotent for the target partition range before starting
|
|
206
|
+
2. Check source system capacity — backfills can overload APIs or databases
|
|
207
|
+
3. Set `max_active_runs` to limit parallelism during backfill
|
|
208
|
+
4. Monitor DLQ during backfill for unexpected errors
|
|
209
|
+
5. Validate output record counts against expected ranges after completion
|
|
210
|
+
6. Notify downstream consumers if backfill will change data they have already read
|
|
211
|
+
|
|
212
|
+
### Dependency Management
|
|
213
|
+
|
|
214
|
+
**Upstream data availability sensors**
|
|
215
|
+
|
|
216
|
+
Use sensors to wait for upstream data rather than hardcoding schedules:
|
|
217
|
+
|
|
218
|
+
```python
|
|
219
|
+
from airflow.sensors.external_task import ExternalTaskSensor
|
|
220
|
+
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
|
|
221
|
+
|
|
222
|
+
# Wait for upstream pipeline to complete
|
|
223
|
+
wait_for_upstream = ExternalTaskSensor(
|
|
224
|
+
task_id="wait_for_stripe_ingest",
|
|
225
|
+
external_dag_id="stripe_charges_ingest",
|
|
226
|
+
external_task_id="load",
|
|
227
|
+
timeout=3600, # fail after 1 hour
|
|
228
|
+
poke_interval=60, # check every minute
|
|
229
|
+
mode="reschedule", # release worker slot while waiting
|
|
230
|
+
)
|
|
231
|
+
|
|
232
|
+
# Wait for data file to appear
|
|
233
|
+
wait_for_file = S3KeySensor(
|
|
234
|
+
task_id="wait_for_source_file",
|
|
235
|
+
bucket_name="raw-data",
|
|
236
|
+
bucket_key="payments/{{ ds }}/transactions.parquet",
|
|
237
|
+
timeout=7200,
|
|
238
|
+
poke_interval=300,
|
|
239
|
+
mode="reschedule",
|
|
240
|
+
)
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
**Cross-DAG dependency mapping**
|
|
244
|
+
|
|
245
|
+
Maintain a dependency map document alongside DAG code:
|
|
246
|
+
|
|
247
|
+
```yaml
|
|
248
|
+
# config/dag-dependencies.yaml
|
|
249
|
+
payments_transactions_enrich:
|
|
250
|
+
depends_on:
|
|
251
|
+
- dag: payments_transactions_ingest
|
|
252
|
+
task: load
|
|
253
|
+
max_lag_hours: 2
|
|
254
|
+
- dag: currency_rates_sync
|
|
255
|
+
task: load
|
|
256
|
+
max_lag_hours: 24
|
|
257
|
+
sla_hours: 4
|
|
258
|
+
consumers:
|
|
259
|
+
- dag: revenue_reports_aggregate
|
|
260
|
+
- dag: fraud_detection_features
|
|
261
|
+
```
|
|
262
|
+
|
|
263
|
+
This map enables automatic lineage tracking, impact analysis for schema changes, and SLA propagation calculations.
|
|
@@ -0,0 +1,176 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-conventions
|
|
3
|
+
description: Naming conventions, error handling patterns, idempotency, and dead-letter queue patterns for data pipelines
|
|
4
|
+
topics: [data-pipeline, conventions, naming, error-handling, idempotency, dead-letter, patterns]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Data pipeline conventions are the shared language that makes a pipeline system maintainable across teams and time. Inconsistent naming, absent idempotency guarantees, and ad-hoc error handling are the top causes of data corruption and operational incidents. Establish and enforce these conventions from day one — retrofitting them onto running production pipelines is expensive and risky.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use a consistent naming scheme (`entity_action_v1`) for all pipeline artifacts. Implement idempotency at every processing step so reruns never produce duplicates. Route all unprocessable records to dead-letter queues (DLQs) rather than silently dropping them. Classify errors as retriable or non-retriable and handle each class explicitly. Document conventions in a shared standard and enforce them via linting and code review.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Naming Conventions
|
|
16
|
+
|
|
17
|
+
Pipeline naming consistency enables operational tooling (lineage, monitoring, search) and reduces cognitive overhead when debugging incidents. Apply a single naming scheme across all pipeline artifacts.
|
|
18
|
+
|
|
19
|
+
**Event and topic naming**
|
|
20
|
+
|
|
21
|
+
Use `{domain}_{entity}_{action}` in `snake_case`:
|
|
22
|
+
- `payments_transaction_created`
|
|
23
|
+
- `users_profile_updated`
|
|
24
|
+
- `inventory_item_depleted`
|
|
25
|
+
- `orders_shipment_dispatched`
|
|
26
|
+
|
|
27
|
+
Avoid generic names like `events`, `data`, or `messages`. The name must encode what the event represents without reading the schema.
|
|
28
|
+
|
|
29
|
+
**Table and dataset naming**
|
|
30
|
+
|
|
31
|
+
Apply the medallion layer as a prefix for warehouse tables:
|
|
32
|
+
- Bronze (raw): `bronze_{source}_{entity}` → `bronze_stripe_charges`
|
|
33
|
+
- Silver (cleaned): `silver_{entity}` → `silver_charges`
|
|
34
|
+
- Gold (aggregated): `gold_{domain}_{subject}` → `gold_revenue_daily`
|
|
35
|
+
|
|
36
|
+
For version-controlled schemas: `silver_charges_v2`. Never modify a table schema in place — create a new versioned table and migrate consumers.
|
|
37
|
+
|
|
38
|
+
**Pipeline/DAG naming**
|
|
39
|
+
|
|
40
|
+
DAG IDs and pipeline names follow `{domain}_{entity}_{operation}`:
|
|
41
|
+
- `payments_transactions_ingest`
|
|
42
|
+
- `users_events_enrich`
|
|
43
|
+
- `revenue_reports_aggregate`
|
|
44
|
+
|
|
45
|
+
Append environment suffix in non-production: `payments_transactions_ingest_staging`. This prevents accidental cross-environment data writes.
|
|
46
|
+
|
|
47
|
+
**Job and task naming within pipelines**
|
|
48
|
+
|
|
49
|
+
Tasks inside a DAG are verbs: `extract`, `validate`, `transform`, `load`, `reconcile`. Use compound verbs when the scope requires it: `extract_from_stripe`, `validate_schema`, `load_to_bigquery`.
|
|
50
|
+
|
|
51
|
+
**Version suffixing**
|
|
52
|
+
|
|
53
|
+
Append `_v{N}` to pipeline names and table names when making breaking changes. Consumer-visible interfaces (schemas, API contracts) increment their version when backward compatibility breaks. Internal implementation changes do not require version increments.
|
|
54
|
+
|
|
55
|
+
### Error Handling Patterns
|
|
56
|
+
|
|
57
|
+
Classify every error before handling it. The classification determines the recovery strategy:
|
|
58
|
+
|
|
59
|
+
**Retriable errors** — transient failures that will likely succeed on retry:
|
|
60
|
+
- Network timeouts and connection resets
|
|
61
|
+
- Rate limiting (HTTP 429, throttling exceptions)
|
|
62
|
+
- Transient database lock contention
|
|
63
|
+
- Infrastructure blips (temporary unavailability)
|
|
64
|
+
|
|
65
|
+
Handle retriable errors with exponential backoff and jitter:
|
|
66
|
+
```python
|
|
67
|
+
def retry_with_backoff(fn, max_retries=5, base_delay=1.0):
|
|
68
|
+
for attempt in range(max_retries):
|
|
69
|
+
try:
|
|
70
|
+
return fn()
|
|
71
|
+
except RetriableError as e:
|
|
72
|
+
if attempt == max_retries - 1:
|
|
73
|
+
raise
|
|
74
|
+
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
|
|
75
|
+
time.sleep(delay)
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
Cap maximum retry delay (e.g., 60 seconds) to avoid indefinite blocking. Cap total retry duration to fit within SLA budgets.
|
|
79
|
+
|
|
80
|
+
**Non-retriable errors** — permanent failures that require human intervention or alternative routing:
|
|
81
|
+
- Schema validation failures (malformed record structure)
|
|
82
|
+
- Business rule violations (negative quantities, future birthdates by centuries)
|
|
83
|
+
- Missing required foreign key references
|
|
84
|
+
- Encoding errors and unparseable data
|
|
85
|
+
|
|
86
|
+
Do not retry non-retriable errors. Route them directly to the dead-letter queue with full context.
|
|
87
|
+
|
|
88
|
+
**Poison pill errors** — records that cause the pipeline to crash on every attempt:
|
|
89
|
+
- Records that trigger a segfault or OOM in the processing runtime
|
|
90
|
+
- Records that cause infinite loops in transformation logic
|
|
91
|
+
|
|
92
|
+
Detect poison pills by tracking per-record failure counts. After N failures for the same record ID, route to DLQ with a `poison_pill` classification tag and alert the on-call engineer.
|
|
93
|
+
|
|
94
|
+
### Idempotency
|
|
95
|
+
|
|
96
|
+
Every pipeline processing step must be idempotent: running the step multiple times with the same input produces the same output as running it once. Idempotency is required for:
|
|
97
|
+
- Safe retries without duplicating data
|
|
98
|
+
- Backfill operations (reprocessing historical data)
|
|
99
|
+
- Recovery from failures mid-run
|
|
100
|
+
- Zero-downtime deploys (brief overlap between old and new pipeline versions)
|
|
101
|
+
|
|
102
|
+
**Idempotency implementation patterns**
|
|
103
|
+
|
|
104
|
+
*Deduplication keys*: Assign a unique deterministic ID to every record based on its content or source identifiers. Use this key to detect and skip already-processed records.
|
|
105
|
+
|
|
106
|
+
```python
|
|
107
|
+
def compute_idempotency_key(record: dict) -> str:
|
|
108
|
+
# Build key from stable source identifiers
|
|
109
|
+
key_fields = (record["source_system"], record["source_id"], record["event_type"])
|
|
110
|
+
return hashlib.sha256("|".join(str(f) for f in key_fields).encode()).hexdigest()
|
|
111
|
+
```
|
|
112
|
+
|
|
113
|
+
*Upsert patterns*: Write operations should use upsert semantics (INSERT OR REPLACE, MERGE, ON CONFLICT DO UPDATE) rather than plain INSERT. A re-run inserts new records and overwrites existing ones to the same result.
|
|
114
|
+
|
|
115
|
+
*Partition-based overwrite*: For batch pipelines writing to partitioned storage, overwrite the entire partition on each run. The result of running once or ten times is identical — the partition contains the correct data for that time window.
|
|
116
|
+
|
|
117
|
+
*Idempotency tokens*: When calling external APIs, pass an idempotency token (request UUID) so the upstream system treats duplicate calls as no-ops.
|
|
118
|
+
|
|
119
|
+
**What breaks idempotency**
|
|
120
|
+
|
|
121
|
+
- Appending to a table without checking for existing records
|
|
122
|
+
- Generating new UUIDs inside the pipeline (call once, store, reuse)
|
|
123
|
+
- Incrementing counters inside processing logic
|
|
124
|
+
- Using `NOW()` or `CURRENT_TIMESTAMP` as a data value rather than passing the source event timestamp through
|
|
125
|
+
|
|
126
|
+
### Dead-Letter Queue (DLQ) Patterns
|
|
127
|
+
|
|
128
|
+
A dead-letter queue is a separate storage location for records that failed processing and cannot be automatically recovered. DLQs prevent bad records from blocking pipeline progress while preserving the data for investigation and reprocessing.
|
|
129
|
+
|
|
130
|
+
**DLQ design principles**
|
|
131
|
+
|
|
132
|
+
Every pipeline must have a DLQ. Silently discarding unprocessable records is never acceptable — it creates invisible data loss that is discovered only when downstream reports are wrong.
|
|
133
|
+
|
|
134
|
+
**DLQ record structure**
|
|
135
|
+
|
|
136
|
+
Enrich every DLQ record with the failure context needed to investigate and fix it:
|
|
137
|
+
|
|
138
|
+
```json
|
|
139
|
+
{
|
|
140
|
+
"original_record": { ... },
|
|
141
|
+
"dlq_metadata": {
|
|
142
|
+
"pipeline_id": "payments_transactions_ingest",
|
|
143
|
+
"pipeline_version": "1.3.2",
|
|
144
|
+
"failure_timestamp": "2024-01-15T14:23:45Z",
|
|
145
|
+
"failure_reason": "schema_validation_error",
|
|
146
|
+
"error_message": "Field 'amount' expected number, got string: 'N/A'",
|
|
147
|
+
"error_class": "non_retriable",
|
|
148
|
+
"retry_count": 0,
|
|
149
|
+
"source_partition": "2024-01-15",
|
|
150
|
+
"source_offset": 1847392
|
|
151
|
+
}
|
|
152
|
+
}
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
**DLQ operational workflow**
|
|
156
|
+
|
|
157
|
+
1. Alert when DLQ depth exceeds threshold (e.g., > 100 records within 1 hour)
|
|
158
|
+
2. Engineer investigates: is this a code bug (fix and reprocess) or a source data issue (escalate to producer)?
|
|
159
|
+
3. Fix the root cause (pipeline code or source data)
|
|
160
|
+
4. Replay DLQ records through the fixed pipeline
|
|
161
|
+
5. Verify replayed records processed successfully and DLQ is drained
|
|
162
|
+
|
|
163
|
+
**DLQ retention**: Retain DLQ records for at least 30 days (or the SLA for incident resolution). Purge older records only after confirming they were resolved or intentionally discarded.
|
|
164
|
+
|
|
165
|
+
**DLQ monitoring**: Track DLQ depth, record age, and failure reason distribution as first-class pipeline metrics. A growing DLQ is a production incident.
|
|
166
|
+
|
|
167
|
+
### Schema and Contract Conventions
|
|
168
|
+
|
|
169
|
+
Document the contract each pipeline step exposes to its consumers:
|
|
170
|
+
|
|
171
|
+
- Input schema: the record structure this step accepts
|
|
172
|
+
- Output schema: the record structure this step produces
|
|
173
|
+
- Guarantees: ordering guarantees, completeness guarantees, latency SLA
|
|
174
|
+
- Breaking change policy: how many days notice before changing the contract
|
|
175
|
+
|
|
176
|
+
Encode contracts as schema files (Avro, JSON Schema, Protobuf) checked into version control alongside the pipeline code. Consumers reference a specific schema version, not "whatever the pipeline produces today."
|