@zigrivers/scaffold 3.8.0 → 3.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +73 -8
- package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
- package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
- package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
- package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
- package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
- package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
- package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
- package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
- package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
- package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
- package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
- package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
- package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
- package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
- package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
- package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
- package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
- package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
- package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
- package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
- package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
- package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
- package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
- package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
- package/content/knowledge/ml/ml-architecture.md +172 -0
- package/content/knowledge/ml/ml-conventions.md +209 -0
- package/content/knowledge/ml/ml-dev-environment.md +299 -0
- package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
- package/content/knowledge/ml/ml-model-evaluation.md +256 -0
- package/content/knowledge/ml/ml-observability.md +253 -0
- package/content/knowledge/ml/ml-project-structure.md +216 -0
- package/content/knowledge/ml/ml-requirements.md +138 -0
- package/content/knowledge/ml/ml-security.md +188 -0
- package/content/knowledge/ml/ml-serving-patterns.md +243 -0
- package/content/knowledge/ml/ml-testing.md +301 -0
- package/content/knowledge/ml/ml-training-patterns.md +269 -0
- package/content/methodology/browser-extension-overlay.yml +82 -0
- package/content/methodology/data-pipeline-overlay.yml +70 -0
- package/content/methodology/ml-overlay.yml +70 -0
- package/dist/cli/commands/init.d.ts +13 -0
- package/dist/cli/commands/init.d.ts.map +1 -1
- package/dist/cli/commands/init.js +122 -2
- package/dist/cli/commands/init.js.map +1 -1
- package/dist/cli/commands/init.test.js +120 -0
- package/dist/cli/commands/init.test.js.map +1 -1
- package/dist/config/schema.d.ts +864 -48
- package/dist/config/schema.d.ts.map +1 -1
- package/dist/config/schema.js +53 -0
- package/dist/config/schema.js.map +1 -1
- package/dist/config/schema.test.js +166 -3
- package/dist/config/schema.test.js.map +1 -1
- package/dist/core/assembly/overlay-loader.test.js +33 -0
- package/dist/core/assembly/overlay-loader.test.js.map +1 -1
- package/dist/e2e/project-type-overlays.test.d.ts +2 -2
- package/dist/e2e/project-type-overlays.test.js +499 -33
- package/dist/e2e/project-type-overlays.test.js.map +1 -1
- package/dist/types/config.d.ts +10 -1
- package/dist/types/config.d.ts.map +1 -1
- package/dist/wizard/questions.d.ts +17 -1
- package/dist/wizard/questions.d.ts.map +1 -1
- package/dist/wizard/questions.js +72 -1
- package/dist/wizard/questions.js.map +1 -1
- package/dist/wizard/questions.test.js +135 -0
- package/dist/wizard/questions.test.js.map +1 -1
- package/dist/wizard/wizard.d.ts +13 -0
- package/dist/wizard/wizard.d.ts.map +1 -1
- package/dist/wizard/wizard.js +17 -1
- package/dist/wizard/wizard.js.map +1 -1
- package/package.json +1 -1
|
@@ -0,0 +1,350 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-dev-environment
|
|
3
|
+
description: Local development setup with Docker, sample data generation, replay tooling, and test fixtures for data pipelines
|
|
4
|
+
topics: [data-pipeline, dev-environment, docker, local-development, sample-data, replay, test-fixtures]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
A productive data pipeline development environment must run locally without requiring cloud accounts, production credentials, or access to real data. Engineers must be able to ingest, process, debug, and test the full pipeline on their laptop in under 10 minutes from a clean checkout. This requires containerized dependencies, synthetic sample data, and tooling to replay specific scenarios.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use Docker Compose to run all pipeline dependencies locally (Kafka, Postgres, object storage, orchestrator UI). Generate synthetic sample data with realistic distributions that match production schemas. Build replay tooling so engineers can re-run any historical pipeline execution or simulate specific failure scenarios. Keep test fixtures small, version-controlled, and deterministic.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Docker Compose Local Stack
|
|
16
|
+
|
|
17
|
+
Define the complete local dependency stack in `docker/docker-compose.yml`. Every external system the pipeline depends on must run locally:
|
|
18
|
+
|
|
19
|
+
```yaml
|
|
20
|
+
# docker/docker-compose.yml
|
|
21
|
+
version: "3.9"
|
|
22
|
+
|
|
23
|
+
services:
|
|
24
|
+
# Message broker
|
|
25
|
+
kafka:
|
|
26
|
+
image: confluentinc/cp-kafka:7.5.0
|
|
27
|
+
ports:
|
|
28
|
+
- "9092:9092"
|
|
29
|
+
environment:
|
|
30
|
+
KAFKA_BROKER_ID: 1
|
|
31
|
+
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
|
|
32
|
+
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
|
|
33
|
+
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
|
|
34
|
+
depends_on: [zookeeper]
|
|
35
|
+
|
|
36
|
+
zookeeper:
|
|
37
|
+
image: confluentinc/cp-zookeeper:7.5.0
|
|
38
|
+
environment:
|
|
39
|
+
ZOOKEEPER_CLIENT_PORT: 2181
|
|
40
|
+
|
|
41
|
+
# Schema registry
|
|
42
|
+
schema-registry:
|
|
43
|
+
image: confluentinc/cp-schema-registry:7.5.0
|
|
44
|
+
ports:
|
|
45
|
+
- "8081:8081"
|
|
46
|
+
environment:
|
|
47
|
+
SCHEMA_REGISTRY_HOST_NAME: schema-registry
|
|
48
|
+
SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:9092
|
|
49
|
+
|
|
50
|
+
# Source database (CDC)
|
|
51
|
+
postgres:
|
|
52
|
+
image: postgres:15
|
|
53
|
+
ports:
|
|
54
|
+
- "5432:5432"
|
|
55
|
+
environment:
|
|
56
|
+
POSTGRES_USER: pipeline
|
|
57
|
+
POSTGRES_PASSWORD: pipeline
|
|
58
|
+
POSTGRES_DB: source_db
|
|
59
|
+
command: >
|
|
60
|
+
postgres
|
|
61
|
+
-c wal_level=logical
|
|
62
|
+
-c max_replication_slots=10
|
|
63
|
+
-c max_wal_senders=10
|
|
64
|
+
volumes:
|
|
65
|
+
- ./docker/postgres/init:/docker-entrypoint-initdb.d
|
|
66
|
+
|
|
67
|
+
# Object storage (S3-compatible)
|
|
68
|
+
minio:
|
|
69
|
+
image: minio/minio:latest
|
|
70
|
+
ports:
|
|
71
|
+
- "9000:9000"
|
|
72
|
+
- "9001:9001"
|
|
73
|
+
environment:
|
|
74
|
+
MINIO_ROOT_USER: minioadmin
|
|
75
|
+
MINIO_ROOT_PASSWORD: minioadmin
|
|
76
|
+
command: server /data --console-address ":9001"
|
|
77
|
+
volumes:
|
|
78
|
+
- minio_data:/data
|
|
79
|
+
|
|
80
|
+
# Pipeline orchestrator (Airflow)
|
|
81
|
+
airflow-webserver:
|
|
82
|
+
image: apache/airflow:2.8.0
|
|
83
|
+
ports:
|
|
84
|
+
- "8080:8080"
|
|
85
|
+
environment:
|
|
86
|
+
AIRFLOW__CORE__EXECUTOR: LocalExecutor
|
|
87
|
+
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@airflow-db/airflow
|
|
88
|
+
AIRFLOW__CORE__LOAD_EXAMPLES: "false"
|
|
89
|
+
volumes:
|
|
90
|
+
- ../src/dags:/opt/airflow/dags
|
|
91
|
+
- ../config:/opt/airflow/config
|
|
92
|
+
depends_on: [airflow-db]
|
|
93
|
+
|
|
94
|
+
airflow-db:
|
|
95
|
+
image: postgres:15
|
|
96
|
+
environment:
|
|
97
|
+
POSTGRES_USER: airflow
|
|
98
|
+
POSTGRES_PASSWORD: airflow
|
|
99
|
+
POSTGRES_DB: airflow
|
|
100
|
+
|
|
101
|
+
volumes:
|
|
102
|
+
minio_data:
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Local Setup Script
|
|
106
|
+
|
|
107
|
+
The bootstrap script gets a clean checkout to a running local environment:
|
|
108
|
+
|
|
109
|
+
```bash
|
|
110
|
+
#!/bin/bash
|
|
111
|
+
# scripts/bootstrap.sh — Set up local development environment
|
|
112
|
+
|
|
113
|
+
set -euo pipefail
|
|
114
|
+
|
|
115
|
+
echo "==> Installing Python dependencies"
|
|
116
|
+
pip install -e ".[dev]"
|
|
117
|
+
|
|
118
|
+
echo "==> Starting Docker services"
|
|
119
|
+
docker compose -f docker/docker-compose.yml up -d
|
|
120
|
+
|
|
121
|
+
echo "==> Waiting for services to be healthy"
|
|
122
|
+
./scripts/wait-for-services.sh
|
|
123
|
+
|
|
124
|
+
echo "==> Creating local Kafka topics"
|
|
125
|
+
docker exec kafka kafka-topics --create \
|
|
126
|
+
--bootstrap-server localhost:9092 \
|
|
127
|
+
--topic payments_transaction_created \
|
|
128
|
+
--partitions 8 \
|
|
129
|
+
--replication-factor 1 \
|
|
130
|
+
--if-not-exists
|
|
131
|
+
|
|
132
|
+
echo "==> Creating MinIO buckets"
|
|
133
|
+
docker exec minio mc alias set local http://localhost:9000 minioadmin minioadmin
|
|
134
|
+
docker exec minio mc mb local/raw-data --ignore-existing
|
|
135
|
+
docker exec minio mc mb local/processed-data --ignore-existing
|
|
136
|
+
|
|
137
|
+
echo "==> Seeding database schemas"
|
|
138
|
+
psql postgresql://pipeline:pipeline@localhost:5432/source_db -f docker/postgres/schema.sql
|
|
139
|
+
|
|
140
|
+
echo "==> Generating sample data"
|
|
141
|
+
python scripts/generate-sample-data.py --records 10000 --output fixtures/
|
|
142
|
+
|
|
143
|
+
echo "==> Local environment ready!"
|
|
144
|
+
echo " Airflow UI: http://localhost:8080 (admin/admin)"
|
|
145
|
+
echo " MinIO console: http://localhost:9001 (minioadmin/minioadmin)"
|
|
146
|
+
echo " Kafka: localhost:9092"
|
|
147
|
+
echo " Postgres: localhost:5432"
|
|
148
|
+
```
|
|
149
|
+
|
|
150
|
+
### Sample Data Generation
|
|
151
|
+
|
|
152
|
+
Synthetic data must match production schemas and include realistic distributions, edge cases, and known-bad records for testing error handling:
|
|
153
|
+
|
|
154
|
+
```python
|
|
155
|
+
# scripts/generate-sample-data.py
|
|
156
|
+
import json
|
|
157
|
+
import random
|
|
158
|
+
from datetime import datetime, timedelta
|
|
159
|
+
from typing import Iterator
|
|
160
|
+
import uuid
|
|
161
|
+
|
|
162
|
+
CURRENCIES = ["USD", "EUR", "GBP", "JPY", "CAD"]
|
|
163
|
+
CURRENCY_WEIGHTS = [0.60, 0.20, 0.10, 0.05, 0.05]
|
|
164
|
+
COUNTRIES = ["US", "GB", "DE", "FR", "JP", "CA", "AU"]
|
|
165
|
+
|
|
166
|
+
def generate_transaction(ts: datetime, inject_error: bool = False) -> dict:
|
|
167
|
+
"""Generate a synthetic payment transaction.
|
|
168
|
+
|
|
169
|
+
Args:
|
|
170
|
+
ts: Event timestamp
|
|
171
|
+
inject_error: If True, inject a known-bad record for error handling tests
|
|
172
|
+
"""
|
|
173
|
+
if inject_error:
|
|
174
|
+
return {
|
|
175
|
+
"transaction_id": str(uuid.uuid4()),
|
|
176
|
+
"amount": "N/A", # schema violation: string instead of number
|
|
177
|
+
"currency": "INVALID", # invalid enum
|
|
178
|
+
"user_id": None, # null violation
|
|
179
|
+
"created_at": ts.isoformat(),
|
|
180
|
+
}
|
|
181
|
+
|
|
182
|
+
currency = random.choices(CURRENCIES, CURRENCY_WEIGHTS)[0]
|
|
183
|
+
# Log-normal distribution matches real transaction amount distributions
|
|
184
|
+
amount = round(random.lognormvariate(3.5, 1.2), 2)
|
|
185
|
+
|
|
186
|
+
return {
|
|
187
|
+
"transaction_id": str(uuid.uuid4()),
|
|
188
|
+
"amount": amount,
|
|
189
|
+
"currency": currency,
|
|
190
|
+
"user_id": f"usr_{random.randint(1, 50000):06d}",
|
|
191
|
+
"merchant_id": f"mer_{random.randint(1, 5000):05d}",
|
|
192
|
+
"country": random.choice(COUNTRIES),
|
|
193
|
+
"created_at": ts.isoformat(),
|
|
194
|
+
"status": random.choices(["completed", "failed", "pending"], [0.92, 0.05, 0.03])[0],
|
|
195
|
+
}
|
|
196
|
+
|
|
197
|
+
def generate_dataset(
|
|
198
|
+
records: int,
|
|
199
|
+
error_rate: float = 0.01,
|
|
200
|
+
start_time: datetime | None = None,
|
|
201
|
+
) -> Iterator[dict]:
|
|
202
|
+
"""Generate a stream of synthetic transactions with configurable error rate."""
|
|
203
|
+
start = start_time or datetime.utcnow() - timedelta(hours=24)
|
|
204
|
+
for i in range(records):
|
|
205
|
+
ts = start + timedelta(seconds=i * 0.1)
|
|
206
|
+
inject_error = random.random() < error_rate
|
|
207
|
+
yield generate_transaction(ts, inject_error)
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
### Replay Tooling
|
|
211
|
+
|
|
212
|
+
Replay tools allow engineers to re-run pipeline logic against specific historical data or DLQ records without running the full orchestration stack:
|
|
213
|
+
|
|
214
|
+
```bash
|
|
215
|
+
#!/bin/bash
|
|
216
|
+
# scripts/replay.sh — Replay pipeline execution for a specific time window
|
|
217
|
+
|
|
218
|
+
set -euo pipefail
|
|
219
|
+
|
|
220
|
+
PIPELINE="${1:-}"
|
|
221
|
+
START_DATE="${2:-}"
|
|
222
|
+
END_DATE="${3:-}"
|
|
223
|
+
|
|
224
|
+
if [[ -z "$PIPELINE" || -z "$START_DATE" ]]; then
|
|
225
|
+
echo "Usage: replay.sh <pipeline_id> <start_date> [end_date]"
|
|
226
|
+
echo " Example: replay.sh payments_transactions_ingest 2024-01-15"
|
|
227
|
+
exit 1
|
|
228
|
+
fi
|
|
229
|
+
|
|
230
|
+
echo "==> Replaying ${PIPELINE} from ${START_DATE} to ${END_DATE:-${START_DATE}}"
|
|
231
|
+
|
|
232
|
+
python -m src.replay \
|
|
233
|
+
--pipeline "${PIPELINE}" \
|
|
234
|
+
--start "${START_DATE}" \
|
|
235
|
+
--end "${END_DATE:-${START_DATE}}" \
|
|
236
|
+
--env local \
|
|
237
|
+
--dry-run false
|
|
238
|
+
```
|
|
239
|
+
|
|
240
|
+
Python replay runner:
|
|
241
|
+
```python
|
|
242
|
+
# src/replay.py
|
|
243
|
+
def replay_pipeline(
|
|
244
|
+
pipeline_id: str,
|
|
245
|
+
start: date,
|
|
246
|
+
end: date,
|
|
247
|
+
env: str = "local",
|
|
248
|
+
dry_run: bool = True,
|
|
249
|
+
) -> ReplayResult:
|
|
250
|
+
"""Replay a pipeline for a given date range.
|
|
251
|
+
|
|
252
|
+
Runs the pipeline transformation logic directly against historical source data
|
|
253
|
+
without going through the orchestrator. Safe to run against production sources
|
|
254
|
+
in read-only mode (dry_run=True).
|
|
255
|
+
"""
|
|
256
|
+
config = load_config(env)
|
|
257
|
+
pipeline = load_pipeline(pipeline_id, config)
|
|
258
|
+
|
|
259
|
+
results = []
|
|
260
|
+
for partition_date in date_range(start, end):
|
|
261
|
+
records = pipeline.source.read(partition_date, partition_date + timedelta(days=1))
|
|
262
|
+
transformed = pipeline.transform(records)
|
|
263
|
+
quality_result = pipeline.quality.check(transformed)
|
|
264
|
+
|
|
265
|
+
if not dry_run:
|
|
266
|
+
pipeline.sink.write(transformed, partition=str(partition_date))
|
|
267
|
+
|
|
268
|
+
results.append(PartitionResult(
|
|
269
|
+
date=partition_date,
|
|
270
|
+
records_read=len(records),
|
|
271
|
+
records_written=len(transformed) if not dry_run else 0,
|
|
272
|
+
quality_passed=quality_result.passed,
|
|
273
|
+
quality_failures=quality_result.failures,
|
|
274
|
+
))
|
|
275
|
+
|
|
276
|
+
return ReplayResult(partitions=results)
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
### DLQ Replay
|
|
280
|
+
|
|
281
|
+
A specific replay workflow for DLQ records:
|
|
282
|
+
|
|
283
|
+
```bash
|
|
284
|
+
# scripts/replay-dlq.sh — Replay records from dead-letter queue
|
|
285
|
+
|
|
286
|
+
set -euo pipefail
|
|
287
|
+
|
|
288
|
+
PIPELINE="${1:-}"
|
|
289
|
+
DLQ_DATE="${2:-$(date -u +%Y-%m-%d)}"
|
|
290
|
+
|
|
291
|
+
echo "==> Replaying DLQ for ${PIPELINE} on ${DLQ_DATE}"
|
|
292
|
+
|
|
293
|
+
python -m src.dlq_replay \
|
|
294
|
+
--pipeline "${PIPELINE}" \
|
|
295
|
+
--date "${DLQ_DATE}" \
|
|
296
|
+
--env local
|
|
297
|
+
```
|
|
298
|
+
|
|
299
|
+
### Test Fixtures
|
|
300
|
+
|
|
301
|
+
Test fixtures are small, static datasets checked into version control:
|
|
302
|
+
|
|
303
|
+
```
|
|
304
|
+
tests/fixtures/
|
|
305
|
+
├── payments/
|
|
306
|
+
│ ├── raw/
|
|
307
|
+
│ │ ├── valid_transactions.json # 50 valid records
|
|
308
|
+
│ │ ├── invalid_amount.json # Records with bad amount values
|
|
309
|
+
│ │ ├── missing_user_id.json # Records with null required fields
|
|
310
|
+
│ │ └── duplicate_transactions.json # Duplicate records for dedup testing
|
|
311
|
+
│ └── expected/
|
|
312
|
+
│ ├── normalized_transactions.json # Expected output after normalization
|
|
313
|
+
│ └── aggregated_revenue.json # Expected output after aggregation
|
|
314
|
+
└── users/
|
|
315
|
+
├── raw/
|
|
316
|
+
│ └── user_events.json
|
|
317
|
+
└── expected/
|
|
318
|
+
└── deduplicated_events.json
|
|
319
|
+
```
|
|
320
|
+
|
|
321
|
+
Fixture size guidelines:
|
|
322
|
+
- Minimum: 10 records (enough to test logic)
|
|
323
|
+
- Maximum: 500 records (keeps test runtime fast)
|
|
324
|
+
- Always include: at least one edge case per tested scenario (empty string, null, boundary value, duplicate)
|
|
325
|
+
- Never include: real production data, PII, or data containing actual user information
|
|
326
|
+
|
|
327
|
+
Fixture loading utility:
|
|
328
|
+
```python
|
|
329
|
+
def load_fixture(category: str, name: str) -> list[dict]:
|
|
330
|
+
"""Load a test fixture from tests/fixtures/."""
|
|
331
|
+
path = Path(__file__).parent.parent / "fixtures" / category / f"{name}.json"
|
|
332
|
+
return json.loads(path.read_text())
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
### Environment Variable Management
|
|
336
|
+
|
|
337
|
+
Use `.env.example` committed to the repo and `.env` gitignored:
|
|
338
|
+
|
|
339
|
+
```bash
|
|
340
|
+
# .env.example — copy to .env and fill in values
|
|
341
|
+
PIPELINE_ENV=local
|
|
342
|
+
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
|
|
343
|
+
SCHEMA_REGISTRY_URL=http://localhost:8081
|
|
344
|
+
POSTGRES_URL=postgresql://pipeline:pipeline@localhost:5432/source_db
|
|
345
|
+
S3_ENDPOINT=http://localhost:9000
|
|
346
|
+
S3_ACCESS_KEY=minioadmin
|
|
347
|
+
S3_SECRET_KEY=minioadmin
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
Never commit `.env` files. Never hardcode credentials in any source file.
|
|
@@ -0,0 +1,291 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: data-pipeline-orchestration
|
|
3
|
+
description: DAG vs event-driven vs scheduled orchestration, retry policies, SLA monitoring, and lineage tracking for data pipelines
|
|
4
|
+
topics: [data-pipeline, orchestration, dag, event-driven, scheduling, retry-policies, sla-monitoring, lineage, airflow, prefect, dagster]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
Pipeline orchestration is the control plane that schedules work, manages dependencies, handles failures, and provides visibility into pipeline health. The choice between schedule-driven, event-driven, and DAG-based orchestration determines operational complexity, latency characteristics, and how well the system degrades under load. Poor orchestration design — missing retries, no SLA monitoring, absent lineage — turns data pipelines into production support burdens.
|
|
8
|
+
|
|
9
|
+
## Summary
|
|
10
|
+
|
|
11
|
+
Use schedule-based orchestration for batch pipelines with predictable cadences. Use event-driven orchestration when downstream pipelines should trigger immediately after upstream completion rather than waiting for the next scheduled window. Define retry policies per error class — exponential backoff for transient failures, no retry for non-retriable errors. Monitor SLAs as first-class metrics. Track lineage from source to serving to enable impact analysis and root cause investigation.
|
|
12
|
+
|
|
13
|
+
## Deep Guidance
|
|
14
|
+
|
|
15
|
+
### Orchestration Models
|
|
16
|
+
|
|
17
|
+
**Schedule-based orchestration**
|
|
18
|
+
|
|
19
|
+
The orchestrator triggers pipelines on a cron schedule, regardless of whether upstream data is ready. This is the simplest model and works well when:
|
|
20
|
+
- Data arrives on a predictable schedule (file drops at 2am, database exports at midnight)
|
|
21
|
+
- Acceptable to occasionally process empty or incomplete datasets (the pipeline handles this gracefully)
|
|
22
|
+
- Downstream consumers tolerate the scheduling delay
|
|
23
|
+
|
|
24
|
+
```python
|
|
25
|
+
# Airflow: hourly schedule
|
|
26
|
+
@dag(schedule_interval="0 * * * *", ...)
|
|
27
|
+
def hourly_pipeline():
|
|
28
|
+
...
|
|
29
|
+
|
|
30
|
+
# With data availability check: schedule + sensor hybrid
|
|
31
|
+
@dag(schedule_interval="0 * * * *", ...)
|
|
32
|
+
def hourly_pipeline_with_sensor():
|
|
33
|
+
wait = S3KeySensor(
|
|
34
|
+
task_id="wait_for_data",
|
|
35
|
+
bucket_name="raw-data",
|
|
36
|
+
bucket_key="events/{{ ds_nodash }}/{{ execution_date.hour:02d }}/data.parquet",
|
|
37
|
+
timeout=3600,
|
|
38
|
+
mode="reschedule",
|
|
39
|
+
)
|
|
40
|
+
process = PythonOperator(...)
|
|
41
|
+
wait >> process
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
**Event-driven orchestration**
|
|
45
|
+
|
|
46
|
+
Pipelines trigger when upstream data becomes available, not on a fixed schedule. Reduces latency and eliminates the "waiting for the next schedule window" problem.
|
|
47
|
+
|
|
48
|
+
```python
|
|
49
|
+
# Dagster: asset-based event-driven orchestration
|
|
50
|
+
@asset(
|
|
51
|
+
partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
|
|
52
|
+
deps=["raw_transactions"], # triggers when raw_transactions materializes
|
|
53
|
+
)
|
|
54
|
+
def silver_transactions(context, raw_transactions):
|
|
55
|
+
return transform(raw_transactions)
|
|
56
|
+
|
|
57
|
+
# Prefect: event-triggered flow
|
|
58
|
+
@flow
|
|
59
|
+
def silver_pipeline():
|
|
60
|
+
...
|
|
61
|
+
|
|
62
|
+
# Register event trigger
|
|
63
|
+
from prefect.events.automations import Automation, EventTrigger
|
|
64
|
+
Automation(
|
|
65
|
+
name="silver-on-bronze-complete",
|
|
66
|
+
trigger=EventTrigger(
|
|
67
|
+
match={"prefect.resource.id": "bronze-pipeline"},
|
|
68
|
+
expect={"prefect.flow-run.Completed"},
|
|
69
|
+
),
|
|
70
|
+
actions=[RunFlow(flow=silver_pipeline)],
|
|
71
|
+
).save()
|
|
72
|
+
```
|
|
73
|
+
|
|
74
|
+
Event-driven orchestration reduces end-to-end latency from "up to schedule interval" to "seconds after upstream completion." The tradeoff is more complex dependency management and potential cascade failures.
|
|
75
|
+
|
|
76
|
+
**DAG-based orchestration**
|
|
77
|
+
|
|
78
|
+
A Directed Acyclic Graph explicitly models task dependencies. The orchestrator executes tasks in topological order, respecting all declared dependencies. DAG-based orchestration is the standard for complex multi-step pipelines with non-trivial dependency graphs.
|
|
79
|
+
|
|
80
|
+
Airflow, Prefect, and Dagster all implement DAG-based orchestration with different trade-offs:
|
|
81
|
+
|
|
82
|
+
| Feature | Airflow | Prefect | Dagster |
|
|
83
|
+
|---------|---------|---------|---------|
|
|
84
|
+
| Maturity | High | Medium | Medium |
|
|
85
|
+
| Dynamic DAGs | Limited | Native | Native |
|
|
86
|
+
| Local testing | Complex | Simple | Simple |
|
|
87
|
+
| Asset tracking | External | Limited | Native |
|
|
88
|
+
| Deployment model | Stateful scheduler | Cloud or self-hosted | Cloud or self-hosted |
|
|
89
|
+
| Best for | Large teams, stable DAGs | Dynamic workflows | Asset-centric pipelines |
|
|
90
|
+
|
|
91
|
+
### Retry Policy Design
|
|
92
|
+
|
|
93
|
+
Every pipeline task must have an explicit retry policy. The policy must distinguish between retriable and non-retriable failures.
|
|
94
|
+
|
|
95
|
+
**Retriable failures** — transient errors that resolve themselves:
|
|
96
|
+
- Network timeouts
|
|
97
|
+
- Rate limiting / throttling
|
|
98
|
+
- Temporary service unavailability (HTTP 503)
|
|
99
|
+
- Database lock contention
|
|
100
|
+
|
|
101
|
+
**Non-retriable failures** — permanent errors requiring human intervention:
|
|
102
|
+
- Schema validation errors
|
|
103
|
+
- Missing required configuration
|
|
104
|
+
- Data corruption detected
|
|
105
|
+
- Authentication/authorization failure
|
|
106
|
+
|
|
107
|
+
**Exponential backoff with jitter**
|
|
108
|
+
|
|
109
|
+
```python
|
|
110
|
+
from datetime import timedelta
|
|
111
|
+
|
|
112
|
+
# Airflow: task-level retry configuration
|
|
113
|
+
default_args = {
|
|
114
|
+
"retries": 5,
|
|
115
|
+
"retry_delay": timedelta(seconds=30), # initial delay
|
|
116
|
+
"retry_exponential_backoff": True, # double delay each retry
|
|
117
|
+
"max_retry_delay": timedelta(minutes=30), # cap at 30 minutes
|
|
118
|
+
}
|
|
119
|
+
|
|
120
|
+
# Retry schedule:
|
|
121
|
+
# Attempt 1: immediate
|
|
122
|
+
# Attempt 2: 30 seconds later
|
|
123
|
+
# Attempt 3: 60 seconds later
|
|
124
|
+
# Attempt 4: 120 seconds later
|
|
125
|
+
# Attempt 5: 240 seconds later
|
|
126
|
+
# Attempt 6 (final): 480 seconds later (capped at 30 min)
|
|
127
|
+
```
|
|
128
|
+
|
|
129
|
+
Add jitter to prevent thundering herd — multiple failed tasks retrying at the same time:
|
|
130
|
+
|
|
131
|
+
```python
|
|
132
|
+
import random
|
|
133
|
+
|
|
134
|
+
def get_retry_delay(attempt: int, base_seconds: int = 30, max_seconds: int = 1800) -> float:
|
|
135
|
+
exponential = base_seconds * (2 ** attempt)
|
|
136
|
+
capped = min(exponential, max_seconds)
|
|
137
|
+
jitter = random.uniform(0, capped * 0.1) # 10% jitter
|
|
138
|
+
return capped + jitter
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
**Classifying and routing failures**
|
|
142
|
+
|
|
143
|
+
```python
|
|
144
|
+
def task_with_classified_retries():
|
|
145
|
+
try:
|
|
146
|
+
execute_pipeline_step()
|
|
147
|
+
except RateLimitError as e:
|
|
148
|
+
# Retriable: will retry with backoff
|
|
149
|
+
raise AirflowException(f"Rate limited: {e}") from e
|
|
150
|
+
except SchemaValidationError as e:
|
|
151
|
+
# Non-retriable: send to DLQ and mark task as failed (no retry)
|
|
152
|
+
dlq.write(current_record, e, context)
|
|
153
|
+
raise AirflowSkipException(f"Validation failed — sent to DLQ: {e}") from e
|
|
154
|
+
except PoisonPillError as e:
|
|
155
|
+
# Non-retriable: alert immediately
|
|
156
|
+
alert_oncall(f"Poison pill detected: {e}")
|
|
157
|
+
raise AirflowException(f"Poison pill — manual intervention required") from e
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
### SLA Monitoring
|
|
161
|
+
|
|
162
|
+
SLA monitoring detects when pipelines are running late before downstream consumers notice stale data.
|
|
163
|
+
|
|
164
|
+
**Airflow SLA callbacks**
|
|
165
|
+
|
|
166
|
+
```python
|
|
167
|
+
def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
|
|
168
|
+
"""Called when a task misses its SLA deadline."""
|
|
169
|
+
message = (
|
|
170
|
+
f"SLA miss in {dag.dag_id}:\n"
|
|
171
|
+
f"Tasks: {[t.task_id for t in task_list]}\n"
|
|
172
|
+
f"Blocking: {[t.task_id for t in blocking_task_list]}"
|
|
173
|
+
)
|
|
174
|
+
pagerduty.trigger_alert(severity="warning", message=message)
|
|
175
|
+
slack.post_message(channel="#data-alerts", text=message)
|
|
176
|
+
|
|
177
|
+
@dag(
|
|
178
|
+
sla_miss_callback=sla_miss_callback,
|
|
179
|
+
default_args={
|
|
180
|
+
"sla": timedelta(hours=2), # alert if task not done in 2 hours
|
|
181
|
+
},
|
|
182
|
+
)
|
|
183
|
+
def payments_pipeline():
|
|
184
|
+
...
|
|
185
|
+
```
|
|
186
|
+
|
|
187
|
+
**Freshness SLA as a metric**
|
|
188
|
+
|
|
189
|
+
Emit a gauge metric tracking data age in the serving layer. Alert when data age exceeds the SLA:
|
|
190
|
+
|
|
191
|
+
```python
|
|
192
|
+
def check_data_freshness(table: str, sla_minutes: int) -> None:
|
|
193
|
+
"""Check that a table has been updated within the SLA window."""
|
|
194
|
+
last_updated = query_last_updated_timestamp(table)
|
|
195
|
+
age_minutes = (datetime.utcnow() - last_updated).total_seconds() / 60
|
|
196
|
+
|
|
197
|
+
metrics.gauge(f"data_freshness_minutes", age_minutes, tags={"table": table})
|
|
198
|
+
|
|
199
|
+
if age_minutes > sla_minutes:
|
|
200
|
+
alert_oncall(
|
|
201
|
+
f"SLA breach: {table} is {age_minutes:.0f} minutes old (SLA: {sla_minutes} min)"
|
|
202
|
+
)
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
**SLA dashboard metrics to track**
|
|
206
|
+
|
|
207
|
+
- Time-to-complete per DAG run (p50, p95, p99)
|
|
208
|
+
- SLA miss count per pipeline per day
|
|
209
|
+
- Data freshness age per serving table
|
|
210
|
+
- Queue depth (pending tasks waiting to run)
|
|
211
|
+
- Failure rate per pipeline (failures per 100 runs)
|
|
212
|
+
|
|
213
|
+
### Lineage Tracking
|
|
214
|
+
|
|
215
|
+
Data lineage records how data flows from sources through transformations to serving layers. Lineage enables:
|
|
216
|
+
- Impact analysis: "Which dashboards will break if I change this table's schema?"
|
|
217
|
+
- Root cause investigation: "Which source system produced this bad value?"
|
|
218
|
+
- Compliance: "Which pipelines process PII from this source?"
|
|
219
|
+
- Dependency-aware deployments: "Can I deploy this pipeline change safely?"
|
|
220
|
+
|
|
221
|
+
**Automated lineage via orchestration metadata**
|
|
222
|
+
|
|
223
|
+
Dagster natively tracks asset lineage:
|
|
224
|
+
|
|
225
|
+
```python
|
|
226
|
+
@asset
|
|
227
|
+
def raw_transactions():
|
|
228
|
+
"""Bronze: raw transactions from Stripe API."""
|
|
229
|
+
return stripe.read_charges()
|
|
230
|
+
|
|
231
|
+
@asset(deps=["raw_transactions"])
|
|
232
|
+
def silver_transactions(raw_transactions):
|
|
233
|
+
"""Silver: normalized and validated transactions."""
|
|
234
|
+
return transforms.normalize(raw_transactions)
|
|
235
|
+
|
|
236
|
+
@asset(deps=["silver_transactions"])
|
|
237
|
+
def daily_revenue(silver_transactions):
|
|
238
|
+
"""Gold: daily revenue aggregation."""
|
|
239
|
+
return aggregations.daily_revenue(silver_transactions)
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
Dagster generates a lineage graph automatically from the `deps` declarations, viewable in the UI.
|
|
243
|
+
|
|
244
|
+
**Custom lineage for Airflow**
|
|
245
|
+
|
|
246
|
+
When using Airflow without a native lineage tool, emit OpenLineage events:
|
|
247
|
+
|
|
248
|
+
```python
|
|
249
|
+
from openlineage.airflow import DAG
|
|
250
|
+
from openlineage.client.facet import SchemaDatasetFacet
|
|
251
|
+
|
|
252
|
+
# OpenLineage-instrumented DAG emits START/COMPLETE events
|
|
253
|
+
# consumed by Marquez, DataHub, or Apache Atlas
|
|
254
|
+
@dag(...)
|
|
255
|
+
def instrumented_pipeline():
|
|
256
|
+
...
|
|
257
|
+
```
|
|
258
|
+
|
|
259
|
+
**Lineage in incident response**
|
|
260
|
+
|
|
261
|
+
When a data quality incident occurs, the lineage graph drives the investigation:
|
|
262
|
+
1. Identify the affected output table
|
|
263
|
+
2. Traverse lineage backwards to find the upstream source
|
|
264
|
+
3. Check pipeline run history for failures or anomalies at each stage
|
|
265
|
+
4. Identify the transformation step where the bad data was introduced
|
|
266
|
+
5. Trace back to the source event or record that caused the issue
|
|
267
|
+
|
|
268
|
+
Without lineage, this investigation is manual, slow, and often incomplete. With lineage, it takes minutes.
|
|
269
|
+
|
|
270
|
+
### Orchestrator Operational Practices
|
|
271
|
+
|
|
272
|
+
**Deployment without DAG downtime**
|
|
273
|
+
|
|
274
|
+
In Airflow, deploying a new DAG version while runs are in progress can cause orphaned tasks. Use versioned DAG IDs for breaking changes:
|
|
275
|
+
|
|
276
|
+
```python
|
|
277
|
+
# Before: payments_ingest_v1
|
|
278
|
+
# After breaking change: payments_ingest_v2
|
|
279
|
+
# Run both in parallel during transition; decommission v1 after drain
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
**Worker pool sizing**
|
|
283
|
+
|
|
284
|
+
Size worker pools based on peak pipeline parallelism:
|
|
285
|
+
- Calculate maximum simultaneous tasks across all DAGs at peak hours
|
|
286
|
+
- Add 20% headroom for backfills and reruns
|
|
287
|
+
- Separate pools for CPU-intensive tasks (Spark submit) vs. I/O-bound tasks (API calls)
|
|
288
|
+
|
|
289
|
+
**Scheduler high availability**
|
|
290
|
+
|
|
291
|
+
Run multiple scheduler instances with leader election. Airflow 2.x supports HA schedulers natively. Single-scheduler deployments are a production risk.
|