@zigrivers/scaffold 3.8.0 → 3.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (70) hide show
  1. package/README.md +73 -8
  2. package/content/knowledge/browser-extension/browser-extension-architecture.md +195 -0
  3. package/content/knowledge/browser-extension/browser-extension-content-scripts.md +264 -0
  4. package/content/knowledge/browser-extension/browser-extension-conventions.md +156 -0
  5. package/content/knowledge/browser-extension/browser-extension-cross-browser.md +229 -0
  6. package/content/knowledge/browser-extension/browser-extension-dev-environment.md +247 -0
  7. package/content/knowledge/browser-extension/browser-extension-manifest.md +220 -0
  8. package/content/knowledge/browser-extension/browser-extension-project-structure.md +183 -0
  9. package/content/knowledge/browser-extension/browser-extension-requirements.md +107 -0
  10. package/content/knowledge/browser-extension/browser-extension-security.md +202 -0
  11. package/content/knowledge/browser-extension/browser-extension-service-workers.md +265 -0
  12. package/content/knowledge/browser-extension/browser-extension-store-submission.md +155 -0
  13. package/content/knowledge/browser-extension/browser-extension-testing.md +270 -0
  14. package/content/knowledge/data-pipeline/data-pipeline-architecture.md +175 -0
  15. package/content/knowledge/data-pipeline/data-pipeline-batch-patterns.md +263 -0
  16. package/content/knowledge/data-pipeline/data-pipeline-conventions.md +176 -0
  17. package/content/knowledge/data-pipeline/data-pipeline-dev-environment.md +350 -0
  18. package/content/knowledge/data-pipeline/data-pipeline-orchestration.md +291 -0
  19. package/content/knowledge/data-pipeline/data-pipeline-project-structure.md +257 -0
  20. package/content/knowledge/data-pipeline/data-pipeline-quality.md +324 -0
  21. package/content/knowledge/data-pipeline/data-pipeline-requirements.md +145 -0
  22. package/content/knowledge/data-pipeline/data-pipeline-schema-management.md +295 -0
  23. package/content/knowledge/data-pipeline/data-pipeline-security.md +326 -0
  24. package/content/knowledge/data-pipeline/data-pipeline-streaming-patterns.md +280 -0
  25. package/content/knowledge/data-pipeline/data-pipeline-testing.md +406 -0
  26. package/content/knowledge/ml/ml-architecture.md +172 -0
  27. package/content/knowledge/ml/ml-conventions.md +209 -0
  28. package/content/knowledge/ml/ml-dev-environment.md +299 -0
  29. package/content/knowledge/ml/ml-experiment-tracking.md +285 -0
  30. package/content/knowledge/ml/ml-model-evaluation.md +256 -0
  31. package/content/knowledge/ml/ml-observability.md +253 -0
  32. package/content/knowledge/ml/ml-project-structure.md +216 -0
  33. package/content/knowledge/ml/ml-requirements.md +138 -0
  34. package/content/knowledge/ml/ml-security.md +188 -0
  35. package/content/knowledge/ml/ml-serving-patterns.md +243 -0
  36. package/content/knowledge/ml/ml-testing.md +301 -0
  37. package/content/knowledge/ml/ml-training-patterns.md +269 -0
  38. package/content/methodology/browser-extension-overlay.yml +82 -0
  39. package/content/methodology/data-pipeline-overlay.yml +70 -0
  40. package/content/methodology/ml-overlay.yml +70 -0
  41. package/dist/cli/commands/init.d.ts +13 -0
  42. package/dist/cli/commands/init.d.ts.map +1 -1
  43. package/dist/cli/commands/init.js +122 -2
  44. package/dist/cli/commands/init.js.map +1 -1
  45. package/dist/cli/commands/init.test.js +120 -0
  46. package/dist/cli/commands/init.test.js.map +1 -1
  47. package/dist/config/schema.d.ts +864 -48
  48. package/dist/config/schema.d.ts.map +1 -1
  49. package/dist/config/schema.js +53 -0
  50. package/dist/config/schema.js.map +1 -1
  51. package/dist/config/schema.test.js +166 -3
  52. package/dist/config/schema.test.js.map +1 -1
  53. package/dist/core/assembly/overlay-loader.test.js +33 -0
  54. package/dist/core/assembly/overlay-loader.test.js.map +1 -1
  55. package/dist/e2e/project-type-overlays.test.d.ts +2 -2
  56. package/dist/e2e/project-type-overlays.test.js +499 -33
  57. package/dist/e2e/project-type-overlays.test.js.map +1 -1
  58. package/dist/types/config.d.ts +10 -1
  59. package/dist/types/config.d.ts.map +1 -1
  60. package/dist/wizard/questions.d.ts +17 -1
  61. package/dist/wizard/questions.d.ts.map +1 -1
  62. package/dist/wizard/questions.js +72 -1
  63. package/dist/wizard/questions.js.map +1 -1
  64. package/dist/wizard/questions.test.js +135 -0
  65. package/dist/wizard/questions.test.js.map +1 -1
  66. package/dist/wizard/wizard.d.ts +13 -0
  67. package/dist/wizard/wizard.d.ts.map +1 -1
  68. package/dist/wizard/wizard.js +17 -1
  69. package/dist/wizard/wizard.js.map +1 -1
  70. package/package.json +1 -1
@@ -0,0 +1,350 @@
1
+ ---
2
+ name: data-pipeline-dev-environment
3
+ description: Local development setup with Docker, sample data generation, replay tooling, and test fixtures for data pipelines
4
+ topics: [data-pipeline, dev-environment, docker, local-development, sample-data, replay, test-fixtures]
5
+ ---
6
+
7
+ A productive data pipeline development environment must run locally without requiring cloud accounts, production credentials, or access to real data. Engineers must be able to ingest, process, debug, and test the full pipeline on their laptop in under 10 minutes from a clean checkout. This requires containerized dependencies, synthetic sample data, and tooling to replay specific scenarios.
8
+
9
+ ## Summary
10
+
11
+ Use Docker Compose to run all pipeline dependencies locally (Kafka, Postgres, object storage, orchestrator UI). Generate synthetic sample data with realistic distributions that match production schemas. Build replay tooling so engineers can re-run any historical pipeline execution or simulate specific failure scenarios. Keep test fixtures small, version-controlled, and deterministic.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Docker Compose Local Stack
16
+
17
+ Define the complete local dependency stack in `docker/docker-compose.yml`. Every external system the pipeline depends on must run locally:
18
+
19
+ ```yaml
20
+ # docker/docker-compose.yml
21
+ version: "3.9"
22
+
23
+ services:
24
+ # Message broker
25
+ kafka:
26
+ image: confluentinc/cp-kafka:7.5.0
27
+ ports:
28
+ - "9092:9092"
29
+ environment:
30
+ KAFKA_BROKER_ID: 1
31
+ KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
32
+ KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
33
+ KAFKA_AUTO_CREATE_TOPICS_ENABLE: "true"
34
+ depends_on: [zookeeper]
35
+
36
+ zookeeper:
37
+ image: confluentinc/cp-zookeeper:7.5.0
38
+ environment:
39
+ ZOOKEEPER_CLIENT_PORT: 2181
40
+
41
+ # Schema registry
42
+ schema-registry:
43
+ image: confluentinc/cp-schema-registry:7.5.0
44
+ ports:
45
+ - "8081:8081"
46
+ environment:
47
+ SCHEMA_REGISTRY_HOST_NAME: schema-registry
48
+ SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:9092
49
+
50
+ # Source database (CDC)
51
+ postgres:
52
+ image: postgres:15
53
+ ports:
54
+ - "5432:5432"
55
+ environment:
56
+ POSTGRES_USER: pipeline
57
+ POSTGRES_PASSWORD: pipeline
58
+ POSTGRES_DB: source_db
59
+ command: >
60
+ postgres
61
+ -c wal_level=logical
62
+ -c max_replication_slots=10
63
+ -c max_wal_senders=10
64
+ volumes:
65
+ - ./docker/postgres/init:/docker-entrypoint-initdb.d
66
+
67
+ # Object storage (S3-compatible)
68
+ minio:
69
+ image: minio/minio:latest
70
+ ports:
71
+ - "9000:9000"
72
+ - "9001:9001"
73
+ environment:
74
+ MINIO_ROOT_USER: minioadmin
75
+ MINIO_ROOT_PASSWORD: minioadmin
76
+ command: server /data --console-address ":9001"
77
+ volumes:
78
+ - minio_data:/data
79
+
80
+ # Pipeline orchestrator (Airflow)
81
+ airflow-webserver:
82
+ image: apache/airflow:2.8.0
83
+ ports:
84
+ - "8080:8080"
85
+ environment:
86
+ AIRFLOW__CORE__EXECUTOR: LocalExecutor
87
+ AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@airflow-db/airflow
88
+ AIRFLOW__CORE__LOAD_EXAMPLES: "false"
89
+ volumes:
90
+ - ../src/dags:/opt/airflow/dags
91
+ - ../config:/opt/airflow/config
92
+ depends_on: [airflow-db]
93
+
94
+ airflow-db:
95
+ image: postgres:15
96
+ environment:
97
+ POSTGRES_USER: airflow
98
+ POSTGRES_PASSWORD: airflow
99
+ POSTGRES_DB: airflow
100
+
101
+ volumes:
102
+ minio_data:
103
+ ```
104
+
105
+ ### Local Setup Script
106
+
107
+ The bootstrap script gets a clean checkout to a running local environment:
108
+
109
+ ```bash
110
+ #!/bin/bash
111
+ # scripts/bootstrap.sh — Set up local development environment
112
+
113
+ set -euo pipefail
114
+
115
+ echo "==> Installing Python dependencies"
116
+ pip install -e ".[dev]"
117
+
118
+ echo "==> Starting Docker services"
119
+ docker compose -f docker/docker-compose.yml up -d
120
+
121
+ echo "==> Waiting for services to be healthy"
122
+ ./scripts/wait-for-services.sh
123
+
124
+ echo "==> Creating local Kafka topics"
125
+ docker exec kafka kafka-topics --create \
126
+ --bootstrap-server localhost:9092 \
127
+ --topic payments_transaction_created \
128
+ --partitions 8 \
129
+ --replication-factor 1 \
130
+ --if-not-exists
131
+
132
+ echo "==> Creating MinIO buckets"
133
+ docker exec minio mc alias set local http://localhost:9000 minioadmin minioadmin
134
+ docker exec minio mc mb local/raw-data --ignore-existing
135
+ docker exec minio mc mb local/processed-data --ignore-existing
136
+
137
+ echo "==> Seeding database schemas"
138
+ psql postgresql://pipeline:pipeline@localhost:5432/source_db -f docker/postgres/schema.sql
139
+
140
+ echo "==> Generating sample data"
141
+ python scripts/generate-sample-data.py --records 10000 --output fixtures/
142
+
143
+ echo "==> Local environment ready!"
144
+ echo " Airflow UI: http://localhost:8080 (admin/admin)"
145
+ echo " MinIO console: http://localhost:9001 (minioadmin/minioadmin)"
146
+ echo " Kafka: localhost:9092"
147
+ echo " Postgres: localhost:5432"
148
+ ```
149
+
150
+ ### Sample Data Generation
151
+
152
+ Synthetic data must match production schemas and include realistic distributions, edge cases, and known-bad records for testing error handling:
153
+
154
+ ```python
155
+ # scripts/generate-sample-data.py
156
+ import json
157
+ import random
158
+ from datetime import datetime, timedelta
159
+ from typing import Iterator
160
+ import uuid
161
+
162
+ CURRENCIES = ["USD", "EUR", "GBP", "JPY", "CAD"]
163
+ CURRENCY_WEIGHTS = [0.60, 0.20, 0.10, 0.05, 0.05]
164
+ COUNTRIES = ["US", "GB", "DE", "FR", "JP", "CA", "AU"]
165
+
166
+ def generate_transaction(ts: datetime, inject_error: bool = False) -> dict:
167
+ """Generate a synthetic payment transaction.
168
+
169
+ Args:
170
+ ts: Event timestamp
171
+ inject_error: If True, inject a known-bad record for error handling tests
172
+ """
173
+ if inject_error:
174
+ return {
175
+ "transaction_id": str(uuid.uuid4()),
176
+ "amount": "N/A", # schema violation: string instead of number
177
+ "currency": "INVALID", # invalid enum
178
+ "user_id": None, # null violation
179
+ "created_at": ts.isoformat(),
180
+ }
181
+
182
+ currency = random.choices(CURRENCIES, CURRENCY_WEIGHTS)[0]
183
+ # Log-normal distribution matches real transaction amount distributions
184
+ amount = round(random.lognormvariate(3.5, 1.2), 2)
185
+
186
+ return {
187
+ "transaction_id": str(uuid.uuid4()),
188
+ "amount": amount,
189
+ "currency": currency,
190
+ "user_id": f"usr_{random.randint(1, 50000):06d}",
191
+ "merchant_id": f"mer_{random.randint(1, 5000):05d}",
192
+ "country": random.choice(COUNTRIES),
193
+ "created_at": ts.isoformat(),
194
+ "status": random.choices(["completed", "failed", "pending"], [0.92, 0.05, 0.03])[0],
195
+ }
196
+
197
+ def generate_dataset(
198
+ records: int,
199
+ error_rate: float = 0.01,
200
+ start_time: datetime | None = None,
201
+ ) -> Iterator[dict]:
202
+ """Generate a stream of synthetic transactions with configurable error rate."""
203
+ start = start_time or datetime.utcnow() - timedelta(hours=24)
204
+ for i in range(records):
205
+ ts = start + timedelta(seconds=i * 0.1)
206
+ inject_error = random.random() < error_rate
207
+ yield generate_transaction(ts, inject_error)
208
+ ```
209
+
210
+ ### Replay Tooling
211
+
212
+ Replay tools allow engineers to re-run pipeline logic against specific historical data or DLQ records without running the full orchestration stack:
213
+
214
+ ```bash
215
+ #!/bin/bash
216
+ # scripts/replay.sh — Replay pipeline execution for a specific time window
217
+
218
+ set -euo pipefail
219
+
220
+ PIPELINE="${1:-}"
221
+ START_DATE="${2:-}"
222
+ END_DATE="${3:-}"
223
+
224
+ if [[ -z "$PIPELINE" || -z "$START_DATE" ]]; then
225
+ echo "Usage: replay.sh <pipeline_id> <start_date> [end_date]"
226
+ echo " Example: replay.sh payments_transactions_ingest 2024-01-15"
227
+ exit 1
228
+ fi
229
+
230
+ echo "==> Replaying ${PIPELINE} from ${START_DATE} to ${END_DATE:-${START_DATE}}"
231
+
232
+ python -m src.replay \
233
+ --pipeline "${PIPELINE}" \
234
+ --start "${START_DATE}" \
235
+ --end "${END_DATE:-${START_DATE}}" \
236
+ --env local \
237
+ --dry-run false
238
+ ```
239
+
240
+ Python replay runner:
241
+ ```python
242
+ # src/replay.py
243
+ def replay_pipeline(
244
+ pipeline_id: str,
245
+ start: date,
246
+ end: date,
247
+ env: str = "local",
248
+ dry_run: bool = True,
249
+ ) -> ReplayResult:
250
+ """Replay a pipeline for a given date range.
251
+
252
+ Runs the pipeline transformation logic directly against historical source data
253
+ without going through the orchestrator. Safe to run against production sources
254
+ in read-only mode (dry_run=True).
255
+ """
256
+ config = load_config(env)
257
+ pipeline = load_pipeline(pipeline_id, config)
258
+
259
+ results = []
260
+ for partition_date in date_range(start, end):
261
+ records = pipeline.source.read(partition_date, partition_date + timedelta(days=1))
262
+ transformed = pipeline.transform(records)
263
+ quality_result = pipeline.quality.check(transformed)
264
+
265
+ if not dry_run:
266
+ pipeline.sink.write(transformed, partition=str(partition_date))
267
+
268
+ results.append(PartitionResult(
269
+ date=partition_date,
270
+ records_read=len(records),
271
+ records_written=len(transformed) if not dry_run else 0,
272
+ quality_passed=quality_result.passed,
273
+ quality_failures=quality_result.failures,
274
+ ))
275
+
276
+ return ReplayResult(partitions=results)
277
+ ```
278
+
279
+ ### DLQ Replay
280
+
281
+ A specific replay workflow for DLQ records:
282
+
283
+ ```bash
284
+ # scripts/replay-dlq.sh — Replay records from dead-letter queue
285
+
286
+ set -euo pipefail
287
+
288
+ PIPELINE="${1:-}"
289
+ DLQ_DATE="${2:-$(date -u +%Y-%m-%d)}"
290
+
291
+ echo "==> Replaying DLQ for ${PIPELINE} on ${DLQ_DATE}"
292
+
293
+ python -m src.dlq_replay \
294
+ --pipeline "${PIPELINE}" \
295
+ --date "${DLQ_DATE}" \
296
+ --env local
297
+ ```
298
+
299
+ ### Test Fixtures
300
+
301
+ Test fixtures are small, static datasets checked into version control:
302
+
303
+ ```
304
+ tests/fixtures/
305
+ ├── payments/
306
+ │ ├── raw/
307
+ │ │ ├── valid_transactions.json # 50 valid records
308
+ │ │ ├── invalid_amount.json # Records with bad amount values
309
+ │ │ ├── missing_user_id.json # Records with null required fields
310
+ │ │ └── duplicate_transactions.json # Duplicate records for dedup testing
311
+ │ └── expected/
312
+ │ ├── normalized_transactions.json # Expected output after normalization
313
+ │ └── aggregated_revenue.json # Expected output after aggregation
314
+ └── users/
315
+ ├── raw/
316
+ │ └── user_events.json
317
+ └── expected/
318
+ └── deduplicated_events.json
319
+ ```
320
+
321
+ Fixture size guidelines:
322
+ - Minimum: 10 records (enough to test logic)
323
+ - Maximum: 500 records (keeps test runtime fast)
324
+ - Always include: at least one edge case per tested scenario (empty string, null, boundary value, duplicate)
325
+ - Never include: real production data, PII, or data containing actual user information
326
+
327
+ Fixture loading utility:
328
+ ```python
329
+ def load_fixture(category: str, name: str) -> list[dict]:
330
+ """Load a test fixture from tests/fixtures/."""
331
+ path = Path(__file__).parent.parent / "fixtures" / category / f"{name}.json"
332
+ return json.loads(path.read_text())
333
+ ```
334
+
335
+ ### Environment Variable Management
336
+
337
+ Use `.env.example` committed to the repo and `.env` gitignored:
338
+
339
+ ```bash
340
+ # .env.example — copy to .env and fill in values
341
+ PIPELINE_ENV=local
342
+ KAFKA_BOOTSTRAP_SERVERS=localhost:9092
343
+ SCHEMA_REGISTRY_URL=http://localhost:8081
344
+ POSTGRES_URL=postgresql://pipeline:pipeline@localhost:5432/source_db
345
+ S3_ENDPOINT=http://localhost:9000
346
+ S3_ACCESS_KEY=minioadmin
347
+ S3_SECRET_KEY=minioadmin
348
+ ```
349
+
350
+ Never commit `.env` files. Never hardcode credentials in any source file.
@@ -0,0 +1,291 @@
1
+ ---
2
+ name: data-pipeline-orchestration
3
+ description: DAG vs event-driven vs scheduled orchestration, retry policies, SLA monitoring, and lineage tracking for data pipelines
4
+ topics: [data-pipeline, orchestration, dag, event-driven, scheduling, retry-policies, sla-monitoring, lineage, airflow, prefect, dagster]
5
+ ---
6
+
7
+ Pipeline orchestration is the control plane that schedules work, manages dependencies, handles failures, and provides visibility into pipeline health. The choice between schedule-driven, event-driven, and DAG-based orchestration determines operational complexity, latency characteristics, and how well the system degrades under load. Poor orchestration design — missing retries, no SLA monitoring, absent lineage — turns data pipelines into production support burdens.
8
+
9
+ ## Summary
10
+
11
+ Use schedule-based orchestration for batch pipelines with predictable cadences. Use event-driven orchestration when downstream pipelines should trigger immediately after upstream completion rather than waiting for the next scheduled window. Define retry policies per error class — exponential backoff for transient failures, no retry for non-retriable errors. Monitor SLAs as first-class metrics. Track lineage from source to serving to enable impact analysis and root cause investigation.
12
+
13
+ ## Deep Guidance
14
+
15
+ ### Orchestration Models
16
+
17
+ **Schedule-based orchestration**
18
+
19
+ The orchestrator triggers pipelines on a cron schedule, regardless of whether upstream data is ready. This is the simplest model and works well when:
20
+ - Data arrives on a predictable schedule (file drops at 2am, database exports at midnight)
21
+ - Acceptable to occasionally process empty or incomplete datasets (the pipeline handles this gracefully)
22
+ - Downstream consumers tolerate the scheduling delay
23
+
24
+ ```python
25
+ # Airflow: hourly schedule
26
+ @dag(schedule_interval="0 * * * *", ...)
27
+ def hourly_pipeline():
28
+ ...
29
+
30
+ # With data availability check: schedule + sensor hybrid
31
+ @dag(schedule_interval="0 * * * *", ...)
32
+ def hourly_pipeline_with_sensor():
33
+ wait = S3KeySensor(
34
+ task_id="wait_for_data",
35
+ bucket_name="raw-data",
36
+ bucket_key="events/{{ ds_nodash }}/{{ execution_date.hour:02d }}/data.parquet",
37
+ timeout=3600,
38
+ mode="reschedule",
39
+ )
40
+ process = PythonOperator(...)
41
+ wait >> process
42
+ ```
43
+
44
+ **Event-driven orchestration**
45
+
46
+ Pipelines trigger when upstream data becomes available, not on a fixed schedule. Reduces latency and eliminates the "waiting for the next schedule window" problem.
47
+
48
+ ```python
49
+ # Dagster: asset-based event-driven orchestration
50
+ @asset(
51
+ partitions_def=DailyPartitionsDefinition(start_date="2024-01-01"),
52
+ deps=["raw_transactions"], # triggers when raw_transactions materializes
53
+ )
54
+ def silver_transactions(context, raw_transactions):
55
+ return transform(raw_transactions)
56
+
57
+ # Prefect: event-triggered flow
58
+ @flow
59
+ def silver_pipeline():
60
+ ...
61
+
62
+ # Register event trigger
63
+ from prefect.events.automations import Automation, EventTrigger
64
+ Automation(
65
+ name="silver-on-bronze-complete",
66
+ trigger=EventTrigger(
67
+ match={"prefect.resource.id": "bronze-pipeline"},
68
+ expect={"prefect.flow-run.Completed"},
69
+ ),
70
+ actions=[RunFlow(flow=silver_pipeline)],
71
+ ).save()
72
+ ```
73
+
74
+ Event-driven orchestration reduces end-to-end latency from "up to schedule interval" to "seconds after upstream completion." The tradeoff is more complex dependency management and potential cascade failures.
75
+
76
+ **DAG-based orchestration**
77
+
78
+ A Directed Acyclic Graph explicitly models task dependencies. The orchestrator executes tasks in topological order, respecting all declared dependencies. DAG-based orchestration is the standard for complex multi-step pipelines with non-trivial dependency graphs.
79
+
80
+ Airflow, Prefect, and Dagster all implement DAG-based orchestration with different trade-offs:
81
+
82
+ | Feature | Airflow | Prefect | Dagster |
83
+ |---------|---------|---------|---------|
84
+ | Maturity | High | Medium | Medium |
85
+ | Dynamic DAGs | Limited | Native | Native |
86
+ | Local testing | Complex | Simple | Simple |
87
+ | Asset tracking | External | Limited | Native |
88
+ | Deployment model | Stateful scheduler | Cloud or self-hosted | Cloud or self-hosted |
89
+ | Best for | Large teams, stable DAGs | Dynamic workflows | Asset-centric pipelines |
90
+
91
+ ### Retry Policy Design
92
+
93
+ Every pipeline task must have an explicit retry policy. The policy must distinguish between retriable and non-retriable failures.
94
+
95
+ **Retriable failures** — transient errors that resolve themselves:
96
+ - Network timeouts
97
+ - Rate limiting / throttling
98
+ - Temporary service unavailability (HTTP 503)
99
+ - Database lock contention
100
+
101
+ **Non-retriable failures** — permanent errors requiring human intervention:
102
+ - Schema validation errors
103
+ - Missing required configuration
104
+ - Data corruption detected
105
+ - Authentication/authorization failure
106
+
107
+ **Exponential backoff with jitter**
108
+
109
+ ```python
110
+ from datetime import timedelta
111
+
112
+ # Airflow: task-level retry configuration
113
+ default_args = {
114
+ "retries": 5,
115
+ "retry_delay": timedelta(seconds=30), # initial delay
116
+ "retry_exponential_backoff": True, # double delay each retry
117
+ "max_retry_delay": timedelta(minutes=30), # cap at 30 minutes
118
+ }
119
+
120
+ # Retry schedule:
121
+ # Attempt 1: immediate
122
+ # Attempt 2: 30 seconds later
123
+ # Attempt 3: 60 seconds later
124
+ # Attempt 4: 120 seconds later
125
+ # Attempt 5: 240 seconds later
126
+ # Attempt 6 (final): 480 seconds later (capped at 30 min)
127
+ ```
128
+
129
+ Add jitter to prevent thundering herd — multiple failed tasks retrying at the same time:
130
+
131
+ ```python
132
+ import random
133
+
134
+ def get_retry_delay(attempt: int, base_seconds: int = 30, max_seconds: int = 1800) -> float:
135
+ exponential = base_seconds * (2 ** attempt)
136
+ capped = min(exponential, max_seconds)
137
+ jitter = random.uniform(0, capped * 0.1) # 10% jitter
138
+ return capped + jitter
139
+ ```
140
+
141
+ **Classifying and routing failures**
142
+
143
+ ```python
144
+ def task_with_classified_retries():
145
+ try:
146
+ execute_pipeline_step()
147
+ except RateLimitError as e:
148
+ # Retriable: will retry with backoff
149
+ raise AirflowException(f"Rate limited: {e}") from e
150
+ except SchemaValidationError as e:
151
+ # Non-retriable: send to DLQ and mark task as failed (no retry)
152
+ dlq.write(current_record, e, context)
153
+ raise AirflowSkipException(f"Validation failed — sent to DLQ: {e}") from e
154
+ except PoisonPillError as e:
155
+ # Non-retriable: alert immediately
156
+ alert_oncall(f"Poison pill detected: {e}")
157
+ raise AirflowException(f"Poison pill — manual intervention required") from e
158
+ ```
159
+
160
+ ### SLA Monitoring
161
+
162
+ SLA monitoring detects when pipelines are running late before downstream consumers notice stale data.
163
+
164
+ **Airflow SLA callbacks**
165
+
166
+ ```python
167
+ def sla_miss_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
168
+ """Called when a task misses its SLA deadline."""
169
+ message = (
170
+ f"SLA miss in {dag.dag_id}:\n"
171
+ f"Tasks: {[t.task_id for t in task_list]}\n"
172
+ f"Blocking: {[t.task_id for t in blocking_task_list]}"
173
+ )
174
+ pagerduty.trigger_alert(severity="warning", message=message)
175
+ slack.post_message(channel="#data-alerts", text=message)
176
+
177
+ @dag(
178
+ sla_miss_callback=sla_miss_callback,
179
+ default_args={
180
+ "sla": timedelta(hours=2), # alert if task not done in 2 hours
181
+ },
182
+ )
183
+ def payments_pipeline():
184
+ ...
185
+ ```
186
+
187
+ **Freshness SLA as a metric**
188
+
189
+ Emit a gauge metric tracking data age in the serving layer. Alert when data age exceeds the SLA:
190
+
191
+ ```python
192
+ def check_data_freshness(table: str, sla_minutes: int) -> None:
193
+ """Check that a table has been updated within the SLA window."""
194
+ last_updated = query_last_updated_timestamp(table)
195
+ age_minutes = (datetime.utcnow() - last_updated).total_seconds() / 60
196
+
197
+ metrics.gauge(f"data_freshness_minutes", age_minutes, tags={"table": table})
198
+
199
+ if age_minutes > sla_minutes:
200
+ alert_oncall(
201
+ f"SLA breach: {table} is {age_minutes:.0f} minutes old (SLA: {sla_minutes} min)"
202
+ )
203
+ ```
204
+
205
+ **SLA dashboard metrics to track**
206
+
207
+ - Time-to-complete per DAG run (p50, p95, p99)
208
+ - SLA miss count per pipeline per day
209
+ - Data freshness age per serving table
210
+ - Queue depth (pending tasks waiting to run)
211
+ - Failure rate per pipeline (failures per 100 runs)
212
+
213
+ ### Lineage Tracking
214
+
215
+ Data lineage records how data flows from sources through transformations to serving layers. Lineage enables:
216
+ - Impact analysis: "Which dashboards will break if I change this table's schema?"
217
+ - Root cause investigation: "Which source system produced this bad value?"
218
+ - Compliance: "Which pipelines process PII from this source?"
219
+ - Dependency-aware deployments: "Can I deploy this pipeline change safely?"
220
+
221
+ **Automated lineage via orchestration metadata**
222
+
223
+ Dagster natively tracks asset lineage:
224
+
225
+ ```python
226
+ @asset
227
+ def raw_transactions():
228
+ """Bronze: raw transactions from Stripe API."""
229
+ return stripe.read_charges()
230
+
231
+ @asset(deps=["raw_transactions"])
232
+ def silver_transactions(raw_transactions):
233
+ """Silver: normalized and validated transactions."""
234
+ return transforms.normalize(raw_transactions)
235
+
236
+ @asset(deps=["silver_transactions"])
237
+ def daily_revenue(silver_transactions):
238
+ """Gold: daily revenue aggregation."""
239
+ return aggregations.daily_revenue(silver_transactions)
240
+ ```
241
+
242
+ Dagster generates a lineage graph automatically from the `deps` declarations, viewable in the UI.
243
+
244
+ **Custom lineage for Airflow**
245
+
246
+ When using Airflow without a native lineage tool, emit OpenLineage events:
247
+
248
+ ```python
249
+ from openlineage.airflow import DAG
250
+ from openlineage.client.facet import SchemaDatasetFacet
251
+
252
+ # OpenLineage-instrumented DAG emits START/COMPLETE events
253
+ # consumed by Marquez, DataHub, or Apache Atlas
254
+ @dag(...)
255
+ def instrumented_pipeline():
256
+ ...
257
+ ```
258
+
259
+ **Lineage in incident response**
260
+
261
+ When a data quality incident occurs, the lineage graph drives the investigation:
262
+ 1. Identify the affected output table
263
+ 2. Traverse lineage backwards to find the upstream source
264
+ 3. Check pipeline run history for failures or anomalies at each stage
265
+ 4. Identify the transformation step where the bad data was introduced
266
+ 5. Trace back to the source event or record that caused the issue
267
+
268
+ Without lineage, this investigation is manual, slow, and often incomplete. With lineage, it takes minutes.
269
+
270
+ ### Orchestrator Operational Practices
271
+
272
+ **Deployment without DAG downtime**
273
+
274
+ In Airflow, deploying a new DAG version while runs are in progress can cause orphaned tasks. Use versioned DAG IDs for breaking changes:
275
+
276
+ ```python
277
+ # Before: payments_ingest_v1
278
+ # After breaking change: payments_ingest_v2
279
+ # Run both in parallel during transition; decommission v1 after drain
280
+ ```
281
+
282
+ **Worker pool sizing**
283
+
284
+ Size worker pools based on peak pipeline parallelism:
285
+ - Calculate maximum simultaneous tasks across all DAGs at peak hours
286
+ - Add 20% headroom for backfills and reruns
287
+ - Separate pools for CPU-intensive tasks (Spark submit) vs. I/O-bound tasks (API calls)
288
+
289
+ **Scheduler high availability**
290
+
291
+ Run multiple scheduler instances with leader election. Airflow 2.x supports HA schedulers natively. Single-scheduler deployments are a production risk.