@booklib/skills 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +105 -0
- package/animation-at-work/SKILL.md +246 -0
- package/animation-at-work/assets/example_asset.txt +1 -0
- package/animation-at-work/references/api_reference.md +369 -0
- package/animation-at-work/references/review-checklist.md +79 -0
- package/animation-at-work/scripts/example.py +1 -0
- package/bin/skills.js +85 -0
- package/clean-code-reviewer/SKILL.md +292 -0
- package/clean-code-reviewer/evals/evals.json +67 -0
- package/data-intensive-patterns/SKILL.md +204 -0
- package/data-intensive-patterns/assets/example_asset.txt +1 -0
- package/data-intensive-patterns/references/api_reference.md +34 -0
- package/data-intensive-patterns/references/patterns-catalog.md +551 -0
- package/data-intensive-patterns/references/review-checklist.md +193 -0
- package/data-intensive-patterns/scripts/example.py +1 -0
- package/data-pipelines/SKILL.md +252 -0
- package/data-pipelines/assets/example_asset.txt +1 -0
- package/data-pipelines/references/api_reference.md +301 -0
- package/data-pipelines/references/review-checklist.md +181 -0
- package/data-pipelines/scripts/example.py +1 -0
- package/design-patterns/SKILL.md +245 -0
- package/design-patterns/assets/example_asset.txt +1 -0
- package/design-patterns/references/api_reference.md +1 -0
- package/design-patterns/references/patterns-catalog.md +726 -0
- package/design-patterns/references/review-checklist.md +173 -0
- package/design-patterns/scripts/example.py +1 -0
- package/domain-driven-design/SKILL.md +221 -0
- package/domain-driven-design/assets/example_asset.txt +1 -0
- package/domain-driven-design/references/api_reference.md +1 -0
- package/domain-driven-design/references/patterns-catalog.md +545 -0
- package/domain-driven-design/references/review-checklist.md +158 -0
- package/domain-driven-design/scripts/example.py +1 -0
- package/effective-java/SKILL.md +195 -0
- package/effective-java/assets/example_asset.txt +1 -0
- package/effective-java/references/api_reference.md +1 -0
- package/effective-java/references/items-catalog.md +955 -0
- package/effective-java/references/review-checklist.md +216 -0
- package/effective-java/scripts/example.py +1 -0
- package/effective-kotlin/SKILL.md +225 -0
- package/effective-kotlin/assets/example_asset.txt +1 -0
- package/effective-kotlin/references/api_reference.md +1 -0
- package/effective-kotlin/references/practices-catalog.md +1228 -0
- package/effective-kotlin/references/review-checklist.md +126 -0
- package/effective-kotlin/scripts/example.py +1 -0
- package/kotlin-in-action/SKILL.md +251 -0
- package/kotlin-in-action/assets/example_asset.txt +1 -0
- package/kotlin-in-action/references/api_reference.md +1 -0
- package/kotlin-in-action/references/practices-catalog.md +436 -0
- package/kotlin-in-action/references/review-checklist.md +204 -0
- package/kotlin-in-action/scripts/example.py +1 -0
- package/lean-startup/SKILL.md +250 -0
- package/lean-startup/assets/example_asset.txt +1 -0
- package/lean-startup/references/api_reference.md +319 -0
- package/lean-startup/references/review-checklist.md +137 -0
- package/lean-startup/scripts/example.py +1 -0
- package/microservices-patterns/SKILL.md +179 -0
- package/microservices-patterns/references/patterns-catalog.md +391 -0
- package/microservices-patterns/references/review-checklist.md +169 -0
- package/package.json +17 -0
- package/refactoring-ui/SKILL.md +236 -0
- package/refactoring-ui/assets/example_asset.txt +1 -0
- package/refactoring-ui/references/api_reference.md +355 -0
- package/refactoring-ui/references/review-checklist.md +114 -0
- package/refactoring-ui/scripts/example.py +1 -0
- package/storytelling-with-data/SKILL.md +238 -0
- package/storytelling-with-data/assets/example_asset.txt +1 -0
- package/storytelling-with-data/references/api_reference.md +379 -0
- package/storytelling-with-data/references/review-checklist.md +111 -0
- package/storytelling-with-data/scripts/example.py +1 -0
- package/system-design-interview/SKILL.md +213 -0
- package/system-design-interview/assets/example_asset.txt +1 -0
- package/system-design-interview/references/api_reference.md +582 -0
- package/system-design-interview/references/review-checklist.md +201 -0
- package/system-design-interview/scripts/example.py +1 -0
- package/using-asyncio-python/SKILL.md +242 -0
- package/using-asyncio-python/assets/example_asset.txt +1 -0
- package/using-asyncio-python/references/api_reference.md +267 -0
- package/using-asyncio-python/references/review-checklist.md +149 -0
- package/using-asyncio-python/scripts/example.py +1 -0
- package/web-scraping-python/SKILL.md +259 -0
- package/web-scraping-python/assets/example_asset.txt +1 -0
- package/web-scraping-python/references/api_reference.md +393 -0
- package/web-scraping-python/references/review-checklist.md +163 -0
- package/web-scraping-python/scripts/example.py +1 -0
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# Data Pipelines Pocket Reference — Practices Catalog
|
|
2
|
+
|
|
3
|
+
Chapter-by-chapter catalog of practices from *Data Pipelines Pocket Reference*
|
|
4
|
+
by James Densmore for pipeline building.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Chapter 1–2: Introduction & Modern Data Infrastructure
|
|
9
|
+
|
|
10
|
+
### Infrastructure Choices
|
|
11
|
+
- **Data warehouse** — Columnar storage optimized for analytics queries (Redshift, BigQuery, Snowflake); choose when analytics is primary use case
|
|
12
|
+
- **Data lake** — Object storage (S3, GCS, Azure Blob) for raw, unstructured, or semi-structured data; choose when data variety is high or schema is unknown
|
|
13
|
+
- **Hybrid** — Land raw data in lake, load structured subsets into warehouse; common modern pattern
|
|
14
|
+
- **Cloud-native** — Leverage managed services (serverless compute, auto-scaling storage); reduce operational burden
|
|
15
|
+
|
|
16
|
+
### Pipeline Types
|
|
17
|
+
- **Batch** — Process data in scheduled intervals (hourly, daily); suitable for most analytics use cases
|
|
18
|
+
- **Streaming** — Process data continuously as it arrives; required for real-time dashboards, alerts, event-driven systems
|
|
19
|
+
- **Micro-batch** — Small frequent batches (every few minutes); compromise between batch simplicity and near-real-time latency
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Chapter 3: Common Data Pipeline Patterns
|
|
24
|
+
|
|
25
|
+
### Extraction Patterns
|
|
26
|
+
- **Full extraction** — Extract entire dataset each run; simple but expensive for large tables; use for small reference tables or initial loads
|
|
27
|
+
- **Incremental extraction** — Extract only new/changed records using a high-water mark (timestamp, auto-increment ID, or sequence); preferred for growing datasets
|
|
28
|
+
- **Change data capture (CDC)** — Capture changes from database transaction logs (MySQL binlog, PostgreSQL WAL); lowest latency, captures deletes; use for real-time sync
|
|
29
|
+
|
|
30
|
+
### Loading Patterns
|
|
31
|
+
- **Full refresh (truncate + load)** — Replace entire destination table; simple, idempotent; use for small tables or when incremental is unreliable
|
|
32
|
+
- **Append** — Insert new records only; use for event/log data that is never updated
|
|
33
|
+
- **Upsert (MERGE)** — Insert new records, update existing ones based on a key; use for mutable dimension data
|
|
34
|
+
- **Delete + Insert by partition** — Delete partition, insert replacement; idempotent and efficient for date-partitioned fact tables
|
|
35
|
+
|
|
36
|
+
### ETL vs ELT
|
|
37
|
+
- **ETL** — Transform before loading; use when destination has limited compute or when data must be cleansed before storage
|
|
38
|
+
- **ELT** — Load raw data first, transform in destination; preferred for modern cloud warehouses with cheap compute; enables raw data preservation and flexible re-transformation
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Chapter 4: Database Ingestion
|
|
43
|
+
|
|
44
|
+
### MySQL Extraction
|
|
45
|
+
- Use `SELECT ... WHERE updated_at > :last_run` for incremental extraction
|
|
46
|
+
- For full extraction: `SELECT *` with optional `LIMIT/OFFSET` for large tables (prefer streaming cursors)
|
|
47
|
+
- Use read replicas to avoid impacting production database performance
|
|
48
|
+
- Handle MySQL timezone conversions (store/compare in UTC)
|
|
49
|
+
- Connection pooling for concurrent extraction from multiple tables
|
|
50
|
+
|
|
51
|
+
### PostgreSQL Extraction
|
|
52
|
+
- Similar incremental patterns using timestamp columns
|
|
53
|
+
- Use `COPY TO` for efficient bulk export to CSV/files
|
|
54
|
+
- Leverage PostgreSQL logical replication for CDC
|
|
55
|
+
- Handle PostgreSQL-specific types (arrays, JSON, custom types) during extraction
|
|
56
|
+
|
|
57
|
+
### MongoDB Extraction
|
|
58
|
+
- Use change streams for real-time CDC
|
|
59
|
+
- For incremental: query by `_id` (ObjectId contains timestamp) or custom timestamp field
|
|
60
|
+
- Handle nested documents: flatten or store as JSON in warehouse
|
|
61
|
+
- Use `mongodump` for full extraction of large collections
|
|
62
|
+
|
|
63
|
+
### General Database Practices
|
|
64
|
+
- **Connection management** — Use connection pools; close connections promptly; handle timeouts
|
|
65
|
+
- **Query optimization** — Add indexes on extraction columns; limit selected columns; use WHERE clauses
|
|
66
|
+
- **Binary data** — Skip or store references to BLOBs; don't load binary into analytics warehouse
|
|
67
|
+
- **Character encoding** — Ensure UTF-8 throughout the pipeline; handle encoding mismatches at extraction
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## Chapter 5: File Ingestion
|
|
72
|
+
|
|
73
|
+
### CSV Files
|
|
74
|
+
- Handle header rows, quoting, escaping, delimiters (not always comma)
|
|
75
|
+
- Detect and handle encoding (UTF-8, Latin-1, Windows-1252)
|
|
76
|
+
- Validate column count per row; log and quarantine malformed rows
|
|
77
|
+
- Use streaming parsers for large files; avoid loading entire file into memory
|
|
78
|
+
|
|
79
|
+
### JSON Files
|
|
80
|
+
- Handle nested structures: flatten for warehouse loading or store as JSON column
|
|
81
|
+
- Use JSON Lines (newline-delimited JSON) for large datasets
|
|
82
|
+
- Validate against expected schema; handle missing and extra fields
|
|
83
|
+
- Parse dates and timestamps consistently
|
|
84
|
+
|
|
85
|
+
### Cloud Storage Integration
|
|
86
|
+
- **S3** — Use `aws s3 cp/sync` or boto3; leverage S3 event notifications for trigger-based ingestion
|
|
87
|
+
- **GCS** — Use `gsutil` or google-cloud-storage library; use Pub/Sub for event notifications
|
|
88
|
+
- **Azure Blob** — Use Azure SDK; leverage Event Grid for notifications
|
|
89
|
+
- Use prefix/partition naming: `s3://bucket/table/year=2024/month=01/day=15/`
|
|
90
|
+
- Implement file manifests to track which files have been processed
|
|
91
|
+
|
|
92
|
+
### File Best Practices
|
|
93
|
+
- **Naming conventions** — Include date, source, and sequence in filenames: `orders_2024-01-15_001.csv`
|
|
94
|
+
- **Compression** — Use gzip or snappy for storage efficiency; most tools handle compressed files natively
|
|
95
|
+
- **Archiving** — Move processed files to archive prefix/bucket; retain for reprocessing capability
|
|
96
|
+
- **Schema detection** — Infer schema from first N rows; validate against expected schema; alert on changes
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## Chapter 6: API Ingestion
|
|
101
|
+
|
|
102
|
+
### REST API Patterns
|
|
103
|
+
- **Authentication** — Handle API keys, OAuth tokens, token refresh; store credentials securely
|
|
104
|
+
- **Pagination** — Implement cursor-based, offset-based, or link-header pagination; prefer cursor-based for consistency
|
|
105
|
+
- **Rate limiting** — Respect rate limit headers (X-RateLimit-Remaining, Retry-After); implement backoff
|
|
106
|
+
- **Retry logic** — Retry on 429 (rate limit) and 5xx (server error) with exponential backoff; don't retry on 4xx (client error)
|
|
107
|
+
|
|
108
|
+
### API Data Handling
|
|
109
|
+
- **JSON response parsing** — Extract relevant fields; handle nested objects and arrays
|
|
110
|
+
- **Incremental fetching** — Use modified_since parameters, cursor tokens, or date range filters
|
|
111
|
+
- **Schema changes** — Handle new fields gracefully; log and alert on missing expected fields
|
|
112
|
+
- **Large responses** — Stream responses for large payloads; paginate aggressively
|
|
113
|
+
|
|
114
|
+
### Webhook Ingestion
|
|
115
|
+
- Set up HTTP endpoints to receive push notifications
|
|
116
|
+
- Validate webhook signatures for security
|
|
117
|
+
- Acknowledge receipt quickly (200 OK); process asynchronously
|
|
118
|
+
- Implement idempotency using event IDs to handle duplicate deliveries
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Chapter 7: Streaming Data
|
|
123
|
+
|
|
124
|
+
### Apache Kafka
|
|
125
|
+
- **Producers** — Serialize events (Avro, JSON, Protobuf); use keys for partition ordering; configure acknowledgments
|
|
126
|
+
- **Consumers** — Use consumer groups for parallel processing; commit offsets after successful processing; handle rebalances
|
|
127
|
+
- **Topics** — Design topic schemas; set retention policies; partition for throughput and ordering requirements
|
|
128
|
+
- **Exactly-once** — Use idempotent producers + transactional consumers; or implement deduplication downstream
|
|
129
|
+
|
|
130
|
+
### Amazon Kinesis
|
|
131
|
+
- **Streams** — Configure shard count for throughput; use enhanced fan-out for multiple consumers
|
|
132
|
+
- **Firehose** — Direct-to-S3/Redshift delivery; configure buffering interval and size; transform with Lambda
|
|
133
|
+
|
|
134
|
+
### Stream Processing Patterns
|
|
135
|
+
- **Windowing** — Tumbling, sliding, session windows for aggregation over time
|
|
136
|
+
- **Watermarks** — Handle late-arriving events; define allowed lateness
|
|
137
|
+
- **State management** — Use state stores for aggregations; handle checkpointing and recovery
|
|
138
|
+
- **Dead letter queues** — Route failed events for inspection and reprocessing; don't lose data silently
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## Chapter 8: Data Storage and Loading
|
|
143
|
+
|
|
144
|
+
### Amazon Redshift
|
|
145
|
+
- Use `COPY` command for bulk loading from S3; much faster than INSERT
|
|
146
|
+
- Define `SORTKEY` for frequently filtered columns; `DISTKEY` for join columns
|
|
147
|
+
- Use `UNLOAD` for efficient export back to S3
|
|
148
|
+
- Vacuum and analyze tables after large loads
|
|
149
|
+
|
|
150
|
+
### Google BigQuery
|
|
151
|
+
- Use load jobs for bulk data; streaming inserts for real-time (more expensive)
|
|
152
|
+
- Partition tables by date (ingestion time or column-based) for cost and performance
|
|
153
|
+
- Cluster tables on frequently filtered columns
|
|
154
|
+
- Use external tables for querying data in GCS without loading
|
|
155
|
+
|
|
156
|
+
### Snowflake
|
|
157
|
+
- Use stages (internal or external) for file-based loading
|
|
158
|
+
- `COPY INTO` for bulk loads from stages; `SNOWPIPE` for continuous loading
|
|
159
|
+
- Use virtual warehouses sized appropriately for load workloads
|
|
160
|
+
- Leverage Time Travel for data recovery and auditing
|
|
161
|
+
|
|
162
|
+
### General Loading Practices
|
|
163
|
+
- **Staging tables** — Always load to staging first; validate before merging to production
|
|
164
|
+
- **Atomic swaps** — Use table rename or partition swap for atomic updates
|
|
165
|
+
- **Data types** — Map source types carefully; avoid implicit conversions; use appropriate precision
|
|
166
|
+
- **Compression** — Let warehouse handle compression; load compressed files when supported
|
|
167
|
+
- **Partitioning** — Partition by date for time-series data; by key for lookup tables
|
|
168
|
+
- **Clustering** — Cluster on frequently filtered columns within partitions
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
## Chapter 9: Data Transformations
|
|
173
|
+
|
|
174
|
+
### SQL-Based Transforms
|
|
175
|
+
- Use CTEs (Common Table Expressions) for readability
|
|
176
|
+
- Prefer window functions over self-joins for ranking, running totals
|
|
177
|
+
- Use CASE expressions for conditional logic
|
|
178
|
+
- Aggregate at the right grain; avoid fan-out joins that multiply rows
|
|
179
|
+
|
|
180
|
+
### dbt (Data Build Tool)
|
|
181
|
+
- **Staging models** — 1:1 with source tables; rename columns, cast types, filter deleted records
|
|
182
|
+
- **Intermediate models** — Business logic joins, complex calculations, deduplication
|
|
183
|
+
- **Mart models** — Final analytics-ready tables; optimized for dashboard queries
|
|
184
|
+
- **Incremental models** — Process only new/changed data; use `unique_key` for merge strategy
|
|
185
|
+
- **Tests** — `not_null`, `unique`, `accepted_values`, `relationships` on key columns; custom data tests
|
|
186
|
+
- **Sources** — Define sources in YAML; use `source()` macro for lineage tracking; freshness checks
|
|
187
|
+
|
|
188
|
+
### Python-Based Transforms
|
|
189
|
+
- Use pandas for small-medium datasets; PySpark or Dask for large-scale processing
|
|
190
|
+
- Write pure functions for transformations; make them testable
|
|
191
|
+
- Handle data types explicitly; don't rely on inference for production pipelines
|
|
192
|
+
- Use vectorized operations; avoid row-by-row iteration
|
|
193
|
+
|
|
194
|
+
### Transform Best Practices
|
|
195
|
+
- **Layered architecture** — Raw → Staging → Intermediate → Mart; each layer has clear purpose
|
|
196
|
+
- **Single responsibility** — Each model/transform does one thing well
|
|
197
|
+
- **Documented logic** — Comment complex business rules; maintain a data dictionary
|
|
198
|
+
- **Version controlled** — All transformation code in git; review changes via PR
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## Chapter 10: Data Validation and Testing
|
|
203
|
+
|
|
204
|
+
### Validation Types
|
|
205
|
+
- **Schema validation** — Verify column names, types, and count match expectations
|
|
206
|
+
- **Row count checks** — Compare source and destination row counts; alert on significant discrepancies
|
|
207
|
+
- **Null checks** — Assert key columns are not null; track null percentages for optional columns
|
|
208
|
+
- **Uniqueness checks** — Verify primary key uniqueness in destination tables
|
|
209
|
+
- **Referential integrity** — Check foreign key relationships between tables
|
|
210
|
+
- **Range checks** — Validate values fall within expected ranges (dates, amounts, percentages)
|
|
211
|
+
- **Freshness checks** — Verify data is not stale; alert when max timestamp is older than threshold
|
|
212
|
+
|
|
213
|
+
### Great Expectations
|
|
214
|
+
- Define expectations as code; version control them
|
|
215
|
+
- Run validations as pipeline steps; fail pipeline on critical expectation failures
|
|
216
|
+
- Generate data documentation from expectations
|
|
217
|
+
- Use checkpoints for scheduled validation runs
|
|
218
|
+
|
|
219
|
+
### Testing Practices
|
|
220
|
+
- **Unit tests** — Test individual transformation functions with known inputs/outputs
|
|
221
|
+
- **Integration tests** — Test end-to-end pipeline with sample data
|
|
222
|
+
- **Regression tests** — Compare current results against known-good baselines
|
|
223
|
+
- **Data contracts** — Define and enforce schemas between producer and consumer teams
|
|
224
|
+
|
|
225
|
+
---
|
|
226
|
+
|
|
227
|
+
## Chapter 11: Orchestration
|
|
228
|
+
|
|
229
|
+
### Apache Airflow
|
|
230
|
+
- **DAGs** — Define pipelines as Directed Acyclic Graphs; each node is a task
|
|
231
|
+
- **Operators** — Use appropriate operators: PythonOperator, BashOperator, provider operators (BigQueryOperator, S3ToRedshiftOperator)
|
|
232
|
+
- **Scheduling** — Use cron expressions or timedelta; set `start_date` and `catchup` appropriately
|
|
233
|
+
- **Dependencies** — Use `>>` operator or `set_upstream/downstream`; keep DAGs shallow and wide
|
|
234
|
+
- **XComs** — Pass small metadata between tasks (row counts, file paths); NOT large datasets
|
|
235
|
+
- **Sensors** — Wait for external conditions (file arrival, partition availability); use with timeout
|
|
236
|
+
- **Variables and Connections** — Store config in Airflow Variables; credentials in Connections
|
|
237
|
+
- **Pools** — Limit concurrency for resource-constrained tasks (database connections, API rate limits)
|
|
238
|
+
|
|
239
|
+
### DAG Design Patterns
|
|
240
|
+
- **One pipeline per DAG** — Keep DAGs focused; avoid mega-DAGs that do everything
|
|
241
|
+
- **Idempotent tasks** — Every task can be re-run safely; use `execution_date` for parameterization
|
|
242
|
+
- **Task granularity** — Tasks should be atomic and independently retryable; not too fine (overhead) or coarse (blast radius)
|
|
243
|
+
- **Error handling** — Use `on_failure_callback` for alerting; `retries` and `retry_delay` for transient failures
|
|
244
|
+
- **Backfilling** — Use `airflow backfill` for historical reprocessing; ensure tasks support date parameterization
|
|
245
|
+
|
|
246
|
+
---
|
|
247
|
+
|
|
248
|
+
## Chapter 12: Monitoring and Alerting
|
|
249
|
+
|
|
250
|
+
### Pipeline Health Metrics
|
|
251
|
+
- **Duration** — Track execution time; alert on runs significantly longer than historical average
|
|
252
|
+
- **Row counts** — Track records processed per run; alert on zero rows or dramatic changes
|
|
253
|
+
- **Error rates** — Track failed records, retries, exceptions; alert on elevated error rates
|
|
254
|
+
- **Data freshness** — Track max timestamp in destination; alert when data is staler than SLA
|
|
255
|
+
- **Resource usage** — Track CPU, memory, disk, network; alert on resource exhaustion
|
|
256
|
+
|
|
257
|
+
### Alerting Strategies
|
|
258
|
+
- **SLA-based** — Define delivery SLAs; alert when pipelines miss their windows
|
|
259
|
+
- **Anomaly-based** — Detect deviations from historical patterns (row counts, durations, values)
|
|
260
|
+
- **Threshold-based** — Alert on fixed thresholds (error rate > 5%, null rate > 10%)
|
|
261
|
+
- **Escalation** — Define severity levels; route alerts appropriately (Slack, PagerDuty, email)
|
|
262
|
+
|
|
263
|
+
### Logging
|
|
264
|
+
- Log at each pipeline stage: extraction start/end, row counts, load confirmation
|
|
265
|
+
- Include correlation IDs to trace records through the pipeline
|
|
266
|
+
- Store logs centrally for searchability (ELK, CloudWatch, Stackdriver)
|
|
267
|
+
- Retain logs for debugging and audit compliance
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
## Chapter 13: Best Practices
|
|
272
|
+
|
|
273
|
+
### Idempotency
|
|
274
|
+
- Use DELETE+INSERT by date partition for fact tables
|
|
275
|
+
- Use MERGE/upsert with natural keys for dimension tables
|
|
276
|
+
- Use staging tables as intermediary; clean up on both success and failure
|
|
277
|
+
- Test by running pipeline twice; verify no data duplication
|
|
278
|
+
|
|
279
|
+
### Backfilling
|
|
280
|
+
- Parameterize all pipelines by date range (start_date, end_date)
|
|
281
|
+
- Use Airflow `execution_date` or equivalent for date-aware runs
|
|
282
|
+
- Test backfill on a small date range before running full historical reprocess
|
|
283
|
+
- Monitor resource usage during backfill; may need to throttle parallelism
|
|
284
|
+
|
|
285
|
+
### Error Handling
|
|
286
|
+
- Retry transient failures (network timeouts, rate limits) with exponential backoff
|
|
287
|
+
- Fail fast on permanent errors (authentication failure, missing source table)
|
|
288
|
+
- Quarantine bad records; don't let one bad row fail the entire pipeline
|
|
289
|
+
- Send alerts with actionable context (error message, affected table, run ID)
|
|
290
|
+
|
|
291
|
+
### Data Lineage and Documentation
|
|
292
|
+
- Track source-to-destination mappings for every table
|
|
293
|
+
- Document transformation logic, especially business rules
|
|
294
|
+
- Maintain a data dictionary with column descriptions and types
|
|
295
|
+
- Use tools like dbt docs, DataHub, or Amundsen for automated lineage
|
|
296
|
+
|
|
297
|
+
### Security
|
|
298
|
+
- Never hardcode credentials; use secrets managers (AWS Secrets Manager, HashiCorp Vault)
|
|
299
|
+
- Encrypt data in transit (TLS) and at rest (warehouse encryption)
|
|
300
|
+
- Use least-privilege IAM roles for pipeline service accounts
|
|
301
|
+
- Audit access to sensitive data; mask PII in non-production environments
|
|
@@ -0,0 +1,181 @@
|
|
|
1
|
+
# Data Pipelines Pocket Reference — Pipeline Review Checklist
|
|
2
|
+
|
|
3
|
+
Systematic checklist for reviewing data pipelines against the 13 chapters
|
|
4
|
+
from *Data Pipelines Pocket Reference* by James Densmore.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## 1. Architecture & Patterns (Chapters 1–3)
|
|
9
|
+
|
|
10
|
+
### Infrastructure
|
|
11
|
+
- [ ] **Ch 1-2 — Appropriate infrastructure** — Is the right storage chosen for the use case (warehouse for analytics, lake for raw/unstructured)?
|
|
12
|
+
- [ ] **Ch 2 — Cloud-native services** — Are managed services used where appropriate to reduce operational burden?
|
|
13
|
+
- [ ] **Ch 2 — Separation of storage and compute** — Is compute scaled independently from storage?
|
|
14
|
+
|
|
15
|
+
### Pipeline Patterns
|
|
16
|
+
- [ ] **Ch 3 — ETL vs ELT** — Is the right pattern chosen? Is ELT used for analytics workloads with modern warehouses?
|
|
17
|
+
- [ ] **Ch 3 — Full vs incremental** — Is incremental extraction used for growing datasets? Is full extraction justified for small tables?
|
|
18
|
+
- [ ] **Ch 3 — CDC where appropriate** — Is CDC used for real-time sync needs instead of polling?
|
|
19
|
+
- [ ] **Ch 3 — Loading strategy** — Is the right load pattern used (append for events, upsert for dimensions, full refresh for small lookups)?
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## 2. Data Ingestion (Chapters 4–7)
|
|
24
|
+
|
|
25
|
+
### Database Ingestion (Ch 4)
|
|
26
|
+
- [ ] **Ch 4 — Read replica usage** — Are extractions running against read replicas, not production databases?
|
|
27
|
+
- [ ] **Ch 4 — Incremental column** — Is there a reliable timestamp or ID column for incremental extraction?
|
|
28
|
+
- [ ] **Ch 4 — Connection management** — Are connections pooled and properly closed? Are timeouts configured?
|
|
29
|
+
- [ ] **Ch 4 — Query efficiency** — Are only needed columns selected? Are WHERE clauses using indexed columns?
|
|
30
|
+
- [ ] **Ch 4 — Large table handling** — Are large extractions chunked or using streaming cursors?
|
|
31
|
+
|
|
32
|
+
### File Ingestion (Ch 5)
|
|
33
|
+
- [ ] **Ch 5 — Schema validation** — Are file schemas validated before processing? Are malformed rows handled?
|
|
34
|
+
- [ ] **Ch 5 — Encoding handling** — Is character encoding handled consistently (UTF-8)?
|
|
35
|
+
- [ ] **Ch 5 — Cloud storage patterns** — Are files organized with partitioned prefixes? Are processed files archived?
|
|
36
|
+
- [ ] **Ch 5 — File tracking** — Is there a mechanism to track which files have been processed to avoid reprocessing?
|
|
37
|
+
- [ ] **Ch 5 — Compression** — Are files compressed for storage and transfer efficiency?
|
|
38
|
+
|
|
39
|
+
### API Ingestion (Ch 6)
|
|
40
|
+
- [ ] **Ch 6 — Pagination** — Is pagination implemented correctly? Are all pages fetched? Is cursor-based preferred?
|
|
41
|
+
- [ ] **Ch 6 — Rate limiting** — Are rate limit headers respected? Is backoff implemented for 429 responses?
|
|
42
|
+
- [ ] **Ch 6 — Retry logic** — Are transient errors (5xx, timeouts) retried with exponential backoff? Are 4xx errors not retried?
|
|
43
|
+
- [ ] **Ch 6 — Authentication** — Are credentials stored securely? Are token refreshes handled?
|
|
44
|
+
- [ ] **Ch 6 — Incremental fetching** — Are date/cursor parameters used to fetch only new data?
|
|
45
|
+
|
|
46
|
+
### Streaming Ingestion (Ch 7)
|
|
47
|
+
- [ ] **Ch 7 — Consumer group design** — Are consumer groups configured for parallel processing? Are rebalances handled?
|
|
48
|
+
- [ ] **Ch 7 — Offset management** — Are offsets committed after successful processing, not before?
|
|
49
|
+
- [ ] **Ch 7 — Serialization** — Are events serialized with a schema (Avro, Protobuf) for evolution support?
|
|
50
|
+
- [ ] **Ch 7 — Dead letter queue** — Are failed events routed to a DLQ for inspection and reprocessing?
|
|
51
|
+
- [ ] **Ch 7 — Exactly-once semantics** — Is deduplication implemented downstream or are idempotent producers used?
|
|
52
|
+
- [ ] **Ch 7 — Backpressure** — Is backpressure handled when consumer can't keep up with producer?
|
|
53
|
+
|
|
54
|
+
---
|
|
55
|
+
|
|
56
|
+
## 3. Data Storage & Loading (Chapter 8)
|
|
57
|
+
|
|
58
|
+
### Loading Patterns
|
|
59
|
+
- [ ] **Ch 8 — Bulk loading** — Are bulk load commands used (COPY, load jobs) instead of row-by-row INSERT?
|
|
60
|
+
- [ ] **Ch 8 — Staging tables** — Is data loaded to staging first, validated, then merged to production?
|
|
61
|
+
- [ ] **Ch 8 — Atomic operations** — Are loads atomic? Is the destination never left in a partial state?
|
|
62
|
+
- [ ] **Ch 8 — Data type mapping** — Are source types mapped correctly to destination types? No implicit conversions?
|
|
63
|
+
|
|
64
|
+
### Table Design
|
|
65
|
+
- [ ] **Ch 8 — Partitioning** — Are large tables partitioned by date or key? Are queries leveraging partition pruning?
|
|
66
|
+
- [ ] **Ch 8 — Clustering** — Are frequently filtered columns used as cluster keys within partitions?
|
|
67
|
+
- [ ] **Ch 8 — Sort/distribution keys** — For Redshift: are SORTKEY and DISTKEY chosen based on query patterns?
|
|
68
|
+
|
|
69
|
+
### Warehouse-Specific
|
|
70
|
+
- [ ] **Ch 8 — Redshift COPY** — Is S3-based COPY used instead of INSERT for bulk loads?
|
|
71
|
+
- [ ] **Ch 8 — BigQuery load jobs** — Are load jobs preferred over streaming inserts for batch pipelines?
|
|
72
|
+
- [ ] **Ch 8 — Snowflake stages** — Are stages used for file-based loading? Is SNOWPIPE configured for continuous loads?
|
|
73
|
+
|
|
74
|
+
---
|
|
75
|
+
|
|
76
|
+
## 4. Transformations (Chapter 9)
|
|
77
|
+
|
|
78
|
+
### SQL Transforms
|
|
79
|
+
- [ ] **Ch 9 — CTEs for readability** — Are Common Table Expressions used instead of deeply nested subqueries?
|
|
80
|
+
- [ ] **Ch 9 — Window functions** — Are window functions used for ranking, running totals instead of self-joins?
|
|
81
|
+
- [ ] **Ch 9 — Grain awareness** — Do joins maintain the correct grain? No accidental fan-out producing duplicate rows?
|
|
82
|
+
- [ ] **Ch 9 — Deterministic logic** — Are transforms deterministic (same input → same output)? No reliance on row ordering?
|
|
83
|
+
|
|
84
|
+
### dbt Patterns
|
|
85
|
+
- [ ] **Ch 9 — Layer structure** — Are models organized into staging → intermediate → mart layers?
|
|
86
|
+
- [ ] **Ch 9 — Staging models** — Are staging models 1:1 with sources? Do they rename, cast, and filter only?
|
|
87
|
+
- [ ] **Ch 9 — Incremental models** — Are large models configured as incremental with proper `unique_key` and merge strategy?
|
|
88
|
+
- [ ] **Ch 9 — Source definitions** — Are sources defined in YAML with `source()` macro? Are freshness checks configured?
|
|
89
|
+
- [ ] **Ch 9 — Model materialization** — Are materializations appropriate (view for light transforms, table for heavy, incremental for large)?
|
|
90
|
+
|
|
91
|
+
### Python Transforms
|
|
92
|
+
- [ ] **Ch 9 — Appropriate tool** — Is Python used only when SQL is insufficient (ML, complex parsing, API calls)?
|
|
93
|
+
- [ ] **Ch 9 — Vectorized operations** — Are pandas/numpy vectorized operations used instead of row-by-row iteration?
|
|
94
|
+
- [ ] **Ch 9 — Memory management** — Are large datasets processed in chunks or with distributed frameworks (PySpark)?
|
|
95
|
+
- [ ] **Ch 9 — Pure functions** — Are transformation functions pure (no side effects) and independently testable?
|
|
96
|
+
|
|
97
|
+
---
|
|
98
|
+
|
|
99
|
+
## 5. Data Validation & Testing (Chapter 10)
|
|
100
|
+
|
|
101
|
+
### Validation Coverage
|
|
102
|
+
- [ ] **Ch 10 — Schema validation** — Are column names, types, and count validated at ingestion?
|
|
103
|
+
- [ ] **Ch 10 — Row count reconciliation** — Are source and destination row counts compared? Is threshold alerting configured?
|
|
104
|
+
- [ ] **Ch 10 — Null checks** — Are NOT NULL constraints enforced on key columns? Are null percentages tracked?
|
|
105
|
+
- [ ] **Ch 10 — Uniqueness checks** — Is primary key uniqueness verified after loading?
|
|
106
|
+
- [ ] **Ch 10 — Referential integrity** — Are foreign key relationships validated between related tables?
|
|
107
|
+
- [ ] **Ch 10 — Range validation** — Are values checked against expected ranges (dates in past, amounts positive, percentages 0-100)?
|
|
108
|
+
- [ ] **Ch 10 — Freshness checks** — Is data freshness monitored? Are alerts configured for stale data?
|
|
109
|
+
|
|
110
|
+
### Testing
|
|
111
|
+
- [ ] **Ch 10 — Unit tests** — Are individual transformation functions tested with known inputs/outputs?
|
|
112
|
+
- [ ] **Ch 10 — Integration tests** — Is the end-to-end pipeline tested with sample data?
|
|
113
|
+
- [ ] **Ch 10 — dbt tests** — Are dbt schema tests (not_null, unique, relationships) defined? Are custom data tests used?
|
|
114
|
+
- [ ] **Ch 10 — Regression tests** — Are current results compared against known-good baselines for critical tables?
|
|
115
|
+
|
|
116
|
+
---
|
|
117
|
+
|
|
118
|
+
## 6. Orchestration (Chapter 11)
|
|
119
|
+
|
|
120
|
+
### DAG Design
|
|
121
|
+
- [ ] **Ch 11 — One pipeline per DAG** — Are DAGs focused on a single pipeline? No mega-DAGs?
|
|
122
|
+
- [ ] **Ch 11 — Task granularity** — Are tasks atomic and independently retryable? Not too fine or too coarse?
|
|
123
|
+
- [ ] **Ch 11 — Shallow and wide** — Are DAGs shallow (few sequential steps) and wide (parallel where possible)?
|
|
124
|
+
- [ ] **Ch 11 — No hardcoded dates** — Are dates parameterized using execution_date or equivalent? No hardcoded date strings?
|
|
125
|
+
|
|
126
|
+
### Airflow Specifics
|
|
127
|
+
- [ ] **Ch 11 — Idempotent tasks** — Can every task be safely re-run without data duplication or side effects?
|
|
128
|
+
- [ ] **Ch 11 — Appropriate operators** — Are provider operators used where available instead of generic PythonOperator?
|
|
129
|
+
- [ ] **Ch 11 — XCom usage** — Are XComs used only for small metadata (file paths, row counts), not large data?
|
|
130
|
+
- [ ] **Ch 11 — Sensor timeouts** — Do sensors have timeouts to avoid indefinite waiting?
|
|
131
|
+
- [ ] **Ch 11 — Error callbacks** — Are `on_failure_callback` configured for alerting on task failures?
|
|
132
|
+
- [ ] **Ch 11 — Retries** — Are retries configured with appropriate delay for transient failures?
|
|
133
|
+
- [ ] **Ch 11 — Pool limits** — Are pools used to limit concurrency for resource-constrained tasks?
|
|
134
|
+
- [ ] **Ch 11 — Catchup configuration** — Is `catchup` set appropriately (True for backfill-supporting DAGs, False otherwise)?
|
|
135
|
+
|
|
136
|
+
---
|
|
137
|
+
|
|
138
|
+
## 7. Monitoring & Operations (Chapters 12–13)
|
|
139
|
+
|
|
140
|
+
### Monitoring
|
|
141
|
+
- [ ] **Ch 12 — Duration tracking** — Is pipeline execution time tracked and alerted on anomalies?
|
|
142
|
+
- [ ] **Ch 12 — Row count monitoring** — Are rows processed per run tracked? Alerts on zero or unusual counts?
|
|
143
|
+
- [ ] **Ch 12 — Error rate tracking** — Are failed records, retries, and exceptions monitored?
|
|
144
|
+
- [ ] **Ch 12 — Data freshness SLA** — Are freshness SLAs defined and monitored? Alerts on breaches?
|
|
145
|
+
- [ ] **Ch 12 — Resource monitoring** — Are CPU, memory, disk, and network usage tracked for pipeline infrastructure?
|
|
146
|
+
|
|
147
|
+
### Alerting
|
|
148
|
+
- [ ] **Ch 12 — Actionable alerts** — Do alerts include context (error message, affected table, run ID, link to logs)?
|
|
149
|
+
- [ ] **Ch 12 — Severity levels** — Are alerts classified by severity with appropriate routing?
|
|
150
|
+
- [ ] **Ch 12 — Alert fatigue prevention** — Are thresholds tuned to avoid noisy alerts? Are alerts deduplicated?
|
|
151
|
+
|
|
152
|
+
### Operational Excellence
|
|
153
|
+
- [ ] **Ch 13 — Idempotency** — Are all pipelines idempotent? Can they be re-run without data corruption?
|
|
154
|
+
- [ ] **Ch 13 — Backfill support** — Are pipelines parameterized for date-range backfilling?
|
|
155
|
+
- [ ] **Ch 13 — Error handling** — Are transient errors retried? Are permanent errors failed fast? Are bad records quarantined?
|
|
156
|
+
- [ ] **Ch 13 — Credential security** — Are credentials in secrets managers, not in code or config files?
|
|
157
|
+
- [ ] **Ch 13 — Data lineage** — Is source-to-destination mapping documented? Is transformation logic recorded?
|
|
158
|
+
- [ ] **Ch 13 — Documentation** — Is there a README, data dictionary, and runbook for each pipeline?
|
|
159
|
+
- [ ] **Ch 13 — Version control** — Is all pipeline code in git? Are changes reviewed via PR?
|
|
160
|
+
|
|
161
|
+
---
|
|
162
|
+
|
|
163
|
+
## Quick Review Workflow
|
|
164
|
+
|
|
165
|
+
1. **Architecture pass** — Verify ETL/ELT choice, incremental vs full, infrastructure fit
|
|
166
|
+
2. **Ingestion pass** — Check source-specific best practices, error handling, incremental logic
|
|
167
|
+
3. **Loading pass** — Verify bulk loading, staging tables, partitioning, atomic operations
|
|
168
|
+
4. **Transform pass** — Check SQL quality, dbt patterns, layer structure, determinism
|
|
169
|
+
5. **Validation pass** — Verify data quality checks at each boundary, test coverage
|
|
170
|
+
6. **Orchestration pass** — Check DAG design, idempotency, task granularity, error handling
|
|
171
|
+
7. **Operations pass** — Verify monitoring, alerting, backfill support, documentation
|
|
172
|
+
8. **Prioritize findings** — Rank by severity: data loss risk > data quality > performance > best practices > style
|
|
173
|
+
|
|
174
|
+
## Severity Levels
|
|
175
|
+
|
|
176
|
+
| Severity | Description | Example |
|
|
177
|
+
|----------|-------------|---------|
|
|
178
|
+
| **Critical** | Data loss, corruption, or security risk | Non-idempotent pipeline causing duplicates, hardcoded credentials, no staging tables with partial load risk, missing dead letter queue losing events |
|
|
179
|
+
| **High** | Data quality or reliability issues | Missing validation, no error handling, full extraction on large tables, no monitoring or alerting, blocking on rate limits |
|
|
180
|
+
| **Medium** | Performance, maintainability, or operational gaps | Missing partitioning, monolithic DAGs, no backfill support, missing documentation, no incremental models for large tables |
|
|
181
|
+
| **Low** | Best practice improvements, optimization opportunities | Missing compression, suboptimal clustering, verbose logging, minor naming inconsistencies |
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|