@booklib/skills 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (85) hide show
  1. package/LICENSE +21 -0
  2. package/README.md +105 -0
  3. package/animation-at-work/SKILL.md +246 -0
  4. package/animation-at-work/assets/example_asset.txt +1 -0
  5. package/animation-at-work/references/api_reference.md +369 -0
  6. package/animation-at-work/references/review-checklist.md +79 -0
  7. package/animation-at-work/scripts/example.py +1 -0
  8. package/bin/skills.js +85 -0
  9. package/clean-code-reviewer/SKILL.md +292 -0
  10. package/clean-code-reviewer/evals/evals.json +67 -0
  11. package/data-intensive-patterns/SKILL.md +204 -0
  12. package/data-intensive-patterns/assets/example_asset.txt +1 -0
  13. package/data-intensive-patterns/references/api_reference.md +34 -0
  14. package/data-intensive-patterns/references/patterns-catalog.md +551 -0
  15. package/data-intensive-patterns/references/review-checklist.md +193 -0
  16. package/data-intensive-patterns/scripts/example.py +1 -0
  17. package/data-pipelines/SKILL.md +252 -0
  18. package/data-pipelines/assets/example_asset.txt +1 -0
  19. package/data-pipelines/references/api_reference.md +301 -0
  20. package/data-pipelines/references/review-checklist.md +181 -0
  21. package/data-pipelines/scripts/example.py +1 -0
  22. package/design-patterns/SKILL.md +245 -0
  23. package/design-patterns/assets/example_asset.txt +1 -0
  24. package/design-patterns/references/api_reference.md +1 -0
  25. package/design-patterns/references/patterns-catalog.md +726 -0
  26. package/design-patterns/references/review-checklist.md +173 -0
  27. package/design-patterns/scripts/example.py +1 -0
  28. package/domain-driven-design/SKILL.md +221 -0
  29. package/domain-driven-design/assets/example_asset.txt +1 -0
  30. package/domain-driven-design/references/api_reference.md +1 -0
  31. package/domain-driven-design/references/patterns-catalog.md +545 -0
  32. package/domain-driven-design/references/review-checklist.md +158 -0
  33. package/domain-driven-design/scripts/example.py +1 -0
  34. package/effective-java/SKILL.md +195 -0
  35. package/effective-java/assets/example_asset.txt +1 -0
  36. package/effective-java/references/api_reference.md +1 -0
  37. package/effective-java/references/items-catalog.md +955 -0
  38. package/effective-java/references/review-checklist.md +216 -0
  39. package/effective-java/scripts/example.py +1 -0
  40. package/effective-kotlin/SKILL.md +225 -0
  41. package/effective-kotlin/assets/example_asset.txt +1 -0
  42. package/effective-kotlin/references/api_reference.md +1 -0
  43. package/effective-kotlin/references/practices-catalog.md +1228 -0
  44. package/effective-kotlin/references/review-checklist.md +126 -0
  45. package/effective-kotlin/scripts/example.py +1 -0
  46. package/kotlin-in-action/SKILL.md +251 -0
  47. package/kotlin-in-action/assets/example_asset.txt +1 -0
  48. package/kotlin-in-action/references/api_reference.md +1 -0
  49. package/kotlin-in-action/references/practices-catalog.md +436 -0
  50. package/kotlin-in-action/references/review-checklist.md +204 -0
  51. package/kotlin-in-action/scripts/example.py +1 -0
  52. package/lean-startup/SKILL.md +250 -0
  53. package/lean-startup/assets/example_asset.txt +1 -0
  54. package/lean-startup/references/api_reference.md +319 -0
  55. package/lean-startup/references/review-checklist.md +137 -0
  56. package/lean-startup/scripts/example.py +1 -0
  57. package/microservices-patterns/SKILL.md +179 -0
  58. package/microservices-patterns/references/patterns-catalog.md +391 -0
  59. package/microservices-patterns/references/review-checklist.md +169 -0
  60. package/package.json +17 -0
  61. package/refactoring-ui/SKILL.md +236 -0
  62. package/refactoring-ui/assets/example_asset.txt +1 -0
  63. package/refactoring-ui/references/api_reference.md +355 -0
  64. package/refactoring-ui/references/review-checklist.md +114 -0
  65. package/refactoring-ui/scripts/example.py +1 -0
  66. package/storytelling-with-data/SKILL.md +238 -0
  67. package/storytelling-with-data/assets/example_asset.txt +1 -0
  68. package/storytelling-with-data/references/api_reference.md +379 -0
  69. package/storytelling-with-data/references/review-checklist.md +111 -0
  70. package/storytelling-with-data/scripts/example.py +1 -0
  71. package/system-design-interview/SKILL.md +213 -0
  72. package/system-design-interview/assets/example_asset.txt +1 -0
  73. package/system-design-interview/references/api_reference.md +582 -0
  74. package/system-design-interview/references/review-checklist.md +201 -0
  75. package/system-design-interview/scripts/example.py +1 -0
  76. package/using-asyncio-python/SKILL.md +242 -0
  77. package/using-asyncio-python/assets/example_asset.txt +1 -0
  78. package/using-asyncio-python/references/api_reference.md +267 -0
  79. package/using-asyncio-python/references/review-checklist.md +149 -0
  80. package/using-asyncio-python/scripts/example.py +1 -0
  81. package/web-scraping-python/SKILL.md +259 -0
  82. package/web-scraping-python/assets/example_asset.txt +1 -0
  83. package/web-scraping-python/references/api_reference.md +393 -0
  84. package/web-scraping-python/references/review-checklist.md +163 -0
  85. package/web-scraping-python/scripts/example.py +1 -0
@@ -0,0 +1,301 @@
1
+ # Data Pipelines Pocket Reference — Practices Catalog
2
+
3
+ Chapter-by-chapter catalog of practices from *Data Pipelines Pocket Reference*
4
+ by James Densmore for pipeline building.
5
+
6
+ ---
7
+
8
+ ## Chapter 1–2: Introduction & Modern Data Infrastructure
9
+
10
+ ### Infrastructure Choices
11
+ - **Data warehouse** — Columnar storage optimized for analytics queries (Redshift, BigQuery, Snowflake); choose when analytics is primary use case
12
+ - **Data lake** — Object storage (S3, GCS, Azure Blob) for raw, unstructured, or semi-structured data; choose when data variety is high or schema is unknown
13
+ - **Hybrid** — Land raw data in lake, load structured subsets into warehouse; common modern pattern
14
+ - **Cloud-native** — Leverage managed services (serverless compute, auto-scaling storage); reduce operational burden
15
+
16
+ ### Pipeline Types
17
+ - **Batch** — Process data in scheduled intervals (hourly, daily); suitable for most analytics use cases
18
+ - **Streaming** — Process data continuously as it arrives; required for real-time dashboards, alerts, event-driven systems
19
+ - **Micro-batch** — Small frequent batches (every few minutes); compromise between batch simplicity and near-real-time latency
20
+
21
+ ---
22
+
23
+ ## Chapter 3: Common Data Pipeline Patterns
24
+
25
+ ### Extraction Patterns
26
+ - **Full extraction** — Extract entire dataset each run; simple but expensive for large tables; use for small reference tables or initial loads
27
+ - **Incremental extraction** — Extract only new/changed records using a high-water mark (timestamp, auto-increment ID, or sequence); preferred for growing datasets
28
+ - **Change data capture (CDC)** — Capture changes from database transaction logs (MySQL binlog, PostgreSQL WAL); lowest latency, captures deletes; use for real-time sync
29
+
30
+ ### Loading Patterns
31
+ - **Full refresh (truncate + load)** — Replace entire destination table; simple, idempotent; use for small tables or when incremental is unreliable
32
+ - **Append** — Insert new records only; use for event/log data that is never updated
33
+ - **Upsert (MERGE)** — Insert new records, update existing ones based on a key; use for mutable dimension data
34
+ - **Delete + Insert by partition** — Delete partition, insert replacement; idempotent and efficient for date-partitioned fact tables
35
+
36
+ ### ETL vs ELT
37
+ - **ETL** — Transform before loading; use when destination has limited compute or when data must be cleansed before storage
38
+ - **ELT** — Load raw data first, transform in destination; preferred for modern cloud warehouses with cheap compute; enables raw data preservation and flexible re-transformation
39
+
40
+ ---
41
+
42
+ ## Chapter 4: Database Ingestion
43
+
44
+ ### MySQL Extraction
45
+ - Use `SELECT ... WHERE updated_at > :last_run` for incremental extraction
46
+ - For full extraction: `SELECT *` with optional `LIMIT/OFFSET` for large tables (prefer streaming cursors)
47
+ - Use read replicas to avoid impacting production database performance
48
+ - Handle MySQL timezone conversions (store/compare in UTC)
49
+ - Connection pooling for concurrent extraction from multiple tables
50
+
51
+ ### PostgreSQL Extraction
52
+ - Similar incremental patterns using timestamp columns
53
+ - Use `COPY TO` for efficient bulk export to CSV/files
54
+ - Leverage PostgreSQL logical replication for CDC
55
+ - Handle PostgreSQL-specific types (arrays, JSON, custom types) during extraction
56
+
57
+ ### MongoDB Extraction
58
+ - Use change streams for real-time CDC
59
+ - For incremental: query by `_id` (ObjectId contains timestamp) or custom timestamp field
60
+ - Handle nested documents: flatten or store as JSON in warehouse
61
+ - Use `mongodump` for full extraction of large collections
62
+
63
+ ### General Database Practices
64
+ - **Connection management** — Use connection pools; close connections promptly; handle timeouts
65
+ - **Query optimization** — Add indexes on extraction columns; limit selected columns; use WHERE clauses
66
+ - **Binary data** — Skip or store references to BLOBs; don't load binary into analytics warehouse
67
+ - **Character encoding** — Ensure UTF-8 throughout the pipeline; handle encoding mismatches at extraction
68
+
69
+ ---
70
+
71
+ ## Chapter 5: File Ingestion
72
+
73
+ ### CSV Files
74
+ - Handle header rows, quoting, escaping, delimiters (not always comma)
75
+ - Detect and handle encoding (UTF-8, Latin-1, Windows-1252)
76
+ - Validate column count per row; log and quarantine malformed rows
77
+ - Use streaming parsers for large files; avoid loading entire file into memory
78
+
79
+ ### JSON Files
80
+ - Handle nested structures: flatten for warehouse loading or store as JSON column
81
+ - Use JSON Lines (newline-delimited JSON) for large datasets
82
+ - Validate against expected schema; handle missing and extra fields
83
+ - Parse dates and timestamps consistently
84
+
85
+ ### Cloud Storage Integration
86
+ - **S3** — Use `aws s3 cp/sync` or boto3; leverage S3 event notifications for trigger-based ingestion
87
+ - **GCS** — Use `gsutil` or google-cloud-storage library; use Pub/Sub for event notifications
88
+ - **Azure Blob** — Use Azure SDK; leverage Event Grid for notifications
89
+ - Use prefix/partition naming: `s3://bucket/table/year=2024/month=01/day=15/`
90
+ - Implement file manifests to track which files have been processed
91
+
92
+ ### File Best Practices
93
+ - **Naming conventions** — Include date, source, and sequence in filenames: `orders_2024-01-15_001.csv`
94
+ - **Compression** — Use gzip or snappy for storage efficiency; most tools handle compressed files natively
95
+ - **Archiving** — Move processed files to archive prefix/bucket; retain for reprocessing capability
96
+ - **Schema detection** — Infer schema from first N rows; validate against expected schema; alert on changes
97
+
98
+ ---
99
+
100
+ ## Chapter 6: API Ingestion
101
+
102
+ ### REST API Patterns
103
+ - **Authentication** — Handle API keys, OAuth tokens, token refresh; store credentials securely
104
+ - **Pagination** — Implement cursor-based, offset-based, or link-header pagination; prefer cursor-based for consistency
105
+ - **Rate limiting** — Respect rate limit headers (X-RateLimit-Remaining, Retry-After); implement backoff
106
+ - **Retry logic** — Retry on 429 (rate limit) and 5xx (server error) with exponential backoff; don't retry on 4xx (client error)
107
+
108
+ ### API Data Handling
109
+ - **JSON response parsing** — Extract relevant fields; handle nested objects and arrays
110
+ - **Incremental fetching** — Use modified_since parameters, cursor tokens, or date range filters
111
+ - **Schema changes** — Handle new fields gracefully; log and alert on missing expected fields
112
+ - **Large responses** — Stream responses for large payloads; paginate aggressively
113
+
114
+ ### Webhook Ingestion
115
+ - Set up HTTP endpoints to receive push notifications
116
+ - Validate webhook signatures for security
117
+ - Acknowledge receipt quickly (200 OK); process asynchronously
118
+ - Implement idempotency using event IDs to handle duplicate deliveries
119
+
120
+ ---
121
+
122
+ ## Chapter 7: Streaming Data
123
+
124
+ ### Apache Kafka
125
+ - **Producers** — Serialize events (Avro, JSON, Protobuf); use keys for partition ordering; configure acknowledgments
126
+ - **Consumers** — Use consumer groups for parallel processing; commit offsets after successful processing; handle rebalances
127
+ - **Topics** — Design topic schemas; set retention policies; partition for throughput and ordering requirements
128
+ - **Exactly-once** — Use idempotent producers + transactional consumers; or implement deduplication downstream
129
+
130
+ ### Amazon Kinesis
131
+ - **Streams** — Configure shard count for throughput; use enhanced fan-out for multiple consumers
132
+ - **Firehose** — Direct-to-S3/Redshift delivery; configure buffering interval and size; transform with Lambda
133
+
134
+ ### Stream Processing Patterns
135
+ - **Windowing** — Tumbling, sliding, session windows for aggregation over time
136
+ - **Watermarks** — Handle late-arriving events; define allowed lateness
137
+ - **State management** — Use state stores for aggregations; handle checkpointing and recovery
138
+ - **Dead letter queues** — Route failed events for inspection and reprocessing; don't lose data silently
139
+
140
+ ---
141
+
142
+ ## Chapter 8: Data Storage and Loading
143
+
144
+ ### Amazon Redshift
145
+ - Use `COPY` command for bulk loading from S3; much faster than INSERT
146
+ - Define `SORTKEY` for frequently filtered columns; `DISTKEY` for join columns
147
+ - Use `UNLOAD` for efficient export back to S3
148
+ - Vacuum and analyze tables after large loads
149
+
150
+ ### Google BigQuery
151
+ - Use load jobs for bulk data; streaming inserts for real-time (more expensive)
152
+ - Partition tables by date (ingestion time or column-based) for cost and performance
153
+ - Cluster tables on frequently filtered columns
154
+ - Use external tables for querying data in GCS without loading
155
+
156
+ ### Snowflake
157
+ - Use stages (internal or external) for file-based loading
158
+ - `COPY INTO` for bulk loads from stages; `SNOWPIPE` for continuous loading
159
+ - Use virtual warehouses sized appropriately for load workloads
160
+ - Leverage Time Travel for data recovery and auditing
161
+
162
+ ### General Loading Practices
163
+ - **Staging tables** — Always load to staging first; validate before merging to production
164
+ - **Atomic swaps** — Use table rename or partition swap for atomic updates
165
+ - **Data types** — Map source types carefully; avoid implicit conversions; use appropriate precision
166
+ - **Compression** — Let warehouse handle compression; load compressed files when supported
167
+ - **Partitioning** — Partition by date for time-series data; by key for lookup tables
168
+ - **Clustering** — Cluster on frequently filtered columns within partitions
169
+
170
+ ---
171
+
172
+ ## Chapter 9: Data Transformations
173
+
174
+ ### SQL-Based Transforms
175
+ - Use CTEs (Common Table Expressions) for readability
176
+ - Prefer window functions over self-joins for ranking, running totals
177
+ - Use CASE expressions for conditional logic
178
+ - Aggregate at the right grain; avoid fan-out joins that multiply rows
179
+
180
+ ### dbt (Data Build Tool)
181
+ - **Staging models** — 1:1 with source tables; rename columns, cast types, filter deleted records
182
+ - **Intermediate models** — Business logic joins, complex calculations, deduplication
183
+ - **Mart models** — Final analytics-ready tables; optimized for dashboard queries
184
+ - **Incremental models** — Process only new/changed data; use `unique_key` for merge strategy
185
+ - **Tests** — `not_null`, `unique`, `accepted_values`, `relationships` on key columns; custom data tests
186
+ - **Sources** — Define sources in YAML; use `source()` macro for lineage tracking; freshness checks
187
+
188
+ ### Python-Based Transforms
189
+ - Use pandas for small-medium datasets; PySpark or Dask for large-scale processing
190
+ - Write pure functions for transformations; make them testable
191
+ - Handle data types explicitly; don't rely on inference for production pipelines
192
+ - Use vectorized operations; avoid row-by-row iteration
193
+
194
+ ### Transform Best Practices
195
+ - **Layered architecture** — Raw → Staging → Intermediate → Mart; each layer has clear purpose
196
+ - **Single responsibility** — Each model/transform does one thing well
197
+ - **Documented logic** — Comment complex business rules; maintain a data dictionary
198
+ - **Version controlled** — All transformation code in git; review changes via PR
199
+
200
+ ---
201
+
202
+ ## Chapter 10: Data Validation and Testing
203
+
204
+ ### Validation Types
205
+ - **Schema validation** — Verify column names, types, and count match expectations
206
+ - **Row count checks** — Compare source and destination row counts; alert on significant discrepancies
207
+ - **Null checks** — Assert key columns are not null; track null percentages for optional columns
208
+ - **Uniqueness checks** — Verify primary key uniqueness in destination tables
209
+ - **Referential integrity** — Check foreign key relationships between tables
210
+ - **Range checks** — Validate values fall within expected ranges (dates, amounts, percentages)
211
+ - **Freshness checks** — Verify data is not stale; alert when max timestamp is older than threshold
212
+
213
+ ### Great Expectations
214
+ - Define expectations as code; version control them
215
+ - Run validations as pipeline steps; fail pipeline on critical expectation failures
216
+ - Generate data documentation from expectations
217
+ - Use checkpoints for scheduled validation runs
218
+
219
+ ### Testing Practices
220
+ - **Unit tests** — Test individual transformation functions with known inputs/outputs
221
+ - **Integration tests** — Test end-to-end pipeline with sample data
222
+ - **Regression tests** — Compare current results against known-good baselines
223
+ - **Data contracts** — Define and enforce schemas between producer and consumer teams
224
+
225
+ ---
226
+
227
+ ## Chapter 11: Orchestration
228
+
229
+ ### Apache Airflow
230
+ - **DAGs** — Define pipelines as Directed Acyclic Graphs; each node is a task
231
+ - **Operators** — Use appropriate operators: PythonOperator, BashOperator, provider operators (BigQueryOperator, S3ToRedshiftOperator)
232
+ - **Scheduling** — Use cron expressions or timedelta; set `start_date` and `catchup` appropriately
233
+ - **Dependencies** — Use `>>` operator or `set_upstream/downstream`; keep DAGs shallow and wide
234
+ - **XComs** — Pass small metadata between tasks (row counts, file paths); NOT large datasets
235
+ - **Sensors** — Wait for external conditions (file arrival, partition availability); use with timeout
236
+ - **Variables and Connections** — Store config in Airflow Variables; credentials in Connections
237
+ - **Pools** — Limit concurrency for resource-constrained tasks (database connections, API rate limits)
238
+
239
+ ### DAG Design Patterns
240
+ - **One pipeline per DAG** — Keep DAGs focused; avoid mega-DAGs that do everything
241
+ - **Idempotent tasks** — Every task can be re-run safely; use `execution_date` for parameterization
242
+ - **Task granularity** — Tasks should be atomic and independently retryable; not too fine (overhead) or coarse (blast radius)
243
+ - **Error handling** — Use `on_failure_callback` for alerting; `retries` and `retry_delay` for transient failures
244
+ - **Backfilling** — Use `airflow backfill` for historical reprocessing; ensure tasks support date parameterization
245
+
246
+ ---
247
+
248
+ ## Chapter 12: Monitoring and Alerting
249
+
250
+ ### Pipeline Health Metrics
251
+ - **Duration** — Track execution time; alert on runs significantly longer than historical average
252
+ - **Row counts** — Track records processed per run; alert on zero rows or dramatic changes
253
+ - **Error rates** — Track failed records, retries, exceptions; alert on elevated error rates
254
+ - **Data freshness** — Track max timestamp in destination; alert when data is staler than SLA
255
+ - **Resource usage** — Track CPU, memory, disk, network; alert on resource exhaustion
256
+
257
+ ### Alerting Strategies
258
+ - **SLA-based** — Define delivery SLAs; alert when pipelines miss their windows
259
+ - **Anomaly-based** — Detect deviations from historical patterns (row counts, durations, values)
260
+ - **Threshold-based** — Alert on fixed thresholds (error rate > 5%, null rate > 10%)
261
+ - **Escalation** — Define severity levels; route alerts appropriately (Slack, PagerDuty, email)
262
+
263
+ ### Logging
264
+ - Log at each pipeline stage: extraction start/end, row counts, load confirmation
265
+ - Include correlation IDs to trace records through the pipeline
266
+ - Store logs centrally for searchability (ELK, CloudWatch, Stackdriver)
267
+ - Retain logs for debugging and audit compliance
268
+
269
+ ---
270
+
271
+ ## Chapter 13: Best Practices
272
+
273
+ ### Idempotency
274
+ - Use DELETE+INSERT by date partition for fact tables
275
+ - Use MERGE/upsert with natural keys for dimension tables
276
+ - Use staging tables as intermediary; clean up on both success and failure
277
+ - Test by running pipeline twice; verify no data duplication
278
+
279
+ ### Backfilling
280
+ - Parameterize all pipelines by date range (start_date, end_date)
281
+ - Use Airflow `execution_date` or equivalent for date-aware runs
282
+ - Test backfill on a small date range before running full historical reprocess
283
+ - Monitor resource usage during backfill; may need to throttle parallelism
284
+
285
+ ### Error Handling
286
+ - Retry transient failures (network timeouts, rate limits) with exponential backoff
287
+ - Fail fast on permanent errors (authentication failure, missing source table)
288
+ - Quarantine bad records; don't let one bad row fail the entire pipeline
289
+ - Send alerts with actionable context (error message, affected table, run ID)
290
+
291
+ ### Data Lineage and Documentation
292
+ - Track source-to-destination mappings for every table
293
+ - Document transformation logic, especially business rules
294
+ - Maintain a data dictionary with column descriptions and types
295
+ - Use tools like dbt docs, DataHub, or Amundsen for automated lineage
296
+
297
+ ### Security
298
+ - Never hardcode credentials; use secrets managers (AWS Secrets Manager, HashiCorp Vault)
299
+ - Encrypt data in transit (TLS) and at rest (warehouse encryption)
300
+ - Use least-privilege IAM roles for pipeline service accounts
301
+ - Audit access to sensitive data; mask PII in non-production environments
@@ -0,0 +1,181 @@
1
+ # Data Pipelines Pocket Reference — Pipeline Review Checklist
2
+
3
+ Systematic checklist for reviewing data pipelines against the 13 chapters
4
+ from *Data Pipelines Pocket Reference* by James Densmore.
5
+
6
+ ---
7
+
8
+ ## 1. Architecture & Patterns (Chapters 1–3)
9
+
10
+ ### Infrastructure
11
+ - [ ] **Ch 1-2 — Appropriate infrastructure** — Is the right storage chosen for the use case (warehouse for analytics, lake for raw/unstructured)?
12
+ - [ ] **Ch 2 — Cloud-native services** — Are managed services used where appropriate to reduce operational burden?
13
+ - [ ] **Ch 2 — Separation of storage and compute** — Is compute scaled independently from storage?
14
+
15
+ ### Pipeline Patterns
16
+ - [ ] **Ch 3 — ETL vs ELT** — Is the right pattern chosen? Is ELT used for analytics workloads with modern warehouses?
17
+ - [ ] **Ch 3 — Full vs incremental** — Is incremental extraction used for growing datasets? Is full extraction justified for small tables?
18
+ - [ ] **Ch 3 — CDC where appropriate** — Is CDC used for real-time sync needs instead of polling?
19
+ - [ ] **Ch 3 — Loading strategy** — Is the right load pattern used (append for events, upsert for dimensions, full refresh for small lookups)?
20
+
21
+ ---
22
+
23
+ ## 2. Data Ingestion (Chapters 4–7)
24
+
25
+ ### Database Ingestion (Ch 4)
26
+ - [ ] **Ch 4 — Read replica usage** — Are extractions running against read replicas, not production databases?
27
+ - [ ] **Ch 4 — Incremental column** — Is there a reliable timestamp or ID column for incremental extraction?
28
+ - [ ] **Ch 4 — Connection management** — Are connections pooled and properly closed? Are timeouts configured?
29
+ - [ ] **Ch 4 — Query efficiency** — Are only needed columns selected? Are WHERE clauses using indexed columns?
30
+ - [ ] **Ch 4 — Large table handling** — Are large extractions chunked or using streaming cursors?
31
+
32
+ ### File Ingestion (Ch 5)
33
+ - [ ] **Ch 5 — Schema validation** — Are file schemas validated before processing? Are malformed rows handled?
34
+ - [ ] **Ch 5 — Encoding handling** — Is character encoding handled consistently (UTF-8)?
35
+ - [ ] **Ch 5 — Cloud storage patterns** — Are files organized with partitioned prefixes? Are processed files archived?
36
+ - [ ] **Ch 5 — File tracking** — Is there a mechanism to track which files have been processed to avoid reprocessing?
37
+ - [ ] **Ch 5 — Compression** — Are files compressed for storage and transfer efficiency?
38
+
39
+ ### API Ingestion (Ch 6)
40
+ - [ ] **Ch 6 — Pagination** — Is pagination implemented correctly? Are all pages fetched? Is cursor-based preferred?
41
+ - [ ] **Ch 6 — Rate limiting** — Are rate limit headers respected? Is backoff implemented for 429 responses?
42
+ - [ ] **Ch 6 — Retry logic** — Are transient errors (5xx, timeouts) retried with exponential backoff? Are 4xx errors not retried?
43
+ - [ ] **Ch 6 — Authentication** — Are credentials stored securely? Are token refreshes handled?
44
+ - [ ] **Ch 6 — Incremental fetching** — Are date/cursor parameters used to fetch only new data?
45
+
46
+ ### Streaming Ingestion (Ch 7)
47
+ - [ ] **Ch 7 — Consumer group design** — Are consumer groups configured for parallel processing? Are rebalances handled?
48
+ - [ ] **Ch 7 — Offset management** — Are offsets committed after successful processing, not before?
49
+ - [ ] **Ch 7 — Serialization** — Are events serialized with a schema (Avro, Protobuf) for evolution support?
50
+ - [ ] **Ch 7 — Dead letter queue** — Are failed events routed to a DLQ for inspection and reprocessing?
51
+ - [ ] **Ch 7 — Exactly-once semantics** — Is deduplication implemented downstream or are idempotent producers used?
52
+ - [ ] **Ch 7 — Backpressure** — Is backpressure handled when consumer can't keep up with producer?
53
+
54
+ ---
55
+
56
+ ## 3. Data Storage & Loading (Chapter 8)
57
+
58
+ ### Loading Patterns
59
+ - [ ] **Ch 8 — Bulk loading** — Are bulk load commands used (COPY, load jobs) instead of row-by-row INSERT?
60
+ - [ ] **Ch 8 — Staging tables** — Is data loaded to staging first, validated, then merged to production?
61
+ - [ ] **Ch 8 — Atomic operations** — Are loads atomic? Is the destination never left in a partial state?
62
+ - [ ] **Ch 8 — Data type mapping** — Are source types mapped correctly to destination types? No implicit conversions?
63
+
64
+ ### Table Design
65
+ - [ ] **Ch 8 — Partitioning** — Are large tables partitioned by date or key? Are queries leveraging partition pruning?
66
+ - [ ] **Ch 8 — Clustering** — Are frequently filtered columns used as cluster keys within partitions?
67
+ - [ ] **Ch 8 — Sort/distribution keys** — For Redshift: are SORTKEY and DISTKEY chosen based on query patterns?
68
+
69
+ ### Warehouse-Specific
70
+ - [ ] **Ch 8 — Redshift COPY** — Is S3-based COPY used instead of INSERT for bulk loads?
71
+ - [ ] **Ch 8 — BigQuery load jobs** — Are load jobs preferred over streaming inserts for batch pipelines?
72
+ - [ ] **Ch 8 — Snowflake stages** — Are stages used for file-based loading? Is SNOWPIPE configured for continuous loads?
73
+
74
+ ---
75
+
76
+ ## 4. Transformations (Chapter 9)
77
+
78
+ ### SQL Transforms
79
+ - [ ] **Ch 9 — CTEs for readability** — Are Common Table Expressions used instead of deeply nested subqueries?
80
+ - [ ] **Ch 9 — Window functions** — Are window functions used for ranking, running totals instead of self-joins?
81
+ - [ ] **Ch 9 — Grain awareness** — Do joins maintain the correct grain? No accidental fan-out producing duplicate rows?
82
+ - [ ] **Ch 9 — Deterministic logic** — Are transforms deterministic (same input → same output)? No reliance on row ordering?
83
+
84
+ ### dbt Patterns
85
+ - [ ] **Ch 9 — Layer structure** — Are models organized into staging → intermediate → mart layers?
86
+ - [ ] **Ch 9 — Staging models** — Are staging models 1:1 with sources? Do they rename, cast, and filter only?
87
+ - [ ] **Ch 9 — Incremental models** — Are large models configured as incremental with proper `unique_key` and merge strategy?
88
+ - [ ] **Ch 9 — Source definitions** — Are sources defined in YAML with `source()` macro? Are freshness checks configured?
89
+ - [ ] **Ch 9 — Model materialization** — Are materializations appropriate (view for light transforms, table for heavy, incremental for large)?
90
+
91
+ ### Python Transforms
92
+ - [ ] **Ch 9 — Appropriate tool** — Is Python used only when SQL is insufficient (ML, complex parsing, API calls)?
93
+ - [ ] **Ch 9 — Vectorized operations** — Are pandas/numpy vectorized operations used instead of row-by-row iteration?
94
+ - [ ] **Ch 9 — Memory management** — Are large datasets processed in chunks or with distributed frameworks (PySpark)?
95
+ - [ ] **Ch 9 — Pure functions** — Are transformation functions pure (no side effects) and independently testable?
96
+
97
+ ---
98
+
99
+ ## 5. Data Validation & Testing (Chapter 10)
100
+
101
+ ### Validation Coverage
102
+ - [ ] **Ch 10 — Schema validation** — Are column names, types, and count validated at ingestion?
103
+ - [ ] **Ch 10 — Row count reconciliation** — Are source and destination row counts compared? Is threshold alerting configured?
104
+ - [ ] **Ch 10 — Null checks** — Are NOT NULL constraints enforced on key columns? Are null percentages tracked?
105
+ - [ ] **Ch 10 — Uniqueness checks** — Is primary key uniqueness verified after loading?
106
+ - [ ] **Ch 10 — Referential integrity** — Are foreign key relationships validated between related tables?
107
+ - [ ] **Ch 10 — Range validation** — Are values checked against expected ranges (dates in past, amounts positive, percentages 0-100)?
108
+ - [ ] **Ch 10 — Freshness checks** — Is data freshness monitored? Are alerts configured for stale data?
109
+
110
+ ### Testing
111
+ - [ ] **Ch 10 — Unit tests** — Are individual transformation functions tested with known inputs/outputs?
112
+ - [ ] **Ch 10 — Integration tests** — Is the end-to-end pipeline tested with sample data?
113
+ - [ ] **Ch 10 — dbt tests** — Are dbt schema tests (not_null, unique, relationships) defined? Are custom data tests used?
114
+ - [ ] **Ch 10 — Regression tests** — Are current results compared against known-good baselines for critical tables?
115
+
116
+ ---
117
+
118
+ ## 6. Orchestration (Chapter 11)
119
+
120
+ ### DAG Design
121
+ - [ ] **Ch 11 — One pipeline per DAG** — Are DAGs focused on a single pipeline? No mega-DAGs?
122
+ - [ ] **Ch 11 — Task granularity** — Are tasks atomic and independently retryable? Not too fine or too coarse?
123
+ - [ ] **Ch 11 — Shallow and wide** — Are DAGs shallow (few sequential steps) and wide (parallel where possible)?
124
+ - [ ] **Ch 11 — No hardcoded dates** — Are dates parameterized using execution_date or equivalent? No hardcoded date strings?
125
+
126
+ ### Airflow Specifics
127
+ - [ ] **Ch 11 — Idempotent tasks** — Can every task be safely re-run without data duplication or side effects?
128
+ - [ ] **Ch 11 — Appropriate operators** — Are provider operators used where available instead of generic PythonOperator?
129
+ - [ ] **Ch 11 — XCom usage** — Are XComs used only for small metadata (file paths, row counts), not large data?
130
+ - [ ] **Ch 11 — Sensor timeouts** — Do sensors have timeouts to avoid indefinite waiting?
131
+ - [ ] **Ch 11 — Error callbacks** — Are `on_failure_callback` configured for alerting on task failures?
132
+ - [ ] **Ch 11 — Retries** — Are retries configured with appropriate delay for transient failures?
133
+ - [ ] **Ch 11 — Pool limits** — Are pools used to limit concurrency for resource-constrained tasks?
134
+ - [ ] **Ch 11 — Catchup configuration** — Is `catchup` set appropriately (True for backfill-supporting DAGs, False otherwise)?
135
+
136
+ ---
137
+
138
+ ## 7. Monitoring & Operations (Chapters 12–13)
139
+
140
+ ### Monitoring
141
+ - [ ] **Ch 12 — Duration tracking** — Is pipeline execution time tracked and alerted on anomalies?
142
+ - [ ] **Ch 12 — Row count monitoring** — Are rows processed per run tracked? Alerts on zero or unusual counts?
143
+ - [ ] **Ch 12 — Error rate tracking** — Are failed records, retries, and exceptions monitored?
144
+ - [ ] **Ch 12 — Data freshness SLA** — Are freshness SLAs defined and monitored? Alerts on breaches?
145
+ - [ ] **Ch 12 — Resource monitoring** — Are CPU, memory, disk, and network usage tracked for pipeline infrastructure?
146
+
147
+ ### Alerting
148
+ - [ ] **Ch 12 — Actionable alerts** — Do alerts include context (error message, affected table, run ID, link to logs)?
149
+ - [ ] **Ch 12 — Severity levels** — Are alerts classified by severity with appropriate routing?
150
+ - [ ] **Ch 12 — Alert fatigue prevention** — Are thresholds tuned to avoid noisy alerts? Are alerts deduplicated?
151
+
152
+ ### Operational Excellence
153
+ - [ ] **Ch 13 — Idempotency** — Are all pipelines idempotent? Can they be re-run without data corruption?
154
+ - [ ] **Ch 13 — Backfill support** — Are pipelines parameterized for date-range backfilling?
155
+ - [ ] **Ch 13 — Error handling** — Are transient errors retried? Are permanent errors failed fast? Are bad records quarantined?
156
+ - [ ] **Ch 13 — Credential security** — Are credentials in secrets managers, not in code or config files?
157
+ - [ ] **Ch 13 — Data lineage** — Is source-to-destination mapping documented? Is transformation logic recorded?
158
+ - [ ] **Ch 13 — Documentation** — Is there a README, data dictionary, and runbook for each pipeline?
159
+ - [ ] **Ch 13 — Version control** — Is all pipeline code in git? Are changes reviewed via PR?
160
+
161
+ ---
162
+
163
+ ## Quick Review Workflow
164
+
165
+ 1. **Architecture pass** — Verify ETL/ELT choice, incremental vs full, infrastructure fit
166
+ 2. **Ingestion pass** — Check source-specific best practices, error handling, incremental logic
167
+ 3. **Loading pass** — Verify bulk loading, staging tables, partitioning, atomic operations
168
+ 4. **Transform pass** — Check SQL quality, dbt patterns, layer structure, determinism
169
+ 5. **Validation pass** — Verify data quality checks at each boundary, test coverage
170
+ 6. **Orchestration pass** — Check DAG design, idempotency, task granularity, error handling
171
+ 7. **Operations pass** — Verify monitoring, alerting, backfill support, documentation
172
+ 8. **Prioritize findings** — Rank by severity: data loss risk > data quality > performance > best practices > style
173
+
174
+ ## Severity Levels
175
+
176
+ | Severity | Description | Example |
177
+ |----------|-------------|---------|
178
+ | **Critical** | Data loss, corruption, or security risk | Non-idempotent pipeline causing duplicates, hardcoded credentials, no staging tables with partial load risk, missing dead letter queue losing events |
179
+ | **High** | Data quality or reliability issues | Missing validation, no error handling, full extraction on large tables, no monitoring or alerting, blocking on rate limits |
180
+ | **Medium** | Performance, maintainability, or operational gaps | Missing partitioning, monolithic DAGs, no backfill support, missing documentation, no incremental models for large tables |
181
+ | **Low** | Best practice improvements, optimization opportunities | Missing compression, suboptimal clustering, verbose logging, minor naming inconsistencies |
@@ -0,0 +1 @@
1
+