@booklib/core 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.cursor/rules/booklib-standards.mdc +40 -0
- package/.gemini/context.md +372 -0
- package/AGENTS.md +166 -0
- package/CHANGELOG.md +226 -0
- package/CLAUDE.md +81 -0
- package/CODE_OF_CONDUCT.md +31 -0
- package/CONTRIBUTING.md +304 -0
- package/LICENSE +21 -0
- package/PLAN.md +28 -0
- package/README.ja.md +198 -0
- package/README.ko.md +198 -0
- package/README.md +503 -0
- package/README.pt-BR.md +198 -0
- package/README.uk.md +241 -0
- package/README.zh-CN.md +198 -0
- package/SECURITY.md +9 -0
- package/agents/architecture-reviewer.md +136 -0
- package/agents/booklib-reviewer.md +90 -0
- package/agents/data-reviewer.md +107 -0
- package/agents/jvm-reviewer.md +146 -0
- package/agents/python-reviewer.md +128 -0
- package/agents/rust-reviewer.md +115 -0
- package/agents/ts-reviewer.md +110 -0
- package/agents/ui-reviewer.md +117 -0
- package/assets/logo.svg +36 -0
- package/bin/booklib-mcp.js +304 -0
- package/bin/booklib.js +1705 -0
- package/bin/skills.cjs +1292 -0
- package/booklib-router.mdc +36 -0
- package/booklib.config.json +19 -0
- package/commands/animation-at-work.md +10 -0
- package/commands/clean-code-reviewer.md +10 -0
- package/commands/data-intensive-patterns.md +10 -0
- package/commands/data-pipelines.md +10 -0
- package/commands/design-patterns.md +10 -0
- package/commands/domain-driven-design.md +10 -0
- package/commands/effective-java.md +10 -0
- package/commands/effective-kotlin.md +10 -0
- package/commands/effective-python.md +10 -0
- package/commands/effective-typescript.md +10 -0
- package/commands/kotlin-in-action.md +10 -0
- package/commands/lean-startup.md +10 -0
- package/commands/microservices-patterns.md +10 -0
- package/commands/programming-with-rust.md +10 -0
- package/commands/refactoring-ui.md +10 -0
- package/commands/rust-in-action.md +10 -0
- package/commands/skill-router.md +10 -0
- package/commands/spring-boot-in-action.md +10 -0
- package/commands/storytelling-with-data.md +10 -0
- package/commands/system-design-interview.md +10 -0
- package/commands/using-asyncio-python.md +10 -0
- package/commands/web-scraping-python.md +10 -0
- package/community/registry.json +1616 -0
- package/hooks/hooks.json +23 -0
- package/hooks/posttooluse-capture.mjs +67 -0
- package/hooks/suggest.js +153 -0
- package/lib/agent-behaviors.js +40 -0
- package/lib/agent-detector.js +96 -0
- package/lib/config-loader.js +39 -0
- package/lib/conflict-resolver.js +148 -0
- package/lib/context-builder.js +574 -0
- package/lib/discovery-engine.js +298 -0
- package/lib/doctor/hook-installer.js +83 -0
- package/lib/doctor/usage-tracker.js +87 -0
- package/lib/engine/ai-features.js +253 -0
- package/lib/engine/auditor.js +103 -0
- package/lib/engine/bm25-index.js +178 -0
- package/lib/engine/capture.js +120 -0
- package/lib/engine/corrections.js +198 -0
- package/lib/engine/doctor.js +195 -0
- package/lib/engine/graph-injector.js +137 -0
- package/lib/engine/graph.js +161 -0
- package/lib/engine/handoff.js +405 -0
- package/lib/engine/indexer.js +242 -0
- package/lib/engine/parser.js +53 -0
- package/lib/engine/query-expander.js +42 -0
- package/lib/engine/reranker.js +40 -0
- package/lib/engine/rrf.js +59 -0
- package/lib/engine/scanner.js +151 -0
- package/lib/engine/searcher.js +139 -0
- package/lib/engine/session-coordinator.js +306 -0
- package/lib/engine/session-manager.js +429 -0
- package/lib/engine/synthesizer.js +70 -0
- package/lib/installer.js +70 -0
- package/lib/instinct-block.js +33 -0
- package/lib/mcp-config-writer.js +88 -0
- package/lib/paths.js +57 -0
- package/lib/profiles/design.md +19 -0
- package/lib/profiles/general.md +16 -0
- package/lib/profiles/research-analysis.md +22 -0
- package/lib/profiles/software-development.md +23 -0
- package/lib/profiles/writing-content.md +19 -0
- package/lib/project-initializer.js +916 -0
- package/lib/registry/skills.js +102 -0
- package/lib/registry-searcher.js +99 -0
- package/lib/rules/rules-manager.js +169 -0
- package/lib/skill-fetcher.js +333 -0
- package/lib/well-known-builder.js +70 -0
- package/lib/wizard/index.js +404 -0
- package/lib/wizard/integration-detector.js +41 -0
- package/lib/wizard/project-detector.js +100 -0
- package/lib/wizard/prompt.js +156 -0
- package/lib/wizard/registry-embeddings.js +107 -0
- package/lib/wizard/skill-recommender.js +69 -0
- package/llms-full.txt +254 -0
- package/llms.txt +70 -0
- package/package.json +45 -0
- package/research-reports/2026-04-01-current-architecture.md +160 -0
- package/research-reports/IDEAS.md +93 -0
- package/rules/common/clean-code.md +42 -0
- package/rules/java/effective-java.md +42 -0
- package/rules/kotlin/effective-kotlin.md +37 -0
- package/rules/python/effective-python.md +38 -0
- package/rules/rust/rust.md +37 -0
- package/rules/typescript/effective-typescript.md +42 -0
- package/scripts/gen-llms-full.mjs +36 -0
- package/scripts/gen-og.mjs +142 -0
- package/scripts/validate-frontmatter.js +25 -0
- package/skills/animation-at-work/SKILL.md +270 -0
- package/skills/animation-at-work/assets/example_asset.txt +1 -0
- package/skills/animation-at-work/evals/evals.json +44 -0
- package/skills/animation-at-work/evals/results.json +13 -0
- package/skills/animation-at-work/examples/after.md +64 -0
- package/skills/animation-at-work/examples/before.md +35 -0
- package/skills/animation-at-work/references/api_reference.md +369 -0
- package/skills/animation-at-work/references/review-checklist.md +79 -0
- package/skills/animation-at-work/scripts/audit_animations.py +295 -0
- package/skills/animation-at-work/scripts/example.py +1 -0
- package/skills/clean-code-reviewer/SKILL.md +444 -0
- package/skills/clean-code-reviewer/audit.json +35 -0
- package/skills/clean-code-reviewer/evals/evals.json +185 -0
- package/skills/clean-code-reviewer/evals/results.json +13 -0
- package/skills/clean-code-reviewer/examples/after.md +48 -0
- package/skills/clean-code-reviewer/examples/before.md +33 -0
- package/skills/clean-code-reviewer/references/api_reference.md +158 -0
- package/skills/clean-code-reviewer/references/practices-catalog.md +282 -0
- package/skills/clean-code-reviewer/references/review-checklist.md +254 -0
- package/skills/clean-code-reviewer/scripts/pre-review.py +206 -0
- package/skills/data-intensive-patterns/SKILL.md +267 -0
- package/skills/data-intensive-patterns/assets/example_asset.txt +1 -0
- package/skills/data-intensive-patterns/evals/evals.json +54 -0
- package/skills/data-intensive-patterns/evals/results.json +13 -0
- package/skills/data-intensive-patterns/examples/after.md +61 -0
- package/skills/data-intensive-patterns/examples/before.md +38 -0
- package/skills/data-intensive-patterns/references/api_reference.md +34 -0
- package/skills/data-intensive-patterns/references/patterns-catalog.md +551 -0
- package/skills/data-intensive-patterns/references/review-checklist.md +193 -0
- package/skills/data-intensive-patterns/scripts/adr.py +213 -0
- package/skills/data-intensive-patterns/scripts/example.py +1 -0
- package/skills/data-pipelines/SKILL.md +259 -0
- package/skills/data-pipelines/assets/example_asset.txt +1 -0
- package/skills/data-pipelines/evals/evals.json +45 -0
- package/skills/data-pipelines/evals/results.json +13 -0
- package/skills/data-pipelines/examples/after.md +97 -0
- package/skills/data-pipelines/examples/before.md +37 -0
- package/skills/data-pipelines/references/api_reference.md +301 -0
- package/skills/data-pipelines/references/review-checklist.md +181 -0
- package/skills/data-pipelines/scripts/example.py +1 -0
- package/skills/data-pipelines/scripts/new_pipeline.py +444 -0
- package/skills/design-patterns/SKILL.md +271 -0
- package/skills/design-patterns/assets/example_asset.txt +1 -0
- package/skills/design-patterns/evals/evals.json +46 -0
- package/skills/design-patterns/evals/results.json +13 -0
- package/skills/design-patterns/examples/after.md +52 -0
- package/skills/design-patterns/examples/before.md +29 -0
- package/skills/design-patterns/references/api_reference.md +1 -0
- package/skills/design-patterns/references/patterns-catalog.md +726 -0
- package/skills/design-patterns/references/review-checklist.md +173 -0
- package/skills/design-patterns/scripts/example.py +1 -0
- package/skills/design-patterns/scripts/scaffold.py +807 -0
- package/skills/domain-driven-design/SKILL.md +142 -0
- package/skills/domain-driven-design/assets/example_asset.txt +1 -0
- package/skills/domain-driven-design/evals/evals.json +48 -0
- package/skills/domain-driven-design/evals/results.json +13 -0
- package/skills/domain-driven-design/examples/after.md +80 -0
- package/skills/domain-driven-design/examples/before.md +43 -0
- package/skills/domain-driven-design/references/api_reference.md +1 -0
- package/skills/domain-driven-design/references/patterns-catalog.md +545 -0
- package/skills/domain-driven-design/references/review-checklist.md +158 -0
- package/skills/domain-driven-design/scripts/example.py +1 -0
- package/skills/domain-driven-design/scripts/scaffold.py +421 -0
- package/skills/effective-java/SKILL.md +227 -0
- package/skills/effective-java/assets/example_asset.txt +1 -0
- package/skills/effective-java/evals/evals.json +46 -0
- package/skills/effective-java/evals/results.json +13 -0
- package/skills/effective-java/examples/after.md +83 -0
- package/skills/effective-java/examples/before.md +37 -0
- package/skills/effective-java/references/api_reference.md +1 -0
- package/skills/effective-java/references/items-catalog.md +955 -0
- package/skills/effective-java/references/review-checklist.md +216 -0
- package/skills/effective-java/scripts/checkstyle_setup.py +211 -0
- package/skills/effective-java/scripts/example.py +1 -0
- package/skills/effective-kotlin/SKILL.md +271 -0
- package/skills/effective-kotlin/assets/example_asset.txt +1 -0
- package/skills/effective-kotlin/audit.json +29 -0
- package/skills/effective-kotlin/evals/evals.json +45 -0
- package/skills/effective-kotlin/evals/results.json +13 -0
- package/skills/effective-kotlin/examples/after.md +36 -0
- package/skills/effective-kotlin/examples/before.md +38 -0
- package/skills/effective-kotlin/references/api_reference.md +1 -0
- package/skills/effective-kotlin/references/practices-catalog.md +1228 -0
- package/skills/effective-kotlin/references/review-checklist.md +126 -0
- package/skills/effective-kotlin/scripts/example.py +1 -0
- package/skills/effective-python/SKILL.md +441 -0
- package/skills/effective-python/evals/evals.json +44 -0
- package/skills/effective-python/evals/results.json +13 -0
- package/skills/effective-python/examples/after.md +56 -0
- package/skills/effective-python/examples/before.md +40 -0
- package/skills/effective-python/ref-01-pythonic-thinking.md +202 -0
- package/skills/effective-python/ref-02-lists-and-dicts.md +146 -0
- package/skills/effective-python/ref-03-functions.md +186 -0
- package/skills/effective-python/ref-04-comprehensions-generators.md +211 -0
- package/skills/effective-python/ref-05-classes-interfaces.md +188 -0
- package/skills/effective-python/ref-06-metaclasses-attributes.md +209 -0
- package/skills/effective-python/ref-07-concurrency.md +213 -0
- package/skills/effective-python/ref-08-robustness-performance.md +248 -0
- package/skills/effective-python/ref-09-testing-debugging.md +253 -0
- package/skills/effective-python/ref-10-collaboration.md +175 -0
- package/skills/effective-python/references/api_reference.md +218 -0
- package/skills/effective-python/references/practices-catalog.md +483 -0
- package/skills/effective-python/references/review-checklist.md +190 -0
- package/skills/effective-python/scripts/lint.py +173 -0
- package/skills/effective-typescript/SKILL.md +262 -0
- package/skills/effective-typescript/audit.json +29 -0
- package/skills/effective-typescript/evals/evals.json +37 -0
- package/skills/effective-typescript/evals/results.json +13 -0
- package/skills/effective-typescript/examples/after.md +70 -0
- package/skills/effective-typescript/examples/before.md +47 -0
- package/skills/effective-typescript/references/api_reference.md +118 -0
- package/skills/effective-typescript/references/practices-catalog.md +371 -0
- package/skills/effective-typescript/scripts/review.py +169 -0
- package/skills/kotlin-in-action/SKILL.md +261 -0
- package/skills/kotlin-in-action/assets/example_asset.txt +1 -0
- package/skills/kotlin-in-action/evals/evals.json +43 -0
- package/skills/kotlin-in-action/evals/results.json +13 -0
- package/skills/kotlin-in-action/examples/after.md +53 -0
- package/skills/kotlin-in-action/examples/before.md +39 -0
- package/skills/kotlin-in-action/references/api_reference.md +1 -0
- package/skills/kotlin-in-action/references/practices-catalog.md +436 -0
- package/skills/kotlin-in-action/references/review-checklist.md +204 -0
- package/skills/kotlin-in-action/scripts/example.py +1 -0
- package/skills/kotlin-in-action/scripts/setup_detekt.py +224 -0
- package/skills/lean-startup/SKILL.md +160 -0
- package/skills/lean-startup/assets/example_asset.txt +1 -0
- package/skills/lean-startup/evals/evals.json +43 -0
- package/skills/lean-startup/evals/results.json +13 -0
- package/skills/lean-startup/examples/after.md +80 -0
- package/skills/lean-startup/examples/before.md +34 -0
- package/skills/lean-startup/references/api_reference.md +319 -0
- package/skills/lean-startup/references/review-checklist.md +137 -0
- package/skills/lean-startup/scripts/example.py +1 -0
- package/skills/lean-startup/scripts/new_experiment.py +286 -0
- package/skills/microservices-patterns/SKILL.md +384 -0
- package/skills/microservices-patterns/evals/evals.json +45 -0
- package/skills/microservices-patterns/evals/results.json +13 -0
- package/skills/microservices-patterns/examples/after.md +69 -0
- package/skills/microservices-patterns/examples/before.md +40 -0
- package/skills/microservices-patterns/references/patterns-catalog.md +391 -0
- package/skills/microservices-patterns/references/review-checklist.md +169 -0
- package/skills/microservices-patterns/scripts/new_service.py +583 -0
- package/skills/programming-with-rust/SKILL.md +209 -0
- package/skills/programming-with-rust/evals/evals.json +37 -0
- package/skills/programming-with-rust/evals/results.json +13 -0
- package/skills/programming-with-rust/examples/after.md +107 -0
- package/skills/programming-with-rust/examples/before.md +59 -0
- package/skills/programming-with-rust/references/api_reference.md +152 -0
- package/skills/programming-with-rust/references/practices-catalog.md +335 -0
- package/skills/programming-with-rust/scripts/review.py +142 -0
- package/skills/refactoring-ui/SKILL.md +362 -0
- package/skills/refactoring-ui/assets/example_asset.txt +1 -0
- package/skills/refactoring-ui/evals/evals.json +45 -0
- package/skills/refactoring-ui/evals/results.json +13 -0
- package/skills/refactoring-ui/examples/after.md +85 -0
- package/skills/refactoring-ui/examples/before.md +58 -0
- package/skills/refactoring-ui/references/api_reference.md +355 -0
- package/skills/refactoring-ui/references/review-checklist.md +114 -0
- package/skills/refactoring-ui/scripts/audit_css.py +250 -0
- package/skills/refactoring-ui/scripts/example.py +1 -0
- package/skills/rust-in-action/SKILL.md +350 -0
- package/skills/rust-in-action/evals/evals.json +38 -0
- package/skills/rust-in-action/evals/results.json +13 -0
- package/skills/rust-in-action/examples/after.md +156 -0
- package/skills/rust-in-action/examples/before.md +56 -0
- package/skills/rust-in-action/references/practices-catalog.md +346 -0
- package/skills/rust-in-action/scripts/review.py +147 -0
- package/skills/skill-router/SKILL.md +186 -0
- package/skills/skill-router/evals/evals.json +38 -0
- package/skills/skill-router/evals/results.json +13 -0
- package/skills/skill-router/examples/after.md +63 -0
- package/skills/skill-router/examples/before.md +39 -0
- package/skills/skill-router/references/api_reference.md +24 -0
- package/skills/skill-router/references/routing-heuristics.md +89 -0
- package/skills/skill-router/references/skill-catalog.md +174 -0
- package/skills/skill-router/scripts/route.py +266 -0
- package/skills/spring-boot-in-action/SKILL.md +340 -0
- package/skills/spring-boot-in-action/evals/evals.json +39 -0
- package/skills/spring-boot-in-action/evals/results.json +13 -0
- package/skills/spring-boot-in-action/examples/after.md +185 -0
- package/skills/spring-boot-in-action/examples/before.md +84 -0
- package/skills/spring-boot-in-action/references/practices-catalog.md +403 -0
- package/skills/spring-boot-in-action/scripts/review.py +184 -0
- package/skills/storytelling-with-data/SKILL.md +241 -0
- package/skills/storytelling-with-data/assets/example_asset.txt +1 -0
- package/skills/storytelling-with-data/evals/evals.json +47 -0
- package/skills/storytelling-with-data/evals/results.json +13 -0
- package/skills/storytelling-with-data/examples/after.md +50 -0
- package/skills/storytelling-with-data/examples/before.md +33 -0
- package/skills/storytelling-with-data/references/api_reference.md +379 -0
- package/skills/storytelling-with-data/references/review-checklist.md +111 -0
- package/skills/storytelling-with-data/scripts/chart_review.py +301 -0
- package/skills/storytelling-with-data/scripts/example.py +1 -0
- package/skills/system-design-interview/SKILL.md +233 -0
- package/skills/system-design-interview/assets/example_asset.txt +1 -0
- package/skills/system-design-interview/evals/evals.json +46 -0
- package/skills/system-design-interview/evals/results.json +13 -0
- package/skills/system-design-interview/examples/after.md +94 -0
- package/skills/system-design-interview/examples/before.md +27 -0
- package/skills/system-design-interview/references/api_reference.md +582 -0
- package/skills/system-design-interview/references/review-checklist.md +201 -0
- package/skills/system-design-interview/scripts/example.py +1 -0
- package/skills/system-design-interview/scripts/new_design.py +421 -0
- package/skills/using-asyncio-python/SKILL.md +290 -0
- package/skills/using-asyncio-python/assets/example_asset.txt +1 -0
- package/skills/using-asyncio-python/evals/evals.json +43 -0
- package/skills/using-asyncio-python/evals/results.json +13 -0
- package/skills/using-asyncio-python/examples/after.md +68 -0
- package/skills/using-asyncio-python/examples/before.md +39 -0
- package/skills/using-asyncio-python/references/api_reference.md +267 -0
- package/skills/using-asyncio-python/references/review-checklist.md +149 -0
- package/skills/using-asyncio-python/scripts/check_blocking.py +270 -0
- package/skills/using-asyncio-python/scripts/example.py +1 -0
- package/skills/web-scraping-python/SKILL.md +280 -0
- package/skills/web-scraping-python/assets/example_asset.txt +1 -0
- package/skills/web-scraping-python/evals/evals.json +46 -0
- package/skills/web-scraping-python/evals/results.json +13 -0
- package/skills/web-scraping-python/examples/after.md +109 -0
- package/skills/web-scraping-python/examples/before.md +40 -0
- package/skills/web-scraping-python/references/api_reference.md +393 -0
- package/skills/web-scraping-python/references/review-checklist.md +163 -0
- package/skills/web-scraping-python/scripts/example.py +1 -0
- package/skills/web-scraping-python/scripts/new_scraper.py +231 -0
- package/skills/writing-plans/audit.json +34 -0
- package/tests/agent-detector.test.js +83 -0
- package/tests/corrections.test.js +245 -0
- package/tests/doctor/hook-installer.test.js +72 -0
- package/tests/doctor/usage-tracker.test.js +140 -0
- package/tests/engine/benchmark-eval.test.js +31 -0
- package/tests/engine/bm25-index.test.js +85 -0
- package/tests/engine/capture-command.test.js +35 -0
- package/tests/engine/capture.test.js +17 -0
- package/tests/engine/graph-augmented-search.test.js +107 -0
- package/tests/engine/graph-injector.test.js +44 -0
- package/tests/engine/graph.test.js +216 -0
- package/tests/engine/hybrid-searcher.test.js +74 -0
- package/tests/engine/indexer-bm25.test.js +37 -0
- package/tests/engine/mcp-tools.test.js +73 -0
- package/tests/engine/project-initializer-mcp.test.js +99 -0
- package/tests/engine/query-expander.test.js +36 -0
- package/tests/engine/reranker.test.js +51 -0
- package/tests/engine/rrf.test.js +49 -0
- package/tests/engine/srag-prefix.test.js +47 -0
- package/tests/instinct-block.test.js +23 -0
- package/tests/mcp-config-writer.test.js +60 -0
- package/tests/project-initializer-new-agents.test.js +48 -0
- package/tests/rules/rules-manager.test.js +230 -0
- package/tests/well-known-builder.test.js +40 -0
- package/tests/wizard/integration-detector.test.js +31 -0
- package/tests/wizard/project-detector.test.js +51 -0
- package/tests/wizard/prompt-session.test.js +61 -0
- package/tests/wizard/prompt.test.js +16 -0
- package/tests/wizard/registry-embeddings.test.js +35 -0
- package/tests/wizard/skill-recommender.test.js +34 -0
- package/tests/wizard/slot-count.test.js +25 -0
- package/vercel.json +21 -0
|
@@ -0,0 +1,45 @@
|
|
|
1
|
+
{
|
|
2
|
+
"evals": [
|
|
3
|
+
{
|
|
4
|
+
"id": "eval-01-etl-no-error-handling-no-idempotency",
|
|
5
|
+
"prompt": "Review this ETL script:\n\n```python\nimport psycopg2\nimport requests\n\nSOURCE_DB = 'postgresql://user:pass@source-host/prod'\nDEST_DB = 'postgresql://user:pass@warehouse-host/warehouse'\n\ndef run():\n src = psycopg2.connect(SOURCE_DB)\n dst = psycopg2.connect(DEST_DB)\n\n rows = src.cursor().execute(\n 'SELECT id, customer_id, amount, created_at FROM orders'\n ).fetchall()\n\n for row in rows:\n order_id, customer_id, amount, created_at = row\n resp = requests.get(f'https://api.exchange.io/rate?currency=EUR')\n rate = resp.json()['rate']\n amount_eur = amount * rate\n\n dst.cursor().execute(\n 'INSERT INTO orders_eur VALUES (%s, %s, %s, %s)',\n (order_id, customer_id, amount_eur, created_at)\n )\n\n dst.commit()\n src.close()\n dst.close()\n\nif __name__ == '__main__':\n run()\n```",
|
|
6
|
+
"expectations": [
|
|
7
|
+
"Flags the full-table extraction `SELECT * FROM orders` with no timestamp filter as a non-incremental load that will re-process the entire table on every run; recommends incremental extraction using a watermark (Ch 3-4: incremental over full extraction)",
|
|
8
|
+
"Flags the absence of idempotency: re-running the script will insert duplicate rows into `orders_eur`; recommends an INSERT ... ON CONFLICT DO NOTHING or MERGE pattern (Ch 13: idempotency is non-negotiable)",
|
|
9
|
+
"Flags the external API call `requests.get` inside the per-row loop, which issues one HTTP request per order row — an N+1 pattern causing severe performance and rate-limit issues; recommends fetching the exchange rate once before the loop",
|
|
10
|
+
"Flags no error handling anywhere: if the API call fails, the loop crashes mid-run leaving the destination in a partially loaded state with no indication of progress (Ch 13: error handling and retry strategies)",
|
|
11
|
+
"Flags hardcoded credentials in source strings; recommends environment variables or a secrets manager (Ch 13: never hardcode credentials)",
|
|
12
|
+
"Flags no logging of rows processed, errors encountered, or run duration (Ch 12: monitoring and observability)",
|
|
13
|
+
"Flags the absence of a staging table: data is written directly to the production `orders_eur` table without validation (Ch 8: always load to staging first)"
|
|
14
|
+
]
|
|
15
|
+
},
|
|
16
|
+
{
|
|
17
|
+
"id": "eval-02-mixed-transform-and-load",
|
|
18
|
+
"prompt": "Review this data pipeline script:\n\n```python\nimport pandas as pd\nimport sqlalchemy\n\ndef process_and_load(csv_path: str, db_url: str, table: str):\n df = pd.read_csv(csv_path)\n\n # Clean and transform\n df['email'] = df['email'].str.lower().str.strip()\n df['revenue'] = df['revenue'].fillna(0)\n df['signup_date'] = pd.to_datetime(df['signup_date'])\n df = df[df['revenue'] >= 0]\n df['revenue_category'] = df['revenue'].apply(\n lambda x: 'high' if x > 1000 else 'low'\n )\n df['country'] = df['country'].str.upper()\n\n # Enrich with another file\n regions = pd.read_csv('regions.csv') # hardcoded path\n df = df.merge(regions, on='country', how='left')\n\n # Load directly into the final table\n engine = sqlalchemy.create_engine(db_url)\n df.to_sql(table, engine, if_exists='append', index=False)\n print(f'Loaded {len(df)} rows')\n```",
|
|
19
|
+
"expectations": [
|
|
20
|
+
"Flags that transformation logic and loading logic are combined in a single function, violating separation of concerns; recommends splitting into separate extract, transform, and load functions (Ch 3: ETL pattern design, Ch 11: DAG-based task granularity)",
|
|
21
|
+
"Flags the hardcoded path `'regions.csv'` as a non-configurable dependency that breaks when the file moves; recommends externalizing all paths and inputs as parameters or config (Ch 13: configurable pipelines)",
|
|
22
|
+
"Flags `if_exists='append'` with no deduplication: re-running appends duplicate rows; recommends staging table + MERGE or using a unique constraint with INSERT OR IGNORE (Ch 13: idempotency)",
|
|
23
|
+
"Flags no data validation before loading: there is no check that the merge did not produce unexpected nulls in the region column or that row counts match expectations (Ch 10: validate at boundaries)",
|
|
24
|
+
"Flags no logging beyond a single print statement: recommends structured logging of row counts at each stage, null rates, and merge match rate (Ch 12: monitoring and observability)",
|
|
25
|
+
"Flags absence of data lineage tracking: no pipeline_run_id or audit column to identify which pipeline run produced each row, making debugging and reruns harder to trace (Ch 13: data lineage)",
|
|
26
|
+
"Recommends adding a schema validation step after reading the CSV to catch missing or mistyped columns before transformations run (Ch 10: schema validation at ingestion)"
|
|
27
|
+
]
|
|
28
|
+
},
|
|
29
|
+
{
|
|
30
|
+
"id": "eval-03-clean-pipeline-with-retry-logging-separation",
|
|
31
|
+
"prompt": "Review this data pipeline implementation:\n\n```python\nimport logging\nimport time\nfrom datetime import datetime, timedelta\nfrom typing import Iterator\nimport psycopg2\nimport psycopg2.extras\n\nlogger = logging.getLogger(__name__)\n\nBATCH_SIZE = 1000\nMAX_RETRIES = 3\nBACKOFF_BASE = 2\n\n\ndef extract(conn, watermark: datetime) -> Iterator[list]:\n \"\"\"Yield batches of new orders since the watermark.\"\"\"\n with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cur:\n cur.execute(\n 'SELECT id, customer_id, amount, created_at '\n 'FROM orders WHERE created_at > %s ORDER BY created_at',\n (watermark,)\n )\n while True:\n rows = cur.fetchmany(BATCH_SIZE)\n if not rows:\n break\n logger.info('Extracted batch of %d rows', len(rows))\n yield [dict(r) for r in rows]\n\n\ndef transform(batch: list[dict]) -> list[dict]:\n \"\"\"Apply business rules: normalize amounts, tag high-value orders.\"\"\"\n result = []\n for row in batch:\n row['amount'] = round(float(row['amount']), 2)\n row['is_high_value'] = row['amount'] > 500\n result.append(row)\n return result\n\n\ndef load(conn, rows: list[dict], run_id: str) -> int:\n \"\"\"Upsert rows into orders_warehouse; return count of rows loaded.\"\"\"\n with conn.cursor() as cur:\n psycopg2.extras.execute_values(\n cur,\n '''\n INSERT INTO orders_warehouse (id, customer_id, amount, is_high_value, created_at, pipeline_run_id)\n VALUES %s\n ON CONFLICT (id) DO UPDATE SET\n amount = EXCLUDED.amount,\n is_high_value = EXCLUDED.is_high_value,\n pipeline_run_id = EXCLUDED.pipeline_run_id\n ''',\n [(r['id'], r['customer_id'], r['amount'], r['is_high_value'],\n r['created_at'], run_id) for r in rows]\n )\n conn.commit()\n return len(rows)\n\n\ndef run_with_retry(fn, *args, **kwargs):\n \"\"\"Retry a function with exponential backoff on transient errors.\"\"\"\n for attempt in range(1, MAX_RETRIES + 1):\n try:\n return fn(*args, **kwargs)\n except psycopg2.OperationalError as e:\n if attempt == MAX_RETRIES:\n raise\n delay = BACKOFF_BASE ** attempt\n logger.warning('Attempt %d failed: %s. Retrying in %ds', attempt, e, delay)\n time.sleep(delay)\n```",
|
|
32
|
+
"expectations": [
|
|
33
|
+
"Recognizes this is a well-designed pipeline and says so explicitly",
|
|
34
|
+
"Praises the clear separation of `extract`, `transform`, and `load` into distinct functions with single responsibilities (Ch 3: ETL pattern, Ch 11: task granularity)",
|
|
35
|
+
"Praises the watermark-based incremental extraction that avoids full-table scans on reruns (Ch 3-4: incremental extraction)",
|
|
36
|
+
"Praises the `ON CONFLICT DO UPDATE` upsert ensuring the pipeline is idempotent and safe to re-run (Ch 13: idempotency is non-negotiable)",
|
|
37
|
+
"Praises the generator-based `extract` function that yields batches, avoiding loading the full result set into memory (Ch 4: streaming extraction, memory efficiency)",
|
|
38
|
+
"Praises the `run_with_retry` wrapper with exponential backoff for transient database errors (Ch 13: error handling and retry strategies)",
|
|
39
|
+
"Praises structured logging at the batch level with row counts for observability (Ch 12: monitoring)",
|
|
40
|
+
"Praises the `pipeline_run_id` column in the load, enabling lineage tracking and debugging of which run produced which rows (Ch 13: data lineage)",
|
|
41
|
+
"Does NOT manufacture issues to appear thorough; any suggestions are framed as minor optional improvements"
|
|
42
|
+
]
|
|
43
|
+
}
|
|
44
|
+
]
|
|
45
|
+
}
|
|
@@ -0,0 +1,13 @@
|
|
|
1
|
+
{
|
|
2
|
+
"pass_rate": 0.957,
|
|
3
|
+
"passed": 22,
|
|
4
|
+
"total": 23,
|
|
5
|
+
"baseline_pass_rate": 0.304,
|
|
6
|
+
"baseline_passed": 7,
|
|
7
|
+
"baseline_total": 23,
|
|
8
|
+
"delta": 0.652,
|
|
9
|
+
"model": "default",
|
|
10
|
+
"evals_run": 3,
|
|
11
|
+
"date": "2026-03-28",
|
|
12
|
+
"non_standard_provider": true
|
|
13
|
+
}
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
# After
|
|
2
|
+
|
|
3
|
+
A clean pipeline with separated extract/transform/load functions, idempotent upserts, retry logic, and proper error handling.
|
|
4
|
+
|
|
5
|
+
```python
|
|
6
|
+
import logging
|
|
7
|
+
import time
|
|
8
|
+
from dataclasses import dataclass
|
|
9
|
+
from datetime import datetime
|
|
10
|
+
from functools import wraps
|
|
11
|
+
|
|
12
|
+
import psycopg2
|
|
13
|
+
import requests
|
|
14
|
+
from requests.exceptions import RequestException
|
|
15
|
+
|
|
16
|
+
logger = logging.getLogger(__name__)
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
@dataclass
|
|
20
|
+
class SaleRecord:
|
|
21
|
+
id: str
|
|
22
|
+
sale_date: datetime
|
|
23
|
+
revenue: float
|
|
24
|
+
region: str
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
def with_retry(max_attempts: int = 3, backoff_seconds: float = 2.0):
|
|
28
|
+
"""Decorator: retry a function on transient failures with exponential backoff."""
|
|
29
|
+
def decorator(fn):
|
|
30
|
+
@wraps(fn)
|
|
31
|
+
def wrapper(*args, **kwargs):
|
|
32
|
+
for attempt in range(1, max_attempts + 1):
|
|
33
|
+
try:
|
|
34
|
+
return fn(*args, **kwargs)
|
|
35
|
+
except (RequestException, psycopg2.OperationalError) as exc:
|
|
36
|
+
if attempt == max_attempts:
|
|
37
|
+
raise
|
|
38
|
+
wait = backoff_seconds ** attempt
|
|
39
|
+
logger.warning("Attempt %d/%d failed: %s — retrying in %.1fs",
|
|
40
|
+
attempt, max_attempts, exc, wait)
|
|
41
|
+
time.sleep(wait)
|
|
42
|
+
return wrapper
|
|
43
|
+
return decorator
|
|
44
|
+
|
|
45
|
+
|
|
46
|
+
@with_retry(max_attempts=3)
|
|
47
|
+
def extract(api_url: str) -> list[dict]:
|
|
48
|
+
"""Fetch raw sales records from the partner API."""
|
|
49
|
+
response = requests.get(api_url, timeout=30)
|
|
50
|
+
response.raise_for_status()
|
|
51
|
+
return response.json()["sales"]
|
|
52
|
+
|
|
53
|
+
|
|
54
|
+
def transform(raw_records: list[dict]) -> list[SaleRecord]:
|
|
55
|
+
"""Parse and normalise raw API records into typed SaleRecord objects."""
|
|
56
|
+
return [
|
|
57
|
+
SaleRecord(
|
|
58
|
+
id=rec["id"],
|
|
59
|
+
sale_date=datetime.fromisoformat(rec["date"]),
|
|
60
|
+
revenue=float(rec["amount_usd"]),
|
|
61
|
+
region=rec["region"].strip().upper(),
|
|
62
|
+
)
|
|
63
|
+
for rec in raw_records
|
|
64
|
+
]
|
|
65
|
+
|
|
66
|
+
|
|
67
|
+
def load(records: list[SaleRecord], dsn: str) -> int:
|
|
68
|
+
"""Upsert records into fact_sales. Idempotent: re-running is safe."""
|
|
69
|
+
upsert_sql = """
|
|
70
|
+
INSERT INTO fact_sales (sale_id, sale_date, revenue, region, loaded_at)
|
|
71
|
+
VALUES (%(id)s, %(sale_date)s, %(revenue)s, %(region)s, NOW())
|
|
72
|
+
ON CONFLICT (sale_id) DO UPDATE
|
|
73
|
+
SET revenue = EXCLUDED.revenue,
|
|
74
|
+
loaded_at = EXCLUDED.loaded_at
|
|
75
|
+
"""
|
|
76
|
+
with psycopg2.connect(dsn) as conn, conn.cursor() as cur:
|
|
77
|
+
cur.executemany(upsert_sql, [vars(r) for r in records])
|
|
78
|
+
loaded = cur.rowcount
|
|
79
|
+
logger.info("Upserted %d records into fact_sales", loaded)
|
|
80
|
+
return loaded
|
|
81
|
+
|
|
82
|
+
|
|
83
|
+
def run_pipeline(api_url: str, warehouse_dsn: str) -> None:
|
|
84
|
+
logger.info("Starting sales pipeline")
|
|
85
|
+
raw = extract(api_url)
|
|
86
|
+
records = transform(raw)
|
|
87
|
+
loaded = load(records, warehouse_dsn)
|
|
88
|
+
logger.info("Pipeline complete: %d records loaded", loaded)
|
|
89
|
+
```
|
|
90
|
+
|
|
91
|
+
Key improvements:
|
|
92
|
+
- Extract, transform, and load are separate functions with single responsibilities — each is independently testable and replaceable (Ch 13: Best Practices — separation of concerns)
|
|
93
|
+
- `ON CONFLICT (sale_id) DO UPDATE` makes the load idempotent — re-running the pipeline never creates duplicate rows (Ch 13: Idempotency)
|
|
94
|
+
- `@with_retry` decorator handles transient API and database failures with exponential backoff (Ch 6: API Ingestion — retry logic)
|
|
95
|
+
- `SaleRecord` dataclass replaces a raw dict, providing type safety and named field access in the transform step
|
|
96
|
+
- `psycopg2.connect` used as a context manager ensures the connection and transaction are always closed and committed correctly (Ch 4: Database Ingestion)
|
|
97
|
+
- Structured logging with `logger.info/warning` replaces bare `print` — output is filterable and includes context (Ch 12: Monitoring)
|
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
# Before
|
|
2
|
+
|
|
3
|
+
A Python ETL script that mixes extraction, transformation, and loading in one function with no error handling, no idempotency, and no retry logic.
|
|
4
|
+
|
|
5
|
+
```python
|
|
6
|
+
import psycopg2
|
|
7
|
+
import requests
|
|
8
|
+
from datetime import datetime
|
|
9
|
+
|
|
10
|
+
def run_pipeline():
|
|
11
|
+
# Extract: fetch from API
|
|
12
|
+
resp = requests.get("https://api.partner.com/sales/export")
|
|
13
|
+
data = resp.json()
|
|
14
|
+
|
|
15
|
+
# Connect to warehouse
|
|
16
|
+
conn = psycopg2.connect("host=dw user=etl dbname=warehouse")
|
|
17
|
+
cur = conn.cursor()
|
|
18
|
+
|
|
19
|
+
# Transform + Load: all in one loop, no error handling
|
|
20
|
+
for record in data["sales"]:
|
|
21
|
+
sale_date = datetime.strptime(record["date"], "%Y-%m-%dT%H:%M:%S")
|
|
22
|
+
revenue = float(record["amount_usd"])
|
|
23
|
+
region = record["region"].strip().upper()
|
|
24
|
+
|
|
25
|
+
# No upsert — re-running inserts duplicates
|
|
26
|
+
cur.execute("""
|
|
27
|
+
INSERT INTO fact_sales (sale_id, sale_date, revenue, region, loaded_at)
|
|
28
|
+
VALUES (%s, %s, %s, %s, NOW())
|
|
29
|
+
""", (record["id"], sale_date, revenue, region))
|
|
30
|
+
|
|
31
|
+
conn.commit()
|
|
32
|
+
cur.close()
|
|
33
|
+
conn.close()
|
|
34
|
+
print("done")
|
|
35
|
+
|
|
36
|
+
run_pipeline()
|
|
37
|
+
```
|
|
@@ -0,0 +1,301 @@
|
|
|
1
|
+
# Data Pipelines Pocket Reference — Practices Catalog
|
|
2
|
+
|
|
3
|
+
Chapter-by-chapter catalog of practices from *Data Pipelines Pocket Reference*
|
|
4
|
+
by James Densmore for pipeline building.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## Chapter 1–2: Introduction & Modern Data Infrastructure
|
|
9
|
+
|
|
10
|
+
### Infrastructure Choices
|
|
11
|
+
- **Data warehouse** — Columnar storage optimized for analytics queries (Redshift, BigQuery, Snowflake); choose when analytics is primary use case
|
|
12
|
+
- **Data lake** — Object storage (S3, GCS, Azure Blob) for raw, unstructured, or semi-structured data; choose when data variety is high or schema is unknown
|
|
13
|
+
- **Hybrid** — Land raw data in lake, load structured subsets into warehouse; common modern pattern
|
|
14
|
+
- **Cloud-native** — Leverage managed services (serverless compute, auto-scaling storage); reduce operational burden
|
|
15
|
+
|
|
16
|
+
### Pipeline Types
|
|
17
|
+
- **Batch** — Process data in scheduled intervals (hourly, daily); suitable for most analytics use cases
|
|
18
|
+
- **Streaming** — Process data continuously as it arrives; required for real-time dashboards, alerts, event-driven systems
|
|
19
|
+
- **Micro-batch** — Small frequent batches (every few minutes); compromise between batch simplicity and near-real-time latency
|
|
20
|
+
|
|
21
|
+
---
|
|
22
|
+
|
|
23
|
+
## Chapter 3: Common Data Pipeline Patterns
|
|
24
|
+
|
|
25
|
+
### Extraction Patterns
|
|
26
|
+
- **Full extraction** — Extract entire dataset each run; simple but expensive for large tables; use for small reference tables or initial loads
|
|
27
|
+
- **Incremental extraction** — Extract only new/changed records using a high-water mark (timestamp, auto-increment ID, or sequence); preferred for growing datasets
|
|
28
|
+
- **Change data capture (CDC)** — Capture changes from database transaction logs (MySQL binlog, PostgreSQL WAL); lowest latency, captures deletes; use for real-time sync
|
|
29
|
+
|
|
30
|
+
### Loading Patterns
|
|
31
|
+
- **Full refresh (truncate + load)** — Replace entire destination table; simple, idempotent; use for small tables or when incremental is unreliable
|
|
32
|
+
- **Append** — Insert new records only; use for event/log data that is never updated
|
|
33
|
+
- **Upsert (MERGE)** — Insert new records, update existing ones based on a key; use for mutable dimension data
|
|
34
|
+
- **Delete + Insert by partition** — Delete partition, insert replacement; idempotent and efficient for date-partitioned fact tables
|
|
35
|
+
|
|
36
|
+
### ETL vs ELT
|
|
37
|
+
- **ETL** — Transform before loading; use when destination has limited compute or when data must be cleansed before storage
|
|
38
|
+
- **ELT** — Load raw data first, transform in destination; preferred for modern cloud warehouses with cheap compute; enables raw data preservation and flexible re-transformation
|
|
39
|
+
|
|
40
|
+
---
|
|
41
|
+
|
|
42
|
+
## Chapter 4: Database Ingestion
|
|
43
|
+
|
|
44
|
+
### MySQL Extraction
|
|
45
|
+
- Use `SELECT ... WHERE updated_at > :last_run` for incremental extraction
|
|
46
|
+
- For full extraction: `SELECT *` with optional `LIMIT/OFFSET` for large tables (prefer streaming cursors)
|
|
47
|
+
- Use read replicas to avoid impacting production database performance
|
|
48
|
+
- Handle MySQL timezone conversions (store/compare in UTC)
|
|
49
|
+
- Connection pooling for concurrent extraction from multiple tables
|
|
50
|
+
|
|
51
|
+
### PostgreSQL Extraction
|
|
52
|
+
- Similar incremental patterns using timestamp columns
|
|
53
|
+
- Use `COPY TO` for efficient bulk export to CSV/files
|
|
54
|
+
- Leverage PostgreSQL logical replication for CDC
|
|
55
|
+
- Handle PostgreSQL-specific types (arrays, JSON, custom types) during extraction
|
|
56
|
+
|
|
57
|
+
### MongoDB Extraction
|
|
58
|
+
- Use change streams for real-time CDC
|
|
59
|
+
- For incremental: query by `_id` (ObjectId contains timestamp) or custom timestamp field
|
|
60
|
+
- Handle nested documents: flatten or store as JSON in warehouse
|
|
61
|
+
- Use `mongodump` for full extraction of large collections
|
|
62
|
+
|
|
63
|
+
### General Database Practices
|
|
64
|
+
- **Connection management** — Use connection pools; close connections promptly; handle timeouts
|
|
65
|
+
- **Query optimization** — Add indexes on extraction columns; limit selected columns; use WHERE clauses
|
|
66
|
+
- **Binary data** — Skip or store references to BLOBs; don't load binary into analytics warehouse
|
|
67
|
+
- **Character encoding** — Ensure UTF-8 throughout the pipeline; handle encoding mismatches at extraction
|
|
68
|
+
|
|
69
|
+
---
|
|
70
|
+
|
|
71
|
+
## Chapter 5: File Ingestion
|
|
72
|
+
|
|
73
|
+
### CSV Files
|
|
74
|
+
- Handle header rows, quoting, escaping, delimiters (not always comma)
|
|
75
|
+
- Detect and handle encoding (UTF-8, Latin-1, Windows-1252)
|
|
76
|
+
- Validate column count per row; log and quarantine malformed rows
|
|
77
|
+
- Use streaming parsers for large files; avoid loading entire file into memory
|
|
78
|
+
|
|
79
|
+
### JSON Files
|
|
80
|
+
- Handle nested structures: flatten for warehouse loading or store as JSON column
|
|
81
|
+
- Use JSON Lines (newline-delimited JSON) for large datasets
|
|
82
|
+
- Validate against expected schema; handle missing and extra fields
|
|
83
|
+
- Parse dates and timestamps consistently
|
|
84
|
+
|
|
85
|
+
### Cloud Storage Integration
|
|
86
|
+
- **S3** — Use `aws s3 cp/sync` or boto3; leverage S3 event notifications for trigger-based ingestion
|
|
87
|
+
- **GCS** — Use `gsutil` or google-cloud-storage library; use Pub/Sub for event notifications
|
|
88
|
+
- **Azure Blob** — Use Azure SDK; leverage Event Grid for notifications
|
|
89
|
+
- Use prefix/partition naming: `s3://bucket/table/year=2024/month=01/day=15/`
|
|
90
|
+
- Implement file manifests to track which files have been processed
|
|
91
|
+
|
|
92
|
+
### File Best Practices
|
|
93
|
+
- **Naming conventions** — Include date, source, and sequence in filenames: `orders_2024-01-15_001.csv`
|
|
94
|
+
- **Compression** — Use gzip or snappy for storage efficiency; most tools handle compressed files natively
|
|
95
|
+
- **Archiving** — Move processed files to archive prefix/bucket; retain for reprocessing capability
|
|
96
|
+
- **Schema detection** — Infer schema from first N rows; validate against expected schema; alert on changes
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
## Chapter 6: API Ingestion
|
|
101
|
+
|
|
102
|
+
### REST API Patterns
|
|
103
|
+
- **Authentication** — Handle API keys, OAuth tokens, token refresh; store credentials securely
|
|
104
|
+
- **Pagination** — Implement cursor-based, offset-based, or link-header pagination; prefer cursor-based for consistency
|
|
105
|
+
- **Rate limiting** — Respect rate limit headers (X-RateLimit-Remaining, Retry-After); implement backoff
|
|
106
|
+
- **Retry logic** — Retry on 429 (rate limit) and 5xx (server error) with exponential backoff; don't retry on 4xx (client error)
|
|
107
|
+
|
|
108
|
+
### API Data Handling
|
|
109
|
+
- **JSON response parsing** — Extract relevant fields; handle nested objects and arrays
|
|
110
|
+
- **Incremental fetching** — Use modified_since parameters, cursor tokens, or date range filters
|
|
111
|
+
- **Schema changes** — Handle new fields gracefully; log and alert on missing expected fields
|
|
112
|
+
- **Large responses** — Stream responses for large payloads; paginate aggressively
|
|
113
|
+
|
|
114
|
+
### Webhook Ingestion
|
|
115
|
+
- Set up HTTP endpoints to receive push notifications
|
|
116
|
+
- Validate webhook signatures for security
|
|
117
|
+
- Acknowledge receipt quickly (200 OK); process asynchronously
|
|
118
|
+
- Implement idempotency using event IDs to handle duplicate deliveries
|
|
119
|
+
|
|
120
|
+
---
|
|
121
|
+
|
|
122
|
+
## Chapter 7: Streaming Data
|
|
123
|
+
|
|
124
|
+
### Apache Kafka
|
|
125
|
+
- **Producers** — Serialize events (Avro, JSON, Protobuf); use keys for partition ordering; configure acknowledgments
|
|
126
|
+
- **Consumers** — Use consumer groups for parallel processing; commit offsets after successful processing; handle rebalances
|
|
127
|
+
- **Topics** — Design topic schemas; set retention policies; partition for throughput and ordering requirements
|
|
128
|
+
- **Exactly-once** — Use idempotent producers + transactional consumers; or implement deduplication downstream
|
|
129
|
+
|
|
130
|
+
### Amazon Kinesis
|
|
131
|
+
- **Streams** — Configure shard count for throughput; use enhanced fan-out for multiple consumers
|
|
132
|
+
- **Firehose** — Direct-to-S3/Redshift delivery; configure buffering interval and size; transform with Lambda
|
|
133
|
+
|
|
134
|
+
### Stream Processing Patterns
|
|
135
|
+
- **Windowing** — Tumbling, sliding, session windows for aggregation over time
|
|
136
|
+
- **Watermarks** — Handle late-arriving events; define allowed lateness
|
|
137
|
+
- **State management** — Use state stores for aggregations; handle checkpointing and recovery
|
|
138
|
+
- **Dead letter queues** — Route failed events for inspection and reprocessing; don't lose data silently
|
|
139
|
+
|
|
140
|
+
---
|
|
141
|
+
|
|
142
|
+
## Chapter 8: Data Storage and Loading
|
|
143
|
+
|
|
144
|
+
### Amazon Redshift
|
|
145
|
+
- Use `COPY` command for bulk loading from S3; much faster than INSERT
|
|
146
|
+
- Define `SORTKEY` for frequently filtered columns; `DISTKEY` for join columns
|
|
147
|
+
- Use `UNLOAD` for efficient export back to S3
|
|
148
|
+
- Vacuum and analyze tables after large loads
|
|
149
|
+
|
|
150
|
+
### Google BigQuery
|
|
151
|
+
- Use load jobs for bulk data; streaming inserts for real-time (more expensive)
|
|
152
|
+
- Partition tables by date (ingestion time or column-based) for cost and performance
|
|
153
|
+
- Cluster tables on frequently filtered columns
|
|
154
|
+
- Use external tables for querying data in GCS without loading
|
|
155
|
+
|
|
156
|
+
### Snowflake
|
|
157
|
+
- Use stages (internal or external) for file-based loading
|
|
158
|
+
- `COPY INTO` for bulk loads from stages; `SNOWPIPE` for continuous loading
|
|
159
|
+
- Use virtual warehouses sized appropriately for load workloads
|
|
160
|
+
- Leverage Time Travel for data recovery and auditing
|
|
161
|
+
|
|
162
|
+
### General Loading Practices
|
|
163
|
+
- **Staging tables** — Always load to staging first; validate before merging to production
|
|
164
|
+
- **Atomic swaps** — Use table rename or partition swap for atomic updates
|
|
165
|
+
- **Data types** — Map source types carefully; avoid implicit conversions; use appropriate precision
|
|
166
|
+
- **Compression** — Let warehouse handle compression; load compressed files when supported
|
|
167
|
+
- **Partitioning** — Partition by date for time-series data; by key for lookup tables
|
|
168
|
+
- **Clustering** — Cluster on frequently filtered columns within partitions
|
|
169
|
+
|
|
170
|
+
---
|
|
171
|
+
|
|
172
|
+
## Chapter 9: Data Transformations
|
|
173
|
+
|
|
174
|
+
### SQL-Based Transforms
|
|
175
|
+
- Use CTEs (Common Table Expressions) for readability
|
|
176
|
+
- Prefer window functions over self-joins for ranking, running totals
|
|
177
|
+
- Use CASE expressions for conditional logic
|
|
178
|
+
- Aggregate at the right grain; avoid fan-out joins that multiply rows
|
|
179
|
+
|
|
180
|
+
### dbt (Data Build Tool)
|
|
181
|
+
- **Staging models** — 1:1 with source tables; rename columns, cast types, filter deleted records
|
|
182
|
+
- **Intermediate models** — Business logic joins, complex calculations, deduplication
|
|
183
|
+
- **Mart models** — Final analytics-ready tables; optimized for dashboard queries
|
|
184
|
+
- **Incremental models** — Process only new/changed data; use `unique_key` for merge strategy
|
|
185
|
+
- **Tests** — `not_null`, `unique`, `accepted_values`, `relationships` on key columns; custom data tests
|
|
186
|
+
- **Sources** — Define sources in YAML; use `source()` macro for lineage tracking; freshness checks
|
|
187
|
+
|
|
188
|
+
### Python-Based Transforms
|
|
189
|
+
- Use pandas for small-medium datasets; PySpark or Dask for large-scale processing
|
|
190
|
+
- Write pure functions for transformations; make them testable
|
|
191
|
+
- Handle data types explicitly; don't rely on inference for production pipelines
|
|
192
|
+
- Use vectorized operations; avoid row-by-row iteration
|
|
193
|
+
|
|
194
|
+
### Transform Best Practices
|
|
195
|
+
- **Layered architecture** — Raw → Staging → Intermediate → Mart; each layer has clear purpose
|
|
196
|
+
- **Single responsibility** — Each model/transform does one thing well
|
|
197
|
+
- **Documented logic** — Comment complex business rules; maintain a data dictionary
|
|
198
|
+
- **Version controlled** — All transformation code in git; review changes via PR
|
|
199
|
+
|
|
200
|
+
---
|
|
201
|
+
|
|
202
|
+
## Chapter 10: Data Validation and Testing
|
|
203
|
+
|
|
204
|
+
### Validation Types
|
|
205
|
+
- **Schema validation** — Verify column names, types, and count match expectations
|
|
206
|
+
- **Row count checks** — Compare source and destination row counts; alert on significant discrepancies
|
|
207
|
+
- **Null checks** — Assert key columns are not null; track null percentages for optional columns
|
|
208
|
+
- **Uniqueness checks** — Verify primary key uniqueness in destination tables
|
|
209
|
+
- **Referential integrity** — Check foreign key relationships between tables
|
|
210
|
+
- **Range checks** — Validate values fall within expected ranges (dates, amounts, percentages)
|
|
211
|
+
- **Freshness checks** — Verify data is not stale; alert when max timestamp is older than threshold
|
|
212
|
+
|
|
213
|
+
### Great Expectations
|
|
214
|
+
- Define expectations as code; version control them
|
|
215
|
+
- Run validations as pipeline steps; fail pipeline on critical expectation failures
|
|
216
|
+
- Generate data documentation from expectations
|
|
217
|
+
- Use checkpoints for scheduled validation runs
|
|
218
|
+
|
|
219
|
+
### Testing Practices
|
|
220
|
+
- **Unit tests** — Test individual transformation functions with known inputs/outputs
|
|
221
|
+
- **Integration tests** — Test end-to-end pipeline with sample data
|
|
222
|
+
- **Regression tests** — Compare current results against known-good baselines
|
|
223
|
+
- **Data contracts** — Define and enforce schemas between producer and consumer teams
|
|
224
|
+
|
|
225
|
+
---
|
|
226
|
+
|
|
227
|
+
## Chapter 11: Orchestration
|
|
228
|
+
|
|
229
|
+
### Apache Airflow
|
|
230
|
+
- **DAGs** — Define pipelines as Directed Acyclic Graphs; each node is a task
|
|
231
|
+
- **Operators** — Use appropriate operators: PythonOperator, BashOperator, provider operators (BigQueryOperator, S3ToRedshiftOperator)
|
|
232
|
+
- **Scheduling** — Use cron expressions or timedelta; set `start_date` and `catchup` appropriately
|
|
233
|
+
- **Dependencies** — Use `>>` operator or `set_upstream/downstream`; keep DAGs shallow and wide
|
|
234
|
+
- **XComs** — Pass small metadata between tasks (row counts, file paths); NOT large datasets
|
|
235
|
+
- **Sensors** — Wait for external conditions (file arrival, partition availability); use with timeout
|
|
236
|
+
- **Variables and Connections** — Store config in Airflow Variables; credentials in Connections
|
|
237
|
+
- **Pools** — Limit concurrency for resource-constrained tasks (database connections, API rate limits)
|
|
238
|
+
|
|
239
|
+
### DAG Design Patterns
|
|
240
|
+
- **One pipeline per DAG** — Keep DAGs focused; avoid mega-DAGs that do everything
|
|
241
|
+
- **Idempotent tasks** — Every task can be re-run safely; use `execution_date` for parameterization
|
|
242
|
+
- **Task granularity** — Tasks should be atomic and independently retryable; not too fine (overhead) or coarse (blast radius)
|
|
243
|
+
- **Error handling** — Use `on_failure_callback` for alerting; `retries` and `retry_delay` for transient failures
|
|
244
|
+
- **Backfilling** — Use `airflow backfill` for historical reprocessing; ensure tasks support date parameterization
|
|
245
|
+
|
|
246
|
+
---
|
|
247
|
+
|
|
248
|
+
## Chapter 12: Monitoring and Alerting
|
|
249
|
+
|
|
250
|
+
### Pipeline Health Metrics
|
|
251
|
+
- **Duration** — Track execution time; alert on runs significantly longer than historical average
|
|
252
|
+
- **Row counts** — Track records processed per run; alert on zero rows or dramatic changes
|
|
253
|
+
- **Error rates** — Track failed records, retries, exceptions; alert on elevated error rates
|
|
254
|
+
- **Data freshness** — Track max timestamp in destination; alert when data is staler than SLA
|
|
255
|
+
- **Resource usage** — Track CPU, memory, disk, network; alert on resource exhaustion
|
|
256
|
+
|
|
257
|
+
### Alerting Strategies
|
|
258
|
+
- **SLA-based** — Define delivery SLAs; alert when pipelines miss their windows
|
|
259
|
+
- **Anomaly-based** — Detect deviations from historical patterns (row counts, durations, values)
|
|
260
|
+
- **Threshold-based** — Alert on fixed thresholds (error rate > 5%, null rate > 10%)
|
|
261
|
+
- **Escalation** — Define severity levels; route alerts appropriately (Slack, PagerDuty, email)
|
|
262
|
+
|
|
263
|
+
### Logging
|
|
264
|
+
- Log at each pipeline stage: extraction start/end, row counts, load confirmation
|
|
265
|
+
- Include correlation IDs to trace records through the pipeline
|
|
266
|
+
- Store logs centrally for searchability (ELK, CloudWatch, Stackdriver)
|
|
267
|
+
- Retain logs for debugging and audit compliance
|
|
268
|
+
|
|
269
|
+
---
|
|
270
|
+
|
|
271
|
+
## Chapter 13: Best Practices
|
|
272
|
+
|
|
273
|
+
### Idempotency
|
|
274
|
+
- Use DELETE+INSERT by date partition for fact tables
|
|
275
|
+
- Use MERGE/upsert with natural keys for dimension tables
|
|
276
|
+
- Use staging tables as intermediary; clean up on both success and failure
|
|
277
|
+
- Test by running pipeline twice; verify no data duplication
|
|
278
|
+
|
|
279
|
+
### Backfilling
|
|
280
|
+
- Parameterize all pipelines by date range (start_date, end_date)
|
|
281
|
+
- Use Airflow `execution_date` or equivalent for date-aware runs
|
|
282
|
+
- Test backfill on a small date range before running full historical reprocess
|
|
283
|
+
- Monitor resource usage during backfill; may need to throttle parallelism
|
|
284
|
+
|
|
285
|
+
### Error Handling
|
|
286
|
+
- Retry transient failures (network timeouts, rate limits) with exponential backoff
|
|
287
|
+
- Fail fast on permanent errors (authentication failure, missing source table)
|
|
288
|
+
- Quarantine bad records; don't let one bad row fail the entire pipeline
|
|
289
|
+
- Send alerts with actionable context (error message, affected table, run ID)
|
|
290
|
+
|
|
291
|
+
### Data Lineage and Documentation
|
|
292
|
+
- Track source-to-destination mappings for every table
|
|
293
|
+
- Document transformation logic, especially business rules
|
|
294
|
+
- Maintain a data dictionary with column descriptions and types
|
|
295
|
+
- Use tools like dbt docs, DataHub, or Amundsen for automated lineage
|
|
296
|
+
|
|
297
|
+
### Security
|
|
298
|
+
- Never hardcode credentials; use secrets managers (AWS Secrets Manager, HashiCorp Vault)
|
|
299
|
+
- Encrypt data in transit (TLS) and at rest (warehouse encryption)
|
|
300
|
+
- Use least-privilege IAM roles for pipeline service accounts
|
|
301
|
+
- Audit access to sensitive data; mask PII in non-production environments
|