@booklib/core 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (374) hide show
  1. package/.cursor/rules/booklib-standards.mdc +40 -0
  2. package/.gemini/context.md +372 -0
  3. package/AGENTS.md +166 -0
  4. package/CHANGELOG.md +226 -0
  5. package/CLAUDE.md +81 -0
  6. package/CODE_OF_CONDUCT.md +31 -0
  7. package/CONTRIBUTING.md +304 -0
  8. package/LICENSE +21 -0
  9. package/PLAN.md +28 -0
  10. package/README.ja.md +198 -0
  11. package/README.ko.md +198 -0
  12. package/README.md +503 -0
  13. package/README.pt-BR.md +198 -0
  14. package/README.uk.md +241 -0
  15. package/README.zh-CN.md +198 -0
  16. package/SECURITY.md +9 -0
  17. package/agents/architecture-reviewer.md +136 -0
  18. package/agents/booklib-reviewer.md +90 -0
  19. package/agents/data-reviewer.md +107 -0
  20. package/agents/jvm-reviewer.md +146 -0
  21. package/agents/python-reviewer.md +128 -0
  22. package/agents/rust-reviewer.md +115 -0
  23. package/agents/ts-reviewer.md +110 -0
  24. package/agents/ui-reviewer.md +117 -0
  25. package/assets/logo.svg +36 -0
  26. package/bin/booklib-mcp.js +304 -0
  27. package/bin/booklib.js +1705 -0
  28. package/bin/skills.cjs +1292 -0
  29. package/booklib-router.mdc +36 -0
  30. package/booklib.config.json +19 -0
  31. package/commands/animation-at-work.md +10 -0
  32. package/commands/clean-code-reviewer.md +10 -0
  33. package/commands/data-intensive-patterns.md +10 -0
  34. package/commands/data-pipelines.md +10 -0
  35. package/commands/design-patterns.md +10 -0
  36. package/commands/domain-driven-design.md +10 -0
  37. package/commands/effective-java.md +10 -0
  38. package/commands/effective-kotlin.md +10 -0
  39. package/commands/effective-python.md +10 -0
  40. package/commands/effective-typescript.md +10 -0
  41. package/commands/kotlin-in-action.md +10 -0
  42. package/commands/lean-startup.md +10 -0
  43. package/commands/microservices-patterns.md +10 -0
  44. package/commands/programming-with-rust.md +10 -0
  45. package/commands/refactoring-ui.md +10 -0
  46. package/commands/rust-in-action.md +10 -0
  47. package/commands/skill-router.md +10 -0
  48. package/commands/spring-boot-in-action.md +10 -0
  49. package/commands/storytelling-with-data.md +10 -0
  50. package/commands/system-design-interview.md +10 -0
  51. package/commands/using-asyncio-python.md +10 -0
  52. package/commands/web-scraping-python.md +10 -0
  53. package/community/registry.json +1616 -0
  54. package/hooks/hooks.json +23 -0
  55. package/hooks/posttooluse-capture.mjs +67 -0
  56. package/hooks/suggest.js +153 -0
  57. package/lib/agent-behaviors.js +40 -0
  58. package/lib/agent-detector.js +96 -0
  59. package/lib/config-loader.js +39 -0
  60. package/lib/conflict-resolver.js +148 -0
  61. package/lib/context-builder.js +574 -0
  62. package/lib/discovery-engine.js +298 -0
  63. package/lib/doctor/hook-installer.js +83 -0
  64. package/lib/doctor/usage-tracker.js +87 -0
  65. package/lib/engine/ai-features.js +253 -0
  66. package/lib/engine/auditor.js +103 -0
  67. package/lib/engine/bm25-index.js +178 -0
  68. package/lib/engine/capture.js +120 -0
  69. package/lib/engine/corrections.js +198 -0
  70. package/lib/engine/doctor.js +195 -0
  71. package/lib/engine/graph-injector.js +137 -0
  72. package/lib/engine/graph.js +161 -0
  73. package/lib/engine/handoff.js +405 -0
  74. package/lib/engine/indexer.js +242 -0
  75. package/lib/engine/parser.js +53 -0
  76. package/lib/engine/query-expander.js +42 -0
  77. package/lib/engine/reranker.js +40 -0
  78. package/lib/engine/rrf.js +59 -0
  79. package/lib/engine/scanner.js +151 -0
  80. package/lib/engine/searcher.js +139 -0
  81. package/lib/engine/session-coordinator.js +306 -0
  82. package/lib/engine/session-manager.js +429 -0
  83. package/lib/engine/synthesizer.js +70 -0
  84. package/lib/installer.js +70 -0
  85. package/lib/instinct-block.js +33 -0
  86. package/lib/mcp-config-writer.js +88 -0
  87. package/lib/paths.js +57 -0
  88. package/lib/profiles/design.md +19 -0
  89. package/lib/profiles/general.md +16 -0
  90. package/lib/profiles/research-analysis.md +22 -0
  91. package/lib/profiles/software-development.md +23 -0
  92. package/lib/profiles/writing-content.md +19 -0
  93. package/lib/project-initializer.js +916 -0
  94. package/lib/registry/skills.js +102 -0
  95. package/lib/registry-searcher.js +99 -0
  96. package/lib/rules/rules-manager.js +169 -0
  97. package/lib/skill-fetcher.js +333 -0
  98. package/lib/well-known-builder.js +70 -0
  99. package/lib/wizard/index.js +404 -0
  100. package/lib/wizard/integration-detector.js +41 -0
  101. package/lib/wizard/project-detector.js +100 -0
  102. package/lib/wizard/prompt.js +156 -0
  103. package/lib/wizard/registry-embeddings.js +107 -0
  104. package/lib/wizard/skill-recommender.js +69 -0
  105. package/llms-full.txt +254 -0
  106. package/llms.txt +70 -0
  107. package/package.json +45 -0
  108. package/research-reports/2026-04-01-current-architecture.md +160 -0
  109. package/research-reports/IDEAS.md +93 -0
  110. package/rules/common/clean-code.md +42 -0
  111. package/rules/java/effective-java.md +42 -0
  112. package/rules/kotlin/effective-kotlin.md +37 -0
  113. package/rules/python/effective-python.md +38 -0
  114. package/rules/rust/rust.md +37 -0
  115. package/rules/typescript/effective-typescript.md +42 -0
  116. package/scripts/gen-llms-full.mjs +36 -0
  117. package/scripts/gen-og.mjs +142 -0
  118. package/scripts/validate-frontmatter.js +25 -0
  119. package/skills/animation-at-work/SKILL.md +270 -0
  120. package/skills/animation-at-work/assets/example_asset.txt +1 -0
  121. package/skills/animation-at-work/evals/evals.json +44 -0
  122. package/skills/animation-at-work/evals/results.json +13 -0
  123. package/skills/animation-at-work/examples/after.md +64 -0
  124. package/skills/animation-at-work/examples/before.md +35 -0
  125. package/skills/animation-at-work/references/api_reference.md +369 -0
  126. package/skills/animation-at-work/references/review-checklist.md +79 -0
  127. package/skills/animation-at-work/scripts/audit_animations.py +295 -0
  128. package/skills/animation-at-work/scripts/example.py +1 -0
  129. package/skills/clean-code-reviewer/SKILL.md +444 -0
  130. package/skills/clean-code-reviewer/audit.json +35 -0
  131. package/skills/clean-code-reviewer/evals/evals.json +185 -0
  132. package/skills/clean-code-reviewer/evals/results.json +13 -0
  133. package/skills/clean-code-reviewer/examples/after.md +48 -0
  134. package/skills/clean-code-reviewer/examples/before.md +33 -0
  135. package/skills/clean-code-reviewer/references/api_reference.md +158 -0
  136. package/skills/clean-code-reviewer/references/practices-catalog.md +282 -0
  137. package/skills/clean-code-reviewer/references/review-checklist.md +254 -0
  138. package/skills/clean-code-reviewer/scripts/pre-review.py +206 -0
  139. package/skills/data-intensive-patterns/SKILL.md +267 -0
  140. package/skills/data-intensive-patterns/assets/example_asset.txt +1 -0
  141. package/skills/data-intensive-patterns/evals/evals.json +54 -0
  142. package/skills/data-intensive-patterns/evals/results.json +13 -0
  143. package/skills/data-intensive-patterns/examples/after.md +61 -0
  144. package/skills/data-intensive-patterns/examples/before.md +38 -0
  145. package/skills/data-intensive-patterns/references/api_reference.md +34 -0
  146. package/skills/data-intensive-patterns/references/patterns-catalog.md +551 -0
  147. package/skills/data-intensive-patterns/references/review-checklist.md +193 -0
  148. package/skills/data-intensive-patterns/scripts/adr.py +213 -0
  149. package/skills/data-intensive-patterns/scripts/example.py +1 -0
  150. package/skills/data-pipelines/SKILL.md +259 -0
  151. package/skills/data-pipelines/assets/example_asset.txt +1 -0
  152. package/skills/data-pipelines/evals/evals.json +45 -0
  153. package/skills/data-pipelines/evals/results.json +13 -0
  154. package/skills/data-pipelines/examples/after.md +97 -0
  155. package/skills/data-pipelines/examples/before.md +37 -0
  156. package/skills/data-pipelines/references/api_reference.md +301 -0
  157. package/skills/data-pipelines/references/review-checklist.md +181 -0
  158. package/skills/data-pipelines/scripts/example.py +1 -0
  159. package/skills/data-pipelines/scripts/new_pipeline.py +444 -0
  160. package/skills/design-patterns/SKILL.md +271 -0
  161. package/skills/design-patterns/assets/example_asset.txt +1 -0
  162. package/skills/design-patterns/evals/evals.json +46 -0
  163. package/skills/design-patterns/evals/results.json +13 -0
  164. package/skills/design-patterns/examples/after.md +52 -0
  165. package/skills/design-patterns/examples/before.md +29 -0
  166. package/skills/design-patterns/references/api_reference.md +1 -0
  167. package/skills/design-patterns/references/patterns-catalog.md +726 -0
  168. package/skills/design-patterns/references/review-checklist.md +173 -0
  169. package/skills/design-patterns/scripts/example.py +1 -0
  170. package/skills/design-patterns/scripts/scaffold.py +807 -0
  171. package/skills/domain-driven-design/SKILL.md +142 -0
  172. package/skills/domain-driven-design/assets/example_asset.txt +1 -0
  173. package/skills/domain-driven-design/evals/evals.json +48 -0
  174. package/skills/domain-driven-design/evals/results.json +13 -0
  175. package/skills/domain-driven-design/examples/after.md +80 -0
  176. package/skills/domain-driven-design/examples/before.md +43 -0
  177. package/skills/domain-driven-design/references/api_reference.md +1 -0
  178. package/skills/domain-driven-design/references/patterns-catalog.md +545 -0
  179. package/skills/domain-driven-design/references/review-checklist.md +158 -0
  180. package/skills/domain-driven-design/scripts/example.py +1 -0
  181. package/skills/domain-driven-design/scripts/scaffold.py +421 -0
  182. package/skills/effective-java/SKILL.md +227 -0
  183. package/skills/effective-java/assets/example_asset.txt +1 -0
  184. package/skills/effective-java/evals/evals.json +46 -0
  185. package/skills/effective-java/evals/results.json +13 -0
  186. package/skills/effective-java/examples/after.md +83 -0
  187. package/skills/effective-java/examples/before.md +37 -0
  188. package/skills/effective-java/references/api_reference.md +1 -0
  189. package/skills/effective-java/references/items-catalog.md +955 -0
  190. package/skills/effective-java/references/review-checklist.md +216 -0
  191. package/skills/effective-java/scripts/checkstyle_setup.py +211 -0
  192. package/skills/effective-java/scripts/example.py +1 -0
  193. package/skills/effective-kotlin/SKILL.md +271 -0
  194. package/skills/effective-kotlin/assets/example_asset.txt +1 -0
  195. package/skills/effective-kotlin/audit.json +29 -0
  196. package/skills/effective-kotlin/evals/evals.json +45 -0
  197. package/skills/effective-kotlin/evals/results.json +13 -0
  198. package/skills/effective-kotlin/examples/after.md +36 -0
  199. package/skills/effective-kotlin/examples/before.md +38 -0
  200. package/skills/effective-kotlin/references/api_reference.md +1 -0
  201. package/skills/effective-kotlin/references/practices-catalog.md +1228 -0
  202. package/skills/effective-kotlin/references/review-checklist.md +126 -0
  203. package/skills/effective-kotlin/scripts/example.py +1 -0
  204. package/skills/effective-python/SKILL.md +441 -0
  205. package/skills/effective-python/evals/evals.json +44 -0
  206. package/skills/effective-python/evals/results.json +13 -0
  207. package/skills/effective-python/examples/after.md +56 -0
  208. package/skills/effective-python/examples/before.md +40 -0
  209. package/skills/effective-python/ref-01-pythonic-thinking.md +202 -0
  210. package/skills/effective-python/ref-02-lists-and-dicts.md +146 -0
  211. package/skills/effective-python/ref-03-functions.md +186 -0
  212. package/skills/effective-python/ref-04-comprehensions-generators.md +211 -0
  213. package/skills/effective-python/ref-05-classes-interfaces.md +188 -0
  214. package/skills/effective-python/ref-06-metaclasses-attributes.md +209 -0
  215. package/skills/effective-python/ref-07-concurrency.md +213 -0
  216. package/skills/effective-python/ref-08-robustness-performance.md +248 -0
  217. package/skills/effective-python/ref-09-testing-debugging.md +253 -0
  218. package/skills/effective-python/ref-10-collaboration.md +175 -0
  219. package/skills/effective-python/references/api_reference.md +218 -0
  220. package/skills/effective-python/references/practices-catalog.md +483 -0
  221. package/skills/effective-python/references/review-checklist.md +190 -0
  222. package/skills/effective-python/scripts/lint.py +173 -0
  223. package/skills/effective-typescript/SKILL.md +262 -0
  224. package/skills/effective-typescript/audit.json +29 -0
  225. package/skills/effective-typescript/evals/evals.json +37 -0
  226. package/skills/effective-typescript/evals/results.json +13 -0
  227. package/skills/effective-typescript/examples/after.md +70 -0
  228. package/skills/effective-typescript/examples/before.md +47 -0
  229. package/skills/effective-typescript/references/api_reference.md +118 -0
  230. package/skills/effective-typescript/references/practices-catalog.md +371 -0
  231. package/skills/effective-typescript/scripts/review.py +169 -0
  232. package/skills/kotlin-in-action/SKILL.md +261 -0
  233. package/skills/kotlin-in-action/assets/example_asset.txt +1 -0
  234. package/skills/kotlin-in-action/evals/evals.json +43 -0
  235. package/skills/kotlin-in-action/evals/results.json +13 -0
  236. package/skills/kotlin-in-action/examples/after.md +53 -0
  237. package/skills/kotlin-in-action/examples/before.md +39 -0
  238. package/skills/kotlin-in-action/references/api_reference.md +1 -0
  239. package/skills/kotlin-in-action/references/practices-catalog.md +436 -0
  240. package/skills/kotlin-in-action/references/review-checklist.md +204 -0
  241. package/skills/kotlin-in-action/scripts/example.py +1 -0
  242. package/skills/kotlin-in-action/scripts/setup_detekt.py +224 -0
  243. package/skills/lean-startup/SKILL.md +160 -0
  244. package/skills/lean-startup/assets/example_asset.txt +1 -0
  245. package/skills/lean-startup/evals/evals.json +43 -0
  246. package/skills/lean-startup/evals/results.json +13 -0
  247. package/skills/lean-startup/examples/after.md +80 -0
  248. package/skills/lean-startup/examples/before.md +34 -0
  249. package/skills/lean-startup/references/api_reference.md +319 -0
  250. package/skills/lean-startup/references/review-checklist.md +137 -0
  251. package/skills/lean-startup/scripts/example.py +1 -0
  252. package/skills/lean-startup/scripts/new_experiment.py +286 -0
  253. package/skills/microservices-patterns/SKILL.md +384 -0
  254. package/skills/microservices-patterns/evals/evals.json +45 -0
  255. package/skills/microservices-patterns/evals/results.json +13 -0
  256. package/skills/microservices-patterns/examples/after.md +69 -0
  257. package/skills/microservices-patterns/examples/before.md +40 -0
  258. package/skills/microservices-patterns/references/patterns-catalog.md +391 -0
  259. package/skills/microservices-patterns/references/review-checklist.md +169 -0
  260. package/skills/microservices-patterns/scripts/new_service.py +583 -0
  261. package/skills/programming-with-rust/SKILL.md +209 -0
  262. package/skills/programming-with-rust/evals/evals.json +37 -0
  263. package/skills/programming-with-rust/evals/results.json +13 -0
  264. package/skills/programming-with-rust/examples/after.md +107 -0
  265. package/skills/programming-with-rust/examples/before.md +59 -0
  266. package/skills/programming-with-rust/references/api_reference.md +152 -0
  267. package/skills/programming-with-rust/references/practices-catalog.md +335 -0
  268. package/skills/programming-with-rust/scripts/review.py +142 -0
  269. package/skills/refactoring-ui/SKILL.md +362 -0
  270. package/skills/refactoring-ui/assets/example_asset.txt +1 -0
  271. package/skills/refactoring-ui/evals/evals.json +45 -0
  272. package/skills/refactoring-ui/evals/results.json +13 -0
  273. package/skills/refactoring-ui/examples/after.md +85 -0
  274. package/skills/refactoring-ui/examples/before.md +58 -0
  275. package/skills/refactoring-ui/references/api_reference.md +355 -0
  276. package/skills/refactoring-ui/references/review-checklist.md +114 -0
  277. package/skills/refactoring-ui/scripts/audit_css.py +250 -0
  278. package/skills/refactoring-ui/scripts/example.py +1 -0
  279. package/skills/rust-in-action/SKILL.md +350 -0
  280. package/skills/rust-in-action/evals/evals.json +38 -0
  281. package/skills/rust-in-action/evals/results.json +13 -0
  282. package/skills/rust-in-action/examples/after.md +156 -0
  283. package/skills/rust-in-action/examples/before.md +56 -0
  284. package/skills/rust-in-action/references/practices-catalog.md +346 -0
  285. package/skills/rust-in-action/scripts/review.py +147 -0
  286. package/skills/skill-router/SKILL.md +186 -0
  287. package/skills/skill-router/evals/evals.json +38 -0
  288. package/skills/skill-router/evals/results.json +13 -0
  289. package/skills/skill-router/examples/after.md +63 -0
  290. package/skills/skill-router/examples/before.md +39 -0
  291. package/skills/skill-router/references/api_reference.md +24 -0
  292. package/skills/skill-router/references/routing-heuristics.md +89 -0
  293. package/skills/skill-router/references/skill-catalog.md +174 -0
  294. package/skills/skill-router/scripts/route.py +266 -0
  295. package/skills/spring-boot-in-action/SKILL.md +340 -0
  296. package/skills/spring-boot-in-action/evals/evals.json +39 -0
  297. package/skills/spring-boot-in-action/evals/results.json +13 -0
  298. package/skills/spring-boot-in-action/examples/after.md +185 -0
  299. package/skills/spring-boot-in-action/examples/before.md +84 -0
  300. package/skills/spring-boot-in-action/references/practices-catalog.md +403 -0
  301. package/skills/spring-boot-in-action/scripts/review.py +184 -0
  302. package/skills/storytelling-with-data/SKILL.md +241 -0
  303. package/skills/storytelling-with-data/assets/example_asset.txt +1 -0
  304. package/skills/storytelling-with-data/evals/evals.json +47 -0
  305. package/skills/storytelling-with-data/evals/results.json +13 -0
  306. package/skills/storytelling-with-data/examples/after.md +50 -0
  307. package/skills/storytelling-with-data/examples/before.md +33 -0
  308. package/skills/storytelling-with-data/references/api_reference.md +379 -0
  309. package/skills/storytelling-with-data/references/review-checklist.md +111 -0
  310. package/skills/storytelling-with-data/scripts/chart_review.py +301 -0
  311. package/skills/storytelling-with-data/scripts/example.py +1 -0
  312. package/skills/system-design-interview/SKILL.md +233 -0
  313. package/skills/system-design-interview/assets/example_asset.txt +1 -0
  314. package/skills/system-design-interview/evals/evals.json +46 -0
  315. package/skills/system-design-interview/evals/results.json +13 -0
  316. package/skills/system-design-interview/examples/after.md +94 -0
  317. package/skills/system-design-interview/examples/before.md +27 -0
  318. package/skills/system-design-interview/references/api_reference.md +582 -0
  319. package/skills/system-design-interview/references/review-checklist.md +201 -0
  320. package/skills/system-design-interview/scripts/example.py +1 -0
  321. package/skills/system-design-interview/scripts/new_design.py +421 -0
  322. package/skills/using-asyncio-python/SKILL.md +290 -0
  323. package/skills/using-asyncio-python/assets/example_asset.txt +1 -0
  324. package/skills/using-asyncio-python/evals/evals.json +43 -0
  325. package/skills/using-asyncio-python/evals/results.json +13 -0
  326. package/skills/using-asyncio-python/examples/after.md +68 -0
  327. package/skills/using-asyncio-python/examples/before.md +39 -0
  328. package/skills/using-asyncio-python/references/api_reference.md +267 -0
  329. package/skills/using-asyncio-python/references/review-checklist.md +149 -0
  330. package/skills/using-asyncio-python/scripts/check_blocking.py +270 -0
  331. package/skills/using-asyncio-python/scripts/example.py +1 -0
  332. package/skills/web-scraping-python/SKILL.md +280 -0
  333. package/skills/web-scraping-python/assets/example_asset.txt +1 -0
  334. package/skills/web-scraping-python/evals/evals.json +46 -0
  335. package/skills/web-scraping-python/evals/results.json +13 -0
  336. package/skills/web-scraping-python/examples/after.md +109 -0
  337. package/skills/web-scraping-python/examples/before.md +40 -0
  338. package/skills/web-scraping-python/references/api_reference.md +393 -0
  339. package/skills/web-scraping-python/references/review-checklist.md +163 -0
  340. package/skills/web-scraping-python/scripts/example.py +1 -0
  341. package/skills/web-scraping-python/scripts/new_scraper.py +231 -0
  342. package/skills/writing-plans/audit.json +34 -0
  343. package/tests/agent-detector.test.js +83 -0
  344. package/tests/corrections.test.js +245 -0
  345. package/tests/doctor/hook-installer.test.js +72 -0
  346. package/tests/doctor/usage-tracker.test.js +140 -0
  347. package/tests/engine/benchmark-eval.test.js +31 -0
  348. package/tests/engine/bm25-index.test.js +85 -0
  349. package/tests/engine/capture-command.test.js +35 -0
  350. package/tests/engine/capture.test.js +17 -0
  351. package/tests/engine/graph-augmented-search.test.js +107 -0
  352. package/tests/engine/graph-injector.test.js +44 -0
  353. package/tests/engine/graph.test.js +216 -0
  354. package/tests/engine/hybrid-searcher.test.js +74 -0
  355. package/tests/engine/indexer-bm25.test.js +37 -0
  356. package/tests/engine/mcp-tools.test.js +73 -0
  357. package/tests/engine/project-initializer-mcp.test.js +99 -0
  358. package/tests/engine/query-expander.test.js +36 -0
  359. package/tests/engine/reranker.test.js +51 -0
  360. package/tests/engine/rrf.test.js +49 -0
  361. package/tests/engine/srag-prefix.test.js +47 -0
  362. package/tests/instinct-block.test.js +23 -0
  363. package/tests/mcp-config-writer.test.js +60 -0
  364. package/tests/project-initializer-new-agents.test.js +48 -0
  365. package/tests/rules/rules-manager.test.js +230 -0
  366. package/tests/well-known-builder.test.js +40 -0
  367. package/tests/wizard/integration-detector.test.js +31 -0
  368. package/tests/wizard/project-detector.test.js +51 -0
  369. package/tests/wizard/prompt-session.test.js +61 -0
  370. package/tests/wizard/prompt.test.js +16 -0
  371. package/tests/wizard/registry-embeddings.test.js +35 -0
  372. package/tests/wizard/skill-recommender.test.js +34 -0
  373. package/tests/wizard/slot-count.test.js +25 -0
  374. package/vercel.json +21 -0
@@ -0,0 +1,45 @@
1
+ {
2
+ "evals": [
3
+ {
4
+ "id": "eval-01-etl-no-error-handling-no-idempotency",
5
+ "prompt": "Review this ETL script:\n\n```python\nimport psycopg2\nimport requests\n\nSOURCE_DB = 'postgresql://user:pass@source-host/prod'\nDEST_DB = 'postgresql://user:pass@warehouse-host/warehouse'\n\ndef run():\n src = psycopg2.connect(SOURCE_DB)\n dst = psycopg2.connect(DEST_DB)\n\n rows = src.cursor().execute(\n 'SELECT id, customer_id, amount, created_at FROM orders'\n ).fetchall()\n\n for row in rows:\n order_id, customer_id, amount, created_at = row\n resp = requests.get(f'https://api.exchange.io/rate?currency=EUR')\n rate = resp.json()['rate']\n amount_eur = amount * rate\n\n dst.cursor().execute(\n 'INSERT INTO orders_eur VALUES (%s, %s, %s, %s)',\n (order_id, customer_id, amount_eur, created_at)\n )\n\n dst.commit()\n src.close()\n dst.close()\n\nif __name__ == '__main__':\n run()\n```",
6
+ "expectations": [
7
+ "Flags the full-table extraction `SELECT * FROM orders` with no timestamp filter as a non-incremental load that will re-process the entire table on every run; recommends incremental extraction using a watermark (Ch 3-4: incremental over full extraction)",
8
+ "Flags the absence of idempotency: re-running the script will insert duplicate rows into `orders_eur`; recommends an INSERT ... ON CONFLICT DO NOTHING or MERGE pattern (Ch 13: idempotency is non-negotiable)",
9
+ "Flags the external API call `requests.get` inside the per-row loop, which issues one HTTP request per order row — an N+1 pattern causing severe performance and rate-limit issues; recommends fetching the exchange rate once before the loop",
10
+ "Flags no error handling anywhere: if the API call fails, the loop crashes mid-run leaving the destination in a partially loaded state with no indication of progress (Ch 13: error handling and retry strategies)",
11
+ "Flags hardcoded credentials in source strings; recommends environment variables or a secrets manager (Ch 13: never hardcode credentials)",
12
+ "Flags no logging of rows processed, errors encountered, or run duration (Ch 12: monitoring and observability)",
13
+ "Flags the absence of a staging table: data is written directly to the production `orders_eur` table without validation (Ch 8: always load to staging first)"
14
+ ]
15
+ },
16
+ {
17
+ "id": "eval-02-mixed-transform-and-load",
18
+ "prompt": "Review this data pipeline script:\n\n```python\nimport pandas as pd\nimport sqlalchemy\n\ndef process_and_load(csv_path: str, db_url: str, table: str):\n df = pd.read_csv(csv_path)\n\n # Clean and transform\n df['email'] = df['email'].str.lower().str.strip()\n df['revenue'] = df['revenue'].fillna(0)\n df['signup_date'] = pd.to_datetime(df['signup_date'])\n df = df[df['revenue'] >= 0]\n df['revenue_category'] = df['revenue'].apply(\n lambda x: 'high' if x > 1000 else 'low'\n )\n df['country'] = df['country'].str.upper()\n\n # Enrich with another file\n regions = pd.read_csv('regions.csv') # hardcoded path\n df = df.merge(regions, on='country', how='left')\n\n # Load directly into the final table\n engine = sqlalchemy.create_engine(db_url)\n df.to_sql(table, engine, if_exists='append', index=False)\n print(f'Loaded {len(df)} rows')\n```",
19
+ "expectations": [
20
+ "Flags that transformation logic and loading logic are combined in a single function, violating separation of concerns; recommends splitting into separate extract, transform, and load functions (Ch 3: ETL pattern design, Ch 11: DAG-based task granularity)",
21
+ "Flags the hardcoded path `'regions.csv'` as a non-configurable dependency that breaks when the file moves; recommends externalizing all paths and inputs as parameters or config (Ch 13: configurable pipelines)",
22
+ "Flags `if_exists='append'` with no deduplication: re-running appends duplicate rows; recommends staging table + MERGE or using a unique constraint with INSERT OR IGNORE (Ch 13: idempotency)",
23
+ "Flags no data validation before loading: there is no check that the merge did not produce unexpected nulls in the region column or that row counts match expectations (Ch 10: validate at boundaries)",
24
+ "Flags no logging beyond a single print statement: recommends structured logging of row counts at each stage, null rates, and merge match rate (Ch 12: monitoring and observability)",
25
+ "Flags absence of data lineage tracking: no pipeline_run_id or audit column to identify which pipeline run produced each row, making debugging and reruns harder to trace (Ch 13: data lineage)",
26
+ "Recommends adding a schema validation step after reading the CSV to catch missing or mistyped columns before transformations run (Ch 10: schema validation at ingestion)"
27
+ ]
28
+ },
29
+ {
30
+ "id": "eval-03-clean-pipeline-with-retry-logging-separation",
31
+ "prompt": "Review this data pipeline implementation:\n\n```python\nimport logging\nimport time\nfrom datetime import datetime, timedelta\nfrom typing import Iterator\nimport psycopg2\nimport psycopg2.extras\n\nlogger = logging.getLogger(__name__)\n\nBATCH_SIZE = 1000\nMAX_RETRIES = 3\nBACKOFF_BASE = 2\n\n\ndef extract(conn, watermark: datetime) -> Iterator[list]:\n \"\"\"Yield batches of new orders since the watermark.\"\"\"\n with conn.cursor(cursor_factory=psycopg2.extras.DictCursor) as cur:\n cur.execute(\n 'SELECT id, customer_id, amount, created_at '\n 'FROM orders WHERE created_at > %s ORDER BY created_at',\n (watermark,)\n )\n while True:\n rows = cur.fetchmany(BATCH_SIZE)\n if not rows:\n break\n logger.info('Extracted batch of %d rows', len(rows))\n yield [dict(r) for r in rows]\n\n\ndef transform(batch: list[dict]) -> list[dict]:\n \"\"\"Apply business rules: normalize amounts, tag high-value orders.\"\"\"\n result = []\n for row in batch:\n row['amount'] = round(float(row['amount']), 2)\n row['is_high_value'] = row['amount'] > 500\n result.append(row)\n return result\n\n\ndef load(conn, rows: list[dict], run_id: str) -> int:\n \"\"\"Upsert rows into orders_warehouse; return count of rows loaded.\"\"\"\n with conn.cursor() as cur:\n psycopg2.extras.execute_values(\n cur,\n '''\n INSERT INTO orders_warehouse (id, customer_id, amount, is_high_value, created_at, pipeline_run_id)\n VALUES %s\n ON CONFLICT (id) DO UPDATE SET\n amount = EXCLUDED.amount,\n is_high_value = EXCLUDED.is_high_value,\n pipeline_run_id = EXCLUDED.pipeline_run_id\n ''',\n [(r['id'], r['customer_id'], r['amount'], r['is_high_value'],\n r['created_at'], run_id) for r in rows]\n )\n conn.commit()\n return len(rows)\n\n\ndef run_with_retry(fn, *args, **kwargs):\n \"\"\"Retry a function with exponential backoff on transient errors.\"\"\"\n for attempt in range(1, MAX_RETRIES + 1):\n try:\n return fn(*args, **kwargs)\n except psycopg2.OperationalError as e:\n if attempt == MAX_RETRIES:\n raise\n delay = BACKOFF_BASE ** attempt\n logger.warning('Attempt %d failed: %s. Retrying in %ds', attempt, e, delay)\n time.sleep(delay)\n```",
32
+ "expectations": [
33
+ "Recognizes this is a well-designed pipeline and says so explicitly",
34
+ "Praises the clear separation of `extract`, `transform`, and `load` into distinct functions with single responsibilities (Ch 3: ETL pattern, Ch 11: task granularity)",
35
+ "Praises the watermark-based incremental extraction that avoids full-table scans on reruns (Ch 3-4: incremental extraction)",
36
+ "Praises the `ON CONFLICT DO UPDATE` upsert ensuring the pipeline is idempotent and safe to re-run (Ch 13: idempotency is non-negotiable)",
37
+ "Praises the generator-based `extract` function that yields batches, avoiding loading the full result set into memory (Ch 4: streaming extraction, memory efficiency)",
38
+ "Praises the `run_with_retry` wrapper with exponential backoff for transient database errors (Ch 13: error handling and retry strategies)",
39
+ "Praises structured logging at the batch level with row counts for observability (Ch 12: monitoring)",
40
+ "Praises the `pipeline_run_id` column in the load, enabling lineage tracking and debugging of which run produced which rows (Ch 13: data lineage)",
41
+ "Does NOT manufacture issues to appear thorough; any suggestions are framed as minor optional improvements"
42
+ ]
43
+ }
44
+ ]
45
+ }
@@ -0,0 +1,13 @@
1
+ {
2
+ "pass_rate": 0.957,
3
+ "passed": 22,
4
+ "total": 23,
5
+ "baseline_pass_rate": 0.304,
6
+ "baseline_passed": 7,
7
+ "baseline_total": 23,
8
+ "delta": 0.652,
9
+ "model": "default",
10
+ "evals_run": 3,
11
+ "date": "2026-03-28",
12
+ "non_standard_provider": true
13
+ }
@@ -0,0 +1,97 @@
1
+ # After
2
+
3
+ A clean pipeline with separated extract/transform/load functions, idempotent upserts, retry logic, and proper error handling.
4
+
5
+ ```python
6
+ import logging
7
+ import time
8
+ from dataclasses import dataclass
9
+ from datetime import datetime
10
+ from functools import wraps
11
+
12
+ import psycopg2
13
+ import requests
14
+ from requests.exceptions import RequestException
15
+
16
+ logger = logging.getLogger(__name__)
17
+
18
+
19
+ @dataclass
20
+ class SaleRecord:
21
+ id: str
22
+ sale_date: datetime
23
+ revenue: float
24
+ region: str
25
+
26
+
27
+ def with_retry(max_attempts: int = 3, backoff_seconds: float = 2.0):
28
+ """Decorator: retry a function on transient failures with exponential backoff."""
29
+ def decorator(fn):
30
+ @wraps(fn)
31
+ def wrapper(*args, **kwargs):
32
+ for attempt in range(1, max_attempts + 1):
33
+ try:
34
+ return fn(*args, **kwargs)
35
+ except (RequestException, psycopg2.OperationalError) as exc:
36
+ if attempt == max_attempts:
37
+ raise
38
+ wait = backoff_seconds ** attempt
39
+ logger.warning("Attempt %d/%d failed: %s — retrying in %.1fs",
40
+ attempt, max_attempts, exc, wait)
41
+ time.sleep(wait)
42
+ return wrapper
43
+ return decorator
44
+
45
+
46
+ @with_retry(max_attempts=3)
47
+ def extract(api_url: str) -> list[dict]:
48
+ """Fetch raw sales records from the partner API."""
49
+ response = requests.get(api_url, timeout=30)
50
+ response.raise_for_status()
51
+ return response.json()["sales"]
52
+
53
+
54
+ def transform(raw_records: list[dict]) -> list[SaleRecord]:
55
+ """Parse and normalise raw API records into typed SaleRecord objects."""
56
+ return [
57
+ SaleRecord(
58
+ id=rec["id"],
59
+ sale_date=datetime.fromisoformat(rec["date"]),
60
+ revenue=float(rec["amount_usd"]),
61
+ region=rec["region"].strip().upper(),
62
+ )
63
+ for rec in raw_records
64
+ ]
65
+
66
+
67
+ def load(records: list[SaleRecord], dsn: str) -> int:
68
+ """Upsert records into fact_sales. Idempotent: re-running is safe."""
69
+ upsert_sql = """
70
+ INSERT INTO fact_sales (sale_id, sale_date, revenue, region, loaded_at)
71
+ VALUES (%(id)s, %(sale_date)s, %(revenue)s, %(region)s, NOW())
72
+ ON CONFLICT (sale_id) DO UPDATE
73
+ SET revenue = EXCLUDED.revenue,
74
+ loaded_at = EXCLUDED.loaded_at
75
+ """
76
+ with psycopg2.connect(dsn) as conn, conn.cursor() as cur:
77
+ cur.executemany(upsert_sql, [vars(r) for r in records])
78
+ loaded = cur.rowcount
79
+ logger.info("Upserted %d records into fact_sales", loaded)
80
+ return loaded
81
+
82
+
83
+ def run_pipeline(api_url: str, warehouse_dsn: str) -> None:
84
+ logger.info("Starting sales pipeline")
85
+ raw = extract(api_url)
86
+ records = transform(raw)
87
+ loaded = load(records, warehouse_dsn)
88
+ logger.info("Pipeline complete: %d records loaded", loaded)
89
+ ```
90
+
91
+ Key improvements:
92
+ - Extract, transform, and load are separate functions with single responsibilities — each is independently testable and replaceable (Ch 13: Best Practices — separation of concerns)
93
+ - `ON CONFLICT (sale_id) DO UPDATE` makes the load idempotent — re-running the pipeline never creates duplicate rows (Ch 13: Idempotency)
94
+ - `@with_retry` decorator handles transient API and database failures with exponential backoff (Ch 6: API Ingestion — retry logic)
95
+ - `SaleRecord` dataclass replaces a raw dict, providing type safety and named field access in the transform step
96
+ - `psycopg2.connect` used as a context manager ensures the connection and transaction are always closed and committed correctly (Ch 4: Database Ingestion)
97
+ - Structured logging with `logger.info/warning` replaces bare `print` — output is filterable and includes context (Ch 12: Monitoring)
@@ -0,0 +1,37 @@
1
+ # Before
2
+
3
+ A Python ETL script that mixes extraction, transformation, and loading in one function with no error handling, no idempotency, and no retry logic.
4
+
5
+ ```python
6
+ import psycopg2
7
+ import requests
8
+ from datetime import datetime
9
+
10
+ def run_pipeline():
11
+ # Extract: fetch from API
12
+ resp = requests.get("https://api.partner.com/sales/export")
13
+ data = resp.json()
14
+
15
+ # Connect to warehouse
16
+ conn = psycopg2.connect("host=dw user=etl dbname=warehouse")
17
+ cur = conn.cursor()
18
+
19
+ # Transform + Load: all in one loop, no error handling
20
+ for record in data["sales"]:
21
+ sale_date = datetime.strptime(record["date"], "%Y-%m-%dT%H:%M:%S")
22
+ revenue = float(record["amount_usd"])
23
+ region = record["region"].strip().upper()
24
+
25
+ # No upsert — re-running inserts duplicates
26
+ cur.execute("""
27
+ INSERT INTO fact_sales (sale_id, sale_date, revenue, region, loaded_at)
28
+ VALUES (%s, %s, %s, %s, NOW())
29
+ """, (record["id"], sale_date, revenue, region))
30
+
31
+ conn.commit()
32
+ cur.close()
33
+ conn.close()
34
+ print("done")
35
+
36
+ run_pipeline()
37
+ ```
@@ -0,0 +1,301 @@
1
+ # Data Pipelines Pocket Reference — Practices Catalog
2
+
3
+ Chapter-by-chapter catalog of practices from *Data Pipelines Pocket Reference*
4
+ by James Densmore for pipeline building.
5
+
6
+ ---
7
+
8
+ ## Chapter 1–2: Introduction & Modern Data Infrastructure
9
+
10
+ ### Infrastructure Choices
11
+ - **Data warehouse** — Columnar storage optimized for analytics queries (Redshift, BigQuery, Snowflake); choose when analytics is primary use case
12
+ - **Data lake** — Object storage (S3, GCS, Azure Blob) for raw, unstructured, or semi-structured data; choose when data variety is high or schema is unknown
13
+ - **Hybrid** — Land raw data in lake, load structured subsets into warehouse; common modern pattern
14
+ - **Cloud-native** — Leverage managed services (serverless compute, auto-scaling storage); reduce operational burden
15
+
16
+ ### Pipeline Types
17
+ - **Batch** — Process data in scheduled intervals (hourly, daily); suitable for most analytics use cases
18
+ - **Streaming** — Process data continuously as it arrives; required for real-time dashboards, alerts, event-driven systems
19
+ - **Micro-batch** — Small frequent batches (every few minutes); compromise between batch simplicity and near-real-time latency
20
+
21
+ ---
22
+
23
+ ## Chapter 3: Common Data Pipeline Patterns
24
+
25
+ ### Extraction Patterns
26
+ - **Full extraction** — Extract entire dataset each run; simple but expensive for large tables; use for small reference tables or initial loads
27
+ - **Incremental extraction** — Extract only new/changed records using a high-water mark (timestamp, auto-increment ID, or sequence); preferred for growing datasets
28
+ - **Change data capture (CDC)** — Capture changes from database transaction logs (MySQL binlog, PostgreSQL WAL); lowest latency, captures deletes; use for real-time sync
29
+
30
+ ### Loading Patterns
31
+ - **Full refresh (truncate + load)** — Replace entire destination table; simple, idempotent; use for small tables or when incremental is unreliable
32
+ - **Append** — Insert new records only; use for event/log data that is never updated
33
+ - **Upsert (MERGE)** — Insert new records, update existing ones based on a key; use for mutable dimension data
34
+ - **Delete + Insert by partition** — Delete partition, insert replacement; idempotent and efficient for date-partitioned fact tables
35
+
36
+ ### ETL vs ELT
37
+ - **ETL** — Transform before loading; use when destination has limited compute or when data must be cleansed before storage
38
+ - **ELT** — Load raw data first, transform in destination; preferred for modern cloud warehouses with cheap compute; enables raw data preservation and flexible re-transformation
39
+
40
+ ---
41
+
42
+ ## Chapter 4: Database Ingestion
43
+
44
+ ### MySQL Extraction
45
+ - Use `SELECT ... WHERE updated_at > :last_run` for incremental extraction
46
+ - For full extraction: `SELECT *` with optional `LIMIT/OFFSET` for large tables (prefer streaming cursors)
47
+ - Use read replicas to avoid impacting production database performance
48
+ - Handle MySQL timezone conversions (store/compare in UTC)
49
+ - Connection pooling for concurrent extraction from multiple tables
50
+
51
+ ### PostgreSQL Extraction
52
+ - Similar incremental patterns using timestamp columns
53
+ - Use `COPY TO` for efficient bulk export to CSV/files
54
+ - Leverage PostgreSQL logical replication for CDC
55
+ - Handle PostgreSQL-specific types (arrays, JSON, custom types) during extraction
56
+
57
+ ### MongoDB Extraction
58
+ - Use change streams for real-time CDC
59
+ - For incremental: query by `_id` (ObjectId contains timestamp) or custom timestamp field
60
+ - Handle nested documents: flatten or store as JSON in warehouse
61
+ - Use `mongodump` for full extraction of large collections
62
+
63
+ ### General Database Practices
64
+ - **Connection management** — Use connection pools; close connections promptly; handle timeouts
65
+ - **Query optimization** — Add indexes on extraction columns; limit selected columns; use WHERE clauses
66
+ - **Binary data** — Skip or store references to BLOBs; don't load binary into analytics warehouse
67
+ - **Character encoding** — Ensure UTF-8 throughout the pipeline; handle encoding mismatches at extraction
68
+
69
+ ---
70
+
71
+ ## Chapter 5: File Ingestion
72
+
73
+ ### CSV Files
74
+ - Handle header rows, quoting, escaping, delimiters (not always comma)
75
+ - Detect and handle encoding (UTF-8, Latin-1, Windows-1252)
76
+ - Validate column count per row; log and quarantine malformed rows
77
+ - Use streaming parsers for large files; avoid loading entire file into memory
78
+
79
+ ### JSON Files
80
+ - Handle nested structures: flatten for warehouse loading or store as JSON column
81
+ - Use JSON Lines (newline-delimited JSON) for large datasets
82
+ - Validate against expected schema; handle missing and extra fields
83
+ - Parse dates and timestamps consistently
84
+
85
+ ### Cloud Storage Integration
86
+ - **S3** — Use `aws s3 cp/sync` or boto3; leverage S3 event notifications for trigger-based ingestion
87
+ - **GCS** — Use `gsutil` or google-cloud-storage library; use Pub/Sub for event notifications
88
+ - **Azure Blob** — Use Azure SDK; leverage Event Grid for notifications
89
+ - Use prefix/partition naming: `s3://bucket/table/year=2024/month=01/day=15/`
90
+ - Implement file manifests to track which files have been processed
91
+
92
+ ### File Best Practices
93
+ - **Naming conventions** — Include date, source, and sequence in filenames: `orders_2024-01-15_001.csv`
94
+ - **Compression** — Use gzip or snappy for storage efficiency; most tools handle compressed files natively
95
+ - **Archiving** — Move processed files to archive prefix/bucket; retain for reprocessing capability
96
+ - **Schema detection** — Infer schema from first N rows; validate against expected schema; alert on changes
97
+
98
+ ---
99
+
100
+ ## Chapter 6: API Ingestion
101
+
102
+ ### REST API Patterns
103
+ - **Authentication** — Handle API keys, OAuth tokens, token refresh; store credentials securely
104
+ - **Pagination** — Implement cursor-based, offset-based, or link-header pagination; prefer cursor-based for consistency
105
+ - **Rate limiting** — Respect rate limit headers (X-RateLimit-Remaining, Retry-After); implement backoff
106
+ - **Retry logic** — Retry on 429 (rate limit) and 5xx (server error) with exponential backoff; don't retry on 4xx (client error)
107
+
108
+ ### API Data Handling
109
+ - **JSON response parsing** — Extract relevant fields; handle nested objects and arrays
110
+ - **Incremental fetching** — Use modified_since parameters, cursor tokens, or date range filters
111
+ - **Schema changes** — Handle new fields gracefully; log and alert on missing expected fields
112
+ - **Large responses** — Stream responses for large payloads; paginate aggressively
113
+
114
+ ### Webhook Ingestion
115
+ - Set up HTTP endpoints to receive push notifications
116
+ - Validate webhook signatures for security
117
+ - Acknowledge receipt quickly (200 OK); process asynchronously
118
+ - Implement idempotency using event IDs to handle duplicate deliveries
119
+
120
+ ---
121
+
122
+ ## Chapter 7: Streaming Data
123
+
124
+ ### Apache Kafka
125
+ - **Producers** — Serialize events (Avro, JSON, Protobuf); use keys for partition ordering; configure acknowledgments
126
+ - **Consumers** — Use consumer groups for parallel processing; commit offsets after successful processing; handle rebalances
127
+ - **Topics** — Design topic schemas; set retention policies; partition for throughput and ordering requirements
128
+ - **Exactly-once** — Use idempotent producers + transactional consumers; or implement deduplication downstream
129
+
130
+ ### Amazon Kinesis
131
+ - **Streams** — Configure shard count for throughput; use enhanced fan-out for multiple consumers
132
+ - **Firehose** — Direct-to-S3/Redshift delivery; configure buffering interval and size; transform with Lambda
133
+
134
+ ### Stream Processing Patterns
135
+ - **Windowing** — Tumbling, sliding, session windows for aggregation over time
136
+ - **Watermarks** — Handle late-arriving events; define allowed lateness
137
+ - **State management** — Use state stores for aggregations; handle checkpointing and recovery
138
+ - **Dead letter queues** — Route failed events for inspection and reprocessing; don't lose data silently
139
+
140
+ ---
141
+
142
+ ## Chapter 8: Data Storage and Loading
143
+
144
+ ### Amazon Redshift
145
+ - Use `COPY` command for bulk loading from S3; much faster than INSERT
146
+ - Define `SORTKEY` for frequently filtered columns; `DISTKEY` for join columns
147
+ - Use `UNLOAD` for efficient export back to S3
148
+ - Vacuum and analyze tables after large loads
149
+
150
+ ### Google BigQuery
151
+ - Use load jobs for bulk data; streaming inserts for real-time (more expensive)
152
+ - Partition tables by date (ingestion time or column-based) for cost and performance
153
+ - Cluster tables on frequently filtered columns
154
+ - Use external tables for querying data in GCS without loading
155
+
156
+ ### Snowflake
157
+ - Use stages (internal or external) for file-based loading
158
+ - `COPY INTO` for bulk loads from stages; `SNOWPIPE` for continuous loading
159
+ - Use virtual warehouses sized appropriately for load workloads
160
+ - Leverage Time Travel for data recovery and auditing
161
+
162
+ ### General Loading Practices
163
+ - **Staging tables** — Always load to staging first; validate before merging to production
164
+ - **Atomic swaps** — Use table rename or partition swap for atomic updates
165
+ - **Data types** — Map source types carefully; avoid implicit conversions; use appropriate precision
166
+ - **Compression** — Let warehouse handle compression; load compressed files when supported
167
+ - **Partitioning** — Partition by date for time-series data; by key for lookup tables
168
+ - **Clustering** — Cluster on frequently filtered columns within partitions
169
+
170
+ ---
171
+
172
+ ## Chapter 9: Data Transformations
173
+
174
+ ### SQL-Based Transforms
175
+ - Use CTEs (Common Table Expressions) for readability
176
+ - Prefer window functions over self-joins for ranking, running totals
177
+ - Use CASE expressions for conditional logic
178
+ - Aggregate at the right grain; avoid fan-out joins that multiply rows
179
+
180
+ ### dbt (Data Build Tool)
181
+ - **Staging models** — 1:1 with source tables; rename columns, cast types, filter deleted records
182
+ - **Intermediate models** — Business logic joins, complex calculations, deduplication
183
+ - **Mart models** — Final analytics-ready tables; optimized for dashboard queries
184
+ - **Incremental models** — Process only new/changed data; use `unique_key` for merge strategy
185
+ - **Tests** — `not_null`, `unique`, `accepted_values`, `relationships` on key columns; custom data tests
186
+ - **Sources** — Define sources in YAML; use `source()` macro for lineage tracking; freshness checks
187
+
188
+ ### Python-Based Transforms
189
+ - Use pandas for small-medium datasets; PySpark or Dask for large-scale processing
190
+ - Write pure functions for transformations; make them testable
191
+ - Handle data types explicitly; don't rely on inference for production pipelines
192
+ - Use vectorized operations; avoid row-by-row iteration
193
+
194
+ ### Transform Best Practices
195
+ - **Layered architecture** — Raw → Staging → Intermediate → Mart; each layer has clear purpose
196
+ - **Single responsibility** — Each model/transform does one thing well
197
+ - **Documented logic** — Comment complex business rules; maintain a data dictionary
198
+ - **Version controlled** — All transformation code in git; review changes via PR
199
+
200
+ ---
201
+
202
+ ## Chapter 10: Data Validation and Testing
203
+
204
+ ### Validation Types
205
+ - **Schema validation** — Verify column names, types, and count match expectations
206
+ - **Row count checks** — Compare source and destination row counts; alert on significant discrepancies
207
+ - **Null checks** — Assert key columns are not null; track null percentages for optional columns
208
+ - **Uniqueness checks** — Verify primary key uniqueness in destination tables
209
+ - **Referential integrity** — Check foreign key relationships between tables
210
+ - **Range checks** — Validate values fall within expected ranges (dates, amounts, percentages)
211
+ - **Freshness checks** — Verify data is not stale; alert when max timestamp is older than threshold
212
+
213
+ ### Great Expectations
214
+ - Define expectations as code; version control them
215
+ - Run validations as pipeline steps; fail pipeline on critical expectation failures
216
+ - Generate data documentation from expectations
217
+ - Use checkpoints for scheduled validation runs
218
+
219
+ ### Testing Practices
220
+ - **Unit tests** — Test individual transformation functions with known inputs/outputs
221
+ - **Integration tests** — Test end-to-end pipeline with sample data
222
+ - **Regression tests** — Compare current results against known-good baselines
223
+ - **Data contracts** — Define and enforce schemas between producer and consumer teams
224
+
225
+ ---
226
+
227
+ ## Chapter 11: Orchestration
228
+
229
+ ### Apache Airflow
230
+ - **DAGs** — Define pipelines as Directed Acyclic Graphs; each node is a task
231
+ - **Operators** — Use appropriate operators: PythonOperator, BashOperator, provider operators (BigQueryOperator, S3ToRedshiftOperator)
232
+ - **Scheduling** — Use cron expressions or timedelta; set `start_date` and `catchup` appropriately
233
+ - **Dependencies** — Use `>>` operator or `set_upstream/downstream`; keep DAGs shallow and wide
234
+ - **XComs** — Pass small metadata between tasks (row counts, file paths); NOT large datasets
235
+ - **Sensors** — Wait for external conditions (file arrival, partition availability); use with timeout
236
+ - **Variables and Connections** — Store config in Airflow Variables; credentials in Connections
237
+ - **Pools** — Limit concurrency for resource-constrained tasks (database connections, API rate limits)
238
+
239
+ ### DAG Design Patterns
240
+ - **One pipeline per DAG** — Keep DAGs focused; avoid mega-DAGs that do everything
241
+ - **Idempotent tasks** — Every task can be re-run safely; use `execution_date` for parameterization
242
+ - **Task granularity** — Tasks should be atomic and independently retryable; not too fine (overhead) or coarse (blast radius)
243
+ - **Error handling** — Use `on_failure_callback` for alerting; `retries` and `retry_delay` for transient failures
244
+ - **Backfilling** — Use `airflow backfill` for historical reprocessing; ensure tasks support date parameterization
245
+
246
+ ---
247
+
248
+ ## Chapter 12: Monitoring and Alerting
249
+
250
+ ### Pipeline Health Metrics
251
+ - **Duration** — Track execution time; alert on runs significantly longer than historical average
252
+ - **Row counts** — Track records processed per run; alert on zero rows or dramatic changes
253
+ - **Error rates** — Track failed records, retries, exceptions; alert on elevated error rates
254
+ - **Data freshness** — Track max timestamp in destination; alert when data is staler than SLA
255
+ - **Resource usage** — Track CPU, memory, disk, network; alert on resource exhaustion
256
+
257
+ ### Alerting Strategies
258
+ - **SLA-based** — Define delivery SLAs; alert when pipelines miss their windows
259
+ - **Anomaly-based** — Detect deviations from historical patterns (row counts, durations, values)
260
+ - **Threshold-based** — Alert on fixed thresholds (error rate > 5%, null rate > 10%)
261
+ - **Escalation** — Define severity levels; route alerts appropriately (Slack, PagerDuty, email)
262
+
263
+ ### Logging
264
+ - Log at each pipeline stage: extraction start/end, row counts, load confirmation
265
+ - Include correlation IDs to trace records through the pipeline
266
+ - Store logs centrally for searchability (ELK, CloudWatch, Stackdriver)
267
+ - Retain logs for debugging and audit compliance
268
+
269
+ ---
270
+
271
+ ## Chapter 13: Best Practices
272
+
273
+ ### Idempotency
274
+ - Use DELETE+INSERT by date partition for fact tables
275
+ - Use MERGE/upsert with natural keys for dimension tables
276
+ - Use staging tables as intermediary; clean up on both success and failure
277
+ - Test by running pipeline twice; verify no data duplication
278
+
279
+ ### Backfilling
280
+ - Parameterize all pipelines by date range (start_date, end_date)
281
+ - Use Airflow `execution_date` or equivalent for date-aware runs
282
+ - Test backfill on a small date range before running full historical reprocess
283
+ - Monitor resource usage during backfill; may need to throttle parallelism
284
+
285
+ ### Error Handling
286
+ - Retry transient failures (network timeouts, rate limits) with exponential backoff
287
+ - Fail fast on permanent errors (authentication failure, missing source table)
288
+ - Quarantine bad records; don't let one bad row fail the entire pipeline
289
+ - Send alerts with actionable context (error message, affected table, run ID)
290
+
291
+ ### Data Lineage and Documentation
292
+ - Track source-to-destination mappings for every table
293
+ - Document transformation logic, especially business rules
294
+ - Maintain a data dictionary with column descriptions and types
295
+ - Use tools like dbt docs, DataHub, or Amundsen for automated lineage
296
+
297
+ ### Security
298
+ - Never hardcode credentials; use secrets managers (AWS Secrets Manager, HashiCorp Vault)
299
+ - Encrypt data in transit (TLS) and at rest (warehouse encryption)
300
+ - Use least-privilege IAM roles for pipeline service accounts
301
+ - Audit access to sensitive data; mask PII in non-production environments