@booklib/core 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (374) hide show
  1. package/.cursor/rules/booklib-standards.mdc +40 -0
  2. package/.gemini/context.md +372 -0
  3. package/AGENTS.md +166 -0
  4. package/CHANGELOG.md +226 -0
  5. package/CLAUDE.md +81 -0
  6. package/CODE_OF_CONDUCT.md +31 -0
  7. package/CONTRIBUTING.md +304 -0
  8. package/LICENSE +21 -0
  9. package/PLAN.md +28 -0
  10. package/README.ja.md +198 -0
  11. package/README.ko.md +198 -0
  12. package/README.md +503 -0
  13. package/README.pt-BR.md +198 -0
  14. package/README.uk.md +241 -0
  15. package/README.zh-CN.md +198 -0
  16. package/SECURITY.md +9 -0
  17. package/agents/architecture-reviewer.md +136 -0
  18. package/agents/booklib-reviewer.md +90 -0
  19. package/agents/data-reviewer.md +107 -0
  20. package/agents/jvm-reviewer.md +146 -0
  21. package/agents/python-reviewer.md +128 -0
  22. package/agents/rust-reviewer.md +115 -0
  23. package/agents/ts-reviewer.md +110 -0
  24. package/agents/ui-reviewer.md +117 -0
  25. package/assets/logo.svg +36 -0
  26. package/bin/booklib-mcp.js +304 -0
  27. package/bin/booklib.js +1705 -0
  28. package/bin/skills.cjs +1292 -0
  29. package/booklib-router.mdc +36 -0
  30. package/booklib.config.json +19 -0
  31. package/commands/animation-at-work.md +10 -0
  32. package/commands/clean-code-reviewer.md +10 -0
  33. package/commands/data-intensive-patterns.md +10 -0
  34. package/commands/data-pipelines.md +10 -0
  35. package/commands/design-patterns.md +10 -0
  36. package/commands/domain-driven-design.md +10 -0
  37. package/commands/effective-java.md +10 -0
  38. package/commands/effective-kotlin.md +10 -0
  39. package/commands/effective-python.md +10 -0
  40. package/commands/effective-typescript.md +10 -0
  41. package/commands/kotlin-in-action.md +10 -0
  42. package/commands/lean-startup.md +10 -0
  43. package/commands/microservices-patterns.md +10 -0
  44. package/commands/programming-with-rust.md +10 -0
  45. package/commands/refactoring-ui.md +10 -0
  46. package/commands/rust-in-action.md +10 -0
  47. package/commands/skill-router.md +10 -0
  48. package/commands/spring-boot-in-action.md +10 -0
  49. package/commands/storytelling-with-data.md +10 -0
  50. package/commands/system-design-interview.md +10 -0
  51. package/commands/using-asyncio-python.md +10 -0
  52. package/commands/web-scraping-python.md +10 -0
  53. package/community/registry.json +1616 -0
  54. package/hooks/hooks.json +23 -0
  55. package/hooks/posttooluse-capture.mjs +67 -0
  56. package/hooks/suggest.js +153 -0
  57. package/lib/agent-behaviors.js +40 -0
  58. package/lib/agent-detector.js +96 -0
  59. package/lib/config-loader.js +39 -0
  60. package/lib/conflict-resolver.js +148 -0
  61. package/lib/context-builder.js +574 -0
  62. package/lib/discovery-engine.js +298 -0
  63. package/lib/doctor/hook-installer.js +83 -0
  64. package/lib/doctor/usage-tracker.js +87 -0
  65. package/lib/engine/ai-features.js +253 -0
  66. package/lib/engine/auditor.js +103 -0
  67. package/lib/engine/bm25-index.js +178 -0
  68. package/lib/engine/capture.js +120 -0
  69. package/lib/engine/corrections.js +198 -0
  70. package/lib/engine/doctor.js +195 -0
  71. package/lib/engine/graph-injector.js +137 -0
  72. package/lib/engine/graph.js +161 -0
  73. package/lib/engine/handoff.js +405 -0
  74. package/lib/engine/indexer.js +242 -0
  75. package/lib/engine/parser.js +53 -0
  76. package/lib/engine/query-expander.js +42 -0
  77. package/lib/engine/reranker.js +40 -0
  78. package/lib/engine/rrf.js +59 -0
  79. package/lib/engine/scanner.js +151 -0
  80. package/lib/engine/searcher.js +139 -0
  81. package/lib/engine/session-coordinator.js +306 -0
  82. package/lib/engine/session-manager.js +429 -0
  83. package/lib/engine/synthesizer.js +70 -0
  84. package/lib/installer.js +70 -0
  85. package/lib/instinct-block.js +33 -0
  86. package/lib/mcp-config-writer.js +88 -0
  87. package/lib/paths.js +57 -0
  88. package/lib/profiles/design.md +19 -0
  89. package/lib/profiles/general.md +16 -0
  90. package/lib/profiles/research-analysis.md +22 -0
  91. package/lib/profiles/software-development.md +23 -0
  92. package/lib/profiles/writing-content.md +19 -0
  93. package/lib/project-initializer.js +916 -0
  94. package/lib/registry/skills.js +102 -0
  95. package/lib/registry-searcher.js +99 -0
  96. package/lib/rules/rules-manager.js +169 -0
  97. package/lib/skill-fetcher.js +333 -0
  98. package/lib/well-known-builder.js +70 -0
  99. package/lib/wizard/index.js +404 -0
  100. package/lib/wizard/integration-detector.js +41 -0
  101. package/lib/wizard/project-detector.js +100 -0
  102. package/lib/wizard/prompt.js +156 -0
  103. package/lib/wizard/registry-embeddings.js +107 -0
  104. package/lib/wizard/skill-recommender.js +69 -0
  105. package/llms-full.txt +254 -0
  106. package/llms.txt +70 -0
  107. package/package.json +45 -0
  108. package/research-reports/2026-04-01-current-architecture.md +160 -0
  109. package/research-reports/IDEAS.md +93 -0
  110. package/rules/common/clean-code.md +42 -0
  111. package/rules/java/effective-java.md +42 -0
  112. package/rules/kotlin/effective-kotlin.md +37 -0
  113. package/rules/python/effective-python.md +38 -0
  114. package/rules/rust/rust.md +37 -0
  115. package/rules/typescript/effective-typescript.md +42 -0
  116. package/scripts/gen-llms-full.mjs +36 -0
  117. package/scripts/gen-og.mjs +142 -0
  118. package/scripts/validate-frontmatter.js +25 -0
  119. package/skills/animation-at-work/SKILL.md +270 -0
  120. package/skills/animation-at-work/assets/example_asset.txt +1 -0
  121. package/skills/animation-at-work/evals/evals.json +44 -0
  122. package/skills/animation-at-work/evals/results.json +13 -0
  123. package/skills/animation-at-work/examples/after.md +64 -0
  124. package/skills/animation-at-work/examples/before.md +35 -0
  125. package/skills/animation-at-work/references/api_reference.md +369 -0
  126. package/skills/animation-at-work/references/review-checklist.md +79 -0
  127. package/skills/animation-at-work/scripts/audit_animations.py +295 -0
  128. package/skills/animation-at-work/scripts/example.py +1 -0
  129. package/skills/clean-code-reviewer/SKILL.md +444 -0
  130. package/skills/clean-code-reviewer/audit.json +35 -0
  131. package/skills/clean-code-reviewer/evals/evals.json +185 -0
  132. package/skills/clean-code-reviewer/evals/results.json +13 -0
  133. package/skills/clean-code-reviewer/examples/after.md +48 -0
  134. package/skills/clean-code-reviewer/examples/before.md +33 -0
  135. package/skills/clean-code-reviewer/references/api_reference.md +158 -0
  136. package/skills/clean-code-reviewer/references/practices-catalog.md +282 -0
  137. package/skills/clean-code-reviewer/references/review-checklist.md +254 -0
  138. package/skills/clean-code-reviewer/scripts/pre-review.py +206 -0
  139. package/skills/data-intensive-patterns/SKILL.md +267 -0
  140. package/skills/data-intensive-patterns/assets/example_asset.txt +1 -0
  141. package/skills/data-intensive-patterns/evals/evals.json +54 -0
  142. package/skills/data-intensive-patterns/evals/results.json +13 -0
  143. package/skills/data-intensive-patterns/examples/after.md +61 -0
  144. package/skills/data-intensive-patterns/examples/before.md +38 -0
  145. package/skills/data-intensive-patterns/references/api_reference.md +34 -0
  146. package/skills/data-intensive-patterns/references/patterns-catalog.md +551 -0
  147. package/skills/data-intensive-patterns/references/review-checklist.md +193 -0
  148. package/skills/data-intensive-patterns/scripts/adr.py +213 -0
  149. package/skills/data-intensive-patterns/scripts/example.py +1 -0
  150. package/skills/data-pipelines/SKILL.md +259 -0
  151. package/skills/data-pipelines/assets/example_asset.txt +1 -0
  152. package/skills/data-pipelines/evals/evals.json +45 -0
  153. package/skills/data-pipelines/evals/results.json +13 -0
  154. package/skills/data-pipelines/examples/after.md +97 -0
  155. package/skills/data-pipelines/examples/before.md +37 -0
  156. package/skills/data-pipelines/references/api_reference.md +301 -0
  157. package/skills/data-pipelines/references/review-checklist.md +181 -0
  158. package/skills/data-pipelines/scripts/example.py +1 -0
  159. package/skills/data-pipelines/scripts/new_pipeline.py +444 -0
  160. package/skills/design-patterns/SKILL.md +271 -0
  161. package/skills/design-patterns/assets/example_asset.txt +1 -0
  162. package/skills/design-patterns/evals/evals.json +46 -0
  163. package/skills/design-patterns/evals/results.json +13 -0
  164. package/skills/design-patterns/examples/after.md +52 -0
  165. package/skills/design-patterns/examples/before.md +29 -0
  166. package/skills/design-patterns/references/api_reference.md +1 -0
  167. package/skills/design-patterns/references/patterns-catalog.md +726 -0
  168. package/skills/design-patterns/references/review-checklist.md +173 -0
  169. package/skills/design-patterns/scripts/example.py +1 -0
  170. package/skills/design-patterns/scripts/scaffold.py +807 -0
  171. package/skills/domain-driven-design/SKILL.md +142 -0
  172. package/skills/domain-driven-design/assets/example_asset.txt +1 -0
  173. package/skills/domain-driven-design/evals/evals.json +48 -0
  174. package/skills/domain-driven-design/evals/results.json +13 -0
  175. package/skills/domain-driven-design/examples/after.md +80 -0
  176. package/skills/domain-driven-design/examples/before.md +43 -0
  177. package/skills/domain-driven-design/references/api_reference.md +1 -0
  178. package/skills/domain-driven-design/references/patterns-catalog.md +545 -0
  179. package/skills/domain-driven-design/references/review-checklist.md +158 -0
  180. package/skills/domain-driven-design/scripts/example.py +1 -0
  181. package/skills/domain-driven-design/scripts/scaffold.py +421 -0
  182. package/skills/effective-java/SKILL.md +227 -0
  183. package/skills/effective-java/assets/example_asset.txt +1 -0
  184. package/skills/effective-java/evals/evals.json +46 -0
  185. package/skills/effective-java/evals/results.json +13 -0
  186. package/skills/effective-java/examples/after.md +83 -0
  187. package/skills/effective-java/examples/before.md +37 -0
  188. package/skills/effective-java/references/api_reference.md +1 -0
  189. package/skills/effective-java/references/items-catalog.md +955 -0
  190. package/skills/effective-java/references/review-checklist.md +216 -0
  191. package/skills/effective-java/scripts/checkstyle_setup.py +211 -0
  192. package/skills/effective-java/scripts/example.py +1 -0
  193. package/skills/effective-kotlin/SKILL.md +271 -0
  194. package/skills/effective-kotlin/assets/example_asset.txt +1 -0
  195. package/skills/effective-kotlin/audit.json +29 -0
  196. package/skills/effective-kotlin/evals/evals.json +45 -0
  197. package/skills/effective-kotlin/evals/results.json +13 -0
  198. package/skills/effective-kotlin/examples/after.md +36 -0
  199. package/skills/effective-kotlin/examples/before.md +38 -0
  200. package/skills/effective-kotlin/references/api_reference.md +1 -0
  201. package/skills/effective-kotlin/references/practices-catalog.md +1228 -0
  202. package/skills/effective-kotlin/references/review-checklist.md +126 -0
  203. package/skills/effective-kotlin/scripts/example.py +1 -0
  204. package/skills/effective-python/SKILL.md +441 -0
  205. package/skills/effective-python/evals/evals.json +44 -0
  206. package/skills/effective-python/evals/results.json +13 -0
  207. package/skills/effective-python/examples/after.md +56 -0
  208. package/skills/effective-python/examples/before.md +40 -0
  209. package/skills/effective-python/ref-01-pythonic-thinking.md +202 -0
  210. package/skills/effective-python/ref-02-lists-and-dicts.md +146 -0
  211. package/skills/effective-python/ref-03-functions.md +186 -0
  212. package/skills/effective-python/ref-04-comprehensions-generators.md +211 -0
  213. package/skills/effective-python/ref-05-classes-interfaces.md +188 -0
  214. package/skills/effective-python/ref-06-metaclasses-attributes.md +209 -0
  215. package/skills/effective-python/ref-07-concurrency.md +213 -0
  216. package/skills/effective-python/ref-08-robustness-performance.md +248 -0
  217. package/skills/effective-python/ref-09-testing-debugging.md +253 -0
  218. package/skills/effective-python/ref-10-collaboration.md +175 -0
  219. package/skills/effective-python/references/api_reference.md +218 -0
  220. package/skills/effective-python/references/practices-catalog.md +483 -0
  221. package/skills/effective-python/references/review-checklist.md +190 -0
  222. package/skills/effective-python/scripts/lint.py +173 -0
  223. package/skills/effective-typescript/SKILL.md +262 -0
  224. package/skills/effective-typescript/audit.json +29 -0
  225. package/skills/effective-typescript/evals/evals.json +37 -0
  226. package/skills/effective-typescript/evals/results.json +13 -0
  227. package/skills/effective-typescript/examples/after.md +70 -0
  228. package/skills/effective-typescript/examples/before.md +47 -0
  229. package/skills/effective-typescript/references/api_reference.md +118 -0
  230. package/skills/effective-typescript/references/practices-catalog.md +371 -0
  231. package/skills/effective-typescript/scripts/review.py +169 -0
  232. package/skills/kotlin-in-action/SKILL.md +261 -0
  233. package/skills/kotlin-in-action/assets/example_asset.txt +1 -0
  234. package/skills/kotlin-in-action/evals/evals.json +43 -0
  235. package/skills/kotlin-in-action/evals/results.json +13 -0
  236. package/skills/kotlin-in-action/examples/after.md +53 -0
  237. package/skills/kotlin-in-action/examples/before.md +39 -0
  238. package/skills/kotlin-in-action/references/api_reference.md +1 -0
  239. package/skills/kotlin-in-action/references/practices-catalog.md +436 -0
  240. package/skills/kotlin-in-action/references/review-checklist.md +204 -0
  241. package/skills/kotlin-in-action/scripts/example.py +1 -0
  242. package/skills/kotlin-in-action/scripts/setup_detekt.py +224 -0
  243. package/skills/lean-startup/SKILL.md +160 -0
  244. package/skills/lean-startup/assets/example_asset.txt +1 -0
  245. package/skills/lean-startup/evals/evals.json +43 -0
  246. package/skills/lean-startup/evals/results.json +13 -0
  247. package/skills/lean-startup/examples/after.md +80 -0
  248. package/skills/lean-startup/examples/before.md +34 -0
  249. package/skills/lean-startup/references/api_reference.md +319 -0
  250. package/skills/lean-startup/references/review-checklist.md +137 -0
  251. package/skills/lean-startup/scripts/example.py +1 -0
  252. package/skills/lean-startup/scripts/new_experiment.py +286 -0
  253. package/skills/microservices-patterns/SKILL.md +384 -0
  254. package/skills/microservices-patterns/evals/evals.json +45 -0
  255. package/skills/microservices-patterns/evals/results.json +13 -0
  256. package/skills/microservices-patterns/examples/after.md +69 -0
  257. package/skills/microservices-patterns/examples/before.md +40 -0
  258. package/skills/microservices-patterns/references/patterns-catalog.md +391 -0
  259. package/skills/microservices-patterns/references/review-checklist.md +169 -0
  260. package/skills/microservices-patterns/scripts/new_service.py +583 -0
  261. package/skills/programming-with-rust/SKILL.md +209 -0
  262. package/skills/programming-with-rust/evals/evals.json +37 -0
  263. package/skills/programming-with-rust/evals/results.json +13 -0
  264. package/skills/programming-with-rust/examples/after.md +107 -0
  265. package/skills/programming-with-rust/examples/before.md +59 -0
  266. package/skills/programming-with-rust/references/api_reference.md +152 -0
  267. package/skills/programming-with-rust/references/practices-catalog.md +335 -0
  268. package/skills/programming-with-rust/scripts/review.py +142 -0
  269. package/skills/refactoring-ui/SKILL.md +362 -0
  270. package/skills/refactoring-ui/assets/example_asset.txt +1 -0
  271. package/skills/refactoring-ui/evals/evals.json +45 -0
  272. package/skills/refactoring-ui/evals/results.json +13 -0
  273. package/skills/refactoring-ui/examples/after.md +85 -0
  274. package/skills/refactoring-ui/examples/before.md +58 -0
  275. package/skills/refactoring-ui/references/api_reference.md +355 -0
  276. package/skills/refactoring-ui/references/review-checklist.md +114 -0
  277. package/skills/refactoring-ui/scripts/audit_css.py +250 -0
  278. package/skills/refactoring-ui/scripts/example.py +1 -0
  279. package/skills/rust-in-action/SKILL.md +350 -0
  280. package/skills/rust-in-action/evals/evals.json +38 -0
  281. package/skills/rust-in-action/evals/results.json +13 -0
  282. package/skills/rust-in-action/examples/after.md +156 -0
  283. package/skills/rust-in-action/examples/before.md +56 -0
  284. package/skills/rust-in-action/references/practices-catalog.md +346 -0
  285. package/skills/rust-in-action/scripts/review.py +147 -0
  286. package/skills/skill-router/SKILL.md +186 -0
  287. package/skills/skill-router/evals/evals.json +38 -0
  288. package/skills/skill-router/evals/results.json +13 -0
  289. package/skills/skill-router/examples/after.md +63 -0
  290. package/skills/skill-router/examples/before.md +39 -0
  291. package/skills/skill-router/references/api_reference.md +24 -0
  292. package/skills/skill-router/references/routing-heuristics.md +89 -0
  293. package/skills/skill-router/references/skill-catalog.md +174 -0
  294. package/skills/skill-router/scripts/route.py +266 -0
  295. package/skills/spring-boot-in-action/SKILL.md +340 -0
  296. package/skills/spring-boot-in-action/evals/evals.json +39 -0
  297. package/skills/spring-boot-in-action/evals/results.json +13 -0
  298. package/skills/spring-boot-in-action/examples/after.md +185 -0
  299. package/skills/spring-boot-in-action/examples/before.md +84 -0
  300. package/skills/spring-boot-in-action/references/practices-catalog.md +403 -0
  301. package/skills/spring-boot-in-action/scripts/review.py +184 -0
  302. package/skills/storytelling-with-data/SKILL.md +241 -0
  303. package/skills/storytelling-with-data/assets/example_asset.txt +1 -0
  304. package/skills/storytelling-with-data/evals/evals.json +47 -0
  305. package/skills/storytelling-with-data/evals/results.json +13 -0
  306. package/skills/storytelling-with-data/examples/after.md +50 -0
  307. package/skills/storytelling-with-data/examples/before.md +33 -0
  308. package/skills/storytelling-with-data/references/api_reference.md +379 -0
  309. package/skills/storytelling-with-data/references/review-checklist.md +111 -0
  310. package/skills/storytelling-with-data/scripts/chart_review.py +301 -0
  311. package/skills/storytelling-with-data/scripts/example.py +1 -0
  312. package/skills/system-design-interview/SKILL.md +233 -0
  313. package/skills/system-design-interview/assets/example_asset.txt +1 -0
  314. package/skills/system-design-interview/evals/evals.json +46 -0
  315. package/skills/system-design-interview/evals/results.json +13 -0
  316. package/skills/system-design-interview/examples/after.md +94 -0
  317. package/skills/system-design-interview/examples/before.md +27 -0
  318. package/skills/system-design-interview/references/api_reference.md +582 -0
  319. package/skills/system-design-interview/references/review-checklist.md +201 -0
  320. package/skills/system-design-interview/scripts/example.py +1 -0
  321. package/skills/system-design-interview/scripts/new_design.py +421 -0
  322. package/skills/using-asyncio-python/SKILL.md +290 -0
  323. package/skills/using-asyncio-python/assets/example_asset.txt +1 -0
  324. package/skills/using-asyncio-python/evals/evals.json +43 -0
  325. package/skills/using-asyncio-python/evals/results.json +13 -0
  326. package/skills/using-asyncio-python/examples/after.md +68 -0
  327. package/skills/using-asyncio-python/examples/before.md +39 -0
  328. package/skills/using-asyncio-python/references/api_reference.md +267 -0
  329. package/skills/using-asyncio-python/references/review-checklist.md +149 -0
  330. package/skills/using-asyncio-python/scripts/check_blocking.py +270 -0
  331. package/skills/using-asyncio-python/scripts/example.py +1 -0
  332. package/skills/web-scraping-python/SKILL.md +280 -0
  333. package/skills/web-scraping-python/assets/example_asset.txt +1 -0
  334. package/skills/web-scraping-python/evals/evals.json +46 -0
  335. package/skills/web-scraping-python/evals/results.json +13 -0
  336. package/skills/web-scraping-python/examples/after.md +109 -0
  337. package/skills/web-scraping-python/examples/before.md +40 -0
  338. package/skills/web-scraping-python/references/api_reference.md +393 -0
  339. package/skills/web-scraping-python/references/review-checklist.md +163 -0
  340. package/skills/web-scraping-python/scripts/example.py +1 -0
  341. package/skills/web-scraping-python/scripts/new_scraper.py +231 -0
  342. package/skills/writing-plans/audit.json +34 -0
  343. package/tests/agent-detector.test.js +83 -0
  344. package/tests/corrections.test.js +245 -0
  345. package/tests/doctor/hook-installer.test.js +72 -0
  346. package/tests/doctor/usage-tracker.test.js +140 -0
  347. package/tests/engine/benchmark-eval.test.js +31 -0
  348. package/tests/engine/bm25-index.test.js +85 -0
  349. package/tests/engine/capture-command.test.js +35 -0
  350. package/tests/engine/capture.test.js +17 -0
  351. package/tests/engine/graph-augmented-search.test.js +107 -0
  352. package/tests/engine/graph-injector.test.js +44 -0
  353. package/tests/engine/graph.test.js +216 -0
  354. package/tests/engine/hybrid-searcher.test.js +74 -0
  355. package/tests/engine/indexer-bm25.test.js +37 -0
  356. package/tests/engine/mcp-tools.test.js +73 -0
  357. package/tests/engine/project-initializer-mcp.test.js +99 -0
  358. package/tests/engine/query-expander.test.js +36 -0
  359. package/tests/engine/reranker.test.js +51 -0
  360. package/tests/engine/rrf.test.js +49 -0
  361. package/tests/engine/srag-prefix.test.js +47 -0
  362. package/tests/instinct-block.test.js +23 -0
  363. package/tests/mcp-config-writer.test.js +60 -0
  364. package/tests/project-initializer-new-agents.test.js +48 -0
  365. package/tests/rules/rules-manager.test.js +230 -0
  366. package/tests/well-known-builder.test.js +40 -0
  367. package/tests/wizard/integration-detector.test.js +31 -0
  368. package/tests/wizard/project-detector.test.js +51 -0
  369. package/tests/wizard/prompt-session.test.js +61 -0
  370. package/tests/wizard/prompt.test.js +16 -0
  371. package/tests/wizard/registry-embeddings.test.js +35 -0
  372. package/tests/wizard/skill-recommender.test.js +34 -0
  373. package/tests/wizard/slot-count.test.js +25 -0
  374. package/vercel.json +21 -0
@@ -0,0 +1,193 @@
1
+ # Data-Intensive Applications Code Review Checklist
2
+
3
+ Use this checklist when reviewing data-intensive application code. Work through each section
4
+ and flag any violations. Not every section applies to every review — skip sections
5
+ that aren't relevant to the code under review.
6
+
7
+ ---
8
+
9
+ ## 1. Data Modeling
10
+
11
+ - [ ] Data model fits the application's access patterns (relational, document, graph, event log)
12
+ - [ ] Relationships are modeled appropriately (joins vs. embedding vs. references)
13
+ - [ ] Schema is explicit or schema-on-read strategy is intentional and documented
14
+ - [ ] No impedance mismatch — application objects map cleanly to storage model
15
+ - [ ] Normalization level is appropriate (not over-normalized for a document store, not under-normalized for relational)
16
+
17
+ **Red flags**: Forcing graph-like traversals through a relational model with recursive joins.
18
+ Storing deeply nested JSON in a relational column then parsing it in application code.
19
+ Document model with many-to-many relationships handled by manual application-side joins.
20
+
21
+ ---
22
+
23
+ ## 2. Storage Engine and Indexing
24
+
25
+ - [ ] Storage engine matches workload characteristics (write-heavy → LSM; read-heavy → B-tree)
26
+ - [ ] Indexes exist for common query patterns
27
+ - [ ] No unnecessary indexes (each index slows down writes)
28
+ - [ ] Column-oriented storage used for analytical/OLAP workloads
29
+ - [ ] Materialized views or data cubes used where pre-aggregation helps
30
+ - [ ] Compaction strategy is configured appropriately for LSM-based stores
31
+
32
+ **Red flags**: Full table scans on large tables due to missing indexes. Using a row-oriented
33
+ store for analytical queries scanning millions of rows. Write-heavy workload on a database
34
+ optimized for reads without considering LSM alternatives.
35
+
36
+ ---
37
+
38
+ ## 3. Encoding and Schema Evolution
39
+
40
+ - [ ] Serialization format supports forward and backward compatibility
41
+ - [ ] Schema registry is in place for Avro/Protobuf-encoded messages
42
+ - [ ] Field tags (Protobuf) or schema resolution (Avro) used for evolution
43
+ - [ ] Old and new code can run simultaneously during rolling deployments
44
+ - [ ] No required fields added in a non-backward-compatible way
45
+ - [ ] Deleted field tags/names are never reused
46
+
47
+ **Red flags**: Using plain JSON for inter-service communication without versioning.
48
+ Adding required fields to Protobuf definitions in production. Encoding changes that break
49
+ consumers during rolling deployments. No schema registry for Kafka topics.
50
+
51
+ ---
52
+
53
+ ## 4. Replication
54
+
55
+ - [ ] Replication topology matches consistency and availability requirements
56
+ - [ ] Failover procedure is tested and documented
57
+ - [ ] Replication lag is monitored and handled in application code
58
+ - [ ] Read-after-write consistency is provided where needed (e.g., read from leader after write)
59
+ - [ ] Split-brain protection exists (fencing tokens, epoch numbers)
60
+ - [ ] For multi-leader: conflict resolution strategy is defined and tested
61
+ - [ ] For leaderless: quorum parameters (w, r, n) are tuned for the workload
62
+
63
+ **Red flags**: Async replication with no monitoring of replication lag. No split-brain protection
64
+ during leader failover. Using LWW for conflict resolution in multi-leader setup where data loss
65
+ is unacceptable. Quorum reads not configured (r + w ≤ n) giving inconsistent reads.
66
+
67
+ ---
68
+
69
+ ## 5. Partitioning
70
+
71
+ - [ ] Partition key distributes load evenly (no hot partitions)
72
+ - [ ] Partition strategy matches access patterns (key-range for scans, hash for uniform)
73
+ - [ ] Cross-partition queries are minimized or explicitly handled
74
+ - [ ] Secondary index strategy is chosen (local vs global) with trade-offs understood
75
+ - [ ] Rebalancing approach is defined (fixed partitions, dynamic split, proportional to nodes)
76
+ - [ ] Request routing is in place (client-side, routing tier, or coordinator)
77
+
78
+ **Red flags**: Monotonically increasing keys (timestamps, auto-increment) used as hash partition
79
+ key — all writes go to one partition. Range queries across hash-partitioned data. No plan
80
+ for rebalancing when adding nodes. Scatter-gather queries hitting all partitions for every read.
81
+
82
+ ---
83
+
84
+ ## 6. Transactions and Concurrency
85
+
86
+ - [ ] Isolation level is appropriate for the consistency requirements
87
+ - [ ] Write skew scenarios are identified and mitigated
88
+ - [ ] Phantom reads are prevented where needed (predicate/index-range locks or SSI)
89
+ - [ ] Long-running transactions are avoided (hold locks briefly)
90
+ - [ ] Deadlock detection or timeout is configured
91
+ - [ ] Optimistic concurrency (CAS, version numbers) used where appropriate
92
+
93
+ **Red flags**: Using READ COMMITTED where transactions read-then-write based on stale data
94
+ (write skew). SERIALIZABLE isolation everywhere regardless of need (performance waste).
95
+ Missing `SELECT ... FOR UPDATE` where concurrent updates can violate business rules.
96
+ No retry logic for serialization failures under SSI.
97
+
98
+ ---
99
+
100
+ ## 7. Distributed Systems Resilience
101
+
102
+ - [ ] All remote calls have timeouts configured
103
+ - [ ] Retries use exponential backoff with jitter
104
+ - [ ] Retry operations are idempotent (idempotency keys present)
105
+ - [ ] Circuit breakers protect against cascading failures
106
+ - [ ] Fencing tokens used for distributed locks/leases
107
+ - [ ] No reliance on wall-clock timestamps for ordering across nodes
108
+ - [ ] Network partitions are handled gracefully (not ignored)
109
+ - [ ] Process pauses (GC, etc.) are accounted for in lease/lock design
110
+
111
+ **Red flags**: HTTP calls without timeouts. Immediate retries without backoff (thundering herd).
112
+ Using System.currentTimeMillis() for conflict resolution across nodes. Distributed locks
113
+ without fencing tokens. Assuming clocks are synchronized across nodes.
114
+
115
+ ---
116
+
117
+ ## 8. Consensus and Coordination
118
+
119
+ - [ ] Leader election uses a proper consensus protocol (not ad-hoc)
120
+ - [ ] Coordination services (ZooKeeper/etcd) used for leader election and configuration
121
+ - [ ] No hand-rolled consensus or distributed locking
122
+ - [ ] 2PC is avoided for cross-service transactions (use sagas instead)
123
+ - [ ] Uniqueness constraints across partitions use linearizable operations
124
+
125
+ **Red flags**: Home-grown leader election using database timestamps. Two-phase commit across
126
+ heterogeneous systems. Distributed lock implemented with Redis SET NX without fencing tokens
127
+ or proper expiration handling. Assumption that ZooKeeper watches are instantaneous.
128
+
129
+ ---
130
+
131
+ ## 9. Batch and Stream Processing
132
+
133
+ - [ ] Batch jobs are idempotent (safe to re-run)
134
+ - [ ] Stream consumers are idempotent (safe to replay)
135
+ - [ ] Exactly-once semantics achieved via idempotency, not by assumption
136
+ - [ ] Processing output goes to a well-defined sink (not side effects scattered in operators)
137
+ - [ ] Backpressure mechanism exists (consumers can signal producers to slow down)
138
+ - [ ] Checkpointing or microbatching configured for stream fault tolerance
139
+ - [ ] Late events / out-of-order events are handled (watermarks, allowed lateness)
140
+ - [ ] Window semantics match business requirements (tumbling, hopping, sliding, session)
141
+
142
+ **Red flags**: Stream consumer that crashes and loses all progress (no checkpointing).
143
+ Batch job that partially writes output on failure (not atomic). Producer overwhelming consumer
144
+ with no flow control. Using processing time instead of event time for time-sensitive analytics.
145
+ No dead letter queue for malformed messages.
146
+
147
+ ---
148
+
149
+ ## 10. Derived Data and Integration
150
+
151
+ - [ ] Derived data (caches, indexes, views) is maintained via events or CDC — not dual writes
152
+ - [ ] Transactional outbox pattern used for reliable event publishing
153
+ - [ ] Change Data Capture configured for keeping systems in sync
154
+ - [ ] Event schema versioning strategy exists
155
+ - [ ] Event consumers can bootstrap from scratch (initial snapshot + streaming)
156
+ - [ ] Eventual consistency is acceptable and communicated to users appropriately
157
+
158
+ **Red flags**: Application code that updates both the primary database and Elasticsearch in
159
+ separate calls (dual write — can diverge on failure). No outbox pattern — events published after
160
+ transaction commit (can be lost on crash). CDC consumer with no mechanism for initial snapshot.
161
+ Derived views that can never be rebuilt from the event log.
162
+
163
+ ---
164
+
165
+ ## 11. Operational Readiness
166
+
167
+ - [ ] Health check endpoints exist
168
+ - [ ] Key metrics exposed: request rate, latency percentiles (p50, p95, p99), error rate
169
+ - [ ] Distributed tracing instrumented (OpenTelemetry or equivalent)
170
+ - [ ] Structured logging with correlation IDs
171
+ - [ ] Alerts configured for critical failure conditions
172
+ - [ ] Capacity planning considers tail latency (p99, not just averages)
173
+ - [ ] Backpressure and graceful degradation strategies in place
174
+ - [ ] Runbooks exist for common failure scenarios
175
+
176
+ **Red flags**: Only monitoring averages (hides tail latency issues). No distributed tracing
177
+ across service boundaries. Console.log as only observability. No runbook for leader failover
178
+ or partition rebalancing. No capacity planning for data growth.
179
+
180
+ ---
181
+
182
+ ## Severity Classification
183
+
184
+ When reporting issues, classify them:
185
+
186
+ - **Critical**: Data loss risk, correctness issue, or security vulnerability
187
+ (e.g., dual writes without outbox, missing fencing tokens, no transaction isolation for invariants)
188
+ - **Major**: Reliability or scalability debt that will cause problems at scale
189
+ (e.g., hot partitions, 2PC across services, no idempotency on retries, wrong storage engine)
190
+ - **Minor**: Best practice deviation with limited immediate impact
191
+ (e.g., missing health check, no schema registry, suboptimal compaction settings)
192
+ - **Suggestion**: Improvement that would be nice but isn't urgent
193
+ (e.g., consider CQRS for complex queries, evaluate column store for analytics workload)
@@ -0,0 +1,213 @@
1
+ #!/usr/bin/env python3
2
+ """
3
+ adr.py - Architecture Decision Record generator for data-intensive systems.
4
+
5
+ Usage:
6
+ python adr.py <decision-title>
7
+ python adr.py # interactive mode
8
+
9
+ Generates:
10
+ adr-NNN-<slug>.md - Numbered ADR file with data-intensive-specific sections
11
+ ADR-INDEX.md - Running index of all ADRs (appended to)
12
+
13
+ The ADR includes standard sections plus four data-intensive-specific sections:
14
+ - Consistency model
15
+ - Failure mode
16
+ - Scalability impact
17
+ - Operability
18
+
19
+ Based on patterns from "Designing Data-Intensive Applications" by Martin Kleppmann.
20
+ """
21
+
22
+ import argparse
23
+ import datetime
24
+ import pathlib
25
+ import re
26
+ import sys
27
+
28
+
29
+ def slugify(title: str) -> str:
30
+ slug = title.lower()
31
+ slug = re.sub(r"[^a-z0-9]+", "-", slug)
32
+ slug = slug.strip("-")
33
+ return slug
34
+
35
+
36
+ def next_adr_number(adr_dir: pathlib.Path) -> int:
37
+ existing = list(adr_dir.glob("adr-[0-9][0-9][0-9]-*.md"))
38
+ if not existing:
39
+ return 1
40
+ numbers = []
41
+ for p in existing:
42
+ m = re.match(r"adr-(\d{3})-", p.name)
43
+ if m:
44
+ numbers.append(int(m.group(1)))
45
+ return max(numbers) + 1 if numbers else 1
46
+
47
+
48
+ def prompt(question: str, default: str = "") -> str:
49
+ suffix = f" [{default}]" if default else ""
50
+ try:
51
+ answer = input(f"{question}{suffix}: ").strip()
52
+ except (EOFError, KeyboardInterrupt):
53
+ print()
54
+ sys.exit(0)
55
+ return answer if answer else default
56
+
57
+
58
+ def collect_options() -> list[str]:
59
+ options = []
60
+ print("Enter up to 4 considered options (leave blank to stop):")
61
+ for i in range(1, 5):
62
+ opt = prompt(f" Option {i}")
63
+ if not opt:
64
+ break
65
+ options.append(opt)
66
+ return options
67
+
68
+
69
+ def render_adr(
70
+ number: int,
71
+ title: str,
72
+ context: str,
73
+ options: list[str],
74
+ chosen: str,
75
+ consequences: str,
76
+ consistency_model: str,
77
+ failure_mode: str,
78
+ scalability_impact: str,
79
+ operability: str,
80
+ date: str,
81
+ ) -> str:
82
+ options_text = "\n".join(f"- {opt}" for opt in options) if options else "- (none listed)"
83
+ return f"""\
84
+ # ADR-{number:03d}: {title}
85
+
86
+ **Date:** {date}
87
+ **Status:** Proposed
88
+
89
+ ---
90
+
91
+ ## Context
92
+
93
+ {context}
94
+
95
+ ## Considered Options
96
+
97
+ {options_text}
98
+
99
+ ## Decision
100
+
101
+ {chosen}
102
+
103
+ ## Consequences
104
+
105
+ {consequences}
106
+
107
+ ---
108
+
109
+ ## Data-Intensive Considerations
110
+
111
+ ### Consistency Model
112
+
113
+ > What consistency guarantees does this choice provide?
114
+
115
+ {consistency_model}
116
+
117
+ ### Failure Mode
118
+
119
+ > What happens when this component fails?
120
+
121
+ {failure_mode}
122
+
123
+ ### Scalability Impact
124
+
125
+ > How does this scale with data volume?
126
+
127
+ {scalability_impact}
128
+
129
+ ### Operability
130
+
131
+ > How observable and maintainable is this choice?
132
+
133
+ {operability}
134
+ """
135
+
136
+
137
+ def append_to_index(index_path: pathlib.Path, number: int, title: str, filename: str, date: str) -> None:
138
+ header = "# ADR Index\n\n| # | Title | Date | File |\n|---|-------|------|------|\n"
139
+ entry = f"| {number:03d} | {title} | {date} | [{filename}]({filename}) |\n"
140
+ if not index_path.exists():
141
+ index_path.write_text(header + entry, encoding="utf-8")
142
+ print(f"Created: {index_path}")
143
+ else:
144
+ content = index_path.read_text(encoding="utf-8")
145
+ index_path.write_text(content + entry, encoding="utf-8")
146
+ print(f"Updated: {index_path}")
147
+
148
+
149
+ def main() -> None:
150
+ parser = argparse.ArgumentParser(
151
+ description="Generate an ADR for data-intensive systems."
152
+ )
153
+ parser.add_argument(
154
+ "title",
155
+ nargs="?",
156
+ default="",
157
+ help="Decision title (will prompt if omitted)",
158
+ )
159
+ parser.add_argument(
160
+ "--output-dir",
161
+ default=".",
162
+ help="Directory to write ADR files (default: ./)",
163
+ )
164
+ args = parser.parse_args()
165
+
166
+ output_dir = pathlib.Path(args.output_dir).resolve()
167
+ output_dir.mkdir(parents=True, exist_ok=True)
168
+
169
+ title = args.title.strip() or prompt("Decision title")
170
+ if not title:
171
+ print("ERROR: A title is required.")
172
+ sys.exit(1)
173
+
174
+ print()
175
+ context = prompt("Context (why is this decision needed?)", default="Describe the situation and forces at play.")
176
+ options = collect_options()
177
+ chosen = prompt("Chosen option")
178
+ consequences = prompt("Consequences (trade-offs, risks, next steps)", default="To be determined.")
179
+ print()
180
+ print("-- Data-intensive sections --")
181
+ consistency_model = prompt("Consistency model", default="To be defined.")
182
+ failure_mode = prompt("Failure mode", default="To be defined.")
183
+ scalability_impact = prompt("Scalability impact", default="To be defined.")
184
+ operability = prompt("Operability", default="To be defined.")
185
+
186
+ number = next_adr_number(output_dir)
187
+ date = datetime.date.today().isoformat()
188
+ filename = f"adr-{number:03d}-{slugify(title)}.md"
189
+ adr_path = output_dir / filename
190
+
191
+ content = render_adr(
192
+ number=number,
193
+ title=title,
194
+ context=context,
195
+ options=options,
196
+ chosen=chosen,
197
+ consequences=consequences,
198
+ consistency_model=consistency_model,
199
+ failure_mode=failure_mode,
200
+ scalability_impact=scalability_impact,
201
+ operability=operability,
202
+ date=date,
203
+ )
204
+
205
+ adr_path.write_text(content, encoding="utf-8")
206
+ print(f"\nWrote: {adr_path}")
207
+
208
+ append_to_index(output_dir / "ADR-INDEX.md", number, title, filename, date)
209
+ print("\nDone.")
210
+
211
+
212
+ if __name__ == "__main__":
213
+ main()
@@ -0,0 +1,259 @@
1
+ ---
2
+ name: data-pipelines
3
+ version: "1.0"
4
+ license: MIT
5
+ tags: [data, etl, pipelines, python]
6
+ description: >
7
+ Apply Data Pipelines Pocket Reference practices (James Densmore). Covers
8
+ Infrastructure (Ch 1-2: warehouses, lakes, cloud), Patterns (Ch 3: ETL, ELT,
9
+ CDC), DB Ingestion (Ch 4: MySQL, PostgreSQL, MongoDB, full/incremental),
10
+ File Ingestion (Ch 5: CSV, JSON, cloud storage), API Ingestion (Ch 6: REST,
11
+ pagination, rate limiting), Streaming (Ch 7: Kafka, Kinesis, event-driven),
12
+ Storage (Ch 8: Redshift, BigQuery, Snowflake), Transforms (Ch 9: SQL, Python,
13
+ dbt), Validation (Ch 10: Great Expectations, schema checks), Orchestration
14
+ (Ch 11: Airflow, DAGs, scheduling), Monitoring (Ch 12: SLAs, alerting),
15
+ Best Practices (Ch 13: idempotency, backfilling, error handling). Trigger on
16
+ "data pipeline", "ETL", "ELT", "data ingestion", "Airflow", "dbt",
17
+ "data warehouse", "Kafka streaming", "CDC", "data orchestration".
18
+ ---
19
+
20
+ # Data Pipelines Pocket Reference Skill
21
+
22
+ You are an expert data engineer grounded in the 13 chapters from
23
+ *Data Pipelines Pocket Reference* (Moving and Processing Data for Analytics)
24
+ by James Densmore. You help developers and data engineers in two modes:
25
+
26
+ 1. **Pipeline Building** — Design and implement data pipelines with idiomatic, production-ready patterns
27
+ 2. **Pipeline Review** — Analyze existing pipelines against the book's practices and recommend improvements
28
+
29
+ ## How to Decide Which Mode
30
+
31
+ - If the user asks you to *build*, *create*, *design*, *implement*, *write*, or *set up* a pipeline → **Pipeline Building**
32
+ - If the user asks you to *review*, *audit*, *improve*, *troubleshoot*, *optimize*, or *analyze* a pipeline → **Pipeline Review**
33
+ - If ambiguous, ask briefly which mode they'd prefer
34
+
35
+ ---
36
+
37
+ ## Mode 1: Pipeline Building
38
+
39
+ When designing or building data pipelines, follow this decision flow:
40
+
41
+ ### Step 1 — Understand the Requirements
42
+
43
+ Ask (or infer from context):
44
+
45
+ - **What data source?** — Database (MySQL, PostgreSQL, MongoDB), files (CSV, JSON, cloud storage), API (REST), streaming (Kafka, Kinesis)?
46
+ - **What destination?** — Data warehouse (Redshift, BigQuery, Snowflake), data lake (S3, GCS), operational database?
47
+ - **What pattern?** — ETL, ELT, CDC, streaming, batch?
48
+ - **What scale?** — Volume, velocity, variety of data? SLA requirements?
49
+
50
+ ### Step 2 — Apply the Right Practices
51
+
52
+ Read `references/practices-catalog.md` for the full chapter-by-chapter catalog. Quick decision guide by concern:
53
+
54
+ | Concern | Chapters to Apply |
55
+ |---------|-------------------|
56
+ | Infrastructure and architecture | Ch 1-2: Pipeline types, data warehouses vs data lakes, cloud storage (S3, GCS, Azure Blob), choosing infrastructure |
57
+ | Pipeline patterns and design | Ch 3: ETL vs ELT, change data capture (CDC), full vs incremental extraction, append vs upsert loading |
58
+ | Database ingestion | Ch 4: MySQL/PostgreSQL/MongoDB extraction, full and incremental loads, connection pooling, binary log replication |
59
+ | File-based ingestion | Ch 5: CSV/JSON/flat file parsing, cloud storage integration, file naming conventions, schema detection |
60
+ | API ingestion | Ch 6: REST API extraction, pagination handling, rate limiting, authentication, retry logic, webhook ingestion |
61
+ | Streaming data | Ch 7: Kafka producers/consumers, Kinesis streams, event-driven pipelines, exactly-once semantics, stream processing |
62
+ | Data storage and loading | Ch 8: Warehouse loading patterns (Redshift COPY, BigQuery load, Snowflake stages), partitioning, clustering |
63
+ | Transformations | Ch 9: SQL-based transforms, Python transforms, dbt models, staging/intermediate/mart layers, incremental models |
64
+ | Data validation and testing | Ch 10: Schema validation, data quality checks, Great Expectations, row counts, null checks, referential integrity |
65
+ | Orchestration | Ch 11: Apache Airflow, DAG design, task dependencies, scheduling, sensors, XComs, idempotent tasks |
66
+ | Monitoring and alerting | Ch 12: Pipeline health metrics, SLA tracking, data freshness, logging, alerting strategies, anomaly detection |
67
+ | Best practices | Ch 13: Idempotency, backfilling, error handling, retry strategies, data lineage, documentation |
68
+
69
+ ### Step 3 — Follow Data Pipeline Principles
70
+
71
+ Every pipeline implementation should honor these principles:
72
+
73
+ 1. **Idempotency always** — Running a pipeline multiple times with the same input produces the same result; use DELETE+INSERT or MERGE patterns
74
+ 2. **Incremental over full** — Prefer incremental extraction using timestamps or CDC over full table scans when data volume grows
75
+ 3. **ELT over ETL for analytics** — Load raw data into the warehouse first, transform with SQL/dbt; leverage warehouse compute power
76
+ 4. **Schema evolution readiness** — Design pipelines to handle schema changes gracefully; use schema detection and validation
77
+ 5. **Atomicity in loading** — Use staging tables, transactions, and atomic swaps; never leave destinations in partial states
78
+ 6. **Orchestration for dependencies** — Use DAGs (Airflow) to manage task ordering, retries, and failure handling; avoid time-based chaining
79
+ 7. **Validate early and often** — Check data quality at ingestion, after transformation, and before serving; use automated assertion frameworks
80
+ 8. **Monitor everything** — Track row counts, data freshness, pipeline duration, error rates; alert on SLA breaches
81
+ 9. **Design for backfilling** — Parameterize pipelines by date range; make it easy to reprocess historical data
82
+ 10. **Document data lineage** — Track where data comes from, how it's transformed, and where it goes; maintain a data catalog
83
+
84
+ ### Step 4 — Build the Pipeline
85
+
86
+ Follow these guidelines:
87
+
88
+ - **Production-ready** — Include error handling, retries, logging, monitoring from the start
89
+ - **Configurable** — Externalize connection strings, credentials, date ranges, batch sizes; use environment variables or config files
90
+ - **Testable** — Write unit tests for transformations, integration tests for end-to-end flows
91
+ - **Observable** — Include logging at each stage, metrics collection, alerting hooks
92
+ - **Documented** — README, data dictionary, DAG documentation, runbook for common failures
93
+
94
+ When building pipelines, produce:
95
+
96
+ 1. **Pattern identification** — Which chapters/concepts apply and why
97
+ 2. **Architecture diagram** — Source → Ingestion → Storage → Transform → Serve flow
98
+ 3. **Implementation** — Production-ready code with error handling
99
+ 4. **Configuration** — Connection configs, scheduling, environment setup
100
+ 5. **Monitoring setup** — What to track and alert on
101
+
102
+ ### Pipeline Building Examples
103
+
104
+ **Example 1 — Database to Warehouse ETL:**
105
+ ```
106
+ User: "Create a pipeline to sync MySQL orders to BigQuery"
107
+
108
+ Apply: Ch 3 (incremental extraction), Ch 4 (MySQL ingestion), Ch 8 (BigQuery loading),
109
+ Ch 11 (Airflow orchestration), Ch 13 (idempotency)
110
+
111
+ Generate:
112
+ - Incremental extraction using updated_at timestamp
113
+ - Staging table load with BigQuery load jobs
114
+ - MERGE/upsert into final table for idempotency
115
+ - Airflow DAG with proper scheduling and error handling
116
+ - Row count validation between source and destination
117
+ ```
118
+
119
+ **Example 2 — REST API Ingestion Pipeline:**
120
+ ```
121
+ User: "Build a pipeline to ingest data from a paginated REST API"
122
+
123
+ Apply: Ch 6 (API ingestion, pagination, rate limiting), Ch 5 (JSON handling),
124
+ Ch 8 (warehouse loading), Ch 10 (validation)
125
+
126
+ Generate:
127
+ - Paginated API client with retry logic and rate limiting
128
+ - JSON response parsing and flattening
129
+ - Incremental loading with cursor-based pagination
130
+ - Schema validation on ingested records
131
+ - Error handling for API failures and timeouts
132
+ ```
133
+
134
+ **Example 3 — Streaming Pipeline:**
135
+ ```
136
+ User: "Set up a Kafka-based streaming pipeline for event data"
137
+
138
+ Apply: Ch 7 (Kafka, event-driven), Ch 8 (warehouse loading),
139
+ Ch 12 (monitoring), Ch 13 (exactly-once semantics)
140
+
141
+ Generate:
142
+ - Kafka consumer group configuration
143
+ - Event deserialization and validation
144
+ - Micro-batch or streaming sink to warehouse
145
+ - Dead letter queue for failed events
146
+ - Consumer lag monitoring and alerting
147
+ ```
148
+
149
+ **Example 4 — dbt Transformation Layer:**
150
+ ```
151
+ User: "Create a dbt project for transforming raw e-commerce data"
152
+
153
+ Apply: Ch 9 (dbt, SQL transforms, staging/mart layers),
154
+ Ch 10 (data testing), Ch 13 (incremental models)
155
+
156
+ Generate:
157
+ - Staging models (1:1 with source, renamed/typed)
158
+ - Intermediate models (business logic joins)
159
+ - Mart models (final analytics tables)
160
+ - dbt tests (not_null, unique, relationships, custom)
161
+ - Incremental model configuration with merge strategy
162
+ ```
163
+
164
+ ---
165
+
166
+ ## Mode 2: Pipeline Review
167
+
168
+ When reviewing data pipelines, read `references/review-checklist.md` for the full checklist.
169
+
170
+ ### Review Process
171
+
172
+ 1. **Architecture scan** — Check Ch 1-3: pipeline pattern choice (ETL/ELT/CDC), infrastructure fit, data flow design
173
+ 2. **Ingestion scan** — Check Ch 4-7: extraction method, incremental vs full, error handling, source-specific best practices
174
+ 3. **Storage scan** — Check Ch 8: loading patterns, partitioning, clustering, staging table usage, atomic loads
175
+ 4. **Transform scan** — Check Ch 9: SQL vs Python choice, dbt patterns, layer structure, incremental models
176
+ 5. **Quality scan** — Check Ch 10: validation coverage, schema checks, data quality assertions, testing
177
+ 6. **Orchestration scan** — Check Ch 11: DAG design, task granularity, dependency management, idempotency
178
+ 7. **Operations scan** — Check Ch 12-13: monitoring, alerting, backfill capability, error handling, documentation
179
+
180
+ ### Calibrating Review Tone — Well-Designed vs. Problematic Pipelines
181
+
182
+ **Before listing issues, assess overall quality:**
183
+
184
+ - If the pipeline already implements idempotency, incremental extraction, separation of concerns, retry logic, structured logging, and lineage tracking — say so explicitly and lead with praise.
185
+ - **Do NOT manufacture problems** to appear thorough. If a pattern is correct, praise it. Only flag genuine gaps.
186
+ - Frame truly optional improvements as "minor" or "nice-to-have," not "Critical" or "will cause real pain in production."
187
+ - A well-designed pipeline deserves a review that opens with "This is a well-designed pipeline" and highlights what it does right before any suggestions.
188
+
189
+ **Specific patterns to recognize and praise when present:**
190
+
191
+ - **ETL function separation** — `extract`, `transform`, `load` as distinct single-responsibility functions (Ch 3: ETL pattern, Ch 11: task granularity) → Praise explicitly.
192
+ - **Generator/batch extraction** — `yield`-based extraction that streams rows in batches rather than fetching everything into memory (Ch 4: streaming extraction, memory efficiency) → Praise explicitly; do NOT suggest it is broken.
193
+ - **Watermark-based incremental extraction** — filtering by timestamp/cursor to avoid full-table scans on reruns (Ch 3-4) → Praise explicitly.
194
+ - **Upsert / ON CONFLICT DO UPDATE** — ensures idempotency and safe reruns (Ch 13) → Praise explicitly.
195
+ - **Retry with exponential backoff** — `run_with_retry` wrappers for transient errors (Ch 13) → Praise explicitly.
196
+ - **Structured logging with row counts** — batch-level `logger.info` with row counts already present (Ch 12: monitoring) → Praise it; do NOT suggest adding logging that already exists.
197
+ - **pipeline_run_id / audit column** — tracking which pipeline run produced each row (Ch 13: data lineage) → Praise explicitly.
198
+
199
+ ### Review Output Format
200
+
201
+ Structure your review as:
202
+
203
+ ```
204
+ ## Summary
205
+ One paragraph: overall pipeline quality, pattern adherence, main concerns.
206
+ If the pipeline is well-designed, say so clearly upfront.
207
+
208
+ ## Strengths
209
+ For each good pattern found:
210
+ - **Pattern**: name and chapter reference
211
+ - **Where**: location in the pipeline
212
+ - **Why it matters**: brief explanation
213
+
214
+ ## Issues
215
+ For each genuine issue found:
216
+ - **Topic**: chapter and concept
217
+ - **Location**: where in the pipeline
218
+ - **Problem**: what's wrong
219
+ - **Fix**: recommended change with code/config snippet
220
+ - **Severity**: Critical / High / Minor (only use Critical or High for real production risks)
221
+
222
+ ## Recommendations
223
+ Priority-ordered list. Frame genuinely minor items as "nice-to-have" or "minor."
224
+ Each recommendation references the specific chapter/concept.
225
+ If no significant issues exist, say so — a short list of minor suggestions is fine.
226
+ ```
227
+
228
+ ### Common Data Pipeline Anti-Patterns to Flag
229
+
230
+ - **Full extraction when incremental suffices** → Ch 3-4: Use timestamp/CDC-based incremental extraction for growing tables
231
+ - **No idempotency** → Ch 13: Pipelines should produce same results when re-run; use DELETE+INSERT or MERGE
232
+ - **Transforming before loading (unnecessary ETL)** → Ch 3: Use ELT pattern; load raw data first, transform in warehouse
233
+ - **No staging tables** → Ch 8: Always load to staging first, validate, then swap/merge to production
234
+ - **Hardcoded credentials** → Ch 13: Use environment variables, secrets managers, or config files
235
+ - **No error handling or retries** → Ch 6, 13: Implement retry logic with exponential backoff for transient failures
236
+ - **Time-based dependencies** → Ch 11: Use DAG-based orchestration (Airflow) instead of cron with time buffers
237
+ - **Missing data validation** → Ch 10: Add row count checks, null checks, schema validation, freshness checks
238
+ - **No monitoring or alerting** → Ch 12: Track pipeline duration, row counts, error rates; alert on SLA breaches
239
+ - **Monolithic pipelines** → Ch 11: Break into small, reusable, testable tasks in a DAG
240
+ - **No backfill support** → Ch 13: Parameterize pipelines by date range; make historical reprocessing easy
241
+ - **Ignoring schema evolution** → Ch 5, 10: Handle new columns, type changes, missing fields gracefully
242
+ - **Unpartitioned warehouse tables** → Ch 8: Partition by date/key for query performance and cost
243
+ - **No data lineage** → Ch 13: Document source-to-destination mappings and transformation logic
244
+ - **Blocking on API rate limits** → Ch 6: Implement rate limit awareness with backoff and queuing
245
+ - **Missing dead letter queues** → Ch 7: Capture failed events/records for inspection and reprocessing
246
+ - **Over-orchestrating** → Ch 11: Not every script needs Airflow; match orchestration complexity to pipeline needs
247
+
248
+ ---
249
+
250
+ ## General Guidelines
251
+
252
+ - **ELT for analytics, ETL for operational** — Use warehouse compute for analytics transforms; use ETL only when destination can't transform
253
+ - **Incremental by default** — Start with incremental extraction; fall back to full only when necessary
254
+ - **Idempotency is non-negotiable** — Every pipeline must be safely re-runnable without data duplication or corruption
255
+ - **Validate at boundaries** — Check data quality at ingestion, after transformation, and before serving
256
+ - **Orchestrate with DAGs** — Use Airflow or similar tools for dependency management, retries, and scheduling
257
+ - **Monitor proactively** — Don't wait for users to report stale data; alert on freshness, completeness, and accuracy
258
+ - For deeper practice details, read `references/practices-catalog.md` before building pipelines.
259
+ - For review checklists, read `references/review-checklist.md` before reviewing pipelines.