@booklib/core 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (374) hide show
  1. package/.cursor/rules/booklib-standards.mdc +40 -0
  2. package/.gemini/context.md +372 -0
  3. package/AGENTS.md +166 -0
  4. package/CHANGELOG.md +226 -0
  5. package/CLAUDE.md +81 -0
  6. package/CODE_OF_CONDUCT.md +31 -0
  7. package/CONTRIBUTING.md +304 -0
  8. package/LICENSE +21 -0
  9. package/PLAN.md +28 -0
  10. package/README.ja.md +198 -0
  11. package/README.ko.md +198 -0
  12. package/README.md +503 -0
  13. package/README.pt-BR.md +198 -0
  14. package/README.uk.md +241 -0
  15. package/README.zh-CN.md +198 -0
  16. package/SECURITY.md +9 -0
  17. package/agents/architecture-reviewer.md +136 -0
  18. package/agents/booklib-reviewer.md +90 -0
  19. package/agents/data-reviewer.md +107 -0
  20. package/agents/jvm-reviewer.md +146 -0
  21. package/agents/python-reviewer.md +128 -0
  22. package/agents/rust-reviewer.md +115 -0
  23. package/agents/ts-reviewer.md +110 -0
  24. package/agents/ui-reviewer.md +117 -0
  25. package/assets/logo.svg +36 -0
  26. package/bin/booklib-mcp.js +304 -0
  27. package/bin/booklib.js +1705 -0
  28. package/bin/skills.cjs +1292 -0
  29. package/booklib-router.mdc +36 -0
  30. package/booklib.config.json +19 -0
  31. package/commands/animation-at-work.md +10 -0
  32. package/commands/clean-code-reviewer.md +10 -0
  33. package/commands/data-intensive-patterns.md +10 -0
  34. package/commands/data-pipelines.md +10 -0
  35. package/commands/design-patterns.md +10 -0
  36. package/commands/domain-driven-design.md +10 -0
  37. package/commands/effective-java.md +10 -0
  38. package/commands/effective-kotlin.md +10 -0
  39. package/commands/effective-python.md +10 -0
  40. package/commands/effective-typescript.md +10 -0
  41. package/commands/kotlin-in-action.md +10 -0
  42. package/commands/lean-startup.md +10 -0
  43. package/commands/microservices-patterns.md +10 -0
  44. package/commands/programming-with-rust.md +10 -0
  45. package/commands/refactoring-ui.md +10 -0
  46. package/commands/rust-in-action.md +10 -0
  47. package/commands/skill-router.md +10 -0
  48. package/commands/spring-boot-in-action.md +10 -0
  49. package/commands/storytelling-with-data.md +10 -0
  50. package/commands/system-design-interview.md +10 -0
  51. package/commands/using-asyncio-python.md +10 -0
  52. package/commands/web-scraping-python.md +10 -0
  53. package/community/registry.json +1616 -0
  54. package/hooks/hooks.json +23 -0
  55. package/hooks/posttooluse-capture.mjs +67 -0
  56. package/hooks/suggest.js +153 -0
  57. package/lib/agent-behaviors.js +40 -0
  58. package/lib/agent-detector.js +96 -0
  59. package/lib/config-loader.js +39 -0
  60. package/lib/conflict-resolver.js +148 -0
  61. package/lib/context-builder.js +574 -0
  62. package/lib/discovery-engine.js +298 -0
  63. package/lib/doctor/hook-installer.js +83 -0
  64. package/lib/doctor/usage-tracker.js +87 -0
  65. package/lib/engine/ai-features.js +253 -0
  66. package/lib/engine/auditor.js +103 -0
  67. package/lib/engine/bm25-index.js +178 -0
  68. package/lib/engine/capture.js +120 -0
  69. package/lib/engine/corrections.js +198 -0
  70. package/lib/engine/doctor.js +195 -0
  71. package/lib/engine/graph-injector.js +137 -0
  72. package/lib/engine/graph.js +161 -0
  73. package/lib/engine/handoff.js +405 -0
  74. package/lib/engine/indexer.js +242 -0
  75. package/lib/engine/parser.js +53 -0
  76. package/lib/engine/query-expander.js +42 -0
  77. package/lib/engine/reranker.js +40 -0
  78. package/lib/engine/rrf.js +59 -0
  79. package/lib/engine/scanner.js +151 -0
  80. package/lib/engine/searcher.js +139 -0
  81. package/lib/engine/session-coordinator.js +306 -0
  82. package/lib/engine/session-manager.js +429 -0
  83. package/lib/engine/synthesizer.js +70 -0
  84. package/lib/installer.js +70 -0
  85. package/lib/instinct-block.js +33 -0
  86. package/lib/mcp-config-writer.js +88 -0
  87. package/lib/paths.js +57 -0
  88. package/lib/profiles/design.md +19 -0
  89. package/lib/profiles/general.md +16 -0
  90. package/lib/profiles/research-analysis.md +22 -0
  91. package/lib/profiles/software-development.md +23 -0
  92. package/lib/profiles/writing-content.md +19 -0
  93. package/lib/project-initializer.js +916 -0
  94. package/lib/registry/skills.js +102 -0
  95. package/lib/registry-searcher.js +99 -0
  96. package/lib/rules/rules-manager.js +169 -0
  97. package/lib/skill-fetcher.js +333 -0
  98. package/lib/well-known-builder.js +70 -0
  99. package/lib/wizard/index.js +404 -0
  100. package/lib/wizard/integration-detector.js +41 -0
  101. package/lib/wizard/project-detector.js +100 -0
  102. package/lib/wizard/prompt.js +156 -0
  103. package/lib/wizard/registry-embeddings.js +107 -0
  104. package/lib/wizard/skill-recommender.js +69 -0
  105. package/llms-full.txt +254 -0
  106. package/llms.txt +70 -0
  107. package/package.json +45 -0
  108. package/research-reports/2026-04-01-current-architecture.md +160 -0
  109. package/research-reports/IDEAS.md +93 -0
  110. package/rules/common/clean-code.md +42 -0
  111. package/rules/java/effective-java.md +42 -0
  112. package/rules/kotlin/effective-kotlin.md +37 -0
  113. package/rules/python/effective-python.md +38 -0
  114. package/rules/rust/rust.md +37 -0
  115. package/rules/typescript/effective-typescript.md +42 -0
  116. package/scripts/gen-llms-full.mjs +36 -0
  117. package/scripts/gen-og.mjs +142 -0
  118. package/scripts/validate-frontmatter.js +25 -0
  119. package/skills/animation-at-work/SKILL.md +270 -0
  120. package/skills/animation-at-work/assets/example_asset.txt +1 -0
  121. package/skills/animation-at-work/evals/evals.json +44 -0
  122. package/skills/animation-at-work/evals/results.json +13 -0
  123. package/skills/animation-at-work/examples/after.md +64 -0
  124. package/skills/animation-at-work/examples/before.md +35 -0
  125. package/skills/animation-at-work/references/api_reference.md +369 -0
  126. package/skills/animation-at-work/references/review-checklist.md +79 -0
  127. package/skills/animation-at-work/scripts/audit_animations.py +295 -0
  128. package/skills/animation-at-work/scripts/example.py +1 -0
  129. package/skills/clean-code-reviewer/SKILL.md +444 -0
  130. package/skills/clean-code-reviewer/audit.json +35 -0
  131. package/skills/clean-code-reviewer/evals/evals.json +185 -0
  132. package/skills/clean-code-reviewer/evals/results.json +13 -0
  133. package/skills/clean-code-reviewer/examples/after.md +48 -0
  134. package/skills/clean-code-reviewer/examples/before.md +33 -0
  135. package/skills/clean-code-reviewer/references/api_reference.md +158 -0
  136. package/skills/clean-code-reviewer/references/practices-catalog.md +282 -0
  137. package/skills/clean-code-reviewer/references/review-checklist.md +254 -0
  138. package/skills/clean-code-reviewer/scripts/pre-review.py +206 -0
  139. package/skills/data-intensive-patterns/SKILL.md +267 -0
  140. package/skills/data-intensive-patterns/assets/example_asset.txt +1 -0
  141. package/skills/data-intensive-patterns/evals/evals.json +54 -0
  142. package/skills/data-intensive-patterns/evals/results.json +13 -0
  143. package/skills/data-intensive-patterns/examples/after.md +61 -0
  144. package/skills/data-intensive-patterns/examples/before.md +38 -0
  145. package/skills/data-intensive-patterns/references/api_reference.md +34 -0
  146. package/skills/data-intensive-patterns/references/patterns-catalog.md +551 -0
  147. package/skills/data-intensive-patterns/references/review-checklist.md +193 -0
  148. package/skills/data-intensive-patterns/scripts/adr.py +213 -0
  149. package/skills/data-intensive-patterns/scripts/example.py +1 -0
  150. package/skills/data-pipelines/SKILL.md +259 -0
  151. package/skills/data-pipelines/assets/example_asset.txt +1 -0
  152. package/skills/data-pipelines/evals/evals.json +45 -0
  153. package/skills/data-pipelines/evals/results.json +13 -0
  154. package/skills/data-pipelines/examples/after.md +97 -0
  155. package/skills/data-pipelines/examples/before.md +37 -0
  156. package/skills/data-pipelines/references/api_reference.md +301 -0
  157. package/skills/data-pipelines/references/review-checklist.md +181 -0
  158. package/skills/data-pipelines/scripts/example.py +1 -0
  159. package/skills/data-pipelines/scripts/new_pipeline.py +444 -0
  160. package/skills/design-patterns/SKILL.md +271 -0
  161. package/skills/design-patterns/assets/example_asset.txt +1 -0
  162. package/skills/design-patterns/evals/evals.json +46 -0
  163. package/skills/design-patterns/evals/results.json +13 -0
  164. package/skills/design-patterns/examples/after.md +52 -0
  165. package/skills/design-patterns/examples/before.md +29 -0
  166. package/skills/design-patterns/references/api_reference.md +1 -0
  167. package/skills/design-patterns/references/patterns-catalog.md +726 -0
  168. package/skills/design-patterns/references/review-checklist.md +173 -0
  169. package/skills/design-patterns/scripts/example.py +1 -0
  170. package/skills/design-patterns/scripts/scaffold.py +807 -0
  171. package/skills/domain-driven-design/SKILL.md +142 -0
  172. package/skills/domain-driven-design/assets/example_asset.txt +1 -0
  173. package/skills/domain-driven-design/evals/evals.json +48 -0
  174. package/skills/domain-driven-design/evals/results.json +13 -0
  175. package/skills/domain-driven-design/examples/after.md +80 -0
  176. package/skills/domain-driven-design/examples/before.md +43 -0
  177. package/skills/domain-driven-design/references/api_reference.md +1 -0
  178. package/skills/domain-driven-design/references/patterns-catalog.md +545 -0
  179. package/skills/domain-driven-design/references/review-checklist.md +158 -0
  180. package/skills/domain-driven-design/scripts/example.py +1 -0
  181. package/skills/domain-driven-design/scripts/scaffold.py +421 -0
  182. package/skills/effective-java/SKILL.md +227 -0
  183. package/skills/effective-java/assets/example_asset.txt +1 -0
  184. package/skills/effective-java/evals/evals.json +46 -0
  185. package/skills/effective-java/evals/results.json +13 -0
  186. package/skills/effective-java/examples/after.md +83 -0
  187. package/skills/effective-java/examples/before.md +37 -0
  188. package/skills/effective-java/references/api_reference.md +1 -0
  189. package/skills/effective-java/references/items-catalog.md +955 -0
  190. package/skills/effective-java/references/review-checklist.md +216 -0
  191. package/skills/effective-java/scripts/checkstyle_setup.py +211 -0
  192. package/skills/effective-java/scripts/example.py +1 -0
  193. package/skills/effective-kotlin/SKILL.md +271 -0
  194. package/skills/effective-kotlin/assets/example_asset.txt +1 -0
  195. package/skills/effective-kotlin/audit.json +29 -0
  196. package/skills/effective-kotlin/evals/evals.json +45 -0
  197. package/skills/effective-kotlin/evals/results.json +13 -0
  198. package/skills/effective-kotlin/examples/after.md +36 -0
  199. package/skills/effective-kotlin/examples/before.md +38 -0
  200. package/skills/effective-kotlin/references/api_reference.md +1 -0
  201. package/skills/effective-kotlin/references/practices-catalog.md +1228 -0
  202. package/skills/effective-kotlin/references/review-checklist.md +126 -0
  203. package/skills/effective-kotlin/scripts/example.py +1 -0
  204. package/skills/effective-python/SKILL.md +441 -0
  205. package/skills/effective-python/evals/evals.json +44 -0
  206. package/skills/effective-python/evals/results.json +13 -0
  207. package/skills/effective-python/examples/after.md +56 -0
  208. package/skills/effective-python/examples/before.md +40 -0
  209. package/skills/effective-python/ref-01-pythonic-thinking.md +202 -0
  210. package/skills/effective-python/ref-02-lists-and-dicts.md +146 -0
  211. package/skills/effective-python/ref-03-functions.md +186 -0
  212. package/skills/effective-python/ref-04-comprehensions-generators.md +211 -0
  213. package/skills/effective-python/ref-05-classes-interfaces.md +188 -0
  214. package/skills/effective-python/ref-06-metaclasses-attributes.md +209 -0
  215. package/skills/effective-python/ref-07-concurrency.md +213 -0
  216. package/skills/effective-python/ref-08-robustness-performance.md +248 -0
  217. package/skills/effective-python/ref-09-testing-debugging.md +253 -0
  218. package/skills/effective-python/ref-10-collaboration.md +175 -0
  219. package/skills/effective-python/references/api_reference.md +218 -0
  220. package/skills/effective-python/references/practices-catalog.md +483 -0
  221. package/skills/effective-python/references/review-checklist.md +190 -0
  222. package/skills/effective-python/scripts/lint.py +173 -0
  223. package/skills/effective-typescript/SKILL.md +262 -0
  224. package/skills/effective-typescript/audit.json +29 -0
  225. package/skills/effective-typescript/evals/evals.json +37 -0
  226. package/skills/effective-typescript/evals/results.json +13 -0
  227. package/skills/effective-typescript/examples/after.md +70 -0
  228. package/skills/effective-typescript/examples/before.md +47 -0
  229. package/skills/effective-typescript/references/api_reference.md +118 -0
  230. package/skills/effective-typescript/references/practices-catalog.md +371 -0
  231. package/skills/effective-typescript/scripts/review.py +169 -0
  232. package/skills/kotlin-in-action/SKILL.md +261 -0
  233. package/skills/kotlin-in-action/assets/example_asset.txt +1 -0
  234. package/skills/kotlin-in-action/evals/evals.json +43 -0
  235. package/skills/kotlin-in-action/evals/results.json +13 -0
  236. package/skills/kotlin-in-action/examples/after.md +53 -0
  237. package/skills/kotlin-in-action/examples/before.md +39 -0
  238. package/skills/kotlin-in-action/references/api_reference.md +1 -0
  239. package/skills/kotlin-in-action/references/practices-catalog.md +436 -0
  240. package/skills/kotlin-in-action/references/review-checklist.md +204 -0
  241. package/skills/kotlin-in-action/scripts/example.py +1 -0
  242. package/skills/kotlin-in-action/scripts/setup_detekt.py +224 -0
  243. package/skills/lean-startup/SKILL.md +160 -0
  244. package/skills/lean-startup/assets/example_asset.txt +1 -0
  245. package/skills/lean-startup/evals/evals.json +43 -0
  246. package/skills/lean-startup/evals/results.json +13 -0
  247. package/skills/lean-startup/examples/after.md +80 -0
  248. package/skills/lean-startup/examples/before.md +34 -0
  249. package/skills/lean-startup/references/api_reference.md +319 -0
  250. package/skills/lean-startup/references/review-checklist.md +137 -0
  251. package/skills/lean-startup/scripts/example.py +1 -0
  252. package/skills/lean-startup/scripts/new_experiment.py +286 -0
  253. package/skills/microservices-patterns/SKILL.md +384 -0
  254. package/skills/microservices-patterns/evals/evals.json +45 -0
  255. package/skills/microservices-patterns/evals/results.json +13 -0
  256. package/skills/microservices-patterns/examples/after.md +69 -0
  257. package/skills/microservices-patterns/examples/before.md +40 -0
  258. package/skills/microservices-patterns/references/patterns-catalog.md +391 -0
  259. package/skills/microservices-patterns/references/review-checklist.md +169 -0
  260. package/skills/microservices-patterns/scripts/new_service.py +583 -0
  261. package/skills/programming-with-rust/SKILL.md +209 -0
  262. package/skills/programming-with-rust/evals/evals.json +37 -0
  263. package/skills/programming-with-rust/evals/results.json +13 -0
  264. package/skills/programming-with-rust/examples/after.md +107 -0
  265. package/skills/programming-with-rust/examples/before.md +59 -0
  266. package/skills/programming-with-rust/references/api_reference.md +152 -0
  267. package/skills/programming-with-rust/references/practices-catalog.md +335 -0
  268. package/skills/programming-with-rust/scripts/review.py +142 -0
  269. package/skills/refactoring-ui/SKILL.md +362 -0
  270. package/skills/refactoring-ui/assets/example_asset.txt +1 -0
  271. package/skills/refactoring-ui/evals/evals.json +45 -0
  272. package/skills/refactoring-ui/evals/results.json +13 -0
  273. package/skills/refactoring-ui/examples/after.md +85 -0
  274. package/skills/refactoring-ui/examples/before.md +58 -0
  275. package/skills/refactoring-ui/references/api_reference.md +355 -0
  276. package/skills/refactoring-ui/references/review-checklist.md +114 -0
  277. package/skills/refactoring-ui/scripts/audit_css.py +250 -0
  278. package/skills/refactoring-ui/scripts/example.py +1 -0
  279. package/skills/rust-in-action/SKILL.md +350 -0
  280. package/skills/rust-in-action/evals/evals.json +38 -0
  281. package/skills/rust-in-action/evals/results.json +13 -0
  282. package/skills/rust-in-action/examples/after.md +156 -0
  283. package/skills/rust-in-action/examples/before.md +56 -0
  284. package/skills/rust-in-action/references/practices-catalog.md +346 -0
  285. package/skills/rust-in-action/scripts/review.py +147 -0
  286. package/skills/skill-router/SKILL.md +186 -0
  287. package/skills/skill-router/evals/evals.json +38 -0
  288. package/skills/skill-router/evals/results.json +13 -0
  289. package/skills/skill-router/examples/after.md +63 -0
  290. package/skills/skill-router/examples/before.md +39 -0
  291. package/skills/skill-router/references/api_reference.md +24 -0
  292. package/skills/skill-router/references/routing-heuristics.md +89 -0
  293. package/skills/skill-router/references/skill-catalog.md +174 -0
  294. package/skills/skill-router/scripts/route.py +266 -0
  295. package/skills/spring-boot-in-action/SKILL.md +340 -0
  296. package/skills/spring-boot-in-action/evals/evals.json +39 -0
  297. package/skills/spring-boot-in-action/evals/results.json +13 -0
  298. package/skills/spring-boot-in-action/examples/after.md +185 -0
  299. package/skills/spring-boot-in-action/examples/before.md +84 -0
  300. package/skills/spring-boot-in-action/references/practices-catalog.md +403 -0
  301. package/skills/spring-boot-in-action/scripts/review.py +184 -0
  302. package/skills/storytelling-with-data/SKILL.md +241 -0
  303. package/skills/storytelling-with-data/assets/example_asset.txt +1 -0
  304. package/skills/storytelling-with-data/evals/evals.json +47 -0
  305. package/skills/storytelling-with-data/evals/results.json +13 -0
  306. package/skills/storytelling-with-data/examples/after.md +50 -0
  307. package/skills/storytelling-with-data/examples/before.md +33 -0
  308. package/skills/storytelling-with-data/references/api_reference.md +379 -0
  309. package/skills/storytelling-with-data/references/review-checklist.md +111 -0
  310. package/skills/storytelling-with-data/scripts/chart_review.py +301 -0
  311. package/skills/storytelling-with-data/scripts/example.py +1 -0
  312. package/skills/system-design-interview/SKILL.md +233 -0
  313. package/skills/system-design-interview/assets/example_asset.txt +1 -0
  314. package/skills/system-design-interview/evals/evals.json +46 -0
  315. package/skills/system-design-interview/evals/results.json +13 -0
  316. package/skills/system-design-interview/examples/after.md +94 -0
  317. package/skills/system-design-interview/examples/before.md +27 -0
  318. package/skills/system-design-interview/references/api_reference.md +582 -0
  319. package/skills/system-design-interview/references/review-checklist.md +201 -0
  320. package/skills/system-design-interview/scripts/example.py +1 -0
  321. package/skills/system-design-interview/scripts/new_design.py +421 -0
  322. package/skills/using-asyncio-python/SKILL.md +290 -0
  323. package/skills/using-asyncio-python/assets/example_asset.txt +1 -0
  324. package/skills/using-asyncio-python/evals/evals.json +43 -0
  325. package/skills/using-asyncio-python/evals/results.json +13 -0
  326. package/skills/using-asyncio-python/examples/after.md +68 -0
  327. package/skills/using-asyncio-python/examples/before.md +39 -0
  328. package/skills/using-asyncio-python/references/api_reference.md +267 -0
  329. package/skills/using-asyncio-python/references/review-checklist.md +149 -0
  330. package/skills/using-asyncio-python/scripts/check_blocking.py +270 -0
  331. package/skills/using-asyncio-python/scripts/example.py +1 -0
  332. package/skills/web-scraping-python/SKILL.md +280 -0
  333. package/skills/web-scraping-python/assets/example_asset.txt +1 -0
  334. package/skills/web-scraping-python/evals/evals.json +46 -0
  335. package/skills/web-scraping-python/evals/results.json +13 -0
  336. package/skills/web-scraping-python/examples/after.md +109 -0
  337. package/skills/web-scraping-python/examples/before.md +40 -0
  338. package/skills/web-scraping-python/references/api_reference.md +393 -0
  339. package/skills/web-scraping-python/references/review-checklist.md +163 -0
  340. package/skills/web-scraping-python/scripts/example.py +1 -0
  341. package/skills/web-scraping-python/scripts/new_scraper.py +231 -0
  342. package/skills/writing-plans/audit.json +34 -0
  343. package/tests/agent-detector.test.js +83 -0
  344. package/tests/corrections.test.js +245 -0
  345. package/tests/doctor/hook-installer.test.js +72 -0
  346. package/tests/doctor/usage-tracker.test.js +140 -0
  347. package/tests/engine/benchmark-eval.test.js +31 -0
  348. package/tests/engine/bm25-index.test.js +85 -0
  349. package/tests/engine/capture-command.test.js +35 -0
  350. package/tests/engine/capture.test.js +17 -0
  351. package/tests/engine/graph-augmented-search.test.js +107 -0
  352. package/tests/engine/graph-injector.test.js +44 -0
  353. package/tests/engine/graph.test.js +216 -0
  354. package/tests/engine/hybrid-searcher.test.js +74 -0
  355. package/tests/engine/indexer-bm25.test.js +37 -0
  356. package/tests/engine/mcp-tools.test.js +73 -0
  357. package/tests/engine/project-initializer-mcp.test.js +99 -0
  358. package/tests/engine/query-expander.test.js +36 -0
  359. package/tests/engine/reranker.test.js +51 -0
  360. package/tests/engine/rrf.test.js +49 -0
  361. package/tests/engine/srag-prefix.test.js +47 -0
  362. package/tests/instinct-block.test.js +23 -0
  363. package/tests/mcp-config-writer.test.js +60 -0
  364. package/tests/project-initializer-new-agents.test.js +48 -0
  365. package/tests/rules/rules-manager.test.js +230 -0
  366. package/tests/well-known-builder.test.js +40 -0
  367. package/tests/wizard/integration-detector.test.js +31 -0
  368. package/tests/wizard/project-detector.test.js +51 -0
  369. package/tests/wizard/prompt-session.test.js +61 -0
  370. package/tests/wizard/prompt.test.js +16 -0
  371. package/tests/wizard/registry-embeddings.test.js +35 -0
  372. package/tests/wizard/skill-recommender.test.js +34 -0
  373. package/tests/wizard/slot-count.test.js +25 -0
  374. package/vercel.json +21 -0
@@ -0,0 +1,109 @@
1
+ # After
2
+
3
+ A scraper using `requests.Session` for connection reuse, `BeautifulSoup` for HTML parsing, per-request retry logic, and polite rate limiting between pages.
4
+
5
+ ```python
6
+ import logging
7
+ import time
8
+ from dataclasses import dataclass
9
+
10
+ import requests
11
+ from bs4 import BeautifulSoup
12
+ from requests.adapters import HTTPAdapter
13
+ from urllib3.util.retry import Retry
14
+
15
+ logger = logging.getLogger(__name__)
16
+
17
+ USER_AGENT = "JobResearchBot/1.0 (contact: scraping@mycompany.com)"
18
+ REQUEST_DELAY_SECONDS = 2.0
19
+
20
+
21
+ @dataclass
22
+ class JobListing:
23
+ title: str
24
+ company: str
25
+ salary: str
26
+
27
+
28
+ def make_session() -> requests.Session:
29
+ """Create a session with retry logic and a descriptive User-Agent."""
30
+ session = requests.Session()
31
+ session.headers.update({"User-Agent": USER_AGENT})
32
+
33
+ retry_policy = Retry(
34
+ total=3,
35
+ backoff_factor=1.5,
36
+ status_forcelist=[429, 500, 502, 503, 504],
37
+ allowed_methods=["GET"],
38
+ )
39
+ adapter = HTTPAdapter(max_retries=retry_policy)
40
+ session.mount("https://", adapter)
41
+ session.mount("http://", adapter)
42
+ return session
43
+
44
+
45
+ def parse_job_listings(html: str) -> list[JobListing]:
46
+ """Extract job listings from a page of HTML using BeautifulSoup."""
47
+ soup = BeautifulSoup(html, "html.parser")
48
+ jobs = []
49
+
50
+ for card in soup.select("article.job-card"):
51
+ title_el = card.select_one("h2.job-title")
52
+ company_el = card.select_one("span.company")
53
+ salary_el = card.select_one("div.salary")
54
+
55
+ if title_el is None:
56
+ logger.debug("Skipping card with no title element")
57
+ continue
58
+
59
+ jobs.append(JobListing(
60
+ title=title_el.get_text(strip=True),
61
+ company=company_el.get_text(strip=True) if company_el else "",
62
+ salary=salary_el.get_text(strip=True) if salary_el else "Not specified",
63
+ ))
64
+
65
+ return jobs
66
+
67
+
68
+ def scrape_jobs(base_url: str, num_pages: int) -> list[JobListing]:
69
+ """Scrape job listings across multiple pages with rate limiting."""
70
+ session = make_session()
71
+ all_jobs: list[JobListing] = []
72
+
73
+ for page in range(1, num_pages + 1):
74
+ url = f"{base_url}?page={page}"
75
+ logger.info("Fetching page %d: %s", page, url)
76
+
77
+ try:
78
+ response = session.get(url, timeout=15)
79
+ response.raise_for_status()
80
+ except requests.HTTPError as exc:
81
+ logger.error("HTTP error on page %d: %s", page, exc)
82
+ break
83
+ except requests.RequestException as exc:
84
+ logger.error("Request failed on page %d: %s — stopping", page, exc)
85
+ break
86
+
87
+ page_jobs = parse_job_listings(response.text)
88
+ logger.info("Extracted %d listings from page %d", len(page_jobs), page)
89
+ all_jobs.extend(page_jobs)
90
+
91
+ if page < num_pages:
92
+ time.sleep(REQUEST_DELAY_SECONDS) # be polite
93
+
94
+ return all_jobs
95
+
96
+
97
+ if __name__ == "__main__":
98
+ logging.basicConfig(level=logging.INFO)
99
+ jobs = scrape_jobs("https://jobs.example.com/listings", num_pages=20)
100
+ print(f"Total jobs scraped: {len(jobs)}")
101
+ ```
102
+
103
+ Key improvements:
104
+ - `requests.Session` with `HTTPAdapter` reuses TCP connections and retries on transient server errors — one session for all pages instead of a new connection per request (Ch 1, 14: Session reuse and retry)
105
+ - `BeautifulSoup` with CSS selectors replaces regex HTML parsing — correct, readable, and resilient to attribute ordering changes (Ch 2: Use BeautifulSoup, not regex, for HTML)
106
+ - `parse_job_listings` is a pure function that takes an HTML string and returns typed `JobListing` dataclasses — easily unit-tested with saved HTML fixtures (Ch 15: Testing scrapers)
107
+ - `None` checks on each element before `.get_text()` prevent `AttributeError` when elements are missing (Ch 2: Defensive parsing)
108
+ - `time.sleep(REQUEST_DELAY_SECONDS)` between pages respects the server; `USER_AGENT` identifies the bot with a contact address (Ch 14, 18: Rate limiting and identification)
109
+ - Specific `requests.HTTPError` and `requests.RequestException` replace the bare `except` — errors are logged with page context and the crawl stops gracefully (Ch 1, 14: Error handling)
@@ -0,0 +1,40 @@
1
+ # Before
2
+
3
+ A scraper that hammers a job listings site with no delays, parses HTML with regex, swallows all errors, and creates a new TCP connection for every page.
4
+
5
+ ```python
6
+ import urllib.request
7
+ import re
8
+
9
+ def scrape_jobs(base_url, num_pages):
10
+ all_jobs = []
11
+
12
+ for page in range(1, num_pages + 1):
13
+ url = base_url + "?page=" + str(page)
14
+ try:
15
+ # New connection every request, no headers, no rate limiting
16
+ response = urllib.request.urlopen(url)
17
+ html = response.read().decode("utf-8")
18
+ except:
19
+ # Swallows every error — silent failures
20
+ continue
21
+
22
+ # Parsing HTML with regex — fragile and incorrect
23
+ titles = re.findall(r'<h2 class="job-title">(.*?)</h2>', html)
24
+ companies = re.findall(r'<span class="company">(.*?)</span>', html)
25
+ salaries = re.findall(r'<div class="salary">(.*?)</div>', html)
26
+
27
+ for i in range(len(titles)):
28
+ job = {
29
+ "title": titles[i],
30
+ "company": companies[i] if i < len(companies) else "",
31
+ "salary": salaries[i] if i < len(salaries) else "",
32
+ }
33
+ all_jobs.append(job)
34
+
35
+ return all_jobs
36
+
37
+
38
+ jobs = scrape_jobs("https://jobs.example.com/listings", 20)
39
+ print(f"Scraped {len(jobs)} jobs")
40
+ ```
@@ -0,0 +1,393 @@
1
+ # Web Scraping with Python — Practices Catalog
2
+
3
+ Chapter-by-chapter catalog of practices from *Web Scraping with Python*
4
+ by Ryan Mitchell for scraper building.
5
+
6
+ ---
7
+
8
+ ## Chapter 1: Your First Web Scraper
9
+
10
+ ### Basic Fetching
11
+ - **urllib.request** — `urlopen(url)` returns an HTTPResponse object; read `.read()` for HTML bytes
12
+ - **requests library** — Preferred over urllib; `requests.get(url)` with headers, params, timeout support
13
+ - **Error handling** — Catch `HTTPError` (4xx/5xx), `URLError` (server not found), and connection timeouts
14
+ - **Response checking** — Always check `response.status_code`; handle 403 (forbidden), 404 (not found), 500 (server error)
15
+
16
+ ### BeautifulSoup Basics
17
+ - **Creating soup** — `BeautifulSoup(html, 'html.parser')` or use `'lxml'` for speed
18
+ - **Direct tag access** — `soup.h1`, `soup.title` returns first matching tag
19
+ - **Tag attributes** — `tag.attrs` returns dict; `tag['href']` for specific attribute; `tag.get_text()` for text content
20
+ - **None checking** — Always check if `soup.find()` returns None before accessing attributes
21
+
22
+ ---
23
+
24
+ ## Chapter 2: Advanced HTML Parsing
25
+
26
+ ### find and findAll
27
+ - **`find(tag, attributes, recursive, text, keywords)`** — Returns first matching element
28
+ - **`findAll(tag, attributes, recursive, text, limit, keywords)`** — Returns list of all matches
29
+ - **Attribute filtering** — `find('div', {'class': 'price'})`, `find('span', {'id': 'result'})`
30
+ - **Multiple tags** — `findAll(['h1', 'h2', 'h3'])` matches any of the listed tags
31
+ - **Text search** — `findAll(text='exact match')` or `findAll(text=re.compile('pattern'))`
32
+
33
+ ### CSS Selectors
34
+ - **`select(selector)`** — Use CSS selectors: `soup.select('div.content > p')`, `soup.select('#main .item')`
35
+ - **Common selectors** — `tag`, `.class`, `#id`, `tag.class`, `parent > child`, `ancestor descendant`, `tag[attr=val]`
36
+ - **Pseudo-selectors** — `:nth-of-type()`, `:first-child`, etc. for positional selection
37
+
38
+ ### Navigating the DOM Tree
39
+ - **Children** — `tag.children` (direct children iterator), `tag.descendants` (all descendants)
40
+ - **Siblings** — `tag.next_sibling`, `tag.previous_sibling`, `tag.next_siblings` (iterator)
41
+ - **Parents** — `tag.parent`, `tag.parents` (iterator up to document root)
42
+ - **Navigation tip** — NavigableString objects (text nodes) count as siblings; use `.find_next_sibling('tag')` to skip
43
+
44
+ ### Regular Expressions with BeautifulSoup
45
+ - **Regex in find** — `soup.find('img', {'src': re.compile(r'\.jpg$')})` matches pattern against attribute
46
+ - **Regex in findAll** — `soup.findAll('a', {'href': re.compile(r'^/wiki/')})` for link patterns
47
+ - **Text regex** — `soup.findAll(text=re.compile(r'\$[\d,]+'))` for finding price patterns
48
+
49
+ ### Lambda Functions
50
+ - **Lambda filters** — `soup.find_all(lambda tag: len(tag.attrs) == 2)` for custom tag filtering
51
+ - **Complex conditions** — Combine tag name, attributes, text content in lambda for precise selection
52
+
53
+ ---
54
+
55
+ ## Chapter 3: Writing Web Crawlers
56
+
57
+ ### Single-Domain Crawling
58
+ - **Internal link collection** — Find all `<a>` tags; filter for same-domain links using `urlparse`
59
+ - **URL normalization** — Resolve relative URLs with `urljoin`; strip fragments and query strings for dedup
60
+ - **Visited tracking** — Maintain a `set()` of visited URLs; check before fetching
61
+ - **Breadth-first** — Use a queue (collections.deque) for BFS traversal of site
62
+ - **Depth-first** — Use a stack (list) for DFS; useful for deep hierarchical sites
63
+
64
+ ### Building Robust Crawlers
65
+ - **Recursive crawling** — Function that fetches page, extracts links, recurses on unvisited links
66
+ - **Data extraction during crawl** — Extract target data while crawling; don't just collect URLs
67
+ - **Depth limiting** — Set maximum crawl depth to prevent infinite recursion
68
+ - **URL deduplication** — Normalize URLs before adding to visited set; handle trailing slashes, www prefix
69
+
70
+ ---
71
+
72
+ ## Chapter 4: Web Crawling Models
73
+
74
+ ### Planning a Crawl
75
+ - **Site mapping** — Understand site structure before coding; identify URL patterns, pagination, categories
76
+ - **Crawl scope** — Define which pages/sections to include or exclude
77
+ - **Data schema** — Define what to extract before building; normalize across different page layouts
78
+
79
+ ### Handling Different Layouts
80
+ - **Template detection** — Sites may use different templates for different content types
81
+ - **Conditional parsing** — Check page type (product vs category vs article) and apply appropriate parser
82
+ - **Data normalization** — Map different field names/formats from different layouts to a unified schema
83
+
84
+ ### Cross-Site Crawling
85
+ - **Multi-domain** — Maintain per-domain settings (delays, selectors, credentials)
86
+ - **Link following policies** — Decide which external links to follow; whitelist/blacklist domains
87
+ - **Politeness per domain** — Track per-domain request timing; respect each site's robots.txt
88
+
89
+ ---
90
+
91
+ ## Chapter 5: Scrapy
92
+
93
+ ### Scrapy Architecture
94
+ - **Spider** — Defines how to crawl and parse; subclass `scrapy.Spider`; implement `parse()` method
95
+ - **Items** — Structured data containers; define fields with `scrapy.Item` and `scrapy.Field()`
96
+ - **Pipelines** — Process items after extraction; validate, clean, store to database/file
97
+ - **Middleware** — Hook into request/response processing; add headers, proxy rotation, retry logic
98
+ - **Settings** — Configure concurrency (`CONCURRENT_REQUESTS`), delays (`DOWNLOAD_DELAY`), user agent, etc.
99
+
100
+ ### CrawlSpider
101
+ - **Rules** — Define `Rule(LinkExtractor(...), callback=...)` for automatic link following
102
+ - **LinkExtractor** — Filter links by `allow` (regex), `deny`, `restrict_css`, `restrict_xpaths`
103
+ - **Callback** — Assign parse methods to different URL patterns; `follow=True` for recursive crawling
104
+
105
+ ### Scrapy Best Practices
106
+ - **Item loaders** — Use `ItemLoader` for cleaner extraction with input/output processors
107
+ - **Logging** — Configure log levels (`LOG_LEVEL = 'INFO'`); log to file for production runs
108
+ - **Autothrottle** — Enable `AUTOTHROTTLE_ENABLED` for adaptive request pacing
109
+ - **Feed exports** — Built-in export to JSON, CSV, XML via `-o output.json`
110
+ - **Contracts** — Add docstring-based contracts for spider testing
111
+
112
+ ---
113
+
114
+ ## Chapter 6: Storing Data
115
+
116
+ ### File Storage
117
+ - **CSV** — Use `csv.writer` or `csv.DictWriter`; handle encoding with `encoding='utf-8'`
118
+ - **JSON** — Use `json.dump()` for structured data; JSON Lines for streaming/appending
119
+ - **Raw files** — Download images, PDFs with `urllib.request.urlretrieve()` or `requests.get()` with streaming
120
+
121
+ ### Database Storage
122
+ - **MySQL** — Use `pymysql` connector; parameterized queries to prevent SQL injection
123
+ - **PostgreSQL** — Use `psycopg2`; connection pooling for concurrent scrapers
124
+ - **SQLite** — Use built-in `sqlite3` for lightweight local storage; good for prototyping
125
+ - **Schema design** — Design tables to match extracted data; use appropriate types; add indexes on lookup columns
126
+
127
+ ### Email Integration
128
+ - **smtplib** — Send scraped data or alerts via email; useful for monitoring scraper results
129
+ - **Notifications** — Alert on scraper failures, unusual data patterns, or completion
130
+
131
+ ### Storage Best Practices
132
+ - **Idempotent storage** — Check for duplicates before inserting; use UPSERT patterns
133
+ - **Raw preservation** — Store raw HTML alongside extracted data for re-parsing capability
134
+ - **Batch operations** — Use bulk inserts for efficiency; commit in batches, not per-row
135
+ - **Connection management** — Use context managers; close connections properly; handle reconnection
136
+
137
+ ---
138
+
139
+ ## Chapter 7: Reading Documents
140
+
141
+ ### PDF Extraction
142
+ - **PDFMiner** — Extract text from PDFs; handle multi-column layouts and tables
143
+ - **Page-by-page** — Process PDFs page by page for memory efficiency
144
+ - **Tables in PDFs** — Use tabula-py or camelot for structured table extraction
145
+
146
+ ### Word Documents
147
+ - **python-docx** — Read `.docx` files; extract paragraphs, tables, headers
148
+ - **Older formats** — Handle `.doc` files with antiword or textract
149
+
150
+ ### Encoding
151
+ - **Character detection** — Use `chardet` to detect file encoding when unknown
152
+ - **UTF-8 normalization** — Convert all text to UTF-8; handle BOM (Byte Order Mark)
153
+ - **HTML encoding** — Read `<meta charset>` tag; handle entity references (`&amp;`, `&lt;`)
154
+
155
+ ---
156
+
157
+ ## Chapter 8: Cleaning Dirty Data
158
+
159
+ ### String Normalization
160
+ - **Whitespace** — Strip leading/trailing whitespace; normalize internal whitespace (multiple spaces to one)
161
+ - **Unicode normalization** — Use `unicodedata.normalize('NFKD', text)` for consistent Unicode representation
162
+ - **Case normalization** — Lowercase for comparison; preserve original for display
163
+
164
+ ### Regex Cleaning
165
+ - **Pattern extraction** — Use regex groups to extract structured data from messy text (prices, dates, phone numbers)
166
+ - **Substitution** — `re.sub()` to remove or replace unwanted characters and patterns
167
+ - **Compiled patterns** — Pre-compile frequently used patterns with `re.compile()` for performance
168
+
169
+ ### Data Normalization
170
+ - **Date formats** — Parse various date formats with `dateutil.parser`; store in ISO 8601
171
+ - **Number formats** — Handle commas, currency symbols, percentage signs; convert to numeric types
172
+ - **Address normalization** — Standardize address components; handle abbreviations
173
+
174
+ ### OpenRefine
175
+ - **Faceting** — Group similar values to find inconsistencies
176
+ - **Clustering** — Automatically find and merge similar values (fingerprint, n-gram, etc.)
177
+ - **GREL expressions** — Transform data with OpenRefine's expression language
178
+
179
+ ---
180
+
181
+ ## Chapter 9: Natural Language Processing
182
+
183
+ ### Text Analysis
184
+ - **N-grams** — Extract sequences of N words; useful for finding common phrases and patterns
185
+ - **Frequency analysis** — Count word/phrase frequencies; identify key topics in scraped text
186
+ - **Stop words** — Filter common words (the, is, at) to focus on meaningful content
187
+
188
+ ### Markov Models
189
+ - **Text generation** — Build Markov chains from scraped text; generate similar-style text
190
+ - **Chain order** — Higher order (2-gram, 3-gram) produces more coherent but less varied output
191
+
192
+ ### NLTK
193
+ - **Tokenization** — Split text into words and sentences with NLTK tokenizers
194
+ - **Part-of-speech tagging** — Tag words as nouns, verbs, etc. for structured extraction
195
+ - **Named entity recognition** — Extract names, organizations, locations from text
196
+ - **Stemming/lemmatization** — Reduce words to base forms for better matching and analysis
197
+
198
+ ---
199
+
200
+ ## Chapter 10: Crawling Through Forms and Logins
201
+
202
+ ### Form Submission
203
+ - **POST requests** — `requests.post(url, data={'field': 'value'})` for form submission
204
+ - **CSRF tokens** — Extract hidden CSRF token from form HTML; include in POST data
205
+ - **Form fields** — Inspect form with browser DevTools; identify all required fields including hidden ones
206
+ - **File uploads** — Use `files` parameter in `requests.post()` for multipart form data
207
+
208
+ ### Session Management
209
+ - **requests.Session()** — Maintains cookies across requests; handles redirects; connection pooling
210
+ - **Cookie persistence** — Session object automatically stores and sends cookies
211
+ - **Login flow** — GET login page → extract CSRF → POST credentials → use session for authenticated pages
212
+
213
+ ### Authentication
214
+ - **HTTP Basic Auth** — `requests.get(url, auth=('user', 'pass'))` for Basic authentication
215
+ - **Token-based** — Extract auth token from login response; send in headers for subsequent requests
216
+ - **OAuth** — Use `requests-oauthlib` for OAuth-protected APIs
217
+ - **Session expiry** — Detect expired sessions (redirects to login); re-authenticate automatically
218
+
219
+ ---
220
+
221
+ ## Chapter 11: Scraping JavaScript
222
+
223
+ ### Selenium WebDriver
224
+ - **Setup** — `webdriver.Chrome()` or `webdriver.Firefox()`; requires matching driver binary
225
+ - **Headless mode** — `options.add_argument('--headless')` for browser without GUI; essential for servers
226
+ - **Navigation** — `driver.get(url)`; `driver.find_element(By.CSS_SELECTOR, selector)`
227
+ - **Interaction** — `.click()`, `.send_keys()`, `.clear()` on elements; simulate user behavior
228
+
229
+ ### Waiting for Content
230
+ - **Implicit waits** — `driver.implicitly_wait(10)` sets default wait for element finding
231
+ - **Explicit waits** — `WebDriverWait(driver, 10).until(EC.presence_of_element_located(...))` for specific conditions
232
+ - **Expected conditions** — `element_to_be_clickable`, `visibility_of_element_located`, `text_to_be_present_in_element`
233
+ - **Custom waits** — Write lambda conditions for complex wait scenarios
234
+
235
+ ### JavaScript Execution
236
+ - **Execute script** — `driver.execute_script('return document.title')` runs JS in page context
237
+ - **Scroll page** — `driver.execute_script('window.scrollTo(0, document.body.scrollHeight)')` for infinite scroll
238
+ - **Extract data** — Execute JS to extract data from page variables, localStorage, or DOM
239
+
240
+ ### Ajax Handling
241
+ - **Wait for Ajax** — Wait for specific elements that load asynchronously
242
+ - **Network monitoring** — Intercept XHR requests to find underlying API endpoints
243
+ - **Alternative approach** — If you can identify the API endpoint, use `requests` directly instead of Selenium
244
+
245
+ ---
246
+
247
+ ## Chapter 12: Crawling Through APIs
248
+
249
+ ### REST API Basics
250
+ - **HTTP methods** — GET (read), POST (create), PUT (update), DELETE (remove)
251
+ - **JSON responses** — `response.json()` for parsing; handle nested objects and arrays
252
+ - **Headers** — Set `Accept: application/json`, `Authorization: Bearer token`
253
+ - **Query parameters** — `requests.get(url, params={'key': 'value'})` for clean URL building
254
+
255
+ ### Undocumented APIs
256
+ - **Browser DevTools** — Use Network tab to discover API calls made by JavaScript
257
+ - **XHR filtering** — Filter network requests to XHR/Fetch to find data endpoints
258
+ - **Request replication** — Copy request headers, cookies, parameters from DevTools to Python
259
+ - **API reverse engineering** — Study request patterns to understand pagination, filtering, authentication
260
+
261
+ ### API Best Practices
262
+ - **Rate limiting** — Respect rate limit headers; implement backoff on 429 responses
263
+ - **Pagination** — Handle cursor-based, offset-based, and link-header pagination
264
+ - **Error handling** — Retry on 5xx errors with exponential backoff; don't retry on 4xx
265
+ - **Authentication** — Store API keys securely; handle token refresh for OAuth
266
+
267
+ ---
268
+
269
+ ## Chapter 13: Image Processing and OCR
270
+
271
+ ### Pillow (PIL)
272
+ - **Image loading** — `Image.open(path)` or from URL response content
273
+ - **Manipulation** — Resize, crop, rotate, filter for preprocessing before OCR
274
+ - **Thresholding** — Convert to grayscale; apply threshold for clean black/white text
275
+
276
+ ### Tesseract OCR
277
+ - **pytesseract** — `pytesseract.image_to_string(image)` for text extraction from images
278
+ - **Preprocessing** — Clean images before OCR: denoise, deskew, threshold, resize
279
+ - **Language support** — Specify language with `lang='eng'`; install language packs as needed
280
+ - **Confidence** — Use `image_to_data()` for per-word confidence scores; filter low confidence
281
+
282
+ ### CAPTCHA Handling
283
+ - **Simple CAPTCHAs** — Preprocessing + OCR may solve simple text CAPTCHAs
284
+ - **Complex CAPTCHAs** — Consider CAPTCHA-solving services or rethink approach (use API instead)
285
+ - **Ethical note** — CAPTCHAs exist to prevent automated access; respect their purpose
286
+
287
+ ---
288
+
289
+ ## Chapter 14: Avoiding Scraping Traps
290
+
291
+ ### Headers and Identity
292
+ - **User-Agent** — Set a realistic browser User-Agent string; rotate for large-scale scraping
293
+ - **Accept headers** — Include Accept, Accept-Language, Accept-Encoding to mimic real browsers
294
+ - **Referer** — Set appropriate Referer header when navigating between pages
295
+ - **Cookie handling** — Accept and send cookies; use sessions for automatic management
296
+
297
+ ### Behavioral Patterns
298
+ - **Request timing** — Add random delays between requests (1-5 seconds); avoid perfectly regular intervals
299
+ - **Navigation patterns** — Don't jump straight to data pages; mimic human browsing (home → category → product)
300
+ - **Click patterns** — With Selenium, click through pages naturally rather than jumping directly to URLs
301
+
302
+ ### Honeypot Detection
303
+ - **Hidden links** — Check for CSS `display:none` or `visibility:hidden` links; avoid following them
304
+ - **Hidden form fields** — Pre-filled hidden fields may be traps; don't submit unexpected values
305
+ - **Link patterns** — Suspicious URL patterns or link text may indicate honeypots
306
+
307
+ ### IP and Session Management
308
+ - **Proxy rotation** — Rotate IP addresses for large-scale scraping; use proxy services
309
+ - **Session rotation** — Create new sessions periodically; don't use same cookies indefinitely
310
+ - **Fingerprint diversity** — Vary headers, timing, and behavior to avoid fingerprinting
311
+
312
+ ---
313
+
314
+ ## Chapter 15: Testing Scrapers
315
+
316
+ ### Unit Testing
317
+ - **Parse function tests** — Test parsing functions with saved HTML files; verify extracted data
318
+ - **Fixture files** — Save representative HTML pages as test fixtures; don't hit live sites in tests
319
+ - **Edge cases** — Test with missing elements, empty pages, different layouts, malformed HTML
320
+
321
+ ### Integration Testing
322
+ - **End-to-end** — Test full scrape pipeline from fetch to storage with known target pages
323
+ - **Selenium tests** — Use Selenium for testing JavaScript-heavy scraping flows
324
+ - **Mock responses** — Use `responses` or `requests-mock` libraries for HTTP mocking in tests
325
+
326
+ ### Testing Best Practices
327
+ - **Site change detection** — Periodically check if site structure has changed; alert on selector failures
328
+ - **Regression testing** — Compare current results against known-good baselines
329
+ - **CI integration** — Run scraper tests in CI pipeline; catch issues before deployment
330
+
331
+ ---
332
+
333
+ ## Chapter 16: Parallel Web Scraping
334
+
335
+ ### Threading
336
+ - **threading module** — Use for I/O-bound scraping; GIL doesn't block network operations
337
+ - **Thread pool** — `concurrent.futures.ThreadPoolExecutor` for managed thread pools
338
+ - **Thread safety** — Use locks for shared state (counters, result lists); prefer queues for task distribution
339
+
340
+ ### Multiprocessing
341
+ - **multiprocessing module** — Use for CPU-bound processing (parsing, cleaning); bypasses GIL
342
+ - **Process pool** — `concurrent.futures.ProcessPoolExecutor` for managed process pools
343
+ - **Inter-process communication** — Use Queue for task distribution; Pipe for point-to-point
344
+
345
+ ### Queue-Based Architecture
346
+ - **Producer-consumer** — Producer adds URLs to queue; consumers fetch and parse in parallel
347
+ - **URL frontier** — Priority queue for managing which URLs to crawl next
348
+ - **Result aggregation** — Collect results from workers into shared storage
349
+
350
+ ### Parallel Best Practices
351
+ - **Per-domain limits** — Limit concurrent requests per domain even with parallel scraping
352
+ - **Graceful shutdown** — Handle KeyboardInterrupt; drain queues cleanly on shutdown
353
+ - **Error isolation** — One worker's failure shouldn't crash the entire scraping operation
354
+ - **Progress tracking** — Log completed/remaining tasks; monitor worker health
355
+
356
+ ---
357
+
358
+ ## Chapter 17: Remote Scraping
359
+
360
+ ### Tor
361
+ - **Tor proxy** — Route requests through Tor network for anonymity; `socks5://127.0.0.1:9150`
362
+ - **IP verification** — Check IP with a service like httpbin.org/ip to verify Tor is active
363
+ - **Performance** — Tor is slow; use only when anonymity is required
364
+ - **Circuit rotation** — Signal Tor to create new circuit for fresh IP; don't rotate too frequently
365
+
366
+ ### Proxy Services
367
+ - **Rotating proxies** — Commercial proxy services provide rotating IP pools
368
+ - **Proxy types** — HTTP/HTTPS proxies, SOCKS proxies; understand the difference
369
+ - **Proxy configuration** — `requests.get(url, proxies={'http': proxy_url})`; or configure in Scrapy settings
370
+
371
+ ### Cloud-Based Scraping
372
+ - **Headless instances** — Run scrapers on cloud VMs (AWS, GCP, DigitalOcean) for scale
373
+ - **Containerization** — Docker containers for consistent scraper environments
374
+ - **Scheduling** — Use cron, cloud schedulers, or orchestration tools for recurring scrapes
375
+ - **Cost management** — Right-size instances; use spot/preemptible instances for batch scraping
376
+
377
+ ---
378
+
379
+ ## Chapter 18: Legalities and Ethics
380
+
381
+ ### Legal Framework
382
+ - **robots.txt** — Machine-readable file at `/robots.txt`; specifies which paths are allowed/disallowed
383
+ - **Terms of Service** — Many sites prohibit scraping in ToS; understand the legal weight
384
+ - **CFAA** — Computer Fraud and Abuse Act (US); accessing computers "without authorization" is a federal crime
385
+ - **Copyright** — Scraped data may be copyrighted; fair use depends on purpose and amount
386
+ - **GDPR** — If scraping personal data of EU citizens, GDPR obligations apply
387
+
388
+ ### Ethical Scraping
389
+ - **Respect the site** — Don't overload servers; honor rate limits; scrape during off-peak hours
390
+ - **Identify yourself** — Use a descriptive User-Agent; provide contact email for site administrators
391
+ - **Minimize footprint** — Only scrape what you need; don't archive entire sites unnecessarily
392
+ - **Data handling** — Handle scraped personal data responsibly; minimize collection and storage
393
+ - **Give back** — If possible, contribute to the site or community; don't just extract value