@booklib/skills 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/LICENSE +21 -0
- package/README.md +105 -0
- package/animation-at-work/SKILL.md +246 -0
- package/animation-at-work/assets/example_asset.txt +1 -0
- package/animation-at-work/references/api_reference.md +369 -0
- package/animation-at-work/references/review-checklist.md +79 -0
- package/animation-at-work/scripts/example.py +1 -0
- package/bin/skills.js +85 -0
- package/clean-code-reviewer/SKILL.md +292 -0
- package/clean-code-reviewer/evals/evals.json +67 -0
- package/data-intensive-patterns/SKILL.md +204 -0
- package/data-intensive-patterns/assets/example_asset.txt +1 -0
- package/data-intensive-patterns/references/api_reference.md +34 -0
- package/data-intensive-patterns/references/patterns-catalog.md +551 -0
- package/data-intensive-patterns/references/review-checklist.md +193 -0
- package/data-intensive-patterns/scripts/example.py +1 -0
- package/data-pipelines/SKILL.md +252 -0
- package/data-pipelines/assets/example_asset.txt +1 -0
- package/data-pipelines/references/api_reference.md +301 -0
- package/data-pipelines/references/review-checklist.md +181 -0
- package/data-pipelines/scripts/example.py +1 -0
- package/design-patterns/SKILL.md +245 -0
- package/design-patterns/assets/example_asset.txt +1 -0
- package/design-patterns/references/api_reference.md +1 -0
- package/design-patterns/references/patterns-catalog.md +726 -0
- package/design-patterns/references/review-checklist.md +173 -0
- package/design-patterns/scripts/example.py +1 -0
- package/domain-driven-design/SKILL.md +221 -0
- package/domain-driven-design/assets/example_asset.txt +1 -0
- package/domain-driven-design/references/api_reference.md +1 -0
- package/domain-driven-design/references/patterns-catalog.md +545 -0
- package/domain-driven-design/references/review-checklist.md +158 -0
- package/domain-driven-design/scripts/example.py +1 -0
- package/effective-java/SKILL.md +195 -0
- package/effective-java/assets/example_asset.txt +1 -0
- package/effective-java/references/api_reference.md +1 -0
- package/effective-java/references/items-catalog.md +955 -0
- package/effective-java/references/review-checklist.md +216 -0
- package/effective-java/scripts/example.py +1 -0
- package/effective-kotlin/SKILL.md +225 -0
- package/effective-kotlin/assets/example_asset.txt +1 -0
- package/effective-kotlin/references/api_reference.md +1 -0
- package/effective-kotlin/references/practices-catalog.md +1228 -0
- package/effective-kotlin/references/review-checklist.md +126 -0
- package/effective-kotlin/scripts/example.py +1 -0
- package/kotlin-in-action/SKILL.md +251 -0
- package/kotlin-in-action/assets/example_asset.txt +1 -0
- package/kotlin-in-action/references/api_reference.md +1 -0
- package/kotlin-in-action/references/practices-catalog.md +436 -0
- package/kotlin-in-action/references/review-checklist.md +204 -0
- package/kotlin-in-action/scripts/example.py +1 -0
- package/lean-startup/SKILL.md +250 -0
- package/lean-startup/assets/example_asset.txt +1 -0
- package/lean-startup/references/api_reference.md +319 -0
- package/lean-startup/references/review-checklist.md +137 -0
- package/lean-startup/scripts/example.py +1 -0
- package/microservices-patterns/SKILL.md +179 -0
- package/microservices-patterns/references/patterns-catalog.md +391 -0
- package/microservices-patterns/references/review-checklist.md +169 -0
- package/package.json +17 -0
- package/refactoring-ui/SKILL.md +236 -0
- package/refactoring-ui/assets/example_asset.txt +1 -0
- package/refactoring-ui/references/api_reference.md +355 -0
- package/refactoring-ui/references/review-checklist.md +114 -0
- package/refactoring-ui/scripts/example.py +1 -0
- package/storytelling-with-data/SKILL.md +238 -0
- package/storytelling-with-data/assets/example_asset.txt +1 -0
- package/storytelling-with-data/references/api_reference.md +379 -0
- package/storytelling-with-data/references/review-checklist.md +111 -0
- package/storytelling-with-data/scripts/example.py +1 -0
- package/system-design-interview/SKILL.md +213 -0
- package/system-design-interview/assets/example_asset.txt +1 -0
- package/system-design-interview/references/api_reference.md +582 -0
- package/system-design-interview/references/review-checklist.md +201 -0
- package/system-design-interview/scripts/example.py +1 -0
- package/using-asyncio-python/SKILL.md +242 -0
- package/using-asyncio-python/assets/example_asset.txt +1 -0
- package/using-asyncio-python/references/api_reference.md +267 -0
- package/using-asyncio-python/references/review-checklist.md +149 -0
- package/using-asyncio-python/scripts/example.py +1 -0
- package/web-scraping-python/SKILL.md +259 -0
- package/web-scraping-python/assets/example_asset.txt +1 -0
- package/web-scraping-python/references/api_reference.md +393 -0
- package/web-scraping-python/references/review-checklist.md +163 -0
- package/web-scraping-python/scripts/example.py +1 -0
|
@@ -0,0 +1,163 @@
|
|
|
1
|
+
# Web Scraping with Python — Scraper Review Checklist
|
|
2
|
+
|
|
3
|
+
Systematic checklist for reviewing web scrapers against the 18 chapters
|
|
4
|
+
from *Web Scraping with Python* by Ryan Mitchell.
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## 1. Fetching & Connection (Chapters 1, 10–11)
|
|
9
|
+
|
|
10
|
+
### HTTP Requests
|
|
11
|
+
- [ ] **Ch 1 — Error handling** — Are HTTP errors (4xx, 5xx), connection errors, and timeouts caught and handled?
|
|
12
|
+
- [ ] **Ch 1 — Response validation** — Is status code checked before parsing? Are non-200 responses handled?
|
|
13
|
+
- [ ] **Ch 1 — Timeout configuration** — Are request timeouts set to avoid hanging on unresponsive servers?
|
|
14
|
+
- [ ] **Ch 10 — Session usage** — Is `requests.Session()` used for cookie persistence and connection pooling?
|
|
15
|
+
|
|
16
|
+
### Authentication
|
|
17
|
+
- [ ] **Ch 10 — Login handling** — Is login implemented correctly with CSRF tokens and proper POST data?
|
|
18
|
+
- [ ] **Ch 10 — Session persistence** — Are cookies maintained across requests for authenticated scraping?
|
|
19
|
+
- [ ] **Ch 10 — Credential security** — Are login credentials stored in environment variables, not hardcoded?
|
|
20
|
+
- [ ] **Ch 10 — Session expiry** — Is session expiry detected and handled with automatic re-authentication?
|
|
21
|
+
|
|
22
|
+
### JavaScript Rendering
|
|
23
|
+
- [ ] **Ch 11 — Rendering need** — Is JavaScript rendering actually needed, or does the data exist in raw HTML or an API?
|
|
24
|
+
- [ ] **Ch 11 — Headless mode** — Is the browser running headless for server/production use?
|
|
25
|
+
- [ ] **Ch 11 — Explicit waits** — Are `WebDriverWait` with `expected_conditions` used instead of `time.sleep()`?
|
|
26
|
+
- [ ] **Ch 11 — Resource cleanup** — Is `driver.quit()` called in a finally block or context manager?
|
|
27
|
+
- [ ] **Ch 11 — Page load strategy** — Is the page load strategy appropriate (normal, eager, none)?
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## 2. Parsing & Extraction (Chapters 2, 7)
|
|
32
|
+
|
|
33
|
+
### HTML Parsing
|
|
34
|
+
- [ ] **Ch 2 — Parser choice** — Is an appropriate parser used (html.parser, lxml, html5lib)?
|
|
35
|
+
- [ ] **Ch 2 — Selector quality** — Are selectors specific enough to avoid false matches but flexible enough to survive minor changes?
|
|
36
|
+
- [ ] **Ch 2 — None checking** — Is `find()` result checked for None before accessing attributes or text?
|
|
37
|
+
- [ ] **Ch 2 — Multiple strategies** — Are fallback selectors used in case the primary selector fails?
|
|
38
|
+
- [ ] **Ch 2 — CSS selectors vs find** — Is `select()` used for complex hierarchical selection where appropriate?
|
|
39
|
+
|
|
40
|
+
### Data Extraction
|
|
41
|
+
- [ ] **Ch 2 — Attribute access** — Is `tag.get('href')` used instead of `tag['href']` to avoid KeyError?
|
|
42
|
+
- [ ] **Ch 2 — Text extraction** — Is `get_text(strip=True)` used for clean text content?
|
|
43
|
+
- [ ] **Ch 2 — Regex usage** — Are regex patterns compiled and used appropriately (not for HTML parsing)?
|
|
44
|
+
- [ ] **Ch 7 — Document handling** — Are non-HTML documents (PDF, Word) handled with appropriate libraries?
|
|
45
|
+
- [ ] **Ch 7 — Encoding** — Is character encoding handled correctly? Is UTF-8 enforced?
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## 3. Crawling & Navigation (Chapters 3–5)
|
|
50
|
+
|
|
51
|
+
### URL Management
|
|
52
|
+
- [ ] **Ch 3 — URL normalization** — Are URLs normalized (resolve relative, strip fragments, handle trailing slashes)?
|
|
53
|
+
- [ ] **Ch 3 — Deduplication** — Is a visited set maintained? Are URLs checked before adding to queue?
|
|
54
|
+
- [ ] **Ch 3 — Scope control** — Is crawl scope defined (same domain, specific paths, depth limit)?
|
|
55
|
+
- [ ] **Ch 3 — Relative URL resolution** — Is `urljoin` used to resolve relative links against the base URL?
|
|
56
|
+
|
|
57
|
+
### Crawl Strategy
|
|
58
|
+
- [ ] **Ch 3 — Traversal order** — Is the right traversal used (BFS for breadth, DFS for depth)?
|
|
59
|
+
- [ ] **Ch 4 — Layout handling** — Are different page layouts detected and parsed appropriately?
|
|
60
|
+
- [ ] **Ch 4 — Data normalization** — Is extracted data normalized to a consistent schema across pages?
|
|
61
|
+
- [ ] **Ch 3 — Pagination** — Is pagination handled correctly (next links, page numbers, cursor)?
|
|
62
|
+
|
|
63
|
+
### Scrapy-Specific
|
|
64
|
+
- [ ] **Ch 5 — Item definitions** — Are Scrapy Items defined for structured data extraction?
|
|
65
|
+
- [ ] **Ch 5 — Pipeline usage** — Are item pipelines used for validation, cleaning, and storage?
|
|
66
|
+
- [ ] **Ch 5 — Rules configuration** — Are CrawlSpider rules properly configured with LinkExtractor?
|
|
67
|
+
- [ ] **Ch 5 — Settings tuning** — Are CONCURRENT_REQUESTS, DOWNLOAD_DELAY, and AUTOTHROTTLE configured?
|
|
68
|
+
- [ ] **Ch 5 — Logging** — Is logging configured at the appropriate level for production?
|
|
69
|
+
|
|
70
|
+
---
|
|
71
|
+
|
|
72
|
+
## 4. Data Storage (Chapter 6)
|
|
73
|
+
|
|
74
|
+
### Storage Patterns
|
|
75
|
+
- [ ] **Ch 6 — Format choice** — Is the right storage format used (CSV for simple, database for relational, JSON for nested)?
|
|
76
|
+
- [ ] **Ch 6 — Duplicate prevention** — Are duplicates detected and handled (UPSERT, unique constraints)?
|
|
77
|
+
- [ ] **Ch 6 — Batch operations** — Are database writes batched instead of per-row for efficiency?
|
|
78
|
+
- [ ] **Ch 6 — Connection management** — Are database connections properly opened, pooled, and closed?
|
|
79
|
+
|
|
80
|
+
### Data Integrity
|
|
81
|
+
- [ ] **Ch 6 — Schema enforcement** — Is extracted data validated against expected schema before storage?
|
|
82
|
+
- [ ] **Ch 6 — Raw preservation** — Is raw HTML/response stored alongside extracted data for re-parsing?
|
|
83
|
+
- [ ] **Ch 6 — Encoding handling** — Are files written with explicit UTF-8 encoding?
|
|
84
|
+
- [ ] **Ch 6 — Error on write** — Are storage errors caught and handled (disk full, DB connection lost)?
|
|
85
|
+
|
|
86
|
+
---
|
|
87
|
+
|
|
88
|
+
## 5. Data Quality (Chapters 8, 9, 15)
|
|
89
|
+
|
|
90
|
+
### Cleaning
|
|
91
|
+
- [ ] **Ch 8 — Whitespace normalization** — Is whitespace stripped and normalized in extracted text?
|
|
92
|
+
- [ ] **Ch 8 — Unicode normalization** — Is Unicode text normalized (NFKD or NFC) for consistency?
|
|
93
|
+
- [ ] **Ch 8 — Type conversion** — Are strings converted to appropriate types (int, float, date) with error handling?
|
|
94
|
+
- [ ] **Ch 8 — Pattern cleaning** — Are regex patterns used to extract clean data from messy strings?
|
|
95
|
+
|
|
96
|
+
### Testing
|
|
97
|
+
- [ ] **Ch 15 — Parser unit tests** — Are parsing functions tested with saved HTML fixtures?
|
|
98
|
+
- [ ] **Ch 15 — Edge case tests** — Are missing elements, empty pages, and malformed HTML tested?
|
|
99
|
+
- [ ] **Ch 15 — Integration tests** — Is the full pipeline tested end-to-end?
|
|
100
|
+
- [ ] **Ch 15 — Change detection** — Is there monitoring for when the target site changes structure?
|
|
101
|
+
- [ ] **Ch 15 — CI integration** — Are scraper tests automated in a CI pipeline?
|
|
102
|
+
|
|
103
|
+
---
|
|
104
|
+
|
|
105
|
+
## 6. Resilience & Performance (Chapters 14, 16)
|
|
106
|
+
|
|
107
|
+
### Anti-Detection
|
|
108
|
+
- [ ] **Ch 14 — User-Agent** — Is a realistic User-Agent header set? Is rotation implemented for scale?
|
|
109
|
+
- [ ] **Ch 14 — Request headers** — Are Accept, Accept-Language, and other standard headers included?
|
|
110
|
+
- [ ] **Ch 14 — Request delays** — Are random delays added between requests (not fixed intervals)?
|
|
111
|
+
- [ ] **Ch 14 — Cookie handling** — Are cookies accepted and maintained properly?
|
|
112
|
+
- [ ] **Ch 14 — Honeypot avoidance** — Are hidden links (display:none, visibility:hidden) detected and avoided?
|
|
113
|
+
|
|
114
|
+
### Performance
|
|
115
|
+
- [ ] **Ch 16 — Parallelism** — Is parallel scraping used for large-scale jobs (threading or multiprocessing)?
|
|
116
|
+
- [ ] **Ch 16 — Thread safety** — Are shared data structures properly protected with locks or queues?
|
|
117
|
+
- [ ] **Ch 16 — Per-domain limits** — Are concurrent requests limited per domain even with parallel scraping?
|
|
118
|
+
- [ ] **Ch 16 — Graceful shutdown** — Can the scraper shut down cleanly, saving state for resumption?
|
|
119
|
+
|
|
120
|
+
### Error Recovery
|
|
121
|
+
- [ ] **Ch 14 — Retry logic** — Are transient errors retried with backoff? Are permanent errors skipped?
|
|
122
|
+
- [ ] **Ch 14 — Block detection** — Are 403/captcha responses detected as potential blocks?
|
|
123
|
+
- [ ] **Ch 16 — Worker isolation** — Does one worker's failure not crash the entire scraper?
|
|
124
|
+
- [ ] **Ch 14 — State persistence** — Can the scraper resume from where it left off after a crash?
|
|
125
|
+
|
|
126
|
+
---
|
|
127
|
+
|
|
128
|
+
## 7. Ethics & Legal (Chapters 17–18)
|
|
129
|
+
|
|
130
|
+
### Compliance
|
|
131
|
+
- [ ] **Ch 18 — robots.txt** — Is robots.txt fetched and respected before crawling?
|
|
132
|
+
- [ ] **Ch 18 — Terms of Service** — Has the target site's ToS been reviewed for scraping restrictions?
|
|
133
|
+
- [ ] **Ch 18 — Rate respect** — Is the scraping rate respectful of server resources?
|
|
134
|
+
- [ ] **Ch 18 — Data rights** — Is scraped data handled in compliance with copyright and privacy laws?
|
|
135
|
+
- [ ] **Ch 18 — GDPR compliance** — If scraping personal data, are GDPR obligations met?
|
|
136
|
+
|
|
137
|
+
### Anonymity & Infrastructure
|
|
138
|
+
- [ ] **Ch 17 — Proxy usage** — Are proxies used appropriately when needed for scale or anonymity?
|
|
139
|
+
- [ ] **Ch 17 — Tor appropriateness** — Is Tor used only when genuinely needed, not as a default?
|
|
140
|
+
- [ ] **Ch 17 — IP verification** — Is proxy/Tor IP verified before scraping sensitive targets?
|
|
141
|
+
- [ ] **Ch 14 — Identification** — Does the User-Agent identify the scraper and provide contact info?
|
|
142
|
+
|
|
143
|
+
---
|
|
144
|
+
|
|
145
|
+
## Quick Review Workflow
|
|
146
|
+
|
|
147
|
+
1. **Fetching pass** — Verify request handling, error handling, session usage, JS rendering needs
|
|
148
|
+
2. **Parsing pass** — Check selector quality, None handling, defensive parsing, fallback strategies
|
|
149
|
+
3. **Crawling pass** — Verify URL management, deduplication, pagination, scope control
|
|
150
|
+
4. **Storage pass** — Check data format, duplicate handling, raw preservation, encoding
|
|
151
|
+
5. **Quality pass** — Verify data cleaning, testing coverage, change detection
|
|
152
|
+
6. **Resilience pass** — Check rate limiting, parallelism, retry logic, anti-detection
|
|
153
|
+
7. **Ethics pass** — Verify robots.txt compliance, legal awareness, respectful crawling
|
|
154
|
+
8. **Prioritize findings** — Rank by severity: legal risk > data loss > reliability > performance > best practices
|
|
155
|
+
|
|
156
|
+
## Severity Levels
|
|
157
|
+
|
|
158
|
+
| Severity | Description | Example |
|
|
159
|
+
|----------|-------------|---------|
|
|
160
|
+
| **Critical** | Legal risk, data loss, or server harm | Ignoring robots.txt, no rate limiting (hammering server), hardcoded credentials, GDPR violations |
|
|
161
|
+
| **High** | Reliability or data quality issues | No error handling, missing None checks, no session management, no deduplication |
|
|
162
|
+
| **Medium** | Performance, maintainability, or operational gaps | No parallel scraping for large jobs, no testing, fixed delays instead of random, no logging |
|
|
163
|
+
| **Low** | Best practice improvements | Missing User-Agent rotation, no raw HTML storage, no change detection, minor code organization |
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
|