scrapetor 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. checksums.yaml +7 -0
  2. data/CHANGELOG.md +242 -0
  3. data/LICENSE +21 -0
  4. data/README.md +440 -0
  5. data/bin/scrapetor +190 -0
  6. data/bin/scrapetor-bench +5 -0
  7. data/ext/scrapetor/README.md +53 -0
  8. data/ext/scrapetor/native/extconf.rb +67 -0
  9. data/ext/scrapetor/native/scrapetor_dom.c +6346 -0
  10. data/ext/scrapetor/native/scrapetor_http.c +2591 -0
  11. data/ext/scrapetor/native/scrapetor_native.c +1156 -0
  12. data/lib/scrapetor/builder.rb +158 -0
  13. data/lib/scrapetor/cleaner.rb +10 -0
  14. data/lib/scrapetor/comment_node.rb +67 -0
  15. data/lib/scrapetor/document.rb +457 -0
  16. data/lib/scrapetor/dom/parser.rb +69 -0
  17. data/lib/scrapetor/dom/selectors.rb +208 -0
  18. data/lib/scrapetor/dom.rb +563 -0
  19. data/lib/scrapetor/encoding.rb +85 -0
  20. data/lib/scrapetor/entities.rb +90 -0
  21. data/lib/scrapetor/errors.rb +12 -0
  22. data/lib/scrapetor/extractor.rb +147 -0
  23. data/lib/scrapetor/fetcher.rb +390 -0
  24. data/lib/scrapetor/fingerprint.rb +29 -0
  25. data/lib/scrapetor/form.rb +141 -0
  26. data/lib/scrapetor/http.rb +114 -0
  27. data/lib/scrapetor/microdata.rb +132 -0
  28. data/lib/scrapetor/money.rb +30 -0
  29. data/lib/scrapetor/native.rb +291 -0
  30. data/lib/scrapetor/native_dom.rb +2258 -0
  31. data/lib/scrapetor/node.rb +539 -0
  32. data/lib/scrapetor/node_set.rb +301 -0
  33. data/lib/scrapetor/page_type.rb +95 -0
  34. data/lib/scrapetor/pagination.rb +109 -0
  35. data/lib/scrapetor/persistent_cache.rb +130 -0
  36. data/lib/scrapetor/robots.rb +159 -0
  37. data/lib/scrapetor/sax.rb +285 -0
  38. data/lib/scrapetor/schema.rb +144 -0
  39. data/lib/scrapetor/selector.rb +576 -0
  40. data/lib/scrapetor/session.rb +141 -0
  41. data/lib/scrapetor/sitemap.rb +52 -0
  42. data/lib/scrapetor/stream.rb +111 -0
  43. data/lib/scrapetor/structured_data.rb +74 -0
  44. data/lib/scrapetor/template_registry.rb +24 -0
  45. data/lib/scrapetor/text_node.rb +101 -0
  46. data/lib/scrapetor/url.rb +21 -0
  47. data/lib/scrapetor/version.rb +5 -0
  48. data/lib/scrapetor/xpath.rb +1603 -0
  49. data/lib/scrapetor.rb +167 -0
  50. data/scrapetor.gemspec +77 -0
  51. metadata +200 -0
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: ff50f6b25b3a2892125706ae7b80c84f740bc9126bebf693fc0b6f84fb82e101
4
+ data.tar.gz: 7bd3443c06d2cca48d7d6fbdfbd3d68c3740b4fd07a62c12f68c8008d9929bad
5
+ SHA512:
6
+ metadata.gz: 7220950255cb9e9db8a59171cc78837e22d724cd091643ad84a8d833a866268410d59e5e9906e26cf3b903b5bbe7e9dd845dbb91ee594b1fc06bb3ef312717f0
7
+ data.tar.gz: aa291c1d94102e3d6e57d4f7ae6b8a98c8bc79e179e3394a60e9c7ca10417b98f21e494006f01719aee2a75608c52f2cf7f6218f9be54edb73b4c5cfc4e1de79
data/CHANGELOG.md ADDED
@@ -0,0 +1,242 @@
1
+ # Changelog
2
+
3
+ All notable changes to Scrapetor are documented here. The format
4
+ follows [Keep a Changelog](https://keepachangelog.com/), and the
5
+ project adheres to [Semantic Versioning](https://semver.org/).
6
+
7
+ ## [0.2.0]
8
+
9
+ The 0.2 series turns Scrapetor from a parser-plus-Net::HTTP-helper
10
+ into a production-shaped scraping toolkit. Every load-bearing piece
11
+ is native C with the GVL released across network + decode + parse.
12
+
13
+ ### Added — HTTP layer (libcurl-backed, opt-in at build time)
14
+
15
+ - `Scrapetor::Fetcher.get` / `.post` / `.put` / `.patch` / `.delete` /
16
+ `.head`. HTTP/2 over TLS via ALPN with graceful 1.1 fallback;
17
+ per-thread libcurl handle cache; GVL released for the round-trip.
18
+ - Request body sugar: `:json`, `:form`, `:body`, `:multipart`. Multipart
19
+ parts can come from `:path` (file) or `:data` (bytes), with optional
20
+ `:filename` / `:content_type` overrides. Driven via `curl_mime`.
21
+ - In-process content-encoding decoders. Advertised in the
22
+ `Accept-Encoding` header and decoded by Scrapetor regardless of
23
+ whether the linked libcurl was built with the codec:
24
+ - gzip + deflate via zlib
25
+ - brotli via libbrotlidec
26
+ - zstd via libzstd
27
+ - Charset transcoding to UTF-8 via iconv. Parses `charset=...` out of
28
+ `Content-Type`, transcodes the body, rewrites the header. Latin-1,
29
+ Shift_JIS, etc. round-trip end-to-end. Opt-out via
30
+ `transcode_utf8: false`.
31
+ - Per-host throttle in native code. `:rate_limit_ms` honours the same
32
+ gate across single + parallel + multi paths via a global pthread-
33
+ mutexed host → next-allowed-time table.
34
+ - Retry + exponential backoff with full jitter. `:retry`, `:backoff`,
35
+ `:max_backoff`, `:retry_on`. Honours numeric `Retry-After` response
36
+ headers (capped at `max_backoff`).
37
+ - Auth: `:basic_auth` (`"user:pass"` → CURLAUTH_BASIC) and
38
+ `:bearer_token` (CURLAUTH_BEARER + CURLOPT_XOAUTH2_BEARER, with an
39
+ Authorization-header fallback on older libcurl).
40
+ - Proxy + custom CA: `:proxy`, `:ca_path`, `:insecure`.
41
+ - ETag / Last-Modified disk cache. `:cache_dir` opts in; second-and-
42
+ later requests send `If-None-Match` / `If-Modified-Since`; 304
43
+ swaps in the cached body and marks
44
+ `headers["x-scrapetor-cache"] = "hit"`.
45
+ - `Scrapetor::Fetcher.revalidate(urls, cache_dir:)` HEADs every URL
46
+ in one curl_multi sweep and classifies each as `:fresh` /
47
+ `:changed` / `:missing` / `:error`.
48
+
49
+ ### Added — Bulk fetch concurrency models
50
+
51
+ - `Scrapetor::Fetcher.parallel_get` / `.parallel_fetch`: N pthread
52
+ workers each running blocking easy handles. Best when each
53
+ response has CPU work after the fetch (decode + parse), since the
54
+ GVL is released for the whole batch. `parallel_fetch` runs the
55
+ full pipeline — decode + transcode + dom_parse + index build —
56
+ inside the same no-GVL window.
57
+ - `Scrapetor::Fetcher.multi_get` / `.multi_fetch`: single driver
58
+ thread + `curl_multi`, N concurrent in-flight. Best for I/O-
59
+ fan-out (hundreds of URLs across many hosts). Same `parse: true`
60
+ / `cache_dir:` options as `parallel_fetch`.
61
+ - `Scrapetor::Fetcher.multi_each(urls) { |r| ... }`: streaming
62
+ variant. Yields each response in *completion* order (not input
63
+ order) as soon as it's done, so processing can start while later
64
+ transfers are still on the wire. Backed by a `MultiBatch`
65
+ typed-data iterator.
66
+ - `CURLSH` shared connection pool + DNS + TLS session cache across
67
+ every pthread worker and every multi handle. N workers hitting
68
+ one host now reuse one connection, DNS resolve, and TLS session.
69
+ `CURLOPT_PIPEWAIT` per-handle for HTTP/2 multiplexing.
70
+
71
+ ### Added — Stateful clients
72
+
73
+ - `Scrapetor::Session.new(cookies: …, user_agent:, rate_limit:,
74
+ retry:, headers:, basic_auth:, bearer_token:, proxy:, ca_path:)`.
75
+ Wraps Fetcher with persistent cookie jar (libcurl COOKIEJAR/
76
+ COOKIEFILE; ephemeral tempfile by default), default-header merge,
77
+ auto-applied auth, per-host throttle, default retry policy. Verbs:
78
+ `get`/`post`/`put`/`patch`/`delete`/`head`/`fetch` (parsed
79
+ Document)/`parallel_get`.
80
+
81
+ ### Added — Parser improvements
82
+
83
+ - Persistent disk-backed parse cache (`SCRAP_PERSISTENT_CACHE=1`).
84
+ Binary arena dump on disk, indexes rebuilt on load, content-
85
+ addressed via SHA-256.
86
+ - Tag-name interning. Static intern table → `uint16_t tag_id` per
87
+ node (free, slots into existing struct padding). Per-match
88
+ `==` replaces strncasecmp for the standard tag set.
89
+ - Ancestor bloom filter (64-bit per node, content-hashed). Chain
90
+ matcher fast-rejects candidates whose ancestor_bloom doesn't
91
+ cover the required mask.
92
+ - Selector specialisation. `c_atom` carries a function pointer to
93
+ one of `match_class_only` / `match_tag_only` / `match_tag_class` /
94
+ `match_id_only` chosen at plan compile, skipping the predicate
95
+ dispatch loop for the common shapes.
96
+ - NEON SIMD scanners on arm64: `dom_advance_ws`, `dom_advance_name`,
97
+ `dom_advance_attr_end`, NEON-accelerated `dom_streq_ci` for
98
+ index-lookup compares, NEON-driven class-attr tokeniser. Remaining
99
+ byte loops in `dom_parse` hoisted onto libc's SIMD `memchr`.
100
+ - Cold parse benchmark vs Nokolexbor: 0.40 ms vs 0.39 ms (was
101
+ 0.54 ms before this series).
102
+
103
+ ### Added — Bounded-memory parsing
104
+
105
+ - `Scrapetor.parallel_parse(htmls, threads:)`. Real multi-core HTML
106
+ parsing on MRI: pthread workers + `rb_thread_call_without_gvl`.
107
+ - `Scrapetor.stream(io, outer:)`. Streaming row parser with depth-
108
+ tracked nested-same-tag balancing, `<script>` / `<style>` /
109
+ comment / CDATA skipping, all in C. Outer pattern accepts `tag`,
110
+ `tag.class`, `tag.cls1.cls2`, `tag#id`, `tag#id.cls`. Peak memory
111
+ is bounded to `max(chunk_size, longest_row_in_bytes)` regardless
112
+ of total document size.
113
+
114
+ ### Added — Crawl helpers
115
+
116
+ - `Scrapetor::Robots.fetch_for(host)` — robots.txt parser with
117
+ longest-match decision (RFC 9309 / Google's de-facto rule), `*`
118
+ and `$` patterns, case-insensitive UA prefix matching,
119
+ Crawl-delay + Sitemap directives.
120
+ - `Scrapetor::Sitemap.urls(source)` — streaming sitemap.xml
121
+ enumerator with recursion into `<sitemapindex>`. Accepts IO,
122
+ String XML, or URL String.
123
+
124
+ ### Added — Tests + benchmarks
125
+
126
+ - 74 new tests across 6 new files: `test_robots.rb`,
127
+ `test_sitemap.rb`, `test_stream_parser.rb`, `test_fetcher.rb`,
128
+ `test_session.rb`, `test_parallel_parse.rb`,
129
+ `test_persistent_cache.rb`. Fetcher + Session tests auto-skip on
130
+ builds without libcurl. 291 tests total (was 158), all green.
131
+ - `benchmark/end_to_end.rb` — fetch+parse+extract vs Net::HTTP +
132
+ Nokogiri and Net::HTTP + Nokolexbor on a local WEBrick server.
133
+ Reproducible without network access.
134
+
135
+ ### Changed
136
+
137
+ - `Scrapetor.parse` consults the persistent parse cache when
138
+ `SCRAP_PERSISTENT_CACHE=1` is set, otherwise behaviour is
139
+ unchanged.
140
+ - Persistent cache binary format magic bumped from `SCRAPV01` to
141
+ `SCRAPV02` because `dom_node_t` grew a `tag_id` field. Older
142
+ cache files are rejected and re-parsed.
143
+
144
+ ### Compatibility
145
+
146
+ - Continues to support Ruby 2.7, 3.0, 3.1, 3.2, 3.3 on Linux and
147
+ macOS. Net::HTTP-based `Scrapetor.fetch` is unchanged so legacy
148
+ callers don't break.
149
+ - libcurl is optional. When present at build time the gem links
150
+ against it (+ optional libbrotlidec, libzstd, iconv); when absent
151
+ the entire `Scrapetor::Fetcher` / `Session` surface stubs out via
152
+ `Scrapetor::Fetcher::AVAILABLE = false` and raises a clear
153
+ `NotAvailableError`. `SCRAP_NO_LIBCURL=1` / `SCRAP_NO_BROTLI=1` /
154
+ `SCRAP_NO_ZSTD=1` force the stubbed path at compile.
155
+ - arm64 (Apple Silicon, AWS Graviton, Linux aarch64) gets the NEON
156
+ path; x86_64 falls back to scalar equivalents for the SIMD
157
+ helpers.
158
+
159
+ ### Added — post-release follow-ups
160
+
161
+ - Pagination helper: `Scrapetor::Pagination.each_page(start_url) { |doc| ... }`
162
+ with `<link rel=next>` / `a[rel~=next]` / custom-selector detection, self-loop
163
+ guard, `:max_pages` cap, `:delay` knob.
164
+ - Form helper: `Scrapetor::Form.new(form_node, base_url:)` with default-value
165
+ capture from every named control (text / hidden / checkbox / radio / select /
166
+ textarea), user overrides via `form[name]=` or `merge!`, dispatch through
167
+ GET / POST (form-encoded or multipart per `enctype`) / PUT / PATCH / DELETE.
168
+ - XPath subset: `Document#xpath` / `Node#xpath` cover descendant + child axes,
169
+ `@attr`, `text()`, `[N]` / `[@attr]` / `[@attr='v']` / `[contains(...)]` /
170
+ `[starts-with(...)]` / `[text()='v']`. Unsupported syntax raises
171
+ `Scrapetor::XPath::UnsupportedError` with the offending fragment.
172
+ - XPath axes: `following-sibling::`, `preceding-sibling::`, `ancestor::`,
173
+ `ancestor-or-self::`, plus the `comment()` node test. Sibling / ancestor
174
+ walks dispatch to new C primitives (`node_ancestor_ids`,
175
+ `node_following_sibling_ids`, `node_preceding_sibling_ids`); comments come
176
+ back as `Scrapetor::CommentNode` with `text` / `content` / `comment?`.
177
+ `//comment()` collapses to a single `node_descendant_comment_ids` C call
178
+ via the DFS range encoding the matcher already maintains.
179
+ - Full XPath 1.0 expression engine: all 13 axes (incl. `following::`,
180
+ `preceding::`, `attribute::`), every node test, full predicate grammar
181
+ (`and`, `or`, `not`, comparisons, arithmetic, union), and the standard
182
+ function library (`position`, `last`, `count`, `not`, `normalize-space`,
183
+ `substring`, `translate`, `concat`, `contains`, `starts-with`, `string`,
184
+ `number`, `floor`, `ceiling`, `round`, `lang`, etc.). Tokenizer + parser
185
+ + evaluator live in `lib/scrapetor/xpath.rb`; compiled ASTs are LRU-cached.
186
+ A translator detects CSS-compatible shapes (path with attr predicates,
187
+ positional `[N]`, `position() > N`, `following-sibling::name`, etc.) and
188
+ routes them through the existing native CSS chain — head-to-head benches
189
+ against libxml-based engines clear them on common scraping idioms.
190
+ - HTTP/3 + WebSocket capability detection in `Fetcher.features`; opt into
191
+ HTTP/3 via `http_version: "3"` when libcurl was built with it.
192
+ - mTLS / proxy auth / streaming download options on `Fetcher.get`:
193
+ `:ssl_cert` / `:ssl_key` / `:ssl_key_password` / `:ssl_cert_type` /
194
+ `:proxy_auth` / `:proxy_type` / `:download_to` / `:max_recv_bps` /
195
+ `:max_send_bps`.
196
+ - Documented the limits Scrapetor doesn't try to cover (JS execution, TLS
197
+ fingerprint impersonation) and the practical paths around them.
198
+
199
+ [0.2.0]: https://github.com/Alaa-abdulridha/scrapetor/releases/tag/v0.2.0
200
+
201
+ ## [0.1.0] — Initial release
202
+
203
+ ### Added
204
+
205
+ - Native C arena DOM (`ext/scrapetor/native/scrapetor_dom.c`). Single-pass
206
+ tokeniser with class, id, and tag indexes built during parsing. Zero-copy
207
+ text and attribute spans into the input buffer.
208
+ - Native streaming extraction engine (`ext/scrapetor/native/scrapetor_native.c`).
209
+ Schemas compile to a flat descriptor and execute during tokenisation —
210
+ no DOM is materialised on this path.
211
+ - Schema DSL with `field`, `repeated`, and type coercions
212
+ (`:text`, `:integer`, `:float`, `:money`, `:date`, `:url`, `:json`,
213
+ `:html`, `:list`, `:boolean`, `:array`).
214
+ - Field options: `clean`, `multi`, `normalize_url`, `default`, `required`,
215
+ `transform`, `delimiter`, and array-of-fallback selectors via `from: [..]`.
216
+ - CSS selector support on the native path for tag, `.class`, `tag.class`,
217
+ `#id`, attribute selectors with `=`, `*=`, `^=`, `$=`, `~=`, `|=`, and
218
+ descendant + child combinators.
219
+ - Encoding detection (BOM, `<meta charset>`, `http-equiv`) with transcoding
220
+ to UTF-8 before parsing.
221
+ - Structured-data extractors: `json_ld`, `opengraph`, `twitter_card`,
222
+ `schema_org(type:)`, `microdata`, `rdfa`.
223
+ - Page-type detection via JSON-LD, OpenGraph, and structural signals.
224
+ - Pure-Ruby HTML builder (`Scrapetor::Builder`) and SAX streaming
225
+ tokeniser (`Scrapetor::SAX`).
226
+ - HTTP fetcher built on `Net::HTTP` (no external gems).
227
+ - CLI binary (`scrapetor`) with `extract`, `info`, `jsonld`, `opengraph`,
228
+ `microdata`, `rdfa`, `schema-org`, `page-type`, `encoding`, and `sax`
229
+ subcommands.
230
+ - HTML5 named-entity decoder covering ~140 common entities plus numeric
231
+ references.
232
+ - Plan caching (`Schema#dump`, `Schema.load`) for cross-process reuse of
233
+ compiled extraction descriptors.
234
+ - 158 tests, four benchmark scripts comparing against Nokogiri and
235
+ Nokolexbor.
236
+
237
+ ### Compatibility
238
+
239
+ - Ruby 2.7, 3.0, 3.1, 3.2, 3.3 on Linux and macOS.
240
+ - No runtime gem dependencies.
241
+
242
+ [0.1.0]: https://github.com/Alaa-abdulridha/scrapetor/releases/tag/v0.1.0
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Alaa Abdulridha
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.