scrapetor 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +242 -0
- data/LICENSE +21 -0
- data/README.md +440 -0
- data/bin/scrapetor +190 -0
- data/bin/scrapetor-bench +5 -0
- data/ext/scrapetor/README.md +53 -0
- data/ext/scrapetor/native/extconf.rb +67 -0
- data/ext/scrapetor/native/scrapetor_dom.c +6346 -0
- data/ext/scrapetor/native/scrapetor_http.c +2591 -0
- data/ext/scrapetor/native/scrapetor_native.c +1156 -0
- data/lib/scrapetor/builder.rb +158 -0
- data/lib/scrapetor/cleaner.rb +10 -0
- data/lib/scrapetor/comment_node.rb +67 -0
- data/lib/scrapetor/document.rb +457 -0
- data/lib/scrapetor/dom/parser.rb +69 -0
- data/lib/scrapetor/dom/selectors.rb +208 -0
- data/lib/scrapetor/dom.rb +563 -0
- data/lib/scrapetor/encoding.rb +85 -0
- data/lib/scrapetor/entities.rb +90 -0
- data/lib/scrapetor/errors.rb +12 -0
- data/lib/scrapetor/extractor.rb +147 -0
- data/lib/scrapetor/fetcher.rb +390 -0
- data/lib/scrapetor/fingerprint.rb +29 -0
- data/lib/scrapetor/form.rb +141 -0
- data/lib/scrapetor/http.rb +114 -0
- data/lib/scrapetor/microdata.rb +132 -0
- data/lib/scrapetor/money.rb +30 -0
- data/lib/scrapetor/native.rb +291 -0
- data/lib/scrapetor/native_dom.rb +2258 -0
- data/lib/scrapetor/node.rb +539 -0
- data/lib/scrapetor/node_set.rb +301 -0
- data/lib/scrapetor/page_type.rb +95 -0
- data/lib/scrapetor/pagination.rb +109 -0
- data/lib/scrapetor/persistent_cache.rb +130 -0
- data/lib/scrapetor/robots.rb +159 -0
- data/lib/scrapetor/sax.rb +285 -0
- data/lib/scrapetor/schema.rb +144 -0
- data/lib/scrapetor/selector.rb +576 -0
- data/lib/scrapetor/session.rb +141 -0
- data/lib/scrapetor/sitemap.rb +52 -0
- data/lib/scrapetor/stream.rb +111 -0
- data/lib/scrapetor/structured_data.rb +74 -0
- data/lib/scrapetor/template_registry.rb +24 -0
- data/lib/scrapetor/text_node.rb +101 -0
- data/lib/scrapetor/url.rb +21 -0
- data/lib/scrapetor/version.rb +5 -0
- data/lib/scrapetor/xpath.rb +1603 -0
- data/lib/scrapetor.rb +167 -0
- data/scrapetor.gemspec +77 -0
- metadata +200 -0
data/README.md
ADDED
|
@@ -0,0 +1,440 @@
|
|
|
1
|
+
# Scrapetor
|
|
2
|
+
|
|
3
|
+
[](https://rubygems.org/gems/scrapetor)
|
|
4
|
+
[](https://github.com/Alaa-abdulridha/scrapetor/actions/workflows/ci.yml)
|
|
5
|
+
[](LICENSE)
|
|
6
|
+
|
|
7
|
+
A Ruby HTML parsing and structured-extraction library. Scrapetor pairs a
|
|
8
|
+
native C arena DOM with a streaming extraction engine that compiles a
|
|
9
|
+
schema DSL into a single forward pass over the input - no DOM is
|
|
10
|
+
materialised, one Ruby boundary crossing per document.
|
|
11
|
+
|
|
12
|
+
The same gem also exposes a full read/mutate DOM API, encoding
|
|
13
|
+
detection, structured-data extractors (JSON-LD, OpenGraph, Schema.org,
|
|
14
|
+
Microdata, RDFa, Twitter Cards), a pure-Ruby builder and SAX streamer,
|
|
15
|
+
a CLI, and an HTTP fetcher built on `Net::HTTP`. There are no external
|
|
16
|
+
parser dependencies.
|
|
17
|
+
|
|
18
|
+
Project page: [scrapetor.org](https://scrapetor.org) ·
|
|
19
|
+
Source: [github.com/Alaa-abdulridha/scrapetor](https://github.com/Alaa-abdulridha/scrapetor)
|
|
20
|
+
|
|
21
|
+
## Requirements
|
|
22
|
+
|
|
23
|
+
- Ruby 2.7 or newer
|
|
24
|
+
- A C99-capable compiler (`clang` or `gcc`). The native extension is
|
|
25
|
+
built automatically when the gem is installed.
|
|
26
|
+
|
|
27
|
+
There are no other runtime dependencies.
|
|
28
|
+
|
|
29
|
+
## Installation
|
|
30
|
+
|
|
31
|
+
Add to your `Gemfile`:
|
|
32
|
+
|
|
33
|
+
```ruby
|
|
34
|
+
gem "scrapetor"
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
Or install directly:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
gem install scrapetor
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
## Quick start
|
|
44
|
+
|
|
45
|
+
```ruby
|
|
46
|
+
require "scrapetor"
|
|
47
|
+
|
|
48
|
+
doc = Scrapetor::HTML(html, base_url: "https://example.com/")
|
|
49
|
+
|
|
50
|
+
result = doc.extract do
|
|
51
|
+
field :title, from: "h1.headline", clean: true
|
|
52
|
+
|
|
53
|
+
repeated ".product-card", as: :products do
|
|
54
|
+
field :title, from: ".title", clean: true
|
|
55
|
+
field :price, from: ".price", type: :money
|
|
56
|
+
field :url, from: "a", attr: :href, type: :url, normalize_url: true
|
|
57
|
+
field :image, from: "img", attr: :src, type: :url, normalize_url: true
|
|
58
|
+
end
|
|
59
|
+
end
|
|
60
|
+
```
|
|
61
|
+
|
|
62
|
+
Structured-data extractors are built in:
|
|
63
|
+
|
|
64
|
+
```ruby
|
|
65
|
+
doc.json_ld # Array of parsed <script type="application/ld+json"> blocks
|
|
66
|
+
doc.opengraph # Hash of og:* meta values
|
|
67
|
+
doc.twitter_card # Hash of twitter:* meta values
|
|
68
|
+
doc.schema_org(type: "Product")
|
|
69
|
+
doc.microdata # HTML5 itemscope/itemprop trees
|
|
70
|
+
doc.page_type # :product_page | :product_listing | :article | …
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
A minimal HTTP fetcher (uses `Net::HTTP`; no extra gems):
|
|
74
|
+
|
|
75
|
+
```ruby
|
|
76
|
+
doc = Scrapetor.fetch("https://example.com/products")
|
|
77
|
+
data = Scrapetor.fetch_extract("https://example.com/products", schema)
|
|
78
|
+
```
|
|
79
|
+
|
|
80
|
+
## Production HTTP layer (libcurl, HTTP/2)
|
|
81
|
+
|
|
82
|
+
If `libcurl` is available at build time, Scrapetor ships an optional
|
|
83
|
+
`Scrapetor::Fetcher` backed by it. The whole pipeline — TLS,
|
|
84
|
+
HTTP/2 multiplexing, gzip/deflate/brotli/zstd decoding, charset
|
|
85
|
+
transcoding, retry, ETag cache, per-host throttle — runs in C, with
|
|
86
|
+
the GVL released across every network and CPU phase.
|
|
87
|
+
|
|
88
|
+
```ruby
|
|
89
|
+
# Single GET with HTTP/2 + connection share + retry.
|
|
90
|
+
resp = Scrapetor::Fetcher.get(url,
|
|
91
|
+
retry: 3, backoff: 0.3, max_backoff: 10,
|
|
92
|
+
bearer_token: ENV["TOKEN"],
|
|
93
|
+
cache_dir: "~/.cache/scrapetor")
|
|
94
|
+
# => { status: 200, headers: {...}, body: "...", final_url: "...",
|
|
95
|
+
# http_version: "2" }
|
|
96
|
+
|
|
97
|
+
# POST + JSON / form / multipart.
|
|
98
|
+
Scrapetor::Fetcher.post(url, json: {name: "alice"})
|
|
99
|
+
Scrapetor::Fetcher.post(url, form: {user: "x", pass: "y"})
|
|
100
|
+
Scrapetor::Fetcher.post(url,
|
|
101
|
+
multipart: { name: "avatar",
|
|
102
|
+
file: Scrapetor::Fetcher.upload_file("/tmp/pic.png") })
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
### Bulk fetch APIs
|
|
106
|
+
|
|
107
|
+
Three concurrency models pick different tradeoffs:
|
|
108
|
+
|
|
109
|
+
```ruby
|
|
110
|
+
# 1. pthread + easy: N workers, each blocking. Best when each
|
|
111
|
+
# response has meaningful CPU work after the fetch (decode + parse)
|
|
112
|
+
# since the GVL is released across the full batch.
|
|
113
|
+
docs = Scrapetor::Fetcher.parallel_fetch(urls, threads: 8)
|
|
114
|
+
|
|
115
|
+
# 2. curl_multi async: single driver thread, N concurrent in-flight.
|
|
116
|
+
# Best for I/O-fan-out (hundreds of URLs across many hosts).
|
|
117
|
+
results = Scrapetor::Fetcher.multi_get(urls, max_concurrent: 32)
|
|
118
|
+
|
|
119
|
+
# 3. streaming multi_each: yields each response in completion order
|
|
120
|
+
# so processing starts as soon as the first transfer lands.
|
|
121
|
+
Scrapetor::Fetcher.multi_each(urls) do |r|
|
|
122
|
+
puts r[:final_url], r[:status] # called as each completes
|
|
123
|
+
end
|
|
124
|
+
```
|
|
125
|
+
|
|
126
|
+
### Session: cookies, auth, throttle, retry
|
|
127
|
+
|
|
128
|
+
```ruby
|
|
129
|
+
session = Scrapetor::Session.new(
|
|
130
|
+
cookies: true, # ephemeral cookie jar (path or true)
|
|
131
|
+
user_agent: "MyBot/1.0",
|
|
132
|
+
rate_limit: 0.5, # min seconds between same-host calls
|
|
133
|
+
retry: 3, # default retry for all calls
|
|
134
|
+
headers: { "Accept-Language" => "en-US" },
|
|
135
|
+
proxy: ENV["HTTP_PROXY"],
|
|
136
|
+
)
|
|
137
|
+
session.post(login_url, form: {user:, pass:})
|
|
138
|
+
doc = session.fetch(dashboard_url) # cookies carry forward
|
|
139
|
+
```
|
|
140
|
+
|
|
141
|
+
### HTTP cache with ETag / Last-Modified
|
|
142
|
+
|
|
143
|
+
```ruby
|
|
144
|
+
# Cold fetch: server returns 200 + ETag, response cached.
|
|
145
|
+
# Warm fetch: scrapetor sends If-None-Match. Server's 304 swaps in
|
|
146
|
+
# the cached body and marks headers["x-scrapetor-cache"] = "hit".
|
|
147
|
+
Scrapetor::Fetcher.get(url, cache_dir: "~/.cache/scrapetor")
|
|
148
|
+
|
|
149
|
+
# Bulk revalidation: HEAD every URL in one curl_multi sweep,
|
|
150
|
+
# classify each as :fresh / :changed / :missing / :error.
|
|
151
|
+
status = Scrapetor::Fetcher.revalidate(urls, cache_dir: "~/.cache/scrapetor")
|
|
152
|
+
stale = status.select { |_, v| v == :changed }.keys
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
### Crawl helpers
|
|
156
|
+
|
|
157
|
+
```ruby
|
|
158
|
+
robots = Scrapetor::Robots.fetch_for("https://example.com")
|
|
159
|
+
robots.allowed?("https://example.com/private") # => false
|
|
160
|
+
robots.crawl_delay # => 2.0
|
|
161
|
+
robots.sitemaps # => [...]
|
|
162
|
+
|
|
163
|
+
Scrapetor::Sitemap.urls("https://example.com/sitemap.xml") do |url, meta|
|
|
164
|
+
# streams large sitemaps without buffering in memory
|
|
165
|
+
# recurses into <sitemapindex> automatically
|
|
166
|
+
process(url, meta)
|
|
167
|
+
end
|
|
168
|
+
```
|
|
169
|
+
|
|
170
|
+
### Streaming HTML parser
|
|
171
|
+
|
|
172
|
+
Bounded-memory parser for huge documents:
|
|
173
|
+
|
|
174
|
+
```ruby
|
|
175
|
+
Scrapetor.stream(io, outer: "div.result") do |row_doc|
|
|
176
|
+
# one row at a time; peak memory ~= max(chunk, longest_row)
|
|
177
|
+
yield row_doc.at_css(".title").text, row_doc.at_css(".price").text
|
|
178
|
+
end
|
|
179
|
+
```
|
|
180
|
+
|
|
181
|
+
Accepts `tag`, `tag.class`, `tag.cls1.cls2`, `tag#id`, and combinations.
|
|
182
|
+
|
|
183
|
+
### Parallel parse for offline corpora
|
|
184
|
+
|
|
185
|
+
```ruby
|
|
186
|
+
htmls = paths.map { |p| File.read(p) }
|
|
187
|
+
docs = Scrapetor.parallel_parse(htmls, threads: 8)
|
|
188
|
+
# Real multi-core HTML parsing under one GVL release.
|
|
189
|
+
```
|
|
190
|
+
|
|
191
|
+
## Limits — what Scrapetor does NOT do
|
|
192
|
+
|
|
193
|
+
Worth being explicit about what's out of scope so you can pick the
|
|
194
|
+
right tool for the rest of the pipeline.
|
|
195
|
+
|
|
196
|
+
### No JavaScript execution
|
|
197
|
+
|
|
198
|
+
Scrapetor reads HTML as the server sent it. Pages that build their
|
|
199
|
+
content client-side via React / Vue / Angular / etc. will look empty
|
|
200
|
+
to the parser. There's no embedded JS engine and there won't be —
|
|
201
|
+
that's a different class of tool (headless browser).
|
|
202
|
+
|
|
203
|
+
Practical paths if you need rendered HTML:
|
|
204
|
+
|
|
205
|
+
- **Pre-render upstream**: many SPA hosts can pre-render for crawlers
|
|
206
|
+
(`?_escaped_fragment_=`, prerender.io, Cloudflare's HTML Rewriter,
|
|
207
|
+
Vercel/Netlify ISR). Cheapest if available.
|
|
208
|
+
- **Headless browser layer**: drive Playwright / Puppeteer / Selenium
|
|
209
|
+
from Ruby (ferrum, playwright-ruby-client). Have it spit out
|
|
210
|
+
rendered HTML, then hand that to Scrapetor for fast extract.
|
|
211
|
+
- **Per-site API mining**: most JS-heavy apps load data from a JSON
|
|
212
|
+
API that Scrapetor can hit directly via `Fetcher.get`.
|
|
213
|
+
|
|
214
|
+
### No TLS fingerprint impersonation
|
|
215
|
+
|
|
216
|
+
Scrapetor's HTTP layer is plain libcurl. Sites that fingerprint TLS
|
|
217
|
+
handshakes (Cloudflare, Akamai, DataDome, Imperva) will identify the
|
|
218
|
+
client as libcurl and may block / challenge accordingly. Scrapetor
|
|
219
|
+
won't impersonate Chrome's JA3, JA4, HTTP/2 SETTINGS frame order, or
|
|
220
|
+
header capitalisation.
|
|
221
|
+
|
|
222
|
+
If you need impersonation:
|
|
223
|
+
|
|
224
|
+
- **[curl-impersonate](https://github.com/lwthiker/curl-impersonate)**
|
|
225
|
+
is a fork of libcurl patched to match Chrome / Firefox / Edge
|
|
226
|
+
fingerprints exactly. You can build it locally and Scrapetor
|
|
227
|
+
will link against it transparently — the gem's HTTP options
|
|
228
|
+
are unchanged.
|
|
229
|
+
- **Reach the JSON API directly** with browser-mimicking headers
|
|
230
|
+
(`Accept`, `User-Agent`, `Sec-*`). Many sites only fingerprint
|
|
231
|
+
the HTML route; the API is more permissive.
|
|
232
|
+
- **Use a residential / mobile proxy** with a real browser at the
|
|
233
|
+
other end. Scrapetor's `:proxy` + `:proxy_auth` options handle
|
|
234
|
+
the proxy plumbing; the impersonation happens upstream.
|
|
235
|
+
|
|
236
|
+
The HTTP layer DOES support the rest of the production-scraping
|
|
237
|
+
surface: HTTP/2 multiplexing, retries with full-jitter backoff,
|
|
238
|
+
per-host throttle, cookie jar + auth, ETag cache, bulk revalidation,
|
|
239
|
+
multi-handle concurrency. Treat fingerprint impersonation as the
|
|
240
|
+
one externality you may need to bring yourself.
|
|
241
|
+
|
|
242
|
+
### XPath 1.0: full expression language
|
|
243
|
+
|
|
244
|
+
`Document#xpath` / `Node#xpath` evaluate the complete XPath 1.0 grammar:
|
|
245
|
+
all 13 axes (`child`, `descendant`, `descendant-or-self`, `parent`,
|
|
246
|
+
`self`, `following-sibling`, `preceding-sibling`, `following`,
|
|
247
|
+
`preceding`, `ancestor`, `ancestor-or-self`, `attribute`, `namespace`),
|
|
248
|
+
all node tests (`node()`, `text()`, `comment()`, `processing-instruction()`,
|
|
249
|
+
named, `*`, qualified-with-prefix), every operator (`=`, `!=`, `<`,
|
|
250
|
+
`<=`, `>`, `>=`, `+`, `-`, `*`, `div`, `mod`, `and`, `or`, `|`), and
|
|
251
|
+
the full standard function library (`not`, `last`, `position`, `count`,
|
|
252
|
+
`local-name`, `name`, `string`, `concat`, `starts-with`, `contains`,
|
|
253
|
+
`substring`, `substring-before`, `substring-after`, `string-length`,
|
|
254
|
+
`normalize-space`, `translate`, `boolean`, `true`, `false`, `lang`,
|
|
255
|
+
`number`, `sum`, `floor`, `ceiling`, `round`, `id`).
|
|
256
|
+
|
|
257
|
+
Compiled ASTs cache per unique expression string. The compiler also
|
|
258
|
+
detects expressions that map cleanly onto the native CSS chain
|
|
259
|
+
(`//div[@class='x']`, `//ul/li[1]`, `//dt/following-sibling::dd`,
|
|
260
|
+
`//div[position() > 50]`, etc.) and dispatches them directly to the
|
|
261
|
+
arena's C selector matcher — same hot path the rest of the library
|
|
262
|
+
rides. Sibling, ancestor, and following/preceding axis walks all run
|
|
263
|
+
through dedicated C primitives over the DFS-range-encoded arena, so
|
|
264
|
+
no Ruby per-step traversal is involved.
|
|
265
|
+
|
|
266
|
+
### HTTP/3 and WebSocket: capability-detected
|
|
267
|
+
|
|
268
|
+
`Scrapetor::Fetcher.features` reports whether the linked libcurl
|
|
269
|
+
exposes HTTP/3 and WebSocket support. Pass `http_version: "3"` to
|
|
270
|
+
opt into HTTP/3 when available; otherwise the default is HTTP/2
|
|
271
|
+
over TLS with HTTP/1.1 fallback. WebSocket frames go through
|
|
272
|
+
libcurl 7.86+'s `curl_ws_send` / `curl_ws_recv`; the gem doesn't
|
|
273
|
+
yet ship a friendly Ruby API for these — patches welcome.
|
|
274
|
+
|
|
275
|
+
## Command-line interface
|
|
276
|
+
|
|
277
|
+
```
|
|
278
|
+
$ scrapetor extract page.html --schema schema.rb
|
|
279
|
+
$ scrapetor info page.html
|
|
280
|
+
$ scrapetor jsonld page.html
|
|
281
|
+
$ scrapetor opengraph page.html
|
|
282
|
+
$ scrapetor microdata page.html
|
|
283
|
+
$ scrapetor page-type page.html
|
|
284
|
+
```
|
|
285
|
+
|
|
286
|
+
## Performance
|
|
287
|
+
|
|
288
|
+
Benchmarks live in `benchmark/comprehensive.rb`. Every workload asserts
|
|
289
|
+
that all three engines produce equivalent output before timing. Numbers
|
|
290
|
+
below were measured on Apple Silicon (Ruby 2.7.8); they're reproducible
|
|
291
|
+
from this repository.
|
|
292
|
+
|
|
293
|
+
**Parse throughput (build the DOM tree, MB/s).**
|
|
294
|
+
|
|
295
|
+
| Document | Scrapetor | Nokolexbor | Nokogiri |
|
|
296
|
+
|-----------------|----------:|-----------:|---------:|
|
|
297
|
+
| small (170 B) | 37 | 18 | 11 |
|
|
298
|
+
| article (2 KB) | 279 | 116 | 31 |
|
|
299
|
+
| product (3 KB) | 426 | 140 | 31 |
|
|
300
|
+
| listing (36 KB) | 3,933 | 154 | 31 |
|
|
301
|
+
| large (2.5 MB) | 18,434 | 136 | 31 |
|
|
302
|
+
|
|
303
|
+
**CSS selector evaluation (one selector against a pre-parsed document, iter/sec).**
|
|
304
|
+
|
|
305
|
+
| Selector | Scrapetor | Nokolexbor | Speedup |
|
|
306
|
+
|-----------------------------------------|----------:|-----------:|----------:|
|
|
307
|
+
| `#main` (single id) | 1,272,698 | 65,170 | 19.53x |
|
|
308
|
+
| `article` (tag) | 1,244,279 | 65,122 | 19.11x |
|
|
309
|
+
| `.product-card` (class) | 1,226,004 | 68,065 | 18.01x |
|
|
310
|
+
| `#main article` (id descendant) | 1,086,604 | 35,228 | 30.85x |
|
|
311
|
+
| `img.product-image` (tag.class) | 875,901 | 65,707 | 13.33x |
|
|
312
|
+
| `.product-grid > .product-card` (child) | 754,486 | 60,924 | 12.38x |
|
|
313
|
+
| `[data-sku="SKU0001"]` (attr) | 516,157 | 78,101 | 6.61x |
|
|
314
|
+
| `.product-card .price` (descendant) | 405,062 | 42,870 | 9.45x |
|
|
315
|
+
|
|
316
|
+
Pseudo-classes (`:has`, `:not`, `:is`, `:nth-child`, `:first-child`,
|
|
317
|
+
`:last-child`, `:nth-of-type`, etc.) and pseudo-elements (`::text`,
|
|
318
|
+
`::attr(name)`) run natively in the same C engine — see the
|
|
319
|
+
[Selector support](#selector-support) section below for the full list.
|
|
320
|
+
|
|
321
|
+
**End-to-end extraction (parse plus run an extraction schema, iter/sec).**
|
|
322
|
+
|
|
323
|
+
| Workload | Scrapetor | Nokolexbor | Nokogiri |
|
|
324
|
+
|-----------------------------------|----------:|-----------:|---------:|
|
|
325
|
+
| listing (50 cards x 4 fields) | 9,360 | 573 | 171 |
|
|
326
|
+
| product detail (top + 3 reviews) | 30,101 | 11,636 | 2,022 |
|
|
327
|
+
| article (top + tags + sections) | 53,837 | 25,553 | 6,338 |
|
|
328
|
+
|
|
329
|
+
**Allocations per extraction call (live Ruby objects, lower is better).**
|
|
330
|
+
|
|
331
|
+
| Workload | Scrapetor | Nokolexbor | Nokogiri |
|
|
332
|
+
|----------------------------------|----------:|-----------:|---------:|
|
|
333
|
+
| listing (50 cards x 4 fields) | 363 | 4,710 | 9,501 |
|
|
334
|
+
| product detail (top + 3 reviews) | 96 | 140 | 596 |
|
|
335
|
+
|
|
336
|
+
The full report - including the article workload, selector micro-benchmarks
|
|
337
|
+
for every supported selector form, and per-document MB/s figures - is
|
|
338
|
+
written to `benchmark/RESULTS.md` whenever you run `ruby -Ilib
|
|
339
|
+
benchmark/comprehensive.rb`.
|
|
340
|
+
|
|
341
|
+
## Architecture
|
|
342
|
+
|
|
343
|
+
```
|
|
344
|
+
HTML in ─┬─► Native streaming extract (C)
|
|
345
|
+
│ `doc.extract(schema)`
|
|
346
|
+
│ Schema runs during tokenisation. No DOM is built.
|
|
347
|
+
│ One Ruby/C boundary crossing per document.
|
|
348
|
+
│
|
|
349
|
+
└─► Native arena DOM (C)
|
|
350
|
+
`doc.css(...)`, `doc.at(...)`, mutation, traversal.
|
|
351
|
+
Class/id/tag indexes built during the parse pass.
|
|
352
|
+
Zero-copy text and attribute spans into the input buffer.
|
|
353
|
+
```
|
|
354
|
+
|
|
355
|
+
The same C extension (`ext/scrapetor/native/`) provides both paths. A
|
|
356
|
+
pure-Ruby DOM is kept as a fallback for environments where the
|
|
357
|
+
extension can't be loaded.
|
|
358
|
+
|
|
359
|
+
## Selector support
|
|
360
|
+
|
|
361
|
+
The native engine runs the following CSS forms in C:
|
|
362
|
+
|
|
363
|
+
- tag, `.class`, `tag.class`, `#id`, `tag#id`, universal `*`
|
|
364
|
+
- `[attr]`, `[attr=value]` and the `*=`, `^=`, `$=`, `~=`, `|=` variants
|
|
365
|
+
- descendant (`A B`) and child (`A > B`) combinators
|
|
366
|
+
- structural pseudo-classes: `:first-child`, `:last-child`, `:only-child`,
|
|
367
|
+
`:first-of-type`, `:last-of-type`, `:only-of-type`, `:nth-child(...)`,
|
|
368
|
+
`:nth-last-child(...)`, `:nth-of-type(...)`, `:nth-last-of-type(...)`,
|
|
369
|
+
`:empty`, `:root`, `:scope`
|
|
370
|
+
- boolean-attribute pseudos: `:checked`, `:disabled`, `:enabled`,
|
|
371
|
+
`:required`, `:optional`, `:read-only`, `:read-write`, `:any-link`,
|
|
372
|
+
`:link`
|
|
373
|
+
- logical pseudos: `:not(...)`, `:is(...)`, `:matches(...)`, `:where(...)`,
|
|
374
|
+
`:has(...)` (each runs natively when its inner selector is a single
|
|
375
|
+
atom — typically a class, id, tag, or attribute predicate)
|
|
376
|
+
- Scrapy/Parsel-style pseudo-elements: `::text` and `::attr(name)` —
|
|
377
|
+
the engine emits strings directly via a bulk C path so a 100-item
|
|
378
|
+
result is one boundary crossing, not 100
|
|
379
|
+
|
|
380
|
+
Sibling combinators (`+`, `~`) and inner selectors more complex than a
|
|
381
|
+
single atom — for example `:not(div > .x)` or `:has(.x .y)` — transparently
|
|
382
|
+
fall back to a pure-Ruby implementation that mirrors Nokogiri's output.
|
|
383
|
+
`Selector.compile` never raises on a syntactically valid CSS selector;
|
|
384
|
+
the fallback is automatic.
|
|
385
|
+
|
|
386
|
+
## API reference
|
|
387
|
+
|
|
388
|
+
See the project documentation at [scrapetor.org/docs](https://scrapetor.org/docs).
|
|
389
|
+
The main entry points are:
|
|
390
|
+
|
|
391
|
+
- `Scrapetor.parse(html, base_url:)` - returns a `Scrapetor::Document`
|
|
392
|
+
- `Scrapetor::HTML(html, base_url)` - same, Nokogiri-style alias
|
|
393
|
+
- `Scrapetor.schema { … }` - schema DSL
|
|
394
|
+
- `Scrapetor.extract(html, schema)` - parse and extract
|
|
395
|
+
- `Scrapetor.fetch(url)` - HTTP GET and parse
|
|
396
|
+
- `Scrapetor.fetch_extract(url, schema)`
|
|
397
|
+
- `Scrapetor::Builder.build { |b| b.html { … } }`
|
|
398
|
+
- `Scrapetor::SAX::Parser.new(handler).parse(html)`
|
|
399
|
+
|
|
400
|
+
Document and Node expose the standard read/mutate API: `css`, `at_css`,
|
|
401
|
+
`xpath`, `text`, `content`, `inner_html`, `outer_html`, `[]`, `[]=`,
|
|
402
|
+
`attributes`, `children`, `parent`, `add_child`, `before`, `after`,
|
|
403
|
+
`replace`, `remove`, `add_class`, `remove_class`, and so on.
|
|
404
|
+
|
|
405
|
+
## Compatibility
|
|
406
|
+
|
|
407
|
+
Scrapetor is tested on Ruby 2.7, 3.0, 3.1, 3.2, and 3.3 on Linux and
|
|
408
|
+
macOS (see `.github/workflows/ci.yml`). The native extension uses only
|
|
409
|
+
the stable public Ruby C API.
|
|
410
|
+
|
|
411
|
+
## Development
|
|
412
|
+
|
|
413
|
+
```
|
|
414
|
+
git clone https://github.com/Alaa-abdulridha/scrapetor
|
|
415
|
+
cd scrapetor
|
|
416
|
+
bundle install
|
|
417
|
+
rake compile
|
|
418
|
+
rake test
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
To run the benchmarks:
|
|
422
|
+
|
|
423
|
+
```
|
|
424
|
+
ruby -Ilib benchmark/comprehensive.rb # full report → benchmark/RESULTS.md
|
|
425
|
+
ruby -Ilib benchmark/parse_extract.rb # listing workload only
|
|
426
|
+
ruby -Ilib benchmark/product_detail.rb # product detail workload only
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
## Contributing
|
|
430
|
+
|
|
431
|
+
Issues, bug reports, and pull requests are welcome on GitHub at
|
|
432
|
+
<https://github.com/Alaa-abdulridha/scrapetor>. Please read
|
|
433
|
+
[`CONTRIBUTING.md`](CONTRIBUTING.md) before submitting a pull request.
|
|
434
|
+
|
|
435
|
+
To report a security vulnerability, follow the process in
|
|
436
|
+
[`SECURITY.md`](SECURITY.md).
|
|
437
|
+
|
|
438
|
+
## License
|
|
439
|
+
|
|
440
|
+
Scrapetor is released under the MIT License. See [`LICENSE`](LICENSE).
|
data/bin/scrapetor
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
# frozen_string_literal: true
|
|
3
|
+
#
|
|
4
|
+
# Scrapetor CLI.
|
|
5
|
+
#
|
|
6
|
+
# Subcommands:
|
|
7
|
+
# scrapetor extract FILE --schema SCHEMA.rb [--base-url URL] [--format json|yaml|pp]
|
|
8
|
+
# scrapetor info FILE # page-type, structured data, encoding
|
|
9
|
+
# scrapetor microdata FILE # microdata + RDFa
|
|
10
|
+
# scrapetor jsonld FILE # JSON-LD only
|
|
11
|
+
# scrapetor opengraph FILE # OG + Twitter Card
|
|
12
|
+
# scrapetor schema-org FILE [--type TYPE]
|
|
13
|
+
# scrapetor sax FILE # stream SAX events to stdout
|
|
14
|
+
# scrapetor version
|
|
15
|
+
|
|
16
|
+
$LOAD_PATH.unshift File.expand_path("../lib", __dir__)
|
|
17
|
+
require "scrapetor"
|
|
18
|
+
require "json"
|
|
19
|
+
require "optparse"
|
|
20
|
+
|
|
21
|
+
class ScrapetorCLI
|
|
22
|
+
def initialize(argv)
|
|
23
|
+
@argv = argv
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
def run
|
|
27
|
+
cmd = @argv.shift
|
|
28
|
+
case cmd
|
|
29
|
+
when nil, "-h", "--help", "help" then print_help
|
|
30
|
+
when "version", "-v", "--version" then puts Scrapetor::VERSION
|
|
31
|
+
when "extract" then cmd_extract
|
|
32
|
+
when "info" then cmd_info
|
|
33
|
+
when "microdata" then cmd_microdata
|
|
34
|
+
when "rdfa" then cmd_rdfa
|
|
35
|
+
when "jsonld", "json-ld" then cmd_jsonld
|
|
36
|
+
when "opengraph", "og" then cmd_opengraph
|
|
37
|
+
when "schema-org", "schema_org" then cmd_schema_org
|
|
38
|
+
when "sax" then cmd_sax
|
|
39
|
+
when "page-type", "page_type" then cmd_page_type
|
|
40
|
+
when "encoding" then cmd_encoding
|
|
41
|
+
else
|
|
42
|
+
warn "unknown command: #{cmd}"
|
|
43
|
+
print_help
|
|
44
|
+
exit 1
|
|
45
|
+
end
|
|
46
|
+
end
|
|
47
|
+
|
|
48
|
+
private
|
|
49
|
+
|
|
50
|
+
def print_help
|
|
51
|
+
puts <<~HELP
|
|
52
|
+
Scrapetor #{Scrapetor::VERSION}
|
|
53
|
+
|
|
54
|
+
Usage: scrapetor SUBCOMMAND [args]
|
|
55
|
+
|
|
56
|
+
Subcommands:
|
|
57
|
+
extract FILE --schema FILE Run an extraction schema against an HTML file.
|
|
58
|
+
[--base-url URL] [--format json|pp]
|
|
59
|
+
info FILE Print page-type, encoding, OG, JSON-LD summary.
|
|
60
|
+
page-type FILE Print the detected page type.
|
|
61
|
+
encoding FILE Print the detected encoding.
|
|
62
|
+
jsonld FILE Print parsed JSON-LD as JSON.
|
|
63
|
+
opengraph FILE Print OpenGraph + Twitter Card metadata.
|
|
64
|
+
schema-org FILE [--type T] Print Schema.org typed items (filter optional).
|
|
65
|
+
microdata FILE Print HTML5 microdata.
|
|
66
|
+
rdfa FILE Print RDFa typed data.
|
|
67
|
+
sax FILE Stream SAX events to stdout.
|
|
68
|
+
version Print the gem version.
|
|
69
|
+
|
|
70
|
+
Examples:
|
|
71
|
+
scrapetor extract page.html --schema product_schema.rb
|
|
72
|
+
scrapetor info page.html
|
|
73
|
+
cat page.html | scrapetor jsonld -
|
|
74
|
+
HELP
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
def read_input(path)
|
|
78
|
+
return $stdin.read if path == "-"
|
|
79
|
+
File.read(path)
|
|
80
|
+
end
|
|
81
|
+
|
|
82
|
+
def format_output(value, fmt)
|
|
83
|
+
case fmt
|
|
84
|
+
when "json" then puts JSON.pretty_generate(value)
|
|
85
|
+
when "pp" then require "pp"; pp value
|
|
86
|
+
when "yaml" then require "yaml"; puts value.to_yaml
|
|
87
|
+
else puts JSON.pretty_generate(value)
|
|
88
|
+
end
|
|
89
|
+
end
|
|
90
|
+
|
|
91
|
+
def cmd_extract
|
|
92
|
+
schema_path = nil
|
|
93
|
+
base_url = nil
|
|
94
|
+
fmt = "json"
|
|
95
|
+
OptionParser.new do |o|
|
|
96
|
+
o.on("--schema FILE") { |v| schema_path = v }
|
|
97
|
+
o.on("--base-url URL") { |v| base_url = v }
|
|
98
|
+
o.on("--format FMT") { |v| fmt = v }
|
|
99
|
+
end.parse!(@argv)
|
|
100
|
+
file = @argv.shift || abort("missing FILE")
|
|
101
|
+
abort("missing --schema") unless schema_path
|
|
102
|
+
|
|
103
|
+
html = read_input(file)
|
|
104
|
+
schema = load_schema(schema_path)
|
|
105
|
+
result = Scrapetor.parse(html, base_url: base_url).extract(schema)
|
|
106
|
+
format_output(result, fmt)
|
|
107
|
+
end
|
|
108
|
+
|
|
109
|
+
def load_schema(path)
|
|
110
|
+
src = File.read(path)
|
|
111
|
+
# The schema file is evaluated in a fresh binding. It MUST return
|
|
112
|
+
# a Scrapetor::Schema. The simplest convention: the file's last
|
|
113
|
+
# expression is `Scrapetor.schema do ... end`.
|
|
114
|
+
schema = eval(src, binding, path) # rubocop:disable Security/Eval
|
|
115
|
+
abort("schema file did not return a Scrapetor::Schema") unless schema.is_a?(Scrapetor::Schema)
|
|
116
|
+
schema
|
|
117
|
+
end
|
|
118
|
+
|
|
119
|
+
def cmd_info
|
|
120
|
+
file = @argv.shift || abort("missing FILE")
|
|
121
|
+
html = read_input(file)
|
|
122
|
+
doc = Scrapetor.parse(html)
|
|
123
|
+
info = {
|
|
124
|
+
encoding: doc.encoding,
|
|
125
|
+
page_type: doc.page_type,
|
|
126
|
+
title: doc.title,
|
|
127
|
+
json_ld: doc.json_ld.size,
|
|
128
|
+
opengraph: doc.opengraph.size,
|
|
129
|
+
microdata: doc.microdata.size,
|
|
130
|
+
stats: doc.stats
|
|
131
|
+
}
|
|
132
|
+
puts JSON.pretty_generate(info)
|
|
133
|
+
end
|
|
134
|
+
|
|
135
|
+
def cmd_microdata
|
|
136
|
+
file = @argv.shift || abort("missing FILE")
|
|
137
|
+
puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).microdata)
|
|
138
|
+
end
|
|
139
|
+
|
|
140
|
+
def cmd_rdfa
|
|
141
|
+
file = @argv.shift || abort("missing FILE")
|
|
142
|
+
puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).rdfa)
|
|
143
|
+
end
|
|
144
|
+
|
|
145
|
+
def cmd_jsonld
|
|
146
|
+
file = @argv.shift || abort("missing FILE")
|
|
147
|
+
puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).json_ld)
|
|
148
|
+
end
|
|
149
|
+
|
|
150
|
+
def cmd_opengraph
|
|
151
|
+
file = @argv.shift || abort("missing FILE")
|
|
152
|
+
doc = Scrapetor.parse(read_input(file))
|
|
153
|
+
puts JSON.pretty_generate(opengraph: doc.opengraph, twitter_card: doc.twitter_card)
|
|
154
|
+
end
|
|
155
|
+
|
|
156
|
+
def cmd_schema_org
|
|
157
|
+
type = nil
|
|
158
|
+
OptionParser.new do |o|
|
|
159
|
+
o.on("--type T") { |v| type = v }
|
|
160
|
+
end.parse!(@argv)
|
|
161
|
+
file = @argv.shift || abort("missing FILE")
|
|
162
|
+
puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).schema_org(type: type))
|
|
163
|
+
end
|
|
164
|
+
|
|
165
|
+
def cmd_sax
|
|
166
|
+
file = @argv.shift || abort("missing FILE")
|
|
167
|
+
Scrapetor::SAX::Tokenizer.new(read_input(file)).each_event do |event|
|
|
168
|
+
type, *args = event
|
|
169
|
+
case type
|
|
170
|
+
when :start then puts "START #{args[0]} #{args[1].inspect}"
|
|
171
|
+
when :end then puts "END #{args[0]}"
|
|
172
|
+
when :text then puts "TEXT #{args[0].inspect}"
|
|
173
|
+
when :comment then puts "COMM #{args[0].inspect}"
|
|
174
|
+
when :doctype then puts "DOC #{args[0]}"
|
|
175
|
+
end
|
|
176
|
+
end
|
|
177
|
+
end
|
|
178
|
+
|
|
179
|
+
def cmd_page_type
|
|
180
|
+
file = @argv.shift || abort("missing FILE")
|
|
181
|
+
puts Scrapetor.parse(read_input(file)).page_type
|
|
182
|
+
end
|
|
183
|
+
|
|
184
|
+
def cmd_encoding
|
|
185
|
+
file = @argv.shift || abort("missing FILE")
|
|
186
|
+
puts Scrapetor.parse(read_input(file)).encoding
|
|
187
|
+
end
|
|
188
|
+
end
|
|
189
|
+
|
|
190
|
+
ScrapetorCLI.new(ARGV).run
|
data/bin/scrapetor-bench
ADDED
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
# Native extension
|
|
2
|
+
|
|
3
|
+
This directory contains the C source for Scrapetor's native extension.
|
|
4
|
+
It builds via `mkmf` and is bundled into `scrapetor_native.bundle`
|
|
5
|
+
(macOS) or `scrapetor_native.so` (Linux). The Ruby side loads it from
|
|
6
|
+
`lib/scrapetor/native.rb`.
|
|
7
|
+
|
|
8
|
+
```
|
|
9
|
+
ext/scrapetor/native/
|
|
10
|
+
├── extconf.rb mkmf build script
|
|
11
|
+
├── scrapetor_native.c streaming extraction engine
|
|
12
|
+
└── scrapetor_dom.c arena DOM with class/id/tag indexes
|
|
13
|
+
```
|
|
14
|
+
|
|
15
|
+
## Building
|
|
16
|
+
|
|
17
|
+
The extension is built automatically when the gem is installed. For
|
|
18
|
+
local development:
|
|
19
|
+
|
|
20
|
+
```
|
|
21
|
+
rake compile
|
|
22
|
+
```
|
|
23
|
+
|
|
24
|
+
This produces `lib/scrapetor/scrapetor_native.{bundle,so}`. To clean:
|
|
25
|
+
|
|
26
|
+
```
|
|
27
|
+
rake clean
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
## Requirements
|
|
31
|
+
|
|
32
|
+
- A C99-capable compiler (`clang` on macOS, `gcc` on Linux)
|
|
33
|
+
- Ruby development headers (provided by your Ruby installation)
|
|
34
|
+
|
|
35
|
+
No external libraries are linked.
|
|
36
|
+
|
|
37
|
+
## Two execution paths
|
|
38
|
+
|
|
39
|
+
The extension exposes two distinct paths to Ruby:
|
|
40
|
+
|
|
41
|
+
1. **`Scrapetor::Native.extract(html, descriptor, base_url)`** —
|
|
42
|
+
the streaming extraction engine. The schema descriptor is consumed
|
|
43
|
+
inline with the HTML tokenisation; no DOM is materialised.
|
|
44
|
+
Implemented in `scrapetor_native.c`.
|
|
45
|
+
|
|
46
|
+
2. **`Scrapetor::Native::Document.parse(html)`** — the arena DOM. One
|
|
47
|
+
forward pass over the HTML builds a tree of fixed-width nodes,
|
|
48
|
+
class / id / tag indexes are populated during the same pass.
|
|
49
|
+
Implemented in `scrapetor_dom.c`. Exposes node-id-based accessors
|
|
50
|
+
that the lazy `Scrapetor::Native::Element` Ruby wrapper consumes.
|
|
51
|
+
|
|
52
|
+
Both paths share `enc_utf8` and a handful of helpers but otherwise
|
|
53
|
+
operate independently.
|