scrapetor 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (51) hide show
  1. checksums.yaml +7 -0
  2. data/CHANGELOG.md +242 -0
  3. data/LICENSE +21 -0
  4. data/README.md +440 -0
  5. data/bin/scrapetor +190 -0
  6. data/bin/scrapetor-bench +5 -0
  7. data/ext/scrapetor/README.md +53 -0
  8. data/ext/scrapetor/native/extconf.rb +67 -0
  9. data/ext/scrapetor/native/scrapetor_dom.c +6346 -0
  10. data/ext/scrapetor/native/scrapetor_http.c +2591 -0
  11. data/ext/scrapetor/native/scrapetor_native.c +1156 -0
  12. data/lib/scrapetor/builder.rb +158 -0
  13. data/lib/scrapetor/cleaner.rb +10 -0
  14. data/lib/scrapetor/comment_node.rb +67 -0
  15. data/lib/scrapetor/document.rb +457 -0
  16. data/lib/scrapetor/dom/parser.rb +69 -0
  17. data/lib/scrapetor/dom/selectors.rb +208 -0
  18. data/lib/scrapetor/dom.rb +563 -0
  19. data/lib/scrapetor/encoding.rb +85 -0
  20. data/lib/scrapetor/entities.rb +90 -0
  21. data/lib/scrapetor/errors.rb +12 -0
  22. data/lib/scrapetor/extractor.rb +147 -0
  23. data/lib/scrapetor/fetcher.rb +390 -0
  24. data/lib/scrapetor/fingerprint.rb +29 -0
  25. data/lib/scrapetor/form.rb +141 -0
  26. data/lib/scrapetor/http.rb +114 -0
  27. data/lib/scrapetor/microdata.rb +132 -0
  28. data/lib/scrapetor/money.rb +30 -0
  29. data/lib/scrapetor/native.rb +291 -0
  30. data/lib/scrapetor/native_dom.rb +2258 -0
  31. data/lib/scrapetor/node.rb +539 -0
  32. data/lib/scrapetor/node_set.rb +301 -0
  33. data/lib/scrapetor/page_type.rb +95 -0
  34. data/lib/scrapetor/pagination.rb +109 -0
  35. data/lib/scrapetor/persistent_cache.rb +130 -0
  36. data/lib/scrapetor/robots.rb +159 -0
  37. data/lib/scrapetor/sax.rb +285 -0
  38. data/lib/scrapetor/schema.rb +144 -0
  39. data/lib/scrapetor/selector.rb +576 -0
  40. data/lib/scrapetor/session.rb +141 -0
  41. data/lib/scrapetor/sitemap.rb +52 -0
  42. data/lib/scrapetor/stream.rb +111 -0
  43. data/lib/scrapetor/structured_data.rb +74 -0
  44. data/lib/scrapetor/template_registry.rb +24 -0
  45. data/lib/scrapetor/text_node.rb +101 -0
  46. data/lib/scrapetor/url.rb +21 -0
  47. data/lib/scrapetor/version.rb +5 -0
  48. data/lib/scrapetor/xpath.rb +1603 -0
  49. data/lib/scrapetor.rb +167 -0
  50. data/scrapetor.gemspec +77 -0
  51. metadata +200 -0
data/README.md ADDED
@@ -0,0 +1,440 @@
1
+ # Scrapetor
2
+
3
+ [![Gem Version](https://img.shields.io/gem/v/scrapetor.svg)](https://rubygems.org/gems/scrapetor)
4
+ [![CI](https://github.com/Alaa-abdulridha/scrapetor/actions/workflows/ci.yml/badge.svg)](https://github.com/Alaa-abdulridha/scrapetor/actions/workflows/ci.yml)
5
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
6
+
7
+ A Ruby HTML parsing and structured-extraction library. Scrapetor pairs a
8
+ native C arena DOM with a streaming extraction engine that compiles a
9
+ schema DSL into a single forward pass over the input - no DOM is
10
+ materialised, one Ruby boundary crossing per document.
11
+
12
+ The same gem also exposes a full read/mutate DOM API, encoding
13
+ detection, structured-data extractors (JSON-LD, OpenGraph, Schema.org,
14
+ Microdata, RDFa, Twitter Cards), a pure-Ruby builder and SAX streamer,
15
+ a CLI, and an HTTP fetcher built on `Net::HTTP`. There are no external
16
+ parser dependencies.
17
+
18
+ Project page: [scrapetor.org](https://scrapetor.org) ·
19
+ Source: [github.com/Alaa-abdulridha/scrapetor](https://github.com/Alaa-abdulridha/scrapetor)
20
+
21
+ ## Requirements
22
+
23
+ - Ruby 2.7 or newer
24
+ - A C99-capable compiler (`clang` or `gcc`). The native extension is
25
+ built automatically when the gem is installed.
26
+
27
+ There are no other runtime dependencies.
28
+
29
+ ## Installation
30
+
31
+ Add to your `Gemfile`:
32
+
33
+ ```ruby
34
+ gem "scrapetor"
35
+ ```
36
+
37
+ Or install directly:
38
+
39
+ ```
40
+ gem install scrapetor
41
+ ```
42
+
43
+ ## Quick start
44
+
45
+ ```ruby
46
+ require "scrapetor"
47
+
48
+ doc = Scrapetor::HTML(html, base_url: "https://example.com/")
49
+
50
+ result = doc.extract do
51
+ field :title, from: "h1.headline", clean: true
52
+
53
+ repeated ".product-card", as: :products do
54
+ field :title, from: ".title", clean: true
55
+ field :price, from: ".price", type: :money
56
+ field :url, from: "a", attr: :href, type: :url, normalize_url: true
57
+ field :image, from: "img", attr: :src, type: :url, normalize_url: true
58
+ end
59
+ end
60
+ ```
61
+
62
+ Structured-data extractors are built in:
63
+
64
+ ```ruby
65
+ doc.json_ld # Array of parsed <script type="application/ld+json"> blocks
66
+ doc.opengraph # Hash of og:* meta values
67
+ doc.twitter_card # Hash of twitter:* meta values
68
+ doc.schema_org(type: "Product")
69
+ doc.microdata # HTML5 itemscope/itemprop trees
70
+ doc.page_type # :product_page | :product_listing | :article | …
71
+ ```
72
+
73
+ A minimal HTTP fetcher (uses `Net::HTTP`; no extra gems):
74
+
75
+ ```ruby
76
+ doc = Scrapetor.fetch("https://example.com/products")
77
+ data = Scrapetor.fetch_extract("https://example.com/products", schema)
78
+ ```
79
+
80
+ ## Production HTTP layer (libcurl, HTTP/2)
81
+
82
+ If `libcurl` is available at build time, Scrapetor ships an optional
83
+ `Scrapetor::Fetcher` backed by it. The whole pipeline — TLS,
84
+ HTTP/2 multiplexing, gzip/deflate/brotli/zstd decoding, charset
85
+ transcoding, retry, ETag cache, per-host throttle — runs in C, with
86
+ the GVL released across every network and CPU phase.
87
+
88
+ ```ruby
89
+ # Single GET with HTTP/2 + connection share + retry.
90
+ resp = Scrapetor::Fetcher.get(url,
91
+ retry: 3, backoff: 0.3, max_backoff: 10,
92
+ bearer_token: ENV["TOKEN"],
93
+ cache_dir: "~/.cache/scrapetor")
94
+ # => { status: 200, headers: {...}, body: "...", final_url: "...",
95
+ # http_version: "2" }
96
+
97
+ # POST + JSON / form / multipart.
98
+ Scrapetor::Fetcher.post(url, json: {name: "alice"})
99
+ Scrapetor::Fetcher.post(url, form: {user: "x", pass: "y"})
100
+ Scrapetor::Fetcher.post(url,
101
+ multipart: { name: "avatar",
102
+ file: Scrapetor::Fetcher.upload_file("/tmp/pic.png") })
103
+ ```
104
+
105
+ ### Bulk fetch APIs
106
+
107
+ Three concurrency models pick different tradeoffs:
108
+
109
+ ```ruby
110
+ # 1. pthread + easy: N workers, each blocking. Best when each
111
+ # response has meaningful CPU work after the fetch (decode + parse)
112
+ # since the GVL is released across the full batch.
113
+ docs = Scrapetor::Fetcher.parallel_fetch(urls, threads: 8)
114
+
115
+ # 2. curl_multi async: single driver thread, N concurrent in-flight.
116
+ # Best for I/O-fan-out (hundreds of URLs across many hosts).
117
+ results = Scrapetor::Fetcher.multi_get(urls, max_concurrent: 32)
118
+
119
+ # 3. streaming multi_each: yields each response in completion order
120
+ # so processing starts as soon as the first transfer lands.
121
+ Scrapetor::Fetcher.multi_each(urls) do |r|
122
+ puts r[:final_url], r[:status] # called as each completes
123
+ end
124
+ ```
125
+
126
+ ### Session: cookies, auth, throttle, retry
127
+
128
+ ```ruby
129
+ session = Scrapetor::Session.new(
130
+ cookies: true, # ephemeral cookie jar (path or true)
131
+ user_agent: "MyBot/1.0",
132
+ rate_limit: 0.5, # min seconds between same-host calls
133
+ retry: 3, # default retry for all calls
134
+ headers: { "Accept-Language" => "en-US" },
135
+ proxy: ENV["HTTP_PROXY"],
136
+ )
137
+ session.post(login_url, form: {user:, pass:})
138
+ doc = session.fetch(dashboard_url) # cookies carry forward
139
+ ```
140
+
141
+ ### HTTP cache with ETag / Last-Modified
142
+
143
+ ```ruby
144
+ # Cold fetch: server returns 200 + ETag, response cached.
145
+ # Warm fetch: scrapetor sends If-None-Match. Server's 304 swaps in
146
+ # the cached body and marks headers["x-scrapetor-cache"] = "hit".
147
+ Scrapetor::Fetcher.get(url, cache_dir: "~/.cache/scrapetor")
148
+
149
+ # Bulk revalidation: HEAD every URL in one curl_multi sweep,
150
+ # classify each as :fresh / :changed / :missing / :error.
151
+ status = Scrapetor::Fetcher.revalidate(urls, cache_dir: "~/.cache/scrapetor")
152
+ stale = status.select { |_, v| v == :changed }.keys
153
+ ```
154
+
155
+ ### Crawl helpers
156
+
157
+ ```ruby
158
+ robots = Scrapetor::Robots.fetch_for("https://example.com")
159
+ robots.allowed?("https://example.com/private") # => false
160
+ robots.crawl_delay # => 2.0
161
+ robots.sitemaps # => [...]
162
+
163
+ Scrapetor::Sitemap.urls("https://example.com/sitemap.xml") do |url, meta|
164
+ # streams large sitemaps without buffering in memory
165
+ # recurses into <sitemapindex> automatically
166
+ process(url, meta)
167
+ end
168
+ ```
169
+
170
+ ### Streaming HTML parser
171
+
172
+ Bounded-memory parser for huge documents:
173
+
174
+ ```ruby
175
+ Scrapetor.stream(io, outer: "div.result") do |row_doc|
176
+ # one row at a time; peak memory ~= max(chunk, longest_row)
177
+ yield row_doc.at_css(".title").text, row_doc.at_css(".price").text
178
+ end
179
+ ```
180
+
181
+ Accepts `tag`, `tag.class`, `tag.cls1.cls2`, `tag#id`, and combinations.
182
+
183
+ ### Parallel parse for offline corpora
184
+
185
+ ```ruby
186
+ htmls = paths.map { |p| File.read(p) }
187
+ docs = Scrapetor.parallel_parse(htmls, threads: 8)
188
+ # Real multi-core HTML parsing under one GVL release.
189
+ ```
190
+
191
+ ## Limits — what Scrapetor does NOT do
192
+
193
+ Worth being explicit about what's out of scope so you can pick the
194
+ right tool for the rest of the pipeline.
195
+
196
+ ### No JavaScript execution
197
+
198
+ Scrapetor reads HTML as the server sent it. Pages that build their
199
+ content client-side via React / Vue / Angular / etc. will look empty
200
+ to the parser. There's no embedded JS engine and there won't be —
201
+ that's a different class of tool (headless browser).
202
+
203
+ Practical paths if you need rendered HTML:
204
+
205
+ - **Pre-render upstream**: many SPA hosts can pre-render for crawlers
206
+ (`?_escaped_fragment_=`, prerender.io, Cloudflare's HTML Rewriter,
207
+ Vercel/Netlify ISR). Cheapest if available.
208
+ - **Headless browser layer**: drive Playwright / Puppeteer / Selenium
209
+ from Ruby (ferrum, playwright-ruby-client). Have it spit out
210
+ rendered HTML, then hand that to Scrapetor for fast extract.
211
+ - **Per-site API mining**: most JS-heavy apps load data from a JSON
212
+ API that Scrapetor can hit directly via `Fetcher.get`.
213
+
214
+ ### No TLS fingerprint impersonation
215
+
216
+ Scrapetor's HTTP layer is plain libcurl. Sites that fingerprint TLS
217
+ handshakes (Cloudflare, Akamai, DataDome, Imperva) will identify the
218
+ client as libcurl and may block / challenge accordingly. Scrapetor
219
+ won't impersonate Chrome's JA3, JA4, HTTP/2 SETTINGS frame order, or
220
+ header capitalisation.
221
+
222
+ If you need impersonation:
223
+
224
+ - **[curl-impersonate](https://github.com/lwthiker/curl-impersonate)**
225
+ is a fork of libcurl patched to match Chrome / Firefox / Edge
226
+ fingerprints exactly. You can build it locally and Scrapetor
227
+ will link against it transparently — the gem's HTTP options
228
+ are unchanged.
229
+ - **Reach the JSON API directly** with browser-mimicking headers
230
+ (`Accept`, `User-Agent`, `Sec-*`). Many sites only fingerprint
231
+ the HTML route; the API is more permissive.
232
+ - **Use a residential / mobile proxy** with a real browser at the
233
+ other end. Scrapetor's `:proxy` + `:proxy_auth` options handle
234
+ the proxy plumbing; the impersonation happens upstream.
235
+
236
+ The HTTP layer DOES support the rest of the production-scraping
237
+ surface: HTTP/2 multiplexing, retries with full-jitter backoff,
238
+ per-host throttle, cookie jar + auth, ETag cache, bulk revalidation,
239
+ multi-handle concurrency. Treat fingerprint impersonation as the
240
+ one externality you may need to bring yourself.
241
+
242
+ ### XPath 1.0: full expression language
243
+
244
+ `Document#xpath` / `Node#xpath` evaluate the complete XPath 1.0 grammar:
245
+ all 13 axes (`child`, `descendant`, `descendant-or-self`, `parent`,
246
+ `self`, `following-sibling`, `preceding-sibling`, `following`,
247
+ `preceding`, `ancestor`, `ancestor-or-self`, `attribute`, `namespace`),
248
+ all node tests (`node()`, `text()`, `comment()`, `processing-instruction()`,
249
+ named, `*`, qualified-with-prefix), every operator (`=`, `!=`, `<`,
250
+ `<=`, `>`, `>=`, `+`, `-`, `*`, `div`, `mod`, `and`, `or`, `|`), and
251
+ the full standard function library (`not`, `last`, `position`, `count`,
252
+ `local-name`, `name`, `string`, `concat`, `starts-with`, `contains`,
253
+ `substring`, `substring-before`, `substring-after`, `string-length`,
254
+ `normalize-space`, `translate`, `boolean`, `true`, `false`, `lang`,
255
+ `number`, `sum`, `floor`, `ceiling`, `round`, `id`).
256
+
257
+ Compiled ASTs cache per unique expression string. The compiler also
258
+ detects expressions that map cleanly onto the native CSS chain
259
+ (`//div[@class='x']`, `//ul/li[1]`, `//dt/following-sibling::dd`,
260
+ `//div[position() > 50]`, etc.) and dispatches them directly to the
261
+ arena's C selector matcher — same hot path the rest of the library
262
+ rides. Sibling, ancestor, and following/preceding axis walks all run
263
+ through dedicated C primitives over the DFS-range-encoded arena, so
264
+ no Ruby per-step traversal is involved.
265
+
266
+ ### HTTP/3 and WebSocket: capability-detected
267
+
268
+ `Scrapetor::Fetcher.features` reports whether the linked libcurl
269
+ exposes HTTP/3 and WebSocket support. Pass `http_version: "3"` to
270
+ opt into HTTP/3 when available; otherwise the default is HTTP/2
271
+ over TLS with HTTP/1.1 fallback. WebSocket frames go through
272
+ libcurl 7.86+'s `curl_ws_send` / `curl_ws_recv`; the gem doesn't
273
+ yet ship a friendly Ruby API for these — patches welcome.
274
+
275
+ ## Command-line interface
276
+
277
+ ```
278
+ $ scrapetor extract page.html --schema schema.rb
279
+ $ scrapetor info page.html
280
+ $ scrapetor jsonld page.html
281
+ $ scrapetor opengraph page.html
282
+ $ scrapetor microdata page.html
283
+ $ scrapetor page-type page.html
284
+ ```
285
+
286
+ ## Performance
287
+
288
+ Benchmarks live in `benchmark/comprehensive.rb`. Every workload asserts
289
+ that all three engines produce equivalent output before timing. Numbers
290
+ below were measured on Apple Silicon (Ruby 2.7.8); they're reproducible
291
+ from this repository.
292
+
293
+ **Parse throughput (build the DOM tree, MB/s).**
294
+
295
+ | Document | Scrapetor | Nokolexbor | Nokogiri |
296
+ |-----------------|----------:|-----------:|---------:|
297
+ | small (170 B) | 37 | 18 | 11 |
298
+ | article (2 KB) | 279 | 116 | 31 |
299
+ | product (3 KB) | 426 | 140 | 31 |
300
+ | listing (36 KB) | 3,933 | 154 | 31 |
301
+ | large (2.5 MB) | 18,434 | 136 | 31 |
302
+
303
+ **CSS selector evaluation (one selector against a pre-parsed document, iter/sec).**
304
+
305
+ | Selector | Scrapetor | Nokolexbor | Speedup |
306
+ |-----------------------------------------|----------:|-----------:|----------:|
307
+ | `#main` (single id) | 1,272,698 | 65,170 | 19.53x |
308
+ | `article` (tag) | 1,244,279 | 65,122 | 19.11x |
309
+ | `.product-card` (class) | 1,226,004 | 68,065 | 18.01x |
310
+ | `#main article` (id descendant) | 1,086,604 | 35,228 | 30.85x |
311
+ | `img.product-image` (tag.class) | 875,901 | 65,707 | 13.33x |
312
+ | `.product-grid > .product-card` (child) | 754,486 | 60,924 | 12.38x |
313
+ | `[data-sku="SKU0001"]` (attr) | 516,157 | 78,101 | 6.61x |
314
+ | `.product-card .price` (descendant) | 405,062 | 42,870 | 9.45x |
315
+
316
+ Pseudo-classes (`:has`, `:not`, `:is`, `:nth-child`, `:first-child`,
317
+ `:last-child`, `:nth-of-type`, etc.) and pseudo-elements (`::text`,
318
+ `::attr(name)`) run natively in the same C engine — see the
319
+ [Selector support](#selector-support) section below for the full list.
320
+
321
+ **End-to-end extraction (parse plus run an extraction schema, iter/sec).**
322
+
323
+ | Workload | Scrapetor | Nokolexbor | Nokogiri |
324
+ |-----------------------------------|----------:|-----------:|---------:|
325
+ | listing (50 cards x 4 fields) | 9,360 | 573 | 171 |
326
+ | product detail (top + 3 reviews) | 30,101 | 11,636 | 2,022 |
327
+ | article (top + tags + sections) | 53,837 | 25,553 | 6,338 |
328
+
329
+ **Allocations per extraction call (live Ruby objects, lower is better).**
330
+
331
+ | Workload | Scrapetor | Nokolexbor | Nokogiri |
332
+ |----------------------------------|----------:|-----------:|---------:|
333
+ | listing (50 cards x 4 fields) | 363 | 4,710 | 9,501 |
334
+ | product detail (top + 3 reviews) | 96 | 140 | 596 |
335
+
336
+ The full report - including the article workload, selector micro-benchmarks
337
+ for every supported selector form, and per-document MB/s figures - is
338
+ written to `benchmark/RESULTS.md` whenever you run `ruby -Ilib
339
+ benchmark/comprehensive.rb`.
340
+
341
+ ## Architecture
342
+
343
+ ```
344
+ HTML in ─┬─► Native streaming extract (C)
345
+ │ `doc.extract(schema)`
346
+ │ Schema runs during tokenisation. No DOM is built.
347
+ │ One Ruby/C boundary crossing per document.
348
+
349
+ └─► Native arena DOM (C)
350
+ `doc.css(...)`, `doc.at(...)`, mutation, traversal.
351
+ Class/id/tag indexes built during the parse pass.
352
+ Zero-copy text and attribute spans into the input buffer.
353
+ ```
354
+
355
+ The same C extension (`ext/scrapetor/native/`) provides both paths. A
356
+ pure-Ruby DOM is kept as a fallback for environments where the
357
+ extension can't be loaded.
358
+
359
+ ## Selector support
360
+
361
+ The native engine runs the following CSS forms in C:
362
+
363
+ - tag, `.class`, `tag.class`, `#id`, `tag#id`, universal `*`
364
+ - `[attr]`, `[attr=value]` and the `*=`, `^=`, `$=`, `~=`, `|=` variants
365
+ - descendant (`A B`) and child (`A > B`) combinators
366
+ - structural pseudo-classes: `:first-child`, `:last-child`, `:only-child`,
367
+ `:first-of-type`, `:last-of-type`, `:only-of-type`, `:nth-child(...)`,
368
+ `:nth-last-child(...)`, `:nth-of-type(...)`, `:nth-last-of-type(...)`,
369
+ `:empty`, `:root`, `:scope`
370
+ - boolean-attribute pseudos: `:checked`, `:disabled`, `:enabled`,
371
+ `:required`, `:optional`, `:read-only`, `:read-write`, `:any-link`,
372
+ `:link`
373
+ - logical pseudos: `:not(...)`, `:is(...)`, `:matches(...)`, `:where(...)`,
374
+ `:has(...)` (each runs natively when its inner selector is a single
375
+ atom — typically a class, id, tag, or attribute predicate)
376
+ - Scrapy/Parsel-style pseudo-elements: `::text` and `::attr(name)` —
377
+ the engine emits strings directly via a bulk C path so a 100-item
378
+ result is one boundary crossing, not 100
379
+
380
+ Sibling combinators (`+`, `~`) and inner selectors more complex than a
381
+ single atom — for example `:not(div > .x)` or `:has(.x .y)` — transparently
382
+ fall back to a pure-Ruby implementation that mirrors Nokogiri's output.
383
+ `Selector.compile` never raises on a syntactically valid CSS selector;
384
+ the fallback is automatic.
385
+
386
+ ## API reference
387
+
388
+ See the project documentation at [scrapetor.org/docs](https://scrapetor.org/docs).
389
+ The main entry points are:
390
+
391
+ - `Scrapetor.parse(html, base_url:)` - returns a `Scrapetor::Document`
392
+ - `Scrapetor::HTML(html, base_url)` - same, Nokogiri-style alias
393
+ - `Scrapetor.schema { … }` - schema DSL
394
+ - `Scrapetor.extract(html, schema)` - parse and extract
395
+ - `Scrapetor.fetch(url)` - HTTP GET and parse
396
+ - `Scrapetor.fetch_extract(url, schema)`
397
+ - `Scrapetor::Builder.build { |b| b.html { … } }`
398
+ - `Scrapetor::SAX::Parser.new(handler).parse(html)`
399
+
400
+ Document and Node expose the standard read/mutate API: `css`, `at_css`,
401
+ `xpath`, `text`, `content`, `inner_html`, `outer_html`, `[]`, `[]=`,
402
+ `attributes`, `children`, `parent`, `add_child`, `before`, `after`,
403
+ `replace`, `remove`, `add_class`, `remove_class`, and so on.
404
+
405
+ ## Compatibility
406
+
407
+ Scrapetor is tested on Ruby 2.7, 3.0, 3.1, 3.2, and 3.3 on Linux and
408
+ macOS (see `.github/workflows/ci.yml`). The native extension uses only
409
+ the stable public Ruby C API.
410
+
411
+ ## Development
412
+
413
+ ```
414
+ git clone https://github.com/Alaa-abdulridha/scrapetor
415
+ cd scrapetor
416
+ bundle install
417
+ rake compile
418
+ rake test
419
+ ```
420
+
421
+ To run the benchmarks:
422
+
423
+ ```
424
+ ruby -Ilib benchmark/comprehensive.rb # full report → benchmark/RESULTS.md
425
+ ruby -Ilib benchmark/parse_extract.rb # listing workload only
426
+ ruby -Ilib benchmark/product_detail.rb # product detail workload only
427
+ ```
428
+
429
+ ## Contributing
430
+
431
+ Issues, bug reports, and pull requests are welcome on GitHub at
432
+ <https://github.com/Alaa-abdulridha/scrapetor>. Please read
433
+ [`CONTRIBUTING.md`](CONTRIBUTING.md) before submitting a pull request.
434
+
435
+ To report a security vulnerability, follow the process in
436
+ [`SECURITY.md`](SECURITY.md).
437
+
438
+ ## License
439
+
440
+ Scrapetor is released under the MIT License. See [`LICENSE`](LICENSE).
data/bin/scrapetor ADDED
@@ -0,0 +1,190 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+ #
4
+ # Scrapetor CLI.
5
+ #
6
+ # Subcommands:
7
+ # scrapetor extract FILE --schema SCHEMA.rb [--base-url URL] [--format json|yaml|pp]
8
+ # scrapetor info FILE # page-type, structured data, encoding
9
+ # scrapetor microdata FILE # microdata + RDFa
10
+ # scrapetor jsonld FILE # JSON-LD only
11
+ # scrapetor opengraph FILE # OG + Twitter Card
12
+ # scrapetor schema-org FILE [--type TYPE]
13
+ # scrapetor sax FILE # stream SAX events to stdout
14
+ # scrapetor version
15
+
16
+ $LOAD_PATH.unshift File.expand_path("../lib", __dir__)
17
+ require "scrapetor"
18
+ require "json"
19
+ require "optparse"
20
+
21
+ class ScrapetorCLI
22
+ def initialize(argv)
23
+ @argv = argv
24
+ end
25
+
26
+ def run
27
+ cmd = @argv.shift
28
+ case cmd
29
+ when nil, "-h", "--help", "help" then print_help
30
+ when "version", "-v", "--version" then puts Scrapetor::VERSION
31
+ when "extract" then cmd_extract
32
+ when "info" then cmd_info
33
+ when "microdata" then cmd_microdata
34
+ when "rdfa" then cmd_rdfa
35
+ when "jsonld", "json-ld" then cmd_jsonld
36
+ when "opengraph", "og" then cmd_opengraph
37
+ when "schema-org", "schema_org" then cmd_schema_org
38
+ when "sax" then cmd_sax
39
+ when "page-type", "page_type" then cmd_page_type
40
+ when "encoding" then cmd_encoding
41
+ else
42
+ warn "unknown command: #{cmd}"
43
+ print_help
44
+ exit 1
45
+ end
46
+ end
47
+
48
+ private
49
+
50
+ def print_help
51
+ puts <<~HELP
52
+ Scrapetor #{Scrapetor::VERSION}
53
+
54
+ Usage: scrapetor SUBCOMMAND [args]
55
+
56
+ Subcommands:
57
+ extract FILE --schema FILE Run an extraction schema against an HTML file.
58
+ [--base-url URL] [--format json|pp]
59
+ info FILE Print page-type, encoding, OG, JSON-LD summary.
60
+ page-type FILE Print the detected page type.
61
+ encoding FILE Print the detected encoding.
62
+ jsonld FILE Print parsed JSON-LD as JSON.
63
+ opengraph FILE Print OpenGraph + Twitter Card metadata.
64
+ schema-org FILE [--type T] Print Schema.org typed items (filter optional).
65
+ microdata FILE Print HTML5 microdata.
66
+ rdfa FILE Print RDFa typed data.
67
+ sax FILE Stream SAX events to stdout.
68
+ version Print the gem version.
69
+
70
+ Examples:
71
+ scrapetor extract page.html --schema product_schema.rb
72
+ scrapetor info page.html
73
+ cat page.html | scrapetor jsonld -
74
+ HELP
75
+ end
76
+
77
+ def read_input(path)
78
+ return $stdin.read if path == "-"
79
+ File.read(path)
80
+ end
81
+
82
+ def format_output(value, fmt)
83
+ case fmt
84
+ when "json" then puts JSON.pretty_generate(value)
85
+ when "pp" then require "pp"; pp value
86
+ when "yaml" then require "yaml"; puts value.to_yaml
87
+ else puts JSON.pretty_generate(value)
88
+ end
89
+ end
90
+
91
+ def cmd_extract
92
+ schema_path = nil
93
+ base_url = nil
94
+ fmt = "json"
95
+ OptionParser.new do |o|
96
+ o.on("--schema FILE") { |v| schema_path = v }
97
+ o.on("--base-url URL") { |v| base_url = v }
98
+ o.on("--format FMT") { |v| fmt = v }
99
+ end.parse!(@argv)
100
+ file = @argv.shift || abort("missing FILE")
101
+ abort("missing --schema") unless schema_path
102
+
103
+ html = read_input(file)
104
+ schema = load_schema(schema_path)
105
+ result = Scrapetor.parse(html, base_url: base_url).extract(schema)
106
+ format_output(result, fmt)
107
+ end
108
+
109
+ def load_schema(path)
110
+ src = File.read(path)
111
+ # The schema file is evaluated in a fresh binding. It MUST return
112
+ # a Scrapetor::Schema. The simplest convention: the file's last
113
+ # expression is `Scrapetor.schema do ... end`.
114
+ schema = eval(src, binding, path) # rubocop:disable Security/Eval
115
+ abort("schema file did not return a Scrapetor::Schema") unless schema.is_a?(Scrapetor::Schema)
116
+ schema
117
+ end
118
+
119
+ def cmd_info
120
+ file = @argv.shift || abort("missing FILE")
121
+ html = read_input(file)
122
+ doc = Scrapetor.parse(html)
123
+ info = {
124
+ encoding: doc.encoding,
125
+ page_type: doc.page_type,
126
+ title: doc.title,
127
+ json_ld: doc.json_ld.size,
128
+ opengraph: doc.opengraph.size,
129
+ microdata: doc.microdata.size,
130
+ stats: doc.stats
131
+ }
132
+ puts JSON.pretty_generate(info)
133
+ end
134
+
135
+ def cmd_microdata
136
+ file = @argv.shift || abort("missing FILE")
137
+ puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).microdata)
138
+ end
139
+
140
+ def cmd_rdfa
141
+ file = @argv.shift || abort("missing FILE")
142
+ puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).rdfa)
143
+ end
144
+
145
+ def cmd_jsonld
146
+ file = @argv.shift || abort("missing FILE")
147
+ puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).json_ld)
148
+ end
149
+
150
+ def cmd_opengraph
151
+ file = @argv.shift || abort("missing FILE")
152
+ doc = Scrapetor.parse(read_input(file))
153
+ puts JSON.pretty_generate(opengraph: doc.opengraph, twitter_card: doc.twitter_card)
154
+ end
155
+
156
+ def cmd_schema_org
157
+ type = nil
158
+ OptionParser.new do |o|
159
+ o.on("--type T") { |v| type = v }
160
+ end.parse!(@argv)
161
+ file = @argv.shift || abort("missing FILE")
162
+ puts JSON.pretty_generate(Scrapetor.parse(read_input(file)).schema_org(type: type))
163
+ end
164
+
165
+ def cmd_sax
166
+ file = @argv.shift || abort("missing FILE")
167
+ Scrapetor::SAX::Tokenizer.new(read_input(file)).each_event do |event|
168
+ type, *args = event
169
+ case type
170
+ when :start then puts "START #{args[0]} #{args[1].inspect}"
171
+ when :end then puts "END #{args[0]}"
172
+ when :text then puts "TEXT #{args[0].inspect}"
173
+ when :comment then puts "COMM #{args[0].inspect}"
174
+ when :doctype then puts "DOC #{args[0]}"
175
+ end
176
+ end
177
+ end
178
+
179
+ def cmd_page_type
180
+ file = @argv.shift || abort("missing FILE")
181
+ puts Scrapetor.parse(read_input(file)).page_type
182
+ end
183
+
184
+ def cmd_encoding
185
+ file = @argv.shift || abort("missing FILE")
186
+ puts Scrapetor.parse(read_input(file)).encoding
187
+ end
188
+ end
189
+
190
+ ScrapetorCLI.new(ARGV).run
@@ -0,0 +1,5 @@
1
+ #!/usr/bin/env ruby
2
+ # frozen_string_literal: true
3
+
4
+ $LOAD_PATH.unshift File.expand_path("../lib", __dir__)
5
+ load File.expand_path("../benchmark/parse_extract.rb", __dir__)
@@ -0,0 +1,53 @@
1
+ # Native extension
2
+
3
+ This directory contains the C source for Scrapetor's native extension.
4
+ It builds via `mkmf` and is bundled into `scrapetor_native.bundle`
5
+ (macOS) or `scrapetor_native.so` (Linux). The Ruby side loads it from
6
+ `lib/scrapetor/native.rb`.
7
+
8
+ ```
9
+ ext/scrapetor/native/
10
+ ├── extconf.rb mkmf build script
11
+ ├── scrapetor_native.c streaming extraction engine
12
+ └── scrapetor_dom.c arena DOM with class/id/tag indexes
13
+ ```
14
+
15
+ ## Building
16
+
17
+ The extension is built automatically when the gem is installed. For
18
+ local development:
19
+
20
+ ```
21
+ rake compile
22
+ ```
23
+
24
+ This produces `lib/scrapetor/scrapetor_native.{bundle,so}`. To clean:
25
+
26
+ ```
27
+ rake clean
28
+ ```
29
+
30
+ ## Requirements
31
+
32
+ - A C99-capable compiler (`clang` on macOS, `gcc` on Linux)
33
+ - Ruby development headers (provided by your Ruby installation)
34
+
35
+ No external libraries are linked.
36
+
37
+ ## Two execution paths
38
+
39
+ The extension exposes two distinct paths to Ruby:
40
+
41
+ 1. **`Scrapetor::Native.extract(html, descriptor, base_url)`** —
42
+ the streaming extraction engine. The schema descriptor is consumed
43
+ inline with the HTML tokenisation; no DOM is materialised.
44
+ Implemented in `scrapetor_native.c`.
45
+
46
+ 2. **`Scrapetor::Native::Document.parse(html)`** — the arena DOM. One
47
+ forward pass over the HTML builds a tree of fixed-width nodes,
48
+ class / id / tag indexes are populated during the same pass.
49
+ Implemented in `scrapetor_dom.c`. Exposes node-id-based accessors
50
+ that the lazy `Scrapetor::Native::Element` Ruby wrapper consumes.
51
+
52
+ Both paths share `enc_utf8` and a handful of helpers but otherwise
53
+ operate independently.