rubycrawl 0.1.4 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -3,46 +3,46 @@
3
3
  [![Gem Version](https://badge.fury.io/rb/rubycrawl.svg)](https://rubygems.org/gems/rubycrawl)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
5
  [![Ruby](https://img.shields.io/badge/ruby-%3E%3D%203.0-red.svg)](https://www.ruby-lang.org/)
6
- [![Node.js](https://img.shields.io/badge/node.js-18%2B-green.svg)](https://nodejs.org/)
7
6
 
8
- **Production-ready web crawler for Ruby powered by Playwright** — Bringing the power of modern browser automation to the Ruby ecosystem with first-class Rails support.
7
+ **Production-ready web crawler for Ruby powered by Ferrum** — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
9
8
 
10
- RubyCrawl provides **accurate, JavaScript-enabled web scraping** using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
9
+ RubyCrawl provides **accurate, JavaScript-enabled web scraping** using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
11
10
 
12
11
  **Why RubyCrawl?**
13
12
 
14
13
  - ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
15
- - ✅ **Zero config** — Works out of the box, no Playwright knowledge needed
14
+ - ✅ **Pure Ruby** — No Node.js, no npm, no external processes to manage
15
+ - ✅ **Zero config** — Works out of the box, no Ferrum knowledge needed
16
16
  - ✅ **Production-ready** — Auto-retry, error handling, resource optimization
17
17
  - ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
18
18
  - ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
19
- - ✅ **Modular architecture** — Clean, testable, maintainable codebase
19
+ - ✅ **Readability-powered** — Mozilla Readability.js for article-quality extraction, heuristic fallback for all other pages
20
20
 
21
21
  ```ruby
22
22
  # One line to crawl any JavaScript-heavy site
23
23
  result = RubyCrawl.crawl("https://docs.example.com")
24
24
 
25
25
  result.html # Full HTML with JS rendered
26
- result.links # All links with metadata
26
+ result.clean_text # Noise-stripped plain text (no nav/footer/ads)
27
+ result.clean_markdown # Markdown ready for RAG pipelines
28
+ result.links # All links with url, text, title, rel
27
29
  result.metadata # Title, description, OG tags, etc.
28
30
  ```
29
31
 
30
32
  ## Features
31
33
 
32
- - **🎭 Playwright-powered**: Real browser automation for JavaScript-heavy sites and SPAs
33
- - **🚀 Production-ready**: Designed for Rails apps and production environments with auto-retry and error handling
34
- - **🎯 Simple API**: Clean, minimal Ruby interface — zero Playwright or Node.js knowledge required
35
- - **⚡ Resource optimization**: Built-in resource blocking for 2-3x faster crawls
36
- - **🔄 Auto-managed browsers**: Browser process reuse and automatic lifecycle management
37
- - **📄 Content extraction**: HTML, plain text, links (with metadata), and **clean markdown** via HTML conversion
38
- - **🌐 Multi-page crawling**: BFS (breadth-first search) crawler with configurable depth limits and URL deduplication
39
- - **🛡️ Smart URL handling**: Automatic normalization, tracking parameter removal, and same-host filtering
40
- - **🔧 Rails integration**: First-class Rails support with generators and initializers
41
- - **💎 Modular design**: Clean separation of concerns with focused, testable modules
34
+ - **Pure Ruby**: Ferrum drives Chromium directly via CDP no Node.js or npm required
35
+ - **Production-ready**: Designed for Rails apps with auto-retry and exponential backoff
36
+ - **Simple API**: Clean Ruby interface — zero Ferrum or CDP knowledge required
37
+ - **Resource optimization**: Built-in resource blocking for 2-3x faster crawls
38
+ - **Auto-managed browsers**: Lazy Chrome singleton, isolated page per crawl
39
+ - **Content extraction**: Mozilla Readability.js (primary) + link-density heuristic (fallback) article-quality `clean_html`, `clean_text`, `clean_markdown`, links, metadata
40
+ - **Multi-page crawling**: BFS crawler with configurable depth limits and URL deduplication
41
+ - **Smart URL handling**: Automatic normalization, tracking parameter removal, same-host filtering
42
+ - **Rails integration**: First-class Rails support with generators and initializers
42
43
 
43
44
  ## Table of Contents
44
45
 
45
- - [Features](#features)
46
46
  - [Installation](#installation)
47
47
  - [Quick Start](#quick-start)
48
48
  - [Use Cases](#use-cases)
@@ -57,18 +57,15 @@ result.metadata # Title, description, OG tags, etc.
57
57
  - [Architecture](#architecture)
58
58
  - [Performance](#performance)
59
59
  - [Development](#development)
60
- - [Project Structure](#project-structure)
61
60
  - [Contributing](#contributing)
62
- - [Why Choose RubyCrawl?](#why-choose-rubycrawl)
63
61
  - [License](#license)
64
- - [Support](#support)
65
62
 
66
63
  ## Installation
67
64
 
68
65
  ### Requirements
69
66
 
70
67
  - **Ruby** >= 3.0
71
- - **Node.js** LTS (v18+ recommended) required for the bundled Playwright service
68
+ - **Chrome or Chromium** managed automatically by Ferrum (downloaded on first use)
72
69
 
73
70
  ### Add to Gemfile
74
71
 
@@ -82,9 +79,9 @@ Then install:
82
79
  bundle install
83
80
  ```
84
81
 
85
- ### Install Playwright browsers
82
+ ### Install Chrome
86
83
 
87
- After bundling, install the Playwright browsers:
84
+ Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
88
85
 
89
86
  ```bash
90
87
  bundle exec rake rubycrawl:install
@@ -92,24 +89,10 @@ bundle exec rake rubycrawl:install
92
89
 
93
90
  This command:
94
91
 
95
- - ✅ Installs Node.js dependencies in the bundled `node/` directory
96
- - ✅ Downloads Playwright browsers (Chromium, Firefox, WebKit) — ~300MB download
92
+ - ✅ Checks for Chrome/Chromium in your PATH
97
93
  - ✅ Creates a Rails initializer (if using Rails)
98
94
 
99
- **Note:** You only need to run this once. The installation task is idempotent and safe to run multiple times.
100
-
101
- **Troubleshooting installation:**
102
-
103
- ```bash
104
- # If installation fails, check Node.js version
105
- node --version # Should be v18+ LTS
106
-
107
- # Enable verbose logging
108
- RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
109
-
110
- # Check installation status
111
- cd node && npm list
112
- ```
95
+ **Note:** If Chrome is not in your PATH, install it via your system package manager or download from [google.com/chrome](https://www.google.com/chrome/).
113
96
 
114
97
  ## Quick Start
115
98
 
@@ -120,37 +103,38 @@ require "rubycrawl"
120
103
  result = RubyCrawl.crawl("https://example.com")
121
104
 
122
105
  # Access extracted content
123
- result.final_url # Final URL after redirects
124
- result.text # Plain text content (via innerText)
125
- result.html # Raw HTML content
126
- result.links # Extracted links with metadata
127
- result.metadata # Title, description, OG tags, etc.
106
+ result.final_url # Final URL after redirects
107
+ result.clean_text # Noise-stripped plain text (no nav/footer/ads)
108
+ result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
109
+ result.raw_text # Full body.innerText (unfiltered)
110
+ result.html # Full raw HTML content
111
+ result.links # Extracted links with url, text, title, rel
112
+ result.metadata # Title, description, OG tags, etc.
113
+ result.metadata['extractor'] # "readability" or "heuristic" — which extractor ran
114
+ result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
128
115
  ```
129
116
 
130
117
  ## Use Cases
131
118
 
132
119
  RubyCrawl is perfect for:
133
120
 
134
- - **📊 Data aggregation**: Crawl product catalogs, job listings, or news articles
135
- - **🤖 RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
136
- - **🔍 SEO analysis**: Extract metadata, links, and content structure
137
- - **📱 Content migration**: Convert existing sites to Markdown for static site generators
138
- - **🧪 Testing**: Verify deployed site structure and content
139
- - **📚 Documentation scraping**: Create local copies of documentation with preserved links
121
+ - **RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
122
+ - **Data aggregation**: Crawl product catalogs, job listings, or news articles
123
+ - **SEO analysis**: Extract metadata, links, and content structure
124
+ - **Content migration**: Convert existing sites to Markdown for static site generators
125
+ - **Documentation scraping**: Create local copies of documentation with preserved links
140
126
 
141
127
  ## Usage
142
128
 
143
129
  ### Basic Crawling
144
130
 
145
- The simplest way to crawl a URL:
146
-
147
131
  ```ruby
148
132
  result = RubyCrawl.crawl("https://example.com")
149
133
 
150
- # Access the results
151
- result.html # => "<html>...</html>"
152
- result.text # => "Example Domain\nThis domain is..." (plain text via innerText)
153
- result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
134
+ result.html # => "<html>...</html>"
135
+ result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
136
+ result.raw_text # => "Example Domain\nThis domain is..." (full body text)
137
+ result.metadata # => { "final_url" => "https://example.com", "title" => "..." }
154
138
  ```
155
139
 
156
140
  ### Multi-Page Crawling
@@ -165,10 +149,10 @@ RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |pa
165
149
 
166
150
  # Save to database
167
151
  Page.create!(
168
- url: page.url,
169
- html: page.html,
152
+ url: page.url,
153
+ html: page.html,
170
154
  markdown: page.clean_markdown,
171
- depth: page.depth
155
+ depth: page.depth
172
156
  )
173
157
  end
174
158
  ```
@@ -176,7 +160,6 @@ end
176
160
  **Real-world example: Building a RAG knowledge base**
177
161
 
178
162
  ```ruby
179
- # Crawl documentation site for AI/RAG application
180
163
  require "rubycrawl"
181
164
 
182
165
  RubyCrawl.configure(
@@ -190,21 +173,18 @@ pages_crawled = RubyCrawl.crawl_site(
190
173
  max_depth: 5,
191
174
  same_host_only: true
192
175
  ) do |page|
193
- # Store in vector database for RAG
194
176
  VectorDB.upsert(
195
- id: Digest::SHA256.hexdigest(page.url),
196
- content: page.clean_markdown, # Clean markdown for better embeddings
177
+ id: Digest::SHA256.hexdigest(page.url),
178
+ content: page.clean_markdown,
197
179
  metadata: {
198
- url: page.url,
180
+ url: page.url,
199
181
  title: page.metadata["title"],
200
182
  depth: page.depth
201
183
  }
202
184
  )
203
-
204
- puts "✓ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
205
185
  end
206
186
 
207
- puts "Crawled #{pages_crawled} pages into knowledge base"
187
+ puts "Indexed #{pages_crawled} pages"
208
188
  ```
209
189
 
210
190
  #### Multi-Page Options
@@ -223,10 +203,13 @@ The block receives a `PageResult` with:
223
203
 
224
204
  ```ruby
225
205
  page.url # String: Final URL after redirects
226
- page.html # String: Full HTML content
227
- page.clean_markdown # String: Lazy-converted Markdown
206
+ page.html # String: Full raw HTML content
207
+ page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
208
+ page.clean_text # String: Noise-stripped plain text (derived from clean_html)
209
+ page.raw_text # String: Full body.innerText (unfiltered)
210
+ page.clean_markdown # String: Lazy-converted Markdown from clean_html
228
211
  page.links # Array: URLs extracted from page
229
- page.metadata # Hash: HTTP status, final URL, etc.
212
+ page.metadata # Hash: final_url, title, OG tags, etc.
230
213
  page.depth # Integer: Link depth from start URL
231
214
  ```
232
215
 
@@ -234,12 +217,12 @@ page.depth # Integer: Link depth from start URL
234
217
 
235
218
  #### Global Configuration
236
219
 
237
- Set default options that apply to all crawls:
238
-
239
220
  ```ruby
240
221
  RubyCrawl.configure(
241
- wait_until: "networkidle", # Wait until network is idle
242
- block_resources: true # Block images, fonts, CSS for speed
222
+ wait_until: "networkidle",
223
+ block_resources: true,
224
+ timeout: 60,
225
+ headless: true
243
226
  )
244
227
 
245
228
  # All subsequent crawls use these defaults
@@ -248,8 +231,6 @@ result = RubyCrawl.crawl("https://example.com")
248
231
 
249
232
  #### Per-Request Options
250
233
 
251
- Override defaults for specific requests:
252
-
253
234
  ```ruby
254
235
  # Use global defaults
255
236
  result = RubyCrawl.crawl("https://example.com")
@@ -257,192 +238,132 @@ result = RubyCrawl.crawl("https://example.com")
257
238
  # Override for this request only
258
239
  result = RubyCrawl.crawl(
259
240
  "https://example.com",
260
- wait_until: "domcontentloaded",
241
+ wait_until: "domcontentloaded",
261
242
  block_resources: false
262
243
  )
263
244
  ```
264
245
 
265
246
  #### Configuration Options
266
247
 
267
- | Option | Values | Default | Description |
268
- | ----------------- | ---------------------------------------------------------------------- | -------- | ------------------------------------------------- |
269
- | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `"load"` | When to consider page loaded |
270
- | `block_resources` | `true`, `false` | `true` | Block images, fonts, CSS, media for faster crawls |
271
- | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
248
+ | Option | Values | Default | Description |
249
+ | ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
250
+ | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
251
+ | `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
252
+ | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
253
+ | `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
254
+ | `headless` | `true`, `false` | `true` | Run Chrome headlessly |
272
255
 
273
256
  **Wait strategies explained:**
274
257
 
275
- - `load` — Wait for the load event (fastest, good for static sites)
276
- - `domcontentloaded` — Wait for DOM ready (medium speed)
277
- - `networkidle` — Wait until no network requests for 500ms (slowest, best for SPAs)
278
- - `commit` — Wait until the first response bytes are received (fastest possible)
279
-
280
- ### Advanced Usage
281
-
282
- #### Session-Based Crawling
283
-
284
- Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by `crawl_site`, but you can manage them manually for advanced use cases:
285
-
286
- ```ruby
287
- # Create a session (reusable browser context)
288
- session_id = RubyCrawl.create_session
289
-
290
- begin
291
- # All crawls with this session_id share the same browser context
292
- result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
293
- result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
294
- # Browser state (cookies, localStorage) persists between crawls
295
- ensure
296
- # Always destroy session when done
297
- RubyCrawl.destroy_session(session_id)
298
- end
299
- ```
300
-
301
- **When to use sessions:**
302
-
303
- - Multiple sequential crawls to the same domain (better performance)
304
- - Preserving cookies/state set by the site between page visits
305
- - Avoiding browser context creation overhead
306
-
307
- **Important:** Sessions are for **performance optimization only**. RubyCrawl is designed for crawling **public websites**. It does not provide authentication or login functionality for protected content.
308
-
309
- **Note:** `crawl_site` automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
310
-
311
- **Session lifecycle:**
312
-
313
- - Sessions automatically expire after 30 minutes of inactivity
314
- - Sessions are cleaned up every 5 minutes
315
- - Always call `destroy_session` when done to free resources immediately
258
+ - `load` — Wait for the load event (good for static sites)
259
+ - `domcontentloaded` — Wait for DOM ready (faster)
260
+ - `networkidle` — Wait until no network requests for 500ms (best for SPAs)
261
+ - `commit` — Wait until the first response bytes are received (fastest)
316
262
 
317
263
  ### Result Object
318
264
 
319
- The crawl result is a `RubyCrawl::Result` object with these attributes:
320
-
321
265
  ```ruby
322
266
  result = RubyCrawl.crawl("https://example.com")
323
267
 
324
- result.html # String: Raw HTML content from page
325
- result.text # String: Plain text via document.body.innerText
326
- result.clean_markdown # String: Markdown conversion (lazy-loaded on first access)
327
- result.links # Array: Extracted links with url and text
328
- result.metadata # Hash: Comprehensive metadata (see below)
268
+ result.html # String: Full raw HTML
269
+ result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
270
+ result.clean_text # String: Plain text derived from clean_html — ideal for RAG
271
+ result.raw_text # String: Full body.innerText (unfiltered)
272
+ result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
273
+ result.links # Array: Extracted links with url/text/title/rel
274
+ result.metadata # Hash: See below
275
+ result.final_url # String: Shortcut for metadata['final_url']
329
276
  ```
330
277
 
331
278
  #### Links Format
332
279
 
333
- Links are extracted with full metadata:
334
-
335
280
  ```ruby
336
281
  result.links
337
282
  # => [
338
- # {
339
- # "url" => "https://example.com/about",
340
- # "text" => "About Us",
341
- # "title" => "Learn more about us", # <a title="...">
342
- # "rel" => nil # <a rel="nofollow">
343
- # },
344
- # {
345
- # "url" => "https://example.com/contact",
346
- # "text" => "Contact",
347
- # "title" => null,
348
- # "rel" => "nofollow"
349
- # },
350
- # ...
283
+ # { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
284
+ # { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
351
285
  # ]
352
286
  ```
353
287
 
354
- **Note:** URLs are automatically converted to absolute URLs by the browser, so relative links like `/about` become `https://example.com/about`.
288
+ URLs are automatically resolved to absolute form by the browser.
355
289
 
356
290
  #### Markdown Conversion
357
291
 
358
- Markdown is **lazy-loaded** — conversion only happens when you access `.clean_markdown`:
292
+ Markdown is **lazy** — conversion only happens on first access of `.clean_markdown`:
359
293
 
360
294
  ```ruby
361
- result = RubyCrawl.crawl(url)
362
- result.html # No overhead
363
- result.clean_markdown # ⬅️ Conversion happens here (first call only)
364
- result.clean_markdown # ✅ Cached, instant
295
+ result.clean_html # Already available, no overhead
296
+ result.clean_markdown # Converts clean_html → Markdown here (first call only)
297
+ result.clean_markdown # Cached, instant on subsequent calls
365
298
  ```
366
299
 
367
300
  Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
368
301
 
369
302
  #### Metadata Fields
370
303
 
371
- The `metadata` hash includes HTTP and HTML metadata:
372
-
373
304
  ```ruby
374
305
  result.metadata
375
306
  # => {
376
- # "status" => 200, # HTTP status code
377
- # "final_url" => "https://...", # Final URL after redirects
378
- # "title" => "Page Title", # <title> tag
379
- # "description" => "...", # Meta description
380
- # "keywords" => "ruby, web", # Meta keywords
381
- # "author" => "Author Name", # Meta author
382
- # "og_title" => "...", # Open Graph title
383
- # "og_description" => "...", # Open Graph description
384
- # "og_image" => "https://...", # Open Graph image
385
- # "og_url" => "https://...", # Open Graph URL
386
- # "og_type" => "website", # Open Graph type
387
- # "twitter_card" => "summary", # Twitter card type
388
- # "twitter_title" => "...", # Twitter title
389
- # "twitter_description" => "...", # Twitter description
390
- # "twitter_image" => "https://...",# Twitter image
391
- # "canonical" => "https://...", # Canonical URL
392
- # "lang" => "en", # Page language
393
- # "charset" => "UTF-8" # Character encoding
307
+ # "final_url" => "https://example.com",
308
+ # "title" => "Page Title",
309
+ # "description" => "...",
310
+ # "keywords" => "ruby, web",
311
+ # "author" => "Author Name",
312
+ # "og_title" => "...",
313
+ # "og_description" => "...",
314
+ # "og_image" => "https://...",
315
+ # "og_url" => "https://...",
316
+ # "og_type" => "website",
317
+ # "twitter_card" => "summary",
318
+ # "twitter_title" => "...",
319
+ # "twitter_description" => "...",
320
+ # "twitter_image" => "https://...",
321
+ # "canonical" => "https://...",
322
+ # "lang" => "en",
323
+ # "charset" => "UTF-8",
324
+ # "extractor" => "readability" # or "heuristic"
394
325
  # }
395
326
  ```
396
327
 
397
- Note: All HTML metadata fields may be `null` if not present on the page.
398
-
399
328
  ### Error Handling
400
329
 
401
- RubyCrawl provides specific exception classes for different error scenarios:
402
-
403
330
  ```ruby
404
331
  begin
405
332
  result = RubyCrawl.crawl(url)
406
333
  rescue RubyCrawl::ConfigurationError => e
407
- # Invalid URL or configuration
408
- puts "Configuration error: #{e.message}"
334
+ # Invalid URL or option value
409
335
  rescue RubyCrawl::TimeoutError => e
410
- # Page load timeout or network timeout
411
- puts "Timeout: #{e.message}"
336
+ # Page load timed out
412
337
  rescue RubyCrawl::NavigationError => e
413
- # Page navigation failed (404, DNS error, SSL error, etc.)
414
- puts "Navigation failed: #{e.message}"
338
+ # Navigation failed (404, DNS error, SSL error)
415
339
  rescue RubyCrawl::ServiceError => e
416
- # Node service unavailable or crashed
417
- puts "Service error: #{e.message}"
340
+ # Browser failed to start or crashed
418
341
  rescue RubyCrawl::Error => e
419
342
  # Catch-all for any RubyCrawl error
420
- puts "Crawl error: #{e.message}"
421
343
  end
422
344
  ```
423
345
 
424
346
  **Exception Hierarchy:**
425
347
 
426
- - `RubyCrawl::Error` (base class)
427
- - `RubyCrawl::ConfigurationError` - Invalid URL or configuration
428
- - `RubyCrawl::TimeoutError` - Timeout during crawl
429
- - `RubyCrawl::NavigationError` - Page navigation failed
430
- - `RubyCrawl::ServiceError` - Node service issues
348
+ ```
349
+ RubyCrawl::Error
350
+ ├── ConfigurationError — invalid URL or option value
351
+ ├── TimeoutError — page load timed out
352
+ ├── NavigationError — navigation failed (HTTP error, DNS, SSL)
353
+ └── ServiceError — browser failed to start or crashed
354
+ ```
431
355
 
432
- **Automatic Retry:** RubyCrawl automatically retries transient failures (service errors, timeouts) with exponential backoff. The default `max_attempts: 3` means 3 total attempts (2 retries). Configure with:
356
+ **Automatic Retry:** `ServiceError` and `TimeoutError` are retried with exponential backoff. `NavigationError` and `ConfigurationError` are not retried (they won't succeed on retry).
433
357
 
434
358
  ```ruby
435
- RubyCrawl.configure(max_attempts: 5)
436
- # or per-request
437
- RubyCrawl.crawl(url, max_attempts: 1) # No retries
359
+ RubyCrawl.configure(max_attempts: 5) # 5 total attempts
360
+ RubyCrawl.crawl(url, max_attempts: 1) # Disable retries
438
361
  ```
439
362
 
440
363
  ## Rails Integration
441
364
 
442
365
  ### Installation
443
366
 
444
- Run the installer in your Rails app:
445
-
446
367
  ```bash
447
368
  bundle exec rake rubycrawl:install
448
369
  ```
@@ -450,173 +371,54 @@ bundle exec rake rubycrawl:install
450
371
  This creates `config/initializers/rubycrawl.rb`:
451
372
 
452
373
  ```ruby
453
- # frozen_string_literal: true
454
-
455
- # rubycrawl default configuration
456
374
  RubyCrawl.configure(
457
- wait_until: "load",
375
+ wait_until: "load",
458
376
  block_resources: true
459
377
  )
460
378
  ```
461
379
 
462
380
  ### Usage in Rails
463
381
 
464
- #### Basic Usage in Controllers
465
-
466
- ```ruby
467
- class PagesController < ApplicationController
468
- def show
469
- result = RubyCrawl.crawl(params[:url])
470
-
471
- @page = Page.create!(
472
- url: result.final_url,
473
- title: result.metadata['title'],
474
- html: result.html,
475
- text: result.text,
476
- markdown: result.clean_markdown
477
- )
478
-
479
- redirect_to @page
480
- end
481
- end
482
- ```
483
-
484
382
  #### Background Jobs with ActiveJob
485
383
 
486
- **Simple Crawl Job:**
487
-
488
384
  ```ruby
489
385
  class CrawlPageJob < ApplicationJob
490
386
  queue_as :crawlers
491
387
 
492
- # Automatic retry with exponential backoff for transient failures
493
388
  retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
494
389
  retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
495
-
496
- # Don't retry on configuration errors (bad URLs)
497
390
  discard_on RubyCrawl::ConfigurationError
498
391
 
499
- def perform(url, user_id: nil)
392
+ def perform(url)
500
393
  result = RubyCrawl.crawl(url)
501
394
 
502
395
  Page.create!(
503
- url: result.final_url,
504
- title: result.metadata['title'],
505
- text: result.text,
506
- html: result.html,
507
- user_id: user_id,
396
+ url: result.final_url,
397
+ title: result.metadata['title'],
398
+ content: result.clean_text,
399
+ markdown: result.clean_markdown,
508
400
  crawled_at: Time.current
509
401
  )
510
- rescue RubyCrawl::NavigationError => e
511
- # Page not found or failed to load
512
- Rails.logger.warn "Failed to crawl #{url}: #{e.message}"
513
- FailedCrawl.create!(url: url, error: e.message, user_id: user_id)
514
- end
515
- end
516
-
517
- # Enqueue from anywhere
518
- CrawlPageJob.perform_later("https://example.com", user_id: current_user.id)
519
- ```
520
-
521
- **Multi-Page Site Crawler Job:**
522
-
523
- ```ruby
524
- class CrawlSiteJob < ApplicationJob
525
- queue_as :crawlers
526
-
527
- def perform(start_url, max_pages: 50)
528
- pages_crawled = RubyCrawl.crawl_site(
529
- start_url,
530
- max_pages: max_pages,
531
- max_depth: 3,
532
- same_host_only: true
533
- ) do |page|
534
- Page.create!(
535
- url: page.url,
536
- title: page.metadata['title'],
537
- text: page.clean_markdown, # Store markdown for RAG applications
538
- depth: page.depth,
539
- crawled_at: Time.current
540
- )
541
- end
542
-
543
- Rails.logger.info "Crawled #{pages_crawled} pages from #{start_url}"
544
- end
545
- end
546
- ```
547
-
548
- **Batch Crawling Pattern:**
549
-
550
- ```ruby
551
- class BatchCrawlJob < ApplicationJob
552
- queue_as :crawlers
553
-
554
- def perform(urls)
555
- # Create session for better performance
556
- session_id = RubyCrawl.create_session
557
-
558
- begin
559
- urls.each do |url|
560
- result = RubyCrawl.crawl(url, session_id: session_id)
561
-
562
- Page.create!(
563
- url: result.final_url,
564
- html: result.html,
565
- text: result.text
566
- )
567
- end
568
- ensure
569
- # Always destroy session when done
570
- RubyCrawl.destroy_session(session_id)
571
- end
572
402
  end
573
403
  end
574
-
575
- # Enqueue batch
576
- BatchCrawlJob.perform_later(["https://example.com", "https://example.com/about"])
577
404
  ```
578
405
 
579
- **Periodic Crawling with Sidekiq-Cron:**
580
-
581
- ```ruby
582
- # config/schedule.yml (for sidekiq-cron)
583
- crawl_news_sites:
584
- cron: "0 */6 * * *" # Every 6 hours
585
- class: "CrawlNewsSitesJob"
586
-
587
- # app/jobs/crawl_news_sites_job.rb
588
- class CrawlNewsSitesJob < ApplicationJob
589
- queue_as :scheduled_crawlers
590
-
591
- def perform
592
- Site.where(active: true).find_each do |site|
593
- CrawlSiteJob.perform_later(site.url, max_pages: site.max_pages)
594
- end
595
- end
596
- end
597
- ```
598
-
599
- **RAG/AI Knowledge Base Pattern:**
406
+ **Multi-page RAG knowledge base:**
600
407
 
601
408
  ```ruby
602
409
  class BuildKnowledgeBaseJob < ApplicationJob
603
410
  queue_as :crawlers
604
411
 
605
412
  def perform(documentation_url)
606
- RubyCrawl.crawl_site(
607
- documentation_url,
608
- max_pages: 500,
609
- max_depth: 5
610
- ) do |page|
611
- # Store in vector database for RAG
413
+ RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
612
414
  embedding = OpenAI.embed(page.clean_markdown)
613
415
 
614
416
  Document.create!(
615
- url: page.url,
616
- title: page.metadata['title'],
617
- content: page.clean_markdown,
417
+ url: page.url,
418
+ title: page.metadata['title'],
419
+ content: page.clean_markdown,
618
420
  embedding: embedding,
619
- depth: page.depth
421
+ depth: page.depth
620
422
  )
621
423
  end
622
424
  end
@@ -625,156 +427,106 @@ end
625
427
 
626
428
  #### Best Practices
627
429
 
628
- 1. **Use background jobs** for crawling to avoid blocking web requests
629
- 2. **Configure retry logic** based on error types (retry ServiceError, discard ConfigurationError)
630
- 3. **Use sessions** for batch crawling to improve performance
631
- 4. **Monitor job failures** and set up alerts for repeated errors
632
- 5. **Rate limit** external crawling to be respectful (use job throttling)
633
- 6. **Store both HTML and text** for flexibility in data processing
430
+ 1. **Use background jobs** to avoid blocking web requests
431
+ 2. **Configure retry logic** based on error type
432
+ 3. **Store `clean_markdown`** for RAG applications (preserves heading structure for chunking)
433
+ 4. **Rate limit** external crawling to be respectful
634
434
 
635
435
  ## Production Deployment
636
436
 
637
437
  ### Pre-deployment Checklist
638
438
 
639
- 1. **Install Node.js** on your production servers (LTS version recommended)
439
+ 1. **Ensure Chrome is installed** on your production servers
640
440
  2. **Run installer** during deployment:
641
441
  ```bash
642
442
  bundle exec rake rubycrawl:install
643
443
  ```
644
- 3. **Set environment variables** (optional):
645
- ```bash
646
- export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path
647
- export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
648
- ```
649
444
 
650
445
  ### Docker Example
651
446
 
652
447
  ```dockerfile
653
448
  FROM ruby:3.2
654
449
 
655
- # Install Node.js LTS
656
- RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
657
- && apt-get install -y nodejs
658
-
659
- # Install system dependencies for Playwright
660
- RUN npx playwright install-deps
450
+ # Install Chrome
451
+ RUN apt-get update && apt-get install -y \
452
+ chromium \
453
+ --no-install-recommends \
454
+ && rm -rf /var/lib/apt/lists/*
661
455
 
662
456
  WORKDIR /app
663
457
  COPY Gemfile* ./
664
458
  RUN bundle install
665
459
 
666
- # Install Playwright browsers
667
- RUN bundle exec rake rubycrawl:install
668
-
669
460
  COPY . .
670
461
  CMD ["rails", "server"]
671
462
  ```
672
463
 
673
- ### Heroku Deployment
464
+ Ferrum will detect `chromium` automatically. To specify a custom path:
674
465
 
675
- Add the Node.js buildpack:
676
-
677
- ```bash
678
- heroku buildpacks:add heroku/nodejs
679
- heroku buildpacks:add heroku/ruby
680
- ```
681
-
682
- Add to `package.json` in your Rails root:
683
-
684
- ```json
685
- {
686
- "engines": {
687
- "node": "18.x"
688
- }
689
- }
466
+ ```ruby
467
+ RubyCrawl.configure(
468
+ browser_options: { "browser-path": "/usr/bin/chromium" }
469
+ )
690
470
  ```
691
471
 
692
- ## How It Works
472
+ ## Architecture
693
473
 
694
- RubyCrawl uses a simple architecture:
474
+ RubyCrawl uses a single-process architecture:
695
475
 
696
- - **Ruby Gem** provides the public API and handles orchestration
697
- - **Node.js Service** (bundled, auto-started) manages Playwright browsers
698
- - Communication via HTTP/JSON on localhost
476
+ ```
477
+ RubyCrawl (public API)
478
+
479
+ Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
480
+
481
+ Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
482
+
483
+ Chromium ← headless browser
484
+
485
+ Readability.js → heuristic fallback ← content extraction (inside browser)
486
+ ```
699
487
 
700
- This design keeps things stable and easy to debug. The browser runs in a separate process, so crashes won't affect your Ruby application.
488
+ - Chrome launches once lazily and is reused across all crawls
489
+ - Each crawl gets an isolated page context (own cookies/storage)
490
+ - Content extraction runs inside the browser via `page.evaluate()`:
491
+ - **Primary**: Mozilla Readability.js — article-quality extraction for blogs, docs, news
492
+ - **Fallback**: link-density heuristic — covers marketing pages, homepages, SPAs
493
+ - `result.metadata['extractor']` tells you which path was used (`"readability"` or `"heuristic"`)
494
+ - No separate processes, no HTTP boundary, no Node.js
701
495
 
702
- ## Performance Tips
496
+ ## Performance
703
497
 
704
- - **Resource blocking**: Keep `block_resources: true` (default) for 2-3x faster crawls when you don't need images/CSS
498
+ - **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
705
499
  - **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
706
- - **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
707
- - **Browser reuse**: The first crawl is slower (~2s) due to browser launch; subsequent crawls are much faster (~500ms)
500
+ - **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
501
+ - **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
708
502
 
709
503
  ## Development
710
504
 
711
- Want to contribute? Check out the [contributor guidelines](.github/copilot-instructions.md).
712
-
713
505
  ```bash
714
- # Setup
715
506
  git clone git@github.com:craft-wise/rubycrawl.git
716
507
  cd rubycrawl
717
508
  bin/setup
718
509
 
719
- # Run tests
510
+ # Run unit tests (no browser required)
720
511
  bundle exec rspec
721
512
 
513
+ # Run integration tests (requires Chrome)
514
+ INTEGRATION=1 bundle exec rspec
515
+
722
516
  # Manual testing
723
517
  bin/console
724
518
  > RubyCrawl.crawl("https://example.com")
519
+ > RubyCrawl.crawl("https://example.com").clean_text
520
+ > RubyCrawl.crawl("https://example.com").clean_markdown
725
521
  ```
726
522
 
727
523
  ## Contributing
728
524
 
729
525
  Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
730
526
 
731
- ### Development Philosophy
732
-
733
527
  - **Simplicity over cleverness**: Prefer clear, explicit code
734
528
  - **Stability over speed**: Correctness first, optimization second
735
- - **Ruby-first**: Hide Node.js/Playwright complexity from users
736
- - **No vendor lock-in**: Pure open source, no SaaS dependencies
737
-
738
- ## Why Choose RubyCrawl?
739
-
740
- RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
741
-
742
- ### 🎯 **Built for Ruby Developers**
743
-
744
- - **Idiomatic Ruby API** — Feels natural to Rubyists, no need to learn Playwright
745
- - **Rails-first design** — Generators, initializers, and ActiveJob integration out of the box
746
- - **Modular architecture** — Clean, testable code following Ruby best practices
747
-
748
- ### 🚀 **Production-Grade Reliability**
749
-
750
- - **Automatic retry** with exponential backoff for transient failures
751
- - **Smart error handling** with custom exception hierarchy
752
- - **Process isolation** — Browser crashes don't affect your Ruby application
753
- - **Battle-tested** — Built on Playwright's proven browser automation
754
-
755
- ### 💎 **Developer Experience**
756
-
757
- - **Zero configuration** — Works immediately after installation
758
- - **Lazy loading** — Markdown conversion only when you need it
759
- - **Smart URL handling** — Automatic normalization and deduplication
760
- - **Comprehensive docs** — Clear examples for common use cases
761
-
762
- ### 🌐 **Rich Feature Set**
763
-
764
- - ✅ JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
765
- - ✅ Multi-page crawling with BFS algorithm
766
- - ✅ Link extraction with metadata (url, text, title, rel)
767
- - ✅ Markdown conversion (GitHub-flavored)
768
- - ✅ Metadata extraction (OG tags, Twitter cards, etc.)
769
- - ✅ Resource blocking for 2-3x performance boost
770
-
771
- ### 📊 **Perfect for Modern Use Cases**
772
-
773
- - **RAG applications** — Build AI knowledge bases from documentation
774
- - **Data aggregation** — Extract structured data from multiple pages
775
- - **Content migration** — Convert sites to Markdown for static generators
776
- - **SEO analysis** — Extract metadata and link structures
777
- - **Testing** — Verify deployed site content and structure
529
+ - **Hide complexity**: Users should never need to know Ferrum exists
778
530
 
779
531
  ## License
780
532
 
@@ -782,21 +534,14 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
782
534
 
783
535
  ## Credits
784
536
 
785
- Built with [Playwright](https://playwright.dev/) by Microsoft the industry-standard browser automation framework.
537
+ Built with [Ferrum](https://github.com/rubycdp/ferrum) pure Ruby Chrome DevTools Protocol client.
538
+
539
+ Content extraction powered by [Mozilla Readability.js](https://github.com/mozilla/readability) — the algorithm behind Firefox Reader View.
786
540
 
787
- Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
541
+ Markdown conversion powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored output.
788
542
 
789
543
  ## Support
790
544
 
791
545
  - **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
792
546
  - **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
793
547
  - **Email**: ganesh.navale@zohomail.in
794
-
795
- ## Acknowledgments
796
-
797
- Special thanks to:
798
-
799
- - [Microsoft Playwright](https://playwright.dev/) team for the robust, production-grade browser automation framework
800
- - The Ruby community for building an ecosystem that values developer happiness and code clarity
801
- - The Node.js community for excellent tooling and libraries that make cross-language integration seamless
802
- - Open source contributors worldwide who make projects like this possible