rubycrawl 0.1.4 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -3,46 +3,45 @@
3
3
  [![Gem Version](https://badge.fury.io/rb/rubycrawl.svg)](https://rubygems.org/gems/rubycrawl)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
5
  [![Ruby](https://img.shields.io/badge/ruby-%3E%3D%203.0-red.svg)](https://www.ruby-lang.org/)
6
- [![Node.js](https://img.shields.io/badge/node.js-18%2B-green.svg)](https://nodejs.org/)
7
6
 
8
- **Production-ready web crawler for Ruby powered by Playwright** — Bringing the power of modern browser automation to the Ruby ecosystem with first-class Rails support.
7
+ **Production-ready web crawler for Ruby powered by Ferrum** — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
9
8
 
10
- RubyCrawl provides **accurate, JavaScript-enabled web scraping** using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
9
+ RubyCrawl provides **accurate, JavaScript-enabled web scraping** using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
11
10
 
12
11
  **Why RubyCrawl?**
13
12
 
14
13
  - ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
15
- - ✅ **Zero config** — Works out of the box, no Playwright knowledge needed
14
+ - ✅ **Pure Ruby** — No Node.js, no npm, no external processes to manage
15
+ - ✅ **Zero config** — Works out of the box, no Ferrum knowledge needed
16
16
  - ✅ **Production-ready** — Auto-retry, error handling, resource optimization
17
17
  - ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
18
18
  - ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
19
- - ✅ **Modular architecture** — Clean, testable, maintainable codebase
20
19
 
21
20
  ```ruby
22
21
  # One line to crawl any JavaScript-heavy site
23
22
  result = RubyCrawl.crawl("https://docs.example.com")
24
23
 
25
24
  result.html # Full HTML with JS rendered
26
- result.links # All links with metadata
25
+ result.clean_text # Noise-stripped plain text (no nav/footer/ads)
26
+ result.clean_markdown # Markdown ready for RAG pipelines
27
+ result.links # All links with url, text, title, rel
27
28
  result.metadata # Title, description, OG tags, etc.
28
29
  ```
29
30
 
30
31
  ## Features
31
32
 
32
- - **🎭 Playwright-powered**: Real browser automation for JavaScript-heavy sites and SPAs
33
- - **🚀 Production-ready**: Designed for Rails apps and production environments with auto-retry and error handling
34
- - **🎯 Simple API**: Clean, minimal Ruby interface — zero Playwright or Node.js knowledge required
35
- - **⚡ Resource optimization**: Built-in resource blocking for 2-3x faster crawls
36
- - **🔄 Auto-managed browsers**: Browser process reuse and automatic lifecycle management
37
- - **📄 Content extraction**: HTML, plain text, links (with metadata), and **clean markdown** via HTML conversion
38
- - **🌐 Multi-page crawling**: BFS (breadth-first search) crawler with configurable depth limits and URL deduplication
39
- - **🛡️ Smart URL handling**: Automatic normalization, tracking parameter removal, and same-host filtering
40
- - **🔧 Rails integration**: First-class Rails support with generators and initializers
41
- - **💎 Modular design**: Clean separation of concerns with focused, testable modules
33
+ - **Pure Ruby**: Ferrum drives Chromium directly via CDP no Node.js or npm required
34
+ - **Production-ready**: Designed for Rails apps with auto-retry and exponential backoff
35
+ - **Simple API**: Clean Ruby interface — zero Ferrum or CDP knowledge required
36
+ - **Resource optimization**: Built-in resource blocking for 2-3x faster crawls
37
+ - **Auto-managed browsers**: Lazy Chrome singleton, isolated page per crawl
38
+ - **Content extraction**: HTML, plain text, clean HTML, Markdown (lazy), links, metadata
39
+ - **Multi-page crawling**: BFS crawler with configurable depth limits and URL deduplication
40
+ - **Smart URL handling**: Automatic normalization, tracking parameter removal, same-host filtering
41
+ - **Rails integration**: First-class Rails support with generators and initializers
42
42
 
43
43
  ## Table of Contents
44
44
 
45
- - [Features](#features)
46
45
  - [Installation](#installation)
47
46
  - [Quick Start](#quick-start)
48
47
  - [Use Cases](#use-cases)
@@ -57,18 +56,15 @@ result.metadata # Title, description, OG tags, etc.
57
56
  - [Architecture](#architecture)
58
57
  - [Performance](#performance)
59
58
  - [Development](#development)
60
- - [Project Structure](#project-structure)
61
59
  - [Contributing](#contributing)
62
- - [Why Choose RubyCrawl?](#why-choose-rubycrawl)
63
60
  - [License](#license)
64
- - [Support](#support)
65
61
 
66
62
  ## Installation
67
63
 
68
64
  ### Requirements
69
65
 
70
66
  - **Ruby** >= 3.0
71
- - **Node.js** LTS (v18+ recommended) required for the bundled Playwright service
67
+ - **Chrome or Chromium** managed automatically by Ferrum (downloaded on first use)
72
68
 
73
69
  ### Add to Gemfile
74
70
 
@@ -82,9 +78,9 @@ Then install:
82
78
  bundle install
83
79
  ```
84
80
 
85
- ### Install Playwright browsers
81
+ ### Install Chrome
86
82
 
87
- After bundling, install the Playwright browsers:
83
+ Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
88
84
 
89
85
  ```bash
90
86
  bundle exec rake rubycrawl:install
@@ -92,24 +88,10 @@ bundle exec rake rubycrawl:install
92
88
 
93
89
  This command:
94
90
 
95
- - ✅ Installs Node.js dependencies in the bundled `node/` directory
96
- - ✅ Downloads Playwright browsers (Chromium, Firefox, WebKit) — ~300MB download
91
+ - ✅ Checks for Chrome/Chromium in your PATH
97
92
  - ✅ Creates a Rails initializer (if using Rails)
98
93
 
99
- **Note:** You only need to run this once. The installation task is idempotent and safe to run multiple times.
100
-
101
- **Troubleshooting installation:**
102
-
103
- ```bash
104
- # If installation fails, check Node.js version
105
- node --version # Should be v18+ LTS
106
-
107
- # Enable verbose logging
108
- RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
109
-
110
- # Check installation status
111
- cd node && npm list
112
- ```
94
+ **Note:** If Chrome is not in your PATH, install it via your system package manager or download from [google.com/chrome](https://www.google.com/chrome/).
113
95
 
114
96
  ## Quick Start
115
97
 
@@ -120,37 +102,37 @@ require "rubycrawl"
120
102
  result = RubyCrawl.crawl("https://example.com")
121
103
 
122
104
  # Access extracted content
123
- result.final_url # Final URL after redirects
124
- result.text # Plain text content (via innerText)
125
- result.html # Raw HTML content
126
- result.links # Extracted links with metadata
127
- result.metadata # Title, description, OG tags, etc.
105
+ result.final_url # Final URL after redirects
106
+ result.clean_text # Noise-stripped plain text (no nav/footer/ads)
107
+ result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
108
+ result.raw_text # Full body.innerText (unfiltered)
109
+ result.html # Full raw HTML content
110
+ result.links # Extracted links with url, text, title, rel
111
+ result.metadata # Title, description, OG tags, etc.
112
+ result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
128
113
  ```
129
114
 
130
115
  ## Use Cases
131
116
 
132
117
  RubyCrawl is perfect for:
133
118
 
134
- - **📊 Data aggregation**: Crawl product catalogs, job listings, or news articles
135
- - **🤖 RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
136
- - **🔍 SEO analysis**: Extract metadata, links, and content structure
137
- - **📱 Content migration**: Convert existing sites to Markdown for static site generators
138
- - **🧪 Testing**: Verify deployed site structure and content
139
- - **📚 Documentation scraping**: Create local copies of documentation with preserved links
119
+ - **RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
120
+ - **Data aggregation**: Crawl product catalogs, job listings, or news articles
121
+ - **SEO analysis**: Extract metadata, links, and content structure
122
+ - **Content migration**: Convert existing sites to Markdown for static site generators
123
+ - **Documentation scraping**: Create local copies of documentation with preserved links
140
124
 
141
125
  ## Usage
142
126
 
143
127
  ### Basic Crawling
144
128
 
145
- The simplest way to crawl a URL:
146
-
147
129
  ```ruby
148
130
  result = RubyCrawl.crawl("https://example.com")
149
131
 
150
- # Access the results
151
- result.html # => "<html>...</html>"
152
- result.text # => "Example Domain\nThis domain is..." (plain text via innerText)
153
- result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
132
+ result.html # => "<html>...</html>"
133
+ result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
134
+ result.raw_text # => "Example Domain\nThis domain is..." (full body text)
135
+ result.metadata # => { "final_url" => "https://example.com", "title" => "..." }
154
136
  ```
155
137
 
156
138
  ### Multi-Page Crawling
@@ -165,10 +147,10 @@ RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |pa
165
147
 
166
148
  # Save to database
167
149
  Page.create!(
168
- url: page.url,
169
- html: page.html,
150
+ url: page.url,
151
+ html: page.html,
170
152
  markdown: page.clean_markdown,
171
- depth: page.depth
153
+ depth: page.depth
172
154
  )
173
155
  end
174
156
  ```
@@ -176,7 +158,6 @@ end
176
158
  **Real-world example: Building a RAG knowledge base**
177
159
 
178
160
  ```ruby
179
- # Crawl documentation site for AI/RAG application
180
161
  require "rubycrawl"
181
162
 
182
163
  RubyCrawl.configure(
@@ -190,21 +171,18 @@ pages_crawled = RubyCrawl.crawl_site(
190
171
  max_depth: 5,
191
172
  same_host_only: true
192
173
  ) do |page|
193
- # Store in vector database for RAG
194
174
  VectorDB.upsert(
195
- id: Digest::SHA256.hexdigest(page.url),
196
- content: page.clean_markdown, # Clean markdown for better embeddings
175
+ id: Digest::SHA256.hexdigest(page.url),
176
+ content: page.clean_markdown,
197
177
  metadata: {
198
- url: page.url,
178
+ url: page.url,
199
179
  title: page.metadata["title"],
200
180
  depth: page.depth
201
181
  }
202
182
  )
203
-
204
- puts "✓ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
205
183
  end
206
184
 
207
- puts "Crawled #{pages_crawled} pages into knowledge base"
185
+ puts "Indexed #{pages_crawled} pages"
208
186
  ```
209
187
 
210
188
  #### Multi-Page Options
@@ -223,10 +201,13 @@ The block receives a `PageResult` with:
223
201
 
224
202
  ```ruby
225
203
  page.url # String: Final URL after redirects
226
- page.html # String: Full HTML content
227
- page.clean_markdown # String: Lazy-converted Markdown
204
+ page.html # String: Full raw HTML content
205
+ page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
206
+ page.clean_text # String: Noise-stripped plain text (derived from clean_html)
207
+ page.raw_text # String: Full body.innerText (unfiltered)
208
+ page.clean_markdown # String: Lazy-converted Markdown from clean_html
228
209
  page.links # Array: URLs extracted from page
229
- page.metadata # Hash: HTTP status, final URL, etc.
210
+ page.metadata # Hash: final_url, title, OG tags, etc.
230
211
  page.depth # Integer: Link depth from start URL
231
212
  ```
232
213
 
@@ -234,12 +215,12 @@ page.depth # Integer: Link depth from start URL
234
215
 
235
216
  #### Global Configuration
236
217
 
237
- Set default options that apply to all crawls:
238
-
239
218
  ```ruby
240
219
  RubyCrawl.configure(
241
- wait_until: "networkidle", # Wait until network is idle
242
- block_resources: true # Block images, fonts, CSS for speed
220
+ wait_until: "networkidle",
221
+ block_resources: true,
222
+ timeout: 60,
223
+ headless: true
243
224
  )
244
225
 
245
226
  # All subsequent crawls use these defaults
@@ -248,8 +229,6 @@ result = RubyCrawl.crawl("https://example.com")
248
229
 
249
230
  #### Per-Request Options
250
231
 
251
- Override defaults for specific requests:
252
-
253
232
  ```ruby
254
233
  # Use global defaults
255
234
  result = RubyCrawl.crawl("https://example.com")
@@ -257,192 +236,131 @@ result = RubyCrawl.crawl("https://example.com")
257
236
  # Override for this request only
258
237
  result = RubyCrawl.crawl(
259
238
  "https://example.com",
260
- wait_until: "domcontentloaded",
239
+ wait_until: "domcontentloaded",
261
240
  block_resources: false
262
241
  )
263
242
  ```
264
243
 
265
244
  #### Configuration Options
266
245
 
267
- | Option | Values | Default | Description |
268
- | ----------------- | ---------------------------------------------------------------------- | -------- | ------------------------------------------------- |
269
- | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `"load"` | When to consider page loaded |
270
- | `block_resources` | `true`, `false` | `true` | Block images, fonts, CSS, media for faster crawls |
271
- | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
246
+ | Option | Values | Default | Description |
247
+ | ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
248
+ | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
249
+ | `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
250
+ | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
251
+ | `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
252
+ | `headless` | `true`, `false` | `true` | Run Chrome headlessly |
272
253
 
273
254
  **Wait strategies explained:**
274
255
 
275
- - `load` — Wait for the load event (fastest, good for static sites)
276
- - `domcontentloaded` — Wait for DOM ready (medium speed)
277
- - `networkidle` — Wait until no network requests for 500ms (slowest, best for SPAs)
278
- - `commit` — Wait until the first response bytes are received (fastest possible)
279
-
280
- ### Advanced Usage
281
-
282
- #### Session-Based Crawling
283
-
284
- Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by `crawl_site`, but you can manage them manually for advanced use cases:
285
-
286
- ```ruby
287
- # Create a session (reusable browser context)
288
- session_id = RubyCrawl.create_session
289
-
290
- begin
291
- # All crawls with this session_id share the same browser context
292
- result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
293
- result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
294
- # Browser state (cookies, localStorage) persists between crawls
295
- ensure
296
- # Always destroy session when done
297
- RubyCrawl.destroy_session(session_id)
298
- end
299
- ```
300
-
301
- **When to use sessions:**
302
-
303
- - Multiple sequential crawls to the same domain (better performance)
304
- - Preserving cookies/state set by the site between page visits
305
- - Avoiding browser context creation overhead
306
-
307
- **Important:** Sessions are for **performance optimization only**. RubyCrawl is designed for crawling **public websites**. It does not provide authentication or login functionality for protected content.
308
-
309
- **Note:** `crawl_site` automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
310
-
311
- **Session lifecycle:**
312
-
313
- - Sessions automatically expire after 30 minutes of inactivity
314
- - Sessions are cleaned up every 5 minutes
315
- - Always call `destroy_session` when done to free resources immediately
256
+ - `load` — Wait for the load event (good for static sites)
257
+ - `domcontentloaded` — Wait for DOM ready (faster)
258
+ - `networkidle` — Wait until no network requests for 500ms (best for SPAs)
259
+ - `commit` — Wait until the first response bytes are received (fastest)
316
260
 
317
261
  ### Result Object
318
262
 
319
- The crawl result is a `RubyCrawl::Result` object with these attributes:
320
-
321
263
  ```ruby
322
264
  result = RubyCrawl.crawl("https://example.com")
323
265
 
324
- result.html # String: Raw HTML content from page
325
- result.text # String: Plain text via document.body.innerText
326
- result.clean_markdown # String: Markdown conversion (lazy-loaded on first access)
327
- result.links # Array: Extracted links with url and text
328
- result.metadata # Hash: Comprehensive metadata (see below)
266
+ result.html # String: Full raw HTML
267
+ result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
268
+ result.clean_text # String: Plain text derived from clean_html — ideal for RAG
269
+ result.raw_text # String: Full body.innerText (unfiltered)
270
+ result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
271
+ result.links # Array: Extracted links with url/text/title/rel
272
+ result.metadata # Hash: See below
273
+ result.final_url # String: Shortcut for metadata['final_url']
329
274
  ```
330
275
 
331
276
  #### Links Format
332
277
 
333
- Links are extracted with full metadata:
334
-
335
278
  ```ruby
336
279
  result.links
337
280
  # => [
338
- # {
339
- # "url" => "https://example.com/about",
340
- # "text" => "About Us",
341
- # "title" => "Learn more about us", # <a title="...">
342
- # "rel" => nil # <a rel="nofollow">
343
- # },
344
- # {
345
- # "url" => "https://example.com/contact",
346
- # "text" => "Contact",
347
- # "title" => null,
348
- # "rel" => "nofollow"
349
- # },
350
- # ...
281
+ # { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
282
+ # { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
351
283
  # ]
352
284
  ```
353
285
 
354
- **Note:** URLs are automatically converted to absolute URLs by the browser, so relative links like `/about` become `https://example.com/about`.
286
+ URLs are automatically resolved to absolute form by the browser.
355
287
 
356
288
  #### Markdown Conversion
357
289
 
358
- Markdown is **lazy-loaded** — conversion only happens when you access `.clean_markdown`:
290
+ Markdown is **lazy** — conversion only happens on first access of `.clean_markdown`:
359
291
 
360
292
  ```ruby
361
- result = RubyCrawl.crawl(url)
362
- result.html # No overhead
363
- result.clean_markdown # ⬅️ Conversion happens here (first call only)
364
- result.clean_markdown # ✅ Cached, instant
293
+ result.clean_html # Already available, no overhead
294
+ result.clean_markdown # Converts clean_html → Markdown here (first call only)
295
+ result.clean_markdown # Cached, instant on subsequent calls
365
296
  ```
366
297
 
367
298
  Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
368
299
 
369
300
  #### Metadata Fields
370
301
 
371
- The `metadata` hash includes HTTP and HTML metadata:
372
-
373
302
  ```ruby
374
303
  result.metadata
375
304
  # => {
376
- # "status" => 200, # HTTP status code
377
- # "final_url" => "https://...", # Final URL after redirects
378
- # "title" => "Page Title", # <title> tag
379
- # "description" => "...", # Meta description
380
- # "keywords" => "ruby, web", # Meta keywords
381
- # "author" => "Author Name", # Meta author
382
- # "og_title" => "...", # Open Graph title
383
- # "og_description" => "...", # Open Graph description
384
- # "og_image" => "https://...", # Open Graph image
385
- # "og_url" => "https://...", # Open Graph URL
386
- # "og_type" => "website", # Open Graph type
387
- # "twitter_card" => "summary", # Twitter card type
388
- # "twitter_title" => "...", # Twitter title
389
- # "twitter_description" => "...", # Twitter description
390
- # "twitter_image" => "https://...",# Twitter image
391
- # "canonical" => "https://...", # Canonical URL
392
- # "lang" => "en", # Page language
393
- # "charset" => "UTF-8" # Character encoding
305
+ # "final_url" => "https://example.com",
306
+ # "title" => "Page Title",
307
+ # "description" => "...",
308
+ # "keywords" => "ruby, web",
309
+ # "author" => "Author Name",
310
+ # "og_title" => "...",
311
+ # "og_description" => "...",
312
+ # "og_image" => "https://...",
313
+ # "og_url" => "https://...",
314
+ # "og_type" => "website",
315
+ # "twitter_card" => "summary",
316
+ # "twitter_title" => "...",
317
+ # "twitter_description" => "...",
318
+ # "twitter_image" => "https://...",
319
+ # "canonical" => "https://...",
320
+ # "lang" => "en",
321
+ # "charset" => "UTF-8"
394
322
  # }
395
323
  ```
396
324
 
397
- Note: All HTML metadata fields may be `null` if not present on the page.
398
-
399
325
  ### Error Handling
400
326
 
401
- RubyCrawl provides specific exception classes for different error scenarios:
402
-
403
327
  ```ruby
404
328
  begin
405
329
  result = RubyCrawl.crawl(url)
406
330
  rescue RubyCrawl::ConfigurationError => e
407
- # Invalid URL or configuration
408
- puts "Configuration error: #{e.message}"
331
+ # Invalid URL or option value
409
332
  rescue RubyCrawl::TimeoutError => e
410
- # Page load timeout or network timeout
411
- puts "Timeout: #{e.message}"
333
+ # Page load timed out
412
334
  rescue RubyCrawl::NavigationError => e
413
- # Page navigation failed (404, DNS error, SSL error, etc.)
414
- puts "Navigation failed: #{e.message}"
335
+ # Navigation failed (404, DNS error, SSL error)
415
336
  rescue RubyCrawl::ServiceError => e
416
- # Node service unavailable or crashed
417
- puts "Service error: #{e.message}"
337
+ # Browser failed to start or crashed
418
338
  rescue RubyCrawl::Error => e
419
339
  # Catch-all for any RubyCrawl error
420
- puts "Crawl error: #{e.message}"
421
340
  end
422
341
  ```
423
342
 
424
343
  **Exception Hierarchy:**
425
344
 
426
- - `RubyCrawl::Error` (base class)
427
- - `RubyCrawl::ConfigurationError` - Invalid URL or configuration
428
- - `RubyCrawl::TimeoutError` - Timeout during crawl
429
- - `RubyCrawl::NavigationError` - Page navigation failed
430
- - `RubyCrawl::ServiceError` - Node service issues
345
+ ```
346
+ RubyCrawl::Error
347
+ ├── ConfigurationError — invalid URL or option value
348
+ ├── TimeoutError — page load timed out
349
+ ├── NavigationError — navigation failed (HTTP error, DNS, SSL)
350
+ └── ServiceError — browser failed to start or crashed
351
+ ```
431
352
 
432
- **Automatic Retry:** RubyCrawl automatically retries transient failures (service errors, timeouts) with exponential backoff. The default `max_attempts: 3` means 3 total attempts (2 retries). Configure with:
353
+ **Automatic Retry:** `ServiceError` and `TimeoutError` are retried with exponential backoff. `NavigationError` and `ConfigurationError` are not retried (they won't succeed on retry).
433
354
 
434
355
  ```ruby
435
- RubyCrawl.configure(max_attempts: 5)
436
- # or per-request
437
- RubyCrawl.crawl(url, max_attempts: 1) # No retries
356
+ RubyCrawl.configure(max_attempts: 5) # 5 total attempts
357
+ RubyCrawl.crawl(url, max_attempts: 1) # Disable retries
438
358
  ```
439
359
 
440
360
  ## Rails Integration
441
361
 
442
362
  ### Installation
443
363
 
444
- Run the installer in your Rails app:
445
-
446
364
  ```bash
447
365
  bundle exec rake rubycrawl:install
448
366
  ```
@@ -450,173 +368,54 @@ bundle exec rake rubycrawl:install
450
368
  This creates `config/initializers/rubycrawl.rb`:
451
369
 
452
370
  ```ruby
453
- # frozen_string_literal: true
454
-
455
- # rubycrawl default configuration
456
371
  RubyCrawl.configure(
457
- wait_until: "load",
372
+ wait_until: "load",
458
373
  block_resources: true
459
374
  )
460
375
  ```
461
376
 
462
377
  ### Usage in Rails
463
378
 
464
- #### Basic Usage in Controllers
465
-
466
- ```ruby
467
- class PagesController < ApplicationController
468
- def show
469
- result = RubyCrawl.crawl(params[:url])
470
-
471
- @page = Page.create!(
472
- url: result.final_url,
473
- title: result.metadata['title'],
474
- html: result.html,
475
- text: result.text,
476
- markdown: result.clean_markdown
477
- )
478
-
479
- redirect_to @page
480
- end
481
- end
482
- ```
483
-
484
379
  #### Background Jobs with ActiveJob
485
380
 
486
- **Simple Crawl Job:**
487
-
488
381
  ```ruby
489
382
  class CrawlPageJob < ApplicationJob
490
383
  queue_as :crawlers
491
384
 
492
- # Automatic retry with exponential backoff for transient failures
493
385
  retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
494
386
  retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
495
-
496
- # Don't retry on configuration errors (bad URLs)
497
387
  discard_on RubyCrawl::ConfigurationError
498
388
 
499
- def perform(url, user_id: nil)
389
+ def perform(url)
500
390
  result = RubyCrawl.crawl(url)
501
391
 
502
392
  Page.create!(
503
- url: result.final_url,
504
- title: result.metadata['title'],
505
- text: result.text,
506
- html: result.html,
507
- user_id: user_id,
393
+ url: result.final_url,
394
+ title: result.metadata['title'],
395
+ content: result.clean_text,
396
+ markdown: result.clean_markdown,
508
397
  crawled_at: Time.current
509
398
  )
510
- rescue RubyCrawl::NavigationError => e
511
- # Page not found or failed to load
512
- Rails.logger.warn "Failed to crawl #{url}: #{e.message}"
513
- FailedCrawl.create!(url: url, error: e.message, user_id: user_id)
514
- end
515
- end
516
-
517
- # Enqueue from anywhere
518
- CrawlPageJob.perform_later("https://example.com", user_id: current_user.id)
519
- ```
520
-
521
- **Multi-Page Site Crawler Job:**
522
-
523
- ```ruby
524
- class CrawlSiteJob < ApplicationJob
525
- queue_as :crawlers
526
-
527
- def perform(start_url, max_pages: 50)
528
- pages_crawled = RubyCrawl.crawl_site(
529
- start_url,
530
- max_pages: max_pages,
531
- max_depth: 3,
532
- same_host_only: true
533
- ) do |page|
534
- Page.create!(
535
- url: page.url,
536
- title: page.metadata['title'],
537
- text: page.clean_markdown, # Store markdown for RAG applications
538
- depth: page.depth,
539
- crawled_at: Time.current
540
- )
541
- end
542
-
543
- Rails.logger.info "Crawled #{pages_crawled} pages from #{start_url}"
544
- end
545
- end
546
- ```
547
-
548
- **Batch Crawling Pattern:**
549
-
550
- ```ruby
551
- class BatchCrawlJob < ApplicationJob
552
- queue_as :crawlers
553
-
554
- def perform(urls)
555
- # Create session for better performance
556
- session_id = RubyCrawl.create_session
557
-
558
- begin
559
- urls.each do |url|
560
- result = RubyCrawl.crawl(url, session_id: session_id)
561
-
562
- Page.create!(
563
- url: result.final_url,
564
- html: result.html,
565
- text: result.text
566
- )
567
- end
568
- ensure
569
- # Always destroy session when done
570
- RubyCrawl.destroy_session(session_id)
571
- end
572
399
  end
573
400
  end
574
-
575
- # Enqueue batch
576
- BatchCrawlJob.perform_later(["https://example.com", "https://example.com/about"])
577
401
  ```
578
402
 
579
- **Periodic Crawling with Sidekiq-Cron:**
580
-
581
- ```ruby
582
- # config/schedule.yml (for sidekiq-cron)
583
- crawl_news_sites:
584
- cron: "0 */6 * * *" # Every 6 hours
585
- class: "CrawlNewsSitesJob"
586
-
587
- # app/jobs/crawl_news_sites_job.rb
588
- class CrawlNewsSitesJob < ApplicationJob
589
- queue_as :scheduled_crawlers
590
-
591
- def perform
592
- Site.where(active: true).find_each do |site|
593
- CrawlSiteJob.perform_later(site.url, max_pages: site.max_pages)
594
- end
595
- end
596
- end
597
- ```
598
-
599
- **RAG/AI Knowledge Base Pattern:**
403
+ **Multi-page RAG knowledge base:**
600
404
 
601
405
  ```ruby
602
406
  class BuildKnowledgeBaseJob < ApplicationJob
603
407
  queue_as :crawlers
604
408
 
605
409
  def perform(documentation_url)
606
- RubyCrawl.crawl_site(
607
- documentation_url,
608
- max_pages: 500,
609
- max_depth: 5
610
- ) do |page|
611
- # Store in vector database for RAG
410
+ RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
612
411
  embedding = OpenAI.embed(page.clean_markdown)
613
412
 
614
413
  Document.create!(
615
- url: page.url,
616
- title: page.metadata['title'],
617
- content: page.clean_markdown,
414
+ url: page.url,
415
+ title: page.metadata['title'],
416
+ content: page.clean_markdown,
618
417
  embedding: embedding,
619
- depth: page.depth
418
+ depth: page.depth
620
419
  )
621
420
  end
622
421
  end
@@ -625,156 +424,101 @@ end
625
424
 
626
425
  #### Best Practices
627
426
 
628
- 1. **Use background jobs** for crawling to avoid blocking web requests
629
- 2. **Configure retry logic** based on error types (retry ServiceError, discard ConfigurationError)
630
- 3. **Use sessions** for batch crawling to improve performance
631
- 4. **Monitor job failures** and set up alerts for repeated errors
632
- 5. **Rate limit** external crawling to be respectful (use job throttling)
633
- 6. **Store both HTML and text** for flexibility in data processing
427
+ 1. **Use background jobs** to avoid blocking web requests
428
+ 2. **Configure retry logic** based on error type
429
+ 3. **Store `clean_markdown`** for RAG applications (preserves heading structure for chunking)
430
+ 4. **Rate limit** external crawling to be respectful
634
431
 
635
432
  ## Production Deployment
636
433
 
637
434
  ### Pre-deployment Checklist
638
435
 
639
- 1. **Install Node.js** on your production servers (LTS version recommended)
436
+ 1. **Ensure Chrome is installed** on your production servers
640
437
  2. **Run installer** during deployment:
641
438
  ```bash
642
439
  bundle exec rake rubycrawl:install
643
440
  ```
644
- 3. **Set environment variables** (optional):
645
- ```bash
646
- export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path
647
- export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
648
- ```
649
441
 
650
442
  ### Docker Example
651
443
 
652
444
  ```dockerfile
653
445
  FROM ruby:3.2
654
446
 
655
- # Install Node.js LTS
656
- RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
657
- && apt-get install -y nodejs
658
-
659
- # Install system dependencies for Playwright
660
- RUN npx playwright install-deps
447
+ # Install Chrome
448
+ RUN apt-get update && apt-get install -y \
449
+ chromium \
450
+ --no-install-recommends \
451
+ && rm -rf /var/lib/apt/lists/*
661
452
 
662
453
  WORKDIR /app
663
454
  COPY Gemfile* ./
664
455
  RUN bundle install
665
456
 
666
- # Install Playwright browsers
667
- RUN bundle exec rake rubycrawl:install
668
-
669
457
  COPY . .
670
458
  CMD ["rails", "server"]
671
459
  ```
672
460
 
673
- ### Heroku Deployment
674
-
675
- Add the Node.js buildpack:
461
+ Ferrum will detect `chromium` automatically. To specify a custom path:
676
462
 
677
- ```bash
678
- heroku buildpacks:add heroku/nodejs
679
- heroku buildpacks:add heroku/ruby
680
- ```
681
-
682
- Add to `package.json` in your Rails root:
683
-
684
- ```json
685
- {
686
- "engines": {
687
- "node": "18.x"
688
- }
689
- }
463
+ ```ruby
464
+ RubyCrawl.configure(
465
+ browser_options: { "browser-path": "/usr/bin/chromium" }
466
+ )
690
467
  ```
691
468
 
692
- ## How It Works
469
+ ## Architecture
693
470
 
694
- RubyCrawl uses a simple architecture:
471
+ RubyCrawl uses a single-process architecture:
695
472
 
696
- - **Ruby Gem** provides the public API and handles orchestration
697
- - **Node.js Service** (bundled, auto-started) manages Playwright browsers
698
- - Communication via HTTP/JSON on localhost
473
+ ```
474
+ RubyCrawl (public API)
475
+
476
+ Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
477
+
478
+ Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
479
+
480
+ Chromium ← headless browser
481
+ ```
699
482
 
700
- This design keeps things stable and easy to debug. The browser runs in a separate process, so crashes won't affect your Ruby application.
483
+ - Chrome launches once lazily and is reused across all crawls
484
+ - Each crawl gets an isolated page context (own cookies/storage)
485
+ - JS extraction runs inside the browser via `page.evaluate()`
486
+ - No separate processes, no HTTP boundary, no Node.js
701
487
 
702
- ## Performance Tips
488
+ ## Performance
703
489
 
704
- - **Resource blocking**: Keep `block_resources: true` (default) for 2-3x faster crawls when you don't need images/CSS
490
+ - **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
705
491
  - **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
706
- - **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
707
- - **Browser reuse**: The first crawl is slower (~2s) due to browser launch; subsequent crawls are much faster (~500ms)
492
+ - **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
493
+ - **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
708
494
 
709
495
  ## Development
710
496
 
711
- Want to contribute? Check out the [contributor guidelines](.github/copilot-instructions.md).
712
-
713
497
  ```bash
714
- # Setup
715
498
  git clone git@github.com:craft-wise/rubycrawl.git
716
499
  cd rubycrawl
717
500
  bin/setup
718
501
 
719
- # Run tests
502
+ # Run unit tests (no browser required)
720
503
  bundle exec rspec
721
504
 
505
+ # Run integration tests (requires Chrome)
506
+ INTEGRATION=1 bundle exec rspec
507
+
722
508
  # Manual testing
723
509
  bin/console
724
510
  > RubyCrawl.crawl("https://example.com")
511
+ > RubyCrawl.crawl("https://example.com").clean_text
512
+ > RubyCrawl.crawl("https://example.com").clean_markdown
725
513
  ```
726
514
 
727
515
  ## Contributing
728
516
 
729
517
  Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
730
518
 
731
- ### Development Philosophy
732
-
733
519
  - **Simplicity over cleverness**: Prefer clear, explicit code
734
520
  - **Stability over speed**: Correctness first, optimization second
735
- - **Ruby-first**: Hide Node.js/Playwright complexity from users
736
- - **No vendor lock-in**: Pure open source, no SaaS dependencies
737
-
738
- ## Why Choose RubyCrawl?
739
-
740
- RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
741
-
742
- ### 🎯 **Built for Ruby Developers**
743
-
744
- - **Idiomatic Ruby API** — Feels natural to Rubyists, no need to learn Playwright
745
- - **Rails-first design** — Generators, initializers, and ActiveJob integration out of the box
746
- - **Modular architecture** — Clean, testable code following Ruby best practices
747
-
748
- ### 🚀 **Production-Grade Reliability**
749
-
750
- - **Automatic retry** with exponential backoff for transient failures
751
- - **Smart error handling** with custom exception hierarchy
752
- - **Process isolation** — Browser crashes don't affect your Ruby application
753
- - **Battle-tested** — Built on Playwright's proven browser automation
754
-
755
- ### 💎 **Developer Experience**
756
-
757
- - **Zero configuration** — Works immediately after installation
758
- - **Lazy loading** — Markdown conversion only when you need it
759
- - **Smart URL handling** — Automatic normalization and deduplication
760
- - **Comprehensive docs** — Clear examples for common use cases
761
-
762
- ### 🌐 **Rich Feature Set**
763
-
764
- - ✅ JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
765
- - ✅ Multi-page crawling with BFS algorithm
766
- - ✅ Link extraction with metadata (url, text, title, rel)
767
- - ✅ Markdown conversion (GitHub-flavored)
768
- - ✅ Metadata extraction (OG tags, Twitter cards, etc.)
769
- - ✅ Resource blocking for 2-3x performance boost
770
-
771
- ### 📊 **Perfect for Modern Use Cases**
772
-
773
- - **RAG applications** — Build AI knowledge bases from documentation
774
- - **Data aggregation** — Extract structured data from multiple pages
775
- - **Content migration** — Convert sites to Markdown for static generators
776
- - **SEO analysis** — Extract metadata and link structures
777
- - **Testing** — Verify deployed site content and structure
521
+ - **Hide complexity**: Users should never need to know Ferrum exists
778
522
 
779
523
  ## License
780
524
 
@@ -782,7 +526,7 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
782
526
 
783
527
  ## Credits
784
528
 
785
- Built with [Playwright](https://playwright.dev/) by Microsoft the industry-standard browser automation framework.
529
+ Built with [Ferrum](https://github.com/rubycdp/ferrum) pure Ruby Chrome DevTools Protocol client.
786
530
 
787
531
  Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
788
532
 
@@ -791,12 +535,3 @@ Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitH
791
535
  - **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
792
536
  - **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
793
537
  - **Email**: ganesh.navale@zohomail.in
794
-
795
- ## Acknowledgments
796
-
797
- Special thanks to:
798
-
799
- - [Microsoft Playwright](https://playwright.dev/) team for the robust, production-grade browser automation framework
800
- - The Ruby community for building an ecosystem that values developer happiness and code clarity
801
- - The Node.js community for excellent tooling and libraries that make cross-language integration seamless
802
- - Open source contributors worldwide who make projects like this possible