rubycrawl 0.1.3 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,32 +1,56 @@
1
- # rubycrawl
1
+ # RubyCrawl 🎭
2
2
 
3
- [![Gem Version](https://badge.fury.io/rb/rubycrawl.svg)](https://badge.fury.io/rb/rubycrawl)
3
+ [![Gem Version](https://badge.fury.io/rb/rubycrawl.svg)](https://rubygems.org/gems/rubycrawl)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
+ [![Ruby](https://img.shields.io/badge/ruby-%3E%3D%203.0-red.svg)](https://www.ruby-lang.org/)
5
6
 
6
- **Playwright-based web crawler for Ruby** — Inspired by [crawl4ai](https://github.com/unclecode/crawl4ai) (Python), designed idiomatically for Ruby with production-ready features.
7
+ **Production-ready web crawler for Ruby powered by Ferrum** — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
7
8
 
8
- RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs and dynamic websites.
9
+ RubyCrawl provides **accurate, JavaScript-enabled web scraping** using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
10
+
11
+ **Why RubyCrawl?**
12
+
13
+ - ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
14
+ - ✅ **Pure Ruby** — No Node.js, no npm, no external processes to manage
15
+ - ✅ **Zero config** — Works out of the box, no Ferrum knowledge needed
16
+ - ✅ **Production-ready** — Auto-retry, error handling, resource optimization
17
+ - ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
18
+ - ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
19
+
20
+ ```ruby
21
+ # One line to crawl any JavaScript-heavy site
22
+ result = RubyCrawl.crawl("https://docs.example.com")
23
+
24
+ result.html # Full HTML with JS rendered
25
+ result.clean_text # Noise-stripped plain text (no nav/footer/ads)
26
+ result.clean_markdown # Markdown ready for RAG pipelines
27
+ result.links # All links with url, text, title, rel
28
+ result.metadata # Title, description, OG tags, etc.
29
+ ```
9
30
 
10
31
  ## Features
11
32
 
12
- - **Playwright-powered**: Real browser automation for JavaScript-heavy sites
13
- - **Production-ready**: Designed for Rails apps and production environments
14
- - **Simple API**: Clean, minimal Ruby interface — zero Playwright knowledge required
15
- - **Resource optimization**: Built-in resource blocking for faster crawls
16
- - **Auto-managed browsers**: Browser process reuse and automatic lifecycle management
17
- - **Content extraction**: HTML, links, and Markdown conversion
18
- - **Multi-page crawling**: BFS crawler with depth limits and deduplication
33
+ - **Pure Ruby**: Ferrum drives Chromium directly via CDP — no Node.js or npm required
34
+ - **Production-ready**: Designed for Rails apps with auto-retry and exponential backoff
35
+ - **Simple API**: Clean Ruby interface — zero Ferrum or CDP knowledge required
36
+ - **Resource optimization**: Built-in resource blocking for 2-3x faster crawls
37
+ - **Auto-managed browsers**: Lazy Chrome singleton, isolated page per crawl
38
+ - **Content extraction**: HTML, plain text, clean HTML, Markdown (lazy), links, metadata
39
+ - **Multi-page crawling**: BFS crawler with configurable depth limits and URL deduplication
40
+ - **Smart URL handling**: Automatic normalization, tracking parameter removal, same-host filtering
19
41
  - **Rails integration**: First-class Rails support with generators and initializers
20
42
 
21
43
  ## Table of Contents
22
44
 
23
45
  - [Installation](#installation)
24
46
  - [Quick Start](#quick-start)
47
+ - [Use Cases](#use-cases)
25
48
  - [Usage](#usage)
26
49
  - [Basic Crawling](#basic-crawling)
27
50
  - [Multi-Page Crawling](#multi-page-crawling)
28
51
  - [Configuration](#configuration)
29
52
  - [Result Object](#result-object)
53
+ - [Error Handling](#error-handling)
30
54
  - [Rails Integration](#rails-integration)
31
55
  - [Production Deployment](#production-deployment)
32
56
  - [Architecture](#architecture)
@@ -40,7 +64,7 @@ RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's
40
64
  ### Requirements
41
65
 
42
66
  - **Ruby** >= 3.0
43
- - **Node.js** LTS (v18+ recommended) required for the bundled Playwright service
67
+ - **Chrome or Chromium** managed automatically by Ferrum (downloaded on first use)
44
68
 
45
69
  ### Add to Gemfile
46
70
 
@@ -54,9 +78,9 @@ Then install:
54
78
  bundle install
55
79
  ```
56
80
 
57
- ### Install Playwright browsers
81
+ ### Install Chrome
58
82
 
59
- After bundling, install the Playwright browsers:
83
+ Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
60
84
 
61
85
  ```bash
62
86
  bundle exec rake rubycrawl:install
@@ -64,9 +88,10 @@ bundle exec rake rubycrawl:install
64
88
 
65
89
  This command:
66
90
 
67
- - Installs Node.js dependencies in the bundled `node/` directory
68
- - Downloads Playwright browsers (Chromium, Firefox, WebKit)
69
- - Creates a Rails initializer (if using Rails)
91
+ - Checks for Chrome/Chromium in your PATH
92
+ - Creates a Rails initializer (if using Rails)
93
+
94
+ **Note:** If Chrome is not in your PATH, install it via your system package manager or download from [google.com/chrome](https://www.google.com/chrome/).
70
95
 
71
96
  ## Quick Start
72
97
 
@@ -77,27 +102,37 @@ require "rubycrawl"
77
102
  result = RubyCrawl.crawl("https://example.com")
78
103
 
79
104
  # Access extracted content
80
- puts result.html # Raw HTML content
81
- puts result.markdown # Converted to Markdown
82
- puts result.links # Extracted links from the page
83
- puts result.metadata # Status code, final URL, etc.
105
+ result.final_url # Final URL after redirects
106
+ result.clean_text # Noise-stripped plain text (no nav/footer/ads)
107
+ result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
108
+ result.raw_text # Full body.innerText (unfiltered)
109
+ result.html # Full raw HTML content
110
+ result.links # Extracted links with url, text, title, rel
111
+ result.metadata # Title, description, OG tags, etc.
112
+ result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
84
113
  ```
85
114
 
115
+ ## Use Cases
116
+
117
+ RubyCrawl is perfect for:
118
+
119
+ - **RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
120
+ - **Data aggregation**: Crawl product catalogs, job listings, or news articles
121
+ - **SEO analysis**: Extract metadata, links, and content structure
122
+ - **Content migration**: Convert existing sites to Markdown for static site generators
123
+ - **Documentation scraping**: Create local copies of documentation with preserved links
124
+
86
125
  ## Usage
87
126
 
88
127
  ### Basic Crawling
89
128
 
90
- The simplest way to crawl a URL:
91
-
92
129
  ```ruby
93
130
  result = RubyCrawl.crawl("https://example.com")
94
131
 
95
- # Access the results
96
- result.html # => "<html>...</html>"
97
- result.markdown # => "# Example Domain\n\nThis domain is..." (lazy-loaded)
98
- result.links # => [{ "url" => "https://...", "text" => "More info" }, ...]
99
- result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
100
- result.text # => "" (coming soon)
132
+ result.html # => "<html>...</html>"
133
+ result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
134
+ result.raw_text # => "Example Domain\nThis domain is..." (full body text)
135
+ result.metadata # => { "final_url" => "https://example.com", "title" => "..." }
101
136
  ```
102
137
 
103
138
  ### Multi-Page Crawling
@@ -109,50 +144,83 @@ Crawl an entire site following links with BFS (breadth-first search):
109
144
  RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
110
145
  # Each page is yielded as it's crawled (streaming)
111
146
  puts "Crawled: #{page.url} (depth: #{page.depth})"
112
-
147
+
113
148
  # Save to database
114
149
  Page.create!(
115
- url: page.url,
116
- html: page.html,
117
- markdown: page.markdown,
118
- depth: page.depth
150
+ url: page.url,
151
+ html: page.html,
152
+ markdown: page.clean_markdown,
153
+ depth: page.depth
119
154
  )
120
155
  end
121
156
  ```
122
157
 
158
+ **Real-world example: Building a RAG knowledge base**
159
+
160
+ ```ruby
161
+ require "rubycrawl"
162
+
163
+ RubyCrawl.configure(
164
+ wait_until: "networkidle", # Ensure JS content loads
165
+ block_resources: true # Skip images/fonts for speed
166
+ )
167
+
168
+ pages_crawled = RubyCrawl.crawl_site(
169
+ "https://docs.example.com",
170
+ max_pages: 500,
171
+ max_depth: 5,
172
+ same_host_only: true
173
+ ) do |page|
174
+ VectorDB.upsert(
175
+ id: Digest::SHA256.hexdigest(page.url),
176
+ content: page.clean_markdown,
177
+ metadata: {
178
+ url: page.url,
179
+ title: page.metadata["title"],
180
+ depth: page.depth
181
+ }
182
+ )
183
+ end
184
+
185
+ puts "Indexed #{pages_crawled} pages"
186
+ ```
187
+
123
188
  #### Multi-Page Options
124
189
 
125
- | Option | Default | Description |
126
- |--------|---------|-------------|
127
- | `max_pages` | 50 | Maximum number of pages to crawl |
128
- | `max_depth` | 3 | Maximum link depth from start URL |
129
- | `same_host_only` | true | Only follow links on the same domain |
130
- | `wait_until` | inherited | Page load strategy |
131
- | `block_resources` | inherited | Block images/fonts/CSS |
190
+ | Option | Default | Description |
191
+ | ----------------- | --------- | ------------------------------------ |
192
+ | `max_pages` | 50 | Maximum number of pages to crawl |
193
+ | `max_depth` | 3 | Maximum link depth from start URL |
194
+ | `same_host_only` | true | Only follow links on the same domain |
195
+ | `wait_until` | inherited | Page load strategy |
196
+ | `block_resources` | inherited | Block images/fonts/CSS |
132
197
 
133
198
  #### Page Result Object
134
199
 
135
200
  The block receives a `PageResult` with:
136
201
 
137
202
  ```ruby
138
- page.url # String: Final URL after redirects
139
- page.html # String: Full HTML content
140
- page.markdown # String: Lazy-converted Markdown
141
- page.links # Array: URLs extracted from page
142
- page.metadata # Hash: HTTP status, final URL, etc.
143
- page.depth # Integer: Link depth from start URL
203
+ page.url # String: Final URL after redirects
204
+ page.html # String: Full raw HTML content
205
+ page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
206
+ page.clean_text # String: Noise-stripped plain text (derived from clean_html)
207
+ page.raw_text # String: Full body.innerText (unfiltered)
208
+ page.clean_markdown # String: Lazy-converted Markdown from clean_html
209
+ page.links # Array: URLs extracted from page
210
+ page.metadata # Hash: final_url, title, OG tags, etc.
211
+ page.depth # Integer: Link depth from start URL
144
212
  ```
145
213
 
146
214
  ### Configuration
147
215
 
148
216
  #### Global Configuration
149
217
 
150
- Set default options that apply to all crawls:
151
-
152
218
  ```ruby
153
219
  RubyCrawl.configure(
154
- wait_until: "networkidle", # Wait until network is idle
155
- block_resources: true # Block images, fonts, CSS for speed
220
+ wait_until: "networkidle",
221
+ block_resources: true,
222
+ timeout: 60,
223
+ headless: true
156
224
  )
157
225
 
158
226
  # All subsequent crawls use these defaults
@@ -161,8 +229,6 @@ result = RubyCrawl.crawl("https://example.com")
161
229
 
162
230
  #### Per-Request Options
163
231
 
164
- Override defaults for specific requests:
165
-
166
232
  ```ruby
167
233
  # Use global defaults
168
234
  result = RubyCrawl.crawl("https://example.com")
@@ -170,36 +236,41 @@ result = RubyCrawl.crawl("https://example.com")
170
236
  # Override for this request only
171
237
  result = RubyCrawl.crawl(
172
238
  "https://example.com",
173
- wait_until: "domcontentloaded",
239
+ wait_until: "domcontentloaded",
174
240
  block_resources: false
175
241
  )
176
242
  ```
177
243
 
178
244
  #### Configuration Options
179
245
 
180
- | Option | Values | Default | Description |
181
- | ----------------- | ----------------------------------------------- | -------- | ------------------------------------------------- |
182
- | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"` | `"load"` | When to consider page loaded |
183
- | `block_resources` | `true`, `false` | `true` | Block images, fonts, CSS, media for faster crawls |
246
+ | Option | Values | Default | Description |
247
+ | ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
248
+ | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
249
+ | `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
250
+ | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
251
+ | `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
252
+ | `headless` | `true`, `false` | `true` | Run Chrome headlessly |
184
253
 
185
254
  **Wait strategies explained:**
186
255
 
187
- - `load` — Wait for the load event (fastest, good for static sites)
188
- - `domcontentloaded` — Wait for DOM ready (medium speed)
189
- - `networkidle` — Wait until no network requests for 500ms (slowest, best for SPAs)
256
+ - `load` — Wait for the load event (good for static sites)
257
+ - `domcontentloaded` — Wait for DOM ready (faster)
258
+ - `networkidle` — Wait until no network requests for 500ms (best for SPAs)
259
+ - `commit` — Wait until the first response bytes are received (fastest)
190
260
 
191
261
  ### Result Object
192
262
 
193
- The crawl result is a `RubyCrawl::Result` object with these attributes:
194
-
195
263
  ```ruby
196
264
  result = RubyCrawl.crawl("https://example.com")
197
265
 
198
- result.html # String: Raw HTML content from page
199
- result.markdown # String: Markdown conversion (lazy-loaded on first access)
200
- result.links # Array: Extracted links with url and text
201
- result.text # String: Plain text (coming soon)
202
- result.metadata # Hash: Comprehensive metadata (see below)
266
+ result.html # String: Full raw HTML
267
+ result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
268
+ result.clean_text # String: Plain text derived from clean_html — ideal for RAG
269
+ result.raw_text # String: Full body.innerText (unfiltered)
270
+ result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
271
+ result.links # Array: Extracted links with url/text/title/rel
272
+ result.metadata # Hash: See below
273
+ result.final_url # String: Shortcut for metadata['final_url']
203
274
  ```
204
275
 
205
276
  #### Links Format
@@ -207,101 +278,89 @@ result.metadata # Hash: Comprehensive metadata (see below)
207
278
  ```ruby
208
279
  result.links
209
280
  # => [
210
- # { "url" => "https://example.com/about", "text" => "About Us" },
211
- # { "url" => "https://example.com/contact", "text" => "Contact" },
212
- # ...
281
+ # { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
282
+ # { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
213
283
  # ]
214
284
  ```
215
285
 
286
+ URLs are automatically resolved to absolute form by the browser.
287
+
216
288
  #### Markdown Conversion
217
289
 
218
- Markdown is **lazy-loaded** — conversion only happens when you access `.markdown`:
290
+ Markdown is **lazy** — conversion only happens on first access of `.clean_markdown`:
219
291
 
220
292
  ```ruby
221
- result = RubyCrawl.crawl(url)
222
- result.html # No overhead
223
- result.markdown # ⬅️ Conversion happens here (first call only)
224
- result.markdown # ✅ Cached, instant
293
+ result.clean_html # Already available, no overhead
294
+ result.clean_markdown # Converts clean_html → Markdown here (first call only)
295
+ result.clean_markdown # Cached, instant on subsequent calls
225
296
  ```
226
297
 
227
298
  Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
228
299
 
229
300
  #### Metadata Fields
230
301
 
231
- The `metadata` hash includes HTTP and HTML metadata:
232
-
233
302
  ```ruby
234
303
  result.metadata
235
304
  # => {
236
- # "status" => 200, # HTTP status code
237
- # "final_url" => "https://...", # Final URL after redirects
238
- # "title" => "Page Title", # <title> tag
239
- # "description" => "...", # Meta description
240
- # "keywords" => "ruby, web", # Meta keywords
241
- # "author" => "Author Name", # Meta author
242
- # "og_title" => "...", # Open Graph title
243
- # "og_description" => "...", # Open Graph description
244
- # "og_image" => "https://...", # Open Graph image
245
- # "og_url" => "https://...", # Open Graph URL
246
- # "og_type" => "website", # Open Graph type
247
- # "twitter_card" => "summary", # Twitter card type
248
- # "twitter_title" => "...", # Twitter title
249
- # "twitter_description" => "...", # Twitter description
250
- # "twitter_image" => "https://...",# Twitter image
251
- # "canonical" => "https://...", # Canonical URL
252
- # "lang" => "en", # Page language
253
- # "charset" => "UTF-8" # Character encoding
305
+ # "final_url" => "https://example.com",
306
+ # "title" => "Page Title",
307
+ # "description" => "...",
308
+ # "keywords" => "ruby, web",
309
+ # "author" => "Author Name",
310
+ # "og_title" => "...",
311
+ # "og_description" => "...",
312
+ # "og_image" => "https://...",
313
+ # "og_url" => "https://...",
314
+ # "og_type" => "website",
315
+ # "twitter_card" => "summary",
316
+ # "twitter_title" => "...",
317
+ # "twitter_description" => "...",
318
+ # "twitter_image" => "https://...",
319
+ # "canonical" => "https://...",
320
+ # "lang" => "en",
321
+ # "charset" => "UTF-8"
254
322
  # }
255
323
  ```
256
324
 
257
- Note: All HTML metadata fields may be `null` if not present on the page.
258
-
259
325
  ### Error Handling
260
326
 
261
- RubyCrawl provides specific exception classes for different error scenarios:
262
-
263
327
  ```ruby
264
328
  begin
265
329
  result = RubyCrawl.crawl(url)
266
330
  rescue RubyCrawl::ConfigurationError => e
267
- # Invalid URL or configuration
268
- puts "Configuration error: #{e.message}"
331
+ # Invalid URL or option value
269
332
  rescue RubyCrawl::TimeoutError => e
270
- # Page load timeout or network timeout
271
- puts "Timeout: #{e.message}"
333
+ # Page load timed out
272
334
  rescue RubyCrawl::NavigationError => e
273
- # Page navigation failed (404, DNS error, SSL error, etc.)
274
- puts "Navigation failed: #{e.message}"
335
+ # Navigation failed (404, DNS error, SSL error)
275
336
  rescue RubyCrawl::ServiceError => e
276
- # Node service unavailable or crashed
277
- puts "Service error: #{e.message}"
337
+ # Browser failed to start or crashed
278
338
  rescue RubyCrawl::Error => e
279
339
  # Catch-all for any RubyCrawl error
280
- puts "Crawl error: #{e.message}"
281
340
  end
282
341
  ```
283
342
 
284
343
  **Exception Hierarchy:**
285
- - `RubyCrawl::Error` (base class)
286
- - `RubyCrawl::ConfigurationError` - Invalid URL or configuration
287
- - `RubyCrawl::TimeoutError` - Timeout during crawl
288
- - `RubyCrawl::NavigationError` - Page navigation failed
289
- - `RubyCrawl::ServiceError` - Node service issues
290
344
 
291
- **Automatic Retry:** RubyCrawl automatically retries transient failures (service errors, timeouts) up to 3 times with exponential backoff (2s, 4s, 8s). Configure with:
345
+ ```
346
+ RubyCrawl::Error
347
+ ├── ConfigurationError — invalid URL or option value
348
+ ├── TimeoutError — page load timed out
349
+ ├── NavigationError — navigation failed (HTTP error, DNS, SSL)
350
+ └── ServiceError — browser failed to start or crashed
351
+ ```
352
+
353
+ **Automatic Retry:** `ServiceError` and `TimeoutError` are retried with exponential backoff. `NavigationError` and `ConfigurationError` are not retried (they won't succeed on retry).
292
354
 
293
355
  ```ruby
294
- RubyCrawl.configure(max_retries: 5)
295
- # or per-request
296
- RubyCrawl.crawl(url, retries: 1) # Disable retry
356
+ RubyCrawl.configure(max_attempts: 5) # 5 total attempts
357
+ RubyCrawl.crawl(url, max_attempts: 1) # Disable retries
297
358
  ```
298
359
 
299
360
  ## Rails Integration
300
361
 
301
362
  ### Installation
302
363
 
303
- Run the installer in your Rails app:
304
-
305
364
  ```bash
306
365
  bundle exec rake rubycrawl:install
307
366
  ```
@@ -309,264 +368,157 @@ bundle exec rake rubycrawl:install
309
368
  This creates `config/initializers/rubycrawl.rb`:
310
369
 
311
370
  ```ruby
312
- # frozen_string_literal: true
313
-
314
- # rubycrawl default configuration
315
371
  RubyCrawl.configure(
316
- wait_until: "load",
372
+ wait_until: "load",
317
373
  block_resources: true
318
374
  )
319
375
  ```
320
376
 
321
377
  ### Usage in Rails
322
378
 
379
+ #### Background Jobs with ActiveJob
380
+
323
381
  ```ruby
324
- # In a controller, service, or background job
325
- class ContentScraperJob < ApplicationJob
382
+ class CrawlPageJob < ApplicationJob
383
+ queue_as :crawlers
384
+
385
+ retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
386
+ retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
387
+ discard_on RubyCrawl::ConfigurationError
388
+
326
389
  def perform(url)
327
390
  result = RubyCrawl.crawl(url)
328
391
 
329
- # Save to database
330
- ScrapedContent.create!(
331
- url: url,
332
- html: result.html,
333
- status: result.metadata[:status]
392
+ Page.create!(
393
+ url: result.final_url,
394
+ title: result.metadata['title'],
395
+ content: result.clean_text,
396
+ markdown: result.clean_markdown,
397
+ crawled_at: Time.current
334
398
  )
335
399
  end
336
400
  end
337
401
  ```
338
402
 
403
+ **Multi-page RAG knowledge base:**
404
+
405
+ ```ruby
406
+ class BuildKnowledgeBaseJob < ApplicationJob
407
+ queue_as :crawlers
408
+
409
+ def perform(documentation_url)
410
+ RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
411
+ embedding = OpenAI.embed(page.clean_markdown)
412
+
413
+ Document.create!(
414
+ url: page.url,
415
+ title: page.metadata['title'],
416
+ content: page.clean_markdown,
417
+ embedding: embedding,
418
+ depth: page.depth
419
+ )
420
+ end
421
+ end
422
+ end
423
+ ```
424
+
425
+ #### Best Practices
426
+
427
+ 1. **Use background jobs** to avoid blocking web requests
428
+ 2. **Configure retry logic** based on error type
429
+ 3. **Store `clean_markdown`** for RAG applications (preserves heading structure for chunking)
430
+ 4. **Rate limit** external crawling to be respectful
431
+
339
432
  ## Production Deployment
340
433
 
341
434
  ### Pre-deployment Checklist
342
435
 
343
- 1. **Install Node.js** on your production servers (LTS version recommended)
436
+ 1. **Ensure Chrome is installed** on your production servers
344
437
  2. **Run installer** during deployment:
345
438
  ```bash
346
439
  bundle exec rake rubycrawl:install
347
440
  ```
348
- 3. **Set environment variables** (optional):
349
- ```bash
350
- export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path
351
- export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
352
- ```
353
441
 
354
442
  ### Docker Example
355
443
 
356
444
  ```dockerfile
357
445
  FROM ruby:3.2
358
446
 
359
- # Install Node.js LTS
360
- RUN curl -fsSL https://deb.nodesource.com/setup_lts.x | bash - \
361
- && apt-get install -y nodejs
362
-
363
- # Install system dependencies for Playwright
364
- RUN npx playwright install-deps
447
+ # Install Chrome
448
+ RUN apt-get update && apt-get install -y \
449
+ chromium \
450
+ --no-install-recommends \
451
+ && rm -rf /var/lib/apt/lists/*
365
452
 
366
453
  WORKDIR /app
367
454
  COPY Gemfile* ./
368
455
  RUN bundle install
369
456
 
370
- # Install Playwright browsers
371
- RUN bundle exec rake rubycrawl:install
372
-
373
457
  COPY . .
374
458
  CMD ["rails", "server"]
375
459
  ```
376
460
 
377
- ### Heroku Deployment
378
-
379
- Add the Node.js buildpack:
380
-
381
- ```bash
382
- heroku buildpacks:add heroku/nodejs
383
- heroku buildpacks:add heroku/ruby
384
- ```
385
-
386
- Add to `package.json` in your Rails root:
461
+ Ferrum will detect `chromium` automatically. To specify a custom path:
387
462
 
388
- ```json
389
- {
390
- "engines": {
391
- "node": "18.x"
392
- }
393
- }
463
+ ```ruby
464
+ RubyCrawl.configure(
465
+ browser_options: { "browser-path": "/usr/bin/chromium" }
466
+ )
394
467
  ```
395
468
 
396
- ### Performance Tips
397
-
398
- - **Reuse instances**: Use the class-level `RubyCrawl.crawl` method (recommended) rather than creating new instances
399
- - **Resource blocking**: Keep `block_resources: true` for 2-3x faster crawls when you don't need images/CSS
400
- - **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
401
- - **Browser reuse**: The first crawl is slower due to browser launch; subsequent crawls reuse the process
402
-
403
469
  ## Architecture
404
470
 
405
- RubyCrawl uses a **dual-process architecture**:
471
+ RubyCrawl uses a single-process architecture:
406
472
 
407
473
  ```
408
- ┌─────────────────────────────────────────────┐
409
- Ruby Process (Your Application) │
410
- ┌─────────────────────────────────────┐ │
411
- │ RubyCrawl Gem │ │
412
- │ │ • Public API │ │
413
- │ • Result normalization │ │
414
- │ │ • Error handling │ │
415
- │ └────────────┬────────────────────────┘ │
416
- └───────────────┼─────────────────────────────┘
417
- │ HTTP/JSON (localhost:3344)
418
- ┌───────────────┼─────────────────────────────┐
419
- │ Node.js Process (Auto-started) │
420
- │ ┌────────────┴────────────────────────┐ │
421
- │ │ Playwright Service │ │
422
- │ │ • Browser management │ │
423
- │ │ • Page navigation │ │
424
- │ │ • HTML extraction │ │
425
- │ │ • Resource blocking │ │
426
- │ └─────────────────────────────────────┘ │
427
- └─────────────────────────────────────────────┘
474
+ RubyCrawl (public API)
475
+
476
+ Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
477
+
478
+ Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
479
+
480
+ Chromium ← headless browser
428
481
  ```
429
482
 
430
- **Why this architecture?**
431
-
432
- - **Separation of concerns**: Ruby handles orchestration, Node handles browsers
433
- - **Stability**: Playwright's official Node.js bindings are most reliable
434
- - **Performance**: Long-running browser process, reused across requests
435
- - **Simplicity**: No C extensions, pure Ruby + bundled Node service
436
-
437
- See [.github/copilot-instructions.md](.github/copilot-instructions.md) for detailed architecture documentation.
483
+ - Chrome launches once lazily and is reused across all crawls
484
+ - Each crawl gets an isolated page context (own cookies/storage)
485
+ - JS extraction runs inside the browser via `page.evaluate()`
486
+ - No separate processes, no HTTP boundary, no Node.js
438
487
 
439
488
  ## Performance
440
489
 
441
- ### Benchmarks
442
-
443
- Typical crawl times (M1 Mac, fast network):
444
-
445
- | Page Type | First Crawl | Subsequent | Config |
446
- | ----------- | ----------- | ---------- | --------------------------- |
447
- | Static HTML | ~2s | ~500ms | `block_resources: true` |
448
- | SPA (React) | ~3s | ~1.2s | `wait_until: "networkidle"` |
449
- | Heavy site | ~4s | ~2s | `block_resources: false` |
450
-
451
- **Note**: First crawl includes browser launch time (~1.5s). Subsequent crawls reuse the browser.
452
-
453
- ### Optimization Tips
454
-
455
- 1. **Enable resource blocking** for content-only extraction:
456
-
457
- ```ruby
458
- RubyCrawl.configure(block_resources: true)
459
- ```
460
-
461
- 2. **Use appropriate wait strategy**:
462
- - Static sites: `wait_until: "load"`
463
- - SPAs: `wait_until: "networkidle"`
464
-
465
- 3. **Batch processing**: Use background jobs for concurrent crawling:
466
- ```ruby
467
- urls.each { |url| CrawlJob.perform_later(url) }
468
- ```
490
+ - **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
491
+ - **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
492
+ - **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
493
+ - **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
469
494
 
470
495
  ## Development
471
496
 
472
- ### Setup
473
-
474
497
  ```bash
475
498
  git clone git@github.com:craft-wise/rubycrawl.git
476
499
  cd rubycrawl
477
- bin/setup # Installs dependencies and sets up Node service
478
- ```
479
-
480
- ### Running Tests
500
+ bin/setup
481
501
 
482
- ```bash
502
+ # Run unit tests (no browser required)
483
503
  bundle exec rspec
484
- ```
485
-
486
- ### Manual Testing
487
504
 
488
- ```bash
489
- # Terminal 1: Start Node service manually (optional)
490
- cd node
491
- npm start
505
+ # Run integration tests (requires Chrome)
506
+ INTEGRATION=1 bundle exec rspec
492
507
 
493
- # Terminal 2: Ruby console
508
+ # Manual testing
494
509
  bin/console
495
- > result = RubyCrawl.crawl("https://example.com")
496
- > puts result.html
510
+ > RubyCrawl.crawl("https://example.com")
511
+ > RubyCrawl.crawl("https://example.com").clean_text
512
+ > RubyCrawl.crawl("https://example.com").clean_markdown
497
513
  ```
498
514
 
499
- ### Project Structure
500
-
501
- ```
502
- rubycrawl/
503
- ├── lib/
504
- │ ├── rubycrawl.rb # Main gem entry point
505
- │ ├── rubycrawl/
506
- │ │ ├── version.rb # Gem version
507
- │ │ ├── railtie.rb # Rails integration
508
- │ │ └── tasks/
509
- │ │ └── install.rake # Installation task
510
- ├── node/
511
- │ ├── src/
512
- │ │ └── index.js # Playwright HTTP service
513
- │ ├── package.json
514
- │ └── README.md
515
- ├── spec/ # RSpec tests
516
- ├── .github/
517
- │ └── copilot-instructions.md # GitHub Copilot guidelines
518
- ├── CLAUDE.md # Claude AI guidelines
519
- └── README.md
520
- ```
521
-
522
- ## Roadmap
523
-
524
- ### Current (v0.1.0)
525
-
526
- - [x] HTML extraction
527
- - [x] Link extraction
528
- - [x] Markdown conversion (lazy-loaded)
529
- - [x] Multi-page crawling with BFS
530
- - [x] URL normalization and deduplication
531
- - [x] Basic metadata (status, final URL)
532
- - [x] Resource blocking
533
- - [x] Rails integration
534
-
535
- ### Coming Soon
536
-
537
- - [ ] Plain text extraction
538
- - [ ] Screenshot capture
539
- - [ ] Custom JavaScript execution
540
- - [ ] Session/cookie support
541
- - [ ] Proxy support
542
- - [ ] Robots.txt support
543
-
544
515
  ## Contributing
545
516
 
546
517
  Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
547
518
 
548
- ### Development Philosophy
549
-
550
519
  - **Simplicity over cleverness**: Prefer clear, explicit code
551
520
  - **Stability over speed**: Correctness first, optimization second
552
- - **Ruby-first**: Hide Node.js/Playwright complexity from users
553
- - **No vendor lock-in**: Pure open source, no SaaS dependencies
554
-
555
- ## Comparison with crawl4ai
556
-
557
- | Feature | crawl4ai (Python) | rubycrawl (Ruby) |
558
- | ------------------- | ----------------- | ---------------- |
559
- | Browser automation | Playwright | Playwright |
560
- | Language | Python | Ruby |
561
- | LLM extraction | ✅ | Planned |
562
- | Markdown extraction | ✅ | ✅ |
563
- | Link extraction | ✅ | ✅ |
564
- | Multi-page crawling | ✅ | ✅ |
565
- | Rails integration | N/A | ✅ |
566
- | Resource blocking | ✅ | ✅ |
567
- | Session management | ✅ | Planned |
568
-
569
- RubyCrawl aims to bring the same level of accuracy and reliability to the Ruby ecosystem.
521
+ - **Hide complexity**: Users should never need to know Ferrum exists
570
522
 
571
523
  ## License
572
524
 
@@ -574,12 +526,12 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
574
526
 
575
527
  ## Credits
576
528
 
577
- Inspired by [crawl4ai](https://github.com/unclecode/crawl4ai) by @unclecode.
529
+ Built with [Ferrum](https://github.com/rubycdp/ferrum) pure Ruby Chrome DevTools Protocol client.
578
530
 
579
- Built with [Playwright](https://playwright.dev/) by Microsoft.
531
+ Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
580
532
 
581
533
  ## Support
582
534
 
583
535
  - **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
584
- - **Discussions**: [GitHub Discussions](https://github.com/your-org/rubycrawl/discussions)
536
+ - **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
585
537
  - **Email**: ganesh.navale@zohomail.in