rubycrawl 0.1.4 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +167 -432
- data/lib/rubycrawl/browser/extraction.rb +106 -0
- data/lib/rubycrawl/browser.rb +106 -0
- data/lib/rubycrawl/errors.rb +1 -1
- data/lib/rubycrawl/helpers.rb +8 -44
- data/lib/rubycrawl/markdown_converter.rb +2 -2
- data/lib/rubycrawl/result.rb +49 -18
- data/lib/rubycrawl/site_crawler.rb +40 -22
- data/lib/rubycrawl/tasks/install.rake +17 -56
- data/lib/rubycrawl/url_normalizer.rb +5 -1
- data/lib/rubycrawl/version.rb +1 -1
- data/lib/rubycrawl.rb +35 -90
- data/rubycrawl.gemspec +3 -4
- metadata +19 -10
- data/lib/rubycrawl/service_client.rb +0 -108
- data/node/.gitignore +0 -2
- data/node/.npmrc +0 -1
- data/node/README.md +0 -19
- data/node/package-lock.json +0 -72
- data/node/package.json +0 -14
- data/node/src/index.js +0 -389
data/README.md
CHANGED
|
@@ -3,46 +3,45 @@
|
|
|
3
3
|
[](https://rubygems.org/gems/rubycrawl)
|
|
4
4
|
[](https://opensource.org/licenses/MIT)
|
|
5
5
|
[](https://www.ruby-lang.org/)
|
|
6
|
-
[](https://nodejs.org/)
|
|
7
6
|
|
|
8
|
-
**Production-ready web crawler for Ruby powered by
|
|
7
|
+
**Production-ready web crawler for Ruby powered by Ferrum** — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
|
|
9
8
|
|
|
10
|
-
RubyCrawl provides **accurate, JavaScript-enabled web scraping** using
|
|
9
|
+
RubyCrawl provides **accurate, JavaScript-enabled web scraping** using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
|
|
11
10
|
|
|
12
11
|
**Why RubyCrawl?**
|
|
13
12
|
|
|
14
13
|
- ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
|
|
15
|
-
- ✅ **
|
|
14
|
+
- ✅ **Pure Ruby** — No Node.js, no npm, no external processes to manage
|
|
15
|
+
- ✅ **Zero config** — Works out of the box, no Ferrum knowledge needed
|
|
16
16
|
- ✅ **Production-ready** — Auto-retry, error handling, resource optimization
|
|
17
17
|
- ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
|
|
18
18
|
- ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
|
|
19
|
-
- ✅ **Modular architecture** — Clean, testable, maintainable codebase
|
|
20
19
|
|
|
21
20
|
```ruby
|
|
22
21
|
# One line to crawl any JavaScript-heavy site
|
|
23
22
|
result = RubyCrawl.crawl("https://docs.example.com")
|
|
24
23
|
|
|
25
24
|
result.html # Full HTML with JS rendered
|
|
26
|
-
result.
|
|
25
|
+
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
|
|
26
|
+
result.clean_markdown # Markdown ready for RAG pipelines
|
|
27
|
+
result.links # All links with url, text, title, rel
|
|
27
28
|
result.metadata # Title, description, OG tags, etc.
|
|
28
29
|
```
|
|
29
30
|
|
|
30
31
|
## Features
|
|
31
32
|
|
|
32
|
-
-
|
|
33
|
-
-
|
|
34
|
-
-
|
|
35
|
-
-
|
|
36
|
-
-
|
|
37
|
-
-
|
|
38
|
-
-
|
|
39
|
-
-
|
|
40
|
-
-
|
|
41
|
-
- **💎 Modular design**: Clean separation of concerns with focused, testable modules
|
|
33
|
+
- **Pure Ruby**: Ferrum drives Chromium directly via CDP — no Node.js or npm required
|
|
34
|
+
- **Production-ready**: Designed for Rails apps with auto-retry and exponential backoff
|
|
35
|
+
- **Simple API**: Clean Ruby interface — zero Ferrum or CDP knowledge required
|
|
36
|
+
- **Resource optimization**: Built-in resource blocking for 2-3x faster crawls
|
|
37
|
+
- **Auto-managed browsers**: Lazy Chrome singleton, isolated page per crawl
|
|
38
|
+
- **Content extraction**: HTML, plain text, clean HTML, Markdown (lazy), links, metadata
|
|
39
|
+
- **Multi-page crawling**: BFS crawler with configurable depth limits and URL deduplication
|
|
40
|
+
- **Smart URL handling**: Automatic normalization, tracking parameter removal, same-host filtering
|
|
41
|
+
- **Rails integration**: First-class Rails support with generators and initializers
|
|
42
42
|
|
|
43
43
|
## Table of Contents
|
|
44
44
|
|
|
45
|
-
- [Features](#features)
|
|
46
45
|
- [Installation](#installation)
|
|
47
46
|
- [Quick Start](#quick-start)
|
|
48
47
|
- [Use Cases](#use-cases)
|
|
@@ -57,18 +56,15 @@ result.metadata # Title, description, OG tags, etc.
|
|
|
57
56
|
- [Architecture](#architecture)
|
|
58
57
|
- [Performance](#performance)
|
|
59
58
|
- [Development](#development)
|
|
60
|
-
- [Project Structure](#project-structure)
|
|
61
59
|
- [Contributing](#contributing)
|
|
62
|
-
- [Why Choose RubyCrawl?](#why-choose-rubycrawl)
|
|
63
60
|
- [License](#license)
|
|
64
|
-
- [Support](#support)
|
|
65
61
|
|
|
66
62
|
## Installation
|
|
67
63
|
|
|
68
64
|
### Requirements
|
|
69
65
|
|
|
70
66
|
- **Ruby** >= 3.0
|
|
71
|
-
- **
|
|
67
|
+
- **Chrome or Chromium** — managed automatically by Ferrum (downloaded on first use)
|
|
72
68
|
|
|
73
69
|
### Add to Gemfile
|
|
74
70
|
|
|
@@ -82,9 +78,9 @@ Then install:
|
|
|
82
78
|
bundle install
|
|
83
79
|
```
|
|
84
80
|
|
|
85
|
-
### Install
|
|
81
|
+
### Install Chrome
|
|
86
82
|
|
|
87
|
-
|
|
83
|
+
Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
|
|
88
84
|
|
|
89
85
|
```bash
|
|
90
86
|
bundle exec rake rubycrawl:install
|
|
@@ -92,24 +88,10 @@ bundle exec rake rubycrawl:install
|
|
|
92
88
|
|
|
93
89
|
This command:
|
|
94
90
|
|
|
95
|
-
- ✅
|
|
96
|
-
- ✅ Downloads Playwright browsers (Chromium, Firefox, WebKit) — ~300MB download
|
|
91
|
+
- ✅ Checks for Chrome/Chromium in your PATH
|
|
97
92
|
- ✅ Creates a Rails initializer (if using Rails)
|
|
98
93
|
|
|
99
|
-
**Note:**
|
|
100
|
-
|
|
101
|
-
**Troubleshooting installation:**
|
|
102
|
-
|
|
103
|
-
```bash
|
|
104
|
-
# If installation fails, check Node.js version
|
|
105
|
-
node --version # Should be v18+ LTS
|
|
106
|
-
|
|
107
|
-
# Enable verbose logging
|
|
108
|
-
RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
|
|
109
|
-
|
|
110
|
-
# Check installation status
|
|
111
|
-
cd node && npm list
|
|
112
|
-
```
|
|
94
|
+
**Note:** If Chrome is not in your PATH, install it via your system package manager or download from [google.com/chrome](https://www.google.com/chrome/).
|
|
113
95
|
|
|
114
96
|
## Quick Start
|
|
115
97
|
|
|
@@ -120,37 +102,37 @@ require "rubycrawl"
|
|
|
120
102
|
result = RubyCrawl.crawl("https://example.com")
|
|
121
103
|
|
|
122
104
|
# Access extracted content
|
|
123
|
-
result.final_url
|
|
124
|
-
result.
|
|
125
|
-
result.
|
|
126
|
-
result.
|
|
127
|
-
result.
|
|
105
|
+
result.final_url # Final URL after redirects
|
|
106
|
+
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
|
|
107
|
+
result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
|
|
108
|
+
result.raw_text # Full body.innerText (unfiltered)
|
|
109
|
+
result.html # Full raw HTML content
|
|
110
|
+
result.links # Extracted links with url, text, title, rel
|
|
111
|
+
result.metadata # Title, description, OG tags, etc.
|
|
112
|
+
result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
|
|
128
113
|
```
|
|
129
114
|
|
|
130
115
|
## Use Cases
|
|
131
116
|
|
|
132
117
|
RubyCrawl is perfect for:
|
|
133
118
|
|
|
134
|
-
-
|
|
135
|
-
-
|
|
136
|
-
-
|
|
137
|
-
-
|
|
138
|
-
-
|
|
139
|
-
- **📚 Documentation scraping**: Create local copies of documentation with preserved links
|
|
119
|
+
- **RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
|
|
120
|
+
- **Data aggregation**: Crawl product catalogs, job listings, or news articles
|
|
121
|
+
- **SEO analysis**: Extract metadata, links, and content structure
|
|
122
|
+
- **Content migration**: Convert existing sites to Markdown for static site generators
|
|
123
|
+
- **Documentation scraping**: Create local copies of documentation with preserved links
|
|
140
124
|
|
|
141
125
|
## Usage
|
|
142
126
|
|
|
143
127
|
### Basic Crawling
|
|
144
128
|
|
|
145
|
-
The simplest way to crawl a URL:
|
|
146
|
-
|
|
147
129
|
```ruby
|
|
148
130
|
result = RubyCrawl.crawl("https://example.com")
|
|
149
131
|
|
|
150
|
-
#
|
|
151
|
-
result.
|
|
152
|
-
result.
|
|
153
|
-
result.metadata
|
|
132
|
+
result.html # => "<html>...</html>"
|
|
133
|
+
result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
|
|
134
|
+
result.raw_text # => "Example Domain\nThis domain is..." (full body text)
|
|
135
|
+
result.metadata # => { "final_url" => "https://example.com", "title" => "..." }
|
|
154
136
|
```
|
|
155
137
|
|
|
156
138
|
### Multi-Page Crawling
|
|
@@ -165,10 +147,10 @@ RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |pa
|
|
|
165
147
|
|
|
166
148
|
# Save to database
|
|
167
149
|
Page.create!(
|
|
168
|
-
url:
|
|
169
|
-
html:
|
|
150
|
+
url: page.url,
|
|
151
|
+
html: page.html,
|
|
170
152
|
markdown: page.clean_markdown,
|
|
171
|
-
depth:
|
|
153
|
+
depth: page.depth
|
|
172
154
|
)
|
|
173
155
|
end
|
|
174
156
|
```
|
|
@@ -176,7 +158,6 @@ end
|
|
|
176
158
|
**Real-world example: Building a RAG knowledge base**
|
|
177
159
|
|
|
178
160
|
```ruby
|
|
179
|
-
# Crawl documentation site for AI/RAG application
|
|
180
161
|
require "rubycrawl"
|
|
181
162
|
|
|
182
163
|
RubyCrawl.configure(
|
|
@@ -190,21 +171,18 @@ pages_crawled = RubyCrawl.crawl_site(
|
|
|
190
171
|
max_depth: 5,
|
|
191
172
|
same_host_only: true
|
|
192
173
|
) do |page|
|
|
193
|
-
# Store in vector database for RAG
|
|
194
174
|
VectorDB.upsert(
|
|
195
|
-
id:
|
|
196
|
-
content:
|
|
175
|
+
id: Digest::SHA256.hexdigest(page.url),
|
|
176
|
+
content: page.clean_markdown,
|
|
197
177
|
metadata: {
|
|
198
|
-
url:
|
|
178
|
+
url: page.url,
|
|
199
179
|
title: page.metadata["title"],
|
|
200
180
|
depth: page.depth
|
|
201
181
|
}
|
|
202
182
|
)
|
|
203
|
-
|
|
204
|
-
puts "✓ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
|
|
205
183
|
end
|
|
206
184
|
|
|
207
|
-
puts "
|
|
185
|
+
puts "Indexed #{pages_crawled} pages"
|
|
208
186
|
```
|
|
209
187
|
|
|
210
188
|
#### Multi-Page Options
|
|
@@ -223,10 +201,13 @@ The block receives a `PageResult` with:
|
|
|
223
201
|
|
|
224
202
|
```ruby
|
|
225
203
|
page.url # String: Final URL after redirects
|
|
226
|
-
page.html # String: Full HTML content
|
|
227
|
-
page.
|
|
204
|
+
page.html # String: Full raw HTML content
|
|
205
|
+
page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
|
|
206
|
+
page.clean_text # String: Noise-stripped plain text (derived from clean_html)
|
|
207
|
+
page.raw_text # String: Full body.innerText (unfiltered)
|
|
208
|
+
page.clean_markdown # String: Lazy-converted Markdown from clean_html
|
|
228
209
|
page.links # Array: URLs extracted from page
|
|
229
|
-
page.metadata # Hash:
|
|
210
|
+
page.metadata # Hash: final_url, title, OG tags, etc.
|
|
230
211
|
page.depth # Integer: Link depth from start URL
|
|
231
212
|
```
|
|
232
213
|
|
|
@@ -234,12 +215,12 @@ page.depth # Integer: Link depth from start URL
|
|
|
234
215
|
|
|
235
216
|
#### Global Configuration
|
|
236
217
|
|
|
237
|
-
Set default options that apply to all crawls:
|
|
238
|
-
|
|
239
218
|
```ruby
|
|
240
219
|
RubyCrawl.configure(
|
|
241
|
-
wait_until:
|
|
242
|
-
block_resources: true
|
|
220
|
+
wait_until: "networkidle",
|
|
221
|
+
block_resources: true,
|
|
222
|
+
timeout: 60,
|
|
223
|
+
headless: true
|
|
243
224
|
)
|
|
244
225
|
|
|
245
226
|
# All subsequent crawls use these defaults
|
|
@@ -248,8 +229,6 @@ result = RubyCrawl.crawl("https://example.com")
|
|
|
248
229
|
|
|
249
230
|
#### Per-Request Options
|
|
250
231
|
|
|
251
|
-
Override defaults for specific requests:
|
|
252
|
-
|
|
253
232
|
```ruby
|
|
254
233
|
# Use global defaults
|
|
255
234
|
result = RubyCrawl.crawl("https://example.com")
|
|
@@ -257,192 +236,131 @@ result = RubyCrawl.crawl("https://example.com")
|
|
|
257
236
|
# Override for this request only
|
|
258
237
|
result = RubyCrawl.crawl(
|
|
259
238
|
"https://example.com",
|
|
260
|
-
wait_until:
|
|
239
|
+
wait_until: "domcontentloaded",
|
|
261
240
|
block_resources: false
|
|
262
241
|
)
|
|
263
242
|
```
|
|
264
243
|
|
|
265
244
|
#### Configuration Options
|
|
266
245
|
|
|
267
|
-
| Option | Values
|
|
268
|
-
| ----------------- |
|
|
269
|
-
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"`
|
|
270
|
-
| `block_resources` | `true`, `false`
|
|
271
|
-
| `max_attempts` | Integer
|
|
246
|
+
| Option | Values | Default | Description |
|
|
247
|
+
| ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
|
|
248
|
+
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
|
|
249
|
+
| `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
|
|
250
|
+
| `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
|
|
251
|
+
| `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
|
|
252
|
+
| `headless` | `true`, `false` | `true` | Run Chrome headlessly |
|
|
272
253
|
|
|
273
254
|
**Wait strategies explained:**
|
|
274
255
|
|
|
275
|
-
- `load` — Wait for the load event (
|
|
276
|
-
- `domcontentloaded` — Wait for DOM ready (
|
|
277
|
-
- `networkidle` — Wait until no network requests for 500ms (
|
|
278
|
-
- `commit` — Wait until the first response bytes are received (fastest
|
|
279
|
-
|
|
280
|
-
### Advanced Usage
|
|
281
|
-
|
|
282
|
-
#### Session-Based Crawling
|
|
283
|
-
|
|
284
|
-
Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by `crawl_site`, but you can manage them manually for advanced use cases:
|
|
285
|
-
|
|
286
|
-
```ruby
|
|
287
|
-
# Create a session (reusable browser context)
|
|
288
|
-
session_id = RubyCrawl.create_session
|
|
289
|
-
|
|
290
|
-
begin
|
|
291
|
-
# All crawls with this session_id share the same browser context
|
|
292
|
-
result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
|
|
293
|
-
result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
|
|
294
|
-
# Browser state (cookies, localStorage) persists between crawls
|
|
295
|
-
ensure
|
|
296
|
-
# Always destroy session when done
|
|
297
|
-
RubyCrawl.destroy_session(session_id)
|
|
298
|
-
end
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
**When to use sessions:**
|
|
302
|
-
|
|
303
|
-
- Multiple sequential crawls to the same domain (better performance)
|
|
304
|
-
- Preserving cookies/state set by the site between page visits
|
|
305
|
-
- Avoiding browser context creation overhead
|
|
306
|
-
|
|
307
|
-
**Important:** Sessions are for **performance optimization only**. RubyCrawl is designed for crawling **public websites**. It does not provide authentication or login functionality for protected content.
|
|
308
|
-
|
|
309
|
-
**Note:** `crawl_site` automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
|
|
310
|
-
|
|
311
|
-
**Session lifecycle:**
|
|
312
|
-
|
|
313
|
-
- Sessions automatically expire after 30 minutes of inactivity
|
|
314
|
-
- Sessions are cleaned up every 5 minutes
|
|
315
|
-
- Always call `destroy_session` when done to free resources immediately
|
|
256
|
+
- `load` — Wait for the load event (good for static sites)
|
|
257
|
+
- `domcontentloaded` — Wait for DOM ready (faster)
|
|
258
|
+
- `networkidle` — Wait until no network requests for 500ms (best for SPAs)
|
|
259
|
+
- `commit` — Wait until the first response bytes are received (fastest)
|
|
316
260
|
|
|
317
261
|
### Result Object
|
|
318
262
|
|
|
319
|
-
The crawl result is a `RubyCrawl::Result` object with these attributes:
|
|
320
|
-
|
|
321
263
|
```ruby
|
|
322
264
|
result = RubyCrawl.crawl("https://example.com")
|
|
323
265
|
|
|
324
|
-
result.html # String:
|
|
325
|
-
result.
|
|
326
|
-
result.
|
|
327
|
-
result.
|
|
328
|
-
result.
|
|
266
|
+
result.html # String: Full raw HTML
|
|
267
|
+
result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
|
|
268
|
+
result.clean_text # String: Plain text derived from clean_html — ideal for RAG
|
|
269
|
+
result.raw_text # String: Full body.innerText (unfiltered)
|
|
270
|
+
result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
|
|
271
|
+
result.links # Array: Extracted links with url/text/title/rel
|
|
272
|
+
result.metadata # Hash: See below
|
|
273
|
+
result.final_url # String: Shortcut for metadata['final_url']
|
|
329
274
|
```
|
|
330
275
|
|
|
331
276
|
#### Links Format
|
|
332
277
|
|
|
333
|
-
Links are extracted with full metadata:
|
|
334
|
-
|
|
335
278
|
```ruby
|
|
336
279
|
result.links
|
|
337
280
|
# => [
|
|
338
|
-
# {
|
|
339
|
-
#
|
|
340
|
-
# "text" => "About Us",
|
|
341
|
-
# "title" => "Learn more about us", # <a title="...">
|
|
342
|
-
# "rel" => nil # <a rel="nofollow">
|
|
343
|
-
# },
|
|
344
|
-
# {
|
|
345
|
-
# "url" => "https://example.com/contact",
|
|
346
|
-
# "text" => "Contact",
|
|
347
|
-
# "title" => null,
|
|
348
|
-
# "rel" => "nofollow"
|
|
349
|
-
# },
|
|
350
|
-
# ...
|
|
281
|
+
# { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
|
|
282
|
+
# { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
|
|
351
283
|
# ]
|
|
352
284
|
```
|
|
353
285
|
|
|
354
|
-
|
|
286
|
+
URLs are automatically resolved to absolute form by the browser.
|
|
355
287
|
|
|
356
288
|
#### Markdown Conversion
|
|
357
289
|
|
|
358
|
-
Markdown is **lazy
|
|
290
|
+
Markdown is **lazy** — conversion only happens on first access of `.clean_markdown`:
|
|
359
291
|
|
|
360
292
|
```ruby
|
|
361
|
-
result
|
|
362
|
-
result.
|
|
363
|
-
result.clean_markdown #
|
|
364
|
-
result.clean_markdown # ✅ Cached, instant
|
|
293
|
+
result.clean_html # ✅ Already available, no overhead
|
|
294
|
+
result.clean_markdown # Converts clean_html → Markdown here (first call only)
|
|
295
|
+
result.clean_markdown # ✅ Cached, instant on subsequent calls
|
|
365
296
|
```
|
|
366
297
|
|
|
367
298
|
Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
|
|
368
299
|
|
|
369
300
|
#### Metadata Fields
|
|
370
301
|
|
|
371
|
-
The `metadata` hash includes HTTP and HTML metadata:
|
|
372
|
-
|
|
373
302
|
```ruby
|
|
374
303
|
result.metadata
|
|
375
304
|
# => {
|
|
376
|
-
# "
|
|
377
|
-
# "
|
|
378
|
-
# "
|
|
379
|
-
# "
|
|
380
|
-
# "
|
|
381
|
-
# "
|
|
382
|
-
# "
|
|
383
|
-
# "
|
|
384
|
-
# "
|
|
385
|
-
# "
|
|
386
|
-
# "
|
|
387
|
-
# "
|
|
388
|
-
# "
|
|
389
|
-
# "
|
|
390
|
-
# "
|
|
391
|
-
# "
|
|
392
|
-
# "
|
|
393
|
-
# "charset" => "UTF-8" # Character encoding
|
|
305
|
+
# "final_url" => "https://example.com",
|
|
306
|
+
# "title" => "Page Title",
|
|
307
|
+
# "description" => "...",
|
|
308
|
+
# "keywords" => "ruby, web",
|
|
309
|
+
# "author" => "Author Name",
|
|
310
|
+
# "og_title" => "...",
|
|
311
|
+
# "og_description" => "...",
|
|
312
|
+
# "og_image" => "https://...",
|
|
313
|
+
# "og_url" => "https://...",
|
|
314
|
+
# "og_type" => "website",
|
|
315
|
+
# "twitter_card" => "summary",
|
|
316
|
+
# "twitter_title" => "...",
|
|
317
|
+
# "twitter_description" => "...",
|
|
318
|
+
# "twitter_image" => "https://...",
|
|
319
|
+
# "canonical" => "https://...",
|
|
320
|
+
# "lang" => "en",
|
|
321
|
+
# "charset" => "UTF-8"
|
|
394
322
|
# }
|
|
395
323
|
```
|
|
396
324
|
|
|
397
|
-
Note: All HTML metadata fields may be `null` if not present on the page.
|
|
398
|
-
|
|
399
325
|
### Error Handling
|
|
400
326
|
|
|
401
|
-
RubyCrawl provides specific exception classes for different error scenarios:
|
|
402
|
-
|
|
403
327
|
```ruby
|
|
404
328
|
begin
|
|
405
329
|
result = RubyCrawl.crawl(url)
|
|
406
330
|
rescue RubyCrawl::ConfigurationError => e
|
|
407
|
-
# Invalid URL or
|
|
408
|
-
puts "Configuration error: #{e.message}"
|
|
331
|
+
# Invalid URL or option value
|
|
409
332
|
rescue RubyCrawl::TimeoutError => e
|
|
410
|
-
# Page load
|
|
411
|
-
puts "Timeout: #{e.message}"
|
|
333
|
+
# Page load timed out
|
|
412
334
|
rescue RubyCrawl::NavigationError => e
|
|
413
|
-
#
|
|
414
|
-
puts "Navigation failed: #{e.message}"
|
|
335
|
+
# Navigation failed (404, DNS error, SSL error)
|
|
415
336
|
rescue RubyCrawl::ServiceError => e
|
|
416
|
-
#
|
|
417
|
-
puts "Service error: #{e.message}"
|
|
337
|
+
# Browser failed to start or crashed
|
|
418
338
|
rescue RubyCrawl::Error => e
|
|
419
339
|
# Catch-all for any RubyCrawl error
|
|
420
|
-
puts "Crawl error: #{e.message}"
|
|
421
340
|
end
|
|
422
341
|
```
|
|
423
342
|
|
|
424
343
|
**Exception Hierarchy:**
|
|
425
344
|
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
|
|
429
|
-
|
|
430
|
-
|
|
345
|
+
```
|
|
346
|
+
RubyCrawl::Error
|
|
347
|
+
├── ConfigurationError — invalid URL or option value
|
|
348
|
+
├── TimeoutError — page load timed out
|
|
349
|
+
├── NavigationError — navigation failed (HTTP error, DNS, SSL)
|
|
350
|
+
└── ServiceError — browser failed to start or crashed
|
|
351
|
+
```
|
|
431
352
|
|
|
432
|
-
**Automatic Retry:**
|
|
353
|
+
**Automatic Retry:** `ServiceError` and `TimeoutError` are retried with exponential backoff. `NavigationError` and `ConfigurationError` are not retried (they won't succeed on retry).
|
|
433
354
|
|
|
434
355
|
```ruby
|
|
435
|
-
RubyCrawl.configure(max_attempts: 5)
|
|
436
|
-
#
|
|
437
|
-
RubyCrawl.crawl(url, max_attempts: 1) # No retries
|
|
356
|
+
RubyCrawl.configure(max_attempts: 5) # 5 total attempts
|
|
357
|
+
RubyCrawl.crawl(url, max_attempts: 1) # Disable retries
|
|
438
358
|
```
|
|
439
359
|
|
|
440
360
|
## Rails Integration
|
|
441
361
|
|
|
442
362
|
### Installation
|
|
443
363
|
|
|
444
|
-
Run the installer in your Rails app:
|
|
445
|
-
|
|
446
364
|
```bash
|
|
447
365
|
bundle exec rake rubycrawl:install
|
|
448
366
|
```
|
|
@@ -450,173 +368,54 @@ bundle exec rake rubycrawl:install
|
|
|
450
368
|
This creates `config/initializers/rubycrawl.rb`:
|
|
451
369
|
|
|
452
370
|
```ruby
|
|
453
|
-
# frozen_string_literal: true
|
|
454
|
-
|
|
455
|
-
# rubycrawl default configuration
|
|
456
371
|
RubyCrawl.configure(
|
|
457
|
-
wait_until:
|
|
372
|
+
wait_until: "load",
|
|
458
373
|
block_resources: true
|
|
459
374
|
)
|
|
460
375
|
```
|
|
461
376
|
|
|
462
377
|
### Usage in Rails
|
|
463
378
|
|
|
464
|
-
#### Basic Usage in Controllers
|
|
465
|
-
|
|
466
|
-
```ruby
|
|
467
|
-
class PagesController < ApplicationController
|
|
468
|
-
def show
|
|
469
|
-
result = RubyCrawl.crawl(params[:url])
|
|
470
|
-
|
|
471
|
-
@page = Page.create!(
|
|
472
|
-
url: result.final_url,
|
|
473
|
-
title: result.metadata['title'],
|
|
474
|
-
html: result.html,
|
|
475
|
-
text: result.text,
|
|
476
|
-
markdown: result.clean_markdown
|
|
477
|
-
)
|
|
478
|
-
|
|
479
|
-
redirect_to @page
|
|
480
|
-
end
|
|
481
|
-
end
|
|
482
|
-
```
|
|
483
|
-
|
|
484
379
|
#### Background Jobs with ActiveJob
|
|
485
380
|
|
|
486
|
-
**Simple Crawl Job:**
|
|
487
|
-
|
|
488
381
|
```ruby
|
|
489
382
|
class CrawlPageJob < ApplicationJob
|
|
490
383
|
queue_as :crawlers
|
|
491
384
|
|
|
492
|
-
# Automatic retry with exponential backoff for transient failures
|
|
493
385
|
retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
|
|
494
386
|
retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
|
|
495
|
-
|
|
496
|
-
# Don't retry on configuration errors (bad URLs)
|
|
497
387
|
discard_on RubyCrawl::ConfigurationError
|
|
498
388
|
|
|
499
|
-
def perform(url
|
|
389
|
+
def perform(url)
|
|
500
390
|
result = RubyCrawl.crawl(url)
|
|
501
391
|
|
|
502
392
|
Page.create!(
|
|
503
|
-
url:
|
|
504
|
-
title:
|
|
505
|
-
|
|
506
|
-
|
|
507
|
-
user_id: user_id,
|
|
393
|
+
url: result.final_url,
|
|
394
|
+
title: result.metadata['title'],
|
|
395
|
+
content: result.clean_text,
|
|
396
|
+
markdown: result.clean_markdown,
|
|
508
397
|
crawled_at: Time.current
|
|
509
398
|
)
|
|
510
|
-
rescue RubyCrawl::NavigationError => e
|
|
511
|
-
# Page not found or failed to load
|
|
512
|
-
Rails.logger.warn "Failed to crawl #{url}: #{e.message}"
|
|
513
|
-
FailedCrawl.create!(url: url, error: e.message, user_id: user_id)
|
|
514
|
-
end
|
|
515
|
-
end
|
|
516
|
-
|
|
517
|
-
# Enqueue from anywhere
|
|
518
|
-
CrawlPageJob.perform_later("https://example.com", user_id: current_user.id)
|
|
519
|
-
```
|
|
520
|
-
|
|
521
|
-
**Multi-Page Site Crawler Job:**
|
|
522
|
-
|
|
523
|
-
```ruby
|
|
524
|
-
class CrawlSiteJob < ApplicationJob
|
|
525
|
-
queue_as :crawlers
|
|
526
|
-
|
|
527
|
-
def perform(start_url, max_pages: 50)
|
|
528
|
-
pages_crawled = RubyCrawl.crawl_site(
|
|
529
|
-
start_url,
|
|
530
|
-
max_pages: max_pages,
|
|
531
|
-
max_depth: 3,
|
|
532
|
-
same_host_only: true
|
|
533
|
-
) do |page|
|
|
534
|
-
Page.create!(
|
|
535
|
-
url: page.url,
|
|
536
|
-
title: page.metadata['title'],
|
|
537
|
-
text: page.clean_markdown, # Store markdown for RAG applications
|
|
538
|
-
depth: page.depth,
|
|
539
|
-
crawled_at: Time.current
|
|
540
|
-
)
|
|
541
|
-
end
|
|
542
|
-
|
|
543
|
-
Rails.logger.info "Crawled #{pages_crawled} pages from #{start_url}"
|
|
544
|
-
end
|
|
545
|
-
end
|
|
546
|
-
```
|
|
547
|
-
|
|
548
|
-
**Batch Crawling Pattern:**
|
|
549
|
-
|
|
550
|
-
```ruby
|
|
551
|
-
class BatchCrawlJob < ApplicationJob
|
|
552
|
-
queue_as :crawlers
|
|
553
|
-
|
|
554
|
-
def perform(urls)
|
|
555
|
-
# Create session for better performance
|
|
556
|
-
session_id = RubyCrawl.create_session
|
|
557
|
-
|
|
558
|
-
begin
|
|
559
|
-
urls.each do |url|
|
|
560
|
-
result = RubyCrawl.crawl(url, session_id: session_id)
|
|
561
|
-
|
|
562
|
-
Page.create!(
|
|
563
|
-
url: result.final_url,
|
|
564
|
-
html: result.html,
|
|
565
|
-
text: result.text
|
|
566
|
-
)
|
|
567
|
-
end
|
|
568
|
-
ensure
|
|
569
|
-
# Always destroy session when done
|
|
570
|
-
RubyCrawl.destroy_session(session_id)
|
|
571
|
-
end
|
|
572
399
|
end
|
|
573
400
|
end
|
|
574
|
-
|
|
575
|
-
# Enqueue batch
|
|
576
|
-
BatchCrawlJob.perform_later(["https://example.com", "https://example.com/about"])
|
|
577
401
|
```
|
|
578
402
|
|
|
579
|
-
**
|
|
580
|
-
|
|
581
|
-
```ruby
|
|
582
|
-
# config/schedule.yml (for sidekiq-cron)
|
|
583
|
-
crawl_news_sites:
|
|
584
|
-
cron: "0 */6 * * *" # Every 6 hours
|
|
585
|
-
class: "CrawlNewsSitesJob"
|
|
586
|
-
|
|
587
|
-
# app/jobs/crawl_news_sites_job.rb
|
|
588
|
-
class CrawlNewsSitesJob < ApplicationJob
|
|
589
|
-
queue_as :scheduled_crawlers
|
|
590
|
-
|
|
591
|
-
def perform
|
|
592
|
-
Site.where(active: true).find_each do |site|
|
|
593
|
-
CrawlSiteJob.perform_later(site.url, max_pages: site.max_pages)
|
|
594
|
-
end
|
|
595
|
-
end
|
|
596
|
-
end
|
|
597
|
-
```
|
|
598
|
-
|
|
599
|
-
**RAG/AI Knowledge Base Pattern:**
|
|
403
|
+
**Multi-page RAG knowledge base:**
|
|
600
404
|
|
|
601
405
|
```ruby
|
|
602
406
|
class BuildKnowledgeBaseJob < ApplicationJob
|
|
603
407
|
queue_as :crawlers
|
|
604
408
|
|
|
605
409
|
def perform(documentation_url)
|
|
606
|
-
RubyCrawl.crawl_site(
|
|
607
|
-
documentation_url,
|
|
608
|
-
max_pages: 500,
|
|
609
|
-
max_depth: 5
|
|
610
|
-
) do |page|
|
|
611
|
-
# Store in vector database for RAG
|
|
410
|
+
RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
|
|
612
411
|
embedding = OpenAI.embed(page.clean_markdown)
|
|
613
412
|
|
|
614
413
|
Document.create!(
|
|
615
|
-
url:
|
|
616
|
-
title:
|
|
617
|
-
content:
|
|
414
|
+
url: page.url,
|
|
415
|
+
title: page.metadata['title'],
|
|
416
|
+
content: page.clean_markdown,
|
|
618
417
|
embedding: embedding,
|
|
619
|
-
depth:
|
|
418
|
+
depth: page.depth
|
|
620
419
|
)
|
|
621
420
|
end
|
|
622
421
|
end
|
|
@@ -625,156 +424,101 @@ end
|
|
|
625
424
|
|
|
626
425
|
#### Best Practices
|
|
627
426
|
|
|
628
|
-
1. **Use background jobs**
|
|
629
|
-
2. **Configure retry logic** based on error
|
|
630
|
-
3. **
|
|
631
|
-
4. **
|
|
632
|
-
5. **Rate limit** external crawling to be respectful (use job throttling)
|
|
633
|
-
6. **Store both HTML and text** for flexibility in data processing
|
|
427
|
+
1. **Use background jobs** to avoid blocking web requests
|
|
428
|
+
2. **Configure retry logic** based on error type
|
|
429
|
+
3. **Store `clean_markdown`** for RAG applications (preserves heading structure for chunking)
|
|
430
|
+
4. **Rate limit** external crawling to be respectful
|
|
634
431
|
|
|
635
432
|
## Production Deployment
|
|
636
433
|
|
|
637
434
|
### Pre-deployment Checklist
|
|
638
435
|
|
|
639
|
-
1. **
|
|
436
|
+
1. **Ensure Chrome is installed** on your production servers
|
|
640
437
|
2. **Run installer** during deployment:
|
|
641
438
|
```bash
|
|
642
439
|
bundle exec rake rubycrawl:install
|
|
643
440
|
```
|
|
644
|
-
3. **Set environment variables** (optional):
|
|
645
|
-
```bash
|
|
646
|
-
export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path
|
|
647
|
-
export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
|
|
648
|
-
```
|
|
649
441
|
|
|
650
442
|
### Docker Example
|
|
651
443
|
|
|
652
444
|
```dockerfile
|
|
653
445
|
FROM ruby:3.2
|
|
654
446
|
|
|
655
|
-
# Install
|
|
656
|
-
RUN
|
|
657
|
-
|
|
658
|
-
|
|
659
|
-
|
|
660
|
-
RUN npx playwright install-deps
|
|
447
|
+
# Install Chrome
|
|
448
|
+
RUN apt-get update && apt-get install -y \
|
|
449
|
+
chromium \
|
|
450
|
+
--no-install-recommends \
|
|
451
|
+
&& rm -rf /var/lib/apt/lists/*
|
|
661
452
|
|
|
662
453
|
WORKDIR /app
|
|
663
454
|
COPY Gemfile* ./
|
|
664
455
|
RUN bundle install
|
|
665
456
|
|
|
666
|
-
# Install Playwright browsers
|
|
667
|
-
RUN bundle exec rake rubycrawl:install
|
|
668
|
-
|
|
669
457
|
COPY . .
|
|
670
458
|
CMD ["rails", "server"]
|
|
671
459
|
```
|
|
672
460
|
|
|
673
|
-
|
|
674
|
-
|
|
675
|
-
Add the Node.js buildpack:
|
|
461
|
+
Ferrum will detect `chromium` automatically. To specify a custom path:
|
|
676
462
|
|
|
677
|
-
```
|
|
678
|
-
|
|
679
|
-
|
|
680
|
-
|
|
681
|
-
|
|
682
|
-
Add to `package.json` in your Rails root:
|
|
683
|
-
|
|
684
|
-
```json
|
|
685
|
-
{
|
|
686
|
-
"engines": {
|
|
687
|
-
"node": "18.x"
|
|
688
|
-
}
|
|
689
|
-
}
|
|
463
|
+
```ruby
|
|
464
|
+
RubyCrawl.configure(
|
|
465
|
+
browser_options: { "browser-path": "/usr/bin/chromium" }
|
|
466
|
+
)
|
|
690
467
|
```
|
|
691
468
|
|
|
692
|
-
##
|
|
469
|
+
## Architecture
|
|
693
470
|
|
|
694
|
-
RubyCrawl uses a
|
|
471
|
+
RubyCrawl uses a single-process architecture:
|
|
695
472
|
|
|
696
|
-
|
|
697
|
-
|
|
698
|
-
|
|
473
|
+
```
|
|
474
|
+
RubyCrawl (public API)
|
|
475
|
+
↓
|
|
476
|
+
Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
|
|
477
|
+
↓
|
|
478
|
+
Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
|
|
479
|
+
↓
|
|
480
|
+
Chromium ← headless browser
|
|
481
|
+
```
|
|
699
482
|
|
|
700
|
-
|
|
483
|
+
- Chrome launches once lazily and is reused across all crawls
|
|
484
|
+
- Each crawl gets an isolated page context (own cookies/storage)
|
|
485
|
+
- JS extraction runs inside the browser via `page.evaluate()`
|
|
486
|
+
- No separate processes, no HTTP boundary, no Node.js
|
|
701
487
|
|
|
702
|
-
## Performance
|
|
488
|
+
## Performance
|
|
703
489
|
|
|
704
|
-
- **Resource blocking**: Keep `block_resources: true` (default) for 2-3x faster crawls
|
|
490
|
+
- **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
|
|
705
491
|
- **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
|
|
706
|
-
- **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
|
|
707
|
-
- **Browser reuse**: The first crawl is slower (~2s) due to
|
|
492
|
+
- **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
|
|
493
|
+
- **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
|
|
708
494
|
|
|
709
495
|
## Development
|
|
710
496
|
|
|
711
|
-
Want to contribute? Check out the [contributor guidelines](.github/copilot-instructions.md).
|
|
712
|
-
|
|
713
497
|
```bash
|
|
714
|
-
# Setup
|
|
715
498
|
git clone git@github.com:craft-wise/rubycrawl.git
|
|
716
499
|
cd rubycrawl
|
|
717
500
|
bin/setup
|
|
718
501
|
|
|
719
|
-
# Run tests
|
|
502
|
+
# Run unit tests (no browser required)
|
|
720
503
|
bundle exec rspec
|
|
721
504
|
|
|
505
|
+
# Run integration tests (requires Chrome)
|
|
506
|
+
INTEGRATION=1 bundle exec rspec
|
|
507
|
+
|
|
722
508
|
# Manual testing
|
|
723
509
|
bin/console
|
|
724
510
|
> RubyCrawl.crawl("https://example.com")
|
|
511
|
+
> RubyCrawl.crawl("https://example.com").clean_text
|
|
512
|
+
> RubyCrawl.crawl("https://example.com").clean_markdown
|
|
725
513
|
```
|
|
726
514
|
|
|
727
515
|
## Contributing
|
|
728
516
|
|
|
729
517
|
Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
|
|
730
518
|
|
|
731
|
-
### Development Philosophy
|
|
732
|
-
|
|
733
519
|
- **Simplicity over cleverness**: Prefer clear, explicit code
|
|
734
520
|
- **Stability over speed**: Correctness first, optimization second
|
|
735
|
-
- **
|
|
736
|
-
- **No vendor lock-in**: Pure open source, no SaaS dependencies
|
|
737
|
-
|
|
738
|
-
## Why Choose RubyCrawl?
|
|
739
|
-
|
|
740
|
-
RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
|
|
741
|
-
|
|
742
|
-
### 🎯 **Built for Ruby Developers**
|
|
743
|
-
|
|
744
|
-
- **Idiomatic Ruby API** — Feels natural to Rubyists, no need to learn Playwright
|
|
745
|
-
- **Rails-first design** — Generators, initializers, and ActiveJob integration out of the box
|
|
746
|
-
- **Modular architecture** — Clean, testable code following Ruby best practices
|
|
747
|
-
|
|
748
|
-
### 🚀 **Production-Grade Reliability**
|
|
749
|
-
|
|
750
|
-
- **Automatic retry** with exponential backoff for transient failures
|
|
751
|
-
- **Smart error handling** with custom exception hierarchy
|
|
752
|
-
- **Process isolation** — Browser crashes don't affect your Ruby application
|
|
753
|
-
- **Battle-tested** — Built on Playwright's proven browser automation
|
|
754
|
-
|
|
755
|
-
### 💎 **Developer Experience**
|
|
756
|
-
|
|
757
|
-
- **Zero configuration** — Works immediately after installation
|
|
758
|
-
- **Lazy loading** — Markdown conversion only when you need it
|
|
759
|
-
- **Smart URL handling** — Automatic normalization and deduplication
|
|
760
|
-
- **Comprehensive docs** — Clear examples for common use cases
|
|
761
|
-
|
|
762
|
-
### 🌐 **Rich Feature Set**
|
|
763
|
-
|
|
764
|
-
- ✅ JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
|
|
765
|
-
- ✅ Multi-page crawling with BFS algorithm
|
|
766
|
-
- ✅ Link extraction with metadata (url, text, title, rel)
|
|
767
|
-
- ✅ Markdown conversion (GitHub-flavored)
|
|
768
|
-
- ✅ Metadata extraction (OG tags, Twitter cards, etc.)
|
|
769
|
-
- ✅ Resource blocking for 2-3x performance boost
|
|
770
|
-
|
|
771
|
-
### 📊 **Perfect for Modern Use Cases**
|
|
772
|
-
|
|
773
|
-
- **RAG applications** — Build AI knowledge bases from documentation
|
|
774
|
-
- **Data aggregation** — Extract structured data from multiple pages
|
|
775
|
-
- **Content migration** — Convert sites to Markdown for static generators
|
|
776
|
-
- **SEO analysis** — Extract metadata and link structures
|
|
777
|
-
- **Testing** — Verify deployed site content and structure
|
|
521
|
+
- **Hide complexity**: Users should never need to know Ferrum exists
|
|
778
522
|
|
|
779
523
|
## License
|
|
780
524
|
|
|
@@ -782,7 +526,7 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
|
|
|
782
526
|
|
|
783
527
|
## Credits
|
|
784
528
|
|
|
785
|
-
Built with [
|
|
529
|
+
Built with [Ferrum](https://github.com/rubycdp/ferrum) — pure Ruby Chrome DevTools Protocol client.
|
|
786
530
|
|
|
787
531
|
Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
|
|
788
532
|
|
|
@@ -791,12 +535,3 @@ Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitH
|
|
|
791
535
|
- **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
|
|
792
536
|
- **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
|
|
793
537
|
- **Email**: ganesh.navale@zohomail.in
|
|
794
|
-
|
|
795
|
-
## Acknowledgments
|
|
796
|
-
|
|
797
|
-
Special thanks to:
|
|
798
|
-
|
|
799
|
-
- [Microsoft Playwright](https://playwright.dev/) team for the robust, production-grade browser automation framework
|
|
800
|
-
- The Ruby community for building an ecosystem that values developer happiness and code clarity
|
|
801
|
-
- The Node.js community for excellent tooling and libraries that make cross-language integration seamless
|
|
802
|
-
- Open source contributors worldwide who make projects like this possible
|