rubycrawl 0.1.3 → 0.2.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +263 -311
- data/lib/rubycrawl/browser/extraction.rb +106 -0
- data/lib/rubycrawl/browser.rb +106 -0
- data/lib/rubycrawl/errors.rb +1 -1
- data/lib/rubycrawl/helpers.rb +9 -41
- data/lib/rubycrawl/markdown_converter.rb +5 -5
- data/lib/rubycrawl/result.rb +55 -25
- data/lib/rubycrawl/site_crawler.rb +46 -20
- data/lib/rubycrawl/tasks/install.rake +17 -56
- data/lib/rubycrawl/url_normalizer.rb +5 -1
- data/lib/rubycrawl/version.rb +1 -1
- data/lib/rubycrawl.rb +37 -66
- data/rubycrawl.gemspec +5 -5
- metadata +20 -6
- data/Gemfile +0 -11
- data/lib/rubycrawl/service_client.rb +0 -86
data/README.md
CHANGED
|
@@ -1,32 +1,56 @@
|
|
|
1
|
-
#
|
|
1
|
+
# RubyCrawl 🎭
|
|
2
2
|
|
|
3
|
-
[](https://
|
|
3
|
+
[](https://rubygems.org/gems/rubycrawl)
|
|
4
4
|
[](https://opensource.org/licenses/MIT)
|
|
5
|
+
[](https://www.ruby-lang.org/)
|
|
5
6
|
|
|
6
|
-
**
|
|
7
|
+
**Production-ready web crawler for Ruby powered by Ferrum** — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
|
|
7
8
|
|
|
8
|
-
RubyCrawl provides accurate, JavaScript-enabled web scraping using
|
|
9
|
+
RubyCrawl provides **accurate, JavaScript-enabled web scraping** using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
|
|
10
|
+
|
|
11
|
+
**Why RubyCrawl?**
|
|
12
|
+
|
|
13
|
+
- ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
|
|
14
|
+
- ✅ **Pure Ruby** — No Node.js, no npm, no external processes to manage
|
|
15
|
+
- ✅ **Zero config** — Works out of the box, no Ferrum knowledge needed
|
|
16
|
+
- ✅ **Production-ready** — Auto-retry, error handling, resource optimization
|
|
17
|
+
- ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
|
|
18
|
+
- ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
|
|
19
|
+
|
|
20
|
+
```ruby
|
|
21
|
+
# One line to crawl any JavaScript-heavy site
|
|
22
|
+
result = RubyCrawl.crawl("https://docs.example.com")
|
|
23
|
+
|
|
24
|
+
result.html # Full HTML with JS rendered
|
|
25
|
+
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
|
|
26
|
+
result.clean_markdown # Markdown ready for RAG pipelines
|
|
27
|
+
result.links # All links with url, text, title, rel
|
|
28
|
+
result.metadata # Title, description, OG tags, etc.
|
|
29
|
+
```
|
|
9
30
|
|
|
10
31
|
## Features
|
|
11
32
|
|
|
12
|
-
- **
|
|
13
|
-
- **Production-ready**: Designed for Rails apps and
|
|
14
|
-
- **Simple API**: Clean
|
|
15
|
-
- **Resource optimization**: Built-in resource blocking for faster crawls
|
|
16
|
-
- **Auto-managed browsers**:
|
|
17
|
-
- **Content extraction**: HTML,
|
|
18
|
-
- **Multi-page crawling**: BFS crawler with depth limits and deduplication
|
|
33
|
+
- **Pure Ruby**: Ferrum drives Chromium directly via CDP — no Node.js or npm required
|
|
34
|
+
- **Production-ready**: Designed for Rails apps with auto-retry and exponential backoff
|
|
35
|
+
- **Simple API**: Clean Ruby interface — zero Ferrum or CDP knowledge required
|
|
36
|
+
- **Resource optimization**: Built-in resource blocking for 2-3x faster crawls
|
|
37
|
+
- **Auto-managed browsers**: Lazy Chrome singleton, isolated page per crawl
|
|
38
|
+
- **Content extraction**: HTML, plain text, clean HTML, Markdown (lazy), links, metadata
|
|
39
|
+
- **Multi-page crawling**: BFS crawler with configurable depth limits and URL deduplication
|
|
40
|
+
- **Smart URL handling**: Automatic normalization, tracking parameter removal, same-host filtering
|
|
19
41
|
- **Rails integration**: First-class Rails support with generators and initializers
|
|
20
42
|
|
|
21
43
|
## Table of Contents
|
|
22
44
|
|
|
23
45
|
- [Installation](#installation)
|
|
24
46
|
- [Quick Start](#quick-start)
|
|
47
|
+
- [Use Cases](#use-cases)
|
|
25
48
|
- [Usage](#usage)
|
|
26
49
|
- [Basic Crawling](#basic-crawling)
|
|
27
50
|
- [Multi-Page Crawling](#multi-page-crawling)
|
|
28
51
|
- [Configuration](#configuration)
|
|
29
52
|
- [Result Object](#result-object)
|
|
53
|
+
- [Error Handling](#error-handling)
|
|
30
54
|
- [Rails Integration](#rails-integration)
|
|
31
55
|
- [Production Deployment](#production-deployment)
|
|
32
56
|
- [Architecture](#architecture)
|
|
@@ -40,7 +64,7 @@ RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's
|
|
|
40
64
|
### Requirements
|
|
41
65
|
|
|
42
66
|
- **Ruby** >= 3.0
|
|
43
|
-
- **
|
|
67
|
+
- **Chrome or Chromium** — managed automatically by Ferrum (downloaded on first use)
|
|
44
68
|
|
|
45
69
|
### Add to Gemfile
|
|
46
70
|
|
|
@@ -54,9 +78,9 @@ Then install:
|
|
|
54
78
|
bundle install
|
|
55
79
|
```
|
|
56
80
|
|
|
57
|
-
### Install
|
|
81
|
+
### Install Chrome
|
|
58
82
|
|
|
59
|
-
|
|
83
|
+
Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
|
|
60
84
|
|
|
61
85
|
```bash
|
|
62
86
|
bundle exec rake rubycrawl:install
|
|
@@ -64,9 +88,10 @@ bundle exec rake rubycrawl:install
|
|
|
64
88
|
|
|
65
89
|
This command:
|
|
66
90
|
|
|
67
|
-
-
|
|
68
|
-
-
|
|
69
|
-
|
|
91
|
+
- ✅ Checks for Chrome/Chromium in your PATH
|
|
92
|
+
- ✅ Creates a Rails initializer (if using Rails)
|
|
93
|
+
|
|
94
|
+
**Note:** If Chrome is not in your PATH, install it via your system package manager or download from [google.com/chrome](https://www.google.com/chrome/).
|
|
70
95
|
|
|
71
96
|
## Quick Start
|
|
72
97
|
|
|
@@ -77,27 +102,37 @@ require "rubycrawl"
|
|
|
77
102
|
result = RubyCrawl.crawl("https://example.com")
|
|
78
103
|
|
|
79
104
|
# Access extracted content
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
105
|
+
result.final_url # Final URL after redirects
|
|
106
|
+
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
|
|
107
|
+
result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
|
|
108
|
+
result.raw_text # Full body.innerText (unfiltered)
|
|
109
|
+
result.html # Full raw HTML content
|
|
110
|
+
result.links # Extracted links with url, text, title, rel
|
|
111
|
+
result.metadata # Title, description, OG tags, etc.
|
|
112
|
+
result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
|
|
84
113
|
```
|
|
85
114
|
|
|
115
|
+
## Use Cases
|
|
116
|
+
|
|
117
|
+
RubyCrawl is perfect for:
|
|
118
|
+
|
|
119
|
+
- **RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
|
|
120
|
+
- **Data aggregation**: Crawl product catalogs, job listings, or news articles
|
|
121
|
+
- **SEO analysis**: Extract metadata, links, and content structure
|
|
122
|
+
- **Content migration**: Convert existing sites to Markdown for static site generators
|
|
123
|
+
- **Documentation scraping**: Create local copies of documentation with preserved links
|
|
124
|
+
|
|
86
125
|
## Usage
|
|
87
126
|
|
|
88
127
|
### Basic Crawling
|
|
89
128
|
|
|
90
|
-
The simplest way to crawl a URL:
|
|
91
|
-
|
|
92
129
|
```ruby
|
|
93
130
|
result = RubyCrawl.crawl("https://example.com")
|
|
94
131
|
|
|
95
|
-
#
|
|
96
|
-
result.
|
|
97
|
-
result.
|
|
98
|
-
result.
|
|
99
|
-
result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
|
|
100
|
-
result.text # => "" (coming soon)
|
|
132
|
+
result.html # => "<html>...</html>"
|
|
133
|
+
result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
|
|
134
|
+
result.raw_text # => "Example Domain\nThis domain is..." (full body text)
|
|
135
|
+
result.metadata # => { "final_url" => "https://example.com", "title" => "..." }
|
|
101
136
|
```
|
|
102
137
|
|
|
103
138
|
### Multi-Page Crawling
|
|
@@ -109,50 +144,83 @@ Crawl an entire site following links with BFS (breadth-first search):
|
|
|
109
144
|
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
|
|
110
145
|
# Each page is yielded as it's crawled (streaming)
|
|
111
146
|
puts "Crawled: #{page.url} (depth: #{page.depth})"
|
|
112
|
-
|
|
147
|
+
|
|
113
148
|
# Save to database
|
|
114
149
|
Page.create!(
|
|
115
|
-
url:
|
|
116
|
-
html:
|
|
117
|
-
markdown: page.
|
|
118
|
-
depth:
|
|
150
|
+
url: page.url,
|
|
151
|
+
html: page.html,
|
|
152
|
+
markdown: page.clean_markdown,
|
|
153
|
+
depth: page.depth
|
|
119
154
|
)
|
|
120
155
|
end
|
|
121
156
|
```
|
|
122
157
|
|
|
158
|
+
**Real-world example: Building a RAG knowledge base**
|
|
159
|
+
|
|
160
|
+
```ruby
|
|
161
|
+
require "rubycrawl"
|
|
162
|
+
|
|
163
|
+
RubyCrawl.configure(
|
|
164
|
+
wait_until: "networkidle", # Ensure JS content loads
|
|
165
|
+
block_resources: true # Skip images/fonts for speed
|
|
166
|
+
)
|
|
167
|
+
|
|
168
|
+
pages_crawled = RubyCrawl.crawl_site(
|
|
169
|
+
"https://docs.example.com",
|
|
170
|
+
max_pages: 500,
|
|
171
|
+
max_depth: 5,
|
|
172
|
+
same_host_only: true
|
|
173
|
+
) do |page|
|
|
174
|
+
VectorDB.upsert(
|
|
175
|
+
id: Digest::SHA256.hexdigest(page.url),
|
|
176
|
+
content: page.clean_markdown,
|
|
177
|
+
metadata: {
|
|
178
|
+
url: page.url,
|
|
179
|
+
title: page.metadata["title"],
|
|
180
|
+
depth: page.depth
|
|
181
|
+
}
|
|
182
|
+
)
|
|
183
|
+
end
|
|
184
|
+
|
|
185
|
+
puts "Indexed #{pages_crawled} pages"
|
|
186
|
+
```
|
|
187
|
+
|
|
123
188
|
#### Multi-Page Options
|
|
124
189
|
|
|
125
|
-
| Option
|
|
126
|
-
|
|
127
|
-
| `max_pages`
|
|
128
|
-
| `max_depth`
|
|
129
|
-
| `same_host_only`
|
|
130
|
-
| `wait_until`
|
|
131
|
-
| `block_resources` | inherited | Block images/fonts/CSS
|
|
190
|
+
| Option | Default | Description |
|
|
191
|
+
| ----------------- | --------- | ------------------------------------ |
|
|
192
|
+
| `max_pages` | 50 | Maximum number of pages to crawl |
|
|
193
|
+
| `max_depth` | 3 | Maximum link depth from start URL |
|
|
194
|
+
| `same_host_only` | true | Only follow links on the same domain |
|
|
195
|
+
| `wait_until` | inherited | Page load strategy |
|
|
196
|
+
| `block_resources` | inherited | Block images/fonts/CSS |
|
|
132
197
|
|
|
133
198
|
#### Page Result Object
|
|
134
199
|
|
|
135
200
|
The block receives a `PageResult` with:
|
|
136
201
|
|
|
137
202
|
```ruby
|
|
138
|
-
page.url
|
|
139
|
-
page.html
|
|
140
|
-
page.
|
|
141
|
-
page.
|
|
142
|
-
page.
|
|
143
|
-
page.
|
|
203
|
+
page.url # String: Final URL after redirects
|
|
204
|
+
page.html # String: Full raw HTML content
|
|
205
|
+
page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
|
|
206
|
+
page.clean_text # String: Noise-stripped plain text (derived from clean_html)
|
|
207
|
+
page.raw_text # String: Full body.innerText (unfiltered)
|
|
208
|
+
page.clean_markdown # String: Lazy-converted Markdown from clean_html
|
|
209
|
+
page.links # Array: URLs extracted from page
|
|
210
|
+
page.metadata # Hash: final_url, title, OG tags, etc.
|
|
211
|
+
page.depth # Integer: Link depth from start URL
|
|
144
212
|
```
|
|
145
213
|
|
|
146
214
|
### Configuration
|
|
147
215
|
|
|
148
216
|
#### Global Configuration
|
|
149
217
|
|
|
150
|
-
Set default options that apply to all crawls:
|
|
151
|
-
|
|
152
218
|
```ruby
|
|
153
219
|
RubyCrawl.configure(
|
|
154
|
-
wait_until:
|
|
155
|
-
block_resources: true
|
|
220
|
+
wait_until: "networkidle",
|
|
221
|
+
block_resources: true,
|
|
222
|
+
timeout: 60,
|
|
223
|
+
headless: true
|
|
156
224
|
)
|
|
157
225
|
|
|
158
226
|
# All subsequent crawls use these defaults
|
|
@@ -161,8 +229,6 @@ result = RubyCrawl.crawl("https://example.com")
|
|
|
161
229
|
|
|
162
230
|
#### Per-Request Options
|
|
163
231
|
|
|
164
|
-
Override defaults for specific requests:
|
|
165
|
-
|
|
166
232
|
```ruby
|
|
167
233
|
# Use global defaults
|
|
168
234
|
result = RubyCrawl.crawl("https://example.com")
|
|
@@ -170,36 +236,41 @@ result = RubyCrawl.crawl("https://example.com")
|
|
|
170
236
|
# Override for this request only
|
|
171
237
|
result = RubyCrawl.crawl(
|
|
172
238
|
"https://example.com",
|
|
173
|
-
wait_until:
|
|
239
|
+
wait_until: "domcontentloaded",
|
|
174
240
|
block_resources: false
|
|
175
241
|
)
|
|
176
242
|
```
|
|
177
243
|
|
|
178
244
|
#### Configuration Options
|
|
179
245
|
|
|
180
|
-
| Option | Values
|
|
181
|
-
| ----------------- |
|
|
182
|
-
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"
|
|
183
|
-
| `block_resources` | `true`, `false`
|
|
246
|
+
| Option | Values | Default | Description |
|
|
247
|
+
| ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
|
|
248
|
+
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
|
|
249
|
+
| `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
|
|
250
|
+
| `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
|
|
251
|
+
| `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
|
|
252
|
+
| `headless` | `true`, `false` | `true` | Run Chrome headlessly |
|
|
184
253
|
|
|
185
254
|
**Wait strategies explained:**
|
|
186
255
|
|
|
187
|
-
- `load` — Wait for the load event (
|
|
188
|
-
- `domcontentloaded` — Wait for DOM ready (
|
|
189
|
-
- `networkidle` — Wait until no network requests for 500ms (
|
|
256
|
+
- `load` — Wait for the load event (good for static sites)
|
|
257
|
+
- `domcontentloaded` — Wait for DOM ready (faster)
|
|
258
|
+
- `networkidle` — Wait until no network requests for 500ms (best for SPAs)
|
|
259
|
+
- `commit` — Wait until the first response bytes are received (fastest)
|
|
190
260
|
|
|
191
261
|
### Result Object
|
|
192
262
|
|
|
193
|
-
The crawl result is a `RubyCrawl::Result` object with these attributes:
|
|
194
|
-
|
|
195
263
|
```ruby
|
|
196
264
|
result = RubyCrawl.crawl("https://example.com")
|
|
197
265
|
|
|
198
|
-
result.html
|
|
199
|
-
result.
|
|
200
|
-
result.
|
|
201
|
-
result.
|
|
202
|
-
result.
|
|
266
|
+
result.html # String: Full raw HTML
|
|
267
|
+
result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
|
|
268
|
+
result.clean_text # String: Plain text derived from clean_html — ideal for RAG
|
|
269
|
+
result.raw_text # String: Full body.innerText (unfiltered)
|
|
270
|
+
result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
|
|
271
|
+
result.links # Array: Extracted links with url/text/title/rel
|
|
272
|
+
result.metadata # Hash: See below
|
|
273
|
+
result.final_url # String: Shortcut for metadata['final_url']
|
|
203
274
|
```
|
|
204
275
|
|
|
205
276
|
#### Links Format
|
|
@@ -207,101 +278,89 @@ result.metadata # Hash: Comprehensive metadata (see below)
|
|
|
207
278
|
```ruby
|
|
208
279
|
result.links
|
|
209
280
|
# => [
|
|
210
|
-
# { "url" => "https://example.com/about", "text" => "About
|
|
211
|
-
# { "url" => "https://example.com/contact", "text" => "Contact" },
|
|
212
|
-
# ...
|
|
281
|
+
# { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
|
|
282
|
+
# { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
|
|
213
283
|
# ]
|
|
214
284
|
```
|
|
215
285
|
|
|
286
|
+
URLs are automatically resolved to absolute form by the browser.
|
|
287
|
+
|
|
216
288
|
#### Markdown Conversion
|
|
217
289
|
|
|
218
|
-
Markdown is **lazy
|
|
290
|
+
Markdown is **lazy** — conversion only happens on first access of `.clean_markdown`:
|
|
219
291
|
|
|
220
292
|
```ruby
|
|
221
|
-
result
|
|
222
|
-
result.
|
|
223
|
-
result.
|
|
224
|
-
result.markdown # ✅ Cached, instant
|
|
293
|
+
result.clean_html # ✅ Already available, no overhead
|
|
294
|
+
result.clean_markdown # Converts clean_html → Markdown here (first call only)
|
|
295
|
+
result.clean_markdown # ✅ Cached, instant on subsequent calls
|
|
225
296
|
```
|
|
226
297
|
|
|
227
298
|
Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
|
|
228
299
|
|
|
229
300
|
#### Metadata Fields
|
|
230
301
|
|
|
231
|
-
The `metadata` hash includes HTTP and HTML metadata:
|
|
232
|
-
|
|
233
302
|
```ruby
|
|
234
303
|
result.metadata
|
|
235
304
|
# => {
|
|
236
|
-
# "
|
|
237
|
-
# "
|
|
238
|
-
# "
|
|
239
|
-
# "
|
|
240
|
-
# "
|
|
241
|
-
# "
|
|
242
|
-
# "
|
|
243
|
-
# "
|
|
244
|
-
# "
|
|
245
|
-
# "
|
|
246
|
-
# "
|
|
247
|
-
# "
|
|
248
|
-
# "
|
|
249
|
-
# "
|
|
250
|
-
# "
|
|
251
|
-
# "
|
|
252
|
-
# "
|
|
253
|
-
# "charset" => "UTF-8" # Character encoding
|
|
305
|
+
# "final_url" => "https://example.com",
|
|
306
|
+
# "title" => "Page Title",
|
|
307
|
+
# "description" => "...",
|
|
308
|
+
# "keywords" => "ruby, web",
|
|
309
|
+
# "author" => "Author Name",
|
|
310
|
+
# "og_title" => "...",
|
|
311
|
+
# "og_description" => "...",
|
|
312
|
+
# "og_image" => "https://...",
|
|
313
|
+
# "og_url" => "https://...",
|
|
314
|
+
# "og_type" => "website",
|
|
315
|
+
# "twitter_card" => "summary",
|
|
316
|
+
# "twitter_title" => "...",
|
|
317
|
+
# "twitter_description" => "...",
|
|
318
|
+
# "twitter_image" => "https://...",
|
|
319
|
+
# "canonical" => "https://...",
|
|
320
|
+
# "lang" => "en",
|
|
321
|
+
# "charset" => "UTF-8"
|
|
254
322
|
# }
|
|
255
323
|
```
|
|
256
324
|
|
|
257
|
-
Note: All HTML metadata fields may be `null` if not present on the page.
|
|
258
|
-
|
|
259
325
|
### Error Handling
|
|
260
326
|
|
|
261
|
-
RubyCrawl provides specific exception classes for different error scenarios:
|
|
262
|
-
|
|
263
327
|
```ruby
|
|
264
328
|
begin
|
|
265
329
|
result = RubyCrawl.crawl(url)
|
|
266
330
|
rescue RubyCrawl::ConfigurationError => e
|
|
267
|
-
# Invalid URL or
|
|
268
|
-
puts "Configuration error: #{e.message}"
|
|
331
|
+
# Invalid URL or option value
|
|
269
332
|
rescue RubyCrawl::TimeoutError => e
|
|
270
|
-
# Page load
|
|
271
|
-
puts "Timeout: #{e.message}"
|
|
333
|
+
# Page load timed out
|
|
272
334
|
rescue RubyCrawl::NavigationError => e
|
|
273
|
-
#
|
|
274
|
-
puts "Navigation failed: #{e.message}"
|
|
335
|
+
# Navigation failed (404, DNS error, SSL error)
|
|
275
336
|
rescue RubyCrawl::ServiceError => e
|
|
276
|
-
#
|
|
277
|
-
puts "Service error: #{e.message}"
|
|
337
|
+
# Browser failed to start or crashed
|
|
278
338
|
rescue RubyCrawl::Error => e
|
|
279
339
|
# Catch-all for any RubyCrawl error
|
|
280
|
-
puts "Crawl error: #{e.message}"
|
|
281
340
|
end
|
|
282
341
|
```
|
|
283
342
|
|
|
284
343
|
**Exception Hierarchy:**
|
|
285
|
-
- `RubyCrawl::Error` (base class)
|
|
286
|
-
- `RubyCrawl::ConfigurationError` - Invalid URL or configuration
|
|
287
|
-
- `RubyCrawl::TimeoutError` - Timeout during crawl
|
|
288
|
-
- `RubyCrawl::NavigationError` - Page navigation failed
|
|
289
|
-
- `RubyCrawl::ServiceError` - Node service issues
|
|
290
344
|
|
|
291
|
-
|
|
345
|
+
```
|
|
346
|
+
RubyCrawl::Error
|
|
347
|
+
├── ConfigurationError — invalid URL or option value
|
|
348
|
+
├── TimeoutError — page load timed out
|
|
349
|
+
├── NavigationError — navigation failed (HTTP error, DNS, SSL)
|
|
350
|
+
└── ServiceError — browser failed to start or crashed
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
**Automatic Retry:** `ServiceError` and `TimeoutError` are retried with exponential backoff. `NavigationError` and `ConfigurationError` are not retried (they won't succeed on retry).
|
|
292
354
|
|
|
293
355
|
```ruby
|
|
294
|
-
RubyCrawl.configure(
|
|
295
|
-
#
|
|
296
|
-
RubyCrawl.crawl(url, retries: 1) # Disable retry
|
|
356
|
+
RubyCrawl.configure(max_attempts: 5) # 5 total attempts
|
|
357
|
+
RubyCrawl.crawl(url, max_attempts: 1) # Disable retries
|
|
297
358
|
```
|
|
298
359
|
|
|
299
360
|
## Rails Integration
|
|
300
361
|
|
|
301
362
|
### Installation
|
|
302
363
|
|
|
303
|
-
Run the installer in your Rails app:
|
|
304
|
-
|
|
305
364
|
```bash
|
|
306
365
|
bundle exec rake rubycrawl:install
|
|
307
366
|
```
|
|
@@ -309,264 +368,157 @@ bundle exec rake rubycrawl:install
|
|
|
309
368
|
This creates `config/initializers/rubycrawl.rb`:
|
|
310
369
|
|
|
311
370
|
```ruby
|
|
312
|
-
# frozen_string_literal: true
|
|
313
|
-
|
|
314
|
-
# rubycrawl default configuration
|
|
315
371
|
RubyCrawl.configure(
|
|
316
|
-
wait_until:
|
|
372
|
+
wait_until: "load",
|
|
317
373
|
block_resources: true
|
|
318
374
|
)
|
|
319
375
|
```
|
|
320
376
|
|
|
321
377
|
### Usage in Rails
|
|
322
378
|
|
|
379
|
+
#### Background Jobs with ActiveJob
|
|
380
|
+
|
|
323
381
|
```ruby
|
|
324
|
-
|
|
325
|
-
|
|
382
|
+
class CrawlPageJob < ApplicationJob
|
|
383
|
+
queue_as :crawlers
|
|
384
|
+
|
|
385
|
+
retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
|
|
386
|
+
retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
|
|
387
|
+
discard_on RubyCrawl::ConfigurationError
|
|
388
|
+
|
|
326
389
|
def perform(url)
|
|
327
390
|
result = RubyCrawl.crawl(url)
|
|
328
391
|
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
392
|
+
Page.create!(
|
|
393
|
+
url: result.final_url,
|
|
394
|
+
title: result.metadata['title'],
|
|
395
|
+
content: result.clean_text,
|
|
396
|
+
markdown: result.clean_markdown,
|
|
397
|
+
crawled_at: Time.current
|
|
334
398
|
)
|
|
335
399
|
end
|
|
336
400
|
end
|
|
337
401
|
```
|
|
338
402
|
|
|
403
|
+
**Multi-page RAG knowledge base:**
|
|
404
|
+
|
|
405
|
+
```ruby
|
|
406
|
+
class BuildKnowledgeBaseJob < ApplicationJob
|
|
407
|
+
queue_as :crawlers
|
|
408
|
+
|
|
409
|
+
def perform(documentation_url)
|
|
410
|
+
RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
|
|
411
|
+
embedding = OpenAI.embed(page.clean_markdown)
|
|
412
|
+
|
|
413
|
+
Document.create!(
|
|
414
|
+
url: page.url,
|
|
415
|
+
title: page.metadata['title'],
|
|
416
|
+
content: page.clean_markdown,
|
|
417
|
+
embedding: embedding,
|
|
418
|
+
depth: page.depth
|
|
419
|
+
)
|
|
420
|
+
end
|
|
421
|
+
end
|
|
422
|
+
end
|
|
423
|
+
```
|
|
424
|
+
|
|
425
|
+
#### Best Practices
|
|
426
|
+
|
|
427
|
+
1. **Use background jobs** to avoid blocking web requests
|
|
428
|
+
2. **Configure retry logic** based on error type
|
|
429
|
+
3. **Store `clean_markdown`** for RAG applications (preserves heading structure for chunking)
|
|
430
|
+
4. **Rate limit** external crawling to be respectful
|
|
431
|
+
|
|
339
432
|
## Production Deployment
|
|
340
433
|
|
|
341
434
|
### Pre-deployment Checklist
|
|
342
435
|
|
|
343
|
-
1. **
|
|
436
|
+
1. **Ensure Chrome is installed** on your production servers
|
|
344
437
|
2. **Run installer** during deployment:
|
|
345
438
|
```bash
|
|
346
439
|
bundle exec rake rubycrawl:install
|
|
347
440
|
```
|
|
348
|
-
3. **Set environment variables** (optional):
|
|
349
|
-
```bash
|
|
350
|
-
export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path
|
|
351
|
-
export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
|
|
352
|
-
```
|
|
353
441
|
|
|
354
442
|
### Docker Example
|
|
355
443
|
|
|
356
444
|
```dockerfile
|
|
357
445
|
FROM ruby:3.2
|
|
358
446
|
|
|
359
|
-
# Install
|
|
360
|
-
RUN
|
|
361
|
-
|
|
362
|
-
|
|
363
|
-
|
|
364
|
-
RUN npx playwright install-deps
|
|
447
|
+
# Install Chrome
|
|
448
|
+
RUN apt-get update && apt-get install -y \
|
|
449
|
+
chromium \
|
|
450
|
+
--no-install-recommends \
|
|
451
|
+
&& rm -rf /var/lib/apt/lists/*
|
|
365
452
|
|
|
366
453
|
WORKDIR /app
|
|
367
454
|
COPY Gemfile* ./
|
|
368
455
|
RUN bundle install
|
|
369
456
|
|
|
370
|
-
# Install Playwright browsers
|
|
371
|
-
RUN bundle exec rake rubycrawl:install
|
|
372
|
-
|
|
373
457
|
COPY . .
|
|
374
458
|
CMD ["rails", "server"]
|
|
375
459
|
```
|
|
376
460
|
|
|
377
|
-
|
|
378
|
-
|
|
379
|
-
Add the Node.js buildpack:
|
|
380
|
-
|
|
381
|
-
```bash
|
|
382
|
-
heroku buildpacks:add heroku/nodejs
|
|
383
|
-
heroku buildpacks:add heroku/ruby
|
|
384
|
-
```
|
|
385
|
-
|
|
386
|
-
Add to `package.json` in your Rails root:
|
|
461
|
+
Ferrum will detect `chromium` automatically. To specify a custom path:
|
|
387
462
|
|
|
388
|
-
```
|
|
389
|
-
|
|
390
|
-
"
|
|
391
|
-
|
|
392
|
-
}
|
|
393
|
-
}
|
|
463
|
+
```ruby
|
|
464
|
+
RubyCrawl.configure(
|
|
465
|
+
browser_options: { "browser-path": "/usr/bin/chromium" }
|
|
466
|
+
)
|
|
394
467
|
```
|
|
395
468
|
|
|
396
|
-
### Performance Tips
|
|
397
|
-
|
|
398
|
-
- **Reuse instances**: Use the class-level `RubyCrawl.crawl` method (recommended) rather than creating new instances
|
|
399
|
-
- **Resource blocking**: Keep `block_resources: true` for 2-3x faster crawls when you don't need images/CSS
|
|
400
|
-
- **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
|
|
401
|
-
- **Browser reuse**: The first crawl is slower due to browser launch; subsequent crawls reuse the process
|
|
402
|
-
|
|
403
469
|
## Architecture
|
|
404
470
|
|
|
405
|
-
RubyCrawl uses a
|
|
471
|
+
RubyCrawl uses a single-process architecture:
|
|
406
472
|
|
|
407
473
|
```
|
|
408
|
-
|
|
409
|
-
|
|
410
|
-
|
|
411
|
-
|
|
412
|
-
|
|
413
|
-
|
|
414
|
-
|
|
415
|
-
│ └────────────┬────────────────────────┘ │
|
|
416
|
-
└───────────────┼─────────────────────────────┘
|
|
417
|
-
│ HTTP/JSON (localhost:3344)
|
|
418
|
-
┌───────────────┼─────────────────────────────┐
|
|
419
|
-
│ Node.js Process (Auto-started) │
|
|
420
|
-
│ ┌────────────┴────────────────────────┐ │
|
|
421
|
-
│ │ Playwright Service │ │
|
|
422
|
-
│ │ • Browser management │ │
|
|
423
|
-
│ │ • Page navigation │ │
|
|
424
|
-
│ │ • HTML extraction │ │
|
|
425
|
-
│ │ • Resource blocking │ │
|
|
426
|
-
│ └─────────────────────────────────────┘ │
|
|
427
|
-
└─────────────────────────────────────────────┘
|
|
474
|
+
RubyCrawl (public API)
|
|
475
|
+
↓
|
|
476
|
+
Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
|
|
477
|
+
↓
|
|
478
|
+
Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
|
|
479
|
+
↓
|
|
480
|
+
Chromium ← headless browser
|
|
428
481
|
```
|
|
429
482
|
|
|
430
|
-
|
|
431
|
-
|
|
432
|
-
-
|
|
433
|
-
-
|
|
434
|
-
- **Performance**: Long-running browser process, reused across requests
|
|
435
|
-
- **Simplicity**: No C extensions, pure Ruby + bundled Node service
|
|
436
|
-
|
|
437
|
-
See [.github/copilot-instructions.md](.github/copilot-instructions.md) for detailed architecture documentation.
|
|
483
|
+
- Chrome launches once lazily and is reused across all crawls
|
|
484
|
+
- Each crawl gets an isolated page context (own cookies/storage)
|
|
485
|
+
- JS extraction runs inside the browser via `page.evaluate()`
|
|
486
|
+
- No separate processes, no HTTP boundary, no Node.js
|
|
438
487
|
|
|
439
488
|
## Performance
|
|
440
489
|
|
|
441
|
-
|
|
442
|
-
|
|
443
|
-
|
|
444
|
-
|
|
445
|
-
| Page Type | First Crawl | Subsequent | Config |
|
|
446
|
-
| ----------- | ----------- | ---------- | --------------------------- |
|
|
447
|
-
| Static HTML | ~2s | ~500ms | `block_resources: true` |
|
|
448
|
-
| SPA (React) | ~3s | ~1.2s | `wait_until: "networkidle"` |
|
|
449
|
-
| Heavy site | ~4s | ~2s | `block_resources: false` |
|
|
450
|
-
|
|
451
|
-
**Note**: First crawl includes browser launch time (~1.5s). Subsequent crawls reuse the browser.
|
|
452
|
-
|
|
453
|
-
### Optimization Tips
|
|
454
|
-
|
|
455
|
-
1. **Enable resource blocking** for content-only extraction:
|
|
456
|
-
|
|
457
|
-
```ruby
|
|
458
|
-
RubyCrawl.configure(block_resources: true)
|
|
459
|
-
```
|
|
460
|
-
|
|
461
|
-
2. **Use appropriate wait strategy**:
|
|
462
|
-
- Static sites: `wait_until: "load"`
|
|
463
|
-
- SPAs: `wait_until: "networkidle"`
|
|
464
|
-
|
|
465
|
-
3. **Batch processing**: Use background jobs for concurrent crawling:
|
|
466
|
-
```ruby
|
|
467
|
-
urls.each { |url| CrawlJob.perform_later(url) }
|
|
468
|
-
```
|
|
490
|
+
- **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
|
|
491
|
+
- **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
|
|
492
|
+
- **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
|
|
493
|
+
- **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
|
|
469
494
|
|
|
470
495
|
## Development
|
|
471
496
|
|
|
472
|
-
### Setup
|
|
473
|
-
|
|
474
497
|
```bash
|
|
475
498
|
git clone git@github.com:craft-wise/rubycrawl.git
|
|
476
499
|
cd rubycrawl
|
|
477
|
-
bin/setup
|
|
478
|
-
```
|
|
479
|
-
|
|
480
|
-
### Running Tests
|
|
500
|
+
bin/setup
|
|
481
501
|
|
|
482
|
-
|
|
502
|
+
# Run unit tests (no browser required)
|
|
483
503
|
bundle exec rspec
|
|
484
|
-
```
|
|
485
|
-
|
|
486
|
-
### Manual Testing
|
|
487
504
|
|
|
488
|
-
|
|
489
|
-
|
|
490
|
-
cd node
|
|
491
|
-
npm start
|
|
505
|
+
# Run integration tests (requires Chrome)
|
|
506
|
+
INTEGRATION=1 bundle exec rspec
|
|
492
507
|
|
|
493
|
-
#
|
|
508
|
+
# Manual testing
|
|
494
509
|
bin/console
|
|
495
|
-
>
|
|
496
|
-
>
|
|
510
|
+
> RubyCrawl.crawl("https://example.com")
|
|
511
|
+
> RubyCrawl.crawl("https://example.com").clean_text
|
|
512
|
+
> RubyCrawl.crawl("https://example.com").clean_markdown
|
|
497
513
|
```
|
|
498
514
|
|
|
499
|
-
### Project Structure
|
|
500
|
-
|
|
501
|
-
```
|
|
502
|
-
rubycrawl/
|
|
503
|
-
├── lib/
|
|
504
|
-
│ ├── rubycrawl.rb # Main gem entry point
|
|
505
|
-
│ ├── rubycrawl/
|
|
506
|
-
│ │ ├── version.rb # Gem version
|
|
507
|
-
│ │ ├── railtie.rb # Rails integration
|
|
508
|
-
│ │ └── tasks/
|
|
509
|
-
│ │ └── install.rake # Installation task
|
|
510
|
-
├── node/
|
|
511
|
-
│ ├── src/
|
|
512
|
-
│ │ └── index.js # Playwright HTTP service
|
|
513
|
-
│ ├── package.json
|
|
514
|
-
│ └── README.md
|
|
515
|
-
├── spec/ # RSpec tests
|
|
516
|
-
├── .github/
|
|
517
|
-
│ └── copilot-instructions.md # GitHub Copilot guidelines
|
|
518
|
-
├── CLAUDE.md # Claude AI guidelines
|
|
519
|
-
└── README.md
|
|
520
|
-
```
|
|
521
|
-
|
|
522
|
-
## Roadmap
|
|
523
|
-
|
|
524
|
-
### Current (v0.1.0)
|
|
525
|
-
|
|
526
|
-
- [x] HTML extraction
|
|
527
|
-
- [x] Link extraction
|
|
528
|
-
- [x] Markdown conversion (lazy-loaded)
|
|
529
|
-
- [x] Multi-page crawling with BFS
|
|
530
|
-
- [x] URL normalization and deduplication
|
|
531
|
-
- [x] Basic metadata (status, final URL)
|
|
532
|
-
- [x] Resource blocking
|
|
533
|
-
- [x] Rails integration
|
|
534
|
-
|
|
535
|
-
### Coming Soon
|
|
536
|
-
|
|
537
|
-
- [ ] Plain text extraction
|
|
538
|
-
- [ ] Screenshot capture
|
|
539
|
-
- [ ] Custom JavaScript execution
|
|
540
|
-
- [ ] Session/cookie support
|
|
541
|
-
- [ ] Proxy support
|
|
542
|
-
- [ ] Robots.txt support
|
|
543
|
-
|
|
544
515
|
## Contributing
|
|
545
516
|
|
|
546
517
|
Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
|
|
547
518
|
|
|
548
|
-
### Development Philosophy
|
|
549
|
-
|
|
550
519
|
- **Simplicity over cleverness**: Prefer clear, explicit code
|
|
551
520
|
- **Stability over speed**: Correctness first, optimization second
|
|
552
|
-
- **
|
|
553
|
-
- **No vendor lock-in**: Pure open source, no SaaS dependencies
|
|
554
|
-
|
|
555
|
-
## Comparison with crawl4ai
|
|
556
|
-
|
|
557
|
-
| Feature | crawl4ai (Python) | rubycrawl (Ruby) |
|
|
558
|
-
| ------------------- | ----------------- | ---------------- |
|
|
559
|
-
| Browser automation | Playwright | Playwright |
|
|
560
|
-
| Language | Python | Ruby |
|
|
561
|
-
| LLM extraction | ✅ | Planned |
|
|
562
|
-
| Markdown extraction | ✅ | ✅ |
|
|
563
|
-
| Link extraction | ✅ | ✅ |
|
|
564
|
-
| Multi-page crawling | ✅ | ✅ |
|
|
565
|
-
| Rails integration | N/A | ✅ |
|
|
566
|
-
| Resource blocking | ✅ | ✅ |
|
|
567
|
-
| Session management | ✅ | Planned |
|
|
568
|
-
|
|
569
|
-
RubyCrawl aims to bring the same level of accuracy and reliability to the Ruby ecosystem.
|
|
521
|
+
- **Hide complexity**: Users should never need to know Ferrum exists
|
|
570
522
|
|
|
571
523
|
## License
|
|
572
524
|
|
|
@@ -574,12 +526,12 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
|
|
|
574
526
|
|
|
575
527
|
## Credits
|
|
576
528
|
|
|
577
|
-
|
|
529
|
+
Built with [Ferrum](https://github.com/rubycdp/ferrum) — pure Ruby Chrome DevTools Protocol client.
|
|
578
530
|
|
|
579
|
-
|
|
531
|
+
Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
|
|
580
532
|
|
|
581
533
|
## Support
|
|
582
534
|
|
|
583
535
|
- **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
|
|
584
|
-
- **Discussions**: [GitHub Discussions](https://github.com/
|
|
536
|
+
- **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
|
|
585
537
|
- **Email**: ganesh.navale@zohomail.in
|