rubycrawl 0.1.3 → 0.1.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +427 -210
- data/lib/rubycrawl/helpers.rb +15 -11
- data/lib/rubycrawl/markdown_converter.rb +3 -3
- data/lib/rubycrawl/result.rb +10 -11
- data/lib/rubycrawl/service_client.rb +25 -3
- data/lib/rubycrawl/site_crawler.rb +14 -6
- data/lib/rubycrawl/version.rb +1 -1
- data/lib/rubycrawl.rb +33 -7
- data/node/.gitignore +2 -0
- data/node/.npmrc +1 -0
- data/node/README.md +19 -0
- data/node/package-lock.json +72 -0
- data/node/package.json +14 -0
- data/node/src/index.js +389 -0
- data/rubycrawl.gemspec +3 -2
- metadata +8 -3
- data/Gemfile +0 -11
data/README.md
CHANGED
|
@@ -1,39 +1,67 @@
|
|
|
1
|
-
#
|
|
1
|
+
# RubyCrawl 🎭
|
|
2
2
|
|
|
3
|
-
[](https://
|
|
3
|
+
[](https://rubygems.org/gems/rubycrawl)
|
|
4
4
|
[](https://opensource.org/licenses/MIT)
|
|
5
|
+
[](https://www.ruby-lang.org/)
|
|
6
|
+
[](https://nodejs.org/)
|
|
5
7
|
|
|
6
|
-
**
|
|
8
|
+
**Production-ready web crawler for Ruby powered by Playwright** — Bringing the power of modern browser automation to the Ruby ecosystem with first-class Rails support.
|
|
7
9
|
|
|
8
|
-
RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs and
|
|
10
|
+
RubyCrawl provides **accurate, JavaScript-enabled web scraping** using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
|
|
11
|
+
|
|
12
|
+
**Why RubyCrawl?**
|
|
13
|
+
|
|
14
|
+
- ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
|
|
15
|
+
- ✅ **Zero config** — Works out of the box, no Playwright knowledge needed
|
|
16
|
+
- ✅ **Production-ready** — Auto-retry, error handling, resource optimization
|
|
17
|
+
- ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
|
|
18
|
+
- ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
|
|
19
|
+
- ✅ **Modular architecture** — Clean, testable, maintainable codebase
|
|
20
|
+
|
|
21
|
+
```ruby
|
|
22
|
+
# One line to crawl any JavaScript-heavy site
|
|
23
|
+
result = RubyCrawl.crawl("https://docs.example.com")
|
|
24
|
+
|
|
25
|
+
result.html # Full HTML with JS rendered
|
|
26
|
+
result.links # All links with metadata
|
|
27
|
+
result.metadata # Title, description, OG tags, etc.
|
|
28
|
+
```
|
|
9
29
|
|
|
10
30
|
## Features
|
|
11
31
|
|
|
12
|
-
-
|
|
13
|
-
-
|
|
14
|
-
-
|
|
15
|
-
-
|
|
16
|
-
-
|
|
17
|
-
-
|
|
18
|
-
-
|
|
19
|
-
-
|
|
32
|
+
- **🎭 Playwright-powered**: Real browser automation for JavaScript-heavy sites and SPAs
|
|
33
|
+
- **🚀 Production-ready**: Designed for Rails apps and production environments with auto-retry and error handling
|
|
34
|
+
- **🎯 Simple API**: Clean, minimal Ruby interface — zero Playwright or Node.js knowledge required
|
|
35
|
+
- **⚡ Resource optimization**: Built-in resource blocking for 2-3x faster crawls
|
|
36
|
+
- **🔄 Auto-managed browsers**: Browser process reuse and automatic lifecycle management
|
|
37
|
+
- **📄 Content extraction**: HTML, plain text, links (with metadata), and **clean markdown** via HTML conversion
|
|
38
|
+
- **🌐 Multi-page crawling**: BFS (breadth-first search) crawler with configurable depth limits and URL deduplication
|
|
39
|
+
- **🛡️ Smart URL handling**: Automatic normalization, tracking parameter removal, and same-host filtering
|
|
40
|
+
- **🔧 Rails integration**: First-class Rails support with generators and initializers
|
|
41
|
+
- **💎 Modular design**: Clean separation of concerns with focused, testable modules
|
|
20
42
|
|
|
21
43
|
## Table of Contents
|
|
22
44
|
|
|
45
|
+
- [Features](#features)
|
|
23
46
|
- [Installation](#installation)
|
|
24
47
|
- [Quick Start](#quick-start)
|
|
48
|
+
- [Use Cases](#use-cases)
|
|
25
49
|
- [Usage](#usage)
|
|
26
50
|
- [Basic Crawling](#basic-crawling)
|
|
27
51
|
- [Multi-Page Crawling](#multi-page-crawling)
|
|
28
52
|
- [Configuration](#configuration)
|
|
29
53
|
- [Result Object](#result-object)
|
|
54
|
+
- [Error Handling](#error-handling)
|
|
30
55
|
- [Rails Integration](#rails-integration)
|
|
31
56
|
- [Production Deployment](#production-deployment)
|
|
32
57
|
- [Architecture](#architecture)
|
|
33
58
|
- [Performance](#performance)
|
|
34
59
|
- [Development](#development)
|
|
60
|
+
- [Project Structure](#project-structure)
|
|
35
61
|
- [Contributing](#contributing)
|
|
62
|
+
- [Why Choose RubyCrawl?](#why-choose-rubycrawl)
|
|
36
63
|
- [License](#license)
|
|
64
|
+
- [Support](#support)
|
|
37
65
|
|
|
38
66
|
## Installation
|
|
39
67
|
|
|
@@ -64,9 +92,24 @@ bundle exec rake rubycrawl:install
|
|
|
64
92
|
|
|
65
93
|
This command:
|
|
66
94
|
|
|
67
|
-
- Installs Node.js dependencies in the bundled `node/` directory
|
|
68
|
-
- Downloads Playwright browsers (Chromium, Firefox, WebKit)
|
|
69
|
-
- Creates a Rails initializer (if using Rails)
|
|
95
|
+
- ✅ Installs Node.js dependencies in the bundled `node/` directory
|
|
96
|
+
- ✅ Downloads Playwright browsers (Chromium, Firefox, WebKit) — ~300MB download
|
|
97
|
+
- ✅ Creates a Rails initializer (if using Rails)
|
|
98
|
+
|
|
99
|
+
**Note:** You only need to run this once. The installation task is idempotent and safe to run multiple times.
|
|
100
|
+
|
|
101
|
+
**Troubleshooting installation:**
|
|
102
|
+
|
|
103
|
+
```bash
|
|
104
|
+
# If installation fails, check Node.js version
|
|
105
|
+
node --version # Should be v18+ LTS
|
|
106
|
+
|
|
107
|
+
# Enable verbose logging
|
|
108
|
+
RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
|
|
109
|
+
|
|
110
|
+
# Check installation status
|
|
111
|
+
cd node && npm list
|
|
112
|
+
```
|
|
70
113
|
|
|
71
114
|
## Quick Start
|
|
72
115
|
|
|
@@ -77,12 +120,24 @@ require "rubycrawl"
|
|
|
77
120
|
result = RubyCrawl.crawl("https://example.com")
|
|
78
121
|
|
|
79
122
|
# Access extracted content
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
|
|
83
|
-
|
|
123
|
+
result.final_url # Final URL after redirects
|
|
124
|
+
result.text # Plain text content (via innerText)
|
|
125
|
+
result.html # Raw HTML content
|
|
126
|
+
result.links # Extracted links with metadata
|
|
127
|
+
result.metadata # Title, description, OG tags, etc.
|
|
84
128
|
```
|
|
85
129
|
|
|
130
|
+
## Use Cases
|
|
131
|
+
|
|
132
|
+
RubyCrawl is perfect for:
|
|
133
|
+
|
|
134
|
+
- **📊 Data aggregation**: Crawl product catalogs, job listings, or news articles
|
|
135
|
+
- **🤖 RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
|
|
136
|
+
- **🔍 SEO analysis**: Extract metadata, links, and content structure
|
|
137
|
+
- **📱 Content migration**: Convert existing sites to Markdown for static site generators
|
|
138
|
+
- **🧪 Testing**: Verify deployed site structure and content
|
|
139
|
+
- **📚 Documentation scraping**: Create local copies of documentation with preserved links
|
|
140
|
+
|
|
86
141
|
## Usage
|
|
87
142
|
|
|
88
143
|
### Basic Crawling
|
|
@@ -93,11 +148,9 @@ The simplest way to crawl a URL:
|
|
|
93
148
|
result = RubyCrawl.crawl("https://example.com")
|
|
94
149
|
|
|
95
150
|
# Access the results
|
|
96
|
-
result.html
|
|
97
|
-
result.
|
|
98
|
-
result.
|
|
99
|
-
result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
|
|
100
|
-
result.text # => "" (coming soon)
|
|
151
|
+
result.html # => "<html>...</html>"
|
|
152
|
+
result.text # => "Example Domain\nThis domain is..." (plain text via innerText)
|
|
153
|
+
result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
|
|
101
154
|
```
|
|
102
155
|
|
|
103
156
|
### Multi-Page Crawling
|
|
@@ -109,38 +162,72 @@ Crawl an entire site following links with BFS (breadth-first search):
|
|
|
109
162
|
RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
|
|
110
163
|
# Each page is yielded as it's crawled (streaming)
|
|
111
164
|
puts "Crawled: #{page.url} (depth: #{page.depth})"
|
|
112
|
-
|
|
165
|
+
|
|
113
166
|
# Save to database
|
|
114
167
|
Page.create!(
|
|
115
168
|
url: page.url,
|
|
116
169
|
html: page.html,
|
|
117
|
-
markdown: page.
|
|
170
|
+
markdown: page.clean_markdown,
|
|
118
171
|
depth: page.depth
|
|
119
172
|
)
|
|
120
173
|
end
|
|
121
174
|
```
|
|
122
175
|
|
|
176
|
+
**Real-world example: Building a RAG knowledge base**
|
|
177
|
+
|
|
178
|
+
```ruby
|
|
179
|
+
# Crawl documentation site for AI/RAG application
|
|
180
|
+
require "rubycrawl"
|
|
181
|
+
|
|
182
|
+
RubyCrawl.configure(
|
|
183
|
+
wait_until: "networkidle", # Ensure JS content loads
|
|
184
|
+
block_resources: true # Skip images/fonts for speed
|
|
185
|
+
)
|
|
186
|
+
|
|
187
|
+
pages_crawled = RubyCrawl.crawl_site(
|
|
188
|
+
"https://docs.example.com",
|
|
189
|
+
max_pages: 500,
|
|
190
|
+
max_depth: 5,
|
|
191
|
+
same_host_only: true
|
|
192
|
+
) do |page|
|
|
193
|
+
# Store in vector database for RAG
|
|
194
|
+
VectorDB.upsert(
|
|
195
|
+
id: Digest::SHA256.hexdigest(page.url),
|
|
196
|
+
content: page.clean_markdown, # Clean markdown for better embeddings
|
|
197
|
+
metadata: {
|
|
198
|
+
url: page.url,
|
|
199
|
+
title: page.metadata["title"],
|
|
200
|
+
depth: page.depth
|
|
201
|
+
}
|
|
202
|
+
)
|
|
203
|
+
|
|
204
|
+
puts "✓ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
|
|
205
|
+
end
|
|
206
|
+
|
|
207
|
+
puts "Crawled #{pages_crawled} pages into knowledge base"
|
|
208
|
+
```
|
|
209
|
+
|
|
123
210
|
#### Multi-Page Options
|
|
124
211
|
|
|
125
|
-
| Option
|
|
126
|
-
|
|
127
|
-
| `max_pages`
|
|
128
|
-
| `max_depth`
|
|
129
|
-
| `same_host_only`
|
|
130
|
-
| `wait_until`
|
|
131
|
-
| `block_resources` | inherited | Block images/fonts/CSS
|
|
212
|
+
| Option | Default | Description |
|
|
213
|
+
| ----------------- | --------- | ------------------------------------ |
|
|
214
|
+
| `max_pages` | 50 | Maximum number of pages to crawl |
|
|
215
|
+
| `max_depth` | 3 | Maximum link depth from start URL |
|
|
216
|
+
| `same_host_only` | true | Only follow links on the same domain |
|
|
217
|
+
| `wait_until` | inherited | Page load strategy |
|
|
218
|
+
| `block_resources` | inherited | Block images/fonts/CSS |
|
|
132
219
|
|
|
133
220
|
#### Page Result Object
|
|
134
221
|
|
|
135
222
|
The block receives a `PageResult` with:
|
|
136
223
|
|
|
137
224
|
```ruby
|
|
138
|
-
page.url
|
|
139
|
-
page.html
|
|
140
|
-
page.
|
|
141
|
-
page.links
|
|
142
|
-
page.metadata
|
|
143
|
-
page.depth
|
|
225
|
+
page.url # String: Final URL after redirects
|
|
226
|
+
page.html # String: Full HTML content
|
|
227
|
+
page.clean_markdown # String: Lazy-converted Markdown
|
|
228
|
+
page.links # Array: URLs extracted from page
|
|
229
|
+
page.metadata # Hash: HTTP status, final URL, etc.
|
|
230
|
+
page.depth # Integer: Link depth from start URL
|
|
144
231
|
```
|
|
145
232
|
|
|
146
233
|
### Configuration
|
|
@@ -177,16 +264,55 @@ result = RubyCrawl.crawl(
|
|
|
177
264
|
|
|
178
265
|
#### Configuration Options
|
|
179
266
|
|
|
180
|
-
| Option | Values
|
|
181
|
-
| ----------------- |
|
|
182
|
-
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`
|
|
183
|
-
| `block_resources` | `true`, `false`
|
|
267
|
+
| Option | Values | Default | Description |
|
|
268
|
+
| ----------------- | ---------------------------------------------------------------------- | -------- | ------------------------------------------------- |
|
|
269
|
+
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `"load"` | When to consider page loaded |
|
|
270
|
+
| `block_resources` | `true`, `false` | `true` | Block images, fonts, CSS, media for faster crawls |
|
|
271
|
+
| `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
|
|
184
272
|
|
|
185
273
|
**Wait strategies explained:**
|
|
186
274
|
|
|
187
275
|
- `load` — Wait for the load event (fastest, good for static sites)
|
|
188
276
|
- `domcontentloaded` — Wait for DOM ready (medium speed)
|
|
189
277
|
- `networkidle` — Wait until no network requests for 500ms (slowest, best for SPAs)
|
|
278
|
+
- `commit` — Wait until the first response bytes are received (fastest possible)
|
|
279
|
+
|
|
280
|
+
### Advanced Usage
|
|
281
|
+
|
|
282
|
+
#### Session-Based Crawling
|
|
283
|
+
|
|
284
|
+
Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by `crawl_site`, but you can manage them manually for advanced use cases:
|
|
285
|
+
|
|
286
|
+
```ruby
|
|
287
|
+
# Create a session (reusable browser context)
|
|
288
|
+
session_id = RubyCrawl.create_session
|
|
289
|
+
|
|
290
|
+
begin
|
|
291
|
+
# All crawls with this session_id share the same browser context
|
|
292
|
+
result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
|
|
293
|
+
result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
|
|
294
|
+
# Browser state (cookies, localStorage) persists between crawls
|
|
295
|
+
ensure
|
|
296
|
+
# Always destroy session when done
|
|
297
|
+
RubyCrawl.destroy_session(session_id)
|
|
298
|
+
end
|
|
299
|
+
```
|
|
300
|
+
|
|
301
|
+
**When to use sessions:**
|
|
302
|
+
|
|
303
|
+
- Multiple sequential crawls to the same domain (better performance)
|
|
304
|
+
- Preserving cookies/state set by the site between page visits
|
|
305
|
+
- Avoiding browser context creation overhead
|
|
306
|
+
|
|
307
|
+
**Important:** Sessions are for **performance optimization only**. RubyCrawl is designed for crawling **public websites**. It does not provide authentication or login functionality for protected content.
|
|
308
|
+
|
|
309
|
+
**Note:** `crawl_site` automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
|
|
310
|
+
|
|
311
|
+
**Session lifecycle:**
|
|
312
|
+
|
|
313
|
+
- Sessions automatically expire after 30 minutes of inactivity
|
|
314
|
+
- Sessions are cleaned up every 5 minutes
|
|
315
|
+
- Always call `destroy_session` when done to free resources immediately
|
|
190
316
|
|
|
191
317
|
### Result Object
|
|
192
318
|
|
|
@@ -195,33 +321,47 @@ The crawl result is a `RubyCrawl::Result` object with these attributes:
|
|
|
195
321
|
```ruby
|
|
196
322
|
result = RubyCrawl.crawl("https://example.com")
|
|
197
323
|
|
|
198
|
-
result.html
|
|
199
|
-
result.
|
|
200
|
-
result.
|
|
201
|
-
result.
|
|
202
|
-
result.metadata
|
|
324
|
+
result.html # String: Raw HTML content from page
|
|
325
|
+
result.text # String: Plain text via document.body.innerText
|
|
326
|
+
result.clean_markdown # String: Markdown conversion (lazy-loaded on first access)
|
|
327
|
+
result.links # Array: Extracted links with url and text
|
|
328
|
+
result.metadata # Hash: Comprehensive metadata (see below)
|
|
203
329
|
```
|
|
204
330
|
|
|
205
331
|
#### Links Format
|
|
206
332
|
|
|
333
|
+
Links are extracted with full metadata:
|
|
334
|
+
|
|
207
335
|
```ruby
|
|
208
336
|
result.links
|
|
209
337
|
# => [
|
|
210
|
-
# {
|
|
211
|
-
#
|
|
338
|
+
# {
|
|
339
|
+
# "url" => "https://example.com/about",
|
|
340
|
+
# "text" => "About Us",
|
|
341
|
+
# "title" => "Learn more about us", # <a title="...">
|
|
342
|
+
# "rel" => nil # <a rel="nofollow">
|
|
343
|
+
# },
|
|
344
|
+
# {
|
|
345
|
+
# "url" => "https://example.com/contact",
|
|
346
|
+
# "text" => "Contact",
|
|
347
|
+
# "title" => null,
|
|
348
|
+
# "rel" => "nofollow"
|
|
349
|
+
# },
|
|
212
350
|
# ...
|
|
213
351
|
# ]
|
|
214
352
|
```
|
|
215
353
|
|
|
354
|
+
**Note:** URLs are automatically converted to absolute URLs by the browser, so relative links like `/about` become `https://example.com/about`.
|
|
355
|
+
|
|
216
356
|
#### Markdown Conversion
|
|
217
357
|
|
|
218
|
-
Markdown is **lazy-loaded** — conversion only happens when you access `.
|
|
358
|
+
Markdown is **lazy-loaded** — conversion only happens when you access `.clean_markdown`:
|
|
219
359
|
|
|
220
360
|
```ruby
|
|
221
361
|
result = RubyCrawl.crawl(url)
|
|
222
|
-
result.html
|
|
223
|
-
result.
|
|
224
|
-
result.
|
|
362
|
+
result.html # ✅ No overhead
|
|
363
|
+
result.clean_markdown # ⬅️ Conversion happens here (first call only)
|
|
364
|
+
result.clean_markdown # ✅ Cached, instant
|
|
225
365
|
```
|
|
226
366
|
|
|
227
367
|
Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
|
|
@@ -282,18 +422,19 @@ end
|
|
|
282
422
|
```
|
|
283
423
|
|
|
284
424
|
**Exception Hierarchy:**
|
|
425
|
+
|
|
285
426
|
- `RubyCrawl::Error` (base class)
|
|
286
427
|
- `RubyCrawl::ConfigurationError` - Invalid URL or configuration
|
|
287
428
|
- `RubyCrawl::TimeoutError` - Timeout during crawl
|
|
288
429
|
- `RubyCrawl::NavigationError` - Page navigation failed
|
|
289
430
|
- `RubyCrawl::ServiceError` - Node service issues
|
|
290
431
|
|
|
291
|
-
**Automatic Retry:** RubyCrawl automatically retries transient failures (service errors, timeouts)
|
|
432
|
+
**Automatic Retry:** RubyCrawl automatically retries transient failures (service errors, timeouts) with exponential backoff. The default `max_attempts: 3` means 3 total attempts (2 retries). Configure with:
|
|
292
433
|
|
|
293
434
|
```ruby
|
|
294
|
-
RubyCrawl.configure(
|
|
435
|
+
RubyCrawl.configure(max_attempts: 5)
|
|
295
436
|
# or per-request
|
|
296
|
-
RubyCrawl.crawl(url,
|
|
437
|
+
RubyCrawl.crawl(url, max_attempts: 1) # No retries
|
|
297
438
|
```
|
|
298
439
|
|
|
299
440
|
## Rails Integration
|
|
@@ -320,22 +461,177 @@ RubyCrawl.configure(
|
|
|
320
461
|
|
|
321
462
|
### Usage in Rails
|
|
322
463
|
|
|
464
|
+
#### Basic Usage in Controllers
|
|
465
|
+
|
|
466
|
+
```ruby
|
|
467
|
+
class PagesController < ApplicationController
|
|
468
|
+
def show
|
|
469
|
+
result = RubyCrawl.crawl(params[:url])
|
|
470
|
+
|
|
471
|
+
@page = Page.create!(
|
|
472
|
+
url: result.final_url,
|
|
473
|
+
title: result.metadata['title'],
|
|
474
|
+
html: result.html,
|
|
475
|
+
text: result.text,
|
|
476
|
+
markdown: result.clean_markdown
|
|
477
|
+
)
|
|
478
|
+
|
|
479
|
+
redirect_to @page
|
|
480
|
+
end
|
|
481
|
+
end
|
|
482
|
+
```
|
|
483
|
+
|
|
484
|
+
#### Background Jobs with ActiveJob
|
|
485
|
+
|
|
486
|
+
**Simple Crawl Job:**
|
|
487
|
+
|
|
323
488
|
```ruby
|
|
324
|
-
|
|
325
|
-
|
|
326
|
-
|
|
489
|
+
class CrawlPageJob < ApplicationJob
|
|
490
|
+
queue_as :crawlers
|
|
491
|
+
|
|
492
|
+
# Automatic retry with exponential backoff for transient failures
|
|
493
|
+
retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
|
|
494
|
+
retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
|
|
495
|
+
|
|
496
|
+
# Don't retry on configuration errors (bad URLs)
|
|
497
|
+
discard_on RubyCrawl::ConfigurationError
|
|
498
|
+
|
|
499
|
+
def perform(url, user_id: nil)
|
|
327
500
|
result = RubyCrawl.crawl(url)
|
|
328
501
|
|
|
329
|
-
|
|
330
|
-
|
|
331
|
-
|
|
502
|
+
Page.create!(
|
|
503
|
+
url: result.final_url,
|
|
504
|
+
title: result.metadata['title'],
|
|
505
|
+
text: result.text,
|
|
332
506
|
html: result.html,
|
|
333
|
-
|
|
507
|
+
user_id: user_id,
|
|
508
|
+
crawled_at: Time.current
|
|
334
509
|
)
|
|
510
|
+
rescue RubyCrawl::NavigationError => e
|
|
511
|
+
# Page not found or failed to load
|
|
512
|
+
Rails.logger.warn "Failed to crawl #{url}: #{e.message}"
|
|
513
|
+
FailedCrawl.create!(url: url, error: e.message, user_id: user_id)
|
|
335
514
|
end
|
|
336
515
|
end
|
|
516
|
+
|
|
517
|
+
# Enqueue from anywhere
|
|
518
|
+
CrawlPageJob.perform_later("https://example.com", user_id: current_user.id)
|
|
337
519
|
```
|
|
338
520
|
|
|
521
|
+
**Multi-Page Site Crawler Job:**
|
|
522
|
+
|
|
523
|
+
```ruby
|
|
524
|
+
class CrawlSiteJob < ApplicationJob
|
|
525
|
+
queue_as :crawlers
|
|
526
|
+
|
|
527
|
+
def perform(start_url, max_pages: 50)
|
|
528
|
+
pages_crawled = RubyCrawl.crawl_site(
|
|
529
|
+
start_url,
|
|
530
|
+
max_pages: max_pages,
|
|
531
|
+
max_depth: 3,
|
|
532
|
+
same_host_only: true
|
|
533
|
+
) do |page|
|
|
534
|
+
Page.create!(
|
|
535
|
+
url: page.url,
|
|
536
|
+
title: page.metadata['title'],
|
|
537
|
+
text: page.clean_markdown, # Store markdown for RAG applications
|
|
538
|
+
depth: page.depth,
|
|
539
|
+
crawled_at: Time.current
|
|
540
|
+
)
|
|
541
|
+
end
|
|
542
|
+
|
|
543
|
+
Rails.logger.info "Crawled #{pages_crawled} pages from #{start_url}"
|
|
544
|
+
end
|
|
545
|
+
end
|
|
546
|
+
```
|
|
547
|
+
|
|
548
|
+
**Batch Crawling Pattern:**
|
|
549
|
+
|
|
550
|
+
```ruby
|
|
551
|
+
class BatchCrawlJob < ApplicationJob
|
|
552
|
+
queue_as :crawlers
|
|
553
|
+
|
|
554
|
+
def perform(urls)
|
|
555
|
+
# Create session for better performance
|
|
556
|
+
session_id = RubyCrawl.create_session
|
|
557
|
+
|
|
558
|
+
begin
|
|
559
|
+
urls.each do |url|
|
|
560
|
+
result = RubyCrawl.crawl(url, session_id: session_id)
|
|
561
|
+
|
|
562
|
+
Page.create!(
|
|
563
|
+
url: result.final_url,
|
|
564
|
+
html: result.html,
|
|
565
|
+
text: result.text
|
|
566
|
+
)
|
|
567
|
+
end
|
|
568
|
+
ensure
|
|
569
|
+
# Always destroy session when done
|
|
570
|
+
RubyCrawl.destroy_session(session_id)
|
|
571
|
+
end
|
|
572
|
+
end
|
|
573
|
+
end
|
|
574
|
+
|
|
575
|
+
# Enqueue batch
|
|
576
|
+
BatchCrawlJob.perform_later(["https://example.com", "https://example.com/about"])
|
|
577
|
+
```
|
|
578
|
+
|
|
579
|
+
**Periodic Crawling with Sidekiq-Cron:**
|
|
580
|
+
|
|
581
|
+
```ruby
|
|
582
|
+
# config/schedule.yml (for sidekiq-cron)
|
|
583
|
+
crawl_news_sites:
|
|
584
|
+
cron: "0 */6 * * *" # Every 6 hours
|
|
585
|
+
class: "CrawlNewsSitesJob"
|
|
586
|
+
|
|
587
|
+
# app/jobs/crawl_news_sites_job.rb
|
|
588
|
+
class CrawlNewsSitesJob < ApplicationJob
|
|
589
|
+
queue_as :scheduled_crawlers
|
|
590
|
+
|
|
591
|
+
def perform
|
|
592
|
+
Site.where(active: true).find_each do |site|
|
|
593
|
+
CrawlSiteJob.perform_later(site.url, max_pages: site.max_pages)
|
|
594
|
+
end
|
|
595
|
+
end
|
|
596
|
+
end
|
|
597
|
+
```
|
|
598
|
+
|
|
599
|
+
**RAG/AI Knowledge Base Pattern:**
|
|
600
|
+
|
|
601
|
+
```ruby
|
|
602
|
+
class BuildKnowledgeBaseJob < ApplicationJob
|
|
603
|
+
queue_as :crawlers
|
|
604
|
+
|
|
605
|
+
def perform(documentation_url)
|
|
606
|
+
RubyCrawl.crawl_site(
|
|
607
|
+
documentation_url,
|
|
608
|
+
max_pages: 500,
|
|
609
|
+
max_depth: 5
|
|
610
|
+
) do |page|
|
|
611
|
+
# Store in vector database for RAG
|
|
612
|
+
embedding = OpenAI.embed(page.clean_markdown)
|
|
613
|
+
|
|
614
|
+
Document.create!(
|
|
615
|
+
url: page.url,
|
|
616
|
+
title: page.metadata['title'],
|
|
617
|
+
content: page.clean_markdown,
|
|
618
|
+
embedding: embedding,
|
|
619
|
+
depth: page.depth
|
|
620
|
+
)
|
|
621
|
+
end
|
|
622
|
+
end
|
|
623
|
+
end
|
|
624
|
+
```
|
|
625
|
+
|
|
626
|
+
#### Best Practices
|
|
627
|
+
|
|
628
|
+
1. **Use background jobs** for crawling to avoid blocking web requests
|
|
629
|
+
2. **Configure retry logic** based on error types (retry ServiceError, discard ConfigurationError)
|
|
630
|
+
3. **Use sessions** for batch crawling to improve performance
|
|
631
|
+
4. **Monitor job failures** and set up alerts for repeated errors
|
|
632
|
+
5. **Rate limit** external crawling to be respectful (use job throttling)
|
|
633
|
+
6. **Store both HTML and text** for flexibility in data processing
|
|
634
|
+
|
|
339
635
|
## Production Deployment
|
|
340
636
|
|
|
341
637
|
### Pre-deployment Checklist
|
|
@@ -393,154 +689,41 @@ Add to `package.json` in your Rails root:
|
|
|
393
689
|
}
|
|
394
690
|
```
|
|
395
691
|
|
|
396
|
-
|
|
397
|
-
|
|
398
|
-
- **Reuse instances**: Use the class-level `RubyCrawl.crawl` method (recommended) rather than creating new instances
|
|
399
|
-
- **Resource blocking**: Keep `block_resources: true` for 2-3x faster crawls when you don't need images/CSS
|
|
400
|
-
- **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
|
|
401
|
-
- **Browser reuse**: The first crawl is slower due to browser launch; subsequent crawls reuse the process
|
|
402
|
-
|
|
403
|
-
## Architecture
|
|
404
|
-
|
|
405
|
-
RubyCrawl uses a **dual-process architecture**:
|
|
406
|
-
|
|
407
|
-
```
|
|
408
|
-
┌─────────────────────────────────────────────┐
|
|
409
|
-
│ Ruby Process (Your Application) │
|
|
410
|
-
│ ┌─────────────────────────────────────┐ │
|
|
411
|
-
│ │ RubyCrawl Gem │ │
|
|
412
|
-
│ │ • Public API │ │
|
|
413
|
-
│ │ • Result normalization │ │
|
|
414
|
-
│ │ • Error handling │ │
|
|
415
|
-
│ └────────────┬────────────────────────┘ │
|
|
416
|
-
└───────────────┼─────────────────────────────┘
|
|
417
|
-
│ HTTP/JSON (localhost:3344)
|
|
418
|
-
┌───────────────┼─────────────────────────────┐
|
|
419
|
-
│ Node.js Process (Auto-started) │
|
|
420
|
-
│ ┌────────────┴────────────────────────┐ │
|
|
421
|
-
│ │ Playwright Service │ │
|
|
422
|
-
│ │ • Browser management │ │
|
|
423
|
-
│ │ • Page navigation │ │
|
|
424
|
-
│ │ • HTML extraction │ │
|
|
425
|
-
│ │ • Resource blocking │ │
|
|
426
|
-
│ └─────────────────────────────────────┘ │
|
|
427
|
-
└─────────────────────────────────────────────┘
|
|
428
|
-
```
|
|
429
|
-
|
|
430
|
-
**Why this architecture?**
|
|
431
|
-
|
|
432
|
-
- **Separation of concerns**: Ruby handles orchestration, Node handles browsers
|
|
433
|
-
- **Stability**: Playwright's official Node.js bindings are most reliable
|
|
434
|
-
- **Performance**: Long-running browser process, reused across requests
|
|
435
|
-
- **Simplicity**: No C extensions, pure Ruby + bundled Node service
|
|
436
|
-
|
|
437
|
-
See [.github/copilot-instructions.md](.github/copilot-instructions.md) for detailed architecture documentation.
|
|
438
|
-
|
|
439
|
-
## Performance
|
|
440
|
-
|
|
441
|
-
### Benchmarks
|
|
692
|
+
## How It Works
|
|
442
693
|
|
|
443
|
-
|
|
694
|
+
RubyCrawl uses a simple architecture:
|
|
444
695
|
|
|
445
|
-
|
|
446
|
-
|
|
447
|
-
|
|
448
|
-
| SPA (React) | ~3s | ~1.2s | `wait_until: "networkidle"` |
|
|
449
|
-
| Heavy site | ~4s | ~2s | `block_resources: false` |
|
|
696
|
+
- **Ruby Gem** provides the public API and handles orchestration
|
|
697
|
+
- **Node.js Service** (bundled, auto-started) manages Playwright browsers
|
|
698
|
+
- Communication via HTTP/JSON on localhost
|
|
450
699
|
|
|
451
|
-
|
|
700
|
+
This design keeps things stable and easy to debug. The browser runs in a separate process, so crashes won't affect your Ruby application.
|
|
452
701
|
|
|
453
|
-
|
|
702
|
+
## Performance Tips
|
|
454
703
|
|
|
455
|
-
|
|
456
|
-
|
|
457
|
-
|
|
458
|
-
|
|
459
|
-
```
|
|
460
|
-
|
|
461
|
-
2. **Use appropriate wait strategy**:
|
|
462
|
-
- Static sites: `wait_until: "load"`
|
|
463
|
-
- SPAs: `wait_until: "networkidle"`
|
|
464
|
-
|
|
465
|
-
3. **Batch processing**: Use background jobs for concurrent crawling:
|
|
466
|
-
```ruby
|
|
467
|
-
urls.each { |url| CrawlJob.perform_later(url) }
|
|
468
|
-
```
|
|
704
|
+
- **Resource blocking**: Keep `block_resources: true` (default) for 2-3x faster crawls when you don't need images/CSS
|
|
705
|
+
- **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
|
|
706
|
+
- **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
|
|
707
|
+
- **Browser reuse**: The first crawl is slower (~2s) due to browser launch; subsequent crawls are much faster (~500ms)
|
|
469
708
|
|
|
470
709
|
## Development
|
|
471
710
|
|
|
472
|
-
|
|
711
|
+
Want to contribute? Check out the [contributor guidelines](.github/copilot-instructions.md).
|
|
473
712
|
|
|
474
713
|
```bash
|
|
714
|
+
# Setup
|
|
475
715
|
git clone git@github.com:craft-wise/rubycrawl.git
|
|
476
716
|
cd rubycrawl
|
|
477
|
-
bin/setup
|
|
478
|
-
```
|
|
717
|
+
bin/setup
|
|
479
718
|
|
|
480
|
-
|
|
481
|
-
|
|
482
|
-
```bash
|
|
719
|
+
# Run tests
|
|
483
720
|
bundle exec rspec
|
|
484
|
-
```
|
|
485
|
-
|
|
486
|
-
### Manual Testing
|
|
487
|
-
|
|
488
|
-
```bash
|
|
489
|
-
# Terminal 1: Start Node service manually (optional)
|
|
490
|
-
cd node
|
|
491
|
-
npm start
|
|
492
721
|
|
|
493
|
-
#
|
|
722
|
+
# Manual testing
|
|
494
723
|
bin/console
|
|
495
|
-
>
|
|
496
|
-
> puts result.html
|
|
497
|
-
```
|
|
498
|
-
|
|
499
|
-
### Project Structure
|
|
500
|
-
|
|
501
|
-
```
|
|
502
|
-
rubycrawl/
|
|
503
|
-
├── lib/
|
|
504
|
-
│ ├── rubycrawl.rb # Main gem entry point
|
|
505
|
-
│ ├── rubycrawl/
|
|
506
|
-
│ │ ├── version.rb # Gem version
|
|
507
|
-
│ │ ├── railtie.rb # Rails integration
|
|
508
|
-
│ │ └── tasks/
|
|
509
|
-
│ │ └── install.rake # Installation task
|
|
510
|
-
├── node/
|
|
511
|
-
│ ├── src/
|
|
512
|
-
│ │ └── index.js # Playwright HTTP service
|
|
513
|
-
│ ├── package.json
|
|
514
|
-
│ └── README.md
|
|
515
|
-
├── spec/ # RSpec tests
|
|
516
|
-
├── .github/
|
|
517
|
-
│ └── copilot-instructions.md # GitHub Copilot guidelines
|
|
518
|
-
├── CLAUDE.md # Claude AI guidelines
|
|
519
|
-
└── README.md
|
|
724
|
+
> RubyCrawl.crawl("https://example.com")
|
|
520
725
|
```
|
|
521
726
|
|
|
522
|
-
## Roadmap
|
|
523
|
-
|
|
524
|
-
### Current (v0.1.0)
|
|
525
|
-
|
|
526
|
-
- [x] HTML extraction
|
|
527
|
-
- [x] Link extraction
|
|
528
|
-
- [x] Markdown conversion (lazy-loaded)
|
|
529
|
-
- [x] Multi-page crawling with BFS
|
|
530
|
-
- [x] URL normalization and deduplication
|
|
531
|
-
- [x] Basic metadata (status, final URL)
|
|
532
|
-
- [x] Resource blocking
|
|
533
|
-
- [x] Rails integration
|
|
534
|
-
|
|
535
|
-
### Coming Soon
|
|
536
|
-
|
|
537
|
-
- [ ] Plain text extraction
|
|
538
|
-
- [ ] Screenshot capture
|
|
539
|
-
- [ ] Custom JavaScript execution
|
|
540
|
-
- [ ] Session/cookie support
|
|
541
|
-
- [ ] Proxy support
|
|
542
|
-
- [ ] Robots.txt support
|
|
543
|
-
|
|
544
727
|
## Contributing
|
|
545
728
|
|
|
546
729
|
Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
|
|
@@ -552,21 +735,46 @@ Contributions are welcome! Please read our [contribution guidelines](.github/cop
|
|
|
552
735
|
- **Ruby-first**: Hide Node.js/Playwright complexity from users
|
|
553
736
|
- **No vendor lock-in**: Pure open source, no SaaS dependencies
|
|
554
737
|
|
|
555
|
-
##
|
|
738
|
+
## Why Choose RubyCrawl?
|
|
739
|
+
|
|
740
|
+
RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
|
|
741
|
+
|
|
742
|
+
### 🎯 **Built for Ruby Developers**
|
|
743
|
+
|
|
744
|
+
- **Idiomatic Ruby API** — Feels natural to Rubyists, no need to learn Playwright
|
|
745
|
+
- **Rails-first design** — Generators, initializers, and ActiveJob integration out of the box
|
|
746
|
+
- **Modular architecture** — Clean, testable code following Ruby best practices
|
|
747
|
+
|
|
748
|
+
### 🚀 **Production-Grade Reliability**
|
|
556
749
|
|
|
557
|
-
|
|
558
|
-
|
|
559
|
-
|
|
560
|
-
|
|
561
|
-
| LLM extraction | ✅ | Planned |
|
|
562
|
-
| Markdown extraction | ✅ | ✅ |
|
|
563
|
-
| Link extraction | ✅ | ✅ |
|
|
564
|
-
| Multi-page crawling | ✅ | ✅ |
|
|
565
|
-
| Rails integration | N/A | ✅ |
|
|
566
|
-
| Resource blocking | ✅ | ✅ |
|
|
567
|
-
| Session management | ✅ | Planned |
|
|
750
|
+
- **Automatic retry** with exponential backoff for transient failures
|
|
751
|
+
- **Smart error handling** with custom exception hierarchy
|
|
752
|
+
- **Process isolation** — Browser crashes don't affect your Ruby application
|
|
753
|
+
- **Battle-tested** — Built on Playwright's proven browser automation
|
|
568
754
|
|
|
569
|
-
|
|
755
|
+
### 💎 **Developer Experience**
|
|
756
|
+
|
|
757
|
+
- **Zero configuration** — Works immediately after installation
|
|
758
|
+
- **Lazy loading** — Markdown conversion only when you need it
|
|
759
|
+
- **Smart URL handling** — Automatic normalization and deduplication
|
|
760
|
+
- **Comprehensive docs** — Clear examples for common use cases
|
|
761
|
+
|
|
762
|
+
### 🌐 **Rich Feature Set**
|
|
763
|
+
|
|
764
|
+
- ✅ JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
|
|
765
|
+
- ✅ Multi-page crawling with BFS algorithm
|
|
766
|
+
- ✅ Link extraction with metadata (url, text, title, rel)
|
|
767
|
+
- ✅ Markdown conversion (GitHub-flavored)
|
|
768
|
+
- ✅ Metadata extraction (OG tags, Twitter cards, etc.)
|
|
769
|
+
- ✅ Resource blocking for 2-3x performance boost
|
|
770
|
+
|
|
771
|
+
### 📊 **Perfect for Modern Use Cases**
|
|
772
|
+
|
|
773
|
+
- **RAG applications** — Build AI knowledge bases from documentation
|
|
774
|
+
- **Data aggregation** — Extract structured data from multiple pages
|
|
775
|
+
- **Content migration** — Convert sites to Markdown for static generators
|
|
776
|
+
- **SEO analysis** — Extract metadata and link structures
|
|
777
|
+
- **Testing** — Verify deployed site content and structure
|
|
570
778
|
|
|
571
779
|
## License
|
|
572
780
|
|
|
@@ -574,12 +782,21 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
|
|
|
574
782
|
|
|
575
783
|
## Credits
|
|
576
784
|
|
|
577
|
-
|
|
785
|
+
Built with [Playwright](https://playwright.dev/) by Microsoft — the industry-standard browser automation framework.
|
|
578
786
|
|
|
579
|
-
|
|
787
|
+
Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
|
|
580
788
|
|
|
581
789
|
## Support
|
|
582
790
|
|
|
583
791
|
- **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
|
|
584
|
-
- **Discussions**: [GitHub Discussions](https://github.com/
|
|
792
|
+
- **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
|
|
585
793
|
- **Email**: ganesh.navale@zohomail.in
|
|
794
|
+
|
|
795
|
+
## Acknowledgments
|
|
796
|
+
|
|
797
|
+
Special thanks to:
|
|
798
|
+
|
|
799
|
+
- [Microsoft Playwright](https://playwright.dev/) team for the robust, production-grade browser automation framework
|
|
800
|
+
- The Ruby community for building an ecosystem that values developer happiness and code clarity
|
|
801
|
+
- The Node.js community for excellent tooling and libraries that make cross-language integration seamless
|
|
802
|
+
- Open source contributors worldwide who make projects like this possible
|