rubycrawl 0.1.4 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +178 -433
- data/lib/rubycrawl/browser/extraction.rb +128 -0
- data/lib/rubycrawl/browser/readability.js +2786 -0
- data/lib/rubycrawl/browser.rb +106 -0
- data/lib/rubycrawl/errors.rb +1 -1
- data/lib/rubycrawl/helpers.rb +8 -44
- data/lib/rubycrawl/markdown_converter.rb +2 -2
- data/lib/rubycrawl/result.rb +49 -18
- data/lib/rubycrawl/site_crawler.rb +40 -22
- data/lib/rubycrawl/tasks/install.rake +17 -56
- data/lib/rubycrawl/url_normalizer.rb +5 -1
- data/lib/rubycrawl/version.rb +1 -1
- data/lib/rubycrawl.rb +35 -90
- data/rubycrawl.gemspec +3 -4
- metadata +21 -11
- data/lib/rubycrawl/service_client.rb +0 -108
- data/node/.gitignore +0 -2
- data/node/.npmrc +0 -1
- data/node/README.md +0 -19
- data/node/package-lock.json +0 -72
- data/node/package.json +0 -14
- data/node/src/index.js +0 -389
data/README.md
CHANGED
|
@@ -3,46 +3,46 @@
|
|
|
3
3
|
[](https://rubygems.org/gems/rubycrawl)
|
|
4
4
|
[](https://opensource.org/licenses/MIT)
|
|
5
5
|
[](https://www.ruby-lang.org/)
|
|
6
|
-
[](https://nodejs.org/)
|
|
7
6
|
|
|
8
|
-
**Production-ready web crawler for Ruby powered by
|
|
7
|
+
**Production-ready web crawler for Ruby powered by Ferrum** — Full JavaScript rendering via Chrome DevTools Protocol, with first-class Rails support and no Node.js dependency.
|
|
9
8
|
|
|
10
|
-
RubyCrawl provides **accurate, JavaScript-enabled web scraping** using
|
|
9
|
+
RubyCrawl provides **accurate, JavaScript-enabled web scraping** using a pure Ruby browser automation stack. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
|
|
11
10
|
|
|
12
11
|
**Why RubyCrawl?**
|
|
13
12
|
|
|
14
13
|
- ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
|
|
15
|
-
- ✅ **
|
|
14
|
+
- ✅ **Pure Ruby** — No Node.js, no npm, no external processes to manage
|
|
15
|
+
- ✅ **Zero config** — Works out of the box, no Ferrum knowledge needed
|
|
16
16
|
- ✅ **Production-ready** — Auto-retry, error handling, resource optimization
|
|
17
17
|
- ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
|
|
18
18
|
- ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
|
|
19
|
-
- ✅ **
|
|
19
|
+
- ✅ **Readability-powered** — Mozilla Readability.js for article-quality extraction, heuristic fallback for all other pages
|
|
20
20
|
|
|
21
21
|
```ruby
|
|
22
22
|
# One line to crawl any JavaScript-heavy site
|
|
23
23
|
result = RubyCrawl.crawl("https://docs.example.com")
|
|
24
24
|
|
|
25
25
|
result.html # Full HTML with JS rendered
|
|
26
|
-
result.
|
|
26
|
+
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
|
|
27
|
+
result.clean_markdown # Markdown ready for RAG pipelines
|
|
28
|
+
result.links # All links with url, text, title, rel
|
|
27
29
|
result.metadata # Title, description, OG tags, etc.
|
|
28
30
|
```
|
|
29
31
|
|
|
30
32
|
## Features
|
|
31
33
|
|
|
32
|
-
-
|
|
33
|
-
-
|
|
34
|
-
-
|
|
35
|
-
-
|
|
36
|
-
-
|
|
37
|
-
-
|
|
38
|
-
-
|
|
39
|
-
-
|
|
40
|
-
-
|
|
41
|
-
- **💎 Modular design**: Clean separation of concerns with focused, testable modules
|
|
34
|
+
- **Pure Ruby**: Ferrum drives Chromium directly via CDP — no Node.js or npm required
|
|
35
|
+
- **Production-ready**: Designed for Rails apps with auto-retry and exponential backoff
|
|
36
|
+
- **Simple API**: Clean Ruby interface — zero Ferrum or CDP knowledge required
|
|
37
|
+
- **Resource optimization**: Built-in resource blocking for 2-3x faster crawls
|
|
38
|
+
- **Auto-managed browsers**: Lazy Chrome singleton, isolated page per crawl
|
|
39
|
+
- **Content extraction**: Mozilla Readability.js (primary) + link-density heuristic (fallback) — article-quality `clean_html`, `clean_text`, `clean_markdown`, links, metadata
|
|
40
|
+
- **Multi-page crawling**: BFS crawler with configurable depth limits and URL deduplication
|
|
41
|
+
- **Smart URL handling**: Automatic normalization, tracking parameter removal, same-host filtering
|
|
42
|
+
- **Rails integration**: First-class Rails support with generators and initializers
|
|
42
43
|
|
|
43
44
|
## Table of Contents
|
|
44
45
|
|
|
45
|
-
- [Features](#features)
|
|
46
46
|
- [Installation](#installation)
|
|
47
47
|
- [Quick Start](#quick-start)
|
|
48
48
|
- [Use Cases](#use-cases)
|
|
@@ -57,18 +57,15 @@ result.metadata # Title, description, OG tags, etc.
|
|
|
57
57
|
- [Architecture](#architecture)
|
|
58
58
|
- [Performance](#performance)
|
|
59
59
|
- [Development](#development)
|
|
60
|
-
- [Project Structure](#project-structure)
|
|
61
60
|
- [Contributing](#contributing)
|
|
62
|
-
- [Why Choose RubyCrawl?](#why-choose-rubycrawl)
|
|
63
61
|
- [License](#license)
|
|
64
|
-
- [Support](#support)
|
|
65
62
|
|
|
66
63
|
## Installation
|
|
67
64
|
|
|
68
65
|
### Requirements
|
|
69
66
|
|
|
70
67
|
- **Ruby** >= 3.0
|
|
71
|
-
- **
|
|
68
|
+
- **Chrome or Chromium** — managed automatically by Ferrum (downloaded on first use)
|
|
72
69
|
|
|
73
70
|
### Add to Gemfile
|
|
74
71
|
|
|
@@ -82,9 +79,9 @@ Then install:
|
|
|
82
79
|
bundle install
|
|
83
80
|
```
|
|
84
81
|
|
|
85
|
-
### Install
|
|
82
|
+
### Install Chrome
|
|
86
83
|
|
|
87
|
-
|
|
84
|
+
Ferrum manages Chrome automatically. Run the install task to verify Chrome is available and generate a Rails initializer:
|
|
88
85
|
|
|
89
86
|
```bash
|
|
90
87
|
bundle exec rake rubycrawl:install
|
|
@@ -92,24 +89,10 @@ bundle exec rake rubycrawl:install
|
|
|
92
89
|
|
|
93
90
|
This command:
|
|
94
91
|
|
|
95
|
-
- ✅
|
|
96
|
-
- ✅ Downloads Playwright browsers (Chromium, Firefox, WebKit) — ~300MB download
|
|
92
|
+
- ✅ Checks for Chrome/Chromium in your PATH
|
|
97
93
|
- ✅ Creates a Rails initializer (if using Rails)
|
|
98
94
|
|
|
99
|
-
**Note:**
|
|
100
|
-
|
|
101
|
-
**Troubleshooting installation:**
|
|
102
|
-
|
|
103
|
-
```bash
|
|
104
|
-
# If installation fails, check Node.js version
|
|
105
|
-
node --version # Should be v18+ LTS
|
|
106
|
-
|
|
107
|
-
# Enable verbose logging
|
|
108
|
-
RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
|
|
109
|
-
|
|
110
|
-
# Check installation status
|
|
111
|
-
cd node && npm list
|
|
112
|
-
```
|
|
95
|
+
**Note:** If Chrome is not in your PATH, install it via your system package manager or download from [google.com/chrome](https://www.google.com/chrome/).
|
|
113
96
|
|
|
114
97
|
## Quick Start
|
|
115
98
|
|
|
@@ -120,37 +103,38 @@ require "rubycrawl"
|
|
|
120
103
|
result = RubyCrawl.crawl("https://example.com")
|
|
121
104
|
|
|
122
105
|
# Access extracted content
|
|
123
|
-
result.final_url
|
|
124
|
-
result.
|
|
125
|
-
result.
|
|
126
|
-
result.
|
|
127
|
-
result.
|
|
106
|
+
result.final_url # Final URL after redirects
|
|
107
|
+
result.clean_text # Noise-stripped plain text (no nav/footer/ads)
|
|
108
|
+
result.clean_html # Noise-stripped HTML (same noise removed as clean_text)
|
|
109
|
+
result.raw_text # Full body.innerText (unfiltered)
|
|
110
|
+
result.html # Full raw HTML content
|
|
111
|
+
result.links # Extracted links with url, text, title, rel
|
|
112
|
+
result.metadata # Title, description, OG tags, etc.
|
|
113
|
+
result.metadata['extractor'] # "readability" or "heuristic" — which extractor ran
|
|
114
|
+
result.clean_markdown # Markdown converted from clean_html (lazy — first access only)
|
|
128
115
|
```
|
|
129
116
|
|
|
130
117
|
## Use Cases
|
|
131
118
|
|
|
132
119
|
RubyCrawl is perfect for:
|
|
133
120
|
|
|
134
|
-
-
|
|
135
|
-
-
|
|
136
|
-
-
|
|
137
|
-
-
|
|
138
|
-
-
|
|
139
|
-
- **📚 Documentation scraping**: Create local copies of documentation with preserved links
|
|
121
|
+
- **RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
|
|
122
|
+
- **Data aggregation**: Crawl product catalogs, job listings, or news articles
|
|
123
|
+
- **SEO analysis**: Extract metadata, links, and content structure
|
|
124
|
+
- **Content migration**: Convert existing sites to Markdown for static site generators
|
|
125
|
+
- **Documentation scraping**: Create local copies of documentation with preserved links
|
|
140
126
|
|
|
141
127
|
## Usage
|
|
142
128
|
|
|
143
129
|
### Basic Crawling
|
|
144
130
|
|
|
145
|
-
The simplest way to crawl a URL:
|
|
146
|
-
|
|
147
131
|
```ruby
|
|
148
132
|
result = RubyCrawl.crawl("https://example.com")
|
|
149
133
|
|
|
150
|
-
#
|
|
151
|
-
result.
|
|
152
|
-
result.
|
|
153
|
-
result.metadata
|
|
134
|
+
result.html # => "<html>...</html>"
|
|
135
|
+
result.clean_text # => "Example Domain\n\nThis domain is..." (no nav/ads)
|
|
136
|
+
result.raw_text # => "Example Domain\nThis domain is..." (full body text)
|
|
137
|
+
result.metadata # => { "final_url" => "https://example.com", "title" => "..." }
|
|
154
138
|
```
|
|
155
139
|
|
|
156
140
|
### Multi-Page Crawling
|
|
@@ -165,10 +149,10 @@ RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |pa
|
|
|
165
149
|
|
|
166
150
|
# Save to database
|
|
167
151
|
Page.create!(
|
|
168
|
-
url:
|
|
169
|
-
html:
|
|
152
|
+
url: page.url,
|
|
153
|
+
html: page.html,
|
|
170
154
|
markdown: page.clean_markdown,
|
|
171
|
-
depth:
|
|
155
|
+
depth: page.depth
|
|
172
156
|
)
|
|
173
157
|
end
|
|
174
158
|
```
|
|
@@ -176,7 +160,6 @@ end
|
|
|
176
160
|
**Real-world example: Building a RAG knowledge base**
|
|
177
161
|
|
|
178
162
|
```ruby
|
|
179
|
-
# Crawl documentation site for AI/RAG application
|
|
180
163
|
require "rubycrawl"
|
|
181
164
|
|
|
182
165
|
RubyCrawl.configure(
|
|
@@ -190,21 +173,18 @@ pages_crawled = RubyCrawl.crawl_site(
|
|
|
190
173
|
max_depth: 5,
|
|
191
174
|
same_host_only: true
|
|
192
175
|
) do |page|
|
|
193
|
-
# Store in vector database for RAG
|
|
194
176
|
VectorDB.upsert(
|
|
195
|
-
id:
|
|
196
|
-
content:
|
|
177
|
+
id: Digest::SHA256.hexdigest(page.url),
|
|
178
|
+
content: page.clean_markdown,
|
|
197
179
|
metadata: {
|
|
198
|
-
url:
|
|
180
|
+
url: page.url,
|
|
199
181
|
title: page.metadata["title"],
|
|
200
182
|
depth: page.depth
|
|
201
183
|
}
|
|
202
184
|
)
|
|
203
|
-
|
|
204
|
-
puts "✓ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
|
|
205
185
|
end
|
|
206
186
|
|
|
207
|
-
puts "
|
|
187
|
+
puts "Indexed #{pages_crawled} pages"
|
|
208
188
|
```
|
|
209
189
|
|
|
210
190
|
#### Multi-Page Options
|
|
@@ -223,10 +203,13 @@ The block receives a `PageResult` with:
|
|
|
223
203
|
|
|
224
204
|
```ruby
|
|
225
205
|
page.url # String: Final URL after redirects
|
|
226
|
-
page.html # String: Full HTML content
|
|
227
|
-
page.
|
|
206
|
+
page.html # String: Full raw HTML content
|
|
207
|
+
page.clean_html # String: Noise-stripped HTML (no nav/header/footer/ads)
|
|
208
|
+
page.clean_text # String: Noise-stripped plain text (derived from clean_html)
|
|
209
|
+
page.raw_text # String: Full body.innerText (unfiltered)
|
|
210
|
+
page.clean_markdown # String: Lazy-converted Markdown from clean_html
|
|
228
211
|
page.links # Array: URLs extracted from page
|
|
229
|
-
page.metadata # Hash:
|
|
212
|
+
page.metadata # Hash: final_url, title, OG tags, etc.
|
|
230
213
|
page.depth # Integer: Link depth from start URL
|
|
231
214
|
```
|
|
232
215
|
|
|
@@ -234,12 +217,12 @@ page.depth # Integer: Link depth from start URL
|
|
|
234
217
|
|
|
235
218
|
#### Global Configuration
|
|
236
219
|
|
|
237
|
-
Set default options that apply to all crawls:
|
|
238
|
-
|
|
239
220
|
```ruby
|
|
240
221
|
RubyCrawl.configure(
|
|
241
|
-
wait_until:
|
|
242
|
-
block_resources: true
|
|
222
|
+
wait_until: "networkidle",
|
|
223
|
+
block_resources: true,
|
|
224
|
+
timeout: 60,
|
|
225
|
+
headless: true
|
|
243
226
|
)
|
|
244
227
|
|
|
245
228
|
# All subsequent crawls use these defaults
|
|
@@ -248,8 +231,6 @@ result = RubyCrawl.crawl("https://example.com")
|
|
|
248
231
|
|
|
249
232
|
#### Per-Request Options
|
|
250
233
|
|
|
251
|
-
Override defaults for specific requests:
|
|
252
|
-
|
|
253
234
|
```ruby
|
|
254
235
|
# Use global defaults
|
|
255
236
|
result = RubyCrawl.crawl("https://example.com")
|
|
@@ -257,192 +238,132 @@ result = RubyCrawl.crawl("https://example.com")
|
|
|
257
238
|
# Override for this request only
|
|
258
239
|
result = RubyCrawl.crawl(
|
|
259
240
|
"https://example.com",
|
|
260
|
-
wait_until:
|
|
241
|
+
wait_until: "domcontentloaded",
|
|
261
242
|
block_resources: false
|
|
262
243
|
)
|
|
263
244
|
```
|
|
264
245
|
|
|
265
246
|
#### Configuration Options
|
|
266
247
|
|
|
267
|
-
| Option | Values
|
|
268
|
-
| ----------------- |
|
|
269
|
-
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"`
|
|
270
|
-
| `block_resources` | `true`, `false`
|
|
271
|
-
| `max_attempts` | Integer
|
|
248
|
+
| Option | Values | Default | Description |
|
|
249
|
+
| ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
|
|
250
|
+
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
|
|
251
|
+
| `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
|
|
252
|
+
| `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
|
|
253
|
+
| `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
|
|
254
|
+
| `headless` | `true`, `false` | `true` | Run Chrome headlessly |
|
|
272
255
|
|
|
273
256
|
**Wait strategies explained:**
|
|
274
257
|
|
|
275
|
-
- `load` — Wait for the load event (
|
|
276
|
-
- `domcontentloaded` — Wait for DOM ready (
|
|
277
|
-
- `networkidle` — Wait until no network requests for 500ms (
|
|
278
|
-
- `commit` — Wait until the first response bytes are received (fastest
|
|
279
|
-
|
|
280
|
-
### Advanced Usage
|
|
281
|
-
|
|
282
|
-
#### Session-Based Crawling
|
|
283
|
-
|
|
284
|
-
Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by `crawl_site`, but you can manage them manually for advanced use cases:
|
|
285
|
-
|
|
286
|
-
```ruby
|
|
287
|
-
# Create a session (reusable browser context)
|
|
288
|
-
session_id = RubyCrawl.create_session
|
|
289
|
-
|
|
290
|
-
begin
|
|
291
|
-
# All crawls with this session_id share the same browser context
|
|
292
|
-
result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
|
|
293
|
-
result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
|
|
294
|
-
# Browser state (cookies, localStorage) persists between crawls
|
|
295
|
-
ensure
|
|
296
|
-
# Always destroy session when done
|
|
297
|
-
RubyCrawl.destroy_session(session_id)
|
|
298
|
-
end
|
|
299
|
-
```
|
|
300
|
-
|
|
301
|
-
**When to use sessions:**
|
|
302
|
-
|
|
303
|
-
- Multiple sequential crawls to the same domain (better performance)
|
|
304
|
-
- Preserving cookies/state set by the site between page visits
|
|
305
|
-
- Avoiding browser context creation overhead
|
|
306
|
-
|
|
307
|
-
**Important:** Sessions are for **performance optimization only**. RubyCrawl is designed for crawling **public websites**. It does not provide authentication or login functionality for protected content.
|
|
308
|
-
|
|
309
|
-
**Note:** `crawl_site` automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
|
|
310
|
-
|
|
311
|
-
**Session lifecycle:**
|
|
312
|
-
|
|
313
|
-
- Sessions automatically expire after 30 minutes of inactivity
|
|
314
|
-
- Sessions are cleaned up every 5 minutes
|
|
315
|
-
- Always call `destroy_session` when done to free resources immediately
|
|
258
|
+
- `load` — Wait for the load event (good for static sites)
|
|
259
|
+
- `domcontentloaded` — Wait for DOM ready (faster)
|
|
260
|
+
- `networkidle` — Wait until no network requests for 500ms (best for SPAs)
|
|
261
|
+
- `commit` — Wait until the first response bytes are received (fastest)
|
|
316
262
|
|
|
317
263
|
### Result Object
|
|
318
264
|
|
|
319
|
-
The crawl result is a `RubyCrawl::Result` object with these attributes:
|
|
320
|
-
|
|
321
265
|
```ruby
|
|
322
266
|
result = RubyCrawl.crawl("https://example.com")
|
|
323
267
|
|
|
324
|
-
result.html # String:
|
|
325
|
-
result.
|
|
326
|
-
result.
|
|
327
|
-
result.
|
|
328
|
-
result.
|
|
268
|
+
result.html # String: Full raw HTML
|
|
269
|
+
result.clean_html # String: Noise-stripped HTML (nav/header/footer/ads removed)
|
|
270
|
+
result.clean_text # String: Plain text derived from clean_html — ideal for RAG
|
|
271
|
+
result.raw_text # String: Full body.innerText (unfiltered)
|
|
272
|
+
result.clean_markdown # String: Markdown from clean_html (lazy — computed on first access)
|
|
273
|
+
result.links # Array: Extracted links with url/text/title/rel
|
|
274
|
+
result.metadata # Hash: See below
|
|
275
|
+
result.final_url # String: Shortcut for metadata['final_url']
|
|
329
276
|
```
|
|
330
277
|
|
|
331
278
|
#### Links Format
|
|
332
279
|
|
|
333
|
-
Links are extracted with full metadata:
|
|
334
|
-
|
|
335
280
|
```ruby
|
|
336
281
|
result.links
|
|
337
282
|
# => [
|
|
338
|
-
# {
|
|
339
|
-
#
|
|
340
|
-
# "text" => "About Us",
|
|
341
|
-
# "title" => "Learn more about us", # <a title="...">
|
|
342
|
-
# "rel" => nil # <a rel="nofollow">
|
|
343
|
-
# },
|
|
344
|
-
# {
|
|
345
|
-
# "url" => "https://example.com/contact",
|
|
346
|
-
# "text" => "Contact",
|
|
347
|
-
# "title" => null,
|
|
348
|
-
# "rel" => "nofollow"
|
|
349
|
-
# },
|
|
350
|
-
# ...
|
|
283
|
+
# { "url" => "https://example.com/about", "text" => "About", "title" => nil, "rel" => nil },
|
|
284
|
+
# { "url" => "https://example.com/contact", "text" => "Contact", "title" => nil, "rel" => "nofollow" },
|
|
351
285
|
# ]
|
|
352
286
|
```
|
|
353
287
|
|
|
354
|
-
|
|
288
|
+
URLs are automatically resolved to absolute form by the browser.
|
|
355
289
|
|
|
356
290
|
#### Markdown Conversion
|
|
357
291
|
|
|
358
|
-
Markdown is **lazy
|
|
292
|
+
Markdown is **lazy** — conversion only happens on first access of `.clean_markdown`:
|
|
359
293
|
|
|
360
294
|
```ruby
|
|
361
|
-
result
|
|
362
|
-
result.
|
|
363
|
-
result.clean_markdown #
|
|
364
|
-
result.clean_markdown # ✅ Cached, instant
|
|
295
|
+
result.clean_html # ✅ Already available, no overhead
|
|
296
|
+
result.clean_markdown # Converts clean_html → Markdown here (first call only)
|
|
297
|
+
result.clean_markdown # ✅ Cached, instant on subsequent calls
|
|
365
298
|
```
|
|
366
299
|
|
|
367
300
|
Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
|
|
368
301
|
|
|
369
302
|
#### Metadata Fields
|
|
370
303
|
|
|
371
|
-
The `metadata` hash includes HTTP and HTML metadata:
|
|
372
|
-
|
|
373
304
|
```ruby
|
|
374
305
|
result.metadata
|
|
375
306
|
# => {
|
|
376
|
-
# "
|
|
377
|
-
# "
|
|
378
|
-
# "
|
|
379
|
-
# "
|
|
380
|
-
# "
|
|
381
|
-
# "
|
|
382
|
-
# "
|
|
383
|
-
# "
|
|
384
|
-
# "
|
|
385
|
-
# "
|
|
386
|
-
# "
|
|
387
|
-
# "
|
|
388
|
-
# "
|
|
389
|
-
# "
|
|
390
|
-
# "
|
|
391
|
-
# "
|
|
392
|
-
# "
|
|
393
|
-
# "
|
|
307
|
+
# "final_url" => "https://example.com",
|
|
308
|
+
# "title" => "Page Title",
|
|
309
|
+
# "description" => "...",
|
|
310
|
+
# "keywords" => "ruby, web",
|
|
311
|
+
# "author" => "Author Name",
|
|
312
|
+
# "og_title" => "...",
|
|
313
|
+
# "og_description" => "...",
|
|
314
|
+
# "og_image" => "https://...",
|
|
315
|
+
# "og_url" => "https://...",
|
|
316
|
+
# "og_type" => "website",
|
|
317
|
+
# "twitter_card" => "summary",
|
|
318
|
+
# "twitter_title" => "...",
|
|
319
|
+
# "twitter_description" => "...",
|
|
320
|
+
# "twitter_image" => "https://...",
|
|
321
|
+
# "canonical" => "https://...",
|
|
322
|
+
# "lang" => "en",
|
|
323
|
+
# "charset" => "UTF-8",
|
|
324
|
+
# "extractor" => "readability" # or "heuristic"
|
|
394
325
|
# }
|
|
395
326
|
```
|
|
396
327
|
|
|
397
|
-
Note: All HTML metadata fields may be `null` if not present on the page.
|
|
398
|
-
|
|
399
328
|
### Error Handling
|
|
400
329
|
|
|
401
|
-
RubyCrawl provides specific exception classes for different error scenarios:
|
|
402
|
-
|
|
403
330
|
```ruby
|
|
404
331
|
begin
|
|
405
332
|
result = RubyCrawl.crawl(url)
|
|
406
333
|
rescue RubyCrawl::ConfigurationError => e
|
|
407
|
-
# Invalid URL or
|
|
408
|
-
puts "Configuration error: #{e.message}"
|
|
334
|
+
# Invalid URL or option value
|
|
409
335
|
rescue RubyCrawl::TimeoutError => e
|
|
410
|
-
# Page load
|
|
411
|
-
puts "Timeout: #{e.message}"
|
|
336
|
+
# Page load timed out
|
|
412
337
|
rescue RubyCrawl::NavigationError => e
|
|
413
|
-
#
|
|
414
|
-
puts "Navigation failed: #{e.message}"
|
|
338
|
+
# Navigation failed (404, DNS error, SSL error)
|
|
415
339
|
rescue RubyCrawl::ServiceError => e
|
|
416
|
-
#
|
|
417
|
-
puts "Service error: #{e.message}"
|
|
340
|
+
# Browser failed to start or crashed
|
|
418
341
|
rescue RubyCrawl::Error => e
|
|
419
342
|
# Catch-all for any RubyCrawl error
|
|
420
|
-
puts "Crawl error: #{e.message}"
|
|
421
343
|
end
|
|
422
344
|
```
|
|
423
345
|
|
|
424
346
|
**Exception Hierarchy:**
|
|
425
347
|
|
|
426
|
-
|
|
427
|
-
|
|
428
|
-
|
|
429
|
-
|
|
430
|
-
|
|
348
|
+
```
|
|
349
|
+
RubyCrawl::Error
|
|
350
|
+
├── ConfigurationError — invalid URL or option value
|
|
351
|
+
├── TimeoutError — page load timed out
|
|
352
|
+
├── NavigationError — navigation failed (HTTP error, DNS, SSL)
|
|
353
|
+
└── ServiceError — browser failed to start or crashed
|
|
354
|
+
```
|
|
431
355
|
|
|
432
|
-
**Automatic Retry:**
|
|
356
|
+
**Automatic Retry:** `ServiceError` and `TimeoutError` are retried with exponential backoff. `NavigationError` and `ConfigurationError` are not retried (they won't succeed on retry).
|
|
433
357
|
|
|
434
358
|
```ruby
|
|
435
|
-
RubyCrawl.configure(max_attempts: 5)
|
|
436
|
-
#
|
|
437
|
-
RubyCrawl.crawl(url, max_attempts: 1) # No retries
|
|
359
|
+
RubyCrawl.configure(max_attempts: 5) # 5 total attempts
|
|
360
|
+
RubyCrawl.crawl(url, max_attempts: 1) # Disable retries
|
|
438
361
|
```
|
|
439
362
|
|
|
440
363
|
## Rails Integration
|
|
441
364
|
|
|
442
365
|
### Installation
|
|
443
366
|
|
|
444
|
-
Run the installer in your Rails app:
|
|
445
|
-
|
|
446
367
|
```bash
|
|
447
368
|
bundle exec rake rubycrawl:install
|
|
448
369
|
```
|
|
@@ -450,173 +371,54 @@ bundle exec rake rubycrawl:install
|
|
|
450
371
|
This creates `config/initializers/rubycrawl.rb`:
|
|
451
372
|
|
|
452
373
|
```ruby
|
|
453
|
-
# frozen_string_literal: true
|
|
454
|
-
|
|
455
|
-
# rubycrawl default configuration
|
|
456
374
|
RubyCrawl.configure(
|
|
457
|
-
wait_until:
|
|
375
|
+
wait_until: "load",
|
|
458
376
|
block_resources: true
|
|
459
377
|
)
|
|
460
378
|
```
|
|
461
379
|
|
|
462
380
|
### Usage in Rails
|
|
463
381
|
|
|
464
|
-
#### Basic Usage in Controllers
|
|
465
|
-
|
|
466
|
-
```ruby
|
|
467
|
-
class PagesController < ApplicationController
|
|
468
|
-
def show
|
|
469
|
-
result = RubyCrawl.crawl(params[:url])
|
|
470
|
-
|
|
471
|
-
@page = Page.create!(
|
|
472
|
-
url: result.final_url,
|
|
473
|
-
title: result.metadata['title'],
|
|
474
|
-
html: result.html,
|
|
475
|
-
text: result.text,
|
|
476
|
-
markdown: result.clean_markdown
|
|
477
|
-
)
|
|
478
|
-
|
|
479
|
-
redirect_to @page
|
|
480
|
-
end
|
|
481
|
-
end
|
|
482
|
-
```
|
|
483
|
-
|
|
484
382
|
#### Background Jobs with ActiveJob
|
|
485
383
|
|
|
486
|
-
**Simple Crawl Job:**
|
|
487
|
-
|
|
488
384
|
```ruby
|
|
489
385
|
class CrawlPageJob < ApplicationJob
|
|
490
386
|
queue_as :crawlers
|
|
491
387
|
|
|
492
|
-
# Automatic retry with exponential backoff for transient failures
|
|
493
388
|
retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
|
|
494
389
|
retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
|
|
495
|
-
|
|
496
|
-
# Don't retry on configuration errors (bad URLs)
|
|
497
390
|
discard_on RubyCrawl::ConfigurationError
|
|
498
391
|
|
|
499
|
-
def perform(url
|
|
392
|
+
def perform(url)
|
|
500
393
|
result = RubyCrawl.crawl(url)
|
|
501
394
|
|
|
502
395
|
Page.create!(
|
|
503
|
-
url:
|
|
504
|
-
title:
|
|
505
|
-
|
|
506
|
-
|
|
507
|
-
user_id: user_id,
|
|
396
|
+
url: result.final_url,
|
|
397
|
+
title: result.metadata['title'],
|
|
398
|
+
content: result.clean_text,
|
|
399
|
+
markdown: result.clean_markdown,
|
|
508
400
|
crawled_at: Time.current
|
|
509
401
|
)
|
|
510
|
-
rescue RubyCrawl::NavigationError => e
|
|
511
|
-
# Page not found or failed to load
|
|
512
|
-
Rails.logger.warn "Failed to crawl #{url}: #{e.message}"
|
|
513
|
-
FailedCrawl.create!(url: url, error: e.message, user_id: user_id)
|
|
514
|
-
end
|
|
515
|
-
end
|
|
516
|
-
|
|
517
|
-
# Enqueue from anywhere
|
|
518
|
-
CrawlPageJob.perform_later("https://example.com", user_id: current_user.id)
|
|
519
|
-
```
|
|
520
|
-
|
|
521
|
-
**Multi-Page Site Crawler Job:**
|
|
522
|
-
|
|
523
|
-
```ruby
|
|
524
|
-
class CrawlSiteJob < ApplicationJob
|
|
525
|
-
queue_as :crawlers
|
|
526
|
-
|
|
527
|
-
def perform(start_url, max_pages: 50)
|
|
528
|
-
pages_crawled = RubyCrawl.crawl_site(
|
|
529
|
-
start_url,
|
|
530
|
-
max_pages: max_pages,
|
|
531
|
-
max_depth: 3,
|
|
532
|
-
same_host_only: true
|
|
533
|
-
) do |page|
|
|
534
|
-
Page.create!(
|
|
535
|
-
url: page.url,
|
|
536
|
-
title: page.metadata['title'],
|
|
537
|
-
text: page.clean_markdown, # Store markdown for RAG applications
|
|
538
|
-
depth: page.depth,
|
|
539
|
-
crawled_at: Time.current
|
|
540
|
-
)
|
|
541
|
-
end
|
|
542
|
-
|
|
543
|
-
Rails.logger.info "Crawled #{pages_crawled} pages from #{start_url}"
|
|
544
|
-
end
|
|
545
|
-
end
|
|
546
|
-
```
|
|
547
|
-
|
|
548
|
-
**Batch Crawling Pattern:**
|
|
549
|
-
|
|
550
|
-
```ruby
|
|
551
|
-
class BatchCrawlJob < ApplicationJob
|
|
552
|
-
queue_as :crawlers
|
|
553
|
-
|
|
554
|
-
def perform(urls)
|
|
555
|
-
# Create session for better performance
|
|
556
|
-
session_id = RubyCrawl.create_session
|
|
557
|
-
|
|
558
|
-
begin
|
|
559
|
-
urls.each do |url|
|
|
560
|
-
result = RubyCrawl.crawl(url, session_id: session_id)
|
|
561
|
-
|
|
562
|
-
Page.create!(
|
|
563
|
-
url: result.final_url,
|
|
564
|
-
html: result.html,
|
|
565
|
-
text: result.text
|
|
566
|
-
)
|
|
567
|
-
end
|
|
568
|
-
ensure
|
|
569
|
-
# Always destroy session when done
|
|
570
|
-
RubyCrawl.destroy_session(session_id)
|
|
571
|
-
end
|
|
572
402
|
end
|
|
573
403
|
end
|
|
574
|
-
|
|
575
|
-
# Enqueue batch
|
|
576
|
-
BatchCrawlJob.perform_later(["https://example.com", "https://example.com/about"])
|
|
577
404
|
```
|
|
578
405
|
|
|
579
|
-
**
|
|
580
|
-
|
|
581
|
-
```ruby
|
|
582
|
-
# config/schedule.yml (for sidekiq-cron)
|
|
583
|
-
crawl_news_sites:
|
|
584
|
-
cron: "0 */6 * * *" # Every 6 hours
|
|
585
|
-
class: "CrawlNewsSitesJob"
|
|
586
|
-
|
|
587
|
-
# app/jobs/crawl_news_sites_job.rb
|
|
588
|
-
class CrawlNewsSitesJob < ApplicationJob
|
|
589
|
-
queue_as :scheduled_crawlers
|
|
590
|
-
|
|
591
|
-
def perform
|
|
592
|
-
Site.where(active: true).find_each do |site|
|
|
593
|
-
CrawlSiteJob.perform_later(site.url, max_pages: site.max_pages)
|
|
594
|
-
end
|
|
595
|
-
end
|
|
596
|
-
end
|
|
597
|
-
```
|
|
598
|
-
|
|
599
|
-
**RAG/AI Knowledge Base Pattern:**
|
|
406
|
+
**Multi-page RAG knowledge base:**
|
|
600
407
|
|
|
601
408
|
```ruby
|
|
602
409
|
class BuildKnowledgeBaseJob < ApplicationJob
|
|
603
410
|
queue_as :crawlers
|
|
604
411
|
|
|
605
412
|
def perform(documentation_url)
|
|
606
|
-
RubyCrawl.crawl_site(
|
|
607
|
-
documentation_url,
|
|
608
|
-
max_pages: 500,
|
|
609
|
-
max_depth: 5
|
|
610
|
-
) do |page|
|
|
611
|
-
# Store in vector database for RAG
|
|
413
|
+
RubyCrawl.crawl_site(documentation_url, max_pages: 500, max_depth: 5) do |page|
|
|
612
414
|
embedding = OpenAI.embed(page.clean_markdown)
|
|
613
415
|
|
|
614
416
|
Document.create!(
|
|
615
|
-
url:
|
|
616
|
-
title:
|
|
617
|
-
content:
|
|
417
|
+
url: page.url,
|
|
418
|
+
title: page.metadata['title'],
|
|
419
|
+
content: page.clean_markdown,
|
|
618
420
|
embedding: embedding,
|
|
619
|
-
depth:
|
|
421
|
+
depth: page.depth
|
|
620
422
|
)
|
|
621
423
|
end
|
|
622
424
|
end
|
|
@@ -625,156 +427,106 @@ end
|
|
|
625
427
|
|
|
626
428
|
#### Best Practices
|
|
627
429
|
|
|
628
|
-
1. **Use background jobs**
|
|
629
|
-
2. **Configure retry logic** based on error
|
|
630
|
-
3. **
|
|
631
|
-
4. **
|
|
632
|
-
5. **Rate limit** external crawling to be respectful (use job throttling)
|
|
633
|
-
6. **Store both HTML and text** for flexibility in data processing
|
|
430
|
+
1. **Use background jobs** to avoid blocking web requests
|
|
431
|
+
2. **Configure retry logic** based on error type
|
|
432
|
+
3. **Store `clean_markdown`** for RAG applications (preserves heading structure for chunking)
|
|
433
|
+
4. **Rate limit** external crawling to be respectful
|
|
634
434
|
|
|
635
435
|
## Production Deployment
|
|
636
436
|
|
|
637
437
|
### Pre-deployment Checklist
|
|
638
438
|
|
|
639
|
-
1. **
|
|
439
|
+
1. **Ensure Chrome is installed** on your production servers
|
|
640
440
|
2. **Run installer** during deployment:
|
|
641
441
|
```bash
|
|
642
442
|
bundle exec rake rubycrawl:install
|
|
643
443
|
```
|
|
644
|
-
3. **Set environment variables** (optional):
|
|
645
|
-
```bash
|
|
646
|
-
export RUBYCRAWL_NODE_BIN=/usr/bin/node # Custom Node.js path
|
|
647
|
-
export RUBYCRAWL_NODE_LOG=/var/log/rubycrawl.log # Service logs
|
|
648
|
-
```
|
|
649
444
|
|
|
650
445
|
### Docker Example
|
|
651
446
|
|
|
652
447
|
```dockerfile
|
|
653
448
|
FROM ruby:3.2
|
|
654
449
|
|
|
655
|
-
# Install
|
|
656
|
-
RUN
|
|
657
|
-
|
|
658
|
-
|
|
659
|
-
|
|
660
|
-
RUN npx playwright install-deps
|
|
450
|
+
# Install Chrome
|
|
451
|
+
RUN apt-get update && apt-get install -y \
|
|
452
|
+
chromium \
|
|
453
|
+
--no-install-recommends \
|
|
454
|
+
&& rm -rf /var/lib/apt/lists/*
|
|
661
455
|
|
|
662
456
|
WORKDIR /app
|
|
663
457
|
COPY Gemfile* ./
|
|
664
458
|
RUN bundle install
|
|
665
459
|
|
|
666
|
-
# Install Playwright browsers
|
|
667
|
-
RUN bundle exec rake rubycrawl:install
|
|
668
|
-
|
|
669
460
|
COPY . .
|
|
670
461
|
CMD ["rails", "server"]
|
|
671
462
|
```
|
|
672
463
|
|
|
673
|
-
|
|
464
|
+
Ferrum will detect `chromium` automatically. To specify a custom path:
|
|
674
465
|
|
|
675
|
-
|
|
676
|
-
|
|
677
|
-
|
|
678
|
-
|
|
679
|
-
heroku buildpacks:add heroku/ruby
|
|
680
|
-
```
|
|
681
|
-
|
|
682
|
-
Add to `package.json` in your Rails root:
|
|
683
|
-
|
|
684
|
-
```json
|
|
685
|
-
{
|
|
686
|
-
"engines": {
|
|
687
|
-
"node": "18.x"
|
|
688
|
-
}
|
|
689
|
-
}
|
|
466
|
+
```ruby
|
|
467
|
+
RubyCrawl.configure(
|
|
468
|
+
browser_options: { "browser-path": "/usr/bin/chromium" }
|
|
469
|
+
)
|
|
690
470
|
```
|
|
691
471
|
|
|
692
|
-
##
|
|
472
|
+
## Architecture
|
|
693
473
|
|
|
694
|
-
RubyCrawl uses a
|
|
474
|
+
RubyCrawl uses a single-process architecture:
|
|
695
475
|
|
|
696
|
-
|
|
697
|
-
|
|
698
|
-
|
|
476
|
+
```
|
|
477
|
+
RubyCrawl (public API)
|
|
478
|
+
↓
|
|
479
|
+
Browser (lib/rubycrawl/browser.rb) ← Ferrum wrapper
|
|
480
|
+
↓
|
|
481
|
+
Ferrum::Browser ← Chrome DevTools Protocol (pure Ruby)
|
|
482
|
+
↓
|
|
483
|
+
Chromium ← headless browser
|
|
484
|
+
↓
|
|
485
|
+
Readability.js → heuristic fallback ← content extraction (inside browser)
|
|
486
|
+
```
|
|
699
487
|
|
|
700
|
-
|
|
488
|
+
- Chrome launches once lazily and is reused across all crawls
|
|
489
|
+
- Each crawl gets an isolated page context (own cookies/storage)
|
|
490
|
+
- Content extraction runs inside the browser via `page.evaluate()`:
|
|
491
|
+
- **Primary**: Mozilla Readability.js — article-quality extraction for blogs, docs, news
|
|
492
|
+
- **Fallback**: link-density heuristic — covers marketing pages, homepages, SPAs
|
|
493
|
+
- `result.metadata['extractor']` tells you which path was used (`"readability"` or `"heuristic"`)
|
|
494
|
+
- No separate processes, no HTTP boundary, no Node.js
|
|
701
495
|
|
|
702
|
-
## Performance
|
|
496
|
+
## Performance
|
|
703
497
|
|
|
704
|
-
- **Resource blocking**: Keep `block_resources: true` (default) for 2-3x faster crawls
|
|
498
|
+
- **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
|
|
705
499
|
- **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
|
|
706
|
-
- **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
|
|
707
|
-
- **Browser reuse**: The first crawl is slower (~2s) due to
|
|
500
|
+
- **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
|
|
501
|
+
- **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
|
|
708
502
|
|
|
709
503
|
## Development
|
|
710
504
|
|
|
711
|
-
Want to contribute? Check out the [contributor guidelines](.github/copilot-instructions.md).
|
|
712
|
-
|
|
713
505
|
```bash
|
|
714
|
-
# Setup
|
|
715
506
|
git clone git@github.com:craft-wise/rubycrawl.git
|
|
716
507
|
cd rubycrawl
|
|
717
508
|
bin/setup
|
|
718
509
|
|
|
719
|
-
# Run tests
|
|
510
|
+
# Run unit tests (no browser required)
|
|
720
511
|
bundle exec rspec
|
|
721
512
|
|
|
513
|
+
# Run integration tests (requires Chrome)
|
|
514
|
+
INTEGRATION=1 bundle exec rspec
|
|
515
|
+
|
|
722
516
|
# Manual testing
|
|
723
517
|
bin/console
|
|
724
518
|
> RubyCrawl.crawl("https://example.com")
|
|
519
|
+
> RubyCrawl.crawl("https://example.com").clean_text
|
|
520
|
+
> RubyCrawl.crawl("https://example.com").clean_markdown
|
|
725
521
|
```
|
|
726
522
|
|
|
727
523
|
## Contributing
|
|
728
524
|
|
|
729
525
|
Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
|
|
730
526
|
|
|
731
|
-
### Development Philosophy
|
|
732
|
-
|
|
733
527
|
- **Simplicity over cleverness**: Prefer clear, explicit code
|
|
734
528
|
- **Stability over speed**: Correctness first, optimization second
|
|
735
|
-
- **
|
|
736
|
-
- **No vendor lock-in**: Pure open source, no SaaS dependencies
|
|
737
|
-
|
|
738
|
-
## Why Choose RubyCrawl?
|
|
739
|
-
|
|
740
|
-
RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
|
|
741
|
-
|
|
742
|
-
### 🎯 **Built for Ruby Developers**
|
|
743
|
-
|
|
744
|
-
- **Idiomatic Ruby API** — Feels natural to Rubyists, no need to learn Playwright
|
|
745
|
-
- **Rails-first design** — Generators, initializers, and ActiveJob integration out of the box
|
|
746
|
-
- **Modular architecture** — Clean, testable code following Ruby best practices
|
|
747
|
-
|
|
748
|
-
### 🚀 **Production-Grade Reliability**
|
|
749
|
-
|
|
750
|
-
- **Automatic retry** with exponential backoff for transient failures
|
|
751
|
-
- **Smart error handling** with custom exception hierarchy
|
|
752
|
-
- **Process isolation** — Browser crashes don't affect your Ruby application
|
|
753
|
-
- **Battle-tested** — Built on Playwright's proven browser automation
|
|
754
|
-
|
|
755
|
-
### 💎 **Developer Experience**
|
|
756
|
-
|
|
757
|
-
- **Zero configuration** — Works immediately after installation
|
|
758
|
-
- **Lazy loading** — Markdown conversion only when you need it
|
|
759
|
-
- **Smart URL handling** — Automatic normalization and deduplication
|
|
760
|
-
- **Comprehensive docs** — Clear examples for common use cases
|
|
761
|
-
|
|
762
|
-
### 🌐 **Rich Feature Set**
|
|
763
|
-
|
|
764
|
-
- ✅ JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
|
|
765
|
-
- ✅ Multi-page crawling with BFS algorithm
|
|
766
|
-
- ✅ Link extraction with metadata (url, text, title, rel)
|
|
767
|
-
- ✅ Markdown conversion (GitHub-flavored)
|
|
768
|
-
- ✅ Metadata extraction (OG tags, Twitter cards, etc.)
|
|
769
|
-
- ✅ Resource blocking for 2-3x performance boost
|
|
770
|
-
|
|
771
|
-
### 📊 **Perfect for Modern Use Cases**
|
|
772
|
-
|
|
773
|
-
- **RAG applications** — Build AI knowledge bases from documentation
|
|
774
|
-
- **Data aggregation** — Extract structured data from multiple pages
|
|
775
|
-
- **Content migration** — Convert sites to Markdown for static generators
|
|
776
|
-
- **SEO analysis** — Extract metadata and link structures
|
|
777
|
-
- **Testing** — Verify deployed site content and structure
|
|
529
|
+
- **Hide complexity**: Users should never need to know Ferrum exists
|
|
778
530
|
|
|
779
531
|
## License
|
|
780
532
|
|
|
@@ -782,21 +534,14 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
|
|
|
782
534
|
|
|
783
535
|
## Credits
|
|
784
536
|
|
|
785
|
-
Built with [
|
|
537
|
+
Built with [Ferrum](https://github.com/rubycdp/ferrum) — pure Ruby Chrome DevTools Protocol client.
|
|
538
|
+
|
|
539
|
+
Content extraction powered by [Mozilla Readability.js](https://github.com/mozilla/readability) — the algorithm behind Firefox Reader View.
|
|
786
540
|
|
|
787
|
-
|
|
541
|
+
Markdown conversion powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored output.
|
|
788
542
|
|
|
789
543
|
## Support
|
|
790
544
|
|
|
791
545
|
- **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
|
|
792
546
|
- **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
|
|
793
547
|
- **Email**: ganesh.navale@zohomail.in
|
|
794
|
-
|
|
795
|
-
## Acknowledgments
|
|
796
|
-
|
|
797
|
-
Special thanks to:
|
|
798
|
-
|
|
799
|
-
- [Microsoft Playwright](https://playwright.dev/) team for the robust, production-grade browser automation framework
|
|
800
|
-
- The Ruby community for building an ecosystem that values developer happiness and code clarity
|
|
801
|
-
- The Node.js community for excellent tooling and libraries that make cross-language integration seamless
|
|
802
|
-
- Open source contributors worldwide who make projects like this possible
|