rubycrawl 0.1.3 → 0.1.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,39 +1,67 @@
1
- # rubycrawl
1
+ # RubyCrawl 🎭
2
2
 
3
- [![Gem Version](https://badge.fury.io/rb/rubycrawl.svg)](https://badge.fury.io/rb/rubycrawl)
3
+ [![Gem Version](https://badge.fury.io/rb/rubycrawl.svg)](https://rubygems.org/gems/rubycrawl)
4
4
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
+ [![Ruby](https://img.shields.io/badge/ruby-%3E%3D%203.0-red.svg)](https://www.ruby-lang.org/)
6
+ [![Node.js](https://img.shields.io/badge/node.js-18%2B-green.svg)](https://nodejs.org/)
5
7
 
6
- **Playwright-based web crawler for Ruby** — Inspired by [crawl4ai](https://github.com/unclecode/crawl4ai) (Python), designed idiomatically for Ruby with production-ready features.
8
+ **Production-ready web crawler for Ruby powered by Playwright** — Bringing the power of modern browser automation to the Ruby ecosystem with first-class Rails support.
7
9
 
8
- RubyCrawl provides accurate, JavaScript-enabled web scraping using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs and dynamic websites.
10
+ RubyCrawl provides **accurate, JavaScript-enabled web scraping** using Playwright's battle-tested browser automation, wrapped in a clean Ruby API. Perfect for extracting content from modern SPAs, dynamic websites, and building RAG knowledge bases.
11
+
12
+ **Why RubyCrawl?**
13
+
14
+ - ✅ **Real browser** — Handles JavaScript, AJAX, and SPAs correctly
15
+ - ✅ **Zero config** — Works out of the box, no Playwright knowledge needed
16
+ - ✅ **Production-ready** — Auto-retry, error handling, resource optimization
17
+ - ✅ **Multi-page crawling** — BFS algorithm with smart URL deduplication
18
+ - ✅ **Rails-friendly** — Generators, initializers, and ActiveJob integration
19
+ - ✅ **Modular architecture** — Clean, testable, maintainable codebase
20
+
21
+ ```ruby
22
+ # One line to crawl any JavaScript-heavy site
23
+ result = RubyCrawl.crawl("https://docs.example.com")
24
+
25
+ result.html # Full HTML with JS rendered
26
+ result.links # All links with metadata
27
+ result.metadata # Title, description, OG tags, etc.
28
+ ```
9
29
 
10
30
  ## Features
11
31
 
12
- - **Playwright-powered**: Real browser automation for JavaScript-heavy sites
13
- - **Production-ready**: Designed for Rails apps and production environments
14
- - **Simple API**: Clean, minimal Ruby interface — zero Playwright knowledge required
15
- - **Resource optimization**: Built-in resource blocking for faster crawls
16
- - **Auto-managed browsers**: Browser process reuse and automatic lifecycle management
17
- - **Content extraction**: HTML, links, and Markdown conversion
18
- - **Multi-page crawling**: BFS crawler with depth limits and deduplication
19
- - **Rails integration**: First-class Rails support with generators and initializers
32
+ - **🎭 Playwright-powered**: Real browser automation for JavaScript-heavy sites and SPAs
33
+ - **🚀 Production-ready**: Designed for Rails apps and production environments with auto-retry and error handling
34
+ - **🎯 Simple API**: Clean, minimal Ruby interface — zero Playwright or Node.js knowledge required
35
+ - **⚡ Resource optimization**: Built-in resource blocking for 2-3x faster crawls
36
+ - **🔄 Auto-managed browsers**: Browser process reuse and automatic lifecycle management
37
+ - **📄 Content extraction**: HTML, plain text, links (with metadata), and **clean markdown** via HTML conversion
38
+ - **🌐 Multi-page crawling**: BFS (breadth-first search) crawler with configurable depth limits and URL deduplication
39
+ - **🛡️ Smart URL handling**: Automatic normalization, tracking parameter removal, and same-host filtering
40
+ - **🔧 Rails integration**: First-class Rails support with generators and initializers
41
+ - **💎 Modular design**: Clean separation of concerns with focused, testable modules
20
42
 
21
43
  ## Table of Contents
22
44
 
45
+ - [Features](#features)
23
46
  - [Installation](#installation)
24
47
  - [Quick Start](#quick-start)
48
+ - [Use Cases](#use-cases)
25
49
  - [Usage](#usage)
26
50
  - [Basic Crawling](#basic-crawling)
27
51
  - [Multi-Page Crawling](#multi-page-crawling)
28
52
  - [Configuration](#configuration)
29
53
  - [Result Object](#result-object)
54
+ - [Error Handling](#error-handling)
30
55
  - [Rails Integration](#rails-integration)
31
56
  - [Production Deployment](#production-deployment)
32
57
  - [Architecture](#architecture)
33
58
  - [Performance](#performance)
34
59
  - [Development](#development)
60
+ - [Project Structure](#project-structure)
35
61
  - [Contributing](#contributing)
62
+ - [Why Choose RubyCrawl?](#why-choose-rubycrawl)
36
63
  - [License](#license)
64
+ - [Support](#support)
37
65
 
38
66
  ## Installation
39
67
 
@@ -64,9 +92,24 @@ bundle exec rake rubycrawl:install
64
92
 
65
93
  This command:
66
94
 
67
- - Installs Node.js dependencies in the bundled `node/` directory
68
- - Downloads Playwright browsers (Chromium, Firefox, WebKit)
69
- - Creates a Rails initializer (if using Rails)
95
+ - Installs Node.js dependencies in the bundled `node/` directory
96
+ - Downloads Playwright browsers (Chromium, Firefox, WebKit) — ~300MB download
97
+ - Creates a Rails initializer (if using Rails)
98
+
99
+ **Note:** You only need to run this once. The installation task is idempotent and safe to run multiple times.
100
+
101
+ **Troubleshooting installation:**
102
+
103
+ ```bash
104
+ # If installation fails, check Node.js version
105
+ node --version # Should be v18+ LTS
106
+
107
+ # Enable verbose logging
108
+ RUBYCRAWL_NODE_LOG=/tmp/rubycrawl.log bundle exec rake rubycrawl:install
109
+
110
+ # Check installation status
111
+ cd node && npm list
112
+ ```
70
113
 
71
114
  ## Quick Start
72
115
 
@@ -77,12 +120,24 @@ require "rubycrawl"
77
120
  result = RubyCrawl.crawl("https://example.com")
78
121
 
79
122
  # Access extracted content
80
- puts result.html # Raw HTML content
81
- puts result.markdown # Converted to Markdown
82
- puts result.links # Extracted links from the page
83
- puts result.metadata # Status code, final URL, etc.
123
+ result.final_url # Final URL after redirects
124
+ result.text # Plain text content (via innerText)
125
+ result.html # Raw HTML content
126
+ result.links # Extracted links with metadata
127
+ result.metadata # Title, description, OG tags, etc.
84
128
  ```
85
129
 
130
+ ## Use Cases
131
+
132
+ RubyCrawl is perfect for:
133
+
134
+ - **📊 Data aggregation**: Crawl product catalogs, job listings, or news articles
135
+ - **🤖 RAG applications**: Build knowledge bases for LLM/AI applications by crawling documentation sites
136
+ - **🔍 SEO analysis**: Extract metadata, links, and content structure
137
+ - **📱 Content migration**: Convert existing sites to Markdown for static site generators
138
+ - **🧪 Testing**: Verify deployed site structure and content
139
+ - **📚 Documentation scraping**: Create local copies of documentation with preserved links
140
+
86
141
  ## Usage
87
142
 
88
143
  ### Basic Crawling
@@ -93,11 +148,9 @@ The simplest way to crawl a URL:
93
148
  result = RubyCrawl.crawl("https://example.com")
94
149
 
95
150
  # Access the results
96
- result.html # => "<html>...</html>"
97
- result.markdown # => "# Example Domain\n\nThis domain is..." (lazy-loaded)
98
- result.links # => [{ "url" => "https://...", "text" => "More info" }, ...]
99
- result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
100
- result.text # => "" (coming soon)
151
+ result.html # => "<html>...</html>"
152
+ result.text # => "Example Domain\nThis domain is..." (plain text via innerText)
153
+ result.metadata # => { "status" => 200, "final_url" => "https://example.com" }
101
154
  ```
102
155
 
103
156
  ### Multi-Page Crawling
@@ -109,38 +162,72 @@ Crawl an entire site following links with BFS (breadth-first search):
109
162
  RubyCrawl.crawl_site("https://example.com", max_pages: 100, max_depth: 3) do |page|
110
163
  # Each page is yielded as it's crawled (streaming)
111
164
  puts "Crawled: #{page.url} (depth: #{page.depth})"
112
-
165
+
113
166
  # Save to database
114
167
  Page.create!(
115
168
  url: page.url,
116
169
  html: page.html,
117
- markdown: page.markdown,
170
+ markdown: page.clean_markdown,
118
171
  depth: page.depth
119
172
  )
120
173
  end
121
174
  ```
122
175
 
176
+ **Real-world example: Building a RAG knowledge base**
177
+
178
+ ```ruby
179
+ # Crawl documentation site for AI/RAG application
180
+ require "rubycrawl"
181
+
182
+ RubyCrawl.configure(
183
+ wait_until: "networkidle", # Ensure JS content loads
184
+ block_resources: true # Skip images/fonts for speed
185
+ )
186
+
187
+ pages_crawled = RubyCrawl.crawl_site(
188
+ "https://docs.example.com",
189
+ max_pages: 500,
190
+ max_depth: 5,
191
+ same_host_only: true
192
+ ) do |page|
193
+ # Store in vector database for RAG
194
+ VectorDB.upsert(
195
+ id: Digest::SHA256.hexdigest(page.url),
196
+ content: page.clean_markdown, # Clean markdown for better embeddings
197
+ metadata: {
198
+ url: page.url,
199
+ title: page.metadata["title"],
200
+ depth: page.depth
201
+ }
202
+ )
203
+
204
+ puts "✓ Indexed: #{page.metadata['title']} (#{page.depth} levels deep)"
205
+ end
206
+
207
+ puts "Crawled #{pages_crawled} pages into knowledge base"
208
+ ```
209
+
123
210
  #### Multi-Page Options
124
211
 
125
- | Option | Default | Description |
126
- |--------|---------|-------------|
127
- | `max_pages` | 50 | Maximum number of pages to crawl |
128
- | `max_depth` | 3 | Maximum link depth from start URL |
129
- | `same_host_only` | true | Only follow links on the same domain |
130
- | `wait_until` | inherited | Page load strategy |
131
- | `block_resources` | inherited | Block images/fonts/CSS |
212
+ | Option | Default | Description |
213
+ | ----------------- | --------- | ------------------------------------ |
214
+ | `max_pages` | 50 | Maximum number of pages to crawl |
215
+ | `max_depth` | 3 | Maximum link depth from start URL |
216
+ | `same_host_only` | true | Only follow links on the same domain |
217
+ | `wait_until` | inherited | Page load strategy |
218
+ | `block_resources` | inherited | Block images/fonts/CSS |
132
219
 
133
220
  #### Page Result Object
134
221
 
135
222
  The block receives a `PageResult` with:
136
223
 
137
224
  ```ruby
138
- page.url # String: Final URL after redirects
139
- page.html # String: Full HTML content
140
- page.markdown # String: Lazy-converted Markdown
141
- page.links # Array: URLs extracted from page
142
- page.metadata # Hash: HTTP status, final URL, etc.
143
- page.depth # Integer: Link depth from start URL
225
+ page.url # String: Final URL after redirects
226
+ page.html # String: Full HTML content
227
+ page.clean_markdown # String: Lazy-converted Markdown
228
+ page.links # Array: URLs extracted from page
229
+ page.metadata # Hash: HTTP status, final URL, etc.
230
+ page.depth # Integer: Link depth from start URL
144
231
  ```
145
232
 
146
233
  ### Configuration
@@ -177,16 +264,55 @@ result = RubyCrawl.crawl(
177
264
 
178
265
  #### Configuration Options
179
266
 
180
- | Option | Values | Default | Description |
181
- | ----------------- | ----------------------------------------------- | -------- | ------------------------------------------------- |
182
- | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"` | `"load"` | When to consider page loaded |
183
- | `block_resources` | `true`, `false` | `true` | Block images, fonts, CSS, media for faster crawls |
267
+ | Option | Values | Default | Description |
268
+ | ----------------- | ---------------------------------------------------------------------- | -------- | ------------------------------------------------- |
269
+ | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `"load"` | When to consider page loaded |
270
+ | `block_resources` | `true`, `false` | `true` | Block images, fonts, CSS, media for faster crawls |
271
+ | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
184
272
 
185
273
  **Wait strategies explained:**
186
274
 
187
275
  - `load` — Wait for the load event (fastest, good for static sites)
188
276
  - `domcontentloaded` — Wait for DOM ready (medium speed)
189
277
  - `networkidle` — Wait until no network requests for 500ms (slowest, best for SPAs)
278
+ - `commit` — Wait until the first response bytes are received (fastest possible)
279
+
280
+ ### Advanced Usage
281
+
282
+ #### Session-Based Crawling
283
+
284
+ Sessions allow reusing browser contexts for better performance when crawling multiple pages. They're automatically used by `crawl_site`, but you can manage them manually for advanced use cases:
285
+
286
+ ```ruby
287
+ # Create a session (reusable browser context)
288
+ session_id = RubyCrawl.create_session
289
+
290
+ begin
291
+ # All crawls with this session_id share the same browser context
292
+ result1 = RubyCrawl.crawl("https://example.com/page1", session_id: session_id)
293
+ result2 = RubyCrawl.crawl("https://example.com/page2", session_id: session_id)
294
+ # Browser state (cookies, localStorage) persists between crawls
295
+ ensure
296
+ # Always destroy session when done
297
+ RubyCrawl.destroy_session(session_id)
298
+ end
299
+ ```
300
+
301
+ **When to use sessions:**
302
+
303
+ - Multiple sequential crawls to the same domain (better performance)
304
+ - Preserving cookies/state set by the site between page visits
305
+ - Avoiding browser context creation overhead
306
+
307
+ **Important:** Sessions are for **performance optimization only**. RubyCrawl is designed for crawling **public websites**. It does not provide authentication or login functionality for protected content.
308
+
309
+ **Note:** `crawl_site` automatically creates and manages a session internally, so you don't need manual session management for multi-page crawling.
310
+
311
+ **Session lifecycle:**
312
+
313
+ - Sessions automatically expire after 30 minutes of inactivity
314
+ - Sessions are cleaned up every 5 minutes
315
+ - Always call `destroy_session` when done to free resources immediately
190
316
 
191
317
  ### Result Object
192
318
 
@@ -195,33 +321,47 @@ The crawl result is a `RubyCrawl::Result` object with these attributes:
195
321
  ```ruby
196
322
  result = RubyCrawl.crawl("https://example.com")
197
323
 
198
- result.html # String: Raw HTML content from page
199
- result.markdown # String: Markdown conversion (lazy-loaded on first access)
200
- result.links # Array: Extracted links with url and text
201
- result.text # String: Plain text (coming soon)
202
- result.metadata # Hash: Comprehensive metadata (see below)
324
+ result.html # String: Raw HTML content from page
325
+ result.text # String: Plain text via document.body.innerText
326
+ result.clean_markdown # String: Markdown conversion (lazy-loaded on first access)
327
+ result.links # Array: Extracted links with url and text
328
+ result.metadata # Hash: Comprehensive metadata (see below)
203
329
  ```
204
330
 
205
331
  #### Links Format
206
332
 
333
+ Links are extracted with full metadata:
334
+
207
335
  ```ruby
208
336
  result.links
209
337
  # => [
210
- # { "url" => "https://example.com/about", "text" => "About Us" },
211
- # { "url" => "https://example.com/contact", "text" => "Contact" },
338
+ # {
339
+ # "url" => "https://example.com/about",
340
+ # "text" => "About Us",
341
+ # "title" => "Learn more about us", # <a title="...">
342
+ # "rel" => nil # <a rel="nofollow">
343
+ # },
344
+ # {
345
+ # "url" => "https://example.com/contact",
346
+ # "text" => "Contact",
347
+ # "title" => null,
348
+ # "rel" => "nofollow"
349
+ # },
212
350
  # ...
213
351
  # ]
214
352
  ```
215
353
 
354
+ **Note:** URLs are automatically converted to absolute URLs by the browser, so relative links like `/about` become `https://example.com/about`.
355
+
216
356
  #### Markdown Conversion
217
357
 
218
- Markdown is **lazy-loaded** — conversion only happens when you access `.markdown`:
358
+ Markdown is **lazy-loaded** — conversion only happens when you access `.clean_markdown`:
219
359
 
220
360
  ```ruby
221
361
  result = RubyCrawl.crawl(url)
222
- result.html # ✅ No overhead
223
- result.markdown # ⬅️ Conversion happens here (first call only)
224
- result.markdown # ✅ Cached, instant
362
+ result.html # ✅ No overhead
363
+ result.clean_markdown # ⬅️ Conversion happens here (first call only)
364
+ result.clean_markdown # ✅ Cached, instant
225
365
  ```
226
366
 
227
367
  Uses [reverse_markdown](https://github.com/xijo/reverse_markdown) with GitHub-flavored output.
@@ -282,18 +422,19 @@ end
282
422
  ```
283
423
 
284
424
  **Exception Hierarchy:**
425
+
285
426
  - `RubyCrawl::Error` (base class)
286
427
  - `RubyCrawl::ConfigurationError` - Invalid URL or configuration
287
428
  - `RubyCrawl::TimeoutError` - Timeout during crawl
288
429
  - `RubyCrawl::NavigationError` - Page navigation failed
289
430
  - `RubyCrawl::ServiceError` - Node service issues
290
431
 
291
- **Automatic Retry:** RubyCrawl automatically retries transient failures (service errors, timeouts) up to 3 times with exponential backoff (2s, 4s, 8s). Configure with:
432
+ **Automatic Retry:** RubyCrawl automatically retries transient failures (service errors, timeouts) with exponential backoff. The default `max_attempts: 3` means 3 total attempts (2 retries). Configure with:
292
433
 
293
434
  ```ruby
294
- RubyCrawl.configure(max_retries: 5)
435
+ RubyCrawl.configure(max_attempts: 5)
295
436
  # or per-request
296
- RubyCrawl.crawl(url, retries: 1) # Disable retry
437
+ RubyCrawl.crawl(url, max_attempts: 1) # No retries
297
438
  ```
298
439
 
299
440
  ## Rails Integration
@@ -320,22 +461,177 @@ RubyCrawl.configure(
320
461
 
321
462
  ### Usage in Rails
322
463
 
464
+ #### Basic Usage in Controllers
465
+
466
+ ```ruby
467
+ class PagesController < ApplicationController
468
+ def show
469
+ result = RubyCrawl.crawl(params[:url])
470
+
471
+ @page = Page.create!(
472
+ url: result.final_url,
473
+ title: result.metadata['title'],
474
+ html: result.html,
475
+ text: result.text,
476
+ markdown: result.clean_markdown
477
+ )
478
+
479
+ redirect_to @page
480
+ end
481
+ end
482
+ ```
483
+
484
+ #### Background Jobs with ActiveJob
485
+
486
+ **Simple Crawl Job:**
487
+
323
488
  ```ruby
324
- # In a controller, service, or background job
325
- class ContentScraperJob < ApplicationJob
326
- def perform(url)
489
+ class CrawlPageJob < ApplicationJob
490
+ queue_as :crawlers
491
+
492
+ # Automatic retry with exponential backoff for transient failures
493
+ retry_on RubyCrawl::ServiceError, wait: :exponentially_longer, attempts: 5
494
+ retry_on RubyCrawl::TimeoutError, wait: :exponentially_longer, attempts: 3
495
+
496
+ # Don't retry on configuration errors (bad URLs)
497
+ discard_on RubyCrawl::ConfigurationError
498
+
499
+ def perform(url, user_id: nil)
327
500
  result = RubyCrawl.crawl(url)
328
501
 
329
- # Save to database
330
- ScrapedContent.create!(
331
- url: url,
502
+ Page.create!(
503
+ url: result.final_url,
504
+ title: result.metadata['title'],
505
+ text: result.text,
332
506
  html: result.html,
333
- status: result.metadata[:status]
507
+ user_id: user_id,
508
+ crawled_at: Time.current
334
509
  )
510
+ rescue RubyCrawl::NavigationError => e
511
+ # Page not found or failed to load
512
+ Rails.logger.warn "Failed to crawl #{url}: #{e.message}"
513
+ FailedCrawl.create!(url: url, error: e.message, user_id: user_id)
335
514
  end
336
515
  end
516
+
517
+ # Enqueue from anywhere
518
+ CrawlPageJob.perform_later("https://example.com", user_id: current_user.id)
337
519
  ```
338
520
 
521
+ **Multi-Page Site Crawler Job:**
522
+
523
+ ```ruby
524
+ class CrawlSiteJob < ApplicationJob
525
+ queue_as :crawlers
526
+
527
+ def perform(start_url, max_pages: 50)
528
+ pages_crawled = RubyCrawl.crawl_site(
529
+ start_url,
530
+ max_pages: max_pages,
531
+ max_depth: 3,
532
+ same_host_only: true
533
+ ) do |page|
534
+ Page.create!(
535
+ url: page.url,
536
+ title: page.metadata['title'],
537
+ text: page.clean_markdown, # Store markdown for RAG applications
538
+ depth: page.depth,
539
+ crawled_at: Time.current
540
+ )
541
+ end
542
+
543
+ Rails.logger.info "Crawled #{pages_crawled} pages from #{start_url}"
544
+ end
545
+ end
546
+ ```
547
+
548
+ **Batch Crawling Pattern:**
549
+
550
+ ```ruby
551
+ class BatchCrawlJob < ApplicationJob
552
+ queue_as :crawlers
553
+
554
+ def perform(urls)
555
+ # Create session for better performance
556
+ session_id = RubyCrawl.create_session
557
+
558
+ begin
559
+ urls.each do |url|
560
+ result = RubyCrawl.crawl(url, session_id: session_id)
561
+
562
+ Page.create!(
563
+ url: result.final_url,
564
+ html: result.html,
565
+ text: result.text
566
+ )
567
+ end
568
+ ensure
569
+ # Always destroy session when done
570
+ RubyCrawl.destroy_session(session_id)
571
+ end
572
+ end
573
+ end
574
+
575
+ # Enqueue batch
576
+ BatchCrawlJob.perform_later(["https://example.com", "https://example.com/about"])
577
+ ```
578
+
579
+ **Periodic Crawling with Sidekiq-Cron:**
580
+
581
+ ```ruby
582
+ # config/schedule.yml (for sidekiq-cron)
583
+ crawl_news_sites:
584
+ cron: "0 */6 * * *" # Every 6 hours
585
+ class: "CrawlNewsSitesJob"
586
+
587
+ # app/jobs/crawl_news_sites_job.rb
588
+ class CrawlNewsSitesJob < ApplicationJob
589
+ queue_as :scheduled_crawlers
590
+
591
+ def perform
592
+ Site.where(active: true).find_each do |site|
593
+ CrawlSiteJob.perform_later(site.url, max_pages: site.max_pages)
594
+ end
595
+ end
596
+ end
597
+ ```
598
+
599
+ **RAG/AI Knowledge Base Pattern:**
600
+
601
+ ```ruby
602
+ class BuildKnowledgeBaseJob < ApplicationJob
603
+ queue_as :crawlers
604
+
605
+ def perform(documentation_url)
606
+ RubyCrawl.crawl_site(
607
+ documentation_url,
608
+ max_pages: 500,
609
+ max_depth: 5
610
+ ) do |page|
611
+ # Store in vector database for RAG
612
+ embedding = OpenAI.embed(page.clean_markdown)
613
+
614
+ Document.create!(
615
+ url: page.url,
616
+ title: page.metadata['title'],
617
+ content: page.clean_markdown,
618
+ embedding: embedding,
619
+ depth: page.depth
620
+ )
621
+ end
622
+ end
623
+ end
624
+ ```
625
+
626
+ #### Best Practices
627
+
628
+ 1. **Use background jobs** for crawling to avoid blocking web requests
629
+ 2. **Configure retry logic** based on error types (retry ServiceError, discard ConfigurationError)
630
+ 3. **Use sessions** for batch crawling to improve performance
631
+ 4. **Monitor job failures** and set up alerts for repeated errors
632
+ 5. **Rate limit** external crawling to be respectful (use job throttling)
633
+ 6. **Store both HTML and text** for flexibility in data processing
634
+
339
635
  ## Production Deployment
340
636
 
341
637
  ### Pre-deployment Checklist
@@ -393,154 +689,41 @@ Add to `package.json` in your Rails root:
393
689
  }
394
690
  ```
395
691
 
396
- ### Performance Tips
397
-
398
- - **Reuse instances**: Use the class-level `RubyCrawl.crawl` method (recommended) rather than creating new instances
399
- - **Resource blocking**: Keep `block_resources: true` for 2-3x faster crawls when you don't need images/CSS
400
- - **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
401
- - **Browser reuse**: The first crawl is slower due to browser launch; subsequent crawls reuse the process
402
-
403
- ## Architecture
404
-
405
- RubyCrawl uses a **dual-process architecture**:
406
-
407
- ```
408
- ┌─────────────────────────────────────────────┐
409
- │ Ruby Process (Your Application) │
410
- │ ┌─────────────────────────────────────┐ │
411
- │ │ RubyCrawl Gem │ │
412
- │ │ • Public API │ │
413
- │ │ • Result normalization │ │
414
- │ │ • Error handling │ │
415
- │ └────────────┬────────────────────────┘ │
416
- └───────────────┼─────────────────────────────┘
417
- │ HTTP/JSON (localhost:3344)
418
- ┌───────────────┼─────────────────────────────┐
419
- │ Node.js Process (Auto-started) │
420
- │ ┌────────────┴────────────────────────┐ │
421
- │ │ Playwright Service │ │
422
- │ │ • Browser management │ │
423
- │ │ • Page navigation │ │
424
- │ │ • HTML extraction │ │
425
- │ │ • Resource blocking │ │
426
- │ └─────────────────────────────────────┘ │
427
- └─────────────────────────────────────────────┘
428
- ```
429
-
430
- **Why this architecture?**
431
-
432
- - **Separation of concerns**: Ruby handles orchestration, Node handles browsers
433
- - **Stability**: Playwright's official Node.js bindings are most reliable
434
- - **Performance**: Long-running browser process, reused across requests
435
- - **Simplicity**: No C extensions, pure Ruby + bundled Node service
436
-
437
- See [.github/copilot-instructions.md](.github/copilot-instructions.md) for detailed architecture documentation.
438
-
439
- ## Performance
440
-
441
- ### Benchmarks
692
+ ## How It Works
442
693
 
443
- Typical crawl times (M1 Mac, fast network):
694
+ RubyCrawl uses a simple architecture:
444
695
 
445
- | Page Type | First Crawl | Subsequent | Config |
446
- | ----------- | ----------- | ---------- | --------------------------- |
447
- | Static HTML | ~2s | ~500ms | `block_resources: true` |
448
- | SPA (React) | ~3s | ~1.2s | `wait_until: "networkidle"` |
449
- | Heavy site | ~4s | ~2s | `block_resources: false` |
696
+ - **Ruby Gem** provides the public API and handles orchestration
697
+ - **Node.js Service** (bundled, auto-started) manages Playwright browsers
698
+ - Communication via HTTP/JSON on localhost
450
699
 
451
- **Note**: First crawl includes browser launch time (~1.5s). Subsequent crawls reuse the browser.
700
+ This design keeps things stable and easy to debug. The browser runs in a separate process, so crashes won't affect your Ruby application.
452
701
 
453
- ### Optimization Tips
702
+ ## Performance Tips
454
703
 
455
- 1. **Enable resource blocking** for content-only extraction:
456
-
457
- ```ruby
458
- RubyCrawl.configure(block_resources: true)
459
- ```
460
-
461
- 2. **Use appropriate wait strategy**:
462
- - Static sites: `wait_until: "load"`
463
- - SPAs: `wait_until: "networkidle"`
464
-
465
- 3. **Batch processing**: Use background jobs for concurrent crawling:
466
- ```ruby
467
- urls.each { |url| CrawlJob.perform_later(url) }
468
- ```
704
+ - **Resource blocking**: Keep `block_resources: true` (default) for 2-3x faster crawls when you don't need images/CSS
705
+ - **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
706
+ - **Concurrency**: Use background jobs (Sidekiq, etc.) for parallel crawling
707
+ - **Browser reuse**: The first crawl is slower (~2s) due to browser launch; subsequent crawls are much faster (~500ms)
469
708
 
470
709
  ## Development
471
710
 
472
- ### Setup
711
+ Want to contribute? Check out the [contributor guidelines](.github/copilot-instructions.md).
473
712
 
474
713
  ```bash
714
+ # Setup
475
715
  git clone git@github.com:craft-wise/rubycrawl.git
476
716
  cd rubycrawl
477
- bin/setup # Installs dependencies and sets up Node service
478
- ```
717
+ bin/setup
479
718
 
480
- ### Running Tests
481
-
482
- ```bash
719
+ # Run tests
483
720
  bundle exec rspec
484
- ```
485
-
486
- ### Manual Testing
487
-
488
- ```bash
489
- # Terminal 1: Start Node service manually (optional)
490
- cd node
491
- npm start
492
721
 
493
- # Terminal 2: Ruby console
722
+ # Manual testing
494
723
  bin/console
495
- > result = RubyCrawl.crawl("https://example.com")
496
- > puts result.html
497
- ```
498
-
499
- ### Project Structure
500
-
501
- ```
502
- rubycrawl/
503
- ├── lib/
504
- │ ├── rubycrawl.rb # Main gem entry point
505
- │ ├── rubycrawl/
506
- │ │ ├── version.rb # Gem version
507
- │ │ ├── railtie.rb # Rails integration
508
- │ │ └── tasks/
509
- │ │ └── install.rake # Installation task
510
- ├── node/
511
- │ ├── src/
512
- │ │ └── index.js # Playwright HTTP service
513
- │ ├── package.json
514
- │ └── README.md
515
- ├── spec/ # RSpec tests
516
- ├── .github/
517
- │ └── copilot-instructions.md # GitHub Copilot guidelines
518
- ├── CLAUDE.md # Claude AI guidelines
519
- └── README.md
724
+ > RubyCrawl.crawl("https://example.com")
520
725
  ```
521
726
 
522
- ## Roadmap
523
-
524
- ### Current (v0.1.0)
525
-
526
- - [x] HTML extraction
527
- - [x] Link extraction
528
- - [x] Markdown conversion (lazy-loaded)
529
- - [x] Multi-page crawling with BFS
530
- - [x] URL normalization and deduplication
531
- - [x] Basic metadata (status, final URL)
532
- - [x] Resource blocking
533
- - [x] Rails integration
534
-
535
- ### Coming Soon
536
-
537
- - [ ] Plain text extraction
538
- - [ ] Screenshot capture
539
- - [ ] Custom JavaScript execution
540
- - [ ] Session/cookie support
541
- - [ ] Proxy support
542
- - [ ] Robots.txt support
543
-
544
727
  ## Contributing
545
728
 
546
729
  Contributions are welcome! Please read our [contribution guidelines](.github/copilot-instructions.md) first.
@@ -552,21 +735,46 @@ Contributions are welcome! Please read our [contribution guidelines](.github/cop
552
735
  - **Ruby-first**: Hide Node.js/Playwright complexity from users
553
736
  - **No vendor lock-in**: Pure open source, no SaaS dependencies
554
737
 
555
- ## Comparison with crawl4ai
738
+ ## Why Choose RubyCrawl?
739
+
740
+ RubyCrawl stands out in the Ruby ecosystem with its unique combination of features:
741
+
742
+ ### 🎯 **Built for Ruby Developers**
743
+
744
+ - **Idiomatic Ruby API** — Feels natural to Rubyists, no need to learn Playwright
745
+ - **Rails-first design** — Generators, initializers, and ActiveJob integration out of the box
746
+ - **Modular architecture** — Clean, testable code following Ruby best practices
747
+
748
+ ### 🚀 **Production-Grade Reliability**
556
749
 
557
- | Feature | crawl4ai (Python) | rubycrawl (Ruby) |
558
- | ------------------- | ----------------- | ---------------- |
559
- | Browser automation | Playwright | Playwright |
560
- | Language | Python | Ruby |
561
- | LLM extraction | ✅ | Planned |
562
- | Markdown extraction | ✅ | ✅ |
563
- | Link extraction | ✅ | ✅ |
564
- | Multi-page crawling | ✅ | ✅ |
565
- | Rails integration | N/A | ✅ |
566
- | Resource blocking | ✅ | ✅ |
567
- | Session management | ✅ | Planned |
750
+ - **Automatic retry** with exponential backoff for transient failures
751
+ - **Smart error handling** with custom exception hierarchy
752
+ - **Process isolation** — Browser crashes don't affect your Ruby application
753
+ - **Battle-tested** Built on Playwright's proven browser automation
568
754
 
569
- RubyCrawl aims to bring the same level of accuracy and reliability to the Ruby ecosystem.
755
+ ### 💎 **Developer Experience**
756
+
757
+ - **Zero configuration** — Works immediately after installation
758
+ - **Lazy loading** — Markdown conversion only when you need it
759
+ - **Smart URL handling** — Automatic normalization and deduplication
760
+ - **Comprehensive docs** — Clear examples for common use cases
761
+
762
+ ### 🌐 **Rich Feature Set**
763
+
764
+ - ✅ JavaScript-enabled crawling (SPAs, AJAX, dynamic content)
765
+ - ✅ Multi-page crawling with BFS algorithm
766
+ - ✅ Link extraction with metadata (url, text, title, rel)
767
+ - ✅ Markdown conversion (GitHub-flavored)
768
+ - ✅ Metadata extraction (OG tags, Twitter cards, etc.)
769
+ - ✅ Resource blocking for 2-3x performance boost
770
+
771
+ ### 📊 **Perfect for Modern Use Cases**
772
+
773
+ - **RAG applications** — Build AI knowledge bases from documentation
774
+ - **Data aggregation** — Extract structured data from multiple pages
775
+ - **Content migration** — Convert sites to Markdown for static generators
776
+ - **SEO analysis** — Extract metadata and link structures
777
+ - **Testing** — Verify deployed site content and structure
570
778
 
571
779
  ## License
572
780
 
@@ -574,12 +782,21 @@ The gem is available as open source under the terms of the [MIT License](LICENSE
574
782
 
575
783
  ## Credits
576
784
 
577
- Inspired by [crawl4ai](https://github.com/unclecode/crawl4ai) by @unclecode.
785
+ Built with [Playwright](https://playwright.dev/) by Microsoft — the industry-standard browser automation framework.
578
786
 
579
- Built with [Playwright](https://playwright.dev/) by Microsoft.
787
+ Powered by [reverse_markdown](https://github.com/xijo/reverse_markdown) for GitHub-flavored Markdown conversion.
580
788
 
581
789
  ## Support
582
790
 
583
791
  - **Issues**: [GitHub Issues](https://github.com/craft-wise/rubycrawl/issues)
584
- - **Discussions**: [GitHub Discussions](https://github.com/your-org/rubycrawl/discussions)
792
+ - **Discussions**: [GitHub Discussions](https://github.com/craft-wise/rubycrawl/discussions)
585
793
  - **Email**: ganesh.navale@zohomail.in
794
+
795
+ ## Acknowledgments
796
+
797
+ Special thanks to:
798
+
799
+ - [Microsoft Playwright](https://playwright.dev/) team for the robust, production-grade browser automation framework
800
+ - The Ruby community for building an ecosystem that values developer happiness and code clarity
801
+ - The Node.js community for excellent tooling and libraries that make cross-language integration seamless
802
+ - Open source contributors worldwide who make projects like this possible