rubycrawl 0.3.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 56d56f2c264e3febc0f1b22badabb739393332e0948a0f4e4ceb534a68127604
4
- data.tar.gz: ebcadd14ba65b12870f6069f898658240073ba7233eab63468e0502871a7a408
3
+ metadata.gz: c38e6b7b377a04d6baec4756a7bdf749580e5391d42483b9f6f7e50ee0cbd25f
4
+ data.tar.gz: 8323d9dbe93915b2f81fb6adbd6056b0007ef3ac58a828feb3492d14e02b7423
5
5
  SHA512:
6
- metadata.gz: 182e8c771358324d256b38a42a236f634a113b18e16c716da891543ddb43a90ea68242bbc1655639781485e05802a203b64d7ea874eb4cb98900c2e771b85ec0
7
- data.tar.gz: f150a6394fb2279b1f872c4074ef9b9df489f19266a7a35886b1e9fbd57e3d4d3761e0519a015270a5259c698d163e2ca223cf0106b6db6471c7911d65c12a29
6
+ metadata.gz: 2905355938f1f18c747c83bdcc1360f88c887026d8b2242c00a87727cb32ab9954927a24ea12975d8c89a1f0e358ffab22ed458b2cac1d8a9acfb9537bb03eca
7
+ data.tar.gz: 556b1d58707d72698a8e537dc41e8a0b4d47656b501b1cbdcff9db1532180d4d28b4f443f505789454da381efa52adf722c501eb85212564809b6571d504ee55
data/README.md CHANGED
@@ -1,6 +1,7 @@
1
1
  # RubyCrawl 🎭
2
2
 
3
3
  [![Gem Version](https://badge.fury.io/rb/rubycrawl.svg)](https://rubygems.org/gems/rubycrawl)
4
+ [![CI](https://github.com/craft-wise/rubycrawl/actions/workflows/ci.yml/badge.svg)](https://github.com/craft-wise/rubycrawl/actions/workflows/ci.yml)
4
5
  [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
6
  [![Ruby](https://img.shields.io/badge/ruby-%3E%3D%203.0-red.svg)](https://www.ruby-lang.org/)
6
7
 
@@ -189,13 +190,38 @@ puts "Indexed #{pages_crawled} pages"
189
190
 
190
191
  #### Multi-Page Options
191
192
 
192
- | Option | Default | Description |
193
- | ----------------- | --------- | ------------------------------------ |
194
- | `max_pages` | 50 | Maximum number of pages to crawl |
195
- | `max_depth` | 3 | Maximum link depth from start URL |
196
- | `same_host_only` | true | Only follow links on the same domain |
197
- | `wait_until` | inherited | Page load strategy |
198
- | `block_resources` | inherited | Block images/fonts/CSS |
193
+ | Option | Default | Description |
194
+ | ---------------------- | --------- | --------------------------------------------------- |
195
+ | `max_pages` | 50 | Maximum number of pages to crawl |
196
+ | `max_depth` | 3 | Maximum link depth from start URL |
197
+ | `same_host_only` | true | Only follow links on the same domain |
198
+ | `wait_until` | inherited | Page load strategy |
199
+ | `block_resources` | inherited | Block images/fonts/CSS |
200
+ | `respect_robots_txt` | false | Honour robots.txt rules and auto-sleep `Crawl-delay`|
201
+
202
+ #### robots.txt Support
203
+
204
+ When `respect_robots_txt: true`, RubyCrawl fetches `robots.txt` once at the start of the crawl and:
205
+
206
+ - Skips any URL disallowed for `User-agent: *`
207
+ - Automatically sleeps the `Crawl-delay` specified in robots.txt between pages
208
+
209
+ ```ruby
210
+ RubyCrawl.crawl_site("https://example.com",
211
+ respect_robots_txt: true,
212
+ max_pages: 100
213
+ ) do |page|
214
+ puts page.url
215
+ end
216
+ ```
217
+
218
+ Or enable globally:
219
+
220
+ ```ruby
221
+ RubyCrawl.configure(respect_robots_txt: true)
222
+ ```
223
+
224
+ If robots.txt is unreachable or missing, crawling proceeds normally (fail open).
199
225
 
200
226
  #### Page Result Object
201
227
 
@@ -247,11 +273,12 @@ result = RubyCrawl.crawl(
247
273
 
248
274
  | Option | Values | Default | Description |
249
275
  | ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
250
- | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
251
- | `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
252
- | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
253
- | `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
254
- | `headless` | `true`, `false` | `true` | Run Chrome headlessly |
276
+ | `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
277
+ | `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
278
+ | `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
279
+ | `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
280
+ | `headless` | `true`, `false` | `true` | Run Chrome headlessly |
281
+ | `respect_robots_txt` | `true`, `false` | `false` | Honour robots.txt rules and auto-sleep Crawl-delay |
255
282
 
256
283
  **Wait strategies explained:**
257
284
 
@@ -497,9 +524,24 @@ Readability.js → heuristic fallback ← content extraction (inside browse
497
524
 
498
525
  - **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
499
526
  - **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
500
- - **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
501
527
  - **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
502
528
 
529
+ ### Parallelism
530
+
531
+ RubyCrawl does not support parallel page loading within a single process — Ferrum uses one Chrome instance and concurrent access is not thread-safe.
532
+
533
+ The recommended pattern is **job-level parallelism**: each background job gets its own `RubyCrawl` instance and Chrome process, with natural rate limiting via your job queue's concurrency setting:
534
+
535
+ ```ruby
536
+ # Enqueue independent crawls — each job runs its own Chrome
537
+ urls.each { |url| CrawlJob.perform_later(url) }
538
+
539
+ # Control concurrency via your queue worker config (Sidekiq, GoodJob, etc.)
540
+ # e.g. Sidekiq concurrency: 3 → 3 Chrome processes crawling in parallel
541
+ ```
542
+
543
+ This also works naturally with `respect_robots_txt: true` — each job respects Crawl-delay independently.
544
+
503
545
  ## Development
504
546
 
505
547
  ```bash
@@ -507,12 +549,9 @@ git clone git@github.com:craft-wise/rubycrawl.git
507
549
  cd rubycrawl
508
550
  bin/setup
509
551
 
510
- # Run unit tests (no browser required)
552
+ # Run all tests (Chrome required — installed as a gem dependency)
511
553
  bundle exec rspec
512
554
 
513
- # Run integration tests (requires Chrome)
514
- INTEGRATION=1 bundle exec rspec
515
-
516
555
  # Manual testing
517
556
  bin/console
518
557
  > RubyCrawl.crawl("https://example.com")
@@ -0,0 +1,86 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'net/http'
4
+ require 'uri'
5
+
6
+ class RubyCrawl
7
+ # Fetches and parses robots.txt for a given site.
8
+ # Supports User-agent: *, Disallow, Allow, and Crawl-delay directives.
9
+ # Fails open — any fetch/parse error allows all URLs.
10
+ class RobotsParser
11
+ # Fetch robots.txt from base_url and return a parser instance.
12
+ # Returns a permissive (allow-all) instance on any network error.
13
+ def self.fetch(base_url)
14
+ uri = URI.join(base_url, '/robots.txt')
15
+ response = Net::HTTP.start(uri.host, uri.port,
16
+ use_ssl: uri.scheme == 'https',
17
+ open_timeout: 5,
18
+ read_timeout: 5) do |http|
19
+ http.get(uri.request_uri)
20
+ end
21
+ new(response.is_a?(Net::HTTPOK) ? response.body : '')
22
+ rescue StandardError
23
+ new('') # network error or invalid URL → allow everything
24
+ end
25
+
26
+ def initialize(content)
27
+ @rules = parse(content.to_s)
28
+ end
29
+
30
+ # Returns true if the given URL is allowed to be crawled.
31
+ def allowed?(url)
32
+ path = URI.parse(url).path
33
+ path = '/' if path.nil? || path.empty?
34
+
35
+ # Allow rules take precedence over Disallow when both match.
36
+ return true if @rules[:allow].any? { |rule| path_matches?(path, rule) }
37
+ return false if @rules[:disallow].any? { |rule| path_matches?(path, rule) }
38
+
39
+ true
40
+ rescue URI::InvalidURIError
41
+ true
42
+ end
43
+
44
+ # Returns the Crawl-delay value in seconds, or nil if not specified.
45
+ def crawl_delay
46
+ @rules[:crawl_delay]
47
+ end
48
+
49
+ private
50
+
51
+ def parse(content)
52
+ rules = { allow: [], disallow: [], crawl_delay: nil }
53
+ in_relevant_section = false
54
+
55
+ content.each_line do |raw_line|
56
+ line = raw_line.strip.sub(/#.*$/, '').strip
57
+ next if line.empty?
58
+
59
+ key, value = line.split(':', 2).map(&:strip)
60
+ next unless key && value
61
+
62
+ case key.downcase
63
+ when 'user-agent'
64
+ in_relevant_section = (value == '*')
65
+ when 'disallow'
66
+ rules[:disallow] << value if in_relevant_section && !value.empty?
67
+ when 'allow'
68
+ rules[:allow] << value if in_relevant_section && !value.empty?
69
+ when 'crawl-delay'
70
+ rules[:crawl_delay] = value.to_f if in_relevant_section && value.match?(/\A\d+(\.\d+)?\z/)
71
+ end
72
+ end
73
+
74
+ rules
75
+ end
76
+
77
+ # Matches a URL path against a robots.txt rule pattern.
78
+ # Supports * (wildcard) and $ (end-of-string anchor).
79
+ def path_matches?(path, rule)
80
+ return false if rule.empty?
81
+
82
+ pattern = Regexp.escape(rule).gsub('\*', '.*').gsub('\$', '\z')
83
+ path.match?(/\A#{pattern}/)
84
+ end
85
+ end
86
+ end
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require 'set'
4
+ require_relative 'robots_parser'
4
5
 
5
6
  class RubyCrawl
6
7
  # BFS crawler that follows links with deduplication.
@@ -46,7 +47,8 @@ class RubyCrawl
46
47
  @same_host_only = options.fetch(:same_host_only, true)
47
48
  @wait_until = options.fetch(:wait_until, nil)
48
49
  @block_resources = options.fetch(:block_resources, nil)
49
- @max_attempts = options.fetch(:max_attempts, nil)
50
+ @max_attempts = options.fetch(:max_attempts, nil)
51
+ @respect_robots_txt = options.fetch(:respect_robots_txt, false)
50
52
  @visited = Set.new
51
53
  @queue = []
52
54
  end
@@ -58,6 +60,7 @@ class RubyCrawl
58
60
  raise ConfigurationError, "Invalid start URL: #{start_url}" unless normalized
59
61
 
60
62
  @base_url = normalized
63
+ @robots = @respect_robots_txt ? RobotsParser.fetch(@base_url) : nil
61
64
  enqueue(normalized, 0)
62
65
  process_queue(&block)
63
66
  end
@@ -71,6 +74,8 @@ class RubyCrawl
71
74
  url, depth = item
72
75
  next if @visited.include?(url)
73
76
 
77
+ sleep(@robots.crawl_delay) if @robots&.crawl_delay && pages_crawled.positive?
78
+
74
79
  result = process_page(url, depth)
75
80
  next unless result
76
81
 
@@ -130,11 +135,20 @@ class RubyCrawl
130
135
  next unless normalized
131
136
  next if @visited.include?(normalized)
132
137
  next if @same_host_only && !UrlNormalizer.same_host?(normalized, @base_url)
138
+ next if robots_disallowed?(normalized)
133
139
 
134
140
  enqueue(normalized, depth)
135
141
  end
136
142
  end
137
143
 
144
+ def robots_disallowed?(url)
145
+ return false unless @robots
146
+ return false if @robots.allowed?(url)
147
+
148
+ warn "[rubycrawl] Skipping #{url} (disallowed by robots.txt)"
149
+ true
150
+ end
151
+
138
152
  def enqueue(url, depth)
139
153
  return if @visited.include?(url)
140
154
 
@@ -27,11 +27,12 @@ namespace :rubycrawl do
27
27
 
28
28
  # RubyCrawl Configuration
29
29
  RubyCrawl.configure(
30
- # wait_until: "load", # "load", "domcontentloaded", "networkidle"
31
- # block_resources: true, # block images/fonts/CSS/media for speed
32
- # max_attempts: 3, # retry count with exponential backoff
33
- # timeout: 30, # browser navigation timeout in seconds
34
- # headless: true, # set false to see the browser (debugging)
30
+ # wait_until: "load", # "load", "domcontentloaded", "networkidle"
31
+ # block_resources: true, # block images/fonts/CSS/media for speed
32
+ # max_attempts: 3, # retry count with exponential backoff
33
+ # timeout: 30, # browser navigation timeout in seconds
34
+ # headless: true, # set false to see the browser (debugging)
35
+ # respect_robots_txt: false, # set true to honour robots.txt and Crawl-delay
35
36
  )
36
37
  RUBY
37
38
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  class RubyCrawl
4
- VERSION = '0.3.0'
4
+ VERSION = '0.4.0'
5
5
  end
data/lib/rubycrawl.rb CHANGED
@@ -81,7 +81,8 @@ class RubyCrawl
81
81
  @max_attempts = options.fetch(:max_attempts, 3)
82
82
  @timeout = options.fetch(:timeout, 30)
83
83
  @headless = options.fetch(:headless, true)
84
- @browser_options = options.fetch(:browser_options, {})
84
+ @browser_options = options.fetch(:browser_options, {})
85
+ @respect_robots_txt = options.fetch(:respect_robots_txt, false)
85
86
  end
86
87
 
87
88
  def with_retries(max_attempts)
@@ -101,12 +102,13 @@ class RubyCrawl
101
102
 
102
103
  def build_crawler_options(options)
103
104
  {
104
- max_pages: options.fetch(:max_pages, 50),
105
- max_depth: options.fetch(:max_depth, 3),
106
- same_host_only: options.fetch(:same_host_only, true),
107
- wait_until: options.fetch(:wait_until, @wait_until),
108
- block_resources: options.fetch(:block_resources, @block_resources),
109
- max_attempts: options.fetch(:max_attempts, @max_attempts)
105
+ max_pages: options.fetch(:max_pages, 50),
106
+ max_depth: options.fetch(:max_depth, 3),
107
+ same_host_only: options.fetch(:same_host_only, true),
108
+ wait_until: options.fetch(:wait_until, @wait_until),
109
+ block_resources: options.fetch(:block_resources, @block_resources),
110
+ max_attempts: options.fetch(:max_attempts, @max_attempts),
111
+ respect_robots_txt: options.fetch(:respect_robots_txt, @respect_robots_txt)
110
112
  }
111
113
  end
112
114
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: rubycrawl
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.3.0
4
+ version: 0.4.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - RubyCrawl contributors
@@ -58,6 +58,7 @@ files:
58
58
  - lib/rubycrawl/markdown_converter.rb
59
59
  - lib/rubycrawl/railtie.rb
60
60
  - lib/rubycrawl/result.rb
61
+ - lib/rubycrawl/robots_parser.rb
61
62
  - lib/rubycrawl/site_crawler.rb
62
63
  - lib/rubycrawl/tasks/install.rake
63
64
  - lib/rubycrawl/url_normalizer.rb