rubycrawl 0.3.0 → 0.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +56 -17
- data/lib/rubycrawl/robots_parser.rb +86 -0
- data/lib/rubycrawl/site_crawler.rb +15 -1
- data/lib/rubycrawl/tasks/install.rake +6 -5
- data/lib/rubycrawl/version.rb +1 -1
- data/lib/rubycrawl.rb +9 -7
- metadata +2 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: c38e6b7b377a04d6baec4756a7bdf749580e5391d42483b9f6f7e50ee0cbd25f
|
|
4
|
+
data.tar.gz: 8323d9dbe93915b2f81fb6adbd6056b0007ef3ac58a828feb3492d14e02b7423
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 2905355938f1f18c747c83bdcc1360f88c887026d8b2242c00a87727cb32ab9954927a24ea12975d8c89a1f0e358ffab22ed458b2cac1d8a9acfb9537bb03eca
|
|
7
|
+
data.tar.gz: 556b1d58707d72698a8e537dc41e8a0b4d47656b501b1cbdcff9db1532180d4d28b4f443f505789454da381efa52adf722c501eb85212564809b6571d504ee55
|
data/README.md
CHANGED
|
@@ -1,6 +1,7 @@
|
|
|
1
1
|
# RubyCrawl 🎭
|
|
2
2
|
|
|
3
3
|
[](https://rubygems.org/gems/rubycrawl)
|
|
4
|
+
[](https://github.com/craft-wise/rubycrawl/actions/workflows/ci.yml)
|
|
4
5
|
[](https://opensource.org/licenses/MIT)
|
|
5
6
|
[](https://www.ruby-lang.org/)
|
|
6
7
|
|
|
@@ -189,13 +190,38 @@ puts "Indexed #{pages_crawled} pages"
|
|
|
189
190
|
|
|
190
191
|
#### Multi-Page Options
|
|
191
192
|
|
|
192
|
-
| Option
|
|
193
|
-
|
|
|
194
|
-
| `max_pages`
|
|
195
|
-
| `max_depth`
|
|
196
|
-
| `same_host_only`
|
|
197
|
-
| `wait_until`
|
|
198
|
-
| `block_resources`
|
|
193
|
+
| Option | Default | Description |
|
|
194
|
+
| ---------------------- | --------- | --------------------------------------------------- |
|
|
195
|
+
| `max_pages` | 50 | Maximum number of pages to crawl |
|
|
196
|
+
| `max_depth` | 3 | Maximum link depth from start URL |
|
|
197
|
+
| `same_host_only` | true | Only follow links on the same domain |
|
|
198
|
+
| `wait_until` | inherited | Page load strategy |
|
|
199
|
+
| `block_resources` | inherited | Block images/fonts/CSS |
|
|
200
|
+
| `respect_robots_txt` | false | Honour robots.txt rules and auto-sleep `Crawl-delay`|
|
|
201
|
+
|
|
202
|
+
#### robots.txt Support
|
|
203
|
+
|
|
204
|
+
When `respect_robots_txt: true`, RubyCrawl fetches `robots.txt` once at the start of the crawl and:
|
|
205
|
+
|
|
206
|
+
- Skips any URL disallowed for `User-agent: *`
|
|
207
|
+
- Automatically sleeps the `Crawl-delay` specified in robots.txt between pages
|
|
208
|
+
|
|
209
|
+
```ruby
|
|
210
|
+
RubyCrawl.crawl_site("https://example.com",
|
|
211
|
+
respect_robots_txt: true,
|
|
212
|
+
max_pages: 100
|
|
213
|
+
) do |page|
|
|
214
|
+
puts page.url
|
|
215
|
+
end
|
|
216
|
+
```
|
|
217
|
+
|
|
218
|
+
Or enable globally:
|
|
219
|
+
|
|
220
|
+
```ruby
|
|
221
|
+
RubyCrawl.configure(respect_robots_txt: true)
|
|
222
|
+
```
|
|
223
|
+
|
|
224
|
+
If robots.txt is unreachable or missing, crawling proceeds normally (fail open).
|
|
199
225
|
|
|
200
226
|
#### Page Result Object
|
|
201
227
|
|
|
@@ -247,11 +273,12 @@ result = RubyCrawl.crawl(
|
|
|
247
273
|
|
|
248
274
|
| Option | Values | Default | Description |
|
|
249
275
|
| ----------------- | ----------------------------------------------------------- | ------- | --------------------------------------------------- |
|
|
250
|
-
| `wait_until`
|
|
251
|
-
| `block_resources`
|
|
252
|
-
| `max_attempts`
|
|
253
|
-
| `timeout`
|
|
254
|
-
| `headless`
|
|
276
|
+
| `wait_until` | `"load"`, `"domcontentloaded"`, `"networkidle"`, `"commit"` | `nil` | When to consider page loaded (nil = Ferrum default) |
|
|
277
|
+
| `block_resources` | `true`, `false` | `nil` | Block images, fonts, CSS, media for faster crawls |
|
|
278
|
+
| `max_attempts` | Integer | `3` | Total number of attempts (including the first) |
|
|
279
|
+
| `timeout` | Integer (seconds) | `30` | Browser navigation timeout |
|
|
280
|
+
| `headless` | `true`, `false` | `true` | Run Chrome headlessly |
|
|
281
|
+
| `respect_robots_txt` | `true`, `false` | `false` | Honour robots.txt rules and auto-sleep Crawl-delay |
|
|
255
282
|
|
|
256
283
|
**Wait strategies explained:**
|
|
257
284
|
|
|
@@ -497,9 +524,24 @@ Readability.js → heuristic fallback ← content extraction (inside browse
|
|
|
497
524
|
|
|
498
525
|
- **Resource blocking**: Keep `block_resources: true` (default: nil) to skip images/fonts/CSS for 2-3x faster crawls
|
|
499
526
|
- **Wait strategy**: Use `wait_until: "load"` for static sites, `"networkidle"` for SPAs
|
|
500
|
-
- **Concurrency**: Use background jobs (Sidekiq, GoodJob, etc.) for parallel crawling
|
|
501
527
|
- **Browser reuse**: The first crawl is slower (~2s) due to Chrome launch; subsequent crawls are much faster (~200-500ms)
|
|
502
528
|
|
|
529
|
+
### Parallelism
|
|
530
|
+
|
|
531
|
+
RubyCrawl does not support parallel page loading within a single process — Ferrum uses one Chrome instance and concurrent access is not thread-safe.
|
|
532
|
+
|
|
533
|
+
The recommended pattern is **job-level parallelism**: each background job gets its own `RubyCrawl` instance and Chrome process, with natural rate limiting via your job queue's concurrency setting:
|
|
534
|
+
|
|
535
|
+
```ruby
|
|
536
|
+
# Enqueue independent crawls — each job runs its own Chrome
|
|
537
|
+
urls.each { |url| CrawlJob.perform_later(url) }
|
|
538
|
+
|
|
539
|
+
# Control concurrency via your queue worker config (Sidekiq, GoodJob, etc.)
|
|
540
|
+
# e.g. Sidekiq concurrency: 3 → 3 Chrome processes crawling in parallel
|
|
541
|
+
```
|
|
542
|
+
|
|
543
|
+
This also works naturally with `respect_robots_txt: true` — each job respects Crawl-delay independently.
|
|
544
|
+
|
|
503
545
|
## Development
|
|
504
546
|
|
|
505
547
|
```bash
|
|
@@ -507,12 +549,9 @@ git clone git@github.com:craft-wise/rubycrawl.git
|
|
|
507
549
|
cd rubycrawl
|
|
508
550
|
bin/setup
|
|
509
551
|
|
|
510
|
-
# Run
|
|
552
|
+
# Run all tests (Chrome required — installed as a gem dependency)
|
|
511
553
|
bundle exec rspec
|
|
512
554
|
|
|
513
|
-
# Run integration tests (requires Chrome)
|
|
514
|
-
INTEGRATION=1 bundle exec rspec
|
|
515
|
-
|
|
516
555
|
# Manual testing
|
|
517
556
|
bin/console
|
|
518
557
|
> RubyCrawl.crawl("https://example.com")
|
|
@@ -0,0 +1,86 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
3
|
+
require 'net/http'
|
|
4
|
+
require 'uri'
|
|
5
|
+
|
|
6
|
+
class RubyCrawl
|
|
7
|
+
# Fetches and parses robots.txt for a given site.
|
|
8
|
+
# Supports User-agent: *, Disallow, Allow, and Crawl-delay directives.
|
|
9
|
+
# Fails open — any fetch/parse error allows all URLs.
|
|
10
|
+
class RobotsParser
|
|
11
|
+
# Fetch robots.txt from base_url and return a parser instance.
|
|
12
|
+
# Returns a permissive (allow-all) instance on any network error.
|
|
13
|
+
def self.fetch(base_url)
|
|
14
|
+
uri = URI.join(base_url, '/robots.txt')
|
|
15
|
+
response = Net::HTTP.start(uri.host, uri.port,
|
|
16
|
+
use_ssl: uri.scheme == 'https',
|
|
17
|
+
open_timeout: 5,
|
|
18
|
+
read_timeout: 5) do |http|
|
|
19
|
+
http.get(uri.request_uri)
|
|
20
|
+
end
|
|
21
|
+
new(response.is_a?(Net::HTTPOK) ? response.body : '')
|
|
22
|
+
rescue StandardError
|
|
23
|
+
new('') # network error or invalid URL → allow everything
|
|
24
|
+
end
|
|
25
|
+
|
|
26
|
+
def initialize(content)
|
|
27
|
+
@rules = parse(content.to_s)
|
|
28
|
+
end
|
|
29
|
+
|
|
30
|
+
# Returns true if the given URL is allowed to be crawled.
|
|
31
|
+
def allowed?(url)
|
|
32
|
+
path = URI.parse(url).path
|
|
33
|
+
path = '/' if path.nil? || path.empty?
|
|
34
|
+
|
|
35
|
+
# Allow rules take precedence over Disallow when both match.
|
|
36
|
+
return true if @rules[:allow].any? { |rule| path_matches?(path, rule) }
|
|
37
|
+
return false if @rules[:disallow].any? { |rule| path_matches?(path, rule) }
|
|
38
|
+
|
|
39
|
+
true
|
|
40
|
+
rescue URI::InvalidURIError
|
|
41
|
+
true
|
|
42
|
+
end
|
|
43
|
+
|
|
44
|
+
# Returns the Crawl-delay value in seconds, or nil if not specified.
|
|
45
|
+
def crawl_delay
|
|
46
|
+
@rules[:crawl_delay]
|
|
47
|
+
end
|
|
48
|
+
|
|
49
|
+
private
|
|
50
|
+
|
|
51
|
+
def parse(content)
|
|
52
|
+
rules = { allow: [], disallow: [], crawl_delay: nil }
|
|
53
|
+
in_relevant_section = false
|
|
54
|
+
|
|
55
|
+
content.each_line do |raw_line|
|
|
56
|
+
line = raw_line.strip.sub(/#.*$/, '').strip
|
|
57
|
+
next if line.empty?
|
|
58
|
+
|
|
59
|
+
key, value = line.split(':', 2).map(&:strip)
|
|
60
|
+
next unless key && value
|
|
61
|
+
|
|
62
|
+
case key.downcase
|
|
63
|
+
when 'user-agent'
|
|
64
|
+
in_relevant_section = (value == '*')
|
|
65
|
+
when 'disallow'
|
|
66
|
+
rules[:disallow] << value if in_relevant_section && !value.empty?
|
|
67
|
+
when 'allow'
|
|
68
|
+
rules[:allow] << value if in_relevant_section && !value.empty?
|
|
69
|
+
when 'crawl-delay'
|
|
70
|
+
rules[:crawl_delay] = value.to_f if in_relevant_section && value.match?(/\A\d+(\.\d+)?\z/)
|
|
71
|
+
end
|
|
72
|
+
end
|
|
73
|
+
|
|
74
|
+
rules
|
|
75
|
+
end
|
|
76
|
+
|
|
77
|
+
# Matches a URL path against a robots.txt rule pattern.
|
|
78
|
+
# Supports * (wildcard) and $ (end-of-string anchor).
|
|
79
|
+
def path_matches?(path, rule)
|
|
80
|
+
return false if rule.empty?
|
|
81
|
+
|
|
82
|
+
pattern = Regexp.escape(rule).gsub('\*', '.*').gsub('\$', '\z')
|
|
83
|
+
path.match?(/\A#{pattern}/)
|
|
84
|
+
end
|
|
85
|
+
end
|
|
86
|
+
end
|
|
@@ -1,6 +1,7 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
require 'set'
|
|
4
|
+
require_relative 'robots_parser'
|
|
4
5
|
|
|
5
6
|
class RubyCrawl
|
|
6
7
|
# BFS crawler that follows links with deduplication.
|
|
@@ -46,7 +47,8 @@ class RubyCrawl
|
|
|
46
47
|
@same_host_only = options.fetch(:same_host_only, true)
|
|
47
48
|
@wait_until = options.fetch(:wait_until, nil)
|
|
48
49
|
@block_resources = options.fetch(:block_resources, nil)
|
|
49
|
-
@max_attempts
|
|
50
|
+
@max_attempts = options.fetch(:max_attempts, nil)
|
|
51
|
+
@respect_robots_txt = options.fetch(:respect_robots_txt, false)
|
|
50
52
|
@visited = Set.new
|
|
51
53
|
@queue = []
|
|
52
54
|
end
|
|
@@ -58,6 +60,7 @@ class RubyCrawl
|
|
|
58
60
|
raise ConfigurationError, "Invalid start URL: #{start_url}" unless normalized
|
|
59
61
|
|
|
60
62
|
@base_url = normalized
|
|
63
|
+
@robots = @respect_robots_txt ? RobotsParser.fetch(@base_url) : nil
|
|
61
64
|
enqueue(normalized, 0)
|
|
62
65
|
process_queue(&block)
|
|
63
66
|
end
|
|
@@ -71,6 +74,8 @@ class RubyCrawl
|
|
|
71
74
|
url, depth = item
|
|
72
75
|
next if @visited.include?(url)
|
|
73
76
|
|
|
77
|
+
sleep(@robots.crawl_delay) if @robots&.crawl_delay && pages_crawled.positive?
|
|
78
|
+
|
|
74
79
|
result = process_page(url, depth)
|
|
75
80
|
next unless result
|
|
76
81
|
|
|
@@ -130,11 +135,20 @@ class RubyCrawl
|
|
|
130
135
|
next unless normalized
|
|
131
136
|
next if @visited.include?(normalized)
|
|
132
137
|
next if @same_host_only && !UrlNormalizer.same_host?(normalized, @base_url)
|
|
138
|
+
next if robots_disallowed?(normalized)
|
|
133
139
|
|
|
134
140
|
enqueue(normalized, depth)
|
|
135
141
|
end
|
|
136
142
|
end
|
|
137
143
|
|
|
144
|
+
def robots_disallowed?(url)
|
|
145
|
+
return false unless @robots
|
|
146
|
+
return false if @robots.allowed?(url)
|
|
147
|
+
|
|
148
|
+
warn "[rubycrawl] Skipping #{url} (disallowed by robots.txt)"
|
|
149
|
+
true
|
|
150
|
+
end
|
|
151
|
+
|
|
138
152
|
def enqueue(url, depth)
|
|
139
153
|
return if @visited.include?(url)
|
|
140
154
|
|
|
@@ -27,11 +27,12 @@ namespace :rubycrawl do
|
|
|
27
27
|
|
|
28
28
|
# RubyCrawl Configuration
|
|
29
29
|
RubyCrawl.configure(
|
|
30
|
-
# wait_until: "load",
|
|
31
|
-
# block_resources: true,
|
|
32
|
-
# max_attempts: 3,
|
|
33
|
-
# timeout: 30,
|
|
34
|
-
# headless: true,
|
|
30
|
+
# wait_until: "load", # "load", "domcontentloaded", "networkidle"
|
|
31
|
+
# block_resources: true, # block images/fonts/CSS/media for speed
|
|
32
|
+
# max_attempts: 3, # retry count with exponential backoff
|
|
33
|
+
# timeout: 30, # browser navigation timeout in seconds
|
|
34
|
+
# headless: true, # set false to see the browser (debugging)
|
|
35
|
+
# respect_robots_txt: false, # set true to honour robots.txt and Crawl-delay
|
|
35
36
|
)
|
|
36
37
|
RUBY
|
|
37
38
|
|
data/lib/rubycrawl/version.rb
CHANGED
data/lib/rubycrawl.rb
CHANGED
|
@@ -81,7 +81,8 @@ class RubyCrawl
|
|
|
81
81
|
@max_attempts = options.fetch(:max_attempts, 3)
|
|
82
82
|
@timeout = options.fetch(:timeout, 30)
|
|
83
83
|
@headless = options.fetch(:headless, true)
|
|
84
|
-
@browser_options
|
|
84
|
+
@browser_options = options.fetch(:browser_options, {})
|
|
85
|
+
@respect_robots_txt = options.fetch(:respect_robots_txt, false)
|
|
85
86
|
end
|
|
86
87
|
|
|
87
88
|
def with_retries(max_attempts)
|
|
@@ -101,12 +102,13 @@ class RubyCrawl
|
|
|
101
102
|
|
|
102
103
|
def build_crawler_options(options)
|
|
103
104
|
{
|
|
104
|
-
max_pages:
|
|
105
|
-
max_depth:
|
|
106
|
-
same_host_only:
|
|
107
|
-
wait_until:
|
|
108
|
-
block_resources:
|
|
109
|
-
max_attempts:
|
|
105
|
+
max_pages: options.fetch(:max_pages, 50),
|
|
106
|
+
max_depth: options.fetch(:max_depth, 3),
|
|
107
|
+
same_host_only: options.fetch(:same_host_only, true),
|
|
108
|
+
wait_until: options.fetch(:wait_until, @wait_until),
|
|
109
|
+
block_resources: options.fetch(:block_resources, @block_resources),
|
|
110
|
+
max_attempts: options.fetch(:max_attempts, @max_attempts),
|
|
111
|
+
respect_robots_txt: options.fetch(:respect_robots_txt, @respect_robots_txt)
|
|
110
112
|
}
|
|
111
113
|
end
|
|
112
114
|
end
|
metadata
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: rubycrawl
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.4.0
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- RubyCrawl contributors
|
|
@@ -58,6 +58,7 @@ files:
|
|
|
58
58
|
- lib/rubycrawl/markdown_converter.rb
|
|
59
59
|
- lib/rubycrawl/railtie.rb
|
|
60
60
|
- lib/rubycrawl/result.rb
|
|
61
|
+
- lib/rubycrawl/robots_parser.rb
|
|
61
62
|
- lib/rubycrawl/site_crawler.rb
|
|
62
63
|
- lib/rubycrawl/tasks/install.rake
|
|
63
64
|
- lib/rubycrawl/url_normalizer.rb
|