medusa-crawler 1.0.0.pre.2 → 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- checksums.yaml.gz.sig +0 -0
- data.tar.gz.sig +0 -0
- data/CHANGELOG.md +33 -6
- data/CONTRIBUTORS.md +1 -1
- data/README.rdoc +32 -2
- data/Rakefile +0 -5
- data/VERSION +1 -1
- data/lib/medusa/core.rb +2 -4
- data/lib/medusa/http.rb +4 -14
- data/lib/medusa/page_store.rb +0 -53
- data/lib/medusa/storage/base.rb +0 -1
- data/lib/medusa/version.rb +1 -1
- data/spec/fakeweb_helper.rb +11 -6
- data/spec/medusa_spec.rb +2 -1
- metadata +38 -8
- metadata.gz.sig +0 -0
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f2398745018a9c162fa8587b634753fe5aa93b46c541c6a45018ab243ab738ec
|
4
|
+
data.tar.gz: 792d6308a0b24958de72506e45acba5f2814e3f26982787bba913233285c5c08
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e773494251c93ba4f1b9b33a9bc778e605f13d16368520ab38f757ff341489a495a265882ad49da7fcf17b809fdfbe23bd46289732386971619b957bcf36e3d6
|
7
|
+
data.tar.gz: e9c98d600dffed81b9bd3fc82ced0479913b1b6c42ff27bda9bb0607166d8eba1dbd7a34f63108ce04759d0720ed408ffcde615ec0d11d7de00242698a2a601d
|
checksums.yaml.gz.sig
CHANGED
Binary file
|
data.tar.gz.sig
CHANGED
Binary file
|
data/CHANGELOG.md
CHANGED
@@ -1,8 +1,36 @@
|
|
1
|
+
## Release v1.0.0 (2020-08-17)
|
2
|
+
Features:
|
3
|
+
- Remove `PageStore#pages_linking_to`, `PageStore#urls_linking_to`
|
4
|
+
- Remove `verbose` setting
|
1
5
|
|
2
|
-
|
6
|
+
Changes:
|
7
|
+
- Add an examples section to the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
|
8
|
+
- Update the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
|
9
|
+
- Update the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
|
3
10
|
|
11
|
+
## Pre-release v1.0.0.pre.2
|
4
12
|
Features:
|
13
|
+
- Remove CLI bins
|
14
|
+
- Remove `PageStore#shortest_paths!`
|
15
|
+
|
16
|
+
Fixes
|
17
|
+
- Skip link regex filter to consider the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
|
18
|
+
|
19
|
+
## Pre-release v1.0.0.pre.1
|
20
|
+
Features:
|
21
|
+
- Switch to use `Moneta` instead of custom storage provider adapters
|
22
|
+
|
23
|
+
Fixes
|
24
|
+
- Fix link skip regex to include the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
|
5
25
|
|
26
|
+
Dev
|
27
|
+
- Use webmock gem for testing
|
28
|
+
|
29
|
+
Changes:
|
30
|
+
- Rename Medusa to medusa-crawler gem
|
31
|
+
|
32
|
+
## Anemone forked into Medusa (2014-12-13)
|
33
|
+
Features:
|
6
34
|
- Switch to use `OpenURI` instead of `net/http`, gaining out of the box support for:
|
7
35
|
- Http basic auth options
|
8
36
|
- Proxy configuration options
|
@@ -11,10 +39,9 @@ Features:
|
|
11
39
|
- Ability to control the RETRY_LIMIT upon connection errors
|
12
40
|
|
13
41
|
Changes:
|
14
|
-
|
15
42
|
- Renamed Anemone to Medusa
|
16
|
-
- Revamped the [README](https://github.com/brutuscat/medusa/blob/
|
17
|
-
- Revamped the [CHANGELOG](https://github.com/brutuscat/medusa/blob/
|
18
|
-
- Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa/blob/
|
43
|
+
- Revamped the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
|
44
|
+
- Revamped the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
|
45
|
+
- Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
|
19
46
|
|
20
|
-
> Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc)
|
47
|
+
> Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc) to go back to the past.
|
data/CONTRIBUTORS.md
CHANGED
data/README.rdoc
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://
|
1
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
2
2
|
|
3
3
|
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
4
4
|
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
@@ -9,7 +9,6 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
|
|
9
9
|
* Multi-threaded design for high performance
|
10
10
|
* Tracks +301+ HTTP redirects
|
11
11
|
* Allows exclusion of URLs based on regular expressions
|
12
|
-
* HTTPS support
|
13
12
|
* Records response time for each page
|
14
13
|
* Obey _robots.txt_ directives (optional, but recommended)
|
15
14
|
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
@@ -17,6 +16,37 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
|
|
17
16
|
|
18
17
|
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
19
18
|
|
19
|
+
=== Examples
|
20
|
+
|
21
|
+
Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
|
22
|
+
|
23
|
+
require 'medusa'
|
24
|
+
|
25
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2)
|
26
|
+
|
27
|
+
Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
|
28
|
+
|
29
|
+
require 'medusa'
|
30
|
+
|
31
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
|
32
|
+
crawler.discard_page_bodies = some_flag
|
33
|
+
|
34
|
+
# Persist all the pages state across crawl-runs.
|
35
|
+
crawler.clear_on_startup = false
|
36
|
+
crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
|
37
|
+
|
38
|
+
crawler.skip_links_like(/private/)
|
39
|
+
|
40
|
+
crawler.on_pages_like(/public/) do |page|
|
41
|
+
logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
|
42
|
+
end
|
43
|
+
|
44
|
+
# Use an arbitrary logic, page by page, to continue customize the crawling.
|
45
|
+
crawler.focus_crawl(/public/) do |page|
|
46
|
+
page.links.first
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
20
50
|
---
|
21
51
|
|
22
52
|
=== Requirements
|
data/Rakefile
CHANGED
@@ -7,11 +7,6 @@ RSpec::Core::RakeTask.new(:rspec) do |spec|
|
|
7
7
|
spec.pattern = 'spec/**/*_spec.rb'
|
8
8
|
end
|
9
9
|
|
10
|
-
RSpec::Core::RakeTask.new(:rcov) do |spec|
|
11
|
-
spec.pattern = 'spec/**/*_spec.rb'
|
12
|
-
spec.rcov = true
|
13
|
-
end
|
14
|
-
|
15
10
|
task :default => :rspec
|
16
11
|
|
17
12
|
Rake::RDocTask.new(:rdoc) do |rdoc|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.0.0
|
1
|
+
1.0.0
|
data/lib/medusa/core.rb
CHANGED
@@ -26,8 +26,6 @@ module Medusa
|
|
26
26
|
DEFAULT_OPTS = {
|
27
27
|
# run 4 Tentacle threads to fetch pages
|
28
28
|
:threads => 4,
|
29
|
-
# disable verbose output
|
30
|
-
:verbose => false,
|
31
29
|
# don't throw away the page response body after scanning it for links
|
32
30
|
:discard_page_bodies => false,
|
33
31
|
# identify self as Medusa/VERSION
|
@@ -40,7 +38,7 @@ module Medusa
|
|
40
38
|
:depth_limit => false,
|
41
39
|
# number of times HTTP redirects will be followed
|
42
40
|
:redirect_limit => 5,
|
43
|
-
# storage engine defaults to
|
41
|
+
# storage engine defaults to In-memory store in +process_options+ if none specified
|
44
42
|
:storage => nil,
|
45
43
|
# cleanups of the storage on every startup of the crawler
|
46
44
|
:clear_on_startup => true,
|
@@ -148,6 +146,7 @@ module Medusa
|
|
148
146
|
# Perform the crawl
|
149
147
|
#
|
150
148
|
def run
|
149
|
+
|
151
150
|
process_options
|
152
151
|
|
153
152
|
@urls.delete_if { |url| !visit_link?(url) }
|
@@ -165,7 +164,6 @@ module Medusa
|
|
165
164
|
loop do
|
166
165
|
page = page_queue.deq
|
167
166
|
@pages.touch_key page.url
|
168
|
-
puts "#{page.url} Queue: #{link_queue.size}" if @opts[:verbose]
|
169
167
|
do_page_blocks page
|
170
168
|
page.discard_doc! if @opts[:discard_page_bodies]
|
171
169
|
|
data/lib/medusa/http.rb
CHANGED
@@ -29,9 +29,9 @@ module Medusa
|
|
29
29
|
# including redirects
|
30
30
|
#
|
31
31
|
def fetch_pages(url, referer = nil, depth = nil)
|
32
|
+
pages = []
|
32
33
|
begin
|
33
34
|
url = URI(url) unless url.is_a?(URI)
|
34
|
-
pages = []
|
35
35
|
get(url, referer) do |response, headers, code, location, redirect_to, response_time|
|
36
36
|
pages << Page.new(location, :body => response,
|
37
37
|
:headers => headers,
|
@@ -43,13 +43,8 @@ module Medusa
|
|
43
43
|
end
|
44
44
|
|
45
45
|
return pages
|
46
|
-
rescue
|
47
|
-
|
48
|
-
puts e.inspect
|
49
|
-
puts e.backtrace
|
50
|
-
end
|
51
|
-
pages ||= []
|
52
|
-
return pages << Page.new(url, :error => e)
|
46
|
+
rescue StandardError => e
|
47
|
+
return pages << Page.new(url, error: e)
|
53
48
|
end
|
54
49
|
end
|
55
50
|
|
@@ -180,18 +175,13 @@ module Medusa
|
|
180
175
|
|
181
176
|
rescue Timeout::Error, EOFError, Errno::ECONNREFUSED, Errno::ETIMEDOUT, Errno::ECONNRESET => e
|
182
177
|
retries += 1
|
183
|
-
puts "[medusa] Retrying ##{retries} on url #{url} because of: #{e.inspect}" if verbose?
|
184
178
|
sleep(3 ^ retries)
|
185
179
|
retry unless retries > RETRY_LIMIT
|
186
180
|
ensure
|
187
|
-
resource
|
181
|
+
resource&.close unless resource&.closed?
|
188
182
|
end
|
189
183
|
end
|
190
184
|
|
191
|
-
def verbose?
|
192
|
-
@opts[:verbose]
|
193
|
-
end
|
194
|
-
|
195
185
|
#
|
196
186
|
# Allowed to connect to the requested url?
|
197
187
|
#
|
data/lib/medusa/page_store.rb
CHANGED
@@ -65,58 +65,5 @@ module Medusa
|
|
65
65
|
each_value { |page| delete page.url if page.redirect? }
|
66
66
|
self
|
67
67
|
end
|
68
|
-
|
69
|
-
#
|
70
|
-
# If given a single URL (as a String or URI), returns an Array of Pages which link to that URL
|
71
|
-
# If given an Array of URLs, returns a Hash (URI => [Page, Page...]) of Pages linking to those URLs
|
72
|
-
#
|
73
|
-
def pages_linking_to(urls)
|
74
|
-
unless urls.is_a?(Array)
|
75
|
-
urls = [urls]
|
76
|
-
single = true
|
77
|
-
end
|
78
|
-
|
79
|
-
urls.map! do |url|
|
80
|
-
unless url.is_a?(URI)
|
81
|
-
URI(url) rescue nil
|
82
|
-
else
|
83
|
-
url
|
84
|
-
end
|
85
|
-
end
|
86
|
-
urls.compact
|
87
|
-
|
88
|
-
links = {}
|
89
|
-
urls.each { |url| links[url] = [] }
|
90
|
-
values.each do |page|
|
91
|
-
urls.each { |url| links[url] << page if page.links.include?(url) }
|
92
|
-
end
|
93
|
-
|
94
|
-
if single and !links.empty?
|
95
|
-
return links[urls.first]
|
96
|
-
else
|
97
|
-
return links
|
98
|
-
end
|
99
|
-
end
|
100
|
-
|
101
|
-
#
|
102
|
-
# If given a single URL (as a String or URI), returns an Array of URLs which link to that URL
|
103
|
-
# If given an Array of URLs, returns a Hash (URI => [URI, URI...]) of URLs linking to those URLs
|
104
|
-
#
|
105
|
-
def urls_linking_to(urls)
|
106
|
-
unless urls.is_a?(Array)
|
107
|
-
urls = [urls] unless urls.is_a?(Array)
|
108
|
-
single = true
|
109
|
-
end
|
110
|
-
|
111
|
-
links = pages_linking_to(urls)
|
112
|
-
links.each { |url, pages| links[url] = pages.map{|p| p.url} }
|
113
|
-
|
114
|
-
if single and !links.empty?
|
115
|
-
return links[urls.first]
|
116
|
-
else
|
117
|
-
return links
|
118
|
-
end
|
119
|
-
end
|
120
|
-
|
121
68
|
end
|
122
69
|
end
|
data/lib/medusa/storage/base.rb
CHANGED
data/lib/medusa/version.rb
CHANGED
data/spec/fakeweb_helper.rb
CHANGED
@@ -29,6 +29,8 @@ module Medusa
|
|
29
29
|
@base = options[:base] if options.has_key?(:base)
|
30
30
|
@content_type = options[:content_type] || "text/html"
|
31
31
|
@body = options[:body]
|
32
|
+
@status = options[:status] || [200, 'OK']
|
33
|
+
@exception = options[:exception]
|
32
34
|
|
33
35
|
create_body unless @body
|
34
36
|
add_to_fakeweb
|
@@ -56,7 +58,7 @@ module Medusa
|
|
56
58
|
end
|
57
59
|
|
58
60
|
def add_to_fakeweb
|
59
|
-
options = {body: @body, status:
|
61
|
+
options = {body: @body, status: @status, headers: {'Content-Type' => @content_type}}
|
60
62
|
|
61
63
|
if @redirect
|
62
64
|
options[:status] = [301, 'Moved Permanently']
|
@@ -66,7 +68,7 @@ module Medusa
|
|
66
68
|
options[:headers]['Location'] = redirect_url
|
67
69
|
|
68
70
|
# register the page this one redirects to
|
69
|
-
WebMock.stub_request(:get, redirect_url).to_return(body: '', status:
|
71
|
+
WebMock.stub_request(:get, redirect_url).to_return(body: '', status: @status, headers: {'Content-Type' => @content_type})
|
70
72
|
end
|
71
73
|
|
72
74
|
if @auth
|
@@ -75,11 +77,14 @@ module Medusa
|
|
75
77
|
WebMock.stub_request(:get, url).to_return(unautorized_options)
|
76
78
|
WebMock.stub_request(:get, url).with(basic_auth: AUTH).to_return(options)
|
77
79
|
else
|
78
|
-
WebMock.stub_request(:get, url).
|
80
|
+
WebMock.stub_request(:get, url).tap do |req|
|
81
|
+
if @exception
|
82
|
+
req.to_raise(@exception)
|
83
|
+
else
|
84
|
+
req.to_return(options)
|
85
|
+
end
|
86
|
+
end
|
79
87
|
end
|
80
88
|
end
|
81
89
|
end
|
82
90
|
end
|
83
|
-
|
84
|
-
#default root
|
85
|
-
Medusa::FakePage.new
|
data/spec/medusa_spec.rb
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
require 'fakeweb_helper'
|
1
2
|
|
2
3
|
RSpec.describe Medusa do
|
3
4
|
|
@@ -6,9 +7,9 @@ RSpec.describe Medusa do
|
|
6
7
|
end
|
7
8
|
|
8
9
|
it "should return a Medusa::Core from the crawl, which has a PageStore" do
|
10
|
+
Medusa::FakePage.new
|
9
11
|
result = Medusa.crawl(SPEC_DOMAIN)
|
10
12
|
expect(result).to be_an_instance_of(Medusa::Core)
|
11
13
|
expect(result.pages).to be_an_instance_of(Medusa::PageStore)
|
12
14
|
end
|
13
|
-
|
14
15
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: medusa-crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.0
|
4
|
+
version: 1.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mauro Asprea
|
@@ -35,7 +35,7 @@ cert_chain:
|
|
35
35
|
g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
|
36
36
|
mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
|
37
37
|
-----END CERTIFICATE-----
|
38
|
-
date: 2020-08-
|
38
|
+
date: 2020-08-17 00:00:00.000000000 Z
|
39
39
|
dependencies:
|
40
40
|
- !ruby/object:Gem::Dependency
|
41
41
|
name: moneta
|
@@ -98,7 +98,7 @@ dependencies:
|
|
98
98
|
- !ruby/object:Gem::Version
|
99
99
|
version: 1.0.0
|
100
100
|
description: |+
|
101
|
-
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://
|
101
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
102
102
|
|
103
103
|
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
104
104
|
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
@@ -109,7 +109,6 @@ description: |+
|
|
109
109
|
* Multi-threaded design for high performance
|
110
110
|
* Tracks +301+ HTTP redirects
|
111
111
|
* Allows exclusion of URLs based on regular expressions
|
112
|
-
* HTTPS support
|
113
112
|
* Records response time for each page
|
114
113
|
* Obey _robots.txt_ directives (optional, but recommended)
|
115
114
|
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
@@ -117,6 +116,37 @@ description: |+
|
|
117
116
|
|
118
117
|
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
119
118
|
|
119
|
+
=== Examples
|
120
|
+
|
121
|
+
Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
|
122
|
+
|
123
|
+
require 'medusa'
|
124
|
+
|
125
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2)
|
126
|
+
|
127
|
+
Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
|
128
|
+
|
129
|
+
require 'medusa'
|
130
|
+
|
131
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
|
132
|
+
crawler.discard_page_bodies = some_flag
|
133
|
+
|
134
|
+
# Persist all the pages state across crawl-runs.
|
135
|
+
crawler.clear_on_startup = false
|
136
|
+
crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
|
137
|
+
|
138
|
+
crawler.skip_links_like(/private/)
|
139
|
+
|
140
|
+
crawler.on_pages_like(/public/) do |page|
|
141
|
+
logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
|
142
|
+
end
|
143
|
+
|
144
|
+
# Use an arbitrary logic, page by page, to continue customize the crawling.
|
145
|
+
crawler.focus_crawl(/public/) do |page|
|
146
|
+
page.links.first
|
147
|
+
end
|
148
|
+
end
|
149
|
+
|
120
150
|
email: mauroasprea@gmail.com
|
121
151
|
executables: []
|
122
152
|
extensions: []
|
@@ -151,7 +181,7 @@ licenses:
|
|
151
181
|
- MIT
|
152
182
|
metadata:
|
153
183
|
bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
|
154
|
-
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0
|
184
|
+
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0
|
155
185
|
description_markup_format: rdoc
|
156
186
|
post_install_message:
|
157
187
|
rdoc_options:
|
@@ -168,11 +198,11 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
168
198
|
version: 2.3.0
|
169
199
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
170
200
|
requirements:
|
171
|
-
- - "
|
201
|
+
- - ">="
|
172
202
|
- !ruby/object:Gem::Version
|
173
|
-
version:
|
203
|
+
version: '0'
|
174
204
|
requirements: []
|
175
|
-
rubygems_version: 3.1.
|
205
|
+
rubygems_version: 3.1.4
|
176
206
|
signing_key:
|
177
207
|
specification_version: 4
|
178
208
|
summary: Medusa is a ruby crawler framework
|
metadata.gz.sig
CHANGED
Binary file
|