medusa-crawler 1.0.0.pre.2 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- checksums.yaml.gz.sig +0 -0
- data.tar.gz.sig +0 -0
- data/CHANGELOG.md +33 -6
- data/CONTRIBUTORS.md +1 -1
- data/README.rdoc +32 -2
- data/Rakefile +0 -5
- data/VERSION +1 -1
- data/lib/medusa/core.rb +2 -4
- data/lib/medusa/http.rb +4 -14
- data/lib/medusa/page_store.rb +0 -53
- data/lib/medusa/storage/base.rb +0 -1
- data/lib/medusa/version.rb +1 -1
- data/spec/fakeweb_helper.rb +11 -6
- data/spec/medusa_spec.rb +2 -1
- metadata +38 -8
- metadata.gz.sig +0 -0
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f2398745018a9c162fa8587b634753fe5aa93b46c541c6a45018ab243ab738ec
|
4
|
+
data.tar.gz: 792d6308a0b24958de72506e45acba5f2814e3f26982787bba913233285c5c08
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: e773494251c93ba4f1b9b33a9bc778e605f13d16368520ab38f757ff341489a495a265882ad49da7fcf17b809fdfbe23bd46289732386971619b957bcf36e3d6
|
7
|
+
data.tar.gz: e9c98d600dffed81b9bd3fc82ced0479913b1b6c42ff27bda9bb0607166d8eba1dbd7a34f63108ce04759d0720ed408ffcde615ec0d11d7de00242698a2a601d
|
checksums.yaml.gz.sig
CHANGED
Binary file
|
data.tar.gz.sig
CHANGED
Binary file
|
data/CHANGELOG.md
CHANGED
@@ -1,8 +1,36 @@
|
|
1
|
+
## Release v1.0.0 (2020-08-17)
|
2
|
+
Features:
|
3
|
+
- Remove `PageStore#pages_linking_to`, `PageStore#urls_linking_to`
|
4
|
+
- Remove `verbose` setting
|
1
5
|
|
2
|
-
|
6
|
+
Changes:
|
7
|
+
- Add an examples section to the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
|
8
|
+
- Update the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
|
9
|
+
- Update the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
|
3
10
|
|
11
|
+
## Pre-release v1.0.0.pre.2
|
4
12
|
Features:
|
13
|
+
- Remove CLI bins
|
14
|
+
- Remove `PageStore#shortest_paths!`
|
15
|
+
|
16
|
+
Fixes
|
17
|
+
- Skip link regex filter to consider the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
|
18
|
+
|
19
|
+
## Pre-release v1.0.0.pre.1
|
20
|
+
Features:
|
21
|
+
- Switch to use `Moneta` instead of custom storage provider adapters
|
22
|
+
|
23
|
+
Fixes
|
24
|
+
- Fix link skip regex to include the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
|
5
25
|
|
26
|
+
Dev
|
27
|
+
- Use webmock gem for testing
|
28
|
+
|
29
|
+
Changes:
|
30
|
+
- Rename Medusa to medusa-crawler gem
|
31
|
+
|
32
|
+
## Anemone forked into Medusa (2014-12-13)
|
33
|
+
Features:
|
6
34
|
- Switch to use `OpenURI` instead of `net/http`, gaining out of the box support for:
|
7
35
|
- Http basic auth options
|
8
36
|
- Proxy configuration options
|
@@ -11,10 +39,9 @@ Features:
|
|
11
39
|
- Ability to control the RETRY_LIMIT upon connection errors
|
12
40
|
|
13
41
|
Changes:
|
14
|
-
|
15
42
|
- Renamed Anemone to Medusa
|
16
|
-
- Revamped the [README](https://github.com/brutuscat/medusa/blob/
|
17
|
-
- Revamped the [CHANGELOG](https://github.com/brutuscat/medusa/blob/
|
18
|
-
- Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa/blob/
|
43
|
+
- Revamped the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
|
44
|
+
- Revamped the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
|
45
|
+
- Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
|
19
46
|
|
20
|
-
> Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc)
|
47
|
+
> Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc) to go back to the past.
|
data/CONTRIBUTORS.md
CHANGED
data/README.rdoc
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://
|
1
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
2
2
|
|
3
3
|
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
4
4
|
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
@@ -9,7 +9,6 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
|
|
9
9
|
* Multi-threaded design for high performance
|
10
10
|
* Tracks +301+ HTTP redirects
|
11
11
|
* Allows exclusion of URLs based on regular expressions
|
12
|
-
* HTTPS support
|
13
12
|
* Records response time for each page
|
14
13
|
* Obey _robots.txt_ directives (optional, but recommended)
|
15
14
|
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
@@ -17,6 +16,37 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
|
|
17
16
|
|
18
17
|
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
19
18
|
|
19
|
+
=== Examples
|
20
|
+
|
21
|
+
Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
|
22
|
+
|
23
|
+
require 'medusa'
|
24
|
+
|
25
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2)
|
26
|
+
|
27
|
+
Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
|
28
|
+
|
29
|
+
require 'medusa'
|
30
|
+
|
31
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
|
32
|
+
crawler.discard_page_bodies = some_flag
|
33
|
+
|
34
|
+
# Persist all the pages state across crawl-runs.
|
35
|
+
crawler.clear_on_startup = false
|
36
|
+
crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
|
37
|
+
|
38
|
+
crawler.skip_links_like(/private/)
|
39
|
+
|
40
|
+
crawler.on_pages_like(/public/) do |page|
|
41
|
+
logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
|
42
|
+
end
|
43
|
+
|
44
|
+
# Use an arbitrary logic, page by page, to continue customize the crawling.
|
45
|
+
crawler.focus_crawl(/public/) do |page|
|
46
|
+
page.links.first
|
47
|
+
end
|
48
|
+
end
|
49
|
+
|
20
50
|
---
|
21
51
|
|
22
52
|
=== Requirements
|
data/Rakefile
CHANGED
@@ -7,11 +7,6 @@ RSpec::Core::RakeTask.new(:rspec) do |spec|
|
|
7
7
|
spec.pattern = 'spec/**/*_spec.rb'
|
8
8
|
end
|
9
9
|
|
10
|
-
RSpec::Core::RakeTask.new(:rcov) do |spec|
|
11
|
-
spec.pattern = 'spec/**/*_spec.rb'
|
12
|
-
spec.rcov = true
|
13
|
-
end
|
14
|
-
|
15
10
|
task :default => :rspec
|
16
11
|
|
17
12
|
Rake::RDocTask.new(:rdoc) do |rdoc|
|
data/VERSION
CHANGED
@@ -1 +1 @@
|
|
1
|
-
1.0.0
|
1
|
+
1.0.0
|
data/lib/medusa/core.rb
CHANGED
@@ -26,8 +26,6 @@ module Medusa
|
|
26
26
|
DEFAULT_OPTS = {
|
27
27
|
# run 4 Tentacle threads to fetch pages
|
28
28
|
:threads => 4,
|
29
|
-
# disable verbose output
|
30
|
-
:verbose => false,
|
31
29
|
# don't throw away the page response body after scanning it for links
|
32
30
|
:discard_page_bodies => false,
|
33
31
|
# identify self as Medusa/VERSION
|
@@ -40,7 +38,7 @@ module Medusa
|
|
40
38
|
:depth_limit => false,
|
41
39
|
# number of times HTTP redirects will be followed
|
42
40
|
:redirect_limit => 5,
|
43
|
-
# storage engine defaults to
|
41
|
+
# storage engine defaults to In-memory store in +process_options+ if none specified
|
44
42
|
:storage => nil,
|
45
43
|
# cleanups of the storage on every startup of the crawler
|
46
44
|
:clear_on_startup => true,
|
@@ -148,6 +146,7 @@ module Medusa
|
|
148
146
|
# Perform the crawl
|
149
147
|
#
|
150
148
|
def run
|
149
|
+
|
151
150
|
process_options
|
152
151
|
|
153
152
|
@urls.delete_if { |url| !visit_link?(url) }
|
@@ -165,7 +164,6 @@ module Medusa
|
|
165
164
|
loop do
|
166
165
|
page = page_queue.deq
|
167
166
|
@pages.touch_key page.url
|
168
|
-
puts "#{page.url} Queue: #{link_queue.size}" if @opts[:verbose]
|
169
167
|
do_page_blocks page
|
170
168
|
page.discard_doc! if @opts[:discard_page_bodies]
|
171
169
|
|
data/lib/medusa/http.rb
CHANGED
@@ -29,9 +29,9 @@ module Medusa
|
|
29
29
|
# including redirects
|
30
30
|
#
|
31
31
|
def fetch_pages(url, referer = nil, depth = nil)
|
32
|
+
pages = []
|
32
33
|
begin
|
33
34
|
url = URI(url) unless url.is_a?(URI)
|
34
|
-
pages = []
|
35
35
|
get(url, referer) do |response, headers, code, location, redirect_to, response_time|
|
36
36
|
pages << Page.new(location, :body => response,
|
37
37
|
:headers => headers,
|
@@ -43,13 +43,8 @@ module Medusa
|
|
43
43
|
end
|
44
44
|
|
45
45
|
return pages
|
46
|
-
rescue
|
47
|
-
|
48
|
-
puts e.inspect
|
49
|
-
puts e.backtrace
|
50
|
-
end
|
51
|
-
pages ||= []
|
52
|
-
return pages << Page.new(url, :error => e)
|
46
|
+
rescue StandardError => e
|
47
|
+
return pages << Page.new(url, error: e)
|
53
48
|
end
|
54
49
|
end
|
55
50
|
|
@@ -180,18 +175,13 @@ module Medusa
|
|
180
175
|
|
181
176
|
rescue Timeout::Error, EOFError, Errno::ECONNREFUSED, Errno::ETIMEDOUT, Errno::ECONNRESET => e
|
182
177
|
retries += 1
|
183
|
-
puts "[medusa] Retrying ##{retries} on url #{url} because of: #{e.inspect}" if verbose?
|
184
178
|
sleep(3 ^ retries)
|
185
179
|
retry unless retries > RETRY_LIMIT
|
186
180
|
ensure
|
187
|
-
resource
|
181
|
+
resource&.close unless resource&.closed?
|
188
182
|
end
|
189
183
|
end
|
190
184
|
|
191
|
-
def verbose?
|
192
|
-
@opts[:verbose]
|
193
|
-
end
|
194
|
-
|
195
185
|
#
|
196
186
|
# Allowed to connect to the requested url?
|
197
187
|
#
|
data/lib/medusa/page_store.rb
CHANGED
@@ -65,58 +65,5 @@ module Medusa
|
|
65
65
|
each_value { |page| delete page.url if page.redirect? }
|
66
66
|
self
|
67
67
|
end
|
68
|
-
|
69
|
-
#
|
70
|
-
# If given a single URL (as a String or URI), returns an Array of Pages which link to that URL
|
71
|
-
# If given an Array of URLs, returns a Hash (URI => [Page, Page...]) of Pages linking to those URLs
|
72
|
-
#
|
73
|
-
def pages_linking_to(urls)
|
74
|
-
unless urls.is_a?(Array)
|
75
|
-
urls = [urls]
|
76
|
-
single = true
|
77
|
-
end
|
78
|
-
|
79
|
-
urls.map! do |url|
|
80
|
-
unless url.is_a?(URI)
|
81
|
-
URI(url) rescue nil
|
82
|
-
else
|
83
|
-
url
|
84
|
-
end
|
85
|
-
end
|
86
|
-
urls.compact
|
87
|
-
|
88
|
-
links = {}
|
89
|
-
urls.each { |url| links[url] = [] }
|
90
|
-
values.each do |page|
|
91
|
-
urls.each { |url| links[url] << page if page.links.include?(url) }
|
92
|
-
end
|
93
|
-
|
94
|
-
if single and !links.empty?
|
95
|
-
return links[urls.first]
|
96
|
-
else
|
97
|
-
return links
|
98
|
-
end
|
99
|
-
end
|
100
|
-
|
101
|
-
#
|
102
|
-
# If given a single URL (as a String or URI), returns an Array of URLs which link to that URL
|
103
|
-
# If given an Array of URLs, returns a Hash (URI => [URI, URI...]) of URLs linking to those URLs
|
104
|
-
#
|
105
|
-
def urls_linking_to(urls)
|
106
|
-
unless urls.is_a?(Array)
|
107
|
-
urls = [urls] unless urls.is_a?(Array)
|
108
|
-
single = true
|
109
|
-
end
|
110
|
-
|
111
|
-
links = pages_linking_to(urls)
|
112
|
-
links.each { |url, pages| links[url] = pages.map{|p| p.url} }
|
113
|
-
|
114
|
-
if single and !links.empty?
|
115
|
-
return links[urls.first]
|
116
|
-
else
|
117
|
-
return links
|
118
|
-
end
|
119
|
-
end
|
120
|
-
|
121
68
|
end
|
122
69
|
end
|
data/lib/medusa/storage/base.rb
CHANGED
data/lib/medusa/version.rb
CHANGED
data/spec/fakeweb_helper.rb
CHANGED
@@ -29,6 +29,8 @@ module Medusa
|
|
29
29
|
@base = options[:base] if options.has_key?(:base)
|
30
30
|
@content_type = options[:content_type] || "text/html"
|
31
31
|
@body = options[:body]
|
32
|
+
@status = options[:status] || [200, 'OK']
|
33
|
+
@exception = options[:exception]
|
32
34
|
|
33
35
|
create_body unless @body
|
34
36
|
add_to_fakeweb
|
@@ -56,7 +58,7 @@ module Medusa
|
|
56
58
|
end
|
57
59
|
|
58
60
|
def add_to_fakeweb
|
59
|
-
options = {body: @body, status:
|
61
|
+
options = {body: @body, status: @status, headers: {'Content-Type' => @content_type}}
|
60
62
|
|
61
63
|
if @redirect
|
62
64
|
options[:status] = [301, 'Moved Permanently']
|
@@ -66,7 +68,7 @@ module Medusa
|
|
66
68
|
options[:headers]['Location'] = redirect_url
|
67
69
|
|
68
70
|
# register the page this one redirects to
|
69
|
-
WebMock.stub_request(:get, redirect_url).to_return(body: '', status:
|
71
|
+
WebMock.stub_request(:get, redirect_url).to_return(body: '', status: @status, headers: {'Content-Type' => @content_type})
|
70
72
|
end
|
71
73
|
|
72
74
|
if @auth
|
@@ -75,11 +77,14 @@ module Medusa
|
|
75
77
|
WebMock.stub_request(:get, url).to_return(unautorized_options)
|
76
78
|
WebMock.stub_request(:get, url).with(basic_auth: AUTH).to_return(options)
|
77
79
|
else
|
78
|
-
WebMock.stub_request(:get, url).
|
80
|
+
WebMock.stub_request(:get, url).tap do |req|
|
81
|
+
if @exception
|
82
|
+
req.to_raise(@exception)
|
83
|
+
else
|
84
|
+
req.to_return(options)
|
85
|
+
end
|
86
|
+
end
|
79
87
|
end
|
80
88
|
end
|
81
89
|
end
|
82
90
|
end
|
83
|
-
|
84
|
-
#default root
|
85
|
-
Medusa::FakePage.new
|
data/spec/medusa_spec.rb
CHANGED
@@ -1,3 +1,4 @@
|
|
1
|
+
require 'fakeweb_helper'
|
1
2
|
|
2
3
|
RSpec.describe Medusa do
|
3
4
|
|
@@ -6,9 +7,9 @@ RSpec.describe Medusa do
|
|
6
7
|
end
|
7
8
|
|
8
9
|
it "should return a Medusa::Core from the crawl, which has a PageStore" do
|
10
|
+
Medusa::FakePage.new
|
9
11
|
result = Medusa.crawl(SPEC_DOMAIN)
|
10
12
|
expect(result).to be_an_instance_of(Medusa::Core)
|
11
13
|
expect(result.pages).to be_an_instance_of(Medusa::PageStore)
|
12
14
|
end
|
13
|
-
|
14
15
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: medusa-crawler
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0.0
|
4
|
+
version: 1.0.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mauro Asprea
|
@@ -35,7 +35,7 @@ cert_chain:
|
|
35
35
|
g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
|
36
36
|
mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
|
37
37
|
-----END CERTIFICATE-----
|
38
|
-
date: 2020-08-
|
38
|
+
date: 2020-08-17 00:00:00.000000000 Z
|
39
39
|
dependencies:
|
40
40
|
- !ruby/object:Gem::Dependency
|
41
41
|
name: moneta
|
@@ -98,7 +98,7 @@ dependencies:
|
|
98
98
|
- !ruby/object:Gem::Version
|
99
99
|
version: 1.0.0
|
100
100
|
description: |+
|
101
|
-
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://
|
101
|
+
== Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
|
102
102
|
|
103
103
|
Medusa is a framework for the ruby language to crawl and collect useful information about the pages
|
104
104
|
it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
|
@@ -109,7 +109,6 @@ description: |+
|
|
109
109
|
* Multi-threaded design for high performance
|
110
110
|
* Tracks +301+ HTTP redirects
|
111
111
|
* Allows exclusion of URLs based on regular expressions
|
112
|
-
* HTTPS support
|
113
112
|
* Records response time for each page
|
114
113
|
* Obey _robots.txt_ directives (optional, but recommended)
|
115
114
|
* In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
|
@@ -117,6 +116,37 @@ description: |+
|
|
117
116
|
|
118
117
|
<b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
|
119
118
|
|
119
|
+
=== Examples
|
120
|
+
|
121
|
+
Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
|
122
|
+
|
123
|
+
require 'medusa'
|
124
|
+
|
125
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2)
|
126
|
+
|
127
|
+
Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
|
128
|
+
|
129
|
+
require 'medusa'
|
130
|
+
|
131
|
+
Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
|
132
|
+
crawler.discard_page_bodies = some_flag
|
133
|
+
|
134
|
+
# Persist all the pages state across crawl-runs.
|
135
|
+
crawler.clear_on_startup = false
|
136
|
+
crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
|
137
|
+
|
138
|
+
crawler.skip_links_like(/private/)
|
139
|
+
|
140
|
+
crawler.on_pages_like(/public/) do |page|
|
141
|
+
logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
|
142
|
+
end
|
143
|
+
|
144
|
+
# Use an arbitrary logic, page by page, to continue customize the crawling.
|
145
|
+
crawler.focus_crawl(/public/) do |page|
|
146
|
+
page.links.first
|
147
|
+
end
|
148
|
+
end
|
149
|
+
|
120
150
|
email: mauroasprea@gmail.com
|
121
151
|
executables: []
|
122
152
|
extensions: []
|
@@ -151,7 +181,7 @@ licenses:
|
|
151
181
|
- MIT
|
152
182
|
metadata:
|
153
183
|
bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
|
154
|
-
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0
|
184
|
+
source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0
|
155
185
|
description_markup_format: rdoc
|
156
186
|
post_install_message:
|
157
187
|
rdoc_options:
|
@@ -168,11 +198,11 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
168
198
|
version: 2.3.0
|
169
199
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
170
200
|
requirements:
|
171
|
-
- - "
|
201
|
+
- - ">="
|
172
202
|
- !ruby/object:Gem::Version
|
173
|
-
version:
|
203
|
+
version: '0'
|
174
204
|
requirements: []
|
175
|
-
rubygems_version: 3.1.
|
205
|
+
rubygems_version: 3.1.4
|
176
206
|
signing_key:
|
177
207
|
specification_version: 4
|
178
208
|
summary: Medusa is a ruby crawler framework
|
metadata.gz.sig
CHANGED
Binary file
|