medusa-crawler 1.0.0.pre.2 → 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9daf1076b9f0528f797128f639affadc99c61aa7980f4de84f91b520bda7b305
4
- data.tar.gz: c95e174b0215473befb1d9437865bc5e20696efb245151ebd5e6b462fbadb099
3
+ metadata.gz: f2398745018a9c162fa8587b634753fe5aa93b46c541c6a45018ab243ab738ec
4
+ data.tar.gz: 792d6308a0b24958de72506e45acba5f2814e3f26982787bba913233285c5c08
5
5
  SHA512:
6
- metadata.gz: e0afb9e4ac4cc5fdfd0a7c9bb3f01672bbdf41e711005eefffbd5d1b54a601e52868ad093e9985f49078942656551acca73de26d4b675e1442dec981e618bee3
7
- data.tar.gz: 1098323c312714b7d1d72abd762d49a97dc233dfd8fb38cd574ab5611112cdc73859050a6f8fefcda35315274fad939c36750ce8277c5560b72769e5e5051a5b
6
+ metadata.gz: e773494251c93ba4f1b9b33a9bc778e605f13d16368520ab38f757ff341489a495a265882ad49da7fcf17b809fdfbe23bd46289732386971619b957bcf36e3d6
7
+ data.tar.gz: e9c98d600dffed81b9bd3fc82ced0479913b1b6c42ff27bda9bb0607166d8eba1dbd7a34f63108ce04759d0720ed408ffcde615ec0d11d7de00242698a2a601d
Binary file
data.tar.gz.sig CHANGED
Binary file
@@ -1,8 +1,36 @@
1
+ ## Release v1.0.0 (2020-08-17)
2
+ Features:
3
+ - Remove `PageStore#pages_linking_to`, `PageStore#urls_linking_to`
4
+ - Remove `verbose` setting
1
5
 
2
- ## Anemone forked into Medusa (2014-12-13)
6
+ Changes:
7
+ - Add an examples section to the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
8
+ - Update the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
9
+ - Update the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
3
10
 
11
+ ## Pre-release v1.0.0.pre.2
4
12
  Features:
13
+ - Remove CLI bins
14
+ - Remove `PageStore#shortest_paths!`
15
+
16
+ Fixes
17
+ - Skip link regex filter to consider the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
18
+
19
+ ## Pre-release v1.0.0.pre.1
20
+ Features:
21
+ - Switch to use `Moneta` instead of custom storage provider adapters
22
+
23
+ Fixes
24
+ - Fix link skip regex to include the full URI [#1](https://github.com/brutuscat/medusa-crawler/issues/1)
5
25
 
26
+ Dev
27
+ - Use webmock gem for testing
28
+
29
+ Changes:
30
+ - Rename Medusa to medusa-crawler gem
31
+
32
+ ## Anemone forked into Medusa (2014-12-13)
33
+ Features:
6
34
  - Switch to use `OpenURI` instead of `net/http`, gaining out of the box support for:
7
35
  - Http basic auth options
8
36
  - Proxy configuration options
@@ -11,10 +39,9 @@ Features:
11
39
  - Ability to control the RETRY_LIMIT upon connection errors
12
40
 
13
41
  Changes:
14
-
15
42
  - Renamed Anemone to Medusa
16
- - Revamped the [README](https://github.com/brutuscat/medusa/blob/master/README.md) file
17
- - Revamped the [CHANGELOG](https://github.com/brutuscat/medusa/blob/master/CHANGELOG.md) file
18
- - Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa/blob/master/CONTRIBUTORS.mdd) file
43
+ - Revamped the [README](https://github.com/brutuscat/medusa-crawler/blob/main/README.md) file
44
+ - Revamped the [CHANGELOG](https://github.com/brutuscat/medusa-crawler/blob/main/CHANGELOG.md) file
45
+ - Revamped the [CONTRIBUTORS](https://github.com/brutuscat/medusa-crawler/blob/main/CONTRIBUTORS.mdd) file
19
46
 
20
- > Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc) for a travel in time.
47
+ > Refer to the [Anemone changelog](https://github.com/chriskite/anemone/blob/next/CHANGELOG.rdoc) to go back to the past.
@@ -1,6 +1,6 @@
1
1
  # Contributors
2
2
 
3
- Many thanks to the following folks who have contributed code to Medusa (a fork of Anemone).
3
+ Many thanks to the following people who have contributed code to Medusa (a fork of Anemone).
4
4
 
5
5
  In no particular order:
6
6
 
@@ -1,4 +1,4 @@
1
- == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
1
+ == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
2
2
 
3
3
  Medusa is a framework for the ruby language to crawl and collect useful information about the pages
4
4
  it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
@@ -9,7 +9,6 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
9
9
  * Multi-threaded design for high performance
10
10
  * Tracks +301+ HTTP redirects
11
11
  * Allows exclusion of URLs based on regular expressions
12
- * HTTPS support
13
12
  * Records response time for each page
14
13
  * Obey _robots.txt_ directives (optional, but recommended)
15
14
  * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
@@ -17,6 +16,37 @@ it visits. It is versatile, allowing you to write your own specialized tasks qui
17
16
 
18
17
  <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
19
18
 
19
+ === Examples
20
+
21
+ Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
22
+
23
+ require 'medusa'
24
+
25
+ Medusa.crawl('https://www.example.com', depth_limit: 2)
26
+
27
+ Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
28
+
29
+ require 'medusa'
30
+
31
+ Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
32
+ crawler.discard_page_bodies = some_flag
33
+
34
+ # Persist all the pages state across crawl-runs.
35
+ crawler.clear_on_startup = false
36
+ crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
37
+
38
+ crawler.skip_links_like(/private/)
39
+
40
+ crawler.on_pages_like(/public/) do |page|
41
+ logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
42
+ end
43
+
44
+ # Use an arbitrary logic, page by page, to continue customize the crawling.
45
+ crawler.focus_crawl(/public/) do |page|
46
+ page.links.first
47
+ end
48
+ end
49
+
20
50
  ---
21
51
 
22
52
  === Requirements
data/Rakefile CHANGED
@@ -7,11 +7,6 @@ RSpec::Core::RakeTask.new(:rspec) do |spec|
7
7
  spec.pattern = 'spec/**/*_spec.rb'
8
8
  end
9
9
 
10
- RSpec::Core::RakeTask.new(:rcov) do |spec|
11
- spec.pattern = 'spec/**/*_spec.rb'
12
- spec.rcov = true
13
- end
14
-
15
10
  task :default => :rspec
16
11
 
17
12
  Rake::RDocTask.new(:rdoc) do |rdoc|
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.0.pre.2
1
+ 1.0.0
@@ -26,8 +26,6 @@ module Medusa
26
26
  DEFAULT_OPTS = {
27
27
  # run 4 Tentacle threads to fetch pages
28
28
  :threads => 4,
29
- # disable verbose output
30
- :verbose => false,
31
29
  # don't throw away the page response body after scanning it for links
32
30
  :discard_page_bodies => false,
33
31
  # identify self as Medusa/VERSION
@@ -40,7 +38,7 @@ module Medusa
40
38
  :depth_limit => false,
41
39
  # number of times HTTP redirects will be followed
42
40
  :redirect_limit => 5,
43
- # storage engine defaults to Hash in +process_options+ if none specified
41
+ # storage engine defaults to In-memory store in +process_options+ if none specified
44
42
  :storage => nil,
45
43
  # cleanups of the storage on every startup of the crawler
46
44
  :clear_on_startup => true,
@@ -148,6 +146,7 @@ module Medusa
148
146
  # Perform the crawl
149
147
  #
150
148
  def run
149
+
151
150
  process_options
152
151
 
153
152
  @urls.delete_if { |url| !visit_link?(url) }
@@ -165,7 +164,6 @@ module Medusa
165
164
  loop do
166
165
  page = page_queue.deq
167
166
  @pages.touch_key page.url
168
- puts "#{page.url} Queue: #{link_queue.size}" if @opts[:verbose]
169
167
  do_page_blocks page
170
168
  page.discard_doc! if @opts[:discard_page_bodies]
171
169
 
@@ -29,9 +29,9 @@ module Medusa
29
29
  # including redirects
30
30
  #
31
31
  def fetch_pages(url, referer = nil, depth = nil)
32
+ pages = []
32
33
  begin
33
34
  url = URI(url) unless url.is_a?(URI)
34
- pages = []
35
35
  get(url, referer) do |response, headers, code, location, redirect_to, response_time|
36
36
  pages << Page.new(location, :body => response,
37
37
  :headers => headers,
@@ -43,13 +43,8 @@ module Medusa
43
43
  end
44
44
 
45
45
  return pages
46
- rescue Exception => e
47
- if verbose?
48
- puts e.inspect
49
- puts e.backtrace
50
- end
51
- pages ||= []
52
- return pages << Page.new(url, :error => e)
46
+ rescue StandardError => e
47
+ return pages << Page.new(url, error: e)
53
48
  end
54
49
  end
55
50
 
@@ -180,18 +175,13 @@ module Medusa
180
175
 
181
176
  rescue Timeout::Error, EOFError, Errno::ECONNREFUSED, Errno::ETIMEDOUT, Errno::ECONNRESET => e
182
177
  retries += 1
183
- puts "[medusa] Retrying ##{retries} on url #{url} because of: #{e.inspect}" if verbose?
184
178
  sleep(3 ^ retries)
185
179
  retry unless retries > RETRY_LIMIT
186
180
  ensure
187
- resource.close if !resource.nil? && !resource.closed?
181
+ resource&.close unless resource&.closed?
188
182
  end
189
183
  end
190
184
 
191
- def verbose?
192
- @opts[:verbose]
193
- end
194
-
195
185
  #
196
186
  # Allowed to connect to the requested url?
197
187
  #
@@ -65,58 +65,5 @@ module Medusa
65
65
  each_value { |page| delete page.url if page.redirect? }
66
66
  self
67
67
  end
68
-
69
- #
70
- # If given a single URL (as a String or URI), returns an Array of Pages which link to that URL
71
- # If given an Array of URLs, returns a Hash (URI => [Page, Page...]) of Pages linking to those URLs
72
- #
73
- def pages_linking_to(urls)
74
- unless urls.is_a?(Array)
75
- urls = [urls]
76
- single = true
77
- end
78
-
79
- urls.map! do |url|
80
- unless url.is_a?(URI)
81
- URI(url) rescue nil
82
- else
83
- url
84
- end
85
- end
86
- urls.compact
87
-
88
- links = {}
89
- urls.each { |url| links[url] = [] }
90
- values.each do |page|
91
- urls.each { |url| links[url] << page if page.links.include?(url) }
92
- end
93
-
94
- if single and !links.empty?
95
- return links[urls.first]
96
- else
97
- return links
98
- end
99
- end
100
-
101
- #
102
- # If given a single URL (as a String or URI), returns an Array of URLs which link to that URL
103
- # If given an Array of URLs, returns a Hash (URI => [URI, URI...]) of URLs linking to those URLs
104
- #
105
- def urls_linking_to(urls)
106
- unless urls.is_a?(Array)
107
- urls = [urls] unless urls.is_a?(Array)
108
- single = true
109
- end
110
-
111
- links = pages_linking_to(urls)
112
- links.each { |url, pages| links[url] = pages.map{|p| p.url} }
113
-
114
- if single and !links.empty?
115
- return links[urls.first]
116
- else
117
- return links
118
- end
119
- end
120
-
121
68
  end
122
69
  end
@@ -18,7 +18,6 @@ module Medusa
18
18
  def [](key)
19
19
  @adap[key]
20
20
  rescue
21
- puts key
22
21
  raise RetrievalError
23
22
  end
24
23
 
@@ -1,3 +1,3 @@
1
1
  module Medusa
2
- VERSION = '1.0.0.pre.2'
2
+ VERSION = '1.0.0'
3
3
  end
@@ -29,6 +29,8 @@ module Medusa
29
29
  @base = options[:base] if options.has_key?(:base)
30
30
  @content_type = options[:content_type] || "text/html"
31
31
  @body = options[:body]
32
+ @status = options[:status] || [200, 'OK']
33
+ @exception = options[:exception]
32
34
 
33
35
  create_body unless @body
34
36
  add_to_fakeweb
@@ -56,7 +58,7 @@ module Medusa
56
58
  end
57
59
 
58
60
  def add_to_fakeweb
59
- options = {body: @body, status: [200, 'OK'], headers: {'Content-Type' => @content_type}}
61
+ options = {body: @body, status: @status, headers: {'Content-Type' => @content_type}}
60
62
 
61
63
  if @redirect
62
64
  options[:status] = [301, 'Moved Permanently']
@@ -66,7 +68,7 @@ module Medusa
66
68
  options[:headers]['Location'] = redirect_url
67
69
 
68
70
  # register the page this one redirects to
69
- WebMock.stub_request(:get, redirect_url).to_return(body: '', status: [200, 'OK'], headers: {'Content-Type' => @content_type})
71
+ WebMock.stub_request(:get, redirect_url).to_return(body: '', status: @status, headers: {'Content-Type' => @content_type})
70
72
  end
71
73
 
72
74
  if @auth
@@ -75,11 +77,14 @@ module Medusa
75
77
  WebMock.stub_request(:get, url).to_return(unautorized_options)
76
78
  WebMock.stub_request(:get, url).with(basic_auth: AUTH).to_return(options)
77
79
  else
78
- WebMock.stub_request(:get, url).to_return(options)
80
+ WebMock.stub_request(:get, url).tap do |req|
81
+ if @exception
82
+ req.to_raise(@exception)
83
+ else
84
+ req.to_return(options)
85
+ end
86
+ end
79
87
  end
80
88
  end
81
89
  end
82
90
  end
83
-
84
- #default root
85
- Medusa::FakePage.new
@@ -1,3 +1,4 @@
1
+ require 'fakeweb_helper'
1
2
 
2
3
  RSpec.describe Medusa do
3
4
 
@@ -6,9 +7,9 @@ RSpec.describe Medusa do
6
7
  end
7
8
 
8
9
  it "should return a Medusa::Core from the crawl, which has a PageStore" do
10
+ Medusa::FakePage.new
9
11
  result = Medusa.crawl(SPEC_DOMAIN)
10
12
  expect(result).to be_an_instance_of(Medusa::Core)
11
13
  expect(result.pages).to be_an_instance_of(Medusa::PageStore)
12
14
  end
13
-
14
15
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: medusa-crawler
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.0.pre.2
4
+ version: 1.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mauro Asprea
@@ -35,7 +35,7 @@ cert_chain:
35
35
  g4G6EZGbKCMwJDC0Wtmrygr7+THZVQlBs0ljTdrN8GXsuI9W52VlZctZQXEuoboH
36
36
  mpXw1d3WewNciml1VaOG782DKqZvT0i19V5LnZzoGzmU2q3ZJw7jCw==
37
37
  -----END CERTIFICATE-----
38
- date: 2020-08-14 00:00:00.000000000 Z
38
+ date: 2020-08-17 00:00:00.000000000 Z
39
39
  dependencies:
40
40
  - !ruby/object:Gem::Dependency
41
41
  name: moneta
@@ -98,7 +98,7 @@ dependencies:
98
98
  - !ruby/object:Gem::Version
99
99
  version: 1.0.0
100
100
  description: |+
101
- == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://badge.fury.io/rb/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
101
+ == Medusa: a ruby crawler framework {rdoc-image:https://badge.fury.io/rb/medusa-crawler.svg}[https://rubygems.org/gems/medusa-crawler] rdoc-image:https://github.com/brutuscat/medusa-crawler/workflows/Ruby/badge.svg?event=push
102
102
 
103
103
  Medusa is a framework for the ruby language to crawl and collect useful information about the pages
104
104
  it visits. It is versatile, allowing you to write your own specialized tasks quickly and easily.
@@ -109,7 +109,6 @@ description: |+
109
109
  * Multi-threaded design for high performance
110
110
  * Tracks +301+ HTTP redirects
111
111
  * Allows exclusion of URLs based on regular expressions
112
- * HTTPS support
113
112
  * Records response time for each page
114
113
  * Obey _robots.txt_ directives (optional, but recommended)
115
114
  * In-memory or persistent storage of pages during crawl, provided by Moneta[https://github.com/moneta-rb/moneta]
@@ -117,6 +116,37 @@ description: |+
117
116
 
118
117
  <b>Do you have an idea or a suggestion? {Open an issue and talk about it}[https://github.com/brutuscat/medusa-crawler/issues/new]</b>
119
118
 
119
+ === Examples
120
+
121
+ Medusa is versatile and to be used programatically, you can start with one or multiple URIs:
122
+
123
+ require 'medusa'
124
+
125
+ Medusa.crawl('https://www.example.com', depth_limit: 2)
126
+
127
+ Or you can pass a block and it will yield the crawler back, to manage configuration or drive its crawling focus:
128
+
129
+ require 'medusa'
130
+
131
+ Medusa.crawl('https://www.example.com', depth_limit: 2) do |crawler|
132
+ crawler.discard_page_bodies = some_flag
133
+
134
+ # Persist all the pages state across crawl-runs.
135
+ crawler.clear_on_startup = false
136
+ crawler.storage = Medusa::Storage.Moneta(:Redis, 'redis://redis.host.name:6379/0')
137
+
138
+ crawler.skip_links_like(/private/)
139
+
140
+ crawler.on_pages_like(/public/) do |page|
141
+ logger.debug "[public page] #{page.url} took #{page.response_time} found #{page.links.count}"
142
+ end
143
+
144
+ # Use an arbitrary logic, page by page, to continue customize the crawling.
145
+ crawler.focus_crawl(/public/) do |page|
146
+ page.links.first
147
+ end
148
+ end
149
+
120
150
  email: mauroasprea@gmail.com
121
151
  executables: []
122
152
  extensions: []
@@ -151,7 +181,7 @@ licenses:
151
181
  - MIT
152
182
  metadata:
153
183
  bug_tracker_uri: https://github.com/brutuscat/medusa-crawler/issues
154
- source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0.pre.2
184
+ source_code_uri: https://github.com/brutuscat/medusa-crawler/tree/v1.0.0
155
185
  description_markup_format: rdoc
156
186
  post_install_message:
157
187
  rdoc_options:
@@ -168,11 +198,11 @@ required_ruby_version: !ruby/object:Gem::Requirement
168
198
  version: 2.3.0
169
199
  required_rubygems_version: !ruby/object:Gem::Requirement
170
200
  requirements:
171
- - - ">"
201
+ - - ">="
172
202
  - !ruby/object:Gem::Version
173
- version: 1.3.1
203
+ version: '0'
174
204
  requirements: []
175
- rubygems_version: 3.1.2
205
+ rubygems_version: 3.1.4
176
206
  signing_key:
177
207
  specification_version: 4
178
208
  summary: Medusa is a ruby crawler framework
metadata.gz.sig CHANGED
Binary file