grell 1.3.0

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 85e9d00d79051e54ba1c5361e7f1edfd8abf8af6
4
+ data.tar.gz: 25b1f2db2ef87e61294158843fe6e396ed15b046
5
+ SHA512:
6
+ metadata.gz: b00755884c96ddf04e6954d5761903e28cbd46207f45ba5a97f1f1b727b70aad2361386a35e3c96ecb8b65089f1412ac4d7a1bdc8a18d70b87009198883205c3
7
+ data.tar.gz: 927714b4cd7e7b520d75755e993eff343cebd06a8c36993bcd81abb7453e1ace497e05766f0b8d1ae4bed2e7c00447a1ded267c20028d6d118f32d661c36685e
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --color
2
+ --require spec_helper
@@ -0,0 +1,32 @@
1
+ * Version 1.3
2
+ The Crawler object allows you to provide an external logger object.
3
+ Clearer semantics when an error happens, special headers are returned so the user can inspect the error
4
+
5
+ Caveats:
6
+ - The 'debug' option in the crawler does not have any affect anymore. Provide an external logger with 'logger' instead
7
+ - The errors provided in the headers by grell has changed from 'grell_status' to 'grellStatus'.
8
+ - The 'visited' property in the page was never supposed to be accesible. Use 'visited?' instead.
9
+
10
+ * Version 1.2.1
11
+ Solve bug: URLs are case insensitive
12
+
13
+ * Version 1.2
14
+ Grell now will consider two links to point to the same page only when the whole URL is exactly the same.
15
+ Versions previously would only consider two links to be the same when they shared the path.
16
+
17
+ * Version 1.1.2
18
+ Solve bug where we were adding links in heads as if there were normal links in the body
19
+
20
+ * Version 1.1.1
21
+ Solve bug with the new data-href functionality
22
+
23
+ * Version 1.1
24
+ Solve problem with randomly failing spec
25
+ Search for elements with 'href' or 'data-href' to find links
26
+
27
+ * Version 1.0.1
28
+ Rescueing Javascript errors
29
+
30
+ * Version 1.0
31
+ Initial implementation
32
+ Basic support to crawling pages.
data/Gemfile ADDED
@@ -0,0 +1,3 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 Medidata Solutions Worldwide
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,78 @@
1
+ # Grell
2
+
3
+ Grell is a generic crawler for the web written in Ruby.
4
+ It can be used to gather data, test pages in a given domain, etc.
5
+
6
+ ## Installation
7
+
8
+ Add this line to your application's Gemfile:
9
+
10
+ ```ruby
11
+ gem 'grell'
12
+ ```
13
+
14
+ And then execute:
15
+
16
+ $ bundle
17
+
18
+ Or install it yourself as:
19
+
20
+ $ gem install grell
21
+
22
+ Grell uses PhantomJS, you will need to download and install it in your
23
+ system. Check for instructions in http://phantomjs.org/
24
+ Grell has been tested with PhantomJS v1.9.x
25
+
26
+ ## Usage
27
+
28
+
29
+ ### Crawling an entire site
30
+
31
+ The main entry point of the library is Grell#start_crawling.
32
+ Grell will yield to your code with each page it finds:
33
+
34
+ ```ruby
35
+ require 'grell'
36
+
37
+ crawler = Grell::Crawler.new
38
+ crawler.start_crawling('http://www.google.com') do |page|
39
+ #Grell will keep iterating this block which each unique page it finds
40
+ puts "yes we crawled #{page.url}"
41
+ puts "status: #{page.status}"
42
+ puts "headers: #{page.headers}"
43
+ puts "body: #{page.body}"
44
+ puts "We crawled it at #{page.timestamp}"
45
+ puts "We found #{page.links.size} links"
46
+ puts "page id and parent_id #{page.id}, #{page.parent_id}"
47
+ end
48
+
49
+ ```
50
+
51
+ Grell keeps a list of pages previously crawled and do not visit the same page twice.
52
+ This list is indexed by the complete url, including query parameters.
53
+
54
+ ### Pages' id
55
+
56
+ Each page has an unique id, accessed by the property 'id'. Also each page stores the id of the page from which we found this page, accessed by the property 'parent_id'.
57
+ The page object generated by accessing the first URL passed to the start_crawling(the root) has a 'parent_id' equal to 'nil' and an 'id' equal to 0.
58
+ Using this information it is possible to construct a directed graph.
59
+
60
+
61
+ ### Errors
62
+ When there is an error in the page or an internal error in the crawler (Javascript crashed the browser, etc). Grell will return with status 404 and the headers will have the following keys:
63
+ - grellStatus: 'Error'
64
+ - errorClass: The class of the error which broke this page.
65
+ - errorMessage: A descriptive message with the information Grell could gather about the error.
66
+
67
+
68
+ ## Tests
69
+
70
+ Run the tests with
71
+ ```ruby
72
+ bundle exec rake ci
73
+ ```
74
+
75
+ ## Contributors
76
+ Grell is (c) Medidata Solutions Worldwide and owned by its major contributors:
77
+ * [Teruhide Hoshikawa](https://github.com/thoshikawa-mdsol)
78
+ * [Jordi Polo Carres](https://github.com/jcarres-mdsol)
@@ -0,0 +1,2 @@
1
+ require 'kender/tasks'
2
+
@@ -0,0 +1,30 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'grell/version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "grell"
8
+ spec.version = Grell::VERSION
9
+ spec.authors = ["Jordi Polo Carres"]
10
+ spec.email = ["jcarres@mdsol.com"]
11
+ spec.summary = %q{Ruby web crawler}
12
+ spec.description = %q{Ruby web crawler using PhantomJS}
13
+ spec.homepage = "https://github.com/mdsol/grell"
14
+
15
+ spec.files = `git ls-files -z`.split("\x0")
16
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
17
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
18
+ spec.require_paths = ["lib"]
19
+
20
+ spec.add_dependency 'capybara', '~> 2.2'
21
+ spec.add_dependency 'poltergeist', '~> 1.5'
22
+
23
+ spec.add_development_dependency "bundler", "~> 1.6"
24
+ spec.add_development_dependency "byebug", "~> 4.0"
25
+ spec.add_development_dependency "kender"
26
+ spec.add_development_dependency "rake"
27
+ spec.add_development_dependency "webmock", '~> 1.18'
28
+ spec.add_development_dependency 'rspec', '~> 3.0'
29
+ spec.add_development_dependency 'puffing-billy', '~> 0.5'
30
+ end
@@ -0,0 +1,10 @@
1
+ require 'capybara/poltergeist'
2
+ require 'capybara/dsl'
3
+
4
+ require 'grell/grell_logger'
5
+ require 'grell/capybara_driver'
6
+ require 'grell/crawler'
7
+ require 'grell/rawpage'
8
+ require 'grell/page'
9
+ require 'grell/page_collection'
10
+ require 'grell/reader'
@@ -0,0 +1,34 @@
1
+
2
+ module Grell
3
+
4
+ #The driver for Capybara. It uses Portelgeist to control PhantomJS
5
+ class CapybaraDriver
6
+ include Capybara::DSL
7
+
8
+ USER_AGENT = "Mozilla/5.0 (Grell Crawler)"
9
+
10
+ def self.setup(options)
11
+ new.setup_capybara unless options[:external_driver]
12
+ end
13
+
14
+ def setup_capybara
15
+ Capybara.register_driver :poltergeist_crawler do |app|
16
+ Capybara::Poltergeist::Driver.new(app, {
17
+ js_errors: false,
18
+ inspector: false,
19
+ phantomjs_logger: open('/dev/null'),
20
+ phantomjs_options: ['--debug=no', '--load-images=no', '--ignore-ssl-errors=yes', '--ssl-protocol=TLSv1']
21
+ })
22
+ end
23
+
24
+ Capybara.default_wait_time = 3
25
+ Capybara.run_server = false
26
+ Capybara.default_driver = :poltergeist_crawler
27
+ page.driver.headers = {
28
+ "DNT" => 1,
29
+ "User-Agent" => USER_AGENT
30
+ }
31
+ end
32
+ end
33
+
34
+ end
@@ -0,0 +1,44 @@
1
+
2
+ module Grell
3
+
4
+ # This is the class that starts and controls the crawling
5
+ class Crawler
6
+ attr_reader :collection
7
+
8
+ def initialize(options = {})
9
+ CapybaraDriver.setup(options)
10
+
11
+ if options[:logger]
12
+ Grell.logger = options[:logger]
13
+ else
14
+ Grell.logger = Logger.new(STDOUT)
15
+ end
16
+
17
+ @collection = PageCollection.new
18
+ end
19
+
20
+
21
+ def start_crawling(url, &block)
22
+ Grell.logger.info "GRELL Started crawling"
23
+ @collection = PageCollection.new
24
+ @collection.create_page(url, nil)
25
+ while !@collection.discovered_pages.empty?
26
+ crawl(@collection.next_page, block)
27
+ end
28
+ Grell.logger.info "GRELL finished crawling"
29
+ end
30
+
31
+ def crawl(site, block)
32
+ Grell.logger.info "Visiting #{site.url}, visited_links: #{@collection.visited_pages.size}, discovered #{@collection.discovered_pages.size}"
33
+ site.navigate
34
+
35
+ block.call(site) if block
36
+
37
+ site.links.each do |url|
38
+ @collection.create_page(url, site.id)
39
+ end
40
+ end
41
+
42
+ end
43
+
44
+ end
@@ -0,0 +1,10 @@
1
+ require 'logger'
2
+
3
+ #Very simple global logger for our crawler.
4
+ module Grell
5
+ class << self
6
+ attr_accessor :logger
7
+ end
8
+ end
9
+
10
+ Grell.logger = Logger.new(STDOUT)
@@ -0,0 +1,231 @@
1
+ require 'forwardable'
2
+
3
+ module Grell
4
+ # This class contains the logic related to work with each page we crawl. It is also the interface we use
5
+ # To access the information of each page.
6
+ # This information comes from result private classes below.
7
+ class Page
8
+ extend Forwardable
9
+
10
+ WAIT_TIME = 10
11
+ WAIT_INTERVAL = 0.5
12
+
13
+ attr_reader :url, :timestamp, :id, :parent_id, :rawpage
14
+ #Most of the interesting information accessed through this class is accessed by the methods below
15
+ def_delegators :@result_page, :headers, :body, :status, :links, :has_selector?, :host, :visited?
16
+
17
+ def initialize( url, id, parent_id)
18
+ @rawpage = RawPage.new
19
+ @url = url
20
+ @id = id
21
+ @parent_id = parent_id
22
+ @timestamp = nil
23
+ @result_page = UnvisitedPage.new
24
+ end
25
+
26
+ def navigate
27
+ # We wait a maximum of WAIT_TIME seconds to get an HTML page. We try or best to workaround inconsistencies on poltergeist
28
+ Reader.wait_for(->{@rawpage.navigate(url)}, WAIT_TIME, WAIT_INTERVAL ) do
29
+ @rawpage.status && !@rawpage.headers.empty? &&
30
+ @rawpage.headers["Content-Type"] && @rawpage.headers["Content-Type"].include?('text/html').equal?(true)
31
+ end
32
+ @result_page = VisitedPage.new(@rawpage)
33
+ @timestamp = Time.now
34
+ rescue Capybara::Poltergeist::JavascriptError => e
35
+ unavailable_page(404, e)
36
+ rescue Capybara::Poltergeist::BrowserError => e #This may happen internally on Poltergeist, they claim is a bug.
37
+ unavailable_page(404, e)
38
+ rescue URI::InvalidURIError => e #No cool URL means we report error
39
+ unavailable_page(404, e)
40
+ rescue Capybara::Poltergeist::TimeoutError => e #Poltergeist has its own timeout which is similar to Chromes.
41
+ unavailable_page(404, e)
42
+ rescue Capybara::Poltergeist::StatusFailError => e
43
+ unavailable_page(404, e)
44
+ end
45
+
46
+ private
47
+ def unavailable_page(status, exception)
48
+ Grell.logger.warn "The page with the URL #{@url} was not available. Exception #{exception}"
49
+ @result_page = ErroredPage.new(status, exception)
50
+ @timestamp = Time.now
51
+ end
52
+
53
+ # Private class.
54
+ # This is a result page when it has not been visited yet. Essentially empty of information
55
+ #
56
+ class UnvisitedPage
57
+ def status
58
+ nil
59
+ end
60
+
61
+ def body
62
+ ''
63
+ end
64
+
65
+ def headers
66
+ {grellStatus: 'NotVisited' }
67
+ end
68
+
69
+ def links
70
+ []
71
+ end
72
+
73
+ def host
74
+ ''
75
+ end
76
+
77
+ def visited?
78
+ false
79
+ end
80
+
81
+ def has_selector?(selector)
82
+ false
83
+ end
84
+
85
+ end
86
+
87
+ # Private class.
88
+ # This is a result page when some error happened. It provides some information about the error.
89
+ #
90
+ class ErroredPage
91
+ def initialize(error_code, exception)
92
+ @error_code = error_code
93
+ @exception = exception
94
+ end
95
+
96
+ def status
97
+ @error_code
98
+ end
99
+
100
+ def body
101
+ ''
102
+ end
103
+
104
+ def headers
105
+ message = begin
106
+ @exception.message
107
+ rescue StandardError
108
+ "Error message can not be accessed" #Poltergeist may try to access a nil object when accessing message
109
+ end
110
+
111
+ {
112
+ grellStatus: 'Error',
113
+ errorClass: @exception.class.to_s,
114
+ errorMessage: message
115
+ }
116
+ end
117
+
118
+ def links
119
+ []
120
+ end
121
+
122
+ def host
123
+ ''
124
+ end
125
+
126
+ def visited?
127
+ true
128
+ end
129
+
130
+ def has_selector?(selector)
131
+ false
132
+ end
133
+
134
+ end
135
+
136
+
137
+ # Private class.
138
+ # This is a result page when we successfully got some information back after visiting the page.
139
+ # It delegates most of the information to the @rawpage capybara page. But any transformation or logic is here
140
+ #
141
+ class VisitedPage
142
+ def initialize(rawpage)
143
+ @rawpage = rawpage
144
+ end
145
+
146
+ def status
147
+ @rawpage.status
148
+ end
149
+
150
+ def body
151
+ @rawpage.body
152
+ end
153
+
154
+ def headers
155
+ @rawpage.headers
156
+ rescue Capybara::Poltergeist::BrowserError => e #This may happen internally on Poltergeist, they claim is a bug.
157
+ {
158
+ grellStatus: 'Error',
159
+ errorClass: e.class.to_s,
160
+ errorMessage: e.message
161
+ }
162
+ end
163
+
164
+ def links
165
+ @links ||= all_links
166
+ end
167
+
168
+ def host
169
+ @rawpage.host
170
+ end
171
+
172
+ def visited?
173
+ true
174
+ end
175
+
176
+ def has_selector?(selector)
177
+ @rawpage.has_selector?(selector)
178
+ end
179
+
180
+ private
181
+ def all_links
182
+ # <link> can only be used in the <head> as of: https://developer.mozilla.org/en/docs/Web/HTML/Element/link
183
+ anchors_in_body = @rawpage.all_anchors.reject{|anchor| anchor.tag_name == 'link' }
184
+
185
+ unique_links = anchors_in_body.map do |anchor|
186
+ anchor['href'] || anchor['data-href']
187
+ end.compact
188
+
189
+ unique_links.map{|link| link_to_url(link)}.uniq.compact
190
+
191
+ rescue Capybara::Poltergeist::ObsoleteNode
192
+ Grell.logger.warn "We found an obsolete node in #{@url}. Ignoring all links"
193
+ # Sometimes Javascript and timing may screw this, we lose these links.
194
+ # TODO: Can we do something more intelligent here?
195
+ []
196
+ end
197
+
198
+ # We only accept links in this same host that start with a path
199
+ # nil from this
200
+ def link_to_url(link)
201
+ uri = URI.parse(link)
202
+ if uri.absolute?
203
+ if uri.host != URI.parse(host).host
204
+ Grell.logger.debug "GRELL does not follow links to external hosts: #{link}"
205
+ nil
206
+ else
207
+ link # Absolute link to our own host
208
+ end
209
+ else
210
+ if uri.path.nil?
211
+ Grell.logger.debug "GRELL does not follow links without a path: #{uri}"
212
+ nil
213
+ end
214
+ if uri.path.start_with?('/')
215
+ host + link #convert to full URL
216
+ else #links like href="google.com" the browser would go to http://google.com like "http://#{link}"
217
+ Grell.logger.debug "GRELL Bad formatted link: #{link}, assuming external"
218
+ nil
219
+ end
220
+ end
221
+
222
+ rescue URI::InvalidURIError #We will have invalid links propagating till we navigate to them
223
+ link
224
+ end
225
+ end
226
+
227
+
228
+
229
+ end
230
+
231
+ end