wgit 0.0.14 → 0.0.15
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +14 -13
- data/lib/wgit/crawler.rb +29 -9
- data/lib/wgit/document.rb +1 -1
- data/lib/wgit/url.rb +13 -5
- data/lib/wgit/version.rb +1 -1
- metadata +15 -3
- data/LICENSE.txt +0 -21
- data/TODO.txt +0 -34
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b83cdb3dd0deec7e706c77a8302eb9a5d52d9ec930fbfab43dc2790334521160
|
4
|
+
data.tar.gz: a1ac1ae6d151a64db95aaee386457fddfce2b76f3d0b824b360926e689ceb64a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 7b3831ba9f3d60810507ebf49ec19e2ca533da188af84657552371823b1924c2357bdf2581a34e3526f4c160aff6dc876e2d5eba927ff793b5b4a0d6d9ff4221
|
7
|
+
data.tar.gz: 2e6b4fc6f36f97f6d400488f52df69323666186d03fab596a8f1895cdd76b1a9d7df573fdc1336c4df759259a1aa309f9e55e5488c2118fb9fb8c0ab307c5dc7
|
data/README.md
CHANGED
@@ -61,7 +61,7 @@ Wgit::Document.instance_methods(false).sort # => [
|
|
61
61
|
# :external_urls, :html, :internal_full_links, :internal_links,
|
62
62
|
# :internal_links_without_anchors, :keywords, :links, :relative_full_links,
|
63
63
|
# :relative_full_urls, :relative_links, :relative_urls, :score, :search,
|
64
|
-
# :search!, :size, :stats, :text, :title, :to_h, :
|
64
|
+
# :search!, :size, :stats, :text, :title, :to_h, :to_json, :url,
|
65
65
|
# :xpath
|
66
66
|
#]
|
67
67
|
|
@@ -315,8 +315,9 @@ tables.first.class # => Nokogiri::XML::Element
|
|
315
315
|
|
316
316
|
Below are some points to keep in mind when using Wgit:
|
317
317
|
|
318
|
-
- All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://`
|
318
|
+
- All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
|
319
319
|
- By default, up to 5 URL redirects will be followed; this is configurable however.
|
320
|
+
- IRI's (URL's containing non ASCII characters) are supported and will be normalised/escaped prior to being crawled.
|
320
321
|
|
321
322
|
## Executable
|
322
323
|
|
@@ -328,24 +329,24 @@ This executable will be very similar in nature to `./bin/console` which is curre
|
|
328
329
|
|
329
330
|
## Change Log
|
330
331
|
|
331
|
-
See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences between
|
332
|
+
See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences (including any breaking changes) between releases of Wgit.
|
332
333
|
|
333
|
-
##
|
334
|
+
## License
|
334
335
|
|
335
|
-
The
|
336
|
+
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
|
336
337
|
|
337
|
-
|
338
|
+
## Contributing
|
338
339
|
|
339
|
-
|
340
|
+
Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/wgit/issues). Just raise an issue, checking it doesn't already exist.
|
340
341
|
|
341
|
-
|
342
|
+
The current road map is rudimentally listed in the [TODO.txt](https://github.com/michaeltelford/wgit/blob/master/TODO.txt) file. Maybe your feature request is already there?
|
342
343
|
|
343
|
-
|
344
|
+
## Development
|
344
345
|
|
345
|
-
|
346
|
+
For a full list of available Rake tasks, run `bundle exec rake help`. The most commonly used tasks are listed below...
|
346
347
|
|
347
|
-
|
348
|
+
After checking out the repo, run `./bin/setup` to install dependencies (requires `bundler`). Then, run `bundle exec rake test` to run the tests. You can also run `./bin/console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
|
348
349
|
|
349
|
-
|
350
|
+
To generate code documentation run `bundle exec yard doc`. To browse the generated documentation run `bundle exec yard server -r`.
|
350
351
|
|
351
|
-
|
352
|
+
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, see the *Gem Publishing Checklist* section of the `TODO.txt` file.
|
data/lib/wgit/crawler.rb
CHANGED
@@ -91,13 +91,15 @@ module Wgit
|
|
91
91
|
# Crawl the url and return the response document or nil.
|
92
92
|
#
|
93
93
|
# @param url [Wgit::Document] The URL to crawl.
|
94
|
+
# @param follow_external_redirects [Boolean] Whether or not to follow
|
95
|
+
# external redirects. False will return nil for such a crawl.
|
94
96
|
# @yield [Wgit::Document] The crawled HTML Document regardless if the
|
95
97
|
# crawl was successful or not. Therefore, the Document#url can be used.
|
96
98
|
# @return [Wgit::Document, nil] The crawled HTML Document or nil if the
|
97
99
|
# crawl was unsuccessful.
|
98
|
-
def crawl_url(url = @urls.first)
|
100
|
+
def crawl_url(url = @urls.first, follow_external_redirects: true)
|
99
101
|
assert_type(url, Wgit::Url)
|
100
|
-
markup = fetch(url)
|
102
|
+
markup = fetch(url, follow_external_redirects: follow_external_redirects)
|
101
103
|
url.crawled = true
|
102
104
|
doc = Wgit::Document.new(url, markup)
|
103
105
|
yield(doc) if block_given?
|
@@ -116,7 +118,7 @@ module Wgit
|
|
116
118
|
def crawl_site(base_url = @urls.first, &block)
|
117
119
|
assert_type(base_url, Wgit::Url)
|
118
120
|
|
119
|
-
doc = crawl_url(base_url, &block)
|
121
|
+
doc = crawl_url(base_url, follow_external_redirects: false, &block)
|
120
122
|
return nil if doc.nil?
|
121
123
|
|
122
124
|
path = base_url.path.nil? ? '/' : base_url.path
|
@@ -133,9 +135,15 @@ module Wgit
|
|
133
135
|
break if links.empty?
|
134
136
|
|
135
137
|
links.each do |link|
|
136
|
-
doc = crawl_url(
|
138
|
+
doc = crawl_url(
|
139
|
+
Wgit::Url.concat(base_url.to_base, link),
|
140
|
+
follow_external_redirects: false,
|
141
|
+
&block
|
142
|
+
)
|
143
|
+
|
137
144
|
crawled_urls << link
|
138
145
|
next if doc.nil?
|
146
|
+
|
139
147
|
internal_urls.concat(get_internal_links(doc))
|
140
148
|
external_urls.concat(doc.external_links)
|
141
149
|
end
|
@@ -160,8 +168,8 @@ module Wgit
|
|
160
168
|
# The fetch method performs a HTTP GET to obtain the HTML document.
|
161
169
|
# Invalid urls or any HTTP response that doesn't return a HTML body will be
|
162
170
|
# ignored and nil will be returned. Otherwise, the HTML is returned.
|
163
|
-
def fetch(url)
|
164
|
-
response = resolve(url)
|
171
|
+
def fetch(url, follow_external_redirects: true)
|
172
|
+
response = resolve(url, follow_external_redirects: follow_external_redirects)
|
165
173
|
@last_response = response
|
166
174
|
response.body.empty? ? nil : response.body
|
167
175
|
rescue Exception => ex
|
@@ -176,18 +184,30 @@ module Wgit
|
|
176
184
|
# A certain amount of redirects will be followed by default before raising
|
177
185
|
# an exception. Redirects can be disabled by setting `redirect_limit: 0`.
|
178
186
|
# The Net::HTTPResponse will be returned.
|
179
|
-
def resolve(
|
187
|
+
def resolve(
|
188
|
+
url,
|
189
|
+
redirect_limit: Wgit::Crawler.default_redirect_limit,
|
190
|
+
follow_external_redirects: true
|
191
|
+
)
|
192
|
+
raise 'url must respond to :normalise' unless url.respond_to?(:normalise)
|
180
193
|
redirect_count = -1
|
194
|
+
|
181
195
|
begin
|
182
|
-
raise
|
196
|
+
raise 'Too many redirects' if redirect_count >= redirect_limit
|
183
197
|
redirect_count += 1
|
184
198
|
|
185
|
-
response = Net::HTTP.get_response(
|
199
|
+
response = Net::HTTP.get_response(url.normalise.to_uri)
|
186
200
|
location = Wgit::Url.new(response.fetch('location', ''))
|
201
|
+
|
187
202
|
if not location.empty?
|
203
|
+
if !follow_external_redirects and !location.is_relative?
|
204
|
+
raise 'External redirect encountered but not allowed'
|
205
|
+
end
|
206
|
+
|
188
207
|
url = location.is_relative? ? url.to_base.concat(location) : location
|
189
208
|
end
|
190
209
|
end while response.is_a?(Net::HTTPRedirection)
|
210
|
+
|
191
211
|
response
|
192
212
|
end
|
193
213
|
|
data/lib/wgit/document.rb
CHANGED
@@ -364,6 +364,7 @@ module Wgit
|
|
364
364
|
# xpath value to be derived on Document initialisation (instead of when
|
365
365
|
# the extension is defined). The call method must return a valid xpath
|
366
366
|
# String.
|
367
|
+
# @param options [Hash] The options to define an extension with.
|
367
368
|
# @option options [Boolean] :singleton The singleton option determines
|
368
369
|
# whether or not the result(s) should be in an Array. If multiple
|
369
370
|
# results are found and singleton is true then the first result will be
|
@@ -501,7 +502,6 @@ module Wgit
|
|
501
502
|
end
|
502
503
|
end
|
503
504
|
|
504
|
-
alias :to_hash :to_h
|
505
505
|
alias :relative_links :internal_links
|
506
506
|
alias :relative_urls :internal_links
|
507
507
|
alias :relative_full_links :internal_full_links
|
data/lib/wgit/url.rb
CHANGED
@@ -1,13 +1,14 @@
|
|
1
1
|
require_relative 'utils'
|
2
2
|
require_relative 'assertable'
|
3
3
|
require 'uri'
|
4
|
+
require 'addressable/uri'
|
4
5
|
|
5
6
|
module Wgit
|
6
7
|
|
7
8
|
# Class modeling a web based URL.
|
8
9
|
# Can be an internal/relative link e.g. "about.html" or a full URL
|
9
|
-
# e.g. "http://www.google.co.uk". Is a subclass of String and uses
|
10
|
-
# internally.
|
10
|
+
# e.g. "http://www.google.co.uk". Is a subclass of String and uses
|
11
|
+
# 'addressable/uri' internally.
|
11
12
|
class Url < String
|
12
13
|
include Assertable
|
13
14
|
|
@@ -43,7 +44,7 @@ module Wgit
|
|
43
44
|
date_crawled = obj["date_crawled"]
|
44
45
|
end
|
45
46
|
|
46
|
-
@uri = URI(url)
|
47
|
+
@uri = Addressable::URI.parse(url)
|
47
48
|
@crawled = crawled
|
48
49
|
@date_crawled = date_crawled
|
49
50
|
|
@@ -169,9 +170,16 @@ module Wgit
|
|
169
170
|
@date_crawled = bool ? Wgit::Utils.time_stamp : nil
|
170
171
|
end
|
171
172
|
|
173
|
+
# Normalises/encodes self and returns a new Wgit::Url.
|
174
|
+
#
|
175
|
+
# @return [Wgit::Url] An encoded version of self.
|
176
|
+
def normalise
|
177
|
+
Wgit::Url.new(@uri.normalize.to_s)
|
178
|
+
end
|
179
|
+
|
172
180
|
# Returns the @uri instance var of this URL.
|
173
181
|
#
|
174
|
-
# @return [
|
182
|
+
# @return [Addressable::URI] The URI object of self.
|
175
183
|
def to_uri
|
176
184
|
@uri
|
177
185
|
end
|
@@ -337,7 +345,6 @@ module Wgit
|
|
337
345
|
Hash[h.to_a.insert(0, ["url", self])] # Insert url at position 0.
|
338
346
|
end
|
339
347
|
|
340
|
-
alias :to_hash :to_h
|
341
348
|
alias :uri :to_uri
|
342
349
|
alias :url :to_url
|
343
350
|
alias :scheme :to_scheme
|
@@ -358,5 +365,6 @@ module Wgit
|
|
358
365
|
alias :is_relative? :relative_link?
|
359
366
|
alias :is_internal? :relative_link?
|
360
367
|
alias :crawled? :crawled
|
368
|
+
alias :normalize :normalise
|
361
369
|
end
|
362
370
|
end
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.15
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
@@ -142,6 +142,20 @@ dependencies:
|
|
142
142
|
- - "~>"
|
143
143
|
- !ruby/object:Gem::Version
|
144
144
|
version: '2.0'
|
145
|
+
- !ruby/object:Gem::Dependency
|
146
|
+
name: addressable
|
147
|
+
requirement: !ruby/object:Gem::Requirement
|
148
|
+
requirements:
|
149
|
+
- - "~>"
|
150
|
+
- !ruby/object:Gem::Version
|
151
|
+
version: 2.6.0
|
152
|
+
type: :runtime
|
153
|
+
prerelease: false
|
154
|
+
version_requirements: !ruby/object:Gem::Requirement
|
155
|
+
requirements:
|
156
|
+
- - "~>"
|
157
|
+
- !ruby/object:Gem::Version
|
158
|
+
version: 2.6.0
|
145
159
|
- !ruby/object:Gem::Dependency
|
146
160
|
name: nokogiri
|
147
161
|
requirement: !ruby/object:Gem::Requirement
|
@@ -196,9 +210,7 @@ files:
|
|
196
210
|
- "./lib/wgit/url.rb"
|
197
211
|
- "./lib/wgit/utils.rb"
|
198
212
|
- "./lib/wgit/version.rb"
|
199
|
-
- LICENSE.txt
|
200
213
|
- README.md
|
201
|
-
- TODO.txt
|
202
214
|
homepage: https://github.com/michaeltelford/wgit
|
203
215
|
licenses:
|
204
216
|
- MIT
|
data/LICENSE.txt
DELETED
@@ -1,21 +0,0 @@
|
|
1
|
-
The MIT License (MIT)
|
2
|
-
|
3
|
-
Copyright (c) 2019 Michael Telford
|
4
|
-
|
5
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
-
of this software and associated documentation files (the "Software"), to deal
|
7
|
-
in the Software without restriction, including without limitation the rights
|
8
|
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
-
copies of the Software, and to permit persons to whom the Software is
|
10
|
-
furnished to do so, subject to the following conditions:
|
11
|
-
|
12
|
-
The above copyright notice and this permission notice shall be included in
|
13
|
-
all copies or substantial portions of the Software.
|
14
|
-
|
15
|
-
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
-
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
21
|
-
THE SOFTWARE.
|
data/TODO.txt
DELETED
@@ -1,34 +0,0 @@
|
|
1
|
-
|
2
|
-
Primary
|
3
|
-
-------
|
4
|
-
- Add <base> support for link processing.
|
5
|
-
- Update Database#search & Document#search to have optional case sensitivity.
|
6
|
-
- Have the ability to crawl sub sections of a site only e.g. https://www.honda.co.uk/motorcycles.html as the base url and crawl any links containing this as a prefix. For example, https://www.honda.co.uk/cars.html would not be crawled but https://www.honda.co.uk/motorcycles/africa-twin.html would be.
|
7
|
-
- Create an executable based on the ./bin/console shipped as `wpry` or `wgit`.
|
8
|
-
|
9
|
-
Secondary
|
10
|
-
---------
|
11
|
-
- Think about how we handle invalid url's on crawled documents. Setup tests and implement logic for this scenario.
|
12
|
-
- Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
|
13
|
-
- Check if Document::TEXT_ELEMENTS is expansive enough.
|
14
|
-
- Possibly use refine instead of core-ext?
|
15
|
-
- Think about potentially using DB._update's update_many func.
|
16
|
-
|
17
|
-
Refactoring
|
18
|
-
-----------
|
19
|
-
- Refactor the 3 main classes and their tests (where needed): Url, Document & Crawler.
|
20
|
-
- Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
|
21
|
-
- Replace method params with named parameters where applicable.
|
22
|
-
|
23
|
-
Gem Publishing Checklist
|
24
|
-
------------------------
|
25
|
-
- Ensure a clean branch of master and create a 'release' branch.
|
26
|
-
- Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
|
27
|
-
- Increment the version number (in version.rb) and update the CHANGELOG.md.
|
28
|
-
- Run 'bundle install' to update deps.
|
29
|
-
- Run 'bundle exec rake compile' and ensure acceptable warnings/errors.
|
30
|
-
- Run 'bundle exec rake test' and ensure all tests are passing.
|
31
|
-
- Run `bundle exec rake install` to build and install the gem locally, then test it manually from outside this repo.
|
32
|
-
- Run `bundle exec yard doc` to update documentation - should be very high percentage.
|
33
|
-
- Commit, merge to master & push any changes made from the above steps.
|
34
|
-
- Run `bundle exec rake RELEASE[origin]` to tag, build and push everything to github.com and rubygems.org.
|