wgit 0.0.14 → 0.0.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: b2f1d98c88dcd1bd2a12b463732b08373f6b5db4d1c75661530293b0ed129c47
4
- data.tar.gz: ed5ff0f5aa5e2427909ced8f67c8189dd1f7aec259b76947cd4cd73a1f8783d3
3
+ metadata.gz: b83cdb3dd0deec7e706c77a8302eb9a5d52d9ec930fbfab43dc2790334521160
4
+ data.tar.gz: a1ac1ae6d151a64db95aaee386457fddfce2b76f3d0b824b360926e689ceb64a
5
5
  SHA512:
6
- metadata.gz: 6270ceff57af1936e7b0ae1fb3748def307981b26dc604d2f190a9e751e0952f103b4a7c227e2cbbc2d00f26bc7c4b1f46f44fa423e13add7e8bd4eb9fe046f4
7
- data.tar.gz: 96b0eecd2713cbf55f7433c145a541fdae34d56b9e3bedd1b91b6bb30b463d06b7f07de59457ad899fd32bd095bbe92ee01d94c04ef923924fbc4a110bbe98f2
6
+ metadata.gz: 7b3831ba9f3d60810507ebf49ec19e2ca533da188af84657552371823b1924c2357bdf2581a34e3526f4c160aff6dc876e2d5eba927ff793b5b4a0d6d9ff4221
7
+ data.tar.gz: 2e6b4fc6f36f97f6d400488f52df69323666186d03fab596a8f1895cdd76b1a9d7df573fdc1336c4df759259a1aa309f9e55e5488c2118fb9fb8c0ab307c5dc7
data/README.md CHANGED
@@ -61,7 +61,7 @@ Wgit::Document.instance_methods(false).sort # => [
61
61
  # :external_urls, :html, :internal_full_links, :internal_links,
62
62
  # :internal_links_without_anchors, :keywords, :links, :relative_full_links,
63
63
  # :relative_full_urls, :relative_links, :relative_urls, :score, :search,
64
- # :search!, :size, :stats, :text, :title, :to_h, :to_hash, :to_json, :url,
64
+ # :search!, :size, :stats, :text, :title, :to_h, :to_json, :url,
65
65
  # :xpath
66
66
  #]
67
67
 
@@ -315,8 +315,9 @@ tables.first.class # => Nokogiri::XML::Element
315
315
 
316
316
  Below are some points to keep in mind when using Wgit:
317
317
 
318
- - All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://`
318
+ - All absolute `Wgit::Url`'s must be prefixed with an appropiate protocol e.g. `https://` etc.
319
319
  - By default, up to 5 URL redirects will be followed; this is configurable however.
320
+ - IRI's (URL's containing non ASCII characters) are supported and will be normalised/escaped prior to being crawled.
320
321
 
321
322
  ## Executable
322
323
 
@@ -328,24 +329,24 @@ This executable will be very similar in nature to `./bin/console` which is curre
328
329
 
329
330
  ## Change Log
330
331
 
331
- See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences between versions of Wgit.
332
+ See the [CHANGELOG.md](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md) for differences (including any breaking changes) between releases of Wgit.
332
333
 
333
- ## Development
334
+ ## License
334
335
 
335
- The current road map is rudimentally listed in the [TODO.txt](https://github.com/michaeltelford/wgit/blob/master/TODO.txt) file.
336
+ The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
336
337
 
337
- For a full list of available Rake tasks, run `bundle exec rake help`. The most commonly used tasks are listed below...
338
+ ## Contributing
338
339
 
339
- After checking out the repo, run `./bin/setup` to install dependencies (requires `bundler`). Then, run `bundle exec rake test` to run the tests. You can also run `./bin/console` for an interactive REPL that will allow you to experiment with the code.
340
+ Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/wgit/issues). Just raise an issue, checking it doesn't already exist.
340
341
 
341
- To generate code documentation run `bundle exec yard doc`. To browse the generated documentation run `bundle exec yard server -r`.
342
+ The current road map is rudimentally listed in the [TODO.txt](https://github.com/michaeltelford/wgit/blob/master/TODO.txt) file. Maybe your feature request is already there?
342
343
 
343
- To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, see the *Gem Publishing Checklist* section of the `TODO.txt` file.
344
+ ## Development
344
345
 
345
- ## Contributing
346
+ For a full list of available Rake tasks, run `bundle exec rake help`. The most commonly used tasks are listed below...
346
347
 
347
- Bug reports and pull requests are welcome on [GitHub](https://github.com/michaeltelford/wgit).
348
+ After checking out the repo, run `./bin/setup` to install dependencies (requires `bundler`). Then, run `bundle exec rake test` to run the tests. You can also run `./bin/console` for an interactive (`pry`) REPL that will allow you to experiment with the code.
348
349
 
349
- ## License
350
+ To generate code documentation run `bundle exec yard doc`. To browse the generated documentation run `bundle exec yard server -r`.
350
351
 
351
- The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
352
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, see the *Gem Publishing Checklist* section of the `TODO.txt` file.
data/lib/wgit/crawler.rb CHANGED
@@ -91,13 +91,15 @@ module Wgit
91
91
  # Crawl the url and return the response document or nil.
92
92
  #
93
93
  # @param url [Wgit::Document] The URL to crawl.
94
+ # @param follow_external_redirects [Boolean] Whether or not to follow
95
+ # external redirects. False will return nil for such a crawl.
94
96
  # @yield [Wgit::Document] The crawled HTML Document regardless if the
95
97
  # crawl was successful or not. Therefore, the Document#url can be used.
96
98
  # @return [Wgit::Document, nil] The crawled HTML Document or nil if the
97
99
  # crawl was unsuccessful.
98
- def crawl_url(url = @urls.first)
100
+ def crawl_url(url = @urls.first, follow_external_redirects: true)
99
101
  assert_type(url, Wgit::Url)
100
- markup = fetch(url)
102
+ markup = fetch(url, follow_external_redirects: follow_external_redirects)
101
103
  url.crawled = true
102
104
  doc = Wgit::Document.new(url, markup)
103
105
  yield(doc) if block_given?
@@ -116,7 +118,7 @@ module Wgit
116
118
  def crawl_site(base_url = @urls.first, &block)
117
119
  assert_type(base_url, Wgit::Url)
118
120
 
119
- doc = crawl_url(base_url, &block)
121
+ doc = crawl_url(base_url, follow_external_redirects: false, &block)
120
122
  return nil if doc.nil?
121
123
 
122
124
  path = base_url.path.nil? ? '/' : base_url.path
@@ -133,9 +135,15 @@ module Wgit
133
135
  break if links.empty?
134
136
 
135
137
  links.each do |link|
136
- doc = crawl_url(Wgit::Url.concat(base_url.to_base, link), &block)
138
+ doc = crawl_url(
139
+ Wgit::Url.concat(base_url.to_base, link),
140
+ follow_external_redirects: false,
141
+ &block
142
+ )
143
+
137
144
  crawled_urls << link
138
145
  next if doc.nil?
146
+
139
147
  internal_urls.concat(get_internal_links(doc))
140
148
  external_urls.concat(doc.external_links)
141
149
  end
@@ -160,8 +168,8 @@ module Wgit
160
168
  # The fetch method performs a HTTP GET to obtain the HTML document.
161
169
  # Invalid urls or any HTTP response that doesn't return a HTML body will be
162
170
  # ignored and nil will be returned. Otherwise, the HTML is returned.
163
- def fetch(url)
164
- response = resolve(url)
171
+ def fetch(url, follow_external_redirects: true)
172
+ response = resolve(url, follow_external_redirects: follow_external_redirects)
165
173
  @last_response = response
166
174
  response.body.empty? ? nil : response.body
167
175
  rescue Exception => ex
@@ -176,18 +184,30 @@ module Wgit
176
184
  # A certain amount of redirects will be followed by default before raising
177
185
  # an exception. Redirects can be disabled by setting `redirect_limit: 0`.
178
186
  # The Net::HTTPResponse will be returned.
179
- def resolve(url, redirect_limit: Wgit::Crawler.default_redirect_limit)
187
+ def resolve(
188
+ url,
189
+ redirect_limit: Wgit::Crawler.default_redirect_limit,
190
+ follow_external_redirects: true
191
+ )
192
+ raise 'url must respond to :normalise' unless url.respond_to?(:normalise)
180
193
  redirect_count = -1
194
+
181
195
  begin
182
- raise "Too many redirects" if redirect_count >= redirect_limit
196
+ raise 'Too many redirects' if redirect_count >= redirect_limit
183
197
  redirect_count += 1
184
198
 
185
- response = Net::HTTP.get_response(URI(url))
199
+ response = Net::HTTP.get_response(url.normalise.to_uri)
186
200
  location = Wgit::Url.new(response.fetch('location', ''))
201
+
187
202
  if not location.empty?
203
+ if !follow_external_redirects and !location.is_relative?
204
+ raise 'External redirect encountered but not allowed'
205
+ end
206
+
188
207
  url = location.is_relative? ? url.to_base.concat(location) : location
189
208
  end
190
209
  end while response.is_a?(Net::HTTPRedirection)
210
+
191
211
  response
192
212
  end
193
213
 
data/lib/wgit/document.rb CHANGED
@@ -364,6 +364,7 @@ module Wgit
364
364
  # xpath value to be derived on Document initialisation (instead of when
365
365
  # the extension is defined). The call method must return a valid xpath
366
366
  # String.
367
+ # @param options [Hash] The options to define an extension with.
367
368
  # @option options [Boolean] :singleton The singleton option determines
368
369
  # whether or not the result(s) should be in an Array. If multiple
369
370
  # results are found and singleton is true then the first result will be
@@ -501,7 +502,6 @@ module Wgit
501
502
  end
502
503
  end
503
504
 
504
- alias :to_hash :to_h
505
505
  alias :relative_links :internal_links
506
506
  alias :relative_urls :internal_links
507
507
  alias :relative_full_links :internal_full_links
data/lib/wgit/url.rb CHANGED
@@ -1,13 +1,14 @@
1
1
  require_relative 'utils'
2
2
  require_relative 'assertable'
3
3
  require 'uri'
4
+ require 'addressable/uri'
4
5
 
5
6
  module Wgit
6
7
 
7
8
  # Class modeling a web based URL.
8
9
  # Can be an internal/relative link e.g. "about.html" or a full URL
9
- # e.g. "http://www.google.co.uk". Is a subclass of String and uses 'uri'
10
- # internally.
10
+ # e.g. "http://www.google.co.uk". Is a subclass of String and uses
11
+ # 'addressable/uri' internally.
11
12
  class Url < String
12
13
  include Assertable
13
14
 
@@ -43,7 +44,7 @@ module Wgit
43
44
  date_crawled = obj["date_crawled"]
44
45
  end
45
46
 
46
- @uri = URI(url)
47
+ @uri = Addressable::URI.parse(url)
47
48
  @crawled = crawled
48
49
  @date_crawled = date_crawled
49
50
 
@@ -169,9 +170,16 @@ module Wgit
169
170
  @date_crawled = bool ? Wgit::Utils.time_stamp : nil
170
171
  end
171
172
 
173
+ # Normalises/encodes self and returns a new Wgit::Url.
174
+ #
175
+ # @return [Wgit::Url] An encoded version of self.
176
+ def normalise
177
+ Wgit::Url.new(@uri.normalize.to_s)
178
+ end
179
+
172
180
  # Returns the @uri instance var of this URL.
173
181
  #
174
- # @return [URI::HTTP, URI::HTTPS] The URI object of self.
182
+ # @return [Addressable::URI] The URI object of self.
175
183
  def to_uri
176
184
  @uri
177
185
  end
@@ -337,7 +345,6 @@ module Wgit
337
345
  Hash[h.to_a.insert(0, ["url", self])] # Insert url at position 0.
338
346
  end
339
347
 
340
- alias :to_hash :to_h
341
348
  alias :uri :to_uri
342
349
  alias :url :to_url
343
350
  alias :scheme :to_scheme
@@ -358,5 +365,6 @@ module Wgit
358
365
  alias :is_relative? :relative_link?
359
366
  alias :is_internal? :relative_link?
360
367
  alias :crawled? :crawled
368
+ alias :normalize :normalise
361
369
  end
362
370
  end
data/lib/wgit/version.rb CHANGED
@@ -3,5 +3,5 @@
3
3
  # @author Michael Telford
4
4
  module Wgit
5
5
  # The current gem version of Wgit.
6
- VERSION = "0.0.14".freeze
6
+ VERSION = "0.0.15".freeze
7
7
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.14
4
+ version: 0.0.15
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
@@ -142,6 +142,20 @@ dependencies:
142
142
  - - "~>"
143
143
  - !ruby/object:Gem::Version
144
144
  version: '2.0'
145
+ - !ruby/object:Gem::Dependency
146
+ name: addressable
147
+ requirement: !ruby/object:Gem::Requirement
148
+ requirements:
149
+ - - "~>"
150
+ - !ruby/object:Gem::Version
151
+ version: 2.6.0
152
+ type: :runtime
153
+ prerelease: false
154
+ version_requirements: !ruby/object:Gem::Requirement
155
+ requirements:
156
+ - - "~>"
157
+ - !ruby/object:Gem::Version
158
+ version: 2.6.0
145
159
  - !ruby/object:Gem::Dependency
146
160
  name: nokogiri
147
161
  requirement: !ruby/object:Gem::Requirement
@@ -196,9 +210,7 @@ files:
196
210
  - "./lib/wgit/url.rb"
197
211
  - "./lib/wgit/utils.rb"
198
212
  - "./lib/wgit/version.rb"
199
- - LICENSE.txt
200
213
  - README.md
201
- - TODO.txt
202
214
  homepage: https://github.com/michaeltelford/wgit
203
215
  licenses:
204
216
  - MIT
data/LICENSE.txt DELETED
@@ -1,21 +0,0 @@
1
- The MIT License (MIT)
2
-
3
- Copyright (c) 2019 Michael Telford
4
-
5
- Permission is hereby granted, free of charge, to any person obtaining a copy
6
- of this software and associated documentation files (the "Software"), to deal
7
- in the Software without restriction, including without limitation the rights
8
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
- copies of the Software, and to permit persons to whom the Software is
10
- furnished to do so, subject to the following conditions:
11
-
12
- The above copyright notice and this permission notice shall be included in
13
- all copies or substantial portions of the Software.
14
-
15
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
- OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
21
- THE SOFTWARE.
data/TODO.txt DELETED
@@ -1,34 +0,0 @@
1
-
2
- Primary
3
- -------
4
- - Add <base> support for link processing.
5
- - Update Database#search & Document#search to have optional case sensitivity.
6
- - Have the ability to crawl sub sections of a site only e.g. https://www.honda.co.uk/motorcycles.html as the base url and crawl any links containing this as a prefix. For example, https://www.honda.co.uk/cars.html would not be crawled but https://www.honda.co.uk/motorcycles/africa-twin.html would be.
7
- - Create an executable based on the ./bin/console shipped as `wpry` or `wgit`.
8
-
9
- Secondary
10
- ---------
11
- - Think about how we handle invalid url's on crawled documents. Setup tests and implement logic for this scenario.
12
- - Think about ignoring non html documents/urls e.g. http://server/image.jpg etc. by implementing MIME types (defaulting to only HTML).
13
- - Check if Document::TEXT_ELEMENTS is expansive enough.
14
- - Possibly use refine instead of core-ext?
15
- - Think about potentially using DB._update's update_many func.
16
-
17
- Refactoring
18
- -----------
19
- - Refactor the 3 main classes and their tests (where needed): Url, Document & Crawler.
20
- - Think about reducing the amount of method aliases, pick the best for the method def and remove the aliases?
21
- - Replace method params with named parameters where applicable.
22
-
23
- Gem Publishing Checklist
24
- ------------------------
25
- - Ensure a clean branch of master and create a 'release' branch.
26
- - Update standalone files (if necessary): README.md, TODO.txt, wgit.gemspec etc.
27
- - Increment the version number (in version.rb) and update the CHANGELOG.md.
28
- - Run 'bundle install' to update deps.
29
- - Run 'bundle exec rake compile' and ensure acceptable warnings/errors.
30
- - Run 'bundle exec rake test' and ensure all tests are passing.
31
- - Run `bundle exec rake install` to build and install the gem locally, then test it manually from outside this repo.
32
- - Run `bundle exec yard doc` to update documentation - should be very high percentage.
33
- - Commit, merge to master & push any changes made from the above steps.
34
- - Run `bundle exec rake RELEASE[origin]` to tag, build and push everything to github.com and rubygems.org.