wgit 0.9.0 → 0.10.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
4
- data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
3
+ metadata.gz: b6719bb2015379133ef2c9b417cada1826deab254f6fa1adaa093314f8fece99
4
+ data.tar.gz: 5ced648c0dff501bf0191aebfc0188d535f4ee657a072e1dbccd68ebbc6ac881
5
5
  SHA512:
6
- metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
7
- data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
6
+ metadata.gz: 4a7782b4ccf6fa69fad9bb63d7d421fa548603ad5a35304db554bdcdf6deafe305395aba1ac9f35bcd095bc6cf4049ce70e56645faf1457e2e1313d48d1eb7f8
7
+ data.tar.gz: 8b8bb1454a131201e262eda060c6ae8490266a7675910026a0dd6ae0b2b55f2accf140d473edf135078f68cbe1048c4bb86f2dc5a6d4cf08a006f8fc20ac49b5
data/CHANGELOG.md CHANGED
@@ -9,6 +9,15 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.0
13
+ ### Added
14
+ - `Wgit::Url#scheme_relative?` method.
15
+ ### Changed/Removed
16
+ - Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
17
+ ### Fixed
18
+ - [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
19
+ ---
20
+
12
21
  ## v0.9.0
13
22
  This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
14
23
  ### Added
@@ -112,7 +121,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
112
121
  - `Wgit::Response` class containing adapter agnostic HTTP response logic.
113
122
  ### Changed/Removed
114
123
  - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
115
- - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
124
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
116
125
  - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
117
126
  - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
118
127
  - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
@@ -160,7 +169,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
160
169
  ---
161
170
 
162
171
  ## v0.2.0
163
- This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
172
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
164
173
  ### Added
165
174
  - `Wgit::Url#absolute?` method.
166
175
  - `Wgit::Url#relative? base: url` support.
data/README.md CHANGED
@@ -10,7 +10,7 @@
10
10
 
11
11
  Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
12
12
 
13
- Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
13
+ Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
14
14
 
15
15
  - URL parsing
16
16
  - Document content extraction (data mining)
@@ -62,31 +62,6 @@ end
62
62
  puts JSON.generate(quotes)
63
63
  ```
64
64
 
65
- The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
66
-
67
- ```ruby
68
- require 'wgit'
69
- require 'json'
70
-
71
- crawler = Wgit::Crawler.new
72
- url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
73
- quotes = []
74
-
75
- Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
76
- Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
77
-
78
- crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
79
- doc.quotes.zip(doc.authors).each do |arr|
80
- quotes << {
81
- quote: arr.first,
82
- author: arr.last
83
- }
84
- end
85
- end
86
-
87
- puts JSON.generate(quotes)
88
- ```
89
-
90
65
  But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
91
66
 
92
67
  ```ruby
@@ -97,14 +72,13 @@ include Wgit::DSL
97
72
  Wgit.logger.level = Logger::WARN
98
73
 
99
74
  connection_string 'mongodb://user:password@localhost/crawler'
100
- clear_db!
101
-
102
- extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
103
- extract :authors, "//div[@class='quote']/span/small", singleton: false
104
75
 
105
76
  start 'http://quotes.toscrape.com/tag/humor/'
106
77
  follow "//li[@class='next']/a/@href"
107
78
 
79
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
80
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
81
+
108
82
  index_site
109
83
  search 'prejudice'
110
84
  ```
@@ -117,10 +91,35 @@ Quotes to Scrape
117
91
  http://quotes.toscrape.com/tag/humor/page/2/
118
92
  ```
119
93
 
120
- Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
94
+ Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
121
95
 
122
96
  ![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
123
97
 
98
+ The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
99
+
100
+ ```ruby
101
+ require 'wgit'
102
+ require 'json'
103
+
104
+ crawler = Wgit::Crawler.new
105
+ url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
106
+ quotes = []
107
+
108
+ Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
109
+ Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
110
+
111
+ crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
112
+ doc.quotes.zip(doc.authors).each do |arr|
113
+ quotes << {
114
+ quote: arr.first,
115
+ author: arr.last
116
+ }
117
+ end
118
+ end
119
+
120
+ puts JSON.generate(quotes)
121
+ ```
122
+
124
123
  ## Why Wgit?
125
124
 
126
125
  There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
@@ -161,33 +160,27 @@ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby impleme
161
160
 
162
161
  Currently, the required MRI Ruby version is:
163
162
 
164
- `~> 2.5` a.k.a. `>= 2.5 && < 3`
163
+ `~> 2.5` (a.k.a.) `>= 2.5 && < 3`
165
164
 
166
165
  ### Using Bundler
167
166
 
168
- Add this line to your application's `Gemfile`:
169
-
170
- ```ruby
171
- gem 'wgit'
172
- ```
173
-
174
- And then execute:
175
-
176
- $ bundle
167
+ $ bundle add wgit
177
168
 
178
169
  ### Using RubyGems
179
170
 
180
171
  $ gem install wgit
181
172
 
182
- Verify the install by using the executable (to start an REPL session):
173
+ ### Verify
183
174
 
184
175
  $ wgit
185
176
 
177
+ Calling the installed executable will start an REPL session.
178
+
186
179
  ## Documentation
187
180
 
188
181
  - [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
189
182
  - [Wiki](https://github.com/michaeltelford/wgit/wiki)
190
- - [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
183
+ - [API Yardocs](https://www.rubydoc.info/gems/wgit)
191
184
  - [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
192
185
 
193
186
  ## Executable
@@ -186,7 +186,7 @@ module Wgit
186
186
  data_hash = model.merge(Wgit::Model.common_update_data)
187
187
  result = @client[collection].replace_one(query, data_hash, upsert: true)
188
188
 
189
- result.matched_count == 0
189
+ result.matched_count.zero?
190
190
  end
191
191
 
192
192
  ### Retrieve Data ###
data/lib/wgit/document.rb CHANGED
@@ -413,6 +413,13 @@ be relative"
413
413
  return [] if @links.empty?
414
414
 
415
415
  links = @links
416
+ .map do |link|
417
+ if link.scheme_relative?
418
+ link.prefix_scheme(@url.to_scheme.to_sym)
419
+ else
420
+ link
421
+ end
422
+ end
416
423
  .reject { |link| link.relative?(host: @url.to_origin) }
417
424
  .map(&:omit_trailing_slash)
418
425
 
data/lib/wgit/url.rb CHANGED
@@ -162,6 +162,7 @@ Addressable::URI::InvalidURIError")
162
162
  opts = defaults.merge(opts)
163
163
  raise 'Url (self) cannot be empty' if empty?
164
164
 
165
+ return false if scheme_relative?
165
166
  return true if @uri.relative?
166
167
 
167
168
  # Self is absolute but may be relative to the opts param e.g. host.
@@ -266,26 +267,28 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
266
267
  # @return [Wgit::Url] Self in absolute form.
267
268
  def make_absolute(doc)
268
269
  assert_type(doc, Wgit::Document)
270
+ raise 'Cannot make absolute when Document @url is not valid' \
271
+ unless doc.url.valid?
272
+
273
+ return prefix_scheme(doc.url.to_scheme&.to_sym) if scheme_relative?
269
274
 
270
275
  absolute? ? self : doc.base_url(link: self).concat(self)
271
276
  end
272
277
 
273
- # Returns self having prefixed a protocol scheme. Doesn't modify receiver.
278
+ # Returns self having prefixed a scheme/protocol. Doesn't modify receiver.
274
279
  # Returns self even if absolute (with scheme); therefore is idempotent.
275
280
  #
276
- # @param protocol [Symbol] Either :http or :https.
277
- # @return [Wgit::Url] Self with a protocol scheme prefix.
278
- def prefix_scheme(protocol: :http)
279
- return self if absolute?
280
-
281
- case protocol
282
- when :http
283
- Wgit::Url.new("http://#{url}")
284
- when :https
285
- Wgit::Url.new("https://#{url}")
286
- else
287
- raise "protocol must be :http or :https, not :#{protocol}"
281
+ # @param scheme [Symbol] Either :http or :https.
282
+ # @return [Wgit::Url] Self with a scheme prefix.
283
+ def prefix_scheme(scheme = :http)
284
+ unless %i[http https].include?(scheme)
285
+ raise "scheme must be :http or :https, not :#{scheme}"
288
286
  end
287
+
288
+ return self if absolute? && !scheme_relative?
289
+
290
+ separator = scheme_relative? ? '' : '//'
291
+ Wgit::Url.new("#{scheme}:#{separator}#{self}")
289
292
  end
290
293
 
291
294
  # Returns a Hash containing this Url's instance vars excluding @uri.
@@ -624,31 +627,40 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
624
627
  self == '/'
625
628
  end
626
629
 
627
- alias + concat
628
- alias crawled? crawled
629
- alias is_relative? relative?
630
- alias is_absolute? absolute?
631
- alias is_valid? valid?
632
- alias is_query? query?
633
- alias is_fragment? fragment?
634
- alias is_index? index?
635
- alias uri to_uri
636
- alias url to_url
637
- alias scheme to_scheme
638
- alias host to_host
639
- alias port to_port
640
- alias domain to_domain
641
- alias brand to_brand
642
- alias base to_base
643
- alias origin to_origin
644
- alias path to_path
645
- alias endpoint to_endpoint
646
- alias query to_query
647
- alias query_hash to_query_hash
648
- alias fragment to_fragment
649
- alias extension to_extension
650
- alias user to_user
651
- alias password to_password
652
- alias sub_domain to_sub_domain
630
+ # Returns true if self starts with '//' a.k.a a scheme/protocol relative
631
+ # path.
632
+ #
633
+ # @return [Boolean] True if self starts with '//', false otherwise.
634
+ def scheme_relative?
635
+ start_with?('//')
636
+ end
637
+
638
+ alias + concat
639
+ alias crawled? crawled
640
+ alias is_relative? relative?
641
+ alias is_absolute? absolute?
642
+ alias is_valid? valid?
643
+ alias is_query? query?
644
+ alias is_fragment? fragment?
645
+ alias is_index? index?
646
+ alias is_scheme_relative? scheme_relative?
647
+ alias uri to_uri
648
+ alias url to_url
649
+ alias scheme to_scheme
650
+ alias host to_host
651
+ alias port to_port
652
+ alias domain to_domain
653
+ alias brand to_brand
654
+ alias base to_base
655
+ alias origin to_origin
656
+ alias path to_path
657
+ alias endpoint to_endpoint
658
+ alias query to_query
659
+ alias query_hash to_query_hash
660
+ alias fragment to_fragment
661
+ alias extension to_extension
662
+ alias user to_user
663
+ alias password to_password
664
+ alias sub_domain to_sub_domain
653
665
  end
654
666
  end
data/lib/wgit/version.rb CHANGED
@@ -6,7 +6,7 @@
6
6
  # @author Michael Telford
7
7
  module Wgit
8
8
  # The current gem version of Wgit.
9
- VERSION = '0.9.0'
9
+ VERSION = '0.10.0'
10
10
 
11
11
  # Returns the current gem version of Wgit as a String.
12
12
  def self.version
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.0
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-31 00:00:00.000000000 Z
11
+ date: 2021-04-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: addressable
@@ -241,7 +241,7 @@ metadata:
241
241
  source_code_uri: https://github.com/michaeltelford/wgit
242
242
  changelog_uri: https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md
243
243
  bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
244
- documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
244
+ documentation_uri: https://www.rubydoc.info/gems/wgit
245
245
  allowed_push_host: https://rubygems.org
246
246
  post_install_message: Added the 'wgit' executable to $PATH
247
247
  rdoc_options: []
@@ -259,7 +259,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
259
259
  version: '0'
260
260
  requirements: []
261
261
  rubygems_version: 3.1.2
262
- signing_key:
262
+ signing_key:
263
263
  specification_version: 4
264
264
  summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
265
265
  extract the data you want from the web.