wgit 0.9.0 → 0.10.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
4
- data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
3
+ metadata.gz: b6719bb2015379133ef2c9b417cada1826deab254f6fa1adaa093314f8fece99
4
+ data.tar.gz: 5ced648c0dff501bf0191aebfc0188d535f4ee657a072e1dbccd68ebbc6ac881
5
5
  SHA512:
6
- metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
7
- data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
6
+ metadata.gz: 4a7782b4ccf6fa69fad9bb63d7d421fa548603ad5a35304db554bdcdf6deafe305395aba1ac9f35bcd095bc6cf4049ce70e56645faf1457e2e1313d48d1eb7f8
7
+ data.tar.gz: 8b8bb1454a131201e262eda060c6ae8490266a7675910026a0dd6ae0b2b55f2accf140d473edf135078f68cbe1048c4bb86f2dc5a6d4cf08a006f8fc20ac49b5
data/CHANGELOG.md CHANGED
@@ -9,6 +9,15 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.0
13
+ ### Added
14
+ - `Wgit::Url#scheme_relative?` method.
15
+ ### Changed/Removed
16
+ - Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
17
+ ### Fixed
18
+ - [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
19
+ ---
20
+
12
21
  ## v0.9.0
13
22
  This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
14
23
  ### Added
@@ -112,7 +121,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
112
121
  - `Wgit::Response` class containing adapter agnostic HTTP response logic.
113
122
  ### Changed/Removed
114
123
  - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
115
- - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
124
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
116
125
  - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
117
126
  - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
118
127
  - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
@@ -160,7 +169,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
160
169
  ---
161
170
 
162
171
  ## v0.2.0
163
- This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
172
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
164
173
  ### Added
165
174
  - `Wgit::Url#absolute?` method.
166
175
  - `Wgit::Url#relative? base: url` support.
data/README.md CHANGED
@@ -10,7 +10,7 @@
10
10
 
11
11
  Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
12
12
 
13
- Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
13
+ Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
14
14
 
15
15
  - URL parsing
16
16
  - Document content extraction (data mining)
@@ -62,31 +62,6 @@ end
62
62
  puts JSON.generate(quotes)
63
63
  ```
64
64
 
65
- The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
66
-
67
- ```ruby
68
- require 'wgit'
69
- require 'json'
70
-
71
- crawler = Wgit::Crawler.new
72
- url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
73
- quotes = []
74
-
75
- Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
76
- Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
77
-
78
- crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
79
- doc.quotes.zip(doc.authors).each do |arr|
80
- quotes << {
81
- quote: arr.first,
82
- author: arr.last
83
- }
84
- end
85
- end
86
-
87
- puts JSON.generate(quotes)
88
- ```
89
-
90
65
  But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
91
66
 
92
67
  ```ruby
@@ -97,14 +72,13 @@ include Wgit::DSL
97
72
  Wgit.logger.level = Logger::WARN
98
73
 
99
74
  connection_string 'mongodb://user:password@localhost/crawler'
100
- clear_db!
101
-
102
- extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
103
- extract :authors, "//div[@class='quote']/span/small", singleton: false
104
75
 
105
76
  start 'http://quotes.toscrape.com/tag/humor/'
106
77
  follow "//li[@class='next']/a/@href"
107
78
 
79
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
80
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
81
+
108
82
  index_site
109
83
  search 'prejudice'
110
84
  ```
@@ -117,10 +91,35 @@ Quotes to Scrape
117
91
  http://quotes.toscrape.com/tag/humor/page/2/
118
92
  ```
119
93
 
120
- Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
94
+ Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
121
95
 
122
96
  ![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
123
97
 
98
+ The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
99
+
100
+ ```ruby
101
+ require 'wgit'
102
+ require 'json'
103
+
104
+ crawler = Wgit::Crawler.new
105
+ url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
106
+ quotes = []
107
+
108
+ Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
109
+ Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
110
+
111
+ crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
112
+ doc.quotes.zip(doc.authors).each do |arr|
113
+ quotes << {
114
+ quote: arr.first,
115
+ author: arr.last
116
+ }
117
+ end
118
+ end
119
+
120
+ puts JSON.generate(quotes)
121
+ ```
122
+
124
123
  ## Why Wgit?
125
124
 
126
125
  There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
@@ -161,33 +160,27 @@ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby impleme
161
160
 
162
161
  Currently, the required MRI Ruby version is:
163
162
 
164
- `~> 2.5` a.k.a. `>= 2.5 && < 3`
163
+ `~> 2.5` (a.k.a.) `>= 2.5 && < 3`
165
164
 
166
165
  ### Using Bundler
167
166
 
168
- Add this line to your application's `Gemfile`:
169
-
170
- ```ruby
171
- gem 'wgit'
172
- ```
173
-
174
- And then execute:
175
-
176
- $ bundle
167
+ $ bundle add wgit
177
168
 
178
169
  ### Using RubyGems
179
170
 
180
171
  $ gem install wgit
181
172
 
182
- Verify the install by using the executable (to start an REPL session):
173
+ ### Verify
183
174
 
184
175
  $ wgit
185
176
 
177
+ Calling the installed executable will start an REPL session.
178
+
186
179
  ## Documentation
187
180
 
188
181
  - [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
189
182
  - [Wiki](https://github.com/michaeltelford/wgit/wiki)
190
- - [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
183
+ - [API Yardocs](https://www.rubydoc.info/gems/wgit)
191
184
  - [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
192
185
 
193
186
  ## Executable
@@ -186,7 +186,7 @@ module Wgit
186
186
  data_hash = model.merge(Wgit::Model.common_update_data)
187
187
  result = @client[collection].replace_one(query, data_hash, upsert: true)
188
188
 
189
- result.matched_count == 0
189
+ result.matched_count.zero?
190
190
  end
191
191
 
192
192
  ### Retrieve Data ###
data/lib/wgit/document.rb CHANGED
@@ -413,6 +413,13 @@ be relative"
413
413
  return [] if @links.empty?
414
414
 
415
415
  links = @links
416
+ .map do |link|
417
+ if link.scheme_relative?
418
+ link.prefix_scheme(@url.to_scheme.to_sym)
419
+ else
420
+ link
421
+ end
422
+ end
416
423
  .reject { |link| link.relative?(host: @url.to_origin) }
417
424
  .map(&:omit_trailing_slash)
418
425
 
data/lib/wgit/url.rb CHANGED
@@ -162,6 +162,7 @@ Addressable::URI::InvalidURIError")
162
162
  opts = defaults.merge(opts)
163
163
  raise 'Url (self) cannot be empty' if empty?
164
164
 
165
+ return false if scheme_relative?
165
166
  return true if @uri.relative?
166
167
 
167
168
  # Self is absolute but may be relative to the opts param e.g. host.
@@ -266,26 +267,28 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
266
267
  # @return [Wgit::Url] Self in absolute form.
267
268
  def make_absolute(doc)
268
269
  assert_type(doc, Wgit::Document)
270
+ raise 'Cannot make absolute when Document @url is not valid' \
271
+ unless doc.url.valid?
272
+
273
+ return prefix_scheme(doc.url.to_scheme&.to_sym) if scheme_relative?
269
274
 
270
275
  absolute? ? self : doc.base_url(link: self).concat(self)
271
276
  end
272
277
 
273
- # Returns self having prefixed a protocol scheme. Doesn't modify receiver.
278
+ # Returns self having prefixed a scheme/protocol. Doesn't modify receiver.
274
279
  # Returns self even if absolute (with scheme); therefore is idempotent.
275
280
  #
276
- # @param protocol [Symbol] Either :http or :https.
277
- # @return [Wgit::Url] Self with a protocol scheme prefix.
278
- def prefix_scheme(protocol: :http)
279
- return self if absolute?
280
-
281
- case protocol
282
- when :http
283
- Wgit::Url.new("http://#{url}")
284
- when :https
285
- Wgit::Url.new("https://#{url}")
286
- else
287
- raise "protocol must be :http or :https, not :#{protocol}"
281
+ # @param scheme [Symbol] Either :http or :https.
282
+ # @return [Wgit::Url] Self with a scheme prefix.
283
+ def prefix_scheme(scheme = :http)
284
+ unless %i[http https].include?(scheme)
285
+ raise "scheme must be :http or :https, not :#{scheme}"
288
286
  end
287
+
288
+ return self if absolute? && !scheme_relative?
289
+
290
+ separator = scheme_relative? ? '' : '//'
291
+ Wgit::Url.new("#{scheme}:#{separator}#{self}")
289
292
  end
290
293
 
291
294
  # Returns a Hash containing this Url's instance vars excluding @uri.
@@ -624,31 +627,40 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
624
627
  self == '/'
625
628
  end
626
629
 
627
- alias + concat
628
- alias crawled? crawled
629
- alias is_relative? relative?
630
- alias is_absolute? absolute?
631
- alias is_valid? valid?
632
- alias is_query? query?
633
- alias is_fragment? fragment?
634
- alias is_index? index?
635
- alias uri to_uri
636
- alias url to_url
637
- alias scheme to_scheme
638
- alias host to_host
639
- alias port to_port
640
- alias domain to_domain
641
- alias brand to_brand
642
- alias base to_base
643
- alias origin to_origin
644
- alias path to_path
645
- alias endpoint to_endpoint
646
- alias query to_query
647
- alias query_hash to_query_hash
648
- alias fragment to_fragment
649
- alias extension to_extension
650
- alias user to_user
651
- alias password to_password
652
- alias sub_domain to_sub_domain
630
+ # Returns true if self starts with '//' a.k.a a scheme/protocol relative
631
+ # path.
632
+ #
633
+ # @return [Boolean] True if self starts with '//', false otherwise.
634
+ def scheme_relative?
635
+ start_with?('//')
636
+ end
637
+
638
+ alias + concat
639
+ alias crawled? crawled
640
+ alias is_relative? relative?
641
+ alias is_absolute? absolute?
642
+ alias is_valid? valid?
643
+ alias is_query? query?
644
+ alias is_fragment? fragment?
645
+ alias is_index? index?
646
+ alias is_scheme_relative? scheme_relative?
647
+ alias uri to_uri
648
+ alias url to_url
649
+ alias scheme to_scheme
650
+ alias host to_host
651
+ alias port to_port
652
+ alias domain to_domain
653
+ alias brand to_brand
654
+ alias base to_base
655
+ alias origin to_origin
656
+ alias path to_path
657
+ alias endpoint to_endpoint
658
+ alias query to_query
659
+ alias query_hash to_query_hash
660
+ alias fragment to_fragment
661
+ alias extension to_extension
662
+ alias user to_user
663
+ alias password to_password
664
+ alias sub_domain to_sub_domain
653
665
  end
654
666
  end
data/lib/wgit/version.rb CHANGED
@@ -6,7 +6,7 @@
6
6
  # @author Michael Telford
7
7
  module Wgit
8
8
  # The current gem version of Wgit.
9
- VERSION = '0.9.0'
9
+ VERSION = '0.10.0'
10
10
 
11
11
  # Returns the current gem version of Wgit as a String.
12
12
  def self.version
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.0
4
+ version: 0.10.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-31 00:00:00.000000000 Z
11
+ date: 2021-04-20 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: addressable
@@ -241,7 +241,7 @@ metadata:
241
241
  source_code_uri: https://github.com/michaeltelford/wgit
242
242
  changelog_uri: https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md
243
243
  bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
244
- documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
244
+ documentation_uri: https://www.rubydoc.info/gems/wgit
245
245
  allowed_push_host: https://rubygems.org
246
246
  post_install_message: Added the 'wgit' executable to $PATH
247
247
  rdoc_options: []
@@ -259,7 +259,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
259
259
  version: '0'
260
260
  requirements: []
261
261
  rubygems_version: 3.1.2
262
- signing_key:
262
+ signing_key:
263
263
  specification_version: 4
264
264
  summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
265
265
  extract the data you want from the web.