wgit 0.9.0 → 0.10.3

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
4
- data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
3
+ metadata.gz: 720cf6b84698fbd54c109319f05557ee2e29bdbda59ec23278422dc5ddc77f2f
4
+ data.tar.gz: d4304bce849b404b9d2d7faa4d9a3f7969784f649a83152605b51b2e0bd21ac4
5
5
  SHA512:
6
- metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
7
- data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
6
+ metadata.gz: a8743ec17b3caaa9b6c5dd5c9b9b18902561927dfd992003f25db88334cc2b4364a4c6ce2dea34629f801d5d7dbe9761b15e7f2f034e00ba526db36ce828dcaf
7
+ data.tar.gz: 00cf954a86c8b0d96f2e694359c1c75e3193e0e6d146ffba19b3857bef4c15ca93d25f1310ebebf815de8da93ede1b97e325dc54aade699219b9ab35f2976e49
data/CHANGELOG.md CHANGED
@@ -9,6 +9,42 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.3
13
+ ### Added
14
+ - ...
15
+ ### Changed/Removed
16
+ - Changed `Database#create_collections` and `#create_unique_indexes` by removing `rescue nil` from their database operations. Now any underlying errors with the database client are not masked.
17
+ ### Fixed
18
+ - ...
19
+ ---
20
+
21
+ ## v0.10.2
22
+ ### Added
23
+ - `Wgit::Base#setup` and `#teardown` methods (lifecycle hooks) that can be overridden by subclasses.
24
+ ### Changed/Removed
25
+ - ...
26
+ ### Fixed
27
+ - ...
28
+ ---
29
+
30
+ ## v0.10.1
31
+ ### Added
32
+ - Support for Ruby 3.
33
+ ### Changed/Removed
34
+ - Removed support for Ruby 2.5 (as it's too old).
35
+ ### Fixed
36
+ - ...
37
+ ---
38
+
39
+ ## v0.10.0
40
+ ### Added
41
+ - `Wgit::Url#scheme_relative?` method.
42
+ ### Changed/Removed
43
+ - Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
44
+ ### Fixed
45
+ - [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
46
+ ---
47
+
12
48
  ## v0.9.0
13
49
  This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
14
50
  ### Added
@@ -112,7 +148,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
112
148
  - `Wgit::Response` class containing adapter agnostic HTTP response logic.
113
149
  ### Changed/Removed
114
150
  - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
115
- - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
151
+ - Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
116
152
  - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
117
153
  - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
118
154
  - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
@@ -160,7 +196,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
160
196
  ---
161
197
 
162
198
  ## v0.2.0
163
- This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
199
+ This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
164
200
  ### Added
165
201
  - `Wgit::Url#absolute?` method.
166
202
  - `Wgit::Url#relative? base: url` support.
data/README.md CHANGED
@@ -10,7 +10,7 @@
10
10
 
11
11
  Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
12
12
 
13
- Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
13
+ Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
14
14
 
15
15
  - URL parsing
16
16
  - Document content extraction (data mining)
@@ -62,31 +62,6 @@ end
62
62
  puts JSON.generate(quotes)
63
63
  ```
64
64
 
65
- The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
66
-
67
- ```ruby
68
- require 'wgit'
69
- require 'json'
70
-
71
- crawler = Wgit::Crawler.new
72
- url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
73
- quotes = []
74
-
75
- Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
76
- Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
77
-
78
- crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
79
- doc.quotes.zip(doc.authors).each do |arr|
80
- quotes << {
81
- quote: arr.first,
82
- author: arr.last
83
- }
84
- end
85
- end
86
-
87
- puts JSON.generate(quotes)
88
- ```
89
-
90
65
  But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
91
66
 
92
67
  ```ruby
@@ -97,14 +72,13 @@ include Wgit::DSL
97
72
  Wgit.logger.level = Logger::WARN
98
73
 
99
74
  connection_string 'mongodb://user:password@localhost/crawler'
100
- clear_db!
101
-
102
- extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
103
- extract :authors, "//div[@class='quote']/span/small", singleton: false
104
75
 
105
76
  start 'http://quotes.toscrape.com/tag/humor/'
106
77
  follow "//li[@class='next']/a/@href"
107
78
 
79
+ extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
80
+ extract :authors, "//div[@class='quote']/span/small", singleton: false
81
+
108
82
  index_site
109
83
  search 'prejudice'
110
84
  ```
@@ -117,10 +91,35 @@ Quotes to Scrape
117
91
  http://quotes.toscrape.com/tag/humor/page/2/
118
92
  ```
119
93
 
120
- Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
94
+ Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
121
95
 
122
96
  ![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
123
97
 
98
+ The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
99
+
100
+ ```ruby
101
+ require 'wgit'
102
+ require 'json'
103
+
104
+ crawler = Wgit::Crawler.new
105
+ url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
106
+ quotes = []
107
+
108
+ Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
109
+ Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
110
+
111
+ crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
112
+ doc.quotes.zip(doc.authors).each do |arr|
113
+ quotes << {
114
+ quote: arr.first,
115
+ author: arr.last
116
+ }
117
+ end
118
+ end
119
+
120
+ puts JSON.generate(quotes)
121
+ ```
122
+
124
123
  ## Why Wgit?
125
124
 
126
125
  There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
@@ -161,33 +160,27 @@ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby impleme
161
160
 
162
161
  Currently, the required MRI Ruby version is:
163
162
 
164
- `~> 2.5` a.k.a. `>= 2.5 && < 3`
163
+ `ruby '>= 2.6', '< 4'`
165
164
 
166
165
  ### Using Bundler
167
166
 
168
- Add this line to your application's `Gemfile`:
169
-
170
- ```ruby
171
- gem 'wgit'
172
- ```
173
-
174
- And then execute:
175
-
176
- $ bundle
167
+ $ bundle add wgit
177
168
 
178
169
  ### Using RubyGems
179
170
 
180
171
  $ gem install wgit
181
172
 
182
- Verify the install by using the executable (to start an REPL session):
173
+ ### Verify
183
174
 
184
175
  $ wgit
185
176
 
177
+ Calling the installed executable will start an REPL session.
178
+
186
179
  ## Documentation
187
180
 
188
181
  - [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
189
182
  - [Wiki](https://github.com/michaeltelford/wgit/wiki)
190
- - [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
183
+ - [API Yardocs](https://www.rubydoc.info/gems/wgit)
191
184
  - [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
192
185
 
193
186
  ## Executable
data/lib/wgit/base.rb CHANGED
@@ -4,16 +4,25 @@ module Wgit
4
4
  class Base
5
5
  extend Wgit::DSL
6
6
 
7
+ # Runs once before the crawl/index is run. Override as needed.
8
+ def setup; end
9
+
10
+ # Runs once after the crawl/index is complete. Override as needed.
11
+ def teardown; end
12
+
7
13
  # Runs the crawl/index passing each crawled `Wgit::Document` and the given
8
14
  # block to the subclass's `#parse` method.
9
15
  def self.run(&block)
16
+ crawl_method = @method || :crawl
10
17
  obj = new
18
+
11
19
  unless obj.respond_to?(:parse)
12
20
  raise "#{obj.class} must respond_to? #parse(doc, &block)"
13
21
  end
14
22
 
15
- crawl_method = @method || :crawl
23
+ obj.setup
16
24
  send(crawl_method) { |doc| obj.parse(doc, &block) }
25
+ obj.teardown
17
26
 
18
27
  obj
19
28
  end
@@ -91,29 +91,27 @@ module Wgit
91
91
 
92
92
  ### DDL ###
93
93
 
94
- # Creates the urls and documents collections if they don't already exist.
95
- # This method is therefore idempotent.
94
+ # Creates the 'urls' and 'documents' collections.
96
95
  #
97
96
  # @return [nil] Always returns nil.
98
97
  def create_collections
99
- db.client[URLS_COLLECTION].create rescue nil
100
- db.client[DOCUMENTS_COLLECTION].create rescue nil
98
+ @client[URLS_COLLECTION].create
99
+ @client[DOCUMENTS_COLLECTION].create
101
100
 
102
101
  nil
103
102
  end
104
103
 
105
- # Creates the urls and documents unique 'url' indexes if they don't already
106
- # exist. This method is therefore idempotent.
104
+ # Creates the urls and documents unique 'url' indexes.
107
105
  #
108
106
  # @return [nil] Always returns nil.
109
107
  def create_unique_indexes
110
108
  @client[URLS_COLLECTION].indexes.create_one(
111
109
  { url: 1 }, name: UNIQUE_INDEX, unique: true
112
- ) rescue nil
110
+ )
113
111
 
114
112
  @client[DOCUMENTS_COLLECTION].indexes.create_one(
115
113
  { 'url.url' => 1 }, name: UNIQUE_INDEX, unique: true
116
- ) rescue nil
114
+ )
117
115
 
118
116
  nil
119
117
  end
@@ -186,7 +184,7 @@ module Wgit
186
184
  data_hash = model.merge(Wgit::Model.common_update_data)
187
185
  result = @client[collection].replace_one(query, data_hash, upsert: true)
188
186
 
189
- result.matched_count == 0
187
+ result.matched_count.zero?
190
188
  end
191
189
 
192
190
  ### Retrieve Data ###
data/lib/wgit/document.rb CHANGED
@@ -413,6 +413,13 @@ be relative"
413
413
  return [] if @links.empty?
414
414
 
415
415
  links = @links
416
+ .map do |link|
417
+ if link.scheme_relative?
418
+ link.prefix_scheme(@url.to_scheme.to_sym)
419
+ else
420
+ link
421
+ end
422
+ end
416
423
  .reject { |link| link.relative?(host: @url.to_origin) }
417
424
  .map(&:omit_trailing_slash)
418
425
 
data/lib/wgit/indexer.rb CHANGED
@@ -80,8 +80,8 @@ database capacity, exiting.")
80
80
  urls_count += write_urls_to_db(ext_links)
81
81
  end
82
82
 
83
- Wgit.logger.info("Crawled and indexed docs for #{docs_count} url(s) \
84
- overall for this iteration.")
83
+ Wgit.logger.info("Crawled and indexed documents for #{docs_count} \
84
+ url(s) overall for this iteration.")
85
85
  Wgit.logger.info("Found and saved #{urls_count} external url(s) for \
86
86
  the next iteration.")
87
87
 
@@ -136,8 +136,8 @@ the next iteration.")
136
136
  Wgit.logger.info("Found and saved #{num_inserted_urls} external url(s)")
137
137
  end
138
138
 
139
- Wgit.logger.info("Crawled and indexed #{total_pages_indexed} docs for \
140
- the site: #{url}")
139
+ Wgit.logger.info("Crawled and indexed #{total_pages_indexed} documents \
140
+ for the site: #{url}")
141
141
 
142
142
  total_pages_indexed
143
143
  end
data/lib/wgit/url.rb CHANGED
@@ -162,6 +162,7 @@ Addressable::URI::InvalidURIError")
162
162
  opts = defaults.merge(opts)
163
163
  raise 'Url (self) cannot be empty' if empty?
164
164
 
165
+ return false if scheme_relative?
165
166
  return true if @uri.relative?
166
167
 
167
168
  # Self is absolute but may be relative to the opts param e.g. host.
@@ -266,26 +267,28 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
266
267
  # @return [Wgit::Url] Self in absolute form.
267
268
  def make_absolute(doc)
268
269
  assert_type(doc, Wgit::Document)
270
+ raise 'Cannot make absolute when Document @url is not valid' \
271
+ unless doc.url.valid?
272
+
273
+ return prefix_scheme(doc.url.to_scheme&.to_sym) if scheme_relative?
269
274
 
270
275
  absolute? ? self : doc.base_url(link: self).concat(self)
271
276
  end
272
277
 
273
- # Returns self having prefixed a protocol scheme. Doesn't modify receiver.
278
+ # Returns self having prefixed a scheme/protocol. Doesn't modify receiver.
274
279
  # Returns self even if absolute (with scheme); therefore is idempotent.
275
280
  #
276
- # @param protocol [Symbol] Either :http or :https.
277
- # @return [Wgit::Url] Self with a protocol scheme prefix.
278
- def prefix_scheme(protocol: :http)
279
- return self if absolute?
280
-
281
- case protocol
282
- when :http
283
- Wgit::Url.new("http://#{url}")
284
- when :https
285
- Wgit::Url.new("https://#{url}")
286
- else
287
- raise "protocol must be :http or :https, not :#{protocol}"
281
+ # @param scheme [Symbol] Either :http or :https.
282
+ # @return [Wgit::Url] Self with a scheme prefix.
283
+ def prefix_scheme(scheme = :http)
284
+ unless %i[http https].include?(scheme)
285
+ raise "scheme must be :http or :https, not :#{scheme}"
288
286
  end
287
+
288
+ return self if absolute? && !scheme_relative?
289
+
290
+ separator = scheme_relative? ? '' : '//'
291
+ Wgit::Url.new("#{scheme}:#{separator}#{self}")
289
292
  end
290
293
 
291
294
  # Returns a Hash containing this Url's instance vars excluding @uri.
@@ -624,31 +627,40 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
624
627
  self == '/'
625
628
  end
626
629
 
627
- alias + concat
628
- alias crawled? crawled
629
- alias is_relative? relative?
630
- alias is_absolute? absolute?
631
- alias is_valid? valid?
632
- alias is_query? query?
633
- alias is_fragment? fragment?
634
- alias is_index? index?
635
- alias uri to_uri
636
- alias url to_url
637
- alias scheme to_scheme
638
- alias host to_host
639
- alias port to_port
640
- alias domain to_domain
641
- alias brand to_brand
642
- alias base to_base
643
- alias origin to_origin
644
- alias path to_path
645
- alias endpoint to_endpoint
646
- alias query to_query
647
- alias query_hash to_query_hash
648
- alias fragment to_fragment
649
- alias extension to_extension
650
- alias user to_user
651
- alias password to_password
652
- alias sub_domain to_sub_domain
630
+ # Returns true if self starts with '//' a.k.a a scheme/protocol relative
631
+ # path.
632
+ #
633
+ # @return [Boolean] True if self starts with '//', false otherwise.
634
+ def scheme_relative?
635
+ start_with?('//')
636
+ end
637
+
638
+ alias + concat
639
+ alias crawled? crawled
640
+ alias is_relative? relative?
641
+ alias is_absolute? absolute?
642
+ alias is_valid? valid?
643
+ alias is_query? query?
644
+ alias is_fragment? fragment?
645
+ alias is_index? index?
646
+ alias is_scheme_relative? scheme_relative?
647
+ alias uri to_uri
648
+ alias url to_url
649
+ alias scheme to_scheme
650
+ alias host to_host
651
+ alias port to_port
652
+ alias domain to_domain
653
+ alias brand to_brand
654
+ alias base to_base
655
+ alias origin to_origin
656
+ alias path to_path
657
+ alias endpoint to_endpoint
658
+ alias query to_query
659
+ alias query_hash to_query_hash
660
+ alias fragment to_fragment
661
+ alias extension to_extension
662
+ alias user to_user
663
+ alias password to_password
664
+ alias sub_domain to_sub_domain
653
665
  end
654
666
  end
data/lib/wgit/version.rb CHANGED
@@ -6,7 +6,7 @@
6
6
  # @author Michael Telford
7
7
  module Wgit
8
8
  # The current gem version of Wgit.
9
- VERSION = '0.9.0'
9
+ VERSION = '0.10.3'
10
10
 
11
11
  # Returns the current gem version of Wgit as a String.
12
12
  def self.version
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.9.0
4
+ version: 0.10.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
- autorequire:
8
+ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2020-07-31 00:00:00.000000000 Z
11
+ date: 2021-11-25 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: addressable
@@ -241,7 +241,7 @@ metadata:
241
241
  source_code_uri: https://github.com/michaeltelford/wgit
242
242
  changelog_uri: https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md
243
243
  bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
244
- documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
244
+ documentation_uri: https://www.rubydoc.info/gems/wgit
245
245
  allowed_push_host: https://rubygems.org
246
246
  post_install_message: Added the 'wgit' executable to $PATH
247
247
  rdoc_options: []
@@ -249,17 +249,20 @@ require_paths:
249
249
  - lib
250
250
  required_ruby_version: !ruby/object:Gem::Requirement
251
251
  requirements:
252
- - - "~>"
252
+ - - ">="
253
+ - !ruby/object:Gem::Version
254
+ version: '2.6'
255
+ - - "<"
253
256
  - !ruby/object:Gem::Version
254
- version: '2.5'
257
+ version: '4'
255
258
  required_rubygems_version: !ruby/object:Gem::Requirement
256
259
  requirements:
257
260
  - - ">="
258
261
  - !ruby/object:Gem::Version
259
262
  version: '0'
260
263
  requirements: []
261
- rubygems_version: 3.1.2
262
- signing_key:
264
+ rubygems_version: 3.2.22
265
+ signing_key:
263
266
  specification_version: 4
264
267
  summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
265
268
  extract the data you want from the web.