RubyGems - wgit - Versions diffs - 0.9.0 → 0.10.3 - Mend

wgit 0.9.0 → 0.10.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +38 -2
data/README.md +36 -43
data/lib/wgit/base.rb +10 -1
data/lib/wgit/database/database.rb +7 -9
data/lib/wgit/document.rb +7 -0
data/lib/wgit/indexer.rb +4 -4
data/lib/wgit/url.rb +51 -39
data/lib/wgit/version.rb +1 -1
metadata +11 -8

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 07e1146e7ddcbb35abb813ae1461520e581576181750d4b9dc654de3f3375d4c
-  data.tar.gz: 6f43949fcdf13c731362d242110348dd43c5183c10130605c2e022e15cbe8cdb
+  metadata.gz: 720cf6b84698fbd54c109319f05557ee2e29bdbda59ec23278422dc5ddc77f2f
+  data.tar.gz: d4304bce849b404b9d2d7faa4d9a3f7969784f649a83152605b51b2e0bd21ac4
 SHA512:
-  metadata.gz: 7288c42fe7b8598572e8b4c8013f8614bd60caa048474a039d8c9a1f4ae231695148158293730998ac78b1f36a4ccd52c9664be1df0c49e218d740fd881d64c4
-  data.tar.gz: 0e36ea8f76aa41f5576044902cdc3e92c3affeb742c179a2fa5ba2b404ad057dede949b5e767bc09eb771b47bc153cf9462e56d9e5a393a63cb9e120bae870a9
+  metadata.gz: a8743ec17b3caaa9b6c5dd5c9b9b18902561927dfd992003f25db88334cc2b4364a4c6ce2dea34629f801d5d7dbe9761b15e7f2f034e00ba526db36ce828dcaf
+  data.tar.gz: 00cf954a86c8b0d96f2e694359c1c75e3193e0e6d146ffba19b3857bef4c15ca93d25f1310ebebf815de8da93ede1b97e325dc54aade699219b9ab35f2976e49

data/CHANGELOG.md CHANGED Viewed

@@ -9,6 +9,42 @@
 - ...
 ---
+## v0.10.3
+### Added
+- ...
+### Changed/Removed
+- Changed `Database#create_collections` and `#create_unique_indexes` by removing `rescue nil` from their database operations. Now any underlying errors with the database client are not masked.
+### Fixed
+- ...
+---
+## v0.10.2
+### Added
+- `Wgit::Base#setup` and `#teardown` methods (lifecycle hooks) that can be overridden by subclasses.
+### Changed/Removed
+- ...
+### Fixed
+- ...
+---
+## v0.10.1
+### Added
+- Support for Ruby 3.
+### Changed/Removed
+- Removed support for Ruby 2.5 (as it's too old).
+### Fixed
+- ...
+---
+## v0.10.0
+### Added
+- `Wgit::Url#scheme_relative?` method.
+### Changed/Removed
+- Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
+### Fixed
+- [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
+---
 ## v0.9.0
 This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
 ### Added
@@ -112,7 +148,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
 - `Wgit::Response` class containing adapter agnostic HTTP response logic.
 ### Changed/Removed
 - Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
-- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/github/michaeltelford/wgit/master).
+- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
 - Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
 - Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
 - Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
@@ -160,7 +196,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
 ---
 ## v0.2.0
-This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/github/michaeltelford/wgit/master
+This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
 ### Added
 - `Wgit::Url#absolute?` method.
 - `Wgit::Url#relative? base: url` support.

data/README.md CHANGED Viewed

@@ -10,7 +10,7 @@
 Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
-Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
+Wgit was primarily designed to crawl static HTML websites to index and  search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
 - URL parsing
 - Document content extraction (data mining)
@@ -62,31 +62,6 @@ end
 puts JSON.generate(quotes)
 ```
-The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
-```ruby
-require 'wgit'
-require 'json'
-crawler = Wgit::Crawler.new
-url     = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
-quotes  = []
-Wgit::Document.define_extractor(:quotes,  "//div[@class='quote']/span[@class='text']", singleton: false)
-Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small",          singleton: false)
-crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
-  doc.quotes.zip(doc.authors).each do |arr|
-    quotes << {
-      quote:  arr.first,
-      author: arr.last
-    }
-  end
-end
-puts JSON.generate(quotes)
-```
 But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
 ```ruby
@@ -97,14 +72,13 @@ include Wgit::DSL
 Wgit.logger.level = Logger::WARN
 connection_string 'mongodb://user:password@localhost/crawler'
-clear_db!
-extract :quotes,  "//div[@class='quote']/span[@class='text']", singleton: false
-extract :authors, "//div[@class='quote']/span/small",          singleton: false
 start  'http://quotes.toscrape.com/tag/humor/'
 follow "//li[@class='next']/a/@href"
+extract :quotes,  "//div[@class='quote']/span[@class='text']", singleton: false
+extract :authors, "//div[@class='quote']/span/small",          singleton: false
 index_site
 search 'prejudice'
 ```
@@ -117,10 +91,35 @@ Quotes to Scrape
 http://quotes.toscrape.com/tag/humor/page/2/
 ```
-Using a Mongo DB [client](https://robomongo.org/), we can see that the two webpages have been indexed, along with their extracted *quotes* and *authors*:
+Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
 ![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
+The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
+```ruby
+require 'wgit'
+require 'json'
+crawler = Wgit::Crawler.new
+url     = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
+quotes  = []
+Wgit::Document.define_extractor(:quotes,  "//div[@class='quote']/span[@class='text']", singleton: false)
+Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small",          singleton: false)
+crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
+  doc.quotes.zip(doc.authors).each do |arr|
+    quotes << {
+      quote:  arr.first,
+      author: arr.last
+    }
+  end
+end
+puts JSON.generate(quotes)
+```
 ## Why Wgit?
 There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
@@ -161,33 +160,27 @@ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby impleme
 Currently, the required MRI Ruby version is:
-`~> 2.5` a.k.a. `>= 2.5 && < 3`
+`ruby '>= 2.6', '< 4'`
 ### Using Bundler
-Add this line to your application's `Gemfile`:
-```ruby
-gem 'wgit'
-```
-And then execute:
-    $ bundle
+    $ bundle add wgit
 ### Using RubyGems
     $ gem install wgit
-Verify the install by using the executable (to start an REPL session):
+### Verify
     $ wgit
+Calling the installed executable will start an REPL session.
 ## Documentation
 - [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
 - [Wiki](https://github.com/michaeltelford/wgit/wiki)
-- [Yardocs](https://www.rubydoc.info/github/michaeltelford/wgit/master)
+- [API Yardocs](https://www.rubydoc.info/gems/wgit)
 - [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
 ## Executable

data/lib/wgit/base.rb CHANGED Viewed

@@ -4,16 +4,25 @@ module Wgit
   class Base
     extend Wgit::DSL
+    # Runs once before the crawl/index is run. Override as needed.
+    def setup; end
+    # Runs once after the crawl/index is complete. Override as needed.
+    def teardown; end
     # Runs the crawl/index passing each crawled `Wgit::Document` and the given
     # block to the subclass's `#parse` method.
     def self.run(&block)
+      crawl_method = @method || :crawl
       obj = new
       unless obj.respond_to?(:parse)
         raise "#{obj.class} must respond_to? #parse(doc, &block)"
       end
-      crawl_method = @method || :crawl
+      obj.setup
       send(crawl_method) { |doc| obj.parse(doc, &block) }
+      obj.teardown
       obj
     end

data/lib/wgit/database/database.rb CHANGED Viewed

@@ -91,29 +91,27 @@ module Wgit
     ### DDL ###
-    # Creates the urls and documents collections if they don't already exist.
-    # This method is therefore idempotent.
+    # Creates the 'urls' and 'documents' collections.
     #
     # @return [nil] Always returns nil.
     def create_collections
-      db.client[URLS_COLLECTION].create rescue nil
-      db.client[DOCUMENTS_COLLECTION].create rescue nil
+      @client[URLS_COLLECTION].create
+      @client[DOCUMENTS_COLLECTION].create
       nil
     end
-    # Creates the urls and documents unique 'url' indexes if they don't already
-    # exist. This method is therefore idempotent.
+    # Creates the urls and documents unique 'url' indexes.
     #
     # @return [nil] Always returns nil.
     def create_unique_indexes
       @client[URLS_COLLECTION].indexes.create_one(
         { url: 1 }, name: UNIQUE_INDEX, unique: true
-      ) rescue nil
+      )
       @client[DOCUMENTS_COLLECTION].indexes.create_one(
         { 'url.url' => 1 }, name: UNIQUE_INDEX, unique: true
-      ) rescue nil
+      )
       nil
     end
@@ -186,7 +184,7 @@ module Wgit
       data_hash = model.merge(Wgit::Model.common_update_data)
       result = @client[collection].replace_one(query, data_hash, upsert: true)
-      result.matched_count == 0
+      result.matched_count.zero?
     end
     ### Retrieve Data ###

data/lib/wgit/document.rb CHANGED Viewed

@@ -413,6 +413,13 @@ be relative"
       return [] if @links.empty?
       links = @links
+              .map do |link|
+                if link.scheme_relative?
+                  link.prefix_scheme(@url.to_scheme.to_sym)
+                else
+                  link
+                end
+              end
               .reject { |link| link.relative?(host: @url.to_origin) }
               .map(&:omit_trailing_slash)

data/lib/wgit/indexer.rb CHANGED Viewed

@@ -80,8 +80,8 @@ database capacity, exiting.")
           urls_count += write_urls_to_db(ext_links)
         end
-        Wgit.logger.info("Crawled and indexed docs for #{docs_count} url(s) \
-overall for this iteration.")
+        Wgit.logger.info("Crawled and indexed documents for #{docs_count} \
+url(s) overall for this iteration.")
         Wgit.logger.info("Found and saved #{urls_count} external url(s) for \
 the next iteration.")
@@ -136,8 +136,8 @@ the next iteration.")
         Wgit.logger.info("Found and saved #{num_inserted_urls} external url(s)")
       end
-      Wgit.logger.info("Crawled and indexed #{total_pages_indexed} docs for \
-the site: #{url}")
+      Wgit.logger.info("Crawled and indexed #{total_pages_indexed} documents \
+for the site: #{url}")
       total_pages_indexed
     end

data/lib/wgit/url.rb CHANGED Viewed

@@ -162,6 +162,7 @@ Addressable::URI::InvalidURIError")
       opts = defaults.merge(opts)
       raise 'Url (self) cannot be empty' if empty?
+      return false if scheme_relative?
       return true if @uri.relative?
       # Self is absolute but may be relative to the opts param e.g. host.
@@ -266,26 +267,28 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
     # @return [Wgit::Url] Self in absolute form.
     def make_absolute(doc)
       assert_type(doc, Wgit::Document)
+      raise 'Cannot make absolute when Document @url is not valid' \
+      unless doc.url.valid?
+      return prefix_scheme(doc.url.to_scheme&.to_sym) if scheme_relative?
       absolute? ? self : doc.base_url(link: self).concat(self)
     end
-    # Returns self having prefixed a protocol scheme. Doesn't modify receiver.
+    # Returns self having prefixed a scheme/protocol. Doesn't modify receiver.
     # Returns self even if absolute (with scheme); therefore is idempotent.
     #
-    # @param protocol [Symbol] Either :http or :https.
-    # @return [Wgit::Url] Self with a protocol scheme prefix.
-    def prefix_scheme(protocol: :http)
-      return self if absolute?
-      case protocol
-      when :http
-        Wgit::Url.new("http://#{url}")
-      when :https
-        Wgit::Url.new("https://#{url}")
-      else
-        raise "protocol must be :http or :https, not :#{protocol}"
+    # @param scheme [Symbol] Either :http or :https.
+    # @return [Wgit::Url] Self with a scheme prefix.
+    def prefix_scheme(scheme = :http)
+      unless %i[http https].include?(scheme)
+        raise "scheme must be :http or :https, not :#{scheme}"
       end
+      return self if absolute? && !scheme_relative?
+      separator = scheme_relative? ? '' : '//'
+      Wgit::Url.new("#{scheme}:#{separator}#{self}")
     end
     # Returns a Hash containing this Url's instance vars excluding @uri.
@@ -624,31 +627,40 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
       self == '/'
     end
-    alias +            concat
-    alias crawled?     crawled
-    alias is_relative? relative?
-    alias is_absolute? absolute?
-    alias is_valid?    valid?
-    alias is_query?    query?
-    alias is_fragment? fragment?
-    alias is_index?    index?
-    alias uri          to_uri
-    alias url          to_url
-    alias scheme       to_scheme
-    alias host         to_host
-    alias port         to_port
-    alias domain       to_domain
-    alias brand        to_brand
-    alias base         to_base
-    alias origin       to_origin
-    alias path         to_path
-    alias endpoint     to_endpoint
-    alias query        to_query
-    alias query_hash   to_query_hash
-    alias fragment     to_fragment
-    alias extension    to_extension
-    alias user         to_user
-    alias password     to_password
-    alias sub_domain   to_sub_domain
+    # Returns true if self starts with '//' a.k.a a scheme/protocol relative
+    # path.
+    #
+    # @return [Boolean] True if self starts with '//', false otherwise.
+    def scheme_relative?
+      start_with?('//')
+    end
+    alias +                   concat
+    alias crawled?            crawled
+    alias is_relative?        relative?
+    alias is_absolute?        absolute?
+    alias is_valid?           valid?
+    alias is_query?           query?
+    alias is_fragment?        fragment?
+    alias is_index?           index?
+    alias is_scheme_relative? scheme_relative?
+    alias uri                 to_uri
+    alias url                 to_url
+    alias scheme              to_scheme
+    alias host                to_host
+    alias port                to_port
+    alias domain              to_domain
+    alias brand               to_brand
+    alias base                to_base
+    alias origin              to_origin
+    alias path                to_path
+    alias endpoint            to_endpoint
+    alias query               to_query
+    alias query_hash          to_query_hash
+    alias fragment            to_fragment
+    alias extension           to_extension
+    alias user                to_user
+    alias password            to_password
+    alias sub_domain          to_sub_domain
   end
 end

data/lib/wgit/version.rb CHANGED Viewed

@@ -6,7 +6,7 @@
 # @author Michael Telford
 module Wgit
   # The current gem version of Wgit.
-  VERSION = '0.9.0'
+  VERSION = '0.10.3'
   # Returns the current gem version of Wgit as a String.
   def self.version

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: wgit
 version: !ruby/object:Gem::Version
-  version: 0.9.0
+  version: 0.10.3
 platform: ruby
 authors:
 - Michael Telford
-autorequire:
+autorequire:
 bindir: bin
 cert_chain: []
-date: 2020-07-31 00:00:00.000000000 Z
+date: 2021-11-25 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: addressable
@@ -241,7 +241,7 @@ metadata:
   source_code_uri: https://github.com/michaeltelford/wgit
   changelog_uri: https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md
   bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
-  documentation_uri: https://www.rubydoc.info/github/michaeltelford/wgit/master
+  documentation_uri: https://www.rubydoc.info/gems/wgit
   allowed_push_host: https://rubygems.org
 post_install_message: Added the 'wgit' executable to $PATH
 rdoc_options: []
@@ -249,17 +249,20 @@ require_paths:
 - lib
 required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
-  - - "~>"
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: '2.6'
+  - - "<"
     - !ruby/object:Gem::Version
-      version: '2.5'
+      version: '4'
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
-rubygems_version: 3.1.2
-signing_key:
+rubygems_version: 3.2.22
+signing_key:
 specification_version: 4
 summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
   extract the data you want from the web.