RubyGems - wgit - Versions diffs - 0.10.7 → 0.11.0 - Mend

wgit 0.10.7 → 0.11.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +44 -1
data/CONTRIBUTING.md +1 -1
data/README.md +22 -2
data/bin/wgit +3 -1
data/lib/wgit/assertable.rb +2 -2
data/lib/wgit/crawler.rb +56 -34
data/lib/wgit/database/database.rb +64 -52
data/lib/wgit/document.rb +67 -39
data/lib/wgit/document_extractors.rb +15 -1
data/lib/wgit/dsl.rb +16 -20
data/lib/wgit/indexer.rb +157 -63
data/lib/wgit/logger.rb +1 -1
data/lib/wgit/response.rb +21 -6
data/lib/wgit/robots_parser.rb +193 -0
data/lib/wgit/url.rb +118 -51
data/lib/wgit/utils.rb +81 -28
data/lib/wgit/version.rb +1 -1
data/lib/wgit.rb +1 -0
metadata +33 -38

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cd1829ff2dda87e2b88fb738c18ba3a31765a8e3afbe82874e804b58b6c094fb
-  data.tar.gz: 8085a6ab3da61aea02fea4bb1f7e7c18caf3edca0d4d11161df3e3a0255293e6
+  metadata.gz: 5e80ab519d9f55f759df4a97873a3cb33be37f3270e408ad8f3f1cf96bd762bc
+  data.tar.gz: baf78c4fe1e30d49847dd44f1c4f3a05104db3ee57886cc2da329e3ad17ddd4f
 SHA512:
-  metadata.gz: add6099d1433baebf93b4ad9471a5a35fb0a551e28eded322c46dfa4cfc45eea059284c8645bc2eb8bf33dd91c55060969f15ce6b24567445ffb737ae8a9afc4
-  data.tar.gz: 787830ce4f6eea9c7270542c36718c54f48beb7f33e26feb06abbf72d0ad0750943542de051ff7b43caced904ea83550e877631b1ecb936c8f3b6ee211713282
+  metadata.gz: 4701b737d24d38b3a9cc27b4524556487c22ce135bfb63b162e75c8f08175b6558b2c96eca7a022bf368aef3a8d37d9d072ebbf75bdd80a29bf7c996795a406c
+  data.tar.gz: 026206a073b0e3465778db5e5a430e2830c64204d6d18ff38e049bfa3668dd62330b031367fb447e7554b0336e41893ea96eca52a844cdd7a5e94845b4e3a3b8

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Wgit Change Log
-## v0.0.0 (TEMPLATE - DO NOT EDIT)
+## v0.0.0 [- BREAKING CHANGES] (TEMPLATE - DO NOT EDIT)
 ### Added
 - ...
 ### Changed/Removed
@@ -9,6 +9,49 @@
 - ...
 ---
+## v0.11.0 - BREAKING CHANGES
+This release is a biggie with the main headline being the introduction of robots.txt support (see below). This release introduces several breaking changes so take care when updating your current version of Wgit.
+### Added
+- Ability to prevent indexing via `robots.txt` and `noindex` values in HTML `meta` elements and HTTP response header `X-Robots-Tag`. See new class `Wgit::RobotsParser` and the updated `Wgit::Indexer#index_*` methods. Also see the [wiki article](https://github.com/michaeltelford/wgit/wiki/How-To-Prevent-Indexing) on the subject.
+- `Wgit::RobotsParser` class for parsing `robots.txt` files.
+- `Wgit::Response#no_index?` and `Wgit::Document#no_index?` methods (see wiki article above).
+- Added two new default extractors which extract robots meta elements for use in `Wgit::Document#no_index?`.
+- Added `Wgit::Document.to_h_ignore_vars` Array for user manipulation.
+- Added `Wgit::Utils.pprint` method to aid debugging.
+- Added `Wgit::Utils.sanitize_url` method.
+- Added `Wgit::Indexer#index_www(max_urls_per_iteration:, ...)` param.
+- Added `Wgit::Url#redirects` and `#redirects=` methods.
+- Added `Wgit::Url#redirects_journey` used by `Wgit::Indexer` to insert a Url and it's redirects.
+- Added `Wgit::Database#bulk_upsert` which `Wgit::Indexer` now uses where possible. This reduces the total database calls made during an index operation.
+### Changed/Removed
+- Updated `Wgit::Indexer#index_*` methods to honour index prevention methods (see the [wiki article](https://github.com/michaeltelford/wgit/wiki/How-To-Prevent-Indexing)).
+- Updated `Wgit::Utils.sanitize*` methods so they no longer modify the receiver.
+- Updated `Wgit::Crawler#crawl_url` to always return the crawled `Wgit::Document`. If relying on `nil` in your code, you should now use `doc.empty?` instead.
+- Updated `Wgit::Indexer` method logs.
+- Updated/added custom class `#inspect` methods.
+- Renamed `Wgit::Utils.printf_search_results` to `pprint_search_results`.
+- Renamed `Wgit::Url#concat` to `#join`. The `#concat` method is now `String#concat`.
+- Updated `Wgit::Indexer` methods to now write external Urls to the Database as: `doc.external_urls.map(&:to_origin)` meaning `http://example.com/about` becomes `http://example.com`.
+- Updated the following methods to no longer omit trailing slashes from Urls: `Wgit::Url` - `#to_path`, `#omit_base`, `#omit_origin` and `Wgit::Document` - `#internal_links`, `#internal_absolute_links`, `#external_links`. For an average website, this results in ~30% less network requests when crawling.
+- Updated Ruby version to `3.3.0`.
+- Updated all bundle dependencies to latest versions, see `Gemfile.lock` for exact versions.
+### Fixed
+- `Wgit::Crawler#crawl_site` now internally records all redirects for a given Url.
+- `Wgit::Crawler#crawl_site` infinite loop when using Wgit on a Ruby version > `3.0.2`.
+- Various other minor fixes/improvements throughout the code base.
+---
+## v0.10.8
+### Added
+- Custom `#inspect` methods to `Wgit::Url` and `Wgit::Document` classes.
+- `Document.remove_extractors` method, which removes all default and defined extractors.
+### Changed/Removed
+- ...
+### Fixed
+- ...
+---
 ## v0.10.7
 ### Added
 - ...

data/CONTRIBUTING.md CHANGED Viewed

@@ -12,7 +12,7 @@ Before you make a contribution, reach out to michael.telford@live.com about what
 - Write some code
 - Re-run the tests (which now hopefully pass)
 - Push your branch to your `origin` remote
-- Open a GitHub Pull Request (with the target branch being wgit's `origin/master`)
+- Open a GitHub Pull Request (with the target branch as wgit's (upstream) `master`)
 - Apply any requested changes
 - Wait for your PR to be merged

data/README.md CHANGED Viewed

@@ -62,7 +62,23 @@ end
 puts JSON.generate(quotes)
 ```
-But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
+Which outputs:
+```text
+[
+    {
+        "quote": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
+        "author": "Jane Austen"
+    },
+    {
+        "quote": "“A day without sunshine is like, you know, night.”",
+        "author": "Steve Martin"
+    },
+    ...
+]
+```
+Great! But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
 ```ruby
 require 'wgit'
@@ -89,6 +105,8 @@ The `search` call (on the last line) will return and output the results:
 Quotes to Scrape
 “I am free of all prejudice. I hate everyone equally. ”
 http://quotes.toscrape.com/tag/humor/page/2/
+...
 ```
 Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
@@ -146,6 +164,8 @@ indexer = Wgit::Indexer.new
 indexer.index_site(wiki, **opts)
 ```
+- Wgit's built in indexing methods will by default, honour a site's `robots.txt` rules. There's also a handy robots.txt parser that you can use in your own code.
 ## Why Not Wgit?
 So why might you not use Wgit, I hear you ask?
@@ -219,7 +239,7 @@ And you're good to go!
 ### Tooling
-Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
+Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. Always run `toys` as `bundle exec toys`. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
 Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.

data/bin/wgit CHANGED Viewed

@@ -5,6 +5,7 @@ require 'wgit'
 # Eval .wgit.rb file (if it exists somewhere).
 def eval_wgit(filepath = nil)
   puts 'Searching for .wgit.rb file in local and home directories...'
+  success = false
   [filepath, Dir.pwd, Dir.home].each do |dir|
     path = "#{dir}/.wgit.rb"
@@ -13,11 +14,12 @@ def eval_wgit(filepath = nil)
     puts "Eval'ing #{path}"
     puts 'Call `eval_wgit` after changes to re-eval the file'
     eval(File.read(path))
+    success = true
     break
   end
-  nil
+  success
 end
 eval_wgit

data/lib/wgit/assertable.rb CHANGED Viewed

@@ -95,7 +95,7 @@ present: %s"
       obj
     end
-    alias assert_type     assert_types
-    alias assert_arr_type assert_arr_types
+    alias_method :assert_type,     :assert_types
+    alias_method :assert_arr_type, :assert_arr_types
   end
 end

data/lib/wgit/crawler.rb CHANGED Viewed

@@ -5,7 +5,6 @@ require_relative 'document'
 require_relative 'utils'
 require_relative 'assertable'
 require_relative 'response'
-require 'set'
 require 'benchmark'
 require 'typhoeus'
 require 'ferrum'
@@ -70,6 +69,8 @@ module Wgit
     # @param parse_javascript [Boolean] Whether or not to parse the Javascript
     #   of the crawled document. Parsing requires Chrome/Chromium to be
     #   installed and in $PATH.
+    # @param parse_javascript_delay [Integer] The delay time given to a page's
+    #   JS to update the DOM. After the delay, the HTML is crawled.
     def initialize(redirect_limit: 5, timeout: 5, encode: true,
                    parse_javascript: false, parse_javascript_delay: 1)
       @redirect_limit         = redirect_limit
@@ -86,8 +87,6 @@ module Wgit
     #
     # Use the allow and disallow paths params to partially and selectively
     # crawl a site; the glob syntax is fully supported e.g. `'wiki/\*'` etc.
-    # Note that each path must NOT start with a slash; the only exception being
-    # a `/` on its own with no other characters, referring to the index page.
     #
     # Only redirects to the same host are followed. For example, the Url
     # 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning
@@ -118,37 +117,35 @@ module Wgit
       url, follow: :default, allow_paths: nil, disallow_paths: nil, &block
     )
       doc = crawl_url(url, &block)
-      return nil if doc.nil?
+      return nil if doc.empty?
-      link_opts = {
-        xpath: follow,
-        allow_paths: allow_paths,
-        disallow_paths: disallow_paths
-      }
-      alt_url   = url.end_with?('/') ? url.chop : url + '/'
+      total_pages = 1
+      link_opts = { xpath: follow, allow_paths:, disallow_paths: }
-      crawled   = Set.new([url, alt_url])
+      crawled   = Set.new(url.redirects_journey)
       externals = Set.new(doc.external_links)
       internals = Set.new(next_internal_links(doc, **link_opts))
       return externals.to_a if internals.empty?
       loop do
-        links = internals - crawled
+        links = subtract_links(internals, crawled)
         break if links.empty?
         links.each do |link|
-          orig_link = link.dup
           doc = crawl_url(link, follow_redirects: :host, &block)
-          crawled += [orig_link, link] # Push both links in case of redirects.
-          next if doc.nil?
+          crawled += link.redirects_journey
+          next if doc.empty?
-          internals += next_internal_links(doc, **link_opts)
-          externals += doc.external_links
+          total_pages += 1
+          internals   += next_internal_links(doc, **link_opts)
+          externals   += doc.external_links
         end
       end
+      Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}")
       externals.to_a
     end
@@ -169,7 +166,7 @@ module Wgit
     def crawl_urls(*urls, follow_redirects: true, &block)
       raise 'You must provide at least one Url' if urls.empty?
-      opts = { follow_redirects: follow_redirects }
+      opts = { follow_redirects: }
       doc = nil
       Wgit::Utils.each(urls) { |url| doc = crawl_url(url, **opts, &block) }
@@ -189,19 +186,19 @@ module Wgit
     # @yield [doc] The crawled HTML page (Wgit::Document) regardless if the
     #   crawl was successful or not. Therefore, Document#url etc. can be used.
     #   Use `doc.empty?` to determine if the page is valid.
-    # @return [Wgit::Document, nil] The crawled HTML Document or nil if the
-    #   crawl was unsuccessful.
+    # @return [Wgit::Document] The crawled HTML Document. Check if the crawl
+    #   was successful with doc.empty? (true if unsuccessful).
     def crawl_url(url, follow_redirects: true)
       # A String url isn't allowed because it's passed by value not reference,
       # meaning a redirect isn't reflected; A Wgit::Url is passed by reference.
       assert_type(url, Wgit::Url)
-      html = fetch(url, follow_redirects: follow_redirects)
+      html = fetch(url, follow_redirects:)
       doc  = Wgit::Document.new(url, html, encode: @encode)
       yield(doc) if block_given?
-      doc.empty? ? nil : doc
+      doc
     end
     protected
@@ -226,7 +223,7 @@ module Wgit
       response = Wgit::Response.new
       raise "Invalid url: #{url}" if url.invalid?
-      resolve(url, response, follow_redirects: follow_redirects)
+      resolve(url, response, follow_redirects:)
       get_browser_response(url, response) if @parse_javascript
       response.body_or_nil
@@ -238,6 +235,9 @@ module Wgit
       url.crawled        = true # Sets date_crawled underneath.
       url.crawl_duration = response.total_time
+      # Don't override previous url.redirects if response is fully resolved.
+      url.redirects      = response.redirects unless response.redirects.empty?
       @last_response = response
     end
@@ -253,7 +253,7 @@ module Wgit
     #   :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.
     # @raise [StandardError] If a redirect isn't allowed etc.
     def resolve(url, response, follow_redirects: true)
-      origin = url.to_url.to_origin # Recorded before any redirects.
+      origin = url.to_origin # Record the origin before any redirects.
       follow_redirects, within = redirect?(follow_redirects)
       loop do
@@ -277,7 +277,7 @@ module Wgit
         if response.redirect_count >= @redirect_limit
         # Process the location to be crawled next.
-        location = url.to_origin.concat(location) if location.relative?
+        location = url.to_origin.join(location) if location.relative?
         response.redirections[url.to_s] = location.to_s
         url.replace(location) # Update the url on redirect.
       end
@@ -420,6 +420,27 @@ module Wgit
     private
+    # Manually does the following: `links = internals - crawled`.
+    # This is needed due to an apparent bug in Set<Url> (when upgrading from
+    # Ruby v3.0.2 to v3.3.0) causing an infinite crawl loop in #crawl_site.
+    # TODO: Check in future Ruby versions and remove this method when fixed.
+    def subtract_links(internals, crawled)
+      links = Set.new
+      internals.each do |internal_url|
+        already_crawled = false
+        crawled.each do |crawled_url|
+          already_crawled = internal_url == crawled_url
+          break if already_crawled
+        end
+        links.add(internal_url) unless already_crawled
+      end
+      links
+    end
     # Returns the next links used to continue crawling a site. The xpath value
     # is used to obtain the links. Any valid URL Strings will be converted into
     # absolute Wgit::Urls. Invalid URLs will be silently dropped. Any link not
@@ -431,7 +452,8 @@ module Wgit
           .compact
       end
-      if links.any? { |link| link.to_domain != doc.url.to_domain }
+      doc_domain = doc.url.to_domain
+      if links.any? { |link| link.to_domain != doc_domain }
         raise 'The links to follow must be within the site domain'
       end
@@ -458,12 +480,12 @@ module Wgit
     # Validate and filter by the given URL paths.
     def process_paths(links, allow_paths, disallow_paths)
-      if allow_paths
+      if allow_paths && !allow_paths.empty?
         paths = validate_paths(allow_paths)
         filter_links(links, :select!, paths)
       end
-      if disallow_paths
+      if disallow_paths && !disallow_paths.empty?
         paths = validate_paths(disallow_paths)
         filter_links(links, :reject!, paths)
       end
@@ -477,7 +499,7 @@ module Wgit
       raise 'The provided paths must all be Strings' \
       unless paths.all? { |path| path.is_a?(String) }
-      Wgit::Utils.sanitize(paths, encode: false)
+      paths = Wgit::Utils.sanitize(paths, encode: false)
       raise 'The provided paths cannot be empty' if paths.empty?
       paths.map do |path|
@@ -491,7 +513,7 @@ module Wgit
     def filter_links(links, filter_method, paths)
       links.send(filter_method) do |link|
         # Turn http://example.com into / meaning index.
-        link = link.to_endpoint.index? ? '/' : link.omit_base
+        link = link.to_endpoint.index? ? '/' : link.omit_base.omit_trailing_slash
         match = false
         paths.each do |pattern|
@@ -532,9 +554,9 @@ module Wgit
       )
     end
-    alias crawl       crawl_urls
-    alias crawl_pages crawl_urls
-    alias crawl_page  crawl_url
-    alias crawl_r     crawl_site
+    alias_method :crawl,       :crawl_urls
+    alias_method :crawl_pages, :crawl_urls
+    alias_method :crawl_page,  :crawl_url
+    alias_method :crawl_r,     :crawl_site
   end
 end

data/lib/wgit/database/database.rb CHANGED Viewed

@@ -162,20 +162,20 @@ module Wgit
     #   Wgit::Document>] The records to insert/create.
     # @raise [StandardError] If data isn't valid.
     def insert(data)
-      data = data.dup # Avoid modifying by reference.
       collection = nil
+      request_obj = nil
-      if data.respond_to?(:map!)
-        data.map! do |obj|
+      if data.respond_to?(:map)
+        request_obj = data.map do |obj|
           collection, _, model = get_type_info(obj)
           model
         end
       else
         collection, _, model = get_type_info(data)
-        data = model
+        request_obj = model
       end
-      create(collection, data)
+      create(collection, request_obj)
     end
     # Inserts or updates the object in the database.
@@ -183,7 +183,7 @@ module Wgit
     # @param obj [Wgit::Url, Wgit::Document] The obj/record to insert/update.
     # @return [Boolean] True if inserted, false if updated.
     def upsert(obj)
-      collection, query, model = get_type_info(obj.dup)
+      collection, query, model = get_type_info(obj)
       data_hash = model.merge(Wgit::Model.common_update_data)
       result = @client[collection].replace_one(query, data_hash, upsert: true)
@@ -192,6 +192,36 @@ module Wgit
       @last_result = result
     end
+    # Bulk upserts the objects in the database collection.
+    # You cannot mix collection objs types, all must be Urls or Documents.
+    #
+    # @param objs [Array<Wgit::Url>, Array<Wgit::Document>] The objs to be
+    #   inserted/updated.
+    # @return [Integer] The total number of upserted objects.
+    def bulk_upsert(objs)
+      assert_arr_types(objs, [Wgit::Url, Wgit::Document])
+      raise 'objs is empty' if objs.empty?
+      collection = nil
+      request_objs = objs.map do |obj|
+        collection, query, model = get_type_info(obj)
+        data_hash = model.merge(Wgit::Model.common_update_data)
+        {
+          update_many: {
+            filter: query,
+            update: { '$set' => data_hash },
+            upsert: true
+          }
+        }
+      end
+      result = @client[collection].bulk_write(request_objs)
+      result.upserted_count + result.modified_count
+    ensure
+      @last_result = result
+    end
     ### Retrieve Data ###
     # Returns all Document records from the DB. Use #search to filter based on
@@ -205,14 +235,14 @@ module Wgit
     # @yield [doc] Given each Document object (Wgit::Document) returned from
     #   the DB.
     # @return [Array<Wgit::Document>] The Documents obtained from the DB.
-    def docs(limit: 0, skip: 0)
+    def docs(limit: 0, skip: 0, &block)
       results = retrieve(DOCUMENTS_COLLECTION, {},
-                         sort: { date_added: 1 }, limit: limit, skip: skip)
+                         sort: { date_added: 1 }, limit:, skip:)
       return [] if results.count < 1 # results#empty? doesn't exist.
       # results.respond_to? :map! is false so we use map and overwrite the var.
       results = results.map { |doc_hash| Wgit::Document.new(doc_hash) }
-      results.each { |doc| yield(doc) } if block_given?
+      results.each(&block) if block_given?
       results
     end
@@ -227,17 +257,16 @@ module Wgit
     # @param skip [Integer] Skip n amount of Url's.
     # @yield [url] Given each Url object (Wgit::Url) returned from the DB.
     # @return [Array<Wgit::Url>] The Urls obtained from the DB.
-    def urls(crawled: nil, limit: 0, skip: 0)
-      query = crawled.nil? ? {} : { crawled: crawled }
+    def urls(crawled: nil, limit: 0, skip: 0, &block)
+      query = crawled.nil? ? {} : { crawled: }
       sort = { date_added: 1 }
-      results = retrieve(URLS_COLLECTION, query,
-                         sort: sort, limit: limit, skip: skip)
+      results = retrieve(URLS_COLLECTION, query, sort:, limit:, skip:)
       return [] if results.count < 1 # results#empty? doesn't exist.
       # results.respond_to? :map! is false so we use map and overwrite the var.
       results = results.map { |url_doc| Wgit::Url.new(url_doc) }
-      results.each { |url| yield(url) } if block_given?
+      results.each(&block) if block_given?
       results
     end
@@ -249,7 +278,7 @@ module Wgit
     # @yield [url] Given each Url object (Wgit::Url) returned from the DB.
     # @return [Array<Wgit::Url>] The crawled Urls obtained from the DB.
     def crawled_urls(limit: 0, skip: 0, &block)
-      urls(crawled: true, limit: limit, skip: skip, &block)
+      urls(crawled: true, limit:, skip:, &block)
     end
     # Returned Url records that haven't yet been crawled.
@@ -259,7 +288,7 @@ module Wgit
     # @yield [url] Given each Url object (Wgit::Url) returned from the DB.
     # @return [Array<Wgit::Url>] The uncrawled Urls obtained from the DB.
     def uncrawled_urls(limit: 0, skip: 0, &block)
-      urls(crawled: false, limit: limit, skip: skip, &block)
+      urls(crawled: false, limit:, skip:, &block)
     end
     # Searches the database's Documents for the given query.
@@ -286,19 +315,21 @@ module Wgit
       query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
     )
       query = query.to_s.strip
-      query.replace('"' + query + '"') if whole_sentence
+      query.replace("\"#{query}\"") if whole_sentence
       # Sort based on the most search hits (aka "textScore").
       # We use the sort_proj hash as both a sort and a projection below.
       sort_proj = { score: { :$meta => 'textScore' } }
-      query = { :$text => {
-        :$search => query,
-        :$caseSensitive => case_sensitive
-      } }
+      query = {
+        :$text => {
+          :$search => query,
+          :$caseSensitive => case_sensitive
+        }
+      }
       results = retrieve(DOCUMENTS_COLLECTION, query,
                          sort: sort_proj, projection: sort_proj,
-                         limit: limit, skip: skip)
+                         limit:, skip:)
       results.map do |mongo_doc|
         doc = Wgit::Document.new(mongo_doc)
@@ -328,21 +359,10 @@ module Wgit
       query, case_sensitive: false, whole_sentence: true,
       limit: 10, skip: 0, sentence_limit: 80
     )
-      results = search(
-        query,
-        case_sensitive: case_sensitive,
-        whole_sentence: whole_sentence,
-        limit: limit,
-        skip: skip
-      )
+      results = search(query, case_sensitive:, whole_sentence:, limit:, skip:)
       results.each do |doc|
-        doc.search!(
-          query,
-          case_sensitive: case_sensitive,
-          whole_sentence: whole_sentence,
-          sentence_limit: sentence_limit
-        )
+        doc.search!(query, case_sensitive:, whole_sentence:, sentence_limit:)
         yield(doc) if block_given?
       end
@@ -373,26 +393,16 @@ module Wgit
       query, case_sensitive: false, whole_sentence: true,
       limit: 10, skip: 0, sentence_limit: 80, top_result_only: false
     )
-      results = search(
-        query,
-        case_sensitive: case_sensitive,
-        whole_sentence: whole_sentence,
-        limit: limit,
-        skip: skip
-      )
+      results = search(query, case_sensitive:, whole_sentence:, limit:, skip:)
       results
         .map do |doc|
           yield(doc) if block_given?
+          # Only return result if its text has a match - compact is called below.
           results = doc.search(
-            query,
-            case_sensitive: case_sensitive,
-            whole_sentence: whole_sentence,
-            sentence_limit: sentence_limit
+            query, case_sensitive:, whole_sentence:, sentence_limit:
           )
-          # Only return result if its text has a match - compact is called below.
           next nil if results.empty?
           [doc.url, (top_result_only ? results.first : results)]
@@ -443,7 +453,7 @@ module Wgit
     # @return [Boolean] True if url exists, otherwise false.
     def url?(url)
       assert_type(url, String) # This includes Wgit::Url's.
-      query = { url: url }
+      query = { url: }
       retrieve(URLS_COLLECTION, query, limit: 1).any?
     end
@@ -490,7 +500,7 @@ module Wgit
     # @raise [StandardError] If the obj is not valid.
     # @return [Integer] The number of updated records/objects.
     def update(obj)
-      collection, query, model = get_type_info(obj.dup)
+      collection, query, model = get_type_info(obj)
       data_hash = model.merge(Wgit::Model.common_update_data)
       mutate(collection, query, { '$set' => data_hash })
@@ -554,6 +564,8 @@ module Wgit
     # @return [Array<Symbol, Hash>] The collection type, query to get
     #   the record/obj from the database (if it exists) and the model of obj.
     def get_type_info(obj)
+      obj = obj.dup
       case obj
       when Wgit::Url
         collection = URLS_COLLECTION
@@ -661,7 +673,7 @@ module Wgit
       @last_result = result
     end
-    alias num_objects num_records
-    alias clear_db! clear_db
+    alias_method :num_objects, :num_records
+    alias_method :clear_db!,   :clear_db
   end
 end