wgit 0.10.8 → 0.11.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 66e8b435303d07b2f81d260badc96662936599c9782916f7f014b74a7c617499
4
- data.tar.gz: 7b55890c66ec09efd8d5749bd66605a4cb43d5091416f072f8fcc5aaaa85fbe7
3
+ metadata.gz: 5e80ab519d9f55f759df4a97873a3cb33be37f3270e408ad8f3f1cf96bd762bc
4
+ data.tar.gz: baf78c4fe1e30d49847dd44f1c4f3a05104db3ee57886cc2da329e3ad17ddd4f
5
5
  SHA512:
6
- metadata.gz: fe1b605224f6682ac504f17b55ab83518556f1320f0410741af8f95bf3a669918c69b48832fb413ca1f78482fdbb7e0d2e7d6f57841c6a562b7f926f7511cdd7
7
- data.tar.gz: 856be2111709bc96488b7d43abbc49c563a9a56330344adb4b9ec40fc263cb91e63465c3c3dab317c0d8930965a609a43102d53d80bbc2001e6165a15cb905fa
6
+ metadata.gz: 4701b737d24d38b3a9cc27b4524556487c22ce135bfb63b162e75c8f08175b6558b2c96eca7a022bf368aef3a8d37d9d072ebbf75bdd80a29bf7c996795a406c
7
+ data.tar.gz: 026206a073b0e3465778db5e5a430e2830c64204d6d18ff38e049bfa3668dd62330b031367fb447e7554b0336e41893ea96eca52a844cdd7a5e94845b4e3a3b8
data/CHANGELOG.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Wgit Change Log
2
2
 
3
- ## v0.0.0 (TEMPLATE - DO NOT EDIT)
3
+ ## v0.0.0 [- BREAKING CHANGES] (TEMPLATE - DO NOT EDIT)
4
4
  ### Added
5
5
  - ...
6
6
  ### Changed/Removed
@@ -9,6 +9,38 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.11.0 - BREAKING CHANGES
13
+ This release is a biggie with the main headline being the introduction of robots.txt support (see below). This release introduces several breaking changes so take care when updating your current version of Wgit.
14
+ ### Added
15
+ - Ability to prevent indexing via `robots.txt` and `noindex` values in HTML `meta` elements and HTTP response header `X-Robots-Tag`. See new class `Wgit::RobotsParser` and the updated `Wgit::Indexer#index_*` methods. Also see the [wiki article](https://github.com/michaeltelford/wgit/wiki/How-To-Prevent-Indexing) on the subject.
16
+ - `Wgit::RobotsParser` class for parsing `robots.txt` files.
17
+ - `Wgit::Response#no_index?` and `Wgit::Document#no_index?` methods (see wiki article above).
18
+ - Added two new default extractors which extract robots meta elements for use in `Wgit::Document#no_index?`.
19
+ - Added `Wgit::Document.to_h_ignore_vars` Array for user manipulation.
20
+ - Added `Wgit::Utils.pprint` method to aid debugging.
21
+ - Added `Wgit::Utils.sanitize_url` method.
22
+ - Added `Wgit::Indexer#index_www(max_urls_per_iteration:, ...)` param.
23
+ - Added `Wgit::Url#redirects` and `#redirects=` methods.
24
+ - Added `Wgit::Url#redirects_journey` used by `Wgit::Indexer` to insert a Url and it's redirects.
25
+ - Added `Wgit::Database#bulk_upsert` which `Wgit::Indexer` now uses where possible. This reduces the total database calls made during an index operation.
26
+ ### Changed/Removed
27
+ - Updated `Wgit::Indexer#index_*` methods to honour index prevention methods (see the [wiki article](https://github.com/michaeltelford/wgit/wiki/How-To-Prevent-Indexing)).
28
+ - Updated `Wgit::Utils.sanitize*` methods so they no longer modify the receiver.
29
+ - Updated `Wgit::Crawler#crawl_url` to always return the crawled `Wgit::Document`. If relying on `nil` in your code, you should now use `doc.empty?` instead.
30
+ - Updated `Wgit::Indexer` method logs.
31
+ - Updated/added custom class `#inspect` methods.
32
+ - Renamed `Wgit::Utils.printf_search_results` to `pprint_search_results`.
33
+ - Renamed `Wgit::Url#concat` to `#join`. The `#concat` method is now `String#concat`.
34
+ - Updated `Wgit::Indexer` methods to now write external Urls to the Database as: `doc.external_urls.map(&:to_origin)` meaning `http://example.com/about` becomes `http://example.com`.
35
+ - Updated the following methods to no longer omit trailing slashes from Urls: `Wgit::Url` - `#to_path`, `#omit_base`, `#omit_origin` and `Wgit::Document` - `#internal_links`, `#internal_absolute_links`, `#external_links`. For an average website, this results in ~30% less network requests when crawling.
36
+ - Updated Ruby version to `3.3.0`.
37
+ - Updated all bundle dependencies to latest versions, see `Gemfile.lock` for exact versions.
38
+ ### Fixed
39
+ - `Wgit::Crawler#crawl_site` now internally records all redirects for a given Url.
40
+ - `Wgit::Crawler#crawl_site` infinite loop when using Wgit on a Ruby version > `3.0.2`.
41
+ - Various other minor fixes/improvements throughout the code base.
42
+ ---
43
+
12
44
  ## v0.10.8
13
45
  ### Added
14
46
  - Custom `#inspect` methods to `Wgit::Url` and `Wgit::Document` classes.
data/CONTRIBUTING.md CHANGED
@@ -12,7 +12,7 @@ Before you make a contribution, reach out to michael.telford@live.com about what
12
12
  - Write some code
13
13
  - Re-run the tests (which now hopefully pass)
14
14
  - Push your branch to your `origin` remote
15
- - Open a GitHub Pull Request (with the target branch being wgit's `origin/master`)
15
+ - Open a GitHub Pull Request (with the target branch as wgit's (upstream) `master`)
16
16
  - Apply any requested changes
17
17
  - Wait for your PR to be merged
18
18
 
data/README.md CHANGED
@@ -164,6 +164,8 @@ indexer = Wgit::Indexer.new
164
164
  indexer.index_site(wiki, **opts)
165
165
  ```
166
166
 
167
+ - Wgit's built in indexing methods will by default, honour a site's `robots.txt` rules. There's also a handy robots.txt parser that you can use in your own code.
168
+
167
169
  ## Why Not Wgit?
168
170
 
169
171
  So why might you not use Wgit, I hear you ask?
@@ -237,7 +239,7 @@ And you're good to go!
237
239
 
238
240
  ### Tooling
239
241
 
240
- Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
242
+ Wgit uses the [`toys`](https://github.com/dazuma/toys) gem (instead of Rake) for task invocation. Always run `toys` as `bundle exec toys`. For a full list of available tasks a.k.a. tools, run `toys --tools`. You can search for a tool using `toys -s tool_name`. The most commonly used tools are listed below...
241
243
 
242
244
  Run `toys db` to see a list of database related tools, enabling you to run a Mongo DB instance locally using Docker. Run `toys test` to execute the tests.
243
245
 
data/bin/wgit CHANGED
@@ -5,6 +5,7 @@ require 'wgit'
5
5
  # Eval .wgit.rb file (if it exists somewhere).
6
6
  def eval_wgit(filepath = nil)
7
7
  puts 'Searching for .wgit.rb file in local and home directories...'
8
+ success = false
8
9
 
9
10
  [filepath, Dir.pwd, Dir.home].each do |dir|
10
11
  path = "#{dir}/.wgit.rb"
@@ -13,11 +14,12 @@ def eval_wgit(filepath = nil)
13
14
  puts "Eval'ing #{path}"
14
15
  puts 'Call `eval_wgit` after changes to re-eval the file'
15
16
  eval(File.read(path))
17
+ success = true
16
18
 
17
19
  break
18
20
  end
19
21
 
20
- nil
22
+ success
21
23
  end
22
24
 
23
25
  eval_wgit
@@ -95,7 +95,7 @@ present: %s"
95
95
  obj
96
96
  end
97
97
 
98
- alias assert_type assert_types
99
- alias assert_arr_type assert_arr_types
98
+ alias_method :assert_type, :assert_types
99
+ alias_method :assert_arr_type, :assert_arr_types
100
100
  end
101
101
  end
data/lib/wgit/crawler.rb CHANGED
@@ -5,7 +5,6 @@ require_relative 'document'
5
5
  require_relative 'utils'
6
6
  require_relative 'assertable'
7
7
  require_relative 'response'
8
- require 'set'
9
8
  require 'benchmark'
10
9
  require 'typhoeus'
11
10
  require 'ferrum'
@@ -70,6 +69,8 @@ module Wgit
70
69
  # @param parse_javascript [Boolean] Whether or not to parse the Javascript
71
70
  # of the crawled document. Parsing requires Chrome/Chromium to be
72
71
  # installed and in $PATH.
72
+ # @param parse_javascript_delay [Integer] The delay time given to a page's
73
+ # JS to update the DOM. After the delay, the HTML is crawled.
73
74
  def initialize(redirect_limit: 5, timeout: 5, encode: true,
74
75
  parse_javascript: false, parse_javascript_delay: 1)
75
76
  @redirect_limit = redirect_limit
@@ -86,8 +87,6 @@ module Wgit
86
87
  #
87
88
  # Use the allow and disallow paths params to partially and selectively
88
89
  # crawl a site; the glob syntax is fully supported e.g. `'wiki/\*'` etc.
89
- # Note that each path must NOT start with a slash; the only exception being
90
- # a `/` on its own with no other characters, referring to the index page.
91
90
  #
92
91
  # Only redirects to the same host are followed. For example, the Url
93
92
  # 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning
@@ -118,37 +117,35 @@ module Wgit
118
117
  url, follow: :default, allow_paths: nil, disallow_paths: nil, &block
119
118
  )
120
119
  doc = crawl_url(url, &block)
121
- return nil if doc.nil?
120
+ return nil if doc.empty?
122
121
 
123
- link_opts = {
124
- xpath: follow,
125
- allow_paths: allow_paths,
126
- disallow_paths: disallow_paths
127
- }
128
- alt_url = url.end_with?('/') ? url.chop : url + '/'
122
+ total_pages = 1
123
+ link_opts = { xpath: follow, allow_paths:, disallow_paths: }
129
124
 
130
- crawled = Set.new([url, alt_url])
125
+ crawled = Set.new(url.redirects_journey)
131
126
  externals = Set.new(doc.external_links)
132
127
  internals = Set.new(next_internal_links(doc, **link_opts))
133
128
 
134
129
  return externals.to_a if internals.empty?
135
130
 
136
131
  loop do
137
- links = internals - crawled
132
+ links = subtract_links(internals, crawled)
138
133
  break if links.empty?
139
134
 
140
135
  links.each do |link|
141
- orig_link = link.dup
142
136
  doc = crawl_url(link, follow_redirects: :host, &block)
143
137
 
144
- crawled += [orig_link, link] # Push both links in case of redirects.
145
- next if doc.nil?
138
+ crawled += link.redirects_journey
139
+ next if doc.empty?
146
140
 
147
- internals += next_internal_links(doc, **link_opts)
148
- externals += doc.external_links
141
+ total_pages += 1
142
+ internals += next_internal_links(doc, **link_opts)
143
+ externals += doc.external_links
149
144
  end
150
145
  end
151
146
 
147
+ Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}")
148
+
152
149
  externals.to_a
153
150
  end
154
151
 
@@ -169,7 +166,7 @@ module Wgit
169
166
  def crawl_urls(*urls, follow_redirects: true, &block)
170
167
  raise 'You must provide at least one Url' if urls.empty?
171
168
 
172
- opts = { follow_redirects: follow_redirects }
169
+ opts = { follow_redirects: }
173
170
  doc = nil
174
171
 
175
172
  Wgit::Utils.each(urls) { |url| doc = crawl_url(url, **opts, &block) }
@@ -189,19 +186,19 @@ module Wgit
189
186
  # @yield [doc] The crawled HTML page (Wgit::Document) regardless if the
190
187
  # crawl was successful or not. Therefore, Document#url etc. can be used.
191
188
  # Use `doc.empty?` to determine if the page is valid.
192
- # @return [Wgit::Document, nil] The crawled HTML Document or nil if the
193
- # crawl was unsuccessful.
189
+ # @return [Wgit::Document] The crawled HTML Document. Check if the crawl
190
+ # was successful with doc.empty? (true if unsuccessful).
194
191
  def crawl_url(url, follow_redirects: true)
195
192
  # A String url isn't allowed because it's passed by value not reference,
196
193
  # meaning a redirect isn't reflected; A Wgit::Url is passed by reference.
197
194
  assert_type(url, Wgit::Url)
198
195
 
199
- html = fetch(url, follow_redirects: follow_redirects)
196
+ html = fetch(url, follow_redirects:)
200
197
  doc = Wgit::Document.new(url, html, encode: @encode)
201
198
 
202
199
  yield(doc) if block_given?
203
200
 
204
- doc.empty? ? nil : doc
201
+ doc
205
202
  end
206
203
 
207
204
  protected
@@ -226,7 +223,7 @@ module Wgit
226
223
  response = Wgit::Response.new
227
224
  raise "Invalid url: #{url}" if url.invalid?
228
225
 
229
- resolve(url, response, follow_redirects: follow_redirects)
226
+ resolve(url, response, follow_redirects:)
230
227
  get_browser_response(url, response) if @parse_javascript
231
228
 
232
229
  response.body_or_nil
@@ -238,6 +235,9 @@ module Wgit
238
235
  url.crawled = true # Sets date_crawled underneath.
239
236
  url.crawl_duration = response.total_time
240
237
 
238
+ # Don't override previous url.redirects if response is fully resolved.
239
+ url.redirects = response.redirects unless response.redirects.empty?
240
+
241
241
  @last_response = response
242
242
  end
243
243
 
@@ -253,7 +253,7 @@ module Wgit
253
253
  # :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.
254
254
  # @raise [StandardError] If a redirect isn't allowed etc.
255
255
  def resolve(url, response, follow_redirects: true)
256
- origin = url.to_url.to_origin # Recorded before any redirects.
256
+ origin = url.to_origin # Record the origin before any redirects.
257
257
  follow_redirects, within = redirect?(follow_redirects)
258
258
 
259
259
  loop do
@@ -277,7 +277,7 @@ module Wgit
277
277
  if response.redirect_count >= @redirect_limit
278
278
 
279
279
  # Process the location to be crawled next.
280
- location = url.to_origin.concat(location) if location.relative?
280
+ location = url.to_origin.join(location) if location.relative?
281
281
  response.redirections[url.to_s] = location.to_s
282
282
  url.replace(location) # Update the url on redirect.
283
283
  end
@@ -420,6 +420,27 @@ module Wgit
420
420
 
421
421
  private
422
422
 
423
+ # Manually does the following: `links = internals - crawled`.
424
+ # This is needed due to an apparent bug in Set<Url> (when upgrading from
425
+ # Ruby v3.0.2 to v3.3.0) causing an infinite crawl loop in #crawl_site.
426
+ # TODO: Check in future Ruby versions and remove this method when fixed.
427
+ def subtract_links(internals, crawled)
428
+ links = Set.new
429
+
430
+ internals.each do |internal_url|
431
+ already_crawled = false
432
+
433
+ crawled.each do |crawled_url|
434
+ already_crawled = internal_url == crawled_url
435
+ break if already_crawled
436
+ end
437
+
438
+ links.add(internal_url) unless already_crawled
439
+ end
440
+
441
+ links
442
+ end
443
+
423
444
  # Returns the next links used to continue crawling a site. The xpath value
424
445
  # is used to obtain the links. Any valid URL Strings will be converted into
425
446
  # absolute Wgit::Urls. Invalid URLs will be silently dropped. Any link not
@@ -431,7 +452,8 @@ module Wgit
431
452
  .compact
432
453
  end
433
454
 
434
- if links.any? { |link| link.to_domain != doc.url.to_domain }
455
+ doc_domain = doc.url.to_domain
456
+ if links.any? { |link| link.to_domain != doc_domain }
435
457
  raise 'The links to follow must be within the site domain'
436
458
  end
437
459
 
@@ -458,12 +480,12 @@ module Wgit
458
480
 
459
481
  # Validate and filter by the given URL paths.
460
482
  def process_paths(links, allow_paths, disallow_paths)
461
- if allow_paths
483
+ if allow_paths && !allow_paths.empty?
462
484
  paths = validate_paths(allow_paths)
463
485
  filter_links(links, :select!, paths)
464
486
  end
465
487
 
466
- if disallow_paths
488
+ if disallow_paths && !disallow_paths.empty?
467
489
  paths = validate_paths(disallow_paths)
468
490
  filter_links(links, :reject!, paths)
469
491
  end
@@ -477,7 +499,7 @@ module Wgit
477
499
  raise 'The provided paths must all be Strings' \
478
500
  unless paths.all? { |path| path.is_a?(String) }
479
501
 
480
- Wgit::Utils.sanitize(paths, encode: false)
502
+ paths = Wgit::Utils.sanitize(paths, encode: false)
481
503
  raise 'The provided paths cannot be empty' if paths.empty?
482
504
 
483
505
  paths.map do |path|
@@ -491,7 +513,7 @@ module Wgit
491
513
  def filter_links(links, filter_method, paths)
492
514
  links.send(filter_method) do |link|
493
515
  # Turn http://example.com into / meaning index.
494
- link = link.to_endpoint.index? ? '/' : link.omit_base
516
+ link = link.to_endpoint.index? ? '/' : link.omit_base.omit_trailing_slash
495
517
 
496
518
  match = false
497
519
  paths.each do |pattern|
@@ -532,9 +554,9 @@ module Wgit
532
554
  )
533
555
  end
534
556
 
535
- alias crawl crawl_urls
536
- alias crawl_pages crawl_urls
537
- alias crawl_page crawl_url
538
- alias crawl_r crawl_site
557
+ alias_method :crawl, :crawl_urls
558
+ alias_method :crawl_pages, :crawl_urls
559
+ alias_method :crawl_page, :crawl_url
560
+ alias_method :crawl_r, :crawl_site
539
561
  end
540
562
  end
@@ -162,20 +162,20 @@ module Wgit
162
162
  # Wgit::Document>] The records to insert/create.
163
163
  # @raise [StandardError] If data isn't valid.
164
164
  def insert(data)
165
- data = data.dup # Avoid modifying by reference.
166
165
  collection = nil
166
+ request_obj = nil
167
167
 
168
- if data.respond_to?(:map!)
169
- data.map! do |obj|
168
+ if data.respond_to?(:map)
169
+ request_obj = data.map do |obj|
170
170
  collection, _, model = get_type_info(obj)
171
171
  model
172
172
  end
173
173
  else
174
174
  collection, _, model = get_type_info(data)
175
- data = model
175
+ request_obj = model
176
176
  end
177
177
 
178
- create(collection, data)
178
+ create(collection, request_obj)
179
179
  end
180
180
 
181
181
  # Inserts or updates the object in the database.
@@ -183,7 +183,7 @@ module Wgit
183
183
  # @param obj [Wgit::Url, Wgit::Document] The obj/record to insert/update.
184
184
  # @return [Boolean] True if inserted, false if updated.
185
185
  def upsert(obj)
186
- collection, query, model = get_type_info(obj.dup)
186
+ collection, query, model = get_type_info(obj)
187
187
  data_hash = model.merge(Wgit::Model.common_update_data)
188
188
  result = @client[collection].replace_one(query, data_hash, upsert: true)
189
189
 
@@ -192,6 +192,36 @@ module Wgit
192
192
  @last_result = result
193
193
  end
194
194
 
195
+ # Bulk upserts the objects in the database collection.
196
+ # You cannot mix collection objs types, all must be Urls or Documents.
197
+ #
198
+ # @param objs [Array<Wgit::Url>, Array<Wgit::Document>] The objs to be
199
+ # inserted/updated.
200
+ # @return [Integer] The total number of upserted objects.
201
+ def bulk_upsert(objs)
202
+ assert_arr_types(objs, [Wgit::Url, Wgit::Document])
203
+ raise 'objs is empty' if objs.empty?
204
+
205
+ collection = nil
206
+ request_objs = objs.map do |obj|
207
+ collection, query, model = get_type_info(obj)
208
+ data_hash = model.merge(Wgit::Model.common_update_data)
209
+
210
+ {
211
+ update_many: {
212
+ filter: query,
213
+ update: { '$set' => data_hash },
214
+ upsert: true
215
+ }
216
+ }
217
+ end
218
+
219
+ result = @client[collection].bulk_write(request_objs)
220
+ result.upserted_count + result.modified_count
221
+ ensure
222
+ @last_result = result
223
+ end
224
+
195
225
  ### Retrieve Data ###
196
226
 
197
227
  # Returns all Document records from the DB. Use #search to filter based on
@@ -205,14 +235,14 @@ module Wgit
205
235
  # @yield [doc] Given each Document object (Wgit::Document) returned from
206
236
  # the DB.
207
237
  # @return [Array<Wgit::Document>] The Documents obtained from the DB.
208
- def docs(limit: 0, skip: 0)
238
+ def docs(limit: 0, skip: 0, &block)
209
239
  results = retrieve(DOCUMENTS_COLLECTION, {},
210
- sort: { date_added: 1 }, limit: limit, skip: skip)
240
+ sort: { date_added: 1 }, limit:, skip:)
211
241
  return [] if results.count < 1 # results#empty? doesn't exist.
212
242
 
213
243
  # results.respond_to? :map! is false so we use map and overwrite the var.
214
244
  results = results.map { |doc_hash| Wgit::Document.new(doc_hash) }
215
- results.each { |doc| yield(doc) } if block_given?
245
+ results.each(&block) if block_given?
216
246
 
217
247
  results
218
248
  end
@@ -227,17 +257,16 @@ module Wgit
227
257
  # @param skip [Integer] Skip n amount of Url's.
228
258
  # @yield [url] Given each Url object (Wgit::Url) returned from the DB.
229
259
  # @return [Array<Wgit::Url>] The Urls obtained from the DB.
230
- def urls(crawled: nil, limit: 0, skip: 0)
231
- query = crawled.nil? ? {} : { crawled: crawled }
260
+ def urls(crawled: nil, limit: 0, skip: 0, &block)
261
+ query = crawled.nil? ? {} : { crawled: }
232
262
  sort = { date_added: 1 }
233
263
 
234
- results = retrieve(URLS_COLLECTION, query,
235
- sort: sort, limit: limit, skip: skip)
264
+ results = retrieve(URLS_COLLECTION, query, sort:, limit:, skip:)
236
265
  return [] if results.count < 1 # results#empty? doesn't exist.
237
266
 
238
267
  # results.respond_to? :map! is false so we use map and overwrite the var.
239
268
  results = results.map { |url_doc| Wgit::Url.new(url_doc) }
240
- results.each { |url| yield(url) } if block_given?
269
+ results.each(&block) if block_given?
241
270
 
242
271
  results
243
272
  end
@@ -249,7 +278,7 @@ module Wgit
249
278
  # @yield [url] Given each Url object (Wgit::Url) returned from the DB.
250
279
  # @return [Array<Wgit::Url>] The crawled Urls obtained from the DB.
251
280
  def crawled_urls(limit: 0, skip: 0, &block)
252
- urls(crawled: true, limit: limit, skip: skip, &block)
281
+ urls(crawled: true, limit:, skip:, &block)
253
282
  end
254
283
 
255
284
  # Returned Url records that haven't yet been crawled.
@@ -259,7 +288,7 @@ module Wgit
259
288
  # @yield [url] Given each Url object (Wgit::Url) returned from the DB.
260
289
  # @return [Array<Wgit::Url>] The uncrawled Urls obtained from the DB.
261
290
  def uncrawled_urls(limit: 0, skip: 0, &block)
262
- urls(crawled: false, limit: limit, skip: skip, &block)
291
+ urls(crawled: false, limit:, skip:, &block)
263
292
  end
264
293
 
265
294
  # Searches the database's Documents for the given query.
@@ -286,19 +315,21 @@ module Wgit
286
315
  query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
287
316
  )
288
317
  query = query.to_s.strip
289
- query.replace('"' + query + '"') if whole_sentence
318
+ query.replace("\"#{query}\"") if whole_sentence
290
319
 
291
320
  # Sort based on the most search hits (aka "textScore").
292
321
  # We use the sort_proj hash as both a sort and a projection below.
293
322
  sort_proj = { score: { :$meta => 'textScore' } }
294
- query = { :$text => {
295
- :$search => query,
296
- :$caseSensitive => case_sensitive
297
- } }
323
+ query = {
324
+ :$text => {
325
+ :$search => query,
326
+ :$caseSensitive => case_sensitive
327
+ }
328
+ }
298
329
 
299
330
  results = retrieve(DOCUMENTS_COLLECTION, query,
300
331
  sort: sort_proj, projection: sort_proj,
301
- limit: limit, skip: skip)
332
+ limit:, skip:)
302
333
 
303
334
  results.map do |mongo_doc|
304
335
  doc = Wgit::Document.new(mongo_doc)
@@ -328,21 +359,10 @@ module Wgit
328
359
  query, case_sensitive: false, whole_sentence: true,
329
360
  limit: 10, skip: 0, sentence_limit: 80
330
361
  )
331
- results = search(
332
- query,
333
- case_sensitive: case_sensitive,
334
- whole_sentence: whole_sentence,
335
- limit: limit,
336
- skip: skip
337
- )
362
+ results = search(query, case_sensitive:, whole_sentence:, limit:, skip:)
338
363
 
339
364
  results.each do |doc|
340
- doc.search!(
341
- query,
342
- case_sensitive: case_sensitive,
343
- whole_sentence: whole_sentence,
344
- sentence_limit: sentence_limit
345
- )
365
+ doc.search!(query, case_sensitive:, whole_sentence:, sentence_limit:)
346
366
  yield(doc) if block_given?
347
367
  end
348
368
 
@@ -373,26 +393,16 @@ module Wgit
373
393
  query, case_sensitive: false, whole_sentence: true,
374
394
  limit: 10, skip: 0, sentence_limit: 80, top_result_only: false
375
395
  )
376
- results = search(
377
- query,
378
- case_sensitive: case_sensitive,
379
- whole_sentence: whole_sentence,
380
- limit: limit,
381
- skip: skip
382
- )
396
+ results = search(query, case_sensitive:, whole_sentence:, limit:, skip:)
383
397
 
384
398
  results
385
399
  .map do |doc|
386
400
  yield(doc) if block_given?
387
401
 
402
+ # Only return result if its text has a match - compact is called below.
388
403
  results = doc.search(
389
- query,
390
- case_sensitive: case_sensitive,
391
- whole_sentence: whole_sentence,
392
- sentence_limit: sentence_limit
404
+ query, case_sensitive:, whole_sentence:, sentence_limit:
393
405
  )
394
-
395
- # Only return result if its text has a match - compact is called below.
396
406
  next nil if results.empty?
397
407
 
398
408
  [doc.url, (top_result_only ? results.first : results)]
@@ -443,7 +453,7 @@ module Wgit
443
453
  # @return [Boolean] True if url exists, otherwise false.
444
454
  def url?(url)
445
455
  assert_type(url, String) # This includes Wgit::Url's.
446
- query = { url: url }
456
+ query = { url: }
447
457
  retrieve(URLS_COLLECTION, query, limit: 1).any?
448
458
  end
449
459
 
@@ -490,7 +500,7 @@ module Wgit
490
500
  # @raise [StandardError] If the obj is not valid.
491
501
  # @return [Integer] The number of updated records/objects.
492
502
  def update(obj)
493
- collection, query, model = get_type_info(obj.dup)
503
+ collection, query, model = get_type_info(obj)
494
504
  data_hash = model.merge(Wgit::Model.common_update_data)
495
505
 
496
506
  mutate(collection, query, { '$set' => data_hash })
@@ -554,6 +564,8 @@ module Wgit
554
564
  # @return [Array<Symbol, Hash>] The collection type, query to get
555
565
  # the record/obj from the database (if it exists) and the model of obj.
556
566
  def get_type_info(obj)
567
+ obj = obj.dup
568
+
557
569
  case obj
558
570
  when Wgit::Url
559
571
  collection = URLS_COLLECTION
@@ -661,7 +673,7 @@ module Wgit
661
673
  @last_result = result
662
674
  end
663
675
 
664
- alias num_objects num_records
665
- alias clear_db! clear_db
676
+ alias_method :num_objects, :num_records
677
+ alias_method :clear_db!, :clear_db
666
678
  end
667
679
  end