wgit 0.10.6 → 0.10.8

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4598dcfc047ce3915ba5a871837be5efc54201d61b4967cf53070bec2af4dd52
4
- data.tar.gz: 604010011024af6f2d4dfcc87e6c4c1d73f8e4811938281119fccb79792818c1
3
+ metadata.gz: 66e8b435303d07b2f81d260badc96662936599c9782916f7f014b74a7c617499
4
+ data.tar.gz: 7b55890c66ec09efd8d5749bd66605a4cb43d5091416f072f8fcc5aaaa85fbe7
5
5
  SHA512:
6
- metadata.gz: 44b098e2a97191801787386e9d2060dcdeacc625c3453976679fc276a73b2bf0614713764a55f7074073018e898f2e43dc1a7f4f803339a86158052f59dcabcb
7
- data.tar.gz: 8645c7095bb14590cf83c21905c9f5ed524e1047254e6526b8fe46a53f3989395472300d27fb65f899951a5f4b80ee9928accd23164b10e1a834975bf045db47
6
+ metadata.gz: fe1b605224f6682ac504f17b55ab83518556f1320f0410741af8f95bf3a669918c69b48832fb413ca1f78482fdbb7e0d2e7d6f57841c6a562b7f926f7511cdd7
7
+ data.tar.gz: 856be2111709bc96488b7d43abbc49c563a9a56330344adb4b9ec40fc263cb91e63465c3c3dab317c0d8930965a609a43102d53d80bbc2001e6165a15cb905fa
data/CHANGELOG.md CHANGED
@@ -9,6 +9,26 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.10.8
13
+ ### Added
14
+ - Custom `#inspect` methods to `Wgit::Url` and `Wgit::Document` classes.
15
+ - `Document.remove_extractors` method, which removes all default and defined extractors.
16
+
17
+ ### Changed/Removed
18
+ - ...
19
+ ### Fixed
20
+ - ...
21
+ ---
22
+
23
+ ## v0.10.7
24
+ ### Added
25
+ - ...
26
+ ### Changed/Removed
27
+ - ...
28
+ ### Fixed
29
+ - Security vulnerabilities by updating gem dependencies.
30
+ ---
31
+
12
32
  ## v0.10.6
13
33
  ### Added
14
34
  - `Wgit::DSL` method `#crawl_url` (aliased to `#crawl`).
data/README.md CHANGED
@@ -18,7 +18,7 @@ Wgit was primarily designed to crawl static HTML websites to index and search t
18
18
 
19
19
  Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
20
20
 
21
- Check out this [demo search engine](https://search-engine-rb.herokuapp.com) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [Heroku](https://www.heroku.com/). Heroku's free tier is used so the initial page load may be slow. Try searching for "Matz" or something else that's Ruby related.
21
+ Check out this [demo search engine](https://wgit-search-engine.fly.dev) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [fly.io](https://fly.io). Try searching for something that's Ruby related like "Matz" or "Rails".
22
22
 
23
23
  ## Table Of Contents
24
24
 
@@ -62,7 +62,23 @@ end
62
62
  puts JSON.generate(quotes)
63
63
  ```
64
64
 
65
- But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
65
+ Which outputs:
66
+
67
+ ```text
68
+ [
69
+ {
70
+ "quote": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
71
+ "author": "Jane Austen"
72
+ },
73
+ {
74
+ "quote": "“A day without sunshine is like, you know, night.”",
75
+ "author": "Steve Martin"
76
+ },
77
+ ...
78
+ ]
79
+ ```
80
+
81
+ Great! But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
66
82
 
67
83
  ```ruby
68
84
  require 'wgit'
@@ -89,6 +105,8 @@ The `search` call (on the last line) will return and output the results:
89
105
  Quotes to Scrape
90
106
  “I am free of all prejudice. I hate everyone equally. ”
91
107
  http://quotes.toscrape.com/tag/humor/page/2/
108
+
109
+ ...
92
110
  ```
93
111
 
94
112
  Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
data/lib/wgit/document.rb CHANGED
@@ -89,9 +89,9 @@ module Wgit
89
89
  #
90
90
  # @return [String] An xpath String to obtain a webpage's text elements.
91
91
  def self.text_elements_xpath
92
- Wgit::Document.text_elements.each_with_index.reduce("") do |xpath, (el, i)|
93
- xpath += " | " unless i.zero?
94
- xpath += format("//%s/text()", el)
92
+ Wgit::Document.text_elements.each_with_index.reduce('') do |xpath, (el, i)|
93
+ xpath += ' | ' unless i.zero?
94
+ xpath += format('//%s/text()', el)
95
95
  end
96
96
  end
97
97
 
@@ -192,13 +192,27 @@ module Wgit
192
192
  Document.send(:remove_method, "init_#{var}_from_object")
193
193
 
194
194
  @extractors.delete(var.to_sym)
195
+
195
196
  true
196
197
  rescue NameError
197
198
  false
198
199
  end
199
200
 
201
+ # Removes all default and defined extractors by calling
202
+ # `Document.remove_extractor` underneath. See its documentation.
203
+ def self.remove_extractors
204
+ @extractors.each { |var| remove_extractor(var) }
205
+ end
206
+
200
207
  ### Document Instance Methods ###
201
208
 
209
+ # Overrides String#inspect to shorten the printed output of a Document.
210
+ #
211
+ # @return [String] A short textual representation of this Document.
212
+ def inspect
213
+ "#<Wgit::Document url=\"#{@url}\" html=#{size} bytes>"
214
+ end
215
+
202
216
  # Determines if both the url and html match. Use
203
217
  # doc.object_id == other.object_id for exact object comparison.
204
218
  #
@@ -505,7 +519,7 @@ be relative"
505
519
  # parameter.
506
520
  #
507
521
  # @param xpath [String, #call] Used to find the value/object in @html.
508
- # @param singleton [Boolean] singleton ? results.first (single Object) :
522
+ # @param singleton [Boolean] singleton ? results.first (single Object) :
509
523
  # results (Enumerable).
510
524
  # @param text_content_only [Boolean] text_content_only ? result.content
511
525
  # (String) : result (Nokogiri Object).
@@ -546,7 +560,7 @@ be relative"
546
560
  # parameter.
547
561
  #
548
562
  # @param xpath [String, #call] Used to find the value/object in @html.
549
- # @param singleton [Boolean] singleton ? results.first (single Object) :
563
+ # @param singleton [Boolean] singleton ? results.first (single Object) :
550
564
  # results (Enumerable).
551
565
  # @param text_content_only [Boolean] text_content_only ? result.content
552
566
  # (String) : result (Nokogiri Object).
data/lib/wgit/url.rb CHANGED
@@ -117,6 +117,13 @@ Addressable::URI::InvalidURIError")
117
117
  @date_crawled = bool ? Wgit::Utils.time_stamp : nil
118
118
  end
119
119
 
120
+ # Overrides String#inspect to distingiush this Url from a String.
121
+ #
122
+ # @return [String] A short textual representation of this Url.
123
+ def inspect
124
+ "#<Wgit::Url url=\"#{self}\" crawled=#{@crawled}>"
125
+ end
126
+
120
127
  # Overrides String#replace setting the new_url @uri and String value.
121
128
  #
122
129
  # @param new_url [Wgit::Url, String] The new URL value.
data/lib/wgit/version.rb CHANGED
@@ -6,7 +6,7 @@
6
6
  # @author Michael Telford
7
7
  module Wgit
8
8
  # The current gem version of Wgit.
9
- VERSION = '0.10.6'
9
+ VERSION = '0.10.8'
10
10
 
11
11
  # Returns the current gem version of Wgit as a String.
12
12
  def self.version
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.10.6
4
+ version: 0.10.8
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2022-07-27 00:00:00.000000000 Z
11
+ date: 2023-08-18 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: addressable