wgit 0.10.6 → 0.10.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +20 -0
- data/README.md +20 -2
- data/lib/wgit/document.rb +19 -5
- data/lib/wgit/url.rb +7 -0
- data/lib/wgit/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 66e8b435303d07b2f81d260badc96662936599c9782916f7f014b74a7c617499
|
4
|
+
data.tar.gz: 7b55890c66ec09efd8d5749bd66605a4cb43d5091416f072f8fcc5aaaa85fbe7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fe1b605224f6682ac504f17b55ab83518556f1320f0410741af8f95bf3a669918c69b48832fb413ca1f78482fdbb7e0d2e7d6f57841c6a562b7f926f7511cdd7
|
7
|
+
data.tar.gz: 856be2111709bc96488b7d43abbc49c563a9a56330344adb4b9ec40fc263cb91e63465c3c3dab317c0d8930965a609a43102d53d80bbc2001e6165a15cb905fa
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,26 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.8
|
13
|
+
### Added
|
14
|
+
- Custom `#inspect` methods to `Wgit::Url` and `Wgit::Document` classes.
|
15
|
+
- `Document.remove_extractors` method, which removes all default and defined extractors.
|
16
|
+
|
17
|
+
### Changed/Removed
|
18
|
+
- ...
|
19
|
+
### Fixed
|
20
|
+
- ...
|
21
|
+
---
|
22
|
+
|
23
|
+
## v0.10.7
|
24
|
+
### Added
|
25
|
+
- ...
|
26
|
+
### Changed/Removed
|
27
|
+
- ...
|
28
|
+
### Fixed
|
29
|
+
- Security vulnerabilities by updating gem dependencies.
|
30
|
+
---
|
31
|
+
|
12
32
|
## v0.10.6
|
13
33
|
### Added
|
14
34
|
- `Wgit::DSL` method `#crawl_url` (aliased to `#crawl`).
|
data/README.md
CHANGED
@@ -18,7 +18,7 @@ Wgit was primarily designed to crawl static HTML websites to index and search t
|
|
18
18
|
|
19
19
|
Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
|
20
20
|
|
21
|
-
Check out this [demo search engine](https://search-engine
|
21
|
+
Check out this [demo search engine](https://wgit-search-engine.fly.dev) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [fly.io](https://fly.io). Try searching for something that's Ruby related like "Matz" or "Rails".
|
22
22
|
|
23
23
|
## Table Of Contents
|
24
24
|
|
@@ -62,7 +62,23 @@ end
|
|
62
62
|
puts JSON.generate(quotes)
|
63
63
|
```
|
64
64
|
|
65
|
-
|
65
|
+
Which outputs:
|
66
|
+
|
67
|
+
```text
|
68
|
+
[
|
69
|
+
{
|
70
|
+
"quote": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
|
71
|
+
"author": "Jane Austen"
|
72
|
+
},
|
73
|
+
{
|
74
|
+
"quote": "“A day without sunshine is like, you know, night.”",
|
75
|
+
"author": "Steve Martin"
|
76
|
+
},
|
77
|
+
...
|
78
|
+
]
|
79
|
+
```
|
80
|
+
|
81
|
+
Great! But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
66
82
|
|
67
83
|
```ruby
|
68
84
|
require 'wgit'
|
@@ -89,6 +105,8 @@ The `search` call (on the last line) will return and output the results:
|
|
89
105
|
Quotes to Scrape
|
90
106
|
“I am free of all prejudice. I hate everyone equally. ”
|
91
107
|
http://quotes.toscrape.com/tag/humor/page/2/
|
108
|
+
|
109
|
+
...
|
92
110
|
```
|
93
111
|
|
94
112
|
Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
|
data/lib/wgit/document.rb
CHANGED
@@ -89,9 +89,9 @@ module Wgit
|
|
89
89
|
#
|
90
90
|
# @return [String] An xpath String to obtain a webpage's text elements.
|
91
91
|
def self.text_elements_xpath
|
92
|
-
Wgit::Document.text_elements.each_with_index.reduce(
|
93
|
-
xpath +=
|
94
|
-
xpath += format(
|
92
|
+
Wgit::Document.text_elements.each_with_index.reduce('') do |xpath, (el, i)|
|
93
|
+
xpath += ' | ' unless i.zero?
|
94
|
+
xpath += format('//%s/text()', el)
|
95
95
|
end
|
96
96
|
end
|
97
97
|
|
@@ -192,13 +192,27 @@ module Wgit
|
|
192
192
|
Document.send(:remove_method, "init_#{var}_from_object")
|
193
193
|
|
194
194
|
@extractors.delete(var.to_sym)
|
195
|
+
|
195
196
|
true
|
196
197
|
rescue NameError
|
197
198
|
false
|
198
199
|
end
|
199
200
|
|
201
|
+
# Removes all default and defined extractors by calling
|
202
|
+
# `Document.remove_extractor` underneath. See its documentation.
|
203
|
+
def self.remove_extractors
|
204
|
+
@extractors.each { |var| remove_extractor(var) }
|
205
|
+
end
|
206
|
+
|
200
207
|
### Document Instance Methods ###
|
201
208
|
|
209
|
+
# Overrides String#inspect to shorten the printed output of a Document.
|
210
|
+
#
|
211
|
+
# @return [String] A short textual representation of this Document.
|
212
|
+
def inspect
|
213
|
+
"#<Wgit::Document url=\"#{@url}\" html=#{size} bytes>"
|
214
|
+
end
|
215
|
+
|
202
216
|
# Determines if both the url and html match. Use
|
203
217
|
# doc.object_id == other.object_id for exact object comparison.
|
204
218
|
#
|
@@ -505,7 +519,7 @@ be relative"
|
|
505
519
|
# parameter.
|
506
520
|
#
|
507
521
|
# @param xpath [String, #call] Used to find the value/object in @html.
|
508
|
-
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
522
|
+
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
509
523
|
# results (Enumerable).
|
510
524
|
# @param text_content_only [Boolean] text_content_only ? result.content
|
511
525
|
# (String) : result (Nokogiri Object).
|
@@ -546,7 +560,7 @@ be relative"
|
|
546
560
|
# parameter.
|
547
561
|
#
|
548
562
|
# @param xpath [String, #call] Used to find the value/object in @html.
|
549
|
-
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
563
|
+
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
550
564
|
# results (Enumerable).
|
551
565
|
# @param text_content_only [Boolean] text_content_only ? result.content
|
552
566
|
# (String) : result (Nokogiri Object).
|
data/lib/wgit/url.rb
CHANGED
@@ -117,6 +117,13 @@ Addressable::URI::InvalidURIError")
|
|
117
117
|
@date_crawled = bool ? Wgit::Utils.time_stamp : nil
|
118
118
|
end
|
119
119
|
|
120
|
+
# Overrides String#inspect to distingiush this Url from a String.
|
121
|
+
#
|
122
|
+
# @return [String] A short textual representation of this Url.
|
123
|
+
def inspect
|
124
|
+
"#<Wgit::Url url=\"#{self}\" crawled=#{@crawled}>"
|
125
|
+
end
|
126
|
+
|
120
127
|
# Overrides String#replace setting the new_url @uri and String value.
|
121
128
|
#
|
122
129
|
# @param new_url [Wgit::Url, String] The new URL value.
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.10.
|
4
|
+
version: 0.10.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-08-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: addressable
|