wgit 0.10.6 → 0.10.8
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +20 -0
- data/README.md +20 -2
- data/lib/wgit/document.rb +19 -5
- data/lib/wgit/url.rb +7 -0
- data/lib/wgit/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 66e8b435303d07b2f81d260badc96662936599c9782916f7f014b74a7c617499
|
4
|
+
data.tar.gz: 7b55890c66ec09efd8d5749bd66605a4cb43d5091416f072f8fcc5aaaa85fbe7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fe1b605224f6682ac504f17b55ab83518556f1320f0410741af8f95bf3a669918c69b48832fb413ca1f78482fdbb7e0d2e7d6f57841c6a562b7f926f7511cdd7
|
7
|
+
data.tar.gz: 856be2111709bc96488b7d43abbc49c563a9a56330344adb4b9ec40fc263cb91e63465c3c3dab317c0d8930965a609a43102d53d80bbc2001e6165a15cb905fa
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,26 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.8
|
13
|
+
### Added
|
14
|
+
- Custom `#inspect` methods to `Wgit::Url` and `Wgit::Document` classes.
|
15
|
+
- `Document.remove_extractors` method, which removes all default and defined extractors.
|
16
|
+
|
17
|
+
### Changed/Removed
|
18
|
+
- ...
|
19
|
+
### Fixed
|
20
|
+
- ...
|
21
|
+
---
|
22
|
+
|
23
|
+
## v0.10.7
|
24
|
+
### Added
|
25
|
+
- ...
|
26
|
+
### Changed/Removed
|
27
|
+
- ...
|
28
|
+
### Fixed
|
29
|
+
- Security vulnerabilities by updating gem dependencies.
|
30
|
+
---
|
31
|
+
|
12
32
|
## v0.10.6
|
13
33
|
### Added
|
14
34
|
- `Wgit::DSL` method `#crawl_url` (aliased to `#crawl`).
|
data/README.md
CHANGED
@@ -18,7 +18,7 @@ Wgit was primarily designed to crawl static HTML websites to index and search t
|
|
18
18
|
|
19
19
|
Wgit provides a high level, easy-to-use API and DSL that you can use in your own applications and scripts.
|
20
20
|
|
21
|
-
Check out this [demo search engine](https://search-engine
|
21
|
+
Check out this [demo search engine](https://wgit-search-engine.fly.dev) - [built](https://github.com/michaeltelford/search_engine) using Wgit and Sinatra - deployed to [fly.io](https://fly.io). Try searching for something that's Ruby related like "Matz" or "Rails".
|
22
22
|
|
23
23
|
## Table Of Contents
|
24
24
|
|
@@ -62,7 +62,23 @@ end
|
|
62
62
|
puts JSON.generate(quotes)
|
63
63
|
```
|
64
64
|
|
65
|
-
|
65
|
+
Which outputs:
|
66
|
+
|
67
|
+
```text
|
68
|
+
[
|
69
|
+
{
|
70
|
+
"quote": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
|
71
|
+
"author": "Jane Austen"
|
72
|
+
},
|
73
|
+
{
|
74
|
+
"quote": "“A day without sunshine is like, you know, night.”",
|
75
|
+
"author": "Steve Martin"
|
76
|
+
},
|
77
|
+
...
|
78
|
+
]
|
79
|
+
```
|
80
|
+
|
81
|
+
Great! But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
66
82
|
|
67
83
|
```ruby
|
68
84
|
require 'wgit'
|
@@ -89,6 +105,8 @@ The `search` call (on the last line) will return and output the results:
|
|
89
105
|
Quotes to Scrape
|
90
106
|
“I am free of all prejudice. I hate everyone equally. ”
|
91
107
|
http://quotes.toscrape.com/tag/humor/page/2/
|
108
|
+
|
109
|
+
...
|
92
110
|
```
|
93
111
|
|
94
112
|
Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
|
data/lib/wgit/document.rb
CHANGED
@@ -89,9 +89,9 @@ module Wgit
|
|
89
89
|
#
|
90
90
|
# @return [String] An xpath String to obtain a webpage's text elements.
|
91
91
|
def self.text_elements_xpath
|
92
|
-
Wgit::Document.text_elements.each_with_index.reduce(
|
93
|
-
xpath +=
|
94
|
-
xpath += format(
|
92
|
+
Wgit::Document.text_elements.each_with_index.reduce('') do |xpath, (el, i)|
|
93
|
+
xpath += ' | ' unless i.zero?
|
94
|
+
xpath += format('//%s/text()', el)
|
95
95
|
end
|
96
96
|
end
|
97
97
|
|
@@ -192,13 +192,27 @@ module Wgit
|
|
192
192
|
Document.send(:remove_method, "init_#{var}_from_object")
|
193
193
|
|
194
194
|
@extractors.delete(var.to_sym)
|
195
|
+
|
195
196
|
true
|
196
197
|
rescue NameError
|
197
198
|
false
|
198
199
|
end
|
199
200
|
|
201
|
+
# Removes all default and defined extractors by calling
|
202
|
+
# `Document.remove_extractor` underneath. See its documentation.
|
203
|
+
def self.remove_extractors
|
204
|
+
@extractors.each { |var| remove_extractor(var) }
|
205
|
+
end
|
206
|
+
|
200
207
|
### Document Instance Methods ###
|
201
208
|
|
209
|
+
# Overrides String#inspect to shorten the printed output of a Document.
|
210
|
+
#
|
211
|
+
# @return [String] A short textual representation of this Document.
|
212
|
+
def inspect
|
213
|
+
"#<Wgit::Document url=\"#{@url}\" html=#{size} bytes>"
|
214
|
+
end
|
215
|
+
|
202
216
|
# Determines if both the url and html match. Use
|
203
217
|
# doc.object_id == other.object_id for exact object comparison.
|
204
218
|
#
|
@@ -505,7 +519,7 @@ be relative"
|
|
505
519
|
# parameter.
|
506
520
|
#
|
507
521
|
# @param xpath [String, #call] Used to find the value/object in @html.
|
508
|
-
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
522
|
+
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
509
523
|
# results (Enumerable).
|
510
524
|
# @param text_content_only [Boolean] text_content_only ? result.content
|
511
525
|
# (String) : result (Nokogiri Object).
|
@@ -546,7 +560,7 @@ be relative"
|
|
546
560
|
# parameter.
|
547
561
|
#
|
548
562
|
# @param xpath [String, #call] Used to find the value/object in @html.
|
549
|
-
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
563
|
+
# @param singleton [Boolean] singleton ? results.first (single Object) :
|
550
564
|
# results (Enumerable).
|
551
565
|
# @param text_content_only [Boolean] text_content_only ? result.content
|
552
566
|
# (String) : result (Nokogiri Object).
|
data/lib/wgit/url.rb
CHANGED
@@ -117,6 +117,13 @@ Addressable::URI::InvalidURIError")
|
|
117
117
|
@date_crawled = bool ? Wgit::Utils.time_stamp : nil
|
118
118
|
end
|
119
119
|
|
120
|
+
# Overrides String#inspect to distingiush this Url from a String.
|
121
|
+
#
|
122
|
+
# @return [String] A short textual representation of this Url.
|
123
|
+
def inspect
|
124
|
+
"#<Wgit::Url url=\"#{self}\" crawled=#{@crawled}>"
|
125
|
+
end
|
126
|
+
|
120
127
|
# Overrides String#replace setting the new_url @uri and String value.
|
121
128
|
#
|
122
129
|
# @param new_url [Wgit::Url, String] The new URL value.
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.10.
|
4
|
+
version: 0.10.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-08-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: addressable
|