wgit 0.10.7 → 0.10.8
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +11 -0
- data/README.md +19 -1
- data/lib/wgit/document.rb +14 -0
- data/lib/wgit/url.rb +7 -0
- data/lib/wgit/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 66e8b435303d07b2f81d260badc96662936599c9782916f7f014b74a7c617499
|
4
|
+
data.tar.gz: 7b55890c66ec09efd8d5749bd66605a4cb43d5091416f072f8fcc5aaaa85fbe7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fe1b605224f6682ac504f17b55ab83518556f1320f0410741af8f95bf3a669918c69b48832fb413ca1f78482fdbb7e0d2e7d6f57841c6a562b7f926f7511cdd7
|
7
|
+
data.tar.gz: 856be2111709bc96488b7d43abbc49c563a9a56330344adb4b9ec40fc263cb91e63465c3c3dab317c0d8930965a609a43102d53d80bbc2001e6165a15cb905fa
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,17 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.8
|
13
|
+
### Added
|
14
|
+
- Custom `#inspect` methods to `Wgit::Url` and `Wgit::Document` classes.
|
15
|
+
- `Document.remove_extractors` method, which removes all default and defined extractors.
|
16
|
+
|
17
|
+
### Changed/Removed
|
18
|
+
- ...
|
19
|
+
### Fixed
|
20
|
+
- ...
|
21
|
+
---
|
22
|
+
|
12
23
|
## v0.10.7
|
13
24
|
### Added
|
14
25
|
- ...
|
data/README.md
CHANGED
@@ -62,7 +62,23 @@ end
|
|
62
62
|
puts JSON.generate(quotes)
|
63
63
|
```
|
64
64
|
|
65
|
-
|
65
|
+
Which outputs:
|
66
|
+
|
67
|
+
```text
|
68
|
+
[
|
69
|
+
{
|
70
|
+
"quote": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
|
71
|
+
"author": "Jane Austen"
|
72
|
+
},
|
73
|
+
{
|
74
|
+
"quote": "“A day without sunshine is like, you know, night.”",
|
75
|
+
"author": "Steve Martin"
|
76
|
+
},
|
77
|
+
...
|
78
|
+
]
|
79
|
+
```
|
80
|
+
|
81
|
+
Great! But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
66
82
|
|
67
83
|
```ruby
|
68
84
|
require 'wgit'
|
@@ -89,6 +105,8 @@ The `search` call (on the last line) will return and output the results:
|
|
89
105
|
Quotes to Scrape
|
90
106
|
“I am free of all prejudice. I hate everyone equally. ”
|
91
107
|
http://quotes.toscrape.com/tag/humor/page/2/
|
108
|
+
|
109
|
+
...
|
92
110
|
```
|
93
111
|
|
94
112
|
Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
|
data/lib/wgit/document.rb
CHANGED
@@ -192,13 +192,27 @@ module Wgit
|
|
192
192
|
Document.send(:remove_method, "init_#{var}_from_object")
|
193
193
|
|
194
194
|
@extractors.delete(var.to_sym)
|
195
|
+
|
195
196
|
true
|
196
197
|
rescue NameError
|
197
198
|
false
|
198
199
|
end
|
199
200
|
|
201
|
+
# Removes all default and defined extractors by calling
|
202
|
+
# `Document.remove_extractor` underneath. See its documentation.
|
203
|
+
def self.remove_extractors
|
204
|
+
@extractors.each { |var| remove_extractor(var) }
|
205
|
+
end
|
206
|
+
|
200
207
|
### Document Instance Methods ###
|
201
208
|
|
209
|
+
# Overrides String#inspect to shorten the printed output of a Document.
|
210
|
+
#
|
211
|
+
# @return [String] A short textual representation of this Document.
|
212
|
+
def inspect
|
213
|
+
"#<Wgit::Document url=\"#{@url}\" html=#{size} bytes>"
|
214
|
+
end
|
215
|
+
|
202
216
|
# Determines if both the url and html match. Use
|
203
217
|
# doc.object_id == other.object_id for exact object comparison.
|
204
218
|
#
|
data/lib/wgit/url.rb
CHANGED
@@ -117,6 +117,13 @@ Addressable::URI::InvalidURIError")
|
|
117
117
|
@date_crawled = bool ? Wgit::Utils.time_stamp : nil
|
118
118
|
end
|
119
119
|
|
120
|
+
# Overrides String#inspect to distingiush this Url from a String.
|
121
|
+
#
|
122
|
+
# @return [String] A short textual representation of this Url.
|
123
|
+
def inspect
|
124
|
+
"#<Wgit::Url url=\"#{self}\" crawled=#{@crawled}>"
|
125
|
+
end
|
126
|
+
|
120
127
|
# Overrides String#replace setting the new_url @uri and String value.
|
121
128
|
#
|
122
129
|
# @param new_url [Wgit::Url, String] The new URL value.
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.10.
|
4
|
+
version: 0.10.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-08-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: addressable
|