wgit 0.10.7 → 0.10.8
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +11 -0
- data/README.md +19 -1
- data/lib/wgit/document.rb +14 -0
- data/lib/wgit/url.rb +7 -0
- data/lib/wgit/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 66e8b435303d07b2f81d260badc96662936599c9782916f7f014b74a7c617499
|
4
|
+
data.tar.gz: 7b55890c66ec09efd8d5749bd66605a4cb43d5091416f072f8fcc5aaaa85fbe7
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: fe1b605224f6682ac504f17b55ab83518556f1320f0410741af8f95bf3a669918c69b48832fb413ca1f78482fdbb7e0d2e7d6f57841c6a562b7f926f7511cdd7
|
7
|
+
data.tar.gz: 856be2111709bc96488b7d43abbc49c563a9a56330344adb4b9ec40fc263cb91e63465c3c3dab317c0d8930965a609a43102d53d80bbc2001e6165a15cb905fa
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,17 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.8
|
13
|
+
### Added
|
14
|
+
- Custom `#inspect` methods to `Wgit::Url` and `Wgit::Document` classes.
|
15
|
+
- `Document.remove_extractors` method, which removes all default and defined extractors.
|
16
|
+
|
17
|
+
### Changed/Removed
|
18
|
+
- ...
|
19
|
+
### Fixed
|
20
|
+
- ...
|
21
|
+
---
|
22
|
+
|
12
23
|
## v0.10.7
|
13
24
|
### Added
|
14
25
|
- ...
|
data/README.md
CHANGED
@@ -62,7 +62,23 @@ end
|
|
62
62
|
puts JSON.generate(quotes)
|
63
63
|
```
|
64
64
|
|
65
|
-
|
65
|
+
Which outputs:
|
66
|
+
|
67
|
+
```text
|
68
|
+
[
|
69
|
+
{
|
70
|
+
"quote": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
|
71
|
+
"author": "Jane Austen"
|
72
|
+
},
|
73
|
+
{
|
74
|
+
"quote": "“A day without sunshine is like, you know, night.”",
|
75
|
+
"author": "Steve Martin"
|
76
|
+
},
|
77
|
+
...
|
78
|
+
]
|
79
|
+
```
|
80
|
+
|
81
|
+
Great! But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
66
82
|
|
67
83
|
```ruby
|
68
84
|
require 'wgit'
|
@@ -89,6 +105,8 @@ The `search` call (on the last line) will return and output the results:
|
|
89
105
|
Quotes to Scrape
|
90
106
|
“I am free of all prejudice. I hate everyone equally. ”
|
91
107
|
http://quotes.toscrape.com/tag/humor/page/2/
|
108
|
+
|
109
|
+
...
|
92
110
|
```
|
93
111
|
|
94
112
|
Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
|
data/lib/wgit/document.rb
CHANGED
@@ -192,13 +192,27 @@ module Wgit
|
|
192
192
|
Document.send(:remove_method, "init_#{var}_from_object")
|
193
193
|
|
194
194
|
@extractors.delete(var.to_sym)
|
195
|
+
|
195
196
|
true
|
196
197
|
rescue NameError
|
197
198
|
false
|
198
199
|
end
|
199
200
|
|
201
|
+
# Removes all default and defined extractors by calling
|
202
|
+
# `Document.remove_extractor` underneath. See its documentation.
|
203
|
+
def self.remove_extractors
|
204
|
+
@extractors.each { |var| remove_extractor(var) }
|
205
|
+
end
|
206
|
+
|
200
207
|
### Document Instance Methods ###
|
201
208
|
|
209
|
+
# Overrides String#inspect to shorten the printed output of a Document.
|
210
|
+
#
|
211
|
+
# @return [String] A short textual representation of this Document.
|
212
|
+
def inspect
|
213
|
+
"#<Wgit::Document url=\"#{@url}\" html=#{size} bytes>"
|
214
|
+
end
|
215
|
+
|
202
216
|
# Determines if both the url and html match. Use
|
203
217
|
# doc.object_id == other.object_id for exact object comparison.
|
204
218
|
#
|
data/lib/wgit/url.rb
CHANGED
@@ -117,6 +117,13 @@ Addressable::URI::InvalidURIError")
|
|
117
117
|
@date_crawled = bool ? Wgit::Utils.time_stamp : nil
|
118
118
|
end
|
119
119
|
|
120
|
+
# Overrides String#inspect to distingiush this Url from a String.
|
121
|
+
#
|
122
|
+
# @return [String] A short textual representation of this Url.
|
123
|
+
def inspect
|
124
|
+
"#<Wgit::Url url=\"#{self}\" crawled=#{@crawled}>"
|
125
|
+
end
|
126
|
+
|
120
127
|
# Overrides String#replace setting the new_url @uri and String value.
|
121
128
|
#
|
122
129
|
# @param new_url [Wgit::Url, String] The new URL value.
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.10.
|
4
|
+
version: 0.10.8
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2023-08-18 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: addressable
|