wgit 0.9.0 → 0.10.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +11 -2
- data/README.md +36 -43
- data/lib/wgit/database/database.rb +1 -1
- data/lib/wgit/document.rb +7 -0
- data/lib/wgit/url.rb +51 -39
- data/lib/wgit/version.rb +1 -1
- metadata +5 -5
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: b6719bb2015379133ef2c9b417cada1826deab254f6fa1adaa093314f8fece99
|
4
|
+
data.tar.gz: 5ced648c0dff501bf0191aebfc0188d535f4ee657a072e1dbccd68ebbc6ac881
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 4a7782b4ccf6fa69fad9bb63d7d421fa548603ad5a35304db554bdcdf6deafe305395aba1ac9f35bcd095bc6cf4049ce70e56645faf1457e2e1313d48d1eb7f8
|
7
|
+
data.tar.gz: 8b8bb1454a131201e262eda060c6ae8490266a7675910026a0dd6ae0b2b55f2accf140d473edf135078f68cbe1048c4bb86f2dc5a6d4cf08a006f8fc20ac49b5
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,15 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.10.0
|
13
|
+
### Added
|
14
|
+
- `Wgit::Url#scheme_relative?` method.
|
15
|
+
### Changed/Removed
|
16
|
+
- Breaking change: Changed method signature of `Wgit::Url#prefix_scheme` by making the previously named parameter a defaulted positional parameter. Remove the `protocol` named parameter for the old behaviour.
|
17
|
+
### Fixed
|
18
|
+
- [Scheme-relative bug](https://github.com/michaeltelford/wgit/issues/10) by adding support for scheme-relative URL's.
|
19
|
+
---
|
20
|
+
|
12
21
|
## v0.9.0
|
13
22
|
This release is a big one with the introduction of a `Wgit::DSL` and Javascript parse support. The `README` has been revamped as a result with new usage examples. And all of the wiki articles have been updated to reflect the latest code base.
|
14
23
|
### Added
|
@@ -112,7 +121,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
|
|
112
121
|
- `Wgit::Response` class containing adapter agnostic HTTP response logic.
|
113
122
|
### Changed/Removed
|
114
123
|
- Breaking changes: Removed `Wgit::Document#date_crawled` and `#crawl_duration` because both of these methods exist on the `Wgit::Document#url`. Instead, use `doc.url.date_crawled` etc.
|
115
|
-
- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/
|
124
|
+
- Breaking changes: Added to and moved `Document.define_extension` block params, it's now `|value, source, type|`. The `source` is not what it used to be; it's now `type` - of either `:document` or `:object`. Confused? See the [docs](https://www.rubydoc.info/gems/wgit).
|
116
125
|
- Breaking changes: Changed `Wgit::Url#prefix_protocol` so that it no longer modifies the receiver.
|
117
126
|
- Breaking changes: Updated `Wgit::Url#to_anchor` and `#to_query` logic to align with that of `Addressable::URI` e.g. the anchor value no longer contains `#` prefix; and the query value no longer contains `?` prefix.
|
118
127
|
- Breaking changes: Renamed `Wgit::Url` methods containing `anchor` to now be named `fragment` e.g. `to_anchor` is now called `to_fragment` and `without_anchor` is `without_fragment` etc.
|
@@ -160,7 +169,7 @@ This release is a big one with the introduction of a `Wgit::DSL` and Javascript
|
|
160
169
|
---
|
161
170
|
|
162
171
|
## v0.2.0
|
163
|
-
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/
|
172
|
+
This version of Wgit see's a major refactor of the code base involving multiple changes to method names and their signatures (optional parameters turned into named parameters in most cases). A list of the breaking changes are below including how to fix any breakages; but if you're having issues with the upgrade see the documentation at: https://www.rubydoc.info/gems/wgit
|
164
173
|
### Added
|
165
174
|
- `Wgit::Url#absolute?` method.
|
166
175
|
- `Wgit::Url#relative? base: url` support.
|
data/README.md
CHANGED
@@ -10,7 +10,7 @@
|
|
10
10
|
|
11
11
|
Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically extract the data you want from the web.
|
12
12
|
|
13
|
-
Wgit was primarily designed to crawl static HTML websites to index and
|
13
|
+
Wgit was primarily designed to crawl static HTML websites to index and search their content - providing the basis of any search engine; but Wgit is suitable for many application domains including:
|
14
14
|
|
15
15
|
- URL parsing
|
16
16
|
- Document content extraction (data mining)
|
@@ -62,31 +62,6 @@ end
|
|
62
62
|
puts JSON.generate(quotes)
|
63
63
|
```
|
64
64
|
|
65
|
-
The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
|
66
|
-
|
67
|
-
```ruby
|
68
|
-
require 'wgit'
|
69
|
-
require 'json'
|
70
|
-
|
71
|
-
crawler = Wgit::Crawler.new
|
72
|
-
url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
|
73
|
-
quotes = []
|
74
|
-
|
75
|
-
Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
|
76
|
-
Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
|
77
|
-
|
78
|
-
crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
|
79
|
-
doc.quotes.zip(doc.authors).each do |arr|
|
80
|
-
quotes << {
|
81
|
-
quote: arr.first,
|
82
|
-
author: arr.last
|
83
|
-
}
|
84
|
-
end
|
85
|
-
end
|
86
|
-
|
87
|
-
puts JSON.generate(quotes)
|
88
|
-
```
|
89
|
-
|
90
65
|
But what if we want to crawl and store the content in a database, so that it can be searched? Wgit makes it easy to index and search HTML using [MongoDB](https://www.mongodb.com/):
|
91
66
|
|
92
67
|
```ruby
|
@@ -97,14 +72,13 @@ include Wgit::DSL
|
|
97
72
|
Wgit.logger.level = Logger::WARN
|
98
73
|
|
99
74
|
connection_string 'mongodb://user:password@localhost/crawler'
|
100
|
-
clear_db!
|
101
|
-
|
102
|
-
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
103
|
-
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
104
75
|
|
105
76
|
start 'http://quotes.toscrape.com/tag/humor/'
|
106
77
|
follow "//li[@class='next']/a/@href"
|
107
78
|
|
79
|
+
extract :quotes, "//div[@class='quote']/span[@class='text']", singleton: false
|
80
|
+
extract :authors, "//div[@class='quote']/span/small", singleton: false
|
81
|
+
|
108
82
|
index_site
|
109
83
|
search 'prejudice'
|
110
84
|
```
|
@@ -117,10 +91,35 @@ Quotes to Scrape
|
|
117
91
|
http://quotes.toscrape.com/tag/humor/page/2/
|
118
92
|
```
|
119
93
|
|
120
|
-
Using a
|
94
|
+
Using a MongoDB [client](https://robomongo.org/), we can see that the two web pages have been indexed, along with their extracted *quotes* and *authors*:
|
121
95
|
|
122
96
|
![MongoDBClient](https://raw.githubusercontent.com/michaeltelford/wgit/assets/assets/wgit_mongo_index.png)
|
123
97
|
|
98
|
+
The [DSL](https://github.com/michaeltelford/wgit/wiki/How-To-Use-The-DSL) makes it easy to write scripts for experimenting with. Wgit's DSL is simply a wrapper around the underlying classes however. For comparison, here is the above example written using the Wgit API *instead of* the DSL:
|
99
|
+
|
100
|
+
```ruby
|
101
|
+
require 'wgit'
|
102
|
+
require 'json'
|
103
|
+
|
104
|
+
crawler = Wgit::Crawler.new
|
105
|
+
url = Wgit::Url.new('http://quotes.toscrape.com/tag/humor/')
|
106
|
+
quotes = []
|
107
|
+
|
108
|
+
Wgit::Document.define_extractor(:quotes, "//div[@class='quote']/span[@class='text']", singleton: false)
|
109
|
+
Wgit::Document.define_extractor(:authors, "//div[@class='quote']/span/small", singleton: false)
|
110
|
+
|
111
|
+
crawler.crawl_site(url, follow: "//li[@class='next']/a/@href") do |doc|
|
112
|
+
doc.quotes.zip(doc.authors).each do |arr|
|
113
|
+
quotes << {
|
114
|
+
quote: arr.first,
|
115
|
+
author: arr.last
|
116
|
+
}
|
117
|
+
end
|
118
|
+
end
|
119
|
+
|
120
|
+
puts JSON.generate(quotes)
|
121
|
+
```
|
122
|
+
|
124
123
|
## Why Wgit?
|
125
124
|
|
126
125
|
There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) out there so why use Wgit?
|
@@ -161,33 +160,27 @@ Only MRI Ruby is tested and supported, but Wgit may work with other Ruby impleme
|
|
161
160
|
|
162
161
|
Currently, the required MRI Ruby version is:
|
163
162
|
|
164
|
-
`~> 2.5` a.k.a. `>= 2.5 && < 3`
|
163
|
+
`~> 2.5` (a.k.a.) `>= 2.5 && < 3`
|
165
164
|
|
166
165
|
### Using Bundler
|
167
166
|
|
168
|
-
|
169
|
-
|
170
|
-
```ruby
|
171
|
-
gem 'wgit'
|
172
|
-
```
|
173
|
-
|
174
|
-
And then execute:
|
175
|
-
|
176
|
-
$ bundle
|
167
|
+
$ bundle add wgit
|
177
168
|
|
178
169
|
### Using RubyGems
|
179
170
|
|
180
171
|
$ gem install wgit
|
181
172
|
|
182
|
-
Verify
|
173
|
+
### Verify
|
183
174
|
|
184
175
|
$ wgit
|
185
176
|
|
177
|
+
Calling the installed executable will start an REPL session.
|
178
|
+
|
186
179
|
## Documentation
|
187
180
|
|
188
181
|
- [Getting Started](https://github.com/michaeltelford/wgit/wiki/Getting-Started)
|
189
182
|
- [Wiki](https://github.com/michaeltelford/wgit/wiki)
|
190
|
-
- [Yardocs](https://www.rubydoc.info/
|
183
|
+
- [API Yardocs](https://www.rubydoc.info/gems/wgit)
|
191
184
|
- [CHANGELOG](https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md)
|
192
185
|
|
193
186
|
## Executable
|
data/lib/wgit/document.rb
CHANGED
@@ -413,6 +413,13 @@ be relative"
|
|
413
413
|
return [] if @links.empty?
|
414
414
|
|
415
415
|
links = @links
|
416
|
+
.map do |link|
|
417
|
+
if link.scheme_relative?
|
418
|
+
link.prefix_scheme(@url.to_scheme.to_sym)
|
419
|
+
else
|
420
|
+
link
|
421
|
+
end
|
422
|
+
end
|
416
423
|
.reject { |link| link.relative?(host: @url.to_origin) }
|
417
424
|
.map(&:omit_trailing_slash)
|
418
425
|
|
data/lib/wgit/url.rb
CHANGED
@@ -162,6 +162,7 @@ Addressable::URI::InvalidURIError")
|
|
162
162
|
opts = defaults.merge(opts)
|
163
163
|
raise 'Url (self) cannot be empty' if empty?
|
164
164
|
|
165
|
+
return false if scheme_relative?
|
165
166
|
return true if @uri.relative?
|
166
167
|
|
167
168
|
# Self is absolute but may be relative to the opts param e.g. host.
|
@@ -266,26 +267,28 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
|
|
266
267
|
# @return [Wgit::Url] Self in absolute form.
|
267
268
|
def make_absolute(doc)
|
268
269
|
assert_type(doc, Wgit::Document)
|
270
|
+
raise 'Cannot make absolute when Document @url is not valid' \
|
271
|
+
unless doc.url.valid?
|
272
|
+
|
273
|
+
return prefix_scheme(doc.url.to_scheme&.to_sym) if scheme_relative?
|
269
274
|
|
270
275
|
absolute? ? self : doc.base_url(link: self).concat(self)
|
271
276
|
end
|
272
277
|
|
273
|
-
# Returns self having prefixed a protocol
|
278
|
+
# Returns self having prefixed a scheme/protocol. Doesn't modify receiver.
|
274
279
|
# Returns self even if absolute (with scheme); therefore is idempotent.
|
275
280
|
#
|
276
|
-
# @param
|
277
|
-
# @return [Wgit::Url] Self with a
|
278
|
-
def prefix_scheme(
|
279
|
-
|
280
|
-
|
281
|
-
case protocol
|
282
|
-
when :http
|
283
|
-
Wgit::Url.new("http://#{url}")
|
284
|
-
when :https
|
285
|
-
Wgit::Url.new("https://#{url}")
|
286
|
-
else
|
287
|
-
raise "protocol must be :http or :https, not :#{protocol}"
|
281
|
+
# @param scheme [Symbol] Either :http or :https.
|
282
|
+
# @return [Wgit::Url] Self with a scheme prefix.
|
283
|
+
def prefix_scheme(scheme = :http)
|
284
|
+
unless %i[http https].include?(scheme)
|
285
|
+
raise "scheme must be :http or :https, not :#{scheme}"
|
288
286
|
end
|
287
|
+
|
288
|
+
return self if absolute? && !scheme_relative?
|
289
|
+
|
290
|
+
separator = scheme_relative? ? '' : '//'
|
291
|
+
Wgit::Url.new("#{scheme}:#{separator}#{self}")
|
289
292
|
end
|
290
293
|
|
291
294
|
# Returns a Hash containing this Url's instance vars excluding @uri.
|
@@ -624,31 +627,40 @@ protocol scheme and domain (e.g. http://example.com): #{url}"
|
|
624
627
|
self == '/'
|
625
628
|
end
|
626
629
|
|
627
|
-
|
628
|
-
|
629
|
-
|
630
|
-
|
631
|
-
|
632
|
-
|
633
|
-
|
634
|
-
|
635
|
-
alias
|
636
|
-
alias
|
637
|
-
alias
|
638
|
-
alias
|
639
|
-
alias
|
640
|
-
alias
|
641
|
-
alias
|
642
|
-
alias
|
643
|
-
alias
|
644
|
-
alias
|
645
|
-
alias
|
646
|
-
alias
|
647
|
-
alias
|
648
|
-
alias
|
649
|
-
alias
|
650
|
-
alias
|
651
|
-
alias
|
652
|
-
alias
|
630
|
+
# Returns true if self starts with '//' a.k.a a scheme/protocol relative
|
631
|
+
# path.
|
632
|
+
#
|
633
|
+
# @return [Boolean] True if self starts with '//', false otherwise.
|
634
|
+
def scheme_relative?
|
635
|
+
start_with?('//')
|
636
|
+
end
|
637
|
+
|
638
|
+
alias + concat
|
639
|
+
alias crawled? crawled
|
640
|
+
alias is_relative? relative?
|
641
|
+
alias is_absolute? absolute?
|
642
|
+
alias is_valid? valid?
|
643
|
+
alias is_query? query?
|
644
|
+
alias is_fragment? fragment?
|
645
|
+
alias is_index? index?
|
646
|
+
alias is_scheme_relative? scheme_relative?
|
647
|
+
alias uri to_uri
|
648
|
+
alias url to_url
|
649
|
+
alias scheme to_scheme
|
650
|
+
alias host to_host
|
651
|
+
alias port to_port
|
652
|
+
alias domain to_domain
|
653
|
+
alias brand to_brand
|
654
|
+
alias base to_base
|
655
|
+
alias origin to_origin
|
656
|
+
alias path to_path
|
657
|
+
alias endpoint to_endpoint
|
658
|
+
alias query to_query
|
659
|
+
alias query_hash to_query_hash
|
660
|
+
alias fragment to_fragment
|
661
|
+
alias extension to_extension
|
662
|
+
alias user to_user
|
663
|
+
alias password to_password
|
664
|
+
alias sub_domain to_sub_domain
|
653
665
|
end
|
654
666
|
end
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.10.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
|
-
autorequire:
|
8
|
+
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date:
|
11
|
+
date: 2021-04-20 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: addressable
|
@@ -241,7 +241,7 @@ metadata:
|
|
241
241
|
source_code_uri: https://github.com/michaeltelford/wgit
|
242
242
|
changelog_uri: https://github.com/michaeltelford/wgit/blob/master/CHANGELOG.md
|
243
243
|
bug_tracker_uri: https://github.com/michaeltelford/wgit/issues
|
244
|
-
documentation_uri: https://www.rubydoc.info/
|
244
|
+
documentation_uri: https://www.rubydoc.info/gems/wgit
|
245
245
|
allowed_push_host: https://rubygems.org
|
246
246
|
post_install_message: Added the 'wgit' executable to $PATH
|
247
247
|
rdoc_options: []
|
@@ -259,7 +259,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
259
259
|
version: '0'
|
260
260
|
requirements: []
|
261
261
|
rubygems_version: 3.1.2
|
262
|
-
signing_key:
|
262
|
+
signing_key:
|
263
263
|
specification_version: 4
|
264
264
|
summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
|
265
265
|
extract the data you want from the web.
|