wgit 0.12.0 → 0.12.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +9 -0
- data/README.md +4 -3
- data/lib/wgit/crawler.rb +49 -21
- data/lib/wgit/url.rb +1 -1
- data/lib/wgit/version.rb +1 -1
- metadata +51 -25
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 4210033f192994609b9a21f1b3e61292247cf94aa415441b22823c32dcc6a214
|
4
|
+
data.tar.gz: 26ca37a3a20998b7ce313c7bd499b89653ba86fac8f09a4b5e699cf6d57bd57c
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 3f824772e8f0633d540237fb7f99064327eb54a3efbacad58317e3b47d3e45c08f9b9bee542656e4fc5315ded9539ced1283e83207127b948107900d7acc9882
|
7
|
+
data.tar.gz: 0f4b7787ed881d7693226466785e70efd0bff7f0971ba65cebb7553886ead143d40cbeb8a4e72e65ef0c249e689024254668da817dfe347c2076153878e605b0
|
data/CHANGELOG.md
CHANGED
@@ -9,6 +9,15 @@
|
|
9
9
|
- ...
|
10
10
|
---
|
11
11
|
|
12
|
+
## v0.12.1
|
13
|
+
### Added
|
14
|
+
- `Wgit::Crawler.new typhoeus_opts:` param which passes the `Hash` directly to `Typhoeus.get`. See the Typhoeus documentation for more info on what can be passed.
|
15
|
+
### Changed/Removed
|
16
|
+
- ...
|
17
|
+
### Fixed
|
18
|
+
- ...
|
19
|
+
---
|
20
|
+
|
12
21
|
## v0.12.0 - BREAKING CHANGES
|
13
22
|
A big release with several breaking changes, not all of which can be listed below. The headline features for this release are the introduction of a database adapter, allowing Wgit to work with practically any underlying database system; and a custom in-house text extractor.
|
14
23
|
### Added
|
data/README.md
CHANGED
@@ -144,6 +144,7 @@ There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) ou
|
|
144
144
|
|
145
145
|
- Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
|
146
146
|
- Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host - no `xpath` needed.
|
147
|
+
- Wgit can crawl authenticated content, providing you can login on a web browser and export your session cookies.
|
147
148
|
- Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
|
148
149
|
- Wgit can index (crawl and save) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
|
149
150
|
|
@@ -171,9 +172,9 @@ indexer.index_site(wiki, **opts)
|
|
171
172
|
|
172
173
|
So why might you not use Wgit, I hear you ask?
|
173
174
|
|
174
|
-
- Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
|
175
|
-
- Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a
|
176
|
-
- Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent
|
175
|
+
- Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that. If however, you simply want to crawl webpages requiring authentication, then it's entirely achievable using Wgit.
|
176
|
+
- Wgit *can* parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider using a purely browser-based crawler instead.
|
177
|
+
- Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent requests then you should consider something else.
|
177
178
|
|
178
179
|
## Installation
|
179
180
|
|
data/lib/wgit/crawler.rb
CHANGED
@@ -54,12 +54,16 @@ module Wgit
|
|
54
54
|
# The value should balance between a good UX and enough JS parse time.
|
55
55
|
attr_accessor :parse_javascript_delay
|
56
56
|
|
57
|
+
# The opts Hash passed directly to the Typhoeus#get request.
|
58
|
+
attr_accessor :typhoeus_opts
|
59
|
+
|
57
60
|
# The opts Hash passed directly to the ferrum Chrome browser when
|
58
61
|
# `parse_javascript: true`.
|
59
|
-
# See https://github.com/rubycdp/ferrum for
|
62
|
+
# See https://github.com/rubycdp/ferrum for more info.
|
60
63
|
attr_accessor :ferrum_opts
|
61
64
|
|
62
65
|
# The Wgit::Response of the most recently crawled URL.
|
66
|
+
# See https://rubydoc.info/gems/typhoeus for more info.
|
63
67
|
attr_reader :last_response
|
64
68
|
|
65
69
|
# Initializes and returns a Wgit::Crawler instance.
|
@@ -76,14 +80,17 @@ module Wgit
|
|
76
80
|
# installed and in $PATH.
|
77
81
|
# @param parse_javascript_delay [Integer] The delay time given to a page's
|
78
82
|
# JS to update the DOM. After the delay, the HTML is crawled.
|
83
|
+
# @param typhoeus_opts [Hash] The options to pass to Typhoeus.
|
84
|
+
# @param ferrum_opts [Hash] The options to pass to Ferrum.
|
79
85
|
def initialize(redirect_limit: 5, timeout: 5, encode: true,
|
80
86
|
parse_javascript: false, parse_javascript_delay: 1,
|
81
|
-
ferrum_opts: {})
|
87
|
+
typhoeus_opts: {}, ferrum_opts: {})
|
82
88
|
assert_type(redirect_limit, Integer)
|
83
89
|
assert_type(timeout, [Integer, Float])
|
84
90
|
assert_type(encode, [TrueClass, FalseClass])
|
85
91
|
assert_type(parse_javascript, [TrueClass, FalseClass])
|
86
92
|
assert_type(parse_javascript_delay, Integer)
|
93
|
+
assert_type(typhoeus_opts, Hash)
|
87
94
|
assert_type(ferrum_opts, Hash)
|
88
95
|
|
89
96
|
@redirect_limit = redirect_limit
|
@@ -91,14 +98,15 @@ module Wgit
|
|
91
98
|
@encode = encode
|
92
99
|
@parse_javascript = parse_javascript
|
93
100
|
@parse_javascript_delay = parse_javascript_delay
|
94
|
-
@
|
101
|
+
@typhoeus_opts = merge_typhoeus_opts(typhoeus_opts)
|
102
|
+
@ferrum_opts = merge_ferrum_opts(ferrum_opts)
|
95
103
|
end
|
96
104
|
|
97
105
|
# Overrides String#inspect to shorten the printed output of a Crawler.
|
98
106
|
#
|
99
107
|
# @return [String] A short textual representation of this Crawler.
|
100
108
|
def inspect
|
101
|
-
"#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} ferrum_opts=#{@ferrum_opts}>"
|
109
|
+
"#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} typhoeus_opts=#{@typhoeus_opts} ferrum_opts=#{@ferrum_opts}>"
|
102
110
|
end
|
103
111
|
|
104
112
|
# Crawls an entire website's HTML pages by recursively going through
|
@@ -268,7 +276,7 @@ module Wgit
|
|
268
276
|
url.crawl_duration = response.total_time
|
269
277
|
|
270
278
|
# Don't override previous url.redirects if response is fully resolved.
|
271
|
-
url.redirects
|
279
|
+
url.redirects = response.redirects unless response.redirects.empty?
|
272
280
|
|
273
281
|
@last_response = response
|
274
282
|
end
|
@@ -377,32 +385,25 @@ module Wgit
|
|
377
385
|
end
|
378
386
|
|
379
387
|
# Performs a HTTP GET request and returns the response.
|
388
|
+
# See https://rubydoc.info/gems/typhoeus for more info.
|
380
389
|
#
|
381
390
|
# @param url [String] The url to GET.
|
382
|
-
# @return [Typhoeus::Response] The HTTP response object.
|
391
|
+
# @return [Typhoeus::Response] The Typhoeus HTTP response object.
|
383
392
|
def http_get(url)
|
384
|
-
|
385
|
-
followlocation: false,
|
386
|
-
timeout: @timeout,
|
387
|
-
accept_encoding: 'gzip',
|
388
|
-
headers: {
|
389
|
-
'User-Agent' => "wgit/#{Wgit::VERSION}",
|
390
|
-
'Accept' => 'text/html'
|
391
|
-
}
|
392
|
-
}
|
393
|
-
|
394
|
-
# See https://rubydoc.info/gems/typhoeus for more info.
|
395
|
-
Typhoeus.get(url, **opts)
|
393
|
+
Typhoeus.get(url, **@typhoeus_opts)
|
396
394
|
end
|
397
395
|
|
398
|
-
# Performs a HTTP GET request in a web browser
|
399
|
-
# before returning the HTML body of the fully rendered webpage.
|
400
|
-
# Javascript (SPA apps etc.) to generate HTML dynamically.
|
396
|
+
# Performs a HTTP GET request in a web browser allowing the response JS to
|
397
|
+
# execute before returning the HTML body of the fully rendered webpage.
|
398
|
+
# This allows Javascript (SPA apps etc.) to generate HTML dynamically.
|
399
|
+
# See https://github.com/rubycdp/ferrum for more info.
|
401
400
|
#
|
402
401
|
# @param url [String] The url to browse to.
|
403
402
|
# @return [Ferrum::Browser] The browser response object.
|
404
403
|
def browser_get(url)
|
405
404
|
@browser ||= Ferrum::Browser.new(**@ferrum_opts)
|
405
|
+
|
406
|
+
# Navigate to the url and start parsing the JS on the page.
|
406
407
|
@browser.goto(url)
|
407
408
|
|
408
409
|
# Wait for the page's JS to finish dynamically manipulating the DOM.
|
@@ -452,6 +453,20 @@ module Wgit
|
|
452
453
|
|
453
454
|
private
|
454
455
|
|
456
|
+
# The default opts which are merged with the user's typhoeus_opts: and then
|
457
|
+
# passed directly to the Typhoeus#get request.
|
458
|
+
def default_typhoeus_opts
|
459
|
+
{
|
460
|
+
followlocation: false,
|
461
|
+
timeout: @timeout,
|
462
|
+
accept_encoding: 'gzip',
|
463
|
+
headers: {
|
464
|
+
'User-Agent' => "wgit/#{Wgit::VERSION}",
|
465
|
+
'Accept' => 'text/html'
|
466
|
+
}
|
467
|
+
}
|
468
|
+
end
|
469
|
+
|
455
470
|
# The default opts which are merged with the user's ferrum_opts: and then
|
456
471
|
# passed directly to the ferrum Chrome browser.
|
457
472
|
def default_ferrum_opts
|
@@ -462,6 +477,19 @@ module Wgit
|
|
462
477
|
}
|
463
478
|
end
|
464
479
|
|
480
|
+
# Merges the default Typhoeus options with user-provided options.
|
481
|
+
# Performs a separate merge of headers to allow user customization.
|
482
|
+
def merge_typhoeus_opts(typhoeus_opts)
|
483
|
+
default_typhoeus_opts.merge(typhoeus_opts) do |key, oldval, newval|
|
484
|
+
key == :headers ? oldval.merge(newval) : newval
|
485
|
+
end
|
486
|
+
end
|
487
|
+
|
488
|
+
# Merges the default Ferrum options with user-provided options.
|
489
|
+
def merge_ferrum_opts(ferrum_opts)
|
490
|
+
default_ferrum_opts.merge(ferrum_opts)
|
491
|
+
end
|
492
|
+
|
465
493
|
# Manually does the following: `links = internals - crawled`.
|
466
494
|
# This is needed due to an apparent bug in Set<Url> (when upgrading from
|
467
495
|
# Ruby v3.0.2 to v3.3.0) causing an infinite crawl loop in #crawl_site.
|
data/lib/wgit/url.rb
CHANGED
@@ -98,7 +98,7 @@ module Wgit
|
|
98
98
|
# Returns a Wgit::Url instance from Wgit::Url.parse, or nil if obj cannot
|
99
99
|
# be parsed successfully e.g. the String is invalid.
|
100
100
|
#
|
101
|
-
# Use this method when you can't
|
101
|
+
# Use this method when you can't guarantee that obj is parsable as a URL.
|
102
102
|
# See Wgit::Url.parse for more information.
|
103
103
|
#
|
104
104
|
# @param obj [Object] The object to parse, which #is_a?(String).
|
data/lib/wgit/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,13 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: wgit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.12.
|
4
|
+
version: 0.12.1
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Michael Telford
|
8
|
-
autorequire:
|
9
8
|
bindir: bin
|
10
9
|
cert_chain: []
|
11
|
-
date:
|
10
|
+
date: 2025-08-18 00:00:00.000000000 Z
|
12
11
|
dependencies:
|
13
12
|
- !ruby/object:Gem::Dependency
|
14
13
|
name: addressable
|
@@ -30,56 +29,84 @@ dependencies:
|
|
30
29
|
requirements:
|
31
30
|
- - "~>"
|
32
31
|
- !ruby/object:Gem::Version
|
33
|
-
version: '0.
|
32
|
+
version: '0.3'
|
34
33
|
type: :runtime
|
35
34
|
prerelease: false
|
36
35
|
version_requirements: !ruby/object:Gem::Requirement
|
37
36
|
requirements:
|
38
37
|
- - "~>"
|
39
38
|
- !ruby/object:Gem::Version
|
40
|
-
version: '0.
|
39
|
+
version: '0.3'
|
40
|
+
- !ruby/object:Gem::Dependency
|
41
|
+
name: benchmark
|
42
|
+
requirement: !ruby/object:Gem::Requirement
|
43
|
+
requirements:
|
44
|
+
- - "~>"
|
45
|
+
- !ruby/object:Gem::Version
|
46
|
+
version: '0.4'
|
47
|
+
type: :runtime
|
48
|
+
prerelease: false
|
49
|
+
version_requirements: !ruby/object:Gem::Requirement
|
50
|
+
requirements:
|
51
|
+
- - "~>"
|
52
|
+
- !ruby/object:Gem::Version
|
53
|
+
version: '0.4'
|
41
54
|
- !ruby/object:Gem::Dependency
|
42
55
|
name: ferrum
|
43
56
|
requirement: !ruby/object:Gem::Requirement
|
44
57
|
requirements:
|
45
58
|
- - "~>"
|
46
59
|
- !ruby/object:Gem::Version
|
47
|
-
version: '0.
|
60
|
+
version: '0.17'
|
61
|
+
type: :runtime
|
62
|
+
prerelease: false
|
63
|
+
version_requirements: !ruby/object:Gem::Requirement
|
64
|
+
requirements:
|
65
|
+
- - "~>"
|
66
|
+
- !ruby/object:Gem::Version
|
67
|
+
version: '0.17'
|
68
|
+
- !ruby/object:Gem::Dependency
|
69
|
+
name: logger
|
70
|
+
requirement: !ruby/object:Gem::Requirement
|
71
|
+
requirements:
|
72
|
+
- - "~>"
|
73
|
+
- !ruby/object:Gem::Version
|
74
|
+
version: '1.7'
|
48
75
|
type: :runtime
|
49
76
|
prerelease: false
|
50
77
|
version_requirements: !ruby/object:Gem::Requirement
|
51
78
|
requirements:
|
52
79
|
- - "~>"
|
53
80
|
- !ruby/object:Gem::Version
|
54
|
-
version: '
|
81
|
+
version: '1.7'
|
55
82
|
- !ruby/object:Gem::Dependency
|
56
83
|
name: mongo
|
57
84
|
requirement: !ruby/object:Gem::Requirement
|
58
85
|
requirements:
|
59
86
|
- - "~>"
|
60
87
|
- !ruby/object:Gem::Version
|
61
|
-
version: '2.
|
88
|
+
version: '2.21'
|
62
89
|
type: :runtime
|
63
90
|
prerelease: false
|
64
91
|
version_requirements: !ruby/object:Gem::Requirement
|
65
92
|
requirements:
|
66
93
|
- - "~>"
|
67
94
|
- !ruby/object:Gem::Version
|
68
|
-
version: '2.
|
95
|
+
version: '2.21'
|
69
96
|
- !ruby/object:Gem::Dependency
|
70
97
|
name: nokogiri
|
71
98
|
requirement: !ruby/object:Gem::Requirement
|
72
99
|
requirements:
|
73
100
|
- - "~>"
|
74
101
|
- !ruby/object:Gem::Version
|
75
|
-
version: '1.
|
102
|
+
version: '1.18'
|
76
103
|
type: :runtime
|
77
104
|
prerelease: false
|
78
105
|
version_requirements: !ruby/object:Gem::Requirement
|
79
106
|
requirements:
|
80
107
|
- - "~>"
|
81
108
|
- !ruby/object:Gem::Version
|
82
|
-
version: '1.
|
109
|
+
version: '1.18'
|
83
110
|
- !ruby/object:Gem::Dependency
|
84
111
|
name: typhoeus
|
85
112
|
requirement: !ruby/object:Gem::Requirement
|
@@ -100,70 +127,70 @@ dependencies:
|
|
100
127
|
requirements:
|
101
128
|
- - "~>"
|
102
129
|
- !ruby/object:Gem::Version
|
103
|
-
version: '
|
130
|
+
version: '12.0'
|
104
131
|
type: :development
|
105
132
|
prerelease: false
|
106
133
|
version_requirements: !ruby/object:Gem::Requirement
|
107
134
|
requirements:
|
108
135
|
- - "~>"
|
109
136
|
- !ruby/object:Gem::Version
|
110
|
-
version: '
|
137
|
+
version: '12.0'
|
111
138
|
- !ruby/object:Gem::Dependency
|
112
139
|
name: dotenv
|
113
140
|
requirement: !ruby/object:Gem::Requirement
|
114
141
|
requirements:
|
115
142
|
- - "~>"
|
116
143
|
- !ruby/object:Gem::Version
|
117
|
-
version: '
|
144
|
+
version: '3.1'
|
118
145
|
type: :development
|
119
146
|
prerelease: false
|
120
147
|
version_requirements: !ruby/object:Gem::Requirement
|
121
148
|
requirements:
|
122
149
|
- - "~>"
|
123
150
|
- !ruby/object:Gem::Version
|
124
|
-
version: '
|
151
|
+
version: '3.1'
|
125
152
|
- !ruby/object:Gem::Dependency
|
126
153
|
name: maxitest
|
127
154
|
requirement: !ruby/object:Gem::Requirement
|
128
155
|
requirements:
|
129
156
|
- - "~>"
|
130
157
|
- !ruby/object:Gem::Version
|
131
|
-
version: '
|
158
|
+
version: '6.0'
|
132
159
|
type: :development
|
133
160
|
prerelease: false
|
134
161
|
version_requirements: !ruby/object:Gem::Requirement
|
135
162
|
requirements:
|
136
163
|
- - "~>"
|
137
164
|
- !ruby/object:Gem::Version
|
138
|
-
version: '
|
165
|
+
version: '6.0'
|
139
166
|
- !ruby/object:Gem::Dependency
|
140
167
|
name: pry
|
141
168
|
requirement: !ruby/object:Gem::Requirement
|
142
169
|
requirements:
|
143
170
|
- - "~>"
|
144
171
|
- !ruby/object:Gem::Version
|
145
|
-
version: '0.
|
172
|
+
version: '0.15'
|
146
173
|
type: :development
|
147
174
|
prerelease: false
|
148
175
|
version_requirements: !ruby/object:Gem::Requirement
|
149
176
|
requirements:
|
150
177
|
- - "~>"
|
151
178
|
- !ruby/object:Gem::Version
|
152
|
-
version: '0.
|
179
|
+
version: '0.15'
|
153
180
|
- !ruby/object:Gem::Dependency
|
154
181
|
name: rubocop
|
155
182
|
requirement: !ruby/object:Gem::Requirement
|
156
183
|
requirements:
|
157
184
|
- - "~>"
|
158
185
|
- !ruby/object:Gem::Version
|
159
|
-
version: '1.
|
186
|
+
version: '1.79'
|
160
187
|
type: :development
|
161
188
|
prerelease: false
|
162
189
|
version_requirements: !ruby/object:Gem::Requirement
|
163
190
|
requirements:
|
164
191
|
- - "~>"
|
165
192
|
- !ruby/object:Gem::Version
|
166
|
-
version: '1.
|
193
|
+
version: '1.79'
|
167
194
|
- !ruby/object:Gem::Dependency
|
168
195
|
name: toys
|
169
196
|
requirement: !ruby/object:Gem::Requirement
|
@@ -184,14 +211,14 @@ dependencies:
|
|
184
211
|
requirements:
|
185
212
|
- - "~>"
|
186
213
|
- !ruby/object:Gem::Version
|
187
|
-
version: '3.
|
214
|
+
version: '3.25'
|
188
215
|
type: :development
|
189
216
|
prerelease: false
|
190
217
|
version_requirements: !ruby/object:Gem::Requirement
|
191
218
|
requirements:
|
192
219
|
- - "~>"
|
193
220
|
- !ruby/object:Gem::Version
|
194
|
-
version: '3.
|
221
|
+
version: '3.25'
|
195
222
|
- !ruby/object:Gem::Dependency
|
196
223
|
name: yard
|
197
224
|
requirement: !ruby/object:Gem::Requirement
|
@@ -274,8 +301,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
274
301
|
- !ruby/object:Gem::Version
|
275
302
|
version: '0'
|
276
303
|
requirements: []
|
277
|
-
rubygems_version: 3.
|
278
|
-
signing_key:
|
304
|
+
rubygems_version: 3.6.7
|
279
305
|
specification_version: 4
|
280
306
|
summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
|
281
307
|
extract the data you want from the web.
|