wgit 0.12.0 → 0.12.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4dee43af6274102c9bc6ad4f32c8811f57c5dc2833e923e038aac8f7f2072385
4
- data.tar.gz: 9463768a40c78ab9c91ac34dd1c3a0fd1e7b990440b70766feebb9e2f0f99bd4
3
+ metadata.gz: 4210033f192994609b9a21f1b3e61292247cf94aa415441b22823c32dcc6a214
4
+ data.tar.gz: 26ca37a3a20998b7ce313c7bd499b89653ba86fac8f09a4b5e699cf6d57bd57c
5
5
  SHA512:
6
- metadata.gz: 5c94fcae3a56254a6c0d9d67597f1a2125439c1ee3d7d68e22fb70fa59298735d76b0fb8e77bc44b0850f6ba561fe11df3a867973d5f0533adddda9d2c6f2002
7
- data.tar.gz: 779bf20dc1eaa29cc926a836d5a5a155c2270db1be0d22bc52ba9893cbd8d3aaec2cee7810b660fb9014020f5de2290af2b23680cc040a9c682ca82431d6f50d
6
+ metadata.gz: 3f824772e8f0633d540237fb7f99064327eb54a3efbacad58317e3b47d3e45c08f9b9bee542656e4fc5315ded9539ced1283e83207127b948107900d7acc9882
7
+ data.tar.gz: 0f4b7787ed881d7693226466785e70efd0bff7f0971ba65cebb7553886ead143d40cbeb8a4e72e65ef0c249e689024254668da817dfe347c2076153878e605b0
data/CHANGELOG.md CHANGED
@@ -9,6 +9,15 @@
9
9
  - ...
10
10
  ---
11
11
 
12
+ ## v0.12.1
13
+ ### Added
14
+ - `Wgit::Crawler.new typhoeus_opts:` param which passes the `Hash` directly to `Typhoeus.get`. See the Typhoeus documentation for more info on what can be passed.
15
+ ### Changed/Removed
16
+ - ...
17
+ ### Fixed
18
+ - ...
19
+ ---
20
+
12
21
  ## v0.12.0 - BREAKING CHANGES
13
22
  A big release with several breaking changes, not all of which can be listed below. The headline features for this release are the introduction of a database adapter, allowing Wgit to work with practically any underlying database system; and a custom in-house text extractor.
14
23
  ### Added
data/README.md CHANGED
@@ -144,6 +144,7 @@ There are many [other HTML crawlers](https://awesome-ruby.com/#-web-crawling) ou
144
144
 
145
145
  - Wgit has excellent unit testing, 100% documentation coverage and follows [semantic versioning](https://semver.org/) rules.
146
146
  - Wgit excels at crawling an entire website's HTML out of the box. Many alternative crawlers require you to provide the `xpath` needed to *follow* the next URLs to crawl. Wgit by default, crawls the entire site by extracting its internal links pointing to the same host - no `xpath` needed.
147
+ - Wgit can crawl authenticated content, providing you can login on a web browser and export your session cookies.
147
148
  - Wgit allows you to define content *extractors* that will fire on every subsequent crawl; be it a single URL or an entire website. This enables you to focus on the content you want.
148
149
  - Wgit can index (crawl and save) HTML to a database making it a breeze to build custom search engines. You can also specify which page content gets searched, making the search more meaningful. For example, here's a script that will index the Wgit [wiki](https://github.com/michaeltelford/wgit/wiki) articles:
149
150
 
@@ -171,9 +172,9 @@ indexer.index_site(wiki, **opts)
171
172
 
172
173
  So why might you not use Wgit, I hear you ask?
173
174
 
174
- - Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that.
175
- - Wgit can parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider a pure browser-based crawler instead.
176
- - Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent crawling then you should consider something else.
175
+ - Wgit doesn't allow for webpage interaction e.g. signing in as a user. There are better gems out there for that. If however, you simply want to crawl webpages requiring authentication, then it's entirely achievable using Wgit.
176
+ - Wgit *can* parse a crawled page's Javascript, but it doesn't do so by default. If your crawls are JS heavy then you might best consider using a purely browser-based crawler instead.
177
+ - Wgit while fast (using `libcurl` for HTTP etc.), isn't multi-threaded; so each URL gets crawled sequentially. You could hand each crawled document to a worker thread for processing - but if you need concurrent requests then you should consider something else.
177
178
 
178
179
  ## Installation
179
180
 
data/lib/wgit/crawler.rb CHANGED
@@ -54,12 +54,16 @@ module Wgit
54
54
  # The value should balance between a good UX and enough JS parse time.
55
55
  attr_accessor :parse_javascript_delay
56
56
 
57
+ # The opts Hash passed directly to the Typhoeus#get request.
58
+ attr_accessor :typhoeus_opts
59
+
57
60
  # The opts Hash passed directly to the ferrum Chrome browser when
58
61
  # `parse_javascript: true`.
59
- # See https://github.com/rubycdp/ferrum for details.
62
+ # See https://github.com/rubycdp/ferrum for more info.
60
63
  attr_accessor :ferrum_opts
61
64
 
62
65
  # The Wgit::Response of the most recently crawled URL.
66
+ # See https://rubydoc.info/gems/typhoeus for more info.
63
67
  attr_reader :last_response
64
68
 
65
69
  # Initializes and returns a Wgit::Crawler instance.
@@ -76,14 +80,17 @@ module Wgit
76
80
  # installed and in $PATH.
77
81
  # @param parse_javascript_delay [Integer] The delay time given to a page's
78
82
  # JS to update the DOM. After the delay, the HTML is crawled.
83
+ # @param typhoeus_opts [Hash] The options to pass to Typhoeus.
84
+ # @param ferrum_opts [Hash] The options to pass to Ferrum.
79
85
  def initialize(redirect_limit: 5, timeout: 5, encode: true,
80
86
  parse_javascript: false, parse_javascript_delay: 1,
81
- ferrum_opts: {})
87
+ typhoeus_opts: {}, ferrum_opts: {})
82
88
  assert_type(redirect_limit, Integer)
83
89
  assert_type(timeout, [Integer, Float])
84
90
  assert_type(encode, [TrueClass, FalseClass])
85
91
  assert_type(parse_javascript, [TrueClass, FalseClass])
86
92
  assert_type(parse_javascript_delay, Integer)
93
+ assert_type(typhoeus_opts, Hash)
87
94
  assert_type(ferrum_opts, Hash)
88
95
 
89
96
  @redirect_limit = redirect_limit
@@ -91,14 +98,15 @@ module Wgit
91
98
  @encode = encode
92
99
  @parse_javascript = parse_javascript
93
100
  @parse_javascript_delay = parse_javascript_delay
94
- @ferrum_opts = default_ferrum_opts.merge(ferrum_opts)
101
+ @typhoeus_opts = merge_typhoeus_opts(typhoeus_opts)
102
+ @ferrum_opts = merge_ferrum_opts(ferrum_opts)
95
103
  end
96
104
 
97
105
  # Overrides String#inspect to shorten the printed output of a Crawler.
98
106
  #
99
107
  # @return [String] A short textual representation of this Crawler.
100
108
  def inspect
101
- "#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} ferrum_opts=#{@ferrum_opts}>"
109
+ "#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} typhoeus_opts=#{@typhoeus_opts} ferrum_opts=#{@ferrum_opts}>"
102
110
  end
103
111
 
104
112
  # Crawls an entire website's HTML pages by recursively going through
@@ -268,7 +276,7 @@ module Wgit
268
276
  url.crawl_duration = response.total_time
269
277
 
270
278
  # Don't override previous url.redirects if response is fully resolved.
271
- url.redirects = response.redirects unless response.redirects.empty?
279
+ url.redirects = response.redirects unless response.redirects.empty?
272
280
 
273
281
  @last_response = response
274
282
  end
@@ -377,32 +385,25 @@ module Wgit
377
385
  end
378
386
 
379
387
  # Performs a HTTP GET request and returns the response.
388
+ # See https://rubydoc.info/gems/typhoeus for more info.
380
389
  #
381
390
  # @param url [String] The url to GET.
382
- # @return [Typhoeus::Response] The HTTP response object.
391
+ # @return [Typhoeus::Response] The Typhoeus HTTP response object.
383
392
  def http_get(url)
384
- opts = {
385
- followlocation: false,
386
- timeout: @timeout,
387
- accept_encoding: 'gzip',
388
- headers: {
389
- 'User-Agent' => "wgit/#{Wgit::VERSION}",
390
- 'Accept' => 'text/html'
391
- }
392
- }
393
-
394
- # See https://rubydoc.info/gems/typhoeus for more info.
395
- Typhoeus.get(url, **opts)
393
+ Typhoeus.get(url, **@typhoeus_opts)
396
394
  end
397
395
 
398
- # Performs a HTTP GET request in a web browser and parses the response JS
399
- # before returning the HTML body of the fully rendered webpage. This allows
400
- # Javascript (SPA apps etc.) to generate HTML dynamically.
396
+ # Performs a HTTP GET request in a web browser allowing the response JS to
397
+ # execute before returning the HTML body of the fully rendered webpage.
398
+ # This allows Javascript (SPA apps etc.) to generate HTML dynamically.
399
+ # See https://github.com/rubycdp/ferrum for more info.
401
400
  #
402
401
  # @param url [String] The url to browse to.
403
402
  # @return [Ferrum::Browser] The browser response object.
404
403
  def browser_get(url)
405
404
  @browser ||= Ferrum::Browser.new(**@ferrum_opts)
405
+
406
+ # Navigate to the url and start parsing the JS on the page.
406
407
  @browser.goto(url)
407
408
 
408
409
  # Wait for the page's JS to finish dynamically manipulating the DOM.
@@ -452,6 +453,20 @@ module Wgit
452
453
 
453
454
  private
454
455
 
456
+ # The default opts which are merged with the user's typhoeus_opts: and then
457
+ # passed directly to the Typhoeus#get request.
458
+ def default_typhoeus_opts
459
+ {
460
+ followlocation: false,
461
+ timeout: @timeout,
462
+ accept_encoding: 'gzip',
463
+ headers: {
464
+ 'User-Agent' => "wgit/#{Wgit::VERSION}",
465
+ 'Accept' => 'text/html'
466
+ }
467
+ }
468
+ end
469
+
455
470
  # The default opts which are merged with the user's ferrum_opts: and then
456
471
  # passed directly to the ferrum Chrome browser.
457
472
  def default_ferrum_opts
@@ -462,6 +477,19 @@ module Wgit
462
477
  }
463
478
  end
464
479
 
480
+ # Merges the default Typhoeus options with user-provided options.
481
+ # Performs a separate merge of headers to allow user customization.
482
+ def merge_typhoeus_opts(typhoeus_opts)
483
+ default_typhoeus_opts.merge(typhoeus_opts) do |key, oldval, newval|
484
+ key == :headers ? oldval.merge(newval) : newval
485
+ end
486
+ end
487
+
488
+ # Merges the default Ferrum options with user-provided options.
489
+ def merge_ferrum_opts(ferrum_opts)
490
+ default_ferrum_opts.merge(ferrum_opts)
491
+ end
492
+
465
493
  # Manually does the following: `links = internals - crawled`.
466
494
  # This is needed due to an apparent bug in Set<Url> (when upgrading from
467
495
  # Ruby v3.0.2 to v3.3.0) causing an infinite crawl loop in #crawl_site.
data/lib/wgit/url.rb CHANGED
@@ -98,7 +98,7 @@ module Wgit
98
98
  # Returns a Wgit::Url instance from Wgit::Url.parse, or nil if obj cannot
99
99
  # be parsed successfully e.g. the String is invalid.
100
100
  #
101
- # Use this method when you can't gaurentee that obj is parsable as a URL.
101
+ # Use this method when you can't guarantee that obj is parsable as a URL.
102
102
  # See Wgit::Url.parse for more information.
103
103
  #
104
104
  # @param obj [Object] The object to parse, which #is_a?(String).
data/lib/wgit/version.rb CHANGED
@@ -6,7 +6,7 @@
6
6
  # @author Michael Telford
7
7
  module Wgit
8
8
  # The current gem version of Wgit.
9
- VERSION = "0.12.0"
9
+ VERSION = "0.12.1"
10
10
 
11
11
  # Returns the current gem version of Wgit as a String.
12
12
  def self.version
metadata CHANGED
@@ -1,14 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: wgit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.12.0
4
+ version: 0.12.1
5
5
  platform: ruby
6
6
  authors:
7
7
  - Michael Telford
8
- autorequire:
9
8
  bindir: bin
10
9
  cert_chain: []
11
- date: 2024-10-30 00:00:00.000000000 Z
10
+ date: 2025-08-18 00:00:00.000000000 Z
12
11
  dependencies:
13
12
  - !ruby/object:Gem::Dependency
14
13
  name: addressable
@@ -30,56 +29,84 @@ dependencies:
30
29
  requirements:
31
30
  - - "~>"
32
31
  - !ruby/object:Gem::Version
33
- version: '0.2'
32
+ version: '0.3'
34
33
  type: :runtime
35
34
  prerelease: false
36
35
  version_requirements: !ruby/object:Gem::Requirement
37
36
  requirements:
38
37
  - - "~>"
39
38
  - !ruby/object:Gem::Version
40
- version: '0.2'
39
+ version: '0.3'
40
+ - !ruby/object:Gem::Dependency
41
+ name: benchmark
42
+ requirement: !ruby/object:Gem::Requirement
43
+ requirements:
44
+ - - "~>"
45
+ - !ruby/object:Gem::Version
46
+ version: '0.4'
47
+ type: :runtime
48
+ prerelease: false
49
+ version_requirements: !ruby/object:Gem::Requirement
50
+ requirements:
51
+ - - "~>"
52
+ - !ruby/object:Gem::Version
53
+ version: '0.4'
41
54
  - !ruby/object:Gem::Dependency
42
55
  name: ferrum
43
56
  requirement: !ruby/object:Gem::Requirement
44
57
  requirements:
45
58
  - - "~>"
46
59
  - !ruby/object:Gem::Version
47
- version: '0.14'
60
+ version: '0.17'
61
+ type: :runtime
62
+ prerelease: false
63
+ version_requirements: !ruby/object:Gem::Requirement
64
+ requirements:
65
+ - - "~>"
66
+ - !ruby/object:Gem::Version
67
+ version: '0.17'
68
+ - !ruby/object:Gem::Dependency
69
+ name: logger
70
+ requirement: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - "~>"
73
+ - !ruby/object:Gem::Version
74
+ version: '1.7'
48
75
  type: :runtime
49
76
  prerelease: false
50
77
  version_requirements: !ruby/object:Gem::Requirement
51
78
  requirements:
52
79
  - - "~>"
53
80
  - !ruby/object:Gem::Version
54
- version: '0.14'
81
+ version: '1.7'
55
82
  - !ruby/object:Gem::Dependency
56
83
  name: mongo
57
84
  requirement: !ruby/object:Gem::Requirement
58
85
  requirements:
59
86
  - - "~>"
60
87
  - !ruby/object:Gem::Version
61
- version: '2.19'
88
+ version: '2.21'
62
89
  type: :runtime
63
90
  prerelease: false
64
91
  version_requirements: !ruby/object:Gem::Requirement
65
92
  requirements:
66
93
  - - "~>"
67
94
  - !ruby/object:Gem::Version
68
- version: '2.19'
95
+ version: '2.21'
69
96
  - !ruby/object:Gem::Dependency
70
97
  name: nokogiri
71
98
  requirement: !ruby/object:Gem::Requirement
72
99
  requirements:
73
100
  - - "~>"
74
101
  - !ruby/object:Gem::Version
75
- version: '1.15'
102
+ version: '1.18'
76
103
  type: :runtime
77
104
  prerelease: false
78
105
  version_requirements: !ruby/object:Gem::Requirement
79
106
  requirements:
80
107
  - - "~>"
81
108
  - !ruby/object:Gem::Version
82
- version: '1.15'
109
+ version: '1.18'
83
110
  - !ruby/object:Gem::Dependency
84
111
  name: typhoeus
85
112
  requirement: !ruby/object:Gem::Requirement
@@ -100,70 +127,70 @@ dependencies:
100
127
  requirements:
101
128
  - - "~>"
102
129
  - !ruby/object:Gem::Version
103
- version: '11.1'
130
+ version: '12.0'
104
131
  type: :development
105
132
  prerelease: false
106
133
  version_requirements: !ruby/object:Gem::Requirement
107
134
  requirements:
108
135
  - - "~>"
109
136
  - !ruby/object:Gem::Version
110
- version: '11.1'
137
+ version: '12.0'
111
138
  - !ruby/object:Gem::Dependency
112
139
  name: dotenv
113
140
  requirement: !ruby/object:Gem::Requirement
114
141
  requirements:
115
142
  - - "~>"
116
143
  - !ruby/object:Gem::Version
117
- version: '2.8'
144
+ version: '3.1'
118
145
  type: :development
119
146
  prerelease: false
120
147
  version_requirements: !ruby/object:Gem::Requirement
121
148
  requirements:
122
149
  - - "~>"
123
150
  - !ruby/object:Gem::Version
124
- version: '2.8'
151
+ version: '3.1'
125
152
  - !ruby/object:Gem::Dependency
126
153
  name: maxitest
127
154
  requirement: !ruby/object:Gem::Requirement
128
155
  requirements:
129
156
  - - "~>"
130
157
  - !ruby/object:Gem::Version
131
- version: '5.4'
158
+ version: '6.0'
132
159
  type: :development
133
160
  prerelease: false
134
161
  version_requirements: !ruby/object:Gem::Requirement
135
162
  requirements:
136
163
  - - "~>"
137
164
  - !ruby/object:Gem::Version
138
- version: '5.4'
165
+ version: '6.0'
139
166
  - !ruby/object:Gem::Dependency
140
167
  name: pry
141
168
  requirement: !ruby/object:Gem::Requirement
142
169
  requirements:
143
170
  - - "~>"
144
171
  - !ruby/object:Gem::Version
145
- version: '0.14'
172
+ version: '0.15'
146
173
  type: :development
147
174
  prerelease: false
148
175
  version_requirements: !ruby/object:Gem::Requirement
149
176
  requirements:
150
177
  - - "~>"
151
178
  - !ruby/object:Gem::Version
152
- version: '0.14'
179
+ version: '0.15'
153
180
  - !ruby/object:Gem::Dependency
154
181
  name: rubocop
155
182
  requirement: !ruby/object:Gem::Requirement
156
183
  requirements:
157
184
  - - "~>"
158
185
  - !ruby/object:Gem::Version
159
- version: '1.57'
186
+ version: '1.79'
160
187
  type: :development
161
188
  prerelease: false
162
189
  version_requirements: !ruby/object:Gem::Requirement
163
190
  requirements:
164
191
  - - "~>"
165
192
  - !ruby/object:Gem::Version
166
- version: '1.57'
193
+ version: '1.79'
167
194
  - !ruby/object:Gem::Dependency
168
195
  name: toys
169
196
  requirement: !ruby/object:Gem::Requirement
@@ -184,14 +211,14 @@ dependencies:
184
211
  requirements:
185
212
  - - "~>"
186
213
  - !ruby/object:Gem::Version
187
- version: '3.19'
214
+ version: '3.25'
188
215
  type: :development
189
216
  prerelease: false
190
217
  version_requirements: !ruby/object:Gem::Requirement
191
218
  requirements:
192
219
  - - "~>"
193
220
  - !ruby/object:Gem::Version
194
- version: '3.19'
221
+ version: '3.25'
195
222
  - !ruby/object:Gem::Dependency
196
223
  name: yard
197
224
  requirement: !ruby/object:Gem::Requirement
@@ -274,8 +301,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
274
301
  - !ruby/object:Gem::Version
275
302
  version: '0'
276
303
  requirements: []
277
- rubygems_version: 3.5.22
278
- signing_key:
304
+ rubygems_version: 3.6.7
279
305
  specification_version: 4
280
306
  summary: Wgit is a HTML web crawler, written in Ruby, that allows you to programmatically
281
307
  extract the data you want from the web.