Dhalang 0.7.0 → 0.7.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Gemfile.lock +7 -7
- data/README.md +16 -19
- data/lib/Dhalang/configuration.rb +4 -1
- data/lib/Dhalang/url_utils.rb +5 -4
- data/lib/Dhalang/version.rb +1 -1
- data/lib/js/dhalang.js +16 -6
- data/lib/js/html-scraper.js +5 -2
- data/lib/js/pdf-generator.js +5 -2
- data/lib/js/screenshot-generator.js +5 -2
- data/package-lock.json +254 -201
- data/package.json +2 -2
- data/renovate.json +15 -0
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: f8181be50a21d8c3b688b3ba8aa7f683f1bd6101acc36b6623b2948ffd0462ca
|
4
|
+
data.tar.gz: 7f04e3437befb446f2e2a45b90f46ffe17e97c56d570299ac1852c4d3fd2ed9a
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 5a45db0cb7cbf08828e99ebd1a86fdc8e3dc8ac1fb01d10d3c521b2bbe38168fdd93e79271fe766504a1868d9cec124390ec3bf65cf513354ee6c974f0e50e37
|
7
|
+
data.tar.gz: 6f56ec00f075b58242fdf55c5d49ca2cb545fcf3b249942c1c4a0f1b102623ce7d633b7db25bb5f04450b25b599580fd6a81c502e434710e282cfcd6ed1af8f0
|
data/Gemfile.lock
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
PATH
|
2
2
|
remote: .
|
3
3
|
specs:
|
4
|
-
Dhalang (0.7.
|
4
|
+
Dhalang (0.7.2)
|
5
5
|
|
6
6
|
GEM
|
7
7
|
remote: https://rubygems.org/
|
8
8
|
specs:
|
9
|
-
Ascii85 (1.1.
|
9
|
+
Ascii85 (1.1.1)
|
10
10
|
afm (0.2.2)
|
11
|
-
bigdecimal (3.1.
|
11
|
+
bigdecimal (3.1.8)
|
12
12
|
diff-lcs (1.5.1)
|
13
13
|
fastimage (2.2.7)
|
14
14
|
hashery (2.1.2)
|
@@ -23,15 +23,15 @@ GEM
|
|
23
23
|
rspec-core (~> 3.13.0)
|
24
24
|
rspec-expectations (~> 3.13.0)
|
25
25
|
rspec-mocks (~> 3.13.0)
|
26
|
-
rspec-core (3.13.
|
26
|
+
rspec-core (3.13.2)
|
27
27
|
rspec-support (~> 3.13.0)
|
28
|
-
rspec-expectations (3.13.
|
28
|
+
rspec-expectations (3.13.3)
|
29
29
|
diff-lcs (>= 1.2.0, < 2.0)
|
30
30
|
rspec-support (~> 3.13.0)
|
31
|
-
rspec-mocks (3.13.
|
31
|
+
rspec-mocks (3.13.2)
|
32
32
|
diff-lcs (>= 1.2.0, < 2.0)
|
33
33
|
rspec-support (~> 3.13.0)
|
34
|
-
rspec-support (3.13.
|
34
|
+
rspec-support (3.13.2)
|
35
35
|
ruby-rc4 (0.1.5)
|
36
36
|
ttfunk (1.8.0)
|
37
37
|
bigdecimal (~> 3.1)
|
data/README.md
CHANGED
@@ -1,4 +1,4 @@
|
|
1
|
-
# Dhalang [](https://github.com/NielsSteensma/Dhalang/actions/workflows/build.yml)
|
1
|
+
# Dhalang [](https://github.com/NielsSteensma/Dhalang/actions/workflows/build.yml) [](https://badge.fury.io/rb/Dhalang)
|
2
2
|
|
3
3
|
> Dhalang is a Ruby wrapper for Google's Puppeteer.
|
4
4
|
|
@@ -11,7 +11,11 @@
|
|
11
11
|
* Scrape HTML from webpages
|
12
12
|
|
13
13
|
|
14
|
-
|
14
|
+
## Prerequisites
|
15
|
+
* Node ≥ 18
|
16
|
+
* Puppeteer ≥ 22
|
17
|
+
* Unix shell ( Dhalang will not work on Windows shells )
|
18
|
+
|
15
19
|
## Installation
|
16
20
|
Add this line to your application's Gemfile:
|
17
21
|
|
@@ -21,11 +25,12 @@ And then execute:
|
|
21
25
|
|
22
26
|
$ bundle update
|
23
27
|
|
24
|
-
Install puppeteer in your application's root directory:
|
28
|
+
Install puppeteer or puppeteer-core in your application's root directory:
|
25
29
|
|
26
|
-
$ npm install puppeteer
|
30
|
+
$ npm install puppeteer
|
31
|
+
or
|
32
|
+
$ npm install puppeteer-core
|
27
33
|
|
28
|
-
<sub>Dhalang and Puppeteer require Node ≥ 18 and Puppeteer ≥ 22</sub>
|
29
34
|
## Usage
|
30
35
|
__PDF of a website url__
|
31
36
|
```ruby
|
@@ -75,10 +80,9 @@ For example to only take a screenshot of the visible part of the page:
|
|
75
80
|
Dhalang::Screenshot.get_from_url("https://www.google.com", :webp, {fullPage: false})
|
76
81
|
```
|
77
82
|
|
78
|
-
A list of all possible PDF options that can be set, can be found at: https://github.com/puppeteer/puppeteer/blob/main/docs/api.md
|
79
|
-
|
80
|
-
A list of all possible screenshot options that can be set, can be found at: https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagescreenshotoptions
|
83
|
+
A list of all possible PDF options that can be set, can be found at: https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.pdfoptions.md
|
81
84
|
|
85
|
+
A list of all possible screenshot options that can be set, can be found at: https://github.com/puppeteer/puppeteer/blob/main/docs/api/puppeteer.screenshotoptions.md
|
82
86
|
> The default Puppeteer options contain the options `headerTemplate` and `footerTemplate`. Puppeteer expects these to be HTML strings. By default, the Dhalang
|
83
87
|
> gem passes all options as arguments in a `node ...` shell command. In case the HTML strings are too long they might surpass the maximum
|
84
88
|
> argument length of the host. For example, on Linux the `MAX_ARG_LEN` is 128kB. Therefore, you can also pass the headers and footers as file path using the
|
@@ -86,24 +90,17 @@ A list of all possible screenshot options that can be set, can be found at: http
|
|
86
90
|
>
|
87
91
|
> For example: `Dhalang::PDF.get_from_url("https://www.google.com", {headerTemplateFile: '/tmp/header.html', footerTemplateFile: '/tmp/footer.html'})`
|
88
92
|
|
89
|
-
|
90
|
-
## Custom user options
|
91
|
-
You may want to change the way Dhalang interacts with Puppeteer in general. User options can be set by providing them in a hash as last argument to any calls you make to the library. Are you setting both custom PDF and user options? Then they should be passed as a single hash.
|
92
|
-
|
93
|
-
For example to set a custom navigation timeout:
|
94
|
-
```ruby
|
95
|
-
Dhalang::Screenshot.get_from_url("https://www.google.com", :jpeg, {navigationTimeout: 20000})
|
96
|
-
```
|
97
|
-
|
98
|
-
Below table lists all possible configuration parameters that can be set:
|
93
|
+
Below table lists more configuration parameters that can be set:
|
99
94
|
| Key | Description | Default |
|
100
95
|
|--------------------|-----------------------------------------------------------------------------------------|---------------------------------|
|
96
|
+
| isHeadless | Indicates if Chromium should be launched headless (useful for debugging) | true |
|
97
|
+
| slowMo | Amount of milliseconds to slow down Puppeteer operations (useful for debugging) | 0 |
|
98
|
+
| browserWebsocketUrl | Websocket url of remote chromium browser to use | None |
|
101
99
|
| navigationTimeout | Amount of milliseconds until Puppeteer while timeout when navigating to the given page | 10000 |
|
102
100
|
| printToPDFTimeout | Amount of milliseconds until Puppeteer while timeout when calling Page.printToPDF | 0 (unlimited) |
|
103
101
|
| navigationWaitForSelector | If set, Dhalang will wait for the specified selector to appear before creating the screenshot or PDF | None |
|
104
102
|
| navigationWaitForXPath | If set, Dhalang will wait for the specified XPath to appear before creating the screenshot or PDF | None |
|
105
103
|
| userAgent | User agent to send with the request | Default Puppeteer one |
|
106
|
-
| isHeadless | Indicates if Chromium should be launched headless | true |
|
107
104
|
| isAutoHeight | When set to true the height of generated PDFs will be based on the scrollHeight property of the document body | false |
|
108
105
|
| viewPort | Custom viewport to use for the request | Default Puppeteer one |
|
109
106
|
| httpAuthenticationCredentials | Custom HTTP authentication credentials to use for the request | None |
|
@@ -3,6 +3,7 @@ module Dhalang
|
|
3
3
|
class Configuration
|
4
4
|
NODE_MODULES_PATH = Dir.pwd + '/node_modules/'.freeze
|
5
5
|
USER_OPTIONS = {
|
6
|
+
browserWebsocketUrl: '',
|
6
7
|
navigationTimeout: 10000,
|
7
8
|
printToPDFTimeout: 0, # unlimited
|
8
9
|
navigationWaitUntil: 'load',
|
@@ -13,7 +14,8 @@ module Dhalang
|
|
13
14
|
viewPort: '',
|
14
15
|
httpAuthenticationCredentials: '',
|
15
16
|
isAutoHeight: false,
|
16
|
-
chromeOptions: []
|
17
|
+
chromeOptions: [],
|
18
|
+
slowMo: 0
|
17
19
|
}.freeze
|
18
20
|
DEFAULT_PDF_OPTIONS = {
|
19
21
|
scale: 1,
|
@@ -48,6 +50,7 @@ module Dhalang
|
|
48
50
|
private_constant :DEFAULT_JPEG_OPTIONS
|
49
51
|
|
50
52
|
private attr_accessor :page_url
|
53
|
+
private attr_accessor :browser_websocket_url
|
51
54
|
private attr_accessor :temp_file_path
|
52
55
|
private attr_accessor :temp_file_extension
|
53
56
|
private attr_accessor :user_options
|
data/lib/Dhalang/url_utils.rb
CHANGED
@@ -6,9 +6,10 @@ module Dhalang
|
|
6
6
|
#
|
7
7
|
# @param [String] url The url to validate
|
8
8
|
def self.validate(url)
|
9
|
-
|
10
|
-
|
11
|
-
|
9
|
+
parsed = URI.parse(url) # Raise URI::InvalidURIError on invalid URLs
|
10
|
+
return true if parsed.absolute?
|
11
|
+
|
12
|
+
raise URI::InvalidURIError, 'The given url was invalid, use format http://www.example.com'
|
12
13
|
end
|
13
14
|
end
|
14
|
-
end
|
15
|
+
end
|
data/lib/Dhalang/version.rb
CHANGED
data/lib/js/dhalang.js
CHANGED
@@ -14,6 +14,7 @@ const fs = require('fs')
|
|
14
14
|
|
15
15
|
/**
|
16
16
|
* @typedef {Object} UserOptions
|
17
|
+
* @property {string} browserWebsocketUrl - The websocket url of remote Chromium browser to use.
|
17
18
|
* @property {number} navigationTimeout - Maximum in milliseconds until navigation times out, we use a default of 10 seconds as timeout.
|
18
19
|
* @property {string} navigationWaitUntil - Determines when the navigation was finished, we wait here until the Window.load event is fired ( meaning all images, stylesheet, etc was loaded ).
|
19
20
|
* @property {string} navigationWaitForSelector - If set, specifies the selector Puppeteer should wait for to appear before continuing.
|
@@ -23,6 +24,7 @@ const fs = require('fs')
|
|
23
24
|
* @property {Object} viewPort - The view port to use.
|
24
25
|
* @property {Object} httpAuthenticationCredentials - The credentials to use for HTTP authentication.
|
25
26
|
* @property {boolean} isAutoHeight - The height is automatically set
|
27
|
+
* @property {number} slowMo - Amount of milliseconds to slow down Puppeteer operations.
|
26
28
|
*/
|
27
29
|
|
28
30
|
/**
|
@@ -47,7 +49,7 @@ exports.getConfiguration = function () {
|
|
47
49
|
|
48
50
|
/**
|
49
51
|
* Launches Puppeteer and returns its instance.
|
50
|
-
* @param {
|
52
|
+
* @param {Configuration} configuration - The configuration to use.
|
51
53
|
* @returns {Promise<Object>}
|
52
54
|
* The launched instance of Puppeteer.
|
53
55
|
*/
|
@@ -55,10 +57,18 @@ exports.launchPuppeteer = async function (configuration) {
|
|
55
57
|
module.paths.push(configuration.puppeteerPath);
|
56
58
|
const puppeteer = require('puppeteer');
|
57
59
|
const launchArgs = ['--no-sandbox', '--disable-setuid-sandbox'].concat(configuration.userOptions.chromeOptions).filter((item, index, self) => self.indexOf(item) === index);
|
58
|
-
|
59
|
-
|
60
|
-
|
61
|
-
|
60
|
+
|
61
|
+
if (configuration.userOptions['browserWebsocketUrl'] !== "") {
|
62
|
+
return await puppeteer.connect( {
|
63
|
+
"browserWSEndpoint": configuration.userOptions.browserWebsocketUrl
|
64
|
+
})
|
65
|
+
} else {
|
66
|
+
return await puppeteer.launch({
|
67
|
+
args: launchArgs,
|
68
|
+
headless: configuration.userOptions.isHeadless,
|
69
|
+
slowMo: configuration.userOptions.slowMo
|
70
|
+
});
|
71
|
+
}
|
62
72
|
}
|
63
73
|
|
64
74
|
/**
|
@@ -148,7 +158,7 @@ exports.getConfiguredPdfOptions = async function (page, configuration) {
|
|
148
158
|
exports.getNavigationParameters = function (configuration) {
|
149
159
|
return {
|
150
160
|
timeout: configuration.userOptions.navigationTimeout,
|
151
|
-
|
161
|
+
waitUntil: configuration.userOptions.navigationWaitUntil
|
152
162
|
}
|
153
163
|
}
|
154
164
|
|
data/lib/js/html-scraper.js
CHANGED
@@ -6,9 +6,10 @@ const scrapeHtml = async () => {
|
|
6
6
|
const configuration = dhalang.getConfiguration();
|
7
7
|
|
8
8
|
let browser;
|
9
|
+
let page;
|
9
10
|
try {
|
10
11
|
browser = await dhalang.launchPuppeteer(configuration);
|
11
|
-
|
12
|
+
page = await browser.newPage();
|
12
13
|
await dhalang.configure(page, configuration.userOptions);
|
13
14
|
await dhalang.navigate(page, configuration);
|
14
15
|
const html = await page.content();
|
@@ -17,8 +18,10 @@ const scrapeHtml = async () => {
|
|
17
18
|
console.error(error.message);
|
18
19
|
process.exit(1);
|
19
20
|
} finally {
|
20
|
-
if (browser) {
|
21
|
+
if (browser && configuration.userOptions['browserWebsocketUrl'] === "") {
|
21
22
|
browser.close();
|
23
|
+
} else {
|
24
|
+
page.close();
|
22
25
|
}
|
23
26
|
process.exit(0);
|
24
27
|
}
|
data/lib/js/pdf-generator.js
CHANGED
@@ -5,9 +5,10 @@ const createPdf = async () => {
|
|
5
5
|
const configuration = dhalang.getConfiguration();
|
6
6
|
|
7
7
|
let browser;
|
8
|
+
let page;
|
8
9
|
try {
|
9
10
|
browser = await dhalang.launchPuppeteer(configuration);
|
10
|
-
|
11
|
+
page = await browser.newPage();
|
11
12
|
await dhalang.configure(page, configuration.userOptions);
|
12
13
|
await dhalang.navigate(page, configuration);
|
13
14
|
const pdfOptions = await dhalang.getConfiguredPdfOptions(page, configuration);
|
@@ -21,8 +22,10 @@ const createPdf = async () => {
|
|
21
22
|
console.error(error.message);
|
22
23
|
process.exit(1);
|
23
24
|
} finally {
|
24
|
-
if (browser) {
|
25
|
+
if (browser && configuration.userOptions['browserWebsocketUrl'] === "") {
|
25
26
|
browser.close();
|
27
|
+
} else {
|
28
|
+
page.close();
|
26
29
|
}
|
27
30
|
process.exit();
|
28
31
|
}
|
@@ -5,9 +5,10 @@ const createScreenshot = async () => {
|
|
5
5
|
const configuration = dhalang.getConfiguration();
|
6
6
|
|
7
7
|
let browser;
|
8
|
+
let page;
|
8
9
|
try {
|
9
10
|
browser = await dhalang.launchPuppeteer(configuration);
|
10
|
-
|
11
|
+
page = await browser.newPage();
|
11
12
|
await dhalang.configure(page, configuration.userOptions);
|
12
13
|
await dhalang.navigate(page, configuration);
|
13
14
|
|
@@ -23,8 +24,10 @@ const createScreenshot = async () => {
|
|
23
24
|
console.error(error.message);
|
24
25
|
process.exit(1);
|
25
26
|
} finally {
|
26
|
-
if (browser) {
|
27
|
+
if (browser && configuration.userOptions['browserWebsocketUrl'] === "") {
|
27
28
|
browser.close();
|
29
|
+
} else {
|
30
|
+
page.close();
|
28
31
|
}
|
29
32
|
process.exit();
|
30
33
|
}
|